yyhhyy's blog

yyhhyy

建立菲律宾台风损害模型

2023-07-27

建立菲律宾台风损害模型-2023/7/27

客观性

使用过去 20 年 12 场台风的数据,介绍了菲律宾台风造成的建筑物损坏统计模型的开发过程。

构建损坏模型本质是为了预测全国每个城市受损房屋的比例,使用一组与灾害相关的指标(例如风速)和社会经济指标(例如贫困发生率或墙体建筑质量)作为预测因子。

建立损坏模型:模型用于预测城市中受损房屋的比例。

数据来源

损坏模型是根据过去几十年台风造成的建筑物损坏的观测结果建立的(510 Global)。该数据集包含 1,638 个观测值,对应全国 1,034 个城市以及 12 个台风的影响。它有 34 列,对应于台风过后调查的数据以及每个城市对应的其他现场和社会经济指标。

  • 损坏指标:台风造成的损坏以每个城市部分或完全损坏(分别为 PD[Partly damaged (abs.)] 或 CD[Completely damaged (abs.)])的房屋数量来衡量。这些指标也以每个城市房屋总数的百分比形式给出(分别为 pPD[Partly damaged (rel.)] 和 pCD[Completely damaged (rel.)])。其他列对应于根据这两个基本测量计算出的不同指标。

  • 灾害指标:与台风事件相关的变量。

    • Wind speed -- 风速[公里/小时]

    • Distance to typhoon -- 到台风的距离[公里]:台风轨迹到城市质心的最短距离。

    • rainfall -- 降雨量 [毫米]

    • distance_first_impact -- 距首次影响的距离 [km]:台风登陆点到城市质心的距离。

    • Experience -- 经历[-]:本次台风之前,该市所属地区平均抵御台风的次数。

  • 场地和暴露指标:与场地地形特征和市社会经济指标相关的变量。这些仅取决于市政府,而不取决于台风(假设它们在时间上相对恒定)。

    • ruggedness_stdev -- 崎岖度标准[-]:市内​​地形崎岖度的标准偏差。

    • Ruggedness -- 崎岖度 [-]:市内​​平均地形崎岖度。

    • Elevation -- 海拔 [m]:市内的平均海拔。

    • Slope -- 坡度[度]:市内的平均地形坡度。

    • slope_stdev -- 坡度标准[-]:市内​​地形坡度的标准偏差。

    • Population -- 人口数量

    • Population density -- 人口密度[居住区/平方公里]

    • land_area -- 土地面积

    • Poverty incidence -- 贫困发生率

    • % skilled Agriculture/Forestry/Fishermen -- 熟练农业/林业/渔民比例

    • % strong roof type -- 强屋面类型比例

    • % strong wall type -- 强墙型比例

数据预览

# 导入原始数据
data = pd.read_csv(r'data/All.csv')
# 查看台风名字及对应数量
data[['disaster_name']].value_counts()
disaster_name
Haima            301
Rammasun         294
Koppu            203
Bopha            175
Haiyan           162
Melor            105
Kalmaegi          92
Nock-Ten          82
Hagupit           81
Sarika            57
Utor              54
Goni              32
dtype: int64
# 输出摘要统计信息
data.describe()
       Completely damaged (abs.)  Partly damaged (abs.)  \
count                1638.000000            1638.000000   
mean                  582.935897            1138.982295   
std                  1553.413663            2370.653848   
min                     1.000000               1.000000   
25%                     1.000000              14.000000   
50%                    16.000000             179.500000   
75%                   269.500000            1411.500000   
max                 14132.000000           46553.000000   
       Total damaged houses (abs.)  total_damage_houses_0p25weight  \
count                  1638.000000                     1638.000000   
mean                   1721.918193                      867.681471   
std                    3496.164509                     1952.399073   
min                       2.000000                        1.250000   
25%                      19.000000                        7.000000   
50%                     225.000000                       75.500000   
75%                    1962.250000                      694.562500   
max                   58823.000000                    23908.250000   
       Partly damaged (rel.)  Completely damaged (rel.)  \
count            1638.000000                1638.000000   
mean               11.860162                   6.103325   
std                17.152123                  14.385213   
min                 0.000972                   0.001168   
25%                 0.197830                   0.025812   
50%                 2.587918                   0.223214   
75%                18.698752                   3.026831   
max               115.343099                  95.607311   
       Total damaged houses (rel.)  total_damage_houses_0p25weight_perc  \
count                  1638.000000                          1638.000000   
mean                     17.963487                             9.068365   
std                      26.756541                            16.703810   
min                       0.002647                             0.001654   
25%                       0.277203                             0.105430   
50%                       3.259474                             1.104780   
75%                      27.268523                             9.262609   
max                     129.974141                            98.544514   
       ratio_comp_part  Total # of houses   Wind speed  Distance to typhoon  \
count      1638.000000        1638.000000  1638.000000          1638.000000   
mean          0.974340       11767.550218    84.682571            74.899818   
std           3.676039       15013.119760    35.797212            70.071947   
min           0.011246         324.250000    40.000000             0.122506   
25%           0.242304        4700.749997    59.999527            24.399295   
50%           0.579407        8017.625004    80.000000            50.303957   
75%           0.972011       13664.437605   102.693927           104.957121   
max          88.201339      194096.500200   298.464302           347.292212   
          rainfall  distance_first_impact        Slope    Elevation  \
count  1638.000000            1638.000000  1638.000000  1638.000000   
mean    323.920596             197.212283     8.186324   258.570181   
std     155.502152             118.607868     5.716850   321.773277   
min      45.000000               4.357767     0.456908     2.516287   
25%     244.900000             110.800300     3.935659    59.706514   
50%     303.507123             175.939736     7.164828   138.017489   
75%     376.500000             263.415087    11.331911   305.599518   
max    1964.000000            1223.383698    25.229471  1899.717311   
       ruggedness_stdev   Ruggedness  slope_stdev  \
count       1638.000000  1638.000000  1638.000000   
mean          28.602452    41.832893     6.011024   
std           15.179823    28.395457     3.061597   
min            1.629265     3.497199     0.286051   
25%           18.188352    21.022102     3.935674   
50%           29.283881    36.834113     6.375094   
75%           39.950044    57.243654     8.481826   
max           86.703939   128.965009    13.024484   
       Predicted damage class (1-5)     Population    land_area  \
count                           0.0    1638.000000  1638.000000   
mean                            NaN   47065.645910   215.933543   
std                             NaN   60040.329748   226.523048   
min                             NaN    1291.000000     1.856500   
25%                             NaN   18807.000000    80.480600   
50%                             NaN   32108.000000   146.901100   
75%                             NaN   54658.500000   264.230700   
max                             NaN  776382.000000  2428.147100   
       Population density  Poverty incidence  % strong roof type  \
count         1638.000000        1638.000000         1638.000000   
mean           452.178975           0.257932            0.834480   
std           1250.825754           0.150841            0.173800   
min              4.498900           0.008500            0.236160   
25%            115.491400           0.136300            0.740387   
50%            229.109300           0.224950            0.915456   
75%            434.300650           0.358525            0.963024   
max          34384.322200           0.847600            1.000000   
       % strong wall type  % skilled Agriculture/Forestry/Fishermen  \
count         1638.000000                               1617.000000   # 该特征数据缺失
mean             0.589403                                  0.107656   
std              0.206134                                  0.054270   
min              0.053748                                  0.001049   
25%              0.430316                                  0.068022   
50%              0.590947                                  0.103011   
75%              0.776601                                  0.138443   
max              0.952283                                  0.340328   
        Experience                   Bicol region  
count  1637.000000 # 该特征数据丢失  1638.000000  
mean      2.100000                     0.117216  
std       0.870465                     0.321776  
min       1.000000                     0.000000  
25%       1.277778                     0.000000  
50%       2.034483                     0.000000  
75%       2.800000                     0.000000  
max       4.280000                     1.000000  

数据清洗

(1)缺失值检验

# 查看多少空值
data.isna().sum()
# 删除缺失值
data.dropna(inplace=True, subset=['% skilled Agriculture/Forestry/Fishermen', 'Experience'])
disaster_type                                  0
disaster_name                                  0
pcode                                          0
Completely damaged (abs.)                      0
Partly damaged (abs.)                          0
Total damaged houses (abs.)                    0
total_damage_houses_0p25weight                 0
Partly damaged (rel.)                          0
Completely damaged (rel.)                      0
Total damaged houses (rel.)                    0
total_damage_houses_0p25weight_perc            0
ratio_comp_part                                0
Total # of houses                              0
Wind speed                                     0
Distance to typhoon                            0
rainfall                                       0
distance_first_impact                          0
Slope                                          0
Elevation                                      0
ruggedness_stdev                               0
Ruggedness                                     0
slope_stdev                                    0
Predicted damage class (1-5)                1638  # 该属性不做处理因为后续不选择该属性作为特征
Population                                     0
land_area                                      0
Population density                             0
Poverty incidence                              0
% strong roof type                             0
% strong wall type                             0
% skilled Agriculture/Forestry/Fishermen      21
Region                                         0
prov                                           0
Experience                                     1
Bicol region                                   0
dtype: int64

(2)选择特征

根据数据探索分析,此处选择有用特征,剔除无用特征。

# 特征选择
data = data[['disaster_name', 'Total damaged houses (rel.)', 'Wind speed', 'Distance to typhoon', 'rainfall',
             'distance_first_impact', 'Experience',
             'ruggedness_stdev', 'Ruggedness', 'Elevation', 'Slope', 'slope_stdev', 'Population', 'Population density',
             'land_area', 'Poverty incidence', '% skilled Agriculture/Forestry/Fishermen',
             '% strong roof type', '% strong wall type']]

(3)异常值检验

没检查异常值检验前的数据概况:

# 有用特征
features = ['Total damaged houses (rel.)', 'Wind speed', 'Distance to typhoon', 'rainfall',
            'distance_first_impact', 'Experience', 'ruggedness_stdev', 'Ruggedness', 'Elevation',
            'Slope', 'slope_stdev', 'Population', 'Population density', 'land_area', 'Poverty incidence',
            '% skilled Agriculture/Forestry/Fishermen', '% strong roof type', '% strong wall type']
# 直方图
# 显示在同一张画布
def outliers_hist_all(num, df, cols=5, save=False, save_path=None, botton=None, left=None):
    plt.rcParams["font.sans-serif"] = ["SimHei"]  # 设置字体
    plt.rcParams["axes.unicode_minus"] = False  # 该语句解决图像中的“-”负号的乱码问题
    fig = plt.figure(figsize=(16, 16))
    fig.text(0.5, 0.04, botton, ha='center', fontsize=28)
    fig.text(0.04, 0.5, left, va='center', rotation='vertical', fontsize=18)
    rows = int(np.ceil(num / cols))
    for i in range(num):
        ax = fig.add_subplot(rows, cols, i + 1)
        ax.hist(df[df.columns[i]], bins=30)
        ax.set_title(df.columns[i])
    if save == True:
        plt.savefig(save_path)
    plt.show()
outliers_hist_all(len(features), data[features], botton='value', left='count')

# 箱线图检验
plt.figure(figsize=(8, 8))
data.boxplot(column=data[features].columns.tolist(), showmeans=True, meanline=True)
plt.xticks(rotation=90)  # 旋转x轴标签,以便更好地显示特征名称
plt.show()

初始柱状图.png

初始箱线图.png

检查并删除异常值。

# 计算每个特征的平均值和标准差
mean = data.iloc[:, 1:].mean()
std = data.iloc[:, 1:].std()
# 定义异常值的阈值(平均值加减3倍标准差)
threshold = 3 * std
# 检查并根据3σ准则删除异常值
data = data[~((data < mean - threshold) | (data > mean + threshold)).any(axis=1)]

检查异常值并检验后的数据概况:

剔除异常值后柱状图.png

剔除异常值后箱线图.png

可以看出,根据三西格玛剔除异常值后直方图的整体分布更加聚集,箱线图中的异常点也减少一定比例。

将目标值缩放到0-1区间中

# 将房屋损坏率转换为0-1
data['Total damaged houses (rel.)'] = data['Total damaged houses (rel.)'] / 100

(4)相关性检验

data_corr = data[features].corr()  # 相关系数矩阵
fig, ax = plt.subplots(figsize=(12, 12))
sns.heatmap(data_corr, annot=True, ax=ax)
plt.show()
data_corr['Total damaged houses (rel.)'].sort_values()

热力图.png

Distance to typhoon                        -0.435464
% strong wall type                         -0.304187
distance_first_impact                      -0.241009
Experience                                 -0.236378
% strong roof type                         -0.206935
Elevation                                  -0.162279
Population density                         -0.085511
Population                                 -0.065211
Slope                                      -0.057425
Ruggedness                                 -0.052857
land_area                                  -0.023486
slope_stdev                                 0.001225
ruggedness_stdev                            0.001823
% skilled Agriculture/Forestry/Fishermen    0.031725
rainfall                                    0.229307
Poverty incidence                           0.270775
Wind speed                                  0.722516
Total damaged houses (rel.)                 1.000000
Name: Total damaged houses (rel.), dtype: float64

(5)多特征关系

myplot.png

将筛选出来的特征与预测变量进行展示特征之间关系,可见变量之间存在一定程度上的相关性。

数据准备

变量变换

使用Box-Cox变换来对潜在的损坏预测变量进行测试,并确定是否需要对变量进行幂律变换,以相对于链接变换的损坏比例标准化变量。

# 对预测变量使用逻辑回归转换 expit(x) = 1 / (1 + exp(-x))
link_fn_td = expit(data['Total damaged houses (rel.)'])


# 将每个特征的最大值 转换成离它最近的半整数
def power_est(bC):
    l_est = bC[bC == max(bC)].iloc[0]
    return round(l_est / 0.5) * 0.5


# 选择需要的特征
var_names = data.columns.tolist()
var_names = [var for var in var_names if
             var not in ["disaster_type", "disaster_name", "pcode", "Total damaged houses (rel.)", "Region", "prov"]]

# 创建一个全为空的dataframe
power_df = pd.DataFrame(np.nan, index=[0], columns=var_names)


# 使用 Box-Cox 变换来测试每个潜在的损坏预测变量,以确定是否需要对变量进行幂律变换,以相对于链接变换的损坏比例标准化变量。
for i, var_i in enumerate(var_names):
    power_df.iloc[0, i] = power_est(boxcox(link_fn_td - np.min(link_fn_td) * 1.01 + 0.01, data[var_i]))

td_data = data.copy()
   Wind speed  Distance to typhoon  rainfall  distance_first_impact  \
0         0.0                  0.0       0.0                    0.0   
   Experience  ruggedness_stdev  Ruggedness  Elevation  Slope  slope_stdev  \
0         0.0               0.0         0.0        0.0    0.0          0.0   
   Population  Population density  land_area  Poverty incidence  \
0         0.0                 0.0        0.0               -1.0   
   % skilled Agriculture/Forestry/Fishermen  % strong roof type  \
0                                      -1.5                -1.0   
   % strong wall type  
0                -1.0  

应用获得的变换:

# 根据power_df中的值对原始数据进行了不同的转换操作,以使得数据符合特定的分布
for i, var_i in enumerate(var_names):
    if power_df.loc[0, var_i] == 0:
        # 通过对数转换(np.log)
        # 可以将数据转换为更接近正态分布的形式,因为对数转换可以压缩较大的值,拉伸较小的值,使数据更加对称和稳定。
        td_data[var_i] = np.log(td_data[var_i])
    else:
        # 通过幂律变换(td_data[var_i] ** power_df.loc[0, var_i])
        # 可以将数据转换为更接近幂律分布的形式,因为幂律变换可以改变数据的分布形状,使其符合幂律分布的特征。
        td_data[var_i] = td_data[var_i] ** power_df.loc[0, var_i]

数据缩放

将除了目标列的特征列标准化

# 标准化
cols_mean_td = td_data[var_names].mean()
cols_sd_td = td_data[var_names].std()
scaled_data_td = (td_data[var_names] - cols_mean_td) / cols_sd_td
td_data[var_names] = scaled_data_td

数据准备后的数据对其以直方图展示:

可见,增提分布特征已经越来越明显。因此,可认为所做的数据变换是有效变换。

探索性数据分析

探索数据以确定预测变量与建筑物损坏比例之间可能的关系。

房屋受损比例:

# 选择有数值的列
numeric_data = td_data.select_dtypes(include='number')

# 数据重塑 将数据从宽格式(wide format)转换为长格式(long format)
melted_data = pd.melt(numeric_data, id_vars=['Total damaged houses (rel.)'],
                      var_name='var', value_name='value')
# 使用seaborn绘制散点图
plt.figure(figsize=(12, 8))
sns.scatterplot(data=melted_data, x='value', y='Total damaged houses (rel.)', hue='var')
plt.title('Scatter plot for each variable')
plt.xlabel('Value')
plt.ylabel('Total damaged houses (rel.)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
features_sc = ['Wind speed', 'Distance to typhoon', 'rainfall',
               'distance_first_impact', 'Experience', 'ruggedness_stdev', 'Ruggedness', 'Elevation',
               'Slope', 'slope_stdev', 'Population', 'Population density', 'land_area', 'Poverty incidence',
               '% skilled Agriculture/Forestry/Fishermen', '% strong roof type', '% strong wall type']
# 散点图
# 创建一个包含所有特征对应的子图标题的列表
titles = [feature for feature in features_sc]
# 计算子图的行数和列数
num_features = len(features_sc)
rows = int(num_features / 4) + 1
cols = min(num_features, 4)
# 创建一个新的图形,并设置图形的大小
fig = plt.figure(figsize=(15, 15))
# 遍历每个特征,创建对应的子图
for i, feature in enumerate(features_sc):
    # 在图形中添加子图,并指定子图的位置
    ax = fig.add_subplot(rows, cols, i + 1)
    # 绘制散点图
    ax.scatter(td_data[feature], expit(td_data['Total damaged houses (rel.)']))
    # 设置子图标题
    ax.set_title(titles[i])
    # 设置x轴和y轴标签
    ax.set_xlabel(feature)
    ax.set_ylabel('Total damaged houses (rel.)')
# 调整子图之间的间距
fig.tight_layout()
# 显示图形
plt.show()

以散点图,可以看出,在一定程度上,经过我们的数据变换使得数据更加集中,这样使得原始数据将会更加有意义。

对于这个变换而言,风速根据其大小可以看到独特的行为。虽然对于较低的风速,可以观察到不少分散且显着的正斜率,但对于较高的风速(例如高于平均水平),这种损害的增量似乎饱和。

其他变量(例如海拔和降雨量)也存在类似的趋势,但由于风速是主要预测变量,因此分析仅限于此。

数据集

损坏房屋的比例:

# 处理过后的数据集展示
td_data.sample(5)
     disaster_name  Total damaged houses (rel.)  Wind speed  \
279        Hagupit                     0.043671   -1.619854   
1610          Utor                     0.166409    0.871446   
107          Bopha                     0.027567    0.202984   
625         Haiyan                     0.777924    2.111691   
1455      Rammasun                     0.117990   -0.640615   
      Distance to typhoon  rainfall  distance_first_impact  Experience  \
279              1.369567 -3.911680               0.510022   -1.149703   
1610            -0.655048  1.501795              -1.785016    1.055731   
107              0.167086  1.160104              -1.089411   -1.545198   
625             -2.225993  0.266232               0.776345   -1.092881   
1455             0.470196 -0.169615              -0.988571    0.460752   
      ruggedness_stdev  Ruggedness  Elevation     Slope  slope_stdev  \
279           0.412075    0.237446  -0.849878  0.107418     0.264466   
1610          0.554183    0.460424   0.871855  0.487643     0.587852   
107           0.947218    1.091489   1.514961  1.032755     0.884493   
625          -0.100165   -0.369386  -0.775195 -0.345895    -0.073127   
1455         -0.072546    0.203466  -0.599278  0.213241     0.015068   
      Population  Population density  land_area  Poverty incidence  \
279    -2.246590           -0.959423  -0.947639          -0.780200   
1610   -0.141595           -0.978254   1.002742          -0.028983   
107     1.307730           -0.582448   1.872402          -0.688154   
625     0.315079            0.381172  -0.152591          -0.440168   
1455   -1.148631            0.322721  -1.425962          -0.815481   
      % skilled Agriculture/Forestry/Fishermen  % strong roof type  \
279                                  -0.315185            1.120590   
1610                                 -0.290428           -0.550206   
107                                  -0.078037            0.167487   
625                                  -0.196568           -0.581216   
1455                                 -0.317290            2.392163   
      % strong wall type  
279             1.268849  
1610           -0.326936  
107             0.723035  
625             0.085402  
1455            0.139717  

检查多重共线性

使用广义线性模型(Generalized Linear Model)来拟合一个多分类的逻辑回归模型,以预测数据集中的'Total damaged houses (rel.)'列的值。

同时检查多重共线性,通过计算方差膨胀因子(VIF)来判断是否存在共线性问题,并在存在共线性问题时(VIF>10 )进行变量的逐步剔除。

# 使用Generalized Linear Model(广义线性模型)来拟合一个多分类的逻辑回归模型,
# 以预测'td_data'中的'Total damaged houses (rel.)'列的值。
td_full = sm.GLM(td_data['Total damaged houses (rel.)'], td_data[var_names], family=Binomial(link=logit())).fit()
td_data.shape
# 检查多重共线性
temp_model = td_full
temp_vif = [variance_inflation_factor(td_data[var_names].values, i) for i in range(len(var_names))]
temp_var = ["dummy"]
temp_data = td_data.copy()

while max(temp_vif) > 10:
    var_remove = temp_data.columns[2:][temp_vif.index(max(temp_vif))]
    temp_var.append(var_remove)
    temp_data = temp_data.drop(columns=[var_remove])
    temp_model = sm.GLM(temp_data['Total damaged houses (rel.)'], temp_data[temp_data.columns[2:]],
                        family=Binomial(link=logit())).fit()
    temp_vif = [variance_inflation_factor(temp_data[temp_data.columns[2:]].values, i) for i in
                range(len(temp_data.columns[2:]))]

temp_vif_with_names = list(zip(temp_data.columns[2:], temp_vif))
pd_temp_vif_with_names = pd.DataFrame(temp_vif_with_names)
pd_temp_vif_with_names.columns = ['Features', 'VIF']
pd_temp_vif_with_names
                                    Features       VIF
0                                 Wind speed  4.024396
1                        Distance to typhoon  3.210826
2                                   rainfall  1.316407
3                      distance_first_impact  1.466195
4                                 Experience  1.509735
5                                  Elevation  3.483213
6                                slope_stdev  3.030725
7                                 Population  1.889770
8                                  land_area  2.336313
9                          Poverty incidence  1.739308
10  % skilled Agriculture/Forestry/Fishermen  1.291850
11                        % strong roof type  1.403770
12                        % strong wall type  1.482686
temp_var
['dummy', 'Population density', 'Slope', 'ruggedness_stdev', 'Ruggedness']
# 更新特征
td_col_to_use = temp_data.columns
td_data = td_data[td_data.columns.intersection(td_col_to_use)]

BorutaShap特征选择

X = td_data[['Wind speed', 'Distance to typhoon', 'rainfall', 'distance_first_impact',
             'Experience', 'Elevation', 'slope_stdev', 'Population', 'land_area',
             'Poverty incidence', '% skilled Agriculture/Forestry/Fishermen',
             '% strong roof type', '% strong wall type']]

y = td_data['Total damaged houses (rel.)']
Feature_Selector = BorutaShap(importance_measure='shap', classification=False)
Feature_Selector.fit(X=X, y=y, n_trials=100, random_state=0)
# 采用shap算法,经过100轮计算后得到的各个特征信息如下:
# 9 attributes confirmed important: ['Experience', 'distance_first_impact', 'Wind speed', 'Poverty incidence', 'Elevation', 'Population', '% strong roof type', 'Distance to typhoon', '% strong wall type']
# 3 attributes confirmed unimportant: ['rainfall', 'land_area', 'slope_stdev']
# 1 tentative attributes remains: ['% skilled Agriculture/Forestry/Fishermen']

数据预处理 最终数据集描述信息:

td_data.describe()
       Total damaged houses (rel.)    Wind speed  Distance to typhoon  \
count                  1405.000000  1.405000e+03         1.405000e+03   
mean                      0.180416  1.011449e-16        -2.124042e-16   
std                       0.258533  1.000000e+00         1.000000e+00   
min                       0.000047 -1.619854e+00        -4.906482e+00   
25%                       0.003059 -6.406151e-01        -4.960408e-01   
50%                       0.034921  5.416582e-02         1.040757e-01   
75%                       0.285269  6.854903e-01         7.271722e-01   
max                       0.951642  2.143221e+00         1.557193e+00   
           rainfall  distance_first_impact    Experience     Elevation  \
count  1.405000e+03           1.405000e+03  1.405000e+03  1.405000e+03   
mean   7.333003e-16          -5.714685e-16  8.091590e-17  2.022897e-17   
std    1.000000e+00           1.000000e+00  1.000000e+00  1.000000e+00   
min   -4.277711e+00          -5.104474e+00 -1.545198e+00 -3.294853e+00   
25%   -4.636419e-01          -5.679346e-01 -9.648785e-01 -6.177726e-01   
50%    9.712316e-02           1.653390e-01  1.362759e-01  6.635912e-02   
75%    6.366046e-01           7.170544e-01  8.923924e-01  7.161769e-01   
max    2.501303e+00           1.766611e+00  1.896988e+00  1.972304e+00   
        slope_stdev    Population     land_area  Poverty incidence  \
count  1.405000e+03  1.405000e+03  1.405000e+03       1.405000e+03   
mean  -8.597314e-17 -5.664113e-16 -1.062021e-16      -4.045795e-17   
std    1.000000e+00  1.000000e+00  1.000000e+00       1.000000e+00   
min   -3.116919e+00 -4.054597e+00 -3.458290e+00      -1.001795e+00   
25%   -1.697487e-01 -6.073934e-01 -6.682982e-01      -6.623434e-01   
50%    3.434245e-01  4.849719e-02  2.027343e-03      -2.661507e-01   
75%    6.702839e-01  6.905744e-01  6.590482e-01       3.871621e-01   
max    1.184968e+00  2.551172e+00  2.195896e+00       9.986928e+00   
       % skilled Agriculture/Forestry/Fishermen  % strong roof type  \
count                               1405.000000        1.405000e+03   
mean                                   0.000000        3.843505e-16   
std                                    1.000000        1.000000e+00   
min                                   -0.365098       -7.386823e-01   
25%                                   -0.263353       -6.261493e-01   
50%                                   -0.187080       -4.621668e-01   
75%                                    0.004608        2.504821e-01   
max                                   29.045222        5.186857e+00   
       % strong wall type  
count        1.405000e+03  
mean         8.091590e-17  
std          1.000000e+00  
min         -7.760501e-01  
25%         -5.753445e-01  
50%         -2.728524e-01  
75%          1.895579e-01  
max          1.396548e+01  

筛选过特征后的数据直方图: