Datawhale & 天池二手车交易价格预测— Task3 特征工程

最新推荐文章于 2021-04-13 15:43:58 发布

原创最新推荐文章于 2021-04-13 15:43:58 发布 · 519 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

收录于

数据挖掘

本文详细介绍了一次二手车交易价格预测竞赛的特征工程实践，包括异常值处理、特征标准化、数据分桶、缺失值处理、特征构造及筛选等步骤，旨在通过特征工程提升模型预测精度。

Datawhale & 天池二手车交易价格预测— Task3 特征工程

文章目录

Datawhale & 天池二手车交易价格预测— Task3 特征工程

特征工程

1.简介

特征工程是进行数据挖掘的重要环节，包括处理异常值，特征归一化、标准化，数据分桶，缺失值处理等内容。
在这里插入图片描述

2 代码参考

2.1 异常处理

读入数据

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter

%matplotlib inline
path = './datalab/231784/'
Traindata = pd.read_csv(path+'used_car_train_20200313.csv', sep=' ')
Testdata = pd.read_csv(path+'used_car_testA_20200313.csv', sep=' ')

查看Traindata和Testdata相关信息

Traindata.head()

在这里插入图片描述

Traindata.columns

Index([‘SaleID’, ‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’,
‘gearbox’, ‘power’, ‘kilometer’, ‘notRepairedDamage’, ‘regionCode’,
‘seller’, ‘offerType’, ‘creatDate’, ‘price’, ‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’,
‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’,
‘v_13’, ‘v_14’],
dtype=‘object’)

Testdata.columns

Index([‘SaleID’, ‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’,
‘gearbox’, ‘power’, ‘kilometer’, ‘notRepairedDamage’, ‘regionCode’,
‘seller’, ‘offerType’, ‘creatDate’, ‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’,
‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,
‘v_14’],
dtype=‘object’)

包装一个异常处理代码，方便随时调用

def outliers_proc(data, col_name, scale=3):
    """
    用于清洗异常值，默认用 box_plot（scale=3）进行清洗
    :param data: 接收 pandas 数据格式
    :param col_name: pandas 列名
    :param scale: 尺度
    :return:
    """
 
    def box_plot_outliers(data_ser, box_scale):
        """
        利用箱线图去除异常值
        :param data_ser: 接收 pandas.Series 数据格式
        :param box_scale: 箱线图尺度，
        :return:
        """
        iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
        val_low = data_ser.quantile(0.25) - iqr
        val_up = data_ser.quantile(0.75) + iqr
        rule_low = (data_ser < val_low)
        rule_up = (data_ser > val_up)
        return (rule_low, rule_up), (val_low, val_up)
    data_n = data.copy()
    data_series = data_n[col_name]
    rule, value = box_plot_outliers(data_series, box_scale=scale)
    index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
    print("Delete number is: {}".format(len(index)))
    data_n = data_n.drop(index)
    data_n.reset_index(drop=True, inplace=True)
    print("Now column number is: {}".format(data_n.shape[0]))
    index_low = np.arange(data_series.shape[0])[rule[0]]
    outliers = data_series.iloc[index_low]
    print("Description of data less than the lower bound is:")
    print(pd.Series(outliers).describe())
    index_up = np.arange(data_series.shape[0])[rule[1]]
    outliers = data_series.iloc[index_up]
    print("Description of data larger than the upper bound is:")
    print(pd.Series(outliers).describe())
    
    fig, ax = plt.subplots(1, 2, figsize=(10, 7))
    sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])
    sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
    return data_n

调用异常处理代码处理Traindata

Traindata = outliers_proc(Traindata, 'power', scale=3)

Delete number is: 963
Now column number is: 149037
Description of data less than the lower bound is:
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Name: power, dtype: float64
Description of data larger than the upper bound is:
count 963.000000
mean 846.836968
std 1929.418081
min 376.000000
25% 400.000000
50% 436.000000
75% 514.000000
max 19312.000000
Name: power, dtype: float64

在这里插入图片描述
Box-Cox转换

from scipy.stats import boxcox

Traindata['power'].fillna(0)
Traindata.loc[Traindata['power']==0]=1e-5
print(Traindata['power'].value_counts())
boxcox_transformeddata =boxcox(Traindata['power'])
fig, ax = plt.subplots(1, 2, figsize=(10, 7))

sns.boxplot(data=Traindata['power'],ax=ax[0])

sns.boxplot(data=boxcox_transformeddata,ax=ax[1])

0.00001 12829
75.00000 9593
150.00000 6495
60.00000 6374
140.00000 5963
101.00000 5537
116.00000 5177
90.00000 4890
170.00000 4791
105.00000 4457
125.00000 2956
136.00000 2813
163.00000 2746
102.00000 2714
143.00000 2435
131.00000 2325
122.00000 2313
54.00000 2293
110.00000 2064
109.00000 2049
50.00000 1751
80.00000 1734
177.00000 1725
120.00000 1660
58.00000 1597
69.00000 1485
115.00000 1316
95.00000 1249
184.00000 1231
68.00000 1207
…
266.00000 2
375.00000 2
23.00000 2
38.00000 2
293.00000 1
336.00000 1
32.00000 1
35.00000 1
332.00000 1
14.00000 1
297.00000 1
282.00000 1
368.00000 1
26.00000 1
352.00000 1
9.00000 1
153.00000 1
229.00000 1
346.00000 1
36.00000 1
366.00000 1
308.00000 1
365.00000 1
19.00000 1
202.00000 1
358.00000 1
319.00000 1
221.00000 1
316.00000 1
348.00000 1
Name: power, Length: 352, dtype: int64

在这里插入图片描述
长尾截断可以用Log变换，或者可以将离群值转换为箱线图最大值

2.2 特征标准化/归一化

# 训练集和测试集放在一起，方便构造特征
Traindata['train']=1
Testdata['train']=0
data = pd.concat([Traindata, Testdata], ignore_index=True)
# 使用时间：data['creatDate'] - data['regDate']，反映汽车使用时间，一般来说价格与使用时间成反比
# 不过要注意，数据里有时间出错的格式，所以我们需要 errors='coerce'
data['used_time'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') - 
                            pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days

# 看一下空数据，有 15k 个样本的时间是有问题的，我们可以选择删除，也可以选择放着。
# 但是这里不建议删除，因为删除缺失数据占总样本量过大，7.5%
# 我们可以先放着，因为如果我们 XGBoost 之类的决策树，其本身就能处理缺失值，所以可以不用管；
data['used_time'].isnull().sum()

15072

# 从邮编中提取城市信息，等同于加入了先验知识
data['city'] = data['regionCode'].apply(lambda x : str(x)[:-3])
data = data

# 计算某品牌的销售统计量，同学们还可以计算其他特征的统计量
# 这里要以 train 的数据计算统计量
Traingb = Traindata.groupby("brand")
allinfo = {}
for kind, kinddata in Traingb:
    info = {}
    kinddata = kinddata[kinddata['price'] > 0]
    info['brand_amount'] = len(kinddata)
    info['brand_price_max'] = kinddata.price.max()
    info['brand_price_median'] = kinddata.price.median()
    info['brand_price_min'] = kinddata.price.min()
    info['brand_price_sum'] = kinddata.price.sum()
    info['brand_price_std'] = kinddata.price.std()
    info['brand_price_average'] = round(kinddata.price.sum() / (len(kinddata) + 1), 2)
    allinfo[kind] = info
brandfe = pd.DataFrame(allinfo).T.reset_index().rename(columns={"index": "brand"})
data = data.merge(brandfe, how='left', on='brand')

2.3 数据分桶

# 数据分桶 以 power 为例
# 这时候我们的缺失值也进桶了，
# 为什么要做数据分桶呢，原因有很多，= =
# 1. 离散后稀疏向量内积乘法运算速度更快，计算结果也方便存储，容易扩展；
# 2. 离散后的特征对异常值更具鲁棒性，如 age>30 为 1 否则为 0，对于年龄为 200 的也不会对模型造成很大的干扰；
# 3. LR 属于广义线性模型，表达能力有限，经过离散化后，每个变量有单独的权重，这相当于引入了非线性，能够提升模型的表达能力，加大拟合；
# 4. 离散后特征可以进行特征交叉，提升表达能力，由 M+N 个变量编程 M*N 个变量，进一步引入非线形，提升了表达能力；
# 5. 特征离散后模型更稳定，如用户年龄区间，不会因为用户年龄长了一岁就变化

# 当然还有很多原因，LightGBM 在改进 XGBoost 时就增加了数据分桶，增强了模型的泛化性

bin = [i*10 for i in range(31)]
data['power_bin'] = pd.cut(data['power'], bin, labels=False)
data[['power_bin', 'power']].head()

在这里插入图片描述

2.4 缺失值处理

# 删除不需要的数据
data = data.drop(['creatDate', 'regDate', 'regionCode'], axis=1)
print(data.shape)
data.columns

(199037, 39)

Index([‘SaleID’, ‘bodyType’, ‘brand’, ‘fuelType’, ‘gearbox’,
‘kilometer’,
‘model’, ‘name’, ‘notRepairedDamage’, ‘offerType’, ‘power’, ‘price’,
‘seller’, ‘train’, ‘v_0’, ‘v_1’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’, ‘v_14’,
‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘used_time’,
‘city’, ‘brand_amount’, ‘brand_price_average’, ‘brand_price_max’,
‘brand_price_median’, ‘brand_price_min’, ‘brand_price_std’,
‘brand_price_sum’, ‘power_bin’],
dtype=‘object’)

# 目前的数据其实已经可以给树模型使用了，导出一下
data.to_csv('data_for_tree.csv', index=0)

2.5 特征构造

# 我们可以再构造一份特征给 LR NN 之类的模型用
# 之所以分开构造是因为，不同模型对数据集的要求不同
# 数据分布：
data['power'].plot.hist()

在这里插入图片描述

# 我们刚刚已经对 train 进行异常值处理了，但是现在还有这么奇怪的分布是因为 test 中的 power 异常值，
# 所以我们其实刚刚 train 中的 power 异常值不删为好，可以用长尾分布截断来代替
Traindata['power'].plot.hist()

在这里插入图片描述

#  LOG变换，在做归一化
from sklearn import preprocessing
minmaxscaler = preprocessing.MinMaxScaler()
data['power'] = np.log(data['power'] + 1) 
data['power'] = ((data['power'] - np.min(data['power'])) / (np.max(data['power']) - np.min(data['power'])))
data['power'].plot.hist()

在这里插入图片描述

# km 的比较正常，应该是已经做过分桶了
data['kilometer'].plot.hist()

在这里插入图片描述

# 直接做归一化
data['kilometer'] = ((data['kilometer'] - np.min(data['kilometer'])) / 
                        (np.max(data['kilometer']) - np.min(data['kilometer'])))
data['kilometer'].plot.hist()

在这里插入图片描述

# 除此之外 还有我们刚刚构造的统计量特征：
# 'brand_amount', 'brand_price_average', 'brand_price_max',
# 'brand_price_median', 'brand_price_min', 'brand_price_std',
# 'brand_price_sum'
# 这里不再一一举例分析了，直接做变换，
def max_min(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

data['brand_amount'] = ((data['brand_amount'] - np.min(data['brand_amount'])) / 
                        (np.max(data['brand_amount']) - np.min(data['brand_amount'])))
data['brand_price_average'] = ((data['brand_price_average'] - np.min(data['brand_price_average'])) / 
                               (np.max(data['brand_price_average']) - np.min(data['brand_price_average'])))
data['brand_price_max'] = ((data['brand_price_max'] - np.min(data['brand_price_max'])) / 
                           (np.max(data['brand_price_max']) - np.min(data['brand_price_max'])))
data['brand_price_median'] = ((data['brand_price_median'] - np.min(data['brand_price_median'])) /
                              (np.max(data['brand_price_median']) - np.min(data['brand_price_median'])))
data['brand_price_min'] = ((data['brand_price_min'] - np.min(data['brand_price_min'])) / 
                           (np.max(data['brand_price_min']) - np.min(data['brand_price_min'])))
data['brand_price_std'] = ((data['brand_price_std'] - np.min(data['brand_price_std'])) / 
                           (np.max(data['brand_price_std']) - np.min(data['brand_price_std'])))
data['brand_price_sum'] = ((data['brand_price_sum'] - np.min(data['brand_price_sum'])) / 
                           (np.max(data['brand_price_sum']) - np.min(data['brand_price_sum'])))

# 对类别特征进行 OneEncoder
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType',
                                     'gearbox', 'notRepairedDamage', 'power_bin'])
print(data.shape)
data.columns

(199037, 370)

Index([‘SaleID’, ‘kilometer’, ‘name’, ‘offerType’, ‘power’, ‘price’,
‘seller’,
‘train’, ‘v_0’, ‘v_1’,
…
‘power_bin_20.0’, ‘power_bin_21.0’, ‘power_bin_22.0’, ‘power_bin_23.0’,
‘power_bin_24.0’, ‘power_bin_25.0’, ‘power_bin_26.0’, ‘power_bin_27.0’,
‘power_bin_28.0’, ‘power_bin_29.0’],
dtype=‘object’, length=370)

# 这份数据可以用于 LR 
data.to_csv('data_for_lr.csv', index=0)

2.6 特征筛选

1、过滤式

# 相关性分析
print(data['power'].corr(data['price'], method='spearman'))
print(data['kilometer'].corr(data['price'], method='spearman'))
print(data['brand_amount'].corr(data['price'], method='spearman'))
print(data['brand_price_average'].corr(data['price'], method='spearman'))
print(data['brand_price_max'].corr(data['price'], method='spearman'))
print(data['brand_price_median'].corr(data['price'], method='spearman'))

0.572828519605
-0.408256970162
0.0581566100256
0.383490957606
0.259066833881
0.386910423934
直接看图

data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average', 
                     'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr()

f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=0.8)

在这里插入图片描述

2、包裹式

!pip install mlxtend

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),
           k_features=10,
           forward=True,
           floating=False,
           scoring = 'r2',
           cv = 0)
x = data.drop(['price'], axis=1)
x = x.fillna(0)
y = data['price']
sfs.fit(x, y)
sfs.k_feature_names_ 
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.grid()
plt.show()

特征构造属于特征工程的一部分，其目的是为了增强数据的表达。
有些比赛的特征是匿名特征，导致我们并不清楚特征相互直接的关联性，这时我们就只有单纯基于特征进行处理，比如装箱，groupby，agg 等这样一些操作进行一些特征统计，此外还可以对特征进行进一步的 log，exp 等变换，或者对多个特征进行四则运算（如上面我们算出的使用时长），多项式组合等然后进行筛选。由于特性的匿名性其实限制了很多对于特征的处理，当然有些时候用 NN 去提取一些特征也会达到意想不到的良好效果。

对于知道特征含义（非匿名）的特征工程，特别是在工业类型比赛中，会基于信号处理，频域提取，丰度，偏度等构建更为有实际意义的特征，这就是结合背景的特征构建，在推荐系统中也是这样的，各种类型点击率统计，各时段统计，加用户属性的统计等等，这样一种特征构建往往要深入分析背后的业务逻辑或者说物理原理，从而才能更好的找到 magic。

当然特征工程其实是和模型结合在一起的，这就是为什么要为 LR NN 做分桶和特征归一化的原因，而对于特征的处理效果和特征重要性等往往要通过模型来验证。

总的来说，特征工程是一个入门简单，但想精通非常难的一件事。

以上大部分内容来自:
— By: 阿泽
PS：复旦大学计算机研究生
知乎：阿泽 https://www.zhihu.com/people/is-aze（主要面向初学者的知识整理）

标签

#大数据 #python #深度学习