本章会讨论:
在下一章《特征工程-3.特征衍生 [2/2]》 会讨论:
分析衍生特征
时序特征衍生
借鉴文本处理
测试集的特征衍生
目标编码
根据上一章《特征工程-1.特征预处理》 的讨论,我们对数据进行如下操作,本文的讨论都在此基础上。
示例代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 import pandas as pdimport numpy as nppd.set_option('display.max_columns' , None ) pd.set_option('display.width' , 5000 ) t = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv' ) category_cols = ['gender' , 'SeniorCitizen' , 'Partner' , 'Dependents' , 'PhoneService' , 'MultipleLines' , 'InternetService' , 'OnlineSecurity' , 'OnlineBackup' , 'DeviceProtection' , 'TechSupport' , 'StreamingTV' , 'StreamingMovies' , 'Contract' , 'PaperlessBilling' , 'PaymentMethod' ] numeric_cols = ['tenure' , 'MonthlyCharges' , 'TotalCharges' ] target = 'Churn' ID_col = 'customerID' t['TotalCharges' ] = t['TotalCharges' ].apply(lambda x: x if x != ' ' else np.nan).astype(float) t['MonthlyCharges' ] = t['MonthlyCharges' ].astype(float) t['TotalCharges' ] = t['TotalCharges' ].fillna(0 ) t['Churn' ].replace(to_replace='Yes' , value=1 , inplace=True ) t['Churn' ].replace(to_replace='No' , value=0 , inplace=True ) features = t.drop(columns=[ID_col, target]).copy() labels = t['Churn' ].copy() print(features.head(5 )) print('-' * 10 ) print(labels)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges 0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 ---------- 0 0 1 0 2 1 3 0 4 1 .. 7038 0 7039 0 7040 0 7041 1 7042 0 Name: Churn, Length: 7043, dtype: int64
概述
定义
特征衍生,也被称为特征创建,通过已有的数据,创建新特征。
广义上的特征衍生分为两类:
只根据数据集,不根据标签,衍生新特征。
例如:我们在上一章关于时间字段的处理部分,把年份、月份等提取出来。
这种方式,被称为 无监督 的特征衍生。
不但根据数据集,还根据标签,衍生新特征。
例如:我们在上一章通过决策树的建模结果对连续特征进行分箱。
这种方式,被称为 有监督 的特征衍生。
狭义的特征衍生,特指 无监督 特征衍生。对于 有监督 的特征衍生,也被称为目标编码。
本质
无论是我们对于时间字段的处理,还是通过决策树的建模结果对连续特征进行分箱。在本质上,我们没有去创造信息,只是对已有的信息进行重排,让原本位于底层的潜在的信息,更能表现出来。
这也是特征衍生的本质,对已有的信息进行重排。
方法
特征衍生的方法很多,分类角度也很多
从脱离业务背景和深入业务背景进行分类:
脱离业务背景
通过一些简单暴力的技术手段,创建海量的特征,然后从海量特征池中挑选有用的特征参与建模。
一般,这种方法创建的特征,缺乏业务背景与可解释性。
但是,在现代计算机运算速度下,这种方法的工作效率更高。
深入业务背景
通过人工合成的方式,创建特征。
一般,这种方法创建的特征,具有较强的业务背景与可解释性。
但是,工作效率较低,因为需要人工进行分析和筛选。
经验
除了方法多,特征衍生还有一个特点,数量无限。
(举例来说,哪怕只有一个特征,但是我们可以根据这一个特征,衍生出无数个特征。例如,二次方作为一个新特征,三次作为一个新特征,归一化作为一个新特征,归一化后的二次方,二次方后进行归一化,等等。)
但是,不是每一个特征都能帮助模型训练出更好的结果,太多的特征会极大程度的影响建模效率。
所以,在实际工作中,对于在哪些情况下应该朝什么方向进行特征衍生,我们需要能有一些判断。
那么,在哪些情况下应该朝什么方向进行特征衍生呢?
不知道。
关于该部分,只有极少的零散的理论依据(多项式衍生其实是借助了核函数的思想),就特征衍生整体而言,没有完整的系统的理论体系。
更多的时候,特征衍生方向的判断,其实是经验。
单特征衍生
数据重编码
关于数据重编码,我们在上一章已经讨论过了。
离散字段重编码:
有序编码
One-Hot编码
连续字段重编码:
标准化(Z-Score标准化)
归一化(0-1标准化)
分箱(离散化)
等距分箱
等频分箱
聚类分箱
有监督分箱
对数化
关于该部分的衍生方向,我们在上一章的"工作经验"部分也有讨论。
多项式
多项式,即通过多项式的方式进行特征衍生。
那么,在这里有一个问题,只有一个特征,怎么能有多项式。
所以,对于单特征的多项式衍生,其实就是单特征自身的二次方、三次方等。
衍生过程
单特征多项式特征衍生过程 原始特征 衍生特征 ID X1 X1^2 X1^3 X1^4 X1^5 1 1 1 1 1 1 2 2 4 8 16 32 3 4 16 64 256 1024 4 1 1 1 1 1 5 3 9 27 81 243
实现方法
我们利用sklearn
中的PolynomialFeatures
来实现。
示例代码:
1 2 3 4 5 6 7 8 import numpy as npfrom sklearn.preprocessing import PolynomialFeaturesx = np.array([0 , 1 , 2 , 4 , 1 , 3 ]).reshape(-1 , 1 ) print(x) x = PolynomialFeatures(degree=5 ).fit_transform(x) print(x)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 [[0] [1] [2] [4] [1] [3]] [[1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00] [1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00] [1.000e+00 2.000e+00 4.000e+00 8.000e+00 1.600e+01 3.200e+01] [1.000e+00 4.000e+00 1.600e+01 6.400e+01 2.560e+02 1.024e+03] [1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00] [1.000e+00 3.000e+00 9.000e+00 2.700e+01 8.100e+01 2.430e+02]]
解释说明:
include_bias
:是否包含偏置列,默认为True,即会包含0次幂。
degree
:最高次方。
题外话 上述我们看到,其中一个值是0 0 0 ,0 0 0^0 0 0 ,我们的结果是1 1 1 。 其实,在数学中,0 0 0^0 0 0 的结果,存在争论。有些观点认为0 0 = 1 0^0=1 0 0 = 1 ,有些观点认为0 0 = 0 0^0=0 0 0 = 0 ,还有观点认为0 0 = 未定义值 0^0 = \text{未定义值} 0 0 = 未定义值 。
在本文,认为0 0 = 1 0^0=1 0 0 = 1 。
工作经验
对于,单特征的多项式衍生,不仅可以用于连续特征,在某些情况下也可以将其用于离散特征(比如1
代表"轻微",2
代表"严重",3
代表"恶劣";那么进行二次方衍生后,1
代表"轻微",4
代表"严重",9
代表"恶劣")。
相比单特征的多项式衍生,双特征(或者多特征)的多项式衍生往往效果会更好。
双特征衍生
双特征衍生,即由两个特征组合成新的特征。
四则运算
衍生过程
双特征四则运算特征衍生过程 原始特征 衍生特征 ID X1 X2 X1-X2 X1+X2 X1*X2 X1/X2 1 1 -1 2 0 -1 -1 2 2 3 -1 5 6 0.666666667 3 4 5 -1 9 20 0.8 4 6 -2 8 4 -12 -3 5 8 1 7 9 8 8
实现方法
示例代码:
1 2 3 4 5 6 7 8 9 10 11 12 import pandas as pddf = pd.DataFrame({'A' : [1 , 2 , 3 , 4 ], 'B' : [5 , 6 , 7 , 8 ]}) print(df) df['A+B' ] = df['A' ] + df['B' ] df['A-B' ] = df['A' ] - df['B' ] df['A/B' ] = df['A' ] / df['B' ] df['A*B' ] = df['A' ] * df['B' ] print(df)
运行结果:
1 2 3 4 5 6 7 8 9 10 A B 0 1 5 1 2 6 2 3 7 3 4 8 A B A+B A-B A/B A*B 0 1 5 6 -4 0.200000 5 1 2 6 8 -4 0.333333 12 2 3 7 10 -4 0.428571 21 3 4 8 12 -4 0.500000 32
工作经验
在以下两种场景,我们可以考虑使用四则运算的方式进行特征衍生:
创建具有明显业务含义的补充字段
例如,消费总金额除以用户入网时间,可以得到用户平均每月的消费金额;用户每月消费金额除以购买服务总数,可以得到每项服务的平均价格。
在进行第二轮特征衍生的时候,采用四则运算的方式进行特征衍生,作为信息的一种补充。
“第二轮特征衍生”,指的是,在第一轮特征的衍生的基础上(已经衍生出了新特征的基础上),再进行特征衍生。
在有些资料中会,会称其为"二阶特征衍生"、或者"二次特征衍生",这个名字容易和多项次特征衍生中项的阶数(次数)搞混。 在本文,为了进行区分,称其为"第二轮特征衍生"。
交叉组合
衍生过程
离散特征之间进行交叉组合,衍生新特征。
双特征交叉组合特征衍生过程 原始特征 衍生特征 (SeniorCitizen, Dependents) ID SeniorCitizen Dependents (0,0) (0,1) (1,0) (1,1) 1 1 1 0 0 0 1 2 0 0 1 0 0 0 3 0 1 0 1 0 0 4 0 0 1 0 0 0 5 1 0 0 0 1 0
实现方法
我们以SeniorCitizen
、Partner
和Dependents
三个特征为例,两两交叉组合衍生新特征。
新特征的名称
1 2 3 4 5 6 7 8 col_names = ['SeniorCitizen' , 'Partner' , 'Dependents' ] for col_index, col_name in enumerate(col_names): for col_sub_index in range(col_index + 1 , len(col_names)): new_names = col_name + '_' + col_names[col_sub_index] print(new_names)
运行结果:
1 2 3 SeniorCitizen_Partner SeniorCitizen_Dependents Partner_Dependents
新特征的生成
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 col_names = ['SeniorCitizen' , 'Partner' , 'Dependents' ] col_names_new_l = [] features_new_l = [] for col_index, col_name in enumerate(col_names): for col_sub_index in range(col_index + 1 , len(col_names)): new_names = col_name + '_' + col_names[col_sub_index] col_names_new_l.append(new_names) new_df = pd.Series(data=features[col_name].astype('str' ) + '_' + features[col_names[col_sub_index]].astype('str' ), name=col_name) features_new_l.append(new_df) print(col_names_new_l) print(features_new_l)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges 0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 ---------- 0 0 1 0 2 1 3 0 4 1 .. 7038 0 7039 0 7040 0 7041 1 7042 0 Name: Churn, Length: 7043, dtype: int64 ['SeniorCitizen_Partner', 'SeniorCitizen_Dependents', 'Partner_Dependents'] [0 0_Yes 1 0_No 2 0_No 3 0_No 4 0_No ... 7038 0_Yes 7039 0_Yes 7040 0_Yes 7041 1_Yes 7042 0_No Name: SeniorCitizen, Length: 7043, dtype: object, 0 0_No 1 0_No 2 0_No 3 0_No 4 0_No ... 7038 0_Yes 7039 0_Yes 7040 0_Yes 7041 1_No 7042 0_No Name: SeniorCitizen, Length: 7043, dtype: object, 0 Yes_No 1 No_No 2 No_No 3 No_No 4 No_No ... 7038 Yes_Yes 7039 Yes_Yes 7040 Yes_Yes 7041 Yes_No 7042 No_No Name: Partner, Length: 7043, dtype: object]
新特征的拼接
1 2 3 features_new = pd.concat(features_new_l, axis=1 ) features_new.columns = col_names_new_l print(features_new)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 SeniorCitizen_Partner SeniorCitizen_Dependents Partner_Dependents 0 0_Yes 0_No Yes_No 1 0_No 0_No No_No 2 0_No 0_No No_No 3 0_No 0_No No_No 4 0_No 0_No No_No ... ... ... ... 7038 0_Yes 0_Yes Yes_Yes 7039 0_Yes 0_Yes Yes_Yes 7040 0_Yes 0_Yes Yes_Yes 7041 1_Yes 1_No Yes_No 7042 0_No 0_No No_No [7043 rows x 3 columns]
即,3个离散特征,每个特征都有4个可能的取值。
One-Hot编码
通过上文,我们创建了3个离散特征,每个特征都有4个可能的取值。接下来,我们需要对这些离散特征进行One-Hot编码。
示例代码:
1 2 3 4 5 6 7 8 enc = OneHotEncoder() enc.fit_transform(features_new) cate_col_name(enc, col_names_new_l, skip_binary=True ) features_new_af = pd.DataFrame(enc.fit_transform(features_new).toarray(), columns=cate_col_name(enc, col_names_new_l, skip_binary=True ))
运行结果:
1 2 3 4 5 6 SeniorCitizen_Partner_0_No SeniorCitizen_Partner_0_Yes SeniorCitizen_Partner_1_No SeniorCitizen_Partner_1_Yes SeniorCitizen_Dependents_0_No SeniorCitizen_Dependents_0_Yes SeniorCitizen_Dependents_1_No SeniorCitizen_Dependents_1_Yes Partner_Dependents_No_No Partner_Dependents_No_Yes Partner_Dependents_Yes_No Partner_Dependents_Yes_Yes 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 2 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 3 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 4 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
cate_col_name
,来自上一章所定义的函数。
至此,我们完成对于给定特征的两两交叉衍生。
函数封装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 def binary_cross_combination (col_names: list, features: pd.DataFrame, one_hot: bool = True) : """ 分类特征两两组合交叉衍生函数 :param col_names: 参与交叉衍生的列名称 :param features: 原始数据集 :param one_hot: 是否进行One-Hot编码 :return: """ col_names_new_l = [] features_new_l = [] features = features[col_names] for col_index, col_name in enumerate(col_names): for col_sub_index in range(col_index + 1 , len(col_names)): new_names = col_name + '_' + col_names[col_sub_index] col_names_new_l.append(new_names) new_df = pd.Series( data=features[col_name].astype('str' ) + '_' + features[col_names[col_sub_index]].astype('str' ), name=col_name) features_new_l.append(new_df) features_new = pd.concat(features_new_l, axis=1 ) features_new.columns = col_names_new_l col_names_new = col_names_new_l if one_hot: enc = OneHotEncoder() enc.fit_transform(features_new) col_names_new = cate_col_name(enc, col_names_new_l, skip_binary=True ) features_new = pd.DataFrame(enc.fit_transform(features_new).toarray(), columns=col_names_new) return features_new, col_names_new
示例代码:
1 2 3 4 col_names = ['SeniorCitizen' , 'Partner' , 'Dependents' ] features_new, col_names_new = binary_cross_combination(col_names, features) print(features_new.head(5 )) print(col_names_new)
运行结果:
1 2 3 4 5 6 7 SeniorCitizen_Partner_0_No SeniorCitizen_Partner_0_Yes SeniorCitizen_Partner_1_No SeniorCitizen_Partner_1_Yes SeniorCitizen_Dependents_0_No SeniorCitizen_Dependents_0_Yes SeniorCitizen_Dependents_1_No SeniorCitizen_Dependents_1_Yes Partner_Dependents_No_No Partner_Dependents_No_Yes Partner_Dependents_Yes_No Partner_Dependents_Yes_Yes 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 2 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 3 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 4 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ['SeniorCitizen_Partner_0_No', 'SeniorCitizen_Partner_0_Yes', 'SeniorCitizen_Partner_1_No', 'SeniorCitizen_Partner_1_Yes', 'SeniorCitizen_Dependents_0_No', 'SeniorCitizen_Dependents_0_Yes', 'SeniorCitizen_Dependents_1_No', 'SeniorCitizen_Dependents_1_Yes', 'Partner_Dependents_No_No', 'Partner_Dependents_No_Yes', 'Partner_Dependents_Yes_No', 'Partner_Dependents_Yes_Yes']
如果还需要和原始数据进行拼接,直接利用concat()
。示例代码:
1 2 df_temp = pd.concat([features, features_new], axis=1 ) print(df_temp.head(5 ))
运行结果:
1 2 3 4 5 6 gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges SeniorCitizen_Partner_0_No SeniorCitizen_Partner_0_Yes SeniorCitizen_Partner_1_No SeniorCitizen_Partner_1_Yes SeniorCitizen_Dependents_0_No SeniorCitizen_Dependents_0_Yes SeniorCitizen_Dependents_1_No SeniorCitizen_Dependents_1_Yes Partner_Dependents_No_No Partner_Dependents_No_Yes Partner_Dependents_Yes_No Partner_Dependents_Yes_Yes 0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
工作经验
两个离散特征的交叉衍生是最常见的特征衍生方法,能够极大程度丰富数据集信息呈现形式,通常应该优先考虑。
如果离散特征的枚举值较多,在进行排列组合后,新的衍生特征会呈指数级增长,而且会导致衍生特征矩阵过于稀疏,从而无法为模型提供有效的训练信息。
对于具体多少个新特征比较合适,没有准确的数字,具体情况具体分析。
分组统计
衍生过程
根据A特征进行分组,然后计算B特征的各种统计指标。
双特征分组统计特征衍生过程 原始特征 衍生特征不同tenure分组统计Monthly Charges ID tenure Monthly Charges mean min max 1 1 3 5 3 7 2 2 10 11 10 12 3 3 15 15 15 15 4 2 12 11 10 12 5 1 7 5 3 7
实现方法
我们以tenure
、SeniorCitizen
和MonthlyCharges
三个特征为例,进行分组统计。
单独提取目标字段的数据集,示例代码:
1 2 3 4 5 6 col_names = ['tenure' , 'SeniorCitizen' , 'MonthlyCharges' ] features_temp = features[col_names] print(features_temp.head(5 ))
运行结果:
1 2 3 4 5 6 tenure SeniorCitizen MonthlyCharges 0 1 0 29.85 1 34 0 56.95 2 2 0 53.85 3 45 0 42.30 4 2 0 70.70
groupby用法示例
groupby
对数据进行分组,然后可以进行各种聚合操作。
计算均值
根据tenure
进行分组,计算其他特征的均值。示例代码:
1 print(features_temp.groupby('tenure' ).mean())
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SeniorCitizen MonthlyCharges tenure 0 0.000000 41.418182 1 0.140294 50.485808 2 0.180672 57.206303 3 0.125000 58.015000 4 0.147727 57.432670 ... ... ... 68 0.130000 73.321000 69 0.136842 70.823158 70 0.142857 76.378992 71 0.182353 73.735588 72 0.154696 80.695856 [73 rows x 2 columns]
计算标准差
根据tenure
进行分组,计算其他特征的标准差。示例代码:
1 print(features_temp.groupby('tenure' ).std())
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SeniorCitizen MonthlyCharges tenure 0 0.000000 23.831484 1 0.347575 24.714198 2 0.385557 25.180714 3 0.331549 26.783798 4 0.355842 26.362647 ... ... ... 68 0.337998 30.267744 69 0.345504 33.730068 70 0.351407 30.993483 71 0.387276 32.711860 72 0.362115 31.956764 [73 rows x 2 columns]
根据两个特征进行分组
根据tenure
和SeniorCitizen
进行分组,计算其他特征的均值。
示例代码:
1 print(features_temp.groupby(['tenure' , 'SeniorCitizen' ]).mean())
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MonthlyCharges tenure SeniorCitizen 0 0 41.418182 1 0 48.329127 1 63.701744 2 0 54.951538 1 67.431395 ... ... 70 1 87.017647 71 0 72.497482 1 79.287097 72 0 78.687745 1 91.668750 [145 rows x 1 columns]
agg用法示例
agg
对分组后的每列进行特定的聚合操作(求平均值、求和等)。
示例代码:
1 2 3 4 aggs = {} aggs['SeniorCitizen' ] = ['max' , 'min' ] aggs['MonthlyCharges' ] = 'mean' print(features_temp.groupby('tenure' ).agg(aggs).head(5 ))
运行结果:
1 2 3 4 5 6 7 8 SeniorCitizen MonthlyCharges max min mean tenure 0 0 0 41.418182 1 1 0 50.485808 2 1 0 57.206303 3 1 0 58.015000 4 1 0 57.432670
组装agg
组装agg,示例代码:
1 2 3 4 5 6 7 8 9 10 11 col_names_sub = ['SeniorCitizen' , 'MonthlyCharges' ] aggs = {} for col in col_names_sub: aggs[col] = ['mean' , 'min' , 'max' ] print(aggs)
运行结果:
1 {'SeniorCitizen': ['mean', 'min', 'max'], 'MonthlyCharges': ['mean', 'min', 'max']}
新特征的名称
示例代码:
1 2 3 4 5 6 cols = ['tenure' ] for key in aggs.keys(): cols.extend([key + '_' + 'tenure' + '_' + stat for stat in aggs[key]]) print(cols)
运行结果:
1 ['tenure', 'SeniorCitizen_tenure_mean', 'SeniorCitizen_tenure_min', 'SeniorCitizen_tenure_max', 'MonthlyCharges_tenure_mean', 'MonthlyCharges_tenure_min', 'MonthlyCharges_tenure_max']
解释说明:[key+'_'+'tenure'+'_'+stat for stat in aggs[key]]
是一个列表表达式,其运算结果是一个列表['MonthlyCharges_tenure_mean', 'MonthlyCharges_tenure_min', 'MonthlyCharges_tenure_max']
。这也是使用extend
,没有使用一般的append
的原因。
新特征的生成
示例代码:
1 2 3 4 5 6 7 8 9 10 col_names = ['tenure' , 'SeniorCitizen' , 'MonthlyCharges' ] features_temp = features[col_names] col_names_sub = ['SeniorCitizen' , 'MonthlyCharges' ] features_new = features_temp.groupby('tenure' ).agg(aggs).reset_index() print(features_new.head(5 ))
运行结果:
1 2 3 4 5 6 7 tenure SeniorCitizen MonthlyCharges mean min max mean min max 0 0 0.000000 0 0 41.418182 19.70 80.85 1 1 0.140294 0 1 50.485808 18.80 102.45 2 2 0.180672 0 1 57.206303 18.75 104.40 3 3 0.125000 0 1 58.015000 18.80 107.95 4 4 0.147727 0 1 57.432670 18.85 105.65
新特征的组装
示例代码:
1 2 features_new.columns = cols print(features_new.head(5 ))
运行结果:
1 2 3 4 5 6 tenure SeniorCitizen_tenure_mean SeniorCitizen_tenure_min SeniorCitizen_tenure_max MonthlyCharges_tenure_mean MonthlyCharges_tenure_min MonthlyCharges_tenure_max 0 0 0.000000 0 0 41.418182 19.70 80.85 1 1 0.140294 0 1 50.485808 18.80 102.45 2 2 0.180672 0 1 57.206303 18.75 104.40 3 3 0.125000 0 1 58.015000 18.80 107.95 4 4 0.147727 0 1 57.432670 18.85 105.65
和原始数据进行拼接
以tenure作为关联条件,进行拼接。示例代码:
1 2 3 4 print(features.head(5 )) print(features_new.head(5 )) df_temp = pd.merge(features, features_new, how='left' , on='tenure' ) print(df_temp.head(5 ))
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges 0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 tenure SeniorCitizen_tenure_mean SeniorCitizen_tenure_min SeniorCitizen_tenure_max MonthlyCharges_tenure_mean MonthlyCharges_tenure_min MonthlyCharges_tenure_max 0 0 0.000000 0 0 41.418182 19.70 80.85 1 1 0.140294 0 1 50.485808 18.80 102.45 2 2 0.180672 0 1 57.206303 18.75 104.40 3 3 0.125000 0 1 58.015000 18.80 107.95 4 4 0.147727 0 1 57.432670 18.85 105.65 gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges SeniorCitizen_tenure_mean SeniorCitizen_tenure_min SeniorCitizen_tenure_max MonthlyCharges_tenure_mean MonthlyCharges_tenure_min MonthlyCharges_tenure_max 0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 0.140294 0 1 50.485808 18.80 102.45 1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 0.184615 0 1 69.644615 19.60 116.25 2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 0.180672 0 1 57.206303 18.75 104.40 3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 0.196721 0 1 71.245902 18.85 115.65 4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 0.180672 0 1 57.206303 18.75 104.40
常用统计指标
分类
mean
:均值
var
:方差
max
:最大值
min
:最小值
median
:中位数
count
:个数
quantile
:分位数
nunique
:类别数
skew
:数据分布偏度(小于零时左偏,大于零时右偏)
对于离散特征,除了skew
(数据分布偏度),其他的统计指标都适用。(例如,对于One-Hot编码的离散特征,其均值实际代表1
的比例。)
对于连续特征,除了nunique
(类别数),其他的统计指标都适用。
计算
除了quantile
(分位数)的计算,其他的计算方法都是类似的。示例代码:
1 2 3 4 5 6 7 8 9 10 11 12 import numpy as npimport pandas as pda = np.array([[1 , 2 , 3 , 2 , 5 , 1 , 6 ], [0 , 0 , 0 , 1 , 1 , 1 , 0 ]]) df = pd.DataFrame(a.T, columns=['x1' , 'x2' ]) print(df) aggs = {'x1' : ['mean' , 'var' , 'max' , 'min' , 'median' , 'count' , 'nunique' , 'skew' ]} print(aggs) df = df.groupby('x2' ).agg(aggs).reset_index() print(df)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 x1 x2 0 1 0 1 2 0 2 3 0 3 2 1 4 5 1 5 1 1 6 6 0 {'x1': ['mean', 'var', 'max', 'min', 'median', 'count', 'nunique', 'skew']} x2 x1 mean var max min median count nunique skew 0 0 3.000000 4.666667 6 1 2.5 4 4 1.190340 1 1 2.666667 4.333333 5 1 2.0 3 3 1.293343
对于quantile
(分位数)的计算,一般都是关注上四分位数和下四分位数。示例代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import pandas as pddef q1 (x) : """ 下四分位数 """ return x.quantile(0.25 ) def q2 (x) : """ 上四分位数 """ return x.quantile(0.75 ) d1 = pd.DataFrame({'x1' : [3 , 2 , 4 , 4 , 2 , 2 ], 'x2' : [0 , 1 , 1 , 0 , 0 , 0 ]}) print(d1) aggs = {'x1' : [q1, q2]} d2 = d1.groupby('x2' ).agg(aggs).reset_index() print(d2) d2.columns = ['x2' , 'x1_x2_q1' , 'x1_x2_q2' ] print(d2)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 x1 x2 0 3 0 1 2 1 2 4 1 3 4 0 4 2 0 5 2 0 x2 x1 q1 q2 0 0 2.0 3.25 1 1 2.5 3.50 x2 x1_x2_q1 x1_x2_q2 0 0 2.0 3.25 1 1 2.5 3.50
关于众数的讨论
上文,我们常见的统计指标中,有一个在这里没有出现,众数。
对于连续特征,众数的意义不大,甚至没有众数(或都是众数)。对于离散特征,也有可能没有众数(或都是众数),在One-Hot编码后的离散特征,均值相对众数能表示更多的信息。而如果出现了没有众数(或都是众数)的情况,不太可能都作为衍生出来的新特征,因为涉及到后续的特征处理和建模问题。无论我们是选择其中一个众数,还是选择众数平均值作为"众数",都可能造成信息失真。例如,数据可能集中于两头的情况。
所以,一般不选取众数。
如果在某些场景下,确认要考虑众数,可以参考如下的方法计算众数。
根据a
进行分组,统计b
的众数,示例代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 import pandas as pddf = pd.DataFrame({'a' : ['A' , 'A' , 'A' , 'A' , 'B' , 'B' , 'B' , 'B' , 'B' ], 'b' : [1 , 1 , 2 , 3 , 1 , 2 , 2 , 3 , 3 ]}) print(df) t1 = df.groupby(['a' , 'b' ]).size().reset_index() t1.columns = ['a' , 'b' , 'count' ] t2 = t1.groupby(['a' ])['count' ].max().reset_index() t2.columns = ['a' , 'max_count' ] t1 = t1.merge(t2, on=['a' ], how='left' ) t1 = t1[t1['count' ] == t1['max_count' ]] t1 = t1.groupby(['a' ])['b' ].mean().reset_index() print(t1)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 a b 0 A 1 1 A 1 2 A 2 3 A 3 4 B 1 5 B 2 6 B 2 7 B 3 8 B 3 a b 0 A 1.0 1 B 2.5
也有其他方法计算众数,例如df.groupby('a').agg(lambda x: np.mean(x.mode())).reset_index()
。但,上述方法是运算速度最快的。
对于多个众数,如果选择最大值或最小值,只需要把mean
替换掉。
函数封装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 def binary_group_statistics (key_col: str, features: pd.DataFrame, col_num: list = None, col_cat: list = None, num_stat: list = ['mean' , 'var' , 'max' , 'min' , 'skew' , 'median' ], cat_stat: list = ['mean' , 'var' , 'max' , 'min' , 'median' , 'count' , 'nunique' ], quantile: bool = True) : """ 双特征分组统计特征衍生函数 :param key_col: 分组参考的关键特征 :param features: 原始数据集 :param col_num: 参与衍生的连续型特征 :param col_cat: 参与衍生的离散型特征 :param num_stat: 连续特征分组统计指标 :param cat_stat: 离散特征分组统计指标 :param quantile: 是否计算分位数 :return:交叉衍生后的新特征和新特征的名称 """ if col_num is not None : aggs_num = {} col_names = col_num for col in col_num: aggs_num[col] = num_stat cols_num = [key_col] for key in aggs_num.keys(): cols_num.extend([key + '_' + key_col + '_' + stat for stat in aggs_num[key]]) features_num_new = features[col_num + [key_col]].groupby(key_col).agg(aggs_num).reset_index() features_num_new.columns = cols_num if col_cat is not None : aggs_cat = {} col_names = col_num + col_cat for col in col_cat: aggs_cat[col] = cat_stat cols_cat = [key_col] for key in aggs_cat.keys(): cols_cat.extend([key + '_' + key_col + '_' + stat for stat in aggs_cat[key]]) features_cat_new = features[col_cat + [key_col]].groupby(key_col).agg(aggs_cat).reset_index() features_cat_new.columns = cols_cat df_temp = pd.merge(features_num_new, features_cat_new, how='left' , on=key_col) features_new = pd.merge(features[key_col], df_temp, how='left' , on=key_col) features_new.loc[:, ~features_new.columns.duplicated()] col_names_new = cols_num + cols_cat col_names_new.remove(key_col) col_names_new.remove(key_col) else : features_new = pd.merge(features[key_col], features_num_new, how='left' , on=key_col) features_new.loc[:, ~features_new.columns.duplicated()] col_names_new = cols_num col_names_new.remove(key_col) else : if col_cat is not None : aggs_cat = {} col_names = col_cat for col in col_cat: aggs_cat[col] = cat_stat cols_cat = [key_col] for key in aggs_cat.keys(): cols_cat.extend([key + '_' + key_col + '_' + stat for stat in aggs_cat[key]]) features_cat_new = features[col_cat + [key_col]].groupby(key_col).agg(aggs_cat).reset_index() features_cat_new.columns = cols_cat features_new = pd.merge(features[key_col], features_cat_new, how='left' , on=key_col) features_new.loc[:, ~features_new.columns.duplicated()] col_names_new = cols_cat col_names_new.remove(key_col) if quantile: def q1 (x) : """ 下四分位数 """ return x.quantile(0.25 ) def q2 (x) : """ 上四分位数 """ return x.quantile(0.75 ) agg_name = {} for col in col_names: agg_name[col] = ['q1' , 'q2' ] cols = [key_col] for key in agg_name.keys(): cols.extend([key + '_' + key_col + '_' + stat for stat in agg_name[key]]) aggs = {} for col in col_names: aggs[col] = [q1, q2] features_temp = features[col_names + [key_col]].groupby(key_col).agg(aggs).reset_index() features_temp.columns = cols features_new = pd.merge(features_new, features_temp, how='left' , on=key_col) features_new.loc[:, ~features_new.columns.duplicated()] col_names_new = col_names_new + cols col_names_new.remove(key_col) features_new.drop([key_col], axis=1 , inplace=True ) return features_new, col_names_new
示例代码:
1 2 3 4 5 6 col_num = ['MonthlyCharges' ] col_cat = ['SeniorCitizen' ] key_col = 'tenure' df, col = binary_group_statistics(key_col, features, col_num, col_cat) print(df.head(5 )) print(col)
运行结果:
1 2 3 4 5 6 7 MonthlyCharges_tenure_mean MonthlyCharges_tenure_var MonthlyCharges_tenure_max MonthlyCharges_tenure_min MonthlyCharges_tenure_skew MonthlyCharges_tenure_median SeniorCitizen_tenure_mean SeniorCitizen_tenure_var SeniorCitizen_tenure_max SeniorCitizen_tenure_min SeniorCitizen_tenure_median SeniorCitizen_tenure_count SeniorCitizen_tenure_nunique MonthlyCharges_tenure_q1 MonthlyCharges_tenure_q2 SeniorCitizen_tenure_q1 SeniorCitizen_tenure_q2 0 50.485808 610.791587 102.45 18.80 0.092226 49.750 0.140294 0.120808 1 0 0.0 613 2 20.9 71.35 0.0 0.0 1 69.644615 866.891416 116.25 19.60 -0.384323 73.950 0.184615 0.152885 1 0 0.0 65 2 50.2 94.25 0.0 0.0 2 57.206303 634.068336 104.40 18.75 -0.220111 61.075 0.180672 0.148654 1 0 0.0 238 2 34.8 79.15 0.0 0.0 3 71.245902 938.289858 115.65 18.85 -0.481543 81.000 0.196721 0.160656 1 0 0.0 61 2 50.9 96.75 0.0 0.0 4 57.206303 634.068336 104.40 18.75 -0.220111 61.075 0.180672 0.148654 1 0 0.0 238 2 34.8 79.15 0.0 0.0 ['MonthlyCharges_tenure_mean', 'MonthlyCharges_tenure_var', 'MonthlyCharges_tenure_max', 'MonthlyCharges_tenure_min', 'MonthlyCharges_tenure_skew', 'MonthlyCharges_tenure_median', 'SeniorCitizen_tenure_mean', 'SeniorCitizen_tenure_var', 'SeniorCitizen_tenure_max', 'SeniorCitizen_tenure_min', 'SeniorCitizen_tenure_median', 'SeniorCitizen_tenure_count', 'SeniorCitizen_tenure_nunique', 'MonthlyCharges_tenure_q1', 'MonthlyCharges_tenure_q2', 'SeniorCitizen_tenure_q1', 'SeniorCitizen_tenure_q2']
features_new.loc[:, ~features_new.columns.duplicated()]
features_new.columns.duplicated()
返回一个布尔型序列,表示每一列是否与前面的列重复。~
,对序列中的元素进行取反操作,即将True
转换为False
,将False
转换为True
。loc
,用于获取表格中特定元素或范围的数据。[:, ~features_new.columns.duplicated()]
表示取所有行,但只保留列名不重复的列。 综上所述,features_new.loc[:, ~features_new.columns.duplicated()]
的作用是获取DataFramefeatures_new
中除去重复列的数据。
pd.merge(features[key_col], features_cat_new, how='left', on=key_col)
对于features
指定key_col
列的原因是,希望函数创建的衍生特征矩阵不包含原始特征列,所以在merge的过程中主表只保留主键的那一列。
举一个简单的例子更能描述清楚,假设d1
是原始数据,d2
是衍生之后的。示例代码:
1 2 3 4 5 6 7 8 9 import pandas as pdd1 = pd.DataFrame({'tenure' : [1 , 2 , 1 , 3 , 2 , 3 ], 'x1' : [2 , 5 , 1 , 2 , 6 , 1 ]}) print(d1) d2 = pd.DataFrame({'tenure' : [1 , 2 , 3 ], 'stat' : [1 , 7 , 4 ]}) print(d2) pd.merge(d1['tenure' ], d2, how='left' , on='tenure' )
运行结果:
1 2 3 4 5 6 7 8 9 10 11 tenure x1 0 1 2 1 2 5 2 1 1 3 3 2 4 2 6 5 3 1 tenure stat 0 1 1 1 2 7 2 3 4
工作经验
在分组统计的特征衍生过程中,某些统计指标在计算过程中,结果可能为空。最好再检查一遍。
(检查方法可以参考《特征工程-1.特征预处理》 的"缺失值处理"的"缺失值检测"部分)
作为分组依据的特征必须是离散特征(或者固定取值的连续特征),且最好是一些枚举值较多的离散特征。
因为,如果作为分组依据的特征的枚举值较少,在衍生的特征矩阵中会出现大量的重复的行,导致区分度不够。
进行统计的特征可以是离散特征也可以是连续特征。
在计算统计指标时,不必拘泥于传统的统计学上的统计指标。
对于连续特征,可以采用一般离散特征才有的统计指标;对于离散特征,可以采用一般连续特征才有的统计指标。
对于离散特征,除了skew
(数据分布偏度),其他的统计指标都适用。
对于连续特征,除了nunique
(类别数),其他的统计指标都适用。
对于分组统计衍生之后的特征,还可以再进行一轮的四则运算特征衍生。
例如,我们用分组统计衍生之后的月度消费金额减去均值,可以知道每一位用户与消费平均水平的差异。
(这个也被称为统计演变,我们在下文会有更多讨论。)
对于分位数,一般是关注上四分位数和下四分位数,但也完全可以用一分位或九分位等这种不常见的分位数。
多项式
衍生过程
双特征二阶多项式特征衍生过程 原始特征 二阶多项式衍生特征 X1 X2 X1^2 X2^2 X1*X2 1 2 1 4 2 2 3 4 9 6 3 4 9 16 12
双特征三阶多项式特征衍生过程 原始特征 二阶多项式衍生特征 三阶多项式衍生特征 X1 X2 X1^2 X1*X2 X2^2 X1^3 X1^2*X2 X1*X2^2 X2^3 1 2 1 2 4 1 2 4 8 2 3 4 6 9 8 12 18 27 3 4 9 12 16 27 36 48 64
实现方法
PolynomialFeatures
示例代码:
1 2 3 4 5 6 7 8 import pandas as pdfrom sklearn.preprocessing import PolynomialFeaturesdf = pd.DataFrame({'X1' : [1 , 2 , 3 ], 'X2' : [2 , 3 , 4 ]}) print(df) df = PolynomialFeatures(degree=2 , interaction_only=False , include_bias=False ).fit_transform(df) print(df)
运行结果:
1 2 3 4 5 6 7 X1 X2 0 1 2 1 2 3 2 3 4 [[ 1. 2. 1. 2. 4.] [ 2. 3. 4. 6. 9.] [ 3. 4. 9. 12. 16.]]
解释说明:一共有5列,分别为:X1、X2、X1的平方、X1乘以X2、X2的平方。
关于PolynomialFeatures
,我们在上文已经讨论过了,当时讨论了其中两个参数
include_bias
:是否包含偏置列,默认为True,即会包含0次幂。
degree
:最高次方。
在这里,有第三个参数interaction_only
,是否只返回特征之间的交互项。默认为False;如果为True,则只产生交互项。
新特征的名称
PolynomialFeatures的排序方式是:第一个特征的次数依次递减、第二个特征的次数依次递增。
X 1 2 ∗ X 2 0 X_1^2 * X_2^0 X 1 2 ∗ X 2 0 、X 1 1 ∗ X 2 1 X_1^1 * X_2^1 X 1 1 ∗ X 2 1 、X 1 0 ∗ X 2 2 X_1^0 * X_2^2 X 1 0 ∗ X 2 2 。
X 1 3 ∗ X 2 0 X_1^3 * X_2^0 X 1 3 ∗ X 2 0 、X 1 2 ∗ X 2 1 X_1^2 * X_2^1 X 1 2 ∗ X 2 1 、X 1 1 ∗ X 2 2 X_1^1 * X_2^2 X 1 1 ∗ X 2 2 、X 1 0 ∗ X 2 3 X_1^0 * X_2^3 X 1 0 ∗ X 2 3 。
对于双特征的三阶多项式衍生,会同时包含二阶和三阶的特征。
示例代码:
1 2 3 4 5 6 7 8 9 10 col_names = ['X1' , 'X2' ] degree = 3 col_names_l = [] for deg in range(2 , degree + 1 ): for i in range(deg + 1 ): col_temp = col_names[0 ] + '**' + str(deg - i) + '*' + col_names[1 ] + '**' + str(i) col_names_l.append(col_temp) print(col_names_l)
运行结果:
1 ['X1**2*X2**0', 'X1**1*X2**1', 'X1**0*X2**2', 'X1**3*X2**0', 'X1**2*X2**1', 'X1**1*X2**2', 'X1**0*X2**3']
新特征的组装
示例代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import pandas as pdfrom sklearn.preprocessing import PolynomialFeaturespd.set_option('display.max_columns' , None ) pd.set_option('display.width' , 5000 ) col_names = ['X1' , 'X2' ] degree = 3 col_names_l = col_names for deg in range(2 , degree + 1 ): for i in range(deg + 1 ): col_temp = col_names[0 ] + '**' + str(deg - i) + '*' + col_names[1 ] + '**' + str(i) col_names_l.append(col_temp) print(col_names_l) df = pd.DataFrame({'X1' : [1 , 2 , 3 ], 'X2' : [2 , 3 , 4 ]}) print(df) df = PolynomialFeatures(degree=3 , include_bias=False ).fit_transform(df) print(pd.DataFrame(data=df, columns=col_names_l))
运行结果:
1 2 3 4 5 6 7 8 9 ['X1', 'X2', 'X1**2*X2**0', 'X1**1*X2**1', 'X1**0*X2**2', 'X1**3*X2**0', 'X1**2*X2**1', 'X1**1*X2**2', 'X1**0*X2**3'] X1 X2 0 1 2 1 2 3 2 3 4 X1 X2 X1**2*X2**0 X1**1*X2**1 X1**0*X2**2 X1**3*X2**0 X1**2*X2**1 X1**1*X2**2 X1**0*X2**3 0 1.0 2.0 1.0 2.0 4.0 1.0 2.0 4.0 8.0 1 2.0 3.0 4.0 6.0 9.0 8.0 12.0 18.0 27.0 2 3.0 4.0 9.0 12.0 16.0 27.0 36.0 48.0 64.0
函数封装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 def binary_polynomial_features (col_names: list, degree: int, features: pd.DataFrame) : """ 连续特征两特征多项式衍生函数 :param col_names: 参与交叉衍生的列名称 :param degree: 多项式最高阶 :param features: 原始数据集 :return:交叉衍生后的新特征和新列名称 """ col_names_new_l = [] features_new_l = [] features = features[col_names] for col_index, col_name in enumerate(col_names): for col_sub_index in range(col_index + 1 , len(col_names)): col_temp = [col_name] + [col_names[col_sub_index]] array_new_temp = PolynomialFeatures(degree=degree, include_bias=False ).fit_transform(features[col_temp]) features_new_l.append(pd.DataFrame(array_new_temp[:, 2 :])) for deg in range(2 , degree + 1 ): for i in range(deg + 1 ): col_name_temp = col_temp[0 ] + '**' + str(deg - i) + '*' + col_temp[1 ] + '**' + str(i) col_names_new_l.append(col_name_temp) features_new = pd.concat(features_new_l, axis=1 ) features_new.columns = col_names_new_l col_names_new = col_names_new_l return features_new, col_names_new
示例代码:
1 2 3 4 5 6 7 8 9 10 col_names = ['X1' , 'X2' ] degree = 3 df = pd.DataFrame({'X1' : [1 , 2 , 3 ], 'X2' : [2 , 3 , 4 ]}) print(df) features_new, col_names_new = binary_polynomial_features(col_names=col_names, degree=degree, features=df) print(features_new) print(col_names_new)
运行结果:
1 2 3 4 5 6 7 8 9 X1 X2 0 1 2 1 2 3 2 3 4 X1**2*X2**0 X1**1*X2**1 X1**0*X2**2 X1**3*X2**0 X1**2*X2**1 X1**1*X2**2 X1**0*X2**3 0 1.0 2.0 4.0 1.0 2.0 4.0 8.0 1 4.0 6.0 9.0 8.0 12.0 18.0 27.0 2 9.0 12.0 16.0 27.0 36.0 48.0 64.0 ['X1**2*X2**0', 'X1**1*X2**1', 'X1**0*X2**2', 'X1**3*X2**0', 'X1**2*X2**1', 'X1**1*X2**2', 'X1**0*X2**3']
工作经验
一般情况下,双特征的多项式衍生会比单特征多项式衍生更有效果。
一般情况下,双特征多项式衍生只适用于两个连续特征之间,一个连续特征一个离散特征或者两个离散特征进行多项式衍生意义不大。
一般情况下,不要随意组合连续特征来进行多项式衍生,有针对性的选择我们认为非常重要的特征来进行多项式衍生。
(其思路与四则运算的特征衍生思路类似,强化重要特征的表现形式。)
伴随着多项式阶数的增加,各列数值也会呈现指数级递增(或递减),因此一般只会衍生到3阶左右,极少数情况会衍生5-10阶。
伴随着多项式阶数的增加,也需要配合一些手段来消除数值爆炸或者衰减所造成的影响,例如对再进行一轮归一化操作。
(关于归一化,可以参考《特征工程-1.特征预处理》 的"连续字段重编码"的"归一化(0-1标准化)"部分。)
统计演变
在上文,我们讨论"双特征衍生"的"分组统计",在"工作经验"部分,提到过对于分组统计衍生之后的特征,还可以再进行一轮的四则运算特征衍生,这就是统计演变特征的范畴。
衍生特征
分类
统计演变特征,整体上可以分为两类:
原始特征与分组统计特征进行交叉衍生
流量平滑特征
黄金组合特征
组内归一化特征
分组统计特征之间进行交叉衍生
Gap特征
数据倾斜
变异系数
流量平滑特征
流量平滑特征:用原始的数值特征除以统计均值或者中位数。
1 2 df['B_div_A_B_mean' ] = df['B' ] / (df['A_B_mean' ] + 1e-5 ) df['B_div_A_B_median' ] = df['B' ] / (df['A_B_median' ] + 1e-5 )
我们先衍生出一个数值特征。
在这里,我们以tenure
作为分组依据,对月均消费金额进行分组统计。示例代码:
1 2 3 4 col_num = ['MonthlyCharges' ] key_col = 'tenure' df, col = binary_group_statistics(key_col, features, col_num) print(df.head(5 ))
运行结果:
1 2 3 4 5 6 MonthlyCharges_tenure_mean MonthlyCharges_tenure_var MonthlyCharges_tenure_max MonthlyCharges_tenure_min MonthlyCharges_tenure_skew MonthlyCharges_tenure_median MonthlyCharges_tenure_q1 MonthlyCharges_tenure_q2 0 50.485808 610.791587 102.45 18.80 0.092226 49.750 20.9 71.35 1 69.644615 866.891416 116.25 19.60 -0.384323 73.950 50.2 94.25 2 57.206303 634.068336 104.40 18.75 -0.220111 61.075 34.8 79.15 3 71.245902 938.289858 115.65 18.85 -0.481543 81.000 50.9 96.75 4 57.206303 634.068336 104.40 18.75 -0.220111 61.075 34.8 79.15
然后,我们用原始的特征除以我们上一轮得到的均值。示例代码:
1 2 print(features['MonthlyCharges' ]) print(features['MonthlyCharges' ] / (df['MonthlyCharges_tenure_mean' ] + 1e-5 ))
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0 29.85 1 56.95 2 53.85 3 42.30 4 70.70 ... 7038 84.80 7039 103.20 7040 29.60 7041 74.40 7042 105.65 Name: MonthlyCharges, Length: 7043, dtype: float64 0 0.591255 1 0.817723 2 0.941330 3 0.593718 4 1.235878 ... 7038 1.382377 7039 1.278876 7040 0.506219 7041 1.295430 7042 1.388971 Length: 7043, dtype: float64
解释说明:为了避免分母为零的情况,在分母上加了一个很小的数。
在有些资料中,认为流量平滑特是一种时间序列特征,主要作用是对时间序列数据进行平滑处理,消除噪音和偶然波动。
示例代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import pandas as pddata = { 'time' : ['2021-01-01' , '2021-01-02' , '2021-01-03' , '2021-01-04' , '2021-01-05' ], 'count' : [10 , 20 , 30 , 20 , 10 ] } df = pd.DataFrame(data) print(df) df['time' ] = pd.to_datetime(df['time' ]) df['count_smooth' ] = df['count' ].rolling(window=3 , center=True ).mean() print(df)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 time count 0 2021-01-01 10 1 2021-01-02 20 2 2021-01-03 30 3 2021-01-04 20 4 2021-01-05 10 time count count_smooth 0 2021-01-01 10 NaN 1 2021-01-02 20 20.000000 2 2021-01-03 30 23.333333 3 2021-01-04 20 20.000000 4 2021-01-05 10 NaN
通过rolling()
和mean()
函数,对数值进行流量平滑处理。 window=3
的含义是窗口大小为3;center=True
的含义是中心点为True;即,整体的含义是,以每个点为中心,取其左右各3个点进行计算平均值,以得到流量平滑的结果。
黄金组合特征
黄金组合特征,是指四个特征的组合。
A的原始数值。
B的原始数值。
基于A进行分组关于B的均值。
B减去(基于A进行分组关于B的均值)。
在有些资料中,会说黄金组合特征是三个特征,把"A的原始数值"和"B的原始数值",两个看成了一个。
最后一个特征的表示如下1 df['B_minus_A_B_mean' ] = df['B' ] - df['A_B_mean' ]
我们上文提到的"我们用分组统计衍生之后的月度消费金额减去均值,可以知道每一位用户与消费平均水平的差异。",就是黄金组合特征中的最后一个。
在本文tenure
和MonthlyCharges
已经有了,MonthlyCharges_tenure_mean
也有了,只差最后一项。
示例代码:
1 print(features['MonthlyCharges' ] - df['MonthlyCharges_tenure_mean' ])
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 0 -20.635808 1 -12.694615 2 -3.356303 3 -28.945902 4 13.493697 ... 7038 23.456383 7039 22.504144 7040 -28.872727 7041 16.967330 7042 29.586517 Length: 7043, dtype: float64
组合如下,示例代码:
1 2 3 4 5 print(pd.concat( [features[['tenure' , 'MonthlyCharges' ]], df['MonthlyCharges_tenure_mean' ], features['MonthlyCharges' ] - df['MonthlyCharges_tenure_mean' ]], axis=1 ))
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 tenure MonthlyCharges MonthlyCharges_tenure_mean 0 0 1 29.85 50.485808 -20.635808 1 34 56.95 69.644615 -12.694615 2 2 53.85 57.206303 -3.356303 3 45 42.30 71.245902 -28.945902 4 2 70.70 57.206303 13.493697 ... ... ... ... ... 7038 24 84.80 61.343617 23.456383 7039 72 103.20 80.695856 22.504144 7040 11 29.60 58.472727 -28.872727 7041 4 74.40 57.432670 16.967330 7042 66 105.65 76.063483 29.586517 [7043 rows x 4 columns]
组内归一化特征
组内归一化特征:
1 df['B_minus_A_B_mean' ] = (df['B' ] - df['A_B_mean' ]) / (df['A_B_std' ] + 1e-9 )
例如,我们用MonthlyCharges
减去MonthlyCharges_tenure_mean
,再除以MonthlyCharges_tenure_std
示例代码:
1 print((features['MonthlyCharges' ] - df['MonthlyCharges_tenure_mean' ]) / (np.sqrt(df['MonthlyCharges_tenure_var' ]) + 1e-5 ))
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 0 -0.834977 1 -0.431159 2 -0.133289 3 -0.944971 4 0.535874 ... 7038 0.827460 7039 0.704206 7040 -1.059064 7041 0.643612 7042 0.998235 Length: 7043, dtype: float64
Gap特征
Gap特征:分组统计后的四分位距。
关于四分位距的计算,我们在《特征工程-1.特征预处理》 有过讨论,用上四分位数(75%
)-下四分位数(25%
)。
首先,我们做分组统计,计算上分位数和下分位数。示例代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 def q1 (x) : """ 下四分位数 """ return x.quantile(0.25 ) def q2 (x) : """ 上四分位数 """ return x.quantile(0.75 ) aggs = {'MonthlyCharges' : [q1, q2]} features_temp = features.groupby('tenure' ).agg(aggs).reset_index() features_temp.columns = ['tenure' , 'MonthlyCharges_tenure_q1' , 'MonthlyCharges_tenure_q2' ] print(features_temp.head(5 ))
运行结果:
1 2 3 4 5 6 tenure MonthlyCharges_tenure_q1 MonthlyCharges_tenure_q2 0 0 20.125 58.9750 1 1 20.900 71.3500 2 2 34.800 79.1500 3 3 29.875 79.2875 4 4 29.525 79.3375
然后,计算分位距。示例代码:
1 2 features_temp['MonthlyCharges_tenure_q2-q1' ] = features_temp['MonthlyCharges_tenure_q2' ] - features_temp['MonthlyCharges_tenure_q1' ] print(features_temp.head(5 ))
运行结果:
1 2 3 4 5 6 tenure MonthlyCharges_tenure_q1 MonthlyCharges_tenure_q2 MonthlyCharges_tenure_q2-q1 0 0 20.125 58.9750 38.8500 1 1 20.900 71.3500 50.4500 2 2 34.800 79.1500 44.3500 3 3 29.875 79.2875 49.4125 4 4 29.525 79.3375 49.8125
数据倾斜
当均值大于中位数时,认为数据正倾斜;当均值小于中位数时,认为数据负倾斜。
衡量均值和中位数的大小关系,有两种方法:
示例代码:
1 2 3 4 5 6 col_num = ['MonthlyCharges' ] key_col = 'tenure' df, col = binary_group_statistics(key_col, features, col_num) print(df['MonthlyCharges_tenure_mean' ] - df['MonthlyCharges_tenure_median' ]) print(df['MonthlyCharges_tenure_mean' ] / (df['MonthlyCharges_tenure_median' ] + 1e-5 ))
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0 0.735808 1 -4.305385 2 -3.868697 3 -9.754098 4 -3.868697 ... 7038 1.943617 7039 -8.379144 7040 -2.777273 7041 0.457670 7042 -4.486517 Length: 7043, dtype: float64 0 1.014790 1 0.941780 2 0.936656 3 0.879579 4 0.936656 ... 7038 1.032721 7039 0.905931 7040 0.954657 7041 1.008033 7042 0.944301 Length: 7043, dtype: float64
变异系数
变异系数:标准差除以均值。
变异系数越大、说明数据离散程度越高。
示例代码:
1 print(np.sqrt(df['MonthlyCharges_tenure_var' ]) / (df['MonthlyCharges_tenure_mean' ] + 1e-10 ))
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 0 0.489528 1 0.422761 2 0.440174 3 0.429941 4 0.440174 ... 7038 0.462109 7039 0.396015 7040 0.466243 7041 0.459018 7042 0.389659 Length: 7043, dtype: float64
争议
流量平滑特征
有些资料记录的流量平滑特征如下:
1 2 df['B_div_A_B_mean' ] = df['A' ] / (df['A_B_mean' ] + 1e-5 ) df['B_div_A_B_median' ] = df['A' ] / (df['A_B_median' ] + 1e-5 )
注意,是df['A']
,不是df['B']
。
根据这种说法,在本文的案例中是
1 features['tenure' ] / (df['MonthlyCharges_tenure_mean' ] + 1e-5 )
我不否认这种流量平滑特征,在某些模型,某些场景下,效果可能更好。
但是,我认为流量平滑特征应该如下:
1 2 df['B_div_A_B_mean' ] = df['B' ] / (df['A_B_mean' ] + 1e-5 ) df['B_div_A_B_median' ] = df['B' ] / (df['A_B_median' ] + 1e-5 )
在本文的案例中应该是
1 features['MonthlyCharges' ] / (df['MonthlyCharges_tenure_mean' ] + 1e-5 )
黄金组合特征
对于黄金组合特征的最后一个特征,有些资料的记录是A减去(基于A进行分组关于B的均值)
。注意,是A减去
,不是B减去
。
根据这种说法,在本文的案例中是
1 features['tenure' ] - df['MonthlyCharges_tenure_mean' ]
我不否认这种黄金组合特征,在某些模型,某些场景下,效果可能更好。
但是,我认为黄金组合特征应该是B减去(基于A进行分组关于B的均值)
。
在本文的案例中应该是
1 features['MonthlyCharges' ] - df['MonthlyCharges_tenure_mean' ]
组内归一化特征
有些资料记录的组内归一化特征是(df['A'] - df['A_B_mean']) / (df['A_B_std'] + 1e-9)
。注意,是df['A']
,不是df['B']
根据这种说法,在本文的案例中是
1 (features['tenure' ] - df['MonthlyCharges_tenure_mean' ]) / (np.sqrt(df['MonthlyCharges_tenure_var' ]) + 1e-5 )
我不否认这种组内归一化特征,在某些模型,某些场景下,效果可能更好。
但是,我认为组内归一化特征应该是(df['B'] - df['A_B_mean']) / (df['A_B_std'] + 1e-9)
。
在本文的案例中应该是
1 print((features['MonthlyCharges' ] - df['MonthlyCharges_tenure_mean' ]) / (np.sqrt(df['MonthlyCharges_tenure_var' ]) + 1e-5 ))
函数封装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 def group_statistics_extension (col_names: list, key_col: str, features: pd.DataFrame) : """ 双特征分组统计二阶特征衍生函数 :param col_names: 参与衍生的特征 :param key_col: 分组参考的关键特征 :param features: 原始数据集 :return:交叉衍生后的新特征和新列名称 """ def q1 (x) : """ 下四分位数 """ return x.quantile(0.25 ) def q2 (x) : """ 上四分位数 """ return x.quantile(0.75 ) aggs = {} for col in col_names: aggs[col] = ['mean' , 'var' , 'median' , 'q1' , 'q2' ] cols = [key_col] for key in aggs.keys(): cols.extend([key + '_' + key_col + '_' + stat for stat in aggs[key]]) aggs = {} for col in col_names: aggs[col] = ['mean' , 'var' , 'median' , q1, q2] features_new = features[col_names + [key_col]].groupby(key_col).agg(aggs).reset_index() features_new.columns = cols col_name_temp = [key_col] col_name_temp.extend(col_names) features_new = pd.merge(features[col_name_temp], features_new, how='left' , on=key_col) features_new.loc[:, ~features_new.columns.duplicated()] col_names_new = cols col_names_new.remove(key_col) col1 = col_names_new.copy() for col_temp in col_names: col = col_temp + '_' + key_col + '_' + 'mean' features_new[col_temp + '_dive1_' + col] = features_new[col_temp] / (features_new[col] + 1e-5 ) col_names_new.append(col_temp + '_dive1_' + col) col = col_temp + '_' + key_col + '_' + 'median' features_new[col_temp + '_dive2_' + col] = features_new[col_temp] / (features_new[col] + 1e-5 ) col_names_new.append(col_temp + '_dive2_' + col) for col_temp in col_names: col = col_temp + '_' + key_col + '_' + 'mean' features_new[col_temp + '_minus1_' + col] = features_new[col_temp] - features_new[col] col_names_new.append(col_temp + '_minus1_' + col) features_new[col_temp + '_minus2_' + col] = features_new[col_temp] - features_new[col] col_names_new.append(col_temp + '_minus2_' + col) for col_temp in col_names: col_mean = col_temp + '_' + key_col + '_' + 'mean' col_var = col_temp + '_' + key_col + '_' + 'var' features_new[col_temp + '_norm_' + key_col] = (features_new[col_temp] - features_new[col_mean]) / ( np.sqrt(features_new[col_var]) + 1e-5 ) col_names_new.append(col_temp + '_norm_' + key_col) for col_temp in col_names: col_q1 = col_temp + '_' + key_col + '_' + 'q1' col_q2 = col_temp + '_' + key_col + '_' + 'q2' features_new[col_temp + '_gap_' + key_col] = features_new[col_q2] - features_new[col_q1] col_names_new.append(col_temp + '_gap_' + key_col) for col_temp in col_names: col_mean = col_temp + '_' + key_col + '_' + 'mean' col_median = col_temp + '_' + key_col + '_' + 'median' features_new[col_temp + '_mag1_' + key_col] = features_new[col_median] - features_new[col_mean] col_names_new.append(col_temp + '_mag1_' + key_col) features_new[col_temp + '_mag2_' + key_col] = features_new[col_median] / (features_new[col_mean] + 1e-5 ) col_names_new.append(col_temp + '_mag2_' + key_col) for col_temp in col_names: col_mean = col_temp + '_' + key_col + '_' + 'mean' col_var = col_temp + '_' + key_col + '_' + 'var' features_new[col_temp + '_cv_' + key_col] = np.sqrt(features_new[col_var]) / (features_new[col_mean] + 1e-5 ) col_names_new.append(col_temp + '_cv_' + key_col) features_new.drop([key_col], axis=1 , inplace=True ) features_new.drop(col1, axis=1 , inplace=True ) col_names_new = list(features_new.columns) return features_new, col_names_new
示例代码:
1 2 3 4 5 6 7 key_col = 'tenure' col_names = ['MonthlyCharges' , 'SeniorCitizen' ] features_new, col_names_new = group_statistics_extension(col_names, key_col, features) print(features_new.head(5 )) print(col_names_new)
运行结果:
1 2 3 4 5 6 7 MonthlyCharges SeniorCitizen MonthlyCharges_dive1_MonthlyCharges_tenure_mean MonthlyCharges_dive2_MonthlyCharges_tenure_median SeniorCitizen_dive1_SeniorCitizen_tenure_mean SeniorCitizen_dive2_SeniorCitizen_tenure_median MonthlyCharges_minus1_MonthlyCharges_tenure_mean MonthlyCharges_minus2_MonthlyCharges_tenure_mean SeniorCitizen_minus1_SeniorCitizen_tenure_mean SeniorCitizen_minus2_SeniorCitizen_tenure_mean MonthlyCharges_norm_tenure SeniorCitizen_norm_tenure MonthlyCharges_gap_tenure SeniorCitizen_gap_tenure MonthlyCharges_mag1_tenure MonthlyCharges_mag2_tenure SeniorCitizen_mag1_tenure SeniorCitizen_mag2_tenure MonthlyCharges_cv_tenure SeniorCitizen_cv_tenure 0 29.85 0 0.591255 0.600000 0.0 0.0 -20.635808 -20.635808 -0.140294 -0.140294 -0.834977 -0.403624 50.45 0.0 -0.735808 0.985425 -0.140294 0.0 0.489528 2.477306 1 56.95 0 0.817723 0.770115 0.0 0.0 -12.694615 -12.694615 -0.184615 -0.184615 -0.431159 -0.472144 44.05 0.0 4.305385 1.061819 -0.184615 0.0 0.422761 2.117827 2 53.85 0 0.941330 0.881703 0.0 0.0 -3.356303 -3.356303 -0.180672 -0.180672 -0.133289 -0.468588 44.35 0.0 3.868697 1.067627 -0.180672 0.0 0.440174 2.133896 3 42.30 0 0.593718 0.522222 0.0 0.0 -28.945902 -28.945902 -0.196721 -0.196721 -0.944971 -0.490786 45.85 0.0 9.754098 1.136907 -0.196721 0.0 0.429941 2.037392 4 70.70 0 1.235878 1.157593 0.0 0.0 13.493697 13.493697 -0.180672 -0.180672 0.535874 -0.468588 44.35 0.0 3.868697 1.067627 -0.180672 0.0 0.440174 2.133896 ['MonthlyCharges', 'SeniorCitizen', 'MonthlyCharges_dive1_MonthlyCharges_tenure_mean', 'MonthlyCharges_dive2_MonthlyCharges_tenure_median', 'SeniorCitizen_dive1_SeniorCitizen_tenure_mean', 'SeniorCitizen_dive2_SeniorCitizen_tenure_median', 'MonthlyCharges_minus1_MonthlyCharges_tenure_mean', 'MonthlyCharges_minus2_MonthlyCharges_tenure_mean', 'SeniorCitizen_minus1_SeniorCitizen_tenure_mean', 'SeniorCitizen_minus2_SeniorCitizen_tenure_mean', 'MonthlyCharges_norm_tenure', 'SeniorCitizen_norm_tenure', 'MonthlyCharges_gap_tenure', 'SeniorCitizen_gap_tenure', 'MonthlyCharges_mag1_tenure', 'MonthlyCharges_mag2_tenure', 'SeniorCitizen_mag1_tenure', 'SeniorCitizen_mag2_tenure', 'MonthlyCharges_cv_tenure', 'SeniorCitizen_cv_tenure']
工作经验
与"分组统计"部分对于分位数的经验一样,一般是关注上四分位数和下四分位数,但也完全可以用一分位或九分位等这种不常见的分位数。
做第二轮,或者做更多轮的特征衍生,往往伴随着严重的信息衰减,而且需要消耗巨大的计算资源。性价比较低,一般不建议在广泛特征基础上进行大量第二轮或更多轮的特征衍生的尝试。
但是,在长期的实践过程中,本文论述的统计衍生特征,能取得一定的效果。
多特征衍生
四则运算
衍生过程
多特征四则运算特征衍生过程 原始特征 衍生特征 ID X1 X2 X3 X1+X2+X3 X1*X2*X3 1 1 -1 3 3 -3 2 1 0 1 2 0 3 4 2 2 8 16 4 2 1 4 7 8 5 3 0 0 3 0
工作经验
与双特征的四则运算衍生相同,在以下两种场景,可以考虑使用四则运算的方式进行特征衍生:
创建具有明显业务含义的补充字段
在进行第二轮特征衍生的时候,采用四则运算的方式进行特征衍生,作为信息的一种补充。
此外,对于多特征的特征衍生,除了四则运算衍生特征外,其他衍生方法会有非常严重的信息衰减,其他方法通常不会优先考虑。
交叉组合
衍生过程
多特征交叉组合特征衍生过程 原始特征 衍生特征:(X1,X2,X3) ID X1 X2 X3 (0,0, 0) (0,0,1) (0,1, 0) (1,0,0) (0,1,1) (1,0,1) (1,1,0) (1,1,1) 1 0 1 1 0 0 0 0 1 0 0 0 2 1 1 0 0 0 0 0 0 0 1 0 3 1 1 0 0 0 0 0 0 0 1 0 4 0 0 1 0 1 0 0 0 0 0 0 5 1 0 1 0 0 0 0 0 1 0 0
实现方法
以SeniorCitizen
、Partner
和Dependents
三个特征的交叉组合过程为例。
新特征的名称
示例代码:
1 2 3 col_names = ['SeniorCitizen' , 'Partner' , 'Dependents' ] col_names_new = col_names[0 ] + '_' + col_names[1 ] + '_' + col_names[2 ] print(col_names_new)
运行结果:
1 SeniorCitizen_Partner_Dependents
也可以通过join()
,示例代码:
1 print('_' .join(col_names))
运行结果:
1 SeniorCitizen_Partner_Dependents
新特征的生成
示例代码:
1 2 3 4 5 6 7 8 features_new = features[col_names[0 ]].astype('str' ) for col in col_names[1 :]: features_new = features_new + '_' + features[col].astype('str' ) features_new = pd.DataFrame(features_new) features_new.columns = [col_names_new] print(features_new.head(5 ))
运行结果:
1 2 3 4 5 6 SeniorCitizen_Partner_Dependents 0 0_Yes_No 1 0_No_No 2 0_No_No 3 0_No_No 4 0_No_No
One-Hot编码
示例代码:
1 2 3 4 5 6 7 8 enc = OneHotEncoder() enc.fit_transform(features_new) features_new_af = pd.DataFrame(data=enc.fit_transform(features_new).toarray(), columns=cate_col_name(enc, [col_names_new], skip_binary=True )) print(features_new_af.head(5 )) print(features_new_af.shape)
运行结果:
1 2 3 4 5 6 7 SeniorCitizen_Partner_Dependents_0_No_No SeniorCitizen_Partner_Dependents_0_No_Yes SeniorCitizen_Partner_Dependents_0_Yes_No SeniorCitizen_Partner_Dependents_0_Yes_Yes SeniorCitizen_Partner_Dependents_1_No_No SeniorCitizen_Partner_Dependents_1_No_Yes SeniorCitizen_Partner_Dependents_1_Yes_No SeniorCitizen_Partner_Dependents_1_Yes_Yes 0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 (7043, 8)
函数封装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 def multi_cross_combination (col_names: list, features: pd.DataFrame, one_hot: bool = True) : """ 多特征组合交叉衍生 :param col_names: 参与交叉衍生的列名称 :param features: 原始数据集 :param one_hot: 是否进行独热编码 :return:交叉衍生后的新特征和新列名称 """ col_names_new = '_' .join([str(i) for i in col_names]) features_new = features[col_names[0 ]].astype('str' ) for col in col_names[1 :]: features_new = features_new + '_' + features[col].astype('str' ) features_new = pd.DataFrame(features_new, columns=[col_names_new]) if one_hot: enc = OneHotEncoder() enc.fit_transform(features_new) col_names_new = cate_col_name(enc, [col_names_new], skip_binary=True ) features_new = pd.DataFrame(enc.fit_transform(features_new).toarray(), columns=col_names_new) return features_new, col_names_new
示例代码:
1 2 3 4 5 col_names = ['SeniorCitizen' , 'Partner' , 'Dependents' ] features_new, col_names_new = multi_cross_combination(col_names, features) print(features_new) print(col_names_new)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SeniorCitizen_Partner_Dependents_0_No_No SeniorCitizen_Partner_Dependents_0_No_Yes SeniorCitizen_Partner_Dependents_0_Yes_No SeniorCitizen_Partner_Dependents_0_Yes_Yes SeniorCitizen_Partner_Dependents_1_No_No SeniorCitizen_Partner_Dependents_1_No_Yes SeniorCitizen_Partner_Dependents_1_Yes_No SeniorCitizen_Partner_Dependents_1_Yes_Yes 0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... ... ... ... ... ... ... ... ... 7038 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 7039 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 7040 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 7041 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 7042 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [7043 rows x 8 columns] ['SeniorCitizen_Partner_Dependents_0_No_No', 'SeniorCitizen_Partner_Dependents_0_No_Yes', 'SeniorCitizen_Partner_Dependents_0_Yes_No', 'SeniorCitizen_Partner_Dependents_0_Yes_Yes', 'SeniorCitizen_Partner_Dependents_1_No_No', 'SeniorCitizen_Partner_Dependents_1_No_Yes', 'SeniorCitizen_Partner_Dependents_1_Yes_No', 'SeniorCitizen_Partner_Dependents_1_Yes_Yes']
工作经验
多特征的交叉组合衍生往往伴随着指数级的特征数量的增长。
如果特征矩阵过于稀疏(有较多零值),表示相同规模数据情况下包含了较少信息,而这也将极大程度影响后续建模过程。
一般情况下,如果有多个特征要进行交叉组合衍生,优先考虑两两组合进行交叉组合衍生。
只有在人工判断是极为重要的特征情况下,才考虑对其进行三个甚至更多的特征进行交叉组合衍生。
分组统计
衍生过程
多特征分组统计特征衍生过程 原始特征 衍生特征 tenure和SerniorCitizen交叉组合分组后统计结果 ID tenure SerniorCitizen Monthly Charges mean max min 1 1 1 3 3 3 3 2 2 1 12 8 12 4 3 2 1 4 8 12 4 4 1 0 11 9 11 7 5 1 0 7 9 11 7
实现方法
核心要点
关于多特征分组统计的实现方法,我们在上文讨论双特征的分组统计的"groupby用法示例"部分已经讨论了。在groupby方法的参数中输入包含多个列的列名称的列表,可以实现多列的交叉组合聚合。
组装agg
示例代码:
1 2 3 4 5 6 7 8 9 10 11 col_names_sub = ['MonthlyCharges' , 'TotalCharges' ] aggs = {} for col in col_names_sub: aggs[col] = ['mean' , 'min' , 'max' ] print(aggs)
运行结果:
1 {'MonthlyCharges': ['mean', 'min', 'max'], 'TotalCharges': ['mean', 'min', 'max']}
新特征的名称
示例代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 key_col = ['tenure' , 'SeniorCitizen' ] cols = key_col.copy() col_temp = cols[0 ] for i in key_col[1 :]: col_temp = col_temp + '_' + i print(key_col) print(col_temp) for key in aggs.keys(): cols.extend([key + '_' + col_temp + '_' + stat for stat in aggs[key]]) print(cols)
运行结果:
1 2 3 ['tenure', 'SeniorCitizen'] tenure_SeniorCitizen ['tenure', 'SeniorCitizen', 'MonthlyCharges_tenure_SeniorCitizen_mean', 'MonthlyCharges_tenure_SeniorCitizen_min', 'MonthlyCharges_tenure_SeniorCitizen_max', 'TotalCharges_tenure_SeniorCitizen_mean', 'TotalCharges_tenure_SeniorCitizen_min', 'TotalCharges_tenure_SeniorCitizen_max']
新特征的生成和组装
示例代码:
1 2 3 features_new = features.groupby(key_col).agg(aggs).reset_index() features_new.columns = cols print(features_new.head())
运行结果:
1 2 3 4 5 6 tenure SeniorCitizen MonthlyCharges_tenure_SeniorCitizen_mean MonthlyCharges_tenure_SeniorCitizen_min MonthlyCharges_tenure_SeniorCitizen_max TotalCharges_tenure_SeniorCitizen_mean TotalCharges_tenure_SeniorCitizen_min TotalCharges_tenure_SeniorCitizen_max 0 0 0 41.418182 19.70 80.85 0.000000 0.00 0.00 1 1 0 48.329127 18.80 102.45 48.329127 18.80 102.45 2 1 1 63.701744 19.45 100.80 63.701744 19.45 100.80 3 2 0 54.951538 18.75 104.40 109.696923 27.55 242.80 4 2 1 67.431395 19.20 95.50 135.353488 37.20 196.75
和原始数据进行拼接
核心要点
在双特征的衍生中,我们只根据了一个特征进行分组统计,我们只需要将那个特征作为merge
方法的关联条件即可,但是现在我们根据了双特征进行分组衍生。
一个方法是,我们原始数据集上创建tenure和SeniorCitizen交叉组合特征,然后再进行拼接。在这里巧妙的借助了交叉组合函数来实现该过程。
原数据集进行关键列的交叉组合
示例代码:
1 2 3 4 5 key_col = ['tenure' , 'SeniorCitizen' ] features_key1, col1 = multi_cross_combination(key_col, features, one_hot=False ) print(features_key1) print(col1)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 tenure_SeniorCitizen 0 1_0 1 34_0 2 2_0 3 45_0 4 2_0 ... ... 7038 24_0 7039 72_0 7040 11_0 7041 4_1 7042 66_0 [7043 rows x 1 columns] tenure_SeniorCitizen
新特征进行交叉组合
示例代码:
1 2 3 4 key_col = ['tenure' , 'SeniorCitizen' ] features_key2, col2 = multi_cross_combination(key_col, features_new, one_hot=False ) print(features_key2) print(col2)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 tenure_SeniorCitizen 0 0_0 1 1_0 2 1_1 3 2_0 4 2_1 .. ... 140 70_1 141 71_0 142 71_1 143 72_0 144 72_1 [145 rows x 1 columns] tenure_SeniorCitizen
数据集的拼接
示例代码:
1 2 3 4 5 features_new_af = pd.concat([features_key2, features_new], axis=1 ) print(features_new_af) features_new_m = pd.merge(features_key1, features_new_af, how='left' , on=col_temp) print(features_new_m)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 tenure_SeniorCitizen tenure SeniorCitizen MonthlyCharges_tenure_SeniorCitizen_mean MonthlyCharges_tenure_SeniorCitizen_min MonthlyCharges_tenure_SeniorCitizen_max TotalCharges_tenure_SeniorCitizen_mean TotalCharges_tenure_SeniorCitizen_min TotalCharges_tenure_SeniorCitizen_max 0 0_0 0 0 41.418182 19.70 80.85 0.000000 0.00 0.00 1 1_0 1 0 48.329127 18.80 102.45 48.329127 18.80 102.45 2 1_1 1 1 63.701744 19.45 100.80 63.701744 19.45 100.80 3 2_0 2 0 54.951538 18.75 104.40 109.696923 27.55 242.80 4 2_1 2 1 67.431395 19.20 95.50 135.353488 37.20 196.75 .. ... ... ... ... ... ... ... ... ... 140 70_1 70 1 87.017647 19.55 113.65 6093.635294 1462.05 7987.60 141 71_0 71 0 72.497482 19.10 118.65 5148.248561 1301.10 8564.75 142 71_1 71 1 79.287097 19.80 115.05 5643.648387 1388.45 8016.60 143 72_0 72 0 78.687745 19.30 118.75 5668.428595 1304.80 8684.80 144 72_1 72 1 91.668750 24.30 117.35 6599.391964 1778.70 8443.70 [145 rows x 9 columns] tenure_SeniorCitizen tenure SeniorCitizen MonthlyCharges_tenure_SeniorCitizen_mean MonthlyCharges_tenure_SeniorCitizen_min MonthlyCharges_tenure_SeniorCitizen_max TotalCharges_tenure_SeniorCitizen_mean TotalCharges_tenure_SeniorCitizen_min TotalCharges_tenure_SeniorCitizen_max 0 1_0 1 0 48.329127 18.80 102.45 48.329127 18.80 102.45 1 34_0 34 0 66.715094 19.60 116.25 2261.290566 635.90 3946.90 2 2_0 2 0 54.951538 18.75 104.40 109.696923 27.55 242.80 3 45_0 45 0 69.703061 18.85 115.65 3148.641837 867.30 5125.50 4 2_0 2 0 54.951538 18.75 104.40 109.696923 27.55 242.80 ... ... ... ... ... ... ... ... ... ... 7038 24_0 24 0 58.976923 19.55 104.65 1419.914103 439.75 2542.45 7039 72_0 72 0 78.687745 19.30 118.75 5668.428595 1304.80 8684.80 7040 11_0 11 0 54.326220 19.25 102.00 603.943902 180.30 1222.05 7041 4_1 4 1 71.798077 20.05 99.80 290.625000 91.45 442.85 7042 66_0 66 0 69.363433 19.35 115.80 4581.785075 1240.80 7942.15 [7043 rows x 9 columns]
提取新特征
示例代码:
1 2 print(features_new_m.iloc[:, len(key_col) + 1 :])
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 MonthlyCharges_tenure_SeniorCitizen_mean MonthlyCharges_tenure_SeniorCitizen_min MonthlyCharges_tenure_SeniorCitizen_max TotalCharges_tenure_SeniorCitizen_mean TotalCharges_tenure_SeniorCitizen_min TotalCharges_tenure_SeniorCitizen_max 0 48.329127 18.80 102.45 48.329127 18.80 102.45 1 66.715094 19.60 116.25 2261.290566 635.90 3946.90 2 54.951538 18.75 104.40 109.696923 27.55 242.80 3 69.703061 18.85 115.65 3148.641837 867.30 5125.50 4 54.951538 18.75 104.40 109.696923 27.55 242.80 ... ... ... ... ... ... ... 7038 58.976923 19.55 104.65 1419.914103 439.75 2542.45 7039 78.687745 19.30 118.75 5668.428595 1304.80 8684.80 7040 54.326220 19.25 102.00 603.943902 180.30 1222.05 7041 71.798077 20.05 99.80 290.625000 91.45 442.85 7042 69.363433 19.35 115.80 4581.785075 1240.80 7942.15 [7043 rows x 6 columns]
函数封装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 def multi_group_statistics (key_col: list, features: pd.DataFrame, col_num: list = None, col_cat: list = None, num_stat: list = ['mean' , 'var' , 'max' , 'min' , 'skew' , 'median' ], cat_stat: list = ['mean' , 'var' , 'max' , 'min' , 'median' , 'count' , 'nunique' ], quantile: bool = True) : """ 多特征分组统计特征衍生函数 :param key_col: 分组参考的关键特征 :param features: 原始数据集 :param col_num: 参与衍生的连续型特征 :param col_cat: 参与衍生的离散型特征 :param num_stat: 连续特征分组统计指标 :param cat_stat: 离散特征分组统计指标 :param quantile: 是否计算分位数 :return:交叉衍生后的新特征和新特征的名称 """ features_key1, col1 = multi_cross_combination(key_col, features, one_hot=False ) if col_num is not None : aggs_num = {} col_names = col_num for col in col_num: aggs_num[col] = num_stat cols_num = key_col.copy() for key in aggs_num.keys(): cols_num.extend([key + '_' + col1 + '_' + stat for stat in aggs_num[key]]) features_num_new = features[col_num + key_col].groupby(key_col).agg(aggs_num).reset_index() features_num_new.columns = cols_num features_key2, col2 = multi_cross_combination(key_col, features_num_new, one_hot=False ) features_num_new = pd.concat([features_key2, features_num_new], axis=1 ) if col_cat is not None : aggs_cat = {} col_names = col_num + col_cat for col in col_cat: aggs_cat[col] = cat_stat cols_cat = key_col.copy() for key in aggs_cat.keys(): cols_cat.extend([key + '_' + col1 + '_' + stat for stat in aggs_cat[key]]) features_cat_new = features[col_cat + key_col].groupby(key_col).agg(aggs_cat).reset_index() features_cat_new.columns = cols_cat features_key3, col3 = multi_cross_combination(key_col, features_cat_new, one_hot=False ) features_cat_new = pd.concat([features_key3, features_cat_new], axis=1 ) df_temp = pd.concat([features_num_new, features_cat_new], axis=1 ) df_temp = df_temp.loc[:, ~df_temp.columns.duplicated()] features_new = pd.merge(features_key1, df_temp, how='left' , on=col1) else : features_new = pd.merge(features_key1, features_num_new, how='left' , on=col1) features_new = features_new.loc[:, ~features_new.columns.duplicated()] else : if col_cat is not None : aggs_cat = {} col_names = col_cat for col in col_cat: aggs_cat[col] = cat_stat cols_cat = key_col.copy() for key in aggs_cat.keys(): cols_cat.extend([key + '_' + col1 + '_' + stat for stat in aggs_cat[key]]) features_cat_new = features[col_cat + key_col].groupby(key_col).agg(aggs_cat).reset_index() features_cat_new.columns = cols_cat features_new = pd.merge(features_key1, features_cat_new, how='left' , on=col1) features_new = features_new.loc[:, ~features_new.columns.duplicated()] if quantile: def q1 (x) : """ 下四分位数 """ return x.quantile(0.25 ) def q2 (x) : """ 上四分位数 """ return x.quantile(0.75 ) agg_name = {} for col in col_names: agg_name[col] = ['q1' , 'q2' ] cols = key_col.copy() for key in agg_name.keys(): cols.extend([key + '_' + col1 + '_' + stat for stat in agg_name[key]]) aggs = {} for col in col_names: aggs[col] = [q1, q2] features_temp = features[col_names + key_col].groupby(key_col).agg(aggs).reset_index() features_temp.columns = cols features_new.drop(key_col, axis=1 , inplace=True ) features_key4, col4 = multi_cross_combination(key_col, features_temp, one_hot=False ) features_temp = pd.concat([features_key4, features_temp], axis=1 ) features_new = pd.merge(features_new, features_temp, how='left' , on=col1) features_new = features_new.loc[:, ~features_new.columns.duplicated()] features_new.drop(key_col + [col1], axis=1 , inplace=True ) col_names_new = list(features_new.columns) return features_new, col_names_new
为了便于演示,我们选择相关的列。示例代码:
1 2 3 cols = ['tenure' , 'SeniorCitizen' , 'MonthlyCharges' , 'TotalCharges' , 'Partner' ] features_oe = features[cols] print(features_oe.head(5 ))
运行结果:
1 2 3 4 5 6 tenure SeniorCitizen MonthlyCharges TotalCharges Partner 0 1 0 29.85 29.85 Yes 1 34 0 56.95 1889.50 No 2 2 0 53.85 108.15 No 3 45 0 42.30 1840.75 No 4 2 0 70.70 151.65 No
在上文的Partner都是Yes
和No
这种值,我们对其进行有序编码。示例代码:
1 2 3 enc = OrdinalEncoder(dtype=int) features_oe['Partner' ] = enc.fit_transform(pd.DataFrame(features_oe['Partner' ])) print(features_oe.head(5 ))
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 tenure SeniorCitizen MonthlyCharges TotalCharges Partner 0 1 0 29.85 29.85 1 1 34 0 56.95 1889.50 0 2 2 0 53.85 108.15 0 3 45 0 42.30 1840.75 0 4 2 0 70.70 151.65 0 SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy features_oe['Partner'] = enc.fit_transform(pd.DataFrame(features_oe['Partner']))
调用我们封装的函数,示例代码:
1 2 3 4 features_new, col_names_new = multi_group_statistics(key_col, features_oe, col_num, col_cat) print(features_new) print(col_names_new)
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 features_oe['Partner'] = enc.fit_transform(pd.DataFrame(features_oe['Partner'])) MonthlyCharges_tenure_SeniorCitizen_mean MonthlyCharges_tenure_SeniorCitizen_var MonthlyCharges_tenure_SeniorCitizen_max MonthlyCharges_tenure_SeniorCitizen_min MonthlyCharges_tenure_SeniorCitizen_skew MonthlyCharges_tenure_SeniorCitizen_median TotalCharges_tenure_SeniorCitizen_mean TotalCharges_tenure_SeniorCitizen_var TotalCharges_tenure_SeniorCitizen_max TotalCharges_tenure_SeniorCitizen_min TotalCharges_tenure_SeniorCitizen_skew TotalCharges_tenure_SeniorCitizen_median Partner_tenure_SeniorCitizen_mean Partner_tenure_SeniorCitizen_var Partner_tenure_SeniorCitizen_max Partner_tenure_SeniorCitizen_min Partner_tenure_SeniorCitizen_median Partner_tenure_SeniorCitizen_count Partner_tenure_SeniorCitizen_nunique MonthlyCharges_tenure_SeniorCitizen_q1 MonthlyCharges_tenure_SeniorCitizen_q2 TotalCharges_tenure_SeniorCitizen_q1 TotalCharges_tenure_SeniorCitizen_q2 Partner_tenure_SeniorCitizen_q1 Partner_tenure_SeniorCitizen_q2 0 48.329127 607.729345 102.45 18.80 0.227341 45.850 48.329127 6.077293e+02 102.45 18.80 0.227341 45.850 0.144213 0.123650 1 0 0.0 527 2 20.4500 70.5500 20.4500 70.5500 0.0 0.00 1 66.715094 966.247845 116.25 19.60 -0.198674 67.650 2261.290566 1.120146e+06 3946.90 635.90 -0.193152 2339.300 0.566038 0.250363 1 0 1.0 53 2 40.5500 90.0500 1325.8500 3097.0000 0.0 1.00 2 54.951538 652.720616 104.40 18.75 -0.099830 55.050 109.696923 2.734438e+03 242.80 27.55 -0.012722 113.550 0.210256 0.166905 1 0 0.0 195 2 24.8750 76.1500 56.8750 151.7000 0.0 0.00 3 69.703061 1008.318063 115.65 18.85 -0.363734 78.800 3148.641837 2.079569e+06 5125.50 867.30 -0.347404 3541.100 0.612245 0.242347 1 0 1.0 49 2 50.2500 95.9500 2221.5500 4442.7500 0.0 1.00 4 54.951538 652.720616 104.40 18.75 -0.099830 55.050 109.696923 2.734438e+03 242.80 27.55 -0.012722 113.550 0.210256 0.166905 1 0 0.0 195 2 24.8750 76.1500 56.8750 151.7000 0.0 0.00 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 7038 58.976923 794.152642 104.65 19.55 -0.101257 56.300 1419.914103 4.618224e+05 2542.45 439.75 -0.085949 1374.475 0.500000 0.253247 1 0 0.5 78 2 24.8125 84.6875 606.2000 2025.0375 0.0 1.00 7039 78.687745 1059.373251 118.75 19.30 -0.703365 86.650 5668.428595 5.539662e+06 8684.80 1304.80 -0.687450 6242.275 0.859477 0.121172 1 0 1.0 306 2 63.7250 107.1625 4491.1375 7658.0750 1.0 1.00 7040 54.326220 700.932853 102.00 19.25 0.039721 54.825 603.943902 8.878321e+04 1222.05 180.30 0.080272 613.000 0.329268 0.223577 1 0 0.0 82 2 25.0500 75.1750 275.6875 837.0125 0.0 1.00 7041 71.798077 498.089096 99.80 20.05 -1.000718 74.925 290.625000 9.093339e+03 442.85 91.45 -0.612205 305.150 0.269231 0.204615 1 0 0.0 26 2 69.5625 87.7875 264.5875 347.2875 0.0 0.75 7042 69.363433 880.913832 115.80 19.35 -0.366816 68.750 4581.785075 3.845390e+06 7942.15 1240.80 -0.354953 4508.650 0.567164 0.249209 1 0 1.0 67 2 52.6000 91.9250 3415.1250 6103.4250 0.0 1.00 [7043 rows x 25 columns] ['MonthlyCharges_tenure_SeniorCitizen_mean', 'MonthlyCharges_tenure_SeniorCitizen_var', 'MonthlyCharges_tenure_SeniorCitizen_max', 'MonthlyCharges_tenure_SeniorCitizen_min', 'MonthlyCharges_tenure_SeniorCitizen_skew', 'MonthlyCharges_tenure_SeniorCitizen_median', 'TotalCharges_tenure_SeniorCitizen_mean', 'TotalCharges_tenure_SeniorCitizen_var', 'TotalCharges_tenure_SeniorCitizen_max', 'TotalCharges_tenure_SeniorCitizen_min', 'TotalCharges_tenure_SeniorCitizen_skew', 'TotalCharges_tenure_SeniorCitizen_median', 'Partner_tenure_SeniorCitizen_mean', 'Partner_tenure_SeniorCitizen_var', 'Partner_tenure_SeniorCitizen_max', 'Partner_tenure_SeniorCitizen_min', 'Partner_tenure_SeniorCitizen_median', 'Partner_tenure_SeniorCitizen_count', 'Partner_tenure_SeniorCitizen_nunique', 'MonthlyCharges_tenure_SeniorCitizen_q1', 'MonthlyCharges_tenure_SeniorCitizen_q2', 'TotalCharges_tenure_SeniorCitizen_q1', 'TotalCharges_tenure_SeniorCitizen_q2', 'Partner_tenure_SeniorCitizen_q1', 'Partner_tenure_SeniorCitizen_q2']
上文的features_oe['Partner'] = enc.fit_transform(pd.DataFrame(features_oe['Partner']))
有告警。
在之前添加一行features_oe = features_oe.copy()
,就不会有告警了。
1 2 features_oe = features_oe.copy() features_oe['Partner' ] = enc.fit_transform(pd.DataFrame(features_oe['Partner' ]))
具体原因,可以参考《用Python分析数据的方法和技巧:3.pandas》 的"链式索引"部分。
工作经验
从直观的结果上来看,多特征分组统计特征衍生能够更细粒度的呈现数据集信息。
因此,有限范围内的多特征分组统计特征衍生,是能达到更好的效果的。
注意,是有限范围内,这种"细粒度"的呈现并不是越细越好。
参与分组的交叉特征越多、分组也就越多,而在相同数据集下,分组越多、每一组的组内样本数量就越少,而在进行组内统计指标计算时,如果组内样本数量太少,统计指标往往就不具备代表性了。
一般情况下,对于人工判断极重要的特征,可以考虑两个或三个特征进行交叉组合后分组。
多项式
衍生过程
多特征二阶多项式特征衍生过程 原始特征 二阶衍生特征 ID X1 X2 X3 X1^2 X1*X2 X1*X3 X2^2 X2*X3 X3^2 1 1 1 2 1 1 2 1 2 4 2 3 1 5 9 3 15 1 5 25 3 5 -1 -2 25 -5 -10 1 2 4 4 2 3 -3 4 6 -6 9 -9 9 5 1 2 1 1 2 1 4 2 1
多特征三阶多项式特征衍生过程 原始特征 二阶衍生特征
三阶衍生特征 ID X1 X2 X3 略 X1^3 X1^2*X2 X1^2*X3 X1*X2^2 X1*X2*X3 X1*X3^2 X2^3 X2^2*X3 X2*X3^2 X3^3 1 1 1 2 略 1 1 2 1 2 4 1 2 4 8 2 3 1 5 略 27 9 45 3 15 75 1 5 25 125 3 5 -1 -2 略 125 -25 _- 50 5 10 20 -1 -2 -4 -8 4 2 3 -3 略 8 12 -12 18 -18 18 27 -27 27 -27 5 1 2 1 略 1 2 1 4 2 1 8 4 2 1
实现要点
PolynomialFeatures
多特征的多项式衍生也是通过PolynomialFeatures
实现,其传入的是一个多特征的DataFrame就是多特征的多项式衍生了。
示例代码:
1 2 3 4 5 6 7 8 import pandas as pdfrom sklearn.preprocessing import PolynomialFeaturesdf = pd.DataFrame({'X1' : [1 , 3 , 5 ], 'X2' : [1 , 1 , -1 ], 'X3' : [2 , 5 , -2 ]}) print(df) df = PolynomialFeatures(degree=2 , include_bias=False ).fit_transform(df) print(df)
运行结果:
1 2 3 4 5 6 7 X1 X2 X3 0 1 1 2 1 3 1 5 2 5 -1 -2 [[ 1. 1. 2. 1. 1. 2. 1. 2. 4.] [ 3. 1. 5. 9. 3. 15. 1. 5. 25.] [ 5. -1. -2. 25. -5. -10. 1. 2. 4.]]
新特征的名称
多特征的多项式衍生特征的列排布顺序规律如下所示:
首先是各特征的1次方(也就是原始特征本身)
然后按照X 1 X_1 X 1 、X 2 X_2 X 2 、X 3 X_3 X 3 依次降序顺序进行组合
例如,对于X 1 X_1 X 1 、X 2 X_2 X 2 、X 3 X_3 X 3 三个特征,衍生列的顺序为:
X 1 3 ∗ X 2 0 ∗ X 3 0 X_1^3*X_2^0*X_3^0 X 1 3 ∗ X 2 0 ∗ X 3 0
X 1 2 ∗ X 2 1 ∗ X 3 0 X_1^2*X_2^1*X_3^0 X 1 2 ∗ X 2 1 ∗ X 3 0 、X 1 2 ∗ X 2 0 ∗ X 3 1 X_1^2*X_2^0*X_3^1 X 1 2 ∗ X 2 0 ∗ X 3 1
X 1 1 ∗ X 2 2 ∗ X 3 0 X_1^1*X_2^2*X_3^0 X 1 1 ∗ X 2 2 ∗ X 3 0 、X 1 1 ∗ X 2 1 ∗ X 3 1 X_1^1*X_2^1*X_3^1 X 1 1 ∗ X 2 1 ∗ X 3 1 、X 1 1 ∗ X 2 0 ∗ X 3 2 X_1^1*X_2^0*X_3^2 X 1 1 ∗ X 2 0 ∗ X 3 2
X 1 0 ∗ X 2 3 ∗ X 3 0 X_1^0*X_2^3*X_3^0 X 1 0 ∗ X 2 3 ∗ X 3 0 、X 1 0 ∗ X 2 2 ∗ X 3 1 X_1^0*X_2^2*X_3^1 X 1 0 ∗ X 2 2 ∗ X 3 1 、X 1 0 ∗ X 2 1 ∗ X 3 2 X_1^0*X_2^1*X_3^2 X 1 0 ∗ X 2 1 ∗ X 3 2 、X 1 0 ∗ X 2 0 ∗ X 3 3 X_1^0*X_2^0*X_3^3 X 1 0 ∗ X 2 0 ∗ X 3 3
关于新特征名称的具体代码实现
函数封装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 def multi_polynomial_features (col_names: list, degree: int, features: pd.DataFrame) : """ 连续特征多特征多项式衍生函数 :param col_names: 参与交叉衍生的列名称 :param degree: 多项式最高阶 :param features: 原始数据集 :return:交叉衍生后的新特征和新列名称 """ col_names_new_l = [] n = len(col_names) features = features[col_names] array_new_temp = PolynomialFeatures(degree=degree, include_bias=False ).fit_transform(features) array_new_temp = array_new_temp[:, n:] deg = 2 while deg <= degree: m = 1 a1 = range(deg, -1 , -1 ) a2 = [] while m < n: a1 = list(product(a1, range(deg, -1 , -1 ))) if m > 1 : for i in a1: i_temp = list(i[0 ]) i_temp.append(i[1 ]) a2.append(i_temp) a1 = a2.copy() a2 = [] m += 1 a1 = np.array(a1) a3 = a1[a1.sum(1 ) == deg] for i in a3: col_names_new_l.append('_' .join(col_names) + '_' + '' .join([str(i) for i in i])) deg += 1 features_new = pd.DataFrame(array_new_temp, columns=col_names_new_l) col_names_new = col_names_new_l return features_new, col_names_new
示例代码:
1 2 3 4 5 6 7 8 df = pd.DataFrame({'X1' : [1 , 3 , 5 ], 'X2' : [1 , 1 , -1 ], 'X3' : [2 , 5 , -2 ]}) cols = ['X1' , 'X2' , 'X3' ] print(df) features_new, col_names_new = multi_polynomial_features(cols, 4 , df) print(features_new) print(col_names_new)
运行结果:
1 2 3 4 5 6 7 8 9 X1 X2 X3 0 1 1 2 1 3 1 5 2 5 -1 -2 X1_X2_X3_200 X1_X2_X3_110 X1_X2_X3_101 X1_X2_X3_020 X1_X2_X3_011 X1_X2_X3_002 X1_X2_X3_300 X1_X2_X3_210 X1_X2_X3_201 X1_X2_X3_120 X1_X2_X3_111 X1_X2_X3_102 X1_X2_X3_030 X1_X2_X3_021 X1_X2_X3_012 X1_X2_X3_003 X1_X2_X3_400 X1_X2_X3_310 X1_X2_X3_301 X1_X2_X3_220 X1_X2_X3_211 X1_X2_X3_202 X1_X2_X3_130 X1_X2_X3_121 X1_X2_X3_112 X1_X2_X3_103 X1_X2_X3_040 X1_X2_X3_031 X1_X2_X3_022 X1_X2_X3_013 X1_X2_X3_004 0 1.0 1.0 2.0 1.0 2.0 4.0 1.0 1.0 2.0 1.0 2.0 4.0 1.0 2.0 4.0 8.0 1.0 1.0 2.0 1.0 2.0 4.0 1.0 2.0 4.0 8.0 1.0 2.0 4.0 8.0 16.0 1 9.0 3.0 15.0 1.0 5.0 25.0 27.0 9.0 45.0 3.0 15.0 75.0 1.0 5.0 25.0 125.0 81.0 27.0 135.0 9.0 45.0 225.0 3.0 15.0 75.0 375.0 1.0 5.0 25.0 125.0 625.0 2 25.0 -5.0 -10.0 1.0 2.0 4.0 125.0 -25.0 -50.0 5.0 10.0 20.0 -1.0 -2.0 -4.0 -8.0 625.0 -125.0 -250.0 25.0 50.0 100.0 -5.0 -10.0 -20.0 -40.0 1.0 2.0 4.0 8.0 16.0 ['X1_X2_X3_200', 'X1_X2_X3_110', 'X1_X2_X3_101', 'X1_X2_X3_020', 'X1_X2_X3_011', 'X1_X2_X3_002', 'X1_X2_X3_300', 'X1_X2_X3_210', 'X1_X2_X3_201', 'X1_X2_X3_120', 'X1_X2_X3_111', 'X1_X2_X3_102', 'X1_X2_X3_030', 'X1_X2_X3_021', 'X1_X2_X3_012', 'X1_X2_X3_003', 'X1_X2_X3_400', 'X1_X2_X3_310', 'X1_X2_X3_301', 'X1_X2_X3_220', 'X1_X2_X3_211', 'X1_X2_X3_202', 'X1_X2_X3_130', 'X1_X2_X3_121', 'X1_X2_X3_112', 'X1_X2_X3_103', 'X1_X2_X3_040', 'X1_X2_X3_031', 'X1_X2_X3_022', 'X1_X2_X3_013', 'X1_X2_X3_004']
工作经验
多个特征的交叉组合乘积会增加特征的表现
但是,伴随着多特征多项式衍生的特征数量增加个阶数的增加,所特征的新特征的数量会呈指数级增加趋势;衍生得到的新特征的取值也将变得非常不稳定,容易趋近于零,或者会是一个非常大的数。
一般情况下,对于人工判断极重要的特征,可以考虑多个特征的三阶甚至四阶的多项式衍生。