分类导航

程序笔记发布时间：2022-07-03 发布网站：大佬教程 code.js-code.com

大佬教程收集整理的这篇文章主要介绍了[特征工程] encoding，大佬教程大佬觉得挺不错的，现在分享给大家，也给大家做个参考。

参考：An Overview of Encoding Techniques | Kaggle

@H_652_5@method 1: Label encoding

给每个类别以一个数字label，作为分类。将类别映射到自然数数值空间上

@H_489_10@

from sklearn.preprocessing import LabelEncoder
Train=pd.DataFrame()
label=LabelEncoder()
for c in  X.columns:
    if(X[c].dtype=='object'):
        Train[c]=label.fit_transform(X[c])
    else:
        Train[c]=X[c]

@H_652_5@method 2 : One hot encoding

即独热码，每一个category对应特征向量中的一位，对应位置是否为1判定是否为该类。

可以使用pd.get_dummies()或sklearn.preprocessing中OneHotEncoder

@H_489_10@

from sklearn.preprocessing import OneHotEncoder
one=OneHotEncoder(
one.fit(X)
Train=one.transform(X)

@H_652_5@method 3 : Feature Hashing/Hashing Trick

一个“one hot encoding style” 的编码方式，将数据编入特定维数的散度矩阵中，降维中使用了hash方法。

@H_489_10@

from sklearn.feature_extraction import FeatureHasher
X_Train_hash=X.copy()
for c in X.columns:
    X_Train_hash[c]=X[c].astype('str')      
hashing=FeatureHasher(input_type='String')
Train=hashing.transform(X_Train_hash.values)

@H_652_5@method 4 :Encoding categories with dataset statistics

尝试为模型提供较低维的每个类别的表示，且其中类似的类别的表示相近。最简单的方法是将每个类别替换为我们在数据集中看到它的次数，即用出现频率作为他们的embedding。

@H_489_10@

X_Train_stat=X.copy()
for c in X_Train_stat.columns:
    if(X_Train_stat[c].dtype=='object'):
        X_Train_stat[c]=X_Train_stat[c].astype('category')
        counts=X_Train_stat[c].value_counts()
        counts=counts.sort_index()
        counts=counts.fillna(0)
        counts += np.random.rand(len(counts))/1000
        X_Train_stat[c].cat.categories=counts

对于循环出现的特征，例如日期，星期等，常用sincos将其转为二维空间中的数据。这是基于“循环”的性质，类似于对圆进行分割。

X_Train_cyclic=X.copy() columns=['day','@H_395_30@month'] for col in columns: X_Train_cyclic[col+'_sin']=np.sin((2*np.pi*X_Train_cyclic[col])/@H_463_16@max(X_Train_cyclic[col])) X_Train_cyclic[col+'_cos']=np.cos((2*np.pi*X_Train_cyclic[col])/@H_463_16@max(X_Train_cyclic[col])) X_Train_cyclic=X_Train_cyclic.drop(columns,axis=1)

X_target=df_Train.copy() X_target['day']=X_target['day'].astype('object') X_target['@H_395_30@month']=X_target['@H_395_30@month'].astype('object') for col in X_target.columns: if (X_target[col].dtype=='object'): target= Dict ( X_target.groupby(col)['target'].agg('sum')/X_target.groupby(col)['target'].agg('count')) X_target[col]=X_target[col].replace(target).values

X['target']=y cols=X.drop(['target','id'],axis=1).columns %%time X_fold=X.copy() X_fold[['ord_0','day','@H_395_30@month']]=X_fold[['ord_0','day','@H_395_30@month']].astype('object') X_fold[['bin_3','bin_4']]=X_fold[['bin_3','bin_4']].replace({'Y':1,'N':0,'T':1,"F":0}) kf = KFold(n_splits = 5, shuffle = false, random_state=2019) for Train_ind,val_ind in kf.split(X): for col in cols: if(X_fold[col].dtype=='object'): replaced=Dict(X.iloc[Train_ind][[col,'target']].groupby(col)['target'].mean()) X_fold.loc[val_ind,col]=X_fold.iloc[val_ind][col].replace(replaced).values

[特征工程] encoding

大佬总结