本文中,云朵君将和大家一起学习数据挖掘中,集合多个机器学习模型,同样可以达到一个很好的效果。下面我们一起看下,如果你有更好的方法,欢迎讨论~
集成学习集成学习(ensemble learning)可以说是现在非常火爆的机器学习方法了。它本身不是一个单独的机器学习算法,而是通过构建并结合多个机器学习器来完成学习任务。也就是我们常说的“博采众长”。集成学习可以用于分类问题集成,回归问题集成,特征选取集成,异常点检测集成等等,可以说所有的机器学习领域都可以看到集成学习的身影。本文就对集成学习的原理做一个总结。
集成学习之结合策略 在上面几节里面我们主要关注于学习器,提到了学习器的结合策略但没有细讲,本节就对集成学习之结合策略做一个总结。我们假定我得到的T个弱学习器是h_1,h_2,...h_T
平均法
对于数值类的回归预测问题,通常使用的结合策略是平均法,也就是说,对于若干个弱学习器的输出进行平均得到最终的预测输出。
投票法 对于分类问题的预测,我们通常使用的是投票法。假设我们的预测类别是h_1,h_2,...h_T。对于任意一个预测样本x,我们的T个弱学习器的预测结果分别是
。
最简单的投票法是相对多数投票法,也就是我们常说的少数服从多数,也就是T个弱学习器的对样本x的预测结果中,数量最多的类别ci𝑐𝑖为最终的分类类别。如果不止一个类别获得最高票,则随机选择一个做最终类别。
稍微复杂的投票法是绝对多数投票法,也就是我们常说的要票过半数。在相对多数投票法的基础上,不光要求获得最高票,还要求票过半数。否则会拒绝预测。
更加复杂的是加权投票法,和加权平均法一样,每个弱学习器的分类票数要乘以一个权重,最终将各个类别的加权票数求和,最大的值对应的类别为最终类别。
学习法 上两节的方法都是对弱学习器的结果做平均或者投票,相对比较简单,但是可能学习误差较大,于是就有了学习法这种方法,对于学习法,代表方法是stacking,当使用stacking的结合策略时, 我们不是对弱学习器的结果做简单的逻辑处理,而是再加上一层学习器,也就是说,我们将训练集弱学习器的学习结果作为输入,将训练集的输出作为输出,重新训练一个学习器来得到最终结果。
在这种情况下,我们将弱学习器称为初级学习器,将用于结合的学习器称为次级学习器。对于测试集,我们首先用初级学习器预测一次,得到次级学习器的输入样本,再用次级学习器预测一次,得到最终的预测结果。
爬山算法什么是爬山算法? 爬山算法(Hill Climbing Algorithm), 又称爬坡算法,是一种针对解决最优化问题的常用算法之一。其过程类比登山时爬上一座山峰的过程,通过朝着“高处”移动来逐渐接近顶峰,达到找到最优解的目的。
爬山算法的基本思路 爬山算法的基本思路是通过从初始解开始,反复地对当前解进行微调,每次改进当前解的一点点,直到某个条件满足而停止。这个过程中需要注意的是,在备选解集合中选择出的解是否为最优解并不确定,只能保证是局部最优解。
爬山算法流程 爬山算法的通用流程如下:
如果有临近解的价值优于当前解,则将其中价值最高的临近解作为当前解。否则,停止并返回当前解。 爬山算法的问题 爬山算法在寻找全局最优解方面有明显的不足,容易陷入局部最优解,因此通常需要采用一些策略来避免这种情况的发生。
爬山算法的改进策略 为了解决爬山算法可能陷入局部最优解的问题,可以采用以下改进策略:
1. 随机化重启爬山算法
随机化重启爬山算法(Random-restart Hill Climbing Algorithm)是通过多次运行爬山算法,每次使用不同的初始解来尝试得到更好的全局最优解。
2. 模拟退火算法 模拟退火算法(Simulated Annealing Algorithm)是将“退火”过程的思想应用到搜索问题中来。它通过对解空间进行随机扰动来达到跳出局部最优解的效果。
3. 遗传算法 遗传算法(Genetic Algorithm)是模拟自然界中的优胜劣汰过程,通过产生新一代的解并在其间进行筛选,来实现全局优化的过程。
数据首先我们读取数据(由于本次重点介绍集成算法策略,因此数据并不是重点。任意一份数据都可以。如果需要本次使用的数据,可以在@公众号:数据STUDIO 后台联系云朵君免费获取)。
train = pd.read_csv('train.csv' ) test = pd.read_csv('test.csv' ) print(f"train:{train.shape} , test:{test.shape} " )
train:(101763, 23), test:(67842, 22)
train.describe()
train.info()
RangeIndex: 101763 entries, 0 to 101762 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 101763 non-null int64 1 loc 101763 non-null float64 2 v(g) 101763 non-null float64 3 ev(g) 101763 non-null float64 4 iv(g) 101763 non-null float64 5 n 101763 non-null float64 6 v 101763 non-null float64 7 l 101763 non-null float64 8 d 101763 non-null float64 9 i 101763 non-null float64 10 e 101763 non-null float64 11 b 101763 non-null float64 12 t 101763 non-null float64 13 lOCode 101763 non-null int64 14 lOComment 101763 non-null int64 15 lOBlank 101763 non-null int64 16 locCodeAndComment 101763 non-null int64 17 uniq_Op 101763 non-null float64 18 uniq_Opnd 101763 non-null float64 19 total_Op 101763 non-null float64 20 total_Opnd 101763 non-null float64 21 branchCount 101763 non-null float64 22 defects 101763 non-null bool dtypes: bool(1), float64(17), int64(5) memory usage: 17.2 MB
"训练"和 "测试"数据集中都没有缺失值。由于这是一个合成数据集,我们将检查是否有重复数据。
print(f"如果我们从训练数据集中移除 id,维度为 {train.drop(columns = ['id' ], axis = 1 ).drop_duplicates().shape} " ) print(f"如果从训练数据集中删除 id 和缺陷,维度为 {train.drop(columns = ['id' , 'defects' ], axis = 1 ).drop_duplicates().shape} " ) print(f"如果从训练数据集中删除 id、分支计数和缺陷,维度为 {train.drop(columns = ['id' , 'branchCount' , 'defects' ], axis = 1 ).drop_duplicates().shape} " )
如果我们从训练数据集中移除 id,维度为 (101763, 22) 如果从训练数据集中删除 id 和缺陷,维度为 (101763, 21) 如果从训练数据集中删除 id、分支计数和缺陷,维度为 (101685, 20)
从上面我们可以看到,"训练" 数据集中有 78 个准重复观测值。现在,让我们来看看 "测试" 数据集。
数据探索train['defects' ].value_counts(normalize = True ).plot(kind = 'bar' , color = ['steelblue' , 'orange' ]) plt.ylabel('Percentage' );
从上面的柱状图中,我们可以看到数据是不平衡的(~77% 为假,~23% 为真)。接下来,我们继续探索输入特征之间的潜在相关性。
corr_mat = train.drop(columns = ['id' , 'defects' ], axis = 1 ).corr() data_mask = np.triu(np.ones_like(corr_mat, dtype = bool)) cmap = sns.diverging_palette(100 , 7 , s = 75 , l = 40
, n = 20 , center = 'light' , as_cmap = True ) f, ax = plt.subplots(figsize = (18 , 13 ),dpi=150 ) sns.heatmap(corr_mat, annot = True , cmap = cmap, fmt = '.2f' , center = 0 , annot_kws = {'size' : 12 }, mask = data_mask).set_title('输入特征之间的相关性' );
可以看出以下几点:
branchCount
和 v(g)
之间的相关性为 97%total_Opnd
和 total_Op
之间的相关性为 96%根据上述相关性热图,我们继续探索通过 "PCA" 降维的想法。请注意,所有输入特征都是右偏的,因此在运行 "PCA" 之前,我们先对特征进行 "对数变换",然后再应用 "PCA"。
colnames = train.drop(columns = ['id' , 'defects' ], axis = 1 ).columns.tolist() pca_md = Pipeline([('log-tran' , ColumnTransformer([('log' , FunctionTransformer(np.log1p), colnames)])), ('stand' , StandardScaler()), ('pca' , PCA())]).fit(train[colnames]) pca_md
plt.figure(figsize = (10 , 8 ),dpi=150 ) ax = sns.lineplot(x = [i for i in range(1 , 22 )], y = np.cumsum(pca_md['pca' ].explained_variance_ratio_), color = 'steelblue' , markers = True ); ax.set_xlabel('维度数' ) ax.set_ylabel('解释性方差(%)' ) ax.set_xticks([i for i in range(1 , 22 )]);
从上面我们可以看出,前 10 个成分可以解释数据中 99% 以上的变化。接下来,让我们对这些成分进行可视化分析,看看是否存在可以利用的模式。
pca_10 = Pipeline([('log-tran' , ColumnTransformer([('log' , FunctionTransformer(np.log1p), colnames)])), ('stand' , StandardScaler()), ('pca' , PCA(n_components = 10 ))]).fit_transform(train[colnames]) pca_10 = pd.DataFrame(pca_10, columns = [str('PCA_' ) + str(i) for i in range(1 , 11 )]) pca_10['defects' ] = train['defects' ].map({False : 0 , True : 1 }) sns.pairplot(data = pca_10, hue = 'defects' , corner = True )
从上述 "PCA" 配对图中,我们可以观察到以下几点:
有些样本("defects = True")很难预测。例如,如果我们仔细观察 PCA_1
和 PCA_2
图,在左下角,有一些蓝色样本("defects = True"),而该区域的大多数样本都是红色的("defects = False")。 接下来,我们运行 "k-means",看看是否有有趣的模式可以利用。
inertias = list()for i in tqdm(range(2 , 21 )): kmeans_md = Pipeline([('log-tran' , ColumnTransformer([('log' , FunctionTransformer(np.log1p), colnames)])), ('stand' , StandardScaler()), ('kmeans' , KMeans(n_clusters = i, n_init = 20 , random_state = 42 ))]).fit(train[colnames]) inertias.append(kmeans_md['kmeans' ].inertia_) plt.figure(figsize = (10 , 8 ),dpi=150 ) ax = sns.lineplot(x = [i for i in range(2 , 21 )], y = inertias, color = 'steelblue' ) plt.xlabel('Number of Clusters' ) plt.ylabel('Cluster Inertia' ) add_name(ax);
从上图来看,根据聚类间距,5 个聚类似乎是该数据集合适的聚类数量(基于肘部方法)。接下来,我们探讨每个聚类中的 "defects" 比例。
kmeans = Pipeline([('log-tran' , ColumnTransformer([('log' , FunctionTransformer(np.log1p), colnames)])), ('stand' , StandardScaler()), ('kmeans' , KMeans(n_clusters = 5 , n_init = 20 , random_state = 42 ))]).fit(train[colnames]) train['cluster' ] = kmeans['kmeans' ].labels_ print('每个群组中的defects比例为\n' ) print(train.groupby('cluster' )['defects' ].mean())
每个群组中的defects比例为 cluster 0 0.109429 1 0.471916 2 0.641147 3 0.077083 4 0.301682 Name: defects, dtype: float64
可以看出以下几点:
群组 4 是 "defects" 数量较多的群组。 群组 2 是 "defects" 数量最多的第二个群组。 群组 0 是 "defects" 数量最少的群组。 接下来,我们简要探讨几个二元关系,如下所示。
fig, axes = plt.subplots(1 , 2 , figsize = (20 ,8 ), dpi=150 ) ax1 = sns.scatterplot(ax = axes[0 ], data = train, x = 'uniq_Op' , y = 'uniq_Opnd' , hue = 'defects' ); ax2 = sns.scatterplot(ax = axes[1 ], data = train, x = 'total_Op' , y = 'total_Opnd' , hue = 'defects' );
从上面的图表中,我们可以看出以下几点:
接下来,我们计算每个输入特征中唯一值的数量,如下所示。
train.drop(columns = ['id' , 'defects' ], axis = 1 ).nunique()
loc 378 v(g) 106 ev(g) 71 iv(g) 84 n 836 v 4515 l 55 d 3360 i 5171 e 8729 b 315 t 8608
lOCode 298 lOComment 91 lOBlank 94 locCodeAndComment 29 uniq_Op 70 uniq_Opnd 176 total_Op 623 total_Opnd 485 branchCount 144 cluster 5 dtype: int64
可以看出,"locCodeAndComment" 是唯一值最少的特征;它只有 29 个唯一值。
建模我用了如下6个模型完成了一次集成,效果惊人
首先,我们在不进行特征工程或 HPO 的情况下建立一些标准模型。首先,我们定义输入特征和目标特征。
定义爬山策略 def hill_climbing (x, y, x_test) : # Evaluating oof predictions scores = {} for col in x.columns: scores[col] = roc_auc_score(y, x[col]) # Sorting the model scores scores = {k: v for k, v in sorted(scores.items(), key = lambda item: item[1 ], reverse = True )} # Sort oof_df and test_preds x = x[list(scores.keys())] x_test = x_test[list(scores.keys())] STOP = False current_best_ensemble = x.iloc[:,0 ] current_best_test_preds = x_test.iloc[:,0 ] MODELS = x.iloc[:,1 :] weight_range = np.arange(-0.5 , 0.51 , 0.01 ) history = [roc_auc_score(y, current_best_ensemble)] j = 0 while not STOP: j += 1 potential_new_best_cv_score = roc_auc_score(y, current_best_ensemble) k_best, wgt_best = None , None for k in MODELS: for wgt in weight_range: potential_ensemble = (1 - wgt) * current_best_ensemble + wgt * MODELS[k] cv_score = roc_auc_score(y, potential_ensemble) if cv_score > potential_new_best_cv_score: potential_new_best_cv_score = cv_score k_best, wgt_best = k, wgt if k_best is not None : current_best_ensemble = (1 - wgt_best) * current_best_ensemble + wgt_best * MODELS[k_best] current_best_test_preds = (1 - wgt_best) * current_best_test_preds + wgt_best * x_test[k_best] MODELS.drop(k_best, axis = 1 , inplace = True ) if MODELS.shape[1 ] == 0 : STOP = True history.append(potential_new_best_cv_score) else : STOP = True hill_ens_pred_1 = current_best_ensemble hill_ens_pred_2 = current_best_test_preds return [hill_ens_pred_1, hill_ens_pred_2]
接下来,我们通过 10 倍交叉验证建立了几个标准模型。
上下滑动查看更多源码
ens_cv_scores, ens_preds = list(), list() hill_ens_cv_scores, hill_ens_preds = list(), list() sk = RepeatedStratifiedKFold(n_splits = 10 , n_repeats = 1 , random_state = 42 )for i, (train_idx, test_idx) in enumerate(sk.split(X, Y)): X_train, X_test = X.iloc[train_idx], X.iloc[test_idx] Y_train, Y_test = Y.iloc[train_idx], Y.iloc[test_idx] print('----------------------------------------------------------' ) ######## ## RF ## ######## RF_md = RandomForestClassifier(n_estimators = 500 , max_depth = 7 ,
min_samples_split = 15 , min_samples_leaf = 10 ).fit(X_train, Y_train) RF_pred = RF_md.predict_proba(X_test)[:, 1 ] RF_score = roc_auc_score(Y_test, RF_pred) print('Fold' , i, '==> RF oof ROC-AUC score is ==>' , RF_score) RF_pred_test = RF_md.predict_proba(test_cv)[:, 1 ] ################# ## Extra Trees ## ################# ET_md = ExtraTreesClassifier(n_estimators = 500 , max_depth = 7 , min_samples_split = 15 , min_samples_leaf = 10 ).fit(X_train, Y_train) ET_pred = ET_md.predict_proba(X_test)[:, 1 ] ET_score = roc_auc_score(Y_test, ET_pred) print('Fold' , i, '==> ET oof ROC-AUC score is ==>' , ET_score) ET_pred_test = ET_md.predict_proba(test_cv)[:, 1 ] ########################## ## HistGradientBoosting ## ########################## hist_md = HistGradientBoostingClassifier(l2_regularization = 0.01 , early_stopping = False , learning_rate = 0.01 , max_iter = 500 , max_depth = 5 , max_bins = 255 , min_samples_leaf = 15 , max_leaf_nodes = 10 ).fit(X_train, Y_train) hist_pred = hist_md.predict_proba(X_test)[:, 1 ] hist_score = roc_auc_score(Y_test, hist_pred) print('Fold' , i, '==> Hist oof ROC-AUC score is ==>' , hist_score) hist_pred_test = hist_md.predict_proba(test_cv)[:, 1 ] ########## ## LGBM ## ########## LGBM_md = LGBMClassifier(objective = 'binary' , n_estimators = 500 , max_depth = 7 , learning_rate = 0.01 , num_leaves = 20 , reg_alpha = 3 , reg_lambda = 3 , subsample = 0.7 , colsample_bytree = 0.7 ).fit(X_train, Y_train) lgb_pred = LGBM_md.predict_proba(X_test)[:, 1 ] lgb_score = roc_auc_score(Y_test, lgb_pred) print('Fold' , i, '==> LGBM oof ROC-AUC score is ==>' , lgb_score) lgb_pred_test = LGBM_md.predict_proba(test_cv)[:, 1 ] ######### ## XGB ## ######### XGB_md = XGBClassifier(objective = 'binary:logistic' , tree_method = 'hist' , colsample_bytree = 0.7 , gamma = 2 , learning_rate = 0.01 , max_depth = 7 , min_child_weight = 10 , n_estimators = 500 , subsample = 0.7 ).fit(X_train, Y_train) xgb_pred = XGB_md.predict_proba(X_test)[:, 1 ] xgb_score = roc_auc_score(Y_test, xgb_pred) print('Fold' , i, '==> XGB oof ROC-AUC score is ==>' , xgb_score) xgb_pred_test = XGB_md.predict_proba(test_cv)[:, 1 ] ############## ## CatBoost ## ############## Cat_md = CatBoostClassifier(loss_function = 'Logloss' , iterations = 500 , learning_rate = 0.01 , depth = 7 , random_strength = 0.5 , bagging_temperature = 0.7 , border_count = 30 , l2_leaf_reg = 5 , verbose = False , task_type = 'CPU' ).fit(X_train, Y_train) cat_pred = Cat_md.predict_proba(X_test)[:, 1 ] cat_score = roc_auc_score(Y_test, cat_pred) print('Fold' , i, '==> CatBoost oof ROC-AUC score is ==>' , cat_score) cat_pred_test = Cat_md.predict_proba(test_cv)[:, 1 ]
############## ## Ensemble ## ############## ens_pred_1 = (RF_pred + ET_pred + hist_pred + lgb_pred + xgb_pred + cat_pred) / 6 ens_pred_2 = (RF_pred_test + ET_pred_test + hist_pred_test + lgb_pred_test + xgb_pred_test + cat_pred_test) / 6 ens_score_fold = roc_auc_score(Y_test, ens_pred_1) ens_cv_scores.append(ens_score_fold) ens_preds.append(ens_pred_2) print('Fold' , i, '==> Average Ensemble oof ROC-AUC score is ==>' , ens_score_fold) ############################ ## Hill Climbing Ensemble ## ############################ x = pd.DataFrame({'RF' : RF_pred, 'ET' : ET_pred, 'Hist' : hist_pred, 'LGBM' : lgb_pred, 'XGB' : xgb_pred, 'Cat' : cat_pred}) y = Y_test x_test = pd.DataFrame({'RF' : RF_pred_test, 'ET' : ET_pred_test, 'Hist' : hist_pred_test, 'LGBM' : lgb_pred_test, 'XGB' : xgb_pred_test, 'Cat' : cat_pred_test}) hill_results = hill_climbing(x, y, x_test) hill_ens_score_fold = roc_auc_score(y, hill_results[0 ]) hill_ens_cv_scores.append(hill_ens_score_fold) hill_ens_preds.append(hill_results[1 ]) print('Fold' , i, '==> Hill Climbing Ensemble oof ROC-AUC score is ==>' , hill_ens_score_fold)1
print('10-folds 的 ROC-AUC 平均集合得分是' , np.mean(ens_cv_scores)) print('10-folds 的 ROC-AUC 爬山策略得分是' , np.mean(hill_ens_cv_scores))
10-folds 的 ROC-AUC 平均集合得分是0.7917392082505299 10-folds 的 ROC-AUC 爬山策略得分是0.7932600401008065
写作最后下面给大家总结了一个爬山策略的工程代码,大家可酌情使用。如果你觉得有用,不妨点个赞~
上下滑动查看更多源码
import pandas as pdimport numpy as npimport osfrom sklearn.metrics import roc_auc_scorefrom sklearn.preprocessing import minmax_scale""" Initializes an empty ensemble """ def init_hillclimb () : best_ensemble = {} for label in LABELS: best_ensemble[label] = [] best_score = {} for label in LABELS: best_score[label] = 0 return best_ensemble, best_score""" Scores average AUC for an ensemble per label """ def score_ensemble (ensemble, label) : blend_preds = np.zeros(len(train)) for model in ensemble: blend_preds += minmax_scale(files[model][label]) blend_preds = blend_preds/len(subnums) score = roc_auc_score(train[label], blend_preds) return score""" Finds the optimal model to add next to an ensemble """ def find_best_improvement (ensemble, label) : best_score = 0 best_ensemble = [] for i in range(0 ,len(files)):
ensemble = ensemble + [i] score = score_ensemble(ensemble, label) if score > best_score: best_score = score best_ensemble = ensemble ensemble = ensemble[:-1 ] return best_ensemble, best_score """ Performs a step for each label """ def climb (best_ensemble, best_score) : for label in LABELS: best_ensemble[label], best_score[label] = find_best_improvement(best_ensemble[label], label) return best_ensemble, best_score""" Gets optimal blending weights after hillclimb """ def get_optimal_weights (best_ensemble) : weights = {} for label in LABELS: weights[label] = {} for num in set(best_ensemble[label]): weights[label][num] = best_ensemble[label].count(num)/len(best_ensemble[label]) return weights""" Constructs a pandas dataframe using the optimal blending weights """ def get_optimal_blend (optimal_weights) : sub = pd.read_csv(os.path.join(DATA_PATH, "sample_submission.csv" )) blend = sub.copy() for label in LABELS: print(label) for key in optimal_weights[label]: blend[label] += optimal_weights[label][key] * minmax_scale(get_sub_file(key)[label]) print(optimal_weights[label][key], filenames[key]) blend[label] = minmax_scale(blend[label]) return blenddef get_sub_file (num) : filename = filenames[num] filename = filename.replace("oof" , "sub" ) return pd.read_csv(os.path.join(SUB_PATH, filename))if __name__ == "__main__" : DATA_PATH = "../input/jigsaw-toxic-comment-classification-challenge/" SUB_PATH = "../input/oofmodels/trained-models/" train = pd.read_csv(os.path.join(DATA_PATH, "train.csv" )).fillna(' ' ) test = pd.read_csv(os.path.join(DATA_PATH, "test.csv" )).fillna(' ' ) LABELS = train.columns[2 :] # Get submissions and out-of-fold predictions subnums = [29 ,51 ,52 ] filenames = ["oof" + str(num) + ".csv" for num in subnums] files = [pd.read_csv(os.path.join(SUB_PATH, filename)) for filename in filenames] best_ensemble, best_score = init_hillclimb() # Run hillclimb for i in range(25 ): print("-------------" ) print("Step" , i) best_ensemble, best_score = climb(best_ensemble, best_score) print("Best ensemble:" ) print(best_ensemble) print("Best score:" ) print(best_score) print("AUC:" , np.mean([best_score[label] for label in LABELS])) # Get optimal weights opt_w = get_optimal_weights(best_ensemble) print("Optimal weights:" ) print(opt_w) # Construct the blend blend = get_optimal_blend(opt_w) # Submit blend.to_csv("hillclimb.csv" , index=False )