模型融合stacking实战

模型融合stacking的原理具体不再解释,有的博客已经解释很清楚了,还是附一张经典图吧,直接上完整程序(根据后面的数据集下载地址可以下载数据集,然后直接运行程序):#Loadinourlibrariesimportpandasaspdimportnumpyasnpimportreimportxgboostasxgbimportwarningswa…

大家好,又见面了,我是你们的朋友全栈君。

模型融合stacking的原理具体不再解释,有的博客已经解释很清楚了,还是附一张经典图吧,
在这里插入图片描述

直接上完整程序(根据后面的数据集下载地址可以下载数据集,然后直接运行程序):

# Load in our libraries
import pandas as pd
import numpy as np
import re
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')
# Going to use these 5 base models for the stacking
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,
GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC
from sklearn.cross_validation import KFold
'''
--------------Feature Exploration, Engineering and Cleaning------------------
'''
# Load in the train and test datasets
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
# Store our passenger ID for easy access
PassengerId = test['PassengerId']
# Feature Engineering
full_data = [train, test]
# Some features of my own that I have added in
# Gives the length of the name
train['Name_length'] = train['Name'].apply(len)
test['Name_length'] = test['Name'].apply(len)
# Feature that tells whether a passenger had a cabin on the Titanic
train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
# Feature engineering steps taken from Sina
# Create new feature FamilySize as a combination of SibSp and Parch
for dataset in full_data:
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
# Create new feature IsAlone from FamilySize
for dataset in full_data:
dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
# Remove all NULLS in the Embarked column
for dataset in full_data:
dataset['Embarked'] = dataset['Embarked'].fillna('S')
# Remove all NULLS in the Fare column and create a new feature CategoricalFare
for dataset in full_data:
dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)
# Create a New feature CategoricalAge
for dataset in full_data:
age_avg = dataset['Age'].mean()
age_std = dataset['Age'].std()
age_null_count = dataset['Age'].isnull().sum()
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
dataset['Age'] = dataset['Age'].astype(int)
train['CategoricalAge'] = pd.cut(train['Age'], 5)
# Define function to extract titles from passenger names
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return ""
# Create a new feature Title, containing the titles of passenger names
for dataset in full_data:
dataset['Title'] = dataset['Name'].apply(get_title)
# Group all non-common titles into one single grouping "Rare"
for dataset in full_data:
dataset['Title'] = dataset['Title'].replace(
['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
for dataset in full_data:
# Mapping Sex
dataset['Sex'] = dataset['Sex'].map({'female': 0, 'male': 1}).astype(int)
# Mapping titles
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)
# Mapping Embarked
dataset['Embarked'] = dataset['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)
# Mapping Fare
dataset.loc[dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
dataset.loc[dataset['Fare'] > 31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
# Mapping Age
dataset.loc[dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[dataset['Age'] > 64, 'Age'] = 4
# Feature selection
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']
train = train.drop(drop_elements, axis = 1)
train = train.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)
test  = test.drop(drop_elements, axis = 1)
# Visualisations略
'''
----------------------Ensembling & Stacking models---------------------
'''
# Helpers via Python Classes
# Some useful parameters which will come in handy later on
ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0  # for reproducibility
NFOLDS = 5  # set folds for out-of-fold prediction
kf = KFold(ntrain, n_folds=NFOLDS, random_state=SEED)
# Class to extend the Sklearn classifier
class SklearnHelper(object):
def __init__(self, clf, seed=0, params=None):
params['random_state'] = seed
self.clf = clf(**params)
def train(self, x_train, y_train):
self.clf.fit(x_train, y_train)
def predict(self, x):
return self.clf.predict(x)
def fit(self, x, y):
return self.clf.fit(x, y)
def feature_importances(self, x, y):
print(self.clf.fit(x, y).feature_importances_)
# Class to extend XGboost classifer
# Out-of-Fold Predictions
def get_oof(clf, x_train, y_train, x_test):
oof_train = np.zeros((ntrain,))
oof_test = np.zeros((ntest,))
oof_test_skf = np.empty((NFOLDS, ntest))
for i, (train_index, test_index) in enumerate(kf):
x_tr = x_train[train_index]
y_tr = y_train[train_index]
x_te = x_train[test_index]
clf.train(x_tr, y_tr)
oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(x_test)
oof_test[:] = oof_test_skf.mean(axis=0)
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)
# Put in our parameters for said classifiers
# Random Forest parameters
rf_params = {
'n_jobs': -1,
'n_estimators': 500,
'warm_start': True,
#'max_features': 0.2,
'max_depth': 6,
'min_samples_leaf': 2,
'max_features' : 'sqrt',
'verbose': 0
}
# Extra Trees Parameters
et_params = {
'n_jobs': -1,
'n_estimators':500,
#'max_features': 0.5,
'max_depth': 8,
'min_samples_leaf': 2,
'verbose': 0
}
# AdaBoost parameters
ada_params = {
'n_estimators': 500,
'learning_rate' : 0.75
}
# Gradient Boosting parameters
gb_params = {
'n_estimators': 500,
#'max_features': 0.2,
'max_depth': 5,
'min_samples_leaf': 2,
'verbose': 0
}
# Support Vector Classifier parameters
svc_params = {
'kernel' : 'linear',
'C' : 0.025
}
# Create 5 objects that represent our 4 models
rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)
et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)
# Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models
y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values # Creates an array of the train data
x_test = test.values # Creats an array of the test data
# Create our OOF train and test predictions. These base results will be used as new features
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier
print("Training is complete")
# 特征重要性分析这块略
# Second-Level Predictions from the First-level Output
# 第二次的训练集的热力图略
# First-level output as new features
x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)
# Second level learning model via XGBoost
gbm = xgb.XGBClassifier(
#learning_rate = 0.02,
n_estimators= 2000,
max_depth= 4,
min_child_weight= 2,
#gamma=1,
gamma=0.9,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread= -1,
scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)
# Producing the Submission file
# Generate Submission File
StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,
'Survived': predictions })
StackingSubmission.to_csv("StackingSubmission.csv", index=False)

数据下载地址:https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python/data

直接看程序的重点,先看看KFold的作用是什么:

from sklearn.cross_validation import KFold
kf = KFold(10000, n_folds=5, random_state=0)
print(kf)
for (train_index, test_index) in kf:
print(train_index)
print(test_index)
print("----------------")

输出:

sklearn.cross_validation.KFold(n=10000, n_folds=5, shuffle=False, random_state=0)
[2000 2001 2002 ... 9997 9998 9999]
[   0    1    2 ... 1997 1998 1999]
----------------
[   0    1    2 ... 9997 9998 9999]
[2000 2001 2002 ... 3997 3998 3999]
----------------
[   0    1    2 ... 9997 9998 9999]
[4000 4001 4002 ... 5997 5998 5999]
----------------
[   0    1    2 ... 9997 9998 9999]
[6000 6001 6002 ... 7997 7998 7999]
----------------
[   0    1    2 ... 7997 7998 7999]
[8000 8001 8002 ... 9997 9998 9999]
----------------

根据下面的测试程序可知,n_fold=5的情况下,如果用了10000,则KFold会对10000这样处理,总共得到5个结果,可以通过for循环得到结果,分别是[2000-9999]和[0-1999],[0-1999,4000-9999]和[2000-3999],[0-3999,6000-9999]和[4000-5999],[0-5999,8000-9999]和[6000-7999],[0-7999]和[8000-9999]。后面会把这些数组作为训练集的索引来把训练集划分成两部分,把用长度长的那部分比如[2000-9999]这部分作为训练集,[0-1999]这部分重新作为测试集来预测得到一个结果。具体如何用的请看下面的get_oof函数。

def get_oof(clf, x_train, y_train, x_test):
oof_train = np.zeros((ntrain,))
oof_test = np.zeros((ntest,))
oof_test_skf = np.empty((NFOLDS, ntest))
for i, (train_index, test_index) in enumerate(kf):
x_tr = x_train[train_index]
y_tr = y_train[train_index]
x_te = x_train[test_index]
clf.train(x_tr, y_tr)
oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(x_test)
oof_test[:] = oof_test_skf.mean(axis=0)
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

上面的clf表示一种算法,函数中的kf来自于kf = KFold(ntrain, n_folds=NFOLDS, random_state=SEED),其中ntrain就是训练集的行数,这行程序也就是把训练集的索引分成了n_folds部分,也就是用来把训练集分为n_folds部分,用来在get_oof函数中进行n_folds次训练,每次训练都会有x_train的一部分参与训练,得到一个预测结果oof_train,最后每部分的预测结果是合并在一起,还有x_test参与预测,得到一个预测结果oof_test_skf,最后每次的预测结果是取平均值。

et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier
print("Training is complete")
# First-level output as new features
x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)

把每种算法训练得到的预测结果合并在一起作为一个新训练集和新测试集,可以想象一下,上面的x_train和x_test分别是一个5列的训练集和测试集,这个新训练集x_train是由5种算法分别对原先的训练集训练后预测了原先的训练集得到的,这个新测试集x_test是由5种算法分别对原先的训练集训练后预测了原先的测试集得到的。(有点绕口,再看看get_oof函数就能明白了)

# Second level learning model via XGBoost
gbm = xgb.XGBClassifier(
#learning_rate = 0.02,
n_estimators= 2000,
max_depth= 4,
min_child_weight= 2,
#gamma=1,
gamma=0.9,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread= -1,
scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)

上面的x_train是新训练集,y_train还是原先的训练集的label。x_test是新测试集。最终得到预测结果predictions,这个结果也就是融合后的预测结果。

参考网址:https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python/notebook 写得很好,建议看看

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/127322.html原文链接:https://javaforall.cn

【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛

【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...

(0)
blank

相关推荐

  • 2019年2月10日训练日记

    2019年2月10日训练日记这是过完年第一天的训练,事情都忙完了,可以专心训练了,这个阶段开始训练关于stl容器的相关知识,然后做的题目都是英文题,完全看不懂,只能一点点的查单词翻译。做的很难受,而且很多知识都没有接触过,只能一点一点百度,看网课学习,所以一下午只a了一道题,不过收获还是蛮大的,以后英语一定要好好学,不然题意都看不懂,看不懂题怎么做题呢,然后就是要把stl的各类容器个好好练习,熟练掌握,做第一道题,我没有用s…

  • HorizontalScrollView 仿真 tabLayout

    HorizontalScrollView 仿真 tabLayout别人微博的网址http://blog.csdn.net/u013835855/article/details/71159888目前滑动指示器最著名的是JakeWarton的ViewpagerIndicator,用别人的东西固然方便,但是也带来很多使用上的疑惑,这篇博客,我们使用HorizontalScrollView自己写一个viewPager指示器。这里首先说一下很多自己写的indi

  • 服务器托管双线技术方案怎么写_自己搭建内网穿透服务器全端口

    服务器托管双线技术方案怎么写_自己搭建内网穿透服务器全端口多线路接入技术就是在互联网数据中心(IDC)通过特殊的技术手段把不同的网络接入商(ISP)服务接入到一台服务器上或服务器集群,使服务器所提供的网络服务访问用户能尽可能以同一个ISP或互访速度较快的ISP连接来进行访问,从而解决或者减轻跨ISP用户访问网站的缓慢延迟(南北网络瓶颈)问题。多线路接入是一个技术概念可以有多种具体实现方式,由于大多用户都是网通与电信,为了见简单起见,我们只讨…

    2022年10月23日
  • 【单片机】51单片机最小系统

    【单片机】51单片机最小系统51单片机最小系统由三部分组成:主控电路、复位电路、晶振电路。添加LED电路和独立按键。原理图如下所示:

  • 禁用Chrome Frame[通俗易懂]

    禁用Chrome Frame[通俗易懂]2019独角兽企业重金招聘Python工程师标准>>>…

  • 一个例子让你了解Java反射机制

    一个例子让你了解Java反射机制本文来自:blog.csdn.net/ljphhjJAVA反射机制:通俗地说,反射机制就是可以把一个类,类的成员(函数,属性),当成一个对象来操作,希望读者能理解,也就是说,类,类的成员,我们在运行的时候还可以动态地去操作他们.理论的东东太多也没用,下面我们看看实践Demo~Demo:packagecn.lee.demo;import…

发表回复

您的电子邮箱地址不会被公开。

关注全栈程序员社区公众号