模型融合stacking实战

模型融合stacking的原理具体不再解释,有的博客已经解释很清楚了,还是附一张经典图吧,直接上完整程序(根据后面的数据集下载地址可以下载数据集,然后直接运行程序):#Loadinourlibrariesimportpandasaspdimportnumpyasnpimportreimportxgboostasxgbimportwarningswa…

大家好,又见面了,我是你们的朋友全栈君。

模型融合stacking的原理具体不再解释,有的博客已经解释很清楚了,还是附一张经典图吧,
在这里插入图片描述

直接上完整程序(根据后面的数据集下载地址可以下载数据集,然后直接运行程序):

# Load in our libraries
import pandas as pd
import numpy as np
import re
import xgboost as xgb

import warnings
warnings.filterwarnings('ignore')

# Going to use these 5 base models for the stacking
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,
                              GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC
from sklearn.cross_validation import KFold

'''
--------------Feature Exploration, Engineering and Cleaning------------------
'''
# Load in the train and test datasets
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

# Store our passenger ID for easy access
PassengerId = test['PassengerId']

# Feature Engineering
full_data = [train, test]

# Some features of my own that I have added in
# Gives the length of the name
train['Name_length'] = train['Name'].apply(len)
test['Name_length'] = test['Name'].apply(len)
# Feature that tells whether a passenger had a cabin on the Titanic
train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

# Feature engineering steps taken from Sina
# Create new feature FamilySize as a combination of SibSp and Parch
for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
# Create new feature IsAlone from FamilySize
for dataset in full_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
# Remove all NULLS in the Embarked column
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
# Remove all NULLS in the Fare column and create a new feature CategoricalFare
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)
# Create a New feature CategoricalAge
for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)
train['CategoricalAge'] = pd.cut(train['Age'], 5)


# Define function to extract titles from passenger names
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""


# Create a new feature Title, containing the titles of passenger names
for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)
# Group all non-common titles into one single grouping "Rare"
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(
        ['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

for dataset in full_data:
    # Mapping Sex
    dataset['Sex'] = dataset['Sex'].map({'female': 0, 'male': 1}).astype(int)

    # Mapping titles
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

    # Mapping Embarked
    dataset['Embarked'] = dataset['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

    # Mapping Fare
    dataset.loc[dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
    dataset.loc[dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

    # Mapping Age
    dataset.loc[dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[dataset['Age'] > 64, 'Age'] = 4

# Feature selection
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']
train = train.drop(drop_elements, axis = 1)
train = train.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)
test  = test.drop(drop_elements, axis = 1)

# Visualisations略

'''
----------------------Ensembling & Stacking models---------------------
'''
# Helpers via Python Classes

# Some useful parameters which will come in handy later on
ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0  # for reproducibility
NFOLDS = 5  # set folds for out-of-fold prediction
kf = KFold(ntrain, n_folds=NFOLDS, random_state=SEED)


# Class to extend the Sklearn classifier
class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)

    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)

    def predict(self, x):
        return self.clf.predict(x)

    def fit(self, x, y):
        return self.clf.fit(x, y)

    def feature_importances(self, x, y):
        print(self.clf.fit(x, y).feature_importances_)

# Class to extend XGboost classifer

# Out-of-Fold Predictions
def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))

    for i, (train_index, test_index) in enumerate(kf):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        clf.train(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

# Put in our parameters for said classifiers
# Random Forest parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 500,
     'warm_start': True,
     #'max_features': 0.2,
    'max_depth': 6,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'verbose': 0
}

# Extra Trees Parameters
et_params = {
    'n_jobs': -1,
    'n_estimators':500,
    #'max_features': 0.5,
    'max_depth': 8,
    'min_samples_leaf': 2,
    'verbose': 0
}

# AdaBoost parameters
ada_params = {
    'n_estimators': 500,
    'learning_rate' : 0.75
}

# Gradient Boosting parameters
gb_params = {
    'n_estimators': 500,
     #'max_features': 0.2,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'verbose': 0
}

# Support Vector Classifier parameters
svc_params = {
    'kernel' : 'linear',
    'C' : 0.025
    }

# Create 5 objects that represent our 4 models
rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)
et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

# Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models
y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values # Creates an array of the train data
x_test = test.values # Creats an array of the test data

# Create our OOF train and test predictions. These base results will be used as new features
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier

print("Training is complete")

# 特征重要性分析这块略

# Second-Level Predictions from the First-level Output
# 第二次的训练集的热力图略
# First-level output as new features
x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)

# Second level learning model via XGBoost
gbm = xgb.XGBClassifier(
    #learning_rate = 0.02,
 n_estimators= 2000,
 max_depth= 4,
 min_child_weight= 2,
 #gamma=1,
 gamma=0.9,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread= -1,
 scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)

# Producing the Submission file
# Generate Submission File
StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,
                            'Survived': predictions })
StackingSubmission.to_csv("StackingSubmission.csv", index=False)

数据下载地址:https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python/data

直接看程序的重点,先看看KFold的作用是什么:

from sklearn.cross_validation import KFold
kf = KFold(10000, n_folds=5, random_state=0)
print(kf)
for (train_index, test_index) in kf:
    print(train_index)
    print(test_index)
    print("----------------")

输出:

sklearn.cross_validation.KFold(n=10000, n_folds=5, shuffle=False, random_state=0)
[2000 2001 2002 ... 9997 9998 9999]
[   0    1    2 ... 1997 1998 1999]
----------------
[   0    1    2 ... 9997 9998 9999]
[2000 2001 2002 ... 3997 3998 3999]
----------------
[   0    1    2 ... 9997 9998 9999]
[4000 4001 4002 ... 5997 5998 5999]
----------------
[   0    1    2 ... 9997 9998 9999]
[6000 6001 6002 ... 7997 7998 7999]
----------------
[   0    1    2 ... 7997 7998 7999]
[8000 8001 8002 ... 9997 9998 9999]
----------------

根据下面的测试程序可知,n_fold=5的情况下,如果用了10000,则KFold会对10000这样处理,总共得到5个结果,可以通过for循环得到结果,分别是[2000-9999]和[0-1999],[0-1999,4000-9999]和[2000-3999],[0-3999,6000-9999]和[4000-5999],[0-5999,8000-9999]和[6000-7999],[0-7999]和[8000-9999]。后面会把这些数组作为训练集的索引来把训练集划分成两部分,把用长度长的那部分比如[2000-9999]这部分作为训练集,[0-1999]这部分重新作为测试集来预测得到一个结果。具体如何用的请看下面的get_oof函数。

def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))

    for i, (train_index, test_index) in enumerate(kf):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        clf.train(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

上面的clf表示一种算法,函数中的kf来自于kf = KFold(ntrain, n_folds=NFOLDS, random_state=SEED),其中ntrain就是训练集的行数,这行程序也就是把训练集的索引分成了n_folds部分,也就是用来把训练集分为n_folds部分,用来在get_oof函数中进行n_folds次训练,每次训练都会有x_train的一部分参与训练,得到一个预测结果oof_train,最后每部分的预测结果是合并在一起,还有x_test参与预测,得到一个预测结果oof_test_skf,最后每次的预测结果是取平均值。

et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier

print("Training is complete")

# First-level output as new features
x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)

把每种算法训练得到的预测结果合并在一起作为一个新训练集和新测试集,可以想象一下,上面的x_train和x_test分别是一个5列的训练集和测试集,这个新训练集x_train是由5种算法分别对原先的训练集训练后预测了原先的训练集得到的,这个新测试集x_test是由5种算法分别对原先的训练集训练后预测了原先的测试集得到的。(有点绕口,再看看get_oof函数就能明白了)

# Second level learning model via XGBoost
gbm = xgb.XGBClassifier(
    #learning_rate = 0.02,
 n_estimators= 2000,
 max_depth= 4,
 min_child_weight= 2,
 #gamma=1,
 gamma=0.9,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread= -1,
 scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)

上面的x_train是新训练集,y_train还是原先的训练集的label。x_test是新测试集。最终得到预测结果predictions,这个结果也就是融合后的预测结果。

参考网址:https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python/notebook 写得很好,建议看看

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/127322.html原文链接:https://javaforall.cn

【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛

【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...

(0)


相关推荐

  • javaWeb学习——servlet、filter、listener、intercept的区别

    javaweb开发有很多技术要学习,首先将简单的概念和基础打好,然后才能更好的成长! 自己整理一遍,记忆深刻一点! 一:概念二:实现方式三:执行逻辑图参考:http://www.cnblogs.com/shangxiaofei/p/5328377.htmlhttp://blog.csdn.net/lzwjavaphp/article/details/13771109

  • 堆和栈_数据结构堆和栈的区别

    堆和栈_数据结构堆和栈的区别堆和栈

  • OpenCv结构和内容

    OpenCv的结构和内容OpenCv源码组成结构其中包括cv,cvauex,cxcore,highgui,ml这5个模块CV:图像处理和视觉算法MLL:统计分类器HighGui:GUI

    2021年12月18日
  • 从最简单的源代码开始,切勿眼高手低—(第一波)

    从正式学习安卓到现在,差不多整整一年了,去年暑假,大约也就是6,7月份的样子,从图书馆借了好多书,安卓的,java的,假期里算是把李刚的完完整整的看完了,当时就只顾着看书,很少敲代码,也没做笔记,凭着脑子看,看完感觉收获还是挺大的,又看了mars老师的视频,没看完,大约看了第一季十几集差不多,终归停留在好像懂了的层面,也没做什么东西.        暑假过后,断断续续的学一点,发现java有点

  • oracle+mybatis分页查询

    oracle+mybatis分页查询当使用oracle进行分页查询时使用以下方式:SELECT* FROM(SELECTA.*,ROWNUMRN     FROM(selectt.name   asname,           t.formula asformula,           t.data_fromasdataFro

  • 10个JS常见算法题目

    10个JS常见算法题目1、冒泡排序调优(从小到大排序)2、输出九九乘法表3、输出水仙花数4、1–10的阶乘和5、输出1900年至2100年中的所有闰年6、输出10–100之间的所有素数7、1,2,3,4四个数字,能组合成多少种互不相同且没有重复的三位数8、取出四位数中的各个位上的数字9、猴子吃桃问题10、用星号输出菱形。源码如下:js算法题目练习&…

发表回复

您的电子邮箱地址不会被公开。

关注全栈程序员社区公众号