0

Introduction to Model Stacking (with example and codes in Python)

Share this article!

Someone call it Stacking, or Blending or even Stacked Generalization. They all are the same thing.
They are a kind of ensemble learning.

Traditional Ensemble Learning

Here, we have multiple models trying to fit on a training dataset to approximate the target. Since each model will have its own output, we will need some kind of combining mechanism to combine the results. This can be done through voting (majority wins), weighted voting, averaging the results and so on. This is traditional Ensemble learning.

Stacking

Stacking, called meta ensembling. is a model ensembling technique used to combine information from multiple predictive models to produce a new one. Often times the stacked model also called the 2nd level model will outperform each of the individual models due to its nature of highlighting each base learner where it performs better and where it performs worse. So, stacking model is used where each base learners are different.

In stacking, the combining mechanism is that the output of the classifiers (Level 1 classifiers) will be used as training data for another classifier (Level 2 classifier) to approximate the same target function. Basically we will just allow the new classifier to figure out the combining mechanism.

Talking about the problem where I used stacking model:

Strategy I followed:

•To build a model to predict the purchase amount of customer against various products which will help them to create a personalized offer for customers against different products.

•Since data was huge, containing approximately half a million rows, the strategy of stacking modeling, which was the combination of 4 XGBoost models, 3 ExtraTreeRegressor models, and 3 RandomForestRegressor models, was applied.

• All these submodels were combined using numpy concatenate() and vstack() to form a final training dataset where a new XGBoost model was trained. The similar trick of stacking modelling was applied on test data where that new XGBoost model was used to predict.

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
In [2]:
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train.head()
Out[2]:
User_ID Product_ID Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_Status Product_Category_1 Product_Category_2 Product_Category_3 Purchase
0 1000001 P00069042 F 0-17 10 A 2 0 3 NaN NaN 8370
1 1000001 P00248942 F 0-17 10 A 2 0 1 6.0 14.0 15200
2 1000001 P00087842 F 0-17 10 A 2 0 12 NaN NaN 1422
3 1000001 P00085442 F 0-17 10 A 2 0 12 14.0 NaN 1057
4 1000002 P00285442 M 55+ 16 C 4+ 0 8 NaN NaN 7969

Concatenating the test and train data:

In [3]:
frames = [train, test]
input = pd.concat(frames)

print (input.shape)
print (test.shape)
Out[3]:
(783667, 12)
(233599, 11)
In [4]:
#Fill all missing values with a large number, 999.
input.fillna(999, inplace=True)
In [5]:
input.head()
Out[5]:
Age City_Category Gender Marital_Status Occupation Product_Category_1 Product_Category_2 Product_Category_3 Product_ID Purchase Stay_In_Current_City_Years User_ID
0 0-17 A F 0 10 3 999.0 999.0 P00069042 8370.0 2 1000001
1 0-17 A F 0 10 1 6.0 14.0 P00248942 15200.0 2 1000001
2 0-17 A F 0 10 12 999.0 999.0 P00087842 1422.0 2 1000001
3 0-17 A F 0 10 12 14.0 999.0 P00085442 1057.0 2 1000001
4 55+ C M 0 16 8 999.0 999.0 P00285442 7969.0 4+ 1000002
In [6]:
target = input.Purchase
In [7]:
target = np.array(target)
In [8]:
input.drop(["Purchase"], axis=1, inplace=True)
In [9]:
#Convert all the columns to string 
input = input.applymap(str)
input.dtypes
Out[9]:
Age                           object
City_Category                 object
Gender                        object
Marital_Status                object
Occupation                    object
Product_Category_1            object
Product_Category_2            object
Product_Category_3            object
Product_ID                    object
Stay_In_Current_City_Years    object
User_ID                       object
dtype: object
In [10]:
# Have a copy of the pandas dataframe. Will be useful later on
input_pd = input.copy()
In [11]:
#Convert categorical variables to numeric using LabelEncoder

input = np.array(input)

for i in range(input.shape[1]):
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(input[:,i]))
    input[:, i] = lbl.transform(input[:, i])
In [12]:
input = input.astype(int)
In [13]:
submission=pd.read_csv('Sample_Submission_Tm9Lura.csv')

Applying the XGBoost model:

i) Parameter “min_child_weight” used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree. Defines the minimum sum of weights of all observations required in a field.

ii) Parameter “subsample” denotes the fraction of observations to be randomly samples for each tree. Lowe values make the algorithm more conservative and prevents overfitting but too small might leads to underfitting. Typical values 0.5-1.

iii) Parameter “colsample_bytree” denotes the fraction of columns to be randomly samples for each tree.

iv) Parameter “silent” is 1 so that no running messages will be printed.

v) Parameter “nthread” is used for parallel processing and number of cores in the system to be printed.

vi) Parameter “objective” is reg:linear here.

vii) Parameter “eta” is analogous to learning rate and makes the model more robust by shrinking weight at each step.

viii) Parameter “eval_metric” is rmse here.

ix) Parameter “seed” can be used for generating reproductible results and also for parameter tuning.

x) Parameter “max_depth” used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.

In [14]:
params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 6
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0

plst = list(params.items())
num_rounds = 3000
In [15]:
xgtrain = xgb.DMatrix(input[:train.shape[0],:], label=target[:train.shape[0]])
watchlist = [(xgtrain, 'train')]
model_1_xgboost = xgb.train(plst, xgtrain, num_rounds)
In [16]:
model_1_predict = model_1_xgboost.predict(xgb.DMatrix(input[train.shape[0]:,:]))
model_1_predict[model_1_predict<0] = 25
submission.Purchase = model_1_predict
submission.User_ID=test.User_ID
submission.Product_ID=test.Product_ID
submission.to_csv("sub1.csv", index=False)
In [17]:
submission.head()
Out[17]:
User_ID Product_ID Purchase
0 1000004 P00128942 15122.963867
1 1000009 P00113442 10431.894531
2 1000010 P00288442 6955.890137
3 1000010 P00145342 3446.657227
4 1000011 P00053842 968.134460
In [18]:
train.shape[0]
Out[18]:
550068

Stacking starts

Preparing data for stacking model:
Dividing the dataset into two equal parts : train_fs and train_ss
First level models to create meta features to feed into a second level model.

In [19]:
first_stage_rows = np.random.randint(train.shape[0], size = np.int(train.shape[0]/2))
In [20]:
train_np   = input[:train.shape[0], :]
target_np  = target[:train.shape[0]]
train_fs   = train_np[first_stage_rows, :]
target_fs  = target_np[first_stage_rows]
train_ss   = train_np[-first_stage_rows, :]
target_ss  = target_np[-first_stage_rows]
In [21]:
print (train_fs.shape, target_fs.shape, train_ss.shape, target_ss.shape)
(275034, 11) (275034,) (275034, 11) (275034,)

Building different models and training them on first half dataset (train_fs)

In [22]:
xgtrain = xgb.DMatrix(train_fs, label=target_fs)
watchlist = [(xgtrain, 'train')]

# Model 1: 6/3000

params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 6
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0

plst = list(params.items())
num_rounds = 3000

model_1 = xgb.train(plst, xgtrain, num_rounds)

# Model 2: 8/1420

params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 8
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0

plst = list(params.items())
num_rounds = 1420

model_2 = xgb.train(plst, xgtrain, num_rounds)

# Model 3: 10/1200

params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 10
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0

plst = list(params.items())
num_rounds = 1200

model_3 = xgb.train(plst, xgtrain, num_rounds)

# Model 4: 12/800

params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 12
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0

plst = list(params.items())
num_rounds = 800

model_4 = xgb.train(plst, xgtrain, num_rounds)

Here comes yet another 3 models having trained on first half dataset (train_fs)

In [23]:
# This set of models will be ExtraTrees

# Model 5: 8/1450

model_5 = ExtraTreesRegressor(n_estimators=1450, max_depth=8, min_samples_split=10, min_samples_leaf=10, oob_score=True, n_jobs=6, random_state=123, verbose=1, bootstrap=True)
model_5.fit(train_fs, target_fs)

# Model 6: 6/3000

model_6 = ExtraTreesRegressor(n_estimators=3000, max_depth=6, min_samples_split=10min_samples_leaf=10oob_score=True, n_jobs=6, random_state=123, verbose=1, bootstrap=True)
model_6.fit(train_fs, target_fs)

# Model 7: 12/800

model_7 = ExtraTreesRegressor(n_estimators=800, max_depth=12, min_samples_split=10, min_samples_leaf=10, oob_score=True, n_jobs=6, random_state=123, bootstrap=True)
model_7.fit(train_fs, target_fs)

3 more models, making it total 10 base learners having trained/fit on first half dataset (train_fs)

In [24]:
# This set of models will be RandomForest

# Model 8: 6/3000
model_8 = RandomForestRegressor(n_estimators=3000, max_depth=6, oob_score=True, n_jobs=6, random_state=123, min_samples_split=10, min_samples_leaf=10)
model_8.fit(train_fs, target_fs)

# Model 9: 8/1500
model_9 = RandomForestRegressor(n_estimators=1500, max_depth=8, oob_score=True, n_jobs=6, random_state=123, min_samples_split=10, min_samples_leaf=10)
model_9.fit(train_fs, target_fs)

# Model 10: 12/800
model_10 = RandomForestRegressor(n_estimators=800, max_depth=12, oob_score=True, n_jobs=6, random_state=123, min_samples_split=10, min_samples_leaf=10)
model_10.fit(train_fs, target_fs)

Finally, predicting those idependent 10 first level learners on the next and final dataset train_ss one by one.

In [25]:
model_1_predict = model_1.predict(xgb.DMatrix(train_ss))
model_2_predict = model_2.predict(xgb.DMatrix(train_ss))
model_3_predict = model_3.predict(xgb.DMatrix(train_ss))
model_4_predict = model_4.predict(xgb.DMatrix(train_ss))
model_5_predict = model_5.predict(train_ss)
model_6_predict = model_6.predict(train_ss)
model_7_predict = model_7.predict(train_ss)
model_8_predict = model_8.predict(train_ss)
model_9_predict = model_9.predict(train_ss)
model_10_predict = model_10.predict(train_ss)

Output of these 1st level classifiers is getting concatenated to form final training data as train_ss_u_meta

In [26]:
train_ss_w_meta = np.concatenate((train_ss, np.vstack((model_1_predict, model_2_predict, model_3_predict, model_4_predict, model_5_predict, model_6_predict, model_7_predict, model_8_predict, model_9_predict, model_10_predict)).T), axis=1)

Training the new classifier which is now 2nd level classifier on the newly born training data ‘train_ss_w_meta’

In [27]:
params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 8
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0

plst = list(params.items())
num_rounds = 1400
In [28]:
xgtrain = xgb.DMatrix(train_ss_w_meta, label=target_ss)
watchlist = [(xgtrain, 'train')]
model_ss_xgboost = xgb.train(plst, xgtrain, num_rounds)

Applying the 10 models on the test data in the similar manner

In [29]:
model_1_predict = model_1.predict(xgb.DMatrix(input[train.shape[0]:, :]))
model_2_predict = model_2.predict(xgb.DMatrix(input[train.shape[0]:, :]))
model_3_predict = model_3.predict(xgb.DMatrix(input[train.shape[0]:, :]))
model_4_predict = model_4.predict(xgb.DMatrix(input[train.shape[0]:, :]))
model_5_predict = model_5.predict(input[train.shape[0]:, :])
model_6_predict = model_6.predict(input[train.shape[0]:, :])
model_7_predict = model_7.predict(input[train.shape[0]:, :])
model_8_predict = model_8.predict(input[train.shape[0]:, :])
model_9_predict = model_9.predict(input[train.shape[0]:, :])
model_10_predict = model_10.predict(input[train.shape[0]:, :])

test_ss_w_meta = np.concatenate((input[train.shape[0]:, :], np.vstack((model_1_predict, model_2_predict, model_3_predict, model_4_predict, model_5_predict, model_6_predict, model_7_predict, model_8_predict, model_9_predict, model_10_predict)).T), axis=1)

Since, I was building a stacking model on my training data till now and the final prediction will be applied on the test data, Level 1 base learners from half training data (train_fs) are now predicted on test data and we got a new test data ‘test_ss_w_meta’ by concatenating from output of level 1 classifiers. Now the 2nd level model (here XGBoost) which is already trained on concatenated training data (train_ss_w_meta) is now predicted on this concatenated test data(test_ss_w_meta)

The last XGBoost model which was trained on the the concatenated training data is now predicted on the concatenated stacked test data

In [30]:
model_ss_predict = model_ss_xgboost.predict(xgb.DMatrix(test_ss_w_meta))
submission.Purchase = model_ss_predict
submission.to_csv("sub2.csv", index=False)

Stacking in action

To conclude this, let’s talk about how, when, why, and where should we use stacking in the real world.
Personally, I saw numerous people using stacking modelling on Kaggle competitions and other platforms.
In reality, stacking produces small gain with a lot of added complexity. Stacking is really effective when you have a group of people building predictive modelling on one common problem. A single set of folds is decided and then every team member builds their own model over that fold. Then each model can be combined using stacking model script. This is great because it prevents team members from stepping on each other’s toes.

Feel free to comment your doubts if any. Happy Learning ! 😀

Share this article!

Shwet Prakash

Shwet Prakash

A typical Indian Computer engineer from IIITDM Kancheepuram 3rd-year who is insanely inclined towards Data Science. Having executed some projects in this field and I still believe that there is much more to learn and adapt. A story writer and teller of some extent and loves watching Test matches more than T20s.

https://github.com/Architectshwet
https://www.linkedin.com/in/shwet-prakash-902b54111/
Shwet Prakash

Latest posts by Shwet Prakash (see all)

Liked it? Take a second to support DataScribble on Patreon!

Shwet Prakash

A typical Indian Computer engineer from IIITDM Kancheepuram 3rd-year who is insanely inclined towards Data Science. Having executed some projects in this field and I still believe that there is much more to learn and adapt. A story writer and teller of some extent and loves watching Test matches more than T20s. https://github.com/Architectshwet https://www.linkedin.com/in/shwet-prakash-902b54111/

Leave a Reply

Your email address will not be published. Required fields are marked *