Someone call it Stacking, or Blending or even Stacked Generalization. They all are the same thing.

They are a kind of ensemble learning.

# Traditional Ensemble Learning

Here, we have multiple models trying to fit on a training dataset to approximate the target. Since each model will have its own output, we will need some kind of combining mechanism to combine the results. This can be done through voting (majority wins), weighted voting, averaging the results and so on. This is traditional Ensemble learning.

# Stacking

Stacking, called meta ensembling. is a model ensembling technique used to combine information from multiple predictive models to produce a new one. Often times the stacked model also called the 2nd level model will outperform each of the individual models due to its nature of highlighting each base learner where it performs better and where it performs worse. So, stacking model is used where each base learners are different.

In stacking, the combining mechanism is that the output of the classifiers (Level 1 classifiers) will be used as training data for another classifier (Level 2 classifier) to approximate the same target function. Basically we will just allow the new classifier to figure out the combining mechanism.

## Talking about the problem where I used stacking model:

### Strategy I followed:

•To build a model to predict the purchase amount of customer against various products which will help them to create a personalized offer for customers against different products.

•Since data was huge, containing approximately half a million rows, the strategy of stacking modeling, which was the combination of 4 XGBoost models, 3 ExtraTreeRegressor models, and 3 RandomForestRegressor models, was applied.

• All these submodels were combined using numpy concatenate() and vstack() to form a final training dataset where a new XGBoost model was trained. The similar trick of stacking modelling was applied on test data where that new XGBoost model was used to predict.

```
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
```

```
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train.head()
```

Concatenating the test and train data:

```
frames = [train, test]
input = pd.concat(frames)
print (input.shape)
print (test.shape)
```

```
#Fill all missing values with a large number, 999.
input.fillna(999, inplace=True)
```

```
input.head()
```

```
target = input.Purchase
```

```
target = np.array(target)
```

```
input.drop(["Purchase"], axis=1, inplace=True)
```

```
#Convert all the columns to string
input = input.applymap(str)
input.dtypes
```

```
# Have a copy of the pandas dataframe. Will be useful later on
input_pd = input.copy()
```

```
#Convert categorical variables to numeric using LabelEncoder
input = np.array(input)
for i in range(input.shape[1]):
lbl = preprocessing.LabelEncoder()
lbl.fit(list(input[:,i]))
input[:, i] = lbl.transform(input[:, i])
```

```
input = input.astype(int)
```

```
submission=pd.read_csv('Sample_Submission_Tm9Lura.csv')
```

Applying the XGBoost model:

i) Parameter “min_child_weight” used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree. Defines the minimum sum of weights of all observations required in a field.

ii) Parameter “subsample” denotes the fraction of observations to be randomly samples for each tree. Lowe values make the algorithm more conservative and prevents overfitting but too small might leads to underfitting. Typical values 0.5-1.

iii) Parameter “colsample_bytree” denotes the fraction of columns to be randomly samples for each tree.

iv) Parameter “silent” is 1 so that no running messages will be printed.

v) Parameter “nthread” is used for parallel processing and number of cores in the system to be printed.

vi) Parameter “objective” is reg:linear here.

vii) Parameter “eta” is analogous to learning rate and makes the model more robust by shrinking weight at each step.

viii) Parameter “eval_metric” is rmse here.

ix) Parameter “seed” can be used for generating reproductible results and also for parameter tuning.

x) Parameter “max_depth” used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.

```
params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 6
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0
plst = list(params.items())
num_rounds = 3000
```

```
xgtrain = xgb.DMatrix(input[:train.shape[0],:], label=target[:train.shape[0]])
watchlist = [(xgtrain, 'train')]
model_1_xgboost = xgb.train(plst, xgtrain, num_rounds)
```

```
model_1_predict = model_1_xgboost.predict(xgb.DMatrix(input[train.shape[0]:,:]))
model_1_predict[model_1_predict<0] = 25
submission.Purchase = model_1_predict
submission.User_ID=test.User_ID
submission.Product_ID=test.Product_ID
submission.to_csv("sub1.csv", index=False)
```

```
submission.head()
```

```
train.shape[0]
```

# Stacking starts

Preparing data for stacking model:

Dividing the dataset into two equal parts : train_fs and train_ss

First level models to create meta features to feed into a second level model.

```
first_stage_rows = np.random.randint(train.shape[0], size = np.int(train.shape[0]/2))
```

```
train_np = input[:train.shape[0], :]
target_np = target[:train.shape[0]]
train_fs = train_np[first_stage_rows, :]
target_fs = target_np[first_stage_rows]
train_ss = train_np[-first_stage_rows, :]
target_ss = target_np[-first_stage_rows]
```

```
print (train_fs.shape, target_fs.shape, train_ss.shape, target_ss.shape)
```

## Building different models and training them on first half dataset (train_fs)

```
xgtrain = xgb.DMatrix(train_fs, label=target_fs)
watchlist = [(xgtrain, 'train')]
# Model 1: 6/3000
params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 6
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0
plst = list(params.items())
num_rounds = 3000
model_1 = xgb.train(plst, xgtrain, num_rounds)
# Model 2: 8/1420
params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 8
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0
plst = list(params.items())
num_rounds = 1420
model_2 = xgb.train(plst, xgtrain, num_rounds)
# Model 3: 10/1200
params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 10
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0
plst = list(params.items())
num_rounds = 1200
model_3 = xgb.train(plst, xgtrain, num_rounds)
# Model 4: 12/800
params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 12
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0
plst = list(params.items())
num_rounds = 800
model_4 = xgb.train(plst, xgtrain, num_rounds)
```

## Here comes yet another 3 models having trained on first half dataset (train_fs)

```
# This set of models will be ExtraTrees
# Model 5: 8/1450
model_5 = ExtraTreesRegressor(n_estimators=1450, max_depth=8, min_samples_split=10, min_samples_leaf=10, oob_score=True, n_jobs=6, random_state=123, verbose=1, bootstrap=True)
model_5.fit(train_fs, target_fs)
# Model 6: 6/3000
model_6 = ExtraTreesRegressor(n_estimators=3000, max_depth=6, min_samples_split=10, min_samples_leaf=10, oob_score=True, n_jobs=6, random_state=123, verbose=1, bootstrap=True)
model_6.fit(train_fs, target_fs)
# Model 7: 12/800
model_7 = ExtraTreesRegressor(n_estimators=800, max_depth=12, min_samples_split=10, min_samples_leaf=10, oob_score=True, n_jobs=6, random_state=123, bootstrap=True)
model_7.fit(train_fs, target_fs)
```

## 3 more models, making it total 10 base learners having trained/fit on first half dataset (train_fs)

```
# This set of models will be RandomForest
# Model 8: 6/3000
model_8 = RandomForestRegressor(n_estimators=3000, max_depth=6, oob_score=True, n_jobs=6, random_state=123, min_samples_split=10, min_samples_leaf=10)
model_8.fit(train_fs, target_fs)
# Model 9: 8/1500
model_9 = RandomForestRegressor(n_estimators=1500, max_depth=8, oob_score=True, n_jobs=6, random_state=123, min_samples_split=10, min_samples_leaf=10)
model_9.fit(train_fs, target_fs)
# Model 10: 12/800
model_10 = RandomForestRegressor(n_estimators=800, max_depth=12, oob_score=True, n_jobs=6, random_state=123, min_samples_split=10, min_samples_leaf=10)
model_10.fit(train_fs, target_fs)
```

## Finally, predicting those idependent 10 first level learners on the next and final dataset train_ss one by one.

```
model_1_predict = model_1.predict(xgb.DMatrix(train_ss))
model_2_predict = model_2.predict(xgb.DMatrix(train_ss))
model_3_predict = model_3.predict(xgb.DMatrix(train_ss))
model_4_predict = model_4.predict(xgb.DMatrix(train_ss))
model_5_predict = model_5.predict(train_ss)
model_6_predict = model_6.predict(train_ss)
model_7_predict = model_7.predict(train_ss)
model_8_predict = model_8.predict(train_ss)
model_9_predict = model_9.predict(train_ss)
model_10_predict = model_10.predict(train_ss)
```

## Output of these 1st level classifiers is getting concatenated to form final training data as train_ss_u_meta

```
train_ss_w_meta = np.concatenate((train_ss, np.vstack((model_1_predict, model_2_predict, model_3_predict, model_4_predict, model_5_predict, model_6_predict, model_7_predict, model_8_predict, model_9_predict, model_10_predict)).T), axis=1)
```

## Training the new classifier which is now 2nd level classifier on the newly born training data ‘train_ss_w_meta’

```
params = {}
params["min_child_weight"] = 10
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 8
params["nthread"] = 6
#params["gamma"] = 1
params["objective"] = "reg:linear"
params["eta"] = 0.1
params["base_score"] = 1800
params["eval_metric"] = "rmse"
params["seed"] = 0
plst = list(params.items())
num_rounds = 1400
```

```
xgtrain = xgb.DMatrix(train_ss_w_meta, label=target_ss)
watchlist = [(xgtrain, 'train')]
model_ss_xgboost = xgb.train(plst, xgtrain, num_rounds)
```

## Applying the 10 models on the test data in the similar manner

```
model_1_predict = model_1.predict(xgb.DMatrix(input[train.shape[0]:, :]))
model_2_predict = model_2.predict(xgb.DMatrix(input[train.shape[0]:, :]))
model_3_predict = model_3.predict(xgb.DMatrix(input[train.shape[0]:, :]))
model_4_predict = model_4.predict(xgb.DMatrix(input[train.shape[0]:, :]))
model_5_predict = model_5.predict(input[train.shape[0]:, :])
model_6_predict = model_6.predict(input[train.shape[0]:, :])
model_7_predict = model_7.predict(input[train.shape[0]:, :])
model_8_predict = model_8.predict(input[train.shape[0]:, :])
model_9_predict = model_9.predict(input[train.shape[0]:, :])
model_10_predict = model_10.predict(input[train.shape[0]:, :])
test_ss_w_meta = np.concatenate((input[train.shape[0]:, :], np.vstack((model_1_predict, model_2_predict, model_3_predict, model_4_predict, model_5_predict, model_6_predict, model_7_predict, model_8_predict, model_9_predict, model_10_predict)).T), axis=1)
```

Since, I was building a stacking model on my training data till now and the final prediction will be applied on the test data, Level 1 base learners from half training data (train_fs) are now predicted on test data and we got a new test data ‘test_ss_w_meta’ by concatenating from output of level 1 classifiers. Now the 2nd level model (here XGBoost) which is already trained on concatenated training data (train_ss_w_meta) is now predicted on this concatenated test data(test_ss_w_meta)

## The last XGBoost model which was trained on the the concatenated training data is now predicted on the concatenated stacked test data

```
model_ss_predict = model_ss_xgboost.predict(xgb.DMatrix(test_ss_w_meta))
submission.Purchase = model_ss_predict
submission.to_csv("sub2.csv", index=False)
```

# Stacking in action

To conclude this, let’s talk about how, when, why, and where should we use stacking in the real world.

Personally, I saw numerous people using stacking modelling on Kaggle competitions and other platforms.

In reality, stacking produces small gain with a lot of added complexity. Stacking is really effective when you have a group of people building predictive modelling on one common problem. A single set of folds is decided and then every team member builds their own model over that fold. Then each model can be combined using stacking model script. This is great because it prevents team members from stepping on each other’s toes.

Feel free to comment your doubts if any. Happy Learning ! 😀

### Shwet Prakash

https://github.com/Architectshwet

https://www.linkedin.com/in/shwet-prakash-902b54111/

#### Latest posts by Shwet Prakash (see all)

- Introduction to Model Stacking (with example and codes in Python) - September 28, 2017