Lecture 5. Ensemble Learning

Lecture 5. Ensemble Learning#

Crowd intelligence

Joaquin Vanschoren

Ensemble learning#

If different models make different mistakes, can we simply average the predictions?
Voting Classifier: gives every model a vote on the class label
- Hard vote: majority class wins (class order breaks ties)
- Soft vote: sum class probabilities $p_{m,c}$ over $M$ models: $\underset{c}{\operatorname{argmax}} \sum_{m=1}^{M} w_c p_{m,c}$
- Classes can get different weights $w_c$ (default: $w_c=1$)

Show code cell source Hide code cell source

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

# Toy data
X, y = make_moons(noise=.2, random_state=18) # carefully picked random state for illustration
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# Plot grid
x_lin = np.linspace(X_train[:, 0].min() - .5, X_train[:, 0].max() + .5, 100)
y_lin = np.linspace(X_train[:, 1].min() - .5, X_train[:, 1].max() + .5, 100)
x_grid, y_grid = np.meshgrid(x_lin, y_lin)
X_grid = np.c_[x_grid.ravel(), y_grid.ravel()]
models = [LogisticRegression(C=100),
          DecisionTreeClassifier(max_depth=3, random_state=0),
          KNeighborsClassifier(n_neighbors=1),
          KNeighborsClassifier(n_neighbors=30)]

@interact
def combine_voters(model1=models, model2=models):
    # Voting Classifier and components
    voting = VotingClassifier([('model1', model1),('model2', model2)],voting='soft')
    voting.fit(X_train, y_train)

    # transform produces individual probabilities
    y_probs =  voting.transform(X_grid)

    fig, axes = plt.subplots(1, 3, subplot_kw={'xticks': (()), 'yticks': (())}, figsize=(11*fig_scale, 3*fig_scale))
    scores = [voting.estimators_[0].score(X_test, y_test),
             voting.estimators_[1].score(X_test, y_test),
             voting.score(X_test, y_test)]
    titles = [model1.__class__.__name__, model2.__class__.__name__, 'VotingClassifier']
    for prob, score, title, ax in zip([y_probs[:, 1], y_probs[:, 3], y_probs[:, 1::2].sum(axis=1)], scores, titles, axes.ravel()):
        ax.contourf(x_grid, y_grid, prob.reshape(x_grid.shape), alpha=.4, cmap='bwr')
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='bwr', s=7*fig_scale)
        ax.set_title(title + f" \n acc={score:.2f}", pad=0, fontsize=9)

Why does this work?
- Different models may be good at different ‘parts’ of data (even if they underfit)
- Individual mistakes can be ‘averaged out’ (especially if models overfit)
Which models should be combined?
Bias-variance analysis teaches us that we have two options:
- If model underfits (high bias, low variance): combine with other low-variance models
  - Need to be different: ‘experts’ on different parts of the data
  - Bias reduction. Can be done with Boosting
- If model overfits (low bias, high variance): combine with other low-bias models
  - Need to be different: individual mistakes must be different
  - Variance reduction. Can be done with Bagging
Models must be uncorrelated but good enough (otherwise the ensemble is worse)
We can also learn how to combine the predictions of different models: Stacking

Decision trees (recap)#

Representation: Tree that splits data points into leaves based on tests
Evaluation (loss): Heuristic for purity of leaves (Gini index, entropy,…)
Optimization: Recursive, heuristic greedy search (Hunt’s algorithm)
- Consider all splits (thresholds) between adjacent data points, for every feature
- Choose the one that yields the purest leafs, repeat

Evaluation (loss function for classification)#

Every leaf predicts a class probability $\hat{p}_c$ = the relative frequency of class $c$
Leaf impurity measures (splitting criteria) for $L$ leafs, leaf $l$ has data $X_l$:
- Gini-Index: $Gini(X_{l}) = \sum_{c\neq c'} \hat{p}_c \hat{p}_{c'}$
- Entropy (more expensive): $E(X_{l}) = -\sum_{c\neq c'} \hat{p}_c \log_{2}\hat{p}_c$
- Best split maximizes information gain (idem for Gini index) $$ Gain(X,X_i) = E(X) - \sum_{l=1}^L \frac{|X_{i=l}|}{|X_{i}|} E(X_{i=l}) $$

../_images/73be214f16833ce3602dc48279e6c50d386add986fafce42805a345fcc94a641.png

Regression trees#

Every leaf predicts the mean target value $\mu$ of all points in that leaf
Choose the split that minimizes squared error of the leaves: $\sum_{x_{i} \in L} (y_i - \mu)^2$
Yields non-smooth step-wise predictions, cannot extrapolate

../_images/cd8bde9c8cf44002581850df79b89623a3db7d366839138ae30977e5fddd0342.png

Impurity/Entropy-based feature importance#

We can measure the importance of features (to the model) based on
- Which features we split on
- How high up in the tree we split on them (first splits ar emore important)

../_images/73501516999db54e1593188f9d057d9572d4f6d81714dd683c4fce9efd2c2680.png

Under- and overfitting#

We can easily control the (maximum) depth of the trees as a hyperparameter
Bias-variance analysis:
- Shallow trees have high bias but very low variance (underfitting)
- Deep trees have high variance but low bias (overfitting)
Because we can easily control their complexity, they are ideal for ensembling
- Deep trees: keep low bias, reduce variance with Bagging
- Shallow trees: keep low variance, reduce bias with Boosting

Show code cell source Hide code cell source

from sklearn.model_selection import ShuffleSplit, train_test_split

# Bias-Variance Computation 
def compute_bias_variance(clf, X, y):
    # Bootstraps
    n_repeat = 40 # 40 is on the low side to get a good estimate. 100 is better.
    shuffle_split = ShuffleSplit(test_size=0.33, n_splits=n_repeat, random_state=0)

    # Store sample predictions
    y_all_pred = [[] for _ in range(len(y))]

    # Train classifier on each bootstrap and score predictions
    for i, (train_index, test_index) in enumerate(shuffle_split.split(X)):
        # Train and predict
        clf.fit(X[train_index], y[train_index])
        y_pred = clf.predict(X[test_index])

        # Store predictions
        for j,index in enumerate(test_index):
            y_all_pred[index].append(y_pred[j])

    # Compute bias, variance, error
    bias_sq = sum([ (1 - x.count(y[i])/len(x))**2 * len(x)/n_repeat 
                for i,x in enumerate(y_all_pred)])
    var = sum([((1 - ((x.count(0)/len(x))**2 + (x.count(1)/len(x))**2))/2) * len(x)/n_repeat
               for i,x in enumerate(y_all_pred)])
    error = sum([ (1 - x.count(y[i])/len(x)) * len(x)/n_repeat 
            for i,x in enumerate(y_all_pred)])

    return np.sqrt(bias_sq), var, error

def plot_bias_variance(clf, X, y):
    bias_scores = []
    var_scores = []
    err_scores = []
    max_depth= range(2,11)

    for i in max_depth:
        b,v,e = compute_bias_variance(clf.set_params(random_state=0,max_depth=i),X,y)
        bias_scores.append(b)
        var_scores.append(v)
        err_scores.append(e)

    plt.figure(figsize=(8*fig_scale,3*fig_scale))
    plt.suptitle(clf.__class__.__name__, fontsize=9)
    plt.plot(max_depth, var_scores,label ="variance", lw=2*fig_scale )
    plt.plot(max_depth, np.square(bias_scores),label ="bias^2", lw=2*fig_scale )
    plt.plot(max_depth, err_scores,label ="error", lw=2*fig_scale)
    plt.xlabel("max_depth", fontsize=7)
    plt.legend(loc="best", fontsize=7)
    plt.show()

dt = DecisionTreeClassifier()
plot_bias_variance(dt, cancer.data, cancer.target)

../_images/a66fa2c1744aeaf341f6261a77a4b7bfcd0c594300ce417bdd0dadac187c1d14.png

Bagging (Bootstrap Aggregating)#

Obtain different models by training the same model on different training samples
- Reduce overfitting by averaging out individual predictions (variance reduction)
In practice: take $I$ bootstrap samples of your data, train a model on each bootstrap
- Higher $I$: more models, more smoothing (but slower training and prediction)
Base models should be unstable: different training samples yield different models
- E.g. very deep decision trees, or even randomized decision trees
- Deep Neural Networks can also benefit from bagging (deep ensembles)
Prediction by averaging predictions of base models
- Soft voting for classification (possibly weighted)
- Mean value for regression
Can produce uncertainty estimates as well
- By combining class probabilities of individual models (or variances for regression)

Random Forests#

Uses randomized trees to make models even less correlated (more unstable)
- At every split, only consider max_features features, randomly selected
Extremely randomized trees: considers 1 random threshold for random set of features (faster)

Effect on bias and variance#

Increasing the number of models (trees) decreases variance (less overfitting)
Bias is mostly unaffected, but will increase if the forest becomes too large (oversmoothing)

Show code cell source Hide code cell source

from sklearn.model_selection import ShuffleSplit, train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
cancer = load_breast_cancer()

# Faster version of plot_bias_variance that uses warm-starting
def plot_bias_variance_rf(model, X, y, warm_start=False):
    bias_scores = []
    var_scores = []
    err_scores = []

    # Bootstraps
    n_repeat = 40 # 40 is on the low side to get a good estimate. 100 is better.
    shuffle_split = ShuffleSplit(test_size=0.33, n_splits=n_repeat, random_state=0)
    
    # Ensemble sizes
    n_estimators = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]

    # Store sample predictions. One per n_estimators
    # n_estimators : [predictions]
    predictions = {}
    for nr_trees in n_estimators:
        predictions[nr_trees] = [[] for _ in range(len(y))]

    # Train classifier on each bootstrap and score predictions
    for i, (train_index, test_index) in enumerate(shuffle_split.split(X)):
                
        # Initialize 
        clf = model(random_state=0)
        if model.__class__.__name__ == 'RandomForestClassifier':
            clf.n_jobs = -1
        if model.__class__.__name__ != 'AdaBoostClassifier':
            clf.warm_start = warm_start
        
        prev_n_estimators = 0
    
        # Train incrementally        
        for nr_trees in n_estimators:
            if model.__class__.__name__ == 'HistGradientBoostingClassifier':
                clf.max_iter = nr_trees
            else:
                clf.n_estimators = nr_trees
            
            # Fit and predict
            clf.fit(X[train_index], y[train_index])
            y_pred = clf.predict(X[test_index])
            for j,index in enumerate(test_index):
                predictions[nr_trees][index].append(y_pred[j])
    
    for nr_trees in n_estimators:
        # Compute bias, variance, error
        bias_sq = sum([ (1 - x.count(y[i])/len(x))**2 * len(x)/n_repeat 
                       for i,x in enumerate(predictions[nr_trees])])
        var = sum([((1 - ((x.count(0)/len(x))**2 + (x.count(1)/len(x))**2))/2) * len(x)/n_repeat 
                   for i,x in enumerate(predictions[nr_trees])])
        error = sum([ (1 - x.count(y[i])/len(x)) * len(x)/n_repeat
                     for i,x in enumerate(predictions[nr_trees])])

        bias_scores.append(bias_sq)
        var_scores.append(var)
        err_scores.append(error)

    plt.figure(figsize=(8*fig_scale,3*fig_scale))
    plt.suptitle(clf.__class__.__name__, fontsize=9)
    plt.plot(n_estimators, var_scores,label = "variance", lw=2*fig_scale )
    plt.plot(n_estimators, bias_scores,label = "bias^2", lw=2*fig_scale )
    plt.plot(n_estimators, err_scores,label = "error", lw=2*fig_scale  )
    plt.xscale('log',base=2)
    plt.xlabel("n_estimators", fontsize=9)
    plt.legend(loc="best", fontsize=7)
    plt.show()

plot_bias_variance_rf(RandomForestClassifier, cancer.data, cancer.target, warm_start=True)

../_images/21d92b98b3251823152b22881a8adc42aedb9b7b06783b0c32c1fa4e0f6372a7.png

In practice#

Different implementations can be used. E.g. in scikit-learn:
- BaggingClassifier: Choose your own base model and sampling procedure
- RandomForestClassifier: Default implementation, many options
- ExtraTreesClassifier: Uses extremely randomized trees
Most important parameters:
- n_estimators (>100, higher is better, but diminishing returns)
  - Will start to underfit (bias error component increases slightly)
- max_features
  - Defaults: $sqrt(p)$ for classification, $log2(p)$ for regression
  - Set smaller to reduce space/time requirements
- parameters of trees, e.g. max_depth, min_samples_split,…
  - Prepruning useful to reduce model size, but don’t overdo it
Easy to parallelize (set n_jobs to -1)
Fix random_state (bootstrap samples) for reproducibility

Out-of-bag error#

RandomForests don’t need cross-validation: you can use the out-of-bag (OOB) error
For each tree grown, about 33% of samples are out-of-bag (OOB)
- Remember which are OOB samples for every model, do voting over these
OOB error estimates are great to speed up model selection
- As good as CV estimates, althought slightly pessimistic
In scikit-learn: oob_error = 1 - clf.oob_score_

Show code cell source Hide code cell source

from collections import OrderedDict
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

RANDOM_STATE = 123

# Generate a binary classification dataset.
X, y = make_classification(n_samples=500, n_features=25,
                           n_clusters_per_class=1, n_informative=15,
                           random_state=RANDOM_STATE)

# NOTE: Setting the `warm_start` construction parameter to `True` disables
# support for parallelized ensembles but is necessary for tracking the OOB
# error trajectory during training.
ensemble_clfs = [
    ("RandomForestClassifier, max_features='sqrt'",
        RandomForestClassifier(warm_start=True, oob_score=True,
                               max_features="sqrt", n_jobs=-1,
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features='log2'",
        RandomForestClassifier(warm_start=True, max_features='log2',
                               oob_score=True, n_jobs=-1,
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features=None",
        RandomForestClassifier(warm_start=True, max_features=None,
                               oob_score=True, n_jobs=-1,
                               random_state=RANDOM_STATE))
]

# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)

# Range of `n_estimators` values to explore.
min_estimators = 15
max_estimators = 175

for label, clf in ensemble_clfs:
    for i in range(min_estimators, max_estimators + 1):
        clf.set_params(n_estimators=i)
        clf.fit(X, y)

        # Record the OOB error for each `n_estimators=i` setting.
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))

# Generate the "OOB error rate" vs. "n_estimators" plot.
plt.figure(figsize=(8*fig_scale,3*fig_scale))
for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=label, lw=2*fig_scale)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators", fontsize=7)
plt.ylabel("OOB error rate", fontsize=7)
plt.legend(loc="upper right", fontsize=7)
plt.show()

../_images/73ee8502249d59ed9a88c1136771e2bf57bc45ceaa718e87dc77b17752d82caf.png

Feature importance#

RandomForests provide more reliable feature importances, based on many alternative hypotheses (trees)

../_images/5add678d6754df42492724ca63ce9b2ea1a47996dfaa7482ead60336319c98d5.png

Other tips#

Model calibration
- RandomForests are poorly calibrated.
- Calibrate afterwards (e.g. isotonic regression) if you aim to use probabilities
Warm starting
- Given an ensemble trained for $I$ iterations, you can simply add more models later
- You warm start from the existing model instead of re-starting from scratch
- Can be useful to train models on new, closely related data
  - Not ideal if the data batches change over time (concept drift)
  - Boosting is more robust against this (see later)

Strength and weaknesses#

RandomForest are among most widely used algorithms:
- Don’t require a lot of tuning
- Typically very accurate
- Handles heterogeneous features well (trees)
- Implictly selects most relevant features
Downsides:
- less interpretable, slower to train (but parallellizable)
- don’t work well on high dimensional sparse data (e.g. text)

Adaptive Boosting (AdaBoost)#

Obtain different models by reweighting the training data every iteration
- Reduce underfitting by focusing on the ‘hard’ training examples
Increase weights of instances misclassified by the ensemble, and vice versa
Base models should be simple so that different instance weights lead to different models
- Underfitting models: decision stumps (or very shallow trees)
- Each is an ‘expert’ on some parts of the data
Additive model: Predictions at iteration $I$ are sum of base model predictions
- In Adaboost, also the models each get a unique weight $w_i$ $$f_I(\mathbf{x}) = \sum_{i=1}^I w_i g_i(\mathbf{x})$$
Adaboost minimizes exponential loss. For instance-weighted error $\varepsilon$: $$\mathcal{L}_{Exp} = \sum_{n=1}^N e^{\varepsilon(f_I(\mathbf{x}))}$$
By deriving $\frac{\partial \mathcal{L}}{\partial w_i}$ you can find that optimal $w_{i} = \frac{1}{2}\log(\frac{1-\varepsilon}{\varepsilon})$

AdaBoost algorithm#

Initialize sample weights: $s_{n,0} = \frac{1}{N}$
Build a model (e.g. decision stumps) using these sample weights
Give the model a weight $w_i$ related to its weighted error rate $\varepsilon$ $$w_{i} = \lambda\log(\frac{1-\varepsilon}{\varepsilon})$$
- Good trees get more weight than bad trees
- Logit function maps error $\varepsilon$ from [0,1] to weight in [-Inf,Inf] (use small minimum error)
- Learning rate $\lambda$ (shrinkage) decreases impact of individual classifiers
  - Small updates are often better but requires more iterations
Update the sample weights
- Increase weight of incorrectly predicted samples: $s_{n,i+1} = s_{n,i}e^{w_i}$
- Decrease weight of correctly predicted samples: $s_{n,i+1} = s_{n,i}e^{-w_i}$
- Normalize weights to add up to 1
Repeat for $I$ iterations

AdaBoost variants#

Discrete Adaboost: error rate $\varepsilon$ is simply the error rate (1-Accuracy)
Real Adaboost: $\varepsilon$ is based on predicted class probabilities $\hat{p}_c$ (better)
AdaBoost for regression: $\varepsilon$ is either linear ($|y_i-\hat{y}_i|$), squared ($(y_i-\hat{y}_i)^2$), or exponential loss
GentleBoost: adds a bound on model weights $w_i$
LogitBoost: Minimizes logistic loss instead of exponential loss $$\mathcal{L}_{Logistic} = \sum_{n=1}^N log(1+e^{\varepsilon(f_I(\mathbf{x}))})$$

Adaboost in action#

Size of the samples represents sample weight
Background shows the latest tree’s predictions

Show code cell source Hide code cell source

from matplotlib.colors import ListedColormap
from sklearn.tree import DecisionTreeClassifier
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from sklearn.preprocessing import normalize

# Code adapted from https://xavierbourretsicotte.github.io/AdaBoost.html
def AdaBoost_scratch(X,y, M=10, learning_rate = 0.5):
    #Initialization of utility variables
    N = len(y)
    estimator_list, y_predict_list, estimator_error_list, estimator_weight_list, sample_weight_list = [],[],[],[],[]

    #Initialize the sample weights
    sample_weight = np.ones(N) / N
    sample_weight_list.append(sample_weight.copy())

    #For m = 1 to M
    for m in range(M):   

        #Fit a classifier
        estimator = DecisionTreeClassifier(max_depth = 1, max_leaf_nodes=2)
        estimator.fit(X, y, sample_weight=sample_weight)
        y_predict = estimator.predict(X)

        #Misclassifications
        incorrect = (y_predict != y)

        #Estimator error
        estimator_error = np.mean( np.average(incorrect, weights=sample_weight, axis=0))
        
        #Boost estimator weights
        estimator_weight =  learning_rate * np.log((1. - estimator_error) / estimator_error)

        #Boost sample weights
        sample_weight *= np.exp(estimator_weight * incorrect * ((sample_weight > 0) | (estimator_weight < 0)))
        sample_weight *= np.exp(-estimator_weight * np.invert(incorrect * ((sample_weight > 0) | (estimator_weight < 0))))
        sample_weight /= np.linalg.norm(sample_weight)
        
        #Save iteration values
        estimator_list.append(estimator)
        y_predict_list.append(y_predict.copy())
        estimator_error_list.append(estimator_error.copy())
        estimator_weight_list.append(estimator_weight.copy())
        sample_weight_list.append(sample_weight.copy())
        
    #Convert to np array for convenience   
    estimator_list = np.asarray(estimator_list)
    y_predict_list = np.asarray(y_predict_list)
    estimator_error_list = np.asarray(estimator_error_list)
    estimator_weight_list = np.asarray(estimator_weight_list)
    sample_weight_list = np.asarray(sample_weight_list)

    #Predictions
    preds = (np.array([np.sign((y_predict_list[:,point] * estimator_weight_list).sum()) for point in range(N)]))
    #print('Accuracy = ', (preds == y).sum() / N) 
    
    return estimator_list, estimator_weight_list, sample_weight_list, estimator_error_list

def plot_decision_boundary(classifier, X, y, N = 10, scatter_weights = np.ones(len(y)) , ax = None, title=None ):
    '''Utility function to plot decision boundary and scatter plot of data'''
    x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
    y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1

    # Get current axis and plot
    if ax is None:
        ax = plt.gca()
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax.scatter(X[:,0],X[:,1], c = y, cmap = cm_bright, s = scatter_weights * 1000, edgecolors='none')
    ax.set_xticks(())
    ax.set_yticks(())
    if title:
        ax.set_title(title, pad=1)
    
    # Plot classifier background
    if classifier is not None:
        xx, yy = np.meshgrid( np.linspace(x_min, x_max, N), np.linspace(y_min, y_max, N))
        
        #Check what methods are available
        if hasattr(classifier, "decision_function"):
            zz = np.array( [classifier.decision_function(np.array([xi,yi]).reshape(1,-1)) for  xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] )
        elif hasattr(classifier, "predict_proba"):
            zz = np.array( [classifier.predict_proba(np.array([xi,yi]).reshape(1,-1))[:,1] for  xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] )
        else:
            zz = np.array( [classifier(np.array([xi,yi]).reshape(1,-1)) for  xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] )

        # reshape result and plot
        Z = zz.reshape(xx.shape)
    
        ax.contourf(xx, yy, Z, 2, cmap='RdBu', alpha=.5, levels=[0,0.5,1])
        #ax.contour(xx, yy, Z, 2, cmap='RdBu', levels=[0,0.5,1])


from sklearn.datasets import make_circles
Xa, ya = make_circles(n_samples=400, noise=0.15, factor=0.5, random_state=1)
cm_bright = ListedColormap(['#FF0000', '#0000FF'])

estimator_list, estimator_weight_list, sample_weight_list, estimator_error_list = AdaBoost_scratch(Xa, ya, M=60, learning_rate = 0.5)
current_ax = None
weight_scale = 1

@interact
def plot_adaboost(iteration=(0,60,1)):
    if iteration == 0:
        s_weights = (sample_weight_list[0,:] / sample_weight_list[0,:].sum() ) * weight_scale
        plot_decision_boundary(None, Xa, ya, N = 20, scatter_weights =s_weights)
    else:
        s_weights = (sample_weight_list[iteration,:] / sample_weight_list[iteration,:].sum() ) * weight_scale
        title = "Base model {}, error: {:.2f}, weight: {:.2f}".format(
            iteration,estimator_error_list[iteration-1],estimator_weight_list[iteration-1])
        plot_decision_boundary(estimator_list[iteration-1], Xa, ya, N = 20, scatter_weights =s_weights, ax=current_ax, title=title )
        

Examples#

../_images/9afa9ed5b2628c75ac0dfdd775fdeae059cac80f6ab04e9614108992c813759f.png

Bias-Variance analysis#

AdaBoost reduces bias (and a little variance)
- Boosting is a bias reduction technique
Boosting too much will eventually increase variance

../_images/75c21f73e5c9173630239f5c18ba7e68da950dcccab7605772880066d43479a5.png

Gradient Boosting#

Ensemble of models, each fixing the remaining mistakes of the previous ones
- Each iteration, the task is to predict the residual error of the ensemble
Additive model: Predictions at iteration $I$ are sum of base model predictions
- Learning rate (or shrinkage ) $\eta$: small updates work better (reduces variance) $$f_I(\mathbf{x}) = g_0(\mathbf{x}) + \sum_{i=1}^I \eta \cdot g_i(\mathbf{x}) = f_{I-1}(\mathbf{x}) + \eta \cdot g_I(\mathbf{x})$$
The pseudo-residuals $r_i$ are computed according to differentiable loss function
- E.g. least squares loss for regression and log loss for classification
- Gradient descent: predictions get updated step by step until convergence $$g_i(\mathbf{x}) \approx r_{i} = - \frac{\partial \mathcal{L}(y_i,f_{i-1}(x_i))}{\partial f_{i-1}(x_i)}$$
Base models $g_i$ should be low variance, but flexible enough to predict residuals accurately
- E.g. decision trees of depth 2-5

Gradient Boosting Trees (Regression)#

Base models are regression trees, loss function is square loss: $\mathcal{L} = \frac{1}{2}(y_i - \hat{y}_i)^2$
The pseudo-residuals are simply the prediction errors for every sample: $$r_i = -\frac{\partial \mathcal{L}}{\partial \hat{y}} = -2 * \frac{1}{2}(y_i - \hat{y}_i) * (-1) = y_i - \hat{y}_i$$
Initial model $g_0$ simply predicts the mean of $y$
For iteration $m=1..M$:
- For all samples i=1..n, compute pseudo-residuals $r_i = y_i - \hat{y}_i$
- Fit a new regression tree model $g_m(\mathbf{x})$ to $r_{i}$
  - In $g_m(\mathbf{x})$, each leaf predicts the mean of all its values
- Update ensemble predictions $\hat{y} = g_0(\mathbf{x}) + \sum_{m=1}^M \eta \cdot g_m(\mathbf{x})$
Early stopping (optional): stop when performance on validation set does not improve for $nr$ iterations

Gradient Boosting Regression in action#

Residuals quickly drop to (near) zero

GradientBoosting Algorithm (Classification)#

Base models are regression trees, predict probability of positive class $p$
- For multi-class problems, train one tree per class
Use (binary) log loss, with true class $y_i \in {0,1}$: $\mathcal{L_{log}} = - \sum_{i=1}^{N} \big[ y_i log(p_i) + (1-y_i) log(1-p_i) \big] $
The pseudo-residuals are simply the difference between true class and predicted $p$: $$\frac{\partial \mathcal{L}}{\partial \hat{y}} = \frac{\partial \mathcal{L}}{\partial log(p_i)} = y_i - p_i$$
Initial model $g_0$ predicts $p = log(\frac{\#positives}{\#negatives})$
For iteration $m=1..M$:
- For all samples i=1..n, compute pseudo-residuals $r_i = y_i - p_i$
- Fit a new regression tree model $g_m(\mathbf{x})$ to $r_{i}$
  - In $g_m(\mathbf{x})$, each leaf predicts $\frac{\sum_{i} r_i}{\sum_{i} p_i(1-p_i)}$
- Update ensemble predictions $\hat{y} = g_0(\mathbf{x}) + \sum_{m=1}^M \eta \cdot g_m(\mathbf{x})$
Early stopping (optional): stop when performance on validation set does not improve for $nr$ iterations

Gradient Boosting Classification in action#

Size of the samples represents the residual weights: most quickly drop to (near) zero

Examples#

../_images/121fd3a8e57e33abb49334f11066c9eaf44f3e65bdf30aa11878bcf34b83428b.png

Bias-variance analysis#

Gradient Boosting is very effective at reducing bias error
Boosting too much will eventually increase variance

../_images/9253c5dcdb8547d3f69fd360b641a942ff64c1e6b695275a9d34dcc4a09e2bc1.png

Feature importance#

Gradient Boosting also provide feature importances, based on many trees
Compared to RandomForests, the trees are smaller, hence more features have zero importance

../_images/a7ae3e39fd9c60fe97fc31d44b88dd77bf1f5a2176d55c3ae0c7347a6233f587.png

Gradient Boosting: strengths and weaknesses#

Among the most powerful and widely used models
Work well on heterogeneous features and different scales
Typically better than random forests, but requires more tuning, longer training
Does not work well on high-dimensional sparse data

Main hyperparameters:

n_estimators: Higher is better, but will start to overfit
learning_rate: Lower rates mean more trees are needed to get more complex models
- Set n_estimators as high as possible, then tune learning_rate
- Or, choose a learning_rate and use early stopping to avoid overfitting
max_depth: typically kept low (<5), reduce when overfitting
max_features: can also be tuned, similar to random forests
n_iter_no_change: early stopping: algorithm stops if improvement is less than a certain tolerance tol for more than n_iter_no_change iterations.

Extreme Gradient Boosting (XGBoost)#

Faster version of gradient boosting: allows more iterations on larger datasets
Normal regression trees: split to minimize squared loss of leaf predictions
- XGBoost trees only fit residuals: split so that residuals in leaf are more similar
Don’t evaluate every split point, only $q$ quantiles per feature (binning)
- $q$ is hyperparameter (sketch_eps, default 0.03)
For large datasets, XGBoost uses approximate quantiles
- Can be parallelized (multicore) by chunking the data and combining histograms of data
- For classification, the quantiles are weighted by $p(1-p)$
Gradient descent sped up by using the second derivative of the loss function
Strong regularization by pre-pruning the trees
Column and row are randomly subsampled when computing splits
Support for out-of-core computation (data compression in RAM, sharding,…)

XGBoost in practice#

Not part of scikit-learn, but HistGradientBoostingClassifier is similar
- binning, multicore,…
The xgboost python package is sklearn-compatible
- Install separately, conda install -c conda-forge xgboost
- Allows learning curve plotting and warm-starting
Further reading:

LightGBM#

Another fast boosting technique

Uses gradient-based sampling
- use all instances with large gradients/residuals (e.g. 10% largest)
- randomly sample instances with small gradients, ignore the rest
- intuition: samples with small gradients are already well-trained.
- requires adapted information gain criterion
Does smarter encoding of categorical features

CatBoost#

Another fast boosting technique

Optimized for categorical variables
- Uses bagged and smoothed version of target encoding
Uses symmetric trees: same split for all nodes on a given level
- Can be much faster
Allows monotonicity constraints for numeric features
- Model must be be a non-decreasing function of these features
Lots of tooling (e.g. GPU training)

Stacking#

Choose $M$ different base-models, generate predictions
Stacker (meta-model) learns mapping between predictions and correct label
- Can also be repeated: multi-level stacking
- Popular stackers: linear models (fast) and gradient boosting (accurate)
Cascade stacking: adds base-model predictions as extra features
Models need to be sufficiently different, be experts at different parts of the data
Can be very accurate, but also very slow to predict

Other ensembling techniques#

Hyper-ensembles: same basic model but with different hyperparameter settings
- Can combine overfitted and underfitted models
Deep ensembles: ensembles of deep learning models
Bayes optimal classifier: ensemble of all possible models (largely theoretic)
Bayesian model averaging: weighted average of probabilistic models, weighted by their posterior probabilities
Cross-validation selection: does internal cross-validation to select best of $M$ models
Any combination of different ensembling techniques

Algorithm overview#

Name	Representation	Loss function	Optimization	Regularization
Classification trees	Decision tree	Entropy / Gini index	Hunt’s algorithm	Tree depth,…
Regression trees	Decision tree	Square loss	Hunt’s algorithm	Tree depth,…
RandomForest	Ensemble of randomized trees	Entropy / Gini / Square	(Bagging)	Number/depth of trees,…
AdaBoost	Ensemble of stumps	Exponential loss	Greedy search	Number/depth of trees,…
GradientBoostingRegression	Ensemble of regression trees	Square loss	Gradient descent	Number/depth of trees,…
GradientBoostingClassification	Ensemble of regression trees	Log loss	Gradient descent	Number/depth of trees,…
XGBoost, LightGBM, CatBoost	Ensemble of XGBoost trees	Square/log loss	2nd order gradients	Number/depth of trees,…
Stacking	Ensemble of heterogeneous models	/	/	Number of models,…

Summary#

Ensembles of voting classifiers improve performance
- Which models to choose? Consider bias-variance tradeoffs!
Bagging / RandomForest is a variance-reduction technique
- Build many high-variance (overfitting) models on random data samples
  - The more different the models, the better
- Aggregation (soft voting) over many models reduces variance
  - Diminishing returns, over-smoothing may increase bias error
- Parallellizes easily, doesn’t require much tuning
Boosting is a bias-reduction technique
- Build low-variance models that correct each other’s mistakes
  - By reweighting misclassified samples: AdaBoost
  - By predicting the residual error: Gradient Boosting
- Additive models: predictions are sum of base-model predictions
  - Can drive the error to zero, but risk overfitting
- Doesn’t parallelize easily. Slower to train, much faster to predict.
  - XGBoost,LightGBM,… are fast and offer some parallellization
Stacking: learn how to combine base-model predictions
- Base-models still have to be sufficiently different

Lecture 5. Ensemble Learning

Contents

Lecture 5. Ensemble Learning#

Ensemble learning#

Decision trees (recap)#

Evaluation (loss function for classification)#

Regression trees#

Impurity/Entropy-based feature importance#

Under- and overfitting#

Bagging (Bootstrap Aggregating)#

Random Forests#

Effect on bias and variance#

In practice#

Out-of-bag error#

Feature importance#

Other tips#

Strength and weaknesses#

Adaptive Boosting (AdaBoost)#

AdaBoost algorithm#

AdaBoost variants#

Adaboost in action#

Examples#

Bias-Variance analysis#

Gradient Boosting#

Gradient Boosting Trees (Regression)#

Gradient Boosting Regression in action#

GradientBoosting Algorithm (Classification)#

Gradient Boosting Classification in action#

Examples#

Bias-variance analysis#

Feature importance#

Gradient Boosting: strengths and weaknesses#

Extreme Gradient Boosting (XGBoost)#

XGBoost in practice#

LightGBM#

CatBoost#

Stacking#

Other ensembling techniques#

Algorithm overview#

Summary#