Lecture 5. Ensemble Learning#

Crowd intelligence

Joaquin Vanschoren

Hide code cell source
# Auto-setup when running on Google Colab
import os
if 'google.colab' in str(get_ipython()) and not os.path.exists('/content/master'):
    !git clone -q https://github.com/ML-course/master.git /content/master
    !pip --quiet install -r /content/master/requirements_colab.txt
    %cd master/notebooks

# Global imports and settings
%matplotlib inline
from preamble import *
interactive = True # Set to True for interactive plots
if interactive:
    fig_scale = 0.9
    plt.rcParams.update(print_config)
else: # For printing
    fig_scale = 0.3
    plt.rcParams.update(print_config)

Ensemble learning#

  • If different models make different mistakes, can we simply average the predictions?

  • Voting Classifier: gives every model a vote on the class label

    • Hard vote: majority class wins (class order breaks ties)

    • Soft vote: sum class probabilities \(p_{m,c}\) over \(M\) models: \(\underset{c}{\operatorname{argmax}} \sum_{m=1}^{M} w_c p_{m,c}\)

    • Classes can get different weights \(w_c\) (default: \(w_c=1\))

Hide code cell source
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

# Toy data
X, y = make_moons(noise=.2, random_state=18) # carefully picked random state for illustration
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# Plot grid
x_lin = np.linspace(X_train[:, 0].min() - .5, X_train[:, 0].max() + .5, 100)
y_lin = np.linspace(X_train[:, 1].min() - .5, X_train[:, 1].max() + .5, 100)
x_grid, y_grid = np.meshgrid(x_lin, y_lin)
X_grid = np.c_[x_grid.ravel(), y_grid.ravel()]
models = [LogisticRegression(C=100),
          DecisionTreeClassifier(max_depth=3, random_state=0),
          KNeighborsClassifier(n_neighbors=1),
          KNeighborsClassifier(n_neighbors=30)]

@interact
def combine_voters(model1=models, model2=models):
    # Voting Classifier and components
    voting = VotingClassifier([('model1', model1),('model2', model2)],voting='soft')
    voting.fit(X_train, y_train)

    # transform produces individual probabilities
    y_probs =  voting.transform(X_grid)

    fig, axes = plt.subplots(1, 3, subplot_kw={'xticks': (()), 'yticks': (())}, figsize=(11*fig_scale, 3*fig_scale))
    scores = [voting.estimators_[0].score(X_test, y_test),
             voting.estimators_[1].score(X_test, y_test),
             voting.score(X_test, y_test)]
    titles = [model1.__class__.__name__, model2.__class__.__name__, 'VotingClassifier']
    for prob, score, title, ax in zip([y_probs[:, 1], y_probs[:, 3], y_probs[:, 1::2].sum(axis=1)], scores, titles, axes.ravel()):
        ax.contourf(x_grid, y_grid, prob.reshape(x_grid.shape), alpha=.4, cmap='bwr')
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='bwr', s=7*fig_scale)
        ax.set_title(title + f" \n acc={score:.2f}", pad=0, fontsize=9)
Hide code cell source
if not interactive:
    combine_voters(models[0],models[1])
  • Why does this work?

    • Different models may be good at different ‘parts’ of data (even if they underfit)

    • Individual mistakes can be ‘averaged out’ (especially if models overfit)

  • Which models should be combined?

  • Bias-variance analysis teaches us that we have two options:

    • If model underfits (high bias, low variance): combine with other low-variance models

      • Need to be different: ‘experts’ on different parts of the data

      • Bias reduction. Can be done with Boosting

    • If model overfits (low bias, high variance): combine with other low-bias models

      • Need to be different: individual mistakes must be different

      • Variance reduction. Can be done with Bagging

  • Models must be uncorrelated but good enough (otherwise the ensemble is worse)

  • We can also learn how to combine the predictions of different models: Stacking

Decision trees (recap)#

  • Representation: Tree that splits data points into leaves based on tests

  • Evaluation (loss): Heuristic for purity of leaves (Gini index, entropy,…)

  • Optimization: Recursive, heuristic greedy search (Hunt’s algorithm)

    • Consider all splits (thresholds) between adjacent data points, for every feature

    • Choose the one that yields the purest leafs, repeat

Hide code cell source
import graphviz

@interact
def plot_depth(depth=(1,5,1)):
    X, y = make_moons(noise=.2, random_state=18) # carefully picked random state for illustration
    fig, ax = plt.subplots(1, 2, figsize=(12*fig_scale, 4*fig_scale),
                           subplot_kw={'xticks': (), 'yticks': ()})

    tree = mglearn.plots.plot_tree(X, y, max_depth=depth)
    ax[0].imshow(mglearn.plots.tree_image(tree))
    ax[0].set_axis_off()
Hide code cell source
if not interactive:
    plot_depth(depth=3)

Evaluation (loss function for classification)#

  • Every leaf predicts a class probability \(\hat{p}_c\) = the relative frequency of class \(c\)

  • Leaf impurity measures (splitting criteria) for \(L\) leafs, leaf \(l\) has data \(X_l\):

    • Gini-Index: \(Gini(X_{l}) = \sum_{c\neq c'} \hat{p}_c \hat{p}_{c'}\)

    • Entropy (more expensive): \(E(X_{l}) = -\sum_{c\neq c'} \hat{p}_c \log_{2}\hat{p}_c\)

    • Best split maximizes information gain (idem for Gini index) $\( Gain(X,X_i) = E(X) - \sum_{l=1}^L \frac{|X_{i=l}|}{|X_{i}|} E(X_{i=l}) \)$

Hide code cell source
def gini(p):
   return (p)*(1 - (p)) + (1 - p)*(1 - (1-p))

def entropy(p):
   return - p*np.log2(p) - (1 - p)*np.log2((1 - p))

def classification_error(p):
   return 1 - np.max([p, 1 - p])

x = np.arange(0.0, 1.0, 0.01)
ent = [entropy(p) if p != 0 else None for p in x]
scaled_ent = [e*0.5 if e else None for e in ent]
c_err = [classification_error(i) for i in x]

fig = plt.figure(figsize=(5*fig_scale, 2.5*fig_scale))
ax = plt.subplot(111)

for j, lab, ls, c, in zip(
      [ent, scaled_ent, gini(x), c_err],
      ['Entropy', 'Entropy (scaled)', 'Gini Impurity', 'Misclassification Error'],
      ['-', '-', '--', '-.'],
      ['lightgray', 'red', 'green', 'blue']):
   line = ax.plot(x, j, label=lab, linestyle=ls, lw=2*fig_scale, color=c)

ax.legend(loc='upper left', ncol=1, fancybox=True, shadow=False)
ax.axhline(y=0.5, linewidth=1, color='k', linestyle='--')
ax.axhline(y=1.0, linewidth=1, color='k', linestyle='--')
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

plt.ylim([0, 1.1])
plt.xlabel('p(j=1)',labelpad=0)
plt.ylabel('Impurity Index')
plt.show()
../_images/73be214f16833ce3602dc48279e6c50d386add986fafce42805a345fcc94a641.png

Regression trees#

  • Every leaf predicts the mean target value \(\mu\) of all points in that leaf

  • Choose the split that minimizes squared error of the leaves: \(\sum_{x_{i} \in L} (y_i - \mu)^2\)

  • Yields non-smooth step-wise predictions, cannot extrapolate

Hide code cell source
from sklearn.tree import DecisionTreeRegressor

def plot_decision_tree_regression(regr_1, regr_2):
    # Create a random dataset
    rng = np.random.RandomState(5)
    X = np.sort(7 * rng.rand(80, 1), axis=0)
    y = np.sin(X).ravel()
    y[::5] += 3 * (0.5 - rng.rand(16))
    split = 65

    # Fit regression model of first 60 points
    regr_1.fit(X[:split], y[:split])
    regr_2.fit(X[:split], y[:split])

    # Predict
    X_test = np.arange(0.0, 7.0, 0.01)[:, np.newaxis]
    y_1 = regr_1.predict(X_test)
    y_2 = regr_2.predict(X_test)

    # Plot the results
    plt.figure(figsize=(8*fig_scale,5*fig_scale))
    plt.scatter(X[:split], y[:split], c="darkorange", label="training data")
    plt.scatter(X[split:], y[split:], c="blue", label="test data")
    plt.plot(X_test, y_1, color="cornflowerblue", label="max_depth=2", linewidth=2*fig_scale)
    plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2*fig_scale)
    plt.xlabel("data", fontsize=9)
    plt.ylabel("target", fontsize=9)
    plt.title("Decision Tree Regression", fontsize=9)
    plt.legend()
    plt.show()

regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)

plot_decision_tree_regression(regr_1,regr_2)
../_images/cd8bde9c8cf44002581850df79b89623a3db7d366839138ae30977e5fddd0342.png

Impurity/Entropy-based feature importance#

  • We can measure the importance of features (to the model) based on

    • Which features we split on

    • How high up in the tree we split on them (first splits ar emore important)

Hide code cell source
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
Xc_train, Xc_test, yc_train, yc_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)
tree = DecisionTreeClassifier(random_state=0).fit(Xc_train, yc_train)

def plot_feature_importances_cancer(model):
    n_features = cancer.data.shape[1]
    plt.figure(figsize=(7*fig_scale,5.4*fig_scale))
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), cancer.feature_names, fontsize=7)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    plt.ylim(-1, n_features)
plot_feature_importances_cancer(tree)
../_images/73501516999db54e1593188f9d057d9572d4f6d81714dd683c4fce9efd2c2680.png

Under- and overfitting#

  • We can easily control the (maximum) depth of the trees as a hyperparameter

  • Bias-variance analysis:

    • Shallow trees have high bias but very low variance (underfitting)

    • Deep trees have high variance but low bias (overfitting)

  • Because we can easily control their complexity, they are ideal for ensembling

    • Deep trees: keep low bias, reduce variance with Bagging

    • Shallow trees: keep low variance, reduce bias with Boosting

Hide code cell source
from sklearn.model_selection import ShuffleSplit, train_test_split

# Bias-Variance Computation 
def compute_bias_variance(clf, X, y):
    # Bootstraps
    n_repeat = 40 # 40 is on the low side to get a good estimate. 100 is better.
    shuffle_split = ShuffleSplit(test_size=0.33, n_splits=n_repeat, random_state=0)

    # Store sample predictions
    y_all_pred = [[] for _ in range(len(y))]

    # Train classifier on each bootstrap and score predictions
    for i, (train_index, test_index) in enumerate(shuffle_split.split(X)):
        # Train and predict
        clf.fit(X[train_index], y[train_index])
        y_pred = clf.predict(X[test_index])

        # Store predictions
        for j,index in enumerate(test_index):
            y_all_pred[index].append(y_pred[j])

    # Compute bias, variance, error
    bias_sq = sum([ (1 - x.count(y[i])/len(x))**2 * len(x)/n_repeat 
                for i,x in enumerate(y_all_pred)])
    var = sum([((1 - ((x.count(0)/len(x))**2 + (x.count(1)/len(x))**2))/2) * len(x)/n_repeat
               for i,x in enumerate(y_all_pred)])
    error = sum([ (1 - x.count(y[i])/len(x)) * len(x)/n_repeat 
            for i,x in enumerate(y_all_pred)])

    return np.sqrt(bias_sq), var, error

def plot_bias_variance(clf, X, y):
    bias_scores = []
    var_scores = []
    err_scores = []
    max_depth= range(2,11)

    for i in max_depth:
        b,v,e = compute_bias_variance(clf.set_params(random_state=0,max_depth=i),X,y)
        bias_scores.append(b)
        var_scores.append(v)
        err_scores.append(e)

    plt.figure(figsize=(8*fig_scale,3*fig_scale))
    plt.suptitle(clf.__class__.__name__, fontsize=9)
    plt.plot(max_depth, var_scores,label ="variance", lw=2*fig_scale )
    plt.plot(max_depth, np.square(bias_scores),label ="bias^2", lw=2*fig_scale )
    plt.plot(max_depth, err_scores,label ="error", lw=2*fig_scale)
    plt.xlabel("max_depth", fontsize=7)
    plt.legend(loc="best", fontsize=7)
    plt.show()

dt = DecisionTreeClassifier()
plot_bias_variance(dt, cancer.data, cancer.target)
../_images/a66fa2c1744aeaf341f6261a77a4b7bfcd0c594300ce417bdd0dadac187c1d14.png

Bagging (Bootstrap Aggregating)#

  • Obtain different models by training the same model on different training samples

    • Reduce overfitting by averaging out individual predictions (variance reduction)

  • In practice: take \(I\) bootstrap samples of your data, train a model on each bootstrap

    • Higher \(I\): more models, more smoothing (but slower training and prediction)

  • Base models should be unstable: different training samples yield different models

    • E.g. very deep decision trees, or even randomized decision trees

    • Deep Neural Networks can also benefit from bagging (deep ensembles)

  • Prediction by averaging predictions of base models

    • Soft voting for classification (possibly weighted)

    • Mean value for regression

  • Can produce uncertainty estimates as well

    • By combining class probabilities of individual models (or variances for regression)

Random Forests#

  • Uses randomized trees to make models even less correlated (more unstable)

    • At every split, only consider max_features features, randomly selected

  • Extremely randomized trees: considers 1 random threshold for random set of features (faster)

Hide code cell source
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

models=[RandomForestClassifier(n_estimators=5, random_state=7, n_jobs=-1),ExtraTreesClassifier(n_estimators=5, random_state=2, n_jobs=-1)]

@interact
def run_forest_run(model=models):
    forest = model.fit(X_train, y_train) 
    fig, axes = plt.subplots(2, 3, figsize=(12*fig_scale, 6*fig_scale))
    for i, (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):
        ax.set_title("Tree {}".format(i), pad=0, fontsize=9)
        mglearn.plots.plot_tree_partition(X_train, y_train, tree, ax=ax)

    mglearn.plots.plot_2d_separator(forest, X_train, fill=True, ax=axes[-1, -1],
                                    alpha=.4)
    axes[-1, -1].set_title(model.__class__.__name__, pad=0, fontsize=9)
    axes[-1, -1].set_xticks(())
    axes[-1, -1].set_yticks(())
    mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train, s=10*fig_scale);
Hide code cell source
if not interactive:
    run_forest_run(model=models[0])

Effect on bias and variance#

  • Increasing the number of models (trees) decreases variance (less overfitting)

  • Bias is mostly unaffected, but will increase if the forest becomes too large (oversmoothing)

Hide code cell source
from sklearn.model_selection import ShuffleSplit, train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
cancer = load_breast_cancer()

# Faster version of plot_bias_variance that uses warm-starting
def plot_bias_variance_rf(model, X, y, warm_start=False):
    bias_scores = []
    var_scores = []
    err_scores = []

    # Bootstraps
    n_repeat = 40 # 40 is on the low side to get a good estimate. 100 is better.
    shuffle_split = ShuffleSplit(test_size=0.33, n_splits=n_repeat, random_state=0)
    
    # Ensemble sizes
    n_estimators = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]

    # Store sample predictions. One per n_estimators
    # n_estimators : [predictions]
    predictions = {}
    for nr_trees in n_estimators:
        predictions[nr_trees] = [[] for _ in range(len(y))]

    # Train classifier on each bootstrap and score predictions
    for i, (train_index, test_index) in enumerate(shuffle_split.split(X)):
                
        # Initialize 
        clf = model(random_state=0)
        if model.__class__.__name__ == 'RandomForestClassifier':
            clf.n_jobs = -1
        if model.__class__.__name__ != 'AdaBoostClassifier':
            clf.warm_start = warm_start
        
        prev_n_estimators = 0
    
        # Train incrementally        
        for nr_trees in n_estimators:
            if model.__class__.__name__ == 'HistGradientBoostingClassifier':
                clf.max_iter = nr_trees
            else:
                clf.n_estimators = nr_trees
            
            # Fit and predict
            clf.fit(X[train_index], y[train_index])
            y_pred = clf.predict(X[test_index])
            for j,index in enumerate(test_index):
                predictions[nr_trees][index].append(y_pred[j])
    
    for nr_trees in n_estimators:
        # Compute bias, variance, error
        bias_sq = sum([ (1 - x.count(y[i])/len(x))**2 * len(x)/n_repeat 
                       for i,x in enumerate(predictions[nr_trees])])
        var = sum([((1 - ((x.count(0)/len(x))**2 + (x.count(1)/len(x))**2))/2) * len(x)/n_repeat 
                   for i,x in enumerate(predictions[nr_trees])])
        error = sum([ (1 - x.count(y[i])/len(x)) * len(x)/n_repeat
                     for i,x in enumerate(predictions[nr_trees])])

        bias_scores.append(bias_sq)
        var_scores.append(var)
        err_scores.append(error)

    plt.figure(figsize=(8*fig_scale,3*fig_scale))
    plt.suptitle(clf.__class__.__name__, fontsize=9)
    plt.plot(n_estimators, var_scores,label = "variance", lw=2*fig_scale )
    plt.plot(n_estimators, bias_scores,label = "bias^2", lw=2*fig_scale )
    plt.plot(n_estimators, err_scores,label = "error", lw=2*fig_scale  )
    plt.xscale('log',base=2)
    plt.xlabel("n_estimators", fontsize=9)
    plt.legend(loc="best", fontsize=7)
    plt.show()
plot_bias_variance_rf(RandomForestClassifier, cancer.data, cancer.target, warm_start=True)
../_images/21d92b98b3251823152b22881a8adc42aedb9b7b06783b0c32c1fa4e0f6372a7.png

In practice#

  • Different implementations can be used. E.g. in scikit-learn:

    • BaggingClassifier: Choose your own base model and sampling procedure

    • RandomForestClassifier: Default implementation, many options

    • ExtraTreesClassifier: Uses extremely randomized trees

  • Most important parameters:

    • n_estimators (>100, higher is better, but diminishing returns)

      • Will start to underfit (bias error component increases slightly)

    • max_features

      • Defaults: \(sqrt(p)\) for classification, \(log2(p)\) for regression

      • Set smaller to reduce space/time requirements

    • parameters of trees, e.g. max_depth, min_samples_split,…

      • Prepruning useful to reduce model size, but don’t overdo it

  • Easy to parallelize (set n_jobs to -1)

  • Fix random_state (bootstrap samples) for reproducibility

Out-of-bag error#

  • RandomForests don’t need cross-validation: you can use the out-of-bag (OOB) error

  • For each tree grown, about 33% of samples are out-of-bag (OOB)

    • Remember which are OOB samples for every model, do voting over these

  • OOB error estimates are great to speed up model selection

    • As good as CV estimates, althought slightly pessimistic

  • In scikit-learn: oob_error = 1 - clf.oob_score_

Hide code cell source
from collections import OrderedDict
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

RANDOM_STATE = 123

# Generate a binary classification dataset.
X, y = make_classification(n_samples=500, n_features=25,
                           n_clusters_per_class=1, n_informative=15,
                           random_state=RANDOM_STATE)

# NOTE: Setting the `warm_start` construction parameter to `True` disables
# support for parallelized ensembles but is necessary for tracking the OOB
# error trajectory during training.
ensemble_clfs = [
    ("RandomForestClassifier, max_features='sqrt'",
        RandomForestClassifier(warm_start=True, oob_score=True,
                               max_features="sqrt", n_jobs=-1,
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features='log2'",
        RandomForestClassifier(warm_start=True, max_features='log2',
                               oob_score=True, n_jobs=-1,
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features=None",
        RandomForestClassifier(warm_start=True, max_features=None,
                               oob_score=True, n_jobs=-1,
                               random_state=RANDOM_STATE))
]

# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)

# Range of `n_estimators` values to explore.
min_estimators = 15
max_estimators = 175

for label, clf in ensemble_clfs:
    for i in range(min_estimators, max_estimators + 1):
        clf.set_params(n_estimators=i)
        clf.fit(X, y)

        # Record the OOB error for each `n_estimators=i` setting.
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))

# Generate the "OOB error rate" vs. "n_estimators" plot.
plt.figure(figsize=(8*fig_scale,3*fig_scale))
for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=label, lw=2*fig_scale)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators", fontsize=7)
plt.ylabel("OOB error rate", fontsize=7)
plt.legend(loc="upper right", fontsize=7)
plt.show()
../_images/73ee8502249d59ed9a88c1136771e2bf57bc45ceaa718e87dc77b17752d82caf.png

Feature importance#

  • RandomForests provide more reliable feature importances, based on many alternative hypotheses (trees)

Hide code cell source
forest = RandomForestClassifier(random_state=0, n_estimators=512, n_jobs=-1)
forest.fit(Xc_train, yc_train)
plot_feature_importances_cancer(forest)
../_images/5add678d6754df42492724ca63ce9b2ea1a47996dfaa7482ead60336319c98d5.png

Other tips#

  • Model calibration

    • RandomForests are poorly calibrated.

    • Calibrate afterwards (e.g. isotonic regression) if you aim to use probabilities

  • Warm starting

    • Given an ensemble trained for \(I\) iterations, you can simply add more models later

    • You warm start from the existing model instead of re-starting from scratch

    • Can be useful to train models on new, closely related data

      • Not ideal if the data batches change over time (concept drift)

      • Boosting is more robust against this (see later)

Strength and weaknesses#

  • RandomForest are among most widely used algorithms:

    • Don’t require a lot of tuning

    • Typically very accurate

    • Handles heterogeneous features well (trees)

    • Implictly selects most relevant features

  • Downsides:

    • less interpretable, slower to train (but parallellizable)

    • don’t work well on high dimensional sparse data (e.g. text)

Adaptive Boosting (AdaBoost)#

  • Obtain different models by reweighting the training data every iteration

    • Reduce underfitting by focusing on the ‘hard’ training examples

  • Increase weights of instances misclassified by the ensemble, and vice versa

  • Base models should be simple so that different instance weights lead to different models

    • Underfitting models: decision stumps (or very shallow trees)

    • Each is an ‘expert’ on some parts of the data

  • Additive model: Predictions at iteration \(I\) are sum of base model predictions

    • In Adaboost, also the models each get a unique weight \(w_i\) $\(f_I(\mathbf{x}) = \sum_{i=1}^I w_i g_i(\mathbf{x})\)$

  • Adaboost minimizes exponential loss. For instance-weighted error \(\varepsilon\): $\(\mathcal{L}_{Exp} = \sum_{n=1}^N e^{\varepsilon(f_I(\mathbf{x}))}\)$

  • By deriving \(\frac{\partial \mathcal{L}}{\partial w_i}\) you can find that optimal \(w_{i} = \frac{1}{2}\log(\frac{1-\varepsilon}{\varepsilon})\)

AdaBoost algorithm#

  • Initialize sample weights: \(s_{n,0} = \frac{1}{N}\)

  • Build a model (e.g. decision stumps) using these sample weights

  • Give the model a weight \(w_i\) related to its weighted error rate \(\varepsilon\) $\(w_{i} = \lambda\log(\frac{1-\varepsilon}{\varepsilon})\)$

    • Good trees get more weight than bad trees

    • Logit function maps error \(\varepsilon\) from [0,1] to weight in [-Inf,Inf] (use small minimum error)

    • Learning rate \(\lambda\) (shrinkage) decreases impact of individual classifiers

      • Small updates are often better but requires more iterations

  • Update the sample weights

    • Increase weight of incorrectly predicted samples: \(s_{n,i+1} = s_{n,i}e^{w_i}\)

    • Decrease weight of correctly predicted samples: \(s_{n,i+1} = s_{n,i}e^{-w_i}\)

    • Normalize weights to add up to 1

  • Repeat for \(I\) iterations

AdaBoost variants#

  • Discrete Adaboost: error rate \(\varepsilon\) is simply the error rate (1-Accuracy)

  • Real Adaboost: \(\varepsilon\) is based on predicted class probabilities \(\hat{p}_c\) (better)

  • AdaBoost for regression: \(\varepsilon\) is either linear (\(|y_i-\hat{y}_i|\)), squared (\((y_i-\hat{y}_i)^2\)), or exponential loss

  • GentleBoost: adds a bound on model weights \(w_i\)

  • LogitBoost: Minimizes logistic loss instead of exponential loss $\(\mathcal{L}_{Logistic} = \sum_{n=1}^N log(1+e^{\varepsilon(f_I(\mathbf{x}))})\)$

Adaboost in action#

  • Size of the samples represents sample weight

  • Background shows the latest tree’s predictions

Hide code cell source
from matplotlib.colors import ListedColormap
from sklearn.tree import DecisionTreeClassifier
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from sklearn.preprocessing import normalize

# Code adapted from https://xavierbourretsicotte.github.io/AdaBoost.html
def AdaBoost_scratch(X,y, M=10, learning_rate = 0.5):
    #Initialization of utility variables
    N = len(y)
    estimator_list, y_predict_list, estimator_error_list, estimator_weight_list, sample_weight_list = [],[],[],[],[]

    #Initialize the sample weights
    sample_weight = np.ones(N) / N
    sample_weight_list.append(sample_weight.copy())

    #For m = 1 to M
    for m in range(M):   

        #Fit a classifier
        estimator = DecisionTreeClassifier(max_depth = 1, max_leaf_nodes=2)
        estimator.fit(X, y, sample_weight=sample_weight)
        y_predict = estimator.predict(X)

        #Misclassifications
        incorrect = (y_predict != y)

        #Estimator error
        estimator_error = np.mean( np.average(incorrect, weights=sample_weight, axis=0))
        
        #Boost estimator weights
        estimator_weight =  learning_rate * np.log((1. - estimator_error) / estimator_error)

        #Boost sample weights
        sample_weight *= np.exp(estimator_weight * incorrect * ((sample_weight > 0) | (estimator_weight < 0)))
        sample_weight *= np.exp(-estimator_weight * np.invert(incorrect * ((sample_weight > 0) | (estimator_weight < 0))))
        sample_weight /= np.linalg.norm(sample_weight)
        
        #Save iteration values
        estimator_list.append(estimator)
        y_predict_list.append(y_predict.copy())
        estimator_error_list.append(estimator_error.copy())
        estimator_weight_list.append(estimator_weight.copy())
        sample_weight_list.append(sample_weight.copy())
        
    #Convert to np array for convenience   
    estimator_list = np.asarray(estimator_list)
    y_predict_list = np.asarray(y_predict_list)
    estimator_error_list = np.asarray(estimator_error_list)
    estimator_weight_list = np.asarray(estimator_weight_list)
    sample_weight_list = np.asarray(sample_weight_list)

    #Predictions
    preds = (np.array([np.sign((y_predict_list[:,point] * estimator_weight_list).sum()) for point in range(N)]))
    #print('Accuracy = ', (preds == y).sum() / N) 
    
    return estimator_list, estimator_weight_list, sample_weight_list, estimator_error_list

def plot_decision_boundary(classifier, X, y, N = 10, scatter_weights = np.ones(len(y)) , ax = None, title=None ):
    '''Utility function to plot decision boundary and scatter plot of data'''
    x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
    y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1

    # Get current axis and plot
    if ax is None:
        ax = plt.gca()
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax.scatter(X[:,0],X[:,1], c = y, cmap = cm_bright, s = scatter_weights * 1000, edgecolors='none')
    ax.set_xticks(())
    ax.set_yticks(())
    if title:
        ax.set_title(title, pad=1)
    
    # Plot classifier background
    if classifier is not None:
        xx, yy = np.meshgrid( np.linspace(x_min, x_max, N), np.linspace(y_min, y_max, N))
        
        #Check what methods are available
        if hasattr(classifier, "decision_function"):
            zz = np.array( [classifier.decision_function(np.array([xi,yi]).reshape(1,-1)) for  xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] )
        elif hasattr(classifier, "predict_proba"):
            zz = np.array( [classifier.predict_proba(np.array([xi,yi]).reshape(1,-1))[:,1] for  xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] )
        else:
            zz = np.array( [classifier(np.array([xi,yi]).reshape(1,-1)) for  xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] )

        # reshape result and plot
        Z = zz.reshape(xx.shape)
    
        ax.contourf(xx, yy, Z, 2, cmap='RdBu', alpha=.5, levels=[0,0.5,1])
        #ax.contour(xx, yy, Z, 2, cmap='RdBu', levels=[0,0.5,1])


from sklearn.datasets import make_circles
Xa, ya = make_circles(n_samples=400, noise=0.15, factor=0.5, random_state=1)
cm_bright = ListedColormap(['#FF0000', '#0000FF'])

estimator_list, estimator_weight_list, sample_weight_list, estimator_error_list = AdaBoost_scratch(Xa, ya, M=60, learning_rate = 0.5)
current_ax = None
weight_scale = 1

@interact
def plot_adaboost(iteration=(0,60,1)):
    if iteration == 0:
        s_weights = (sample_weight_list[0,:] / sample_weight_list[0,:].sum() ) * weight_scale
        plot_decision_boundary(None, Xa, ya, N = 20, scatter_weights =s_weights)
    else:
        s_weights = (sample_weight_list[iteration,:] / sample_weight_list[iteration,:].sum() ) * weight_scale
        title = "Base model {}, error: {:.2f}, weight: {:.2f}".format(
            iteration,estimator_error_list[iteration-1],estimator_weight_list[iteration-1])
        plot_decision_boundary(estimator_list[iteration-1], Xa, ya, N = 20, scatter_weights =s_weights, ax=current_ax, title=title )
        
Hide code cell source
if not interactive:
    fig, axes = plt.subplots(2, 2, subplot_kw={'xticks': (()), 'yticks': (())}, figsize=(11*fig_scale, 6*fig_scale))
    weight_scale = 0.5
    for iteration, ax in zip([1, 5, 37, 59],axes.flatten()):
        current_ax = ax
        plot_adaboost(iteration)

Examples#

Hide code cell source
from sklearn.ensemble import AdaBoostClassifier
names = ["AdaBoost 1 tree", "AdaBoost 3 trees", "AdaBoost 100 trees"]

classifiers = [
    AdaBoostClassifier(n_estimators=1, random_state=0, learning_rate=0.5),
    AdaBoostClassifier(n_estimators=3, random_state=0, learning_rate=0.5),
    AdaBoostClassifier(n_estimators=100, random_state=0, learning_rate=0.5)
    ]

mglearn.plots.plot_classifiers(names, classifiers, figuresize=(6*fig_scale,3*fig_scale))  
../_images/9afa9ed5b2628c75ac0dfdd775fdeae059cac80f6ab04e9614108992c813759f.png

Bias-Variance analysis#

  • AdaBoost reduces bias (and a little variance)

    • Boosting is a bias reduction technique

  • Boosting too much will eventually increase variance

Hide code cell source
plot_bias_variance_rf(AdaBoostClassifier, cancer.data, cancer.target, warm_start=False)
../_images/75c21f73e5c9173630239f5c18ba7e68da950dcccab7605772880066d43479a5.png

Gradient Boosting#

  • Ensemble of models, each fixing the remaining mistakes of the previous ones

    • Each iteration, the task is to predict the residual error of the ensemble

  • Additive model: Predictions at iteration \(I\) are sum of base model predictions

    • Learning rate (or shrinkage ) \(\eta\): small updates work better (reduces variance) $\(f_I(\mathbf{x}) = g_0(\mathbf{x}) + \sum_{i=1}^I \eta \cdot g_i(\mathbf{x}) = f_{I-1}(\mathbf{x}) + \eta \cdot g_I(\mathbf{x})\)$

  • The pseudo-residuals \(r_i\) are computed according to differentiable loss function

    • E.g. least squares loss for regression and log loss for classification

    • Gradient descent: predictions get updated step by step until convergence $\(g_i(\mathbf{x}) \approx r_{i} = - \frac{\partial \mathcal{L}(y_i,f_{i-1}(x_i))}{\partial f_{i-1}(x_i)}\)$

  • Base models \(g_i\) should be low variance, but flexible enough to predict residuals accurately

    • E.g. decision trees of depth 2-5

Gradient Boosting Trees (Regression)#

  • Base models are regression trees, loss function is square loss: \(\mathcal{L} = \frac{1}{2}(y_i - \hat{y}_i)^2\)

  • The pseudo-residuals are simply the prediction errors for every sample: $\(r_i = -\frac{\partial \mathcal{L}}{\partial \hat{y}} = -2 * \frac{1}{2}(y_i - \hat{y}_i) * (-1) = y_i - \hat{y}_i\)$

  • Initial model \(g_0\) simply predicts the mean of \(y\)

  • For iteration \(m=1..M\):

    • For all samples i=1..n, compute pseudo-residuals \(r_i = y_i - \hat{y}_i\)

    • Fit a new regression tree model \(g_m(\mathbf{x})\) to \(r_{i}\)

      • In \(g_m(\mathbf{x})\), each leaf predicts the mean of all its values

    • Update ensemble predictions \(\hat{y} = g_0(\mathbf{x}) + \sum_{m=1}^M \eta \cdot g_m(\mathbf{x})\)

  • Early stopping (optional): stop when performance on validation set does not improve for \(nr\) iterations

Gradient Boosting Regression in action#

  • Residuals quickly drop to (near) zero

Hide code cell source
# Example adapted from Andreas Mueller
from sklearn.ensemble import GradientBoostingRegressor

# Make some toy data
def make_poly(n_samples=100):
    rnd = np.random.RandomState(42)
    x = rnd.uniform(-3, 3, size=n_samples)
    y_no_noise = (x) ** 3
    y = (y_no_noise + rnd.normal(scale=3, size=len(x))) / 2
    return x.reshape(-1, 1), y
Xp, yp = make_poly()

# Train gradient booster and get predictions
Xp_train, Xp_test, yp_train, yp_test = train_test_split(Xp, yp, random_state=0)
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=61, learning_rate=.3, random_state=0).fit(Xp_train, yp_train)
gbrt.score(Xp_test, yp_test)

line = np.linspace(Xp.min(), Xp.max(), 1000)
preds = list(gbrt.staged_predict(line[:, np.newaxis]))
preds_train = [np.zeros(len(yp_train))] + list(gbrt.staged_predict(Xp_train))

# Plot
def plot_gradient_boosting_step(step, axes):
    axes[0].plot(Xp_train[:, 0], yp_train - preds_train[step], 'o', alpha=0.5, markersize=10*fig_scale)
    axes[0].plot(line, gbrt.estimators_[step, 0].predict(line[:, np.newaxis]), linestyle='-', lw=3*fig_scale)
    axes[0].plot(line, [0]*len(line), c='k', linestyle='-', lw=1*fig_scale)
    axes[1].plot(Xp_train[:, 0], yp_train, 'o',  alpha=0.5, markersize=10*fig_scale)
    axes[1].plot(line, preds[step], linestyle='-', lw=3*fig_scale)
    axes[1].vlines(Xp_train[:, 0], yp_train, preds_train[step+1])

    axes[0].set_title("Residual prediction step {}".format(step + 1), fontsize=9)
    axes[1].set_title("Total prediction step {}".format(step + 1), fontsize=9)
    axes[0].set_ylim(yp.min(), yp.max())
    axes[1].set_ylim(yp.min(), yp.max())
    plt.tight_layout();

@interact
def plot_gradient_boosting(step = (0, 60, 1)):
    fig, axes = plt.subplots(1, 2, subplot_kw={'xticks': (()), 'yticks': (())}, figsize=(10*fig_scale, 4*fig_scale))
    plot_gradient_boosting_step(step, axes)
Hide code cell source
if not interactive:
    fig, all_axes = plt.subplots(3, 2, subplot_kw={'xticks': (()), 'yticks': (())}, figsize=(10*fig_scale, 5*fig_scale))
    for i, s in enumerate([0,3,9]):
        axes = all_axes[i,:]
        plot_gradient_boosting_step(s, axes)

GradientBoosting Algorithm (Classification)#

  • Base models are regression trees, predict probability of positive class \(p\)

    • For multi-class problems, train one tree per class

  • Use (binary) log loss, with true class \(y_i \in {0,1}\): \(\mathcal{L_{log}} = - \sum_{i=1}^{N} \big[ y_i log(p_i) + (1-y_i) log(1-p_i) \big] \)

  • The pseudo-residuals are simply the difference between true class and predicted \(p\): $\(\frac{\partial \mathcal{L}}{\partial \hat{y}} = \frac{\partial \mathcal{L}}{\partial log(p_i)} = y_i - p_i\)$

  • Initial model \(g_0\) predicts \(p = log(\frac{\#positives}{\#negatives})\)

  • For iteration \(m=1..M\):

    • For all samples i=1..n, compute pseudo-residuals \(r_i = y_i - p_i\)

    • Fit a new regression tree model \(g_m(\mathbf{x})\) to \(r_{i}\)

      • In \(g_m(\mathbf{x})\), each leaf predicts \(\frac{\sum_{i} r_i}{\sum_{i} p_i(1-p_i)}\)

    • Update ensemble predictions \(\hat{y} = g_0(\mathbf{x}) + \sum_{m=1}^M \eta \cdot g_m(\mathbf{x})\)

  • Early stopping (optional): stop when performance on validation set does not improve for \(nr\) iterations

Gradient Boosting Classification in action#

  • Size of the samples represents the residual weights: most quickly drop to (near) zero

Hide code cell source
from sklearn.ensemble import GradientBoostingClassifier

Xa_train, Xa_test, ya_train, ya_test = train_test_split(Xa, ya, random_state=0)
gbct = GradientBoostingClassifier(max_depth=2, n_estimators=60, learning_rate=.3, random_state=0).fit(Xa_train, ya_train)
gbct.score(Xa_test, ya_test)
preds_train_cl = [np.zeros(len(ya_train))] + list(gbct.staged_predict_proba(Xa_train))
current_gb_ax = None
weight_scale = 1

def plot_gb_decision_boundary(gbmodel, step, X, y, N = 10, scatter_weights = np.ones(len(y)) , ax = None, title = None ):
    '''Utility function to plot decision boundary and scatter plot of data'''
    x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
    y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1

    # Get current axis and plot
    if ax is None:
        ax = plt.gca()
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax.scatter(X[:,0],X[:,1], c = y, cmap = cm_bright, s = scatter_weights * 40, edgecolors='none')
    ax.set_xticks(())
    ax.set_yticks(())
    if title:
        ax.set_title(title, pad='0.5')
    
    # Plot classifier background
    if gbmodel is not None:
        xx, yy = np.meshgrid( np.linspace(x_min, x_max, N), np.linspace(y_min, y_max, N))
        zz = np.array( [list(gbmodel.staged_predict_proba(np.array([xi,yi]).reshape(1,-1)))[step][:,1] for  xi, yi in zip(np.ravel(xx), np.ravel(yy)) ] )
        Z = zz.reshape(xx.shape)
        ax.contourf(xx, yy, Z, 2, cmap='RdBu', alpha=.5, levels=[0,0.5,1])


@interact
def plot_gboost(iteration=(1,60,1)):
    pseudo_residuals = np.abs(ya_train - preds_train_cl[iteration][:,1])
    title = "Base model {}, error: {:.2f}".format(iteration,np.sum(pseudo_residuals))
    plot_gb_decision_boundary(gbct, (iteration-1), Xa_train, ya_train, N = 20, scatter_weights =pseudo_residuals * weight_scale, ax=current_gb_ax, title=title ) 
        
Hide code cell source
if not interactive:
    fig, axes = plt.subplots(2, 2, subplot_kw={'xticks': (()), 'yticks': (())}, figsize=(10*fig_scale, 6*fig_scale))
    weight_scale = 0.3
    for iteration, ax in zip([1, 5, 17, 59],axes.flatten()):
        current_gb_ax = ax
        plot_gboost(iteration)

Examples#

Hide code cell source
names = ["GradientBoosting 1 tree", "GradientBoosting 3 trees", "GradientBoosting 100 trees"]

classifiers = [
    GradientBoostingClassifier(n_estimators=1, random_state=0, learning_rate=0.5),
    GradientBoostingClassifier(n_estimators=3, random_state=0, learning_rate=0.5),
    GradientBoostingClassifier(n_estimators=100, random_state=0, learning_rate=0.5)
    ]

mglearn.plots.plot_classifiers(names, classifiers, figuresize=(6*fig_scale,3*fig_scale))  
../_images/121fd3a8e57e33abb49334f11066c9eaf44f3e65bdf30aa11878bcf34b83428b.png

Bias-variance analysis#

  • Gradient Boosting is very effective at reducing bias error

  • Boosting too much will eventually increase variance

Hide code cell source
# Note: I tried if HistGradientBoostingClassifier is faster. It's not.
# We're training many small models here and the thread spawning likely causes too much overhead
plot_bias_variance_rf(GradientBoostingClassifier, cancer.data, cancer.target, warm_start=True)
../_images/9253c5dcdb8547d3f69fd360b641a942ff64c1e6b695275a9d34dcc4a09e2bc1.png

Feature importance#

  • Gradient Boosting also provide feature importances, based on many trees

  • Compared to RandomForests, the trees are smaller, hence more features have zero importance

Hide code cell source
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(Xc_train, yc_train)
plot_feature_importances_cancer(gbrt)
../_images/a7ae3e39fd9c60fe97fc31d44b88dd77bf1f5a2176d55c3ae0c7347a6233f587.png

Gradient Boosting: strengths and weaknesses#

  • Among the most powerful and widely used models

  • Work well on heterogeneous features and different scales

  • Typically better than random forests, but requires more tuning, longer training

  • Does not work well on high-dimensional sparse data

Main hyperparameters:

  • n_estimators: Higher is better, but will start to overfit

  • learning_rate: Lower rates mean more trees are needed to get more complex models

    • Set n_estimators as high as possible, then tune learning_rate

    • Or, choose a learning_rate and use early stopping to avoid overfitting

  • max_depth: typically kept low (<5), reduce when overfitting

  • max_features: can also be tuned, similar to random forests

  • n_iter_no_change: early stopping: algorithm stops if improvement is less than a certain tolerance tol for more than n_iter_no_change iterations.

Extreme Gradient Boosting (XGBoost)#

  • Faster version of gradient boosting: allows more iterations on larger datasets

  • Normal regression trees: split to minimize squared loss of leaf predictions

    • XGBoost trees only fit residuals: split so that residuals in leaf are more similar

  • Don’t evaluate every split point, only \(q\) quantiles per feature (binning)

    • \(q\) is hyperparameter (sketch_eps, default 0.03)

  • For large datasets, XGBoost uses approximate quantiles

    • Can be parallelized (multicore) by chunking the data and combining histograms of data

    • For classification, the quantiles are weighted by \(p(1-p)\)

  • Gradient descent sped up by using the second derivative of the loss function

  • Strong regularization by pre-pruning the trees

  • Column and row are randomly subsampled when computing splits

  • Support for out-of-core computation (data compression in RAM, sharding,…)

XGBoost in practice#

  • Not part of scikit-learn, but HistGradientBoostingClassifier is similar

    • binning, multicore,…

  • The xgboost python package is sklearn-compatible

    • Install separately, conda install -c conda-forge xgboost

    • Allows learning curve plotting and warm-starting

  • Further reading:

LightGBM#

Another fast boosting technique

  • Uses gradient-based sampling

    • use all instances with large gradients/residuals (e.g. 10% largest)

    • randomly sample instances with small gradients, ignore the rest

    • intuition: samples with small gradients are already well-trained.

    • requires adapted information gain criterion

  • Does smarter encoding of categorical features

CatBoost#

Another fast boosting technique

  • Optimized for categorical variables

    • Uses bagged and smoothed version of target encoding

  • Uses symmetric trees: same split for all nodes on a given level

    • Can be much faster

  • Allows monotonicity constraints for numeric features

    • Model must be be a non-decreasing function of these features

  • Lots of tooling (e.g. GPU training)

Stacking#

  • Choose \(M\) different base-models, generate predictions

  • Stacker (meta-model) learns mapping between predictions and correct label

    • Can also be repeated: multi-level stacking

    • Popular stackers: linear models (fast) and gradient boosting (accurate)

  • Cascade stacking: adds base-model predictions as extra features

  • Models need to be sufficiently different, be experts at different parts of the data

  • Can be very accurate, but also very slow to predict

ml

Other ensembling techniques#

  • Hyper-ensembles: same basic model but with different hyperparameter settings

    • Can combine overfitted and underfitted models

  • Deep ensembles: ensembles of deep learning models

  • Bayes optimal classifier: ensemble of all possible models (largely theoretic)

  • Bayesian model averaging: weighted average of probabilistic models, weighted by their posterior probabilities

  • Cross-validation selection: does internal cross-validation to select best of \(M\) models

  • Any combination of different ensembling techniques

Hide code cell source
%%HTML
<style>
td {font-size: 20px}
th {font-size: 20px}
.rendered_html table, .rendered_html td, .rendered_html th {
    font-size: 20px;
}
</style>

Algorithm overview#

Name

Representation

Loss function

Optimization

Regularization

Classification trees

Decision tree

Entropy / Gini index

Hunt’s algorithm

Tree depth,…

Regression trees

Decision tree

Square loss

Hunt’s algorithm

Tree depth,…

RandomForest

Ensemble of randomized trees

Entropy / Gini / Square

(Bagging)

Number/depth of trees,…

AdaBoost

Ensemble of stumps

Exponential loss

Greedy search

Number/depth of trees,…

GradientBoostingRegression

Ensemble of regression trees

Square loss

Gradient descent

Number/depth of trees,…

GradientBoostingClassification

Ensemble of regression trees

Log loss

Gradient descent

Number/depth of trees,…

XGBoost, LightGBM, CatBoost

Ensemble of XGBoost trees

Square/log loss

2nd order gradients

Number/depth of trees,…

Stacking

Ensemble of heterogeneous models

/

/

Number of models,…

Summary#

  • Ensembles of voting classifiers improve performance

    • Which models to choose? Consider bias-variance tradeoffs!

  • Bagging / RandomForest is a variance-reduction technique

    • Build many high-variance (overfitting) models on random data samples

      • The more different the models, the better

    • Aggregation (soft voting) over many models reduces variance

      • Diminishing returns, over-smoothing may increase bias error

    • Parallellizes easily, doesn’t require much tuning

  • Boosting is a bias-reduction technique

    • Build low-variance models that correct each other’s mistakes

      • By reweighting misclassified samples: AdaBoost

      • By predicting the residual error: Gradient Boosting

    • Additive models: predictions are sum of base-model predictions

      • Can drive the error to zero, but risk overfitting

    • Doesn’t parallelize easily. Slower to train, much faster to predict.

      • XGBoost,LightGBM,… are fast and offer some parallellization

  • Stacking: learn how to combine base-model predictions

    • Base-models still have to be sufficiently different