Recap: Decision Trees

Recap: Decision Trees#

Representation: Assume we can represent the concept we want to learn with a decision tree
- Repeatedly split the data based on one feature at a time
- Note: Oblique trees can split on combinations of features
Evaluation (loss): One tree is better that another tree according to some heuristic
- Classification: Instances in a leaf are mostly of the same class (pure leafs)
- Regression: Instances in a leaf have values close to each other
Optimization: Recursive, heuristic greedy search (Hunt’s algorithm)
- Make first split based on the heuristic
- In each branch, repeat splitting in the same way

# Auto-setup when running on Google Colab
import os
if 'google.colab' in str(get_ipython()) and not os.path.exists('/content/master'):
    !git clone -q https://github.com/ML-course/master.git /content/master
    !pip install -rq master/requirements_colab.txt
    %cd master/notebooks

# Global imports and settings
%matplotlib inline
from preamble import *
interactive = True # Set to True for interactive plots
if interactive:
    fig_scale = 1.5

Installation note#

To plot the decision trees in this notebook, you’ll need to install the graphviz executable:

OS X: use homebrew: brew install graphviz
Ubuntu/debian: use apt-get: apt-get install graphviz.
Installing graphviz on Windows can be tricky and using conda / anaconda is recommended.

And the python package: pip install graphviz

Decision Tree classification#

Where would you make the first split?

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
plt.figure(figsize=(12*fig_scale,4*fig_scale))
ax = plt.gca()
mglearn.tools.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
ax.set_xticks(())
ax.set_yticks(());

../_images/79abb5c953e78869e70e1441f5d2ecde9d22930d86cd3ed3c1d1f605c8604aa1.png

import graphviz
def plot_depth(depth):
    fig, ax = plt.subplots(1, 2, figsize=(12*fig_scale, 4*fig_scale),
                           subplot_kw={'xticks': (), 'yticks': ()})

    tree = mglearn.plots.plot_tree(X, y, max_depth=depth)
    ax[0].imshow(mglearn.plots.tree_image(tree))
    ax[0].set_axis_off()
plot_depth(1)

../_images/7122bd764faba206dedf5e8abb634329e96662f9024fe34f2643b9591709e22e.png

plot_depth(2)

../_images/b89f707a982d47fc62ad10318482deab62069862ebe686a0ce84c43972c2562d.png

plot_depth(9)

../_images/81b74fe273d4c156bd0afd0df6165918b0375f6d4271a468a0af4fccb7016efe.png

Heuristics#

We start from a dataset of $n$ points $D = \{(x_i,y_i)\}_{i=1}^{n}$ where $y_i$ is one of $k$ classes
Consider splits between adjacent data point of different class, for every variable
After splitting, each leaf will have $\hat{p}_k$ = the relative frequency of class $k$

We can define several impurity measures:

Misclassification Error (leads to larger trees): $ 1 - \underset{k}{\operatorname{argmax}} \hat{p}_{k} $
Gini-Index: $ \sum_{k\neq k'} \hat{p}_k \hat{p}_{k'} = \sum_{k=1}^K \hat{p}_k(1-\hat{p}_k) $
Sum up the heuristics per leaf, weighted by the number of examples in each leaf

\[ \sum_{l=1}^L \frac{|X_{i=l}|}{|X_{i}|} Gini(X_{i=l}) \]

Visualization: the plots on the right show the class distribution of the ‘top’ and ‘bottom’ leaf, respectively.

from __future__ import print_function
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

def misclassification_error(leaf1_distr, leaf2_distr, leaf1_size, leaf2_size):
    total = leaf1_size + leaf2_size
    return leaf1_size/total * (1 - max(leaf1_distr)) + leaf2_size/total * (1 - max(leaf2_distr))

def gini_index(leaf1_distr, leaf2_distr, leaf1_size, leaf2_size):
    total = leaf1_size + leaf2_size
    return leaf1_size/total * np.array([leaf1_distr[k]*(1-leaf1_distr[k]) for k in range(0,2)]).sum() + \
            leaf2_size/total * np.array([leaf2_distr[k]*(1-leaf2_distr[k]) for k in range(0,2)]).sum()

@interact
def plot_heuristics(split=(-0.5,1.5,0.05)):
    
    fig, ax = plt.subplots(1, 2, figsize=(12*fig_scale, 4*fig_scale),
                           subplot_kw={'xticks': (), 'yticks': ()})
    mglearn.tools.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax[0])
    ax[0].plot([min(X[:, 0]),max(X[:, 0])],[split,split])

    ind = np.arange(2)
    width = 0.35
    top, bottom = y[np.where(X[:, 1]>split)], y[np.where(X[:, 1]<=split)]
    top_0, top_1 = (top == 0).sum()/len(top), (top == 1).sum()/len(top)
    bottom_0, bottom_1 = (bottom == 0).sum()/len(bottom), (bottom == 1).sum()/len(bottom)
    ax[1].barh(ind, [bottom_1,top_1], width, color='r')
    for i, v in enumerate([bottom_1,top_1]):
        ax[1].text(0, i, 'p_1={:.2f}'.format(v), color='black', fontweight='bold')
    ax[1].barh(ind+width, [bottom_0,top_0], width, color='royalblue')
    for i, v in enumerate([bottom_0,top_0]):
        ax[1].text(0, i + width, 'p_0={:.2f}'.format(v), color='black', fontweight='bold')
    error = misclassification_error([top_0,top_1],[bottom_0,bottom_1],len(top),len(bottom))
    gini = gini_index([top_0,top_1],[bottom_0,bottom_1],len(top),len(bottom))
    ax[1].set_title("Misclass. Error:{:.2f}, Gini:{:.2f}".format(error, gini))

if not interactive:
    plot_heuristics(split=0.6)

Entropy (of the class attribute) measures unpredictability of the data:
- How likely will random example have class k? $$ E(X) = -\sum_{k=1}^K \hat{p}_k \log_{2}\hat{p}_k $$
Information Gain (a.k.a. Kullback–Leibler divergence) measures how much entropy has reduced by splitting on attribute $X_i$: $$ G(X,X_i) = E(X) - \sum_{l=1}^L \frac{|X_{i=l}|}{|X_{i}|} E(X_{i=l}) $$

with $X$ = the training set, $l$ a specific leaf after splitting on feature $X_i$, $X_{i=l}$ is the set of examples in leaf $l$: $\{x \in X | X_i \in l\}$

Heuristics visualized (binary class)

Note that gini != entropy/2

def gini(p):
   return (p)*(1 - (p)) + (1 - p)*(1 - (1-p))

def entropy(p):
   return - p*np.log2(p) - (1 - p)*np.log2((1 - p))

def classification_error(p):
   return 1 - np.max([p, 1 - p])

x = np.arange(0.0, 1.0, 0.01)
ent = [entropy(p) if p != 0 else None for p in x]
scaled_ent = [e*0.5 if e else None for e in ent]
c_err = [classification_error(i) for i in x]

fig = plt.figure(figsize=(10*fig_scale,6*fig_scale))
ax = plt.subplot(111)

for j, lab, ls, c, in zip(
      [ent, scaled_ent, gini(x), c_err],
      ['Entropy', 'Entropy (scaled)', 'Gini Impurity', 'Misclassification Error'],
      ['-', '-', '--', '-.'],
      ['lightgray', 'red', 'green', 'blue']):
   line = ax.plot(x, j, label=lab, linestyle=ls, lw=1, color=c)

ax.legend(loc='upper left', bbox_to_anchor=(0.01, 0.85),
         ncol=1, fancybox=True, shadow=False)
ax.axhline(y=0.5, linewidth=1, color='k', linestyle='--')
ax.axhline(y=1.0, linewidth=1, color='k', linestyle='--')

plt.ylim([0, 1.1])
plt.xlabel('p(j=1)')
plt.ylabel('Impurity Index')
plt.show()

../_images/5e1d906f58256a7c6c8a1d3065554317a4f2c7739438763bf178fedc005729ce.png

%%HTML
<style>
td {font-size: 20px}
th {font-size: 20px}
.rendered_html table, .rendered_html td, .rendered_html th {
    font-size: 20px;
}
</style>

Example#

Compute information gain for a dataset with categorical features:

Ex.	1	2	3	4	5	6
a1	T	T	T	F	F	F
a2	T	T	F	F	T	T
class	+	+	-	+	-	-

\[E(X)?\]

\[G(X,X_{a2})?\]

\[G(X,X_{a1})?\]

Ex.	1	2	3	4	5	6
a1	T	T	T	F	F	F
a2	T	T	F	F	T	T
class	+	+	-	+	-	-

$E(X)$ = $-(\frac{1}{2}*log_2(\frac{1}{2})+\frac{1}{2}*log_2(\frac{1}{2})) = 1$ (classes have equal probabilities)
$G(X, X_{a2})$ = 0 (after split, classes still have equal probabilities, entropy stays 1)

Ex.	1	2	3	4	5	6
a1	T	T	T	F	F	F
a2	T	T	F	F	T	T
class	+	+	-	+	-	-

\[ E(X) = -\sum_{k=1}^K \hat{p}_k \log\hat{p}_k \quad , \quad G(X,X_i) = E(X) - \sum_{v=1}^V \frac{|X_{i=v}|}{|X_{i}|} E(X_{i=v}) \]

\[E(X_{a1=T}) = - \frac{2}{3} \log_{2}(\frac{2}{3}) - \frac{1}{3} \log_{2}(\frac{1}{3}) = 0.9183 \quad (= E(X_{a1=F}))\]

\[G(X, X_{a1}) = 1 - \frac{1}{2} 0.9183 - \frac{1}{2} 0.9183 = 0.0817 \]

hence we split on a1

Heuristics in scikit-learn#

The splitting criterion can be set with the criterion option in DecisionTreeClassifier

gini (default): gini impurity index
entropy: information gain

Best value depends on dataset, as well as other hyperparameters

from sklearn.tree import DecisionTreeClassifier
names = ["Decision tree - gini", "Decision tree - entropy"]

classifiers = [
    DecisionTreeClassifier(),
    DecisionTreeClassifier(criterion="entropy")
    ]

mglearn.plots.plot_classifiers(names, classifiers, figuresize=(12*fig_scale,6*fig_scale))

../_images/bf476c9535c6d0b221f1d6b09a57cee7cbacb577641841267d91a20d8e6ae206.png

Handling many-valued features#

What happens when a categorical feature has (almost) as many values as examples?

Information Gain will select it

One approach: use Gain Ratio instead (not available scikit-learn): $$ GainRatio(X,X_i) = \frac{Gain(X,X_i)}{SplitInfo(X,X_i)}$$ SplitInfo(X,X_i) = - \sum_{v=1}^V \frac{|X_{i=v}|}{|X|} log_{2} \frac{|X_{i=v}|}{|X|} $$

where $X_{i=v}$ is the subset of examples for which feature $X_i$ has value v.

SplitInfo will be big if $X_i$ fragments the data into many small subsets, resulting in a smaller Gain Ratio.

Overfitting: Controlling complexity of Decision Trees#

Decision trees can very easily overfit the data. Regularization strategies:

Pre-pruning: stop creation of new leafs at some point
- Limiting the depth of the tree, or the number of leafs
- Requiring a minimal leaf size (number of instances) to allow a split
Post-pruning: build full tree, then prune (join) leafs
- Reduced error pruning: evaluate against held-out data
- Many other strategies exist.
- scikit-learn supports none of them (yet)

Effect of pre-pruning:

Shallow trees tend to underfit (high bias)
Deep trees tend to overfit (high variance)

from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, stratify=cancer.target, random_state=42)
train_score, test_score = [],[]
for depth in range(1,10):
    tree = DecisionTreeClassifier(random_state=0, max_depth=depth).fit(X_train, y_train)
    train_score.append(tree.score(X_train, y_train))
    test_score.append(tree.score(X_test, y_test))
plt.figure(figsize=(10*fig_scale,6*fig_scale))
plt.plot(range(1,10), train_score, label="train_score", linewidth=4)
plt.plot(range(1,10), test_score, label="test_score", linewidth=4)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)

<matplotlib.legend.Legend at 0x29c5f2580>

../_images/9622dc9d39ac9b5047c7781922129c1dc5509168f6fe01aaa2560693de2e691d.png

Decision Trees are easy to interpet

Visualize and find the path that most data takes

# Creates a .dot file
from sklearn.tree import export_graphviz
export_graphviz(tree, out_file="tree.dot", class_names=["malignant", "benign"], 
                feature_names=cancer.feature_names, impurity=False, filled=True)
# Open and display
import graphviz
with open("tree.dot") as f:
    dot_graph = f.read()
display(graphviz.Source(dot_graph))

../_images/9e10752e46ddc1cfaa1bc0b269fa172c3938cbab7f718de525cc7cf84263c226.svg

DecisionTreeClassifier also returns feature importances

tree.feature_importances_

In [0,1], sum up to 1
High values for features selected early (near the root)

# Feature importances sum up to 1
print("Feature importances:\n{}".format(tree.feature_importances_))

Feature importances:
[0.    0.008 0.    0.    0.009 0.    0.008 0.    0.    0.    0.01  0.046
 0.    0.002 0.002 0.    0.    0.    0.    0.007 0.695 0.054 0.    0.014
 0.    0.    0.017 0.117 0.011 0.   ]

def plot_feature_importances_cancer(model):
    n_features = cancer.data.shape[1]
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), cancer.feature_names)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    plt.ylim(-1, n_features)

plt.rcParams.update({'font.size': 8*fig_scale})
plt.rcParams["figure.figsize"] = (12*fig_scale,6*fig_scale)
plot_feature_importances_cancer(tree)

../_images/2ff700e1c1351debfcadd4cbee11b7c48067e2a231f045f9e95b40606b1f55f3.png

Decision tree regression#

Heuristic: Minimal quadratic distance
Consider splits at every data point for every variable (or halfway between)
Dividing the data on $X_j$ at splitpoint $s$ leads to the following half-spaces:

\[ R_1(j, s) = { X : X_j \leq s} \quad and \quad R_2(j, s) = { X : X_j > s} \]

The best split, with predicted value $c_i$ (mean of all values in the leaf) and actual value $y_i$:

\[ \min_{j,s} \left(\min_{c_1} \sum_{x_{i} \in R_1(j,s)} (y_i - c_1)^2 + \min_{c_2} \sum_{x_{i} \in R_2(j,s)} (y_i - c_2)^2 \right) \]

Assuming that the tree predicts $y_i$ as the average of all $x_i$ in the leaf:

\[ \hat{c}_1 = \text{avg}(y_i | x_{i} \in R_1(j,s)) \quad and \quad \hat{c}_2 = \text{avg}(y_i | x_{i} \in R_2(j,s)) \]

with $x_i$ being the i-th example in the data, with target value $y_i$

In scikit-learn#

Regression is done with DecisionTreeRegressor

def plot_decision_tree_regression(regr_1, regr_2):
    # Create a random dataset
    rng = np.random.RandomState(1)
    Xr = np.sort(5 * rng.rand(80, 1), axis=0)
    yr = np.sin(Xr).ravel()
    yr[::5] += 3 * (0.5 - rng.rand(16))

    # Fit regression model
    regr_1.fit(Xr, yr)
    regr_2.fit(Xr, yr)

    # Predict
    Xr_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
    y_1 = regr_1.predict(Xr_test)
    y_2 = regr_2.predict(Xr_test)

    # Plot the results
    plt.figure(figsize=(8,6))
    plt.scatter(Xr, yr, c="darkorange", label="data")
    plt.plot(Xr_test, y_1, color="cornflowerblue", label="max_depth=2", linewidth=2)
    plt.plot(Xr_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
    plt.xlabel("data")
    plt.ylabel("target")
    plt.title("Decision Tree Regression")
    plt.legend()
    plt.show()

from sklearn.tree import DecisionTreeRegressor
plt.rcParams["figure.figsize"] = (12*fig_scale,6*fig_scale)
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)

plot_decision_tree_regression(regr_1,regr_2)

../_images/25dc394c00bdc1e3520bd1a7942e844335debd1a7ed5436a206048b756ad48ed.png

Note that decision trees do not extrapolate well.

The leafs return the same mean value no matter how far the new data point lies from the training examples.
Example on the ram_price forecasting dataset

ram_prices = pd.read_csv('../data/ram_price.csv')

plt.semilogy(ram_prices.date, ram_prices.price)
plt.xlabel("Year")
plt.ylabel("Price in $/Mbyte");

../_images/886b68ced330c9a5e5b4657a3a30a57773cef8a1160c94086b080bddedfb992c.png

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

# Use historical data to forecast prices after the year 2000
data_train = ram_prices[ram_prices.date < 2000]
data_test = ram_prices[ram_prices.date >= 2000]

# predict prices based on date:
Xl_train = data_train.date[:, np.newaxis]
# we use a log-transform to get a simpler relationship of data to target
yl_train = np.log(data_train.price)

tree = DecisionTreeRegressor().fit(Xl_train, yl_train)
linear_reg = LinearRegression().fit(Xl_train, yl_train)

# predict on all data
X_all = ram_prices.date[:, np.newaxis]

pred_tree = tree.predict(X_all)
pred_lr = linear_reg.predict(X_all)

# undo log-transform
price_tree = np.exp(pred_tree)
price_lr = np.exp(pred_lr)

plt.rcParams['lines.linewidth'] = 2
plt.semilogy(data_train.date, data_train.price, label="Training data")
plt.semilogy(data_test.date, data_test.price, label="Test data")
plt.semilogy(ram_prices.date, price_tree, label="Tree prediction")
plt.semilogy(ram_prices.date, price_lr, label="Linear prediction")
plt.legend();

../_images/671fc3260a76e47712bc4a6b6e29c79ec2a5c5410d3dced2332fb72168b8013a.png

Model trees#

Instead of predicting a single value per leaf (e.g. mean value for regression), you can build a model on all the points remaining in a leaf
- E.g. a linear regression model
Can learn more complex concepts, extrapolates better. Overfits easily.

Strengths, weaknesses and parameters#

Decision trees:

Work well with features on completely different scales, or a mix of binary and continuous features
- Does not require normalization
Interpretable, easily visualized
Tend to overfit easily.

Pre-pruning: regularize by:

Setting a low max_depth, max_leaf_nodes
Setting a higher min_samples_leaf (default=1)

On under- and overfitting#

Let’s study which types of errors are made by decision trees
Deep trees have high variance but low bias
- What if we built many deep trees and average them out to reduce variance?
Shallow trees have high bias but very low variance
- What if we could correct the systematic mistakes to reduce bias?

from sklearn.model_selection import ShuffleSplit, train_test_split

# Bias-Variance Computation 
def compute_bias_variance(clf, X, y):
    # Bootstraps
    n_repeat = 40 # 40 is on the low side to get a good estimate. 100 is better.
    shuffle_split = ShuffleSplit(test_size=0.33, n_splits=n_repeat, random_state=0)

    # Store sample predictions
    y_all_pred = [[] for _ in range(len(y))]

    # Train classifier on each bootstrap and score predictions
    for i, (train_index, test_index) in enumerate(shuffle_split.split(X)):
        # Train and predict
        clf.fit(X[train_index], y[train_index])
        y_pred = clf.predict(X[test_index])

        # Store predictions
        for j,index in enumerate(test_index):
            y_all_pred[index].append(y_pred[j])

    # Compute bias, variance, error
    bias_sq = sum([ (1 - x.count(y[i])/len(x))**2 * len(x)/n_repeat 
                for i,x in enumerate(y_all_pred)])
    var = sum([((1 - ((x.count(0)/len(x))**2 + (x.count(1)/len(x))**2))/2) * len(x)/n_repeat
               for i,x in enumerate(y_all_pred)])
    error = sum([ (1 - x.count(y[i])/len(x)) * len(x)/n_repeat 
            for i,x in enumerate(y_all_pred)])

    return np.sqrt(bias_sq), var, error

def plot_bias_variance(clf, X, y):
    bias_scores = []
    var_scores = []
    err_scores = []
    max_depth= range(2,11)

    for i in max_depth:
        b,v,e = compute_bias_variance(clf.set_params(random_state=0,max_depth=i),X,y)
        bias_scores.append(b)
        var_scores.append(v)
        err_scores.append(e)

    plt.figure(figsize=(10*fig_scale,5*fig_scale))
    plt.rcParams.update({'font.size': 12})
    plt.suptitle(clf.__class__.__name__)
    plt.plot(max_depth, var_scores,label ="variance" )
    plt.plot(max_depth, np.square(bias_scores),label ="bias^2")
    plt.plot(max_depth, err_scores,label ="error" )
    plt.xlabel("max_depth")
    plt.legend(loc="best")
    plt.show()

cancer = load_breast_cancer()
Xc, yc = cancer.data, cancer.target
dt = DecisionTreeClassifier()
plot_bias_variance(dt, Xc, yc)

../_images/898513c722ceabba61a419f8422084ef84bfec8c813d8109539e35f987a041b5.png

Algorithm overview#

Name	Representation	Loss function	Optimization	Regularization
Classification trees	Decision tree	Information Gain (KL div.) / Gini index	Hunt’s algorithm	Tree depth,…
Regression trees	Decision tree	Min. quadratic distance	Hunt’s algorithm	Tree depth,…
Model trees	Decision tree + other models in leafs	As above + used model’s loss	Hunt’s algorithm + used model’s optimization	Tree depth,…