Lecture 3: Model Selection

Can I trust you?

Joaquin Vanschoren, Eindhoven University of Technology

Evaluation

  • To know whether we can trust our method or system, we need to evaluate it.
  • If you cannot measure it, you cannot improve it.
  • Model selection: choose between different models in a data-driven way.
  • Convince others that your work is meaningful
    • Peers, leadership, clients, yourself(!)
  • Keep evaluating relentlessly, adapt to changes

Designing Machine Learning systems

  • Just running your favourite algorithm is usually not a great way to start
  • Consider the problem at large
    • Do you want to understand phenomena or do black box modelling?
    • How to define and measure success? Are there costs involved?
    • Do you have the right data? How can you make it better?
  • Build prototypes early-on to evaluate the above.
  • Analyze your model's mistakes
    • Should you collect more, or additional data?
    • Should the task be reformulated?
    • Often a higher payoff than endless finetuning
  • Technical debt: creation-maintenance trade-off
    • Very complex machine learning systems are hard/impossible to put into practice
    • See 'Machine Learning: The High Interest Credit Card of Technical Debt'

ml

Real world evaluations

  • Evaluate predictions, but also how outcomes improve because of them
  • Feedback loops: predictions are fed into the inputs, e.g. as new data, invalidating models
    • Example: a medical recommendation model may change future measurements
  • The signal your model found may just be an artifact of your biased data
    • When possible, try to interpret what your model has learned
    • See 'Why Should I Trust You?' by Marco Ribeiro et al. ml
  • Adversarial situations (e.g. spam filtering) can subvert your predictions
  • Do A/B testing (or bandit testing) to evaluate algorithms in the wild

A/B and bandit testing

  • Test a single innovation (or choose between two models)
  • Have most users use the old system, divert small group to new system
  • Evaluate and compare performance
  • Bandit testing: smaller time intervals, direct more users to currently winning system

ml

Performance estimation techniques

  • We do not have access to future observations
  • Always evaluate models as if they are predicting the future
  • Set aside data for objective evaluation
    • How?

The holdout (simple train-test split)

  • Randomly split data (and corresponding labels) into training and test set (e.g. 75%-25%)
  • Train (fit) a model on the training data, score on the test data
In [2]:
from sklearn.model_selection import (TimeSeriesSplit, KFold, ShuffleSplit, train_test_split,
                                     StratifiedKFold, GroupShuffleSplit,
                                     GroupKFold, StratifiedShuffleSplit)
from matplotlib.patches import Patch
np.random.seed(1338)
cmap_data = plt.cm.brg
cmap_group = plt.cm.Paired
cmap_cv = plt.cm.coolwarm
n_splits = 4

# Generate the class/group data
n_points = 100
X = np.random.randn(100, 10)

percentiles_classes = [.1, .3, .6]
y = np.hstack([[ii] * int(100 * perc)
               for ii, perc in enumerate(percentiles_classes)])

# Evenly spaced groups repeated once
rng = np.random.RandomState(42)
group_prior = rng.dirichlet([2]*10)
rng.multinomial(100, group_prior)
groups = np.repeat(np.arange(10), rng.multinomial(100, group_prior))
In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
fig = plt.figure()
ax = fig.add_subplot(111)
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train,
                         markers='o', ax=ax)
mglearn.discrete_scatter(X_test[:, 0], X_test[:, 1], y_test,
                         markers='^', ax=ax)
ax.legend(["Train class 0", "Train class 1", "Train class 2", "Test class 0",
                "Test class 1", "Test class 2"], ncol=6,  loc=(-0.1, 1.1));

K-fold Cross-validation

  • Each random split can yield very different models (and scores)
    • e.g. all easy (of hard) examples could end up in the test set
  • Split data (randomly) into k equal-sized parts, called folds
    • Create k splits, each time using a different fold as the test set
  • Compute k evaluation scores, aggregate afterwards (e.g. take the mean)
  • Examine the score variance to see how sensitive (unstable) models are
  • Reduces sampling bias by testing on every point exactly once
  • Large k gives better estimates (more training data), but is expensive
In [4]:
def plot_cv_indices(cv, X, y, group, ax, lw=2, show_groups=False, s=700, legend=True):
    """Create a sample plot for indices of a cross-validation object."""
    n_splits = cv.get_n_splits(X, y, group)

    # Generate the training/testing visualizations for each CV split
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
        # Fill in indices with the training/test groups
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        # Visualize the results
        ax.scatter([n_splits - ii - 1] * len(indices), range(len(indices)),
                   c=indices, marker='_', lw=lw, cmap=cmap_cv,
                   vmin=-.2, vmax=1.2, s=s)

    # Plot the data classes and groups at the end
    ax.scatter([-1] * len(X), range(len(X)), 
               c=y, marker='_', lw=lw, cmap=cmap_data, s=s)
    yticklabels = ['class'] + list(range(1, n_splits + 1))
    
    if show_groups:
        ax.scatter([-2] * len(X), range(len(X)), 
                   c=group, marker='_', lw=lw, cmap=cmap_group, s=s)
        yticklabels.insert(0, 'group')

    # Formatting
    ax.set(xticks=np.arange(-1 - show_groups, n_splits), xticklabels=yticklabels,
            ylabel='Sample index', xlabel="CV iteration",
            xlim=[-1.5 - show_groups, n_splits+.2], ylim=[-6, 100])
    ax.set_title('{}'.format(type(cv).__name__), fontsize=15)
    if legend:
        ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.2))],
                  ['Testing set', 'Training set'], loc=(1.02, .8))
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.set_yticks(())
    return ax
In [5]:
fig, ax = plt.subplots(figsize=(6, 3.5))
cv = KFold(5)
plot_cv_indices(cv, X, y, groups, ax, s=700);

Stratified K-Fold cross-validation

  • If the data is unbalanced, some classes have many fewer samples
  • Likely that some classes are not present in the test set
  • Stratification: proportions between classes are conserved in each fold
    • Order examples per class
    • Separate the samples of each class in k sets (strata)
    • Combine corresponding strate into folds
In [6]:
fig, ax = plt.subplots(figsize=(6, 3.9))
cv = StratifiedKFold(5)
plot_cv_indices(cv, X, y, groups, ax, s=700)
ax.set_ylim((-6, 100));

Can you explain this result?

kfold = KFold(n_splits=3)
print("Cross-validation scores KFold(n_splits=3):\n{}".format(
      cross_val_score(logreg, iris.data, iris.target, cv=kfold)))
In [7]:
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

iris = load_iris()
logreg = LogisticRegression()
kfold = KFold(n_splits=3)
print("Cross-validation scores KFold(n_splits=3):\n{}".format(
      cross_val_score(logreg, iris.data, iris.target, cv=kfold)))
Cross-validation scores KFold(n_splits=3):
[0. 0. 0.]

Leave-One-Out cross-validation

  • k fold cross-validation with k equal to the number of samples
  • Completely unbiased (in terms of data splits), but computationally expensive
  • But: generalizes less well towards unseen data
    • The training sets are correlated (overlap heavily)
    • Overfits on the data used for (the entire) evaluation
    • A different sample of the data can yield different results
  • Recommended only for small datasets

Shuffle-Split cross-validation

  • Additionally shuffles the data (only do this when the data is not ordered)
  • Samples a number of samples (train_size) randomly as the training set
  • Can also use a smaller (test_size), handy when using very large datasets
In [8]:
fig, ax = plt.subplots(figsize=(6, 4))
cv = ShuffleSplit(8, test_size=.2)
plot_cv_indices(cv, X, y, groups, ax, n_splits, s=700)
ax.set_ylim((-6, 100))
ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.2))],
          ['Testing set', 'Training set'], loc=(.95, .8));

Note: this is related to bootstrapping:

  • Sample n (total number of samples) data points, with replacement, as training set (the bootstrap)
  • Use the unsampled (out-of-bootstrap) samples as the test set
  • Repeat n_iter times, obtaining n_iter scores
  • Not supported in scikit-learn, only approximated
    • Use Shuffle-Split with train_size=0.66, test_size=0.34
    • You can prove that bootstraps include 66% of all data points on average

Repeated cross-validation

  • Cross-validation is still biased in that the initial split can be made in many ways
  • Repeated, or n-times-k-fold cross-validation:
    • Shuffle data randomly, do k-fold cross-validation
    • Repeat n times, yields n times k scores
  • Unbiased, very robust, but n times more expensive
In [9]:
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib.patches import Rectangle
fig, ax = plt.subplots(figsize=(10, 3.9))
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3)
plot_cv_indices(cv, X, y, groups, ax, lw=2, s=400, legend=False)
ax.set_ylim((-6, 102))
xticklabels = ["class"] + [f"{repeat}x{split}" for repeat in range(1, 4) for split in range(1, 6)]
ax.set_xticklabels(xticklabels)
for i in range(3):
    rect = Rectangle((-.5 + i * 5, -2.), 5, 103, edgecolor='k', facecolor='none')
    ax.add_artist(rect)

Cross-validation with groups

  • Sometimes the data contains inherent groups:
    • Multiple samples from same patient, images from same person,...
  • With normal cross-validation, data from the same person may end up in the training and test set
  • We want to measure how well the model generalizes to other people
  • Make sure that data from one person are in either the train or test set
    • This is called grouping or blocking
    • Leave-one-subject-out cross-validation: test set for each subject
In [10]:
fig, ax = plt.subplots(figsize=(6, 3.9))
cv = GroupKFold(5)
plot_cv_indices(cv, X, y, groups, ax, s=700, show_groups=True)
ax.set_ylim((-6, 100));

Time series

When the data is ordered, random test sets are not a good idea

In [11]:
import pandas as pd
approval = pd.read_csv("https://projects.fivethirtyeight.com/trump-approval-data/approval_topline.csv")
adults = approval.groupby("subgroup").get_group('Adults')
adults = adults.set_index('modeldate')[::-1]
adults.approve_estimate.plot()
ax = plt.gca()
plt.rcParams["figure.figsize"] = [8,4]
ax.set_xlabel("")
xlim, ylim = ax.get_xlim(), ax.get_ylim()
for i in range(20):
    rect = Rectangle((np.random.randint(0, xlim[1]), ylim[0]), 10, ylim[1]-ylim[0], facecolor='#FFAAAA')
    ax.add_artist(rect)
plt.title("Presidential approval estimates by fivethirtyeight")
plt.legend([rect], ['Random Test Set'] );

Time series

  • Test-then-train (prequential evaluation)
    • Every new sample is evaluated on once, then added to the training set
    • Can also be done in batches (of n samples at a time)
  • TimeSeriesSplit
    • In the kth split, the first k folds are used as the train set and the (k+1)th fold as the test set
    • Can also be done with a maximum training set size: more robust against concept drift
In [12]:
from sklearn.utils import shuffle
fig, ax = plt.subplots(figsize=(6, 3.9))
cv = TimeSeriesSplit(5, max_train_size=20)
plot_cv_indices(cv, X, shuffle(y), groups, ax, s=700, lw=2)
ax.set_ylim((-6, 100))
ax.set_title("TimeSeriesSplit(5, max_train_size=20)");

Choosing a performance estimation procedure

No strict rules, only guidelines:

  • Always use stratification for classification (sklearn does this by default)
  • Use holdout for very large datasets (e.g. >1.000.000 examples)
    • Or when learners don't always converge (e.g. deep learning)
  • Choose k depending on dataset size and resources
    • Use leave-one-out for small datasets (e.g. <500 examples)
    • Use cross-validation otherwise
      • Most popular (and theoretically sound): 10-fold CV
      • Literature suggests 5x2-fold CV is better
  • Use grouping or leave-one-subject-out for grouped data
  • Use train-then-test for time series

Evaluation Metrics and scoring

Keep the end-goal in mind

Evaluation vs Optimization

  • Each algorithm optimizes a given objective function (on the training data)
    • E.g. remember L2 loss in Ridge regression $$\mathcal{L}_{ridge} = \sum_{i}(y_i - \sum_{j} x_{i,j}w_j)^2 + \alpha \sum_{i} w_i^2$$
  • The choice of function is limited by what can be efficiently optimized
    • E.g. gradient descent requires a differentiable loss function
  • However, we evaluate the resulting model with a score that makes sense in the real world
    • E.g. percentage of correct predictions (on a test set)
  • We also tune the algorithm's hyperparameters to maximize that score

Binary classification

  • We have a positive and a negative class
  • 2 different kind of errors:
    • False Positive (type I error): model predicts positive while the true label is negative
    • False Negative (type II error): model predicts negative while the true label is positive
  • They are not always equally important
    • Which side do you want to err on for a medical test?

Confusion matrices

  • We can represent all predictions (correct and incorrect) in a confusion matrix
    • n by n array (n is the number of classes)
    • Rows correspond to true classes, columns to predicted classes
    • Each entry counts how often a sample that belongs to the class corresponding to the row was classified as the class corresponding to the column.
    • For binary classification, we label these true negative (TN), true positive (TP), false negative (FN), false positive (FP)
In [13]:
mglearn.plots.plot_binary_confusion_matrix()

Predictive accuracy

  • Accuracy is one of the measures we can compute based on the confusion matrix:
\begin{equation} \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \end{equation}
  • In sklearn: use confusion_matrix and accuracy_score from sklearn.metrics.
  • Accuracy is also the default evaluation measure for classification
In [14]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, stratify=data.target, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)

print("confusion_matrix(y_test, y_pred): \n", confusion_matrix(y_test, y_pred))
print("accuracy_score(y_test, y_pred): ", accuracy_score(y_test, y_pred))
print("model.score(X_test, y_test): ", lr.score(X_test, y_test))
confusion_matrix(y_test, y_pred): 
 [[48  5]
 [ 5 85]]
accuracy_score(y_test, y_pred):  0.9300699300699301
model.score(X_test, y_test):  0.9300699300699301

The problem with accuracy: imbalanced datasets

  • The type of error plays an even larger role if the dataset is imbalanced
    • One class is much more frequent than the other, e.g. credit fraud
    • Is a 99.99% accuracy good enough?
  • Are these three models really equally good?
In [17]:
plot_measure(accuracy_score)

Precision is used when the goal is to limit FPs

  • Clinical trails: you only want to test drugs that really work
  • Search engines: you want to avoid bad search results
\begin{equation} \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \end{equation}
In [18]:
from sklearn.metrics import precision_score
plot_measure(precision_score)

Recall is used when the goal is to limit FNs

  • Cancer diagnosis: you don't want to miss a serious disease
  • Search engines: You don't want to omit important hits
  • Also know as sensitivity, hit rate, true positive rate (TPR)
\begin{equation} \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \end{equation}
In [19]:
from sklearn.metrics import recall_score
plot_measure(recall_score)

Comparison

F1-score or F1-measure trades off precision and recall:

\begin{equation} \text{F1} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} \end{equation}
In [20]:
from sklearn.metrics import f1_score
plot_measure(f1_score)

Averaging scores per class

  • Study the scores by class (in scikit-learn: classification_report)
    • One class viewed as positive, other(s) als negative
    • Support: number of samples in each class
    • Last line: weighted average over the classes (weighted by number of samples in each class)
  • Averaging for scoring measure R across C classes (also for multiclass):
    • micro: count total number of TP, FP, TN, FN
    • macro $$\frac{1}{C} \sum_{c \in C} R(y_c,\hat{y_c})$$
    • weighted ($w_c$: ratio of examples of class $c$) $$\sum_{c \in C} w_c R(y_c,\hat{y_c})$$
In [21]:
from sklearn.metrics import classification_report

def report(y_pred):
    print(classification_report(y_true, y_pred))
    fig, ax = plt.subplots(figsize=(2, 2))
    plt.rcParams['figure.dpi'] = 100 # Use 300 for PDF, 100 for slides
    plot_confusion_matrix(confusion_matrix(y_true, y_pred), cmap='gray_r', ax=ax,
                          xticklabels=["N", "P"], yticklabels=["N", "P"], xtickrotation=0, vmin=0, vmax=100)
    plt.tight_layout();
report(y_pred_1)
              precision    recall  f1-score   support

           0       0.90      1.00      0.95        90
           1       0.00      0.00      0.00        10

    accuracy                           0.90       100
   macro avg       0.45      0.50      0.47       100
weighted avg       0.81      0.90      0.85       100

In [22]:
report(y_pred_2)
              precision    recall  f1-score   support

           0       1.00      0.89      0.94        90
           1       0.50      1.00      0.67        10

    accuracy                           0.90       100
   macro avg       0.75      0.94      0.80       100
weighted avg       0.95      0.90      0.91       100

In [23]:
report(y_pred_3)
              precision    recall  f1-score   support

           0       0.94      0.94      0.94        90
           1       0.50      0.50      0.50        10

    accuracy                           0.90       100
   macro avg       0.72      0.72      0.72       100
weighted avg       0.90      0.90      0.90       100

Uncertainty estimates from classifiers

  • Classifiers can often provide uncertainty estimates of predictions.
  • Remember that linear models actually return a numeric value.
    • When $\hat{y}<0$, predict class -1, otherwise predict class +1 $$\hat{y} = w_0 * x_0 + w_1 * x_1 + ... + w_p * x_p + b $$
  • In practice, you are often interested in how certain a classifier is about each class prediction (e.g. cancer treatments).

Scikit-learn offers 2 functions. Often, both are available for every learner, but not always.

  • decision_function: returns floating point value for each sample
  • predict_proba: return probability for each class

The Decision Function

In the binary classification case, the return value of decision_function encodes how strongly the model believes a data point to belong to the “positive” class.

  • Positive values indicate a preference for the "positive" class
  • Negative values indicate a preference for the "negative" (other) class
In [24]:
# create and split a synthetic dataset
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
X, y = make_blobs(centers=2, cluster_std=2.5, random_state=8)

# we rename the classes "blue" and "red"
y_named = np.array(["blue", "red"])[y]

# we can call train test split with arbitrary many arrays
# all will be split in a consistent manner
X_train, X_test, y_train_named, y_test_named, y_train, y_test = \
    train_test_split(X, y_named, y, random_state=0)

# build the logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train_named)

# get the decision function
dec = lr.decision_function(X_test)

mglearn.plots.plot_2d_separator(lr, X)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y);
plt.rcParams['figure.dpi'] = 90

for i, v in enumerate(dec):
    plt.annotate("{:.2f}".format(v), (X_test[i,0],X_test[i,1]),
                 textcoords="offset points", xytext=(0,7))
  • The range of decision_function can be arbitrary, and depends on the data and the model parameters. This makes it sometimes hard to interpret.
  • We can visualize the decision function as follows, with the actual decision boundary left and the values of the decision boudaries color-coded on the right.
  • Note how the test examples are labeled depending on the decision function.
In [25]:
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
    
mglearn.tools.plot_2d_separator(lr, X, ax=axes[0], alpha=.4,
                                fill=True, cm=mglearn.cm2)
scores_image = mglearn.tools.plot_2d_scores(lr, X, ax=axes[1],
                                            alpha=.4, cm=mglearn.ReBl)

for ax in axes:
    # plot training and test points
    mglearn.discrete_scatter(X_test[:, 0], X_test[:, 1], y_test,
                             markers='^', ax=ax)
    mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train,
                             markers='o', ax=ax)
    ax.set_xlabel("Feature 0")
    ax.set_ylabel("Feature 1")
cbar = plt.colorbar(scores_image, ax=axes.tolist())
cbar.set_alpha(1)
cbar.draw_all()
axes[0].legend(["Test class 0", "Test class 1", "Train class 0",
                "Train class 1"], ncol=4, loc=(.1, 1.1));  

Predicting probabilities

The output of predict_proba is a probability for each class, with one column per class. They sum up to 1.

In [26]:
print("Shape of probabilities: {}".format(lr.predict_proba(X_test).shape))
# show the first few entries of predict_proba
print("Predicted probabilities:\n{}".format(
      lr.predict_proba(X_test[:6])))
Shape of probabilities: (25, 2)
Predicted probabilities:
[[0.232 0.768]
 [0.002 0.998]
 [0.    1.   ]
 [0.003 0.997]
 [0.001 0.999]
 [1.    0.   ]]

We can visualize them again. Note that the gradient looks different now.

In [27]:
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
    
mglearn.tools.plot_2d_separator(
    lr, X, ax=axes[0], alpha=.4, fill=True, cm=mglearn.cm2)
scores_image = mglearn.tools.plot_2d_scores(
    lr, X, ax=axes[1], alpha=.5, cm=mglearn.ReBl, function='predict_proba')

for ax in axes:
    # plot training and test points
    mglearn.discrete_scatter(X_test[:, 0], X_test[:, 1], y_test,
                             markers='^', ax=ax)
    mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train,
                             markers='o', ax=ax)
    ax.set_xlabel("Feature 0")
    ax.set_ylabel("Feature 1")
# don't want a transparent colorbar
cbar = plt.colorbar(scores_image, ax=axes.tolist())
cbar.set_alpha(1)
cbar.draw_all()
axes[0].legend(["Test class 0", "Test class 1", "Train class 0",
                "Train class 1"], ncol=4, loc=(.1, 1.1));

Interpreting probabilities

  • The class with the highest probability is predicted.
  • How well the uncertainty actually reflects uncertainty in the data depends on the model and the parameters.
    • An overfitted model tends to make more certain predictions, even if they might be wrong.
    • A model with less complexity usually has more uncertainty in its predictions.
  • A model is called calibrated if the reported uncertainty actually matches how correct it is — A prediction made with 70% certainty would be correct 70% of the time.
    • LogisticRegression returns well calibrated predictions by default as it directly optimizes log-loss
    • Linear SVM are not well calibrated. They are biased towards points close to the decision boundary.
  • Calibration techniques can calibrate models in post-processing.

Model calibration

In [28]:
from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import calibration_curve

from sklearn.datasets import fetch_covtype
from sklearn.utils import check_array

def load_data(dtype=np.float32, order='C', random_state=13):
    ######################################################################
    # Load covertype dataset (downloads it from the web, might take a bit)
    # TODO: Use OpenML version
    data = fetch_covtype(download_if_missing=True, shuffle=True,
                         random_state=random_state)
    X = check_array(data['data'], dtype=dtype, order=order)
    # make it bineary classification
    y = (data['target'] != 1).astype(np.int)

    # Create train-test split (as [Joachims, 2006])
    n_train = 522911
    X_train = X[:n_train]
    y_train = y[:n_train]
    X_test = X[n_train:]
    y_test = y[n_train:]

    # Standardize first 10 features (the numerical ones)
    mean = X_train.mean(axis=0)
    std = X_train.std(axis=0)
    mean[10:] = 0.0
    std[10:] = 1.0
    X_train = (X_train - mean) / std
    X_test = (X_test - mean) / std
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = load_data()

# subsample training set by a factor of 10:
X_train = X_train[::10]
y_train = y_train[::10]

#probs = lr.predict_proba(X_test)[:, 1]
#prob_true, prob_pred = calibration_curve(y_test, probs, n_bins=5)

from sklearn.linear_model import LogisticRegressionCV
lr = LogisticRegressionCV().fit(X_train, y_train)

def plot_calibration_curve(y_true, y_prob, n_bins=5, ax=None, hist=True, normalize=False):
    prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins, normalize=normalize)
    if ax is None:
        ax = plt.gca()
    if hist:
        ax.hist(y_prob, weights=np.ones_like(y_prob) / len(y_prob), alpha=.4,
               bins=np.maximum(10, n_bins))
    ax.plot([0, 1], [0, 1], ':', c='k')
    curve = ax.plot(prob_pred, prob_true, marker="o")

    ax.set_xlabel("predicted probability")
    ax.set_ylabel("fraction of positive samples")

    ax.set(aspect='equal')
    return curve

fig, axes = plt.subplots(1, 3, figsize=(8, 8))
for ax, clf in zip(axes, [LogisticRegressionCV(), DecisionTreeClassifier(),
                          RandomForestClassifier()]):
        # use predict_proba is the estimator has it
        scores = clf.fit(X_train, y_train).predict_proba(X_test)[:, 1]
        plot_calibration_curve(y_test, scores, n_bins=20, ax=ax)
        ax.set_title(clf.__class__.__name__)

plt.tight_layout();

Model calibration

  • Build another model, mapping classifier probabilities to better probabilities!
  • 1d model! (or more for multi-class) $$f_{calib}(s(x))≈p(y)$$

  • s(x) is score given by model, usually

  • Can also work with models that don’t even provide probabilities! Need model for f_{calib}, need to decide what data to train it on.
  • Can train on training set, causes overfit
  • Can train using cross-validation, slower

Platt Scaling

  • Use a logistic sigmoid for $f_{calib}$ $$f_{platt}=\frac{1}{1+\exp(−ws(x)−b)}$$

  • Basically learning a 1d logistic regression (+ some tricks)

  • Works well for SVMs

Isotonic regression

  • Very flexible way to specify $f_{calib}
  • Learns arbitrary monotonically increasing step-functions in 1d.
  • Groups data into constant parts, steps in between.
  • Optimum monotone function on training data (wrt MSE)

ml

Taking uncertainty into account

  • Remember that many classifiers actually return a probability per class
    • We can retrieve it with decision_function and predict_proba
  • For binary classification, we threshold at 0 for decision_function and 0.5 for predict_proba by default
  • However, depending on the evaluation measure, you may want to threshold differently to fit your goals
    • For instance, when a FP is much worse than a FN
    • This is called threshold calibration
  • Imagine that we want to avoid misclassifying a positive (red) point
  • Points within decision boundary (black line) are classified positive
  • Lowering the decision treshold (bottom figure): fewer FN, more FP
In [30]:
plt.rcParams['figure.dpi'] = 80 
mglearn.plots.plot_decision_threshold()
  • Studying the classification report, we see that lowering the threshold yields:
    • higher recall for class 1 (we risk more FPs in exchange for more TP)
    • lower precision for class 1
  • We can often trade off precision for recall
In [31]:
plt.rcParams['figure.dpi'] = 100 
print("Threshold 0")
print(classification_report(y_test, svc.predict(X_test)))
Threshold 0
              precision    recall  f1-score   support

           0       0.91      0.96      0.93        96
           1       0.67      0.47      0.55        17

    accuracy                           0.88       113
   macro avg       0.79      0.71      0.74       113
weighted avg       0.87      0.88      0.88       113

In [32]:
print("Threshold -0.8")
y_pred_lower_threshold = svc.decision_function(X_test) > -.8
print(classification_report(y_test, y_pred_lower_threshold))  
Threshold -0.8
              precision    recall  f1-score   support

           0       0.98      0.92      0.95        96
           1       0.65      0.88      0.75        17

    accuracy                           0.91       113
   macro avg       0.81      0.90      0.85       113
weighted avg       0.93      0.91      0.92       113

Precision-Recall curves

  • As we've seen, you can trade off precision for recall by changing the decision threshold
  • The best trade-off depends on your application, driven by real-world goals.
    • You can have arbitrary high recall, but you often want reasonable precision, too.
  • Plotting precision against recall for all possible thresholds yields a precision-recall curve
  • It helps answer multiple questions:
    • Threshold calibration: what's the best achievable precision-recall tradeoff?
    • How much more precision can I gain without losing too much recall?
    • Which models offer the best trade-offs?
In [33]:
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(
    y_test, svc.decision_function(X_test)) 
  • The default threshold (threshold zero) gives a certain precision and recall
    • Lower the threshold to gain higher recall (move right)
    • Increase the threshold to gain higher precision (move left)
  • The curve is often jagged: increasing the threshold leaves fewer and fewer positive predictions, so precision ($\frac{TP}{TP+FP}$) can change dramatically
  • The closer the curve stays to the upper-right corner, the better
  • Here, it is possible to still get a precision of 0.55 with recall 0.9
In [34]:
# create a similar dataset as before, but with more samples
# to get a smoother curve
X, y = make_blobs(n_samples=(4000, 500), centers=2, cluster_std=[7.0, 2],
                  random_state=22)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

svc = SVC(gamma=.05).fit(X_train, y_train)

precision, recall, thresholds = precision_recall_curve(
    y_test, svc.decision_function(X_test))
# find threshold closest to zero:
close_zero = np.argmin(np.abs(thresholds))
plt.plot(recall[close_zero], precision[close_zero], 'o', markersize=10,
         label="threshold zero", fillstyle="none", c='k', mew=2)

plt.plot(recall, precision, label="precision recall curve")
plt.ylabel("Precision")
plt.xlabel("Recall")
plt.legend(loc="best");

Model selection

  • Different classifiers offer different trade-offs
  • RandomForest (in red) performs better at the extremes, SVM better in center
  • In applications we may only care about a specific region (e.g. very high recall)
In [35]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=0, max_features=2)
rf.fit(X_train, y_train)

# RandomForestClassifier has predict_proba, but not decision_function
# Only pass probabilities for the positive class
precision_rf, recall_rf, thresholds_rf = precision_recall_curve(
    y_test, rf.predict_proba(X_test)[:, 1])

plt.plot(recall, precision, label="svc")

plt.plot(recall[close_zero], precision[close_zero], 'o', markersize=10,
         label="threshold zero svc", fillstyle="none", c='k', mew=2)

plt.plot(recall_rf, precision_rf, label="rf")

close_default_rf = np.argmin(np.abs(thresholds_rf - 0.5))
plt.plot( recall_rf[close_default_rf], precision_rf[close_default_rf], '^', c='k',
         markersize=10, label="threshold 0.5 rf", fillstyle="none", mew=2)
plt.ylabel("Precision")
plt.xlabel("Recall")
plt.legend(loc="best");

AUPRC

  • The area under the precision-recall curve (AUPRC) is often used as a general evaluation measure
  • This is a good general measure, but also hides the subtleties we saw in the curve
In [36]:
from sklearn.metrics import average_precision_score
ap_rf = average_precision_score(y_test, rf.predict_proba(X_test)[:, 1])
ap_svc = average_precision_score(y_test, svc.decision_function(X_test))
print("Average precision of random forest: {:.3f}".format(ap_rf))
print("Average precision of svc: {:.3f}".format(ap_svc))
Average precision of random forest: 0.660
Average precision of svc: 0.666

Receiver Operating Characteristics (ROC)

  • We can also trade off recall (or true positive rate) $\textit{TPR}= \frac{TP}{TP + FN}$ with false positive rate $\textit{FPR} = \frac{FP}{FP + TN}$
  • Varying the decision threshold yields the ROC curve
    • Lower the threshold to gain more recall (move right)
    • Increase the threshold to reduce FPs (move left)
In [37]:
plt.rcParams['savefig.dpi'] = 100 # Use 300 for PDF, 100 for slides
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, svc.decision_function(X_test))

plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
# find threshold closest to zero:
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10,
         label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4);

Visualization

  • Horizontal axis represents the decision function. Every predicted point is on it.
  • The blue probability density shows the actual negative points. The red one is for the positive points.
  • Vertical line is the decision threshold: every point to the left is predicted negative (TN or FN) and vice versa (TP or FP).
  • Increase threshold: fewer FP and TP: point on ROC curve moves leftward

ml

Model selection

  • Again, we can compare multiple models by looking at the ROC curves
  • We can calibrate the threshold depending on whether we need high recall or low FPR
  • We can select between algorithms (or hyperparameters) depending on the involved costs
In [38]:
from sklearn.metrics import roc_curve
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, rf.predict_proba(X_test)[:, 1])

plt.plot(fpr, tpr, label="ROC Curve SVC")
plt.plot(fpr_rf, tpr_rf, label="ROC Curve RF")

plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10,
         label="threshold zero SVC", fillstyle="none", c='k', mew=2)
close_default_rf = np.argmin(np.abs(thresholds_rf - 0.5))
plt.plot(fpr_rf[close_default_rf], tpr_rf[close_default_rf], '^', markersize=10,
         label="threshold 0.5 RF", fillstyle="none", c='k', mew=2)

plt.legend(loc=4);

Calculating costs

  • A certain amount of FP and FN can be translated to a cost: $$\text{total cost} = \text{FPR} * cost_{FP} + (1-\text{TPR}) * cost_{FN}$$
  • This yields different isometrics (lines of equal cost) in ROC space
  • The optimal threshold is the point on the ROC curve where the cost is minimal
In [39]:
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

# Cost function
def cost(fpr, tpr, cost_FN, cost_FP):
    return fpr * cost_FP + (1 - tpr) * cost_FN;

@interact
def plot_isometrics(c_FN=(1,10.0,1.0), c_FP=(1,10.0,1.0)):
    plt.rcParams['savefig.dpi'] = 100 # Use 300 for PDF, 100 for slides
    from sklearn.metrics import roc_curve
    fpr, tpr, thresholds = roc_curve(y_test, svc.decision_function(X_test))

    # get minimum
    costs = [cost(fpr[x],tpr[x],c_FN,c_FP) for x in range(len(thresholds))]
    min_cost = np.min(costs)
    min_thres = np.argmin(costs)

    # plot contours
    x = np.arange(0.0, 1.1, 0.1)
    y = np.arange(0.0, 1.1, 0.1)
    Xp, Yp = np.meshgrid(x, y)
    costs = [cost(f, t, c_FN, c_FP) for f, t in zip(Xp,Yp)]

    plt.plot(fpr, tpr, label="ROC Curve")
    levels = np.linspace(np.array(costs).min(), np.array(costs).max(), 10)
    levels = np.sort(np.append(levels, min_cost))
    CS = plt.contour(Xp, Yp, costs, levels)
    plt.clabel(CS, inline=1, fontsize=10)

    plt.xlabel("FPR")
    plt.ylabel("TPR (recall)")
    # find threshold closest to zero:
    plt.plot(fpr[min_thres], tpr[min_thres], 'o', markersize=4,
             label="optimal", fillstyle="none", c='k', mew=2)
    plt.legend(loc=4);
    plt.title("Isometrics, cost_FN: {}, cost_FP: {}".format(c_FN, c_FP))
    plt.show()
In [40]:
for cFP in [1, 10]:
    plot_isometrics(1,cFP)

Area under the ROC curve

  • A useful summary measure is the area under the ROC curve (AUROC or AUC)
  • Key benefit: 'sensitive' to class imbalance
    • Random guessing always yields TPR=FPR
    • All points are on the diagonal line, hence an AUC of 0.5
In [41]:
from sklearn.metrics import roc_auc_score
from sklearn.dummy import DummyClassifier
rf_auc = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])
svc_auc = roc_auc_score(y_test, svc.decision_function(X_test))
dummy = DummyClassifier().fit(X_train, y_train)
dummy_auc = roc_auc_score(y_test, dummy.predict_proba(X_test)[:, 1])
print("AUC for Random Forest: {:.3f}".format(rf_auc))
print("AUC for SVC: {:.3f}".format(svc_auc))
print("AUC for dummy classifier: {:.3f}".format(dummy_auc))
AUC for Random Forest: 0.937
AUC for SVC: 0.916
AUC for dummy classifier: 0.498

Example: unbalanced dataset (10% positive, 90% negative):

  • 3 models: overfitting ($\gamma=1.0$), good ($\gamma=0.01$), underfitting ($\gamma$=1e-5)
  • ACC is the same (we might be random guessing), AUC is more informative
In [42]:
from sklearn.datasets import load_digits

digits = load_digits()
y = digits.target == 9

X_train, X_test, y_train, y_test = train_test_split(
    digits.data, y, random_state=0)

plt.figure()  

for gamma in [1, 0.01, 0.00001]:
    svc = SVC(gamma=gamma).fit(X_train, y_train)
    accuracy = svc.score(X_test, y_test)
    auc = roc_auc_score(y_test, svc.decision_function(X_test))
    fpr, tpr, _ = roc_curve(y_test , svc.decision_function(X_test))
    print("gamma = {:.1e}  ACC = {:.2f}  AUC = {:.4f}".format(
            gamma, accuracy, auc))
    plt.plot(fpr, tpr, label="gamma={:.1e}".format(gamma), lw=2)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.xlim(-0.01, 1)
plt.ylim(0, 1.02)
plt.legend(loc="best");
gamma = 1.0e+00  ACC = 0.90  AUC = 0.5000
gamma = 1.0e-02  ACC = 0.90  AUC = 0.9995
gamma = 1.0e-05  ACC = 0.90  AUC = 0.9882

Multiclass classification

Common technique: one-vs-rest approach:

  • A binary model is learned for each class vs. all other classes
  • Creates as many binary models as there are classes
In [43]:
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVC

X, y = make_blobs(random_state=42)
linear_svm = LinearSVC().fit(X, y)

mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
line = np.linspace(-15, 15)
for coef, intercept, color in zip(linear_svm.coef_, linear_svm.intercept_,
                                  mglearn.cm3.colors):
    plt.plot(line, -(line * coef[0] + intercept) / coef[1], c=color)
plt.ylim(-10, 15)
plt.xlim(-10, 8)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(['Class 0', 'Class 1', 'Class 2', 'Line class 0', 'Line class 1',
            'Line class 2'], loc=(1.01, 0.3));

Every binary classifiers makes a prediction

  • The confidence (decision score) of that prediction is the confidence in that class
  • The class with the highest decision score (>0) wins
  • Decision boundaries visualized below
In [44]:
mglearn.plots.plot_2d_classification(linear_svm, X, fill=True, alpha=.7)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
line = np.linspace(-15, 15)
for coef, intercept, color in zip(linear_svm.coef_, linear_svm.intercept_,
                                  mglearn.cm3.colors):
    plt.plot(line, -(line * coef[0] + intercept) / coef[1], c=color)
plt.legend(['Class 0', 'Class 1', 'Class 2', 'Line class 0', 'Line class 1',
            'Line class 2'], loc=(1.01, 0.3))
plt.xlabel("Feature 0")
plt.ylabel("Feature 1");

Uncertainty in multi-class classification

  • decision_function and predict_proba also work in the multiclass setting
  • always have shape (n_samples, n_classes)
  • Example on the Iris dataset, which has 3 classes:
In [45]:
from sklearn.datasets import load_iris

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, random_state=42)

lr2 = LogisticRegression()
lr2 = lr2.fit(X_train, y_train)

print("Decision function:\n{}".format(lr2.decision_function(X_test)[:6, :]))
# show the first few entries of predict_proba
print("Predicted probabilities:\n{}".format(lr2.predict_proba(X_test)[:6]))
Decision function:
[[ -3.035   2.294   0.741]
 [  5.919   3.091  -9.01 ]
 [-10.052   1.875   8.177]
 [ -2.733   2.036   0.697]
 [ -3.737   2.476   1.262]
 [  6.036   3.035  -9.07 ]]
Predicted probabilities:
[[0.004 0.822 0.174]
 [0.944 0.056 0.   ]
 [0.    0.002 0.998]
 [0.007 0.787 0.206]
 [0.002 0.77  0.229]
 [0.953 0.047 0.   ]]

Multi-class metrics

  • Multiclass metrics are derived from binary metrics, averaged over all classes
  • Example: handwritten digit recognition (MNIST)
In [46]:
from sklearn.metrics import accuracy_score
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)
pred = lr.predict(X_test)
In [47]:
scores_image = mglearn.tools.heatmap(
    confusion_matrix(y_test, pred), xlabel='Predicted label',
    ylabel='True label', xticklabels=digits.target_names,
    yticklabels=digits.target_names, cmap=plt.cm.gray_r, fmt="%d")    
plt.title("Confusion matrix")
plt.gca().invert_yaxis()

Precision, recall, F1-score now yield 10 per-class scores

In [48]:
print(classification_report(y_test, pred))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        37
           1       0.89      0.93      0.91        43
           2       0.98      0.91      0.94        44
           3       0.91      0.96      0.93        45
           4       0.97      0.97      0.97        38
           5       0.98      0.96      0.97        48
           6       1.00      0.98      0.99        52
           7       0.98      0.96      0.97        48
           8       0.91      0.90      0.91        48
           9       0.90      0.96      0.93        47

    accuracy                           0.95       450
   macro avg       0.95      0.95      0.95       450
weighted avg       0.95      0.95      0.95       450

Different ways to compute average

  • macro-averaging: computes unweighted per-class scores: $\frac{\sum_{i=0}^{n}score_i}{n}$

    • Use when you care about each class equally much
  • weighted averaging: scores are weighted by the relative size of the classes (support): $\frac{\sum_{i=0}^{n}score_i weight_i}{n}$

    • Use when data is imbalanced
  • micro-averaging: computes total number of FP, FN, TP over all classes, then computes scores using these counts: $recall = \frac{\sum_{i=0}^{n}TP_i}{\sum_{i=0}^{n}TP_i + \sum_{i=0}^{n}FN_i}$

    • Use when you care about each sample equally much
In [49]:
print("Micro average f1 score: {:.3f}".format(f1_score(y_test, pred, average="micro")))
print("Weighted average f1 score: {:.3f}".format(f1_score(y_test, pred, average="weighted")))
print("Macro average f1 score: {:.3f}".format(f1_score(y_test, pred, average="macro")))
Micro average f1 score: 0.951
Weighted average f1 score: 0.951
Macro average f1 score: 0.952

Multi-class ROC

  • To use AUC in a multi-class setting, you need to choose whether you use a micro- or macro average TPR and FPR.
  • Depends on the application: is every class equally important?
    • SKlearn currently doesn't have a default option
In [50]:
from itertools import cycle

from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
from sklearn.metrics import roc_auc_score

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
                                                    random_state=0)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))

# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])

# Finally average it and compute AUC
mean_tpr /= n_classes

fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)

plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)

colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=2,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

Regression metrics

Most commonly used are

  • (root) mean squared error: $\frac{\sum_{i}(y_{pred_i}-y_{actual_i})^2}{n}$
  • mean absolute error: $\frac{\sum_{i}|y_{pred_i}-y_{actual_i}|}{n}$
    • Less sensitive to outliers and large errors

ml

  • R squared (r2): $1 - \frac{\color{blue}{\sum_{i}(y_{pred_i}-y_{actual_i})^2}}{\color{red}{\sum_{i}(y_{mean}-y_{actual_i})^2}}$
    • Ratio of variation explained by the model / total variation
    • Between 0 and 1, but negative if the model is worse than just predicting the mean
    • Easier to interpret (higher is better).

ml

Visualizing errors

  • Prediction plot (left): predicted vs actual target values
  • Residual plot (right): residuals vs actual target values
    • Over- and underpredictions can be given different costs
In [51]:
from sklearn.linear_model import Ridge
from sklearn.datasets import load_boston
boston = load_boston()

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target)

pred = Ridge(normalize=True).fit(X_train, y_train).predict(X_test)

plt.subplot(1, 2, 1)
plt.gca().set_aspect("equal")
plt.plot([10, 50], [10, 50], '--', c='k')
plt.plot(y_test, pred, 'o', alpha=.5)
plt.ylabel("predicted")
plt.xlabel("true");

plt.subplot(1, 2, 2)
plt.gca().set_aspect("equal")
plt.plot([10, 50], [0,0], '--', c='k')
plt.plot(y_test, y_test - pred, 'o', alpha=.5)
plt.xlabel("true")
plt.ylabel("true - predicted")
plt.tight_layout();

Other considerations

  • There exist techniques to correct label imbalance (see next lecture)
    • Undersample the majority class, or oversample the minority class
    • SMOTE (Synthetic Minority Oversampling TEchnique) adds articifial training points by interpolating existing minority class points
      • Think twice before creating 'artificial' training data
  • Cost-sensitive classification (not in sklearn)
    • Cost matrix: a confusion matrix with a costs associated to every possible type of error
    • Some algorithms allow optimizing on these costs instead of their usual loss function
    • Meta-cost: builds ensemble of models by relabeling training sets to match a given cost matrix
      • Black-box: can make any algorithm cost sensitive (but slower and less accurate)
  • There are many more metrics to choose from
    • Balanced accuracy: accuracy where each sample is weighted according to the inverse prevalence of its true class
      • Identical to macro-averaged recall
    • Cohen's Kappa: accuracy, taking into account the possibility of predicting the right class by chance
      • 1: perfect prediction, 0: random prediction, negative: worse than random
      • With $p_0$ = accuracy, and $p_e$ = accuracy of random classifier: $$\kappa = \frac{p_o - p_e}{1 - p_e}$$
    • Matthews correlation coefficient: another measure that can be used on imbalanced data
      • 1: perfect prediction, 0: random prediction, -1: inverse prediction $$MCC = \frac{tp \times tn - fp \times fn}{\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}$$

Bias-Variance decomposition

  • When we repeat evaluation procedures multiple times, we can distinguish two sources of errors:
    • Bias: systematic error (independent of the training sample). The classifier always gets certain points wrong
    • Variance: error due to variability of the model with respect to the training sample. The classifier predicts some points accurately on some training sets, but inaccurately on others.
  • There is also an intrinsic (noise) error, but there's nothing we can do against that.
  • Bias is associated with underfitting, and variance with overfitting
  • Bias-variance trade-off: you can often exchange variance for bias through regularization (and vice versa)
    • The challenge is to find the right trade-off (minimizing total error)
  • Useful to understand how to tune or adapt learning algorithm

ml

Computing bias-variance

  • Take 100 or more bootstraps (or shuffle-splits)
  • Regression: for each data point x:
    • $bias(x)^2 = (x_{true} - mean(x_{predicted}))^2$
    • $variance(x) = var(x_{predicted})$
  • Classification: for each data point x:
    • $bias(x)$ = misclassification ratio
    • $variance(x) = (1 - (P(class_1)^2 + P(class_2)^2))/2$
      • $P(class_i)$ is ratio of class $i$ predictions
  • Total bias: $\sum_{x} bias(x)^2 * w_x$, with $w_x$ the ratio of x occuring in the test sets
  • Total variance: $\sum_{x} variance(x) * w_x$

Bias and variance reduction

  • Bagging (RandomForests) is a variance-reduction technique
  • Boosting is a bias-reduction technique
In [52]:
from sklearn.model_selection import ShuffleSplit, train_test_split

# Bias-Variance Computation 
def compute_bias_variance(clf, X, y):
    # Bootstraps
    n_repeat = 40 # 40 is on the low side to get a good estimate. 100 is better.
    shuffle_split = ShuffleSplit(test_size=0.33, n_splits=n_repeat, random_state=0)

    # Store sample predictions
    y_all_pred = [[] for _ in range(len(y))]

    # Train classifier on each bootstrap and score predictions
    for i, (train_index, test_index) in enumerate(shuffle_split.split(X)):
        # Train and predict
        clf.fit(X[train_index], y[train_index])
        y_pred = clf.predict(X[test_index])

        # Store predictions
        for j,index in enumerate(test_index):
            y_all_pred[index].append(y_pred[j])

    # Compute bias, variance, error
    bias_sq = sum([ (1 - x.count(y[i])/len(x))**2 * len(x)/n_repeat 
                for i,x in enumerate(y_all_pred)])
    var = sum([((1 - ((x.count(0)/len(x))**2 + (x.count(1)/len(x))**2))/2) * len(x)/n_repeat
               for i,x in enumerate(y_all_pred)])
    error = sum([ (1 - x.count(y[i])/len(x)) * len(x)/n_repeat 
            for i,x in enumerate(y_all_pred)])

    return np.sqrt(bias_sq), var, error

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
cancer = load_breast_cancer()

def plot_bias_variance_rf(clf, X, y):
    bias_scores = []
    var_scores = []
    err_scores = []
    n_estimators= [1, 2, 4, 8, 16, 32, 64, 128, 256]

    for i in n_estimators:
        b,v,e = compute_bias_variance(clf.set_params(random_state=0,n_estimators=i),X,y)
        bias_scores.append(b)
        var_scores.append(v)
        err_scores.append(e)

    plt.figure(figsize=(5,2))
    plt.rcParams.update({'font.size': 12})
    plt.suptitle(clf.__class__.__name__)
    plt.plot(n_estimators, var_scores,label ="variance", lw=2 )
    plt.plot(n_estimators, np.square(bias_scores),label ="bias^2", lw=2 )
    plt.plot(n_estimators, err_scores,label ="error", lw=2 )
    plt.xscale('log',basex=2)
    plt.xlabel("n_estimators")
    plt.legend(loc="best")
    plt.show()

X, y = cancer.data, cancer.target
rf = RandomForestClassifier(random_state=0, n_estimators=512, n_jobs=-1)
plot_bias_variance_rf(rf, X, y)

X, y = cancer.data, cancer.target
ab = AdaBoostClassifier(random_state=0, n_estimators=512)
plot_bias_variance_rf(ab, X, y)

Bias-variance and overfitting

Bias-variance-tradeoff

  • High bias means that you are likely underfitting
    • Do less regularization
    • Use a more flexible/complex model (another algorithm)
    • Use a bias-reduction technique (e.g. boosting)
  • High variance means that you are likely overfitting
    • Use more regularization
    • Get more data
    • Use a simpler model (another algorithm)
    • Use a variance-reduction techniques (e.g. bagging)

Bias-Variance Flowchart (Andrew Ng, Coursera) Bias-variance-Flowchart

Hyperparameter tuning

We can basically use any optimization technique to optimize hyperparameters:

  • Grid search
  • Random search

More advanced techniques (see lecture 7):

  • Local search
  • Racing algorithms
  • Model-based optimization (see later)
  • Multi-armed bandits
  • Genetic algorithms

Overfitting on the test set

  • Simply taking the best performing hyperparameters yields optimistic results
  • We've already used the test data to evaluate each hyperparameter setting!
    • Information 'leaks' from test set into the final model
  • Set aside part of the training data to evaluate the hyperparameter settings
    • Select best hyperparameters on validation set
    • Rebuild the model on the training+validation set
    • Evaluate optimal model on the test set
In [53]:
mglearn.plots.plot_threefold_split()
In [54]:
from sklearn.svm import SVC
# split data into train+validation set and test set
X_trainval, X_test, y_trainval, y_test = train_test_split(
    iris.data, iris.target, random_state=0)
# split train+validation set into training and validation set
X_train, X_valid, y_train, y_valid = train_test_split(
    X_trainval, y_trainval, random_state=1)
print("Size of training set: {}   size of validation set: {}   size of test set:"
      " {}\n".format(X_train.shape[0], X_valid.shape[0], X_test.shape[0]))

best_score = 0

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
    for C in [0.001, 0.01, 0.1, 1, 10, 100]:
        # for each combination of parameters
        # train an SVC
        svm = SVC(gamma=gamma, C=C)
        svm.fit(X_train, y_train)
        # evaluate the SVC on the test set 
        score = svm.score(X_valid, y_valid)
        # if we got a better score, store the score and parameters
        if score > best_score:
            best_score = score
            best_parameters = {'C': C, 'gamma': gamma}

# rebuild a model on the combined training and validation set,
# and evaluate it on the test set
svm = SVC(**best_parameters)
svm.fit(X_trainval, y_trainval)
test_score = svm.score(X_test, y_test)
print("Best score on validation set: {:.2f}".format(best_score))
print("Best parameters: ", best_parameters)
print("Test set score with best parameters: {:.2f}".format(test_score))
Size of training set: 84   size of validation set: 28   size of test set: 38

Best score on validation set: 0.96
Best parameters:  {'C': 10, 'gamma': 0.001}
Test set score with best parameters: 0.92

Grid-search with cross-validation

  • The way that we split the data into training, validation, and test set may have a large influence on estimated performance
  • We need to use cross-validation again (e.g. 3-fold CV), instead of a single split
In [55]:
plt.rcParams['figure.dpi'] = 70 # Avoid overlapping boxes
mglearn.plots.plot_grid_search_overview()

Nested cross-validation

  • Note that we are still using a single split to create the outer test set
  • We can also use cross-validation here
  • Nested cross-validation:
    • Outer loop: split data in training and test sets
    • Inner loop: run grid search, splitting the training data into train and validation sets
  • Result is a just a list of scores
    • There will be multiple optimized models and hyperparameter settings (not returned)
  • To apply on future data, we need to train GridSearchCV on all data again
scores = cross_val_score(GridSearchCV(SVC(), param_grid, cv=5),
                         iris.data, iris.target, cv=5)
In [57]:
scores = cross_val_score(GridSearchCV(SVC(), param_grid, cv=5),
                         iris.data, iris.target, cv=5)
print("Cross-validation scores: ", scores)
print("Mean cross-validation score: ", scores.mean())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-57-fb9e55fc7320> in <module>
----> 1 scores = cross_val_score(GridSearchCV(SVC(), param_grid, cv=5),
      2                          iris.data, iris.target, cv=5)
      3 print("Cross-validation scores: ", scores)
      4 print("Mean cross-validation score: ", scores.mean())

NameError: name 'GridSearchCV' is not defined