Lecture 4: Model Selection

Lecture 4: Model Selection#

Can I trust you?

Joaquin Vanschoren

Evaluation#

To know whether we can trust our method or system, we need to evaluate it.
Model selection: choose between different models in a data-driven way.
- If you cannot measure it, you cannot improve it.
Convince others that your work is meaningful
- Peers, leadership, clients, yourself(!)
When possible, try to interpret what your model has learned
- The signal your model found may just be an artifact of your biased data
- See ‘Why Should I Trust You?’ by Marco Ribeiro et al.

Designing Machine Learning systems#

Just running your favourite algorithm is usually not a great way to start
Consider the problem: How to measure success? Are there costs involved?
- Do you want to understand phenomena or do black box modelling?
Analyze your model’s mistakes. Don’t just finetune endlessly.
- Build early prototypes. Should you collect more, or additional data?
- Should the task be reformulated?
Overly complex machine learning systems are hard to maintain
- See ‘Machine Learning: The High Interest Credit Card of Technical Debt’

Real world evaluations#

Evaluate predictions, but also how outcomes improve because of them
Beware of feedback loops: predictions can influence future input data
- Medical recommendations, spam filtering, trading algorithms,…
Evaluate algorithms in the wild.
- A/B testing: split users in groups, test different models in parallel
- Bandit testing: gradually direct more users to the winning system

Performance estimation techniques#

Always evaluate models as if they are predicting future data
We do not have access to future data, so we pretend that some data is hidden
Simplest way: the holdout (simple train-test split)
- Randomly split data (and corresponding labels) into training and test set (e.g. 75%-25%)
- Train (fit) a model on the training data, score on the test data

../_images/81bf44d915aa084852f67ae33214c71ed6cecb8b0022255ecab931dab9cc8a79.png

K-fold Cross-validation#

Each random split can yield very different models (and scores)
- e.g. all easy (of hard) examples could end up in the test set
Split data into k equal-sized parts, called folds
- Create k splits, each time using a different fold as the test set
Compute k evaluation scores, aggregate afterwards (e.g. take the mean)
Examine the score variance to see how sensitive (unstable) models are
Large k gives better estimates (more training data), but is expensive

../_images/2c8cd672a562f876164b588a2e11cf8f08d1ccb03f4c9506a53da37e88128b02.png

Can you explain this result?

kfold = KFold(n_splits=3)
cross_val_score(logistic_regression, iris.data, iris.target, cv=kfold)

Cross-validation scores KFold(n_splits=3):
[0. 0. 0.]

../_images/38ada49c7737f96f5711365f84e20a863aabfba3493490c3dc33a7521f7691fb.png

Stratified K-Fold cross-validation#

If the data is unbalanced, some classes have only few samples
Likely that some classes are not present in the test set
Stratification: proportions between classes are conserved in each fold
- Order examples per class
- Separate the samples of each class in k sets (strata)
- Combine corresponding strata into folds

../_images/297e3b2f9ebb072ab26c65a73548d98df198323561d014155a206934de5b5e82.png

Leave-One-Out cross-validation#

k fold cross-validation with k equal to the number of samples
Completely unbiased (in terms of data splits), but computationally expensive
Actually generalizes less well towards unseen data
- The training sets are correlated (overlap heavily)
- Overfits on the data used for (the entire) evaluation
- A different sample of the data can yield different results
Recommended only for small datasets

../_images/6aca0e60a7cfd6e16a16c941bbfb76c0a5792032a71d1d6cd1212baec3ea5fa7.png

Shuffle-Split cross-validation#

Shuffles the data, samples (train_size) points randomly as the training set
Can also use a smaller (test_size), handy with very large datasets
Never use if the data is ordered (e.g. time series)

../_images/0ad33c143d5fea912f2292d0be0c16cbcee21203c14562bb479b417f65eec8c7.png

The Bootstrap#

Sample n (dataset size) data points, with replacement, as training set (the bootstrap)
- On average, bootstraps include 66% of all data points (some are duplicates)
Use the unsampled (out-of-bootstrap) samples as the test set
Repeat $k$ times to obtain $k$ scores
Similar to Shuffle-Split with train_size=0.66, test_size=0.34 but without duplicates

../_images/e649259157a7caaec6f3395e49a9c90389c85afcc25b38b876a35ec8f31498b2.png

Repeated cross-validation#

Cross-validation is still biased in that the initial split can be made in many ways
Repeated, or n-times-k-fold cross-validation:
- Shuffle data randomly, do k-fold cross-validation
- Repeat n times, yields n times k scores
Unbiased, very robust, but n times more expensive

../_images/5cdc7cf7f25562f4b12fd86cdbd9f4d4a7f4de7f6be5dbcb75dd802392cac07f.png

Cross-validation with groups#

Sometimes the data contains inherent groups:
- Multiple samples from same patient, images from same person,…
Data from the same person may end up in the training and test set
We want to measure how well the model generalizes to other people
Make sure that data from one person are in either the train or test set
- This is called grouping or blocking
- Leave-one-subject-out cross-validation: test set for each subject/group

../_images/be3d9381a2476f4bcd819a1eb901d2d67472245a1e93b889ffbc5617e23f6a99.png

Time series#

When the data is ordered, random test sets are not a good idea

../_images/862cf71645ae19a0ad48daaa159e78c37953bfd7e1ae6c8c4958a28a2b22d209.png

Test-then-train (prequential evaluation)#

Every new sample is evaluated only once, then added to the training set
- Can also be done in batches (of n samples at a time)
TimeSeriesSplit
- In the kth split, the first k folds are the train set and the (k+1)th fold as the test set
- Often, a maximum training set size (or window) is used
  - more robust against concept drift (change in data over time)

../_images/00c5088311c787fa9437fee225ff036b6123edc6ef0d89ba8a5cc30e051babef.png

Choosing a performance estimation procedure#

No strict rules, only guidelines:

Always use stratification for classification (sklearn does this by default)
Use holdout for very large datasets (e.g. >1.000.000 examples)
- Or when learners don’t always converge (e.g. deep learning)
Choose k depending on dataset size and resources
- Use leave-one-out for very small datasets (e.g. <100 examples)
- Use cross-validation otherwise
  - Most popular (and theoretically sound): 10-fold CV
  - Literature suggests 5x2-fold CV is better
Use grouping or leave-one-subject-out for grouped data
Use train-then-test for time series

Evaluation Metrics for Classification#

Evaluation vs Optimization#

Each algorithm optimizes a given objective function (on the training data)
- E.g. remember L2 loss in Ridge regression $$\mathcal{L}_{Ridge} = \sum_{n=1}^{N} (y_n-(\mathbf{w}\mathbf{x_n} + w_0))^2 + \alpha \sum_{i=0}^{p} w_i^2$$
The choice of function is limited by what can be efficiently optimized
However, we evaluate the resulting model with a score that makes sense in the real world
- Percentage of correct predictions (on a test set)
- The actual cost of mistakes (e.g. in money, time, lives,…)
We also tune the algorithm’s hyperparameters to maximize that score

Binary classification#

We have a positive and a negative class
2 different kind of errors:
- False Positive (type I error): model predicts positive while true label is negative
- False Negative (type II error): model predicts negative while true label is positive
They are not always equally important
- Which side do you want to err on for a medical test?

Confusion matrices#

We can represent all predictions (correct and incorrect) in a confusion matrix
- n by n array (n is the number of classes)
- Rows correspond to true classes, columns to predicted classes
- Count how often samples belonging to a class C are classified as C or any other class.
- For binary classification, we label these true negative (TN), true positive (TP), false negative (FN), false positive (FP)

	Predicted Neg	Predicted Pos
Actual Neg	TN	FP
Actual Pos	FN	TP

confusion_matrix(y_test, y_pred): 
 [[48  5]
 [ 5 85]]

Predictive accuracy#

Accuracy can be computed based on the confusion matrix
Not useful if the dataset is very imbalanced
- E.g. credit card fraud: is 99.99% accuracy good enough?

\begin{equation} \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \end{equation}

3 models: very different predictions, same accuracy:

def plot_confusion_matrix(values, xlabel="predicted labels", ylabel="true labels", xticklabels=None,
                          yticklabels=None, cmap=None, vmin=None, vmax=None, ax=None,
                          fmt="{:.2f}", xtickrotation=45, norm=None, fsize=8):
    
    if ax is None:
        ax = plt.gca()
    ax.figure.set_size_inches(6*fig_scale, 6*fig_scale)
    values = np.array(values)  # Ensure 'values' is a numpy array for consistent handling
    img = ax.pcolormesh(values, cmap=cmap, vmin=vmin, vmax=vmax, norm=norm)
    ax.set_xlabel(xlabel, fontsize=8)
    ax.set_ylabel(ylabel, fontsize=8)

    # Setting the tick labels
    ax.set_xticks(np.arange(values.shape[1]) + 0.5, minor=False)
    ax.set_yticks(np.arange(values.shape[0]) + 0.5, minor=False)
    ax.set_xticklabels(xticklabels or [], minor=False, rotation=xtickrotation, fontsize=8)
    ax.set_yticklabels(yticklabels or [], minor=False, fontsize=8)

    # Loop over data dimensions and create text annotations.
    for i in range(values.shape[0]):
        for j in range(values.shape[1]):
            ax.text(j + 0.5, i + 0.5, fmt.format(values[i, j]),
                    ha="center", va="center", color="white" if values[i, j] > vmax/2 else "black", fontsize=8)

    ax.set_aspect('equal')  # Optional: set aspect ratio to be equal
    ax.invert_yaxis()
    return ax

../_images/fac8728555478bd3144d8427440af001837bc8682fb497495343418cdbff4421.png

Precision#

Use when the goal is to limit FPs
- Clinical trails: you only want to test drugs that really work
- Search engines: you want to avoid bad search results

\begin{equation} \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \end{equation}

../_images/5bc3951e25d4334669951b8b799791be1efd3d957640c5524d29737d17a3133a.png

Recall#

Use when the goal is to limit FNs
- Cancer diagnosis: you don’t want to miss a serious disease
- Search engines: You don’t want to omit important hits
Also know as sensitivity, hit rate, true positive rate (TPR)

\begin{equation} \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \end{equation}

../_images/3cac4a582d3dd01a8bc5b0676da74622079a8a846bc51e74ee8da834bf467b56.png

Comparison

F1-score#

Trades off precision and recall:

\begin{equation} \text{F1} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} \end{equation}

../_images/0bb92fcc6caf8a8a85ffd7fe30d99a7476294e13e417f4e3542b4ca8a832b397.png

Classification measure Zoo

https://en.wikipedia.org/wiki/Precision_and_recall

Multi-class classification#

Train models per class : one class viewed as positive, other(s) als negative, then average
- micro-averaging: count total TP, FP, TN, FN (every sample equally important)
  - micro-precision, micro-recall, micro-F1, accuracy are all the same $$\text{Precision:} \frac{\sum_{c=1}^C\text{TP}_c}{\sum_{c=1}^C\text{TP}_c + \sum_{c=1}^C\text{FP}_c} \xrightarrow{c=2} \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$
- macro-averaging: average of scores $R(y_c,\hat{y_c})$ obtained on each class
  - Preferable for imbalanced classes (if all classes are equally important)
  - macro-averaged recall is also called balanced accuracy $$\frac{1}{C} \sum_{c=1}^C R(y_c,\hat{y_c})$$
- weighted averaging ($w_c$: ratio of examples of class $c$, aka support): $\sum_{c=1}^C w_c R(y_c,\hat{y_c})$

../_images/6d14c120d944214df54db39e43bcad07eb517bd7ddd3446f753b1a38c0a2056c.png

../_images/acd9e1619b5939d9d31cfd85a92300ad741ca74aa96a09f67cd78dd9b54a5294.png

../_images/517a25bcec5525795b79d1a0151ff3e3c7eabc0bbf13fb9c1ebbae0677cdfab7.png

Other useful classification metrics#

Cohen’s Kappa
- Measures ‘agreement’ between different models (aka inter-rater agreement)
- To evaluate a single model, compare it against a model that does random guessing
  - Similar to accuracy, but taking into account the possibility of predicting the right class by chance
- Can be weighted: different misclassifications given different weights
- 1: perfect prediction, 0: random prediction, negative: worse than random
- With $p_0$ = accuracy, and $p_e$ = accuracy of random classifier: $$\kappa = \frac{p_o - p_e}{1 - p_e}$$
Matthews correlation coefficient
- Corrects for imbalanced data, alternative for balanced accuracy or AUROC
- 1: perfect prediction, 0: random prediction, -1: inverse prediction $$MCC = \frac{tp \times tn - fp \times fn}{\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}$$

Probabilistic evaluation#

Classifiers can often provide uncertainty estimates of predictions.
Remember that linear models actually return a numeric value.
- When $\hat{y}<0$, predict class -1, otherwise predict class +1 $$\hat{y} = w_0 * x_0 + w_1 * x_1 + ... + w_p * x_p + b $$
In practice, you are often interested in how certain a classifier is about each class prediction (e.g. cancer treatments).
Most learning methods can return at least one measure of confidence in their predicions.
- Decision function: floating point value for each sample (higher: more confident)
- Probability: estimated probability for each class

The decision function#

In the binary classification case, the return value of the decision function encodes how strongly the model believes a data point belongs to the “positive” class.

Positive values indicate preference for the positive class.
The range can be arbitrary, and can be affected by hyperparameters. Hard to interpret.

../_images/66b57710eec4c55f10717d2d550b8ae9498541b841132b527aaf6f0ae51fc50b.png

Predicting probabilities#

Some models can also return a probability for each class with every prediction. These sum up to 1. We can visualize them again. Note that the gradient looks different now.

../_images/4cf48f0dbd487a718665a9001017ea8f9ff74a2b37b506538d0c4139065e748b.png

Threshold calibration#

By default, we threshold at 0 for decision_function and 0.5 for predict_proba
Depending on the application, you may want to threshold differently
- Lower threshold yields fewer FN (better recall), more FP (worse precision), and vice-versa

Show code cell source Hide code cell source

from mglearn.datasets import make_blobs
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from mglearn.tools import plot_2d_separator, plot_2d_scores, cm, discrete_scatter
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

Xs, ys = make_blobs(n_samples=(400, 50), centers=2, cluster_std=[7.0, 2],
                    random_state=22)
Xs_train, Xs_test, ys_train, ys_test = train_test_split(Xs, ys, stratify=ys, random_state=0)
svc1 = SVC(gamma=.04).fit(Xs_train, ys_train)
    
    
@interact
def plot_decision_threshold(threshold=(-1.2,1.3,0.1)):
    fig, axes = plt.subplots(1, 2, figsize=(8*fig_scale, 2.2*fig_scale), subplot_kw={'xticks': (), 'yticks': ()})    
    line = np.linspace(Xs_train.min(), Xs_train.max(), 100)

    axes[0].set_title("decision with threshold {:.2f}".format(threshold))
    discrete_scatter(Xs_train[:, 0], Xs_train[:, 1], ys_train, ax=axes[0], s=7*fig_scale)
    discrete_scatter(Xs_test[:, 0], Xs_test[:, 1], ys_test, ax=axes[0], markers='^', s=7*fig_scale)

    plot_2d_scores(svc1, Xs_train, function="decision_function", alpha=.7,
                   ax=axes[0], cm=mglearn.ReBl)
    plot_2d_separator(svc1, Xs_train, linewidth=3, ax=axes[0], threshold=threshold)
    axes[0].set_xlim(Xs_train[:, 0].min(), Xs_train[:, 0].max())
    axes[0].plot(line, np.array([10 for i in range(len(line))]), 'k:', linewidth=2)

    axes[1].set_title("cross-section with threshold {:.2f}".format(threshold))
    axes[1].plot(line, svc1.decision_function(np.c_[line, 10 * np.ones(100)]), c='k')
    dec = svc1.decision_function(np.c_[line, 10 * np.ones(100)])
    contour = (dec > threshold).reshape(1, -1).repeat(10, axis=0)
    axes[1].contourf(line, np.linspace(-1.5, 1.5, 10), contour, linewidth=2, alpha=0.4, cmap=cm)
    discrete_scatter(Xs_test[:, 0], Xs_test[:, 1]*0, ys_test, ax=axes[1], markers='^', s=7*fig_scale)

    axes[1].plot(line, np.array([threshold for i in range(len(line))]), 'r:', linewidth=3)
    axes[1].tick_params(labelsize=8*fig_scale)
    
    axes[0].set_xlim(Xs_train[:, 0].min(), Xs_train[:, 0].max())
    axes[1].set_xlim(Xs_train[:, 0].min(), Xs_train[:, 0].max())
    axes[1].set_ylim(-1.5, 1.5)
    axes[1].set_xticks(())
    axes[1].set_yticks(np.arange(-1.5, 1.5, 0.5))
    axes[1].yaxis.tick_right()
    axes[1].set_ylabel("Decision value")
    
    y_pred = svc1.decision_function(Xs_test)
    y_pred = y_pred > threshold
    axes[1].text(Xs_train.min()+1,1.2,"Precision: {:.4f}".format(precision_score(ys_test,y_pred)), size=7*fig_scale)
    axes[1].text(Xs_train.min()+1,0.9,"Recall: {:.4f}".format(recall_score(ys_test,y_pred)), size=7*fig_scale)
    plt.tight_layout();

Precision-Recall curve#

The best trade-off between precision and recall depends on your application
- You can have arbitrary high recall, but you often want reasonable precision, too.
Plotting precision against recall for all possible thresholds yields a precision-recall curve
- Change the treshold until you find a sweet spot in the precision-recall trade-off
- Often jagged at high thresholds, when there are few positive examples left

Model selection#

Some models can achieve trade-offs that others can’t
Your application may require very high recall (or very high precision)
- Choose the model that offers the best trade-off, given your application
The area under the PR curve (AUPRC) gives the best overall model

../_images/1c9a494fd897278c717d57031408b9648c12dc5fdd82c1ca2d55ad83628374d3.png

Hyperparameter effects#

Of course, hyperparameters affect predictions and hence also the shape of the curve

../_images/e08bba811d5b4bf50c2f5aaa9bc05254bb89a2a7fff82cb9f3cf1363e5d809c1.png

Receiver Operating Characteristics (ROC)#

Trade off true positive rate $\textit{TPR}= \frac{TP}{TP + FN}$ with false positive rate $\textit{FPR} = \frac{FP}{FP + TN}$
Plotting TPR against FPR for all possible thresholds yields a Receiver Operating Characteristics curve
- Change the treshold until you find a sweet spot in the TPR-FPR trade-off
- Lower thresholds yield higher TPR (recall), higher FPR, and vice versa

Visualization#

Histograms show the amount of points with a certain decision value (for each class)
$\textit{TPR}= \frac{\color{red}{TP}}{\color{red}{TP} + \color{magenta}{FN}}$ can be seen from the positive predictions (top histogram)
$\textit{FPR} = \frac{\color{cyan}{FP}}{\color{cyan}{FP} + \color{blue}{TN}}$ can be seen from the negative predictions (bottom histogram)

Show code cell source Hide code cell source

# More data for a smoother curve
Xb, yb = make_blobs(n_samples=(4000, 4000), centers=2, cluster_std=[3, 3], random_state=7)
Xb_train, Xb_test, yb_train, yb_test = train_test_split(Xb, yb, random_state=0)
svc_roc = SVC(C=2).fit(Xb_train, yb_train)
probs_roc = svc_roc.decision_function(Xb_test)

@interact
def plot_roc_threshold(threshold=(-2,2,0.1)):
    fig = plt.figure(constrained_layout=True, figsize=(10*fig_scale,4*fig_scale))
    axes = []
    gs = fig.add_gridspec(2, 2)
    axes.append(fig.add_subplot(gs[0, :-1]))
    axes.append(fig.add_subplot(gs[1, :-1]))
    axes.append(fig.add_subplot(gs[:, 1]))
    
    n=50 # number of histogram bins
    color=['b','r']
    color_fill=['b','c','m','r']
    labels=['TN','FP','FN','TP']
    
    # Histograms
    for label in range(2):
        ps = probs_roc[yb_test == label] # get prediction for given label
        p, x = np.histogram(ps, bins=n) # bin it into n bins
        x = x[:-1] + (x[1] - x[0])/2 # convert bin edges to center
        axes[1-label].plot(x, p, c=color[label], lw=2)
        axes[1-label].fill_between(x[x<threshold], p[x<threshold], -5, facecolor=color_fill[2*label], label='{}: {}'.format(labels[2*label],np.sum(p[x<threshold])))
        axes[1-label].fill_between(x[x>=threshold], p[x>threshold], -5, facecolor=color_fill[2*label+1], label='{}: {}'.format(labels[2*label+1],np.sum(p[x>=threshold])))
        axes[1-label].set_title('Histogram of decision values for points with class {}'.format(label), fontsize=12*fig_scale)
        axes[1-label].legend(prop={"size":10})
        
    #ROC curve
    fpr, tpr, thresholds = roc_curve(yb_test, svc_roc.decision_function(Xb_test))
    axes[2].plot(fpr, tpr, lw=2, label="ROC curve", c='k')
    closest = np.argmin(np.abs(thresholds-threshold))
    axes[2].plot(fpr[closest], tpr[closest], 'o', markersize=10, label="threshold {:.2f}".format(threshold))
    
    axes[2].set_title('ROC curve', fontsize=12*fig_scale)
    axes[2].set_xlabel("FPR")
    axes[2].set_ylabel("TPR")
    axes[2].legend(prop={"size":10})

Model selection#

Again, some models can achieve trade-offs that others can’t
Your application may require minizing FPR (low FP), or maximizing TPR (low FN)
The area under the ROC curve (AUROC or AUC) gives the best overall model
- Frequently used for evaluating models on imbalanced data
- Random guessing (TPR=FPR) or predicting majority class (TPR=FPR=1): 0.5 AUC

../_images/304763d12a92a7a0a09124f15420e97295ef4ae9b46490360719b777b3d7b4b5.png

Multi-class AUROC (or AUPRC)#

We again need to choose between micro- or macro averaging TPR and FPR.
- Micro-average if every sample is equally important (irrespective of class)
- Macro-average if every class is equally important, especially for imbalanced data

Show code cell source Hide code cell source

from itertools import cycle

from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from numpy import interp
from sklearn.metrics import roc_auc_score

# 3 class imbalanced data
Xi, yi = make_blobs(n_samples=(800, 500, 60), centers=3, cluster_std=[7.0, 2, 3.0], random_state=22)
sizes = [800, 500, 60]

# Binarize the output
yi = label_binarize(yi, classes=[0, 1, 2])
n_classes = yi.shape[1]
Xi_train, Xi_test, yi_train, yi_test = train_test_split(Xi, yi, test_size=.5, random_state=0)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(SVC(probability=True))
y_score = classifier.fit(Xi_train, yi_train).decision_function(Xi_test)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(yi_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(yi_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))

# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])

# Finally average it and compute AUC
mean_tpr /= n_classes

fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

# Plot all ROC curves
plt.figure(figsize=(10*fig_scale,4*fig_scale))

plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)

plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)

colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=2, linestyle='-',
             label='ROC curve of class {} (size: {}) (area = {:0.2f})'
             ''.format(i, sizes[i], roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Extension of ROC to multi-class')
plt.legend(loc="lower right", prop={"size":10})
plt.show()

../_images/8f95a0c5f427cbe55c246f06406ea33ad396f47164e41ce5709f8cde9f1edda9.png

Model calibration#

For some models, the predicted uncertainty does not reflect the actual uncertainty
- If a model is 90% sure that samples are positive, is it also 90% accurate on these?
A model is called calibrated if the reported uncertainty actually matches how correct it is
- Overfitted models also tend to be over-confident
- LogisticRegression models are well calibrated since they learn probabilities
- SVMs are not well calibrated. Biased towards points close to the decision boundary.

Show code cell source Hide code cell source

from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss, accuracy_score

def load_data():
    X, y = make_classification(n_samples=100000, n_features=20, random_state=0)
    train_samples = 2000  # Samples used for training the models
    X_train = X[:train_samples]
    X_test = X[train_samples:]
    y_train = y[:train_samples]
    y_test = y[train_samples:]

    return X_train, X_test, y_train, y_test

Xc_train, Xc_test, yc_train, yc_test = load_data()

def plot_calibration_curve(y_true, y_prob, n_bins=5, ax=None, hist=True, normalize=False):
    prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins)
    if ax is None:
        ax = plt.gca()
    if hist:
        ax.hist(y_prob, weights=np.ones_like(y_prob) / len(y_prob), alpha=.4,
               bins=np.maximum(10, n_bins))
    ax.plot([0, 1], [0, 1], ':', c='k')
    curve = ax.plot(prob_pred, prob_true, marker="o")

    ax.set_xlabel("predicted probability")
    ax.set_ylabel("fraction of pos. samples")
    ax.set(aspect='equal')
    return curve

# Plot calibration curves for `models`, optionally show a calibrator run on a calibratee
def plot_calibration_comparison(models, calibrator=None, calibratee=None): 
    def get_probabilities(clf, X):
        if hasattr(clf, "predict_proba"): # Use probabilities if classifier has predict_proba
            prob_pos = clf.predict_proba(X)[:, 1]
        else:  # Otherwise, use decision function and scale
            prob_pos = clf.decision_function(X)
            prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
        return prob_pos
    
    nr_plots = len(models)
    if calibrator:
        nr_plots += 1
    fig, axes = plt.subplots(1, nr_plots, figsize=(3*nr_plots*fig_scale, 4*nr_plots*fig_scale))
    for ax, clf in zip(axes[:len(models)], models):
            clf.fit(Xc_train, yc_train)
            prob_pos = get_probabilities(clf,Xc_test)           
            bs = brier_score_loss(yc_test,prob_pos)
            plot_calibration_curve(yc_test, prob_pos, n_bins=20, ax=ax)
            ax.set_title(clf.__class__.__name__, fontsize=10*fig_scale)
            ax.text(0,0.95,"Brier score: {:.3f}".format(bs), size=10*fig_scale)
    if calibrator:
        calibratee.fit(Xc_train, yc_train)
        # We're visualizing the trained calibrator, hence let it predict the training data
        prob_pos = get_probabilities(calibratee, Xc_train) # get uncalibrated predictions
        y_sort = [x for _,x in sorted(zip(prob_pos,yc_train))] # sort for nicer plots
        prob_pos.sort()
        cal_prob = calibrator.fit(prob_pos, y_sort).predict(prob_pos) # fit calibrator
        axes[-1].scatter(prob_pos,y_sort, s=2)
        axes[-1].scatter(prob_pos,cal_prob, s=2)
        axes[-1].plot(prob_pos,cal_prob)
        axes[-1].set_title("Calibrator: {}".format(calibrator.__class__.__name__), fontsize=10*fig_scale)
        axes[-1].set_xlabel("predicted probability")
        axes[-1].set_ylabel("outcome")
        axes[-1].set(aspect='equal')
    plt.tight_layout()
    
plot_calibration_comparison([LogisticRegression(), SVC()])   

../_images/ad2d27b913c130e808889825ae6983cd25c7366382b0b10e2ba59d2ac8ea908e.png

Brier score#

You may want to select models based on how accurate the class confidences are.
The Brier score loss: squared loss between predicted probability $\hat{p}$ and actual outcome $y$
- Lower is better $$\mathcal{L}_{Brier} = \frac{1}{n}\sum_{i=1}^n (\hat{p}_i - y_i)^2$$

Logistic Regression Brier score loss: 0.0322
SVM Brier score loss: 0.0795

Model calibration techniques#

We can post-process trained models to make them more calibrated.
Fit a regression model (a calibrator) to map the model’s outcomes $f(x)$ to a calibrated probability in [0,1]
- $f(x)$ returns the decision values or probability estimates
- $f_{calib}$ is fitted on the training data to map these to the correct outcome
  - Often an internal cross-validation with few folds is used
- Multi-class models require one calibrator per class

\[f_{calib}(f(x))≈p(y)\]

Platt Scaling#

Calibrator is a logistic (sigmoid) function:
- Learn the weight $w_1$ and bias $w_0$ from data $$f_{platt}=\frac{1}{1+\exp(−w_1 f(x)− w_0)}$$

../_images/90328200a3ab223adc224f0cfcb9f4df94725e68fa5c64f6f3913b36c85f81bf.png

Isotonic regression#

Maps input $x_i$ to an output $\hat{y}_i$ so that $\hat{y}_i$ increases monotonically with $x_i$ and minimizes loss $\sum_i^n (y_i-\hat{y}_i)$
- Predictions are made by interpolating the predicted $\hat{y}_i$
Fit to minimize the loss between the uncalibrated predictions $f(x)$ and the actual labels
Corrects any monotonic distortion, but tends to overfit on small samples

../_images/f8580b7b5cd51553de62142a5980faa25dfbf50119b5a8a6af81712f5e1a0f48.png

Cost-sensitive classification (dealing with imbalance)#

In the real world, different kinds of misclassification can have different costs
- Misclassifying certain classes can be more costly than others
- Misclassifying certain samples can be more costly than others
Cost-sensitive resampling: resample (or reweight) the data to represent real-world expectations
- oversample minority classes (or undersample majority) to ‘correct’ imbalance
- increase weight of misclassified samples (e.g. in boosting)
- decrease weight of misclassified (noisy) samples (e.g. in model compression)

Class weighting#

If some classes are more important than others, we can give them more weight
- E.g. for imbalanced data, we can give more weight to minority classes
Most classification models can include it in their loss function and optimize for it
- E.g. Logistic regression: add a class weight $w_c$ in the log loss function $$\mathcal{L_{log}}(\mathbf{w}) = - \sum_{c=1}^{C} \color{red}{w_c} \sum_{n=1}^{N} p_{n,c} log(q_{n,c}) $$

../_images/709b6822d6e8a3fffea34eeac294b4598250fe1b4337e80e23536b9192dd11bc.png

Instance weighting#

If some training instances are important to get right, we can give them more weight
- E.g. when some examples are from groups underrepresented in the data
These are passed during training (fit), and included in the loss function
- E.g. Logistic regression: add a instance weight $w_n$ in the log loss function $$\mathcal{L_{log}}(\mathbf{w}) = - \sum_{c=1}^{C} \sum_{n=1}^{N} \color{red}{w_n} p_{n,c} log(q_{n,c}) $$

../_images/0dc5f750f1bce587136d07ca535c84897239f589fa06743752a09391be161e68.png

Cost-sensitive algorithms#

Cost-sensitive algorithms
- If misclassification cost of some classes is higher, we can give them higher weights
- Some support cost matrix $C$: costs $c_{i,j}$ for every possible type of error
Cost-sensitive ensembles: convert cost-insensitive classifiers into cost-sensitive ones
- MetaCost: Build a model (ensemble) to learn the class probabilities $P(j|x)$
  - Relabel training data to minimize expected cost: $\underset{i}{\operatorname{argmin}} \sum_j P_j(x) c_{i,j}$
  - Accuracy may decrease but cost decreases as well.
- AdaCost: Boosting with reweighting instances to reduce costs

Tuning the decision threshold#

If every FP or FN has a certain cost, we can compute the total cost for a given model: $$\text{total cost} = \text{FPR} * cost_{FP} * ratio_{pos} + (1-\text{TPR}) * cost_{FN} * (1-ratio_{pos})$$
This yields different isometrics (lines of equal cost) in ROC space
Optimal threshold is the point on the ROC curve where cost is minimal (line search)

Regression metrics#

Most commonly used are

mean squared error: $\frac{\sum_{i}(y_{pred_i}-y_{actual_i})^2}{n}$
- root mean squared error (RMSE) often used as well
mean absolute error: $\frac{\sum_{i}|y_{pred_i}-y_{actual_i}|}{n}$
- Less sensitive to outliers and large errors

R squared#

$R^2 = 1 - \frac{\color{blue}{\sum_{i}(y_{pred_i}-y_{actual_i})^2}}{\color{red}{\sum_{i}(y_{mean}-y_{actual_i})^2}}$
- Ratio of variation explained by the model / total variation
- Between 0 and 1, but negative if the model is worse than just predicting the mean
- Easier to interpret (higher is better).

Visualizing regression errors#

Prediction plot (left): predicted vs actual target values
Residual plot (right): residuals vs actual target values
- Over- and underpredictions can be given different costs

../_images/df1489ee953b4a3a394f0d64a5a285065668827c87b839ea2d8fe05dac3b6841.png

Bias-Variance decomposition#

Evaluate the same algorithm multiple times on different random samples of the data
Two types of errors can be observed:
- Bias error: systematic error, independent of the training sample
  - These points are predicted (equally) wrong every time
- Variance error: error due to variability of the model w.r.t. the training sample
  - These points are sometimes predicted accurately, sometimes inaccurately

Computing bias and variance error#

Take 100 or more bootstraps (or shuffle-splits)
Regression: for each data point x:
- $bias(x)^2 = (x_{true} - mean(x_{predicted}))^2$
- $variance(x) = var(x_{predicted})$
Classification: for each data point x:
- $bias(x)$ = misclassification ratio
- $variance(x) = (1 - (P(class_1)^2 + P(class_2)^2))/2$
  - $P(class_i)$ is ratio of class $i$ predictions
Total bias: $\sum_{x} bias(x)^2 * w_x$ $w_x$: the percentage of times $x$ occurs in the test sets
Total variance: $\sum_{x} variance(x) * w_x$

Bias and variance, underfitting and overfitting#

High variance means that you are likely overfitting
- Use more regularization or use a simpler model
High bias means that you are likely underfitting
- Do less regularization or use a more flexible/complex model
Ensembling techniques (see later) reduce bias or variance directly
- Bagging (e.g. RandomForests) reduces variance, Boosting reduces bias

Understanding under- and overfitting#

Regularization reduces variance error (increases stability of predictions)
- But too much increases bias error (inability to learn ‘harder’ points)
High regularization (left side): Underfitting, high bias error, low variance error
- High training error and high test error
Low regularization (right side): Overfitting, low bias error, high variance error
- Low training error and higher test error

../_images/305fa5ffbe6ff285095c86c68380906174afec17cd55d2db3a63a548b2eaf0b3.png

Summary Flowchart (by Andrew Ng)

Hyperparameter tuning#

There exists a huge range of techniques to tune hyperparameters. The simplest:
- Grid search: Choose a range of values for every hyperparameter, try every combination
  - Doesn’t scale to many hyperparameters (combinatorial explosion)
- Random search: Choose random values for all hyperparameters, iterate $n$ times
  - Better, especially when some hyperparameters are less important
Many more advanced techniques exist, see lecture on Automated Machine Learning

First, split the data in training and test sets (outer split)
Split up the training data again (inner cross-validation)
- Generate hyperparameter configurations (e.g. random/grid search)
- Evaluate all configurations on all inner splits, select the best one (on average)
Retrain best configurations on full training set, evaluate on held-out test data

../_images/61517fa55e23bd87fbf326b802b4cf5cf1dc875727c017eba09e166a7ce4b02e.png

Nested cross-validation#

Simplest approach: single outer split and single inner split (shown below)
Risk of over-tuning hyperparameters on specific train-test split
- Only recommended for very large datasets
Nested cross-validation:
- Outer loop: split full dataset in $k_1$ training and test splits
- Inner loop: split training data into $k_2$ train and validation sets
This yields $k_1$ scores for $k_1$ possibly different hyperparameter settings
- Average score is the expected performance of the tuned model
To use the model in practice, retune on the entire dataset

hps = {'C': expon(scale=100), 'gamma': expon(scale=.1)}
scores = cross_val_score(RandomizedSearchCV(SVC(), hps, cv=3), X, y, cv=5)

../_images/f84eb78cf45d632a5bb6b727d95b272beb2918571a49ee588861f97210721c6c.png

Summary#

Split the data into training and test sets according to the application
- Holdout only for large datasets, cross-validation for smaller ones
- For classification, always use stratification
- Grouped or ordered data requires special splitting
Choose a metric that fits your application
- E.g. precision to avoid false positives, recall to avoid false negatives
Calibrate the decision threshold to fit your application
- ROC curves or Precision-Recall curves can help to find a good tradeoff
If possible, include the actual or relative costs of misclassifications
- Class weighting, instance weighting, ROC isometrics can help
- Be careful with imbalanced or unrepresentative datasets
When using the predicted probabilities in applications, calibrate the models
Always tune the most important hyperparameters
- Manual tuning: Use insight and train-test scores for guidance
- Hyperparameter optimization: be careful not to over-tune

Lecture 4: Model Selection

Contents

Lecture 4: Model Selection#

Evaluation#

Designing Machine Learning systems#

Real world evaluations#

Performance estimation techniques#

K-fold Cross-validation#

Stratified K-Fold cross-validation#

Leave-One-Out cross-validation#

Shuffle-Split cross-validation#

The Bootstrap#

Repeated cross-validation#

Cross-validation with groups#

Time series#

Test-then-train (prequential evaluation)#

Choosing a performance estimation procedure#

Evaluation Metrics for Classification#

Evaluation vs Optimization#

Binary classification#

Confusion matrices#

Predictive accuracy#

Precision#

Recall#

F1-score#

Multi-class classification#

Other useful classification metrics#

Probabilistic evaluation#

The decision function#

Predicting probabilities#

Threshold calibration#

Precision-Recall curve#

Model selection#

Hyperparameter effects#

Receiver Operating Characteristics (ROC)#

Visualization#

Model selection#

Multi-class AUROC (or AUPRC)#

Model calibration#

Brier score#

Model calibration techniques#

Platt Scaling#

Isotonic regression#

Cost-sensitive classification (dealing with imbalance)#

Class weighting#

Instance weighting#

Cost-sensitive algorithms#

Tuning the decision threshold#

Regression metrics#

R squared#

Visualizing regression errors#

Bias-Variance decomposition#

Computing bias and variance error#

Bias and variance, underfitting and overfitting#

Understanding under- and overfitting#

Hyperparameter tuning#

Nested cross-validation#

Summary#