interactive(children=(IntSlider(value=120, description='rotation', max=240, step=10), Output()), _dom_classes=…
Linear models make a prediction using a linear function of the input features $X$
$$f_{\mathbf{w}}(\mathbf{x}) = \sum_{i=1}^{p} w_i \cdot x_i + w_{0}$$Learn $w$ from $X$, given a loss function $\mathcal{L}$: $$\underset{\mathbf{w}}{\operatorname{argmin}} \mathcal{L}(f_\mathbf{w}(X))$$
w_1: 0.393906 w_0: -0.031804
interactive(children=(FloatSlider(value=0.2, description='learn_rate', max=0.4, min=0.01, step=0.01), Checkbox…
In two dimensions:
(Image by A. Karpathy)
sklearn.linear_model
. We'll evaluate it on the Boston Housing dataset.LinearRegression
uses closed form solution, SGDRegressor
with loss='squared_loss'
uses Stochastic Gradient Descentfrom sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_train, y_train)
Weights (coefficients): [ -412.711 -52.243 -131.899 -12.004 -15.511 28.716 54.704 -49.535 26.582 37.062 -11.828 -18.058 -19.525 12.203 2980.781 1500.843 114.187 -16.97 40.961 -24.264 57.616 1278.121 -2239.869 222.825 -2.182 42.996 -13.398 -19.389 -2.575 -81.013 9.66 4.914 -0.812 -7.647 33.784 -11.446 68.508 -17.375 42.813 1.14 ] Bias (intercept): 30.93456367364078
Training set score (R^2): 0.95 Test set score (R^2): 0.61
from sklearn.linear_model import Ridge
lr = Ridge().fit(X_train, y_train)
Weights (coefficients): [-1.414 -1.557 -1.465 -0.127 -0.079 8.332 0.255 -4.941 3.899 -1.059 -1.584 1.051 -4.012 0.334 0.004 -0.849 0.745 -1.431 -1.63 -1.405 -0.045 -1.746 -1.467 -1.332 -1.692 -0.506 2.622 -2.092 0.195 -0.275 5.113 -1.671 -0.098 0.634 -0.61 0.04 -1.277 -2.913 3.395 0.792] Bias (intercept): 21.39052595861006 Training set score: 0.89 Test set score: 0.75
Test set score is higher and training set score lower: less overfitting!
interactive(children=(FloatSlider(value=5.0, description='alpha', max=10.0, step=0.05), Output()), _dom_classe…
Analyze what happens to the weights:
interactive(children=(FloatSlider(value=0.25, description='alpha', max=0.5, step=0.005), Output()), _dom_class…
The L1 term is not differentiable but convex: we can compute the subgradient
For $|w_i|$, the subgradient $\partial_{w_i} |w_i|$ = $\begin{cases}-1 & w_i<0\\ [-1,1] & w_i=0 \\ 1 & w_i>0 \\ \end{cases}$
Subdifferential $\partial(f+g) = \partial f + \partial g$ if $f$ and $g$ are both convex
interactive(children=(FloatSlider(value=1.0, description='alpha', max=2.0, step=0.05), Output()), _dom_classes…
Least Squares Loss + L1 or L2
interactive(children=(FloatSlider(value=0.5, description='alpha', max=1.0, step=0.05), Output()), _dom_classes…
SVR
in sklearn)SGDRegressor
in sklearnAims to find a hyperplane that separates the examples of each class.
For binary classification (2 classes), we aim to fit the following function:
$\hat{y} = w_1 * x_1 + w_2 * x_2 +... + w_p * x_p + w_0 > 0$
When $\hat{y}<0$, predict class -1, otherwise predict class +1
There are many algorithms for linear classification, differing in loss function, regularization techniques, and optimization method
Most common techniques:
interactive(children=(FloatSlider(value=-5.0, description='w0', max=5.0, min=-15.0, step=1.0), FloatSlider(val…
interactive(children=(IntSlider(value=180, description='rotation', max=360, step=10), Output()), _dom_classes=…
Models that return class probabilities can use cross-entropy loss $$\mathcal{L_{log}}(\mathbf{w}) = \sum_{n=1}^{N} H(p_n,q_n) = - \sum_{n=1}^{N} \sum_{c=1}^{C} p_{n,c} log(q_{n,c}) $$
liblinear
in sklearn. Can't run in parallel.lbfgs
)sklearn.linear_model
.C
hyperparameter is the inverse regularization strength: $C=\alpha^{-1}$penalty
: type of regularization: L1, L2 (default), Elastic-Net, or Nonesolver
: newton-cg, lbfgs (default), liblinear, sag, sagafrom sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1).fit(X_train, y_train)
interactive(children=(FloatSlider(value=0.0, description='C_log', max=4.0, min=-3.0), Output()), _dom_classes=…
interactive(children=(FloatSlider(value=49.910000000000004, description='C', min=0.01), Dropdown(description='…
interactive(children=(IntSlider(value=10, description='rotationX', max=20), IntSlider(value=135, description='…
so that
$$ \mathbf{w} = \sum_{i=1}^{n} a_i y_i \mathbf{x_i} $$$$ a_i \geq 0 \quad \text{and} \quad \sum_{i=1}^{l} a_i y_i = 0 $$so that
$$ a_i \geq 0 \quad \text{and} \quad \sum_{i=1}^{l} a_i y_i = 0 $$Computes the dual coefficients directly. A number $l$ of these are non-zero (sparseness).
Can be solved with quadratic programming, e.g. Sequential Minimal Optimization (SMO)
Example result. The circled samples are support vectors, together with their coefficients.
Hence, the SVM model is completely defined by the support vectors and their dual coefficients (weights)
Knowing the dual coefficients $a_i$, we can find the weights $w$ for the maximal margin separating hyperplane:
$$ \mathbf{w} = \sum_{i=1}^{l} a_i y_i \mathbf{x_i} $$
Remember, we will classify a new point $\mathbf{u}$ by looking at the sign of:
$$f(x) = \mathbf{w}\mathbf{u}+w_0 = \sum_{i=1}^{l} a_i y_i \mathbf{x_i}\mathbf{u}+w_0$$
Weighted k-nearest neighbor is a generalization of the k-nearest neighbor classifier. It classifies points by evaluating:
$$f(x) = \sum_{i=1}^{k} a_i y_i dist(x_i, u)^{-1}$$
Hence: SVM's predict much the same way as k-NN, only:
Same for non-linearly separable data
Large C values can lead to overfitting (e.g. fitting noise), small values can lead to underfitting
svm.LinearSVC
: faster for large datasetscoef_
($\mathbf{w}$) and intercept_
($w_0$)svm.SVC
with kernel=linear
: allows kernelization (see later)support_vectors_
(the support vectors) and the dual_coef_
$a_i$svm.LinearSVR
and svm.SVR
are variants for regressionclf = svm.SVC(kernel='linear')
clf.fit(X, Y)
print("Support vectors:", clf.support_vectors_[:])
print("Coefficients:", clf.dual_coef_[:])
Support vectors: [[-1.021 0.241] [-0.467 -0.531] [ 0.951 0.58 ]] Coefficients: [[-0.048 -0.569 0.617]]
SGDClassifier
can be used with any of these. Good for large datasets.Name | Representation | Loss function | Optimization | Regularization |
---|---|---|---|---|
Least squares | Linear function (R) | SSE | CFS or SGD | None |
Ridge | Linear function (R) | SSE + L2 | CFS or SGD | L2 strength ($\alpha$) |
Lasso | Linear function (R) | SSE + L1 | Coordinate descent | L1 strength ($\alpha$) |
Elastic-Net | Linear function (R) | SSE + L1 + L2 | Coordinate descent | $\alpha$, L1 ratio ($\rho$) |
SGDRegressor | Linear function (R) | SSE, Huber, $\epsilon$-ins,... + L1/L2 | SGD | L1/L2, $\alpha$ |
Logistic regression | Linear function (C) | Log + L1/L2 | SGD, coordinate descent,... | L1/L2, $\alpha$ |
Ridge classification | Linear function (C) | SSE + L2 | CFS or SGD | L2 strength ($\alpha$) |
Linear SVM | Support Vectors | Hinge(1) | Quadratic programming or SGD | Cost (C) |
Least Squares SVM | Support Vectors | Squared Hinge | Linear equations or SGD | Cost (C) |
Perceptron | Linear function (C) | Hinge(0) | SGD | None |
SGDClassifier | Linear function (C) | Log, (Sq.) Hinge, Mod. Huber,... | SGD | L1/L2, $\alpha$ |