Linear models: $\hat{y} = \mathbf{w}\mathbf{x} + w_0 = \sum_{i=1}^{p} w_i x_i + w_0 = w_0 + w_1 x_1 + ... + w_p x_p $
When we cannot fit the data well, we can add non-linear transformations of the features
Weights: [0.418]
PolynomialFeatures
x0 | x0^2 | x0^3 | x0^4 | x0^5 | x0^6 | x0^7 | x0^8 | x0^9 | x0^10 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -0.752759 | 0.566647 | -0.426548 | 0.321088 | -0.241702 | 0.181944 | -0.136960 | 0.103098 | -0.077608 | 0.058420 |
1 | 2.704286 | 7.313162 | 19.776880 | 53.482337 | 144.631526 | 391.124988 | 1057.713767 | 2860.360362 | 7735.232021 | 20918.278410 |
2 | 1.391964 | 1.937563 | 2.697017 | 3.754150 | 5.225640 | 7.273901 | 10.125005 | 14.093639 | 19.617834 | 27.307312 |
3 | 0.591951 | 0.350406 | 0.207423 | 0.122784 | 0.072682 | 0.043024 | 0.025468 | 0.015076 | 0.008924 | 0.005283 |
4 | -2.063888 | 4.259634 | -8.791409 | 18.144485 | -37.448187 | 77.288869 | -159.515582 | 329.222321 | -679.478050 | 1402.366700 |
Weights: [ 0.643 0.297 -0.69 -0.264 0.41 0.096 -0.076 -0.014 0.004 0.001]
Ridge has a closed-form solution which we can compute with linear algebra: $$w^{*} = (X^{T}X + \alpha I)^{-1} X^T Y$$
Since X has $n$ rows (examples), and $d$ columns (features), $X^{T}X$ has dimensionality $d x d$
After the feature map $\Phi$, we get $$w^{*} = (\Phi(X)^{T}\Phi(X) + \alpha I)^{-1} \Phi(X)^T Y$$
Since $\Phi$ increases $d$ a lot, $\Phi(X)^{T}\Phi(X)$ becomes huge
We can add a new feature by taking the squares of feature1 values
Now we can fit a linear model
As a function of the original features, the decision boundary is now a polynomial as well $$y = w_0 + w_1 x_1 + w_2 x_2 + w_3 x_2^2 > 0$$
1
A (Mercer) kernel is any function $k: X \times X \rightarrow \mathbb{R}$ with these properties:
The kernel matrix (or Gram matrix) for $n$ points of $x_1,..., x_n \in X$ is defined as:
interactive(children=(IntSlider(value=5, description='degree', max=10, min=1), FloatSlider(value=1.0, descript…
The Radial Basis Function (RBF) feature map builds the Taylor series expansion of $e^x$ $$\Phi(x) = e^{-x^2/2\gamma^2} \Big[ 1, \sqrt{\frac{1}{1!\gamma^2}}x,\sqrt{\frac{1}{2!\gamma^4}}x^2,\sqrt{\frac{1}{3!\gamma^6}}x^3,\ldots\Big]^T$$
RBF (or Gaussian ) kernel with kernel width $\gamma \geq 0$:
$$k_{RBF}(\mathbf{x_1},\mathbf{x_2}) = exp(-\gamma ||\mathbf{x_1} - \mathbf{x_2}||^2)$$
interactive(children=(FloatSlider(value=4.91, description='gamma', max=10.0, min=0.01), Output()), _dom_classe…
interactive(children=(FloatSlider(value=0.5, description='gamma', max=1.0, min=0.1), FloatSlider(value=0.0, de…
interactive(children=(FloatSlider(value=4.6, description='gamma', max=10.0, min=0.1, step=0.5), FloatSlider(va…
Kernel overview
SVC
for classification with a range of kernelsSVR
for regressionThe Representer Theorem states that if we have a loss function $\mathcal{L}'$ with
Then the weights $\mathbf{w}$ can be described as a linear combination of the training samples: $$\mathbf{w} = \sum_{i=1}^{n} a_i y_i f(\mathbf{x_i})$$
interactive(children=(FloatSlider(value=1.0, description='gamma', max=2.0, min=0.1), FloatSlider(value=0.0, de…
KernelRidge
interactive(children=(FloatSlider(value=4.51, description='gamma', max=10.0, min=0.01, step=0.5), Output()), _…
interactive(children=(FloatSlider(value=4.51, description='gamma', max=10.0, min=0.01, step=0.5), Output()), _…