XKCD, Randall Monroe
Rule for updating the probability of a hypothesis $c$ given data $x$
$P(c|x)$ is the posterior probability of class $c$ given data $x$.
$P(c)$ is the prior probability of class $c$: what you believed before you saw the data $x$
$P(x|c)$ is the likelihood of data point $x$ given that the class is $c$ (computed from your dataset)
$P(x)$ is the prior probability of the data (marginal likelihood): the likelihood of the data $x$ under any circumstance (no matter what the class is)
interactive(children=(FloatSlider(value=0.0, description='x', max=3.0, min=-3.0, step=0.5), FloatSlider(value=…
What's the probability that your friend will play golf if the weather is sunny?
GaussianNB: Computes mean $\mu_c$ and standard deviation $\sigma_c$ of the feature values per class: $p(x=v \mid c)=\frac{1}{\sqrt{2\pi\sigma^2_c}}\,e^{ -\frac{(v-\mu_c)^2}{2\sigma^2_c} }$
We can now make predictions using Bayes' theorem: $p(c \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid c) \ p(c)}{p(\mathbf{x})}$
Other Naive Bayes classifiers:
interactive(children=(IntSlider(value=10, description='nr_points', max=20), Output()), _dom_classes=('widget-i…
Linear regression (recap):
$$y = f(\mathbf{x}_i) = \mathbf{x}_i\mathbf{w} + b $$For one input feature: $$y = w_1 \cdot x_1 + b \cdot 1$$
We can solve this via linear algebra (closed form solution): $w^{*} = (X^{T}X)^{-1} X^T Y$
w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
$\mathbf{X}$ is our data matrix with a $x_0=1$ column to represent the bias $b$:
$$\mathbf{X} = \begin{bmatrix} \mathbf{x}_1^\top \\\ \mathbf{x}_2^\top \\\ \vdots \\\ \mathbf{x}_N^\top \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\\ 1 & x_2 \\\ \vdots & \vdots \\\ 1 & x_N \end{bmatrix}$$We learned: $ y= w_1 x + w_0 = -0.013 x + 28.895$
We can fit a 2nd degree polynomial by using a basis expansion (adding more basis functions):
$$\mathbf{\Phi} = \left[ \mathbf{1} \quad \mathbf{x} \quad \mathbf{x}^2\right]$$We can also kernelize the model and learn a dual coefficient per data point
interactive(children=(IntSlider(value=4, description='poly_degree', max=8, min=1), IntSlider(value=-6, descrip…
We have an uncertainty predictions, but it is the same for all predictions
interactive(children=(FloatSlider(value=0.005, description='sigma', max=0.01, min=0.001, step=0.001), Output()…
In the Bayesian approach, we assume a prior (Gaussian) distribution for the parameters, $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \alpha \mathbf{I})$:
I.e, $w_i$ is drawn from a Gaussian density with variance $\alpha$ $$w_i \sim \mathcal{N}(0,\alpha)$$
interactive(children=(FloatSlider(value=0.5, description='alpha', max=1.0, min=0.1), Checkbox(value=False, des…
We can sample from the prior distribution to see what form we are imposing on the functions a priori (before seeing any data).
interactive(children=(FloatSlider(value=2.1, description='alpha', max=5.0, min=0.1, step=0.5), IntSlider(value…
We assume that our data is Gaussian distributed: $P(y|x_{test},\textbf{X}) = \mathcal{N}(\mathbf{\mu,\Sigma})$
Example with learned mean $[m,m]$ and covariance $\begin{bmatrix} \alpha & \beta \\ \beta & \alpha \end{bmatrix}$
interactive(children=(FloatSlider(value=0.0, description='x_test', max=3.0, min=-3.0, step=0.5), FloatSlider(v…
interactive(children=(FloatSlider(value=0.0, description='x1', max=3.0, min=-3.0, step=0.5), FloatSlider(value…
interactive(children=(FloatSlider(value=1.0, description='alpha', max=2.0, min=0.1), IntSlider(value=51, descr…
Repeat for 40 dimensions, with $\boldsymbol{\Phi}$ the polynomial transform:
We normally add Gaussian noise to obtain our observations: $$ \mathbf{y} = \mathbf{f} + \boldsymbol{\epsilon} $$
interactive(children=(FloatSlider(value=0.05, description='sigma', max=0.1, min=0.01, step=0.01), Output()), _…
where $\left\Vert\mathbf{x} - \mathbf{x}^\prime\right\Vert^2$ is the squared distance between the two input vectors
$$ \left\Vert\mathbf{x} - \mathbf{x}^\prime\right\Vert^2 = (\mathbf{x} - \mathbf{x}^\prime)^\top (\mathbf{x} - \mathbf{x}^\prime) $$and the length parameter $l$ controls the smoothness of the function and $\alpha$ the vertical variation.
Now the influence of a point decreases smoothly but exponentially
interactive(children=(FloatSlider(value=1.0, description='alpha', max=2.0, min=0.1), IntSlider(value=10, descr…
Assuming that $P(X)$ is a Gaussian density with a covariance given by kernel matrix $\mathbf{K}$, the model likelihood becomes: $$ P(\mathbf{y}|\mathbf{X}) = \frac{P(y) \ P(\mathbf{X} \mid y)}{P(\mathbf{X})} = \frac{1}{(2\pi)^{\frac{n}{2}}|\mathbf{K}|^{\frac{1}{2}}} \exp\left(-\frac{1}{2}\mathbf{y}^\top \left(\mathbf{K}+\sigma^2 \mathbf{I}\right)^{-1}\mathbf{y}\right) $$
Hence, the negative log likelihood (the objective function) is given by: $$ E(\boldsymbol{\theta}) = \frac{1}{2} \log |\mathbf{K}| + \frac{1}{2} \mathbf{y}^\top \left(\mathbf{K} + \sigma^2\mathbf{I}\right)^{-1}\mathbf{y} $$
The model parameters (e.g. noise variance $\sigma^2$) and the kernel parameters (e.g. lengthscale, variance) can be embedded in the covariance function and learned from data.
Good news: This loss function can be optimized using linear algebra (Cholesky Decomposition)
The model makes predictions for $\mathbf{f}$ that are unaffected by future values of $\mathbf{f}^*$.
If we think of $\mathbf{f}^*$ as test points, we can still write down a joint probability density over the training observations, $\mathbf{f}$ and the test observations, $\mathbf{f}^*$.
This joint probability density will be Gaussian, with a covariance matrix given by our kernel function, $k(\mathbf{x}_i, \mathbf{x}_j)$. $$ \begin{bmatrix}\mathbf{f} \\ \mathbf{f}^*\end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K} & \mathbf{K}_\ast \\ \mathbf{K}_\ast^\top & \mathbf{K}_{\ast,\ast}\end{bmatrix}\right) $$
where $\mathbf{K}$ is the kernel matrix computed between all the training points,
$\mathbf{K}_\ast$ is the kernel matrix computed between the training points and the test points,
$\mathbf{K}_{\ast,\ast}$ is the kernel matrix computed between all the tests points and themselves.
Finally, we need to define conditional distributions to answer particular questions of interest
We will need the conditional density for making predictions. $$ \mathbf{f}^* | \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_f,\mathbf{C}_f) $$ with a mean given by $ \boldsymbol{\mu}_f = \mathbf{K}_*^\top \left[\mathbf{K} + \sigma^2 \mathbf{I}\right]^{-1} \mathbf{y} $
and a covariance given by $ \mathbf{C}_f = \mathbf{K}_{*,*} - \mathbf{K}_*^\top \left[\mathbf{K} + \sigma^2 \mathbf{I}\right]^{-1} \mathbf{K}_\ast. $
interactive(children=(IntSlider(value=13, description='nr_points', max=27), Output()), _dom_classes=('widget-i…
Remember that our prediction is the sum of the mean and the variance: $P(\mathbf{y}|x_{test} , \mathbf{X}) = \mathcal{N}(\mathbf{\mu,\Sigma})$
interactive(children=(IntSlider(value=13, description='nr_points', max=27), Checkbox(value=False, description=…
The values on the diagonal of the covariance matrix give us the variance, so we can simply plot the mean and 95% confidence interval
interactive(children=(IntSlider(value=13, description='nr_points', max=27), IntSlider(value=12, description='l…
GPyRegression
Generate a kernel first
Variance and lengthscale are optional, default = 1
kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)
Other kernels:
GPy.kern.BasisFuncKernel?
Build model:
m = GPy.models.GPRegression(X,Y,kernel)
Matern
is a generalized RBF kernel that can scale between RBF and Exponential
Build the untrained GP. The shaded region corresponds to ~95% confidence intervals (i.e. +/- 2 standard deviation)
Train the model (optimize the parameters): maximize the likelihood of the data.
Best to optimize with a few restarts: the optimizer may converges to the high-noise solution. The optimizer is then restarted with a few random initialization of the parameter values.
You can also show results in 2D
We can plot 2D slices using the fixed_inputs
argument to the plot function.
fixed_inputs
is a list of tuples containing which of the inputs to fix, and to which value.
GaussianProcessRegressor
kernel
: kernel specifying the covariance function of the GPalpha
: regularization parameter n_restarts_optimizer
: number of restarts of the optimizery_pred, sigma = gp.predict(x, return_std=True)
Example
Example with noisy data
Advantages:
Disadvantages:
Shahriari et al. Taking the Human Out of the Loop: A Review of Bayesian Optimization
In 2 dimensions: