Lecture 8. Neural Networks¶

How to train your neurons

Joaquin Vanschoren

Overview¶

  • Neural architectures
  • Training neural nets
    • Forward pass: Tensor operations
    • Backward pass: Backpropagation
  • Neural network design:
    • Activation functions
    • Weight initialization
    • Optimizers
  • Neural networks in practice
  • Model selection
    • Early stopping
    • Memorization capacity and information bottleneck
    • L1/L2 regularization
    • Dropout
    • Batch normalization

Linear models as a building block¶

  • Logistic regression, drawn in a different, neuro-inspired, way
    • Linear model: inner product ($z$) of input vector $\mathbf{x}$ and weight vector $\mathbf{w}$, plus bias $w_0$
    • Logistic (or sigmoid) function maps the output to a probability in [0,1]
    • Uses log loss (cross-entropy) and gradient descent to learn the weights
$$\hat{y}(\mathbf{x}) = \text{sigmoid}(z) = \text{sigmoid}(w_0 + \mathbf{w}\mathbf{x}) = \text{sigmoid}(w_0 + w_1 * x_1 + w_2 * x_2 +... + w_p * x_p)$$

Basic Architecture¶

  • Add one (or more) hidden layers $h$ with $k$ nodes (or units, cells, neurons)
    • Every 'neuron' is a tiny function, the network is an arbitrarily complex function
    • Weights $w_{i,j}$ between node $i$ and node $j$ form a weight matrix $\mathbf{W}^{(l)}$ per layer $l$
  • Every neuron weights the inputs $\mathbf{x}$ and passes it through a non-linear activation function
    • Activation functions ($f,g$) can be different per layer, output $\mathbf{a}$ is called activation $$\color{blue}{h(\mathbf{x})} = \color{blue}{\mathbf{a}} = f(\mathbf{z}) = f(\mathbf{W}^{(1)} \color{green}{\mathbf{x}}+\mathbf{w}^{(1)}_0) \quad \quad \color{red}{o(\mathbf{x})} = g(\mathbf{W}^{(2)} \color{blue}{\mathbf{a}}+\mathbf{w}^{(2)}_0)$$

More layers¶

  • Add more layers, and more nodes per layer, to make the model more complex
    • For simplicity, we don't draw the biases (but remember that they are there)
  • In dense (fully-connected) layers, every previous layer node is connected to all nodes
  • The output layer can also have multiple nodes (e.g. 1 per class in multi-class classification)
interactive(children=(IntSlider(value=3, description='nr_layers', max=6), IntSlider(value=6, description='nr_n…

Why layers?¶

  • Each layer acts as a filter and learns a new representation of the data
    • Subsequent layers can learn iterative refinements
    • Easier that learning a complex relationship in one go
  • Example: for image input, each layer yields new (filtered) images
    • Can learn multiple mappings at once: weight tensor $\mathit{W}$ yields activation tensor $\mathit{A}$
    • From low-level patterns (edges, end-points, ...) to combinations thereof
    • Each neuron 'lights up' if certain patterns occur in the input

ml

Other architectures¶

  • There exist MANY types of networks for many different tasks
  • Convolutional nets for image data, Recurrent nets for sequential data,...
  • Also used to learn representations (embeddings), generate new images, text,...

ml

Training Neural Nets¶

  • Design the architecture, choose activation functions (e.g. sigmoids)
  • Choose a way to initialize the weights (e.g. random initialization)
  • Choose a loss function (e.g. log loss) to measure how well the model fits training data
  • Choose an optimizer (typically an SGD variant) to update the weights

ml

Mini-batch Stochastic Gradient Descent (recap)¶

  1. Draw a batch of batch_size training data $\mathbf{X}$ and $\mathbf{y}$
  2. Forward pass : pass $\mathbf{X}$ though the network to yield predictions $\mathbf{\hat{y}}$
  3. Compute the loss $\mathcal{L}$ (mismatch between $\mathbf{\hat{y}}$ and $\mathbf{y}$)
  4. Backward pass : Compute the gradient of the loss with regard to every weight
    • Backpropagate the gradients through all the layers
  5. Update $W$: $W_{(i+1)} = W_{(i)} - \frac{\partial L(x, W_{(i)})}{\partial W} * \eta$

Repeat until n passes (epochs) are made through the entire training set

interactive(children=(IntSlider(value=50, description='iteration', min=1), Output()), _dom_classes=('widget-in…

Forward pass¶

  • We can naturally represent the data as tensors
    • Numerical n-dimensional array (with n axes)
    • 2D tensor: matrix (samples, features)
    • 3D tensor: time series (samples, timesteps, features)
    • 4D tensor: color images (samples, height, width, channels)
    • 5D tensor: video (samples, frames, height, width, channels)

ml ml

Tensor operations¶

  • The operations that the network performs on the data can be reduced to a series of tensor operations
    • These are also much easier to run on GPUs
  • A dense layer with sigmoid activation, input tensor $\mathbf{X}$, weight tensor $\mathbf{W}$, bias $\mathbf{b}$:
y = sigmoid(np.dot(X, W) + b)
  • Tensor dot product for 2D inputs ($a$ samples, $b$ features, $c$ hidden nodes)

ml

Element-wise operations¶

  • Activation functions and addition are element-wise operations:
def sigmoid(x):
  return 1/(1 + np.exp(-x)) 

def add(x, y):
  return x + y
  • Note: if y has a lower dimension than x, it will be broadcasted: axes are added to match the dimensionality, and y is repeated along the new axes
>>> np.array([[1,2],[3,4]]) + np.array([10,20])
array([[11, 22],
       [13, 24]])

Backward pass (backpropagation)¶

  • For last layer, compute gradient of the loss function $\mathcal{L}$ w.r.t all weights of layer $l$
$$\nabla \mathcal{L} = \frac{\partial \mathcal{L}}{\partial W^{(l)}} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial w_{0,0}} & \ldots & \frac{\partial \mathcal{L}}{\partial w_{0,l}} \\ \vdots & \ddots & \vdots \\ \frac{\partial \mathcal{L}}{\partial w_{k,0}} & \ldots & \frac{\partial \mathcal{L}}{\partial w_{k,l}} \end{bmatrix} \\[15pt]$$
  • Sum up the gradients for all $\mathbf{x}_j$ in minibatch: $\sum_{j} \frac{\partial \mathcal{L}(\mathbf{x}_j,y_j)}{\partial W^{(l)}}$
  • Update all weights in a layer at once (with learning rate $\eta$): $W_{(i+1)}^{(l)} = W_{(i)}^{(l)} - \eta \sum_{j} \frac{\partial \mathcal{L}(\mathbf{x}_j,y_j)}{\partial W_{(i)}^{(l)}}$
  • Repeat for next layer, iterating backwards (most efficient, avoids redundant calculations)

ml

Backpropagation (example)¶

  • Imagine feeding a single data point, output is $\hat{y} = g(z) = g(w_0 + w_1 * a_1 + w_2 * a_2 +... + w_p * a_p)$
  • Decrease loss by updating weights:
    • Update the weights of last layer to maximize improvement: $w_{i,(new)} = w_{i} - \frac{\partial \mathcal{L}}{\partial w_i} * \eta$
    • To compute gradient $\frac{\partial \mathcal{L}}{\partial w_i}$ we need the chain rule: $f(g(x)) = f'(g(x)) * g'(x)$ $$\frac{\partial \mathcal{L}}{\partial w_i} = \color{red}{\frac{\partial \mathcal{L}}{\partial g}} \color{blue}{\frac{\partial \mathcal{g}}{\partial z_0}} \color{green}{\frac{\partial \mathcal{z_0}}{\partial w_i}}$$
  • E.g., with $\mathcal{L} = \frac{1}{2}(y-\hat{y})^2$ and sigmoid $\sigma$: $\frac{\partial \mathcal{L}}{\partial w_i} = \color{red}{(y - \hat{y})} * \color{blue}{\sigma'(z_0)} * \color{green}{a_i}$

Backpropagation (2)¶

  • Another way to decrease the loss $\mathcal{L}$ is to update the activations $a_i$
    • To update $a_i = f(z_i)$, we need to update the weights of the previous layer
    • We want to nudge $a_i$ in the right direction by updating $w_{i,j}$: $$\frac{\partial \mathcal{L}}{\partial w_{i,j}} = \frac{\partial \mathcal{L}}{\partial a_i} \frac{\partial a_i}{\partial z_i} \frac{\partial \mathcal{z_i}}{\partial w_{i,j}} = \left( \frac{\partial \mathcal{L}}{\partial g} \frac{\partial \mathcal{g}}{\partial z_0} \frac{\partial \mathcal{z_0}}{\partial a_i} \right) \frac{\partial a_i}{\partial z_i} \frac{\partial \mathcal{z_i}}{\partial w_{i,j}}$$
    • We know $\frac{\partial \mathcal{L}}{\partial g}$ and $\frac{\partial \mathcal{g}}{\partial z_0}$ from the previous step, $\frac{\partial \mathcal{z_0}}{\partial a_i} = w_i$, $\frac{\partial a_i}{\partial z_i} = f'$ and $\frac{\partial \mathcal{z_i}}{\partial w_{i,j}} = x_j$

Backpropagation (3)¶

  • With multiple output nodes, $\mathcal{L}$ is the sum of all per-output (per-class) losses
    • $\frac{\partial \mathcal{L}}{\partial a_i}$ is sum of the gradients for every output
  • Per layer, sum up gradients for every point $\mathbf{x}$ in the batch: $\sum_{j} \frac{\partial \mathcal{L}(\mathbf{x}_j,y_j)}{\partial W}$
  • Update all weights of every layer $l$
    • $W_{(i+1)}^{(l)} = W_{(i)}^{(l)} - \eta \sum_{j} \frac{\partial \mathcal{L}(\mathbf{x}_j,y_j)}{\partial W_{(i)}^{(l)}}$
  • Repeat with a new batch of data until loss converges
  • Nice animation of the entire process

Backpropagation (summary)¶

  • The network output $a_o$ is defined by the weights $W^{(o)}$ and biases $\mathbf{b}^{(o)}$ of the output layer, and
  • The activations of a hidden layer $h_1$ with activation function $a_{h_1}$, weights $W^{(1)}$ and biases $\mathbf{b^{(1)}}$:
$$\color{red}{a_o(\mathbf{x})} = \color{red}{a_o(\mathbf{z_0})} = \color{red}{a_o(W^{(o)}} \color{blue}{a_{h_1}(z_{h_1})} \color{red}{+ \mathbf{b}^{(o)})} = \color{red}{a_o(W^{(o)}} \color{blue}{a_{h_1}(W^{(1)} \color{green}{\mathbf{x}} + \mathbf{b}^{(1)})} \color{red}{+ \mathbf{b}^{(o)})} $$
  • Minimize the loss by SGD. For layer $l$, compute $\frac{\partial \mathcal{L}(a_o(x))}{\partial W_l}$ and $\frac{\partial \mathcal{L}(a_o(x))}{\partial b_{l,i}}$ using the chain rule
  • Decomposes into gradient of layer above, gradient of activation function, gradient of layer input:
$$\frac{\partial \mathcal{L}(a_o)}{\partial W^{(1)}} = \color{red}{\frac{\partial \mathcal{L}(a_o)}{\partial a_{h_1}}} \color{blue}{\frac{\partial a_{h_1}}{\partial z_{h_1}}} \color{green}{\frac{\partial z_{h_1}}{\partial W^{(1)}}} = \left( \color{red}{\frac{\partial \mathcal{L}(a_o)}{\partial a_o}} \color{blue}{\frac{\partial a_o}{\partial z_o}} \color{green}{\frac{\partial z_o}{\partial a_{h_1}}}\right) \color{blue}{\frac{\partial a_{h_1}}{\partial z_{h_1}}} \color{green}{\frac{\partial z_{h_1}}{\partial W^{(1)}}} $$

ml

Activation functions for hidden layers¶

  • Sigmoid: $f(z) = \frac{1}{1+e^{-z}}$
  • Tanh: $f(z) = \frac{2}{1+e^{-2z}} - 1$
    • Activations around 0 are better for gradient descent convergence
  • Rectified Linear (ReLU): $f(z) = max(0,z)$
    • Less smooth, but much faster (note: not differentiable at 0)
  • Leaky ReLU: $f(z) = \begin{cases} 0.01z & z<0 \\ z & otherwise \end{cases}$
interactive(children=(Dropdown(description='function', options=('sigmoid', 'tanh', 'relu', 'leaky_relu'), valu…

Effect of activation functions on the gradient¶

  • During gradient descent, the gradient depends on the activation function $a_{h}$: $\frac{\partial \mathcal{L}(a_o)}{\partial W^{(l)}} = \color{red}{\frac{\partial \mathcal{L}(a_o)}{\partial a_{h_l}}} \color{blue}{\frac{\partial a_{h_l}}{\partial z_{h_l}}} \color{green}{\frac{\partial z_{h_l}}{\partial W^{(l)}}}$
  • If derivative of the activation function $\color{blue}{\frac{\partial a_{h_l}}{\partial z_{h_l}}}$ is 0, the weights $w_i$ are not updated
    • Moreover, the gradients of previous layers will be reduced (vanishing gradient)
  • sigmoid, tanh: gradient is very small for large inputs: slow updates
  • With ReLU, $\color{blue}{\frac{\partial a_{h_l}}{\partial z_{h_l}}} = 1$ if $z>0$, hence better against vanishing gradients
    • Problem: for very negative inputs, the gradient is 0 and may never recover (dying ReLU)
    • Leaky ReLU has a small (0.01) gradient there to allow recovery
interactive(children=(Dropdown(description='function', options=('sigmoid', 'tanh', 'relu', 'leaky_relu'), valu…

ReLU vs Tanh¶

  • What is the effect of using non-smooth activation functions?
    • ReLU produces piecewise-linear boundaries, but allows deeper networks
    • Tanh produces smoother decision boundaries, but is slower
interactive(children=(IntSlider(value=2, description='nr_layers', max=4, min=1), Output()), _dom_classes=('wid…

Activation functions for output layer¶

  • sigmoid converts output to probability in [0,1]
    • For binary classification
  • softmax converts all outputs (aka 'logits') to probabilities that sum up to 1
    • For multi-class classification ($k$ classes)
    • Can cause over-confident models. If so, smooth the labels: $y_{smooth} = (1-\alpha)y + \frac{\alpha}{k}$ $$\text{softmax}(\mathbf{x},i) = \frac{e^{x_i}}{\sum_{j=1}^k e^{x_j}}$$
  • For regression, don't use any activation function, let the model learn the exact target
interactive(children=(Dropdown(description='function', options=('sigmoid', 'softmax', 'none'), value='sigmoid'…

Weight initialization¶

  • Initializing weights to 0 is bad: all gradients in layer will be identical (symmetry)
  • Too small random weights shrink activations to 0 along the layers (vanishing gradient)
  • Too large random weights multiply along layers (exploding gradient, zig-zagging)
  • Ideal: small random weights + variance of input and output gradients remains the same
    • Glorot/Xavier initialization (for tanh): randomly sample from $N(0,\sigma), \sigma = \sqrt{\frac{2}{\text{fan_in + fan_out}}}$
      • fan_in: number of input units, fan_out: number of output units
    • He initialization (for ReLU): randomly sample from $N(0,\sigma), \sigma = \sqrt{\frac{2}{\text{fan_in}}}$
    • Uniform sampling (instead of $N(0,\sigma)$) for deeper networks (w.r.t. vanishing gradients)

Weight initialization: transfer learning¶

  • Instead of starting from scratch, start from weights previously learned from similar tasks
    • This is, to a big extent, how humans learn so fast
  • Transfer learning: learn weights on task T, transfer them to new network
    • Weights can be frozen, or finetuned to the new data
  • Only works if the previous task is 'similar' enough
    • Meta-learning: learn a good initialization across many related tasks

ml

Optimizers¶

SGD with learning rate schedules¶

  • Using a constant learning $\eta$ rate for weight updates $\mathbf{w}_{(s+1)} = \mathbf{w}_s-\eta\nabla \mathcal{L}(\mathbf{w}_s)$ is not ideal
  • Learning rate decay/annealing with decay rate $k$
    • E.g. exponential ($\eta_{s+1} = \eta_{s} e^{-ks}$), inverse-time ($\eta_{s+1} = \frac{\eta_{0}}{1+ks}$),...
  • Cyclical learning rates
    • Change from small to large: hopefully in 'good' region long enough before diverging
    • Warm restarts: aggressive decay + reset to initial learning rate
interactive(children=(IntSlider(value=50, description='iterations', min=1), Dropdown(description='optimizer1',…

Momentum¶

  • Imagine a ball rolling downhill: accumulates momentum, doesn't exactly follow steepest descent
    • Reduces oscillation, follows larger (consistent) gradient of the loss surface
  • Adds a velocity vector $\mathbf{v}$ with momentum $\gamma$ (e.g. 0.9, or increase from $\gamma=0.5$ to $\gamma=0.99$) $$\mathbf{w}_{(s+1)} = \mathbf{w}_{(s)} + \mathbf{v}_{(s)} \qquad \text{with} \qquad \color{blue}{\mathbf{v}_{(s)}} = \color{green}{\gamma \mathbf{v}_{(s-1)}} - \color{red}{\eta \nabla \mathcal{L}(\mathbf{w}_{(s)})}$$
  • Nesterov momentum: Look where momentum step would bring you, compute gradient there
    • Responds faster (and reduces momentum) when the gradient changes $$\color{blue}{\mathbf{v}_{(s)}} = \color{green}{\gamma \mathbf{v}_{(s-1)}} - \color{red}{\eta \nabla \mathcal{L}(\mathbf{w}_{(s)} + \gamma \mathbf{v}_{(s-1)})}$$

Momentum in practice¶

interactive(children=(IntSlider(value=50, description='iterations', min=1), Dropdown(description='optimizer1',…

Adaptive gradients¶

  • 'Correct' the learning rate for each $w_i$ based on specific local conditions (layer depth, fan-in,...)
  • Adagrad: scale $\eta$ according to squared sum of previous gradients $G_{i,(s)} = \sum_{t=1}^s \mathcal{L}(w_{i,(t)})^2$
    • Update rule for $w_i$. Usually $\epsilon=10^{-7}$ (avoids division by 0), $\eta=0.001$. $$w_{i,(s+1)} = w_{i,(s)} - \frac{\eta}{\sqrt{G_{i,(s)}+\epsilon}} \nabla \mathcal{L}(w_{i,(s)})$$
  • RMSProp: use moving average of squared gradients $m_{i,(s)} = \gamma m_{i,(s-1)} + (1-\gamma) \nabla \mathcal{L}(w_{i,(s)})^2$
    • Avoids that gradients dwindle to 0 as $G_{i,(s)}$ grows. Usually $\gamma=0.9, \eta=0.001$ $$w_{i,(s+1)} = w_{i,(s)}- \frac{\eta}{\sqrt{m_{i,(s)}+\epsilon}} \nabla \mathcal{L}(w_{i,(s)})$$
interactive(children=(IntSlider(value=50, description='iterations', min=1), Dropdown(description='optimizer1',…

Adam (Adaptive moment estimation)¶

  • Adam: RMSProp + momentum. Adds moving average for gradients as well ($\gamma_2$ = momentum):

    • Adds a bias correction to avoid small initial gradients: $\hat{m}_{i,(s)} = \frac{m_{i,(s)}}{1-\gamma}$ and $\hat{g}_{i,(s)} = \frac{g_{i,(s)}}{1-\gamma_2}$ $$g_{i,(s)} = \gamma_2 g_{i,(s-1)} + (1-\gamma_2) \nabla \mathcal{L}(w_{i,(s)})$$ $$w_{i,(s+1)} = w_{i,(s)}- \frac{\eta}{\sqrt{\hat{m}_{i,(s)}+\epsilon}} \hat{g}_{i,(s)}$$
  • Adamax: Idem, but use max() instead of moving average: $u_{i,(s)} = max(\gamma u_{i,(s-1)}, |\mathcal{L}(w_{i,(s)})|)$ $$w_{i,(s+1)} = w_{i,(s)}- \frac{\eta}{u_{i,(s)}} \hat{g}_{i,(s)}$$

interactive(children=(IntSlider(value=50, description='iterations', min=1), Dropdown(description='optimizer1',…

SGD Optimizer Zoo¶

  • RMSProp often works well, but do try alternatives. For even more optimizers, see here.
interactive(children=(IntSlider(value=50, description='iterations', min=1), Output()), _dom_classes=('widget-i…

Neural networks in practice¶

  • There are many practical courses on training neural nets. E.g.:
    • With TensorFlow: https://www.tensorflow.org/resources/learn-ml
    • With PyTorch: fast.ai course, https://pytorch.org/tutorials/
  • Here, we'll use Keras, a general API for building neural networks
    • Default API for TensorFlow, also has backends for CNTK, Theano
  • Focus on key design decisions, evaluation, and regularization
  • Running example: Fashion-MNIST
    • 28x28 pixel images of 10 classes of fashion items

Building the network¶

  • We first build a simple sequential model (no branches)
  • Input layer ('input_shape'): a flat vector of 28*28=784 nodes
    • We'll see how to properly deal with images later
  • Two dense hidden layers: 512 nodes each, ReLU activation
    • Glorot weight initialization is applied by default
  • Output layer: 10 nodes (for 10 classes) and softmax activation
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', kernel_initializer='he_normal', input_shape=(28 * 28,)))
network.add(layers.Dense(512, activation='relu', kernel_initializer='he_normal'))
network.add(layers.Dense(10, activation='softmax'))

Model summary¶

  • Lots of parameters (weights and biases) to learn!
    • hidden layer 1 : (28 28 + 1) 512 = 401920
    • hidden layer 2 : (512 + 1) * 512 = 262656
    • output layer: (512 + 1) * 10 = 5130
network.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 512)               401920    
                                                                 
 dense_1 (Dense)             (None, 512)               262656    
                                                                 
 dense_2 (Dense)             (None, 10)                5130      
                                                                 
=================================================================
Total params: 669,706
Trainable params: 669,706
Non-trainable params: 0
_________________________________________________________________

Choosing loss, optimizer, metrics¶

  • Loss function
    • Cross-entropy (log loss) for multi-class classification ($y_{true}$ is one-hot encoded)
    • Use binary crossentropy for binary problems (single output node)
    • Use sparse categorical crossentropy if $y_{true}$ is label-encoded (1,2,3,...)
  • Optimizer
    • Any of the optimizers we discussed before. RMSprop usually works well.
  • Metrics
    • To monitor performance during training and testing, e.g. accuracy
# Shorthand
network.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
# Detailed
network.compile(loss=CategoricalCrossentropy(label_smoothing=0.01),
                optimizer=RMSprop(learning_rate=0.001, momentum=0.0)
                metrics=[Accuracy()])

Preprocessing: Normalization, Reshaping, Encoding¶

  • Always normalize (standardize or min-max) the inputs. Mean should be close to 0.
    • Avoid that some inputs overpower others
    • Speed up convergence
      • Gradients of activation functions $\frac{\partial a_{h}}{\partial z_{h}}$ are (near) 0 for large inputs
      • If some gradients become much larger than others, SGD will start zig-zagging
  • Reshape the data to fit the shape of the input layer, e.g. (n, 28*28) or (n, 28,28)
    • Tensor with instances in first dimension, rest must match the input layer
  • In multi-class classification, every class is an output node, so one-hot-encode the labels
    • e.g. class '4' becomes [0,0,0,0,1,0,0,0,0,0]
X = X.astype('float32') / 255
X = X.reshape((60000, 28 * 28))
y = to_categorical(y)

Choosing training hyperparameters¶

  • Number of epochs: enough to allow convergence
    • Too much: model starts overfitting (or just wastes time)
  • Batch size: small batches (e.g. 32, 64,... samples) often preferred
    • 'Noisy' training data makes overfitting less likely
      • Larger batches generalize less well ('generalization gap')
    • Requires less memory (especially in GPUs)
    • Large batches do speed up training, may converge in fewer epochs
  • Batch size interacts with learning rate
    • Instead of shrinking the learning rate you can increase batch size
history = network.fit(X_train, y_train, epochs=3, batch_size=32);
Epoch 1/3
1875/1875 [==============================] - 24s 13ms/step - loss: 0.4331 - accuracy: 0.8529
Epoch 2/3
1875/1875 [==============================] - 25s 13ms/step - loss: 0.4242 - accuracy: 0.8568
Epoch 3/3
1875/1875 [==============================] - 26s 14ms/step - loss: 0.4183 - accuracy: 0.8573

Predictions and evaluations¶

We can now call predict to generate predictions, and evaluate the trained model on the entire test set

network.predict(X_test)
test_loss, test_acc = network.evaluate(X_test, y_test)
[0.0240177 0.0001167 0.4472437 0.0056629 0.057807  0.000094  0.4632739
 0.0000267 0.0017463 0.0000112]
313/313 [==============================] - 2s 7ms/step - loss: 0.3845 - accuracy: 0.8636
Test accuracy: 0.8636000156402588

Model selection¶

  • How many epochs do we need for training?
  • Train the neural net and track the loss after every iteration on a validation set
    • You can add a callback to the fit version to get info on every epoch
  • Best model after a few epochs, then starts overfitting

Early stopping¶

  • Stop training when the validation loss (or validation accuracy) no longer improves
  • Loss can be bumpy: use a moving average or wait for $k$ steps without improvement
earlystop = callbacks.EarlyStopping(monitor='val_loss', patience=3)
model.fit(x_train, y_train, epochs=25, batch_size=512, callbacks=[earlystop])

Regularization and memorization capacity¶

  • The number of learnable parameters is called the model capacity
  • A model with more parameters has a higher memorization capacity
    • Too high capacity causes overfitting, too low causes underfitting
    • In the extreme, the training set can be 'memorized' in the weights
  • Smaller models are forced it to learn a compressed representation that generalizes better
    • Find the sweet spot: e.g. start with few parameters, increase until overfitting stars.
  • Example: 256 nodes in first layer, 32 nodes in second layer, similar performance

Information bottleneck¶

  • If a layer is too narrow, it will lose information that can never be recovered by subsequent layers
  • Information bottleneck theory defines a bound on the capacity of the network
  • Imagine that you need to learn 10 outputs (e.g. classes) and your hidden layer has 2 nodes
    • This is like trying to learn 10 hyperplanes from a 2-dimensional representation
  • Example: bottleneck of 2 nodes, no overfitting, much higher training loss

Weight regularization (weight decay)¶

  • As we did many times before, we can also add weight regularization to our loss function
  • L1 regularization: leads to sparse networks with many weights that are 0
  • L2 regularization: leads to many very small weights
network = models.Sequential()
network.add(layers.Dense(256, activation='relu', kernel_regularizer=regularizers.l2(0.001), input_shape=(28 * 28,)))
network.add(layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001)))

Dropout¶

  • Every iteration, randomly set a number of activations $a_i$ to 0
  • Dropout rate : fraction of the outputs that are zeroed-out (e.g. 0.1 - 0.5)
  • Idea: break up accidental non-significant learned patterns
  • At test time, nothing is dropped out, but the output values are scaled down by the dropout rate
    • Balances out that more units are active than during training

Dropout layers¶

  • Dropout is usually implemented as a special layer
network = models.Sequential()
network.add(layers.Dense(256, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dropout(0.5))
network.add(layers.Dense(32, activation='relu'))
network.add(layers.Dropout(0.5))
network.add(layers.Dense(10, activation='softmax'))

Batch Normalization¶

  • We've seen that scaling the input is important, but what if layer activations become very large?
    • Same problems, starting deeper in the network
  • Batch normalization: normalize the activations of the previous layer within each batch
    • Within a batch, set the mean activation close to 0 and the standard deviation close to 1
      • Across badges, use exponential moving average of batch-wise mean and variance
    • Allows deeper networks less prone to vanishing or exploding gradients
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.BatchNormalization())
network.add(layers.Dropout(0.5))
network.add(layers.Dense(256, activation='relu'))
network.add(layers.BatchNormalization())
network.add(layers.Dropout(0.5))
network.add(layers.Dense(64, activation='relu'))
network.add(layers.BatchNormalization())
network.add(layers.Dropout(0.5))
network.add(layers.Dense(32, activation='relu'))
network.add(layers.BatchNormalization())
network.add(layers.Dropout(0.5))

Tuning multiple hyperparameters¶

  • You can wrap Keras models as scikit-learn models and use any tuning technique
  • Keras also has built-in RandomSearch (and HyperBand and BayesianOptimization - see later)
def make_model(hp):
    m.add(Dense(units=hp.Int('units', min_value=32, max_value=512, step=32)))
    m.compile(optimizer=Adam(hp.Choice('learning rate', [1e-2, 1e-3, 1e-4])))
    return model
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
clf = KerasClassifier(make_model)
grid = GridSearchCV(clf, param_grid=param_grid, cv=3)

from kerastuner.tuners import RandomSearch
tuner = keras.RandomSearch(build_model, max_trials=5)

Summary¶

  • Neural architectures
  • Training neural nets
    • Forward pass: Tensor operations
    • Backward pass: Backpropagation
  • Neural network design:
    • Activation functions
    • Weight initialization
    • Optimizers
  • Neural networks in practice
  • Model selection
    • Early stopping
    • Memorization capacity and information bottleneck
    • L1/L2 regularization
    • Dropout
    • Batch normalization