Batch normalization (batchnorm)

Deep Learning

Introduction

Batch normalization CITE[szegedy-2015], succinctly known as batchnorm, is a strategy for adaptive reparametrization to facilitate training of very deep neural networks. It was originally proposed as a mechanism to limit the impact of parameter updates in one layer of the network on other layers that came before or after the said layer. It works by normalizing the activations of a training minibatch after a layer to zero mean and unit variance.

Although not the original intention, it also has the side-effect of regularizing the training of the neural network by restricting the minibatch activations and consequently the gradient updates. It is a particularly effective strategy that is now a default recommendation in most deep neural networks, especially those employing convolutional layers.

Prerequisites

To understand BatchNorm, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Motivation

Deep neural networks are composites of several functions, in some cases represented as layers of the network. These composite functions are trained by gradient-based optimization strategies, the gradients being computed by backpropagation. The training proceeds through iterations, typically over mini-batches of training data, each time updating the parameters with functions of computed gradients.

But there is a key caveat: the gradients are partial derivatives. The gradients suggest the update to a parameter that will minimize the loss, assuming all the other parameters of the model remain the same as before. Neural networks are never trained to satisfy this assumption. After each mini-batch, we update all the parameters in the model. This means, the update may be futile for some parameters, depending on the magnitude of updates to other related parameters of the model.

One alternative to tackle this issue is to use higher order derivatives. Higher order derivatives incorporate relationships between parameters. For example, second-order derivatives will inform the optimization algorithm of pairwise interactions. But very deep neural networks are complicated composites. Higher-order derivatives, although better than first-order derivatives at handling inter-parameter relationships, will still not overcome the challenge of dealing with updates over deep composites.

We need something better to coordinate updates across many layers. And that solution is offered by batchnorm. But first, let's understand this challenge mathematically.

The mathematical motivation

Let's consider a simple illustrative example. Consider a deep neural network, with no nonlinear activations, and multiple layers, each with a single unit. From the input \( x \) to the output \( \yhat \), the composite transformation is straightforward — \( \yhat = x w_1 w_2 \ldots w_l \), where \( w_i \) denotes the parameter weight of the \( i\)-th layer, with \( i=1 \) being the input layer. In terms of activations, for the \( i\)-th layer, the activation is \( h_i = w_i h_{i-1} \).

With gradient-based optimization, each parameter will get updated by its gradient \( g_i \) as \( w_i \leftarrow w_i - \eta g_i \), where \( \eta \) is the learning rate of the gradient-based approach.
As a result, the new output with the updated model will be \( \dash{\yhat} = x (w_1-\eta g_1) (w_2-\eta g_2) \ldots (w_l - \eta g_l) \). Is this desirable? Let's find out.

Treating the neural network as function of its parameters and input, we can write the output as \( \yhat = f_x(\vw) \), where \( \vw = [w_1,\ldots,w_l] \). By changing the parameters, we change the output for the same input. The Taylor series approximation to the first-order tells us that.

$$ f_x(\dash{\vw}) = f_x(\vw) + \nabla_{\vw}\left(f(\vw)\right) (\dash{\vw} - \vw) $$

If \( \dash{\vw} \) denotes the updated values of the weights, then we can compute the difference between the new output \( \dash{\yhat} \) and the output before update, \( \yhat \). That is

\begin{aligned} \dash{\yhat} - \yhat &= f_x(\dash{\vw}) - f_x(\vw) \\\\ &= f_x(\vw) + \nabla_{\vw}\left(f(\vw)\right) (\dash{\vw} - \vw) - f_x(\vw) \\\\ &= \nabla_{\vw}\left(f(\vw)\right) (\dash{\vw} - \vw) \\\\ &= \nabla_{\vw}\left(f(\vw)\right) \eta \vg \\\\ &= \eta\vg^T \vg \\\\ \end{aligned}

where, we have used the representation \( \vg = [g_1,\ldots,g_l] \) to denote the vector of gradients, that is \( \nabla_{\vw}\left(f(\vw)\right) \). Thus, with a gradient-based update of \( \eta \vg \), we expect the new output to be different by \( \eta \vg^T \vg \) from the output before the gradient update \( \yhat \).

But we saw earlier that

\begin{aligned} \dash{\yhat} - \yhat &= x (w_1-\eta g_1) (w_2-\eta g_2) \ldots (w_l - \eta g_l) - x w_1 w_2 \ldots w_l \end{aligned}

Clearly, the changes we propagated will not have the desired results because there are many orders of interactions between gradients and weights, not just \( \eta \vg^T \vg \).

This challenge is worsened by increasing depths in the network, as more inter-parameter interactions come into play.

batchnorm: The mechanics

Batchnorm works by normalizing the output of each layer for each mini-batch; hence the name.

Suppose \( \mH \) denotes the activations of a layer on a mini-batch, with the activations of each input example from the mini-batch arranged as one row of \( \mH \).

We first compute a vector of means, \( \vmu \), of these activations per unit in that layer. This is the average over the \( m \) rows of the activation matrix \( \mH \).

$$ \vmu = \frac{1}{m} \sum_{i} \mH_{i,:} $$

Then, we compute the standard deviation, \( \vsigma \) of the activation matrix \( \mH \).

$$ \vsigma = \sqrt{ \delta + \frac{1}{m} \sum_i \left(\mH_{i,:} - \vmu\right)^2} $$

where, \( \delta \) is a small positive term, say \( \delta = 10^{-10} \), to ensure that we do not have zero standard deviation.

With these batch-level statistics, we can now normalize the activations \( \mH \) by standardizing all columns to have zero mean and unit variance. This can be achieved by subtracting the mean \( \vmu \) from each row, and then dividing the row by the vector of standard deviations.

$$ \dash{\mH_{i,:}} = \frac{\mH_{i,:} - \vmu}{\vsigma}, ~~~~\forall i=1,\ldots,m $$

Note that the small positive value of \( \delta \) in our calculation of \( \vsigma \) will ensure that the divisor is not zero.

That's it.

Maintaining expressive power with batchnorm

As discussed earlier, the batchnorm will limit the expressive power of the neural network through restricting parameter updates. This is effectively the result of reducing the standard deviation of the activations to 1 and the mean to 0. Maybe this is too restrictive. And leaving the standard deviation as \( \vsigma \) and mean as \( \vmu \) is too flexible. How about a learnable middle ground?

We could allow the standard variation to be \( \gamma \) and the mean to be \( \beta \) and let these parameters \( \gamma, \beta \) be learnable. This reparametrization strategy is quite popular. It replaces the normalized activation matrix \( \dash{\mH} \) with the reparametrized matrix \( \gamma \dash{\mH} + \beta \).

It may seem counterintuitive to first remove the mean \( \vmu \) of the activations and then reintroduce it as \( \beta \), a learnable parameter. But the mean of activations, \( \vmu \), is a result of complicated interactions between many layers of the network. The parameter \( \beta \) is a learnable parameter, offering more control, and better learnable by gradient descent.

Please share

Let your friends, followers, and colleagues know about this resource you discovered.

Let's connect

Please share your comments, questions, encouragement, and feedback.