# Backpropagation

## Introduction

Backpropagation, often called backprop, is the gradient-based optimization approach for training neural networks. It follows from applying the chain-rule of calculus to minimizing the loss function for deep learning models. ## Prerequisites

To understand backpropagation, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

## Gradient-based optimization: a quick recap

Now, we know how most machine learning models are trained — by fitting parameters to minimize a loss function. And a preferred minimization approach is the gradient-based approach that works as follows:

1. Initialize parameters $\Theta$ to randomly chosen values.
2. Calculate the output $o$ using $\Theta$ as the model parameters
3. Compute the loss $\ell(o,y)$ with respect to the desired targets $y$
4. Compute gradient of the loss with respect to the parameters $\Theta$, say $\nabla_{\Theta}^{\ell(o,y)}$
5. Update parameters with the negative gradient and suitable learning rate $\eta$ as $\Theta \leftarrow \Theta - \eta \nabla_{\Theta}^{\ell(o,y)}$
6. Go back to step 2 and iterate till convergence is achieved in the loss.

Thus, to support gradient-based optimization, we need two key ingredients: the loss on an example and the gradient of the loss with respect to each parameter of the model. In neural networks, the loss is computed by the forward propagation phase and the gradient is computed by the backward propagation phase. Let's study these phases in detail next.

## The chain rule of calculus: quick recap

The chain rule of calculus is a principled way to compute the derivatives of functions composed of other functions by fusing together derivatives of the underlying functions in a chain-like order.

Consider a real-valued scalar variable $p \in \real$. Let $f: \real \to \real$, $g: \real \to \real$, and $h: \real \to \real$ all be functions that map from a real number to a real number. We can create a composite function by chaining these functions as follows:

\begin{aligned} q &= f(p) \\\\ r &= g(q) = g(f(p)) \\\\ s &= h(r) = h(g(f(p))) \end{aligned}

How do we calculate the derivative of $s$ with respect to $p$? Using the chain rule of calculus! Here's how it works.

$$\frac{ds}{dp} = \frac{ds}{dr} \frac{dr}{dq}\frac{dq}{dp}$$

As simple as that.

## Chain rule: Multivariate version

The chain rule also applies to multivariate variables in the context of partial derivatives. Consider the multivariate variables $\vp \in \real^P$, $\vq \in \real^Q$, $\vr \in \real^R$ and $\vs \in \real^S$. Again, consider our functions with these updated domains $f: \real^P \to \real^Q$, $g: \real^Q \to \real^R$, and $h: \real^R \to \real^S$, so that

\begin{aligned} \vq &= f(\vp) \\\\ \vr &= g(\vq) = g(f(\vp)) \\\\ \vs &= h(\vr) = h(g(f(\vp))) \end{aligned}

And the chain rule for partial derivatives works as follows:

$$\frac{\partial \vs}{\partial p_i} = \sum_{j} \frac{\partial \vs}{\partial r_j}\frac{\partial r_j}{\partial p_i}$$

If we denote the gradient as $\nabla_{\vp}^{\vs} = [\frac{\partial \vs}{\partial p_1}, \ldots, \frac{\partial \vs}{\partial p_P}]$, we can succinctly rewrite the above multivariate chain rule of calculus as

$$\nabla_{\vp}^{\vs} = \left( \frac{\partial \vr}{\partial \vp} \right)^T \nabla_{\vr}^{\vs}$$

## Forward propagation: From input to the loss

Neural networks can be described as conducting a forward propagation of information from the inputs to the outputs. For example, imagine a multilayer perceptron (MLP) with an input layer $\sH_0$, followed by $L$ hidden layers $\sH_1, \sH_2, \ldots, \sH_L$, followed by an output layer $\sO$. Except the input layer, each layer is described by a set of parameters — the weights and biases — as $\sH_i = \set{\mW_i, \vb_i}$, for all $i = 0, 1, \ldots, L$. The final output layer is also governed by the weight and bias parameters $\sO = \set{\mW_o, b_o}$.

The output of each hidden layer is an affine transformation followed by an activation function $\phi$.

$$\vh_l = \phi \left(\mW_l^T \vh_{l-1} + b_l \right)$$

where, $\vh_0 = \vx$.

Working backwards from the output, we can write output as a composite function of the input.

\begin{aligned} o &= \vw_o^T \vh_L + b_o \\\\ &= \vw_o^T \phi \left(\mW_L^T \vh_{L-1} + \vb_L \right) + b_o \\\\ &= \vw_o^T \phi \left(\mW_L^T \phi\left(\mW_{L-1}^T\vh_{L-2} + \vb_{L-1} \right) + \vb_L \right) + b_o \\\\ &\vdots \\\\ &= \vw_o^T \phi \left(\mW_L^T \ldots \phi\left(\mW_1^T\vx + \vb_1 \right) \ldots + \vb_L \right) + b_o \\\\ \end{aligned}

Thus, the output is a composite function passing through multiple transformations from the input to the output.

The composite output $o$ from the previous section is used to compute the loss function of the model — a function comparing the predicted output $o$ to the true desired output $y$, $\ell(y,o)$.

As we explained in the recap of gradient-based optimization, to minimize the loss, we discover parameters by iteratively updating them in the direction of negative gradient of the loss. Since the neural network is a multivariate model, the gradient will involve partial derivatives of the loss with respect to each parameter of the model. In our example case of the MLP, the parameters include the following matrices and vectors: $\mW_1, \ldots, \mW_L$, $\vb_1, \ldots, \vb_L$, $\mW_o$ and $b_o$.

We know from our review in an earlier section that the recipe to calculate partial derivatives of composite functions involves the chain rule of calculus.

In the context of the MLP, this implies the following relations.

\begin{aligned} \nabla_{o}^{\ell(o,y)} &= \frac{\partial \ell(o,y)}{\partial o} \\\\ \nabla_{\vh_L}^{\ell(o,y)} &= \left(\frac{\partial o}{\partial \vh_L}\right)^T \nabla_{o}^{\ell(o,y)} \\\\ \nabla_{\vh_l}^{\ell(o,y)} &= \left(\frac{\partial \vh_{l+1}}{\partial \vh_l}\right)^T \nabla_{\vh_{l+1}}^{\ell(o,y)},~~~~\forall l=1,\ldots,L-1 \label{eqn:output-gradients} \end{aligned}

Knowing these partial derivatives with respect to layer-level outputs $o, \vh_1, \ldots, \vh_L$, the parameter-level gradients can be calculated as follows.

\begin{aligned} \nabla_{\mW_o}^{\ell(o,y)} &= \left(\frac{\partial o}{\partial \mW_o}\right)^T \nabla_{o}^{\ell(o,y)} \\\\ \nabla_{\mW_L}^{\ell(o,y)} &= \left(\frac{\partial \vh_L}{\partial \mW_L}\right)^T \nabla_{\vh_L}^{\ell(o,y)} \\\\ \nabla_{\mW_l}^{\ell(o,y)} &= \left(\frac{\partial \vh_l}{\partial \mW_l}\right)^T \nabla_{\vh_l}^{\ell(o,y)},~~~~\forall l=1,\ldots,L-1 \label{eqn:parameter-gradients} \end{aligned}

The gradients with respect to the biases $\vb_1, \ldots, \vb_L$, and $b_o$ are similarly defined and have been omitted for brevity.

Thus, for gradient-based optimization of multi-layered neural networks, the math is clear and straightforward. How about the computation?

## Efficiently computing the multilayered gradient

Now that we know the mathematical formulations of the gradients, how do we calculate them? Should we start with calculating the parameters closest to the output, $\nabla_{\mW_o}^{\ell(o,y)}$, or the parameters closest to the input, $\nabla_{\mW_1}^{\ell(o,y)}$?

The nature of the mathematical formulae for gradients in Equation \eqref{eqn:parameter-gradients} and Equation \eqref{eqn:output-gradients} has the answer for doing this efficiently. Note that the gradient $\nabla_{\mW_o}^{\ell(o,y)}$ depends on the gradient $\nabla_{o}^{\ell(o,y)}$, but not the other way around.

For brevity, let's denote this directional dependence relation with an arrow, so that $A \Leftarrow B$ represents that $B$ feeds into $A$. In other words, $A$ requires the computation of $B$ first. Using this notation, we can write the computational dependence as follows:

\begin{aligned} \nabla_{\mW_o}^{\ell(o,y)} &\Leftarrow \nabla_{o}^{\ell(o,y)} \\\\ \nabla_{\mW_l}^{\ell(o,y)} &\Leftarrow \nabla_{\vh_l}^{\ell(o,y)},~~~~\forall l = 1,\ldots,L\\\\ \nabla_{\vh_L}^{\ell(o,y)} &\Leftarrow \nabla_{o}^{\ell(o,y)} \\\\ \nabla_{\vh_l}^{\ell(o,y)} &\Leftarrow \nabla_{\vh_{l+1}}^{\ell(o,y)} ,~~~~\forall l = 1,\ldots,L-1\\\\ \end{aligned}

It can be observed that the gradients at the layers near the input are dependent on the gradients of layers that are further away from the input. This dependence clearly implies an efficient gradient computation strategy. We have to start with computing the gradients of the parameters that are closest to the output. Then move backwards from the outermost layer to the input layer when computing the gradients of the hidden layers.

This backward flow of gradient computation is known as backpropagation or succinctly, backprop. It is a computationally efficient way of computing the gradients of multilayered neural networks.

## An iteration with the backpropagation algorithm

Neural network training relies on repeated application of the backpropagation algorithm to compute gradients of the loss with respect to model parameters. These gradients are then used to update the parameters of the model using the gradient-based optimization strategy.

In this section, let's describe the steps involved in a single such iteration. For the following steps, we will use a fully connected multilayer perceptron (MLP) as an example neural network. That being said, the strategy is generally applicable to other networks with minor architecture-specific refinements.

We will use the same MLP that we used in our motivation of the backpropagation approach. Consider a MLP with an input layer $\sH_0$, followed by $L$ hidden layers $\sH_1, \sH_2, \ldots, \sH_L$, followed by an output layer $\sO$. Except the input layer, each layer is described by a set of parameters — the weights and biases — as $\sH_l = \set{\mW_l, \vb_l}$, for all $l = 1, \ldots, L$. The final output layer is also governed by the weight and bias parameters $\sO = \set{\mW_o, b_o}$. We will use the element-wise activation function $\phi$ in our model for each hidden layer.

Gradient computation occurs in two phases. In the first phase, we compute the loss of the model using the current parameter values. In the second phase, we compute the actual gradients of the loss with respect to the model parameters.

#### The forward propagation phase

1. Compute the output of the first hidden layer as a function of the input $\vx$ $$\vh_1 = \phi \left(\mW_1^T \vx + b_1 \right)$$
2. For each hidden layer, compute the output of that layer as a function of the output of the previous layer as $$\vh_l = \phi \left(\mW_l^T \vh_{l-1} + b_l \right),~~~~\forall l = 2,\ldots,L$$
3. Finally, compute the output of the neural network as $$o = \vw_o^T \vh_L + b_o$$
4. Now compute the loss of the model as the error of the predicted output $o$ compared to the desired output $y$ as $\ell(o,y)$.

The forward pass is now complete.

#### The backward propagation phase

1. Compute gradient of loss with respect to the output and store it for use in steps 2 and 3. $$\nabla_{o}^{\ell(o,y)} = \frac{\partial \ell(o,y)}{\partial o}$$
2. Compute gradients of the loss with respect to the parameters of the output layer. \begin{aligned} \nabla_{\mW_o}^{\ell(o,y)} &= \left(\frac{\partial o}{\partial \mW_o}\right)^T \nabla_{o}^{\ell(o,y)} \\\\ \nabla_{b_o}^{\ell(o,y)} &= \left(\frac{\partial o}{\partial b_o}\right)^T \nabla_{o}^{\ell(o,y)} \end{aligned}
3. Compute gradient of the output of the outermost hidden layer $\sH_L$ with respect to the loss and store it for use in step 4 and 5. $$\nabla_{\vh_L}^{\ell(o,y)} = \left(\frac{\partial o}{\partial \vh_L}\right)^T \nabla_{o}^{\ell(o,y)}$$
4. Compute gradients of the loss with respect to the parameters of the outermost hidden layer $\sH_L$ \begin{aligned} \nabla_{\mW_L}^{\ell(o,y)} &= \left(\frac{\partial \vh_L}{\partial \mW_L}\right)^T \nabla_{\vh_L}^{\ell(o,y)} \\\\ \nabla_{\vb_L}^{\ell(o,y)} &= \left(\frac{\partial \vh_L}{\partial \vb_L}\right)^T \nabla_{\vh_L}^{\ell(o,y)} \\\\ \end{aligned}
5. Let $l = L-1$. Compute and store gradients of outputs and parameters of that layer using gradients of layer $l+1$ as follows. \begin{aligned} \nabla_{\vh_l}^{\ell(o,y)} &= \left(\frac{\partial \vh_{l+1}}{\partial \vh_l}\right)^T \nabla_{\vh_{l+1}}^{\ell(o,y)},~~~~\forall l=1,\ldots,L-1 \\\\ \nabla_{\mW_l}^{\ell(o,y)} &= \left(\frac{\partial \vh_l}{\partial \mW_l}\right)^T \nabla_{\vh_l}^{\ell(o,y)},~~~~\forall l=1,\ldots,L-1 \\\\ \nabla_{\vb_l}^{\ell(o,y)} &= \left(\frac{\partial \vh_l}{\partial \vb_l}\right)^T \nabla_{\vh_l}^{\ell(o,y)},~~~~\forall l=1,\ldots,L-1 \\\\ \end{aligned}
6. If $l = 1$, terminate. Else, set $l = l - 1$. Go back to step 5.

The backward pass is now complete and the gradients have been computed. Phew!