Introduction
Backpropagation, often called backprop, is the gradient-based optimization approach for training neural networks. It follows from applying the chain-rule of calculus to minimizing the loss function for deep learning models.
Backpropagation, often called backprop, is the gradient-based optimization approach for training neural networks. It follows from applying the chain-rule of calculus to minimizing the loss function for deep learning models.
To understand backpropagation, we recommend familiarity with the concepts in
Follow the above links to first get acquainted with the corresponding concepts.
Now, we know how most machine learning models are trained — by fitting parameters to minimize a loss function. And a preferred minimization approach is the gradient-based approach that works as follows:
Thus, to support gradient-based optimization, we need two key ingredients: the loss on an example and the gradient of the loss with respect to each parameter of the model. In neural networks, the loss is computed by the forward propagation phase and the gradient is computed by the backward propagation phase. Let's study these phases in detail next.
The chain rule of calculus is a principled way to compute the derivatives of functions composed of other functions by fusing together derivatives of the underlying functions in a chain-like order.
Consider a real-valued scalar variable \( p \in \real \). Let \( f: \real \to \real \), \( g: \real \to \real \), and \( h: \real \to \real \) all be functions that map from a real number to a real number. We can create a composite function by chaining these functions as follows:
\begin{aligned} q &= f(p) \\\\ r &= g(q) = g(f(p)) \\\\ s &= h(r) = h(g(f(p))) \end{aligned}
How do we calculate the derivative of \( s \) with respect to \( p \)? Using the chain rule of calculus! Here's how it works.
$$ \frac{ds}{dp} = \frac{ds}{dr} \frac{dr}{dq}\frac{dq}{dp} $$
As simple as that.
The chain rule also applies to multivariate variables in the context of partial derivatives. Consider the multivariate variables \( \vp \in \real^P \), \( \vq \in \real^Q \), \( \vr \in \real^R \) and \( \vs \in \real^S \). Again, consider our functions with these updated domains \( f: \real^P \to \real^Q \), \( g: \real^Q \to \real^R \), and \( h: \real^R \to \real^S \), so that
\begin{aligned} \vq &= f(\vp) \\\\ \vr &= g(\vq) = g(f(\vp)) \\\\ \vs &= h(\vr) = h(g(f(\vp))) \end{aligned}
And the chain rule for partial derivatives works as follows:
$$ \frac{\partial \vs}{\partial p_i} = \sum_{j} \frac{\partial \vs}{\partial r_j}\frac{\partial r_j}{\partial p_i} $$
If we denote the gradient as \( \nabla_{\vp}^{\vs} = [\frac{\partial \vs}{\partial p_1}, \ldots, \frac{\partial \vs}{\partial p_P}] \), we can succinctly rewrite the above multivariate chain rule of calculus as
$$ \nabla_{\vp}^{\vs} = \left( \frac{\partial \vr}{\partial \vp} \right)^T \nabla_{\vr}^{\vs} $$
Neural networks can be described as conducting a forward propagation of information from the inputs to the outputs. For example, imagine a multilayer perceptron (MLP) with an input layer \( \sH_0 \), followed by \( L \) hidden layers \( \sH_1, \sH_2, \ldots, \sH_L \), followed by an output layer \( \sO \). Except the input layer, each layer is described by a set of parameters — the weights and biases — as \( \sH_i = \set{\mW_i, \vb_i} \), for all \( i = 0, 1, \ldots, L \). The final output layer is also governed by the weight and bias parameters \( \sO = \set{\mW_o, b_o} \).
The output of each hidden layer is an affine transformation followed by an activation function \( \phi \).
$$ \vh_l = \phi \left(\mW_l^T \vh_{l-1} + b_l \right) $$
where, \( \vh_0 = \vx \).
Working backwards from the output, we can write output as a composite function of the input.
\begin{aligned} o &= \vw_o^T \vh_L + b_o \\\\ &= \vw_o^T \phi \left(\mW_L^T \vh_{L-1} + \vb_L \right) + b_o \\\\ &= \vw_o^T \phi \left(\mW_L^T \phi\left(\mW_{L-1}^T\vh_{L-2} + \vb_{L-1} \right) + \vb_L \right) + b_o \\\\ &\vdots \\\\ &= \vw_o^T \phi \left(\mW_L^T \ldots \phi\left(\mW_1^T\vx + \vb_1 \right) \ldots + \vb_L \right) + b_o \\\\ \end{aligned}
Thus, the output is a composite function passing through multiple transformations from the input to the output.
The composite output \( o \) from the previous section is used to compute the loss function of the model — a function comparing the predicted output \( o \) to the true desired output \( y \), \( \ell(y,o) \).
As we explained in the recap of gradient-based optimization, to minimize the loss, we discover parameters by iteratively updating them in the direction of negative gradient of the loss. Since the neural network is a multivariate model, the gradient will involve partial derivatives of the loss with respect to each parameter of the model. In our example case of the MLP, the parameters include the following matrices and vectors: \( \mW_1, \ldots, \mW_L \), \( \vb_1, \ldots, \vb_L \), \( \mW_o \) and \( b_o \).
We know from our review in an earlier section that the recipe to calculate partial derivatives of composite functions involves the chain rule of calculus.
In the context of the MLP, this implies the following relations.
\begin{aligned} \nabla_{o}^{\ell(o,y)} &= \frac{\partial \ell(o,y)}{\partial o} \\\\ \nabla_{\vh_L}^{\ell(o,y)} &= \left(\frac{\partial o}{\partial \vh_L}\right)^T \nabla_{o}^{\ell(o,y)} \\\\ \nabla_{\vh_l}^{\ell(o,y)} &= \left(\frac{\partial \vh_{l+1}}{\partial \vh_l}\right)^T \nabla_{\vh_{l+1}}^{\ell(o,y)},~~~~\forall l=1,\ldots,L-1 \label{eqn:output-gradients} \end{aligned}
Knowing these partial derivatives with respect to layer-level outputs \(o, \vh_1, \ldots, \vh_L \), the parameter-level gradients can be calculated as follows.
\begin{aligned} \nabla_{\mW_o}^{\ell(o,y)} &= \left(\frac{\partial o}{\partial \mW_o}\right)^T \nabla_{o}^{\ell(o,y)} \\\\ \nabla_{\mW_L}^{\ell(o,y)} &= \left(\frac{\partial \vh_L}{\partial \mW_L}\right)^T \nabla_{\vh_L}^{\ell(o,y)} \\\\ \nabla_{\mW_l}^{\ell(o,y)} &= \left(\frac{\partial \vh_l}{\partial \mW_l}\right)^T \nabla_{\vh_l}^{\ell(o,y)},~~~~\forall l=1,\ldots,L-1 \label{eqn:parameter-gradients} \end{aligned}
The gradients with respect to the biases \( \vb_1, \ldots, \vb_L \), and \( b_o \) are similarly defined and have been omitted for brevity.
Thus, for gradient-based optimization of multi-layered neural networks, the math is clear and straightforward. How about the computation?
Now that we know the mathematical formulations of the gradients, how do we calculate them? Should we start with calculating the parameters closest to the output, \( \nabla_{\mW_o}^{\ell(o,y)} \), or the parameters closest to the input, \( \nabla_{\mW_1}^{\ell(o,y)} \)?
The nature of the mathematical formulae for gradients in Equation \eqref{eqn:parameter-gradients} and Equation \eqref{eqn:output-gradients} has the answer for doing this efficiently. Note that the gradient \( \nabla_{\mW_o}^{\ell(o,y)} \) depends on the gradient \( \nabla_{o}^{\ell(o,y)} \), but not the other way around.
For brevity, let's denote this directional dependence relation with an arrow, so that \( A \Leftarrow B \) represents that \( B \) feeds into \( A \). In other words, \( A \) requires the computation of \( B \) first. Using this notation, we can write the computational dependence as follows:
\begin{aligned} \nabla_{\mW_o}^{\ell(o,y)} &\Leftarrow \nabla_{o}^{\ell(o,y)} \\\\ \nabla_{\mW_l}^{\ell(o,y)} &\Leftarrow \nabla_{\vh_l}^{\ell(o,y)},~~~~\forall l = 1,\ldots,L\\\\ \nabla_{\vh_L}^{\ell(o,y)} &\Leftarrow \nabla_{o}^{\ell(o,y)} \\\\ \nabla_{\vh_l}^{\ell(o,y)} &\Leftarrow \nabla_{\vh_{l+1}}^{\ell(o,y)} ,~~~~\forall l = 1,\ldots,L-1\\\\ \end{aligned}
It can be observed that the gradients at the layers near the input are dependent on the gradients of layers that are further away from the input. This dependence clearly implies an efficient gradient computation strategy. We have to start with computing the gradients of the parameters that are closest to the output. Then move backwards from the outermost layer to the input layer when computing the gradients of the hidden layers.
This backward flow of gradient computation is known as backpropagation or succinctly, backprop. It is a computationally efficient way of computing the gradients of multilayered neural networks.
Neural network training relies on repeated application of the backpropagation algorithm to compute gradients of the loss with respect to model parameters. These gradients are then used to update the parameters of the model using the gradient-based optimization strategy.
In this section, let's describe the steps involved in a single such iteration. For the following steps, we will use a fully connected multilayer perceptron (MLP) as an example neural network. That being said, the strategy is generally applicable to other networks with minor architecture-specific refinements.
We will use the same MLP that we used in our motivation of the backpropagation approach. Consider a MLP with an input layer \( \sH_0 \), followed by \( L \) hidden layers \( \sH_1, \sH_2, \ldots, \sH_L \), followed by an output layer \( \sO \). Except the input layer, each layer is described by a set of parameters — the weights and biases — as \( \sH_l = \set{\mW_l, \vb_l} \), for all \( l = 1, \ldots, L \). The final output layer is also governed by the weight and bias parameters \( \sO = \set{\mW_o, b_o} \). We will use the element-wise activation function \( \phi \) in our model for each hidden layer.
Gradient computation occurs in two phases. In the first phase, we compute the loss of the model using the current parameter values. In the second phase, we compute the actual gradients of the loss with respect to the model parameters.
The forward pass is now complete.
The backward pass is now complete and the gradients have been computed. Phew!
Help us create more engaging and effective content and keep it free of paywalls and advertisements!
Stay up to date with new material for free.