Momentum based gradient descent

Optimization

Momentum is a strategy to improve the convergence of gradient descent based approaches to unconstrained optimization. It works by incorporating a fraction of the update factor from the previous iteration as an additional term into the current iteration's update. This results in rapid progress on steep regions of the optimization function and some prevention of oscillation around the optimum.

In this article, we will study the concept of momentum is detail through interactive and intuitive examples.

Prerequisites

To understand this article, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Gradient descent: A quick recap

Gradient descent is an iterative approach to unconstrained optimization. The gradient descent algorithm proceeds as follows:

  1. Start from a suitable point \( \vx \)
  2. Apply the following update to \( \vx \) till convergence in the function value or until a maximum number of iterations have been completed. For the \( k\)-th iteration, do

$$ \vx_k \leftarrow \vx_{k-1} - \alpha \nabla_{\vx_k} $$

where, \( \nabla_{\vx} \) denotes the gradient of the objective function \( f: \real^\ndim \to \real \) at the point \( \vx \).

Thus, in each iteration, the gradient descent approach moves the point in the direction of the negative gradient; hence the name.

A challenge with gradient descent

We have presented earlier the importance of choosing a good learning rate and the role of dampening factor. A low value of the learning rate leads to slow convergence towards the optimum. A high value leads to oscillations, or in the worst case, divergence away from the optimum! The learning rate needs to be just right.

A strategy to achieve faster convergence while avoiding oscillations is that of dampening. With dampening, we introduce a real-valued scalar variable \( \delta \in (0,1) \) as a multiplier on the learning rate. Thus, at iteration \( k \), the learning rate is a fraction of the initial learning rate \( \alpha_0 \).

$$ \alpha_{k} = \delta \alpha_{k-1} = \delta^k \alpha_0 $$

We saw the dramatic impact such a dampening can have in preventing oscillations.

Dampening reduces the learning rate with each iteration, with a fixed pre-defined factor. But, how about slowing up as necessary? And, what about speeding up when needed? What if we could do both — with a single factor?

That's where momentum comes in. But first, velocity.

Velocity

As explained earlier, the update equation in the \( k\)-th iteration of gradient descent is

$$ \vx_k \leftarrow \vx_{k-1} - \alpha_k \nabla_{\vx_k} $$

We are updating the existing iterate \( \vx_{k-1} \), by subtracting the component \( \alpha \nabla_{\vx_k} \). This component is the distance moved by the variable in one iteration. Thus, intuitively, it is the velocity \( \vv_k \) of the \( k\)-th iteration, so that

$$ \vv_k = \alpha_k \nabla_{\vx_k} $$

and, the gradient descent update at iteration \( k \) becomes

$$ \vx_k \leftarrow \vx_{k-1} - \vv_k $$

Momentum

To continue to make rapid progress towards the optimum, a simple improvement to vanilla gradient descent might look like this: use the velocity from the previous iteration to inform the current iteration. Use a scalar multiplier, \( \beta \) to control the amount of information from the previous iteration into the current.

Thus, the velocity of the \( k+1 \)-th iteration becomes

$$ \vv_{k+1} = \beta \vv_k + \alpha_{k+1} \nabla_{\vx_k} $$

If we consider \( \beta \) as the mass and \( \vv_k \) as the velocity, then the term \( \beta \vv_k \) is \( \text{mass} \times \text{velocity} \), a well-studied concept in Physics known as momentum. And the intuition remains the same too! If you are moving faster, the momentum keeps you moving faster. If you abruptly go in the opposite direction, the momentum makes you slow down.

Owing the momentum, gradient descent is able to rapidly make progress towards the optimum. If it overshoots and goes beyond the minimum, its step size is automatically reduced due to the negating effects of the previous gradient — an informed dampening factor, leading to reduced oscillations.

Momentum demo: \( \min x^2 \)

Let's see momentum-based gradient descent in action with a simple univariate function \( f(x) = x^2 \), where \( x \in \real \).

Note that the function has a global minimum at \( x = 0 \). The goal of the gradient descent method is to discover this point of least function value, starting at any arbitrary point.

In this accompanying demo, change the starting point by dragging the orange circle and see the trajectory of gradient descent method towards the global minimum. Also study the effect of the learning rate on the convergence rate.

For added flexibility, we have also provided a dampening factor on the learning rate that reduces the learning rate at each iteration. Our expectation is that with momentum correctly set up, you may not need dampening on the learning rate.

In this interactive, try to study the following effects:

  • With \( \delta = 1 \) (no dampening) and \( \gamma = 0\) (no momentum), note that as the learning rate is increased past 0.5, the iterations start oscillating around the optimum. For very large values of the learning rate, the iterations diverge away from the optimum, a very undesirable outcome!
  • With \( \delta = 1 \) (no dampening) and \( \delta > 0 \), the oscillations tend to move quickly towards the optimum. In this particular case, we need only a little momentum to achieve the desired effect. But intuitively, the amount of momentum required is proportional to the starting value of the learning rate. In fact, with large enough momentum, even large values of \( \alpha_0 \) are tamed down and guided towards the optimum, thereby avoiding divergence. All this without fiddling with the dampening factor. Such is the power of momentum.

Momentum demo : \( \min x^4 - 8x^2 \)

Let's see the impact of momentum on gradient descent in action on a nonconvex function with multiple minima and a single maximum.

$$ f(x) = x^4 - 8x^2 $$

We are particularly interested in observing the effect of the momentum factor \( \gamma \) and the initial learning rate \( \alpha_0 \). A good combination of \( \gamma \) and \( \alpha_0 \) ensures that gradient descent is able to find the minima.

Again, for added flexibility, we have also provided a dampening factor on the learning rate that reduces the learning rate at each iteration. Our expectation is that with momentum correctly set up, you may not need dampening on the learning rate.

In this demo, particularly note the how the oscillations quickly get pulled back due to the lingering effects of the velocity from the previous iteration.

Momentum demo: Himmelblau's function

We introduced Himmelblau's function in our article on multivariate functions in calculus.
It has 4 local minima (highlighted in green) and 1 maximum (highlighted in blue).

Turns out, Himmelblaus, in spite of its bump, is actually quite easy for gradient descent with momentum. Unlike Newton's method, the trajectory does not go over the bump, but rather quickly seeks out the nearest minimum.

The technological impact of momentum

This seemingly simple idea of momentum lead to a variety of improvements for gradient descent approaches such as

Many of these approaches have becomes ubiquitous in deep learning libraries such as PyTorch and Tensorflow. Some of them are the default optimization algorithms for training deep neural networks and other machine learning models.

Where to next?

Expand your knowledge of optimization approaches with our detailed interactive articles on this subject.

To brush up on calculus, explore our articles on topics in calculus and multivariate calculus. Or start with strong mathematical foundations.

Already an optimization expert? Check out comprehensive courses on machine learning or deep learning.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Let's connect

Please share your comments, questions, encouragement, and feedback.