Choosing a good learning rate

Optimization

Gradient descent, stochastic gradient descent, as well as quasi-Newton methods such as BFGS, all rely on a learning rate or step length to choose the next point in their iterations towards the optimal solution. We observed in the interactive demos for each of these methods that a good learning rate is really important for their success.

In this article, we will study some strategies for choosing a good learning rate.

Prerequisites

To understand this article, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Gradually decreasing learning rate

A simple strategy that is quite effective in practice is to gradually decrease the learning rate.

Such gradual decrease in the learning rate can be achieved with a multiplicative dampening factor. Suppose, \( \gamma \in (0,1) \) is the dampening factor. Then, the learning rate \( \alpha \) at iteration \( k + 1 \) is

$$ \alpha_{k+1} = \gamma \alpha_k = \gamma^{k+1} \alpha_0 $$

where, \( \alpha_0 \) is the initial learning rate.

Since, \( 0 < \gamma < 1 \), the learning rate will reduce with every iteration.

Check out the effect of such a dampening factor in the next few demos.

In the next interactive demo try to notice the effect of a gradually decreasing learning rate. With no dampening there ( \( \gamma = 1 \) \), there could be significant oscillations. A low \( \gamma < 0.5 \) leads to very slow progress, which may not even reach the minimum. We just need some dampening, but not too much and usually \( \gamma > 0.9 \) works in most cases.

Dampening demo : \( \min x^4 - 8x^2 \)

Let's see gradient descent in action on a nonconvex function with multiple minima and a single maximum.

$$ f(x) = x^4 - 8x^2 $$

We are particularly interested in observing the effect of the dampening factor \( \gamma \) and the initial learning rate \( \alpha_0 \). A good combination of \( \gamma \) and \( \alpha_0 \) ensures that gradient descent is able to find the minima.

Dampening demo: Rosenbrock's function

We introduced Rosenbrock's function in our article on multivariate functions in calculus. Rosenbrock is a very hard problem for many optimization algorithms.

Notice in this next demo that gradient descent can quickly find the valley in the center, but the convergence towards the optimal (green point) is extremely slow, as the iterates oscillate on the two sides of the valley. This can be guided better by choosing appropriate learning rate and dampening factor.

Intelligent learning rate

Given the importance of the learning rate, is there a good learning rate? Can we find it analytically? Let's try to design some conditions for describing a good learning rate.

First and foremost, given that we are performing minimization, we definitely want the function to decrease sufficiently at each iteration. Sufficient in this case means that the decrease should be at least proportional to the gradient step taken. It shouldn't be the case that we take a big step and the decrease is miniscule or vice versa.

So, for some constant \( c_1 \), such that \( 0 < c_1 < 1 \), we wish that

\begin{equation} f(\vx + \alpha \vp) \le f(\vx) + c_1 \alpha \nabla f(\vx)^T \vp \label{eqn:armijo-condition} \end{equation}

This first condition, which ensures there is sufficient decrease in function value with a step is known as the Armijo condition.

Now, let's consider another criteria. The slope or gradient of the function at the new point should also be larger (less negative) than the one before it. This means, for some constant \( c_2 \), such that \( 0 < c_2 < 1 \), we wish that

\begin{equation} \nabla f(\vx + \alpha \vp)^T \vp \ge c_2 \nabla f(\vx)^T \vp \label{eqn:curvature-condition} \end{equation}

The second one, which ensures that the curvature is flattening out over time is known as curvature condition.

These two conditions for a desirable step length are known as Wolfe conditions. It is usually the case that \( c_1 < c_2 \).

Wolfe conditions can be a good strategy to identify a suitable learning rate at any point. At each iteration, we can select a value of the learning rate the satisfies both the conditions. If none of the conditions can be satisfied, we just use a default learning rate to move to another point and try finding a satisfactory learning rate at the next iteration.

We use this strategy in the next two demos applying gradient descent to multivariate problems.

Wolfe conditions demo: Rosenbrock's function

Let's try gradient descent on Rosenbrock's function again, this time with dynamically chosen learning rate that satisfies Wolfe conditions.

Wolfe conditions demo : Himmelblau's function

We introduced Himmelblau's function in our article on multivariate functions in calculus. It has 4 local minima (highlighted in green) and 1 maximum (highlighted in blue).

Turns out, Himmelblaus, with intelligently chosen learning rate at each iteration, is actually quite easy for gradient descent, in spite of the maximum in the middle. Unlike Newton's method, the trajectory does not go over the bump, but rather quickly seeks out the nearest minimum.

Where to next?

Expand your knowledge of optimization approaches with our detailed interactive articles on this subject.

To brush up on calculus, explore our articles on topics in calculus and multivariate calculus. Or start with strong mathematical foundations.

Already an optimization expert? Check out comprehensive courses on machine learning or deep learning.

Please share

Let your friends, followers, and colleagues know about this resource you discovered.

Subscribe for article updates

Stay up to date with new material for free.