Gradient descent is an approach for unconstrained optimization.

In this article, we will motivate the formulation for gradient descent and provide interactive demos over multiple univariate and multivariate functions to show it in action.

This learning module has many interactive demos. It is easier to work with them on a larger screen.
Bookmark and revisit if you are currently on a small screen device.

Gradient descent is an approach for unconstrained optimization.

In this article, we will motivate the formulation for gradient descent and provide interactive demos over multiple univariate and multivariate functions to show it in action.

To understand this article, we recommend familiarity with the concepts in

- Mathematical notation
- Functions in mathematics
- Derivatives
- Higher-order derivatives
- Taylor's Theorem for function approximation
- Identifying minima, maxima, and saddle points
- Unconstrained optimization
- Newton's method

Follow the above links to first get acquainted with the corresponding concepts.

In this article, we will consider an objective function \( f : \real^\ndim \to \real \) that we wish to minimize. We express the unconstrained optimization problem as,

\begin{aligned} \min_{\vx} \text{ }& f(\vx) \\ \end{aligned}

Gradient descent is an iterative approach. Starting from an arbitrary point, we will progressively discover points with lower function values till we converge to a point with minimum function value.

The next few sections will outline the strategy used by gradient descent to discover the next point \( \va + \vp \), from the current point \( \va \), on its journey to the optimal solution.

Owing to the Taylor's Theorem, for a twice differentiable function \( f: \real^\ndim \to \ndim \), the value of the function at a point \(\va+\vp \) can be written in terms of the function value at a point \( \va \) as follows.

\begin{equation} f(\va+\vp) = f(\va) + \nabla_{\va}^T \vp + \frac{1}{2} \vp^T \mH_{\va} \vp \label{eqn:second-order-taylor-approx} \end{equation}

Here, \( \nabla_{\va} \) and \( \mH_{\va} \) denote the gradient and Hessian of the function, evaluated at the point \( \va \).

To choose the next point \( \va + \vp \), we need to discover a suitable value of \( \vp \). Since \( \va \) is already known (the current point), this is a function in \( \vp \). At its local minimizer, \( \star{\vp} \), the gradient with respect to \( \vp \) will be zero.

\begin{aligned} \doh{f(\va + \star{\vp})}{\vp} = 0 \\\\ \label{eqn:first-order-condition-on-p} \end{aligned}

Thus, we now have a system of equations to solve for the value of \( \vp \) to choose our next point \( \va + \vp \).

We studied that Newton's method solves the equations above in its entirey. When the Hessian is difficult to compute, an alternative is a quasi-Newton method such as BFGS that locally approximate the Hessian.

Another alternative is to completely ignore the term with the Hessian!

Re-writing the function approximation from the previous section using Taylor's Theorem by ignoring the Hessian term, we get.

\begin{equation} f(\va+\vp) = f(\va) + \nabla_{\va}^T \vp \label{eqn:first-order-taylor-approx} \end{equation}

This formulation only involves the gradient term. And the approach that utilizes such first-order approximation to discover the next point is known as **gradient descent**.

For minimization, we wish to move from an iterate \( \va \) to \( \va + \vp \) such that \( f(\va + \vp) < f(\va) \).

From the first order Taylor approximation, this requires that

\begin{aligned} & f(\va + \vp) < f(\va) \\ \implies& f(\va) + \nabla_{\va}^T \vp < f(\va) \\ \implies& \nabla_{\va}^T \vp < 0 \label{eqn:first-order-decrease-criteria} \end{aligned}

The easiest way of ensuring this is to choose \( \vp = -\nabla_{\va} \). Easy to verify that this might work.

\begin{aligned} \nabla_{\va}^T \vp = - \nabla_{\va}^T \nabla_{\va} < 0 \end{aligned}

because, \(\nabla_{\va}^T \nabla_{\va}\) is a square and hence always positive. Great!

By setting \( \vp = -\nabla_{\va} \), we are moving in the direction opposite (negative) of the gradient.
Therefore, this method is called **gradient descent**.

We saw that the update in the Newton's method was \( \vp = -\inv{\mH_{\va}} \nabla_{\va} \). There was a *scaling factor* \( \inv{\mH_\va} \) — the inverse of the Hessian.

We can introduce a similar scaling factor for gradient descent as well — a positive scalar \( \alpha \). The new update, with the scaling factor, in the case of gradient descent is

$$ \vp = -\alpha \nabla_{\va} $$

Because \( \alpha \) controls the size of the step from \( \va \) to \( \va + \vp \), it is known as the **step length** or **learning rate**.

We will see the effect of the learning rate on the performance of gradient descent in the upcoming interactive demos.

Thus, the gradient descent proceeds as follows:

- Start from a suitable point \( \vx \)
- Apply the following update to \( \vx \) till convergence in the function value or until a maximum number of iterations have been completed: \( \vx \leftarrow \vx - \alpha \nabla_{\vx} \).

See the similarity to the Newton's method? Indeed, the Newton's method is a special form of this structure with a step length equal to the inverse of the Hessian \( \alpha_{\text{Newton}} = \inv{\mH_{\vx}} \).

Let's see gradient descent in action with a simple univariate function \( f(x) = x^2 \), where \( x \in \real \).

Note that the function has a global minimum at \( x = 0 \). The goal of the gradient descent method is to discover this point of least function value, starting at any arbitrary point.

In this accompanying demo, change the starting point by dragging the orange circle and see the trajectory of gradient descent method towards the global minimum. Also study the effect of the learning rate on the convergence rate.

Remember how quickly Newton's method minimized the convex quadratic objective \( x^2 \)? Well, note that gradient descent is not that fast, in terms of the number of steps.

However, computationally, it may still be faster for some problems, as it does not require the computation of the Hessian inverse, a major disadvantage of the Newton's method.

Also, note the impact of the step size or learning rate \( \alpha \) on the performance. For this particular problem, the Newton's method suggests a step size of \( 0.5 \). So, note that a step size of 0.5 with gradient descent will reach the minimum in one step. Any learning rate lower than 0.5 leads to slower convergence. Any learning rate higher than 0.5 leads to oscillations around the minimum till it finally lands at the minimum.

In fact, try the learning rate \( \alpha = 1 \) for this function. The iterations just keep oscillating.

Clearly, the learning rate is a crucial parameter of the gradient descent approach. We will look at practical strategies to choose a good value of the learning rate in the upcoming sections.

Let's see gradient descent in action on a nonconvex function with multiple minima and a single maximum.

$$ f(x) = x^4 - 8x^2 $$

Note that the learning rate plays a huge role in where the gradient descent solution ends up. Also, note that due to the bumpy nature of the function, we need a much slower learning rate than what we used in the previous demo. If not for a low learning rate, the iterations oscillate across the minima.

We introduced Rosenbrock's function in our article on multivariate functions in calculus. Rosenbrock is a very hard problem for many optimization algorithms. Notice in this next demo that gradient descent can quickly find the valley in the center, but the convergence towards the optimal (green point) is extremely slow, as the iterates oscillate on the two sides of the valley.

We introduced Himmelblau's function in our article on multivariate functions in calculus. It has 4 local minima (highlighted in green) and 1 maximum (highlighted in blue).

Turns out, Himmelblaus, in spite of its bump, is actually quite easy for gradient descent. Unlike Newton's method, the trajectory does not go over the bump, but rather quickly seeks out the nearest minimum.

In all the previous demos, we observed that gradient descent is heavily dependent on the learning rate parameter. The functions we studied were easy to visualize, but the ones that we encounter in practice are much harder, with many more bumps and local minima.

How do we choose a good learning rate? Check out our comprehensive interactive article on choosing a good learning rate where we talk about strategies such as dampening and Wolfe conditions.

Expand your knowledge of optimization approaches with our detailed interactive articles on this subject.

To brush up on calculus, explore our articles on topics in calculus and multivariate calculus. Or start with strong mathematical foundations.

Already an optimization expert? Check out comprehensive courses on machine learning or deep learning.

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please share your comments, questions, encouragement, and feedback.