Interactive tutorial on gradient descent

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Gradient descent
        Optimization
      

Gradient descent is an approach for unconstrained optimization.

In this article, we will motivate the formulation for gradient descent and provide interactive demos over multiple univariate and multivariate functions to show it in action.

Prerequisites

To understand this article, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem statement

In this article, we will consider an objective function $ f : \real^\ndim \to \real $ that we wish to minimize. We express the unconstrained optimization problem as,

\begin{aligned} \min_{\vx} \text{ }& f(\vx) \\ \end{aligned}

Gradient descent is an iterative approach. Starting from an arbitrary point, we will progressively discover points with lower function values till we converge to a point with minimum function value.

The next few sections will outline the strategy used by gradient descent to discover the next point $ \va + \vp $, from the current point $ \va $, on its journey to the optimal solution.

Taylor's theorem and optimization: A recap

Owing to the Taylor's Theorem, for a twice differentiable function $ f: \real^\ndim \to \ndim $, the value of the function at a point $\va+\vp $ can be written in terms of the function value at a point $ \va $ as follows.

\begin{equation} f(\va+\vp) = f(\va) + \nabla_{\va}^T \vp + \frac{1}{2} \vp^T \mH_{\va} \vp \label{eqn:second-order-taylor-approx} \end{equation}

Here, $ \nabla_{\va} $ and $ \mH_{\va} $ denote the gradient and Hessian of the function, evaluated at the point $ \va $.

To choose the next point $ \va + \vp $, we need to discover a suitable value of $ \vp $. Since $ \va $ is already known (the current point), this is a function in $ \vp $. At its local minimizer, $ \star{\vp} $, the gradient with respect to $ \vp $ will be zero.

\begin{aligned} \doh{f(\va + \star{\vp})}{\vp} = 0 \\\\ \label{eqn:first-order-condition-on-p} \end{aligned}

Thus, we now have a system of equations to solve for the value of $ \vp $ to choose our next point $ \va + \vp $.

Ignoring the Hessian term

We studied that Newton's method solves the equations above in its entirey. When the Hessian is difficult to compute, an alternative is a quasi-Newton method such as BFGS that locally approximate the Hessian.

Another alternative is to completely ignore the term with the Hessian!

Re-writing the function approximation from the previous section using Taylor's Theorem by ignoring the Hessian term, we get.

\begin{equation} f(\va+\vp) = f(\va) + \nabla_{\va}^T \vp \label{eqn:first-order-taylor-approx} \end{equation}

This formulation only involves the gradient term. And the approach that utilizes such first-order approximation to discover the next point is known as gradient descent.

Gradient descent

For minimization, we wish to move from an iterate $ \va $ to $ \va + \vp $ such that $ f(\va + \vp) < f(\va) $.

From the first order Taylor approximation, this requires that

\begin{aligned} & f(\va + \vp) < f(\va) \\ \implies& f(\va) + \nabla_{\va}^T \vp < f(\va) \\ \implies& \nabla_{\va}^T \vp < 0 \label{eqn:first-order-decrease-criteria} \end{aligned}

The easiest way of ensuring this is to choose $ \vp = -\nabla_{\va} $. Easy to verify that this might work.

\begin{aligned} \nabla_{\va}^T \vp = - \nabla_{\va}^T \nabla_{\va} < 0 \end{aligned}

because, $\nabla_{\va}^T \nabla_{\va}$ is a square and hence always positive. Great!

By setting $ \vp = -\nabla_{\va} $, we are moving in the direction opposite (negative) of the gradient. Therefore, this method is called gradient descent.

Learning rate

We saw that the update in the Newton's method was $ \vp = -\inv{\mH_{\va}} \nabla_{\va} $. There was a scaling factor $ \inv{\mH_\va} $ — the inverse of the Hessian.

We can introduce a similar scaling factor for gradient descent as well — a positive scalar $ \alpha $. The new update, with the scaling factor, in the case of gradient descent is

$$ \vp = -\alpha \nabla_{\va} $$

Because $ \alpha $ controls the size of the step from $ \va $ to $ \va + \vp $, it is known as the step length or learning rate.

We will see the effect of the learning rate on the performance of gradient descent in the upcoming interactive demos.

Gradient descent: The algorithm

Thus, the gradient descent proceeds as follows:

Start from a suitable point $ \vx $
Apply the following update to $ \vx $ till convergence in the function value or until a maximum number of iterations have been completed: $ \vx \leftarrow \vx - \alpha \nabla_{\vx} $.

See the similarity to the Newton's method? Indeed, the Newton's method is a special form of this structure with a step length equal to the inverse of the Hessian $ \alpha_{\text{Newton}} = \inv{\mH_{\vx}} $.

Gradient descent demo: $ \min x^2 $

Let's see gradient descent in action with a simple univariate function $ f(x) = x^2 $, where $ x \in \real $.

Note that the function has a global minimum at $ x = 0 $. The goal of the gradient descent method is to discover this point of least function value, starting at any arbitrary point.

In this accompanying demo, change the starting point by dragging the orange circle and see the trajectory of gradient descent method towards the global minimum. Also study the effect of the learning rate on the convergence rate.

Gradient descent analysis: $ \min x^2 $

Remember how quickly Newton's method minimized the convex quadratic objective $ x^2 $? Well, note that gradient descent is not that fast, in terms of the number of steps.

However, computationally, it may still be faster for some problems, as it does not require the computation of the Hessian inverse, a major disadvantage of the Newton's method.

Also, note the impact of the step size or learning rate $ \alpha $ on the performance. For this particular problem, the Newton's method suggests a step size of $ 0.5 $. So, note that a step size of 0.5 with gradient descent will reach the minimum in one step. Any learning rate lower than 0.5 leads to slower convergence. Any learning rate higher than 0.5 leads to oscillations around the minimum till it finally lands at the minimum.

In fact, try the learning rate $ \alpha = 1 $ for this function. The iterations just keep oscillating.

Clearly, the learning rate is a crucial parameter of the gradient descent approach. We will look at practical strategies to choose a good value of the learning rate in the upcoming sections.

Gradient descent demo: $ \min x^4 - 8x^2 $

Let's see gradient descent in action on a nonconvex function with multiple minima and a single maximum.

$$ f(x) = x^4 - 8x^2 $$

Note that the learning rate plays a huge role in where the gradient descent solution ends up. Also, note that due to the bumpy nature of the function, we need a much slower learning rate than what we used in the previous demo. If not for a low learning rate, the iterations oscillate across the minima.

Gradient descent demo: Rosenbrock's function

We introduced Rosenbrock's function in our article on multivariate functions in calculus. Rosenbrock is a very hard problem for many optimization algorithms. Notice in this next demo that gradient descent can quickly find the valley in the center, but the convergence towards the optimal (green point) is extremely slow, as the iterates oscillate on the two sides of the valley.

Gradient descent demo: Himmelblau's function

We introduced Himmelblau's function in our article on multivariate functions in calculus. It has 4 local minima (highlighted in green) and 1 maximum (highlighted in blue).

Turns out, Himmelblaus, in spite of its bump, is actually quite easy for gradient descent. Unlike Newton's method, the trajectory does not go over the bump, but rather quickly seeks out the nearest minimum.

A good learning rate

In all the previous demos, we observed that gradient descent is heavily dependent on the learning rate parameter. The functions we studied were easy to visualize, but the ones that we encounter in practice are much harder, with many more bumps and local minima.

How do we choose a good learning rate? Check out our comprehensive interactive article on choosing a good learning rate where we talk about strategies such as dampening and Wolfe conditions.

Where to next?

Expand your knowledge of optimization approaches with our detailed interactive articles on this subject.

To brush up on calculus, explore our articles on topics in calculus and multivariate calculus. Or start with strong mathematical foundations.

Already an optimization expert? Check out comprehensive courses on machine learning or deep learning.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Let's connect

Please share your comments, questions, encouragement, and feedback.

Gradient descent

Optimization

Prerequisites

Problem statement

Taylor's theorem and optimization: A recap

Ignoring the Hessian term

Gradient descent

Learning rate

Gradient descent: The algorithm

Gradient descent demo: \( \min x^2 \)

Gradient descent analysis: \( \min x^2 \)

Gradient descent demo: \( \min x^4 - 8x^2 \)

Gradient descent demo: Rosenbrock's function

Gradient descent demo: Himmelblau's function

A good learning rate

Where to next?

Please support us

Let's connect