Interactive tutorial on SGD

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Stochastic gradient descent
        Optimization
      

Stochastic gradient descent (SGD) is an approach for unconstrained optimization. SGD is the workhorse of optimization for machine learning approaches. It is used as a faster alternative for training support vector machines and is the preferred optimization routine for deep learning approaches.

In this article, we will motivate the formulation for stochastic gradient descent and provide interactive demos over multiple univariate and multivariate functions to show it in action.

Prerequisites

To understand this article, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Stochastic gradient descent (SGD)

SGD is variant of gradient descent, that works by parts. If the objective function can be decomposed into several terms, then SGD proceeds by updating the iterates based on the gradient of a single term at a time, in a round-robin, sequential, or more generally, stochastic, fashion.

A popular machine learning loss function is a summation of the form $ \sum_{i} (x-a_i)^2 $. Such a sum of squares is a great fit for stochastic gradient descent. The $i$-th term is a quadratic of the form $ (x - a_i)^2 $ and its derivative is $ 2(x - a_i) $.

For such decomposable functions, the SGD approach performs the following steps till convergence.

In iteration $ k $, Pick a term $ i_k $, at random.
Update $ x_{k+1} = x_k - \alpha_k 2(x_k - a_{i_k}) $.

Notice that each update is akin to a partial gradient descent. For the sake of analogy, if we were doing a full gradient descent, we would be performing the following update in each iteration.

$$ x_{k+1} = x_k - 2 \alpha_k \sum_{i} (x_k - a_{i}) $$.

Thus, SGD is a great choice for optimizing functions with large number of terms. It will be evident in the upcoming examples that such a process leads to a zig-zag path to the optimum, as each iteration works with partial information.

SGD demo: $ \min \sum_{a=-2}^2 (x+a)^2 $

Let's look at an easily decomposable convex function.

$$ f(x) = \sum_{a=-2}^2 (x+a)^2 $$

There are several terms in the function. With SGD, we will randomly pick term at each iteration and make partial progress according to the gradient of that term. Change the starting point by dragging the orange circle. Change the initial learning rate and the dampening factor using appropriate sliders. To understand dampening factor, comprehensive interactive article on choosing a good learning rate.

Turns out, for this sum of squares, SGD is a great choice. Note that compared to gradient descent, SGD goes back and forth in its iterations. Moreover, choosing an appropriate value of the learning rate and dampening factor is extremely important.

SGD demo : $ \min x^4 - 8x^2 $

For the sake of SGD, we will decompose this nonconvex function into the two terms $ x^4 $ and $ 8x^2 $. Note how a high value of learning rate, with no dampening, can send the iterations off the chart.

Also, note that starting near the optimal point does not guarantee quick convergence. This is because the stochastic nature of the algorithm results in random jumps that may not satisfy the gradient with respect to all the terms taken together.

SGD demo: Rosenbrock's function

We have introduced Rosenbrock's function in the module on Calculus. The global minimum of the function lies at $ x = 1, y = 1 $. Our objective is the following.

$$ \min_{x,y} (1-x)^2 + 100 (y-x^2)^2 $$

For the purpose of demonstrating SGD, we have decomposed the Rosenbrock function into its two summands, specifically, $ (1-x)^2 $ and $ 100(y - x^2)^2 $ to show the effect of using partial gradients with SGD in each iteration.

We saw in the article on gradient descent that Rosenbrock is particularly hard for gradient descent. It is even harder for SGD, which only works with partial gradients in each step.

Finding the valley takes longer than gradient descent, and there is very little possibility of finding the optimal if the learning rate is steadily dampened.

Although SGD is the preferred approach for deep learning and is the workhorse of most modern machine learning libraries, this point needs to be emphasized. Lack of progress of iterations does not imply convergence to a solution. It could mean that the SGD is just stuck.

SGD is easy to use and implement. It works on many problems of interest. But use it cautiously and be aware of its shortcomings.

SGD demo: Himmelblau's function

So far, our function had only a single minimum. Such functions, albeit easy to understand, are rare in machine learning.

Let us study a function with multiple minima. We have introduced Himmelblau's function in the module on Calculus.

$$ f(x,y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2 $$

Himmelblau's function is nonconvex with 4 minima (marked in green) and 1 local maximum (marked in blue).

In this demo, we have decomposed Himmelblau's function into its two summands, $ (x^2 + y - 11)^2 $ and $ (x+y^2-7)^2 $ for the sake of demonstrating SGD with partial gradients in each step.

SGD converges for Himmelblau's function if the learning rate is just right and dampening is used to slow it down with each iteration to avoid oscillations. But as expected, it takes a more serpentine route to reach the minima. Interestingly, if the parameters are set appropriately, SGD converges to the closest minimum.

However, if the parameters are not chosen suitably, then SGD may not converge to any optima. The reliance of SGD on good learning rates and dampening factors is more than that of gradient descent, and this point is very important to remember in implementing SGD based solutions. Owing to these factors, there is increased interest in finding variants of SGD that rely on fewer parameters or just work out-of-the-box. We will investigate these in an upcoming section specifically dedicated to SGD and its popular variants.

Where to next?

Expand your knowledge of optimization approaches with our detailed interactive articles on this subject.

To brush up on calculus, explore our articles on topics in calculus and multivariate calculus. Or start with strong mathematical foundations.

Already an optimization expert? Check out comprehensive courses on machine learning or deep learning.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Subscribe for article updates

Stay up to date with new material for free.

Stochastic gradient descent

Optimization