This learning module has many interactive demos. It is easier to work with them on a larger screen.
Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \newcommand{\R}{\mathbb{R}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\mA}{\mathbf{A}} \newcommand{\mB}{\mathbf{B}} \newcommand{\mH}{\mathbf{H}} \newcommand{\mC}{\mathbf{C}} \newcommand{\va}{\mathbf{a}} \newcommand{\vb}{\mathbf{b}} \newcommand{\vc}{\mathbf{c}} \newcommand{\vp}{\mathbf{p}} \newcommand{\star}[1]{#1^*} \newcommand{\neighborhood}{\mathcal{N}} \newcommand{\vxstar}{\mathbf{x}^*} \newcommand{\vlambdastar}{\mathbf{\lambda}^*} \newcommand{\vx}{\mathbf{x}} \newcommand{\vy}{\mathbf{y}} \newcommand{\meye}{\mathbf{I}} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\mQ}{\mathbf{Q}} \newcommand{\mU}{\mathbf{U}} \newcommand{\mV}{\mathbf{V}} \newcommand{\mD}{\mathbf{D}} \newcommand{\mLambda}{\mathbf{\Lambda}} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \newcommand{\sX}{\mathcal{X}} \newcommand{\sY}{\mathcal{Y}} \newcommand{\mX}{\mathbf{X}} \newcommand{\mY}{\mathbf{Y}} \renewcommand{\dox}[1]{\frac{\partial}{\partial x} #1} \renewcommand{\doy}[1]{\frac{\partial}{\partial y} #1} \renewcommand{\doxx}[1]{\frac{\partial^2}{\partial x^2} #1} \renewcommand{\doyy}[1]{\frac{\partial^2}{\partial y^2} #1} \renewcommand{\doxy}[1]{\frac{\partial^2}{\partial x \partial y} #1} \renewcommand{\doyx}[1]{\frac{\partial^2}{\partial y \partial x} #1} \renewcommand{\dovx}[1]{\frac{\partial}{\partial \vx} #1} \renewcommand{\dovar}[2]{\frac{\partial #2}{\partial #1}} \newcommand{\gradient}{\nabla} \newcommand{\hessian}{\mathbf{H}} \newcommand{\implies}{\Rightarrow} \newcommand{\equalities}{\mathcal{E}} \newcommand{\inequalities}{\mathcal{I}} \newcommand{\lagrange}{\mathcal{L}}\)

The importance of optimization cannot be overstated in the context of Machine Learning. Most, if not all, of Machine Learning involves *fitting* a model to data, a process that involves optimization.

Optimization is a broad research area with many open challenges. This article does not intend to cover all of optimization theory or algorithms. Instead, we will focus on concepts and algorithms that are essential to understanding and implementing machine learning.

Before venturing into this article, it is **strongly recommended** that you are well familiar with the concepts introduced in our previous modules on **Linear Algebra** and **Calculus** for Machine Learning.

Optimization is an approach to minimize or maximize a function of variables. For example,

$$\begin{aligned} \min_{\vx} \text{ }& f(\vx) \text{ subject to } &\\ & g(\vx) = \mathbf{a} \\ & h(\vx) \ge \mathbf{b} \\ \end{aligned}$$

The above statement describes the task of finding a value of a variable \( \vx \) where the function \( f(\vx) \) attains a minimium value. The function being minimized, \(f: \R^n \to \R \), is known as the **objective function**.

Note that \( \max_{\vx} f(\vx) \) has the same solution as minimizing the negated objective function \( \min_{\vx} -f(\vx) \). So, it is common practice in optimization to *standardize* a problem as a minimization problem. We will use optimization to mean minimization, unless otherwise explicitly specified.

The statements involving \( g(\vx) \) and \( h(\vx) \) are known as **constraints** which require the variable \( \vx \) to satisfy certain conditions. The set of all \( \vx \) that satisfy the constraints is known as the **feasible set**.

The value of the variable that attains the minimum value of the function is known as the **solution** and is usually denoted as \( \vxstar \) .

If the function value is lowest among all elements of the feasible set, then the solution is a **global minimizer**. If it is lower than all the elements of the feasible set that lie within a neighborhood of \( \vxstar \), then the solution is a **local minimizer**.

A problem devoid of constraints is **unconstrained**, otherwise it is a **constrained** optimization problem.

We start with understanding methods for unconstrained optimization first.

Here are some essential facts that we already know from the module on Calculus.

- At the stationary points, the gradient of a function is zero.
- At a local minimum, the Hessian is positive definite.
- At a local maximum, the Hessian is negative definite.
- At other stationary points, such as saddle points, the Hessian is indefinite.

Equipped with this knowledge about gradients and Hessians, we can solve an unconstrained optimization problem as follows:

- Identify points where the gradient is zero
- Filter out the ones where the Hessian is not positive definite
- Among the remaining points, find the point with the least function value. That is the solution.

Mathematically expressed, these rules-of-thumb look like this at the local minimum \( \vxstar \) for the optimization problem \( \min_x f(x) \), where \( f: \R^n \to \R \).

- \( f(\vxstar) \le f(\vx)\) , for all \(\vx \in \neighborhood \), where \( \neighborhood \) is the immediate neighborhood of \( \vxstar \).
- \( \nabla f(\vxstar) = 0 \)
- \( \nabla^2 f(\vxstar) \) is positive definite, or that, \( \vy^T \nabla^2 f(\vxstar) \vy > 0 \), for all \( \vy \in \R^n \).

The first condition above is the definition of a local minimizer in a neighborhood.

The condition 2 above is known as the **first-order necessary condition** for a local minimizer as it involves only the first-order derivatives. But we know that it is not sufficient since a point satisfying that could be a minimizer, maximizer, or a saddle point.

The conditions 2 and 3 above are together known as **second-order sufficient conditions** for a local minimizer as they involve the second-order derivatives as well. Any point satisfying both these conditions is guaranteed to be a local minimizer.

Practical optimization problems seldom lend themselves to straightforward solutions that can be identified with steps outlined earlier. There are many practical challenges:

- Function being optimized may not be differentiable!
- It may be infeasible to compute the gradient or the Hessian
- Even if they may be computed, it may be infeasible to directly solve them to identify extrema.

Thus, optimization approaches typically utilize an intuitive iterative approach: *Start from an arbitrary point and keep moving towards a point with lower function value, until no further reduction in function value is possible.*

**So, the key question is: from a given point, how do we identify a point with a lower function value?**

It all depends on being able to model the function value at any new point \( f(\vx_{i+1}) \) based on the function value at the current point \( f(\vx_i) \). This is the realm of **Taylor's Theorem**.

We have introduced the Taylor's Theorem in our previous module on Calculus for Machine Learning. The Taylor's Theorem is by far the most important concept to understand the theory and reasoning behind optimization algorithms, as you will soon realize.

Owing to the Taylor's Theorem, for a twice differentiable function \( f(\vx) \), the value of the function at a point \(\vx+\vp \) can be written as

\begin{equation} f(\vx+\vp) = f(\vx) + \gradient_{\vx}^T \vp + \frac{1}{2} \vp^T \hessian_{\vx} \vp \label{eqn:second-order-taylor-approx} \end{equation}

Here, \( \gradient_{\vx} \) and \( \hessian_{\vx} \) denote the gradient and Hessian of the function with respect to \( \vx \) , respectively, evaluated at the point \( \vx \).

From the current iterate \( \vx \), this is a function in \( \vp \). At its local minimizer \( \star{\vp} \), the gradient with respect to \( \vp \) will be zero.

\begin{equation} \frac{\partial f(\vx + \star{\vp})}{\partial \vp} = 0 \label{eqn:first-order-condition-on-p} \end{equation}

Methods for unconstrained optimization differ in ways of arriving at a suitable \( \vp \).

**Newton's method**calculates \( \vp \) by solving the above equation in its entirety.**Gradient descent**ignores the second-order term.**quasi-Newton methods**and**inexact-Newton methods**, approximate the second-order term.

We will delve deeper into these methods next.

The Newton's method arrives at the next iterate \( \vx + \vp \) from the current iterate \( \vx \) by solving the Equation \eqref{eqn:first-order-condition-on-p} on the second-order Taylor approximation of the function from Equation \eqref{eqn:second-order-taylor-approx}

$$\begin{aligned} \frac{\partial f(\vx + \star{\vp})}{\partial \vp} = 0 & \\ \Rightarrow & \gradient_\vx + \hessian_\vx \vp = 0 \\ \Rightarrow & \vp = -\hessian_{\vx}^{-1} \gradient_\vx \\ \end{aligned} $$

Thus, Newton's method proceeds as follows:

- Start from a suitable point \( \vx \)
- Apply the following update to \( \vx \) till a local minimizer has been reached: \( \vx \leftarrow \vx - \hessian_{\vx}^{-1} \gradient_{\vx} \).

Quite straightforward, but! What if the function is not twice differentiable? We cannot calculate \( \hessian_{\vx} \) , let alone \( \hessian_{\vx}^{-1} \).

The **Gradient descent** and **quasi-Newton methods** are alternative approaches that do not require the computation of the Hessian of the function. We present them next.

No matter where you start for this convex quadratic objective function, Newton's method finds the minimum in exactly 1 step.

In fact, this is a property worth noting: **Newton's method always finds the minimum of a convex quadratic objective function in exactly one step**. And why not? The second-order Taylor series expansion perfectly models the function and Newton's method finds the exact minimum of that second-order model in each step.

So, let's go beyond quadratic and beyond convex.

This is clearly a nonconvex objective fuction with two minima and one local maximum, the bump in the middle.

First, note that the Newton's method, without any restrictions on its movement, does not discriminate between the minimum and the maximum. It just gets to a nearby stationary point. So, if you start very close to the bump in the center, you will land at the local maximum.

Second, note that the iterates jump around before they reach a stationary point, unlike our previous example, a consequence of the objective function being nonconvex.

Also worth nothing is that starting near a particular local minimum does not guarantee that the algorithm will converge to that local minimum. It may actually go over the bump and land up in another local minimum altogether.

Of course, Newton's method can be applied to multivariate functions.

We have introduced Rosenbrock's function in the module on Calculus. The global minimum of the function lies at \( x = 1, y = 1 \).

Note that unless you start close enough to the minimum, the Newton's method may not find it within 10 iterations. This is a well-known shortcoming of the Newton's method. Iterations should begin with a point that is likely close to the minimum.

Himmelblau's function is nonconvex with 4 minima and 1 local maximum.

As expected, Newton's method seeks out either of these stationary points and does not necessarily reach the closest stationary point.

The gradient descent method works with the first order Taylor's approximation of the function being minimized.

\begin{equation} f(\vx+\vp) = f(\vx) + \gradient_{\vx}^T \vp \label{eqn:first-order-taylor-approx} \end{equation}

For minimization, we wish to move from an iterate \( \vx \) to \( \vx + \vp \) such that \( f(\vx + \vp) < f(\vx) \).

From the first order Taylor approximation, this requires that

\begin{equation} \gradient_{\vx}^T \vp < 0 \label{eqn:first-order-decrease-criteria} \end{equation}

How can we choose a \( \vp \) to ensure this? How about \( \vp = -\alpha \gradient_{\vx} \), for some scalar \( \alpha > 0 \)?

Easy to verify that \( \gradient_{\vx}^T \vp = -\alpha \gradient_{\vx}^T \gradient_{vx} < 0 \), since \(\gradient_{\vx}^T \gradient_{\vx}\) is a square and hence always positive. Great!

Because we are moving in the direction opposite (negative) of the gradient, this method is called **Gradient descent**.

Also, because \( \alpha \) controls the size of the step from \( \vx \) to \( \vx + \vp \), it is known as the **step length** or **learning rate**.

Thus, the gradient descent proceeds as follows:

- Start from a suitable point \( \vx \)
- Apply the following update to \( \vx \) till a local minimizer has been reached: \( \vx \leftarrow \vx - \alpha \gradient_{\vx} \).

See the similarity to the Newton's method? Indeed, the Newton's method is a special form of this structure with a step length equal to the inverse of the Hessian \( \alpha_{\text{Newton}} = \hessian_{\vx}^{-1} \).

Remember how quickly Newton's method minimized the convex quadratic objective \( x^2 \)? Well, note that gradient descent is not that fast, in terms of the number of steps.

However, computationally, it may still be faster for some problems, as it does not require the computation of the Hessian inverse, a major disadvantage of the Newton's method.

Also, note the impact of the step size or learning rate \( \alpha \) on the performance. For this particular problem, the Newton's method suggests a step size of \( 0.5 \). So, note that a step size of 0.5 with gradient descent will reach the minimum in one step. Any learning rate lower than 0.5 leads to slower convergence. Any learning rate higher than 0.5 leads to oscillations around the minimum till it finally lands at the minimum.

In fact, try the learning rate \( \alpha = 1 \) for this function. The iterations just keep oscillating.

Clearly, the learning rate is a crucial parameter of the gradient descent approach. We will look at practical strategies to choose a good value of the learning rate in the upcoming sections.

A simple strategy that is quite effective in practice is to gradually decrease the learning rate.

So, if \( 0 < \gamma < 1 \) is the dampening factor, then the learning rate \( \alpha \) at iteration \( k + 1 \) is

$$ \alpha_{k+1} = \gamma \alpha_k $$

In the next example try to notice the effect of a gradually decreasing learning rate. With no dampening there could be significant oscillations. \( \gamma < 0.5 \) leads to very slow progress, which may not even reach the minimum. We just need some dampening, but not too much and usually \( \gamma > 0.9 \) works in most cases.

Given the importance of the learning rate, is there a good learning rate? Can we find it analytically?

First and foremost, given that we are performing minimization, we definitely want the function to decrease sufficiently at each iteration. Sufficient in this case means that the decrease should be at least proportional to the gradient step taken. So, for some constant \( 0 < c_1 < 1 \), we wish that

\begin{equation} f(\vx + \alpha \vp) \le f(\vx) + c_1 \alpha \nabla f(\vx)^T \vp \label{eqn:armijo-condition} \end{equation}

Moreover, the slope or gradient of the function at the new point should also be larger (less negative) than the one before it. This means, for some constant \( 0 < c_2 < 1 \), we wish that

\begin{equation} \nabla f(\vx + \alpha \vp)^T \vp \ge c_2 \nabla f(\vx)^T \vp \label{eqn:curvature-condition} \end{equation}

These two conditions for a desirable step length are known as **Wolfe conditions**. The first one, which ensures there is sufficient decrease is also known as **Armijo condition**. The second one, which ensures that the curvature is flattening out over time is known as **curvature condition**. It is usually the case that \( c_1 < c_2 \).

Wolfe conditions can be a good strategy to identify a suitable learning rate at any point. At each iteration, we can select a value of the learning rate the satisfies both the conditions. If none of the conditions can be satisfied, we just use a default learning rate to move to another point and try finding a satisfactory learning rate at the next iteration.

We use this strategy in the next two demos applying gradient descent to multivariate problems.

Rosenbrock is a very hard problem for many optimization algorithms. Notice that gradient descent can quickly find the valley in the center, but the convergence towards the optimal is extremely slow, as the iterates oscillate on the two sides of the valley.

Turns out, Himmelblaus, inspite of its bump, is actually quite easy for gradient descent. With the learning rate selected dynamically based on the Wolfe conditions, the algorithm converges towards the solution rather quickly. Moreover, unlike Newton's method, the trajectory does not go over the bump, but rather quickly seeks out the nearest minimum.

Newton's method works with the second-order Taylor approximation and gradient descent works with the first order terms. Can there be a middle ground?

What if we could model the second-order terms without explicitly calculating the Hessian? Then the method can *pretend* to be Newton's method, without actually being one. Such methods are called **quasi-Newton** methods, which literally means *seemingly and apparently but not really, Newton*.

So, essentially, we want to find \( \vp \) such that

\begin{equation} f(\vx+\vp) = f(\vx) + \gradient_{\vx}^T \vp + \frac{1}{2} \vp^T \mB_{\vx} \vp \label{eqn:quasi-newton-taylor} \end{equation}

where, \( \mB \in \R^{n \times n} \) is some matrix for modeling second-order terms in the vicinity of the current iterate \( \vx \).

If we can find a good way to estimate \( \mB \), then a quasi-Newton method is just a re-statement of the Newton's method.

- Start from a suitable point \( \vx \)
- Apply the following update to \( \vx \) till a local minimizer has been reached: \( \vx \leftarrow \vx - \mB_{\vx}^{-1} \gradient_{\vx} \).

The answer lies in Calculus.

Let's denote our second-order approximation as a function of \( \vp \) at iteration \( k \) as \( m_k(\vp) \).

\begin{equation} m_k(\vp) = f(\vx_k) + \gradient_{\vx_k}^T \vp + \frac{1}{2} \vp^T B_k \vp \end{equation}

A good second-order approximation will lead to consistent gradients across two consecutive iterations. Suppose the iterates were \( \vx_k \) and \( \vx_{k+1} = \vx_k + \alpha_k \vp_k \) across two adjacent iterations \( k \) and \( k + 1 \), respectively. Then, matching the gradients across the two iterations, we get

\begin{aligned} & \frac{\partial m_k(0)}{\partial \vp} = \frac{\partial m_{k+1}(-\alpha \vp_k)}{\partial \vp} \\ & \implies \gradient_k = \gradient_{k+1} - \alpha_k\mB_{k+1}\vp_k \\ & \implies \mB_{k+1} \alpha_k \vp_k = \gradient_{k+1} - \gradient_k \\ & \implies \mB_{k+1} (\vx_{k+1} - \vx_k) = \gradient_{k+1} - \gradient_k \\ \end{aligned}

This beautiful equation, known as the **secant equation**, is akin to calculating the second-order derivative as a rate of change of the first-order derivative; the finite-difference approximation. It was discovered over 3000 years before the Newton's method for the task of finding roots of an equation!

Note that the equation \( \mB_{k+1} (\vx_{k+1} - \vx_k) = \gradient_{k+1} - \gradient_k \) has \( n^2 \) unknowns and only \( n \) equations. Many matrices will satisfy such relationship. We can afford to put constraints to select a more desirable \( \mB_{k+1} \)!

We will impose two desirable conditions on \( \mB_{k+1} \)

- It should be close to our current approximation \( \mB_k \), so that the function is not changing drastically with each iteration.
- It should be symmetric, \( \mB_{k+1} = \mB_{k+1}^T \), just like a true Hessian of a twice differentiable function.

Thus, we can find a suitable \( \mB_{k+1} \) by solving this intermediate minimization problem

\begin{aligned} \min_{\mB} & ||\mB - \mB_k|| \\ \text{subject to } & \\ & \mB = \mB^T, \\ & \mB(x_{k+1} - x_k) = \gradient_{k+1} - \gradient_k \\ \end{aligned}

The objective function ensures proximity with current iterate, while the two constraints solve for symmetry and the secant equation.

The various quasi-Newton methods differ in kind of norms they use for \( ||\mB - \mB_k|| \). The original algorithm, proposed by Davidon in 1959 uses the Frobenius norm.

The popular **BFGS algorithm** (named after its founders: Broyden, Fletcher, Goldfarb, and Shanno) also uses the Frobenius norm, but directly works with the inverse of \( \mB \) since that is what we need for the update rule anyways. Re-stating the above equation in terms of the inverse \( \mH_k = \mB_k^{-1} \)

\begin{aligned} \min_{\mH} & ||\mH - \mH_k|| \\ \text{subject to } & \\ & \mH = \mH^T, \\ & \mH(\gradient_{k+1} - \gradient_k) = x_{k+1} - x_k \\ \end{aligned}

This simplifies the update rule for BFGS to \( \vx_{k+1} = \vx_k -\mH_k \nabla_k \). No inverse needed, ever!

There are approaches to compute a closed form solution to these intermediate minimization problems, for example, using the Sherman-Morrison-Woodbury formula, but that is out of scope for this article.

Although, we can choose the learning rate \( \alpha \) dynamically based on Wolfe conditions, we have purposely set it to 1. This will enable us to see how quickly the quasi-Newton method models the true second-order model. We have set the initial value of \( \mB \) to the identity matrix.

Notice in the demo that no matter where you start, the algorithm converges in 2 steps.

The first step makes a straight jump to the opposite side due to the identity matrix and unit learning rate. However, within that step, the matrix \( \mB \) approximates the second-order model of \( x^2 \) quite well.

Let's do the math, which is significantly simplified due to the univariate problem. If you start from point \( x_0 \), then the next point is \( x_1 = x_0 - \alpha \frac{\nabla_0}{b_0} = \frac{2x_0}{1} \), where \( b \) denotes the univariate version of \( \mB \) which is set to identity (1 for univariate) initially.

In the next step, we will be solving for \( b_1 \) as follows:

\begin{aligned} b_1(x_1 - x_0) & = \nabla_1 - \nabla_0 \\ & \Rightarrow b_1 = \frac{2x_1 - 2x_0}{x_1 - x_0} \\ & \Rightarrow b_1 = 2 \\ \end{aligned}

Thus, with the first step, the BFGS method has figured out the same multiplier as the Newton's method and converged in exactly one step thereafter.

For this nonconvex objective function, again the trajectory of the BFGS algorithm is quite similar to that of the Newton's method. Depending on how close you start to a minimum, you either land up at the minimum or the maximum.

As in the previous example, the first jump with the quasi-Newton method is rather erratic, far from the vicinity of the solution. That is expected, given that we have started from an arbitrary second-order approximation of the function. But as soon as the quasi-Newton approach starts closely approximating the true Hessian, the iterations starts behaving like the Newton's method.

Rosenbrock continues to be hard for BFGS. Once in the valley, the convergence towards the minimum is really slow.

The performance of BFGS on the Himmeblaus function is similar to that of the Newton's method. As long as the iterations start close enough to an optimum, the convergence is rather quick. Otherwise, the trajectory is rather erratic, as the BFGS method approximates this bumpy function's Hessian.

Newton's method, quasi-Newton method and the gradient descent method, all have their large-scale counterparts.

**Inexact Newton methods** work by directly calculating the product \( \hessian^{-1} \gradient \) instead of computing them separately. This has the benefit of significant savings in terms of memory, since the Hessian is never stored explicitly.

**L-BFGS**, the limited-memory version of the quasi-Newton BFGS method works on a similar principle. In addition to working with an approximated Hessian inverse, as BFGS, it also assumes a sparse Hessian, thereby significantly reducing the memory footprint and computations involved.

**Stochastic gradient descent (SGD)**, scales up the gradient descent approach by decomposing the large scale optimization problem into mini problems that could be optimized in a sequence to arrive at the final solution. This is by far the most general and easily implemented strategies for optimization. SGD and its variants are the weapon of choice for machine learning and deep learning frameworks such as Theano, PyTorch, and Tensorflow. We will investigate SGD and its variants in more detail in the next few sections.

Stochastic gradient descent is the workhorse of recent machine learning approaches. It is used as a faster alternative for training support vector machines and is the preferred optimization routine for deep learning approaches.

It is a variant of gradient descent, that works by parts. If the objective function can be decomposed into several terms, then SGD proceeds by updating the iterates based on the gradient of a single term at a time, in a round-robin, sequential, or more generally, stochastic, fashion.

A popular machine learning loss function is a summation of the form \( \sum_{i} (x-a_i)^2 \). Such a sum of squares is a great fit for stochastic gradient descent. The \(i\)-th term is a quadratic of the form \( (x - a_i)^2 \) and its derivative is \( 2(x - a_i) \).

For such a function, the SGD approach performs the following steps till convergence.

- In iteration \( k \), Pick a term \( i_k \), at random.
- Update \( x_{k+1} = x_k - \alpha_k 2(x_k - a_{i_k}) \).

Notice that each update is akin to a partial gradient descent. For the sake of analogy, if we were doing a full gradient descent, we would be performing the following update.

$$ x_{k+1} = x_k - 2 \alpha_k \sum_{i} (x_k - a_{i}) $$.

Thus, SGD is a great choice for optimizing functions with large number of terms, each of which is presented sequentially. It will be evident in the upcoming examples that such a process leads to a zig-zag path to the optimum, as each iteration works with partial information.

For this sum of squares, SGD is a great choice. Note that compared to gradient descent, SGD goes back and forth in its iterations. Moreover, choosing an appropriate value of the learning rate and dampening factor is extremely important.

For the sake of SGD, we will decompose this nonconvex function into the two terms \( x^4 \) and \( 8x^2 \). Note how a high value of learning rate, with no dampening, can send the iterations off the chart.

Also, note that starting near the optimal point does not guarantee quick convergence. This is because the stochastic nature of the algorithm results in random jumps that may not satisfy the gradient with respect to all the terms taken together.

We have decomposed the Rosenbrock function into its two summands, specifically, \( (1-x)^2 \) and \( 100(y - x^2)^2 \) to demonstrate the effect of using partial gradients with SGD in each iteration.

Rosenbrock was hard for gradient descent. It is even harder for SGD, which only works with partial gradients in each step.

Finding the valley takes longer than gradient descent, and there is very little possibility of finding the optimal if the learning rate is steadily dampened.

Although SGD is the preferred approach for deep learning and is the workhorse of most modern machine learning libraries, this point needs to be emphasized. Lack of progress of iterations does not imply convergence to a solution. It could mean that the SGD is just stuck. SGD is easy to use and implement. It works on many problems of interest. But use it cautiously and be aware of its shortcomings.

We have decomposed the Himmelblaus function into its two summands, \( (x^2 + y - 11)^2 \) and \( (x+y^2-7)^2 \) for the sake of demonstrating SGD with partial gradients in each step.

SGD converges for the Himmelblaus function if the learning rate is just right and dampening is used to slow it down with each iteration to avoid oscillations. But as expected, it takes a more serpentine route to reach the minima. Interestingly, if the parameters are set appropriately, SGD converges to the closest minimum.

However, if the parameters are not chosen suitably, then SGD may not converge to any optima. The reliance of SGD on good learning rates and dampening factors is more than that of gradient descent, and this point is very important to remember in implementing SGD based solutions. Owing to these factors, there is increased interest in finding variants of SGD that rely on fewer parameters or just work out-of-the-box. We will investigate these in an upcoming section specifically dedicated to SGD and its popular variants.

Optimization is a broad field, much older than machine learning. There are numerous good resources, we list the important ones.

- Numerical Optimization, a book by Prof. Nocedal and Prof. Wright is a very comprehensive resource on optimization algorithms. An older version is freely downloadable in the pdf format.
- Leon Bottou, in collaboration with Prof. Nocedal has written a comprehensive paper on optimization methods for large-scale machine learning
- The Scipy package offers several optimization routines in the scipy.optimize package. For SGD implementations, check out the popular machine learning frameworks, such as Tensorflow's GradientDescentOptimizer, PyTorch's Optim.SGD, and Theano's SGD.

- Do you know what optimization is?
- Do you understand constraints, feasibility, global vs local minima, and the objective function?
- Can you derive Newton's method, gradient descent, and BFGS approach to unconstrained optimiation?
- Do you understand the large-scale variants of these approaches, especially stochastic gradient descent (SGD)?
- Do you understand the importance of learning rate and dampening factor?
- Do you know the Wolfe conditions? Can you describe the Armijo and the curvature conditions?
- Can you discuss the suitability of each optimization approach for a given problem?
- Can you identify problems where SGD might be a good fit?