Newton's method

Optimization

Newton's method is an approach for unconstrained optimization.

In this article, we will motivate the formulation of this approach and provide interactive demos over multiple univariate and multivariate functions to show it in action.

Prerequisites

To understand this article, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem statement

In this article, we will consider an objective function \( f : \real^\ndim \to \real \) that we wish to minimize. We express the unconstrained optimization problem as,

\begin{aligned} \min_{\vx} \text{ }& f(\vx) \\ \end{aligned}

Newton's method is an iterative approach. Starting from an arbitrary point, we will progressively discover points with lower function values till we converge to a point with minimum function value.

The next few sections will outline the strategy used by Newton's method to discover the next point \( \va + \vp \), from the current point \( \va \) on its journey to the optimal solution.

Taylor's theorem and optimization: A recap

Owing to the Taylor's Theorem, for a twice differentiable function \( f: \real^\ndim \to \ndim \), the value of the function at a point \(\va+\vp \) can be written in terms of the function value at a point \( \va \) as follows.

\begin{equation} f(\va+\vp) = f(\va) + \nabla_{\va}^T \vp + \frac{1}{2} \vp^T \mH_{\va} \vp \label{eqn:second-order-taylor-approx} \end{equation}

Here, \( \nabla_{\va} \) and \( \mH_{\va} \) denote the gradient and Hessian of the function, evaluated at the point \( \va \).

To choose the next point \( \va + \vp \), we need to discover a suitable value of \( \vp \). Since \( \va \) is already known (the current point), this is a function in \( \vp \). At its local minimizer, \( \star{\vp} \), the gradient with respect to \( \vp \) will be zero.

\begin{aligned} \doh{f(\va + \star{\vp})}{\vp} = 0 \\\\ \label{eqn:first-order-condition-on-p} \end{aligned}

Thus, we now have a system of equations to solve for the value of \( \vp \) to choose our next point \( \va + \vp \).

Newton's method

The Newton's method arrives at the next iterate \( \va + \vp \) from the current iterate \( \va \) by solving the above equation.

$$\begin{aligned} & \frac{\partial f(\va + \star{\vp})}{\partial \vp} = 0 \\ \implies & \frac{\partial}{\partial \vp} \left( f(\va) + \nabla_{\va}^T \vp + \frac{1}{2} \vp^T \mH_{\va} \vp \right) = 0 \\ \implies & \nabla_{\va} + \mH_{\va} \vp = 0 \\ \implies & \vp = -\inv{\mH_{\va}} \nabla_{\va} \\ \end{aligned} $$

Thus, Newton's method proceeds as follows:

  1. Start from a suitable point \( \va = \va_0 \)
  2. Apply the following update to \( \va \) till a local minimizer has been reached: \( \va \leftarrow \va - \mH_{\va}^{-1} \nabla_{\va} \).

Quite straightforward!

Newton's method demo: \( \min x^2 \)

Let's see Newton's method in action with a simple univariate function \( f(x) = x^2 \), where \( x \in \real \).

Note that the function has a global minimum at \( x = 0 \). The goal of the Newton's method is to discover this point of least function value, starting at any arbitrary point.

In this accompanying demo, change the starting point by dragging the orange circle and see the trajectory of Newton's method towards the global minimum.

Newton's method analysis: \( \min x^2 \)

No matter where you start for this convex quadratic objective function, Newton's method finds the minimum in exactly 1 step.

In fact, this is a property worth noting: Newton's method always finds the minimum of a convex quadratic objective function in exactly one step. And why not? The second-order Taylor series expansion perfectly models the function and Newton's method finds the exact minimum of that second-order model in each step.

So, let's go beyond quadratic and beyond convex.

Newton's method demo: \( \min x^4 - 8x^2 \)

Let's see Newton's method on a nonconvex function with multiple minima and a single maximum.

$$ f(x) = x^4 - 8x^2 $$

Newton's method analysis: \( \min x^4 - 8x^2 \)

This is clearly a nonconvex objective fuction with two minima and one local maximum, the bump in the middle.

First, note that the Newton's method, without any restrictions on its movement, does not discriminate between the minimum and the maximum. It just gets to a nearby stationary point. So, if you start very close to the bump in the center, you will land at the local maximum.

Second, note that the iterates jump around before they reach a stationary point, unlike our previous example, a consequence of the objective function being nonconvex.

Also worth nothing is that starting near a particular local minimum does not guarantee that the algorithm will converge to that local minimum. It may actually go over the bump and land up in another local minimum altogether.

Newton's method demo: Rosenbrock's function

Of course, Newton's method can be applied to multivariate functions.

We have introduced Rosenbrock's function in the module on Calculus.

$$ \min_{x,y} (1-x)^2 + 100 (y-x^2)^2 $$

The global minimum of the function lies at \( x = 1, y = 1 \).

Note that unless you start close enough to the minimum, the Newton's method may not find it within 10 iterations. This is a well-known shortcoming of the Newton's method. Iterations should begin with a point that is likely close to the minimum.

Newton's method demo: Himmelblau's function

So far, our function had only a single minimum. Such functions, albeit easy to understand, are rare in machine learning. Let us study a function with multiple minima. We have introduced Himmelblau's function in the module on Calculus.

$$ f(x,y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2 $$

Himmelblau's function is nonconvex with 4 minima (marked in green) and 1 local maximum (marked in blue).

As expected, Newton's method seeks out one of these stationary points. It does not necessarily reach the closest stationary point.

Challenge of the Hessian

Newton's method seems quite straightforward.

But! What if the function is not twice differentiable? Or what if the Hessian is computationally infeasible.

In these cases, we cannot calculate \( \mH_{\vx} \) , let alone it's inverse, \( \inv{\mH_{\vx}} \).

In such scenarios, we use alternative approaches such as

  • approximate the Hessian using quasi-Newton methods such as BFGS
  • ignore the Hessian term altogether using gradient descent

Where to next?

Expand your knowledge of optimization approaches with our detailed interactive articles on this subject.

To brush up on calculus, explore our articles on topics in calculus and multivariate calculus. Or start with strong mathematical foundations.

Already an optimization expert? Check out comprehensive courses on machine learning or deep learning.

Please share

Let your friends, followers, and colleagues know about this resource you discovered.

Subscribe for article updates

Stay up to date with new material for free.