Unconstrained optimization

Optimization

A problem devoid of constraints is, well, an unconstrained optimization problem. Much of modern machine learning and deep learning depends on formulating and solving an unconstrained optimization problem, by incorporating constraints as additional elements of the loss with suitable penalties.

In this article, we will give a broad intuition on solving unconstrained optimization problems.

Prerequisites

To understand this article, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem statement

Unconstrained optimization problems are easy to express as they are devoid of constraints.

In this article, we will consider an objective function \( f : \real^\ndim \to \real \) that we wish to minimize. We express this as,

\begin{aligned} \min_{\vx} \text{ }& f(\vx) \\ \end{aligned}

If we are interested in the value of \( \vx \) that achieves the minimum, we shall write this as

\begin{aligned} \star{\vx} = \argmin_{\vx} \text{ }& f(\vx) \\ \end{aligned}

In machine learning, we are usually interested in this latter formulation. In that context, \( \star{\vx} \) denotes the parameters of the model being fit to the training data, and \( f(\vx) \) denotes the loss that we wish to minimize, as a function of the parameters.

Calculus recap: Maxima and minima

Here are some essential facts about function maxima, minima, and saddle points that we already know from calculus.

  • At the stationary points, the gradient of a function is zero.
  • At a local minimum, the Hessian is positive definite.
  • At a local maximum, the Hessian is negative definite.
  • At other stationary points, such as saddle points, the Hessian is indefinite.

Thus, to find a local minimum, we need to find points where the gradient is zero and the Hessian is positive definite.

If we can enumerate all such local minimizers, then the point with lowest function value among the local minimizers, is the global minimizer.

Conditions for optimality

The conditions for a point \( \star{\vx} \) to be a local minimizer can be mathematically expressed as

  1. \( f(\star{\vx}) \le f(\vx)\) , for all \(\vx \in \mathcal{N}_{\star{vx}} \), where \( \mathcal{N}_{\star{\vx}} \) is the immediate neighborhood of \( \star{\vx} \).
  2. \( \nabla f(\star{\vx} ) = 0 \)
  3. \( \nabla^2 f(\star{\vx}) \) is positive definite. In other words, \( \vy^T \nabla^2 f(\star{\vx}) \vy > 0 \), for all \( \vy \in \real^n \).

The first condition above is just the definition of a local minimizer in a neighborhood.

The condition 2 above is known as the first-order necessary condition for a local minimizer as it involves only the first-order derivatives. Any minimizer is a stationary point. Therefore it must satisfy this condition. Hence it is deemed a necessary condition.
But, it is not a sufficient condition, since it merely identifies a stationary point. Such a point could be a minimizer, maximizer, or a saddle point.

The conditions 2 and 3 above are together known as second-order sufficient conditions for a local minimizer as they involve the second-order derivatives as well. Any point satisfying both these conditions is guaranteed to be a local minimizer.

Need for an iterative process

Equipped with this knowledge about gradients and Hessians, we can solve an unconstrained optimization problem as follows:

  1. Identify points where the gradient is zero
  2. Filter out the ones where the Hessian is not positive definite
  3. Among the remaining points, find the point with the least function value. That is the solution.

Practical optimization problems seldom lend themselves to straightforward solutions that can be identified with such steps. There are many practical challenges:

  • Function being optimized may not be differentiable!
  • It may be infeasible to compute the gradient or the Hessian
  • Even if they may be computed, it may be infeasible to directly solve them to identify extrema.

Thus, optimization approaches typically utilize an intuitive iterative approach:

Start from an arbitrary point and keep moving towards a point with lower function value, until no further reduction in function value is possible.

Choosing the next point

Optimization is an iterative process, progressing through points with lower function values, until convergence. The key question is

From a given point, how do we identify a point with a lower function value?

It all depends on being able to model the function value at any new point \( f(\vx_{i+1}) \) based on the function value at the current point \( f(\vx_i) \). This is the realm of Taylor's Theorem, a strong tool for function approximation that we introduced in our module on calculus.

Taylor's Theorem is by far the most important concept to understand the theory and reasoning behind optimization algorithms, as you will soon realize.

Taylor's Theorem and optimization

Owing to the Taylor's Theorem, for a twice differentiable function \( f(\vx) \), the value of the function at a point \(\va+\vp \) can be written in terms of the function value at a point \( \va \) as follows.

\begin{equation} f(\va+\vp) = f(\va) + \nabla_{\va}^T \vp + \frac{1}{2} \vp^T \mH_{\va} \vp \label{eqn:second-order-taylor-approx} \end{equation}

Here, \( \nabla_{\va} \) and \( \mH_{\va} \) denote the gradient and Hessian of the function, evaluated at the point \( \va \).

To choose the next point \( \va + \vp \), we need to discover a suitable value of \( \vp \). Since \( \va \) is already known (the current point), this is a function in \( \vp \). At its local minimizer, \( \star{\vp} \), the gradient with respect to \( \vp \) will be zero.

\begin{aligned} \doh{f(\va + \star{\vp})}{\vp} = 0 \\\\ \label{eqn:first-order-condition-on-p} \end{aligned}

Thus, we now have a system of equations to solve for the value of \( \vp \) to choose our next point \( \va + \vp \).

\begin{aligned} \frac{\partial}{\partial \vp} \left( f(\va) + \nabla_{\va}^T \vp + \frac{1}{2} \vp^T \mH_{\va} \vp \right) = 0 \end{aligned}

Approaches

Methods for unconstrained optimization differ in ways of arriving at a suitable \( \vp \) from the above equation.

  • Newton's method calculates \( \vp \) by solving the above equation in its entirety.
  • quasi-Newton methods such as the BFGS method, approximate the second-order term. This may be necessary if the Hessian of the function is computationally infeasible.
  • Gradient descent ignores the second-order term. They only work with the gradient term. Hence the name.

These methods are further adapted for scalability with extensions such as inexact Newton's method, L-BFGS, and stochastic gradient descent (SGD).

We know from the Taylor's Theorem demo that including more terms in the expansion leads to better function approximation. Therefore, gradient descent is usually slower to converge than Newton's method. However, we often deal with complicated objective functions, with hard to compute Hessians. Hence, gradient descent, particularly SGD, is the preferred weapon of choice in modern machine learning and deep learning.

Follow the links in this section to study each of these approaches in details and build intuition through interactive problems.

Gradient descent demo: \( \min x^2 \)

Let's see gradient descent in action on a simple univariate function \( f(x) = x^2 \), where \( x \in \real \).

Note that the function has a global minimum at \( x = 0 \). The goal of the gradient descent method is to discover this point of least function value, starting at any arbitrary point.

In this accompanying demo, change the starting point by dragging the orange circle and see the trajectory of gradient descent method towards the global minimum. Also study the effect of the learning rate on the convergence rate.

To know more about gradient descent and learning rate, head on to our comprehensive articles on these topics in optimization.

Gradient descent demo: Himmelblau's function

As another example, let's apply gradient descent to a multivariate function with multiple stationary points.

We introduced Himmelblau's function in our article on multivariate functions in calculus. It has 4 local minima (highlighted in green) and 1 maximum (highlighted in blue).

Turns out, Himmelblau's, in spite of its bump, is actually quite easy for gradient descent.

Large scale optimization

Newton's method, quasi-Newton method and the gradient descent method, all have their large-scale counterparts.

Inexact Newton methods work by directly calculating the product \( \inv{\mH} \nabla \) instead of computing them separately. This has the benefit of significant savings in terms of memory, since the Hessian is never stored explicitly and the product is a vector.

L-BFGS, the limited-memory version of the quasi-Newton BFGS method works on a similar principle. In addition to working with an approximated Hessian inverse, as BFGS, it also assumes a sparse Hessian, thereby significantly reducing the memory footprint and computations involved.

Stochastic gradient descent (SGD), scales up the gradient descent approach by decomposing the large scale optimization problem into mini problems that could be optimized in a sequence to arrive at the final solution. This is by far the most general and easily implemented strategies for optimization. SGD and its variants are the weapon of choice for machine learning and deep learning frameworks such as Theano, PyTorch, and Tensorflow. We will investigate SGD and its variants in more detail in the next few sections.

Where to next?

Expand your knowledge of optimization approaches with our detailed interactive articles on this subject.

To brush up on calculus, explore our articles on topics in calculus and multivariate calculus. Or start with strong mathematical foundations.

Already an optimization expert? Check out comprehensive courses on machine learning or deep learning.

Please share

Let your friends, followers, and colleagues know about this resource you discovered.

Let's connect

Please share your comments, questions, encouragement, and feedback.