Identifying maxima, minima, and saddle points with multivariate derivatives

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Maxima, minima, and saddle points
        Calculus
      

Much of machine learning is built around the idea of loss functions and optimizing for them. To understand optimization, we first need to build intuition about the maxima, minima, and so-called saddle points.

In this article, through interactive visualization on several example functions, we will build such an understanding.

Prerequisites

To understand this article on identifying maxima, minima, and saddle points, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Taylor's Theorem for multivariate functions

The Taylor's theorem introduced earlier is also applicable to multivariate functions.

For a bivariate function, $ f: \real^2 \to \real $, the Taylor series expansion with 2 terms is

$$ f(x+a, y+b) = f(a,b) + x\dox{f}\bigg\rvert_{a,b} + y\doy{f}\bigg\rvert_{a,b} + \frac{1}{2}\left( x^2 \doxx{f}\bigg\rvert_{a,b} + xy\doxy{f}\bigg\rvert_{a,b} + xy \doyx{f}\bigg\rvert_{a,b} + y^2 \doyy{f}\bigg\rvert_{a,b} \right) $$

Here, $ g\bigg\rvert_{a,b} $ means that the derivative is evaluated at the point $ (a,b) $.

We will use this Taylor Theorem based approximation to identify the conditions for maxima, minima, and saddle points.

Conditions for minimum or minima

Suppose, the function has a minimum at some point $ (a,b) $.

Since a minimum is a critical point, this means the gradient of the function is zero at $ (a,b) $. Therefore, $\dox{f}\bigg\rvert_{a,b} = 0 $ and $ \doy{f}\bigg\rvert_{a,b} = 0 $.

Because $ f(x+a,y+b) > f(a,b) $ for $ (a,b) $ to have a minimum, from the Taylor series expansion above, it also means,

$$ x^2 \doxx{f}\bigg\rvert_{a,b} + xy\doxy{f}\bigg\rvert_{a,b} + xy \doyx{f}\bigg\rvert_{a,b} + y^2 \doyy{f}\bigg\rvert_{a,b} > 0 $$

If $ \vx $ denotes the vector with the two coordinates $ (x,y) $, then $ \vx = [x,y] $. Using this notation, we can write the above relationship as

$$ \vx^T \mH\bigg\rvert_{a,b} \vx > 0 $$

where, $ \mH\bigg\rvert_{a,b} $ is the Hessian matrix — the matrix of second-order partial derivatives.

But wait! This equation,$ \vx^T \mH\bigg\rvert_{a,b} \vx > 0 $ implies that $ \mH\bigg\rvert_{a,b} $ is positive definite!

Thus, a function has a minimum at a point $ (a,b) $ if

Gradient at $ (a,b) $ is zero
Hessian at $ (a,b) $ is positive definite

Conditions for maximum or maxima of a function

We can arrive at these conditions using the same approach as before.

Suppose, the function has a maximum at some point $ (c,d) $.

Since a maximum is a critical point, this means the gradient of the function is zero at $ (c,d) $. Therefore, $\dox{f}\bigg\rvert_{c,d} = 0 $ and $ \doy{f}\bigg\rvert_{c,d} = 0 $.

For the maximum or maxima to exist at $ (c,d) $, it should be the case that $ f(x+c,y+d) < f(c,d) $, for all $ (x,y) \in \real^2 $.

From the Taylor series expansion above, it also means,

$$ x^2 \doxx{f}\bigg\rvert_{c,d} + xy\doxy{f}\bigg\rvert_{c,d} + xy \doyx{f}\bigg\rvert_{c,d} + y^2 \doyy{f}\bigg\rvert_{c,d} < 0 $$

Using the same vector notation and Hessian notation as before, we note that the above inequality is equivalent to

$$ \vx^T \mH\bigg\rvert_{c,d} \vx < 0 $$

This means, $ \mH\bigg\rvert_{c,d} $ should be negative definite at the maximum.

Thus, a function has a maximum at a point $ (c,d) $ if

Gradient at $ (c,d) $ is zero
Hessian at $ (c,d) $ is negative definite

Conditions for saddle point

Well, what if the gradient of the function is zero at a point, but the Hessian is indefinite. This means, the point is a critical point, but it is neither a maximum or a minimum. Then such a point is a saddle point.

To summarize, a function has a saddle point at a point $ (c,d) $ if

Gradient at $ (c,d) $ is zero
Hessian at $ (c,d) $ is indefinite

A summary of the conditions

Just as a quick reference, here we summarize the 3 conditions.

So, if the gradient at a point $ (a,b) $ — $ \nabla\bigg\rvert_{a,b} = 0 $ — then the function has a

minimum at $ (a,b) $, if its Hessian $ \mathbf{H}\bigg\rvert_{a,b} $ at that point is positive definite.
maximum at $ (a,b) $, if its Hessian $ \mathbf{H}\bigg\rvert_{a,b} $ at that point is negative definite.
saddle point at $(a,b)$, if its Hessian $ \mathbf{H}\bigg\rvert_{a,b} $ at that point is indefinite.

This is very important.

To distinguish among maxima, minima, and saddle points, investigate the definiteness of the Hessian.

Gradient and Hessian of Multivariate Bowl

In the accompanying demo, we have plotted this gradient as follows. At each point on the 2-dimensional plane,

The arrows show the direction of the gradient at that point.
The color intensity shows the magnitude, the $L_2$-norm, of the gradient at that point.

Vector direction and norm were introduced earlier in our article on vector geometry in linear algebra.

We have also superimposed a red arrow indicating the gradient at the highlighted point. The arrow represents the vector $ [\dox{f} \doy{f}] $, with origin at the point $ (x,y) $. Notice that it always points in the direction of increasing function and it is longer for steep changes in the function.

For the symmetric multivariate bowl function, the gradient plot conveys the following information.

The function increases in all directions from the center.
The rate of growth of the function is symmetric around the center.
The function grows more rapidly as we move away from the center.

We have shown the Hessian information in a separate panel as follows: At every point on the 2-dimensional plane,

The $+$ or $-$ indicates if the Hessian is positive semi-definite (PSD) or negative semi-definite (NSD).
The Hessian plot has 3 colors based on definiteness of the Hessian at that point: Blue (negative definite), Green (positive definite), and Red (indefinite).

For the multivariate bowl, the Hessian is a positive semi-definite, in fact, positive definite every where in its domain. (Do you remember how to verify this?)

It has exactly one point where the gradient is zero, $ [0 \text{ } 0] $. And because the Hessian is positive definite, that point must be a minimum.

Easier compared to looking at all the charts and slices in various directions.

Multivariate bowl: gradient and Hessian demo

Gradient and Hessian: Rosenbrock's function

Appearances can be deceiving. Note that in the case of Rosenbrock's function, one might incorrectly assume that there is a maximum somewhere along the $y$-axis.

But studying the definiteness of the Hessian suggests otherwise. The Hessian is indefinite in that region. The minimum occurs in the region the Hessian is positive definite.

Rosenbrock's function: gradient and Hessian demo

Gradient and Hessian: Himmelblau's function

And now comes the interesting part.

You will note that for the Himmelblau's function, there is a patch in the center, where the Hessian is negative definite. That is the region of local maximum. This is analogous to the negative second-order derivative requirement for local maxima in the case of univariate functions.

For the other 4 critical points, the Hessian is positive definite, suggesting minima.

Himmelblau's function: gradient and Hessian demo

A function with a saddle

And finally, here's an example of a function with a saddle point.

$$ f(x,y) = 3x^2y + y^3 - 3x^2 - 3y^2 + 2 $$

From the gradient plot, you can see 4 critical points.

Note the Hessian definiteness plot to identify their nature.

Local minima at $ x = 0, y = 2 $: Hessian positive definite and gradient is zero.
Local maxima at $ x = 0, y = 0 $: Hessian negative definite and gradient is zero.
Saddle points at $ x = -1, y = 1 $ and $ x = 1, y = 1 $: Hessian indefinite, and gradient is zero.

Notice how the function plot is visually deceiving in terms of where the maximum, minimum, and saddle point lie. But the gradient and Hessian never deceive.

Always remember: Gradient helps in identifying critical points. Hessian helps in distinguishing between minima, maxima, and saddle points.

Saddle function: gradient and Hessian demo

Where to next?

Now that you understand multivariate calculus, understand the strategies for computing derivatives in programmatic implementations. Or choose some other topic from mathematical foundations.

Already a calculus expert? Check out comprehensive courses on machine learning or deep learning.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Let's connect

Please share your comments, questions, encouragement, and feedback.