Foundations of Machine Learning

Our previous learning module dealt with Linear Algebra for Machine Learning.

Much of Machine Learning is built around the idea of loss functions and optimizing for them. To understand them, you first need to understand Calculus, especially, multivariate calculus.

The goal of this article is to provide a comprehensive overview of concepts from multivariate calculus and to build an intuition for them.


Although anyone who has completed a basic course in Maths knows what a function is, it is a good idea to review this basic concept and associated terminology, just in case you need a refresher.

A function is relationship that defines how one quantity depends on another. A function takes an input from a set and maps it to an output from another set. The input set is known as the domain and the output set is known as the codomain or target set of the function. There are many ways to denote functions in machine learning literature, or in Mathematics. Here's the more popular ones.

$$\begin{align} & f: \sX \rightarrow \sY \\ & y = f(x), x \in \sX, y \in \sY \\ & f: x \mapsto y, x \in \sX, y \in \sY \\ \end{align}$$

In these, \( \sX \) is the domain, \( \sY \) is the codomain, \( x \) is the input variable and \( y \) is the output variable.

Note that the codomain provides a set of possible values for the output of the function. The function may not actually realize all possible values in the codomain. For example, consider the square function \( y = x^2 \) defined over the set of natural numbers, \( x \in \natural \). In this case, the domain and the codomain are both \( \natural \). However, for the input sequence \( 1, 2, 3, 4 \ldots \), the output sequence is \( 1, 4, 9 , 16, \ldots \). The natural numbers \(2, 3, 5, 6, 7, \ldots \) are missing from the outputs of the function. In other words, all natural numbers are not realized by the square function. This restricted set of output values that are actually generated by the function is known as its range or the image of the function. Remember this subtle difference between range and codomain of a function. Succinctly, \( \text{image of } f \subseteq \text{codomain of } f \). Note that some texts use range interchangeably to mean codomain or image, rendering it ambiguous. To be precise, it is better to stick with codomain and image and avoid using range, which might be misinterpreted.

Example functions

Quadratic: \( x^2 \)
Hinge loss: \( \text{max}(0,-x) \)
Sine: \( \sin x \)
Hyperbolic tangent: \( \tanh x \)


The sigmoid function is a popular function in machine learning. It is defined as follows

$$ \begin{align} & \sigma(x) = \frac{1}{1 + \exp^{-x}} \\ \end{align}$$

Crazy people like us, the machine learners, often wonder: what is the maximum and minimum value of \( \sigma(x) \) for \( x \in \real \)?

We try various values of \( x \), starting with our favorite number, \( 0 \). We find that \( \sigma(0) = 0.5 \). For increasing the inputs \( x = 1, 2, \ldots, 7, 8, 9, \ldots \), we get the outputs \(0.73105857, 0.8807970, \ldots, 0.9990889, 0.999664649, 0.99987660, \ldots \).

On decreasing the values of \( x \) from \( 0 \), we see the opposite trend. So, for \( x = -1, -2, \ldots, -8, -9, \ldots \), we get the outputs, \( 0.2689414213, 0.11920292, \ldots, 0.0024726231, 0.0009110511, 0.000335350, \ldots \).

So, we observe a few things. For positive increasing values of \( x \), the outputs are increasing. Moreover, they are getting increasingly closer to 1.0, but not quite there. For negative decreasing values of \( x \), the outputs are decreasing and getting approaching 0.0, but not quite there.

How do we denote these concepts mathematically? With the limits notation.

$$ \lim_{x \to \infty} \sigma(x) = 1 $$ $$ \lim_{x \to -\infty} \sigma(x) = 0 $$

We read this to mean that as the input \( x \) moves closer to \( \infty \), the output of the function \( \sigma(x) \) approaches 1. Does it actually take on the value 1 will depend on what it means for the \( x \) to take on the value \( \infty \).

Limits are an important mathematical concept regarding the behavior of a function near a point. So, \( \lim_{x \to p} f(x) = L \), means that the output of the function \( f(x) \) gets closer to \( L \) as the input \( x \) gets closer to \( p \).

Function limits

Natural logarithm: \( \lim_{x \to 0} \log x = -\infty \)
Exponential: \( \lim_{x \to -\infty} \exp^x = 0 \)
Sigmoid:  \( \lim_{x \to \infty} \frac{1}{1 + \exp^{-x}}= 1 \)
Logistic loss: \( \lim_{x \to \infty} \frac{1}{\log 2} \log(1 + \exp^{-x}) = 0\)


Let's take another popular function in machine learning. The Heaviside step function or the unit step function is defined as follows

$$ H(x) = \begin{cases} 1, \text{ if } x \ge 0 \\ 0, \text{ otherwise } \\ \end{cases} $$

So, we know that \( H(0) = 1 \). We can also write this in the limits notation introduced earlier as

$$ \lim_{x \to 0} H(x) = 1 $$

Or, can we? Are we coming closer to zero from the positive side or the negative side? If we are approaching from the positive side, then sure, we can claim that to be 1. But, if we are visiting zero from the negative side then a point very very very very close to zero will have a \(H(x) = 0 \). Mathematically, if \( 0^+ \) and \( 0^{-} \) denote the positive and negative side of zero respectively, then we can denote this as follows

$$ \lim_{x \to 0^+} H(x) = 1 $$ $$ \lim_{x \to 0^-} H(x) = 0 $$

Such one-sided limits, that depend on the direction of approach, are known as, well, one-sided limits.

In the case of the Heaviside function, we note that the two one-sided limits do not match. This means cannot really write \( \lim_{x \to 0} H(x) \), without specifying the direction of approach. That means the limit, \( \lim_{x \to 0 } H(x) \) does not exist!

It is then said that the function \( H(x) \) is discontinuous at the point 0. For any other input in the domain of \( x \), the limit exists. Thus, one says that the function \( H(x) \) is continuous on the negative interval and on the positive interval, but not continous at zero.

In general, for any function \( f(x) \), if \( \lim_{x \to p^+} f(x) \neq \lim_{x \to p^-} f(x) \), then it is said that the limit does not exist at \( p \).

A function \( f(x) \) is said to be continuous at a point \( p \), if the limit \( \lim_{x \to p} f(x) \) exists, and that \( \lim_{x \to p} f(x) = f(p) \).

A function \( f(x) \) is said to be continuous within an interval \( [a,b] \), if it is continuous at every point within the interval \( [a,b] \).

So, is sigmoid a continuous function on \( \real \)? Figure that one out.

Discontinuous functions

Points of discontinuity highlighted in red

Reciprocal \( 1/x \)
Heaviside step function
Tan: \(\tan x \)
\( f(x) = \frac{x^2}{x - 1} \)

Rate of change

Zoom in close enough and most continuous functions will look piecewise linear; made by juxtaposing line segments together. Zoom into a discontinuity, and it remains a discontinuity; a gap that just keeps getting wider.

So, for a continuous function, we should be able to predict the upcoming value, if we knew the orientation of that line segment. Suppose our line segment extends from \( a \) to \( b \) and we have two points \( x \in [a,b] \) and \( (x+h) \in [a,b] \). If \( m \) is the slope of the line segment, basic line geometry tells us that

$$\begin{align} & f(x + h) = f(x) + m(x+h - x) \\ & \implies m = \frac{f(x+h) - f(x)}{h} \\ \end{align}$$

Note that the key is zooming-in enough that the function behaves like a line-segment in the magnified viewport. In other words, \( h \) should be miniscule. According to the limits concept introduced earlier, \( h \to 0 \). So, we can compute the slope of a function, denoted as \( f'(x) \) or \( \frac{df(x)}{dx} \), or succinctly as \( \frac{df}{dx} \), at a point as

$$ f'(x) = \frac{df(x)}{dx} = \frac{df}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} $$

This slope of the function is known as a derivative. It is by far one of the most important mathematical concept required to optimize machine learning models on data.

We have shown two of the popular ways to denote derivatives. Here's a list of symbols that you might find in machine learning literature, named after their founders. They have been ordered according to their popularity in machine learning literature.

  • Leibniz's notation : \( \frac{df}{dx} \)
  • Lagrange's notation: \( f' \)
  • Newton's notation : \( \dot{f} \)
  • Euler's notation : \( D_x f(x) \)

We will restrict usage to the Leibniz and Lagrange notation. We will use the Lagrange as a succinct representation when \( x \) is clear from the context and using Leibniz notation otherwise.

The magnitude of \( f'(x) \) tells us how fast the function is changing. The sign of \( f'(x) \) tells us the direction of change. Positive \( f'(x) \) means the function \( f(x) \) is increasing at \( x \) if we slightly increase the value of \( x \). Negative \( f'(x) \) implies inverse relationship; the function \( f(x) \) is decreasing at \( x \).

If \( f'(x) = 0 \), then there is no change in \( f(x) \) at \( x \). The function is effectively constant at \( x \). Such points of zero-derivatives are known as critical points, stable points, or stationary points of the function.

We noted earlier that a limit does not always exist. So, if the limit in the above equation exists, then we can compute the derivative. The functions for which that limit exists are known as differentiable functions. Conversely, the remaining functions are known as not-differentiable.

We already made the case that discontinuous functions are not differentiable at their point(s) of discontinuity. But are all continuous functions differentiable? Find out next.

The geometry of derivatives

Derivatives provide us an easy way to intuitively understand function behavior, even without plotting them.

The upcoming sections will demonstrate the relationship of functions and their derivatives. Here's how to understand them.

  • Each demo has two plots, the function \( f(x) \) and its derivative \( f'(x) \)
  • The same point is highlighted on both the plots. You can move this point by dragging the orange circle on the \(X\)-axis. This will help you better inspect the derivative.
  • At the point of interest \( x \), we have drawn a line segment that passes through \( (x, f(x)) \), and has slope equal to \( f'(x) \). Notice that this line is always tangent to the function, if \( f'(x) \) exists.
  • The tangent line has a arrow on its head, attached in the direction of increasing \( x \). Thus, the arrow denotes the direction of change in function \( f(x) \) as \( x \) increases. An arrow pointing downward means the function decreases as \( x \) increases. An upward arrow means the function increases.

Derivative example: \( x^2 \)

\( f(x) = x^2 \)
\( f'(x) = \frac{d}{dx} x^2 = 2x\)

Understanding derivative of \( x^2 \)

Let's start with the example of the simple quadratic function \( f(x) = x^2 \). The derivative is \( f'(x) = 2x \). Look at the accompanying charts to understand this relationship.

By inspection, we already knew that \( f(x) \) is always positive. But now we know one more thing. The slope of the function \( f(x) \) doubles with \( x \). This means, the function keeps becoming steeper as we move away from \( x = 0 \).

Moreover, it doesn't matter if we are moving towards the positive or negative side from \( x = 0 \), the rapid increase in the function is the same in both directions. This means, the function must be a symmetric function around \( x = 0 \).

One other thing to notice is that the minimum of the function is achieved when the derivative is \( 0 \). Well, the function flattens out near its minimum, then it's derivative must be zero at that point. But, the function also flattens near a maximum. We will see in one of the later examples how we deal with that.

In machine learning, derivatives are mostly used in fitting models by optimizing a loss function. We will focus on this aspect of derivatives in the rest of the discussion.

Here's a thought exercise: Can you identify whether \( x^2 - 1 \) and \( (x-1)^2 \) are symmetric functions? Are they both symmetric around \( x = 0 \) or around \( x = 1 \)? Can you visualize their shapes without actually plotting them? Can you identify their minimum value and when they attain them?

Derivative example: Sigmoid

Sigmoid \( \sigma(x) \)
\( \sigma'(x) = \frac{d}{dx} \sigma(x) = \sigma(x) (1 - \sigma(x)) \)

Understanding derivative of the Sigmoid

The derivative of Sigmoid is quite interesting and often computed in optimization routines.

$$ \sigma'(x) = \sigma(x) (1 - \sigma(x)) $$

We already knew that \( \sigma(x) \ge 0, \forall x \in \real \). From limits, we also knew that \( \lim_{x \to \infty} \sigma(x) = 1 \) and \( \lim_{x \to -\infty} \sigma(x) = 0 \).

From the derivative, now we know that the function has the steepest growth around \(\sigma(0) = 0.5, x= 0\). It grows or drops very very slowly as we move away from \( x = 0 \), on either sides, because if \( \sigma(x) \) is high, then \( 1 - \sigma(x) \) will be low, reducing the overall slope, leading to slow growth. Thus, the function flattens out as \( x \to \infty \) and \( x \to -\infty \).

This is important to understand. The magnitude of the derivative tells how fast or slow a function is growing or decreasing. And as we saw in the previous example, the sign of the derivative indicates the direction of change of the function with respect to its input. Positive derivative implies function increasing with increasing \( x \) and negative derivative indicates a decreasing function.

We can also identify that the function is symmetric about \( x = 0 \), but the symmetric portion is flipped. (Think!)

In the case of the quadratic \( f(x) = x^2 \) function, we observed that the zero derivative indicated minimum of the function. In the case of Sigmoid, the zero is never really attained, but you will note that \( \sigma'(x) \to 0\) as \( x \to \infty \) or \( x \to 0 \). So, again we note that the derivative tends to be zero near flat regions.

But just by looking at the derivative, we cannot tell whether the extremum reached is a minimum or a maximum In both cases the derivative is zero or close to zero. We will address this challenge soon.

Thought experiment: The hyperbolic tangent \( \tanh(x) = \frac{\exp^{x} - \exp^{-x}}{ \exp^{x} + \exp^{x}} \), is a related function, sometimes used as an alternative activation function to the Sigmoid function. What can you say about its nature, without even plotting the function?

Derivative example: \( \log x \)

\( f(x) = \log x\)
\( f'(x) = \frac{1}{x} \)

Understanding derivative of \( \log x \)

The logarithm function \( f(x) = \log x \) appears very commonly in machine learning. It is a continuous function, only defined for \( x > 0 \). The derivative is also defined for \( x > 0 \). It is given by

$$ \frac{d}{dx} \log x = \frac{1}{x} $$

We have encountered the reciprocal function earlier in our discussion on limits. Since we are only dealing with the positive values of \( x \), we know that

$$ \lim_{x \to 0^+} \frac{1}{x} = \infty $$

We also know that

$$ \lim_{x \to \infty} \frac{1}{x} = 0 $$

This means, the slope of the function rises rapidly as we move closer to zero. It also means that the function flattens out as we move closer to \( \infty \).

Derivative example: \( x^3 \)

\( f(x) = x^3 \)
\( f'(x) = 3x^2 \)

Derivative of \( x^3 \)

For \( f(x) = x^3 \), the derivative is \( f'(x) = 3x^2 \). Note that at \( x = 0 \), \(f'(x) = 0 \). But wait!

As you can see from the chart, at \( x = 0 \), the function does not have a minimum or a maximum, local or global. Such points, where \( f'(x) = 0 \), which are neither minima or maxima, are knowns as inflection points.

Note that we cannot identify whether a critical point is a minimum, maximum, or an inflection just from \( f'(x) \).

Maxima, minima, and inflection points

We know now that \( f'(x) = 0 \) implies that the function has either a minimum, maximum, or an inflection point at \( x \).

If you have played with the interactive demos enough, you will notice the behavior of the derivative in the vicinity of such critical points.

Imagine three points, \( p, q, \text{ and } r\) in the domain of \( f(x) \). These are arranged such that \( f'(q) = 0 \) and \( p < q \), and \( r > q \).

For cases are possible on the derivatives at \( p \) and \( r \).

  • \( f'(p) < 0, f'(r) > 0 \)
  • \( f'(p) > 0, f'(r) < 0 \)
  • \( f'(p) < 0, f'(r) < 0 \)
  • \( f'(p) > 0, f'(r) > 0 \)

We know that a negative derivative implies a decreasing function and a positive derivative implies an increasing function on \( x \).

If \( f'(p) < 0 \) and \( f'(r) > 0 \), then it means that the function was decreasing before it reached \( q \) and increased past \( q \). Thus, \( q \) must be a local minimum.

If \( f'(p) > 0 \) and \( f'(r) < 0 \), then it means that the function was increasing before it reached \( q \) and then started decreasing after \( q \). Thus, \( q \) must be a local maximum.

If \( f'(p) > 0 \) and \( f'(r) > 0 \), then the function was always increasing. The rate of change slowed down at \( q \) and then continued rising again. It just flattened momentarily at \( q \). So there cannot be a minimum or maximum at \( q \). Hence \( q \) must be an inflection point.

Similar reasoning can be applied for the case where \( f'(p) < 0 \) and \( f'(r) < 0 \).

Derivative example: \( |x| \)

Points of non-differentiability highlighted in red

\( f(x) = |x|\)
\( f'(|x|) = \text{sgn}(x) \)

Understanding derivative of \( |x| \)

The absolute value function \( f(x) = |x| \) appears very commonly in machine learning. It is defined as follows

$$ f(x) = |x| = \begin{cases} x, \text{ if } x \ge 0 \\ -x, \text{ otherwise } \end{cases} $$

It is a continuous function. So, \( \lim_{x \to 0} f(x) \) exists. (Verify this. Do it. You will be happier.)

But for the derivative, the mere existence of \( \lim_{x \to 0} f(x) \) is not sufficient.

We want the limit \( \lim_{h \to 0} \frac{|x+h| - |x|}{h} \) to exist. Does that exist?

It exists everywhere, except at \( x = 0 \). (Verify this. And give yourself a pat on the back when you do.)

However, in such cases, we can still differentiate on the portions that differentiable. These derivatives are

$$ f'(x) = \begin{cases} 1, \text{ if } x > 0 \\ -1, \text{ if } x < 0 \\ \text{Not differentiable if } x = 0 \end{cases}$$

Thus, the slope of the function stays constant on either side of \( x = 0 \). At \( x = 0 \), the function is not differentiable.

In optimization, such situations are dealt with the introduction of the concept of subgradient, also known as subdifferential, or subderivative.

A full discussion of subgradients is not in scope here, so we will make a brief comment. Intuitively, if the function \( f(x) \) is not differentiable at the point \( x_0 \), a subderivative is any line that passes through the point \( (x_0, f(x_0)) \) which either touches or stays below/above the graph of the function \( f(x) \).


So, there are continuous functions, that are not differentiable everywhere. What should we call them? Let's try to understand these functions geometrically first.

The entire idea of the derivative relies on being able to zoom into a function to arrive at a line segment. What if no matter how much you magnify, you cannot arrive at a line segment?

We saw that this happens for a gap; a discontinuity. It will also happen when no matter how much you zoom, the function looks as if two straight line segments, with different slopes, are connecting at a point. In such as a scenario, the equation of the slope, presented earlier, does not hold anymore. If slope cannot be calculated, then the derivative cannot be calculated either.

This is possible for a continuous function, as can be seen for our example of \( f(x) = |x| \). In that sense, the function \( f(x) = |x| \), is not smooth at \( x = 0 \).

Thus, a function is smooth at a point if it is continuous and differentiable at that point. A function is smooth within an interval if it is smooth at every point in that interval.

So, an easy way to remember these ideas is this.

  • if limit of a function exists at a point, then it is continuous at that point
  • if derivative of a function exists at a point, then it is smooth at that point

Thus, smoothness has a stronger requirement than continuity. If a function is smooth at a point, then it is definitely also continuous at that point. Continuity does not imply smoothness.

Continuous but non-smooth functions

Points of non-differentiability highlighted in red on the X-axis

ReLU: \( \text{max}(0,x) \)
Absolute: \( |x| \)
Cube root: \( f(x) = x^{1/3} \)
Square reciprocal: \(1/x^2\)

Rules for computing derivatives

Some easy rules apply when dealing with functions of functions. A brief discussion follows here.

  • The Sum rule helps in calculating the derivative of the sum of two functions, if they are differentiable.

$$ \frac{d (f(x) + g(x))}{dx} = \frac{df(x)}{dx} + \frac{dg(x)}{dx} $$

  • The Product rule helps in calculating the derivative of the product of two functions, if they are differentiable.

$$ \frac{d (f(x) g(x))}{dx} = g(x) \frac{df(x)}{dx} + f(x) \frac{dg(x)}{dx} $$

  • The Quotient rule is used to calculate the derivative of the ratio of two differentiable functions.

$$ \frac{d \left(\frac{f(x)}{g(x)} \right)}{dx} = \frac{\frac{df(x)}{dx} g(x) - f(x) \frac{dg(x)}{dx}}{g(x)^2} $$

  • The Chain rule is used to compute the derivative of a function of a function. If we use the symbol \( y = g(x) \), and if \( f'(y) \) and \( g'(x) \) exists, then we can write

$$ \frac{d (f(g(x)) }{dx} = \frac{df(y)}{dy} \frac{dy}{dx}$$

More generally, one can write

These rules are the building blocks for computing derivatives of complicated functions from the derivatives of well-known simpler functions.

Derivative of a derivative

The derivative \( f'(x) \) is itself a function. So it we might be able to further differentiate it. A derivative of a derivative is called a second-order derivative. By that logic the first derivative of the function is called the first-order derivative of the function. The second-order derivative is denoted as

$$ f''(x) = \frac{d^2 f(x)}{dx^2} = \frac{d^2 f}{dx^2} = \frac{d}{dx} \left(\frac{df}{dx}\right) $$

If the derivative of a function does not exist, then obviously, its second-order derivative does not make any sense. If the derivative of a function exists, then for the second-order derivative, we have the same constraints. The second-order derivative exists at a point, if and only if, the first-order derivative is smooth at that point.

But, what does it mean, the second-order derivative?

We defined earlier that the derivative of a function at a point provides us the rate of change of that function at that point. So a second-order derivative at a point tells us the rate of change of the first-order derivative at that point.

  • Positive second order derivative \( \implies \) slope of function is increasing.
  • Negative second order derivative \( \implies \) slope of function is decreasing.
  • Zero second order derivative \( \implies \) slope of function is constant.

Note, that the above statements are about the slope of the function, not the function itself!

And, we do not have to stop at second-order derivative. We can continue taking derivatives of derivatives as long as they exist. Such higher order derivatives are denoted as \( \frac{d^nf}{dx^n} \).

The existence of an \( n \)-th order derivative implies that the \((n-1)\)-th order derivative is continuous. So, the smoothness of a function is measured in terms of number of derivatives it has which are continuous.

  • The class \( C^0 \) includes all continuous functions
  • The class \( C^1 \) includes all continuously differentiable (differentiable and the derivative is also continuous) functions

In general, the class \( C^n \) includes all functions whose derivative is in the class \( C^{n-1} \).

And a function, which has derivatives of all orders, everywhere in its domain, belongs to the class \( C^\infty \). Such a function is known as the smooth function.

Second-order derivative of \( x^2 \)

\( f(x) = x^2 \)
\( f'(x) = 2x \)
\( f''(x) = 2 \)

Second-order derivative of \( x^2 \)

For the quadratic function \( f(x) = x^2 \), the derivatives are \( f'(x) = 2x \) and \( f''(x) = 2 \). The second order derivative is a constant. It means the slope of the slope of \( x^2 \) does not vary as a function of \( x \).

But most importantly, it is positive. And that is the most important information for us from an optimization perspective.

But first, let's imagine our function was \( g(x) = -x^2 \). Since \( g(x) \) is the negative of \( f(x) \) it must be negative for all input and achieve a maximum of zero at \( x = 0 \).

The derivatives would be \( g'(x) = -2x \) and \( g''(x) = -2 \).

In both cases, we know that the functions \( f(x) \) and \( g(x) \) would attain their minimum and maximum, respectively, at \( x = 0 \). But it is their second order derivative that tell us which one it is.

If \( f'(a) = 0 \) and \( f''(a) > 0 \), then \( f(x) \) attains a local minimum at \( x = a \). This is easy to understand. \( f'(a) = 0 \) means that the function has flattened out there. \( f''(a) > 0 \) means that the function starts increasing all sides of \( x = a \). Hence minimum.

If \( g'(a) = 0 \) and \( g''(a) < 0 \), then \( g(x) \) attains a local maximum at \( x = a \). Analogous to before \(g'(a) = 0 \) means that the function has flattened out there. \( g''(a) < 0 \) means that the function falls off on all sides of \( x = a\). Hence maximum.

We will elucidate these concepts on the next set of charts.

Second-order derivative of \( x^4 - 8x^2 \)

\( g(x) = x^4 - 8x^2 \)
\( g'(x) = 4x^3 - 16x \)
\( g''(x) = 12x^2 - 16 \)

Second-order derivative of \( x^4 - 8x^2 \)

We have concocted this function to highlight our observation about the nature of the second order derivative.

For the function \( g(x) = x^2 - 8x^2 \), the derivatives are \( g'(x) = 4x^3 - 16x \) and \( g''(x) = 12x^2 - 16 \).

Notice that the derivative \( g'(x) \) is zero at 3 locations in the chart, namely \( x \in \{-2, 0, 2 \} \). Note that the function is either a minimum or a maximum at these points. But note that the function is not a global maximum, only a local. So, remember that the derivative being zero does not imply global maximum or minimum. Only local.

Now note the second-order derivative. At the locations \( x \in \{-2, 2\} \), the second order derivative is positive. At the point \( x = 0 \), the \( g''(0) < 0 \).

This means that the local extremum achieved by the function at \( x \in \{-2,2\} \) is a local minimum. The critical point at \( x = 0 \) is a maximum.

First-order derivatives help you identify local extrema. Second-order derivatives help you in distinguishing between maxima and minima from among those extrema.

One more interesting thing to note here. Try dragging the orange circle between the local minimum at \( x = -2 \) to the local maximum at \( x = 0 \). Notice that the tangent to the function goes from being under the function to over the function. Such a point of change can be detected by change in the sign of the second derivative.

Such points, when the second-order derivative changes sign from positive to negative, or vice-versa, are known as inflection points. So

  • \( f'(x) = 0 \implies \) critical point
  • \( f''(x) = 0 \implies \) inflection point

Second-order derivative example: \( \sin x \)

\( h(x) = \sin x \)
\( h'(x) = \cos x \)
\( h''(x) = -\sin x \)

Second-order derivative of \( \sin x \)

Many machine learners naively assume that if they have a function to optimize, then it must have a maximum or a minimum. That may not be the case as this next example shows.

\( h(x) = \sin x \) is a cyclic function. It's derivatives are \( h'(x) = \cos x \) and \( h''(x) = -\sin x \), both cyclic too. They both cyclically vary in the range \( [-1,1] \) implying many local minima and many local maxima.

So, even a second order derivative does not tell you whether you have reached a global extremum. Only local minima or maxima can be detected and should be treated as such.

Second-order derivative example: \( x \sin x \)

\( k(x) = x \sin x \)
\( k'(x) = \sin x + x\cos x \)
\( k''(x) = 2\cos x - x\sin x\)

Second-order derivative of \( x \sin x \)

The function \( k(x) = x \sin x \) is cyclic, just like sine function we saw before. But note that the cycles keep getting amplified, as we move farther from \( x = 0 \).

The derivatives \( k'(x) = \sin x + x \cos x \) and \( k''(x) = 2 \cos x - x \sin x \) are also cyclic like \( k(x) \).

There are plenty of local minima and maxima when \( k'(x) = 0 \). You can also distinguish between them by checking if \( k''(x) > 0 \) or \( k''(x) < 0 \).

But it all does not mean anything from an optimization perspective.

Just because you arrived at a zero first-order derivative and just because you have an appropriate sign on the second-order derivative does not mean much.

You may be doing great in identifying local extrema, but that does not imply anything about the global stage. Be humble.

Area under a curve

Switching tracks a bit here. Instead of breaking functions, as we did with derivatives, let's aggregate over them.

The area of a rectangle is simple. If \( w \) is the width and \( h \) is the height, then the area of a rectangle is \( wh \). How do we compute the area of complicated shapes?

Again, using the area of a rectangle as our building block! Here's how. Suppose you wish to find the area between the x-axis and the function \( f(x) \) within the boundary \(x \in [a,b] \).

  • Starting at \( x_0 = a \), you compute the function \( f(x_0) \).
  • Assume that the function is constant within \( w \) distance from \( x = a\).
  • This way, you have created a rectangle of width \( w \) and height \( f(x_0) \), with area \( f(x_0)w \).
  • Take a tiny step from \( x=a \) towards \( x=b \), with \( x_1 = x_0 + w \) and create a new rectangle of area \( f(x_1)w \).
  • Create all such rectangles that can be fit within the function between \( x = a \) to \( x = b \).
  • Sum up their area and you get your crude estimation of the area under the curve.

Try out the next demo to understand what we mean. You can vary the size of \( w \) using the slider. You can also vary the bounds \( a \) and \( b \) by dragging the circles. Check out how the rectangles are constructed. Also, understand the approximation error as you change the w.

Area approximation demo

Approximate area under curve for function \( f(x) = \exp^x \). Drag the slider to change width of \( w \). Drag the green and orange circles to modify the lower and upper limits of the region of interest.

True area: \( A \)


Estimated area: \( \hat{A} \)


Approximation error: \( A - \hat{A} \)



In the previous demo, you must have noticed that the area approximation improves as you lower the value of \( w \), the width of the rectangle. Extrapolating, you can imagine that if \( w \) was miniscule (not possible to display in an interactive demo), then, our area calculation will be even better. Basically, we want \( w \to 0 \).

This strategy of estimating complex quantities as a sum over simpler elements is one way of looking at the mathematical concept of integration. The area under the curve for the function \( f(x) \) between the bounds \( [a,b] \) is represented as a sum over products \( f(x)dx \) as

$$ \int_a^b f(x)dx $$

This quantity is known as a definite integral, because of the known bounds \([a,b] \). Without those, \( \int f(x)dx \) is known as an indefinite integral.

In this next demo, try to understand the area under the curve is really the space between the function \(y = f(x) \) and \( y = 0 \) on the y-axis and between the bounds \( x = a \) and \( x = b \) on the x-axis. So, if the function is below the x-axis, then it is actually area over the curve.

Integrals as area approximation: Example 2

Area under and over the curve for function \( f(x) = \sin x \). Drag the slider to change the width of the approximating rectangles. Drag the orange and green circles to change the bounds on the x-axis.

True Area: \(A\)


Estimated Area: \( \hat{A} \)


Approximation error: \( A - \hat{A} \)


Relationship between integrals and derivatives

Suppose, \( A(x) \) denotes the area under the function \( f(x) \), between 0 and \( x \). Then, based on the approximation strategy we discussed earlier, it should be the case that

$$ A(x+w) - A(x) = f(x)w + \epsilon $$

This means, if we extend \( w \) beyond \( x \), then we are essentially adding one more rectangle of area \( f(x)w \) and there is some approximation error \( \epsilon \).

Re-arranging the terms, we get

$$ f(x) = \frac{A(x+w) - A(x)}{w} - \frac{\epsilon}{w} $$

As we saw in the previous demo, this approximation error \( \epsilon \to 0 \) as \( w \to 0 \).

$$ f(x) = \lim_{w \to 0} \frac{A(x+w) - A(x)}{w} $$

Now you will observe that this is exactly the definition of the derivative of a differentiable function, \( A(x) \). Thus, what we have shown is that

$$ f(x) = \frac{d A(x)}{dx} = A'(x) $$

And, if that is the case, then this must also be true

$$ A(x) = \int A'(x)dx $$

These statements, relating the integral and the derivative are a very important result in mathematics, known as the Fundamental Theorem of Calculus.Thus, to compute integrals, you just need to know the formulae for derivatives.

And it is for this reason that the indefinite integral is also known as the anti-derivative.

Function approximation

Speaking of approximations, here's what we already know, written in a slightly different way.

$$ f(x) = f(a) + \int_a^x f'(t)dt $$

This can be refined by further decomposing the integral.

$$ f(x) = f(a) + \int_a^x \left(f'(a) + \int_a^t f''(p)dp \right) dt $$

This can be re-written as

$$ f(x) = f(a) + f'(a) \int_a^x dt + \int_a^x \int_a^t f''(p)dp dt $$

And \( \int_a^x dt = (x - a) \). So, we have

$$ f(x) = f(a) + (x-a) f'(a) + \int_a^x \int_a^t f''(p)dp dt $$

Continue substituting \( f^n(p) \) with elements with higher order derivatives, and you end up with the the following relationship for a \(k\)-times differentiable function.

$$ f(x) = f(a) + (x-a) f'(a) + (x-a)^2 \frac{f''(a)}{2!} + (x-a)^3 \frac{f'''(a)}{3!} + \ldots + (x-a)^k\frac{f^k(a)}{k!} $$

This relationship is a famous result in calculus known as Taylor's Theorem.

Taylor's theorem is a handy way to approximate a function at a point \( x \), if we can readily estimate its value and those of its derivatives at some other point \( a \) in its domain. It appears in quite a few derivations in optimization and machine learning.

Multivariate functions

All the functions and examples we worked with so far took a single value as its input. Such functions that take a scalar, a single value, as an input are called univariate functions.
For example, a function mapping a real number to another real number, \( f: \real \rightarrow \real \) is a univariate function.

A function that generates an output based on multiple input values is known as a multivariate function. For example, \( f: \real^n \rightarrow \real \) takes \( n \) real inputs and generates a real output.

The multivariate bowl

We already saw the univariate bowl \( f(x) = x^2 \). Time to understand its multivariate analogue in two dimensions, the multivariate bowl \( f(x,y) = x^2 + y^2 \).

The multivariate bowl function is similar to many loss-functions in machine learning that involve quadratic terms. It is crucial to understand its topology to build intuition about any optimization that might be performed on such functions.

In the next interactive demo, we have plotted the multivariate bowl in three panels.

The larger panel shows the contour plot of the function. This means, the color band is indicative of the value of \( f(x,y) \). The variable \( x \) is plotted on the horizontal X-axis. The variable \( y \) is plotted on the vertical Y-axis.

The remaining two panels show a slice of the function if we fix the value along either axis. This provides looking into the function from the side. For example, \( f(x,y) \) with \( y \) held constant means the spectator is analyzing the function standing below the chart looking at a particular slice along the Y-axis.

The slice locations along both axes can be changed by moving the interactive orange circle on the contour plot. The slices can be imagined as vertical cuts made into the function along the two dotted lines; one along each axis. The location of the orange circle is also highlighted on the slice plots as a red dot.

For this particular chart, note that the function achieves a minimum at \( x = 0, y = 0 \). The function increases in value along all other directions. By looking at the bands on the contour plot, note that the increase in the function is slower near the center, and gets quite steep as you move away from \( x = 0, y = 0 \). This can also be verified from the slice plots.

Finally, note that no matter where you move slices, the nature of the function remains the same, a univariate cup.

Multivariate bowl demo

A multivariate cup function \( f(x,y) = x^2 + y^2 \)
\( f(x,y), y = \) Text
\( f(x,y),  x = \) Text

Rosenbrock's function

Of course, multivariate functions are seldom symmetric. Let's look at a popular asymmetric function, the Rosenbrock function \( f(x,y) = (a-x)^2 + b(y - x^2)^2 \). In our implementation, we have assigned \( a = 1 \) and \( b = 100 \), a very common setting for this function. The Rosenbrock function is a very good example when trying to understand the performance of optimization methods.

Check out the next interactive demo to build intuition about the Rosenbrock function.

Note the parabolic valley, the white band of low values in the center. The global minimum, which happens at \( x = a, y = a^2 \), lies inside this valley.

It can be seen from the contour plot that the rate of change of the function is very different along the two dimensions. Moving towards a lower area in one dimension does not imply you are actually making progress towards the global minimum since the other dimension is so different.

Rosenbrock's function demo

Rosenbrock function \( f(x,y) = (1-x)^2 + 100(y - x^2)^2 \)
\( f(x,y), y = \) Text
\( f(x,y), x = \) Text

Himmelblau's function

So far, our function had only a single minimum. such functions, albeit easy to understand, are rare in machine learning. Let's understand the Himmelblau's function \( f(x,y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2 \).

From the contour plot, note that the function has four local minimas and one local maximum. The local minima, \( f(x,y) = 0 \) happen at the following points (truncated in precision)

  • \( x = 3.0, y = 2.0 \)
  • \( x = -2.80, y = 3.13 \)
  • \( x = -3.78, y = -3.28 \)
  • \( x = 3.58, y = -1.85 \)

The local maximum, \( f(x,y) = 181.617 \), occurs at \( x = -0.271 \) and \( y = -0.923 \). This is somehwere in the center of the 4 local minima. Note that the function keep increasing along the edges of the graph, so this is only a local maximum.

Try out the next interactive demo to understand the topology of Himmelblau's function. Notice the small bump, the local maximum, on both the slice plots. Also notice the small puddles on both slice plots. Notice how the two puddles flatten out for \( x = 3 \) on the slice \( f(3,y) \), as that slice ignores the local maximum.

Himmelblau's function demo

Himmelblau's function: \( f(x,y) = (x^2 + y - 11)^2  + (x + y^2 - 7)^2 \)
\( f(x,y), y = \) Text
\( f(x,y), x = \) Text

A twisted function

In univariate functions, there are only three kinds of critical points: minima, maxima, and inflection points. In multivariate functions, there is an interplay between different dimensions, resulting in another kind: a saddle point.

In the next function \( f(x,y) = 3x^2y + y^3 - 3x^2 - 3y^2 + 2 \), check out what happens at the following points.

  • \( x = 0, y = 0 \)
  • \( x = 0, y = 2 \)
  • \( x = -1, y = 1 \)
  • \( x = 1, y = 1 \)

The first two are local maxima and minima respectively. The last two are saddle points.

Move the orange circle to the saddle points to observe the function behavior around those points.

The slice \( f(x,1) \) is flat; a constant. In fact, \(f(x,1) = 0, \forall x \in \real \).

The slice \( f(-1,y) \) is interesting however. It is similar to the cube function \( f(x) = x^3 \) that we introduced earlier for inflection points. Indeed, there is an inflection point on the slice \(f(-1,y) \) at \( y = 1 \).

Similarly, there is another inflection point on the slice \( f(1,y) \), at \( y = 1 \).

Just like in the case of univariate functions, distinguishing between the various critical points (maxima, minima, inflection, and saddle points) relies on the use of second-order derivatives.

We will present this shortly, after an introduction to multivariate calculus.

Saddle demo

A function with a saddle \( f(x,y) = 3x^2 + y^3 - 3x^2 - 3y^2 + 2 \)
\( f(x,y), y = \) Text
\( f(x,y), x = \) Text

Multivariate calculus

The concepts of limits, derivatives, continuity, and smoothness are all applicable to multivariate functions. And they mean exactly the same. Just the definitions have to be adapted.

So, the limit of a multivariate function \( f(x_1,\ldots,x_n) \) at a multivariate point \((a_1,\ldots,a_n)\) is defined as

$$ \lim_{\substack{x_1 \to a_1 \\ \ldots \\ \\ x_n \to a_n}} f(x_1,\ldots,x_n) $$

Limit of a multivariate function may not exist, or the multivariate function may be discontinuous, at a point \( \va = [a_1, \ldots, a_n] \), for the same reasons that we discussed in the univariate case.

If a function is continuous at a point, then the derivative of the multivariate function \(f'(\vx) \) means the same thing, the slope of the function. But there is an interesting twist to the story here.

Being a multivariate function, the rate of change needs to be qualified with the dimension along which we wish to measure the rate of change. For a function \( f(x_1,\ldots,x_n) \), are we measuring the change along \( x_1 \) or \( x_2 \), or \( x_n \)? This is crucial because changing dimension may have different effect on the function.

To measure the rate of change for the \( k \)-th variable \( x_k \), here's what we should do

  • hold all the remaining variables constant
  • compute the derivative of the resulting univariate function in \( x_k \) only with respect to \( x_k \)

So, in effect, we are only computing a part of the derivative of that function. The part that is only affected by \( x_k \). Naturally, such a derivative is known as partial derivative of \( f(x_1,\ldots,x_n) \) with respect to \( x_k \). It is denoted as \( \frac{\partial f}{\partial x_k} \). It is computed exactly the same way we would compute a derivative of a univariate function.

$$ \frac{\partial f(\vx)}{\partial x_k} = \lim_{h \to 0} \frac{f(x_1,\ldots,x_k + h,\ldots, x_n) - f(x_1,\ldots,x_k, \ldots, x_n)}{h} $$

If there are \( n \) input variables to a function \( f(x_1,\ldots,x_n) \), then one can try to compute \( n \) partial derivatives of the function \(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n} \); one with respect to each input variable. This collection of partial derivatives, a generalization of the derivative to the multivariate case, is known as the gradient of the function.

And if one can calculate \( n \) partial derivatives of a multivariate function, then we could also calculate partial derivatives of these partial derivatives. So, the first-order-partial derivative \( \frac{\partial f}{\partial x_k} \) will lead to \( n \) second-order partial derivatives

$$ \frac{\partial^2 f}{\partial x_j x_k} = \frac{\partial}{\partial x_j} \left( \frac{\partial f}{\partial x_k} \right), \forall j = \{1,\ldots,n\} $$

Note that there are \( n \) first-order partial derivatives. Thus, there will be \( n^2 \) second-order partial derivatives.

Are second order partial derivatives interchangeable in order? Can we take the derivative first with \( x_j \) and then with \( x_k \)? The Schwartz Theorem (also known as Clairaut's theorem or Young's theorem) states that if a function has continuous second-order partial derivatives at a point \( (a_1,\ldots,a_n) \), then the partial derivatives are interchangeable. So, only under the conditions of the Schwartz theorem

$$ \frac{\partial^2 f}{\partial x_k x_j} = \frac{\partial^2 f}{\partial x_j x_k} $$

A concise notation for multivariate calculus

Explicitly writing each variable can become quite cumbersome.

An astute reader will notice that we can use the vector notation introduced earlier to arrive at a concise notation for multivariate functions, limits, and partial derivatives.

A multivariate function on \( n \) input variables \( f(x_1,\ldots,x_n) \) can be written in a vector form by encapsulating the input variables into a vector \( \vx = [x_1,\ldots,x_n] \). So, \( f(x_1,\ldots,x_n) = f(\vx) \)

The limit of a multivariate function \( f(\vx) \) can then be defined as

$$ \lim_{\vx \to \va} f(\vx) $$

Note that \( \vx \in \real^n \) is a vector, so \( \va \in \real^n \) is also a vector of the same dimensionality, \( \va = [a_1,\ldots,a_n] \).

The set of partial derivatives, the gradient can also be denoted as vector of partial derivative with respect to each input variable.

$$ \nabla = \begin{bmatrix} \frac{\partial f(\vx)}{\partial x_1}, \ldots, \frac{\partial f(\vx)}{\partial x_n} \end{bmatrix} $$

The second order partial derivatives, all the \( n^2 \) of them, can be denoted by a matrix \( \mathbf{H} \)

$$ \mathbf{H} = \begin{bmatrix} \frac{\partial f}{\partial x_1^2} & \ldots & \frac{\partial f}{\partial x_1 x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial^2 x_n x_1} & \ldots & \frac{\partial f}{\partial x_n^2} \\ \end{bmatrix} $$

This matrix of second-order partial derivatives of a function is called the Hessian of the function. The \(i,j\)-th element of the Hessian is defined as

$$ \mathbf{H}_{i,j} = \frac{\partial^2 f}{\partial x_j x_i} $$

Under the conditions of the Schwartz Theorem described earlier, \( \mathbf{H}_{i,j} = \mathbf{H}_{j,i} \). This would mean, that if a function has continuous second-order derivatives, then the Hessian of the function is symmetric. But there are many functions for which the Hessian is not symmetric. So beware!

Detecting minima, maxima, and saddle points

The Taylor's theorem introduced earlier is also applicable to multivariate functions. For a bivariate function, for a second-order approximation

$$ f(x+a, y+b) = f(a,b) + x\dox{f}\bigg\rvert_{a,b} + y\doy{f}\bigg\rvert_{a,b} + \frac{1}{2}\left( x^2 \doxx{f}\bigg\rvert_{a,b} + xy\doxy{f}\bigg\rvert_{a,b} + xy \doyx{f}\bigg\rvert_{a,b} + y^2 \doyy{f}\bigg\rvert_{a,b} \right) $$

Here, \( g\bigg\rvert_{a,b} \) means that the derivative is evaluated at the point \( (a,b) \).

Suppose, the function has a minimum at \( (a,b) \).

This means the gradient is zero. So, \(\dox{f}\bigg\rvert_{a,b} = 0 \) and \( \doy{f}\bigg\rvert_{a,b} = 0 \).

Because \( f(x+a,y+b) > f(a,b) \) for \( (a,b) \) to have a minimum, from the Taylor series expansion above, it also means,

$$ x^2 \doxx{f}\bigg\rvert_{a,b} + xy\doxy{f}\bigg\rvert_{a,b} + xy \doyx{f}\bigg\rvert_{a,b} + y^2 \doyy{f}\bigg\rvert_{a,b} > 0 $$

Conversely, if the function had a maximum at \( (a,b) \), then, $$ x^2 \doxx{f}\bigg\rvert_{a,b} + xy\doxy{f}\bigg\rvert_{a,b} + xy \doyx{f}\bigg\rvert_{a,b} + y^2 \doyy{f}\bigg\rvert_{a,b} < 0 $$

And finally, if the above two conditions are not met, then the function has neither a minimum nor a maximum at \( (a,b) \). So, \((a,b) \) must be a saddle point.

In the matrix notation, we can also write the expression above as

$$ \begin{bmatrix}x & y\end{bmatrix} \begin{bmatrix}\doxx{f}\bigg\rvert_{a,b} & \doxy{f}\bigg\rvert_{a,b} \\ \doyx{f}\bigg\rvert_{a,b} & \doyy{f}\bigg\rvert_{a,b} \end{bmatrix} \begin{bmatrix}x \\ y \end{bmatrix} $$


$$ \vx = \begin{bmatrix} x \\ y \end{bmatrix} $$


$$ \mathbf{H}\bigg\rvert_{a,b} =\begin{bmatrix}\doxx{f}\bigg\rvert_{a,b} & \doxy{f}\bigg\rvert_{a,b} \\ \doyx{f}\bigg\rvert_{a,b} & \doyy{f}\bigg\rvert_{a,b} \end{bmatrix}$$

So, what we are effectively saying is that, if \( \dox{f}\bigg\rvert_{a,b} = 0 \) and \( \doy{f}\bigg\rvert_{a,b} = 0 \), then the function has a

  • minimum at \( (a,b) \), if \( \vx^T \mathbf{H}\bigg\rvert_{a,b} \vx > 0 \)
  • maximum at \( (a,b) \), if \( \vx^T \mathbf{H}\bigg\rvert_{a,b} \vx < 0 \)
  • saddle point at \((a,b)\), otherwise.

But wait! Notice that the first two relationships are definitions of positive and negative definite matrices respectively.

So, if \( \nabla\bigg\rvert_{a,b} = 0 \), then the function has a

  • minimum at \( (a,b) \), if its Hessian \( \mathbf{H}\bigg\rvert_{a,b} \) at that point is positive definite.
  • maximum at \( (a,b) \), if its Hessian \( \mathbf{H}\bigg\rvert_{a,b} \) at that point is negative definite.
  • saddle point at \((a,b)\), if its Hessian \( \mathbf{H}\bigg\rvert_{a,b} \) at that point is indefinite.

This is very important. To distinguish among maxima, minima, and saddle points, investigate the definiteness of the Hessian.

Gradient and Hessian of Multivariate Bowl

In the accompanying demo, we have plotted this gradient as follows. At each point on the 2-dimensional plane,

  • The arrows show the direction of the gradient at that point.
  • The color intensity shows the magnitude, the \(L_2\)-norm, of the gradient at that point.

Vector direction and norm were introduced earlier in our article on linear algebra for machine learning.

We have also superimposed a red arrow indicating the gradient at the highlighted point. The arrow represents the vector \( [\dox{f} \doy{f}] \), with origin at the point \( (x,y) \). Notice that it always points in the direction of increasing function and it is longer for steep changes in the function.

For the symmetric multivariate bowl function, the gradient plot conveys the following information.

  • The function increases in all directions from the center.
  • The rate of growth of the function is symmetric around the center.
  • The function grows more rapidly as we move away from the center.

We have shown the Hessian information in a separate panel as follows: At every point on the 2-dimensional plane,

  • The \(+\) or \(-\) indicates if the Hessian is positive semi-definite (PSD) or negative semi-definite (NSD).
  • The Hessian plot has 3 colors based on definiteness of the Hessian at that point: Blue (negative definite), Green (positive definite), and Red (indefinite).

For the multivariate bowl, the Hessian is a positive semi-definite, in fact, positive definite every where in its domain. (Do you remember how to verify this?)

It has exactly one point where the gradient is zero, \( [0 \text{ } 0] \). And because the Hessian is positive definite, that point must be a minimum.

Easier compared to looking at all the charts and slices in various directions.

Gradient and Hessian: Multivariate cup

A multivariate cup function \( f(x,y) = x^2 + y^2 \)
Gradient direction and magnitude
Hessian definiteness

Gradient and Hessian: Rosenbrock's function

Appearances can be deceiving. Note that in the case of Rosenbrock's function, one might incorrectly assume that there is a maximum somewhere along the \(y\)-axis.

But studying the definiteness of the Hessian suggests otherwise. The Hessian is indefinite in that region. The minimum occurs in the region the Hessian is positive definite.

Gradient and Hessian: Rosenbrock's function

Gradient and Hessian of Rosenbrock function \( f(x,y) = (1-x)^2 + 100(y - x^2)^2 \)
Gradient direction and magnitude
Hessian definiteness


Gradient and Hessian: Himmelblau's function

And now comes the interesting part.

You will note that for the Himmelblau's function, there is a patch in the center, where the Hessian is negative definite. That is the region of local maximum. This is analogous to the negative second-order derivative requirement for local maxima in the case of univariate functions.

For the other 4 critical points, the Hessian is positive definite, suggesting minima.

Gradient and Hessian: Himmelblau's function

Gradient and Hessian of Himmeblau's function: \( f(x,y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2 \)
Gradient direction and magnitude
Hessian definiteness

A function with a saddle

And finally, here's an example of a function with a saddle point.

$$ f(x,y) = 3x^2y + y^3 - 3x^2 - 3y^2 + 2 $$

From the gradient plot, you can see 4 critical points.

Note the Hessian definiteness plot to identify their nature.

  • Local minima at \( x = 0, y = 2 \): Hessian positive definite and gradient is zero.
  • Local maxima at \( x = 0, y = 0 \): Hessian negative definite and gradient is zero.
  • Saddle points at \( x = -1, y = 1 \) and \( x = 1, y = 1 \): Hessian indefinite, and gradient is zero.

Notice how the function plot is visually deceiving in terms of where the maximum, minimum, and saddle point lie. But the gradient and Hessian never deceive.

Always remember: Gradient helps in identifying critical points. Hessian helps in distinguishing between minima, maxima, and saddle points.

Gradient and Hessian: Saddle example

A function with a saddle \( f(x,y) = 3x^2y + y^3 - 3x^2 - 3y^2 + 2 \)
Gradient direction and magnitude
Hessian definiteness

Computing derivatives

Given the importance of derivatives in machine learning, it is crucial that you know how to compute them. There are multiple strategies.

Symbolic differentiation utilizes tables of well-known simple derivatives to solve composite ones. This is akin to what we are taught in high school and is crucial to understand if you are developing proofs and theorems. Although there is software for symbolic differentiation, the approach leads to complicated expressions and inefficient code.

Numerical differentiation is an alternative that computes derivatives from their definition. So, to compute a derivative of \( f(x) \), numerical differentiation calculates the following limit

$$ f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} $$

Numerical differentiation can be quickly implemented, but suffers from floating point precision errors, as making \( h \) small enough might lead to cancellations and round-off errors. The problems are worsened in multivariate settings.

Moreover, higher order derivatives are particularly challenging with symbolic and numerical differentiation.

Automatic differentiation is the weapon of choice of machine learning platforms such as Tensorflow and PyTorch. It works by deconstructing an expression into its computational graph consisting of basic operations and elementary functions, with well-known derivatives and then applying chain rule.

Suppose we wish to compute the derivative of the sigmoid function \( \sigma(wx) = \frac{1}{1 + \exp^{-wx}} \), w.r.t. \( w \).

First we will deconstruct the sigmoid function by variable substitution as follows:

$$\begin{aligned} \sigma(wx) = & \frac{1}{1 + \exp^{-wx}} \\ & = \frac{1}{1 + \exp^{-y_1}}, y_1 = wx, \dovar{w}{y_1} = x \\ & = \frac{1}{1 + \exp^{y_2}}, y_2 = -y_1, \dovar{y_1}{y_2} = -1 \\ & = \frac{1}{1 + y_3}, y_3 = \exp^{y_2}, \dovar{y_2}{y_3} = \exp^{y_2} \\ & = \frac{1}{y_4}, y_4 = 1 + y_3, \dovar{y_3}{y_4} = 1 \\ & = y_5, y_4 = \frac{1}{y_4}, \dovar{y_4}{y_5} = -\frac{1}{y_4^2} \end{aligned}$$

By chain rule,

$$ \dovar{w}{\sigma} = \dovar{y_5}{\sigma} \dovar{y_4}{y_5} \dovar{y_3}{y_4} \dovar{y_2}{y_3} \dovar{y_1}{y_2} \dovar{w}{y_1} $$

Note that we are not interested in the algebraic expression of the derivative. We are interested in its numerical value. So, we can work with the above expression from right to left, substituting with computed values at the derivative of interest, not symbols.

$$ \dovar{w}{\sigma} = (-\frac{1}{y_4^2}) \cdot 1 \cdot \exp^{y_2} \cdot -1 \cdot x $$

Here, we have used \( \cdot \) to show the constituents of the chain rule calculations. It should not be confused with dot product.

If we were computing this for a specific value of \( w \), we could have easily just completed this in one pass. Such an approach is known as forward accumulation. Doing so in a reverse manner is known as reverse accumulation. Reverse approach is preferred over forward accumulation for multivariate derivatives where the number of outputs is smaller than the number of inputs of the function, as reverse accumulation avoids duplicate calculations.

Further reading and resources

Here are some additional resources on calculus to supplement the material presented here.

  • Paul's math notes from the Lamar University are an invaluable and comprehensive resource for Calculus in general, not just Machine Learning.
  • Khan Academy has a free course on Differential Calculus
  • Popular machine learning frameworks provide API for computing derivatives. Automatic differentiation is available as an API from PyTorch and Tensorflow. Symbolic differentiation API is provided by Theano. For numpy-based derivatives, try out HIPS/autograd.

Next article: Unconstrained optimization

With a sound understanding of Linear Algebra and Calculus we delve into unconstrained optimization for machine learning.

Supplemented with numerous interactive demos, the article is self-contained and builds intuition about everything you need to know about unconstrained optimization, from a bit of theory to practical insights.

  • What do function, its image, domain, codomain, and target set mean?
  • What are limits and one-sided limits? and when a limit exists?
  • When does a limit exist?
  • Do you understand continuity?
  • Do you understand differentiable functions and smoothness?
  • Do you understand derivatives and second-order derivatives?
  • What is integration?
  • Do you know the Taylor's theorem and its role in approximation?
  • Do you understand partial derivatives?
  • Do you understand critical points, maxima, minima, inflection points, and saddle points?
  • Do you understand the role of Hessian and its definiteness?
  • Methods of computing derivatives automatically and the differences between them?