If the rate of change of a function is the derivative, then how do we characterize the rate of change of the derivative itself? In this article, we explore this concept — the higher-order derivatives.
If the rate of change of a function is the derivative, then how do we characterize the rate of change of the derivative itself? In this article, we explore this concept — the higher-order derivatives.
To understand higher order derivatives, we recommend familiarity with the concepts in
Follow the above links to first get acquainted with the corresponding concepts.
The derivative \( f'(x) \) is itself a function. So it we might be able to further differentiate it. A derivative of a derivative is called a second-order derivative. By that logic the first derivative of the function is called the first-order derivative of the function. The second-order derivative is denoted as
$$ f''(x) = \frac{d^2 f(x)}{dx^2} = \frac{d^2 f}{dx^2} = \frac{d}{dx} \left(\frac{df}{dx}\right) $$
If the derivative of a function does not exist, then obviously, its second-order derivative does not make any sense. If the derivative of a function exists, then for the second-order derivative, we have the same constraints. The second-order derivative exists at a point, if and only if, the first-order derivative is smooth at that point.
And, we do not have to stop at second-order derivative. We can continue taking derivatives of derivatives as long as they exist. Such higher order derivatives are denoted as \( \frac{d^nf}{dx^n} \).
But, what does it mean, the second-order derivative?
We defined earlier that the derivative of a function at a point provides us the rate of change of that function at that point. So a second-order derivative at a point tells us the rate of change of the first-order derivative at that point.
Note, that the above statements are about the slope of the function, not the function itself!
Let us try to understand second-order derivatives in the context of some functions, starting with \( f(x) = x^2 \).
For the quadratic function \( f(x) = x^2 \), the derivatives are \( f'(x) = 2x \) and \( f''(x) = 2 \). The second order derivative is a constant. It means the slope of the slope of \( x^2 \) does not vary as a function of \( x \).
But most importantly, it is positive. And that is the most important information for us from an optimization perspective.
But first, let's imagine our function was \( g(x) = -x^2 \). Since \( g(x) \) is the negative of \( f(x) \) it must be negative for all input and achieve a maximum of zero at \( x = 0 \).
The derivatives would be \( g'(x) = -2x \) and \( g''(x) = -2 \).
In both cases, we know that the functions \( f(x) \) and \( g(x) \) would attain their minimum and maximum, respectively, at \( x = 0 \). But it is their second order derivative that tell us which one it is.
If \( f'(a) = 0 \) and \( f''(a) > 0 \), then \( f(x) \) attains a local minimum at \( x = a \). This is easy to understand. \( f'(a) = 0 \) means that the function has flattened out there. \( f''(a) > 0 \) means that the function starts increasing all sides of \( x = a \). Hence minimum.
If \( g'(a) = 0 \) and \( g''(a) < 0 \), then \( g(x) \) attains a local maximum at \( x = a \). Analogous to before \(g'(a) = 0 \) means that the function has flattened out there. \( g''(a) < 0 \) means that the function falls off on all sides of \( x = a\). Hence maximum.
We will elucidate these concepts on the next set of charts.
Let us try to understand second-order derivatives in the context of another function, \( f(x) = x^4 - 8x^2 \).
We have concocted this function to highlight our observation about the nature of the second order derivative.
For the function \( g(x) = x^2 - 8x^2 \), the derivatives are \( g'(x) = 4x^3 - 16x \) and \( g''(x) = 12x^2 - 16 \).
Notice that the derivative \( g'(x) \) is zero at 3 locations in the chart, namely \( x \in \{-2, 0, 2 \} \). Note that the function is either a minimum or a maximum at these points. But note that the function is not a global maximum, only a local. So, remember that the derivative being zero does not imply global maximum or minimum. Only local.
Now note the second-order derivative. At the locations \( x \in \{-2, 2\} \), the second order derivative is positive. At the point \( x = 0 \), the \( g''(0) < 0 \).
This means that the local extremum achieved by the function at \( x \in \{-2,2\} \) is a local minimum. The critical point at \( x = 0 \) is a maximum.
First-order derivatives help you identify local extrema. Second-order derivatives help you in distinguishing between maxima and minima from among those extrema.
One more interesting thing to note here. Try dragging the orange circle between the local minimum at \( x = -2 \) to the local maximum at \( x = 0 \). Notice that the tangent to the function goes from being under the function to over the function. Such a point of change can be detected by change in the sign of the second derivative.
Such points, when the second-order derivative changes sign from positive to negative, or vice-versa, are known as inflection points. So
Let us try to understand second-order derivatives in the context of yet another function, a much more wavy one: \( f(x) = \sin x \).
Many machine learners naively assume that if they have a function to optimize, then it must have a maximum or a minimum. That may not be the case as this next example shows.
\( h(x) = \sin x \) is a cyclic function. It's derivatives are \( h'(x) = \cos x \) and \( h''(x) = -\sin x \), both cyclic too. They both cyclically vary in the range \( [-1,1] \) implying many local minima and many local maxima.
So, even a second order derivative does not tell you whether you have reached a global extremum. Only local minima or maxima can be detected and should be treated as such.
Now let's try a function that is not only wavy, but also goes through changes in the amplitude of the waves as a function of the input. \( f(x) = x \sin x \).
The function \( k(x) = x \sin x \) is cyclic, just like sine function we saw before. But note that the cycles keep getting amplified, as we move farther from \( x = 0 \).
The derivatives \( k'(x) = \sin x + x \cos x \) and \( k''(x) = 2 \cos x - x \sin x \) are also cyclic like \( k(x) \).
There are plenty of local minima and maxima when \( k'(x) = 0 \). You can also distinguish between them by checking if \( k''(x) > 0 \) or \( k''(x) < 0 \).
But it all does not mean anything from an optimization perspective.
Just because you arrived at a zero first-order derivative and just because you have an appropriate sign on the second-order derivative does not mean much.
You may be doing great in identifying local extrema, but that does not imply anything about the global stage. Be humble.
In a previous article, we explored the idea of smoothness.
The existence of an \( n \)-th order derivative implies that the \((n-1)\)-th order derivative is continuous. So, the smoothness of a function is measured in terms of number of derivatives it has which are continuous.
In general, the class \( C^n \) includes all functions whose derivative is in the class \( C^{n-1} \).
And a function, which has derivatives of all orders, everywhere in its domain, belongs to the class \( C^\infty \). Such a function is known as the smooth function.
Now that you are an expert in derivatives, explore the counterpart to derivatives — integrals.
Already a calculus expert? Check out comprehensive courses on multivariate calculus, machine learning or deep learning
Help us create more engaging and effective content and keep it free of paywalls and advertisements!
Stay up to date with new material for free.