Given the importance of derivatives in machine learning and optimization, it is crucial that you know how to compute them.
In this article, we cover multiple strategies.
Given the importance of derivatives in machine learning and optimization, it is crucial that you know how to compute them.
In this article, we cover multiple strategies.
To understand the techniques for automatically computing derivaties, we recommend familiarity with the concepts in
Follow the above links to first get acquainted with the corresponding concepts.
Symbolic differentiation utilizes tables of well-known simple derivatives to solve composite ones.
This is akin to what we are taught in high school and is crucial to understand if you are developing proofs and theorems. Although there is software for symbolic differentiation, the approach leads to complicated expressions and inefficient code.
Numerical differentiation is an alternative that computes derivatives from their definition. So, to compute a derivative of \( f(x) \), numerical differentiation calculates the following limit
$$ f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} $$
Numerical differentiation can be quickly implemented, but suffers from floating point precision errors, as making \( h \) small enough might lead to cancellations and round-off errors. The problems are worsened in multivariate settings.
Moreover, higher order derivatives are particularly challenging with symbolic and numerical differentiation.
Automatic differentiation is the weapon of choice of machine learning platforms such as Tensorflow and PyTorch. It works by deconstructing an expression into its computational graph consisting of basic operations and elementary functions, with well-known derivatives and then applying chain rule.
Suppose we wish to compute the derivative of the sigmoid function \( \sigma(wx) = \frac{1}{1 + \exp^{-wx}} \), w.r.t. \( w \).
First we will deconstruct the sigmoid function by variable substitution as follows:
$$\begin{aligned} \sigma(wx) = & \frac{1}{1 + \exp^{-wx}} \\ & = \frac{1}{1 + \exp^{-y_1}}, y_1 = wx, \doh{w}{y_1} = x \\ & = \frac{1}{1 + \exp^{y_2}}, y_2 = -y_1, \doh{y_1}{y_2} = -1 \\ & = \frac{1}{1 + y_3}, y_3 = \exp^{y_2}, \doh{y_2}{y_3} = \exp^{y_2} \\ & = \frac{1}{y_4}, y_4 = 1 + y_3, \doh{y_3}{y_4} = 1 \\ & = y_5, y_4 = \frac{1}{y_4}, \doh{y_4}{y_5} = -\frac{1}{y_4^2} \end{aligned}$$
By chain rule,
$$ \doh{w}{\sigma} = \doh{y_5}{\sigma} \doh{y_4}{y_5} \doh{y_3}{y_4} \doh{y_2}{y_3} \doh{y_1}{y_2} \doh{w}{y_1} $$
Note that we are not interested in the algebraic expression of the derivative. We are interested in its numerical value. So, we can work with the above expression from right to left, substituting with computed values at the derivative of interest, not symbols.
$$ \doh{w}{\sigma} = (-\frac{1}{y_4^2}) \cdot 1 \cdot \exp^{y_2} \cdot -1 \cdot x $$
Here, we have used \( \cdot \) to show the constituents of the chain rule calculations. It should not be confused with dot product.
If we were computing this for a specific value of \( w \), we could have easily just completed this in one pass. Such an approach is known as forward accumulation. Doing so in a reverse manner is known as reverse accumulation. Reverse approach is preferred over forward accumulation for multivariate derivatives where the number of outputs is smaller than the number of inputs of the function, as reverse accumulation avoids duplicate calculations.
Here are some additional resources on calculus to supplement the material presented here.
Check out our other comprehensive articles on topics in calculus and multivariate calculus.
Already a calculus expert? Check out comprehensive courses on machine learning or deep learning.
Help us create more engaging and effective content and keep it free of paywalls and advertisements!
Please share your comments, questions, encouragement, and feedback.