The mathematical motivation
Let's consider a simple illustrative example. Consider a deep neural network, with no nonlinear activations, and multiple layers, each with a single unit.
From the input \( x \) to the output \( \yhat \), the composite transformation is straightforward — \( \yhat = x w_1 w_2 \ldots w_l \), where \( w_i \) denotes the parameter weight of the \( i\)-th layer, with \( i=1 \) being the input layer.
In terms of activations, for the \( i\)-th layer, the activation is \( h_i = w_i h_{i-1} \).
With gradient-based optimization, each parameter will get updated by its gradient \( g_i \) as \( w_i \leftarrow w_i - \eta g_i \), where \( \eta \) is the learning rate of the gradient-based approach.
As a result, the new output with the updated model will be \( \dash{\yhat} = x (w_1-\eta g_1) (w_2-\eta g_2) \ldots (w_l - \eta g_l) \). Is this desirable? Let's find out.
Treating the neural network as function of its parameters and input, we can write the output as \( \yhat = f_x(\vw) \), where \( \vw = [w_1,\ldots,w_l] \). By changing the parameters, we change the output for the same input.
The Taylor series approximation to the first-order tells us that.
$$ f_x(\dash{\vw}) = f_x(\vw) + \nabla_{\vw}\left(f(\vw)\right) (\dash{\vw} - \vw) $$
If \( \dash{\vw} \) denotes the updated values of the weights, then we can compute the difference between the new output \( \dash{\yhat} \) and the output before update, \( \yhat \).
That is
\begin{aligned}
\dash{\yhat} - \yhat &= f_x(\dash{\vw}) - f_x(\vw) \\\\
&= f_x(\vw) + \nabla_{\vw}\left(f(\vw)\right) (\dash{\vw} - \vw) - f_x(\vw) \\\\
&= \nabla_{\vw}\left(f(\vw)\right) (\dash{\vw} - \vw) \\\\
&= \nabla_{\vw}\left(f(\vw)\right) \eta \vg \\\\
&= \eta\vg^T \vg \\\\
\end{aligned}
where, we have used the representation \( \vg = [g_1,\ldots,g_l] \) to denote the vector of gradients, that is \( \nabla_{\vw}\left(f(\vw)\right) \).
Thus, with a gradient-based update of \( \eta \vg \), we expect the new output to be different by \( \eta \vg^T \vg \) from the output before the gradient update \( \yhat \).
But we saw earlier that
\begin{aligned}
\dash{\yhat} - \yhat &= x (w_1-\eta g_1) (w_2-\eta g_2) \ldots (w_l - \eta g_l) - x w_1 w_2 \ldots w_l
\end{aligned}
Clearly, the changes we propagated will not have the desired results because there are many orders of interactions between gradients and weights, not just \( \eta \vg^T \vg \).
This challenge is worsened by increasing depths in the network, as more inter-parameter interactions come into play.