Unfolding and unrolling
In an RNN, the same RNN-cell (the same set of parameters) is applied at each time step.
This is evident in the nature of the function \( f \) we defined earlier.
\begin{aligned}
\vh^{(t)} &= f(\vh^{(t-1)}, \vxtt; \mTheta) \\\\
&= f(f(\vh^{(t-2)}, \vx^{(t-1)}; \mTheta), \vxtt; \mTheta) \\\\
&= f( \ldots f(\vh^{0}, \vx^{1}; \mTheta), \ldots \vxtt; \mTheta)
\end{aligned}
Writing a recurrence relation this way to account for the full chain of input variables is known as unfolding the equation.
Unfolding shows the hidden state at any time \( t \) as a function of the parameters \( \mTheta \) and the entire sequence up to the point \( t \), that is, \( \seq{\vxtone,\ldots,\vxtt} \).
To train an RNN, we have to infer \( \mTheta \).
This means, during the forward pass, we have to compute through all the hidden states \( \vh^{(1)}, \ldots, \vh^{(\tau)} \) to be able to derive the gradients of the parameters of the model.
In other words, we unroll the computational graph to calculate the hidden state at each time step and store it for the purposes of computing the gradients.
Then, the gradients are computed using backpropagation, starting at the loss, and then sequentially going backwards in time from the gradient of the hidden variables \( \vh^{(\tau)} \) to the gradient of \( \vh^{(1)} \).
This approach is known as backpropagation through time (BPTT).
BPTT is quite computationally expensive.
The need for BPTT arises from using the previous hidden state as context to predict the current state.
This is not amenable to parallel computation and requires sequentially computing the hidden states all the way from \( \vh^{(1)} \) to \( \vh^{(\tau)} \).
This sequential computation can be avoided in some cases using techniques such as teacher forcing.