# Recurrent neural networks

## Introduction

Recurrent neural networks (RNN) are a deep learning strategy for modeling sequential data. Till the advent of attention models, RNNs were the default recommendation for working with sequential data. A deep feedforward model for a sequence may need specific parameters for each element of the sequence. Moreover, it may not be able to generalize to sequences of variable lengths. Recurrent neural networks apply the same weights for each element of the sequence, significantly reducing the number of parameters and allowing the model to generalize to variable length sequences. Owing to this design, RNNs also generalize to structured data beyond sequential data, such as spatial or graphical data.

## Prerequisites

To understand RNNs, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

## Motivation

We have already covered deep feedforward networks or MLP and their special variant, the convolutional neural networks. Both these models offer great predictive capabilities on a variety of tasks. For sequential data, there are challenges though.

Consider tasks involving sequential data such as speech recognition, video captioning, or natural language processing. In such tasks, input elements form a sequence. For some tasks, outputs may also be a sequence of elements. For example, all the example tasks mentioned above involve word sequences.

• Speech recognition involves converting a sequence of audio signals to a sequence of words.
• Video captioning involves converting a sequence of video frames to a sequence of words.
• Natural language processing tasks such as question answering involve addressing a question (a sequence of words) into an answer (another sequence of words).

The order of elements in the sequence is likely useful for addressing such tasks. For example, "The man is holding the cat" is quite different from "The cat is holding the man". So, merely using a bag-of-words strategy, that does not respect word ordering, with some MLP is unlikely to result in good predictive performance. We need some way of respecting the element or word order.

One naive solution could be to use an MLP such that the input layer has as many units as the number of elements in the input sequence. Will that work? Not really. It will not work with variable length sequences. Moreover, it will require that input elements appear in the same order always. This is definitely not true about natural language. The same sentence can be expressed in multiple ways with some reordering of terms without changing its meaning. Without an exhaustive dataset containing all possible orderings with the same meaning, it will be infeasible to learn a model that works.

Need something better.

## Intuition

A successful strategy for dealing with sequential data should include the following:

• Ability to model variable length sequences without requiring additional parameters that depend on the sequence length.
• Capacity to respect element ordering.

Recurrent neural networks (RNN) are purpose designed for these criteria. In an RNN, there are element-level models that are strung together to form the overall predictive model. All elements in the sequence share the same parameters for the element-level model. This means, the overall number of parameters in the model does not change is independent of the sequence length. Moreover, since the same parameters are applied to each element, it does not matter where the element appears in the sequence, it gets processed the same way, but still retains its order among the outputs of the element-level models.

To respect element ordering, each element-level model takes as input the corresponding element as well as its context. For context, RNNs typically use the output of the element-level model of the previous element in that sequence. Intuitively, each element-level model is expected to capture the contextual information of the sequence so far and use the current element to provide an output that can be used as context for the next element in the sequence.

Progressing this way, after the final element of the sequence has been processed, the information about the sequence can be encoded into the output of the final element-level model. This encoded information can then be used to predict a category or a real-valued variable. It can even be decoded for sequential output tasks such as speech recognition, machine translation, or video-captioning by applying another RNN for the decode operation.

Now let us understand the details.

## Problem setting


In a sequence modeling task, inputs appear as a sequence of elements $\mX = \seq{\vxtone,\ldots,\vxtT}$. Each element of the sequence, $\vxtt \in \real^N$, is a vector consisting of $N$ features, $\vxtt = [x_1^{(t)}, x_2^{(t)}, \ldots, x_N^{(t)}]$.

In classification, the goal of the predictive model is to identify the class that generated a particular sequence. In other words, we attempt to infer the function $f$ that takes as input a sequence $\mX$ and predicts the class $y \in \set{C_1,\ldots,C_\nclass}$ as $y = f(\mX)$. In sequential output tasks, the target variable $y$ is also a sequence $\vy = \seq{y^{(1)},\ldots,y^{(\dash{\tau})}}$.

## A recurrence relation

In mathematics, a recurrence relation is an equation that defines an element of a sequence as a function of previous elements of the same sequence. For example, the sequence of natural numbers follows a simple recurrence relation

$$a^{(t)} = a^{(t-1)} + 1 = f(a^{(t-1)})$$

In the case of the series of natural numbers, the function $f: \natural \to \natural$ is simply $f(x) = x + 1$. When we define the first natural number, $a^{(1)} = 1$, we describe the entire sequence of natural numbers using this recurrence relation. Factorials, geometric series, arithmetic progressions, and the Fibonacci series can all be written as recurrence relations.

A dynamical system is a state-based system such that state sequence can be written as a recurrence relation.

$$\vs^{(t)} = f(\vs^{(t-1)}; \mTheta)$$

where, $\mTheta$ is the set of parameters of the model and $\vs^{(t)}$ denotes the state at time step $t$.

## The "recurrent" in RNN

Recurrent networks are modeled after such recurrence relations and hence they are called, recurrent neural networks. The states in the recurrent net are the encodings of the hidden units, typically denoted as vector $\vh^{(t)}$ with its associated time step. Just like the dynamical system, the state sequence is dependent on the previous states — the context. Additionally, the state sequence is also affected by the input at each time step. Thus, the recurrent relation for the recurrent neural network is

$$\vh^{(t)} = f(\vh^{(t-1)}, \vxtt; \mTheta)$$

Once the state of the system has been inferred this way, the output at any time step can be defined as a function of that state. as

$$\vo^{(t)} = g(\vh^{(t)}; \dash{\mTheta})$$

where, $\dash{\mTheta}$ is a different set of parameters, specifically trained for the output variable.

Note that the functions $f$ and $g$ are applied at each time step. Collectively, these functions, applied at each time step, are known an RNN cell, a specialized repeatable unit in the network.

Let's study the nature of the RNN cell next.

## The RNN cell

A typical recipe in neural networks for defining functions is this: To get an output multiply a weight vector to the input vector, add in some bias, and then apply the activation function to allow the modeling of nonlinearity in the output.

To infer the current state, a simple version of the function $f$ is no different, as this definition shows.

\begin{align} \vh^{(t)} &= f(\vh^{(t-1)}, \vxtt; \mTheta) \\\\ &= \text{tanh}\left( \mW \vh^{(t-1)} + \mU \vxtt + \vb \right) \end{align}

Here, the parameters $\mTheta$ include $\mW, \mU,$ and $\vb$. The parameters $\mW$ and $\mU$ are weight matrices and $\vb$ is the bias vector, because we typically wish to represent states as multidimensional vectors. Traditionally, the hyperbolic tangent activation function, $\text{tanh}$, was used to introduce nonlinearity, but other functions may be used.

Similarly, the output of the RNN cell can be calculated as

\begin{align} \vo^{(t)} &= g(\vh^{(t)}; \dash{\mTheta}) \\\\ &= \mV \vh^{(t)} + \vc \end{align}

where, $\mV$ and $\vc$ denote the weight and bias, (the parameters $\dash{\mTheta}$ of the output function $g$. Again $\mV$ is a matrix and $\vc$ is a vector to enable multidimensional outputs.

The RNN cell defined here is as simple as it gets. Several variations such as Long short-term memory (LSTM) and Gated recurrent units (GRU) have been studied for improving the learning capacity of RNN networks for long sequences.

## Predictive model

The RNN then consists of chain of such RNN cells to form a comprehensive model of sequential inputs. The predictive model of the overall RNN depends on the nature of the predictive task. We highlight a few here.

#### Classifying the entire sequence

For sequence classification task, the goal is to assign the entire sequence to one of many classes. In other words, we have to predict $\yhat \in \set{C_1, \ldots, C_\nclass}$ for an input sequence $\mX = \seq{\vxtone,\ldots,\vxtT}$.

A simple recipe with RNN involves taking a softmax of the final output $\votT$ of the sequence.

$$\yhat = \text{softmax}\left( \votT \right)$$

#### Classifying each element of the sequence

For some tasks, such as parts of speech identification, we need to predict the category of each element of the input sequence. In other words, there is a prediction $\yhat^{(t)}$ at each time step $t$. Similar to the previous approach, we apply a softmax to the output $\vott$ of the RNN cell at time step $t$ to get the prediction at that time step.

$$\yhat^{(t)} = \text{softmax}\left( \vott \right)$$

## Training

Training involves inferring the parameters of the model that best fit the available training data. In the case of the RNN cell we described, the parameters to be trained are the weight matrices $\mV, \mU, \mW$ and the bias vectors $\vb$ and $\vc$. For sequence modeling tasks, the training data is a collection of sequences of inputs and the corresponding outputs (or output sequences).

As with all neural networks, training an RNN follows the usual process. First we define a task-dependent loss for the predictions of the model. Subject to this loss, we utilize a gradient-based optimization strategy such as stochastic gradient descent (SGD) or its variant to fit the model parameters to the available training data. The gradients are computed using backpropagation.

The recipe for training is standard. But a major practical hack is needed due to the sequential nature of the data and corresponding model.

## Unfolding and unrolling

In an RNN, the same RNN-cell (the same set of parameters) is applied at each time step. This is evident in the nature of the function $f$ we defined earlier.

\begin{aligned} \vh^{(t)} &= f(\vh^{(t-1)}, \vxtt; \mTheta) \\\\ &= f(f(\vh^{(t-2)}, \vx^{(t-1)}; \mTheta), \vxtt; \mTheta) \\\\ &= f( \ldots f(\vh^{0}, \vx^{1}; \mTheta), \ldots \vxtt; \mTheta) \end{aligned}

Writing a recurrence relation this way to account for the full chain of input variables is known as unfolding the equation. Unfolding shows the hidden state at any time $t$ as a function of the parameters $\mTheta$ and the entire sequence up to the point $t$, that is, $\seq{\vxtone,\ldots,\vxtt}$.

To train an RNN, we have to infer $\mTheta$. This means, during the forward pass, we have to compute through all the hidden states $\vh^{(1)}, \ldots, \vh^{(\tau)}$ to be able to derive the gradients of the parameters of the model. In other words, we unroll the computational graph to calculate the hidden state at each time step and store it for the purposes of computing the gradients. Then, the gradients are computed using backpropagation, starting at the loss, and then sequentially going backwards in time from the gradient of the hidden variables $\vh^{(\tau)}$ to the gradient of $\vh^{(1)}$. This approach is known as backpropagation through time (BPTT).

BPTT is quite computationally expensive. The need for BPTT arises from using the previous hidden state as context to predict the current state. This is not amenable to parallel computation and requires sequentially computing the hidden states all the way from $\vh^{(1)}$ to $\vh^{(\tau)}$. This sequential computation can be avoided in some cases using techniques such as teacher forcing.

## Challenge of training RNNs

In the previous section, we showed the unfolding of the RNN recurrence relation. We present a simplified expansion here, where we have ignored the bias $\vb$, the multiplier to the input $\mU$ and activation function, to focus on an example problematic term.

\begin{aligned} \vh^{(t)} &= f( \ldots f(\vh^{0}, \vx^{1}; \mTheta), \ldots \vxtt; \mTheta) \\\\ &= \mW^T \left( \ldots \mW^T \vh^{(0)} \ldots \right) \\\ &= \left( \mW^t\right)^T \vh^{(0)} \\\\ &= \mQ^T \mLambda^t \mQ \vh^{(0)} \end{aligned}

where, we have assumed that $\mW$ is amenable to the eigendecomposition $\mW = \mQ \mLambda \mQ^T$.

Thus, repeated application of the same parameter weights through the entire sequence leads to the eigenvalues $\mLambda$ of the parameters to be raised to $t$, the length of the sequence. This means, eigenvalues that are less than 1 are driven to zero and eigenvalues that are greater than 1 explode as the length of the sequence increases. Consequently, it is a challenge to arrive at suitable parameters for modeling longer length sequences.

Note that this problem is very specific to RNNs. It is the result of applying the same weights, the RNN cell, for all time steps. If they were different, then we would not have this challenge of long-term dependencies. But that would lead to other challenges: too many model parameters and the inability to model variable length sequences.

One way to address this challenge is through the use of specialized cells, known as gated units, such as long short-term memory (LSTM) and gated recurrent units (GRU).