Long short-term memory (LSTM)

Deep Learning

Introduction

Long short-term memory (LSTM) are specialized RNN cells that have been designed to overcome the challenge of long-term dependencies in RNNs while still allowing the network to remember longer sequences. They are a form of units known as gated units that avoid the problem of vanishing or exploding gradients.

LSTMs are among the most widely used cells for implementing RNNs. Owing to their effectiveness, they have been applied to a variety of sequence modeling problems in a variety of application domains such as video, audio, natural language processing, time-series modeling, and geo-spatial modeling.

Prerequisites

To understand LSTMs, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Intuition

As we studied earlier in our comprehensive article on RNNs, one key challenge that is specific to RNNs, is that of vanishing or exploding gradients. This problem arises due to the repeated application of the same parameters, through the same RNN cell, at each time step.

Choosing to apply different parameters are each time step can overcome this problem, but introduce new ones — having to learn too many parameters and inability to generalize to variable length sequenes. Is there a middle ground? Introducing new parameters at each time step, while still generalizing to variable length sequences and keeping the overall number of learnable parameters constant? Gated RNN cells such as LSTM and GRU offer this alternative.

Gated cells have internal variables known as gates. The value of a gate at a time step depends on the input at that time step and the previous state. The gate value then gets multiplied to the other variables of interest to affect them. For example, an input gate will affect which values of the input at that time step affect the state at that time. Or, a state gate, also known as forget gate, will affect whether the previous state has any impact on the current state. For example, an LSTM includes an input gate, a forget gate, as well as an output gate.

The net effect of such gates is that the same weight parameters do not get multiplied throughout the sequence. The gates affect them at each time step, thereby avoiding the vanishing or exploding gradient problem we observed with ungated RNN cells.

RNN: A refresher on notation

In a sequence modeling task, inputs appear as a sequence of elements \( \mX = \seq{\vx^{(1)},\ldots,\vx^{(\tau)}} \). Each element of the sequence, \( \vx^{(t)} \in \real^N \), is a vector consisting of \( N \) features, \(\vx^{(t)} = [x_1^{(t)}, x_2^{(t)}, \ldots, x_N^{(t)}] \).

RNNs work on the principal of applying the same unit, a cell with the same parameters, at each time step to infer the state of the RNN. This results in applying same recurrent function \( f \) at every time step.

A typical recipe in neural networks for defining functions is this: To get an output multiply a weight vector to the input vector, add in some bias, and then apply the activation function to allow the modeling of nonlinearity in the output. To infer the current state, a simple version of the function \( f \) is no different, as this definition shows.

\begin{align} \vh^{(t)} &= f(\vh^{(t-1)}, \vx^{(t)}; \mTheta) \\\\ &= \text{tanh}\left( \mW \vh^{(t-1)} + \mU \vx^{(t)} + \vb \right) \end{align} Here, the parameters \( \mTheta \) include \( \mW, \mU, \) and \( \vb \). The parameters \( \mW \) and \( \mU \) are weight matrices and \( \vb \) is the bias vector, because we typically wish to represent states as multidimensional vectors.

Similarly, the output of the RNN cell can be calculated as a function of its current state.

\begin{align} \vo^{(t)} &= g(\vh^{(t)}; \dash{\mTheta}) \\\\ &= \mV \vh^{(t)} + \vc \end{align}

where, \( \mV \) and \( \vc \) denote the weight and bias, (the parameters \( \dash{\mTheta} \) of the output function \( g \). Again \( \mV \) is a matrix and \( \vc \) is a vector to enable multidimensional outputs.

The gating mechanism

To refresh your memory on the challenge of modeling long sequences with RNNs, here's the unrolled equation that we studied in our comprehensive article on RNNs.

\begin{aligned} \vh^{(t)} &= f( \ldots f(\vh^{0}, \vx^{1}; \mTheta), \ldots \vx^{(t)}; \mTheta) \\\\ &= \mW^T \left( \ldots \mW^T \vh^{(0)} \ldots \right) \\\ &= \left( \mW^t\right)^T \vh^{(0)} \\\\ \end{aligned}

We present a simplified expansion here, where we have ignored the bias \( \vb \), the multiplier to the input \( \mU \) and activation function, to focus on an example problematic term — the repeated multiplication of the weight parameter \( \mW \) to itself. Gating will help us avoid this repetitive multiplication of the same term. Here's how.

Consider a new internal variable \( g^{(t)} \) that is a function of the input \( \vx^{(t)} \) at time step \( t \) and the previous state \( \vh^{(t-1)} \).

\begin{aligned} g^{(t)} &= \sigma \left( \mW_g \vh^{(t-1)} + \mU_g \vx^{(t)} + \vb_g \right) \\\\ \end{aligned}

The subscript \( g \) on \( \mW_g, \vb_g \) and \( \mU_g \) indicates that these are gate-specific parameters, different from the other cell parameters \( \mW, \vb \) and \( \mU \).

We now modify the state variable by multiplication with the gating variable.

\begin{aligned} \vh^{(t)} &= f(\vh^{(t-1)}, \vx^{(t)}; \mTheta)g^{(t)} \\\\ \end{aligned}

This new formulation does not involve a repeated multplicative dependence of \( \vh^{(t)} \) on \( \vh^{(t-1)} \). Depending on the value of \( g^{(t)} \), the effect will be more or less. Thus, just by introducing gating variables, we are able to avoid repetitive multiplication of the same weight parameters, thereby preventing the vanishing and exploding gradient problems.

The LSTM cell

The LSTM cell leverages the gating mechanism everywhere possible. It uses a specialized gate for each of its major components — the input, the state, and the output.

We list below the calculations for each of the gate variables — the input gate \( \vi^{(t)} \), a gate on the state (forget gate) \( \vs^{(t)} \) and an output gate \( \vg^{(t)} \). They all have the same calculation recipe.

\begin{aligned} \vi^{(t)} &= \sigma \left( \mW_s \vh^{(t-1)} + \mU_i \vx^{(t)} + \vb_i \right) \\\\ \vs^{(t)} &= \sigma \left( \mW_i \vh^{(t-1)} + \mU_s \vx^{(t)} + \vb_s \right) \\\\ \vg^{(t)} &= \sigma \left( \mW_g \vh^{(t-1)} + \mU_g \vx^{(t)} + \vb_g \right) \\\\ \end{aligned}

In all these equations, the subscripts \(i, s, \) and \( g \) on the weight matrices and biases indicate that these parameters are specific to the corresponding gate. The \( \sigma \) denotes the sigmoid function.

With these gates, the calculation of the hidden state \( \vh^{(t)} \) and the output vector of the cell \( \vo^{(t)} \) proceed as follows

\begin{aligned} h^{(t)} &= \vh^{(t-1)} \hadamard \vs^{(t)} + \vi^{(t)} \hadamard \sigma \left( \mW \vh^{(t-1)} + \mU \vx^{(t)} + \vb \right) \\\\ o^{(t)} &= \text{tanh} \left( \vh^{(t)} \right) \hadamard \vg^{(t)} \\\\ \end{aligned}

where, \( \hadamard \) denotes the Hadamard (element-wise) product of its operands.

LSTM variants

Several LSTM variants have been studied in recent times. Some of these variants involve the removal of one of the gates.

One notable variant with a different architecture than an LSTM is the gated recurrent unit (GRU). It has two gates — the update gate \( \vu^{(t)} \) and the reset gate \( \vr^{(t)} \), calculated as follows.

\begin{aligned} \vu^{(t)} &= \sigma \left( \mW_u \vh^{(t)} + \mU_u \vx^{(t)} + \vb_u \right) \\\\ \vr^{(t)} &= \sigma \left( \mW_r \vh^{(t)} + \mU_r \vx^{(t)} + \vb_r \right) \\\\ \end{aligned}

With these gates, the hidden state \( \vh^{(t)} \) is calculated as

\begin{aligned} \vh^{(t)} &= \vh^{(t-1)} \hadamard \vu^{(t-1)} + \left(1 - \vu^{(t-1)}\right) \hadamard \sigma \left( \mW \vh^{(t-1)} \hadamard \vr^{(t-1)} + \mU \vx^{(t)} + \vb \right) \\\\ \end{aligned}

Since the same gate satisfies the function of the forget gate and input gate from the LSTM, there is some reduction in the parameters required to train for the GRU cells.

That being said, research has shown that such modifications and variants of LSTM cells is not significantly beneficial to the performance of the overall network to warrant trials with all such variations. The default recommendation is to use LSTM cells for sequence modeling tasks involving RNNs.

Please share

Let your friends, followers, and colleagues know about this resource you discovered.

Subscribe for article updates

Stay up to date with new material for free.