Sequence-to-sequence (seq2seq) models

Deep Learning


Many machine learning tasks involve the transformation of one sequence into another. For example, machine translation involves the transformation of text in one language into a text in another language. In deep learning, such tasks are typically modeled using sequence-to-sequence models, also known as seq2seq models.


To understand the seq2seq models, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In classification, the goal of the predictive model is to identify the class that generated a particular instance.

Consider such an instance \( \vx \in \real^N \), a vector consisting of \( N \) features, \(\vx = [x_1, x_2, \ldots, x_N] \).

We need to assign it to one of the \( M \) classes \( C_1, C_2, \ldots, C_M \) depending on the values of the \( N \) features .


Seq2seq models typically consist of two subcomponents — an encoder and a decoder.

The encoder transforms an input sequence in a sequence of latent representations known as encodings. The sequence of encodings is compressed into a fixed-size vector known as the context vector. The decoder then takes the context vector to create the target output sequence.

The encoder and decoder networks are commonly implemented as recurrent neural networks (RNNs), with specialized cells such as LSTM or GRU. For example, the encodings could be the hidden state of the LSTM cells used in the encoder RNN and the target output sequence could be the sequence of outputs of the LSTM cells of the decoder output variables.

The encoder formulation

The encoder encodes a \(\tau\)-length sequence of input tokens \( \seq{\vx^{ (1)}, \ldots, \vx^{(\tau)}} \) into a sequence of codes (vectors) \( \seq{\ve^{(1)},\ldots,\ve^{(\tau)}} \) such that each subsequent encoding is a function of the previous encoding and the input at that step.

$$ \ve^{(t)} = f_e(\ve^{(t-1)}, \vx^{(t)}), ~~\forall t=1,\ldots,\tau $$

where, \( f_e \) is the encoding function learned by the encoder. For example, the encoder can be an LSTM, as used by CITE[sutskever-2014].

The context vector

The set of encodings are passed on to the decoder for output sequence generation as a context vector. The context vector is some function \( f_c \) of the sequence of encodings,

$$ \vc = f_c\left(\set{\ve^{(1)},\ldots,\ve^{(\tau)}}\right) $$

For example, a simple strategy is to use the final encoding \( \vh^{(\tau)} \) as the context vector, as used in CITE[sutskever-2014].

The decoder

The decoder utilizes the context vector \( \vc \) to generate the output sequence \( \seq{\vy^{(1)}, \ldots, \vy^{\dash{\tau}}} \). The decoder achieves this transformation by learning the following function.

\begin{equation} \vy^{(t)} = g_d(\vc, \vy^{(t-1)}, \vd^{(t)}), ~~\forall t=1,\ldots,\dash{\tau} \label{eqn:decoder-output} \end{equation}

where, \( g_d \) is the decoding function learned by the model and \( \vd^{(t)} \) is the internal state of the decoder. The decoder state itself follows a recurrence relationship with previous state as

\begin{equation} \vd^{(t)} = f_d(\vc, \vy^{(t-1)}, \vd^{(t)}), ~~\forall t=1,\ldots,\dash{\tau} \label{eqn:decoder-state} \end{equation}

For example, the decoder could be modeled as an LSTM-based RNN and \( \vd^{(t)} \) can be the hidden state of such RNN.

The length of the output sequence, \( \dash{\tau} \) could be different from the length of the input sequence, \( \tau \).


The seq2seq model is trained using supervised examples — tuples of input and target sequences to jointly fit the encoder and decoder models.

Training involves learning the parameters of the encoder and decoding networks jointly. The recipe for training is standard.

As with all neural networks, training an RNN follows the usual process. First we define a task-dependent loss for the predictions of the model. Subject to this loss, we utilize a gradient-based optimization strategy such as stochastic gradient descent (SGD) or its variant to fit the model parameters to the available training data. The gradients are computed using backpropagation.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Subscribe for article updates

Stay up to date with new material for free.