Sequence-to-sequence models: A recap
Many machine learning tasks involve the transformation of one sequence into another.
For example, machine translation involves the transformation of text in one language into a text in another language.
A popular deep learning strategy for modeling sequence to sequence tasks is the sequence-to-sequence model, also known as seq2seq models.
Seq2seq models consist of two subcomponents — an encoder network and a decoder network.
The encoder encodes a \(\tau\)-length sequence of input tokens \( \seq{\vx^{ (1)}, \ldots, \vx^{(\tau)}} \) into a sequence of codes (vectors) \( \seq{\ve^{(1)},\ldots,\ve^{(\tau)}} \) such that each subsequent encoding is a function of the previous encoding and the input at that step.
$$ \ve^{(t)} = f_e(\ve^{(t-1)}, \vx^{(t)}), ~~\forall t=1,\ldots,\tau $$
where, \( f_e \) is the encoding function learned by the encoder.
For example, the encoder can be an LSTM, as used by CITE[sutskever-2014].
The set of encodings are passed on to the decoder for output sequence generation as a context vector.
The context vector is some function \( f_c \) of the sequence of encodings,
$$ \vc = f_c(\set{\ve^{(1)},\ldots,\ve^{(\tau)}}) $$
For example, a simple strategy is to use the final encoding \( \vh^{(\tau)} \) as the context vector, as used in CITE[sutskever-2014].
The decoder utilizes the context vector \( \vc \) to generate the output sequence \( \seq{\vy^{(1)}, \ldots, \vy^{\dash{\tau}}} \).
The decoder achieves this transformation by learning the following function.
\begin{equation}
\vy^{(t)} = g_d(\vc, \vy^{(t-1)}, \vd^{(t)}), ~~\forall t=1,\ldots,\dash{\tau}
\label{eqn:decoder-output}
\end{equation}
where, \( g_d \) is the decoding function learned by the model and \( \vd^{(t)} \) is the internal state of the decoder.
The decoder state itself follows a recurrence relationship with previous state as
\begin{equation}
\vd^{(t)} = f_d(\vc, \vy^{(t-1)}, \vd^{(t)}), ~~\forall t=1,\ldots,\dash{\tau}
\label{eqn:decoder-state}
\end{equation}
For example, the decoder could be modeled as an LSTM-based RNN and \( \vd^{(t)} \) can be the hidden state of such RNN.
The length of the output sequence, \( \dash{\tau} \) could be different from the length of the input sequence, \( \tau \).
The encoder and decoder networks are commonly implemented as recurrent neural networks (RNNs), with specialized cells such as LSTM or GRU.
The seq2seq model is trained using supervised examples — tuples of input and target sequences to jointly fit the encoder and decoder models.