Deep Learning


The Transformer CITE[vaswani-2017] is a supervised learning model for sequential tasks that does not use a recurrent neural network (RNN) architecture. It achieves this feat by cleverly utilizing the attention mechanism in a deep feedforward network instead of the traditional encoder-decoder style seq2seq model. Experiments with the Transformer demonstrated that it attention was by itself a very powerful mechanism for working with sequential data. The recurrence offered by the RNN was not even necessary to harness the power of attention. These results are reported in the now well-cited and aptly named seminal paper on the Transformer "Attention is all you need" CITE[vaswani-2017].

The Transformer architecture is one of the coolest ideas of this decade in machine learning. Before the advent of the Transformer, the default recommendation for sequential tasks utilized RNNs. Over time, the Transformer architecture has become an effective and efficient replacement to RNN-based models in a variety of domains involving sequential data such as natural language processing, speech, and video-related tasks. It is a precursor to some of the most popular natural language processing models such as BERT and GPT.


To understand the Transformer model, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

seq2seq + attention: A recap

A popular RNN-based strategy for sequence-to-sequence tasks is the sequence-to-sequence (seq2seq) model. Seq2seq models consist of two subcomponents — an encoder network and a decoder network.

The encoder encodes an input sequence to a sequence of codes or encodings. Each encoding is a fixed length vector.

The set of encodings are then passed on to the decoder for output sequence generation as a context vector. The context vector is some function of the sequence of encodings. In the case of attention models, the context vector is dynamically calculated for each step of the output to customize the context based on the decoder state at that step.

The decoder utilizes the context vector to generate the target output sequence. The length of the output sequence could be different from the length of the input sequence.

The encoder and decoder networks are commonly implemented as recurrent neural networks (RNNs), with specialized cells such as LSTM or GRU. For example, the encodings are the hidden variables of the LSTM-cells used in the network. The seq2seq model is trained using supervised examples — tuples of input and target sequences to jointly fit the encoder and decoder models.

The challenge of seq2seq models

Although was a significant improvement to traditional seq2seq models, they still required sequential processing of the input. This prohibited any means of improving computational efficiency by parallelization over the input sequence.

Coupled with the already challenging art of training RNNs in general due to unfolding and unrolling, it was essential to look for alternatives for sequence-to-sequence modeling that did not utilize RNN-based architectures.

The Transformer was proposed to overcome these challenges. The Transformer processes all input tokens of the sequence at the same time and calculates attention weights between them. This allows for parallel processing and the ability to train on very large datasets — the primary strength of pre-trained models such as BERT and GPT.

Transformer: Intuition

Just like the seq2seq model and the rnn-based attention model, the Transformer is also an encoder-decoder architecture. The encoder and decoder perform similar functions as before: the encoder converts the input sequence to codes, and the decoder transforms the codes to outputs, utilizing attention to focus on the relevant codes from the sequence of encodings. But the implementation of encoder, the decoder, and attention is different from their RNN-based counterparts. In the Transformer, the encoder and decoder are both implemented as stack of identical layers.


The encoder is a feedforward network consisting of a stack of \( L_e \) identical layers, each composed of two sub-layers. The first sub-layer is an attention layer known as the multi-head self-attention layer. The second sub-layer is a position-wise feed forward network (FFN) layer. Each sub-layer has residual connections around it, followed by layer-normalization.

The multi-head self-attention layer is composed of several parallel layers known as self-attention layers. The self-attention mechanism relates input tokens and their positions within the same input sequence. Such parallel stacking of several self-attention layers achieves more expressiveness as opposed to a single attention formulation. The particular form of attention used in the Transformer is known as the scaled dot-product attention.

The position-wise FFN generates content-embedding as well as position encoding for each element of the input sequence. This enables the model to utilize positional information of each token in the input sequence without the use of an RNN-based model.


The decoder is a feedforward network consisting of a stack of \( L_d \) identical layers, each composed of three sub-layers. The first sub-layer is a multi-head self-attention layer for the outputs. The second sub-layer is a multi-head self-attention layer for the encodings from the encoder. The final layer is the position-wise FNN, similar to the one used in the encoder. Similar to the encoder, there are residual connections around each of these sub-layers, followed by layer-normalization.

The decoder also uses restrictions to ensure that the prediction for a particular position in the sequence depends only on the previous elements of the sequence and not the subsequent ones. This is achieved by offsetting the output embeddings by one position and by masking the self-attention in the decoder to prevent it from using knowledge of subsequent positions.

In the original proposal for the Transformer architecture, \( L_e = L_d = 6 \) CITE[vaswani-2017]. In the following sections, we will cover each of these novel concepts (position-wise FFN, self-attention, multi-head self-attention, scaled dot-product attention) in detail.

Scaled dot-product attention

We have explained in our comprehensive article on attention, that the attention mechanism is a specialized function, known as the attention function. The attention function takes an input query (the decoder state) and input key-value pairs (the sequence of encodings from the encoder) to arrive at the attention score (the context vector).

In the case of the Transformer, the specialized attention function is known as scaled dot-product attention, which is a scaled version of the dot-product attention. It is computed as

$$ \text{Attention}(\mQ, \mK, \mV) = \text{softmax}\left( \frac{\mQ\mK^T}{\sqrt{d_k}} \right) \mV $$

The only difference with the dot-product attention from before is the division by the scaling factor \( \sqrt{d_k} \), where \(d_k\) is the dimensionality of the key. The scaling term prohibits the dot product from being affected by keys of large dimensionality, which may lead the softmax functions into regions of extremely small gradients CITE[vaswani-2017].

Multi-head attention

If attention is so good, why use just one attention function? Why not multiple? Attention from multiple perspectives may allow the model to jointly utilize information from different representation subspaces at different positions. This is the primary motivation behind multi-head attention.

A multi-head attention block runs several attention functions in parallel on linear projections of the same queries, keys, and values. It then concatenates the results and further projects them to arrive at a single output, just like regular attention would.

Consider input matrices for the queries, keys, and values to be \( \mQ, \mK, \mV \), respectively.

Creating a single head

For creating a single head, we linearly project each of the inputs by multiplying with a parameter matrix. For example, to create the \( h\)-th head, the query matrix \( \mQ \) is projected to \( \mQ\mW_{qh} \) with the parameter \( \mW_{qh} \).

Thus, the formulation for the \( i \)-th head is just the attention function applied to the \(i\)-th linear projections of the queries, keys, and values.

$$ \text{head}_h = \text{Attention}\left(\mQ\mW_{qh}, \mK\mW_{kh}, \mV\mW_{vh}\right) $$

Combining heads to create a multi-head

With all heads computed in parallel, they are first concatenated and then multiplied with a multi-head output parameter matrix \( \mW_o \) to arrive at the multi-head formulation.

$$ \text{MultiHead}(\mQ, \mK, \mV) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H) \mW_o $$

where, \( H \) is the number of heads in the multi-head attention. The Transformer paper uses \( H = 8 \) CITE[vaswani-2017]. A single multi-head attention module will have the parameters \( \mW_o \) and \( \mW_{qh}, \mW_{kh}, \mV_{vh} \) for all \( h=1, \ldots, H \).


The multi-head attention described in the previous section is used in 3 different ways in the Transformer model. The first use is the same as that of the traditional seq2seq model with attention — to enable the decoder to focus selectively on all positions of encodings from the encoder.

Additionally, the Transformer introduces a concept of self-attention — a multi-head attention from the previous layer to the next layer of the same module, encoder or decoder.

Encoder self-attention

As we have explained in the Transformer overview, the encoder has several identical layers. There is a multi-head attention block between adjoining layers. This block enables an encoder layer to attend to all positions in the previous encoder layer.

For encoder self-attention, all inputs (queries, keys, and values) to the multi-head attention of an encoder layer are the output of the previous encoder layer.

Decoder self-attention

The decoder also has several identical layers and there is a multi-head attention block between adjoining layers. Unlike the encoder, the decoder is prohibited from looking at subsequent positions of the output. It can only utilize queries, keys, and values from the previous positions. Therefore, the self-attention in the case of the decoder is known as a masked multi-head attention. The masking merely hides information of subsequent positions from visibility to a decoder position. The masking works by setting all undesirable values (those resulting from subsequent positions of the sequence) in the input of softmax to \( -\infty \).

With this mask in place, for decoder self-attention, the inputs are sourced in the same way as that of encoder self-attention.

Position-wise feed-forward network

Following the multi-head self-attention, each layer of the encoder has another sub-layer — the position-wise feed-forward network (PFFN), a fully connected two-layer network, with a ReLU activation function on the hidden layer.

Let \( \mW_{i1}, \vb_{i1} \) and \( \mW_{i2}, \vb_{i2} \) denote the weights and biases of the first (hidden) and second (output) level of the \( i\)-th layer PFFN. Then, the output of the \( i \)-th PFFN on an input \( \vx \) is

$$ \text{PFFN}_i(\vx) = \max(0, \vx\mW_{i1} + \vb_{i1})\mW_{i2} + \vb_{i2} $$

The same \(\text{PFFN}_i\) is applied to all positions (\( t = 1, \ldots, \tau \)) of the input in the same manner. This means, the output of the \(\text{PFFN}_i\) on an input \( \seq{\vx^{(1)}, \ldots, \vx^{(\tau)}} \) of length \( \tau \) is the encoded sequence of the same length as follows

$$ \seq{\vx^{(1)}, \ldots, \vx^{(\tau)}} \overset{\text{PFFN}_i}{\Rightarrow} \seq{\text{PFFN}_i(\vx^{(1)}), \ldots, \text{PFFN}_i(\vx^{(\tau)})} $$

Thus, there is parameter sharing among all positions of the input. This is intuitively similar to convolutions with kernel size 1.

Each encoder layer and decoder layer has its own PFFN with its own parameters. There are \( N_e \) PFFNs in the encoder, and \( N_d \) PFFNs in the decoder.

Positional encoding

The Transformer does not use an RNN-architecture. But it needs a way to preserve the relative positions of tokens in the input and output, while still supporting variable length inputs.

The Transformer achieves these objectives with the use of encodings on the position of the input. The Transformer paper for machine translation uses the sinusoidal function for computing the positional embedding CITE[vaswani-2017]. In the original Transformer, for position \( p \) and dimension \( i \) of the input embedding, the positional encoding \( \text{PE} \) is computed as

$$ \text{PE}(p,i) = \text{sin}\left(p/10000^{2i/d_m}\right) $$

Nevertheless, the positional encoding could be any function of the position of the token.

Before being passed as input to the encoder or decoder stacks, the positional encoding is added to the input and output token embeddings.


In spite of multiple encoder and decoder layers and the sub-layers of multi-head attention and position-wise feed-forward networks, in the end, the Transformer is a deep feedforward network (DFFN). It is trained using the standard training recipe for (DFFN), or any other deep neural network.

First we define a task-dependent loss for the predictions of the model. Subject to this loss, we utilize a gradient-based optimization strategy such as stochastic gradient descent (SGD) or its variant to fit the model parameters to the available training data. The gradients are computed using backpropagation.

Please share

Let your friends, followers, and colleagues know about this resource you discovered.

Subscribe for article updates

Stay up to date with new material for free.