Introduction
The Transformer CITE[vaswani-2017] is a supervised learning model for sequential tasks that does not use a recurrent neural network (RNN) architecture. It achieves this feat by cleverly utilizing the attention mechanism in a deep feedforward network instead of the traditional encoder-decoder style seq2seq model. Experiments with the Transformer demonstrated that it attention was by itself a very powerful mechanism for working with sequential data. The recurrence offered by the RNN was not even necessary to harness the power of attention. These results are reported in the now well-cited and aptly named seminal paper on the Transformer "Attention is all you need" CITE[vaswani-2017].
The Transformer architecture is one of the coolest ideas of this decade in machine learning. Before the advent of the Transformer, the default recommendation for sequential tasks utilized RNNs. Over time, the Transformer architecture has become an effective and efficient replacement to RNN-based models in a variety of domains involving sequential data such as natural language processing, speech, and video-related tasks. It is a precursor to some of the most popular natural language processing models such as BERT and GPT.