Transformer
A Transformer is a neural network architecture that uses a self-attention mechanism to compute the relationships among all tokens in an input sequence in parallel. Introduced in Google's 2017 paper "Attention Is All You Need," it serves as the foundation for modern large language models (LLMs) such as GPT, Claude, and BERT.
- The Transformer is a neural network architecture that drops recurrence (RNNs) and convolution (CNNs) and processes sequences using attention alone.
- At its heart is self-attention, where every word in a sentence references every other word simultaneously to compute contextual meaning.
- It was first introduced in Google's 2017 paper "Attention Is All You Need" (arXiv:1706.03762).
- With no sequential processing, it enables massively parallel training on GPUs and TPUs, and this scalability opened the door to today's LLM era.
- Most modern large language models, including GPT, Claude, and BERT, are built on the Transformer.
What Is a Transformer?
A Transformer is a neural network architecture that uses a self-attention mechanism to compute the relationships among all tokens in an input sequence in a single pass. First introduced in the 2017 paper "Attention Is All You Need" by Google researchers, it marked a major turning point by discarding the recurrent neural networks (RNNs) and convolutional neural networks (CNNs) that had dominated sequence modeling and instead handling language with attention alone.
Earlier RNN-based models read words one at a time, front to back, which made it hard to learn relationships between distant words and limited their ability to parallelize. The Transformer looks at an entire sentence at once and captures the dependencies between words in a single operation, and this design made large-scale parallel training on GPUs and TPUs possible. That very scalability is the foundation on which large language models (LLMs) like GPT, Claude, and BERT were able to emerge.
Core Components
Self-Attention
Self-attention is the process of comparing each word in a sentence against every other word, scoring how relevant they are to one another, and then rebuilding each word's representation according to those scores. For instance, in the sentence "The animal didn't cross the street because it was too tired," self-attention directly learns the link that "it" refers to "animal" when processing that word. The decisive difference from RNNs is that it resolves such a relationship in a single operation, no matter how far apart the words sit.
The attention computation relies on three kinds of vectors derived from each word embedding by multiplying it with learned weight matrices: the Query, Key, and Value. Attention scores come from the similarity between queries and keys, and those scores are used to take a weighted sum of the value vectors, yielding a new, context-aware representation.
Multi-Head Attention and Positional Encoding
Rather than performing attention just once, the Transformer runs several attention "heads" in parallel, each with its own set of weight matrices (multi-head attention). This lets the model capture different kinds of connections at the same time, such as syntactic and semantic relationships. At the same time, because the model processes words in parallel rather than in sequence, information about word order is lost; to make up for this, positional encoding adds a vector carrying position information to each word embedding.
Encoder and Decoder
The Transformer in the original paper consisted of an encoder and a decoder, each built from several stacked layers. The encoder turns the input sequence into a context-aware representation, and the decoder uses that representation to generate the output sequence. Later applications split this structure apart, adapting it to their goals: BERT uses only the encoder, while the GPT family uses only the decoder.
Significance and Evidence
The paper "Attention Is All You Need" (Vaswani et al., 2017) showed experimentally that the Transformer not only produced higher-quality machine translation than prior models but also trained faster and parallelized more easily. Concretely, it scored 28.4 BLEU on WMT 2014 English-to-German translation, improving on the previous state of the art by more than 2 BLEU, and reached 41.8 BLEU on English-to-French translation, the best result for a single model at the time.
That same year, the Google AI blog (Jakob Uszkoreit, August 31, 2017) explained that, unlike RNNs and CNNs, the Transformer relates every word in a sentence using a constant number of operations and can speed up training by up to an order of magnitude. This training efficiency and scalability fed directly into the trend of performance improving as models grew larger, and it has since become the core architecture underpinning conversational AI and generative search.