The fifteen pages
that quietly rewrote
artificial intelligence.

In the summer of 2017, eight researchers at Google published Attention Is All You Need. It was unfashionably short, refreshingly direct, and turned out to be the foundational document of the modern AI era. Scroll to read it.

Scroll to begin

012017 · Google Brain & Google Research

Attention Is All You Need.

In June 2017, eight researchers published a fifteen-page paper with an audacious title and an even more audacious claim: the recurrent neural networks that powered nearly every translation system in the world were no longer needed.

In their place, the authors proposed a single, surprisingly simple idea — let every word in a sentence look at every other word, all at once. They called the architecture the Transformer.

"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."

02Chapter 01 — The bottleneck

Language, one word at a time.

For years, the state of the art in machine translation was the recurrent neural network. RNNs read a sentence the way you'd read a ticker tape — word by word, left to right, carrying a hidden memory forward at every step.

That sequential nature is also their curse. To compute the state at position t, the network must first finish positions 1 through t-1. You cannot parallelize a chain. As sentences grow longer, training slows and the signal connecting distant words has to travel through every word in between.

03Chapter 02 — The idea

What if every word could see every other word?

Self-attention discards the chain entirely. For each word, the model asks a question — a query — and compares it against keys produced by every other word in the sentence. The closer the match, the more that word contributes to the answer.

The result: any two positions in a sentence are exactly one operation apart. A pronoun on the right can look directly at its antecedent on the left without the signal having to crawl through everything in between.

"An attention function maps a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors."

04Chapter 03 — The architecture

Six layers, eight heads, one elegant stack.

The Transformer is an encoder–decoder built from N = 6 identical layers on each side. Inside every layer sit two ideas: multi-head self-attention and a position-wise feed-forward network, each wrapped in a residual connection and layer normalization.

Multi-head attention runs the attention computation eight times in parallel, each head learning to look for a different kind of relationship — syntax in one, coreference in another, long-range agreement in a third.

05Chapter 04 — Why it works

Collapsing the distance between words.

In Table 1, the authors compare the maximum path length a signal must travel between any two positions. For a recurrent layer it grows linearly with sentence length, O(n). For a convolutional stack it grows logarithmically, O(logₖ n).

For self-attention, that path length is a flat O(1). It does not matter whether two words are three tokens apart or three hundred — they are always one hop away from each other.

06Chapter 05 — The results

A new state of the art, by a wide margin.

On the WMT 2014 English-to-German benchmark, the big Transformer scored 28.4 BLEU — more than 2 BLEU above the previous best, including ensembles of multiple models.

On English-to-French it set a new single-model state of the art of 41.8 BLEU. Each colored point below is a competing system from the literature. The Transformer sits up and to the left: better quality, dramatically less compute.

07Chapter 06 — The cost

Three and a half days, eight GPUs.

The base Transformer trained for just 12 hours on eight NVIDIA P100 GPUs. The big model — the one that broke the records — trained for 3.5 days on the same hardware.

Competing ensembles required one to two orders of magnitude more floating-point operations to reach lower scores. Attention wasn't just better. It was cheaper.

"A small fraction of the training costs of the best models from the literature."

08Epilogue

The paper that ate machine learning.

The authors closed with a quiet sentence: they were excited about the future of attention-based models, and planned to apply them to problems beyond text — images, audio, video.

Within five years, the Transformer would underpin BERT, GPT, T5, DALL·E, AlphaFold, and Stable Diffusion. Almost every model you have heard of in the last decade traces a direct line back to these fifteen pages.

The Transformer · 2017

Eight researchers. Fifteen pages. A decade of AI.

Vaswani et al. · NIPS 2017 · arXiv 1706.03762