Skip to content
← Back
2017 Summary

Attention Is All You Need

Darrell Thomas

Attention Is All You Need

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (Google Brain / Google Research / University of Toronto) Published: June 2017 | arxiv.org/abs/1706.03762


The Big Idea

Before this paper, AI read text like a game of telephone. Systems called recurrent neural networks (RNNs) processed words one at a time, left to right, passing a summary forward at each step. By word 500, the memory of word 3 was mostly gone. Slow and forgetful.

This paper asked: what if the AI could see the entire sentence at once? Let every word look at every other word simultaneously and figure out which ones matter most. That mechanism is called self-attention, and the architecture built on it is the Transformer.

A Transformer trained in 3.5 days on 8 GPUs beat every existing translation system — including ones that used 50 times more computing power. Every major AI you use today — ChatGPT, Claude, Gemini, LLaMA — descends from this 15-page paper. It didn’t just improve translation; it took over text, images, audio, protein folding, and robotics. The title “Attention Is All You Need” sounded bold in 2017. By 2020 it was just a fact.


How It Works

Attention: Which Words Matter?

Consider: “The animal didn’t cross the street because it was too tired.”

What does “it” refer to? The animal — not the street. You know this because “tired” connects to “animal.” Streets don’t get tired.

Now swap one word: “The animal didn’t cross the street because it was too wide.”

Now “it” means the street. Same structure, completely different meaning, all because of one word’s relationship to another.

This is what attention solves. For each word, the model asks: which other words in this sentence help me understand this one? And it asks this question for every word, about every other word, all at the same time.

Queries, Keys, and Values

The paper describes the mechanism using three concepts: Queries, Keys, and Values.

Imagine walking into a library with a question (your Query). Every book has a label on its spine (its Key). You scan every label and compare it to your question — some are relevant, most aren’t. Then you read the contents (the Values) of the relevant books, giving more weight to the ones whose labels best match your question.

That’s what the Transformer does for every word in a sentence. Each word becomes three things: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“here’s my information if you pick me”). Each word’s Query gets compared to every other word’s Key, the scores are converted into percentages, and the final output is a blend of everyone’s Values — with the most relevant words contributing the most.

All of this is basic math — multiplication and addition — done on every word at the same time. No word waits for any other word. That’s why it’s fast, and why GPUs, which are built for parallel work, can tear through it.

Multi-Head Attention

One round of attention catches one type of relationship. But language has many happening at once. In our animal sentence, “it” connects to “animal,” “cross” connects to “street,” and “didn’t” connects to “tired” for cause and effect.

Multi-head attention is like sending eight people to the library with eight different questions about the same sentence. One person tracks “who does what to whom,” another tracks “which words describe which,” another follows pronouns. Each runs attention independently, then they pool their findings. The model figures out this division of labor entirely on its own — nobody programs which head does what.

The Rest of the Architecture

A few more pieces complete the picture:

  • Encoder and Decoder. The Transformer has two halves. The Encoder reads and understands the input (like reading an English sentence). The Decoder generates the output (like writing the French translation), referring back to the Encoder’s understanding at every step. Think of a human translator reading first, then writing.

  • Positional encoding. Since attention sees all words at once, it has no natural sense of word order — “dog bites man” and “man bites dog” would look the same. The fix: tag each word with its position before processing, so the model knows what came first.

  • Residual connections. At each step, the original input is added back to the output — like keeping a photocopy of your essay before each round of edits. This prevents information from getting distorted through many layers and makes deep networks possible to train.


Results

The Transformer beat every existing translation system while using roughly 48 times less computing power. The systems it beat weren’t even single models — they were ensembles of multiple models combined. The big Transformer trained in 3.5 days. A smaller version trained in 12 hours.

It also matched purpose-built systems on English grammar parsing with minimal changes — proving it was general-purpose from day one.


What Came Next

The design has two known tradeoffs: attention cost grows with the square of input length (fine for sentences, expensive for long documents — which is why work like Mamba exists), and the original encoder-decoder layout was later simplified. Most modern AI (GPT, Claude) uses just the decoder half. The paper introduced the core ideas; the community found the best arrangements.


Key Takeaways

  1. One change made all the difference. Letting every word see every other word at once — instead of processing one at a time — made AI faster to train, better at results, and capable of things the old systems never could have done.

  2. The library analogy is the whole game. Walk in with a question, scan every label, read the most relevant books. That’s attention. Every major AI architecture since 2017 is a variation on this idea.

  3. Eight researchers, 8 GPUs, 15 pages. The most influential architecture in AI history wasn’t a billion-dollar moonshot. It was a clean idea, executed well. Everything built since rests on this foundation.

  4. Want to go deeper? Stephen Welch at Welch Labs and Sebastian Raschka’s Build a Large Language Model (From Scratch) are the best resources for understanding the mechanics hands-on.


Summary prepared by Darrell Thomas, 2026