Attention Is All You Need is famous as a paper. It is also a snapshot of an unusual moment when eight researchers across Google Brain and Google Research, all under thirty-five, briefly collaborated and produced the architecture that runs the modern world.
The transformer paper has been written about thousands of times for its technical content. This deck covers it as a piece of history. The room it came out of, the eight people in the room, what they each contributed, why they all left Google, and where each of them went next.
Google Brain was founded in 2011 by Jeff Dean, Andrew Ng, and Greg Corrado as part of Google X. By 2016 it was a roughly 500-person research organisation in Mountain View, with a sister team at Google Brain Zürich and a deep partnership with the much smaller, much fancier Google DeepMind in London (Brain and DeepMind were sibling research orgs from 2014 onwards, with often-uneasy relations — deck 07 covers DeepMind).
Brain in this period had four things in unusual combination:
An LSTM is sequential by construction — you cannot start step t+1 until step t has finished. On a TPU pod that means most of the chips sit idle. The transformer's attraction was, originally, that it parallelised. The fact that it generalised better was, by several accounts, a genuine surprise.
The author list, in order: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. The footnote on the paper notes that all authors contributed equally and the order was random. That is not a polite fiction — multiple of the eight have publicly described it as genuinely a team effort with no clear lead.
That said, there is a rough division of labour the authors have described in various interviews:
Indian, did his PhD at USC under Daniel Marcu. The "first author" by alphabetical accident, but heavily involved in the experimental work and the writing. Calm and methodical. Co-founded Adept in 2022 with Niki Parmar; later left to found Essential AI.
Brilliant, idiosyncratic, the highest-paid IC at Google for years. Joined as employee number ~200, was central to AdSense and several other major systems before drifting into ML. Pushed multi-head attention as a generalisation, did much of the actual implementation. Founded Character.AI in 2021 (returned to Google with Daniel De Freitas in a 2024 reverse-acqui-hire that paid out over $2 B). Now back at DeepMind working on Gemini.
Indian. Did her master's at USC in the same lab as Vaswani, joined Brain shortly after him. Image transformer follow-up work. The first woman to be a major architect of a frontier-LM-defining paper.
German; son of Hans Uszkoreit, a famous computational linguist at Saarbrücken. Multi-lingual, articulate, often credited with the "self-attention is all you need" framing — reportedly pushed the team to drop the recurrence completely. Now applies transformers to mRNA design at Inceptive.
Welsh. Did much of the implementation work in Tensor2Tensor, including the production-quality attention layers everyone subsequently used. Soft-spoken; one of the longest to remain at Google after the paper. Now in Tokyo running Sakana AI with David Ha.
Canadian. The youngest author by a margin — a 21-year-old University of Toronto undergraduate intern at Brain when the paper was written. Went on to do his DPhil at Oxford under Yarin Gal, then co-founded Cohere with Ivan Zhang and Nick Frosst. CEO of Cohere today.
Polish, formerly a theoretical computer scientist at the University of Warsaw and Université Paris Diderot. Co-author also of the One Model To Learn Them All paper. The first of the eight to leave for OpenAI; reportedly central to o1 / o3 training infrastructure.
Ukrainian. Left Google in late 2017 — before the paper had even fully landed publicly — to co-found NEAR Protocol, a layer-1 blockchain. Has spoken candidly about how the original NEAR mission was AI-related and pivoted to crypto when funding made it convenient. Returned closer to AI work in recent years.
None of the eight had a deep statistical-NLP background. They were a mix of ML engineers, theoretical computer scientists, an undergraduate, and people who had done other things at Google for years. The paper is in some sense an outsider paper to traditional NLP, written by people whose tools were TPUs and Tensor2Tensor rather than Penn Treebank and CoNLL.
The paper is structured as a single proposal: build sequence-to-sequence models out of attention layers and feed-forward layers only, with no recurrence. Everything else is in service of making that work.
An encoder stack of N = 6 identical layers; a decoder stack of N = 6 identical layers. Each encoder layer has multi-head self-attention plus a position-wise feed-forward network. Each decoder layer has masked multi-head self-attention, then cross-attention into the encoder output, then position-wise FFN. Residual connections and layer norm everywhere. Sinusoidal positional encodings added to the input embeddings.
Q, K, V are the queries, keys and values — in self-attention all derived from the same input by separate linear projections. The √dk scaling is a small but important detail; without it, large dot products push softmax into a saturated regime and gradients vanish. Shazeer's contribution.
The paper's most underrated technical contribution. Rather than one attention computation, do h = 8 in parallel with smaller key dimension, and concatenate. This lets a single layer attend to multiple kinds of relationship simultaneously — e.g. one head tracks syntactic dependencies, another long-range coreference, another simple positional offsets. Slide 04 expands on this.
| Task | Best prior (2016–17) | Transformer base | Transformer big |
|---|---|---|---|
| WMT 2014 EN→DE | 26.30 BLEU (GNMT + attention ensemble) | 27.3 | 28.4 |
| WMT 2014 EN→FR | 40.56 BLEU | 38.1 | 41.0 |
| Training cost | ~weeks on a GPU cluster | 12 h on 8 P100 | 3.5 days on 8 P100 |
The training-cost line, easy to miss, is what turned heads inside Google. The transformer was state of the art on translation and trained an order of magnitude faster than the GNMT system it replaced.
The headline of the paper is self-attention works. But self-attention had been published already — Cheng et al 2016, Lin et al 2017. What had not been published was multi-head self-attention with the specific design used in the transformer. This is the invention that holds up best in retrospect.
A single attention head computes one weighted sum per position. It can model one type of relationship at a time. Capacity is bound by the rank of the attention matrix.
h = 8 parallel heads with smaller per-head dimension dk = d/h let each head specialise. Total parameters stay roughly the same as a single big head, total FLOPs identical. Free expressivity gain.
Visualisation work in 2018–19 (Clark et al, Voita et al) showed individual heads tracking interpretable phenomena: the verb of a clause, the subject of a relative pronoun, identical-token attention, position-offset attention. Most heads are redundant; some are specialised; the architecture is over-parameterised in a useful way.
The team's own framing at the time was modest: a faster, cleaner translation architecture. Almost none of the implications visible by 2020 (decoder-only language modelling at 175 B parameters, in-context learning, scaling laws, RLHF) were anticipated.
The transformer paper is famously short on novel components and long on careful integration. Honest accounting:
| Component | Status | Source |
|---|---|---|
| Attention | Borrowed | Bahdanau / Cho / Bengio 2014; Cheng et al 2016 (self-attention) |
| Encoder-decoder pattern | Borrowed | Sutskever / Cho 2014 |
| Residual connections | Borrowed | He et al, ResNet 2015 |
| Layer normalisation | Borrowed | Ba, Kiros, Hinton 2016 |
| Adam | Borrowed | Kingma & Ba 2014 |
| Dropout | Borrowed | Srivastava et al 2014 |
| Sinusoidal positional encoding | Borrowed | Gehring et al, ConvS2S 2017 |
| Scaled dot-product attention | New here | Vaswani et al 2017 |
| Multi-head attention | New here | Vaswani et al 2017 |
| Removing recurrence entirely | New here | Vaswani et al 2017 |
| Pre-LN variants | Later | Xiong et al 2020 (Post-LN was original) |
The paper is to NLP roughly what the iPhone was to mobile phones: very few of the components were genuinely new (touchscreen, mobile internet, browser, music player and camera all existed), but the integration was so much better than anything previously available that the result felt like a step change.
The paper landed on arXiv on 12 June 2017 (paper id 1706.03762). It was accepted to NeurIPS 2017 in Long Beach, where it was a poster, not an oral. The poster session was busy but not extraordinary.
A few people understood immediately. Most did not.
The arc of the paper's citation count is steep but not immediate: a few hundred cites in 2018, low thousands in 2019, then exponential. By 2024 it was the most-cited paper in machine learning for a single year cohort and one of the ten most cited papers in computer science overall.
In 2018, two papers used the transformer to train large language models with self-supervised pretraining and showed it generalised dramatically. They came out within four months of each other. They picked opposite halves of the architecture, and that choice mattered.
BERT was strictly better at understanding tasks circa 2019. GPT-2 (Feb 2019) and especially GPT-3 (May 2020) showed that scale + decoder-only + autoregressive pretraining generalised to every task, including understanding tasks, simply by prompting. By 2022 the encoder-only line had largely been folded into less-glamorous production work; the decoder-only line was the frontier.
Decoder-only is a modelling choice with a hidden architectural payoff: it lets you do in-context learning. A bidirectional encoder cannot meaningfully be prompted — there is no causal direction in which "the prompt comes first and the answer follows." Once GPT-3 demonstrated that prompting worked, the encoder-only branch lost most of its strategic value. Deck 05 picks this story up.
Once BERT and GPT-1 had shown the recipe, the field moved with unusual speed. Within 24 months you had:
| Year | Result | Lab | Why it mattered |
|---|---|---|---|
| 2019 Feb | GPT-2 (1.5 B) | OpenAI | Coherent paragraph-level generation. Staged release sets norms about disclosure. |
| 2019 Jul | RoBERTa | FAIR | BERT trained for longer with more data — comfortably beats BERT. Lesson: training matters more than architectural cleverness. |
| 2019 Oct | T5 (Raffel et al) | Google Brain | Encoder-decoder transformer at 11 B params. Unifies all NLP as text-to-text. C4 dataset. |
| 2019 Nov | BART | FAIR | Denoising encoder-decoder. Strong on summarisation. |
| 2020 Jan | Scaling Laws (Kaplan et al) | OpenAI | The empirical foundation of "just scale it". See deck 05. |
| 2020 May | GPT-3 (175 B) | OpenAI | In-context learning. Few-shot generalisation across hundreds of tasks. The tipping point. |
| 2020 Dec | Switch Transformer (1.6 T) | Google Brain | Mixture of Experts at scale. Foreshadowing GPT-4 and DeepSeek. |
For most of the 2010s a research group could lead the field for two or three years on a single architectural insight. After 2017 the cycle compressed dramatically: a published recipe became a product within months, an architectural variation became obsolete within a year, and any lab that did not have its own large pretrained model in 2020 was behind.
By the end of 2022 all eight authors of the transformer paper had left Google. Several were already gone before GPT-3 was published. This is not the usual pattern at Google — senior IC tenure is normally measured in years, and Brain was famously a fun place to work. (Shazeer is the only one who has since returned, via the 2024 Character.AI deal — see the diaspora table below.)
Each author has talked about their own reasons. Three patterns recur:
The Brain culture was publish, don't ship. Several authors explicitly wanted to ship products built on this architecture, which inside Google meant going through Search or Cloud (a multi-year programme), and outside meant founding a company.
Google paid well but in salary and RSUs, not company-defining equity. The 2018–2022 founding wave (OpenAI, Anthropic, Cohere, Adept, Character, Inflection, Mistral, Sakana, Inceptive, NEAR) gave researchers double-digit-percent ownership of named, well-funded entities.
Several authors have hinted at frustration that Google did not capitalise on its own breakthrough quickly. The ChatGPT moment (Nov 2022) is widely framed inside Google as a "we built the technology and someone else shipped it" moment. By then most of the transformer team was elsewhere.
The eight authors collectively founded or joined the leadership of seven different AI companies. In aggregate those companies have raised on the order of $20 B and employ more than 5,000 people.
| Author | 2022 destination | Status (2026) |
|---|---|---|
| Vaswani | Adept (co-founder) | Essential AI (founder, after leaving Adept) |
| Shazeer | Character.AI (co-founder) | Google DeepMind (returned 2024 in Character reverse-acqui-hire) |
| Parmar | Adept (co-founder) | Essential AI (co-founder) |
| Uszkoreit | Inceptive (co-founder) | Inceptive — transformers for mRNA design |
| Jones | Sakana AI (co-founder, Tokyo) | Sakana AI |
| Gomez | Cohere (co-founder, CEO) | Cohere — one of the larger non-FrontierLab companies |
| Kaiser | OpenAI (research) | OpenAI — reportedly central to o-series |
| Polosukhin | NEAR Protocol | NEAR — pivoted back toward AI infrastructure |
Three of the eight (Vaswani, Parmar, and indirectly Shazeer) have been on at least two different AI start-up adventures since the paper. The fluidity of the AI labour market means high-leverage people often go through several founding rounds in five years — something almost unheard of in software in earlier eras.
The paper has held up remarkably well. The base architecture is essentially unchanged in 2026 — Llama, Qwen, GPT-5 and Claude 4 all use a recognisable Vaswani transformer with a few additions (RoPE positional encodings, RMSNorm, gated FFN variants like SwiGLU, MoE in some cases). The things the paper got wrong or did not anticipate:
The conclusion section ends: "We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video." Every word of that has come true.