LLM History 03 — The Transformer Paper

00

What This Deck Covers

The transformer paper has been written about thousands of times for its technical content. This deck covers it as a piece of history. The room it came out of, the eight people in the room, what they each contributed, why they all left Google, and where each of them went next.

The Setting — Google Brain in 2016–17
The Eight Authors
What the Paper Actually Says
Multi-Head Attention as the Real Invention
What Was New, What Was Borrowed
The Reception — December 2017
The 2018 Fork — BERT vs GPT-1
2019–2020 — Acceleration
Why They All Left Google
The Diaspora — Where the Eight Went
What the Paper Got Wrong (and What It Underestimated)
Cheat Sheet

01

The Setting — Google Brain in 2016–17

Google Brain was founded in 2011 by Jeff Dean, Andrew Ng, and Greg Corrado as part of Google X. By 2016 it was a roughly 500-person research organisation in Mountain View, with a sister team at Google Brain Zürich and a deep partnership with the much smaller, much fancier Google DeepMind in London (Brain and DeepMind were sibling research orgs from 2014 onwards, with often-uneasy relations — deck 07 covers DeepMind).

Brain in this period had four things in unusual combination:

What Brain had

Compute. First-generation TPUs in 2015, v2 pods in early 2017. Nobody else in the world had this.
People. Hinton (joined 2013), Bengio's students, Schmidhuber's students, the JHU/CMU/Stanford pipeline, plus a growing cohort of European researchers.
Publication culture. Almost everything got published. Tensor2Tensor was open-source.
Loose research direction. No single boss told the team what to work on.

The transformer team's specific situation

Working on Tensor2Tensor — an attempt at a unified seq2seq framework.
Trying to make NMT (Google Translate's bread and butter) train faster.
Frustrated with the wall-clock cost of LSTM training, which did not parallelise well across a TPU pod.
Aware of the Bahdanau attention paper, the Cheng et al 2016 self-attention paper, and the Lin et al 2017 structured self-attention paper.

The crucial constraint

An LSTM is sequential by construction — you cannot start step t+1 until step t has finished. On a TPU pod that means most of the chips sit idle. The transformer's attraction was, originally, that it parallelised. The fact that it generalised better was, by several accounts, a genuine surprise.

02

The Eight Authors

The author list, in order: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. The footnote on the paper notes that all authors contributed equally and the order was random. That is not a polite fiction — multiple of the eight have publicly described it as genuinely a team effort with no clear lead.

That said, there is a rough division of labour the authors have described in various interviews:

AV

Ashish Vaswani — first author

USC PhD → Google Brain → Adept → Essential AI (founder)

Indian, did his PhD at USC under Daniel Marcu. The "first author" by alphabetical accident, but heavily involved in the experimental work and the writing. Calm and methodical. Co-founded Adept in 2022 with Niki Parmar; later left to found Essential AI.

NS

Noam Shazeer

Google (1999–2021) → Character.AI co-founder → Google DeepMind (2024–)

Brilliant, idiosyncratic, the highest-paid IC at Google for years. Joined as employee number ~200, was central to AdSense and several other major systems before drifting into ML. Pushed multi-head attention as a generalisation, did much of the actual implementation. Founded Character.AI in 2021 (returned to Google with Daniel De Freitas in a 2024 reverse-acqui-hire that paid out over $2 B). Now back at DeepMind working on Gemini.

NP

Niki Parmar

USC → Google Brain → Adept (co-founder) → Essential AI (co-founder)

Indian. Did her master's at USC in the same lab as Vaswani, joined Brain shortly after him. Image transformer follow-up work. The first woman to be a major architect of a frontier-LM-defining paper.

JU

Jakob Uszkoreit

Google Brain → Inceptive (co-founder, 2021)

German; son of Hans Uszkoreit, a famous computational linguist at Saarbrücken. Multi-lingual, articulate, often credited with the "self-attention is all you need" framing — reportedly pushed the team to drop the recurrence completely. Now applies transformers to mRNA design at Inceptive.

LJ

Llion Jones

Google Brain → Sakana AI (co-founder, Tokyo, 2023)

Welsh. Did much of the implementation work in Tensor2Tensor, including the production-quality attention layers everyone subsequently used. Soft-spoken; one of the longest to remain at Google after the paper. Now in Tokyo running Sakana AI with David Ha.

AG

Aidan N. Gomez

Google Brain intern (2017) → Oxford → Cohere (co-founder, 2019)

Canadian. The youngest author by a margin — a 21-year-old University of Toronto undergraduate intern at Brain when the paper was written. Went on to do his DPhil at Oxford under Yarin Gal, then co-founded Cohere with Ivan Zhang and Nick Frosst. CEO of Cohere today.

ŁK

Łukasz Kaiser

Google Brain → OpenAI (2021–)

Polish, formerly a theoretical computer scientist at the University of Warsaw and Université Paris Diderot. Co-author also of the One Model To Learn Them All paper. The first of the eight to leave for OpenAI; reportedly central to o1 / o3 training infrastructure.

IP

Illia Polosukhin

Google → NEAR Protocol (co-founder, 2017)

Ukrainian. Left Google in late 2017 — before the paper had even fully landed publicly — to co-found NEAR Protocol, a layer-1 blockchain. Has spoken candidly about how the original NEAR mission was AI-related and pivoted to crypto when funding made it convenient. Returned closer to AI work in recent years.

An unusual fact about the team

None of the eight had a deep statistical-NLP background. They were a mix of ML engineers, theoretical computer scientists, an undergraduate, and people who had done other things at Google for years. The paper is in some sense an outsider paper to traditional NLP, written by people whose tools were TPUs and Tensor2Tensor rather than Penn Treebank and CoNLL.

03

What the Paper Actually Says

The paper is structured as a single proposal: build sequence-to-sequence models out of attention layers and feed-forward layers only, with no recurrence. Everything else is in service of making that work.

The architecture in one paragraph

An encoder stack of N = 6 identical layers; a decoder stack of N = 6 identical layers. Each encoder layer has multi-head self-attention plus a position-wise feed-forward network. Each decoder layer has masked multi-head self-attention, then cross-attention into the encoder output, then position-wise FFN. Residual connections and layer norm everywhere. Sinusoidal positional encodings added to the input embeddings.

Scaled dot-product attention

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Q, K, V are the queries, keys and values — in self-attention all derived from the same input by separate linear projections. The √d_k scaling is a small but important detail; without it, large dot products push softmax into a saturated regime and gradients vanish. Shazeer's contribution.

Multi-head attention

MultiHead(Q, K, V) = Concat(head₁, …, head_h) W^O
head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)

The paper's most underrated technical contribution. Rather than one attention computation, do h = 8 in parallel with smaller key dimension, and concatenate. This lets a single layer attend to multiple kinds of relationship simultaneously — e.g. one head tracks syntactic dependencies, another long-range coreference, another simple positional offsets. Slide 04 expands on this.

Headline numbers

Task	Best prior (2016–17)	Transformer base	Transformer big
WMT 2014 EN→DE	26.30 BLEU (GNMT + attention ensemble)	27.3	28.4
WMT 2014 EN→FR	40.56 BLEU	38.1	41.0
Training cost	~weeks on a GPU cluster	12 h on 8 P100	3.5 days on 8 P100

The training-cost line, easy to miss, is what turned heads inside Google. The transformer was state of the art on translation and trained an order of magnitude faster than the GNMT system it replaced.

04

Multi-Head Attention as the Real Invention

The headline of the paper is self-attention works. But self-attention had been published already — Cheng et al 2016, Lin et al 2017. What had not been published was multi-head self-attention with the specific design used in the transformer. This is the invention that holds up best in retrospect.

Why one head is not enough

A single attention head computes one weighted sum per position. It can model one type of relationship at a time. Capacity is bound by the rank of the attention matrix.

Why h heads helps

h = 8 parallel heads with smaller per-head dimension d_k = d/h let each head specialise. Total parameters stay roughly the same as a single big head, total FLOPs identical. Free expressivity gain.

What the heads actually do

Visualisation work in 2018–19 (Clark et al, Voita et al) showed individual heads tracking interpretable phenomena: the verb of a clause, the subject of a relative pronoun, identical-token attention, position-offset attention. Most heads are redundant; some are specialised; the architecture is over-parameterised in a useful way.

We thought maybe we'd publish this and people would use it instead of LSTMs for translation. We were not thinking about anything beyond that. — Aidan Gomez, in interviews around the paper's fifth anniversary (2022)

The team's own framing at the time was modest: a faster, cleaner translation architecture. Almost none of the implications visible by 2020 (decoder-only language modelling at 175 B parameters, in-context learning, scaling laws, RLHF) were anticipated.

05

What Was New, What Was Borrowed

The transformer paper is famously short on novel components and long on careful integration. Honest accounting:

Component	Status	Source
Attention	Borrowed	Bahdanau / Cho / Bengio 2014; Cheng et al 2016 (self-attention)
Encoder-decoder pattern	Borrowed	Sutskever / Cho 2014
Residual connections	Borrowed	He et al, ResNet 2015
Layer normalisation	Borrowed	Ba, Kiros, Hinton 2016
Adam	Borrowed	Kingma & Ba 2014
Dropout	Borrowed	Srivastava et al 2014
Sinusoidal positional encoding	Borrowed	Gehring et al, ConvS2S 2017
Scaled dot-product attention	New here	Vaswani et al 2017
Multi-head attention	New here	Vaswani et al 2017
Removing recurrence entirely	New here	Vaswani et al 2017
Pre-LN variants	Later	Xiong et al 2020 (Post-LN was original)

A useful framing

The paper is to NLP roughly what the iPhone was to mobile phones: very few of the components were genuinely new (touchscreen, mobile internet, browser, music player and camera all existed), but the integration was so much better than anything previously available that the result felt like a step change.

06

The Reception — December 2017

The paper landed on arXiv on 12 June 2017 (paper id 1706.03762). It was accepted to NeurIPS 2017 in Long Beach, where it was a poster, not an oral. The poster session was busy but not extraordinary.

A few people understood immediately. Most did not.

Who got it immediately

Alec Radford, OpenAI — began work on what became GPT-1 within months.
Jacob Devlin, Google AI Language — began work on what became BERT.
Colin Raffel, Brain — began work on what became T5.
Andrej Karpathy — wrote a minGPT for teaching purposes; predicted decoder-only would dominate.

Who didn't

Most of the academic NLP community for a year — the 2018 ACL programme is still 80% LSTM-based work.
DeepMind, which spent 2017–2019 on RL before catching up on language.
Microsoft Research, which had a strong NMT group focused on related but distinct architectures.
Most of Google outside the immediate team. The transformer was an internal one-off for the better part of a year.

The arc of the paper's citation count is steep but not immediate: a few hundred cites in 2018, low thousands in 2019, then exponential. By 2024 it was the most-cited paper in machine learning for a single year cohort and one of the ten most cited papers in computer science overall.

07

The 2018 Fork — BERT vs GPT-1

In 2018, two papers used the transformer to train large language models with self-supervised pretraining and showed it generalised dramatically. They came out within four months of each other. They picked opposite halves of the architecture, and that choice mattered.

BERT — Devlin, Chang, Lee, Toutanova; Google AI; Oct 2018

Encoder-only. Bidirectional — every token attends to every other.
Trained with masked language modelling (MLM): mask 15% of tokens, predict them.
Fine-tune the entire model on each downstream task.
Dominates GLUE / SQuAD overnight. Becomes the workhorse of enterprise NLP.

GPT-1 — Radford, Narasimhan, Salimans, Sutskever; OpenAI; Jun 2018

Decoder-only. Autoregressive — each token attends only to previous tokens.
Trained with next-token prediction.
Fine-tune for downstream tasks (initially) but the architecture is generative.
117 M parameters. Modest at the time. Sets the template for everything OpenAI does next.

Why decoder-only won the long run

BERT was strictly better at understanding tasks circa 2019. GPT-2 (Feb 2019) and especially GPT-3 (May 2020) showed that scale + decoder-only + autoregressive pretraining generalised to every task, including understanding tasks, simply by prompting. By 2022 the encoder-only line had largely been folded into less-glamorous production work; the decoder-only line was the frontier.

A subtle point about the decoder choice

Decoder-only is a modelling choice with a hidden architectural payoff: it lets you do in-context learning. A bidirectional encoder cannot meaningfully be prompted — there is no causal direction in which "the prompt comes first and the answer follows." Once GPT-3 demonstrated that prompting worked, the encoder-only branch lost most of its strategic value. Deck 05 picks this story up.

08

2019–2020 — Acceleration

Once BERT and GPT-1 had shown the recipe, the field moved with unusual speed. Within 24 months you had:

Year	Result	Lab	Why it mattered
2019 Feb	GPT-2 (1.5 B)	OpenAI	Coherent paragraph-level generation. Staged release sets norms about disclosure.
2019 Jul	RoBERTa	FAIR	BERT trained for longer with more data — comfortably beats BERT. Lesson: training matters more than architectural cleverness.
2019 Oct	T5 (Raffel et al)	Google Brain	Encoder-decoder transformer at 11 B params. Unifies all NLP as text-to-text. C4 dataset.
2019 Nov	BART	FAIR	Denoising encoder-decoder. Strong on summarisation.
2020 Jan	Scaling Laws (Kaplan et al)	OpenAI	The empirical foundation of "just scale it". See deck 05.
2020 May	GPT-3 (175 B)	OpenAI	In-context learning. Few-shot generalisation across hundreds of tasks. The tipping point.
2020 Dec	Switch Transformer (1.6 T)	Google Brain	Mixture of Experts at scale. Foreshadowing GPT-4 and DeepSeek.

The reason the speed felt unusual

For most of the 2010s a research group could lead the field for two or three years on a single architectural insight. After 2017 the cycle compressed dramatically: a published recipe became a product within months, an architectural variation became obsolete within a year, and any lab that did not have its own large pretrained model in 2020 was behind.

09

Why They All Left Google

By the end of 2022 all eight authors of the transformer paper had left Google. Several were already gone before GPT-3 was published. This is not the usual pattern at Google — senior IC tenure is normally measured in years, and Brain was famously a fun place to work. (Shazeer is the only one who has since returned, via the 2024 Character.AI deal — see the diaspora table below.)

Each author has talked about their own reasons. Three patterns recur:

1. Risk-tolerance mismatch

The Brain culture was publish, don't ship. Several authors explicitly wanted to ship products built on this architecture, which inside Google meant going through Search or Cloud (a multi-year programme), and outside meant founding a company.

2. Compensation / equity

Google paid well but in salary and RSUs, not company-defining equity. The 2018–2022 founding wave (OpenAI, Anthropic, Cohere, Adept, Character, Inflection, Mistral, Sakana, Inceptive, NEAR) gave researchers double-digit-percent ownership of named, well-funded entities.

3. Strategic frustration

Several authors have hinted at frustration that Google did not capitalise on its own breakthrough quickly. The ChatGPT moment (Nov 2022) is widely framed inside Google as a "we built the technology and someone else shipped it" moment. By then most of the transformer team was elsewhere.

We had everything we needed to ship something like ChatGPT in 2019. We didn't. — Reportedly said by several Brain alumni in interviews after 2022; the sentiment is widespread.

10

The Diaspora — Where the Eight Went

The eight authors collectively founded or joined the leadership of seven different AI companies. In aggregate those companies have raised on the order of $20 B and employ more than 5,000 people.

Author	2022 destination	Status (2026)
Vaswani	Adept (co-founder)	Essential AI (founder, after leaving Adept)
Shazeer	Character.AI (co-founder)	Google DeepMind (returned 2024 in Character reverse-acqui-hire)
Parmar	Adept (co-founder)	Essential AI (co-founder)
Uszkoreit	Inceptive (co-founder)	Inceptive — transformers for mRNA design
Jones	Sakana AI (co-founder, Tokyo)	Sakana AI
Gomez	Cohere (co-founder, CEO)	Cohere — one of the larger non-FrontierLab companies
Kaiser	OpenAI (research)	OpenAI — reportedly central to o-series
Polosukhin	NEAR Protocol	NEAR — pivoted back toward AI infrastructure

A pattern

Three of the eight (Vaswani, Parmar, and indirectly Shazeer) have been on at least two different AI start-up adventures since the paper. The fluidity of the AI labour market means high-leverage people often go through several founding rounds in five years — something almost unheard of in software in earlier eras.

11

What the Paper Got Wrong (and What It Underestimated)

The paper has held up remarkably well. The base architecture is essentially unchanged in 2026 — Llama, Qwen, GPT-5 and Claude 4 all use a recognisable Vaswani transformer with a few additions (RoPE positional encodings, RMSNorm, gated FFN variants like SwiGLU, MoE in some cases). The things the paper got wrong or did not anticipate:

What it got wrong

Sinusoidal positional encodings — superseded by learned, then by RoPE (rotary). Better at extrapolation.
Post-LN — the original places LayerNorm after the residual addition. Pre-LN is far more stable at depth.
ReLU FFN — superseded by GELU, SwiGLU, and gated variants.
Single-head attention scoring — the paper considers and discards alternatives that turned out to matter.

What it underestimated

How much it would scale. The paper trains models with at most 213 M parameters. They believed scaling would help; nobody guessed three orders of magnitude.
How dominant decoder-only would become. The paper assumes encoder-decoder is the natural default. By 2022 decoder-only had won.
How many problems would reduce to language. Vision, audio, code, robotics, protein structure, genomics, music — a transformer of some flavour now dominates each one.
The economic implications. It was a research paper. By 2024 the architectures it founded were sustaining a $200 B+ industry.

The paper's most underrated claim

The conclusion section ends: "We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video." Every word of that has come true.

12

Cheat Sheet

The paper in five bullets

Drop recurrence; build seq2seq from attention + FFN only.
Multi-head scaled dot-product attention is the key novelty.
Trains an order of magnitude faster than LSTM-based NMT.
State of the art on WMT EN-DE / EN-FR.
Code released as part of Tensor2Tensor.

The eight authors

Vaswani → Adept → Essential AI
Shazeer → Character → back to Google DeepMind
Parmar → Adept → Essential AI
Uszkoreit → Inceptive (mRNA)
Jones → Sakana AI (Tokyo)
Gomez → Cohere
Kaiser → OpenAI
Polosukhin → NEAR Protocol

The 2018 fork

Encoder-only → BERT (Google AI). Workhorse for understanding tasks.
Decoder-only → GPT-1 (OpenAI). Generative; wins the long run.
Encoder-decoder → T5, BART. Strong on translation/summarisation.

What's next in the series

04 — the university labs that produced these people.
05 — OpenAI, where decoder-only became the frontier.
07 — Google DeepMind, where the paper came from but the product never quite did.