An introductory map of a young, dense, still-moving field. The four eras of language modelling, the people thread that runs through all of them, and a guide to the nine decks that follow.
This is the introduction to a ten-deck history of large language models. Its job is to put a frame around the rest of the series — the eras, the people, the lab geography — so each later deck can dive into one slice of the picture without re-explaining the whole.
The field is small, young, and unusually personal. Most of the technology you use today — transformers, RLHF, scaling laws, instruction tuning — was invented by a few hundred researchers across about a dozen labs over twelve years. If you know the names you can read the field; if you do not, half the references in any modern paper are opaque.
Three things make the history more useful than just trivia:
Decisions made in 2017–2020 (decoder-only, RLHF, the API-not-weights stance) shape what frontier models look like in 2026. Knowing why those choices were made tells you which are load-bearing and which are accidents.
A surprising fraction of LLM research traces to three advisors — Hinton, Bengio, LeCun — through their direct students. Reading a paper is much faster when you can place its authors in that family tree.
Frontier labs are often tiny — OpenAI was 100 people when it shipped GPT-3, Anthropic 7 when it incorporated. The personalities of the founders explain choices that look mysterious from the outside.
You can teach the technology of LLMs without any history. But you cannot make sense of why the field looks the way it does — why there are exactly three US frontier labs, why Meta open-weights and OpenAI does not, why China caught up in eighteen months, why every Anthropic paper has an interpretability appendix — without it.
Language modelling has moved through four overlapping eras since Shannon. The boundaries are fuzzy — statistical methods kept improving long after neural ones existed, and pre-transformer neural NLP is still in production at companies that never updated. But the centres of gravity are clear.
| Era | Dominant model | Centre of gravity | Killer demo |
|---|---|---|---|
| Symbolic / Statistical | n-grams, HMMs, IBM Models 1–5, log-linear | IBM Watson, Bell Labs, JHU CLSP, Microsoft Research | Google Translate (statistical, 2006) |
| Neural pre-Transformer | NPLM, RNN-LM, LSTM, seq2seq + Bahdanau attention | Toronto, Mila/Montreal, NYU, Stanford | Google Translate (neural, 2016) |
| Transformer | BERT, GPT-2/3, T5, encoder & decoder variants | Google Brain → OpenAI → many | GPT-3 in-context learning (2020) |
| Scaled Frontier | Decoder-only + RLHF + tools + RL on chains-of-thought | OpenAI, Anthropic, Google DeepMind, Meta, DeepSeek, xAI | ChatGPT (Nov 2022) → o1/R1 (2024) |
The first era is 55 years long; the most recent is barely four. Compute, data, and money have compressed the tempo. The expectation among people in the field is that this compression continues — deck 10 explores what it might mean.
The era opens with Claude Shannon's 1948 Mathematical Theory of Communication, which contained the first n-gram experiments — using English text to estimate letter and word probabilities. It closes with Yoshua Bengio's 2003 neural language model, although statistical methods kept winning benchmarks for years afterward.
Noam Chomsky's 1957 Syntactic Structures framed language as rule-governed and discrete. Ran for decades through Charniak, Marcus (Penn Treebank), Manning's early work. Strong on structure, brittle on coverage. Lost to data.
Frederick Jelinek at IBM Watson (and later JHU): "every time I fire a linguist, the performance of the speech recogniser goes up". The IBM Models 1–5 for word alignment (Brown, Della Pietra, Mercer, 1993) are the foundation of statistical MT.
Neural networks were not new in 2003 — LeCun had MNIST in 1998, Schmidhuber and Hochreiter had LSTM in 1997. But the moment a neural net beat a statistical n-gram on language was Bengio's A Neural Probabilistic Language Model (2003).
For the next fourteen years, language modelling and neural NLP slowly displaced statistical methods. The pace picked up dramatically after 2012's AlexNet showed deep learning worked at scale.
| Year | Result | Why it mattered |
|---|---|---|
| 2003 | Bengio NPLM | First neural net to beat n-grams on perplexity. Word embeddings born. |
| 2010 | Mikolov RNN-LM | Recurrent net + simple unrolling matches large n-gram models on speech recognition rescoring. |
| 2013 | word2vec (Mikolov et al, Google) | Word embeddings as a standalone artefact. king − man + woman ≈ queen. |
| 2014 | Sutskever / Vinyals / Le seq2seq + Cho GRU | Encoder-decoder pattern; LSTM as default. Translation becomes a neural pipeline. |
| 2014 | Bahdanau / Cho / Bengio attention | Soft alignment between encoder and decoder. The seed crystal of the transformer. |
| 2015 | ELMo / contextual embeddings (Peters et al) | Embeddings stop being a single vector per word; they become per-context. |
| 2017 | Transformer (Vaswani et al) | Throw away the recurrence. Attention is all you need. |
Notice the geographic story: Toronto, Mila/Montreal, NYU, and Google Brain dominate. This is the era when university labs were genuinely the centre of frontier work. That ends in 2018.
The transformer paper itself is deck 03 of this series. What matters here is what happened in the five years after it. Two camps formed almost immediately:
Google AI, Oct 2018. Devlin, Chang, Lee, Toutanova. Pretrain on masked language modelling, fine-tune for everything else. Dominated NLP benchmarks 2018–2020. The pattern most enterprise NLP still uses.
OpenAI, Jun 2018 (GPT-1) → Feb 2019 (GPT-2) → May 2020 (GPT-3). Radford, Wu, Child, Kaplan, Hoffmann (later). Generative, scaled, zero-shot. The pattern that wins.
Almost all of the scientific work was published openly. Papers, code, often weights. The change from open to closed happens quite cleanly with GPT-4 in early 2023. It is the boundary between Era 3 and Era 4.
The era opens with ChatGPT (30 November 2022) and is defined by a single fact: frontier models are now serious products with serious revenue, and the labs that build them act accordingly. The defining changes:
The geography flips again. Era 2 was university labs; Era 3 was a handful of US industry labs; Era 4 is multipolar — six US labs, a major British arm of Google, four to six serious Chinese labs, and an open-weight ecosystem orbiting Meta, Mistral, Qwen and DeepSeek.
DeepSeek R1 was released with open weights and a paper, matched OpenAI o1 on most benchmarks, and reportedly cost a tiny fraction of a US frontier run. It is the moment the question is the frontier still mostly American? stopped being rhetorical. Deck 09 is dedicated to this.
The intellectual lineage of modern LLMs runs through three people. They shared the 2018 Turing Award and have been called the Godfathers of Deep Learning ever since. Their direct and indirect students built almost all the labs you have heard of.
Backprop (with Rumelhart and Williams, 1986), Boltzmann machines, dropout, capsule networks. Direct supervisor of Sutskever, Salakhutdinov, Krizhevsky, Graves, Mnih and Vinyals. Resigned from Google in 2023 specifically so he could speak about AI risk.
Neural language model 2003, attention with Bahdanau and Cho 2014, GANs (with Goodfellow as student). MILA is the largest deep-learning academic group in the world. Now devotes most of his time to AI safety.
Convolutional networks (1989), MNIST, the lead voice for open-weight research and a vocal critic of pure-LLM AGI roadmaps. JEPA / world-model proposal is his counter-programme. Made Llama possible.
Roughly: if a paper is from Toronto, Google Brain (pre-2023) or DeepMind, expect a Hinton trace. If from Mila, NYU pre-2010, or Meta AI, expect Bengio or LeCun. If from OpenAI or Anthropic, expect ex-Brain ex-DeepMind people whose advisors trace to one of the three. Deck 04 maps this in detail.
For an outsider the field looks crowded. From inside it is dramatically smaller than that — perhaps fifteen labs that genuinely matter, clustered around a handful of cities.
OpenAI (Mission), Anthropic (SoMa), Google DeepMind (Mountain View, plus the old Brain campus), Meta AI (Menlo Park), xAI (Palo Alto/SF), Stanford NLP, Berkeley AI Research.
Google DeepMind (King's Cross, the original), Mistral (Paris), Meta AI Paris (FAIR FAIR), UK AISI, the Oxford / Cambridge groups, Reka.
DeepSeek (Hangzhou via High-Flyer), Qwen / Alibaba DAMO (Hangzhou), Moonshot / Zhipu / MiniMax (Beijing), Baidu (Beijing), Tencent (Shenzhen), 01.AI (Beijing).
Vector Institute (Toronto, Hinton-aligned), MILA (Montreal, Bengio-aligned), Cohere (Toronto, Aidan Gomez), Element AI legacy.
NYU (LeCun), MIT, Princeton (Narayanan, Chen), Allen AI, Microsoft Research (Redmond), Boston University.
CMU (Salakhutdinov, Mitchell, Cohen, Singh), the second-most-cited NLP lab in the US after Stanford.
The frontier lab founders tend to physically locate near their academic origins or their target talent pool rather than near customers. Anthropic is in San Francisco because its founders left OpenAI and lived there. DeepMind stayed in London because Demis Hassabis lives there. xAI clusters in SF for the same reason as everyone else: that is where the senior IC pool is.
Across all four eras there are a few persistent through-lines — arguments and tensions that keep recurring, even when the technology underneath has changed completely.
It is older than this story. Chomsky vs Jelinek in the 1980s; LeCun vs Marcus today. Each generation has a version of "is intelligence built on rules or on statistics?", and each generation, the statistical side wins another notch. We are not done.
BBN published. IBM published. Google published BERT and the transformer. OpenAI published GPT-2 and GPT-3. Then GPT-4 was a black box. Then Llama was open. Then DeepSeek R1. The default keeps moving and the question is genuinely contested.
The phrase did not exist before Bostrom's 2014 Superintelligence. By 2016 it was the founding rationale for OpenAI. By 2021 it caused the split that produced Anthropic. By 2023 it produced the OpenAI board crisis. The argument is still live and is one of the few that genuinely constrains what gets built.
Each of them comes back in every later deck. The OpenAI deck (05) is partly the story of the safety thread breaking. The Anthropic deck (06) is the story of it holding. The Meta deck (08) is the open vs closed thread. The future-directions deck (10) is essentially a forecast of which way each thread bends next.
The 70-year arc as a single scrollable list. The colour stripe in each row indicates which era the event sits in (red — statistical; amber — neural pre-Transformer; violet — transformer; green — scaled frontier).
The remaining nine decks split into three groups. The technical spine, the lab profiles, and the forecast.
The technical content of LLMs is in the rest of the LLMs hub — not here. Transformer Architecture, Modern Architectures, Fine Tuning, Reasoning all dive into the engineering. This series provides the context for those decks, not a replacement.