LLM History Series — Presentation 01

The History of LLMs — Four Eras, Two Hundred People

An introductory map of a young, dense, still-moving field. The four eras of language modelling, the people thread that runs through all of them, and a guide to the nine decks that follow.

1948200320172022 HintonBengioLeCun Frontier labs
Symbolic / statistical (1948–2003) Neural pre-transformer (2003–2017) Transformer (2017–2022) Scaled frontier (2022–)
00

What This Deck Covers

This is the introduction to a ten-deck history of large language models. Its job is to put a frame around the rest of the series — the eras, the people, the lab geography — so each later deck can dive into one slice of the picture without re-explaining the whole.

01

Why LLM History is Worth Knowing

The field is small, young, and unusually personal. Most of the technology you use today — transformers, RLHF, scaling laws, instruction tuning — was invented by a few hundred researchers across about a dozen labs over twelve years. If you know the names you can read the field; if you do not, half the references in any modern paper are opaque.

Three things make the history more useful than just trivia:

1. Path dependence

Decisions made in 2017–2020 (decoder-only, RLHF, the API-not-weights stance) shape what frontier models look like in 2026. Knowing why those choices were made tells you which are load-bearing and which are accidents.

2. Lineage

A surprising fraction of LLM research traces to three advisors — Hinton, Bengio, LeCun — through their direct students. Reading a paper is much faster when you can place its authors in that family tree.

3. Personality

Frontier labs are often tiny — OpenAI was 100 people when it shipped GPT-3, Anthropic 7 when it incorporated. The personalities of the founders explain choices that look mysterious from the outside.

A useful frame

You can teach the technology of LLMs without any history. But you cannot make sense of why the field looks the way it does — why there are exactly three US frontier labs, why Meta open-weights and OpenAI does not, why China caught up in eighteen months, why every Anthropic paper has an interpretability appendix — without it.

02

The Four Eras at a Glance

Language modelling has moved through four overlapping eras since Shannon. The boundaries are fuzzy — statistical methods kept improving long after neural ones existed, and pre-transformer neural NLP is still in production at companies that never updated. But the centres of gravity are clear.

Symbolic / Statistical · 1948–2003
Neural pre-Transformer · 2003–2017
Transformer · 2017–2022
Scaled Frontier · 2022–
EraDominant modelCentre of gravityKiller demo
Symbolic / Statisticaln-grams, HMMs, IBM Models 1–5, log-linearIBM Watson, Bell Labs, JHU CLSP, Microsoft ResearchGoogle Translate (statistical, 2006)
Neural pre-TransformerNPLM, RNN-LM, LSTM, seq2seq + Bahdanau attentionToronto, Mila/Montreal, NYU, StanfordGoogle Translate (neural, 2016)
TransformerBERT, GPT-2/3, T5, encoder & decoder variantsGoogle Brain → OpenAI → manyGPT-3 in-context learning (2020)
Scaled FrontierDecoder-only + RLHF + tools + RL on chains-of-thoughtOpenAI, Anthropic, Google DeepMind, Meta, DeepSeek, xAIChatGPT (Nov 2022) → o1/R1 (2024)
A note on the lengths

The first era is 55 years long; the most recent is barely four. Compute, data, and money have compressed the tempo. The expectation among people in the field is that this compression continues — deck 10 explores what it might mean.

03

Era 1 — Symbolic & Statistical NLP (1948–2003)

The era opens with Claude Shannon's 1948 Mathematical Theory of Communication, which contained the first n-gram experiments — using English text to estimate letter and word probabilities. It closes with Yoshua Bengio's 2003 neural language model, although statistical methods kept winning benchmarks for years afterward.

Two parallel programmes

Chomskyan / symbolic

Noam Chomsky's 1957 Syntactic Structures framed language as rule-governed and discrete. Ran for decades through Charniak, Marcus (Penn Treebank), Manning's early work. Strong on structure, brittle on coverage. Lost to data.

IBM / statistical

Frederick Jelinek at IBM Watson (and later JHU): "every time I fire a linguist, the performance of the speech recogniser goes up". The IBM Models 1–5 for word alignment (Brown, Della Pietra, Mercer, 1993) are the foundation of statistical MT.

Whenever I fire a linguist our system performance improves. — Frederick Jelinek, IBM, ~1985 (frequently quoted in many forms; he later softened it)

The five things you need to know from this era

04

Era 2 — Neural Language Models (2003–2017)

Neural networks were not new in 2003 — LeCun had MNIST in 1998, Schmidhuber and Hochreiter had LSTM in 1997. But the moment a neural net beat a statistical n-gram on language was Bengio's A Neural Probabilistic Language Model (2003).

For the next fourteen years, language modelling and neural NLP slowly displaced statistical methods. The pace picked up dramatically after 2012's AlexNet showed deep learning worked at scale.

YearResultWhy it mattered
2003Bengio NPLMFirst neural net to beat n-grams on perplexity. Word embeddings born.
2010Mikolov RNN-LMRecurrent net + simple unrolling matches large n-gram models on speech recognition rescoring.
2013word2vec (Mikolov et al, Google)Word embeddings as a standalone artefact. king − man + woman ≈ queen.
2014Sutskever / Vinyals / Le seq2seq + Cho GRUEncoder-decoder pattern; LSTM as default. Translation becomes a neural pipeline.
2014Bahdanau / Cho / Bengio attentionSoft alignment between encoder and decoder. The seed crystal of the transformer.
2015ELMo / contextual embeddings (Peters et al)Embeddings stop being a single vector per word; they become per-context.
2017Transformer (Vaswani et al)Throw away the recurrence. Attention is all you need.

Notice the geographic story: Toronto, Mila/Montreal, NYU, and Google Brain dominate. This is the era when university labs were genuinely the centre of frontier work. That ends in 2018.

05

Era 3 — The Transformer Era (2017–2022)

The transformer paper itself is deck 03 of this series. What matters here is what happened in the five years after it. Two camps formed almost immediately:

BERT camp — encoder, masked LM

Google AI, Oct 2018. Devlin, Chang, Lee, Toutanova. Pretrain on masked language modelling, fine-tune for everything else. Dominated NLP benchmarks 2018–2020. The pattern most enterprise NLP still uses.

GPT camp — decoder, autoregressive

OpenAI, Jun 2018 (GPT-1) → Feb 2019 (GPT-2) → May 2020 (GPT-3). Radford, Wu, Child, Kaplan, Hoffmann (later). Generative, scaled, zero-shot. The pattern that wins.

The three discoveries that defined the era

The strange thing about this era

Almost all of the scientific work was published openly. Papers, code, often weights. The change from open to closed happens quite cleanly with GPT-4 in early 2023. It is the boundary between Era 3 and Era 4.

06

Era 4 — Scaled Frontier & Reasoning (2022–)

The era opens with ChatGPT (30 November 2022) and is defined by a single fact: frontier models are now serious products with serious revenue, and the labs that build them act accordingly. The defining changes:

What changed for the labs

  • Closed weights become the norm at the frontier.
  • Frontier training runs cost $50–500 M.
  • Each lab has dedicated alignment / safety teams.
  • Government engagement is now part of the job.

What changed for the models

  • Mixture-of-Experts becomes default at frontier.
  • Long context (1M+ tokens at Gemini, Claude).
  • Multimodal (vision, audio, video) is table stakes.
  • Test-time compute — o1 (Sep 2024), R1 (Jan 2025).
  • Agents — Computer Use (Oct 2024), Operator, Project Mariner.

The geography flips again. Era 2 was university labs; Era 3 was a handful of US industry labs; Era 4 is multipolar — six US labs, a major British arm of Google, four to six serious Chinese labs, and an open-weight ecosystem orbiting Meta, Mistral, Qwen and DeepSeek.

The R1 moment — January 2025

DeepSeek R1 was released with open weights and a paper, matched OpenAI o1 on most benchmarks, and reportedly cost a tiny fraction of a US frontier run. It is the moment the question is the frontier still mostly American? stopped being rhetorical. Deck 09 is dedicated to this.

07

The People Thread — The Three Godfathers

The intellectual lineage of modern LLMs runs through three people. They shared the 2018 Turing Award and have been called the Godfathers of Deep Learning ever since. Their direct and indirect students built almost all the labs you have heard of.

GH

Geoffrey Hinton

Toronto, Google Brain (2013–2023), Vector Institute

Backprop (with Rumelhart and Williams, 1986), Boltzmann machines, dropout, capsule networks. Direct supervisor of Sutskever, Salakhutdinov, Krizhevsky, Graves, Mnih and Vinyals. Resigned from Google in 2023 specifically so he could speak about AI risk.

YB

Yoshua Bengio

Université de Montréal, MILA founder

Neural language model 2003, attention with Bahdanau and Cho 2014, GANs (with Goodfellow as student). MILA is the largest deep-learning academic group in the world. Now devotes most of his time to AI safety.

YL

Yann LeCun

NYU, Meta AI / FAIR Chief Scientist

Convolutional networks (1989), MNIST, the lead voice for open-weight research and a vocal critic of pure-LLM AGI roadmaps. JEPA / world-model proposal is his counter-programme. Made Llama possible.

Why this matters for reading papers

Roughly: if a paper is from Toronto, Google Brain (pre-2023) or DeepMind, expect a Hinton trace. If from Mila, NYU pre-2010, or Meta AI, expect Bengio or LeCun. If from OpenAI or Anthropic, expect ex-Brain ex-DeepMind people whose advisors trace to one of the three. Deck 04 maps this in detail.

08

The Lab Geography

For an outsider the field looks crowded. From inside it is dramatically smaller than that — perhaps fifteen labs that genuinely matter, clustered around a handful of cities.

San Francisco Bay Area

OpenAI (Mission), Anthropic (SoMa), Google DeepMind (Mountain View, plus the old Brain campus), Meta AI (Menlo Park), xAI (Palo Alto/SF), Stanford NLP, Berkeley AI Research.

London & Europe

Google DeepMind (King's Cross, the original), Mistral (Paris), Meta AI Paris (FAIR FAIR), UK AISI, the Oxford / Cambridge groups, Reka.

China

DeepSeek (Hangzhou via High-Flyer), Qwen / Alibaba DAMO (Hangzhou), Moonshot / Zhipu / MiniMax (Beijing), Baidu (Beijing), Tencent (Shenzhen), 01.AI (Beijing).

Toronto / Montreal

Vector Institute (Toronto, Hinton-aligned), MILA (Montreal, Bengio-aligned), Cohere (Toronto, Aidan Gomez), Element AI legacy.

NYC / Boston / Seattle

NYU (LeCun), MIT, Princeton (Narayanan, Chen), Allen AI, Microsoft Research (Redmond), Boston University.

Pittsburgh

CMU (Salakhutdinov, Mitchell, Cohen, Singh), the second-most-cited NLP lab in the US after Stanford.

A pattern

The frontier lab founders tend to physically locate near their academic origins or their target talent pool rather than near customers. Anthropic is in San Francisco because its founders left OpenAI and lived there. DeepMind stayed in London because Demis Hassabis lives there. xAI clusters in SF for the same reason as everyone else: that is where the senior IC pool is.

09

Three Threads Running Through It All

Across all four eras there are a few persistent through-lines — arguments and tensions that keep recurring, even when the technology underneath has changed completely.

Thread 1 — Statistics vs symbols

It is older than this story. Chomsky vs Jelinek in the 1980s; LeCun vs Marcus today. Each generation has a version of "is intelligence built on rules or on statistics?", and each generation, the statistical side wins another notch. We are not done.

Thread 2 — Open vs closed

BBN published. IBM published. Google published BERT and the transformer. OpenAI published GPT-2 and GPT-3. Then GPT-4 was a black box. Then Llama was open. Then DeepSeek R1. The default keeps moving and the question is genuinely contested.

Thread 3 — Capability vs safety

The phrase did not exist before Bostrom's 2014 Superintelligence. By 2016 it was the founding rationale for OpenAI. By 2021 it caused the split that produced Anthropic. By 2023 it produced the OpenAI board crisis. The argument is still live and is one of the few that genuinely constrains what gets built.

Why these three

Each of them comes back in every later deck. The OpenAI deck (05) is partly the story of the safety thread breaking. The Anthropic deck (06) is the story of it holding. The Meta deck (08) is the open vs closed thread. The future-directions deck (10) is essentially a forecast of which way each thread bends next.

10

Interactive Timeline

The 70-year arc as a single scrollable list. The colour stripe in each row indicates which era the event sits in (red — statistical; amber — neural pre-Transformer; violet — transformer; green — scaled frontier).

1948
ShannonA Mathematical Theory of Communication. First n-gram experiments on English.
1957
ChomskySyntactic Structures. Symbolic linguistics for the next 40 years.
1986
Rumelhart, Hinton, Williams — backprop paper. Parallel Distributed Processing.
1989
LeCun — LeNet, convolutional networks for handwritten digits at Bell Labs.
1993
Brown et al, IBM — IBM Models 1–5 of statistical machine translation.
1997
Hochreiter & Schmidhuber — LSTM.
2003
Bengio et alA Neural Probabilistic Language Model. Era 2 begins.
2010
DeepMind founded by Hassabis, Suleyman and Legg in London.
2011
Google Brain founded by Jeff Dean, Andrew Ng and Greg Corrado.
2012
AlexNet — Krizhevsky, Sutskever, Hinton. ImageNet by 10 points. Deep learning becomes mainstream.
2013
word2vec — Mikolov, Chen, Corrado, Dean (Google). DeepMind acquired by Google for ~$500 M.
2014
Seq2seq (Sutskever / Vinyals / Le); Bahdanau attention; GANs (Goodfellow).
2015
OpenAI founded — Altman, Brockman, Sutskever, Musk, Karpathy.
2016
AlphaGo beats Lee Sedol on TPU v1. Google Translate switches to neural.
2017
"Attention Is All You Need" — Vaswani et al, NeurIPS. Era 3 begins.
2018
BERT (Google) and GPT-1 (OpenAI).
2019
GPT-2 staged release; T5; OpenAI takes Microsoft's $1B.
2020
GPT-3 (175 B). Scaling Laws (Kaplan et al). In-context learning is real.
2021
Anthropic founded by Amodei + 6 OpenAI alumni. Codex / Copilot launches.
2022
InstructGPT + Chinchilla. ChatGPT ships 30 November. Era 4 begins.
2023
GPT-4, Llama (and Llama-2), Claude, Gemini 1. OpenAI board crisis (Nov). Brain & DeepMind merge.
2024
Claude 3 Opus, GPT-4o, o1, Llama-3, Grok-2, Computer Use.
2025
DeepSeek R1 (Jan), GPT-5, Claude 4, Gemini 2.5, agents become standard.
2026
Multipolar frontier. Open-weight Chinese models compete on capability. The story is still being written.
11

How to Read the Rest of This Series

The remaining nine decks split into three groups. The technical spine, the lab profiles, and the forecast.

Forecast — reads alone

Will date faster than the others; deliberately so.

A note on what this series does not cover

The technical content of LLMs is in the rest of the LLMs hub — not here. Transformer Architecture, Modern Architectures, Fine Tuning, Reasoning all dive into the engineering. This series provides the context for those decks, not a replacement.

12

Cheat Sheet

Four eras

  • Symbolic / statistical — Shannon to early-2000s. Chomsky vs IBM.
  • Neural pre-Transformer — Bengio 2003 to Vaswani 2017.
  • Transformer — 2017 to ChatGPT.
  • Scaled frontier — ChatGPT onwards. Closed weights, expensive runs, agents.

Three godfathers

  • Hinton — Toronto / Brain. Backprop, students everywhere.
  • Bengio — MILA. Neural LM, attention.
  • LeCun — NYU / Meta. CNNs, JEPA, open-weights advocate.

Three through-lines

  • Statistics vs symbols.
  • Open vs closed.
  • Capability vs safety.

Six big numbers

  • 1948 — Shannon.
  • 2003 — Bengio NPLM.
  • 2017 — Transformer.
  • 2020 — GPT-3 (175 B).
  • 2022 — ChatGPT.
  • 2025 — DeepSeek R1.