LLM History 01 — Introduction & Four Eras

00

What This Deck Covers

This is the introduction to a ten-deck history of large language models. Its job is to put a frame around the rest of the series — the eras, the people, the lab geography — so each later deck can dive into one slice of the picture without re-explaining the whole.

Why LLM History is Worth Knowing
The Four Eras at a Glance
Era 1 — Symbolic & Statistical NLP (1948–2003)
Era 2 — Neural Language Models (2003–2017)
Era 3 — The Transformer Era (2017–2022)
Era 4 — Scaled Frontier & Reasoning (2022–)
The People Thread — The Three Godfathers
The Lab Geography
Three Threads Running Through It All
Interactive Timeline
How to Read the Rest of This Series
Cheat Sheet

01

Why LLM History is Worth Knowing

The field is small, young, and unusually personal. Most of the technology you use today — transformers, RLHF, scaling laws, instruction tuning — was invented by a few hundred researchers across about a dozen labs over twelve years. If you know the names you can read the field; if you do not, half the references in any modern paper are opaque.

Three things make the history more useful than just trivia:

1. Path dependence

Decisions made in 2017–2020 (decoder-only, RLHF, the API-not-weights stance) shape what frontier models look like in 2026. Knowing why those choices were made tells you which are load-bearing and which are accidents.

2. Lineage

A surprising fraction of LLM research traces to three advisors — Hinton, Bengio, LeCun — through their direct students. Reading a paper is much faster when you can place its authors in that family tree.

3. Personality

Frontier labs are often tiny — OpenAI was 100 people when it shipped GPT-3, Anthropic 7 when it incorporated. The personalities of the founders explain choices that look mysterious from the outside.

A useful frame

You can teach the technology of LLMs without any history. But you cannot make sense of why the field looks the way it does — why there are exactly three US frontier labs, why Meta open-weights and OpenAI does not, why China caught up in eighteen months, why every Anthropic paper has an interpretability appendix — without it.

02

The Four Eras at a Glance

Language modelling has moved through four overlapping eras since Shannon. The boundaries are fuzzy — statistical methods kept improving long after neural ones existed, and pre-transformer neural NLP is still in production at companies that never updated. But the centres of gravity are clear.

Symbolic / Statistical · 1948–2003

Neural pre-Transformer · 2003–2017

Transformer · 2017–2022

Scaled Frontier · 2022–

Era	Dominant model	Centre of gravity	Killer demo
Symbolic / Statistical	n-grams, HMMs, IBM Models 1–5, log-linear	IBM Watson, Bell Labs, JHU CLSP, Microsoft Research	Google Translate (statistical, 2006)
Neural pre-Transformer	NPLM, RNN-LM, LSTM, seq2seq + Bahdanau attention	Toronto, Mila/Montreal, NYU, Stanford	Google Translate (neural, 2016)
Transformer	BERT, GPT-2/3, T5, encoder & decoder variants	Google Brain → OpenAI → many	GPT-3 in-context learning (2020)
Scaled Frontier	Decoder-only + RLHF + tools + RL on chains-of-thought	OpenAI, Anthropic, Google DeepMind, Meta, DeepSeek, xAI	ChatGPT (Nov 2022) → o1/R1 (2024)

A note on the lengths

The first era is 55 years long; the most recent is barely four. Compute, data, and money have compressed the tempo. The expectation among people in the field is that this compression continues — deck 10 explores what it might mean.

03

Era 1 — Symbolic & Statistical NLP (1948–2003)

The era opens with Claude Shannon's 1948 Mathematical Theory of Communication, which contained the first n-gram experiments — using English text to estimate letter and word probabilities. It closes with Yoshua Bengio's 2003 neural language model, although statistical methods kept winning benchmarks for years afterward.

Two parallel programmes

Chomskyan / symbolic

Noam Chomsky's 1957 Syntactic Structures framed language as rule-governed and discrete. Ran for decades through Charniak, Marcus (Penn Treebank), Manning's early work. Strong on structure, brittle on coverage. Lost to data.

IBM / statistical

Frederick Jelinek at IBM Watson (and later JHU): "every time I fire a linguist, the performance of the speech recogniser goes up". The IBM Models 1–5 for word alignment (Brown, Della Pietra, Mercer, 1993) are the foundation of statistical MT.

Whenever I fire a linguist our system performance improves. — Frederick Jelinek, IBM, ~1985 (frequently quoted in many forms; he later softened it)

The five things you need to know from this era

n-gram language models. P(word | last n−1 words). Trivial to fit, hard to smooth (Kneser-Ney), still a useful baseline.
HMMs & the Viterbi algorithm. The toolbox of speech recognition for forty years. Rabiner's 1989 tutorial is one of the most cited engineering papers ever.
IBM Models 1–5 & phrase-based MT. The latent alignment idea reappears everywhere downstream.
Penn Treebank (Mitch Marcus, 1993). The dataset that made supervised parsing possible.
The CoNLL shared tasks. Fifteen years of public benchmarking that set the field's habits about evaluation.

04

Era 2 — Neural Language Models (2003–2017)

Neural networks were not new in 2003 — LeCun had MNIST in 1998, Schmidhuber and Hochreiter had LSTM in 1997. But the moment a neural net beat a statistical n-gram on language was Bengio's A Neural Probabilistic Language Model (2003).

For the next fourteen years, language modelling and neural NLP slowly displaced statistical methods. The pace picked up dramatically after 2012's AlexNet showed deep learning worked at scale.

Year	Result	Why it mattered
2003	Bengio NPLM	First neural net to beat n-grams on perplexity. Word embeddings born.
2010	Mikolov RNN-LM	Recurrent net + simple unrolling matches large n-gram models on speech recognition rescoring.
2013	word2vec (Mikolov et al, Google)	Word embeddings as a standalone artefact. `king − man + woman ≈ queen`.
2014	Sutskever / Vinyals / Le seq2seq + Cho GRU	Encoder-decoder pattern; LSTM as default. Translation becomes a neural pipeline.
2014	Bahdanau / Cho / Bengio attention	Soft alignment between encoder and decoder. The seed crystal of the transformer.
2015	ELMo / contextual embeddings (Peters et al)	Embeddings stop being a single vector per word; they become per-context.
2017	Transformer (Vaswani et al)	Throw away the recurrence. Attention is all you need.

Notice the geographic story: Toronto, Mila/Montreal, NYU, and Google Brain dominate. This is the era when university labs were genuinely the centre of frontier work. That ends in 2018.

05

Era 3 — The Transformer Era (2017–2022)

The transformer paper itself is deck 03 of this series. What matters here is what happened in the five years after it. Two camps formed almost immediately:

BERT camp — encoder, masked LM

Google AI, Oct 2018. Devlin, Chang, Lee, Toutanova. Pretrain on masked language modelling, fine-tune for everything else. Dominated NLP benchmarks 2018–2020. The pattern most enterprise NLP still uses.

GPT camp — decoder, autoregressive

OpenAI, Jun 2018 (GPT-1) → Feb 2019 (GPT-2) → May 2020 (GPT-3). Radford, Wu, Child, Kaplan, Hoffmann (later). Generative, scaled, zero-shot. The pattern that wins.

The three discoveries that defined the era

Scaling laws (Kaplan et al, OpenAI, 2020; Hoffmann et al / Chinchilla, DeepMind, 2022). Loss falls as a smooth power law in compute, data and parameters. Once true, the strategy is obvious: scale.
In-context learning (GPT-3, 2020). A 175 B-parameter language model can be told a task in the prompt and just do it. The first hint of generality.
RLHF (InstructGPT, 2022). Christiano, Ouyang, Wu, Lowe et al. Aligning a base model to human preferences with reinforcement learning on human comparisons. The hidden ingredient that made ChatGPT shippable.

The strange thing about this era

Almost all of the scientific work was published openly. Papers, code, often weights. The change from open to closed happens quite cleanly with GPT-4 in early 2023. It is the boundary between Era 3 and Era 4.

06

Era 4 — Scaled Frontier & Reasoning (2022–)

The era opens with ChatGPT (30 November 2022) and is defined by a single fact: frontier models are now serious products with serious revenue, and the labs that build them act accordingly. The defining changes:

What changed for the labs

Closed weights become the norm at the frontier.
Frontier training runs cost $50–500 M.
Each lab has dedicated alignment / safety teams.
Government engagement is now part of the job.

What changed for the models

Mixture-of-Experts becomes default at frontier.
Long context (1M+ tokens at Gemini, Claude).
Multimodal (vision, audio, video) is table stakes.
Test-time compute — o1 (Sep 2024), R1 (Jan 2025).
Agents — Computer Use (Oct 2024), Operator, Project Mariner.

The geography flips again. Era 2 was university labs; Era 3 was a handful of US industry labs; Era 4 is multipolar — six US labs, a major British arm of Google, four to six serious Chinese labs, and an open-weight ecosystem orbiting Meta, Mistral, Qwen and DeepSeek.

The R1 moment — January 2025

DeepSeek R1 was released with open weights and a paper, matched OpenAI o1 on most benchmarks, and reportedly cost a tiny fraction of a US frontier run. It is the moment the question is the frontier still mostly American? stopped being rhetorical. Deck 09 is dedicated to this.

07

The People Thread — The Three Godfathers

The intellectual lineage of modern LLMs runs through three people. They shared the 2018 Turing Award and have been called the Godfathers of Deep Learning ever since. Their direct and indirect students built almost all the labs you have heard of.

GH

Geoffrey Hinton

Toronto, Google Brain (2013–2023), Vector Institute

Backprop (with Rumelhart and Williams, 1986), Boltzmann machines, dropout, capsule networks. Direct supervisor of Sutskever, Salakhutdinov, Krizhevsky, Graves, Mnih and Vinyals. Resigned from Google in 2023 specifically so he could speak about AI risk.

YB

Yoshua Bengio

Université de Montréal, MILA founder

Neural language model 2003, attention with Bahdanau and Cho 2014, GANs (with Goodfellow as student). MILA is the largest deep-learning academic group in the world. Now devotes most of his time to AI safety.

YL

Yann LeCun

NYU, Meta AI / FAIR Chief Scientist

Convolutional networks (1989), MNIST, the lead voice for open-weight research and a vocal critic of pure-LLM AGI roadmaps. JEPA / world-model proposal is his counter-programme. Made Llama possible.

Why this matters for reading papers

Roughly: if a paper is from Toronto, Google Brain (pre-2023) or DeepMind, expect a Hinton trace. If from Mila, NYU pre-2010, or Meta AI, expect Bengio or LeCun. If from OpenAI or Anthropic, expect ex-Brain ex-DeepMind people whose advisors trace to one of the three. Deck 04 maps this in detail.

08

The Lab Geography

For an outsider the field looks crowded. From inside it is dramatically smaller than that — perhaps fifteen labs that genuinely matter, clustered around a handful of cities.

San Francisco Bay Area

OpenAI (Mission), Anthropic (SoMa), Google DeepMind (Mountain View, plus the old Brain campus), Meta AI (Menlo Park), xAI (Palo Alto/SF), Stanford NLP, Berkeley AI Research.

London & Europe

Google DeepMind (King's Cross, the original), Mistral (Paris), Meta AI Paris (FAIR FAIR), UK AISI, the Oxford / Cambridge groups, Reka.

China

DeepSeek (Hangzhou via High-Flyer), Qwen / Alibaba DAMO (Hangzhou), Moonshot / Zhipu / MiniMax (Beijing), Baidu (Beijing), Tencent (Shenzhen), 01.AI (Beijing).

Toronto / Montreal

Vector Institute (Toronto, Hinton-aligned), MILA (Montreal, Bengio-aligned), Cohere (Toronto, Aidan Gomez), Element AI legacy.

NYC / Boston / Seattle

NYU (LeCun), MIT, Princeton (Narayanan, Chen), Allen AI, Microsoft Research (Redmond), Boston University.

Pittsburgh

CMU (Salakhutdinov, Mitchell, Cohen, Singh), the second-most-cited NLP lab in the US after Stanford.

A pattern

The frontier lab founders tend to physically locate near their academic origins or their target talent pool rather than near customers. Anthropic is in San Francisco because its founders left OpenAI and lived there. DeepMind stayed in London because Demis Hassabis lives there. xAI clusters in SF for the same reason as everyone else: that is where the senior IC pool is.

09

Three Threads Running Through It All

Across all four eras there are a few persistent through-lines — arguments and tensions that keep recurring, even when the technology underneath has changed completely.

Thread 1 — Statistics vs symbols

It is older than this story. Chomsky vs Jelinek in the 1980s; LeCun vs Marcus today. Each generation has a version of "is intelligence built on rules or on statistics?", and each generation, the statistical side wins another notch. We are not done.

Thread 2 — Open vs closed

BBN published. IBM published. Google published BERT and the transformer. OpenAI published GPT-2 and GPT-3. Then GPT-4 was a black box. Then Llama was open. Then DeepSeek R1. The default keeps moving and the question is genuinely contested.

Thread 3 — Capability vs safety

The phrase did not exist before Bostrom's 2014 Superintelligence. By 2016 it was the founding rationale for OpenAI. By 2021 it caused the split that produced Anthropic. By 2023 it produced the OpenAI board crisis. The argument is still live and is one of the few that genuinely constrains what gets built.

Why these three

Each of them comes back in every later deck. The OpenAI deck (05) is partly the story of the safety thread breaking. The Anthropic deck (06) is the story of it holding. The Meta deck (08) is the open vs closed thread. The future-directions deck (10) is essentially a forecast of which way each thread bends next.

10

Interactive Timeline

The 70-year arc as a single scrollable list. The colour stripe in each row indicates which era the event sits in (red — statistical; amber — neural pre-Transformer; violet — transformer; green — scaled frontier).

1948

Shannon — A Mathematical Theory of Communication. First n-gram experiments on English.

1957

Chomsky — Syntactic Structures. Symbolic linguistics for the next 40 years.

1986

Rumelhart, Hinton, Williams — backprop paper. Parallel Distributed Processing.

1989

LeCun — LeNet, convolutional networks for handwritten digits at Bell Labs.

1993

Brown et al, IBM — IBM Models 1–5 of statistical machine translation.

1997

Hochreiter & Schmidhuber — LSTM.

2003

Bengio et al — A Neural Probabilistic Language Model. Era 2 begins.

2010

DeepMind founded by Hassabis, Suleyman and Legg in London.

2011

Google Brain founded by Jeff Dean, Andrew Ng and Greg Corrado.

2012

AlexNet — Krizhevsky, Sutskever, Hinton. ImageNet by 10 points. Deep learning becomes mainstream.

2013

word2vec — Mikolov, Chen, Corrado, Dean (Google). DeepMind acquired by Google for ~$500 M.

2014

Seq2seq (Sutskever / Vinyals / Le); Bahdanau attention; GANs (Goodfellow).

2015

OpenAI founded — Altman, Brockman, Sutskever, Musk, Karpathy.

2016

AlphaGo beats Lee Sedol on TPU v1. Google Translate switches to neural.

2017

"Attention Is All You Need" — Vaswani et al, NeurIPS. Era 3 begins.

2018

BERT (Google) and GPT-1 (OpenAI).

2019

GPT-2 staged release; T5; OpenAI takes Microsoft's $1B.

2020

GPT-3 (175 B). Scaling Laws (Kaplan et al). In-context learning is real.

2021

Anthropic founded by Amodei + 6 OpenAI alumni. Codex / Copilot launches.

2022

InstructGPT + Chinchilla. ChatGPT ships 30 November. Era 4 begins.

2023

GPT-4, Llama (and Llama-2), Claude, Gemini 1. OpenAI board crisis (Nov). Brain & DeepMind merge.

2024

Claude 3 Opus, GPT-4o, o1, Llama-3, Grok-2, Computer Use.

2025

DeepSeek R1 (Jan), GPT-5, Claude 4, Gemini 2.5, agents become standard.

2026

Multipolar frontier. Open-weight Chinese models compete on capability. The story is still being written.

11

How to Read the Rest of This Series

The remaining nine decks split into three groups. The technical spine, the lab profiles, and the forecast.

Technical spine — read in order

Lab profiles — any order

Forecast — reads alone

10 — Future Directions

Will date faster than the others; deliberately so.

A note on what this series does not cover

The technical content of LLMs is in the rest of the LLMs hub — not here. Transformer Architecture, Modern Architectures, Fine Tuning, Reasoning all dive into the engineering. This series provides the context for those decks, not a replacement.

12

Cheat Sheet

Four eras

Symbolic / statistical — Shannon to early-2000s. Chomsky vs IBM.
Neural pre-Transformer — Bengio 2003 to Vaswani 2017.
Transformer — 2017 to ChatGPT.
Scaled frontier — ChatGPT onwards. Closed weights, expensive runs, agents.

Three godfathers

Hinton — Toronto / Brain. Backprop, students everywhere.
Bengio — MILA. Neural LM, attention.
LeCun — NYU / Meta. CNNs, JEPA, open-weights advocate.

Three through-lines

Statistics vs symbols.
Open vs closed.
Capability vs safety.

Six big numbers

1948 — Shannon.
2003 — Bengio NPLM.
2017 — Transformer.
2020 — GPT-3 (175 B).
2022 — ChatGPT.
2025 — DeepSeek R1.