LLM History Series — Presentation 02

Pre-Transformer NLP — Shannon to Bahdanau (1948–2016)

The 70-year run-up to the transformer paper. Information theory, the long Chomsky-vs-IBM argument, neural language models, word embeddings, sequence-to-sequence learning, and the 2014 attention paper that contained the seed of everything to come.

ShannonJelinekIBM Models BengioMikolov SutskeverChoBahdanau
n-grams IBM stat MT NPLM (2003) word2vec (2013) seq2seq (2014) attention (2014) transformer (2017)
00

What This Deck Covers

Everything that happened to language modelling between Shannon's information theory and the 2017 transformer paper. Each idea here ends up reused inside the transformer or its descendants. The deck pays equal attention to the technical content and the people who produced it.

01

Shannon — Information Theory and Language (1948)

The story starts in 1948 at Bell Labs with Claude Shannon's A Mathematical Theory of Communication. It is famously a paper about communication channels, but tucked inside it is what is plausibly the first language model: a chain of n-gram experiments using printed English.

Shannon counted letter pairs and triples in "Jefferson the Virginian" and similar sources, then sampled from those distributions to generate gibberish that nonetheless began to look like English the more context he conditioned on. He drew an unmistakable conclusion:

It appears, then, that a sufficiently complex stochastic process will give a satisfactory representation of a discrete information source. — Claude Shannon, A Mathematical Theory of Communication, 1948 § 4

That sentence is the founding charter of statistical NLP. Its second half — a sufficiently complex stochastic process will give a satisfactory representation — is, almost word for word, the bet that ChatGPT is built on.

Things Shannon defined that we still use

  • Entropy H(X) = −Σ p(x) log p(x). Loss in cross-entropy training is exactly this.
  • Cross-entropy and perplexity. Perplexity = 2H. Still the headline LM metric.
  • n-gram as a unit of approximation.
  • Channel coding — relevant to the noisy-channel framing of MT and ASR.

Shannon the person

Worked at Bell Labs alongside the transistor team. Built juggling machines, a chess-playing automaton, and rode a unicycle through the corridors. Quiet, mathematically rigorous, broadly disinterested in the academic-prestige game. The closest historical analogue to a modern Brain or DeepMind senior IC: small output, enormous influence.

02

The Symbolic Side — Chomsky's Programme

Nine years after Shannon, Noam Chomsky's Syntactic Structures (1957) proposed the opposite framing: language is not a probabilistic process, it is a system of formal rules. The book reviewed Skinner's behaviourism for Language and demolished it, in the process founding modern generative linguistics.

Colorless green ideas sleep furiously. — Chomsky's most famous example. Grammatical, semantically nonsensical — intended as a refutation of statistical accounts.

Chomsky's argument was that statistical co-occurrence cannot distinguish that sentence from furiously sleep ideas green colorless, but any English speaker instantly can; therefore meaning and grammaticality must come from structure. It was a knockdown argument for sixty years and is now, in light of LLMs, treated more as an open question.

The symbolic programme that followed

A note on Chomsky and LLMs

Chomsky himself, into his nineties, has remained a vocal sceptic of LLMs — arguing they are "high-tech plagiarism" and miss the point of language. His New York Times op-ed of 8 March 2023 makes the case explicitly. Many in the field disagree, but his framing of what language is remains the cleanest articulation of the symbolic side of the argument.

03

The Statistical Counter — Jelinek and IBM

The counter-programme to Chomsky did not come from a university linguistics department. It came from IBM T. J. Watson Research in Yorktown Heights, where a Czech-born electrical engineer named Frederick Jelinek was leading a speech-recognition group.

FJ

Frederick Jelinek (1932–2010)

IBM Watson 1972–1993, Johns Hopkins CLSP 1993–2010

Survived the Holocaust as a child in Prague, emigrated to the US in 1949, did his PhD at MIT under Robert Fano and (effectively) Shannon. At IBM ran the speech-recognition group that produced the noisy channel framing of ASR and MT. Founded the JHU CLSP, which became the strongest US lab for statistical NLP.

The IBM philosophy in one paragraph

Linguistic theory does not constrain a recogniser; data does. Build a language model from a large corpus of text, build an acoustic model from a large corpus of recorded speech, combine them with Bayes' rule, decode with Viterbi, ship.

P(words | audio) ∝ P(audio | words) · P(words)

P(words) is the language model. P(audio | words) is the acoustic model. The Jelinek group spent twenty years making each of those better, and almost every improvement turned out to come from more data or better statistical estimation, not from imported linguistic structure. Jelinek's famous remark:

Every time I fire a linguist the performance of the speech recognizer goes up. — Frederick Jelinek, ~1985 (frequently softened in retellings; he later said he meant the structure linguists wanted to impose was wrong, not that linguists should not be hired)
04

IBM Models 1–5 and Statistical MT (1990–1993)

Jelinek's group attacked machine translation with the same noisy-channel framing. The results — IBM Models 1–5, published by Peter Brown, Vincent Della Pietra, Stephen Della Pietra and Bob Mercer in 1993 — are the foundation of all subsequent statistical and neural MT.

P(English | French) ∝ P(French | English) · P(English)

The five models, ladder of complexity

ModelAddsWhy it matters today
Model 1Lexical translation, uniform alignmentThe seed of cross-lingual correspondence; ignores word order entirely.
Model 2Position-dependent alignmentFirst explicit position bias — an ancestor of positional embeddings.
Model 3Fertility (one word → many)Counts how many target words a source word generates. Multi-token alignment.
Model 4Distortion modelWord reordering as a learned distribution over offsets.
Model 5Avoids overlap deficienciesCleaner generative story for alignment; rarely used in practice.

The key reusable concept is the latent alignment. Words on each side line up to words on the other; that lining-up is hidden, learned by EM, and is reusable in any sequence-to-sequence problem. Soft attention in 2014 is a relaxation of exactly this idea.

Where the IBM people went

Brown and Mercer left IBM Research and went to Renaissance Technologies — Jim Simons's hedge fund — where they reused the same statistical machinery to predict markets. Mercer became famous (and controversial) as a Republican mega-donor in the Trump era. The Della Pietra brothers also went to Renaissance. Statistical NLP literally walked out of academia into finance for fifteen years.

05

The Neural Sleeper — Hinton's Family Tree

While IBM was running statistical NLP, a quieter and much smaller programme in distributed representations was running through Hinton's group in Toronto and a handful of fellow travellers (Schmidhuber in Munich, LeCun at Bell Labs, Bengio in Montreal).

It was unfashionable. It was so unfashionable that the field ran an unofficial blacklist on neural-net papers at NeurIPS for much of the 1990s and 2000s. Hinton, in interviews, has said that his strategy was simply to keep training PhD students until the field came around. It eventually did.

The 1980s & 90s prep work

  • Backprop — Rumelhart, Hinton, Williams 1986.
  • Boltzmann machines & RBMs — Hinton with Sejnowski.
  • LeNet — LeCun, 1989, at Bell Labs. CNNs for handwriting.
  • LSTM — Hochreiter & Schmidhuber, 1997.
  • Distributed representations — Hinton, 1986, the philosophical case.

The students who matter

  • Yoshua Bengio — Hinton's postdoc, then Mila.
  • Ilya Sutskever — Hinton PhD, AlexNet co-author.
  • Ruslan Salakhutdinov — Hinton PhD, CMU.
  • Alex Krizhevsky — Hinton PhD, AlexNet first author.
  • Alex Graves — LSTM, CTC, neural Turing machines.
  • Volodymyr Mnih — DQN at DeepMind.
I needed to keep students stupid enough to work on it. — Geoffrey Hinton, paraphrasing himself in multiple interviews on why neural-net research persisted through its unfashionable decades.

The Hinton-trained students are about as concentrated a contributor pool as the field has. Almost every senior IC at Brain, DeepMind, OpenAI, and Anthropic in 2026 either trained under Hinton, trained under one of his students, or works alongside someone who did.

06

Bengio's NPLM (2003) — The Hinge

The hinge between the statistical era and the neural era is a single paper: Yoshua Bengio, Réjean Ducharme, Pascal Vincent and Christian Jauvin, A Neural Probabilistic Language Model, JMLR 2003.

It does two things at once:

P(wt | wt−1, …, wt−n+1) = softmax( W·tanh(U·[e(wt−1); …; e(wt−n+1)]) )

That second point matters more than the first. Word embeddings — learned, dense, real-valued representations of words — are introduced into NLP here. Every later development from word2vec to BERT to GPT-5 keeps this idea, in some form, at the input layer.

Why this paper is the hinge

The NPLM was reasonably competitive on perplexity in 2003 but too slow to train at the corpus sizes that mattered for speech and MT. So statistical n-gram models won, in production, for another decade. But the architecture — embeddings + neural net + softmax — became the blueprint everything else followed. The transformer is, at the input and output layers, recognisably a Bengio NPLM.

YB

Yoshua Bengio

Université de Montréal, MILA founder

French-born, Canadian. Did his PhD at McGill, postdoc with Hinton in Toronto, then Bell Labs with LeCun, then settled in Montreal in 1993 and stayed. Built MILA into the largest deep-learning academic group in the world. His students — Goodfellow, Bahdanau, Cho (postdoc), Mensch — appear repeatedly in the rest of this deck and the rest of this series. Quietly intense, ferociously prolific. Now devotes most of his time to AI safety, and writes publicly about catastrophic risk.

07

Word Embeddings — word2vec and GloVe (2013–2014)

For ten years after the NPLM, learned word embeddings were a research curiosity. Then in 2013 a small Google Brain team led by Tomas Mikolov released word2vec — a pair of dramatically simplified architectures (CBOW and skip-gram) that produced high-quality embeddings in a fraction of the compute.

What made word2vec famous

Not the architecture — the analogies. The vectors had compositional structure that nobody had quite expected:

v(king) − v(man) + v(woman) ≈ v(queen)
v(Paris) − v(France) + v(Italy) ≈ v(Rome)

The Distributed Representations of Words and Phrases paper (NIPS 2013) showed dozens of these. The result went viral. It made plain to a wider audience that semantics was, in some operational sense, geometric.

word2vec — Mikolov et al, Google

  • CBOW predicts a word from its context.
  • Skip-gram predicts context from a word.
  • Negative sampling and hierarchical softmax made it fast.
  • Training a 300 d embedding for 100 B-word corpora became a day's work.

GloVe — Pennington, Socher, Manning, Stanford 2014

  • Factorise a global word-co-occurrence matrix.
  • Theoretically cleaner: explicit objective tied to PMI.
  • Open-sourced pre-trained vectors that became the default in NLP for years.
  • Manning's group at Stanford NLP becomes a major force from here.
TM

Tomas Mikolov

Brno University → Microsoft Research → Google Brain → Facebook AI → CIIRC

Czech, did his PhD on RNN-LMs in Brno in the late 2000s — pre-GPU, on small CPU clusters. Joined Google Brain in 2012 and shipped word2vec in 2013. Reportedly intense and quiet, with a strong scepticism of hype. Now back in Prague at the Czech Institute of Informatics, Robotics and Cybernetics. Word2vec was his first first-author paper out of Brno.

08

RNNs, LSTMs and the Encoder-Decoder Era (2014–2016)

While embeddings were spreading, the sequence model story was running in parallel. The key sequence of papers:

YearPaperAuthorsContribution
2010RNN-LMMikolov et al, BrnoRNN trained as language model, beats large n-grams on speech rescoring.
2014Sequence to Sequence Learning with Neural NetworksSutskever, Vinyals, Le, GoogleEncoder LSTM → decoder LSTM. End-to-end MT. Reverse the input for a 5 BLEU gain.
2014Learning Phrase Representations using RNN Encoder-DecoderCho, van Merriënboer, Bahdanau, Bengio et al, MilaSame idea, independently. Introduces the GRU.
2014Neural Machine Translation by Jointly Learning to Align and TranslateBahdanau, Cho, Bengio, MilaAttention. See next slide.
2015Effective Approaches to Attention-based NMTLuong, Pham, Manning, StanfordLocal vs global attention; clarifications.
2016Google's Neural Machine Translation SystemWu, Schuster, Chen, Le, … GoogleProduction-deployed NMT, displaces phrase-based MT inside Google Translate.

The two architectures and the sleeper LSTM

RNN-LMs and LSTM-LMs are autoregressive sequence models with state. They can in principle handle arbitrary-length context, but in practice vanishing gradients meant they could not. LSTM (Hochreiter & Schmidhuber, 1997) was the engineering fix — gating that let gradients flow over hundreds of timesteps.

LSTMs sat dormant for fifteen years (Schmidhuber's lab in Munich and one or two others used them) before becoming standard around 2013. The encoder-decoder pattern of 2014 is essentially: stack an LSTM as a reader, stack another as a writer, train end-to-end. Sutskever and Cho independently arrived at this within months.

The bottleneck problem

The Sutskever / Cho seq2seq compressed the entire source sentence into a fixed-length vector. For translations longer than ~30 words, performance fell off a cliff. This was the problem attention was invented to solve.

09

Bahdanau Attention — The Seed Crystal (2014)

The single most important paper in this entire deck is arguably the one with the most modest title: Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau, Cho and Bengio, ICLR 2015 (preprinted on arXiv 1 September 2014).

The paper does one thing. Instead of compressing the source sentence into a single vector, it lets the decoder, at each output step, look back over the encoder's hidden states and weigh them by relevance. The weights are produced by a small feed-forward net — this is an alignment model in the IBM-Models sense, but soft and learned by gradient descent.

αij = softmax(a(si, hj))
ci = Σj αij hj

If you squint, the transformer's scaled-dot-product attention is the same equation with a different scoring function and applied also within a sequence (self-attention), not just across. The Vaswani paper of 2017 generalises this and removes the recurrent backbone, but the seed is here.

DB

Dzmitry Bahdanau

Mila PhD (Bengio) → Element AI / ServiceNow Research → Mila faculty

Belarusian; did his PhD with Bengio at Mila and was the first author on the attention paper as a young PhD student. The paper was nearly rejected by ICLR reviewers who found it incremental. Bahdanau has spoken openly in interviews about how much of the credit really belongs to Cho (who supervised him directly during the work) and Bengio (the senior author). A clean example of how a famously influential paper can come from a junior author plus their advisor.

KC

Kyunghyun Cho

Mila postdoc → NYU faculty → Genentech (research director, on leave from NYU)

Korean. Pioneered the GRU (a simpler LSTM), co-author on the attention paper, and one of the most prolific NLP researchers of the 2010s. Has the unusual property of being well-respected on both the empirical and theoretical sides of NLP. Currently splits time between drug discovery (Genentech) and academic work at NYU.

Why attention was the seed crystal

By 2015 it was clear that adding attention to a seq2seq model dramatically improved translation, especially of long sentences. By 2016 every NMT system had attention. The natural follow-up question — do we even need the recurrent backbone, or is the attention doing all the work? — was sitting on the table for two years before the Vaswani team picked it up. Deck 03 covers what they did with it.

10

The People Behind These Ideas

A short Who's Who of the figures who appear repeatedly across this deck and turn up later in the series. None of these are public figures in the way that Altman or Hassabis became; they are the deep-bench technical researchers who built the foundations.

CS

Claude Shannon (1916–2001)

Bell Labs, MIT

Information theorist. The closest thing computer science has to Newton. Built juggling robots and rode a unicycle. Almost everything in this deck rests on his entropy framing.

FJ

Frederick Jelinek (1932–2010)

IBM, JHU CLSP

The most influential statistical-NLP figure of the late 20th century. Trained dozens of statistical-NLP researchers at JHU's Center for Language and Speech Processing.

GH

Geoffrey Hinton

Toronto, Google Brain (2013–2023)

Backprop, distributed representations, the unbroken thread of neural-net research through the 1990s wilderness. Resigned Google in 2023 to speak about AI risk.

YB

Yoshua Bengio

Mila, Université de Montréal

NPLM 2003, attention 2014. The most prolific advisor in deep learning. Now AI-safety focused.

YL

Yann LeCun

Bell Labs → NYU → Meta AI Chief Scientist

CNNs (1989), MNIST, the FAIR / Meta open-weight stance. Public sceptic of the LLM-as-AGI roadmap; advocates JEPA / world models. Loud, opinionated, scientifically generous.

JS

Jürgen Schmidhuber

IDSIA Lugano (Switzerland)

LSTM (1997, with Hochreiter), neural Turing machines, generative models. Famously argues that he and his students (Sepp Hochreiter, Felix Gers, Alex Graves) anticipated most of modern deep learning. The argument is partially correct and entirely characteristic.

CDM

Christopher Manning

Stanford NLP, Stanford HAI

Australian. Ran Stanford NLP through its modern dominant period. GloVe, dependency parsing, the Stanford CoreNLP toolkit, generations of students — Socher, Pennington, Karpathy, Liang. Calm, methodical, deeply influential.

11

What Made the Transformer Inevitable

By the start of 2017 every major piece of the transformer existed. Looking back, the Vaswani paper feels less like an invention and more like a careful re-arrangement of parts that had been lying on the workbench for several years.

Inputs

  • Bengio NPLM 2003 — learned word embeddings.
  • word2vec / GloVe 2013–14 — high-quality embeddings at scale.
  • Positional encodings (sinusoidal or learned) — not new in 2017.

Sequence model

  • seq2seq 2014 (Sutskever, Cho).
  • Bahdanau attention 2014.
  • Self-attention — Cheng, Dong, Lapata 2016 (intra-sentence attention); Lin et al 2017 (structured self-attention).
  • Multi-head — the small invention of the 2017 paper.

Training

  • Adam optimiser — Kingma & Ba 2014.
  • Layer normalisation — Ba, Kiros, Hinton 2016.
  • Dropout — Srivastava et al 2014.
  • Residual connections — He et al 2015 (ResNet).
Why nobody had done it yet

Two reasons. One: the Bahdanau-style attention was always layered on top of a recurrent backbone — it was an addition, not a replacement. The instinct to remove the RNN was unusual. Two: the parallel-friendliness of attention only matters if you have GPU/TPU clusters big enough to take advantage of it. The transformer paper was, in a sense, waiting for the 2016–17 generation of TPU pods at Google Brain to make its training run feasible. Deck 03 picks the story up there.

12

Cheat Sheet

The five hinges

  • 1948 Shannon — n-grams, entropy, perplexity.
  • 1993 IBM Models — latent alignment, noisy channel.
  • 2003 Bengio NPLM — embeddings + neural net.
  • 2013 word2vec — semantic geometry of embeddings.
  • 2014 Bahdanau attention — the seed crystal of the transformer.

The five people

  • Shannon — the framing.
  • Jelinek — the statistical-NLP school.
  • Hinton — the neural sleeper, students everywhere.
  • Bengio — the hinge papers.
  • Manning — the leading academic NLP group.

The two arguments

  • Symbolic vs statistical (Chomsky vs Jelinek).
  • Recurrent vs everything-else (which attention 2014 began to settle).

What's next in the series

  • 03 — The 2017 transformer paper itself, the eight authors, what BERT and GPT did with it.
  • 04 — The university labs where most of these ideas were born.