The 70-year run-up to the transformer paper. Information theory, the long Chomsky-vs-IBM argument, neural language models, word embeddings, sequence-to-sequence learning, and the 2014 attention paper that contained the seed of everything to come.
Everything that happened to language modelling between Shannon's information theory and the 2017 transformer paper. Each idea here ends up reused inside the transformer or its descendants. The deck pays equal attention to the technical content and the people who produced it.
The story starts in 1948 at Bell Labs with Claude Shannon's A Mathematical Theory of Communication. It is famously a paper about communication channels, but tucked inside it is what is plausibly the first language model: a chain of n-gram experiments using printed English.
Shannon counted letter pairs and triples in "Jefferson the Virginian" and similar sources, then sampled from those distributions to generate gibberish that nonetheless began to look like English the more context he conditioned on. He drew an unmistakable conclusion:
That sentence is the founding charter of statistical NLP. Its second half — a sufficiently complex stochastic process will give a satisfactory representation — is, almost word for word, the bet that ChatGPT is built on.
Worked at Bell Labs alongside the transistor team. Built juggling machines, a chess-playing automaton, and rode a unicycle through the corridors. Quiet, mathematically rigorous, broadly disinterested in the academic-prestige game. The closest historical analogue to a modern Brain or DeepMind senior IC: small output, enormous influence.
Nine years after Shannon, Noam Chomsky's Syntactic Structures (1957) proposed the opposite framing: language is not a probabilistic process, it is a system of formal rules. The book reviewed Skinner's behaviourism for Language and demolished it, in the process founding modern generative linguistics.
Chomsky's argument was that statistical co-occurrence cannot distinguish that sentence from furiously sleep ideas green colorless, but any English speaker instantly can; therefore meaning and grammaticality must come from structure. It was a knockdown argument for sixty years and is now, in light of LLMs, treated more as an open question.
Chomsky himself, into his nineties, has remained a vocal sceptic of LLMs — arguing they are "high-tech plagiarism" and miss the point of language. His New York Times op-ed of 8 March 2023 makes the case explicitly. Many in the field disagree, but his framing of what language is remains the cleanest articulation of the symbolic side of the argument.
The counter-programme to Chomsky did not come from a university linguistics department. It came from IBM T. J. Watson Research in Yorktown Heights, where a Czech-born electrical engineer named Frederick Jelinek was leading a speech-recognition group.
Survived the Holocaust as a child in Prague, emigrated to the US in 1949, did his PhD at MIT under Robert Fano and (effectively) Shannon. At IBM ran the speech-recognition group that produced the noisy channel framing of ASR and MT. Founded the JHU CLSP, which became the strongest US lab for statistical NLP.
Linguistic theory does not constrain a recogniser; data does. Build a language model from a large corpus of text, build an acoustic model from a large corpus of recorded speech, combine them with Bayes' rule, decode with Viterbi, ship.
P(words) is the language model. P(audio | words) is the acoustic model. The Jelinek group spent twenty years making each of those better, and almost every improvement turned out to come from more data or better statistical estimation, not from imported linguistic structure. Jelinek's famous remark:
Jelinek's group attacked machine translation with the same noisy-channel framing. The results — IBM Models 1–5, published by Peter Brown, Vincent Della Pietra, Stephen Della Pietra and Bob Mercer in 1993 — are the foundation of all subsequent statistical and neural MT.
| Model | Adds | Why it matters today |
|---|---|---|
| Model 1 | Lexical translation, uniform alignment | The seed of cross-lingual correspondence; ignores word order entirely. |
| Model 2 | Position-dependent alignment | First explicit position bias — an ancestor of positional embeddings. |
| Model 3 | Fertility (one word → many) | Counts how many target words a source word generates. Multi-token alignment. |
| Model 4 | Distortion model | Word reordering as a learned distribution over offsets. |
| Model 5 | Avoids overlap deficiencies | Cleaner generative story for alignment; rarely used in practice. |
The key reusable concept is the latent alignment. Words on each side line up to words on the other; that lining-up is hidden, learned by EM, and is reusable in any sequence-to-sequence problem. Soft attention in 2014 is a relaxation of exactly this idea.
Brown and Mercer left IBM Research and went to Renaissance Technologies — Jim Simons's hedge fund — where they reused the same statistical machinery to predict markets. Mercer became famous (and controversial) as a Republican mega-donor in the Trump era. The Della Pietra brothers also went to Renaissance. Statistical NLP literally walked out of academia into finance for fifteen years.
While IBM was running statistical NLP, a quieter and much smaller programme in distributed representations was running through Hinton's group in Toronto and a handful of fellow travellers (Schmidhuber in Munich, LeCun at Bell Labs, Bengio in Montreal).
It was unfashionable. It was so unfashionable that the field ran an unofficial blacklist on neural-net papers at NeurIPS for much of the 1990s and 2000s. Hinton, in interviews, has said that his strategy was simply to keep training PhD students until the field came around. It eventually did.
The Hinton-trained students are about as concentrated a contributor pool as the field has. Almost every senior IC at Brain, DeepMind, OpenAI, and Anthropic in 2026 either trained under Hinton, trained under one of his students, or works alongside someone who did.
The hinge between the statistical era and the neural era is a single paper: Yoshua Bengio, Réjean Ducharme, Pascal Vincent and Christian Jauvin, A Neural Probabilistic Language Model, JMLR 2003.
It does two things at once:
That second point matters more than the first. Word embeddings — learned, dense, real-valued representations of words — are introduced into NLP here. Every later development from word2vec to BERT to GPT-5 keeps this idea, in some form, at the input layer.
The NPLM was reasonably competitive on perplexity in 2003 but too slow to train at the corpus sizes that mattered for speech and MT. So statistical n-gram models won, in production, for another decade. But the architecture — embeddings + neural net + softmax — became the blueprint everything else followed. The transformer is, at the input and output layers, recognisably a Bengio NPLM.
French-born, Canadian. Did his PhD at McGill, postdoc with Hinton in Toronto, then Bell Labs with LeCun, then settled in Montreal in 1993 and stayed. Built MILA into the largest deep-learning academic group in the world. His students — Goodfellow, Bahdanau, Cho (postdoc), Mensch — appear repeatedly in the rest of this deck and the rest of this series. Quietly intense, ferociously prolific. Now devotes most of his time to AI safety, and writes publicly about catastrophic risk.
For ten years after the NPLM, learned word embeddings were a research curiosity. Then in 2013 a small Google Brain team led by Tomas Mikolov released word2vec — a pair of dramatically simplified architectures (CBOW and skip-gram) that produced high-quality embeddings in a fraction of the compute.
Not the architecture — the analogies. The vectors had compositional structure that nobody had quite expected:
The Distributed Representations of Words and Phrases paper (NIPS 2013) showed dozens of these. The result went viral. It made plain to a wider audience that semantics was, in some operational sense, geometric.
Czech, did his PhD on RNN-LMs in Brno in the late 2000s — pre-GPU, on small CPU clusters. Joined Google Brain in 2012 and shipped word2vec in 2013. Reportedly intense and quiet, with a strong scepticism of hype. Now back in Prague at the Czech Institute of Informatics, Robotics and Cybernetics. Word2vec was his first first-author paper out of Brno.
While embeddings were spreading, the sequence model story was running in parallel. The key sequence of papers:
| Year | Paper | Authors | Contribution |
|---|---|---|---|
| 2010 | RNN-LM | Mikolov et al, Brno | RNN trained as language model, beats large n-grams on speech rescoring. |
| 2014 | Sequence to Sequence Learning with Neural Networks | Sutskever, Vinyals, Le, Google | Encoder LSTM → decoder LSTM. End-to-end MT. Reverse the input for a 5 BLEU gain. |
| 2014 | Learning Phrase Representations using RNN Encoder-Decoder | Cho, van Merriënboer, Bahdanau, Bengio et al, Mila | Same idea, independently. Introduces the GRU. |
| 2014 | Neural Machine Translation by Jointly Learning to Align and Translate | Bahdanau, Cho, Bengio, Mila | Attention. See next slide. |
| 2015 | Effective Approaches to Attention-based NMT | Luong, Pham, Manning, Stanford | Local vs global attention; clarifications. |
| 2016 | Google's Neural Machine Translation System | Wu, Schuster, Chen, Le, … Google | Production-deployed NMT, displaces phrase-based MT inside Google Translate. |
RNN-LMs and LSTM-LMs are autoregressive sequence models with state. They can in principle handle arbitrary-length context, but in practice vanishing gradients meant they could not. LSTM (Hochreiter & Schmidhuber, 1997) was the engineering fix — gating that let gradients flow over hundreds of timesteps.
LSTMs sat dormant for fifteen years (Schmidhuber's lab in Munich and one or two others used them) before becoming standard around 2013. The encoder-decoder pattern of 2014 is essentially: stack an LSTM as a reader, stack another as a writer, train end-to-end. Sutskever and Cho independently arrived at this within months.
The Sutskever / Cho seq2seq compressed the entire source sentence into a fixed-length vector. For translations longer than ~30 words, performance fell off a cliff. This was the problem attention was invented to solve.
The single most important paper in this entire deck is arguably the one with the most modest title: Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau, Cho and Bengio, ICLR 2015 (preprinted on arXiv 1 September 2014).
The paper does one thing. Instead of compressing the source sentence into a single vector, it lets the decoder, at each output step, look back over the encoder's hidden states and weigh them by relevance. The weights are produced by a small feed-forward net — this is an alignment model in the IBM-Models sense, but soft and learned by gradient descent.
If you squint, the transformer's scaled-dot-product attention is the same equation with a different scoring function and applied also within a sequence (self-attention), not just across. The Vaswani paper of 2017 generalises this and removes the recurrent backbone, but the seed is here.
Belarusian; did his PhD with Bengio at Mila and was the first author on the attention paper as a young PhD student. The paper was nearly rejected by ICLR reviewers who found it incremental. Bahdanau has spoken openly in interviews about how much of the credit really belongs to Cho (who supervised him directly during the work) and Bengio (the senior author). A clean example of how a famously influential paper can come from a junior author plus their advisor.
Korean. Pioneered the GRU (a simpler LSTM), co-author on the attention paper, and one of the most prolific NLP researchers of the 2010s. Has the unusual property of being well-respected on both the empirical and theoretical sides of NLP. Currently splits time between drug discovery (Genentech) and academic work at NYU.
By 2015 it was clear that adding attention to a seq2seq model dramatically improved translation, especially of long sentences. By 2016 every NMT system had attention. The natural follow-up question — do we even need the recurrent backbone, or is the attention doing all the work? — was sitting on the table for two years before the Vaswani team picked it up. Deck 03 covers what they did with it.
A short Who's Who of the figures who appear repeatedly across this deck and turn up later in the series. None of these are public figures in the way that Altman or Hassabis became; they are the deep-bench technical researchers who built the foundations.
Information theorist. The closest thing computer science has to Newton. Built juggling robots and rode a unicycle. Almost everything in this deck rests on his entropy framing.
The most influential statistical-NLP figure of the late 20th century. Trained dozens of statistical-NLP researchers at JHU's Center for Language and Speech Processing.
Backprop, distributed representations, the unbroken thread of neural-net research through the 1990s wilderness. Resigned Google in 2023 to speak about AI risk.
NPLM 2003, attention 2014. The most prolific advisor in deep learning. Now AI-safety focused.
CNNs (1989), MNIST, the FAIR / Meta open-weight stance. Public sceptic of the LLM-as-AGI roadmap; advocates JEPA / world models. Loud, opinionated, scientifically generous.
LSTM (1997, with Hochreiter), neural Turing machines, generative models. Famously argues that he and his students (Sepp Hochreiter, Felix Gers, Alex Graves) anticipated most of modern deep learning. The argument is partially correct and entirely characteristic.
Australian. Ran Stanford NLP through its modern dominant period. GloVe, dependency parsing, the Stanford CoreNLP toolkit, generations of students — Socher, Pennington, Karpathy, Liang. Calm, methodical, deeply influential.
By the start of 2017 every major piece of the transformer existed. Looking back, the Vaswani paper feels less like an invention and more like a careful re-arrangement of parts that had been lying on the workbench for several years.
Two reasons. One: the Bahdanau-style attention was always layered on top of a recurrent backbone — it was an addition, not a replacement. The instinct to remove the RNN was unusual. Two: the parallel-friendliness of attention only matters if you have GPU/TPU clusters big enough to take advantage of it. The transformer paper was, in a sense, waiting for the 2016–17 generation of TPU pods at Google Brain to make its training run feasible. Deck 03 picks the story up there.