A custom AI ASIC was a strange thing for a search company to build in 2013. The story of how Google did it in 15 months — and the architects who knew, from forty years of computer-architecture history, that it would work.
The TPU is one of the small handful of clean-sheet datacenter ASICs to ever reach genuine industrial scale. Understanding how it got there is half history, half computer-architecture lineage. We'll cover both.
The TPU project does not start with a chip designer. It starts with a back-of-the-napkin calculation by Jeff Dean — Google's Senior Fellow, Google Brain co-founder, and the engineer most responsible for the architecture of Google search itself.
Speech recognition had just made the jump from Hidden Markov Models to deep neural networks. Voice search worked, and was about to become a default Android feature. Dean ran the obvious extrapolation:
"Double the datacenters" is not a feature request. It is a structural problem. Google's existing fleet of CPU-and-GPU machines was already planet-scale. You cannot solve it with another procurement cycle; you have to change the per-query economics of inference.
That is the moment a custom ASIC stops being exotic and starts being the cheapest option on the table.
The TPU isn't justified by training cost. It's justified by inference cost at fleet scale. Training is a fixed expense per model; inference is paid per query, every day, forever. The voice projection was the first time anyone at Google did the accounting at fleet scale and concluded that the answer was a new kind of silicon.
In 2013 the conventional answer to "Google needs more ML throughput" was either more Xeon servers or more NVIDIA GPUs. Both options were available. Both were rejected. Why?
If you remove everything from a GPU that isn't doing 8-bit multiply-accumulate, you free a startling amount of die for actual MACs. Google's projection said an ASIC could deliver 30× better perf/W than a contemporary GPU on the inference workloads Google actually ran — and the ISCA 2017 paper's measured numbers (15–30× perf, 30–80× perf/W vs Haswell and K80) bore that out.
This is the moment the term domain-specific architecture (DSA) starts to matter at hyperscale. Patterson and Hennessy's 2017 Turing-lecture paper "A New Golden Age for Computer Architecture" uses the TPU as the canonical example: when Moore's-law and Dennard scaling slow, the way out is specialised silicon for the workloads that dominate your fleet. Google's fleet was dominated by inference of a small set of model topologies. That's the ideal DSA target.
You cannot understand the TPU without understanding the person who designed it. Norm Jouppi's career is a forty-year survey of every memory-hierarchy and microarchitecture lesson ever learned in industry.
PhD Stanford 1984 under John Hennessy — an original architect on the Stanford MIPS RISC project that defined modern processor design. Joined Google in 2011; tech-led the TPU programme from inception in 2013.
MIPS was the first commercial processor built on the explicit principle that the compiler, not the hardware, should schedule instructions. No out-of-order, no aggressive branch predictor — just a simple pipeline and a smart compiler. That philosophy was unfashionable for two decades while Intel made out-of-order superscalar work brilliantly. It became fashionable again the moment dark silicon and power constraints made hardware-side dynamic scheduling untenable. The TPU is a MIPS-philosophy machine. The XLA compiler does the scheduling that an Intel core does in hardware.
1984 to 1996, Jouppi was at Digital's Western Research Lab. His papers from that era include some of the most cited microarchitecture work ever written:
1996 to 2011 at Compaq/HP Labs, ending as HP Fellow running the Intelligent Infrastructure Lab. Eckert–Mauchly Award 2015 (the IEEE/ACM career achievement prize for computer architecture) for "contributions to the design and analysis of high performance processors and computer storage systems".
Three habits show up directly in the TPU v1 design: (1) compiler does the scheduling (no OoO, no branch predictor, ~12-instruction CISC ISA); (2) quantitative memory-hierarchy analysis (the ISCA 2017 paper has a roofline plot that is pure CACTI-era thinking); (3) conservative, deeply-thought silicon choices — Jouppi's reputation is for shipping; the TPU went from spec to datacenter in 15 months.
The ISCA 2017 TPU paper has seventy-five authors. A custom ASIC at this scale is a team sport. A few names worth knowing:
Berkeley RISC pioneer; co-author of Computer Architecture: A Quantitative Approach; 2017 Turing Award (with Hennessy). Joined Google specifically because of the TPU programme. Co-author on the ISCA 2017, CACM 2020 (v2/v3), and ISCA 2023 (v4) papers. Public face of the "domain-specific architecture" framing.
The voice-search projection was his. Co-founded Google Brain (2011) with Andrew Ng and Greg Corrado; pushed for and underwrote the TPU programme internally. Co-author on ISCA 2017. More recently: lead author on the Pathways system that orchestrates pod-spanning training.
Co-author on all three landmark TPU papers (2017, 2020, 2023). Earlier work at Bell Labs and D. E. Shaw Research's Anton molecular-dynamics ASIC — another high-profile clean-sheet datacenter chip. The Anton experience shows up in TPU's deterministic, statically-scheduled approach.
One of the engineers on the original TPU team. Left in 2016 to found Groq, whose deterministic-VLIW LPU is — very deliberately — an attempt to take the TPU's static-scheduling philosophy further still. Half of the post-2016 ML-accelerator startup wave is, in some sense, alumni of this team.
Lead engineer on board / cooling / system integration. The names that do not headline papers but design the boards, the SerDes, the OCS optical switch (Palomar, ISCA 2023), and the Jupiter datacenter network the pods sit in.
Quoc Le, the Brain co-founder famous for the "neural cat detector", is sometimes credited with the TPU. He is not on any of the TPU papers and was not on the hardware team — he is on the ML-research side that motivated the chip. Worth being precise: TPU = a hardware-team artefact, prompted by a Brain-team need.
The most surprising number in the entire TPU story is the schedule. The May 2016 Cloud blog post by Jouppi, written immediately after the public announcement, gives the only numbers Google has ever published on the design cycle:
Jouppi's team starts the architecture work after the voice-search projection lands.
~12 months of design. A 28 nm tape-out at TSMC; very deliberate process choice (mature, cheap, predictable).
22 days from "tested silicon" to a card running production inference workloads in a Google datacenter.
By late 2015 TPUs are running RankBrain, Translate, Photos and Street View at meaningful scale. None of this is public yet.
The Lee Sedol match in Seoul (9–15 March) is served on TPU v1.
Sundar Pichai keynote at Google I/O; Jouppi's blog post the same day.
For comparison: a typical big-vendor AI accelerator from spec to general availability is closer to 3–4 years. NVIDIA's Volta announcement (March 2017) traces back to architecture work started around 2013. The TPU did the same job — and shipped to a production datacenter — in less than half the time.
Three structural choices made it work. (1) Mature node: 28 nm in 2014 was three years old — tooling and yields were boring and predictable. (2) Minimal ISA: ~12 CISC instructions, no caches, no branch predictor. There simply isn't much to verify. (3) Drop-in form factor: a PCIe Gen3 ×16 card. No new server, no new datacenter, no new power infrastructure — just slot it next to the existing fleet.
The first the world heard of the TPU was not from a chip launch but from a Go match. AlphaGo's 4–1 victory over Lee Sedol in Seoul in March 2016 was, for most observers, the first time deep learning seemed obviously transformative. The chip running it was a TPU v1.
The Lee Sedol match mattered for the TPU programme in a way that no benchmark plot could. It gave Google leadership a public-facing proof that custom silicon for ML was a strategic capability, not a research curiosity. Every TPU generation since traces, in budget terms, back to that match.
On 19 May 2016 at Google I/O, Sundar Pichai announced the TPU on stage. Jouppi's accompanying Cloud blog post, posted the same day, is still the cleanest first-principles description of the chip in public. It led with one specific number:
Those details would arrive thirteen months later in the ISCA 2017 paper — the first peer-reviewed disclosure of a hyperscaler's production AI ASIC.
Google's previous datacenter custom silicon — the network-switch ASICs of Jupiter, the storage-controller chips — was never publicised. The TPU was. Two reasons: (1) recruiting — chip architects want to know their work will be seen; an entire generation of ML-hardware talent moved to Google after this announcement; (2) cloud strategy — Google Cloud was being framed as the place to run TensorFlow, and the TPU was the differentiator. The chip is now sold by the hour as Cloud TPU.
If you only read three TPU papers, read these three. Together they cover the entire programme.
| Paper | Venue | Year | What it covers |
|---|---|---|---|
| In-Datacenter Performance Analysis of a Tensor Processing Unit Jouppi, Young, Patil, Patterson … (75 authors) | ISCA | 2017 | The TPU v1 disclosure paper. 28 nm, 256×256 INT8 systolic, 92 TOPS, 28 W typical, 40 W max. Roofline analysis showing v1 was memory-bandwidth-bound — the lesson that motivated HBM in v2. |
| A Domain-Specific Supercomputer for Training Deep Neural Networks Jouppi, Yoon, Kurian, Li, Patil, Laudon, Young, Patterson | CACM | 2020 | TPU v2 and v3. Why training needed a new chip (no FP, no gradients in v1). bfloat16, two TensorCores per chip, 2D-torus pod, ICI custom interconnect. |
| TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings Jouppi, Kurian, Li, … Patterson | ISCA | 2023 | The 4096-chip v4 pod. 7 nm, 3D torus, the Palomar 3D-MEMS optical circuit switch, SparseCore for embeddings. The "ML supercomputer" framing. |
Google publishes the TPU. AWS does not publish Trainium / Inferentia in any comparable depth. Microsoft has not published Maia. Meta's MTIA has a paper but nothing like the level of detail. This is partly a Patterson legacy — an academic culture inside Google's hardware org — and partly recruiting. The papers are also how the rest of the industry has learned that a custom AI ASIC at hyperscale actually works.
What started as a one-off inference chip in 2015 has become a continuous decade-long silicon programme. The cadence has been roughly one new chip every 18–24 months, and from v5 onward, two SKUs per generation.
| Generation | Year | Role | Key innovation |
|---|---|---|---|
| v1 | 2015 | Inference | 256×256 INT8 systolic; 92 TOPS at 40 W |
| v2 | 2017 | Training | bfloat16, HBM, 2D torus pod, two TensorCores per chip |
| v3 | 2018 | Training (refresh) | 4 MXUs/chip, 32 GiB HBM, liquid cooling |
| v4 | 2020 (announced 2021) | Training flagship | 7 nm, 3D torus, Palomar OCS, SparseCore, CMEM |
| v4i | 2020 | Inference | Single-core variant of v4 for fleet inference |
| v5e | Aug 2023 | Cost-optimised | The "e-class" inference / small-training chip |
| v5p | Dec 2023 | Training flagship | 95 GiB HBM, 8,960-chip pod, Gemini 1/1.5 training |
| Trillium (v6e) | May 2024 / GA Dec 2024 | Cost-optimised | 4.7× v5e compute, 3rd-gen SparseCore |
| Ironwood (v7 / TPU7x) | Apr 2025 / GA Nov 2025 | Inference flagship | 4.6 PFLOPS FP8, 192 GiB HBM3e, 9,216-chip pod |
The product fork — e-class (efficient) and p-class (performance) — is itself a strategic decision. It mirrors NVIDIA's split between consumer (RTX) and datacenter (HGX) lines, but at much smaller silicon scale: same architecture team, same software stack, two SKUs.
The "TPU" you read about today is a continuous lineage of decisions: 28 nm INT8 inference (v1) → 16 nm bf16 training (v2/v3) → 7 nm pod-scale training with optical interconnect (v4) → FP8 inference at 192 GiB per chip (v7). Every generation is a response to what the previous one couldn't do. The shape of the chip changes; the philosophy — statically scheduled, compiler-driven, scratchpad memory, systolic core — does not.
The TPU is sometimes presented as a clean-sheet design. It is not. It is a synthesis of three decades of architecture research that happened to find its first hyperscale application in deep learning.
For decades after Kung & Leiserson there was no real workload that wanted a 65,536-MAC systolic array. Signal processing was too small, scientific computing too irregular. Deep learning — the moment it became a sequence of dense matmuls on small batches of low-precision integers — was the first workload in history that fit the shape. The TPU is the chip those three lineages were waiting for.
In Hennessy and Patterson's 2018 Turing lecture they call this the "new golden age for computer architecture": post-Dennard scaling, you can no longer get free perf/W from process; you have to pay for it with specialisation. The TPU is the canonical example. The cost is portability — a TPU only runs ML workloads — and the payoff is one to two orders of magnitude in perf/W on those workloads. Every hyperscaler now has its own version of this argument and its own chip.
Click a milestone to see what was happening at Google, in the broader ML world, and on the TPU programme that month. Use this to anchor your mental model: most LLM headlines you remember have a TPU silicon decision sitting underneath them by 6–18 months.
Deck 02 — Generations Overview walks the spec sheet for every TPU from v1 to Ironwood; deck 03 — Systolic Arrays opens up the 1978 Kung & Leiserson lineage in detail; deck 04 — Inside TPU v1 takes the chip from this story apart slide by slide.