Google TPUs 01 — History & People

00

What This Deck Covers

The TPU is one of the small handful of clean-sheet datacenter ASICs to ever reach genuine industrial scale. Understanding how it got there is half history, half computer-architecture lineage. We'll cover both.

The 2013 Voice-Search Projection
Why Build an ASIC At All?
The Architect — Norm Jouppi's Career Arc
The Other Names That Matter
15 Months From Spec to Datacenter
AlphaGo — The First Public Workload
The May 2016 Announcement
The Three Landmark Papers
From v1 to a Programme
Why It Worked — The Architectural Lineage
Interactive Timeline
Cheat Sheet

01

The 2013 Voice-Search Projection

The TPU project does not start with a chip designer. It starts with a back-of-the-napkin calculation by Jeff Dean — Google's Senior Fellow, Google Brain co-founder, and the engineer most responsible for the architecture of Google search itself.

Speech recognition had just made the jump from Hidden Markov Models to deep neural networks. Voice search worked, and was about to become a default Android feature. Dean ran the obvious extrapolation:

If hundreds of millions of people talked to Google for just three minutes a day via voice search, we'd need to double the number of computers in our datacenters. — Jeff Dean (paraphrased; widely repeated, Google's own retellings)

"Double the datacenters" is not a feature request. It is a structural problem. Google's existing fleet of CPU-and-GPU machines was already planet-scale. You cannot solve it with another procurement cycle; you have to change the per-query economics of inference.

That is the moment a custom ASIC stops being exotic and starts being the cheapest option on the table.

Why this number is the right number to start with

The TPU isn't justified by training cost. It's justified by inference cost at fleet scale. Training is a fixed expense per model; inference is paid per query, every day, forever. The voice projection was the first time anyone at Google did the accounting at fleet scale and concluded that the answer was a new kind of silicon.

02

Why Build an ASIC At All?

In 2013 the conventional answer to "Google needs more ML throughput" was either more Xeon servers or more NVIDIA GPUs. Both options were available. Both were rejected. Why?

What CPUs gave you in 2013

Xeon Ivy Bridge / Haswell — about 200 GFLOPS FP32 per socket, ~150 W.
General-purpose — pays for branch predictors, out-of-order, large caches, all dark silicon for matmul.
Software flexibility was great; perf/W on dense matmul was ~0.3 GFLOPS/W.

What GPUs gave you in 2013

NVIDIA Kepler K20/K40 — ~3.5 TFLOPS FP32, 235 W. ~15 GFLOPS/W.
Tensor cores didn't exist yet (those arrive with Volta in 2017).
Best perf/W in the industry, but built for graphics first — rasterisers, ROPs, RT pipelines, heavy register files.
Single-vendor supply; pricing reflects that.

The ASIC arithmetic

If you remove everything from a GPU that isn't doing 8-bit multiply-accumulate, you free a startling amount of die for actual MACs. Google's projection said an ASIC could deliver 30× better perf/W than a contemporary GPU on the inference workloads Google actually ran — and the ISCA 2017 paper's measured numbers (15–30× perf, 30–80× perf/W vs Haswell and K80) bore that out.

Domain-specific architecture

This is the moment the term domain-specific architecture (DSA) starts to matter at hyperscale. Patterson and Hennessy's 2017 Turing-lecture paper "A New Golden Age for Computer Architecture" uses the TPU as the canonical example: when Moore's-law and Dennard scaling slow, the way out is specialised silicon for the workloads that dominate your fleet. Google's fleet was dominated by inference of a small set of model topologies. That's the ideal DSA target.

03

The Architect — Norm Jouppi's Career Arc

You cannot understand the TPU without understanding the person who designed it. Norm Jouppi's career is a forty-year survey of every memory-hierarchy and microarchitecture lesson ever learned in industry.

NJ

Norman P. Jouppi

TPU Lead Architect · Google Distinguished Hardware Engineer

PhD Stanford 1984 under John Hennessy — an original architect on the Stanford MIPS RISC project that defined modern processor design. Joined Google in 2011; tech-led the TPU programme from inception in 2013.

The MIPS lesson — "compiler does the scheduling"

MIPS was the first commercial processor built on the explicit principle that the compiler, not the hardware, should schedule instructions. No out-of-order, no aggressive branch predictor — just a simple pipeline and a smart compiler. That philosophy was unfashionable for two decades while Intel made out-of-order superscalar work brilliantly. It became fashionable again the moment dark silicon and power constraints made hardware-side dynamic scheduling untenable. The TPU is a MIPS-philosophy machine. The XLA compiler does the scheduling that an Intel core does in hardware.

The DEC WRL years — memory hierarchy obsession

1984 to 1996, Jouppi was at Digital's Western Research Lab. His papers from that era include some of the most cited microarchitecture work ever written:

Victim caches (1990) — small fully-associative buffer to absorb conflict misses; a standard piece of CPU design ever since.
Stream buffers (1990) — software-visible prefetch hardware. The intellectual ancestor of the TPU's explicit DMA-and-VMEM model.
CACTI (1996) — a cache and SRAM modelling tool every chip designer in the world has used. Lets you do quantitative die-area / latency / bandwidth tradeoffs without taping out.

HP Labs — Fellow, Director, Eckert–Mauchly Award

1996 to 2011 at Compaq/HP Labs, ending as HP Fellow running the Intelligent Infrastructure Lab. Eckert–Mauchly Award 2015 (the IEEE/ACM career achievement prize for computer architecture) for "contributions to the design and analysis of high performance processors and computer storage systems".

What he carried into the TPU

Three habits show up directly in the TPU v1 design: (1) compiler does the scheduling (no OoO, no branch predictor, ~12-instruction CISC ISA); (2) quantitative memory-hierarchy analysis (the ISCA 2017 paper has a roofline plot that is pure CACTI-era thinking); (3) conservative, deeply-thought silicon choices — Jouppi's reputation is for shipping; the TPU went from spec to datacenter in 15 months.

04

The Other Names That Matter

The ISCA 2017 TPU paper has seventy-five authors. A custom ASIC at this scale is a team sport. A few names worth knowing:

DP

David A. Patterson

Co-architect · Distinguished Engineer (joined Google ~2016)

Berkeley RISC pioneer; co-author of Computer Architecture: A Quantitative Approach; 2017 Turing Award (with Hennessy). Joined Google specifically because of the TPU programme. Co-author on the ISCA 2017, CACM 2020 (v2/v3), and ISCA 2023 (v4) papers. Public face of the "domain-specific architecture" framing.

JD

Jeff Dean

Project sponsor · Senior Fellow · Google Brain co-founder

The voice-search projection was his. Co-founded Google Brain (2011) with Andrew Ng and Greg Corrado; pushed for and underwrote the TPU programme internally. Co-author on ISCA 2017. More recently: lead author on the Pathways system that orchestrates pod-spanning training.

CY

Cliff Young

Co-architect · Google Brain hardware

Co-author on all three landmark TPU papers (2017, 2020, 2023). Earlier work at Bell Labs and D. E. Shaw Research's Anton molecular-dynamics ASIC — another high-profile clean-sheet datacenter chip. The Anton experience shows up in TPU's deterministic, statically-scheduled approach.

JR

Jonathan Ross

Original team · later Groq founder & CEO

One of the engineers on the original TPU team. Left in 2016 to found Groq, whose deterministic-VLIW LPU is — very deliberately — an attempt to take the TPU's static-scheduling philosophy further still. Half of the post-2016 ML-accelerator startup wave is, in some sense, alumni of this team.

AS

Andy Swing & the systems team

Co-architect, ISCA 2017 & ISCA 2023

Lead engineer on board / cooling / system integration. The names that do not headline papers but design the boards, the SerDes, the OCS optical switch (Palomar, ISCA 2023), and the Jupiter datacenter network the pods sit in.

Note on the Quoc Le myth

Quoc Le, the Brain co-founder famous for the "neural cat detector", is sometimes credited with the TPU. He is not on any of the TPU papers and was not on the hardware team — he is on the ML-research side that motivated the chip. Worth being precise: TPU = a hardware-team artefact, prompted by a Brain-team need.

05

15 Months From Spec to Datacenter

The most surprising number in the entire TPU story is the schedule. The May 2016 Cloud blog post by Jouppi, written immediately after the public announcement, gives the only numbers Google has ever published on the design cycle:

We've been running TPUs inside our data centers for more than a year, and have found them to deliver an order of magnitude better-optimized performance per watt for machine learning. … The first tested silicon was deployed in our datacenters within 22 days. — Norm Jouppi, Google Cloud Blog, 19 May 2016

The implied timeline

~2013 Q3

Project kickoff.

Jouppi's team starts the architecture work after the voice-search projection lands.

2014

RTL, verification, layout.

~12 months of design. A 28 nm tape-out at TSMC; very deliberate process choice (mature, cheap, predictable).

2015 mid

First silicon back from fab.

22 days from "tested silicon" to a card running production inference workloads in a Google datacenter.

2015 H2

Datacenter rollout.

By late 2015 TPUs are running RankBrain, Translate, Photos and Street View at meaningful scale. None of this is public yet.

2016 Q1

AlphaGo on TPU.

The Lee Sedol match in Seoul (9–15 March) is served on TPU v1.

2016-05-19

Public announcement.

Sundar Pichai keynote at Google I/O; Jouppi's blog post the same day.

For comparison: a typical big-vendor AI accelerator from spec to general availability is closer to 3–4 years. NVIDIA's Volta announcement (March 2017) traces back to architecture work started around 2013. The TPU did the same job — and shipped to a production datacenter — in less than half the time.

How is that even possible?

Three structural choices made it work. (1) Mature node: 28 nm in 2014 was three years old — tooling and yields were boring and predictable. (2) Minimal ISA: ~12 CISC instructions, no caches, no branch predictor. There simply isn't much to verify. (3) Drop-in form factor: a PCIe Gen3 ×16 card. No new server, no new datacenter, no new power infrastructure — just slot it next to the existing fleet.

06

AlphaGo — The First Public Workload

The first the world heard of the TPU was not from a chip launch but from a Go match. AlphaGo's 4–1 victory over Lee Sedol in Seoul in March 2016 was, for most observers, the first time deep learning seemed obviously transformative. The chip running it was a TPU v1.

AlphaGo was powered by TPUs in the matches against Go world champion, Lee Sedol, enabling it to "think" much faster and look farther ahead between moves. — Norm Jouppi, Google Cloud Blog, 19 May 2016 (the announcement)

What's true and what's not in the AlphaGo retellings

True

The Lee Sedol match (March 2016) ran on TPU v1.
Internal head-to-head play during 2015–16 had AlphaGo-on-TPU beating AlphaGo-on-GPU something like 80% of the time at fixed time controls.
The match was the first large-scale public-facing TPU deployment.

Not true (or unclear)

"AlphaGo used 1,202 CPUs and 176 GPUs" — that figure is from the Fan Hui match (October 2015), described in the January 2016 Nature paper. It is not the Lee Sedol setup.
The exact TPU count for Seoul was not published. Trade-press estimates of "48 TPUs" or similar are unsourced.
AlphaGo training did not happen on TPU v1 — v1 is inference-only. Training used GPU clusters.

The Lee Sedol match mattered for the TPU programme in a way that no benchmark plot could. It gave Google leadership a public-facing proof that custom silicon for ML was a strategic capability, not a research curiosity. Every TPU generation since traces, in budget terms, back to that match.

07

The May 2016 Announcement

On 19 May 2016 at Google I/O, Sundar Pichai announced the TPU on stage. Jouppi's accompanying Cloud blog post, posted the same day, is still the cleanest first-principles description of the chip in public. It led with one specific number:

TPUs deliver an order of magnitude better-optimised performance per watt for machine learning. This is roughly equivalent to fast-forwarding technology about seven years into the future (three generations of Moore's Law). — Norm Jouppi, Google Cloud Blog, 19 May 2016

What was disclosed in May 2016

The chip exists, has been deployed for "more than a year".
It is an ASIC tailored for TensorFlow inference.
It is integrated as a PCIe accelerator in standard server racks.
10× perf/W vs contemporary processors.
It runs RankBrain, Street View, voice, AlphaGo.

What was held back

The systolic array dimensions (256×256) and MAC count.
The numeric format (8-bit integer).
The memory hierarchy and DRAM choice.
That training was already in design as a separate chip.

Those details would arrive thirteen months later in the ISCA 2017 paper — the first peer-reviewed disclosure of a hyperscaler's production AI ASIC.

Why announce the chip at all?

Google's previous datacenter custom silicon — the network-switch ASICs of Jupiter, the storage-controller chips — was never publicised. The TPU was. Two reasons: (1) recruiting — chip architects want to know their work will be seen; an entire generation of ML-hardware talent moved to Google after this announcement; (2) cloud strategy — Google Cloud was being framed as the place to run TensorFlow, and the TPU was the differentiator. The chip is now sold by the hour as Cloud TPU.

08

The Three Landmark Papers

If you only read three TPU papers, read these three. Together they cover the entire programme.

Paper	Venue	Year	What it covers
In-Datacenter Performance Analysis of a Tensor Processing Unit Jouppi, Young, Patil, Patterson … (75 authors)	ISCA	2017	The TPU v1 disclosure paper. 28 nm, 256×256 INT8 systolic, 92 TOPS, 28 W typical, 40 W max. Roofline analysis showing v1 was memory-bandwidth-bound — the lesson that motivated HBM in v2.
A Domain-Specific Supercomputer for Training Deep Neural Networks Jouppi, Yoon, Kurian, Li, Patil, Laudon, Young, Patterson	CACM	2020	TPU v2 and v3. Why training needed a new chip (no FP, no gradients in v1). bfloat16, two TensorCores per chip, 2D-torus pod, ICI custom interconnect.
TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings Jouppi, Kurian, Li, … Patterson	ISCA	2023	The 4096-chip v4 pod. 7 nm, 3D torus, the Palomar 3D-MEMS optical circuit switch, SparseCore for embeddings. The "ML supercomputer" framing.

Important supporting papers

ISCA 2021 — Jouppi et al., "Ten Lessons From Three Generations Shaped Google's TPUv4i". The inference-focused sibling of v4. CMEM is introduced here.
MLSys 2022 — Barham, Dean, Ghemawat et al., "Pathways: Asynchronous Distributed Dataflow for ML". The orchestration system that made multi-pod training of PaLM (6,144 chips) possible.
arXiv 2208.10041 (2022) — "Mission Apollo — OCS at Datacenter Scale". How OCS scaled out beyond a single TPU pod into Jupiter.
CACM 2019 Turing lecture — Hennessy & Patterson, "A New Golden Age for Computer Architecture". The intellectual frame: domain-specific architectures are the way out of the post-Dennard slowdown.

The pattern is unusual

Google publishes the TPU. AWS does not publish Trainium / Inferentia in any comparable depth. Microsoft has not published Maia. Meta's MTIA has a paper but nothing like the level of detail. This is partly a Patterson legacy — an academic culture inside Google's hardware org — and partly recruiting. The papers are also how the rest of the industry has learned that a custom AI ASIC at hyperscale actually works.

09

From v1 to a Programme

What started as a one-off inference chip in 2015 has become a continuous decade-long silicon programme. The cadence has been roughly one new chip every 18–24 months, and from v5 onward, two SKUs per generation.

Generation	Year	Role	Key innovation
v1	2015	Inference	256×256 INT8 systolic; 92 TOPS at 40 W
v2	2017	Training	bfloat16, HBM, 2D torus pod, two TensorCores per chip
v3	2018	Training (refresh)	4 MXUs/chip, 32 GiB HBM, liquid cooling
v4	2020 (announced 2021)	Training flagship	7 nm, 3D torus, Palomar OCS, SparseCore, CMEM
v4i	2020	Inference	Single-core variant of v4 for fleet inference
v5e	Aug 2023	Cost-optimised	The "e-class" inference / small-training chip
v5p	Dec 2023	Training flagship	95 GiB HBM, 8,960-chip pod, Gemini 1/1.5 training
Trillium (v6e)	May 2024 / GA Dec 2024	Cost-optimised	4.7× v5e compute, 3rd-gen SparseCore
Ironwood (v7 / TPU7x)	Apr 2025 / GA Nov 2025	Inference flagship	4.6 PFLOPS FP8, 192 GiB HBM3e, 9,216-chip pod

The product fork — e-class (efficient) and p-class (performance) — is itself a strategic decision. It mirrors NVIDIA's split between consumer (RTX) and datacenter (HGX) lines, but at much smaller silicon scale: same architecture team, same software stack, two SKUs.

A programme, not a chip

The "TPU" you read about today is a continuous lineage of decisions: 28 nm INT8 inference (v1) → 16 nm bf16 training (v2/v3) → 7 nm pod-scale training with optical interconnect (v4) → FP8 inference at 192 GiB per chip (v7). Every generation is a response to what the previous one couldn't do. The shape of the chip changes; the philosophy — statically scheduled, compiler-driven, scratchpad memory, systolic core — does not.

10

Why It Worked — The Architectural Lineage

The TPU is sometimes presented as a clean-sheet design. It is not. It is a synthesis of three decades of architecture research that happened to find its first hyperscale application in deep learning.

From RISC / MIPS

Compiler does the scheduling.
Simple, regular ISA.
No out-of-order, no branch predictor.
Make the simple case fast and the complex case the compiler's problem.

From systolic arrays

Kung & Leiserson, 1978/79.
"Why Systolic Architectures?" Kung 1982.
Local communication, regular dataflow, perfect VLSI fit.
CMU Warp / Intel iWarp prefigured the matmul array.

From DEC WRL memory hierarchy

Software-managed scratchpads, not opaque caches.
Explicit prefetch (stream buffers).
Quantitative bandwidth analysis (CACTI).
Roofline thinking baked into the design loop.

And the workload finally cooperated

For decades after Kung & Leiserson there was no real workload that wanted a 65,536-MAC systolic array. Signal processing was too small, scientific computing too irregular. Deep learning — the moment it became a sequence of dense matmuls on small batches of low-precision integers — was the first workload in history that fit the shape. The TPU is the chip those three lineages were waiting for.

The Patterson framing

In Hennessy and Patterson's 2018 Turing lecture they call this the "new golden age for computer architecture": post-Dennard scaling, you can no longer get free perf/W from process; you have to pay for it with specialisation. The TPU is the canonical example. The cost is portability — a TPU only runs ML workloads — and the payoff is one to two orders of magnitude in perf/W on those workloads. Every hyperscaler now has its own version of this argument and its own chip.

11

Interactive Timeline

Click a milestone to see what was happening at Google, in the broader ML world, and on the TPU programme that month. Use this to anchor your mental model: most LLM headlines you remember have a TPU silicon decision sitting underneath them by 6–18 months.

Click any dot above to read what happened.

12

Cheat Sheet

The TPU programme begins in 2013 with Jeff Dean's voice-search projection that universal voice query would double Google's datacenter footprint on CPUs.
The lead architect is Norm Jouppi: Stanford-MIPS PhD under Hennessy, DEC WRL (victim cache, stream buffers, CACTI), HP Fellow, Eckert–Mauchly 2015. He brought four decades of memory-hierarchy and RISC-discipline thinking to the chip.
v1 went from spec to running production inference in ~15 months, and from tested silicon to datacenter in 22 days. A 28 nm PCIe card, ~12 instructions, no caches, no branch predictor.
The first big public workload was AlphaGo vs Lee Sedol (March 2016) on TPU v1.
Public announcement 19 May 2016 at Google I/O. Detailed disclosure at ISCA 2017 (Jouppi et al., 75 authors). v2/v3 in CACM 2020; v4 + Palomar OCS in ISCA 2023.
From v5 onward the line forks: e-class for cost-optimised inference / small training, p-class for the training flagship.
Today's chip, Ironwood (TPU v7), ships 4.6 PFLOPS FP8 and 192 GiB HBM3e per chip in a 9,216-chip 3D-torus pod.
The architectural lineage: RISC compiler-discipline + Kung & Leiserson systolic arrays + Jouppi-era software-managed memory hierarchy — finally finding a workload (deep matmul-heavy ML) that fits the shape.