Google TPUs 02 — Generations Overview, v1 to Ironwood

00

Topics We'll Cover

This deck is the spec-sheet companion to deck 01. It walks every chip and pod, anchors the numbers in primary Google sources, and shows the per-generation jumps as a single chart you can use as a desk reference.

The Whole Lineage At A Glance
v1 — The 2015 Inference Chip
v2 & v3 — Training Arrives
v4 & v4i — OCS, SparseCore, the Pod Era
v5e & v5p — The Two-Track Fork
Trillium (v6e)
Ironwood (v7) — The Inference Flagship
Per-Chip Compute Jump Chart
Per-Chip HBM Jump Chart
Pod Size Over Time
Interactive Generation Explorer
Cheat Sheet

01

The Whole Lineage At A Glance

Every Google TPU ever shipped, on one row. Numbers are peak vendor figures from Google Cloud documentation and the relevant launch blog or paper.

Gen	Year	Process	Per-chip peak	HBM	Pod size	Topology	Role
v1	2015	28 nm	92 TOPS INT8	— (8 GiB DDR3)	1 chip / PCIe card	n/a	Inference
v2	2017	16 nm	45 TFLOPS bf16	16 GiB	256 chips	2D torus	Training
v3	2018	16 nm	123 TFLOPS bf16	32 GiB	1,024 chips	2D torus	Training
v4	2020 (announced 2021)	7 nm	275 TFLOPS bf16	32 GiB	4,096 chips	3D torus + OCS	Training
v4i	2020	7 nm	138 TFLOPS bf16	8 GiB	1 chip	n/a	Inference
v5e	2023	(undisclosed)	197 TFLOPS bf16 393 TOPS INT8	16 GiB	256 chips	2D torus	Cost-optimised
v5p	2023	(undisclosed)	459 TFLOPS bf16 918 TOPS INT8	95 GiB	8,960 chips	3D torus + OCS	Training flagship
Trillium (v6e)	2024	(undisclosed)	918 TFLOPS bf16 1.8 POPS INT8	32 GiB	256 chips	2D torus	Cost-optimised
Ironwood (v7)	2025	(undisclosed)	4.6 PFLOPS FP8 2.3 PFLOPS bf16	192 GiB HBM3e	9,216 chips	3D torus + OCS	Inference flagship

The two-track shape from v5 onward

e-class (v4i → v5e → Trillium / v6e): single-TensorCore-equivalent, 256-chip 2D-torus pod, no in-pod OCS, cost-optimised for inference and small training.
p-class (v4 → v5p → Ironwood / v7): two-TensorCore, 3D-torus pod with optical circuit switches, the chip Gemini and other frontier models train on.
Trillium is e-class only — there is no v6p. The next p-class chip is Ironwood / v7.

02

v1 — The 2015 Inference Chip

The v1 disclosure (ISCA 2017) is still the most quoted custom-AI-ASIC paper ever written. The chip itself is almost mathematically simple.

Silicon

TSMC 28 nm.
Die <= 331 mm² (paper).
Single 256×256 8-bit MAC systolic array = 65,536 MACs.
Clock 700 MHz → 92 TOPS INT8.
Multiplies INT8, accumulates INT32.

Memory & I/O

24 MiB on-chip Unified Buffer (activations).
4 MiB accumulators, software-managed.
8 GiB dual-channel DDR3-2133 at ~34 GB/s — the bandwidth wall the paper makes famous.
PCIe Gen3 ×16 host link.
40 W max TDP. Drop-in to existing servers.

What it ran in 2015–2017

RankBrain (ranking signal in Search), Google Translate's Neural Machine Translation, Google Photos labelling, Street View text recognition, Now/Assistant speech, and the Lee Sedol AlphaGo match. Inference only — no FP, no gradients.

The lesson v2 had to fix

The ISCA 2017 roofline shows v1 was severely memory-bandwidth-bound on most workloads — 34 GB/s of DDR3 simply cannot feed 92 TOPS of compute. Every subsequent TPU has had HBM. That bandwidth wall is the single most important thing v1 taught the team, and it shapes every chip after.

03

v2 & v3 — Training Arrives

v1 was inference-only. By 2016 it was clear that the bigger fleet bottleneck was actually training large neural-translation and language models. v2 was a clean redesign: floating point, HBM, an interconnect that let chips talk to each other, gradients.

v2 (2017)

16 nm. Two TensorCores per chip.
Each TensorCore: one 128×128 MXU, vector unit, scalar unit.
bfloat16 introduced — Google Brain's invention; FP32 8-bit exponent + 7-bit mantissa.
FP32 accumulation.
16 GiB HBM at ~600 GB/s.
45 TFLOPS bf16 per chip.
4 chips per board; 256-chip pod as 16×16 2D torus → ~11.5 PFLOPS.
Custom ICI (Inter-Chip Interconnect) connecting each chip to 4 nearest neighbours.
Air-cooled.

v3 (2018)

Same 16 nm node; "tick" of v2.
Doubled MXUs — 4 MXUs/chip (2 per TensorCore).
32 GiB HBM at ~900 GB/s.
123 TFLOPS bf16 per chip.
1,024-chip pod as 32×32 2D torus → >100 PFLOPS, ~32 TiB aggregate HBM.
Liquid cooling — first TPU generation to require it (and the first liquid-cooled accelerator at hyperscale).
Used to train BERT-large, T5, early MUM, much of Search ranking.

bfloat16 — Google's most influential numeric

The format was designed inside Google Brain specifically so that mixed-precision training of large neural nets would be numerically stable without rescaling: keep FP32's 8-bit exponent (so dynamic range is identical), drop the bottom 16 bits of the mantissa (so memory and bandwidth halve, multipliers shrink). NVIDIA shipped bf16 in tensor cores starting Ampere (2020); Intel and ARM followed. Every modern AI accelerator now supports bf16 because of TPU v2.

04

v4 & v4i — OCS, SparseCore, the Pod Era

v4 is the generation where the TPU stops being a chip and becomes a supercomputer. It was deployed in Google datacenters from 2020, announced at I/O 2021, and finally given a paper in 2023 (the long delay is a hyperscaler tell — you don't disclose the supercomputer that's training your next model).

v4 (training)

TSMC 7 nm.
Two TensorCores; 4 MXUs each (8 MXUs/chip total).
bf16 + INT8.
275 TFLOPS bf16 per chip.
32 GiB HBM at 1.2 TB/s.
~170 W per chip.
6 ICI links per chip → 3D torus.
4,096-chip pod = 64 blocks of 64 chips, ~1.1 ExaFLOPS bf16.
Palomar OCS — 3D-MEMS optical circuit switches, 136 ports each, 48 OCS units per pod, <5% of system cost.
SparseCore — embedding-lookup accelerator on-die for recommendation models.
Trained PaLM 540B on 6,144 chips for 50 days.

v4i (inference)

Same 7 nm process, ISCA 2021 paper.
One TensorCore per chip (vs v4's two) — tighter inference cost envelope.
138 TFLOPS bf16; INT8 also supported.
8 GiB HBM2 at ~614 GB/s.
175 W; air-cooled (vs v4's liquid).
Introduced CMEM — ~128 MiB on-chip cache — that's now in every later TPU.
Powers Search, YouTube, Ads, Assistant, and later Bard / Gemini inference.

Why OCS matters

Optical Circuit Switching gives v4 something no GPU pod has: topology that can be reconfigured per job, in milliseconds. A bad cube can be routed around without hardware swap; a 2,048-chip slice can be allocated as 4×4×128 or 8×16×16 depending on what your model wants. Deck 10 takes Palomar apart in detail.

05

v5e & v5p — The Two-Track Fork

By v5 the product strategy is explicit: ship two SKUs every generation. The "e" is for efficiency — cost-optimised inference and small-scale training. The "p" is for performance — the chip Google trains its next frontier model on.

v5e (Aug 2023)

One TensorCore, 4 MXUs, vector + scalar units.
197 TFLOPS bf16, 393 TOPS INT8 per chip.
16 GiB HBM at 819 GB/s.
400 GB/s ICI bidirectional, 4 ports, 2D torus.
256-chip pod → ~50.6 PFLOPS.
Targeted at fleet inference and modest fine-tuning.
Marketed as ~2.5× perf/$ vs v4 for inference.

v5p (Dec 2023)

Two TensorCores, 4 MXUs each, plus 4 SparseCores.
459 TFLOPS bf16, 918 TOPS INT8 per chip.
95 GiB HBM at 2.76 TB/s — almost 3× the bandwidth of v5e.
ICI: 4,800 Gbps per chip (600 GB/s per direction).
3D torus + OCS, 8,960-chip pod; max single-job slice 6,144 chips.
Used to train Gemini 1.0 and Gemini 1.5.
Pod aggregate: ~4.1 ExaFLOPS bf16, ~850 TiB aggregate HBM.

The Multislice innovation

v5 is also the generation that ships Multislice as a first-class feature. A single training job can span multiple pods (potentially across datacenters), with high-bandwidth ICI inside each slice and the Jupiter datacenter network between slices. PaLM trained on two TPU v4 pods via Pathways; Gemini training on v5p uses the same pattern at higher scale.

06

Trillium (v6e)

Announced at Google I/O May 2024, GA December 2024. Trillium is e-class only — no v6p exists; the next p-class is Ironwood / v7.

Per-chip

918 TFLOPS bf16 (4.66× v5e's 197 TFLOPS — the marketing "4.7×" headline).
1,836 TOPS INT8.
32 GiB HBM at 1,640 GB/s — exactly 2× v5e in capacity and bandwidth.
800 GB/s ICI bidirectional — 2× v5e.
3rd-generation SparseCore — 2× embedding throughput, 5× on DLRM-DCNv2.

Pod & multipod

256-chip pod, 2D torus.
Multipod over Google's Jupiter data-centre fabric — up to ~100,000 Trillium chips in one optical-network domain, 13 Pb/s bisection bandwidth (vendor figure).
99% scaling efficiency at 12-pod (3,072 chips), 94% at 24-pod (6,144 chips) — vendor figures from the GA blog.
Used to train Gemini 2.0.

Why "Trillium" got a name

From v6 onwards, Google has been naming TPUs after plants — Trillium (woodland flower), Ironwood (a tree). The numbered scheme is still used internally and in docs ("v6e", "v7" / "TPU7x") but the public face has been rebranded. This mirrors NVIDIA's mathematician-physicist naming (Pascal, Volta, Hopper) — a memorable name is good cloud marketing.

07

Ironwood (v7) — The Inference Flagship

Announced at Google Cloud Next on 9 April 2025. GA in November 2025. Google calls it "the first TPU built for the age of inference" — though v4i in 2021 already targeted inference, Ironwood is the first inference-focused chip at flagship scale, and the first that explicitly leads on FP8.

Compute

4,614 TFLOPS FP8 (= 4.6 PFLOPS) per chip.
2,307 TFLOPS bf16 (half of FP8, as expected from the format).
~5× Trillium's bf16 / FP8 throughput.

Memory

192 GiB HBM3e per chip — 6× Trillium capacity.
7.37 TB/s bandwidth — 4.5× Trillium.
The chip Gemini 2.5 / 3 inference runs on.

Interconnect

1.2 TB/s ICI bidirectional per chip (= 9.6 Tbps).
1.5× Trillium ICI bandwidth.
3D torus + OCS (p-class topology).

Pod scale

Two pod sizes: 256 chips (smaller deployments) and 9,216 chips (max).
9,216-chip pod aggregate: 42.5 ExaFLOPS FP8 — vendor figure (9,216 × 4.614 PFLOPS).
~600 W per chip; "2× perf/W vs Trillium," "30× vs v2."
JAX-only access path; available through GKE.

A 192 GiB inference chip changes the picture

For perspective: NVIDIA's H200 has 141 GiB HBM3e per GPU; B200 has 192 GiB per dual-die GPU package. Ironwood matches B200 capacity at the chip level — on a chip explicitly tuned for inference rather than training. A single chip can hold a 70B-parameter model in BF16 with KV cache to spare. A 256-chip pod can hold the largest open Gemini-class checkpoints with massive batch-size headroom.

08

Per-Chip Compute Jump Chart

Plotting peak per-chip throughput on a log scale. Note how the jumps come in three places: v1→v2 (FP arrives), v3→v4 (process node + 4 MXUs), and v6→v7 (FP8 introduced and scaled). This is the curve that says "every new model class arrives 6–18 months after a TPU jump".

Numbers are PFLOPS (or PFLOPS-equivalent for INT8/FP8). Two key observations:

v2→v4 is roughly 6× on bf16 over 3 years (faster than Moore's law, mostly from architecture).
v5p→Ironwood is ~10× on the headline metric (FP8 vs bf16, plus die-area growth and HBM3e). Ironwood's 4.6 PFLOPS FP8 is the largest single-chip number Google has ever published.

09

Per-Chip HBM Jump Chart

Memory capacity is often the binding constraint for LLM serving and large-batch training. The TPU per-chip HBM curve is striking — flat from v2 to v4, then a step in v5p, and a near-vertical jump in Ironwood.

The v6e regression (32 GiB) is a deliberate cost choice — e-class chips don't need v5p's 95 GiB because they're not training frontier dense models. Ironwood at 192 GiB resets the inference picture: a single chip can hold a 70B model with the KV cache for a long-context conversation.

10

Pod Size Over Time

The other curve worth tracking: how many chips an ICI-coherent pod holds. This is the size of the largest "single machine" you can train on.

Generation	Pod size (chips)	Topology	Aggregate compute (peak)	Aggregate HBM
v1	1	n/a (PCIe card)	92 TOPS INT8	8 GiB DDR3
v2	256	16×16 2D torus	~11.5 PFLOPS bf16	4 TiB
v3	1,024	32×32 2D torus	~126 PFLOPS bf16	32 TiB
v4	4,096	4×4×4 of 8×8×8 cubes (3D)	~1.1 ExaFLOPS bf16	128 TiB
v5e	256	16×16 2D torus	~50 PFLOPS bf16	4 TiB
v5p	8,960	3D torus + OCS	~4.1 ExaFLOPS bf16	~850 TiB
Trillium (v6e)	256	2D torus (multipod up to ~100k via Jupiter)	~235 PFLOPS bf16	8 TiB
Ironwood (v7)	9,216	3D torus + OCS	~42.5 ExaFLOPS FP8	~1.7 PiB HBM3e

The 9,216-chip Ironwood pod is the largest single-vendor scale-up domain ever built — larger by chip count than any NVIDIA NVL72 deployment, and connected by ICI rather than InfiniBand at the fabric level.

11

Interactive Generation Explorer

Pick two TPU generations to compare per chip. Useful for quick sanity-checks — "is Trillium really 4.7× v5e?", "how much compute does Ironwood gain from FP8 alone?".

Generation A

Generation B

Reading the ratio honestly

The compute ratio is on each chip's headline numeric — INT8 vs bf16 vs FP8. To do a true throughput comparison you have to pick one numeric: e.g. v5p's 459 TFLOPS bf16 vs Ironwood's 2,307 TFLOPS bf16 is a clean 5×; saying Ironwood is 10× v5p only works if you compare FP8 to bf16, which is unfair. Vendor headlines play this game often — check the units.

12

Cheat Sheet

v1 (2015): 28 nm, 256×256 INT8 systolic, 92 TOPS, 8 GiB DDR3, single PCIe card. Inference only.
v2 (2017): 16 nm, 2 TensorCores, bf16 introduced, 16 GiB HBM, 256-chip 2D-torus pod. Training begins.
v3 (2018): 16 nm refresh, 4 MXUs/chip, 32 GiB HBM, 1024-chip pod, first liquid-cooled TPU.
v4 (2020): 7 nm, 8 MXUs/chip, 275 TFLOPS bf16, 4096-chip 3D-torus pod with Palomar OCS, SparseCore. PaLM trained on 6,144 chips.
v4i (2020): Single-TensorCore inference variant; CMEM introduced.
v5e (Aug 2023): e-class. 197 TFLOPS bf16, 16 GiB HBM, 256-chip 2D-torus pod.
v5p (Dec 2023): p-class. 459 TFLOPS bf16, 95 GiB HBM, 8,960-chip 3D-torus pod. Gemini 1/1.5 trained here.
Trillium (v6e, May 2024): e-class. 918 TFLOPS bf16, 32 GiB HBM, 3rd-gen SparseCore. Gemini 2.0 trained here.
Ironwood (v7 / TPU7x, Apr 2025 / GA Nov 2025): p-class inference flagship. 4.6 PFLOPS FP8, 192 GiB HBM3e, 9,216-chip pod, 42.5 ExaFLOPS aggregate.
The product fork: from v5 onward, e-class (2D torus, 256-chip pods, cost-optimised) and p-class (3D torus + OCS, kilo-chip pods, frontier training).