Google TPUs Series — Presentation 02

Ten Years of TPUs — v1 to Ironwood

Every TPU generation from the 28 nm 2015 inference chip to the 9,216-chip Ironwood pod — specs, role, what it unlocked, and the e-class / p-class fork.

v1v2v3 v4v4i v5ev5p TrilliumIronwood
v1 (2015) v2 (2017) v3 (2018) v4 (2020) v5e (2023) v5p (2023) Trillium (2024) Ironwood (2025)
00

Topics We'll Cover

This deck is the spec-sheet companion to deck 01. It walks every chip and pod, anchors the numbers in primary Google sources, and shows the per-generation jumps as a single chart you can use as a desk reference.

01

The Whole Lineage At A Glance

Every Google TPU ever shipped, on one row. Numbers are peak vendor figures from Google Cloud documentation and the relevant launch blog or paper.

GenYearProcessPer-chip peakHBMPod sizeTopologyRole
v1201528 nm92 TOPS INT8
(8 GiB DDR3)
1 chip / PCIe cardn/aInference
v2201716 nm45 TFLOPS bf1616 GiB256 chips2D torusTraining
v3201816 nm123 TFLOPS bf1632 GiB1,024 chips2D torusTraining
v42020 (announced 2021)7 nm275 TFLOPS bf1632 GiB4,096 chips3D torus + OCSTraining
v4i20207 nm138 TFLOPS bf168 GiB1 chipn/aInference
v5e2023(undisclosed)197 TFLOPS bf16
393 TOPS INT8
16 GiB256 chips2D torusCost-optimised
v5p2023(undisclosed)459 TFLOPS bf16
918 TOPS INT8
95 GiB8,960 chips3D torus + OCSTraining flagship
Trillium (v6e)2024(undisclosed)918 TFLOPS bf16
1.8 POPS INT8
32 GiB256 chips2D torusCost-optimised
Ironwood (v7)2025(undisclosed)4.6 PFLOPS FP8
2.3 PFLOPS bf16
192 GiB HBM3e9,216 chips3D torus + OCSInference flagship

The two-track shape from v5 onward

02

v1 — The 2015 Inference Chip

The v1 disclosure (ISCA 2017) is still the most quoted custom-AI-ASIC paper ever written. The chip itself is almost mathematically simple.

Silicon

  • TSMC 28 nm.
  • Die <= 331 mm² (paper).
  • Single 256×256 8-bit MAC systolic array = 65,536 MACs.
  • Clock 700 MHz → 92 TOPS INT8.
  • Multiplies INT8, accumulates INT32.

Memory & I/O

  • 24 MiB on-chip Unified Buffer (activations).
  • 4 MiB accumulators, software-managed.
  • 8 GiB dual-channel DDR3-2133 at ~34 GB/s — the bandwidth wall the paper makes famous.
  • PCIe Gen3 ×16 host link.
  • 40 W max TDP. Drop-in to existing servers.

What it ran in 2015–2017

RankBrain (ranking signal in Search), Google Translate's Neural Machine Translation, Google Photos labelling, Street View text recognition, Now/Assistant speech, and the Lee Sedol AlphaGo match. Inference only — no FP, no gradients.

The lesson v2 had to fix

The ISCA 2017 roofline shows v1 was severely memory-bandwidth-bound on most workloads — 34 GB/s of DDR3 simply cannot feed 92 TOPS of compute. Every subsequent TPU has had HBM. That bandwidth wall is the single most important thing v1 taught the team, and it shapes every chip after.

03

v2 & v3 — Training Arrives

v1 was inference-only. By 2016 it was clear that the bigger fleet bottleneck was actually training large neural-translation and language models. v2 was a clean redesign: floating point, HBM, an interconnect that let chips talk to each other, gradients.

v2 (2017)

  • 16 nm. Two TensorCores per chip.
  • Each TensorCore: one 128×128 MXU, vector unit, scalar unit.
  • bfloat16 introduced — Google Brain's invention; FP32 8-bit exponent + 7-bit mantissa.
  • FP32 accumulation.
  • 16 GiB HBM at ~600 GB/s.
  • 45 TFLOPS bf16 per chip.
  • 4 chips per board; 256-chip pod as 16×16 2D torus → ~11.5 PFLOPS.
  • Custom ICI (Inter-Chip Interconnect) connecting each chip to 4 nearest neighbours.
  • Air-cooled.

v3 (2018)

  • Same 16 nm node; "tick" of v2.
  • Doubled MXUs — 4 MXUs/chip (2 per TensorCore).
  • 32 GiB HBM at ~900 GB/s.
  • 123 TFLOPS bf16 per chip.
  • 1,024-chip pod as 32×32 2D torus → >100 PFLOPS, ~32 TiB aggregate HBM.
  • Liquid cooling — first TPU generation to require it (and the first liquid-cooled accelerator at hyperscale).
  • Used to train BERT-large, T5, early MUM, much of Search ranking.
bfloat16 — Google's most influential numeric

The format was designed inside Google Brain specifically so that mixed-precision training of large neural nets would be numerically stable without rescaling: keep FP32's 8-bit exponent (so dynamic range is identical), drop the bottom 16 bits of the mantissa (so memory and bandwidth halve, multipliers shrink). NVIDIA shipped bf16 in tensor cores starting Ampere (2020); Intel and ARM followed. Every modern AI accelerator now supports bf16 because of TPU v2.

04

v4 & v4i — OCS, SparseCore, the Pod Era

v4 is the generation where the TPU stops being a chip and becomes a supercomputer. It was deployed in Google datacenters from 2020, announced at I/O 2021, and finally given a paper in 2023 (the long delay is a hyperscaler tell — you don't disclose the supercomputer that's training your next model).

v4 (training)

  • TSMC 7 nm.
  • Two TensorCores; 4 MXUs each (8 MXUs/chip total).
  • bf16 + INT8.
  • 275 TFLOPS bf16 per chip.
  • 32 GiB HBM at 1.2 TB/s.
  • ~170 W per chip.
  • 6 ICI links per chip → 3D torus.
  • 4,096-chip pod = 64 blocks of 64 chips, ~1.1 ExaFLOPS bf16.
  • Palomar OCS — 3D-MEMS optical circuit switches, 136 ports each, 48 OCS units per pod, <5% of system cost.
  • SparseCore — embedding-lookup accelerator on-die for recommendation models.
  • Trained PaLM 540B on 6,144 chips for 50 days.

v4i (inference)

  • Same 7 nm process, ISCA 2021 paper.
  • One TensorCore per chip (vs v4's two) — tighter inference cost envelope.
  • 138 TFLOPS bf16; INT8 also supported.
  • 8 GiB HBM2 at ~614 GB/s.
  • 175 W; air-cooled (vs v4's liquid).
  • Introduced CMEM — ~128 MiB on-chip cache — that's now in every later TPU.
  • Powers Search, YouTube, Ads, Assistant, and later Bard / Gemini inference.
Why OCS matters

Optical Circuit Switching gives v4 something no GPU pod has: topology that can be reconfigured per job, in milliseconds. A bad cube can be routed around without hardware swap; a 2,048-chip slice can be allocated as 4×4×128 or 8×16×16 depending on what your model wants. Deck 10 takes Palomar apart in detail.

05

v5e & v5p — The Two-Track Fork

By v5 the product strategy is explicit: ship two SKUs every generation. The "e" is for efficiency — cost-optimised inference and small-scale training. The "p" is for performance — the chip Google trains its next frontier model on.

v5e (Aug 2023)

  • One TensorCore, 4 MXUs, vector + scalar units.
  • 197 TFLOPS bf16, 393 TOPS INT8 per chip.
  • 16 GiB HBM at 819 GB/s.
  • 400 GB/s ICI bidirectional, 4 ports, 2D torus.
  • 256-chip pod → ~50.6 PFLOPS.
  • Targeted at fleet inference and modest fine-tuning.
  • Marketed as ~2.5× perf/$ vs v4 for inference.

v5p (Dec 2023)

  • Two TensorCores, 4 MXUs each, plus 4 SparseCores.
  • 459 TFLOPS bf16, 918 TOPS INT8 per chip.
  • 95 GiB HBM at 2.76 TB/s — almost 3× the bandwidth of v5e.
  • ICI: 4,800 Gbps per chip (600 GB/s per direction).
  • 3D torus + OCS, 8,960-chip pod; max single-job slice 6,144 chips.
  • Used to train Gemini 1.0 and Gemini 1.5.
  • Pod aggregate: ~4.1 ExaFLOPS bf16, ~850 TiB aggregate HBM.

The Multislice innovation

v5 is also the generation that ships Multislice as a first-class feature. A single training job can span multiple pods (potentially across datacenters), with high-bandwidth ICI inside each slice and the Jupiter datacenter network between slices. PaLM trained on two TPU v4 pods via Pathways; Gemini training on v5p uses the same pattern at higher scale.

06

Trillium (v6e)

Announced at Google I/O May 2024, GA December 2024. Trillium is e-class only — no v6p exists; the next p-class is Ironwood / v7.

Per-chip

  • 918 TFLOPS bf16 (4.66× v5e's 197 TFLOPS — the marketing "4.7×" headline).
  • 1,836 TOPS INT8.
  • 32 GiB HBM at 1,640 GB/s — exactly 2× v5e in capacity and bandwidth.
  • 800 GB/s ICI bidirectional — 2× v5e.
  • 3rd-generation SparseCore — 2× embedding throughput, 5× on DLRM-DCNv2.

Pod & multipod

  • 256-chip pod, 2D torus.
  • Multipod over Google's Jupiter data-centre fabric — up to ~100,000 Trillium chips in one optical-network domain, 13 Pb/s bisection bandwidth (vendor figure).
  • 99% scaling efficiency at 12-pod (3,072 chips), 94% at 24-pod (6,144 chips) — vendor figures from the GA blog.
  • Used to train Gemini 2.0.
Why "Trillium" got a name

From v6 onwards, Google has been naming TPUs after plants — Trillium (woodland flower), Ironwood (a tree). The numbered scheme is still used internally and in docs ("v6e", "v7" / "TPU7x") but the public face has been rebranded. This mirrors NVIDIA's mathematician-physicist naming (Pascal, Volta, Hopper) — a memorable name is good cloud marketing.

07

Ironwood (v7) — The Inference Flagship

Announced at Google Cloud Next on 9 April 2025. GA in November 2025. Google calls it "the first TPU built for the age of inference" — though v4i in 2021 already targeted inference, Ironwood is the first inference-focused chip at flagship scale, and the first that explicitly leads on FP8.

Compute

  • 4,614 TFLOPS FP8 (= 4.6 PFLOPS) per chip.
  • 2,307 TFLOPS bf16 (half of FP8, as expected from the format).
  • ~5× Trillium's bf16 / FP8 throughput.

Memory

  • 192 GiB HBM3e per chip — 6× Trillium capacity.
  • 7.37 TB/s bandwidth — 4.5× Trillium.
  • The chip Gemini 2.5 / 3 inference runs on.

Interconnect

  • 1.2 TB/s ICI bidirectional per chip (= 9.6 Tbps).
  • 1.5× Trillium ICI bandwidth.
  • 3D torus + OCS (p-class topology).

Pod scale

A 192 GiB inference chip changes the picture

For perspective: NVIDIA's H200 has 141 GiB HBM3e per GPU; B200 has 192 GiB per dual-die GPU package. Ironwood matches B200 capacity at the chip level — on a chip explicitly tuned for inference rather than training. A single chip can hold a 70B-parameter model in BF16 with KV cache to spare. A 256-chip pod can hold the largest open Gemini-class checkpoints with massive batch-size headroom.

08

Per-Chip Compute Jump Chart

Plotting peak per-chip throughput on a log scale. Note how the jumps come in three places: v1→v2 (FP arrives), v3→v4 (process node + 4 MXUs), and v6→v7 (FP8 introduced and scaled). This is the curve that says "every new model class arrives 6–18 months after a TPU jump".

Peak per-chip throughput (log scale, headline numeric) 10 PFLOPS 1 PFLOPS 100 TFLOPS 10 TFLOPS 1 TFLOPS 0.092v1 INT8 0.045v2 bf16 0.123v3 bf16 0.275v4 bf16 0.138v4i bf16 0.197v5e bf16 0.459v5p bf16 0.918v6e bf16 2.307v7 bf16 4.614v7 FP8 p-class / training e-class inference / Trillium Ironwood

Numbers are PFLOPS (or PFLOPS-equivalent for INT8/FP8). Two key observations:

09

Per-Chip HBM Jump Chart

Memory capacity is often the binding constraint for LLM serving and large-batch training. The TPU per-chip HBM curve is striking — flat from v2 to v4, then a step in v5p, and a near-vertical jump in Ironwood.

Per-chip HBM capacity (linear, GiB) 50 100 150 200 0v1 (DDR3) 16v2 32v3 32v4 8v4i 16v5e 95v5p 32v6e 192v7

The v6e regression (32 GiB) is a deliberate cost choice — e-class chips don't need v5p's 95 GiB because they're not training frontier dense models. Ironwood at 192 GiB resets the inference picture: a single chip can hold a 70B model with the KV cache for a long-context conversation.

10

Pod Size Over Time

The other curve worth tracking: how many chips an ICI-coherent pod holds. This is the size of the largest "single machine" you can train on.

GenerationPod size (chips)TopologyAggregate compute (peak)Aggregate HBM
v11n/a (PCIe card)92 TOPS INT88 GiB DDR3
v225616×16 2D torus~11.5 PFLOPS bf164 TiB
v31,02432×32 2D torus~126 PFLOPS bf1632 TiB
v44,0964×4×4 of 8×8×8 cubes (3D)~1.1 ExaFLOPS bf16128 TiB
v5e25616×16 2D torus~50 PFLOPS bf164 TiB
v5p8,9603D torus + OCS~4.1 ExaFLOPS bf16~850 TiB
Trillium (v6e)2562D torus (multipod up to ~100k via Jupiter)~235 PFLOPS bf168 TiB
Ironwood (v7)9,2163D torus + OCS~42.5 ExaFLOPS FP8~1.7 PiB HBM3e

The 9,216-chip Ironwood pod is the largest single-vendor scale-up domain ever built — larger by chip count than any NVIDIA NVL72 deployment, and connected by ICI rather than InfiniBand at the fabric level.

11

Interactive Generation Explorer

Pick two TPU generations to compare per chip. Useful for quick sanity-checks — "is Trillium really 4.7× v5e?", "how much compute does Ironwood gain from FP8 alone?".

Reading the ratio honestly

The compute ratio is on each chip's headline numeric — INT8 vs bf16 vs FP8. To do a true throughput comparison you have to pick one numeric: e.g. v5p's 459 TFLOPS bf16 vs Ironwood's 2,307 TFLOPS bf16 is a clean 5×; saying Ironwood is 10× v5p only works if you compare FP8 to bf16, which is unfair. Vendor headlines play this game often — check the units.

12

Cheat Sheet

Read next

Deck 03 — Systolic Arrays opens up the matmul engine inside the chip. Decks 04–08 walk each generation in detail. Deck 09 is on memory and numerics. Deck 10 is on ICI and OCS.