Google TPUs 08 — Trillium & Ironwood (v6e & v7)

00

Topics We'll Cover

Trillium — The 4.7× Jump
SparseCore Generation 3
Trillium's 256-Chip Pod & Multipod Scaling
Why No "v6p"?
Ironwood — The Inference Flagship
Per-Chip Numbers In Context
192 GiB HBM3e — Why It Matters
FP8 Becomes the Headline Numeric
The 9,216-Chip Pod
Workloads — Gemini 2.0 / 2.5
The Inference-First Design Philosophy
Cheat Sheet

01

Trillium — The 4.7× Jump

Announced at Google I/O on 14 May 2024 as the "6th-generation TPU"; rebranded with the public name Trillium (the woodland flower). General availability landed on 11 December 2024.

Per-chip vs v5e

918 TFLOPS bf16 (4.66× v5e's 197 TFLOPS) — the "4.7×" headline.
1,836 TOPS INT8.
32 GiB HBM at 1,640 GB/s — exactly 2× v5e in capacity and bandwidth.
800 GB/s ICI bidirectional — 2× v5e ICI.
Same 1-TensorCore architecture as v5e (e-class).

What got bigger to make it 4.7×

Process node shrink (Google has not officially said; widely reported as a TSMC 5/4nm-class step from v5e's node).
Larger MXU array per TensorCore.
Higher clock at lower voltage.
Deeper VMEM hierarchy with more software-managed buffer.
3rd-generation SparseCore (more on the next slide).

A 4.7× "tick" is unusually large

For comparison: NVIDIA's H100 to B200 was around 2.5× on the equivalent numeric. Trillium's 4.7× comes partly from Google having held v5e architecture longer than originally planned (v5e shipped Aug 2023; ~21 months to GA-Trillium) and partly from the e-class line being the easier place to spend transistor budget on raw compute — you don't have to also pay for an OCS-attachable 3D-torus subsystem.

02

SparseCore Generation 3

Trillium's 3rd-generation SparseCore is roughly 2× the embedding throughput of the v5p SparseCore, and per Google's GA blog delivers 5× on DLRM-DCNv2 — a recommendation-system benchmark.

What changed in SparseCore G3

Wider gather lanes — more concurrent embedding lookups in flight.
Hardware support for two-sided embedding lookups (e.g. user-item interaction features, common in modern recommenders).
Better handling of variable-length lookup sequences — relevant for MoE routing, where each token picks k of N experts and the per-token k can vary.
Reduced power per lookup (no specific number disclosed).

Why MoE makes SparseCore load-bearing

MoE inference looks structurally identical to recommendation embedding lookup: each token selects a small subset of expert weight blocks and gathers them from HBM. SparseCore's hardware path is precisely what that needs. As MoE has become the dominant frontier-model architecture (Mixtral, GPT-4-class, Gemini 2.x), SparseCore has moved from "useful for ranking" to "load-bearing for LLMs". You cannot serve a sparse MoE efficiently without something like a SparseCore.

03

Trillium's 256-Chip Pod & Multipod Scaling

Like v5e, Trillium ships a 256-chip 2D-torus pod — the canonical e-class shape. The big change is the multipod story: Google explicitly markets Trillium pods as composable into a 100,000-chip Jupiter-network domain.

In-pod (ICI)

256 chips, 16×16 2D torus.
800 GB/s ICI bidirectional per chip.
Pod aggregate: ~235 PFLOPS bf16, 8 TiB HBM.
2× v5e on every per-pod metric.

Multi-pod (Jupiter DCN)

Up to ~100,000 Trillium chips in one optical-network domain.
13 Pb/s bisection bandwidth (vendor figure).
99% scaling efficiency at 12-pod (3,072 chips) on Google's published benchmarks.
94% scaling efficiency at 24-pod (6,144 chips).
Used to train Gemini 2.0.

Why 100,000 chips on e-class, when v5p has 8,960 in one pod?

An ICI-coherent pod and a multipod cluster are different things. ICI is sub-microsecond, custom SerDes, hardware all-reduce. Multipod over Jupiter is microseconds-to-milliseconds, optical-circuit-switched, with all-reduce in software. You can train at much larger scale on the multipod, but the collective patterns are coarser-grained. v5p's 8,960-chip ICI domain is for fine-grained tensor-parallel training; Trillium's 100k-chip multipod is for data-parallel and pipeline-parallel scaling.

04

Why No "v6p"?

This is a question that confused everyone in 2024. v5 had two SKUs (e and p); v6 only has Trillium / v6e. There is no v6p. The next p-class chip is Ironwood / v7.

The product-cadence answer

p-class TPUs sit in datacenters for years; v5p deployed in late 2023 will still be doing useful work in 2027. Refreshing it every 18 months is wasteful capex. The cadence on p-class is closer to ~24–30 months.

The technology answer

HBM3e wasn't ready in volume in 2024. The 192 GiB / 7.4 TB/s per-chip memory profile that Ironwood (2025) actually ships requires HBM3e. Putting v6p out in 2024 would have meant another HBM2e chip with marginal capacity gain.

The strategic answer

Google's Gemini training was already saturating v5p pods in mid-2024 with very high utilisation. The right next p-class chip is one with a major capacity and FP8 jump — not a 1.5× refresh.

The result is a deliberate cadence: e-class refreshes more often (v5e → Trillium → eventually v8e), p-class refreshes less often (v5p → Ironwood → eventually v8p). This pattern will probably continue. Treat "Trillium = v6e" and "Ironwood = v7p" as the start of two interleaved series.

05

Ironwood — The Inference Flagship

Announced at Google Cloud Next on 9 April 2025, with general availability following in November 2025. Internally TPU v7 / TPU7x; publicly named Ironwood. Marketed as Google's "first TPU built for the age of inference".

FP8 peak / chip

4.6 PFLOPS

bf16 / chip

2.31 PFLOPS

HBM / chip

192 GiB

HBM BW / chip

7.37 TB/s

ICI / chip (bidir)

1.2 TB/s

Max pod chips

9,216

Pod FP8

42.5 EFLOPS

Per-chip TDP

~600 W

Headline framings (from Google's announcement)

"5× Trillium" on FP8 (4,614 / 918 TFLOPS).
"6× Trillium" on HBM capacity (192 / 32 GiB).
"4.5× Trillium" on HBM bandwidth (7,370 / 1,640 GB/s).
"30× v2" on perf-per-watt (back-to-back generations of efficiency improvements compounding).
"2× Trillium" on perf-per-watt.

Important caveat: many of these headlines compare FP8 against Trillium's bf16, which is an unfair 2× numeric advantage built in. The honest comparison is bf16-to-bf16: Ironwood is ~2.5× Trillium on bf16. The 5× comes from the FP8 + capacity + bandwidth combination, which is genuinely transformative for inference.

06

Per-Chip Numbers In Context

Where Ironwood sits relative to NVIDIA's contemporary inference / training silicon (May 2026):

Chip	Year	Headline numeric	HBM	HBM BW	TDP
TPU v5p	Dec 2023	459 TFLOPS bf16	95 GiB	2.76 TB/s	?
NVIDIA H100 SXM	2022	989 TFLOPS bf16 dense / 3,958 TFLOPS FP8	80 GiB	3.35 TB/s	700 W
NVIDIA H200	2024	989 TFLOPS bf16 / 3,958 FP8	141 GiB	4.8 TB/s	700 W
Trillium (v6e)	Dec 2024	918 TFLOPS bf16	32 GiB	1.64 TB/s	~280 W
NVIDIA B200	2024 (dual-die)	2.25 PFLOPS bf16 / 9 PFLOPS FP4	192 GiB	8 TB/s	1000 W
Ironwood (v7)	Apr 2025	2.31 PFLOPS bf16 / 4.6 PFLOPS FP8	192 GiB	7.37 TB/s	~600 W

Reading the table

Per-chip bf16: Ironwood (2.31 PFLOPS) vs B200 dual-die (2.25 PFLOPS) — Ironwood is competitive on bf16 at the chip level.
FP8: B200 has FP8 too (~4.5 PFLOPS dense); both are in the same ballpark. NVIDIA also has FP4 on B200 doubling the headline.
HBM capacity: tied at 192 GiB.
HBM bandwidth: B200 wins (8 TB/s vs 7.37 TB/s) by ~10%.
TDP: Ironwood is ~40% lower (600 W vs 1000 W). Better perf-per-watt is a real Ironwood win.
Pod scale: Ironwood 9,216 chips in one ICI domain; NVIDIA NVL72 = 72 GPUs. Two orders of magnitude on scale-up size.

07

192 GiB HBM3e — Why It Matters

Memory capacity is the binding constraint for modern inference. Ironwood's 192 GiB per chip is the single most important spec on the chip.

What fits on one chip

70B model in bf16 (140 GB) + 50 GB of KV cache → comfortable fit.
180B model in FP8 (180 GB) + 12 GB of KV cache → fits.
~700B-parameter MoE (FP8, 8 active experts of 2 each) with experts sharded by topology → one 4-chip tile.
Long-context Gemini 2.5 with 2M-token KV cache at FP8 quantisation → one chip.

The bandwidth side

7.37 TB/s HBM is enough to stream 70B in bf16 in <20 ms — faster than any meaningful per-token deadline.
Effective bandwidth scales with CMEM hit rate; KV cache hits go to ~2× effective.
For decode-bound workloads (chatbot inference), bandwidth matters more than peak FLOPS, and 7.4 TB/s is in the same league as B200.

HBM3e details

HBM3e is the JEDEC standard finalised in early 2024. Key parameters:

Up to 12-Hi stacks (12 vertically-stacked memory dies per stack).
Per-stack capacity up to 36 GiB at 12-Hi.
Per-stack bandwidth up to ~1.2 TB/s.
Ironwood at 192 GiB / 7.37 TB/s per chip likely uses 6 stacks at 32 GiB / ~1.23 TB/s each — or 8 stacks at 24 GiB.

Either way, the package design at this HBM scale is extremely difficult — CoWoS-style 2.5D advanced packaging, careful thermal management, and per-stack ECC. This is the kind of system engineering Google has been quietly accumulating since v3.

08

FP8 Becomes the Headline Numeric

Ironwood is the first TPU to lead with FP8 as its headline numeric. v4 / v5 / Trillium all advertised primarily in bf16. The shift mirrors NVIDIA's H100 (Hopper, 2022) and B200 (Blackwell, 2024).

Why FP8 at this generation, not earlier?

1. Numerical maturity

FP8 wasn't a stable training format until ~2023 — required loss-scaling, careful range tracking, and per-tensor scale factors. By 2025 the techniques were proven.

2. Inference-first chip

FP8 inference quantisation has been mainstream since 2023. Ironwood is the first TPU explicitly designed for this case — a chip whose primary metric is "tokens/sec on a frontier model in production".

3. The doubling lever

Halving precision doubles FLOPS-per-MAC. Going from bf16 to FP8 is the cleanest possible 2× on peak performance with a known acceptable accuracy cost — cleaner than die-area or clock increases at this point in the curve.

Ironwood's FP8 path uses the standard OCP FP8 formats (E4M3 for activations, E5M2 for gradients on the rare occasions training uses them) with per-block scaling. The bf16 path is preserved for cases where FP8 is too aggressive.

09

The 9,216-Chip Pod

Ironwood pods come in two sizes: 256 chips (smaller deployments) and 9,216 chips (max). The 9,216-chip pod is the largest single ICI-coherent compute domain ever built.

What you can fit in one pod

A 10T-parameter dense model in FP8 (10 TB) takes 0.6% of pod HBM — pod-scale frontier training is bound by compute, not memory.
A 100B-parameter MoE with KV cache for tens of thousands of concurrent users at 1M-token contexts — full multi-tenant inference deployment in one pod.
For Gemini-class training: 9,216 chips at 4.6 PFLOPS FP8 each gives a sustained throughput well past 10 ExaFLOPS — the whole pod can churn through a 10T-token training run in days, not weeks.

10

Workloads — Gemini 2.0 / 2.5

Model	Year	Hardware	Notes
Gemini 2.0	Dec 2024	Trillium (v6e), multipod	Trained on Trillium — Google's GA blog explicit on this. The first frontier training run on the 100k-chip Jupiter multipod.
Gemini 2.5 Pro	Mar 2025	Trillium training; Ironwood inference	Trained when Ironwood was still pre-GA; inference moved to Ironwood for the long-context (2M token) variants.
Gemini 2.5 Flash	2025	v5e & Trillium for inference	Distilled, smaller, runs comfortably on the e-class fleet.
Gemini 3 (rumoured)	2025–26	Ironwood	Trade-press indications; Google hasn't confirmed details. The 9,216-chip pod and 192 GiB HBM are sized for this generation.
YouTube / Search ranking	2024–26	Trillium fleet	SparseCore G3 shines here; 5× DLRM-DCNv2 throughput translates directly.

The "age of inference" pitch, decoded

Google's Ironwood marketing leans on "age of inference" because, by 2025, inference-cost-at-scale is the binding constraint for monetising AI. A frontier training run is now O(100M$); ongoing inference for a popular consumer product is O(1B$/year). Ironwood's design centre — high HBM, low TDP, FP8 throughput, big pod — is shaped by that economic shift, not by training requirements.

11

The Inference-First Design Philosophy

What does "designed for inference" actually mean architecturally? Five concrete things in Ironwood:

Memory-led, not compute-led

Ironwood's chip-area budget went to HBM3e I/O, more on-die SRAM, and a wider interconnect — not to a 4× larger MXU. Inference is bandwidth-bound; throwing more MACs at a chip you can't feed is wasted silicon.

FP8 native

FP8 is the production inference numeric for everyone in 2025. v5p had bf16 + INT8; Ironwood adds FP8 as a first-class path through the MXU, which was a non-trivial silicon change.

Lower TDP, higher density

~600 W vs B200's 1000 W. For a fleet running 24/7, 40% lower power per chip translates to ~30% lower datacenter cost per token. Ironwood is more "perf/W" than "peak perf".

SparseCore for MoE routing

3rd-gen+ SparseCore handles MoE expert lookup. Inference of an MoE model is dominated by the routing-then-gather step; Ironwood does it in dedicated silicon.

Why "first" inference TPU is contested

v4i (2021) was inference-tuned. v5e (2023) was cost-optimised for inference. The marketing claim "first TPU built for the age of inference" is best read as "first flagship-scale TPU built primarily for inference" — Ironwood is the first inference-first chip that is also bigger than its training-first sibling. v5p and Ironwood are both 8,960- and 9,216-chip pods; the pod size suggests the inference-vs-training distinction is now mostly numeric (FP8 vs bf16) and capacity (192 GiB vs 95 GiB).

12

Cheat Sheet

Trillium (v6e, May 2024 / GA Dec 2024): 918 TFLOPS bf16 (4.7× v5e), 32 GiB HBM at 1.64 TB/s, 800 GB/s ICI, 256-chip 2D-torus pod. 3rd-gen SparseCore. Trained Gemini 2.0.
No v6p: Google skipped a p-class chip in the v6 generation; the next p-class is Ironwood / v7.
Ironwood (v7 / TPU7x, Apr 2025 / GA Nov 2025): 4.6 PFLOPS FP8, 2.31 PFLOPS bf16, 192 GiB HBM3e at 7.37 TB/s, 1.2 TB/s ICI, 9,216-chip 3D-torus + OCS pod. ~600 W per chip. JAX-only.
Ironwood pod aggregate: 42.5 ExaFLOPS FP8, ~1.77 PiB HBM3e, ~67 PB/s aggregate HBM. Largest single ICI-coherent compute domain in production.
FP8 native on Ironwood — OCP E4M3 / E5M2 formats, per-block scaling.
192 GiB HBM matches NVIDIA B200 capacity at the chip level. Lets one chip hold a 70B model + 50 GB KV cache, or one 180B FP8 model.
Gemini 2.0 trained on Trillium multipod; Gemini 2.5 inference on Ironwood; Gemini 3 expected on Ironwood.
The "age of inference" framing is honest: by 2025 inference-cost-at-fleet-scale is the dominant economic problem, and Ironwood is shaped to solve it rather than to set training records.