Google TPUs Series — Presentation 08

Trillium & Ironwood — v6e & v7

2024–2025: Trillium quadruples e-class compute. Ironwood (TPU v7) is Google's first chip "for the age of inference" — 4.6 PFLOPS FP8, 192 GiB HBM3e, 9,216-chip pods.

Trillium (v6e)May 2024 Ironwood (v7)Apr 2025 FP8HBM3e SparseCore G3Gemini 2.x / 3
v5e / v5p Trillium 4.7× Ironwood +FP8 192 GiB HBM3e 9,216-chip pod 42.5 ExaFLOPS
00

Topics We'll Cover

01

Trillium — The 4.7× Jump

Announced at Google I/O on 14 May 2024 as the "6th-generation TPU"; rebranded with the public name Trillium (the woodland flower). General availability landed on 11 December 2024.

Per-chip vs v5e

  • 918 TFLOPS bf16 (4.66× v5e's 197 TFLOPS) — the "4.7×" headline.
  • 1,836 TOPS INT8.
  • 32 GiB HBM at 1,640 GB/s — exactly 2× v5e in capacity and bandwidth.
  • 800 GB/s ICI bidirectional — 2× v5e ICI.
  • Same 1-TensorCore architecture as v5e (e-class).

What got bigger to make it 4.7×

  • Process node shrink (Google has not officially said; widely reported as a TSMC 5/4nm-class step from v5e's node).
  • Larger MXU array per TensorCore.
  • Higher clock at lower voltage.
  • Deeper VMEM hierarchy with more software-managed buffer.
  • 3rd-generation SparseCore (more on the next slide).
A 4.7× "tick" is unusually large

For comparison: NVIDIA's H100 to B200 was around 2.5× on the equivalent numeric. Trillium's 4.7× comes partly from Google having held v5e architecture longer than originally planned (v5e shipped Aug 2023; ~21 months to GA-Trillium) and partly from the e-class line being the easier place to spend transistor budget on raw compute — you don't have to also pay for an OCS-attachable 3D-torus subsystem.

02

SparseCore Generation 3

Trillium's 3rd-generation SparseCore is roughly 2× the embedding throughput of the v5p SparseCore, and per Google's GA blog delivers 5× on DLRM-DCNv2 — a recommendation-system benchmark.

What changed in SparseCore G3

Why MoE makes SparseCore load-bearing

MoE inference looks structurally identical to recommendation embedding lookup: each token selects a small subset of expert weight blocks and gathers them from HBM. SparseCore's hardware path is precisely what that needs. As MoE has become the dominant frontier-model architecture (Mixtral, GPT-4-class, Gemini 2.x), SparseCore has moved from "useful for ranking" to "load-bearing for LLMs". You cannot serve a sparse MoE efficiently without something like a SparseCore.

03

Trillium's 256-Chip Pod & Multipod Scaling

Like v5e, Trillium ships a 256-chip 2D-torus pod — the canonical e-class shape. The big change is the multipod story: Google explicitly markets Trillium pods as composable into a 100,000-chip Jupiter-network domain.

In-pod (ICI)

  • 256 chips, 16×16 2D torus.
  • 800 GB/s ICI bidirectional per chip.
  • Pod aggregate: ~235 PFLOPS bf16, 8 TiB HBM.
  • 2× v5e on every per-pod metric.

Multi-pod (Jupiter DCN)

  • Up to ~100,000 Trillium chips in one optical-network domain.
  • 13 Pb/s bisection bandwidth (vendor figure).
  • 99% scaling efficiency at 12-pod (3,072 chips) on Google's published benchmarks.
  • 94% scaling efficiency at 24-pod (6,144 chips).
  • Used to train Gemini 2.0.

Why 100,000 chips on e-class, when v5p has 8,960 in one pod?

An ICI-coherent pod and a multipod cluster are different things. ICI is sub-microsecond, custom SerDes, hardware all-reduce. Multipod over Jupiter is microseconds-to-milliseconds, optical-circuit-switched, with all-reduce in software. You can train at much larger scale on the multipod, but the collective patterns are coarser-grained. v5p's 8,960-chip ICI domain is for fine-grained tensor-parallel training; Trillium's 100k-chip multipod is for data-parallel and pipeline-parallel scaling.

04

Why No "v6p"?

This is a question that confused everyone in 2024. v5 had two SKUs (e and p); v6 only has Trillium / v6e. There is no v6p. The next p-class chip is Ironwood / v7.

The product-cadence answer

p-class TPUs sit in datacenters for years; v5p deployed in late 2023 will still be doing useful work in 2027. Refreshing it every 18 months is wasteful capex. The cadence on p-class is closer to ~24–30 months.

The technology answer

HBM3e wasn't ready in volume in 2024. The 192 GiB / 7.4 TB/s per-chip memory profile that Ironwood (2025) actually ships requires HBM3e. Putting v6p out in 2024 would have meant another HBM2e chip with marginal capacity gain.

The strategic answer

Google's Gemini training was already saturating v5p pods in mid-2024 with very high utilisation. The right next p-class chip is one with a major capacity and FP8 jump — not a 1.5× refresh.

The result is a deliberate cadence: e-class refreshes more often (v5e → Trillium → eventually v8e), p-class refreshes less often (v5p → Ironwood → eventually v8p). This pattern will probably continue. Treat "Trillium = v6e" and "Ironwood = v7p" as the start of two interleaved series.

05

Ironwood — The Inference Flagship

Announced at Google Cloud Next on 9 April 2025, with general availability following in November 2025. Internally TPU v7 / TPU7x; publicly named Ironwood. Marketed as Google's "first TPU built for the age of inference".

FP8 peak / chip
4.6 PFLOPS
bf16 / chip
2.31 PFLOPS
HBM / chip
192 GiB
HBM BW / chip
7.37 TB/s
ICI / chip (bidir)
1.2 TB/s
Max pod chips
9,216
Pod FP8
42.5 EFLOPS
Per-chip TDP
~600 W

Headline framings (from Google's announcement)

Important caveat: many of these headlines compare FP8 against Trillium's bf16, which is an unfair 2× numeric advantage built in. The honest comparison is bf16-to-bf16: Ironwood is ~2.5× Trillium on bf16. The 5× comes from the FP8 + capacity + bandwidth combination, which is genuinely transformative for inference.

06

Per-Chip Numbers In Context

Where Ironwood sits relative to NVIDIA's contemporary inference / training silicon (May 2026):

ChipYearHeadline numericHBMHBM BWTDP
TPU v5pDec 2023459 TFLOPS bf1695 GiB2.76 TB/s?
NVIDIA H100 SXM2022989 TFLOPS bf16 dense / 3,958 TFLOPS FP880 GiB3.35 TB/s700 W
NVIDIA H2002024989 TFLOPS bf16 / 3,958 FP8141 GiB4.8 TB/s700 W
Trillium (v6e)Dec 2024918 TFLOPS bf1632 GiB1.64 TB/s~280 W
NVIDIA B2002024 (dual-die)2.25 PFLOPS bf16 / 9 PFLOPS FP4192 GiB8 TB/s1000 W
Ironwood (v7)Apr 20252.31 PFLOPS bf16 / 4.6 PFLOPS FP8192 GiB7.37 TB/s~600 W

Reading the table

07

192 GiB HBM3e — Why It Matters

Memory capacity is the binding constraint for modern inference. Ironwood's 192 GiB per chip is the single most important spec on the chip.

What fits on one chip

  • 70B model in bf16 (140 GB) + 50 GB of KV cache → comfortable fit.
  • 180B model in FP8 (180 GB) + 12 GB of KV cache → fits.
  • ~700B-parameter MoE (FP8, 8 active experts of 2 each) with experts sharded by topology → one 4-chip tile.
  • Long-context Gemini 2.5 with 2M-token KV cache at FP8 quantisation → one chip.

The bandwidth side

  • 7.37 TB/s HBM is enough to stream 70B in bf16 in <20 ms — faster than any meaningful per-token deadline.
  • Effective bandwidth scales with CMEM hit rate; KV cache hits go to ~2× effective.
  • For decode-bound workloads (chatbot inference), bandwidth matters more than peak FLOPS, and 7.4 TB/s is in the same league as B200.

HBM3e details

HBM3e is the JEDEC standard finalised in early 2024. Key parameters:

Either way, the package design at this HBM scale is extremely difficult — CoWoS-style 2.5D advanced packaging, careful thermal management, and per-stack ECC. This is the kind of system engineering Google has been quietly accumulating since v3.

08

FP8 Becomes the Headline Numeric

Ironwood is the first TPU to lead with FP8 as its headline numeric. v4 / v5 / Trillium all advertised primarily in bf16. The shift mirrors NVIDIA's H100 (Hopper, 2022) and B200 (Blackwell, 2024).

Why FP8 at this generation, not earlier?

1. Numerical maturity

FP8 wasn't a stable training format until ~2023 — required loss-scaling, careful range tracking, and per-tensor scale factors. By 2025 the techniques were proven.

2. Inference-first chip

FP8 inference quantisation has been mainstream since 2023. Ironwood is the first TPU explicitly designed for this case — a chip whose primary metric is "tokens/sec on a frontier model in production".

3. The doubling lever

Halving precision doubles FLOPS-per-MAC. Going from bf16 to FP8 is the cleanest possible 2× on peak performance with a known acceptable accuracy cost — cleaner than die-area or clock increases at this point in the curve.

Ironwood's FP8 path uses the standard OCP FP8 formats (E4M3 for activations, E5M2 for gradients on the rare occasions training uses them) with per-block scaling. The bf16 path is preserved for cases where FP8 is too aggressive.

09

The 9,216-Chip Pod

Ironwood pods come in two sizes: 256 chips (smaller deployments) and 9,216 chips (max). The 9,216-chip pod is the largest single ICI-coherent compute domain ever built.

Ironwood Pod (max) 9,216 chips · 3D torus + Palomar-class OCS ~1.77 PiB HBM3e · ~67 PB/s aggregate HBM bandwidth ~42.5 ExaFLOPS FP8 peak ~5.5 MW pod power · liquid-cooled OCS layer · per-job topology reconfiguration

What you can fit in one pod

10

Workloads — Gemini 2.0 / 2.5

ModelYearHardwareNotes
Gemini 2.0Dec 2024Trillium (v6e), multipodTrained on Trillium — Google's GA blog explicit on this. The first frontier training run on the 100k-chip Jupiter multipod.
Gemini 2.5 ProMar 2025Trillium training; Ironwood inferenceTrained when Ironwood was still pre-GA; inference moved to Ironwood for the long-context (2M token) variants.
Gemini 2.5 Flash2025v5e & Trillium for inferenceDistilled, smaller, runs comfortably on the e-class fleet.
Gemini 3 (rumoured)2025–26IronwoodTrade-press indications; Google hasn't confirmed details. The 9,216-chip pod and 192 GiB HBM are sized for this generation.
YouTube / Search ranking2024–26Trillium fleetSparseCore G3 shines here; 5× DLRM-DCNv2 throughput translates directly.
The "age of inference" pitch, decoded

Google's Ironwood marketing leans on "age of inference" because, by 2025, inference-cost-at-scale is the binding constraint for monetising AI. A frontier training run is now O(100M$); ongoing inference for a popular consumer product is O(1B$/year). Ironwood's design centre — high HBM, low TDP, FP8 throughput, big pod — is shaped by that economic shift, not by training requirements.

11

The Inference-First Design Philosophy

What does "designed for inference" actually mean architecturally? Five concrete things in Ironwood:

Memory-led, not compute-led

Ironwood's chip-area budget went to HBM3e I/O, more on-die SRAM, and a wider interconnect — not to a 4× larger MXU. Inference is bandwidth-bound; throwing more MACs at a chip you can't feed is wasted silicon.

FP8 native

FP8 is the production inference numeric for everyone in 2025. v5p had bf16 + INT8; Ironwood adds FP8 as a first-class path through the MXU, which was a non-trivial silicon change.

Lower TDP, higher density

~600 W vs B200's 1000 W. For a fleet running 24/7, 40% lower power per chip translates to ~30% lower datacenter cost per token. Ironwood is more "perf/W" than "peak perf".

SparseCore for MoE routing

3rd-gen+ SparseCore handles MoE expert lookup. Inference of an MoE model is dominated by the routing-then-gather step; Ironwood does it in dedicated silicon.

Why "first" inference TPU is contested

v4i (2021) was inference-tuned. v5e (2023) was cost-optimised for inference. The marketing claim "first TPU built for the age of inference" is best read as "first flagship-scale TPU built primarily for inference" — Ironwood is the first inference-first chip that is also bigger than its training-first sibling. v5p and Ironwood are both 8,960- and 9,216-chip pods; the pod size suggests the inference-vs-training distinction is now mostly numeric (FP8 vs bf16) and capacity (192 GiB vs 95 GiB).

12

Cheat Sheet

Read next

Deck 09 — Memory & Numerics goes deeper on HBM evolution, VMEM/CMEM, and the bf16 / INT8 / FP8 numeric story. Deck 10 — ICI & OCS explains how Ironwood's 9,216-chip pod is wired up.