2024–2025: Trillium quadruples e-class compute. Ironwood (TPU v7) is Google's first chip "for the age of inference" — 4.6 PFLOPS FP8, 192 GiB HBM3e, 9,216-chip pods.
Announced at Google I/O on 14 May 2024 as the "6th-generation TPU"; rebranded with the public name Trillium (the woodland flower). General availability landed on 11 December 2024.
For comparison: NVIDIA's H100 to B200 was around 2.5× on the equivalent numeric. Trillium's 4.7× comes partly from Google having held v5e architecture longer than originally planned (v5e shipped Aug 2023; ~21 months to GA-Trillium) and partly from the e-class line being the easier place to spend transistor budget on raw compute — you don't have to also pay for an OCS-attachable 3D-torus subsystem.
Trillium's 3rd-generation SparseCore is roughly 2× the embedding throughput of the v5p SparseCore, and per Google's GA blog delivers 5× on DLRM-DCNv2 — a recommendation-system benchmark.
MoE inference looks structurally identical to recommendation embedding lookup: each token selects a small subset of expert weight blocks and gathers them from HBM. SparseCore's hardware path is precisely what that needs. As MoE has become the dominant frontier-model architecture (Mixtral, GPT-4-class, Gemini 2.x), SparseCore has moved from "useful for ranking" to "load-bearing for LLMs". You cannot serve a sparse MoE efficiently without something like a SparseCore.
Like v5e, Trillium ships a 256-chip 2D-torus pod — the canonical e-class shape. The big change is the multipod story: Google explicitly markets Trillium pods as composable into a 100,000-chip Jupiter-network domain.
An ICI-coherent pod and a multipod cluster are different things. ICI is sub-microsecond, custom SerDes, hardware all-reduce. Multipod over Jupiter is microseconds-to-milliseconds, optical-circuit-switched, with all-reduce in software. You can train at much larger scale on the multipod, but the collective patterns are coarser-grained. v5p's 8,960-chip ICI domain is for fine-grained tensor-parallel training; Trillium's 100k-chip multipod is for data-parallel and pipeline-parallel scaling.
This is a question that confused everyone in 2024. v5 had two SKUs (e and p); v6 only has Trillium / v6e. There is no v6p. The next p-class chip is Ironwood / v7.
p-class TPUs sit in datacenters for years; v5p deployed in late 2023 will still be doing useful work in 2027. Refreshing it every 18 months is wasteful capex. The cadence on p-class is closer to ~24–30 months.
HBM3e wasn't ready in volume in 2024. The 192 GiB / 7.4 TB/s per-chip memory profile that Ironwood (2025) actually ships requires HBM3e. Putting v6p out in 2024 would have meant another HBM2e chip with marginal capacity gain.
Google's Gemini training was already saturating v5p pods in mid-2024 with very high utilisation. The right next p-class chip is one with a major capacity and FP8 jump — not a 1.5× refresh.
The result is a deliberate cadence: e-class refreshes more often (v5e → Trillium → eventually v8e), p-class refreshes less often (v5p → Ironwood → eventually v8p). This pattern will probably continue. Treat "Trillium = v6e" and "Ironwood = v7p" as the start of two interleaved series.
Announced at Google Cloud Next on 9 April 2025, with general availability following in November 2025. Internally TPU v7 / TPU7x; publicly named Ironwood. Marketed as Google's "first TPU built for the age of inference".
Important caveat: many of these headlines compare FP8 against Trillium's bf16, which is an unfair 2× numeric advantage built in. The honest comparison is bf16-to-bf16: Ironwood is ~2.5× Trillium on bf16. The 5× comes from the FP8 + capacity + bandwidth combination, which is genuinely transformative for inference.
Where Ironwood sits relative to NVIDIA's contemporary inference / training silicon (May 2026):
| Chip | Year | Headline numeric | HBM | HBM BW | TDP |
|---|---|---|---|---|---|
| TPU v5p | Dec 2023 | 459 TFLOPS bf16 | 95 GiB | 2.76 TB/s | ? |
| NVIDIA H100 SXM | 2022 | 989 TFLOPS bf16 dense / 3,958 TFLOPS FP8 | 80 GiB | 3.35 TB/s | 700 W |
| NVIDIA H200 | 2024 | 989 TFLOPS bf16 / 3,958 FP8 | 141 GiB | 4.8 TB/s | 700 W |
| Trillium (v6e) | Dec 2024 | 918 TFLOPS bf16 | 32 GiB | 1.64 TB/s | ~280 W |
| NVIDIA B200 | 2024 (dual-die) | 2.25 PFLOPS bf16 / 9 PFLOPS FP4 | 192 GiB | 8 TB/s | 1000 W |
| Ironwood (v7) | Apr 2025 | 2.31 PFLOPS bf16 / 4.6 PFLOPS FP8 | 192 GiB | 7.37 TB/s | ~600 W |
Memory capacity is the binding constraint for modern inference. Ironwood's 192 GiB per chip is the single most important spec on the chip.
HBM3e is the JEDEC standard finalised in early 2024. Key parameters:
Either way, the package design at this HBM scale is extremely difficult — CoWoS-style 2.5D advanced packaging, careful thermal management, and per-stack ECC. This is the kind of system engineering Google has been quietly accumulating since v3.
Ironwood is the first TPU to lead with FP8 as its headline numeric. v4 / v5 / Trillium all advertised primarily in bf16. The shift mirrors NVIDIA's H100 (Hopper, 2022) and B200 (Blackwell, 2024).
FP8 wasn't a stable training format until ~2023 — required loss-scaling, careful range tracking, and per-tensor scale factors. By 2025 the techniques were proven.
FP8 inference quantisation has been mainstream since 2023. Ironwood is the first TPU explicitly designed for this case — a chip whose primary metric is "tokens/sec on a frontier model in production".
Halving precision doubles FLOPS-per-MAC. Going from bf16 to FP8 is the cleanest possible 2× on peak performance with a known acceptable accuracy cost — cleaner than die-area or clock increases at this point in the curve.
Ironwood's FP8 path uses the standard OCP FP8 formats (E4M3 for activations, E5M2 for gradients on the rare occasions training uses them) with per-block scaling. The bf16 path is preserved for cases where FP8 is too aggressive.
Ironwood pods come in two sizes: 256 chips (smaller deployments) and 9,216 chips (max). The 9,216-chip pod is the largest single ICI-coherent compute domain ever built.
| Model | Year | Hardware | Notes |
|---|---|---|---|
| Gemini 2.0 | Dec 2024 | Trillium (v6e), multipod | Trained on Trillium — Google's GA blog explicit on this. The first frontier training run on the 100k-chip Jupiter multipod. |
| Gemini 2.5 Pro | Mar 2025 | Trillium training; Ironwood inference | Trained when Ironwood was still pre-GA; inference moved to Ironwood for the long-context (2M token) variants. |
| Gemini 2.5 Flash | 2025 | v5e & Trillium for inference | Distilled, smaller, runs comfortably on the e-class fleet. |
| Gemini 3 (rumoured) | 2025–26 | Ironwood | Trade-press indications; Google hasn't confirmed details. The 9,216-chip pod and 192 GiB HBM are sized for this generation. |
| YouTube / Search ranking | 2024–26 | Trillium fleet | SparseCore G3 shines here; 5× DLRM-DCNv2 throughput translates directly. |
Google's Ironwood marketing leans on "age of inference" because, by 2025, inference-cost-at-scale is the binding constraint for monetising AI. A frontier training run is now O(100M$); ongoing inference for a popular consumer product is O(1B$/year). Ironwood's design centre — high HBM, low TDP, FP8 throughput, big pod — is shaped by that economic shift, not by training requirements.
What does "designed for inference" actually mean architecturally? Five concrete things in Ironwood:
Ironwood's chip-area budget went to HBM3e I/O, more on-die SRAM, and a wider interconnect — not to a 4× larger MXU. Inference is bandwidth-bound; throwing more MACs at a chip you can't feed is wasted silicon.
FP8 is the production inference numeric for everyone in 2025. v5p had bf16 + INT8; Ironwood adds FP8 as a first-class path through the MXU, which was a non-trivial silicon change.
~600 W vs B200's 1000 W. For a fleet running 24/7, 40% lower power per chip translates to ~30% lower datacenter cost per token. Ironwood is more "perf/W" than "peak perf".
3rd-gen+ SparseCore handles MoE expert lookup. Inference of an MoE model is dominated by the routing-then-gather step; Ironwood does it in dedicated silicon.
v4i (2021) was inference-tuned. v5e (2023) was cost-optimised for inference. The marketing claim "first TPU built for the age of inference" is best read as "first flagship-scale TPU built primarily for inference" — Ironwood is the first inference-first chip that is also bigger than its training-first sibling. v5p and Ironwood are both 8,960- and 9,216-chip pods; the pod size suggests the inference-vs-training distinction is now mostly numeric (FP8 vs bf16) and capacity (192 GiB vs 95 GiB).
Deck 09 — Memory & Numerics goes deeper on HBM evolution, VMEM/CMEM, and the bf16 / INT8 / FP8 numeric story. Deck 10 — ICI & OCS explains how Ironwood's 9,216-chip pod is wired up.