Every TPU generation from the 28 nm 2015 inference chip to the 9,216-chip Ironwood pod — specs, role, what it unlocked, and the e-class / p-class fork.
This deck is the spec-sheet companion to deck 01. It walks every chip and pod, anchors the numbers in primary Google sources, and shows the per-generation jumps as a single chart you can use as a desk reference.
Every Google TPU ever shipped, on one row. Numbers are peak vendor figures from Google Cloud documentation and the relevant launch blog or paper.
| Gen | Year | Process | Per-chip peak | HBM | Pod size | Topology | Role |
|---|---|---|---|---|---|---|---|
| v1 | 2015 | 28 nm | 92 TOPS INT8 | — (8 GiB DDR3) | 1 chip / PCIe card | n/a | Inference |
| v2 | 2017 | 16 nm | 45 TFLOPS bf16 | 16 GiB | 256 chips | 2D torus | Training |
| v3 | 2018 | 16 nm | 123 TFLOPS bf16 | 32 GiB | 1,024 chips | 2D torus | Training |
| v4 | 2020 (announced 2021) | 7 nm | 275 TFLOPS bf16 | 32 GiB | 4,096 chips | 3D torus + OCS | Training |
| v4i | 2020 | 7 nm | 138 TFLOPS bf16 | 8 GiB | 1 chip | n/a | Inference |
| v5e | 2023 | (undisclosed) | 197 TFLOPS bf16 393 TOPS INT8 | 16 GiB | 256 chips | 2D torus | Cost-optimised |
| v5p | 2023 | (undisclosed) | 459 TFLOPS bf16 918 TOPS INT8 | 95 GiB | 8,960 chips | 3D torus + OCS | Training flagship |
| Trillium (v6e) | 2024 | (undisclosed) | 918 TFLOPS bf16 1.8 POPS INT8 | 32 GiB | 256 chips | 2D torus | Cost-optimised |
| Ironwood (v7) | 2025 | (undisclosed) | 4.6 PFLOPS FP8 2.3 PFLOPS bf16 | 192 GiB HBM3e | 9,216 chips | 3D torus + OCS | Inference flagship |
The v1 disclosure (ISCA 2017) is still the most quoted custom-AI-ASIC paper ever written. The chip itself is almost mathematically simple.
RankBrain (ranking signal in Search), Google Translate's Neural Machine Translation, Google Photos labelling, Street View text recognition, Now/Assistant speech, and the Lee Sedol AlphaGo match. Inference only — no FP, no gradients.
The ISCA 2017 roofline shows v1 was severely memory-bandwidth-bound on most workloads — 34 GB/s of DDR3 simply cannot feed 92 TOPS of compute. Every subsequent TPU has had HBM. That bandwidth wall is the single most important thing v1 taught the team, and it shapes every chip after.
v1 was inference-only. By 2016 it was clear that the bigger fleet bottleneck was actually training large neural-translation and language models. v2 was a clean redesign: floating point, HBM, an interconnect that let chips talk to each other, gradients.
The format was designed inside Google Brain specifically so that mixed-precision training of large neural nets would be numerically stable without rescaling: keep FP32's 8-bit exponent (so dynamic range is identical), drop the bottom 16 bits of the mantissa (so memory and bandwidth halve, multipliers shrink). NVIDIA shipped bf16 in tensor cores starting Ampere (2020); Intel and ARM followed. Every modern AI accelerator now supports bf16 because of TPU v2.
v4 is the generation where the TPU stops being a chip and becomes a supercomputer. It was deployed in Google datacenters from 2020, announced at I/O 2021, and finally given a paper in 2023 (the long delay is a hyperscaler tell — you don't disclose the supercomputer that's training your next model).
Optical Circuit Switching gives v4 something no GPU pod has: topology that can be reconfigured per job, in milliseconds. A bad cube can be routed around without hardware swap; a 2,048-chip slice can be allocated as 4×4×128 or 8×16×16 depending on what your model wants. Deck 10 takes Palomar apart in detail.
By v5 the product strategy is explicit: ship two SKUs every generation. The "e" is for efficiency — cost-optimised inference and small-scale training. The "p" is for performance — the chip Google trains its next frontier model on.
v5 is also the generation that ships Multislice as a first-class feature. A single training job can span multiple pods (potentially across datacenters), with high-bandwidth ICI inside each slice and the Jupiter datacenter network between slices. PaLM trained on two TPU v4 pods via Pathways; Gemini training on v5p uses the same pattern at higher scale.
Announced at Google I/O May 2024, GA December 2024. Trillium is e-class only — no v6p exists; the next p-class is Ironwood / v7.
From v6 onwards, Google has been naming TPUs after plants — Trillium (woodland flower), Ironwood (a tree). The numbered scheme is still used internally and in docs ("v6e", "v7" / "TPU7x") but the public face has been rebranded. This mirrors NVIDIA's mathematician-physicist naming (Pascal, Volta, Hopper) — a memorable name is good cloud marketing.
Announced at Google Cloud Next on 9 April 2025. GA in November 2025. Google calls it "the first TPU built for the age of inference" — though v4i in 2021 already targeted inference, Ironwood is the first inference-focused chip at flagship scale, and the first that explicitly leads on FP8.
For perspective: NVIDIA's H200 has 141 GiB HBM3e per GPU; B200 has 192 GiB per dual-die GPU package. Ironwood matches B200 capacity at the chip level — on a chip explicitly tuned for inference rather than training. A single chip can hold a 70B-parameter model in BF16 with KV cache to spare. A 256-chip pod can hold the largest open Gemini-class checkpoints with massive batch-size headroom.
Plotting peak per-chip throughput on a log scale. Note how the jumps come in three places: v1→v2 (FP arrives), v3→v4 (process node + 4 MXUs), and v6→v7 (FP8 introduced and scaled). This is the curve that says "every new model class arrives 6–18 months after a TPU jump".
Numbers are PFLOPS (or PFLOPS-equivalent for INT8/FP8). Two key observations:
Memory capacity is often the binding constraint for LLM serving and large-batch training. The TPU per-chip HBM curve is striking — flat from v2 to v4, then a step in v5p, and a near-vertical jump in Ironwood.
The v6e regression (32 GiB) is a deliberate cost choice — e-class chips don't need v5p's 95 GiB because they're not training frontier dense models. Ironwood at 192 GiB resets the inference picture: a single chip can hold a 70B model with the KV cache for a long-context conversation.
The other curve worth tracking: how many chips an ICI-coherent pod holds. This is the size of the largest "single machine" you can train on.
| Generation | Pod size (chips) | Topology | Aggregate compute (peak) | Aggregate HBM |
|---|---|---|---|---|
| v1 | 1 | n/a (PCIe card) | 92 TOPS INT8 | 8 GiB DDR3 |
| v2 | 256 | 16×16 2D torus | ~11.5 PFLOPS bf16 | 4 TiB |
| v3 | 1,024 | 32×32 2D torus | ~126 PFLOPS bf16 | 32 TiB |
| v4 | 4,096 | 4×4×4 of 8×8×8 cubes (3D) | ~1.1 ExaFLOPS bf16 | 128 TiB |
| v5e | 256 | 16×16 2D torus | ~50 PFLOPS bf16 | 4 TiB |
| v5p | 8,960 | 3D torus + OCS | ~4.1 ExaFLOPS bf16 | ~850 TiB |
| Trillium (v6e) | 256 | 2D torus (multipod up to ~100k via Jupiter) | ~235 PFLOPS bf16 | 8 TiB |
| Ironwood (v7) | 9,216 | 3D torus + OCS | ~42.5 ExaFLOPS FP8 | ~1.7 PiB HBM3e |
The 9,216-chip Ironwood pod is the largest single-vendor scale-up domain ever built — larger by chip count than any NVIDIA NVL72 deployment, and connected by ICI rather than InfiniBand at the fabric level.
Pick two TPU generations to compare per chip. Useful for quick sanity-checks — "is Trillium really 4.7× v5e?", "how much compute does Ironwood gain from FP8 alone?".
The compute ratio is on each chip's headline numeric — INT8 vs bf16 vs FP8. To do a true throughput comparison you have to pick one numeric: e.g. v5p's 459 TFLOPS bf16 vs Ironwood's 2,307 TFLOPS bf16 is a clean 5×; saying Ironwood is 10× v5p only works if you compare FP8 to bf16, which is unfair. Vendor headlines play this game often — check the units.
Deck 03 — Systolic Arrays opens up the matmul engine inside the chip. Decks 04–08 walk each generation in detail. Deck 09 is on memory and numerics. Deck 10 is on ICI and OCS.