2023 is the year Google admits the TPU is now two products. v5e is cheap and good enough; v5p is the chip Gemini trains on. The split that defines every TPU since.
Until v4, Google shipped one TPU SKU per generation, plus an inference variant (v4i) that arrived a year later. By 2023 that pattern was breaking down for three reasons.
Cloud TPU customers were split. Some wanted the cheapest chip per dollar for fine-tuning and inference. Others wanted the absolute biggest pod for training a frontier model. One SKU couldn't serve both well.
v4 was expensive: 32 GiB HBM, two TensorCores, OCS-connected pod. For someone running a 7B-parameter inference fleet, ~70% of that silicon is dark. v5e cuts price by simplifying.
By mid-2023 Google's frontier model team needed pods bigger than 4096 chips. v5p ships with an 8,960-chip pod — more than double v4 — specifically to enable Gemini-class runs.
"e-class" = cheaper than v4i ever was. "p-class" = bigger than v4 ever was. Same architecture team, same software stack, two SKUs per generation from this point onward.
NVIDIA does this too: GeForce RTX (consumer / inference) and HGX H100 (datacenter / training) are the same architecture in dramatically different packages. The TPU split is at smaller scale — one team, two SKUs per gen, no consumer line — but the strategic logic is the same.
Announced August 2023 at Google Cloud Next. Marketed as offering ~2.5× perf-per-dollar over v4 for inference, and ~2× for fine-tuning.
v5e is also the first TPU with explicit Multislice support — you can compose multiple v5e pods into a larger virtual training cluster over the Jupiter datacenter network. We'll come to that on slide 07.
Announced December 2023, four months after v5e. v5p is the chip Google trains its own frontier models on; it's the largest pod Google has ever shipped to that point.
The single most-asked question about v5p is "why 95 GiB and not 96?" Two facts: (1) HBM stacks come in standard sizes (16, 24 GiB per stack); v5p has six stacks × 16 GiB = 96 GiB physical. (2) Some capacity is reserved for ECC / sparing / firmware. The user-visible figure that Google publishes is 95 GiB. Treat 95 and 96 as the same number with different accounting.
For comparison: v4 had 32 GiB; H100 SXM has 80 GiB; H200 has 141 GiB. v5p sits between H100 and H200 on capacity but gives you a lot more chips per pod — 8,960 vs the few thousand you'd realistically scale-up with NVLink+IB. The TPU pod's economic moat is per-pod aggregate HBM, not per-chip HBM.
| v4 (2020) | v5e (Aug 2023) | v5p (Dec 2023) | |
|---|---|---|---|
| TensorCores | 2 | 1 | 2 |
| MXUs per TensorCore | 4 | 4 | 4 |
| SparseCores | 1 | 1 | 4 |
| Per-chip bf16 | 275 TFLOPS | 197 TFLOPS | 459 TFLOPS |
| Per-chip INT8 | — | 393 TOPS | 918 TOPS |
| HBM | 32 GiB @ 1.2 TB/s | 16 GiB @ 819 GB/s | 95 GiB @ 2.76 TB/s |
| ICI per chip (bidir) | ~300 GB/s | 400 GB/s | 1.2 TB/s |
| Topology | 3D torus + OCS | 2D torus | 3D torus + OCS |
| Pod size | 4,096 | 256 | 8,960 |
| Pod bf16 | ~1.13 EFLOPS | ~50.4 PFLOPS | ~4.11 EFLOPS |
| Pod HBM | ~128 TiB | ~4 TiB | ~850 TiB |
| Cooling | Liquid | Liquid | Liquid |
| Role | Training | Cost-optimised | Training flagship |
Two observations. First, v5p is 1.7× v4 per chip on bf16 but the pod is 2.2× bigger, so pod-level v5p is ~3.6× v4. Second, v5e is ~70% of v4 per chip at a much lower price point — the perf-per-dollar improvement is real but it isn't a faster chip; it's a cheaper one.
The largest scale-up domain Google had ever built at the time it shipped (Dec 2023). The topology is the v4 architecture taken to its logical extreme.
A dense 1T-parameter model in bf16 weighs 2 TB. With 850 TiB aggregate HBM, a v5p pod can hold ~400 such models, or one such model with all optimiser state plus arbitrarily large activation memory for backprop. v5p is the first TPU pod where pod-aggregate HBM stops being a binding constraint for any LLM Google trains; the constraint moves to compute time and orchestration.
v5e's pod is much smaller, much cheaper, and intentionally simpler. 256 chips, 16×16 2D torus, no OCS within a pod.
A v5e pod can be sliced into smaller jobs at fixed shapes — 1x1, 2x2, 2x4, 4x4, 4x8, 8x8, 8x16, 16x16. Each job gets a sub-torus. Cloud TPU schedules these as queued resources — you ask for a shape, you wait until one is free.
v5e's value proposition isn't "best at any one thing"; it's "good enough at most things, much cheaper". For inference of a 70B model: 4 chips are enough. For LoRA fine-tuning of Gemma 27B: 64 chips are enough. For pre-training a 7B model from scratch: 256 chips are enough. The v5e pod is sized to those use cases.
Multislice is a Cloud TPU feature, announced at Google Cloud Next August 2023, that lets a single training job span multiple slices, potentially across multiple pods, with high-bandwidth ICI inside each slice and the Jupiter DCN between slices.
The orchestration layer that makes Multislice usable. Pathways (Barham, Dean, Ghemawat et al., MLSys 2022) is Google's single-controller distributed-dataflow runtime for ML — one Python process drives compilation, scheduling, and execution across thousands of chips.
Pathways underpinned the PaLM training run on v4 in 2022; it scaled to multi-pod with v5p and Gemini in 2023–24; and is now the standard orchestration story for any TPU job above ~1,000 chips.
The flagship workload for v5p: Gemini. Google's Gemini 1.0 paper (Dec 2023) explicitly credits TPU v4 and v5e for training; Gemini 1.5 Pro (Feb 2024) is widely understood to be a v5p training job, with Pathways orchestration.
| Model | Year | Hardware | Notes |
|---|---|---|---|
| Gemini 1.0 Ultra / Pro / Nano | Dec 2023 | TPU v4 + v5e | Trained across multiple pods via Pathways. v5e for the smaller variants (Nano) and inference; v4 for pretraining heavy lifting. |
| Gemini 1.5 Pro (1M context) | Feb 2024 | TPU v5p | The first v5p showcase. The 1M-token context is enabled by v5p's 95 GiB per chip and high HBM bandwidth. |
| Gemini 1.5 Flash | May 2024 | TPU v5p / v5e mix | Distillation from 1.5 Pro; serving on v5e for cost. |
| Bard → Gemini App inference | 2023–24 | v5e fleet | The user-facing chatbot's inference fleet migrated from v4i to v5e through 2023. |
Long-context inference is dominated by KV cache. A 1M-token context for a Gemini-class model is hundreds of GB of KV. v5p's 95 GiB per chip and 2.76 TB/s HBM bandwidth let a single instance hold and stream that KV; smaller-HBM chips force you to split it across more chips, paying ICI all-reduce on every step.
v5e and v5p are sold via Google Cloud as both on-demand and reservation pricing. Public list prices (US-central, on-demand, May 2026) are roughly:
| SKU | Per chip-hour | Per pod-hour (full pod) | Notes |
|---|---|---|---|
| v5e (1 chip) | ~$1.20 | ~$307 (256 chips) | By far the cheapest TPU. |
| v5p (1 chip) | ~$4.20 | ~$37,600 (8,960 chips) | Roughly 3.5× v5e per chip. |
| Trillium v6e | ~$2.70 | ~$691 (256 chips) | 4.7× v5e perf at ~2.3× the price. |
| Ironwood v7 | (reserved-only) | (reserved-only) | Capacity sold via long-term commitments at GA in late 2025. |
Disclaimer: prices change. Treat the table as illustrative; check the current Cloud TPU pricing page for live numbers.
v5 is the chip that defines the modern TPU product strategy. Trillium and Ironwood are refinements of the v5e and v5p archetypes — they don't change the strategy, they execute on it.
Deck 08 — Trillium & Ironwood covers the 2024–2025 chips that take this same strategy further. Deck 11 — Software Stack covers Pathways, GSPMD, JAX sharding in detail.