Google TPUs 07 — v5e & v5p, the Two-Track Fork

00

Topics We'll Cover

Why Google Forked the Product Line
v5e — The Efficient Chip
v5p — The Frontier Training Chip
Per-Chip Spec Comparison
The 8,960-Chip v5p Pod
v5e's 256-Chip 2D-Torus Pod
Multislice — Crossing The Pod Boundary
Pathways on Cloud
Workloads — Gemini 1.0 / 1.5
Pricing & Availability
What v5 Got Right And Wrong
Cheat Sheet

01

Why Google Forked the Product Line

Until v4, Google shipped one TPU SKU per generation, plus an inference variant (v4i) that arrived a year later. By 2023 that pattern was breaking down for three reasons.

1. Cloud TPU customer mix

Cloud TPU customers were split. Some wanted the cheapest chip per dollar for fine-tuning and inference. Others wanted the absolute biggest pod for training a frontier model. One SKU couldn't serve both well.

2. The cost gap

v4 was expensive: 32 GiB HBM, two TensorCores, OCS-connected pod. For someone running a 7B-parameter inference fleet, ~70% of that silicon is dark. v5e cuts price by simplifying.

3. Gemini's appetite

By mid-2023 Google's frontier model team needed pods bigger than 4096 chips. v5p ships with an 8,960-chip pod — more than double v4 — specifically to enable Gemini-class runs.

The product strategy in one slogan

"e-class" = cheaper than v4i ever was. "p-class" = bigger than v4 ever was. Same architecture team, same software stack, two SKUs per generation from this point onward.

Mirroring NVIDIA

NVIDIA does this too: GeForce RTX (consumer / inference) and HGX H100 (datacenter / training) are the same architecture in dramatically different packages. The TPU split is at smaller scale — one team, two SKUs per gen, no consumer line — but the strategic logic is the same.

02

v5e — The Efficient Chip

Announced August 2023 at Google Cloud Next. Marketed as offering ~2.5× perf-per-dollar over v4 for inference, and ~2× for fine-tuning.

Per-chip

1 TensorCore with 4 MXUs, vector unit, scalar unit.
197 TFLOPS bf16; 393 TOPS INT8.
16 GiB HBM at 819 GB/s.
4 ICI ports, 400 GB/s bidirectional aggregate.
SparseCore on-die.
Process node not officially disclosed (widely reported as 5nm-class).

Pod

256-chip pod, 16×16 2D torus.
Pod aggregate: ~50.4 PFLOPS bf16, 4 TiB aggregate HBM.
No OCS within a pod — static torus topology.
Multipod uses Jupiter DCN (not ICI) for cross-pod traffic.

What v5e is for

Fleet inference. Single-chip serving of 7B–30B-parameter models; v5e replaces v4i in this role.
Fine-tuning at modest scale. A 64-chip slice of v5e is enough for a LoRA pass on a 70B-parameter base.
Cost-sensitive training. If the model fits on 256 chips and the budget matters more than wall-clock time, v5e is the right answer.

v5e is also the first TPU with explicit Multislice support — you can compose multiple v5e pods into a larger virtual training cluster over the Jupiter datacenter network. We'll come to that on slide 07.

03

v5p — The Frontier Training Chip

Announced December 2023, four months after v5e. v5p is the chip Google trains its own frontier models on; it's the largest pod Google has ever shipped to that point.

Per-chip

2 TensorCores, each with 4 MXUs.
4 SparseCores per chip.
459 TFLOPS bf16; 918 TOPS INT8.
95 GiB HBM at 2.76 TB/s — almost 3× v5e's bandwidth.
6 ICI ports for 3D-torus connectivity.
4,800 Gbps total ICI per chip = 600 GB/s per direction = 1.2 TB/s bidir.

Pod

8,960-chip pod.
3D torus + OCS — same family as v4 but bigger.
Pod aggregate: ~4.1 ExaFLOPS bf16; ~850 TiB aggregate HBM.
Max single-job slice = 96 cubes = 6,144 chips.

The 95 GiB HBM number

The single most-asked question about v5p is "why 95 GiB and not 96?" Two facts: (1) HBM stacks come in standard sizes (16, 24 GiB per stack); v5p has six stacks × 16 GiB = 96 GiB physical. (2) Some capacity is reserved for ECC / sparing / firmware. The user-visible figure that Google publishes is 95 GiB. Treat 95 and 96 as the same number with different accounting.

A 95-GiB-per-chip HBM budget

For comparison: v4 had 32 GiB; H100 SXM has 80 GiB; H200 has 141 GiB. v5p sits between H100 and H200 on capacity but gives you a lot more chips per pod — 8,960 vs the few thousand you'd realistically scale-up with NVLink+IB. The TPU pod's economic moat is per-pod aggregate HBM, not per-chip HBM.

04

Per-Chip Spec Comparison

	v4 (2020)	v5e (Aug 2023)	v5p (Dec 2023)
TensorCores	2	1	2
MXUs per TensorCore	4	4	4
SparseCores	1	1	4
Per-chip bf16	275 TFLOPS	197 TFLOPS	459 TFLOPS
Per-chip INT8	—	393 TOPS	918 TOPS
HBM	32 GiB @ 1.2 TB/s	16 GiB @ 819 GB/s	95 GiB @ 2.76 TB/s
ICI per chip (bidir)	~300 GB/s	400 GB/s	1.2 TB/s
Topology	3D torus + OCS	2D torus	3D torus + OCS
Pod size	4,096	256	8,960
Pod bf16	~1.13 EFLOPS	~50.4 PFLOPS	~4.11 EFLOPS
Pod HBM	~128 TiB	~4 TiB	~850 TiB
Cooling	Liquid	Liquid	Liquid
Role	Training	Cost-optimised	Training flagship

Two observations. First, v5p is 1.7× v4 per chip on bf16 but the pod is 2.2× bigger, so pod-level v5p is ~3.6× v4. Second, v5e is ~70% of v4 per chip at a much lower price point — the perf-per-dollar improvement is real but it isn't a faster chip; it's a cheaper one.

05

The 8,960-Chip v5p Pod

The largest scale-up domain Google had ever built at the time it shipped (Dec 2023). The topology is the v4 architecture taken to its logical extreme.

Pod organisation

Chips arranged as a 3D torus.
Cubes connected through Palomar-class OCS (8,960 chips means a much larger OCS infrastructure than v4's 48 switches).
Maximum schedulable slice for one job: 6,144 chips (96 cubes of 64).
Twisted-torus topology supported per-job.

Pod aggregate numbers

~4.1 ExaFLOPS bf16 peak.
~850 TiB aggregate HBM.
~10 PB/s aggregate ICI bandwidth.
~5 MW typical pod power.
Liquid-cooled, 100+ racks.

What you can fit in this

A dense 1T-parameter model in bf16 weighs 2 TB. With 850 TiB aggregate HBM, a v5p pod can hold ~400 such models, or one such model with all optimiser state plus arbitrarily large activation memory for backprop. v5p is the first TPU pod where pod-aggregate HBM stops being a binding constraint for any LLM Google trains; the constraint moves to compute time and orchestration.

06

v5e's 256-Chip 2D-Torus Pod

v5e's pod is much smaller, much cheaper, and intentionally simpler. 256 chips, 16×16 2D torus, no OCS within a pod.

Why no OCS?

OCS infrastructure is non-trivial cost — 48+ switches per pod.
For 256 chips, the OCS reconfigurability isn't worth it: a 16×16 torus is small enough that you don't need to slice it dynamically.
Faults are handled at the multi-pod level (route around a faulty pod), not within a pod.

Why 2D not 3D?

4 ICI ports per chip is enough for a 16×16 torus.
Cuts ICI silicon and packaging cost ~33% vs 6-port 3D.
For inference batches and small training runs, 2D-torus all-reduce is plenty.
Multipod scaling uses Jupiter DCN, not extending the torus.

The slice configurations

A v5e pod can be sliced into smaller jobs at fixed shapes — 1x1, 2x2, 2x4, 4x4, 4x8, 8x8, 8x16, 16x16. Each job gets a sub-torus. Cloud TPU schedules these as queued resources — you ask for a shape, you wait until one is free.

The "good enough" thesis

v5e's value proposition isn't "best at any one thing"; it's "good enough at most things, much cheaper". For inference of a 70B model: 4 chips are enough. For LoRA fine-tuning of Gemma 27B: 64 chips are enough. For pre-training a 7B model from scratch: 256 chips are enough. The v5e pod is sized to those use cases.

07

Multislice — Crossing The Pod Boundary

Multislice is a Cloud TPU feature, announced at Google Cloud Next August 2023, that lets a single training job span multiple slices, potentially across multiple pods, with high-bandwidth ICI inside each slice and the Jupiter DCN between slices.

How it differs from data-parallel-only

You can shard a single tensor across slices, with partial-sum or reduce-scatter across the DCN.
Optimiser state can live on the slice with most spare HBM, not necessarily where the gradient is computed.
The compiler (XLA / GSPMD / Shardy) treats the entire multislice as one logical mesh.
Cross-slice latency (Jupiter DCN) is on the order of microseconds to milliseconds, vs sub-microsecond for ICI — so collective patterns are designed to run at slice granularity, not chip granularity.

08

Pathways on Cloud

The orchestration layer that makes Multislice usable. Pathways (Barham, Dean, Ghemawat et al., MLSys 2022) is Google's single-controller distributed-dataflow runtime for ML — one Python process drives compilation, scheduling, and execution across thousands of chips.

What Pathways does

Single Python client; one logical program.
Asynchronous compilation: each shard is compiled independently and held in a cache.
Dataflow graph that spans pods, with explicit barriers and collective ops.
Dynamic re-sharding: a job can change shape mid-run if a pod fails.
JAX is the production frontend; PyTorch via PyTorch/XLA also supported.

Where it differs from "MPI but for TPUs"

MPI: every rank runs the same program independently and synchronises.
Pathways: one program, the controller dispatches to thousands of workers asynchronously.
You can debug, profile, and trace from the single client.
It works well for heterogeneous pipelines — the kind that arise in MoE and reinforcement learning.

Pathways underpinned the PaLM training run on v4 in 2022; it scaled to multi-pod with v5p and Gemini in 2023–24; and is now the standard orchestration story for any TPU job above ~1,000 chips.

09

Workloads — Gemini 1.0 / 1.5

The flagship workload for v5p: Gemini. Google's Gemini 1.0 paper (Dec 2023) explicitly credits TPU v4 and v5e for training; Gemini 1.5 Pro (Feb 2024) is widely understood to be a v5p training job, with Pathways orchestration.

Model	Year	Hardware	Notes
Gemini 1.0 Ultra / Pro / Nano	Dec 2023	TPU v4 + v5e	Trained across multiple pods via Pathways. v5e for the smaller variants (Nano) and inference; v4 for pretraining heavy lifting.
Gemini 1.5 Pro (1M context)	Feb 2024	TPU v5p	The first v5p showcase. The 1M-token context is enabled by v5p's 95 GiB per chip and high HBM bandwidth.
Gemini 1.5 Flash	May 2024	TPU v5p / v5e mix	Distillation from 1.5 Pro; serving on v5e for cost.
Bard → Gemini App inference	2023–24	v5e fleet	The user-facing chatbot's inference fleet migrated from v4i to v5e through 2023.

Why v5p for 1M context

Long-context inference is dominated by KV cache. A 1M-token context for a Gemini-class model is hundreds of GB of KV. v5p's 95 GiB per chip and 2.76 TB/s HBM bandwidth let a single instance hold and stream that KV; smaller-HBM chips force you to split it across more chips, paying ICI all-reduce on every step.

10

Pricing & Availability

v5e and v5p are sold via Google Cloud as both on-demand and reservation pricing. Public list prices (US-central, on-demand, May 2026) are roughly:

SKU	Per chip-hour	Per pod-hour (full pod)	Notes
v5e (1 chip)	~$1.20	~$307 (256 chips)	By far the cheapest TPU.
v5p (1 chip)	~$4.20	~$37,600 (8,960 chips)	Roughly 3.5× v5e per chip.
Trillium v6e	~$2.70	~$691 (256 chips)	4.7× v5e perf at ~2.3× the price.
Ironwood v7	(reserved-only)	(reserved-only)	Capacity sold via long-term commitments at GA in late 2025.

Provisioning

Cloud TPU VM — the modern API. You get a host VM attached to TPU chips; SSH in, run JAX or PyTorch.
Queued Resources — for v5e/v5p/Trillium, you submit a request for a shape; it waits until capacity is free. The standard provisioning pattern.
GKE TPU node pools — TPUs as Kubernetes resources, with multi-host slice support.
Pathways on Cloud — for jobs that need multipod or async resharding.

Disclaimer: prices change. Treat the table as illustrative; check the current Cloud TPU pricing page for live numbers.

11

What v5 Got Right And Wrong

Got right

The product fork itself. Two SKUs serve customers much better than one.
v5p's 95 GiB HBM — the right capacity for Gemini-class context lengths.
Multislice and Pathways on Cloud — cross-pod jobs become routine, not heroic.
v5e's 2D-torus simplification — right tradeoff for the e-class price point.
SparseCore ×4 on v5p — reflects MoE workloads becoming first-class.

What v6 / v7 then changed

v5e's 197 TFLOPS was below NVIDIA H100's bf16 (~989 TFLOPS dense) by a noticeable margin. Trillium closes most of the gap (918 TFLOPS bf16) at the same form factor.
v5p's 459 TFLOPS bf16 was competitive but didn't have FP8. Ironwood adds FP8 at 4.6 PFLOPS — 10× on the headline number.
HBM3e wasn't ready for v5; Ironwood ships with 192 GiB HBM3e per chip in 2025.
The ICI bandwidth on v5e (400 GB/s) was already a bottleneck for some workloads; Trillium doubles it.

v5 is the chip that defines the modern TPU product strategy. Trillium and Ironwood are refinements of the v5e and v5p archetypes — they don't change the strategy, they execute on it.

12

Cheat Sheet

v5e (Aug 2023): 1 TensorCore, 197 TFLOPS bf16, 16 GiB HBM, 400 GB/s ICI, 256-chip 2D-torus pod, no OCS, ~$1.20/chip-hr.
v5p (Dec 2023): 2 TensorCores, 4 SparseCores, 459 TFLOPS bf16, 95 GiB HBM at 2.76 TB/s, 1.2 TB/s ICI, 8,960-chip 3D-torus + OCS pod, the chip Gemini 1.5 trains on.
The product fork: "e-class" = cheaper than v4i, "p-class" = bigger than v4. One team, two SKUs per gen.
Multislice (Aug 2023): single training job spans multiple slices / pods, ICI inside slices + Jupiter DCN between.
Pathways on Cloud: single-controller orchestration, public extension of the system that trained PaLM.
v5p pod aggregate: ~4.1 ExaFLOPS bf16, ~850 TiB HBM — pod HBM stops being a binding constraint for LLM training.
v5e use cases: fleet inference, fine-tuning, small-scale pretraining. v5p use cases: Gemini-class frontier training.
Long context lives on v5p — Gemini 1.5 Pro's 1M-token context only works because of the 95 GiB / 2.76 TB/s per-chip HBM combination.