Google TPUs Series — Presentation 07

TPU v5e & v5p — The Two-Track Fork

2023 is the year Google admits the TPU is now two products. v5e is cheap and good enough; v5p is the chip Gemini trains on. The split that defines every TPU since.

v5e (Aug 2023)v5p (Dec 2023) e-classp-class Gemini 1.0/1.5Multislice 8960-chip pod
v4 / v4i v5e (efficient)+ v5p (performance) Multislice Pathways on Cloud Gemini training
00

Topics We'll Cover

01

Why Google Forked the Product Line

Until v4, Google shipped one TPU SKU per generation, plus an inference variant (v4i) that arrived a year later. By 2023 that pattern was breaking down for three reasons.

1. Cloud TPU customer mix

Cloud TPU customers were split. Some wanted the cheapest chip per dollar for fine-tuning and inference. Others wanted the absolute biggest pod for training a frontier model. One SKU couldn't serve both well.

2. The cost gap

v4 was expensive: 32 GiB HBM, two TensorCores, OCS-connected pod. For someone running a 7B-parameter inference fleet, ~70% of that silicon is dark. v5e cuts price by simplifying.

3. Gemini's appetite

By mid-2023 Google's frontier model team needed pods bigger than 4096 chips. v5p ships with an 8,960-chip pod — more than double v4 — specifically to enable Gemini-class runs.

The product strategy in one slogan

"e-class" = cheaper than v4i ever was. "p-class" = bigger than v4 ever was. Same architecture team, same software stack, two SKUs per generation from this point onward.

Mirroring NVIDIA

NVIDIA does this too: GeForce RTX (consumer / inference) and HGX H100 (datacenter / training) are the same architecture in dramatically different packages. The TPU split is at smaller scale — one team, two SKUs per gen, no consumer line — but the strategic logic is the same.

02

v5e — The Efficient Chip

Announced August 2023 at Google Cloud Next. Marketed as offering ~2.5× perf-per-dollar over v4 for inference, and ~2× for fine-tuning.

Per-chip

  • 1 TensorCore with 4 MXUs, vector unit, scalar unit.
  • 197 TFLOPS bf16; 393 TOPS INT8.
  • 16 GiB HBM at 819 GB/s.
  • 4 ICI ports, 400 GB/s bidirectional aggregate.
  • SparseCore on-die.
  • Process node not officially disclosed (widely reported as 5nm-class).

Pod

  • 256-chip pod, 16×16 2D torus.
  • Pod aggregate: ~50.4 PFLOPS bf16, 4 TiB aggregate HBM.
  • No OCS within a pod — static torus topology.
  • Multipod uses Jupiter DCN (not ICI) for cross-pod traffic.

What v5e is for

v5e is also the first TPU with explicit Multislice support — you can compose multiple v5e pods into a larger virtual training cluster over the Jupiter datacenter network. We'll come to that on slide 07.

03

v5p — The Frontier Training Chip

Announced December 2023, four months after v5e. v5p is the chip Google trains its own frontier models on; it's the largest pod Google has ever shipped to that point.

Per-chip

  • 2 TensorCores, each with 4 MXUs.
  • 4 SparseCores per chip.
  • 459 TFLOPS bf16; 918 TOPS INT8.
  • 95 GiB HBM at 2.76 TB/s — almost 3× v5e's bandwidth.
  • 6 ICI ports for 3D-torus connectivity.
  • 4,800 Gbps total ICI per chip = 600 GB/s per direction = 1.2 TB/s bidir.

Pod

  • 8,960-chip pod.
  • 3D torus + OCS — same family as v4 but bigger.
  • Pod aggregate: ~4.1 ExaFLOPS bf16; ~850 TiB aggregate HBM.
  • Max single-job slice = 96 cubes = 6,144 chips.

The 95 GiB HBM number

The single most-asked question about v5p is "why 95 GiB and not 96?" Two facts: (1) HBM stacks come in standard sizes (16, 24 GiB per stack); v5p has six stacks × 16 GiB = 96 GiB physical. (2) Some capacity is reserved for ECC / sparing / firmware. The user-visible figure that Google publishes is 95 GiB. Treat 95 and 96 as the same number with different accounting.

A 95-GiB-per-chip HBM budget

For comparison: v4 had 32 GiB; H100 SXM has 80 GiB; H200 has 141 GiB. v5p sits between H100 and H200 on capacity but gives you a lot more chips per pod — 8,960 vs the few thousand you'd realistically scale-up with NVLink+IB. The TPU pod's economic moat is per-pod aggregate HBM, not per-chip HBM.

04

Per-Chip Spec Comparison

v4 (2020)v5e (Aug 2023)v5p (Dec 2023)
TensorCores212
MXUs per TensorCore444
SparseCores114
Per-chip bf16275 TFLOPS197 TFLOPS459 TFLOPS
Per-chip INT8393 TOPS918 TOPS
HBM32 GiB @ 1.2 TB/s16 GiB @ 819 GB/s95 GiB @ 2.76 TB/s
ICI per chip (bidir)~300 GB/s400 GB/s1.2 TB/s
Topology3D torus + OCS2D torus3D torus + OCS
Pod size4,0962568,960
Pod bf16~1.13 EFLOPS~50.4 PFLOPS~4.11 EFLOPS
Pod HBM~128 TiB~4 TiB~850 TiB
CoolingLiquidLiquidLiquid
RoleTrainingCost-optimisedTraining flagship

Two observations. First, v5p is 1.7× v4 per chip on bf16 but the pod is 2.2× bigger, so pod-level v5p is ~3.6× v4. Second, v5e is ~70% of v4 per chip at a much lower price point — the perf-per-dollar improvement is real but it isn't a faster chip; it's a cheaper one.

05

The 8,960-Chip v5p Pod

The largest scale-up domain Google had ever built at the time it shipped (Dec 2023). The topology is the v4 architecture taken to its logical extreme.

Pod organisation

  • Chips arranged as a 3D torus.
  • Cubes connected through Palomar-class OCS (8,960 chips means a much larger OCS infrastructure than v4's 48 switches).
  • Maximum schedulable slice for one job: 6,144 chips (96 cubes of 64).
  • Twisted-torus topology supported per-job.

Pod aggregate numbers

  • ~4.1 ExaFLOPS bf16 peak.
  • ~850 TiB aggregate HBM.
  • ~10 PB/s aggregate ICI bandwidth.
  • ~5 MW typical pod power.
  • Liquid-cooled, 100+ racks.

What you can fit in this

A dense 1T-parameter model in bf16 weighs 2 TB. With 850 TiB aggregate HBM, a v5p pod can hold ~400 such models, or one such model with all optimiser state plus arbitrarily large activation memory for backprop. v5p is the first TPU pod where pod-aggregate HBM stops being a binding constraint for any LLM Google trains; the constraint moves to compute time and orchestration.

06

v5e's 256-Chip 2D-Torus Pod

v5e's pod is much smaller, much cheaper, and intentionally simpler. 256 chips, 16×16 2D torus, no OCS within a pod.

Why no OCS?

  • OCS infrastructure is non-trivial cost — 48+ switches per pod.
  • For 256 chips, the OCS reconfigurability isn't worth it: a 16×16 torus is small enough that you don't need to slice it dynamically.
  • Faults are handled at the multi-pod level (route around a faulty pod), not within a pod.

Why 2D not 3D?

  • 4 ICI ports per chip is enough for a 16×16 torus.
  • Cuts ICI silicon and packaging cost ~33% vs 6-port 3D.
  • For inference batches and small training runs, 2D-torus all-reduce is plenty.
  • Multipod scaling uses Jupiter DCN, not extending the torus.

The slice configurations

A v5e pod can be sliced into smaller jobs at fixed shapes — 1x1, 2x2, 2x4, 4x4, 4x8, 8x8, 8x16, 16x16. Each job gets a sub-torus. Cloud TPU schedules these as queued resources — you ask for a shape, you wait until one is free.

The "good enough" thesis

v5e's value proposition isn't "best at any one thing"; it's "good enough at most things, much cheaper". For inference of a 70B model: 4 chips are enough. For LoRA fine-tuning of Gemma 27B: 64 chips are enough. For pre-training a 7B model from scratch: 256 chips are enough. The v5e pod is sized to those use cases.

07

Multislice — Crossing The Pod Boundary

Multislice is a Cloud TPU feature, announced at Google Cloud Next August 2023, that lets a single training job span multiple slices, potentially across multiple pods, with high-bandwidth ICI inside each slice and the Jupiter DCN between slices.

v5p Pod A 8,960 chips slice 1 — 1024 chips internal 3D torus + OCS slice 2 — 1024 chips v5p Pod B 8,960 chips slice 3 — 1024 chips v5p Pod C 8,960 chips slice 4 — 1024 chips multislice = ICI inside each slice + Jupiter DCN between slices single program, single dataflow graph, asynchronous compilation across slices

How it differs from data-parallel-only

08

Pathways on Cloud

The orchestration layer that makes Multislice usable. Pathways (Barham, Dean, Ghemawat et al., MLSys 2022) is Google's single-controller distributed-dataflow runtime for ML — one Python process drives compilation, scheduling, and execution across thousands of chips.

What Pathways does

  • Single Python client; one logical program.
  • Asynchronous compilation: each shard is compiled independently and held in a cache.
  • Dataflow graph that spans pods, with explicit barriers and collective ops.
  • Dynamic re-sharding: a job can change shape mid-run if a pod fails.
  • JAX is the production frontend; PyTorch via PyTorch/XLA also supported.

Where it differs from "MPI but for TPUs"

  • MPI: every rank runs the same program independently and synchronises.
  • Pathways: one program, the controller dispatches to thousands of workers asynchronously.
  • You can debug, profile, and trace from the single client.
  • It works well for heterogeneous pipelines — the kind that arise in MoE and reinforcement learning.

Pathways underpinned the PaLM training run on v4 in 2022; it scaled to multi-pod with v5p and Gemini in 2023–24; and is now the standard orchestration story for any TPU job above ~1,000 chips.

09

Workloads — Gemini 1.0 / 1.5

The flagship workload for v5p: Gemini. Google's Gemini 1.0 paper (Dec 2023) explicitly credits TPU v4 and v5e for training; Gemini 1.5 Pro (Feb 2024) is widely understood to be a v5p training job, with Pathways orchestration.

ModelYearHardwareNotes
Gemini 1.0 Ultra / Pro / NanoDec 2023TPU v4 + v5eTrained across multiple pods via Pathways. v5e for the smaller variants (Nano) and inference; v4 for pretraining heavy lifting.
Gemini 1.5 Pro (1M context)Feb 2024TPU v5pThe first v5p showcase. The 1M-token context is enabled by v5p's 95 GiB per chip and high HBM bandwidth.
Gemini 1.5 FlashMay 2024TPU v5p / v5e mixDistillation from 1.5 Pro; serving on v5e for cost.
Bard → Gemini App inference2023–24v5e fleetThe user-facing chatbot's inference fleet migrated from v4i to v5e through 2023.
Why v5p for 1M context

Long-context inference is dominated by KV cache. A 1M-token context for a Gemini-class model is hundreds of GB of KV. v5p's 95 GiB per chip and 2.76 TB/s HBM bandwidth let a single instance hold and stream that KV; smaller-HBM chips force you to split it across more chips, paying ICI all-reduce on every step.

10

Pricing & Availability

v5e and v5p are sold via Google Cloud as both on-demand and reservation pricing. Public list prices (US-central, on-demand, May 2026) are roughly:

SKUPer chip-hourPer pod-hour (full pod)Notes
v5e (1 chip)~$1.20~$307 (256 chips)By far the cheapest TPU.
v5p (1 chip)~$4.20~$37,600 (8,960 chips)Roughly 3.5× v5e per chip.
Trillium v6e~$2.70~$691 (256 chips)4.7× v5e perf at ~2.3× the price.
Ironwood v7(reserved-only)(reserved-only)Capacity sold via long-term commitments at GA in late 2025.

Provisioning

Disclaimer: prices change. Treat the table as illustrative; check the current Cloud TPU pricing page for live numbers.

11

What v5 Got Right And Wrong

Got right

  • The product fork itself. Two SKUs serve customers much better than one.
  • v5p's 95 GiB HBM — the right capacity for Gemini-class context lengths.
  • Multislice and Pathways on Cloud — cross-pod jobs become routine, not heroic.
  • v5e's 2D-torus simplification — right tradeoff for the e-class price point.
  • SparseCore ×4 on v5p — reflects MoE workloads becoming first-class.

What v6 / v7 then changed

  • v5e's 197 TFLOPS was below NVIDIA H100's bf16 (~989 TFLOPS dense) by a noticeable margin. Trillium closes most of the gap (918 TFLOPS bf16) at the same form factor.
  • v5p's 459 TFLOPS bf16 was competitive but didn't have FP8. Ironwood adds FP8 at 4.6 PFLOPS — 10× on the headline number.
  • HBM3e wasn't ready for v5; Ironwood ships with 192 GiB HBM3e per chip in 2025.
  • The ICI bandwidth on v5e (400 GB/s) was already a bottleneck for some workloads; Trillium doubles it.

v5 is the chip that defines the modern TPU product strategy. Trillium and Ironwood are refinements of the v5e and v5p archetypes — they don't change the strategy, they execute on it.

12

Cheat Sheet

Read next

Deck 08 — Trillium & Ironwood covers the 2024–2025 chips that take this same strategy further. Deck 11 — Software Stack covers Pathways, GSPMD, JAX sharding in detail.