Google TPUs 05 — v2 & v3, the Training Era

00

Topics We'll Cover

Why v1 Couldn't Train
The v2 Block Diagram
bfloat16 — Google Brain's Numeric
Two TensorCores Per Chip
HBM Arrives
ICI — The Inter-Chip Interconnect
The 2D Torus Pod
From v2 to v3 — The "Tick" Generation
Liquid Cooling — The First Time
Workloads — BERT, T5, GNMT
What v2/v3 Got Right And Wrong
Cheat Sheet

01

Why v1 Couldn't Train

By 2016 the bottleneck at Google had moved. Inference was solved — v1 was running tens of thousands of inferences per second per chip. The new constraint was training time: weeks on big GPU clusters for each new model rev. The team's question shifted from "what does an inference ASIC look like?" to "what does a training supercomputer look like?".

What v1 lacked for training

No floating point. Gradients have wide dynamic range; INT8 won't accumulate them stably.
No vector unit. Optimisers (SGD, Adam, RMSProp) need elementwise ops, masks, scales. v1 ran these on the host.
No chip-to-chip link. Training a model bigger than one chip requires gradient all-reduce; via PCIe-and-host that's hopeless.
34 GB/s of DDR3. Not enough to feed a chip and spill activations for backprop.
Inference precision. Even if the previous problems vanished, INT8 forward passes throw away too much information for gradient back-propagation.

What v2 needs to add

A floating-point matmul format that holds gradients well.
A real per-chip vector unit.
HBM (much more bandwidth and capacity).
A custom inter-chip interconnect that bypasses the host.
A pod organisation so that thousands of chips can act as one machine.
An accumulator strategy that keeps long matmuls numerically stable.

Each of those features is a major silicon and system-design change. v2 is not a v1 refresh; it is a clean redesign with the same DNA.

02

The v2 Block Diagram

v2 is two-cores-on-a-chip. Each TensorCore has its own MXU, vector unit, scalar unit, and slice of HBM. They run independently and can synchronise via the on-chip network.

One chip = two TensorCores; each TensorCore has compute, scratchpad, and its own slice of HBM. ICI sits at the bottom and connects to four neighbours.

03

bfloat16 — Google Brain's Numeric

bfloat16 is the most quietly influential numeric format in modern computing. It was invented inside Google Brain specifically so that mixed-precision training would just work, and is now native in every modern AI chip.

FP32

1 sign + 8 exp + 23 mantissa.
Range ~10^±38.
Standard since IEEE 754-1985.

FP16 (IEEE half)

1 sign + 5 exp + 10 mantissa.
Range only ~10^±5.
Underflows / overflows on gradients without loss-scaling.

bfloat16 (Google)

1 sign + 8 exp + 7 mantissa.
Range identical to FP32 (~10^±38).
Conversion to/from FP32 is just truncation.
Works without loss-scaling on most networks.

Why bf16's range matters more than its precision

Gradients have an enormous dynamic range — some weights see updates of 10^-7, some of 10^-1. FP16's 5-bit exponent can't span that. bf16 keeps FP32's exponent and just throws away the bottom 16 bits of the mantissa — giving up precision (where neural nets are robust) to keep range (where they aren't). The tradeoff turns out to be exactly the right one.

How the TPU MXU uses it

Multiplications are done in bf16.
Accumulation is done in FP32 inside the systolic PEs.
This is the exact pattern that NVIDIA later named "mixed precision" with their first tensor cores in 2017 (Volta) using FP16; bf16 versions arrive in Ampere (2020).

A standard born from one chip

bf16 went from "Google internal format" in 2017 to JEDEC and IEEE adoption by 2020. NVIDIA, Intel (with their Cooper Lake AVX-512 BF16 instructions), ARM (with the BFloat16 extension to ARMv8.6), AMD — all support it. The TPU forced a numeric standard on the rest of the industry.

04

Two TensorCores Per Chip

v2 is the first TPU with multiple cores on one die. The "TensorCore" terminology that appears later in NVIDIA's marketing is unrelated — in TPU language a TensorCore is a complete sub-chip with its own compute, scratchpad, and HBM partition.

What's in a TensorCore

One MXU — 128×128 bf16 weight-stationary systolic.
One vector unit — SIMD-style elementwise on bf16 / FP32, with reductions.
One scalar unit — integer arithmetic, addresses, loop counters.
VMEM — 32 MiB (v2) software-managed scratchpad for activations.
~22.5 TFLOPS bf16 per TensorCore.

Why two, not four (yet)

Die-area budget at 16 nm. Two cores fits a ~625 mm² die comfortably.
Two HBM stacks per chip — one per core — matches package routing.
Easy parallelism story for the compiler: model-parallel inside chip, data-parallel across chips.
v3 doubles the MXU count per core (4 MXUs/chip) instead of doubling cores — same compute uplift, less floorplan churn.

The vector unit

v2's vector unit is the first part of the chip that's not for matmul. Softmax, layer-norm, optimiser updates (Adam moment estimates, learning-rate schedules), gradient clipping, masking — all of these now run on-chip. The host CPU shrinks back to a control-plane role.

The scalar unit

Every chip has one. Manages instruction-stream sequencing, address computation, masks, control flow that doesn't make sense to vectorise. It is also the chip's "VLIW issue logic" — the static schedule emitted by XLA is consumed by the scalar unit and dispatched to MXU and vector unit each cycle.

05

HBM Arrives

The single biggest change vs v1. v2 ships with two HBM stacks per chip — one per TensorCore.

	v1 (DDR3)	v2 (HBM)	v3 (HBM)
Capacity	8 GiB	16 GiB	32 GiB
Bandwidth	~34 GB/s	~600 GB/s	~900 GB/s
Stacks	2 channels DDR3-2133	2 stacks HBM (1 per core)	2 stacks HBM
Bandwidth uplift vs v1	1.0×	~17.6×	~26.5×

What HBM bandwidth unlocks

Activation spilling for backprop. Training requires keeping activations from the forward pass to compute gradients on the backward pass. v1's 24 MiB Unified Buffer can't hold them; HBM at 600 GB/s can.
Larger effective batch sizes. Bandwidth-bound workloads stop being bandwidth-bound, freeing the compiler to choose tile sizes that match the MXU's natural shape.
Embedding tables. Recommendation models rely on huge sparse embedding lookups; HBM makes them serviceable.
Optimiser state. Adam keeps two FP32 moment estimates per parameter — 2× the model size. HBM gives you somewhere to put them.

Cost reality

HBM is, gram for gram, the most expensive memory in production. In 2017 it was 6–10× the price per GB of GDDR5. Putting it on every TPU permanently changed the chip's cost structure. Every TPU since has reflected the trade: more die area for HBM I/O than for any other interface, and increasing HBM stacks per chip generation by generation.

06

ICI — The Inter-Chip Interconnect

The other transformative addition. ICI is Google's custom chip-to-chip link, sitting outside the PCIe path. It is the thing that turns a chip into a pod.

ICI v2 specs

4 bidirectional links per chip, one to each cardinal-direction neighbour (north, south, east, west) for 2D-torus connectivity.
~62 GB/s aggregate per chip (HotChips 2020 figures; per-link bandwidth in the tens of GB/s range).
Custom Google SerDes — not InfiniBand, not Ethernet, not NVLink. Co-designed with the routing fabric for low-latency torus all-reduce.
Hardware all-reduce primitives in the link controller, not the chip core.

Why a custom link?

InfiniBand cost

Per-port silicon (HCA, switch ASIC) is a real cost adder. PCIe-attached HCAs add latency and contend with host traffic.

Ethernet latency

RoCE has improved but switch-traversal latency is still 1–2 μs. Inside a torus you want sub-100 ns per hop.

NVLink unavailable

NVLink is NVIDIA proprietary and only between NVIDIA chips. Even if Google had wanted it, the scale-up to 256 / 1024 chips wasn't there in 2017.

ICI is the defining component of a TPU pod. NVIDIA later builds NVLink-Switch / NVL72 to compete; AMD's Infinity Fabric is its parallel. As of 2026 ICI is in its 7th generation.

07

The 2D Torus Pod

v2 ships with a fixed pod shape: 256 chips arranged in a 16×16 2D torus, 4 chips per board, 16 boards per rack-scale pod.

Why a 2D torus?

Fixed degree. Every chip has exactly 4 ICI links, regardless of pod size. Cabling is identical for every chip.
Wrap-around halves the diameter. A 16×16 torus has diameter 16; a 16×16 mesh has diameter 30.
All-reduce is free. A bandwidth-optimal all-reduce on a 2D torus is just a 2-pass row-then-column reduction. The hardware has dedicated paths for this.
Failure isolation. Any single chip failure can be routed around with at most a 2-hop detour.

The pod aggregate at v2 is ~11.5 PFLOPS bf16, with 4 TiB of HBM. At v3 the pod is 1024 chips (32×32 torus) and aggregates to over 100 PFLOPS bf16.

08

From v2 to v3 — The "Tick" Generation

v3 is the same 16 nm node as v2 but everything inside is doubled. Two MXUs per TensorCore (instead of one), 32 GiB HBM (instead of 16), 1024-chip pod (instead of 256), faster clock.

	v2	v3
Process	16 nm	16 nm
TensorCores per chip	2	2
MXUs per TensorCore	1	2
MXU dim	128×128	128×128
Per-chip bf16	45 TFLOPS	123 TFLOPS
HBM	16 GiB	32 GiB
HBM BW	~600 GB/s	~900 GB/s
Pod size	256 chips (16×16)	1024 chips (32×32)
Cooling	Air	Liquid
Pod aggregate	~11.5 PFLOPS	~126 PFLOPS

Why doubling MXUs per core, instead of clock or PE count?

Doubling the systolic array from 128×128 to 256×256 would quadruple the PE count and require redoing the entire weight-FIFO and activation-stream paths. Going from 1 to 2 MXUs per core is a copy-and-paste operation in the floorplan. It also lets the compiler issue two matmuls in parallel within a TensorCore — useful for attention's QK^T and softmax(.) V operations, which can run side by side.

v3's per-chip bf16 (123 TFLOPS) is 2.7× v2's (45 TFLOPS) — doubled MXUs plus a higher clock plus better issue rates. Pod-level the jump is even larger because the pod itself grew 4× in chip count.

09

Liquid Cooling — The First Time

v3 was the first liquid-cooled accelerator at hyperscale. Per-chip TDP estimates put v3 at 200–250 W; eight chips per board, 32 boards per rack means rack-level power dissipation past anything air-cooled would tolerate.

The cooling stack

Cold plate directly on each TPU die.
Coolant manifold across the board, one supply / return per rack.
Coolant distribution unit (CDU) per row of racks.
Facility chilled-water loop or evaporative tower as the heat sink.

This was a major datacenter operational change. Google's existing fleet was air-cooled; v3 required a parallel liquid-coolant infrastructure that did not exist before. By 2018 several Google datacenters were dedicated TPU sites built specifically around the new cooling.

Why this matters in 2025

Every modern AI accelerator is now liquid-cooled at the rack level — NVIDIA NVL72, AMD MI300X clusters, AWS Trainium2 racks. Google was first by 5+ years. The expertise transfers up the stack: when Ironwood pods land in 2025 with 9,216 chips dissipating ~5.5 MW each, the cooling story is a refinement, not a redesign.

10

Workloads — BERT, T5, GNMT

The 2017–2020 wave of large NLP models is, almost without exception, a TPU v2/v3 story.

Model	Year	Hardware	Notes
GNMT (Google Neural Machine Translation)	2016–17	v1 inference, v2 training	The migration from RNN-LSTM seq2seq to Transformer happens over the v2/v3 era.
BERT-Base / Large (Devlin et al.)	Oct 2018	v3	Trained on a v3 pod — the paper credits "Cloud TPU v3 Pod". 4 days for BERT-Large.
T5 (Raffel et al.)	Oct 2019	v3	11B parameters at the largest. Pre-trained on 1024-chip v3 pods.
Meena / LaMDA	2020–21	v3 / early v4	Conversational models leading toward Bard/Gemini.
MUM	2021	v4	Multimodal; trained as v4 came online.

External usage

v2 and v3 were the first TPUs sold as a product. Cloud TPU went GA in February 2018 (v2) and March 2019 (v3). The TensorFlow Research Cloud programme gave free TPU access to academic researchers; whole sub-fields of language and vision research happened on these chips. If you read a 2018–2020 paper that says "trained on Cloud TPU", it's a v2 or v3 pod.

11

What v2/v3 Got Right And Wrong

Got right

bfloat16 as the production training numeric.
Two-TensorCore-per-chip layout.
HBM at the right scale.
ICI as a custom chip-to-chip link.
The 2D torus pod — a fixed, predictable, all-reduce-friendly topology.
Cloud TPU as a rentable product.

What v4 had to fix

2D torus diameter. A 32×32 torus has diameter 32; for ~5,000-chip jobs this hurts. v4 moves to 3D + OCS.
Embedding lookups are slow on the vector unit. v4 adds SparseCore.
Static pod allocation. v2/v3 pods are 256 or 1024, no slicing. v4 lets you reserve sub-pod slices via OCS.
Hard pod failure mode. v3 has no facility for routing around a faulty board. v4's OCS lets the system bypass any single block.
16 nm. 7 nm is overdue by 2020.

The v4 paper (ISCA 2023) is, as a result, mostly a list of "things v3 was beginning to feel painful for the team running PaLM-class workloads on it". Each item gets its own architectural fix.

12

Cheat Sheet

v2 (2017): 16 nm, 2 TensorCores/chip, 1 MXU each, 128×128 bf16, 45 TFLOPS/chip, 16 GiB HBM, 256-chip 16×16 2D-torus pod, ICI introduced, air-cooled.
v3 (2018): Same 16 nm, doubled MXUs per core (4/chip), 123 TFLOPS/chip, 32 GiB HBM at ~900 GB/s, 1024-chip 32×32 pod, first liquid-cooled TPU.
bfloat16: Google Brain invention. FP32 exponent + 7-bit mantissa. Now industry-standard.
HBM arrives in v2 — 17× v1's bandwidth, the single most important upgrade.
ICI custom chip-to-chip link enables training pods. 2D torus topology.
Pod-scale aggregate: v2 ≈ 11.5 PFLOPS, v3 ≈ 126 PFLOPS bf16.
BERT, T5, GNMT, the entire 2017–2020 NLP renaissance trained on v2/v3 pods.
Cloud TPU went GA on v2 in Feb 2018; v3 in March 2019. First time external researchers could rent TPUs by the hour.