Google TPUs Series — Presentation 05

TPU v2 & v3 — The Training Era Begins

2017–2018: HBM arrives, bfloat16 is invented, the chip grows two TensorCores, the pod becomes a 2D torus, and Google has to rebuild its datacenter cooling.

v2 (2017)v3 (2018) bf16HBM 2D torusICI Liquid Cooling
v1 lessons +HBM +bfloat16 +ICI +vector unit v2 (256-chip) v3 (1024-chip, liquid)
00

Topics We'll Cover

01

Why v1 Couldn't Train

By 2016 the bottleneck at Google had moved. Inference was solved — v1 was running tens of thousands of inferences per second per chip. The new constraint was training time: weeks on big GPU clusters for each new model rev. The team's question shifted from "what does an inference ASIC look like?" to "what does a training supercomputer look like?".

What v1 lacked for training

  • No floating point. Gradients have wide dynamic range; INT8 won't accumulate them stably.
  • No vector unit. Optimisers (SGD, Adam, RMSProp) need elementwise ops, masks, scales. v1 ran these on the host.
  • No chip-to-chip link. Training a model bigger than one chip requires gradient all-reduce; via PCIe-and-host that's hopeless.
  • 34 GB/s of DDR3. Not enough to feed a chip and spill activations for backprop.
  • Inference precision. Even if the previous problems vanished, INT8 forward passes throw away too much information for gradient back-propagation.

What v2 needs to add

  • A floating-point matmul format that holds gradients well.
  • A real per-chip vector unit.
  • HBM (much more bandwidth and capacity).
  • A custom inter-chip interconnect that bypasses the host.
  • A pod organisation so that thousands of chips can act as one machine.
  • An accumulator strategy that keeps long matmuls numerically stable.

Each of those features is a major silicon and system-design change. v2 is not a v1 refresh; it is a clean redesign with the same DNA.

02

The v2 Block Diagram

v2 is two-cores-on-a-chip. Each TensorCore has its own MXU, vector unit, scalar unit, and slice of HBM. They run independently and can synchronise via the on-chip network.

TensorCore 0 MXU 128×128 bf16 22.5 TFLOPS Vector Unit elementwise / reduce Scalar Unit control flow / addressing Vector Memory (VMEM scratchpad) TensorCore 1 MXU 128×128 bf16 22.5 TFLOPS Vector Unit Scalar Unit Vector Memory (VMEM scratchpad) HBM stack 0 — 8 GiB @ 300 GB/s HBM stack 1 — 8 GiB @ 300 GB/s ICI — 4 bidirectional links to neighbours (north / south / east / west) — 2D torus

One chip = two TensorCores; each TensorCore has compute, scratchpad, and its own slice of HBM. ICI sits at the bottom and connects to four neighbours.

03

bfloat16 — Google Brain's Numeric

bfloat16 is the most quietly influential numeric format in modern computing. It was invented inside Google Brain specifically so that mixed-precision training would just work, and is now native in every modern AI chip.

FP32

  • 1 sign + 8 exp + 23 mantissa.
  • Range ~10±38.
  • Standard since IEEE 754-1985.

FP16 (IEEE half)

  • 1 sign + 5 exp + 10 mantissa.
  • Range only ~10±5.
  • Underflows / overflows on gradients without loss-scaling.

bfloat16 (Google)

  • 1 sign + 8 exp + 7 mantissa.
  • Range identical to FP32 (~10±38).
  • Conversion to/from FP32 is just truncation.
  • Works without loss-scaling on most networks.

Why bf16's range matters more than its precision

Gradients have an enormous dynamic range — some weights see updates of 10-7, some of 10-1. FP16's 5-bit exponent can't span that. bf16 keeps FP32's exponent and just throws away the bottom 16 bits of the mantissa — giving up precision (where neural nets are robust) to keep range (where they aren't). The tradeoff turns out to be exactly the right one.

How the TPU MXU uses it

A standard born from one chip

bf16 went from "Google internal format" in 2017 to JEDEC and IEEE adoption by 2020. NVIDIA, Intel (with their Cooper Lake AVX-512 BF16 instructions), ARM (with the BFloat16 extension to ARMv8.6), AMD — all support it. The TPU forced a numeric standard on the rest of the industry.

04

Two TensorCores Per Chip

v2 is the first TPU with multiple cores on one die. The "TensorCore" terminology that appears later in NVIDIA's marketing is unrelated — in TPU language a TensorCore is a complete sub-chip with its own compute, scratchpad, and HBM partition.

What's in a TensorCore

  • One MXU — 128×128 bf16 weight-stationary systolic.
  • One vector unit — SIMD-style elementwise on bf16 / FP32, with reductions.
  • One scalar unit — integer arithmetic, addresses, loop counters.
  • VMEM — 32 MiB (v2) software-managed scratchpad for activations.
  • ~22.5 TFLOPS bf16 per TensorCore.

Why two, not four (yet)

  • Die-area budget at 16 nm. Two cores fits a ~625 mm² die comfortably.
  • Two HBM stacks per chip — one per core — matches package routing.
  • Easy parallelism story for the compiler: model-parallel inside chip, data-parallel across chips.
  • v3 doubles the MXU count per core (4 MXUs/chip) instead of doubling cores — same compute uplift, less floorplan churn.

The vector unit

v2's vector unit is the first part of the chip that's not for matmul. Softmax, layer-norm, optimiser updates (Adam moment estimates, learning-rate schedules), gradient clipping, masking — all of these now run on-chip. The host CPU shrinks back to a control-plane role.

The scalar unit

Every chip has one. Manages instruction-stream sequencing, address computation, masks, control flow that doesn't make sense to vectorise. It is also the chip's "VLIW issue logic" — the static schedule emitted by XLA is consumed by the scalar unit and dispatched to MXU and vector unit each cycle.

05

HBM Arrives

The single biggest change vs v1. v2 ships with two HBM stacks per chip — one per TensorCore.

v1 (DDR3)v2 (HBM)v3 (HBM)
Capacity8 GiB16 GiB32 GiB
Bandwidth~34 GB/s~600 GB/s~900 GB/s
Stacks2 channels DDR3-21332 stacks HBM (1 per core)2 stacks HBM
Bandwidth uplift vs v11.0×~17.6×~26.5×

What HBM bandwidth unlocks

Cost reality

HBM is, gram for gram, the most expensive memory in production. In 2017 it was 6–10× the price per GB of GDDR5. Putting it on every TPU permanently changed the chip's cost structure. Every TPU since has reflected the trade: more die area for HBM I/O than for any other interface, and increasing HBM stacks per chip generation by generation.

06

ICI — The Inter-Chip Interconnect

The other transformative addition. ICI is Google's custom chip-to-chip link, sitting outside the PCIe path. It is the thing that turns a chip into a pod.

ICI v2 specs

Why a custom link?

InfiniBand cost

Per-port silicon (HCA, switch ASIC) is a real cost adder. PCIe-attached HCAs add latency and contend with host traffic.

Ethernet latency

RoCE has improved but switch-traversal latency is still 1–2 μs. Inside a torus you want sub-100 ns per hop.

NVLink unavailable

NVLink is NVIDIA proprietary and only between NVIDIA chips. Even if Google had wanted it, the scale-up to 256 / 1024 chips wasn't there in 2017.

ICI is the defining component of a TPU pod. NVIDIA later builds NVLink-Switch / NVL72 to compete; AMD's Infinity Fabric is its parallel. As of 2026 ICI is in its 7th generation.

07

The 2D Torus Pod

v2 ships with a fixed pod shape: 256 chips arranged in a 16×16 2D torus, 4 chips per board, 16 boards per rack-scale pod.

v2 — 16×16 2D torus (subset shown 6×6) dashed links represent torus wrap-around

Why a 2D torus?

The pod aggregate at v2 is ~11.5 PFLOPS bf16, with 4 TiB of HBM. At v3 the pod is 1024 chips (32×32 torus) and aggregates to over 100 PFLOPS bf16.

08

From v2 to v3 — The "Tick" Generation

v3 is the same 16 nm node as v2 but everything inside is doubled. Two MXUs per TensorCore (instead of one), 32 GiB HBM (instead of 16), 1024-chip pod (instead of 256), faster clock.

v2v3
Process16 nm16 nm
TensorCores per chip22
MXUs per TensorCore12
MXU dim128×128128×128
Per-chip bf1645 TFLOPS123 TFLOPS
HBM16 GiB32 GiB
HBM BW~600 GB/s~900 GB/s
Pod size256 chips (16×16)1024 chips (32×32)
CoolingAirLiquid
Pod aggregate~11.5 PFLOPS~126 PFLOPS

Why doubling MXUs per core, instead of clock or PE count?

Doubling the systolic array from 128×128 to 256×256 would quadruple the PE count and require redoing the entire weight-FIFO and activation-stream paths. Going from 1 to 2 MXUs per core is a copy-and-paste operation in the floorplan. It also lets the compiler issue two matmuls in parallel within a TensorCore — useful for attention's QKT and softmax(.) V operations, which can run side by side.

v3's per-chip bf16 (123 TFLOPS) is 2.7× v2's (45 TFLOPS) — doubled MXUs plus a higher clock plus better issue rates. Pod-level the jump is even larger because the pod itself grew 4× in chip count.

09

Liquid Cooling — The First Time

v3 was the first liquid-cooled accelerator at hyperscale. Per-chip TDP estimates put v3 at 200–250 W; eight chips per board, 32 boards per rack means rack-level power dissipation past anything air-cooled would tolerate.

The cooling stack

This was a major datacenter operational change. Google's existing fleet was air-cooled; v3 required a parallel liquid-coolant infrastructure that did not exist before. By 2018 several Google datacenters were dedicated TPU sites built specifically around the new cooling.

Why this matters in 2025

Every modern AI accelerator is now liquid-cooled at the rack level — NVIDIA NVL72, AMD MI300X clusters, AWS Trainium2 racks. Google was first by 5+ years. The expertise transfers up the stack: when Ironwood pods land in 2025 with 9,216 chips dissipating ~5.5 MW each, the cooling story is a refinement, not a redesign.

10

Workloads — BERT, T5, GNMT

The 2017–2020 wave of large NLP models is, almost without exception, a TPU v2/v3 story.

ModelYearHardwareNotes
GNMT (Google Neural Machine Translation)2016–17v1 inference, v2 trainingThe migration from RNN-LSTM seq2seq to Transformer happens over the v2/v3 era.
BERT-Base / Large (Devlin et al.)Oct 2018v3Trained on a v3 pod — the paper credits "Cloud TPU v3 Pod". 4 days for BERT-Large.
T5 (Raffel et al.)Oct 2019v311B parameters at the largest. Pre-trained on 1024-chip v3 pods.
Meena / LaMDA2020–21v3 / early v4Conversational models leading toward Bard/Gemini.
MUM2021v4Multimodal; trained as v4 came online.

External usage

v2 and v3 were the first TPUs sold as a product. Cloud TPU went GA in February 2018 (v2) and March 2019 (v3). The TensorFlow Research Cloud programme gave free TPU access to academic researchers; whole sub-fields of language and vision research happened on these chips. If you read a 2018–2020 paper that says "trained on Cloud TPU", it's a v2 or v3 pod.

11

What v2/v3 Got Right And Wrong

Got right

  • bfloat16 as the production training numeric.
  • Two-TensorCore-per-chip layout.
  • HBM at the right scale.
  • ICI as a custom chip-to-chip link.
  • The 2D torus pod — a fixed, predictable, all-reduce-friendly topology.
  • Cloud TPU as a rentable product.

What v4 had to fix

  • 2D torus diameter. A 32×32 torus has diameter 32; for ~5,000-chip jobs this hurts. v4 moves to 3D + OCS.
  • Embedding lookups are slow on the vector unit. v4 adds SparseCore.
  • Static pod allocation. v2/v3 pods are 256 or 1024, no slicing. v4 lets you reserve sub-pod slices via OCS.
  • Hard pod failure mode. v3 has no facility for routing around a faulty board. v4's OCS lets the system bypass any single block.
  • 16 nm. 7 nm is overdue by 2020.

The v4 paper (ISCA 2023) is, as a result, mostly a list of "things v3 was beginning to feel painful for the team running PaLM-class workloads on it". Each item gets its own architectural fix.

12

Cheat Sheet

Read next

Deck 06 — v4, OCS & SparseCore covers the chip that turns the TPU pod into a true machine-learning supercomputer.