Google TPUs 09 — Memory Hierarchy & Numerics

00

Topics We'll Cover

The Six-Tier TPU Memory Hierarchy
VMEM — Software-Managed Vector Memory
CMEM — The First TPU Cache
HBM Generations Across TPU Generations
Remote HBM and the ICI Bandwidth Cliff
INT8 — Where v1 Started
bfloat16 — The Format That Won
FP8 — OCP E4M3 / E5M2
Accumulator Widths & Numerical Stability
The "Halve Precision, Double FLOPS" Pattern
Why TPUs Have No FP4 (Yet)
Cheat Sheet

01

The Six-Tier TPU Memory Hierarchy

From smallest and fastest to largest and slowest. Numbers are roughly Ironwood-class.

PE regs

Per-PE registers (KiBs)

~1 cycle

Accum

Accumulator SRAM (~MB)

1–2 cycles

VMEM

VMEM scratchpad (16–64 MiB per TC)

~5 cycles

CMEM

CMEM cache (128 MiB+ shared)

~20 cycles

HBM

HBM3e — 192 GiB at 7.37 TB/s

~250 ns

Remote

Remote HBM via ICI (1.2 TB/s bidir)

~2 μs + tx

The two boundaries that matter most

VMEM ↔ HBM (a ~1000× bandwidth cliff). The compiler's tile-size choice is dominated by keeping the working set in VMEM — an HBM round-trip costs you orders of magnitude more.
HBM ↔ remote-HBM via ICI (~6× bandwidth cliff and ~10,000× latency cliff). This is the boundary between single-chip-tractable problems and pod-scale-tractable problems.

Everything else — PE registers, accumulator, CMEM — is plumbing. The two cliffs are where compiler intelligence matters.

02

VMEM — Software-Managed Vector Memory

VMEM is the per-TensorCore software-managed scratchpad. Direct ancestor of v1's Unified Buffer; not a cache.

What lives in VMEM

Activation tiles being fed into the MXU.
Output tiles drained from the MXU before writeback.
Working sets for vector-unit operations (softmax, layer-norm, residual adds).
Small operands the compiler can keep resident across multiple operations.

What's not in VMEM

Weights — weights live in HBM and stream through the weight FIFO.
Anything the compiler can't fit. If a tile is too big, the compiler spills to HBM (or now CMEM).
No tags, no replacement policy. Every byte's residency is statically planned.

Capacities by generation

v2: 32 MiB per TensorCore.
v3: 32 MiB per TensorCore.
v4: ~32 MiB per TensorCore.
v5p: ~64 MiB per TensorCore (estimated; not officially confirmed).
Trillium / Ironwood: similar order of magnitude, with bigger CMEM in front.

VMEM size has not grown as fast as HBM — SRAM density doesn't scale like DRAM, and area is precious. The compiler is expected to manage residency with finer tile sizes; CMEM picks up the rest.

The Pallas API view

If you write low-level TPU kernels in Pallas, VMEM is the main thing you allocate explicitly. BlockSpec(block_shape=..., index_map=...) tells the compiler exactly which HBM region to bring into VMEM for each pipeline step; pltpu.emit_pipeline handles the double-buffering. This is where the TPU's "software does the scheduling" philosophy is most visible at the application layer.

03

CMEM — The First TPU Cache

CMEM is introduced with v4i (ISCA 2021) and persists through every later TPU. It is shared between the two TensorCores on a chip, sits between VMEM and HBM, and is the first TPU memory level with explicit cache-like semantics.

What CMEM is

Roughly 128 MiB on-die SRAM (v4 / v4i specifics; later chips similar order).
Shared by both TensorCores on the chip.
XLA controls residency; behaves cache-like in that lookups can hit or miss.
HBM bandwidth amplifier — effective memory bandwidth roughly doubles for workloads with good locality.

What CMEM is not

Not a fully transparent cache (no automatic LRU; XLA decides what stays).
Not coherent with anything off-chip.
Not a uniform replacement for HBM — capacity is small.

What it buys at the workload level

KV cache hits on long-context inference.
Activation re-use for backprop in training.
Fused-attention staging for FlashAttention-style kernels.
Embedding-row caches for repeated lookups in the same sequence.

Why CMEM and not "just bigger VMEM"? VMEM is per-TensorCore; CMEM is shared. Shared SRAM is more area-efficient when both cores reuse the same data (typical for attention's QKV tiles). Splitting the levels gives the compiler two knobs — tightly-pipelined VMEM tiles vs whole-step CMEM working sets.

The architectural trend

From v1's pure scratchpad model to v4's scratchpad-plus-cache, the TPU's memory hierarchy has gradually become more cache-like. NVIDIA went the other way at the same time — H100's TMA introduced more explicit DMA control, more scratchpad-like behaviour. The two architectures are converging on the same shape from opposite directions.

04

HBM Generations Across TPU Generations

TPU	HBM standard	Per-chip capacity	Per-chip bandwidth	Stacks
v1 (2015)	none (DDR3)	8 GiB DDR3	~34 GB/s	2× DDR3 channels
v2 (2017)	HBM	16 GiB	~600 GB/s	2 stacks
v3 (2018)	HBM2	32 GiB	~900 GB/s	2 stacks
v4 (2020)	HBM2	32 GiB	1.2 TB/s	2 stacks
v4i (2020)	HBM2	8 GiB	~614 GB/s	1 stack
v5e (2023)	HBM2e (likely)	16 GiB	819 GB/s	1–2 stacks
v5p (2023)	HBM3 (likely)	95 GiB	2.76 TB/s	6 stacks
Trillium (2024)	HBM3 (likely)	32 GiB	1.64 TB/s	2 stacks
Ironwood (2025)	HBM3e	192 GiB	7.37 TB/s	6–8 stacks

The bandwidth curve in one ratio

From v1 (34 GB/s) to Ironwood (7.37 TB/s) is ~217× bandwidth growth in a decade — vastly faster than Moore's law on transistor count. Most of the per-chip performance growth in TPUs has come from feeding the chip faster, not from making the chip do more arithmetic per cycle.

The HBM stack mechanics

Each HBM stack is several DRAM dies stacked vertically (4-Hi, 8-Hi, now 12-Hi for HBM3e), connected by through-silicon vias (TSVs) to a logic base die.
The logic base die handles the wide PHY and ECC.
The stack sits next to the TPU die on the package, connected via a silicon interposer (CoWoS or equivalent) over thousands of micro-bumps.
Per-stack bandwidth doubled roughly every generation: HBM (128 GB/s) → HBM2 (256 GB/s) → HBM2e (~480 GB/s) → HBM3 (~820 GB/s) → HBM3e (~1.2 TB/s).

05

Remote HBM and the ICI Bandwidth Cliff

Below HBM in the hierarchy is "the HBM on a different chip in the same pod, reached via ICI". Critical for any matmul that exceeds one-chip capacity.

Path	Bandwidth	Latency
Local HBM (Ironwood)	7.37 TB/s	~250 ns
Remote HBM via ICI (Ironwood)	1.2 TB/s bidir	~2 μs + serialisation
Remote HBM via Jupiter DCN (Trillium multipod)	13 Pb/s pod bisection → per-link ~Tbps	~10–100 μs

The compiler's job

Tile a big matmul so each tile fits in local HBM × VMEM bandwidth.
Schedule cross-chip all-reduces so they overlap with compute on the next tile.
For multipod jobs: schedule cross-pod traffic as bulk async transfers, not on the critical path of a step.

This is the same job the GPU compiler stack does, with different cliffs (NVLink for nearby, InfiniBand for far). The cliffs are sharper on TPU because of the larger pod size and the fact that ICI is custom-silicon — sub-microsecond at small messages, but with a hard tier change to Jupiter at the pod boundary.

A pod is one machine

Within a TPU pod, software treats the pod as a single coherent memory space — you can jax.device_put(x, sharding=NamedSharding(mesh, ...)) and any chip can access any other chip's HBM. The compiler will turn that into ICI all-gather / all-reduce / reduce-scatter calls automatically. You don't write the cross-chip transfers explicitly — but you do feel them in your tile-size choices.

06

INT8 — Where v1 Started

v1's only numeric: 8-bit signed integers. Multipliers produced 16-bit products; accumulators were 32-bit. Inference only — no gradients, no FP arithmetic anywhere on chip.

Why INT8 worked for inference in 2015

The dominant inference models (CNNs for images, narrow MLPs for ranking) tolerate post-training quantisation to INT8 with negligible accuracy loss.
INT8 multiplies are 16× cheaper in silicon area than FP32, and ~4× cheaper than FP16.
INT8 storage is 4× cheaper in HBM bandwidth than FP32.

Why INT8 didn't work for training

Gradients have a dynamic range of ~10⁵ to 10¹⁰; INT8 covers maybe 10².
Weight updates are typically 10^-5 × learning_rate × gradient. Quantising those updates to INT8 erases most of them.
Accumulating gradients across batches blows out the integer range immediately.

v1's INT8-only choice is the cleanest possible "inference accelerator" decision. v2 is the moment Google admits that training also needs to live on this chip, which forces FP arithmetic.

07

bfloat16 — The Format That Won

FP32 (IEEE 754 single)

S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM

1 sign + 8 exp + 23 mantissa. Range ~10^±38. The standard since 1985.

FP16 (IEEE half)

S EEEEE MMMMMMMMMM

1+5+10. Range ~10^±5. Underflows on small gradients.

bfloat16 (Google Brain)

S EEEEEEEE MMMMMMM

1+8+7. Same range as FP32 (~10^±38). Conversion to/from FP32 is just truncation.

Why bfloat16 is the format of the modern AI era

Range, not precision, is the binding constraint. Neural-net training is numerically robust to small rounding errors, but extremely sensitive to over/underflow.
Drop-in compatibility. A bf16 value is the top half of an FP32 value — you can convert by truncation, no scaling needed.
Hardware multipliers shrink. A bf16 multiplier is ~25% the area of an FP32 multiplier; an FP16 multiplier is ~30% (extra mantissa bits).
Memory and bandwidth halve compared to FP32.

bf16 first appeared in TPU v2 in 2017. NVIDIA added it to Ampere tensor cores in 2020. Intel shipped AVX-512 BF16 instructions in Cooper Lake (also 2020). ARM's BFloat16 extension landed in ARMv8.6. By 2026 every major AI accelerator and CPU supports it natively. It is the most successful Google-originated numeric format in computing history.

08

FP8 — OCP E4M3 / E5M2

FP8 is the third generation of "halve precision, double FLOPS" on TPUs. Native on Ironwood; supported but less prominent on v5p. The format is the OCP (Open Compute Project) standard, jointly defined by NVIDIA, Intel, ARM, AMD, and Google.

E4M3 — activations / weights

S EEEE MMM

1 sign + 4 exp + 3 mantissa.
Range ~10^±1.5 (after the special-value asymmetry).
Better precision; used for forward-pass tensors.

E5M2 — gradients

S EEEEE MM

1 sign + 5 exp + 2 mantissa.
Range ~10^±5.
Wider range; used for gradients in mixed-precision training (rarer use on TPU since training stays in bf16).

Per-block scaling

FP8's narrow range alone is too tight for most tensors. The standard practice is block scaling: a block of values (typically 32 or 128 elements) shares an FP32 scale factor. The math is FP8 × (1 / scale); the scale is recomputed per block. This is the same pattern NVIDIA uses with MXFP8 / NVFP8, and the same family Blackwell uses for its FP4 (MXFP4) format.

Why TPUs went for FP8 specifically (not int8 doubled)

FP8 with block scaling matches per-tensor INT8 + scale on accuracy, with floating-point semantics that compose better through nonlinear ops.
The MXU silicon for FP8 multiplies is similar in area to INT8 multiplies, so peak FLOPS doubles vs bf16 in the same area.
Software ecosystem — NVIDIA's adoption of the same OCP formats means models converge across hardware vendors.

09

Accumulator Widths & Numerical Stability

An MXU multiplies two narrow-format operands and accumulates into a wider format. Accumulator width is the chip's secret weapon for numerical stability.

Generation	Multiply	Accumulate	Output
v1	INT8 × INT8	INT32	INT8 (re-quantised)
v2 / v3 / v4 / v5	bf16 × bf16	FP32	bf16 (or FP32 if requested)
v5p / v5e (INT8 path)	INT8 × INT8	INT32	bf16 / INT8
Ironwood (FP8 path)	FP8 × FP8 (with per-block scale)	FP32	bf16 / FP8

Why FP32 accumulate matters

A 256-deep matmul with bf16 inputs has 256 partial-product additions stacking up. If you accumulate in bf16 (7-bit mantissa), the running sum loses ~7 bits of precision per addition; long contractions are essentially zero by the end. FP32 (23-bit mantissa) gives you ~16 extra bits of headroom — enough that the same matmul produces a result indistinguishable from FP32-throughout.

This is why "bf16 matmul" on a TPU is in fact "bf16 multiply, FP32 accumulate". The advertised throughput is the multiply rate; the accumulator silicon is more expensive and is sized to keep the array stable.

A trade no one talks about

FP8 with FP32 accumulate is the same engineering choice. NVIDIA does this; AMD does this; Google does this. The reason FP8 / bf16 / INT8 advertised TFLOPS are 2× / 4× / 4× the FP32 TFLOPS is partly because the accumulator stays at FP32 width — you save area on the multiplier, not the adder. This is also why advertised FP4 numbers are even more advertising-flavoured than usual: the accumulator hasn't shrunk.

10

The "Halve Precision, Double FLOPS" Pattern

Every TPU generation has used this lever at least once. The pattern:

FP32 → bf16 (2017, v2): 2× on multiplier area, 2× on memory and bandwidth.
bf16 → INT8 (v1 had it; v5e revives it): 2× on multiplier area, 2× on memory and bandwidth.
bf16 → FP8 (Ironwood, 2025): 2× on multiplier area, 2× on memory and bandwidth.

This is the post-Dennard-scaling answer. You can't double FLOPS by doubling transistors any more — transistors are too expensive and dies are reticle-limited. You double FLOPS by halving the format. Each step costs you accuracy; the price has been worth paying so far.

11

Why TPUs Have No FP4 (Yet)

NVIDIA's Blackwell (B200) added FP4 (MXFP4) in 2024, claiming 2× FP8 throughput on inference. Ironwood (2025) does not ship FP4. Why?

The skeptical view

FP4 is so narrow (3 bits of magnitude) that almost any model needs aggressive per-block scaling and PTQ-style calibration.
Quality regressions on long-context generation have been observed.
Ironwood's HBM3e capacity (192 GiB) is large enough that most inference workloads don't need FP4 to fit in chip-memory.
FP8 + bigger HBM may be a better point on the quality / throughput curve than FP4 + tight HBM.

The likely-future view

NVIDIA's MXFP4 is the OCP-blessed format; if it becomes the inference standard, TPUs will follow.
Future Trillium-successor or Ironwood-successor will plausibly ship FP4 native; the design lever is just too obvious to skip permanently.
NVFP4 (NVIDIA's variant with 16-element blocks and a two-level FP8/FP32 scale) gives most of FP4's bandwidth advantage with fewer accuracy regressions; OCP MXFP4 (32-element blocks, E8M0 scale) is the more conservative version.

The TPU programme has historically been conservative on numerics — bf16 was years ahead of FP16 in adoption, but FP8 and FP4 have lagged NVIDIA by one generation each. That's a deliberate Jouppi-era trade: ship known-good silicon, let the format mature in the wild, then add it.

12

Cheat Sheet

Six-tier hierarchy: PE registers → accumulator SRAM → VMEM (per-TC scratchpad) → CMEM (shared on-die cache) → HBM → remote HBM via ICI. The two important boundaries are VMEM↔HBM and HBM↔remote.
VMEM: ~32–64 MiB per TensorCore, software-managed, no tags. Direct ancestor of v1's Unified Buffer.
CMEM: introduced v4i (2021). ~128 MiB shared, cache-like. Effective HBM bandwidth amplifier.
HBM growth from v1's 34 GB/s DDR3 to Ironwood's 7.37 TB/s HBM3e — 217× in 10 years, the dominant per-chip uplift.
INT8 (v1): the inference-only choice, with INT32 accumulators.
bfloat16 (v2 onwards): Google Brain's invention. FP32 exponent + 7-bit mantissa. FP32 accumulate. Now industry standard.
FP8 (Ironwood, 2025): OCP E4M3/E5M2 with per-block scaling. FP32 accumulate. Doubles peak vs bf16.
Accumulator width stays FP32 across all generations — that's why narrow-format FLOPS scale with the multiplier, not the adder.
FP4 not yet on TPU; will likely arrive in a future generation when the format stabilises.