Google TPUs 04 — Inside TPU v1

00

Topics We'll Cover

A Block Diagram on One Slide
The 256×256 Matrix Multiply Unit
The 24 MiB Unified Buffer
Accumulators and Activation Pipeline
Why DDR3 and Not HBM
The Brilliantly Minimal ISA
PCIe Gen3 As The System Bus
The Roofline — Why v1 Was Memory-Bound
Workloads in 2015–2017
What v1 Got Wrong (And v2 Fixed)
What v1 Got Right (And v7 Still Has)
Cheat Sheet

01

A Block Diagram on One Slide

From Jouppi et al. ISCA 2017, the chip's logical structure is unusually clean. Five blocks plus a control unit.

The whole machine is a streaming pipeline: host CPU dispatches a CISC instruction over PCIe, weights are pulled from DDR3 through a FIFO into the MMU, activations come from the on-chip Unified Buffer, the systolic array does the matmul, results land in the 4 MiB accumulator SRAM, the activation pipeline applies ReLU/pool/normalize, and the result writes back into the Unified Buffer for the next layer.

02

The 256×256 Matrix Multiply Unit

The MMU is the chip's centre of gravity — the part where almost all the silicon and almost all the power live.

What's in each PE

One INT8 multiplier (8×8 → 16-bit product).
One 16-bit (or 32-bit) adder.
One register holding the resident weight.
One register for the activation arriving from the left.
One register for the partial sum descending from above.

The 65,536 PE budget

256 × 256 = 65,536 PEs.
One INT8 multiply & one 32-bit add per PE per cycle.
700 MHz clock → 92 TOPS peak.
~24 W typical, ~40 W max chip-wide; the array dominates the power budget.

Weight-stationary loading

Before each matmul the host issues a READ_WEIGHTS instruction. The Weight FIFO buffers a 256×256 tile from DDR3 into a staging area, and on the next cycle it is committed into the array's weight registers. Activations then flow in from the left of the Unified Buffer, skewed by row to align with the systolic schedule. Partial sums propagate top-to-bottom and emerge into the accumulators.

The clock-rate choice

700 MHz looks low next to a 2014 Xeon (3+ GHz). It's deliberate: 65,536 multipliers all switching simultaneously is a power and clock-distribution problem, not a transistor problem. Doubling the clock would have doubled power without proportionally doubling throughput once thermal limits hit. The TPU trades clock for parallelism — same total work, lower V²f.

03

The 24 MiB Unified Buffer

The Unified Buffer is the chip's working set for activations. It is software-visible and software-managed — not a cache. Every read and write is explicit in the program.

24 MiB total, multi-banked.
Read bandwidth designed to feed one row of activations into the MMU per cycle, plus simultaneous writeback from accumulators.
No tags, no replacement policy, no coherence. The compiler decides what lives where.

Why a scratchpad and not a cache?

Cache costs you don't want

Tag arrays burn area and power.
Replacement state machines.
Unpredictable miss-rate variation.
Coherence protocols if multi-port.

Scratchpad costs you do want

Just SRAM cells — ~50% denser.
Compiler-controlled, deterministic.
Predictable latency (worst-case = average).
Easy to reason about for tile-size selection.

The scratchpad model only works if you have a compiler that can plan the working set ahead of time. ML graphs — with their statically known shapes — are exactly the workload where this is tractable. Try the same approach on a database query and you'd lose; on a transformer it's a massive win.

04

Accumulators and Activation Pipeline

Once the MMU drains its outputs, two more dedicated units finish the layer:

4 MiB Accumulator SRAM

INT32 accumulators, 4096 rows of 256 elements each (Jouppi et al.).
Sized to hold a few full output tiles concurrently.
Allows partial-sum accumulation across multiple tile passes — essential for matmuls where the contraction dimension exceeds 256.

Activation pipeline

Hardwired ReLU, sigmoid, tanh, max-pool, average-pool, normalisation.
All of these are cheap and fixed-function — the chip pays nothing for what it doesn't support, and the compiler emits one of them per layer end.
Writes back into the Unified Buffer for the next layer's input.

What's notably missing

No softmax, no attention, no division, no general-purpose vector unit. v1 was designed for the 2014 inference workload, which was almost entirely convolutions plus dense layers plus ReLU plus pooling. Anything fancier ran on the host CPU, which sometimes meant pulling tensors back across PCIe — expensive, and one of the first things v2 fixed by adding a real vector unit per TensorCore.

A specifically pre-Transformer chip

The Transformer paper (Vaswani et al.) did not appear until June 2017, after v1 was already in production. v1 has no native softmax, no efficient broadcast for attention scores, and no fast small-matmul path. Transformer inference on v1 worked — AlphaGo's policy/value networks are convnets — but it was already obvious by 2016 that the next chip would need different building blocks.

05

Why DDR3 and Not HBM

The single most second-guessed decision in v1's design. With hindsight everyone knows v1 was bandwidth-bound. The choice was made for two real reasons:

HBM was new in 2014

JEDEC HBM1 standard finalised October 2013.
First HBM products (AMD Fiji, NVIDIA P100) shipped 2015–16.
HBM controllers were expensive and immature; risky for a 2014 design.

v1 only stored weights

Inference: weights are read-only, loaded once, then re-used.
Activations live in the on-chip Unified Buffer.
So the off-chip bandwidth requirement seemed modest at design time: stream the weight set per inference batch.

The shipping discipline

The team had 15 months to a datacenter.
A 28 nm chip with a DDR3 controller was a known-buildable thing.
Trade off optimal for shippable.

The actual numbers

2 channels of DDR3-2133, ~17 GB/s each, ~34 GB/s aggregate.
8 GiB total (4 GiB per channel) — enough for the weight tensors of every model Google ran in 2015.
For comparison: v2 in 2017 has 16 GiB HBM at ~600 GB/s — 17× the bandwidth in two years.

Why Jouppi calls this the chip's most important lesson

The ISCA 2017 paper's roofline plot shows v1 sitting far below its compute roof on most workloads, pinned by DRAM bandwidth. Convnets were close to compute-bound; LSTMs were 5× below peak; embedding-heavy models were 10× below peak. The "what should v2 do?" question almost answered itself: add HBM. Every TPU since has had it.

06

The Brilliantly Minimal ISA

v1's instruction set is roughly twelve CISC instructions. The host CPU sends them over PCIe; the chip executes them in order. There is no branch, no condition codes, no exception, no virtual memory, no caches.

tpu v1 — representative instruction set (illustrative; exact mnemonics from ISCA 2017)

// Bring weight tile from DDR3 through the FIFO into the MMU
READ_WEIGHTS  <dram_addr>, <tile_shape>

// Bring activations from host memory into the Unified Buffer
READ_HOST_MEMORY <host_addr>, <ub_addr>, <size>

// Run the matmul: stream activations, drain to accumulators
MATRIX_MULTIPLY  <ub_in>, <acc_dst>, <rows>, <cols>

// Apply non-linearity / pool / norm and write back to Unified Buffer
ACTIVATE         <acc_src>, <ub_dst>, <op = relu | pool | norm>

// Send a tensor back to the host over PCIe
WRITE_HOST_MEMORY <ub_addr>, <host_addr>, <size>

What's not in the ISA

No conditional branch. Programs are straight-line streams.
No floating point. INT8 in, INT32 accumulate, INT8 (re-quantised) out.
No interrupts. The host polls completion bits.
No virtual memory. The chip sees physical addresses pinned by the host driver.
No traps, no exceptions. Bus errors halt the chip; the driver handles recovery.

The Jouppi-MIPS lineage in plain sight

This is RISC discipline taken to its endpoint. The MIPS lesson was "compiler does the scheduling, hardware does the work". v1 takes it a step further: the host CPU does the control flow, the TPU just does the work. Every instruction is a coarse, well-typed action. The chip never has to make a runtime decision — which is why it can ship in 15 months with confidence.

07

PCIe Gen3 As The System Bus

v1 is a PCIe Gen3 ×16 card. The host link runs at ~16 GB/s in each direction, of which ~12.5 GB/s is usable bandwidth.

What flows over PCIe

Instructions from the host control program.
Input activations for each inference (image bytes, embedding lookups, etc.).
Final output tensors back to the host.
Not weights — weights live in the chip's DDR3 and are loaded at model-load time.

Why this works for inference

An inference query is: 1 small input tensor in → many matmuls on-chip → 1 small output tensor out. PCIe is fine for the small tensors at the boundary. The on-chip work dominates by orders of magnitude. An H100 GPU faces the same bandwidth picture and reaches the same conclusion — the host bus is for control and I/O, not for the inner loop.

Why this stops working for training

For training, gradients need to be all-reduced across a cluster of chips. If the only inter-chip path is PCIe-to-PCIe through the host CPU, you spend more time on the host than on the chip. v2 fixes this with ICI — a custom chip-to-chip interconnect that bypasses the host entirely. That single change is what turns the TPU from an inference card into a training supercomputer.

08

The Roofline — Why v1 Was Memory-Bound

The most-quoted figure from the ISCA 2017 paper is the roofline plot. It says: plot operational intensity (ops per byte) on the X axis, achieved performance (ops per second) on the Y axis. The peak compute is a horizontal line at 92 TOPS; the memory roof rises from the origin at slope 34 GB/s.

The plot is schematic; the workload labels follow Jouppi et al. CNN1 (which is convolution-heavy, like AlexNet) sits at the kink — achieving close to peak. LSTMs sit far below the memory roof — bandwidth-starved. Half the production workloads in 2016 looked more like the LSTMs than the CNNs.

The lesson that designs every later TPU

v1's cleanest architectural lesson, repeated explicitly in the CACM 2020 paper for v2/v3: your peak FLOPS number is irrelevant if your bandwidth roof is below it for the workloads you actually run. Every later TPU pays disproportionately more for memory than for compute — HBM2 in v2/v3, HBM2 + CMEM in v4/v4i, HBM3e in Ironwood at 7.4 TB/s per chip. The compute is easy; feeding it is hard.

09

Workloads in 2015–2017

What v1 actually ran in production. From the ISCA paper plus subsequent disclosures:

System	Model class	Why TPU
RankBrain (Search)	Deep neural net for query interpretation	Latency-sensitive; 10s of billions of queries/day; perf/W matters more than peak.
Google Translate (NMT)	Encoder-decoder LSTM (later GNMT)	Large model parameters; bandwidth-bound on v1; still cheaper than CPU.
Google Photos	CNN classifier / labeller	Compute-bound — near-ideal v1 workload (CNN1 in the roofline).
Street View	CNN OCR	Compute-bound batch inference at huge scale.
Voice / Now / Assistant	Acoustic model + language model	Latency-sensitive, real-time.
AlphaGo	Policy + value CNNs	The headline workload — Lee Sedol match, March 2016.

The fleet picture

By the time of the May 2016 announcement, v1s were running in tens of thousands in Google datacenters — an actual production fleet, not a benchmark. The ISCA 2017 paper presents measured performance per workload across the production fleet; that's why the numbers are quoted across very specific named models and not synthetic benchmarks.

Why this matters for the rest of the industry

v1 is the first time a custom AI ASIC was demonstrated at hyperscale, with measured wins, on real workloads. It is the existence-proof that justifies every other AI chip startup's pitch deck since 2016. The right comparison is "v1 vs Haswell + K80 in production" — not "v1 vs anything in a benchmark".

10

What v1 Got Wrong (And v2 Fixed)

DDR3 instead of HBM. Bandwidth-starved on LSTM and embedding workloads. v2 ships with HBM2.
INT8 only, no floating point. Made post-training quantisation mandatory; precluded training on the chip. v2 introduces bfloat16.
No inter-chip interconnect. Multi-chip jobs went through PCIe-via-host. v2 introduces ICI for chip-to-chip and 2D-torus pods.
No softmax / division / fancy ops. Anything non-trivial fell back to the host. v2 adds a real vector unit per TensorCore.
One MMU, one TensorCore. v2 has two TensorCores per chip; v3 doubles MXU count again.
Manual quantisation pipeline. Every model needed an INT8 calibration step; v2's bf16 lets you train and serve in the same numeric.

How quickly the lessons turn into silicon

v1 ships to datacenter mid-2015. v2 announced May 2017 — about 22 months later. Every architectural fix above is in v2. This is the cadence the TPU programme has held ever since: ship, learn, fix in the next chip, ship again.

11

What v1 Got Right (And v7 Still Has)

Weight-stationary systolic core. Every TPU since has been built around the same 2D MAC array. The 256×256 of v1 became 128×128 (smaller, more numerous) on v2 onwards, but the structure is identical.
Software-managed scratchpad memory. The Unified Buffer is the direct ancestor of VMEM and CMEM in modern TPUs.
Compiler does the scheduling. v1's host-issued CISC ops are an early form of what XLA emits for modern chips. The chip never schedules anything itself.
Drop-in PCIe form factor. v4i and v5e still drop into existing servers; the integration cost is low.
Quantitative roofline thinking. Every TPU paper has a roofline plot. The mental model has not changed.
Ship discipline. 15 months from concept to datacenter. Later TPUs have taken longer (more verification, more partners) but the bias toward conservative, manufacturable choices is preserved.

The continuity argument

If you walked an engineer from the v1 team into a 2025 TPU bring-up lab, almost everything would be familiar in form: a weight FIFO, a systolic MMU, a vector unit, a software-managed buffer, a host-issued instruction stream, a roofline plot pinned to a wall. The dimensions are different by orders of magnitude; the ideas are not.

12

Cheat Sheet

TSMC 28 nm, <331 mm² die. PCIe Gen3 ×16 form factor; ~24–40 W.
One 256×256 INT8 systolic MMU at 700 MHz → 92 TOPS peak.
24 MiB on-chip Unified Buffer for activations, software-managed (not a cache).
4 MiB INT32 accumulator SRAM, fixed-function activation pipeline (ReLU, pool, norm).
8 GiB DDR3-2133, dual-channel, ~34 GB/s — the chip's biggest constraint.
~12 CISC instructions, host-issued over PCIe. No branch, no FP, no VM, no interrupts.
Bandwidth-bound on most workloads except CNNs — the lesson v2 fixes with HBM.
Production fleet ran RankBrain, Translate, Photos, Street View, AlphaGo from 2015–2017.
The brilliance is not in any one feature — it's in what's not there. The chip is what's left when you remove everything that wasn't doing matmul.