Google TPUs Series — Presentation 03

Inside the Systolic Array — The Matmul Engine

From Kung & Leiserson's 1978 paper to the 256×256 MAC grid at the heart of every TPU. Why this 47-year-old idea is the reason Google's chips work.

Kung 1978Warp / iWarp Weight-StationaryOutput-Stationary EyerissMXU
Mead & Conway VLSI Kung-Leiserson 1978 "Why Systolic?" 1982 CMU Warp / iWarp 90s decline TPU v1 revival 2015
00

Topics We'll Cover

01

What's a Systolic Array?

A systolic array is a 2D grid of identical, simple processing elements (PEs) that pump data through themselves like blood through a heart — data moves, computation happens at each beat, and the result emerges from the boundary. The name is Kung's; the analogy is medical.

A systolic system consists of a set of interconnected cells, each capable of performing some simple operation. Information in a systolic system flows between cells in a pipelined fashion, and communication with the outside world occurs only at the boundary cells. — H.T. Kung, "Why Systolic Architectures?", IEEE Computer, January 1982

Three defining properties

The compute-density argument

If you remove caches, branch predictors, instruction decoders, and replace all that with a tile of multiply-accumulate units that fire every cycle, you get an enormous arithmetic-intensity advantage per unit of silicon. TPU v1 fits 65,536 MAC units in a 28 nm die smaller than a contemporary mid-range CPU.

02

Kung & Leiserson, 1978

The paper that names the field is "Systolic Arrays (for VLSI)" by H.T. Kung and Charles Leiserson at Carnegie Mellon, circulated as CMU technical report CMU-CS-79-103 (1979) but written and presented in 1978 at the Sparse Matrix Symposium. A condensed version appears as Section 8.3 of Mead & Conway's Introduction to VLSI Systems (Addison-Wesley, 1980), the textbook that taught the world how to design chips.

Why 1978 was the right year

The Kung & Leiserson paper, in one diagram

Their Figure 1 shows a 1D linear array of MAC units computing y = Ax for a banded matrix A. Each PE holds one element of A. The vector x streams in from the left, partial sums of y stream out to the right, and on every clock each PE computes partial += a * x and forwards x. The 2D extension to dense matmul is one paragraph and one figure.

The paper closes with what would, decades later, become the TPU's design philosophy: "these algorithms can be implemented as special-purpose hardware that achieves much higher performance than general-purpose computers".

03

"Why Systolic Architectures?" — Kung 1982

Four years later, Kung published a single-author review in IEEE Computer that became the field's most-cited paper. Its three contributions:

1. The pump analogy

Naming it. "Systole" is the heart-pump phase that pushes blood through the body. The architecture's defining feature is that data moves on every clock and computes along the way.

2. The I/O-bound argument

If you have N PEs you need an I/O bandwidth proportional to N to keep them busy — unless you re-use each operand many times. Systolic arrays are the canonical answer: each input value visits O(N) PEs.

3. The algorithm taxonomy

Tables of which kernels map to which structure. 1D linear arrays for FIR filters and 1D convolution; 2D meshes for matmul and 2D convolution; hex arrays for transitive closure; trees for sorting. Each row of the table is a future ASIC.

The argument in one inequality

Kung's quantitative core: a systolic array of side N processing M elements gives you O(NM) compute work for O(M) I/O — arithmetic intensity of N. For a 256×256 array doing a matmul, that's 256 ops per byte of input bandwidth. The DRAM you have can keep up. This is the same arithmetic-intensity argument that justifies tensor cores today — it just took 35 years for a workload to need it badly enough to fund the silicon.

A direct line to the TPU

The 2017 ISCA TPU paper cites Kung 1982 explicitly, and Jouppi has said in talks that the v1 design started from a literal copy of the Kung-Leiserson 2D matrix-multiply array, scaled to 256×256 to fit a 28 nm die budget. The chip is, structurally, a 1978 idea on a 2014 process.

04

CMU Warp & Intel iWarp

The first generation of systolic-array hardware, built by Kung's group at CMU, ran from 1985 to the early 1990s. Three machines:

MachineYearCellsPerformanceIndustrial partner
WW-Warp (Wire-Wrap)1985–862-cell prototype, then 10~10 MFLOPS/cellGE, Honeywell
PC-Warp (Printed Circuit)1987–8810 cells~10 MFLOPS/cellGE
iWarp1990–93up to ~64 cells per array20 MFLOPS/cell @ 20 MHzIntel

Warp's three architectural lessons

The iWarp project shipped about 39 systems before Intel exited in 1993. Commercially small, intellectually large — the entire engineering DNA of TPU v1 traces through Warp.

Patterson on Warp → TPU

David Patterson has said in talks that the TPU programme treated the Warp/iWarp work as the canonical reference for "what a systolic-array supercomputer actually looks like at engineering scale" — including all the unglamorous parts (compiler, I/O, memory hierarchy, system integration) that the original 1978 paper did not need to address.

05

The 1990s Decline

By 1996, systolic arrays were considered an architectural dead end. Three reasons converged:

The DRAM bandwidth wall

An N×N array of MACs needs O(N) bytes per cycle to keep busy. By the early 90s memory bandwidth was lagging compute exponentially — the gap eventually called the "memory wall" in Wulf-McKee 1995. Systolic arrays starved.

Microprocessor exponential

Pentium-class CPUs riding Dennard scaling were getting a free 1.5×/year on single-thread performance. A custom systolic ASIC that took two years to design was beaten by the next general-purpose Pentium before it shipped.

Killer application missing

Outside specialised signal processing, no civilian workload needed 256×256 dense matmul. Without an application, no one funded the silicon. MasPar, Thinking Machines, Cray's MPP unit all failed between 1993 and 1996.

The hibernation, 1996–2014

For nearly two decades systolic arrays survived only in narrow niches: Fourier-transform accelerators, beamforming ASICs in radar, video-compression chips, and a few academic exercises. They were taught in computer-architecture textbooks as a historical milestone, not as a current technique.

And then deep learning happened. By 2014 the dominant compute pattern in industrial ML was large dense matrix multiplications on small batches of low-precision integers — exactly the workload Kung had been trying to find since 1978. The TPU is what happens when a 36-year-old idea finally meets the workload it was waiting for.

06

The Three Dataflows

For the matmul C = A × B, with the array of PEs holding partial accumulators, you have three choices about which value stays still. This is the field's main taxonomy.

Weight-Stationary (WS)

Pre-load the weight matrix A into the array; weights stay put. Activations B stream in skewed; partial sums propagate down (or right). Final outputs emerge at one boundary.

Used by: TPU MXU, NVDLA, most modern ML accelerators.

Wins: weights re-used across many activations — ideal when weights are large and shared.

Output-Stationary (OS)

Pin a partial accumulator in each PE. Both A and B stream through; each PE keeps adding to its own output. When done, drain accumulators.

Used by: early DSP arrays, some FPGA matmul cores.

Wins: no accumulator data movement; minimum write bandwidth at the boundary.

Row-Stationary (RS)

MIT Eyeriss (ISCA 2016, Chen / Emer / Sze). Each row of A, B, C stays local to a PE; partial sums travel diagonally. Hand-tuned to maximise data re-use across all three matrices.

Used by: Eyeriss; not adopted by industry chips at scale.

Wins: 1.4–2.5× energy improvement on AlexNet vs WS/OS in the original paper.

Why the choice matters

The dataflow choice is the single biggest architectural decision in any systolic accelerator. It determines bandwidth requirements at every level of the memory hierarchy, the layout of operands in SRAM, the optimal tile size, the latency to first output, and the pipeline-fill cost. Most public TPU descriptions get this right (weight-stationary); some popular write-ups call it output-stationary, which is incorrect — the TPU's MXU loads weights in advance and streams activations.

07

The TPU's Choice — Weight-Stationary

The TPU MXU pre-loads weights into the array before each tile of work, then streams activations. Why?

Workload-driven argument

  • In LLM inference the weight tensor is much larger than any single activation tile and is re-used across the whole batch and the whole sequence.
  • Loading a 128×128 weight tile once and reusing it for 128 activation rows = 128× reuse on the most expensive operand.
  • Activations are smaller, change frequently, and stream cleanly from VMEM.

Engineering argument

  • Weights live in HBM. HBM bandwidth is precious. Minimising weight bandwidth is the dominant optimisation.
  • Weight-stationary needs only one weight tile transfer per N activation tiles.
  • Output-stationary needs all three matrices streaming — double the SRAM bandwidth for no compute gain.

The mechanics, in one sentence

For each tile: (1) DMA a 128×128 weight tile from HBM into the MXU through the weight FIFO, (2) feed activations from VMEM into the array's left edge skewed by row, (3) partial sums propagate top-to-bottom and accumulate in the bottom row, (4) drain results into accumulators or back to VMEM.

Why "MXU" is not a synonym for "systolic array"

Internally the TPU MXU is a systolic array of MAC PEs. But there are several MXUs per chip (one per TensorCore in v2, two in v3, four in v4 and later), each fed by a vector unit and a scalar unit, with explicit accumulators and a weight FIFO. The systolic core is the compute; the rest of the MXU is the plumbing that keeps it fed. Most of the engineering effort in a TPU goes into the plumbing.

08

How a 256×256 Array Maps a Matmul

For TPU v1's single 256×256 array, computing C[M×N] = A[M×K] × B[K×N] with M = N = K = 256:

  1. Cycle 0: The weight matrix A (or for inference convention, the weight tile of B) is already loaded into the 65,536 PEs — one weight per PE.
  2. Cycles 0–255: Activations enter from the left, skewed: row 0 enters at cycle 0, row 1 at cycle 1, …, row 255 at cycle 255. Each cycle's worth of input is one column-vector entering each row.
  3. Cycles 0–510: A diagonal wavefront of valid compute sweeps the array. At cycle t the PEs on the anti-diagonal i+j = t are doing useful multiply-accumulate work; all others are still in fill or already drained.
  4. Cycles 255–510: Partial sums propagate downwards through the array, accumulating one term at a time. The output emerges from the bottom row, one column per cycle.
  5. Total latency: M + N + K − 2 = 766 cycles to fully process one 256×256×256 matmul on a 256×256 array, a tiny fraction of which is fill or drain.

The wavefront pattern

The visualisation on the next slide shows this concretely. The PEs that are firing at any one cycle form a moving anti-diagonal — a wavefront sweeping from the top-left to the bottom-right.

Why this is hard to write yourself

If you tried to schedule this on a CPU you'd have a giant nested loop, with explicit shift registers, careful boundary handling, and probably an off-by-one error somewhere on the skew. On a systolic array the structure of the chip is the schedule — the only thing the compiler emits is "feed these tiles, in this order, then read these accumulators". This is the property that makes XLA tractable; lowering an HLO matmul to TPU code is largely a tile-size choice, not a scheduling problem.

09

Interactive: Wavefront Visualiser

An 8×8 systolic array (smaller than the TPU's 128×128 MXU for visibility). Press Step to advance one cycle at a time and watch the diagonal wavefront move. The number in each cell is its current accumulator value as integer multiplies fire.

280 ms
cycle 0
8×8 weight-stationary array — activations stream from left, partial sums from top activations (skewed) drained ↓ partial sums emerge from the bottom row ↓
Cycle
0
PEs firing
0
PEs done
0
Utilisation
0%

Yellow PE values are accumulating; green PEs have drained. The wavefront is the diagonal band of yellow. Notice that even on this small array, peak utilisation (the centre of the wavefront) reaches 100% only briefly — the average over a single matmul is around 50% before fill / drain are amortised. Hence why TPUs love big batches.

10

Throughput & Utilisation

The TPU v1's 256×256 array running at 700 MHz, with each PE doing one INT8 multiply and one INT32 accumulate per cycle:

PEs
65,536
Clock
700 MHz
Ops / PE / cycle
2
Peak
~92 TOPS

65,536 × 700 MHz × 2 ops/MAC = 91.75 TOPS, which matches the "92 TOPS" headline figure.

Pipeline-fill cost

For a single matmul of K contract dimension on a 256×256 array, useful cycles = K; fill+drain = ~512 cycles. So utilisation is approximately K / (K + 512). At K = 1024 this is 67%; at K = 4096 it's 89%; at K = 16384 it's 97%. This is the arithmetic-intensity argument that makes TPUs prefer large batches and long contraction dimensions.

Why this scales

If you double the array (e.g. v2/v3 added a second TensorCore with its own MXUs), peak doubles. If you double the clock, peak doubles. If you double the precision (e.g. v7's FP8 vs bf16), the same MAC unit can do twice as many ops. Every TPU generation has used at least two of these levers.

The compiler's job

XLA's most important pass on a TPU target is tile-size selection: pick the largest tile that fits in VMEM and divides evenly into the activation and weight shapes, so that pipeline fill becomes a small percentage of total work. This is the main determinant of measured TFLOPS — you can be at 95% of peak on long contractions and 30% on short ones, all on the same chip.

11

Comparison with Other Accelerators

The TPU is the most prominent systolic-array accelerator but not the only one in the modern landscape. The other major designs make different choices — and most are not systolic.

ArchitectureCompute primitiveSchedulingMemory model
Google TPUSystolic 2D MAC array (128×128 MXU on v2+, 256×256 on v1)Static, compiler-issued (XLA HLO)Software-managed VMEM/CMEM scratchpad + HBM
NVIDIA Tensor CoresPer-SM MMA blocks (e.g. wgmma m64n256k16 on Hopper). Internals undisclosed.Dynamic warp scheduler chooses per cycleL1/L2 cache + register file + shared memory + HBM
Cerebras WSE-2/3Wafer-scale 2D mesh of independent PEs with SwarmX fabricSpatial dataflow; static placement40 GiB on-wafer SRAM, no off-chip DRAM
Groq LPUDeterministic VLIW streaming pipeline (TSP)Static, deterministic, no caches at allOn-chip SRAM only; predictable cycle latency
SambaNova RDUReconfigurable Dataflow Architecture: pattern compute units + memory unitsCompiler-mapped spatial dataflowTiered SRAM/HBM, mesh interconnect
Intel Gaudi 3MME (matrix multiply engine) — systolic-styleStatic + dynamic mixHBM + SRAM scratchpad
Honest claim about NVIDIA tensor cores

NVIDIA does not publish tensor-core internals. Microbenchmarks have argued both ways. The safe statement: at the programmer-visible level they are MMA blocks (matrix-multiply-accumulate primitives, e.g. wmma.mma.sync.m16n16k16 on Volta+, wgmma on Hopper). They might be implemented as small systolic arrays, FMA-tree reducers, or a hybrid. The behavioural difference from a TPU MXU is large regardless: an SM has 4 of these blocks; a TPU has one or two huge ones. The scheduling philosophy (warp-dynamic vs compiler-static) is the much bigger architectural divergence.

The lineage of "alumni" chips

Several modern non-TPU accelerators are explicit philosophical descendants of the TPU. Groq was founded by ex-TPU engineer Jonathan Ross; its deterministic philosophy is TPU's static scheduling taken to its logical extreme. Tenstorrent's Tensix cores, Cerebras's PE meshes, even AWS Trainium: all three companies cite the TPU papers in their whitepapers. The TPU is the chip whose ideas are now everywhere.

12

Cheat Sheet

Read next

Deck 04 — Inside TPU v1 takes apart the chip from this story slide by slide. Deck 09 — Memory & Numerics covers VMEM/CMEM and accumulator widths. Deck 11 — Software Stack covers XLA's role as the systolic-schedule compiler.