Google TPUs Series — Presentation 04

Inside TPU v1 — The 2015 Inference Chip

A 28 nm PCIe card with 65,536 8-bit MACs, 24 MiB SRAM, eight gigs of DDR3, and almost no instructions. The chip that proved a custom AI ASIC could ship.

28 nm256×256 MAC INT8DDR3 PCIe Gen340 W ISCA 2017
Host CPU PCIe Gen3 ×16 DDR3 weights Weight FIFO 256×256 MMU Accumulators Activation Unified Buffer
00

Topics We'll Cover

01

A Block Diagram on One Slide

From Jouppi et al. ISCA 2017, the chip's logical structure is unusually clean. Five blocks plus a control unit.

Host CPU PCIe Gen3 ×16 DDR3 ×2 8 GiB / 34 GB/s Weight FIFO streams to MMU Unified Buffer 24 MiB activations 256 × 256 Matrix Multiply Unit 65,536 INT8 MACs 700 MHz · 92 TOPS weight-stationary systolic array Accumulators 4 MiB INT32 Activation ReLU, pool, norm writeback to UB Control / instr buffer (host-issued)

The whole machine is a streaming pipeline: host CPU dispatches a CISC instruction over PCIe, weights are pulled from DDR3 through a FIFO into the MMU, activations come from the on-chip Unified Buffer, the systolic array does the matmul, results land in the 4 MiB accumulator SRAM, the activation pipeline applies ReLU/pool/normalize, and the result writes back into the Unified Buffer for the next layer.

02

The 256×256 Matrix Multiply Unit

The MMU is the chip's centre of gravity — the part where almost all the silicon and almost all the power live.

What's in each PE

  • One INT8 multiplier (8×8 → 16-bit product).
  • One 16-bit (or 32-bit) adder.
  • One register holding the resident weight.
  • One register for the activation arriving from the left.
  • One register for the partial sum descending from above.

The 65,536 PE budget

  • 256 × 256 = 65,536 PEs.
  • One INT8 multiply & one 32-bit add per PE per cycle.
  • 700 MHz clock → 92 TOPS peak.
  • ~24 W typical, ~40 W max chip-wide; the array dominates the power budget.

Weight-stationary loading

Before each matmul the host issues a READ_WEIGHTS instruction. The Weight FIFO buffers a 256×256 tile from DDR3 into a staging area, and on the next cycle it is committed into the array's weight registers. Activations then flow in from the left of the Unified Buffer, skewed by row to align with the systolic schedule. Partial sums propagate top-to-bottom and emerge into the accumulators.

The clock-rate choice

700 MHz looks low next to a 2014 Xeon (3+ GHz). It's deliberate: 65,536 multipliers all switching simultaneously is a power and clock-distribution problem, not a transistor problem. Doubling the clock would have doubled power without proportionally doubling throughput once thermal limits hit. The TPU trades clock for parallelism — same total work, lower V²f.

03

The 24 MiB Unified Buffer

The Unified Buffer is the chip's working set for activations. It is software-visible and software-managed — not a cache. Every read and write is explicit in the program.

Why a scratchpad and not a cache?

Cache costs you don't want

  • Tag arrays burn area and power.
  • Replacement state machines.
  • Unpredictable miss-rate variation.
  • Coherence protocols if multi-port.

Scratchpad costs you do want

  • Just SRAM cells — ~50% denser.
  • Compiler-controlled, deterministic.
  • Predictable latency (worst-case = average).
  • Easy to reason about for tile-size selection.

The scratchpad model only works if you have a compiler that can plan the working set ahead of time. ML graphs — with their statically known shapes — are exactly the workload where this is tractable. Try the same approach on a database query and you'd lose; on a transformer it's a massive win.

04

Accumulators and Activation Pipeline

Once the MMU drains its outputs, two more dedicated units finish the layer:

4 MiB Accumulator SRAM

  • INT32 accumulators, 4096 rows of 256 elements each (Jouppi et al.).
  • Sized to hold a few full output tiles concurrently.
  • Allows partial-sum accumulation across multiple tile passes — essential for matmuls where the contraction dimension exceeds 256.

Activation pipeline

  • Hardwired ReLU, sigmoid, tanh, max-pool, average-pool, normalisation.
  • All of these are cheap and fixed-function — the chip pays nothing for what it doesn't support, and the compiler emits one of them per layer end.
  • Writes back into the Unified Buffer for the next layer's input.

What's notably missing

No softmax, no attention, no division, no general-purpose vector unit. v1 was designed for the 2014 inference workload, which was almost entirely convolutions plus dense layers plus ReLU plus pooling. Anything fancier ran on the host CPU, which sometimes meant pulling tensors back across PCIe — expensive, and one of the first things v2 fixed by adding a real vector unit per TensorCore.

A specifically pre-Transformer chip

The Transformer paper (Vaswani et al.) did not appear until June 2017, after v1 was already in production. v1 has no native softmax, no efficient broadcast for attention scores, and no fast small-matmul path. Transformer inference on v1 worked — AlphaGo's policy/value networks are convnets — but it was already obvious by 2016 that the next chip would need different building blocks.

05

Why DDR3 and Not HBM

The single most second-guessed decision in v1's design. With hindsight everyone knows v1 was bandwidth-bound. The choice was made for two real reasons:

HBM was new in 2014

  • JEDEC HBM1 standard finalised October 2013.
  • First HBM products (AMD Fiji, NVIDIA P100) shipped 2015–16.
  • HBM controllers were expensive and immature; risky for a 2014 design.

v1 only stored weights

  • Inference: weights are read-only, loaded once, then re-used.
  • Activations live in the on-chip Unified Buffer.
  • So the off-chip bandwidth requirement seemed modest at design time: stream the weight set per inference batch.

The shipping discipline

  • The team had 15 months to a datacenter.
  • A 28 nm chip with a DDR3 controller was a known-buildable thing.
  • Trade off optimal for shippable.

The actual numbers

Why Jouppi calls this the chip's most important lesson

The ISCA 2017 paper's roofline plot shows v1 sitting far below its compute roof on most workloads, pinned by DRAM bandwidth. Convnets were close to compute-bound; LSTMs were 5× below peak; embedding-heavy models were 10× below peak. The "what should v2 do?" question almost answered itself: add HBM. Every TPU since has had it.

06

The Brilliantly Minimal ISA

v1's instruction set is roughly twelve CISC instructions. The host CPU sends them over PCIe; the chip executes them in order. There is no branch, no condition codes, no exception, no virtual memory, no caches.

tpu v1 — representative instruction set (illustrative; exact mnemonics from ISCA 2017)
// Bring weight tile from DDR3 through the FIFO into the MMU
READ_WEIGHTS  <dram_addr>, <tile_shape>

// Bring activations from host memory into the Unified Buffer
READ_HOST_MEMORY <host_addr>, <ub_addr>, <size>

// Run the matmul: stream activations, drain to accumulators
MATRIX_MULTIPLY  <ub_in>, <acc_dst>, <rows>, <cols>

// Apply non-linearity / pool / norm and write back to Unified Buffer
ACTIVATE         <acc_src>, <ub_dst>, <op = relu | pool | norm>

// Send a tensor back to the host over PCIe
WRITE_HOST_MEMORY <ub_addr>, <host_addr>, <size>

What's not in the ISA

The Jouppi-MIPS lineage in plain sight

This is RISC discipline taken to its endpoint. The MIPS lesson was "compiler does the scheduling, hardware does the work". v1 takes it a step further: the host CPU does the control flow, the TPU just does the work. Every instruction is a coarse, well-typed action. The chip never has to make a runtime decision — which is why it can ship in 15 months with confidence.

07

PCIe Gen3 As The System Bus

v1 is a PCIe Gen3 ×16 card. The host link runs at ~16 GB/s in each direction, of which ~12.5 GB/s is usable bandwidth.

What flows over PCIe

Why this works for inference

An inference query is: 1 small input tensor in → many matmuls on-chip → 1 small output tensor out. PCIe is fine for the small tensors at the boundary. The on-chip work dominates by orders of magnitude. An H100 GPU faces the same bandwidth picture and reaches the same conclusion — the host bus is for control and I/O, not for the inner loop.

Why this stops working for training

For training, gradients need to be all-reduced across a cluster of chips. If the only inter-chip path is PCIe-to-PCIe through the host CPU, you spend more time on the host than on the chip. v2 fixes this with ICI — a custom chip-to-chip interconnect that bypasses the host entirely. That single change is what turns the TPU from an inference card into a training supercomputer.

08

The Roofline — Why v1 Was Memory-Bound

The most-quoted figure from the ISCA 2017 paper is the roofline plot. It says: plot operational intensity (ops per byte) on the X axis, achieved performance (ops per second) on the Y axis. The peak compute is a horizontal line at 92 TOPS; the memory roof rises from the origin at slope 34 GB/s.

Roofline — TPU v1 (log-log, schematic) operational intensity (ops/byte) → ↑ TOPS 10 100 1k 10k 9.2 92 peak compute 92 TOPS DDR3 roof = 34 GB/s LSTM0 (~14 ops/byte) LSTM1 (~22) MLP (~85) CNN0 (~620) CNN1 (~2900) — compute-bound

The plot is schematic; the workload labels follow Jouppi et al. CNN1 (which is convolution-heavy, like AlexNet) sits at the kink — achieving close to peak. LSTMs sit far below the memory roof — bandwidth-starved. Half the production workloads in 2016 looked more like the LSTMs than the CNNs.

The lesson that designs every later TPU

v1's cleanest architectural lesson, repeated explicitly in the CACM 2020 paper for v2/v3: your peak FLOPS number is irrelevant if your bandwidth roof is below it for the workloads you actually run. Every later TPU pays disproportionately more for memory than for compute — HBM2 in v2/v3, HBM2 + CMEM in v4/v4i, HBM3e in Ironwood at 7.4 TB/s per chip. The compute is easy; feeding it is hard.

09

Workloads in 2015–2017

What v1 actually ran in production. From the ISCA paper plus subsequent disclosures:

SystemModel classWhy TPU
RankBrain (Search)Deep neural net for query interpretationLatency-sensitive; 10s of billions of queries/day; perf/W matters more than peak.
Google Translate (NMT)Encoder-decoder LSTM (later GNMT)Large model parameters; bandwidth-bound on v1; still cheaper than CPU.
Google PhotosCNN classifier / labellerCompute-bound — near-ideal v1 workload (CNN1 in the roofline).
Street ViewCNN OCRCompute-bound batch inference at huge scale.
Voice / Now / AssistantAcoustic model + language modelLatency-sensitive, real-time.
AlphaGoPolicy + value CNNsThe headline workload — Lee Sedol match, March 2016.

The fleet picture

By the time of the May 2016 announcement, v1s were running in tens of thousands in Google datacenters — an actual production fleet, not a benchmark. The ISCA 2017 paper presents measured performance per workload across the production fleet; that's why the numbers are quoted across very specific named models and not synthetic benchmarks.

Why this matters for the rest of the industry

v1 is the first time a custom AI ASIC was demonstrated at hyperscale, with measured wins, on real workloads. It is the existence-proof that justifies every other AI chip startup's pitch deck since 2016. The right comparison is "v1 vs Haswell + K80 in production" — not "v1 vs anything in a benchmark".

10

What v1 Got Wrong (And v2 Fixed)

How quickly the lessons turn into silicon

v1 ships to datacenter mid-2015. v2 announced May 2017 — about 22 months later. Every architectural fix above is in v2. This is the cadence the TPU programme has held ever since: ship, learn, fix in the next chip, ship again.

11

What v1 Got Right (And v7 Still Has)

The continuity argument

If you walked an engineer from the v1 team into a 2025 TPU bring-up lab, almost everything would be familiar in form: a weight FIFO, a systolic MMU, a vector unit, a software-managed buffer, a host-issued instruction stream, a roofline plot pinned to a wall. The dimensions are different by orders of magnitude; the ideas are not.

12

Cheat Sheet

Read next

Deck 05 — v2 & v3, the Training Era walks the chip that fixes v1's biggest limitations: HBM, bf16, 2D torus pod, ICI.