A 28 nm PCIe card with 65,536 8-bit MACs, 24 MiB SRAM, eight gigs of DDR3, and almost no instructions. The chip that proved a custom AI ASIC could ship.
From Jouppi et al. ISCA 2017, the chip's logical structure is unusually clean. Five blocks plus a control unit.
The whole machine is a streaming pipeline: host CPU dispatches a CISC instruction over PCIe, weights are pulled from DDR3 through a FIFO into the MMU, activations come from the on-chip Unified Buffer, the systolic array does the matmul, results land in the 4 MiB accumulator SRAM, the activation pipeline applies ReLU/pool/normalize, and the result writes back into the Unified Buffer for the next layer.
The MMU is the chip's centre of gravity — the part where almost all the silicon and almost all the power live.
Before each matmul the host issues a READ_WEIGHTS instruction. The Weight FIFO buffers a 256×256 tile from DDR3 into a staging area, and on the next cycle it is committed into the array's weight registers. Activations then flow in from the left of the Unified Buffer, skewed by row to align with the systolic schedule. Partial sums propagate top-to-bottom and emerge into the accumulators.
700 MHz looks low next to a 2014 Xeon (3+ GHz). It's deliberate: 65,536 multipliers all switching simultaneously is a power and clock-distribution problem, not a transistor problem. Doubling the clock would have doubled power without proportionally doubling throughput once thermal limits hit. The TPU trades clock for parallelism — same total work, lower V²f.
The Unified Buffer is the chip's working set for activations. It is software-visible and software-managed — not a cache. Every read and write is explicit in the program.
The scratchpad model only works if you have a compiler that can plan the working set ahead of time. ML graphs — with their statically known shapes — are exactly the workload where this is tractable. Try the same approach on a database query and you'd lose; on a transformer it's a massive win.
Once the MMU drains its outputs, two more dedicated units finish the layer:
No softmax, no attention, no division, no general-purpose vector unit. v1 was designed for the 2014 inference workload, which was almost entirely convolutions plus dense layers plus ReLU plus pooling. Anything fancier ran on the host CPU, which sometimes meant pulling tensors back across PCIe — expensive, and one of the first things v2 fixed by adding a real vector unit per TensorCore.
The Transformer paper (Vaswani et al.) did not appear until June 2017, after v1 was already in production. v1 has no native softmax, no efficient broadcast for attention scores, and no fast small-matmul path. Transformer inference on v1 worked — AlphaGo's policy/value networks are convnets — but it was already obvious by 2016 that the next chip would need different building blocks.
The single most second-guessed decision in v1's design. With hindsight everyone knows v1 was bandwidth-bound. The choice was made for two real reasons:
The ISCA 2017 paper's roofline plot shows v1 sitting far below its compute roof on most workloads, pinned by DRAM bandwidth. Convnets were close to compute-bound; LSTMs were 5× below peak; embedding-heavy models were 10× below peak. The "what should v2 do?" question almost answered itself: add HBM. Every TPU since has had it.
v1's instruction set is roughly twelve CISC instructions. The host CPU sends them over PCIe; the chip executes them in order. There is no branch, no condition codes, no exception, no virtual memory, no caches.
// Bring weight tile from DDR3 through the FIFO into the MMU
READ_WEIGHTS <dram_addr>, <tile_shape>
// Bring activations from host memory into the Unified Buffer
READ_HOST_MEMORY <host_addr>, <ub_addr>, <size>
// Run the matmul: stream activations, drain to accumulators
MATRIX_MULTIPLY <ub_in>, <acc_dst>, <rows>, <cols>
// Apply non-linearity / pool / norm and write back to Unified Buffer
ACTIVATE <acc_src>, <ub_dst>, <op = relu | pool | norm>
// Send a tensor back to the host over PCIe
WRITE_HOST_MEMORY <ub_addr>, <host_addr>, <size>
This is RISC discipline taken to its endpoint. The MIPS lesson was "compiler does the scheduling, hardware does the work". v1 takes it a step further: the host CPU does the control flow, the TPU just does the work. Every instruction is a coarse, well-typed action. The chip never has to make a runtime decision — which is why it can ship in 15 months with confidence.
v1 is a PCIe Gen3 ×16 card. The host link runs at ~16 GB/s in each direction, of which ~12.5 GB/s is usable bandwidth.
An inference query is: 1 small input tensor in → many matmuls on-chip → 1 small output tensor out. PCIe is fine for the small tensors at the boundary. The on-chip work dominates by orders of magnitude. An H100 GPU faces the same bandwidth picture and reaches the same conclusion — the host bus is for control and I/O, not for the inner loop.
For training, gradients need to be all-reduced across a cluster of chips. If the only inter-chip path is PCIe-to-PCIe through the host CPU, you spend more time on the host than on the chip. v2 fixes this with ICI — a custom chip-to-chip interconnect that bypasses the host entirely. That single change is what turns the TPU from an inference card into a training supercomputer.
The most-quoted figure from the ISCA 2017 paper is the roofline plot. It says: plot operational intensity (ops per byte) on the X axis, achieved performance (ops per second) on the Y axis. The peak compute is a horizontal line at 92 TOPS; the memory roof rises from the origin at slope 34 GB/s.
The plot is schematic; the workload labels follow Jouppi et al. CNN1 (which is convolution-heavy, like AlexNet) sits at the kink — achieving close to peak. LSTMs sit far below the memory roof — bandwidth-starved. Half the production workloads in 2016 looked more like the LSTMs than the CNNs.
v1's cleanest architectural lesson, repeated explicitly in the CACM 2020 paper for v2/v3: your peak FLOPS number is irrelevant if your bandwidth roof is below it for the workloads you actually run. Every later TPU pays disproportionately more for memory than for compute — HBM2 in v2/v3, HBM2 + CMEM in v4/v4i, HBM3e in Ironwood at 7.4 TB/s per chip. The compute is easy; feeding it is hard.
What v1 actually ran in production. From the ISCA paper plus subsequent disclosures:
| System | Model class | Why TPU |
|---|---|---|
| RankBrain (Search) | Deep neural net for query interpretation | Latency-sensitive; 10s of billions of queries/day; perf/W matters more than peak. |
| Google Translate (NMT) | Encoder-decoder LSTM (later GNMT) | Large model parameters; bandwidth-bound on v1; still cheaper than CPU. |
| Google Photos | CNN classifier / labeller | Compute-bound — near-ideal v1 workload (CNN1 in the roofline). |
| Street View | CNN OCR | Compute-bound batch inference at huge scale. |
| Voice / Now / Assistant | Acoustic model + language model | Latency-sensitive, real-time. |
| AlphaGo | Policy + value CNNs | The headline workload — Lee Sedol match, March 2016. |
By the time of the May 2016 announcement, v1s were running in tens of thousands in Google datacenters — an actual production fleet, not a benchmark. The ISCA 2017 paper presents measured performance per workload across the production fleet; that's why the numbers are quoted across very specific named models and not synthetic benchmarks.
v1 is the first time a custom AI ASIC was demonstrated at hyperscale, with measured wins, on real workloads. It is the existence-proof that justifies every other AI chip startup's pitch deck since 2016. The right comparison is "v1 vs Haswell + K80 in production" — not "v1 vs anything in a benchmark".
v1 ships to datacenter mid-2015. v2 announced May 2017 — about 22 months later. Every architectural fix above is in v2. This is the cadence the TPU programme has held ever since: ship, learn, fix in the next chip, ship again.
If you walked an engineer from the v1 team into a 2025 TPU bring-up lab, almost everything would be familiar in form: a weight FIFO, a systolic MMU, a vector unit, a software-managed buffer, a host-issued instruction stream, a roofline plot pinned to a wall. The dimensions are different by orders of magnitude; the ideas are not.
Deck 05 — v2 & v3, the Training Era walks the chip that fixes v1's biggest limitations: HBM, bf16, 2D torus pod, ICI.