2017–2018: HBM arrives, bfloat16 is invented, the chip grows two TensorCores, the pod becomes a 2D torus, and Google has to rebuild its datacenter cooling.
By 2016 the bottleneck at Google had moved. Inference was solved — v1 was running tens of thousands of inferences per second per chip. The new constraint was training time: weeks on big GPU clusters for each new model rev. The team's question shifted from "what does an inference ASIC look like?" to "what does a training supercomputer look like?".
Each of those features is a major silicon and system-design change. v2 is not a v1 refresh; it is a clean redesign with the same DNA.
v2 is two-cores-on-a-chip. Each TensorCore has its own MXU, vector unit, scalar unit, and slice of HBM. They run independently and can synchronise via the on-chip network.
One chip = two TensorCores; each TensorCore has compute, scratchpad, and its own slice of HBM. ICI sits at the bottom and connects to four neighbours.
bfloat16 is the most quietly influential numeric format in modern computing. It was invented inside Google Brain specifically so that mixed-precision training would just work, and is now native in every modern AI chip.
Gradients have an enormous dynamic range — some weights see updates of 10-7, some of 10-1. FP16's 5-bit exponent can't span that. bf16 keeps FP32's exponent and just throws away the bottom 16 bits of the mantissa — giving up precision (where neural nets are robust) to keep range (where they aren't). The tradeoff turns out to be exactly the right one.
bf16 went from "Google internal format" in 2017 to JEDEC and IEEE adoption by 2020. NVIDIA, Intel (with their Cooper Lake AVX-512 BF16 instructions), ARM (with the BFloat16 extension to ARMv8.6), AMD — all support it. The TPU forced a numeric standard on the rest of the industry.
v2 is the first TPU with multiple cores on one die. The "TensorCore" terminology that appears later in NVIDIA's marketing is unrelated — in TPU language a TensorCore is a complete sub-chip with its own compute, scratchpad, and HBM partition.
v2's vector unit is the first part of the chip that's not for matmul. Softmax, layer-norm, optimiser updates (Adam moment estimates, learning-rate schedules), gradient clipping, masking — all of these now run on-chip. The host CPU shrinks back to a control-plane role.
Every chip has one. Manages instruction-stream sequencing, address computation, masks, control flow that doesn't make sense to vectorise. It is also the chip's "VLIW issue logic" — the static schedule emitted by XLA is consumed by the scalar unit and dispatched to MXU and vector unit each cycle.
The single biggest change vs v1. v2 ships with two HBM stacks per chip — one per TensorCore.
| v1 (DDR3) | v2 (HBM) | v3 (HBM) | |
|---|---|---|---|
| Capacity | 8 GiB | 16 GiB | 32 GiB |
| Bandwidth | ~34 GB/s | ~600 GB/s | ~900 GB/s |
| Stacks | 2 channels DDR3-2133 | 2 stacks HBM (1 per core) | 2 stacks HBM |
| Bandwidth uplift vs v1 | 1.0× | ~17.6× | ~26.5× |
HBM is, gram for gram, the most expensive memory in production. In 2017 it was 6–10× the price per GB of GDDR5. Putting it on every TPU permanently changed the chip's cost structure. Every TPU since has reflected the trade: more die area for HBM I/O than for any other interface, and increasing HBM stacks per chip generation by generation.
The other transformative addition. ICI is Google's custom chip-to-chip link, sitting outside the PCIe path. It is the thing that turns a chip into a pod.
Per-port silicon (HCA, switch ASIC) is a real cost adder. PCIe-attached HCAs add latency and contend with host traffic.
RoCE has improved but switch-traversal latency is still 1–2 μs. Inside a torus you want sub-100 ns per hop.
NVLink is NVIDIA proprietary and only between NVIDIA chips. Even if Google had wanted it, the scale-up to 256 / 1024 chips wasn't there in 2017.
ICI is the defining component of a TPU pod. NVIDIA later builds NVLink-Switch / NVL72 to compete; AMD's Infinity Fabric is its parallel. As of 2026 ICI is in its 7th generation.
v2 ships with a fixed pod shape: 256 chips arranged in a 16×16 2D torus, 4 chips per board, 16 boards per rack-scale pod.
The pod aggregate at v2 is ~11.5 PFLOPS bf16, with 4 TiB of HBM. At v3 the pod is 1024 chips (32×32 torus) and aggregates to over 100 PFLOPS bf16.
v3 is the same 16 nm node as v2 but everything inside is doubled. Two MXUs per TensorCore (instead of one), 32 GiB HBM (instead of 16), 1024-chip pod (instead of 256), faster clock.
| v2 | v3 | |
|---|---|---|
| Process | 16 nm | 16 nm |
| TensorCores per chip | 2 | 2 |
| MXUs per TensorCore | 1 | 2 |
| MXU dim | 128×128 | 128×128 |
| Per-chip bf16 | 45 TFLOPS | 123 TFLOPS |
| HBM | 16 GiB | 32 GiB |
| HBM BW | ~600 GB/s | ~900 GB/s |
| Pod size | 256 chips (16×16) | 1024 chips (32×32) |
| Cooling | Air | Liquid |
| Pod aggregate | ~11.5 PFLOPS | ~126 PFLOPS |
Doubling the systolic array from 128×128 to 256×256 would quadruple the PE count and require redoing the entire weight-FIFO and activation-stream paths. Going from 1 to 2 MXUs per core is a copy-and-paste operation in the floorplan. It also lets the compiler issue two matmuls in parallel within a TensorCore — useful for attention's QKT and softmax(.) V operations, which can run side by side.
v3's per-chip bf16 (123 TFLOPS) is 2.7× v2's (45 TFLOPS) — doubled MXUs plus a higher clock plus better issue rates. Pod-level the jump is even larger because the pod itself grew 4× in chip count.
v3 was the first liquid-cooled accelerator at hyperscale. Per-chip TDP estimates put v3 at 200–250 W; eight chips per board, 32 boards per rack means rack-level power dissipation past anything air-cooled would tolerate.
This was a major datacenter operational change. Google's existing fleet was air-cooled; v3 required a parallel liquid-coolant infrastructure that did not exist before. By 2018 several Google datacenters were dedicated TPU sites built specifically around the new cooling.
Every modern AI accelerator is now liquid-cooled at the rack level — NVIDIA NVL72, AMD MI300X clusters, AWS Trainium2 racks. Google was first by 5+ years. The expertise transfers up the stack: when Ironwood pods land in 2025 with 9,216 chips dissipating ~5.5 MW each, the cooling story is a refinement, not a redesign.
The 2017–2020 wave of large NLP models is, almost without exception, a TPU v2/v3 story.
| Model | Year | Hardware | Notes |
|---|---|---|---|
| GNMT (Google Neural Machine Translation) | 2016–17 | v1 inference, v2 training | The migration from RNN-LSTM seq2seq to Transformer happens over the v2/v3 era. |
| BERT-Base / Large (Devlin et al.) | Oct 2018 | v3 | Trained on a v3 pod — the paper credits "Cloud TPU v3 Pod". 4 days for BERT-Large. |
| T5 (Raffel et al.) | Oct 2019 | v3 | 11B parameters at the largest. Pre-trained on 1024-chip v3 pods. |
| Meena / LaMDA | 2020–21 | v3 / early v4 | Conversational models leading toward Bard/Gemini. |
| MUM | 2021 | v4 | Multimodal; trained as v4 came online. |
v2 and v3 were the first TPUs sold as a product. Cloud TPU went GA in February 2018 (v2) and March 2019 (v3). The TensorFlow Research Cloud programme gave free TPU access to academic researchers; whole sub-fields of language and vision research happened on these chips. If you read a 2018–2020 paper that says "trained on Cloud TPU", it's a v2 or v3 pod.
The v4 paper (ISCA 2023) is, as a result, mostly a list of "things v3 was beginning to feel painful for the team running PaLM-class workloads on it". Each item gets its own architectural fix.
Deck 06 — v4, OCS & SparseCore covers the chip that turns the TPU pod into a true machine-learning supercomputer.