NVIDIA GPU Architectures Series — Presentation 23

HBM — Inside the Stacks That Feed the Tensor Cores

Eight DRAM dies vertically stacked above a logic die, connected by 1024+ TSVs to the GPU through a silicon interposer at less than a centimetre. Walk through HBM internals — channels, banks, refresh, ECC modes, the PHY, and CoWoS packaging — that decide whether you actually see the spec sheet bandwidth.

HBM2eHBM3HBM3e HBM4TSVChannel BankRefreshECC PHYCoWoSInterposer
Die Stack TSV Interposer PHY Channel Bank Cell
00

Topics We'll Cover

HBM is the most expensive component on a modern datacenter GPU and the one that most often decides whether you're compute-bound or memory-bound. Here's how it actually works — from cell to interposer.

01

Why HBM Exists — HBM vs GDDR

Both approaches reach roughly the same total throughput. They differ in how they get there — bus width vs frequency — and that single choice ripples through power, area, capacity, and packaging cost.

GDDR6X / GDDR7 (consumer)

A row of 8–12 discrete DRAM packages on the GPU PCB, each with its own 32-bit bus (16-bit when running PAM4 on two pseudo-channels). Frequency does the heavy lifting: GDDR6X 21–24 Gbps/pin, GDDR7 32–40 Gbps/pin in PAM3.

  • Aggregate bus per chip: 32-bit
  • Per-card bus: 192–512 bits
  • Power: ~7 pJ/bit
  • Capacity: 16–32 GB (consumer-class)
  • Packaging: standard FCBGA on PCB — cheap

HBM2e / HBM3 / HBM3e (datacenter)

A vertical stack of 8–12 DRAM dies sitting on a silicon interposer right next to the GPU die. Width does the heavy lifting: 1024 bits per stack, frequency only ~9 Gb/s/pin on HBM3e.

  • Aggregate bus per stack: 1024-bit
  • Per-GPU bus: 5120–8192 bits
  • Power: ~3 pJ/bit
  • Capacity: 80–192 GB (H100, B200)
  • Packaging: CoWoS silicon interposer — expensive
Same throughput, different shape

An RTX 5090 (GDDR7, 512-bit, ~1.8 TB/s) and an H100 SXM5 (5 HBM3 stacks, ~3.4 TB/s) live on opposite ends of the same trade. GDDR keeps every transistor on the GPU die fed by signalling fast over a narrow bus; HBM gives up frequency and instead opens a fire-hose-wide bus across centimetres of silicon interposer. HBM wins on bandwidth-per-watt and bandwidth-per-volume; GDDR wins on cost-per-GB and yield.

8× GDDR @ 32-bit
256 bits × 24 Gb/s = 768 GB/s
vs
1024 bits × 6.4 Gb/s = 819 GB/s
1× HBM3 stack

The HBM trade is paid in silicon interposer area. The 2.5D substrate carrying GPU + HBM stacks is itself a wafer-fabbed silicon die, with thousands of fine-pitch traces routed at near-on-die density. Yield falls with area, the reticle limit (~830 mm²) caps how big it can be, and supplier capacity (TSMC CoWoS) is currently the binding constraint on HBM-class GPU production.

02

The Stack — Anatomy of HBM3e

One HBM3e stack at a glance: 8 (or 12) DRAM dies stacked above a single base/buffer die, total stack height roughly 720 µm, footprint about 8 × 11 mm. Vertical wiring is done with through-silicon vias — tungsten or copper pillars that pass through the silicon of every die above the bottom.

HBM3e Stack — Cross-Section DRAM die 7 (24 Gb) top DRAM die 6 (24 Gb) DRAM die 5 (24 Gb) DRAM die 4 (24 Gb) DRAM die 3 (24 Gb) DRAM die 2 (24 Gb) DRAM die 1 (24 Gb) DRAM die 0 (24 Gb) Base / buffer die (PHY, ECC, refresh FSM, repair) µ-bumps (~55 µm pitch) Silicon interposer (TSV grid + RDL routing) GPU die Organic package substrate TSVs ×1024+ ~720 µm Vertical scale exaggerated — horizontal not to scale

Layer-by-layer responsibilities

Notice the asymmetry: the buffer die does real work; the DRAM dies above it are mostly passive arrays with their TSVs ganged together. Every channel's command stream lands on the buffer die, gets decoded, then is fan-out broadcast up the TSVs to the right rank/die.

03

Channels & Pseudo-Channels

The 1024-bit bus is not a monolith — it's 16 independently scheduled channels of 64 bits each. Each channel has its own command/address bus, its own banks, and its own row buffer. The memory controller can issue 16 different reads/writes to the same stack on the same cycle.

Stack
1 × HBM3 stack — 1024-bit total bus
Channels
CH0 64b
CH1 64b
CH2 64b
CH3 64b
CH4 64b
CH5 64b
CH6 64b
CH7 64b
CH8 64b
CH9 64b
CH10 64b
CH11 64b
CH12 64b
CH13 64b
CH14 64b
CH15 64b
Pseudo-CH
Each 64b channel = 2 × 32b PCs
PC0a / PC0b
PC1a / PC1b
… 32 PCs total / stack

Pseudo-channel details

HBM2 already had pseudo-channels (the JEDEC term has been around since 2016) but HBM3 sharpens the design: each 64-bit channel is split into two 32-bit pseudo-channels that share the data bus by time-slot but have independent command streams. So a small 32-byte transaction on PC0a can issue while PC0b is doing something else — doubled effective concurrency for fine-grained transfers.

Without pseudo-channels

One 64-bit channel = one address per cycle. A 32-byte read wastes half the bus — the burst is 64 B minimum (BL=8 × 64 b).

With pseudo-channels (HBM3)

Two 32-bit pseudo-channels under one physical channel. A 32-byte read uses BL=8 × 32 b on one half — full-bus utilisation. PC0a and PC0b can issue to different rows/banks simultaneously.

Why this matters for GPUs: Tensor Cores stream large, contiguous tiles — pseudo-channels don't help much there. But KV-cache reads in attention (small, scattered) and indirect lookups (sparse MoE expert gathers) do hit pseudo-channels well. HBM3e refines pseudo-channel arbitration further; HBM4 widens the stack to 2048 bits but keeps the pseudo-channel concept.

A common confusion

HBM "channel" is not the same as a CUDA "memory channel" or an Ada/Hopper "memory partition". A Hopper H100 has 5 HBM stacks, 16 channels each = 80 HBM channels; CUDA exposes them as 12 memory partitions / 80 sub-controllers depending on which level of the stack you're inspecting. They're independent concepts at different abstraction layers.

04

Banks, Bank Groups, Rows, and Columns

Inside a single HBM channel, the addressing hierarchy is the same shape as DDR/LPDDR — banks → bank groups → rows → columns — just denser. Every memory operation is a sequence of activate, read/write column, and precharge.

Channel
1 channel (HBM3)
Bank groups
BG0
BG1
BG2
BG3
BG4
BG5
BG6
BG7
BG8
BG9
BG10
BG11
BG12
BG13
BG14
BG15
Banks
2 banks per BG
32 banks per channel total
Row
~64 K rows per bank
8 KB row width (page)
Column
BL=8, beat = 32 b on a pseudo-channel
32 B / 64 B per access

Timing, simplified

ParameterHBM3 typicalWhat it means
tRC~32 nsActivate-to-activate same bank — the floor for changing rows.
tRCD~14 nsActivate to first column read — "row open" latency.
tRP~14 nsPrecharge to next activate — closing the row before opening another.
tCCDL~3 nsColumn-to-column delay within a bank group.
tCCDS~2 nsColumn-to-column delay across bank groups (faster).
tFAW~16 nsFour-activation window — can't activate > 4 banks too close together.
tREFI~3.9 µsRefresh interval — one refresh command must issue per bank inside this window.

Row hits, bank-group parallelism, conflicts

The memory controller's life consists of three modes:

Modern GPU memory controllers are deep, out-of-order, and very aware of bank state. They reorder pending requests to maximise row hits, schedule refreshes during natural gaps, and prefer cross-bank-group accesses to hide latency. NVIDIA's MC subsystem (sometimes called the FBPA / FBIO partitions in Hopper docs) is one of the most carefully tuned blocks on the chip — and the reason "the same DRAM" performs differently on different GPUs.

Practitioner takeaway

If your kernel access pattern is strided with a stride that hashes to one bank, you'll bank-conflict yourself into the ground regardless of total bandwidth. CUDA's memory hashing across channels mostly hides this for tile-friendly workloads, but pathological strides (powers of 2 close to 64 KB, KV-cache layouts that don't tile across heads) still bite. ncu --set full shows DRAM read-throughput vs request count — that ratio tells you whether you're serving rows or thrashing them.

05

HBM Generation Numbers

The headline numbers per JEDEC spec. The "per pin" rate is the per-data-line transfer rate; multiply by 1024 (or 2048 for HBM4) and divide by 8 to get GB/s/stack.

GenYearGb/s/pinBus / stackBW / stackCapacity / stackChannels
HBM20151.01024 b128 GB/s1–4 GB8 (128 b each)
HBM220162.0–2.41024 b256–307 GB/s4–8 GB16 (64 b)
HBM2e20193.2–3.61024 b410–460 GB/s8–16 GB16 (64 b)
HBM320226.41024 b819 GB/s16–24 GB16 (64 b) + 32 PCs
HBM3e20249.2–9.61024 b1.18–1.23 TB/s24–36 GB16 (64 b) + 32 PCs
HBM42026+~82048 b~2 TB/s36–48 GB32 (64 b)

How NVIDIA datacenter GPUs stack up

GPUHBM genStacksCapacityAggregate BW
V100 SXM2HBM2416/32 GB900 GB/s
A100 SXM4 80GBHBM2e580 GB2.04 TB/s
H100 SXM5HBM3580 GB3.35 TB/s
H200 SXM5HBM3e6141 GB4.8 TB/s
B100 / B200HBM3e8192 GB~8 TB/s
GB200 (per GPU)HBM3e8192 GB~8 TB/s

Two things to notice. First, the BW per stack roughly doubles each generation while the per-pin rate only grew ~10× in a decade — the rest came from going from 8 to 16 channels (HBM2) and adding pseudo-channels (HBM3). HBM4's leap is structural again: it doubles the bus from 1024 to 2048 bits and lowers per-pin rate slightly to keep signal integrity in check.

Second, the stack count per GPU grows along with the package. Five stacks fit around an H100 die on CoWoS-S. Eight stacks need a bigger interposer — CoWoS-L with bridge dies on Blackwell B200. HBM4's bigger bus drives even more interposer area; expect Rubin-class GPUs to use CoWoS-L with multiple bridge dies.

V100
0.9 TB/s
HBM2 ×4
A100
2.04 TB/s
HBM2e ×5
H100
3.35 TB/s
HBM3 ×5
H200
4.8 TB/s
HBM3e ×6
B200
8 TB/s
HBM3e ×8

Same DRAM technology in H200 and B200; B200 just has more stacks (8 vs 6) on a bigger CoWoS-L interposer with the chip-to-chip NV-HBI link between the two dies of the dual-die package.

06

ECC — The Three Levels

Datacenter HBM has multiple error-correction layers stacked on top of each other. They protect against different failure modes and have very different bandwidth costs.

1. On-die ECC

Always on in HBM3 and HBM3e. Each DRAM die has internal ECC that transparently corrects single-bit errors inside an array, scrubbing them before the data leaves the die.

  • Cost: zero externally visible bandwidth or capacity
  • Hidden: extra cells inside each row, internal SECDED logic on the die
  • You cannot turn it off — it's part of the die

2. Side-band ECC

Classic GDDR/HBM ECC mode. The DRAM controller sends an extra ECC code over a separate ECC channel (extra DQ pins) alongside the data.

  • Cost: ~6% bandwidth + 6% capacity
  • Catches: bus errors, multi-bit faults that escaped on-die ECC
  • Default ON for datacenter (H100/H200/B200), OFF for consumer (RTX)

3. In-line ECC

HBM3 spec adds the option to embed ECC inside the data channel rather than on a separate one — saves dedicated ECC pins.

  • Cost: lower than side-band but variable
  • Trickier scheduling: must align ECC bursts with data bursts
  • Used selectively; some Blackwell configurations rely on it

What you see in nvidia-smi

$ nvidia-smi -q -d ECC
==============NVSMI LOG==============

GPU 00000000:01:00.0
    Product Name                          : NVIDIA H100 80GB HBM3
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 3     # scrubbed by side-band
            DRAM Uncorrectable            : 0
        Aggregate (since boot)
            DRAM Correctable              : 12
            DRAM Uncorrectable            : 0

"DRAM Correctable" counts errors caught by side-band ECC. A handful per day on a healthy GPU is normal — cosmic rays and thermal noise. Hundreds-per-hour or any uncorrectable count means a row is going bad; the driver will eventually retire that page (page-retirement / row-remapping).

Disabling ECC on inference-only hosts

Reclaiming the ~6% capacity
# list current state
$ nvidia-smi -q -d ECC | grep -A1 "ECC Mode"

# turn it off, reboot to apply
$ sudo nvidia-smi -e 0
$ sudo reboot

# after reboot, an H100 80GB reports ~85.5 GiB usable instead of ~80 GiB
$ nvidia-smi --query-gpu=memory.total --format=csv
memory.total [MiB]
87559 MiB
When to actually do this

For training — never. A single bit-flip on a gradient propagates through every parameter and you'll never find it. For serving inference — usually fine; a single-bit flip in a weight or KV-cache value usually changes a logit by O(1e-6) and gets averaged out. The 6% extra capacity is enough to fit a 70B q4_K_M model where it wouldn't otherwise. Datacenter operators typically leave ECC on anyway because the support burden of "weird sporadic numerical issues" is more painful than 6% of VRAM.

07

Refresh — The Tax You Always Pay

DRAM cells are leaky capacitors. A 1-bit cell at the 1β node holds maybe a few femto-coulombs of charge, leaking it through the access transistor and the substrate over time. Read it after the charge has drained — you read 0 instead of 1.

The fix is to refresh every row periodically — activate it (which puts the data in the row buffer, sense-amps detect it cleanly) and write it back. JEDEC mandates a full refresh of every row every 32 ms (HBM3) or 64 ms (older). Above 85 °C the interval halves: leakage doubles, so refresh frequency must double.

How refresh interacts with throughput

DRAM controller wants: 1024 b / cycle of useful data
Reality: must inject ~1 refresh / 3.9 µs / bank, each ~350 ns
All-bank refresh would block channel for ~350 ns — 9% of cycles
Fine-Granularity Refresh (FGR): refresh per-bank-group, overlap with other banks
Practical refresh tax: ~2–3% of bandwidth on cool kit, ~5–7% above 85 °C

FGR modes

Temperature, throttling, and refresh

Each HBM stack reports its die temperatures back to the GPU. The GPU's firmware is responsible for both (a) cooling the GPU die enough that the HBM beside it doesn't cook, and (b) doubling the refresh rate above 85 °C to keep data integrity. Around 95 °C the GPU starts thermal throttling clocks — partly to protect the HBM.

Cool (≤75 °C)

  • 1x refresh, 32 ms interval
  • Refresh tax ~2%
  • Full per-pin rate
  • No throttle

Warm (75–85 °C)

  • 1x refresh still
  • Refresh tax ~3%
  • Full clocks
  • Watch the trend

Hot (85–95 °C)

  • 2x refresh auto-engaged
  • Refresh tax 5–7%
  • HBM PHY may lower frequency
  • Effective BW down 10–15%

Critical (≥95 °C)

  • Throttling engaged
  • GPU clocks step down hard
  • HBM clocks step down
  • Sustained: rejected by ops as a hardware fault
Operational signal

If nvidia-smi -q -d TEMPERATURE shows Memory Current persistently above 85 °C while the GPU die ("GPU Current") sits at 70 °C, your HBM cooling is the bottleneck — airflow through the side of the SXM module rather than through the heatsink fins. This is a packaging/thermal issue, not a compute one. SXM and OAM cards are designed for front-to-back airflow with no obstructions; the moment you put one in a 2U chassis with a bend in the duct, HBM temps rise faster than die temps.

08

The HBM PHY

"PHY" is the physical-layer block on the GPU side that drives the HBM bus — the analog and mixed-signal logic that turns digital command/address/data words into millivolts on micro-bumps and back. It's massive.

HBM3e PHY block on the GPU side HBM3e PHY — ~24 mm² per stack interface Command / Address SDR, ~1.5 Gb/s Data SerDes DDR, 9.2–9.6 Gb/s/pin Training / Eq DFE, CTLE, deskew ECC encode/decode side-band + in-line CRC engine retry on bus errors Lane repair / row remap spare lanes, soft-repair I/O driver array — 1024 data + ~200 C/A lines × 5–8 stacks per GPU Termination (ODT), VREF, VPP, VDDQ regulation

Why it's so big

A Hopper GH100 spends roughly 120 mm² of its ~814 mm² die on the five HBM3 PHYs — ~15% of total die area. Reasons:

IP licensing

NVIDIA, AMD, Intel, and the cloud-silicon shops all license HBM PHY IP from Synopsys, Cadence, or Rambus rather than designing in-house. The PHY is too analog-heavy and too gen-specific (HBM3 vs HBM3e vs HBM4) for in-house teams to economically maintain. Synopsys' DesignWare HBM3e PHY is the most-used building block in industry; if you're working on a chiplet-class accelerator and HBM is on your BoM, expect to integrate it.

The downside: PHY IP is shared, so PHY-level innovation is bounded. NVIDIA's edge over AMD on HBM bandwidth utilisation comes mostly from memory-controller design (FBPA scheduling, hashing, prefetching) rather than the PHY itself. If two GPUs use the same Synopsys PHY behind different controllers, they'll deliver very different effective BW — same raw bus, different schedulers.

Why HBM4's bus doubling matters here

Going from 1024 to 2048 bits per stack in HBM4 doubles the PHY pin count. At HBM3e PHY scaling (~24 mm² per stack), HBM4 PHYs would need ~48 mm² each — multiplied by 6–8 stacks, that's most of a reticle. The way out: HBM4 lowers the per-pin rate slightly (~8 Gb/s instead of 9.6) and uses CoWoS-L bridges to keep the GPU die smaller. But the PHY area floor is now the practical limit on how many HBM4 stacks you can attach — not the interposer area itself.

09

CoWoS Packaging — How It's Actually Built

"CoWoS" is TSMC's Chip-on-Wafer-on-Substrate — the 2.5D packaging family that puts GPU + HBM stacks side-by-side on a silicon interposer, then mounts the whole thing on an organic substrate. NVIDIA datacenter GPUs from V100 onward all use it.

CoWoS-L cross-section — Blackwell-class GPU + 8 HBM stacks Integrated heat spreader / lid (Cu-Ni) TIM (indium / liquid metal) HBM stack HBM HBM HBM GPU die A ~830 mm² GPU die B NV-HBI Silicon interposer (CoWoS-L) LSI bridges (Local Si Interconnect) Organic package substrate (build-up layers + core) BGA ball-grid mounting to OAM / SXM5 board Vertical scale heavily exaggerated

The build sequence

  1. Start with a 12-inch silicon wafer. Etch the interposer's RDL (re-distribution layers) and TSV array onto it.
  2. Place GPU die + HBM stacks face-down onto the interposer using micro-bump bonding (~55 µm pitch). Underfill with epoxy.
  3. Mount the assembled interposer + chips onto an organic substrate using C4 bumps (~150 µm pitch). This is the "Wafer-on-Substrate" step.
  4. Add the integrated heat spreader (lid) with thermal interface material on top of the dies.
  5. Solder BGA balls on the underside of the substrate for mounting on the OAM/SXM card.

CoWoS variants

CoWoS-S

Original silicon-interposer flow. Single passive interposer carrying GPU + ~5 HBM stacks. Reticle limit ~830 mm² on the interposer.

  • V100, A100, H100, H200
  • 5–6 HBM stacks max comfortable
  • Mature, high-volume

CoWoS-L

Adds LSI bridges — small silicon interconnect dies embedded in an organic-RDL substrate. Lets the package exceed the reticle limit by tiling multiple bridges.

  • Blackwell B100/B200, GB200
  • Two GPU dies + 8 HBM stacks
  • Higher cost, lower yield

CoWoS-R

Replaces the silicon interposer with a fine-pitch organic redistribution layer. Lower bandwidth density; used for cheaper accelerators that don't need full HBM density.

  • Niche; not used by NVIDIA datacenter GPUs
  • Better yield, lower BW

Reticle limit and Blackwell's response

Lithography reticles cap the largest single die a stepper can pattern at ~830 mm². H100's GH100 die is right up against this limit at 814 mm². To go bigger, you have two choices: stitch reticles (slow, exotic, used for Cerebras WSE) or split the design across two dies and connect them with a bridge.

Blackwell B200 takes the second route: two GPU dies side-by-side, each ~830 mm², connected by NVIDIA's NV-HBI (NVLink-High Bandwidth Interface) at ~10 TB/s through CoWoS-L bridge dies. To system software, B200 looks like one GPU; physically, it's a chiplet pair. This is what made the 8-stack HBM3e configuration possible.

The CoWoS bottleneck

Through 2024-2026, TSMC's CoWoS capacity — not wafer fab, not HBM supply — has been the binding constraint on H100/H200/B200 production. Each GPU consumes one slot of CoWoS-S or CoWoS-L flow, which has long process times and dedicated equipment. NVIDIA has taken the lion's share of this capacity since 2023; AMD MI300 contention competes for the same line. Watch CoWoS capacity announcements as the leading indicator for datacenter GPU availability, more so than node-shrink schedules.

10

What You Actually Get vs Spec Sheet

The spec-sheet number is a peak: 1024 b × 9.2 Gb/s/pin / 8 = ~1.18 TB/s/stack, ×6 stacks = ~7 TB/s on H200. The number you measure with cudaMemcpy or in a kernel is always lower — here's why.

Bandwidth tax breakdown

Spec peak
100% — 4.8 TB/s on H200 spec
peak
- refresh
~97% — refresh tax 2–3% cool
-2–3%
- ECC sb
~91% — side-band ECC overhead
-6%
- thermal
~86% — warm operation
-5%
streaming
~82% achieved — ~3.9 TB/s
good
strided 128B
~65% achieved — ~3.1 TB/s
ok
random 32B
~30% achieved — ~1.4 TB/s
bad

Worked example: H200 BF16 weight streaming vs random KV

BF16 weight streaming (good case)

Decode-step weight read for a 70B BF16 model, vLLM tile-friendly layout, ECC on, GPU at 75 °C.

  • Spec: 4.8 TB/s
  • Refresh tax: -2.5%
  • ECC side-band: -6%
  • Bus utilisation (sequential, large bursts): ~95%
  • Achieved: ~4.4 TB/s
  • 140 GB / 4.4 TB/s ≈ 32 ms / pass → ~31 tok/s on 70B BF16

Random 32 B KV-cache reads (bad case)

Long-context attention with non-coalesced KV layout, 8K sequence, 16 heads, scattered head/token order.

  • Spec: 4.8 TB/s
  • Refresh + ECC: same -8.5%
  • Random 32 B: bus utilisation ~30%
  • Hot operation (hot HBM): -10%
  • Achieved: ~1.4 TB/s
  • 3× slower for the exact same DRAM, exact same workload mass
The lesson

Cache-friendly access patterns are not a nice-to-have on HBM — they're the difference between "uses the GPU" and "uses 30% of it". A single bad KV-cache layout can reduce a 70B-on-H200 from 30 tok/s to 10 tok/s without you ever seeing a single error. This is why FlashAttention and PagedAttention exist. They're not "small kernel tricks", they're HBM utilisation rescue jobs.

Tools to measure it

11

Interactive: HBM Bandwidth Estimator

Pick a GPU, an access pattern, ECC and thermal state. The estimator applies the taxes from the previous slide and reports the bandwidth you can plausibly expect to measure, plus advice on what's costing you.

Spec BW
Achieved BW
Efficiency
Refresh tax
ECC overhead
Thermal de-rate
Choose a configuration to see the estimate.

Method: spec BW × (1 − refresh%) × (1 − ECC%) × (1 − thermal%) × pattern_efficiency. Pattern efficiency is the bus utilisation factor for that access shape: ~95% sequential, ~70% strided 128 B, ~55% random 128 B, ~30% random 32 B. These are realistic ranges from ncu measurements on Hopper-class kit; numbers vary ±5% across kernels.

Quick sanity numbers

If you take one thing away

HBM's spec-sheet TB/s is achievable, but only when the kernel cooperates. Refresh and ECC together cost ~8% on every GPU, every workload, no avoiding them. Thermal can cost another ~10% if your server's airflow is bad. The remaining 80% of the bus is yours — but only if your access pattern can fill it. The whole "memory wall" conversation in 2026 is really a conversation about how much of that 80% your code actually claims.