Eight DRAM dies vertically stacked above a logic die, connected by 1024+ TSVs to the GPU through a silicon interposer at less than a centimetre. Walk through HBM internals — channels, banks, refresh, ECC modes, the PHY, and CoWoS packaging — that decide whether you actually see the spec sheet bandwidth.
HBM is the most expensive component on a modern datacenter GPU and the one that most often decides whether you're compute-bound or memory-bound. Here's how it actually works — from cell to interposer.
Both approaches reach roughly the same total throughput. They differ in how they get there — bus width vs frequency — and that single choice ripples through power, area, capacity, and packaging cost.
A row of 8–12 discrete DRAM packages on the GPU PCB, each with its own 32-bit bus (16-bit when running PAM4 on two pseudo-channels). Frequency does the heavy lifting: GDDR6X 21–24 Gbps/pin, GDDR7 32–40 Gbps/pin in PAM3.
A vertical stack of 8–12 DRAM dies sitting on a silicon interposer right next to the GPU die. Width does the heavy lifting: 1024 bits per stack, frequency only ~9 Gb/s/pin on HBM3e.
An RTX 5090 (GDDR7, 512-bit, ~1.8 TB/s) and an H100 SXM5 (5 HBM3 stacks, ~3.4 TB/s) live on opposite ends of the same trade. GDDR keeps every transistor on the GPU die fed by signalling fast over a narrow bus; HBM gives up frequency and instead opens a fire-hose-wide bus across centimetres of silicon interposer. HBM wins on bandwidth-per-watt and bandwidth-per-volume; GDDR wins on cost-per-GB and yield.
The HBM trade is paid in silicon interposer area. The 2.5D substrate carrying GPU + HBM stacks is itself a wafer-fabbed silicon die, with thousands of fine-pitch traces routed at near-on-die density. Yield falls with area, the reticle limit (~830 mm²) caps how big it can be, and supplier capacity (TSMC CoWoS) is currently the binding constraint on HBM-class GPU production.
One HBM3e stack at a glance: 8 (or 12) DRAM dies stacked above a single base/buffer die, total stack height roughly 720 µm, footprint about 8 × 11 mm. Vertical wiring is done with through-silicon vias — tungsten or copper pillars that pass through the silicon of every die above the bottom.
Notice the asymmetry: the buffer die does real work; the DRAM dies above it are mostly passive arrays with their TSVs ganged together. Every channel's command stream lands on the buffer die, gets decoded, then is fan-out broadcast up the TSVs to the right rank/die.
The 1024-bit bus is not a monolith — it's 16 independently scheduled channels of 64 bits each. Each channel has its own command/address bus, its own banks, and its own row buffer. The memory controller can issue 16 different reads/writes to the same stack on the same cycle.
HBM2 already had pseudo-channels (the JEDEC term has been around since 2016) but HBM3 sharpens the design: each 64-bit channel is split into two 32-bit pseudo-channels that share the data bus by time-slot but have independent command streams. So a small 32-byte transaction on PC0a can issue while PC0b is doing something else — doubled effective concurrency for fine-grained transfers.
One 64-bit channel = one address per cycle. A 32-byte read wastes half the bus — the burst is 64 B minimum (BL=8 × 64 b).
Two 32-bit pseudo-channels under one physical channel. A 32-byte read uses BL=8 × 32 b on one half — full-bus utilisation. PC0a and PC0b can issue to different rows/banks simultaneously.
Why this matters for GPUs: Tensor Cores stream large, contiguous tiles — pseudo-channels don't help much there. But KV-cache reads in attention (small, scattered) and indirect lookups (sparse MoE expert gathers) do hit pseudo-channels well. HBM3e refines pseudo-channel arbitration further; HBM4 widens the stack to 2048 bits but keeps the pseudo-channel concept.
HBM "channel" is not the same as a CUDA "memory channel" or an Ada/Hopper "memory partition". A Hopper H100 has 5 HBM stacks, 16 channels each = 80 HBM channels; CUDA exposes them as 12 memory partitions / 80 sub-controllers depending on which level of the stack you're inspecting. They're independent concepts at different abstraction layers.
Inside a single HBM channel, the addressing hierarchy is the same shape as DDR/LPDDR — banks → bank groups → rows → columns — just denser. Every memory operation is a sequence of activate, read/write column, and precharge.
| Parameter | HBM3 typical | What it means |
|---|---|---|
tRC | ~32 ns | Activate-to-activate same bank — the floor for changing rows. |
tRCD | ~14 ns | Activate to first column read — "row open" latency. |
tRP | ~14 ns | Precharge to next activate — closing the row before opening another. |
tCCDL | ~3 ns | Column-to-column delay within a bank group. |
tCCDS | ~2 ns | Column-to-column delay across bank groups (faster). |
tFAW | ~16 ns | Four-activation window — can't activate > 4 banks too close together. |
tREFI | ~3.9 µs | Refresh interval — one refresh command must issue per bank inside this window. |
The memory controller's life consists of three modes:
tCCDL / tCCDS. Streaming sequential workloads are essentially all row hits.tRC overlaps with another bank's column reads. This is the heart of bank-group parallelism.tRP + tRCD ≈ ~28 ns stall. This is what kills random small reads.Modern GPU memory controllers are deep, out-of-order, and very aware of bank state. They reorder pending requests to maximise row hits, schedule refreshes during natural gaps, and prefer cross-bank-group accesses to hide latency. NVIDIA's MC subsystem (sometimes called the FBPA / FBIO partitions in Hopper docs) is one of the most carefully tuned blocks on the chip — and the reason "the same DRAM" performs differently on different GPUs.
If your kernel access pattern is strided with a stride that hashes to one bank, you'll bank-conflict yourself into the ground regardless of total bandwidth. CUDA's memory hashing across channels mostly hides this for tile-friendly workloads, but pathological strides (powers of 2 close to 64 KB, KV-cache layouts that don't tile across heads) still bite. ncu --set full shows DRAM read-throughput vs request count — that ratio tells you whether you're serving rows or thrashing them.
The headline numbers per JEDEC spec. The "per pin" rate is the per-data-line transfer rate; multiply by 1024 (or 2048 for HBM4) and divide by 8 to get GB/s/stack.
| Gen | Year | Gb/s/pin | Bus / stack | BW / stack | Capacity / stack | Channels |
|---|---|---|---|---|---|---|
| HBM | 2015 | 1.0 | 1024 b | 128 GB/s | 1–4 GB | 8 (128 b each) |
| HBM2 | 2016 | 2.0–2.4 | 1024 b | 256–307 GB/s | 4–8 GB | 16 (64 b) |
| HBM2e | 2019 | 3.2–3.6 | 1024 b | 410–460 GB/s | 8–16 GB | 16 (64 b) |
| HBM3 | 2022 | 6.4 | 1024 b | 819 GB/s | 16–24 GB | 16 (64 b) + 32 PCs |
| HBM3e | 2024 | 9.2–9.6 | 1024 b | 1.18–1.23 TB/s | 24–36 GB | 16 (64 b) + 32 PCs |
| HBM4 | 2026+ | ~8 | 2048 b | ~2 TB/s | 36–48 GB | 32 (64 b) |
| GPU | HBM gen | Stacks | Capacity | Aggregate BW |
|---|---|---|---|---|
| V100 SXM2 | HBM2 | 4 | 16/32 GB | 900 GB/s |
| A100 SXM4 80GB | HBM2e | 5 | 80 GB | 2.04 TB/s |
| H100 SXM5 | HBM3 | 5 | 80 GB | 3.35 TB/s |
| H200 SXM5 | HBM3e | 6 | 141 GB | 4.8 TB/s |
| B100 / B200 | HBM3e | 8 | 192 GB | ~8 TB/s |
| GB200 (per GPU) | HBM3e | 8 | 192 GB | ~8 TB/s |
Two things to notice. First, the BW per stack roughly doubles each generation while the per-pin rate only grew ~10× in a decade — the rest came from going from 8 to 16 channels (HBM2) and adding pseudo-channels (HBM3). HBM4's leap is structural again: it doubles the bus from 1024 to 2048 bits and lowers per-pin rate slightly to keep signal integrity in check.
Second, the stack count per GPU grows along with the package. Five stacks fit around an H100 die on CoWoS-S. Eight stacks need a bigger interposer — CoWoS-L with bridge dies on Blackwell B200. HBM4's bigger bus drives even more interposer area; expect Rubin-class GPUs to use CoWoS-L with multiple bridge dies.
Same DRAM technology in H200 and B200; B200 just has more stacks (8 vs 6) on a bigger CoWoS-L interposer with the chip-to-chip NV-HBI link between the two dies of the dual-die package.
Datacenter HBM has multiple error-correction layers stacked on top of each other. They protect against different failure modes and have very different bandwidth costs.
Always on in HBM3 and HBM3e. Each DRAM die has internal ECC that transparently corrects single-bit errors inside an array, scrubbing them before the data leaves the die.
Classic GDDR/HBM ECC mode. The DRAM controller sends an extra ECC code over a separate ECC channel (extra DQ pins) alongside the data.
HBM3 spec adds the option to embed ECC inside the data channel rather than on a separate one — saves dedicated ECC pins.
nvidia-smi==============NVSMI LOG==============
GPU 00000000:01:00.0
Product Name : NVIDIA H100 80GB HBM3
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 3 # scrubbed by side-band
DRAM Uncorrectable : 0
Aggregate (since boot)
DRAM Correctable : 12
DRAM Uncorrectable : 0
"DRAM Correctable" counts errors caught by side-band ECC. A handful per day on a healthy GPU is normal — cosmic rays and thermal noise. Hundreds-per-hour or any uncorrectable count means a row is going bad; the driver will eventually retire that page (page-retirement / row-remapping).
# list current state
$ nvidia-smi -q -d ECC | grep -A1 "ECC Mode"
# turn it off, reboot to apply
$ sudo nvidia-smi -e 0
$ sudo reboot
# after reboot, an H100 80GB reports ~85.5 GiB usable instead of ~80 GiB
$ nvidia-smi --query-gpu=memory.total --format=csv
memory.total [MiB]
87559 MiB
For training — never. A single bit-flip on a gradient propagates through every parameter and you'll never find it. For serving inference — usually fine; a single-bit flip in a weight or KV-cache value usually changes a logit by O(1e-6) and gets averaged out. The 6% extra capacity is enough to fit a 70B q4_K_M model where it wouldn't otherwise. Datacenter operators typically leave ECC on anyway because the support burden of "weird sporadic numerical issues" is more painful than 6% of VRAM.
DRAM cells are leaky capacitors. A 1-bit cell at the 1β node holds maybe a few femto-coulombs of charge, leaking it through the access transistor and the substrate over time. Read it after the charge has drained — you read 0 instead of 1.
The fix is to refresh every row periodically — activate it (which puts the data in the row buffer, sense-amps detect it cleanly) and write it back. JEDEC mandates a full refresh of every row every 32 ms (HBM3) or 64 ms (older). Above 85 °C the interval halves: leakage doubles, so refresh frequency must double.
Each HBM stack reports its die temperatures back to the GPU. The GPU's firmware is responsible for both (a) cooling the GPU die enough that the HBM beside it doesn't cook, and (b) doubling the refresh rate above 85 °C to keep data integrity. Around 95 °C the GPU starts thermal throttling clocks — partly to protect the HBM.
If nvidia-smi -q -d TEMPERATURE shows Memory Current persistently above 85 °C while the GPU die ("GPU Current") sits at 70 °C, your HBM cooling is the bottleneck — airflow through the side of the SXM module rather than through the heatsink fins. This is a packaging/thermal issue, not a compute one. SXM and OAM cards are designed for front-to-back airflow with no obstructions; the moment you put one in a 2U chassis with a bend in the duct, HBM temps rise faster than die temps.
"PHY" is the physical-layer block on the GPU side that drives the HBM bus — the analog and mixed-signal logic that turns digital command/address/data words into millivolts on micro-bumps and back. It's massive.
A Hopper GH100 spends roughly 120 mm² of its ~814 mm² die on the five HBM3 PHYs — ~15% of total die area. Reasons:
NVIDIA, AMD, Intel, and the cloud-silicon shops all license HBM PHY IP from Synopsys, Cadence, or Rambus rather than designing in-house. The PHY is too analog-heavy and too gen-specific (HBM3 vs HBM3e vs HBM4) for in-house teams to economically maintain. Synopsys' DesignWare HBM3e PHY is the most-used building block in industry; if you're working on a chiplet-class accelerator and HBM is on your BoM, expect to integrate it.
The downside: PHY IP is shared, so PHY-level innovation is bounded. NVIDIA's edge over AMD on HBM bandwidth utilisation comes mostly from memory-controller design (FBPA scheduling, hashing, prefetching) rather than the PHY itself. If two GPUs use the same Synopsys PHY behind different controllers, they'll deliver very different effective BW — same raw bus, different schedulers.
Going from 1024 to 2048 bits per stack in HBM4 doubles the PHY pin count. At HBM3e PHY scaling (~24 mm² per stack), HBM4 PHYs would need ~48 mm² each — multiplied by 6–8 stacks, that's most of a reticle. The way out: HBM4 lowers the per-pin rate slightly (~8 Gb/s instead of 9.6) and uses CoWoS-L bridges to keep the GPU die smaller. But the PHY area floor is now the practical limit on how many HBM4 stacks you can attach — not the interposer area itself.
"CoWoS" is TSMC's Chip-on-Wafer-on-Substrate — the 2.5D packaging family that puts GPU + HBM stacks side-by-side on a silicon interposer, then mounts the whole thing on an organic substrate. NVIDIA datacenter GPUs from V100 onward all use it.
Original silicon-interposer flow. Single passive interposer carrying GPU + ~5 HBM stacks. Reticle limit ~830 mm² on the interposer.
Adds LSI bridges — small silicon interconnect dies embedded in an organic-RDL substrate. Lets the package exceed the reticle limit by tiling multiple bridges.
Replaces the silicon interposer with a fine-pitch organic redistribution layer. Lower bandwidth density; used for cheaper accelerators that don't need full HBM density.
Lithography reticles cap the largest single die a stepper can pattern at ~830 mm². H100's GH100 die is right up against this limit at 814 mm². To go bigger, you have two choices: stitch reticles (slow, exotic, used for Cerebras WSE) or split the design across two dies and connect them with a bridge.
Blackwell B200 takes the second route: two GPU dies side-by-side, each ~830 mm², connected by NVIDIA's NV-HBI (NVLink-High Bandwidth Interface) at ~10 TB/s through CoWoS-L bridge dies. To system software, B200 looks like one GPU; physically, it's a chiplet pair. This is what made the 8-stack HBM3e configuration possible.
Through 2024-2026, TSMC's CoWoS capacity — not wafer fab, not HBM supply — has been the binding constraint on H100/H200/B200 production. Each GPU consumes one slot of CoWoS-S or CoWoS-L flow, which has long process times and dedicated equipment. NVIDIA has taken the lion's share of this capacity since 2023; AMD MI300 contention competes for the same line. Watch CoWoS capacity announcements as the leading indicator for datacenter GPU availability, more so than node-shrink schedules.
The spec-sheet number is a peak: 1024 b × 9.2 Gb/s/pin / 8 = ~1.18 TB/s/stack, ×6 stacks = ~7 TB/s on H200. The number you measure with cudaMemcpy or in a kernel is always lower — here's why.
Decode-step weight read for a 70B BF16 model, vLLM tile-friendly layout, ECC on, GPU at 75 °C.
Long-context attention with non-coalesced KV layout, 8K sequence, 16 heads, scattered head/token order.
Cache-friendly access patterns are not a nice-to-have on HBM — they're the difference between "uses the GPU" and "uses 30% of it". A single bad KV-cache layout can reduce a 70B-on-H200 from 30 tok/s to 10 tok/s without you ever seeing a single error. This is why FlashAttention and PagedAttention exist. They're not "small kernel tricks", they're HBM utilisation rescue jobs.
ncu) — --set full reports DRAM throughput, sector reads/writes, L2 hit rate. dram__bytes_read.sum and dram__throughput.avg.pct_of_peak_sustained_elapsed are the load-bearing metrics.dcgm-exporter) — production telemetry. DCGM_FI_PROF_DRAM_ACTIVE is the percent of cycles DRAM was busy doing real work. >75% is good for serving; under 30% means you're compute-bound or your kernels are bad.babelStream or NVIDIA's nvbandwidth for sanity-checking peak achievable bandwidth on your specific board.Pick a GPU, an access pattern, ECC and thermal state. The estimator applies the taxes from the previous slide and reports the bandwidth you can plausibly expect to measure, plus advice on what's costing you.
Method: spec BW × (1 − refresh%) × (1 − ECC%) × (1 − thermal%) × pattern_efficiency. Pattern efficiency is the bus utilisation factor for that access shape: ~95% sequential, ~70% strided 128 B, ~55% random 128 B, ~30% random 32 B. These are realistic ranges from ncu measurements on Hopper-class kit; numbers vary ±5% across kernels.
HBM's spec-sheet TB/s is achievable, but only when the kernel cooperates. Refresh and ECC together cost ~8% on every GPU, every workload, no avoiding them. Thermal can cost another ~10% if your server's airflow is bad. The remaining 80% of the bus is yours — but only if your access pattern can fill it. The whole "memory wall" conversation in 2026 is really a conversation about how much of that 80% your code actually claims.