NVIDIA GPU 23 — HBM Internals: Stacks, Channels, Packaging

00

Topics We'll Cover

HBM is the most expensive component on a modern datacenter GPU and the one that most often decides whether you're compute-bound or memory-bound. Here's how it actually works — from cell to interposer.

Why HBM Exists — HBM vs GDDR
The Stack — Anatomy of HBM3e
Channels & Pseudo-Channels
Banks, Bank Groups, Rows, and Columns
HBM Generation Numbers
ECC — The Three Levels
Refresh — The Tax You Always Pay
The HBM PHY
CoWoS Packaging — How It's Actually Built
What You Actually Get vs Spec Sheet
Interactive: HBM Bandwidth Estimator

01

Why HBM Exists — HBM vs GDDR

Both approaches reach roughly the same total throughput. They differ in how they get there — bus width vs frequency — and that single choice ripples through power, area, capacity, and packaging cost.

GDDR6X / GDDR7 (consumer)

A row of 8–12 discrete DRAM packages on the GPU PCB, each with its own 32-bit bus (16-bit when running PAM4 on two pseudo-channels). Frequency does the heavy lifting: GDDR6X 21–24 Gbps/pin, GDDR7 32–40 Gbps/pin in PAM3.

Aggregate bus per chip: 32-bit
Per-card bus: 192–512 bits
Power: ~7 pJ/bit
Capacity: 16–32 GB (consumer-class)
Packaging: standard FCBGA on PCB — cheap

HBM2e / HBM3 / HBM3e (datacenter)

A vertical stack of 8–12 DRAM dies sitting on a silicon interposer right next to the GPU die. Width does the heavy lifting: 1024 bits per stack, frequency only ~9 Gb/s/pin on HBM3e.

Aggregate bus per stack: 1024-bit
Per-GPU bus: 5120–8192 bits
Power: ~3 pJ/bit
Capacity: 80–192 GB (H100, B200)
Packaging: CoWoS silicon interposer — expensive

Same throughput, different shape

An RTX 5090 (GDDR7, 512-bit, ~1.8 TB/s) and an H100 SXM5 (5 HBM3 stacks, ~3.4 TB/s) live on opposite ends of the same trade. GDDR keeps every transistor on the GPU die fed by signalling fast over a narrow bus; HBM gives up frequency and instead opens a fire-hose-wide bus across centimetres of silicon interposer. HBM wins on bandwidth-per-watt and bandwidth-per-volume; GDDR wins on cost-per-GB and yield.

8× GDDR @ 32-bit

≈

256 bits × 24 Gb/s = 768 GB/s

vs

1024 bits × 6.4 Gb/s = 819 GB/s

≈

1× HBM3 stack

The HBM trade is paid in silicon interposer area. The 2.5D substrate carrying GPU + HBM stacks is itself a wafer-fabbed silicon die, with thousands of fine-pitch traces routed at near-on-die density. Yield falls with area, the reticle limit (~830 mm²) caps how big it can be, and supplier capacity (TSMC CoWoS) is currently the binding constraint on HBM-class GPU production.

02

The Stack — Anatomy of HBM3e

One HBM3e stack at a glance: 8 (or 12) DRAM dies stacked above a single base/buffer die, total stack height roughly 720 µm, footprint about 8 × 11 mm. Vertical wiring is done with through-silicon vias — tungsten or copper pillars that pass through the silicon of every die above the bottom.

Layer-by-layer responsibilities

DRAM dies (7–0) — identical 24 Gb (HBM3e) or 36 Gb (HBM4) dies, each with the same internal floorplan: bank groups, sense-amp arrays, row/column decoders. Memory cells are 1α or 1β node DRAM.
TSVs — copper pillars ~5–10 µm diameter, ~50 µm pitch, plated through silicon. Each die has thousands of them; the JEDEC HBM3 spec mandates ~1024 data + ~200 command/address/control lines per stack, plus power/ground.
Base / buffer die — the only "logic" die in the stack. Hosts the PHY, ECC encode/decode, refresh state machine, repair logic for bad cells, temperature sensors, and the test interface (IEEE 1500). Manufactured on a logic-friendly process (~12 nm), unlike the DRAM dies.
Micro-bumps — ~55 µm pitch solder bumps connecting the base die down to the interposer.
Silicon interposer — passive 2.5D substrate. Carries the 1024-bit bus laterally to the GPU die, also at ~55 µm bump pitch and finer in-die routing.

Notice the asymmetry: the buffer die does real work; the DRAM dies above it are mostly passive arrays with their TSVs ganged together. Every channel's command stream lands on the buffer die, gets decoded, then is fan-out broadcast up the TSVs to the right rank/die.

03

Channels & Pseudo-Channels

The 1024-bit bus is not a monolith — it's 16 independently scheduled channels of 64 bits each. Each channel has its own command/address bus, its own banks, and its own row buffer. The memory controller can issue 16 different reads/writes to the same stack on the same cycle.

Stack

1 × HBM3 stack — 1024-bit total bus

Channels

CH0 64b

CH1 64b

CH2 64b

CH3 64b

CH4 64b

CH5 64b

CH6 64b

CH7 64b

CH8 64b

CH9 64b

CH10 64b

CH11 64b

CH12 64b

CH13 64b

CH14 64b

CH15 64b

Pseudo-CH

Each 64b channel = 2 × 32b PCs

PC0a / PC0b

PC1a / PC1b

… 32 PCs total / stack

Pseudo-channel details

HBM2 already had pseudo-channels (the JEDEC term has been around since 2016) but HBM3 sharpens the design: each 64-bit channel is split into two 32-bit pseudo-channels that share the data bus by time-slot but have independent command streams. So a small 32-byte transaction on PC0a can issue while PC0b is doing something else — doubled effective concurrency for fine-grained transfers.

Without pseudo-channels

One 64-bit channel = one address per cycle. A 32-byte read wastes half the bus — the burst is 64 B minimum (BL=8 × 64 b).

With pseudo-channels (HBM3)

Two 32-bit pseudo-channels under one physical channel. A 32-byte read uses BL=8 × 32 b on one half — full-bus utilisation. PC0a and PC0b can issue to different rows/banks simultaneously.

Why this matters for GPUs: Tensor Cores stream large, contiguous tiles — pseudo-channels don't help much there. But KV-cache reads in attention (small, scattered) and indirect lookups (sparse MoE expert gathers) do hit pseudo-channels well. HBM3e refines pseudo-channel arbitration further; HBM4 widens the stack to 2048 bits but keeps the pseudo-channel concept.

A common confusion

HBM "channel" is not the same as a CUDA "memory channel" or an Ada/Hopper "memory partition". A Hopper H100 has 5 HBM stacks, 16 channels each = 80 HBM channels; CUDA exposes them as 12 memory partitions / 80 sub-controllers depending on which level of the stack you're inspecting. They're independent concepts at different abstraction layers.

04

Banks, Bank Groups, Rows, and Columns

Inside a single HBM channel, the addressing hierarchy is the same shape as DDR/LPDDR — banks → bank groups → rows → columns — just denser. Every memory operation is a sequence of activate, read/write column, and precharge.

Channel

1 channel (HBM3)

Bank groups

BG0

BG1

BG2

BG3

BG4

BG5

BG6

BG7

BG8

BG9

BG10

BG11

BG12

BG13

BG14

BG15

Banks

2 banks per BG

32 banks per channel total

Row

~64 K rows per bank

8 KB row width (page)

Column

BL=8, beat = 32 b on a pseudo-channel

32 B / 64 B per access

Timing, simplified

Parameter	HBM3 typical	What it means
`tRC`	~32 ns	Activate-to-activate same bank — the floor for changing rows.
`tRCD`	~14 ns	Activate to first column read — "row open" latency.
`tRP`	~14 ns	Precharge to next activate — closing the row before opening another.
`tCCDL`	~3 ns	Column-to-column delay within a bank group.
`tCCDS`	~2 ns	Column-to-column delay across bank groups (faster).
`tFAW`	~16 ns	Four-activation window — can't activate > 4 banks too close together.
`tREFI`	~3.9 µs	Refresh interval — one refresh command must issue per bank inside this window.

Row hits, bank-group parallelism, conflicts

The memory controller's life consists of three modes:

Row hit — next access is to the currently open row in this bank. Cheap: a column read at tCCDL / tCCDS. Streaming sequential workloads are essentially all row hits.
Row miss, bank free — activate a new row in a different bank in the same channel. Pipelinable: one bank's tRC overlaps with another bank's column reads. This is the heart of bank-group parallelism.
Row conflict — need a different row in a bank that already has one open. Pay tRP + tRCD ≈ ~28 ns stall. This is what kills random small reads.

Modern GPU memory controllers are deep, out-of-order, and very aware of bank state. They reorder pending requests to maximise row hits, schedule refreshes during natural gaps, and prefer cross-bank-group accesses to hide latency. NVIDIA's MC subsystem (sometimes called the FBPA / FBIO partitions in Hopper docs) is one of the most carefully tuned blocks on the chip — and the reason "the same DRAM" performs differently on different GPUs.

Practitioner takeaway

If your kernel access pattern is strided with a stride that hashes to one bank, you'll bank-conflict yourself into the ground regardless of total bandwidth. CUDA's memory hashing across channels mostly hides this for tile-friendly workloads, but pathological strides (powers of 2 close to 64 KB, KV-cache layouts that don't tile across heads) still bite. ncu --set full shows DRAM read-throughput vs request count — that ratio tells you whether you're serving rows or thrashing them.

05

HBM Generation Numbers

The headline numbers per JEDEC spec. The "per pin" rate is the per-data-line transfer rate; multiply by 1024 (or 2048 for HBM4) and divide by 8 to get GB/s/stack.

Gen	Year	Gb/s/pin	Bus / stack	BW / stack	Capacity / stack	Channels
HBM	2015	1.0	1024 b	128 GB/s	1–4 GB	8 (128 b each)
HBM2	2016	2.0–2.4	1024 b	256–307 GB/s	4–8 GB	16 (64 b)
HBM2e	2019	3.2–3.6	1024 b	410–460 GB/s	8–16 GB	16 (64 b)
HBM3	2022	6.4	1024 b	819 GB/s	16–24 GB	16 (64 b) + 32 PCs
HBM3e	2024	9.2–9.6	1024 b	1.18–1.23 TB/s	24–36 GB	16 (64 b) + 32 PCs
HBM4	2026+	~8	2048 b	~2 TB/s	36–48 GB	32 (64 b)

How NVIDIA datacenter GPUs stack up

GPU	HBM gen	Stacks	Capacity	Aggregate BW
V100 SXM2	HBM2	4	16/32 GB	900 GB/s
A100 SXM4 80GB	HBM2e	5	80 GB	2.04 TB/s
H100 SXM5	HBM3	5	80 GB	3.35 TB/s
H200 SXM5	HBM3e	6	141 GB	4.8 TB/s
B100 / B200	HBM3e	8	192 GB	~8 TB/s
GB200 (per GPU)	HBM3e	8	192 GB	~8 TB/s

Two things to notice. First, the BW per stack roughly doubles each generation while the per-pin rate only grew ~10× in a decade — the rest came from going from 8 to 16 channels (HBM2) and adding pseudo-channels (HBM3). HBM4's leap is structural again: it doubles the bus from 1024 to 2048 bits and lowers per-pin rate slightly to keep signal integrity in check.

Second, the stack count per GPU grows along with the package. Five stacks fit around an H100 die on CoWoS-S. Eight stacks need a bigger interposer — CoWoS-L with bridge dies on Blackwell B200. HBM4's bigger bus drives even more interposer area; expect Rubin-class GPUs to use CoWoS-L with multiple bridge dies.

V100

0.9 TB/s

HBM2 ×4

A100

2.04 TB/s

HBM2e ×5

H100

3.35 TB/s

HBM3 ×5

H200

4.8 TB/s

HBM3e ×6

B200

8 TB/s

HBM3e ×8

Same DRAM technology in H200 and B200; B200 just has more stacks (8 vs 6) on a bigger CoWoS-L interposer with the chip-to-chip NV-HBI link between the two dies of the dual-die package.

06

ECC — The Three Levels

Datacenter HBM has multiple error-correction layers stacked on top of each other. They protect against different failure modes and have very different bandwidth costs.

1. On-die ECC

Always on in HBM3 and HBM3e. Each DRAM die has internal ECC that transparently corrects single-bit errors inside an array, scrubbing them before the data leaves the die.

Cost: zero externally visible bandwidth or capacity
Hidden: extra cells inside each row, internal SECDED logic on the die
You cannot turn it off — it's part of the die

2. Side-band ECC

Classic GDDR/HBM ECC mode. The DRAM controller sends an extra ECC code over a separate ECC channel (extra DQ pins) alongside the data.

Cost: ~6% bandwidth + 6% capacity
Catches: bus errors, multi-bit faults that escaped on-die ECC
Default ON for datacenter (H100/H200/B200), OFF for consumer (RTX)

3. In-line ECC

HBM3 spec adds the option to embed ECC inside the data channel rather than on a separate one — saves dedicated ECC pins.

Cost: lower than side-band but variable
Trickier scheduling: must align ECC bursts with data bursts
Used selectively; some Blackwell configurations rely on it

What you see in `nvidia-smi`

$ nvidia-smi -q -d ECC

==============NVSMI LOG==============

GPU 00000000:01:00.0
    Product Name                          : NVIDIA H100 80GB HBM3
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 3     # scrubbed by side-band
            DRAM Uncorrectable            : 0
        Aggregate (since boot)
            DRAM Correctable              : 12
            DRAM Uncorrectable            : 0

"DRAM Correctable" counts errors caught by side-band ECC. A handful per day on a healthy GPU is normal — cosmic rays and thermal noise. Hundreds-per-hour or any uncorrectable count means a row is going bad; the driver will eventually retire that page (page-retirement / row-remapping).

Disabling ECC on inference-only hosts

Reclaiming the ~6% capacity

# list current state
$ nvidia-smi -q -d ECC | grep -A1 "ECC Mode"

# turn it off, reboot to apply
$ sudo nvidia-smi -e 0
$ sudo reboot

# after reboot, an H100 80GB reports ~85.5 GiB usable instead of ~80 GiB
$ nvidia-smi --query-gpu=memory.total --format=csv
memory.total [MiB]
87559 MiB

When to actually do this

For training — never. A single bit-flip on a gradient propagates through every parameter and you'll never find it. For serving inference — usually fine; a single-bit flip in a weight or KV-cache value usually changes a logit by O(1e-6) and gets averaged out. The 6% extra capacity is enough to fit a 70B q4_K_M model where it wouldn't otherwise. Datacenter operators typically leave ECC on anyway because the support burden of "weird sporadic numerical issues" is more painful than 6% of VRAM.

07

Refresh — The Tax You Always Pay

DRAM cells are leaky capacitors. A 1-bit cell at the 1β node holds maybe a few femto-coulombs of charge, leaking it through the access transistor and the substrate over time. Read it after the charge has drained — you read 0 instead of 1.

The fix is to refresh every row periodically — activate it (which puts the data in the row buffer, sense-amps detect it cleanly) and write it back. JEDEC mandates a full refresh of every row every 32 ms (HBM3) or 64 ms (older). Above 85 °C the interval halves: leakage doubles, so refresh frequency must double.

How refresh interacts with throughput

DRAM controller wants: 1024 b / cycle of useful data

↓

Reality: must inject ~1 refresh / 3.9 µs / bank, each ~350 ns

↓

All-bank refresh would block channel for ~350 ns — 9% of cycles

↓

Fine-Granularity Refresh (FGR): refresh per-bank-group, overlap with other banks

↓

Practical refresh tax: ~2–3% of bandwidth on cool kit, ~5–7% above 85 °C

FGR modes

1x refresh (legacy) — one all-bank refresh command stalls the entire channel for ~350 ns. Simple, costly.
2x / 4x FGR — refresh more often but for shorter durations, hitting only one bank group at a time. Other bank groups can keep serving requests.
Per-bank refresh (REFsb) — HBM3-specific. The MC schedules a refresh on whichever bank is currently idle, threading refreshes invisibly between real requests. Best case: refresh tax drops to ~1%.

Temperature, throttling, and refresh

Each HBM stack reports its die temperatures back to the GPU. The GPU's firmware is responsible for both (a) cooling the GPU die enough that the HBM beside it doesn't cook, and (b) doubling the refresh rate above 85 °C to keep data integrity. Around 95 °C the GPU starts thermal throttling clocks — partly to protect the HBM.

Cool (≤75 °C)

1x refresh, 32 ms interval
Refresh tax ~2%
Full per-pin rate
No throttle

Warm (75–85 °C)

1x refresh still
Refresh tax ~3%
Full clocks
Watch the trend

Hot (85–95 °C)

2x refresh auto-engaged
Refresh tax 5–7%
HBM PHY may lower frequency
Effective BW down 10–15%

Critical (≥95 °C)

Throttling engaged
GPU clocks step down hard
HBM clocks step down
Sustained: rejected by ops as a hardware fault

Operational signal

If nvidia-smi -q -d TEMPERATURE shows Memory Current persistently above 85 °C while the GPU die ("GPU Current") sits at 70 °C, your HBM cooling is the bottleneck — airflow through the side of the SXM module rather than through the heatsink fins. This is a packaging/thermal issue, not a compute one. SXM and OAM cards are designed for front-to-back airflow with no obstructions; the moment you put one in a 2U chassis with a bend in the duct, HBM temps rise faster than die temps.

08

The HBM PHY

"PHY" is the physical-layer block on the GPU side that drives the HBM bus — the analog and mixed-signal logic that turns digital command/address/data words into millivolts on micro-bumps and back. It's massive.

Why it's so big

A Hopper GH100 spends roughly 120 mm² of its ~814 mm² die on the five HBM3 PHYs — ~15% of total die area. Reasons:

Pin count — ~1224 signals per stack, ×5 stacks = ~6000 differential / single-ended IO drivers, each with its own equalisation and termination.
Analog content — receivers (CTLE, DFE), drivers, voltage references. Analog doesn't shrink with each node like digital does, so the PHY is one of the slowest-shrinking blocks.
Training and calibration — on every reset the PHY runs a multi-stage training sequence (per-bit-deskew, eye-margin scan, voltage centring) before the bus is usable. This logic stays resident.
Signal-integrity guard space — analog blocks demand keep-out from noisy digital, so the PHY area is more than just transistor count.

IP licensing

NVIDIA, AMD, Intel, and the cloud-silicon shops all license HBM PHY IP from Synopsys, Cadence, or Rambus rather than designing in-house. The PHY is too analog-heavy and too gen-specific (HBM3 vs HBM3e vs HBM4) for in-house teams to economically maintain. Synopsys' DesignWare HBM3e PHY is the most-used building block in industry; if you're working on a chiplet-class accelerator and HBM is on your BoM, expect to integrate it.

The downside: PHY IP is shared, so PHY-level innovation is bounded. NVIDIA's edge over AMD on HBM bandwidth utilisation comes mostly from memory-controller design (FBPA scheduling, hashing, prefetching) rather than the PHY itself. If two GPUs use the same Synopsys PHY behind different controllers, they'll deliver very different effective BW — same raw bus, different schedulers.

Why HBM4's bus doubling matters here

Going from 1024 to 2048 bits per stack in HBM4 doubles the PHY pin count. At HBM3e PHY scaling (~24 mm² per stack), HBM4 PHYs would need ~48 mm² each — multiplied by 6–8 stacks, that's most of a reticle. The way out: HBM4 lowers the per-pin rate slightly (~8 Gb/s instead of 9.6) and uses CoWoS-L bridges to keep the GPU die smaller. But the PHY area floor is now the practical limit on how many HBM4 stacks you can attach — not the interposer area itself.

09

CoWoS Packaging — How It's Actually Built

"CoWoS" is TSMC's Chip-on-Wafer-on-Substrate — the 2.5D packaging family that puts GPU + HBM stacks side-by-side on a silicon interposer, then mounts the whole thing on an organic substrate. NVIDIA datacenter GPUs from V100 onward all use it.

The build sequence

Start with a 12-inch silicon wafer. Etch the interposer's RDL (re-distribution layers) and TSV array onto it.
Place GPU die + HBM stacks face-down onto the interposer using micro-bump bonding (~55 µm pitch). Underfill with epoxy.
Mount the assembled interposer + chips onto an organic substrate using C4 bumps (~150 µm pitch). This is the "Wafer-on-Substrate" step.
Add the integrated heat spreader (lid) with thermal interface material on top of the dies.
Solder BGA balls on the underside of the substrate for mounting on the OAM/SXM card.

CoWoS variants

CoWoS-S

Original silicon-interposer flow. Single passive interposer carrying GPU + ~5 HBM stacks. Reticle limit ~830 mm² on the interposer.

V100, A100, H100, H200
5–6 HBM stacks max comfortable
Mature, high-volume

CoWoS-L

Adds LSI bridges — small silicon interconnect dies embedded in an organic-RDL substrate. Lets the package exceed the reticle limit by tiling multiple bridges.

Blackwell B100/B200, GB200
Two GPU dies + 8 HBM stacks
Higher cost, lower yield

CoWoS-R

Replaces the silicon interposer with a fine-pitch organic redistribution layer. Lower bandwidth density; used for cheaper accelerators that don't need full HBM density.

Niche; not used by NVIDIA datacenter GPUs
Better yield, lower BW

Reticle limit and Blackwell's response

Lithography reticles cap the largest single die a stepper can pattern at ~830 mm². H100's GH100 die is right up against this limit at 814 mm². To go bigger, you have two choices: stitch reticles (slow, exotic, used for Cerebras WSE) or split the design across two dies and connect them with a bridge.

Blackwell B200 takes the second route: two GPU dies side-by-side, each ~830 mm², connected by NVIDIA's NV-HBI (NVLink-High Bandwidth Interface) at ~10 TB/s through CoWoS-L bridge dies. To system software, B200 looks like one GPU; physically, it's a chiplet pair. This is what made the 8-stack HBM3e configuration possible.

The CoWoS bottleneck

Through 2024-2026, TSMC's CoWoS capacity — not wafer fab, not HBM supply — has been the binding constraint on H100/H200/B200 production. Each GPU consumes one slot of CoWoS-S or CoWoS-L flow, which has long process times and dedicated equipment. NVIDIA has taken the lion's share of this capacity since 2023; AMD MI300 contention competes for the same line. Watch CoWoS capacity announcements as the leading indicator for datacenter GPU availability, more so than node-shrink schedules.

10

What You Actually Get vs Spec Sheet

The spec-sheet number is a peak: 1024 b × 9.2 Gb/s/pin / 8 = ~1.18 TB/s/stack, ×6 stacks = ~7 TB/s on H200. The number you measure with cudaMemcpy or in a kernel is always lower — here's why.

Bandwidth tax breakdown

Spec peak

100% — 4.8 TB/s on H200 spec

peak

- refresh

~97% — refresh tax 2–3% cool

-2–3%

- ECC sb

~91% — side-band ECC overhead

-6%

- thermal

~86% — warm operation

-5%

streaming

~82% achieved — ~3.9 TB/s

good

strided 128B

~65% achieved — ~3.1 TB/s

ok

random 32B

~30% achieved — ~1.4 TB/s

bad

Worked example: H200 BF16 weight streaming vs random KV

BF16 weight streaming (good case)

Decode-step weight read for a 70B BF16 model, vLLM tile-friendly layout, ECC on, GPU at 75 °C.

Spec: 4.8 TB/s
Refresh tax: -2.5%
ECC side-band: -6%
Bus utilisation (sequential, large bursts): ~95%
Achieved: ~4.4 TB/s
140 GB / 4.4 TB/s ≈ 32 ms / pass → ~31 tok/s on 70B BF16

Random 32 B KV-cache reads (bad case)

Long-context attention with non-coalesced KV layout, 8K sequence, 16 heads, scattered head/token order.

Spec: 4.8 TB/s
Refresh + ECC: same -8.5%
Random 32 B: bus utilisation ~30%
Hot operation (hot HBM): -10%
Achieved: ~1.4 TB/s
3× slower for the exact same DRAM, exact same workload mass

The lesson

Cache-friendly access patterns are not a nice-to-have on HBM — they're the difference between "uses the GPU" and "uses 30% of it". A single bad KV-cache layout can reduce a 70B-on-H200 from 30 tok/s to 10 tok/s without you ever seeing a single error. This is why FlashAttention and PagedAttention exist. They're not "small kernel tricks", they're HBM utilisation rescue jobs.

Tools to measure it

Nsight Compute (ncu) — --set full reports DRAM throughput, sector reads/writes, L2 hit rate. dram__bytes_read.sum and dram__throughput.avg.pct_of_peak_sustained_elapsed are the load-bearing metrics.
Nsight Systems — CUDA timeline + DRAM utilisation traces over time. Useful to spot phases where you fall off the bandwidth cliff.
DCGM (dcgm-exporter) — production telemetry. DCGM_FI_PROF_DRAM_ACTIVE is the percent of cycles DRAM was busy doing real work. >75% is good for serving; under 30% means you're compute-bound or your kernels are bad.
Bandwidth microbenchmark — babelStream or NVIDIA's nvbandwidth for sanity-checking peak achievable bandwidth on your specific board.

11

Interactive: HBM Bandwidth Estimator

Pick a GPU, an access pattern, ECC and thermal state. The estimator applies the taxes from the previous slide and reports the bandwidth you can plausibly expect to measure, plus advice on what's costing you.

GPU class

Access pattern

ECC mode

Thermal state

Spec BW

—

Achieved BW

—

Efficiency

—

Refresh tax

—

ECC overhead

—

Thermal de-rate

—

Choose a configuration to see the estimate.

Method: spec BW × (1 − refresh%) × (1 − ECC%) × (1 − thermal%) × pattern_efficiency. Pattern efficiency is the bus utilisation factor for that access shape: ~95% sequential, ~70% strided 128 B, ~55% random 128 B, ~30% random 32 B. These are realistic ranges from ncu measurements on Hopper-class kit; numbers vary ±5% across kernels.

Quick sanity numbers

H200 / sequential / cool / ECC on → ~3.9 TB/s (82% of spec). This is what tuned BF16 weight streaming actually hits.
H200 / random 32 B / hot / ECC on → ~1.2 TB/s (25%). The "why is my long-context inference so slow" number.
B200 / sequential / cool / ECC on → ~6.6 TB/s. Real-world Blackwell streaming numbers from the field match this within ~3%.
A100 / strided / cool / ECC on → ~1.3 TB/s — useful as a baseline for older fleets.

If you take one thing away

HBM's spec-sheet TB/s is achievable, but only when the kernel cooperates. Refresh and ECC together cost ~8% on every GPU, every workload, no avoiding them. Thermal can cost another ~10% if your server's airflow is bad. The remaining 80% of the bus is yours — but only if your access pattern can fill it. The whole "memory wall" conversation in 2026 is really a conversation about how much of that 80% your code actually claims.