NVIDIA GPU 31 — Hopper Low-Level HW

00

Topics We'll Cover

Hopper at a Glance
GH100 Die — Block Diagram
The Hopper SM — Cluster-Aware
4th-Generation Tensor Cores — FP8 Native
WGMMA — Warp-Group Matrix Multiply
TMA — The Tensor Memory Accelerator
Thread-Block Clusters & DSMEM
DPX — Dynamic Programming Acceleration
Process, Voltage, Form Factors
Memory — HBM3, HBM3e, GH200 LPDDR5x
NVLink 4 & NVSwitch 3 (HGX H100)
Interactive: Hopper SKU Picker

01

Hopper at a Glance

Hopper launched in March 2022 (announced) with volume in 2023. Single die family, datacenter only — like Volta, no consumer Hopper. The architecture's job is high-end LLM training and inference. Variants:

SKU	SMs (active)	HBM	BW	NVLink	TDP
H100 SXM5 80 GB	132	80 GB HBM3	3.35 TB/s	18× NVLink 4 = 900 GB/s	700 W
H100 PCIe 80 GB	114	80 GB HBM3	2.0 TB/s	NVLink bridge (2-card) = 600 GB/s	350 W
H100 NVL 94 GB	132	94 GB HBM3	3.9 TB/s	2-card NVLink = 600 GB/s	2× 400 W
H200 SXM5 141 GB	132	141 GB HBM3e	4.8 TB/s	900 GB/s	700 W
GH200 480 / 624	132	96 / 144 GB HBM3 / HBM3e + 480 GB LPDDR5x	4.8 TB/s + C2C 900 GB/s	900 GB/s	1000 W (whole package)

02

GH100 Die — Block Diagram

Chip

GH100 — 814 mm², 80 B transistors, TSMC 4N

GPCs

GPC0GPC1GPC2GPC3GPC4GPC5GPC6GPC7

SMs / GPC

9 TPCs × 2 SMs × 8 GPCs = 144 SMs (132 enabled on H100)

L2

50 MB total (split into 2 partitions of 25 MB)

Memory

5 stacks HBM3 (one disabled on H100 80GB) — 5120-bit bus

IO

PCIe 5.0 x1618 NVLink 4

Density: 80 B / 814 mm² = 98.3 M/mm² — less dense than AD102 (125.5) on the same node because GH100 spends area on FP64, full TC throughput, and HBM3 PHYs.

03

The Hopper SM — Cluster-Aware

Per-SM compute

4 partitions, 1 warp scheduler each
128 FP32 cores (32/partition)
64 INT32 cores (16/partition)
64 FP64 cores (16/partition) — 1:2 of FP32, restored to GA100 levels
4 tensor cores (4th gen) per SM
16 LD/ST, 16 SFU
TMA unit (one per SM, shared across partitions)
No RT cores

Per-SM memory

256 KB register file
256 KB L1 + shared (configurable, more than Ampere)
DSMEM — SMs in same cluster directly read each other's shared memory
Async transactions — mbarrier-based fences for TMA / WGMMA

04

4th-Generation Tensor Cores — FP8 Native

Hopper's tensor cores add FP8 in two formats: E4M3 (1 sign + 4 exp + 3 mantissa, favours mantissa, used for activations and weights after scaling) and E5M2 (1+5+2, favours range, used for gradients). Throughput is 2× BF16.

Operation	H100 SXM5 (1980 MHz)	H200 SXM5 (1980 MHz)	H100 PCIe (1755 MHz)
FP32 FMA	67 TFLOPS	67 TFLOPS	51 TFLOPS
FP64 FMA	34 TFLOPS	34 TFLOPS	26 TFLOPS
Tensor TF32	495 TFLOPS	495 TFLOPS	378 TFLOPS
Tensor BF16 / FP16	989 TFLOPS	989 TFLOPS	756 TFLOPS
Tensor BF16 sparse	1979 TFLOPS	1979 TFLOPS	1513 TFLOPS
Tensor FP8	1979 TFLOPS	1979 TFLOPS	1513 TFLOPS
Tensor FP8 sparse	3958 TFLOPS	3958 TFLOPS	3026 TFLOPS

Software: the Transformer Engine library (open source, PyTorch + JAX bindings) handles per-tensor FP8 scaling automatically. The SM tracks recent activation maxima and applies appropriate scales on the next forward pass — "delayed scaling".

05

WGMMA — Warp-Group Matrix Multiply

Hopper introduces a new MMA shape: warp-group MMA, operating on 64×N×K tiles where N can be up to 256, K up to 32. Four warps cooperate; the instruction is asynchronous — it returns immediately and signals completion via mbarrier.

PTX wgmma instruction (Hopper)

wgmma.mma_async.sync.aligned.m64n128k16.f32.bf16.bf16
    {d0, d1, ..., d63},     // 64 FP32 accumulator regs per warp
    desc-a,                          // shared memory descriptor for A tile
    desc-b,                          // shared memory descriptor for B tile
    {scale-d, imm-trans-a, imm-trans-b};
wgmma.commit_group.sync;
wgmma.wait_group.sync 0;

Why this matters: pre-Hopper MMA shapes were small (16×8×16) so each thread had to issue many instructions to fill an output tile. WGMMA does it in one instruction over four warps, freeing the warp schedulers for other work. Used by every tuned cuBLAS / cuDNN / Flash-Attn 3 / CUTLASS 3.x kernel on Hopper.

06

TMA — The Tensor Memory Accelerator

One TMA unit per SM. The kernel sets up a tensor-map descriptor (base pointer, multi-dim shape, strides, swizzle pattern, OOB handling) and submits a single cp.async.bulk.tensor PTX instruction. The TMA hardware transfers a tile from L2/HBM to shared memory while the SM continues executing other warps.

CUDA C++ TMA bulk copy

__shared__ __align__(16) half tile[128][64];
__shared__ __alignas__(8) cuda::barrier<cuda::thread_scope_block> bar;

if (threadIdx.x == 0) cuda::memcpy_async(
    tile, gmem_ptr_with_descriptor,
    cuda::aligned_size_t<16>{128*64*2}, bar);
__syncthreads();
bar.arrive_and_wait();   // blocks here until TMA done

TMA is not just a memcpy — it understands the layout the WGMMA wants. Without TMA, every thread issued global loads, consuming registers as staging buffers. With TMA, kernels become almost-pure compute on shared-memory tiles.

07

Thread-Block Clusters & DSMEM

New level of the CUDA hierarchy: Grid → Cluster → Block → Warp → Thread. A cluster is up to 16 cooperating thread blocks, all scheduled on the same GPC. They share a cluster-wide barrier and can directly read each other's shared memory: Distributed Shared Memory (DSMEM).

Why clusters

Working tiles for a 128k-context FlashAttention or a 1024×1024 GEMM exceed any single block's shared memory. Clusters let multiple blocks pool their shared memory at L1 latency, no global memory round-trip.

Programming

Launch with cluster_dims<2,2,1> attribute. Use cuda::experimental::cluster_group::distributed_shared_memory(). CUTLASS 3.x and cuDNN 9 attention rely heavily on clusters.

Implementation detail: clusters share a 256 KB L1+shared budget per SM, but the cluster as a whole sees 16×256 KB = up to 4 MB of pooled shared memory at L1 access latency.

08

DPX — Dynamic Programming Acceleration

Hopper added DPX instructions: hardware-accelerated min/max followed by a fused add, designed for the recurrence patterns in dynamic programming. Smith-Waterman gap-affine alignment in genomics gets a 7× speedup; beam search in NLP and Floyd-Warshall in routing also benefit.

Marginal for LLM training proper, but headline-relevant for biotech / scientific HPC and certain RL tree searches.

09

Process, Voltage, Form Factors

GH100 is on TSMC 4N — same NVIDIA-customised N4 variant as Ada. 814 mm² is right at the reticle limit; this is a near-impossible-to-yield die. NVIDIA mitigates by binning aggressively: H100 enables 132 of 144 SMs, H800 enables 132 of 144 (export-restricted), H20 enables far fewer at lower clocks for the China market.

Form factor	TDP	Notes
SXM5 (HGX baseboard)	700 W	Highest binned, 132 SMs, full NVLink 4 fabric.
PCIe 80 GB	350 W	114 SMs (cut), lower clocks; NVLink bridge for 2-card.
NVL 94 GB (paired)	2× 400 W	Two cards bridged; 188 GB total. Inference focus.
H800 SXM5	700 W	NVLink throttled 900→400 GB/s for export rules.
H20	~400 W	Far fewer SMs, slower clocks; full 96 GB HBM3 (counter-intuitively, a great inference card for Chinese cloud).

Voltage rails: VDD core (~0.85 V at boost), VDDQ-HBM3 (1.1 V), VDDIO-NVLink (~0.75 V). SXM5 carries no aux power connector — the HGX baseboard supplies 700 W via the SXM5 socket itself. PCIe variant uses a CEM 16-pin EPS.

10

Memory — HBM3, HBM3e, GH200 LPDDR5x

H100 was the first GPU to ship HBM3. H200 then upgraded to HBM3e in mid-2024.

SKU	Stack type	Stacks	Capacity	Pin rate	BW
H100 SXM5 80 GB	HBM3 16 GB stacks	5 (one disabled)	80 GB	5.23 Gbps	3.35 TB/s
H100 PCIe 80 GB	HBM3 16 GB stacks	5	80 GB	3.13 Gbps	2.0 TB/s
H100 NVL 94 GB	HBM3 24 GB stacks	5	94 GB	5.13 Gbps	3.9 TB/s
H200 SXM5 141 GB	HBM3e 24 GB stacks	6	141 GB	6.4 Gbps	4.8 TB/s
GH200 480 GB	HBM3 96 GB + Grace LPDDR5x 480 GB	6 + Grace	576 GB	HBM3 4.0 + LPDDR5x 480 GB/s	HBM 4.0 TB/s + C2C 900 GB/s coherent
GH200 624 GB	HBM3e 144 GB + Grace LPDDR5x 480 GB	6 + Grace	624 GB	HBM3e 4.8 + LPDDR5x	HBM 4.8 TB/s + C2C 900 GB/s

L2 cache: 50 MB on GH100, organised as two 25 MB partitions. ECC SECDED + on-die ECC mandatory.

11

NVLink 4 & NVSwitch 3 (HGX H100)

NVLink 4 keeps the per-link 25 GB/s/dir of NVLink 3 but doubles links per GPU: 18 links × 50 GB/s/dir = 900 GB/s/dir aggregate. Signalling: 100 Gbps PAM4 per lane, 2 lanes per link.

Property	NVLink 3 (Ampere)	NVLink 4 (Hopper)
Per-lane rate	50 Gbps NRZ	100 Gbps PAM4
Lanes per link	4	2
Per-link BW (1-dir)	25 GB/s	25 GB/s
Links per GPU	12	18
Aggregate per GPU	600 GB/s	900 GB/s

HGX H100: 8× H100 SXM5 + 4× NVSwitch 3.0 chips (NVSwitch 3 has 64 NVLink 4 ports, ~14 TB/s aggregate per chip). Fully connected at 900 GB/s per pair. NVLink-C2C on GH200 connects Grace CPU to Hopper GPU at 900 GB/s coherent — same protocol family but in-package. Cross-node uses ConnectX-7 NDR 400 Gb/s InfiniBand (8 NICs per HGX H100).

12

Interactive: Hopper SKU Picker

Hopper SKU

SMs

—

FP8 TC TF

—

BF16 TC TF

—

FP64 TF

—

VRAM

—

BW (TB/s)

—

TDP

—

NVLink

—