NVIDIA GPU Architectures Series — Presentation 31

Inside Hopper — FP8, TMA, Thread-Block Clusters, NVLink 4

A low-level deep dive into the GH100 die that trained GPT-4-class frontier models. 4th-generation tensor cores with native FP8 (E4M3 / E5M2) at 1979 TFLOPS per H100 SXM5, the warp-group MMA, the Tensor Memory Accelerator, distributed shared memory across thread-block clusters, DPX, HBM3, NVLink 4 at 900 GB/s, and the H200 / GH200 variants.

GH100H100H200GH200 TSMC 4NHBM3HBM3e FP8WGMMATMA DSMEMClusterNVLink 4
GH100 GPC SM TC4 / FP8 TMA Cluster HBM3/3e NVLink 4
00

Topics We'll Cover

01

Hopper at a Glance

Hopper launched in March 2022 (announced) with volume in 2023. Single die family, datacenter only — like Volta, no consumer Hopper. The architecture's job is high-end LLM training and inference. Variants:

SKUSMs (active)HBMBWNVLinkTDP
H100 SXM5 80 GB13280 GB HBM33.35 TB/s18× NVLink 4 = 900 GB/s700 W
H100 PCIe 80 GB11480 GB HBM32.0 TB/sNVLink bridge (2-card) = 600 GB/s350 W
H100 NVL 94 GB13294 GB HBM33.9 TB/s2-card NVLink = 600 GB/s2× 400 W
H200 SXM5 141 GB132141 GB HBM3e4.8 TB/s900 GB/s700 W
GH200 480 / 62413296 / 144 GB HBM3 / HBM3e + 480 GB LPDDR5x4.8 TB/s + C2C 900 GB/s900 GB/s1000 W (whole package)
02

GH100 Die — Block Diagram

Chip
GH100 — 814 mm², 80 B transistors, TSMC 4N
GPCs
GPC0GPC1GPC2GPC3GPC4GPC5GPC6GPC7
SMs / GPC
9 TPCs × 2 SMs × 8 GPCs = 144 SMs (132 enabled on H100)
L2
50 MB total (split into 2 partitions of 25 MB)
Memory
5 stacks HBM3 (one disabled on H100 80GB) — 5120-bit bus
IO
PCIe 5.0 x1618 NVLink 4

Density: 80 B / 814 mm² = 98.3 M/mm² — less dense than AD102 (125.5) on the same node because GH100 spends area on FP64, full TC throughput, and HBM3 PHYs.

03

The Hopper SM — Cluster-Aware

Per-SM compute

  • 4 partitions, 1 warp scheduler each
  • 128 FP32 cores (32/partition)
  • 64 INT32 cores (16/partition)
  • 64 FP64 cores (16/partition) — 1:2 of FP32, restored to GA100 levels
  • 4 tensor cores (4th gen) per SM
  • 16 LD/ST, 16 SFU
  • TMA unit (one per SM, shared across partitions)
  • No RT cores

Per-SM memory

  • 256 KB register file
  • 256 KB L1 + shared (configurable, more than Ampere)
  • DSMEM — SMs in same cluster directly read each other's shared memory
  • Async transactions — mbarrier-based fences for TMA / WGMMA
04

4th-Generation Tensor Cores — FP8 Native

Hopper's tensor cores add FP8 in two formats: E4M3 (1 sign + 4 exp + 3 mantissa, favours mantissa, used for activations and weights after scaling) and E5M2 (1+5+2, favours range, used for gradients). Throughput is 2× BF16.

OperationH100 SXM5 (1980 MHz)H200 SXM5 (1980 MHz)H100 PCIe (1755 MHz)
FP32 FMA67 TFLOPS67 TFLOPS51 TFLOPS
FP64 FMA34 TFLOPS34 TFLOPS26 TFLOPS
Tensor TF32495 TFLOPS495 TFLOPS378 TFLOPS
Tensor BF16 / FP16989 TFLOPS989 TFLOPS756 TFLOPS
Tensor BF16 sparse1979 TFLOPS1979 TFLOPS1513 TFLOPS
Tensor FP81979 TFLOPS1979 TFLOPS1513 TFLOPS
Tensor FP8 sparse3958 TFLOPS3958 TFLOPS3026 TFLOPS

Software: the Transformer Engine library (open source, PyTorch + JAX bindings) handles per-tensor FP8 scaling automatically. The SM tracks recent activation maxima and applies appropriate scales on the next forward pass — "delayed scaling".

05

WGMMA — Warp-Group Matrix Multiply

Hopper introduces a new MMA shape: warp-group MMA, operating on 64×N×K tiles where N can be up to 256, K up to 32. Four warps cooperate; the instruction is asynchronous — it returns immediately and signals completion via mbarrier.

PTX wgmma instruction (Hopper)
wgmma.mma_async.sync.aligned.m64n128k16.f32.bf16.bf16
    {d0, d1, ..., d63},     // 64 FP32 accumulator regs per warp
    desc-a,                          // shared memory descriptor for A tile
    desc-b,                          // shared memory descriptor for B tile
    {scale-d, imm-trans-a, imm-trans-b};
wgmma.commit_group.sync;
wgmma.wait_group.sync 0;

Why this matters: pre-Hopper MMA shapes were small (16×8×16) so each thread had to issue many instructions to fill an output tile. WGMMA does it in one instruction over four warps, freeing the warp schedulers for other work. Used by every tuned cuBLAS / cuDNN / Flash-Attn 3 / CUTLASS 3.x kernel on Hopper.

06

TMA — The Tensor Memory Accelerator

One TMA unit per SM. The kernel sets up a tensor-map descriptor (base pointer, multi-dim shape, strides, swizzle pattern, OOB handling) and submits a single cp.async.bulk.tensor PTX instruction. The TMA hardware transfers a tile from L2/HBM to shared memory while the SM continues executing other warps.

CUDA C++ TMA bulk copy
__shared__ __align__(16) half tile[128][64];
__shared__ __alignas__(8) cuda::barrier<cuda::thread_scope_block> bar;

if (threadIdx.x == 0) cuda::memcpy_async(
    tile, gmem_ptr_with_descriptor,
    cuda::aligned_size_t<16>{128*64*2}, bar);
__syncthreads();
bar.arrive_and_wait();   // blocks here until TMA done

TMA is not just a memcpy — it understands the layout the WGMMA wants. Without TMA, every thread issued global loads, consuming registers as staging buffers. With TMA, kernels become almost-pure compute on shared-memory tiles.

07

Thread-Block Clusters & DSMEM

New level of the CUDA hierarchy: Grid → Cluster → Block → Warp → Thread. A cluster is up to 16 cooperating thread blocks, all scheduled on the same GPC. They share a cluster-wide barrier and can directly read each other's shared memory: Distributed Shared Memory (DSMEM).

Why clusters

Working tiles for a 128k-context FlashAttention or a 1024×1024 GEMM exceed any single block's shared memory. Clusters let multiple blocks pool their shared memory at L1 latency, no global memory round-trip.

Programming

Launch with cluster_dims<2,2,1> attribute. Use cuda::experimental::cluster_group::distributed_shared_memory(). CUTLASS 3.x and cuDNN 9 attention rely heavily on clusters.

Implementation detail: clusters share a 256 KB L1+shared budget per SM, but the cluster as a whole sees 16×256 KB = up to 4 MB of pooled shared memory at L1 access latency.

08

DPX — Dynamic Programming Acceleration

Hopper added DPX instructions: hardware-accelerated min/max followed by a fused add, designed for the recurrence patterns in dynamic programming. Smith-Waterman gap-affine alignment in genomics gets a 7× speedup; beam search in NLP and Floyd-Warshall in routing also benefit.

Marginal for LLM training proper, but headline-relevant for biotech / scientific HPC and certain RL tree searches.

09

Process, Voltage, Form Factors

GH100 is on TSMC 4N — same NVIDIA-customised N4 variant as Ada. 814 mm² is right at the reticle limit; this is a near-impossible-to-yield die. NVIDIA mitigates by binning aggressively: H100 enables 132 of 144 SMs, H800 enables 132 of 144 (export-restricted), H20 enables far fewer at lower clocks for the China market.

Form factorTDPNotes
SXM5 (HGX baseboard)700 WHighest binned, 132 SMs, full NVLink 4 fabric.
PCIe 80 GB350 W114 SMs (cut), lower clocks; NVLink bridge for 2-card.
NVL 94 GB (paired)2× 400 WTwo cards bridged; 188 GB total. Inference focus.
H800 SXM5700 WNVLink throttled 900→400 GB/s for export rules.
H20~400 WFar fewer SMs, slower clocks; full 96 GB HBM3 (counter-intuitively, a great inference card for Chinese cloud).

Voltage rails: VDD core (~0.85 V at boost), VDDQ-HBM3 (1.1 V), VDDIO-NVLink (~0.75 V). SXM5 carries no aux power connector — the HGX baseboard supplies 700 W via the SXM5 socket itself. PCIe variant uses a CEM 16-pin EPS.

10

Memory — HBM3, HBM3e, GH200 LPDDR5x

H100 was the first GPU to ship HBM3. H200 then upgraded to HBM3e in mid-2024.

SKUStack typeStacksCapacityPin rateBW
H100 SXM5 80 GBHBM3 16 GB stacks5 (one disabled)80 GB5.23 Gbps3.35 TB/s
H100 PCIe 80 GBHBM3 16 GB stacks580 GB3.13 Gbps2.0 TB/s
H100 NVL 94 GBHBM3 24 GB stacks594 GB5.13 Gbps3.9 TB/s
H200 SXM5 141 GBHBM3e 24 GB stacks6141 GB6.4 Gbps4.8 TB/s
GH200 480 GBHBM3 96 GB + Grace LPDDR5x 480 GB6 + Grace576 GBHBM3 4.0 + LPDDR5x 480 GB/sHBM 4.0 TB/s + C2C 900 GB/s coherent
GH200 624 GBHBM3e 144 GB + Grace LPDDR5x 480 GB6 + Grace624 GBHBM3e 4.8 + LPDDR5xHBM 4.8 TB/s + C2C 900 GB/s

L2 cache: 50 MB on GH100, organised as two 25 MB partitions. ECC SECDED + on-die ECC mandatory.

11

NVLink 4 & NVSwitch 3 (HGX H100)

NVLink 4 keeps the per-link 25 GB/s/dir of NVLink 3 but doubles links per GPU: 18 links × 50 GB/s/dir = 900 GB/s/dir aggregate. Signalling: 100 Gbps PAM4 per lane, 2 lanes per link.

PropertyNVLink 3 (Ampere)NVLink 4 (Hopper)
Per-lane rate50 Gbps NRZ100 Gbps PAM4
Lanes per link42
Per-link BW (1-dir)25 GB/s25 GB/s
Links per GPU1218
Aggregate per GPU600 GB/s900 GB/s

HGX H100: 8× H100 SXM5 + 4× NVSwitch 3.0 chips (NVSwitch 3 has 64 NVLink 4 ports, ~14 TB/s aggregate per chip). Fully connected at 900 GB/s per pair. NVLink-C2C on GH200 connects Grace CPU to Hopper GPU at 900 GB/s coherent — same protocol family but in-package. Cross-node uses ConnectX-7 NDR 400 Gb/s InfiniBand (8 NICs per HGX H100).

12

Interactive: Hopper SKU Picker

SMs
FP8 TC TF
BF16 TC TF
FP64 TF
VRAM
BW (TB/s)
TDP
NVLink