A low-level deep dive into the GH100 die that trained GPT-4-class frontier models. 4th-generation tensor cores with native FP8 (E4M3 / E5M2) at 1979 TFLOPS per H100 SXM5, the warp-group MMA, the Tensor Memory Accelerator, distributed shared memory across thread-block clusters, DPX, HBM3, NVLink 4 at 900 GB/s, and the H200 / GH200 variants.
Hopper launched in March 2022 (announced) with volume in 2023. Single die family, datacenter only — like Volta, no consumer Hopper. The architecture's job is high-end LLM training and inference. Variants:
| SKU | SMs (active) | HBM | BW | NVLink | TDP |
|---|---|---|---|---|---|
| H100 SXM5 80 GB | 132 | 80 GB HBM3 | 3.35 TB/s | 18× NVLink 4 = 900 GB/s | 700 W |
| H100 PCIe 80 GB | 114 | 80 GB HBM3 | 2.0 TB/s | NVLink bridge (2-card) = 600 GB/s | 350 W |
| H100 NVL 94 GB | 132 | 94 GB HBM3 | 3.9 TB/s | 2-card NVLink = 600 GB/s | 2× 400 W |
| H200 SXM5 141 GB | 132 | 141 GB HBM3e | 4.8 TB/s | 900 GB/s | 700 W |
| GH200 480 / 624 | 132 | 96 / 144 GB HBM3 / HBM3e + 480 GB LPDDR5x | 4.8 TB/s + C2C 900 GB/s | 900 GB/s | 1000 W (whole package) |
Density: 80 B / 814 mm² = 98.3 M/mm² — less dense than AD102 (125.5) on the same node because GH100 spends area on FP64, full TC throughput, and HBM3 PHYs.
Hopper's tensor cores add FP8 in two formats: E4M3 (1 sign + 4 exp + 3 mantissa, favours mantissa, used for activations and weights after scaling) and E5M2 (1+5+2, favours range, used for gradients). Throughput is 2× BF16.
| Operation | H100 SXM5 (1980 MHz) | H200 SXM5 (1980 MHz) | H100 PCIe (1755 MHz) |
|---|---|---|---|
| FP32 FMA | 67 TFLOPS | 67 TFLOPS | 51 TFLOPS |
| FP64 FMA | 34 TFLOPS | 34 TFLOPS | 26 TFLOPS |
| Tensor TF32 | 495 TFLOPS | 495 TFLOPS | 378 TFLOPS |
| Tensor BF16 / FP16 | 989 TFLOPS | 989 TFLOPS | 756 TFLOPS |
| Tensor BF16 sparse | 1979 TFLOPS | 1979 TFLOPS | 1513 TFLOPS |
| Tensor FP8 | 1979 TFLOPS | 1979 TFLOPS | 1513 TFLOPS |
| Tensor FP8 sparse | 3958 TFLOPS | 3958 TFLOPS | 3026 TFLOPS |
Software: the Transformer Engine library (open source, PyTorch + JAX bindings) handles per-tensor FP8 scaling automatically. The SM tracks recent activation maxima and applies appropriate scales on the next forward pass — "delayed scaling".
Hopper introduces a new MMA shape: warp-group MMA, operating on 64×N×K tiles where N can be up to 256, K up to 32. Four warps cooperate; the instruction is asynchronous — it returns immediately and signals completion via mbarrier.
wgmma.mma_async.sync.aligned.m64n128k16.f32.bf16.bf16
{d0, d1, ..., d63}, // 64 FP32 accumulator regs per warp
desc-a, // shared memory descriptor for A tile
desc-b, // shared memory descriptor for B tile
{scale-d, imm-trans-a, imm-trans-b};
wgmma.commit_group.sync;
wgmma.wait_group.sync 0;
Why this matters: pre-Hopper MMA shapes were small (16×8×16) so each thread had to issue many instructions to fill an output tile. WGMMA does it in one instruction over four warps, freeing the warp schedulers for other work. Used by every tuned cuBLAS / cuDNN / Flash-Attn 3 / CUTLASS 3.x kernel on Hopper.
One TMA unit per SM. The kernel sets up a tensor-map descriptor (base pointer, multi-dim shape, strides, swizzle pattern, OOB handling) and submits a single cp.async.bulk.tensor PTX instruction. The TMA hardware transfers a tile from L2/HBM to shared memory while the SM continues executing other warps.
__shared__ __align__(16) half tile[128][64];
__shared__ __alignas__(8) cuda::barrier<cuda::thread_scope_block> bar;
if (threadIdx.x == 0) cuda::memcpy_async(
tile, gmem_ptr_with_descriptor,
cuda::aligned_size_t<16>{128*64*2}, bar);
__syncthreads();
bar.arrive_and_wait(); // blocks here until TMA done
TMA is not just a memcpy — it understands the layout the WGMMA wants. Without TMA, every thread issued global loads, consuming registers as staging buffers. With TMA, kernels become almost-pure compute on shared-memory tiles.
New level of the CUDA hierarchy: Grid → Cluster → Block → Warp → Thread. A cluster is up to 16 cooperating thread blocks, all scheduled on the same GPC. They share a cluster-wide barrier and can directly read each other's shared memory: Distributed Shared Memory (DSMEM).
Working tiles for a 128k-context FlashAttention or a 1024×1024 GEMM exceed any single block's shared memory. Clusters let multiple blocks pool their shared memory at L1 latency, no global memory round-trip.
Launch with cluster_dims<2,2,1> attribute. Use cuda::experimental::cluster_group::distributed_shared_memory(). CUTLASS 3.x and cuDNN 9 attention rely heavily on clusters.
Implementation detail: clusters share a 256 KB L1+shared budget per SM, but the cluster as a whole sees 16×256 KB = up to 4 MB of pooled shared memory at L1 access latency.
Hopper added DPX instructions: hardware-accelerated min/max followed by a fused add, designed for the recurrence patterns in dynamic programming. Smith-Waterman gap-affine alignment in genomics gets a 7× speedup; beam search in NLP and Floyd-Warshall in routing also benefit.
Marginal for LLM training proper, but headline-relevant for biotech / scientific HPC and certain RL tree searches.
GH100 is on TSMC 4N — same NVIDIA-customised N4 variant as Ada. 814 mm² is right at the reticle limit; this is a near-impossible-to-yield die. NVIDIA mitigates by binning aggressively: H100 enables 132 of 144 SMs, H800 enables 132 of 144 (export-restricted), H20 enables far fewer at lower clocks for the China market.
| Form factor | TDP | Notes |
|---|---|---|
| SXM5 (HGX baseboard) | 700 W | Highest binned, 132 SMs, full NVLink 4 fabric. |
| PCIe 80 GB | 350 W | 114 SMs (cut), lower clocks; NVLink bridge for 2-card. |
| NVL 94 GB (paired) | 2× 400 W | Two cards bridged; 188 GB total. Inference focus. |
| H800 SXM5 | 700 W | NVLink throttled 900→400 GB/s for export rules. |
| H20 | ~400 W | Far fewer SMs, slower clocks; full 96 GB HBM3 (counter-intuitively, a great inference card for Chinese cloud). |
Voltage rails: VDD core (~0.85 V at boost), VDDQ-HBM3 (1.1 V), VDDIO-NVLink (~0.75 V). SXM5 carries no aux power connector — the HGX baseboard supplies 700 W via the SXM5 socket itself. PCIe variant uses a CEM 16-pin EPS.
H100 was the first GPU to ship HBM3. H200 then upgraded to HBM3e in mid-2024.
| SKU | Stack type | Stacks | Capacity | Pin rate | BW |
|---|---|---|---|---|---|
| H100 SXM5 80 GB | HBM3 16 GB stacks | 5 (one disabled) | 80 GB | 5.23 Gbps | 3.35 TB/s |
| H100 PCIe 80 GB | HBM3 16 GB stacks | 5 | 80 GB | 3.13 Gbps | 2.0 TB/s |
| H100 NVL 94 GB | HBM3 24 GB stacks | 5 | 94 GB | 5.13 Gbps | 3.9 TB/s |
| H200 SXM5 141 GB | HBM3e 24 GB stacks | 6 | 141 GB | 6.4 Gbps | 4.8 TB/s |
| GH200 480 GB | HBM3 96 GB + Grace LPDDR5x 480 GB | 6 + Grace | 576 GB | HBM3 4.0 + LPDDR5x 480 GB/s | HBM 4.0 TB/s + C2C 900 GB/s coherent |
| GH200 624 GB | HBM3e 144 GB + Grace LPDDR5x 480 GB | 6 + Grace | 624 GB | HBM3e 4.8 + LPDDR5x | HBM 4.8 TB/s + C2C 900 GB/s |
L2 cache: 50 MB on GH100, organised as two 25 MB partitions. ECC SECDED + on-die ECC mandatory.
NVLink 4 keeps the per-link 25 GB/s/dir of NVLink 3 but doubles links per GPU: 18 links × 50 GB/s/dir = 900 GB/s/dir aggregate. Signalling: 100 Gbps PAM4 per lane, 2 lanes per link.
| Property | NVLink 3 (Ampere) | NVLink 4 (Hopper) |
|---|---|---|
| Per-lane rate | 50 Gbps NRZ | 100 Gbps PAM4 |
| Lanes per link | 4 | 2 |
| Per-link BW (1-dir) | 25 GB/s | 25 GB/s |
| Links per GPU | 12 | 18 |
| Aggregate per GPU | 600 GB/s | 900 GB/s |
HGX H100: 8× H100 SXM5 + 4× NVSwitch 3.0 chips (NVSwitch 3 has 64 NVLink 4 ports, ~14 TB/s aggregate per chip). Fully connected at 900 GB/s per pair. NVLink-C2C on GH200 connects Grace CPU to Hopper GPU at 900 GB/s coherent — same protocol family but in-package. Cross-node uses ConnectX-7 NDR 400 Gb/s InfiniBand (8 NICs per HGX H100).