NVIDIA GPU 22 — Warp Scheduling & SIMT Mechanics

00

Topics We'll Cover

The micro-architecture of warp issue: how the SM picks one warp out of dozens each cycle, why the picks decide your kernel's runtime, and the mental models that turn occupancy and divergence from mysteries into arithmetic.

The Warp — 32 Threads, One Program Counter (Mostly)
SM Partitions and Schedulers
The Issue Rule
Instruction Latencies (Approximate)
Occupancy & ILP — Two Ways to Hide Latency
Divergence & Re-convergence
Predication vs Branching
Independent Thread Scheduling (Volta+)
Occupancy Maths — Worked Out
Common Stall Patterns and Cures
Interactive: Issue-Rate Predictor

01

The Warp — 32 Threads, One Program Counter (Mostly)

A warp is the unit of scheduling on an NVIDIA SM: 32 threads grouped together, historically sharing a single program counter and executing the same instruction in lockstep. This is the “Single-Instruction, Multiple-Thread” (SIMT) model — the SM fetches one instruction and broadcasts it to all 32 lanes, each holding a different thread's register state.

Volta (2017) changed the deal: it gave each thread its own PC and call stack so that re-convergence after divergence works automatically and starvation-free. But warps still issue at warp granularity for performance; divergent warps are slower, not broken.

SIMT vs SIMD — the precise difference

SIMD (CPU AVX-512, ARM SVE): one instruction operates on a vector register; lanes have no per-lane PC, no per-lane control flow.
SIMT: each thread keeps its own register state, and Volta+ each its own PC. The compiler emits scalar-looking code; the hardware bundles 32 threads at issue time. Predication and reconvergence handle divergence transparently.
The practical consequence: you write a per-thread C++ kernel, but you must think in warps. A 32-thread warp that touches 32 different cache lines costs 32 transactions; a coalesced one costs 1.

Why 32?

Thirty-two has been the warp size on every NVIDIA architecture from Tesla (2006) to Blackwell (2025). It is wide enough to amortise control overhead and hit one cache line per warp on a 128-byte coalesced load (4 bytes × 32 = 128 B), and narrow enough to keep the register file and operand network manageable. Changing it would break every CUDA binary ever shipped, so it is effectively a constant of the universe.

02

SM Partitions and Schedulers

From Volta onwards an SM is split into 4 partitions (also called “sub-cores” or “processing blocks”). Each partition is essentially its own little SIMT machine, with one warp scheduler that handles a private pool of warps and feeds private functional units.

Per partition (Ampere SM)

1 warp scheduler & dispatcher
16 FP32 + 16 INT32 cores
1 tensor core (3rd / 4th gen on Ampere/Hopper)
1 SFU (sin, exp, rcp, rsqrt)
1 LD/ST unit set
16 K × 32-bit register file slice (= 64 KB)
L0 instruction cache

Per SM totals

4 partitions → 4 schedulers → 4 instructions/SM/cycle
64 FP32 + 64 INT32 cores (Ampere; Ada/Hopper similar; GA10x/AD102 dual-issue FP32 doubles FP32 to 128/SM)
4 tensor cores
4 SFUs, 4 LD/ST groups
64 K × 32-bit (= 256 KB) register file
Shared L1 / SMEM (128–228 KB) and L1-I cache

One Ampere SM partition, drawn out

Issue rate, drawn at SM and GPU scope

Per partition: 1 instruction / cycle (some Ada/Ampere consumer SKUs effectively 2 FP32 ops/cycle thanks to dual-issue FP32, but only one scheduler slot).
Per SM: 4 instructions / cycle.
Per H100: 4 IPC × 132 SMs = 528 issued instructions per cycle. At 1.83 GHz that is ~0.97 trillion warp-instructions/s, of which any one HMMA represents many tensor-core math ops.
The SM is the smallest unit of scheduling work. CUDA blocks are placed onto whole SMs; warps within a block are spread roughly evenly across the four partitions when the block is launched.

Why four partitions?

Volta's V100 introduced the 4-partition design so each scheduler could feed a tensor core without blocking the other three. It also halved the operand-collector network width compared with Pascal's two big schedulers, which made the register file faster to read and easier to lay out at high frequency. Ampere, Ada, Hopper and Blackwell all kept the four-partition layout; only the unit counts inside change.

03

The Issue Rule

Each cycle every scheduler looks at its pool of resident warps (up to 16 per partition, so up to 64 per SM at full Volta+ occupancy) and picks one that is ready. “Ready” is a conjunction of conditions; every condition is a hardware scoreboard or pipeline status bit.

Readiness check	What it means
Instruction in L0-I	The next instruction for this warp's PC is fetched and decoded; if not, an i-cache miss stall.
Source registers ready	No outstanding writes to any source operand (RAW hazard tracked by per-warp scoreboard bits).
Functional unit free	The dispatch port for the required pipe (FP32, INT, FMA, MUFU, MIO, tensor) is not busy this cycle.
Predicate decoded	If predicated (`@P0`), the predicate is read; even all-false predicated instructions consume an issue slot.
No barrier wait	`BAR.SYNC`, `BAR.ARV`, `__syncthreads()`, named barriers, async-copy barriers all parked the warp.
No memory dependency	If the next instruction reads a register fed by an outstanding LDG/LDS, wait for the load's scoreboard release.

If more than one warp is ready, the scheduler picks one. The exact policy is undocumented and varies by architecture; it has been described as a loose round-robin biased by age and by recently-issued warps. The key invariant is “greedy then fair”: a warp that just got an instruction issued is not preferred next cycle.

Nsight Compute — warp-stall reasons sorted by frequency

# Captured on a memory-bound GEMM kernel, A100, fp16
ncu --set full --section WarpStateStats ./gemm

Section: Warp State Statistics
 
Stall — Long Scoreboard          27.4 cycles/instr   # LDG miss waiting on L2/HBM
Stall — Wait                     14.2 cycles/instr   # dep on a fixed-latency op (FFMA)
Stall — Short Scoreboard         11.8 cycles/instr   # LDS waiting on shared mem
Stall — Math Pipe Throttle        4.1 cycles/instr   # FFMA pipe saturated
Stall — Drain                     3.6 cycles/instr   # EXIT or BRA flushing
Stall — MIO Throttle              2.9 cycles/instr   # shared-mem bank conflicts
Stall — IMC Miss                  1.3 cycles/instr   # constant-cache miss
Stall — Branch Resolving          0.7 cycles/instr   # indirect / divergent branch
Stall — Tex Throttle              0.2 cycles/instr   # tex pipe full
Stall — Membar                    0.0 cycles/instr   # fence in flight
 
# Selected (eligible) warp count: 1.6 / cycle  — only 1.6 warps were ready,
# so “Issued” instructions / cycle were limited even though 4 schedulers exist.

The eligible-warp number is the one to read

Nsight Compute reports both active warps (resident) and eligible warps (ready to issue this cycle). Active >> eligible means there are warps but they're all stalled — classic memory-bound kernel. Active ≈ eligible at < 4 means low occupancy — classic register-pressure or small-grid kernel. You want eligible ≥ 4 sustained: at that point the schedulers are saturated and arch peaks become reachable.

04

Instruction Latencies (Approximate)

Latency in this context means “how many cycles after issue can a dependent instruction issue?”. Throughput is “how many of the same instruction can issue per cycle?”. The SM hides latency by issuing other warps' instructions in the gap.

Op	Latency	Notes
`FFMA` / `IMAD` / `IADD3`	~4 cycles	Back-to-back FFMAs from independent warps issue every cycle; from one warp need ILP ≥ 4.
`LDS` (shared load)	~30 cycles	Bank-conflict-free; with conflicts, multiply by max bank serialisation factor.
`LDG` (global, L1 hit)	~30–90 cycles	Hits in L1 SMEM/L1 cache.
`LDG` (global, L2 hit)	~200–300 cycles	Misses L1 but hits the SM-shared L2 (40–50 MB on Hopper, 60+ on Blackwell).
`LDG` (global, HBM)	~400–700 cycles	Full miss; all the way to DRAM. The number that drives every occupancy decision.
`HMMA` / `WGMMA`	4–8 cycles to issue	The instruction issues quickly but the tensor-core matmul streams over many cycles; back-to-back HMMA from one warp keeps the unit fed.
`ATOM.E` (global atomic)	~600 cycles + contention	Conflicts serialise; uncontended atomics on L2 atomics units run faster.
`MUFU.SIN`, `MUFU.EXP`, `MUFU.RCP`	~12–30 cycles	SFU; 4× lower throughput than FFMA.
`BAR.SYNC` (`__syncthreads`)	variable	All warps in the block must reach the barrier; cost = arrival skew.

The latency-hiding rule of thumb

To hide an op of latency L cycles when its throughput is 1/cycle, you need L independent in-flight instances. With one warp giving one instance per cycle (back-to-back), that is L resident warps with at least one ready instruction each, or fewer warps with ILP ≥ L.

FFMA, L=4: 4 ready warps per partition trivially saturate the FP32 pipe.
LDS, L=30: ~30 in-flight loads per partition needed. At 1 warp per LDS that is 30 warps — nearly the full 16-warp pool plus some ILP.
LDG to HBM, L=600: hopeless to hide with warp count alone. Solve it with prefetch (cp.async), wider tiles, and software pipelining of compute over loads.

Bandwidth, not latency, sets the ceiling

HBM-load latency × required bandwidth = how many warps you must keep busy. On H100 with 3.35 TB/s HBM and ~600-cycle load latency at 1.83 GHz, the minimum in-flight bytes to saturate bandwidth is roughly bandwidth × latency ≈ 1.1 MB. Divide by your access size (often 16 B/thread × 32 = 512 B/warp) and you get the warp-load count target. Below it you are leaving HBM idle.

05

Occupancy & ILP — Two Ways to Hide Latency

The scheduler keeps the SM busy by always finding something ready to issue. There are two complementary ways to give it work:

High occupancy (TLP)

Many resident warps (high thread-level parallelism). Each warp may have only one in-flight instruction at a time, but with 16 warps per partition the scheduler always has someone ready. Classic CUDA wisdom from Pascal/Volta.

Pro: simple, robust.
Con: pressures the register file (more warps → fewer regs/thread); pressures shared memory (more blocks → less SMEM/block).

High ILP

Few warps, but each warp has multiple independent instructions in flight. The scheduler can issue from the same warp on consecutive cycles. Volta+ encourages this: a wider register file and LDG.E.128 (16 B/thread) lets one warp keep a partition busy.

Pro: more registers per thread → bigger tiles → better arithmetic intensity.
Con: requires unrolling and explicit prefetch in source.

Both tactics are valid and modern CUTLASS / cuBLASLt kernels lean heavily on ILP rather than chasing 100% occupancy. Volta-and-later kernels often deliver near-peak FLOPS at 25–50% occupancy because they pack four independent FFMAs and two prefetches into each warp's loop body.

Two implementations of the same SAXPY-like kernel

a) High occupancy, low ILP — older “128 threads, 1 elt/thread” style

__global__ void saxpy_high_occ(float a, const float* x, float* y, int n)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) y[i] = a * x[i] + y[i];
}
// Launch: 256 threads/block, 32 regs/thread → 8 blocks/SM = 64 warps/SM = 100% occ.
// One LDG, one FFMA, one STG per thread. Per-warp ILP = 1.

b) Low occupancy, high ILP — Volta-tuned “1 warp does 4 elements per iter”

__global__ void saxpy_high_ilp(float a, const float4* x4, float4* y4, int n4)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n4) return;
    // 128-bit load: one LDG.E.128 brings 16 bytes/thread → 512 B/warp aligned
    float4 xv = x4[i];
    float4 yv = y4[i];
    yv.x = a * xv.x + yv.x;
    yv.y = a * xv.y + yv.y;
    yv.z = a * xv.z + yv.z;
    yv.w = a * xv.w + yv.w;   // 4 independent FFMAs per warp per loop iter
    y4[i] = yv;
}
// Launch: 128 threads/block, 64 regs/thread → 4 blocks/SM = 16 warps/SM = 25% occ.
// Per-warp ILP = 4 FFMAs; one wide LDG hides — achieves 95+% of memory peak.

The Volta inflection

Pre-Volta the dominant guidance was “chase occupancy.” Volta widened the per-warp register footprint to 255 regs/thread, doubled the partition register file, and added wide loads. From Volta onwards, the highest-throughput kernels are typically 25–50% occupancy with deep ILP, not 100% with one elt/thread. CUTLASS, cuBLAS, FlashAttention all run that way.

06

Divergence & Re-convergence

When threads in a warp take different paths through an if/else, the warp diverges. The hardware tracks per-lane activity with an active mask; the warp issues the if-branch with one mask, then the else-branch with the inverse, both serialised. After the branches re-meet at the immediate post-dominator, they reconverge and the active mask returns to all-ones.

Pre-Volta (Pascal and earlier)

One PC per warp. The compiler emits SSY (set sync) to mark the IPDOM and .S-suffixed branches that pop the convergence stack. Lanes in the not-taken branch are idle but cannot run anything else — they wait. Reconvergence is guaranteed at the IPDOM but starvation is possible if a lane never gets to make progress.

Consequence: a lane holding a lock cannot release it while sibling lanes spin waiting. Inter-thread mutual exclusion within a warp is fundamentally broken.

Volta and later (ITS)

Each thread has its own PC and its own call stack. The compiler emits BSSY/BSYNC SASS pairs which set up a re-convergence point but do not force lockstep. The scheduler may interleave the two divergent halves of the warp at instruction granularity. Lanes can make independent forward progress; mutual exclusion within a warp is now possible.

aggregate work isn't reduced by ITS. ITS only buys: forward progress for individual threads (locks!), and freedom for the scheduler to interleave the halves to hide latency better. Crucial gotcha: code that assumed lock-step (warp-shuffle without explicit mask) broke on Volta+.

The lockstep assumption is dead

Pre-Volta, programmers casually assumed that all 32 lanes execute the same instruction at the same cycle — so a __shfl(x, lane) “just worked” because every active lane was at the same PC. Volta+ may interleave divergent halves, so an unmasked __shfl can read garbage from a lane that hasn't reached the same instruction yet. The fix is the _sync family: __shfl_sync(0xffffffff, x, lane). The mask both selects participating lanes and forces them to converge before the operation.

07

Predication vs Branching

The compiler has two ways to implement an if in a warp:

Predication. Emit both sides of the branch unconditionally, but tag each instruction with a predicate register. Lanes whose predicate is false simply discard their result — the issue slot is still spent.
Branching. Emit a real BRA. If the lanes agree (uniform branch), only one side runs. If they disagree, the warp diverges and both sides run sequentially with masks — same cost as predication, plus branch overhead.

The ptxas heuristic: short branches (1–3 instructions per side) get predicated; long branches get a real branch because predication wastes too many issue slots and can hurt instruction-cache density. You can force one or the other with #pragma unroll, by using ?: for short branches, and by structuring loops so the compiler sees them as uniform.

Cost model	Predication	Branching (divergent)
Issue slots spent (uniform case)	both sides	only the taken side
Issue slots spent (divergent case)	both sides	both sides + branch overhead
Front-end I-cache pressure	full inline expansion	compact code
Best for	1–3 instruction bodies	large bodies, often-uniform conditions

Worked SASS — ptxas chose predication for a tiny if/else

CUDA source

__device__ unsigned choose(unsigned a, unsigned b)
{
    return (a > b) ? a + b : a + 0x10u;
}

Resulting SASS (sm_80)

// R0 = a, R1 = b
ISETP.GT.AND.U32 P0, PT, R0, R1, PT      // P0 = (a > b)
@P0  IADD3 R2, R0, R1, RZ                // if-true:  R2 = a + b
@!P0 IADD3 R2, R0, 0x10, RZ              // if-false: R2 = a + 0x10
// no BRA emitted — both IADD3s issue, predicates filter their commit

Both IADD3 instructions issue in every lane. Lanes where P0 is false on the first instruction discard the result; the same lanes where P0 is true on the second instruction discard that result. The cost is two issue slots but no divergence overhead. For three-instruction bodies it is exactly break-even with a real branch when the condition is fully divergent.

Loop-invariant predicates

If the predicate is loop-invariant and uniform across the warp, ptxas may hoist it: a BRA.UNI at the top of the loop selects the right body. Same code, no per-iteration predicate cost. Achieved by using __builtin_assume or by writing two specialised kernels and selecting at launch time.

08

Independent Thread Scheduling (Volta+)

Lockstep warp execution made nested critical sections impossible to write safely. A thread holding a lock could not release it while siblings span on a sibling lock; both lanes would be pinned to the same PC, both holding their respective locks forever. Several decades of multi-threaded textbook code simply did not port to pre-Volta GPUs.

Volta's promise: per-thread forward progress

Each thread owns its own PC and call stack.
The scheduler is allowed to interleave divergent halves of a warp at instruction granularity.
The hardware guarantees that every thread eventually makes forward progress (no starvation), so spin-wait + release patterns terminate.
This unlocks idiomatic concurrent code:

A spinlock that works on Volta+ but not on Pascal

__device__ int mutex = 0;

__device__ void acquire() {
    while (atomicCAS(&mutex, 0, 1) != 0) { /* spin */ }
}
__device__ void release() { atomicExch(&mutex, 0); }

__global__ void k(float* shared_acc, float v) {
    acquire();
    *shared_acc += v;       // critical section
    release();
    // On Pascal: lanes that lost the CAS spin forever because the winning lane
    // is at the same PC and cannot make progress to the release.
    // On Volta+ ITS: the winner can advance to release, the loser then makes
    // progress on its own retry, and the warp converges naturally.
}

The cost: explicit warp synchronisation now required

Code that relied on lockstep, especially warp-level intrinsics, broke. Volta+ introduced the _sync family with a 32-bit mask argument: the mask names participating lanes and forces them to converge before the op.

Old vs new warp shuffle

// Pascal era — relied on implicit lockstep
float v = __shfl(x, lane);                       // deprecated, removed in CUDA 9+

// Volta+ era — explicit mask, explicit convergence
float v = __shfl_sync(0xffffffff, x, lane);     // all 32 lanes participate
unsigned mask = __activemask();                   // query who is here right now
int b      = __ballot_sync(mask, predicate);
int sum    = __reduce_add_sync(mask, x);          // Volta+ HW reduction

Old (deprecated)	New _sync form	Why the change
`__shfl(x, lane)`	`__shfl_sync(mask, x, lane)`	Force convergence on the named lanes before reading.
`__ballot(p)`	`__ballot_sync(mask, p)`	Same: who-voted-yes only meaningful with a defined participant set.
`__any(p)` / `__all(p)`	`__any_sync` / `__all_sync`	Same.
(implicit)	`__syncwarp(mask)`	Explicit warp-level barrier; cheap, no impact on block barrier.

If your old kernel suddenly returns garbage on a Hopper

The single most common Volta-onwards porting bug is calling a non-_sync warp intrinsic. The compiler will warn; do not ignore the warning. Either pass 0xffffffff (all lanes) or, if you genuinely have a divergent context, use __activemask() to capture the real participating set.

09

Occupancy Maths — Worked Out

Occupancy = (resident warps per SM) / (max warps per SM). On Volta and later the cap is 64 warps/SM (= 2048 threads). The actual resident count is the minimum of three independent limits.

Resource	Per-SM budget (Volta+)	Per-block consumption
Registers	64 K × 32-bit = 256 KB	threads × regs/thread
Shared memory	100–228 KB (config split with L1)	SMEM declared by the block
Warps	64	ceil(threads/32)
Blocks	32 (Volta+, 16 on some)	1

Worked example A — register-bound

Block: 256 threads, 64 regs/thread.
Per-block regs: 256 × 64 = 16 384 regs = 16 K → 4 blocks/SM fit in 64 K registers.
4 blocks × 256 threads = 1024 threads = 32 warps/SM = 50% occupancy.
The warp limit (64) and block limit (32) are not binding here; registers are.

Worked example B — warp-cap-bound

Same block size, but reduce regs/thread to 32.
Per-block: 256 × 32 = 8 192 regs → up to 8 blocks/SM by registers alone.
But 8 × 256 = 2 048 threads = 64 warps — at the warp cap.
So 8 blocks × 256 = 64 warps = 100% occupancy.

Worked example C — SMEM-bound

Block: 128 threads, 48 regs/thread, 48 KB SMEM/block.
Registers: 128 × 48 = 6 144 regs → 64K/6144 ≈ 10 blocks; not binding.
Warps: 128/32 = 4 warps/block → warp cap allows 64/4 = 16 blocks; not binding.
SMEM: 100 KB / 48 KB = 2 blocks/SM; SMEM is binding.
2 blocks × 4 warps = 8 warps/SM = 12.5% occupancy. Probably bandwidth-bound and will need either smaller tiles or async-copy double-buffering to hit peak.

Asking the runtime instead of doing the maths by hand

int blocks = 0, threads_per_block = 256;
size_t shared_bytes = 48 * 1024;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &blocks,                       // out
    my_kernel,                     // kernel function pointer
    threads_per_block,             // per-block thread count
    shared_bytes);                 // dynamic SMEM

// Or pick a configuration that maximises occupancy:
int minGrid = 0, optBlock = 0;
cudaOccupancyMaxPotentialBlockSize(&minGrid, &optBlock,
    my_kernel, /*dyn smem fn*/0, /*max block*/1024);

// And to query the actual achieved occupancy at run time:
int active = 0;
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags(
    &active, my_kernel, threads_per_block, shared_bytes,
    cudaOccupancyDefault);

Higher occupancy is not always faster

Beyond the point where the schedulers are saturated, more warps don't help — and they cost you registers/thread, which forces wider tiles to spill to L1, which costs bandwidth. Many production kernels (CUTLASS GEMM tiles, FlashAttention) target 25–50% occupancy on purpose. Use Nsight Compute to read the “eligible warps per cycle” metric: if it is ≥ 4 sustained, occupancy is enough.

10

Common Stall Patterns and Cures

Nsight Compute names ten or so warp-stall reasons. Most kernels show one or two dominant ones; once you know which, the cure is usually mechanical.

Stall reason	Symptom	Cure
Long Scoreboard (memory dependency)	Waiting on outstanding LDG / LDG.E / LDS to retire.	Bigger tiles to reuse loaded data; `cp.async` + `cp.async.commit` double-buffer; raise occupancy if low; check coalescing.
Wait (execution dependency)	Fixed-latency op (FFMA, IMAD, HMMA) feeding the next instruction; not enough independent work.	Unroll loops for ILP, reduce regs/thread to fit more warps, interleave independent accumulators.
Barrier (`__syncthreads`)	One warp waiting for siblings to reach the barrier.	Use `cuda::barrier` + named barriers (Hopper); switch heavy load/compute to `cp.async`; reduce SMEM-traffic skew.
MIO Throttle	Memory I/O subsystem (SMEM bank conflicts, atomics, surface ops) saturated.	Pad shared arrays to break bank conflicts; coalesce; replace atomics with reductions.
IMC Throttle (Issue/Math throttle)	Warp scheduler can't issue more of the dominant pipe; back-to-back FFMAs from the same warp.	Mix instruction types (FP + INT + LDS), increase warp count, schedule independent accumulators.
Tex Throttle	Texture pipe saturated by `TEX`/`TLD` ops.	Use plain `LDG`/`LDS` when the texture filtering isn't actually needed.
Math Pipe Throttle (FP / DP / SFU)	The corresponding pipe is full every cycle.	Stagger instruction types; for SFU-heavy code, replace with polynomial approximations or reorder.
Drain	`EXIT` or branch-resolving flush at end of warp.	Usually benign; large blocks tail more. Consider persistent kernels for very small grids.
Branch Resolving	Indirect or divergent branch waiting on the resolution unit.	Make branches uniform; precompute targets; predicate when bodies are short.

Triage script

If Long Scoreboard dominates > 30% of cycles, you are memory-bound — tile bigger and async-copy. If Wait dominates, you are ILP-bound — unroll. If MIO Throttle dominates, your shared-memory access pattern has bank conflicts — pad arrays. If Math Pipe Throttle dominates, you are doing well — you are at the arch's compute peak. The other reasons rarely dominate in well-formed code.

11

Interactive: Issue-Rate Predictor

Pick a GPU class and your block size / register pressure / shared memory. The predictor computes the binding limit and estimates achievable IPC plus a likely dominant stall.

GPU class

Threads / block 256

Registers / thread 64

Shared memory / block (KB) 0 KB

Blocks/SM

—

Warps/SM

—

Occupancy

—

IPC (est.)

—

SM-wide IPC

—

Likely stall

—