Inside an SM, four warp schedulers issue one (or two) instructions per cycle to whichever warp is ready. Walk through warp issue rules, ILP, instruction latencies, divergence and convergence, predication, occupancy maths, and Volta's independent thread scheduling — the things that decide whether your kernel runs at peak.
The micro-architecture of warp issue: how the SM picks one warp out of dozens each cycle, why the picks decide your kernel's runtime, and the mental models that turn occupancy and divergence from mysteries into arithmetic.
A warp is the unit of scheduling on an NVIDIA SM: 32 threads grouped together, historically sharing a single program counter and executing the same instruction in lockstep. This is the “Single-Instruction, Multiple-Thread” (SIMT) model — the SM fetches one instruction and broadcasts it to all 32 lanes, each holding a different thread's register state.
Volta (2017) changed the deal: it gave each thread its own PC and call stack so that re-convergence after divergence works automatically and starvation-free. But warps still issue at warp granularity for performance; divergent warps are slower, not broken.
Thirty-two has been the warp size on every NVIDIA architecture from Tesla (2006) to Blackwell (2025). It is wide enough to amortise control overhead and hit one cache line per warp on a 128-byte coalesced load (4 bytes × 32 = 128 B), and narrow enough to keep the register file and operand network manageable. Changing it would break every CUDA binary ever shipped, so it is effectively a constant of the universe.
From Volta onwards an SM is split into 4 partitions (also called “sub-cores” or “processing blocks”). Each partition is essentially its own little SIMT machine, with one warp scheduler that handles a private pool of warps and feeds private functional units.
Volta's V100 introduced the 4-partition design so each scheduler could feed a tensor core without blocking the other three. It also halved the operand-collector network width compared with Pascal's two big schedulers, which made the register file faster to read and easier to lay out at high frequency. Ampere, Ada, Hopper and Blackwell all kept the four-partition layout; only the unit counts inside change.
Each cycle every scheduler looks at its pool of resident warps (up to 16 per partition, so up to 64 per SM at full Volta+ occupancy) and picks one that is ready. “Ready” is a conjunction of conditions; every condition is a hardware scoreboard or pipeline status bit.
| Readiness check | What it means |
|---|---|
| Instruction in L0-I | The next instruction for this warp's PC is fetched and decoded; if not, an i-cache miss stall. |
| Source registers ready | No outstanding writes to any source operand (RAW hazard tracked by per-warp scoreboard bits). |
| Functional unit free | The dispatch port for the required pipe (FP32, INT, FMA, MUFU, MIO, tensor) is not busy this cycle. |
| Predicate decoded | If predicated (@P0), the predicate is read; even all-false predicated instructions consume an issue slot. |
| No barrier wait | BAR.SYNC, BAR.ARV, __syncthreads(), named barriers, async-copy barriers all parked the warp. |
| No memory dependency | If the next instruction reads a register fed by an outstanding LDG/LDS, wait for the load's scoreboard release. |
If more than one warp is ready, the scheduler picks one. The exact policy is undocumented and varies by architecture; it has been described as a loose round-robin biased by age and by recently-issued warps. The key invariant is “greedy then fair”: a warp that just got an instruction issued is not preferred next cycle.
# Captured on a memory-bound GEMM kernel, A100, fp16
ncu --set full --section WarpStateStats ./gemm
Section: Warp State Statistics
Stall — Long Scoreboard 27.4 cycles/instr # LDG miss waiting on L2/HBM
Stall — Wait 14.2 cycles/instr # dep on a fixed-latency op (FFMA)
Stall — Short Scoreboard 11.8 cycles/instr # LDS waiting on shared mem
Stall — Math Pipe Throttle 4.1 cycles/instr # FFMA pipe saturated
Stall — Drain 3.6 cycles/instr # EXIT or BRA flushing
Stall — MIO Throttle 2.9 cycles/instr # shared-mem bank conflicts
Stall — IMC Miss 1.3 cycles/instr # constant-cache miss
Stall — Branch Resolving 0.7 cycles/instr # indirect / divergent branch
Stall — Tex Throttle 0.2 cycles/instr # tex pipe full
Stall — Membar 0.0 cycles/instr # fence in flight
# Selected (eligible) warp count: 1.6 / cycle — only 1.6 warps were ready,
# so “Issued” instructions / cycle were limited even though 4 schedulers exist.
Nsight Compute reports both active warps (resident) and eligible warps (ready to issue this cycle). Active >> eligible means there are warps but they're all stalled — classic memory-bound kernel. Active ≈ eligible at < 4 means low occupancy — classic register-pressure or small-grid kernel. You want eligible ≥ 4 sustained: at that point the schedulers are saturated and arch peaks become reachable.
Latency in this context means “how many cycles after issue can a dependent instruction issue?”. Throughput is “how many of the same instruction can issue per cycle?”. The SM hides latency by issuing other warps' instructions in the gap.
| Op | Latency | Notes |
|---|---|---|
FFMA / IMAD / IADD3 | ~4 cycles | Back-to-back FFMAs from independent warps issue every cycle; from one warp need ILP ≥ 4. |
LDS (shared load) | ~30 cycles | Bank-conflict-free; with conflicts, multiply by max bank serialisation factor. |
LDG (global, L1 hit) | ~30–90 cycles | Hits in L1 SMEM/L1 cache. |
LDG (global, L2 hit) | ~200–300 cycles | Misses L1 but hits the SM-shared L2 (40–50 MB on Hopper, 60+ on Blackwell). |
LDG (global, HBM) | ~400–700 cycles | Full miss; all the way to DRAM. The number that drives every occupancy decision. |
HMMA / WGMMA | 4–8 cycles to issue | The instruction issues quickly but the tensor-core matmul streams over many cycles; back-to-back HMMA from one warp keeps the unit fed. |
ATOM.E (global atomic) | ~600 cycles + contention | Conflicts serialise; uncontended atomics on L2 atomics units run faster. |
MUFU.SIN, MUFU.EXP, MUFU.RCP | ~12–30 cycles | SFU; 4× lower throughput than FFMA. |
BAR.SYNC (__syncthreads) | variable | All warps in the block must reach the barrier; cost = arrival skew. |
To hide an op of latency L cycles when its throughput is 1/cycle, you need L independent in-flight instances. With one warp giving one instance per cycle (back-to-back), that is L resident warps with at least one ready instruction each, or fewer warps with ILP ≥ L.
cp.async), wider tiles, and software pipelining of compute over loads.HBM-load latency × required bandwidth = how many warps you must keep busy. On H100 with 3.35 TB/s HBM and ~600-cycle load latency at 1.83 GHz, the minimum in-flight bytes to saturate bandwidth is roughly bandwidth × latency ≈ 1.1 MB. Divide by your access size (often 16 B/thread × 32 = 512 B/warp) and you get the warp-load count target. Below it you are leaving HBM idle.
The scheduler keeps the SM busy by always finding something ready to issue. There are two complementary ways to give it work:
Many resident warps (high thread-level parallelism). Each warp may have only one in-flight instruction at a time, but with 16 warps per partition the scheduler always has someone ready. Classic CUDA wisdom from Pascal/Volta.
Few warps, but each warp has multiple independent instructions in flight. The scheduler can issue from the same warp on consecutive cycles. Volta+ encourages this: a wider register file and LDG.E.128 (16 B/thread) lets one warp keep a partition busy.
Both tactics are valid and modern CUTLASS / cuBLASLt kernels lean heavily on ILP rather than chasing 100% occupancy. Volta-and-later kernels often deliver near-peak FLOPS at 25–50% occupancy because they pack four independent FFMAs and two prefetches into each warp's loop body.
__global__ void saxpy_high_occ(float a, const float* x, float* y, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = a * x[i] + y[i];
}
// Launch: 256 threads/block, 32 regs/thread → 8 blocks/SM = 64 warps/SM = 100% occ.
// One LDG, one FFMA, one STG per thread. Per-warp ILP = 1.
__global__ void saxpy_high_ilp(float a, const float4* x4, float4* y4, int n4)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n4) return;
// 128-bit load: one LDG.E.128 brings 16 bytes/thread → 512 B/warp aligned
float4 xv = x4[i];
float4 yv = y4[i];
yv.x = a * xv.x + yv.x;
yv.y = a * xv.y + yv.y;
yv.z = a * xv.z + yv.z;
yv.w = a * xv.w + yv.w; // 4 independent FFMAs per warp per loop iter
y4[i] = yv;
}
// Launch: 128 threads/block, 64 regs/thread → 4 blocks/SM = 16 warps/SM = 25% occ.
// Per-warp ILP = 4 FFMAs; one wide LDG hides — achieves 95+% of memory peak.
Pre-Volta the dominant guidance was “chase occupancy.” Volta widened the per-warp register footprint to 255 regs/thread, doubled the partition register file, and added wide loads. From Volta onwards, the highest-throughput kernels are typically 25–50% occupancy with deep ILP, not 100% with one elt/thread. CUTLASS, cuBLAS, FlashAttention all run that way.
When threads in a warp take different paths through an if/else, the warp diverges. The hardware tracks per-lane activity with an active mask; the warp issues the if-branch with one mask, then the else-branch with the inverse, both serialised. After the branches re-meet at the immediate post-dominator, they reconverge and the active mask returns to all-ones.
One PC per warp. The compiler emits SSY (set sync) to mark the IPDOM and .S-suffixed branches that pop the convergence stack. Lanes in the not-taken branch are idle but cannot run anything else — they wait. Reconvergence is guaranteed at the IPDOM but starvation is possible if a lane never gets to make progress.
Consequence: a lane holding a lock cannot release it while sibling lanes spin waiting. Inter-thread mutual exclusion within a warp is fundamentally broken.
Each thread has its own PC and its own call stack. The compiler emits BSSY/BSYNC SASS pairs which set up a re-convergence point but do not force lockstep. The scheduler may interleave the two divergent halves of the warp at instruction granularity. Lanes can make independent forward progress; mutual exclusion within a warp is now possible.
Pre-Volta, programmers casually assumed that all 32 lanes execute the same instruction at the same cycle — so a __shfl(x, lane) “just worked” because every active lane was at the same PC. Volta+ may interleave divergent halves, so an unmasked __shfl can read garbage from a lane that hasn't reached the same instruction yet. The fix is the _sync family: __shfl_sync(0xffffffff, x, lane). The mask both selects participating lanes and forces them to converge before the operation.
The compiler has two ways to implement an if in a warp:
BRA. If the lanes agree (uniform branch), only one side runs. If they disagree, the warp diverges and both sides run sequentially with masks — same cost as predication, plus branch overhead.The ptxas heuristic: short branches (1–3 instructions per side) get predicated; long branches get a real branch because predication wastes too many issue slots and can hurt instruction-cache density. You can force one or the other with #pragma unroll, by using ?: for short branches, and by structuring loops so the compiler sees them as uniform.
| Cost model | Predication | Branching (divergent) |
|---|---|---|
| Issue slots spent (uniform case) | both sides | only the taken side |
| Issue slots spent (divergent case) | both sides | both sides + branch overhead |
| Front-end I-cache pressure | full inline expansion | compact code |
| Best for | 1–3 instruction bodies | large bodies, often-uniform conditions |
__device__ unsigned choose(unsigned a, unsigned b)
{
return (a > b) ? a + b : a + 0x10u;
}
// R0 = a, R1 = b
ISETP.GT.AND.U32 P0, PT, R0, R1, PT // P0 = (a > b)
@P0 IADD3 R2, R0, R1, RZ // if-true: R2 = a + b
@!P0 IADD3 R2, R0, 0x10, RZ // if-false: R2 = a + 0x10
// no BRA emitted — both IADD3s issue, predicates filter their commit
Both IADD3 instructions issue in every lane. Lanes where P0 is false on the first instruction discard the result; the same lanes where P0 is true on the second instruction discard that result. The cost is two issue slots but no divergence overhead. For three-instruction bodies it is exactly break-even with a real branch when the condition is fully divergent.
If the predicate is loop-invariant and uniform across the warp, ptxas may hoist it: a BRA.UNI at the top of the loop selects the right body. Same code, no per-iteration predicate cost. Achieved by using __builtin_assume or by writing two specialised kernels and selecting at launch time.
Lockstep warp execution made nested critical sections impossible to write safely. A thread holding a lock could not release it while siblings span on a sibling lock; both lanes would be pinned to the same PC, both holding their respective locks forever. Several decades of multi-threaded textbook code simply did not port to pre-Volta GPUs.
__device__ int mutex = 0;
__device__ void acquire() {
while (atomicCAS(&mutex, 0, 1) != 0) { /* spin */ }
}
__device__ void release() { atomicExch(&mutex, 0); }
__global__ void k(float* shared_acc, float v) {
acquire();
*shared_acc += v; // critical section
release();
// On Pascal: lanes that lost the CAS spin forever because the winning lane
// is at the same PC and cannot make progress to the release.
// On Volta+ ITS: the winner can advance to release, the loser then makes
// progress on its own retry, and the warp converges naturally.
}
Code that relied on lockstep, especially warp-level intrinsics, broke. Volta+ introduced the _sync family with a 32-bit mask argument: the mask names participating lanes and forces them to converge before the op.
// Pascal era — relied on implicit lockstep
float v = __shfl(x, lane); // deprecated, removed in CUDA 9+
// Volta+ era — explicit mask, explicit convergence
float v = __shfl_sync(0xffffffff, x, lane); // all 32 lanes participate
unsigned mask = __activemask(); // query who is here right now
int b = __ballot_sync(mask, predicate);
int sum = __reduce_add_sync(mask, x); // Volta+ HW reduction
| Old (deprecated) | New _sync form | Why the change |
|---|---|---|
__shfl(x, lane) | __shfl_sync(mask, x, lane) | Force convergence on the named lanes before reading. |
__ballot(p) | __ballot_sync(mask, p) | Same: who-voted-yes only meaningful with a defined participant set. |
__any(p) / __all(p) | __any_sync / __all_sync | Same. |
| (implicit) | __syncwarp(mask) | Explicit warp-level barrier; cheap, no impact on block barrier. |
The single most common Volta-onwards porting bug is calling a non-_sync warp intrinsic. The compiler will warn; do not ignore the warning. Either pass 0xffffffff (all lanes) or, if you genuinely have a divergent context, use __activemask() to capture the real participating set.
Occupancy = (resident warps per SM) / (max warps per SM). On Volta and later the cap is 64 warps/SM (= 2048 threads). The actual resident count is the minimum of three independent limits.
| Resource | Per-SM budget (Volta+) | Per-block consumption |
|---|---|---|
| Registers | 64 K × 32-bit = 256 KB | threads × regs/thread |
| Shared memory | 100–228 KB (config split with L1) | SMEM declared by the block |
| Warps | 64 | ceil(threads/32) |
| Blocks | 32 (Volta+, 16 on some) | 1 |
int blocks = 0, threads_per_block = 256;
size_t shared_bytes = 48 * 1024;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&blocks, // out
my_kernel, // kernel function pointer
threads_per_block, // per-block thread count
shared_bytes); // dynamic SMEM
// Or pick a configuration that maximises occupancy:
int minGrid = 0, optBlock = 0;
cudaOccupancyMaxPotentialBlockSize(&minGrid, &optBlock,
my_kernel, /*dyn smem fn*/0, /*max block*/1024);
// And to query the actual achieved occupancy at run time:
int active = 0;
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags(
&active, my_kernel, threads_per_block, shared_bytes,
cudaOccupancyDefault);
Beyond the point where the schedulers are saturated, more warps don't help — and they cost you registers/thread, which forces wider tiles to spill to L1, which costs bandwidth. Many production kernels (CUTLASS GEMM tiles, FlashAttention) target 25–50% occupancy on purpose. Use Nsight Compute to read the “eligible warps per cycle” metric: if it is ≥ 4 sustained, occupancy is enough.
Nsight Compute names ten or so warp-stall reasons. Most kernels show one or two dominant ones; once you know which, the cure is usually mechanical.
| Stall reason | Symptom | Cure |
|---|---|---|
| Long Scoreboard (memory dependency) | Waiting on outstanding LDG / LDG.E / LDS to retire. | Bigger tiles to reuse loaded data; cp.async + cp.async.commit double-buffer; raise occupancy if low; check coalescing. |
| Wait (execution dependency) | Fixed-latency op (FFMA, IMAD, HMMA) feeding the next instruction; not enough independent work. | Unroll loops for ILP, reduce regs/thread to fit more warps, interleave independent accumulators. |
Barrier (__syncthreads) |
One warp waiting for siblings to reach the barrier. | Use cuda::barrier + named barriers (Hopper); switch heavy load/compute to cp.async; reduce SMEM-traffic skew. |
| MIO Throttle | Memory I/O subsystem (SMEM bank conflicts, atomics, surface ops) saturated. | Pad shared arrays to break bank conflicts; coalesce; replace atomics with reductions. |
| IMC Throttle (Issue/Math throttle) | Warp scheduler can't issue more of the dominant pipe; back-to-back FFMAs from the same warp. | Mix instruction types (FP + INT + LDS), increase warp count, schedule independent accumulators. |
| Tex Throttle | Texture pipe saturated by TEX/TLD ops. |
Use plain LDG/LDS when the texture filtering isn't actually needed. |
| Math Pipe Throttle (FP / DP / SFU) | The corresponding pipe is full every cycle. | Stagger instruction types; for SFU-heavy code, replace with polynomial approximations or reorder. |
| Drain | EXIT or branch-resolving flush at end of warp. |
Usually benign; large blocks tail more. Consider persistent kernels for very small grids. |
| Branch Resolving | Indirect or divergent branch waiting on the resolution unit. | Make branches uniform; precompute targets; predicate when bodies are short. |
If Long Scoreboard dominates > 30% of cycles, you are memory-bound — tile bigger and async-copy. If Wait dominates, you are ILP-bound — unroll. If MIO Throttle dominates, your shared-memory access pattern has bank conflicts — pad arrays. If Math Pipe Throttle dominates, you are doing well — you are at the arch's compute peak. The other reasons rarely dominate in well-formed code.
Pick a GPU class and your block size / register pressure / shared memory. The predictor computes the binding limit and estimates achievable IPC plus a likely dominant stall.