NVIDIA GPU Architectures Series — Presentation 22

Warp Scheduling & SIMT — How the SM Actually Issues Instructions

Inside an SM, four warp schedulers issue one (or two) instructions per cycle to whichever warp is ready. Walk through warp issue rules, ILP, instruction latencies, divergence and convergence, predication, occupancy maths, and Volta's independent thread scheduling — the things that decide whether your kernel runs at peak.

WarpSchedulerIssue ILPDivergencePredication ITSOccupancy LatencyStall
Warp Scheduler Issue Execute Writeback Stall ITS
00

Topics We'll Cover

The micro-architecture of warp issue: how the SM picks one warp out of dozens each cycle, why the picks decide your kernel's runtime, and the mental models that turn occupancy and divergence from mysteries into arithmetic.

01

The Warp — 32 Threads, One Program Counter (Mostly)

A warp is the unit of scheduling on an NVIDIA SM: 32 threads grouped together, historically sharing a single program counter and executing the same instruction in lockstep. This is the “Single-Instruction, Multiple-Thread” (SIMT) model — the SM fetches one instruction and broadcasts it to all 32 lanes, each holding a different thread's register state.

Volta (2017) changed the deal: it gave each thread its own PC and call stack so that re-convergence after divergence works automatically and starvation-free. But warps still issue at warp granularity for performance; divergent warps are slower, not broken.

One warp instruction broadcast to 32 lanes FFMA R4, R0, R1, R2 (PC = 0x1820) lane 0..31 → all active: predicated (if branch took 18 lanes): Solid green: lane executes & writes back. Hatched grey: lane idles for this issue (slot still spent).

SIMT vs SIMD — the precise difference

Why 32?

Thirty-two has been the warp size on every NVIDIA architecture from Tesla (2006) to Blackwell (2025). It is wide enough to amortise control overhead and hit one cache line per warp on a 128-byte coalesced load (4 bytes × 32 = 128 B), and narrow enough to keep the register file and operand network manageable. Changing it would break every CUDA binary ever shipped, so it is effectively a constant of the universe.

02

SM Partitions and Schedulers

From Volta onwards an SM is split into 4 partitions (also called “sub-cores” or “processing blocks”). Each partition is essentially its own little SIMT machine, with one warp scheduler that handles a private pool of warps and feeds private functional units.

Per partition (Ampere SM)

  • 1 warp scheduler & dispatcher
  • 16 FP32 + 16 INT32 cores
  • 1 tensor core (3rd / 4th gen on Ampere/Hopper)
  • 1 SFU (sin, exp, rcp, rsqrt)
  • 1 LD/ST unit set
  • 16 K × 32-bit register file slice (= 64 KB)
  • L0 instruction cache

Per SM totals

  • 4 partitions → 4 schedulers → 4 instructions/SM/cycle
  • 64 FP32 + 64 INT32 cores (Ampere; Ada/Hopper similar; GA10x/AD102 dual-issue FP32 doubles FP32 to 128/SM)
  • 4 tensor cores
  • 4 SFUs, 4 LD/ST groups
  • 64 K × 32-bit (= 256 KB) register file
  • Shared L1 / SMEM (128–228 KB) and L1-I cache

One Ampere SM partition, drawn out

SM partition (1 of 4) L0 instruction cache fetched lines for this partition's warps Warp scheduler & dispatcher (16 resident warps max) picks 1 ready warp / cycle, dispatches up to 1 instruction (some arch dual-issue FP32) Register file slice — 16K × 32-bit = 64 KB read 3 source operands / cycle through the operand collector network 16 × FP32 cores FFMA, FADD, FMUL 16 × INT32 cores IADD3, IMAD, ISETP 1 × tensor core HMMA / WGMMA 1 × SFU sin/exp/rcp LD/ST unit SMEM / L1 / global Shared with rest of SM → L1/SMEM (128–228 KB), L1-I, MIO crossbar

Issue rate, drawn at SM and GPU scope

Why four partitions?

Volta's V100 introduced the 4-partition design so each scheduler could feed a tensor core without blocking the other three. It also halved the operand-collector network width compared with Pascal's two big schedulers, which made the register file faster to read and easier to lay out at high frequency. Ampere, Ada, Hopper and Blackwell all kept the four-partition layout; only the unit counts inside change.

03

The Issue Rule

Each cycle every scheduler looks at its pool of resident warps (up to 16 per partition, so up to 64 per SM at full Volta+ occupancy) and picks one that is ready. “Ready” is a conjunction of conditions; every condition is a hardware scoreboard or pipeline status bit.

Readiness checkWhat it means
Instruction in L0-IThe next instruction for this warp's PC is fetched and decoded; if not, an i-cache miss stall.
Source registers readyNo outstanding writes to any source operand (RAW hazard tracked by per-warp scoreboard bits).
Functional unit freeThe dispatch port for the required pipe (FP32, INT, FMA, MUFU, MIO, tensor) is not busy this cycle.
Predicate decodedIf predicated (@P0), the predicate is read; even all-false predicated instructions consume an issue slot.
No barrier waitBAR.SYNC, BAR.ARV, __syncthreads(), named barriers, async-copy barriers all parked the warp.
No memory dependencyIf the next instruction reads a register fed by an outstanding LDG/LDS, wait for the load's scoreboard release.

If more than one warp is ready, the scheduler picks one. The exact policy is undocumented and varies by architecture; it has been described as a loose round-robin biased by age and by recently-issued warps. The key invariant is “greedy then fair”: a warp that just got an instruction issued is not preferred next cycle.

Nsight Compute — warp-stall reasons sorted by frequency
# Captured on a memory-bound GEMM kernel, A100, fp16
ncu --set full --section WarpStateStats ./gemm

Section: Warp State Statistics
 
Stall — Long Scoreboard          27.4 cycles/instr   # LDG miss waiting on L2/HBM
Stall — Wait                     14.2 cycles/instr   # dep on a fixed-latency op (FFMA)
Stall — Short Scoreboard         11.8 cycles/instr   # LDS waiting on shared mem
Stall — Math Pipe Throttle        4.1 cycles/instr   # FFMA pipe saturated
Stall — Drain                     3.6 cycles/instr   # EXIT or BRA flushing
Stall — MIO Throttle              2.9 cycles/instr   # shared-mem bank conflicts
Stall — IMC Miss                  1.3 cycles/instr   # constant-cache miss
Stall — Branch Resolving          0.7 cycles/instr   # indirect / divergent branch
Stall — Tex Throttle              0.2 cycles/instr   # tex pipe full
Stall — Membar                    0.0 cycles/instr   # fence in flight
 
# Selected (eligible) warp count: 1.6 / cycle  — only 1.6 warps were ready,
# so “Issued” instructions / cycle were limited even though 4 schedulers exist.
The eligible-warp number is the one to read

Nsight Compute reports both active warps (resident) and eligible warps (ready to issue this cycle). Active >> eligible means there are warps but they're all stalled — classic memory-bound kernel. Active ≈ eligible at < 4 means low occupancy — classic register-pressure or small-grid kernel. You want eligible ≥ 4 sustained: at that point the schedulers are saturated and arch peaks become reachable.

04

Instruction Latencies (Approximate)

Latency in this context means “how many cycles after issue can a dependent instruction issue?”. Throughput is “how many of the same instruction can issue per cycle?”. The SM hides latency by issuing other warps' instructions in the gap.

OpLatencyNotes
FFMA / IMAD / IADD3~4 cyclesBack-to-back FFMAs from independent warps issue every cycle; from one warp need ILP ≥ 4.
LDS (shared load)~30 cyclesBank-conflict-free; with conflicts, multiply by max bank serialisation factor.
LDG (global, L1 hit)~30–90 cyclesHits in L1 SMEM/L1 cache.
LDG (global, L2 hit)~200–300 cyclesMisses L1 but hits the SM-shared L2 (40–50 MB on Hopper, 60+ on Blackwell).
LDG (global, HBM)~400–700 cyclesFull miss; all the way to DRAM. The number that drives every occupancy decision.
HMMA / WGMMA4–8 cycles to issueThe instruction issues quickly but the tensor-core matmul streams over many cycles; back-to-back HMMA from one warp keeps the unit fed.
ATOM.E (global atomic)~600 cycles + contentionConflicts serialise; uncontended atomics on L2 atomics units run faster.
MUFU.SIN, MUFU.EXP, MUFU.RCP~12–30 cyclesSFU; 4× lower throughput than FFMA.
BAR.SYNC (__syncthreads)variableAll warps in the block must reach the barrier; cost = arrival skew.

The latency-hiding rule of thumb

To hide an op of latency L cycles when its throughput is 1/cycle, you need L independent in-flight instances. With one warp giving one instance per cycle (back-to-back), that is L resident warps with at least one ready instruction each, or fewer warps with ILP ≥ L.

Bandwidth, not latency, sets the ceiling

HBM-load latency × required bandwidth = how many warps you must keep busy. On H100 with 3.35 TB/s HBM and ~600-cycle load latency at 1.83 GHz, the minimum in-flight bytes to saturate bandwidth is roughly bandwidth × latency ≈ 1.1 MB. Divide by your access size (often 16 B/thread × 32 = 512 B/warp) and you get the warp-load count target. Below it you are leaving HBM idle.

05

Occupancy & ILP — Two Ways to Hide Latency

The scheduler keeps the SM busy by always finding something ready to issue. There are two complementary ways to give it work:

High occupancy (TLP)

Many resident warps (high thread-level parallelism). Each warp may have only one in-flight instruction at a time, but with 16 warps per partition the scheduler always has someone ready. Classic CUDA wisdom from Pascal/Volta.

  • Pro: simple, robust.
  • Con: pressures the register file (more warps → fewer regs/thread); pressures shared memory (more blocks → less SMEM/block).

High ILP

Few warps, but each warp has multiple independent instructions in flight. The scheduler can issue from the same warp on consecutive cycles. Volta+ encourages this: a wider register file and LDG.E.128 (16 B/thread) lets one warp keep a partition busy.

  • Pro: more registers per thread → bigger tiles → better arithmetic intensity.
  • Con: requires unrolling and explicit prefetch in source.

Both tactics are valid and modern CUTLASS / cuBLASLt kernels lean heavily on ILP rather than chasing 100% occupancy. Volta-and-later kernels often deliver near-peak FLOPS at 25–50% occupancy because they pack four independent FFMAs and two prefetches into each warp's loop body.

Two implementations of the same SAXPY-like kernel

a) High occupancy, low ILP — older “128 threads, 1 elt/thread” style
__global__ void saxpy_high_occ(float a, const float* x, float* y, int n)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) y[i] = a * x[i] + y[i];
}
// Launch: 256 threads/block, 32 regs/thread → 8 blocks/SM = 64 warps/SM = 100% occ.
// One LDG, one FFMA, one STG per thread. Per-warp ILP = 1.
b) Low occupancy, high ILP — Volta-tuned “1 warp does 4 elements per iter”
__global__ void saxpy_high_ilp(float a, const float4* x4, float4* y4, int n4)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n4) return;
    // 128-bit load: one LDG.E.128 brings 16 bytes/thread → 512 B/warp aligned
    float4 xv = x4[i];
    float4 yv = y4[i];
    yv.x = a * xv.x + yv.x;
    yv.y = a * xv.y + yv.y;
    yv.z = a * xv.z + yv.z;
    yv.w = a * xv.w + yv.w;   // 4 independent FFMAs per warp per loop iter
    y4[i] = yv;
}
// Launch: 128 threads/block, 64 regs/thread → 4 blocks/SM = 16 warps/SM = 25% occ.
// Per-warp ILP = 4 FFMAs; one wide LDG hides — achieves 95+% of memory peak.
The Volta inflection

Pre-Volta the dominant guidance was “chase occupancy.” Volta widened the per-warp register footprint to 255 regs/thread, doubled the partition register file, and added wide loads. From Volta onwards, the highest-throughput kernels are typically 25–50% occupancy with deep ILP, not 100% with one elt/thread. CUTLASS, cuBLAS, FlashAttention all run that way.

06

Divergence & Re-convergence

When threads in a warp take different paths through an if/else, the warp diverges. The hardware tracks per-lane activity with an active mask; the warp issues the if-branch with one mask, then the else-branch with the inverse, both serialised. After the branches re-meet at the immediate post-dominator, they reconverge and the active mask returns to all-ones.

Pre-Volta (Pascal and earlier)

One PC per warp. The compiler emits SSY (set sync) to mark the IPDOM and .S-suffixed branches that pop the convergence stack. Lanes in the not-taken branch are idle but cannot run anything else — they wait. Reconvergence is guaranteed at the IPDOM but starvation is possible if a lane never gets to make progress.

Consequence: a lane holding a lock cannot release it while sibling lanes spin waiting. Inter-thread mutual exclusion within a warp is fundamentally broken.

Volta and later (ITS)

Each thread has its own PC and its own call stack. The compiler emits BSSY/BSYNC SASS pairs which set up a re-convergence point but do not force lockstep. The scheduler may interleave the two divergent halves of the warp at instruction granularity. Lanes can make independent forward progress; mutual exclusion within a warp is now possible.

Time on the x-axis — pre-Volta vs Volta+ schedule of a divergent warp Pre-Volta: if-path (mask 0xffff0000) else lanes idle if lanes idle else-path (mask 0x0000ffff) SSY pop — reconverge Volta+ ITS: if[0] else[0] if[1] else[1] if[2] else[2] if[3] else[3] BSYNC reconv. Divergent warps still pay the lane-count cost — the aggregate work isn't reduced by ITS. ITS only buys: forward progress for individual threads (locks!), and freedom for the scheduler to interleave the halves to hide latency better. Crucial gotcha: code that assumed lock-step (warp-shuffle without explicit mask) broke on Volta+.
The lockstep assumption is dead

Pre-Volta, programmers casually assumed that all 32 lanes execute the same instruction at the same cycle — so a __shfl(x, lane) “just worked” because every active lane was at the same PC. Volta+ may interleave divergent halves, so an unmasked __shfl can read garbage from a lane that hasn't reached the same instruction yet. The fix is the _sync family: __shfl_sync(0xffffffff, x, lane). The mask both selects participating lanes and forces them to converge before the operation.

07

Predication vs Branching

The compiler has two ways to implement an if in a warp:

The ptxas heuristic: short branches (1–3 instructions per side) get predicated; long branches get a real branch because predication wastes too many issue slots and can hurt instruction-cache density. You can force one or the other with #pragma unroll, by using ?: for short branches, and by structuring loops so the compiler sees them as uniform.

Cost modelPredicationBranching (divergent)
Issue slots spent (uniform case)both sidesonly the taken side
Issue slots spent (divergent case)both sidesboth sides + branch overhead
Front-end I-cache pressurefull inline expansioncompact code
Best for1–3 instruction bodieslarge bodies, often-uniform conditions

Worked SASS — ptxas chose predication for a tiny if/else

CUDA source
__device__ unsigned choose(unsigned a, unsigned b)
{
    return (a > b) ? a + b : a + 0x10u;
}
Resulting SASS (sm_80)
// R0 = a, R1 = b
ISETP.GT.AND.U32 P0, PT, R0, R1, PT      // P0 = (a > b)
@P0  IADD3 R2, R0, R1, RZ                // if-true:  R2 = a + b
@!P0 IADD3 R2, R0, 0x10, RZ              // if-false: R2 = a + 0x10
// no BRA emitted — both IADD3s issue, predicates filter their commit

Both IADD3 instructions issue in every lane. Lanes where P0 is false on the first instruction discard the result; the same lanes where P0 is true on the second instruction discard that result. The cost is two issue slots but no divergence overhead. For three-instruction bodies it is exactly break-even with a real branch when the condition is fully divergent.

Loop-invariant predicates

If the predicate is loop-invariant and uniform across the warp, ptxas may hoist it: a BRA.UNI at the top of the loop selects the right body. Same code, no per-iteration predicate cost. Achieved by using __builtin_assume or by writing two specialised kernels and selecting at launch time.

08

Independent Thread Scheduling (Volta+)

Lockstep warp execution made nested critical sections impossible to write safely. A thread holding a lock could not release it while siblings span on a sibling lock; both lanes would be pinned to the same PC, both holding their respective locks forever. Several decades of multi-threaded textbook code simply did not port to pre-Volta GPUs.

Volta's promise: per-thread forward progress

A spinlock that works on Volta+ but not on Pascal
__device__ int mutex = 0;

__device__ void acquire() {
    while (atomicCAS(&mutex, 0, 1) != 0) { /* spin */ }
}
__device__ void release() { atomicExch(&mutex, 0); }

__global__ void k(float* shared_acc, float v) {
    acquire();
    *shared_acc += v;       // critical section
    release();
    // On Pascal: lanes that lost the CAS spin forever because the winning lane
    // is at the same PC and cannot make progress to the release.
    // On Volta+ ITS: the winner can advance to release, the loser then makes
    // progress on its own retry, and the warp converges naturally.
}

The cost: explicit warp synchronisation now required

Code that relied on lockstep, especially warp-level intrinsics, broke. Volta+ introduced the _sync family with a 32-bit mask argument: the mask names participating lanes and forces them to converge before the op.

Old vs new warp shuffle
// Pascal era — relied on implicit lockstep
float v = __shfl(x, lane);                       // deprecated, removed in CUDA 9+

// Volta+ era — explicit mask, explicit convergence
float v = __shfl_sync(0xffffffff, x, lane);     // all 32 lanes participate
unsigned mask = __activemask();                   // query who is here right now
int b      = __ballot_sync(mask, predicate);
int sum    = __reduce_add_sync(mask, x);          // Volta+ HW reduction
Old (deprecated)New _sync formWhy the change
__shfl(x, lane)__shfl_sync(mask, x, lane)Force convergence on the named lanes before reading.
__ballot(p)__ballot_sync(mask, p)Same: who-voted-yes only meaningful with a defined participant set.
__any(p) / __all(p)__any_sync / __all_syncSame.
(implicit)__syncwarp(mask)Explicit warp-level barrier; cheap, no impact on block barrier.
If your old kernel suddenly returns garbage on a Hopper

The single most common Volta-onwards porting bug is calling a non-_sync warp intrinsic. The compiler will warn; do not ignore the warning. Either pass 0xffffffff (all lanes) or, if you genuinely have a divergent context, use __activemask() to capture the real participating set.

09

Occupancy Maths — Worked Out

Occupancy = (resident warps per SM) / (max warps per SM). On Volta and later the cap is 64 warps/SM (= 2048 threads). The actual resident count is the minimum of three independent limits.

ResourcePer-SM budget (Volta+)Per-block consumption
Registers64 K × 32-bit = 256 KBthreads × regs/thread
Shared memory100–228 KB (config split with L1)SMEM declared by the block
Warps64ceil(threads/32)
Blocks32 (Volta+, 16 on some)1

Worked example A — register-bound

Worked example B — warp-cap-bound

Worked example C — SMEM-bound

Asking the runtime instead of doing the maths by hand
int blocks = 0, threads_per_block = 256;
size_t shared_bytes = 48 * 1024;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &blocks,                       // out
    my_kernel,                     // kernel function pointer
    threads_per_block,             // per-block thread count
    shared_bytes);                 // dynamic SMEM

// Or pick a configuration that maximises occupancy:
int minGrid = 0, optBlock = 0;
cudaOccupancyMaxPotentialBlockSize(&minGrid, &optBlock,
    my_kernel, /*dyn smem fn*/0, /*max block*/1024);

// And to query the actual achieved occupancy at run time:
int active = 0;
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags(
    &active, my_kernel, threads_per_block, shared_bytes,
    cudaOccupancyDefault);
Higher occupancy is not always faster

Beyond the point where the schedulers are saturated, more warps don't help — and they cost you registers/thread, which forces wider tiles to spill to L1, which costs bandwidth. Many production kernels (CUTLASS GEMM tiles, FlashAttention) target 25–50% occupancy on purpose. Use Nsight Compute to read the “eligible warps per cycle” metric: if it is ≥ 4 sustained, occupancy is enough.

10

Common Stall Patterns and Cures

Nsight Compute names ten or so warp-stall reasons. Most kernels show one or two dominant ones; once you know which, the cure is usually mechanical.

Stall reasonSymptomCure
Long Scoreboard (memory dependency) Waiting on outstanding LDG / LDG.E / LDS to retire. Bigger tiles to reuse loaded data; cp.async + cp.async.commit double-buffer; raise occupancy if low; check coalescing.
Wait (execution dependency) Fixed-latency op (FFMA, IMAD, HMMA) feeding the next instruction; not enough independent work. Unroll loops for ILP, reduce regs/thread to fit more warps, interleave independent accumulators.
Barrier (__syncthreads) One warp waiting for siblings to reach the barrier. Use cuda::barrier + named barriers (Hopper); switch heavy load/compute to cp.async; reduce SMEM-traffic skew.
MIO Throttle Memory I/O subsystem (SMEM bank conflicts, atomics, surface ops) saturated. Pad shared arrays to break bank conflicts; coalesce; replace atomics with reductions.
IMC Throttle (Issue/Math throttle) Warp scheduler can't issue more of the dominant pipe; back-to-back FFMAs from the same warp. Mix instruction types (FP + INT + LDS), increase warp count, schedule independent accumulators.
Tex Throttle Texture pipe saturated by TEX/TLD ops. Use plain LDG/LDS when the texture filtering isn't actually needed.
Math Pipe Throttle (FP / DP / SFU) The corresponding pipe is full every cycle. Stagger instruction types; for SFU-heavy code, replace with polynomial approximations or reorder.
Drain EXIT or branch-resolving flush at end of warp. Usually benign; large blocks tail more. Consider persistent kernels for very small grids.
Branch Resolving Indirect or divergent branch waiting on the resolution unit. Make branches uniform; precompute targets; predicate when bodies are short.
Triage script

If Long Scoreboard dominates > 30% of cycles, you are memory-bound — tile bigger and async-copy. If Wait dominates, you are ILP-bound — unroll. If MIO Throttle dominates, your shared-memory access pattern has bank conflicts — pad arrays. If Math Pipe Throttle dominates, you are doing well — you are at the arch's compute peak. The other reasons rarely dominate in well-formed code.

11

Interactive: Issue-Rate Predictor

Pick a GPU class and your block size / register pressure / shared memory. The predictor computes the binding limit and estimates achievable IPC plus a likely dominant stall.

256
64
0 KB
Blocks/SM
Warps/SM
Occupancy
IPC (est.)
SM-wide IPC
Likely stall