Streams & Async Execution

00

Topics We'll Cover

This tutorial explores how to unlock GPU concurrency by overlapping computation with data transfers using CUDA streams and events.

The Default Stream Trap
CUDA Streams
Overlapping Compute & Transfer
CUDA Events
Practical Pattern — Pipelined Processing
The Null Stream & --default-stream per-thread
Exercises
Summary & Next Steps

Prerequisites

Tutorials 01–04 (GPU Architecture, First Kernel, Memory Model, Thread Synchronisation). You should be comfortable launching kernels and managing device memory.

01

The Default Stream Trap

When you call CUDA runtime functions without specifying a stream, everything goes into stream 0 (the default stream). This means every operation waits for the previous one to finish — even when the hardware could run them concurrently.

Everything Is Serialised

Consider this typical pattern: copy data to device, run a kernel, copy results back. On stream 0, each step blocks until the previous one completes.

serialised_on_default_stream.cu

cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);   // blocks until done
myKernel<<<grid, block>>>(d_in, d_out);                 // waits for memcpy
cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost); // waits for kernel

Sequential Timeline on Stream 0

Stream 0 — All Operations Serialised

Stream 0

H→D Copy

Kernel

D→H Copy

GPU idle

←————— Total time: H2D + Kernel + D2H —————→

Compute

Host → Device

Device → Host

The Problem

Modern GPUs have separate copy engines and compute engines. They can physically transfer data and run kernels at the same time. But the default stream serialises everything, leaving hardware idle. The solution: explicit CUDA streams.

02

CUDA Streams

A CUDA stream is a sequence of operations that execute in order on the GPU. Operations in different streams can run concurrently — the GPU scheduler interleaves them on the available hardware engines.

Stream Lifecycle

cudaStreamCreate

→

Issue Operations

→

cudaStreamSynchronize

→

cudaStreamDestroy

API Essentials

stream_basics.cu

cudaStream_t stream1, stream2;

// Create streams
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Issue work to different streams
myKernel<<<grid, block, 0, stream1>>>(d_a);  // 4th param = stream
myKernel<<<grid, block, 0, stream2>>>(d_b);  // runs concurrently!

// Synchronise
cudaStreamSynchronize(stream1);  // block host until stream1 done
cudaStreamSynchronize(stream2);

// Clean up
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

Key Rules

Rule	Details
Within a stream	Operations execute in issue order (FIFO)
Across streams	No ordering guarantee — may overlap
Default stream	Stream 0 synchronises with all other streams (legacy behaviour)
Kernel launch	4th argument in <<<grid, block, sharedMem, stream>>>
Async memcpy	Use `cudaMemcpyAsync()` — requires pinned memory

Important

cudaMemcpyAsync returns immediately to the host — the copy happens in the background. But the source/destination must be pinned (page-locked) memory allocated with cudaMallocHost(), not regular malloc().

03

Overlapping Compute & Transfer

The real power of streams: run a kernel on one dataset while simultaneously transferring the next dataset. This requires pinned memory and async memcpy.

Requirements for Overlap

Pinned host memory — allocated with cudaMallocHost() or cudaHostAlloc()
Async operations — cudaMemcpyAsync() instead of cudaMemcpy()
Separate streams — operations must be in different streams to overlap
Device support — GPU must have concurrent copy+compute capability (all modern GPUs)

Serialised vs Overlapped Execution

Without Streams — Everything Sequential

Stream 0

H→D A

Kernel A

D→H A

H→D B

Kernel B

D→H B

With Streams — Compute Overlaps Transfer

Stream 1

H→D A

Kernel A

D→H A

Stream 2

H→D B

Kernel B

D→H B

Overlap saves time! H→D B runs during Kernel A

Compute

Host → Device

Device → Host

Complete Example — Two-Stream Overlap

stream_overlap.cu — compilable: nvcc -o stream_overlap stream_overlap.cu

#include <cstdio>
#include <cuda_runtime.h>

// Simple kernel: scale each element by 2
__global__ void scaleKernel(float *data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] *= 2.0f;
    }
}

int main() {
    const int N = 1 << 20;  // 1M elements per chunk
    const size_t size = N * sizeof(float);

    // Allocate pinned host memory (required for async transfers)
    float *h_A, *h_B;
    cudaMallocHost(&h_A, size);
    cudaMallocHost(&h_B, size);

    // Initialise host data
    for (int i = 0; i < N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // Allocate device memory
    float *d_A, *d_B;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);

    // Create two streams
    cudaStream_t stream1, stream2;
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);

    int threads = 256;
    int blocks = (N + threads - 1) / threads;

    // Issue operations to stream 1
    cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, stream1);
    scaleKernel<<<blocks, threads, 0, stream1>>>(d_A, N);
    cudaMemcpyAsync(h_A, d_A, size, cudaMemcpyDeviceToHost, stream1);

    // Issue operations to stream 2 (can overlap with stream 1)
    cudaMemcpyAsync(d_B, h_B, size, cudaMemcpyHostToDevice, stream2);
    scaleKernel<<<blocks, threads, 0, stream2>>>(d_B, N);
    cudaMemcpyAsync(h_B, d_B, size, cudaMemcpyDeviceToHost, stream2);

    // Wait for both streams to complete
    cudaStreamSynchronize(stream1);
    cudaStreamSynchronize(stream2);

    // Verify results
    printf("h_A[0] = %.1f (expected 2.0)\n", h_A[0]);
    printf("h_B[0] = %.1f (expected 4.0)\n", h_B[0]);

    // Cleanup
    cudaStreamDestroy(stream1);
    cudaStreamDestroy(stream2);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFreeHost(h_A);
    cudaFreeHost(h_B);

    return 0;
}

Key Insight

The issue order matters. If you issue all of stream 1's work first, then all of stream 2's work, the GPU can start stream 2's H→D copy while stream 1's kernel is still running. For best overlap, interleave operations: copy1, copy2, kernel1, kernel2, etc. (See Slide 05 for the optimal pattern.)

04

CUDA Events

CUDA events are lightweight markers you insert into a stream. They serve two purposes: precise GPU timing and inter-stream synchronisation.

Event API

Function	Purpose
`cudaEventCreate(&event)`	Create an event object
`cudaEventRecord(event, stream)`	Record the event in a stream (marks a point in time)
`cudaEventSynchronize(event)`	Block the host until the event has been reached
`cudaEventElapsedTime(&ms, start, stop)`	Measure time between two events (milliseconds)
`cudaStreamWaitEvent(stream, event)`	Make a stream wait until an event in another stream completes
`cudaEventDestroy(event)`	Destroy the event object

Timing with Events

CUDA events give you accurate GPU-side timing, unaffected by host-side noise, kernel launch overhead, or OS scheduling.

Record Start

→

Kernel Executes

→

Record Stop

→

ElapsedTime

Complete Timing Example

event_timing.cu — compilable: nvcc -o event_timing event_timing.cu

#include <cstdio>
#include <cuda_runtime.h>

__global__ void vectorAdd(const float *a, const float *b,
                           float *c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N)
        c[idx] = a[idx] + b[idx];
}

int main() {
    const int N = 1 << 22;  // 4M elements
    const size_t size = N * sizeof(float);

    // Host arrays
    float *h_a, *h_b, *h_c;
    cudaMallocHost(&h_a, size);
    cudaMallocHost(&h_b, size);
    cudaMallocHost(&h_c, size);

    for (int i = 0; i < N; i++) {
        h_a[i] = 1.0f;
        h_b[i] = 2.0f;
    }

    // Device arrays
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);

    // Create events for timing
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Copy data to device
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

    int threads = 256;
    int blocks = (N + threads - 1) / threads;

    // Record start event, launch kernel, record stop event
    cudaEventRecord(start);
    vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, N);
    cudaEventRecord(stop);

    // Wait for stop event to complete
    cudaEventSynchronize(stop);

    // Calculate elapsed time
    float milliseconds = 0.0f;
    cudaEventElapsedTime(&milliseconds, start, stop);
    printf("Kernel execution time: %.3f ms\n", milliseconds);

    // Copy result back and verify
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
    printf("h_c[0] = %.1f (expected 3.0)\n", h_c[0]);

    // Cleanup
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    cudaFreeHost(h_a);
    cudaFreeHost(h_b);
    cudaFreeHost(h_c);

    return 0;
}

Inter-Stream Synchronisation

Use cudaStreamWaitEvent() to create dependencies between streams without blocking the host.

stream_wait_event.cu

cudaEvent_t dataReady;
cudaEventCreate(&dataReady);

// Stream 1 does the copy, records when done
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream1);
cudaEventRecord(dataReady, stream1);

// Stream 2 waits for the event, then computes
cudaStreamWaitEvent(stream2, dataReady);
processKernel<<<grid, block, 0, stream2>>>(d_data);

Why Not Just Use One Stream?

cudaStreamWaitEvent only blocks the GPU stream, not the host. The host can keep issuing work to other streams. This is how you build complex dependency graphs while keeping the GPU busy.

05

Practical Pattern — Pipelined Processing

The most important stream pattern: split your data into chunks and pipeline them. While chunk N is being computed, chunk N+1 is being transferred. This maximises hardware utilisation.

Pipeline Strategy

Split Data

→

Copy Chunk i

→

Compute Chunk i

→

Copy Back Chunk i

→

Next i

Pipelined Timeline (4 Chunks, 4 Streams)

Each Stream Handles One Chunk — Stages Overlap

Stream 0

H→D

Compute

D→H

Stream 1

H→D

Compute

D→H

Stream 2

H→D

Compute

D→H

Stream 3

H→D

Compute

D→H

H→D, Compute, and D→H from different chunks run simultaneously

Compute

Host → Device

Device → Host

Complete Pipelined Example

pipelined_processing.cu — compilable: nvcc -o pipelined pipelined_processing.cu

#include <cstdio>
#include <cuda_runtime.h>

__global__ void squareKernel(float *data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float val = data[idx];
        data[idx] = val * val;
    }
}

int main() {
    const int TOTAL = 1 << 22;      // 4M elements total
    const int NUM_STREAMS = 4;
    const int CHUNK = TOTAL / NUM_STREAMS;
    const size_t chunkBytes = CHUNK * sizeof(float);

    // Pinned host memory
    float *h_data;
    cudaMallocHost(&h_data, TOTAL * sizeof(float));
    for (int i = 0; i < TOTAL; i++)
        h_data[i] = 3.0f;

    // Device memory
    float *d_data;
    cudaMalloc(&d_data, TOTAL * sizeof(float));

    // Create streams and events for timing
    cudaStream_t streams[NUM_STREAMS];
    for (int i = 0; i < NUM_STREAMS; i++)
        cudaStreamCreate(&streams[i]);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    int threads = 256;
    int blocks = (CHUNK + threads - 1) / threads;

    // ── Pipelined execution ──
    cudaEventRecord(start);

    for (int i = 0; i < NUM_STREAMS; i++) {
        int offset = i * CHUNK;
        // Stage 1: Copy chunk to device
        cudaMemcpyAsync(d_data + offset, h_data + offset,
                        chunkBytes, cudaMemcpyHostToDevice, streams[i]);
        // Stage 2: Compute on chunk
        squareKernel<<<blocks, threads, 0, streams[i]>>>(
            d_data + offset, CHUNK);
        // Stage 3: Copy results back
        cudaMemcpyAsync(h_data + offset, d_data + offset,
                        chunkBytes, cudaMemcpyDeviceToHost, streams[i]);
    }

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms = 0.0f;
    cudaEventElapsedTime(&ms, start, stop);
    printf("Pipelined execution: %.3f ms\n", ms);

    // ── Sequential baseline for comparison ──
    for (int i = 0; i < TOTAL; i++)
        h_data[i] = 3.0f;

    cudaEventRecord(start);

    cudaMemcpy(d_data, h_data, TOTAL * sizeof(float),
               cudaMemcpyHostToDevice);
    squareKernel<<<(TOTAL + 255) / 256, 256>>>(d_data, TOTAL);
    cudaMemcpy(h_data, d_data, TOTAL * sizeof(float),
               cudaMemcpyDeviceToHost);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    cudaEventElapsedTime(&ms, start, stop);
    printf("Sequential execution: %.3f ms\n", ms);

    // Verify
    printf("h_data[0] = %.1f (expected 9.0)\n", h_data[0]);

    // Cleanup
    for (int i = 0; i < NUM_STREAMS; i++)
        cudaStreamDestroy(streams[i]);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFree(d_data);
    cudaFreeHost(h_data);

    return 0;
}

Optimal Issue Order

For best overlap on GPUs with a single copy engine, interleave the issue order: all H→D copies first, then all kernels, then all D→H copies. However, the loop-per-stream pattern shown above works well on GPUs with dual copy engines (most modern GPUs) and is simpler to write and maintain.

06

The Null Stream & `--default-stream per-thread`

The null stream (stream 0) has special synchronisation behaviour that can silently kill your concurrency. Understanding it is critical.

Legacy vs Per-Thread Default Stream

Legacy (Default)

Stream 0 is a synchronising stream
An operation on stream 0 waits for all prior work on all streams
All streams wait for stream 0 operations to complete
Effectively serialises everything

Per-Thread (`--default-stream per-thread`)

Each host thread gets its own default stream
The per-thread default stream does not synchronise with other streams
Enables concurrency even without explicit stream creation
Behaves like a regular (non-null) stream

Compilation Flags

compilation options

# Legacy behaviour (default)
nvcc -o myapp myapp.cu

# Per-thread default stream
nvcc --default-stream per-thread -o myapp myapp.cu

# Alternatively, define before including CUDA headers:
#define CUDA_API_PER_THREAD_DEFAULT_STREAM
#include <cuda_runtime.h>

The Trap in Practice

null_stream_trap.cu

cudaStream_t s1;
cudaStreamCreate(&s1);

kernelA<<<grid, block, 0, s1>>>();   // issued to stream s1
kernelB<<<grid, block>>>();            // issued to stream 0 (null stream)
kernelC<<<grid, block, 0, s1>>>();   // issued to stream s1

// With LEGACY null stream:
//   kernelA starts
//   kernelB waits for kernelA to finish (stream 0 syncs)
//   kernelC waits for kernelB to finish (stream 0 syncs)
//   Result: ALL THREE are serialised!
//
// With PER-THREAD default stream:
//   kernelA and kernelB can overlap
//   kernelC waits only for kernelA (same stream s1)

Best Practice

Always use explicit streams for production code. Never rely on the null stream for concurrent work. If you use libraries that issue to the default stream, compile with --default-stream per-thread to avoid accidental serialisation.

Non-Blocking Streams

You can also create streams that do not synchronise with the null stream:

non_blocking_stream.cu

cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
// This stream will NOT synchronise with the null stream

07

Exercises

Put your understanding of streams and async execution to the test with these hands-on challenges.

Exercise 1: Pipelined Vector Addition

Goal

Implement a pipelined vector addition that splits two input arrays into N_STREAMS chunks, copies each chunk to the device, computes the sum, and copies the result back — all using separate streams.

Use cudaMallocHost for all host allocations
Use cudaMemcpyAsync for all transfers
Experiment with 1, 2, 4, and 8 streams
Verify correctness by checking output values

Exercise 2: Measure the Speedup

Goal

Using CUDA events, measure and compare:

Baseline: sequential execution (single cudaMemcpy + kernel + cudaMemcpy)
Pipelined: execution with 2, 4, and 8 streams
Print a table showing the speedup for each stream count
Try with different data sizes (1M, 4M, 16M, 64M elements)

Exercise 3: Inter-Stream Dependencies

Goal

Create a scenario where stream 2 must wait for stream 1 to produce a result before it can start computing:

Stream 1: copy data, compute partial result
Use cudaEventRecord + cudaStreamWaitEvent for the dependency
Stream 2: wait for stream 1's result, then compute the final answer
Verify that removing the event wait produces incorrect results

Profiling Tip

Use nsys profile ./your_app (Nsight Systems) to visually inspect your timeline. You'll see whether streams actually overlap or if accidental null-stream usage is serialising them.

08

Summary & Next Steps

What We Covered

The default stream trap — stream 0 serialises everything by default, leaving copy engines idle while kernels run (and vice versa).
CUDA streams — independent command queues where operations run in FIFO order within a stream but can overlap across streams.
Async transfers — cudaMemcpyAsync with pinned memory enables concurrent copy and compute.
CUDA events — lightweight markers for precise GPU timing and inter-stream synchronisation without blocking the host.
Pipelining — splitting data into chunks and processing them in separate streams to maximise hardware utilisation.
Per-thread default stream — the --default-stream per-thread flag eliminates null-stream serialisation for multi-threaded host code.

Key Takeaways

Always Use Pinned Memory

Async transfers require cudaMallocHost. Without pinned memory, cudaMemcpyAsync silently falls back to synchronous behaviour.

Profile Your Streams

Use Nsight Systems to verify overlap. The GPU timeline visualisation immediately shows whether your streams are actually concurrent.

Next Tutorial

Up Next — Tutorial 09

CUDA Libraries — leverage cuBLAS, cuFFT, cuRAND, and Thrust to accelerate common operations without writing custom kernels. Learn when to use a library vs rolling your own.

Topics We'll Cover

The Default Stream Trap

Everything Is Serialised

Sequential Timeline on Stream 0

CUDA Streams

Stream Lifecycle

API Essentials

Key Rules

Overlapping Compute & Transfer

Requirements for Overlap

Serialised vs Overlapped Execution

Complete Example — Two-Stream Overlap

CUDA Events

Event API

Timing with Events

Complete Timing Example

Inter-Stream Synchronisation

Practical Pattern — Pipelined Processing

Pipeline Strategy

Pipelined Timeline (4 Chunks, 4 Streams)

Complete Pipelined Example

The Null Stream & --default-stream per-thread

Legacy vs Per-Thread Default Stream

Legacy (Default)

Per-Thread (--default-stream per-thread)

Compilation Flags

The Trap in Practice

Non-Blocking Streams

Exercises

Exercise 1: Pipelined Vector Addition

Goal

Exercise 2: Measure the Speedup

Goal

Exercise 3: Inter-Stream Dependencies

Goal

Summary & Next Steps

What We Covered

Key Takeaways

Always Use Pinned Memory

Profile Your Streams

Next Tutorial

The Null Stream & `--default-stream per-thread`

Per-Thread (`--default-stream per-thread`)