CUDA Programming Series — Tutorial 08

Streams & Async Execution

Overlapping compute and data transfer, CUDA streams, events, concurrency patterns, and the default stream trap.

CUDA Streams Async Events Pipelining Concurrency
Default Stream Stream API Overlap Events Pipelining Per-Thread Stream Exercises
00

Topics We'll Cover

This tutorial explores how to unlock GPU concurrency by overlapping computation with data transfers using CUDA streams and events.

Prerequisites

Tutorials 01–04 (GPU Architecture, First Kernel, Memory Model, Thread Synchronisation). You should be comfortable launching kernels and managing device memory.

01

The Default Stream Trap

When you call CUDA runtime functions without specifying a stream, everything goes into stream 0 (the default stream). This means every operation waits for the previous one to finish — even when the hardware could run them concurrently.

Everything Is Serialised

Consider this typical pattern: copy data to device, run a kernel, copy results back. On stream 0, each step blocks until the previous one completes.

serialised_on_default_stream.cu
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice);   // blocks until done
myKernel<<<grid, block>>>(d_in, d_out);                 // waits for memcpy
cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost); // waits for kernel

Sequential Timeline on Stream 0

Stream 0 — All Operations Serialised
Stream 0
H→D Copy
Kernel
D→H Copy
GPU idle
←————— Total time: H2D + Kernel + D2H —————→
Compute
Host → Device
Device → Host
The Problem

Modern GPUs have separate copy engines and compute engines. They can physically transfer data and run kernels at the same time. But the default stream serialises everything, leaving hardware idle. The solution: explicit CUDA streams.

02

CUDA Streams

A CUDA stream is a sequence of operations that execute in order on the GPU. Operations in different streams can run concurrently — the GPU scheduler interleaves them on the available hardware engines.

Stream Lifecycle

cudaStreamCreate
Issue Operations
cudaStreamSynchronize
cudaStreamDestroy

API Essentials

stream_basics.cu
cudaStream_t stream1, stream2;

// Create streams
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Issue work to different streams
myKernel<<<grid, block, 0, stream1>>>(d_a);  // 4th param = stream
myKernel<<<grid, block, 0, stream2>>>(d_b);  // runs concurrently!

// Synchronise
cudaStreamSynchronize(stream1);  // block host until stream1 done
cudaStreamSynchronize(stream2);

// Clean up
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

Key Rules

Rule Details
Within a stream Operations execute in issue order (FIFO)
Across streams No ordering guarantee — may overlap
Default stream Stream 0 synchronises with all other streams (legacy behaviour)
Kernel launch 4th argument in <<<grid, block, sharedMem, stream>>>
Async memcpy Use cudaMemcpyAsync() — requires pinned memory
Important

cudaMemcpyAsync returns immediately to the host — the copy happens in the background. But the source/destination must be pinned (page-locked) memory allocated with cudaMallocHost(), not regular malloc().

03

Overlapping Compute & Transfer

The real power of streams: run a kernel on one dataset while simultaneously transferring the next dataset. This requires pinned memory and async memcpy.

Requirements for Overlap

Serialised vs Overlapped Execution

Without Streams — Everything Sequential
Stream 0
H→D A
Kernel A
D→H A
H→D B
Kernel B
D→H B
With Streams — Compute Overlaps Transfer
Stream 1
H→D A
Kernel A
D→H A
Stream 2
H→D B
Kernel B
D→H B
Overlap saves time! H→D B runs during Kernel A
Compute
Host → Device
Device → Host

Complete Example — Two-Stream Overlap

stream_overlap.cu — compilable: nvcc -o stream_overlap stream_overlap.cu
#include <cstdio>
#include <cuda_runtime.h>

// Simple kernel: scale each element by 2
__global__ void scaleKernel(float *data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        data[idx] *= 2.0f;
    }
}

int main() {
    const int N = 1 << 20;  // 1M elements per chunk
    const size_t size = N * sizeof(float);

    // Allocate pinned host memory (required for async transfers)
    float *h_A, *h_B;
    cudaMallocHost(&h_A, size);
    cudaMallocHost(&h_B, size);

    // Initialise host data
    for (int i = 0; i < N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 2.0f;
    }

    // Allocate device memory
    float *d_A, *d_B;
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);

    // Create two streams
    cudaStream_t stream1, stream2;
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);

    int threads = 256;
    int blocks = (N + threads - 1) / threads;

    // Issue operations to stream 1
    cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, stream1);
    scaleKernel<<<blocks, threads, 0, stream1>>>(d_A, N);
    cudaMemcpyAsync(h_A, d_A, size, cudaMemcpyDeviceToHost, stream1);

    // Issue operations to stream 2 (can overlap with stream 1)
    cudaMemcpyAsync(d_B, h_B, size, cudaMemcpyHostToDevice, stream2);
    scaleKernel<<<blocks, threads, 0, stream2>>>(d_B, N);
    cudaMemcpyAsync(h_B, d_B, size, cudaMemcpyDeviceToHost, stream2);

    // Wait for both streams to complete
    cudaStreamSynchronize(stream1);
    cudaStreamSynchronize(stream2);

    // Verify results
    printf("h_A[0] = %.1f (expected 2.0)\n", h_A[0]);
    printf("h_B[0] = %.1f (expected 4.0)\n", h_B[0]);

    // Cleanup
    cudaStreamDestroy(stream1);
    cudaStreamDestroy(stream2);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFreeHost(h_A);
    cudaFreeHost(h_B);

    return 0;
}
Key Insight

The issue order matters. If you issue all of stream 1's work first, then all of stream 2's work, the GPU can start stream 2's H→D copy while stream 1's kernel is still running. For best overlap, interleave operations: copy1, copy2, kernel1, kernel2, etc. (See Slide 05 for the optimal pattern.)

04

CUDA Events

CUDA events are lightweight markers you insert into a stream. They serve two purposes: precise GPU timing and inter-stream synchronisation.

Event API

Function Purpose
cudaEventCreate(&event) Create an event object
cudaEventRecord(event, stream) Record the event in a stream (marks a point in time)
cudaEventSynchronize(event) Block the host until the event has been reached
cudaEventElapsedTime(&ms, start, stop) Measure time between two events (milliseconds)
cudaStreamWaitEvent(stream, event) Make a stream wait until an event in another stream completes
cudaEventDestroy(event) Destroy the event object

Timing with Events

CUDA events give you accurate GPU-side timing, unaffected by host-side noise, kernel launch overhead, or OS scheduling.

Record Start
Kernel Executes
Record Stop
ElapsedTime

Complete Timing Example

event_timing.cu — compilable: nvcc -o event_timing event_timing.cu
#include <cstdio>
#include <cuda_runtime.h>

__global__ void vectorAdd(const float *a, const float *b,
                           float *c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N)
        c[idx] = a[idx] + b[idx];
}

int main() {
    const int N = 1 << 22;  // 4M elements
    const size_t size = N * sizeof(float);

    // Host arrays
    float *h_a, *h_b, *h_c;
    cudaMallocHost(&h_a, size);
    cudaMallocHost(&h_b, size);
    cudaMallocHost(&h_c, size);

    for (int i = 0; i < N; i++) {
        h_a[i] = 1.0f;
        h_b[i] = 2.0f;
    }

    // Device arrays
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);

    // Create events for timing
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Copy data to device
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

    int threads = 256;
    int blocks = (N + threads - 1) / threads;

    // Record start event, launch kernel, record stop event
    cudaEventRecord(start);
    vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, N);
    cudaEventRecord(stop);

    // Wait for stop event to complete
    cudaEventSynchronize(stop);

    // Calculate elapsed time
    float milliseconds = 0.0f;
    cudaEventElapsedTime(&milliseconds, start, stop);
    printf("Kernel execution time: %.3f ms\n", milliseconds);

    // Copy result back and verify
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
    printf("h_c[0] = %.1f (expected 3.0)\n", h_c[0]);

    // Cleanup
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    cudaFreeHost(h_a);
    cudaFreeHost(h_b);
    cudaFreeHost(h_c);

    return 0;
}

Inter-Stream Synchronisation

Use cudaStreamWaitEvent() to create dependencies between streams without blocking the host.

stream_wait_event.cu
cudaEvent_t dataReady;
cudaEventCreate(&dataReady);

// Stream 1 does the copy, records when done
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream1);
cudaEventRecord(dataReady, stream1);

// Stream 2 waits for the event, then computes
cudaStreamWaitEvent(stream2, dataReady);
processKernel<<<grid, block, 0, stream2>>>(d_data);
Why Not Just Use One Stream?

cudaStreamWaitEvent only blocks the GPU stream, not the host. The host can keep issuing work to other streams. This is how you build complex dependency graphs while keeping the GPU busy.

05

Practical Pattern — Pipelined Processing

The most important stream pattern: split your data into chunks and pipeline them. While chunk N is being computed, chunk N+1 is being transferred. This maximises hardware utilisation.

Pipeline Strategy

Split Data
Copy Chunk i
Compute Chunk i
Copy Back Chunk i
Next i

Pipelined Timeline (4 Chunks, 4 Streams)

Each Stream Handles One Chunk — Stages Overlap
Stream 0
H→D
Compute
D→H
Stream 1
H→D
Compute
D→H
Stream 2
H→D
Compute
D→H
Stream 3
H→D
Compute
D→H
H→D, Compute, and D→H from different chunks run simultaneously
Compute
Host → Device
Device → Host

Complete Pipelined Example

pipelined_processing.cu — compilable: nvcc -o pipelined pipelined_processing.cu
#include <cstdio>
#include <cuda_runtime.h>

__global__ void squareKernel(float *data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float val = data[idx];
        data[idx] = val * val;
    }
}

int main() {
    const int TOTAL = 1 << 22;      // 4M elements total
    const int NUM_STREAMS = 4;
    const int CHUNK = TOTAL / NUM_STREAMS;
    const size_t chunkBytes = CHUNK * sizeof(float);

    // Pinned host memory
    float *h_data;
    cudaMallocHost(&h_data, TOTAL * sizeof(float));
    for (int i = 0; i < TOTAL; i++)
        h_data[i] = 3.0f;

    // Device memory
    float *d_data;
    cudaMalloc(&d_data, TOTAL * sizeof(float));

    // Create streams and events for timing
    cudaStream_t streams[NUM_STREAMS];
    for (int i = 0; i < NUM_STREAMS; i++)
        cudaStreamCreate(&streams[i]);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    int threads = 256;
    int blocks = (CHUNK + threads - 1) / threads;

    // ── Pipelined execution ──
    cudaEventRecord(start);

    for (int i = 0; i < NUM_STREAMS; i++) {
        int offset = i * CHUNK;
        // Stage 1: Copy chunk to device
        cudaMemcpyAsync(d_data + offset, h_data + offset,
                        chunkBytes, cudaMemcpyHostToDevice, streams[i]);
        // Stage 2: Compute on chunk
        squareKernel<<<blocks, threads, 0, streams[i]>>>(
            d_data + offset, CHUNK);
        // Stage 3: Copy results back
        cudaMemcpyAsync(h_data + offset, d_data + offset,
                        chunkBytes, cudaMemcpyDeviceToHost, streams[i]);
    }

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms = 0.0f;
    cudaEventElapsedTime(&ms, start, stop);
    printf("Pipelined execution: %.3f ms\n", ms);

    // ── Sequential baseline for comparison ──
    for (int i = 0; i < TOTAL; i++)
        h_data[i] = 3.0f;

    cudaEventRecord(start);

    cudaMemcpy(d_data, h_data, TOTAL * sizeof(float),
               cudaMemcpyHostToDevice);
    squareKernel<<<(TOTAL + 255) / 256, 256>>>(d_data, TOTAL);
    cudaMemcpy(h_data, d_data, TOTAL * sizeof(float),
               cudaMemcpyDeviceToHost);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    cudaEventElapsedTime(&ms, start, stop);
    printf("Sequential execution: %.3f ms\n", ms);

    // Verify
    printf("h_data[0] = %.1f (expected 9.0)\n", h_data[0]);

    // Cleanup
    for (int i = 0; i < NUM_STREAMS; i++)
        cudaStreamDestroy(streams[i]);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFree(d_data);
    cudaFreeHost(h_data);

    return 0;
}
Optimal Issue Order

For best overlap on GPUs with a single copy engine, interleave the issue order: all H→D copies first, then all kernels, then all D→H copies. However, the loop-per-stream pattern shown above works well on GPUs with dual copy engines (most modern GPUs) and is simpler to write and maintain.

06

The Null Stream & --default-stream per-thread

The null stream (stream 0) has special synchronisation behaviour that can silently kill your concurrency. Understanding it is critical.

Legacy vs Per-Thread Default Stream

Legacy (Default)

  • Stream 0 is a synchronising stream
  • An operation on stream 0 waits for all prior work on all streams
  • All streams wait for stream 0 operations to complete
  • Effectively serialises everything

Per-Thread (--default-stream per-thread)

  • Each host thread gets its own default stream
  • The per-thread default stream does not synchronise with other streams
  • Enables concurrency even without explicit stream creation
  • Behaves like a regular (non-null) stream

Compilation Flags

compilation options
# Legacy behaviour (default)
nvcc -o myapp myapp.cu

# Per-thread default stream
nvcc --default-stream per-thread -o myapp myapp.cu

# Alternatively, define before including CUDA headers:
#define CUDA_API_PER_THREAD_DEFAULT_STREAM
#include <cuda_runtime.h>

The Trap in Practice

null_stream_trap.cu
cudaStream_t s1;
cudaStreamCreate(&s1);

kernelA<<<grid, block, 0, s1>>>();   // issued to stream s1
kernelB<<<grid, block>>>();            // issued to stream 0 (null stream)
kernelC<<<grid, block, 0, s1>>>();   // issued to stream s1

// With LEGACY null stream:
//   kernelA starts
//   kernelB waits for kernelA to finish (stream 0 syncs)
//   kernelC waits for kernelB to finish (stream 0 syncs)
//   Result: ALL THREE are serialised!
//
// With PER-THREAD default stream:
//   kernelA and kernelB can overlap
//   kernelC waits only for kernelA (same stream s1)
Best Practice

Always use explicit streams for production code. Never rely on the null stream for concurrent work. If you use libraries that issue to the default stream, compile with --default-stream per-thread to avoid accidental serialisation.

Non-Blocking Streams

You can also create streams that do not synchronise with the null stream:

non_blocking_stream.cu
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
// This stream will NOT synchronise with the null stream
07

Exercises

Put your understanding of streams and async execution to the test with these hands-on challenges.

Exercise 1: Pipelined Vector Addition

Goal

Implement a pipelined vector addition that splits two input arrays into N_STREAMS chunks, copies each chunk to the device, computes the sum, and copies the result back — all using separate streams.

Exercise 2: Measure the Speedup

Goal

Using CUDA events, measure and compare:

Exercise 3: Inter-Stream Dependencies

Goal

Create a scenario where stream 2 must wait for stream 1 to produce a result before it can start computing:

Profiling Tip

Use nsys profile ./your_app (Nsight Systems) to visually inspect your timeline. You'll see whether streams actually overlap or if accidental null-stream usage is serialising them.

08

Summary & Next Steps

What We Covered

Key Takeaways

Always Use Pinned Memory

Async transfers require cudaMallocHost. Without pinned memory, cudaMemcpyAsync silently falls back to synchronous behaviour.

Profile Your Streams

Use Nsight Systems to verify overlap. The GPU timeline visualisation immediately shows whether your streams are actually concurrent.

Next Tutorial

Up Next — Tutorial 09

CUDA Libraries — leverage cuBLAS, cuFFT, cuRAND, and Thrust to accelerate common operations without writing custom kernels. Learn when to use a library vs rolling your own.