Overlapping compute and data transfer, CUDA streams, events, concurrency patterns, and the default stream trap.
This tutorial explores how to unlock GPU concurrency by overlapping computation with data transfers using CUDA streams and events.
Tutorials 01–04 (GPU Architecture, First Kernel, Memory Model, Thread Synchronisation). You should be comfortable launching kernels and managing device memory.
When you call CUDA runtime functions without specifying a stream, everything goes into stream 0 (the default stream). This means every operation waits for the previous one to finish — even when the hardware could run them concurrently.
Consider this typical pattern: copy data to device, run a kernel, copy results back. On stream 0, each step blocks until the previous one completes.
cudaMemcpy(d_in, h_in, size, cudaMemcpyHostToDevice); // blocks until done
myKernel<<<grid, block>>>(d_in, d_out); // waits for memcpy
cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost); // waits for kernel
Modern GPUs have separate copy engines and compute engines. They can physically transfer data and run kernels at the same time. But the default stream serialises everything, leaving hardware idle. The solution: explicit CUDA streams.
A CUDA stream is a sequence of operations that execute in order on the GPU. Operations in different streams can run concurrently — the GPU scheduler interleaves them on the available hardware engines.
cudaStream_t stream1, stream2;
// Create streams
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
// Issue work to different streams
myKernel<<<grid, block, 0, stream1>>>(d_a); // 4th param = stream
myKernel<<<grid, block, 0, stream2>>>(d_b); // runs concurrently!
// Synchronise
cudaStreamSynchronize(stream1); // block host until stream1 done
cudaStreamSynchronize(stream2);
// Clean up
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
| Rule | Details |
|---|---|
| Within a stream | Operations execute in issue order (FIFO) |
| Across streams | No ordering guarantee — may overlap |
| Default stream | Stream 0 synchronises with all other streams (legacy behaviour) |
| Kernel launch | 4th argument in <<<grid, block, sharedMem, stream>>> |
| Async memcpy | Use cudaMemcpyAsync() — requires pinned memory |
cudaMemcpyAsync returns immediately to the host — the copy happens in the background. But the source/destination must be pinned (page-locked) memory allocated with cudaMallocHost(), not regular malloc().
The real power of streams: run a kernel on one dataset while simultaneously transferring the next dataset. This requires pinned memory and async memcpy.
cudaMallocHost() or cudaHostAlloc()cudaMemcpyAsync() instead of cudaMemcpy()#include <cstdio>
#include <cuda_runtime.h>
// Simple kernel: scale each element by 2
__global__ void scaleKernel(float *data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
data[idx] *= 2.0f;
}
}
int main() {
const int N = 1 << 20; // 1M elements per chunk
const size_t size = N * sizeof(float);
// Allocate pinned host memory (required for async transfers)
float *h_A, *h_B;
cudaMallocHost(&h_A, size);
cudaMallocHost(&h_B, size);
// Initialise host data
for (int i = 0; i < N; i++) {
h_A[i] = 1.0f;
h_B[i] = 2.0f;
}
// Allocate device memory
float *d_A, *d_B;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
// Create two streams
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
int threads = 256;
int blocks = (N + threads - 1) / threads;
// Issue operations to stream 1
cudaMemcpyAsync(d_A, h_A, size, cudaMemcpyHostToDevice, stream1);
scaleKernel<<<blocks, threads, 0, stream1>>>(d_A, N);
cudaMemcpyAsync(h_A, d_A, size, cudaMemcpyDeviceToHost, stream1);
// Issue operations to stream 2 (can overlap with stream 1)
cudaMemcpyAsync(d_B, h_B, size, cudaMemcpyHostToDevice, stream2);
scaleKernel<<<blocks, threads, 0, stream2>>>(d_B, N);
cudaMemcpyAsync(h_B, d_B, size, cudaMemcpyDeviceToHost, stream2);
// Wait for both streams to complete
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
// Verify results
printf("h_A[0] = %.1f (expected 2.0)\n", h_A[0]);
printf("h_B[0] = %.1f (expected 4.0)\n", h_B[0]);
// Cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
cudaFree(d_A);
cudaFree(d_B);
cudaFreeHost(h_A);
cudaFreeHost(h_B);
return 0;
}
The issue order matters. If you issue all of stream 1's work first, then all of stream 2's work, the GPU can start stream 2's H→D copy while stream 1's kernel is still running. For best overlap, interleave operations: copy1, copy2, kernel1, kernel2, etc. (See Slide 05 for the optimal pattern.)
CUDA events are lightweight markers you insert into a stream. They serve two purposes: precise GPU timing and inter-stream synchronisation.
| Function | Purpose |
|---|---|
cudaEventCreate(&event) |
Create an event object |
cudaEventRecord(event, stream) |
Record the event in a stream (marks a point in time) |
cudaEventSynchronize(event) |
Block the host until the event has been reached |
cudaEventElapsedTime(&ms, start, stop) |
Measure time between two events (milliseconds) |
cudaStreamWaitEvent(stream, event) |
Make a stream wait until an event in another stream completes |
cudaEventDestroy(event) |
Destroy the event object |
CUDA events give you accurate GPU-side timing, unaffected by host-side noise, kernel launch overhead, or OS scheduling.
#include <cstdio>
#include <cuda_runtime.h>
__global__ void vectorAdd(const float *a, const float *b,
float *c, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N)
c[idx] = a[idx] + b[idx];
}
int main() {
const int N = 1 << 22; // 4M elements
const size_t size = N * sizeof(float);
// Host arrays
float *h_a, *h_b, *h_c;
cudaMallocHost(&h_a, size);
cudaMallocHost(&h_b, size);
cudaMallocHost(&h_c, size);
for (int i = 0; i < N; i++) {
h_a[i] = 1.0f;
h_b[i] = 2.0f;
}
// Device arrays
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);
// Create events for timing
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Copy data to device
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
int threads = 256;
int blocks = (N + threads - 1) / threads;
// Record start event, launch kernel, record stop event
cudaEventRecord(start);
vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, N);
cudaEventRecord(stop);
// Wait for stop event to complete
cudaEventSynchronize(stop);
// Calculate elapsed time
float milliseconds = 0.0f;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel execution time: %.3f ms\n", milliseconds);
// Copy result back and verify
cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
printf("h_c[0] = %.1f (expected 3.0)\n", h_c[0]);
// Cleanup
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
cudaFreeHost(h_a);
cudaFreeHost(h_b);
cudaFreeHost(h_c);
return 0;
}
Use cudaStreamWaitEvent() to create dependencies between streams without blocking the host.
cudaEvent_t dataReady;
cudaEventCreate(&dataReady);
// Stream 1 does the copy, records when done
cudaMemcpyAsync(d_data, h_data, size, cudaMemcpyHostToDevice, stream1);
cudaEventRecord(dataReady, stream1);
// Stream 2 waits for the event, then computes
cudaStreamWaitEvent(stream2, dataReady);
processKernel<<<grid, block, 0, stream2>>>(d_data);
cudaStreamWaitEvent only blocks the GPU stream, not the host. The host can keep issuing work to other streams. This is how you build complex dependency graphs while keeping the GPU busy.
The most important stream pattern: split your data into chunks and pipeline them. While chunk N is being computed, chunk N+1 is being transferred. This maximises hardware utilisation.
#include <cstdio>
#include <cuda_runtime.h>
__global__ void squareKernel(float *data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
float val = data[idx];
data[idx] = val * val;
}
}
int main() {
const int TOTAL = 1 << 22; // 4M elements total
const int NUM_STREAMS = 4;
const int CHUNK = TOTAL / NUM_STREAMS;
const size_t chunkBytes = CHUNK * sizeof(float);
// Pinned host memory
float *h_data;
cudaMallocHost(&h_data, TOTAL * sizeof(float));
for (int i = 0; i < TOTAL; i++)
h_data[i] = 3.0f;
// Device memory
float *d_data;
cudaMalloc(&d_data, TOTAL * sizeof(float));
// Create streams and events for timing
cudaStream_t streams[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; i++)
cudaStreamCreate(&streams[i]);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
int threads = 256;
int blocks = (CHUNK + threads - 1) / threads;
// ── Pipelined execution ──
cudaEventRecord(start);
for (int i = 0; i < NUM_STREAMS; i++) {
int offset = i * CHUNK;
// Stage 1: Copy chunk to device
cudaMemcpyAsync(d_data + offset, h_data + offset,
chunkBytes, cudaMemcpyHostToDevice, streams[i]);
// Stage 2: Compute on chunk
squareKernel<<<blocks, threads, 0, streams[i]>>>(
d_data + offset, CHUNK);
// Stage 3: Copy results back
cudaMemcpyAsync(h_data + offset, d_data + offset,
chunkBytes, cudaMemcpyDeviceToHost, streams[i]);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms = 0.0f;
cudaEventElapsedTime(&ms, start, stop);
printf("Pipelined execution: %.3f ms\n", ms);
// ── Sequential baseline for comparison ──
for (int i = 0; i < TOTAL; i++)
h_data[i] = 3.0f;
cudaEventRecord(start);
cudaMemcpy(d_data, h_data, TOTAL * sizeof(float),
cudaMemcpyHostToDevice);
squareKernel<<<(TOTAL + 255) / 256, 256>>>(d_data, TOTAL);
cudaMemcpy(h_data, d_data, TOTAL * sizeof(float),
cudaMemcpyDeviceToHost);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&ms, start, stop);
printf("Sequential execution: %.3f ms\n", ms);
// Verify
printf("h_data[0] = %.1f (expected 9.0)\n", h_data[0]);
// Cleanup
for (int i = 0; i < NUM_STREAMS; i++)
cudaStreamDestroy(streams[i]);
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaFree(d_data);
cudaFreeHost(h_data);
return 0;
}
For best overlap on GPUs with a single copy engine, interleave the issue order: all H→D copies first, then all kernels, then all D→H copies. However, the loop-per-stream pattern shown above works well on GPUs with dual copy engines (most modern GPUs) and is simpler to write and maintain.
--default-stream per-threadThe null stream (stream 0) has special synchronisation behaviour that can silently kill your concurrency. Understanding it is critical.
--default-stream per-thread)# Legacy behaviour (default)
nvcc -o myapp myapp.cu
# Per-thread default stream
nvcc --default-stream per-thread -o myapp myapp.cu
# Alternatively, define before including CUDA headers:
#define CUDA_API_PER_THREAD_DEFAULT_STREAM
#include <cuda_runtime.h>
cudaStream_t s1;
cudaStreamCreate(&s1);
kernelA<<<grid, block, 0, s1>>>(); // issued to stream s1
kernelB<<<grid, block>>>(); // issued to stream 0 (null stream)
kernelC<<<grid, block, 0, s1>>>(); // issued to stream s1
// With LEGACY null stream:
// kernelA starts
// kernelB waits for kernelA to finish (stream 0 syncs)
// kernelC waits for kernelB to finish (stream 0 syncs)
// Result: ALL THREE are serialised!
//
// With PER-THREAD default stream:
// kernelA and kernelB can overlap
// kernelC waits only for kernelA (same stream s1)
Always use explicit streams for production code. Never rely on the null stream for concurrent work. If you use libraries that issue to the default stream, compile with --default-stream per-thread to avoid accidental serialisation.
You can also create streams that do not synchronise with the null stream:
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
// This stream will NOT synchronise with the null stream
Put your understanding of streams and async execution to the test with these hands-on challenges.
Implement a pipelined vector addition that splits two input arrays into N_STREAMS chunks, copies each chunk to the device, computes the sum, and copies the result back — all using separate streams.
cudaMallocHost for all host allocationscudaMemcpyAsync for all transfersUsing CUDA events, measure and compare:
Create a scenario where stream 2 must wait for stream 1 to produce a result before it can start computing:
cudaEventRecord + cudaStreamWaitEvent for the dependencyUse nsys profile ./your_app (Nsight Systems) to visually inspect your timeline. You'll see whether streams actually overlap or if accidental null-stream usage is serialising them.
cudaMemcpyAsync with pinned memory enables concurrent copy and compute.--default-stream per-thread flag eliminates null-stream serialisation for multi-threaded host code.Async transfers require cudaMallocHost. Without pinned memory, cudaMemcpyAsync silently falls back to synchronous behaviour.
Use Nsight Systems to verify overlap. The GPU timeline visualisation immediately shows whether your streams are actually concurrent.
CUDA Libraries — leverage cuBLAS, cuFFT, cuRAND, and Thrust to accelerate common operations without writing custom kernels. Learn when to use a library vs rolling your own.