CUDA Programming Series — Tutorial 09

Libraries & Ecosystem

cuBLAS, cuDNN, Thrust, cuRAND, cuFFT — when to write custom kernels vs use optimised libraries, with complete compilable examples.

CUDA cuBLAS cuDNN Thrust cuRAND cuFFT TensorRT
Custom vs Library cuBLAS cuDNN Thrust cuRAND/cuFFT Cheat Sheet Python TensorRT
00

Topics We'll Cover

This tutorial surveys the CUDA library ecosystem. You'll learn when to reach for a library vs write a custom kernel, and see complete working examples of the most important libraries.

Prerequisites

Tutorials 01–06 (GPU architecture, kernels, memory, synchronisation, streams, and performance). You should be comfortable writing and launching CUDA kernels.

01

When to Write Custom Kernels vs Use Libraries

The most important decision in CUDA development: should you write a custom kernel or use an existing library? Getting this wrong wastes weeks of engineering time.

Decision Flowchart

Is your operation a standard primitive?
(GEMM, FFT, conv, sort, reduce, RNG)
YES ↓
NO ↓
Use a library
cuBLAS, cuDNN, Thrust, cuFFT
Can you fuse multiple ops?
(reduce memory traffic)
YES ↓
NO ↓
Write a custom kernel
Fused ops, unique access patterns
Compose library calls
Chain with CUDA streams

When to Write Custom Kernels

Custom Kernel

  • Kernel fusion — combine multiple operations to reduce memory round-trips
  • Unique access patterns — irregular data structures, sparse matrices
  • Domain-specific logic — physics simulations, custom loss functions
  • Extreme tuning — squeezing last 10% of performance for your specific hardware

Use a Library

  • Standard BLAS ops — matrix multiply, triangular solve
  • FFT, convolution — well-studied algorithms with HW-specific tuning
  • Sort, scan, reduce — parallel primitives with tricky edge cases
  • DNN layers — convolution, pooling, normalisation
Rule of Thumb

If NVIDIA has a library for it, use the library first. Profile it. Only write a custom kernel if profiling reveals the library is a bottleneck for your specific workload, or if you need to fuse multiple operations together.

02

cuBLAS — GPU-Accelerated Linear Algebra

cuBLAS is the CUDA implementation of the Basic Linear Algebra Subprograms (BLAS) standard. Its crown jewel is GEMM — General Matrix Multiply — which is the single most performance-critical operation in deep learning.

Key API Patterns

GEMM: C = αAB + βC

Matrix A
M × K
×
Matrix B
K × N
Matrix C
M × N

Complete cuBLAS Matrix Multiply

cublas_matmul.cu — compile: nvcc -o cublas_matmul cublas_matmul.cu -lcublas
#include <cstdio>
#include <cstdlib>
#include <cublas_v2.h>
#include <cuda_runtime.h>

// Error checking macro
#define CHECK_CUDA(call) do { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(1); \
    } \
} while(0)

#define CHECK_CUBLAS(call) do { \
    cublasStatus_t stat = call; \
    if (stat != CUBLAS_STATUS_SUCCESS) { \
        fprintf(stderr, "cuBLAS error at %s:%d: %d\n", \
                __FILE__, __LINE__, stat); \
        exit(1); \
    } \
} while(0)

int main() {
    // Matrix dimensions: C(M×N) = A(M×K) × B(K×N)
    const int M = 1024, K = 512, N = 2048;
    const float alpha = 1.0f, beta = 0.0f;

    // Host allocation
    size_t sizeA = M * K * sizeof(float);
    size_t sizeB = K * N * sizeof(float);
    size_t sizeC = M * N * sizeof(float);

    float *h_A = (float*)malloc(sizeA);
    float *h_B = (float*)malloc(sizeB);
    float *h_C = (float*)malloc(sizeC);

    // Initialise with random values
    for (int i = 0; i < M * K; i++) h_A[i] = (float)rand() / RAND_MAX;
    for (int i = 0; i < K * N; i++) h_B[i] = (float)rand() / RAND_MAX;

    // Device allocation
    float *d_A, *d_B, *d_C;
    CHECK_CUDA(cudaMalloc(&d_A, sizeA));
    CHECK_CUDA(cudaMalloc(&d_B, sizeB));
    CHECK_CUDA(cudaMalloc(&d_C, sizeC));

    // Copy to device
    CHECK_CUDA(cudaMemcpy(d_A, h_A, sizeA, cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(d_B, h_B, sizeB, cudaMemcpyHostToDevice));

    // Create cuBLAS handle
    cublasHandle_t handle;
    CHECK_CUBLAS(cublasCreate(&handle));

    // Perform GEMM: C = alpha * A * B + beta * C
    // cuBLAS uses column-major, so we compute B^T * A^T = (AB)^T
    // which gives us the row-major result we want
    CHECK_CUBLAS(cublasSgemm(handle,
        CUBLAS_OP_N, CUBLAS_OP_N,  // no transpose
        N, M, K,                    // dimensions (col-major order)
        &alpha,
        d_B, N,                     // B is N×K in col-major = K×N row-major
        d_A, K,                     // A is K×M in col-major = M×K row-major
        &beta,
        d_C, N));                   // C is N×M in col-major = M×N row-major

    // Copy result back
    CHECK_CUDA(cudaMemcpy(h_C, d_C, sizeC, cudaMemcpyDeviceToHost));

    // Verify a single element: C[0][0] = dot(A[0][:], B[:][0])
    float expected = 0.0f;
    for (int i = 0; i < K; i++)
        expected += h_A[i] * h_B[i * N];
    printf("C[0][0] = %.4f (expected %.4f)\n", h_C[0], expected);
    printf("GEMM complete: (%d×%d) × (%d×%d) = (%d×%d)\n", M, K, K, N, M, N);

    // Cleanup
    cublasDestroy(handle);
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}
Column-Major Trick

cuBLAS inherits Fortran's column-major layout. To multiply row-major matrices without transposing, swap A and B and swap M and N: compute B * A in cuBLAS column-major, which yields A * B in row-major. This avoids explicit transposes.

03

cuDNN — Deep Neural Network Primitives

cuDNN provides GPU-accelerated implementations of the core operations in neural networks. It is the engine behind PyTorch, TensorFlow, and virtually every deep learning framework.

What cuDNN Provides

Core Operations

  • Convolution — forward, backward data, backward filter
  • Pooling — max, average, adaptive
  • Normalisation — batch norm, layer norm, group norm
  • Activation — ReLU, sigmoid, tanh, GELU

Advanced Features

  • Attention — multi-head attention (Flash Attention)
  • RNNs — LSTM, GRU with persistent kernels
  • Tensor Cores — automatic FP16/TF32 acceleration
  • Graph API — fuse multiple ops into a single kernel

cuDNN API Pattern

Every cuDNN operation follows the same pattern: create descriptors, find the best algorithm, allocate workspace, execute.

Create handle & descriptors
Find best algorithm
cudnnGetConvolutionForwardAlgorithm_v7
Query workspace size
cudaMalloc workspace
Execute operation
Destroy descriptors & handle

Workspace Allocation Pattern

cuDNN convolution pattern (pseudocode)
// 1. Create descriptors for input, filter, output, convolution
cudnnTensorDescriptor_t      inputDesc, outputDesc;
cudnnFilterDescriptor_t      filterDesc;
cudnnConvolutionDescriptor_t convDesc;

// 2. Let cuDNN pick the fastest algorithm for this configuration
cudnnConvolutionFwdAlgoPerf_t perfResults[8];
int returnedAlgos;
cudnnGetConvolutionForwardAlgorithm_v7(handle,
    inputDesc, filterDesc, convDesc, outputDesc,
    8, &returnedAlgos, perfResults);

// 3. Allocate workspace for the chosen algorithm
size_t workspaceSize = perfResults[0].memory;
void *d_workspace;
cudaMalloc(&d_workspace, workspaceSize);

// 4. Execute the convolution
cudnnConvolutionForward(handle,
    &alpha, inputDesc, d_input,
    filterDesc, d_filter,
    convDesc, perfResults[0].algo,
    d_workspace, workspaceSize,
    &beta, outputDesc, d_output);

Why Frameworks Use cuDNN

Under the Hood

When you call torch.nn.Conv2d in PyTorch, it dispatches to cuDNN. When you run tf.nn.conv2d in TensorFlow, it also dispatches to cuDNN. Writing your own convolution that beats cuDNN is extremely difficult — NVIDIA tunes it per-architecture with assembly-level micro-optimisations and Tensor Core scheduling.

04

Thrust — STL-Like Parallel Algorithms

Thrust is CUDA's answer to the C++ Standard Template Library. It provides parallel versions of sort, reduce, transform, scan, and more — all without writing a single kernel.

Key Features

Common Thrust Operations

sort
reduce
transform
scan
copy_if
unique
merge

Complete Thrust Example

thrust_demo.cu — compile: nvcc -o thrust_demo thrust_demo.cu
#include <cstdio>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>

// Custom functor: square each element
struct square_op {
    __host__ __device__
    float operator()(float x) const {
        return x * x;
    }
};

int main() {
    const int N = 1000000;

    // ── 1. Create and fill a device vector ──
    thrust::device_vector<float> d_vec(N);
    thrust::sequence(d_vec.begin(), d_vec.end(), 1.0f);
    // d_vec = {1, 2, 3, ..., 1000000}

    // ── 2. Transform: square every element ──
    thrust::device_vector<float> d_squared(N);
    thrust::transform(d_vec.begin(), d_vec.end(),
                       d_squared.begin(), square_op());
    printf("squared[0] = %.0f, squared[4] = %.0f\n",
           (float)d_squared[0], (float)d_squared[4]);
    // Output: squared[0] = 1, squared[4] = 25

    // ── 3. Reduce: sum of squares ──
    float sum = thrust::reduce(d_squared.begin(), d_squared.end(),
                               0.0f, thrust::plus<float>());
    printf("Sum of squares 1..%d = %.0f\n", N, sum);

    // ── 4. Sort in descending order ──
    thrust::sort(d_vec.begin(), d_vec.end(), thrust::greater<float>());
    printf("After descending sort: first = %.0f, last = %.0f\n",
           (float)d_vec[0], (float)d_vec[N - 1]);
    // Output: first = 1000000, last = 1

    // ── 5. Copy back to host ──
    thrust::host_vector<float> h_vec = d_vec;
    printf("Copied %d elements to host. h_vec[0] = %.0f\n",
           (int)h_vec.size(), h_vec[0]);

    return 0;
}
Performance Note

Thrust's sort uses radix sort on GPUs and typically outperforms hand-written comparison sorts. For 100M elements, thrust::sort is within 5% of the theoretical peak memory bandwidth — very hard to beat with a custom kernel.

05

cuRAND & cuFFT

Two more essential CUDA libraries: random number generation and Fast Fourier Transforms.

cuRAND — GPU Random Number Generation

cuRAND generates pseudorandom and quasirandom numbers on the GPU. It supports multiple generator types (XORWOW, Philox, MRG32k3a, Sobol) and distributions (uniform, normal, log-normal, Poisson).

curand_demo.cu — compile: nvcc -o curand_demo curand_demo.cu -lcurand
#include <cstdio>
#include <curand.h>
#include <cuda_runtime.h>

int main() {
    const int N = 1000000;
    float *d_data, *h_data;

    cudaMalloc(&d_data, N * sizeof(float));
    h_data = (float*)malloc(N * sizeof(float));

    // Create generator
    curandGenerator_t gen;
    curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);
    curandSetPseudoRandomGeneratorSeed(gen, 42ULL);

    // Generate 1M uniform floats in [0, 1)
    curandGenerateUniform(gen, d_data, N);

    // Copy back and check
    cudaMemcpy(h_data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);

    float sum = 0.0f;
    for (int i = 0; i < N; i++) sum += h_data[i];
    printf("Mean of %d uniform samples: %f (expected ~0.5)\n",
           N, sum / N);

    // Generate normal distribution (mean=0, stddev=1)
    curandGenerateNormal(gen, d_data, N, 0.0f, 1.0f);
    cudaMemcpy(h_data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);

    sum = 0.0f;
    for (int i = 0; i < N; i++) sum += h_data[i];
    printf("Mean of %d normal samples: %f (expected ~0.0)\n",
           N, sum / N);

    // Cleanup
    curandDestroyGenerator(gen);
    cudaFree(d_data);
    free(h_data);
    return 0;
}

cuFFT — Fast Fourier Transforms

cuFFT implements 1D, 2D, and 3D FFTs on the GPU. It handles real-to-complex, complex-to-real, and complex-to-complex transforms with automatic plan optimisation.

cufft_demo.cu — compile: nvcc -o cufft_demo cufft_demo.cu -lcufft
#include <cstdio>
#include <cufft.h>
#include <cuda_runtime.h>
#include <cmath>

int main() {
    const int N = 1024;

    // Host: create a signal with two frequencies
    cufftComplex *h_signal = (cufftComplex*)malloc(N * sizeof(cufftComplex));
    for (int i = 0; i < N; i++) {
        float t = (float)i / N;
        h_signal[i].x = sinf(2.0f * M_PI * 50.0f * t)   // 50 Hz
                       + sinf(2.0f * M_PI * 120.0f * t);  // 120 Hz
        h_signal[i].y = 0.0f;  // imaginary part
    }

    // Device allocation
    cufftComplex *d_signal;
    cudaMalloc(&d_signal, N * sizeof(cufftComplex));
    cudaMemcpy(d_signal, h_signal, N * sizeof(cufftComplex),
               cudaMemcpyHostToDevice);

    // Create FFT plan and execute
    cufftHandle plan;
    cufftPlan1d(&plan, N, CUFFT_C2C, 1);
    cufftExecC2C(plan, d_signal, d_signal, CUFFT_FORWARD);

    // Copy back and find peak frequencies
    cudaMemcpy(h_signal, d_signal, N * sizeof(cufftComplex),
               cudaMemcpyDeviceToHost);

    printf("Peak frequencies (magnitude > N/4):\n");
    for (int i = 0; i < N / 2; i++) {
        float mag = sqrtf(h_signal[i].x * h_signal[i].x
                        + h_signal[i].y * h_signal[i].y);
        if (mag > N / 4.0f)
            printf("  Bin %d (freq ~%d Hz): magnitude = %.1f\n",
                   i, i, mag);
    }

    // Cleanup
    cufftDestroy(plan);
    cudaFree(d_signal);
    free(h_signal);
    return 0;
}
Performance

cuFFT automatically selects the Cooley-Tukey, Bluestein, or mixed-radix algorithm depending on the transform size. For power-of-two sizes, it achieves near-peak memory bandwidth. It also supports multi-GPU transforms via cufftXtMakePlanMany.

06

CUDA Math Libraries Cheat Sheet

A quick reference for the most important CUDA libraries and when to reach for each one.

Library Purpose Key Functions Typical Use Case
cuBLAS Linear algebra (BLAS 1/2/3) cublasSgemm, cublasDgemv Matrix multiply, linear layers in DNNs
cuDNN Deep neural network primitives cudnnConvolutionForward, cudnnBatchNormForward Training & inference in DL frameworks
Thrust Parallel STL algorithms thrust::sort, thrust::reduce, thrust::transform Sorting, filtering, aggregation on GPU
cuRAND Random number generation curandGenerateUniform, curandGenerateNormal Monte Carlo, initialisation, dropout
cuFFT Fast Fourier Transforms cufftExecC2C, cufftExecR2C Signal processing, spectral analysis
cuSPARSE Sparse matrix operations cusparseSpMV, cusparseSpGEMM Graph analytics, sparse neural networks
cuSOLVER Dense & sparse solvers cusolverDnSgetrf, cusolverSpScsrlsvqr LU factorisation, eigenvalue decomposition
NCCL Multi-GPU communication ncclAllReduce, ncclBroadcast Distributed training across GPUs/nodes
cuTENSOR Tensor contractions cutensorContraction, cutensorReduction Quantum chemistry, tensor networks
NPP Image/signal processing nppiResize, nppiFilter Image preprocessing pipelines
Linking

Each library requires its own link flag: -lcublas, -lcurand, -lcufft, -lcusparse, -lcusolver. Thrust and CUB are header-only and require no link flags. cuDNN is a separate download from developer.nvidia.com.

07

Integrating with Python

Most CUDA development today happens through Python. Here are the main ways to use CUDA from Python, from easiest to most control.

CuPy

  • Drop-in NumPy replacement on GPU
  • import cupy as cp
  • Supports custom CUDA kernels via RawKernel
  • Uses cuBLAS, cuFFT, cuRAND under the hood

Numba CUDA

  • Write CUDA kernels in Python syntax
  • @cuda.jit decorator
  • Access to shared memory, synchronisation
  • Good for prototyping custom kernels

PyCUDA

  • Thin Python wrapper over CUDA Driver API
  • Inline CUDA C code as strings
  • Full control, full complexity
  • Compile-at-runtime with caching

PyTorch CUDA Extensions

  • Write C++/CUDA, call from PyTorch
  • Integrates with autograd for training
  • Used by FlashAttention, xFormers, etc.
  • torch.utils.cpp_extension

PyTorch Custom CUDA Extension

The most common path for production CUDA in ML: write a kernel in CUDA C++, expose it through PyTorch's extension mechanism.

setup.py — build system for a PyTorch CUDA extension
from setuptools import setup
from torch.utils.cpp_extension import CUDAExtension, BuildExtension

setup(
    name="my_cuda_ops",
    ext_modules=[
        CUDAExtension(
            "my_cuda_ops",
            ["my_ops.cpp", "my_kernel.cu"],
        ),
    ],
    cmdclass={"build_ext": BuildExtension},
)
Usage in Python after building
import torch
import my_cuda_ops

x = torch.randn(1024, 1024, device="cuda")
result = my_cuda_ops.my_custom_op(x)  # calls your CUDA kernel
Recommendation

Start with CuPy or PyTorch for prototyping. When you identify a performance bottleneck, write a custom CUDA kernel and expose it via PyTorch's cpp_extension. This is the path FlashAttention, Triton-compiled kernels, and most production ML inference code follows.

08

NVIDIA NIM & TensorRT

Once your model is trained, CUDA libraries power the inference stack. TensorRT and NIM sit at the top of the deployment pipeline.

TensorRT — Inference Optimisation

TensorRT is NVIDIA's inference optimiser and runtime. It takes a trained model and produces an engine optimised for your specific GPU.

Trained Model
PyTorch / ONNX / TensorFlow
TensorRT Optimiser
Layer & Tensor Fusion
Precision Calibration
FP32 → FP16 / INT8
Kernel Auto-Tuning
Profile per-layer on target GPU
Optimised Engine
.engine / .plan file

What TensorRT Does Under the Hood

NVIDIA NIM — Model Deployment

NIM (NVIDIA Inference Microservices) packages TensorRT-optimised models as containerised microservices with a standard API. It bridges the gap between a CUDA-optimised engine and a production HTTP endpoint.

Custom CUDA Kernel
cuBLAS / cuDNN
TensorRT Engine
Triton Inference Server
NIM Container
When CUDA Meets Production

Understanding the full stack matters: custom CUDA kernels feed into libraries (cuBLAS, cuDNN), which feed into optimisers (TensorRT), which feed into serving infrastructure (Triton, NIM). Each layer adds performance optimisations that compound.

09

Summary & Series Conclusion

The CUDA ecosystem is vast, but the decision framework is simple: use libraries for standard operations, write custom kernels when you need fusion or unique access patterns.

Key Takeaways

The Full CUDA Landscape

Your Application
cuBLAS
cuDNN
Thrust
cuFFT
cuRAND
CUDA Runtime API
CUDA Driver & GPU Hardware

Next Tutorial

Up Next — Tutorial 10

Hardware Deep Dive — GPU microarchitecture details, Tensor Cores, memory subsystem internals, and how to map your understanding of CUDA software to real silicon.