Libraries & Ecosystem

00

Topics We'll Cover

This tutorial surveys the CUDA library ecosystem. You'll learn when to reach for a library vs write a custom kernel, and see complete working examples of the most important libraries.

When to Write Custom Kernels vs Use Libraries
cuBLAS — GPU-Accelerated Linear Algebra
cuDNN — Deep Neural Network Primitives
Thrust — STL-Like Parallel Algorithms
cuRAND & cuFFT
CUDA Math Libraries Cheat Sheet
Integrating with Python
NVIDIA NIM & TensorRT
Summary & Series Conclusion

Prerequisites

Tutorials 01–06 (GPU architecture, kernels, memory, synchronisation, streams, and performance). You should be comfortable writing and launching CUDA kernels.

01

When to Write Custom Kernels vs Use Libraries

The most important decision in CUDA development: should you write a custom kernel or use an existing library? Getting this wrong wastes weeks of engineering time.

Decision Flowchart

Is your operation a standard primitive?
(GEMM, FFT, conv, sort, reduce, RNG)

YES ↓

NO ↓

Use a library
cuBLAS, cuDNN, Thrust, cuFFT

Can you fuse multiple ops?
(reduce memory traffic)

YES ↓

NO ↓

Write a custom kernel
Fused ops, unique access patterns

Compose library calls
Chain with CUDA streams

When to Write Custom Kernels

Custom Kernel

Kernel fusion — combine multiple operations to reduce memory round-trips
Unique access patterns — irregular data structures, sparse matrices
Domain-specific logic — physics simulations, custom loss functions
Extreme tuning — squeezing last 10% of performance for your specific hardware

Use a Library

Standard BLAS ops — matrix multiply, triangular solve
FFT, convolution — well-studied algorithms with HW-specific tuning
Sort, scan, reduce — parallel primitives with tricky edge cases
DNN layers — convolution, pooling, normalisation

Rule of Thumb

If NVIDIA has a library for it, use the library first. Profile it. Only write a custom kernel if profiling reveals the library is a bottleneck for your specific workload, or if you need to fuse multiple operations together.

02

cuBLAS — GPU-Accelerated Linear Algebra

cuBLAS is the CUDA implementation of the Basic Linear Algebra Subprograms (BLAS) standard. Its crown jewel is GEMM — General Matrix Multiply — which is the single most performance-critical operation in deep learning.

Key API Patterns

Handle-based API — create a cublasHandle_t once, reuse across calls
Stream integration — bind a handle to a CUDA stream for async execution
Column-major by default — inherited from Fortran BLAS (use transpose flags for row-major data)
Tensor Cores — cuBLAS automatically uses Tensor Cores on Volta+ GPUs when using CUBLAS_COMPUTE_32F_FAST_16F

GEMM: C = αAB + βC

Matrix A
M × K

×

Matrix B
K × N

→

Matrix C
M × N

Complete cuBLAS Matrix Multiply

cublas_matmul.cu — compile: nvcc -o cublas_matmul cublas_matmul.cu -lcublas

#include <cstdio>
#include <cstdlib>
#include <cublas_v2.h>
#include <cuda_runtime.h>

// Error checking macro
#define CHECK_CUDA(call) do { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(1); \
    } \
} while(0)

#define CHECK_CUBLAS(call) do { \
    cublasStatus_t stat = call; \
    if (stat != CUBLAS_STATUS_SUCCESS) { \
        fprintf(stderr, "cuBLAS error at %s:%d: %d\n", \
                __FILE__, __LINE__, stat); \
        exit(1); \
    } \
} while(0)

int main() {
    // Matrix dimensions: C(M×N) = A(M×K) × B(K×N)
    const int M = 1024, K = 512, N = 2048;
    const float alpha = 1.0f, beta = 0.0f;

    // Host allocation
    size_t sizeA = M * K * sizeof(float);
    size_t sizeB = K * N * sizeof(float);
    size_t sizeC = M * N * sizeof(float);

    float *h_A = (float*)malloc(sizeA);
    float *h_B = (float*)malloc(sizeB);
    float *h_C = (float*)malloc(sizeC);

    // Initialise with random values
    for (int i = 0; i < M * K; i++) h_A[i] = (float)rand() / RAND_MAX;
    for (int i = 0; i < K * N; i++) h_B[i] = (float)rand() / RAND_MAX;

    // Device allocation
    float *d_A, *d_B, *d_C;
    CHECK_CUDA(cudaMalloc(&d_A, sizeA));
    CHECK_CUDA(cudaMalloc(&d_B, sizeB));
    CHECK_CUDA(cudaMalloc(&d_C, sizeC));

    // Copy to device
    CHECK_CUDA(cudaMemcpy(d_A, h_A, sizeA, cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(d_B, h_B, sizeB, cudaMemcpyHostToDevice));

    // Create cuBLAS handle
    cublasHandle_t handle;
    CHECK_CUBLAS(cublasCreate(&handle));

    // Perform GEMM: C = alpha * A * B + beta * C
    // cuBLAS uses column-major, so we compute B^T * A^T = (AB)^T
    // which gives us the row-major result we want
    CHECK_CUBLAS(cublasSgemm(handle,
        CUBLAS_OP_N, CUBLAS_OP_N,  // no transpose
        N, M, K,                    // dimensions (col-major order)
        &alpha,
        d_B, N,                     // B is N×K in col-major = K×N row-major
        d_A, K,                     // A is K×M in col-major = M×K row-major
        &beta,
        d_C, N));                   // C is N×M in col-major = M×N row-major

    // Copy result back
    CHECK_CUDA(cudaMemcpy(h_C, d_C, sizeC, cudaMemcpyDeviceToHost));

    // Verify a single element: C[0][0] = dot(A[0][:], B[:][0])
    float expected = 0.0f;
    for (int i = 0; i < K; i++)
        expected += h_A[i] * h_B[i * N];
    printf("C[0][0] = %.4f (expected %.4f)\n", h_C[0], expected);
    printf("GEMM complete: (%d×%d) × (%d×%d) = (%d×%d)\n", M, K, K, N, M, N);

    // Cleanup
    cublasDestroy(handle);
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

Column-Major Trick

cuBLAS inherits Fortran's column-major layout. To multiply row-major matrices without transposing, swap A and B and swap M and N: compute B * A in cuBLAS column-major, which yields A * B in row-major. This avoids explicit transposes.

03

cuDNN — Deep Neural Network Primitives

cuDNN provides GPU-accelerated implementations of the core operations in neural networks. It is the engine behind PyTorch, TensorFlow, and virtually every deep learning framework.

What cuDNN Provides

Core Operations

Convolution — forward, backward data, backward filter
Pooling — max, average, adaptive
Normalisation — batch norm, layer norm, group norm
Activation — ReLU, sigmoid, tanh, GELU

Advanced Features

Attention — multi-head attention (Flash Attention)
RNNs — LSTM, GRU with persistent kernels
Tensor Cores — automatic FP16/TF32 acceleration
Graph API — fuse multiple ops into a single kernel

cuDNN API Pattern

Every cuDNN operation follows the same pattern: create descriptors, find the best algorithm, allocate workspace, execute.

Create handle & descriptors

↓

Find best algorithm
cudnnGetConvolutionForwardAlgorithm_v7

↓

Query workspace size

↓

cudaMalloc workspace

↓

Execute operation

↓

Destroy descriptors & handle

Workspace Allocation Pattern

cuDNN convolution pattern (pseudocode)

// 1. Create descriptors for input, filter, output, convolution
cudnnTensorDescriptor_t      inputDesc, outputDesc;
cudnnFilterDescriptor_t      filterDesc;
cudnnConvolutionDescriptor_t convDesc;

// 2. Let cuDNN pick the fastest algorithm for this configuration
cudnnConvolutionFwdAlgoPerf_t perfResults[8];
int returnedAlgos;
cudnnGetConvolutionForwardAlgorithm_v7(handle,
    inputDesc, filterDesc, convDesc, outputDesc,
    8, &returnedAlgos, perfResults);

// 3. Allocate workspace for the chosen algorithm
size_t workspaceSize = perfResults[0].memory;
void *d_workspace;
cudaMalloc(&d_workspace, workspaceSize);

// 4. Execute the convolution
cudnnConvolutionForward(handle,
    &alpha, inputDesc, d_input,
    filterDesc, d_filter,
    convDesc, perfResults[0].algo,
    d_workspace, workspaceSize,
    &beta, outputDesc, d_output);

Why Frameworks Use cuDNN

Under the Hood

When you call torch.nn.Conv2d in PyTorch, it dispatches to cuDNN. When you run tf.nn.conv2d in TensorFlow, it also dispatches to cuDNN. Writing your own convolution that beats cuDNN is extremely difficult — NVIDIA tunes it per-architecture with assembly-level micro-optimisations and Tensor Core scheduling.

04

Thrust — STL-Like Parallel Algorithms

Thrust is CUDA's answer to the C++ Standard Template Library. It provides parallel versions of sort, reduce, transform, scan, and more — all without writing a single kernel.

Key Features

Device vectors — thrust::device_vector<T> manages GPU memory automatically
Host/device portability — swap device_vector for host_vector to run on CPU
Composable algorithms — chain transforms, reduces, and scans with iterators
Bundled with CUDA Toolkit — no separate installation needed

Common Thrust Operations

sort

reduce

transform

scan

copy_if

unique

merge

Complete Thrust Example

thrust_demo.cu — compile: nvcc -o thrust_demo thrust_demo.cu

#include <cstdio>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>

// Custom functor: square each element
struct square_op {
    __host__ __device__
    float operator()(float x) const {
        return x * x;
    }
};

int main() {
    const int N = 1000000;

    // ── 1. Create and fill a device vector ──
    thrust::device_vector<float> d_vec(N);
    thrust::sequence(d_vec.begin(), d_vec.end(), 1.0f);
    // d_vec = {1, 2, 3, ..., 1000000}

    // ── 2. Transform: square every element ──
    thrust::device_vector<float> d_squared(N);
    thrust::transform(d_vec.begin(), d_vec.end(),
                       d_squared.begin(), square_op());
    printf("squared[0] = %.0f, squared[4] = %.0f\n",
           (float)d_squared[0], (float)d_squared[4]);
    // Output: squared[0] = 1, squared[4] = 25

    // ── 3. Reduce: sum of squares ──
    float sum = thrust::reduce(d_squared.begin(), d_squared.end(),
                               0.0f, thrust::plus<float>());
    printf("Sum of squares 1..%d = %.0f\n", N, sum);

    // ── 4. Sort in descending order ──
    thrust::sort(d_vec.begin(), d_vec.end(), thrust::greater<float>());
    printf("After descending sort: first = %.0f, last = %.0f\n",
           (float)d_vec[0], (float)d_vec[N - 1]);
    // Output: first = 1000000, last = 1

    // ── 5. Copy back to host ──
    thrust::host_vector<float> h_vec = d_vec;
    printf("Copied %d elements to host. h_vec[0] = %.0f\n",
           (int)h_vec.size(), h_vec[0]);

    return 0;
}

Performance Note

Thrust's sort uses radix sort on GPUs and typically outperforms hand-written comparison sorts. For 100M elements, thrust::sort is within 5% of the theoretical peak memory bandwidth — very hard to beat with a custom kernel.

05

cuRAND & cuFFT

Two more essential CUDA libraries: random number generation and Fast Fourier Transforms.

cuRAND — GPU Random Number Generation

cuRAND generates pseudorandom and quasirandom numbers on the GPU. It supports multiple generator types (XORWOW, Philox, MRG32k3a, Sobol) and distributions (uniform, normal, log-normal, Poisson).

curand_demo.cu — compile: nvcc -o curand_demo curand_demo.cu -lcurand

#include <cstdio>
#include <curand.h>
#include <cuda_runtime.h>

int main() {
    const int N = 1000000;
    float *d_data, *h_data;

    cudaMalloc(&d_data, N * sizeof(float));
    h_data = (float*)malloc(N * sizeof(float));

    // Create generator
    curandGenerator_t gen;
    curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);
    curandSetPseudoRandomGeneratorSeed(gen, 42ULL);

    // Generate 1M uniform floats in [0, 1)
    curandGenerateUniform(gen, d_data, N);

    // Copy back and check
    cudaMemcpy(h_data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);

    float sum = 0.0f;
    for (int i = 0; i < N; i++) sum += h_data[i];
    printf("Mean of %d uniform samples: %f (expected ~0.5)\n",
           N, sum / N);

    // Generate normal distribution (mean=0, stddev=1)
    curandGenerateNormal(gen, d_data, N, 0.0f, 1.0f);
    cudaMemcpy(h_data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);

    sum = 0.0f;
    for (int i = 0; i < N; i++) sum += h_data[i];
    printf("Mean of %d normal samples: %f (expected ~0.0)\n",
           N, sum / N);

    // Cleanup
    curandDestroyGenerator(gen);
    cudaFree(d_data);
    free(h_data);
    return 0;
}

cuFFT — Fast Fourier Transforms

cuFFT implements 1D, 2D, and 3D FFTs on the GPU. It handles real-to-complex, complex-to-real, and complex-to-complex transforms with automatic plan optimisation.

cufft_demo.cu — compile: nvcc -o cufft_demo cufft_demo.cu -lcufft

#include <cstdio>
#include <cufft.h>
#include <cuda_runtime.h>
#include <cmath>

int main() {
    const int N = 1024;

    // Host: create a signal with two frequencies
    cufftComplex *h_signal = (cufftComplex*)malloc(N * sizeof(cufftComplex));
    for (int i = 0; i < N; i++) {
        float t = (float)i / N;
        h_signal[i].x = sinf(2.0f * M_PI * 50.0f * t)   // 50 Hz
                       + sinf(2.0f * M_PI * 120.0f * t);  // 120 Hz
        h_signal[i].y = 0.0f;  // imaginary part
    }

    // Device allocation
    cufftComplex *d_signal;
    cudaMalloc(&d_signal, N * sizeof(cufftComplex));
    cudaMemcpy(d_signal, h_signal, N * sizeof(cufftComplex),
               cudaMemcpyHostToDevice);

    // Create FFT plan and execute
    cufftHandle plan;
    cufftPlan1d(&plan, N, CUFFT_C2C, 1);
    cufftExecC2C(plan, d_signal, d_signal, CUFFT_FORWARD);

    // Copy back and find peak frequencies
    cudaMemcpy(h_signal, d_signal, N * sizeof(cufftComplex),
               cudaMemcpyDeviceToHost);

    printf("Peak frequencies (magnitude > N/4):\n");
    for (int i = 0; i < N / 2; i++) {
        float mag = sqrtf(h_signal[i].x * h_signal[i].x
                        + h_signal[i].y * h_signal[i].y);
        if (mag > N / 4.0f)
            printf("  Bin %d (freq ~%d Hz): magnitude = %.1f\n",
                   i, i, mag);
    }

    // Cleanup
    cufftDestroy(plan);
    cudaFree(d_signal);
    free(h_signal);
    return 0;
}

Performance

cuFFT automatically selects the Cooley-Tukey, Bluestein, or mixed-radix algorithm depending on the transform size. For power-of-two sizes, it achieves near-peak memory bandwidth. It also supports multi-GPU transforms via cufftXtMakePlanMany.

06

CUDA Math Libraries Cheat Sheet

A quick reference for the most important CUDA libraries and when to reach for each one.

Library	Purpose	Key Functions	Typical Use Case
cuBLAS	Linear algebra (BLAS 1/2/3)	`cublasSgemm`, `cublasDgemv`	Matrix multiply, linear layers in DNNs
cuDNN	Deep neural network primitives	`cudnnConvolutionForward`, `cudnnBatchNormForward`	Training & inference in DL frameworks
Thrust	Parallel STL algorithms	`thrust::sort`, `thrust::reduce`, `thrust::transform`	Sorting, filtering, aggregation on GPU
cuRAND	Random number generation	`curandGenerateUniform`, `curandGenerateNormal`	Monte Carlo, initialisation, dropout
cuFFT	Fast Fourier Transforms	`cufftExecC2C`, `cufftExecR2C`	Signal processing, spectral analysis
cuSPARSE	Sparse matrix operations	`cusparseSpMV`, `cusparseSpGEMM`	Graph analytics, sparse neural networks
cuSOLVER	Dense & sparse solvers	`cusolverDnSgetrf`, `cusolverSpScsrlsvqr`	LU factorisation, eigenvalue decomposition
NCCL	Multi-GPU communication	`ncclAllReduce`, `ncclBroadcast`	Distributed training across GPUs/nodes
cuTENSOR	Tensor contractions	`cutensorContraction`, `cutensorReduction`	Quantum chemistry, tensor networks
NPP	Image/signal processing	`nppiResize`, `nppiFilter`	Image preprocessing pipelines

Linking

Each library requires its own link flag: -lcublas, -lcurand, -lcufft, -lcusparse, -lcusolver. Thrust and CUB are header-only and require no link flags. cuDNN is a separate download from developer.nvidia.com.

07

Integrating with Python

Most CUDA development today happens through Python. Here are the main ways to use CUDA from Python, from easiest to most control.

CuPy

Drop-in NumPy replacement on GPU
import cupy as cp
Supports custom CUDA kernels via RawKernel
Uses cuBLAS, cuFFT, cuRAND under the hood

Numba CUDA

Write CUDA kernels in Python syntax
@cuda.jit decorator
Access to shared memory, synchronisation
Good for prototyping custom kernels

PyCUDA

Thin Python wrapper over CUDA Driver API
Inline CUDA C code as strings
Full control, full complexity
Compile-at-runtime with caching

PyTorch CUDA Extensions

Write C++/CUDA, call from PyTorch
Integrates with autograd for training
Used by FlashAttention, xFormers, etc.
torch.utils.cpp_extension

PyTorch Custom CUDA Extension

The most common path for production CUDA in ML: write a kernel in CUDA C++, expose it through PyTorch's extension mechanism.

setup.py — build system for a PyTorch CUDA extension

from setuptools import setup
from torch.utils.cpp_extension import CUDAExtension, BuildExtension

setup(
    name="my_cuda_ops",
    ext_modules=[
        CUDAExtension(
            "my_cuda_ops",
            ["my_ops.cpp", "my_kernel.cu"],
        ),
    ],
    cmdclass={"build_ext": BuildExtension},
)

Usage in Python after building

import torch
import my_cuda_ops

x = torch.randn(1024, 1024, device="cuda")
result = my_cuda_ops.my_custom_op(x)  # calls your CUDA kernel

Recommendation

Start with CuPy or PyTorch for prototyping. When you identify a performance bottleneck, write a custom CUDA kernel and expose it via PyTorch's cpp_extension. This is the path FlashAttention, Triton-compiled kernels, and most production ML inference code follows.

08

NVIDIA NIM & TensorRT

Once your model is trained, CUDA libraries power the inference stack. TensorRT and NIM sit at the top of the deployment pipeline.

TensorRT — Inference Optimisation

TensorRT is NVIDIA's inference optimiser and runtime. It takes a trained model and produces an engine optimised for your specific GPU.

Trained Model
PyTorch / ONNX / TensorFlow

↓

TensorRT Optimiser

↓

Layer & Tensor Fusion

↓

Precision Calibration
FP32 → FP16 / INT8

↓

Kernel Auto-Tuning
Profile per-layer on target GPU

↓

Optimised Engine
.engine / .plan file

What TensorRT Does Under the Hood

Layer fusion — merges Conv + BatchNorm + ReLU into a single kernel
Precision reduction — FP16 and INT8 quantisation with calibration
Kernel auto-tuning — profiles multiple kernel implementations per layer, picks the fastest for your GPU
Memory optimisation — reuses buffers, minimises allocations
Dynamic shapes — optimisation profiles for variable batch sizes

NVIDIA NIM — Model Deployment

NIM (NVIDIA Inference Microservices) packages TensorRT-optimised models as containerised microservices with a standard API. It bridges the gap between a CUDA-optimised engine and a production HTTP endpoint.

Custom CUDA Kernel

→

cuBLAS / cuDNN

→

TensorRT Engine

→

Triton Inference Server

→

NIM Container

When CUDA Meets Production

Understanding the full stack matters: custom CUDA kernels feed into libraries (cuBLAS, cuDNN), which feed into optimisers (TensorRT), which feed into serving infrastructure (Triton, NIM). Each layer adds performance optimisations that compound.

09

Summary & Series Conclusion

The CUDA ecosystem is vast, but the decision framework is simple: use libraries for standard operations, write custom kernels when you need fusion or unique access patterns.

Key Takeaways

cuBLAS — use for all matrix math. Its GEMM is the gold standard, tuned per-GPU architecture with Tensor Core support.
cuDNN — use for neural network layers. Every major DL framework calls cuDNN under the hood.
Thrust — use for parallel primitives (sort, reduce, scan). Header-only, zero kernel code needed.
cuRAND — use for GPU-side random numbers. Supports uniform, normal, Poisson, and more.
cuFFT — use for Fourier transforms. Automatic algorithm selection for optimal performance.
TensorRT — use for inference deployment. Layer fusion + precision reduction + auto-tuning.

The Full CUDA Landscape

Your Application

↓

cuBLAS

cuDNN

Thrust

cuFFT

cuRAND

↓

CUDA Runtime API

↓

CUDA Driver & GPU Hardware

Next Tutorial

Up Next — Tutorial 10

Hardware Deep Dive — GPU microarchitecture details, Tensor Cores, memory subsystem internals, and how to map your understanding of CUDA software to real silicon.