cuBLAS, cuDNN, Thrust, cuRAND, cuFFT — when to write custom kernels vs use optimised libraries, with complete compilable examples.
This tutorial surveys the CUDA library ecosystem. You'll learn when to reach for a library vs write a custom kernel, and see complete working examples of the most important libraries.
Tutorials 01–06 (GPU architecture, kernels, memory, synchronisation, streams, and performance). You should be comfortable writing and launching CUDA kernels.
The most important decision in CUDA development: should you write a custom kernel or use an existing library? Getting this wrong wastes weeks of engineering time.
If NVIDIA has a library for it, use the library first. Profile it. Only write a custom kernel if profiling reveals the library is a bottleneck for your specific workload, or if you need to fuse multiple operations together.
cuBLAS is the CUDA implementation of the Basic Linear Algebra Subprograms (BLAS) standard. Its crown jewel is GEMM — General Matrix Multiply — which is the single most performance-critical operation in deep learning.
cublasHandle_t once, reuse across callsCUBLAS_COMPUTE_32F_FAST_16F#include <cstdio>
#include <cstdlib>
#include <cublas_v2.h>
#include <cuda_runtime.h>
// Error checking macro
#define CHECK_CUDA(call) do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(1); \
} \
} while(0)
#define CHECK_CUBLAS(call) do { \
cublasStatus_t stat = call; \
if (stat != CUBLAS_STATUS_SUCCESS) { \
fprintf(stderr, "cuBLAS error at %s:%d: %d\n", \
__FILE__, __LINE__, stat); \
exit(1); \
} \
} while(0)
int main() {
// Matrix dimensions: C(M×N) = A(M×K) × B(K×N)
const int M = 1024, K = 512, N = 2048;
const float alpha = 1.0f, beta = 0.0f;
// Host allocation
size_t sizeA = M * K * sizeof(float);
size_t sizeB = K * N * sizeof(float);
size_t sizeC = M * N * sizeof(float);
float *h_A = (float*)malloc(sizeA);
float *h_B = (float*)malloc(sizeB);
float *h_C = (float*)malloc(sizeC);
// Initialise with random values
for (int i = 0; i < M * K; i++) h_A[i] = (float)rand() / RAND_MAX;
for (int i = 0; i < K * N; i++) h_B[i] = (float)rand() / RAND_MAX;
// Device allocation
float *d_A, *d_B, *d_C;
CHECK_CUDA(cudaMalloc(&d_A, sizeA));
CHECK_CUDA(cudaMalloc(&d_B, sizeB));
CHECK_CUDA(cudaMalloc(&d_C, sizeC));
// Copy to device
CHECK_CUDA(cudaMemcpy(d_A, h_A, sizeA, cudaMemcpyHostToDevice));
CHECK_CUDA(cudaMemcpy(d_B, h_B, sizeB, cudaMemcpyHostToDevice));
// Create cuBLAS handle
cublasHandle_t handle;
CHECK_CUBLAS(cublasCreate(&handle));
// Perform GEMM: C = alpha * A * B + beta * C
// cuBLAS uses column-major, so we compute B^T * A^T = (AB)^T
// which gives us the row-major result we want
CHECK_CUBLAS(cublasSgemm(handle,
CUBLAS_OP_N, CUBLAS_OP_N, // no transpose
N, M, K, // dimensions (col-major order)
&alpha,
d_B, N, // B is N×K in col-major = K×N row-major
d_A, K, // A is K×M in col-major = M×K row-major
&beta,
d_C, N)); // C is N×M in col-major = M×N row-major
// Copy result back
CHECK_CUDA(cudaMemcpy(h_C, d_C, sizeC, cudaMemcpyDeviceToHost));
// Verify a single element: C[0][0] = dot(A[0][:], B[:][0])
float expected = 0.0f;
for (int i = 0; i < K; i++)
expected += h_A[i] * h_B[i * N];
printf("C[0][0] = %.4f (expected %.4f)\n", h_C[0], expected);
printf("GEMM complete: (%d×%d) × (%d×%d) = (%d×%d)\n", M, K, K, N, M, N);
// Cleanup
cublasDestroy(handle);
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
free(h_A); free(h_B); free(h_C);
return 0;
}
cuBLAS inherits Fortran's column-major layout. To multiply row-major matrices without transposing, swap A and B and swap M and N: compute B * A in cuBLAS column-major, which yields A * B in row-major. This avoids explicit transposes.
cuDNN provides GPU-accelerated implementations of the core operations in neural networks. It is the engine behind PyTorch, TensorFlow, and virtually every deep learning framework.
Every cuDNN operation follows the same pattern: create descriptors, find the best algorithm, allocate workspace, execute.
// 1. Create descriptors for input, filter, output, convolution
cudnnTensorDescriptor_t inputDesc, outputDesc;
cudnnFilterDescriptor_t filterDesc;
cudnnConvolutionDescriptor_t convDesc;
// 2. Let cuDNN pick the fastest algorithm for this configuration
cudnnConvolutionFwdAlgoPerf_t perfResults[8];
int returnedAlgos;
cudnnGetConvolutionForwardAlgorithm_v7(handle,
inputDesc, filterDesc, convDesc, outputDesc,
8, &returnedAlgos, perfResults);
// 3. Allocate workspace for the chosen algorithm
size_t workspaceSize = perfResults[0].memory;
void *d_workspace;
cudaMalloc(&d_workspace, workspaceSize);
// 4. Execute the convolution
cudnnConvolutionForward(handle,
&alpha, inputDesc, d_input,
filterDesc, d_filter,
convDesc, perfResults[0].algo,
d_workspace, workspaceSize,
&beta, outputDesc, d_output);
When you call torch.nn.Conv2d in PyTorch, it dispatches to cuDNN. When you run tf.nn.conv2d in TensorFlow, it also dispatches to cuDNN. Writing your own convolution that beats cuDNN is extremely difficult — NVIDIA tunes it per-architecture with assembly-level micro-optimisations and Tensor Core scheduling.
Thrust is CUDA's answer to the C++ Standard Template Library. It provides parallel versions of sort, reduce, transform, scan, and more — all without writing a single kernel.
thrust::device_vector<T> manages GPU memory automaticallydevice_vector for host_vector to run on CPU#include <cstdio>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
// Custom functor: square each element
struct square_op {
__host__ __device__
float operator()(float x) const {
return x * x;
}
};
int main() {
const int N = 1000000;
// ── 1. Create and fill a device vector ──
thrust::device_vector<float> d_vec(N);
thrust::sequence(d_vec.begin(), d_vec.end(), 1.0f);
// d_vec = {1, 2, 3, ..., 1000000}
// ── 2. Transform: square every element ──
thrust::device_vector<float> d_squared(N);
thrust::transform(d_vec.begin(), d_vec.end(),
d_squared.begin(), square_op());
printf("squared[0] = %.0f, squared[4] = %.0f\n",
(float)d_squared[0], (float)d_squared[4]);
// Output: squared[0] = 1, squared[4] = 25
// ── 3. Reduce: sum of squares ──
float sum = thrust::reduce(d_squared.begin(), d_squared.end(),
0.0f, thrust::plus<float>());
printf("Sum of squares 1..%d = %.0f\n", N, sum);
// ── 4. Sort in descending order ──
thrust::sort(d_vec.begin(), d_vec.end(), thrust::greater<float>());
printf("After descending sort: first = %.0f, last = %.0f\n",
(float)d_vec[0], (float)d_vec[N - 1]);
// Output: first = 1000000, last = 1
// ── 5. Copy back to host ──
thrust::host_vector<float> h_vec = d_vec;
printf("Copied %d elements to host. h_vec[0] = %.0f\n",
(int)h_vec.size(), h_vec[0]);
return 0;
}
Thrust's sort uses radix sort on GPUs and typically outperforms hand-written comparison sorts. For 100M elements, thrust::sort is within 5% of the theoretical peak memory bandwidth — very hard to beat with a custom kernel.
Two more essential CUDA libraries: random number generation and Fast Fourier Transforms.
cuRAND generates pseudorandom and quasirandom numbers on the GPU. It supports multiple generator types (XORWOW, Philox, MRG32k3a, Sobol) and distributions (uniform, normal, log-normal, Poisson).
#include <cstdio>
#include <curand.h>
#include <cuda_runtime.h>
int main() {
const int N = 1000000;
float *d_data, *h_data;
cudaMalloc(&d_data, N * sizeof(float));
h_data = (float*)malloc(N * sizeof(float));
// Create generator
curandGenerator_t gen;
curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(gen, 42ULL);
// Generate 1M uniform floats in [0, 1)
curandGenerateUniform(gen, d_data, N);
// Copy back and check
cudaMemcpy(h_data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
float sum = 0.0f;
for (int i = 0; i < N; i++) sum += h_data[i];
printf("Mean of %d uniform samples: %f (expected ~0.5)\n",
N, sum / N);
// Generate normal distribution (mean=0, stddev=1)
curandGenerateNormal(gen, d_data, N, 0.0f, 1.0f);
cudaMemcpy(h_data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
sum = 0.0f;
for (int i = 0; i < N; i++) sum += h_data[i];
printf("Mean of %d normal samples: %f (expected ~0.0)\n",
N, sum / N);
// Cleanup
curandDestroyGenerator(gen);
cudaFree(d_data);
free(h_data);
return 0;
}
cuFFT implements 1D, 2D, and 3D FFTs on the GPU. It handles real-to-complex, complex-to-real, and complex-to-complex transforms with automatic plan optimisation.
#include <cstdio>
#include <cufft.h>
#include <cuda_runtime.h>
#include <cmath>
int main() {
const int N = 1024;
// Host: create a signal with two frequencies
cufftComplex *h_signal = (cufftComplex*)malloc(N * sizeof(cufftComplex));
for (int i = 0; i < N; i++) {
float t = (float)i / N;
h_signal[i].x = sinf(2.0f * M_PI * 50.0f * t) // 50 Hz
+ sinf(2.0f * M_PI * 120.0f * t); // 120 Hz
h_signal[i].y = 0.0f; // imaginary part
}
// Device allocation
cufftComplex *d_signal;
cudaMalloc(&d_signal, N * sizeof(cufftComplex));
cudaMemcpy(d_signal, h_signal, N * sizeof(cufftComplex),
cudaMemcpyHostToDevice);
// Create FFT plan and execute
cufftHandle plan;
cufftPlan1d(&plan, N, CUFFT_C2C, 1);
cufftExecC2C(plan, d_signal, d_signal, CUFFT_FORWARD);
// Copy back and find peak frequencies
cudaMemcpy(h_signal, d_signal, N * sizeof(cufftComplex),
cudaMemcpyDeviceToHost);
printf("Peak frequencies (magnitude > N/4):\n");
for (int i = 0; i < N / 2; i++) {
float mag = sqrtf(h_signal[i].x * h_signal[i].x
+ h_signal[i].y * h_signal[i].y);
if (mag > N / 4.0f)
printf(" Bin %d (freq ~%d Hz): magnitude = %.1f\n",
i, i, mag);
}
// Cleanup
cufftDestroy(plan);
cudaFree(d_signal);
free(h_signal);
return 0;
}
cuFFT automatically selects the Cooley-Tukey, Bluestein, or mixed-radix algorithm depending on the transform size. For power-of-two sizes, it achieves near-peak memory bandwidth. It also supports multi-GPU transforms via cufftXtMakePlanMany.
A quick reference for the most important CUDA libraries and when to reach for each one.
| Library | Purpose | Key Functions | Typical Use Case |
|---|---|---|---|
| cuBLAS | Linear algebra (BLAS 1/2/3) | cublasSgemm, cublasDgemv |
Matrix multiply, linear layers in DNNs |
| cuDNN | Deep neural network primitives | cudnnConvolutionForward, cudnnBatchNormForward |
Training & inference in DL frameworks |
| Thrust | Parallel STL algorithms | thrust::sort, thrust::reduce, thrust::transform |
Sorting, filtering, aggregation on GPU |
| cuRAND | Random number generation | curandGenerateUniform, curandGenerateNormal |
Monte Carlo, initialisation, dropout |
| cuFFT | Fast Fourier Transforms | cufftExecC2C, cufftExecR2C |
Signal processing, spectral analysis |
| cuSPARSE | Sparse matrix operations | cusparseSpMV, cusparseSpGEMM |
Graph analytics, sparse neural networks |
| cuSOLVER | Dense & sparse solvers | cusolverDnSgetrf, cusolverSpScsrlsvqr |
LU factorisation, eigenvalue decomposition |
| NCCL | Multi-GPU communication | ncclAllReduce, ncclBroadcast |
Distributed training across GPUs/nodes |
| cuTENSOR | Tensor contractions | cutensorContraction, cutensorReduction |
Quantum chemistry, tensor networks |
| NPP | Image/signal processing | nppiResize, nppiFilter |
Image preprocessing pipelines |
Each library requires its own link flag: -lcublas, -lcurand, -lcufft, -lcusparse, -lcusolver. Thrust and CUB are header-only and require no link flags. cuDNN is a separate download from developer.nvidia.com.
Most CUDA development today happens through Python. Here are the main ways to use CUDA from Python, from easiest to most control.
import cupy as cpRawKernel@cuda.jit decoratortorch.utils.cpp_extensionThe most common path for production CUDA in ML: write a kernel in CUDA C++, expose it through PyTorch's extension mechanism.
from setuptools import setup
from torch.utils.cpp_extension import CUDAExtension, BuildExtension
setup(
name="my_cuda_ops",
ext_modules=[
CUDAExtension(
"my_cuda_ops",
["my_ops.cpp", "my_kernel.cu"],
),
],
cmdclass={"build_ext": BuildExtension},
)
import torch
import my_cuda_ops
x = torch.randn(1024, 1024, device="cuda")
result = my_cuda_ops.my_custom_op(x) # calls your CUDA kernel
Start with CuPy or PyTorch for prototyping. When you identify a performance bottleneck, write a custom CUDA kernel and expose it via PyTorch's cpp_extension. This is the path FlashAttention, Triton-compiled kernels, and most production ML inference code follows.
Once your model is trained, CUDA libraries power the inference stack. TensorRT and NIM sit at the top of the deployment pipeline.
TensorRT is NVIDIA's inference optimiser and runtime. It takes a trained model and produces an engine optimised for your specific GPU.
NIM (NVIDIA Inference Microservices) packages TensorRT-optimised models as containerised microservices with a standard API. It bridges the gap between a CUDA-optimised engine and a production HTTP endpoint.
Understanding the full stack matters: custom CUDA kernels feed into libraries (cuBLAS, cuDNN), which feed into optimisers (TensorRT), which feed into serving infrastructure (Triton, NIM). Each layer adds performance optimisations that compound.
The CUDA ecosystem is vast, but the decision framework is simple: use libraries for standard operations, write custom kernels when you need fusion or unique access patterns.
Hardware Deep Dive — GPU microarchitecture details, Tensor Cores, memory subsystem internals, and how to map your understanding of CUDA software to real silicon.