Practical buying guide — DGX Spark, consumer GPUs, cloud instances, Google Colab — cost, capability, and honest recommendations.
This is a hardware buying guide, not a coding tutorial. We'll walk through every tier of CUDA-capable hardware — from free cloud notebooks to workstation-class machines — with honest assessments of cost, capability, and who each option is actually for.
Prices are approximate as of early 2026 and will vary by region and market conditions. GPU prices on the used market fluctuate significantly.
The barrier to entry for CUDA programming is lower than most people think. You need exactly three things:
Any NVIDIA GPU with CUDA support — compute capability 5.0 or higher. This includes every GeForce card from the GTX 900 series onward.
Free download from NVIDIA. Includes nvcc compiler, runtime libraries, and profiling tools like Nsight.
A text editor and a terminal. VS Code with the CUDA extension works well. No IDE required.
CUDA compute capability determines which features your GPU supports. For this tutorial series, 5.0+ covers everything through Tutorial 09. Unified memory (cudaMallocManaged) works best on 6.0+.
You do not need an expensive GPU to learn CUDA. A five-year-old GTX 1650 can run every kernel in Tutorials 01–06. The hardware only starts to matter when you're doing real workloads — large matrix multiplications, ML training, or multi-GPU programming.
The fastest way to start writing CUDA code without installing anything or spending any money.
%%cuda magic cell (with nvcc4jupyter)!nvcc -o out kernel.cu%%writefile# Install the CUDA plugin for Jupyter
!pip install nvcc4jupyter
%load_ext nvcc4jupyter
%%cuda
#include <cstdio>
__global__ void hello() {
printf("Hello from thread %d\n", threadIdx.x);
}
int main() {
hello<<<1, 32>>>();
cudaDeviceSynchronize();
}
nsys and ncu are not available; limited to basic timing with CUDA eventsBest for: Getting started immediately, working through Tutorials 01–08, learning from any machine with a browser. Not ideal for: Profiling-heavy tutorials, persistent projects, or anything requiring long uninterrupted sessions.
A local GPU is the most practical option for serious CUDA learning. Full control over your environment, real profiling tools, and no session limits.
The RTX 3060 12 GB is the best bang-for-buck CUDA learning card. It has more VRAM than the RTX 3070 (which only has 8 GB), enough CUDA cores for meaningful parallel workloads, Ampere architecture with compute capability 8.6, and can be found used for around £250.
| GPU | VRAM | CUDA Cores | CC | Approx Price | Suitability |
|---|---|---|---|---|---|
| GTX 1650 | 4 GB | 896 | 7.5 | ~£80 | Tutorials 01–06 only |
| GTX 1080 Ti | 11 GB | 3,584 | 6.1 | ~£150 | Full series, no Tensor Cores |
| RTX 3050 | 8 GB | 2,560 | 8.6 | ~£150 | Most tutorials, VRAM limited |
| RTX 3060 12 GB | 12 GB | 3,584 | 8.6 | ~£250 | Best value — full series |
| RTX 4060 | 8 GB | 3,072 | 8.9 | ~£280 | Newer arch, less VRAM |
If you're buying one GPU specifically for CUDA learning, the used RTX 3060 12 GB at ~£250 is the clear winner. It covers every tutorial in this series with room to spare, and gives you access to full profiling with Nsight Systems and Nsight Compute.
If you want to go beyond tutorials and do real ML training, inference, or large-scale CUDA projects, the Ada Lovelace generation offers a significant step up.
For ML workloads, VRAM matters more than raw CUDA core count. A model that doesn't fit in VRAM won't run — no matter how many cores you have.
The RTX 4090 (24 GB) is the most powerful consumer GPU available. If you're combining CUDA learning with ML development, it's the sweet spot. But for pure CUDA learning, it's more than you need — the RTX 3060 covers the tutorials just fine.
NVIDIA's compact AI workstation, built on the GB10 Grace Blackwell Superchip. It's a desktop-sized machine that combines an ARM64 CPU and a Blackwell GPU in a unified memory architecture.
| Spec | DGX Spark | Comparison |
|---|---|---|
| Chip | GB10 Grace Blackwell Superchip | Custom NVIDIA SoC |
| CUDA Cores | 6,144 | Similar to RTX 5070 |
| Memory | 128 GB unified (LPDDR5x) | 5× the RTX 4090's 24 GB |
| AI Performance | 1 PFLOP FP4 sparse | Blackwell Tensor Cores |
| CPU | 20-core ARM64 (Grace) | Not x86 — ARM architecture |
| OS | Ubuntu Linux (pre-installed) | Full CUDA toolkit included |
| Networking | 200GbE QSFP56 (ConnectX-8) | Can link two units |
| Price | $4,699 / ~£3,800 | 15× a used RTX 3060 |
Unlike discrete GPUs where CPU and GPU have separate memory pools, the DGX Spark shares 128 GB between CPU and GPU. This makes cudaMallocManaged genuinely efficient — no PCIe transfers, just coherent access to the same physical memory.
The DGX Spark is overkill for pure CUDA learning. A £250 used RTX 3060 covers every tutorial in this series. Where the Spark shines is as a combined CUDA learning machine and AI development workstation — if you also want to run and fine-tune large language models locally, prototype edge AI applications, or learn multi-GPU programming without renting cloud time. It's an excellent machine for the right person: someone who wants a single device that handles both CUDA education and serious AI development. It's not the right purchase for someone who only wants to learn CUDA basics on a budget.
Two DGX Spark units can be linked via their built-in 200GbE QSFP56 ports (ConnectX-8 SuperNIC) to create a compact multi-GPU development cluster.
cudaMemcpyPeer, peer-to-peer access, and GPU-to-GPU data transfersA two-Spark cluster costs ~£7,600. For CUDA learning, this is firmly in "if money is no object" territory. Multi-GPU programming can also be learned on two consumer GPUs in a single desktop (e.g. two RTX 3060s for ~£500), or via cloud instances. The Spark cluster is more relevant for teams doing serious AI development who want a compact, always-on lab environment.
Cloud GPUs make sense for occasional heavy workloads — large-scale training runs, experimenting with A100/H100-class hardware, or when you don't want to invest in local hardware.
| Provider | GPU Options | VRAM | Approx Cost/hr | Notes |
|---|---|---|---|---|
| Google Colab Pro | T4, A100 | 16–40 GB | £9/month flat | Easiest; not guaranteed A100 |
| AWS (p3/p4/p5) | V100, A100, H100 | 16–80 GB | £2–£30/hr | Enterprise-grade; complex setup |
| GCP | T4, L4, A100, H100 | 16–80 GB | £1–£25/hr | Good Jupyter integration |
| Lambda Labs | A100, H100 | 40–80 GB | £1–£3/hr | Best value for A100/H100 |
| Vast.ai | Various (marketplace) | 8–80 GB | £0.20–£3/hr | Cheapest; variable reliability |
| RunPod | A100, H100, RTX 4090 | 24–80 GB | £0.40–£4/hr | Good developer experience |
A single A100 instance at £2.50/hr running 8 hours/day for a month costs ~£600 — enough to buy a used RTX 4070 Ti outright. Cloud is economical for occasional use, not daily development.
Matching your needs and budget to the right hardware.
| Tier | Hardware | Cost | Covers |
|---|---|---|---|
| Free | Google Colab (T4) | £0 | Tutorials 01–08, with session limits |
| Budget | Used RTX 3060 12 GB | ~£250 | All tutorials, full profiling |
| Enthusiast | RTX 4070 Ti / 4090 | £600–£1,600 | All tutorials + real ML training |
| Workstation | DGX Spark | ~£3,800 | All tutorials + LLM dev + edge AI |
| Cloud | Lambda Labs / Vast.ai (A100) | £1–3/hr | Heavy training, pay-as-you-go |
→ Google Colab (free) or a used RTX 3060 (~£250). Either will cover the full tutorial series. The RTX 3060 adds full profiling and no session limits.
→ RTX 4070 Ti (16 GB, ~£700) or RTX 4090 (24 GB, ~£1,500). Enough VRAM for real model training alongside CUDA learning.
→ DGX Spark (~£3,800). The 128 GB unified memory lets you run and fine-tune large models locally without cloud dependency. Also excellent for exploring unified memory programming.
→ Cloud A100/H100 instances. On-demand access to top-tier hardware without capital expenditure. Use Lambda Labs or RunPod for best developer experience.
Begin with Google Colab to see if CUDA programming clicks for you. Once you're committed, a local GPU gives you the full development experience — profiling, persistent files, no session limits.
A £250 card you use every day beats a £4,000 workstation gathering dust. Match the hardware to your actual workflow, not your aspirations.
You've reached the end of the CUDA Programming Series. Over 10 tutorials, we've covered GPU architecture, kernel programming, memory hierarchies, synchronisation, streams, profiling, and now the hardware to run it all on. The next step is yours — pick a project, pick a GPU, and start building.