Parameter-efficient fine-tuning, the post-training stack, and the VRAM trade-offs that decide which method fits which GPU. Cert-focused synthesis of LoRA, QLoRA, DoRA, RLHF, DPO, and Constitutional AI.
A cert-focused tour of the post-training stack — supervised fine-tuning, parameter-efficient adapters, preference optimisation, and how each method trades VRAM for capability.
Experimentation
22%
Method-selection scenarios — given a problem (small labelled dataset, limited compute, latency targets), pick fine-tune vs RAG vs prompt and justify. PEFT covered conceptually.
Fine-Tuning
13%
Numerical depth — LoRA parameter count from rank/target_modules, QLoRA quantisation impact, RLHF three-model VRAM, DPO derivation, multi-LoRA serving.
Fine-tuning decisions interact with Data Preparation (9%) on dataset construction, with Safety (5%) on refusal training, and with Model Optimisation (17%) on adapter merging vs multi-LoRA serving choices.
A pretrained base model predicts next tokens; it does not produce helpful or safe outputs. Post-training shapes the model. Modern stacks compose four stages in sequence:
Not every model goes through all four. A domain-specialist may do SFT only. A safety-critical assistant adds RLHF then CAI. Each stage is independently optional but the order matters: preference tuning needs an SFT-style policy to start from.
| Method | VRAM (rough) | When to choose |
|---|---|---|
| Full SFT (BF16) | ~16 bytes/param | Large dataset, <1B model on consumer GPU, or you have multi-GPU and need maximum capability shift |
| LoRA (BF16 base) | ~3 bytes/param + adapters | Default for most tasks. Adapter switching, multi-tenant serving, predictable VRAM. |
| QLoRA (NF4 base) | ~0.6 bytes/param + adapters | Largest model that fits. RTX 4000 Ada (20 GB) hosts 13B comfortably, 30B with activations checkpointing. |
| DoRA | Like LoRA + small overhead | When LoRA underperforms and you want closer to full-FT quality without the memory. |
| IA³ | Tiny (3 vectors per layer) | Strong inductive bias, very small adapter, less expressive than LoRA on hard tasks. |
Default to LoRA. Fall back to QLoRA when memory binds. Reach for DoRA when LoRA accuracy disappoints. Reach for full FT when capability shift is large and you have the compute. Reach for IA³ when you must minimise adapter size for serving.
Supervised fine-tuning trains the base model on instruction/response pairs with a chat template applied. The loss is computed on the response tokens only — loss masking on the prompt side avoids learning to predict the user.
Axolotl, TRL (Hugging Face), Unsloth (memory-optimised), torchtune, NeMo Framework. All implement the same fundamentals; differ on ergonomics, speed, and which methods they expose first.
Hu et al. (2021): instead of updating the full weight matrix W, freeze it and learn a low-rank update. The forward pass becomes h = W₀x + BAx, where W₀ is the frozen pretrained weight, A is d×r and B is r×d, with r << d.
For d=4096 and r=8: full update has 16.8M params, LoRA update has 65k — a 256× reduction. The base weights are untouched, so multiple LoRA adapters can be swapped on the same base at inference time.
Initialisation: A is Gaussian, B is zero. At training start, BAx = 0, so the model starts identical to the base. Training shifts the residual.
| Hyperparameter | Typical value | What it controls |
|---|---|---|
r (rank) | 4–64 | Capacity of the adapter. Higher r = more expressive, more VRAM. r=8 is a strong default. |
alpha | 16–32 | Scaling factor. Effective contribution = (alpha/r) · BAx. Convention: alpha = 2r. |
target_modules | q_proj, v_proj (typical) | Which weight matrices get adapters. Q+V is the LoRA-paper recommendation; adding K, O, FFN raises capacity and cost. |
dropout | 0.05–0.1 | Regularisation on the adapter activation. Higher for small datasets. |
bias | none / lora_only / all | Whether to also train the bias terms; usually 'none'. |
Setting alpha=r looks neutral but actually halves the contribution because alpha/r=1 instead of the more typical 2. Worth checking when porting configs across frameworks.
For full coverage of variants and the math, see FT_02_LoRA_and_PEFT_Variants.
Dettmers et al. (2023). Push the base weights to 4-bit while keeping the adapter in BF16. Three pieces:
QLoRA fine-tuned 65B Llama on a single 48 GB GPU. The same approach scales down: 7B fits in 8 GB, 13B in 16 GB, 70B in 48-80 GB.
Forward pass: dequantise NF4 weights to BF16 on the fly per layer, then standard LoRA forward. Backward pass updates only the adapter; the NF4 weights are never updated.
Liu et al. (2024). Decomposes the weight update into magnitude and direction. LoRA learns the direction; a separate scalar per output channel learns the magnitude. Empirically closes much of the gap to full fine-tuning at LoRA-like cost. Selected ICML 2024 Oral.
Liu et al. (2022). Three small learned vectors per layer: scale K, scale V, scale FFN intermediate. ~10× fewer parameters than LoRA at similar performance on a range of tasks, but less expressive on harder problems.
Tasks where LoRA underperforms relative to full FT and you have the parameter budget for a small magnitude vector per channel.
Adapter size is the binding constraint — serving thousands of customer adapters, or where storage/transfer of adapters dominates cost.
Reinforcement Learning from Human Feedback. Three models live concurrently in memory: the policy (the SFT model being optimised), a frozen reference (initial SFT model), and the reward model.
Without it, the policy gradient pushes the model towards extreme outputs that maximise reward but no longer resemble fluent text. With KL = β · KL(π‒policy ‖ π‒ref), the policy stays close to the reference distribution. Tuning β is the central RLHF hyperparameter.
Three model copies plus PPO actor-critic state: RLHF is memory-heavy. A 7B model at full precision needs ~80 GB; QLoRA RLHF (e.g. via TRL's PPOTrainer) brings it into single-GPU range.
Rafailov et al. (2023). Direct Preference Optimisation collapses the RLHF pipeline into a single supervised loss on preference pairs — no separate reward model, no PPO, no KL penalty as a separate term.
DPO derives an analytical solution: the optimal policy under a Bradley-Terry reward model has a closed-form relationship between the policy log-probabilities and an implicit reward. The DPO loss directly optimises this without learning the reward function explicitly.
| Method | What it adds over DPO |
|---|---|
| IPO | Fixes preference saturation: when one option is overwhelmingly preferred, DPO over-optimises; IPO regularises. |
| KTO | Kahneman-Tversky loss. Works on unpaired data — just labels of "good" or "bad", no pairwise comparison needed. |
| ORPO | Reference-free. No KL term, no separate ref model copy. SFT and preference combined. |
| GRPO | Group-relative policy optimisation. Used in DeepSeek-R1 reasoning training. Compares groups of samples rather than pairs. |
For depth see FT_04_DPO_and_Cousins.
Bai et al. (2022). Anthropic's response to the human-labelling bottleneck. Replace human preference labels with AI-generated labels following a written constitution.
The constitution is the lever: a list of principles ("be helpful, be harmless, be honest, refuse to assist with X categories of request"). Editing the constitution updates the entire pipeline without re-collecting human labels.
Generalises CAI: any AI labeller, not necessarily one following a constitution. Often a stronger external model (e.g. Claude or GPT-4 labelling for a smaller open model). Cuts labelling cost ~10× vs human at comparable quality on many tasks.
Per parameter, BF16 SFT with AdamW: 2 (weights) + 4 (gradient FP32) + 8 (optimiser state, two FP32 momenta) = 14 bytes, plus activations.
For a 7B model: 7B × 14 = 98 GB just for parameters/grads/optimiser. Activations on top. Single GPU: not feasible.
Base model frozen in BF16 (2 bytes/param) + adapter in BF16 with optimiser state. For r=8, target Q+V on a 7B (32 layers, d=4096):
# Per-layer Q+V LoRA params:
# 2 matrices (Q, V) x 2 sub-matrices (A, B) x d * r
params_per_layer = 2 * 2 * 4096 * 8 # = 131072
total_layers = 32
adapter_params = 32 * 131072 # = ~4.2M
Adapter VRAM is dwarfed by the base. Total: ~14 GB for 7B LoRA in BF16 — fits on RTX 4000 Ada (20 GB) with room for activations.
NF4 base = 0.5 bytes/param + adapter as above. For 7B: ~5 GB for the model. Even on RTX 3080 (10 GB) feasible with gradient checkpointing.
One base model, many adapters served concurrently. The serving question: do you merge each adapter into the base before serving, or keep them separate?
Run peft.merge_and_unload → one specialised model. Same inference speed as the base. One copy of weights per adapter — cost scales with N.
Adapter applied at runtime per request. One copy of base weights, many small adapters. Frameworks: S-LoRA, Punica. Heterogeneous batching: different requests use different adapters in the same batch.
S-LoRA (Sheng et al. 2023) demonstrates serving thousands of LoRA adapters on a single GPU with negligible overhead vs the base. Key trick: a custom CUDA kernel (Multi-rank Block-Group Multiplication) that handles variable-rank adapters in one batch.
TensorRT-LLM and vLLM both support multi-LoRA inference; the cert may probe the trade-off (merge for predictable latency, multi-LoRA for memory and tenant flexibility).
| Tool | Strengths | Best for |
|---|---|---|
| Axolotl | YAML configs, broad method coverage, HF-native | Reproducible runs across many methods |
| TRL (HF) | Trainer classes for SFT, DPO, PPO; integrates with HF stack | Pipeline-by-pipeline experimentation |
| Unsloth | Memory-optimised; fastest single-GPU SFT/LoRA | Resource-constrained training, prototyping |
| torchtune | PyTorch-native, recipe-driven, official Meta tool | Stay closer to PyTorch idioms |
| NeMo Framework / NeMo RL | NVIDIA's training stack, multi-node, BF16/FP8 | Multi-node clusters, NVIDIA-native deployments |
For Brendan's hardware (single RTX 3080 or RTX 4000 Ada), Unsloth or TRL are the best entry points. For exam, recognise that NVIDIA's first-party tool is now NeMo RL (renamed from NeMo Aligner).