PEFT and Fine-Tuning — NVIDIA GenAI Cert Prep

00

Topics in This Deck

A cert-focused tour of the post-training stack — supervised fine-tuning, parameter-efficient adapters, preference optimisation, and how each method trades VRAM for capability.

Cert Framing — Exam Domains
The Post-Training Stack
Full FT vs PEFT — Decision Tree
SFT Essentials
LoRA — The Math
LoRA Hyperparameters
QLoRA — NF4 and Paged Optimisers
DoRA and IA³
RLHF — Bradley-Terry, PPO, KL
DPO and Cousins
Constitutional AI and RLAIF
VRAM Budgeting — Worked Example
Multi-LoRA Serving
Tooling Landscape
Likely Exam Angles
Cross-References and Further Reading

01

Cert Framing — Exam Domains

NCA-GENL Associate

Experimentation

22%

Method-selection scenarios — given a problem (small labelled dataset, limited compute, latency targets), pick fine-tune vs RAG vs prompt and justify. PEFT covered conceptually.

NCP-GENL Professional

Fine-Tuning

13%

Numerical depth — LoRA parameter count from rank/target_modules, QLoRA quantisation impact, RLHF three-model VRAM, DPO derivation, multi-LoRA serving.

Cross-domain bleed

Fine-tuning decisions interact with Data Preparation (9%) on dataset construction, with Safety (5%) on refusal training, and with Model Optimisation (17%) on adapter merging vs multi-LoRA serving choices.

02

The Post-Training Stack

A pretrained base model predicts next tokens; it does not produce helpful or safe outputs. Post-training shapes the model. Modern stacks compose four stages in sequence:

Not every model goes through all four. A domain-specialist may do SFT only. A safety-critical assistant adds RLHF then CAI. Each stage is independently optional but the order matters: preference tuning needs an SFT-style policy to start from.

03

Full FT vs PEFT — Decision Tree

Method	VRAM (rough)	When to choose
Full SFT (BF16)	~16 bytes/param	Large dataset, <1B model on consumer GPU, or you have multi-GPU and need maximum capability shift
LoRA (BF16 base)	~3 bytes/param + adapters	Default for most tasks. Adapter switching, multi-tenant serving, predictable VRAM.
QLoRA (NF4 base)	~0.6 bytes/param + adapters	Largest model that fits. RTX 4000 Ada (20 GB) hosts 13B comfortably, 30B with activations checkpointing.
DoRA	Like LoRA + small overhead	When LoRA underperforms and you want closer to full-FT quality without the memory.
IA³	Tiny (3 vectors per layer)	Strong inductive bias, very small adapter, less expressive than LoRA on hard tasks.

Rule of thumb

Default to LoRA. Fall back to QLoRA when memory binds. Reach for DoRA when LoRA accuracy disappoints. Reach for full FT when capability shift is large and you have the compute. Reach for IA³ when you must minimise adapter size for serving.

04

SFT Essentials

Supervised fine-tuning trains the base model on instruction/response pairs with a chat template applied. The loss is computed on the response tokens only — loss masking on the prompt side avoids learning to predict the user.

The hyperparameters that matter

Learning rate — typically 1e-5 to 5e-5 for full FT, 1e-4 to 3e-4 for LoRA. Too high collapses the base, too low under-trains.
Batch size and gradient accumulation — effective batch 32-128 typical; accumulation lets a 20 GB GPU emulate large batches.
Sample packing — concatenate variable-length examples up to max sequence length; cuts wasted padding by 30-60% on instruction data.
Chat template — must match the base model's training template (Llama-3, Qwen2.5, ChatML); mismatched template silently degrades performance.

Tooling

Axolotl, TRL (Hugging Face), Unsloth (memory-optimised), torchtune, NeMo Framework. All implement the same fundamentals; differ on ergonomics, speed, and which methods they expose first.

05

LoRA — The Math

Hu et al. (2021): instead of updating the full weight matrix W, freeze it and learn a low-rank update. The forward pass becomes h = W₀x + BAx, where W₀ is the frozen pretrained weight, A is d×r and B is r×d, with r << d.

For d=4096 and r=8: full update has 16.8M params, LoRA update has 65k — a 256× reduction. The base weights are untouched, so multiple LoRA adapters can be swapped on the same base at inference time.

Initialisation: A is Gaussian, B is zero. At training start, BAx = 0, so the model starts identical to the base. Training shifts the residual.

06

LoRA Hyperparameters

Hyperparameter	Typical value	What it controls
`r` (rank)	4–64	Capacity of the adapter. Higher r = more expressive, more VRAM. r=8 is a strong default.
`alpha`	16–32	Scaling factor. Effective contribution = (alpha/r) · BAx. Convention: alpha = 2r.
`target_modules`	q_proj, v_proj (typical)	Which weight matrices get adapters. Q+V is the LoRA-paper recommendation; adding K, O, FFN raises capacity and cost.
`dropout`	0.05–0.1	Regularisation on the adapter activation. Higher for small datasets.
`bias`	none / lora_only / all	Whether to also train the bias terms; usually 'none'.

Common mistake

Setting alpha=r looks neutral but actually halves the contribution because alpha/r=1 instead of the more typical 2. Worth checking when porting configs across frameworks.

For full coverage of variants and the math, see FT_02_LoRA_and_PEFT_Variants.

07

QLoRA — NF4 and Paged Optimisers

Dettmers et al. (2023). Push the base weights to 4-bit while keeping the adapter in BF16. Three pieces:

NF4 (NormalFloat 4-bit). A 4-bit float format optimised for normally-distributed weights — 16 levels chosen to be the optimal quantisation points for a normal distribution. Better than naive INT4 for transformer weights.
Double quantisation. The per-block scaling factors are themselves quantised, saving an additional ~0.4 bits/param.
Paged optimisers. Optimiser state pages between GPU and CPU memory on the fly. Prevents OOM during gradient peaks.

Headline result

QLoRA fine-tuned 65B Llama on a single 48 GB GPU. The same approach scales down: 7B fits in 8 GB, 13B in 16 GB, 70B in 48-80 GB.

Forward pass: dequantise NF4 weights to BF16 on the fly per layer, then standard LoRA forward. Backward pass updates only the adapter; the NF4 weights are never updated.

08

DoRA and IA³

DoRA — Weight-Decomposed LoRA

Liu et al. (2024). Decomposes the weight update into magnitude and direction. LoRA learns the direction; a separate scalar per output channel learns the magnitude. Empirically closes much of the gap to full fine-tuning at LoRA-like cost. Selected ICML 2024 Oral.

IA³ — Infused Adapter by Inhibiting and Amplifying

Liu et al. (2022). Three small learned vectors per layer: scale K, scale V, scale FFN intermediate. ~10× fewer parameters than LoRA at similar performance on a range of tasks, but less expressive on harder problems.

When DoRA wins

Tasks where LoRA underperforms relative to full FT and you have the parameter budget for a small magnitude vector per channel.

When IA³ wins

Adapter size is the binding constraint — serving thousands of customer adapters, or where storage/transfer of adapters dominates cost.

09

RLHF — Bradley-Terry, PPO, KL

Reinforcement Learning from Human Feedback. Three models live concurrently in memory: the policy (the SFT model being optimised), a frozen reference (initial SFT model), and the reward model.

Pipeline

Reward model training. Take a dataset of preferred/rejected response pairs. Train a reward head on top of the SFT model under the Bradley-Terry assumption: P(y₁ preferred over y₂) = σ(r(y₁) - r(y₂)).
Policy optimisation with PPO. Sample responses from the policy, score with the reward model, update the policy to maximise reward, with a KL penalty against the reference to prevent reward hacking.

The KL penalty

Without it, the policy gradient pushes the model towards extreme outputs that maximise reward but no longer resemble fluent text. With KL = β · KL(π‒policy ‖ π‒ref), the policy stays close to the reference distribution. Tuning β is the central RLHF hyperparameter.

VRAM note

Three model copies plus PPO actor-critic state: RLHF is memory-heavy. A 7B model at full precision needs ~80 GB; QLoRA RLHF (e.g. via TRL's PPOTrainer) brings it into single-GPU range.

10

DPO and Cousins

Rafailov et al. (2023). Direct Preference Optimisation collapses the RLHF pipeline into a single supervised loss on preference pairs — no separate reward model, no PPO, no KL penalty as a separate term.

The implicit reward

DPO derives an analytical solution: the optimal policy under a Bradley-Terry reward model has a closed-form relationship between the policy log-probabilities and an implicit reward. The DPO loss directly optimises this without learning the reward function explicitly.

Method	What it adds over DPO
IPO	Fixes preference saturation: when one option is overwhelmingly preferred, DPO over-optimises; IPO regularises.
KTO	Kahneman-Tversky loss. Works on unpaired data — just labels of "good" or "bad", no pairwise comparison needed.
ORPO	Reference-free. No KL term, no separate ref model copy. SFT and preference combined.
GRPO	Group-relative policy optimisation. Used in DeepSeek-R1 reasoning training. Compares groups of samples rather than pairs.

For depth see FT_04_DPO_and_Cousins.

11

Constitutional AI and RLAIF

Bai et al. (2022). Anthropic's response to the human-labelling bottleneck. Replace human preference labels with AI-generated labels following a written constitution.

Two-stage pipeline

SL-CAI (supervised). Generate responses, ask the model to critique them against the constitution, ask it to revise. Train on the revisions.
RL-CAI (reinforcement). Generate response pairs, ask the model to pick which better follows the constitution, train a preference model on those AI labels, run RLHF/DPO with the AI-labelled preferences.

The constitution is the lever: a list of principles ("be helpful, be harmless, be honest, refuse to assist with X categories of request"). Editing the constitution updates the entire pipeline without re-collecting human labels.

RLAIF

Generalises CAI: any AI labeller, not necessarily one following a constitution. Often a stronger external model (e.g. Claude or GPT-4 labelling for a smaller open model). Cuts labelling cost ~10× vs human at comparable quality on many tasks.

12

VRAM Budgeting — Worked Example

Full SFT memory

Per parameter, BF16 SFT with AdamW: 2 (weights) + 4 (gradient FP32) + 8 (optimiser state, two FP32 momenta) = 14 bytes, plus activations.

For a 7B model: 7B × 14 = 98 GB just for parameters/grads/optimiser. Activations on top. Single GPU: not feasible.

LoRA memory

Base model frozen in BF16 (2 bytes/param) + adapter in BF16 with optimiser state. For r=8, target Q+V on a 7B (32 layers, d=4096):

LoRA adapter param count

# Per-layer Q+V LoRA params:
# 2 matrices (Q, V) x 2 sub-matrices (A, B) x d * r
params_per_layer = 2 * 2 * 4096 * 8  # = 131072
total_layers = 32
adapter_params = 32 * 131072  # = ~4.2M

Adapter VRAM is dwarfed by the base. Total: ~14 GB for 7B LoRA in BF16 — fits on RTX 4000 Ada (20 GB) with room for activations.

QLoRA memory

NF4 base = 0.5 bytes/param + adapter as above. For 7B: ~5 GB for the model. Even on RTX 3080 (10 GB) feasible with gradient checkpointing.

13

Multi-LoRA Serving

One base model, many adapters served concurrently. The serving question: do you merge each adapter into the base before serving, or keep them separate?

Merged adapter

Run peft.merge_and_unload → one specialised model. Same inference speed as the base. One copy of weights per adapter — cost scales with N.

Multi-LoRA inference

Adapter applied at runtime per request. One copy of base weights, many small adapters. Frameworks: S-LoRA, Punica. Heterogeneous batching: different requests use different adapters in the same batch.

S-LoRA (Sheng et al. 2023) demonstrates serving thousands of LoRA adapters on a single GPU with negligible overhead vs the base. Key trick: a custom CUDA kernel (Multi-rank Block-Group Multiplication) that handles variable-rank adapters in one batch.

TensorRT-LLM and vLLM both support multi-LoRA inference; the cert may probe the trade-off (merge for predictable latency, multi-LoRA for memory and tenant flexibility).

14

Tooling Landscape

Tool	Strengths	Best for
Axolotl	YAML configs, broad method coverage, HF-native	Reproducible runs across many methods
TRL (HF)	Trainer classes for SFT, DPO, PPO; integrates with HF stack	Pipeline-by-pipeline experimentation
Unsloth	Memory-optimised; fastest single-GPU SFT/LoRA	Resource-constrained training, prototyping
torchtune	PyTorch-native, recipe-driven, official Meta tool	Stay closer to PyTorch idioms
NeMo Framework / NeMo RL	NVIDIA's training stack, multi-node, BF16/FP8	Multi-node clusters, NVIDIA-native deployments

For Brendan's hardware (single RTX 3080 or RTX 4000 Ada), Unsloth or TRL are the best entry points. For exam, recognise that NVIDIA's first-party tool is now NeMo RL (renamed from NeMo Aligner).

15

Likely Exam Angles

VRAM math. Given a model size and a method (full FT / LoRA / QLoRA), pick the smallest GPU that fits. The 14-bytes-per-param formula is essential.
LoRA hyperparameter trade-offs. What does increasing r do; what does the alpha/r ratio control; which target_modules give the best capacity-cost ratio.
QLoRA mechanism. What does NF4 quantise (base, not adapter); double quantisation; why paged optimisers help.
RLHF vs DPO. Which model copies are needed in each; what's eliminated by going to DPO; reward hacking failure mode.
Constitutional AI. Two-stage pipeline; the role of the constitution; CAI vs RLAIF distinction.
Multi-LoRA serving. Merge vs runtime adapter; when each wins; the S-LoRA contribution.

16

Cross-References and Further Reading

Portfolio repos (depth treatment)

FT_01_SFT_Pipeline — chat templates, loss masking, sample packing, VRAM budgeting.
FT_02_LoRA_and_PEFT_Variants — LoRA math, QLoRA, DoRA, IA³, multi-LoRA serving.
FT_03_RLHF_and_PPO — Bradley-Terry reward modelling, PPO, KL, reward hacking.
FT_04_DPO_and_Cousins — DPO derivation, IPO, KTO, ORPO, GRPO.
FT_05_Constitutional_AI_and_RLAIF — CAI two-stage pipeline, RLAIF.
LLM_Hub_Fine_Tuning — hub overview.

Cert-prep repo resources

notes/06_fine_tuning_and_peft.md — the notes this deck is built from.
exercises/03_lora_finetune_minimal/ — hands-on LoRA on Qwen2.5-0.5B.

Primary literature

Hu et al. (2021), LoRA: arxiv.org/abs/2106.09685
Dettmers et al. (2023), QLoRA: arxiv.org/abs/2305.14314
Liu et al. (2024), DoRA: arxiv.org/abs/2402.09353
Rafailov et al. (2023), DPO: arxiv.org/abs/2305.18290
Bai et al. (2022), Constitutional AI: arxiv.org/abs/2212.08073