NVIDIA GenAI Cert Prep — Presentation 03

PEFT and Fine-Tuning

Parameter-efficient fine-tuning, the post-training stack, and the VRAM trade-offs that decide which method fits which GPU. Cert-focused synthesis of LoRA, QLoRA, DoRA, RLHF, DPO, and Constitutional AI.

NCA Experimentation 22% NCP Fine-Tuning 13% LoRA QLoRA DoRA RLHF DPO CAI
Pretrain SFT Preference Constitutional Deploy
00

Topics in This Deck

A cert-focused tour of the post-training stack — supervised fine-tuning, parameter-efficient adapters, preference optimisation, and how each method trades VRAM for capability.

01

Cert Framing — Exam Domains

NCA-GENL Associate

Experimentation

22%

Method-selection scenarios — given a problem (small labelled dataset, limited compute, latency targets), pick fine-tune vs RAG vs prompt and justify. PEFT covered conceptually.

NCP-GENL Professional

Fine-Tuning

13%

Numerical depth — LoRA parameter count from rank/target_modules, QLoRA quantisation impact, RLHF three-model VRAM, DPO derivation, multi-LoRA serving.

Cross-domain bleed

Fine-tuning decisions interact with Data Preparation (9%) on dataset construction, with Safety (5%) on refusal training, and with Model Optimisation (17%) on adapter merging vs multi-LoRA serving choices.

02

The Post-Training Stack

A pretrained base model predicts next tokens; it does not produce helpful or safe outputs. Post-training shapes the model. Modern stacks compose four stages in sequence:

Pretrain Trillions of tokens SFT Instruction pairs RLHF or DPO Preference pairs CAI / RLAIF Self-critique

Not every model goes through all four. A domain-specialist may do SFT only. A safety-critical assistant adds RLHF then CAI. Each stage is independently optional but the order matters: preference tuning needs an SFT-style policy to start from.

03

Full FT vs PEFT — Decision Tree

MethodVRAM (rough)When to choose
Full SFT (BF16)~16 bytes/paramLarge dataset, <1B model on consumer GPU, or you have multi-GPU and need maximum capability shift
LoRA (BF16 base)~3 bytes/param + adaptersDefault for most tasks. Adapter switching, multi-tenant serving, predictable VRAM.
QLoRA (NF4 base)~0.6 bytes/param + adaptersLargest model that fits. RTX 4000 Ada (20 GB) hosts 13B comfortably, 30B with activations checkpointing.
DoRALike LoRA + small overheadWhen LoRA underperforms and you want closer to full-FT quality without the memory.
IA³Tiny (3 vectors per layer)Strong inductive bias, very small adapter, less expressive than LoRA on hard tasks.
Rule of thumb

Default to LoRA. Fall back to QLoRA when memory binds. Reach for DoRA when LoRA accuracy disappoints. Reach for full FT when capability shift is large and you have the compute. Reach for IA³ when you must minimise adapter size for serving.

04

SFT Essentials

Supervised fine-tuning trains the base model on instruction/response pairs with a chat template applied. The loss is computed on the response tokens only — loss masking on the prompt side avoids learning to predict the user.

The hyperparameters that matter

Tooling

Axolotl, TRL (Hugging Face), Unsloth (memory-optimised), torchtune, NeMo Framework. All implement the same fundamentals; differ on ergonomics, speed, and which methods they expose first.

05

LoRA — The Math

Hu et al. (2021): instead of updating the full weight matrix W, freeze it and learn a low-rank update. The forward pass becomes h = W₀x + BAx, where W₀ is the frozen pretrained weight, A is d×r and B is r×d, with r << d.

W₀ d × d frozen + B d × r A r × d trainable params: 2·d·r vs full: d·d ratio: 2r/d

For d=4096 and r=8: full update has 16.8M params, LoRA update has 65k — a 256× reduction. The base weights are untouched, so multiple LoRA adapters can be swapped on the same base at inference time.

Initialisation: A is Gaussian, B is zero. At training start, BAx = 0, so the model starts identical to the base. Training shifts the residual.

06

LoRA Hyperparameters

HyperparameterTypical valueWhat it controls
r (rank)4–64Capacity of the adapter. Higher r = more expressive, more VRAM. r=8 is a strong default.
alpha16–32Scaling factor. Effective contribution = (alpha/r) · BAx. Convention: alpha = 2r.
target_modulesq_proj, v_proj (typical)Which weight matrices get adapters. Q+V is the LoRA-paper recommendation; adding K, O, FFN raises capacity and cost.
dropout0.05–0.1Regularisation on the adapter activation. Higher for small datasets.
biasnone / lora_only / allWhether to also train the bias terms; usually 'none'.
Common mistake

Setting alpha=r looks neutral but actually halves the contribution because alpha/r=1 instead of the more typical 2. Worth checking when porting configs across frameworks.

For full coverage of variants and the math, see FT_02_LoRA_and_PEFT_Variants.

07

QLoRA — NF4 and Paged Optimisers

Dettmers et al. (2023). Push the base weights to 4-bit while keeping the adapter in BF16. Three pieces:

Headline result

QLoRA fine-tuned 65B Llama on a single 48 GB GPU. The same approach scales down: 7B fits in 8 GB, 13B in 16 GB, 70B in 48-80 GB.

Forward pass: dequantise NF4 weights to BF16 on the fly per layer, then standard LoRA forward. Backward pass updates only the adapter; the NF4 weights are never updated.

08

DoRA and IA³

DoRA — Weight-Decomposed LoRA

Liu et al. (2024). Decomposes the weight update into magnitude and direction. LoRA learns the direction; a separate scalar per output channel learns the magnitude. Empirically closes much of the gap to full fine-tuning at LoRA-like cost. Selected ICML 2024 Oral.

IA³ — Infused Adapter by Inhibiting and Amplifying

Liu et al. (2022). Three small learned vectors per layer: scale K, scale V, scale FFN intermediate. ~10× fewer parameters than LoRA at similar performance on a range of tasks, but less expressive on harder problems.

When DoRA wins

Tasks where LoRA underperforms relative to full FT and you have the parameter budget for a small magnitude vector per channel.

When IA³ wins

Adapter size is the binding constraint — serving thousands of customer adapters, or where storage/transfer of adapters dominates cost.

09

RLHF — Bradley-Terry, PPO, KL

Reinforcement Learning from Human Feedback. Three models live concurrently in memory: the policy (the SFT model being optimised), a frozen reference (initial SFT model), and the reward model.

Pipeline

  1. Reward model training. Take a dataset of preferred/rejected response pairs. Train a reward head on top of the SFT model under the Bradley-Terry assumption: P(y₁ preferred over y₂) = σ(r(y₁) - r(y₂)).
  2. Policy optimisation with PPO. Sample responses from the policy, score with the reward model, update the policy to maximise reward, with a KL penalty against the reference to prevent reward hacking.

The KL penalty

Without it, the policy gradient pushes the model towards extreme outputs that maximise reward but no longer resemble fluent text. With KL = β · KL(π‒policy ‖ π‒ref), the policy stays close to the reference distribution. Tuning β is the central RLHF hyperparameter.

VRAM note

Three model copies plus PPO actor-critic state: RLHF is memory-heavy. A 7B model at full precision needs ~80 GB; QLoRA RLHF (e.g. via TRL's PPOTrainer) brings it into single-GPU range.

10

DPO and Cousins

Rafailov et al. (2023). Direct Preference Optimisation collapses the RLHF pipeline into a single supervised loss on preference pairs — no separate reward model, no PPO, no KL penalty as a separate term.

The implicit reward

DPO derives an analytical solution: the optimal policy under a Bradley-Terry reward model has a closed-form relationship between the policy log-probabilities and an implicit reward. The DPO loss directly optimises this without learning the reward function explicitly.

MethodWhat it adds over DPO
IPOFixes preference saturation: when one option is overwhelmingly preferred, DPO over-optimises; IPO regularises.
KTOKahneman-Tversky loss. Works on unpaired data — just labels of "good" or "bad", no pairwise comparison needed.
ORPOReference-free. No KL term, no separate ref model copy. SFT and preference combined.
GRPOGroup-relative policy optimisation. Used in DeepSeek-R1 reasoning training. Compares groups of samples rather than pairs.

For depth see FT_04_DPO_and_Cousins.

11

Constitutional AI and RLAIF

Bai et al. (2022). Anthropic's response to the human-labelling bottleneck. Replace human preference labels with AI-generated labels following a written constitution.

Two-stage pipeline

  1. SL-CAI (supervised). Generate responses, ask the model to critique them against the constitution, ask it to revise. Train on the revisions.
  2. RL-CAI (reinforcement). Generate response pairs, ask the model to pick which better follows the constitution, train a preference model on those AI labels, run RLHF/DPO with the AI-labelled preferences.

The constitution is the lever: a list of principles ("be helpful, be harmless, be honest, refuse to assist with X categories of request"). Editing the constitution updates the entire pipeline without re-collecting human labels.

RLAIF

Generalises CAI: any AI labeller, not necessarily one following a constitution. Often a stronger external model (e.g. Claude or GPT-4 labelling for a smaller open model). Cuts labelling cost ~10× vs human at comparable quality on many tasks.

12

VRAM Budgeting — Worked Example

Full SFT memory

Per parameter, BF16 SFT with AdamW: 2 (weights) + 4 (gradient FP32) + 8 (optimiser state, two FP32 momenta) = 14 bytes, plus activations.

For a 7B model: 7B × 14 = 98 GB just for parameters/grads/optimiser. Activations on top. Single GPU: not feasible.

LoRA memory

Base model frozen in BF16 (2 bytes/param) + adapter in BF16 with optimiser state. For r=8, target Q+V on a 7B (32 layers, d=4096):

LoRA adapter param count
# Per-layer Q+V LoRA params:
# 2 matrices (Q, V) x 2 sub-matrices (A, B) x d * r
params_per_layer = 2 * 2 * 4096 * 8  # = 131072
total_layers = 32
adapter_params = 32 * 131072  # = ~4.2M

Adapter VRAM is dwarfed by the base. Total: ~14 GB for 7B LoRA in BF16 — fits on RTX 4000 Ada (20 GB) with room for activations.

QLoRA memory

NF4 base = 0.5 bytes/param + adapter as above. For 7B: ~5 GB for the model. Even on RTX 3080 (10 GB) feasible with gradient checkpointing.

13

Multi-LoRA Serving

One base model, many adapters served concurrently. The serving question: do you merge each adapter into the base before serving, or keep them separate?

Merged adapter

Run peft.merge_and_unload → one specialised model. Same inference speed as the base. One copy of weights per adapter — cost scales with N.

Multi-LoRA inference

Adapter applied at runtime per request. One copy of base weights, many small adapters. Frameworks: S-LoRA, Punica. Heterogeneous batching: different requests use different adapters in the same batch.

S-LoRA (Sheng et al. 2023) demonstrates serving thousands of LoRA adapters on a single GPU with negligible overhead vs the base. Key trick: a custom CUDA kernel (Multi-rank Block-Group Multiplication) that handles variable-rank adapters in one batch.

TensorRT-LLM and vLLM both support multi-LoRA inference; the cert may probe the trade-off (merge for predictable latency, multi-LoRA for memory and tenant flexibility).

14

Tooling Landscape

ToolStrengthsBest for
AxolotlYAML configs, broad method coverage, HF-nativeReproducible runs across many methods
TRL (HF)Trainer classes for SFT, DPO, PPO; integrates with HF stackPipeline-by-pipeline experimentation
UnslothMemory-optimised; fastest single-GPU SFT/LoRAResource-constrained training, prototyping
torchtunePyTorch-native, recipe-driven, official Meta toolStay closer to PyTorch idioms
NeMo Framework / NeMo RLNVIDIA's training stack, multi-node, BF16/FP8Multi-node clusters, NVIDIA-native deployments

For Brendan's hardware (single RTX 3080 or RTX 4000 Ada), Unsloth or TRL are the best entry points. For exam, recognise that NVIDIA's first-party tool is now NeMo RL (renamed from NeMo Aligner).

15

Likely Exam Angles

16

Cross-References and Further Reading

Portfolio repos (depth treatment)

Cert-prep repo resources

Primary literature