Fine-tuning accounts for 13% of the NCP-GENL Professional exam and is a component of the 22% Experimentation domain on the NCA-GENL Associate exam. Questions are concrete: VRAM budgets, rank/alpha interaction, which DPO eliminates, what LoRA actually does to the weight matrices. The portfolio repos below carry the full technical treatment; this file provides the cert-calibrated synthesis with hardware constraints specific to the RTX 3080 (10 GB) and RTX 4000 Ada (20 GB) used in this portfolio.
Pretraining produces a model capable of predicting text; the post-training pipeline turns it into one that follows instructions, stays on-topic, and behaves safely:
Each stage builds on the previous. SFT-only models follow instructions adequately; adding preference tuning substantially improves helpfulness and safety. Alignment detail is in notes/04_alignment_and_trustworthy_ai.md; RLHF/PPO depth is in FT_03_RLHF_and_PPO; DPO depth is in FT_04_DPO_and_Cousins.
Instruction datasets. SFT requires (prompt, completion) pairs in a consistent chat template — typically ChatML or Llama-3’s <|begin_of_text|> format. Dataset quality dominates quantity: a few thousand high-quality examples outperform millions of low-quality ones. Common sources include Alpaca, OpenHermes, SlimOrca, and purpose-built domain datasets.
Chat templates. Modern models use structured templates to delineate system, user, and assistant turns. The tokeniser applies the template; training must use the same template the model was pretrained or instruct-tuned with, or the model will not learn the correct format.
Loss masking. During SFT, cross-entropy loss is computed only on the assistant turns, not on the prompt. Without masking, the model wastes capacity fitting the user-input distribution it should already handle.
Sample packing. Multiple short examples are concatenated into a single sequence of length equal to the model’s maximum context. This eliminates padding waste and can double effective training throughput. Attention masks must prevent cross-example attention.
VRAM for SFT. A rough estimate for full-precision (bf16) SFT:
\[\text{VRAM} \approx (M_p \times 2) + (M_p \times 2) + (M_p \times 8) + \text{activations}\]where $M_p$ is parameters in billions × 2 bytes. The three terms are: parameters, gradients, and AdamW optimiser states (two moments × fp32). This gives approximately 16 bytes per parameter for full SFT with AdamW:
| Model size | Full SFT (AdamW, bf16) |
|---|---|
| 1B | ~16 GB |
| 3B | ~48 GB |
| 7B | ~112 GB |
Full technical depth: FT_01_SFT_Pipeline.
Full-parameter fine-tuning updates every weight in the model. It achieves the best task-specific performance ceiling but requires VRAM for parameters + gradients + optimiser states, and produces a full model checkpoint for each task.
PEFT (Parameter-Efficient Fine-Tuning) freezes most of the model and trains only a small set of additional or modified parameters. The adapter is orders of magnitude smaller than the base model, can be stored and swapped cheaply, and dramatically reduces VRAM during training.
When to use full fine-tuning: small models (1B–3B) where VRAM is not the bottleneck; tasks where the full parameter expressiveness matters; when merging is not a priority. When to use PEFT: large models; multi-task scenarios with one base model and many adapters; resource-constrained hardware.
LoRA (Hu et al., 2021) freezes the pretrained weight matrix $W_0 \in \mathbb{R}^{d \times k}$ and injects a trainable low-rank decomposition alongside it:
\[h = W_0 x + \Delta W x = W_0 x + B A x\]where $A \in \mathbb{R}^{r \times k}$ and $B \in \mathbb{R}^{d \times r}$ with rank $r \ll \min(d, k)$. At initialisation, $A$ receives random Gaussian values and $B$ is zeroed, so $\Delta W = 0$ and training starts from the pretrained output.
Key hyperparameters.
LoRA reduces trainable parameters by roughly $\frac{2rd}{d^2} = \frac{2r}{d}$ — for a 4096-dimensional model with $r = 16$, this is less than 1% of the original weight count. The full paper claims up to 10,000× fewer trainable parameters vs full fine-tuning of GPT-3.
QLoRA (Dettmers et al., 2023) applies LoRA training on top of a 4-bit quantised base model, reducing VRAM further:
QLoRA enables fine-tuning 65B parameter models on a single 48 GB GPU. On consumer hardware, it makes 7B–13B fine-tuning accessible on 20 GB VRAM.
DoRA (Liu et al., 2024, ICML Oral) decomposes the pretrained weight $W_0$ into magnitude ($m$) and direction ($V$) components and applies LoRA only to the directional component:
\[W = m \cdot \frac{V + \Delta V}{\|V + \Delta V\|}\]where $\Delta V = BA$ is the low-rank update. The magnitude vector $m$ is trained as a separate parameter. This separation mirrors how pretrained models change during full fine-tuning (direction changes dominate in early training; magnitude adjusts later) and consistently outperforms LoRA at equivalent rank on LLaMA, LLaVA, and VL-BART benchmarks, with no additional inference overhead after merging.
IA³ modifies activations rather than weight matrices by element-wise rescaling: it learns a small vector of scale factors applied to the keys, values, and feed-forward activations. Trainable parameter count is an order of magnitude smaller than LoRA ($\sim 0.01\%$ of base model parameters). Best suited to scenarios with very limited data or extremely tight VRAM budgets; typically underperforms LoRA on complex tasks.
Full technical depth: FT_02_LoRA_and_PEFT_Variants.
Merging. Because $W = W_0 + BA$ is exact, a LoRA adapter can be fused into the base model weights after training. The merged model is indistinguishable from a fully fine-tuned model at inference and requires no changes to the serving stack. Merging multiple adapters (linear or TIES merging) produces a single model that combines capabilities from multiple fine-tuning runs.
Multi-LoRA serving. When a single base model must serve many tasks simultaneously — each with its own LoRA adapter — loading and unloading adapters per request is impractical. S-LoRA and Punica are systems designed to batch requests across different LoRA adapters efficiently, keeping the base model weights resident on GPU and managing the adapter weights as a separate pool. This is the standard architecture for SaaS fine-tuning platforms. Coverage: FT_02_LoRA_and_PEFT_Variants.
RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preference pairs and then optimises the language model against that reward using PPO, with a KL penalty to the SFT reference model to prevent reward hacking.
DPO (Direct Preference Optimisation) eliminates the reward model and PPO by reparameterising the RLHF objective in closed form — the language model’s own log-probabilities serve as an implicit reward. DPO trains with a simple binary cross-entropy loss over preference pairs and is substantially more stable.
This file provides context and VRAM framing; see notes/04_alignment_and_trustworthy_ai.md for the alignment synthesis, FT_03_RLHF_and_PPO for the PPO actor-critic detail, and FT_04_DPO_and_Cousins for DPO, IPO, KTO, ORPO, and GRPO.
The hardware in this portfolio is the RTX 3080 (10 GB) and RTX 4000 Ada (20 GB). The following table gives realistic guidance; numbers assume bf16 activations, no tensor parallelism, and standard AdamW/paged-AdamW.
| Configuration | 10 GB (RTX 3080) | 20 GB (RTX 4000 Ada) |
|---|---|---|
| 1B full SFT | Tight (fits with small batch) | Comfortable |
| 3B full SFT | Not feasible | Tight with gradient checkpointing |
| 7B full SFT | Not feasible | Not feasible |
| 1B QLoRA | Comfortable | Comfortable |
| 3B QLoRA | Comfortable | Comfortable |
| 7B QLoRA | Tight (~9–10 GB with r=16) | Comfortable |
| 13B QLoRA | Not feasible (>12 GB) | Tight (~18–20 GB) |
| 7B LoRA (bf16 base) | Not feasible (base alone ~14 GB) | Not feasible |
| DPO on 7B QLoRA | Tight on 10 GB (two model copies) | Feasible |
Key rules of thumb.
| Tool | Primary use | Notes |
|---|---|---|
| TRL (Hugging Face) | SFT, DPO, PPO, GRPO | Standard reference implementation; SFTTrainer, DPOTrainer |
| Axolotl | SFT and PEFT with config files | Wraps TRL; supports QLoRA, DoRA, sample packing, chat templates; well-maintained |
| Unsloth | Fast QLoRA / LoRA | Kernel-level optimisations for attention and MLP; claims 2–5× speedup; works on consumer GPUs |
| torchtune | PyTorch-native SFT and PEFT | Meta’s official fine-tuning library; minimal dependencies; good for understanding internals |
| NeMo Aligner | Full RLHF, DPO, SteerLM on NVIDIA stack | Requires NeMo ecosystem; targets multi-GPU / Slurm deployments; not suited to single-GPU |
| LLaMA-Factory | SFT, LoRA, QLoRA with web UI | Broad model and method support; good for rapid experimentation |
For single-GPU work on the RTX 3080 or RTX 4000 Ada, Axolotl or Unsloth are the practical defaults. TRL is the right choice when you need to understand or modify the training loop directly. NeMo Aligner is out of scope without a multi-GPU cluster.