The decoder-only transformer underlies every major production LLM. This deck maps the architecture to cert exam domains, walks the token-to-logit pipeline, and flags the distractor traps the NCA and NCP papers use most often.
A cert-focused tour of the decoder-only transformer — from the token pipeline through to modern variants and the exact exam angles the NCA and NCP papers probe.
Core ML and AI Knowledge
30%
Largest single domain. Tests foundational understanding of how transformers work — attention mechanics, scaling, trade-offs. Questions are scenario-based: "which change would most reduce memory at long context?"
LLM Architecture
6%
Smaller domain but more precise. Tests exact mechanisms: the 1/√dₖ scaling rationale, pre-norm vs post-norm, GQA vs MQA memory savings, RoPE vs ALiBi mechanism difference.
The 30% Core ML domain is the single largest lever on your NCA score. Understanding transformer internals also builds the mental model needed for RAG (22%), fine-tuning (22%), and inference optimisation — the examiners assume the architecture is your baseline.
Each step is differentiable. During training the full forward pass runs in one shot; during generation the model runs one token at a time in autoregressive fashion, reusing the KV cache from prior steps.
Byte-Pair Encoding (BPE) is the dominant tokenisation algorithm in production LLMs (GPT series, LLaMA, Mistral). It builds a vocabulary of subword units by iteratively merging the most frequent adjacent byte pair in the training corpus.
| Property | Value / Range | Why it matters |
|---|---|---|
| Vocabulary size | 32 000–128 256 | Larger vocab = shorter sequences; embedding matrix grows |
| Special tokens | <bos>, <eos>, <pad>, <unk> | Must match exactly at fine-tuning time |
| Tokeniser algorithm | BPE, SentencePiece, Unigram | Different tokenisers → different token counts for same text |
| Worst-case token | 1 char = 1 token | Code, languages with rare scripts inflate token count |
The tokeniser is separate from the model weights. A mismatch between the tokeniser used during training and the one used at inference produces garbage — this is a common distractor in fine-tuning questions.
Token embeddings are a trainable matrix E ∈ ℝV×dₚ. Each row is the dense vector representation of one vocabulary token. The forward pass is a simple lookup: given token index i, the embedding is the i-th row of E.
Attention is permutation-equivariant — shuffle the input and the output shuffles identically. Positional information must be injected explicitly. Three dominant strategies:
| Method | Where applied | Mechanism | Length extrapolation | Used in |
|---|---|---|---|---|
| Absolute sinusoidal | Added to embeddings | Fixed sin/cos waves, no learned params | Poor in practice | Original Transformer |
| Learned absolute | Added to embeddings | Trainable pos. embedding matrix | None beyond max length | GPT-2, BERT |
| RoPE | Rotates Q and K vectors | Relative position via complex rotation; dot product depends only on i−j | Good (YaRN/ABF) | LLaMA, Mistral, Qwen |
| ALiBi | Bias added to attention scores (after dot product) | Linear penalty −m·|i−j| per head | Very good | MPT, BLOOM |
RoPE modifies Q and K vectors before the dot product via rotation. ALiBi adds a distance-based linear bias after the dot product but before softmax. This distinction is the most common NCP exam question on positional encoding.
The central computation in every decoder block. Given input X ∈ ℝseq×dₚ, three linear projections produce queries, keys, and values:
Q = X W_Q -- W_Q ∈ ℝ^(d_model × d_k)
K = X W_K -- W_K ∈ ℝ^(d_model × d_k)
V = X W_V -- W_V ∈ ℝ^(d_model × d_v)
Attention(Q, K, V) = softmax( Q Kᵀ / √d_k + M ) · V
-- M is the causal mask: -inf for j > i, 0 otherwise
-- Scaling by 1/√d_k prevents softmax saturation at large d_k
When vectors of dimension dₖ are initialised with unit variance, their dot product has variance dₖ. Without scaling, large dₖ values push dot products to large magnitudes, driving softmax towards a near-one-hot distribution — gradients vanish. Dividing by √dₖ restores unit variance at the softmax input.
The decoder is autoregressive — position i may only attend to positions ≤ i. The upper-triangular mask sets logits for future positions to −∞; after softmax these become 0. This is implemented as a single triangular tensor added to the attention scores before softmax.
The 1/√dₖ scaling is for variance stability at the softmax input, not for normalising the output magnitude of the projection. Exam distractors frequently conflate these.
Rather than one large attention computation, the model runs h parallel attention heads, each with dₖ = dₚ/h. The outputs are concatenated and projected:
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) W_O
headᵢ = Attention( Q W_Qᵢ, K W_Kᵢ, V W_Vᵢ )
Multiple heads allow the model to attend to different aspects of the context simultaneously — different syntactic roles, different semantic relations — within the same layer. What matters is the total parameter count dₚ² (or 4 × dₚ² for full QKVO), not the number of heads per se.
h query heads
h key heads
h value heads
Baseline. Large KV cache.
h query heads
g key heads (g < h)
g value heads
h/g queries share each KV group. KV cache shrinks by h/g. Used in LLaMA 3, Mistral.
h query heads
1 key head
1 value head
Extreme case. Maximum KV saving; slight quality cost.
GQA reduces the KV cache memory footprint proportionally to h/g without reducing the query expressiveness. This is the primary architectural lever for fitting longer context on consumer GPUs.
The FFN occupies roughly two-thirds of the total parameter count in a standard decoder-only model. It operates independently on each token position — no cross-token interaction.
FFN(x) = W₂ · ReLU( W₁ x + b₁ ) + b₂
d_ff = 4 × d_model -- Transformer base: 2048 = 4 × 512
FFN(x) = ( W₁ x · SiLU( W₂ x ) ) · W₃
d_ff ≈ (8/3) × d_model -- keeps param count ≈ 4× ReLU FFN
-- SiLU(x) = x · sigmoid(x) [smooth gating]
For a single FFN layer with standard ReLU:
params_ffn = 2 × d_model × d_ff -- two weight matrices (no bias for LLaMA)
= 2 × 4096 × 11008 -- LLaMA-2 7B (d_ff = 11008 ≈ 8/3 × 4096)
≈ 90 M parameters per FFN block
× 32 blocks = ~2.9 B params from FFN alone
SwiGLU consistently outperforms ReLU and GELU at the same FLOP budget in ablations. The gating mechanism allows the network to selectively suppress or amplify features. The (8/3) factor is derived to keep total FLOPs comparable to a 4× ReLU FFN at identical dₚ.
During autoregressive generation, token t requires attention against all previous tokens. The KV cache stores the K and V projections for all past positions, so only the new token's Q, K, V need computing at each step.
KV_bytes = 2 × N_layers × N_kv_heads × d_k × seq_len × bytes_per_element
-- 7B model example (LLaMA-2 style, FP16):
-- N_layers = 32, N_kv_heads = 32, d_k = 128, seq_len = 4096, bytes = 2
KV_bytes = 2 × 32 × 32 × 128 × 4096 × 2 = 2,147,483,648 B ≈ 2 GB
-- With GQA (N_kv_heads = 8, as in LLaMA-3 8B):
KV_bytes = 2 × 32 × 8 × 128 × 4096 × 2 ≈ 0.5 GB -- 4× saving
At sequence length 32 768 (32K context), the KV cache for a 7B model with 32 KV heads reaches 16 GB in FP16 — exceeding the total VRAM of an RTX 4000 Ada (20 GB) once model weights are also loaded. GQA and 8-bit KV quantisation are the primary mitigations.
| Technique | KV memory reduction | Quality impact |
|---|---|---|
| GQA (h=32 → 8) | 4× | Negligible |
| MQA (h=32 → 1) | 32× | Small quality loss |
| KV quantisation (FP16 → INT8) | 2× | Very small |
| Paged attention (vLLM) | Removes fragmentation waste | None |
KV cache quantisation and paged attention are covered in detail in NVIDIA_GPU_19_TensorRT_LLM and cheatsheets/quantisation_and_kv_cache.md.
Always select the highest-probability token: argmax over logits. Deterministic. Fast. Prone to degenerate repetition on open-ended generation tasks.
Sample from the probability distribution. Temperature τ divides logits before softmax — τ<1 sharpens, τ>1 flattens. Top-p (nucleus) samples from the smallest set of tokens summing to p. Top-k restricts to the k highest-probability tokens.
A small draft model generates candidate tokens cheaply; the large target model verifies them in parallel. Accepted tokens are kept; the first rejected token is resampled. Speedup 2–4× on aligned draft/target pairs, with exact output distribution preserved.
Full treatment of sampling parameters and speculative decoding in cheatsheets/sampling_and_decoding.md.
| Architecture | Self-attention mask | Cross-attention | Typical use | Examples |
|---|---|---|---|---|
| Encoder-only | Bidirectional (no causal mask) | None | Classification, embedding, NER, sentence similarity | BERT, RoBERTa, E5, BGE |
| Encoder-decoder | Encoder: bidirectional; Decoder: causal | Decoder attends to encoder output | Translation, summarisation, seq2seq | T5, BART, mT5 |
| Decoder-only | Causal (autoregressive) | None (or prefix attention) | Language modelling, instruction following, generation | GPT-4, LLaMA, Mistral, Gemma |
The original Transformer (Vaswani et al., 2017) was encoder-decoder for machine translation. The shift to decoder-only at scale reflects empirical evidence that next-token prediction on a diverse corpus is a more scalable pretraining objective than seq2seq on task-specific data.
Encoder-only models are not autoregressive and cannot generate text in the decoder-only sense. They are used for dense representations (embeddings for RAG retrieval, sentence classification). A question asking "which architecture is best for semantic search?" points to encoder-only (bi-encoder). "Which for instruction following?" points to decoder-only.
Replaces the dense FFN with E expert FFNs. A gating network selects top-K experts per token. Total parameters grow but active parameters (FLOPs per token) stay constant.
Canonical examples: Switch Transformer, Mixtral 8×7B, DeepSeek-V2.
Trade-offs: load balancing (expert collapse), communication overhead in multi-GPU deployments, expert routing as a source of training instability.
Selective state-space models replace attention with a recurrent formulation. Time complexity O(n) vs O(n²) for attention.
Weakness: less effective at sharp content-based retrieval over long context compared to attention. Hybrids (Jamba, Zamba) address this.
Interleave attention layers (global, content-based) with SSM or convolution layers (cheap local processing). Field is active.
Decoder-only transformers with RoPE and GQA remain the majority of production deployments as of 2026.
Full technical treatment of MoE architecture, routing algorithms, and training instabilities in LLM_Hub_Modern_Architectures. MoE VRAM and routing in Arch_01_MoE.
The following are the six most frequently examined points drawn from notes/02_transformer_architecture.md and community NCA/NCP reports:
DPO eliminates the reward model but keeps the reference model. LoRA adds zero inference overhead after merging. S-LoRA is for multi-adapter serving, not single-adapter speed. These cross into the fine-tuning domain but are rooted in the same architecture understanding.