NVIDIA GenAI Cert Prep — Presentation 01

Transformer Architecture
Decoder-Only Deep Dive

The decoder-only transformer underlies every major production LLM. This deck maps the architecture to cert exam domains, walks the token-to-logit pipeline, and flags the distractor traps the NCA and NCP papers use most often.

NCA Core ML 30% NCP LLM Arch 6% Attention RoPE KV Cache MoE Decoder-Only
Tokens Embeddings Pos. Enc. N× Decoder Block Unembed Logits
00

Topics in This Deck

A cert-focused tour of the decoder-only transformer — from the token pipeline through to modern variants and the exact exam angles the NCA and NCP papers probe.

01

Cert Framing — Which Domains, What Weighting

NCA-GENL Associate

Core ML and AI Knowledge

30%

Largest single domain. Tests foundational understanding of how transformers work — attention mechanics, scaling, trade-offs. Questions are scenario-based: "which change would most reduce memory at long context?"

NCP-GENL Professional

LLM Architecture

6%

Smaller domain but more precise. Tests exact mechanisms: the 1/√dₖ scaling rationale, pre-norm vs post-norm, GQA vs MQA memory savings, RoPE vs ALiBi mechanism difference.

Why this matters for the NCA

The 30% Core ML domain is the single largest lever on your NCA score. Understanding transformer internals also builds the mental model needed for RAG (22%), fine-tuning (22%), and inference optimisation — the examiners assume the architecture is your baseline.

02

From Token to Logit — The Full Pipeline

Tokens integer IDs Embeddings V × dₚ Pos. Enc. RoPE/ALiBi N× Decoder Block Attn + FFN Unembed dₚ → V Logits ℝ⳼

Each step is differentiable. During training the full forward pass runs in one shot; during generation the model runs one token at a time in autoregressive fashion, reusing the KV cache from prior steps.

03

Tokenisation — BPE Essentials

Byte-Pair Encoding (BPE) is the dominant tokenisation algorithm in production LLMs (GPT series, LLaMA, Mistral). It builds a vocabulary of subword units by iteratively merging the most frequent adjacent byte pair in the training corpus.

How BPE works

  1. Initialise the vocabulary with all individual bytes (256 entries for byte-level BPE).
  2. Count all adjacent symbol pairs in the corpus. Merge the most frequent pair into a new token.
  3. Repeat until vocabulary reaches target size (e.g., 32 000 or 128 000 merges).
PropertyValue / RangeWhy it matters
Vocabulary size32 000–128 256Larger vocab = shorter sequences; embedding matrix grows
Special tokens<bos>, <eos>, <pad>, <unk>Must match exactly at fine-tuning time
Tokeniser algorithmBPE, SentencePiece, UnigramDifferent tokenisers → different token counts for same text
Worst-case token1 char = 1 tokenCode, languages with rare scripts inflate token count
Exam angle

The tokeniser is separate from the model weights. A mismatch between the tokeniser used during training and the one used at inference produces garbage — this is a common distractor in fine-tuning questions.

04

Embeddings — The Lookup Table

Token embeddings are a trainable matrix E ∈ ℝV×dₚ. Each row is the dense vector representation of one vocabulary token. The forward pass is a simple lookup: given token index i, the embedding is the i-th row of E.

row 0 row 1 row i ← token i row V-1 E ∈ ℝ⳼×dₚ eᵢ ℝdₚ Weight Tying Input embedding E (V×dₚ) Unembedding Wᵤ = Eᵀ (dₚ×V) Same parameter matrix, transposed. Saves V×dₚ parameters.
05

Positional Encoding — Absolute, RoPE, ALiBi

Attention is permutation-equivariant — shuffle the input and the output shuffles identically. Positional information must be injected explicitly. Three dominant strategies:

MethodWhere appliedMechanismLength extrapolationUsed in
Absolute sinusoidal Added to embeddings Fixed sin/cos waves, no learned params Poor in practice Original Transformer
Learned absolute Added to embeddings Trainable pos. embedding matrix None beyond max length GPT-2, BERT
RoPE Rotates Q and K vectors Relative position via complex rotation; dot product depends only on i−j Good (YaRN/ABF) LLaMA, Mistral, Qwen
ALiBi Bias added to attention scores (after dot product) Linear penalty −m·|i−j| per head Very good MPT, BLOOM
RoPE: 2D rotation of (q₀, q₁) at position m q (pos 0) R(mθ)q R(nθ)k θm Dot product property qᵀ·k = R((m-n)θ)qᵀ·k Score depends only on relative displacement m−n not absolute positions
Exam distinction

RoPE modifies Q and K vectors before the dot product via rotation. ALiBi adds a distance-based linear bias after the dot product but before softmax. This distinction is the most common NCP exam question on positional encoding.

06

Attention — Q, K, V, Scale, and Mask

The central computation in every decoder block. Given input X ∈ ℝseq×dₚ, three linear projections produce queries, keys, and values:

Scaled Dot-Product Attention
Q = X W_Q   -- W_Q ∈ ℝ^(d_model × d_k)
K = X W_K   -- W_K ∈ ℝ^(d_model × d_k)
V = X W_V   -- W_V ∈ ℝ^(d_model × d_v)

Attention(Q, K, V) = softmax( Q Kᵀ / √d_k  +  M ) · V

-- M is the causal mask: -inf for j > i, 0 otherwise
-- Scaling by 1/√d_k prevents softmax saturation at large d_k

Why 1/√dₖ?

When vectors of dimension dₖ are initialised with unit variance, their dot product has variance dₖ. Without scaling, large dₖ values push dot products to large magnitudes, driving softmax towards a near-one-hot distribution — gradients vanish. Dividing by √dₖ restores unit variance at the softmax input.

Causal masking

The decoder is autoregressive — position i may only attend to positions ≤ i. The upper-triangular mask sets logits for future positions to −∞; after softmax these become 0. This is implemented as a single triangular tensor added to the attention scores before softmax.

Common distractor

The 1/√dₖ scaling is for variance stability at the softmax input, not for normalising the output magnitude of the projection. Exam distractors frequently conflate these.

07

Multi-Head Attention and GQA

Rather than one large attention computation, the model runs h parallel attention heads, each with dₖ = dₚ/h. The outputs are concatenated and projected:

Multi-Head Attention
MultiHead(Q, K, V) = Concat(head, ..., head) W_O

head = Attention( Q W_Qᵢ, K W_Kᵢ, V W_Vᵢ )

Multiple heads allow the model to attend to different aspects of the context simultaneously — different syntactic roles, different semantic relations — within the same layer. What matters is the total parameter count dₚ² (or 4 × dₚ² for full QKVO), not the number of heads per se.

Grouped-Query Attention (GQA)

MHA

h query heads
h key heads
h value heads

Baseline. Large KV cache.

GQA

h query heads
g key heads (g < h)
g value heads

h/g queries share each KV group. KV cache shrinks by h/g. Used in LLaMA 3, Mistral.

MQA

h query heads
1 key head
1 value head

Extreme case. Maximum KV saving; slight quality cost.

Cert relevance

GQA reduces the KV cache memory footprint proportionally to h/g without reducing the query expressiveness. This is the primary architectural lever for fitting longer context on consumer GPUs.

08

The Decoder Block — Pre-Norm and Residuals

Decoder Block (Pre-Norm) Residual stream x RMSNorm / LayerNorm Masked Multi-Head Attn + RMSNorm / LayerNorm Feed-Forward (SwiGLU) + residual
09

Feed-Forward Network — SwiGLU vs ReLU

The FFN occupies roughly two-thirds of the total parameter count in a standard decoder-only model. It operates independently on each token position — no cross-token interaction.

Standard FFN (original Transformer)
FFN(x) = W₂ · ReLU( W₁ x + b₁ ) + b₂

d_ff = 4 × d_model  -- Transformer base: 2048 = 4 × 512
SwiGLU FFN (LLaMA-style)
FFN(x) = ( W₁ x · SiLU( W₂ x ) ) · W₃

d_ff ≈ (8/3) × d_model  -- keeps param count ≈ 4× ReLU FFN
-- SiLU(x) = x · sigmoid(x)  [smooth gating]

Parameter count formula

For a single FFN layer with standard ReLU:

Formula
params_ffn = 2 × d_model × d_ff  -- two weight matrices (no bias for LLaMA)
           = 2 × 4096 × 11008   -- LLaMA-2 7B (d_ff = 11008 ≈ 8/3 × 4096)
           ≈ 90 M parameters per FFN block
× 32 blocks = ~2.9 B params from FFN alone
Why SwiGLU?

SwiGLU consistently outperforms ReLU and GELU at the same FLOP budget in ablations. The gating mechanism allows the network to selectively suppress or amplify features. The (8/3) factor is derived to keep total FLOPs comparable to a 4× ReLU FFN at identical dₚ.

10

KV Cache — Memory Cost at Long Context

During autoregressive generation, token t requires attention against all previous tokens. The KV cache stores the K and V projections for all past positions, so only the new token's Q, K, V need computing at each step.

KV Cache memory formula
KV_bytes = 2 × N_layers × N_kv_heads × d_k × seq_len × bytes_per_element

-- 7B model example (LLaMA-2 style, FP16):
--   N_layers = 32, N_kv_heads = 32, d_k = 128, seq_len = 4096, bytes = 2
KV_bytes = 2 × 32 × 32 × 128 × 4096 × 2 = 2,147,483,648 B ≈ 2 GB

-- With GQA (N_kv_heads = 8, as in LLaMA-3 8B):
KV_bytes = 2 × 32 × 8 × 128 × 4096 × 2 ≈ 0.5 GB  -- 4× saving

At sequence length 32 768 (32K context), the KV cache for a 7B model with 32 KV heads reaches 16 GB in FP16 — exceeding the total VRAM of an RTX 4000 Ada (20 GB) once model weights are also loaded. GQA and 8-bit KV quantisation are the primary mitigations.

TechniqueKV memory reductionQuality impact
GQA (h=32 → 8)Negligible
MQA (h=32 → 1)32×Small quality loss
KV quantisation (FP16 → INT8)Very small
Paged attention (vLLM)Removes fragmentation wasteNone
Cross-reference

KV cache quantisation and paged attention are covered in detail in NVIDIA_GPU_19_TensorRT_LLM and cheatsheets/quantisation_and_kv_cache.md.

11

Decoding Strategies — Greedy, Sampling, Speculative

Greedy Decoding

Always select the highest-probability token: argmax over logits. Deterministic. Fast. Prone to degenerate repetition on open-ended generation tasks.

Sampling (Temperature, Top-p, Top-k)

Sample from the probability distribution. Temperature τ divides logits before softmax — τ<1 sharpens, τ>1 flattens. Top-p (nucleus) samples from the smallest set of tokens summing to p. Top-k restricts to the k highest-probability tokens.

Speculative Decoding

A small draft model generates candidate tokens cheaply; the large target model verifies them in parallel. Accepted tokens are kept; the first rejected token is resampled. Speedup 2–4× on aligned draft/target pairs, with exact output distribution preserved.

Cross-reference

Full treatment of sampling parameters and speculative decoding in cheatsheets/sampling_and_decoding.md.

Key distinctions for the exam

12

Architecture Families — Decoder, Encoder-Decoder, Encoder-Only

ArchitectureSelf-attention maskCross-attentionTypical useExamples
Encoder-only Bidirectional (no causal mask) None Classification, embedding, NER, sentence similarity BERT, RoBERTa, E5, BGE
Encoder-decoder Encoder: bidirectional; Decoder: causal Decoder attends to encoder output Translation, summarisation, seq2seq T5, BART, mT5
Decoder-only Causal (autoregressive) None (or prefix attention) Language modelling, instruction following, generation GPT-4, LLaMA, Mistral, Gemma

The original Transformer (Vaswani et al., 2017) was encoder-decoder for machine translation. The shift to decoder-only at scale reflects empirical evidence that next-token prediction on a diverse corpus is a more scalable pretraining objective than seq2seq on task-specific data.

Exam angle

Encoder-only models are not autoregressive and cannot generate text in the decoder-only sense. They are used for dense representations (embeddings for RAG retrieval, sentence classification). A question asking "which architecture is best for semantic search?" points to encoder-only (bi-encoder). "Which for instruction following?" points to decoder-only.

13

Modern Variants — MoE, Mamba, Hybrids

Mixture of Experts (MoE)

Replaces the dense FFN with E expert FFNs. A gating network selects top-K experts per token. Total parameters grow but active parameters (FLOPs per token) stay constant.

Canonical examples: Switch Transformer, Mixtral 8×7B, DeepSeek-V2.

Trade-offs: load balancing (expert collapse), communication overhead in multi-GPU deployments, expert routing as a source of training instability.

Mamba / SSMs

Selective state-space models replace attention with a recurrent formulation. Time complexity O(n) vs O(n²) for attention.

Weakness: less effective at sharp content-based retrieval over long context compared to attention. Hybrids (Jamba, Zamba) address this.

Hybrid Models

Interleave attention layers (global, content-based) with SSM or convolution layers (cheap local processing). Field is active.

Decoder-only transformers with RoPE and GQA remain the majority of production deployments as of 2026.

Cross-reference

Full technical treatment of MoE architecture, routing algorithms, and training instabilities in LLM_Hub_Modern_Architectures. MoE VRAM and routing in Arch_01_MoE.

14

Likely Exam Angles

The following are the six most frequently examined points drawn from notes/02_transformer_architecture.md and community NCA/NCP reports:

Also worth noting

DPO eliminates the reward model but keeps the reference model. LoRA adds zero inference overhead after merging. S-LoRA is for multi-adapter serving, not single-adapter speed. These cross into the fine-tuning domain but are rooted in the same architecture understanding.

15

Cross-References and Further Reading

Portfolio repos (depth treatment)

Cert-prep repo resources

Primary literature