Transformer Architecture — NVIDIA GenAI Cert Prep

00

Topics in This Deck

A cert-focused tour of the decoder-only transformer — from the token pipeline through to modern variants and the exact exam angles the NCA and NCP papers probe.

Cert Framing — Exam Domains and Weightings
From Token to Logit — Pipeline Overview
Tokenisation — BPE Essentials
Embeddings — Lookup Table and Weight Tying
Positional Encoding — Absolute, RoPE, ALiBi
Attention — Q, K, V, Scale, Mask
Multi-Head Attention and GQA
The Decoder Block — Pre-Norm and Residuals
Feed-Forward Network — SwiGLU vs ReLU
KV Cache — Memory Cost at Long Context
Decoding Strategies — Greedy, Sampling, Speculative
Architecture Families — Decoder, Encoder-Decoder, Encoder
Modern Variants — MoE, Mamba, Hybrids
Likely Exam Angles
Cross-References and Further Reading

01

Cert Framing — Which Domains, What Weighting

NCA-GENL Associate

Core ML and AI Knowledge

30%

Largest single domain. Tests foundational understanding of how transformers work — attention mechanics, scaling, trade-offs. Questions are scenario-based: "which change would most reduce memory at long context?"

NCP-GENL Professional

LLM Architecture

6%

Smaller domain but more precise. Tests exact mechanisms: the 1/√dₖ scaling rationale, pre-norm vs post-norm, GQA vs MQA memory savings, RoPE vs ALiBi mechanism difference.

Why this matters for the NCA

The 30% Core ML domain is the single largest lever on your NCA score. Understanding transformer internals also builds the mental model needed for RAG (22%), fine-tuning (22%), and inference optimisation — the examiners assume the architecture is your baseline.

02

From Token to Logit — The Full Pipeline

Each step is differentiable. During training the full forward pass runs in one shot; during generation the model runs one token at a time in autoregressive fashion, reusing the KV cache from prior steps.

V — vocabulary size (e.g. 32 000–128 000 tokens)
dₚ — residual stream width (e.g. 4096 for a 7B model)
N — number of decoder blocks (e.g. 32 for a 7B model)
Unembedding is often the transposed embedding matrix (weight tying — covered in Slide 04)

03

Tokenisation — BPE Essentials

Byte-Pair Encoding (BPE) is the dominant tokenisation algorithm in production LLMs (GPT series, LLaMA, Mistral). It builds a vocabulary of subword units by iteratively merging the most frequent adjacent byte pair in the training corpus.

How BPE works

Initialise the vocabulary with all individual bytes (256 entries for byte-level BPE).
Count all adjacent symbol pairs in the corpus. Merge the most frequent pair into a new token.
Repeat until vocabulary reaches target size (e.g., 32 000 or 128 000 merges).

Property	Value / Range	Why it matters
Vocabulary size	32 000–128 256	Larger vocab = shorter sequences; embedding matrix grows
Special tokens	<bos>, <eos>, <pad>, <unk>	Must match exactly at fine-tuning time
Tokeniser algorithm	BPE, SentencePiece, Unigram	Different tokenisers → different token counts for same text
Worst-case token	1 char = 1 token	Code, languages with rare scripts inflate token count

Exam angle

The tokeniser is separate from the model weights. A mismatch between the tokeniser used during training and the one used at inference produces garbage — this is a common distractor in fine-tuning questions.

04

Embeddings — The Lookup Table

Token embeddings are a trainable matrix E ∈ ℝ^V×dₚ. Each row is the dense vector representation of one vocabulary token. The forward pass is a simple lookup: given token index i, the embedding is the i-th row of E.

Tied weights: the unembedding matrix is E^T. Halves the parameter count for the vocabulary projections. Used in LLaMA, GPT-2. Trade-off: constraints the expressiveness of the output projection.
Untied weights: separate matrices for input embedding and output unembedding. Used in some architectures where the distributional mismatch between input and output warrants separate parameters.
For a 7B model with dₚ = 4096 and V = 128 256: embedding matrix = 128 256 × 4096 × 2 bytes ≈ 1.05 GB — a significant fraction of total VRAM.

05

Positional Encoding — Absolute, RoPE, ALiBi

Attention is permutation-equivariant — shuffle the input and the output shuffles identically. Positional information must be injected explicitly. Three dominant strategies:

Method	Where applied	Mechanism	Length extrapolation	Used in
Absolute sinusoidal	Added to embeddings	Fixed sin/cos waves, no learned params	Poor in practice	Original Transformer
Learned absolute	Added to embeddings	Trainable pos. embedding matrix	None beyond max length	GPT-2, BERT
RoPE	Rotates Q and K vectors	Relative position via complex rotation; dot product depends only on i−j	Good (YaRN/ABF)	LLaMA, Mistral, Qwen
ALiBi	Bias added to attention scores (after dot product)	Linear penalty −m·\|i−j\| per head	Very good	MPT, BLOOM

Exam distinction

RoPE modifies Q and K vectors before the dot product via rotation. ALiBi adds a distance-based linear bias after the dot product but before softmax. This distinction is the most common NCP exam question on positional encoding.

06

Attention — Q, K, V, Scale, and Mask

The central computation in every decoder block. Given input X ∈ ℝ^seq×dₚ, three linear projections produce queries, keys, and values:

Scaled Dot-Product Attention

Q = X W_Q   -- W_Q ∈ ℝ^(d_model × d_k)
K = X W_K   -- W_K ∈ ℝ^(d_model × d_k)
V = X W_V   -- W_V ∈ ℝ^(d_model × d_v)

Attention(Q, K, V) = softmax( Q Kᵀ / √d_k  +  M ) · V

-- M is the causal mask: -inf for j > i, 0 otherwise
-- Scaling by 1/√d_k prevents softmax saturation at large d_k

Why 1/√dₖ?

When vectors of dimension dₖ are initialised with unit variance, their dot product has variance dₖ. Without scaling, large dₖ values push dot products to large magnitudes, driving softmax towards a near-one-hot distribution — gradients vanish. Dividing by √dₖ restores unit variance at the softmax input.

Causal masking

The decoder is autoregressive — position i may only attend to positions ≤ i. The upper-triangular mask sets logits for future positions to −∞; after softmax these become 0. This is implemented as a single triangular tensor added to the attention scores before softmax.

Common distractor

The 1/√dₖ scaling is for variance stability at the softmax input, not for normalising the output magnitude of the projection. Exam distractors frequently conflate these.

07

Multi-Head Attention and GQA

Rather than one large attention computation, the model runs h parallel attention heads, each with dₖ = dₚ/h. The outputs are concatenated and projected:

Multi-Head Attention

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) W_O

headᵢ = Attention( Q W_Qᵢ, K W_Kᵢ, V W_Vᵢ )

Multiple heads allow the model to attend to different aspects of the context simultaneously — different syntactic roles, different semantic relations — within the same layer. What matters is the total parameter count dₚ² (or 4 × dₚ² for full QKVO), not the number of heads per se.

Grouped-Query Attention (GQA)

MHA

h query heads
h key heads
h value heads

Baseline. Large KV cache.

GQA

h query heads
g key heads (g < h)
g value heads

h/g queries share each KV group. KV cache shrinks by h/g. Used in LLaMA 3, Mistral.

MQA

h query heads
1 key head
1 value head

Extreme case. Maximum KV saving; slight quality cost.

Cert relevance

GQA reduces the KV cache memory footprint proportionally to h/g without reducing the query expressiveness. This is the primary architectural lever for fitting longer context on consumer GPUs.

08

The Decoder Block — Pre-Norm and Residuals

Pre-norm: LayerNorm/RMSNorm applied before each sub-layer. More stable training at depth; used in LLaMA, GPT-2. The original Transformer used post-norm (after the residual add).
Residual connections: primary purpose is gradient flow (highway for backpropagation), not activation normalisation. Without residuals, gradients through N layers vanish.
RMSNorm: drops the mean-subtraction step of LayerNorm, keeping only scale normalisation. Faster and empirically equivalent in practice.

09

Feed-Forward Network — SwiGLU vs ReLU

The FFN occupies roughly two-thirds of the total parameter count in a standard decoder-only model. It operates independently on each token position — no cross-token interaction.

Standard FFN (original Transformer)

FFN(x) = W₂ · ReLU( W₁ x + b₁ ) + b₂

d_ff = 4 × d_model  -- Transformer base: 2048 = 4 × 512

SwiGLU FFN (LLaMA-style)

FFN(x) = ( W₁ x · SiLU( W₂ x ) ) · W₃

d_ff ≈ (8/3) × d_model  -- keeps param count ≈ 4× ReLU FFN
-- SiLU(x) = x · sigmoid(x)  [smooth gating]

Parameter count formula

For a single FFN layer with standard ReLU:

Formula

params_ffn = 2 × d_model × d_ff  -- two weight matrices (no bias for LLaMA)
           = 2 × 4096 × 11008   -- LLaMA-2 7B (d_ff = 11008 ≈ 8/3 × 4096)
           ≈ 90 M parameters per FFN block
× 32 blocks = ~2.9 B params from FFN alone

Why SwiGLU?

SwiGLU consistently outperforms ReLU and GELU at the same FLOP budget in ablations. The gating mechanism allows the network to selectively suppress or amplify features. The (8/3) factor is derived to keep total FLOPs comparable to a 4× ReLU FFN at identical dₚ.

10

KV Cache — Memory Cost at Long Context

During autoregressive generation, token t requires attention against all previous tokens. The KV cache stores the K and V projections for all past positions, so only the new token's Q, K, V need computing at each step.

KV Cache memory formula

KV_bytes = 2 × N_layers × N_kv_heads × d_k × seq_len × bytes_per_element

-- 7B model example (LLaMA-2 style, FP16):
--   N_layers = 32, N_kv_heads = 32, d_k = 128, seq_len = 4096, bytes = 2
KV_bytes = 2 × 32 × 32 × 128 × 4096 × 2 = 2,147,483,648 B ≈ 2 GB

-- With GQA (N_kv_heads = 8, as in LLaMA-3 8B):
KV_bytes = 2 × 32 × 8 × 128 × 4096 × 2 ≈ 0.5 GB  -- 4× saving

At sequence length 32 768 (32K context), the KV cache for a 7B model with 32 KV heads reaches 16 GB in FP16 — exceeding the total VRAM of an RTX 4000 Ada (20 GB) once model weights are also loaded. GQA and 8-bit KV quantisation are the primary mitigations.

Technique	KV memory reduction	Quality impact
GQA (h=32 → 8)	4×	Negligible
MQA (h=32 → 1)	32×	Small quality loss
KV quantisation (FP16 → INT8)	2×	Very small
Paged attention (vLLM)	Removes fragmentation waste	None

Cross-reference

KV cache quantisation and paged attention are covered in detail in NVIDIA_GPU_19_TensorRT_LLM and cheatsheets/quantisation_and_kv_cache.md.

11

Decoding Strategies — Greedy, Sampling, Speculative

Greedy Decoding

Always select the highest-probability token: argmax over logits. Deterministic. Fast. Prone to degenerate repetition on open-ended generation tasks.

Sampling (Temperature, Top-p, Top-k)

Sample from the probability distribution. Temperature τ divides logits before softmax — τ<1 sharpens, τ>1 flattens. Top-p (nucleus) samples from the smallest set of tokens summing to p. Top-k restricts to the k highest-probability tokens.

Speculative Decoding

A small draft model generates candidate tokens cheaply; the large target model verifies them in parallel. Accepted tokens are kept; the first rejected token is resampled. Speedup 2–4× on aligned draft/target pairs, with exact output distribution preserved.

Cross-reference

Full treatment of sampling parameters and speculative decoding in cheatsheets/sampling_and_decoding.md.

Key distinctions for the exam

Temperature = 0 is equivalent to greedy decoding (argmax).
Top-p = 1.0, Top-k = 0 disables nucleus/top-k filtering — full distribution sampling.
Speculative decoding does not change the output distribution; it is a latency optimisation only. Draft model errors are rejected via the target model's verification step.
Beam search is not sampling; it maintains multiple hypotheses but is rarely used in modern autoregressive LLMs due to memory cost and tendency towards generic outputs.

12

Architecture Families — Decoder, Encoder-Decoder, Encoder-Only

Architecture	Self-attention mask	Cross-attention	Typical use	Examples
Encoder-only	Bidirectional (no causal mask)	None	Classification, embedding, NER, sentence similarity	BERT, RoBERTa, E5, BGE
Encoder-decoder	Encoder: bidirectional; Decoder: causal	Decoder attends to encoder output	Translation, summarisation, seq2seq	T5, BART, mT5
Decoder-only	Causal (autoregressive)	None (or prefix attention)	Language modelling, instruction following, generation	GPT-4, LLaMA, Mistral, Gemma

The original Transformer (Vaswani et al., 2017) was encoder-decoder for machine translation. The shift to decoder-only at scale reflects empirical evidence that next-token prediction on a diverse corpus is a more scalable pretraining objective than seq2seq on task-specific data.

Exam angle

Encoder-only models are not autoregressive and cannot generate text in the decoder-only sense. They are used for dense representations (embeddings for RAG retrieval, sentence classification). A question asking "which architecture is best for semantic search?" points to encoder-only (bi-encoder). "Which for instruction following?" points to decoder-only.

13

Modern Variants — MoE, Mamba, Hybrids

Mixture of Experts (MoE)

Replaces the dense FFN with E expert FFNs. A gating network selects top-K experts per token. Total parameters grow but active parameters (FLOPs per token) stay constant.

Canonical examples: Switch Transformer, Mixtral 8×7B, DeepSeek-V2.

Trade-offs: load balancing (expert collapse), communication overhead in multi-GPU deployments, expert routing as a source of training instability.

Mamba / SSMs

Selective state-space models replace attention with a recurrent formulation. Time complexity O(n) vs O(n²) for attention.

Weakness: less effective at sharp content-based retrieval over long context compared to attention. Hybrids (Jamba, Zamba) address this.

Hybrid Models

Interleave attention layers (global, content-based) with SSM or convolution layers (cheap local processing). Field is active.

Decoder-only transformers with RoPE and GQA remain the majority of production deployments as of 2026.

Cross-reference

Full technical treatment of MoE architecture, routing algorithms, and training instabilities in LLM_Hub_Modern_Architectures. MoE VRAM and routing in Arch_01_MoE.

14

Likely Exam Angles

The following are the six most frequently examined points drawn from notes/02_transformer_architecture.md and community NCA/NCP reports:

1/√dₖ scaling rationale — variance stability at the softmax input, not output magnitude normalisation. Distractors claim the latter.
Pre-norm vs post-norm — modern models (GPT-2, LLaMA) use pre-norm for training stability at depth. The original Transformer used post-norm. Questions may name a specific architecture and ask which norm order it uses.
KV cache memory formula — 2 × N_layers × N_kv_heads × d_k × seq_len × bytes. GQA reduces N_kv_heads; it does not reduce query heads. Must be derivable from first principles.
Encoder-only vs decoder-only — encoder-only (BERT) uses bidirectional attention and is not autoregressive; used for representation and classification, not generation. The distinction matters for RAG embedding model selection.
RoPE vs ALiBi mechanism — RoPE rotates Q and K before the dot product; ALiBi adds a linear distance penalty after the dot product but before softmax. Both enable length extrapolation through different mechanisms.
Residual connection purpose — gradient flow through depth, not activation normalisation. Activation scale is managed by LayerNorm/RMSNorm. A question asking "what is the primary purpose of residual connections?" should answer gradient flow.

Also worth noting

DPO eliminates the reward model but keeps the reference model. LoRA adds zero inference overhead after merging. S-LoRA is for multi-adapter serving, not single-adapter speed. These cross into the fine-tuning domain but are rooted in the same architecture understanding.

15

Cross-References and Further Reading

Portfolio repos (depth treatment)

LLM_Hub_Transformer_Architecture — interactive visual walkthrough, RTL implementation, PyTorch from scratch.
LLM_Hub_Modern_Architectures — MoE, Mamba, hybrids.
Arch_01_MoE — MoE routing, load balancing, VRAM.
NVIDIA_GPU_19_TensorRT_LLM — inference engine, KV cache optimisation, paged attention.

Cert-prep repo resources

notes/02_transformer_architecture.md — the notes this deck is built from.
cheatsheets/transformer_math_one_pager.md — all formulas on one page.
cheatsheets/sampling_and_decoding.md — decoding strategies.
cheatsheets/quantisation_and_kv_cache.md — quantisation and KV cache.

Primary literature

Vaswani et al., "Attention Is All You Need" (2017): arxiv.org/abs/1706.03762
Su et al., "RoFormer / RoPE" (2021): arxiv.org/abs/2104.09864
Press et al., "ALiBi" (2021): arxiv.org/abs/2108.12409
Ainslie et al., "GQA" (2023): arxiv.org/abs/2305.13245
Dao et al., "FlashAttention" (2022): arxiv.org/abs/2205.14135