Inside the Transformer Decoder — A Visual Deep Dive

Section 0

The Big Picture

A decoder-only transformer (like GPT, LLaMA, and Claude's architecture) is an autoregressive language model. It takes a sequence of tokens and predicts the next token, one at a time. The entire model is a deep stack of repeated blocks, each containing self-attention and feed-forward layers.

Here's the full pipeline at a glance — click any block to jump to its section:

✂️Tokenize 📐Embed 📍+ Position 🔍Masked Self-Attention ⚖️LayerNorm ⚡Feed-Forward ⚖️LayerNorm 🎯Logits → Softmax

Notation we'll use: We denote the model dimension as d_model (e.g., 768), number of attention heads as h (e.g., 12), head dimension as d_k = d_model / h (e.g., 64), vocabulary size as V (e.g., 50,257), and sequence length as n.

The key insight: the decoder block (attention → add & norm → FFN → add & norm) is repeated N times — typically 12 to 96+ layers. Each layer refines the representation. The final output is projected to vocabulary-sized logits and sampled to produce the next token.

Section 1

Tokenization

The model can't read characters — it operates on integer token IDs. A tokenizer (like BPE — Byte-Pair Encoding) splits input text into subword tokens and maps each to an integer ID from a fixed vocabulary.

How BPE Works

BPE starts with individual bytes/characters and iteratively merges the most frequent adjacent pair into a new token. After training, we get a merge table: an ordered list of pair-merge rules. At inference, the tokenizer applies these rules greedily to split text into known subwords.

Input text:

"The cat sat on the mat"

Key detail: Common words like "the" get a single token, but rare or long words get split: "unbelievable" → ["un", "believ", "able"]. Spaces are often part of the token (e.g., " the" with a leading space is one token). This means the vocabulary size V is typically 32K–128K tokens.

The Token Embedding Table

After tokenization, each integer ID is used as an index into the embedding matrix W_E of shape V × d_model. This is a simple lookup — row i of the matrix is the embedding vector for token i. No computation is needed; it's a table lookup.

Section 2

Token Embeddings

Each token ID indexes into a learned embedding table W_E ∈ ℝ^(V×d). The result is a dense vector of dimension d_model for each token, transforming the discrete token ID into a continuous representation that the model can do linear algebra on.

x_i = W_E[token_id_i] → shape: d_model

Symbol breakdown:
x_i — the resulting embedding vector for the i-th token in the sequence
W_E — the embedding weight matrix of shape V × d_model (learned during training)
token_id_i — the integer ID of the i-th token, used as a row index into W_E
[ ] — table lookup (indexing): select the row at that index, no multiplication needed
d_model — the dimensionality of each embedding vector (e.g. 768)

For a sequence of n tokens, we stack these into a matrix X ∈ ℝ^(n×d). Each row is a token's embedding vector.

These embedding vectors are learned parameters — they are initialized randomly and updated through training via backpropagation. Semantically similar tokens end up with similar vectors.

Section 2b

The Embedding Space

The embedding table doesn't just assign arbitrary numbers to tokens — through training, it organizes tokens into a high-dimensional geometric space where distance and direction encode meaning. This space is the foundation everything else builds on.

What the Dimensions Represent

Each of the d_model dimensions in an embedding vector doesn't have a single interpretable meaning. Instead, meaning is distributed across many dimensions simultaneously. A single dimension might partially encode part-of-speech, formality, topic, sentiment, and dozens of other features all at once. No one dimension is "the noun dimension" — it's the pattern across all dimensions that creates the representation.

Why so many dimensions? With d_model = 768, each token lives in a 768-dimensional space. This high dimensionality gives the model enormous capacity to represent fine-grained distinctions. In lower dimensions, different meanings would be forced to overlap (the "curse of dimensionality" in reverse — more dimensions means more room to separate concepts).

Similarity = Dot Product

The model measures how related two tokens are by computing their dot product (or equivalently, cosine similarity). If two embedding vectors point in similar directions, the model considers those tokens related. This is the same operation used later in attention (Q·K), so the geometry of embeddings directly shapes how attention flows.

similarity(a, b) = (a · b) / (‖a‖ · ‖b‖) = cos(θ)

Symbol breakdown:
a, b — two embedding vectors being compared
a · b — dot product: Σ aᵢbᵢ (sum of element-wise products)
‖a‖ — the L2 norm (magnitude) of vector a: √(Σ aᵢ²)
cos(θ) — the cosine of the angle between the two vectors: +1 = identical direction, 0 = orthogonal (unrelated), −1 = opposite

Linear Substructure

Famously, well-trained embedding spaces exhibit linear relationships: the direction from "king" to "queen" is approximately the same as the direction from "man" to "woman". These consistent offset vectors mean the model can reason by analogy — semantic relationships are encoded as geometric translations.

W_E["king"] − W_E["man"] + W_E["woman"] ≈ W_E["queen"]

Symbol breakdown:
W_E["king"] — the embedding vector for "king", looked up from the embedding table
− — vector subtraction (element-wise): extracts the direction from "man" to "king"
+ — vector addition (element-wise): applies that direction starting from "woman"
≈ — approximately equal: the resulting vector lands near "queen" in embedding space

These relationships emerge automatically from training — the model is never told about gender or royalty. It discovers these regularities from predicting the next token across billions of training examples.

The Residual Stream

The initial embedding vector enters what's called the residual stream — a conceptual pathway that runs through the entire model. Each layer (attention, FFN) reads from this stream and adds to it. The embedding is never overwritten; it accumulates information as it passes through layers. By the final layer, the vector at each position has been enriched with contextual information from the entire sequence.

Section 3

Positional Encoding

Self-attention is permutation-invariant — it doesn't know the order of tokens. We must inject position information. There are two main approaches:

Absolute Positional Embeddings (GPT-2 style)

A second learned embedding table W_P ∈ ℝ^(n_max × d) maps each position index to a vector. The position embedding is simply added element-wise to the token embedding:

h⁰_i = W_E[token_i] + W_P[i]

Symbol breakdown:
h⁰_i — the initial hidden state for position i (superscript 0 = input to layer 0)
W_E[token_i] — the token embedding vector looked up from the embedding table
W_P[i] — the positional embedding vector for position i, looked up from a second learned table of shape n_max × d_model
+ — element-wise addition: the token meaning and position information are combined into a single vector

Rotary Position Embeddings (RoPE — LLaMA style)

RoPE doesn't add to embeddings. Instead, it rotates query and key vectors in attention by an angle proportional to their position. For a pair of dimensions (2j, 2j+1) at position m:

RoPE(x, m)_{2j} = x_{2j} cos(mθ_j) − x_{2j+1} sin(mθ_j)
RoPE(x, m)_{2j+1} = x_{2j} sin(mθ_j) + x_{2j+1} cos(mθ_j)

Symbol breakdown:
x — a query or key vector to be position-encoded
m — the absolute position index of this token in the sequence (0, 1, 2, …)
j — the dimension pair index: dimensions are processed in pairs (2j, 2j+1)
x_{2j}, x_{2j+1} — the two scalar values from dimension pair j
θ_j — the rotation frequency for pair j: θ_j = 10000^(−2j/d) — higher dimensions rotate slower
mθ_j — the rotation angle: position × frequency. Further-apart positions get larger angle differences
cos, sin — the rotation is a standard 2D rotation matrix applied to each dimension pair

where θ_j = 10000^(−2j/d). This encodes relative positions: the dot product between two RoPE-rotated vectors depends only on their distance, not absolute positions.

Why rotation? When computing q_m · k_n, the rotation angles combine as cos((m−n)θ), so the attention score depends on the relative distance (m − n) between tokens, not their absolute positions. This gives better generalization to unseen sequence lengths.

Section 4

Masked Multi-Head Self-Attention

This is the core operation. Attention lets each token look at all previous tokens and decide how much to "attend" to each one. It's the mechanism by which context flows through the sequence.

Step 1: Compute Q, K, V projections

For each attention head i, we have three learned weight matrices: W^Q_i, W^K_i, W^V_i each of shape d_model × d_k. We project the input:

Q_i = X · W^Q_i K_i = X · W^K_i V_i = X · W^V_i
each of shape [n × d_k]

Symbol breakdown:
X — the input matrix of shape n × d_model (all token hidden states stacked as rows)
W^Q_i, W^K_i, W^V_i — learned weight matrices for head i, each of shape d_model × d_k
Q_i — Query matrix for head i: "what is each token looking for?"
K_i — Key matrix for head i: "what does each token advertise?"
V_i — Value matrix for head i: "what content does each token carry?"
· — matrix multiplication: each token's d_model-dim vector is projected down to d_k dimensions
d_k — head dimension = d_model / h (e.g. 768/12 = 64)

Q, K, V intuition: Think of it like a search engine. The Query is "what am I looking for?", the Key is "what do I contain?", and the Value is "what information do I give if selected?" Attention scores are Q·K similarity; the output is a weighted sum of V.

Step 2: Compute Attention Scores

The raw attention scores are the dot product of queries and keys, scaled by √d_k to prevent the softmax from saturating for large dimensions:

scores = (Q · Kᵀ) / √d_k shape: n × n

Symbol breakdown:
Q — Query matrix of shape n × d_k (one row per token)
Kᵀ — Key matrix transposed to shape d_k × n (the ᵀ means transpose: swap rows and columns)
Q · Kᵀ — matrix multiply producing an n × n matrix where entry (i, j) is the dot product of query_i and key_j
√d_k — square root of the head dimension (e.g. √64 = 8), used to scale down the dot products
/ — element-wise division: every score is divided by √d_k to prevent softmax saturation

Step 3: Apply the Causal Mask

This is what makes it a decoder. We apply a mask that sets all positions where j > i (future tokens) to −∞. After softmax, these become 0 — each token can only attend to itself and previous tokens:

Before masking (raw scores)

→

After masking (−∞ applied)

→

After softmax (attention weights)

Step 4: Weighted Sum of Values

Multiply the attention weights by the value vectors to get the output of this head:

head_i = softmax(mask(Q_i · K_iᵀ / √d_k)) · V_i shape: n × d_k

Symbol breakdown:
head_i — the output of attention head i, shape n × d_k
Q_i · K_iᵀ / √d_k — scaled attention scores (see above)
mask() — applies the causal mask: sets future positions (j > i) to −∞
softmax() — converts each row of scores into a probability distribution (sums to 1), with −∞ entries becoming 0
· V_i — matrix multiply with Values: each token's output is a weighted average of all value vectors, weighted by attention probabilities

Step 5: Concatenate Heads & Project

All h heads are concatenated along the last dimension and multiplied by an output projection matrix W^O ∈ ℝ^(d_model × d_model):

MultiHead(X) = Concat(head_1, ..., head_h) · W^O

Symbol breakdown:
head_1, ..., head_h — outputs from all h attention heads, each of shape n × d_k
Concat() — concatenate along the last dimension: h heads of d_k → one tensor of d_model (since h × d_k = d_model)
W^O — output projection matrix of shape d_model × d_model (learned), mixes information across heads
· — matrix multiplication: final output has shape n × d_model, same as the input

Multi-Head Intuition

Each head can learn to attend to different things — one head might focus on the previous word, another on the subject of the sentence, another on syntactic structure. The concatenation combines these different "views" into a unified representation.

Section 5

Feed-Forward Network (FFN)

After attention, each token's representation passes (independently) through a two-layer feed-forward network. This is where much of the model's "knowledge" is stored — the weight matrices act as a key-value memory.

Standard FFN (GPT-2)

FFN(x) = GELU(x · W_1 + b_1) · W_2 + b_2

Symbol breakdown:
x — the input vector for a single token, shape d_model
W_1 — first weight matrix, shape d × 4d (expands the dimension by 4×)
b_1 — first bias vector, shape 4d (added after the first projection)
GELU() — activation function applied element-wise: GELU(z) = z · Φ(z) where Φ is the standard normal CDF
W_2 — second weight matrix, shape 4d × d (projects back down to model dimension)
b_2 — second bias vector, shape d
· — matrix multiplication (not element-wise)

Where W_1 ∈ ℝ^(d × 4d) expands the dimension by 4×, and W_2 ∈ ℝ^(4d × d) projects back down. The inner dimension 4d is called the intermediate size.

SwiGLU FFN (LLaMA, modern models)

Modern models use a gated variant called SwiGLU, which uses three weight matrices:

FFN_SwiGLU(x) = (Swish(x · W_gate) ⊙ (x · W_up)) · W_down

Symbol breakdown:
x — the input vector for a single token, shape d_model
W_gate — gating weight matrix, shape d × d_ff (produces the gate signal)
W_up — up-projection weight matrix, shape d × d_ff (produces the value signal)
W_down — down-projection weight matrix, shape d_ff × d (projects back to model dimension)
Swish(z) — activation function: z · σ(z) where σ is the sigmoid function 1/(1 + e^(−z))
⊙ — element-wise (Hadamard) multiplication: the gate output controls how much of the up-projection passes through
· — matrix multiplication

where ⊙ is element-wise multiplication. The gate controls information flow, improving training stability and performance. The Swish activation is Swish(x) = x · σ(x).

GELU Activation Function

GELU (Gaussian Error Linear Unit) is a smooth approximation to ReLU. It's defined as GELU(x) = x · Φ(x) where Φ is the cumulative distribution function of the standard normal distribution. Unlike ReLU which has a hard cutoff at 0, GELU smoothly transitions, allowing small negative values through — which helps gradient flow during training.

Section 6

Layer Normalization & Residual Connections

Layer Normalization

LayerNorm normalizes each token's hidden state independently across the feature dimension. For a vector x ∈ ℝ^d:

LayerNorm(x) = γ ⊙ ((x − μ) / √(σ² + ε)) + β

Symbol breakdown:
x — the input vector (one token's hidden state, dimension d_model)
μ (mu) — the mean of all elements in x: μ = (1/d) Σ xᵢ
σ² (sigma squared) — the variance of all elements in x: σ² = (1/d) Σ (xᵢ − μ)²
ε (epsilon) — a tiny constant (e.g. 10⁻⁵) added inside the square root to prevent division by zero
γ (gamma) — a learned per-dimension scale vector of size d, initialized to ones
β (beta) — a learned per-dimension shift (bias) vector of size d, initialized to zeros
⊙ — element-wise (Hadamard) multiplication: each dimension is scaled independently by its corresponding γ value

Where μ and σ² are the mean and variance computed across the d_model dimensions of that single token, γ and β are learned scale and shift parameters (each of size d), and ε is a small constant (e.g., 10⁻⁵) for numerical stability.

RMSNorm (Modern Variant)

LLaMA and many modern models use RMSNorm instead, which skips the mean subtraction and uses the root mean square:

RMSNorm(x) = γ ⊙ (x / √(mean(x²) + ε))

Symbol breakdown:
x — the input vector (one token's hidden state, dimension d_model)
x² — each element squared (element-wise)
mean(x²) — the average of all squared elements: (1/d) Σ xᵢ² (the root-mean-square statistic, without mean-centering)
ε (epsilon) — small stability constant (e.g. 10⁻⁵) to prevent division by zero
√ — square root of the denominator
γ (gamma) — learned per-dimension scale vector of size d, initialized to ones
⊙ — element-wise multiplication: each dimension is scaled by its own γ value
Note: unlike LayerNorm, RMSNorm has no β (bias) and no mean subtraction — just scale

This is simpler (no bias β, no mean subtraction) and empirically works just as well, with less compute.

Residual Connections

Every sub-layer (attention and FFN) is wrapped in a residual connection. The output of each sub-layer is added to its input before normalization:

output = LayerNorm(x + SubLayer(x)) (Post-Norm)
output = x + SubLayer(LayerNorm(x)) (Pre-Norm, more common)

Symbol breakdown:
x — the input to the sub-layer (a token's hidden state vector)
SubLayer() — either the attention block or the FFN block
x + SubLayer(x) — the residual connection: the sub-layer's output is added to its own input, creating a skip connection
LayerNorm() — layer normalization (see above)
Post-Norm: normalize after adding the residual — the original Transformer design
Pre-Norm: normalize before the sub-layer, then add the residual — more stable for very deep models, used by GPT-2+, LLaMA, etc.

Why residuals? They create "shortcuts" for gradients to flow backward through the network during training. Without them, very deep networks (96+ layers) would suffer from vanishing gradients. The residual stream also means each layer makes a small additive update to the representation, rather than fully overwriting it.

Section 7

Output Projection & Softmax

After all N decoder layers and a final LayerNorm, we have a hidden state h ∈ ℝ^(n × d). To predict the next token, we need to map this back to a probability distribution over the vocabulary.

The Unembedding (LM Head)

The final hidden state of the last token in the sequence is projected to vocabulary-sized logits by multiplying with the unembedding matrix. Many models tie weights, reusing the embedding matrix W_E:

logits = h_last · W_Eᵀ shape: V

Symbol breakdown:
h_last — the hidden state of the last token in the sequence after all N decoder layers, shape d_model
W_E — the token embedding matrix of shape V × d_model (the same one used to embed input tokens, when weights are tied)
W_Eᵀ — W_E transposed to shape d_model × V (the ᵀ means transpose: swap rows and columns)
· — matrix multiplication: the dot product of the hidden state with each token's embedding produces one score per vocabulary entry
logits — the resulting vector of size V, with one raw unnormalized score per token in the vocabulary

Each logit value represents the model's raw (unnormalized) score for how likely each token in the vocabulary is to come next.

Temperature-Scaled Softmax

To convert logits to probabilities, we apply softmax with a temperature parameter T:

P(token_i) = exp(logit_i / T) / Σ_j exp(logit_j / T)

Symbol breakdown:
P(token_i) — the predicted probability of token i being the next token (between 0 and 1, all probabilities sum to 1)
logit_i — the raw score for token i from the output projection (can be any real number)
T — temperature: a positive scalar that controls distribution sharpness (T=1 is standard, T<1 sharper, T>1 flatter)
exp() — the exponential function e^(…), which maps any real number to a positive value
Σ_j — summation over all V tokens in the vocabulary (the normalizing denominator that ensures probabilities sum to 1)

INTERACTIVE: Temperature effect on probability distribution

Temperature 1.0

Temperature intuition: T=1.0 is the standard softmax. T < 1 makes the distribution "sharper" (more confident, more deterministic). T > 1 makes it "flatter" (more uniform, more creative/random). T → 0 becomes argmax (greedy). T → ∞ becomes uniform random.

Section 8

Token Sampling & Generation

Once we have the probability distribution, we need to select the next token. There are several strategies:

Greedy Decoding

Simply pick the token with the highest probability: next_token = argmax(P). Fast but repetitive and boring — it always picks the same path.

Top-k Sampling

Keep only the top k most probable tokens, zero out the rest, renormalize, then sample from this truncated distribution. Typical k values: 40–100.

Top-p (Nucleus) Sampling

Keep the smallest set of tokens whose cumulative probability exceeds p. This adapts to the shape of the distribution — when the model is confident, fewer tokens are kept; when uncertain, more tokens are considered.

The KV Cache

A critical optimization: during autoregressive generation, we don't recompute attention for all previous tokens. Instead, we cache the Key and Value matrices from all previous positions. At each new step, we only compute Q, K, V for the new token, append K and V to the cache, and compute attention against the full cached K and V.

The Autoregressive Loop

Putting it all together, generation works as a loop:

    // Autoregressive generation pseudocode

    function generate(prompt_tokens, max_new_tokens):

      tokens = prompt_tokens

      kv_cache = empty

      for step in 1..max_new_tokens:

        // Run the model on the latest token (or all tokens on first pass)

        logits, kv_cache = model(tokens[-1], kv_cache)

        // Apply temperature

        logits = logits / temperature

        // Apply top-p filtering

        probs = softmax(logits)

        probs = top_p_filter(probs, p=0.9)

        // Sample next token

        next_token = sample(probs)

        // Append and continue

        tokens.append(next_token)

        if next_token == EOS_TOKEN: break

      return tokens

Complete Parameter Count

For a model with N layers, d model dimensions, h heads, and vocabulary V:

Section 9

Prefill vs. Decode: The Two Phases of Inference

When you send a prompt to an LLM and it starts generating, the inference engine doesn't process every token the same way. The computation splits into two fundamentally different phases with very different performance characteristics.

Phase 1: Prefill (Prompt Processing)

When you send a prompt — say, 500 tokens — the model processes all of them at once in a single forward pass. This is possible because attention between prompt tokens doesn't depend on generation order; we already have all the tokens. The causal mask still applies (token 300 can't attend to token 301), but the computation for all positions happens simultaneously via matrix multiplication.

During prefill, the model computes Q, K, V for every prompt token, runs attention, runs the FFN, and — critically — stores all K and V vectors into the KV cache. This is the expensive setup step.

Prefill is compute-bound. The main bottleneck is the raw arithmetic of matrix multiplications. GPUs are excellent at this — you're multiplying large matrices, and the compute units stay busy. Prefill throughput is measured in tokens per second and scales well with batch size.

The Transition

Prefill produces two things: (1) the KV cache containing K and V for all prompt positions across all layers, and (2) the logits for the last prompt token, from which the first generated token is sampled. From this point, the system switches to decode mode.

Phase 2: Decode (Token Generation)

Now the model generates tokens one at a time. For each new token, the forward pass is much smaller:

Decode is memory-bound. The arithmetic is tiny (one token's worth of matrix multiplies), but you must read the entire KV cache from GPU memory at every step. For a 7B model with 8K context, the KV cache can be several gigabytes. The GPU spends most of its time loading data, not computing. This is why generating 100 tokens takes much longer than processing 100 prompt tokens.

Performance Characteristics

Why This Matters

The prefill/decode split has major implications for how inference systems are designed:

Time to First Token (TTFT) is dominated by prefill. A long prompt means a longer wait before the model starts responding. This is why you see a pause before text starts streaming — the system is crunching through your entire prompt in parallel.

Tokens Per Second (TPS) is determined by decode speed. Each generated token requires reading the entire KV cache, so generation speed degrades as the output gets longer and the cache grows. A 4K-token response is slower per-token at the end than at the beginning.

Practical example: You send a 2,000-token prompt to a 70B model. Prefill processes all 2,000 tokens in ~0.8 seconds (parallel matrix multiply). Then decode generates at ~30 tokens/second (sequential, memory-bound). A 200-token response takes ~6.7 additional seconds. The user waits 0.8s for the first token, then sees text stream for ~6.7s.

Optimizations for Each Phase

For prefill: Flash Attention fuses the Q·K, mask, softmax, and ×V operations into a single GPU kernel, avoiding materializing the full n×n attention matrix in memory. Chunked prefill splits very long prompts into chunks to limit peak memory. Tensor parallelism splits weight matrices across GPUs.

For decode: Continuous batching groups decode steps from multiple requests to improve GPU utilization (instead of wasting capacity on a [1×d] multiply, batch 64 users' single-token decodes into a [64×d] multiply). Speculative decoding uses a small "draft" model to predict several tokens ahead, then verifies them in parallel with the big model — if the guesses are right (they often are), you get multiple tokens for the cost of one step. GQA/MQA (Grouped/Multi-Query Attention) reduce KV cache size by sharing keys and values across heads.

Chunked Prefill

For very long prompts (e.g. 100K tokens), the full n×n attention matrix won't fit in GPU memory. Chunked prefill processes the prompt in chunks (say, 8K tokens at a time), building the KV cache incrementally. Each chunk attends to all prior cached tokens plus itself. This bounds peak memory at the cost of slightly more compute.

KV Cache Memory Revisited

The KV cache is often the limiting factor for maximum context length and batch size. Its memory grows linearly with sequence length, number of layers, and model dimension:

KV_memory = 2 × N × n × d_model × bytes_per_param

Symbol breakdown:
2 — one set for Keys, one for Values
N — number of decoder layers (each layer has its own KV cache)
n — current sequence length (prompt + generated tokens so far)
d_model — model hidden dimension (or d_k × h for all heads combined)
bytes_per_param — 2 for FP16/BF16, 1 for INT8, 0.5 for INT4 quantized cache

Concrete numbers: A 70B model (N=80, d=8192) serving an 8K context in FP16 uses 2 × 80 × 8192 × 8192 × 2 = ~20 GB per request just for the KV cache. This is why quantized KV caches (INT8, INT4) and techniques like GQA (which reduce the number of unique K/V heads) are critical for production deployment.

Section 10 — Reference

Glossary

Quick-reference for every term and symbol used in this guide.

Core Dimensions & Notation

d_model (or d): The hidden dimension of the model — the width of every token's representation vector as it flows through the network. Typical values: 768 (GPT-2 small), 4096 (LLaMA-7B), 12288 (GPT-3 175B).
n: Sequence length — the number of tokens currently in the context window. Bounded by the model's maximum context length (e.g. 2048, 4096, 128K).
V: Vocabulary size — the total number of unique tokens the model knows. Typically 32K–128K for modern models.
N: Number of decoder layers (blocks) stacked in the model. Ranges from 12 (small) to 96+ (large).
h: Number of attention heads per layer. Each head operates on a sub-dimension d_k = d_model / h.
d_k: Head dimension — the dimensionality each attention head works in. Equal to d_model / h. Typically 64 or 128.
d_ff: Feed-forward intermediate dimension. Usually 4 × d_model in standard FFN, or ~2.67 × d_model with SwiGLU (to match parameter count).

Tokenization

Token: The atomic unit the model processes — a subword, word, or character fragment. Text is split into tokens before being fed to the model.
BPE (Byte-Pair Encoding): A tokenization algorithm that builds a vocabulary by iteratively merging the most frequent adjacent byte/character pairs. Used by GPT, LLaMA, and most modern LLMs.
Token ID: An integer index (0 to V−1) that uniquely identifies a token in the vocabulary. The model only ever sees these integers, not text.
EOS Token: End-of-Sequence token — a special token ID signaling the model to stop generating. When sampled, the autoregressive loop terminates.

Embeddings & Positions

Embedding (W_E): A learned lookup table of shape V × d_model. Each row is a dense vector representing one token. Converts discrete token IDs into continuous vectors.
Positional Encoding: A mechanism to inject token ordering information. Without it, attention is permutation-invariant and cannot distinguish "cat sat" from "sat cat".
RoPE (Rotary Position Embedding): A positional encoding method that rotates Q and K vectors by position-dependent angles. Encodes relative positions, enabling better length generalization than absolute embeddings.
Weight Tying: Reusing the embedding matrix W_E as the output unembedding matrix (transposed). Reduces parameters and often improves performance.

Attention

Self-Attention: A mechanism where each token computes a weighted sum of all other tokens' representations. The weights are determined by learned Query-Key compatibility. This is how context flows between tokens.
Query (Q): "What am I looking for?" — a learned linear projection of each token, used to probe other tokens for relevant information.
Key (K): "What do I contain?" — a learned linear projection that advertises each token's content. Dot-producted with queries to compute attention scores.
Value (V): "What information do I give?" — a learned linear projection carrying the actual content to be aggregated. Weighted by attention scores to form the output.
Attention Score: The dot product Q · Kᵀ / √d_k between a query and key vector. Higher scores mean more "attention" — that token's value will have more influence.
Causal Mask: A triangular mask that sets attention scores for future positions to −∞. Ensures each token can only attend to itself and previous tokens. This is what makes the model autoregressive.
Multi-Head Attention: Running h parallel attention operations, each with its own Q/K/V projections over a sub-dimension. Allows the model to attend to different aspects simultaneously.
W^O (Output Projection): A matrix of shape d_model × d_model applied after concatenating all head outputs, mixing information across heads.
Scaled Dot-Product: Division by √d_k before softmax. Without scaling, large dimension dot products produce extreme values that push softmax into saturation (near-zero gradients).

Feed-Forward Network

FFN (Feed-Forward Network): A two-layer MLP applied independently to each token position. Expands the dimension (typically 4×), applies a nonlinearity, then projects back. Stores much of the model's factual "knowledge".
GELU (Gaussian Error Linear Unit): Activation function defined as x · Φ(x). A smooth version of ReLU that allows small negative values through, improving gradient flow. Standard in GPT-family models.
SwiGLU: A gated FFN variant using three projections: (Swish(xW_gate) ⊙ xW_up) W_down. The gate controls information flow. Used by LLaMA, PaLM, and most modern models.
Swish / SiLU: Activation function x · σ(x) where σ is the sigmoid. Self-gated and smooth. Used inside SwiGLU.

Normalization & Residuals

LayerNorm: Normalizes each token's vector to mean ≈ 0 and variance ≈ 1 across the feature dimension, then applies learned scale (γ) and shift (β). Stabilizes training of deep networks.
RMSNorm: Simplified LayerNorm that skips mean-centering and uses the root mean square for normalization. No bias parameter. Slightly faster, works equally well.
Residual Connection: Adding the input of a sub-layer to its output: x + SubLayer(x). Creates a "shortcut" for gradients and lets each layer make incremental updates rather than overwriting the representation.
Pre-Norm vs Post-Norm: Pre-Norm: x + SubLayer(Norm(x)) — normalize before the sub-layer. More stable for deep models. Post-Norm: Norm(x + SubLayer(x)) — original Transformer style. Harder to train at scale.

Output & Generation

Logits: The raw, unnormalized scores output by the final linear layer — one value per vocabulary token. Converted to probabilities via softmax.
Softmax: Converts a vector of logits into a probability distribution: P(i) = exp(z_i) / Σ_j exp(z_j). All outputs are in [0, 1] and sum to 1.
Temperature (T): A scalar dividing logits before softmax. T < 1 → sharper/more confident distribution. T > 1 → flatter/more random. T = 1 → standard softmax.
Greedy Decoding: Always selecting the highest-probability token: argmax(P). Deterministic but repetitive.
Top-k Sampling: Restricting sampling to the k highest-probability tokens, zeroing out the rest, then renormalizing and sampling. Fixed candidate count.
Top-p / Nucleus Sampling: Keeping the smallest token set whose cumulative probability ≥ p. Adaptive: keeps fewer tokens when confident, more when uncertain.
KV Cache: A memory optimization for autoregressive generation. Stores the Key and Value tensors from all previous tokens so they don't need to be recomputed at each generation step. Trades memory for compute.
Autoregressive: Generating one token at a time, feeding each new token back as input to predict the next. The causal mask ensures the model never "sees the future".

Embedding Space

Embedding Space: The high-dimensional geometric space in which token embedding vectors live. Distances and directions in this space encode semantic relationships between tokens.
Cosine Similarity: A measure of similarity between two vectors based on the cosine of the angle between them: cos(θ) = (a · b) / (‖a‖ · ‖b‖). Ranges from −1 (opposite) to +1 (identical direction). Used to compare token embeddings.
Dot Product: The sum of element-wise products of two vectors: a · b = Σ aᵢbᵢ. Measures both similarity of direction and magnitude. The core operation in attention score computation.
Linear Substructure: The phenomenon where semantic relationships are encoded as consistent vector offsets (e.g., king − man + woman ≈ queen). Emerges from training, not explicit programming.
Residual Stream: A conceptual view of the model's hidden state as a continuous pathway. Each sub-layer (attention, FFN) reads from and adds to this stream via residual connections. The representation is enriched incrementally, never overwritten.
Distributed Representation: Meaning is spread across many dimensions simultaneously — no single dimension encodes one concept. A token's semantics arise from the pattern across all dimensions together.

Inference: Prefill & Decode

Prefill: The first phase of inference where all prompt tokens are processed in parallel in a single forward pass. Builds the KV cache. Compute-bound (limited by FLOPS). Determines Time to First Token (TTFT).
Decode: The second phase where tokens are generated one at a time autoregressively. Each step reads the full KV cache. Memory-bandwidth-bound. Determines Tokens Per Second (TPS).
TTFT (Time to First Token): The latency from sending a request to receiving the first generated token. Dominated by the prefill phase — longer prompts mean higher TTFT.
TPS (Tokens Per Second): The rate at which tokens are generated during the decode phase. Determined by how fast the system can read the KV cache and run one token through the model.
Compute-Bound: A workload limited by arithmetic throughput (FLOPS). The GPU's compute units are the bottleneck. Prefill is compute-bound because it involves large matrix multiplications.
Memory-Bound: A workload limited by memory read/write bandwidth (GB/s). The GPU spends more time loading data than computing. Decode is memory-bound because it must read the entire KV cache for each tiny single-token computation.
Flash Attention: An optimized attention implementation that fuses Q·K, mask, softmax, and ×V into a single GPU kernel, avoiding materialization of the full n×n attention matrix. Reduces memory from O(n²) to O(n).
Continuous Batching: A serving optimization that dynamically groups decode steps from multiple user requests into a single batch, improving GPU utilization during the memory-bound decode phase.
Speculative Decoding: An optimization where a small "draft" model guesses several tokens ahead, which the large model then verifies in parallel. Accepted guesses yield multiple tokens per decode step.
Chunked Prefill: Processing a long prompt in smaller chunks to bound peak memory usage. Each chunk builds its portion of the KV cache while attending to all previously cached tokens.
GQA (Grouped-Query Attention): A variant where multiple query heads share the same K/V head, reducing KV cache size (e.g. 32 query heads sharing 8 KV heads = 4× reduction). Used by LLaMA 2 70B, Mistral, and others.

Training Concepts

Parameters / Weights: The learned numerical values in the model's matrices (embeddings, Q/K/V projections, FFN layers, LayerNorm scales). Adjusted during training via backpropagation.
Backpropagation: The algorithm that computes gradients of the loss with respect to every parameter by propagating error backwards through the network. Used with an optimizer (e.g. Adam) to update weights.
Cross-Entropy Loss: The training objective: −log P(correct_token) averaged over all positions. Minimizing this makes the model assign higher probability to the right next token.
Perplexity: Exponential of the average cross-entropy loss: exp(loss). Intuitively, how many tokens the model is "equally confused between" at each step. Lower is better.