Orthonormal bases are the cleanest way to do anything. QR is how you build them. And RoPE — the position encoding inside Llama, Qwen, GPT-NeoX, Mistral and most modern transformers — is just a stack of tiny 2×2 rotation matrices.
A set $\{\mathbf{q}_1, \ldots, \mathbf{q}_n\}$ is orthonormal if
$$\langle \mathbf{q}_i, \mathbf{q}_j \rangle = \delta_{ij} = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}.$$
Each is a unit vector and they're pairwise perpendicular. A square matrix $Q \in \mathbb{R}^{n\times n}$ whose columns are orthonormal is called orthogonal:
$$Q^\top Q = Q Q^\top = I, \qquad Q^{-1} = Q^\top.$$
Orthogonal matrices are the maps that don't deform anything — they rotate (and possibly reflect) but never stretch. They are the cleanest possible linear maps.
An orthogonal $Q$ satisfies $\sigma_1 = \sigma_n = 1$ — condition number 1. Multiplying by $Q$ doesn't amplify or attenuate any direction. Stack 100 orthogonal layers and the signal magnitude stays the same; gradients neither vanish nor explode through this part. This is the entire reason orthogonal weight initialisation (slide 06) is a respected baseline.
Given an arbitrary linearly independent $\{\mathbf{a}_1, \ldots, \mathbf{a}_n\}$, the Gram-Schmidt process produces an orthonormal $\{\mathbf{q}_1, \ldots, \mathbf{q}_n\}$ that spans the same space, by repeatedly subtracting the projection onto the already-built basis:
$$\tilde{\mathbf{q}}_k = \mathbf{a}_k - \sum_{j=1}^{k-1} \langle \mathbf{a}_k, \mathbf{q}_j\rangle\, \mathbf{q}_j, \qquad \mathbf{q}_k = \frac{\tilde{\mathbf{q}}_k}{\|\tilde{\mathbf{q}}_k\|}.$$
In words: take $\mathbf{a}_k$, kill its components along all the existing $\mathbf{q}_j$, then normalise.
By construction, each $\mathbf{a}_k$ lies in the span of $\mathbf{q}_1, \ldots, \mathbf{q}_k$ — so $\mathbf{a}_k = R_{1k}\mathbf{q}_1 + R_{2k}\mathbf{q}_2 + \cdots + R_{kk}\mathbf{q}_k$, where $R_{jk} = \langle \mathbf{a}_k, \mathbf{q}_j\rangle$ for $j < k$ and $R_{kk} = \|\tilde{\mathbf{q}}_k\|$. Stacking columns:
$$A = QR$$
with $Q$ orthonormal and $R$ upper triangular. The QR factorisation falls out as a side effect of orthogonalising the columns.
Classical Gram-Schmidt is unstable in floating point: small errors in early $\mathbf{q}_j$ contaminate later ones, and the resulting $Q$ can be far from orthogonal. The fix is Modified Gram-Schmidt (subtract one projection at a time, updating $\mathbf{a}_k$ in place) — better but still not great. Production libraries use Householder or Givens (next slide).
For any $A \in \mathbb{R}^{m \times n}$ with $m \ge n$, there exist $Q \in \mathbb{R}^{m \times n}$ with orthonormal columns and $R \in \mathbb{R}^{n \times n}$ upper triangular such that
$$A = QR.$$
$R$ is upper triangular — the same property that makes back-substitution easy. This is what makes QR useful: solving $A\mathbf{x} = \mathbf{b}$ becomes "compute $\mathbf{c} = Q^\top \mathbf{b}$, then back-substitute $R\mathbf{x} = \mathbf{c}$" — one orthogonal multiplication and one triangular solve.
numpy.linalg.eig uses this under the hood.nn.init.orthogonal_ does.Production QR uses one of two methods, both stable.
For each column, choose a unit vector $\mathbf{v}$ such that the reflection $H = I - 2\mathbf{v}\mathbf{v}^\top$ zeros out everything below the diagonal in that column. Apply $n$ such reflections; product of reflectors is $Q^\top$, and the modified $A$ is $R$.
Each reflector is rank-1 + identity — cheap to apply. The standard QR algorithm in LAPACK's geqrf.
2×2 rotation embedded in a larger identity, rotating two coordinates at a time to zero one entry. Produces $Q$ as a product of $\binom{n}{2}$ Givens rotations.
Slower for dense $A$ but parallelisable, and ideal when $A$ is already nearly triangular (sparse, banded). Used inside the iterative QR-eigenvalue algorithm.
You almost never store an orthogonal $Q$ explicitly — you store the recipe to apply $Q$ (the list of Householder vectors or Givens angles). Applying it is then $O(n^2)$ instead of $O(n^3)$. Same trick is used inside the FFT, Hadamard transform, butterfly factorisation. Structured orthogonal matrices are everywhere in fast algorithms.
The RoPE positional encoding is a giant block-diagonal orthogonal matrix — a product of $d/2$ independent 2×2 Givens rotations applied to consecutive pairs of dimensions. Like a Householder QR, it's never stored explicitly: each rotation is parameterised by a scalar angle that depends on token position. Slide 08 makes this concrete.
The condition number of $A$ is
$$\kappa(A) = \frac{\sigma_1}{\sigma_n} = \|A\|_2 \, \|A^{-1}\|_2.$$
It measures how much $A$ amplifies the worst-case relative error in $\mathbf{x}$ when computing $A\mathbf{x}$. $\kappa = 1$ is perfect (orthogonal matrices); $\kappa = \infty$ means singular.
Anywhere in numerical work where you can replace an arbitrary linear map with an orthogonal one, do it. Cholesky is $L L^\top$ with $L$ lower triangular but not orthogonal — so its $\kappa$ matters. QR is $Q R$ with $\kappa(Q) = 1$, isolating all conditioning in $R$ — you don't get free conditioning, but you get a clean separation.
Random Gaussian initialisations have $\kappa(W) \approx (1+\sqrt{n/m})/(1 - \sqrt{n/m})$ — finite for square or wider matrices, but degraded for very tall ones. Adam-trained matrices in the middle of training tend to have worse $\kappa$ than at init. RMSNorm (deck 11) and weight decay both implicitly regularise toward better-conditioned weights.
Saxe, McClelland & Ganguli (ICLR 2014) showed that initialising deep linear networks with random orthogonal matrices keeps the singular spectrum at exactly 1 throughout the network at $t = 0$. Signal neither vanishes nor explodes during the first forward pass; gradients flow uniformly.
For a weight $W \in \mathbb{R}^{m \times n}$ with $m \le n$ (wider-than-tall):
For deeper-than-wide matrices, transpose. The "gain" factor compensates for the variance of subsequent activations: $\sqrt{2}$ for ReLU networks (compensating the half-rectifier), 1 for linear, $\approx 5/3$ for tanh. PyTorch's nn.init.orthogonal_(W, gain=...) implements exactly this.
Variance scaling: $\sim 1/\sqrt{n_{in}}$. Cheap, no QR. Works for shallow nets.
$\sim \sqrt{2/n_{in}}$ — tuned for ReLU. Default for CNNs.
Best for very deep linear networks; comparable for nonlinear. Sometimes used with care for RNN recurrent matrices to delay gradient explosion.
For modern transformers, orthogonal init is rarely the default — the residual stream and LayerNorm together control signal scale well enough that Xavier-uniform / He-normal are fine. But it's a useful tool when chasing very deep architectures with residual connections turned off (rare in practice).
The 2D rotation by angle $\theta$:
$$R_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}.$$
Orthogonal ($R_\theta^\top R_\theta = I$), determinant $+1$ (no reflection), eigenvalues $e^{\pm i\theta}$ over $\mathbb{C}$.
The second property is the entire reason RoPE works for attention.
Identify $\mathbb{R}^2$ with $\mathbb{C}$ via $(x, y) \leftrightarrow x + iy$. Then $R_\theta$ acts as multiplication by $e^{i\theta}$. The composition rule $R_\alpha R_\beta = R_{\alpha + \beta}$ is just $e^{i\alpha} e^{i\beta} = e^{i(\alpha + \beta)}$. And inner product invariance becomes: $\langle e^{i\theta} z_1, e^{i\theta} z_2\rangle = \mathrm{Re}(\overline{e^{i\theta} z_1}\, e^{i\theta} z_2) = \mathrm{Re}(\bar{z}_1 z_2) = \langle z_1, z_2 \rangle$.
Su et al. (2021) introduced Rotary Position Embedding (RoPE), now the dominant position encoding in Llama, Qwen, Mistral, GPT-NeoX, Falcon, DeepSeek and most modern LLMs. It encodes position by rotating the query and key vectors before the attention dot product.
For a $d$-dimensional vector $\mathbf{x}$ at position $m$, group the $d$ coordinates into $d/2$ pairs $(x_{2i-1}, x_{2i})$. Apply a 2×2 rotation to each pair, with angle $m\theta_i$:
$$\begin{pmatrix} x'_{2i-1} \\ x'_{2i} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} x_{2i-1} \\ x_{2i} \end{pmatrix}.$$
The rotation frequency $\theta_i$ varies with the pair index:
$$\theta_i = \mathrm{base}^{-2(i-1)/d}, \quad \text{typically with}\ \mathrm{base} = 10000.$$
So the first pair rotates fast (high frequency), the last pair slowly (low frequency). Together the $d/2$ pairs form a frequency comb — every pair sees position differently.
If $\mathbf{q}$ at position $m$ is rotated to $R_m \mathbf{q}$ and $\mathbf{k}$ at position $n$ is rotated to $R_n \mathbf{k}$, the attention score is
$$\langle R_m \mathbf{q}, R_n \mathbf{k}\rangle = \mathbf{q}^\top R_m^\top R_n \mathbf{k} = \mathbf{q}^\top R_{n - m} \mathbf{k}.$$
(Using $R_m^\top R_n = R_{-m} R_n = R_{n-m}$ block-by-block.) The score depends only on the relative position $n - m$ — never on absolute positions. This is the property that makes RoPE work: the attention pattern between two tokens is a function of their distance, not where they sit in the sequence.
Applied to $\mathbf{q}$ and $\mathbf{k}$ only, after the QKV projection and before the dot product. The values $\mathbf{v}$ are not rotated. With MLA (deck 07), RoPE is applied to a small "decoupled" portion of the key vector after up-projection from the latent — clever workaround to keep MLA compatible with rotary positions.
Add a fixed sin/cos position vector to embeddings. Absolute, not relative; degrades for sequences longer than seen at train time. Foundational but historical.
A learned vector per absolute position. Used by GPT-2, BERT. Hard cap at the longest sequence seen in training; doesn't extrapolate at all.
Rotation per pair, angle linear in position. Inherently relative. Extrapolates (somewhat) to longer sequences, especially with frequency interpolation tricks (PI, NTK, YaRN).
Plain RoPE trained on 4k tokens degrades quickly past 4k — the high-frequency pairs see angles outside their training distribution. Three families of fix:
The key facts: (1) the score depends only on relative position, so the attention pattern transfers across sequence lengths; (2) every pair gets a different frequency, so positional information is encoded in a frequency comb that survives composition with weight matrices; (3) implementing it is two element-wise multiplications and a small permute — almost free. The combination is hard to beat.
The widget shows a $d = 8$-dimensional vector being RoPE-rotated as a function of token position. Each pair of dimensions traces a circle at its own frequency — high-frequency pairs spin fast, low-frequency pairs almost stand still. The bottom plot shows how the attention dot product between position-0 and position-$m$ evolves with $m$.
Top: each circle shows one of the 4 dimension-pairs. The pink dot is its current rotated position; the violet path traces history. Bottom: $\mathbf{q}_0^\top R_m \mathbf{k}$ for a fixed $\mathbf{q} = \mathbf{k}$ as a function of $m$ — the attention pattern is a sum of cosines at the comb of frequencies.
Deck 09 — Gradients, Jacobians & Backprop picks up the chain rule from a linear-algebra angle: every backward pass in a neural net is a Jacobian-vector product, and writing forward and backward as matrix algebra shows why autograd works.