Linear Algebra for AI — Presentation 08

Orthogonality, QR & RoPE

Orthonormal bases are the cleanest way to do anything. QR is how you build them. And RoPE — the position encoding inside Llama, Qwen, GPT-NeoX, Mistral and most modern transformers — is just a stack of tiny 2×2 rotation matrices.

Gram-SchmidtQR orthogonal initcondition number RoPE2×2 rotation
Gram-Schmidt QR factorisation orthogonal matrices RoPE for transformers
00

Topics We'll Cover

01

Orthonormal Sets & Orthogonal Matrices

A set $\{\mathbf{q}_1, \ldots, \mathbf{q}_n\}$ is orthonormal if

$$\langle \mathbf{q}_i, \mathbf{q}_j \rangle = \delta_{ij} = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}.$$

Each is a unit vector and they're pairwise perpendicular. A square matrix $Q \in \mathbb{R}^{n\times n}$ whose columns are orthonormal is called orthogonal:

$$Q^\top Q = Q Q^\top = I, \qquad Q^{-1} = Q^\top.$$

Three properties of orthogonal matrices

Orthogonal matrices are the maps that don't deform anything — they rotate (and possibly reflect) but never stretch. They are the cleanest possible linear maps.

Why ML loves orthogonal matrices

An orthogonal $Q$ satisfies $\sigma_1 = \sigma_n = 1$ — condition number 1. Multiplying by $Q$ doesn't amplify or attenuate any direction. Stack 100 orthogonal layers and the signal magnitude stays the same; gradients neither vanish nor explode through this part. This is the entire reason orthogonal weight initialisation (slide 06) is a respected baseline.

02

Gram-Schmidt Orthogonalisation

Given an arbitrary linearly independent $\{\mathbf{a}_1, \ldots, \mathbf{a}_n\}$, the Gram-Schmidt process produces an orthonormal $\{\mathbf{q}_1, \ldots, \mathbf{q}_n\}$ that spans the same space, by repeatedly subtracting the projection onto the already-built basis:

$$\tilde{\mathbf{q}}_k = \mathbf{a}_k - \sum_{j=1}^{k-1} \langle \mathbf{a}_k, \mathbf{q}_j\rangle\, \mathbf{q}_j, \qquad \mathbf{q}_k = \frac{\tilde{\mathbf{q}}_k}{\|\tilde{\mathbf{q}}_k\|}.$$

In words: take $\mathbf{a}_k$, kill its components along all the existing $\mathbf{q}_j$, then normalise.

Tells you the QR factorisation for free

By construction, each $\mathbf{a}_k$ lies in the span of $\mathbf{q}_1, \ldots, \mathbf{q}_k$ — so $\mathbf{a}_k = R_{1k}\mathbf{q}_1 + R_{2k}\mathbf{q}_2 + \cdots + R_{kk}\mathbf{q}_k$, where $R_{jk} = \langle \mathbf{a}_k, \mathbf{q}_j\rangle$ for $j < k$ and $R_{kk} = \|\tilde{\mathbf{q}}_k\|$. Stacking columns:

$$A = QR$$

with $Q$ orthonormal and $R$ upper triangular. The QR factorisation falls out as a side effect of orthogonalising the columns.

Numerical caveat

Classical Gram-Schmidt is unstable in floating point: small errors in early $\mathbf{q}_j$ contaminate later ones, and the resulting $Q$ can be far from orthogonal. The fix is Modified Gram-Schmidt (subtract one projection at a time, updating $\mathbf{a}_k$ in place) — better but still not great. Production libraries use Householder or Givens (next slide).

03

QR Factorisation

For any $A \in \mathbb{R}^{m \times n}$ with $m \ge n$, there exist $Q \in \mathbb{R}^{m \times n}$ with orthonormal columns and $R \in \mathbb{R}^{n \times n}$ upper triangular such that

$$A = QR.$$

$R$ is upper triangular — the same property that makes back-substitution easy. This is what makes QR useful: solving $A\mathbf{x} = \mathbf{b}$ becomes "compute $\mathbf{c} = Q^\top \mathbf{b}$, then back-substitute $R\mathbf{x} = \mathbf{c}$" — one orthogonal multiplication and one triangular solve.

Where QR shows up in ML

04

Householder & Givens — the Real Algorithms

Production QR uses one of two methods, both stable.

Householder reflectors

For each column, choose a unit vector $\mathbf{v}$ such that the reflection $H = I - 2\mathbf{v}\mathbf{v}^\top$ zeros out everything below the diagonal in that column. Apply $n$ such reflections; product of reflectors is $Q^\top$, and the modified $A$ is $R$.

Each reflector is rank-1 + identity — cheap to apply. The standard QR algorithm in LAPACK's geqrf.

Givens rotations

2×2 rotation embedded in a larger identity, rotating two coordinates at a time to zero one entry. Produces $Q$ as a product of $\binom{n}{2}$ Givens rotations.

Slower for dense $A$ but parallelisable, and ideal when $A$ is already nearly triangular (sparse, banded). Used inside the iterative QR-eigenvalue algorithm.

Why "$Q$ as a product of small things"

You almost never store an orthogonal $Q$ explicitly — you store the recipe to apply $Q$ (the list of Householder vectors or Givens angles). Applying it is then $O(n^2)$ instead of $O(n^3)$. Same trick is used inside the FFT, Hadamard transform, butterfly factorisation. Structured orthogonal matrices are everywhere in fast algorithms.

Why this matters for RoPE

The RoPE positional encoding is a giant block-diagonal orthogonal matrix — a product of $d/2$ independent 2×2 Givens rotations applied to consecutive pairs of dimensions. Like a Householder QR, it's never stored explicitly: each rotation is parameterised by a scalar angle that depends on token position. Slide 08 makes this concrete.

05

Condition Number, Stability, Why $Q$ is Cheap

The condition number of $A$ is

$$\kappa(A) = \frac{\sigma_1}{\sigma_n} = \|A\|_2 \, \|A^{-1}\|_2.$$

It measures how much $A$ amplifies the worst-case relative error in $\mathbf{x}$ when computing $A\mathbf{x}$. $\kappa = 1$ is perfect (orthogonal matrices); $\kappa = \infty$ means singular.

Why orthogonal $Q$ is the cheapest

Anywhere in numerical work where you can replace an arbitrary linear map with an orthogonal one, do it. Cholesky is $L L^\top$ with $L$ lower triangular but not orthogonal — so its $\kappa$ matters. QR is $Q R$ with $\kappa(Q) = 1$, isolating all conditioning in $R$ — you don't get free conditioning, but you get a clean separation.

The conditioning of weight matrices in transformers

Random Gaussian initialisations have $\kappa(W) \approx (1+\sqrt{n/m})/(1 - \sqrt{n/m})$ — finite for square or wider matrices, but degraded for very tall ones. Adam-trained matrices in the middle of training tend to have worse $\kappa$ than at init. RMSNorm (deck 11) and weight decay both implicitly regularise toward better-conditioned weights.

06

Orthogonal Initialisation

Saxe, McClelland & Ganguli (ICLR 2014) showed that initialising deep linear networks with random orthogonal matrices keeps the singular spectrum at exactly 1 throughout the network at $t = 0$. Signal neither vanishes nor explodes during the first forward pass; gradients flow uniformly.

The recipe

For a weight $W \in \mathbb{R}^{m \times n}$ with $m \le n$ (wider-than-tall):

  1. Sample $G \sim \mathcal{N}(0, 1)^{m \times n}$.
  2. Compute QR: $G = QR$ with $Q \in \mathbb{R}^{m \times n}$ orthonormal rows.
  3. Set $W = \mathrm{gain} \cdot Q$.

For deeper-than-wide matrices, transpose. The "gain" factor compensates for the variance of subsequent activations: $\sqrt{2}$ for ReLU networks (compensating the half-rectifier), 1 for linear, $\approx 5/3$ for tanh. PyTorch's nn.init.orthogonal_(W, gain=...) implements exactly this.

How it compares to alternatives

Xavier / Glorot

Variance scaling: $\sim 1/\sqrt{n_{in}}$. Cheap, no QR. Works for shallow nets.

He / Kaiming

$\sim \sqrt{2/n_{in}}$ — tuned for ReLU. Default for CNNs.

Orthogonal

Best for very deep linear networks; comparable for nonlinear. Sometimes used with care for RNN recurrent matrices to delay gradient explosion.

For modern transformers, orthogonal init is rarely the default — the residual stream and LayerNorm together control signal scale well enough that Xavier-uniform / He-normal are fine. But it's a useful tool when chasing very deep architectures with residual connections turned off (rare in practice).

07

Rotation Matrices in 2D

The 2D rotation by angle $\theta$:

$$R_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}.$$

Orthogonal ($R_\theta^\top R_\theta = I$), determinant $+1$ (no reflection), eigenvalues $e^{\pm i\theta}$ over $\mathbb{C}$.

Two key properties

The second property is the entire reason RoPE works for attention.

The complex-number reading

Identify $\mathbb{R}^2$ with $\mathbb{C}$ via $(x, y) \leftrightarrow x + iy$. Then $R_\theta$ acts as multiplication by $e^{i\theta}$. The composition rule $R_\alpha R_\beta = R_{\alpha + \beta}$ is just $e^{i\alpha} e^{i\beta} = e^{i(\alpha + \beta)}$. And inner product invariance becomes: $\langle e^{i\theta} z_1, e^{i\theta} z_2\rangle = \mathrm{Re}(\overline{e^{i\theta} z_1}\, e^{i\theta} z_2) = \mathrm{Re}(\bar{z}_1 z_2) = \langle z_1, z_2 \rangle$.

08

RoPE — Rotary Position Embeddings

Su et al. (2021) introduced Rotary Position Embedding (RoPE), now the dominant position encoding in Llama, Qwen, Mistral, GPT-NeoX, Falcon, DeepSeek and most modern LLMs. It encodes position by rotating the query and key vectors before the attention dot product.

The construction, dim by dim

For a $d$-dimensional vector $\mathbf{x}$ at position $m$, group the $d$ coordinates into $d/2$ pairs $(x_{2i-1}, x_{2i})$. Apply a 2×2 rotation to each pair, with angle $m\theta_i$:

$$\begin{pmatrix} x'_{2i-1} \\ x'_{2i} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} x_{2i-1} \\ x_{2i} \end{pmatrix}.$$

The rotation frequency $\theta_i$ varies with the pair index:

$$\theta_i = \mathrm{base}^{-2(i-1)/d}, \quad \text{typically with}\ \mathrm{base} = 10000.$$

So the first pair rotates fast (high frequency), the last pair slowly (low frequency). Together the $d/2$ pairs form a frequency comb — every pair sees position differently.

The dot product becomes relative

If $\mathbf{q}$ at position $m$ is rotated to $R_m \mathbf{q}$ and $\mathbf{k}$ at position $n$ is rotated to $R_n \mathbf{k}$, the attention score is

$$\langle R_m \mathbf{q}, R_n \mathbf{k}\rangle = \mathbf{q}^\top R_m^\top R_n \mathbf{k} = \mathbf{q}^\top R_{n - m} \mathbf{k}.$$

(Using $R_m^\top R_n = R_{-m} R_n = R_{n-m}$ block-by-block.) The score depends only on the relative position $n - m$ — never on absolute positions. This is the property that makes RoPE work: the attention pattern between two tokens is a function of their distance, not where they sit in the sequence.

Where in the block does RoPE live?

Applied to $\mathbf{q}$ and $\mathbf{k}$ only, after the QKV projection and before the dot product. The values $\mathbf{v}$ are not rotated. With MLA (deck 07), RoPE is applied to a small "decoupled" portion of the key vector after up-projection from the latent — clever workaround to keep MLA compatible with rotary positions.

09

Why RoPE Beats Sinusoidal & Learned Positional

Sinusoidal (Vaswani 2017)

Add a fixed sin/cos position vector to embeddings. Absolute, not relative; degrades for sequences longer than seen at train time. Foundational but historical.

Learned positional embeddings

A learned vector per absolute position. Used by GPT-2, BERT. Hard cap at the longest sequence seen in training; doesn't extrapolate at all.

RoPE

Rotation per pair, angle linear in position. Inherently relative. Extrapolates (somewhat) to longer sequences, especially with frequency interpolation tricks (PI, NTK, YaRN).

Length extrapolation

Plain RoPE trained on 4k tokens degrades quickly past 4k — the high-frequency pairs see angles outside their training distribution. Three families of fix:

Why RoPE wins overall

The key facts: (1) the score depends only on relative position, so the attention pattern transfers across sequence lengths; (2) every pair gets a different frequency, so positional information is encoded in a frequency comb that survives composition with weight matrices; (3) implementing it is two element-wise multiplications and a small permute — almost free. The combination is hard to beat.

10

Interactive: RoPE Visualiser

The widget shows a $d = 8$-dimensional vector being RoPE-rotated as a function of token position. Each pair of dimensions traces a circle at its own frequency — high-frequency pairs spin fast, low-frequency pairs almost stand still. The bottom plot shows how the attention dot product between position-0 and position-$m$ evolves with $m$.

0
10000

Top: each circle shows one of the 4 dimension-pairs. The pink dot is its current rotated position; the violet path traces history. Bottom: $\mathbf{q}_0^\top R_m \mathbf{k}$ for a fixed $\mathbf{q} = \mathbf{k}$ as a function of $m$ — the attention pattern is a sum of cosines at the comb of frequencies.

11

Cheat Sheet

Read next

Deck 09 — Gradients, Jacobians & Backprop picks up the chain rule from a linear-algebra angle: every backward pass in a neural net is a Jacobian-vector product, and writing forward and backward as matrix algebra shows why autograd works.