Linear Algebra for AI 08 — Orthogonality, QR & RoPE

00

Topics We'll Cover

Orthonormal Sets & Orthogonal Matrices
Gram-Schmidt Orthogonalisation
QR Factorisation
Householder & Givens — the Real Algorithms
Condition Number, Stability, Why $Q$ is Cheap
Orthogonal Initialisation
Rotation Matrices in 2D
RoPE — Rotary Position Embeddings
Why RoPE Beats Sinusoidal & Learned Positional
Interactive: RoPE Visualiser
Cheat Sheet

01

Orthonormal Sets & Orthogonal Matrices

A set $\{\mathbf{q}_1, \ldots, \mathbf{q}_n\}$ is orthonormal if

$$\langle \mathbf{q}_i, \mathbf{q}_j \rangle = \delta_{ij} = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}.$$

Each is a unit vector and they're pairwise perpendicular. A square matrix $Q \in \mathbb{R}^{n\times n}$ whose columns are orthonormal is called orthogonal:

$$Q^\top Q = Q Q^\top = I, \qquad Q^{-1} = Q^\top.$$

Three properties of orthogonal matrices

Length-preserving: $\|Q\mathbf{x}\| = \|\mathbf{x}\|$ for all $\mathbf{x}$. ($\|Q\mathbf{x}\|^2 = \mathbf{x}^\top Q^\top Q \mathbf{x} = \mathbf{x}^\top \mathbf{x}$.)
Angle-preserving: $\langle Q\mathbf{x}, Q\mathbf{y}\rangle = \langle \mathbf{x}, \mathbf{y}\rangle$. Multiplication by $Q$ is a rigid motion of the space.
$|\det Q| = 1$. Either $+1$ (rotation) or $-1$ (rotation with reflection).

Orthogonal matrices are the maps that don't deform anything — they rotate (and possibly reflect) but never stretch. They are the cleanest possible linear maps.

Why ML loves orthogonal matrices

An orthogonal $Q$ satisfies $\sigma_1 = \sigma_n = 1$ — condition number 1. Multiplying by $Q$ doesn't amplify or attenuate any direction. Stack 100 orthogonal layers and the signal magnitude stays the same; gradients neither vanish nor explode through this part. This is the entire reason orthogonal weight initialisation (slide 06) is a respected baseline.

02

Gram-Schmidt Orthogonalisation

Given an arbitrary linearly independent $\{\mathbf{a}_1, \ldots, \mathbf{a}_n\}$, the Gram-Schmidt process produces an orthonormal $\{\mathbf{q}_1, \ldots, \mathbf{q}_n\}$ that spans the same space, by repeatedly subtracting the projection onto the already-built basis:

$$\tilde{\mathbf{q}}_k = \mathbf{a}_k - \sum_{j=1}^{k-1} \langle \mathbf{a}_k, \mathbf{q}_j\rangle\, \mathbf{q}_j, \qquad \mathbf{q}_k = \frac{\tilde{\mathbf{q}}_k}{\|\tilde{\mathbf{q}}_k\|}.$$

In words: take $\mathbf{a}_k$, kill its components along all the existing $\mathbf{q}_j$, then normalise.

Tells you the QR factorisation for free

By construction, each $\mathbf{a}_k$ lies in the span of $\mathbf{q}_1, \ldots, \mathbf{q}_k$ — so $\mathbf{a}_k = R_{1k}\mathbf{q}_1 + R_{2k}\mathbf{q}_2 + \cdots + R_{kk}\mathbf{q}_k$, where $R_{jk} = \langle \mathbf{a}_k, \mathbf{q}_j\rangle$ for $j < k$ and $R_{kk} = \|\tilde{\mathbf{q}}_k\|$. Stacking columns:

$$A = QR$$

with $Q$ orthonormal and $R$ upper triangular. The QR factorisation falls out as a side effect of orthogonalising the columns.

Numerical caveat

Classical Gram-Schmidt is unstable in floating point: small errors in early $\mathbf{q}_j$ contaminate later ones, and the resulting $Q$ can be far from orthogonal. The fix is Modified Gram-Schmidt (subtract one projection at a time, updating $\mathbf{a}_k$ in place) — better but still not great. Production libraries use Householder or Givens (next slide).

03

QR Factorisation

For any $A \in \mathbb{R}^{m \times n}$ with $m \ge n$, there exist $Q \in \mathbb{R}^{m \times n}$ with orthonormal columns and $R \in \mathbb{R}^{n \times n}$ upper triangular such that

$$A = QR.$$

$R$ is upper triangular — the same property that makes back-substitution easy. This is what makes QR useful: solving $A\mathbf{x} = \mathbf{b}$ becomes "compute $\mathbf{c} = Q^\top \mathbf{b}$, then back-substitute $R\mathbf{x} = \mathbf{c}$" — one orthogonal multiplication and one triangular solve.

Where QR shows up in ML

Numerically stable least squares. $\min \|A\mathbf{x} - \mathbf{b}\|$ via QR avoids forming the badly-conditioned $A^\top A$.
The QR algorithm (the iterative one, distinct from this factorisation): a sequence of QR factorisations of shifted matrices used to find eigenvalues. numpy.linalg.eig uses this under the hood.
Gradient-based orthogonalisation in optimisers. Some natural-gradient and Riemannian-optimisation methods periodically re-orthogonalise weight subspaces using QR.
Random orthogonal matrices. Take a random Gaussian matrix, run QR, output $Q$. The $Q$ is uniformly distributed on the orthogonal group $O(n)$ — the standard way to sample. This is what PyTorch's nn.init.orthogonal_ does.

04

Householder & Givens — the Real Algorithms

Production QR uses one of two methods, both stable.

Householder reflectors

For each column, choose a unit vector $\mathbf{v}$ such that the reflection $H = I - 2\mathbf{v}\mathbf{v}^\top$ zeros out everything below the diagonal in that column. Apply $n$ such reflections; product of reflectors is $Q^\top$, and the modified $A$ is $R$.

Each reflector is rank-1 + identity — cheap to apply. The standard QR algorithm in LAPACK's geqrf.

Givens rotations

2×2 rotation embedded in a larger identity, rotating two coordinates at a time to zero one entry. Produces $Q$ as a product of $\binom{n}{2}$ Givens rotations.

Slower for dense $A$ but parallelisable, and ideal when $A$ is already nearly triangular (sparse, banded). Used inside the iterative QR-eigenvalue algorithm.

Why "$Q$ as a product of small things"

You almost never store an orthogonal $Q$ explicitly — you store the recipe to apply $Q$ (the list of Householder vectors or Givens angles). Applying it is then $O(n^2)$ instead of $O(n^3)$. Same trick is used inside the FFT, Hadamard transform, butterfly factorisation. Structured orthogonal matrices are everywhere in fast algorithms.

Why this matters for RoPE

The RoPE positional encoding is a giant block-diagonal orthogonal matrix — a product of $d/2$ independent 2×2 Givens rotations applied to consecutive pairs of dimensions. Like a Householder QR, it's never stored explicitly: each rotation is parameterised by a scalar angle that depends on token position. Slide 08 makes this concrete.

05

Condition Number, Stability, Why $Q$ is Cheap

The condition number of $A$ is

$$\kappa(A) = \frac{\sigma_1}{\sigma_n} = \|A\|_2 \, \|A^{-1}\|_2.$$

It measures how much $A$ amplifies the worst-case relative error in $\mathbf{x}$ when computing $A\mathbf{x}$. $\kappa = 1$ is perfect (orthogonal matrices); $\kappa = \infty$ means singular.

Why orthogonal $Q$ is the cheapest

$\kappa(Q) = 1$: no amplification of error.
$Q^{-1} = Q^\top$: no expensive inverse to compute.
Multiplying by $Q$ doesn't change vector lengths or angles.

Anywhere in numerical work where you can replace an arbitrary linear map with an orthogonal one, do it. Cholesky is $L L^\top$ with $L$ lower triangular but not orthogonal — so its $\kappa$ matters. QR is $Q R$ with $\kappa(Q) = 1$, isolating all conditioning in $R$ — you don't get free conditioning, but you get a clean separation.

The conditioning of weight matrices in transformers

Random Gaussian initialisations have $\kappa(W) \approx (1+\sqrt{n/m})/(1 - \sqrt{n/m})$ — finite for square or wider matrices, but degraded for very tall ones. Adam-trained matrices in the middle of training tend to have worse $\kappa$ than at init. RMSNorm (deck 11) and weight decay both implicitly regularise toward better-conditioned weights.

06

Orthogonal Initialisation

Saxe, McClelland & Ganguli (ICLR 2014) showed that initialising deep linear networks with random orthogonal matrices keeps the singular spectrum at exactly 1 throughout the network at $t = 0$. Signal neither vanishes nor explodes during the first forward pass; gradients flow uniformly.

The recipe

For a weight $W \in \mathbb{R}^{m \times n}$ with $m \le n$ (wider-than-tall):

Sample $G \sim \mathcal{N}(0, 1)^{m \times n}$.
Compute QR: $G = QR$ with $Q \in \mathbb{R}^{m \times n}$ orthonormal rows.
Set $W = \mathrm{gain} \cdot Q$.

For deeper-than-wide matrices, transpose. The "gain" factor compensates for the variance of subsequent activations: $\sqrt{2}$ for ReLU networks (compensating the half-rectifier), 1 for linear, $\approx 5/3$ for tanh. PyTorch's nn.init.orthogonal_(W, gain=...) implements exactly this.

How it compares to alternatives

Xavier / Glorot

Variance scaling: $\sim 1/\sqrt{n_{in}}$. Cheap, no QR. Works for shallow nets.

He / Kaiming

$\sim \sqrt{2/n_{in}}$ — tuned for ReLU. Default for CNNs.

Orthogonal

Best for very deep linear networks; comparable for nonlinear. Sometimes used with care for RNN recurrent matrices to delay gradient explosion.

For modern transformers, orthogonal init is rarely the default — the residual stream and LayerNorm together control signal scale well enough that Xavier-uniform / He-normal are fine. But it's a useful tool when chasing very deep architectures with residual connections turned off (rare in practice).

07

Rotation Matrices in 2D

The 2D rotation by angle $\theta$:

$$R_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}.$$

Orthogonal ($R_\theta^\top R_\theta = I$), determinant $+1$ (no reflection), eigenvalues $e^{\pm i\theta}$ over $\mathbb{C}$.

Two key properties

Composition: $R_\alpha R_\beta = R_{\alpha + \beta}$. Rotations form a group; matrix multiplication corresponds to angle addition.
Inner product invariance: $\langle R_\theta \mathbf{x}, R_\theta \mathbf{y}\rangle = \langle \mathbf{x}, \mathbf{y}\rangle$. Rotating both vectors doesn't change their dot product.

The second property is the entire reason RoPE works for attention.

The complex-number reading

Identify $\mathbb{R}^2$ with $\mathbb{C}$ via $(x, y) \leftrightarrow x + iy$. Then $R_\theta$ acts as multiplication by $e^{i\theta}$. The composition rule $R_\alpha R_\beta = R_{\alpha + \beta}$ is just $e^{i\alpha} e^{i\beta} = e^{i(\alpha + \beta)}$. And inner product invariance becomes: $\langle e^{i\theta} z_1, e^{i\theta} z_2\rangle = \mathrm{Re}(\overline{e^{i\theta} z_1}\, e^{i\theta} z_2) = \mathrm{Re}(\bar{z}_1 z_2) = \langle z_1, z_2 \rangle$.

08

RoPE — Rotary Position Embeddings

Su et al. (2021) introduced Rotary Position Embedding (RoPE), now the dominant position encoding in Llama, Qwen, Mistral, GPT-NeoX, Falcon, DeepSeek and most modern LLMs. It encodes position by rotating the query and key vectors before the attention dot product.

The construction, dim by dim

For a $d$-dimensional vector $\mathbf{x}$ at position $m$, group the $d$ coordinates into $d/2$ pairs $(x_{2i-1}, x_{2i})$. Apply a 2×2 rotation to each pair, with angle $m\theta_i$:

$$\begin{pmatrix} x'_{2i-1} \\ x'_{2i} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} x_{2i-1} \\ x_{2i} \end{pmatrix}.$$

The rotation frequency $\theta_i$ varies with the pair index:

$$\theta_i = \mathrm{base}^{-2(i-1)/d}, \quad \text{typically with}\ \mathrm{base} = 10000.$$

So the first pair rotates fast (high frequency), the last pair slowly (low frequency). Together the $d/2$ pairs form a frequency comb — every pair sees position differently.

The dot product becomes relative

If $\mathbf{q}$ at position $m$ is rotated to $R_m \mathbf{q}$ and $\mathbf{k}$ at position $n$ is rotated to $R_n \mathbf{k}$, the attention score is

$$\langle R_m \mathbf{q}, R_n \mathbf{k}\rangle = \mathbf{q}^\top R_m^\top R_n \mathbf{k} = \mathbf{q}^\top R_{n - m} \mathbf{k}.$$

(Using $R_m^\top R_n = R_{-m} R_n = R_{n-m}$ block-by-block.) The score depends only on the relative position $n - m$ — never on absolute positions. This is the property that makes RoPE work: the attention pattern between two tokens is a function of their distance, not where they sit in the sequence.

Where in the block does RoPE live?

Applied to $\mathbf{q}$ and $\mathbf{k}$ only, after the QKV projection and before the dot product. The values $\mathbf{v}$ are not rotated. With MLA (deck 07), RoPE is applied to a small "decoupled" portion of the key vector after up-projection from the latent — clever workaround to keep MLA compatible with rotary positions.

09

Why RoPE Beats Sinusoidal & Learned Positional

Sinusoidal (Vaswani 2017)

Add a fixed sin/cos position vector to embeddings. Absolute, not relative; degrades for sequences longer than seen at train time. Foundational but historical.

Learned positional embeddings

A learned vector per absolute position. Used by GPT-2, BERT. Hard cap at the longest sequence seen in training; doesn't extrapolate at all.

RoPE

Rotation per pair, angle linear in position. Inherently relative. Extrapolates (somewhat) to longer sequences, especially with frequency interpolation tricks (PI, NTK, YaRN).

Length extrapolation

Plain RoPE trained on 4k tokens degrades quickly past 4k — the high-frequency pairs see angles outside their training distribution. Three families of fix:

Position Interpolation (Chen 2023): at inference, scale all positions by a factor $< 1$. Reduces angles into the training range. Cheap; loses some local resolution.
NTK-aware scaling (bloc97 2023): scale only the high-frequency components, leaving low frequencies untouched. Preserves local detail.
YaRN (Peng 2023): NTK-by-parts plus an additional attention-temperature factor. The current go-to for clean RoPE extension. Used in Mistral-7B-32k, Llama 3.1 (with adjustments).

Why RoPE wins overall

The key facts: (1) the score depends only on relative position, so the attention pattern transfers across sequence lengths; (2) every pair gets a different frequency, so positional information is encoded in a frequency comb that survives composition with weight matrices; (3) implementing it is two element-wise multiplications and a small permute — almost free. The combination is hard to beat.

10

Interactive: RoPE Visualiser

The widget shows a $d = 8$-dimensional vector being RoPE-rotated as a function of token position. Each pair of dimensions traces a circle at its own frequency — high-frequency pairs spin fast, low-frequency pairs almost stand still. The bottom plot shows how the attention dot product between position-0 and position-$m$ evolves with $m$.

Position m 0

Base 10000

Top: each circle shows one of the 4 dimension-pairs. The pink dot is its current rotated position; the violet path traces history. Bottom: $\mathbf{q}_0^\top R_m \mathbf{k}$ for a fixed $\mathbf{q} = \mathbf{k}$ as a function of $m$ — the attention pattern is a sum of cosines at the comb of frequencies.

11

Cheat Sheet

An orthonormal set has unit norm and zero pairwise inner products. A square orthogonal matrix has $Q^\top Q = I$ and $Q^{-1} = Q^\top$.
Orthogonal maps are length- and angle-preserving; condition number 1; cheapest possible inverse.
Gram-Schmidt orthogonalises a basis by subtracting projections; gives QR for free. Numerically unstable; production uses Householder or Givens.
QR factorisation: $A = QR$, $Q$ orthonormal columns, $R$ upper triangular. Solving $A\mathbf{x} = \mathbf{b}$ via QR is the stable least-squares route.
Orthogonal initialisation (Saxe et al. 2014): sample Gaussian, run QR, scale by gain. Singular spectrum stays at 1; signal preserved through depth.
2D rotation matrix $R_\theta$ has eigenvalues $e^{\pm i\theta}$. Composition: $R_\alpha R_\beta = R_{\alpha + \beta}$. Inner-product preserving.
RoPE: rotate each $(2i-1, 2i)$ pair by angle $m\theta_i$, with $\theta_i$ a geometric frequency comb. Applied to $\mathbf{q}, \mathbf{k}$ only.
RoPE makes the attention score $\mathbf{q}^\top R_{n-m}\mathbf{k}$ — relative position. This is the property that's hard to get any other way at the same cost.
Length extrapolation: PI, NTK-aware, YaRN. All are reparameterisations of the frequency comb to keep angles in distribution at longer contexts.