Length, angle and perpendicular — the geometry that turns every "find the closest" question in machine learning into a single drop of a perpendicular.
A norm on a real vector space $V$ is a function $\|\cdot\| : V \to \mathbb{R}_{\ge 0}$ satisfying
$\|\mathbf{v}\|_1 = \sum_i |v_i|$.
Unit ball: diamond. Encourages sparsity (lasso).
$\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$.
Unit ball: circle. Smooth, rotationally invariant, what an inner product induces.
$\|\mathbf{v}\|_\infty = \max_i |v_i|$.
Unit ball: square. Useful for adversarial perturbations and robust ML.
For finite dimensions all norms are equivalent — they differ by at most a constant factor. They induce the same notion of convergence, but very different notions of "small": minimising under $\ell^1$ picks a sparse vector; under $\ell^2$ picks a smooth one. ML uses both.
An inner product on $V$ is a map $\langle\cdot,\cdot\rangle : V\times V \to \mathbb{R}$ that is
Every inner product produces a norm by $\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}$. Not every norm comes from an inner product: $\ell^1$ and $\ell^\infty$ don't.
The inner product is the only object needed for projection, angle, and orthogonality. The kernel trick (deck 12) generalises by letting $\langle\mathbf{u},\mathbf{v}\rangle_k = k(\mathbf{u},\mathbf{v})$ — an inner product without ever writing down the corresponding vectors.
$$\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}.$$
$$d(\mathbf{u},\mathbf{v}) = \|\mathbf{u}-\mathbf{v}\|.$$
$$\cos\theta = \frac{\langle\mathbf{u},\mathbf{v}\rangle}{\|\mathbf{u}\|\,\|\mathbf{v}\|} \quad \in [-1, 1].$$
The fact that this ratio is in $[-1,1]$ is the Cauchy–Schwarz inequality:
$$|\langle\mathbf{u},\mathbf{v}\rangle| \le \|\mathbf{u}\|\,\|\mathbf{v}\|.$$
Equality holds iff $\mathbf{u}$ and $\mathbf{v}$ are linearly dependent. This single inequality powers most convergence proofs in optimisation and the regret bounds in online learning.
$\mathbf{u}\perp\mathbf{v}$ iff $\langle\mathbf{u},\mathbf{v}\rangle=0$. When $\mathbf{u}\perp\mathbf{v}$, Pythagoras applies: $\|\mathbf{u}+\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2$.
The fact above is the basis of "cosine similarity" between embedding vectors: $\cos\theta(\mathbf{u},\mathbf{v}) \in [-1,1]$ measures how aligned the directions are, independent of magnitude. Two embeddings can be a thousand units apart in $\ell^2$ and still have $\cos\theta=1$.
A basis $(\mathbf{b}_1,\dots,\mathbf{b}_n)$ is orthonormal if $\langle\mathbf{b}_i,\mathbf{b}_j\rangle = \delta_{ij}$ (1 if $i=j$, 0 otherwise). In coordinates relative to an orthonormal basis, the inner product collapses to a dot product and a vector's coordinates are just its inner products with the basis vectors:
$$\mathbf{v} = \sum_{i=1}^n \langle \mathbf{v}, \mathbf{b}_i\rangle\, \mathbf{b}_i.$$
Given any basis $(\mathbf{v}_1, \dots, \mathbf{v}_n)$, produce an orthonormal one:
$$\mathbf{u}_k = \mathbf{v}_k - \sum_{j=1}^{k-1} \langle\mathbf{v}_k, \mathbf{b}_j\rangle\,\mathbf{b}_j, \qquad \mathbf{b}_k = \frac{\mathbf{u}_k}{\|\mathbf{u}_k\|}.$$
Each step strips off the component of $\mathbf{v}_k$ that already lies in the span of the previous basis vectors, then normalises. Stack the $\mathbf{b}_k$'s as columns and you get the $Q$ of the $QR$ factorisation (deck 04).
"Classical" Gram–Schmidt is numerically unstable for nearly-parallel vectors. The modified version subtracts each projection one at a time and accumulates much less round-off error; production libraries use modified GS or Householder reflections.
For a subspace $U\subseteq V$, the orthogonal complement is
$$U^\perp = \{\mathbf{w}\in V : \langle\mathbf{w},\mathbf{u}\rangle=0 \text{ for all } \mathbf{u}\in U\}.$$
Three facts:
For any matrix $A$:
$$\mathrm{Row}(A)^\perp = \mathrm{Nul}(A), \qquad \mathrm{Col}(A)^\perp = \mathrm{Nul}(A^\top).$$
The four fundamental subspaces are therefore two orthogonal pairs sitting inside $\mathbb{R}^n$ and $\mathbb{R}^m$. This is the geometric content of the rank-nullity theorem.
Drag the tip of $\mathbf{v}$ (purple). The projection onto the green line $L = \mathrm{span}(\mathbf{u})$ is computed live, and the residual is drawn as a dashed amber segment. Notice three things:
The formula on the line:
$$\pi_L(\mathbf{v}) = \frac{\langle\mathbf{v},\mathbf{u}\rangle}{\langle\mathbf{u},\mathbf{u}\rangle}\,\mathbf{u}.$$
For a $k$-dimensional subspace $U = \mathrm{Col}(B)$ where $B\in\mathbb{R}^{D\times k}$ has linearly independent columns, the orthogonal projection of $\mathbf{v}$ onto $U$ is
$$\pi_U(\mathbf{v}) = B(B^\top B)^{-1} B^\top \mathbf{v}.$$
If the columns of $B$ are orthonormal the formula simplifies to $BB^\top \mathbf{v}$ — one matrix-vector multiply.
$\pi_U(\mathbf{v})$ is the unique element of $U$ closest to $\mathbf{v}$:
$$\pi_U(\mathbf{v}) = \arg\min_{\mathbf{u}\in U}\|\mathbf{v}-\mathbf{u}\|.$$
Proof: the residual $\mathbf{v}-\pi_U(\mathbf{v})$ lies in $U^\perp$; Pythagoras gives $\|\mathbf{v}-\mathbf{u}\|^2 = \|\mathbf{v}-\pi_U(\mathbf{v})\|^2 + \|\pi_U(\mathbf{v})-\mathbf{u}\|^2$, minimised at $\mathbf{u}=\pi_U(\mathbf{v})$.
The least-squares solution $\hat{\boldsymbol\theta} = (\Phi^\top\Phi)^{-1}\Phi^\top\mathbf{y}$ in deck 09 has the formula $\Phi\hat{\boldsymbol\theta} = \Phi(\Phi^\top\Phi)^{-1}\Phi^\top\mathbf{y}$ — exactly the projection of $\mathbf{y}$ onto $\mathrm{Col}(\Phi)$. The "$\Phi^\top$" on the right and the "$(\Phi^\top\Phi)^{-1}$" in the middle exist because $\Phi$ may not have orthonormal columns.
A rotation is a linear map that preserves both lengths and orientation. In matrix form: $R\in\mathbb{R}^{D\times D}$ is a rotation iff $R^\top R = I$ and $\det R = +1$. The set of all such $R$ forms a group called $SO(D)$.
$$R(\theta) = \begin{bmatrix}\cos\theta & -\sin\theta\\ \sin\theta & \phantom{-}\cos\theta\end{bmatrix}.$$
$R(\theta_1)R(\theta_2)=R(\theta_1+\theta_2)$. A 2D rotation is uniquely determined by one angle.
Three angles (Euler angles, or roll/pitch/yaw), or equivalently a unit axis and a single angle (Rodrigues), or a unit quaternion.
If $R^\top R = I$ but $\det R = -1$, $R$ is a reflection. Together they form the orthogonal group $O(D)$.
Rotary Position Embeddings (used by Llama, Gemma, DeepSeek) rotate pairs of dimensions in $\mathbb{R}^d$ by an angle that depends on position. The position-dependent rotation matrix is block-diagonal with $d/2$ blocks each of the form above. The whole thing is a rotation.
The abstract definition of inner product means we can put one on spaces of functions. The space $L^2([a,b])$ of square-integrable functions on $[a,b]$ comes with
$$\langle f, g\rangle = \int_a^b f(x)\, g(x)\, dx.$$
This makes "two functions are orthogonal" a sensible statement. Three uses in ML:
$\{1, \sin nx, \cos nx\}$ is an orthogonal basis of $L^2([0,2\pi])$. Coefficients = inner products.
Orthogonal polynomials (Legendre, Hermite) give numerically stable feature maps for regression.
Mercer's theorem: a kernel $k(\mathbf{x},\mathbf{x}')$ is the inner product of feature maps in an RKHS. The SVM in deck 12 only ever uses $k$, never the features themselves.
The same theorems — Pythagoras, Cauchy–Schwarz, projection minimises distance — hold word-for-word in these function spaces. That's the power of the abstract definition: every theorem proved in this chapter applies everywhere.
| Concept | Used by | Where |
|---|---|---|
| $\ell^2$ norm | Ridge regression | Regulariser $\|\boldsymbol\theta\|^2$ shrinks toward zero |
| $\ell^1$ norm | Sparse models | Lasso selects features |
| Inner product | SVM (ch. 12) | Replaced by kernel $k(\mathbf{x},\mathbf{x}')$ |
| Cosine similarity | Embeddings, retrieval | Length-invariant similarity |
| Cauchy–Schwarz | Convergence proofs | SGD, perceptron, regret bounds |
| Orthonormal basis | PCA (ch. 10) | Principal components are an orthonormal basis |
| Orthogonal projection | Linear regression (ch. 9) | $\hat{\mathbf{y}}$ is the projection of $\mathbf{y}$ onto $\mathrm{Col}(\Phi)$ |
| Distance to hyperplane | SVM (ch. 12) | Margin = $\frac{|\mathbf{w}^\top\mathbf{x}+b|}{\|\mathbf{w}\|}$ |
| Rotation | PCA, decoders | Whitening / RoPE / orthogonal weight init |
| Quantity | Formula |
|---|---|
| length | $\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}$ |
| distance | $d(\mathbf{u},\mathbf{v}) = \|\mathbf{u}-\mathbf{v}\|$ |
| angle | $\cos\theta = \frac{\langle\mathbf{u},\mathbf{v}\rangle}{\|\mathbf{u}\|\,\|\mathbf{v}\|}$ |
| Cauchy–Schwarz | $|\langle\mathbf{u},\mathbf{v}\rangle| \le \|\mathbf{u}\|\,\|\mathbf{v}\|$ |
| projection onto a line | $\pi_L(\mathbf{v})=\frac{\langle\mathbf{v},\mathbf{u}\rangle}{\langle\mathbf{u},\mathbf{u}\rangle}\mathbf{u}$ |
| projection onto $\mathrm{Col}(B)$ | $B(B^\top B)^{-1}B^\top \mathbf{v}$ |
| orthonormal projection | $BB^\top \mathbf{v}$ (when $B^\top B=I$) |
| residual is perpendicular | $\mathbf{v}-\pi_U(\mathbf{v})\in U^\perp$ |
| rotation in $\mathbb{R}^2$ | $\begin{bmatrix}\cos\theta & -\sin\theta\\\sin\theta & \cos\theta\end{bmatrix}$ |
| $SO(D)$ | $R^\top R=I,\;\det R=+1$ |
Deck 04 takes the same geometry one step further: when does a matrix have a complete orthonormal basis of eigenvectors? The answer (the spectral theorem) and its generalisation to rectangular matrices (the SVD) sit at the heart of PCA and low-rank approximation.