Mathematics for Machine Learning — Deck 03

Analytic Geometry

Length, angle and perpendicular — the geometry that turns every "find the closest" question in machine learning into a single drop of a perpendicular.

normsinner product angleorthonormal projectionrotation
norm inner product angle orthogonality projection
00

Topics We'll Cover

01

Norms and the Unit Ball

A norm on a real vector space $V$ is a function $\|\cdot\| : V \to \mathbb{R}_{\ge 0}$ satisfying

  1. Positive definiteness: $\|\mathbf{v}\|=0 \iff \mathbf{v}=\mathbf{0}$;
  2. Absolute homogeneity: $\|\alpha\mathbf{v}\| = |\alpha|\,\|\mathbf{v}\|$;
  3. Triangle inequality: $\|\mathbf{u}+\mathbf{v}\| \le \|\mathbf{u}\| + \|\mathbf{v}\|$.

$\ell^1$ — "Manhattan"

$\|\mathbf{v}\|_1 = \sum_i |v_i|$.

Unit ball: diamond. Encourages sparsity (lasso).

$\ell^2$ — "Euclidean"

$\|\mathbf{v}\|_2 = \sqrt{\sum_i v_i^2}$.

Unit ball: circle. Smooth, rotationally invariant, what an inner product induces.

$\ell^\infty$ — "Chebyshev"

$\|\mathbf{v}\|_\infty = \max_i |v_i|$.

Unit ball: square. Useful for adversarial perturbations and robust ML.

For finite dimensions all norms are equivalent — they differ by at most a constant factor. They induce the same notion of convergence, but very different notions of "small": minimising under $\ell^1$ picks a sparse vector; under $\ell^2$ picks a smooth one. ML uses both.

02

Inner Products — The Abstract Definition

An inner product on $V$ is a map $\langle\cdot,\cdot\rangle : V\times V \to \mathbb{R}$ that is

  1. Bilinear: $\langle\alpha\mathbf{u}+\beta\mathbf{v},\mathbf{w}\rangle = \alpha\langle\mathbf{u},\mathbf{w}\rangle + \beta\langle\mathbf{v},\mathbf{w}\rangle$ and linear in the second slot too;
  2. Symmetric: $\langle\mathbf{u},\mathbf{v}\rangle = \langle\mathbf{v},\mathbf{u}\rangle$;
  3. Positive definite: $\langle\mathbf{v},\mathbf{v}\rangle \ge 0$, with equality iff $\mathbf{v}=\mathbf{0}$.

Examples

The induced norm

Every inner product produces a norm by $\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}$. Not every norm comes from an inner product: $\ell^1$ and $\ell^\infty$ don't.

Why this matters for ML

The inner product is the only object needed for projection, angle, and orthogonality. The kernel trick (deck 12) generalises by letting $\langle\mathbf{u},\mathbf{v}\rangle_k = k(\mathbf{u},\mathbf{v})$ — an inner product without ever writing down the corresponding vectors.

03

Lengths, Distances, Angles, Cauchy–Schwarz

Length

$$\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}.$$

Distance

$$d(\mathbf{u},\mathbf{v}) = \|\mathbf{u}-\mathbf{v}\|.$$

Angle

$$\cos\theta = \frac{\langle\mathbf{u},\mathbf{v}\rangle}{\|\mathbf{u}\|\,\|\mathbf{v}\|} \quad \in [-1, 1].$$

The fact that this ratio is in $[-1,1]$ is the Cauchy–Schwarz inequality:

$$|\langle\mathbf{u},\mathbf{v}\rangle| \le \|\mathbf{u}\|\,\|\mathbf{v}\|.$$

Equality holds iff $\mathbf{u}$ and $\mathbf{v}$ are linearly dependent. This single inequality powers most convergence proofs in optimisation and the regret bounds in online learning.

Orthogonality

$\mathbf{u}\perp\mathbf{v}$ iff $\langle\mathbf{u},\mathbf{v}\rangle=0$. When $\mathbf{u}\perp\mathbf{v}$, Pythagoras applies: $\|\mathbf{u}+\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2$.

Cosine similarity in ML

The fact above is the basis of "cosine similarity" between embedding vectors: $\cos\theta(\mathbf{u},\mathbf{v}) \in [-1,1]$ measures how aligned the directions are, independent of magnitude. Two embeddings can be a thousand units apart in $\ell^2$ and still have $\cos\theta=1$.

04

Orthonormal Bases & Gram–Schmidt

A basis $(\mathbf{b}_1,\dots,\mathbf{b}_n)$ is orthonormal if $\langle\mathbf{b}_i,\mathbf{b}_j\rangle = \delta_{ij}$ (1 if $i=j$, 0 otherwise). In coordinates relative to an orthonormal basis, the inner product collapses to a dot product and a vector's coordinates are just its inner products with the basis vectors:

$$\mathbf{v} = \sum_{i=1}^n \langle \mathbf{v}, \mathbf{b}_i\rangle\, \mathbf{b}_i.$$

Gram–Schmidt

Given any basis $(\mathbf{v}_1, \dots, \mathbf{v}_n)$, produce an orthonormal one:

$$\mathbf{u}_k = \mathbf{v}_k - \sum_{j=1}^{k-1} \langle\mathbf{v}_k, \mathbf{b}_j\rangle\,\mathbf{b}_j, \qquad \mathbf{b}_k = \frac{\mathbf{u}_k}{\|\mathbf{u}_k\|}.$$

Each step strips off the component of $\mathbf{v}_k$ that already lies in the span of the previous basis vectors, then normalises. Stack the $\mathbf{b}_k$'s as columns and you get the $Q$ of the $QR$ factorisation (deck 04).

Numerical caveat

"Classical" Gram–Schmidt is numerically unstable for nearly-parallel vectors. The modified version subtracts each projection one at a time and accumulates much less round-off error; production libraries use modified GS or Householder reflections.

05

Orthogonal Complement

For a subspace $U\subseteq V$, the orthogonal complement is

$$U^\perp = \{\mathbf{w}\in V : \langle\mathbf{w},\mathbf{u}\rangle=0 \text{ for all } \mathbf{u}\in U\}.$$

Three facts:

  1. $U^\perp$ is a subspace.
  2. $\dim U + \dim U^\perp = \dim V$.
  3. $V = U \oplus U^\perp$ — every $\mathbf{v}\in V$ has a unique decomposition $\mathbf{v} = \mathbf{u} + \mathbf{w}$ with $\mathbf{u}\in U$, $\mathbf{w}\in U^\perp$.

Tying back to deck 02

For any matrix $A$:

$$\mathrm{Row}(A)^\perp = \mathrm{Nul}(A), \qquad \mathrm{Col}(A)^\perp = \mathrm{Nul}(A^\top).$$

The four fundamental subspaces are therefore two orthogonal pairs sitting inside $\mathbb{R}^n$ and $\mathbb{R}^m$. This is the geometric content of the rank-nullity theorem.

06

Interactive: Orthogonal Projection

Drag the tip of $\mathbf{v}$ (purple). The projection onto the green line $L = \mathrm{span}(\mathbf{u})$ is computed live, and the residual is drawn as a dashed amber segment. Notice three things:

Drag the purple dot. The green line is $\mathrm{span}(\mathbf{u})$ where $\mathbf{u}$ is the green vector.
|v|
|π_L(v)|
|v − π_L(v)|
cos θ

The formula on the line:

$$\pi_L(\mathbf{v}) = \frac{\langle\mathbf{v},\mathbf{u}\rangle}{\langle\mathbf{u},\mathbf{u}\rangle}\,\mathbf{u}.$$

07

Projection onto a Subspace

For a $k$-dimensional subspace $U = \mathrm{Col}(B)$ where $B\in\mathbb{R}^{D\times k}$ has linearly independent columns, the orthogonal projection of $\mathbf{v}$ onto $U$ is

$$\pi_U(\mathbf{v}) = B(B^\top B)^{-1} B^\top \mathbf{v}.$$

If the columns of $B$ are orthonormal the formula simplifies to $BB^\top \mathbf{v}$ — one matrix-vector multiply.

The minimising property

$\pi_U(\mathbf{v})$ is the unique element of $U$ closest to $\mathbf{v}$:

$$\pi_U(\mathbf{v}) = \arg\min_{\mathbf{u}\in U}\|\mathbf{v}-\mathbf{u}\|.$$

Proof: the residual $\mathbf{v}-\pi_U(\mathbf{v})$ lies in $U^\perp$; Pythagoras gives $\|\mathbf{v}-\mathbf{u}\|^2 = \|\mathbf{v}-\pi_U(\mathbf{v})\|^2 + \|\pi_U(\mathbf{v})-\mathbf{u}\|^2$, minimised at $\mathbf{u}=\pi_U(\mathbf{v})$.

Linear regression preview

The least-squares solution $\hat{\boldsymbol\theta} = (\Phi^\top\Phi)^{-1}\Phi^\top\mathbf{y}$ in deck 09 has the formula $\Phi\hat{\boldsymbol\theta} = \Phi(\Phi^\top\Phi)^{-1}\Phi^\top\mathbf{y}$ — exactly the projection of $\mathbf{y}$ onto $\mathrm{Col}(\Phi)$. The "$\Phi^\top$" on the right and the "$(\Phi^\top\Phi)^{-1}$" in the middle exist because $\Phi$ may not have orthonormal columns.

08

Rotations

A rotation is a linear map that preserves both lengths and orientation. In matrix form: $R\in\mathbb{R}^{D\times D}$ is a rotation iff $R^\top R = I$ and $\det R = +1$. The set of all such $R$ forms a group called $SO(D)$.

In $\mathbb{R}^2$

$$R(\theta) = \begin{bmatrix}\cos\theta & -\sin\theta\\ \sin\theta & \phantom{-}\cos\theta\end{bmatrix}.$$

$R(\theta_1)R(\theta_2)=R(\theta_1+\theta_2)$. A 2D rotation is uniquely determined by one angle.

In $\mathbb{R}^3$

Three angles (Euler angles, or roll/pitch/yaw), or equivalently a unit axis and a single angle (Rodrigues), or a unit quaternion.

Orthogonal but not a rotation

If $R^\top R = I$ but $\det R = -1$, $R$ is a reflection. Together they form the orthogonal group $O(D)$.

RoPE in transformers

Rotary Position Embeddings (used by Llama, Gemma, DeepSeek) rotate pairs of dimensions in $\mathbb{R}^d$ by an angle that depends on position. The position-dependent rotation matrix is block-diagonal with $d/2$ blocks each of the form above. The whole thing is a rotation.

09

Inner Products on Function Spaces

The abstract definition of inner product means we can put one on spaces of functions. The space $L^2([a,b])$ of square-integrable functions on $[a,b]$ comes with

$$\langle f, g\rangle = \int_a^b f(x)\, g(x)\, dx.$$

This makes "two functions are orthogonal" a sensible statement. Three uses in ML:

Fourier

$\{1, \sin nx, \cos nx\}$ is an orthogonal basis of $L^2([0,2\pi])$. Coefficients = inner products.

Polynomial features

Orthogonal polynomials (Legendre, Hermite) give numerically stable feature maps for regression.

Kernel methods

Mercer's theorem: a kernel $k(\mathbf{x},\mathbf{x}')$ is the inner product of feature maps in an RKHS. The SVM in deck 12 only ever uses $k$, never the features themselves.

The same theorems — Pythagoras, Cauchy–Schwarz, projection minimises distance — hold word-for-word in these function spaces. That's the power of the abstract definition: every theorem proved in this chapter applies everywhere.

10

Where This Lands in Part II

ConceptUsed byWhere
$\ell^2$ normRidge regressionRegulariser $\|\boldsymbol\theta\|^2$ shrinks toward zero
$\ell^1$ normSparse modelsLasso selects features
Inner productSVM (ch. 12)Replaced by kernel $k(\mathbf{x},\mathbf{x}')$
Cosine similarityEmbeddings, retrievalLength-invariant similarity
Cauchy–SchwarzConvergence proofsSGD, perceptron, regret bounds
Orthonormal basisPCA (ch. 10)Principal components are an orthonormal basis
Orthogonal projectionLinear regression (ch. 9)$\hat{\mathbf{y}}$ is the projection of $\mathbf{y}$ onto $\mathrm{Col}(\Phi)$
Distance to hyperplaneSVM (ch. 12)Margin = $\frac{|\mathbf{w}^\top\mathbf{x}+b|}{\|\mathbf{w}\|}$
RotationPCA, decodersWhitening / RoPE / orthogonal weight init
11

Cheat Sheet

QuantityFormula
length$\|\mathbf{v}\| = \sqrt{\langle\mathbf{v},\mathbf{v}\rangle}$
distance$d(\mathbf{u},\mathbf{v}) = \|\mathbf{u}-\mathbf{v}\|$
angle$\cos\theta = \frac{\langle\mathbf{u},\mathbf{v}\rangle}{\|\mathbf{u}\|\,\|\mathbf{v}\|}$
Cauchy–Schwarz$|\langle\mathbf{u},\mathbf{v}\rangle| \le \|\mathbf{u}\|\,\|\mathbf{v}\|$
projection onto a line$\pi_L(\mathbf{v})=\frac{\langle\mathbf{v},\mathbf{u}\rangle}{\langle\mathbf{u},\mathbf{u}\rangle}\mathbf{u}$
projection onto $\mathrm{Col}(B)$$B(B^\top B)^{-1}B^\top \mathbf{v}$
orthonormal projection$BB^\top \mathbf{v}$ (when $B^\top B=I$)
residual is perpendicular$\mathbf{v}-\pi_U(\mathbf{v})\in U^\perp$
rotation in $\mathbb{R}^2$$\begin{bmatrix}\cos\theta & -\sin\theta\\\sin\theta & \cos\theta\end{bmatrix}$
$SO(D)$$R^\top R=I,\;\det R=+1$
Up next

Deck 04 takes the same geometry one step further: when does a matrix have a complete orthonormal basis of eigenvectors? The answer (the spectral theorem) and its generalisation to rectangular matrices (the SVD) sit at the heart of PCA and low-rank approximation.