Length, angle, and orthogonality — the geometric language behind cosine similarity, attention scores, the 1/√d scale, and why random high-dimensional vectors are almost orthogonal.
The dot product on $\mathbb{R}^n$ has three equivalent definitions; the elegance is that they all give the same number.
$$\langle \mathbf{u}, \mathbf{v} \rangle = \sum_{i=1}^n u_i v_i.$$
The component-wise sum of products. The thing the GPU computes.
$$\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top \mathbf{v}.$$
One row times one column. The matmul atom.
$$\langle \mathbf{u}, \mathbf{v} \rangle = \|\mathbf{u}\|\,\|\mathbf{v}\|\,\cos\theta.$$
Length-times-length-times-cosine-of-angle. The picture in $\mathbb{R}^2$.
An inner product on a real vector space $V$ is a function $\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R}$ satisfying:
Any function satisfying these defines a "geometry" on the space — a notion of length and angle. The standard dot product on $\mathbb{R}^n$ is one example, the Mahalanobis inner product (slide 06) is another.
A norm $\|\cdot\| : V \to [0, \infty)$ satisfies three properties: $\|\mathbf{v}\| = 0 \iff \mathbf{v} = \mathbf{0}$; $\|\alpha \mathbf{v}\| = |\alpha|\,\|\mathbf{v}\|$; and $\|\mathbf{u} + \mathbf{v}\| \le \|\mathbf{u}\| + \|\mathbf{v}\|$ (the triangle inequality).
Every inner product gives a norm: $\|\mathbf{v}\| = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}$. But not every norm comes from an inner product (only the L2 norm does).
| Norm | Formula | Where it shows up in ML |
|---|---|---|
| $L^1$ (taxicab) | $\sum_i |v_i|$ | Lasso regularisation, sparse-coding objective, total-variation losses. |
| $L^2$ (Euclidean) | $\sqrt{\sum_i v_i^2}$ | Default everywhere — weight decay, MSE, gradient norms, embedding distance, attention scaling. |
| $L^\infty$ (max) | $\max_i |v_i|$ | Adversarial $\varepsilon$-balls, max-perturbation attack budgets, PGD attacks. |
| $L^p$ ($1 \le p < \infty$) | $(\sum_i |v_i|^p)^{1/p}$ | Less common in ML; appears in regularisation theory and analysis of generalisation bounds. |
| Frobenius (matrix) | $\sqrt{\sum_{ij} A_{ij}^2}$ | Treats a matrix as a flat vector. Standard ML matrix norm; the Eckart-Young theorem (deck 07) is in Frobenius norm. |
| Spectral (matrix) | $\sigma_1(A)$ (largest singular value) | Lipschitz constant of a linear layer; used in spectral normalisation (Miyato et al. 2018) for GAN stability. |
$L^1$, $L^2$ and $L^\infty$ unit balls in $\mathbb{R}^2$ are a square rotated 45° (a diamond), a circle, and an axis-aligned square. The corners of the $L^1$ ball sit on the axes — this is the geometric reason Lasso encourages sparse solutions: the optimum tends to land at a corner where some coordinates are exactly zero.
The most-cited inequality in linear algebra:
$$|\langle \mathbf{u}, \mathbf{v} \rangle| \le \|\mathbf{u}\|\,\|\mathbf{v}\|.$$
Equality iff $\mathbf{u}$ and $\mathbf{v}$ are parallel. Reading: the absolute dot product is bounded by the product of the lengths — equivalently, $|\cos\theta| \le 1$.
For any $t \in \mathbb{R}$, $0 \le \|\mathbf{u} - t\mathbf{v}\|^2 = \|\mathbf{u}\|^2 - 2t \langle \mathbf{u}, \mathbf{v} \rangle + t^2 \|\mathbf{v}\|^2$. Choose $t = \langle \mathbf{u}, \mathbf{v} \rangle / \|\mathbf{v}\|^2$ to minimise the RHS; the minimum is $\|\mathbf{u}\|^2 - \langle \mathbf{u}, \mathbf{v} \rangle^2 / \|\mathbf{v}\|^2 \ge 0$. Rearranging gives Cauchy-Schwarz.
$$\|\mathbf{u} + \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + 2\langle \mathbf{u}, \mathbf{v} \rangle + \|\mathbf{v}\|^2 \le \|\mathbf{u}\|^2 + 2\|\mathbf{u}\|\|\mathbf{v}\| + \|\mathbf{v}\|^2 = (\|\mathbf{u}\| + \|\mathbf{v}\|)^2.$$
Cauchy-Schwarz is doing the work in the middle.
Cauchy-Schwarz bounds everything built from dot products. Attention scores live in $[-\|\mathbf{q}\|\|\mathbf{k}\|, +\|\mathbf{q}\|\|\mathbf{k}\|]$. A normalisation step ($\|\mathbf{q}\| = \|\mathbf{k}\| = 1$) tightens this to $[-1, 1]$. The Lipschitz-bounded layers used in GAN stabilisation lean on Cauchy-Schwarz to get worst-case behaviour. The attention $1/\sqrt{d}$ scaling depends on what Cauchy-Schwarz doesn't tell you — the typical case — which we cover on slide 08.
The cosine similarity of $\mathbf{u}, \mathbf{v}$ is the cosine of the angle between them:
$$\mathrm{cos\_sim}(\mathbf{u}, \mathbf{v}) = \frac{\langle \mathbf{u}, \mathbf{v} \rangle}{\|\mathbf{u}\|\,\|\mathbf{v}\|}.$$
Range $[-1, 1]$ (by Cauchy-Schwarz). $+1$ = same direction, $-1$ = opposite, $0$ = orthogonal. This is the standard similarity score for embeddings in retrieval, RAG, deduplication, and most metric-learning losses.
For embeddings, direction is meaningful but length is often a side-effect of token frequency, training instability, or the length of an input sequence. A short query and a long document can have similar topical content but very different L2 norms; cosine ignores that. Equivalently, cosine is the dot product after L2-normalising both vectors.
If $\hat{\mathbf{u}}, \hat{\mathbf{v}}$ are unit vectors,
$$\|\hat{\mathbf{u}} - \hat{\mathbf{v}}\|^2 = 2 - 2\,\mathrm{cos\_sim}(\mathbf{u}, \mathbf{v}).$$
Squared Euclidean distance on the unit sphere is a monotone function of cosine similarity. So nearest-neighbour search by cosine and by L2 distance give identical orderings (after normalisation) — a fact that vector-DB indexes (FAISS, ScaNN, HNSW) exploit.
OpenAI text-embedding-3, Cohere embed-v3, Voyage and most modern embedders return L2-normalised vectors. Cosine similarity reduces to a plain dot product, which is faster and SIMD-friendly. Always check whether your embedder returns normalised vectors before reaching for cos_sim — if it does, v1 @ v2 is enough.
Two vectors are orthogonal when $\langle \mathbf{u}, \mathbf{v} \rangle = 0$. A set $\{\mathbf{e}_1, \ldots, \mathbf{e}_n\}$ is orthonormal when its members are unit vectors and pairwise orthogonal — the simplest possible basis.
The closest point in a subspace $U$ to a given vector $\mathbf{v}$ is the orthogonal projection of $\mathbf{v}$ onto $U$ — the unique $\hat{\mathbf{v}} \in U$ with $\mathbf{v} - \hat{\mathbf{v}} \perp U$. This is the foundation of least-squares, of the FFN's projection structure (deck 05) and of every "best low-rank approximation" theorem.
Deck 05 is entirely about projections. This slide is the bridge: projection is geometry built out of inner products, and that's all it is.
The standard dot product on $\mathbb{R}^n$ is one inner product. Many others exist. The general form on $\mathbb{R}^n$ is
$$\langle \mathbf{u}, \mathbf{v} \rangle_M = \mathbf{u}^\top M \mathbf{v}$$
for any symmetric positive-definite $M$. The standard one is $M = I$.
$$d_M(\mathbf{u}, \mathbf{v}) = \sqrt{(\mathbf{u} - \mathbf{v})^\top \Sigma^{-1} (\mathbf{u} - \mathbf{v})}$$
with $\Sigma$ the covariance of the data. Distances measured in standard deviations along the principal axes — whitening makes Mahalanobis equal to Euclidean.
For a list of vectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$, the Gram matrix has entries $G_{ij} = \langle \mathbf{v}_i, \mathbf{v}_j \rangle$. Symmetric, positive semi-definite, captures all pairwise geometry. The kernel matrix in kernel methods is exactly a Gram matrix in feature space.
Define a feature map $\phi : \mathcal{X} \to V$ and an inner product on $V$. Many ML algorithms (SVM, kernel ridge, GP) only ever access $\phi(\mathbf{x}_i)$ through inner products $\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$, so you can use them without ever computing $\phi$ explicitly — just a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$ that returns the inner product directly. This is what made SVMs powerful in the 90s and is the precise mathematical content of "attention is approximate kernel evaluation".
A foundational fact about high dimensions:
Let $\mathbf{u}, \mathbf{v}$ be two independent vectors with i.i.d. $\mathcal{N}(0, 1)$ entries in $\mathbb{R}^d$. Then
$$\mathbb{E}[\langle \mathbf{u}, \mathbf{v} \rangle] = 0, \quad \mathrm{Var}[\langle \mathbf{u}, \mathbf{v} \rangle] = d.$$
So the dot product has standard deviation $\sqrt{d}$, and after normalisation by lengths (each $\approx \sqrt{d}$), the cosine similarity has standard deviation $1/\sqrt{d}$.
For $d = 4096$, $1/\sqrt{d} \approx 0.0156$. Two random Gaussian vectors in $\mathbb{R}^{4096}$ have cosine similarity within $\pm 0.05$ of zero with overwhelming probability — they are almost orthogonal.
The scaled dot-product attention score:
$$\mathrm{score}(\mathbf{q}, \mathbf{k}) = \frac{\mathbf{q}^\top \mathbf{k}}{\sqrt{d}}.$$
Why $\sqrt{d}$ and not $d$ or $d^2$? It's the previous slide.
Suppose $\mathbf{q}, \mathbf{k}$ have entries with mean 0 and variance 1. Then
$$\mathbf{q}^\top \mathbf{k} = \sum_{i=1}^d q_i k_i, \qquad \mathrm{Var}[\mathbf{q}^\top \mathbf{k}] = d.$$
So the un-scaled dot product has standard deviation $\sqrt{d}$. Without the scale, the magnitudes that go into softmax grow with $d$ — for $d = 64$ they're already on the order of $\pm 8$. Softmax saturates: all probability mass concentrates on the single largest argument, gradients through everything else go to zero, learning stalls.
Dividing by $\sqrt{d}$ keeps the score's standard deviation at $\approx 1$ regardless of head dimension. Softmax stays in its responsive range; gradients flow through all keys, not just the winner.
Train two attention layers, one with $1/\sqrt{d}$, one without. The unscaled one stalls inside the first thousand steps for $d \ge 64$ — the score range is too wide for softmax to be useful. The Vaswani et al. 2017 paper (Section 3.2.1) gives this as the explicit motivation. The whole choice is downstream of "random vectors are almost orthogonal but not exactly — the leftover dot product grows like $\sqrt{d}$".
Drag $\mathbf{u}$ and $\mathbf{v}$. The widget shows their L2 lengths, dot product, cosine similarity (= $\cos\theta$), and Euclidean distance. Notice that as $\mathbf{u}$ rotates around $\mathbf{v}$, the dot product traces a cosine curve.
Deck 05 — Projections, Up & Down takes the geometry of inner products and uses it to project. We'll build the projection-matrix view of an FFN's up- and down-projections, the structural reason transformer MLPs go up by 4× and back down.