The tiniest object in machine learning is the vector. Get this right and every later piece — embeddings, attention, projections — falls into place.
Throughout the series we will write a vector as a column of numbers. But the same object has three useful interpretations, and a working ML engineer flips between them dozens of times in a day.
An arrow from the origin to a point. Length and direction matter; position in the plane doesn't (it always starts at 0). This is the picture for $\mathbb{R}^2$ and $\mathbb{R}^3$.
$\mathbf{v} = (v_1, v_2, \ldots, v_n)$. The picture that scales: a 4096-dimensional vector is just a list of 4096 floats. Most of ML lives here.
An element of an abstract vector space — a thing you can add to other things and scale by numbers, with no commitment to what the things "are". This view is what lets us put words, images and audio in the same room.
Modern ML moves between all three. An embedding is an arrow when you draw a t-SNE plot; it's a tuple when you call torch.matmul; and it's a point in an abstract space when you reason about norm, distance, projection, similarity.
Bold lowercase letters $\mathbf{v}, \mathbf{x}, \mathbf{w}$ are vectors, written as columns. Capital letters $A, B, W$ are matrices. Greek lowercase $\alpha, \beta, \lambda$ are scalars. We work over the real numbers $\mathbb{R}$ except where noted — complex vectors appear only briefly in the eigen / SVD decks.
Two operations and one rule. Together they generate all of linear algebra.
$$\mathbf{u} + \mathbf{v} = (u_1 + v_1,\, u_2 + v_2,\, \ldots,\, u_n + v_n)$$
Geometrically: place the tail of $\mathbf{v}$ at the head of $\mathbf{u}$; the resultant is the arrow from the origin to the new tip. Commutative, associative, has identity $\mathbf{0}$ and inverses $-\mathbf{v}$.
$$\alpha \mathbf{v} = (\alpha v_1,\, \alpha v_2,\, \ldots,\, \alpha v_n)$$
Geometrically: stretch $\mathbf{v}$ by the factor $\alpha$ (and flip if $\alpha < 0$). Direction is preserved or reversed; never rotated to anything else.
The two operations distribute: $\alpha(\mathbf{u}+\mathbf{v}) = \alpha\mathbf{u} + \alpha\mathbf{v}$ and $(\alpha+\beta)\mathbf{v} = \alpha\mathbf{v}+\beta\mathbf{v}$. Out of this single property the rest of linear algebra falls out.
Every operation inside an attention block — QKV projection, attention-weighted average, output projection, residual add — is built from exactly these two operations. The non-linearity (softmax, GELU) is intentionally a thin sandwich between long stretches of pure linear algebra. Understand vectors and matrices and you understand 95% of the FLOPs.
A vector space is any set $V$ together with two operations — addition $+ : V \times V \to V$ and scalar multiplication $\cdot : \mathbb{R} \times V \to V$ — that satisfy eight axioms: associativity and commutativity of addition, an additive identity $\mathbf{0}$, additive inverses, and four scalar-multiplication axioms (associativity, identity 1, and the two distributive laws).
If you can add two things and scale them by real numbers, and the obvious algebraic identities hold, you have a vector space. The axioms are deliberately weak so that very different objects qualify.
| Vector space | "Vector" is… | Why it matters in ML |
|---|---|---|
| $\mathbb{R}^n$ | An $n$-tuple of reals | Every embedding, activation, weight column. |
| $\mathbb{R}^{m\times n}$ | An $m\times n$ matrix of reals | Weight matrices, attention scores, batched activations — flattened, an element of $\mathbb{R}^{mn}$. |
| Polynomials of degree $\le n$ | A polynomial $p(t) = a_0 + a_1 t + \cdots + a_n t^n$ | Spline bases, kernel features, classical positional encodings. |
| Continuous functions $C([0,1])$ | A function $f : [0,1] \to \mathbb{R}$ | Idealised embedding spaces; Fourier theory; the right setting for "infinite-dimensional" attention kernels. |
| Random variables (with finite variance) | A random variable | Inner product = covariance; orthogonality = uncorrelated. Useful for whitening, PCA, denoising-diffusion analysis. |
The eight axioms guarantee that any theorem you prove from them alone holds in every example. Define "basis" once and you have basis for $\mathbb{R}^n$, basis for polynomials (the monomials $1, t, t^2, \ldots$), basis for the Fourier function space (sines and cosines), and basis for the embedding space of a transformer — all at once. This is why the vocabulary "basis", "span", "rank" sounds out of proportion to its content: it's leverage that pays off across many domains.
A linear combination of vectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$ with coefficients $\alpha_1, \ldots, \alpha_k$ is
$$\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k.$$
The span of a set of vectors is the set of all linear combinations of them:
$$\mathrm{span}(\mathbf{v}_1, \ldots, \mathbf{v}_k) = \{\,\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k : \alpha_i \in \mathbb{R}\,\}.$$
Span is the answer to the question: what set of points can I reach by mixing these vectors? The answer is always a subspace through the origin (a line, plane, hyperplane, or the whole space).
The span of the rows of a weight matrix $W$ is the range of the linear map $\mathbf{x} \mapsto W^\top\mathbf{x}$. Equivalently, the span of the columns of $W$ is the set of outputs of the layer $\mathbf{x} \mapsto W\mathbf{x}$. A low-rank $W$ has a small-dimensional column span — only a thin slice of output space is reachable. This is the foundation of LoRA (deck 07).
A set $\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}$ is linearly independent if the only way to write the zero vector as a linear combination is with all coefficients zero:
$$\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0} \iff \alpha_1 = \cdots = \alpha_k = 0.$$
Otherwise the set is linearly dependent: some non-trivial combination of them is zero, which is the same as saying at least one of them is a linear combination of the others. That one can be dropped without losing any span.
Real data lies, approximately, on low-dimensional surfaces inside very high-dimensional spaces. The pixels of an MNIST digit are nominally a 784-dimensional vector, but the set of all real handwritten digits forms a much smaller-dimensional manifold. The columns of the matrix you'd build by stacking 60,000 MNIST digits are highly linearly dependent — this is exactly what PCA (deck 06) exploits, and exactly why a low-rank LoRA adapter (deck 07) is enough.
Are $(1,2)$, $(2,4)$, $(3,5)$ linearly independent in $\mathbb{R}^2$? No — there are three vectors in a 2-dimensional space, so any three of them must be linearly dependent. (Two of them might be, e.g. $(1,2)$ and $(2,4)$ are scalar multiples.) The maximum size of a linearly independent set in $\mathbb{R}^n$ is $n$.
A basis of a vector space $V$ is a set $\{\mathbf{e}_1, \ldots, \mathbf{e}_n\}$ that is
Equivalently: every $\mathbf{v} \in V$ can be written uniquely as $\mathbf{v} = c_1\mathbf{e}_1 + \cdots + c_n\mathbf{e}_n$. The numbers $c_i$ are the coordinates of $\mathbf{v}$ in the basis.
Every basis of $V$ has the same number of elements. That number is the dimension of $V$, written $\dim V$. (Proof sketch: if you had a basis of size $m$ and another of size $n > m$, you'd be expressing $n$ linearly independent vectors as combinations of $m$ vectors, contradicting linear independence.)
$\mathbf{e}_1 = (1, 0, \ldots, 0)$, $\mathbf{e}_2 = (0, 1, 0, \ldots, 0)$, …, $\mathbf{e}_n = (0, \ldots, 0, 1)$. Each is a 1 in one slot, zeros everywhere else. Coordinates in this basis are just the components.
Monomial basis: $1, t, t^2, \ldots, t^n$. Dimension $n+1$. Coordinates are the polynomial coefficients.
The hidden state in Llama-3-70B is a 8192-dimensional vector. Its "basis" is the columns of whatever last writes into the residual stream — output projections from attention heads, output projections from FFNs. Mechanistic interpretability is, in part, the study of which directions in this 8192-dim space carry which information — an attempt to recover a meaningful basis from the standard one the GPU happens to use.
A vector exists independently of any basis — it's the arrow / abstract point. Its coordinates, on the other hand, depend on the basis you chose. Switching basis is a fundamental operation in ML: PCA, whitening, ICA and the rotations inside RoPE are all changes of basis.
If $B = (\mathbf{b}_1, \ldots, \mathbf{b}_n)$ is the matrix whose columns are a new basis, and $\mathbf{v}$ has new-basis coordinates $\mathbf{c}$, then the standard-basis coordinates of $\mathbf{v}$ are
$$\mathbf{v} = B \mathbf{c}.$$
Equivalently, given $\mathbf{v}$ in the standard basis, its new-basis coordinates are $\mathbf{c} = B^{-1} \mathbf{v}$.
A linear map looks completely different in different bases. A 2D rotation matrix is
$$R_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$$
in the standard basis. Change to a basis aligned with the eigenvectors and the same map becomes diagonal (it's a complex diagonal $\mathrm{diag}(e^{i\theta}, e^{-i\theta})$). Same map, different appearance — the choice of basis is the storyteller. Eigendecomposition (deck 06) and SVD (deck 07) are precisely the search for the basis that makes a matrix as simple as possible.
An embedding lookup is a coordinate mapping: token id $i \mapsto \mathbf{e}_i$ in some chosen basis $E$. Any rotation $R$ of the embedding space gives an equivalent model with weights $RE, R\,W_Q, R\,W_K, \ldots$ — a phenomenon used in mechanistic interpretability to argue that "directions, not neurons" are the meaningful unit.
A subspace $U$ of $V$ is a non-empty subset that is itself a vector space — closed under addition and scaling. Equivalently, $U$ is a subspace iff for all $\mathbf{u}, \mathbf{u}' \in U$ and $\alpha \in \mathbb{R}$:
From these two it follows that $\mathbf{0} \in U$ (take $\alpha = 0$). Every subspace passes through the origin.
In $\mathbb{R}^3$ the only subspaces are $\{\mathbf{0}\}$, lines through $\mathbf{0}$, planes through $\mathbf{0}$, and all of $\mathbb{R}^3$. A line that doesn't pass through the origin is not a subspace — the zero vector isn't on it.
For a matrix $A : \mathbb{R}^n \to \mathbb{R}^m$, the range $\{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\} \subseteq \mathbb{R}^m$ and the null space $\{\mathbf{x} : A\mathbf{x} = \mathbf{0}\} \subseteq \mathbb{R}^n$ are both subspaces. Deck 02 develops these.
An affine subspace is a subspace shifted by a vector: $\mathbf{p} + U$ for some fixed $\mathbf{p}$ and subspace $U$. A line through $(0,0,1)$ in the direction $(1,1,1)$ is affine but not a subspace. ML loss-landscape geometry has both: the set of solutions to $W\mathbf{x} = \mathbf{b}$ is affine, the kernel $\{\mathbf{x} : W\mathbf{x} = \mathbf{0}\}$ is its parallel linear subspace through the origin.
Drag the heads of $\mathbf{u}$ and $\mathbf{v}$. The playground shows their sum, the linear combination $\alpha\mathbf{u} + \beta\mathbf{v}$, and the span (a line if $\mathbf{u}$ and $\mathbf{v}$ are parallel, the whole plane otherwise).
The pink arrow is the linear combination $\alpha\mathbf{u} + \beta\mathbf{v}$. Sweep both sliders and notice how its tip can reach any point in the plane — that's a 2D span. Click "Make parallel" and notice how the span collapses to a single line (highlighted): now no setting of $\alpha, \beta$ can reach off-line points. This is what linear dependence looks like.
The intuition you just built in 2D scales, with one important caveat — most properties of high-dimensional vectors are deeply counter-intuitive.
Modern LLM embedding spaces are 2k–16k dimensional. The "almost-orthogonal" phenomenon is a feature: it means a $d$-dimensional space has vastly more than $d$ approximately-orthogonal directions you can pack into it. Anthropic's Toy Models of Superposition (Elhage et al., 2022) and follow-up work show that LLMs exploit this to encode many more "features" than they have neurons by using nearly-orthogonal directions.
How many vectors can you pack into $\mathbb{R}^{4096}$ such that each pair makes an angle of at least 80°? In strict orthogonality you'd be limited to 4096. Allowing $\cos\theta \le 0.17$ (i.e. 80° apart) lets you pack exponentially many — the Johnson-Lindenstrauss bound says roughly $\exp(\Theta(\varepsilon^2 d))$ vectors. For $d = 4096$ and $\varepsilon = 0.17$ that's something like $10^{50}$. This is the slack that superposition exploits.
Deck 02 — Matrices as Linear Maps turns the static "tuple of numbers" picture into a moving one: a matrix is the rule that takes a vector to another vector, and matmul is composition of rules.