Linear Algebra for AI 01 — Vectors & Vector Spaces

00

Topics We'll Cover

Three Ways to Look at a Vector
Vector Operations
What's a Vector Space?
Linear Combinations & Span
Linear Independence
Basis & Dimension
Coordinates & Change of Basis
Subspaces
Interactive: 2D Vector Playground
From R² to R𝑑: the Embedding Picture
Cheat Sheet

01

Three Ways to Look at a Vector

Throughout the series we will write a vector as a column of numbers. But the same object has three useful interpretations, and a working ML engineer flips between them dozens of times in a day.

1. An arrow

An arrow from the origin to a point. Length and direction matter; position in the plane doesn't (it always starts at 0). This is the picture for $\mathbb{R}^2$ and $\mathbb{R}^3$.

2. A tuple of numbers

$\mathbf{v} = (v_1, v_2, \ldots, v_n)$. The picture that scales: a 4096-dimensional vector is just a list of 4096 floats. Most of ML lives here.

3. A point in a space

An element of an abstract vector space — a thing you can add to other things and scale by numbers, with no commitment to what the things "are". This view is what lets us put words, images and audio in the same room.

Modern ML moves between all three. An embedding is an arrow when you draw a t-SNE plot; it's a tuple when you call torch.matmul; and it's a point in an abstract space when you reason about norm, distance, projection, similarity.

Convention used throughout this series

Bold lowercase letters $\mathbf{v}, \mathbf{x}, \mathbf{w}$ are vectors, written as columns. Capital letters $A, B, W$ are matrices. Greek lowercase $\alpha, \beta, \lambda$ are scalars. We work over the real numbers $\mathbb{R}$ except where noted — complex vectors appear only briefly in the eigen / SVD decks.

02

Vector Operations

Two operations and one rule. Together they generate all of linear algebra.

Addition

$$\mathbf{u} + \mathbf{v} = (u_1 + v_1,\, u_2 + v_2,\, \ldots,\, u_n + v_n)$$

Geometrically: place the tail of $\mathbf{v}$ at the head of $\mathbf{u}$; the resultant is the arrow from the origin to the new tip. Commutative, associative, has identity $\mathbf{0}$ and inverses $-\mathbf{v}$.

Scalar multiplication

$$\alpha \mathbf{v} = (\alpha v_1,\, \alpha v_2,\, \ldots,\, \alpha v_n)$$

Geometrically: stretch $\mathbf{v}$ by the factor $\alpha$ (and flip if $\alpha < 0$). Direction is preserved or reversed; never rotated to anything else.

Linearity, the one rule

The two operations distribute: $\alpha(\mathbf{u}+\mathbf{v}) = \alpha\mathbf{u} + \alpha\mathbf{v}$ and $(\alpha+\beta)\mathbf{v} = \alpha\mathbf{v}+\beta\mathbf{v}$. Out of this single property the rest of linear algebra falls out.

Why this matters for transformers

Every operation inside an attention block — QKV projection, attention-weighted average, output projection, residual add — is built from exactly these two operations. The non-linearity (softmax, GELU) is intentionally a thin sandwich between long stretches of pure linear algebra. Understand vectors and matrices and you understand 95% of the FLOPs.

03

What's a Vector Space?

A vector space is any set $V$ together with two operations — addition $+ : V \times V \to V$ and scalar multiplication $\cdot : \mathbb{R} \times V \to V$ — that satisfy eight axioms: associativity and commutativity of addition, an additive identity $\mathbf{0}$, additive inverses, and four scalar-multiplication axioms (associativity, identity 1, and the two distributive laws).

If you can add two things and scale them by real numbers, and the obvious algebraic identities hold, you have a vector space. The axioms are deliberately weak so that very different objects qualify.

Vector space	"Vector" is…	Why it matters in ML
$\mathbb{R}^n$	An $n$-tuple of reals	Every embedding, activation, weight column.
$\mathbb{R}^{m\times n}$	An $m\times n$ matrix of reals	Weight matrices, attention scores, batched activations — flattened, an element of $\mathbb{R}^{mn}$.
Polynomials of degree $\le n$	A polynomial $p(t) = a_0 + a_1 t + \cdots + a_n t^n$	Spline bases, kernel features, classical positional encodings.
Continuous functions $C([0,1])$	A function $f : [0,1] \to \mathbb{R}$	Idealised embedding spaces; Fourier theory; the right setting for "infinite-dimensional" attention kernels.
Random variables (with finite variance)	A random variable	Inner product = covariance; orthogonality = uncorrelated. Useful for whitening, PCA, denoising-diffusion analysis.

Why the abstraction

The eight axioms guarantee that any theorem you prove from them alone holds in every example. Define "basis" once and you have basis for $\mathbb{R}^n$, basis for polynomials (the monomials $1, t, t^2, \ldots$), basis for the Fourier function space (sines and cosines), and basis for the embedding space of a transformer — all at once. This is why the vocabulary "basis", "span", "rank" sounds out of proportion to its content: it's leverage that pays off across many domains.

04

Linear Combinations & Span

A linear combination of vectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$ with coefficients $\alpha_1, \ldots, \alpha_k$ is

$$\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k.$$

The span of a set of vectors is the set of all linear combinations of them:

$$\mathrm{span}(\mathbf{v}_1, \ldots, \mathbf{v}_k) = \{\,\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k : \alpha_i \in \mathbb{R}\,\}.$$

Geometric pictures in low dimension

One non-zero vector in $\mathbb{R}^2$. Span = line through the origin.
Two non-parallel vectors in $\mathbb{R}^2$. Span = the whole plane $\mathbb{R}^2$.
Two parallel vectors in $\mathbb{R}^2$. Span = a line again. The second vector adds nothing.
Three generic vectors in $\mathbb{R}^3$. Span = all of $\mathbb{R}^3$.
Three coplanar vectors in $\mathbb{R}^3$. Span = a plane. The third was redundant.

Span is the answer to the question: what set of points can I reach by mixing these vectors? The answer is always a subspace through the origin (a line, plane, hyperplane, or the whole space).

Span in ML

The span of the rows of a weight matrix $W$ is the range of the linear map $\mathbf{x} \mapsto W^\top\mathbf{x}$. Equivalently, the span of the columns of $W$ is the set of outputs of the layer $\mathbf{x} \mapsto W\mathbf{x}$. A low-rank $W$ has a small-dimensional column span — only a thin slice of output space is reachable. This is the foundation of LoRA (deck 07).

05

Linear Independence

A set $\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}$ is linearly independent if the only way to write the zero vector as a linear combination is with all coefficients zero:

$$\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0} \iff \alpha_1 = \cdots = \alpha_k = 0.$$

Otherwise the set is linearly dependent: some non-trivial combination of them is zero, which is the same as saying at least one of them is a linear combination of the others. That one can be dropped without losing any span.

Three equivalent characterisations

Algebraic. The only solution to $\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0}$ is the trivial one.
Combinatorial. No vector in the set is a linear combination of the others.
Geometric. Each new vector points in a genuinely new direction not already reachable from the previous ones.

Why linear dependence is everywhere in ML

Real data lies, approximately, on low-dimensional surfaces inside very high-dimensional spaces. The pixels of an MNIST digit are nominally a 784-dimensional vector, but the set of all real handwritten digits forms a much smaller-dimensional manifold. The columns of the matrix you'd build by stacking 60,000 MNIST digits are highly linearly dependent — this is exactly what PCA (deck 06) exploits, and exactly why a low-rank LoRA adapter (deck 07) is enough.

A tiny test

Are $(1,2)$, $(2,4)$, $(3,5)$ linearly independent in $\mathbb{R}^2$? No — there are three vectors in a 2-dimensional space, so any three of them must be linearly dependent. (Two of them might be, e.g. $(1,2)$ and $(2,4)$ are scalar multiples.) The maximum size of a linearly independent set in $\mathbb{R}^n$ is $n$.

06

Basis & Dimension

A basis of a vector space $V$ is a set $\{\mathbf{e}_1, \ldots, \mathbf{e}_n\}$ that is

linearly independent (no redundancy), and
spans $V$ (no gaps).

Equivalently: every $\mathbf{v} \in V$ can be written uniquely as $\mathbf{v} = c_1\mathbf{e}_1 + \cdots + c_n\mathbf{e}_n$. The numbers $c_i$ are the coordinates of $\mathbf{v}$ in the basis.

The dimension theorem

Every basis of $V$ has the same number of elements. That number is the dimension of $V$, written $\dim V$. (Proof sketch: if you had a basis of size $m$ and another of size $n > m$, you'd be expressing $n$ linearly independent vectors as combinations of $m$ vectors, contradicting linear independence.)

Standard bases you should know cold

$\mathbb{R}^n$ — the canonical basis

$\mathbf{e}_1 = (1, 0, \ldots, 0)$, $\mathbf{e}_2 = (0, 1, 0, \ldots, 0)$, …, $\mathbf{e}_n = (0, \ldots, 0, 1)$. Each is a 1 in one slot, zeros everywhere else. Coordinates in this basis are just the components.

Polynomials of degree $\le n$

Monomial basis: $1, t, t^2, \ldots, t^n$. Dimension $n+1$. Coordinates are the polynomial coefficients.

A transformer's hidden basis

The hidden state in Llama-3-70B is a 8192-dimensional vector. Its "basis" is the columns of whatever last writes into the residual stream — output projections from attention heads, output projections from FFNs. Mechanistic interpretability is, in part, the study of which directions in this 8192-dim space carry which information — an attempt to recover a meaningful basis from the standard one the GPU happens to use.

07

Coordinates & Change of Basis

A vector exists independently of any basis — it's the arrow / abstract point. Its coordinates, on the other hand, depend on the basis you chose. Switching basis is a fundamental operation in ML: PCA, whitening, ICA and the rotations inside RoPE are all changes of basis.

The change-of-basis matrix

If $B = (\mathbf{b}_1, \ldots, \mathbf{b}_n)$ is the matrix whose columns are a new basis, and $\mathbf{v}$ has new-basis coordinates $\mathbf{c}$, then the standard-basis coordinates of $\mathbf{v}$ are

$$\mathbf{v} = B \mathbf{c}.$$

Equivalently, given $\mathbf{v}$ in the standard basis, its new-basis coordinates are $\mathbf{c} = B^{-1} \mathbf{v}$.

Why this is so important

A linear map looks completely different in different bases. A 2D rotation matrix is

$$R_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$$

in the standard basis. Change to a basis aligned with the eigenvectors and the same map becomes diagonal (it's a complex diagonal $\mathrm{diag}(e^{i\theta}, e^{-i\theta})$). Same map, different appearance — the choice of basis is the storyteller. Eigendecomposition (deck 06) and SVD (deck 07) are precisely the search for the basis that makes a matrix as simple as possible.

Tokenisers and bases

An embedding lookup is a coordinate mapping: token id $i \mapsto \mathbf{e}_i$ in some chosen basis $E$. Any rotation $R$ of the embedding space gives an equivalent model with weights $RE, R\,W_Q, R\,W_K, \ldots$ — a phenomenon used in mechanistic interpretability to argue that "directions, not neurons" are the meaningful unit.

08

Subspaces

A subspace $U$ of $V$ is a non-empty subset that is itself a vector space — closed under addition and scaling. Equivalently, $U$ is a subspace iff for all $\mathbf{u}, \mathbf{u}' \in U$ and $\alpha \in \mathbb{R}$:

$\mathbf{u} + \mathbf{u}' \in U$ (closed under addition),
$\alpha \mathbf{u} \in U$ (closed under scaling).

From these two it follows that $\mathbf{0} \in U$ (take $\alpha = 0$). Every subspace passes through the origin.

Examples that recur

Lines and planes through 0

In $\mathbb{R}^3$ the only subspaces are $\{\mathbf{0}\}$, lines through $\mathbf{0}$, planes through $\mathbf{0}$, and all of $\mathbb{R}^3$. A line that doesn't pass through the origin is not a subspace — the zero vector isn't on it.

Range & null space of $A$

For a matrix $A : \mathbb{R}^n \to \mathbb{R}^m$, the range $\{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\} \subseteq \mathbb{R}^m$ and the null space $\{\mathbf{x} : A\mathbf{x} = \mathbf{0}\} \subseteq \mathbb{R}^n$ are both subspaces. Deck 02 develops these.

Affine subspaces are not subspaces

An affine subspace is a subspace shifted by a vector: $\mathbf{p} + U$ for some fixed $\mathbf{p}$ and subspace $U$. A line through $(0,0,1)$ in the direction $(1,1,1)$ is affine but not a subspace. ML loss-landscape geometry has both: the set of solutions to $W\mathbf{x} = \mathbf{b}$ is affine, the kernel $\{\mathbf{x} : W\mathbf{x} = \mathbf{0}\}$ is its parallel linear subspace through the origin.

09

Interactive: 2D Vector Playground

Drag the heads of $\mathbf{u}$ and $\mathbf{v}$. The playground shows their sum, the linear combination $\alpha\mathbf{u} + \beta\mathbf{v}$, and the span (a line if $\mathbf{u}$ and $\mathbf{v}$ are parallel, the whole plane otherwise).

α (scale of u) 1.00

β (scale of v) 1.00

independent

u

(2.0, 0.0)

v

(0.0, 2.0)

αu+βv

(2.0, 2.0)

span

R²

The pink arrow is the linear combination $\alpha\mathbf{u} + \beta\mathbf{v}$. Sweep both sliders and notice how its tip can reach any point in the plane — that's a 2D span. Click "Make parallel" and notice how the span collapses to a single line (highlighted): now no setting of $\alpha, \beta$ can reach off-line points. This is what linear dependence looks like.

10

From R² to R^d: the Embedding Picture

The intuition you just built in 2D scales, with one important caveat — most properties of high-dimensional vectors are deeply counter-intuitive.

What's the same

Addition and scaling work component-wise.
Span, basis, dimension, independence are defined identically.
Subspaces are still through the origin.

What's wildly different

Volumes concentrate near the surface of a $d$-ball: most of the mass of a unit ball in $\mathbb{R}^{4096}$ is in a thin shell.
Two random Gaussian vectors are almost orthogonal with overwhelming probability — not a 90° coincidence but a property of measure concentration.
Distances all converge: the ratio of largest to smallest pairwise distance among random points goes to 1.

Why this matters for embeddings

Modern LLM embedding spaces are 2k–16k dimensional. The "almost-orthogonal" phenomenon is a feature: it means a $d$-dimensional space has vastly more than $d$ approximately-orthogonal directions you can pack into it. Anthropic's Toy Models of Superposition (Elhage et al., 2022) and follow-up work show that LLMs exploit this to encode many more "features" than they have neurons by using nearly-orthogonal directions.

A back-of-envelope count

How many vectors can you pack into $\mathbb{R}^{4096}$ such that each pair makes an angle of at least 80°? In strict orthogonality you'd be limited to 4096. Allowing $\cos\theta \le 0.17$ (i.e. 80° apart) lets you pack exponentially many — the Johnson-Lindenstrauss bound says roughly $\exp(\Theta(\varepsilon^2 d))$ vectors. For $d = 4096$ and $\varepsilon = 0.17$ that's something like $10^{50}$. This is the slack that superposition exploits.

11

Cheat Sheet

A vector space is a set with two operations (add, scale) satisfying eight axioms. The set $\mathbb{R}^n$ is the canonical example; matrices, polynomials, functions are others.
A linear combination is $\sum_i \alpha_i \mathbf{v}_i$. The span of vectors is the set of all linear combinations — always a subspace through the origin.
A set is linearly independent when no non-trivial combination gives zero. Equivalently: no vector is in the span of the others.
A basis is a linearly independent spanning set. Every basis of $V$ has the same size; that size is $\dim V$.
A subspace is a non-empty subset closed under addition and scaling. Lines and planes through 0 in $\mathbb{R}^3$, the range and null space of any linear map.
The change-of-basis from a basis $B$ to the standard basis is multiplication by the matrix $B$; the inverse direction is $B^{-1}$.
High-dimensional embedding spaces have far more "directions" than dimensions: random vectors are nearly orthogonal, and superposition exploits this.