Linear Algebra for AI 02 — Matrices as Linear Maps

00

Topics We'll Cover

A Matrix is a Function
The Column Picture
The Row Picture
Range & Null Space
Rank
The Four Fundamental Subspaces
Composition is Multiplication
Inverse, Pseudoinverse, Square & Otherwise
Interactive: 2D Linear Map Visualiser
Inside an MLP Layer
Cheat Sheet

01

A Matrix is a Function

A linear map $T : V \to W$ is a function that respects addition and scaling:

$$T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}), \qquad T(\alpha \mathbf{u}) = \alpha T(\mathbf{u}).$$

Once you fix bases for $V$ and $W$, every linear map is given by a matrix. The map $T : \mathbb{R}^n \to \mathbb{R}^m$ that sends $\mathbf{x} \mapsto A\mathbf{x}$ is linear; conversely, every linear map can be written this way for some unique $A$.

How $A$ is determined by $T$

The columns of $A$ are the images of the standard basis vectors. If $\mathbf{e}_j = (0, \ldots, 1, \ldots, 0)$ is the $j$-th standard basis vector, then

$$A = \big[\, T(\mathbf{e}_1) \mid T(\mathbf{e}_2) \mid \cdots \mid T(\mathbf{e}_n) \,\big].$$

This little formula is worth memorising. To find a matrix, push the basis vectors through the map and write the results as columns. Done.

Two-sentence summary of all of linear algebra

Linear maps and matrices are the same thing. Multiplication of matrices is composition of maps.

02

The Column Picture

The product $A\mathbf{x}$ has two equivalent definitions; the column picture is the one that makes most ML intuitions work.

$$A\mathbf{x} = \begin{pmatrix} | & | & & | \\ \mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_n \\ | & | & & | \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n.$$

That is, $A\mathbf{x}$ is a linear combination of the columns of $A$, weighted by the entries of $\mathbf{x}$. The matrix doesn't operate on $\mathbf{x}$ entry-by-entry; it asks each column to chip in proportional to one component of $\mathbf{x}$, then sums.

What this gives you for free

Range = column span. The set of outputs $\{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\}$ is exactly $\mathrm{span}(\mathbf{a}_1, \ldots, \mathbf{a}_n)$.
Solvability. The equation $A\mathbf{x} = \mathbf{b}$ has a solution iff $\mathbf{b}$ is in the column span. No "if and only if"-mystery: it's literally about whether $\mathbf{b}$ is reachable as a combination of columns.
Embedding lookups. If $E$ is an embedding matrix and $\mathbf{x}$ is a one-hot vector, then $E\mathbf{x}$ is just the column of $E$ at the index of the 1. This is what the embedding layer does.

03

The Row Picture

Equivalently, the $i$-th component of $A\mathbf{x}$ is the dot product of the $i$-th row of $A$ with $\mathbf{x}$:

$$(A\mathbf{x})_i = \sum_j A_{ij} x_j = \langle \mathbf{r}_i,\, \mathbf{x} \rangle.$$

Each row is a "feature detector". The dot product says how much $\mathbf{x}$ aligns with that detector. The output is a vector of detector responses, one per row.

Two pictures, same answer

Column picture

"What output do I get if I take this much of column 1, this much of column 2, …?" Output is reconstructed from a recipe of basis pieces.

Row picture

"How much does my input look like row 1? Like row 2?" Output is a similarity score against each row.

The row picture is what feels right when you're computing attention scores ($Q K^\top$ is "row of $Q$ dotted with row of $K$") or matching an embedding against vocabulary logits (the unembedding matrix's rows are the per-token detectors). The column picture is what feels right when you're updating residual streams (each output column adds a direction to the next layer's input). You need both.

04

Range & Null Space

Every matrix $A \in \mathbb{R}^{m \times n}$ defines two subspaces of immediate interest.

Range (column space) $\subseteq \mathbb{R}^m$

$$\mathrm{Range}(A) = \{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\} = \mathrm{span}(\text{columns}).$$

The set of outputs the map can produce. A subspace of the output space.

Null space (kernel) $\subseteq \mathbb{R}^n$

$$\mathrm{Null}(A) = \{\mathbf{x} : A\mathbf{x} = \mathbf{0}\}.$$

The set of inputs that get squashed to zero. A subspace of the input space.

The rank-nullity theorem

For any $A \in \mathbb{R}^{m\times n}$,

$$\dim \mathrm{Range}(A) + \dim \mathrm{Null}(A) = n.$$

"Range plus null equals input dimension." Information that goes in either survives (in the range) or is destroyed (in the null space) — nothing else.

An ML-flavoured example

A down-projection $W \in \mathbb{R}^{1024 \times 4096}$ has $n = 4096$ inputs and at most $\dim\mathrm{Range}(W) = 1024$ output dimensions, so $\dim\mathrm{Null}(W) \ge 4096 - 1024 = 3072$. Three quarters of the input directions are guaranteed to be in the null space — their information is irretrievably lost. This is why the FFN's down-projection is a real bottleneck on what can survive each block, and why residual connections matter so much (deck 11).

05

Rank

The rank of $A$ is $\dim \mathrm{Range}(A)$ — the number of linearly independent columns. It also equals the number of linearly independent rows (a non-trivial theorem: row rank = column rank).

For $A \in \mathbb{R}^{m \times n}$, $\mathrm{rank}(A) \le \min(m, n)$. When equality holds, $A$ is full rank; otherwise rank-deficient.

Shape	Full rank means	Geometric meaning
Tall ($m > n$)	$\mathrm{rank} = n$	Map is injective: distinct inputs give distinct outputs. $\mathrm{Null}(A) = \{\mathbf{0}\}$.
Wide ($m < n$)	$\mathrm{rank} = m$	Map is surjective: every $\mathbf{b} \in \mathbb{R}^m$ has a preimage. $\mathrm{Range}(A) = \mathbb{R}^m$.
Square ($m = n$)	$\mathrm{rank} = n$	Map is bijective: $A$ is invertible. $\det A \neq 0$.
Any shape	$\mathrm{rank} = r < \min(m,n)$	Output lives on an $r$-dimensional slice of $\mathbb{R}^m$. Most directions are unreachable.

Why low rank is a feature, not a bug

For a real weight matrix in a trained transformer, full rank is the default but the effective rank (the number of singular values above a small threshold) is often much smaller. This is what LoRA exploits: the update learnt during fine-tuning is approximately rank-8 to rank-32, even when the full weight is rank-4096. We meet the singular value spectrum properly in deck 07.

06

The Four Fundamental Subspaces

Any matrix $A \in \mathbb{R}^{m \times n}$ defines four subspaces, two in each space, that decompose both spaces orthogonally.

In $\mathbb{R}^n$ (input space)

Row space $\mathrm{Row}(A) = \mathrm{Range}(A^\top)$. Dimension $r$.
Null space $\mathrm{Null}(A)$. Dimension $n - r$.
$\mathrm{Row}(A) \perp \mathrm{Null}(A)$ and they together span $\mathbb{R}^n$.

In $\mathbb{R}^m$ (output space)

Column space $\mathrm{Range}(A)$. Dimension $r$.
Left null space $\mathrm{Null}(A^\top)$. Dimension $m - r$.
$\mathrm{Range}(A) \perp \mathrm{Null}(A^\top)$ and they together span $\mathbb{R}^m$.

The map $A$ takes the row space bijectively to the column space (an $r$-to-$r$ isomorphism), kills the null space, and never reaches the left null space. This is the structure that the SVD makes precise.

Where this shows up

Least-squares regression solves $A\mathbf{x} = \mathbf{b}$ when $\mathbf{b}$ is not in the column space, by orthogonally projecting $\mathbf{b}$ onto $\mathrm{Range}(A)$. The component of $\mathbf{b}$ that lives in the left null space is, by construction, the residual that least-squares cannot fit. We see this again in deck 05 (projections) and deck 07 (the SVD-based pseudoinverse).

07

Composition is Multiplication

If $A$ is the matrix of $T : \mathbb{R}^n \to \mathbb{R}^m$ and $B$ is the matrix of $S : \mathbb{R}^p \to \mathbb{R}^n$, then the matrix of the composition $T \circ S : \mathbb{R}^p \to \mathbb{R}^m$ is the matrix product $AB$.

This is the entire reason matrix multiplication is defined the way it is. The associativity, the $(i,k)$-entry formula, the non-commutativity — they all fall out of the demand that $(AB)\mathbf{x} = A(B\mathbf{x})$.

Three things the composition view explains immediately

Why dimensions must match. $A$ takes $n$-vectors to $m$-vectors, $B$ takes $p$-vectors to $n$-vectors. Compose only if $A$'s input matches $B$'s output. The "inner dimensions match" rule for matmul is exactly this.
Why composition isn't commutative. Rotate then stretch is not the same as stretch then rotate. Composing functions in different orders gives different functions; the matrices don't commute.
Why a deep network is "deeply non-linear" only between layers. $A_3 (A_2 (A_1 \mathbf{x}))$ collapses to $(A_3 A_2 A_1) \mathbf{x}$ without any non-linearity in between. The non-linear activations between layers are what stop layers from squashing into one.

A deep linear network is one matrix

Stack 100 linear layers $W_1, W_2, \ldots, W_{100}$ with no non-linearities and you've trained a single matrix $W = W_{100} \cdots W_1$. The gradients flow differently — you can have a rich training dynamic on a deep linear network — but the function class is exactly the same as a single matmul. (See Saxe, McClelland & Ganguli, ICLR 2014, for the dynamics.) Non-linearities are the entire reason we stack layers.

08

Inverse, Pseudoinverse, Square & Otherwise

For square full-rank $A$, there's a unique $A^{-1}$ such that $A A^{-1} = A^{-1} A = I$. For everything else, the right tool is the pseudoinverse $A^+$, which always exists and is unique.

Inverse $A^{-1}$ (square, full rank)

Exists iff $\det A \neq 0$ iff columns are linearly independent.
Solves $A\mathbf{x} = \mathbf{b}$ uniquely as $\mathbf{x} = A^{-1}\mathbf{b}$.
In practice we don't compute $A^{-1}$; we factor $A$ (LU, QR, Cholesky) and back-substitute. Computing $A^{-1}$ explicitly is numerically unstable and wasteful.

Pseudoinverse $A^+$ (any shape)

If $A$ has SVD $A = U\Sigma V^\top$ then $A^+ = V \Sigma^+ U^\top$, where $\Sigma^+$ inverts the non-zero singular values (deck 07).
For tall $A$: $A^+ = (A^\top A)^{-1} A^\top$ — the least-squares solver.
For wide $A$: $A^+ = A^\top (A A^\top)^{-1}$ — the minimum-norm solver.

Engineering note

Numerical libraries never recommend literally inverting a matrix in production. numpy.linalg.solve(A, b) uses LU with partial pivoting; scipy.linalg.lstsq uses SVD. Asking for $A^{-1}$ is a code smell. The frameworks underlying ML training use these factorisations under the hood for the few places they're needed (e.g. natural-gradient methods, pre-conditioners).

09

Interactive: 2D Linear Map Visualiser

The grid below shows a unit square (and grid) before and after applying the matrix $A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}$. The two coloured arrows are the columns of $A$ — the images of $(1,0)$ and $(0,1)$. Determinant = signed area of the parallelogram they span. Drag the sliders.

a (row 1, col 1)1.00

b (row 1, col 2)0.00

c (row 2, col 1)0.00

d (row 2, col 2)1.00

det(A)

1.00

rank(A)

2

column 1

(1, 0)

column 2

(0, 1)

Notice: the columns of $A$ are exactly where the basis vectors $\mathbf{e}_1$ and $\mathbf{e}_2$ go. Ratio of new area to old area is $|\det A|$. When you click "Rank 1", the unit square collapses to a line segment and $\det = 0$ — the map's range is one-dimensional and the null space is the orthogonal direction.

10

Inside an MLP Layer

A single MLP layer in PyTorch:

class MLP(nn.Module):
    def __init__(self, d_in, d_hidden, d_out):
        super().__init__()
        self.W1 = nn.Linear(d_in, d_hidden)
        self.W2 = nn.Linear(d_hidden, d_out)
    def forward(self, x):
        return self.W2(F.gelu(self.W1(x)))

In linear-algebra notation this is

$$\mathbf{y} = W_2 \,\sigma(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2,$$

where $W_1 \in \mathbb{R}^{d_h \times d_{in}}$ and $W_2 \in \mathbb{R}^{d_{out} \times d_h}$. This is two linear maps with a non-linearity sandwiched between them. Without the $\sigma$, the whole layer collapses to one linear map $W_2 W_1$.

What each piece does, geometrically

$W_1\mathbf{x}$ projects the input into the hidden space. Each row of $W_1$ is a feature detector; each entry of $W_1\mathbf{x} + \mathbf{b}_1$ is a feature score.
$\sigma(\cdot)$ applies a non-linear gate: GELU keeps activations roughly proportional to their score when positive, zeroes out (or smoothly suppresses) negative scores.
$W_2$ writes the gated features back into the output space. Each column of $W_2$ is a "what to add to the residual stream when this hidden neuron fires" direction.

Decks 03 (matmul) and 05 (projection) make the up/down projection picture in transformer FFNs precise. By the time we reach deck 11 you'll be able to read this code as: down-project to feature space, gate, write back.

11

Cheat Sheet

A matrix $A$ is a linear map. Its columns are the images of the standard basis vectors.
Column picture: $A\mathbf{x}$ is a linear combination of $A$'s columns. Row picture: $A\mathbf{x}$ is the vector of dot products of $A$'s rows with $\mathbf{x}$.
$\mathrm{Range}(A)$ = column span. $\mathrm{Null}(A)$ = inputs $A$ kills. Rank-nullity: $\dim\mathrm{Range} + \dim\mathrm{Null} = n$.
Rank = dimension of range = number of linearly independent columns = number of linearly independent rows. Bounded by $\min(m,n)$.
The four fundamental subspaces orthogonally decompose input space ($\mathrm{Row} \oplus \mathrm{Null}$) and output space ($\mathrm{Col} \oplus \mathrm{LeftNull}$).
Composition is multiplication. $(AB)\mathbf{x} = A(B\mathbf{x})$. Inner dimensions must match. Multiplication isn't commutative because composing functions in different orders gives different functions.
Pseudoinverse $A^+$ replaces $A^{-1}$ for non-square or rank-deficient maps. For tall full-rank $A$: $A^+ = (A^\top A)^{-1}A^\top$, the least-squares solver.