Linear Algebra for AI — Presentation 02

Matrices as Linear Maps

Stop reading a matrix as a grid of numbers. Start reading it as a function — a rule that turns one vector into another. Every layer of every neural network is exactly this object.

linear mapcolumn picture row picturerange null spacerank
grid of numbers rule x → Ax linear function an MLP layer
00

Topics We'll Cover

01

A Matrix is a Function

A linear map $T : V \to W$ is a function that respects addition and scaling:

$$T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}), \qquad T(\alpha \mathbf{u}) = \alpha T(\mathbf{u}).$$

Once you fix bases for $V$ and $W$, every linear map is given by a matrix. The map $T : \mathbb{R}^n \to \mathbb{R}^m$ that sends $\mathbf{x} \mapsto A\mathbf{x}$ is linear; conversely, every linear map can be written this way for some unique $A$.

How $A$ is determined by $T$

The columns of $A$ are the images of the standard basis vectors. If $\mathbf{e}_j = (0, \ldots, 1, \ldots, 0)$ is the $j$-th standard basis vector, then

$$A = \big[\, T(\mathbf{e}_1) \mid T(\mathbf{e}_2) \mid \cdots \mid T(\mathbf{e}_n) \,\big].$$

This little formula is worth memorising. To find a matrix, push the basis vectors through the map and write the results as columns. Done.

Two-sentence summary of all of linear algebra

Linear maps and matrices are the same thing. Multiplication of matrices is composition of maps.

02

The Column Picture

The product $A\mathbf{x}$ has two equivalent definitions; the column picture is the one that makes most ML intuitions work.

$$A\mathbf{x} = \begin{pmatrix} | & | & & | \\ \mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_n \\ | & | & & | \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n.$$

That is, $A\mathbf{x}$ is a linear combination of the columns of $A$, weighted by the entries of $\mathbf{x}$. The matrix doesn't operate on $\mathbf{x}$ entry-by-entry; it asks each column to chip in proportional to one component of $\mathbf{x}$, then sums.

What this gives you for free

03

The Row Picture

Equivalently, the $i$-th component of $A\mathbf{x}$ is the dot product of the $i$-th row of $A$ with $\mathbf{x}$:

$$(A\mathbf{x})_i = \sum_j A_{ij} x_j = \langle \mathbf{r}_i,\, \mathbf{x} \rangle.$$

Each row is a "feature detector". The dot product says how much $\mathbf{x}$ aligns with that detector. The output is a vector of detector responses, one per row.

Two pictures, same answer

Column picture

"What output do I get if I take this much of column 1, this much of column 2, …?" Output is reconstructed from a recipe of basis pieces.

Row picture

"How much does my input look like row 1? Like row 2?" Output is a similarity score against each row.

The row picture is what feels right when you're computing attention scores ($Q K^\top$ is "row of $Q$ dotted with row of $K$") or matching an embedding against vocabulary logits (the unembedding matrix's rows are the per-token detectors). The column picture is what feels right when you're updating residual streams (each output column adds a direction to the next layer's input). You need both.

04

Range & Null Space

Every matrix $A \in \mathbb{R}^{m \times n}$ defines two subspaces of immediate interest.

Range (column space) $\subseteq \mathbb{R}^m$

$$\mathrm{Range}(A) = \{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\} = \mathrm{span}(\text{columns}).$$

The set of outputs the map can produce. A subspace of the output space.

Null space (kernel) $\subseteq \mathbb{R}^n$

$$\mathrm{Null}(A) = \{\mathbf{x} : A\mathbf{x} = \mathbf{0}\}.$$

The set of inputs that get squashed to zero. A subspace of the input space.

The rank-nullity theorem

For any $A \in \mathbb{R}^{m\times n}$,

$$\dim \mathrm{Range}(A) + \dim \mathrm{Null}(A) = n.$$

"Range plus null equals input dimension." Information that goes in either survives (in the range) or is destroyed (in the null space) — nothing else.

An ML-flavoured example

A down-projection $W \in \mathbb{R}^{1024 \times 4096}$ has $n = 4096$ inputs and at most $\dim\mathrm{Range}(W) = 1024$ output dimensions, so $\dim\mathrm{Null}(W) \ge 4096 - 1024 = 3072$. Three quarters of the input directions are guaranteed to be in the null space — their information is irretrievably lost. This is why the FFN's down-projection is a real bottleneck on what can survive each block, and why residual connections matter so much (deck 11).

05

Rank

The rank of $A$ is $\dim \mathrm{Range}(A)$ — the number of linearly independent columns. It also equals the number of linearly independent rows (a non-trivial theorem: row rank = column rank).

For $A \in \mathbb{R}^{m \times n}$, $\mathrm{rank}(A) \le \min(m, n)$. When equality holds, $A$ is full rank; otherwise rank-deficient.

ShapeFull rank meansGeometric meaning
Tall ($m > n$)$\mathrm{rank} = n$Map is injective: distinct inputs give distinct outputs. $\mathrm{Null}(A) = \{\mathbf{0}\}$.
Wide ($m < n$)$\mathrm{rank} = m$Map is surjective: every $\mathbf{b} \in \mathbb{R}^m$ has a preimage. $\mathrm{Range}(A) = \mathbb{R}^m$.
Square ($m = n$)$\mathrm{rank} = n$Map is bijective: $A$ is invertible. $\det A \neq 0$.
Any shape$\mathrm{rank} = r < \min(m,n)$Output lives on an $r$-dimensional slice of $\mathbb{R}^m$. Most directions are unreachable.

Why low rank is a feature, not a bug

For a real weight matrix in a trained transformer, full rank is the default but the effective rank (the number of singular values above a small threshold) is often much smaller. This is what LoRA exploits: the update learnt during fine-tuning is approximately rank-8 to rank-32, even when the full weight is rank-4096. We meet the singular value spectrum properly in deck 07.

06

The Four Fundamental Subspaces

Any matrix $A \in \mathbb{R}^{m \times n}$ defines four subspaces, two in each space, that decompose both spaces orthogonally.

In $\mathbb{R}^n$ (input space)

  • Row space $\mathrm{Row}(A) = \mathrm{Range}(A^\top)$. Dimension $r$.
  • Null space $\mathrm{Null}(A)$. Dimension $n - r$.
  • $\mathrm{Row}(A) \perp \mathrm{Null}(A)$ and they together span $\mathbb{R}^n$.

In $\mathbb{R}^m$ (output space)

  • Column space $\mathrm{Range}(A)$. Dimension $r$.
  • Left null space $\mathrm{Null}(A^\top)$. Dimension $m - r$.
  • $\mathrm{Range}(A) \perp \mathrm{Null}(A^\top)$ and they together span $\mathbb{R}^m$.

The map $A$ takes the row space bijectively to the column space (an $r$-to-$r$ isomorphism), kills the null space, and never reaches the left null space. This is the structure that the SVD makes precise.

Where this shows up

Least-squares regression solves $A\mathbf{x} = \mathbf{b}$ when $\mathbf{b}$ is not in the column space, by orthogonally projecting $\mathbf{b}$ onto $\mathrm{Range}(A)$. The component of $\mathbf{b}$ that lives in the left null space is, by construction, the residual that least-squares cannot fit. We see this again in deck 05 (projections) and deck 07 (the SVD-based pseudoinverse).

07

Composition is Multiplication

If $A$ is the matrix of $T : \mathbb{R}^n \to \mathbb{R}^m$ and $B$ is the matrix of $S : \mathbb{R}^p \to \mathbb{R}^n$, then the matrix of the composition $T \circ S : \mathbb{R}^p \to \mathbb{R}^m$ is the matrix product $AB$.

This is the entire reason matrix multiplication is defined the way it is. The associativity, the $(i,k)$-entry formula, the non-commutativity — they all fall out of the demand that $(AB)\mathbf{x} = A(B\mathbf{x})$.

Three things the composition view explains immediately

A deep linear network is one matrix

Stack 100 linear layers $W_1, W_2, \ldots, W_{100}$ with no non-linearities and you've trained a single matrix $W = W_{100} \cdots W_1$. The gradients flow differently — you can have a rich training dynamic on a deep linear network — but the function class is exactly the same as a single matmul. (See Saxe, McClelland & Ganguli, ICLR 2014, for the dynamics.) Non-linearities are the entire reason we stack layers.

08

Inverse, Pseudoinverse, Square & Otherwise

For square full-rank $A$, there's a unique $A^{-1}$ such that $A A^{-1} = A^{-1} A = I$. For everything else, the right tool is the pseudoinverse $A^+$, which always exists and is unique.

Inverse $A^{-1}$ (square, full rank)

  • Exists iff $\det A \neq 0$ iff columns are linearly independent.
  • Solves $A\mathbf{x} = \mathbf{b}$ uniquely as $\mathbf{x} = A^{-1}\mathbf{b}$.
  • In practice we don't compute $A^{-1}$; we factor $A$ (LU, QR, Cholesky) and back-substitute. Computing $A^{-1}$ explicitly is numerically unstable and wasteful.

Pseudoinverse $A^+$ (any shape)

  • If $A$ has SVD $A = U\Sigma V^\top$ then $A^+ = V \Sigma^+ U^\top$, where $\Sigma^+$ inverts the non-zero singular values (deck 07).
  • For tall $A$: $A^+ = (A^\top A)^{-1} A^\top$ — the least-squares solver.
  • For wide $A$: $A^+ = A^\top (A A^\top)^{-1}$ — the minimum-norm solver.
Engineering note

Numerical libraries never recommend literally inverting a matrix in production. numpy.linalg.solve(A, b) uses LU with partial pivoting; scipy.linalg.lstsq uses SVD. Asking for $A^{-1}$ is a code smell. The frameworks underlying ML training use these factorisations under the hood for the few places they're needed (e.g. natural-gradient methods, pre-conditioners).

09

Interactive: 2D Linear Map Visualiser

The grid below shows a unit square (and grid) before and after applying the matrix $A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}$. The two coloured arrows are the columns of $A$ — the images of $(1,0)$ and $(0,1)$. Determinant = signed area of the parallelogram they span. Drag the sliders.

1.00
0.00
0.00
1.00
det(A)
1.00
rank(A)
2
column 1
(1, 0)
column 2
(0, 1)

Notice: the columns of $A$ are exactly where the basis vectors $\mathbf{e}_1$ and $\mathbf{e}_2$ go. Ratio of new area to old area is $|\det A|$. When you click "Rank 1", the unit square collapses to a line segment and $\det = 0$ — the map's range is one-dimensional and the null space is the orthogonal direction.

10

Inside an MLP Layer

A single MLP layer in PyTorch:

class MLP(nn.Module):
    def __init__(self, d_in, d_hidden, d_out):
        super().__init__()
        self.W1 = nn.Linear(d_in, d_hidden)
        self.W2 = nn.Linear(d_hidden, d_out)
    def forward(self, x):
        return self.W2(F.gelu(self.W1(x)))

In linear-algebra notation this is

$$\mathbf{y} = W_2 \,\sigma(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2,$$

where $W_1 \in \mathbb{R}^{d_h \times d_{in}}$ and $W_2 \in \mathbb{R}^{d_{out} \times d_h}$. This is two linear maps with a non-linearity sandwiched between them. Without the $\sigma$, the whole layer collapses to one linear map $W_2 W_1$.

What each piece does, geometrically

Decks 03 (matmul) and 05 (projection) make the up/down projection picture in transformer FFNs precise. By the time we reach deck 11 you'll be able to read this code as: down-project to feature space, gate, write back.

11

Cheat Sheet

Read next

Deck 03 — Matrix Multiplication Deep-Dive takes the composition picture and unfolds matmul into its four equivalent forms (row×col, col-comb, row-comb, outer-sum). It also begins the GEMM / arithmetic-intensity story that connects to the GPU and TPU hubs.