Stop reading a matrix as a grid of numbers. Start reading it as a function — a rule that turns one vector into another. Every layer of every neural network is exactly this object.
A linear map $T : V \to W$ is a function that respects addition and scaling:
$$T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}), \qquad T(\alpha \mathbf{u}) = \alpha T(\mathbf{u}).$$
Once you fix bases for $V$ and $W$, every linear map is given by a matrix. The map $T : \mathbb{R}^n \to \mathbb{R}^m$ that sends $\mathbf{x} \mapsto A\mathbf{x}$ is linear; conversely, every linear map can be written this way for some unique $A$.
The columns of $A$ are the images of the standard basis vectors. If $\mathbf{e}_j = (0, \ldots, 1, \ldots, 0)$ is the $j$-th standard basis vector, then
$$A = \big[\, T(\mathbf{e}_1) \mid T(\mathbf{e}_2) \mid \cdots \mid T(\mathbf{e}_n) \,\big].$$
This little formula is worth memorising. To find a matrix, push the basis vectors through the map and write the results as columns. Done.
Linear maps and matrices are the same thing. Multiplication of matrices is composition of maps.
The product $A\mathbf{x}$ has two equivalent definitions; the column picture is the one that makes most ML intuitions work.
$$A\mathbf{x} = \begin{pmatrix} | & | & & | \\ \mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_n \\ | & | & & | \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n.$$
That is, $A\mathbf{x}$ is a linear combination of the columns of $A$, weighted by the entries of $\mathbf{x}$. The matrix doesn't operate on $\mathbf{x}$ entry-by-entry; it asks each column to chip in proportional to one component of $\mathbf{x}$, then sums.
Equivalently, the $i$-th component of $A\mathbf{x}$ is the dot product of the $i$-th row of $A$ with $\mathbf{x}$:
$$(A\mathbf{x})_i = \sum_j A_{ij} x_j = \langle \mathbf{r}_i,\, \mathbf{x} \rangle.$$
Each row is a "feature detector". The dot product says how much $\mathbf{x}$ aligns with that detector. The output is a vector of detector responses, one per row.
"What output do I get if I take this much of column 1, this much of column 2, …?" Output is reconstructed from a recipe of basis pieces.
"How much does my input look like row 1? Like row 2?" Output is a similarity score against each row.
The row picture is what feels right when you're computing attention scores ($Q K^\top$ is "row of $Q$ dotted with row of $K$") or matching an embedding against vocabulary logits (the unembedding matrix's rows are the per-token detectors). The column picture is what feels right when you're updating residual streams (each output column adds a direction to the next layer's input). You need both.
Every matrix $A \in \mathbb{R}^{m \times n}$ defines two subspaces of immediate interest.
$$\mathrm{Range}(A) = \{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\} = \mathrm{span}(\text{columns}).$$
The set of outputs the map can produce. A subspace of the output space.
$$\mathrm{Null}(A) = \{\mathbf{x} : A\mathbf{x} = \mathbf{0}\}.$$
The set of inputs that get squashed to zero. A subspace of the input space.
For any $A \in \mathbb{R}^{m\times n}$,
$$\dim \mathrm{Range}(A) + \dim \mathrm{Null}(A) = n.$$
"Range plus null equals input dimension." Information that goes in either survives (in the range) or is destroyed (in the null space) — nothing else.
A down-projection $W \in \mathbb{R}^{1024 \times 4096}$ has $n = 4096$ inputs and at most $\dim\mathrm{Range}(W) = 1024$ output dimensions, so $\dim\mathrm{Null}(W) \ge 4096 - 1024 = 3072$. Three quarters of the input directions are guaranteed to be in the null space — their information is irretrievably lost. This is why the FFN's down-projection is a real bottleneck on what can survive each block, and why residual connections matter so much (deck 11).
The rank of $A$ is $\dim \mathrm{Range}(A)$ — the number of linearly independent columns. It also equals the number of linearly independent rows (a non-trivial theorem: row rank = column rank).
For $A \in \mathbb{R}^{m \times n}$, $\mathrm{rank}(A) \le \min(m, n)$. When equality holds, $A$ is full rank; otherwise rank-deficient.
| Shape | Full rank means | Geometric meaning |
|---|---|---|
| Tall ($m > n$) | $\mathrm{rank} = n$ | Map is injective: distinct inputs give distinct outputs. $\mathrm{Null}(A) = \{\mathbf{0}\}$. |
| Wide ($m < n$) | $\mathrm{rank} = m$ | Map is surjective: every $\mathbf{b} \in \mathbb{R}^m$ has a preimage. $\mathrm{Range}(A) = \mathbb{R}^m$. |
| Square ($m = n$) | $\mathrm{rank} = n$ | Map is bijective: $A$ is invertible. $\det A \neq 0$. |
| Any shape | $\mathrm{rank} = r < \min(m,n)$ | Output lives on an $r$-dimensional slice of $\mathbb{R}^m$. Most directions are unreachable. |
For a real weight matrix in a trained transformer, full rank is the default but the effective rank (the number of singular values above a small threshold) is often much smaller. This is what LoRA exploits: the update learnt during fine-tuning is approximately rank-8 to rank-32, even when the full weight is rank-4096. We meet the singular value spectrum properly in deck 07.
Any matrix $A \in \mathbb{R}^{m \times n}$ defines four subspaces, two in each space, that decompose both spaces orthogonally.
The map $A$ takes the row space bijectively to the column space (an $r$-to-$r$ isomorphism), kills the null space, and never reaches the left null space. This is the structure that the SVD makes precise.
Least-squares regression solves $A\mathbf{x} = \mathbf{b}$ when $\mathbf{b}$ is not in the column space, by orthogonally projecting $\mathbf{b}$ onto $\mathrm{Range}(A)$. The component of $\mathbf{b}$ that lives in the left null space is, by construction, the residual that least-squares cannot fit. We see this again in deck 05 (projections) and deck 07 (the SVD-based pseudoinverse).
If $A$ is the matrix of $T : \mathbb{R}^n \to \mathbb{R}^m$ and $B$ is the matrix of $S : \mathbb{R}^p \to \mathbb{R}^n$, then the matrix of the composition $T \circ S : \mathbb{R}^p \to \mathbb{R}^m$ is the matrix product $AB$.
This is the entire reason matrix multiplication is defined the way it is. The associativity, the $(i,k)$-entry formula, the non-commutativity — they all fall out of the demand that $(AB)\mathbf{x} = A(B\mathbf{x})$.
Stack 100 linear layers $W_1, W_2, \ldots, W_{100}$ with no non-linearities and you've trained a single matrix $W = W_{100} \cdots W_1$. The gradients flow differently — you can have a rich training dynamic on a deep linear network — but the function class is exactly the same as a single matmul. (See Saxe, McClelland & Ganguli, ICLR 2014, for the dynamics.) Non-linearities are the entire reason we stack layers.
For square full-rank $A$, there's a unique $A^{-1}$ such that $A A^{-1} = A^{-1} A = I$. For everything else, the right tool is the pseudoinverse $A^+$, which always exists and is unique.
Numerical libraries never recommend literally inverting a matrix in production. numpy.linalg.solve(A, b) uses LU with partial pivoting; scipy.linalg.lstsq uses SVD. Asking for $A^{-1}$ is a code smell. The frameworks underlying ML training use these factorisations under the hood for the few places they're needed (e.g. natural-gradient methods, pre-conditioners).
The grid below shows a unit square (and grid) before and after applying the matrix $A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}$. The two coloured arrows are the columns of $A$ — the images of $(1,0)$ and $(0,1)$. Determinant = signed area of the parallelogram they span. Drag the sliders.
Notice: the columns of $A$ are exactly where the basis vectors $\mathbf{e}_1$ and $\mathbf{e}_2$ go. Ratio of new area to old area is $|\det A|$. When you click "Rank 1", the unit square collapses to a line segment and $\det = 0$ — the map's range is one-dimensional and the null space is the orthogonal direction.
A single MLP layer in PyTorch:
class MLP(nn.Module):
def __init__(self, d_in, d_hidden, d_out):
super().__init__()
self.W1 = nn.Linear(d_in, d_hidden)
self.W2 = nn.Linear(d_hidden, d_out)
def forward(self, x):
return self.W2(F.gelu(self.W1(x)))
In linear-algebra notation this is
$$\mathbf{y} = W_2 \,\sigma(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2,$$
where $W_1 \in \mathbb{R}^{d_h \times d_{in}}$ and $W_2 \in \mathbb{R}^{d_{out} \times d_h}$. This is two linear maps with a non-linearity sandwiched between them. Without the $\sigma$, the whole layer collapses to one linear map $W_2 W_1$.
Decks 03 (matmul) and 05 (projection) make the up/down projection picture in transformer FFNs precise. By the time we reach deck 11 you'll be able to read this code as: down-project to feature space, gate, write back.
Deck 03 — Matrix Multiplication Deep-Dive takes the composition picture and unfolds matmul into its four equivalent forms (row×col, col-comb, row-comb, outer-sum). It also begins the GEMM / arithmetic-intensity story that connects to the GPU and TPU hubs.