Mathematics for Machine Learning — Deck 05

Vector Calculus

Derivatives, Jacobians, the chain rule on a DAG, and the multivariate Taylor series — the language every training loop reads.

gradientJacobian chain rulebackprop autodiffTaylor
$f'(x)$ $\nabla f$ Jacobian $J$ chain rule backprop
00

Topics We'll Cover

01

Univariate Differentiation, Re-read

The derivative of $f:\mathbb{R}\to\mathbb{R}$ at $x_0$ is

$$f'(x_0) = \lim_{h\to 0}\frac{f(x_0+h)-f(x_0)}{h}.$$

The more useful re-reading: $f'(x_0)$ is the unique number such that

$$f(x_0 + h) = f(x_0) + f'(x_0)\,h + o(h).$$

So the derivative is the best linear approximation of $f$ near $x_0$. Every later definition in this deck generalises that sentence, not the limit.

The rules, re-derived

RuleStatement
linearity$(\alpha f + \beta g)' = \alpha f' + \beta g'$
product$(fg)' = f'g + fg'$
quotient$(f/g)' = (f'g - fg')/g^2$
chain$(g\circ f)'(x) = g'(f(x))\cdot f'(x)$

The chain rule is the centre of this deck. Everything else generalises by replacing real numbers with vectors and ordinary multiplication with matrix multiplication.

02

Partial Derivatives and the Gradient

For $f:\mathbb{R}^n\to\mathbb{R}$, the partial derivative with respect to $x_i$ is the ordinary derivative obtained by holding all other variables fixed:

$$\frac{\partial f}{\partial x_i}(\mathbf{x}) = \lim_{h\to 0}\frac{f(\mathbf{x} + h\mathbf{e}_i) - f(\mathbf{x})}{h}.$$

The gradient stacks them into a row vector (book convention):

$$\nabla_{\mathbf{x}} f(\mathbf{x}) \;=\; \frac{\partial f}{\partial \mathbf{x}} \;=\; \left[\,\frac{\partial f}{\partial x_1},\; \dots,\; \frac{\partial f}{\partial x_n}\,\right].$$

Linear approximation

$$f(\mathbf{x}+\mathbf{h}) = f(\mathbf{x}) + \nabla f(\mathbf{x})\,\mathbf{h} + o(\|\mathbf{h}\|).$$

The gradient is the row vector that, multiplied by any direction $\mathbf{h}$, gives the rate of change of $f$ along $\mathbf{h}$.

Direction of steepest ascent

Among unit-length $\mathbf{h}$, the one maximising $\nabla f \,\mathbf{h}$ is $\mathbf{h} = \nabla f^\top / \|\nabla f\|$. So the gradient (as a column) points uphill, with magnitude equal to the steepness. Gradient descent walks in the opposite direction.

03

The Jacobian

For $\mathbf{f}:\mathbb{R}^n\to\mathbb{R}^m$ with components $f_i$, the Jacobian is the $m\times n$ matrix of partial derivatives:

$$J_{\mathbf{f}}(\mathbf{x}) \;=\; \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \;=\; \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}.$$

Row $i$ is $\nabla f_i$. So the gradient of a scalar function is just the Jacobian when $m=1$.

The linear approximation generalises

$$\mathbf{f}(\mathbf{x}+\mathbf{h}) = \mathbf{f}(\mathbf{x}) + J_{\mathbf{f}}(\mathbf{x})\,\mathbf{h} + o(\|\mathbf{h}\|).$$

Three useful Jacobians

$\mathbf{f}(\mathbf{x})$$\partial \mathbf{f}/\partial \mathbf{x}$
$A\mathbf{x}$$A$
$\mathbf{x}\odot\mathbf{x}$ (element-wise square)$\mathrm{diag}(2\mathbf{x})$
$\sigma(\mathbf{x})$ (sigmoid, element-wise)$\mathrm{diag}(\sigma(\mathbf{x})\odot(1-\sigma(\mathbf{x})))$
04

Gradients of Matrix Expressions

ML constantly differentiates scalars with respect to matrices. The shape convention is:

$$\frac{\partial f}{\partial A} \in \mathbb{R}^{m\times n} \quad \text{has entries } \left[\frac{\partial f}{\partial A}\right]_{ij} = \frac{\partial f}{\partial A_{ij}}.$$

Six identities you will use often

$f$$\partial f/\partial \mathbf{x}$
$\mathbf{a}^\top\mathbf{x}$$\mathbf{a}^\top$
$\mathbf{x}^\top A\mathbf{x}$$\mathbf{x}^\top(A + A^\top)$ — $2A\mathbf{x}^\top$ when $A$ symmetric
$\|A\mathbf{x}-\mathbf{b}\|^2$$2(A\mathbf{x}-\mathbf{b})^\top A$
$\log\det A$ ($A\succ 0$)$(A^{-1})^\top$, w.r.t. $A$
$\mathrm{tr}(AB)$$B^\top$, w.r.t. $A$
$\sigma(\mathbf{x})$ (softmax)$\mathrm{diag}(\boldsymbol\sigma) - \boldsymbol\sigma\boldsymbol\sigma^\top$, where $\boldsymbol\sigma = \sigma(\mathbf{x})$
A trick the book uses

To compute $\partial f/\partial A$ where $f$ is a scalar, write $df = \mathrm{tr}(M^\top dA)$ for some matrix $M$. Then $\partial f/\partial A = M$. The trace identity $\mathrm{tr}(P) = \mathrm{tr}(P^\top)$ and the cyclic property let you re-shuffle terms until they fit this pattern.

05

The Chain Rule as Matrix Product

If $\mathbf{g}:\mathbb{R}^n\to\mathbb{R}^m$ and $\mathbf{f}:\mathbb{R}^m\to\mathbb{R}^k$, then the composition $\mathbf{h}=\mathbf{f}\circ\mathbf{g}:\mathbb{R}^n\to\mathbb{R}^k$ has Jacobian

$$J_{\mathbf{h}}(\mathbf{x}) = J_{\mathbf{f}}(\mathbf{g}(\mathbf{x}))\, J_{\mathbf{g}}(\mathbf{x}).$$

Multivariate chain rule = matrix product. The single fact every line of backprop applies.

Chains of three or more

For $\mathbf{h} = \mathbf{f}_L\circ\mathbf{f}_{L-1}\circ\dots\circ\mathbf{f}_1$,

$$J_{\mathbf{h}} = J_{\mathbf{f}_L}\cdot J_{\mathbf{f}_{L-1}} \cdots J_{\mathbf{f}_1}.$$

An $L$-layer neural network's Jacobian is a product of $L$ Jacobians — one per layer. The order matters; this is exactly composition of linear maps from deck 02.

Why scalar loss makes life easier

Training loss is a scalar, so $J_{\mathbf{f}_L}$ is a row vector ($1\times m$). Computing $J_{\mathbf{h}}$ right-to-left lets every intermediate stay a row vector — the cost of one matrix-vector product per layer. That is reverse-mode autodiff. Computing left-to-right would require materialising full Jacobians.

06

Backpropagation on a DAG

Real neural networks aren't strict compositions — they're directed acyclic graphs with shared subexpressions (residual connections, attention's many heads, etc.). The chain rule generalises seamlessly: for any node $z$,

$$\frac{\partial \mathcal{L}}{\partial z} = \sum_{w \text{ child of } z} \frac{\partial \mathcal{L}}{\partial w}\,\frac{\partial w}{\partial z}.$$

Each summand says "the loss feels $z$ partly via this child $w$". The algorithm walks the DAG in reverse topological order and accumulates these contributions; every intermediate gets visited once.

Forward pass

Visit nodes in topological order. Compute each node's value and save the local Jacobian (or enough of it — e.g. just $\sigma(x)$ to later compute $\sigma'(x)$).

Backward pass

Visit in reverse. At each node, multiply incoming "upstream gradient" by the saved local Jacobian and route it to parents. Each edge is traversed once.

Memory grows with the number of forward activations; time grows linearly with the size of the DAG. Gradient checkpointing trades time for memory by re-computing activations on demand.

07

Forward vs Reverse Mode Autodiff

Forward (JVP)Reverse (VJP)
computes$J_{\mathbf{f}}\,\mathbf{v}$ — "one column of $J$"$\mathbf{u}^\top J_{\mathbf{f}}$ — "one row of $J$"
cheap for$n$ small (few inputs)$m$ small (few outputs — the loss is 1)
memorynone beyond live activationsall forward activations stored
used byJAX jvp, sensitivities, second-orderPyTorch backward, every NN training step

To compute the gradient of a scalar loss in a deep net, one VJP costs ~the same as the forward pass; one JVP per input variable would cost $n$ forward passes. For modern LLMs $n$ is in the hundreds of billions and the loss is one number — reverse mode wins decisively.

Implementation hint

Each "primitive" in PyTorch / JAX / TensorFlow comes with a pair of rules: a forward function $\mathbf{y}=\mathbf{f}(\mathbf{x})$ and a VJP rule $(\mathbf{u}, \mathbf{x}, \mathbf{y}) \mapsto \mathbf{u}^\top J_{\mathbf{f}}(\mathbf{x})$. Composing primitives is what gives you a differentiable program.

08

Hessian and Multivariate Taylor

The Hessian $H_f(\mathbf{x})$ of a scalar $f:\mathbb{R}^n\to\mathbb{R}$ is the $n\times n$ matrix of second partials:

$$[H_f]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}.$$

If $f$ is $C^2$, $H_f$ is symmetric (Schwarz's theorem).

Multivariate Taylor at order 2

$$f(\mathbf{x}+\mathbf{h}) = f(\mathbf{x}) + \nabla f(\mathbf{x})\,\mathbf{h} + \tfrac12\,\mathbf{h}^\top H_f(\mathbf{x})\,\mathbf{h} + o(\|\mathbf{h}\|^2).$$

The first two terms give the linear (affine) model used by gradient descent; the third gives Newton's method.

Classification of stationary points

At a stationary point ($\nabla f = 0$): the Hessian's signature tells you the type.

In high dimensions, almost every stationary point of a non-convex loss is a saddle — this is one reason SGD works at all.

09

Interactive: Gradient Descent on a 2D Surface

Pick a loss surface and a starting point. Step size and momentum are adjustable. The path is drawn over a contour plot; the right gauge shows the loss vs iteration. Click the canvas to set the starting point.

0.050
0.00
iteration
0
loss
|∇L|

Try the ravine surface with a big learning rate — the path bounces between the walls and barely makes progress along the floor. Bump the momentum up and it should stop oscillating. That's why Adam exists: it estimates per-coordinate curvature on the fly.

10

Where This Lands in Part II

ConceptUsed byWhere
Gradient of mean-squared errorLinear regression (ch. 9)Setting $\nabla R = 0$ gives the normal equation
$\partial \mathbf{x}^\top \Sigma^{-1}\mathbf{x}/\partial\mathbf{x}$Gaussian (ch. 6), GMM (ch. 11)Maximising likelihoods
$\partial \log\det A/\partial A$GMM (ch. 11)EM update for the covariance
BackpropagationAny deep modelComputes $\nabla_\theta \mathcal{L}$
Hessian, NewtonSVM (ch. 12), GLMQuadratic-programming solver, IRLS
Taylor expansionOptimisation (ch. 7)Gradient descent = first-order; Newton = second-order
11

Cheat Sheet

QuantityFormula
gradient$\nabla f = [\partial f/\partial x_1, \dots, \partial f/\partial x_n]$ (row)
Jacobian$[J_{\mathbf{f}}]_{ij} = \partial f_i/\partial x_j$, shape $m\times n$
chain rule$J_{\mathbf{f}\circ\mathbf{g}} = J_\mathbf{f}\,J_\mathbf{g}$
$\nabla(\mathbf{a}^\top\mathbf{x})$$\mathbf{a}^\top$
$\nabla(\mathbf{x}^\top A\mathbf{x})$$\mathbf{x}^\top(A+A^\top)$
$\nabla\|A\mathbf{x}-\mathbf{b}\|^2$$2(A\mathbf{x}-\mathbf{b})^\top A$
$\nabla \log\det A$$(A^{-1})^\top$
Hessian$[H_f]_{ij} = \partial^2 f/(\partial x_i\partial x_j)$
2nd-order Taylor$f(\mathbf{x}+\mathbf{h}) \approx f + \nabla f \cdot \mathbf{h} + \tfrac12 \mathbf{h}^\top H_f \mathbf{h}$
steepest descent$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla f(\mathbf{x}_t)^\top$