Gradient descent, Lagrange multipliers, convexity and the dual — the optimisation toolkit Part II needs.
The chapter studies
$$\min_{\mathbf{x}\in\mathbb{R}^d}\; f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) \le 0,\; h_j(\mathbf{x}) = 0,$$
where $f, g_i, h_j$ are differentiable. Three cases of practical interest:
If $f$ is convex and the constraint set is convex, the problem is a convex programme — every local minimum is global, and powerful solvers exist (interior-point, projected gradient, ADMM…).
The iteration:
$$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta\,\nabla f(\mathbf{x}_t)^\top.$$
If $f$ is $L$-smooth ($\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\| \le L\|\mathbf{x}-\mathbf{y}\|$) and $\eta = 1/L$:
$$f(\mathbf{x}_{t+1}) \le f(\mathbf{x}_t) - \frac{1}{2L}\|\nabla f(\mathbf{x}_t)\|^2.$$
So $f$ strictly decreases; for convex $f$ the rate is $\mathcal{O}(1/t)$.
If $f$ is also $m$-strongly convex (the Hessian satisfies $H_f \succeq mI$), the rate becomes linear:
$$f(\mathbf{x}_t) - f^\star \le \left(1 - \frac{m}{L}\right)^t (f(\mathbf{x}_0) - f^\star).$$
The ratio $\kappa = L/m$ is the condition number; it controls how many iterations you need. The book's example: $f(\mathbf{x}) = \tfrac12\mathbf{x}^\top A\mathbf{x}$ has $L=\lambda_{\max}(A)$ and $m=\lambda_{\min}(A)$.
| Choice | Recipe | Notes |
|---|---|---|
| Fixed | $\eta_t = \eta$ | Simple. Needs $\eta \le 1/L$ for guarantee. Underused in deep learning. |
| Diminishing | $\eta_t = c/t$ or $c/\sqrt{t}$ | For stochastic gradients. The $1/\sqrt{t}$ schedule is what gives SGD's classical $\mathcal{O}(1/\sqrt{t})$ rate. |
| Backtracking line search | Halve $\eta$ until Armijo: $f(\mathbf{x}-\eta\mathbf{g}) \le f(\mathbf{x}) - \alpha\eta\|\mathbf{g}\|^2$ | No prior knowledge of $L$; widely used outside deep learning. |
| Adam / AdamW | per-coordinate $\eta$, adaptive from history | What every transformer trainer uses; not analysed in the book but implements the same idea. |
Heavy ball / momentum:
$$\mathbf{v}_{t+1} = \beta\mathbf{v}_t + \nabla f(\mathbf{x}_t), \qquad \mathbf{x}_{t+1} = \mathbf{x}_t - \eta\,\mathbf{v}_{t+1}.$$
Smooths out oscillations along sharp directions and accumulates progress along flat ones.
Nesterov:
$$\mathbf{y}_t = \mathbf{x}_t + \beta(\mathbf{x}_t-\mathbf{x}_{t-1}),\qquad \mathbf{x}_{t+1} = \mathbf{y}_t - \eta\nabla f(\mathbf{y}_t).$$
Look ahead before taking the gradient. Achieves the optimal rate $\mathcal{O}(1/t^2)$ for smooth convex $f$ — provably the best first-order rate.
Plain GD: $\mathcal{O}(1/t)$. Nesterov: $\mathcal{O}(1/t^2)$. Stochastic GD: $\mathcal{O}(1/\sqrt{t})$. Strongly-convex GD: linear ($e^{-t/\kappa}$).
For $\min f(\mathbf{x})$ subject to $h(\mathbf{x})=0$, form the Lagrangian
$$\mathcal{L}(\mathbf{x}, \lambda) = f(\mathbf{x}) - \lambda\, h(\mathbf{x}).$$
At a stationary point:
$$\nabla_{\mathbf{x}}\mathcal{L} = 0 \;\Longleftrightarrow\; \nabla f(\mathbf{x}^\star) = \lambda^\star\,\nabla h(\mathbf{x}^\star), \quad h(\mathbf{x}^\star) = 0.$$
Geometrically: the gradient of $f$ must be parallel to the gradient of $h$ at the optimum. Otherwise you could move along the constraint surface and decrease $f$ further.
For multiple equality constraints, sum: $\mathcal L = f - \sum_j \lambda_j h_j$ and $\nabla f = \sum_j \lambda_j \nabla h_j$ at the optimum.
For $\min f(\mathbf{x})$ s.t. $g_i(\mathbf{x})\le 0$, $h_j(\mathbf{x})=0$, the Karush–Kuhn–Tucker conditions are necessary at any local optimum (under regularity).
Complementary slackness is the key one: either constraint $i$ is active ($g_i=0$, multiplier $\alpha_i>0$) or it's inactive ($g_i<0$, multiplier $\alpha_i=0$). The SVM in deck 12 reads off "support vectors" from exactly this rule.
Minimise $f(\mathbf{x}) = (x_1-c_1)^2 + (x_2-c_2)^2$ on the constraint $h(\mathbf{x}) = x_1^2 + x_2^2 - r^2 = 0$ (the circle of radius $r$). Drag the centre of $f$; change $r$. The optimum is plotted — notice $\nabla f$ and $\nabla h$ aligned at the optimum.
The Lagrangian condition reads $2(\mathbf{x}^\star - \mathbf{c}) = \lambda^\star \cdot 2\mathbf{x}^\star$, i.e. $\mathbf{x}^\star$ is collinear with $\mathbf{c}$. The optimum is $\mathbf{x}^\star = r\,\mathbf{c}/\|\mathbf{c}\|$ (project $\mathbf{c}$ onto the circle).
$C\subseteq\mathbb{R}^d$ is convex if $\alpha\mathbf{x}+(1-\alpha)\mathbf{y}\in C$ for all $\mathbf{x},\mathbf{y}\in C$, $\alpha\in[0,1]$. Examples: half-spaces, balls, the PSD cone, the simplex.
$f:C\to\mathbb{R}$ on a convex domain $C$ is convex if
$$f(\alpha\mathbf{x}+(1-\alpha)\mathbf{y}) \le \alpha f(\mathbf{x}) + (1-\alpha) f(\mathbf{y}).$$
Equivalent (assume differentiability):
Every local minimum is a global minimum. Many algorithms have global convergence guarantees on convex problems. Most "classical" ML — linear/logistic regression, SVMs, lasso/ridge — is convex; modern neural networks are not.
$$\min\;\mathbf{c}^\top\mathbf{x} \quad \text{s.t.}\quad A\mathbf{x} \le \mathbf{b}.$$
Feasible region is a polyhedron; the optimum is at a vertex. Solved by simplex (combinatorial) or interior-point (polynomial-time).
$$\min\;\tfrac12\mathbf{x}^\top Q\mathbf{x} + \mathbf{c}^\top\mathbf{x} \quad \text{s.t.}\quad A\mathbf{x} \le \mathbf{b}.$$
If $Q\succeq 0$, it's convex. The SVM (deck 12) is exactly a QP with this shape; ridge regression is a QP with no constraints.
Many ML losses are quadratic or piecewise-linear; many constraints are linear. The intersection — QP — is solvable to global optimality, fast, and has well-understood software.
For the constrained problem with Lagrangian $\mathcal{L}(\mathbf{x},\boldsymbol\alpha,\boldsymbol\lambda) = f(\mathbf{x}) + \sum_i\alpha_i g_i(\mathbf{x}) + \sum_j\lambda_j h_j(\mathbf{x})$, define
$$\mathcal{D}(\boldsymbol\alpha,\boldsymbol\lambda) = \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x},\boldsymbol\alpha,\boldsymbol\lambda).$$
$\mathcal D$ is the dual function; it is always concave (even if $f$ isn't convex). The dual problem:
$$\max_{\boldsymbol\alpha\ge 0,\boldsymbol\lambda}\;\mathcal D(\boldsymbol\alpha,\boldsymbol\lambda).$$
The primal SVM is a QP in $d+1$ variables; the dual is a QP in $N$ variables (number of training points). For small $d$ and large $N$ the primal is faster; with kernels the dual is the natural place to insert $k(\mathbf{x}_i,\mathbf{x}_j)$ without ever computing the feature map.
| Concept | Used by | Where |
|---|---|---|
| Gradient descent | Any deep model | Training loop |
| Stationarity | Linear regression (ch. 9) | $\nabla R(\boldsymbol\theta) = 0$ gives the normal equation |
| Lagrange multipliers | PCA (ch. 10) | Maximise variance subject to unit-norm constraint |
| KKT & complementary slackness | SVM (ch. 12) | Support vectors = active inequality constraints |
| Lagrange dual | SVM (ch. 12) | Dual SVM is a QP in $N$ vars allowing kernels |
| Convexity | Most of Part II | Why each algorithm's optimum is unique & tractable |
| EM | GMM (ch. 11) | Coordinate ascent on a lower bound — another convex device |
| Idea | Statement |
|---|---|
| GD step | $\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla f(\mathbf{x}_t)^\top$ |
| Smoothness $L$ | $\eta \le 1/L$ ensures monotonic decrease |
| Condition number | $\kappa = L/m$ (smoothness / strong convexity) |
| Lagrangian | $\mathcal L = f + \sum\alpha_i g_i + \sum\lambda_j h_j$ |
| Stationarity | $\nabla f = \sum\alpha_i\nabla g_i + \sum\lambda_j\nabla h_j$ |
| Comp. slackness | $\alpha_i g_i = 0$ |
| Convexity (1st-order) | $f(\mathbf y)\ge f(\mathbf x) + \nabla f(\mathbf x)(\mathbf y - \mathbf x)$ |
| Strong duality | holds for convex + Slater's condition |
| QP form | $\min \tfrac12 \mathbf x^\top Q\mathbf x + \mathbf c^\top \mathbf x$ s.t. $A\mathbf x\le\mathbf b$ |