Mathematics for Machine Learning — Deck 12

Classification with SVMs

The maximum-margin hyperplane, the dual quadratic programme, support vectors, and the kernel trick — all four ideas in one demo.

hyperplanemax-margin slackdual QP support vectorskernel
00

Topics We'll Cover

01

The Binary Classification Setup

Data $\{(\mathbf{x}_n, y_n)\}_{n=1}^N$ with $\mathbf{x}_n\in\mathbb{R}^D$ and $y_n\in\{-1, +1\}$. We want a classifier of the form

$$\hat y(\mathbf{x}) = \mathrm{sign}\bigl(\mathbf{w}^\top\mathbf{x} + b\bigr).$$

The set $\{\mathbf{x} : \mathbf{w}^\top\mathbf{x}+b = 0\}$ is an affine hyperplane (deck 02). When the classes are linearly separable, infinitely many hyperplanes work; SVMs pick the one with the largest margin.

Why $\pm 1$ labels?

With $y_n\in\{-1,+1\}$, "correct classification" is exactly $y_n(\mathbf{w}^\top\mathbf{x}_n+b) > 0$. The product trick (instead of $\{0,1\}$) is what makes the whole SVM algebra clean.

02

Geometric Margin

The (signed) distance from $\mathbf{x}_n$ to the hyperplane $\mathbf{w}^\top\mathbf{x}+b=0$ is

$$\frac{\mathbf{w}^\top\mathbf{x}_n+b}{\|\mathbf{w}\|} \;\in\; \mathbb{R}.$$

Multiplying by $y_n$ makes correct classifications positive. The geometric margin of the training set is

$$\gamma = \min_n \;\frac{y_n(\mathbf{w}^\top\mathbf{x}_n+b)}{\|\mathbf{w}\|}.$$

By rescaling $(\mathbf{w}, b)$ we can assume the closest points satisfy $y_n(\mathbf{w}^\top\mathbf{x}_n + b) = 1$ — the canonical form. Then $\gamma = 1/\|\mathbf{w}\|$ and "maximise margin" $\iff$ "minimise $\|\mathbf{w}\|$".

03

Primal Hard-Margin SVM

$$\min_{\mathbf{w}, b}\;\tfrac12\|\mathbf{w}\|^2 \quad \text{s.t.}\quad y_n(\mathbf{w}^\top\mathbf{x}_n + b) \ge 1,\;\; n=1,\dots,N.$$

This is a convex quadratic programme (deck 07): quadratic objective, linear inequality constraints. The objective $\tfrac12\|\mathbf{w}\|^2$ is strictly convex; the optimum is unique.

What can go wrong

The fix is the soft-margin SVM — slide 04.

04

Soft-Margin SVM & the Hinge Loss

Allow violations $\xi_n \ge 0$ and penalise them:

$$\min_{\mathbf{w},b,\boldsymbol\xi}\;\tfrac12\|\mathbf{w}\|^2 + C\sum_n \xi_n \quad \text{s.t.}\quad y_n(\mathbf{w}^\top\mathbf{x}_n+b)\ge 1-\xi_n,\;\;\xi_n\ge 0.$$

$C>0$ is a hyperparameter: large $C$ means few violations (closer to hard-margin), small $C$ means a wider margin even at the cost of misclassifying training points.

Equivalent loss form

At the optimum, $\xi_n = \max\bigl(0, 1 - y_n(\mathbf{w}^\top\mathbf{x}_n+b)\bigr)$. So the SVM solves

$$\min_{\mathbf{w},b}\;\sum_n \underbrace{\max\bigl(0, 1 - y_n(\mathbf{w}^\top\mathbf{x}_n+b)\bigr)}_{\text{hinge loss}} + \frac{1}{2C}\|\mathbf{w}\|^2.$$

This is the canonical "loss + L2 regularisation" form — same shape as ridge regression, with the hinge instead of squared loss.

05

The Lagrange Dual

Form the Lagrangian (deck 07) with multipliers $\alpha_n \ge 0$ for the inequality constraints and $\mu_n\ge 0$ for $\xi_n\ge 0$:

$$\mathcal L = \tfrac12\|\mathbf{w}\|^2 + C\sum_n \xi_n - \sum_n \alpha_n[y_n(\mathbf{w}^\top\mathbf{x}_n+b) - 1 + \xi_n] - \sum_n \mu_n \xi_n.$$

Setting $\nabla_{\mathbf{w}} = 0$, $\partial_b = 0$, $\partial_{\xi_n} = 0$:

$$\mathbf{w} = \sum_n \alpha_n y_n \mathbf{x}_n, \qquad \sum_n \alpha_n y_n = 0, \qquad \alpha_n + \mu_n = C.$$

Substituting back yields the dual problem:

$$\max_{\boldsymbol\alpha}\;\sum_n \alpha_n - \tfrac12 \sum_{m,n} \alpha_m\alpha_n y_m y_n \mathbf{x}_m^\top\mathbf{x}_n \quad \text{s.t.}\;\; 0\le\alpha_n\le C,\;\sum_n \alpha_n y_n = 0.$$

Convex QP in $N$ variables; strong duality (deck 07) holds because the primal is convex and Slater's condition is satisfied.

06

Support Vectors from KKT

Complementary slackness (deck 07) gives, at the optimum:

From $\mathbf{w} = \sum_n \alpha_n y_n \mathbf{x}_n$, the prediction is

$$\hat y(\mathbf{x}) = \mathrm{sign}\Bigl(\sum_n \alpha_n y_n\,\mathbf{x}_n^\top\mathbf{x} + b\Bigr).$$

Only support vectors (those with $\alpha_n > 0$) appear in the sum. The SVM's representation is sparse in the training set.

07

The Kernel Trick

In every formula above, $\mathbf{x}$ only ever appears as an inner product $\mathbf{x}_m^\top\mathbf{x}_n$. Replace it with any positive-definite kernel $k(\mathbf{x}_m, \mathbf{x}_n)$:

$$\max_{\boldsymbol\alpha}\;\sum_n \alpha_n - \tfrac12 \sum_{m,n} \alpha_m\alpha_n y_m y_n\,k(\mathbf{x}_m, \mathbf{x}_n).$$

The classifier becomes

$$\hat y(\mathbf{x}) = \mathrm{sign}\Bigl(\sum_{n\in\mathrm{SV}}\alpha_n y_n\,k(\mathbf{x}_n, \mathbf{x}) + b\Bigr).$$

Mercer's theorem says a kernel $k$ has the form $k(\mathbf{x},\mathbf{x}') = \boldsymbol\phi(\mathbf{x})^\top\boldsymbol\phi(\mathbf{x}')$ for some (possibly infinite-dimensional) feature map $\boldsymbol\phi$, as long as the kernel matrix $K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j)$ is positive semi-definite for all data sets. The SVM never computes $\boldsymbol\phi$; it only needs $K$.

Why this is one of the most influential ideas in ML

It lets a linear algorithm operate in an arbitrarily complicated feature space at the cost of a kernel evaluation per pair of points — no representation of the feature space is ever materialised. Gaussian processes (kernel ridge regression), kernel PCA and kernel logistic regression are the same trick applied to different objectives.

08

Three Common Kernels

NameDefinitionWhat it does
Linear$k(\mathbf x,\mathbf x') = \mathbf x^\top\mathbf x'$Identical to the un-kernelised SVM; baseline.
Polynomial$k(\mathbf x,\mathbf x') = (\mathbf x^\top\mathbf x' + c)^p$Decision boundary is a degree-$p$ polynomial. Be careful with $p>3$ — conditioning suffers.
RBF / Gaussian$k(\mathbf x,\mathbf x') = e^{-\|\mathbf x-\mathbf x'\|^2 / (2\sigma^2)}$Infinite-dimensional feature space; classifier is a smooth bump per support vector. The most popular default.

The RBF bandwidth $\sigma$

Cross-validation (deck 08) picks the sweet spot.

09

Interactive: 2D Kernel SVM

Hold the green or pink buttons to add points of each class. Choose a kernel and play with $C$ and (for RBF) $\sigma$. The decision boundary, the margin and the support vectors update live. The dual is solved by projected coordinate ascent under the box constraints.

3.16
0.60
N
0
support vectors
training err

Click outside the canvas to deactivate. Try the spiral / XOR pattern with linear vs RBF kernels: linear can't separate it, RBF can — the kernel trick at work.

10

Solving the QP — SMO and Friends

The dual QP has $N$ variables, with one equality constraint and box constraints $0\le\alpha_n\le C$. General-purpose QP solvers are $\mathcal O(N^3)$ — too slow for $N$ in the tens of thousands.

Sequential Minimal Optimisation (Platt, 1998)

  1. Pick two variables $(\alpha_i, \alpha_j)$ to update; freeze the rest.
  2. The 2-variable sub-problem has an analytical solution under the linear equality constraint and box constraints.
  3. Repeat until KKT violations vanish.

SMO is what libsvm and sklearn run by default. For very large data, coordinate descent on the primal (LibLinear) is faster — gives up some accuracy in exchange for speed.

Caching the kernel matrix

The biggest cost is $K\in\mathbb{R}^{N\times N}$. Production SVMs use working-set methods and cache only the rows for active variables. For $N$ in the millions you stop using kernel SVMs and switch to deep networks or kernel approximations (Nyström, random features).

11

Cheat Sheet

ObjectFormula
Decision function$\hat y(\mathbf x) = \mathrm{sign}(\mathbf w^\top\mathbf x + b)$
Geometric margin$\gamma = 1/\|\mathbf w\|$ (canonical form)
Hard-margin primal$\min \tfrac12\|\mathbf w\|^2$ s.t. $y_n(\mathbf w^\top\mathbf x_n+b)\ge 1$
Soft-margin primal$\min \tfrac12\|\mathbf w\|^2 + C\sum\xi_n$ s.t. margin−slack
Hinge loss$\max(0, 1 - y(\mathbf w^\top\mathbf x + b))$
Dual QP$\max\sum\alpha_n - \tfrac12\sum\alpha_m\alpha_n y_m y_n k(\mathbf x_m, \mathbf x_n)$
Constraint$0\le\alpha_n\le C$, $\sum\alpha_n y_n = 0$
Weights from dual$\mathbf w = \sum_n \alpha_n y_n \mathbf x_n$
Kernel classifier$\hat y(\mathbf x) = \mathrm{sign}(\sum_{\text{SV}} \alpha_n y_n k(\mathbf x_n, \mathbf x) + b)$
RBF kernel$\exp(-\|\mathbf x-\mathbf x'\|^2/(2\sigma^2))$
The end of the book

You've worked through every chapter. Part I built the language: vectors, geometry, matrix decompositions, calculus, probability and optimisation. Part II turned each of those tools into an algorithm: regression as a projection, PCA as eigenvectors, GMM as EM on a latent model, SVM as a kernel-aware quadratic programme. The same toolkit reappears, restructured, in every model that came after — right up to today's transformers.