MML 12 — Classification with SVMs

00

Topics We'll Cover

The binary classification setup
Geometric margin
Primal hard-margin SVM
Soft-margin SVM & the hinge loss
The Lagrange dual
Support vectors from KKT
The kernel trick
Three common kernels
Interactive: 2D kernel SVM
Solving the QP — SMO and friends
Cheat sheet

01

The Binary Classification Setup

Data $\{(\mathbf{x}_n, y_n)\}_{n=1}^N$ with $\mathbf{x}_n\in\mathbb{R}^D$ and $y_n\in\{-1, +1\}$. We want a classifier of the form

$$\hat y(\mathbf{x}) = \mathrm{sign}\bigl(\mathbf{w}^\top\mathbf{x} + b\bigr).$$

The set $\{\mathbf{x} : \mathbf{w}^\top\mathbf{x}+b = 0\}$ is an affine hyperplane (deck 02). When the classes are linearly separable, infinitely many hyperplanes work; SVMs pick the one with the largest margin.

Why $\pm 1$ labels?

With $y_n\in\{-1,+1\}$, "correct classification" is exactly $y_n(\mathbf{w}^\top\mathbf{x}_n+b) > 0$. The product trick (instead of $\{0,1\}$) is what makes the whole SVM algebra clean.

02

Geometric Margin

The (signed) distance from $\mathbf{x}_n$ to the hyperplane $\mathbf{w}^\top\mathbf{x}+b=0$ is

$$\frac{\mathbf{w}^\top\mathbf{x}_n+b}{\|\mathbf{w}\|} \;\in\; \mathbb{R}.$$

Multiplying by $y_n$ makes correct classifications positive. The geometric margin of the training set is

$$\gamma = \min_n \;\frac{y_n(\mathbf{w}^\top\mathbf{x}_n+b)}{\|\mathbf{w}\|}.$$

By rescaling $(\mathbf{w}, b)$ we can assume the closest points satisfy $y_n(\mathbf{w}^\top\mathbf{x}_n + b) = 1$ — the canonical form. Then $\gamma = 1/\|\mathbf{w}\|$ and "maximise margin" $\iff$ "minimise $\|\mathbf{w}\|$".

03

Primal Hard-Margin SVM

$$\min_{\mathbf{w}, b}\;\tfrac12\|\mathbf{w}\|^2 \quad \text{s.t.}\quad y_n(\mathbf{w}^\top\mathbf{x}_n + b) \ge 1,\;\; n=1,\dots,N.$$

This is a convex quadratic programme (deck 07): quadratic objective, linear inequality constraints. The objective $\tfrac12\|\mathbf{w}\|^2$ is strictly convex; the optimum is unique.

What can go wrong

If the classes are not linearly separable, no $(\mathbf{w},b)$ satisfies every constraint — the problem is infeasible.
If they are barely separable, a single noisy point can push $\mathbf{w}$ to absurd values.

The fix is the soft-margin SVM — slide 04.

04

Soft-Margin SVM & the Hinge Loss

Allow violations $\xi_n \ge 0$ and penalise them:

$$\min_{\mathbf{w},b,\boldsymbol\xi}\;\tfrac12\|\mathbf{w}\|^2 + C\sum_n \xi_n \quad \text{s.t.}\quad y_n(\mathbf{w}^\top\mathbf{x}_n+b)\ge 1-\xi_n,\;\;\xi_n\ge 0.$$

$C>0$ is a hyperparameter: large $C$ means few violations (closer to hard-margin), small $C$ means a wider margin even at the cost of misclassifying training points.

Equivalent loss form

At the optimum, $\xi_n = \max\bigl(0, 1 - y_n(\mathbf{w}^\top\mathbf{x}_n+b)\bigr)$. So the SVM solves

$$\min_{\mathbf{w},b}\;\sum_n \underbrace{\max\bigl(0, 1 - y_n(\mathbf{w}^\top\mathbf{x}_n+b)\bigr)}_{\text{hinge loss}} + \frac{1}{2C}\|\mathbf{w}\|^2.$$

This is the canonical "loss + L2 regularisation" form — same shape as ridge regression, with the hinge instead of squared loss.

05

The Lagrange Dual

Form the Lagrangian (deck 07) with multipliers $\alpha_n \ge 0$ for the inequality constraints and $\mu_n\ge 0$ for $\xi_n\ge 0$:

$$\mathcal L = \tfrac12\|\mathbf{w}\|^2 + C\sum_n \xi_n - \sum_n \alpha_n[y_n(\mathbf{w}^\top\mathbf{x}_n+b) - 1 + \xi_n] - \sum_n \mu_n \xi_n.$$

Setting $\nabla_{\mathbf{w}} = 0$, $\partial_b = 0$, $\partial_{\xi_n} = 0$:

$$\mathbf{w} = \sum_n \alpha_n y_n \mathbf{x}_n, \qquad \sum_n \alpha_n y_n = 0, \qquad \alpha_n + \mu_n = C.$$

Substituting back yields the dual problem:

$$\max_{\boldsymbol\alpha}\;\sum_n \alpha_n - \tfrac12 \sum_{m,n} \alpha_m\alpha_n y_m y_n \mathbf{x}_m^\top\mathbf{x}_n \quad \text{s.t.}\;\; 0\le\alpha_n\le C,\;\sum_n \alpha_n y_n = 0.$$

Convex QP in $N$ variables; strong duality (deck 07) holds because the primal is convex and Slater's condition is satisfied.

06

Support Vectors from KKT

Complementary slackness (deck 07) gives, at the optimum:

$\alpha_n = 0$: point is outside the margin, classified correctly. Doesn't influence $\mathbf{w}$.
$0 < \alpha_n < C$: point lies exactly on the margin; $y_n(\mathbf{w}^\top\mathbf{x}_n+b) = 1$. These are the support vectors on the margin; they're enough to recover $b$.
$\alpha_n = C$: point is inside the margin or misclassified; $\xi_n > 0$. These are bound support vectors.

From $\mathbf{w} = \sum_n \alpha_n y_n \mathbf{x}_n$, the prediction is

$$\hat y(\mathbf{x}) = \mathrm{sign}\Bigl(\sum_n \alpha_n y_n\,\mathbf{x}_n^\top\mathbf{x} + b\Bigr).$$

Only support vectors (those with $\alpha_n > 0$) appear in the sum. The SVM's representation is sparse in the training set.

07

The Kernel Trick

In every formula above, $\mathbf{x}$ only ever appears as an inner product $\mathbf{x}_m^\top\mathbf{x}_n$. Replace it with any positive-definite kernel $k(\mathbf{x}_m, \mathbf{x}_n)$:

$$\max_{\boldsymbol\alpha}\;\sum_n \alpha_n - \tfrac12 \sum_{m,n} \alpha_m\alpha_n y_m y_n\,k(\mathbf{x}_m, \mathbf{x}_n).$$

The classifier becomes

$$\hat y(\mathbf{x}) = \mathrm{sign}\Bigl(\sum_{n\in\mathrm{SV}}\alpha_n y_n\,k(\mathbf{x}_n, \mathbf{x}) + b\Bigr).$$

Mercer's theorem says a kernel $k$ has the form $k(\mathbf{x},\mathbf{x}') = \boldsymbol\phi(\mathbf{x})^\top\boldsymbol\phi(\mathbf{x}')$ for some (possibly infinite-dimensional) feature map $\boldsymbol\phi$, as long as the kernel matrix $K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j)$ is positive semi-definite for all data sets. The SVM never computes $\boldsymbol\phi$; it only needs $K$.

Why this is one of the most influential ideas in ML

It lets a linear algorithm operate in an arbitrarily complicated feature space at the cost of a kernel evaluation per pair of points — no representation of the feature space is ever materialised. Gaussian processes (kernel ridge regression), kernel PCA and kernel logistic regression are the same trick applied to different objectives.

08

Three Common Kernels

Name	Definition	What it does
Linear	$k(\mathbf x,\mathbf x') = \mathbf x^\top\mathbf x'$	Identical to the un-kernelised SVM; baseline.
Polynomial	$k(\mathbf x,\mathbf x') = (\mathbf x^\top\mathbf x' + c)^p$	Decision boundary is a degree-$p$ polynomial. Be careful with $p>3$ — conditioning suffers.
RBF / Gaussian	$k(\mathbf x,\mathbf x') = e^{-\\|\mathbf x-\mathbf x'\\|^2 / (2\sigma^2)}$	Infinite-dimensional feature space; classifier is a smooth bump per support vector. The most popular default.

The RBF bandwidth $\sigma$

Tiny $\sigma$: each support vector influences only its immediate neighbourhood — the boundary fragments into bumps; overfits.
Huge $\sigma$: every pair of points has $k \approx 1$; the kernel matrix is rank-1 and the classifier is almost linear.

Cross-validation (deck 08) picks the sweet spot.

09

Interactive: 2D Kernel SVM

Hold the green or pink buttons to add points of each class. Choose a kernel and play with $C$ and (for RBF) $\sigma$. The decision boundary, the margin and the support vectors update live. The dual is solved by projected coordinate ascent under the box constraints.

kernel

C (regularisation)

3.16

RBF σ

0.60

N

0

support vectors

—

training err

—

Click outside the canvas to deactivate. Try the spiral / XOR pattern with linear vs RBF kernels: linear can't separate it, RBF can — the kernel trick at work.

10

Solving the QP — SMO and Friends

The dual QP has $N$ variables, with one equality constraint and box constraints $0\le\alpha_n\le C$. General-purpose QP solvers are $\mathcal O(N^3)$ — too slow for $N$ in the tens of thousands.

Sequential Minimal Optimisation (Platt, 1998)

Pick two variables $(\alpha_i, \alpha_j)$ to update; freeze the rest.
The 2-variable sub-problem has an analytical solution under the linear equality constraint and box constraints.
Repeat until KKT violations vanish.

SMO is what libsvm and sklearn run by default. For very large data, coordinate descent on the primal (LibLinear) is faster — gives up some accuracy in exchange for speed.

Caching the kernel matrix

The biggest cost is $K\in\mathbb{R}^{N\times N}$. Production SVMs use working-set methods and cache only the rows for active variables. For $N$ in the millions you stop using kernel SVMs and switch to deep networks or kernel approximations (Nyström, random features).

11

Cheat Sheet

Object	Formula
Decision function	$\hat y(\mathbf x) = \mathrm{sign}(\mathbf w^\top\mathbf x + b)$
Geometric margin	$\gamma = 1/\\|\mathbf w\\|$ (canonical form)
Hard-margin primal	$\min \tfrac12\\|\mathbf w\\|^2$ s.t. $y_n(\mathbf w^\top\mathbf x_n+b)\ge 1$
Soft-margin primal	$\min \tfrac12\\|\mathbf w\\|^2 + C\sum\xi_n$ s.t. margin−slack
Hinge loss	$\max(0, 1 - y(\mathbf w^\top\mathbf x + b))$
Dual QP	$\max\sum\alpha_n - \tfrac12\sum\alpha_m\alpha_n y_m y_n k(\mathbf x_m, \mathbf x_n)$
Constraint	$0\le\alpha_n\le C$, $\sum\alpha_n y_n = 0$
Weights from dual	$\mathbf w = \sum_n \alpha_n y_n \mathbf x_n$
Kernel classifier	$\hat y(\mathbf x) = \mathrm{sign}(\sum_{\text{SV}} \alpha_n y_n k(\mathbf x_n, \mathbf x) + b)$
RBF kernel	$\exp(-\\|\mathbf x-\mathbf x'\\|^2/(2\sigma^2))$

The end of the book

You've worked through every chapter. Part I built the language: vectors, geometry, matrix decompositions, calculus, probability and optimisation. Part II turned each of those tools into an algorithm: regression as a projection, PCA as eigenvectors, GMM as EM on a latent model, SVM as a kernel-aware quadratic programme. The same toolkit reappears, restructured, in every model that came after — right up to today's transformers.