Support Vector Machines
Linear SVM
- Classification setting
- Find the maximum-margin hyperplane that can separate the data
- Best hyperplane is the one that maximizes the margin
- Margin the distance of the hyperplane to closest data points from both classes
- Hyperplane: $H : wx +b = 0$
- Distance of a point (x) to a hyperplane (h):
- $d = \frac{|Wx + b|}{||W||}$
- Margin is defined by the point closest to the hyperplane
- $\gamma(W,b) = \min_{x \in D} \frac{|Wx + b|}{||W||^2}$
- Margin is scale invariant
- SVM wants to maximize this margin
- For margin to be maximized, hyperplane must lie right in the middle of the two classes
- Otherwise it can be moved towards data points of the class that is further away and be further increased
- Mathematics
- Binary Classification
- $y_i \in {+1,-1}$
- Need to find a separating hyperplane such that
- $(Wx_i + b) > 0 ; \forall ; y_i = +1$
- $(Wx_i + b) < 0 ; \forall ; y_i = -1$
- $y_i(Wx_i + b) > 0$
- SVM posits that the best hyperplane is the one that maximizes the margin
- Margin acts as buffer which can lead to better generalization
- Objective
- $\max_{W,b} \gamma(W,b) ; \text{subject to} ; y_i(Wx_i + b) > 0$
- $\max_{W,b} \min_{x \in D} \frac{|Wx + b|}{||W||^2} ; \text{subject to} ; y_i(Wx_i + b) > 0$
- A max-min optimization problem
- Simplification
- The best possible hyperplace is scale invariant
- Add a constraint such that $|Wx +b| = 1$
- Updated objective
- $\max \frac{1}{||W||^2} ; \text{subject to} ; y_i(Wx_i + b) \ge 0 ; ; |Wx +b| = 1$
- $\min ||W||^2 ; \text{subject to} ; y_i(Wx_i + b) \ge 0 ; ; |Wx +b| = 1$
- Combining the contraints
- $y_i(Wx_i + b) \ge 0; ; |Wx +b| = 1 \implies y_i(Wx_i + b) \ge 1$
- Holds true because the objective is trying to minimize W
- Final objective
- $\min ||W||^2 ; \text{subject to} ; y_i(Wx_i + b) \ge 1$\
- Quadratic optimization problem
- Can be solved quickly unlike regression which involves inverting a large matrix
- Gives a unique solution unlike perceptron
- At the optimal solution, some training points will lie of the margin
- $y_i(Wx_i + b) = 1$
- These points are called support vectors
- Binary Classification
- Soft Constraints
- What if the optimization problem is infeasible?
- No solution exists
- Add relaxations i.e. allow for some misclassification
- Original: $y_i(Wx_i + b) \ge 1$
- Relaxed: $y_i(Wx_i + b) \ge 1 - \xi_i ; ; \xi_i > 0$
- $\xi_i = \begin{cases} 1 - y_i(Wx_i + b), & \text{if } y_i(Wx_i + b) < 1\0, & \text{otherwise} \end{cases}$
- Hinge Loss $\xi_i = \max (1 - y_i(Wx_i + b), 0)$
- Objective: $\min ||W||^2 + C \sum_i \max (1 - y_i(Wx_i + b), 0)$
- C is the regularization parameter that calculates trade-off
- High value of C allows for less torelance on errors
- What if the optimization problem is infeasible?
- Duality
- Primal problem is hard to solve
- Convert the problem to a Dual, which is easier to solve and also provides near-optimal solution to primal
- The gap is the optimality that arises in this process is the duality gap
- Lagrangian multipliers determine if strong suality exists
- Convert the above soft-margin SVM to dual via Lagrangian multipliers
- $\sum \alpha_i + \sum\sum \alpha_i \alpha_j y_i y_j x_i^T x_j$
- $\alpha$ is the Lagrangian multiplier
- Kernelization
- Say the points are not separable in lower dimension
- Transform them via kernels to project them to a higher dimension
- The points may be separable the higher dimension
- Non-linear feature transformation
- Solve non-linear problems via Linear SVM
- Polynomial Kernel
- $K(x_i, x_j) = (x_i^T x_j + c)^d$
- The d regers to the degree of the polynomial
- Example: 2 points in 1-D (a and b) transformerd via second order polynomial kernel
- $K(a,b) = (ab + 1)^2 = 2ab+ a^2b^2 + 1 = (\sqrt{2a}, a, 1)(\sqrt{2b}, b, 1)$
- Calculates similarity between points in higher dimension
- RBF Kernel
- $K(x_i, x_j) = \exp {\gamma |x_i - x_j|^2}$
- The larger the distance between two observations, the less is the similarity
- Radial Kernel determines how much influence each observation has on classifying new data points\
- Transforms points to an infinite dimension space
- Tayloy Expansion of exponential term shows how RBF is a polynomial function with inifnite dimensions
- 2 points in 1-D (a and b) transformerd via RBF
- $K(a,b) = (1, \sqrt{\frac{1}{1!}}a, \sqrt{\frac{1}{2!}}a^2...)(1, \sqrt{\frac{1}{1!}}b, \sqrt{\frac{1}{2!}}b^2...)$
- Kernel Trick
- Transforming the original dataset via Kernels and training SVM is expensive
- Convert Dot-products of support vectors to dot-products of mapping functions
- $x_i^T x_j \implies \phi(x_i)^T \phi(x_j)$
- Kernels are chosen in a way that this is feasible
- Say the points are not separable in lower dimension
- SVM For Regression
- Margins should cover all data points (Hard) or most data points (Soft)
- The boundary now lies in the middle of the margins
- The regression model to estimate the target values
- The objective is to minimize the the distance of the points to the boundary
- Hard SVM is sensitive to outliers
Kernel Selection
Choosing the right kernel:
- Linear kernel: $K(x_i, x_j) = x_i^T x_j$
- Efficient for high-dimensional data
- Works well when number of features exceeds number of samples
- Simplest kernel with fewest hyperparameters
- Polynomial kernel: $K(x_i, x_j) = (x_i^T x_j + c)^d$
- Good for normalized training data
- Degree d controls flexibility (higher d = more complex decision boundary)
- Can capture feature interactions
- RBF/Gaussian kernel: $K(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)$
- Most commonly used non-linear kernel
- Works well for most datasets
- Gamma parameter controls influence radius (higher gamma = more complex boundary)
- Sigmoid kernel: $K(x_i, x_j) = \tanh(\alpha x_i^T x_j + c)$
- Similar to neural networks (hyperbolic tangent activation)
- Less commonly used in practice
- Linear kernel: $K(x_i, x_j) = x_i^T x_j$
Cross-validation should be used to select the optimal kernel and hyperparameters
SVM Hyperparameter Tuning
- C parameter (regularization strength):
- Controls trade-off between maximizing margin and minimizing training error
- Smaller C: Wider margin, more regularization, potential underfitting
- Larger C: Narrower margin, less regularization, potential overfitting
- Gamma parameter (for RBF kernel):
- Controls influence radius of support vectors
- Smaller gamma: Larger radius, smoother decision boundary
- Larger gamma: Smaller radius, more complex decision boundary
- Practical suggestions:
- Start with RBF kernel, grid search over C and gamma
- Try logarithmic scale for both C and gamma (e.g., 0.001, 0.01, 0.1, 1, 10, 100)
- Use cross-validation to evaluate performance