Kernel Methods

Kernel Density Estimation

A random sample $[x_0, x_1, ... x_n]$ is drawn from probability distribution $f_X(x)$
Parzen Estimate of $\hat f_X(x)$
- $\hat f_X(x_0) = {1 \over N \lambda} ,, # x_i \in N(x_0)$
- $\lambda$ is width of the neighbourhood
- The esitmate is bumpy
Gaussian Kernel of width $\lambda$ can be a choice of Kernel
- $\hat f_X(x_0) = {1 \over N \lambda} K_{\lambda}(x_i, x_0)$
- $K_{\lambda}(x_i, x_0) = \phi(|x_i - x_0|) / \lambda$
- Weight of point $x_i$ descreases as distance from $x_0$ increases
New estimate for density is
- $\hat f_X(x_0) = {1 \over N } \phi_{\lambda}(x_ - x_0)$
- Convolution of Sample empirical distribution with Gaussian Kernel

Bayes' Theorem
$P(G=j | X=x_0) \propto \hat \pi_j \hat f_j(x_0)$
$\hat \pi_j$ is the sample proportion of the class j
$\hat f_j(x_0)$ is the Kernel density estimate for class j
Learning separate class densities may be misleading
- Dense vs Non-dense regions in feature space
- Density estimates are critical only near the decision boundary

Applicable when dimension of feature space is high
Assumption: For a given class, the features are independent
- $f_j(x) = \prod_p f_{jp}(x_p)$
- Rarely holds true in real world dataset
$\log \frac{P(G=i|X=X)}{P(G=j|X=X)} = \log \frac{\pi_i \prod f_{ip}(x_p)}{\pi_j \prod f_{jp}(x_p)}$
- $\log \frac{\pi_i}{\pi_j} + \sum \log \frac{f_{ip}(x_p)}{f_{jp}(x_p)}$

Basis Functions $f(x) = \sum \beta h(x)$
Transform lower dimension features to high dimensions
Data which is not linearly separable in lower dimension may become linearly separable in higher dimensions
RBF treats gaussian kernel functions as basis functions
- Taylor series expansion of $\exp(x)$
- Polynomial basis function with infinite dimensions
$f(x) = \sum_j K_{\lambda_j}(\xi_j, x) \beta_j$
$f(x) = \sum_j D({|x_i - \xi_j| \over \lambda _j}) \beta_j$
- D is the standard normal gaussian density function
For least square regression, SSE can be optimized wrt to $\beta, \xi, \lambda$
- Non-Linear Optimization
- Use greedy approaches like SGD
Simplify the calculations by assuming $\xi, \lambda$ to be hyperparameters
- Use unsupervised learning to estimate them
- Assuming constant variance simplifies calculations
  - It can create "holes" where none of the kernels have high density estimate

Extension of RBF
$f(x) = \sum_j \alpha_j \phi(x, \mu_j, \Sigma_j)$
$\sum \alpha_j = 1$, are the mixing proporitons
Gaussian mixture models use Gaussian kernel in place of $\phi$
Parameters are fit using Maximum Likelihood
If the covariance martix is restricted to a diagonal matrix $\Sigma = \sigma^2 I$, then it reduces to radial basis expansion
Classification can be done via Bayes Theorem
- Separate density estimation for each class
- Probability is $\propto \hat \pi_i f_j(x)$