Exemplar Methods
Non-parametric Models
- Keep the training data around
- Effective number of model parameters grow with |D|
Instance-based Learning
- Models keep training examples around test time
- Define similarity between training points and test input
- Assign the label based on the similarity
KNN
- Classify the new input based on K closest examples in the training set
- $p(y = c | x, D) = \frac{1}{K} \sum_{i \in N_K(x)} I{y_i=c}$
- The closest point can be computed using Mahalanobis Distance
- $d_M(x,\mu) = \sqrt{(x-\mu)^TM(x-\mu)}$
- M is positive definite matrix
- If M = I, then distance reduces to Euclidean distance
Curse of Dimensionality
- Space volume grows exponentially with increase in dimension
- Suppose inputs are uniformly distributed
- As we move from square to cube, 10% edge covers less region
- Space volume grows exponentially with increase in dimension
Speed and Memory Requirements
- Finding K nearest neighbors slow
- KD Tree / LSH to speed up approximate neighbor calculation
- KD Tree:
- K dimensional binary search tree
- LSH: Similar objects go to same hash bucket
- Shingling, Minhash, LSH
- KD Tree:
Open Set recognition
- New classes appear at test time
- Person Re-identification
- Novelty Detection
- New classes appear at test time
Learning Distance Matrix
- Treat M is the distance matrix as a parameter
- Large Margin Nearest Neighbors
- Find M such that
- $M = W^T W$ (Positive Definite)
- Similar points have minimum distance
- Dissimilar points are at least m units away (margin)
Deep Metric Learning
- Reduce the curse of dimensionality
- Project he input from high dimension space to lower dimension via embedding
- Normalize the embedding
- Compute the distance
- Euclidean or Cosine, both are related
- $|e_1 - e_2|^2 = |e1|^2 + |e_2|^2 - 2e_1 e_2$
- Euclidean = 2 ( 1 - Cosine)
- $\cos \theta = {a \dot b \over ||a|| ||b||}$
- Derivation via trigonometry
- $\ cos \theta = a^2 + b ^ 2 - c^2 / 2 a b$
- Learn an embedding function such that similar examples are close and dissimar examples are far
- Loss functions:
- Classification Losses
- Only learn to push examples on correct side of the decision boundary
- Pairwise Loss
- Simaese Neural Network
- Common Backbone to embed the inputs
- $L(\theta, x_i, x_j) = I {y_i =y_j} d(x_i, x_j) + I {y_i \ne y_j} [m - d(x_i, x_j)]_+$
- If same class, minimize the distance
- If different class, maximize the distance with m margin (Hinge Loss)
- Triplet Loss
- In Pairwise Loss: positive and negative examples siloed
- $L(\theta, x_i, x^+, x^-) = [m + d(x_i, x_+) - d(x_i, x_-)]_+$
- Minimize the distance between anchor and positive
- Maximize the distance between anchor and negative
- m is the safety margin
- Need to mine hard negative examples that are close to the positive pairs
- Computationally slow
- Use proxies to represent each class and speed up the training
- Classification Losses
Kernel Density Estimation
- Density Kernel
- Domain: R
- Range: R+
- $\int K(x)dx = 1$
- $K(-x) = K(x)$
- $\int x K(x-x_n) dx = x_n$
- Gaussian Kernel
- $K(x) = {1 \over \sqrt{2\pi}} \exp(-{1\over2}x^2)$
- RBF: Generalization to vector valued inputs
- Bandwitdth
- Parameter to control the width of the kernel
- Density Estimation
- Extend the concept of Gaussian Mixture models to the extreme
- Each point acts as an individual cluster
- Mean $x_n$
- Constant variance
- No covariance
- Var-Cov matrix is $\sigma^2 I$
- $p(x|D) = {1 \over N}\sum K_h(x - x_n)$
- No model fitting is required
- KDE vs KNN
- KDE and KNN are closely related
- Essentially, in KNN we grow the volume around a point till we encounter K neighbors.
- Density Kernel