Exemplar-Based Methods
Exemplar methods (also called instance-based or memory-based) keep training data around and use it directly for prediction. The classic example is K-Nearest Neighbors (KNN).
The Big Picture
Parametric models: Learn parameters θ, discard training data at test time.
- Parameters: Fixed, doesn't grow with data
Non-parametric models: Keep training data, use it directly.
- Model complexity grows with data size
- Can adapt to arbitrary complexity
Instance-Based Learning
The Approach
- Store training examples
- At test time, find similar training examples
- Predict based on their labels
Key ingredient: A good distance/similarity measure.
K-Nearest Neighbors (KNN)
Classification
Find K closest training points; vote on the label: $$p(y = c | x, D) = \frac{1}{K} \sum_{i \in N_K(x)} I{y_i = c}$$
Hyperparameter K:
- K = 1: Highly flexible, noisy
- K = N: Predicts majority class always
- Typical: K = 5-10, or tune via cross-validation
Regression
Average the labels of K nearest neighbors: $$\hat{y} = \frac{1}{K} \sum_{i \in N_K(x)} y_i$$
Distance Metrics
Euclidean distance: $$d(x, x') = |x - x'|_2 = \sqrt{\sum_j (x_j - x'_j)^2}$$
Mahalanobis distance (accounts for correlations): $$d_M(x, x') = \sqrt{(x - x')^T M (x - x')}$$
Where M is a positive definite matrix (often $M = \Sigma^{-1}$).
If M = I: Reduces to Euclidean distance.
The Curse of Dimensionality
The Problem
In high dimensions, distances become meaningless:
- All points become approximately equidistant
- Local neighborhoods become empty
- Need exponentially more data to fill space
Example
Consider the fraction of volume within 10% of the edges:
- 1D: 20%
- 10D: 89%
- 100D: 99.99999...%
Almost all points are near the boundary!
Solutions
- Dimensionality reduction (PCA, autoencoders)
- Feature selection
- Metric learning (learn a better distance)
Computational Efficiency
Naive Approach
For each query, compute distance to all N training points.
- Time: O(Nd) per query
- Infeasible for large datasets
Approximate Nearest Neighbors
KD-Trees:
- Binary tree that partitions space
- O(log N) for low dimensions
- Degrades in high dimensions
Locality-Sensitive Hashing (LSH):
- Hash similar items to same bucket
- Sublinear query time
- Approximate, not exact
Open Set Recognition
Standard classification: All test classes seen during training.
Open set: New, unseen classes may appear at test time.
KNN advantage: Can handle novel classes naturally by looking at nearest neighbors.
Applications:
- Person re-identification
- Anomaly detection
- Few-shot learning
Learning Distance Metrics
Motivation
Euclidean distance may not reflect true similarity.
Goal: Learn a distance metric M that captures task-relevant similarity.
Large Margin Nearest Neighbors (LMNN)
Learn M such that:
- Points with same label are close
- Points with different labels are far (by margin m)
Constraint: $M = W^T W$ ensures positive definiteness.
Deep Metric Learning
The Idea
Learn an embedding function $f(x; \theta)$ such that:
- Similar examples are close in embedding space
- Dissimilar examples are far
Siamese Networks
Two copies of same network process two inputs.
Contrastive Loss: $$L = I{y_i = y_j} \cdot d(x_i, x_j)^2 + I{y_i \neq y_j} \cdot [m - d(x_i, x_j)]_+^2$$
- Same class: Minimize distance
- Different class: Push apart (up to margin m)
Triplet Loss
Use triplets: (anchor, positive, negative)
$$L = [m + d(a, p) - d(a, n)]_+$$
Goal: Anchor should be closer to positive than negative by margin m.
Hard Negative Mining
Random negatives are too easy (already far from anchor).
Solution: Sample hard negatives that are close to anchor but from different class.
Connection to Cosine Similarity
For normalized embeddings: $$|e_1 - e_2|^2 = 2(1 - \cos\theta)$$
Euclidean distance and cosine similarity are equivalent for unit vectors.
Kernel Density Estimation
The Idea
Estimate the probability density by placing kernels at each data point: $$\hat{p}(x) = \frac{1}{N} \sum_{n=1}^N K_h(x - x_n)$$
Gaussian kernel: $$K_h(x) = \frac{1}{h\sqrt{2\pi}} \exp\left(-\frac{x^2}{2h^2}\right)$$
Bandwidth h
Controls smoothness:
- Small h: Spiky, overfitting
- Large h: Over-smooth, underfitting
Choose via cross-validation.
Connection to GMM
KDE is like a GMM where:
- Each point is its own cluster center
- All clusters have same (spherical) covariance
- No mixing proportions to learn
KDE for Classification
Use Bayes rule with class-conditional densities: $$p(y = c | x) \propto \pi_c \cdot \hat{p}(x | y = c)$$
KDE vs KNN
Connection: Both use local neighborhoods.
- KDE: Fixed bandwidth, variable number of neighbors
- KNN: Variable bandwidth (grows until K neighbors), fixed number of neighbors
Dual view: KNN adapts to local density automatically.
Summary
| Method | Key Idea | Pros | Cons |
|---|---|---|---|
| KNN | Vote of K nearest neighbors | Simple, no training | Slow at test time |
| Metric Learning | Learn task-specific distance | Better than Euclidean | Requires labeled pairs |
| Deep Metric | Embed + distance | Handles high dimensions | Needs lots of data |
| KDE | Kernels at each point | Density estimation | Curse of dimensionality |
When to Use Exemplar Methods
Good for:
- Few training examples per class (few-shot learning)
- Classes change over time (no retraining needed)
- Interpretable predictions ("similar to example X")
- Baseline before trying complex methods
Challenges:
- High-dimensional data (curse of dimensionality)
- Large training sets (computational cost)
- Need good distance metric