Clustering
Clustering is the task of grouping similar objects together without being told what the groups should be. It's unsupervised learning—we have no labels, only data. The algorithm discovers structure on its own.
Why Clustering Matters:
- Customer segmentation: Group customers by behavior
- Image segmentation: Group pixels by color/texture
- Document organization: Group articles by topic
- Anomaly detection: Normal patterns vs. outliers
- Data compression: Represent data by cluster centers
Hierarchical Agglomerative Clustering
The Idea: Start with each point as its own cluster, then repeatedly merge the two most similar clusters until only one remains.
Algorithm:
- Start: Each data point is its own cluster
- Compute all pairwise distances between clusters
- Merge the two closest clusters
- Repeat steps 2-3 until one cluster remains
This produces a dendrogram—a tree showing the merge history. Cut the tree at any height to get that many clusters.
Defining "Similarity" Between Clusters:
| Linkage Method | Definition | Properties |
|---|---|---|
| Single Link | Distance between two closest members | Tends to create long, chain-like clusters |
| Complete Link | Distance between two farthest members | Creates compact, spherical clusters |
| Average Link | Average distance between all pairs | Balance between single and complete |
| Ward's Method | Increase in total within-cluster variance | Minimizes variance, popular choice |
When to Use:
- When you don't know the number of clusters ahead of time
- When you want to see hierarchical structure
- For small to medium datasets (doesn't scale well)
Limitation: Very slow! $O(n^3)$ time complexity makes it impractical for large datasets.
K-Means Clustering
The Idea: Partition data into exactly $K$ clusters, each represented by its centroid (center of mass).
Algorithm:
- Choose $K$ initial cluster centers (randomly or using K-Means++)
- Assign each point to nearest center: $z_n^* = \arg\min_k ||x_n - \mu_k||^2$
- Update centers as mean of assigned points: $\mu_k = \frac{1}{N_k}\sum_{n: z_n=k} x_n$
- Repeat steps 2-3 until assignments don't change
Objective Function (Distortion/Inertia): $$L = \sum_n ||x_n - \mu_{z_n}||^2$$
This is the total squared distance from each point to its assigned center.
Key Properties:
- Each iteration decreases (or maintains) the objective
- Guaranteed to converge, but not necessarily to global optimum
- Converges quickly in practice
The Initialization Problem:
- K-Means is sensitive to initial centers
- Bad initialization → stuck in poor local minimum
- Solution: Multiple restarts—run many times, keep best result
K-Means++ Initialization (the smart way):
- Choose first center randomly from data points
- For each subsequent center:
- Compute distance $D(x)$ from each point to nearest existing center
- Choose new center with probability proportional to $D(x)^2$
- This spreads out initial centers—points far from existing centers more likely to be chosen
Result: Much better starting point, often finds better solutions.
K-Medoids Algorithm (PAM - Partitioning Around Medoids):
- Centers must be actual data points (medoids), not means
- More robust to outliers
- Assignment: $z_n^* = \arg\min_k d(x_n, \mu_k)$ (any distance metric)
- Update: Find point with smallest total distance to all other cluster members
- Swap step: Try swapping current medoid with non-medoid, keep if cost decreases
Choosing the Number of Clusters ($K$):
This is one of the hardest problems in clustering!
Elbow Method:
- Plot distortion vs. $K$
- Look for "elbow" where distortion stops decreasing rapidly
- Often subjective—the elbow isn't always clear
Silhouette Score: For each point $i$:
- $a_i$ = average distance to other points in same cluster (cohesion)
- $b_i$ = average distance to points in nearest other cluster (separation)
- $S_i = \frac{b_i - a_i}{\max(a_i, b_i)}$
Interpretation:
- $S_i \approx 1$: Point is well-clustered
- $S_i \approx 0$: Point is on boundary between clusters
- $S_i \approx -1$: Point may be in wrong cluster
Average silhouette score across all points measures overall clustering quality.
K-Means is EM with Hard Assignments:
- Expectation-Maximization (EM) uses "soft" assignments (probabilities)
- K-Means uses "hard" assignments (0 or 1)
- Both assume spherical clusters with equal variance
- K-Means is a special case of Gaussian Mixture Models
Limitations of K-Means:
- Assumes spherical clusters: Can't find elongated or irregular shapes
- Assumes equal-sized clusters: Tends to split large clusters
- Sensitive to outliers: Means are pulled by extreme values
- Requires specifying $K$: Must know number of clusters beforehand
- Non-convex clusters: Fails on clusters with complex shapes
- Euclidean distance: May not be appropriate for all data types
Interpreting Results:
- Cluster centers: "Prototypical" members, useful for understanding what each cluster represents
- Cluster sizes: Imbalanced sizes might indicate poor clustering or natural structure
- Within-cluster variance: Measures homogeneity
- Between-cluster variance: Measures separation
Spectral Clustering
The Idea: Use the eigenvalues (spectrum) of a similarity graph to find clusters. Works when K-Means fails on non-convex shapes.
Graph Perspective:
- Data points are nodes
- Edges connect similar points (weighted by similarity)
- Goal: Find a partition that minimizes edges cut between groups
Algorithm:
- Build similarity graph from data (e.g., k-nearest neighbors or Gaussian similarity)
- Compute the Graph Laplacian: $L = D - A$
- $D$ = degree matrix (diagonal, $D_{ii}$ = sum of edges from node $i$)
- $A$ = adjacency matrix (edge weights)
- Find eigenvectors of $L$ corresponding to smallest eigenvalues
- Use these eigenvectors as new features, run K-Means
Why the Laplacian?:
- Smallest eigenvalue is always 0 (corresponding to all-ones vector)
- Second smallest eigenvalue (Fiedler value) indicates best cut
- For $K$ clusters, use the $K$ smallest eigenvectors
When to Use:
- Non-convex cluster shapes (crescents, rings)
- When you have a natural similarity/distance matrix
- Graph-structured data
DBSCAN
Density-Based Spatial Clustering of Applications with Noise
The Idea: Clusters are dense regions separated by sparse regions. Points in sparse regions are "noise."
Key Concepts:
- Core Point: Has at least
minPtspoints within radius $\epsilon$ - Border Point: Within $\epsilon$ of a core point, but not itself core
- Noise Point: Neither core nor border
Reachability:
- Direct Density Reachable: Point $q$ is within $\epsilon$ of core point $p$
- Density Reachable: There's a chain of core points connecting $p$ to $q$
- Density Connected: Both $p$ and $q$ are density reachable from some core point
Algorithm:
- For each point, determine if it's a core point
- Create clusters by connecting core points that are within $\epsilon$ of each other
- Assign border points to nearby clusters
- Mark remaining points as noise
Parameters:
- $\epsilon$ (epsilon): Neighborhood radius
minPts: Minimum points to form dense region
Choosing Parameters:
minPts: Often set to $\text{dimensionality} + 1$ or higher- $\epsilon$: Plot k-distance graph (distance to k-th nearest neighbor), look for elbow
Advantages:
- No need to specify number of clusters
- Finds arbitrarily shaped clusters
- Robust to outliers (identifies them as noise)
- Only two intuitive parameters
Disadvantages:
- Struggles with clusters of varying density
- Parameter selection can be tricky
- Doesn't work well in high dimensions (curse of dimensionality)
- Can't handle clusters that are close together
Extensions:
OPTICS (Ordering Points To Identify Clustering Structure):
- Creates an ordering of points based on density
- Can extract clusters at different density levels
- More robust to parameter choices
HDBSCAN (Hierarchical DBSCAN):
- Automatically finds clusters of varying densities
- Combines benefits of DBSCAN and hierarchical clustering
- More robust, fewer parameters to tune
- Often the best choice in practice
Choosing a Clustering Algorithm
Decision Guide:
| Situation | Recommended Algorithm |
|---|---|
| Know number of clusters, spherical shapes | K-Means |
| Don't know number, want hierarchy | Hierarchical Agglomerative |
| Arbitrary shapes, want to identify noise | DBSCAN or HDBSCAN |
| Non-convex shapes, know number | Spectral Clustering |
| Large dataset | K-Means or Mini-batch K-Means |
| Want to visualize cluster relationships | Hierarchical with dendrogram |
Validation Approaches:
- Internal metrics (no ground truth): Silhouette score, Davies-Bouldin index, Calinski-Harabasz index
- External metrics (with ground truth): Adjusted Rand Index, Normalized Mutual Information
- Stability: Do clusters persist with data perturbations?
Remember: Clustering is exploratory—there's often no "right" answer. Different algorithms reveal different structure. The best choice depends on your data and goals.