Dimensionality Reduction
Background
- Curse of Dimensionality
- Data has too many features (n << p)
- Data volume required for good generalization grows exponentially
- Same edge (say 10) square and cube
- 1x1 patch covers 1% area in quare
- 1x1x1 patch covers 0.1% volume in cube
- Two approaches
- Feature Selection
- Use only a subset of original features\
- Latent Features
- Recombine the original features for more efficient representation
- Can be linear or non-linear
Principal Component Analysis
- Find a linear and orthogonal projection of data from high dimension to low dimension
- Encode original data $x \in R^D$ using $W \in R^{D \times L}$
- Encode: $z = W^T x \in R^L$
- Decode $z$ by projecting it from lower dimension to higher dimension
- Objective is to minimize reconstruction error
- $L(w) = {1 \over N} \sum ||x - \hat x||^2$
- Proof: Project all the data to one dimension
- $w_1 \in R^D$
- $\hat x = z_{1} w_1$
- Optimal value of z and w that minimizes reconstruction error
- $L = {1 \over N} \sum ||x_i - z_{i1} w_1||^2$
- $L = {1 \over N} \sum (x_i - z_{i1} w_1)^T(x_i - z_{i1} w_1)$
- $L = {1 \over N} \sum x_i^T x_i -2 z_{i1} w_1^T x_i - z_{i1} w_1^Tw_1 z_{i1}$
- Orthonormal Assumption $\implies w_1^Tw_1 = 1$
- $L = {1 \over N} \sum x_i^T x_i -2 z_{i1} w_1^T x_i - z_{i1}^2$
- Take Derivaties wrt z and w
- ${\delta L \over \delta z_{i1}} = {1 \over N} (-2 w_1^T x_i + 2 z_{i1}) = 0$
- Optimal Embedding: $z_{i1} = w_1^T x$
- Plugging the value of z in L
- $L = {1 \over N} \sum x_i^T x_i - z_{i1}^2$
- $L = C - {1 \over N} \sum z_{i1}^2$
- $L = C - {1 \over N} \sum w_1^T x_i^T x_i w_1$
- $L = - {1 \over N} w_1^T \Sigma w_1$
- $\Sigma$ is the Var-Cov matrix of X
- The loss can be minimized trivially by scaling $w$
- To avoid this, impose a unit-norm constraint on $w$
- $L = {1 \over N} w_1^T \Sigma w_1 + \lambda (w_1^T w_1 - 1)$
- ${\delta L \over \delta w_1} = -2 \Sigma w_1 + 2 \lambda w_1 = 0$
- Optimal w is given by eigen vector of $\Sigma$
- To minimize the loss, pick the vector corresponding to highest eigenvalue
- PCA finds vectors that maximize the variance of projected data
- $L = C - {1 \over N} \sum z_{i1}^2$
- The original data is scaled
- $E(z_1) = E(w_1^T x) = 0$
- $L = C + \text{Var}(z_1)$
- Geometric Explanation
- Find a new axis to capture the data
- Distance of the point from origin is fixed $R^2$
- $D^2$ if the distance of the point from origin along the new axis (Variance)
- $\epsilon$ if the vertical distance of the point from the new axis (Distortion)
- By Pythagoras theorem $R^2 = D^2 + \epsilon$
- PCA maximizes the variance $D^2$
- Is equivalent to minimizing distortion $\epsilon$ as $R^2$ is constant
- Eigenvalues euqal the sum-sq(distances) on points on the principal component axis
- Use eigenvalues to understand how much variation is captured by each principal component
- Use scree plot (varation captured vs # components) to understand how many components should be included
- The maximum number of components are equal to the number of features in the original data
- Full basis
- If data is 2D, the eigen value for the 3rd PC will be 0
- Principal components are linear combinations of original features
- The weights used for linear combinations are called factor loadings
- Factor loadings denote the importance of features in capturing variance
- PCA + linear regression is still interpretable
- Use estimated coefficients and factor loadings to understand how the original variables are being used
- PCA is calculated using SVD (singular value decomposition)
- $X = U S V^T \in R^{N \times D}$
- $U \in R^{N \times N}$ is orthonormal
- $S \in R^{N \times D}$ is diagonal (contains singular values)
- $V \in R^{D \times D}$ is orthonormal
- $X^T X = (U S V^{T})^T(U S V^{T}) = V S^T U^T U S V^T = V S^T S V^T$
- Since S is a diagonal matrix, $S^TS$ is diagonal as well
- $X^T X = V D V^T$ where $D = S^T S$
- On multiplying both sides by V: $(X^T X)V = V D$
- D matrix gives the eigenvalues and V matrix gives the corresponding eigenvectors
- Notes
- PCA doesn't work well if the interrelationships are non-linear
- Kernel PCA, Factor Analysis
- PCA doesn't work well in case of outliers
- PCA can't handle missing data
- PCA is unsupervised
- LDA is a supervised dimensionality reduction technique
Stochastic Neighbour Embedding (SNE)
- Unsupervised Non-parametric Mehtod for dimensionality reduction
- Manifold is a topological space which is locally Euclidean
- Eath is a 2D surface embedded in a 3D space
- High-dimensional data can lie in a low dimenison manifold
- Idea is to preserve nearest neighbours instead of preserving distances
- Convert the distances in high-dimension to probabilities
- Probability the point i will select j as it's neighbour
- Gaussian Kernel
- $p_{j|i} \propto \exp({|| x_i - x_j||^2 \over 2\sigma_i^2})$
- $\sigma_i^2$ is the variance for data point i
- Magnify the scale of points in dense region
- Diminish the scale of points in sparse regions
- Perplexity parameter (say 30)
- Variance will be adjusted to cover approx 30 neighbours
- Balance between local and global aspects of the data
- Initialize the low-dimnesion representations and calculate the same probability
- $q_{j|i} \propto \exp({|| z_i - z_j||^2})$
- Variance is assumed to be constant here
- A good representation will preserve the neighbours
- $p$ and $q$ are probability distributions. KL Divergence will capture the distance between them
- $L = KL(p||q) = \sum_i\sum_j p_{i|j}\log({p_{i|j} \over q_{i|j}})$
- If p is high and q is low, the penalty is high
- Points were neighbours in high dimension but not in lo dimension
- If p is low and q is high, the penalty is low
- Unrelated points are pushed closer now
- Calculate $z$ by minimizing KL-Div using SGD
- $\Delta_{z_i} L = 0$
- $2 \sum (z_i - z_j) (p_{i|j} - q_{i|j} + p_{j|i} - q_{j|i})$
- Symmetric SNE
- In the above formulation the distances are not symmetric
- $p_{i|j} \ne p_{j|i}$
- To enforce this: $p_{ij} = (p_{i|j} + p_{j|i}) / 2$
- Equivalent to using constant variance in high-dimensional space
- $\Delta_{z_i} L = 4 \sum (z_i - z_j) (p_{ij} - q_{ij})$
- Similar to Potential energy in a spring (F = kx)
- $(p_{ij} - q_{ij})$ is k
- $(z_i - z_j)$ is x
- t-SNE
- SNE has a crowding problem
- Gaussian Kernel pushes moderately far away points in high dimension close together in low dimension (squared errors)
- Replace it with t-distribution that has fatter tails (probability goes to 0 slowly)
- The fatter tails allow dissimilar points to be far apart in lower dimension as well
- Removes unwanted attractive forces between points that are modelrately far in high dimension
- $q_{j|i} \propto (1+{|| z_i - z_j||^2})^{-1}$
- $\Delta_{z_i} L = \sum (z_i - z_j) (p_{ij} - q_{ij}) (1 + || z_i - z_j||^2)^{-1}$
- $(1 + || z_i - z_j||^2)^{-1}$ ensures well separated clusters with tightly packed points inside
- Introduces strong repulsions between the dissimilar datapoints that are modeled by small pairwise distance in the low-dimensional map
- Coordinates after embedding have no inherent meaning
- UMAP
- Uniform Manifold Approximation and Projection\
- Similar to t-SNE but much faster
- t-SNE calculates all pairwise distances
- UMAP calculates distances between close neighbours only
- t-SNE start with random initialization, UMAP start with spectral embeddings
- t-SNE moves every points slightly in each iteration, UMAP can move single points or subset of points in each iteration
- Mathematics
- t-SNE uses Gaussian desnity function to calculate the distance between points in high dimension
- UMAP uses similarity scores
- Hyperparameter: number of neighbours (similar to perplexity in t-SNE)
- Calculate log(number of neighbours)
- Calculate similarity scores
- $\exp(-(\text{raw distance} - \text{distance to nearest neighbour}) / \sigma$
- Rescale the curve such that sum of distances = log(number of neighbours)
- UMAP makes the scores symmetrical by $(S_1 + S_2) - S_1S_2$
- Initialize a low dimension graph using Spectral Embedding
- Decompoistion of Graph Laplacian
- Graph Laplacian = Degree Matrix - Adjacency Matrix
- Calculate the similarity in low dimension using t-distrbution
- $(1 + \alpha d^{2\beta})^{-1}$
- The parameters help user control the shape of the curve
- Cost Function
- Cross-Entropy between graphs
- $\log(1 - S_{\text{not neighbour}}) - log(S_{\text{neighbour}})$\
- UMAP can accomodate new data (predict function) without recomputation
Applications of Dimensionality Reduction
- Data Visualization:
- Reduce high-dimensional data to 2D or 3D for visualization
- Helps identify patterns, clusters, and outliers visually
- Noise Reduction:
- Lower-dimensional representations can filter out noise
- PCA can help separate signal from noise when the variance of the noise is smaller than the variance of the signal
- Preprocessing for Machine Learning:
- Mitigates curse of dimensionality
- Can improve performance of models sensitive to high dimensionality
- Reduces computational complexity and storage requirements
- Feature Extraction:
- Creates new features that better capture the underlying structure of data
- Often more informative than original features
- Multicollinearity Reduction:
- Addresses correlation among predictor variables in regression
- PCA specifically creates uncorrelated components
Comparing Dimensionality Reduction Techniques
- Linear vs. Non-linear:
- Linear methods (PCA, LDA): Preserve global structure, computationally efficient
- Non-linear methods (t-SNE, UMAP): Better at preserving local structure, capturing complex relationships
- Supervised vs. Unsupervised:
- Unsupervised (PCA, t-SNE): No target variable required
- Supervised (LDA): Incorporates class information
- Local vs. Global:
- Global (PCA): Preserves large pairwise distances
- Local (t-SNE, UMAP): Preserves small pairwise distances
- Selection considerations:
- Data size: Some methods (t-SNE) don't scale well to large datasets
- Interpretability: Some methods (PCA) produce more interpretable features
- Goal: Visualization vs. preprocessing vs. feature extraction