Regression
Derivation
- $y = X\beta + \epsilon$
- $\epsilon \sim N(0, \sigma^2I)$
- Linear Function Approximation
- $E(Y|X) = f(X) = \beta_0 + \sum_{j=1}^p \beta_j x_j$
- Model is linear in parameters
- Coefficient $\beta_j$ represents the expected change in response for a one-unit change in $x_j$, holding other predictors constant
- Minimize Residual Sum of Squares
- $RSS = \sum (y_i - f(x_i))^2 = (y - X\beta)^T(y - X\beta)$
- Optimal value of beta:
- $\frac{\partial RSS}{\partial \beta} = 0$
- $\hat \beta = (X^T X)^{-1}(X^Ty)$
- $\hat y = X \hat \beta = (X(X^T X)^{-1}X^T)y = H y$
- H is the projection or Hat matrix
- $H_{ii}$ are leverage values indicating influence of each observation
Sampling Distribution of $\beta$
- Deviations around the conditional mean are Gaussian
- $Var(\hat \beta) = (X^T X)^{-1} \sigma^2$
- Estimate of Sigma can be done by looking at sample variance
- $\hat \sigma^2 = \frac{1}{N-p-1} \sum (y_i - \hat y_i)^2$
- $\hat \beta \sim N(\beta, (X^T X)^{-1} \sigma^2)$
Statistical Significance
- $Z_i = \frac{\hat\beta_i}{SE_i} = \frac{\hat\beta_i}{\hat \sigma \sqrt{v_i}}$
- $v_i$ is the $i$-th diagonal element of $(X^T X)^{-1}$
- Under null hypothesis $H_0: \beta_i = 0$, $Z_i \sim t_{N-p-1}$
- Testing significance for a group of parameters
- Say categorical variables with all k variables
- $F = \frac{(RSS_0 - RSS_1)/(p_1-p_0)}{RSS_1 / (N - p_1-1)}$
- $RSS_0$ is from restricted model, $RSS_1$ from full model
- $p_0$ and $p_1$ are the number of parameters in each model
- Under $H_0$, $F \sim F_{p_1-p_0, N-p_1-1}$
Gauss-Markov Theorem
- Among all unbiased estimators, the least square estimates have lowest variance
- $E[(Y_0 - \hat Y_0)^2] = \sigma^2 + MSE(\hat f(X_0))$
- Assumptions required:
- Linearity of the true relationship
- Independence of errors
- Homoscedasticity (constant error variance)
- No perfect multicollinearity
Subset Selection
- Select only a few variables for better interpretability
- Best subset selection of size K is the one that yields minimum RSS
- Forward Selection
- Sequentially add one variable that most improves the fit
- QR decomposition / successive orthogonalization to look at correlation
- Computationally efficient but may miss optimal subset
- Backward Selection
- Sequentially delete the variable that has least impact on the fit
- Z Score
- Requires starting with all variables (can't be used when N < p)
- Hybrid Stepwise Selection
- Consider both forward and backward moves at each step
- AIC for weighting the choices
- Better exploration of the model space
- Forward Stagewise Selection
- Add the variable most correlated with current residual
- Don't re-adjust the coefficients of the existing variables
- Similar to gradient descent in function space
Shrinkage Methods
- Shrinkage methods result in biased estimators but a large reduction in variance
- More continuous and don't suffer from high variability
- Ridge Regression
- Impose a penalty on the size of the coefficients
- $\hat \beta^{\text{ridge}} = \arg \min (y - X \beta)^T(y - X \beta) + \lambda \sum \beta^2$
- $\hat \beta^{\text{ridge}} = \arg \min (y - X \beta)^T(y - X \beta) ; \text{subject to} \sum \beta^2 \le t$
- t is the budget
- In case of correlated variables, coefficients are poorly determined
- A large positive coefficient of a variable is canceled by a large negative coefficient of the correlated variable
- Solution not invariant to scaling. Standardize the inputs and don't impose penalty on intercept
- $\hat \beta^{\text{ridge}} = (X^T X + \lambda I)^{-1}(X^Ty)$
- In case of correlated predictors, the original $(X^T X)$ wasn't full rank. But by adding noise to diagonal elements, the matrix can now be inverted.
- Eigenvalue decomposition: $X^TX = U D U^T$
- Ridge coefficients: $\hat{\beta}^{ridge} = \sum_{j=1}^p \frac{d_j}{d_j + \lambda} u_j^T y \cdot u_j$
- As $\lambda$ increases, coefficients shrink toward zero but not exactly zero
- In case of orthonormal inputs (PCA), the ridge coefficients are scaled versions of the original least-square estimates.
- $\lambda$ controls the degrees of freedom. A large value results in effectively dropping the variables.
- Lasso Regression
- $\hat \beta^{\text{lasso}} = \arg \min (y - X \beta)^T(y - X \beta) + \lambda \sum |\beta|$
- Non-linear optimization
- A heavy restriction on budget makes some coefficients exactly zero
- Continuous subset selection
- Comparison between Ridge and Lasso
- Ridge represents a disk $\beta_1^2 + \beta_2^2 <= t$
- Lasso represents a rhombus $|\beta_1| + |\beta_2| <= t$
- At optimal value, the estimated parameters can be exactly zero (corner solutions)
- Bayesian MAP estimates with different priors
- Lasso has Laplace Prior ($p(\beta) \propto e^{-\alpha|\beta|}$)
- Ridge has Gaussian Prior ($p(\beta) \propto e^{-\alpha\beta^2/2}$)
- Elastic Net
- $\lambda \sum \alpha \beta^2 + (1 - \alpha) |\beta|$
- Variable selection like Lasso
- Shrinking coefficients like Ridge
- Better handles groups of correlated predictors
Partial Least Squares
- Alternative approach to PCA to deal with correlated features
- Supervised transformation
- Principal component regression seeks directions that have high variance
- Partial Least Square seeks direction with high variance and high correlation with response
- Derive new features by linear combination of raw variables re-weighted by the correlation
- Algorithm:
- Standardize X and y
- For m = 1,2,...M:
- Compute weight vector $w_m \propto X^T_{m-1}y$
- Create score vector $z_m = X_{m-1}w_m$
- Regress y on $z_m$ to get coefficient $\hat{\phi}_m$
- Regress each column of $X_{m-1}$ on $z_m$ to get loadings $\hat{p}_m$
- Orthogonalize: $X_m = X_{m-1} - z_m\hat{p}_m^T$
- Final prediction: $\hat{y} = \bar{y} + \sum_{m=1}^M \hat{\phi}_m z_m$