Model Selection
Maximum Likelihood
- Maximum Likelihood Inference
- Parametric Model
- Random variable $z_i \sim g_\theta(z)$
- Unknown Parameters $\theta = (\mu, \sigma^2)$
- Likelihood Function
- $L(\theta, Z) = \prod g_\theta(z_i)$
- Probability of observed data under the model $g_\theta$
- Usually work with log-likelihood: $\ell(\theta, Z) = \sum \log g_\theta(z_i)$
- Maximize the Likelihood function
- Select the parameters $\theta$ such that the probability of observed data is maximized under the model
- For many distributions, has closed-form solution
- Score Function $\frac{\partial L}{\partial \theta}$
- Vector of partial derivatives of log-likelihood
- At MLE: $S(\hat{\theta}) = 0$
- Information Matrix $I(\theta) = -E\left[\frac{\partial^2 \log L}{\partial \theta^2}\right]$
- Expected curvature of log-likelihood
- Measures information data contains about parameters
- Fisher Information $i(\theta) = I(\theta)_{\hat \theta}$
- Sampling Distribution of MLE has limiting normal distribution
- $\hat\theta \sim N(\theta, I(\theta)^{-1})$
- Asymptotic result (as N → ∞)
- Allows construction of confidence intervals and hypothesis tests
- OLS estimates are equivalent to MLE estimates for Linear Regression with Gaussian errors
- $\text{Var}(\hat \beta) = \sigma^2 / S_{xx}$
- $\text{Var}(\hat y_i) = \sigma^2 X_i^T (X^TX)^{-1} X_i$
- For non-Gaussian errors, OLS still gives unbiased estimates but may not be efficient
Bootstrap
- Bootstrap assesses uncertainty by sampling from training data
- Estimate different models using bootstrap datasets
- Calculate the variance of estimates for ith observation from these models
- Provides empirical sampling distribution when theoretical one is unavailable
- Non-Parametric Bootstrap
- Uses raw data for sampling, model free
- Makes minimal assumptions about data distribution
- Approaches:
- Case resampling: Sample observations with replacement
- Residual resampling: Resample residuals and add to fitted values
- Parametric Bootstrap
- Simulate new target variable by adding gaussian noise to predicted values from model
- Predictions estimated from this sampling will follow Gaussian distribution
- Assumes error distribution is correctly specified
- Computational alternative to MLE
- No formulae are available
- Especially useful for complex models or statistics
- Bootstrap mean is equivalent to posterior average in Bayesian inference
- Under certain conditions, has Bayesian interpretation
- Bagging averages predictions over collection of bootstrap samples
- Reduces variance of estimates
- Bagging often decreases mean-squared error
- Most effective for high-variance, low-bias models (like decision trees)
- Bootstrap confidence intervals
- Percentile method: Use empirical quantiles from bootstrap distribution
- BCa method: Bias-corrected and accelerated, adjusts for bias and skewness
Bayesian Methods
- Assume a prior distribution over unknown parameters
- $P(\theta)$
- Encodes initial beliefs about parameters before seeing data
- Types:
- Informative priors: strong beliefs about parameters
- Non-informative priors: minimal assumptions (e.g., uniform)
- Conjugate priors: result in posterior of same family as prior
- Sampling Distribution of data given the parameters
- $P(Z | \theta)$
- Likelihood function from frequentist approach
- Posterior Distribution
- Updated knowledge of parameters after seeing the data
- $P(\theta | Z) \propto P(Z | \theta) \times P(\theta)$
- Full distribution rather than point estimate
- Permits probabilistic statements about parameters
- Predictive Distribution
- Predicting values of new unseen observations
- $P(z | Z) = \int P(z | \theta) P(\theta | Z) d\theta$
- Integrates over all possible parameter values, weighted by posterior
- Accounts for parameter uncertainty unlike plug-in estimates
- MAP Estimate
- Maximum a Posterior, point estimate of unknown parameters
- Select the parameters that maximize posterior density function
- $\hat \theta = \arg \max P(\theta | Z)$
- Compromise between MLE and fully Bayesian approach
- MAP differs from frequentist approaches (like MLE) in its use of prior distribution
- Prior Distribution acts as regularization
- MAP for linear regression with Gaussian priors yields Ridge Regression
- MAP for linear regression with Laplace priors yields Lasso Regression
- Hierarchical Bayesian models
- Place priors on hyperparameters
- Allows borrowing strength across groups
- Naturally handles multilevel/grouped data
EM Algorithm
- Simplifies difficult MLE problems involving latent variables
- Applications:
- Missing data
- Mixture models
- Latent variable models
- Hidden Markov models
- Bimodal Data Distribution
- $Y_1 \sim N(\mu_1, \sigma^2_1)$
- $Y_2 \sim N(\mu_2, \sigma^2_2)$
- $Y = \Delta Y_1 + (1 - \Delta) Y_2$
- $\Delta \in {0,1}$
- $P(\Delta = 1) = \pi$
- Density function of Y
- $g_Y(y) = (1 - \pi) \phi_1(y) + \pi \phi_2(y)$
- Direct maximization of likelihood difficult
- Sum operation inside log
- $\log L(\theta) = \sum_{i=1}^N \log[(1-\pi)\phi_1(y_i) + \pi\phi_2(y_i)]$
- Responsibility
- $\Delta_i$ is latent for a given observation
- $\gamma_i(\theta) = P(\Delta_i = 1 | y_i, \theta)$
- Soft Assignments
- Posterior probability of component membership
- EM Algorithm
- Take Initial Guesses for parameters
- Sample Mean, Sample Variances, Proportion
- Can use K-means or random initialization
- Expectation Step: Compute the responsibility
- $\hat \gamma_i = \frac{\hat \pi \phi_2(y_i)}{(1 - \hat \pi) \phi_1(y_i) + \hat \pi \phi_2(y_i)}$
- Calculate expected value of log-likelihood with respect to latent variables
- Maximization Step: Compute the weighted means and variances, and mixing probability
- $\mu_1 = \frac{\sum (1 - \hat \gamma_i) y_i}{\sum (1 - \hat \gamma_i)}$
- $\mu_2 = \frac{\sum \hat \gamma_i y_i}{\sum \hat \gamma_i}$
- $\hat \pi = \frac{\sum \gamma_i}{N}$
- Maximize the expected log-likelihood from E-step
- Iterate until convergence
- Properties:
- Monotonic likelihood increase
- Convergence to local maximum guaranteed
- Multiple restarts may be needed to find global maximum
MCMC
- Given a set of random variables $U_1, U_2, U_3...$
- Sampling from joint distribution is difficult
- Sampling from conditional distribution is easy
- For example bayesian inference
- Joint distribution $P(Z, \theta)$
- Conditional Distribution $P(Z | \theta)$
- Gibbs Sampling
- Take Some initial values of RVs $U^0_k$
- Draw from conditional Distribution
- $P(U_1 | U_2^{(t)}, U_3^{(t)},..., U_K^{(t)})$
- $P(U_2 | U_1^{(t+1)}, U_3^{(t)},..., U_K^{(t)})$
- And so on, updating each variable in turn
- Continue until the joint distribution doesn't change
- Markov Chain whose stationary distribution is the true joint distribution
- Markov Chain Monte Carlo
- Metropolis-Hastings Algorithm
- More general MCMC approach than Gibbs sampling
- Steps:
- Generate proposal $\theta^* \sim q(\theta^*|\theta^{(t)})$
- Calculate acceptance ratio $r = \min\left(1, \frac{p(\theta^*|Z)q(\theta^{(t)}|\theta^*)}{p(\theta^{(t)}|Z)q(\theta^*|\theta^{(t)})}\right)$
- Accept proposal with probability r
- Special cases include random walk and independent proposals
- Practical considerations
- Burn-in period: discard initial samples
- Thinning: use every kth sample to reduce autocorrelation
- Convergence diagnostics: trace plots, Gelman-Rubin statistic
- Gibbs Sampling is related to EM algorithm
- Generate $\Delta_i \in {0,1}$ using $p(\Delta_i = 1) = \gamma_i (\theta)$
- Calculate the means and variances
- $\mu_1 = \frac{\sum (1 - \Delta_i) y_i}{\sum (1 - \Delta_i)}$
- $\mu_2 = \frac{\sum \Delta_i y_i}{\sum \Delta_i}$
- Keep repeating until the joint distribution doesn't change
- EM finds mode of posterior, MCMC explores full posterior distribution