Statistics
Statistics is the science of learning from data. This chapter covers the key concepts for estimating model parameters and quantifying uncertainty in those estimates.
The Big Picture
Inference: Quantifying uncertainty about unknown quantities using finite data samples.
Two major paradigms:
- Frequentist: Parameters are fixed; uncertainty comes from random sampling
- Bayesian: Parameters are random variables with prior distributions
Maximum Likelihood Estimation (MLE)
The most common approach to parameter estimation: choose parameters that make the observed data most probable.
The Setup
Given:
- Data: $D = {x_1, x_2, ..., x_N}$ (assumed i.i.d.)
- Parametric model: $p(x | \theta)$
The Likelihood Function
$$L(\theta; D) = p(D | \theta) = \prod_{i=1}^N p(x_i | \theta)$$
Key insight: We treat the data as fixed and vary θ. For which θ was this data most likely?
Log-Likelihood
Products are numerically unstable. Convert to sums using logs:
$$\ell(\theta; D) = \log L(\theta; D) = \sum_{i=1}^N \log p(x_i | \theta)$$
The MLE Estimate
$$\hat{\theta}_{MLE} = \arg\max_\theta \ell(\theta; D) = \arg\min_\theta -\ell(\theta; D)$$
Equivalently: minimize the Negative Log-Likelihood (NLL).
Why MLE Works
Theoretical justifications:
- Bayesian view: MLE is MAP estimate with uniform (uninformative) prior
- Information-theoretic view: MLE minimizes KL divergence between model and empirical distribution
Sufficient Statistics
A sufficient statistic summarizes all information in the data relevant to estimating θ.
Example (Bernoulli): For $N$ coin flips, the sufficient statistics are:
- $N_1$ = number of heads
- $N_0$ = number of tails
You don't need to know the order of the flips!
MLE Examples
Bernoulli Distribution
Model: $p(y | \theta) = \theta^y (1-\theta)^{1-y}$
NLL: $$\text{NLL}(\theta) = -[N_1 \log\theta + N_0 \log(1-\theta)]$$
Setting derivative to zero: $$\hat{\theta}_{MLE} = \frac{N_1}{N_0 + N_1} = \frac{\text{# heads}}{\text{# flips}}$$
Intuitive result: Estimate probability as observed frequency.
Gaussian Distribution
Model: $p(y | \mu, \sigma^2) = \mathcal{N}(y | \mu, \sigma^2)$
MLE estimates: $$\hat{\mu} = \frac{1}{N}\sum_{i=1}^N y_i \quad \text{(sample mean)}$$ $$\hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^N (y_i - \hat{\mu})^2 \quad \text{(sample variance)}$$
Linear Regression
Model: $p(y | x, w, \sigma^2) = \mathcal{N}(y | w^T x + b, \sigma^2)$
NLL is proportional to: $$\text{NLL} \propto \sum_{i=1}^N (y_i - w^T x_i - b)^2$$
Key insight: MLE for Gaussian regression = minimize squared error!
Empirical Risk Minimization
ERM generalizes MLE beyond log-loss to any loss function.
Definition
$$\hat{\theta}{ERM} = \arg\min_\theta \frac{1}{N}\sum{i=1}^N \ell(y_i, f(x_i; \theta))$$
Common loss functions:
- Log-loss: gives MLE
- Squared loss: regression
- 0-1 loss: classification accuracy
- Hinge loss: SVMs
Surrogate Losses
The 0-1 loss is non-differentiable. We use smooth surrogate losses that are easier to optimize:
- Log-loss (cross-entropy)
- Hinge loss
- Exponential loss
Online Learning
When data arrives sequentially, we can't afford to retrain from scratch each time.
Recursive Updates
Many statistics can be updated incrementally:
Running mean: $$\mu_t = \mu_{t-1} + \frac{1}{t}(x_t - \mu_{t-1})$$
Exponentially Weighted Moving Average (EWMA): $$\mu_t = \beta \mu_{t-1} + (1-\beta) x_t$$
Regularization
MLE can overfit — the estimated model fits training data perfectly but fails on new data.
The Problem
- Empirical distribution ≠ true distribution
- MLE finds parameters optimal for the empirical distribution
- May not generalize well
The Solution: Add a Penalty
$$\hat{\theta} = \arg\min_\theta \left[\text{NLL}(\theta) + \lambda R(\theta)\right]$$
Where $R(\theta)$ penalizes complex models.
MAP Estimation
From the Bayesian view, regularization corresponds to adding a prior:
$$\hat{\theta}_{MAP} = \arg\max_\theta [p(D | \theta) \cdot p(\theta)]$$
Taking logs: $$\hat{\theta}_{MAP} = \arg\min_\theta [-\log p(D|\theta) - \log p(\theta)]$$
Examples:
- Gaussian prior → L2 regularization (Ridge)
- Laplace prior → L1 regularization (Lasso)
Choosing Regularization Strength
The regularization parameter λ controls the bias-variance trade-off.
Methods to choose λ:
- Validation set: Test on held-out data
- Cross-validation: For small datasets
- One Standard Error Rule: Choose simplest model within one SE of best
Early Stopping
Another form of regularization: stop training before the model overfits.
- Monitor validation error
- Stop when it starts increasing
Bayesian Statistics
The Bayesian approach treats parameters as random variables.
The Bayesian Recipe
- Prior: $p(\theta)$ — initial beliefs before seeing data
- Likelihood: $p(D | \theta)$ — probability of data given parameters
- Posterior: $p(\theta | D) \propto p(D | \theta) \cdot p(\theta)$ — updated beliefs
Posterior Predictive Distribution
To predict new data, integrate over parameter uncertainty:
$$p(y_{new} | x_{new}, D) = \int p(y_{new} | x_{new}, \theta) \cdot p(\theta | D) d\theta$$
Compare to plug-in prediction: $p(y_{new} | x_{new}, \hat{\theta})$
The Bayesian approach properly accounts for uncertainty in θ!
Conjugate Priors
When the prior and posterior have the same functional form:
- Bernoulli-Beta: Prior on coin bias
- Gaussian-Gaussian: Prior on Gaussian mean
- Poisson-Gamma: Prior on Poisson rate
Makes computation tractable.
MAP vs. Full Bayesian
| Aspect | MAP | Full Bayesian |
|---|---|---|
| Output | Point estimate | Full distribution |
| Computation | Optimization | Integration |
| Uncertainty | Not captured | Fully captured |
| Regularization | Equivalent to adding prior | Built-in |
Frequentist Statistics
In the frequentist view:
- Parameters θ are fixed (unknown) constants
- Data D is random (sampled from true distribution)
- Uncertainty comes from randomness in sampling
Sampling Distribution
If we repeated the experiment many times, our estimate $\hat{\theta}$ would vary. The sampling distribution describes this variation.
Bootstrap
When the true sampling distribution is unknown, approximate it by resampling:
- Draw N samples with replacement from your data
- Compute the statistic of interest
- Repeat many times
- The distribution of statistics approximates the sampling distribution
Key fact: Each bootstrap sample contains ~63.2% of unique original observations: $$P(\text{included}) = 1 - \left(1 - \frac{1}{N}\right)^N \approx 1 - \frac{1}{e} \approx 0.632$$
Confidence Intervals
A 95% confidence interval means: if we repeated the experiment many times, 95% of the computed intervals would contain the true parameter.
Note: This is NOT the same as "95% probability that θ is in this interval"!
Bias-Variance Trade-off
Bias
How far off is our estimator on average?
$$\text{bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta^*$$
Unbiased: $\mathbb{E}[\hat{\theta}] = \theta^*$
Example: Sample variance $\frac{1}{N}\sum(x_i - \bar{x})^2$ is biased! The unbiased version divides by N-1.
Variance
How much does our estimate fluctuate across different datasets?
$$\text{Var}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2]$$
Mean Squared Error
Combines both: $$\text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta^*)^2] = \text{bias}^2 + \text{variance}$$
Key insight: Sometimes it's worth accepting bias if it substantially reduces variance!
This is exactly what regularization does.
Summary
| Concept | Key Idea |
|---|---|
| MLE | Choose θ that maximizes probability of observed data |
| NLL | Negative log-likelihood; what we minimize |
| Sufficient Statistics | Compress data without losing information about θ |
| ERM | Generalization of MLE to any loss function |
| Regularization | Penalty on complexity to prevent overfitting |
| MAP | MLE + prior = regularized MLE |
| Bayesian | Full distribution over θ, not just point estimate |
| Posterior Predictive | Integrate predictions over parameter uncertainty |
| Bootstrap | Approximate sampling distribution by resampling |
| Bias-Variance | MSE = bias² + variance; trade-off is fundamental |