Decision Theory
Decision theory provides a formal framework for making optimal choices under uncertainty. It bridges the gap between probabilistic predictions and concrete actions.
The Big Picture
Having a good model isn't enough — you need to make decisions. Decision theory tells us how to choose actions that minimize expected loss (or maximize expected utility).
Key question: Given uncertainty about the world and different costs for different errors, what should we do?
Risk Attitudes
Risk Neutrality
A risk-neutral agent values expected outcomes:
- $50 for sure = 50% chance of $100
Risk Aversion
A risk-averse agent prefers certainty:
- Would take $45 for sure over 50% chance of $100
- Most people are risk-averse for gains
Risk Seeking
A risk-seeking agent prefers uncertainty:
- Would reject $55 for sure to keep 50% chance of $100
- Gamblers exhibit this behavior
Classification Decision Rules
Zero-One Loss
The simplest loss: you're either right or wrong.
$$\ell_{01}(y, \hat{y}) = I{y \neq \hat{y}} = \begin{cases} 0 & \text{if } y = \hat{y} \ 1 & \text{if } y \neq \hat{y} \end{cases}$$
Optimal policy: Predict the most probable class!
$$\pi^*(x) = \arg\max_c P(Y = c | X = x)$$
Derivation: The risk (expected loss) for predicting $\hat{y}$: $$R(\hat{y} | x) = P(Y \neq \hat{y} | x) = 1 - P(Y = \hat{y} | x)$$
Minimizing risk = maximizing the probability of being correct.
Cost-Sensitive Classification
Different errors have different consequences!
Medical diagnosis example:
- False Negative (miss cancer): Potentially fatal
- False Positive (false alarm): Unnecessary tests, anxiety
We assign different costs:
- $\ell_{01}$: Cost of predicting 0 when truth is 1 (false negative)
- $\ell_{10}$: Cost of predicting 1 when truth is 0 (false positive)
Optimal decision rule: Predict class 1 if: $$P(Y=0|x) \cdot \ell_{01} < P(Y=1|x) \cdot \ell_{10}$$
Effect: Higher cost of false negatives → lower threshold for predicting positive.
The Confusion Matrix
For binary classification:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
Key metrics:
| Metric | Formula | Meaning |
|---|---|---|
| Sensitivity (Recall, TPR) | TP / (TP + FN) | Of actual positives, how many did we catch? |
| Specificity (TNR) | TN / (TN + FP) | Of actual negatives, how many did we correctly identify? |
| Precision (PPV) | TP / (TP + FP) | Of predicted positives, how many are correct? |
| False Positive Rate (FPR) | FP / (FP + TN) | Of actual negatives, how many did we incorrectly flag? |
The Rejection Option
Sometimes the best decision is to not decide.
Setup:
- Cost of error: $\lambda_e$
- Cost of rejection: $\lambda_r$ (where $\lambda_r < \lambda_e$)
Optimal policy:
- Predict if confident: $\max_c P(Y=c|x) \geq 1 - \frac{\lambda_r}{\lambda_e}$
- Reject (abstain) if uncertain
Use case: Route uncertain cases to human experts.
ROC Curves
The Receiver Operating Characteristic curve shows the trade-off between sensitivity and specificity across all classification thresholds.
Construction
For each threshold τ:
- Compute TPR (sensitivity) and FPR (1 - specificity)
- Plot the point (FPR, TPR)
Interpretation
- Perfect classifier: Goes through (0, 1) — 100% TPR, 0% FPR
- Random classifier: Diagonal line from (0, 0) to (1, 1)
- Better models: Curves closer to upper-left corner
AUC (Area Under the ROC Curve)
A single number summarizing performance:
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random guessing
- AUC > 0.7: Generally acceptable
Interpretation: The probability that a randomly chosen positive example ranks higher than a randomly chosen negative example.
Equal Error Rate (EER)
The point where FPR = FNR. Lower is better.
Precision-Recall Curves
When to Use
ROC curves can be misleading with class imbalance (when one class is much more common).
Example: In fraud detection, 99.9% of transactions are legitimate. A model that predicts "not fraud" always achieves:
- 99.9% accuracy
- 100% specificity
- 0% recall (catches no fraud!)
PR Curve Construction
For each threshold:
- Compute Precision and Recall
- Plot the point (Recall, Precision)
Key Properties
- No dependence on TN (unlike ROC)
- Sensitive to class imbalance
- Baseline is the positive class proportion
Summary Metrics
Precision @ K: Precision when retrieving top K results
Average Precision (AP): Area under the (interpolated) PR curve
mAP: Mean AP across multiple queries/classes
F-Score: Harmonic mean of precision and recall $$F_\beta = \frac{(1 + \beta^2) \cdot P \cdot R}{\beta^2 P + R}$$
- F₁: Equal weight to precision and recall
- F₂: Weights recall higher
- F₀.₅: Weights precision higher
Why harmonic mean? It penalizes if either P or R is very low.
Regression Losses
Common Loss Functions
| Loss | Formula | Properties |
|---|---|---|
| MSE | $\frac{1}{N}\sum(y - \hat{y})^2$ | Sensitive to outliers; corresponds to Gaussian likelihood |
| MAE | $\frac{1}{N}\sum|y - \hat{y}|$ | Robust to outliers; corresponds to Laplace likelihood |
| Huber | MSE for small errors, MAE for large | Best of both worlds |
Quantile Loss
For predicting the q-th quantile: $$L_q(y, \hat{y}) = \begin{cases} q \cdot (y - \hat{y}) & \text{if } y > \hat{y} \ (1-q) \cdot (\hat{y} - y) & \text{if } y \leq \hat{y} \end{cases}$$
Asymmetric penalty: Different costs for over- vs under-prediction.
Model Calibration
A model is well-calibrated if its predicted probabilities match actual frequencies.
Example: Among all predictions with confidence 80%, about 80% should be correct.
Reliability Diagrams
- x-axis: Predicted probability (binned)
- y-axis: Actual proportion of positives in each bin
- Perfect calibration: Points fall on the diagonal
Why Calibration Matters
- For decision making, we need accurate probabilities
- Many models (especially neural networks) are overconfident
- Calibration can be fixed post-hoc (Platt scaling, isotonic regression)
Bayesian Model Selection
The Bayesian Approach
Choose the model m that maximizes posterior probability: $$p(m | D) \propto p(D | m) \cdot p(m)$$
Marginal likelihood (evidence): $$p(D | m) = \int p(D | \theta, m) \cdot p(\theta | m) d\theta$$
This integral automatically penalizes complexity (Occam's Razor).
Model Comparison Criteria
AIC (Akaike Information Criterion) $$\text{AIC} = -2 \cdot \text{LL} + 2k$$
Where k = number of parameters.
Intuition: Approximates out-of-sample predictive performance.
BIC (Bayesian Information Criterion) $$\text{BIC} = -2 \cdot \text{LL} + k \cdot \log N$$
Where N = number of observations.
Intuition: Approximates Bayesian model evidence; penalizes complexity more heavily than AIC.
AIC vs BIC
| Criterion | Penalty | Selects | Best For |
|---|---|---|---|
| AIC | 2k | Larger models | Prediction |
| BIC | k log N | Smaller models | Finding "true" model |
MDL (Minimum Description Length)
Information-theoretic view: $$\text{MDL} = L(\text{model}) + L(\text{data}|\text{model})$$
Choose the model that gives the shortest description of the data.
Frequentist Decision Theory
Risk of an Estimator
$$R(\theta, \hat{\theta}) = \mathbb{E}_{p(D|\theta)}[L(\theta, \hat{\theta}(D))]$$
The expected loss when the true parameter is θ.
Types of Risk
Bayes Risk: Average over prior on θ $$R_B = \int R(\theta, \hat{\theta}) p(\theta) d\theta$$
Maximum Risk: Worst-case over all θ $$R_{max} = \max_\theta R(\theta, \hat{\theta})$$
Empirical Risk Minimization
Population Risk: Expected loss on true distribution $$R(f) = \mathbb{E}_{p^*(x,y)}[\ell(y, f(x))]$$
Empirical Risk: Average loss on training data $$\hat{R}(f) = \frac{1}{N}\sum_{i=1}^N \ell(y_i, f(x_i))$$
The gap: Estimation error = $\hat{R}(f) - R(f)$
Structural Risk Minimization
Add complexity penalty to prevent overfitting: $$\hat{f} = \arg\min_f \left[\hat{R}(f) + \lambda C(f)\right]$$
Statistical Learning Theory
PAC Learning
A concept is Probably Approximately Correct (PAC) learnable if:
- With high probability (1 - δ)
- We can find a hypothesis with low error (≤ ε)
- Using polynomial time and data
VC Dimension
Measures the "capacity" or complexity of a hypothesis class.
Definition: Maximum number of points that can be shattered (perfectly classified for any labeling).
Examples:
- Linear classifiers in d dimensions: VC = d + 1
- A line in 2D: VC = 3
Generalization Bounds
VC theory gives bounds like: $$R(f) \leq \hat{R}(f) + O\left(\sqrt{\frac{\text{VC}}{N}}\right)$$
Implication: Lower VC dimension → better generalization.
Hypothesis Testing
The Setup
- Null hypothesis $H_0$: Default assumption (e.g., "no effect")
- Alternative hypothesis $H_1$: What we want to show
Error Types
| H₀ True | H₁ True | |
|---|---|---|
| Reject H₀ | Type I Error (α) | Correct |
| Accept H₀ | Correct | Type II Error (β) |
- Significance level α: P(reject H₀ | H₀ true)
- Power 1 - β: P(reject H₀ | H₁ true)
p-value
The probability, under the null hypothesis, of observing a test statistic at least as extreme as what was observed.
Common misconception: p-value is NOT P(H₀ is true | data)!
Likelihood Ratio Test
Compare how well each hypothesis explains the data: $$\Lambda = \frac{p(D | H_0)}{p(D | H_1)}$$
Neyman-Pearson Lemma: The likelihood ratio test is the most powerful test for a given significance level.
Summary
| Concept | Key Insight |
|---|---|
| Decision Rule | Map probabilities to actions |
| Cost-Sensitive | Different errors have different costs |
| ROC Curve | Trade-off between TPR and FPR |
| PR Curve | Better for imbalanced classes |
| Calibration | Predicted probabilities should match reality |
| AIC/BIC | Trade-off between fit and complexity |
| VC Dimension | Theoretical measure of model complexity |
| Hypothesis Testing | Formal framework for statistical evidence |