Decision Theory
Decision theory provides a formal framework for making optimal decisions under uncertainty
Optimal Policy specifies which action to take for each possible observation to minimize risk or maximize utility
Risk Neutrality vs Risk Aversion
- Risk neutrality: Agent values expected outcomes (e.g., $50 = 0.5 \times $100)
- Risk aversion: Agent prefers certain outcomes to uncertain ones with same expected value
- Risk preference: Agent prefers uncertainty (gambling) over certainty
Decision Rules for Classification
Zero-One loss: Penalizes misclassification with a unit cost
- $l_{01}(y, \hat y) = I{y \ne \hat y}$
- Optimal policy minimizes risk by choosing the most probable class:
- $R(y | x) = P(y \ne \hat y | x) = 1 - P(y = \hat y | x)$
- $\pi(x) = \arg \max P(y | x)$
Cost-Sensitive Classification
- Different error types have different consequences
- False Positive (Type I error): Predict positive when truth is negative
- False Negative (Type II error): Predict negative when truth is positive
- Different costs can be assigned: $l_{01} \neq c \times l_{10}$
- Choose label 1 if the expected cost is lower:
- $p(y=0|x) \times l_{01} < p(y=1|x) \times c \times l_{10}$
- The cost ratio c shifts the decision boundary
- Different error types have different consequences
Rejection Option (Abstention)
- Sometimes it's better to abstain from making a decision
- Three possible actions: predict 0, predict 1, or reject
- Cost parameters:
- $\lambda_e$: Cost of making an error
- $\lambda_r$: Cost of rejection/abstention
- No decision is made when model confidence is below threshold:
- Abstain when $\max_y P(y|x) < 1 - \frac{\lambda_r}{\lambda_e}$
- This creates bands of uncertainty where expert input might be required
ROC Curves
Summarize performance across various thresholds
Confusion Matrix
- Give a threshold $\tau$
- Confusion Matrix
- Positive, Negative: Model Prediction
- True, False: Actual Labels
- TP, TN: Correct Predictions
- FP: Model predicts 1, Ground Truth is 0
- FN: Model predicts 0, Ground Truth is 1
- Ratios from Confusion Matrix
- TPR, Sensitivity, Recall
- TP / (TP + FN)
- Accuracy in positive predictions
- FPR, Type 1 Error rate
- FP / (FP + TN)
- Error in Negative Predictions
- TPR, Sensitivity, Recall
- Ratios from Confusion Matrix
ROC Curve is a plot between FPR (x-axis) and TPR (y-axis) across various thresholds
AUC is a numerical summary of ROC
Equal Error Rate is where ROC crosses -45 degree line.
ROC Curve is insensitive to class imbalance
- FPR consists of TN in denominator
- If TN >> TP, metric becomes insensitive to FPR
Precision-Recall Curves
- The negatives are not model specific but system specific
- For a search query, retrieve 50 vs 500 items. (or tiles vs list)
- Precision
- TP / TP + FP
- Recall
- TP / TP + FN
- There is no dependency on TN
- Precision curve has distortions. Smooth it out by interpolation.
- To summarize the performance by a scalar
- Precision @ K
- Average Precision: Area under interpolated precision curve
- mAP or Mean Average Precision is mean of AP across different PR curves (say different queries)
- F-Score
- Weighted harmonic mean between precision and recall
- ${1 \over F} = {1 \over 1 + \beta^2} {1 \over P} + {\beta^2 \over 1 + \beta^2} {1 \over R}$
- Harmonic mean imposes more penalty if either precision or recall fall to a very low level
Class Imbalance
- ROC curves are not sensitive to class imbalance. Does not matter which class is defined as 1 or 0.
- PR curves are sensitive to class imbalance. Switching classes impacts performance.
- $P = {TP \over TP + FP}$
- PR-AUC is more appropriate in case class imbalance
- Multiclass problems can be treated as multiple binary classes. One class vs Rest.
- Decision threshold calibration finds the optimal threshold for desired trade-off between precision and recall
Regression
- Metrics
- Mean Squared Error: ${1 \over N} \sum (y - \hat y)^2$
- Mean Absolute Error: ${1 \over N} \sum |y - \hat y|$
- MAE is robust to outliers - proportional to regression with Laplace conditional distribution
- MSE has simple calculus and proportional to regression with Gaussian conditional distribution
- Huber loss: MSE below cutoff, MAE above cutoff
- Loss functions are usually calibrated with a proper scoring rule
- Log score of the predicted density at the true value
- Quantile Loss
- Probability that value y is less than q quantile is q
- $P(y \le y_q) = q$
- Linear function with positive slope for over-prediction and negative slope for under-prediction
- Slope depends on the quantile q
- Metrics
Calibration
- Property that predicted certainty of events match the frequency of their occurrence
- Reliability Diagrams
- x-asis: Predicted Probability
- y-asis: Actual Probability
Bayes Model Selection and Averaging
Bayesian approach to model selection
- Choose $m \in M$ to maximize posterior $p(m|D)$
- $p(m|D) \propto p(D|m)p(m)$
- Use uniform prior over model classes
- $p(D|m) = \int p(D|\theta,m)p(\theta|m)d\theta$
- Integration over all possible parameter values
- In practice, models might be compared using BIC or AIC
- BIC approximates Bayes factor
- AIC approximates Cross-validation
Akaike's Information Criterion
- Maximizes predictive density of held out data
- Approximating out-of-sample generalization
- As trained model fits the training data
- $AIC = -2LL + 2C$
- LL is log-likelihood
- C is number of parameters in the model
Bayesian Information Criteria
- Biases in favor of simpler models
- $BIC = -2LL + C\log N$
- LL is log-likelihood
- C is number of parameters in the model
- N is the number of observations
Minimum Description Length
- Minimum number of bits needed to describe the model and the data
- If we describe model in $L_1$ bits, and then describe the dataset using the model in $L_2$ bits.
- The MDL for the model is $L_1 + L_2$
Approximation of Bayesian Model Comparison
- Integrated likelihood is analytically intractable
- Approximate the integral by Laplace approximation
- Uses 2nd order taylor expansion around MLE / MAP
- The model is approximated as gaussian with constant determinant as n increases
- AIC is an approximation of KL-Divergence between the the true model and a fitted model
- Simple to compute
- AIC = $2 \times \text{dof} - 2 \times LL$ (dof: degrees of freedom)
- Sample error can result in overfitting
- AIC_C: when sample size is small
- AIC_C = AIC + ${2C(C+1) \over N-C-1}$
- C is additional penalty for increasing number of parameters
Widely Applicable / Watanabe-Akaike Information Criterion
- Penalizes models based on effective degrees of freedom
- $C = 2 \times \text{dof}$
Frequentist Decision Theory
- Risk of an estimator is the expected loss when applying the estimator to data sampled from likelihood function $p( y,x | \theta)$
- Bayes Risk
- True generating function unknown
- Assume a prior and then average it out
- Maximum Risk
- Minimize the maximum risk
- Consistent Estimator
- Recovers true parameter in the limit of infinite data
- Empirical Risk Minimization
- Population Risk
- Expectation of the loss function w.r.t. true distribution
- True distribution is unknown
- $R(f, \theta^*) = \mathbf{E}[l(\theta^*, \pi(D))]$
- Empirical Risk
- Approximate the expectation of loss by using training data samples
- $R(f, D) = \mathbf{E}[l(y, \pi(x))]$
- Empirical Risk Minimizaiton
- Optimize empirical risk over hypothesis space of functions
- $f_{ERM} = \arg \min_H R(f,D)$
- Approximation Error
- Risk that the chosen true parameters don't lie in the hypothesis space
- Estimation Error
- Error due to having finite training set
- Difference between training error and test error
- Generalization Gap
- Regularized Risk
- Add complexity penalty
- $R_\lambda(f,D) = R(f,D) + \lambda C(f)$
- Complexity term resembles the prior term in MAP estimation
- Structural Risk
- Empirical underestimates population risk
- Structural risk minimization is to pick the right level of model complexity by minimizing regularized risk and cross-validation
- Population Risk
Statistical Learning Theory
- Upper bound on generalization error with certain probability
- PAC (probably approximately correct) learnable
- Hoeffding's Inequality
- Upper bound on generalization error
- VC Dimension
- Measures the degrees of freedom of a hypothesis class
Frequentist Hypothesis Testing
- Null vs Alternate Hypothesis
- Likelihood Ratio Test
- $p(D| H_0) / p(D| H_1)$
- Null Hypothesis Significance Testing
- Type-1 Error
- P(Reject H0 | H0 is True)
- Significance of the test
- $\alpha$
- Type-2 Error
- P(Accept H0 | H1 is True)
- $\beta$
- Power of the test is $1 - \beta$
- Most powerful test is the one with highest power given a level of significance
- Neyman-Pearson lemma: Likelihood ratio test is the most powerful test
- p-value
- Probability, under the null hypothesis, of observing a test statistic larger that that actually observed
- Type-1 Error