Additive Models
Generalized Additive Models
- Linear models fail to capture non-linear trends
- Additive models an alternative
- $g[\mu(X)] = \alpha + f(X_1) + f(X_2)....$
- $f(x)$ are non-parametric smoothing functions (say cubic splines)
- $\mu(x)$ is the conditional mean
- $g(x)$ is the link functions
- identity, logit, log-linear etc.\
- Estimation using Penalized Sum Squares (PRSS)
- The coefficients of the regression are replaced with a flexible function (say spline)
- Allows for modeling non-linear relationships
Tree-based Methods
- Partition the feature space into rectangels and fit a simple model in each partition
- Regression Setting
- $f(X) = \sum c_i I{X \in R_i}$
- $c_m = ave(y_i | X_i \in R_m)$
- Greedy Algorithms to find best splits
- $R_1 = {X | X_j \le s}; ; R_2 = {X | X_j > s}$\
- $\min_{j,s} \min \sum (y_i - c_i)^2 I{X \in R_i}$
- Tree size is a hyperparameter
- Pruning
- Option-1
- Split only if delta is greater than some threshold
- Short Sighted, the node may lead to a better split down the line\
- Option 2
- Grow the tree to full saize (say depth 5)
- $N_m$ # of observations in m'th node
- $C_m = \sum y_i / N_m$
- $Q_m = {1 \over N_m }\sum (y_i - C_m)^2$
- Cost-Complexity Pruning
- $C = \sum_T N_m Q_m(T) + \alpha |T|$
- $\alpha$ governs the trade-off, large value leads to smaller trees
- Classification Setting
- $p_{mk} = {1 \over N_m}\sum_{R_m} I{y_i = k}$
- Splitting Criteria
- Miss-classification Error: $1 - \hat p_{mk}$
- Gini Index: $\sum_K p_{mk}(1 - \hat p_{mk})$
- Probability of miscalssification
- Variance of Binomial Distribution
- Cross-Entropy: $- \sum_K p_{mk} \log (p_{mk})$
- Gini Index and Cross Entropy more sensitive to node probabilities
- Splitting categorical variable
- N levels, $2^{N-1} - 1$ possible paritions
- Order the categories by proportion
- Treat the variable as continuous
- Missing Values
- Create a new level within the original corresponding to missing observations
- Create a surrogate variable for missing values
- Split by non-missing values
- Leverage the correlation between predictors and surrogates to minimize loss of information
- Evaluation
- $L_{xy} =$ Loss for predicting class x obkect as k
- $L_{00}, L_{11} = 0$
- $L_{10} =$ False Negative
- $L_{01} =$ False Positive
- Sentitivity:
- Prediciting disease as disease (Recall)
- TP / TP + FN
- $L_{11} / (L_{11} + L_{10})$
- Specificity:
- Predicting non-disease as non-disease
- TN / TN + FP
- $L_{00} / (L_{00} + L_{01})$\
- AUC-ROC
- How Sentitivity (y) and Specificty (x) vary with thresholds
- Area under ROC Curve is the C-statistic
- Equivalent to Mann-Whitney U Test, Wilcoxin rank-sum test
- Median Difference in prediction scores for two groups
- MARS
- High dimension regression
- Piece-wise Linear basis Functions
- Analogous to deision tree splits
- Can handle interactions
PRIM
- Patient Rule Induction Method
- Boxes with high response rates
- Non-tree partitioning structure
- Start with a large box and
- Peeling: compress the side that gives the largest mean
- Pasting: expand the bix dimensions that gives the largest mean
Mixture of Experts
- Tree splits are not hard decisions but soft probabilities
- Terminal nodes are called experts
- A linear model is fit in each terminal node
- Non-terminal nodes are called gating networks
- The decision of experts is combined by gating networks
- Estimation via EM Algorithm