Additive Models, Trees, and Related Methods
This chapter introduces flexible methods that can capture non-linear relationships while remaining interpretable. We start with generalized additive models, then dive deep into decision trees — one of the most intuitive and widely-used machine learning methods.
The Big Picture
Linear models are simple and interpretable but can miss important non-linear patterns. At the other extreme, highly flexible methods (like neural networks) can fit anything but are hard to interpret.
Additive models and trees offer a middle ground: they capture non-linear relationships while remaining interpretable.
Generalized Additive Models (GAMs)
The Limitation of Linear Models
In linear regression, we model: $$E[Y|X] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p$$
Each predictor has a simple linear effect. But what if the relationship is curved? What if income affects health differently for low vs. high earners?
The GAM Solution
Replace linear terms with flexible smooth functions:
$$E[Y|X] = \alpha + f_1(X_1) + f_2(X_2) + ... + f_p(X_p)$$
Where each $f_j$ is learned from data — typically a smooth curve like a spline.
More Generally: Link Functions
For non-normal responses (binary, count data), we use a link function:
$$g[E[Y|X]] = \alpha + f_1(X_1) + f_2(X_2) + ... + f_p(X_p)$$
Common link functions:
- Identity (linear regression): $g(\mu) = \mu$
- Logit (logistic regression): $g(\mu) = \log(\mu/(1-\mu))$
- Log (Poisson regression): $g(\mu) = \log(\mu)$
Interpretability
Key advantage: Each $f_j$ shows exactly how predictor $X_j$ affects the response. You can plot $f_j(X_j)$ and see the relationship!
- Is it linear? Curved? Has a threshold?
- The shape is learned from data, not assumed
Fitting GAMs
GAMs are fit by minimizing Penalized Residual Sum of Squares (PRSS), which balances fitting the data with keeping functions smooth:
$$\text{PRSS} = \sum_{i=1}^N \left(y_i - \alpha - \sum_j f_j(x_{ij})\right)^2 + \sum_j \lambda_j \int f_j''(t)^2 dt$$
The penalty term punishes "wiggly" functions (large second derivatives).
Decision Trees
Decision trees are perhaps the most intuitive machine learning method. They partition the feature space into regions and fit simple models (often just constants) in each region.
The Core Idea
Think of playing "20 Questions" with your data:
- "Is income > $50K?" → split data into two groups
- "Is age > 40?" → further split
- Keep splitting until groups are "pure" (mostly one class/similar values)
The result is a tree structure that's easy to understand and explain.
Regression Trees
For continuous outcomes, we:
- Partition the feature space into rectangles $R_1, R_2, ..., R_M$
- Fit a constant in each region: $c_m = \text{average of } y_i \text{ in } R_m$
The model is: $$f(X) = \sum_{m=1}^M c_m \cdot I{X \in R_m}$$
Where $I{\cdot}$ is the indicator function (1 if true, 0 otherwise).
How to Find the Best Splits
We use a greedy algorithm — at each node, find the split that most reduces error.
For a split on variable $X_j$ at value $s$:
- Left region: $R_1 = {X | X_j \leq s}$
- Right region: $R_2 = {X | X_j > s}$
Choose j and s to minimize: $$\min_{j,s} \left[\min_{c_1}\sum_{X_i \in R_1}(y_i - c_1)^2 + \min_{c_2}\sum_{X_i \in R_2}(y_i - c_2)^2\right]$$
The inner minimizations are easy — just take averages in each region!
Classification Trees
For categorical outcomes, each leaf predicts the most common class in that region:
$$\hat{G}_m = \text{majority class in } R_m$$
The proportion of class k in node m is: $$\hat{p}{mk} = \frac{1}{N_m}\sum{i \in R_m} I{y_i = k}$$
Splitting Criteria for Classification
We need to measure how "impure" a node is. Several options:
Misclassification Error: $1 - \max_k \hat{p}_{mk}$
- Simple but not differentiable — hard to optimize
Gini Index: $\sum_{k=1}^K \hat{p}{mk}(1 - \hat{p}{mk})$
- Measures probability of misclassifying a randomly chosen element
- Equals variance of Bernoulli distribution when K=2
- Most commonly used
Cross-Entropy (Deviance): $-\sum_{k=1}^K \hat{p}{mk}\log\hat{p}{mk}$
- Information-theoretic measure of impurity
- Similar to Gini in practice
Note: Gini and entropy are more sensitive to node purity changes than misclassification error, making them better for tree growing.
Handling Categorical Predictors
If variable X has L categories, there are $2^{L-1} - 1$ possible binary splits. This seems exponential, but for classification with 2 classes:
Trick: Order categories by proportion of class 1, then treat as ordered. This reduces to checking L-1 splits!
Handling Missing Values
Two common approaches:
1. Missing as a Category: Create a new category for missing values.
2. Surrogate Splits: Find alternative splits that mimic the primary split. At prediction time, if the primary split variable is missing, use the surrogate.
The surrogate approach leverages correlations between predictors to minimize information loss.
Pruning: Controlling Tree Size
The Problem
Trees that grow too large:
- Overfit the training data
- Have high variance
- Are harder to interpret
Option 1: Stop Early
Split only if improvement exceeds a threshold.
Problem: A bad split now might enable great splits later! This is short-sighted.
Option 2: Grow and Prune (Better!)
- Grow a large tree (until leaves have few observations)
- Prune back to find the best subtree
Cost-Complexity Pruning
Define a cost that balances fit and complexity:
$$C_\alpha(T) = \sum_{m=1}^{|T|} N_m Q_m(T) + \alpha|T|$$
Where:
- $|T|$ = number of terminal nodes (leaves)
- $Q_m$ = impurity measure for node m (e.g., RSS/N_m for regression)
- $\alpha$ = complexity penalty parameter
Small α: Prefer large trees (focus on fit) Large α: Prefer small trees (focus on simplicity)
Algorithm:
- For each α, find the subtree that minimizes $C_\alpha(T)$
- Use cross-validation to select the best α
Evaluating Classification Performance
The Confusion Matrix
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
Key Metrics
Sensitivity (Recall, True Positive Rate): $$\text{Sensitivity} = \frac{TP}{TP + FN}$$ "Of all actual positives, what fraction did we catch?"
Specificity (True Negative Rate): $$\text{Specificity} = \frac{TN}{TN + FP}$$ "Of all actual negatives, what fraction did we correctly identify?"
Precision: $$\text{Precision} = \frac{TP}{TP + FP}$$ "Of all predicted positives, what fraction are correct?"
The ROC Curve
The Receiver Operating Characteristic (ROC) curve shows the tradeoff between sensitivity and specificity as you vary the classification threshold.
- X-axis: 1 - Specificity (False Positive Rate)
- Y-axis: Sensitivity (True Positive Rate)
AUC (Area Under the ROC Curve):
- AUC = 1: Perfect classifier
- AUC = 0.5: Random guessing
- AUC > 0.7: Generally acceptable
Interpretation: AUC is the probability that a randomly chosen positive example ranks higher than a randomly chosen negative example.
MARS: Multivariate Adaptive Regression Splines
MARS extends tree ideas to regression with smoother, continuous functions.
The Idea
Instead of piecewise constant regions, use piecewise linear basis functions:
$$(x - t)+ = \max(0, x-t)$$ $$(t - x)+ = \max(0, t-x)$$
These are "hockey stick" functions that are zero until a knot t, then linear.
MARS Model
$$f(X) = \beta_0 + \sum_{m=1}^M \beta_m h_m(X)$$
Where each $h_m$ is a product of basis functions (allowing interactions).
Connection to Trees
MARS is like a regression tree with smoother predictions at boundaries. Trees have sharp jumps; MARS transitions smoothly.
PRIM: Patient Rule Induction Method
PRIM takes a different approach: find regions (boxes) with unusually high (or low) response values.
The Algorithm
- Start with a box containing all data
- Peeling: Shrink the box by removing a thin slice from one face, choosing the slice that maximizes mean response
- Pasting: Try expanding the box if it improves the mean
- Repeat to find multiple boxes
Use Case
Useful for finding "hot spots" — regions where the response is particularly high. Applications include fraud detection, medical diagnosis, quality control.
Mixture of Experts
This is a probabilistic generalization of decision trees.
The Idea
Instead of hard splits (left or right), use soft probabilistic splits:
- Each observation has some probability of going to each child node
- "Gating networks" at internal nodes determine these probabilities
- "Expert" models at leaves make predictions
- Final prediction is a weighted average
Structure
Gating Networks (internal nodes):
- Soft decision functions
- Output probabilities for each branch
Experts (terminal nodes):
- Fit local models (often linear regression)
- Each expert specializes in a region
Fitting
Use the EM algorithm:
- E-step: Compute responsibilities (how much each expert contributes to each observation)
- M-step: Update expert parameters and gating parameters
Advantages
- Smoother predictions than hard trees
- Naturally provides uncertainty estimates
- Can capture complex interactions
Summary: Choosing a Method
| Method | Best For | Interpretability | Flexibility |
|---|---|---|---|
| GAM | Understanding non-linear effects | High (plot each effect) | Medium |
| Decision Tree | Simple rules, categorical outcomes | Very High | Medium |
| MARS | Regression with interactions | Medium | Medium-High |
| PRIM | Finding high-response regions | High | Low |
| Mixture of Experts | Complex boundaries with uncertainty | Low | High |
Key Takeaways
- Trees are intuitive: Easy to explain and visualize
- Pruning is essential: Unpruned trees overfit badly
- GAMs maintain interpretability: Each predictor's effect is visible
- Splitting criteria matter: Gini/entropy better than misclassification for tree growing
- These methods form building blocks: Trees are the foundation for Random Forests and Boosting!