Decision Trees and Ensembles
Decision trees are intuitive models that partition the feature space into regions. While single trees are prone to overfitting, ensemble methods (Random Forests, Boosting) combine many trees for powerful, robust predictions.
The Big Picture
Trees: Recursively partition space with simple rules.
- Highly interpretable
- But high variance (unstable)
Ensembles: Combine many trees.
- Random Forests: Reduce variance through averaging
- Boosting: Reduce bias through sequential correction
Decision Tree Structure
The Model
$$f(x) = \sum_{j=1}^{J} w_j \cdot I{x \in R_j}$$
Where:
- $R_j$ are disjoint regions (leaves)
- $w_j$ is the prediction for region j
- I{} is indicator function
Building a Tree
At each node i:
- Select a feature $d_i$
- Select a threshold $t_i$
- Split: left if $x_{d_i} \leq t_i$, right otherwise
Result: Axis-parallel partitions of the feature space.
Leaf Predictions
Regression: Average of training labels in region $$w_j = \frac{\sum_{n: x_n \in R_j} y_n}{\sum_{n: x_n \in R_j} 1}$$
Classification: Majority vote or probability distribution
Finding Optimal Splits
The Greedy Algorithm
Tree optimization is NP-hard. We use a greedy approach:
At each node, find the best split by minimizing: $$L = \frac{|D_L|}{|D|} C_L + \frac{|D_R|}{|D|} C_R$$
Where C is the cost (impurity) and D is the data reaching that node.
Splitting Criteria
For Regression (MSE): $$C = \frac{1}{|D|}\sum_{i \in D}(y_i - \bar{y})^2$$
For Classification:
Gini Index: $$C = \sum_{c=1}^C \hat{p}_c(1 - \hat{p}_c)$$ Probability of misclassifying a randomly chosen element.
Cross-Entropy: $$C = -\sum_{c=1}^C \hat{p}_c \log \hat{p}_c$$ Information-theoretic measure of impurity.
Why Binary Splits?
- More splits = more data fragmentation
- Binary splits are sufficient (can always split further)
- Simpler to optimize
Regularization (Preventing Overfitting)
Option 1: Early Stopping
Stop growing when:
- Maximum depth reached
- Minimum samples per leaf
- Improvement below threshold
Problem: May stop too early (miss good splits downstream).
Option 2: Grow and Prune
- Grow full tree (until pure leaves or minimum samples)
- Prune back using cost-complexity criterion:
$$C_\alpha(T) = \sum_{j=1}^{|T|} N_j \cdot C_j + \alpha |T|$$
Where $|T|$ is number of leaves and α is complexity penalty.
Use cross-validation to select optimal α.
Handling Missing Features
Categorical: Treat "missing" as new category.
Continuous: Use surrogate splits — find alternative splits that best mimic the primary split.
Pros and Cons of Trees
Advantages
- Interpretable: Easy to visualize and explain
- Minimal preprocessing: Handles mixed types, no normalization needed
- Fast: Prediction is O(log nodes)
- Robust to outliers: Splits don't depend on magnitudes
Disadvantages
- High variance: Small data changes → different tree
- Axis-aligned only: Can't capture diagonal boundaries efficiently
- Prone to overfitting: Without regularization
Ensemble Learning
The Core Idea
Combine multiple models to reduce errors.
Regression: Average predictions $$\hat{y} = \frac{1}{M}\sum_{m=1}^M f_m(x)$$
Classification: Majority vote or average probabilities
Why Ensembles Work
For M independent classifiers each with accuracy p > 0.5: $$P(\text{majority correct}) = \sum_{k > M/2} \binom{M}{k} p^k (1-p)^{M-k}$$
As M → ∞, this probability → 1!
Key requirement: Classifiers must be diverse (uncorrelated errors).
Stacking
Learn weights for combining models: $$\hat{y} = \sum_m w_m f_m(x)$$
Train weights on held-out data to avoid overfitting.
Bagging (Bootstrap Aggregating)
Algorithm
- Create B bootstrap samples (sample with replacement)
- Train a tree on each bootstrap sample
- Average predictions (regression) or vote (classification)
Key Properties
Each bootstrap sample contains ~63% of unique points: $$P(\text{included}) = 1 - (1 - 1/N)^N \approx 1 - 1/e \approx 0.632$$
OOB Error: Evaluate each tree on its out-of-bag samples (free cross-validation!)
Variance Reduction
$$\text{Var}(\bar{f}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$$
Where ρ is correlation between trees.
- More trees (larger B) → second term vanishes
- Less correlation (smaller ρ) → first term shrinks
Random Forests
Beyond Bagging
Bagging helps, but trees from similar data are correlated.
Random Forest innovation: Add randomness to splits.
At each split:
- Randomly select m features (typically $m = \sqrt{p}$ for classification, $m = p/3$ for regression)
- Find best split among only those m features
Why It Works
- Forces trees to use different features
- Reduces correlation between trees
- Combined with bagging → powerful ensemble
Extra Trees
Even more randomness:
- Random feature subset (like RF)
- Random threshold selection (not optimized)
- Faster training, often similar performance
Boosting
The Key Idea
Sequentially fit weak learners, each focusing on previous mistakes.
$$F_m(x) = F_{m-1}(x) + \beta_m f_m(x)$$
Boosting reduces bias (unlike bagging which reduces variance).
AdaBoost
For classification with exponential loss: $$L(y, F) = \exp(-yF(x)), \quad y \in {-1, +1}$$
Algorithm:
- Initialize equal weights on examples
- For each round:
- Train weak learner on weighted examples
- Increase weights on misclassified examples
- Compute learner weight based on accuracy
Gradient Boosting
Generalize to any differentiable loss:
- Initialize: $F_0 = \arg\min_\gamma \sum L(y_i, \gamma)$
- For m = 1 to M:
- Compute pseudo-residuals: $r_i = -\frac{\partial L(y_i, F)}{\partial F}|{F{m-1}}$
- Fit weak learner to pseudo-residuals
- Line search for step size
- Update: $F_m = F_{m-1} + \eta \cdot f_m$
For MSE loss: Pseudo-residuals = actual residuals!
Regularization via shrinkage: Small learning rate η (0.01-0.1) + more trees.
Stochastic Gradient Boosting
Subsample data at each round:
- Faster training
- Better generalization (regularization effect)
XGBoost
Innovations
Regularized objective: $$L = \sum L(y_i, F(x_i)) + \gamma J + \frac{\lambda}{2}\sum w_j^2$$
Second-order approximation (use Hessian): $$L \approx \sum [g_i F(x_i) + \frac{1}{2}h_i F(x_i)^2] + \text{regularization}$$
Optimal leaf weights: $$w_j^* = -\frac{G_j}{H_j + \lambda}$$
Where $G_j = \sum_{i \in j} g_i$ and $H_j = \sum_{i \in j} h_i$.
- Split gain: $$\text{Gain} = \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda} - \gamma$$
γ acts as regularization — won't split unless gain exceeds γ.
Feature Importance
Mean Decrease in Impurity
Sum the impurity decrease at all splits using feature k: $$R_k = \sum_{\text{nodes using } k} \Delta \text{impurity}$$
Average across all trees.
Caveat: Biased toward high-cardinality features.
Permutation Importance
- Compute baseline accuracy
- For each feature k:
- Permute (shuffle) feature k's values
- Compute accuracy drop
- Importance = accuracy drop
More reliable but slower.
Partial Dependence Plots
Visualize effect of feature on prediction: $$\bar{f}k(x_k) = \frac{1}{N}\sum{i=1}^N f(x_k, x_{i,-k})$$
Average over all other features.
Summary
| Method | Reduces | Training | Key Hyperparameters |
|---|---|---|---|
| Single Tree | — | Fast | max_depth, min_samples |
| Bagging | Variance | Parallel | n_estimators |
| Random Forest | Variance | Parallel | n_estimators, max_features |
| Boosting | Bias | Sequential | n_estimators, learning_rate, max_depth |
Practical Recommendations
- Start with Random Forest: Works well with minimal tuning
- Try XGBoost/LightGBM: Often best for tabular data
- Tune carefully: Learning rate and n_estimators together
- Early stopping: Monitor validation error
- Feature importance: Helps interpretability