Decision Trees
Decision Trees
- Recursively split the input / feature space using stubs i.e. decision rules
- Splits are parallel to the axis
- Mathematical Represenation
- $R_j = { x : d_1 <= t_1, d_2 >= t_2 ... }$\
- $\hat Y_i = \sum_j w_j I{x_i \in R_j}$
- $w_j = \frac{\sum_i y_i I {x_i \in R_j}}{\sum_i I {x_i \in R_j}}$
- Types of Decision Trees
- Binary Splits
- Classification and Regression Trees (CART)
- C4.5
- Multiple Splits:
- CHAID (Chi-Square Automatic Interaction Detection)
- ID3
Splitting
- Split Criteria for Classification Trees
- The nodes are split to decrease impurity in classification
- Gini Criterion
- $1 - \sum_C p_{i}^2$
- Probability that observation belongs to class i: $p_i$
- Misclassification:
- For a given class (say i):
- $p_i \times p_{k \ne i} = p_i \times (1 - p_i)$
- Across all classes:
- $\sum_C p_i \times (1 - p_i)$
- $\sum_C p_i - \sum_C p_{i}^2$
- $1 - \sum_C p_{i}^2$
- Ranges from (0, 0.5)
- Entropy Criterion
- Measure of uncertainly of a random variable
- Given an event E
- p(E) = 1 $\implies$ No Surprise
- p(E) = 0 $\implies$ Huge Surprise
- Informaion Content: $I(E) = \log(1 / p(E))$
- Entropy is the expectation of this information content
- $H(E) = - \sum p(E) \log(p(E))$
- Maximum when all outcomes have same probability of occurance
- Ranges from (0, 1)
- Split Criteria for Regression Trees
- Sum-Squared Error
- $\sum_i (Y_i - \bar Y)^2$
- Finding the Split
- For any candidate value:
- Calculate the weighted average reduction in impurity / error
- Weights being the number of observations flowing in the child nodes
- Starting Gini
- $\text{Gini}_{\text{Root}}$
- $N_{\text{Root}}$
- After Split
- Child Nodes
- $\text{Gini}{\text{Left}}, N{\text{Left}}$
- $\text{Gini}{\text{Right}}, N{\text{Right}}$
- Updated Gini
- $\frac{N_{\text{Left}}}{N_{\text{Root}}} \times \text{Gini}{\text{Left}} + \frac{N{\text{Right}}}{N_{\text{Root}}} \times \text{Gini}_{\text{Right}}$
- Find the split, the results in minimum updated Gini
- Updated Gini <= Starting Gini
- Greedy algorithms to find the best splits
Bias-Variance Trade-off
- Bias
- Measures ability of an ML algorithm to model true relationship between features and target
- Simplifying assumptions made by the model to learn the relationship
- Example: Linear vs Parabolic relationship
- Low Bias: Less restrictive assupmtions
- High Bias: More restrictive assumptions
- Variance
- The difference in model performance across different datasets drawn from the same distribution
- Low Variance: Small changes to model perforamance with changes in datasets
- High Variance: Large changes to model perforamance with changes in datasets
- Irreducible Error
- Bayes error
- Cannot be reduced irrespective of the model form
- Best model minimizes: $\text{MSE} = \text{bias}^2 + \text{variance}$
- Decision trees have low bias and high variance
- Decision trees are prone to overfitting
- Noisy Samples
- Small data samples in nodes down the tree
- Tree Pruning solves for overfitting
- Adding a cost term to objetive which captures tree complexity
- $\text{Tree Score} = SSR + \alpha T$
- As the tree grows in size, the reduction in SSR has to more than offset the complexity cost
Nature of Decision Trees
- Decision Trees can model non-linear relationships (complex deicison boundaries)
- Spline regressions cannot achieve the same results
- Spline adds indicator variables to capture interactions and create kinks
- But the decision boundary has to be continuous
- The same restriction doesn't apply to decision trees
- Decision Trees don't require feature sscaling
- Decision Trees are less sensitive to outliers
- Outliers are of various kinds:
- Outliers: Points with extreme values
- Input Features
- Doesn't impact Decision Trees
- Split finding will ignore the extreme values
- Output / Target
- Influential / High-Leverage Points: Undue influence on model
- Decision Trees cannot extrapolate well to ranges outside the training data
- Decision trees cannot capture linear time series based trends / seasonality
Bagging
- Bootstrap Agrregation
- Sampling with repetition
- Given Dataset of Size N
- Draw N samples with replacement
- Probability that a point (say i) never gets selected
- $(1 - \frac{1}{N})^N \approx \frac{1}{e}$
- Probability that a point (say i) gets selected atleast once
- $1 - \frac{1}{e} \approx 63%$
Random Forest
- Use bootstrap aggregation (bagging) to create multiple datasets
- "Random" subspace of dataset
- Use subset of variables for split at each node
- sqrt for classification
- m//3 for regression
- Comparison to single decision tree
- Bias remains the same
- Variance decreases
- Randomness in data and slpits reduces the correlation in prediction across trees
- Let $\hat y_i$ be the prediction from ith tree in the forest
- Let $\sigma^2$ be the variance of $\hat y_i$
- Let $\rho$ be the correlation between two trees in the forest
- $V(\sum_i \hat y_i) = \sum V(\hat y_i) + 2 \sum\sum COV(\hat y_i, \hat y_j)$
- $V(\sum_i \hat y_i) = n \sigma^2 + n(n-1) \rho \sigma^2$
- $V( \frac{1}{n} \sum_i \hat y_i) = \rho \sigma^2 + \frac{1-\rho}{n} \sigma^2$
- Variance goes down as more trees are added, but bias stays put
- Output Combination
- Majority Voting for Classification
- Averaging for Regression
- Out-of-bag (OOB) Error
- Use the non-selected rows in bagging to estimate model performance
- Comparable to cross-validaiton results
- Proximity Matrix
- Use OOB observations
- Count the number of times each pair goes to the same terminal node
- Identifies observations that are close/similar to each other
- Extremely Randomized Trees
- Bagging:
- ExtraTrees: No
- Extremely Randomized Trees: Yes
- Mutiple trees are built using:
- Random variable subset for splitting
- Random threshold subsets for a variable for splitting
Variable Importance
- Split-based importance
- If variable j is used for split
- Calculate the improvement in Gini at the split
- Sum this improvement across all trees and splits wherever jth variable is used
- Alternate is to calculate the number of times variable is used for splitting
- Biased in favour of continuous variables which can be split multiple times
- Permutation-based importance / Boruta
- Use OOB samples to calculate variable importance
- Take bth tree:
- Pass the OOB samples and calculate accuracy
- Permuate jth variable and calculate the decrease in accuracy
- Average this decrease in accuracy across all trees to calculate variable importance for j
- Effect is simialr to setting the coefficient to 0 in regression
- Takes into account if good surrogates are present in the dataset
- Partial Dependence Plots
- Marginal effect of of a feature on target
- Understand the relationship between feature and target
- Assumes features are not correlated
- $\hat f(x_s) =\frac{1}{C} \sum f(x_s,x_i)$
- Average predictions over all other variables
- Can be used to identify important interactions
- Friedman's H Statistic
- If features don't interact Joint PDP can be decomposed into marginals
- Shapely Values
- Model agnositc feature importance
- LIME
Handling Categorical Variables
- Binary categorical variables are easily incorporated into decision trees
- For multi-category variables:
- One-hot encoding (creates a binary feature for each category)
- Label encoding (assigns an ordinal value to each category)
- Trees can directly handle categorical variables by considering all possible subsets for splitting
- CART typically uses binary splits (creates a binary question from categorical features)
- C4.5 and CHAID can create multi-way splits
Tree Pruning
- Decision trees are prone to overfitting
- Noisy Samples
- Small data samples in nodes down the tree
- Tree Pruning solves for overfitting
- Adding a cost term to objetive which captures tree complexity
- $\text{Tree Score} = SSR + \alpha T$
- As the tree grows in size, the reduction in SSR has to more than offset the complexity cost
- Pre-pruning vs. Post-pruning:
- Pre-pruning: Stops tree growth early using criteria like:
- Minimum samples per leaf
- Maximum depth
- Minimum impurity decrease
- Post-pruning: Grows a full tree and then removes branches that don't improve generalization
- Cost-complexity pruning (used in CART)
- Reduced Error Pruning (REP)
- Pessimistic Error Pruning (PEP)
- Cross-validation can be used to determine optimal pruning level