Trees
Recursively partition the input space and define a local model in the resulting region of the input space
- Node i
- Feature dimension d_i is compared to threshold t_i
- $R_i = {x : x_{d_1} \leq t_1, x_{d_2} \leq t_2, \ldots}$
- Axis parallel splits
- At leaf node, model specifies the predicted output for any input that falls in the region
- $w_1 = \frac{\sum_{n} y_n I{x_n \in R_1}}{\sum_{n} I{x_n \in R_1}}$
- Tree structure can be represented as
- $f(x, \theta) = \sum_j w_j I{x \in R_j}$
- where j denotes a leaf node
Model Fitting
- $L(\theta) = \sum_J \sum_{i \in R_j} (y_i, w_j)$
- The tree structure is non-differentiable
- Greedy approach to grow the tree
- C4.5, ID3 etc.
- Finding the split
- $L(\theta) = {|D_l \over |D|} c_l + {|D_r \over |D|} c_r$
- Find the split such that the new weighted overall cost after splitting is minimized
- Looks for binary splits because of data fragmentation
- Determining the cost
- Regression: Mean Squared Error
- Classification:
- Gini Index: $\sum \pi_ic (1 - \pi_ic)$
- $\pi_ic$ probability that the observation i belongs to class c
- $1 - \pi_ic$ probability of misclassification
- Entropy: $\sum \pi_{ic} \log \pi_{ic}$
- Regularization
- Approach 1: Stop growing the tree according to some heuristic
- Example: Tree reaches some maximum depth
- Approach 2: Grow the tree to its maximum possible depth and prune it back
- Approach 1: Stop growing the tree according to some heuristic
- Handling missing features
- Categorical: Consider missing value as a new category
- Continuous: Surrogate splits
- Look for variables that are most correlated to the feature used for split
- Advantages of Trees
- Easy to interpret
- Minimal data preprocessing is required
- Robust to outliers
- Disadvantages of Trees
- Easily overfit
- Perform poorly on distributional shifts
Ensemble Learning
- Decision Trees are high variance estimators
- Average multiple models to reduce variance
- $f(y| x) = {1 \over M} \sum f_m (y | x)$
- In case of classification, take majority voting
- $p = Pr(S > M/2) = 1 - \text{Bin}(M, M/2, \theta)$
- Bin(.) if the CDF of the binomial distribution
- If the errors of the models are uncorrelated, the averaging of classifiers can boost the performance
- Stacking
- Stacked Generalization
- Weighted Average of the models
- $f(y| x) = {1 \over M} \sum w_m f_m (y | x)$
- Weights have to be learned on unseen data
- Stacking is different from Bayes averaging
- Weights need not add up to 1
- Only a subset of hypothesis space considered in stacking
Bagging
- Bootstrap aggregation
- Sampling with replacement
- Start with N data points
- Sample with replacement till N points are sampled
- Probability that a point is never selected
- $(1 - {1 \over N})^N$
- As N → $\infty$, the value is roughly 1/e (37% approx)
- Build different estimators of these sampled datasets
- Model doesn't overly rely on any single data point
- Evaluate the performance on the 37% excluded data points
- OOB (out of bag error)
- Performance boost relies on de-correlation between various models
- Reduce the variance is predictions
- The bias remains put
- $V = \rho \sigma ^ 2 + {(1 - \rho) \over B} \sigma ^2$
- If the trees are IID, correlation is 0, and variance is 1/B
- Random Forests
- De-correlate the trees further by randomizing the splits
- A random subset of features chosen for split at each node
- Extra Trees: Further randomization by selecting subset of thresholds
Boosting
- Sequentially fitting additive models
- In the first round, use original data
- In the subsequent rounds, weight data samples based on the errors
- Misclassified examples get more weight
- Even if each single classifier is a weak learner, the above procedure makes the ensemble a strong classifier
- Boosting reduces the bias of the individual weak learners to result in an overall strong classifier
- Forward Stage-wise Additive Modeling
- $F_m(x) = F_{m-1}(x) + \beta_m f_m(x; \theta_m)$
- $\beta_m, \theta_m$ are chosen to minimize the loss.
- Add new models that address the residual error $r_i = (y_i - F_{m-1}(x_i))$
- AdaBoodst
- Classification with exponential loss: $L(y, F(x)) = exp(-yF(x))$
- $y \in {-1, +1}$
- Output of tree has to be {-1, +1} in each region
- Each round m, the optimization is simplified to finding the best classifier fit to the weighted training data.
- $\theta_m = \arg\min \sum w_i^{(m)} I {y_i \ne f_m(x_i; \theta)}$
- Gradient Boosting
- Fit the entire model in forward stagewise manner
- $F_m(x) = F_{m-1}(x) + \beta_m f_m(x; \theta_m)$
- Expand this as a Taylor series around $F_{m-1}(x)$
- $L(y, F_m(x)) = L(y, F_{m-1}(x)) + g_m(x) + \beta_m f_m(x) + O(\beta_m^2)$
- Neglect higher order terms
- Minimize the loss
- $f_m(x) = - g_m(x)$
- $g_m(x) = {\delta L(y, F(x)) \over \delta F(x)}|{F(x) = F{m-1}(x)}$
- The new predictor is trained to approximate the negative gradient of the loss
- In the current form, the optimization is limited to the set of training points
- Need a function that can generalize
- Train a weak learner that can approximate the negative gradient signal
- $F_m = \arg\min \sum (-g_m -F(x_i))^2$
- Use a shrinkage factor for regularization
- Stochastic Gradient Boosting
- Data Subsampling for faster computation and better generalization
- Sequentially fitting additive models
XGBoost
- Extreme Gradient Boosting
- Add regularization to the objective
- $L(f) = \sum l(y_i, f(x_i)) + \Omega(f)$
- $\Omega(f) = \gamma J + {1 \over 2} \lambda \sum w_j^2$
- Consider the forward stage wise additive modeling
- $L_m(f) = \sum l(y_i, f_{m-1}(x_i) + F(x_i)) + \Omega(f)$
- Use Taylor's approximation on F(x)
- $L_m(f) = \sum l(y_i, f_{m-1}(x_i)) + g_{im} F_m(x_i) + {1 \over 2} h_{im} F_m(x_i)^2) + \Omega(f)$
- g is the gradient and h is the hessian
- Dropping the constant terms and using a decision tree form of F
- $F(x_{ij}) = w_{j}$
- $L_m = \sum_j (\sum_{i \in I_j} g_{im}w_j) + (\sum_{i \in I_j} h_{im} w_j^2) + \gamma J + {1 \over 2} \lambda \sum w_j^2$
- Solution to the Quadratic Equation:
- $G_{jm} = \sum_{i \in I_j} g_{im}$
- $H_{jm} = \sum_{i \in I_j} h_{im}$
- $w^* = {- G \over H + \lambda}$
- $L(w^*) = - {1 \over 2} \sum_J {G^2_{jm} \over H_{jm} + \lambda} + \gamma J$
- Condition for Splitting the node:
- $\text{gain} = [{G^2_L \over H_L + \lambda} + {G^2_R \over H_R + \lambda} - {G^2_L + G^2_R \over H_R + H_L + \lambda}] - \gamma$
- Gamma acts as regularization
- Tree wont split if the gain from split is less than gamma
Feature Importance
- $R_k(T) = \sum_J G_j I(v_j = k)$
- G is the gain in accuracy / reduction in cost
- I(.) returns 1 if node uses the feature
- Average the value of R over the ensemble of trees
- Normalize the values
- Biased towards features with large number of levels
Partial Dependency Plot
- Assess the impact of a feature on output
- Marginalize all other features except k