STA 141C Big-data and Statistical Computing

Discussion 8: Tree-based Models

TA: Tesi Xiao

Decision Tree

Pros
- Non-linear classifier
- Nonparametric
- Better interpretability with splitting nodes
- Fast prediction
Cons
- Overfitting
- Slow training

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None)

Criterion: Gini impurity vs. Information gain
Algorithms

from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
X, y = iris.data, iris.target
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)

tree.plot_tree(clf) # visualize the tree

png

Ensemble Model

(An aggregate of weak learners)

Multiple diverse models are created by using many different modeling algorithms or using different training data sets to predict one outcome. The ensemble model aggregates the prediction of each base model and results in once final prediction for the unseen data.

Bagging (Bootstrap aggregating)
In the ensemble, each model is created independently and votes with equal weight. Bagging trains each model in the ensemble using a randomly drawn subset of the training set
Example: Random forest (multiple decision trees)
class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None)
Boosting
Boosting involves incrementally (sequentially) building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified.
The new learner learns from the previous weak learners.
Example: Gradient Boosted Decision Trees (GBDT)

Classifier comparison

link