STA 142A Statistical Learning I
Discussion 8: Additive Models, Smoothing Splines, Ensemble Models
TA: Tesi Xiao
Additive Models
Let $X = [X_1, X_2, \dots, X_d]\in \mathbb{R}^d$, $Y\in\mathbb{R}$.
\[\mathbb{E}[Y\vert X] = f(X) = \beta_0 + \sum_{j=1}^{d} f_j(X_j)\]For example, if $f_j(X_j) = \beta_j X_j$, then $\mathbb{E}[Y\vert X]=\beta_0 + \beta_1 X_1 + \dots + \beta_d X_d \rightarrow$ Linear Regression
In general, $f_j(X_j)$ can be non-linear.
Remark.
- Generalized Additive Models (GAM) \(g(\mathbb{E}[Y\vert X]) = \beta_0 + \sum_{j=1}^{d} f_j(X_j)\) where $g(\cdot)$ is the link function. For example, in Logistic Regression, $g(\mathbb{E}[Y\vert X]) = \text{logit}(\mathbb{E}[Y\vert X]) = \log(\frac{\mathbb{E}[Y\vert X]}{1-\mathbb{E}[Y\vert X]})$.
- GAM Fitting: $f_j(X_j)$ is fitted using smoothing splines by the backfitting algorithm.
Smoothing Spline
Let ${(x_i, y_i), i=1,\dots, n}$ be a set of observations, modeled by
\[y_i = f(x_i) + \epsilon_i\]where $\epsilon_i$’s are indepedent, zero mean random variables.
Assuming $f$ is unkown to us, we are trying to find a smoothing spline $\hat{f}$ to estimate $f$ by solving
\[\underset{f}{\min}~ \sum_{i=1}^{n} \{y_i - f(x_i)\}^2 + \lambda \int f^{(m)}(x)^2 \text{d}x\]where $f^{(m)}(x)$ is the $m$-th derivative of $f$. For example, $f^{(0)}(x)=f(x), f^{(1)}(x) = f’(x), f^{(2)}(x) = f’‘(x)$.
Special Cases:
- $\lambda = +\infty,~ m=0$: $f^{(0)}(x) = f(x)$
- $\lambda = +\infty,~ m=1$: $f^{(1)}(x) = f’(x)$
Meanwhile. when $f(x)=\beta_0$,
\(\underset{f}{\min}~ \sum_{i=1}^{n} \{y_i - f(x_i)\}^2 + \infty \cdot \int f^{(1)}(x)^2 \text{d}x \Longleftrightarrow \underset{f}{\min}~\sum_{i=1}^{n} \{y_i - f(x_i)\}^2 \Longleftrightarrow \underset{\beta_0}{\min}~\sum_{i=1}^{n} \{y_i - \beta_0\}^2 \Longleftrightarrow f(x) = \beta_0\).
Therefore, $f(x) \equiv \beta_0 = \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i$.
Ensemble Models: Bagging vs. Boosting
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance.
Bagging (Bootstrap Aggregating)
Given a training set $D={(x_1, y_1), \dots, (x_n, y_n)}$ of size $n$, bagging generates $B$ new training sets $D_i, i=1,2,\dots,B$, each of size $n’$, by sampling from $D$ uniformly and with replacement. Then, we fit $B$ models using the above $B$ bootstrap samples and combined by averaging the output (for regression) or voting (for classification).
\[f_{\text{avg}}(x) = \frac{1}{B} \sum_{b=1}^{B} f^b(x),\qquad f_{\text{vote}}(x) = \text{mode}(f^1(x), \dots, f^B(x))\]For example, Random Forest $=$ Bagging Multiple Decision Trees.
Boosting
Boosting refers to the process of turning a weak learner into a strong learner, where $B$ models are sequentially fitted:
\[f^1(x) \longrightarrow f^2(x) \longrightarrow f^3(x) \longrightarrow \dots \longrightarrow f^B(x)\]The subsequent model learns from the experiences of previous models. Finally, we aggregate models by
\[f_{\text{boosting}}(x) = \sum_{b=1}^B \alpha_b f^{(b)}(x),\]where larger weights are usually assigned to stronger learners, i.e., $\alpha_b \uparrow$ if $b\uparrow$.
For example, AdaBoost.