STA 142A Statistical Learning I

Discussion 8: Additive Models, Smoothing Splines, Ensemble Models

TA: Tesi Xiao

Additive Models

Let $X = [X_1, X_2, \dots, X_d]\in \mathbb{R}^d$, $Y\in\mathbb{R}$.

\[\mathbb{E}[Y\vert X] = f(X) = \beta_0 + \sum_{j=1}^{d} f_j(X_j)\]

For example, if $f_j(X_j) = \beta_j X_j$, then $\mathbb{E}[Y\vert X]=\beta_0 + \beta_1 X_1 + \dots + \beta_d X_d \rightarrow$ Linear Regression

In general, $f_j(X_j)$ can be non-linear.

Remark.

Generalized Additive Models (GAM) $g(\mathbb{E}[Y\vert X]) = \beta_0 + \sum_{j=1}^{d} f_j(X_j)$ where $g(\cdot)$ is the link function. For example, in Logistic Regression, $g(\mathbb{E}[Y\vert X]) = \text{logit}(\mathbb{E}[Y\vert X]) = \log(\frac{\mathbb{E}[Y\vert X]}{1-\mathbb{E}[Y\vert X]})$.
GAM Fitting: $f_j(X_j)$ is fitted using smoothing splines by the backfitting algorithm.

Smoothing Spline

Let ${(x_i, y_i), i=1,\dots, n}$ be a set of observations, modeled by

\[y_i = f(x_i) + \epsilon_i\]

where $\epsilon_i$’s are indepedent, zero mean random variables.

Assuming $f$ is unkown to us, we are trying to find a smoothing spline $\hat{f}$ to estimate $f$ by solving

\[\underset{f}{\min}~ \sum_{i=1}^{n} \{y_i - f(x_i)\}^2 + \lambda \int f^{(m)}(x)^2 \text{d}x\]

where $f^{(m)}(x)$ is the $m$-th derivative of $f$. For example, $f^{(0)}(x)=f(x), f^{(1)}(x) = f’(x), f^{(2)}(x) = f’‘(x)$.

Special Cases:

$\lambda = +\infty,~ m=0$: $f^{(0)}(x) = f(x)$

\[\underset{f}{\min}~ \sum_{i=1}^{n} \{y_i - f(x_i)\}^2 + \infty \cdot \int f^{(0)}(x)^2 \text{d}x \Longleftrightarrow \underset{f}{\min}~\int f(x)^2 \text{d}x \Longleftrightarrow f(x)\equiv 0\]

$\lambda = +\infty,~ m=1$: $f^{(1)}(x) = f’(x)$

\[\underset{f}{\min}~ \sum_{i=1}^{n} \{y_i - f(x_i)\}^2 + \infty \cdot \int f^{(1)}(x)^2 \text{d} x \Longleftrightarrow \underset{f}{\min}~\int f'(x)^2 \text{d}x \Longleftrightarrow f'(x)\equiv 0 \Longleftrightarrow f(x) = \beta_0\]

Meanwhile. when $f(x)=\beta_0$,

$\underset{f}{\min}~ \sum_{i=1}^{n} \{y_i - f(x_i)\}^2 + \infty \cdot \int f^{(1)}(x)^2 \text{d}x \Longleftrightarrow \underset{f}{\min}~\sum_{i=1}^{n} \{y_i - f(x_i)\}^2 \Longleftrightarrow \underset{\beta_0}{\min}~\sum_{i=1}^{n} \{y_i - \beta_0\}^2 \Longleftrightarrow f(x) = \beta_0$.

Therefore, $f(x) \equiv \beta_0 = \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i$.

Ensemble Models: Bagging vs. Boosting

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance.

Bagging (Bootstrap Aggregating)

Given a training set $D={(x_1, y_1), \dots, (x_n, y_n)}$ of size $n$, bagging generates $B$ new training sets $D_i, i=1,2,\dots,B$, each of size $n’$, by sampling from $D$ uniformly and with replacement. Then, we fit $B$ models using the above $B$ bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

\[f_{\text{avg}}(x) = \frac{1}{B} \sum_{b=1}^{B} f^b(x),\qquad f_{\text{vote}}(x) = \text{mode}(f^1(x), \dots, f^B(x))\]

For example, Random Forest $=$ Bagging Multiple Decision Trees.

Boosting

Boosting refers to the process of turning a weak learner into a strong learner, where $B$ models are sequentially fitted:

\[f^1(x) \longrightarrow f^2(x) \longrightarrow f^3(x) \longrightarrow \dots \longrightarrow f^B(x)\]

The subsequent model learns from the experiences of previous models. Finally, we aggregate models by

\[f_{\text{boosting}}(x) = \sum_{b=1}^B \alpha_b f^{(b)}(x),\]

where larger weights are usually assigned to stronger learners, i.e., $\alpha_b \uparrow$ if $b\uparrow$.

For example, AdaBoost.