STA 142A Statistical Learning I

Discussion 8: Additive Models, Smoothing Splines, Ensemble Models

TA: Tesi Xiao

Additive Models

Let X=[X1,X2,,Xd]Rd, YR.

E[Y|X]=f(X)=β0+dj=1fj(Xj)

For example, if fj(Xj)=βjXj, then E[Y|X]=β0+β1X1++βdXd Linear Regression

In general, fj(Xj) can be non-linear.

Remark.

  • Generalized Additive Models (GAM) g(E[Y|X])=β0+dj=1fj(Xj) where g() is the link function. For example, in Logistic Regression, g(E[Y|X])=logit(E[Y|X])=log(E[Y|X]1E[Y|X]).
  • GAM Fitting: fj(Xj) is fitted using smoothing splines by the backfitting algorithm.

Smoothing Spline

Let (xi,yi),i=1,,n be a set of observations, modeled by

yi=f(xi)+ϵi

where ϵi’s are indepedent, zero mean random variables.

Assuming f is unkown to us, we are trying to find a smoothing spline ˆf to estimate f by solving

minf ni=1{yif(xi)}2+λf(m)(x)2dx

where f(m)(x) is the m-th derivative of f. For example, f(0)(x)=f(x),f(1)(x)=f(x),f(2)(x)=f(x).

Special Cases:

  • λ=+, m=0: f(0)(x)=f(x)
minf ni=1{yif(xi)}2+f(0)(x)2dxminf f(x)2dxf(x)0
  • λ=+, m=1: f(1)(x)=f(x)
minf ni=1{yif(xi)}2+f(1)(x)2dxminf f(x)2dxf(x)0f(x)=β0

Meanwhile. when f(x)=β0,

minf ni=1{yif(xi)}2+f(1)(x)2dxminf ni=1{yif(xi)}2minβ0 ni=1{yiβ0}2f(x)=β0.

Therefore, f(x)β0=ˉy=1nni=1yi.

Ensemble Models: Bagging vs. Boosting

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance.

Bagging (Bootstrap Aggregating)

Given a training set D=(x1,y1),,(xn,yn) of size n, bagging generates B new training sets Di,i=1,2,,B, each of size n, by sampling from D uniformly and with replacement. Then, we fit B models using the above B bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

favg(x)=1BBb=1fb(x),fvote(x)=mode(f1(x),,fB(x))

For example, Random Forest = Bagging Multiple Decision Trees.

Boosting

Boosting refers to the process of turning a weak learner into a strong learner, where B models are sequentially fitted:

f1(x)f2(x)f3(x)fB(x)

The subsequent model learns from the experiences of previous models. Finally, we aggregate models by

fboosting(x)=Bb=1αbf(b)(x),

where larger weights are usually assigned to stronger learners, i.e., αb if b.

For example, AdaBoost.