STA 142A Statistical Learning I
Discussion 8: Additive Models, Smoothing Splines, Ensemble Models
TA: Tesi Xiao
Additive Models
Let X=[X1,X2,…,Xd]∈Rd, Y∈R.
E[Y|X]=f(X)=β0+d∑j=1fj(Xj)For example, if fj(Xj)=βjXj, then E[Y|X]=β0+β1X1+⋯+βdXd→ Linear Regression
In general, fj(Xj) can be non-linear.
Remark.
- Generalized Additive Models (GAM) g(E[Y|X])=β0+∑dj=1fj(Xj) where g(⋅) is the link function. For example, in Logistic Regression, g(E[Y|X])=logit(E[Y|X])=log(E[Y|X]1−E[Y|X]).
- GAM Fitting: fj(Xj) is fitted using smoothing splines by the backfitting algorithm.
Smoothing Spline
Let (xi,yi),i=1,…,n be a set of observations, modeled by
yi=f(xi)+ϵiwhere ϵi’s are indepedent, zero mean random variables.
Assuming f is unkown to us, we are trying to find a smoothing spline ˆf to estimate f by solving
minf n∑i=1{yi−f(xi)}2+λ∫f(m)(x)2dxwhere f(m)(x) is the m-th derivative of f. For example, f(0)(x)=f(x),f(1)(x)=f′(x),f(2)(x)=f′‘(x).
Special Cases:
- λ=+∞, m=0: f(0)(x)=f(x)
- λ=+∞, m=1: f(1)(x)=f′(x)
Meanwhile. when f(x)=β0,
minf ∑ni=1{yi−f(xi)}2+∞⋅∫f(1)(x)2dx⟺minf ∑ni=1{yi−f(xi)}2⟺minβ0 ∑ni=1{yi−β0}2⟺f(x)=β0.
Therefore, f(x)≡β0=ˉy=1n∑ni=1yi.
Ensemble Models: Bagging vs. Boosting
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance.
Bagging (Bootstrap Aggregating)
Given a training set D=(x1,y1),…,(xn,yn) of size n, bagging generates B new training sets Di,i=1,2,…,B, each of size n′, by sampling from D uniformly and with replacement. Then, we fit B models using the above B bootstrap samples and combined by averaging the output (for regression) or voting (for classification).
favg(x)=1BB∑b=1fb(x),fvote(x)=mode(f1(x),…,fB(x))For example, Random Forest = Bagging Multiple Decision Trees.
Boosting
Boosting refers to the process of turning a weak learner into a strong learner, where B models are sequentially fitted:
f1(x)⟶f2(x)⟶f3(x)⟶⋯⟶fB(x)The subsequent model learns from the experiences of previous models. Finally, we aggregate models by
fboosting(x)=B∑b=1αbf(b)(x),where larger weights are usually assigned to stronger learners, i.e., αb↑ if b↑.
For example, AdaBoost.