STA 142A Statistical Learning I
Discussion 4: Bayes Optimal Classifier
TA: Tesi Xiao
A Big Picture of Supervised Learning
Data $(X, Y)\sim D$
- Population-level: Assume the collected data follow underlying distribution $(X, Y)\sim D$. Here, the random variable $X$ stands for the predictor/feature; the random variable $Y$ represents the label/class in classification or the response ion regression; $D$ is some unkown joint distribution of $(X,Y)$.
- Sample-level: Observe $(x_1,y_1), (x_2,y_2), \dots, (x_n,y_n) ~ i.i.d. \sim (X, Y)$.
Prediction Function $f\in\mathcal{F}$
A learning algorithm seeks to find a function $f$ that maps the feature vector $x$ to $f(x)$ in the output space as a prediction of $y$. The function $f$ is an element of some space $\mathcal{F}$ of all possible functions, also called the hypothesis space.
Loss Function $\ell(\hat{y}, y)$
The loss function is introduced to quantify the performance of a prediction function $f$. Suppose we observe $(x, y)$, the loss value of predicting $y$ as $\hat{y} = f(x)$ is defined as
\[\ell(\hat{y}, y) = \ell(f(x), y).\]Here, the lower-case letters represent the observed values. Thus, $\ell(\hat{y}, y)$ is a deterministic scalar value. $\ell(\cdot, \cdot)$ is a functon of the predicted value and the true value.
Risk Function $R(f)$
Suppose we replace the lower-case letters $(x,y)$ with the random variables $(X,Y)$, we obtain a random variable $\ell(f(X), Y)$. The risk function is the expected value of this random variable with respect to the underlying distribution of $(X, Y)$, i.e.,
\[R(f) = \mathbb{E}_{(X,Y)\sim D}[\ell(f(X), Y)].\]It should be noted that the risk function $R(f)$ is a function of the prediction function $f$ defined on the hypothesis space $\mathcal{F}$, which is different from the loss function.
Optimal Prediction Funtion $f^\star$
Given $(X, Y)\sim D$, the hypothesis function space $\mathcal{F}$, and the loss function $\ell(\cdot, \cdot)$, the optimal prediction function $f^\star$ is the function that minimizes the risk function:
\[f^\star = \underset{f\in\mathcal{F}}{\arg \min}~ R(f)\]Empirical Risk Minimization (ERM)
In practice, the explicit form of $R(f)$ is not accessible due to the unknown distribution $D$. In order to find an approximate solution of $f^\star$, we estimate $R(f)$ by its sample average, also called as the emprical risk,
\[R_{\text{emp}}(f) = \frac{1}{n} \sum_{i=1}^n \ell (f(x_i), y_i).\]Then, the approximate solution $\hat{f}$ is given by
\[\hat{f} = \underset{f\in\mathcal{F}}{\arg \min}~ R_{\text{emp}}(f).\]Bayes Optimal Classifier
Binary Classification
In the binary classification problem ($Y=-1 \text{ or } 1$), given some joint distribution of $(X,Y)$, suppose we pick the 0-1 loss,
\[\ell(f(X), Y) = \mathbf{1}_{\{f(X)\neq Y\}}=\begin{cases}0,& f(X)=Y\\1,& f(X)\neq Y\end{cases},\]the optimal classifier that minimizes the risk $R(f)$, also known as Bayes optimal classifier, is given by
\[f^\star(x) = \begin{cases}1,& \eta(x)\geq 1/2\\-1,& \eta(x)\leq 1/2\end{cases},\]where $\eta(x) = P(Y=1\mid X=x)$.
Remark. Note that $\eta(x) = P(Y=1\mid X=x), 1-\eta(x)=P(Y=-1\mid X=x)$. Then, we can see that $P(Y=1\mid X=x) \geq 1/2$ is equivalent to $P(Y=1\mid X=x) \geq P(Y=-1\mid X=x)$. In other words, the Bayes optimal classifier predict the label according to the largest probability of all possible values of $Y$ given $X=x$.
Multi-label Classification
Given the fact we observe above, it is natural to extend the Bayes optimal classifier to the multi-class case, i.e., $Y = 1, 2, \dots, K$. The Bayes optimal Classifier is given by
\[f^\star = \underset{f}{\arg\min}~R(f) = \underset{f}{\arg\min}~\mathbb{E} [\mathbf{1}_{\{f(X)\neq Y\}}] = \underset{k\in\{1,2,\dots,K\}}{\arg\max}~ P(Y=k\mid X=x).\]