STA 142A Statistical Learning I

Discussion 3: Maximum Likelihood Estimation

TA: Tesi Xiao

From MLE to Statistical Learning

In classic statistics, the Maximum Likelihood Estimation is a method of estimating the parameters of an assumed probability distribution, given some observed data [Wikipedia] However, in the world of statistical (supervised) learning, a learning task aims to construct a model for predicting the response without knowing the true probabilistic model. The term inference is also sometimes used instead in statistical machine learning to mean “make a prediction”.

The reason why I put the paragraph above at the very begining of this course is that it is of importance to notice the difference of estimation and inference/prediction in this course. Whenever a model is constructed to make a prediction, the goodness of fit should be considered for this model. The discrepancy between the true model and the constructed model should be awared.

MLE

Assume $x_1, x_2, …, x_n$ are i.i.d. (indepedent identically distributed) sampled from a distribution $P_{\theta^\star}(x)$

Likelihood Function $L(\theta) = L(\theta | x_1, ..., x_n)= P_{\theta}(x_1, x_2, ..., x_n) = P_{\theta}(x_1)\cdot P_{\theta}(x_2) \cdot\cdots\cdot P_{\theta}(x_n) = \prod_{i=1}^{n} P_{\theta}(x_i)$
Log-Likelihood Function $\ell(\theta) = \log L(\theta) = \log \left(\prod_{i=1}^{n} P_{\theta}(x_i)\right) = \sum_{i=1}^{n} \log P_{\theta}(x_i)$
MLE $\hat{\theta}_n = \arg\max L(\theta) = \arg\max \ell(\theta) = \arg\min -\ell(\theta)$ i.e., maximize likelihood function = maximize log-likelihood function = minimize negative log-likelihood function. (The $\log$ function is monotonically increasing)

Find the (global) maximum/minimum of a given function

In this class, for derivation purposes, we only require univariate analyses, i.e., the domain $X\subseteq\mathbb{R}$. Vector-level analyses are optional. Motivated students can check Matrix Calculus.

Definition.

A real-valued function $f: X\rightarrow \mathbb{R}$ (defined on the domain $X$) has a global maximum at $x^\star$ if $f(x)\leq f(x^\star)$ for all $x\in X$. Similarly, A real-valued function $f: X\rightarrow \mathbb{R}$ (defined on the domain $X$) has a global minimum at $x^\star$ if $f(x)\geq f(x^\star)$ for all $x\in X$.

Examples

(a) concave functions: $f: \mathbb{R}\rightarrow\mathbb{R}$ is a concave function if $f(tx_1 + (1-t)x_2)\geq tf(x_1) + (1-t)f(x_2)$ for all $t\in[0,1], x_1, x_2\in\mathbb{R}$.

(b) convex functions: $f: \mathbb{R}\rightarrow\mathbb{R}$ is a convex function if $f(tx_1 + (1-t)x_2)\leq tf(x_1) + (1-t)f(x_2)$ for all $t\in[0,1], x_1, x_2\in\mathbb{R}$.

If the given function is convex/concave and differentiable, the minimum/maximum point of this function can be obtained by setting the derivative equal to zero, i.e., $f’(x^\star)=0$.

A comprehensive notes on Concave and convex functions of a single variable

Steps to find the maximum/minimum of a given function

Given a function $f: [a, b]\rightarrow \mathbb{R}$ and its first and second derivatives $f’, f’’$.

Step 1. Check if $f$ is monotone.

$f’\geq 0$ $\rightarrow$ $f$ is monotonically increasing $\rightarrow$ $f(a)\leq f(x) \leq f(b), \forall x\in[a, b]$
$f’\leq 0$ $\rightarrow$ $f$ is monotonically decreasing $\rightarrow$ $f(a)\geq f(x) \geq f(b), \forall x\in[a, b]$

Step 2. Check if $f$ is convex/concave.

$f’‘\geq 0$ $\rightarrow$ $f$ is convex
$f’‘\leq 0$ $\rightarrow$ $f$ is concave

Solve $x^\star\in[a, b]$ from $f’(x^\star)=0$. Then $x^\star$ or $a, b$ can be the minimum/maximum point. (It depends on whether zero derivative point fall within the domain $[a,b]$.)

Step 3. Try to plot the roungh graph of the function. (e.g. piece-wise functions, polynomial functions, etc.)

Remark If the domain is $(a, b), (a, b], [a, b)$, the minimum/maximum point may not exist.

Some useful facts

(i) $\underset{x}{\arg\min}~ f(x) = \underset{x}{\arg\min}~ f(x) + c$

(ii) $\underset{x}{\arg\min}~ f(x) = \underset{x}{\arg\min}~ af(x),~ a>0$

(iii) $\underset{x}{\arg\min}~ f(x) = \underset{x}{\arg\max}~ af(x),~ a<0$

(iv) $\underset{x}{\arg\min}~ f(x) = \underset{x}{\arg\min}~ g(f(x))$ if $g$ is monotonically increasing.

(iv) $\underset{x}{\arg\min}~ f(x) = \underset{x}{\arg\max}~ g(f(x))$ if $g$ is monotonically decreasing.

Examples of MLE derivations

Example 1. Let $x_1, \dots, x_n \overset{i.i.d.}{\sim} \text{Bernoulli}(\theta), 0\leq \theta \leq 1$. Find MLE of $\theta$.

Note that $P_{\theta}(X=1) = \theta, P_{\theta}(X=0)= 1-\theta$. Then, one can write $P_{\theta}(X=x_i) = \theta^{x_i} (1-\theta)^{1-x_i}$ since $x_i=0$ or $1$.

The likelihood function is given by

\[L(\theta) = \prod_{i=1}^n \theta^{x_i} (1-\theta)^{1-x_i}= \theta^{\sum_{i=1}^{n} x_i} (1-\theta)^{n - \sum_{i=1}^{n} x_i}, \quad 0\leq\theta\leq 1\]

The log-likelihood function is given by

\[\ell(\theta) = \log L(\theta) = (\sum_{i=1}^{n} x_i) \log \theta + (n - \sum_{i=1}^{n} x_i) \log (1-\theta), 0<\theta<1.\]

Note that $L(0)=L(1)=0, \log(0) = - \infty$ is not defined. Therefore, $\ell(\theta)$ is only defined on $(0,1)$.

The first derivative of $\ell(\theta)$ is given by

\[\ell'(\theta) = (\sum_{i=1}^{n}x_i)(1/\theta) + (n - \sum_{i=1}^{n} x_i) (1/(\theta-1)).\]

The second derivative of $\ell(\theta)$ is given by

\[\ell''(\theta) = (\sum_{i=1}^{n}x_i) (-1/\theta^2) + (n - \sum_{i=1}^{n} x_i) (-1/(\theta-1)^2)<0. ~\forall \theta\in(0,1).\]

Therefore, $\ell(\theta)$ concave on $(0,1)$. Setting $\ell’(\theta)=0$, we have $\hat{\theta}_{n} = \frac{1}{n}\sum_{i=1}^{n} x_{i} = \bar{x}$.

Example 2. Let $x_1, \dots, x_n \overset{i.i.d.}{\sim} \text{Normal}(\mu, \sigma^2)$. Find MLE of $(\mu, \sigma^2)$.

Let $\theta = (\mu, \sigma^2)$. $P_{\theta}(x_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp (-\frac{(x_i-\mu)^2}{2\sigma^2})$.

The likelihood function is given by

\[L(\theta) = (2\pi\sigma^2)^{-n/2} \exp (-\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}).\]

The log-likelihood function is given by $\ell(\theta) = \log L(\theta) = -\frac{n}{2} \log(2\pi\sigma^2) - \sum_{i=1}^{n} \frac{(x_i-\mu)^2}{2\sigma^2} = C - \frac{n}{2}\log(\sigma^2) - \sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2},$ where $C= -\frac{n}{2}\log(2\pi)$ is a constant free of $\theta$.

Although this is a bivariate function, we can still analyze it in an univariate way. Note that fix $\sigma^2$, $\ell(\theta) = \ell(\mu)$ is concave in $\mathbb{R}$. Setting $\partial\ell/\partial\mu=0$, we get $\hat{\mu}_{n} = \frac{1}{n}\sum_{i=1}^{n} x_{i} = \bar{x}$.

Plugging $\hat{\mu}_{n}$ back into $\ell(\theta)$, $\ell(\theta) = \ell(\sigma^2)$ is concave for $\sigma^2 > 0$. Setting $\partial \ell/\partial \sigma^2=0$, we have $\hat{\sigma^2}_{n} = \frac{1}{n}\sum_{i=1}^{n} (x_{i} - \hat{\mu}_{n})^2.$