STA 243 Computational Statistics
Discussion 5: Smooth Convex Optimization
TA: Tesi Xiao
Continuous Optimization
An optimization problem is the problem of finding the best solution from all feasible solutions. In this course, we focus on the continous optimization problems in which the optimization variables are continous and defined in the Euclidean space. (An optimization problem with discrete variables is known as a discrete optimization.)
A continous optimization problem is to find some $\mathbf{x}^{\star } \in \Theta\subseteq\mathbb{R}^d$ attaining $\inf{f(\mathbf {x} ):\mathbf {x} \in \Theta}$, i.e.
\[\begin{aligned} &\underset{\mathbf{x}}{\text{minimize}}\quad &f(\mathbf{x})\\ &\text{subject to }\quad &\mathbf{x}\in \Theta \end{aligned}\]- Uncostrained: $\Theta \equiv \mathbb{R}^{d}$
- Constrained: $\Theta \subset \mathbb{R}^{d}$
- Standard Form: $\Theta$ can be characterized by a combination of inequality constraints and equality constraints.
Smooth Convex Optmization
The problem above is called a smooth convex optimization problem if
- $\Theta\equiv \mathbb{R}^d$ or $\Theta\subset\mathbb{R}^d$ is bounded, closed, (compact), and convex.
- $f(\mathbf{x})$ is convex on the domain $\Theta$;
- $f(\mathbf{x})$ is smooth (has Lipschitz continuous gradient).
Remark
- In the smooth convex optimization, the first-oder (gradient) and second-order (Hessian) information is reliable which can be used to design the algorithm.
- Even in the smooth convex optimization problem, $\mathbf{x}^\star$ can be non-unique ($f$ has several global minimum) or even non-existent ($f$ has no global minimum).
Convex Set
A set $\Theta \subset \mathbb{R}^d$ is called a convex set if every line segment between two points in $\Theta$ is in $\Theta$, i.e.
\[\forall \mathbf{x}, \mathbf{y}\in\Theta, \forall c\in [0,1], \quad c\mathbf{x} + (1-c)\mathbf{y}\in \Theta.\]Convex Function
Let $\Theta$ be a convex set and let $ f:\Theta\rightarrow \mathbb {R}$ be a function.
Convex
$f$ is called convex if: \(\forall \mathbf{x}_{1},\mathbf{x}_{2}\in \Theta,\forall t\in [0,1]:\quad f(t\cdot \mathbf{x}_{1}+(1-t)\cdot \mathbf{x}_{2})\leq t\cdot f(\mathbf{x}_{1})+(1-t)\cdot f(\mathbf{x}_{2})\)
Strictly convex
$f$ is called strictly convex if: \(\forall \mathbf{x}_{1},\mathbf{x}_{2}\in \Theta,\forall t\in (0,1):\quad f(t\cdot \mathbf{x}_{1}+(1-t)\cdot \mathbf{x}_{2})< t\cdot f(\mathbf{x}_{1})+(1-t)\cdot f(\mathbf{x}_{2})\)
Equivalent definitions of the convexity
- First-order: if $f$ is differentiable, i.e., $\nabla f(x)$ exists, then
Geometric Interpretation: Its graph lies above all of its tangents.
- Second-order: if $f$ is twice continously differentiable, i.e., $\nabla^2 f(\mathbf{x})$ exists, then
Strongly convex
- First-order: let $f$ be differentiable. Then, $f$ is called $\mu$-strongly convex if \(f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x}) + \frac{\mu}{2}\| \mathbf{y} - \mathbf{x} \|^2\)
- Second-order: if $f$ is twice continously differentiable, i.e., $\nabla^2 f(\mathbf{x})$ exists, then \(f \text{ is $\mu$-strongly convex} \Longleftrightarrow \nabla^2 f(\mathbf{x}) \text{ is positive definite and } \lambda_{\min}(\nabla^2 f(x))\geq \mu > 0)\quad \forall \mathbf{x}\in\Theta\)
Remark
- Strong convexity doesn’t necessarily require $f$ to be differentiable, and the gradient is replaced by the sub-gradient in the first-order characterization when the function is non-smooth.
- Strictly convex vs. Strongly convex
- Strongly convex $\Rightarrow$ Strictly convex, but not vice versa. See $f(x) = -\log(x)$.
- Strictly convex $\Rightarrow$ $f$ has at most one global minimum; $\mathbf{x}^\star$ may not exist, and $\mathbf{x}^\star$ is unique if it exists.
- Strongly convex $\Rightarrow$ $\mathbf{x}^\star$ exists, and it is unique.
- Intuitively speaking, strong convexity is a condition representing how curved the function. The first-order characterization implies that there exists a quadratic lower bound on the growth of the function.
Smooth Function
- $L$-smooth
- A differentiable function $f$ is said to be $L$-smooth or to have an $L$-Lipschitz continuous gradient if for some $L>0$, \(\| \nabla f(\mathbf{x}) - \nabla f(\mathbf{y}) \| \leq L\| \mathbf{x}- \mathbf{y} \|,\quad \forall \mathbf{x},\mathbf{y} \in \Theta\).
Equivalent definitions of the smoothness
- First-order: \(f \text{ is $L$-smooth} \Longleftrightarrow \forall \mathbf{x},\mathbf{y}\in \Theta, f(\mathbf{y})\leq f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y}-\mathbf{x}) + \frac{L}{2}\| \mathbf{x}-\mathbf{y}\|_2^2\)
- Second-order: if $f$ is twice continuously differentiable, then \(f \text{ is $L$-smooth} \Longleftrightarrow \lambda_{\max}(\nabla^2 f(\mathbf{x}))\leq L, \quad \forall \mathbf{x}\in\Theta\)
Remark
- Smoothness characterizes the continuity of $\nabla f$. The gradient does not change dramatically.
- If $\nabla f(\mathbf{x})$ is Lipschtiz continous, then $\nabla f$ is differentiable almost everywhere, i.e., $\nabla^2 f(\mathbf{x})$ exists almost everywhere.
- Non-smooth functions: $\vert x\vert , x^2 \sin(1/x)$
Algorithm
The algorithms for solving smooth convex optimization problems learned or soon learned in this course can be divided in two categories according to the oracle information used in the algorithm:
- First-order:
- Unconstrained: (Accelerated) Gradient Descent; Heavy Ball Method
- Constrained: Projected Gradient Descent; Frank-Wolfe Method
- Second-order: Newton’s Method and its variants