STA 243 Computational Statistics

Discussion 5: Smooth Convex Optimization

TA: Tesi Xiao

Continuous Optimization

An optimization problem is the problem of finding the best solution from all feasible solutions. In this course, we focus on the continous optimization problems in which the optimization variables are continous and defined in the Euclidean space. (An optimization problem with discrete variables is known as a discrete optimization.)

A continous optimization problem is to find some $\mathbf{x}^{\star } \in \Theta\subseteq\mathbb{R}^d$ attaining $\inf{f(\mathbf {x} ):\mathbf {x} \in \Theta}$, i.e.

\[\begin{aligned} &\underset{\mathbf{x}}{\text{minimize}}\quad &f(\mathbf{x})\\ &\text{subject to }\quad &\mathbf{x}\in \Theta \end{aligned}\]

Uncostrained: $\Theta \equiv \mathbb{R}^{d}$
Constrained: $\Theta \subset \mathbb{R}^{d}$
Standard Form: $\Theta$ can be characterized by a combination of inequality constraints and equality constraints.

Smooth Convex Optmization

The problem above is called a smooth convex optimization problem if

$\Theta\equiv \mathbb{R}^d$ or $\Theta\subset\mathbb{R}^d$ is bounded, closed, (compact), and convex.
$f(\mathbf{x})$ is convex on the domain $\Theta$;
$f(\mathbf{x})$ is smooth (has Lipschitz continuous gradient).

Remark

In the smooth convex optimization, the first-oder (gradient) and second-order (Hessian) information is reliable which can be used to design the algorithm.
Even in the smooth convex optimization problem, $\mathbf{x}^\star$ can be non-unique ($f$ has several global minimum) or even non-existent ($f$ has no global minimum).

Convex Set

A set $\Theta \subset \mathbb{R}^d$ is called a convex set if every line segment between two points in $\Theta$ is in $\Theta$, i.e.

\[\forall \mathbf{x}, \mathbf{y}\in\Theta, \forall c\in [0,1], \quad c\mathbf{x} + (1-c)\mathbf{y}\in \Theta.\]

Convex Function

Let $\Theta$ be a convex set and let $ f:\Theta\rightarrow \mathbb {R}$ be a function.

Convex

$f$ is called convex if: $\forall \mathbf{x}_{1},\mathbf{x}_{2}\in \Theta,\forall t\in [0,1]:\quad f(t\cdot \mathbf{x}_{1}+(1-t)\cdot \mathbf{x}_{2})\leq t\cdot f(\mathbf{x}_{1})+(1-t)\cdot f(\mathbf{x}_{2})$

Strictly convex

$f$ is called strictly convex if: $\forall \mathbf{x}_{1},\mathbf{x}_{2}\in \Theta,\forall t\in (0,1):\quad f(t\cdot \mathbf{x}_{1}+(1-t)\cdot \mathbf{x}_{2})< t\cdot f(\mathbf{x}_{1})+(1-t)\cdot f(\mathbf{x}_{2})$

Equivalent definitions of the convexity

First-order: if $f$ is differentiable, i.e., $\nabla f(x)$ exists, then

\[f \text{ is convex} \Longleftrightarrow \forall \mathbf{x},\mathbf{y}\in \Theta, f(\mathbf{x}) \geq f(\mathbf{y}) + \nabla f(\mathbf{y})^\top (\mathbf{x}-\mathbf{y})\]

Geometric Interpretation: Its graph lies above all of its tangents.

Second-order: if $f$ is twice continously differentiable, i.e., $\nabla^2 f(\mathbf{x})$ exists, then

\[f \text{ is convex} \Longleftrightarrow \nabla^2 f(\mathbf{x}) \text{ is positive semidefinite } (\lambda_{\min}(\nabla^2 f(x))\geq 0)\quad \forall \mathbf{x}\in\Theta\]

Strongly convex

First-order: let $f$ be differentiable. Then, $f$ is called $\mu$-strongly convex if $f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x}) + \frac{\mu}{2}\| \mathbf{y} - \mathbf{x} \|^2$
Second-order: if $f$ is twice continously differentiable, i.e., $\nabla^2 f(\mathbf{x})$ exists, then $f \text{ is $\mu$-strongly convex} \Longleftrightarrow \nabla^2 f(\mathbf{x}) \text{ is positive definite and } \lambda_{\min}(\nabla^2 f(x))\geq \mu > 0)\quad \forall \mathbf{x}\in\Theta$

Remark

Strong convexity doesn’t necessarily require $f$ to be differentiable, and the gradient is replaced by the sub-gradient in the first-order characterization when the function is non-smooth.
Strictly convex vs. Strongly convex
- Strongly convex $\Rightarrow$ Strictly convex, but not vice versa. See $f(x) = -\log(x)$.
- Strictly convex $\Rightarrow$ $f$ has at most one global minimum; $\mathbf{x}^\star$ may not exist, and $\mathbf{x}^\star$ is unique if it exists.
- Strongly convex $\Rightarrow$ $\mathbf{x}^\star$ exists, and it is unique.
Intuitively speaking, strong convexity is a condition representing how curved the function. The first-order characterization implies that there exists a quadratic lower bound on the growth of the function.

Smooth Function

$L$-smooth: A differentiable function $f$ is said to be $L$-smooth or to have an $L$-Lipschitz continuous gradient if for some $L>0$, $\| \nabla f(\mathbf{x}) - \nabla f(\mathbf{y}) \| \leq L\| \mathbf{x}- \mathbf{y} \|,\quad \forall \mathbf{x},\mathbf{y} \in \Theta$.

Equivalent definitions of the smoothness

First-order: $f \text{ is $L$-smooth} \Longleftrightarrow \forall \mathbf{x},\mathbf{y}\in \Theta, f(\mathbf{y})\leq f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y}-\mathbf{x}) + \frac{L}{2}\| \mathbf{x}-\mathbf{y}\|_2^2$
Second-order: if $f$ is twice continuously differentiable, then $f \text{ is $L$-smooth} \Longleftrightarrow \lambda_{\max}(\nabla^2 f(\mathbf{x}))\leq L, \quad \forall \mathbf{x}\in\Theta$

Remark

Smoothness characterizes the continuity of $\nabla f$. The gradient does not change dramatically.
If $\nabla f(\mathbf{x})$ is Lipschtiz continous, then $\nabla f$ is differentiable almost everywhere, i.e., $\nabla^2 f(\mathbf{x})$ exists almost everywhere.
Non-smooth functions: $\vert x\vert , x^2 \sin(1/x)$

Algorithm

The algorithms for solving smooth convex optimization problems learned or soon learned in this course can be divided in two categories according to the oracle information used in the algorithm:

First-order:
- Unconstrained: (Accelerated) Gradient Descent; Heavy Ball Method
- Constrained: Projected Gradient Descent; Frank-Wolfe Method
Second-order: Newton’s Method and its variants