STA 243 Computational Statistics
Discussion 6: Non-Smooth Convex Optimization
TA: Tesi Xiao
Recap: Smooth Convex Optimization
A convex optimization problem is to find some $\mathbf {x^{\star }} \in \Theta$ attaining $\inf{f(\mathbf {x} ):\mathbf {x} \in \Theta}$, i.e. \(\begin{aligned} &\min_{\mathbf{x}}\quad f(\mathbf{x})\\ &\text{subject to }\quad \mathbf{x}\in \Theta \end{aligned}\)
- $f(\mathbf{x})$ is convex on the domain $\Theta$;
- Uncostrained: $\Theta \equiv \mathbb{R}^{d}$;
- Constrained: $\Theta\subset \mathbb{R}^{d}$ is bounded, closed, and convex.
So far, the first-order algorithms we have learned in the lectures solving smooth convex optimization problems, such as (Projected) (Accelerated) Gradient Descent methods and Conditional Gradient methods, require the gradient information $\nabla f(\mathbf{x})$. But what if $f(\mathbf{x})$ is non-smooth and not differentiable at some points?
Subderivative
In mathematics, the subderivative generalizes the derivative to convex functions which are not necessarily differentiable.
Derivative of Convex Functions. If $f: \Theta\subseteq \mathbb{R}\rightarrow \mathbb{R}$ is convex and differiable at a point $x\in\Theta$ (i.e. the derivative of $f$ at $x$, $f’(x)$, exists), then we have
\[f(y) \geq f(x) + f'(x)(y-x), \quad \forall y\in\Theta.\]Subderivative of Convex Functions. Let $f: \Theta\subseteq \mathbb{R}\rightarrow \mathbb{R}$ be a convex function defined on the open interval $\Theta$. A subderivative of $f$ at a point $x$ is a read number $c$ such that
\[f(y) \geq f(x) + c(y-x), \quad \forall y \in \Theta.\]The set of all subderivatives is called the subdifferntial of the function $f$ at $x$. One can show that the set if subderivatives at $x$ for a convex function $f$ is a nonempty closed interval $[a,b]$, where
\[a=\lim_{h\rightarrow 0^-} \frac{f(x+h) - f(x)}{h}, \quad b=\lim_{h\rightarrow 0^+} \frac{f(x+h) - f(x)}{h}\]For example, consider the function $f(x) = \vert x \vert$ which is convex. Then, the subdifferential at the origin is the interval $[−1, 1]$. The subdifferential at any point $x<0$ is the singleton set ${−1}$, while the subdifferential at any point $x>0$ is the singleton set ${1}$.
Subgradient
The concepts of subderivative and subdifferential can be generalized to $f: \Theta\subseteq\mathbb{R}^d \rightarrow \mathbb{R}$.
If $f:\Theta\subseteq\mathbb{R}^d\rightarrow \mathbb{R}$ is a real-valued convex function defined on a convex open set $\Theta\subseteq\mathbb{R}^d$, a vector $\mathbf{g=g(x)}\in\mathbb{R}^d$ in that space is called a subgradient of $f$ at a point $\mathbf{x}\in\Theta$ if for any $\mathbf{y}\in\Theta$ one has
\[f(\mathbf{y})\geq f(\mathbf{x})+ \mathbf{g}^\top (\mathbf{y}-\mathbf{x})\]The set of all subgradients at $\mathbf{x}$ is called the subdifferential at $\mathbf{x}$ and is denoted $\partial f(\mathbf{x})$. One can show that the subdifferential is always a nonempty convex compact set.
For example, consider the function $f(\mathbf{x}) = \Vert \mathbf{x}\Vert_1 = \sum_{i=1}^{d}\vert x_i \vert$ which is convex. Then,
\[\partial f(\mathbf{x}) = \{\mathbf{v} = (v_1, \cdots, v_d)^\top: \forall j = 1, \cdots, d, ~ \left\{\begin{aligned} &v_j = 1, & x_j>0\\ &v_j = -1, & x_j<0\\ &v_j\in [-1,1], &x_j =0 \end{aligned}\right.\}\]Subgradient Method
To solve a non-smooth convex optimization problem, we can simply use the subgradient $\mathbf{g}$of the objective function $f$ instead of $\nabla f$, which leads to the subgradient method.
\[\mathbf{x}_{t+1} = \mathbf{x}_t - \eta\cdot \mathbf{g}(\mathbf{x}_t)\]Pros
- handles general non-differentiable convex problem
- often leads to very simple algorithms
Cons
- convergence can be very slow
- no good stopping criterion
- theoretical complexity: $\mathcal{O}(1/\epsilon^2)$ iterations to find $\epsilon$-optimal solution
Remark. The subgradient method is not a descent method: negative gradient of differentiable $f$ is a descent direction; negative subgradient is not always a descent direction. For example: $f(x) = \vert x\vert $. $g = 0.5$ is a subgradient of $f$ at $x=0$. But $f(0 - \eta \cdot 0.5) = 0.5 \eta > f(0)=0$.
Proximal Gradient Method **
Suppose we want to solve the following non-smooth convex optimization problem:
\[\underset{\mathbf{x}\in\Theta}{\min} \quad f(\mathbf{x}) = g(\mathbf{x}) + h(\mathbf{x})\]- $g$ is convex, differentiable, $\Theta \equiv \mathbb{R}^d$
- $h$ is convex, but not differentiable
In this case,
- we do not want to use the subgradient method, because it is too slow
- what if there is a special structure of $h$?
- in statiscal learning, there are multiple examples where we add a non-smooth penalty term.
The proximal gradient method involves two steps, namely forward and backward, as follows:
- (Forward) run gradient descent for $g$: \(\mathbf{y}_{t+1} = \mathbf{x}_t - \eta \nabla g(\mathbf{x}_t).\)
- (Backward) proximal mapping of $h$ to avoid the increasing of $h(\mathbf{x})$: \(\mathbf{x}_{t+1} = \arg\min_\mathbf{x} \quad \bigg( \eta h(\mathbf{x}) + \frac{1}{2} \Vert\mathbf{x}- \mathbf{y}_{t+1} \Vert_2^2 \bigg) = \text{prox}_{\eta h}(\mathbf{y}_{t+1})\)
We assume that $h$ has a special structure: the proximal mapping (prox-operator) of $h$ is easy to compute!
The proximal mapping (prox-operator) of a convex function $h$ is defined as
\[\text{prox}_h(\mathbf{y}) =\arg\min_\mathbf{x} \quad \bigg( h(\mathbf{x}) + \frac{1}{2} \|\mathbf{x}- \mathbf{y} \|_2^2 \bigg)\]Examples
- $h(\mathbf{x}) = 0$: $\text{prox}_h(\mathbf{y}) = \mathbf{y}.$
- $h(\mathbf{x}) = \textbf{1}_\mathcal{C}(\mathbf{x})$: $\text{prox}_h$ is the projection on $\mathcal{C}$. \(\textbf{1}_\mathcal{C}(\mathbf{x})= \begin{cases} \infty & \text{if $\mathbf{x}\notin \mathcal{C}$}\\ 0 & \text{if $\mathbf{x}\in \mathcal{C}$} \end{cases},\quad \text{prox}_h(\mathbf{y}) = \arg\min_{y\in C} \| x-y\|_2^2 = \text{Proj}_C(y)\)
- $h(\mathbf{x}) = \Vert \mathbf{x} \Vert_1$: $\text{prox}_h$ is the “soft-threshold” (shrinkage) operation
It can be shown that it only requires $\mathcal{O}(1/\epsilon)$ iterations to find $\epsilon$-optimal solution for the proximal gradient method.
To solve LASSO problem, the best approach is ISTA/FISTA. (Iterative Shrinkage-Thresholding Algorithm & its Nestrov’s accelarated variant)