STA 243 Computational Statistics

Discussion 6: Non-Smooth Convex Optimization

TA: Tesi Xiao

Recap: Smooth Convex Optimization

A convex optimization problem is to find some $\mathbf {x^{\star }} \in \Theta$ attaining $\inf{f(\mathbf {x} ):\mathbf {x} \in \Theta}$, i.e. \(\begin{aligned} &\min_{\mathbf{x}}\quad f(\mathbf{x})\\ &\text{subject to }\quad \mathbf{x}\in \Theta \end{aligned}\)

  • $f(\mathbf{x})$ is convex on the domain $\Theta$;
  • Uncostrained: $\Theta \equiv \mathbb{R}^{d}$;
  • Constrained: $\Theta\subset \mathbb{R}^{d}$ is bounded, closed, and convex.

So far, the first-order algorithms we have learned in the lectures solving smooth convex optimization problems, such as (Projected) (Accelerated) Gradient Descent methods and Conditional Gradient methods, require the gradient information $\nabla f(\mathbf{x})$. But what if $f(\mathbf{x})$ is non-smooth and not differentiable at some points?

Subderivative

In mathematics, the subderivative generalizes the derivative to convex functions which are not necessarily differentiable.

Derivative of Convex Functions. If $f: \Theta\subseteq \mathbb{R}\rightarrow \mathbb{R}$ is convex and differiable at a point $x\in\Theta$ (i.e. the derivative of $f$ at $x$, $f’(x)$, exists), then we have

\[f(y) \geq f(x) + f'(x)(y-x), \quad \forall y\in\Theta.\]

Subderivative of Convex Functions. Let $f: \Theta\subseteq \mathbb{R}\rightarrow \mathbb{R}$ be a convex function defined on the open interval $\Theta$. A subderivative of $f$ at a point $x$ is a read number $c$ such that

\[f(y) \geq f(x) + c(y-x), \quad \forall y \in \Theta.\]

The set of all subderivatives is called the subdifferntial of the function $f$ at $x$. One can show that the set if subderivatives at $x$ for a convex function $f$ is a nonempty closed interval $[a,b]$, where

\[a=\lim_{h\rightarrow 0^-} \frac{f(x+h) - f(x)}{h}, \quad b=\lim_{h\rightarrow 0^+} \frac{f(x+h) - f(x)}{h}\]

For example, consider the function $f(x) = \vert x \vert$ which is convex. Then, the subdifferential at the origin is the interval $[−1, 1]$. The subdifferential at any point $x<0$ is the singleton set ${−1}$, while the subdifferential at any point $x>0$ is the singleton set ${1}$.

Subgradient

The concepts of subderivative and subdifferential can be generalized to $f: \Theta\subseteq\mathbb{R}^d \rightarrow \mathbb{R}$.

If $f:\Theta\subseteq\mathbb{R}^d\rightarrow \mathbb{R}$ is a real-valued convex function defined on a convex open set $\Theta\subseteq\mathbb{R}^d$, a vector $\mathbf{g=g(x)}\in\mathbb{R}^d$ in that space is called a subgradient of $f$ at a point $\mathbf{x}\in\Theta$ if for any $\mathbf{y}\in\Theta$ one has

\[f(\mathbf{y})\geq f(\mathbf{x})+ \mathbf{g}^\top (\mathbf{y}-\mathbf{x})\]

The set of all subgradients at $\mathbf{x}$ is called the subdifferential at $\mathbf{x}$ and is denoted $\partial f(\mathbf{x})$. One can show that the subdifferential is always a nonempty convex compact set.

For example, consider the function $f(\mathbf{x}) = \Vert \mathbf{x}\Vert_1 = \sum_{i=1}^{d}\vert x_i \vert$ which is convex. Then,

\[\partial f(\mathbf{x}) = \{\mathbf{v} = (v_1, \cdots, v_d)^\top: \forall j = 1, \cdots, d, ~ \left\{\begin{aligned} &v_j = 1, & x_j>0\\ &v_j = -1, & x_j<0\\ &v_j\in [-1,1], &x_j =0 \end{aligned}\right.\}\]

Subgradient Method

To solve a non-smooth convex optimization problem, we can simply use the subgradient $\mathbf{g}$of the objective function $f$ instead of $\nabla f$, which leads to the subgradient method.

\[\mathbf{x}_{t+1} = \mathbf{x}_t - \eta\cdot \mathbf{g}(\mathbf{x}_t)\]

Pros

  • handles general non-differentiable convex problem
  • often leads to very simple algorithms

Cons

  • convergence can be very slow
  • no good stopping criterion
  • theoretical complexity: $\mathcal{O}(1/\epsilon^2)$ iterations to find $\epsilon$-optimal solution

Remark. The subgradient method is not a descent method: negative gradient of differentiable $f$ is a descent direction; negative subgradient is not always a descent direction. For example: $f(x) = \vert x\vert $. $g = 0.5$ is a subgradient of $f$ at $x=0$. But $f(0 - \eta \cdot 0.5) = 0.5 \eta > f(0)=0$.

Proximal Gradient Method **

Suppose we want to solve the following non-smooth convex optimization problem:

\[\underset{\mathbf{x}\in\Theta}{\min} \quad f(\mathbf{x}) = g(\mathbf{x}) + h(\mathbf{x})\]
  • $g$ is convex, differentiable, $\Theta \equiv \mathbb{R}^d$
  • $h$ is convex, but not differentiable

In this case,

  • we do not want to use the subgradient method, because it is too slow
  • what if there is a special structure of $h$?
  • in statiscal learning, there are multiple examples where we add a non-smooth penalty term.

The proximal gradient method involves two steps, namely forward and backward, as follows:

  1. (Forward) run gradient descent for $g$: \(\mathbf{y}_{t+1} = \mathbf{x}_t - \eta \nabla g(\mathbf{x}_t).\)
  2. (Backward) proximal mapping of $h$ to avoid the increasing of $h(\mathbf{x})$: \(\mathbf{x}_{t+1} = \arg\min_\mathbf{x} \quad \bigg( \eta h(\mathbf{x}) + \frac{1}{2} \Vert\mathbf{x}- \mathbf{y}_{t+1} \Vert_2^2 \bigg) = \text{prox}_{\eta h}(\mathbf{y}_{t+1})\)

We assume that $h$ has a special structure: the proximal mapping (prox-operator) of $h$ is easy to compute!

The proximal mapping (prox-operator) of a convex function $h$ is defined as

\[\text{prox}_h(\mathbf{y}) =\arg\min_\mathbf{x} \quad \bigg( h(\mathbf{x}) + \frac{1}{2} \|\mathbf{x}- \mathbf{y} \|_2^2 \bigg)\]

Examples

  • $h(\mathbf{x}) = 0$: $\text{prox}_h(\mathbf{y}) = \mathbf{y}.$
  • $h(\mathbf{x}) = \textbf{1}_\mathcal{C}(\mathbf{x})$: $\text{prox}_h$ is the projection on $\mathcal{C}$. \(\textbf{1}_\mathcal{C}(\mathbf{x})= \begin{cases} \infty & \text{if $\mathbf{x}\notin \mathcal{C}$}\\ 0 & \text{if $\mathbf{x}\in \mathcal{C}$} \end{cases},\quad \text{prox}_h(\mathbf{y}) = \arg\min_{y\in C} \| x-y\|_2^2 = \text{Proj}_C(y)\)
  • $h(\mathbf{x}) = \Vert \mathbf{x} \Vert_1$: $\text{prox}_h$ is the “soft-threshold” (shrinkage) operation
\[\text{prox}_h(\mathbf{y})_i= \begin{cases} y_i -1 & \text{if $y_i> 1$}\\ 0 & \text{if $|y_i|\leq 1$}\\ y_i +1 & \text{if $y_i< -1$} \end{cases}\]

It can be shown that it only requires $\mathcal{O}(1/\epsilon)$ iterations to find $\epsilon$-optimal solution for the proximal gradient method.

To solve LASSO problem, the best approach is ISTA/FISTA. (Iterative Shrinkage-Thresholding Algorithm & its Nestrov’s accelarated variant)