STA 243 Computational Statistics

Discussion 6: Non-Smooth Convex Optimization

TA: Tesi Xiao

Recap: Smooth Convex Optimization

A convex optimization problem is to find some xΘ attaining inff(x):xΘ, i.e. minxf(x)subject to xΘ

  • f(x) is convex on the domain Θ;
  • Uncostrained: ΘRd;
  • Constrained: ΘRd is bounded, closed, and convex.

So far, the first-order algorithms we have learned in the lectures solving smooth convex optimization problems, such as (Projected) (Accelerated) Gradient Descent methods and Conditional Gradient methods, require the gradient information f(x). But what if f(x) is non-smooth and not differentiable at some points?

Subderivative

In mathematics, the subderivative generalizes the derivative to convex functions which are not necessarily differentiable.

Derivative of Convex Functions. If f:ΘRR is convex and differiable at a point xΘ (i.e. the derivative of f at x, f(x), exists), then we have

f(y)f(x)+f(x)(yx),yΘ.

Subderivative of Convex Functions. Let f:ΘRR be a convex function defined on the open interval Θ. A subderivative of f at a point x is a read number c such that

f(y)f(x)+c(yx),yΘ.

The set of all subderivatives is called the subdifferntial of the function f at x. One can show that the set if subderivatives at x for a convex function f is a nonempty closed interval [a,b], where

a=limh0f(x+h)f(x)h,b=limh0+f(x+h)f(x)h

For example, consider the function f(x)=|x| which is convex. Then, the subdifferential at the origin is the interval [1,1]. The subdifferential at any point x<0 is the singleton set 1, while the subdifferential at any point x>0 is the singleton set 1.

Subgradient

The concepts of subderivative and subdifferential can be generalized to f:ΘRdR.

If f:ΘRdR is a real-valued convex function defined on a convex open set ΘRd, a vector g=g(x)Rd in that space is called a subgradient of f at a point xΘ if for any yΘ one has

f(y)f(x)+g(yx)

The set of all subgradients at x is called the subdifferential at x and is denoted f(x). One can show that the subdifferential is always a nonempty convex compact set.

For example, consider the function f(x)=x1=di=1|xi| which is convex. Then,

f(x)={v=(v1,,vd):j=1,,d, {vj=1,xj>0vj=1,xj<0vj[1,1],xj=0}

Subgradient Method

To solve a non-smooth convex optimization problem, we can simply use the subgradient gof the objective function f instead of f, which leads to the subgradient method.

xt+1=xtηg(xt)

Pros

  • handles general non-differentiable convex problem
  • often leads to very simple algorithms

Cons

  • convergence can be very slow
  • no good stopping criterion
  • theoretical complexity: O(1/ϵ2) iterations to find ϵ-optimal solution

Remark. The subgradient method is not a descent method: negative gradient of differentiable f is a descent direction; negative subgradient is not always a descent direction. For example: f(x)=|x|. g=0.5 is a subgradient of f at x=0. But f(0η0.5)=0.5η>f(0)=0.

Proximal Gradient Method **

Suppose we want to solve the following non-smooth convex optimization problem:

minxΘf(x)=g(x)+h(x)
  • g is convex, differentiable, ΘRd
  • h is convex, but not differentiable

In this case,

  • we do not want to use the subgradient method, because it is too slow
  • what if there is a special structure of h?
  • in statiscal learning, there are multiple examples where we add a non-smooth penalty term.

The proximal gradient method involves two steps, namely forward and backward, as follows:

  1. (Forward) run gradient descent for g: yt+1=xtηg(xt).
  2. (Backward) proximal mapping of h to avoid the increasing of h(x): xt+1=argminx(ηh(x)+12xyt+122)=proxηh(yt+1)

We assume that h has a special structure: the proximal mapping (prox-operator) of h is easy to compute!

The proximal mapping (prox-operator) of a convex function h is defined as

proxh(y)=argminx(h(x)+12xy22)

Examples

  • h(x)=0: proxh(y)=y.
  • h(x)=1C(x): proxh is the projection on C. 1C(x)={if xC0if xC,proxh(y)=argminyCxy22=ProjC(y)
  • h(x)=x1: proxh is the “soft-threshold” (shrinkage) operation
proxh(y)i={yi1if yi>10if |yi|1yi+1if yi<1

It can be shown that it only requires O(1/ϵ) iterations to find ϵ-optimal solution for the proximal gradient method.

To solve LASSO problem, the best approach is ISTA/FISTA. (Iterative Shrinkage-Thresholding Algorithm & its Nestrov’s accelarated variant)