STA 243 Computational Statistics
Discussion 6: Non-Smooth Convex Optimization
TA: Tesi Xiao
Recap: Smooth Convex Optimization
A convex optimization problem is to find some x⋆∈Θ attaining inff(x):x∈Θ, i.e. minxf(x)subject to x∈Θ
- f(x) is convex on the domain Θ;
- Uncostrained: Θ≡Rd;
- Constrained: Θ⊂Rd is bounded, closed, and convex.
So far, the first-order algorithms we have learned in the lectures solving smooth convex optimization problems, such as (Projected) (Accelerated) Gradient Descent methods and Conditional Gradient methods, require the gradient information ∇f(x). But what if f(x) is non-smooth and not differentiable at some points?
Subderivative
In mathematics, the subderivative generalizes the derivative to convex functions which are not necessarily differentiable.
Derivative of Convex Functions. If f:Θ⊆R→R is convex and differiable at a point x∈Θ (i.e. the derivative of f at x, f′(x), exists), then we have
f(y)≥f(x)+f′(x)(y−x),∀y∈Θ.Subderivative of Convex Functions. Let f:Θ⊆R→R be a convex function defined on the open interval Θ. A subderivative of f at a point x is a read number c such that
f(y)≥f(x)+c(y−x),∀y∈Θ.The set of all subderivatives is called the subdifferntial of the function f at x. One can show that the set if subderivatives at x for a convex function f is a nonempty closed interval [a,b], where
a=limh→0−f(x+h)−f(x)h,b=limh→0+f(x+h)−f(x)hFor example, consider the function f(x)=|x| which is convex. Then, the subdifferential at the origin is the interval [−1,1]. The subdifferential at any point x<0 is the singleton set −1, while the subdifferential at any point x>0 is the singleton set 1.
Subgradient
The concepts of subderivative and subdifferential can be generalized to f:Θ⊆Rd→R.
If f:Θ⊆Rd→R is a real-valued convex function defined on a convex open set Θ⊆Rd, a vector g=g(x)∈Rd in that space is called a subgradient of f at a point x∈Θ if for any y∈Θ one has
f(y)≥f(x)+g⊤(y−x)The set of all subgradients at x is called the subdifferential at x and is denoted ∂f(x). One can show that the subdifferential is always a nonempty convex compact set.
For example, consider the function f(x)=‖x‖1=∑di=1|xi| which is convex. Then,
∂f(x)={v=(v1,⋯,vd)⊤:∀j=1,⋯,d, {vj=1,xj>0vj=−1,xj<0vj∈[−1,1],xj=0}Subgradient Method
To solve a non-smooth convex optimization problem, we can simply use the subgradient gof the objective function f instead of ∇f, which leads to the subgradient method.
xt+1=xt−η⋅g(xt)Pros
- handles general non-differentiable convex problem
- often leads to very simple algorithms
Cons
- convergence can be very slow
- no good stopping criterion
- theoretical complexity: O(1/ϵ2) iterations to find ϵ-optimal solution
Remark. The subgradient method is not a descent method: negative gradient of differentiable f is a descent direction; negative subgradient is not always a descent direction. For example: f(x)=|x|. g=0.5 is a subgradient of f at x=0. But f(0−η⋅0.5)=0.5η>f(0)=0.
Proximal Gradient Method **
Suppose we want to solve the following non-smooth convex optimization problem:
minx∈Θf(x)=g(x)+h(x)- g is convex, differentiable, Θ≡Rd
- h is convex, but not differentiable
In this case,
- we do not want to use the subgradient method, because it is too slow
- what if there is a special structure of h?
- in statiscal learning, there are multiple examples where we add a non-smooth penalty term.
The proximal gradient method involves two steps, namely forward and backward, as follows:
- (Forward) run gradient descent for g: yt+1=xt−η∇g(xt).
- (Backward) proximal mapping of h to avoid the increasing of h(x): xt+1=argminx(ηh(x)+12‖x−yt+1‖22)=proxηh(yt+1)
We assume that h has a special structure: the proximal mapping (prox-operator) of h is easy to compute!
The proximal mapping (prox-operator) of a convex function h is defined as
proxh(y)=argminx(h(x)+12‖x−y‖22)Examples
- h(x)=0: proxh(y)=y.
- h(x)=1C(x): proxh is the projection on C. 1C(x)={∞if x∉C0if x∈C,proxh(y)=argminy∈C‖x−y‖22=ProjC(y)
- h(x)=‖x‖1: proxh is the “soft-threshold” (shrinkage) operation
It can be shown that it only requires O(1/ϵ) iterations to find ϵ-optimal solution for the proximal gradient method.
To solve LASSO problem, the best approach is ISTA/FISTA. (Iterative Shrinkage-Thresholding Algorithm & its Nestrov’s accelarated variant)