STA 243 Computational Statistics

Discussion 8: Non-convex Optimization

TA: Tesi Xiao

(General) Non-convex Optimization

In general non-convex optimization is hard (at least NP-hard) due to

potentially many local minima
saddle points
very flat regions
widely varying curvature

However, all the algorithms that we discussed for convex optimization could be run even when the objective function being optimized is non-convex. The theoretical guarantees that we could provide on those algorithms are not as strong as the convex scenario. There are varietis of theoretical convergence results for algorithms solving different types of non-convex problems in the literature.

Convergence to a stationary point
- first-order stationary point: $\Vert\nabla f(x)\Vert\leq \epsilon$
- second-order stationary point: $\Vert \nabla f(x) \Vert \leq \epsilon_1, ~\lambda_{\min}(\nabla^2 f(x))\geq -\epsilon_2$
Convergence to a local minimum
Local convergence to the global minimum
- a good initialization in the neighborhood of the global minimum.
Global convergence to the global minimum
- certain classes of objective function $f$: PL-condition

Nonconvex SGD

In particular, SGD can also be used for optimizing non-convex functions. Instead of convergence results discussed in the lectures, SGD also has a weaker convergence results (convergence to a first-order stationary point in expectation) in the nonconvex regime.

\[x_{t+1} = x_t - \eta_t \nabla F(x_t; \xi_{t+1})\]

Assumption Let $\mathcal{F_t} = \sigma (\xi_1, \dots, \xi_t)$

$f$ is $L$-smooth.
$\mathbb{E}[\nabla F(x_t; \xi_{t+1})\mid\mathcal{F}_t] = \nabla f(x_t)$
$\mathbb{E}[\Vert \nabla F(x_t; \xi_{t+1}) \Vert^2 \mid \mathcal{F}_t]\leq M^2$

Theorem Let ${x_t} (t=0, 1, \dots ,T-1)$ be generated by SGD with $\eta = \frac{c}{\sqrt{T}}$ for any $c>0$ and $R$ be a random integer uniformly from $0,1,\dots, T-1$. Then

\[\mathbb{E} [\|\nabla f(x_R)\|^2] \leq \frac{f(x_0) - f^\star}{c\sqrt{T}} + \frac{c LM^2}{2\sqrt{T}} = \mathcal{O}(T^{-1/2})\]

Proof. By the smoothness of $f$, we have

\[\begin{aligned} f(x_{t+1}) &\leq f(x_t) + \langle \nabla f(x_t), x_{t+1} - x_t \rangle + \frac{L}{2}\|x_{t+1} - x_{t}\|^2\\ &= f(x_t) + \langle \nabla f(x_t) , - \eta_t\nabla F(x_t; \xi_{t+1}) \rangle + \frac{L\eta_t^2}{2} \| \nabla F(x_t;\xi_{t+1}) \|^2 \end{aligned}\]

Taking expectation on both sides, we get: $\begin{aligned} \mathbb{E}[f(x_{t+1}) - f(x_t)] &\leq -\eta \mathbb{E}[\| \nabla f(x_t) \|^2] + \frac{\eta_t^2 LM^2}{2}\\ \mathbb{E}[\| \nabla f(x_t) \|^2]&\leq \frac{\mathbb{E}[f(x_{t+1}) - f(x_t)]}{\eta_t} + \frac{\eta_t LM^2}{2}. \end{aligned}$

Telescoping the above inequality with $\eta_t \equiv \eta$, we further have

\[\sum_{t=0}^{T-1} \mathbb{E} [ \| \nabla f(x_t) \|^2] \leq \frac{f(x_0) - f^\star}{\eta} + \frac{T\eta LM^2}{2}.\]

Dividing both sides by $T$ and letting $\eta = \frac{c}{\sqrt{T}}$ and $R$ be a random integer uniformly from $0, 1, \dots, T-1$, we have

\[\mathbb{E}[\|\nabla f(x_R)\|^2] \leq \frac{f(x_0) - f^\star}{c\sqrt{T}} + \frac{c LM^2}{2\sqrt{T}} = \mathcal{O}(T^{-1/2})\]

Remark

Non-convex SGD converges in the sense of getting to points where the expected gradient is arbitrarily small.
This doesn’t mean it goes to a local minimum or finds the global optimum
- It can go to a saddle point, or a local maximum.
- It can go to a region of very flat but nonzero gradients

Special Non-convex Optimization

As pointed out in the very beginining, stronger convergence results can be established for certian classes of non-convex optimization problems. One straightforward extension is global convergence for SGD when the objective function $f$ satisfies the Polyak-Łojasiewicz condition:

\[\frac{1}{2}\|\nabla f(x)\|^2 \geq \mu (f(x) - f^\star).\]

In addition, there is one familiar case where we can show global convergence: PCA.

PCA: Power Method/GD

Finding the dominant eigenvalue-eigenvector pair of a positive semidefinite symmetric matrix $A$ can be written as

\[\mathbf{v}_1 = \underset{\mathbf{v}\in\mathbb{R}^d, \|\mathbf{v}\|_2 = 1}{\arg\max} \mathbf{v}^\top \mathbf{A} \mathbf{v},\]

which aims to maximize a convex function over a unit sphere (non-convex). However, the power method and the gradient descent method we learned during the lectures are guaranteed to converge when $\vert\lambda_1\vert > \vert\lambda_2\vert$.

Mixture Models: EM Algorithm

The expectation maximization (EM) algorithm is an iterative algorithm to deal with such non-convex problems that arise due to missing data. In particular, EM algorithm is commonly used in the clustering analysis assuming data comes from a mixture model.

Clustering Algorithms: Model-free: K-Means vs. Model-based: Gaussian Mixture Model (EM algorithm)

The EM iteration alternates between performing an expectation (E) step and a maximization (M) step:

E-step: creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters
M-step: computes parameters maximizing the expected log-likelihood found on the E-step.

Numerically, the E-step finds a concave lowever bound of the original non-convex objective function, the log-likelihood function; the M-step updates the parameters by maximizing this concave function derived in the E-step. Therefore, one can show that the log-likehood is monotonically increasing in the EM algorithm. However, it does not mean that it will converge to the global optimum. Moreover, EM is also not guaranteed to converge to a local minimum. It is only guaranteed to converge to a point with zero gradient with respect to the parameters. See reference. Some recent papers show that EM enjoys global convergence or local convergence to the global optimum for specific mixture models. See reference.

Deep Learning as Non-Convex Optimization

Optimization algorithms solving non-convex problems arising in deep learning become much more mysterious. The landscape of the loss function will be hard to imagine. However, from the past experiences in the community, we can still try to do to prevent things from going wrong:

Data:

Data augmentation
- increases the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data.
- acts as a regularizer and helps reduce overfitting when training a machine learning model.
Train-Validation-Test Split

Model:

Design the network structure carefully (width, depth, activation functions, etc.)
- networks using a RELU activation and batch normalization technique tend to avoid vanishing/exploding gradient
Choose the appropriate loss function
- picking different loss functions or adding regularization terms are effective in some situations

Optimization:

Run the algorithm with different initialization
Add extra noise to gradient updates: Stochastic Gradient Langevin Dynamics
Use adaptive step size: Adam, Adagrad, etc.