STA 243 Computational Statistics

Discussion 3: Power Method and Sketching

TA: Tesi Xiao

Power Method

Given a diagonalizable matrix $\mathbf{A}\in \mathbb{R}^{d\times d}$, the power method is an eigenvalue algorithm that will produce a number $\lambda_1$, which is the greatest (in absolute value) eigenvalue of $\mathbf{A}$, (i.e., $\vert \lambda_1\vert > \vert\lambda_2\vert \geq \cdots \geq\vert \lambda_d \vert $), and a nonzero vector $\mathbf{v}$, which is a corresponding eigenvector of $\lambda_1$.

\[\begin{aligned} \mathbf{A} &= \mathbf{U} \boldsymbol{\Sigma} \mathbf{U}^\top=\mathbf{U} \boldsymbol{\Sigma} \mathbf{U}^{-1}\\ &= \begin{bmatrix}\mathbf{u}_1, \cdots, \mathbf{u}_d\end{bmatrix}\begin{bmatrix} \lambda_{1} & & \\ & \ddots & \\ & & \lambda_{d} \end{bmatrix}\begin{bmatrix}\mathbf{u}_1, \cdots, \mathbf{u}_d\end{bmatrix}^\top\\ &= \lambda_1 \mathbf{u}_1 \mathbf{u}_1^\top + \cdots + \lambda_d \mathbf{u}_d \mathbf{u}_d^\top. \end{aligned}\]

Is the matrix $\mathbf{A}$ required to be symmetric or positive semi-definite?
No. Any diagonalizable matrix works. (Real symmetric matrices are diagonalizable). In fact, the power method also works for non-diagonalizable matrices with a dominant eigenvalue. The convergence analysis can be established via the Jordan canonical form. See Wikipage.
What if $\mathbf{A}$ has complex eigenvalues?
In general, the power method works for the complex matrix $\mathbf{A}\in\mathbb{C}^{d\times d}$ to find the eigenvector and the eigenvalue with the largest modulus. However, if $\mathbf{A}\in\mathbb{R}^{d\times d}$ has complex eigenvalues, its complex eigenvalues will always occur in complex conjugate pairs. That is to say, if $\mathbf{A}\in \mathbb{R}^{d\times d}$ has an eigenvalue $a+bi$, then $a-bi$ is also an eigenvalue of $\mathbf{A}$. Therefore, to satisfy the requirement of the power method $\vert \lambda_1\vert > \vert \lambda_2\vert \geq \cdots \geq \vert \lambda_d \vert $, the dominant eigenvalue $\lambda_1$ of $\mathbf{A}$ should be real. Otherwise, we have $\vert \lambda_1\vert = \vert \lambda_2\vert = \sqrt{a^2+b^2}$.
Convergence Speed
For a theoretical point of view, the convergence of the power method is goemetric, with ratio $\frac{\vert \lambda_2 \vert}{\vert \lambda_1 \vert}$. In particular, assuming the starting vector $\mathbf{b}_0 = \mathbf{U} \mathbf{c} = c_1 \mathbf{u}_1 + \cdots + c_d \mathbf{u}_d$, ($c_1\neq 0$) , we have
\[\begin{aligned} \mathbf{b}_k &= \frac{\mathbf{A}^k \mathbf{b}_0}{\|\mathbf{A}^k \mathbf{b}_0 \|} = \frac{\mathbf{U} \mathbf{\Sigma}^k \mathbf{U}^{-1} \mathbf{b}_0}{\|\mathbf{U} \mathbf{\Sigma}^k \mathbf{U}^{-1} \mathbf{b}_0 \|} = \frac{\mathbf{U} \mathbf{\Sigma}^k \mathbf{U}^{-1} \mathbf{U} \mathbf{c}}{\|\mathbf{U} \mathbf{\Sigma}^k \mathbf{U}^{-1} \mathbf{U} \mathbf{c} \|} = \frac{\mathbf{U} \mathbf{\Sigma}^k \mathbf{c}}{\|\mathbf{U} \mathbf{\Sigma}^k \mathbf{c} \|} \\ &= \frac{c_1 |\lambda_1|^k \mathbf{u}_1 + c_2 |\lambda_2|^k \mathbf{u}_2 + \cdots + c_d |\lambda_d|^k \mathbf{u}_d}{\|c_1 |\lambda_1|^k \mathbf{u}_1 + c_2 |\lambda_2|^k \mathbf{u}_2 + \cdots + c_d |\lambda_d|^k \mathbf{u}_d\|}\\ &= \frac{ \mathbf{u}_1 + \frac{c_2}{c_1} (\frac{|\lambda_2|}{|\lambda_1|})^k \mathbf{u}_2 + \cdots + \frac{c_d}{c_1} (\frac{|\lambda_d|}{|\lambda_1|})^k \mathbf{u}_d}{\|\mathbf{u}_1 + \frac{c_2}{c_1} (\frac{|\lambda_2|}{|\lambda_1|})^k \mathbf{u}_2 + \cdots + \frac{c_d}{c_1} (\frac{|\lambda_d|}{|\lambda_1|})^k \mathbf{u}_d\|} = \mathbf{u}_1 + \mathbf{r}_k\\ \|\mathbf{r}_k\| &= \mathcal{O\left((\frac{\vert \lambda_2 \vert}{\vert \lambda_1 \vert})^k\right)} \rightarrow 0, \quad \text{as}~ k\rightarrow \infty \end{aligned}\]
Thus, the method converges slowly if there is an eigenvalue close in magnitude to the dominant eigenvalue.
In pratice, the convergence speed highly relies on the implementation of matrix-vector product. Therefore, the power iteration method is especially suitable for sparse matrices, for which $\mathbf{A}\mathbf{b}$ can be calculated efficiently. For non-symmetric matrices that are well-conditioned, the power iteration method can outperform more complex Arnoldi iteration. For symmetric matrices, the power iteration method is rarely used, since other algorithms can be used to speed up the convergence.
Applications: PageRank. (The transition matrix is usually sparse); PCA.

Sketching

The idea of sketching in linear regression is to construct a matrix $\boldsymbol{\Phi}\in\mathbb{R}^{r\times n}$ and solve the following sketched-OLS problem instead of the Ordinary Least Squares (OLS) problem:

\[\mathbf{b}_s = \underset{\mathbf{c}\in\mathbb{R}^d}{\arg\min}~ \| \boldsymbol{\Phi}(\mathbf{X}\boldsymbol{c} - \mathbf{y})\|^2_2\]

where $\boldsymbol{\Phi} = \mathbf{S}^\top \mathbf{H} \mathbf{D}$ is called as the sketching matrix (a.k.a. subsampled randomized Hadamard transform),

\[\boldsymbol{\Phi} = \mathbf{S}_{n\times r}^\top \mathbf{H}_{n\times n} \mathbf{D}_{n\times n},\]

and $\mathbf{X}\in\mathbb{R}^{n\times d}, \mathbf{y}\in \mathbb{R}^n$.

\[\mathbf{b}_s = ((\boldsymbol{\Phi}\mathbf{X})^\top (\boldsymbol{\Phi}\mathbf{X}))^{-1}(\boldsymbol{\Phi}\mathbf{X})^\top \boldsymbol{\Phi} \mathbf{y} = (\boldsymbol{\Phi}\mathbf{X})^{+} \boldsymbol{\Phi} \mathbf{y}\]

$(\cdot)^+$: Moore-Penrose pseudo-inverse

Design of Matrices $\mathbf{S}, \mathbf{H}, \mathbf{D}$

$\mathbf{S}_{n\times r}$: The function of a sub-sampling matrix $\mathbf{S}$ is to subsample the row vectors of a matrix in $\mathbb{R}^{n\times d}$ and obtain a matrix in $\mathbb{R}^{r\times d}$ with extra scaling factor $\sqrt{n/r}$.
Let $\mathbf{S}$ be an empty matrix. For $t=1,\dots, r$ (i.i.d. trials with replacement) select uniformly at random an integer from ${1,2\dots, n}$. If $i$ is selected, then append the column vector $(\sqrt{n/r}) \mathbf{e}_i$ to $\mathbf{S}$, where $\mathbf{e}_i\in\mathbb{R}^n$ is the sparse one-hot vector with only the $i$-th entry equal to 1.
$\mathbf{H}_{n\times n}$: The normalized Hadamard transform matrix
The normalized Hadamard transformation matrix $\mathbf{H}_{2^m}\in \mathbb{R}^{2^{m}\times 2^{m}}$ is defined recursively as follows: $\mathbf{H}_1 = \begin{pmatrix} +1 \end{pmatrix},\quad \mathbf{H}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix} +1 & +1\\ +1 & -1\end{pmatrix} ,\quad \mathbf{H}_{2^m}={\frac {1}{\sqrt {2}}}{\begin{pmatrix}\mathbf{H}_{2^{m-1}}&\mathbf{H}_{2^{m-1}}\\\mathbf{H}_{2^{m-1}}&-\mathbf{H}_{2^{m-1}}\end{pmatrix}}.$
$\mathbf{D}_{n\times n}$: A random diagonal matrix with $D_{ii}$ i.i.d. uniformly selected from $+1, -1$.

Complexity Analysis

Let us start with considering calculating the exact solution given by

\[\mathbf{b} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}.\]

The complexity of matrices multiplication $\mathbf{X}^\top \mathbf{X}$ is $\mathcal{O}(nd^2)$. The complexity of matrix-vector product $\mathbf{X}^\top \mathbf{y}$ is $\mathcal{O}(nd)$. The complexity for matrix inversion of $d\times d$ matrices usually is $\mathcal{O}(d^3)$ (though it can be lowered). To obtain $\mathbf{b}$, the matrix inversion is not required. One can also solve $\mathbf{b}$ from the following linear equations:

\[(\mathbf{X}^\top \mathbf{X}) \mathbf{b} = \mathbf{X}^\top \mathbf{y}\]

Solving a system of $d$ linear equations has a complexity of at most $\mathcal{O}(d^3)$. The best algorithm known to date has a complexity of $\mathcal{O}(d^{2.376})$. Therefore, for the worst case, the complexity of calculating the exact value of $\mathbf{b}$ is $\mathcal{O}(nd^2 + d^3)$.

OLS: $\mathcal{O}(nd^2 + d^3)$

One may easily see that Sketched-OLS requires $\mathcal{O}(rd^2 + d^3)$ computational complexity. However, for Sketched-OLS, we also need consider the complexity of getting $\widetilde{\mathbf{X}}=\boldsymbol{\Phi}\mathbf{X}$. If we create matrices $\mathbf{S}, \mathbf{H}, \mathbf{D}$ explicity and calculate $\boldsymbol{\Phi} = \mathbf{S}^\top \mathbf{H} \mathbf{D}$, the time and space complexity of obtaining $\widetilde{\mathbf{X}}$ will be large. For an efficient implementation, we can think of $\boldsymbol{\Phi}$ as the composition of three linear transformations.

Given a column vector $(\cdot)_{n\times 1}$, the linear transformations $\mathbf{D}_{n\times n}, \mathbf{H}_{n\times n}, \mathbf{S}_{n\times r}^\top$ conduct the following operations on $(\cdot)_{n\times 1}$:

$\mathbf{D}_{n\times n}(\cdot)_{n\times 1}$: It independently multiplies each element of $(\cdot)_{n\times 1}$ by a random integer unformly selected from $+1, -1$.
$\mathbf{H}_{n\times n}(\cdot)_{n\times 1}$: Walsh-Hadamard transform (WHT). A naive implementation of the WHT of order $n=2^{m}$ would have a computational complexity of $\mathcal{O}(n^{2})$. In computational mathematics, we have an efficient algorithm to compute the WHT, which is called as the Hadamard ordered fast Walsh–Hadamard transform. It is a divide-and-conquer algorithm that recursively breaks down a WHT of size $n$ into two smaller WHTs of size $n/2$, which requires only $\mathcal{O}(n\log n)$ additions or subtractions. $\mathbf{H}_n \begin{pmatrix} \mathbf{b}_1\\\mathbf{b}_2\end{pmatrix}= {\frac {1}{\sqrt {2}}}{\begin{pmatrix}\mathbf{H}_{n/2}\mathbf{b}_1 + \mathbf{H}_{n/2}\mathbf{b}_2\\ \mathbf{H}_{n/2}\mathbf{b}_1 - \mathbf{H}_{n/2}\mathbf{b}_2\end{pmatrix}}.$ Python example code for an iterative in-place FWHT can be found here. For HW, you can also use sympy.discrete.transforms.fwht(seq) for convenience. (sympy.fwht is NOT efficient at all. Please refer to the remark below.)
$(\mathbf{S}^\top)_{r\times n}(\cdot)_{n\times 1} = \sqrt{n/r} \begin{bmatrix} \mathbf{e}_{i_1}, \cdots, \mathbf{e}_{i_r} \end{bmatrix}^\top (\cdot)_{n\times 1} = \sqrt{n/r} \begin{bmatrix} \mathbf{e}_{i_1}^\top (\cdot), \cdots , \mathbf{e}_{i_r}^\top (\cdot)\end{bmatrix}^\top$: It sub-sample the $n$ elements of $(\cdot)$ to obtain a column vector of size $r$ with an extra scaling factor $\sqrt{n/r}$.

Overall, $\boldsymbol{\Phi}(\cdot)_{n\times 1}$ takes $\mathcal{O}(n\log n)$ time. (Note that $r«n$.) Therefore, $\boldsymbol{\Phi}\mathbf{X}_{n\times d}$ takes $\mathcal{O}(d n\log n)$.

Sketcehd-OLS: $\mathcal{O}(nd(d+\log n) + d^3)$

Thus, the sketeched-OLS works for the regime of $d$ being small but $d^2 \leq n \leq e^d$.

Remark

Even though the sketeched OLS has less time complexity in the order of $n$ and $d$ as shown above in the regime of $d^2\leq n \leq e^d$ compared to the vanilla OLS, different implementations (machine code) will have different runtime, which corresponds to different constants hidden in the $\mathcal{O}$ notation. Optimized machine code compiled by C/C++ is usually much more efficient than machine code compiled by Python. In HW1 Problem 6, to implement an efficient sketched-OLS algorithm in Python, the easiest way I found so far is to use Numba to optimize the Python code. Specifically, to accelerate the Hadamard transform function, all you need to do is to import numba and add @jit(nopython=True) to the function.

from numba import jit

@jit(nopython=True)
def fwht(a) -> None:
    """In-place Fast Walsh–Hadamard Transform of array a."""
    h = 1
    while h < len(a):
        for i in range(0, len(a), h * 2):
            for j in range(i, i + h):
                a[j], a[j+h] = a[j]+a[j+h], a[j]-a[j+h]
        h *= 2

The above code will be much less time-consuming, which only takes ~.15 seconds for transforming a vector of size $2^{20}$. (The function without numba takes ~4.5 seconds to do that). From my preliminary test, the whole runtime of the sketched-OLS in this implementation (including obtaining $\boldsymbol{\Phi}\mathbf{X}, \boldsymbol{\Phi}\mathbf{y}$) will still exceed the run time of the vannilla-OLS for Problem 6 in HW1. But it is already fast enough for simple Python implementation; it takes about 3 seconds to obtain the results.

Other C/C++ based implementations will speed up the transform even 10 times faster than the numba implementation; see fastwht. However, it require familiarity with compliers and command line tools. Motivated students can give it a try.

In addition, the transfom $\Phi$ over $\mathbf{X}$ and $\mathbf{y}$ can be implemented in a parallel way, which can further reduce the runtime.

For the runtime of sketched-OLS in Problem 6 HW1, you can split it into 2 part: the caculation time of $\mathbf{b}_s$ and the cacluation time of $\boldsymbol{\Phi}\mathbf{X}, \boldsymbol{\Phi}\mathbf{y}$.