STA 243 Computational Statistics

Discussion 3: Power Method and Sketching

TA: Tesi Xiao

Power Method

Given a diagonalizable matrix ARd×d, the power method is an eigenvalue algorithm that will produce a number λ1, which is the greatest (in absolute value) eigenvalue of A, (i.e., |λ1|>|λ2||λd|), and a nonzero vector v, which is a corresponding eigenvector of λ1.

A=UΣU=UΣU1=[u1,,ud][λ1λd][u1,,ud]=λ1u1u1++λdudud.
  • Is the matrix A required to be symmetric or positive semi-definite?

    No. Any diagonalizable matrix works. (Real symmetric matrices are diagonalizable). In fact, the power method also works for non-diagonalizable matrices with a dominant eigenvalue. The convergence analysis can be established via the Jordan canonical form. See Wikipage.

  • What if A has complex eigenvalues?

    In general, the power method works for the complex matrix ACd×d to find the eigenvector and the eigenvalue with the largest modulus. However, if ARd×d has complex eigenvalues, its complex eigenvalues will always occur in complex conjugate pairs. That is to say, if ARd×d has an eigenvalue a+bi, then abi is also an eigenvalue of A. Therefore, to satisfy the requirement of the power method |λ1|>|λ2||λd|, the dominant eigenvalue λ1 of A should be real. Otherwise, we have |λ1|=|λ2|=a2+b2.

  • Convergence Speed

    For a theoretical point of view, the convergence of the power method is goemetric, with ratio |λ2||λ1|. In particular, assuming the starting vector b0=Uc=c1u1++cdud, (c10) , we have

    bk=Akb0Akb0=UΣkU1b0UΣkU1b0=UΣkU1UcUΣkU1Uc=UΣkcUΣkc=c1|λ1|ku1+c2|λ2|ku2++cd|λd|kudc1|λ1|ku1+c2|λ2|ku2++cd|λd|kud=u1+c2c1(|λ2||λ1|)ku2++cdc1(|λd||λ1|)kudu1+c2c1(|λ2||λ1|)ku2++cdc1(|λd||λ1|)kud=u1+rkrk=O((|λ2||λ1|)k)0,as k

    Thus, the method converges slowly if there is an eigenvalue close in magnitude to the dominant eigenvalue.

    In pratice, the convergence speed highly relies on the implementation of matrix-vector product. Therefore, the power iteration method is especially suitable for sparse matrices, for which Ab can be calculated efficiently. For non-symmetric matrices that are well-conditioned, the power iteration method can outperform more complex Arnoldi iteration. For symmetric matrices, the power iteration method is rarely used, since other algorithms can be used to speed up the convergence.

  • Applications: PageRank. (The transition matrix is usually sparse); PCA.

Sketching

The idea of sketching in linear regression is to construct a matrix ΦRr×n and solve the following sketched-OLS problem instead of the Ordinary Least Squares (OLS) problem:

bs=argmincRd Φ(Xcy)22

where Φ=SHD is called as the sketching matrix (a.k.a. subsampled randomized Hadamard transform),

Φ=Sn×rHn×nDn×n,

and XRn×d,yRn.

bs=((ΦX)(ΦX))1(ΦX)Φy=(ΦX)+Φy

()+: Moore-Penrose pseudo-inverse

Design of Matrices S,H,D

  • Sn×r: The function of a sub-sampling matrix S is to subsample the row vectors of a matrix in Rn×d and obtain a matrix in Rr×d with extra scaling factor n/r.

    Let S be an empty matrix. For t=1,,r (i.i.d. trials with replacement) select uniformly at random an integer from 1,2,n. If i is selected, then append the column vector (n/r)ei to S, where eiRn is the sparse one-hot vector with only the i-th entry equal to 1.

  • Hn×n: The normalized Hadamard transform matrix

    The normalized Hadamard transformation matrix H2mR2m×2m is defined recursively as follows: H1=(+1),H2=12(+1+1+11),H2m=12(H2m1H2m1H2m1H2m1).

  • Dn×n: A random diagonal matrix with Dii i.i.d. uniformly selected from +1,1.

Complexity Analysis

Let us start with considering calculating the exact solution given by

b=(XX)1Xy.

The complexity of matrices multiplication XX is O(nd2). The complexity of matrix-vector product Xy is O(nd). The complexity for matrix inversion of d×d matrices usually is O(d3) (though it can be lowered). To obtain b, the matrix inversion is not required. One can also solve b from the following linear equations:

(XX)b=Xy

Solving a system of d linear equations has a complexity of at most O(d3). The best algorithm known to date has a complexity of O(d2.376). Therefore, for the worst case, the complexity of calculating the exact value of b is O(nd2+d3).

  • OLS: O(nd2+d3)

One may easily see that Sketched-OLS requires O(rd2+d3) computational complexity. However, for Sketched-OLS, we also need consider the complexity of getting ˜X=ΦX. If we create matrices S,H,D explicity and calculate Φ=SHD, the time and space complexity of obtaining ˜X will be large. For an efficient implementation, we can think of Φ as the composition of three linear transformations.

Given a column vector ()n×1, the linear transformations Dn×n,Hn×n,Sn×r conduct the following operations on ()n×1:

  • Dn×n()n×1: It independently multiplies each element of ()n×1 by a random integer unformly selected from +1,1.
  • Hn×n()n×1: Walsh-Hadamard transform (WHT). A naive implementation of the WHT of order n=2m would have a computational complexity of O(n2). In computational mathematics, we have an efficient algorithm to compute the WHT, which is called as the Hadamard ordered fast Walsh–Hadamard transform. It is a divide-and-conquer algorithm that recursively breaks down a WHT of size n into two smaller WHTs of size n/2, which requires only O(nlogn) additions or subtractions. Hn(b1b2)=12(Hn/2b1+Hn/2b2Hn/2b1Hn/2b2). Python example code for an iterative in-place FWHT can be found here. For HW, you can also use sympy.discrete.transforms.fwht(seq) for convenience. (sympy.fwht is NOT efficient at all. Please refer to the remark below.)
  • (S)r×n()n×1=n/r[ei1,,eir]()n×1=n/r[ei1(),,eir()]: It sub-sample the n elements of () to obtain a column vector of size r with an extra scaling factor n/r.

Overall, Φ()n×1 takes O(nlogn) time. (Note that r«n.) Therefore, ΦXn×d takes O(dnlogn).

  • Sketcehd-OLS: O(nd(d+logn)+d3)

Thus, the sketeched-OLS works for the regime of d being small but d2ned.

Remark

Even though the sketeched OLS has less time complexity in the order of n and d as shown above in the regime of d2ned compared to the vanilla OLS, different implementations (machine code) will have different runtime, which corresponds to different constants hidden in the O notation. Optimized machine code compiled by C/C++ is usually much more efficient than machine code compiled by Python. In HW1 Problem 6, to implement an efficient sketched-OLS algorithm in Python, the easiest way I found so far is to use Numba to optimize the Python code. Specifically, to accelerate the Hadamard transform function, all you need to do is to import numba and add @jit(nopython=True) to the function.

from numba import jit

@jit(nopython=True)
def fwht(a) -> None:
    """In-place Fast Walsh–Hadamard Transform of array a."""
    h = 1
    while h < len(a):
        for i in range(0, len(a), h * 2):
            for j in range(i, i + h):
                a[j], a[j+h] = a[j]+a[j+h], a[j]-a[j+h]
        h *= 2

The above code will be much less time-consuming, which only takes ~.15 seconds for transforming a vector of size 220. (The function without numba takes ~4.5 seconds to do that). From my preliminary test, the whole runtime of the sketched-OLS in this implementation (including obtaining ΦX,Φy) will still exceed the run time of the vannilla-OLS for Problem 6 in HW1. But it is already fast enough for simple Python implementation; it takes about 3 seconds to obtain the results.

Other C/C++ based implementations will speed up the transform even 10 times faster than the numba implementation; see fastwht. However, it require familiarity with compliers and command line tools. Motivated students can give it a try.

In addition, the transfom Φ over X and y can be implemented in a parallel way, which can further reduce the runtime.

For the runtime of sketched-OLS in Problem 6 HW1, you can split it into 2 part: the caculation time of bs and the cacluation time of ΦX,Φy.