STA 243 Computational Statistics
Discussion 3: Power Method and Sketching
TA: Tesi Xiao
Power Method
Given a diagonalizable matrix A∈Rd×d, the power method is an eigenvalue algorithm that will produce a number λ1, which is the greatest (in absolute value) eigenvalue of A, (i.e., |λ1|>|λ2|≥⋯≥|λd|), and a nonzero vector v, which is a corresponding eigenvector of λ1.
A=UΣU⊤=UΣU−1=[u1,⋯,ud][λ1⋱λd][u1,⋯,ud]⊤=λ1u1u⊤1+⋯+λdudu⊤d.Is the matrix A required to be symmetric or positive semi-definite?
No. Any diagonalizable matrix works. (Real symmetric matrices are diagonalizable). In fact, the power method also works for non-diagonalizable matrices with a dominant eigenvalue. The convergence analysis can be established via the Jordan canonical form. See Wikipage.
What if A has complex eigenvalues?
In general, the power method works for the complex matrix A∈Cd×d to find the eigenvector and the eigenvalue with the largest modulus. However, if A∈Rd×d has complex eigenvalues, its complex eigenvalues will always occur in complex conjugate pairs. That is to say, if A∈Rd×d has an eigenvalue a+bi, then a−bi is also an eigenvalue of A. Therefore, to satisfy the requirement of the power method |λ1|>|λ2|≥⋯≥|λd|, the dominant eigenvalue λ1 of A should be real. Otherwise, we have |λ1|=|λ2|=√a2+b2.
Convergence Speed
For a theoretical point of view, the convergence of the power method is goemetric, with ratio |λ2||λ1|. In particular, assuming the starting vector b0=Uc=c1u1+⋯+cdud, (c1≠0) , we have
bk=Akb0‖Akb0‖=UΣkU−1b0‖UΣkU−1b0‖=UΣkU−1Uc‖UΣkU−1Uc‖=UΣkc‖UΣkc‖=c1|λ1|ku1+c2|λ2|ku2+⋯+cd|λd|kud‖c1|λ1|ku1+c2|λ2|ku2+⋯+cd|λd|kud‖=u1+c2c1(|λ2||λ1|)ku2+⋯+cdc1(|λd||λ1|)kud‖u1+c2c1(|λ2||λ1|)ku2+⋯+cdc1(|λd||λ1|)kud‖=u1+rk‖rk‖=O((|λ2||λ1|)k)→0,as k→∞Thus, the method converges slowly if there is an eigenvalue close in magnitude to the dominant eigenvalue.
In pratice, the convergence speed highly relies on the implementation of matrix-vector product. Therefore, the power iteration method is especially suitable for sparse matrices, for which Ab can be calculated efficiently. For non-symmetric matrices that are well-conditioned, the power iteration method can outperform more complex Arnoldi iteration. For symmetric matrices, the power iteration method is rarely used, since other algorithms can be used to speed up the convergence.
Applications: PageRank. (The transition matrix is usually sparse); PCA.
Sketching
The idea of sketching in linear regression is to construct a matrix Φ∈Rr×n and solve the following sketched-OLS problem instead of the Ordinary Least Squares (OLS) problem:
bs=argminc∈Rd ‖Φ(Xc−y)‖22where Φ=S⊤HD is called as the sketching matrix (a.k.a. subsampled randomized Hadamard transform),
Φ=S⊤n×rHn×nDn×n,and X∈Rn×d,y∈Rn.
bs=((ΦX)⊤(ΦX))−1(ΦX)⊤Φy=(ΦX)+Φy(⋅)+: Moore-Penrose pseudo-inverse
Design of Matrices S,H,D
Sn×r: The function of a sub-sampling matrix S is to subsample the row vectors of a matrix in Rn×d and obtain a matrix in Rr×d with extra scaling factor √n/r.
Let S be an empty matrix. For t=1,…,r (i.i.d. trials with replacement) select uniformly at random an integer from 1,2…,n. If i is selected, then append the column vector (√n/r)ei to S, where ei∈Rn is the sparse one-hot vector with only the i-th entry equal to 1.
Hn×n: The normalized Hadamard transform matrix
The normalized Hadamard transformation matrix H2m∈R2m×2m is defined recursively as follows: H1=(+1),H2=1√2(+1+1+1−1),H2m=1√2(H2m−1H2m−1H2m−1−H2m−1).
Dn×n: A random diagonal matrix with Dii i.i.d. uniformly selected from +1,−1.
Complexity Analysis
Let us start with considering calculating the exact solution given by
b=(X⊤X)−1X⊤y.The complexity of matrices multiplication X⊤X is O(nd2). The complexity of matrix-vector product X⊤y is O(nd). The complexity for matrix inversion of d×d matrices usually is O(d3) (though it can be lowered). To obtain b, the matrix inversion is not required. One can also solve b from the following linear equations:
(X⊤X)b=X⊤ySolving a system of d linear equations has a complexity of at most O(d3). The best algorithm known to date has a complexity of O(d2.376). Therefore, for the worst case, the complexity of calculating the exact value of b is O(nd2+d3).
- OLS: O(nd2+d3)
One may easily see that Sketched-OLS requires O(rd2+d3) computational complexity. However, for Sketched-OLS, we also need consider the complexity of getting ˜X=ΦX. If we create matrices S,H,D explicity and calculate Φ=S⊤HD, the time and space complexity of obtaining ˜X will be large. For an efficient implementation, we can think of Φ as the composition of three linear transformations.
Given a column vector (⋅)n×1, the linear transformations Dn×n,Hn×n,S⊤n×r conduct the following operations on (⋅)n×1:
- Dn×n(⋅)n×1: It independently multiplies each element of (⋅)n×1 by a random integer unformly selected from +1,−1.
- Hn×n(⋅)n×1: Walsh-Hadamard transform (WHT). A naive implementation of the WHT of order n=2m would have a computational complexity of O(n2). In computational mathematics, we have an efficient algorithm to compute the WHT, which is called as the Hadamard ordered fast Walsh–Hadamard transform. It is a divide-and-conquer algorithm that recursively breaks down a WHT of size n into two smaller WHTs of size n/2, which requires only O(nlogn) additions or subtractions. Hn(b1b2)=1√2(Hn/2b1+Hn/2b2Hn/2b1−Hn/2b2). Python example code for an iterative in-place FWHT can be found here. For HW, you can also use
sympy.discrete.transforms.fwht(seq)
for convenience. (sympy.fwht is NOT efficient at all. Please refer to the remark below.) - (S⊤)r×n(⋅)n×1=√n/r[ei1,⋯,eir]⊤(⋅)n×1=√n/r[e⊤i1(⋅),⋯,e⊤ir(⋅)]⊤: It sub-sample the n elements of (⋅) to obtain a column vector of size r with an extra scaling factor √n/r.
Overall, Φ(⋅)n×1 takes O(nlogn) time. (Note that r«n.) Therefore, ΦXn×d takes O(dnlogn).
- Sketcehd-OLS: O(nd(d+logn)+d3)
Thus, the sketeched-OLS works for the regime of d being small but d2≤n≤ed.
Remark
Even though the sketeched OLS has less time complexity in the order of n and d as shown above in the regime of d2≤n≤ed compared to the vanilla OLS, different implementations (machine code) will have different runtime, which corresponds to different constants hidden in the O notation. Optimized machine code compiled by C/C++ is usually much more efficient than machine code compiled by Python. In HW1 Problem 6, to implement an efficient sketched-OLS algorithm in Python, the easiest way I found so far is to use Numba to optimize the Python code. Specifically, to accelerate the Hadamard transform function, all you need to do is to import numba
and add @jit(nopython=True)
to the function.
from numba import jit
@jit(nopython=True)
def fwht(a) -> None:
"""In-place Fast Walsh–Hadamard Transform of array a."""
h = 1
while h < len(a):
for i in range(0, len(a), h * 2):
for j in range(i, i + h):
a[j], a[j+h] = a[j]+a[j+h], a[j]-a[j+h]
h *= 2
The above code will be much less time-consuming, which only takes ~.15 seconds for transforming a vector of size 220. (The function without numba takes ~4.5 seconds to do that). From my preliminary test, the whole runtime of the sketched-OLS in this implementation (including obtaining ΦX,Φy) will still exceed the run time of the vannilla-OLS for Problem 6 in HW1. But it is already fast enough for simple Python implementation; it takes about 3 seconds to obtain the results.
Other C/C++ based implementations will speed up the transform even 10 times faster than the numba implementation; see fastwht. However, it require familiarity with compliers and command line tools. Motivated students can give it a try.
In addition, the transfom Φ over X and y can be implemented in a parallel way, which can further reduce the runtime.
For the runtime of sketched-OLS in Problem 6 HW1, you can split it into 2 part: the caculation time of bs and the cacluation time of ΦX,Φy.