STA 243 Computational Statistics

Discussion 1: Basic Linear Algebra

TA: Tesi Xiao

Course Overview

The goal of this course is to introduce computational and statistical techniques for analysis of big-data. The main emphasize of this course is not on understanding the main principles behind the computational algorithms used in statistics. The class will be about thoeretical details and not just high-level ideas. The course will cover:
- Randomized Algorithms in Linear Algebra
  - Randomized Matrix Multiplication
  - Eigenvector Computation: Power Method
  - Randomized Algorithms for Least Square: Sketching
- Optimization Algorithms (Frequentist)
  - Convex Optimization
  - Non-convex Optimization
- Sampling Algorithms (Bayesian)
  - Rejection Sampling, Importance Sampling, etc.
  - Markov Chain Monte Carlo
Python is the only programming language used in this class. If you are unfamiliar with the basic syntax (like the control flow tools), please quickly go over Section 1-6 in Python tutorials in the first several weeks. Also, a sense of objected-oriented programming (OOP) is preferred. If you want to learn OOP in Python, take time to read Section 9: Classes.
Discussions will cover supplementary materials to the lectures and homework related questions.

Matrix = Linear Transformation

A matrix $\mathbf{A}\in \mathbb{R}^{m\times n}$ is a linear transformation mapping a vector in $\mathbb{R}^n$ to a vector in $\mathbb{R}^m$:

\[\mathbf{y}=\mathbf{A}\mathbf{x}\]

Square Matrix ($m=n$)
- $\text{Rank}(\mathbf{A})\leq n$
- Eigenvalues, Eigenvectors
- e.g. Rotation Matrix, Householder Matrix (reflection)
Rectangular matrix ($m\neq n$)
- $\text{Rank}(\mathbf{A})\leq \min(m, n)$
- Singular values, Singular vectors

Eigenvalues & Eigenvectors

The square matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$ is a linear transformation from $\mathbb{R}^n$ to $\mathbb{R}^n$ and $\mathbf{e}\in\mathbb{R}^n$ is a nonzero vector, then $\mathbb{e}$ is an eigenvector of $\mathbf{A}$ if $\mathbf{A}\mathbf{e}$ is a scalar multiple of $\mathbf{e}$. This can be written as

\[\mathbf{A}\mathbf{e} = \lambda \mathbf{e}\]

Eigenvector $\mathbf{e}$ of the linear transformation $\mathbf{A}$: a non-zero vector that changes at most by a scalar factor $\lambda$ when the linear transformation $\mathbf{A}$ is applied to it;
Eigenvalue $\lambda$ associated with $\mathbf{e}$: the scaling factor that can be negative, zero, or complex.

For example, let $\mathbf{A}\in \mathbb{R}^{n\times n}$ be a symmetric matrix. One can diagonalize it by

\[\mathbf{A} = [\mathbf{e}_1, \cdots, \mathbf{e}_n] \begin{bmatrix} \lambda_1 & &\\ & \ddots & \\ & & \lambda_n \end{bmatrix} [\mathbf{e}_1, \cdots, \mathbf{e}_n]^\top = \lambda_1 \mathbf{e}_1 \mathbf{e}_1^\top + \cdots + \lambda_n \mathbf{e}_n \mathbf{e}_n^\top\]

Singular values & Singular vectors

A non-negative real number $\sigma$ is a singular value for $\mathbf{M}$ if and only if there exist unit-length vectors $\mathbf{u}\in \mathbb{R}^{m}$ and $\mathbf{v}\in \mathbb{R}^{n}$ such that

\[\mathbf{M}\mathbf{v} = \sigma \mathbf{u}, \qquad \mathbf{M}^\top \mathbf{u} = \sigma \mathbf{v}\]

The vectors $\mathbf{u}$ and $\mathbf{v}$ are called left-singular and right-singular vectors for $\sigma$, respectively.

Singular value decomposition (SVD)

In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any $m\times n$ matrix. Let $\mathbf{M}\in \mathbb{R}^{m\times n}$ and $p=\min{m,n}$.

\[\mathbf{M} = \mathbf{U}\boldsymbol{\Sigma} \mathbf{V}^\top = \sigma_1 \mathbf{u}_1\mathbf{v}_1^\top + \cdots \sigma_p \mathbf{u}_p \mathbf{v}_p^\top\]

where $\mathbf{U} = [\mathbf{u}_1, \mathbf{u}_2, \cdots, \mathbf{u}_m] \in\mathbb{R}^{m\times m}, \mathbf{V} = [\mathbf{v}_1, \mathbf{v}_2, \cdots, \mathbf{v}_n] \in \mathbb{R}^{n\times n}$ such that $\mathbf{U}^\top \mathbf{U} = \mathbf{I}_{m\times m}, \mathbf{V}^\top \mathbf{V} = \mathbf{I}_{n\times n},$ and $\boldsymbol{\Sigma}\in \mathbb{R}^{m\times n}$ is a diagonal matrix in the form of

\[\boldsymbol{\Sigma} = \begin{bmatrix} \sigma_1 & & & & &\\ & \ddots & & & \\ & & \sigma_p & & \\ & & & & \end{bmatrix}_{m\times n}.\]

As a rule, we have $\sigma_1 \geq \sigma_2 \geq \cdots\geq \sigma_p $. We also have the following relations:

\[\sigma_i(\mathbf{M}) = \sqrt{\lambda_i (\mathbf{M}^\top \mathbf{M})} = \sqrt{\lambda_i (\mathbf{M} \mathbf{M}^\top)},\]

where $\lambda_i(\mathbf{A})$ is the $i$-th largest eigenvalue of the matrix $\mathbf{A}$.

Norm

Vector Norm: $\mathbf{x} = [x_1. x_2, \cdots, x_n ]^\top\in \mathbb{R}^n$

$\ell_p$-norm: $\|\mathbf{x}\|_p = (\sum_{i=1}^n \vert x_i\vert ^p)^{1/p}$
- $\ell_1$-norm: $\|\mathbf{x}\|_1 = \sum_{i=1}^n \vert x_i\vert$
- $\ell_2$-norm: $\|\mathbf{x}\|_2 = (\sum_{i=1}^{n} x_i^2)^{1/2}$
- $\ell_\infty$-norm: $\|\mathbf{x}\|_{\infty} = \max_{1\leq i\leq n} \vert x_i \vert$

Matrix Norm: $\mathbf{M} \in \mathbb{R}^{m\times n}$

Matrix norms induced by vector norms $\|\mathbf{M}\|_p = \sup_{x\neq 0} \frac{\|\mathbf{M}\mathbf{x}\|_p}{\|\mathbf{x}\|_p}$:
- $p=2$: Operator norm $\| \mathbf{M} \|_2 = \|\mathbf{M} \|_{\text{op}} = \sup_{x\neq 0} \frac{\|\mathbf{M}\mathbf{x}\|_2}{\|\mathbf{x}\|_2}$
“Entry-wise” matrix norms:
- Treat the matrix as a vector of size $m\cdot n$: $\|\mathbf{M}\|_{p,p} = \| \text{vec}(\mathbf{M}) \|_p$
- $L_{p,q}$ norm: $\|\mathbf{M}\|_{p,q} = \left(\sum_{j=1}^{n} \left(\sum_{i=1}^{m} | M_{i,j} |^p\right)^{\frac{q}{p}} \right)^{\frac{1}{q}}$
  - $p=q=2$: Frobenius norm $\|\mathbf{M}\|_F = (\sum_{i,j} M_{i,j}^2)^{1/2}$
  - $p=q=\infty$: Max norm $\|\mathbf{M}\|_{\max} = \max_{i,j} | M_{i,j} |$
Schatten $p$-norms: $\|\mathbf{M}\|_p = \left(\sum_{i=1}^{\min(m,n)} \sigma_i^p\right)^{1/p}$
- $p=1$: Nuclear norm $\|\mathbf{M}\|_{*} = \sum_{i=1}^{\min(m,n)} \sigma_i(\mathbf{M}) = \text{trace}(\sqrt{\mathbf{M}^\top \mathbf{M}})$
- $p=2$: Frobenius norm
- $p=\infty$: Operator norm