STA 141C Big-data and Statistical Computing
Discussion 4: Spectral Clustering
TA: Tesi Xiao
Graph
A graph is made up of vertices (nodes/points) which are connected edges (links/lines). Mathematically speaking, a graph is an ordered pair G=(V,E)
- G=(V,E)
- V: the set of vertices
- E⊆{(x,y)|x,y∈V and x≠y} (for simplicity, we do not consider self-loops here)
- Undirected graph: (x,y) is an unordered pair;
- Directed graph: (x,y) is an ordered pair.
- Weighted graph G=(V,E,W): A weighted graph (or a network) is a graph with weighted edges. That is to say, a number (the weight) wij∈W is assigned to the edge (i,j)∈E. wij=0 if (i,j)∉E.
- |V|: the cardinality of set V / the number of vertices
- |E|: the cardinality of set E / the number of edges
- Weight Matrix W: a |V|×|V| matrix where Wi,j=wij if (i,j)∈E else Wi,j=0. For undirected graphs, W is symmetric, i.e., wij=wji.
Note that for unweigted graph, one can assign 1 to all the edges, then W would become the adjacency matrix A where Ai,j=1 if (i,j)∈E else Ai,j=0
- Degree: For undirected graphs, the degree of a vertix v∈V denoted by deg(v) is defined as deg(v)=∑j∈Vwvj. For directed graphs, we can similarly define the in-degree and the out-degree of a vertex. See Wikipage.
Commonly, we use D to represent the degree matrix, which is a diagonal matrix including the degrees of vertices.
- Laplacian Matrix for weighted graphs
- Laplacian Matrix L: L=D−W
- Symmetric Normalized Laplacian ¯L=D−1/2LD−1/2=I−D−1/2WD−1/2
Netwokx
NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. Check out Tutorials
# a simple example of using networkx
import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
G.add_nodes_from([1,2,3])
G.add_node(4)
G.add_edges_from([(1,2), (2,3), (1,3)])
G.add_edge(3,4)
nx.draw(G, with_labels=True)
Graph Cut
Partitioning into two clusters
- Naive Balanced Cut: minV1,V2 cut(V1,V2):=∑vi∈V1,vj∈V2Wij s.t. |V1|=|V2| and V1∪V2=V,V1∩V2=∅
- Ratio-Cut: minV1,V2 RC(V1,V2):={cut(V1,V2)|V1|+cut(V1,V2)|V2|}
- Normalized-Cut: minV1,V2 NC(V1,V2):={cut(V1,V2)deg(V1)+cut(V1,V2)deg(V2)}
In the orginal paper about Normalized-Cut, they call deg(V1) as assoc(V1,V), which is the total connections from nodes in V1 to all nodes in the graph. See paper.
Generalization to k clusters
- Ratio-Cut: minV1,...,Vk∑kc=1cut(Vc,V−Vc)|Vc|
- Normalized-Cut: minV1,...,Vk∑kc=1cut(Vc,V−Vc)deg(Vc)
Spectral Clustering
The above graph-cut problems are hard to solve. However, fortunately, one can derive relaxed versions of those problems.
- Relaxed to the real-valued Rayleigh quotient minimization problem
- Ratio-Cut: miny1,...,yk∑kc=1y⊤cLycy⊤cyc
- Normalized-Cut: miny1,...,yk∑kc=1y⊤c¯Lycy⊤cyc
- Remark: yc is a vector indicates which vertices belong to the c-th cluster, thus satifying certain constraints. Putting these ycs together into a matrix, we will get a problem in the form of minY⊤Y=I=Trace(Y⊤LY)(Ratio-Cut) or Trace(Y⊤¯LY)(Normalized-Cut) Then we are able to solve it by finding eigenvectors corresponding the smallest k eigenvalues of L. However, the obtained Y may not satisfy the contraints that Y should exactly indicate the partition. We run K-means algorithm on the rows of Y to obtain the final results.
- Summary: We convert the original problem into a eigenvalue problem of the graph Laplacian.