STA 141C Big-data and Statistical Computing

Discussion 4: Spectral Clustering

TA: Tesi Xiao

Graph

A graph is made up of vertices (nodes/points) which are connected edges (links/lines). Mathematically speaking, a graph is an ordered pair G=(V,E)

  • G=(V,E)
  • V: the set of vertices
  • E{(x,y)|x,yV and xy} (for simplicity, we do not consider self-loops here)
    • Undirected graph: (x,y) is an unordered pair;
    • Directed graph: (x,y) is an ordered pair.
  • Weighted graph G=(V,E,W): A weighted graph (or a network) is a graph with weighted edges. That is to say, a number (the weight) wijW is assigned to the edge (i,j)E. wij=0 if (i,j)E.
    • |V|: the cardinality of set V / the number of vertices
    • |E|: the cardinality of set E / the number of edges
    • Weight Matrix W: a |V|×|V| matrix where Wi,j=wij if (i,j)E else Wi,j=0. For undirected graphs, W is symmetric, i.e., wij=wji.

Note that for unweigted graph, one can assign 1 to all the edges, then W would become the adjacency matrix A where Ai,j=1 if (i,j)E else Ai,j=0

  • Degree: For undirected graphs, the degree of a vertix vV denoted by deg(v) is defined as deg(v)=jVwvj. For directed graphs, we can similarly define the in-degree and the out-degree of a vertex. See Wikipage.

Commonly, we use D to represent the degree matrix, which is a diagonal matrix including the degrees of vertices.

  • Laplacian Matrix for weighted graphs
    • Laplacian Matrix L: L=DW
    • Symmetric Normalized Laplacian ¯L=D1/2LD1/2=ID1/2WD1/2

Netwokx

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. Check out Tutorials

# a simple example of using networkx

import networkx as nx
import matplotlib.pyplot as plt

G = nx.Graph()
G.add_nodes_from([1,2,3])
G.add_node(4)

G.add_edges_from([(1,2), (2,3), (1,3)])
G.add_edge(3,4)
nx.draw(G, with_labels=True)

Graph Cut

  • Partitioning into two clusters

    • Naive Balanced Cut: minV1,V2 cut(V1,V2):=viV1,vjV2Wij s.t. |V1|=|V2| and V1V2=V,V1V2=
    • Ratio-Cut: minV1,V2 RC(V1,V2):={cut(V1,V2)|V1|+cut(V1,V2)|V2|}
    • Normalized-Cut: minV1,V2 NC(V1,V2):={cut(V1,V2)deg(V1)+cut(V1,V2)deg(V2)}

    In the orginal paper about Normalized-Cut, they call deg(V1) as assoc(V1,V), which is the total connections from nodes in V1 to all nodes in the graph. See paper.

  • Generalization to k clusters

    • Ratio-Cut: minV1,...,Vkkc=1cut(Vc,VVc)|Vc|
    • Normalized-Cut: minV1,...,Vkkc=1cut(Vc,VVc)deg(Vc)

Spectral Clustering

The above graph-cut problems are hard to solve. However, fortunately, one can derive relaxed versions of those problems.

  • Relaxed to the real-valued Rayleigh quotient minimization problem
    • Ratio-Cut: miny1,...,ykkc=1ycLycycyc
    • Normalized-Cut: miny1,...,ykkc=1yc¯Lycycyc
    • Remark: yc is a vector indicates which vertices belong to the c-th cluster, thus satifying certain constraints. Putting these ycs together into a matrix, we will get a problem in the form of minYY=I=Trace(YLY)(Ratio-Cut) or Trace(Y¯LY)(Normalized-Cut) Then we are able to solve it by finding eigenvectors corresponding the smallest k eigenvalues of L. However, the obtained Y may not satisfy the contraints that Y should exactly indicate the partition. We run K-means algorithm on the rows of Y to obtain the final results.
  • Summary: We convert the original problem into a eigenvalue problem of the graph Laplacian.

Spectral Clustering by scikit learn