to Graph Cluster Analysis Outline Introduction to Cluster Analysis Types of Graph Cluster Analysis Algorithms for Graph Clustering kSpanning Tree Shared Nearest Neighbor Betweenness Centrality Based ID: 187170
Download Presentation The PPT/PDF document "Introduction" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Introduction
to
Graph
Cluster AnalysisSlide2
Outline
Introduction to Cluster Analysis
Types of Graph Cluster AnalysisAlgorithms for Graph Clusteringk-Spanning TreeShared Nearest NeighborBetweenness Centrality BasedHighly Connected ComponentsMaximal Clique EnumerationKernel k-meansApplication
2Slide3
Outline
Introduction to Clustering
Introduction to Graph ClusteringAlgorithms for Graph Clusteringk-Spanning TreeShared Nearest NeighborBetweenness Centrality BasedHighly Connected ComponentsMaximal Clique Enumeration
Kernel k-means
Application
3Slide4
What is Cluster Analysis?
The process of dividing a set of input data into possibly overlapping
, subsets, where elements in each subset are considered related by some similarity measure4
Clusters by Shape
Clusters by Color
2 Clusters
3
ClustersSlide5
Applications of Cluster Analysis
5
Summarization
Provides a macro-level view of the data-set
Clustering precipitation in Australia
From
Tan, Steinbach, Kumar
Introduction To
Data Mining,
Addison-Wesley, Edition 1Slide6
Outline
Introduction to Clustering
Introduction to Graph ClusteringAlgorithms for Graph Clusteringk-Spanning TreeShared Nearest NeighborBetweenness Centrality BasedHighly Connected ComponentsMaximal Clique Enumeration
Kernel
k-means
Application
6Slide7
What is Graph Clustering?
Types
Between-graphClustering a set of graphsWithin-graphClustering the nodes/edges of a single graph7Slide8
Between-graph Clustering
Between-graph clustering
methods divide a set of graphs into different clustersE.g., A set of graphs representing chemical compounds can be grouped into clusters based on their structural similarity8Slide9
Within-graph Clustering
Within-graph clustering methods
divides the nodes of a graph into clustersE.g., In a social networking graph, these clusters could represent people with same/similar hobbies9Note: In this chapter we will look at different algorithms to perform within-graph clusteringSlide10
Outline
Introduction to Clustering
Introduction to Graph ClusteringAlgorithms for Within Graph Clusteringk-Spanning TreeShared Nearest NeighborBetweenness Centrality BasedHighly Connected ComponentsMaximal Clique
Enumeration
Kernel k-means
Application
10Slide11
k
-Spanning Tree
11
1
2
3
4
5
2
3
2
k-Spanning Tree
k
k
groups
of
non-overlapping
vertices
4
Minimum Spanning Tree
STEPS:
Obtains the Minimum Spanning Tree (MST) of
i
nput
g
raph
G
Removes k-1 edges from the MST
Results in k clustersSlide12
What is
a Spanning
Tree?A connected subgraph with no cycles that includes all vertices in the graph12
1
2
3
4
5
2
3
2
4
6
5
7
4
1
2
3
4
5
2
6
7
Weight = 17
2
Note:
Weight can represent either distance
or similarity between
two vertices or similarity of the two vertices
GSlide13
What is a Minimum Spanning Tree (MST)?
13
1
2
3
4
5
2
3
2
4
6
5
7
4
G
1
2
3
4
5
2
3
2
4
Weight = 11
2
1
2
3
4
5
2
4
5
Weight = 13
1
2
3
4
5
2
6
7
Weight = 17
2
The spanning tree of a graph with the minimum possible sum of
edge weights,
if the edge weights represent
distance
Note:
maximum
possible sum of edge weights, if the edge weights represent
similaritySlide14
Algorithm to Obtain MST
Prim’s Algorithm
14
1
2
3
4
5
2
3
2
4
6
5
7
4
Given Input Graph
G
Select Vertex Randomly
e.g., Vertex 5
5
Initialize Empty Graph T with Vertex 5
5
T
Select a list of edges L from G such that at most ONE vertex of
each
edge
is in T
From L
select
the edge X with minimum weight
Add X to T
5
5
3
4
6
4
5
4
4
5
T
4
Repeat
until
a
ll
v
ertices
are added to TSlide15
k
-Spanning Tree
15
1
2
3
4
5
2
3
2
Remove k-1 edges with highest weight
4
Minimum Spanning Tree
Note:
k – is the number of clusters
E.g., k=3
1
2
3
4
5
2
3
2
4
E.g., k=3
1
2
3
4
5
3 ClustersSlide16
k-Spanning Tree R-code
library(
GraphClusterAnalysis)library(RBGL)library(igraph)library(graph)data(MST_Example)G = graph.data.frame(MST_Example,directed=FALSE)E(G)$weight=E(G)$V3MST_PRIM =
minimum.spanning.tree
(
G,weights=G$weight, algorithm = "prim
")
OutputList
=
k_clusterSpanningTree
(MST_PRIM,3)
Clusters =
OutputList[[1]]
outputGraph = OutputList[[2]]Clusters
16Slide17
Outline
Introduction to Clustering
Introduction to Graph ClusteringAlgorithms for Within Graph Clusteringk-Spanning TreeShared Nearest Neighbor ClusteringBetweenness Centrality BasedHighly Connected Components
Maximal Clique
Enumeration
Kernel k-means
Application
17Slide18
18
Shared Nearest Neighbor Clustering
0
1
2
3
4
Shared Nearest Neighbor Graph (SNN)
2
2
2
2
1
1
3
2
Shared Nearest Neighbor Clustering
Groups
of
non-overlapping
vertices
STEPS:
Obtains the Shared Nearest Neighbor Graph (SNN) of
i
nput
g
raph
G
Removes edges from the SNN with weight less than
τ
τSlide19
What is Shared Nearest
Neighbor?
(Refresher from Proximity Chapter)19
u
v
Shared Nearest Neighbor is a proximity measure and denotes the number of neighbor nodes common between any given pair of nodesSlide20
Shared Nearest Neighbor (SNN) Graph
20
0
1
2
3
4
G
0
1
2
3
4
SNN
2
2
2
2
1
1
3
Given input
graph
G, weight each edge (
u,v
) with the number of shared nearest neighbors between u and v
1
Node 0 and Node 1 have 2 neighbors in common: Node 2 and Node 3Slide21
Shared Nearest Neighbor
Clustering
Jarvis-Patrick Algorithm21
0
1
2
3
4
SNN
g
raph
of input
graph
G
2
2
2
2
1
1
3
2
If u and v share more
than
τ
neighbors
Place them in
the same
cluster
0
1
2
3
4
E.g.,
τ
=3Slide22
SNN-Clustering R code
library(
GraphClusterAnalysis)library(RBGL)library(igraph)library(graph)data(SNN_Example)G = graph.data.frame(SNN_Example,directed=FALSE)tkplot(G)Output = SNN_Clustering(G,3)Output
22Slide23
Outline
Introduction to Clustering
Introduction to Graph ClusteringAlgorithms for Within Graph Clusteringk-Spanning TreeShared Nearest Neighbor ClusteringBetweenness Centrality BasedHighly Connected Components
Maximal Clique
Enumeration
Kernel
k-means
Application
23Slide24
What is Betweenness Centrality?
(Refresher from Proximity Chapter)
Two types:Vertex BetweennessEdge Betweenness24Betweenness centrality quantifies the degree to which a vertex (or edge) occurs on the shortest path between all the other pairs of nodesSlide25
Vertex Betweenness
25
The number
of
shortest paths in the graph G
that pass through a given node S
G
E.g., Sharon is likely a liaison between NCSU and DUKE and hence many connections between DUKE and NCSU pass through SharonSlide26
Edge Betweenness
The number
of shortest paths in the graph G that pass through given edge (S, B)26
E.g., Sharon and Bob both study at NCSU and they are the only link between NY DANCE and CISCO groups
NCSU
Vertices and Edges with high Betweenness form good starting points to identify clustersSlide27
Vertex Betweenness Clustering
27
Repeat until highest
v
ertex
betweenness
≤
μ
Select
vertex v with the
h
ighest
b
etweenness
E.g., Vertex 3 with value 0.67
Given Input
graph
G
Betweenness for each vertex
1. Disconnect
graph
at
selected
vertex (e.g.,
vertex
3 )
2. Copy
vertex to both ComponentsSlide28
Vertex
Betweenness Clustering
R codelibrary(GraphClusterAnalysis)library(RBGL)library(igraph)library(graph)data(Betweenness_Vertex_Example)G = graph.data.frame(Betweenness_Vertex_Example,directed=FALSE)betweennessBasedClustering(G,mode="
vertex",threshold
=0.2)
28Slide29
29
Edge-Betweenness Clustering
Girvan and Newman Algorithm
29
Repeat
until
h
ighest
e
dge
b
etweenness
≤
μ
Select
edge
with Highest Betweenness
E.g.,
edge
(3,4) with value 0.571
Given Input Graph G
Betweenness for each edge
Disconnect
graph at selected edge
(E.g., (3,4 ))Slide30
Edge Betweenness Clustering
R code
library(GraphClusterAnalysis)library(RBGL)library(igraph)library(graph)data(Betweenness_Edge_Example)G = graph.data.frame(Betweenness_Edge_Example,directed=FALSE) betweennessBasedClustering(G,mode
="
edge",threshold
=0.2)
30Slide31
Outline
Introduction to Clustering
Introduction to Graph ClusteringAlgorithms for Within Graph Clusteringk-Spanning TreeShared Nearest Neighbor ClusteringBetweenness Centrality BasedHighly Connected Components
Maximal Clique
Enumeration
Kernel
k-means
Application
31Slide32
What is
a Highly
Connected Subgraph?Requires the following definitionsCut Minimum Edge Cut (MinCut)Edge Connectivity (EC)32Slide33
Cut
The set of edges whose removal disconnects a graph
336
5
4
7
3
2
1
0
8
6
5
4
7
3
2
1
0
8
6
5
4
7
3
2
1
0
8
Cut = {(0,1),(1,2),(1,3}
Cut = {(3,5),(4,2)}Slide34
Minimum Cut
34
6
5
4
7
3
2
1
0
8
6
5
4
7
3
2
1
0
8
MinCut
= {(3,5),(4,2)}
The
minimum set
of edges whose removal disconnects a graphSlide35
Edge
Connectivity (EC)
Minimum NUMBER of edges that will disconnect a graph356
5
4
7
3
2
1
0
8
MinCut
= {(3,5),(4,2)}
EC = |
MinCut
|
= |
{(3,5),(4,2
)}|
= 2
Edge
C
onnectivitySlide36
Highly Connected Subgraph (HCS)
A graph
G =(V,E) is highly connected if EC(G)>V/236
6
5
4
7
3
2
1
0
8
EC(G) > V/2
2 > 9/2
G
G is NOT a highly connected subgraphSlide37
HCS Clustering
37
6
5
4
7
3
2
1
0
8
Find the
Minimum Cut
MinCut
(G)
Given Input
graph
G
(3,5
),(4,2)}
YES
Return
G
NO
3
2
1
0
6
5
4
7
8
G1
G2
Divide G
using
MinCut
Is EC(G)> V/2
Process Graph G1
Process Graph G2Slide38
HCS Clustering R code
library(
GraphClusterAnalysis)library(RBGL)library(igraph)library(graph)data(HCS_Example)G = graph.data.frame(HCS_Example,directed=FALSE)HCSClustering(G,kappa=2)
38Slide39
Outline
Introduction to Clustering
Introduction to Graph ClusteringAlgorithms for Within Graph Clusteringk-Spanning TreeShared Nearest Neighbor ClusteringBetweenness Centrality Based
Highly Connected Components
Maximal Clique
Enumeration
Kernel k-means
Application
39Slide40
What is a Clique?
A subgraph C of graph
G with edges between all pairs of nodes406
5
4
7
8
Clique
6
5
7
G
CSlide41
What is a Maximal Clique?
41
65
4
7
8
Clique
Maximal Clique
6
5
7
6
5
7
8
A
maximal clique is a clique
that is not part of a larger clique.Slide42
42
BK(C,P,N)
C - vertices in current clique
P – vertices that can be added to C
N – vertices that cannot be added to C
Condition:
If both P and N are empty – output C as maximal clique
Maximal Clique Enumeration
Bron
and
Kerbosch
Algorithm
Input Graph GSlide43
Maximal Clique R code
library(
GraphClusterAnalysis)library(RBGL)library(igraph)library(graph)data(CliqueData)G = graph.data.frame(CliqueData,directed=FALSE)tkplot(G)maximalCliqueEnumerator (G)
43Slide44
Outline
Introduction to Clustering
Introduction to Graph ClusteringAlgorithms for Within Graph Clusteringk-Spanning TreeShared Nearest Neighbor ClusteringBetweenness Centrality BasedHighly Connected Components
Maximal Clique
Enumeration
Kernel k-means
Application
44Slide45
What is k-means?
k-means is a clustering algorithm applied to vector data points
k-means recap:Select k data points from input as centroidsAssign other data points to the nearest centroidRecompute centroid for each clusterRepeat Steps 1 and 2 until centroids don’t change
45Slide46
k-means on Graphs
Kernel K-means
Basic algorithm is the same as k-means on Vector dataWe utilize the “kernel trick” (recall Kernel Chapter)“kernel trick” recap We know that we can use within-graph kernel functions to calculate the inner product of a pair of vertices in a user-defined feature space.We replace the standard distance/proximity measures used in k-means with this within-graph kernel
function
46Slide47
Outline
Introduction to Clustering
Introduction to Graph ClusteringAlgorithms for Within Graph Clusteringk-Spanning TreeShared Nearest Neighbor ClusteringBetweenness Centrality BasedHighly Connected Components
Maximal Clique
Enumeration
Kernel k-means
Application
47Slide48
Application
Functional modules
in protein-protein interaction networksSubgraphs with pair-wise interacting nodes => Maximal cliques48
R-code
library(
GraphClusterAnalysis
)
library(RBGL)
library(
igraph
)
library(graph
)
data(
YeasPPI
)
G =
graph.data.frame
(
YeasPPI,directed
=FALSE
)Potential_Protein_Complexes = maximalCliqueEnumerator (G
)Potential_Protein_Complexes