CS224W Social and Information Network Analysis Jure Leskovec Stanford University httpcs224wstanfordedu Overlapping Communities Nonoverlapping vs overlapping communities 11102010 Jure Leskovec Stanford CS224W Social and Information Network Analysis httpcs224wstanfordedu ID: 725959
Download Presentation The PPT/PDF document "Community Detection: Overlapping Communi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Community Detection:Overlapping Communities
CS224W: Social and Information Network AnalysisJure Leskovec, Stanford Universityhttp://cs224w.stanford.eduSlide2
Overlapping Communities
Non-overlapping vs. overlapping communities11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
2Slide3
Overlaps of Social Circles
A node belongs to many social circles11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
3
[
Palla
et al., ‘05]Slide4
11/17/2011
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu4Slide5
Clique Percolation Method (CPM)
Two nodes belong to the same community if they can be connected through adjacent k-cliques:
k
-clique:
Fully connected
graph on
k
nodes
Adjacent k
-cliques:
overlap in
k-1
nodes
k
-clique community
Set of nodes that can
be reached through a
sequence of adjacent
k
-cliques
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
5
3-clique
adjacent
3-cliques
[
Palla
et al., ‘05]
Give an example of two non-
overallping
3-cliques!Slide6
Clique Percolation Method (CPM)
Two nodes belong to the same community if they can be connected through adjacent k-cliques:
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
6
4-clique
adjacent
4-cliques
Communities for k=4
[
Palla
et al., ‘05]
Give an example of two non-
overallping
4-cliques!Slide7
CPM: Steps
Clique Percolation Method:
Find maximal-cliques
(not
k
-cliques!)
Clique overlap graph:
Each clique is a node
Connect two cliques if they
overlap in at least
k-1
nodes
Communities:
Connected components of
the clique overlap matrix
How to set
k
?
Set
k
so that we get the “richest” (most widely distributed cluster sizes) community structure
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
7
A
C
D
B
A
C
D
B
Cliques
Communities
k=3
On clique overlap graph show the communities – circles of connected components.
Emphasize that this is for parameter k=3.
Define maximal clique!Slide8
CPM method: Example
Start with graphFind maximal cliques Create clique overlap matrixThreshold the matrix at value k-1If
a
ij
<k-1
set 0
Communities are the connected components of the
thresholded
matrix
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
8
(1) Graph
(2) Clique overlap
matrix
(3)
Thresholded
matrix at 3
(4) Communities
(connected components)Slide9
Example: Phone-Call Network11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
9
Communities in a “tiny” part of a phone call network of 4 million users
[
Palla
et al., ‘07]
[
Palla
et al., ‘07]Slide10
Example: Website
11/17/2011Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
10
[
Farkas
et. al
.
07]Slide11
How to Find Maximal Cliques?
No nice way, hard combinatorial problem
Maximal clique:
clique
that can’t be extended
{
a,b,c
} is a clique but not maximal clique{
a,b,c,d} is maximal cliqueAlgorithm: Sketch
Start with a seed node
Expand the clique around the seed
Once the clique cannot be further
expanded we found the maximal clique
Note:
This will generate the same clique multiple times
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
11
a
b
d
cSlide12
How to Find Maximal Cliques?
Start with a seed vertex “a”Goal: Find the maximal clique Q “a” belongs toObservation: If some “x” belongs to Q then it is a
neighbor of
“a”
Why?
If
a,x
Q but not a–x, then Q is not a clique!Recursive algorithm:
Q … current cliqueR … candidate vertices to expand the clique toExample: Start with “a” and expand around it
11/17/2011
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
12
Q= {a} {
a,b
} {
a,b,c
}
bktrack
{
a,b,d
}
R= {
b
,c,d
} {
b,c,d
}
{d}
(c)={}
{c}(d)={}
(b)={c
,d}
Steps of the recursive algorithm
(u)…neighbor set of u
d
a
b
cSlide13
How to Find Maximal Cliques?
Start with a seed vertex “a”Goal: Find the maximal clique Q “a” belongs toObservation: If some “x” belongs to Q then it is a member of “a”
Why?
If
a,x
Q but not a–x, then Q is not a clique!
Recursive algorithm:Q … current cliqueR … candidate vertices
to expand the clique toExample: Start with “a” and expand around it
11/17/2011
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
13
Q= {a} {
a,b
} {
a,b,c
}
bktrack
{
a,b,d
}
R= {
b
,c,d
} {
b,c,d
}
{d}
(c)={}
{c}
(d)={}
(b)={c,d
}
Steps of the recursive algorithm
(u)…neighbor set of u
d
a
b
cSlide14
How to Find Maximal Cliques?
Q … current cliqueR … candidate verticesExpand(R,Q)
while
R ≠ {}
p = vertex in R
Q
p
= Q
{p}
R
p
= R
(p)
if
R
p
≠ {}: Expand(
R
p,
Q
p
)
else:
output
Q
p
R = R – {p}
11/17/2011
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
14
a
e
b
c
f
d
Start: Expand(V, {})
R={a,…f}, Q={}
p = {a}
Q
p
= {a}
R
p
= {
b,d
}
Expand(
Rp
, Q):
R = {b,d}, Q={a}
p = {b}
Qp = {
a,b} R
p = {d}
Expand(Rp
, Q): R = {d}, Q={
a,b} p = {d}
Qp
= {a,b,d}
Rp = {} : output {a,b,d} p = {d}
Qp = {a,d} Rp = {b} Expand(
R
p
, Q):
R =
{b},
Q={
a,d
}
p =
{b}
Q
p
= {
a,d
}
R
p
= {} :
output {
a,d,b
}
Have an animation about R and Q on the example graph.Slide15
How to Find Maximal Cliques?
Q … current cliqueR … candidate verticesExpand(R,Q)
while
R ≠ {}
p = vertex in R
Q
p
= Q
{p}
R
p
= R
(p)
if
R
p
≠ {}: Expand(
R
p,
Q
p
)
else:
output
Q
p
R = R – {p}
11/17/2011
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
15
a
e
b
c
f
d
Start: Expand(V, {})
R={a,…f}, Q={}
p = {b}
Q
p
= {b}
R
p
= {
a,c,d
}
Expand(
Rp
, Q):
R = {a,c,d}, Q={b}
p = {a}
Qp = {
b,a} R
p = {d}
Expand(Rp
, Q): R = {d}, Q={
b,a} p = {d}
Qp
= {b,a,d}
Rp = {} : output {b,a,d} p = {c}
Qp = {b,c} Rp = {d}
Expand(
R
p
, Q):
R =
{d},
Q
={
b,c
}
p =
{d}
Q
p
=
{
b,c,d
}
R
p
= {} :
output
{
b,c,d
}Slide16
How to Find Maximal Cliques?
How to prevent maximal cliques to be generated multiple times?Only output cliques that are lexicographically minimum{a,b,c} < {b,a,c
}
Even better:
Only expand to
the nodes higher in the lexicographical order
11/17/2011
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
16
a
e
b
c
f
d
Start: Expand(V, {})
R={a,…f}, Q={}
p = {a}
Q
p
= {a}
R
p
= {
b,d
}
Expand(
R
p
, Q):
R = {
b,d
}, Q={a}
p = {b}
Q
p
= {
a,b
}
R
p
= {d}
Expand(
Rp, Q
): R = {d}, Q={a,b
} p = {d}
Q
p = {a,b,d
} R
p = {} : output {
a,b,d} p
= {d}
Qp = {
a,d}
Rp =
{b}
Don’t expand
d > bBetter explain the lexicographical ordering and why cascades are generated multiple times.Slide17
How to Model Networks with Communities?
11/17/2011Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
17Slide18
Reflections: Finding Communities
Let’s rethink what we
are doing…
Given a network
Want to find communities!
Need to:
Formalize the notion
of a community
Need
an algorithm that will find
sets of nodes
that are “good” communities
More generally:
How to think about clusters in large networks?
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
18
Better motivate what we want to do. And how we want to do that – why NCP and what will we get out of it?Slide19
Community Score
How community like is a set of nodes?A good cluster S hasMany edges internally
Few edges pointing outside
Simplest objective function:
Conductance
Small
conductance
corresponds to good clusters
19
S
S’Slide20
Network Community Profile Plot
Define: Network community profile (NCP) plot
Plot the score of
best
community of size
k
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
20
Community size, log k
log
Φ
(k)
k=5
k=7
[WWW ‘08]
k=10Slide21
How to (Really) Compute NCP?11/12/2009
Jure Leskovec, Stanford CS322: Network Analysis
21
Run the favorite clustering method
Each dot represents a cluster
For each size find “best” cluster
Cluster size, log k
Cluster score, log
Φ
(k)
Spectral
Graclus
MetisSlide22
NCP Plot: Meshes
Meshes, grids, dense random graphs:Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
22
d-dimensional meshes
California road network
11/10/2010
[WWW ‘08]Slide23
NCP plot: Network Science
Collaborations between scientists in networks
[Newman, 2005]
23
Community size, log k
Conductance, log
Φ
(k)
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
11/10/2010
[WWW ‘08]Slide24
Natural Hypothesis
Natural hypothesis about NCP:
NCP of real networks slopes
downward
Slope
of the NCP corresponds to the “
dimensionality
“ of the network
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
24
What about large networks?
[Internet Mathematics ‘09]Slide25
Large Networks: Very Different
Typical example:
General Relativity collaborations
(
n=4,158, m=13,422
)
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
25
[Internet Mathematics ‘09]Slide26
More NCP Plots of Networks11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
26
[Internet Mathematics ‘09]Slide27
Φ(k), (score)
k, (cluster size)
NCP:
LiveJournal
(
n=5m, m=42m
)
27
Better and better clusters
Clusters get worse and worse
Best cluster has ~100 nodesSlide28
Explanation: The Upward PartAs clusters grow the number of edges
inside grows slower that the number crossing
28
Φ
=2/10 = 0.2
Each node has twice as many children
Φ
=1/7=0.14
Φ
=8/20 = 0.4
Φ
=64/92 = 0.69Slide29
Explanation: Downward Part
Empirically we note that best clusters
are
barely connected
to the network
29
NCP plot
Core-periphery structure
Make this slide first, before the infinite tree.Slide30
What If We Remove Good Clusters?
30
Nothing happens!
Nestedness
of the core-periphery structureSlide31
Suggested Network Structure
Nested Core-Periphery (jellyfish, octopus)
Whiskers are responsible for good communities
Denser and denser core of the network
Core contains 60% node and 80% edges
31Slide32
******* END *********Good lecture
Overlapping part went wellPeople had questions – make it more interactiveMake more homeworks, quizes so that students do more work – now they just come to class and stare at the slides.
11/17/2011
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
32Slide33
Communities: Issues and Questions
11/17/2011
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
33Slide34
Communities: Issues and Questions
Some issues with community detection:Many different formalizations of clustering objective functions Objectives are NP-hard to optimize exactlyMethods can find clusters that are systematically “biased”
Questions:
How well do algorithms optimize objectives?
What clusters do different methods find?
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
34Slide35
Many Different Objective Functions
Single-criterion:Modularity: m-E(m)Edges cut:
c
Multi-criterion:
Conductance
:
c/(2m+c)
Expansion:
c/n
Density:
1-m/n
2
CutRatio
:
c/n
(N-n)
Normalized Cut:
c/(2m+c) + c/2(M-m)+c
Flake-ODF:
frac
. of nodes with more than ½ edges
pointing outside S
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
35
S
n
: nodes in S
m
: edges in S
c
: edges pointing
outside S
[WWW ‘09]Slide36
Many Classes of Algorithms
Many algorithms to that implicitly or explicitly optimize objectives and extract communities:Heuristics:Girvan-Newman,
Modularity optimization:
popular heuristics
Metis:
multi-resolution heuristic
[Karypis-Kumar ‘98]
Theoretical approximation algorithms:
Spectral partitioning
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
36
[WWW ‘09]Slide37
NCP: Live Journal
LiveJournal
Spectral
Metis
37
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WWW ‘09]Slide38
Properties of Clusters (1)
500
node communities from
Spectral
:
500
node communities from
Metis
:
38
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WWW ‘09]Slide39
Properties of Clusters (2)
Metis gives sets with better conductanceSpectral gives
tighter and more well-rounded sets
39
Conductance of bounding cut
Spectral
Disconnected
Metis
Connected
Metis
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WWW ‘09]
Diameter of the cluster
External / Internal
conductance
Lower is good
Expand this slide into 3 different slides to illustrate what each of
the figures plots
.Slide40
Multi-criterion Objectives
40
All qualitatively similar
Observations:
Conductance, Expansion, Norm-cut, Cut-ratio are similar
Flake-ODF
prefers larger clusters
Density
is bad
Cut-ratio
has high variance
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WWW ‘09]Slide41
Single-criterion Objectives
41Observations:All measures are monotonic
Modularity
prefers large clusters
Ignores small clusters
11/10/2010
Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
[WWW ‘09]