Community Detection: Overlapping Communities
5K - views

Community Detection: Overlapping Communities

Similar presentations


Download Presentation

Community Detection: Overlapping Communities




Download Presentation - The PPT/PDF document "Community Detection: Overlapping Communi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Community Detection: Overlapping Communities"— Presentation transcript:

Slide1

Community Detection:Overlapping Communities

CS224W: Social and Information Network AnalysisJure Leskovec, Stanford Universityhttp://cs224w.stanford.edu

Slide2

Overlapping Communities

Non-overlapping vs. overlapping communities11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

2

Slide3

Overlaps of Social Circles

A node belongs to many social circles11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

3

[

Palla

et al., ‘05]

Slide4

11/17/2011

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu4

Slide5

Clique Percolation Method (CPM)

Two nodes belong to the same community if they can be connected through adjacent k-cliques:

k

-clique:

Fully connected

graph on

k

nodes

Adjacent k

-cliques:

overlap in

k-1

nodes

k

-clique community

Set of nodes that can

be reached through a

sequence of adjacent

k

-cliques

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

5

3-clique

adjacent

3-cliques

[

Palla

et al., ‘05]

Give an example of two non-

overallping

3-cliques!

Slide6

Clique Percolation Method (CPM)

Two nodes belong to the same community if they can be connected through adjacent k-cliques:

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

6

4-clique

adjacent

4-cliques

Communities for k=4

[

Palla

et al., ‘05]

Give an example of two non-

overallping

4-cliques!

Slide7

CPM: Steps

Clique Percolation Method:

Find maximal-cliques

(not

k

-cliques!)

Clique overlap graph:

Each clique is a node

Connect two cliques if they

overlap in at least

k-1

nodes

Communities:

Connected components of

the clique overlap matrix

How to set

k

?

Set

k

so that we get the “richest” (most widely distributed cluster sizes) community structure

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

7

A

C

D

B

A

C

D

B

Cliques

Communities

k=3

On clique overlap graph show the communities – circles of connected components.

Emphasize that this is for parameter k=3.

Define maximal clique!

Slide8

CPM method: Example

Start with graphFind maximal cliques Create clique overlap matrixThreshold the matrix at value k-1If

a

ij

<k-1

set 0

Communities are the connected components of the

thresholded

matrix

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

8

(1) Graph

(2) Clique overlap

matrix

(3)

Thresholded

matrix at 3

(4) Communities

(connected components)

Slide9

Example: Phone-Call Network11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

9

Communities in a “tiny” part of a phone call network of 4 million users

[

Palla

et al., ‘07]

[

Palla

et al., ‘07]

Slide10

Example: Website

11/17/2011Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

10

[

Farkas

et. al

.

 

07]

Slide11

How to Find Maximal Cliques?

No nice way, hard combinatorial problem

Maximal clique:

clique

that can’t be extended

{

a,b,c

} is a clique but not maximal clique{

a,b,c,d} is maximal cliqueAlgorithm: Sketch

Start with a seed node

Expand the clique around the seed

Once the clique cannot be further

expanded we found the maximal clique

Note:

This will generate the same clique multiple times

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

11

a

b

d

c

Slide12

How to Find Maximal Cliques?

Start with a seed vertex “a”Goal: Find the maximal clique Q “a” belongs toObservation: If some “x” belongs to Q then it is a

neighbor of

“a”

Why?

If

a,x

 Q but not a–x, then Q is not a clique!Recursive algorithm:

Q … current cliqueR … candidate vertices to expand the clique toExample: Start with “a” and expand around it

11/17/2011

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

12

Q= {a} {

a,b

} {

a,b,c

}

bktrack

{

a,b,d

}

R= {

b

,c,d

} {

b,c,d

}

{d}



(c)={}

{c}(d)={}

(b)={c

,d}

Steps of the recursive algorithm

(u)…neighbor set of u

d

a

b

c

Slide13

How to Find Maximal Cliques?

Start with a seed vertex “a”Goal: Find the maximal clique Q “a” belongs toObservation: If some “x” belongs to Q then it is a member of “a”

Why?

If

a,x

 Q but not a–x, then Q is not a clique!

Recursive algorithm:Q … current cliqueR … candidate vertices

to expand the clique toExample: Start with “a” and expand around it

11/17/2011

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

13

Q= {a} {

a,b

} {

a,b,c

}

bktrack

{

a,b,d

}

R= {

b

,c,d

} {

b,c,d

}

{d}



(c)={}

{c}

(d)={} 

(b)={c,d

}

Steps of the recursive algorithm

(u)…neighbor set of u

d

a

b

c

Slide14

How to Find Maximal Cliques?

Q … current cliqueR … candidate verticesExpand(R,Q)

while

R ≠ {}

p = vertex in R

Q

p

= Q

 {p}

R

p

= R

 (p)

if

R

p

≠ {}: Expand(

R

p,

Q

p

)

else:

output

Q

p

R = R – {p}

11/17/2011

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

14

a

e

b

c

f

d

Start: Expand(V, {})

R={a,…f}, Q={}

p = {a}

Q

p

= {a}

R

p

= {

b,d

}

Expand(

Rp

, Q):

R = {b,d}, Q={a}

p = {b}

Qp = {

a,b} R

p = {d}

Expand(Rp, Q

): R = {d}, Q={

a,b} p = {d}

Qp = {

a,b,d}

Rp = {} : output {a,b,d} p = {d} Q

p = {a,d} Rp = {b} Expand(R

p

, Q):

R =

{b},

Q={

a,d

}

p =

{b}

Q

p

= {

a,d

}

R

p

= {} :

output {

a,d,b

}

Have an animation about R and Q on the example graph.

Slide15

How to Find Maximal Cliques?

Q … current cliqueR … candidate verticesExpand(R,Q)

while

R ≠ {}

p = vertex in R

Q

p

= Q

 {p}

R

p

= R

 (p)

if

R

p

≠ {}: Expand(

R

p,

Q

p

)

else:

output

Q

p

R = R – {p}

11/17/2011

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

15

a

e

b

c

f

d

Start: Expand(V, {})

R={a,…f}, Q={}

p = {b}

Q

p

= {b}

R

p

= {

a,c,d

}

Expand(

Rp, Q):

R = {a,c,d}, Q={b}

p = {a}

Qp = {b,a

} R

p = {d} Expand(

Rp, Q

): R = {d}, Q={b,a

} p = {d}

Qp = {

b,a,d}

Rp = {} : output {b,a,d} p = {c} Q

p = {b,c} Rp = {d} Expand(

R

p

, Q):

R =

{d},

Q

={

b,c

}

p =

{d}

Q

p

=

{

b,c,d

}

R

p

= {} :

output

{

b,c,d

}

Slide16

How to Find Maximal Cliques?

How to prevent maximal cliques to be generated multiple times?Only output cliques that are lexicographically minimum{a,b,c} < {b,a,c

}

Even better:

Only expand to

the nodes higher in the lexicographical order

11/17/2011

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

16

a

e

b

c

f

d

Start: Expand(V, {})

R={a,…f}, Q={}

p = {a}

Q

p

= {a}

R

p

= {

b,d

}

Expand(

R

p

, Q):

R = {

b,d

}, Q={a}

p = {b}

Q

p

= {

a,b

}

R

p

= {d}

Expand(

Rp, Q):

R = {d}, Q={a,b

} p = {d}

Q

p = {a,b,d}

R

p = {} : output {a,b,d

} p =

{d}

Qp = {a,d

}

Rp =

{b}

Don’t expand d >

bBetter explain the lexicographical ordering and why cascades are generated multiple times.

Slide17

How to Model Networks with Communities?

11/17/2011Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

17

Slide18

Reflections: Finding Communities

Let’s rethink what we

are doing…

Given a network

Want to find communities!

Need to:

Formalize the notion

of a community

Need

an algorithm that will find

sets of nodes

that are “good” communities

More generally:

How to think about clusters in large networks?

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

18

Better motivate what we want to do. And how we want to do that – why NCP and what will we get out of it?

Slide19

Community Score

How community like is a set of nodes?A good cluster S hasMany edges internally

Few edges pointing outside

Simplest objective function:

Conductance

Small

conductance

corresponds to good clusters

19

S

S’

Slide20

Network Community Profile Plot

Define: Network community profile (NCP) plot

Plot the score of

best

community of size

k

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

20

Community size, log k

log

Φ

(k)

k=5

k=7

[WWW ‘08]

k=10

Slide21

How to (Really) Compute NCP?11/12/2009

Jure Leskovec, Stanford CS322: Network Analysis

21

Run the favorite clustering method

Each dot represents a cluster

For each size find “best” cluster

Cluster size, log k

Cluster score, log

Φ

(k)

Spectral

Graclus

Metis

Slide22

NCP Plot: Meshes

Meshes, grids, dense random graphs:Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

22

d-dimensional meshes

California road network

11/10/2010

[WWW ‘08]

Slide23

NCP plot: Network Science

Collaborations between scientists in networks

[Newman, 2005]

23

Community size, log k

Conductance, log

Φ

(k)

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

11/10/2010

[WWW ‘08]

Slide24

Natural Hypothesis

Natural hypothesis about NCP:

NCP of real networks slopes

downward

Slope

of the NCP corresponds to the “

dimensionality

“ of the network

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

24

What about large networks?

[Internet Mathematics ‘09]

Slide25

Large Networks: Very Different

Typical example:

General Relativity collaborations

(

n=4,158, m=13,422

)

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

25

[Internet Mathematics ‘09]

Slide26

More NCP Plots of Networks11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

26

[Internet Mathematics ‘09]

Slide27

Φ(k), (score)

k, (cluster size)

NCP:

LiveJournal

(

n=5m, m=42m

)

27

Better and better clusters

Clusters get worse and worse

Best cluster has ~100 nodes

Slide28

Explanation: The Upward PartAs clusters grow the number of edges

inside grows slower that the number crossing

28

Φ

=2/10 = 0.2

Each node has twice as many children

Φ

=1/7=0.14

Φ

=8/20 = 0.4

Φ

=64/92 = 0.69

Slide29

Explanation: Downward Part

Empirically we note that best clusters

are

barely connected

to the network

29

NCP plot

Core-periphery structure

Make this slide first, before the infinite tree.

Slide30

What If We Remove Good Clusters?

30

Nothing happens!



Nestedness

of the core-periphery structure

Slide31

Suggested Network Structure

Nested Core-Periphery (jellyfish, octopus)

Whiskers are responsible for good communities

Denser and denser core of the network

Core contains 60% node and 80% edges

31

Slide32

******* END *********Good lecture

Overlapping part went wellPeople had questions – make it more interactiveMake more homeworks, quizes so that students do more work – now they just come to class and stare at the slides.

11/17/2011

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

32

Slide33

Communities: Issues and Questions

11/17/2011

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

33

Slide34

Communities: Issues and Questions

Some issues with community detection:Many different formalizations of clustering objective functions Objectives are NP-hard to optimize exactlyMethods can find clusters that are systematically “biased”

Questions:

How well do algorithms optimize objectives?

What clusters do different methods find?

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

34

Slide35

Many Different Objective Functions

Single-criterion:Modularity: m-E(m)Edges cut:

c

Multi-criterion:

Conductance

:

c/(2m+c)

Expansion:

c/n

Density:

1-m/n

2

CutRatio

:

c/n

(N-n)

Normalized Cut:

c/(2m+c) + c/2(M-m)+c

Flake-ODF:

frac

. of nodes with more than ½ edges

pointing outside S

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

35

S

n

: nodes in S

m

: edges in S

c

: edges pointing

outside S

[WWW ‘09]

Slide36

Many Classes of Algorithms

Many algorithms to that implicitly or explicitly optimize objectives and extract communities:Heuristics:Girvan-Newman,

Modularity optimization:

popular heuristics

Metis:

multi-resolution heuristic

[Karypis-Kumar ‘98]

Theoretical approximation algorithms:

Spectral partitioning

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

36

[WWW ‘09]

Slide37

NCP: Live Journal

LiveJournal

Spectral

Metis

37

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

[WWW ‘09]

Slide38

Properties of Clusters (1)

500

node communities from

Spectral

:

500

node communities from

Metis

:

38

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

[WWW ‘09]

Slide39

Properties of Clusters (2)

Metis gives sets with better conductanceSpectral gives

tighter and more well-rounded sets

39

Conductance of bounding cut

Spectral

Disconnected

Metis

Connected

Metis

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

[WWW ‘09]

Diameter of the cluster

External / Internal

conductance

Lower is good

Expand this slide into 3 different slides to illustrate what each of

the figures plots

.

Slide40

Multi-criterion Objectives

40

All qualitatively similar

Observations:

Conductance, Expansion, Norm-cut, Cut-ratio are similar

Flake-ODF

prefers larger clusters

Density

is bad

Cut-ratio

has high variance

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

[WWW ‘09]

Slide41

Single-criterion Objectives

41Observations:All measures are monotonic

Modularity

prefers large clusters

Ignores small clusters

11/10/2010

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

[WWW ‘09]