/
Analysis of Large Graphs: Analysis of Large Graphs:

Analysis of Large Graphs: - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
351 views
Uploaded On 2018-11-09

Analysis of Large Graphs: - PPT Presentation

Overlapping Communities Mining of Massive Datasets Jure Leskovec Anand Rajaraman Jeff Ullman Stanford University httpwwwmmdsorg Note to other teachers and users of these slides ID: 725310

leskovec mining www mmds mining leskovec mmds www ullman rajaraman http massive datasets org communities model community nodes network agm edges likelihood

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Analysis of Large Graphs:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Analysis of Large Graphs:Overlapping Communities

Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford Universityhttp://www.mmds.org

Note to other teachers and users of these

slides:

We

would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs

. If

you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site:

http://

www.mmds.org

Slide2

Identifying Communities

2

Nodes: Football Teams

Edges: Games played

Can we identify node groups?

(communities, modules, clusters)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide3

NCAA Football Network

3

NCAA conferences

Nodes: Football Teams

Edges: Games played

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide4

Protein-Protein

Interactions

4

Can we identify functional modules?

Nodes: Proteins

Edges: Physical interactions

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide5

Protein-Protein Interactions

5

Functional modules

Nodes: Proteins

Edges: Physical interactions

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide6

Facebook Network

6

Can we identify social communities?

Nodes: Facebook Users

Edges: Friendships

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide7

Facebook Network

7

High school

Summer

internship

Stanford (Squash)

Stanford (Basketball)

Social communities

Nodes: Facebook Users

Edges: Friendships

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide8

Overlapping Communities

Non-overlapping vs. overlapping communitiesJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org8Slide9

Non-overlapping Communities

9

Network

Adjacency matrix

Nodes

Nodes

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide10

Communities as Tiles!

What is the structure of community overlaps:Edge density in the overlaps is higher!10

Communities as

“tiles”

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide11

Recap so far…

11

This is what we want!

Communities

in a network

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide12

Plan of attack1)

Given a model, we generate the network:2) Given a network, find the “best” modelJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

12

C

A

B

D

E

H

F

G

C

A

B

D

E

H

F

G

Generative model for networks

Generative model for networksSlide13

Model of networks

Goal: Define a model that can generate networksThe model will have a set of “parameters” that we will later want to estimate (and detect communities)Q: Given a set of nodes, how do communities “generate” edges of the network?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

13

C

A

B

D

E

H

F

G

Generative model for networksSlide14

Community-Affiliation Graph

Generative

model B

(

V

, C, M

, {

p

c

}

)

for graphs:

Nodes

V

, Communities

C

, Memberships

M

Each community

c

has a single probability

p

c

Later we fit the model to networks to detect communities

14

Model

Network

Communities, C

Nodes, V

Model

p

A

p

B

Memberships, M

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide15

AGM: Generative Process

AGM generates the links: For each For each pair of nodes in community , we connect them with prob.

The overall edge probability is:

 

15

Model

Network

Communities, C

Nodes, V

Community Affiliations

p

A

p

B

Memberships, M

If

share no

communities:

 

Think of this as an “OR”

function: If at least 1 community says “YES” we create an edge

… set of communities

node

belongs to

 

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide16

Recap: AGM networks

16

Model

Network

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide17

AGM: Flexibility

AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested

17

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide18

How do we detect communities with AGM?Slide19

Detecting Communities

Detecting communities with AGM:19

C

A

B

D

E

H

F

G

Given a Graph

, find the Model

 

Affiliation

g

raph

M

Number of communities

C

Parameters

p

c

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide20

Maximum Likelihood Estimation

Maximum Likelihood Principle (MLE):Given: Data Assumption: Data is generated by some model

… model

… model parameters

Want to estimate

:

The probability that our model

(with parameters

) generated the data

Now let’s find the most likely model that could have generated the data:

 

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

20Slide21

Example: MLE

Imagine we are given a set of coin flipsTask: Figure out the bias of a coin!Data: Sequence of coin flips:

Model:

return 1 with prob.

else return 0

What is

?

Assuming coin flips are independent

So,

What is

?

Simple,

Then,

For example:

What did we learn?

Our data was most

likely generated by coin with bias

 

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

21

 

 

 Slide22

MLE for Graphs

How do we do MLE for graphs?Model generates a probabilistic adjacency matrixWe then flip all the entries of the probabilistic matrix to obtain the binary adjacency matrix

The likelihood of AGM generating graph G:

 

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

22

0

0.10

0.10

0.04

0.10

0

0.02

0.06

0.10

0.02

0

0.06

0.04

0.06

0.06

0

0

1

0

0

1

0

1

1

0

1

0

1

0

1

1

0

For every pair of nodes

AGM gives the prob.

of them being linked

 

Flip

biased coins

 Slide23

Graphs: Likelihood P(G|

Θ)Given graph

G(V,E) and

Θ

,

we calculate

likelihood

that Θ generated G:

P(G|

Θ

)

0

0.9

0.9

0

0.9

0

0.9

0

0.9

0.9

0

0.9

0

0

0.9

0

Θ

=

B

(

V

, C, M

, {

p

c

}

)

0

1

1

0

1

0

1

0

1

1

0

1

0

0

1

0

G

P(G|

Θ

)

G

23

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

A

BSlide24

MLE for Graphs

Our goal: Find

such that:

How do we find

that maximizes the likelihood?

 

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

24

P(

|

)

AGM

arg max

 Slide25

MLE for AGM

Our goal is to find

such that:

Problem:

Finding

B

means finding the bipartite affiliation network.

There is no nice way to do this.

Fitting

is too hard,

let’s change the model (so it is easier to fit)!

 

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

25Slide26

From AGM to BigCLAM

Relaxation: Memberships have strengths

The

membership strength of

node

to

community (

:

no membership)

Each community

links nodes independently:

 

26

 

u

v

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide27

Factor Matrix

 

Community membership strength matrix

 

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

27

 

j

Communities

Nodes

strength of

’s

membership to

 

… vector of

community

membership

strengths of

 

Probability of

connection is

proportional to the product of strengths

Notice:

If one node doesn’t belong to the community (

) then

Prob.

t

hat

at

least one

common community

links

the nodes:

 Slide28

From AGM to BigCLAM

Community links nodes

independently:

Then prob. at least one common

links them:

Example

matrix:

 

28

 

 

Then:

And:

But:

 

 

Node community

membership strengths

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

0

1.2

0

0.2

0.5

0

0

0.8

0

1.8

1

0Slide29

BigCLAM

: How to find F

Task: Given a network

, estimate

Find

that maximizes the likelihood:

where:

Many times we take the logarithm of the

likelihood,

and

call it log-likelihood:

Goal:

Find

that maximizes

:

 

29

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide30

BigCLAM: V1.0

Compute gradient of a single row

of

:

Coordinate gradient ascent:

Iterate over the rows of

:

Compute gradient

of row

(while keeping others fixed)

Update the row

:

Project

back to a non-negative vector: If

:

This is slow!

Computing

takes linear time!

 

30

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

.. Set out

outgoing neighbors

 Slide31

BigCLAM: V2.0

However, we notice:We cache

So, computing

now takes

linear time

in

the degree

of

In networks degree of a node is much smaller to the total number of nodes in the network, so this is a significant speedup!

 

31

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide32

BigClam: Scalability

BigCLAM takes 5 minutes for 300k node netsOther methods take 10 daysCan process networks with 100M edges!

32

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide33

Extension: Directed membershipsSlide34

Extension: Beyond Clusters

34

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide35

Extension: Directed AGM

Extension: Make community membership edges directed!Outgoing membership: Nodes “sends” edgesIncoming membership: Node “

receives” edges

35

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide36

Example: Model and Network

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org36Slide37

Directed AGM

Everything is almost the same except now we have 2 matrices: and …

out-going community memberships…

in-coming community memberships

Edge prob.:

Network log-likelihood:

which

we optimize the same

way as before

 

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

37

 

 Slide38

Predator-prey Communities

38

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide39

More details at…

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013.Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley

, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014.

Community Detection in Networks with Node Attributes

 by J. Yang, J.

McAuley

, J.

Leskovec. IEEE International Conference On Data Mining (ICDM), 2013.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org39