Overlapping Communities Mining of Massive Datasets Jure Leskovec Anand Rajaraman Jeff Ullman Stanford University httpwwwmmdsorg Note to other teachers and users of these slides ID: 725310
Download Presentation The PPT/PDF document "Analysis of Large Graphs:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Analysis of Large Graphs:Overlapping Communities
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford Universityhttp://www.mmds.org
Note to other teachers and users of these
slides:
We
would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs
. If
you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site:
http://
www.mmds.org
Slide2
Identifying Communities
2
Nodes: Football Teams
Edges: Games played
Can we identify node groups?
(communities, modules, clusters)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide3
NCAA Football Network
3
NCAA conferences
Nodes: Football Teams
Edges: Games played
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide4
Protein-Protein
Interactions
4
Can we identify functional modules?
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide5
Protein-Protein Interactions
5
Functional modules
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide6
Facebook Network
6
Can we identify social communities?
Nodes: Facebook Users
Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide7
Facebook Network
7
High school
Summer
internship
Stanford (Squash)
Stanford (Basketball)
Social communities
Nodes: Facebook Users
Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide8
Overlapping Communities
Non-overlapping vs. overlapping communitiesJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org8Slide9
Non-overlapping Communities
9
Network
Adjacency matrix
Nodes
Nodes
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide10
Communities as Tiles!
What is the structure of community overlaps:Edge density in the overlaps is higher!10
Communities as
“tiles”
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide11
Recap so far…
11
This is what we want!
Communities
in a network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide12
Plan of attack1)
Given a model, we generate the network:2) Given a network, find the “best” modelJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
12
C
A
B
D
E
H
F
G
C
A
B
D
E
H
F
G
Generative model for networks
Generative model for networksSlide13
Model of networks
Goal: Define a model that can generate networksThe model will have a set of “parameters” that we will later want to estimate (and detect communities)Q: Given a set of nodes, how do communities “generate” edges of the network?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
13
C
A
B
D
E
H
F
G
Generative model for networksSlide14
Community-Affiliation Graph
Generative
model B
(
V
, C, M
, {
p
c
}
)
for graphs:
Nodes
V
, Communities
C
, Memberships
M
Each community
c
has a single probability
p
c
Later we fit the model to networks to detect communities
14
Model
Network
Communities, C
Nodes, V
Model
p
A
p
B
Memberships, M
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide15
AGM: Generative Process
AGM generates the links: For each For each pair of nodes in community , we connect them with prob.
The overall edge probability is:
15
Model
Network
Communities, C
Nodes, V
Community Affiliations
p
A
p
B
Memberships, M
If
share no
communities:
Think of this as an “OR”
function: If at least 1 community says “YES” we create an edge
… set of communities
node
belongs to
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide16
Recap: AGM networks
16
Model
Network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide17
AGM: Flexibility
AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested
17
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide18
How do we detect communities with AGM?Slide19
Detecting Communities
Detecting communities with AGM:19
C
A
B
D
E
H
F
G
Given a Graph
, find the Model
Affiliation
g
raph
M
Number of communities
C
Parameters
p
c
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide20
Maximum Likelihood Estimation
Maximum Likelihood Principle (MLE):Given: Data Assumption: Data is generated by some model
… model
… model parameters
Want to estimate
:
The probability that our model
(with parameters
) generated the data
Now let’s find the most likely model that could have generated the data:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
20Slide21
Example: MLE
Imagine we are given a set of coin flipsTask: Figure out the bias of a coin!Data: Sequence of coin flips:
Model:
return 1 with prob.
else return 0
What is
?
Assuming coin flips are independent
So,
What is
?
Simple,
Then,
For example:
What did we learn?
Our data was most
likely generated by coin with bias
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
21
Slide22
MLE for Graphs
How do we do MLE for graphs?Model generates a probabilistic adjacency matrixWe then flip all the entries of the probabilistic matrix to obtain the binary adjacency matrix
The likelihood of AGM generating graph G:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
22
0
0.10
0.10
0.04
0.10
0
0.02
0.06
0.10
0.02
0
0.06
0.04
0.06
0.06
0
0
1
0
0
1
0
1
1
0
1
0
1
0
1
1
0
For every pair of nodes
AGM gives the prob.
of them being linked
Flip
biased coins
Slide23
Graphs: Likelihood P(G|
Θ)Given graph
G(V,E) and
Θ
,
we calculate
likelihood
that Θ generated G:
P(G|
Θ
)
0
0.9
0.9
0
0.9
0
0.9
0
0.9
0.9
0
0.9
0
0
0.9
0
Θ
=
B
(
V
, C, M
, {
p
c
}
)
0
1
1
0
1
0
1
0
1
1
0
1
0
0
1
0
G
P(G|
Θ
)
G
23
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
A
BSlide24
MLE for Graphs
Our goal: Find
such that:
How do we find
that maximizes the likelihood?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
24
P(
|
)
AGM
arg max
Slide25
MLE for AGM
Our goal is to find
such that:
Problem:
Finding
B
means finding the bipartite affiliation network.
There is no nice way to do this.
Fitting
is too hard,
let’s change the model (so it is easier to fit)!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
25Slide26
From AGM to BigCLAM
Relaxation: Memberships have strengths
The
membership strength of
node
to
community (
:
no membership)
Each community
links nodes independently:
26
u
v
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide27
Factor Matrix
Community membership strength matrix
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
27
j
Communities
Nodes
…
strength of
’s
membership to
… vector of
community
membership
strengths of
Probability of
connection is
proportional to the product of strengths
Notice:
If one node doesn’t belong to the community (
) then
Prob.
t
hat
at
least one
common community
links
the nodes:
Slide28
From AGM to BigCLAM
Community links nodes
independently:
Then prob. at least one common
links them:
Example
matrix:
28
Then:
And:
But:
Node community
membership strengths
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
0
1.2
0
0.2
0.5
0
0
0.8
0
1.8
1
0Slide29
BigCLAM
: How to find F
Task: Given a network
, estimate
Find
that maximizes the likelihood:
where:
Many times we take the logarithm of the
likelihood,
and
call it log-likelihood:
Goal:
Find
that maximizes
:
29
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide30
BigCLAM: V1.0
Compute gradient of a single row
of
:
Coordinate gradient ascent:
Iterate over the rows of
:
Compute gradient
of row
(while keeping others fixed)
Update the row
:
Project
back to a non-negative vector: If
:
This is slow!
Computing
takes linear time!
30
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
.. Set out
outgoing neighbors
Slide31
BigCLAM: V2.0
However, we notice:We cache
So, computing
now takes
linear time
in
the degree
of
In networks degree of a node is much smaller to the total number of nodes in the network, so this is a significant speedup!
31
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide32
BigClam: Scalability
BigCLAM takes 5 minutes for 300k node netsOther methods take 10 daysCan process networks with 100M edges!
32
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide33
Extension: Directed membershipsSlide34
Extension: Beyond Clusters
34
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide35
Extension: Directed AGM
Extension: Make community membership edges directed!Outgoing membership: Nodes “sends” edgesIncoming membership: Node “
receives” edges
35
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide36
Example: Model and Network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org36Slide37
Directed AGM
Everything is almost the same except now we have 2 matrices: and …
out-going community memberships…
in-coming community memberships
Edge prob.:
Network log-likelihood:
which
we optimize the same
way as before
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
37
Slide38
Predator-prey Communities
38
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.orgSlide39
More details at…
Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013.Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley
, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014.
Community Detection in Networks with Node Attributes
by J. Yang, J.
McAuley
, J.
Leskovec. IEEE International Conference On Data Mining (ICDM), 2013.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org39