Distributions from Sampled Network Data Minas Gjoka Emily Smith Carter T Butts University of California Irvine Outline Problem statement Estimation methodology Results with reallife graphs ID: 591469
Download Presentation The PPT/PDF document "Estimating Clique Composition and Size" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Estimating Clique Composition and Size
Distributions from Sampled Network Data
Minas
Gjoka
, Emily Smith, Carter T. Butts
University of California, IrvineSlide2
Outline
Problem statementEstimation methodologyResults with real-life graphsSlide3
Cliques
A complete subgraph that contains
i vertices is an order-i
clique
…
order-
1
order-
2
order-
3
order-
4
order-
5
order-
i
A
maximal clique
is a clique that is
not included in a larger cliqueSlide4
Cliques
A complete subgraph that contains
i vertices is an order-i
clique
c
b
d
a
order-
3
order-
4
A
maximal clique
is a clique that is
not included in a larger clique
b
d
a
c
b
d
c
b
a
c
d
a
4
non-maximal
order-
3
cliquesSlide5
Counting of Cliques
graph G
3
2
4
5
1
8
7
6
C
i
is the count of order-
i
cliques (
maximal
or
non-maximal
)
C
1
C
2
C
3
C
4
order-
1
order-
2
order-
3
order-
4
Clique Distribution of G
C = (C
1
,
C
2
,
C
3
,
C
4
)
= ( 0, 1, 2, 1 )
Goal 1: Estimate
C
i
(for all
i
) in graph G
from sampled network dataSlide6
Counting of Cliques Vertex Attributes
graph G
3
2
4
5
1
8
7
6
p =3
Vertex Attribute vector
X
j
j=1..p, p<=N
u =[ 3 0 0 ]
u =[ 2 1 0 ]
u =[ 2 0 1 ]
Clique Composition
Distribution of G
C
u
is the count of order-
u
cliques
Goal 2: Estimate
C
u
(for all
u
) in graph G
from sampled network dataSlide7
What type of cliques can we count?
Maximal cliquesNon-maximal cliques Slide8
Motivation
Counting of Cliquescliques describe local structure (clustering, cohesive subgroups)
algorithmic implications of cliques in engineering contextcliques used as input in network modelsSampled network data
unknown graphs with access limitationsmassive known graphsSlide9
Related Work
Model-based methodsDo not scale
Do not help with countingDesign-based methods
Subgraph (or motif) counting tools that use sampling e.g. MFinder, FANMOD, MODA
No support for subgraphs of size larger than 10
No support for vertex attributesBiased EstimationSlide10
EstimationSlide11
Methodology
Collect an egocentric network sample
H1,..,H
nCollect a probability sample of “n” nodes from the graph:
Vj, X[V
j] j=1..n
uniform independence samplingweighted independence samplinglink-trace sampling
with replacementwithout replacementSlide12
7
4
Methodology
Collect an egocentric network sample
H
1,..,Hn
Collect a probability sample of “n” nodes from the graph:
graph G(V,E)
V
j, X[Vj] j=1..n
3
2
4
5
1
8
7
6
C
3
n
=2Slide13
Methodology
Collect an egocentric network sample
H1,..,H
nCollect a probability sample of “n” nodes from the graph:
Fetch the egonet of each sampled node:
Vj, X[Vj]
j=1..n
G[Vj]j=1..n
graph G(V,E)
3
2
4
5
1
8
7
6
C
3
n
=2
8
6
7
3
2
5
4Slide14
Methodology
Collect an egocentric network sample
H1,..,H
n Collect a probability sample of “n” nodes from the graph
Fetch the egonet of each sampled node
Calculate the clique count C
i (or C
u) in each egonet
Hj
Vj
, X[Vj]G[
Vj]
j=1..ngraph G(V,E)
3
2
4
5
1
8
7
6
C
3
n
=2
8
6
7
3
2
5
4Slide15
Methodology
Collect an egocentric network sample
H1,..,H
nCollect a probability sample of “n” nodes from the graph
Fetch the egonet of each sampled node
Calculate the clique count Ci
(or Cu
) in each egonet H
jcan use existing exact clique counting algorithmsclique type is determined by counting algorithm.
Vj, X[Vj]
G[Vj]
j=1..ngraph G(V,E)
3
2
4
5
1
8
7
6
C
3
1
0
n
=2
8
6
7
3
2
5
4Slide16
Methodology
Collect an egocentric network sample
H1,..,H
nCollect a probability sample of “n” nodes from the graph
Fetch the egonet of each sampled node
Calculate the clique count Ci
(or Cu
) in each egonet H
jApply estimation method that combines calculations
Clique Degree Sums (CDS)Distinct Clique Counting (CC)
Vj
, X[Vj]G[Vj] j=1..n
1
0
n=2
graph G(V,E)
3
2
4
5
1
8
7
6
8
6
7
3
2
5
4
C
3Slide17
Methodology
Collect an egocentric network sample
H1,..,H
nCollect a probability sample of “n” nodes from the graph
Fetch the egonet of each sampled node
Calculate the clique count Ci
(or Cu
) in each egonet H
jApply estimation method that combines calculations
Clique Degree Sums (CDS)labeling of neighbors not required, more space efficient
Distinct Clique Counting (CC)higher accuracy
Vj, X[Vj]G[Vj]
j=1..n
1
0
n=2
graph G(V,E)
3
2
4
5
1
8
7
6
8
6
7
3
2
5
4
C
3Slide18
Labeling of neighbors
g
raph G
8
7
9
6
5
3
4
1
C
3
2Slide19
9
9
6
5
Labeling of neighbors
g
raph G
8
7
9
6
5
3
4
1
n=2
C
3
2
V
j
, X[
V
j
], G[
V
j
]
8
7
3
4
1
2
6
5Slide20
9
Labeling of neighbors
Distinct Clique Counting (CC)
labeled neighbors
g
raph G
8
7
9
6
5
3
4
1
n=2
Labeled Neighbors
C
3
9
6
5
4
8
7
9
6
5
2
9
6
5
5
4
3
6
5
5
5
4
3
Calculate count C
3
Slide21
5
9
Labeling of neighbors
Distinct Clique Counting (CC)
labeled neighbors
Clique Degree Sums (CDS)
unlabeled neighbors
g
raph G
8
7
9
6
5
3
4
1
n=2
Unlabeled Neighbors
Calculate count C
3
C
3
6
5
4
8
7
9
6
5
2
9
6
5
5
3
4
9
5
4
3
5
5
Labeled Neighbors
Calculate count C
3
Slide22
Order-i
Clique Degree dij contains
the number of i-cliques that node j belongs
Clique Degree Sums unlabeled neighborsSlide23
Order-i
Clique Degree dij contains
the number of i-cliques that node j belongs
d38
Clique Degree Sums unlabeled neighbors
C
3
2
3
1
8
g
raph G (V,E)
8
6
7
5
4
= 2
H
8Slide24
Clique Degree Sums
unlabeled neighbors
All nodes
Number of
i
-cliques that node j belongs
D
i
is the
Order-
i
Clique Degree SumSlide25
d
38
Clique Degree Sums
unlabeled neighbors
C
3
2
3
1
8
g
raph G (V,E)
All nodes
Number of
i
-cliques that node j belongs
8
6
7
5
4
D
3
= d
31
+ d
32
+ d
33
+ d
34
+ d
35
+d
36
+ d
37
+ d
38
D
3
= 1 + 1 + 0 + 1 + 2 + 1 + 1 + 2
D
3
= 9
D
3
= 3
C
3
D
i
is the
Order-
i
Clique Degree SumSlide26
is a design-unbiased Horvitz-Thompson estimator (
)
Clique Degree Sums
unlabeled neighbors
All nodes
Number of
i
-cliques that node j belongs
Sampled nodes
Node j inclusion probabilitySlide27
Clique Degree Sums
unlabeled neighbors
All nodes
Sampled nodes
Node j inclusion probability
Number of
i
-cliques that node j belongs
Number of u-cliques that node j belongs
is a design-unbiased Horvitz-Thompson estimator
(
)Slide28
Clique Degree Sums
Estimator Variance
We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of
and
Node inclusion probability
Joint node
inclusion probabilitySlide29
Clique Degree Sums
Estimator Variance
We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of
and
Uniform Independence Sampling
Weighted Independence Sampling
Link-trace Sampling
Without replacement
With replacementSlide30
Clique Degree Sums
Estimator Variance
We can use Horvitz-Thompson theory to derive unbiased estimators of the variance of
and
Uniform Independence Sampling
Without replacement
Joint node
inclusion probability
Node inclusion probability
All nodes
Sampled nodesSlide31
Distinct Clique Counting
labeled neighbors
i
-clique inclusion probability
is a design-unbiased Horvitz-Thompson estimator
(
)
)
number of distinct
i
-cliquesin H1, ..,
Hn
Uniform Independence Sampling
Weighted Independence Sampling Link-trace Sampling
With replacement
Without replacementSlide32
Distinct Clique Counting
labeled neighbors
i
-clique inclusion probability
number of distinct
i
-cliques
in H
1
, .., Hn
Uniform Independence Sampling
With replacement
is a design-unbiased Horvitz-Thompson estimator
( )
)Slide33
Distinct Clique Counting
labeled neighbors
b
c
a
2
3
1
8
6
7
5
4
g
raph G
N
=8
n=4 UIS with replacement
C
3Slide34
Distinct Clique Counting
labeled neighbors
b
c
a
2
3
1
8
6
7
5
4
g
raph G
8
6
7
8
6
7
2
1
5
n=4 UIS with replacement
N
=8
8
6
7
2
1
5
2
1
5
C
3
Observed
order-3 cliques
Distinct
order-3 cliquesSlide35
Computational complexity
Space complexity to count Ci
or CuO(1) for Clique Degree Sums Method O(c
i) or O(cu) for Distinct Clique Counting MethodTime complexity
from O(3N/3) to O(
n*3D/3) where N
is the graph size, D is the maximum degree, and n
is the sample sizefrom O(n*3D/3
) to O(3D/3) via parallel computations per
egonet Slide36
Benefits of our methodology
Full knowledge of graph not requiredFast estimation for massive known graphs
Estimation or exact computation easily parallelizable for massive known graphsEstimation with or without neighbor labelsSupports vertex attributes Supports a variety of sampling designsSlide37
ResultsSlide38
Simulation ResultsSlide39
Simulation ResultsFacebook
New Orleans
Egonet
sample size n=1,000Uniform independence sampling, without replacement1000 simulations
Clique Degree Sums
Distinct Clique CountingSlide40
Simulation Results
Error metric Normalized Mean Absolute Error :
1000 simulations
Distinct Clique Counting
Clique Degree SumsSlide41
Simulation Results
Distinct Clique Counting
Clique Degree SumsSlide42
Which estimation method to use?
Heuristic
Average Edge Count =
All edges between egos and neighbors
Unique edges between egos and neighbors
2
3
1
8
6
7
5
4
g
raph G
8
6
7
8
6
7
2
1
5
n=3
N
=8
8
6
7
2
1
5
Average Edge Count =
9
6
b
c
a
= 1.5Slide43
Which estimation method to use?Heuristic
Average Edge Count
Clique Degree Sums Error
Distinct Clique Counting ErrorSlide44
Estimation ResultsFacebook
‘09
Facebook ‘09 crawled dataset[1]36,628 unique egonets
[1] M.
Gjoka
, M. Kurant, C. T. Butts and A. Markopoulou
, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, IEEE INFOCOM 2010.Slide45
Estimation Resultsvertex attributes,
Facebook ‘09
Complemented dataset with gender attributesabout 6 million usersSlide46
References
[1] M. Gjoka, E.
Smith
, C. T. Butts, “
Estimating Clique Composition and Size Distributions from Sampled Network Data”, IEEE
NetSciCom '14
.[2
] Facebook datasets:
http://odysseas.calit2.uci.edu/research/osn.html
[3] Python code for Clique Estimators: http://tinyurl.com/clique-estimators
Thank you!
Unbiased estimation methods of clique distributionsClique Degree Sums
Distinct Clique CountingFacebook cliquesFuture worksupport estimation of any subgraphs (beyond cliques)