NIbble 2 Why Im talking about graphs Lots of large data is graphs Facebook Twitter citation data and other social networks The web the blogosphere the semantic web Freebase W ID: 632284
Download Presentation The PPT/PDF document "Subsampling Graphs 1 RECAP of PageRank-" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Subsampling Graphs
1Slide2
RECAP of PageRank-NIbble
2Slide3
Why I’m talking about graphs
Lots of large data
is
graphs
Facebook, Twitter, citation data, and other
social
networks
The web, the blogosphere, the semantic web, Freebase, Wikipedia, Twitter, and other information networksText corpora (like RCV1), large datasets with discrete feature values, and other bipartite networksnodes = documents or wordslinks connect document word or word documentComputer networks, biological networks (proteins, ecosystems, brains, …), …Heterogeneous networks with multiple types of nodespeople, groups, documents
3Slide4
Our first operation on graphs
Local graph partitioning
Why?
it’s about the best we can do in terms of subsampling/exploratory analysis of a graph
4Slide5
What is Local Graph Partitioning?
Global
Local
5Slide6
What is Local Graph Partitioning?
6Slide7
Main results of the paper (Reid et al)
An
approximate
personalized
PageRank computation that only touches nodes “near” the seed
but has small error relative to the true PageRank vector
A proof that a sweep over the approximate personalized PageRank vector finds a cut with conductance sqrt(α ln m)unless no good cut existsno subset S contains significantly more mass in the approximate PageRank than in a uniform distribution7Slide8
Approximate PageRank: Key Idea
p
is current approximation (start at
0
)
r
is set of “recursive calls to make”
residual errorstart with all mass on su is the node picked for the next call
8Slide9
ANalYsis Of
PageRank-NIbble
9Slide10
Analysis
linearity
re-group & linearity
10Slide11
Approximate PageRank: Key Idea
By definition PageRank is fixed point of:
Claim:
Proof:
11Slide12
Approximate PageRank: Algorithm
12Slide13
Analysis
So, at every point in the
apr
algorithm:
Also, at each point, |
r|
1 decreases by α*
ε
*degree(
u
), so: after T push operations where degree(
i-th
u)=d
i
, we know
which bounds the size of
r
and
p
13Slide14
Analysis
With the invariant:
This bounds the error of
p
relative to the PageRank vector.
14Slide15
Comments – API
p,r
are hash tables – they are small (1/
ε
α)
push just needs
p, r,
and neighbors of
u
Could implement with API:
List<Node> neighbor(Node
u)
int
degree(Node
u
)
d(v)
=
api.degree
(
v
)
15Slide16
Comments - Ordering
might pick the
largest
r(u)/d(u) …
or…
16Slide17
Comments – Ordering for Scanning
Scan
repeatedly
through an adjacency-list encoding of
the graph
For every line you read
u, v1,…,v
d
(u)
such that
r(u)/d(u) >
ε
:
benefit: storage is O(
1
/
ε
α
) for the hash tables, avoids any
seeking
17Slide18
Possible optimizations?
Much faster than doing random access the first few scans,
but then
slower the last few
…there
will be only a few ‘pushes’ per
scan
Optimizations you might imagine:Parallelize?Hybrid seek/scan:Index the nodes in the graph on the first scanStart seeking when you expect too few pushes to justify a scanSay, less than one push/megabyte of scanningHotspots:Save adjacency-list representation for nodes with a large r(u)/d(u) in a separate file of “hot spots” as you scanThen rescan that smaller list of “hot spots” until their score drops below threshold.
18Slide19
After computing apr
…
Given a graph
that’s too big for memory, and/or
that’s only accessible via API
…we can extract a
sample
in an interesting areaRun the apr/rwr from a seed nodeSweep to find a low-conductance subsetThencompute statisticstest out some ideasvisualize it…19Slide20
Key idea: a “sweep”
Order vertices v
1
, v
2
, …. by highest
apr
(s)Pick a prefix S={ v1, v2, …. vk }:S={}; volS =0; B={
}
For k=1,…n:
S += {
v
k
}
volS
+= degree(
v
k
)
B += {u:
v
k
u
and u not in S}
Φ
[k] = |B|/
volS
Pick k:
Φ
[k]
is smallest
20Slide21
Scoring a graph prefix
the edges leaving S
vol
(S) is sum of
deg
(x) for x in S
for small S:
Prob(random
edge
leaves S)
21Slide22
Putting this together
Given a graph
that’s too big for memory, and/or
that’s only accessible via API
…we can extract a
sample
in an interesting area
Run the apr/rwr from a seed nodeSweep to find a low-conductance subsetThencompute statisticstest out some ideasvisualize it…
22Slide23
Visualizing a Graph with Gephi
23Slide24
Screen shots/demo
Gelphi
– java tool
Reads several inputs
.
csv
file
[demo]24Slide25
25Slide26
26Slide27
27Slide28
28Slide29
29Slide30
30Slide31
31Slide32
32Slide33
33Slide34
34