12 Instructors httpwwwcohenwangcomedithbigdataclass2013 Edith Cohen Amos Fiat Haim Kaplan Tova Milo Today AllDistances Sketches Applications of AllDistance sketches Back to linear sketches random linear transformations ID: 187336
Download Presentation The PPT/PDF document "Leveraging Big Data: Lecture" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Leveraging Big Data: Lecture 12
Instructors:
http://www.cohenwang.com/edith/bigdataclass2013
Edith Cohen
Amos Fiat
Haim
Kaplan
Tova
MiloSlide2
Today
All-Distances SketchesApplications of All-Distance sketches
Back to linear sketches (random linear transformations)Slide3
All-Distances Sketches (ADSs)
is in the Min-Hash sketch of
for
some
: the set of nodes that are within distance at most
from
Bottom-
is a list of pairs , where is a node ID. < kth smallest hash of nodes that are closer to than
Slide4
ADS example
5
5
4
4
3
3
10
10
10
10
10
10
6
5
7
6
7
5
3
4
1
2
4
3
3
4
4
13
14
15
10
0
6
5
7
15
17
16
17
10
SP distances:
0.49
0.91
0.56
0.42
0.07
0.21
0.14
0.28
0.63
0.84
0.70
0.77
0.35
Random permutation of nodesSlide5
ADS example
Sorted SP distances from to all other nodes
0.63
0.42
0.56
0.84
0.07
0.35
0.49
0.910.210.280.140.700.77: 0.63
0.42
0.07Slide6
0.21
ADS example
:
0.63
0.42
0.56
0.07
0.35
0.140.630.42
0.56
0.84
0.07
0.35
0.49
0.91
0.21
0.28
0.14
0.70
0.77
Sorted SP distances from to all other nodesSlide7
Expected Size of Bottom-
ADS
Lemma
:
Proof
: The ith closest node to is included with probability
* Same argument as in lecture2 to bound number of updates to a Min-Hash sketch of a stream. Distance instead of time.Slide8
Computing
bottom-
ADS for all nodes: pruned Dijkstra’s
Iterate over nodes
by increasing
: Run Dijkstra’s algorithm from on the reverse graph. When visiting :IF
ADS
Continue on ELSE, prune at Slide9
ADS Computation: Dijkstra
Perform pruned Dijkstra
from nodes by increasing hash:
5
5
4
4
3
3
10
10
10
10
10
10
6
5
7
6
7
5
3
4
1
2
4
3
3
4
4
0.49
0.91
0.56
0.42
0.07
0.21
0.14
0.28
0.63
0.84
0.70
0.77
0.35
Slide10
Computing
bottom-
ADSs (unweighted edges): Dynamic Programming
Initalize
:
send update
to Iterate on until no updates:For all nodes
If update
received and then
If new entry , send update to Slide11
ADS Computation: Dynamic Programming
Iteration
computes entries with distance
We display lowest hash in current iteration.
0.49
0.91
0.56
0.42
0.07
0.21
0.14
0.28
0.63
0.84
0.70
0.77
0.35
Slide12
ADS Computation: Dynamic Programming
Start
: Each places
in ADS
0.49
0.91
0.56
0.42
0.07
0.21
0.14
0.28
0.63
0.84
0.70
0.77
0.35
Slide13
ADS Computation: Dynamic Programming
Start
: Each places
in ADS
Slide14
ADS Computation: Dynamic Programming
Iteration
send
distance hash to all neighbors.
Create ADS entry with distance 1
if hash is lower
.
Slide15
ADS Computation: Dynamic Programming
Iteration
send
distance hash to all neighbors.
Create ADS entry with distance 1
if hash is lower
.
Slide16
ADS Computation: Dynamic Programming
Iteration
send
distance hash to all neighbors.
Create ADS entry with distance 1
if hash is lower
.
Slide17
ADS Computation: Dynamic Programming
Iteration
send
distance entry to all neighbors.
Create ADS entry with distance 2
if hash is lower
.
Slide18
ADS computation: Analysis
Pruned
Dijkstra’s Introduces ADS entries by increasing hash.DP introduces entries
by
increasing distance.
With either approach,
inNeighbors are used only after an update expected number of edge traversals is bounded by sum over nodes of ADS size times inDegree: Slide19
ADS computation: Comments
Pruned
Dijkstra ADS computation can be parallelized, similarly to BFS reachability sketches, to reduce dependency.DP can be implemented via (diameter number of) sequential passes over edge set. W
e only need to keep in memory k entries for each node (k smallest hashes so far).
I
t is also possible to
perform
a distributed computation where nodes asynchronously communicate with neighbors. Entries can be modified or removed. This incurs overhead. Slide20
Next: Some ADS applications
Cardinality/similarity of
-neighborhoods
by extracting
Min-Hash sketches
Closeness Centralities: HIP estimators
Closeness similarities Distance oracles Slide21
Next: Some ADS applications
Cardinality/similarity of
-
neighborhoods
by extracting
Min-Hash sketches
Closeness Centralities: HIP estimatorsCloseness similarities Distance oracles Slide22
Using ADSs
Extract Min-Hash sketch of the
neighborhood of
,
, from ADS
:
bottom-Directly using Min-Hash sketches (lectures 2-3):Can estimate cardinality Can estimate Jaccard similarity of and Slide23
Some ADS applications
Cardinality/similarity of
neighborhoods
by extracting
Min-Hash sketches
Closeness Centralities: HIP estimatorsCloseness similarities Distance oraclesSlide24
Closeness Centrality
Based on distances to all other nodes
Correction :
Harmonic
Mean of distances
Bavelas (1948) :
Issues
:Does not work for disconnected graphsEmphasis on contribution of “far” nodesSlide25
Closeness Centrality
More general definition :
non increasing;
some filter
Based on distances to all other nodes
Harmonic Mean:
,
Exponential decay with distance Degree centrality: ; Neighborhood size : Slide26
Closeness Centrality
More general definition :
non increasing;
some filter
Based on distances to all other nodes
Centrality with respect to a filter
Education level, community (TAU graduates), geography, language, product typeApplications for filter: attribute completion, targeted ads Slide27
HIP estimators for ADSs
For each node
, we estimate the “presence” of
with respect to
(=1 if
, 0 otherwise)
Estimate is if . If , we compute the probability that it is included, conditioned on fixed hash values of all nodes that are closer to than We then use the inverse-probability estimate .For bottom- and
Slide28
Example: HIP estimates
0.21
Bottom-
ADS of
0.63
0.42
0.56
0.07
0.350.14:
:2nd smallest hash amongst closer nodes
:
Slide29
Example: HIP estimates
Bottom-
ADS of
:
distance:
We estimate:
Slide30
Example: HIP estimates
Bottom-
ADS of
Only good guys
(
is good) :
:
distance:
Slide31
Example: HIP estimates
Bottom-
ADS of
Only bad guys (
is bad) :
:
distance:
Slide32
Estimating Closeness Centrality
non increasing;
some filter
has
CV
for uniform
or when ADSs are computed with respect to .
Lemma:
The HIP estimator
We do not give the proof hereSlide33
Closeness Centrality Interpreted as the L1 norm of closeness vectors
view
nodes as features weighted by some
Relevance of node
to node
decreases with distance , according to 12345….2
…..2…..1
…..
1
…..
The
closeness vector
of node
:
Slide34
Next: Some ADS applications
Cardinality/similarity of
neighborhoods
by extracting
Min-Hash sketches
Closeness Centralities: HIP estimatorsCloseness similarities Distance oraclesSlide35
Closeness Similarity
computed from the closeness vectors
,
Weighted
Jaccard
coefficient
:
Cosine similarity
:
2
…..
2
…..
1
…..
1
…..Slide36
Closeness Similarity:
choices of ,
Similarity of
-Neighborhoods:
when
when
-Neighborhood Jaccard:
=1-Neighborhood Adamic-Adar: Slide37
Estimating Closeness Similarity
*F
or uniform or when ADSs are
computed with respect to
.
Lemma:
We can estimate weighted Jaccard coefficient or the cosine similarity of the closeness vectors of two nodes from ADS with mean square error
We do not give the proof hereSlide38
Next: Some ADS applications
Cardinality/similarity of
neighborhoods
by extracting
Min-Hash sketches
Closeness Centralities: HIP estimatorsCloseness similarities Distance oraclesSlide39
Estimating SP distance
We can use
and
to obtain an upper bound on
:
Comment
: For
directed graphs we need a “forward”
and a “backward” What can we say about the quality of this bound ?Slide40
Bottom-
ADSs of
0.21
0.63
0.42
0.560.07
0.35
0.14distance:
distance:
0.14
0.77
0.91
0.35
0.49
0.21
0.07Slide41
Common nodes in b
ottom- ADSs
of
0.21
0.63
0.42
0.560.07
0.35
0.14distance:
distance:
0.14
0.77
0.91
0.35
0.49
0.21
0.07
10+15=25
10+4=14
10+15=25Slide42
Query time improvement: Only test Min-Hash nodes membership in other ADS
0.21
0.63
0.42
0.56
0.07
0.35
0.14
distance:
distance:
0.14
0.77
0.91
0.35
0.49
0.21
0.07Slide43
Query time
Basic version: intersection
of ADSs:
Faster version tests presence of
“pivots” in the other ADS.
Query time (requires data structures) is
.
Can (will not show in class) further reduce query time by noting dependencies between pivots: no need to test all of them Comment: Theoretical worst-case upper bound on stretch is same, but in practice, estimate quality deteriorates with
query time “improvements”: Better to use the full “sketch”Slide44
Bounding the stretch
Theorem
: On undirected graphs, for any integer , if we use
, there is constant probability that the estimate is at most
times the actual distance.
: stretch is at most
but With fixed we get stretch.Stretch/representation-size tradeoff is worst-case tight (under some hardness assumptions)In practice, stretch is typically much better. Stretch: ratio between approximate and true distanceSlide45
Bounding the stretch
Theorem
: On undirected graphs, for any integer , if we use
, there is constant probability that the estimate is at most
times the actual distance.
We prove a slightly weaker version in classSlide46
Slide47
Proof outline
:
Part 1:
We show that if the ratio between the cardinalities of two consecutive sets is
, then the min-hash of the smaller set is likely to be a member of the bottom- of the larger set.
If the larger set is then the bound that we get is Part 2: If all pairs have ratio , we show can not be too big and the minimum hash node gives good stretch. Slide48
“distance” from
:
Distance
, set size is
. If growth ratio at least
, we must have
.
If then . This means all nodes are of distance from (one of) . In particular, the node with minimum hash must be of distance at most from one and
from the other.We obtain “stretch”
. Part 2: If all pairs have ratio , Slide49
Proof outline
:
Part 1:
We show that if the ratio between the cardinalities of two consecutive sets is
, then the min-hash of the smaller set is likely to be a member of the bottom- of the larger set.
If the larger set is then the bound that we get is Part 2: If all pairs have ratio , we show can not be too big and the minimum hash node gives good stretch. Slide50
If
the minimum hash in
the bottom-k in
our estimate is at most
More generally….
Slide51
If
the minimum hash in
the bottom-k in
our estimate is at most
Slide52
Proof outline
:
Part 1:
We show that if the ratio between the cardinalities of two consecutive sets is
, then the min-hash of the smaller set is likely to be a member of the bottom- of the larger set.
If the larger set is then the bound that we get is Part 2: If all pairs have ratio , we show can not be too big and the minimum hash node gives good stretch. Slide53
Lemma:
The probability that the minimum hash in
minimum hash in
is
Proof: look at the minimum in It is a uniform random node in Slide54
Lemma:
The probability that the minimum hash in
the bottom-
in
is
, where
Proof: The probability that none in the bottom-k of is
is The probability that at least one is in is For Slide55
Lemma:
The probability that the minimum hash in
the bottom-k in
is
,
where
Argument holds for any two neighborhoods in the sequence:
If , the probability is , at least a constant. Slide56
Bibliography
Closeness centrality on ADSs using HIP: E. Cohen “All-Distances
Sketches, Revisited: HIP Estimators for Massive Graphs Analysis” arXiv 2013
Closeness similarity and distance oracles using ADSs:
E. Cohen, D.
Delling
, F. Fuchs, A.
Goldberg,M. Goldszmidt, and R. Werneck. “Scalable similarity estimation in social networks: Closeness, node labels, and random edge lengths.” In COSN, 2013.Related:Distance distribution on social graphs: P. Boldi, M. Rosa, and S. Vigna. HyperANF: “approximating the neighbourhood function of very large graphs on a budget.” In WWW, 2011More on distance oracles: M. Thorup U. Zwick “Approximate Distance Oracles” JACM 2005Slide57
Next: (back to) Linear Sketches
Linear
Transformations of the input vector to a lower dimension.
When to use linear sketches?
Examples:
JL Lemma on Gaussian random projections, AMS sketchSlide58
Min-Hash sketches
Simple, little overhead.
Mergeable (in particular, can add elements)
One sketch with many uses: distinct count, similarity, sample
W
eighted vectors
(we did not see in class):
Weighted sampling norm/distance estimates. Sketch supports increase of entries values or replace with a larger value.But.. no support for negative updates Slide59
Linear Sketches
linear transformations (usually “random”)
Input vector of dimension
Matrix
whose entries are specified by (carefully chosen) random hash functions
Slide60
Advantages of Linear Sketches
E
asy to update sketch under positive and
negative
updates to entry:
U
pdate
, where means .To update sketch:
Naturally mergeable (over signed entries)