/
Leveraging Big Data:  Lecture Leveraging Big Data:  Lecture

Leveraging Big Data: Lecture - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
386 views
Uploaded On 2015-11-08

Leveraging Big Data: Lecture - PPT Presentation

12 Instructors httpwwwcohenwangcomedithbigdataclass2013 Edith Cohen Amos Fiat Haim Kaplan Tova Milo Today AllDistances Sketches Applications of AllDistance sketches Back to linear sketches random linear transformations ID: 187336

hash distance nodes ads distance hash ads nodes closeness bottom sketches similarity min hip node computation set estimate probability minimum stretch adss

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Leveraging Big Data: Lecture" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Leveraging Big Data: Lecture 12

Instructors:

http://www.cohenwang.com/edith/bigdataclass2013

Edith Cohen

Amos Fiat

Haim

Kaplan

Tova

MiloSlide2

Today

All-Distances SketchesApplications of All-Distance sketches

Back to linear sketches (random linear transformations)Slide3

All-Distances Sketches (ADSs)

is in the Min-Hash sketch of

for

some

: the set of nodes that are within distance at most

from

 

Bottom-

is a list of pairs , where is a node ID. < kth smallest hash of nodes that are closer to than

 Slide4

ADS example

5

5

4

4

3

3

10

10

10

10

10

10

6

5

7

6

7

5

3

4

1

2

4

3

3

4

4

13

14

15

10

0

6

5

7

15

17

16

17

10

SP distances:

0.49

0.91

0.56

0.42

0.07

0.21

0.14

0.28

0.63

0.84

0.70

0.77

0.35

Random permutation of nodesSlide5

ADS example

 

Sorted SP distances from to all other nodes

0.63

0.42

0.56

0.84

0.07

0.35

0.49

0.910.210.280.140.700.77: 0.63

0.42

0.07Slide6

0.21

ADS example

 

:

 

0.63

0.42

0.56

0.07

0.35

0.140.630.42

0.56

0.84

0.07

0.35

0.49

0.91

0.21

0.28

0.14

0.70

0.77

Sorted SP distances from to all other nodesSlide7

Expected Size of Bottom-

ADS

 

Lemma

:

Proof

: The ith closest node to is included with probability

 * Same argument as in lecture2 to bound number of updates to a Min-Hash sketch of a stream. Distance instead of time.Slide8

Computing

bottom-

ADS for all nodes: pruned Dijkstra’s

 

Iterate over nodes

by increasing

: Run Dijkstra’s algorithm from on the reverse graph. When visiting :IF

ADS

Continue on ELSE, prune at  Slide9

ADS Computation: Dijkstra

Perform pruned Dijkstra

from nodes by increasing hash:

5

5

4

4

3

3

10

10

10

10

10

10

6

5

7

6

7

5

3

4

1

2

4

3

3

4

4

0.49

0.91

0.56

0.42

0.07

0.21

0.14

0.28

0.63

0.84

0.70

0.77

0.35

 Slide10

Computing

bottom-

ADSs (unweighted edges): Dynamic Programming

 

Initalize

:

send update

to Iterate on until no updates:For all nodes

If update

received and then

If new entry , send update to  Slide11

ADS Computation: Dynamic Programming

Iteration

computes entries with distance

We display lowest hash in current iteration.

 

0.49

0.91

0.56

0.42

0.07

0.21

0.14

0.28

0.63

0.84

0.70

0.77

0.35

 Slide12

ADS Computation: Dynamic Programming

Start

: Each places

in ADS

 

0.49

0.91

0.56

0.42

0.07

0.21

0.14

0.28

0.63

0.84

0.70

0.77

0.35

 

 

 

 

 

 

 

 

 

 

 

 

 

 Slide13

ADS Computation: Dynamic Programming

Start

: Each places

in ADS

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 Slide14

ADS Computation: Dynamic Programming

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Iteration

send

distance hash to all neighbors.

Create ADS entry with distance 1

if hash is lower

.

 Slide15

ADS Computation: Dynamic Programming

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Iteration

send

distance hash to all neighbors.

Create ADS entry with distance 1

if hash is lower

.

 

 Slide16

ADS Computation: Dynamic Programming

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Iteration

send

distance hash to all neighbors.

Create ADS entry with distance 1

if hash is lower

.

 

 

 

 

 

 

 

 

 

 

 Slide17

ADS Computation: Dynamic Programming

 

Iteration

send

distance entry to all neighbors.

Create ADS entry with distance 2

if hash is lower

.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 Slide18

ADS computation: Analysis

Pruned

Dijkstra’s Introduces ADS entries by increasing hash.DP introduces entries

by

increasing distance.

With either approach,

inNeighbors are used only after an update expected number of edge traversals is bounded by sum over nodes of ADS size times inDegree:  Slide19

ADS computation: Comments

Pruned

Dijkstra ADS computation can be parallelized, similarly to BFS reachability sketches, to reduce dependency.DP can be implemented via (diameter number of) sequential passes over edge set. W

e only need to keep in memory k entries for each node (k smallest hashes so far).

I

t is also possible to

perform

a distributed computation where nodes asynchronously communicate with neighbors. Entries can be modified or removed. This incurs overhead. Slide20

Next: Some ADS applications

Cardinality/similarity of

-neighborhoods

by extracting

Min-Hash sketches

Closeness Centralities: HIP estimators

Closeness similarities Distance oracles Slide21

Next: Some ADS applications

Cardinality/similarity of

-

neighborhoods

by extracting

Min-Hash sketches

Closeness Centralities: HIP estimatorsCloseness similarities Distance oracles Slide22

Using ADSs

Extract Min-Hash sketch of the

neighborhood of

,

, from ADS

:

bottom-Directly using Min-Hash sketches (lectures 2-3):Can estimate cardinality Can estimate Jaccard similarity of and  Slide23

Some ADS applications

Cardinality/similarity of

neighborhoods

by extracting

Min-Hash sketches

Closeness Centralities: HIP estimatorsCloseness similarities Distance oraclesSlide24

Closeness Centrality

Based on distances to all other nodes

Correction :

Harmonic

Mean of distances

 

Bavelas (1948) :

 

Issues

:Does not work for disconnected graphsEmphasis on contribution of “far” nodesSlide25

Closeness Centrality

More general definition :

non increasing;

some filter

 

Based on distances to all other nodes

Harmonic Mean:

,

Exponential decay with distance Degree centrality: ; Neighborhood size :  Slide26

Closeness Centrality

More general definition :

non increasing;

some filter

 

Based on distances to all other nodes

Centrality with respect to a filter

Education level, community (TAU graduates), geography, language, product typeApplications for filter: attribute completion, targeted ads Slide27

HIP estimators for ADSs

For each node

, we estimate the “presence” of

with respect to

(=1 if

, 0 otherwise)

Estimate is if . If , we compute the probability that it is included, conditioned on fixed hash values of all nodes that are closer to than We then use the inverse-probability estimate .For bottom- and

 Slide28

Example: HIP estimates

0.21

Bottom-

ADS of

 

0.63

0.42

0.56

0.07

0.350.14:

 :2nd smallest hash amongst closer nodes       

 

:

 

 

 

 

 

 

 

 Slide29

Example: HIP estimates

Bottom-

ADS of

 

:

 

 

 

 

 

   distance:    

 

 

 

We estimate:

 

 Slide30

Example: HIP estimates

Bottom-

ADS of

 

Only good guys

(

is good) :

 

 

:

 

 

 

 

 

 

 

 

distance:

 

 

 

 

 

 

 Slide31

Example: HIP estimates

Bottom-

ADS of

 

Only bad guys (

is bad) :

 

 

:

 

 

 

 

 

 

 

 

distance:

 

 

 

 

 

 

 Slide32

Estimating Closeness Centrality

non increasing;

some filter

 

has

CV

for uniform

or when ADSs are computed with respect to . 

 

Lemma:

The HIP estimator

We do not give the proof hereSlide33

Closeness Centrality Interpreted as the L1 norm of closeness vectors

view

nodes as features weighted by some

Relevance of node

to node

decreases with distance , according to  12345….2

…..2…..1

…..

1

…..

The

closeness vector

of node

:

 

 Slide34

Next: Some ADS applications

Cardinality/similarity of

neighborhoods

by extracting

Min-Hash sketches

Closeness Centralities: HIP estimatorsCloseness similarities Distance oraclesSlide35

Closeness Similarity

computed from the closeness vectors

,

Weighted

Jaccard

coefficient

:

Cosine similarity

:

 

 2

…..

2

…..

1

…..

1

…..Slide36

Closeness Similarity:

choices of ,

 

Similarity of

-Neighborhoods:

when

when

-Neighborhood Jaccard:

=1-Neighborhood Adamic-Adar:  Slide37

Estimating Closeness Similarity

*F

or uniform or when ADSs are

computed with respect to

.

 

Lemma:

We can estimate weighted Jaccard coefficient or the cosine similarity of the closeness vectors of two nodes from ADS with mean square error

 

We do not give the proof hereSlide38

Next: Some ADS applications

Cardinality/similarity of

neighborhoods

by extracting

Min-Hash sketches

Closeness Centralities: HIP estimatorsCloseness similarities Distance oraclesSlide39

Estimating SP distance

We can use

and

to obtain an upper bound on

:

 

Comment

: For

directed graphs we need a “forward”

and a “backward”  What can we say about the quality of this bound ?Slide40

Bottom-

ADSs of

 

0.21

0.63

0.42

0.560.07

0.35

0.14distance: 

      

distance:

 

 

 

 

 

 

 

0.14

0.77

0.91

0.35

0.49

0.21

0.07Slide41

Common nodes in b

ottom- ADSs

of

 

0.21

0.63

0.42

0.560.07

0.35

0.14distance: 

      

distance:

 

 

 

 

 

 

 

0.14

0.77

0.91

0.35

0.49

0.21

0.07

10+15=25

10+4=14

10+15=25Slide42

Query time improvement: Only test Min-Hash nodes membership in other ADS

0.21

0.63

0.42

0.56

0.07

0.35

0.14

distance:

  

     

distance:

 

 

 

 

 

 

 

0.14

0.77

0.91

0.35

0.49

0.21

0.07Slide43

Query time

Basic version: intersection

of ADSs:

Faster version tests presence of

“pivots” in the other ADS.

Query time (requires data structures) is

.

Can (will not show in class) further reduce query time by noting dependencies between pivots: no need to test all of them Comment: Theoretical worst-case upper bound on stretch is same, but in practice, estimate quality deteriorates with

query time “improvements”: Better to use the full “sketch”Slide44

Bounding the stretch

Theorem

: On undirected graphs, for any integer , if we use

, there is constant probability that the estimate is at most

times the actual distance.

 

: stretch is at most

but With fixed we get stretch.Stretch/representation-size tradeoff is worst-case tight (under some hardness assumptions)In practice, stretch is typically much better.  Stretch: ratio between approximate and true distanceSlide45

Bounding the stretch

Theorem

: On undirected graphs, for any integer , if we use

, there is constant probability that the estimate is at most

times the actual distance.

 

We prove a slightly weaker version in classSlide46

 

 

 

 

 

 

 Slide47

 

 

 

 

Proof outline

:

Part 1:

We show that if the ratio between the cardinalities of two consecutive sets is

, then the min-hash of the smaller set is likely to be a member of the bottom- of the larger set.

If the larger set is then the bound that we get is Part 2: If all pairs have ratio , we show can not be too big and the minimum hash node gives good stretch. Slide48

“distance” from

:

Distance

, set size is

. If growth ratio at least

, we must have

.

If then . This means all nodes are of distance from (one of) . In particular, the node with minimum hash must be of distance at most from one and

from the other.We obtain “stretch”

.  Part 2: If all pairs have ratio ,  Slide49

 

 

 

 

Proof outline

:

Part 1:

We show that if the ratio between the cardinalities of two consecutive sets is

, then the min-hash of the smaller set is likely to be a member of the bottom- of the larger set.

If the larger set is then the bound that we get is Part 2: If all pairs have ratio , we show can not be too big and the minimum hash node gives good stretch. Slide50

 

 

 

 

 

If

the minimum hash in

the bottom-k in

our estimate is at most

   More generally…. 

 Slide51

 

 

 

 

 

If

the minimum hash in

the bottom-k in

our estimate is at most     

 Slide52

 

 

 

 

Proof outline

:

Part 1:

We show that if the ratio between the cardinalities of two consecutive sets is

, then the min-hash of the smaller set is likely to be a member of the bottom- of the larger set.

If the larger set is then the bound that we get is Part 2: If all pairs have ratio , we show can not be too big and the minimum hash node gives good stretch. Slide53

 

 

 

 

 

Lemma:

The probability that the minimum hash in

minimum hash in

is

   Proof: look at the minimum in It is a uniform random node in  Slide54

 

 

Lemma:

The probability that the minimum hash in

the bottom-

in

is

, where

 

Proof: The probability that none in the bottom-k of is

is The probability that at least one is in is For  Slide55

 

 

Lemma:

The probability that the minimum hash in

the bottom-k in

is

,

where

 

Argument holds for any two neighborhoods in the sequence:

     If , the probability is , at least a constant.  Slide56

Bibliography

Closeness centrality on ADSs using HIP: E. Cohen “All-Distances

Sketches, Revisited: HIP Estimators for Massive Graphs Analysis” arXiv 2013

Closeness similarity and distance oracles using ADSs:

E. Cohen, D.

Delling

, F. Fuchs, A.

Goldberg,M. Goldszmidt, and R. Werneck. “Scalable similarity estimation in social networks: Closeness, node labels, and random edge lengths.” In COSN, 2013.Related:Distance distribution on social graphs: P. Boldi, M. Rosa, and S. Vigna. HyperANF: “approximating the neighbourhood function of very large graphs on a budget.” In WWW, 2011More on distance oracles: M. Thorup U. Zwick “Approximate Distance Oracles” JACM 2005Slide57

Next: (back to) Linear Sketches

Linear

Transformations of the input vector to a lower dimension.

 

 

 

 

When to use linear sketches?

Examples:

JL Lemma on Gaussian random projections, AMS sketchSlide58

Min-Hash sketches

Simple, little overhead.

Mergeable (in particular, can add elements)

One sketch with many uses: distinct count, similarity, sample

W

eighted vectors

(we did not see in class):

Weighted sampling norm/distance estimates. Sketch supports increase of entries values or replace with a larger value.But.. no support for negative updates Slide59

Linear Sketches

linear transformations (usually “random”)

Input vector of dimension

Matrix

whose entries are specified by (carefully chosen) random hash functions

 

   

 

 

   Slide60

Advantages of Linear Sketches

E

asy to update sketch under positive and

negative

updates to entry:

U

pdate

, where means .To update sketch:

 

Naturally mergeable (over signed entries)