/
What’s the Difference? What’s the Difference?

What’s the Difference? - PowerPoint Presentation

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
425 views
Uploaded On 2016-04-01

What’s the Difference? - PPT Presentation

Efficient Set Reconciliation without Prior Context Frank Uyeda University of California San Diego David Eppstein Michael T Goodrich amp George Varghese 1 Motivation Distributed applications often need to compare remote state ID: 272768

host ibf set difference ibf host difference set decode hash estimator cells size strata differences functions bloom data invertible

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "What’s the Difference?" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

What’s the Difference?

Efficient Set Reconciliation without Prior Context

Frank UyedaUniversity of California, San DiegoDavid Eppstein, Michael T. Goodrich & George Varghese

1Slide2

Motivation

Distributed applications often need to compare remote state.

2

R1

R2

M

ust

solve the

Set-Difference Problem

!

Partition

HealsSlide3

What is the Set-Difference problem?

What objects are unique to host 1?

What objects are unique to host 2?

A

Host 1

Host 2

C

A

F

E

B

D

F

3Slide4

Example 1: Data Synchronization

Identify missing data blocks

Transfer blocks to synchronize sets

A

Host 1

Host 2

C

A

F

E

B

D

F

D

C

B

E

4Slide5

Example 2: Data De-duplication

Identify all unique blocks.

Replace duplicate data with pointers

A

Host 1

Host 2

C

A

F

E

B

D

F

5Slide6

Set-Difference Solutions

Trade a sorted list of objects.O(n) communication, O(n log n) computation

Approximate Solutions:Approximate Reconciliation Tree (Byers)O(n) communication, O(n log n) computationPolynomial Encodings (Minsky & Trachtenberg)Let “

d

” be the size of the difference

O(d) communication,

O

(dn+d

3

) computationInvertible Bloom FilterO(d) communication, O(

n+d) computation

6Slide7

Difference Digests

Efficiently solves the set-difference problem.Consists of two data structures:Invertible Bloom Filter (IBF)Efficiently computes the set difference.

Needs the size of the differenceStrata EstimatorApproximates the size of the set difference.Uses IBF’s as a building block.

7Slide8

Invertible Bloom Filters (IBF)

Encode local object identifiers into an IBF

.

A

Host 1

Host 2

C

A

F

E

B

D

F

IBF 2

IBF 1

8Slide9

IBF Data Structure

Array of IBF cellsFor a set difference of size, d, require

αd cells (α > 1)Each ID is assigned to many IBF cellsEach IBF cell contains:

9

idSum

XOR of all

ID’s

in the cell

hashSum

XOR of

hash(ID) for all ID’s in the cellcountNumber of

ID’s assign to the cellSlide10

IBF Encode

A

idSum

A

hashSum

⊕ H(A)count++

idSum

AhashSum

⊕ H(A)count++

idSum

A

hashSum

H(A)

count++

Hash1

Hash2

Hash3

B

C

Assign ID to many cells

10

IBF:

αd

“Add” ID to cell

Not O(n), like Bloom Filters!

All hosts use the same hash functionsSlide11

Invertible Bloom Filters (IBF)

Trade

IBF’s

with remote host

A

Host 1

Host 2

C

A

F

E

B

D

F

IBF 2

IBF 1

11Slide12

Invertible Bloom Filters (IBF)

“Subtract” IBF structures

Produces a new IBF containing only unique objects

A

Host 1

Host 2

C

A

F

E

B

D

F

IBF 2

IBF 1

IBF (2 - 1)

12Slide13

IBF Subtract

13Slide14

Timeout for Intuition

After subtraction, all elements common to both sets have disappeared. Why?Any common element (

e.g W) is assigned to same cells on both hosts (assume same hash functions on both sides)On subtraction, W XOR W = 0. Thus, W vanishes.While elements in set difference remain, they may be randomly mixed  need a decode procedure.

14Slide15

Invertible Bloom Filters (IBF)

Decode resulting IBF

Recover object identifiers from IBF structure.

A

Host 1

Host 2

C

A

F

E

B

D

F

IBF (2 - 1)

B

E

C

D

Host 1

Host 2

15

IBF 2

IBF 1Slide16

IBF Decode

16

H(V ⊕ X ⊕ Z)≠H(V) ⊕ H(X) ⊕ H(Z)

Test for Purity:

H(

idSum

)

H

(

idSum

)

=

hashSum

H(V) = H(V

)Slide17

IBF Decode

17Slide18

IBF Decode

18Slide19

IBF Decode

19Slide20

20

Small

Diffs

:

1.4x – 2.3x

Large Differences:

1.25x - 1.4x

How many IBF cells?

Space Overhead

Set Difference

Hash

Cnt 3

Hash Cnt 4

Overhead to decode at >99%Slide21

How many hash functions?

1 hash function produces many pure cells initially but nothing to undo when an element is removed.

21

A

B

CSlide22

How many hash functions?

1 hash function produces many pure cells initially but nothing to undo when an element is removed.

Many (say 10) hash functions: too many collisions.22

A

A

B

C

B

C

A

A

B

B

C

CSlide23

How many hash functions?

1 hash function produces many pure cells initially but nothing to undo when an element is removed.

Many (say 10) hash functions: too many collisions.We find by experiment that 3 or 4 hash functions works well. Is there some theoretical reason?

23

A

A

B

C

C

A

B

B

CSlide24

Theory

Let

d = difference size, k = # hash functions.

Theorem 1:

With

(

k + 1)

d

cells, failure probability falls exponentially. For

k = 3, implies a 4x tax on storage, a bit weak.[Goodrich,Mitzenmacher]

: Failure is equivalent to finding a 2-core (loop) in a random hypergraph Theorem

2: With ck d, cells, failure probability falls exponentially

c4 = 1.3x

tax, agrees with experiments24Slide25

25

Large Differences:

1.25x - 1.4x

How many IBF cells?

Space Overhead

Set Difference

Hash

Cnt

3

Hash

Cnt 4

Overhead to decode at >99%Slide26

Connection to Coding

Mystery:

IBF decode similar to peeling procedure used to decode Tornado codes. Why?Explanation: Set Difference is equivalent to coding with insert-delete channels

Intuition:

Given a code for set A, send

codewords

only

to B. Think of B’s set as a corrupted form of A’s.

Reduction:

If code can correct D insertions/deletions, then B can recover A and the set difference.

26

Reed Solomon <---> Polynomial Methods LDPC (Tornado) <---> Difference Digest Slide27

Difference Digests

Consists of two data structures:Invertible Bloom Filter (IBF)Efficiently computes the set difference.Needs the size of the difference

Strata EstimatorApproximates the size of the set difference.Uses IBF’s as a building block.27Slide28

Strata Estimator

A

Consistent

Partitioning

B

C

28

~1/2

~1/4

~1/8

1/16

IBF 1

IBF 4

IBF 3

IBF 2

Estimator

Divide keys into partitions of containing ~1/2

k

Encode each partition into an IBF of

fixed size

l

og(n) IBF’s of ~80 cells eachSlide29

4x

Strata Estimator

29

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 1

Attempt to subtract & decode

IBF’s

at each level.

If level

k

decodes, then return:

2

k

x

(the number of

ID’s

recovered)

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 2

Decode

Host 1

Host 2Slide30

4x

Strata Estimator

30

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 1

Attempt to subtract & decode

IBF’s

at each level.

If level

k

decodes, then return:

2

k

x

(the number of

ID’s

recovered)

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 2

Decode

Host 1

Host 2

What about the other strata?Slide31

2x

Strata Estimator

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 1

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 2

Decode

Decode

Host 1

Host 2

Host 2

Host 1

31

Observation: Extra partitions hold useful data

Sum elements from

all

decoded strata & return:

2

(k-1)

x (the number of ID’s recovered)

Decode

Host 1

Host 2

…Slide32

Estimation Accuracy

32

Strata good for small differences.

Min-Wise good for large differences.

Average Estimation Error (15.3

KBytes

)

Set Difference

Relative Error in Estimation (%)Slide33

Hybrid Estimator

33

IBF 1

IBF 4

IBF 3

IBF 2

Strata

Combine Strata and Min-Wise Estimators.

Use IBF

Stratas

for small differences.

Use Min-Wise for large differences.

IBF 1

Min-Wise

IBF 2

Hybrid

IBF 3Slide34

Hybrid Estimator Accuracy

34

Hybrid matches Strata for small differences

.

Converges with Min-wise for large differences

Set Difference

Average Estimation Error (15.3

KBytes

)

Relative Error in Estimation (%)Slide35

Application:

KeyDiff

Service

Promising Applications:

File Synchronization

P2P file sharing

Failure Recovery

Key Service

Key Service

Key Service

Application

Application

Application

Add( key )

Remove( key )

Diff( host1, host2 )

35Slide36

Difference Digests Summary

Strata & Hybrid EstimatorsEstimate the size of the Set Difference.For 100K sets, 15KB

estimator has <15% errorO(log n) communication, O(log n) computation.Invertible Bloom FilterIdentifies all ID’s in the Set Difference.16 to 28 Bytes per ID in Set Difference.O(d) communication, O(

n+d

) computation.

Implemented in

KeyDiff

Service

36Slide37

Conclusions: Got Diffs?

New randomized algorithm (difference digests) for set difference or insertion/deletion codingCould it be useful for your system? Need:

Large but roughly equal size sets Small set differences (less than 10% of set size)

37Slide38

38Slide39

Extra Slides

39Slide40

Comparison to Logs

IBF work with no prior context.Logs work with prior context, BUTRedundant information when sync’ing with multiple parties.Logging must be built into system for each write.

Logging add overhead at runtime.Logging requires non-volatile storage.Often not present in network devices.40

IBF’s may out-perform logs when:

Synchronizing multiple parties

Synchronizations happen infrequently