Efficient Set Reconciliation without Prior Context Frank Uyeda University of California San Diego David Eppstein Michael T Goodrich amp George Varghese 1 Motivation Distributed applications often need to compare remote state ID: 272768
Download Presentation The PPT/PDF document "What’s the Difference?" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
What’s the Difference?
Efficient Set Reconciliation without Prior Context
Frank UyedaUniversity of California, San DiegoDavid Eppstein, Michael T. Goodrich & George Varghese
1Slide2
Motivation
Distributed applications often need to compare remote state.
2
R1
R2
M
ust
solve the
Set-Difference Problem
!
Partition
HealsSlide3
What is the Set-Difference problem?
What objects are unique to host 1?
What objects are unique to host 2?
A
Host 1
Host 2
C
A
F
E
B
D
F
3Slide4
Example 1: Data Synchronization
Identify missing data blocks
Transfer blocks to synchronize sets
A
Host 1
Host 2
C
A
F
E
B
D
F
D
C
B
E
4Slide5
Example 2: Data De-duplication
Identify all unique blocks.
Replace duplicate data with pointers
A
Host 1
Host 2
C
A
F
E
B
D
F
5Slide6
Set-Difference Solutions
Trade a sorted list of objects.O(n) communication, O(n log n) computation
Approximate Solutions:Approximate Reconciliation Tree (Byers)O(n) communication, O(n log n) computationPolynomial Encodings (Minsky & Trachtenberg)Let “
d
” be the size of the difference
O(d) communication,
O
(dn+d
3
) computationInvertible Bloom FilterO(d) communication, O(
n+d) computation
6Slide7
Difference Digests
Efficiently solves the set-difference problem.Consists of two data structures:Invertible Bloom Filter (IBF)Efficiently computes the set difference.
Needs the size of the differenceStrata EstimatorApproximates the size of the set difference.Uses IBF’s as a building block.
7Slide8
Invertible Bloom Filters (IBF)
Encode local object identifiers into an IBF
.
A
Host 1
Host 2
C
A
F
E
B
D
F
IBF 2
IBF 1
8Slide9
IBF Data Structure
Array of IBF cellsFor a set difference of size, d, require
αd cells (α > 1)Each ID is assigned to many IBF cellsEach IBF cell contains:
9
idSum
XOR of all
ID’s
in the cell
hashSum
XOR of
hash(ID) for all ID’s in the cellcountNumber of
ID’s assign to the cellSlide10
IBF Encode
A
idSum
⊕
A
hashSum
⊕ H(A)count++
idSum
⊕
AhashSum
⊕ H(A)count++
idSum
⊕
A
hashSum
⊕
H(A)
count++
Hash1
Hash2
Hash3
B
C
Assign ID to many cells
10
IBF:
αd
“Add” ID to cell
Not O(n), like Bloom Filters!
All hosts use the same hash functionsSlide11
Invertible Bloom Filters (IBF)
Trade
IBF’s
with remote host
A
Host 1
Host 2
C
A
F
E
B
D
F
IBF 2
IBF 1
11Slide12
Invertible Bloom Filters (IBF)
“Subtract” IBF structures
Produces a new IBF containing only unique objects
A
Host 1
Host 2
C
A
F
E
B
D
F
IBF 2
IBF 1
IBF (2 - 1)
12Slide13
IBF Subtract
13Slide14
Timeout for Intuition
After subtraction, all elements common to both sets have disappeared. Why?Any common element (
e.g W) is assigned to same cells on both hosts (assume same hash functions on both sides)On subtraction, W XOR W = 0. Thus, W vanishes.While elements in set difference remain, they may be randomly mixed need a decode procedure.
14Slide15
Invertible Bloom Filters (IBF)
Decode resulting IBF
Recover object identifiers from IBF structure.
A
Host 1
Host 2
C
A
F
E
B
D
F
IBF (2 - 1)
B
E
C
D
Host 1
Host 2
15
IBF 2
IBF 1Slide16
IBF Decode
16
H(V ⊕ X ⊕ Z)≠H(V) ⊕ H(X) ⊕ H(Z)
Test for Purity:
H(
idSum
)
H
(
idSum
)
=
hashSum
H(V) = H(V
)Slide17
IBF Decode
17Slide18
IBF Decode
18Slide19
IBF Decode
19Slide20
20
Small
Diffs
:
1.4x – 2.3x
Large Differences:
1.25x - 1.4x
How many IBF cells?
Space Overhead
Set Difference
Hash
Cnt 3
Hash Cnt 4
Overhead to decode at >99%Slide21
How many hash functions?
1 hash function produces many pure cells initially but nothing to undo when an element is removed.
21
A
B
CSlide22
How many hash functions?
1 hash function produces many pure cells initially but nothing to undo when an element is removed.
Many (say 10) hash functions: too many collisions.22
A
A
B
C
B
C
A
A
B
B
C
CSlide23
How many hash functions?
1 hash function produces many pure cells initially but nothing to undo when an element is removed.
Many (say 10) hash functions: too many collisions.We find by experiment that 3 or 4 hash functions works well. Is there some theoretical reason?
23
A
A
B
C
C
A
B
B
CSlide24
Theory
Let
d = difference size, k = # hash functions.
Theorem 1:
With
(
k + 1)
d
cells, failure probability falls exponentially. For
k = 3, implies a 4x tax on storage, a bit weak.[Goodrich,Mitzenmacher]
: Failure is equivalent to finding a 2-core (loop) in a random hypergraph Theorem
2: With ck d, cells, failure probability falls exponentially
c4 = 1.3x
tax, agrees with experiments24Slide25
25
Large Differences:
1.25x - 1.4x
How many IBF cells?
Space Overhead
Set Difference
Hash
Cnt
3
Hash
Cnt 4
Overhead to decode at >99%Slide26
Connection to Coding
Mystery:
IBF decode similar to peeling procedure used to decode Tornado codes. Why?Explanation: Set Difference is equivalent to coding with insert-delete channels
Intuition:
Given a code for set A, send
codewords
only
to B. Think of B’s set as a corrupted form of A’s.
Reduction:
If code can correct D insertions/deletions, then B can recover A and the set difference.
26
Reed Solomon <---> Polynomial Methods LDPC (Tornado) <---> Difference Digest Slide27
Difference Digests
Consists of two data structures:Invertible Bloom Filter (IBF)Efficiently computes the set difference.Needs the size of the difference
Strata EstimatorApproximates the size of the set difference.Uses IBF’s as a building block.27Slide28
Strata Estimator
A
Consistent
Partitioning
B
C
28
~1/2
~1/4
~1/8
1/16
IBF 1
IBF 4
IBF 3
IBF 2
Estimator
Divide keys into partitions of containing ~1/2
k
Encode each partition into an IBF of
fixed size
l
og(n) IBF’s of ~80 cells eachSlide29
4x
Strata Estimator
29
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 1
Attempt to subtract & decode
IBF’s
at each level.
If level
k
decodes, then return:
2
k
x
(the number of
ID’s
recovered)
…
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 2
…
Decode
Host 1
Host 2Slide30
4x
Strata Estimator
30
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 1
Attempt to subtract & decode
IBF’s
at each level.
If level
k
decodes, then return:
2
k
x
(the number of
ID’s
recovered)
…
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 2
…
Decode
Host 1
Host 2
What about the other strata?Slide31
2x
Strata Estimator
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 1
…
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 2
…
Decode
Decode
Host 1
Host 2
Host 2
Host 1
31
Observation: Extra partitions hold useful data
Sum elements from
all
decoded strata & return:
2
(k-1)
x (the number of ID’s recovered)
Decode
Host 1
Host 2
…Slide32
Estimation Accuracy
32
Strata good for small differences.
Min-Wise good for large differences.
Average Estimation Error (15.3
KBytes
)
Set Difference
Relative Error in Estimation (%)Slide33
Hybrid Estimator
33
IBF 1
IBF 4
IBF 3
IBF 2
Strata
Combine Strata and Min-Wise Estimators.
Use IBF
Stratas
for small differences.
Use Min-Wise for large differences.
…
IBF 1
Min-Wise
IBF 2
Hybrid
IBF 3Slide34
Hybrid Estimator Accuracy
34
Hybrid matches Strata for small differences
.
Converges with Min-wise for large differences
Set Difference
Average Estimation Error (15.3
KBytes
)
Relative Error in Estimation (%)Slide35
Application:
KeyDiff
Service
Promising Applications:
File Synchronization
P2P file sharing
Failure Recovery
Key Service
Key Service
Key Service
Application
Application
Application
Add( key )
Remove( key )
Diff( host1, host2 )
35Slide36
Difference Digests Summary
Strata & Hybrid EstimatorsEstimate the size of the Set Difference.For 100K sets, 15KB
estimator has <15% errorO(log n) communication, O(log n) computation.Invertible Bloom FilterIdentifies all ID’s in the Set Difference.16 to 28 Bytes per ID in Set Difference.O(d) communication, O(
n+d
) computation.
Implemented in
KeyDiff
Service
36Slide37
Conclusions: Got Diffs?
New randomized algorithm (difference digests) for set difference or insertion/deletion codingCould it be useful for your system? Need:
Large but roughly equal size sets Small set differences (less than 10% of set size)
37Slide38
38Slide39
Extra Slides
39Slide40
Comparison to Logs
IBF work with no prior context.Logs work with prior context, BUTRedundant information when sync’ing with multiple parties.Logging must be built into system for each write.
Logging add overhead at runtime.Logging requires non-volatile storage.Often not present in network devices.40
IBF’s may out-perform logs when:
Synchronizing multiple parties
Synchronizations happen infrequently