Practically Better Than Bloom Bin Fan CMUGoogle David Andersen CMU Michael Kaminsky Intel Labs Michael Mitzenmacher Harvard 1 What is Bloom Filter A Compact Data Structure ID: 196799
Download Presentation The PPT/PDF document "Cuckoo Filter:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Cuckoo Filter: Practically Better Than Bloom
Bin Fan (CMU/Google)David Andersen (CMU)Michael Kaminsky (Intel Labs)Michael Mitzenmacher (Harvard)
1Slide2
What is Bloom Filter? A Compact Data Structure Storing Set-membership
Bloom Filters answer “is item x in set Y ” by:“definitely no”, or“probably yes” with probability ε to be wrong
Benefit: not always precise but highly compact
Typically a few bits per item
Achieving lower
ε (more accurate) requires spending more bits per item
2
false positive rateSlide3
Example Use: Safe Browsing3
www.binfan.comLookup(“www.binfan.com”)
No!
Known Malicious U
RLs
Stored in Bloom Filter
Scale to millions URLs
Remote
Server
Please verify
“
www.binfan.com
”
It is Good!
Probably Yes!Slide4
Bloom Filter Basics
A Bloom Filter consists of m bits and k hash functions Example: m = 10, k = 3 4
0
0
0
0
0
0
0
0
0
0
Insert(x)
hash
1
(x)
hash
2
(x)
hash
3
(x)
1
0
0
0
0
0
1
0
1
0
1
0
0
0
0
0
1
0
1
0
Lookup(y)
hash
1
(y)
hash
2
(y)
hash
3
(y)
=
not foundSlide5
High Performance
Low Space CostDelete SupportBloom Filter
Counting Bloom Filter
Quotient Filter
5
Succinct Data Structures for
Approximate Set-membership Tests
Can we achieve all three in practice?
✔
✗
✔
✔
✔
✔
✔
✗
✗Slide6
OutlineBackground
Cuckoo filter algorithmPerformance evaluationSummary6Slide7
Basic Idea: Store Fingerprints in Hash Table7
Fingerprint(x): A hash value of xLower false positive rate ε, longer fingerprint
FP(a)
0:
1:
2:
3:
FP(c)
FP(b)
5:
6:
7:
4:Slide8
Basic Idea: Store Fingerprints in Hash Table8
Fingerprint(x): A hash value of xLower false positive rate ε, longer fingerprintInsert(x): add Fingerprint(x) to hash table
FP(a)
0:
1:
2:
3:
FP(c)
FP(b)
5:
6:
7:
4:
FP(x)Slide9
Basic Idea: Store Fingerprints in Hash Table9
Fingerprint(x): A hash value of xLower false positive rate ε, longer fingerprintInsert(x): add Fingerprint(x) to hash tableLookup(x): search Fingerprint(x) in
hashtable
FP(a)
0:
1:
2:
3:
FP(c)
FP(b)
5:
6:
7:
4:
FP(x)
Lookup(x)
=
foundSlide10
Basic Idea: Store Fingerprints in Hash Table10
Fingerprint(x): A hash value of xLower false positive rate ε, longer fingerprintInsert(x): add Fingerprint(x) to hash tableLookup(x): search Fingerprint(x) in hashtable
Delete(x): remove
Fingerprint
(x) from
hashtable
FP(a)
0:
1:
2:
3:
FP(c)
FP(b)
5:
6:
7:
4:
FP(x)
Delete(x)
How to Construct
Hashtable
?Slide11
11Perfect hashing: maps all items with no collisions
FP(e)
FP(c)
FP(d)
FP(b)
FP(f)
FP(a)
{a, b, c, d, e, f}
f(x)
(Minimal) Perfect Hashing:
No Collision but Update is Expensive
StrawmanSlide12
Perfect hashing: maps all items with no collisions
Changing set must recalculate f high cost/bad performance of update12{a, b, c, d, e, f}
f(x)
(Minimum) Perfect Hashing:
No Collision but
Update is
Expensive
{a, b, c, d, e,
g
}
f(x) = ?
Strawman
FP(e)
FP(c)
FP(d)
FP(b)
FP(f)
FP(a)Slide13
Convention Hash Table: High Space Cost
Chaining :Pointers low space utilizationLinear Probing
Making lookups O(1) requires
large % table empty
low
space utilizationCompare multiple fingerprints
sequentially
more false positives
13
bkt1
bkt2
bkt3
FP(a)
bkt0
Strawman
FP(c)
FP(d)
FP(a)
Lookup(x)
Lookup(x)
FP(c)
FP(f)Slide14
Cuckoo Hashing[Pagh2004]
Good But ..High Space Utilization4-way set-associative table: >95% entries occupiedFast Lookup: O(1)14
0:
1:
2:
3:
5:
6:
7:
4:
hash
2
(x)
Standard
cuckoo
hashing
doesn’t work
with
fingerprints
[Pagh2004
]
Cuckoo hashing.
lookup(x)
hash
1
(x)Slide15
15Standard
Cuckoo Requires Storing Each Item
b
0:
1:
2:
3:
c
a
5:
6:
7:
4:
Insert(x)
h
1
(x)
h
2
(x)Slide16
16Standard
Cuckoo Requires Storing Each Item
b
0:
1:
2:
3:
c
x
5:
6:
7:
4:
Insert(x)
Rehash a: alternate(a) = 4
Kick a to bucket 4
h
2
(x)Slide17
17Standard
Cuckoo Requires Storing Each Item
b
0:
1:
2:
3:
a
x
5:
6:
7:
4:
Insert(x)
Rehash a: alternate(a) = 4
Kick a to bucket 4
Rehash c: alternate(c) = 1
Kick c to bucket 1
h
2
(x)Slide18
18Standard
Cuckoo Requires Storing Each Item
c
b
0:
1:
2:
3:
a
x
5:
6:
7:
4:
Insert(x)
Insert complete
(or fail if
MaxSteps
reached)
Rehash a: alternate(a) = 4
Kick a to bucket 4
Rehash c: alternate(c) = 1
Kick c to bucket 1
h
2
(x)Slide19
Challenge: How to Perform Cuckoo?Cuckoo hashing requires rehashing and displacing existing items
19With only fingerprint, how to calculate
item’s alternate bucket?
FP(b)
0:
1:
2:
3:
FP(c)
FP(a)
5:
6:
7:
4:
Kick FP(a) to which bucket?
Kick FP(c) to which bucket?Slide20
We Apply Partial-Key CuckooStandard Cuckoo Hashing:
two independent hash functions for two bucketsPartial-key Cuckoo Hashing: use one bucket and fingerprint to derive the other [Fan2013]To displace existing fingerprint:20
Solution
bucket1 = hash(x)
bucket2 = bucket1 hash(FP(x))
bucket1 = hash
1
(x)
bucket2 = hash
2
(x)
alternate(x) = current(x) hash(FP(x))
[Fan2013
] MemC3: Compact and Concurrent
MemCache
with
Dumber Caching and Smarter HashingSlide21
Partial Key Cuckoo HashingPerform cuckoo hashing on fingerprints
21Solution
FP(b)
0:
1:
2:
3:
FP(c)
FP(a)
5:
6:
7:
4:
Kick FP(a) to “6 hash(FP(a))”
Kick FP(c)
to
“4
hash(FP
(c))”
Can we still achieve high space utilization with partial-key cuckoo hashing?Slide22
Fingerprints Must Be “Long” for Space Efficiency
Fingerprint must be Ω(logn/b) bits in theoryn: hash table size, b: bucket size
see more analysis in paper
22
When fingerprint > 5 bits, high table space utilization
Table Space Utilization
Table size: n=128
million entriesSlide23
Semi-Sorting: Further Save 1 bit/item
Based on observation:A monotonic sequence of integers is easier to compress[Bonomi2006]Semi-Sorting:Sort fingerprints sorted in each bucketCompress sorted fingerprints+ For 4-way bucket, save one bit per item
-- Slower lookup / insert
23
21
97
88
04
fingerprints
in a bucket
04
21
88
97
Sort
fingerprints
Easier to compress
[Bonomi2006]
Beyond Bloom filters: From approximate membership checks to
ap
- proximate state machines.Slide24
Space Efficiency
24 ε: target false positive ratebits per item to achieve ε
Lower bound
More Space
More False PositiveSlide25
Space Efficiency
25 ε: target false positive ratebits per item to achieve ε
Bloom filter
Lower bound
More Space
More False PositiveSlide26
Space Efficiency
26 ε: target false positive ratebits per item to achieve ε
Cuckoo filter
Bloom filter
Lower bound
More Space
More False PositiveSlide27
Space Efficiency 27
ε: target false positive ratebits per item to achieve εCuckoo filter +
semisorting
more compact than
Bloom
filter at 3%
C
uckoo filter
Bloom filter
Lower bound
More Space
More False PositiveSlide28
OutlineBackground
Cuckoo filter algorithmPerformance evaluationSummary28Slide29
EvaluationCompare cuckoo filter with
Bloom filter (cannot delete)Blocked Bloom filter [Putze2007] (cannot delete) d-left counting Bloom filter [Bonomi2006]Cuckoo filter + semisortingMore in the paperC++ implementation, single threaded
29
[Putze2007
]
Cache-, hash- and space- efficient bloom filters.[Bonomi2006] Beyond Bloom filters: From approximate membership checks to approximate state machines.Slide30
Lookup Performance (MOPS) 30
CuckooCuckoo +semisort
d-left countingBloom
blocked
Bloom
(no deletion)
Bloom
(no deletion)Slide31
Lookup Performance (MOPS) 31
CuckooCuckoo +semisort
blockedBloom
(no deletion)
Bloom
(no deletion)
d-left counting
BloomSlide32
Lookup Performance (MOPS) 32
CuckooCuckoo +semisort
blockedBloom
(no deletion)
Bloom
(no deletion)
d-left counting
BloomSlide33
Lookup Performance (MOPS) 33
Cuckoo filter is among the fastest regardless workloads.CuckooCuckoo +
semisort
blocked
Bloom
(no deletion)
Bloom
(no deletion)
d-left counting
BloomSlide34
Insert Performance (MOPS)
34Cuckoo filter has decreasing insert rate, but overall is only slower than blocked Bloom filter.
Cuckoo
Blocked Bloom
d-left Bloom
Cuckoo +
semisorting
Standard BloomSlide35
SummaryCuckoo filter, a Bloom filter replacement:
Deletion supportHigh performanceLess Space than Bloom filters in practice Easy to implementSource code available in C++:https://github.com/efficient/cuckoofilter
35