Plan I spent the last decade advising on numerous cases where hash tablesfunctions were used A few observations on What data structures Ive seen implemented and where What do developers think were they need help ID: 652725
Download Presentation The PPT/PDF document "Anecdotes on hashing Udi Wieder – VMwa..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Anecdotes on hashing
Udi Wieder – VMware Research GroupSlide2
Plan
I spent the last decade+ advising on numerous cases where hash tables/functions were used
A few observations on
What data structures I’ve seen implemented and where
What do developers think, were they need help
Where can we help
Application to secure MPCSlide3
Disclaimer
Subjective.
Partial. Arguably different companies see different problems
Trivial (?)
Discussion welcome!Slide4
Hash Tables - Dictionaries
Which data structure do programmers use?
50+ years of research: Many designs.
Chained Hash Table: An array of linked lists
Space: O(n). Insert: O(1). Lookup: O(1) \ O(log n)
Linear Probing: scan an array until an empty bucket is found
Space: O(n). Insert O(1) \ O(log n). Lookup: O(1) \ O(log n)
Cuckoo HashingSpace O(n). Insert O(1) \ O(log n). Lookup: O(1)Slide5
Hash Tables – Does it matter?
Majority of hash tables in code are small
In a web server most hash tables are of size ~ 6
Not the bottleneck of performance
When programming in a managed language (Java, C#, etc…) most of the time is taken by ‘type’ related problems:
boxing/unboxing types (
int
vs. Integer)Superfluous function calls Indirection in the ‘value’ type – Objects are large and pointer chasing inevitableSlide6
Hash Tables -
First step: building type specific classes
Example: Popular open source released
by Goldman Sachs …
(makes the code pretty ugly)
and/or moving to lower level languages
After that, chose the right oneSlide7
Why is simple chaining so popular?
Because it is simple
B
ecause it’s actually reasonable space-wise
Use 32 bit pointers
Since rehash is so expensive it is often the case that load factors are low anyhow
Because write operations are O(1), and if there is locality in the workload read is surprisingly fast
Assuming memory allocation is done in bulkBecause a linked list is a well understood data structure in multithreaded environmentsThe main array is read only. Locking only upon a rehash.Slide8
Linear Probing
Popular. Simple, at least without deletions
When keys and values are integers - linear probing is excellent
Especially if the workload is write intensive
Design of choice for most of the more specialized libraries
What about CH?
Vanilla CH is dead in the water
Performance, need to rehash is non-flexibleBlocked Cuckoo Hashing (2 hash functions, bigger buckets) should be a candidate for read heavy workloadsEducation gap Slide9
Which hash function?
We have a fantastic theory with time/space/quality/randomness tradeoffs
When the workload is assumed to be ‘random’ anyway, use Multiply-shift
If you see problems ‘blame’ the data set
Otherwise use Murmur3
Why?
Fast, simple, works in practice
Programmers are unaware of Tabulated Hashing and unmotivated to trySlide10
Deterministic hash functions
A surprisingly large number of times the hash function is deterministic
Who chooses the keys? Coordinates them?
Serialization problems
Simply good enough surprisingly often
Same reasons for a push back for tabulation hashing
The ‘curse of the hash code’. Java and .NET have 32 bit hash codes natively.
There are so many collisions in the hash codes themselves they mask the quality of the hash functionConfusion with the semantics of hashcodesSlide11
Conclusion
From a practical point of view, hash tables and dictionaries are largely a solved problem
Personally I found determinism and hash codes the most frustrating aspect
There is an educational gap, if a dictionary is in a hot loop or the workload is skewed developers need help finding the correct (known) solution
Workloads in single threaded instances are typically write heavy
Logs
Streams
EventsJoin computations When read heavy, typically due to multiple usersIt would be nice to prove formal guarantees for murmur3 and its kind (some proprietary versions exist too). Seems hardSlide12
Approximate Set Representation
EVERYONE knows and loves Bloom Filters
Many computation of the hash function (KM08)
Not optimal in space
Not cache friendly (often in fast memory where locality less important)
NO ONE uses the (more natural
imho
) version of a dictionary with short fingerprintsVanilla LP seems to behave very similarly to Bloom Filters Space-wiseRandom Probing has better space efficiency if a bit slowSuccinct solutions based on CH, I haven’t seen anySlide13
ASP on disks and on SSD’s
We are used to think of ASP is filtering for slower memory/network but they can filter for a slower data structure
Sophisticated distributed search (Elastic Search)
B-Trees designed for slow storage are ill equipped for insert heavy workloads
Newer DS use principles of buffering and deferral upon insertion
Log structured merge trees (LSM),
B^e
trees, ….Currently on key-value storesFile systems nextSlide14
ASP on disks and on SSD’s
A hierarchal list of sorted arrays
Compaction offline
Reads are more expensive. ASP data structures allow you to skip the binary search and go down to the leaves
ASP data structure written in the disk
Locality more important than before
Currently mitigated by bufferingSlide15
Other uses of hashing
Streaming data structures
Hyperloglog
, count–min sketch, min hashing, etc.
Reports and accumulated experience show Murmur3 is a good choice in practice
Why not tabulated hashing? (Hashing for statistics over partitions)
Maybe they should, I’m unaware of a serious experiment comparing the two
In parallel settings all processes need to use the same key, smaller keys are betterSlide16
Large system - Load Balancing
value for expertise
Replication
Rate of churn
Load balancing reads/writes/space
Failure recovery
Multi-tenancyEvery system is different – no magic solution, the (underpaid) expert is neededSlide17
Secure Multi-Party Computation (MPC)
X
Want to compute some function of X, Y
But leak no other information!
(even to each other
)
Semi-honest
model – both parties follow the protocol
Y
Alice
Bob
Slide18
?
?
Private Set Intersection (PSI)Slide19
Private Set Intersection (PSI)
Client Server
Input:
X = x
1
…
x
n
Y = y
1
…
y
n
Output:
X
Y
only nothing
19
Variants
are important:
client learns size of intersection;
sum of payload; etc
.Slide20
A circuit based protocol
There are generic protocols for securely computing any function expressed as a binary circuit
GMW
, Yao,…
Advantages
: adaptability, existing code
base
The overhead depends on the size of the circuitSlide21
The Algorithmic Challenge
G
oal
: Find the smallest circuit for computing SI
Alice and Bob can prepare their inputs
Circuit does not depend on data
!
Example: Both parties sort their inputs and the circuit computes a merge sortSize of the circuit is O(n log n)Can Hashing help?Slide22
Hashing
Solution
Pad each bin with dummy items
=
> all
bins are of the size of the most populated bin
Naïve solution: Map
n items to n/log n bins using a hash function
The expected size of a bin is
O(log n)
The maximum size of a bin is
whp
O(log n
)
Number of comparisons
O(n/log n * log
2
n) = O(n log n)Slide23
Cuckoo Hashing – can it help?
What if each party stores its items using CH
Can we get O(n) comparisons?
No. Alice may store x in T
2
while Bob in T
1
T
1
T
2
X
X
T
1
T
2
X
XSlide24
Our Solution – 2D Cuckoo
Alice and Bob each holds 4 tables, and same 4 hash functions
Alice:
Item in (T
1
and T
2
) or (T3 and T4)Bob: Item in (T1
and T3) or (T2 and T
4)
T
1
T
2
X
X
X
T
3
T
4
X
X
XSlide25
Our Solution – 2D Cuckoo
Alice and Bob each holds 4 tables, and same 4 hash functions
Alice:
Item in (T
1
and T
2
) or (T3 and T4)Bob: Item in (T1
and T3) or (T2 and T
4)Like a quorum system where Alice picks a row and Bob a columnSlide26
Our Solution – 2D Cuckoo
Alice and Bob each holds 4 tables, and same 4 hash functions
Alice:
Item in (T
1
and T
2
) or (T3 and T4)Bob: Item in (T1
and T3) or (T2 and T
4)Like a quorum system where Alice picks a row and Bob a column
Question: How big should the tables be so that n items could be placed
w.h.p
T
1
T
2
X
X
X
T
3
T
4
X
X
XSlide27
2D cuckoo hashing
Invariant:
Item in (T
1
and T
2
) or (T
3 and T
4)Theorem: n items could be placed maintaining the invariant
w.h.p
if each table has > 2n buckets of size 1.
Total of
8n
buckets and
8n
comparisons
T
1
T
2
X
X
X
T
3
T
4
X
X
XSlide28
Set Intersection on GPU
Goal: Highly parallel algorithm with simple logic. Ended up with essentially the same model.
AmossenPagh11:
3 Tables
Each item is stored in any two out of the three
Deal with double-counting by adding an extra bit to each item (clever!)
Total number of buckets is
6nSlide29
Ours: 8n
[AP11
]: 6n
Experiments: 7.2n
Idea: increase the capacity of buckets.
- Reduces #buckets
- Reduces #stash
- May increase comparisons
- No proof
Other ideas?
Lower bound?