/
Anecdotes on hashing Udi Wieder – VMware Research Group Anecdotes on hashing Udi Wieder – VMware Research Group

Anecdotes on hashing Udi Wieder – VMware Research Group - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
357 views
Uploaded On 2018-03-16

Anecdotes on hashing Udi Wieder – VMware Research Group - PPT Presentation

Plan I spent the last decade advising on numerous cases where hash tablesfunctions were used A few observations on What data structures Ive seen implemented and where What do developers think were they need help ID: 652725

hashing hash tables log hash hashing log tables bob alice item data space function solution circuit size cuckoo buckets

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Anecdotes on hashing Udi Wieder – VMwa..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Anecdotes on hashing

Udi Wieder – VMware Research GroupSlide2

Plan

I spent the last decade+ advising on numerous cases where hash tables/functions were used

A few observations on

What data structures I’ve seen implemented and where

What do developers think, were they need help

Where can we help

Application to secure MPCSlide3

Disclaimer

Subjective.

Partial. Arguably different companies see different problems

Trivial (?)

Discussion welcome!Slide4

Hash Tables - Dictionaries

Which data structure do programmers use?

50+ years of research: Many designs.

Chained Hash Table: An array of linked lists

Space: O(n). Insert: O(1). Lookup: O(1) \ O(log n)

Linear Probing: scan an array until an empty bucket is found

Space: O(n). Insert O(1) \ O(log n). Lookup: O(1) \ O(log n)

Cuckoo HashingSpace O(n). Insert O(1) \ O(log n). Lookup: O(1)Slide5

Hash Tables – Does it matter?

Majority of hash tables in code are small

In a web server most hash tables are of size ~ 6

Not the bottleneck of performance

When programming in a managed language (Java, C#, etc…) most of the time is taken by ‘type’ related problems:

boxing/unboxing types (

int

vs. Integer)Superfluous function calls Indirection in the ‘value’ type – Objects are large and pointer chasing inevitableSlide6

Hash Tables -

First step: building type specific classes

Example: Popular open source released

by Goldman Sachs …

(makes the code pretty ugly)

and/or moving to lower level languages

After that, chose the right oneSlide7

Why is simple chaining so popular?

Because it is simple

B

ecause it’s actually reasonable space-wise

Use 32 bit pointers

Since rehash is so expensive it is often the case that load factors are low anyhow

Because write operations are O(1), and if there is locality in the workload read is surprisingly fast

Assuming memory allocation is done in bulkBecause a linked list is a well understood data structure in multithreaded environmentsThe main array is read only. Locking only upon a rehash.Slide8

Linear Probing

Popular. Simple, at least without deletions

When keys and values are integers - linear probing is excellent

Especially if the workload is write intensive

Design of choice for most of the more specialized libraries

What about CH?

Vanilla CH is dead in the water

Performance, need to rehash is non-flexibleBlocked Cuckoo Hashing (2 hash functions, bigger buckets) should be a candidate for read heavy workloadsEducation gap Slide9

Which hash function?

We have a fantastic theory with time/space/quality/randomness tradeoffs

When the workload is assumed to be ‘random’ anyway, use Multiply-shift

If you see problems ‘blame’ the data set

Otherwise use Murmur3

Why?

Fast, simple, works in practice

Programmers are unaware of Tabulated Hashing and unmotivated to trySlide10

Deterministic hash functions

A surprisingly large number of times the hash function is deterministic

Who chooses the keys? Coordinates them?

Serialization problems

Simply good enough surprisingly often

Same reasons for a push back for tabulation hashing

The ‘curse of the hash code’. Java and .NET have 32 bit hash codes natively.

There are so many collisions in the hash codes themselves they mask the quality of the hash functionConfusion with the semantics of hashcodesSlide11

Conclusion

From a practical point of view, hash tables and dictionaries are largely a solved problem

Personally I found determinism and hash codes the most frustrating aspect

There is an educational gap, if a dictionary is in a hot loop or the workload is skewed developers need help finding the correct (known) solution

Workloads in single threaded instances are typically write heavy

Logs

Streams

EventsJoin computations When read heavy, typically due to multiple usersIt would be nice to prove formal guarantees for murmur3 and its kind (some proprietary versions exist too). Seems hardSlide12

Approximate Set Representation

EVERYONE knows and loves Bloom Filters

Many computation of the hash function (KM08)

Not optimal in space

Not cache friendly (often in fast memory where locality less important)

NO ONE uses the (more natural

imho

) version of a dictionary with short fingerprintsVanilla LP seems to behave very similarly to Bloom Filters Space-wiseRandom Probing has better space efficiency if a bit slowSuccinct solutions based on CH, I haven’t seen anySlide13

ASP on disks and on SSD’s

We are used to think of ASP is filtering for slower memory/network but they can filter for a slower data structure

Sophisticated distributed search (Elastic Search)

B-Trees designed for slow storage are ill equipped for insert heavy workloads

Newer DS use principles of buffering and deferral upon insertion

Log structured merge trees (LSM),

B^e

trees, ….Currently on key-value storesFile systems nextSlide14

ASP on disks and on SSD’s

A hierarchal list of sorted arrays

Compaction offline

Reads are more expensive. ASP data structures allow you to skip the binary search and go down to the leaves

ASP data structure written in the disk

Locality more important than before

Currently mitigated by bufferingSlide15

Other uses of hashing

Streaming data structures

Hyperloglog

, count–min sketch, min hashing, etc.

Reports and accumulated experience show Murmur3 is a good choice in practice

Why not tabulated hashing? (Hashing for statistics over partitions)

Maybe they should, I’m unaware of a serious experiment comparing the two

In parallel settings all processes need to use the same key, smaller keys are betterSlide16

Large system - Load Balancing

value for expertise

Replication

Rate of churn

Load balancing reads/writes/space

Failure recovery

Multi-tenancyEvery system is different – no magic solution, the (underpaid) expert is neededSlide17

Secure Multi-Party Computation (MPC)

X

Want to compute some function of X, Y

But leak no other information!

(even to each other

)

Semi-honest

model – both parties follow the protocol

Y

Alice

Bob

Slide18

?

?

Private Set Intersection (PSI)Slide19

Private Set Intersection (PSI)

Client Server

Input:

X = x

1

x

n

Y = y

1

y

n

Output:

X

Y

only nothing

19

Variants

are important:

client learns size of intersection;

sum of payload; etc

.Slide20

A circuit based protocol

There are generic protocols for securely computing any function expressed as a binary circuit

GMW

, Yao,…

Advantages

: adaptability, existing code

base

The overhead depends on the size of the circuitSlide21

The Algorithmic Challenge

G

oal

: Find the smallest circuit for computing SI

Alice and Bob can prepare their inputs

Circuit does not depend on data

!

Example: Both parties sort their inputs and the circuit computes a merge sortSize of the circuit is O(n log n)Can Hashing help?Slide22

Hashing

Solution

Pad each bin with dummy items

=

> all

bins are of the size of the most populated bin

Naïve solution: Map

n items to n/log n bins using a hash function

The expected size of a bin is

O(log n)

The maximum size of a bin is

whp

O(log n

)

Number of comparisons

O(n/log n * log

2

n) = O(n log n)Slide23

Cuckoo Hashing – can it help?

What if each party stores its items using CH

Can we get O(n) comparisons?

No. Alice may store x in T

2

while Bob in T

1

T

1

T

2

X

X

T

1

T

2

X

XSlide24

Our Solution – 2D Cuckoo

Alice and Bob each holds 4 tables, and same 4 hash functions

Alice:

Item in (T

1

and T

2

) or (T3 and T4)Bob: Item in (T1

and T3) or (T2 and T

4)

T

1

T

2

X

X

X

T

3

T

4

X

X

XSlide25

Our Solution – 2D Cuckoo

Alice and Bob each holds 4 tables, and same 4 hash functions

Alice:

Item in (T

1

and T

2

) or (T3 and T4)Bob: Item in (T1

and T3) or (T2 and T

4)Like a quorum system where Alice picks a row and Bob a columnSlide26

Our Solution – 2D Cuckoo

Alice and Bob each holds 4 tables, and same 4 hash functions

Alice:

Item in (T

1

and T

2

) or (T3 and T4)Bob: Item in (T1

and T3) or (T2 and T

4)Like a quorum system where Alice picks a row and Bob a column

Question: How big should the tables be so that n items could be placed

w.h.p

T

1

T

2

X

X

X

T

3

T

4

X

X

XSlide27

2D cuckoo hashing

Invariant:

Item in (T

1

and T

2

) or (T

3 and T

4)Theorem: n items could be placed maintaining the invariant

w.h.p

if each table has > 2n buckets of size 1.

Total of

8n

buckets and

8n

comparisons

T

1

T

2

X

X

X

T

3

T

4

X

X

XSlide28

Set Intersection on GPU

Goal: Highly parallel algorithm with simple logic. Ended up with essentially the same model.

AmossenPagh11:

3 Tables

Each item is stored in any two out of the three

Deal with double-counting by adding an extra bit to each item (clever!)

Total number of buckets is

6nSlide29

Ours: 8n

[AP11

]: 6n

Experiments: 7.2n

Idea: increase the capacity of buckets.

- Reduces #buckets

- Reduces #stash

- May increase comparisons

- No proof

Other ideas?

Lower bound?