/
Cache Craftiness for Fast Multicore Key-Value Storage Cache Craftiness for Fast Multicore Key-Value Storage

Cache Craftiness for Fast Multicore Key-Value Storage - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
345 views
Uploaded On 2019-12-07

Cache Craftiness for Fast Multicore Key-Value Storage - PPT Presentation

Cache Craftiness for Fast Multicore KeyValue Storage Yandong Mao MIT Eddie Kohler Harvard Robert Morris MIT Lets build a fast keyvalue store KV store systems are important Google Bigtable ID: 769541

sec tree dram cores tree sec cores dram throughput masstree key short req millions keys single 140m put cache

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Cache Craftiness for Fast Multicore Key-..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Let’s build a fast key-value store KV store systems are important Google Bigtable , Amazon Dynamo, Yahoo! PNUTS Single-server KV performance matters Reduce cost Easier management Goal: fast KV store for single multi-core server Assume all data fits in memory Redis , VoltDB

Feature wish list Clients send queries over network Persist data across crashes Range query Perform well on various workloads Including hard ones!

Hard workloads Skewed key popularity Hard! ( Load imbalance ) Small key-value pairs Hard! Many puts Hard! Arbitrary keys String (e.g. www.wikipedia.org/...) or integer Hard!

First try: fast binary tree 140M short KV, put-only, @16 cores Throughput ( req /sec, millions) Network/disk not bottlenecks High-BW NIC Multiple disks 3.7 million queries/second ! Better? What bottleneck remains? DRAM!

Cache craftiness goes 1.5X farther 140M short KV, put-only, @16 cores Throughput ( req /sec, millions) Cache-craftiness: careful use of cache and memory

Contributions Masstree achieves millions of queries per second across various hard workloads Skewed key popularity Various read/write ratios Variable relatively long keys Data >> on-chip cache New ideasTrie of B+ trees, permuter, etc.Full systemNew ideas + best practices (network, disk, etc.)

Experiment environment A 16-core server three active DRAM nodes Single 10Gb Network Interface Card (NIC) Four SSDs 64 GB DRAM A cluster of load generators

Potential bottlenecks in Masstree Single multi-core server Network Disk log log … … DRAM

NIC bottleneck can be avoided Single 10Gb NIC Multiple queue, scale to many cores Target: 100B KV pair => 10M/ req /sec Use network stack efficiently Pipeline requests Avoid copying cost

Disk bottleneck can be avoided 10M/puts/sec => 1GB logs/sec! Single disk Multiple disks: split log See paper for details Single multi-core server Write throughput Cost Mainstream Disk 100-300 MB/sec 1 $/GB High performance SSD up to 4.4GB/sec > 40 $/GB

DRAM bottleneck – hard to avoid 140M short KV, put-only, @16 cores Throughput ( req /sec, millions) Cache-craftiness goes 1.5X father, including the cost of: Network Disk

DRAM bottleneck – w/o network/disk 140M short KV, put-only, @16 cores Throughput ( req /sec, millions) Cache-craftiness goes 1.7X father!

DRAM latency – binary tree 140M short KV, put-only, @16 cores Throughput ( req /sec, millions) B A C Y X Z … serial DRAM latencies!   10M keys => VoltDB 2.7 us/lookup 380K lookups/core/sec

DRAM latency – Lock-free 4-way tree Concurrency: same as binary tree One cache line per node => 3 KV / 4 children X Y Z A B … … … ½ levels as binary tree ½ DRAM latencies as binary tree

4-tree beats binary tree by 40% 140M short KV, put-only, @16 cores Throughput ( req /sec, millions)

4-tree may perform terribly! Unbalanced: serial DRAM latencies e.g. sequential inserts Want balanced tree w/ wide fanout   A B C D E F G H I … O(N) levels!

B+tree – Wide and balanced Balanced! Concurrent main memory B+tree [OLFIT] Optimistic concurrency control: version technique Lookup/scan is lock-free Puts hold ≤ 3 per-node locks

Wide fanout B+tree is 11% slower! 140M short KV, put-only Throughput ( req /sec, millions) Fanout =15, fewer levels than 4-tree, but # cache lines from DRAM >= 4-tree 4-tree: each internal node is full B+tree : nodes are ~75% full Serial DRAM latencies >= 4-tree

B+tree – Software prefetch Same as [ pB + - trees] Masstree : B+tree w/ fanout 15 => 4 cache linesAlways prefetch whole node when accessedResult: one DRAM latency per node vs. 2, 3, or 4 4 lines 1 line =

B+tree with prefetch 140M short KV, put-only, @16 cores Throughput ( req /sec, millions) Beats 4-tree by 9 % Balanced beats unbalanced!

Concurrent B+tree problem Lookups retry in case of a concurrent insert Lock-free 4-tree: not a problem keys do not move around b ut unbalanced A C D A C D A B C D insert(B) Intermediate state!

B+tree optimization - Permuter Keys stored unsorted, define order in tree nodes A concurrent lookup does not need to retry Lookup uses permuter to search keys Insert appears atomic to lookups ACD A C D B A C D B insert(B) 0 1 2 Permuter : 64-bit integer … 0 3 1 … 2

B+tree with permuter 140M short KV, put-only, @16 cores Throughput ( req /sec, millions) Improve by 4%

Performance drops dramatically when key length increases Short values, 50 % updates, @16 cores, no logging Throughput ( req /sec, millions) Key length Keys differ in last 8B Why? Stores key suffix indirectly, thus each key comparison compares full key extra DRAM fetch

… B+tree , indexed by k [0:7] B+tree , indexed by k [8:15] B+tree , indexed by k [16:23] … Masstree – Trie of B+trees Trie : a tree where each level is indexed by fixed-length key fragment Masstree : a trie with fanout 2 64 , but each trie node is a B+tree Compress key prefixes!

Case Study: Keys share P byte prefix – Better than single B+tree … trie levels each has one node only   A single B+tree with 8B keys Complexity DRAM access Masstree O(log N ) O(log N ) Single B+tree O( P log N ) O( P log N )

Masstree performs better for long keys with prefixes Short values, 50 % updates, @16 cores, no logging 8B key comparison vs. full key comparison Throughput ( req /sec, millions) Key length

Does t rie of B+trees hurt short key performance? 140M short KV, put-only, @16 cores Throughput ( req /sec, millions) 8% faster! More efficient code – internal node handle 8B keys only

Evaluation Masstree compare to other systems? Masstree compare to partitioned trees? How much do we pay for handling skewed workloads? Masstree compare with hash table? How much do we pay for supporting range queries? Masstree scale on many cores?

Masstree performs well even with persistence and range queries Throughput ( req /sec, millions) 20M short KV, uniform dist., read-only, @16 cores, w/ network 0.04 0.22 Unfair: both have a richer data and query model Memcached : not persistent and no range queries Redis : no range queries

Multi-core – Partition among cores? Multiple instances, one unique set of keys per inst. Memcached , Redis , VoltDB Masstree : a single shared tree each core can access all keys reduced imbalance B A C Y X Z B A C Y X Z

A single Masstree performs better for skewed workloads Throughput ( req /sec, millions) δ 140M short KV, read-only, @16 cores, w/ network One partition receives δ times more queries No remote DRAM access No concurrency control Partition: 80% idle time 1 partition: 40% 15 partitions: 4 %

Cost of supporting range queries Without range query? One can use hash table No resize cost: p re-allocate a large hash table Lock-free: update with cmpxchg Only support 8B keys: efficient code 30% full, each lookup = 1.1 hash probesMeasured in the Masstree framework2.5X the throughput of MasstreeRange query costs 2.5X in performance

Scale to 12X on 16 cores Number of cores Throughput ( req /sec/ core , millions) Perfect scalability Scale to 12X Put scales similarly Limited by the shared memory system Short KV, w/o logging

Related work [OLFIT]: Optimistic C oncurrency Control [ pB + -trees ]: B+tree with software prefetch[pkB-tree]: store fixed # of diff. bits inline[PALM]: lock-free B+tree, 2.3X as [OLFIT]Masstree: first system combines them together, w/ new optimizationsTrie of B+trees, permuter

Summary Masstree : a general-purpose high-performance persistent KV store 5.8 million puts/sec, 8 million gets/secMore comparisons with other systems in paperUsing cache-craftiness improves performance by 1.5X

Thank you!