/
Caching at the Web Scale Caching at the Web Scale

Caching at the Web Scale - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
387 views
Uploaded On 2017-07-23

Caching at the Web Scale - PPT Presentation

Victor Zakhary Divyakant Agrawal Amr El Abbadi 1 The old problem of Caching Disk RAM L 2 L 1 Larger Slower Cheaper Smaller Faster Expensive 2 The old problem of Caching Smaller ID: 572298

key caching hashing web caching key web hashing access cache memory memcached nsdi server facebook slide presentation distributed consistent

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Caching at the Web Scale" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Caching at the Web Scale

Victor Zakhary, Divyakant Agrawal, Amr El Abbadi

1Slide2

The old problem of Caching

Disk

RAM

L

2

L

1

Larger

Slower

Cheaper

Smaller

FasterExpensive

2Slide3

The old problem of Caching

Smaller

FasterExpensive

Larger

Slower

Cheaper

T

a

= T

h + m x Tm Ta: average access time

Th: access time in case of a hitm: miss ration (1- hit ratio)Tm: access time in case of a missTm >>>> Th

Disk

RAM

L

2

L

1

3Slide4

The old problem of Caching

Tm >>>> ThWhen the cache is full

 replacement policyReplacement policy  eviction mechanism

Having the right elements in cache increases the hit ratio

High hit ratio

 less average access time

4Slide5

The old problem of Caching

Ta= T

h + m x Tm

T

a

: average access time

T

h: access time in case of a hit

m: miss ration (1- hit ratio)Tm: access time in case of a missTm >>>> Th

T

h

and m are always in contentionGood caching strategy:lowers mrequires more tracking

Increases T

h

Less tracking:

Lower T

h

increases m

5Slide6

The old problem of Caching

Larger

SlowerCheaper

This is not a tutorial on 70s materials

Right?

Smaller

Faster

Expensive

Disk

RAM

L

2

L

1

6Slide7

Nowadays

Hardware technologies have changed:

Storage

Memory

Network

etc

Solutions have to exploit all these changes to serve client requests:

at billions of requests per second scale

with low latency

with high availability

achieving data consistency(varies from application to application)

New designs for Caching Services

Very time sensitive

Huge amount of data

Dynamically generated data

7

Source: http://www.visualcapitalist.com/what-happens-internet-minute-2016/Slide8

Facebook page load

Each page load is translated into hundreds of lookups

Lookups are done on multiple rounds

8

This slide was taken from Scaling

Memcache

at Facebook presentation in NSDI`13 Slide9

Nowadays Architecture

Persistent Storage

Millions of

end-users

Page-load and page-update

stream(millions/sec)

Billions of key lookups per second

Stateless

Application Servers

Overloaded

9Slide10

Nowadays Architecture

Load Balancer

Persistent Storage

Millions of

end-users

Page-load and page-update

stream(millions/sec)

Billions of key lookups per second

Hundreds of Stateless

Application Servers

Overloaded

10Slide11

Nowadays Architecture

Partition and replicate

Load Balancer

Millions of

end-users

Page-load and page-update

stream(millions/sec)

Billions of key lookups per second

Hundreds of Stateless

Application Servers

Persistent Storage

High Latency

Supported

operations

Consistency

11Slide12

Facebook page load

Each page load is translated into hundreds of lookups

Lookups are done on multiple rounds

Reads are 99.8% while writes are only 0.2% [Tao ATC`13]

Persistent Storage cannot handle this request throughput at this scale

Caching

 lower latency + alleviate load on storage

12

This slide was taken from Scaling

Memcache

at Facebook presentation in NSDI`13 Slide13

Nowadays Architecture

Partitioned and replicated

Load Balancer

Millions of

end-users

Page-load and page-update

stream(millions/sec)

Billions of key lookups per second

Hundreds of Stateless

Application Servers

Persistent Storage

Caching Server

hit

miss

Overloaded

13Slide14

Nowadays Architecture

Partitioned and replicated

Load Balancer

Millions of

end-users

Page-load and page-update

stream(millions/sec)

Billions of key lookups per second

Hundreds of Stateless

Application Servers

Persistent Storage

Tens of Caching Servers

hit

miss

Failures

Load balance

Lookaside vs knowledge based

14Slide15

15Slide16

Access latency

y

y

y

y

y

y

y

y

16

Peter

Norvig

: http://norvig.com/21-days.html#answersSlide17

Goal

Access Latency ↓

Access Latency ↓

Load distribution

Old Caching

Modern Caching

Update Strategy

Challenges

Replacement policy

Update strategy

Update durability

Thread contention

Scale management

Load balancing (utilization)

Update strategy

Update durability

Data consistency

Request rate

T

a

= T

h

+ m x T

m

17Slide18

Replacement Policies

18Slide19

Cache Replacement Policies

Cache

Lookup

Cache miss

Cache hit

Fetch page

Insert page

to cache

Cache is not full

Cache is full

Evict and insert

Insert

19Slide20

Cache Replacement Policies

Cache size is limited  Cannot fit everythingEviction mechanism

Contention between hit access time and miss ratioFIFO, LIFO LRU (recency

of access)

Pseudo-LRU

ARC

(Frequency and recency of access)

MRU…

20Slide21

LRU

Hardware supported implementationsUsing countersUsing a binary 2D matrixSoftware implementation

Using a doubly linked list and a hash table

21Slide22

LRU – Hardware using Counters

Large enough counter 64 -128 bitsIncrement the counter after each instructionWhen accessing a page:Tag the page with the current counter value at the access time

When a page fault happens:Evict the page with the lowest counter tagVery expensive to examine the counter for every page

22Slide23

LRU – Hardware using 2D Binary Matrix

1

2

3

4

1

0

0

0

0

2

0000

3

0

0

0

0

4

0

0

0

0

1

2

3

4

1

0

1

1

1

2

0

0

0

0

3

0

0

0

0

4

0

0

0

0

1

1

2

3

4

1

0

0

1

1

2

1

0

1

1

3

0000

40000

2

1

2341

0001210013

1101

40000

3

1

2

341000

02100

03

11004

11104

O(N) bits per page

The row with the smallest binary value is the eviction candidate23Slide24

LRU – Software Implementation

Hash Table

24Slide25

LRU – Software Implementation

Access:

Hash Table

25Slide26

LRU – Software Implementation

Access:

Hash Table

26Slide27

LRU – Software Implementation

Access:

Hash Table

27Slide28

Pseudo-LRU (PLRU)

Bit-PLRUOne bit per page, initially zeroOn access, flip page’s bit to one

If all bits are one, flip all to zero except the last accessed page

0

0

0

0

1

0

0

0

1

1

0

0

1

1

1

0

Access: 1, 2, 3, 4

0

0

0

1

28Slide29

Pseudo-LRU

0

0

0

Organizes blocks on a binary tree

The path from the root leads to the PLRU leaf

On access, flip the values along the path to the leaf

0 goes left, 1 goes right

Access: 1 3 2 4 1 4 5

29Slide30

Organizes blocks on a binary tree

The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right

Pseudo-LRU

1

1

1

0

Access:

1

3 2 4 1 4 5

30Slide31

Organizes blocks on a binary tree

The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right

Pseudo-LRU

1

3

1

0

1

Access: 1

3

2 4 1 4 5

31Slide32

Organizes blocks on a binary tree

The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right

Pseudo-LRU

1

2

3

0

1

1

Access: 1 3

2

4 1 4 5

32Slide33

Organizes blocks on a binary tree

The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right

PLRU

1

2

3

4

0

0

0

Access: 1 3 2

4

1 4 5

33Slide34

Organizes blocks on a binary tree

The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right

Pseudo-LRU

1

2

3

4

1

1

0

Access: 1 3 2 4

1

4 5

34Slide35

Organizes blocks on a binary tree

The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right

Pseudo-LRU

1

2

3

4

1

0

0

Access: 1 3 2 4 1

4

5

35Slide36

Organizes blocks on a binary tree

The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right

Pseudo-LRU

1

5

3

4

0

1

0

Access: 1 3 2 4 1 4

5

36Slide37

ARC- Adaptive Replacement Cache

Maintains 2 LRU lists L1 and L2

L1 for recencyL2 for frequency

Tracks pages twice the size of the cache. |L

1

+ L

2

| = 2 |c|Dynamically and adaptively balancing between recency and frequencyOnline

and self-tuning – in response to evolving and possibly changing access patterns

ARC: Almaden Research Center37

Megiddo, Nimrod, and Dharmendra S. Modha. "ARC: A Self-Tuning, Low Overhead Replacement Cache." FAST. Vol. 3. 2003.Slide38

ARC

L

1

L

2

T

1

B

1

B

2

T

2

c

Recency

Frequency

Ghost pages

Ghost pages

A miss is inserted to the head of L

1

If a page in L

1

is accessed twice, move it to L

2

head of T

2

A miss in B

1

increases the size of L

1

A miss in B

2

increases the size of L

2

38Slide39

Goal

Access Latency ↓

Access Latency ↓

Load distribution

Old Caching

Modern Caching

Update Strategy

Challenges

Replacement policy

Update strategy

Update durability

Thread contention

Scale management

Load balancing (utilization)

Update strategy

Update durability

Data consistency

Request rate

39Slide40

Scale Management

40Slide41

Memcached*

Distributed in-memory caching systemFree and Open source

Written first in Perl by Brad Fitzpatrick in 2003Rewritten in C by Anatoly Vorobey 

Client driven caching

How does it work?

41

https://memcached.org/,

FITZPATRICK, B. Distributed caching with

memcached

.

Linux Journal 2004, 124 (Aug. 2004), 5.Slide42

Memcached*

Memcached

logic

Client Side

Server Side

Storage

Application server or dedicated cache client

Cache server

1- lookup(k)

2- Response(k)

If k != null

done

Else?

3- lookup(k)

4- Response(k)

5- Set(

k,V

)

So what does

Memcached

provide?

42

https://memcached.org/,

FITZPATRICK, B. Distributed caching with

memcached

.

Linux Journal 2004, 124 (Aug. 2004), 5.Slide43

Memcached

1GB

Caching servers

1GB

1GB

Application server or dedicated cache client

43

https://memcached.org/,

FITZPATRICK, B. Distributed caching with

memcached

.

Linux Journal 2004, 124 (Aug. 2004), 5.Slide44

Memcached

1GB

Caching servers

1GB

1GB

3GB

Application server or dedicated cache client

Each key is mapped to one caching server

Better memory utilization through hashing

Clients know all servers

Servers don’t communicate with each other

Shared-nothing architecture

Easy to scale

44

https://memcached.org/,

FITZPATRICK, B. Distributed caching with

memcached

.

Linux Journal 2004, 124 (Aug. 2004), 5.Slide45

LRU – Software Implementation

Hash Table

45Slide46

Memcached

1GB

Caching servers

1GB

1GB

Application server or dedicated cache client

Lookup(k)

Hash(k) % server count

Lookup(k)

-Is k here?

-Yes

-Update LRU and return value

=No

=Return null

46

https://memcached.org/,

FITZPATRICK, B. Distributed caching with

memcached

.

Linux Journal 2004, 124 (Aug. 2004), 5.Slide47

Consistent Hashing

When adding/removing a server% function causes high key churn (remapping)Consistent hashingK/n keys are remapped

The churn problemAssume keys 1,2,3,4,5,6,7,8,9,10,11,12These keys are distributed into 4 servers 1,2,3,4

47

Karger

, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." 

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing

. ACM, 1997.

Karger, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide48

Consistent Hashing

1,5,9

2,6,10

3,7,11

4,8,12

1

2

3

4

5

5%4 =1

11

11%4 =3

What happens when number of server changes?

48

Karger

, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." 

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing

. ACM, 1997.

Karger

, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide49

Consistent Hashing

1,4,7,10

2,5,8,11

3,6,9,12

1,5,9

2,6,10

3,7,11

4,8,12

Keys 3,4,5,6,7,8,9,10,11

are remapped

1

2

4

1

2

3

4

Keys

4,5,6,8,9,10 are remapped even if their machines are up

49

Karger

, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." 

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing

. ACM, 1997.

Karger

, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide50

Consistent Hashing

5

2

7

1

6

11

12

10

8

3

4

9

50

Karger

, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." 

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing

. ACM, 1997.

Karger

, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide51

Consistent Hashing

5

2

7

1

6

11

12

10

8

3

4

9

51

Karger

, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." 

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing

. ACM, 1997.

Karger

, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide52

Consistent Hashing

5

2

7

1

6

11

12

10

8

3

4

9

52

Karger

, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." 

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing

. ACM, 1997.

Karger

, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide53

Consistent Hashing

5

2

7

1

6

11

12

10

8

3

4

9

53

Karger

, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." 

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing

. ACM, 1997.

Karger

, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide54

Memcached

Item (object)Key (string up to 250 bytes in length)Expiration time (0 means never)

Values up to 1 MB sizeASCII or binary representations (ASCII vs binary protocol)Binary protocol is faster and more compact12345678 (4 bytes) instead of “12345678” (8 bytes)

In the binary protocol

SASL can be enabled to limit access

Clients have to authenticate themselves before sending commands

54Slide55

Memcached API

Documentation: https://github.com/memcached/memcached/wiki/Full API:

https://github.com/memcached/memcached/wiki/Commands

55Slide56

Redis

Redis stands for REmote DI

ctionary ServerOpen Source BSD licensed

By Salvatore

Sanfilippo

, May 2009

Developed in ANSI C

Data Structure store more than just a key/value storeHow?

56

https://redis.io/Slide57

Redis

Caching Server

Key

Value

Key

Key

Set

Key

Sorted Set

Key

Hash map

Key

Bit map

Key

HyperLogLog

A rich API is used to map keys to:

String values

Linked lists

Sets

Sorted Sets

Hash maps

Bit maps

HyperLogLog

57

https://redis.io/Slide58

For Key/Value

Get keySet keyFor ListsLPOP keyRemove and get the first element in a listLPUSH key value [value …]

Prepend one or multiple values to a listLSET key index valueList[key][index] = value

Full API commands:

https://redis.io/commands

Redis

Caching Server

Key

Value

Key

Key

Set

Key

Sorted Set

Key

Hash map

Key

Bit map

Key

HyperLogLog

58Slide59

Implementing

Redis using MemcachedWhat if?

Memcached

server

K

1

Value

K

2

Memcached

client

Get K

2

59Slide60

Implementing

Redis using MemcachedWhat if?

Memcached

server

K

1

Value

K

2

Memcached

client

Set K

2

60Slide61

Redis

Redis rich API enables updating stored data structures without fetching themString values are up to 512 MB

Lists are lists of stringsMaximum length of a list is 232 – 1 elementsSet are unordered collections of strings

Maximum length of a list is 2

32

– 1 elements

Redis

data types: https://redis.io/topics/data-types

61Slide62

Redis – Replacement Policies

noevictionReturn error when the memory is full and the operation requires more memory

allkeys-lruLRU-like for all the keysvolatile-

lru

evict the least recently used keys among keys with an expire set

allkeys

-random

evict random keysvolatile-randomevict random keys with an expire set

volatile-ttlevict only keys with an expire set and start with keys that have the shorter time to live (ttl)

A key with an expire set has a life time

If no keys have expire set

behave like

noeviction

62

https://redis.io/topics/lru-cacheSlide63

Redis- Cluster Management

Client side partitioningHash partitioning, range partitioning, or consistent hashingProxy assisted partitioning

Client  Proxy  Server instance  Client

Twemproxy

: Twitter proxy for

Memcache

and

Redis

Query routingClient  Random Redis

Server  Client  Redis Server63

https://redis.io/topics/partitioningSlide64

Redis- Replication & Persistence

Master-slave replicationMaster can update a slave or moreUpdates are sent asynchronouslyEventual consistency model

Master should persist updates locallySlaves achieve better scalabilityRead-only workloads

Updates are sent to the master

Clients reading from a slave should tolerate some staleness

Master

Slave

Slave

Slave

64

https://redis.io/topics/replicationSlide65

Amazon

ElastiCacheCompatible with Memcached and Redis

Cache as a serviceAmazon deploy and operate the instancesAmazon manage scalabilityAutomatically detects and replaces failed nodes

Better interfaces for monitoring

65Slide66

Partitioned Caches

Hash Table

….

….

Contention between threads

Partition on multiple machines

Partition memory based on object size (slabs)

66Slide67

Partitioned Caches

Motivation

Reduce contention between threads

Memory resources are distributed

Fairness between large objects and small objects

Drawback

Fragmentation

67Slide68

Solutions

Change the partitioning techniqueRe-distributed keys in a way that achieves higher utilizationChange the memory assigned to different partitionsHighly loaded partitions get more memory

Less loaded partitions get less memoryEasy in shared memory architectures

68Slide69

Dynacache: Dynamic Cloud Caching

Hit rate

P(hit)

Number of objects cached

1

69

Cidon

,

Asaf

, et al. "

Dynacache

: Dynamic Cloud Caching."

HotStorage

. 2015.

This slide was taken from

Dynacache

presentation in Hotcloud`15 Slide70

Dynacache: Dynamic Cloud Caching

Hit rate

P(hit)

Number of objects cached

1

Hit rate

P(hit)

Number of objects cached

1

70

Cidon

,

Asaf

, et al. "

Dynacache

: Dynamic Cloud Caching."

HotStorage

. 2015.

This slide was taken from

Dynacache

presentation in Hotcloud`15 Slide71

CliffHanger: Scaling performance cliffs in web memory caches

Increases overall hit rate by 1.2%Reduces the total number of cache misses by 36.7%Same hit rate with 45% less memory capacity

Memory is usually divided between objects based on size (slabs)Assign more memory to slabs that have high miss rateTake memory from slabs that do not experience high miss rate

How?

Hit rate gradient

71

Cidon

,

Asaf, et al. "Cliffhanger: Scaling performance cliffs in web memory caches." USENIX NSDI. 2016.Slide72

CliffHanger

: Scaling performance cliffs in web memory caches

Hit rate

P(hit)

Number of objects cached

1

Hit rate

P(hit)

Number of objects cached

1

Low gradient

High gradient

72

Cidon

,

Asaf

, et al. "Cliffhanger: Scaling performance cliffs in web memory caches." USENIX NSDI. 2016.Slide73

CliffHanger: Scaling performance cliffs in web memory caches

Maintain shadow lists for different slabsWhen a miss happens in the shadow listIncrease the list size and decrease the list size of a another slab

When the shadow queue size exceeds a certain limitAssign more memory

73

Cidon

,

Asaf

, et al. "Cliffhanger: Scaling performance cliffs in web memory caches." USENIX NSDI. 2016.Slide74

CliffHanger: Scaling performance cliffs in web memory caches

Hit rate

P(hit)

Number of objects cached

Hit rate

P(hit)

Number of objects cached

1

1

Performance

cliff

74

Cidon

,

Asaf

, et al. "Cliffhanger: Scaling performance cliffs in web memory caches." USENIX NSDI. 2016.Slide75

Solving Utilization using Hashing

75Slide76

Cuckoo hashing

Insert(K)

Hash Function

H(K)

 id

.

.

.

Insert(K)

If memory is full

eject some key

and insert K

else

insert K

Deterministically map each key to one bucket

If the bucket is full, eject some other key even if other buckets have empty slots.

76

Pagh

, Rasmus, and

Flemming

Friche

Rodler

. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Slide77

Cuckoo hashing

Insert(K)

Cuckoo Hash Function

H(K)

 id

1

, id

2

.

.

.

Insert(K)

If memory is full

failed

Insert(K)

Deterministically map each key to 2 buckets. If any of them have space, insert the key to this bucket. Otherwise, eject a random key from any of the 2 buckets. Try to insert the ejected key to its second bucket and repeat until the key is inserted or the eviction path passes some length limit

77Slide78

Cuckoo hashing

Insert(k)

Cuckoo Hash Function

H(k)

 id

1

, id

2

Insert(k)

Insert(k)

a

b

c

d

m

n

o

p

z

x

e

w

j

g

f

h

i

full

not

full

k

78

Pagh

, Rasmus, and

Flemming

Friche

Rodler

. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Slide79

Cuckoo hashing

Insert(l)

Cuckoo Hash Function

H(l)

 id

1

, id

2

Insert(l)

Insert(l)

a

b

c

d

m

n

o

p

z

x

e

w

j

k

g

f

h

i

full

full

not

full

full

79

Pagh

, Rasmus, and

Flemming

Friche

Rodler

. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Slide80

Cuckoo hashing

Insert(l)

Cuckoo Hash Function

H(l)

 id

1

, id

2

Insert(l)

Insert(l)

l

b

c

d

m

a

o

p

z

x

e

w

j

k

n

g

f

h

i

full

full

not

full

full

80

Pagh

, Rasmus, and

Flemming

Friche

Rodler

. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Slide81

Cuckoo hashing

lookup(k)

Cuckoo Hash Function

H(k)

 id

1

, id

2

lookup(k)

lookup(k)

a

b

c

d

m

n

o

p

z

x

e

w

j

k

g

f

h

i

Lookup both buckets in parallel. Bloom filters can be used to reduce the lookups to 1.

81

Pagh

, Rasmus, and

Flemming

Friche

Rodler

. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Slide82

Cuckoo Hashing optimizations

MemC3Fan, Bin, David G. Andersen, and Michael Kaminsky. "MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing."

NSDI. Vol. 13. 2013.Algorithmic improvements for fast concurrent cuckoo hashingLi, Xiaozhou

, et al. "Algorithmic improvements for fast concurrent cuckoo hashing."

Proceedings of the Ninth European Conference on Computer Systems

. ACM, 2014.

82Slide83

Web Scale Distributed Caching Examples

83Slide84

Scaling Memcache

at FacebookGlobally distributed caching (in memory key/value store)Billions of requests per second over trillions of items

GoalsAllow near real-time communicationAggregate content on-the-fly from

multiple resources

Be able to access and update

very popular

shared content

Scale to process millions of user requests per second

84Nishtala, Rajesh, et al. "Scaling

Memcache at Facebook." nsdi. Vol. 13. 2013.Slide85

Scaling Memcache

at FacebookRead-dominated workloadMultiple data sourcesMySQL

HDFSOne caching layer for multiple data sourcesDemand-filled look-aside cacheHigh fanout and hierarchical data fetching

85

Nishtala

, Rajesh, et al. "Scaling

Memcache

at Facebook."

nsdi. Vol. 13. 2013.Slide86

Lookups and Updates

Database

Memcached

Client

(Web Server)

1- lookup(k)

2- Select ..

3- Set(

k,v

)

Database

Memcached

Client

(Web Server)

2- Delete k

1- Update ..

Lookup

Update

86

Nishtala

, Rajesh, et al. "Scaling

Memcache

at Facebook."

nsdi

. Vol. 13. 2013.Slide87

Handling updates

Memcache needs to be invalidated after DB writePrefer deletes to setsIdempotent

Up to web application to specify which keys to invalidate after database update

87

Nishtala

, Rajesh, et al. "Scaling

Memcache

at Facebook."

nsdi. Vol. 13. 2013.This slide was taken from Scaling Memcache at Facebook presentation in NSDI`13 Slide88

Client

(Web Server)

Client

(Web Server)

One Cluster

Memcached

Memcached

Memcached

Memcached

…..

Consistent Hashing

Client

(Web Server)

Client

(Web Server)

Client

(Web Server)

Client

(Web Server)

Client

(Web Server)

Client

(Web Server)

Client

(Web Server)

Client

(Web Server)

Client

(Web Server)

Client

(Web Server)

…..

Congestion

Webservers:

batch requests

use sliding window to limit outgoing requests

use UDP for gets

use TCP for deletes & sets

88

Nishtala

, Rajesh, et al. "Scaling

Memcache

at Facebook."

nsdi

. Vol. 13. 2013.Slide89

Replication

Storage Cluster

Lookups &

Updates

Front-end Cluster

Front-end Cluster

89Slide90

90

This slide was taken from Scaling

Memcache

at Facebook presentation in NSDI`13 Slide91

91

This slide was taken from Scaling

Memcache

at Facebook presentation in NSDI`13 Slide92

Tolerate datacenter scale outages

Read latency

Master slave replication

92

This slide was taken from Scaling

Memcache

at Facebook presentation in NSDI`13 Slide93

93

This slide was taken from Scaling

Memcache

at Facebook presentation in NSDI`13 Slide94

94

This slide was taken from Scaling

Memcache

at Facebook presentation in NSDI`13 Slide95

Caching with Domain Knowledge

95Slide96

TAO: Facebook’s Distributed Data Store for the Social Graph

Nodes and Associations

Facebook graph

Nodes

Associations

Look-aside cache

Get/Set keys

Agnostic of the data structure

More requests

More rounds of lookups

96

Bronson, Nathan, et al. "TAO: Facebook's Distributed Data Store for the Social Graph." USENIX Annual Technical Conference. 2013.Slide97

FRIEND

The Social Graph

COMMENT

POST

USER

USER

PHOTO

LOCATION

USER

Carol

USER

USER

USER

EXIF_INFO

GPS_DATA

AT

PHOTO

EXIF

COMMENT

CHECKIN

LIKE

LIKE

LIKE

LIKE

AUTHOR

AUTHOR

FRIEND

(hypothetical

encoding)

97

This slide was taken from Facebook TAO’s presentation ATC`13 Slide98

What does TAO do?

Nodes and Associations

Graph partitioning instead of hash partitioning

Instead of loading a single key per request

Node

Association

Enrich the API

Load ranges of the association list

Newest ten comments on a post

Get count of association

How many comments are there on a post?

98

Bronson, Nathan, et al. "TAO: Facebook's Distributed Data Store for the Social Graph." USENIX Annual Technical Conference. 2013.Slide99

Objects and Associations API

Reads – 99.8%Point queries

obj_get 28.9%

assoc_get

15.7%

Range queriesassoc_range

40.9%assoc_time_range 2.8%Count queries

assoc_count 11.7%Writes – 0.2%Create, update, delete for objects

obj_add 16.5%obj_update 20.7%

obj_del 2.0%Set and delete for associationsassoc_add 52.5%

assoc_del

8.3%

99

This slide was taken from Facebook TAO’s presentation ATC`13 Slide100

Partitioning

Objects are assigned a shard idObjects are bound to this shard for their life timeAssociationsDefined as (id1,

atype, id2)id1 and id2 are node idsstored on the shard of its id1Association query can be served from a single server

100Slide101

Follower and Leader Caches

Follower cache

Database

Web servers

Leader cache

101

This slide was taken from Facebook TAO’s presentation ATC`13 Slide102

Write-through Caching – Association Lists

Follower cache

Database

Web servers

X,…

X,A,B,C

Leader cache

X,A,B,C

Y,A,B,C

Y,A,B,C

X –> Y

X –> Y

X –> Y

ok

ok

refill X

refill X

ok

Y,…

X,A,B,C

Y,A,B,C

range get

102

This slide was taken from Facebook TAO’s presentation ATC`13 Slide103

Asynchronous DB Replication

Follower cache

Database

W

eb servers

Master data center

Replica data center

Leader

cache

Inval

and refill embedded in SQL

Writes forwarded to master

Delivery after DB replication done

103

This slide was taken from Facebook TAO’s presentation ATC`13 Slide104

Improving Availability: Read Failover

Follower cache

Database

Web servers

Master data center

Replica data

c

enter

Leader

cache

104

This slide was taken from Facebook TAO’s presentation ATC`13 Slide105

Hotspots

Objects are semi-randomly distributed among shardsLoad imbalanceA post by Justin Bieber vs a post by Victor Zakhary

Load imbalanceReplicate ( more followers to handle the same object)

Invalidation messages

Cache hot objects at the client

An additional level of invalidation

Alleviate request load on the caching servers

105Slide106

Inverse associations

Bidirectional relationships have separate

a→b and b

a

edges

inv_type

(LIKES) =

LIKED_BYinv_type(FRIEND_OF) = FRIEND_OFForward and inverse types linked only during write

TAO assoc_add will update bothNot atomic, but failures are logged and repaired

Nathan

Carol

“On the summit”

FRIEND_OF

FRIEND_OF

AUTHORED_BY

AUTHOR

LIKED_BY

LIKES

106

This slide was taken from Facebook TAO’s presentation ATC`13 Slide107

Hardware Advances for Scale

107Slide108

FaRM

: Fast Remote Memory

Data is in-memory

Accessed using RDMA (Remote Direct Memory Access)

Avoid TCP/IP bottlenecks

Still remote memory is NUMA (Non-Uniform Memory Access) 23x faster to access local memory

108

Dragojević

, Aleksandar, et al. "

FaRM

: Fast remote memory." Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation. 2014.

This slide was taken

FaRM’s

presentation NSDI`14 Slide109

FaRM: Fast Remote Memory

Communication PrimitivesUsing RDMA primitivesDistributed Memory Management

Shared address spaceLock-free ReadDistributed Transaction ManagementStrong consistency guarantees

Distributed Key/Value Store

Facebook TAO’s, 10x and 40-50x lower latency on a cluster of 20 machines

109

Dragojević

, Aleksandar, et al. "

FaRM: Fast remote memory." Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation. 2014.Slide110

FaRM

- Communication

Primitives

RDMA read to read remote objects

RDMA write to send a message

Messages are put into a circular buffer

Receiver update sender after processing messages

110Slide111

FaRM

- Distributed Memory Management

111

This slide was taken

FaRM’s

presentation NSDI`14 Slide112

Traditional Lock-Free Read

V

1- Read Version

2 - Read Data

3- Read Version

Consistent if both versions are equal

3 accesses, not suitable for RDMA

112

This slide was taken

FaRM’s

presentation NSDI`14 Slide113

Hardware supported Lock-Free Read

V

V

V

Coherent DMA cache line versioning:

Invalidate writes

Flush reads

Read all cache lines of the object

If all the versions are the same then the read is consistent

113

This slide was taken

FaRM’s

presentation NSDI`14 Slide114

MICA: A Holistic Approach to Fast In-Memory Key Value Storage

Boost the performance of a single machine

65.6 to 76.9 million key-value operations per second

Using a single general-purpose multi-core system

114

Lim,

Hyeontaek

, et al. "MICA: A holistic approach to fast in-memory key-value storage." management 15.32 (2014): 36.

This slide was taken MICA’s presentation NSDI`14 Slide115

Boost the performance of a single machine

65.6 to 76.9 million key-value operations per second

Using a single general-purpose multi-core system

MICA: A Holistic Approach to Fast In-Memory Key Value Storage

115

Lim,

Hyeontaek

, et al. "MICA: A holistic approach to fast in-memory key-value storage." management 15.32 (2014): 36.

This slide was taken MICA’s presentation NSDI`14 Slide116

MICA: A Holistic Approach to Fast In-Memory Key Value Storage

Boost the performance of a single machine

65.6 to 76.9 million key-value operations per second

Using a single general-purpose multi-core system

116Slide117

Caching as a Service

MemcachierClient focus on the appMemcachier handle

Allocation and cluster managementAvailabilityMonitoring dashboardAmazon

Elasticache

Challenges?

Isolation of clients

Utilization

117Slide118

FairRide

: Near-Optimal, Fair Cache Sharing

Low latency

Reduce load on the backend

118

Pu,

Qifan

, et al. "

Fairride

: Near-optimal, fair cache sharing." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, 2016.

This slide was taken

FairRide’s

presentation NSDI`16 Slide119

FairRide

: Near-Optimal, Fair Cache Sharing

LRU, MRU, ….

Prone to Strategic behavior

119

Pu,

Qifan

, et al. "

Fairride

: Near-optimal, fair cache sharing." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, 2016.

This slide was taken

FairRide’s

presentation NSDI`16 Slide120

FairRide

: Near-Optimal, Fair Cache Sharing

Statically allocated

Isolation

Strategy-proof

Globally shared

Utilization

120

Pu,

Qifan

, et al. "

Fairride

: Near-optimal, fair cache sharing." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, 2016.

This slide was taken

FairRide’s

presentation NSDI`16 Slide121

FairRide

: Near-Optimal, Fair Cache Sharing

A modification to max-min fairness

Probabilistically block request from users based on their behavior

Disincentive users from cheating

The more a user cheat, the more you are hurt

Max-Min

FairRide

121Slide122

Slicer: Auto-

Sharding for Datacenter ApplicationsProvides auto-sharding

without tying to storageSeparate assignment generation “control plane” from  request forwarding “data plane”Via a small interface

In a scalable, consistent, fault-tolerant manner

Reshards

for capacity and failure adaptation, load balancing

122

Adya

, Atul, et al. "Slicer: Auto-sharding

for datacenter applications." USENIX Symposium on Operating Systems Design and Implementation. 2016.This slide was taken Slicer’s presentation OSDI`16 Slide123

Slicer Sharding Model

Caching

servers

0

2

63

- 1

Hash(K

1

)

Hash(K

2

)

Hash(K

3

)

“Slices”

Hash keys into 63-bit space

Assign ranges ("slices") of space to

servers

Split/Merge/Migrate slices for load balancing

Asymmetric replication”: more copies for hot slices

123

Adya

, Atul, et al. "Slicer: Auto-

sharding

for datacenter applications." USENIX Symposium on Operating Systems Design and Implementation. 2016.

This slide was taken Slicer’s presentation OSDI`16 Slide124

Slicer Overview

Frontends

Caching

servers

Slicelet

Clerk

124

Distributed data plane

Centralized control plane

Hash(key)

Hash(key)

Slicer Service

Adya

, Atul, et al. "Slicer: Auto-

sharding

for datacenter applications." USENIX Symposium on Operating Systems Design and Implementation. 2016.

This slide was taken Slicer’s presentation OSDI`16 Slide125

Slicer Architecture

Frontends

Caching

servers

Slicelet

Clerk

Distributor

Backup

Distributor

Assigner

Existing Google

Infrastructure

125

Capacity Monitoring

Health Monitoring

Load Monitoring

Lease Manager

This slide was taken Slicer’s presentation OSDI`16 Slide126

Summarize

Replacement PolicyPartitioning cachesFragmentation

CliffhangersCuckoo hashingCluster managementMemcached

Redis

Cross-datacenter caching

Memcache

at Facebook

Domain knowledge cachingTao at Facebook

126Slide127

Summarize

Hardware supported distributed cachesFaRMMICA

Caching as a serviceMemcachierElastiCache

Fairness for Caching

FairRide

Autosharding

Slicer

127Slide128

Open problems

Load balancing and UtilizationDistributed keys equally between shardsMemory utilization Load per key (Hotspots problem)

Re-ShardingMapping?Replication

Propagate updates

Consistency vs Latency

Geo-replication

Consistency vs Latency

128Slide129

Open problems

Domain-knowledge cachingCaching for distributed real time analyticsAdvances in HardwareSDN

RDMANVMSSDsDatacentersEdge and

Fog computing

129Slide130

References and Sources

http://www.visualcapitalist.com/what-happens-internet-minute-2016/http://norvig.com/21-days.html#answers

Tanenbaum, Andrew S., and Herbert Bos. Modern operating systems. Prentice Hall Press, 2014.Megiddo, Nimrod, and Dharmendra S.

Modha

. "ARC: A Self-Tuning, Low Overhead Replacement Cache." FAST. Vol. 3. 2003.

https://memcached.org/,

FITZPATRICK, B. Distributed caching with

memcached. Linux Journal 2004, 124 (Aug. 2004), 5.Karger

, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM, 1997.Karger, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.

130Slide131

References and Sources

https://redis.io/Cidon, Asaf

, et al. "Dynacache: Dynamic Cloud Caching." HotStorage. 2015.

Cidon

,

Asaf

, et al. "Cliffhanger: Scaling performance cliffs in web memory caches." USENIX NSDI. 2016.

Pagh, Rasmus, and Flemming

Friche Rodler. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Fan, Bin, David G. Andersen, and Michael Kaminsky. "MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing." NSDI. Vol. 13. 2013.Li, Xiaozhou

, et al. "Algorithmic improvements for fast concurrent cuckoo hashing." Proceedings of the Ninth European Conference on Computer Systems. ACM, 2014.Nishtala, Rajesh, et al. "Scaling Memcache at Facebook." nsdi. Vol. 13. 2013.Bronson, Nathan, et al. "TAO: Facebook's Distributed Data Store for the Social Graph." USENIX Annual Technical Conference. 2013.

131Slide132

References and Sources

Dragojević, Aleksandar, et al. "FaRM: Fast remote memory." Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation. 2014.

Lim, Hyeontaek, et al. "MICA: A holistic approach to fast in-memory key-value storage." management 15.32 (2014): 36.

Pu,

Qifan

, et al. "

Fairride

: Near-optimal, fair cache sharing." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, 2016.Adya, Atul, et al. "Slicer: Auto-

sharding for datacenter applications." USENIX Symposium on Operating Systems Design and Implementation. 2016.132Slide133

Questions

Thank you 

133