Victor Zakhary Divyakant Agrawal Amr El Abbadi 1 The old problem of Caching Disk RAM L 2 L 1 Larger Slower Cheaper Smaller Faster Expensive 2 The old problem of Caching Smaller ID: 572298
Download Presentation The PPT/PDF document "Caching at the Web Scale" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Caching at the Web Scale
Victor Zakhary, Divyakant Agrawal, Amr El Abbadi
1Slide2
The old problem of Caching
Disk
RAM
L
2
L
1
Larger
Slower
Cheaper
Smaller
FasterExpensive
2Slide3
The old problem of Caching
Smaller
FasterExpensive
Larger
Slower
Cheaper
T
a
= T
h + m x Tm Ta: average access time
Th: access time in case of a hitm: miss ration (1- hit ratio)Tm: access time in case of a missTm >>>> Th
Disk
RAM
L
2
L
1
3Slide4
The old problem of Caching
Tm >>>> ThWhen the cache is full
replacement policyReplacement policy eviction mechanism
Having the right elements in cache increases the hit ratio
High hit ratio
less average access time
4Slide5
The old problem of Caching
Ta= T
h + m x Tm
T
a
: average access time
T
h: access time in case of a hit
m: miss ration (1- hit ratio)Tm: access time in case of a missTm >>>> Th
T
h
and m are always in contentionGood caching strategy:lowers mrequires more tracking
Increases T
h
Less tracking:
Lower T
h
increases m
5Slide6
The old problem of Caching
Larger
SlowerCheaper
This is not a tutorial on 70s materials
Right?
Smaller
Faster
Expensive
Disk
RAM
L
2
L
1
6Slide7
Nowadays
Hardware technologies have changed:
Storage
Memory
Network
etc
Solutions have to exploit all these changes to serve client requests:
at billions of requests per second scale
with low latency
with high availability
achieving data consistency(varies from application to application)
New designs for Caching Services
Very time sensitive
Huge amount of data
Dynamically generated data
7
Source: http://www.visualcapitalist.com/what-happens-internet-minute-2016/Slide8
Facebook page load
Each page load is translated into hundreds of lookups
Lookups are done on multiple rounds
8
This slide was taken from Scaling
Memcache
at Facebook presentation in NSDI`13 Slide9
Nowadays Architecture
Persistent Storage
Millions of
end-users
Page-load and page-update
stream(millions/sec)
Billions of key lookups per second
Stateless
Application Servers
Overloaded
9Slide10
Nowadays Architecture
Load Balancer
Persistent Storage
Millions of
end-users
Page-load and page-update
stream(millions/sec)
Billions of key lookups per second
Hundreds of Stateless
Application Servers
Overloaded
10Slide11
Nowadays Architecture
Partition and replicate
Load Balancer
Millions of
end-users
Page-load and page-update
stream(millions/sec)
Billions of key lookups per second
Hundreds of Stateless
Application Servers
Persistent Storage
High Latency
Supported
operations
Consistency
11Slide12
Facebook page load
Each page load is translated into hundreds of lookups
Lookups are done on multiple rounds
Reads are 99.8% while writes are only 0.2% [Tao ATC`13]
Persistent Storage cannot handle this request throughput at this scale
Caching
lower latency + alleviate load on storage
12
This slide was taken from Scaling
Memcache
at Facebook presentation in NSDI`13 Slide13
Nowadays Architecture
Partitioned and replicated
Load Balancer
Millions of
end-users
Page-load and page-update
stream(millions/sec)
Billions of key lookups per second
Hundreds of Stateless
Application Servers
Persistent Storage
Caching Server
hit
miss
Overloaded
13Slide14
Nowadays Architecture
Partitioned and replicated
Load Balancer
Millions of
end-users
Page-load and page-update
stream(millions/sec)
Billions of key lookups per second
Hundreds of Stateless
Application Servers
Persistent Storage
Tens of Caching Servers
hit
miss
Failures
Load balance
Lookaside vs knowledge based
14Slide15
15Slide16
Access latency
y
y
y
y
y
y
y
y
16
Peter
Norvig
: http://norvig.com/21-days.html#answersSlide17
Goal
Access Latency ↓
Access Latency ↓
Load distribution
Old Caching
Modern Caching
Update Strategy
Challenges
Replacement policy
Update strategy
Update durability
Thread contention
Scale management
Load balancing (utilization)
Update strategy
Update durability
Data consistency
Request rate
T
a
= T
h
+ m x T
m
17Slide18
Replacement Policies
18Slide19
Cache Replacement Policies
Cache
Lookup
Cache miss
Cache hit
Fetch page
Insert page
to cache
Cache is not full
Cache is full
Evict and insert
Insert
19Slide20
Cache Replacement Policies
Cache size is limited Cannot fit everythingEviction mechanism
Contention between hit access time and miss ratioFIFO, LIFO LRU (recency
of access)
Pseudo-LRU
ARC
(Frequency and recency of access)
MRU…
20Slide21
LRU
Hardware supported implementationsUsing countersUsing a binary 2D matrixSoftware implementation
Using a doubly linked list and a hash table
21Slide22
LRU – Hardware using Counters
Large enough counter 64 -128 bitsIncrement the counter after each instructionWhen accessing a page:Tag the page with the current counter value at the access time
When a page fault happens:Evict the page with the lowest counter tagVery expensive to examine the counter for every page
22Slide23
LRU – Hardware using 2D Binary Matrix
1
2
3
4
1
0
0
0
0
2
0000
3
0
0
0
0
4
0
0
0
0
1
2
3
4
1
0
1
1
1
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
1
1
2
3
4
1
0
0
1
1
2
1
0
1
1
3
0000
40000
2
1
2341
0001210013
1101
40000
3
1
2
341000
02100
03
11004
11104
O(N) bits per page
The row with the smallest binary value is the eviction candidate23Slide24
LRU – Software Implementation
Hash Table
24Slide25
LRU – Software Implementation
Access:
Hash Table
25Slide26
LRU – Software Implementation
Access:
Hash Table
26Slide27
LRU – Software Implementation
Access:
Hash Table
27Slide28
Pseudo-LRU (PLRU)
Bit-PLRUOne bit per page, initially zeroOn access, flip page’s bit to one
If all bits are one, flip all to zero except the last accessed page
0
0
0
0
1
0
0
0
1
1
0
0
1
1
1
0
Access: 1, 2, 3, 4
0
0
0
1
28Slide29
Pseudo-LRU
0
0
0
Organizes blocks on a binary tree
The path from the root leads to the PLRU leaf
On access, flip the values along the path to the leaf
0 goes left, 1 goes right
Access: 1 3 2 4 1 4 5
29Slide30
Organizes blocks on a binary tree
The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right
Pseudo-LRU
1
1
1
0
Access:
1
3 2 4 1 4 5
30Slide31
Organizes blocks on a binary tree
The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right
Pseudo-LRU
1
3
1
0
1
Access: 1
3
2 4 1 4 5
31Slide32
Organizes blocks on a binary tree
The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right
Pseudo-LRU
1
2
3
0
1
1
Access: 1 3
2
4 1 4 5
32Slide33
Organizes blocks on a binary tree
The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right
PLRU
1
2
3
4
0
0
0
Access: 1 3 2
4
1 4 5
33Slide34
Organizes blocks on a binary tree
The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right
Pseudo-LRU
1
2
3
4
1
1
0
Access: 1 3 2 4
1
4 5
34Slide35
Organizes blocks on a binary tree
The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right
Pseudo-LRU
1
2
3
4
1
0
0
Access: 1 3 2 4 1
4
5
35Slide36
Organizes blocks on a binary tree
The path from the root leads to the PLRU leafOn access, flip the values along the path to the leaf0 goes left, 1 goes right
Pseudo-LRU
1
5
3
4
0
1
0
Access: 1 3 2 4 1 4
5
36Slide37
ARC- Adaptive Replacement Cache
Maintains 2 LRU lists L1 and L2
L1 for recencyL2 for frequency
Tracks pages twice the size of the cache. |L
1
+ L
2
| = 2 |c|Dynamically and adaptively balancing between recency and frequencyOnline
and self-tuning – in response to evolving and possibly changing access patterns
ARC: Almaden Research Center37
Megiddo, Nimrod, and Dharmendra S. Modha. "ARC: A Self-Tuning, Low Overhead Replacement Cache." FAST. Vol. 3. 2003.Slide38
ARC
L
1
L
2
T
1
B
1
B
2
T
2
c
Recency
Frequency
Ghost pages
Ghost pages
A miss is inserted to the head of L
1
If a page in L
1
is accessed twice, move it to L
2
head of T
2
A miss in B
1
increases the size of L
1
A miss in B
2
increases the size of L
2
38Slide39
Goal
Access Latency ↓
Access Latency ↓
Load distribution
Old Caching
Modern Caching
Update Strategy
Challenges
Replacement policy
Update strategy
Update durability
Thread contention
Scale management
Load balancing (utilization)
Update strategy
Update durability
Data consistency
Request rate
39Slide40
Scale Management
40Slide41
Memcached*
Distributed in-memory caching systemFree and Open source
Written first in Perl by Brad Fitzpatrick in 2003Rewritten in C by Anatoly Vorobey
Client driven caching
How does it work?
41
https://memcached.org/,
FITZPATRICK, B. Distributed caching with
memcached
.
Linux Journal 2004, 124 (Aug. 2004), 5.Slide42
Memcached*
Memcached
logic
Client Side
Server Side
Storage
Application server or dedicated cache client
Cache server
1- lookup(k)
2- Response(k)
If k != null
done
Else?
3- lookup(k)
4- Response(k)
5- Set(
k,V
)
So what does
Memcached
provide?
42
https://memcached.org/,
FITZPATRICK, B. Distributed caching with
memcached
.
Linux Journal 2004, 124 (Aug. 2004), 5.Slide43
Memcached
1GB
Caching servers
1GB
1GB
Application server or dedicated cache client
43
https://memcached.org/,
FITZPATRICK, B. Distributed caching with
memcached
.
Linux Journal 2004, 124 (Aug. 2004), 5.Slide44
Memcached
1GB
Caching servers
1GB
1GB
3GB
Application server or dedicated cache client
Each key is mapped to one caching server
Better memory utilization through hashing
Clients know all servers
Servers don’t communicate with each other
Shared-nothing architecture
Easy to scale
44
https://memcached.org/,
FITZPATRICK, B. Distributed caching with
memcached
.
Linux Journal 2004, 124 (Aug. 2004), 5.Slide45
LRU – Software Implementation
Hash Table
45Slide46
Memcached
1GB
Caching servers
1GB
1GB
Application server or dedicated cache client
Lookup(k)
Hash(k) % server count
Lookup(k)
-Is k here?
-Yes
-Update LRU and return value
=No
=Return null
46
https://memcached.org/,
FITZPATRICK, B. Distributed caching with
memcached
.
Linux Journal 2004, 124 (Aug. 2004), 5.Slide47
Consistent Hashing
When adding/removing a server% function causes high key churn (remapping)Consistent hashingK/n keys are remapped
The churn problemAssume keys 1,2,3,4,5,6,7,8,9,10,11,12These keys are distributed into 4 servers 1,2,3,4
47
Karger
, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web."
Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
. ACM, 1997.
Karger, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide48
Consistent Hashing
1,5,9
2,6,10
3,7,11
4,8,12
1
2
3
4
5
5%4 =1
11
11%4 =3
What happens when number of server changes?
48
Karger
, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web."
Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
. ACM, 1997.
Karger
, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide49
Consistent Hashing
1,4,7,10
2,5,8,11
3,6,9,12
1,5,9
2,6,10
3,7,11
4,8,12
Keys 3,4,5,6,7,8,9,10,11
are remapped
1
2
4
1
2
3
4
Keys
4,5,6,8,9,10 are remapped even if their machines are up
49
Karger
, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web."
Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
. ACM, 1997.
Karger
, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide50
Consistent Hashing
5
2
7
1
6
11
12
10
8
3
4
9
50
Karger
, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web."
Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
. ACM, 1997.
Karger
, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide51
Consistent Hashing
5
2
7
1
6
11
12
10
8
3
4
9
51
Karger
, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web."
Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
. ACM, 1997.
Karger
, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide52
Consistent Hashing
5
2
7
1
6
11
12
10
8
3
4
9
52
Karger
, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web."
Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
. ACM, 1997.
Karger
, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide53
Consistent Hashing
5
2
7
1
6
11
12
10
8
3
4
9
53
Karger
, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web."
Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
. ACM, 1997.
Karger
, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.Slide54
Memcached
Item (object)Key (string up to 250 bytes in length)Expiration time (0 means never)
Values up to 1 MB sizeASCII or binary representations (ASCII vs binary protocol)Binary protocol is faster and more compact12345678 (4 bytes) instead of “12345678” (8 bytes)
In the binary protocol
SASL can be enabled to limit access
Clients have to authenticate themselves before sending commands
54Slide55
Memcached API
Documentation: https://github.com/memcached/memcached/wiki/Full API:
https://github.com/memcached/memcached/wiki/Commands
55Slide56
Redis
Redis stands for REmote DI
ctionary ServerOpen Source BSD licensed
By Salvatore
Sanfilippo
, May 2009
Developed in ANSI C
Data Structure store more than just a key/value storeHow?
56
https://redis.io/Slide57
Redis
Caching Server
Key
Value
Key
Key
Set
Key
Sorted Set
Key
Hash map
Key
Bit map
Key
HyperLogLog
A rich API is used to map keys to:
String values
Linked lists
Sets
Sorted Sets
Hash maps
Bit maps
HyperLogLog
57
https://redis.io/Slide58
For Key/Value
Get keySet keyFor ListsLPOP keyRemove and get the first element in a listLPUSH key value [value …]
Prepend one or multiple values to a listLSET key index valueList[key][index] = value
Full API commands:
https://redis.io/commands
Redis
Caching Server
Key
Value
Key
Key
Set
Key
Sorted Set
Key
Hash map
Key
Bit map
Key
HyperLogLog
58Slide59
Implementing
Redis using MemcachedWhat if?
Memcached
server
K
1
Value
K
2
Memcached
client
Get K
2
59Slide60
Implementing
Redis using MemcachedWhat if?
Memcached
server
K
1
Value
K
2
Memcached
client
Set K
2
60Slide61
Redis
Redis rich API enables updating stored data structures without fetching themString values are up to 512 MB
Lists are lists of stringsMaximum length of a list is 232 – 1 elementsSet are unordered collections of strings
Maximum length of a list is 2
32
– 1 elements
Redis
data types: https://redis.io/topics/data-types
61Slide62
Redis – Replacement Policies
noevictionReturn error when the memory is full and the operation requires more memory
allkeys-lruLRU-like for all the keysvolatile-
lru
evict the least recently used keys among keys with an expire set
allkeys
-random
evict random keysvolatile-randomevict random keys with an expire set
volatile-ttlevict only keys with an expire set and start with keys that have the shorter time to live (ttl)
A key with an expire set has a life time
If no keys have expire set
behave like
noeviction
62
https://redis.io/topics/lru-cacheSlide63
Redis- Cluster Management
Client side partitioningHash partitioning, range partitioning, or consistent hashingProxy assisted partitioning
Client Proxy Server instance Client
Twemproxy
: Twitter proxy for
Memcache
and
Redis
Query routingClient Random Redis
Server Client Redis Server63
https://redis.io/topics/partitioningSlide64
Redis- Replication & Persistence
Master-slave replicationMaster can update a slave or moreUpdates are sent asynchronouslyEventual consistency model
Master should persist updates locallySlaves achieve better scalabilityRead-only workloads
Updates are sent to the master
Clients reading from a slave should tolerate some staleness
Master
Slave
Slave
Slave
64
https://redis.io/topics/replicationSlide65
Amazon
ElastiCacheCompatible with Memcached and Redis
Cache as a serviceAmazon deploy and operate the instancesAmazon manage scalabilityAutomatically detects and replaces failed nodes
Better interfaces for monitoring
65Slide66
Partitioned Caches
Hash Table
….
….
Contention between threads
Partition on multiple machines
Partition memory based on object size (slabs)
66Slide67
Partitioned Caches
Motivation
Reduce contention between threads
Memory resources are distributed
Fairness between large objects and small objects
Drawback
Fragmentation
67Slide68
Solutions
Change the partitioning techniqueRe-distributed keys in a way that achieves higher utilizationChange the memory assigned to different partitionsHighly loaded partitions get more memory
Less loaded partitions get less memoryEasy in shared memory architectures
68Slide69
Dynacache: Dynamic Cloud Caching
Hit rate
P(hit)
Number of objects cached
1
69
Cidon
,
Asaf
, et al. "
Dynacache
: Dynamic Cloud Caching."
HotStorage
. 2015.
This slide was taken from
Dynacache
presentation in Hotcloud`15 Slide70
Dynacache: Dynamic Cloud Caching
Hit rate
P(hit)
Number of objects cached
1
Hit rate
P(hit)
Number of objects cached
1
70
Cidon
,
Asaf
, et al. "
Dynacache
: Dynamic Cloud Caching."
HotStorage
. 2015.
This slide was taken from
Dynacache
presentation in Hotcloud`15 Slide71
CliffHanger: Scaling performance cliffs in web memory caches
Increases overall hit rate by 1.2%Reduces the total number of cache misses by 36.7%Same hit rate with 45% less memory capacity
Memory is usually divided between objects based on size (slabs)Assign more memory to slabs that have high miss rateTake memory from slabs that do not experience high miss rate
How?
Hit rate gradient
71
Cidon
,
Asaf, et al. "Cliffhanger: Scaling performance cliffs in web memory caches." USENIX NSDI. 2016.Slide72
CliffHanger
: Scaling performance cliffs in web memory caches
Hit rate
P(hit)
Number of objects cached
1
Hit rate
P(hit)
Number of objects cached
1
Low gradient
High gradient
72
Cidon
,
Asaf
, et al. "Cliffhanger: Scaling performance cliffs in web memory caches." USENIX NSDI. 2016.Slide73
CliffHanger: Scaling performance cliffs in web memory caches
Maintain shadow lists for different slabsWhen a miss happens in the shadow listIncrease the list size and decrease the list size of a another slab
When the shadow queue size exceeds a certain limitAssign more memory
73
Cidon
,
Asaf
, et al. "Cliffhanger: Scaling performance cliffs in web memory caches." USENIX NSDI. 2016.Slide74
CliffHanger: Scaling performance cliffs in web memory caches
Hit rate
P(hit)
Number of objects cached
Hit rate
P(hit)
Number of objects cached
1
1
Performance
cliff
74
Cidon
,
Asaf
, et al. "Cliffhanger: Scaling performance cliffs in web memory caches." USENIX NSDI. 2016.Slide75
Solving Utilization using Hashing
75Slide76
Cuckoo hashing
Insert(K)
Hash Function
H(K)
id
.
.
.
Insert(K)
If memory is full
eject some key
and insert K
else
insert K
Deterministically map each key to one bucket
If the bucket is full, eject some other key even if other buckets have empty slots.
76
Pagh
, Rasmus, and
Flemming
Friche
Rodler
. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Slide77
Cuckoo hashing
Insert(K)
Cuckoo Hash Function
H(K)
id
1
, id
2
.
.
.
Insert(K)
If memory is full
failed
Insert(K)
Deterministically map each key to 2 buckets. If any of them have space, insert the key to this bucket. Otherwise, eject a random key from any of the 2 buckets. Try to insert the ejected key to its second bucket and repeat until the key is inserted or the eviction path passes some length limit
77Slide78
Cuckoo hashing
Insert(k)
Cuckoo Hash Function
H(k)
id
1
, id
2
Insert(k)
Insert(k)
…
…
…
…
a
b
c
d
m
n
o
p
z
x
e
w
j
g
f
h
i
…
…
…
…
…
…
…
…
full
not
full
k
78
Pagh
, Rasmus, and
Flemming
Friche
Rodler
. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Slide79
Cuckoo hashing
Insert(l)
Cuckoo Hash Function
H(l)
id
1
, id
2
Insert(l)
Insert(l)
…
…
…
…
a
b
c
d
m
n
o
p
z
x
e
w
j
k
g
f
h
i
…
…
…
…
…
…
…
…
full
full
not
full
full
79
Pagh
, Rasmus, and
Flemming
Friche
Rodler
. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Slide80
Cuckoo hashing
Insert(l)
Cuckoo Hash Function
H(l)
id
1
, id
2
Insert(l)
Insert(l)
…
…
…
…
l
b
c
d
m
a
o
p
z
x
e
w
j
k
n
g
f
h
i
…
…
…
…
…
…
…
…
full
full
not
full
full
80
Pagh
, Rasmus, and
Flemming
Friche
Rodler
. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Slide81
Cuckoo hashing
lookup(k)
Cuckoo Hash Function
H(k)
id
1
, id
2
lookup(k)
lookup(k)
…
…
…
…
a
b
c
d
m
n
o
p
z
x
e
w
j
k
g
f
h
i
…
…
…
…
…
…
…
…
Lookup both buckets in parallel. Bloom filters can be used to reduce the lookups to 1.
81
Pagh
, Rasmus, and
Flemming
Friche
Rodler
. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Slide82
Cuckoo Hashing optimizations
MemC3Fan, Bin, David G. Andersen, and Michael Kaminsky. "MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing."
NSDI. Vol. 13. 2013.Algorithmic improvements for fast concurrent cuckoo hashingLi, Xiaozhou
, et al. "Algorithmic improvements for fast concurrent cuckoo hashing."
Proceedings of the Ninth European Conference on Computer Systems
. ACM, 2014.
82Slide83
Web Scale Distributed Caching Examples
83Slide84
Scaling Memcache
at FacebookGlobally distributed caching (in memory key/value store)Billions of requests per second over trillions of items
GoalsAllow near real-time communicationAggregate content on-the-fly from
multiple resources
Be able to access and update
very popular
shared content
Scale to process millions of user requests per second
84Nishtala, Rajesh, et al. "Scaling
Memcache at Facebook." nsdi. Vol. 13. 2013.Slide85
Scaling Memcache
at FacebookRead-dominated workloadMultiple data sourcesMySQL
HDFSOne caching layer for multiple data sourcesDemand-filled look-aside cacheHigh fanout and hierarchical data fetching
85
Nishtala
, Rajesh, et al. "Scaling
Memcache
at Facebook."
nsdi. Vol. 13. 2013.Slide86
Lookups and Updates
Database
Memcached
Client
(Web Server)
1- lookup(k)
2- Select ..
3- Set(
k,v
)
Database
Memcached
Client
(Web Server)
2- Delete k
1- Update ..
Lookup
Update
86
Nishtala
, Rajesh, et al. "Scaling
Memcache
at Facebook."
nsdi
. Vol. 13. 2013.Slide87
Handling updates
Memcache needs to be invalidated after DB writePrefer deletes to setsIdempotent
Up to web application to specify which keys to invalidate after database update
87
Nishtala
, Rajesh, et al. "Scaling
Memcache
at Facebook."
nsdi. Vol. 13. 2013.This slide was taken from Scaling Memcache at Facebook presentation in NSDI`13 Slide88
Client
(Web Server)
Client
(Web Server)
One Cluster
Memcached
Memcached
Memcached
Memcached
…..
Consistent Hashing
Client
(Web Server)
Client
(Web Server)
Client
(Web Server)
Client
(Web Server)
Client
(Web Server)
Client
(Web Server)
Client
(Web Server)
Client
(Web Server)
Client
(Web Server)
Client
(Web Server)
…..
Congestion
Webservers:
batch requests
use sliding window to limit outgoing requests
use UDP for gets
use TCP for deletes & sets
88
Nishtala
, Rajesh, et al. "Scaling
Memcache
at Facebook."
nsdi
. Vol. 13. 2013.Slide89
Replication
Storage Cluster
Lookups &
Updates
Front-end Cluster
Front-end Cluster
89Slide90
90
This slide was taken from Scaling
Memcache
at Facebook presentation in NSDI`13 Slide91
91
This slide was taken from Scaling
Memcache
at Facebook presentation in NSDI`13 Slide92
Tolerate datacenter scale outages
Read latency
Master slave replication
92
This slide was taken from Scaling
Memcache
at Facebook presentation in NSDI`13 Slide93
93
This slide was taken from Scaling
Memcache
at Facebook presentation in NSDI`13 Slide94
94
This slide was taken from Scaling
Memcache
at Facebook presentation in NSDI`13 Slide95
Caching with Domain Knowledge
95Slide96
TAO: Facebook’s Distributed Data Store for the Social Graph
Nodes and Associations
Facebook graph
Nodes
Associations
Look-aside cache
Get/Set keys
Agnostic of the data structure
More requests
More rounds of lookups
96
Bronson, Nathan, et al. "TAO: Facebook's Distributed Data Store for the Social Graph." USENIX Annual Technical Conference. 2013.Slide97
FRIEND
The Social Graph
COMMENT
POST
USER
USER
PHOTO
LOCATION
USER
Carol
USER
USER
USER
EXIF_INFO
GPS_DATA
AT
PHOTO
EXIF
COMMENT
CHECKIN
LIKE
LIKE
LIKE
LIKE
AUTHOR
AUTHOR
FRIEND
(hypothetical
encoding)
97
This slide was taken from Facebook TAO’s presentation ATC`13 Slide98
What does TAO do?
Nodes and Associations
Graph partitioning instead of hash partitioning
Instead of loading a single key per request
Node
Association
Enrich the API
Load ranges of the association list
Newest ten comments on a post
Get count of association
How many comments are there on a post?
98
Bronson, Nathan, et al. "TAO: Facebook's Distributed Data Store for the Social Graph." USENIX Annual Technical Conference. 2013.Slide99
Objects and Associations API
Reads – 99.8%Point queries
obj_get 28.9%
assoc_get
15.7%
Range queriesassoc_range
40.9%assoc_time_range 2.8%Count queries
assoc_count 11.7%Writes – 0.2%Create, update, delete for objects
obj_add 16.5%obj_update 20.7%
obj_del 2.0%Set and delete for associationsassoc_add 52.5%
assoc_del
8.3%
99
This slide was taken from Facebook TAO’s presentation ATC`13 Slide100
Partitioning
Objects are assigned a shard idObjects are bound to this shard for their life timeAssociationsDefined as (id1,
atype, id2)id1 and id2 are node idsstored on the shard of its id1Association query can be served from a single server
100Slide101
Follower and Leader Caches
Follower cache
Database
Web servers
Leader cache
101
This slide was taken from Facebook TAO’s presentation ATC`13 Slide102
Write-through Caching – Association Lists
Follower cache
Database
Web servers
X,…
X,A,B,C
Leader cache
X,A,B,C
Y,A,B,C
Y,A,B,C
X –> Y
X –> Y
X –> Y
ok
ok
refill X
refill X
ok
Y,…
X,A,B,C
Y,A,B,C
range get
102
This slide was taken from Facebook TAO’s presentation ATC`13 Slide103
Asynchronous DB Replication
Follower cache
Database
W
eb servers
Master data center
Replica data center
Leader
cache
Inval
and refill embedded in SQL
Writes forwarded to master
Delivery after DB replication done
103
This slide was taken from Facebook TAO’s presentation ATC`13 Slide104
Improving Availability: Read Failover
Follower cache
Database
Web servers
Master data center
Replica data
c
enter
Leader
cache
104
This slide was taken from Facebook TAO’s presentation ATC`13 Slide105
Hotspots
Objects are semi-randomly distributed among shardsLoad imbalanceA post by Justin Bieber vs a post by Victor Zakhary
Load imbalanceReplicate ( more followers to handle the same object)
Invalidation messages
Cache hot objects at the client
An additional level of invalidation
Alleviate request load on the caching servers
105Slide106
Inverse associations
Bidirectional relationships have separate
a→b and b
→
a
edges
inv_type
(LIKES) =
LIKED_BYinv_type(FRIEND_OF) = FRIEND_OFForward and inverse types linked only during write
TAO assoc_add will update bothNot atomic, but failures are logged and repaired
Nathan
Carol
“On the summit”
FRIEND_OF
FRIEND_OF
AUTHORED_BY
AUTHOR
LIKED_BY
LIKES
106
This slide was taken from Facebook TAO’s presentation ATC`13 Slide107
Hardware Advances for Scale
107Slide108
FaRM
: Fast Remote Memory
Data is in-memory
Accessed using RDMA (Remote Direct Memory Access)
Avoid TCP/IP bottlenecks
Still remote memory is NUMA (Non-Uniform Memory Access) 23x faster to access local memory
108
Dragojević
, Aleksandar, et al. "
FaRM
: Fast remote memory." Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation. 2014.
This slide was taken
FaRM’s
presentation NSDI`14 Slide109
FaRM: Fast Remote Memory
Communication PrimitivesUsing RDMA primitivesDistributed Memory Management
Shared address spaceLock-free ReadDistributed Transaction ManagementStrong consistency guarantees
Distributed Key/Value Store
Facebook TAO’s, 10x and 40-50x lower latency on a cluster of 20 machines
109
Dragojević
, Aleksandar, et al. "
FaRM: Fast remote memory." Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation. 2014.Slide110
FaRM
- Communication
Primitives
RDMA read to read remote objects
RDMA write to send a message
Messages are put into a circular buffer
Receiver update sender after processing messages
110Slide111
FaRM
- Distributed Memory Management
111
This slide was taken
FaRM’s
presentation NSDI`14 Slide112
Traditional Lock-Free Read
V
1- Read Version
2 - Read Data
3- Read Version
Consistent if both versions are equal
3 accesses, not suitable for RDMA
112
This slide was taken
FaRM’s
presentation NSDI`14 Slide113
Hardware supported Lock-Free Read
V
V
V
Coherent DMA cache line versioning:
Invalidate writes
Flush reads
Read all cache lines of the object
If all the versions are the same then the read is consistent
113
This slide was taken
FaRM’s
presentation NSDI`14 Slide114
MICA: A Holistic Approach to Fast In-Memory Key Value Storage
Boost the performance of a single machine
65.6 to 76.9 million key-value operations per second
Using a single general-purpose multi-core system
114
Lim,
Hyeontaek
, et al. "MICA: A holistic approach to fast in-memory key-value storage." management 15.32 (2014): 36.
This slide was taken MICA’s presentation NSDI`14 Slide115
Boost the performance of a single machine
65.6 to 76.9 million key-value operations per second
Using a single general-purpose multi-core system
MICA: A Holistic Approach to Fast In-Memory Key Value Storage
115
Lim,
Hyeontaek
, et al. "MICA: A holistic approach to fast in-memory key-value storage." management 15.32 (2014): 36.
This slide was taken MICA’s presentation NSDI`14 Slide116
MICA: A Holistic Approach to Fast In-Memory Key Value Storage
Boost the performance of a single machine
65.6 to 76.9 million key-value operations per second
Using a single general-purpose multi-core system
116Slide117
Caching as a Service
MemcachierClient focus on the appMemcachier handle
Allocation and cluster managementAvailabilityMonitoring dashboardAmazon
Elasticache
Challenges?
Isolation of clients
Utilization
117Slide118
FairRide
: Near-Optimal, Fair Cache Sharing
Low latency
Reduce load on the backend
118
Pu,
Qifan
, et al. "
Fairride
: Near-optimal, fair cache sharing." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, 2016.
This slide was taken
FairRide’s
presentation NSDI`16 Slide119
FairRide
: Near-Optimal, Fair Cache Sharing
LRU, MRU, ….
Prone to Strategic behavior
119
Pu,
Qifan
, et al. "
Fairride
: Near-optimal, fair cache sharing." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, 2016.
This slide was taken
FairRide’s
presentation NSDI`16 Slide120
FairRide
: Near-Optimal, Fair Cache Sharing
Statically allocated
Isolation
Strategy-proof
Globally shared
Utilization
120
Pu,
Qifan
, et al. "
Fairride
: Near-optimal, fair cache sharing." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, 2016.
This slide was taken
FairRide’s
presentation NSDI`16 Slide121
FairRide
: Near-Optimal, Fair Cache Sharing
A modification to max-min fairness
Probabilistically block request from users based on their behavior
Disincentive users from cheating
The more a user cheat, the more you are hurt
Max-Min
FairRide
121Slide122
Slicer: Auto-
Sharding for Datacenter ApplicationsProvides auto-sharding
without tying to storageSeparate assignment generation “control plane” from request forwarding “data plane”Via a small interface
In a scalable, consistent, fault-tolerant manner
Reshards
for capacity and failure adaptation, load balancing
122
Adya
, Atul, et al. "Slicer: Auto-sharding
for datacenter applications." USENIX Symposium on Operating Systems Design and Implementation. 2016.This slide was taken Slicer’s presentation OSDI`16 Slide123
Slicer Sharding Model
Caching
servers
0
2
63
- 1
Hash(K
1
)
Hash(K
2
)
Hash(K
3
)
“Slices”
Hash keys into 63-bit space
Assign ranges ("slices") of space to
servers
Split/Merge/Migrate slices for load balancing
“
Asymmetric replication”: more copies for hot slices
123
Adya
, Atul, et al. "Slicer: Auto-
sharding
for datacenter applications." USENIX Symposium on Operating Systems Design and Implementation. 2016.
This slide was taken Slicer’s presentation OSDI`16 Slide124
Slicer Overview
Frontends
Caching
servers
Slicelet
Clerk
124
Distributed data plane
Centralized control plane
Hash(key)
Hash(key)
Slicer Service
Adya
, Atul, et al. "Slicer: Auto-
sharding
for datacenter applications." USENIX Symposium on Operating Systems Design and Implementation. 2016.
This slide was taken Slicer’s presentation OSDI`16 Slide125
Slicer Architecture
Frontends
Caching
servers
Slicelet
Clerk
Distributor
Backup
Distributor
Assigner
Existing Google
Infrastructure
125
Capacity Monitoring
Health Monitoring
Load Monitoring
Lease Manager
This slide was taken Slicer’s presentation OSDI`16 Slide126
Summarize
Replacement PolicyPartitioning cachesFragmentation
CliffhangersCuckoo hashingCluster managementMemcached
Redis
Cross-datacenter caching
Memcache
at Facebook
Domain knowledge cachingTao at Facebook
126Slide127
Summarize
Hardware supported distributed cachesFaRMMICA
Caching as a serviceMemcachierElastiCache
Fairness for Caching
FairRide
Autosharding
Slicer
127Slide128
Open problems
Load balancing and UtilizationDistributed keys equally between shardsMemory utilization Load per key (Hotspots problem)
Re-ShardingMapping?Replication
Propagate updates
Consistency vs Latency
Geo-replication
Consistency vs Latency
128Slide129
Open problems
Domain-knowledge cachingCaching for distributed real time analyticsAdvances in HardwareSDN
RDMANVMSSDsDatacentersEdge and
Fog computing
129Slide130
References and Sources
http://www.visualcapitalist.com/what-happens-internet-minute-2016/http://norvig.com/21-days.html#answers
Tanenbaum, Andrew S., and Herbert Bos. Modern operating systems. Prentice Hall Press, 2014.Megiddo, Nimrod, and Dharmendra S.
Modha
. "ARC: A Self-Tuning, Low Overhead Replacement Cache." FAST. Vol. 3. 2003.
https://memcached.org/,
FITZPATRICK, B. Distributed caching with
memcached. Linux Journal 2004, 124 (Aug. 2004), 5.Karger
, David, et al. "Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web." Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM, 1997.Karger, David, et al. "Web caching with consistent hashing." Computer Networks 31.11 (1999): 1203-1213.
130Slide131
References and Sources
https://redis.io/Cidon, Asaf
, et al. "Dynacache: Dynamic Cloud Caching." HotStorage. 2015.
Cidon
,
Asaf
, et al. "Cliffhanger: Scaling performance cliffs in web memory caches." USENIX NSDI. 2016.
Pagh, Rasmus, and Flemming
Friche Rodler. "Cuckoo hashing." Journal of Algorithms 51.2 (2004): 122-144.Fan, Bin, David G. Andersen, and Michael Kaminsky. "MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing." NSDI. Vol. 13. 2013.Li, Xiaozhou
, et al. "Algorithmic improvements for fast concurrent cuckoo hashing." Proceedings of the Ninth European Conference on Computer Systems. ACM, 2014.Nishtala, Rajesh, et al. "Scaling Memcache at Facebook." nsdi. Vol. 13. 2013.Bronson, Nathan, et al. "TAO: Facebook's Distributed Data Store for the Social Graph." USENIX Annual Technical Conference. 2013.
131Slide132
References and Sources
Dragojević, Aleksandar, et al. "FaRM: Fast remote memory." Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation. 2014.
Lim, Hyeontaek, et al. "MICA: A holistic approach to fast in-memory key-value storage." management 15.32 (2014): 36.
Pu,
Qifan
, et al. "
Fairride
: Near-optimal, fair cache sharing." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, 2016.Adya, Atul, et al. "Slicer: Auto-
sharding for datacenter applications." USENIX Symposium on Operating Systems Design and Implementation. 2016.132Slide133
Questions
Thank you
133