RAMCloud Motivation The role of DRAM has been increasing Facebook used 150TB of DRAM For 200TB of disk storage However there are limitations DRAM is typically used as cache Need to worry about consistency and cache misses ID: 533848
Download Presentation The PPT/PDF document "Fast Crash Recovery in" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Fast Crash Recovery in RAMCloudSlide2
Motivation
The role of DRAM has been increasing
Facebook used 150TB of DRAM
For 200TB of disk storage
However, there are limitations
DRAM is typically used as cache
Need to worry about consistency and cache missesSlide3
RAMCloud
Keeps all data in RAM at all
times
Designed to scale to thousands of servers
To host terabytes of data
Provides low-latency (5-10 µs) for small reads
Design goals
High durability and availability
Without compromising performanceSlide4
Alternatives
3x RAM replications
3x cost and energy
Power failure
RAMCloud
keeps one copy in RAM
Two copies on disks
To achieve good availability
Fast crash recovery (64GB in 1-2 seconds)Slide5
RAMCloud Basics
Thousands of off-the-shelf servers
Each with 64GB of RAM
With
Infiniband
NICs
Remote access below 10 µsSlide6
Data Model
Key-value store
Tables of objects
Object
64-bit ID + 1MB array + 64-bit version number
No atomic updates to multiple objects Slide7
System Structure
A large number of storage servers
Each server hosts
A
master
, which manages local DRAM objects and service requests
A
backup
, which stores copies of objects from other masters on storage
A coordinator
Manages
config
info and object locations
Not involved in most requestsSlide8
RAMCloud Cluster Architecture
client
master
backup
coordinatorSlide9
More on the Coordinator
Maps objects to servers in units of tablets
Hold consecutive key ranges within a single table
For locality reasons
Small tables are stored on a single server
Large tables are split across servers
Clients can cache tablets to access servers directlySlide10
Log-structured Storage
Logging approach
Each master logs data in memory
Log entries are forwarded to backup servers
Backup servers buffer log entries
Battery-backed
Writes complete once all backup servers acknowledge
A backup server flushes its buffer when full
8MB segment for logging, buffering,
and
IOs
Each server can handle 300K 100-byte writes/secSlide11
Recovery
When a server crashes, its DRAM content must be reconstructed
1-2 second recovery time is good enoughSlide12
Using Scale
Simple 3 replica approach
Recovery based on the speed of three disks
3.5 minutes to read 64GB of data
Scattered over 1,000 disks
Takes 0.6 seconds to read 64GB
Centralized recovery master becomes a bottleneck
10
Gbps
network means 1 min to transfer 64GB of data to the centralized masterSlide13
RAMCloud
Uses 100 recovery masters
Cuts the time down to 1 secondSlide14
Scattering Log Segments
Ideally uniform, but with more details
Need to avoid correlated failures
Need to account for heterogeneity of hardware
Need to coordinate machines not to overflow buffers on individual machines
Need to account for changing memberships of servers due to failuresSlide15
Failure Detection
Periodic pings to random servers
With 99% chance to detect failed servers within 5 rounds
Recovery
Setup
Replay
CleanupSlide16
Setup
Coordinator finds log segment replicas
By querying all backup servers
Detecting incomplete logs
Logs are self describing
Starting Partition Recoveries
Each master uploads a will periodically to the coordinator in the event of its demise
Coordinator carries out the will accordinglySlide17
Replay
Parallel recovery
Six stages of pipelining
At segment granularity
Same ordering of operations on segments to avoid pipeline stalls
Only the primary replicas is involved in recoverySlide18
Cleanup
Get master online
Free up segments from the previous crashSlide19
Consistency
Exactly-once semantics
Implementation not yet complete
ZooKeeper
handles coordinator failures
Distributed configuration service
With its own replicationSlide20
Additional Failure Modes
Current focus
Recover DRAM content for a single master failure
Failed backup server
Need to know what segments are lost from the server
Rereplicate
those lost segments across remaining disksSlide21
Multiple Failures
Multiple servers fail simultaneously
Recover each failure independently
Some will involve secondary replicas
Based on projection
With 5,000 servers, recovering 40 masters within a rack will take about 2 seconds
Can’t do much when many racks are blacked outSlide22
Cold Start
Complete power outage
Backups will contact the coordinate as they reboot
Need to quorum of backups before starting reconstructing masters
Current implementation does not perform cold startsSlide23
Evaluation
60-node cluster
Each node
16GB RAM, 1 disk
Infiniband
(25
Gbps
)
User level apps can talk to NICs bypassing the kernelSlide24
Results
Can recover lost data at 22 GB/s
A crashed server with 35 GB storage
Can be recovered in 1.6 seconds
Recovery time stays nearly flat from 1 to 20 recovery masters, each talks to 6 disks
60 recovery masters adds only 10
ms
recovery time Slide25
Results
Fast
recovery significantly reduces the risk of data
loss
Assume recovery time of 1 sec
The risk of data loss for 100K node is 10
-5
in one year
10x improvement in recovery time, improves reliability by 1,000x
Assumes independent failuresSlide26
Theoretical Recovery Speed Limit
Harder to be faster than a few hundred
msec
150
msec
to detect failure
100
msec
to contact every backup
100
msec
to read a single segment from diskSlide27
Risks
Scalability study based on a small cluster
Can treat performance glitches as failures
Trigger unnecessary recovery
Access patterns can change dynamically
May lead
to unbalanced load