/
Fast Crash Recovery in Fast Crash Recovery in

Fast Crash Recovery in - PowerPoint Presentation

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
403 views
Uploaded On 2017-04-05

Fast Crash Recovery in - PPT Presentation

RAMCloud Motivation The role of DRAM has been increasing Facebook used 150TB of DRAM For 200TB of disk storage However there are limitations DRAM is typically used as cache Need to worry about consistency and cache misses ID: 533848

servers recovery backup data recovery servers data backup server master coordinator time masters failures failure dram ramcloud ram disks

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Fast Crash Recovery in" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Fast Crash Recovery in RAMCloudSlide2

Motivation

The role of DRAM has been increasing

Facebook used 150TB of DRAM

For 200TB of disk storage

However, there are limitations

DRAM is typically used as cache

Need to worry about consistency and cache missesSlide3

RAMCloud

Keeps all data in RAM at all

times

Designed to scale to thousands of servers

To host terabytes of data

Provides low-latency (5-10 µs) for small reads

Design goals

High durability and availability

Without compromising performanceSlide4

Alternatives

3x RAM replications

3x cost and energy

Power failure

RAMCloud

keeps one copy in RAM

Two copies on disks

To achieve good availability

Fast crash recovery (64GB in 1-2 seconds)Slide5

RAMCloud Basics

Thousands of off-the-shelf servers

Each with 64GB of RAM

With

Infiniband

NICs

Remote access below 10 µsSlide6

Data Model

Key-value store

Tables of objects

Object

64-bit ID + 1MB array + 64-bit version number

No atomic updates to multiple objects Slide7

System Structure

A large number of storage servers

Each server hosts

A

master

, which manages local DRAM objects and service requests

A

backup

, which stores copies of objects from other masters on storage

A coordinator

Manages

config

info and object locations

Not involved in most requestsSlide8

RAMCloud Cluster Architecture

client

master

backup

coordinatorSlide9

More on the Coordinator

Maps objects to servers in units of tablets

Hold consecutive key ranges within a single table

For locality reasons

Small tables are stored on a single server

Large tables are split across servers

Clients can cache tablets to access servers directlySlide10

Log-structured Storage

Logging approach

Each master logs data in memory

Log entries are forwarded to backup servers

Backup servers buffer log entries

Battery-backed

Writes complete once all backup servers acknowledge

A backup server flushes its buffer when full

8MB segment for logging, buffering,

and

IOs

Each server can handle 300K 100-byte writes/secSlide11

Recovery

When a server crashes, its DRAM content must be reconstructed

1-2 second recovery time is good enoughSlide12

Using Scale

Simple 3 replica approach

Recovery based on the speed of three disks

3.5 minutes to read 64GB of data

Scattered over 1,000 disks

Takes 0.6 seconds to read 64GB

Centralized recovery master becomes a bottleneck

10

Gbps

network means 1 min to transfer 64GB of data to the centralized masterSlide13

RAMCloud

Uses 100 recovery masters

Cuts the time down to 1 secondSlide14

Scattering Log Segments

Ideally uniform, but with more details

Need to avoid correlated failures

Need to account for heterogeneity of hardware

Need to coordinate machines not to overflow buffers on individual machines

Need to account for changing memberships of servers due to failuresSlide15

Failure Detection

Periodic pings to random servers

With 99% chance to detect failed servers within 5 rounds

Recovery

Setup

Replay

CleanupSlide16

Setup

Coordinator finds log segment replicas

By querying all backup servers

Detecting incomplete logs

Logs are self describing

Starting Partition Recoveries

Each master uploads a will periodically to the coordinator in the event of its demise

Coordinator carries out the will accordinglySlide17

Replay

Parallel recovery

Six stages of pipelining

At segment granularity

Same ordering of operations on segments to avoid pipeline stalls

Only the primary replicas is involved in recoverySlide18

Cleanup

Get master online

Free up segments from the previous crashSlide19

Consistency

Exactly-once semantics

Implementation not yet complete

ZooKeeper

handles coordinator failures

Distributed configuration service

With its own replicationSlide20

Additional Failure Modes

Current focus

Recover DRAM content for a single master failure

Failed backup server

Need to know what segments are lost from the server

Rereplicate

those lost segments across remaining disksSlide21

Multiple Failures

Multiple servers fail simultaneously

Recover each failure independently

Some will involve secondary replicas

Based on projection

With 5,000 servers, recovering 40 masters within a rack will take about 2 seconds

Can’t do much when many racks are blacked outSlide22

Cold Start

Complete power outage

Backups will contact the coordinate as they reboot

Need to quorum of backups before starting reconstructing masters

Current implementation does not perform cold startsSlide23

Evaluation

60-node cluster

Each node

16GB RAM, 1 disk

Infiniband

(25

Gbps

)

User level apps can talk to NICs bypassing the kernelSlide24

Results

Can recover lost data at 22 GB/s

A crashed server with 35 GB storage

Can be recovered in 1.6 seconds

Recovery time stays nearly flat from 1 to 20 recovery masters, each talks to 6 disks

60 recovery masters adds only 10

ms

recovery time Slide25

Results

Fast

recovery significantly reduces the risk of data

loss

Assume recovery time of 1 sec

The risk of data loss for 100K node is 10

-5

in one year

10x improvement in recovery time, improves reliability by 1,000x

Assumes independent failuresSlide26

Theoretical Recovery Speed Limit

Harder to be faster than a few hundred

msec

150

msec

to detect failure

100

msec

to contact every backup

100

msec

to read a single segment from diskSlide27

Risks

Scalability study based on a small cluster

Can treat performance glitches as failures

Trigger unnecessary recovery

Access patterns can change dynamically

May lead

to unbalanced load