/
the shared log approach to cloud-scale consistency the shared log approach to cloud-scale consistency

the shared log approach to cloud-scale consistency - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
351 views
Uploaded On 2018-09-24

the shared log approach to cloud-scale consistency - PPT Presentation

Mahesh Balakrishnan Microsoft Research VMware Research Collaborators Dahlia Malkhi Ted Wobber Vijayan Prabhakaran Phil Bernstein Ming Wu Michael Wei Dan Glick John D Davis Aviad ID: 678718

shared log tango corfu log shared corfu tango distributed data read client sec node abstraction fast scalable objects commit

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "the shared log approach to cloud-scale c..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

the shared log approach to cloud-scale consistency

Mahesh Balakrishnan

Microsoft Research / VMware Research

Collaborators:

Dahlia Malkhi, Ted

Wobber

, Vijayan Prabhakaran, Phil Bernstein, Ming Wu, Michael Wei, Dan Glick, John D. Davis,

Aviad

Zuck

, Tao Zou, Sriram Rao Slide2

anatomy of a distributed system

data is distributed, metadata is

logically centralized

metadata

da

t

a

HDFS

datanodes

HDFS

namenode

HDFS client

- schedulers

- allocators

- coordinators

- namespaces

- indices

- controllers

- resource managers

filesystems

, key-value stores, block stores,

MapReduce

runtimes, software defined networks…Slide3

the Achilles’ heel of the cloud

Coordinator

failures will be handled safely

using the

ZooKeeper

service [14].” Fast Crash Recovery in RAMCloud,

Ongaro et al., SOSP 2011.

“However,

adequate resilience can be achieved by applying

standard replication techniques to the decision element.” NOX: Towards an Operating System for Networks,

Gude et al.,

Sigcomm CCR 2008.

Efforts are also underway

to address high

availability of a YARN cluster by having passive/active

failover of RM to a standby node.” Apache Hadoop YARN: Yet Another Resource Negotiator, Vavilapalli et al., SOCC 2013.

“The NameNode is a Single Point of Failure for the HDFS cluster. HDFS is not currently a High Availability system. … needs active contributions to make it Highly Available

.” wiki.apache.org, Nov 2011.

metadata is physically centralizeddistribute later

for durability / availability / scalability…but distributing a centralized service is difficult!Slide4

problem #1: the abstraction gap

centralized metadata services are built using in-memory data structures (e.g. Java / C# Collections)

state resides in maps, trees, queues, counters, graphs…

transactional access to data structures

example: a scheduler atomically moves a node from a free list to an allocation map

distributing a service requires different abstractionsmove state to external service like DB/ZooKeeper… or implement distributed protocolsSlide5

problem #2: protocol spaghetti

inefficient when layered

,

unsafe when combined

caching, geo-mirroring, versioning,

snapshots, rollback, elasticity…

logging

replication

sharding

transactionsSlide6

problem statement

metadata services are difficult to build, harden, and scale, due to:

restrictive abstractions

complex protocols

can we simplify the construction of distributed metadata services?Slide7

the shared log abstraction

shared log API:

O

= append(

V

)V

= read(O

)trim(O) //GCO

= check() //tail

append

to tail

read

from anywhere

. . .

clients can concurrently

append

to the log,

read

from anywhere in its body,

check

the current tail, and

trim

entries that are no longer needed.

clients

remote shared logSlide8

outline

a shared log is a powerful and versatile abstraction.

Tango (SOSP 2013) provides transactional in-memory data structures backed by a shared log.

the shared log abstraction can be implemented efficiently. CORFU (NSDI 2012) is a scalable, distributed shared log that supports millions of appends/sec.

a fast, scalable shared log enables fast, scalable distributed services. Tango+CORFU

supports millions of transactions/sec. Slide9

outline

a shared log is a powerful and versatile abstraction.

Tango (SOSP 2013) provides transactional in-memory data structures backed by a shared log.

the shared log abstraction can be implemented efficiently. CORFU (NSDI 2012) is a scalable, distributed shared log that supports millions of appends/sec.

a fast, scalable shared log enables fast, scalable distributed services. Tango+CORFU

supports millions of transactions/sec. Slide10

the shared log approach

the shared log is the source of

persistence

consistency

elasticity

atomicity and isolation

… across multiple objects

commit record

uncommitted data

shared log

a Tango object

=

view

in-memory data structure

+

history

updates in shared log

no messages… only appends/reads on the shared log!

Tango objects are

easy to use

Tango objects are

easy to build

Tango runtime

application

Tango runtime

applicationSlide11

under the hood:

Tango objects are easy to use

implement standard interfaces (Java/C# Collections)

linearizability

for single operations

example:

curowner

=

ownermap

.get

(“ledger”);

if(

curowner.equals

(

myname

))

ledger.add

(item); Slide12

under the hood:

Tango objects are easy to use

implement standard interfaces (Java/C# Collections)

linearizability

for single operations

serializable

transactions

example:

TR.

BeginTX

();

curowner

=

ownermap

.get

(“ledger”);

if(

curowner.equals

(

myname))

ledger.add(item);status = TR.EndTX

();

TX commits if read-set (

ownermap

) has not changed in conflict window

TX commit record:

read-set: (

ownermap

, ver:2)

write-set: (ledger, ver:6)

speculative commit records: each client decides if the TX commits or aborts

independently but deterministically

[similar to

Hyder

(Bernstein et al., CIDR 2011)]Slide13

Tango objects are easy to build

class

TangoRegister

{

int

oid

;

TangoRuntime

∗T;

int

state;

void

apply

(void ∗X) { state = ∗(

int ∗)X; } void writeRegister (int

newstate) { T−>update_helper(&

newstate , sizeof (int) , oid

); } int

readRegister () { T−>query_helper(oid); return state; } }

object-specific state

invoked by Tango runtime on

EndTX

to change state

mutator

: updates TX write-set, appends to shared log

accessor

: updates TX read-set, returns local state

15 LOC == persistent, highly available, transactional register

Other examples:

Java

ConcurrentMap

: 350 LOC

Apache

ZooKeeper

: 1000 LOC

Apache

BookKeeper

: 300 LOC

simple API exposed by runtime to object: 1

upcall

+ two helper methods

arbitrary API exposed by object to application:

mutators

and

accessorsSlide14

outline

a shared log is a powerful and versatile abstraction.

Tango (SOSP 2013) provides transactional in-memory data structures backed by a shared log.

the shared log abstraction can be implemented efficiently. CORFU (NSDI 2012) is a scalable, distributed shared log that supports millions of appends/sec.

a fast, scalable shared log enables fast, scalable distributed services. Tango+CORFU

supports millions of transactions/sec. Slide15

the CORFU design

CORFU

Tango runtime

CORFU API:

O

= append(

V

)

V

= read(

O

)

trim(

O

) //GCO

= check() //tail

application

append

to tail

read

from anywhere

each entry maps to a replica set

passive flash units:

write-once, sparse address spaces

smart client librarySlide16

the CORFU protocol: reads

16

Tango

CORFU library

read(pos)

read(D1/D2, page#)

Projection:

D1 D2

D3 D4

D5 D6

D7 D8

D1 D3 D5 D7

D2 D4 D6 D8

client

CORFU cluster

D1/D2

L

0

L

4

...

D3/D4

L

1

L

5

...

D5/D6

L

2

L

6

...

D7/D8

L

3

L

7

...

page 0

page 1

…Slide17

the CORFU protocol: appends

17

Tango

CORFU library

append(

val

)

write(D1/D2,

val

)

Projection:

D1 D2

D3 D4

D5 D6

D7 D8

reserve next position in log (e.g., 8)

sequencer (T0)

D1 D3 D5 D7

D2 D4 D6 D8

CORFU append throughput: # of 64-bit tokens issued per second

client

CORFU cluster

read(pos)

sequencer is only an optimization!

clients can probe for tail or reconstruct it from flash units

other clients can

fill

holes in the log caused by a crashed client

fast reconfiguration protocol:

10

ms

for 32-drive clusterSlide18

chain replication in CORFU

client C1

client C2

safety under contention:

if multiple clients try to write to same log position concurrently, only one wins

writes to already written pages => error

client C3

durability:

data is only visible to reads if entire chain has seen it

reads on unwritten pages => error

requires

write-once

semantics from flash unit

1

2Slide19

outline

a shared log is a powerful and versatile abstraction.

Tango (SOSP 2013) provides transactional in-memory data structures backed by a shared log.

the shared log abstraction can be implemented efficiently. CORFU (NSDI 2012) is a scalable, distributed shared log that supports millions of appends/sec.

a fast, scalable shared log enables fast, scalable distributed services. Tango+CORFU

supports millions of transactions/sec. Slide20

node 2

node 1

the playback bottleneck

:

clients must read all entries

inbound NIC is a bottleneck

B

B

B

C

C

C

A

A

A

solution:

stream

abstraction

readnext

(

streamid

)

append(value, streamid1, … )

free list

aggregation tree

allocation

table

each client only plays entries of interest to it

A

A

C

a fast shared log isn’t enough…

10

Gbps

10

GbpsSlide21

node 2

node 1

beginTX

read A

write C

endTX

decision record with commit/abort bit

commit/abort?

has

A

changed?

don’t know!

commit/abort?

has

A

changed?

yes, abort

txes

over streams

free list

aggregation tree

allocation

table

node 1 helps node 2Slide22

node 2

node 1

beginTX

read A,

B

write C

endTX

commit/abort?

has

A

changed?

don’t know!

commit/abort?

has

B

changed?

don’t know!

distributed

txes

over streams

free list

aggregation tree

allocation

table

node 1 and node 2 help each other!

distributed transactions without a distributed (commit) protocol!Slide23

research insights

a durable,

iterable

total order (i.e., a shared log) is a unifying abstraction for distributed systems, subsuming the roles of many distributed protocols

it is possible to impose a total order at speeds exceeding the I/O capacity of any single machine

a total order is useful even when individual nodes consume a subsequence of itSlide24

how far is CORFU from Paxos?Slide25

how far is CORFU from Paxos?

acceptors

learners

CORFU cluster

D1 D3 D5 D7

D2 D4 D6 D8

acceptors

CORFU scales the

Paxos

acceptor role

:

each consensus decision is made by a different set of acceptors

streaming CORFU scales the

Paxos

learner role

:

each learner plays a subsequence of commandsSlide26

2014

2013

2012

2011

2010

2009

2008

2007

2006

(recent) related work

Percolator

Spanner

Calvin

Hyder

distributed

txes

state machine replication

Chubby

ZooKeeper

Tango

CORFU

Sinfonia

RAFT

time

Walter

Transaction Chains

Eve

MegastoreSlide27

evaluation:

linearizable

operations

adding more clients

 more reads/sec

… until shared log is saturated

beefier shared log

 scaling continues…

ultimate bottleneck: sequencer

a Tango object provides elasticity for strongly consistent reads

constant write load (10K writes/sec), each client adds 10K reads/sec

(latency = 1

ms

)

…Slide28

evaluation: multi-object txes

28

18 clients, each client hosts its own

TangoMap

cross-partition

tx

: client moves element from its own

TangoMap

to some other client’s

TangoMap

similar scaling to 2PL…

without a complex distributed protocol

over 100K

txes

/sec when 16% of

txes

are cross-partition

Tango enables fast, distributed transactions across multiple objects

More recent results:

1M

txes

/sec with

64%

cross-partition

txesSlide29

ongoing work

CorfuDB

:

open source rewrite (

www.github.com/corfudb)Hopper: streams

hop across different shared logsgeo-distribution (a log per site)tiered storage (a log per storage class)collaborations around shared log designs for:

single-machine block storage (with Hakim Weatherspoon

)mobile data-sharing (with Rodrigo Rodrigues)multi-core data structures (with Marcos Aguilera, Sid Sen)

big data analytics (with Tyson

Condie, Wyatt Lloyd)Slide30

future work

can we get rid of complex distributed protocols?

(what this talk was about…)

can we get rid of complex storage stacks?

storage idioms: index/log/cache/RAID/buffer…hypothesis: storage stacks can be constructed by composing persistent, transactional data structures:

easy to synthesize code; predict performance; discover new designsSlide31

conclusion

Tango objects: data structures backed by a shared log

key idea: the shared log does all the heavy lifting

(durability, consistency, atomicity, isolation, elasticity…)

Tango objects are easy to use, easy to build, and fast…

… thanks to CORFU, a shared log without an I/O bottleneck