Mahesh Balakrishnan Microsoft Research VMware Research Collaborators Dahlia Malkhi Ted Wobber Vijayan Prabhakaran Phil Bernstein Ming Wu Michael Wei Dan Glick John D Davis Aviad ID: 678718
Download Presentation The PPT/PDF document "the shared log approach to cloud-scale c..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
the shared log approach to cloud-scale consistency
Mahesh Balakrishnan
Microsoft Research / VMware Research
Collaborators:
Dahlia Malkhi, Ted
Wobber
, Vijayan Prabhakaran, Phil Bernstein, Ming Wu, Michael Wei, Dan Glick, John D. Davis,
Aviad
Zuck
, Tao Zou, Sriram Rao Slide2
anatomy of a distributed system
data is distributed, metadata is
logically centralized
metadata
da
t
a
HDFS
datanodes
HDFS
namenode
HDFS client
- schedulers
- allocators
- coordinators
- namespaces
- indices
- controllers
- resource managers
filesystems
, key-value stores, block stores,
MapReduce
runtimes, software defined networks…Slide3
the Achilles’ heel of the cloud
“
Coordinator
failures will be handled safely
using the
ZooKeeper
service [14].” Fast Crash Recovery in RAMCloud,
Ongaro et al., SOSP 2011.
“However,
adequate resilience can be achieved by applying
standard replication techniques to the decision element.” NOX: Towards an Operating System for Networks,
Gude et al.,
Sigcomm CCR 2008.
“
Efforts are also underway
to address high
availability of a YARN cluster by having passive/active
failover of RM to a standby node.” Apache Hadoop YARN: Yet Another Resource Negotiator, Vavilapalli et al., SOCC 2013.
“The NameNode is a Single Point of Failure for the HDFS cluster. HDFS is not currently a High Availability system. … needs active contributions to make it Highly Available
.” wiki.apache.org, Nov 2011.
metadata is physically centralizeddistribute later
for durability / availability / scalability…but distributing a centralized service is difficult!Slide4
problem #1: the abstraction gap
centralized metadata services are built using in-memory data structures (e.g. Java / C# Collections)
state resides in maps, trees, queues, counters, graphs…
transactional access to data structures
example: a scheduler atomically moves a node from a free list to an allocation map
distributing a service requires different abstractionsmove state to external service like DB/ZooKeeper… or implement distributed protocolsSlide5
problem #2: protocol spaghetti
inefficient when layered
,
unsafe when combined
caching, geo-mirroring, versioning,
snapshots, rollback, elasticity…
logging
replication
sharding
transactionsSlide6
problem statement
metadata services are difficult to build, harden, and scale, due to:
restrictive abstractions
complex protocols
can we simplify the construction of distributed metadata services?Slide7
the shared log abstraction
shared log API:
O
= append(
V
)V
= read(O
)trim(O) //GCO
= check() //tail
append
to tail
read
from anywhere
. . .
clients can concurrently
append
to the log,
read
from anywhere in its body,
check
the current tail, and
trim
entries that are no longer needed.
clients
remote shared logSlide8
outline
a shared log is a powerful and versatile abstraction.
Tango (SOSP 2013) provides transactional in-memory data structures backed by a shared log.
the shared log abstraction can be implemented efficiently. CORFU (NSDI 2012) is a scalable, distributed shared log that supports millions of appends/sec.
a fast, scalable shared log enables fast, scalable distributed services. Tango+CORFU
supports millions of transactions/sec. Slide9
outline
a shared log is a powerful and versatile abstraction.
Tango (SOSP 2013) provides transactional in-memory data structures backed by a shared log.
the shared log abstraction can be implemented efficiently. CORFU (NSDI 2012) is a scalable, distributed shared log that supports millions of appends/sec.
a fast, scalable shared log enables fast, scalable distributed services. Tango+CORFU
supports millions of transactions/sec. Slide10
the shared log approach
the shared log is the source of
persistence
consistency
elasticity
atomicity and isolation
… across multiple objects
commit record
uncommitted data
shared log
a Tango object
=
view
in-memory data structure
+
history
updates in shared log
no messages… only appends/reads on the shared log!
Tango objects are
easy to use
Tango objects are
easy to build
Tango runtime
application
Tango runtime
applicationSlide11
under the hood:
Tango objects are easy to use
implement standard interfaces (Java/C# Collections)
linearizability
for single operations
example:
curowner
=
ownermap
.get
(“ledger”);
if(
curowner.equals
(
myname
))
ledger.add
(item); Slide12
under the hood:
Tango objects are easy to use
implement standard interfaces (Java/C# Collections)
linearizability
for single operations
serializable
transactions
example:
TR.
BeginTX
();
curowner
=
ownermap
.get
(“ledger”);
if(
curowner.equals
(
myname))
ledger.add(item);status = TR.EndTX
();
TX commits if read-set (
ownermap
) has not changed in conflict window
TX commit record:
read-set: (
ownermap
, ver:2)
write-set: (ledger, ver:6)
speculative commit records: each client decides if the TX commits or aborts
independently but deterministically
[similar to
Hyder
(Bernstein et al., CIDR 2011)]Slide13
Tango objects are easy to build
class
TangoRegister
{
int
oid
;
TangoRuntime
∗T;
int
state;
void
apply
(void ∗X) { state = ∗(
int ∗)X; } void writeRegister (int
newstate) { T−>update_helper(&
newstate , sizeof (int) , oid
); } int
readRegister () { T−>query_helper(oid); return state; } }
object-specific state
invoked by Tango runtime on
EndTX
to change state
mutator
: updates TX write-set, appends to shared log
accessor
: updates TX read-set, returns local state
15 LOC == persistent, highly available, transactional register
Other examples:
Java
ConcurrentMap
: 350 LOC
Apache
ZooKeeper
: 1000 LOC
Apache
BookKeeper
: 300 LOC
simple API exposed by runtime to object: 1
upcall
+ two helper methods
arbitrary API exposed by object to application:
mutators
and
accessorsSlide14
outline
a shared log is a powerful and versatile abstraction.
Tango (SOSP 2013) provides transactional in-memory data structures backed by a shared log.
the shared log abstraction can be implemented efficiently. CORFU (NSDI 2012) is a scalable, distributed shared log that supports millions of appends/sec.
a fast, scalable shared log enables fast, scalable distributed services. Tango+CORFU
supports millions of transactions/sec. Slide15
the CORFU design
CORFU
Tango runtime
CORFU API:
O
= append(
V
)
V
= read(
O
)
trim(
O
) //GCO
= check() //tail
application
append
to tail
read
from anywhere
each entry maps to a replica set
passive flash units:
write-once, sparse address spaces
smart client librarySlide16
the CORFU protocol: reads
16
Tango
CORFU library
read(pos)
read(D1/D2, page#)
Projection:
D1 D2
D3 D4
D5 D6
D7 D8
D1 D3 D5 D7
D2 D4 D6 D8
client
CORFU cluster
D1/D2
L
0
L
4
...
D3/D4
L
1
L
5
...
D5/D6
L
2
L
6
...
D7/D8
L
3
L
7
...
page 0
page 1
…Slide17
the CORFU protocol: appends
17
Tango
CORFU library
append(
val
)
write(D1/D2,
val
)
Projection:
D1 D2
D3 D4
D5 D6
D7 D8
reserve next position in log (e.g., 8)
sequencer (T0)
D1 D3 D5 D7
D2 D4 D6 D8
CORFU append throughput: # of 64-bit tokens issued per second
client
CORFU cluster
read(pos)
sequencer is only an optimization!
clients can probe for tail or reconstruct it from flash units
other clients can
fill
holes in the log caused by a crashed client
fast reconfiguration protocol:
10
ms
for 32-drive clusterSlide18
chain replication in CORFU
client C1
client C2
safety under contention:
if multiple clients try to write to same log position concurrently, only one wins
writes to already written pages => error
client C3
durability:
data is only visible to reads if entire chain has seen it
reads on unwritten pages => error
requires
write-once
semantics from flash unit
1
2Slide19
outline
a shared log is a powerful and versatile abstraction.
Tango (SOSP 2013) provides transactional in-memory data structures backed by a shared log.
the shared log abstraction can be implemented efficiently. CORFU (NSDI 2012) is a scalable, distributed shared log that supports millions of appends/sec.
a fast, scalable shared log enables fast, scalable distributed services. Tango+CORFU
supports millions of transactions/sec. Slide20
node 2
node 1
the playback bottleneck
:
clients must read all entries
inbound NIC is a bottleneck
B
B
B
C
C
C
A
A
A
solution:
stream
abstraction
readnext
(
streamid
)
append(value, streamid1, … )
free list
aggregation tree
allocation
table
each client only plays entries of interest to it
A
A
C
a fast shared log isn’t enough…
10
Gbps
10
GbpsSlide21
node 2
node 1
beginTX
read A
write C
endTX
decision record with commit/abort bit
commit/abort?
has
A
changed?
don’t know!
commit/abort?
has
A
changed?
yes, abort
txes
over streams
free list
aggregation tree
allocation
table
node 1 helps node 2Slide22
node 2
node 1
beginTX
read A,
B
write C
endTX
commit/abort?
has
A
changed?
don’t know!
commit/abort?
has
B
changed?
don’t know!
distributed
txes
over streams
free list
aggregation tree
allocation
table
node 1 and node 2 help each other!
distributed transactions without a distributed (commit) protocol!Slide23
research insights
a durable,
iterable
total order (i.e., a shared log) is a unifying abstraction for distributed systems, subsuming the roles of many distributed protocols
it is possible to impose a total order at speeds exceeding the I/O capacity of any single machine
a total order is useful even when individual nodes consume a subsequence of itSlide24
how far is CORFU from Paxos?Slide25
how far is CORFU from Paxos?
acceptors
learners
CORFU cluster
D1 D3 D5 D7
D2 D4 D6 D8
acceptors
CORFU scales the
Paxos
acceptor role
:
each consensus decision is made by a different set of acceptors
streaming CORFU scales the
Paxos
learner role
:
each learner plays a subsequence of commandsSlide26
2014
2013
2012
2011
2010
2009
2008
2007
2006
(recent) related work
Percolator
Spanner
Calvin
Hyder
distributed
txes
state machine replication
Chubby
ZooKeeper
Tango
CORFU
Sinfonia
RAFT
time
Walter
Transaction Chains
Eve
MegastoreSlide27
evaluation:
linearizable
operations
adding more clients
more reads/sec
… until shared log is saturated
beefier shared log
scaling continues…
ultimate bottleneck: sequencer
a Tango object provides elasticity for strongly consistent reads
constant write load (10K writes/sec), each client adds 10K reads/sec
(latency = 1
ms
)
…Slide28
evaluation: multi-object txes
28
18 clients, each client hosts its own
TangoMap
cross-partition
tx
: client moves element from its own
TangoMap
to some other client’s
TangoMap
similar scaling to 2PL…
without a complex distributed protocol
over 100K
txes
/sec when 16% of
txes
are cross-partition
Tango enables fast, distributed transactions across multiple objects
…
More recent results:
1M
txes
/sec with
64%
cross-partition
txesSlide29
ongoing work
CorfuDB
:
open source rewrite (
www.github.com/corfudb)Hopper: streams
hop across different shared logsgeo-distribution (a log per site)tiered storage (a log per storage class)collaborations around shared log designs for:
single-machine block storage (with Hakim Weatherspoon
)mobile data-sharing (with Rodrigo Rodrigues)multi-core data structures (with Marcos Aguilera, Sid Sen)
big data analytics (with Tyson
Condie, Wyatt Lloyd)Slide30
future work
can we get rid of complex distributed protocols?
(what this talk was about…)
can we get rid of complex storage stacks?
storage idioms: index/log/cache/RAID/buffer…hypothesis: storage stacks can be constructed by composing persistent, transactional data structures:
easy to synthesize code; predict performance; discover new designsSlide31
conclusion
Tango objects: data structures backed by a shared log
key idea: the shared log does all the heavy lifting
(durability, consistency, atomicity, isolation, elasticity…)
Tango objects are easy to use, easy to build, and fast…
… thanks to CORFU, a shared log without an I/O bottleneck