Cezara Drăgoi INRIA ENS CNRS Thomas A Henzinger IST Austria Damien Zufferey MIT CSAIL SNAPL 20150504 Faulttolerant distributed algorithms How to get it right when things go wrong ID: 563645
Download Presentation The PPT/PDF document "The Need for Language Support for Fault-..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Need for Language Support for Fault-Tolerant Distributed Systems
Cezara
Drăgoi
, INRIA ENS CNRS
Thomas A.
Henzinger
, IST Austria
Damien Zufferey
, MIT CSAIL
SNAPL, 2015.05.04Slide2
Fault-tolerant distributed algorithms
How to get it right when things go wrong ?
Crash, network partition, …
Mean time to failure (thing eventually go wrong)
Replication using
Consensus
Agreement
: Every correct process must
agree on the same value
.
Irrevocability
: Every correct process
decides at most one value
.
Validity
: If all processes propose the same value v, then all correct processes decide v.
Integrity
: If value v is a decision, then v must have been proposed by some process.
Termination
: Every correct
process decides
some value
.Slide3
Our journey starts on the island of Paxos …
… where archeologists made an interesting discovery about a parliament system …
CC-BY-SA-NC Matt Taylor Copyright ACM
3Slide4
The Paxos Algorithm [Lamport
98]
Used at Google (Chubby), Microsoft (Autopilot)
Proposer
Acceptor
Acceptor
Prepare
Promise
Accept
AcceptedSlide5
Paxos in the Literature
The part-time
p
arliament [
Lamport
98]Paxos made simple [Lamport 01]Paxos made live: An engineering perspective [Chandra et al. 07]In search of an understandable consensus algorithm
. [Ongaro and Ousterhout 14]Paxos
made moderately complex [van Renesse and Altinbuken 15]
...
Claim:
If it is hard, more of the same is not going to help.
Changing the way we think about it might.Slide6
Why is the PL community concerned ?
Quotes from
Paxos
made live [Chandra et al. 07]
“
The fault-tolerance computing community has not developed the tools to make it easy to implement their algorithms.”“The fault-tolerance computing community has
not paid enough attention to testing, a key ingredient for building fault-tolerant systems.”
“In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small
protocol extensions. The cumulative effort will be substantial and the final system will be
based on an unproven
protocol
.”Slide7
Challenges to understanding what is going on
Parametric systems
Asynchrony (Interleaving, delays)
Channels
Faults
…
nSlide8
Programming Models & Languages
Consensus is not solvable with asynchrony and faults ([FLP 85]).
Asynchronous
Synchronous (timed)
Actor model, CSP,
CCS, pi-calculus, …
Not realistic for distributed system
Many PL based on or implementing those models
Timed-automata, timed process calculi
Lustre
,
Esterel
,
Giotto, LabVIEW
?
Partial synchrony
Failure detectors
Crash-stop, crash-recovery
Benign, Byzantine faults
Faults introduce a middle ground
Alternation between synchronous and asynchronous period
We don’t want a model/language for each variation.
We want a simple model that unifies all of them.
network contention
crashSlide9
Structure of distributed algorithms: Communication-closed Rounds
Proposer
Acceptor
Acceptor
Prepare
Promise
Accept
Accepted
[
Elrad
&
Francez
82]: decomposition of algorithm in communication-closed rounds.
[
Dwork
& Lynch &
Stockmeyer
, 88] defines round model for non-synchronous models: partial synchrony
A round defines the
scope of its messages
.Slide10
Faults: the environment as an adversary.
Semantics:
Execution:
Compiler + runtimeSlide11
Benefits for verification
Promise
Accept
Reason about rounds in isolation.
Lock-step semantics, no interleaving.
Simple invariants that connects the round at the boundaries.
No message in flight, only local state of the processes.Slide12
The Heard-Of model [Charron-Bost
&
Schiper
09]
Intuitive model: communication-closed roundssend and update operationsIllusion of synchrony
a single process cannot distinguish between a synchronous and an asynchronous executionMaps every faults to message faultsA crashed process is the same as a process whose messages are dropped.
Byzantine faults can be simulated altering messagesSimplify the proofs: does not need to case split on (in)correct processesHandling transient/permanent faults
is transparent at the algorithm levelDeveloped for theoretical simplicitySlide13
Conclusion
Building fault-tolerant distributed systems is hard and important.
The current programming abstraction are inadequate.
The DA community has models that streamline faults handling.
We started to build a language around those idea:
Key elements (HO-model):Communication-closed roundsAsynchrony and faults as an adversary that drops messagesBenefits:Conceptually simplerAutomated reasoning/verification becomes possible
Acceptable runtime overhead (early results)