/
The Need for Language Support for Fault-Tolerant Distribute The Need for Language Support for Fault-Tolerant Distribute

The Need for Language Support for Fault-Tolerant Distribute - PowerPoint Presentation

tatyana-admore
tatyana-admore . @tatyana-admore
Follow
388 views
Uploaded On 2017-06-26

The Need for Language Support for Fault-Tolerant Distribute - PPT Presentation

Cezara Drăgoi INRIA ENS CNRS Thomas A Henzinger IST Austria Damien Zufferey MIT CSAIL SNAPL 20150504 Faulttolerant distributed algorithms How to get it right when things go wrong ID: 563645

process faults model paxos faults process paxos model fault distributed rounds amp correct communication messages models acceptor processes synchronous

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "The Need for Language Support for Fault-..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

The Need for Language Support for Fault-Tolerant Distributed Systems

Cezara

Drăgoi

, INRIA ENS CNRS

Thomas A.

Henzinger

, IST Austria

Damien Zufferey

, MIT CSAIL

SNAPL, 2015.05.04Slide2

Fault-tolerant distributed algorithms

How to get it right when things go wrong ?

Crash, network partition, …

Mean time to failure (thing eventually go wrong)

Replication using

Consensus

Agreement

: Every correct process must

agree on the same value

.

Irrevocability

: Every correct process

decides at most one value

.

Validity

: If all processes propose the same value v, then all correct processes decide v.

Integrity

: If value v is a decision, then v must have been proposed by some process.

Termination

: Every correct

process decides

some value

.Slide3

Our journey starts on the island of Paxos …

… where archeologists made an interesting discovery about a parliament system …

CC-BY-SA-NC Matt Taylor Copyright ACM

3Slide4

The Paxos Algorithm [Lamport

98]

Used at Google (Chubby), Microsoft (Autopilot)

Proposer

Acceptor

Acceptor

Prepare

Promise

Accept

AcceptedSlide5

Paxos in the Literature

The part-time

p

arliament [

Lamport

98]Paxos made simple [Lamport 01]Paxos made live: An engineering perspective [Chandra et al. 07]In search of an understandable consensus algorithm

. [Ongaro and Ousterhout 14]Paxos

made moderately complex [van Renesse and Altinbuken 15]

...

Claim:

If it is hard, more of the same is not going to help.

Changing the way we think about it might.Slide6

Why is the PL community concerned ?

Quotes from

Paxos

made live [Chandra et al. 07]

The fault-tolerance computing community has not developed the tools to make it easy to implement their algorithms.”“The fault-tolerance computing community has

not paid enough attention to testing, a key ingredient for building fault-tolerant systems.”

“In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small

protocol extensions. The cumulative effort will be substantial and the final system will be

based on an unproven

protocol

.”Slide7

Challenges to understanding what is going on

Parametric systems

Asynchrony (Interleaving, delays)

Channels

Faults

nSlide8

Programming Models & Languages

Consensus is not solvable with asynchrony and faults ([FLP 85]).

Asynchronous

Synchronous (timed)

Actor model, CSP,

CCS, pi-calculus, …

Not realistic for distributed system

Many PL based on or implementing those models

Timed-automata, timed process calculi

Lustre

,

Esterel

,

Giotto, LabVIEW

?

Partial synchrony

Failure detectors

Crash-stop, crash-recovery

Benign, Byzantine faults

Faults introduce a middle ground

Alternation between synchronous and asynchronous period

We don’t want a model/language for each variation.

We want a simple model that unifies all of them.

network contention

crashSlide9

Structure of distributed algorithms: Communication-closed Rounds

Proposer

Acceptor

Acceptor

Prepare

Promise

Accept

Accepted

[

Elrad

&

Francez

82]: decomposition of algorithm in communication-closed rounds.

[

Dwork

& Lynch &

Stockmeyer

, 88] defines round model for non-synchronous models: partial synchrony

A round defines the

scope of its messages

.Slide10

Faults: the environment as an adversary.

Semantics:

Execution:

Compiler + runtimeSlide11

Benefits for verification

Promise

Accept

Reason about rounds in isolation.

Lock-step semantics, no interleaving.

Simple invariants that connects the round at the boundaries.

No message in flight, only local state of the processes.Slide12

The Heard-Of model [Charron-Bost

&

Schiper

09]

Intuitive model: communication-closed roundssend and update operationsIllusion of synchrony

a single process cannot distinguish between a synchronous and an asynchronous executionMaps every faults to message faultsA crashed process is the same as a process whose messages are dropped.

Byzantine faults can be simulated altering messagesSimplify the proofs: does not need to case split on (in)correct processesHandling transient/permanent faults

is transparent at the algorithm levelDeveloped for theoretical simplicitySlide13

Conclusion

Building fault-tolerant distributed systems is hard and important.

The current programming abstraction are inadequate.

The DA community has models that streamline faults handling.

We started to build a language around those idea:

Key elements (HO-model):Communication-closed roundsAsynchrony and faults as an adversary that drops messagesBenefits:Conceptually simplerAutomated reasoning/verification becomes possible

Acceptable runtime overhead (early results)