/
Consensus V. Arun Computer Science Consensus V. Arun Computer Science

Consensus V. Arun Computer Science - PowerPoint Presentation

joyousbudweiser
joyousbudweiser . @joyousbudweiser
Follow
343 views
Uploaded On 2020-11-06

Consensus V. Arun Computer Science - PPT Presentation

University of Massachusetts Amherst 1 Problem statement Highlevel goal To allow a set of distributed processes to choose a decision value from a set of values proposed by one or more of them ID: 816573

prepare decision epoch proposal decision prepare proposal epoch number stable state request rcvd props amherst

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Consensus V. Arun Computer Science" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Consensus

V. ArunComputer ScienceUniversity of Massachusetts Amherst

1

Slide2

Problem statement

High-level goal: To allow a set of distributed processes to choose a decision value from a set of values proposed by one or more of them.

2

Slide3

Assumptions

Message passing: Processes can only communicate by sending messages to each other.Asynchrony: Messages may be lost or take arbitrarily long to arrive. (Note: Assuming reliable, FIFO channels does not make the problem any simpler as long as delays are unbounded.)Crash failures: Processes may crash for arbitrarily long (with no message sending/receiving ability when crashed) and then recover at any time.

Stable storage

: Each process has access to stable storage that ensures durability across crashes.

3

Slide4

Safety requirements

Validity: Only a proposed value may be chosen as the decision.Agreement: Only a single value may be chosen as the decision.Non-triviality: No process knows the decision value before the protocol even starts.Notes:

Not all processes may propose a value.

Validity implies that if no process proposed value V, then V can not be chosen.

Not all processes may arrive upon a decision value, but those that do so must agree.

4

Slide5

Liveness properties

Termination: The protocol should terminate quickly, i.e., each processes should arrive at a decision if at least one value was proposed and that process and enough others remain uncrashed for a long enough time and can communicate in a timely manner.

Efficiency:

The protocol should not be wasteful in terms of

message

overhead,

computational

overhead, or the amount of stable

storage

it requires.

Note: It is okay to compromise on

liveness

for safety.

5

Slide6

Classic “FLP” impossibility result

It is impossible for a consensus protocol to be safe and yet guarantee termination (in any finite amount of time) in an asynchronous (unbounded delays) environment if even a single process can crash.6

Slide7

Party time! New Year’s eve meeting point

We all in class need to pick between one of exactly two locations – Amherst or Northampton – on New Year’s eve. AssumptionsWe can send (individual) text messages to each other.

Some texts may take arbitrarily long; like really really long; like literally forever.

Sent/receives texts are persistently stored.

Any of us can faint anytime for long periods and then recover later.

At 11:55pm, Dec 31, 2017, everyone not at home must gather at the same place: Amherst or Northampton

7

Slide8

Strawman 1

Init: rcvd_props = {} myProposal

pick_rand

({“Amherst”, “

Noho

”});

sendAll

(PROPOSAL,

myProposal

, self);

Recv

(PROPOSAL, prop, P)

:

rcvd_props

rcvd_props

U [

prop,P

]

; if(count(rcvd_props, “Amherst”) > N/2) stable_decision“Amherst”; sendAll(DECISION, “Amherst”); if(count(rcvd_props, “Noho”) > N/2) stable_decision“Noho”; sendAll(DECISION, “Noho”);Recv(DECISION, decn, P): stable_decision  decn;

8

Slide9

Strawman 1 + Andrew amendment (AA)

Init: rcvd_props = {} myProposal =

pick_rand

({“Amherst”, “

Noho

”});

sendAll

(PROPOSAL,

myProposal

, self);

Recv

(PROPOSAL, prop, P)

:

rcvd_props

rcvd_props

U [

prop,P

]

;

if(count(

rcvd_props

, “Amherst”) > N/2) stable_decision“Amherst”; sendAll(DECISION, “Amherst”); if(count(rcvd_props, “Noho”) > N/2) stable_decision“Noho”; sendAll(DECISION, “Noho”); if(count(rcvd_props, “Amherst”)=N/2 && N%2=0) stable_decision”Amherst

”;

sendAll

(DECISION, “Amherst”)Recv(DECISION, decn, P): stable_decision  decn;

9

Slide10

Strawman 1 + AA + logging

Init: rcvd_props = {}; if not logged anything

stable_log

(

myProposal

=

pick_rand

({“Amherst”, “

Noho

”}));

else

myProposal

logged

proposal

;

sendAll

(PROPOSAL,

myProposal

, self);

Recv

(PROPOSAL, prop, P)

: rcvd_props = rcvd_props U [prop,P]; if(count(rcvd_props, “Amherst”) > N/2) stable_decision“Amherst”; sendAll(DECISION, “Amherst”); if(count(rcvd_props, “Noho”) > N/2) stable_decision

Noho”; sendAll(DECISION, “Noho”); if(count(rcvd_props, “Amherst”)=N/2 && N%2=0)

stable_decision

”Amherst

”; sendAll(DECISION, “Amherst”)Recv(DECISION, decn, P): stable_decision  decn;

10

Slide11

New Year’s eve: What if 2+ choices?

Discussion11

Slide12

(Single) Paxos

Phase 1A proposer selects a proposal number n and sends a PREPARE(n) to a majority of all*

acceptors

.

If an acceptor receives a PREPARE request with number

n

greater than any previously received PREPARE to which it has already responded, then it responds with a promise not to accept any more

proposals

numbered less than

n

and with the highest-numbered proposal (if any) it accepted.

12

*

Either one of “a majority of” or ”all” is safe.

Slide13

(Single) Paxos (cont’d)

Phase 2:If the proposer receives a response to its PREPARE(n) from a majority of acceptors, then it sends an ACCEPT request to each of those all

*

acceptors for a proposal numbered n with a request value

v

, where

v

is the value of the highest-numbered proposal among the responses (in Phase 1.B), or is any arbitrarily chosen value if the responses reported no proposals.

If an acceptor receives an ACCEPT(

n,v

) message, it accepts the proposal unless it has already responded to a PREPARE with a number greater than n.

13

* Either one of “each of those” or ”all” is safe.

Slide14

(Single) Paxos (cont’d)

The first two phases conducted by proposers and acceptors are crucial for safety. Learning the decision value is easier and is the third phase.

14

Slide15

(Single) Phase 3

Phase 3:If the proposer (or for that matter any process) gets an acknowledgment from some majority of acceptors for ACCEPT(n,v), then it decides v and can announce DECISION(v) to anyone.

If any process receives DECISION(v), it can safely adopt v as the decision.

15

Slide16

Key invariant maintained

For any v and n, if a proposal with value v and number n is issued, then there is a set S

consisting of a majority of acceptors such that

either no acceptor in

S

has accepted any proposal numbered less than

n,

or

v

is the value of the highest-numbered proposal among all proposals numbered less than

n

accepted by the acceptors in

S

.

This invariant is essential to ensuring that if a proposal

v

has been decided, then no proposer will even try to issue ACCEPT(*, v’) for v’ != v.

16

Slide17

Observations: progress

Why progress not guaranteed?Two (or more) proposers could alternately keep issuing a sequence of PREPAREs and, possibly corresponding ACCEPTs, that supersede lower PREPAREs without a decision ever being chosenExample: Proposer p completes phase1 with prepare number n1 and, before it completes phase2 at a majority of acceptors, proposer q issues phase1 with a higher prepare number n2 > n1. Proposer p could then again try to prepare n3 > n2 causing acceptors to ignore q’s phase2, and so on.

17

Slide18

Observations: progress (cont’d)

Selecting a distinguished proposer, i.e., a unique coordinator designated to issue PREPAREs, will eventually find a higher prepare number than any other and succeed in determining a decision provided it can communicate with a majority of acceptors.Uniqueness of coordinator not necessary for safety, only for progress to limit one-upping of proposers ad infinitum Long-lived, well-connected coordinators critical in practice

Failure detectors needed for another member to try to take over as the coordinator (by just issuing a higher-numbered prepare) if current coordinator suspected faulty.

18

Slide19

Observations: prepare number

Prepare numbers could be drawn from an arbitrary, infinitely big set of totally ordered elementsIn practiceNumber space: Prepare number space has to be finite, otherwise the size of protocol meta-data will grow arbitrarilyWraparound handled by assuming for safety that ballot increase rate and message reordering small relative to size of space, i.e., ambiguities like whether 1 > 0 or 1 < 0 (because 1 + MAX = 0)

ID:

Useful to look at a prepare number

n

and automatically infer who issued it, e.g., to check their health

2-tuple ballots:

Common technique (not necessary for safety) to use “ballots” as prepare numbers where a ballot <N,C> is a two-tuple consisting of a ballot number N and a corresponding issuing coordinator C. Ballot B1=<N1,C1> is greater than B2=<N2,C2>

iff

N1 > N2 or N1==N2 and C1>C2

19

Slide20

Observations: logging

Every PREPARE or ACCEPT message must be logged to stable storage before being acknowledged so that a member never reneges on the promise implicit in its PREPARE-ACK(n) or ACCEPT-ACK(n,*)Important to limit size of logs from growing unboundedEspecially critical for replicated state machines (coming next)A common (but potentially problematic) optimization is to maintain only in-memory message logs and collecting the missing logs upon recovery by contacting a majority of alive replicas

RSM can stuck or violate safety if majority crashes

20

Slide21

Replicated State Machines

21

Slide22

Building a replicated state machine

A replicated state machine (RSM) is a set of processes agreeing on the order of requests received from clients. Assumes an underlying consensus protocol for each [slot, request] pair where slot

i

s the position in the sequence.

Requests may arrive asynchronously.

Requests have deterministic outcomes, i.e. a request R executed on state S1 results in state S2 no matter which process executes it.

All processes assumed to start with the same initial state, so 1-3 imply that they have the same state after executing m requests for any given m.

22

Slide23

Typical sequence

23

Phases 2 and 3

s

lot number

Slide24

RSM message efficiency with Paxos

Only Phases 2 and 3 needed in a typical replicated state machine with long-lived coordinatorsHow many messages end-to-end per client request?1 // request R from client to

entry

acceptor

+ 1 // request

R

acceptor to coordinator

+ n //

prepare(ballot,

)

from coordinator to all including self

+ n //

prepare-

ack

(ballot)

from all to coordinator

+ n //

accept(ballot, slot, R)

from coordinator to all

+ n //

accept-

ack(ballot, slot) from all to coordinator+ n // decision(slot, digest(R)) from coordinator to all+ 1 // response resp(R) from entry acceptor to client= 3 + 5n messagesOnly 3 + 3n messages without prepare phase (long-lived coordinators)3n messages if we don’t count messages to self3n-1 messages if client can learn the coordinatorRoughly only n|R| + |resp(R)| bytes of traffic in common case 24(n-1)*|R| bytes

|R| bytes

|R| bytes

|resp(R)| bytes

Slide25

RSM delay with Paxos

How many one-way delays (T) per end-to-end client request?8T with prepare phase (as before)5T (or 2.5 round-trips) without prepare and sending directly to coordinator

25

Summary

: Compared

to a “perfect” single server, RSM/

paxos

- adds

1.5 RTTs of delay

- adds

(n-1)|R| additional app-specific bytes of traffic

- adds

3n-1 additional messages (many of which

small

)

Slide26

RSM/Paxos soft and hard state

Checkpoints help trim logs and roll-forward time upon recoveryAccept(slot,*,*) must be maintained until at least a majority of replicas have received all decisions up to

slot, otherwise safety can be violated

26

Slide27

Observations: odds and ends

Does RSM guarantee execution of every submitted request? Why or why not?Does RSM execution order maintain client-local order of submitted requests? Why or why not?Can RSM guarantee rejection of a submitted request with certainty? Why or why not?

27

Slide28

Byzantine fault tolerance

28

Slide29

Byzantine fault tolerance

3f+1 or more replicas needed to tolerate up to f faults29

Practical Byzantine fault tolerance, Castro and

Liskov

, OSDI 1999

Slide30

Safety and liveness

Safety ensured when at most floor((n-1)/3) replicas can be faultyLiveness ensured only under (weak) synchrony assumptions, i.e., a finite (possibly unknown) delay bound exists for successful transmission if a sender keeps retransmitting a message until successful.“Weak” because the protocol does not need to know the value of the bound anywhere (say, to call the end of a round or declare a replica as crashed, etc.)

30

Slide31

High-level operation

Four high-level steps (key step is multicast in step 2)A client sends a request to invoke a service operation to the primaryThe primary multicasts

the request to the replicas

Replicas execute the request and send a reply to the client

The client waits for f+1 replies from different replicas with the same result; this is the result of the operation

31

Slide32

Typical sequence

32

Slide33

Message formats

[[PRE-PREPARE, v, n, d]σ(p), m]p = issuing primaryv = view (like prepare ballot in paxos)n

= sequence number (like slot number in

paxos

)

d

= digest(m)

[X]

σ

(p

)

X signed by p’s private key

[PREPARE, v, n, d,

i

]

σ

(

i

)

where i is issuing replica[COMMIT, v, n, D(m), i] σ(i) where i is issuing replica33

Slide34

Key invariants

Definition:prepared(m,v,n,i) true if replica i has inserted in its log: the request m, a pre-prepare for m in view v with sequence number n, and 2f matching prepares from different other replicas

Invariant: If

prepared

(

m,v,n,i

) is true,

prepared

(

m’,

v,n

,j

) is false for any non-faulty replica j (including

i=j)Why is this true?

34

Slide35

Key invariants (cont’d)

Definition: committed(m,v,n) is true iff prepared(m,v,n,i) is true for all

i

in some set of f+1 non-faulty replicas;

c

ommitted-local

(

m,v,n,i

) is true

iff

prepared

(

m,v,n,i) is true and i

has accepted 2f+1 commits (possibly including its own) from different replicas that match pre-prepare for m

Invariant: committed-local(

m,v,n,i

) is true at some non-faulty

i

=> committed(

m,v,n

) is true

35

Slide36

View change protocol

Any replica i can issue [VIEW-CHANGE,v+1,n, C, P, i] σ(i

)

n

= sequence number of last stable checkpoint s known to

i

C is a certificate, a set of 2f+1 valid checkpoint messages proving the correctness of checkpoint s

P is a set containing a set P

m

for each m that prepared at

i

with a sequence number higher than n. Each set P

m

contains a valid pre-prepare and 2f+1 matching prepares all signed by different replicas

New primary p multicasts [NEW-VIEW, v+1, V, O]

σ

(p)

when it receives 2f valid view change messages from others

Determines sequence number min-s of latest stable checkpoint and max-s of highest

prepare in V

V = set containing valid view change messages received by primary

O is a set of pre-prepare messages (without piggybacked request) computed in a manner similar to how a Paxos coordinator carries forward ACCEPTS from lower ballots36

Slide37

Reconfiguration

37

Slide38

Motivation and broad approaches

Ability to configure replica groups based on demand, available resources, and fault tolerance needs38

Slide39

Spanner:Google’s

current geo-distributed DB39

Slide40

GigaPaxos: Arbitrary object-group mapping

40

Slide41

Goal: change RSM group membership

RSM invariant must be preserved, i.e., agreement on the total order of all executed requestsHow to define a “total” order when not all replicas execute all requests because of changing groups?Execution sequence divided into epochsEach epoch e associated with a group

G

e

=[S

1

,…

S

k

]

For every epoch in

e

, all replicas in

Ge

agree upon the sequence of requests executed in

e

.

41

Challenge: Managing epoch transitions, i.e., end of an epoch and start of the next epoch correctly.

Slide42

Challenges

Managing epoch transitions, i.e., i.e., end of an epoch and start of the next epoch correctlyAll replicas in current and next epoch must agree on the transition pointSafety with any number of failures and

liveness

if majority up

Locating the current group

How does a client find authoritative information about the current set of replicas?

42

Slide43

High-level protocol operation

Three phases to transition from epoch i to i+1StopEpoch: stop epoch i with members Gi

StartEpoch

: start epoch i+1 with members G

i+1

DropEpoch

: clear state for epoch

i

for members in

G

i

43

Slide44

Event-action protocol steps

Any process P, either a member of Gi or a member of Gi+1 or an external server, can initiate reconfigurationInit: Issue a special [STOP,

i

]request to

G

i

[STOP_COMMIT,

i

]: Issue

[

START, i+1,

G

i

] to G

i+1

m

ajority([START_ACK,

i

]): Issue [DROP,

i

] to

G

i // epoch i+1 ready at this point to start executing requests!Coordintor with ballot (bnum,bcoord) in Gi[STOP, i]: Coordinate [STOP, i] using consensusRi: If ACCEPT([num,coord], STOPi, n) issued for some slot number n, never issue ACCEPT([num,coord],Ri, m) for m>n44

Slide45

Event-action protocol steps (cont’d)

Replica Q in Gi+1[START, i+1, Gi, P]: Issue [GET_EPOCH_FINAL_STATE, i, Q] to any member of G

i

[EPOCH_FINAL_STATE,

i

, state]: Initialize RSM state for epoch

i

with “state”; send [START_ACK, i+1] to P

Replica in

G

i

Execute([STOP,

i

]): Store epoch final checkpoint

C

i

; remove all other volatile and stable state for epoch

i

[GET_EPOCH_FINAL_STATE,

i

, Q]: Send

C

i to Q [DROP, i]: Erase Ci // reconfiguration complete45

Slide46

Locating current group approaches

Centralized (Liskov/Cowling):A centralized administrator controlled server returns the current membership groupCould also be authoritative DNS server

Distributed (

gigapaxos

approach):

Reconfiguration (+location) service itself replicated as RSM

Consensus ensures that all

reconfigurators

agree on all reconfiguration operations, so the group is never “lost”

46

Slide47

Strawman candidates and shortcomings

47

Slide48

Scratch: Protocol3

Protocol3 init: myProposal <- pick_random(UMass, Amherst Coffee)send_all(PROPOSAL: [

myProposal

,

myID

]) // format:

msg_type

:[

msg

body

]

// evicted = null set, N=total no. processes

Event/action:

receive(PROPOSAL:[proposal, i]): record locally

waitfor

(

someProposal

, n/2+1): // same proposal from

majority

localState

 PRE_DECISION

sendAll(PRE_DECISION:[someProposal, self])timeout(T, PROPOSAL:[*,i]): // evict iIf(notAlreadyEvicted(i)) evicted  evicted U {i}n  N - |evicted|waitfor(PRE_DECISION:[someProposal,*], N/2+1) :if(localState==PRE_DECISION) decide someProposal for self and optionally announce to all48