University of Massachusetts Amherst 1 Problem statement Highlevel goal To allow a set of distributed processes to choose a decision value from a set of values proposed by one or more of them ID: 816573
Download The PPT/PDF document "Consensus V. Arun Computer Science" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Consensus
V. ArunComputer ScienceUniversity of Massachusetts Amherst
1
Slide2Problem statement
High-level goal: To allow a set of distributed processes to choose a decision value from a set of values proposed by one or more of them.
2
Slide3Assumptions
Message passing: Processes can only communicate by sending messages to each other.Asynchrony: Messages may be lost or take arbitrarily long to arrive. (Note: Assuming reliable, FIFO channels does not make the problem any simpler as long as delays are unbounded.)Crash failures: Processes may crash for arbitrarily long (with no message sending/receiving ability when crashed) and then recover at any time.
Stable storage
: Each process has access to stable storage that ensures durability across crashes.
3
Slide4Safety requirements
Validity: Only a proposed value may be chosen as the decision.Agreement: Only a single value may be chosen as the decision.Non-triviality: No process knows the decision value before the protocol even starts.Notes:
Not all processes may propose a value.
Validity implies that if no process proposed value V, then V can not be chosen.
Not all processes may arrive upon a decision value, but those that do so must agree.
4
Slide5Liveness properties
Termination: The protocol should terminate quickly, i.e., each processes should arrive at a decision if at least one value was proposed and that process and enough others remain uncrashed for a long enough time and can communicate in a timely manner.
Efficiency:
The protocol should not be wasteful in terms of
message
overhead,
computational
overhead, or the amount of stable
storage
it requires.
Note: It is okay to compromise on
liveness
for safety.
5
Slide6Classic “FLP” impossibility result
It is impossible for a consensus protocol to be safe and yet guarantee termination (in any finite amount of time) in an asynchronous (unbounded delays) environment if even a single process can crash.6
Slide7Party time! New Year’s eve meeting point
We all in class need to pick between one of exactly two locations – Amherst or Northampton – on New Year’s eve. AssumptionsWe can send (individual) text messages to each other.
Some texts may take arbitrarily long; like really really long; like literally forever.
Sent/receives texts are persistently stored.
Any of us can faint anytime for long periods and then recover later.
At 11:55pm, Dec 31, 2017, everyone not at home must gather at the same place: Amherst or Northampton
7
Slide8Strawman 1
Init: rcvd_props = {} myProposal
pick_rand
({“Amherst”, “
Noho
”});
sendAll
(PROPOSAL,
myProposal
, self);
Recv
(PROPOSAL, prop, P)
:
rcvd_props
rcvd_props
U [
prop,P
]
; if(count(rcvd_props, “Amherst”) > N/2) stable_decision“Amherst”; sendAll(DECISION, “Amherst”); if(count(rcvd_props, “Noho”) > N/2) stable_decision“Noho”; sendAll(DECISION, “Noho”);Recv(DECISION, decn, P): stable_decision decn;
8
Slide9Strawman 1 + Andrew amendment (AA)
Init: rcvd_props = {} myProposal =
pick_rand
({“Amherst”, “
Noho
”});
sendAll
(PROPOSAL,
myProposal
, self);
Recv
(PROPOSAL, prop, P)
:
rcvd_props
rcvd_props
U [
prop,P
]
;
if(count(
rcvd_props
, “Amherst”) > N/2) stable_decision“Amherst”; sendAll(DECISION, “Amherst”); if(count(rcvd_props, “Noho”) > N/2) stable_decision“Noho”; sendAll(DECISION, “Noho”); if(count(rcvd_props, “Amherst”)=N/2 && N%2=0) stable_decision”Amherst
”;
sendAll
(DECISION, “Amherst”)Recv(DECISION, decn, P): stable_decision decn;
9
Slide10Strawman 1 + AA + logging
Init: rcvd_props = {}; if not logged anything
stable_log
(
myProposal
=
pick_rand
({“Amherst”, “
Noho
”}));
else
myProposal
logged
proposal
;
sendAll
(PROPOSAL,
myProposal
, self);
Recv
(PROPOSAL, prop, P)
: rcvd_props = rcvd_props U [prop,P]; if(count(rcvd_props, “Amherst”) > N/2) stable_decision“Amherst”; sendAll(DECISION, “Amherst”); if(count(rcvd_props, “Noho”) > N/2) stable_decision
“
Noho”; sendAll(DECISION, “Noho”); if(count(rcvd_props, “Amherst”)=N/2 && N%2=0)
stable_decision
”Amherst
”; sendAll(DECISION, “Amherst”)Recv(DECISION, decn, P): stable_decision decn;
10
Slide11New Year’s eve: What if 2+ choices?
Discussion11
Slide12(Single) Paxos
Phase 1A proposer selects a proposal number n and sends a PREPARE(n) to a majority of all*
acceptors
.
If an acceptor receives a PREPARE request with number
n
greater than any previously received PREPARE to which it has already responded, then it responds with a promise not to accept any more
proposals
numbered less than
n
and with the highest-numbered proposal (if any) it accepted.
12
*
Either one of “a majority of” or ”all” is safe.
Slide13(Single) Paxos (cont’d)
Phase 2:If the proposer receives a response to its PREPARE(n) from a majority of acceptors, then it sends an ACCEPT request to each of those all
*
acceptors for a proposal numbered n with a request value
v
, where
v
is the value of the highest-numbered proposal among the responses (in Phase 1.B), or is any arbitrarily chosen value if the responses reported no proposals.
If an acceptor receives an ACCEPT(
n,v
) message, it accepts the proposal unless it has already responded to a PREPARE with a number greater than n.
13
* Either one of “each of those” or ”all” is safe.
Slide14(Single) Paxos (cont’d)
The first two phases conducted by proposers and acceptors are crucial for safety. Learning the decision value is easier and is the third phase.
14
Slide15(Single) Phase 3
Phase 3:If the proposer (or for that matter any process) gets an acknowledgment from some majority of acceptors for ACCEPT(n,v), then it decides v and can announce DECISION(v) to anyone.
If any process receives DECISION(v), it can safely adopt v as the decision.
15
Slide16Key invariant maintained
For any v and n, if a proposal with value v and number n is issued, then there is a set S
consisting of a majority of acceptors such that
either no acceptor in
S
has accepted any proposal numbered less than
n,
or
v
is the value of the highest-numbered proposal among all proposals numbered less than
n
accepted by the acceptors in
S
.
This invariant is essential to ensuring that if a proposal
v
has been decided, then no proposer will even try to issue ACCEPT(*, v’) for v’ != v.
16
Slide17Observations: progress
Why progress not guaranteed?Two (or more) proposers could alternately keep issuing a sequence of PREPAREs and, possibly corresponding ACCEPTs, that supersede lower PREPAREs without a decision ever being chosenExample: Proposer p completes phase1 with prepare number n1 and, before it completes phase2 at a majority of acceptors, proposer q issues phase1 with a higher prepare number n2 > n1. Proposer p could then again try to prepare n3 > n2 causing acceptors to ignore q’s phase2, and so on.
17
Slide18Observations: progress (cont’d)
Selecting a distinguished proposer, i.e., a unique coordinator designated to issue PREPAREs, will eventually find a higher prepare number than any other and succeed in determining a decision provided it can communicate with a majority of acceptors.Uniqueness of coordinator not necessary for safety, only for progress to limit one-upping of proposers ad infinitum Long-lived, well-connected coordinators critical in practice
Failure detectors needed for another member to try to take over as the coordinator (by just issuing a higher-numbered prepare) if current coordinator suspected faulty.
18
Slide19Observations: prepare number
Prepare numbers could be drawn from an arbitrary, infinitely big set of totally ordered elementsIn practiceNumber space: Prepare number space has to be finite, otherwise the size of protocol meta-data will grow arbitrarilyWraparound handled by assuming for safety that ballot increase rate and message reordering small relative to size of space, i.e., ambiguities like whether 1 > 0 or 1 < 0 (because 1 + MAX = 0)
ID:
Useful to look at a prepare number
n
and automatically infer who issued it, e.g., to check their health
2-tuple ballots:
Common technique (not necessary for safety) to use “ballots” as prepare numbers where a ballot <N,C> is a two-tuple consisting of a ballot number N and a corresponding issuing coordinator C. Ballot B1=<N1,C1> is greater than B2=<N2,C2>
iff
N1 > N2 or N1==N2 and C1>C2
19
Slide20Observations: logging
Every PREPARE or ACCEPT message must be logged to stable storage before being acknowledged so that a member never reneges on the promise implicit in its PREPARE-ACK(n) or ACCEPT-ACK(n,*)Important to limit size of logs from growing unboundedEspecially critical for replicated state machines (coming next)A common (but potentially problematic) optimization is to maintain only in-memory message logs and collecting the missing logs upon recovery by contacting a majority of alive replicas
RSM can stuck or violate safety if majority crashes
20
Slide21Replicated State Machines
21
Slide22Building a replicated state machine
A replicated state machine (RSM) is a set of processes agreeing on the order of requests received from clients. Assumes an underlying consensus protocol for each [slot, request] pair where slot
i
s the position in the sequence.
Requests may arrive asynchronously.
Requests have deterministic outcomes, i.e. a request R executed on state S1 results in state S2 no matter which process executes it.
All processes assumed to start with the same initial state, so 1-3 imply that they have the same state after executing m requests for any given m.
22
Slide23Typical sequence
23
Phases 2 and 3
s
lot number
Slide24RSM message efficiency with Paxos
Only Phases 2 and 3 needed in a typical replicated state machine with long-lived coordinatorsHow many messages end-to-end per client request?1 // request R from client to
entry
acceptor
+ 1 // request
R
acceptor to coordinator
+ n //
prepare(ballot,
…
)
from coordinator to all including self
+ n //
prepare-
ack
(ballot)
from all to coordinator
+ n //
accept(ballot, slot, R)
from coordinator to all
+ n //
accept-
ack(ballot, slot) from all to coordinator+ n // decision(slot, digest(R)) from coordinator to all+ 1 // response resp(R) from entry acceptor to client= 3 + 5n messagesOnly 3 + 3n messages without prepare phase (long-lived coordinators)3n messages if we don’t count messages to self3n-1 messages if client can learn the coordinatorRoughly only n|R| + |resp(R)| bytes of traffic in common case 24(n-1)*|R| bytes
|R| bytes
|R| bytes
|resp(R)| bytes
Slide25RSM delay with Paxos
How many one-way delays (T) per end-to-end client request?8T with prepare phase (as before)5T (or 2.5 round-trips) without prepare and sending directly to coordinator
25
Summary
: Compared
to a “perfect” single server, RSM/
paxos
- adds
1.5 RTTs of delay
- adds
(n-1)|R| additional app-specific bytes of traffic
- adds
3n-1 additional messages (many of which
small
)
Slide26RSM/Paxos soft and hard state
Checkpoints help trim logs and roll-forward time upon recoveryAccept(slot,*,*) must be maintained until at least a majority of replicas have received all decisions up to
slot, otherwise safety can be violated
26
Slide27Observations: odds and ends
Does RSM guarantee execution of every submitted request? Why or why not?Does RSM execution order maintain client-local order of submitted requests? Why or why not?Can RSM guarantee rejection of a submitted request with certainty? Why or why not?
27
Slide28Byzantine fault tolerance
28
Slide29Byzantine fault tolerance
3f+1 or more replicas needed to tolerate up to f faults29
Practical Byzantine fault tolerance, Castro and
Liskov
, OSDI 1999
Slide30Safety and liveness
Safety ensured when at most floor((n-1)/3) replicas can be faultyLiveness ensured only under (weak) synchrony assumptions, i.e., a finite (possibly unknown) delay bound exists for successful transmission if a sender keeps retransmitting a message until successful.“Weak” because the protocol does not need to know the value of the bound anywhere (say, to call the end of a round or declare a replica as crashed, etc.)
30
Slide31High-level operation
Four high-level steps (key step is multicast in step 2)A client sends a request to invoke a service operation to the primaryThe primary multicasts
the request to the replicas
Replicas execute the request and send a reply to the client
The client waits for f+1 replies from different replicas with the same result; this is the result of the operation
31
Slide32Typical sequence
32
Slide33Message formats
[[PRE-PREPARE, v, n, d]σ(p), m]p = issuing primaryv = view (like prepare ballot in paxos)n
= sequence number (like slot number in
paxos
)
d
= digest(m)
[X]
σ
(p
)
≅
X signed by p’s private key
[PREPARE, v, n, d,
i
]
σ
(
i
)
where i is issuing replica[COMMIT, v, n, D(m), i] σ(i) where i is issuing replica33
Slide34Key invariants
Definition:prepared(m,v,n,i) true if replica i has inserted in its log: the request m, a pre-prepare for m in view v with sequence number n, and 2f matching prepares from different other replicas
Invariant: If
prepared
(
m,v,n,i
) is true,
prepared
(
m’,
v,n
,j
) is false for any non-faulty replica j (including
i=j)Why is this true?
34
Slide35Key invariants (cont’d)
Definition: committed(m,v,n) is true iff prepared(m,v,n,i) is true for all
i
in some set of f+1 non-faulty replicas;
c
ommitted-local
(
m,v,n,i
) is true
iff
prepared
(
m,v,n,i) is true and i
has accepted 2f+1 commits (possibly including its own) from different replicas that match pre-prepare for m
Invariant: committed-local(
m,v,n,i
) is true at some non-faulty
i
=> committed(
m,v,n
) is true
35
Slide36View change protocol
Any replica i can issue [VIEW-CHANGE,v+1,n, C, P, i] σ(i
)
n
= sequence number of last stable checkpoint s known to
i
C is a certificate, a set of 2f+1 valid checkpoint messages proving the correctness of checkpoint s
P is a set containing a set P
m
for each m that prepared at
i
with a sequence number higher than n. Each set P
m
contains a valid pre-prepare and 2f+1 matching prepares all signed by different replicas
New primary p multicasts [NEW-VIEW, v+1, V, O]
σ
(p)
when it receives 2f valid view change messages from others
Determines sequence number min-s of latest stable checkpoint and max-s of highest
prepare in V
V = set containing valid view change messages received by primary
O is a set of pre-prepare messages (without piggybacked request) computed in a manner similar to how a Paxos coordinator carries forward ACCEPTS from lower ballots36
Slide37Reconfiguration
37
Slide38Motivation and broad approaches
Ability to configure replica groups based on demand, available resources, and fault tolerance needs38
Slide39Spanner:Google’s
current geo-distributed DB39
Slide40GigaPaxos: Arbitrary object-group mapping
40
Slide41Goal: change RSM group membership
RSM invariant must be preserved, i.e., agreement on the total order of all executed requestsHow to define a “total” order when not all replicas execute all requests because of changing groups?Execution sequence divided into epochsEach epoch e associated with a group
G
e
=[S
1
,…
S
k
]
For every epoch in
e
, all replicas in
Ge
agree upon the sequence of requests executed in
e
.
41
Challenge: Managing epoch transitions, i.e., end of an epoch and start of the next epoch correctly.
Slide42Challenges
Managing epoch transitions, i.e., i.e., end of an epoch and start of the next epoch correctlyAll replicas in current and next epoch must agree on the transition pointSafety with any number of failures and
liveness
if majority up
Locating the current group
How does a client find authoritative information about the current set of replicas?
42
Slide43High-level protocol operation
Three phases to transition from epoch i to i+1StopEpoch: stop epoch i with members Gi
StartEpoch
: start epoch i+1 with members G
i+1
DropEpoch
: clear state for epoch
i
for members in
G
i
43
Slide44Event-action protocol steps
Any process P, either a member of Gi or a member of Gi+1 or an external server, can initiate reconfigurationInit: Issue a special [STOP,
i
]request to
G
i
[STOP_COMMIT,
i
]: Issue
[
START, i+1,
G
i
] to G
i+1
m
ajority([START_ACK,
i
]): Issue [DROP,
i
] to
G
i // epoch i+1 ready at this point to start executing requests!Coordintor with ballot (bnum,bcoord) in Gi[STOP, i]: Coordinate [STOP, i] using consensusRi: If ACCEPT([num,coord], STOPi, n) issued for some slot number n, never issue ACCEPT([num,coord],Ri, m) for m>n44
Slide45Event-action protocol steps (cont’d)
Replica Q in Gi+1[START, i+1, Gi, P]: Issue [GET_EPOCH_FINAL_STATE, i, Q] to any member of G
i
[EPOCH_FINAL_STATE,
i
, state]: Initialize RSM state for epoch
i
with “state”; send [START_ACK, i+1] to P
Replica in
G
i
Execute([STOP,
i
]): Store epoch final checkpoint
C
i
; remove all other volatile and stable state for epoch
i
[GET_EPOCH_FINAL_STATE,
i
, Q]: Send
C
i to Q [DROP, i]: Erase Ci // reconfiguration complete45
Slide46Locating current group approaches
Centralized (Liskov/Cowling):A centralized administrator controlled server returns the current membership groupCould also be authoritative DNS server
Distributed (
gigapaxos
approach):
Reconfiguration (+location) service itself replicated as RSM
Consensus ensures that all
reconfigurators
agree on all reconfiguration operations, so the group is never “lost”
46
Slide47Strawman candidates and shortcomings
47
Slide48Scratch: Protocol3
Protocol3 init: myProposal <- pick_random(UMass, Amherst Coffee)send_all(PROPOSAL: [
myProposal
,
myID
]) // format:
msg_type
:[
msg
body
]
// evicted = null set, N=total no. processes
Event/action:
receive(PROPOSAL:[proposal, i]): record locally
waitfor
(
someProposal
, n/2+1): // same proposal from
majority
localState
PRE_DECISION
sendAll(PRE_DECISION:[someProposal, self])timeout(T, PROPOSAL:[*,i]): // evict iIf(notAlreadyEvicted(i)) evicted evicted U {i}n N - |evicted|waitfor(PRE_DECISION:[someProposal,*], N/2+1) :if(localState==PRE_DECISION) decide someProposal for self and optionally announce to all48