/
Raft: A Consensus Algorithm Raft: A Consensus Algorithm

Raft: A Consensus Algorithm - PowerPoint Presentation

scarlett
scarlett . @scarlett
Follow
73 views
Uploaded On 2023-09-23

Raft: A Consensus Algorithm - PPT Presentation

for Replicated Logs Diego Ongaro and John Ousterhout Stanford University Replicated log gt replicated state machine All servers execute same commands in same order Consensus module ensures proper log replication ID: 1020264

log leader consensus term leader log term consensus 2013raft entry algorithmslide entries committed candidate election rpcs follower server majority

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Raft: A Consensus Algorithm" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Raft: A Consensus Algorithmfor Replicated LogsDiego Ongaro and John OusterhoutStanford University

2. Replicated log => replicated state machineAll servers execute same commands in same orderConsensus module ensures proper log replicationSystem makes progress as long as any majority of servers are upFailure model: fail-stop (not Byzantine), delayed/lost messagesMarch 3, 2013Raft Consensus AlgorithmSlide 2Goal: Replicated LogaddjmpmovshlLogConsensusModuleStateMachineaddjmpmovshlLogConsensusModuleStateMachineaddjmpmovshlLogConsensusModuleStateMachineServersClientsshl

3. Two general approaches to consensus:Symmetric, leader-less:All servers have equal rolesClients can contact any serverAsymmetric, leader-based:At any given time, one server is in charge, others accept its decisionsClients communicate with the leaderRaft uses a leader:Decomposes the problem (normal operation, leader changes)Simplifies normal operation (no conflicts)More efficient than leader-less approachesMarch 3, 2013Raft Consensus AlgorithmSlide 3Approaches to Consensus

4. Leader election:Select one of the servers to act as leaderDetect crashes, choose new leaderNormal operation (basic log replication)Safety and consistency after leader changesNeutralizing old leadersClient interactionsImplementing linearizeable semanticsConfiguration changes: Adding and removing serversMarch 3, 2013Raft Consensus AlgorithmSlide 4Raft Overview

5. At any given time, each server is either:Leader: handles all client interactions, log replicationAt most 1 viable leader at a timeFollower: completely passive (issues no RPCs, responds to incoming RPCs)Candidate: used to elect a new leaderNormal operation: 1 leader, N-1 followersMarch 3, 2013Raft Consensus AlgorithmSlide 5Server StatesFollowerCandidateLeaderstarttimeout,start electionreceive votes frommajority of serverstimeout,new electiondiscover server with higher termdiscover current serveror higher term“stepdown”

6. Time divided into terms:ElectionNormal operation under a single leaderAt most 1 leader per termSome terms have no leader (failed election)Each server maintains current term valueKey role of terms: identify obsolete informationMarch 3, 2013Raft Consensus AlgorithmSlide 6TermsTerm 1Term 2Term 3Term 4Term 5timeElectionsNormal OperationSplit Vote

7. March 3, 2013Raft Consensus AlgorithmSlide 7Respond to RPCs from candidates and leaders.Convert to candidate if election timeout elapses without either:Receiving valid AppendEntries RPC, orGranting vote to candidate FollowersIncrement currentTerm, vote for selfReset election timeoutSend RequestVote RPCs to all other servers, wait for either:Votes received from majority of servers: become leaderAppendEntries RPC received from new leader: step downElection timeout elapses without election resolution: increment term, start new electionDiscover higher term: step downCandidatesEach server persists the following to stable storage synchronously before responding to RPCs:currentTerm latest term server has seen (initialized to 0 on first boot)votedFor candidateId that received vote in current term (or null if none)log[] log entries Persistent Stateterm term when entry was received by leaderindex position of entry in the logcommand command for state machineLog EntryInvoked by candidates to gather votes.Arguments:candidateId candidate requesting voteterm candidate's termlastLogIndex index of candidate's last log entrylastLogTerm term of candidate's last log entryResults:term currentTerm, for candidate to update itselfvoteGranted true means candidate received voteImplementation:If term > currentTerm, currentTerm ← term(step down if leader or candidate)If term == currentTerm, votedFor is null or candidateId, and candidate's log is at least as complete as local log, grant vote and reset election timeoutRequestVote RPCInvoked by leader to replicate log entries and discover inconsistencies; also used as heartbeat .Arguments:term leader's termleaderId so follower can redirect clientsprevLogIndex index of log entry immediately preceding new onesprevLogTerm term of prevLogIndex entryentries[] log entries to store (empty for heartbeat)commitIndex last entry known to be committedResults:term currentTerm, for leader to update itselfsuccess true if follower contained entry matching prevLogIndex and prevLogTermImplementation:Return if term < currentTermIf term > currentTerm, currentTerm ← termIf candidate or leader, step downReset election timeoutReturn failure if log doesn’t contain an entry at prevLogIndex whose term matches prevLogTermIf existing entries conflict with new entries, delete all existing entries starting with first conflicting entryAppend any new entries not already in the logAdvance state machine with newly committed entriesAppendEntries RPCRaft Protocol SummaryInitialize nextIndex for each to last log index + 1Send initial empty AppendEntries RPCs (heartbeat) to each follower; repeat during idle periods to prevent election timeoutsAccept commands from clients, append new entries to local logWhenever last log index ≥ nextIndex for a follower, send AppendEntries RPC with log entries starting at nextIndex, update nextIndex if successfulIf AppendEntries fails because of log inconsistency, decrement nextIndex and retryMark log entries committed if stored on a majority of servers and at least one entry from current term is stored on a majority of serversStep down if currentTerm changesLeaders

8. Servers start up as followersFollowers expect to receive RPCs from leaders or candidatesLeaders must send heartbeats (empty AppendEntries RPCs) to maintain authorityIf electionTimeout elapses with no RPCs:Follower assumes leader has crashedFollower starts new electionTimeouts typically 100-500msMarch 3, 2013Raft Consensus AlgorithmSlide 8Heartbeats and Timeouts

9. Increment current termChange to Candidate stateVote for selfSend RequestVote RPCs to all other servers, retry until either:Receive votes from majority of servers:Become leaderSend AppendEntries heartbeats to all other serversReceive RPC from valid leader:Return to follower stateNo-one wins election (election timeout elapses):Increment term, start new electionMarch 3, 2013Raft Consensus AlgorithmSlide 9Election Basics

10. Safety: allow at most one winner per termEach server gives out only one vote per term (persist on disk)Two different candidates can’t accumulate majorities in same termLiveness: some candidate must eventually winChoose election timeouts randomly in [T, 2T]One server usually times out and wins election before others wake upWorks well if T >> broadcast timeMarch 3, 2013Raft Consensus AlgorithmSlide 10Elections, cont’dServersVoted for candidate AB can’t also get majority

11. Log entry = index, term, commandLog stored on stable storage (disk); survives crashesEntry committed if known to be stored on majority of serversDurable, will eventually be executed by state machinesMarch 3, 2013Raft Consensus AlgorithmSlide 11Log Structure1add123456783jmp1cmp1ret2mov3div3shl3sub1add3jmp1cmp1ret2mov1add3jmp1cmp1ret2mov3div3shl3sub1add1cmp1add3jmp1cmp1ret2mov3div3shlleaderlog indexfollowerscommitted entriestermcommand

12. Client sends command to leaderLeader appends command to its logLeader sends AppendEntries RPCs to followersOnce new entry committed:Leader passes command to its state machine, returns result to clientLeader notifies followers of committed entries in subsequent AppendEntries RPCsFollowers pass committed commands to their state machinesCrashed/slow followers?Leader retries RPCs until they succeedPerformance is optimal in common case:One successful RPC to any majority of serversMarch 3, 2013Raft Consensus AlgorithmSlide 12Normal Operation

13. High level of coherency between logs:If log entries on different servers have same index and term:They store the same commandThe logs are identical in all preceding entriesIf a given entry is committed, all preceding entries are also committedMarch 3, 2013Raft Consensus AlgorithmSlide 13Log Consistency1add1234563jmp1cmp1ret2mov3div4sub1add3jmp1cmp1ret2mov

14. Each AppendEntries RPC contains index, term of entry preceding new onesFollower must contain matching entry; otherwise it rejects requestImplements an induction step, ensures coherencyMarch 3, 2013Raft Consensus AlgorithmSlide 14AppendEntries Consistency Check1add3jmp1cmp1ret2mov1add1cmp1ret2movleaderfollower123451add3jmp1cmp1ret2mov1add1cmp1ret1shlleaderfollowerAppendEntries succeeds:matching entryAppendEntries fails:mismatch

15. At beginning of new leader’s term:Old leader may have left entries partially replicatedNo special steps by new leader: just start normal operationLeader’s log is “the truth”Will eventually make follower’s logs identical to leader’sMultiple crashes can leave many extraneous log entries:March 3, 2013Raft Consensus AlgorithmSlide 15Leader Changes12345678log index1111556666115514111772233327terms1s2s3s4s5

16. Once a log entry has been applied to a state machine, no other state machine must apply a different value for that log entryRaft safety property:If a leader has decided that a log entry is committed, that entry will be present in the logs of all future leadersThis guarantees the safety requirementLeaders never overwrite entries in their logsOnly entries in the leader’s log can be committedEntries must be committed before applying to state machineMarch 3, 2013Raft Consensus AlgorithmSlide 16Safety RequirementCommitted → Present in future leaders’ logsRestrictions oncommitmentRestrictions onleader election

17. Can’t tell which entries are committed!During elections, choose candidate with log most likely to contain all committed entriesCandidates include log info in RequestVote RPCs(index & term of last log entry)Voting server V denies vote if its log is “more complete”:(lastTermV > lastTermC) ||(lastTermV == lastTermC) && (lastIndexV > lastIndexC)Leader will have “most complete” log among electing majorityMarch 3, 2013Raft Consensus AlgorithmSlide 17Picking the Best Leader1211212345121112112unavailable during leader transitioncommitted?

18. Case #1/2: Leader decides entry in current term is committedSafe: leader for term 3 must contain entry 4March 3, 2013Raft Consensus AlgorithmSlide 18Committing Entry from Current Term12345611111112111s1s2s3s4s52222222AppendEntries justsucceededCan’t be elected asleader for term 3Leader forterm 2

19. Case #2/2: Leader is trying to finish committing entry from an earlier termEntry 3 not safely committed:s5 can be elected as leader for term 5If elected, it will overwrite entry 3 on s1, s2, and s3!March 3, 2013Raft Consensus AlgorithmSlide 19Committing Entry from Earlier Term12345611111112111s1s2s3s4s522AppendEntries justsucceeded343Leader forterm 43

20. For a leader to decide an entry is committed:Must be stored on a majority of serversAt least one new entry from leader’s term must also be stored on majority of serversOnce entry 4 committed:s5 cannot be elected leader for term 5Entries 3 and 4 both safeMarch 3, 2013Raft Consensus AlgorithmSlide 20New Commitment Rules1234511111112111s1s2s3s4s522343Leader forterm 444Combination of election rules and commitment rulesmakes Raft safe3

21. Leader changes can result in log inconsistencies:March 3, 2013Raft Consensus AlgorithmSlide 21Log Inconsistencies1411455666123456789101112log indexleader forterm 8141145566141114114556666141145566614114111possiblefollowers447722333332(a)(b)(c)(d)(e)(f)ExtraneousEntriesMissingEntries

22. March 3, 2013Raft Consensus AlgorithmNew leader must make follower logs consistent with its ownDelete extraneous entriesFill in missing entriesLeader keeps nextIndex for each follower:Index of next log entry to send to that followerInitialized to (1 + leader’s last index)When AppendEntries consistency check fails, decrement nextIndex and try again:Repairing Follower Logs1411455666123456789101112log indexleader for term 71411111followers22333332(a)(b)nextIndexSlide 22

23. When follower overwrites inconsistent entry, it deletes all subsequent entries:March 3, 2013Raft Consensus AlgorithmSlide 23Repairing Logs, cont’d14114556661234567891011log indexleader for term 7111follower (before)22333332nextIndex111follower (after)4

24. Deposed leader may not be dead:Temporarily disconnected from networkOther servers elect a new leaderOld leader becomes reconnected, attempts to commit log entriesTerms used to detect stale leaders (and candidates)Every RPC contains term of senderIf sender’s term is older, RPC is rejected, sender reverts to follower and updates its termIf receiver’s term is older, it reverts to follower, updates its term, then processes RPC normallyElection updates terms of majority of serversDeposed server cannot commit new log entriesMarch 3, 2013Raft Consensus AlgorithmSlide 24Neutralizing Old Leaders

25. Send commands to leaderIf leader unknown, contact any serverIf contacted server not leader, it will redirect to leaderLeader does not respond until command has been logged, committed, and executed by leader’s state machineIf request times out (e.g., leader crash):Client reissues command to some other serverEventually redirected to new leaderRetry request with new leaderMarch 3, 2013Raft Consensus AlgorithmSlide 25Client Protocol

26. What if leader crashes after executing command, but before responding? Must not execute command twiceSolution: client embeds a unique id in each commandServer includes id in log entryBefore accepting command, leader checks its log for entry with that idIf id found in log, ignore new command, return response from old commandResult: exactly-once semantics as long as client doesn’t crashMarch 3, 2013Raft Consensus AlgorithmSlide 26Client Protocol, cont’d

27. System configuration:ID, address for each serverDetermines what constitutes a majorityConsensus mechanism must support changes in the configuration:Replace failed machineChange degree of replicationMarch 3, 2013Raft Consensus AlgorithmSlide 27Configuration Changes

28. Cannot switch directly from one configuration to another: conflicting majorities could ariseMarch 3, 2013Raft Consensus AlgorithmSlide 28Configuration Changes, cont’dColdCnewServer 1Server 2Server 3Server 4Server 5Majority of ColdMajority of Cnewtime

29. March 3, 2013Raft Consensus AlgorithmSlide 29Raft uses a 2-phase approach:Intermediate phase uses joint consensus (need majority of both old and new configurations for elections, commitment)Configuration change is just a log entry; applied immediately on receipt (committed or not)Once joint consensus is committed, begin replicating log entry for final configurationJoint ConsensustimeCold+new entrycommittedCnew entrycommittedColdCold+newCnewCold can makeunilateral decisionsCnew can makeunilateral decisions

30. Additional details:Any server from either configuration can serve as leaderIf current leader is not in Cnew, must step down once Cnew is committed.March 3, 2013Raft Consensus AlgorithmSlide 30Joint Consensus, cont’dtimeCold+new entrycommittedCnew entrycommittedColdCold+newCnewCold can makeunilateral decisionsCnew can makeunilateral decisionsleader not in Cnewsteps down here

31. Leader electionNormal operationSafety and consistencyNeutralize old leadersClient protocolConfiguration changesMarch 3, 2013Raft Consensus AlgorithmSlide 31Raft Summary