/
AbstractThedistributedtransactioncommitproblemrequiresreachingagreemen AbstractThedistributedtransactioncommitproblemrequiresreachingagreemen

AbstractThedistributedtransactioncommitproblemrequiresreachingagreemen - PDF document

osullivan
osullivan . @osullivan
Follow
342 views
Uploaded On 2021-10-11

AbstractThedistributedtransactioncommitproblemrequiresreachingagreemen - PPT Presentation

1IntroductionAdistributedtransactionconsistsofanumberofoperationsperformedatmultiplesitesterminatedbyarequesttocommitorabortthetransactionThesitesthenuseatransactioncommitprotocoltodecidewhetherthetra ID: 900509

rmstate prepared acc committed prepared rmstate committed acc aborted bal type7 working ins phasecommit commit send bal7 val7 rmstate0

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "AbstractThedistributedtransactioncommitp..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 AbstractThedistributedtransactioncommitp
AbstractThedistributedtransactioncommitproblemrequiresreachingagreementonwhetheratransactioniscommittedoraborted.TheclassicTwo-PhaseCommitprotocolblocksifthecoordinatorfails.Fault-tolerantconsensusalgorithmsalsoreachagreement,butdonotblockwheneveranymajorityoftheprocessesareworking.ThePaxosCommitalgorithmrunsaPaxosconsensusalgorithmonthecommit/abortdecisionofeachparticipanttoobtainatransactioncommitprotocolthatuses2F+1coordinatorsandmakesprogressifatleastF+1ofthemareworkingproperly.PaxosCommithasthesamestable-storagewritedelay,andcanbeimplementedtohavethesamemessagedelayinthefault-freecase,asTwo-PhaseCommit,butitusesmoremessages.TheclassicTwo-PhaseCommitalgorithmisobtainedasthespecialF=0caseofthePaxosCommitalgorithm. 1IntroductionAdistributedtransactionconsistsofanumberofoperations,performedatmultiple

2 sites,terminatedbyarequesttocommitorabor
sites,terminatedbyarequesttocommitorabortthetransaction.Thesitesthenuseatransactioncommitprotocoltodecidewhetherthetransactioniscommittedoraborted.Thetransactioncanbecommittedonlyifallsitesarewillingtocommitit.Achievingthisall-or-nothingatom-icitypropertyinadistributedsystemisnottrivial.TherequirementsfortransactioncommitarestatedpreciselyinSection2.TheclassictransactioncommitprotocolisTwo-PhaseCommit[9],de-scribedinSection3.Itusesasinglecoordinatortoreachagreement.Thefailureofthatcoordinatorcancausetheprotocoltoblock,withnoprocessknowingtheoutcome,untilthecoordinatorisrepaired.InSection4,weusethePaxosconsensusalgorithm[12]toobtainatransactioncommitprotocolthatusesmultiplecoordinators;itmakesprogressifamajorityoftheco-ordinatorsareworking.Section5comparesTwo-PhaseCommitandPaxosCommit.WeshowthatTwo-Pha

3 seCommitisadegeneratecaseofthePaxosCommi
seCommitisadegeneratecaseofthePaxosCommitalgorithmwithasinglecoordinator,guaranteeingprogressonlyifthatcoordinatorisworking.Section6discussessomepracticalaspectsoftransactionmanagement.Relatedworkisdiscussedintheconclusion.Ourcomputationmodelassumesthatalgorithmsareexecutedbyacol-lectionofprocessesthatcommunicateusingmessages.Eachprocessexe-cutesatanodeinanetwork.Aprocesscansavedataonstablestoragethatsurvivesfailures.Di erentprocessesmayexecuteonthesamenode.Ourcostmodelcountsinter-nodemessages,messagedelays,andstable-storagewrites,andstable-storagewritedelays.Weassumethatmessagesbetweenprocessesonthesamenodehavenegligiblecost.Ourfailuremodelassumesthatnodes,andhencetheirprocesses,canfail;messagescanbelostordupli-cated,butnot(undetectably)corrupted.Anyprocessexecutingatafailednodesimplystopsperform

4 ingactions;itdoesnotperformincorrectacti
ingactions;itdoesnotperformincorrectactionsanddoesnotforgetitsstate.Implementingthismodelofprocessfailurerequireswritinginformationtostablestorage,whichcanbeanexpensiveoperation.WewillseethatthedelaysincurredbywritestostablestoragearethesameinTwo-PhaseCommitandPaxosCommit.Ingeneral,therearetwokindsofcorrectnesspropertiesthatanalgorithmmustsatisfy:safetyandliveness.Intuitively,asafetypropertydescribeswhatisallowedtohappen,andalivenesspropertydescribeswhatmusthappen[2].Ouralgorithmsareasynchronousinthesensethattheirsafetypropertiesdonotdependontimelyexecutionbyprocessesoronboundedmessage1 ConsistencyItisimpossibleforoneRMtobeinthecommittedstateandanothertobeintheabortedstate.Thesetwopropertiesimplythat,onceanRMentersthecommittedstate,nootherRMcanentertheabortedstate,andviceversa.EachRMalsohasapreparedst

5 ate.WerequirethatAnRMcanenterthecom
ate.WerequirethatAnRMcanenterthecommittedstateonlyafterallRMshavebeeninthepreparedstate.Theserequirementsimplythatthetransactioncancommit,meaningthatallRMsreachthecommittedstate,onlybythefollowingsequenceofevents:AlltheRMsenterthepreparedstate,inanyorder.AlltheRMsenterthecommittedstate,inanyorder.Theprotocolallowsthefollowingeventthatpreventsthetransactionfromcommitting:AnyRMintheworkingstatecanentertheabortedstate.ThestabilityandconsistencyconditionsimplythatthisspontaneousaborteventcannotoccurifsomeRMhasenteredthecommittedstate.Inpractice,aworkingRMwillabortwhenitrealizesthatitcannotperformitspartofthetransaction.Theserequirementsaresummarizedinthestate-transitiondiagramofFigure1.ThegoalofthealgorithmisforallRMstoreachthecommittedorabortedstate,butthiscannotbeachievedinanon-trivi

6 alwayifRMscanfailorbecomeisolatedthrough
alwayifRMscanfailorbecomeisolatedthroughcommunicationfailure.(AtrivialsolutionisoneinwhichallRMsalwaysabort.)Moreover,theclassictheoremofFischer,Lynch,andPaterson[8]impliesthatadeterministic,purelyasynchronous ?working ?prepared �� @@Rcommitted aborted Figure1:Thestate-transitiondiagramforaresourcemanager.Itbeginsintheworkingstate,inwhichitmaydecidethatitwantstoabortorcommit.Itabortsbysimplyenter-ingtheabortedstate.Ifitdecidestocommit,itentersthepreparedstate.Fromthisstate,itcancommitonlyifallotherresourceman-agersalsodecidedtocommit.3 3Two-PhaseCommit3.1TheProtocolTheTwo-PhaseCommitprotocolisanimplementationoftransactioncom-mitthatusesatransactionmanager(TM)processtocoordinatethedecision-makingprocedure.TheRMshavethesamestatesinthisprotocolasinthespeci cationoftransactioncommit.TheTMhasthef

7 ollowingstates:init(itsinitialstate),pre
ollowingstates:init(itsinitialstate),preparing,committed,andaborted.TheTwo-PhaseCommitprotocolstartswhenanRMentersthepreparedstateandsendsaPreparedmessagetotheTM.UponreceiptofthePreparedmessage,theTMentersthepreparingstateandsendsaPreparemessagetoeveryotherRM.UponreceiptofthePreparemessage,anRMthatisstillintheworkingstatecanenterthepreparedstateandsendaPreparedmessagetotheTM.WhenithasreceivedaPreparedmessagefromallRMs,theTMcanenterthecommittedstateandsendCommitmessagestoalltheotherprocesses.TheRMscanenterthecommittedstateuponreceiptoftheCommitmessagefromtheTM.Themessage owfortheTwo-PhaseCommitprotocolisshowninFigure2.Figure2showsonedistinguishedRMspontaneouslypreparing.Infact,anyRMcanspontaneouslygofromtheworkingtopreparedstateandsendapreparedmessageatanytime.TheTM'spreparemessagecanbeviewedasanoption

8 alsuggestionthatnowwouldbeagoodtimetodos
alsuggestionthatnowwouldbeagoodtimetodoso.Otherevents,includingreal-timedeadlines,mightcauseworkingRMstoprepare.ThisobservationisthebasisforvariantsoftheTwo-PhaseCommitprotocolthatusefewermessages.AnRMcanspontaneouslyentertheabortedstateifitisintheworkingstate;andtheTMcanspontaneouslyentertheabortedstateunlessitisinRM1 OtherRMs TM XXXXXXXXXXXz9 Prepared Prepare XXXXXXXz Prepared 9)9 Prepare Commit Figure2:Themessage owforTwo-PhaseCommitinthenormalfailure-freecase,whereRM1isthe rstRMtoenterthepreparedstate.5 AsdiscussedinSection3.1,wecaneliminatetheTM'sPreparemessages,reducingthemessagecomplexityto2N.Butinpractice,thisrequireseitherextramessagedelaysorsomereal-timeassumptions

9 .Inadditiontothemessagedelays,thetwo-pha
.Inadditiontothemessagedelays,thetwo-phasecommitprotocolincursthedelaysassociatedwithwritestostablestorage:thewritebythe rstRMtoprepare,thewritesbytheremainingRMswhentheyprepare,andthewritebytheTMwhenitmakesthecommitdecision.ThiscanbereducedtotwowritedelaysbyhavingallRMsprepareconcurrently.3.3TheProblemwithTwo-PhaseCommitInatransactioncommitprotocol,ifoneormoreRMsfail,thetransactionisusuallyaborted.Forexample,intheTwo-PhaseCommitprotocol,iftheTMdoesnotreceiveaPreparedmessagefromsomeRMsoonenoughaftersendingthePreparemessage,thenitwillabortthetransactionbysendingAbortmessagestotheotherRMs.However,thefailureoftheTMcancausetheprotocoltoblockuntiltheTMisrepaired.Inparticular,iftheTMfailsrightaftereveryRMhassentaPreparedmessage,thentheotherRMshavenowayofknowingwhethertheTMcommittedorabortedthetransactio

10 n.Anon-blockingcommitprotocolisoneinwhic
n.Anon-blockingcommitprotocolisoneinwhichthefailureofasingleprocessdoesnotpreventtheotherprocessesfromdecidingifthetransactioniscommittedoraborted.TheyareoftencalledThree-PhaseCommitproto-cols.Severalhavebeenproposed,andafewhavebeenimplemented[3,4,19].Theyhaveusuallyattemptedto\ x"theTwo-PhaseCommitprotocolbychoosinganotherTMifthe rstTMfails.However,weknowofnonethatprovidesacompletealgorithmproventosatisfyaclearlystatedcorrectnesscondition.Forexample,thediscussionofnon-blockingcommitintheclas-sictextofBernstein,Hadzilacos,andGoodman[3]failstoexplainwhataprocessshoulddoifitreceivesmessagesfromtwodi erentprocesses,bothclaimingtobethecurrentTM.Guaranteeingthatthissituationcannotariseisaproblemthatisasdicultasimplementingatransactioncommitprotocol.4PaxosCommit4.1ThePaxosConsensusAlgorithm

11 Thedistributedcomputingcommunityhasstudi
Thedistributedcomputingcommunityhasstudiedthemoregeneralproblemofconsensus,whichrequiresthatacollectionofprocessesagreeonsomevalue.Manysolutionstothisproblemhavebeenproposed,undervarious7 containingitscurrentstate,whichconsistsofThelargestballotnumberforwhichitreceivedaphase1amessage,andThephase2bmessagewiththehighestballotnumberithassent,ifany.Theacceptorignoresthephase1amessageifithasperformedanactionforaballotnumberedbalorgreater.Phase2aWhentheleaderhasreceivedaphase1bmessageforballotnumberbalfromamajorityoftheacceptors,itcanlearnoneoftwopossibilities:FreeNoneofthemajorityofacceptorsreporthavingsentaphase2bmessage,sothealgorithmhasnotyetchosenavalue.ForcedSomeacceptorinthemajorityreportshavingsentaphase2bmessage.Letbethemaximumballotnumberofallthereportedphase2bmessages,andletMb

12 ethesetofallthosephase2bmessagesthathave
ethesetofallthosephase2bmessagesthathaveballotnumber.AllthemessagesinMhavethesamevaluev,whichmightalreadyhavebeenchosen.Inthefreecase,theleadercantrytogetanyvalueaccepted;itusuallypicksthe rstvalueproposedbyaclient.Intheforcedcase,ittriestogetthevaluevchosenbysendingaphase2amessagewithvaluevandballotnumberbaltoeveryacceptor.Phase2bWhenanacceptorreceivesaphase2amessageforavaluevandballotnumberbal,ifithasnotalreadyreceivedaphase1aor2amessageforalargerballotnumber,itacceptsthatmessageandsendsaphase2bmessageforvandbaltotheleader.Theacceptorignoresthemessageifithasalreadyparticipatedinahigher-numberedballot.Phase3Whentheleaderhasreceivedphase2bmessagesforvaluevandballotbalfromamajorityoftheacceptors,itknowsthatthevaluevhasbeenchosenandcommunicatesthatfacttoallinterestedprocesseswithaphase3mes

13 sage.Ballot0hasnophase1becausethereareno
sage.Ballot0hasnophase1becausetherearenolower-numberedballots,sothereisnothingforacceptorstoreportinphase1bmessages.AnexplanationofwhythePaxosalgorithmiscorrectcanbefoundintheliterature[6,12,13,15].Aswithanyasynchronousalgorithm,process9 PaxosCommitusesaseparateinstanceofthePaxosconsensusalgorithmtoobtainagreementonthedecisioneachRMmakesofwhethertoprepareorabort|adecisionwerepresentbythevaluesPreparedandAborted.So,thereisoneinstanceoftheconsensusalgorithmforeachRM.Thetrans-actioniscommittedi eachRM'sinstancechoosesPrepared;otherwisethetransactionisaborted.TheideaofperformingaseparateconsensusoneachRM'sdecisioncanbeusedwithanyconsensusalgorithm,buthowoneusesthisideatosaveamessagedelaydependsonthealgorithm.PaxosCommitusesthesamesetof2F+1acceptorsandthesamecurrentleaderforeachinstanceofPaxos.So,theca

14 stofcharactersconsistsofNRMs,2F+1accepto
stofcharactersconsistsofNRMs,2F+1acceptors,andthecurrentleader.WeassumefornowthattheRMsknowtheacceptorsinadvance.InordinaryPaxos,aballot0phase2amessagecanhaveanyvaluev.Whiletheleaderusuallysendssuchamessage,thePaxosalgorithmobviouslyremainscorrectifthesendingofthatmessageisdelegatedtoanysingleprocesschoseninadvance.InPaxosCommit,eachRMannouncesitsprepare/abortdecisionbysending,initsinstanceofPaxos,aballot0phase2amessagewiththevaluePreparedorAborted.ExecutionofPaxosCommitnormallystartswhensomeRMdecidestoprepareandsendsaBeginCommitmessagetotheleader.TheleaderthensendsaPreparemessagetoalltheotherRMs.IfanRMdecidesthatitwantstoprepare,itsendsaphase2amessagewithvaluePreparedandballotnumber0initsinstanceofthePaxosalgorithm.Otherwise,itsendsaphase2amessagewiththevalueAbortedandballotnumber0.Foreachinstance,an

15 acceptorsendsitsphase2bmessagetotheleade
acceptorsendsitsphase2bmessagetotheleader.TheleaderknowstheoutcomeofthisinstanceifitreceivesF+1phase2bmessagesforballotnumber0,whereuponitcansenditsphase3messageannouncingtheoutcometotheRMs.(AsobservedinSection4.1above,phase3canbeeliminatedbyhavingtheacceptorssendtheirphase2bmessagesdirectlytotheRMs.)Thetransactioniscommittedi everyRM'sinstanceofthePaxosalgorithmchoosesPrepared;otherwisethetransactionisaborted.Foreciency,anacceptorcanbundleitsphase2bmessagesforallin-stancesofthePaxosalgorithmintoasinglephysicalmessage.Theleadercandistillitsphase3messagesforallinstancesintoasingleCommitorAbortmessage,dependingonwhetherornotallinstanceschosethevaluePrepared.TheinstancesofthePaxosalgorithmforoneormoreRMsmaynotreachadecisionwithballotnumber0.Inthatcase,theleader(alertedbyatimeout)assumesthateacho

16 fthoseRMshasfailedandexecutesphase1afora
fthoseRMshasfailedandexecutesphase1aforalargerballotnumberineachoftheirinstancesofPaxos.If,inphase2a,11 RM1 OtherRMs InitialLeader Acceptors XXXXXXXXXXXz BeginCommit PPPPPPPPPPPPPPPPPPq 2aPrepared 9 Prepare XXXXXXXXXXXXXXz 2aPrepared 9 2bPrepared 9) Commit Figure3:Themes-sage owforPaxosCommitinthenormalfailure-freecase,whereRM1isthe rstRMtoenterthepreparedstate,and2aPreparedand2bPreparedarethephase2aand2bmessagesofthePaxosconsensusalgorithm.4.3TheCostofPaxosCommitWenowconsiderthecostofPaxosCommitinthenormalcase,whenthetransactioniscommitted.ThesequenceofmessageexchangesisshowninFigure3.WeagainassumethatthereareNRMs.Weconsiderasystemthatcant

17 olerateFfaults,sothereare2F+1acceptors.H
olerateFfaults,sothereare2F+1acceptors.However,weassumetheoptimizationinwhichtheleadersendsphase2amessagestoF+1acceptors,andonlyifoneormoreofthemfailareotheracceptorsused.Inthenormalcase,thePaxosCommitalgorithmusesthefollowingpotentiallyinter-nodemessages:The rstRMtopreparesendsaBeginCommitmessagetotheleader.(1message)TheleadersendsaPreparemessagetoeveryotherRM.(N�1mes-sages)EachRMsendsaballot0phase2aPreparedmessageforitsinstanceofPaxostotheF+1acceptors.(N(F+1)messages)ForeachRM'sinstanceofPaxos,anacceptorrespondstoaphase2amessagebysendingaphase2bPreparedmessagetotheleader.How-ever,anacceptorcanbundlethemessagesforallthoseinstancesintoasinglemessage.(F+1messages)TheleadersendsasingleCommitmessagetoeachRMcontainingaphase3PreparedmessageforeveryinstanceofPaxos.(Nmessages)

18 13 Two-PhaseCommitPaxosCommitFasterPaxos
13 Two-PhaseCommitPaxosCommitFasterPaxosCommit MessageDelays 4 5 4 Messages noco-location 3N�1 (N+1)(F+3)�4 N(2F+3)�1 withco-location 3N�3 N(F+3)�3 (N�1)(2F+3) StableStorage writedelays 2 2 2 writes N+1 N+F+1 N+F+1 Figure4:CorrespondingComplexity5PaxosversusTwo-PhaseCommitIntheTwo-PhaseCommitprotocol,theTMbothmakestheabort/commitdecisionandstoresthatdecisioninstablestorage.Two-PhaseCommitcanblockinde nitelyiftheTMfails.HadweusedPaxossimplytoob-tainconsensusonasingledecisionvalue,thiswouldhavebeenequivalenttoreplacingtheTM'sstablestoragebytheacceptors'stablestorage,andreplacingthesingleTMbyasetofpossibleleaders.OurPaxosCommital-gorithmgoesfurtherinessentiallyeliminatingtheTM'sroleinmakingthedecision.InTwo-PhaseCommit,theTMcanunilaterallydecidetoabort.InPaxosCommit,aleadercanmakean

19 abortdecisiononlyforanRMthatdoesnotdecid
abortdecisiononlyforanRMthatdoesnotdecideforitself.Theleaderdoesthisbyinitiatingaballotwithnumbergreaterthan0forthatRM'sinstanceofPaxos.(TheleadermustbeabletodothistopreventblockingbyafailedRM.)Sections3.2and4.3describethenormal-casecostinmessagesandwritestostablestorageofTwo-PhaseCommitandPaxosCommit,respectively.Bothalgorithmshavethesamethreestablestoragewritedelays(twoifallRMsprepareconcurrently).TheothercostsaresummarizedinFigure4.TheentriesforPaxosCommitassumethattheinitialleaderisonthesamenodeasanacceptor.FasterPaxosCommitisthealgorithmoptimizedtoremovephase3ofthePaxosconsensusalgorithm.ForTwo-PhaseCommit,co-locationmeansthattheinitiatingRMandtheTCareonthesamenode.ForPaxosCommit,itmeansthateachacceptorisonthesamenodeasanRM,andthattheinitiatingRMistheonthesamenodeastheinitialleader.InPaxosCommitw

20 ithoutco-location,weassumethattheinitial
ithoutco-location,weassumethattheinitialleaderisanacceptor.15 Section5showedthatTwo-PhaseCommitistheF=0caseofPaxosCommit,inwhichthetransactionmanagerperformsthefunctionsoftheoneacceptorandtheonepossibleleader.WethereforeconsideronlyPaxosCommit.ToaccommodateadynamicsetofRMs,weintroducearegistrarprocessthatkeepstrackofwhatRMshavejoinedthetransaction.TheregistraractsmuchlikeanadditionalRM,exceptthatitsinputtothecommitprotocolisthesetofRMsthathavejoined,ratherthanthevaluePreparedorAborted.AswithanRM,PaxosCommitrunsaseparateinstanceofthePaxoscon-sensusalgorithmtodecideupontheregistrar'sinput,usingthesamesetofacceptors.Thetransactioniscommittedi theconsensusalgorithmfortheregistrarchoosesasetofRMsandtheinstanceoftheconsensusalgorithmforeachofthoseRMschoosesPrepared.Theregistrarisgenerallyonthesamenodeas

21 theinitialleader,whichistypicallyonthesa
theinitialleader,whichistypicallyonthesamenodeastheRMthatcreatesthetransaction.InTwo-PhaseCommit,theregistrar'sfunctionisusuallyperformedbytheTMratherthanbyaseparateprocess.(RecallthatforthecaseofTwo-PhaseCommit,thePaxosconsensusalgorithmisthetrivialoneinwhichtheTMsimplychoosesthevalueandwritesittostablestorage.)WenowdescribehowthedynamicPaxosalgorithmworks.6.1TransactionCreationEachnodehasalocaltransactionservicethatanRMcancalltocreateandmanagetransactions.Tocreateatransaction,theserviceconstructsadescriptorforthetransaction,consistingofauniqueidenti er(uid)andthenamesofthetransaction'scoordinatorprocesses.ThecoordinatorprocessesareallprocessesotherthantheRMsthattakepartinthecommitprotocol|namely,theregistrar,theinitialleader,theotherpossibleleaders,andtheacceptors.Anymessagesentduringtheexecutio

22 nofatransactioncontainsthetransactiondes
nofatransactioncontainsthetransactiondescriptor,soarecipientknowswhichtransactionthemessageisfor.Aprocessmight rstlearnabouttheexistenceoftransactionbyreceivingsuchamessage.Thedescriptortellstheprocessthenamesofthecoordinatorsthatitmustknowtoperformitsroleintheprotocol.6.2JoiningaTransactionAnRMjoinsatransactionbysendingajoinmessagetotheregistrar.Asobservedabove,thejoinmessagemustcontainthetransactiondescriptorif17 instances.Theacceptorwaitsuntilitknowswhatphase2bmessagetosendforallinstancesbeforesendingthisonemessage.However,\allinstances"includesaninstanceforeachparticipatingRM,andthesetofparticipatingRMsischosenbytheregistrar'sinstance.Tobreakthiscircularity,weobservethat,iftheregistrar'sinstancechoosesthevalueAborted,thenitdoesn'tmatterwhatvaluesarechosenbytheRMs'instances.Therefore,theaccepto

23 rwaitsuntilitisreadytosendaphase2bmessag
rwaitsuntilitisreadytosendaphase2bmessagefortheregistrar'sinstance.IfthatmessagecontainsasetJofRMsasavalue,thentheacceptorwaitsuntilitcansendthephase2bmessageforeachRMinJ.Ifthephase2bmessagefortheregistrar'sinstancecontainsthevalueAborted,thentheacceptorsendsonlythatphase2bmessage.AsexplainedinSection4.2,theprotocolcanbeshort-circuitedandabortmessagessenttoallprocessesifanyparticipatingRMchoosesthevalueAborted.Insteadofsendingaphase2amessage,theRMcansimplysendanabortmessagetothecoordinatorprocesses.TheregistrarcanrelaytheabortmessagetoallotherRMsthathavejoinedthetransaction.Failureoftheregistrarbeforeitsendsitsballot0phase2amessagecausesthetransactiontoabort.However,failureofasingleRMcanalsocausethetransactiontoabort.Fault-tolerancemeansonlythatfailureofanindividualprocessdoesnotpreventacommit/abortde

24 cisionfrombeingmade.6.4LearningtheOutcom
cisionfrombeingmade.6.4LearningtheOutcomeThedescriptionaboveshowsthat,whenthereisnofailure,thedynamiccommitprotocolworksessentiallyasdescribedinFigure3ofSection4.3.Wenowconsiderwhathappensintheeventoffailure.Thecaseofacceptorfailureisstraightforward.Ifthetransactioniscreatedtohave2F+1acceptors,thenfailureofuptoFofthemcausesnoproblem.Ifmoreacceptorsfail,theprotocolsimplyblocksuntilthereareF+1workingacceptors,whereuponitcontinuesasifnothinghadhappened.Beforeconsideringotherprocessfailures,letusexaminehowaprocessP,knowingonlythetransactiondescriptor,candiscovertheoutcomeoftheprotocol|thatis,whetherthetransactionwascommittedoraborted.Forexample,PmightbearestartedRMthathadfailedaftersendingaphase2aPreparedmessagebutbeforerecordingtheoutcomeinitsstablestorage.Havingthedescriptor,Pknowsthesetofallpossiblelea

25 derprocesses.Itsendsthemamessagecontaini
derprocesses.Itsendsthemamessagecontainingthedescriptorandaskingwhattheoutcomewas.Ifalltheleaderprocesseshavefailed,thenPmustwaituntiloneormoreofthemarerestarted.(Eachnodethathasanacceptorprocess19 7ConclusionTwo-PhaseCommitistheclassicaltransactioncommitprotocol.Indeed,itissometimesthoughttobesynonymouswithtransactioncommit[17].Two-PhaseCommitisnotfaulttolerantbecauseitusesasinglecoordinatorwhosefailurecancausetheprotocoltoblock.WehaveintroducedPaxosCommit,anewtransactioncommitprotocolthatusesmultiplecoordinatorsandmakesprogressifamajorityofthemareworking.Hence,2F+1coordinatorscanmakeprogressevenifFofthemarefaulty.Two-PhaseCommitisisomorphictoPaxosCommitwithasinglecoordinator.Inthenormal,failure-freecase,PaxosCommitrequiresonemoremes-sagedelaythanTwo-PhaseCommit.ThisextramessagedelayiseliminatedbyFas

26 terPaxosCommit,whichhasthetheoreticallym
terPaxosCommit,whichhasthetheoreticallyminimalmessagedelayforanon-blockingprotocol.Non-blockingtransactioncommitprotocolswere rstproposedintheearly1980s[3,4,19].TheinitialalgorithmshadtwomessagedelaysmorethanTwo-PhaseCommitinthefailure-freecase;lateralgorithmsreducedthistooneextramessagedelay[3].Allofthesealgorithmsusedacoor-dinatorprocessandassumedthattwodi erentprocessescouldneverbothbelievetheywerethecoordinator|anassumptionthatcannotbeimple-mentedinapurelyasynchronoussystem.Transientnetworkfailurescouldcausethemtoviolatetheconsistencyrequirementoftransactioncommit.Itiseasytoimplementnon-blockingcommitusingaconsensusalgorithm|anobservationalsomadeinthe1980s[16].However,theobviouswayofdoingthisleadstoonemessagedelaymorethanthatofPaxosCommit.TheonlyalgorithmthatachievedthelowmessagedelayofFas

27 terPaxosCommitisthatofGuerraoui,Larrea,a
terPaxosCommitisthatofGuerraoui,Larrea,andSchiper[11].ItisessentiallythesameasFasterPaxosCommitintheabsenceoffailures.(Itcanbemod-i edwithanoptimizationanalogoustothesendingofphase2amessagesonlytoamajorityofacceptorstogiveitthesamemessagecomplexityasFasterPaxosCommit.)ThissimilaritytoPaxosCommitisnotsurpris-ing,sincemostasynchronousconsensusalgorithms(andmostincompleteattemptsatalgorithms)arethesameasPaxosinthefailure-freecase.How-ever,theiralgorithmismorecomplicatedthanPaxosCommit.Itusesaspecialprocedureforthefailure-freecaseandcallsuponamodi edversionofanordinaryconsensusalgorithm,whichaddsanextramessagedelayintheeventoffailure.With2F+1coordinatorsandNresourcemanagers,PaxosCommitrequiresabout2FNmoremessagesthanTwo-PhaseCommitinthenormalcase.Bothalgorithmsincurthesamedelayforwritingtostablest

28 orage.In21 [5]BernadetteCharron-BostandA
orage.In21 [5]BernadetteCharron-BostandAndreSchiper.Uniformconsensusisharderthanconsensus(extendedabstract).TechnicalReportDSC/2000/028,EcolePolytechniqueFederaledeLausanne,Switzer-land,May2000.[6]RobertoDePrisco,ButlerLampson,andNancyLynch.RevisitingthePaxosalgorithm.InMariosMavronicolasandPhilippasTsigas,editors,Proceedingsofthe11thInternationalWorkshoponDistributedAlgorithms(WDAG97),volume1320ofLectureNotesinComputerScience,pages111{125,Saarbruken,Germany,1997.Springer-Verlag.[7]CynthiaDwork,NancyLynch,andLarryStockmeyer.Consensusinthepresenceofpartialsynchrony.JournaloftheACM,35(2):288{323,April1988.[8]MichaelJ.Fischer,NancyLynch,andMichaelS.Paterson.Impossi-bilityofdistributedconsensuswithonefaultyprocess.JournaloftheACM,32(2):374{382,April1985.[9]J.N.Gray.Notesondatabaseopera

29 tingsystems.InR.Bayer,R.M.Graham,andG.Se
tingsystems.InR.Bayer,R.M.Graham,andG.Seegmuller,editors,OperatingSystems:AnAdvancedCourse,volume60ofLectureNotesinComputerScience,pages393{481.Springer-Verlag,Berlin,Heidelberg,NewYork,1978.[10]RachidGuerraoui.Revisitingtherelationshipbetweennon-blockingatomiccommitmentandconsensus.InJean-MichelHelaryandMichelRaynal,editors,Proceedingsofthe9thInternationalWorkshoponDis-tributedAlgorithms(WDAG95),volume972ofLectureNotesinCom-puterScience,pages87{100,LeMont-Saint-Michel,France,September1995.Springer-Verlag.[11]RachidGuerraoui,MikelLarrea,andAndreSchiper.Reducingthecostfornon-blockinginatomiccommitment.InProceedingsofthe16thInternationalConferenceonDistributedComputingSystems(ICDCS),pages692{697,HongKong,May1996.IEEEComputerSociety.[12]LeslieLamport.Thepart-timeparliament.ACMTransactionsonCom-

30 puterSystems,16(2):133{169,May1998.[13]L
puterSystems,16(2):133{169,May1998.[13]LeslieLamport.Paxosmadesimple.ACMSIGACTNews(DistributedComputingColumn),32(4):51{58,December2001.[14]LeslieLamport.SpecifyingSystems.Addison-Wesley,Boston,2003.Alinktoanelectroniccopycanbefoundathttp://lamport.org.23 ATheTLA+Speci cationsA.1TheSpeci cationofaTransactionCommitProtocol moduleTCommit constantRM ThesetofparticipatingresourcemanagersvariablermState rmState[rm]isthestateofresourcemanagerrm. TCTypeOK= Thetype-correctnessinvariantrmState2[RM!f\working";\prepared";\committed";\aborted"g]TCInit=rmState=[rm2RM7!\working"] Theinitialpredicate.canCommit=8rm2RM:rmState[rm]2f\prepared";\committed"g Truei allRMsareinthe\prepared"or\committed"state.notCommitted=8rm2RM:rmState[rm]6=\committed" Truei noresourcemanagerhasdecidedtocomm

31 it. Wenowde netheactionsthatmaybeper
it. Wenowde netheactionsthatmaybeperformedbytheRMs,andthende nethecompletenext-stateactionofthespeci cationtobethedisjunctionofthepossibleRMactions.Prepare(rm)=^rmState[rm]=\working"^rmState0=[rmStateexcept![rm]=\prepared"]Decide(rm)=_^rmState[rm]=\prepared"^canCommit^rmState0=[rmStateexcept![rm]=\committed"]_^rmState[rm]2f\working";\prepared"g^notCommitted^rmState0=[rmStateexcept![rm]=\aborted"]TCNext=9rm2RM:Prepare(rm)_Decide(rm) Thenext-stateaction. TCSpec=TCInit^2[TCNext]rmState Thecompletespeci cationoftheprotocol. Wenowassertinvariancepropertiesofthespeci cation.25 Message= Thesetofallpossiblemessages.Messagesoftype\Prepared"aresentfromtheRMindicatedbythemessage'srm eldtotheTM.Messagesoftype\Commit"and\Abort"arebroadcastbytheTM,tobereceivedbyallRMs.The

32 setmsgscontainsjustasinglecopyofsuchames
setmsgscontainsjustasinglecopyofsuchamessage.[type:f\Prepared"g;rm:RM][[type:f\Commit";\Abort"g]TPTypeOK= Thetype-correctnessinvariant^rmState2[RM!f\working";\prepared";\committed";\aborted"g]^tmState2f\init";\committed";\aborted"g^tmPreparedRM^msgsMessageTPInit= Theinitialpredicate.^rmState=[rm2RM7!\working"]^tmState=\init"^tmPrepared=fg^msgs=fg Wenowde netheactionsthatmaybeperformedbytheprocesses, rsttheTM'sactions,thentheRMs'actions.TMRcvPrepared(rm)= TheTMreceivesa\Prepared"messagefromresourcemanagerrm.^tmState=\init"^[type7!\Prepared";rm7!rm]2msgs^tmPrepared0=tmPrepared[frmg^unchangedhrmState;tmState;msgsiTMCommit= TheTMcommitsthetransaction;enabledi theTMisinitsinitialstateandeveryRMhassenta\Prepared"message.^tmState=\init"^tmPrepared=RM^tmState0=\committed"

33 ^msgs0=msgs[f[type7!\Commit"]g^unchanged
^msgs0=msgs[f[type7!\Commit"]g^unchangedhrmState;tmPreparediTMAbort= TheTMspontaneouslyabortsthetransaction.27 theoremTPSpec)2TPTypeOK Thistheoremassertsthatthetype-correctnesspredicateTPTypeOKisaninvariantofthespeci cation. WenowassertthattheTwo-PhaseCommitprotocolimplementstheTransactionCommitprotocolofmoduleTCommit.Thefollowingstatementde nesTC!TCSpectobeformulaTSpecofmoduleTCommit.(TheTLA+instancestatementisusedtorenametheoperatorsde nedinmoduleTCommitavoidsanynamecon ictsthatmightexistwithoperatorsinthecurrentmodule.)TC=instanceTCommittheoremTPSpec)TC!TCSpec Thistheoremassertsthatthespeci cationTPSpecoftheTwo-PhaseCommitprotocolimplementsthespeci cationTCSpecoftheTransactionCommitprotocol. ThetwotheoremsinthismodulehavebeencheckedwithTLCforsixRMs,acon gurationwith5

34 0816reachablestates,inalittleoveraminute
0816reachablestates,inalittleoveraminuteona1GHzPC. A.3ThePaxosCommitAlgorithm modulePaxosCommit Thismodulespeci esthePaxosCommitalgorithm.Wespecifyonlysafetyproperties,notlivenessproperties.Wesimplifythespeci cationinthefollowingways.Asinthespeci cationofmoduleTwoPhase,andforthesamereasons,weletthevariablemsgsbethesetofallmessagesthathaveeverbeensent.Ifamessageissenttoasetofrecipients,onlyonecopyofthemessageappearsinmsgs.Wedonotexplicitlymodelthereceiptofmessages.Ifanoperationcanbeper-formedwhenaprocesshasreceivedacertainsetofmessages,thentheoperationisrepresentedbyanactionthatisenabledwhenthosemessagesareinthesetmsgsofsentmessages.(Wearespecifyingonlysafetyproperties,whichassertwhateventscanoccur,andtheoperationcanoccurifthemessagesthatenableithavebeensent.)Wedonotmodellead

35 erselection.Wede neactionsthatthecur
erselection.Wede neactionsthatthecurrentleadermayperform,butdonotspecifywhoperformsthem.Asinthespeci cationofTwo-PhasecommitinmoduleTwoPhase,wehaveRMssponta-neouslyissuePreparedmessagesandweignorePreparemessages.extendsIntegersMaximum(S)= IfJisasetofnumbers,thenthisde neMaximum(S)tobethemaximumofthosenumbers,or�1ifJisempty.ifS=fgthen�1elsechoosen2S:8m2S:nm29 bal:Ballot[f�1g;val:f\prepared";\aborted";\none"g]]]^msgs2subsetMessagePCInit= Theinitialpredicate.^rmState=[rm2RM7!\working"]^aState=[ins2RM7![ac2Acceptor7![mbal7!0;bal7!�1;val7!\none"]]]^msgs=fg TheActionsSend(m)=msgs0=msgs[fmg Anactionexpressionthatdescribesthesendingofmessagem. RMActionsRMPrepare(rm)= Resourcemanagerrmpreparesbysendingaphase2amessageforballotnumber0withvalue\prepared":^rmState[rm]=\

36 working"^rmState0=[rmStateexcept![rm]=\p
working"^rmState0=[rmStateexcept![rm]=\prepared"]^Send([type7!\phase2a";ins7!rm;bal7!0;val7!\prepared"])^unchangedaStateRMChooseToAbort(rm)= Resourcemanagerrmspontaneouslydecidestoabort.Itmay(butneednot)sendaphase2amessageforballotnumber0withvalue\aborted".^rmState[rm]=\working"^rmState0=[rmStateexcept![rm]=\aborted"]^Send([type7!\phase2a";ins7!rm;bal7!0;val7!\aborted"])^unchangedaStateRMRcvCommitMsg(rm)= Resourcemanagerrmistoldbytheleadertocommit.Whenthisactionisenabled,rmState[rm]mustequaleither\prepared"or\committed".Inthelattercase,theactionleavesthestateunchanged(itisa\stutteringstep").^[type7!\Commit"]2msgs^rmState0=[rmStateexcept![rm]=\committed"]^unchangedhaState;msgsi31 ^Send([type7!\phase2a";ins7!rm;bal7!bal;val7!v])^unchangedhrmState;aStateiDecide= AleadercandecidethatPaxosCommi

37 thasreachedaresultandsendamessagean-noun
thasreachedaresultandsendamessagean-nouncingtheresultifithasreceivedthenecessaryphase2bmessages.^letDecided(rm;v)= Truei instancermofthePaxosconsensusalgorithmhaschosenthevaluev.9b2Ballot;MS2Majority:8ac2MS:[type7!\phase2b";ins7!rm;bal7!b;val7!v;acc7!ac]2msgsin_^8rm2RM:Decided(rm;\prepared")^Send([type7!\Commit"])_^9rm2RM:Decided(rm;\aborted")^Send([type7!\Abort"])^unchangedhrmState;aStatei AcceptorActionsPhase1b(acc)=9m2msgs:^m:type=\phase1a"^aState[m:ins][acc]:mbalm:bal^aState0=[aStateexcept![m:ins][acc]:mbal=m:bal]^Send([type7!\phase1b";ins7!m:ins;mbal7!m:bal;bal7!aState[m:ins][acc]:bal;val7!aState[m:ins][acc]:val;acc7!acc])^unchangedrmStatePhase2b(acc)=^9m2msgs:^m:type=\phase2a"^aState[m:ins][acc]:mbalm:bal^aState0=[aStateexcept![m:ins][acc]:mbal=m:bal;![m:ins][acc]:bal=m:bal;

Related Contents


Next Show more