1IntroductionAdistributedtransactionconsistsofanumberofoperationsperformedatmultiplesitesterminatedbyarequesttocommitorabortthetransactionThesitesthenuseatransactioncommitprotocoltodecidewhetherthetra ID: 900509
Download Pdf The PPT/PDF document "AbstractThedistributedtransactioncommitp..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1 AbstractThedistributedtransactioncommitp
AbstractThedistributedtransactioncommitproblemrequiresreachingagreementonwhetheratransactioniscommittedoraborted.TheclassicTwo-PhaseCommitprotocolblocksifthecoordinatorfails.Fault-tolerantconsensusalgorithmsalsoreachagreement,butdonotblockwheneveranymajorityoftheprocessesareworking.ThePaxosCommitalgorithmrunsaPaxosconsensusalgorithmonthecommit/abortdecisionofeachparticipanttoobtainatransactioncommitprotocolthatuses2F+1coordinatorsandmakesprogressifatleastF+1ofthemareworkingproperly.PaxosCommithasthesamestable-storagewritedelay,andcanbeimplementedtohavethesamemessagedelayinthefault-freecase,asTwo-PhaseCommit,butitusesmoremessages.TheclassicTwo-PhaseCommitalgorithmisobtainedasthespecialF=0caseofthePaxosCommitalgorithm. 1IntroductionAdistributedtransactionconsistsofanumberofoperations,performedatmultiple
2 sites,terminatedbyarequesttocommitorabor
sites,terminatedbyarequesttocommitorabortthetransaction.Thesitesthenuseatransactioncommitprotocoltodecidewhetherthetransactioniscommittedoraborted.Thetransactioncanbecommittedonlyifallsitesarewillingtocommitit.Achievingthisall-or-nothingatom-icitypropertyinadistributedsystemisnottrivial.TherequirementsfortransactioncommitarestatedpreciselyinSection2.TheclassictransactioncommitprotocolisTwo-PhaseCommit[9],de-scribedinSection3.Itusesasinglecoordinatortoreachagreement.Thefailureofthatcoordinatorcancausetheprotocoltoblock,withnoprocessknowingtheoutcome,untilthecoordinatorisrepaired.InSection4,weusethePaxosconsensusalgorithm[12]toobtainatransactioncommitprotocolthatusesmultiplecoordinators;itmakesprogressifamajorityoftheco-ordinatorsareworking.Section5comparesTwo-PhaseCommitandPaxosCommit.WeshowthatTwo-Pha
3 seCommitisadegeneratecaseofthePaxosCommi
seCommitisadegeneratecaseofthePaxosCommitalgorithmwithasinglecoordinator,guaranteeingprogressonlyifthatcoordinatorisworking.Section6discussessomepracticalaspectsoftransactionmanagement.Relatedworkisdiscussedintheconclusion.Ourcomputationmodelassumesthatalgorithmsareexecutedbyacol-lectionofprocessesthatcommunicateusingmessages.Eachprocessexe-cutesatanodeinanetwork.Aprocesscansavedataonstablestoragethatsurvivesfailures.Dierentprocessesmayexecuteonthesamenode.Ourcostmodelcountsinter-nodemessages,messagedelays,andstable-storagewrites,andstable-storagewritedelays.Weassumethatmessagesbetweenprocessesonthesamenodehavenegligiblecost.Ourfailuremodelassumesthatnodes,andhencetheirprocesses,canfail;messagescanbelostordupli-cated,butnot(undetectably)corrupted.Anyprocessexecutingatafailednodesimplystopsperform
4 ingactions;itdoesnotperformincorrectacti
ingactions;itdoesnotperformincorrectactionsanddoesnotforgetitsstate.Implementingthismodelofprocessfailurerequireswritinginformationtostablestorage,whichcanbeanexpensiveoperation.WewillseethatthedelaysincurredbywritestostablestoragearethesameinTwo-PhaseCommitandPaxosCommit.Ingeneral,therearetwokindsofcorrectnesspropertiesthatanalgorithmmustsatisfy:safetyandliveness.Intuitively,asafetypropertydescribeswhatisallowedtohappen,andalivenesspropertydescribeswhatmusthappen[2].Ouralgorithmsareasynchronousinthesensethattheirsafetypropertiesdonotdependontimelyexecutionbyprocessesoronboundedmessage1 ConsistencyItisimpossibleforoneRMtobeinthecommittedstateandanothertobeintheabortedstate.Thesetwopropertiesimplythat,onceanRMentersthecommittedstate,nootherRMcanentertheabortedstate,andviceversa.EachRMalsohasapreparedst
5 ate.WerequirethatAnRMcanenterthecom
ate.WerequirethatAnRMcanenterthecommittedstateonlyafterallRMshavebeeninthepreparedstate.Theserequirementsimplythatthetransactioncancommit,meaningthatallRMsreachthecommittedstate,onlybythefollowingsequenceofevents:AlltheRMsenterthepreparedstate,inanyorder.AlltheRMsenterthecommittedstate,inanyorder.Theprotocolallowsthefollowingeventthatpreventsthetransactionfromcommitting:AnyRMintheworkingstatecanentertheabortedstate.ThestabilityandconsistencyconditionsimplythatthisspontaneousaborteventcannotoccurifsomeRMhasenteredthecommittedstate.Inpractice,aworkingRMwillabortwhenitrealizesthatitcannotperformitspartofthetransaction.Theserequirementsaresummarizedinthestate-transitiondiagramofFigure1.ThegoalofthealgorithmisforallRMstoreachthecommittedorabortedstate,butthiscannotbeachievedinanon-trivi
6 alwayifRMscanfailorbecomeisolatedthrough
alwayifRMscanfailorbecomeisolatedthroughcommunicationfailure.(AtrivialsolutionisoneinwhichallRMsalwaysabort.)Moreover,theclassictheoremofFischer,Lynch,andPaterson[8]impliesthatadeterministic,purelyasynchronous ?working ?prepared @@Rcommitted aborted Figure1:Thestate-transitiondiagramforaresourcemanager.Itbeginsintheworkingstate,inwhichitmaydecidethatitwantstoabortorcommit.Itabortsbysimplyenter-ingtheabortedstate.Ifitdecidestocommit,itentersthepreparedstate.Fromthisstate,itcancommitonlyifallotherresourceman-agersalsodecidedtocommit.3 3Two-PhaseCommit3.1TheProtocolTheTwo-PhaseCommitprotocolisanimplementationoftransactioncom-mitthatusesatransactionmanager(TM)processtocoordinatethedecision-makingprocedure.TheRMshavethesamestatesinthisprotocolasinthespecicationoftransactioncommit.TheTMhasthef
7 ollowingstates:init(itsinitialstate),pre
ollowingstates:init(itsinitialstate),preparing,committed,andaborted.TheTwo-PhaseCommitprotocolstartswhenanRMentersthepreparedstateandsendsaPreparedmessagetotheTM.UponreceiptofthePreparedmessage,theTMentersthepreparingstateandsendsaPreparemessagetoeveryotherRM.UponreceiptofthePreparemessage,anRMthatisstillintheworkingstatecanenterthepreparedstateandsendaPreparedmessagetotheTM.WhenithasreceivedaPreparedmessagefromallRMs,theTMcanenterthecommittedstateandsendCommitmessagestoalltheotherprocesses.TheRMscanenterthecommittedstateuponreceiptoftheCommitmessagefromtheTM.Themessage owfortheTwo-PhaseCommitprotocolisshowninFigure2.Figure2showsonedistinguishedRMspontaneouslypreparing.Infact,anyRMcanspontaneouslygofromtheworkingtopreparedstateandsendapreparedmessageatanytime.TheTM'spreparemessagecanbeviewedasanoption
8 alsuggestionthatnowwouldbeagoodtimetodos
alsuggestionthatnowwouldbeagoodtimetodoso.Otherevents,includingreal-timedeadlines,mightcauseworkingRMstoprepare.ThisobservationisthebasisforvariantsoftheTwo-PhaseCommitprotocolthatusefewermessages.AnRMcanspontaneouslyentertheabortedstateifitisintheworkingstate;andtheTMcanspontaneouslyentertheabortedstateunlessitisinRM1 OtherRMs TM XXXXXXXXXXXz9 Prepared Prepare XXXXXXXz Prepared 9)9 Prepare Commit Figure2:Themessage owforTwo-PhaseCommitinthenormalfailure-freecase,whereRM1istherstRMtoenterthepreparedstate.5 AsdiscussedinSection3.1,wecaneliminatetheTM'sPreparemessages,reducingthemessagecomplexityto2N.Butinpractice,thisrequireseitherextramessagedelaysorsomereal-timeassumptions
9 .Inadditiontothemessagedelays,thetwo-pha
.Inadditiontothemessagedelays,thetwo-phasecommitprotocolincursthedelaysassociatedwithwritestostablestorage:thewritebytherstRMtoprepare,thewritesbytheremainingRMswhentheyprepare,andthewritebytheTMwhenitmakesthecommitdecision.ThiscanbereducedtotwowritedelaysbyhavingallRMsprepareconcurrently.3.3TheProblemwithTwo-PhaseCommitInatransactioncommitprotocol,ifoneormoreRMsfail,thetransactionisusuallyaborted.Forexample,intheTwo-PhaseCommitprotocol,iftheTMdoesnotreceiveaPreparedmessagefromsomeRMsoonenoughaftersendingthePreparemessage,thenitwillabortthetransactionbysendingAbortmessagestotheotherRMs.However,thefailureoftheTMcancausetheprotocoltoblockuntiltheTMisrepaired.Inparticular,iftheTMfailsrightaftereveryRMhassentaPreparedmessage,thentheotherRMshavenowayofknowingwhethertheTMcommittedorabortedthetransactio
10 n.Anon-blockingcommitprotocolisoneinwhic
n.Anon-blockingcommitprotocolisoneinwhichthefailureofasingleprocessdoesnotpreventtheotherprocessesfromdecidingifthetransactioniscommittedoraborted.TheyareoftencalledThree-PhaseCommitproto-cols.Severalhavebeenproposed,andafewhavebeenimplemented[3,4,19].Theyhaveusuallyattemptedto\x"theTwo-PhaseCommitprotocolbychoosinganotherTMiftherstTMfails.However,weknowofnonethatprovidesacompletealgorithmproventosatisfyaclearlystatedcorrectnesscondition.Forexample,thediscussionofnon-blockingcommitintheclas-sictextofBernstein,Hadzilacos,andGoodman[3]failstoexplainwhataprocessshoulddoifitreceivesmessagesfromtwodierentprocesses,bothclaimingtobethecurrentTM.Guaranteeingthatthissituationcannotariseisaproblemthatisasdicultasimplementingatransactioncommitprotocol.4PaxosCommit4.1ThePaxosConsensusAlgorithm
11 Thedistributedcomputingcommunityhasstudi
Thedistributedcomputingcommunityhasstudiedthemoregeneralproblemofconsensus,whichrequiresthatacollectionofprocessesagreeonsomevalue.Manysolutionstothisproblemhavebeenproposed,undervarious7 containingitscurrentstate,whichconsistsofThelargestballotnumberforwhichitreceivedaphase1amessage,andThephase2bmessagewiththehighestballotnumberithassent,ifany.Theacceptorignoresthephase1amessageifithasperformedanactionforaballotnumberedbalorgreater.Phase2aWhentheleaderhasreceivedaphase1bmessageforballotnumberbalfromamajorityoftheacceptors,itcanlearnoneoftwopossibilities:FreeNoneofthemajorityofacceptorsreporthavingsentaphase2bmessage,sothealgorithmhasnotyetchosenavalue.ForcedSomeacceptorinthemajorityreportshavingsentaphase2bmessage.Letbethemaximumballotnumberofallthereportedphase2bmessages,andletMb
12 ethesetofallthosephase2bmessagesthathave
ethesetofallthosephase2bmessagesthathaveballotnumber.AllthemessagesinMhavethesamevaluev,whichmightalreadyhavebeenchosen.Inthefreecase,theleadercantrytogetanyvalueaccepted;itusuallypickstherstvalueproposedbyaclient.Intheforcedcase,ittriestogetthevaluevchosenbysendingaphase2amessagewithvaluevandballotnumberbaltoeveryacceptor.Phase2bWhenanacceptorreceivesaphase2amessageforavaluevandballotnumberbal,ifithasnotalreadyreceivedaphase1aor2amessageforalargerballotnumber,itacceptsthatmessageandsendsaphase2bmessageforvandbaltotheleader.Theacceptorignoresthemessageifithasalreadyparticipatedinahigher-numberedballot.Phase3Whentheleaderhasreceivedphase2bmessagesforvaluevandballotbalfromamajorityoftheacceptors,itknowsthatthevaluevhasbeenchosenandcommunicatesthatfacttoallinterestedprocesseswithaphase3mes
13 sage.Ballot0hasnophase1becausethereareno
sage.Ballot0hasnophase1becausetherearenolower-numberedballots,sothereisnothingforacceptorstoreportinphase1bmessages.AnexplanationofwhythePaxosalgorithmiscorrectcanbefoundintheliterature[6,12,13,15].Aswithanyasynchronousalgorithm,process9 PaxosCommitusesaseparateinstanceofthePaxosconsensusalgorithmtoobtainagreementonthedecisioneachRMmakesofwhethertoprepareorabort|adecisionwerepresentbythevaluesPreparedandAborted.So,thereisoneinstanceoftheconsensusalgorithmforeachRM.Thetrans-actioniscommittedieachRM'sinstancechoosesPrepared;otherwisethetransactionisaborted.TheideaofperformingaseparateconsensusoneachRM'sdecisioncanbeusedwithanyconsensusalgorithm,buthowoneusesthisideatosaveamessagedelaydependsonthealgorithm.PaxosCommitusesthesamesetof2F+1acceptorsandthesamecurrentleaderforeachinstanceofPaxos.So,theca
14 stofcharactersconsistsofNRMs,2F+1accepto
stofcharactersconsistsofNRMs,2F+1acceptors,andthecurrentleader.WeassumefornowthattheRMsknowtheacceptorsinadvance.InordinaryPaxos,aballot0phase2amessagecanhaveanyvaluev.Whiletheleaderusuallysendssuchamessage,thePaxosalgorithmobviouslyremainscorrectifthesendingofthatmessageisdelegatedtoanysingleprocesschoseninadvance.InPaxosCommit,eachRMannouncesitsprepare/abortdecisionbysending,initsinstanceofPaxos,aballot0phase2amessagewiththevaluePreparedorAborted.ExecutionofPaxosCommitnormallystartswhensomeRMdecidestoprepareandsendsaBeginCommitmessagetotheleader.TheleaderthensendsaPreparemessagetoalltheotherRMs.IfanRMdecidesthatitwantstoprepare,itsendsaphase2amessagewithvaluePreparedandballotnumber0initsinstanceofthePaxosalgorithm.Otherwise,itsendsaphase2amessagewiththevalueAbortedandballotnumber0.Foreachinstance,an
15 acceptorsendsitsphase2bmessagetotheleade
acceptorsendsitsphase2bmessagetotheleader.TheleaderknowstheoutcomeofthisinstanceifitreceivesF+1phase2bmessagesforballotnumber0,whereuponitcansenditsphase3messageannouncingtheoutcometotheRMs.(AsobservedinSection4.1above,phase3canbeeliminatedbyhavingtheacceptorssendtheirphase2bmessagesdirectlytotheRMs.)ThetransactioniscommittedieveryRM'sinstanceofthePaxosalgorithmchoosesPrepared;otherwisethetransactionisaborted.Foreciency,anacceptorcanbundleitsphase2bmessagesforallin-stancesofthePaxosalgorithmintoasinglephysicalmessage.Theleadercandistillitsphase3messagesforallinstancesintoasingleCommitorAbortmessage,dependingonwhetherornotallinstanceschosethevaluePrepared.TheinstancesofthePaxosalgorithmforoneormoreRMsmaynotreachadecisionwithballotnumber0.Inthatcase,theleader(alertedbyatimeout)assumesthateacho
16 fthoseRMshasfailedandexecutesphase1afora
fthoseRMshasfailedandexecutesphase1aforalargerballotnumberineachoftheirinstancesofPaxos.If,inphase2a,11 RM1 OtherRMs InitialLeader Acceptors XXXXXXXXXXXz BeginCommit PPPPPPPPPPPPPPPPPPq 2aPrepared 9 Prepare XXXXXXXXXXXXXXz 2aPrepared 9 2bPrepared 9) Commit Figure3:Themes-sage owforPaxosCommitinthenormalfailure-freecase,whereRM1istherstRMtoenterthepreparedstate,and2aPreparedand2bPreparedarethephase2aand2bmessagesofthePaxosconsensusalgorithm.4.3TheCostofPaxosCommitWenowconsiderthecostofPaxosCommitinthenormalcase,whenthetransactioniscommitted.ThesequenceofmessageexchangesisshowninFigure3.WeagainassumethatthereareNRMs.Weconsiderasystemthatcant
17 olerateFfaults,sothereare2F+1acceptors.H
olerateFfaults,sothereare2F+1acceptors.However,weassumetheoptimizationinwhichtheleadersendsphase2amessagestoF+1acceptors,andonlyifoneormoreofthemfailareotheracceptorsused.Inthenormalcase,thePaxosCommitalgorithmusesthefollowingpotentiallyinter-nodemessages:TherstRMtopreparesendsaBeginCommitmessagetotheleader.(1message)TheleadersendsaPreparemessagetoeveryotherRM.(N1mes-sages)EachRMsendsaballot0phase2aPreparedmessageforitsinstanceofPaxostotheF+1acceptors.(N(F+1)messages)ForeachRM'sinstanceofPaxos,anacceptorrespondstoaphase2amessagebysendingaphase2bPreparedmessagetotheleader.How-ever,anacceptorcanbundlethemessagesforallthoseinstancesintoasinglemessage.(F+1messages)TheleadersendsasingleCommitmessagetoeachRMcontainingaphase3PreparedmessageforeveryinstanceofPaxos.(Nmessages)
18 13 Two-PhaseCommitPaxosCommitFasterPaxos
13 Two-PhaseCommitPaxosCommitFasterPaxosCommit MessageDelays 4 5 4 Messages noco-location 3N1 (N+1)(F+3)4 N(2F+3)1 withco-location 3N3 N(F+3)3 (N1)(2F+3) StableStorage writedelays 2 2 2 writes N+1 N+F+1 N+F+1 Figure4:CorrespondingComplexity5PaxosversusTwo-PhaseCommitIntheTwo-PhaseCommitprotocol,theTMbothmakestheabort/commitdecisionandstoresthatdecisioninstablestorage.Two-PhaseCommitcanblockindenitelyiftheTMfails.HadweusedPaxossimplytoob-tainconsensusonasingledecisionvalue,thiswouldhavebeenequivalenttoreplacingtheTM'sstablestoragebytheacceptors'stablestorage,andreplacingthesingleTMbyasetofpossibleleaders.OurPaxosCommital-gorithmgoesfurtherinessentiallyeliminatingtheTM'sroleinmakingthedecision.InTwo-PhaseCommit,theTMcanunilaterallydecidetoabort.InPaxosCommit,aleadercanmakean
19 abortdecisiononlyforanRMthatdoesnotdecid
abortdecisiononlyforanRMthatdoesnotdecideforitself.Theleaderdoesthisbyinitiatingaballotwithnumbergreaterthan0forthatRM'sinstanceofPaxos.(TheleadermustbeabletodothistopreventblockingbyafailedRM.)Sections3.2and4.3describethenormal-casecostinmessagesandwritestostablestorageofTwo-PhaseCommitandPaxosCommit,respectively.Bothalgorithmshavethesamethreestablestoragewritedelays(twoifallRMsprepareconcurrently).TheothercostsaresummarizedinFigure4.TheentriesforPaxosCommitassumethattheinitialleaderisonthesamenodeasanacceptor.FasterPaxosCommitisthealgorithmoptimizedtoremovephase3ofthePaxosconsensusalgorithm.ForTwo-PhaseCommit,co-locationmeansthattheinitiatingRMandtheTCareonthesamenode.ForPaxosCommit,itmeansthateachacceptorisonthesamenodeasanRM,andthattheinitiatingRMistheonthesamenodeastheinitialleader.InPaxosCommitw
20 ithoutco-location,weassumethattheinitial
ithoutco-location,weassumethattheinitialleaderisanacceptor.15 Section5showedthatTwo-PhaseCommitistheF=0caseofPaxosCommit,inwhichthetransactionmanagerperformsthefunctionsoftheoneacceptorandtheonepossibleleader.WethereforeconsideronlyPaxosCommit.ToaccommodateadynamicsetofRMs,weintroducearegistrarprocessthatkeepstrackofwhatRMshavejoinedthetransaction.TheregistraractsmuchlikeanadditionalRM,exceptthatitsinputtothecommitprotocolisthesetofRMsthathavejoined,ratherthanthevaluePreparedorAborted.AswithanRM,PaxosCommitrunsaseparateinstanceofthePaxoscon-sensusalgorithmtodecideupontheregistrar'sinput,usingthesamesetofacceptors.ThetransactioniscommitteditheconsensusalgorithmfortheregistrarchoosesasetofRMsandtheinstanceoftheconsensusalgorithmforeachofthoseRMschoosesPrepared.Theregistrarisgenerallyonthesamenodeas
21 theinitialleader,whichistypicallyonthesa
theinitialleader,whichistypicallyonthesamenodeastheRMthatcreatesthetransaction.InTwo-PhaseCommit,theregistrar'sfunctionisusuallyperformedbytheTMratherthanbyaseparateprocess.(RecallthatforthecaseofTwo-PhaseCommit,thePaxosconsensusalgorithmisthetrivialoneinwhichtheTMsimplychoosesthevalueandwritesittostablestorage.)WenowdescribehowthedynamicPaxosalgorithmworks.6.1TransactionCreationEachnodehasalocaltransactionservicethatanRMcancalltocreateandmanagetransactions.Tocreateatransaction,theserviceconstructsadescriptorforthetransaction,consistingofauniqueidentier(uid)andthenamesofthetransaction'scoordinatorprocesses.ThecoordinatorprocessesareallprocessesotherthantheRMsthattakepartinthecommitprotocol|namely,theregistrar,theinitialleader,theotherpossibleleaders,andtheacceptors.Anymessagesentduringtheexecutio
22 nofatransactioncontainsthetransactiondes
nofatransactioncontainsthetransactiondescriptor,soarecipientknowswhichtransactionthemessageisfor.Aprocessmightrstlearnabouttheexistenceoftransactionbyreceivingsuchamessage.Thedescriptortellstheprocessthenamesofthecoordinatorsthatitmustknowtoperformitsroleintheprotocol.6.2JoiningaTransactionAnRMjoinsatransactionbysendingajoinmessagetotheregistrar.Asobservedabove,thejoinmessagemustcontainthetransactiondescriptorif17 instances.Theacceptorwaitsuntilitknowswhatphase2bmessagetosendforallinstancesbeforesendingthisonemessage.However,\allinstances"includesaninstanceforeachparticipatingRM,andthesetofparticipatingRMsischosenbytheregistrar'sinstance.Tobreakthiscircularity,weobservethat,iftheregistrar'sinstancechoosesthevalueAborted,thenitdoesn'tmatterwhatvaluesarechosenbytheRMs'instances.Therefore,theaccepto
23 rwaitsuntilitisreadytosendaphase2bmessag
rwaitsuntilitisreadytosendaphase2bmessagefortheregistrar'sinstance.IfthatmessagecontainsasetJofRMsasavalue,thentheacceptorwaitsuntilitcansendthephase2bmessageforeachRMinJ.Ifthephase2bmessagefortheregistrar'sinstancecontainsthevalueAborted,thentheacceptorsendsonlythatphase2bmessage.AsexplainedinSection4.2,theprotocolcanbeshort-circuitedandabortmessagessenttoallprocessesifanyparticipatingRMchoosesthevalueAborted.Insteadofsendingaphase2amessage,theRMcansimplysendanabortmessagetothecoordinatorprocesses.TheregistrarcanrelaytheabortmessagetoallotherRMsthathavejoinedthetransaction.Failureoftheregistrarbeforeitsendsitsballot0phase2amessagecausesthetransactiontoabort.However,failureofasingleRMcanalsocausethetransactiontoabort.Fault-tolerancemeansonlythatfailureofanindividualprocessdoesnotpreventacommit/abortde
24 cisionfrombeingmade.6.4LearningtheOutcom
cisionfrombeingmade.6.4LearningtheOutcomeThedescriptionaboveshowsthat,whenthereisnofailure,thedynamiccommitprotocolworksessentiallyasdescribedinFigure3ofSection4.3.Wenowconsiderwhathappensintheeventoffailure.Thecaseofacceptorfailureisstraightforward.Ifthetransactioniscreatedtohave2F+1acceptors,thenfailureofuptoFofthemcausesnoproblem.Ifmoreacceptorsfail,theprotocolsimplyblocksuntilthereareF+1workingacceptors,whereuponitcontinuesasifnothinghadhappened.Beforeconsideringotherprocessfailures,letusexaminehowaprocessP,knowingonlythetransactiondescriptor,candiscovertheoutcomeoftheprotocol|thatis,whetherthetransactionwascommittedoraborted.Forexample,PmightbearestartedRMthathadfailedaftersendingaphase2aPreparedmessagebutbeforerecordingtheoutcomeinitsstablestorage.Havingthedescriptor,Pknowsthesetofallpossiblelea
25 derprocesses.Itsendsthemamessagecontaini
derprocesses.Itsendsthemamessagecontainingthedescriptorandaskingwhattheoutcomewas.Ifalltheleaderprocesseshavefailed,thenPmustwaituntiloneormoreofthemarerestarted.(Eachnodethathasanacceptorprocess19 7ConclusionTwo-PhaseCommitistheclassicaltransactioncommitprotocol.Indeed,itissometimesthoughttobesynonymouswithtransactioncommit[17].Two-PhaseCommitisnotfaulttolerantbecauseitusesasinglecoordinatorwhosefailurecancausetheprotocoltoblock.WehaveintroducedPaxosCommit,anewtransactioncommitprotocolthatusesmultiplecoordinatorsandmakesprogressifamajorityofthemareworking.Hence,2F+1coordinatorscanmakeprogressevenifFofthemarefaulty.Two-PhaseCommitisisomorphictoPaxosCommitwithasinglecoordinator.Inthenormal,failure-freecase,PaxosCommitrequiresonemoremes-sagedelaythanTwo-PhaseCommit.ThisextramessagedelayiseliminatedbyFas
26 terPaxosCommit,whichhasthetheoreticallym
terPaxosCommit,whichhasthetheoreticallyminimalmessagedelayforanon-blockingprotocol.Non-blockingtransactioncommitprotocolswererstproposedintheearly1980s[3,4,19].TheinitialalgorithmshadtwomessagedelaysmorethanTwo-PhaseCommitinthefailure-freecase;lateralgorithmsreducedthistooneextramessagedelay[3].Allofthesealgorithmsusedacoor-dinatorprocessandassumedthattwodierentprocessescouldneverbothbelievetheywerethecoordinator|anassumptionthatcannotbeimple-mentedinapurelyasynchronoussystem.Transientnetworkfailurescouldcausethemtoviolatetheconsistencyrequirementoftransactioncommit.Itiseasytoimplementnon-blockingcommitusingaconsensusalgorithm|anobservationalsomadeinthe1980s[16].However,theobviouswayofdoingthisleadstoonemessagedelaymorethanthatofPaxosCommit.TheonlyalgorithmthatachievedthelowmessagedelayofFas
27 terPaxosCommitisthatofGuerraoui,Larrea,a
terPaxosCommitisthatofGuerraoui,Larrea,andSchiper[11].ItisessentiallythesameasFasterPaxosCommitintheabsenceoffailures.(Itcanbemod-iedwithanoptimizationanalogoustothesendingofphase2amessagesonlytoamajorityofacceptorstogiveitthesamemessagecomplexityasFasterPaxosCommit.)ThissimilaritytoPaxosCommitisnotsurpris-ing,sincemostasynchronousconsensusalgorithms(andmostincompleteattemptsatalgorithms)arethesameasPaxosinthefailure-freecase.How-ever,theiralgorithmismorecomplicatedthanPaxosCommit.Itusesaspecialprocedureforthefailure-freecaseandcallsuponamodiedversionofanordinaryconsensusalgorithm,whichaddsanextramessagedelayintheeventoffailure.With2F+1coordinatorsandNresourcemanagers,PaxosCommitrequiresabout2FNmoremessagesthanTwo-PhaseCommitinthenormalcase.Bothalgorithmsincurthesamedelayforwritingtostablest
28 orage.In21 [5]BernadetteCharron-BostandA
orage.In21 [5]BernadetteCharron-BostandAndreSchiper.Uniformconsensusisharderthanconsensus(extendedabstract).TechnicalReportDSC/2000/028,EcolePolytechniqueFederaledeLausanne,Switzer-land,May2000.[6]RobertoDePrisco,ButlerLampson,andNancyLynch.RevisitingthePaxosalgorithm.InMariosMavronicolasandPhilippasTsigas,editors,Proceedingsofthe11thInternationalWorkshoponDistributedAlgorithms(WDAG97),volume1320ofLectureNotesinComputerScience,pages111{125,Saarbruken,Germany,1997.Springer-Verlag.[7]CynthiaDwork,NancyLynch,andLarryStockmeyer.Consensusinthepresenceofpartialsynchrony.JournaloftheACM,35(2):288{323,April1988.[8]MichaelJ.Fischer,NancyLynch,andMichaelS.Paterson.Impossi-bilityofdistributedconsensuswithonefaultyprocess.JournaloftheACM,32(2):374{382,April1985.[9]J.N.Gray.Notesondatabaseopera
29 tingsystems.InR.Bayer,R.M.Graham,andG.Se
tingsystems.InR.Bayer,R.M.Graham,andG.Seegmuller,editors,OperatingSystems:AnAdvancedCourse,volume60ofLectureNotesinComputerScience,pages393{481.Springer-Verlag,Berlin,Heidelberg,NewYork,1978.[10]RachidGuerraoui.Revisitingtherelationshipbetweennon-blockingatomiccommitmentandconsensus.InJean-MichelHelaryandMichelRaynal,editors,Proceedingsofthe9thInternationalWorkshoponDis-tributedAlgorithms(WDAG95),volume972ofLectureNotesinCom-puterScience,pages87{100,LeMont-Saint-Michel,France,September1995.Springer-Verlag.[11]RachidGuerraoui,MikelLarrea,andAndreSchiper.Reducingthecostfornon-blockinginatomiccommitment.InProceedingsofthe16thInternationalConferenceonDistributedComputingSystems(ICDCS),pages692{697,HongKong,May1996.IEEEComputerSociety.[12]LeslieLamport.Thepart-timeparliament.ACMTransactionsonCom-
30 puterSystems,16(2):133{169,May1998.[13]L
puterSystems,16(2):133{169,May1998.[13]LeslieLamport.Paxosmadesimple.ACMSIGACTNews(DistributedComputingColumn),32(4):51{58,December2001.[14]LeslieLamport.SpecifyingSystems.Addison-Wesley,Boston,2003.Alinktoanelectroniccopycanbefoundathttp://lamport.org.23 ATheTLA+SpecicationsA.1TheSpecicationofaTransactionCommitProtocol moduleTCommit constantRM ThesetofparticipatingresourcemanagersvariablermState rmState[rm]isthestateofresourcemanagerrm. TCTypeOK= Thetype-correctnessinvariantrmState2[RM!f\working";\prepared";\committed";\aborted"g]TCInit=rmState=[rm2RM7!\working"] Theinitialpredicate.canCommit=8rm2RM:rmState[rm]2f\prepared";\committed"g TrueiallRMsareinthe\prepared"or\committed"state.notCommitted=8rm2RM:rmState[rm]6=\committed" Trueinoresourcemanagerhasdecidedtocomm
31 it. Wenowdenetheactionsthatmaybeper
it. WenowdenetheactionsthatmaybeperformedbytheRMs,andthendenethecompletenext-stateactionofthespecicationtobethedisjunctionofthepossibleRMactions.Prepare(rm)=^rmState[rm]=\working"^rmState0=[rmStateexcept![rm]=\prepared"]Decide(rm)=_^rmState[rm]=\prepared"^canCommit^rmState0=[rmStateexcept![rm]=\committed"]_^rmState[rm]2f\working";\prepared"g^notCommitted^rmState0=[rmStateexcept![rm]=\aborted"]TCNext=9rm2RM:Prepare(rm)_Decide(rm) Thenext-stateaction. TCSpec=TCInit^2[TCNext]rmState Thecompletespecicationoftheprotocol. Wenowassertinvariancepropertiesofthespecication.25 Message= Thesetofallpossiblemessages.Messagesoftype\Prepared"aresentfromtheRMindicatedbythemessage'srmeldtotheTM.Messagesoftype\Commit"and\Abort"arebroadcastbytheTM,tobereceivedbyallRMs.The
32 setmsgscontainsjustasinglecopyofsuchames
setmsgscontainsjustasinglecopyofsuchamessage.[type:f\Prepared"g;rm:RM][[type:f\Commit";\Abort"g]TPTypeOK= Thetype-correctnessinvariant^rmState2[RM!f\working";\prepared";\committed";\aborted"g]^tmState2f\init";\committed";\aborted"g^tmPreparedRM^msgsMessageTPInit= Theinitialpredicate.^rmState=[rm2RM7!\working"]^tmState=\init"^tmPrepared=fg^msgs=fg Wenowdenetheactionsthatmaybeperformedbytheprocesses,rsttheTM'sactions,thentheRMs'actions.TMRcvPrepared(rm)= TheTMreceivesa\Prepared"messagefromresourcemanagerrm.^tmState=\init"^[type7!\Prepared";rm7!rm]2msgs^tmPrepared0=tmPrepared[frmg^unchangedhrmState;tmState;msgsiTMCommit= TheTMcommitsthetransaction;enableditheTMisinitsinitialstateandeveryRMhassenta\Prepared"message.^tmState=\init"^tmPrepared=RM^tmState0=\committed"
33 ^msgs0=msgs[f[type7!\Commit"]g^unchanged
^msgs0=msgs[f[type7!\Commit"]g^unchangedhrmState;tmPreparediTMAbort= TheTMspontaneouslyabortsthetransaction.27 theoremTPSpec)2TPTypeOK Thistheoremassertsthatthetype-correctnesspredicateTPTypeOKisaninvariantofthespecication. WenowassertthattheTwo-PhaseCommitprotocolimplementstheTransactionCommitprotocolofmoduleTCommit.ThefollowingstatementdenesTC!TCSpectobeformulaTSpecofmoduleTCommit.(TheTLA+instancestatementisusedtorenametheoperatorsdenedinmoduleTCommitavoidsanynamecon ictsthatmightexistwithoperatorsinthecurrentmodule.)TC=instanceTCommittheoremTPSpec)TC!TCSpec ThistheoremassertsthatthespecicationTPSpecoftheTwo-PhaseCommitprotocolimplementsthespecicationTCSpecoftheTransactionCommitprotocol. ThetwotheoremsinthismodulehavebeencheckedwithTLCforsixRMs,acongurationwith5
34 0816reachablestates,inalittleoveraminute
0816reachablestates,inalittleoveraminuteona1GHzPC. A.3ThePaxosCommitAlgorithm modulePaxosCommit ThismodulespeciesthePaxosCommitalgorithm.Wespecifyonlysafetyproperties,notlivenessproperties.Wesimplifythespecicationinthefollowingways.AsinthespecicationofmoduleTwoPhase,andforthesamereasons,weletthevariablemsgsbethesetofallmessagesthathaveeverbeensent.Ifamessageissenttoasetofrecipients,onlyonecopyofthemessageappearsinmsgs.Wedonotexplicitlymodelthereceiptofmessages.Ifanoperationcanbeper-formedwhenaprocesshasreceivedacertainsetofmessages,thentheoperationisrepresentedbyanactionthatisenabledwhenthosemessagesareinthesetmsgsofsentmessages.(Wearespecifyingonlysafetyproperties,whichassertwhateventscanoccur,andtheoperationcanoccurifthemessagesthatenableithavebeensent.)Wedonotmodellead
35 erselection.Wedeneactionsthatthecur
erselection.Wedeneactionsthatthecurrentleadermayperform,butdonotspecifywhoperformsthem.AsinthespecicationofTwo-PhasecommitinmoduleTwoPhase,wehaveRMssponta-neouslyissuePreparedmessagesandweignorePreparemessages.extendsIntegersMaximum(S)= IfJisasetofnumbers,thenthisdeneMaximum(S)tobethemaximumofthosenumbers,or1ifJisempty.ifS=fgthen1elsechoosen2S:8m2S:nm29 bal:Ballot[f1g;val:f\prepared";\aborted";\none"g]]]^msgs2subsetMessagePCInit= Theinitialpredicate.^rmState=[rm2RM7!\working"]^aState=[ins2RM7![ac2Acceptor7![mbal7!0;bal7!1;val7!\none"]]]^msgs=fg TheActionsSend(m)=msgs0=msgs[fmg Anactionexpressionthatdescribesthesendingofmessagem. RMActionsRMPrepare(rm)= Resourcemanagerrmpreparesbysendingaphase2amessageforballotnumber0withvalue\prepared":^rmState[rm]=\
36 working"^rmState0=[rmStateexcept![rm]=\p
working"^rmState0=[rmStateexcept![rm]=\prepared"]^Send([type7!\phase2a";ins7!rm;bal7!0;val7!\prepared"])^unchangedaStateRMChooseToAbort(rm)= Resourcemanagerrmspontaneouslydecidestoabort.Itmay(butneednot)sendaphase2amessageforballotnumber0withvalue\aborted".^rmState[rm]=\working"^rmState0=[rmStateexcept![rm]=\aborted"]^Send([type7!\phase2a";ins7!rm;bal7!0;val7!\aborted"])^unchangedaStateRMRcvCommitMsg(rm)= Resourcemanagerrmistoldbytheleadertocommit.Whenthisactionisenabled,rmState[rm]mustequaleither\prepared"or\committed".Inthelattercase,theactionleavesthestateunchanged(itisa\stutteringstep").^[type7!\Commit"]2msgs^rmState0=[rmStateexcept![rm]=\committed"]^unchangedhaState;msgsi31 ^Send([type7!\phase2a";ins7!rm;bal7!bal;val7!v])^unchangedhrmState;aStateiDecide= AleadercandecidethatPaxosCommi
37 thasreachedaresultandsendamessagean-noun
thasreachedaresultandsendamessagean-nouncingtheresultifithasreceivedthenecessaryphase2bmessages.^letDecided(rm;v)= TrueiinstancermofthePaxosconsensusalgorithmhaschosenthevaluev.9b2Ballot;MS2Majority:8ac2MS:[type7!\phase2b";ins7!rm;bal7!b;val7!v;acc7!ac]2msgsin_^8rm2RM:Decided(rm;\prepared")^Send([type7!\Commit"])_^9rm2RM:Decided(rm;\aborted")^Send([type7!\Abort"])^unchangedhrmState;aStatei AcceptorActionsPhase1b(acc)=9m2msgs:^m:type=\phase1a"^aState[m:ins][acc]:mbalm:bal^aState0=[aStateexcept![m:ins][acc]:mbal=m:bal]^Send([type7!\phase1b";ins7!m:ins;mbal7!m:bal;bal7!aState[m:ins][acc]:bal;val7!aState[m:ins][acc]:val;acc7!acc])^unchangedrmStatePhase2b(acc)=^9m2msgs:^m:type=\phase2a"^aState[m:ins][acc]:mbalm:bal^aState0=[aStateexcept![m:ins][acc]:mbal=m:bal;![m:ins][acc]:bal=m:bal;