com MKirchbergmasseyacnz Abstract This pap er presen ts an adaptation of the ARIES reco ery al gorithm that solv es the problem of reco ery in Shared Disk SD database systems whilst preserving all the desirable prop erties of the original al gorithm ID: 36390 Download Pdf
Tags :Download Pdf - The PPT/PDF document "DARIES Distributed ersion of the ARIES R..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Presentation on theme: "DARIES Distributed ersion of the ARIES Reco ery Algorithm Ja yson Sp eer and Markus Kirc erg Information Science Researc Cen tre Departmen of Information Systems Massey Univ ersit Priv ate Bag alme"— Presentation transcript
D-ARIES:ADistributedVersionoftheARIESRecoveryAlgorithmJaysonSpeerandMarkusKirchbergInformationScienceResearchCentre,DepartmentofInformationSystems,MasseyUniversity,PrivateBag11222,PalmerstonNorth5301,NewZealandJaysonSpeer@gmail.com,M.Kirchberg@massey.ac.nzAbstract.ThispaperpresentsanadaptationoftheARIESrecoveryal-gorithmthatsolvestheproblemofrecoveryinSharedDisk(SD)databasesystems,whilstpreservingallthedesirablepropertiesoftheoriginalal-gorithm.PrevioussuchadaptationsoftheARIESalgorithmhavefailedtosolvesomeoftheproblemsassociatedwithSDsystems,resultinginanumberofundesirablecompromisesbeingmade.Themostsignicantproblemishowtoassignlogsequencenumbers(LSNs)insuchawaythatthesystemisrecoverableafteracrash.Existingadaptationshaverequired,amongotherthings,aperfectlysynchronisedglobalclockandacentralmergeoflogdataeitherduringnormalprocessingorcrashrecov-ery,whichclearlyimposesasignicantoverheadonthedatabasesystem.ThisadaptationofARIESremovesthisrequiremententirely,meaningthatlogmergesandsynchronisedclocksbecomeentirelyunnecessary.FurtherenhancementsthatallowtheRedoandUndophasesofrecoverytobeperformedonapage-by-pagebasishavesignicantlyreducedtherecoverytime.Additionally,itispossibleforthedatabasetoreturntonormalprocessingattheendoftheAnalysisphase,ratherthanwaitingfortherecoveryprocesstocomplete.1IntroductionIntroducedbyMohanetal.[1],theARIES(AlgorithmforRecoveryandIsolationExploitingSemantics)algorithmhashadasignicantimpactoncurrentthinkingondatabasetransactionloggingandrecovery.IthasbeenincorporatedintoLotusNotesaswellasanumberofmajordatabasemanagementsystems(DBMSs)[2].ARIES,likemanyotheralgorithms,isbasedontheWAL(WriteAheadLogging)protocolthatensuresrecoverabilityofadatabaseinthepresenceofacrash.However,ARIES'RepeatingHistoryparadigmsetsitapartfromallotherWALbasedprotocols.Therepeatinghistoryparadigmrequires,duringrecovery,thatthedatabasebereturnedtothesamestateasitwasbeforethecrashoccurred.ThisaspectofARIESallowsittosupportnegranularitylocking,operationlocking(i.e.incrementanddecrement)andpartialrollbacks.ARIESfurthersupportstheuseofbuermanagersthatusetheStealandNo-Forceapproachtopagereplacement[3].13 14RecoveryinSharedDiskArchitecturesOftentheperformanceofasingleDBMSisunacceptable,resultingintheuseofmultiplesystemswherethedatabaseissplitacrossmultiplesystems.ThetwomajorarchitecturesareSharedDisk(SD)orSharedNothing(SN)architecture.Log1Log2LognS1S2SnShared DisksFig.1.SimpliedSharedDisk(SD)Architecture.IntheSDarchitecture,thediskscontainingthedatabasearesharedamongstthedierentsystems.EachsystemhasaninstanceoftheDBMSexecutingonitandmayreadorupdateanyportionofthedatabaseheldontheshareddisks.InthesimpliedarchitectureshowninFigure1,thediskscontainingthedatabasearesharedamongstthesystemsfS1;:::;Sng,witheachsystemhavinganinstanceoftheDBMS.EachsystemmayperformupdatesonanyportionofthedatabaseandwilllogeachsuchupdateinitslocallogfLog1;:::;Logng.Intheeventofoneormoreofthesystemsfailing(i.e.powerfailure,etc.),thealgorithmmustbeabletoreturnthesystemtoaconsistentstateafterwards.Here,suchnodesarereferredtoascrashnodesandaredistinguishedfromnon-crashnodes,sincetheyplayslightlydierentrolesintherecoveryprocess.1.2ProblemsTheintroductionoftheSDarchitectureaddsanextradimensionofcomplexitytotheproblemofrecovery,whichcanbealmostentirelyattributedtoassigninglogsequencenumbers(LSNs)andhowtointerpretthemduringcrashrecovery.MohanandNarang([4],p.312)detailedanumberoftheseproblems,whichinclude:1.Duringrecovery,howdowedeterminewhetherornotthereisaneedtoredotheupdatesofalogrecordiftheaddressesoflogrecords(whenusedastheLSNvalues)arenotmonotonicallyincreasingacrossallsystems?2.Ifthelogrecordsrepresentingupdatestoaparticularpagearescatteredacrossthedierentsystemslocallogles,howdowestarttorecoverthepageandhowdowemergethelogrecordsfromthedierentlocallogles?ARIEShasbeenadaptedforuseinSDarchitectures[5],withseveraldier-entschemesbeingproposed.However,eachschememadeanumberofsignicant 15compromisesinordertoaddresstheaforementionedproblems.Themajoras-sumptionsmadeforalltheseschemesare:1.`Forthepurposesofhandlingdatarecovery,onesystemintheSDcomplexwhichhasconnectivitytoallthelocallogs'disksproducesamergedversionofthoselogs.'([5],p.194)2.`:::theLSNisatimestampandthattheclocksacrosstheSDcomplexareperfectlysynchronised.'([5],p.195)TherstassumptionclearlyintroducesanextralayerofoverheadtotheloggingprocessthatisnotpresentwhenARIESisappliedtoasinglesystemdatabase,duetotheextracommunicationandprocessingthateachsystemmustnowperform.Suchproblemswillbeexacerbatedinthecasewherethesystemperformingthelogmergebecomesabottleneckintheprocess.Thesecondassumptionnotonlyaddsanextralayerofoverheadduetotherequirementthatallclocksinthesystemareperfectlysynchronised,butalsoaddsanextralayerofcomplexitytothehandlingofLSNs.Whilstlogging,eachrecordisassignedauniqueLSNwhenthatrecordisappendedtothelog.Insinglesystemdatabases,LSNsaretypically`thelogicaladdressesofthecorre-spondinglogrecords'([1],p.96).ThisallowsacorrespondencetobeestablishedbetweenLSNsandthephysicallocationofthecorrespondinglogrecord.Un-dertheproposedassumption,thiscorrespondencecannolongerbeestablished.ThismeansextradatastructuresmustbeimplementedtorelateLSNstothecorrespondinglogicaladdress,whichcausesadropindatabaseperformance.Additionally,inthecasewhereownershipofapageistransferredtoanothersystemwithoutrstforcingthatpage,theschemeproposedbyMohan&Narang[5]requiresalogmergeinordertoenablerecoveryofthepage.Rahm[6]avoidsthisproblembyprohibitingthechangeofpageownershipaltogether.Lomet[7]oersthemethod,ofallthoseinvestigated,thatmostresemblestheonepresentedhere,wherestateidentiersareusedtotrackupdatestodataobjects.Whilstthismethoddoesnotrequiregloballysynchronisedclocks,itdoesrequirethestateidentierstobemonotonicallyincreasingforanygivendataobject,whichintroducesanundesirablelayerofcomplexityandbreaksthedirectcorrespondencebetweenthelogicalandphysicaladdressesoflogrecords.Otherrestrictiveassumptionsarealsomade,suchasrequiringthatapagebeforcedtodiskbeforetransferringownershiptoanothernode.Bozas&Kober[8]oerasimilarmethodto[7]thatrequiresstateidentierstobemonotonicallyincreasingforapage,butadditionallydoesnotsupportlogicaloperationloggingorStealbuermanagementpolicies.1.3SolutionsThealgorithmproposedinthispaperaddressestheproblemofrecoverywithinaSDsystemwhilstaddressingtheaforementionedproblems.Inparticular,theneedtohaveLSNsmonotonicallyincreasingacrossallsystemshasbeenremoved.ThisremovesallrelianceonclocksandallowsthecorrespondencebetweenLSNs 16thephysicallocationofthecorrespondinglogrecordstobepreserved.Thealgorithmalsoremovesanyneedforthemergingoflogdata(duringnormalprocessingorrecovery),butrequiresthatallnodesparticipateincrashrecovery.AdditionaloptimisationsweremadetotheRedoandUndophasesoftherecoveryprocesswherebythesephasesarenowperformedonapage-by-pagebasis.PerformingtheRedoandUndophasesonapage-by-pagebasisallowsamuchhigherdegreeofconcurrencyduringtherecoveryprocess.Thistechniquealsoprovidesthebasisforaproposedmethodtoimproveconcurrencyduringtherollbackoftransactionsduringnormalprocessing.Duringrecovery,thetimethatthedatabaseisquiescedisreducedsincenormalprocessingcancommenceassoonastheAnalysisphaseofrecoveryiscomplete.BeforetheendoftheAnalysisphase,anexclusivelockisacquired(onbehalfoftherecoveryalgorithm)onthosepagesthatmusthavechangesreapplied(Redophase)orremoved(Undophase).Asaresult,allotherpagesremainavailablefornormalprocessing.SincetheRedoandUndophasesareperformedonapage-by-pagebasis,theexclusivelockoneachpageisreleasedassoonaspossible,reducingtheunavailabilityofthedatabasetoaminimum.Itwasalsoimportantnottoimposeanyunnecessaryoverheadonthealgo-rithmintermsofbothcommunicationcostsandlogging.Theextraloggingre-quiredisverysmallwithasingleextraeldbeingaddedtosomelogrecordtypesandeldsremovedfromothers.Communicationistypicallythebottleneckinanydistributedenvironment.Therefore,reducingtheamountofcommunicationrequiredor,moresignicantly,increasingtheconcurrencyofcommunication,willimprovetheoverallsystemperformance.ConcurrencyofcommunicationduringtheRedoandUndophasesofrecoveryhavebeensignicantlyincreasedbyperformingboththeRedoandUndophasesofrecoveryindependentlyoneachpage,hencedecreasingtheoveralldelayduetocommunication.WhileaddressingtheproblemsexperiencedinpreviousadaptationsofARIEStotheSDenvironment,itwasimportantnottolosethoseaspectsofARIESthatmakeitbothuniqueandwidelyaccepted.Forthisreason,thealgorithmpreservestherepeatinghistoryparadigm,negranularitylockingandpartialrollbacks.Italsoplacesnorestrictiononthenatureofbuermanagementused,therebysupportingtheNo-ForceandStealapproachespreviouslysupported.1.4OverviewThispaperisorganisedintothefollowingsections.Section2introducestheoriginalARIESrecoveryalgorithmbrie\ry.Sections3to5presentouradapta-tionoftheARIESalgorithmtargetedatSDdatabasesystems.Section6drawsconclusionsandidentiestheareasinwhichfurtherworkmightbeundertaken.2ARIESThissectionprovidesabriefoverviewoftheoriginalARIESalgorithm.Thereaderisreferredto[1]forthedetailsaboutARIESandto[9]foranintroductionintheARIESfamilyofalgorithms. 17ARIES,likemanyotheralgorithms,isbasedontheWALprotocolthaten-suresrecoverabilityofadatabaseinthepresenceofacrash.Allupdatestoallpagesarelogged(e.g.inlogicalfashion).ARIESusesanLSNstoredoneachpagetocorrelatethestateofthepagewithloggedupdatesofthatpage.Byexamin-ingtheLSNofapage(calledthePageLSN)itcanbeeasilydeterminedwhichloggedupdatesarere\rectedinthepage.Beingabletodeterminethestateofapagew.r.t.loggedupdatesiscriticalwhilstrepeatinghistory,sinceitisessentialthatanyupdatebeappliedtoapageonceandonlyonce.Failuretorespectthisrequirementwillinmostcasesresultinaviolationofdataconsistency.UpdatesperformedduringforwardprocessingoftransactionsaredescribedbyUpdateLogRecords(ULRs).However,loggingisnotrestrictedtoforwardprocessing.ARIESalsologs,usingCompensationLogRecords(CLRs),updates(i.e.compensationsofupdatesofaborted/incompletetransactions)performedduringpartialortotalrollbacksoftransactions.ByappropriatechainingofCLRrecordstologrecordswrittenduringforwardprocessing,aboundedamountofloggingisensuredduringrollbacks,eveninthefaceofrepeatedfailuresduringcrashrecovery.Thischainingisachievedby1)assigningLSNsinascendingsequence;and2)addingapointer(calledthePrevLSN)tothemostrecentprecedinglogrecordwrittenbythesametransactiontoeachlogrecord.WhentheundoofalogrecordcausesaCLRrecordtobewritten,apointer(calledtheUndoNextLSN)tothepredecessorofthelogrecordbeingundoneisaddedtotheCLRrecord.TheUndoNextLSNkeepstrackoftheprogressofarollback.Ittellsthesystemfromwheretocontinuetherollbackofthetransaction,ifasystemfailureweretointerruptthecompletionoftherollback.Periodicallyduringnormalprocessing,ARIEStakesfuzzycheckpointsinordertoavoidquiescingthedatabasewhilecheckpointdataiswrittentodisk.Checkpointsaretakentomakecrashrecoverymoreecient.Whenperformingcrashrecovery,ARIESmakesthreepasses(i.e.Analysis,RedoandUndo)overthelog.DuringAnalysis,ARIESscansthelogfromthemostrecentcheckpointtotheendofthelog.Itdetermines1)thestartingpointoftheRedophasebykeepingtrackofdirtypages;and2)thelistoftransactionstoberolledbackintheUndophasebymonitoringthestateoftransactions.DuringRedo,ARIESrepeatshistory.Itisensuredthatupdatesofalltransactionshavebeenexecutedonceandonlyonce.Thus,thedatabaseisreturnedtothestateitwasinimmediatelybeforethecrash.Finally,Undorollsbackallupdatesoftransactionsthathavebeenidentiedasactiveatthetimethecrashoccurred.3D-ARIES{PreliminariesInthenextthreesections,weintroduceadistributedversionoftheARIESrecoveryalgorithm(referredtoasD-ARIES).3.1OverviewD-ARIESpreservesthedesirablepropertiesoftheoriginalARIESalgorithm.Crashrecoverystillmakesthreepassesoverthelog.However,supportingown- 18changesofpages(withouttheneedto\rushrespectivepagesrstormergelogs)requiresallnodestobeinvolvedincrashrecovery.Thus,itbecomesevenmoreimportantthatthetimethedatabaseisquiescedisreduced.D-ARIESachievesthissincenormalprocessingcanresumeassoonastheAnalysisphaseofrecoveryiscomplete.Furthermore,sincetheRedoandUndophasesareper-formedonapage-by-pagebasis,pagesrequiredduringtheRedoandUndophasesaremadeavailabletonormaltransactionprocessingassoonaspossible,reducingtheunavailabilityofthedatabasetoaminimum.Section4discussescrashrecoveryprocessinginmoredetail.Intheeventofoneormoreofthesystemsfailing,thecrashrecoveryalgorithmmustbeabletoreturnthesystem(s)tothemostrecentconsistentstate.Besidescrashrecovery,itisalsonecessarytodiscusstransactionrollbackduringnormalprocessing.Section5considerstwomainclassesofschedulesthatmustbeaddressedwhendeningarollbackalgorithm;thesearescheduleswithcascadingabortsandscheduleswithout.D-ARIEStakesadvantageofmulti-threadingandrollsbacktransactionsonapage-by-pagebasis.3.2LoggingPreviousadaptationsoftheARIESalgorithmrequirethattheLSNsbegloballymonotonicallyincreasingacrossallnodesinthesystem.Thisrequirementintro-ducestheproblemofhowtoensurethattheLSNsaremonotonicallyincreasingonaglobalbasisandmeansthatthephysicaladdressofalogrecordcannolongerbeinferredfromitsLSN.TheadaptationoftheARIESalgorithmpresentedherenolongerrequiresLSNstobegloballymonotonicallyincreasing,howeveritisrequiredthatLSNsbemonotonicallyincreasingwithineachnode.ThisrequirementformonotonicallyincreasingLSNswithinanodeisnotaburden,butratherabenet,sinceitallowsadirectcorrespondencebetweenalogrecord'sphysicaladdressandlogicaladdresstobemaintained.Inordertoremovetheneedforgloballymonotonicallyincreasinglognum-bers,anumberofmodicationsmustbemadetothewaytheARIESalgorithmperformslogging,theseare:1.Denitionofa`DistributedLSN'.2.ModicationoftheCLRrecord.3.Denitionofa`SpecialCompensationLogRecord(SCR)'.4.AdditionofPageLastLSNpointers(tospecictypesoflogrecords).DistributedLSN.Inordertoidentifywhichnodealogrecordbelongsto,anodeidentiermustbeincludedineachLSN.Althoughtheexactmethodfordoingthisisleftopen,forthepurposesofthispaper,theformatofdistributedLSNsisasfollows:RecNum.NodeId 19WhereNodeIdisagloballyuniqueidentierforthenodethatwrotethelogrecordandtheRecNumismonotonicallyincreasinganduniquewithinthatnode.Forthepurposesofthispaper,thenodeidentierandrecordnumbersforeachLSNcanbereferredtoindividuallyasLSN[NodeId]andLSN[RecNum].ModicationoftheCLRRecord.InthisdistributedincarnationoftheARIESalgorithm,extensivechangesaremadetotheCLRrecord,bothintermsoftheinformationitcontainsandthewayinwhichitisused.Changesare:1.TheUndoneLSNeldreplacestheUndoNextLSNeld.WhereastheUn-doNextLSNrecordstheLSNofthenextoperationtobeundone,theUn-doneLSNrecordstheLSNoftheoperationthatwasundone.2.ThePrevLSNeldisnolongerrequiredfortheCLRrecord.3.CLRrecordsarenowusedtorecordundooperationsduringnormalprocess-ingonly,thenewlydenedSCRrecords(referbelow)isusedtorecordundooperationsduringcrashrecovery.TherationalebehindthesemodicationscanbeunderstoodbyobservingthedierencesinhowD-ARIESandtheoriginalARIESalgorithmperformundooperationsandhowthisaectstheinformationrequiredbyD-ARIES.IntheoriginalARIESalgorithm,operationsareundoneoneatatimeinthereverseordertowhichtheywereperformedbytransactions.However,inordertoincreaseconcurrency,thisalgorithmcanperformmultipleundooperationsconcurrently,whereupdatestoindividualpagesareundoneindependentlyofeachother.Theresultofthisisthatalthoughoperationsareundoneinreverseorderonaperpagebasis,itisnowpossibleforoperationsbelongingtoasingletransactiontobeundoneinanorderthatdoescorrespondtotheorderinwhichtheyweredone.DenitionoftheSCRRecord.TheSpecialCompensationLogRecord(SCR)isidenticalinalmosteveryrespecttothemodiedversionoftheCLRrecord,theonlydierencesbeing:{Therecordtypeeld(SCRratherthanCLR),{SCRsarewrittenduringcrashrecoveryratherthannormalprocessing.Duringnormalrollbackprocessing,operationsareundoneinthereverseor-dertowhichtheywereperformedbyindividualtransactions.However,duringcrashrecoveryrollback,operationsareundoneinreverseorderthattheywereperformedonindividualpages.Havingseparatelogrecordsforcompensationduringrecoveryandnormalrollbackallowsustoexploitthisfact.PageLastLSNPointers.ThePageLastLSNpointer1isaddedtoallULR,SCRandCLRrecords.ItrecordstheLSNoftherecordthatlastmodiedanobject1ThePageLastLSNpointerdoesnotonlysupportpage-by-pageprocessing.ItisalsothecrucialdatastructurethathelpstoavoidgloballymonotonicallyincreasingLSNs. 20thispage.RecordingthesePageLastLSNpointersprovidesaneasymethodoftracingallmodicationsmadetoaparticularsetofobjects(storedonasinglepage).ThePageLastLSNforeachrecordmustbeadistributedLSNandincludeboththeLSNandthenodeonwhichtherecordresides.Thisisduetothefactthatthelastmodicationtothepagemayhaveoccurredonadierentnode.3.3FuzzyCheckpointsAswiththeoriginalARIESalgorithm,afuzzycheckpointisperformedinordertoavoidquiescingthedatabasewhilecheckpointdataiswrittentodisk.Thefollowinginformationisstoredduringthecheckpoint:Activetransactiontable;andDirtyLSNvalue.Foreachactivetransaction,thetransactiontablestoresthefollowingdata:TransIdIdentieroftheactivetransaction.FirstLSNLSNoftherstlogrecordwrittenonthisnodeforthetransaction.StatusEitherActiveorCommit.Thus,eachnodemustmaintaintheFirstLSNofeachactivetransaction.ThevalueofFirstLSNcanbeeasilysetwhenthecorrespondingtransactiontableentryiscreated.Giventhesetofpagesthatweredirtyatthetimeofthecheckpoint,theDirtyLSNvaluepointstotherecordthatrepresentstheoldestupdatetoanysuchpagethathasnotyetbeenforcedtodisk.4D-ARIES{CrashRecoveryAswiththeoriginalARIESalgorithm,recoveryremainssplitintothreephases,whichareAnalysis,RedoandUndo.However,unliketheoriginalARIESalgo-rithm,recoverytakesplaceonapage-by-pagebasis,whereupdatesarereapplied(Redophase)andremovedfrom(Undophase)pagesindependentlyfromonean-other.TheRedophasereapplieschangestoeachpageintheexactorderthattheywereloggedandtheUndophaseundoeschangestoeachpageintheexactreverseorderthattheywereperformed.Sincethestateofeachpageisaccu-ratelyrecorded(byuseofthePageLSN),theconsistencyofthedatabasewillbemaintainedduringsuchaprocess.4.1DataStructures.ThedatacollectedbythealgorithmduringtheAnalysisphaseisstoredinthefollowingdatastructures:TransactionStatusTable.Thepurposeofthetransactionstatus(TransSta-tus)tableistodeterminethenalstatusofalltransactionsthatwereac-tiveatsometimeafterthelastcheckpoint.Thisinformationisusedtode-terminewhetherchangesmadetothedatabaseshouldbekeptordiscarded. 21TheTransStatustablehasthefollowingelds:TransIdIdentierofthetransaction.StatusStatusofthetransaction,whichdetermineswhetherornotitmustberolledback.PossiblestatesareActive,EndandCommit.OncetheAnalysisalgorithmhascompletedscanningallrequiredrecords,itwillsendtheTransStatustabletoall`crashnodes'inthesystem.UponreceivingaTransStatustablefromanothernode(Ni),thecrashnodewillupdateitsTransStatustableusingthefollowingrule:ThecrashnodewilldeleteanentryfortransactionTjfromtheTransStatustableif:1)TheTransStatustablereceivedfromnodeNicontainsanentryforTj;and2)ThestatusofthatentryiseitherCommitorEnd.AfterhavingreceivedtheTransStatustablefromallothernodesandupdat-ingthelocaltable,anytransactionwiththestatus`Active'isdeclareda`losertransaction',whilstallothertransactionsaredeclared`winnertransactions'.LocalPageLinkList.ThepurposeoftheLocalPageLink(LLink)lististoprovidealinkedlistofrecordsforeachpagemodiedbyanode.ThislinkedlistisusedintheRedophasetonavigateforwardsthroughthelogreapplyingupdatesmadetodataobjectsonthatpage.ForeachpagethathasCLR,SCRorULRrecords,eachnodewillcreateanLLinklist,whichisanorderedlistofallLSNsthatrecordchangestothatpage.ThedatacontainedwithintheLLinklistissucienttoallowforwardnav-igationthroughasetofrecordsprovidedallrecordsforthepagearestoredonthesamenode.However,inordertoallowforwardnavigationwhenchangestoapageareloggedonmultiplenodes,furtherdatamustbecaptured.ThisdataisstoredintheDistributedPageLinkList(seebelow).DistributedPageLinkList.ThepurposeoftheDistributedPageLink(DLink)lististoaugmenttheLLinklistinsuchawaythatalinkedlistofrecordsforeachpageispossibleevenwhenapagehaslogrecordsstoredonmultiplenodes.ForeachCLR,SCRorULRrecordencounteredthathasaPageLastLSNpointerthatpointstoanothernode,aDLinklistentrywillbecreated.EachDLinklistentryhasthefollowingelds:PageIdIdentierofthepage.LSNLSNoftherecord.PageLastLSNPageLastLSNpointeroftherecord.NodeIdIdentierofthenodethatthePageLastLSNpointerpointsto.AttheendoftheAnalysisphase,eachnodewillsendtherelevantportion2oftheDLinklisttotheothernodes.ThedatacontainedintheDLinklistwill2EachnodewillbesentallentrieswheretheirnodeidentierisequaltoNodeId. 22eusedtoaugmenttheLLinklistforallpages.TherulesforinsertingdatafromaDLinklistintoanLLinklistare:1.LocatethepointintheLLinklistwhereLSN=DLink.PageLastLSN.2.InsertthecorrespondingLSNfromtheDLinklistdirectlyafterthispoint.Example1.DLinkList(FromNode2)LSNPageLastLSNNodeId10.212.11LLinkList(Node1)571216First,locatetheentrywhereLSN=12.1(PageLastLSNfromDLinklist),theninsert10.2(LSNfromDLinklist)directlyafterthispoint.AugmentedLLinkList(Node1):571210.216PageStartList.ThepurposeofthePageStartLististodetermine,foreachpage,fromwhichnodetoinitiatetheRedophaseoftherecoveryprocess.ThePageStartListhasthefollowingelds:PageIdIdentierofthepage.Duringtheforwardscanofthelog,thersttimethealgorithmencountersalogrecordforapagePj,ittakesthefollowingactions:1.CreateaPageStartListentryforPj.AttheendoftheAnalysisphase,afterhavingaugmentedtheLLinklist,revisiteachentryofthelocalPageStartListandapplythefollowingrule:2.Removetheentryiftherstentryforthecorrespondingpageintheaug-mentedLLinklistdoesnotrefertothelocalnode.Afterstep1.hasbeencompletedonallnodes,weareguaranteedtohavecapturedallpagesthathavetobevisitedduringtheRedoandUndophases.Thus,weareabletolockthosepagesexclusively.Allotherpagescanthenbemadeavailablefornormalprocessing.3Multiplelockrequests(allonbehalfoftherecoveryalgorithm)forthesamepagearepossible,butdonotcauseanyproblems.Second,third,:::requestscansimplybeignored.Step2.thenremovesallentriesexceptthosethatrefertopagesthathavebeenupdatedrstbythelocalnode.ThisensuresthatthereisexactlyonePageStartListentryperpage(thatisofinteresttothecrashrecoveryalgorithm)acrossallnodes.ThisentryreferstothenodewhichwillbeinchargeofinitiatingtheRedopassforthatparticularpage.3Note:bythistime,nocommunicationbetweennodesnoraccesstopersistentdataotherthenthelocallogwererequired. 23PageEndList.ThepurposeofthePageEndLististodetermine,foreachpage,wheretheUndophaseoftherecoveryalgorithmshouldstopprocessing.However,thisisonlyanoptionalfeature,whichwillbeomittedduetospaceconstraints.IntheabsenceofthePageEndList,theUndophaseterminatesassoonastheScanLSNrecord(seebelow)hasbeenreachedonanynode.UndoneList.ThepurposeoftheUndoneLististostorealistofallopera-tionsthathavebeenpreviouslyundone.TheUndoneListhasthefollowingelds:PageIdIdentierofthepage.UndoneLSNLSNoftherecordthathasbeenundone.Duringthescanofthelog,wheneverthealgorithmencountersaCLRrecord,itaddsanentrytotheUndoneList.4.2AnalysisPhase.DuringtheAnalysisphase,thealgorithmcollectsalldatathatisrequiredtorestorethedatabasetothemostrecentconsistentstate.Thisinvolvesperformingaforwardscanthroughthelog,collectingthedatarequiredfortheRedoandUndophasesofrecovery.TheAnalysisphaseoftherecoveryprocessiscomprisedofthreesteps,being:Initialisation,DataCollectionandCompletion.Step1:Initialisation.TheinitialisationoftheAnalysisphaseinvolvesreadingthemostrecentcheckpointinordertoconstructaninitialTransStatustableanddeterminethestartpoint(ScanLSN)fortheforwardscanofthelog.TransStatusTable.ForeachtransactionstoredintheActiveTransactiontable,acorrespondingentryiscreatedintheTransStatustable.StartPoint.Thestartpointofthescan(ScanLSN)isthepointinthelogwherethenodewillstartscanningandiscomputedasfollows:ThelowestLSNofeither:1)DirtyLSN;or2)ThelowestFirstLSNofanytransactionintheTransStatustablewhosestatusis`Active'.Step2:DataCollection.Duringtheforwardscanofitslog,eachnodewillcollectdatatobestoredinthedatastructuresdiscussedinSection4.1.Thetypeofrecordencounteredduringthelogscandeterminesthedatathatiscollectedandintowhichdatastructureitisstored.TherecordsfromwhichtheAnalysisphasecollectsdataareCommitLogRecord,EndLogRecord,ULR,CLR,andSCR.CommitLogRecord.EachtimeaCommitLogRecordisencountered,therecov-erymanagerinsertsanentryintotheTransStatustableforthetransactionwithstatussettoCommit.Anyexistingentriesforthistransactionarereplaced. 24LogRecord.EachtimeanEndLogRecordisencountered,therecoverymanagerinsertsanentryintotheTransStatustableforthetransactionwithstatussettoEnd.Anyexistingentriesforthistransactionarereplaced.UpdateLogRecord(ULR).EachtimeanULRrecordisencountered,thefol-lowingdataiscaptured:{IfnoentryexistsintheTransStatustableforthistransaction,thenanentryiscreatedforthistransactionwithstatussettoActive.{AddanentrytotheLLinklist.{IfthePageLastLSNpointerpointstoadierentnode,thenaddanentrytotheDLinklist.{CreateaPageStartListentryasrequired.CompensationLogRecord(CLR).EachtimeaCLRrecordisencountered,inadditiontocapturingalldatadescribedforanULRrecord,anentrywillbeaddedtotheUndoneList.SpecialCompensationLogRecord(SCR).ThesamedataiscapturedforanSCRrecordasthatcapturedforanULRrecord.Step3:Completion.Oncethenodehascompletedtheforwardscanofthelog,itperformsthefollowingactions:1.Acquireanexclusivelock,onbehalfoftherecoveryalgorithm,onallpagesidentiedinthePageStartList.2.Sendthefollowingdata:TransStatustable(toallcrashnodes)andDLinklist(toallothernodes).3.Uponreceivingthisdatafromothernodes,thealgorithmperformsthefol-lowingtasks:1)UpdatetheTransStatustable;2)AugmenttheLLinklists.4.OnceallLLinklistshavethenaugmented,updatethePageStartList.5.OnceacrashnodehasreceivedtheTransStatustablefromallothernodes,itcansendalistof`losertransactions'toallothernodes.Notes:{Onceallnodesinthesystemhaveacquiredtherequiredlocks(referStep1),thedatabasecancommencenormalprocessing.Onlythosepagesthatarelockedforrecoverywillremainunavailable.{OnceanodehasreceivedtheDLinklistsfromallothernodes(andaug-menteditsLLinklists),itcanentertheRedophase.{Onceanodehasreceivedalistoflosertransactionsfromallcrashnodes,itcanpotentiallyentertheUndophase. 254.3RedoPhase.TheRedophaseisresponsibleforreturningeachpageinthedatabasetothestateitwasinimmediatelybeforethecrash.InorderforanodetoentertheRedophase,itmusthaveaugmentitsLLinklistsandthePageStartList.ForeachpagethatthenodehasinitsPageStartList,theRedoalgorithmwillspawnathreadthat`repeatshistory'forthatpage.GivenapagePj,historyisrepeatedbyperformingthefollowingtasks:1.StartbyconsideringtheoldestlogrecordforpagePjthatwaswrittenafterPageLSN.Thisrequiresreadingthepageintomainmemory.2.UsingtheaugmentedLLinklistsforpagePj,moveforwardthroughtheloguntilnomorerecordsforthispageexist.3.Eachtimeare-doablerecordisencountered,reapplythedescribedchanges.Re-doablerecordsare:SCR,CLRandULRrecords.Oncethethreadhasprocessedthelastrecordforthispage,therecoveryalgorithmmayentertheUndophaseforthispage.TherecoveryalgorithmmayentertheUndophasefordierentpagesatdierenttimes,forexamplepageP1mightentertheUndophasewhilepageP2isstillintheRedophase.OncetherecoveryalgorithmhascompletedtheRedophaseforallpages,anEndLogRecordcanbewrittenforalltransactionswhosestatusisCommitintheTransStatustable.SinceatransactioncanhaveaCommitentryononlyasinglenode,itisguaranteedthatexactlyoneEndLogRecordwillbewrittenforsuchtransactions.Forexpediency,thiscanbedeferreduntilaftertheUndophaseiscompleteifsodesired.4.4UndoPhase.TheUndophaseisresponsibleforundoingtheeectsofallupdatesthatwereperformedbyso-called`losertransactions'.InorderforanodetoentertheUndophase,itmusthavereceivedalistoflosertransactionsfromallcrashnodes.ThethreadthatwasspawnedfortheRedophaseforpagePjwillnowbeginworkingbackwardsthroughthelogundoingallupdatestothepagethatweremadebylosertransactionsbyperformingthefollowingtasks:1.WorkbackwardsthroughthelogusingthePageLastLSNpointersprocessingeachlogrecorduntilallupdatesbylosertransactionshavebeenundone.2.EachtimeanSCRorULRrecordisencountered,takethefollowingactions:{SpecialCompensationLogRecord(SCR).(a)JumptotherecordimmediatelyprecedingtherecordpointedtobytheUndoneLSNeld.TheUndoneLSNeldindicatesthatduringapreviousinvocationoftherecoveryalgorithm,theupdatesrecordedbytherecordatUndoneLSNhavealreadybeenundone. 26{UpdateLogRecord(ULR).Iftheupdatewasnotwrittenbyalosertransactionorhaspreviouslybeenundone(theUndoneListisusedtodeterminethis),thennoactionistaken.Otherwise,thefollowingactionsaretakentoundotheupdate:(a)WriteanSCRrecordthatdescribestheundoactiontobeperformedwiththeUndoneLSNeldsetequaltotheLSNoftheULRrecordwhoseupdateshavebeenundone.(b)ExecutetheundoactiondescribedintheSCRrecordwritten.OncethethreadhascompletedprocessingallrecordsbacktoScanLSN,thepagecanbeunlocked,onbehalfoftherecoveryalgorithm,andmadeavailablefornormalprocessingagain.Theadvantageofallowingeachpagetobeunlockedindividuallyisthatthedatabasecanreturntonormalprocessingasquicklyaspossible.OncetherecoveryalgorithmhascompletedtheUndophaseforallpages,anEndLogRecordcanbewrittenforalltransactionswhosestatusisActiveintheTransStatustable(thesearethelosertransactions).Thenodethatisresponsibleforthetransactionwillwritethisrecord.4.5CrashesDuringCrashRecovery.BypreservingARIES'paradigmofrepeatinghistory,itcanbeguaranteedthatmultiplecrashesduringcrashrecoverywillnotaecttheoutcomeoftherecoveryprocess.TheRedophaseensuresthateachupdatelostduringthecrashisappliedexactlyoncebyusingthePageLSNvaluetodeterminewhichloggedupdateshavealreadybeenappliedtothepage.SinceallcompensationoperationsareloggedduringtheUndophase,theRedophaseandthenatureoftheUndophaseensurethatcompensationoperationsarealsoperformedexactlyonce.TheUndoneLSNplaysasimilarroleinD-ARIESastheUndoNextLSNdoesinARIES.4.6EnhancementsItispossibletofurtherenhancethefaulttoleranceandspeedofrecoveryofthisalgorithminthefaceofanodethathascrashedandisslowtobecomeavailable.Thisisachievedbymakingboththedatabasepartitionandrecoverylogofeachnodeinthesystemaccessibletoeachothernodeinthesystem.Duringrecovery,itisthenpossibleforanon-crashnodetoactasa`proxy'forthecrashnode.Thatis,thenon-crashnodecanperformrecoveryonbehalfofthecrashednode.Whilstactingasaproxy,anon-crashnodewouldreadthelogrecordofthecrashnodeandperformupdatesonthecrashnode'sdatabasepartition,justasthecrashnodewoulddoifitwereavailable.Clearlysuchanenhancementoersrealbenetsintermsofrecoveryspeed,particularlywhenacrashnodebecomesunavailableforanextendedperiodoftime.Bythetimethecrashnodebecomesavailable,itsdatabasepartitionandlogwillbeinaconsistentstateandthenodewillbeabletocommenceprocessingimmediately. 275D-ARIES{RollbackDuringNormalProcessingHavingdenedthealgorithmforrollbackoftransactionsduringcrashrecovery,itisnownecessarytodothesamefornormalprocessing.Therearetwomainclassesofschedulesthatmustbeconsideredwhendeningarollbackalgorithm;thesearescheduleswithcascadingabortsandscheduleswithout.Thecasewherecascadingabortsdonotisexististrivial,whererollingbackatransactionsimplyinvolvesfollowingthePrevLSNpointersforthetransactionbackwardsundoingeachoperationasitisencountered.Sincecascadingabortsdonotexistintheseschedules,noconsiderationneedbegiventocon\rictsbetweentheabortingtransactionandanyothertransactions.Thecasewherecascadingabortsdoexistisagreatdealmorecomplex,sincerollingbackatransactionmaynecessitatetherollbackofoneormoreothertransactions.Eachtimeanoperationisundone,itisnecessarytocon-siderwhichtransactions,ifany,mustberolledbackinordertoavoiddatabaseinconsistencies.IntheoriginalARIESalgorithm,rollbackoftransactionTiinvolvesundoingeachoperationinreverseorderbyfollowingthePrevLSNpointersfromoneULRrecordtothenext.WheneveranundooperationfortransactionTicon\rictswithanoperationinsomeothertransactionTj,acascadingabortoftransactionTjmustbeinitiated.TransactionTimustthensuspendrollbackandwaitfortrans-actionTjtorollbackbeyondthecon\rictingoperationbeforeitcanrecommencerollback.Clearlythisisnotthemostecientmethod,sincetherollbackoftheentiretransactionissuspendedduetoasingleoperationbeingincon\rict.Amoredesirablemethodistosuspendrollbackonlyofthoseoperationsthatareincon\rictandtocontinuerollbackofallotheroperations.Itisalsodesirabletotriggerthecascadingabortofalltransactionsincon\rictasearlyaspossible.Bytakingadvantageofmulti-threading,itispossibletorollbackatransactiononapage-by-pagebasis.Thisallowsatransactioninrollbacktosimultaneously:{Triggermultiplecascadingaborts,{Suspendrollbackofupdatestopageswhilstwaitingforothertransactionstorollback,and{Continuerollingbackupdatesthatdonothaveanycon\ricts.Partialrollbackoftransactionsisachievedbyestablishingsavepoints[10]duringprocessing,thenatsomelaterpointrequestingtherollbackofthetrans-actiontothemostrecentsavepoint.Thiscanbecontrastedwithtotalrollbackthatremovesallupdatesperformedbythetransaction.5.1Algorithms.RollbackofatransactionTiisachievedbytheuseofasingle`MasterThread'thatisresponsibleforcoordinatingtherollbackprocessandmultiple`SlaveThreads'thatareresponsiblefortherollbackofupdatesmadetoindividualpages. 28Thread.Themasterthreadisresponsibleforcoordinatingtherollbackofatransactionbyperformingthefollowingactions:{Triggeringthecascadingabortoftransactionsasrequired.{Undoingallupdateoperationsthatarenotincon\rictwithupdateoperationsfromothertransactions.{Spawninganewslavethreadwheneveracon\rictdetectedrequirestheundoofupdatestoapagebedelayedwhileothertransaction(s)rollback.Algorithm1:Parameters.TiIdentifieroftransactiontorollback.SaveLSNThesavepointforpartialrollbackofthetransaction.IfSaveLSNisNull,thentotalrollbackwilloccur.Variables.CurrLSNTheLSNoftherecordcurrentlybeingprocessed.PSSetofpageidentifiersforwhichslavethreadsarespawned.Procedure.1.CurrLSN=LSNoflastupdateperformedbyTi.2.WHILE(Record.LSN6=SaveLSN)DO3.MovetoRecordatCurrLSN4.IF(Currentoperationisnotinconflictwithanyother)THEN5.IF(Record.PageId=2PS)THEN6.Undocurrentoperation7.ENDIF8.ELSE9.IF(Record.PageId=2PS)THEN10.AddRecord.PageId11.SpawnanewslavethreadforpageRecord.PageIdParameters:(Record.PageId,SaveLSN,NodeId4)12.ENDIF13.Triggerthecascadingabortofanytransaction(s)thathaveconflictingoperations(andarenotalreadyaborting).14.ENDIF15.CurrLSN=Record.PrevLSN16.ENDDOThealgorithmterminatesoncethemasterthreadhasreachedthesavepointandhasreceivedaDone.messagefromallslavethreadsspawned.SlaveThread.Theslavethreadisresponsibleforundoingallupdatesmadebythetransactiontoasinglepage.Theslavethreadmustnotundoanyoperationuntilanycon\rictingoperation(s)havebeenrolledback.Oncetheslavethreadhascompletedrollingbackallchangestothepage,itsendsaDone.messagebacktothemasterthread.4IftheSaveLSN=Null,thenNodeIdisthenodeonwhichthemasterthreadwasspawned,otherwiseNodeId=SaveLSN[NodeId].ThisistoensurethatallslavethreadssendtheirDone.messagestothesamenode. 29Algorithm2:Parameters.PiIdentifierofpagetorollback.SaveLSNThesavepointforpartialrollbackofthetransaction.IfSaveLSNisNull,thentotalrollbackwilloccur.NodeIdIdentifierofthenodethatshouldreceivetheDone.message.Variables.CurrLSNTheLSNoftherecordcurrentlybeingprocessed.Procedure.1.WHILE(Record.LSN6=SaveLSN)DO2.MovetorecordatCurrLSN3.IF(Currentoperationisinconflictwithanyother)THEN4.Waituntilconflictingoperations(s)havebeenrolledback.5.ENDIF6.Undocurrentoperation7.CurrLSN=Record.PrevLSN8.ENDDO9.SendDone.Messagetothemasterthread.Optimisation.Inrollback,itispossibleforboththemasterthreadandtheslavethreadstoreducethefrequencywithwhichtheycheckforcon\rictsbetweenthecurrentoperationandoperationsbelongingtoothertransactions.GivenaULRrecordwrittenforpagePjbytransactionTi,itisonlynecessarytocheckforcon\rictsifthelastULRrecordwrittenforpagePjwasnotwrittenbytransactionTi.ThiscanbeachievedbycheckingifthePageLastLSNofthepreviouslogrecordwrittenbytransactionTipointstothecurrentlogrecordwrittenbytransactionTi.Ifthisisthecase,nocon\rictcheckingneedbeperformed.6ConclusionsThealgorithmpresentedinthispaperoersanadaptationofARIESthatsolvestheproblemofrecoveryinSDdatabasesystems.ThedesirablepropertiesofARIES,suchasnegranularitylocking5,operationlockingandpartialrollbacks,havebeenpreservedwithoutimposingsignicantoverheadonthesystem.TheoriginalARIESalgorithmreliesontheLSNsbeinggloballymonotoni-callyincreasinginordertoensurerecoverabilityofthedatabaseafteracrash.Assuch,previousadaptationshavesoughttoensurethatLSNsaremonotonicallyincreasingthroughouttheSDsystemratherthanresolvethisreliance.Thishasresultedinundesirablecompromisesbeingmadeincludingtherequirementforaperfectlysynchronisedglobalclockandtheglobalmergeofthelogs.Thisalgo-rithmremovesARIES'dependencyongloballymonotonicallyincreasingLSNs,whichinturnrendersagloballogmergeorclockofanykindredundant.5SameaswithARIES,D-ARIESsupportsnegranularitylockingassumingthatrecoverabilityisensuredbythetransactionmanager. 30Furtherenhancementsweremadetodecreasethetimetakenforthedatabasetorecoverfromacrashandreducethetimethatthedatabaseremainsun-availablefornormalprocessing.SuchenhancementsincludeallowingnormalprocessingtocommenceaftercompletionoftheAnalysisphaseandtheuseofmulti-threadingtoincreasetheconcurrencyofrecoveryoperations.Thecon-ceptofmulti-threadedrecoverywasfurtheradaptedtoprovideamechanismofincreasedconcurrencyoftransactionrollbackduringnormalprocessing.Also,itshouldnotbeforgottentomentionthattheproposedalgorithmoersbenetseventocentraliseddatabasesystems.Multi-threadedrecoveryallowsforahighdegreeofparallelism.Also,supportforrecoveryfromisolatedhardwarefailures(e.g.asingle-pagerestoreafteratornwrite)isprovided.Moreover,theproposedalgorithmreadilypermitstoexploitacommonandveryeectiveoptimisation,namelyloggingofdiskwrites.Futureworkidentiedinthisareaincludestheadaptationofthisalgorithmtothemulti-leveltransactionmodelandtheuseofmulti-threadingtofurtherimproveconcurrencyofbothrecoveryandnormalprocessing.References1.Mohan,C.,Haderle,D.J.,Lindsay,B.G.,Pirahesh,H.,Schwarz,P.M.:ARIES:atransactionrecoverymethodsupportingne-granularitylockingandpartialroll-backsusingwrite-aheadlogging.ACMTransactionsonDatabaseSystems(TODS)17(1992)94{1622.Mohan,C.:ARIESfamilyoflockingandrecoveryalgorithms(2004)http://www.almaden.ibm.com/u/mohan/ARIESImpact.html3.Harder,T.,Reuter,A.:Principlesoftransaction-orienteddatabaserecovery.ACMComputingSurveys(CSUR)15(1983)287{3174.Mohan,C.,Narang,I.:Databaserecoveryinshareddisksandclient-serverar-chitectures.In:Proceedingsofthe12thInternationalConferenceonDistributedComputingSystems,IEEEComputerSocietyPress(1992)310{3175.Mohan,C.,Narang,I.:Recoveryandcoherency-controlprotocolsforfastinter-systempagetransferandne-granularitylockinginashareddiskstransactionenvironment.In:Proceedingsofthe17thInternationalConferenceonVeryLargeDataBases,MorganKaufmannPublishersInc.(1991)193{2076.Rahm,E.:Recoveryconceptsfordatasharingsystems.In:Proceedingsof21stInternationalSymposiumonFault-TolerantComputing.(1991)368{3777.Lomet,D.B.:Recoveryforshareddisksystemsusingmultipleredologs.TechnicalReportCLR90/4,DigitalEquipmentCorp.,CambridgeResearchLab,Cambridge,MA(1990)8.Bozas,G.,Kober,S.:Loggingandcrashrecoveryinshared-diskparalleldatabasesystems.TechnicalReportTUM-I9812,SFB-BerichtNr.342/06/98A,DeptofComputerScience,MunichUniversityofTechnology,Germany(1998)9.Mohan,C.:RepeatinghistorybeyondARIES.InAtkinson,M.P.,Orlowska,M.E.,Valduriez,P.,Zdonik,S.B.,Brodie,M.L.,eds.:VLDB'99,Proceedingsof25thInternationalConferenceonVeryLargeDataBases,September7-10,1999,Edin-burgh,Scotland,UK,MorganKaufmann(1999)1{1710.Gray,J.,McJones,P.,Blasgen,M.,Lindsay,B.,Lorie,R.,Price,T.,Putzolu,F.,Traiger,I.:TherecoverymanageroftheSystemRdatabasemanager.ACMComputingSurveys(CSUR)13(1981)223{242