176K - views

DARIES Distributed ersion of the ARIES Reco ery Algorithm Ja yson Sp eer and Markus Kirc erg Information Science Researc Cen tre Departmen of Information Systems Massey Univ ersit Priv ate Bag alme

com MKirchbergmasseyacnz Abstract This pap er presen ts an adaptation of the ARIES reco ery al gorithm that solv es the problem of reco ery in Shared Disk SD database systems whilst preserving all the desirable prop erties of the original al gorithm

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "DARIES Distributed ersion of the ARIES R..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

DARIES Distributed ersion of the ARIES Reco ery Algorithm Ja yson Sp eer and Markus Kirc erg Information Science Researc Cen tre Departmen of Information Systems Massey Univ ersit Priv ate Bag alme






Presentation on theme: "DARIES Distributed ersion of the ARIES Reco ery Algorithm Ja yson Sp eer and Markus Kirc erg Information Science Researc Cen tre Departmen of Information Systems Massey Univ ersit Priv ate Bag alme"— Presentation transcript:

D-ARIES:ADistributedVersionoftheARIESRecoveryAlgorithmJaysonSpeerandMarkusKirchbergInformationScienceResearchCentre,DepartmentofInformationSystems,MasseyUniversity,PrivateBag11222,PalmerstonNorth5301,NewZealandJaysonSpeer@gmail.com,M.Kirchberg@massey.ac.nzAbstract.ThispaperpresentsanadaptationoftheARIESrecoveryal-gorithmthatsolvestheproblemofrecoveryinSharedDisk(SD)databasesystems,whilstpreservingallthedesirablepropertiesoftheoriginalal-gorithm.PrevioussuchadaptationsoftheARIESalgorithmhavefailedtosolvesomeoftheproblemsassociatedwithSDsystems,resultinginanumberofundesirablecompromisesbeingmade.Themostsigni cantproblemishowtoassignlogsequencenumbers(LSNs)insuchawaythatthesystemisrecoverableafteracrash.Existingadaptationshaverequired,amongotherthings,aperfectlysynchronisedglobalclockandacentralmergeoflogdataeitherduringnormalprocessingorcrashrecov-ery,whichclearlyimposesasigni cantoverheadonthedatabasesystem.ThisadaptationofARIESremovesthisrequiremententirely,meaningthatlogmergesandsynchronisedclocksbecomeentirelyunnecessary.FurtherenhancementsthatallowtheRedoandUndophasesofrecoverytobeperformedonapage-by-pagebasishavesigni cantlyreducedtherecoverytime.Additionally,itispossibleforthedatabasetoreturntonormalprocessingattheendoftheAnalysisphase,ratherthanwaitingfortherecoveryprocesstocomplete.1IntroductionIntroducedbyMohanetal.[1],theARIES(AlgorithmforRecoveryandIsolationExploitingSemantics)algorithmhashadasigni cantimpactoncurrentthinkingondatabasetransactionloggingandrecovery.IthasbeenincorporatedintoLotusNotesaswellasanumberofmajordatabasemanagementsystems(DBMSs)[2].ARIES,likemanyotheralgorithms,isbasedontheWAL(WriteAheadLogging)protocolthatensuresrecoverabilityofadatabaseinthepresenceofacrash.However,ARIES'RepeatingHistoryparadigmsetsitapartfromallotherWALbasedprotocols.Therepeatinghistoryparadigmrequires,duringrecovery,thatthedatabasebereturnedtothesamestateasitwasbeforethecrashoccurred.ThisaspectofARIESallowsittosupport negranularitylocking,operationlocking(i.e.incrementanddecrement)andpartialrollbacks.ARIESfurthersupportstheuseofbu ermanagersthatusetheStealandNo-Forceapproachtopagereplacement[3].13 14RecoveryinSharedDiskArchitecturesOftentheperformanceofasingleDBMSisunacceptable,resultingintheuseofmultiplesystemswherethedatabaseissplitacrossmultiplesystems.ThetwomajorarchitecturesareSharedDisk(SD)orSharedNothing(SN)architecture.Log1Log2LognS1S2SnShared DisksFig.1.Simpli edSharedDisk(SD)Architecture.IntheSDarchitecture,thediskscontainingthedatabasearesharedamongstthedi erentsystems.EachsystemhasaninstanceoftheDBMSexecutingonitandmayreadorupdateanyportionofthedatabaseheldontheshareddisks.Inthesimpli edarchitectureshowninFigure1,thediskscontainingthedatabasearesharedamongstthesystemsfS1;:::;Sng,witheachsystemhavinganinstanceoftheDBMS.EachsystemmayperformupdatesonanyportionofthedatabaseandwilllogeachsuchupdateinitslocallogfLog1;:::;Logng.Intheeventofoneormoreofthesystemsfailing(i.e.powerfailure,etc.),thealgorithmmustbeabletoreturnthesystemtoaconsistentstateafterwards.Here,suchnodesarereferredtoascrashnodesandaredistinguishedfromnon-crashnodes,sincetheyplayslightlydi erentrolesintherecoveryprocess.1.2ProblemsTheintroductionoftheSDarchitectureaddsanextradimensionofcomplexitytotheproblemofrecovery,whichcanbealmostentirelyattributedtoassigninglogsequencenumbers(LSNs)andhowtointerpretthemduringcrashrecovery.MohanandNarang([4],p.312)detailedanumberoftheseproblems,whichinclude:1.Duringrecovery,howdowedeterminewhetherornotthereisaneedtoredotheupdatesofalogrecordiftheaddressesoflogrecords(whenusedastheLSNvalues)arenotmonotonicallyincreasingacrossallsystems?2.Ifthelogrecordsrepresentingupdatestoaparticularpagearescatteredacrossthedi erentsystemslocallog les,howdowestarttorecoverthepageandhowdowemergethelogrecordsfromthedi erentlocallog les?ARIEShasbeenadaptedforuseinSDarchitectures[5],withseveraldi er-entschemesbeingproposed.However,eachschememadeanumberofsigni cant 15compromisesinordertoaddresstheaforementionedproblems.Themajoras-sumptionsmadeforalltheseschemesare:1.`Forthepurposesofhandlingdatarecovery,onesystemintheSDcomplexwhichhasconnectivitytoallthelocallogs'disksproducesamergedversionofthoselogs.'([5],p.194)2.`:::theLSNisatimestampandthattheclocksacrosstheSDcomplexareperfectlysynchronised.'([5],p.195)The rstassumptionclearlyintroducesanextralayerofoverheadtotheloggingprocessthatisnotpresentwhenARIESisappliedtoasinglesystemdatabase,duetotheextracommunicationandprocessingthateachsystemmustnowperform.Suchproblemswillbeexacerbatedinthecasewherethesystemperformingthelogmergebecomesabottleneckintheprocess.Thesecondassumptionnotonlyaddsanextralayerofoverheadduetotherequirementthatallclocksinthesystemareperfectlysynchronised,butalsoaddsanextralayerofcomplexitytothehandlingofLSNs.Whilstlogging,eachrecordisassignedauniqueLSNwhenthatrecordisappendedtothelog.Insinglesystemdatabases,LSNsaretypically`thelogicaladdressesofthecorre-spondinglogrecords'([1],p.96).ThisallowsacorrespondencetobeestablishedbetweenLSNsandthephysicallocationofthecorrespondinglogrecord.Un-dertheproposedassumption,thiscorrespondencecannolongerbeestablished.ThismeansextradatastructuresmustbeimplementedtorelateLSNstothecorrespondinglogicaladdress,whichcausesadropindatabaseperformance.Additionally,inthecasewhereownershipofapageistransferredtoanothersystemwithout rstforcingthatpage,theschemeproposedbyMohan&Narang[5]requiresalogmergeinordertoenablerecoveryofthepage.Rahm[6]avoidsthisproblembyprohibitingthechangeofpageownershipaltogether.Lomet[7]o ersthemethod,ofallthoseinvestigated,thatmostresemblestheonepresentedhere,wherestateidenti ersareusedtotrackupdatestodataobjects.Whilstthismethoddoesnotrequiregloballysynchronisedclocks,itdoesrequirethestateidenti erstobemonotonicallyincreasingforanygivendataobject,whichintroducesanundesirablelayerofcomplexityandbreaksthedirectcorrespondencebetweenthelogicalandphysicaladdressesoflogrecords.Otherrestrictiveassumptionsarealsomade,suchasrequiringthatapagebeforcedtodiskbeforetransferringownershiptoanothernode.Bozas&Kober[8]o erasimilarmethodto[7]thatrequiresstateidenti erstobemonotonicallyincreasingforapage,butadditionallydoesnotsupportlogicaloperationloggingorStealbu ermanagementpolicies.1.3SolutionsThealgorithmproposedinthispaperaddressestheproblemofrecoverywithinaSDsystemwhilstaddressingtheaforementionedproblems.Inparticular,theneedtohaveLSNsmonotonicallyincreasingacrossallsystemshasbeenremoved.ThisremovesallrelianceonclocksandallowsthecorrespondencebetweenLSNs 16thephysicallocationofthecorrespondinglogrecordstobepreserved.Thealgorithmalsoremovesanyneedforthemergingoflogdata(duringnormalprocessingorrecovery),butrequiresthatallnodesparticipateincrashrecovery.AdditionaloptimisationsweremadetotheRedoandUndophasesoftherecoveryprocesswherebythesephasesarenowperformedonapage-by-pagebasis.PerformingtheRedoandUndophasesonapage-by-pagebasisallowsamuchhigherdegreeofconcurrencyduringtherecoveryprocess.Thistechniquealsoprovidesthebasisforaproposedmethodtoimproveconcurrencyduringtherollbackoftransactionsduringnormalprocessing.Duringrecovery,thetimethatthedatabaseisquiescedisreducedsincenormalprocessingcancommenceassoonastheAnalysisphaseofrecoveryiscomplete.BeforetheendoftheAnalysisphase,anexclusivelockisacquired(onbehalfoftherecoveryalgorithm)onthosepagesthatmusthavechangesreapplied(Redophase)orremoved(Undophase).Asaresult,allotherpagesremainavailablefornormalprocessing.SincetheRedoandUndophasesareperformedonapage-by-pagebasis,theexclusivelockoneachpageisreleasedassoonaspossible,reducingtheunavailabilityofthedatabasetoaminimum.Itwasalsoimportantnottoimposeanyunnecessaryoverheadonthealgo-rithmintermsofbothcommunicationcostsandlogging.Theextraloggingre-quiredisverysmallwithasingleextra eldbeingaddedtosomelogrecordtypesand eldsremovedfromothers.Communicationistypicallythebottleneckinanydistributedenvironment.Therefore,reducingtheamountofcommunicationrequiredor,moresigni cantly,increasingtheconcurrencyofcommunication,willimprovetheoverallsystemperformance.ConcurrencyofcommunicationduringtheRedoandUndophasesofrecoveryhavebeensigni cantlyincreasedbyperformingboththeRedoandUndophasesofrecoveryindependentlyoneachpage,hencedecreasingtheoveralldelayduetocommunication.WhileaddressingtheproblemsexperiencedinpreviousadaptationsofARIEStotheSDenvironment,itwasimportantnottolosethoseaspectsofARIESthatmakeitbothuniqueandwidelyaccepted.Forthisreason,thealgorithmpreservestherepeatinghistoryparadigm, negranularitylockingandpartialrollbacks.Italsoplacesnorestrictiononthenatureofbu ermanagementused,therebysupportingtheNo-ForceandStealapproachespreviouslysupported.1.4OverviewThispaperisorganisedintothefollowingsections.Section2introducestheoriginalARIESrecoveryalgorithmbrie\ry.Sections3to5presentouradapta-tionoftheARIESalgorithmtargetedatSDdatabasesystems.Section6drawsconclusionsandidenti estheareasinwhichfurtherworkmightbeundertaken.2ARIESThissectionprovidesabriefoverviewoftheoriginalARIESalgorithm.Thereaderisreferredto[1]forthedetailsaboutARIESandto[9]foranintroductionintheARIESfamilyofalgorithms. 17ARIES,likemanyotheralgorithms,isbasedontheWALprotocolthaten-suresrecoverabilityofadatabaseinthepresenceofacrash.Allupdatestoallpagesarelogged(e.g.inlogicalfashion).ARIESusesanLSNstoredoneachpagetocorrelatethestateofthepagewithloggedupdatesofthatpage.Byexamin-ingtheLSNofapage(calledthePageLSN)itcanbeeasilydeterminedwhichloggedupdatesarere\rectedinthepage.Beingabletodeterminethestateofapagew.r.t.loggedupdatesiscriticalwhilstrepeatinghistory,sinceitisessentialthatanyupdatebeappliedtoapageonceandonlyonce.Failuretorespectthisrequirementwillinmostcasesresultinaviolationofdataconsistency.UpdatesperformedduringforwardprocessingoftransactionsaredescribedbyUpdateLogRecords(ULRs).However,loggingisnotrestrictedtoforwardprocessing.ARIESalsologs,usingCompensationLogRecords(CLRs),updates(i.e.compensationsofupdatesofaborted/incompletetransactions)performedduringpartialortotalrollbacksoftransactions.ByappropriatechainingofCLRrecordstologrecordswrittenduringforwardprocessing,aboundedamountofloggingisensuredduringrollbacks,eveninthefaceofrepeatedfailuresduringcrashrecovery.Thischainingisachievedby1)assigningLSNsinascendingsequence;and2)addingapointer(calledthePrevLSN)tothemostrecentprecedinglogrecordwrittenbythesametransactiontoeachlogrecord.WhentheundoofalogrecordcausesaCLRrecordtobewritten,apointer(calledtheUndoNextLSN)tothepredecessorofthelogrecordbeingundoneisaddedtotheCLRrecord.TheUndoNextLSNkeepstrackoftheprogressofarollback.Ittellsthesystemfromwheretocontinuetherollbackofthetransaction,ifasystemfailureweretointerruptthecompletionoftherollback.Periodicallyduringnormalprocessing,ARIEStakesfuzzycheckpointsinordertoavoidquiescingthedatabasewhilecheckpointdataiswrittentodisk.Checkpointsaretakentomakecrashrecoverymoreecient.Whenperformingcrashrecovery,ARIESmakesthreepasses(i.e.Analysis,RedoandUndo)overthelog.DuringAnalysis,ARIESscansthelogfromthemostrecentcheckpointtotheendofthelog.Itdetermines1)thestartingpointoftheRedophasebykeepingtrackofdirtypages;and2)thelistoftransactionstoberolledbackintheUndophasebymonitoringthestateoftransactions.DuringRedo,ARIESrepeatshistory.Itisensuredthatupdatesofalltransactionshavebeenexecutedonceandonlyonce.Thus,thedatabaseisreturnedtothestateitwasinimmediatelybeforethecrash.Finally,Undorollsbackallupdatesoftransactionsthathavebeenidenti edasactiveatthetimethecrashoccurred.3D-ARIES{PreliminariesInthenextthreesections,weintroduceadistributedversionoftheARIESrecoveryalgorithm(referredtoasD-ARIES).3.1OverviewD-ARIESpreservesthedesirablepropertiesoftheoriginalARIESalgorithm.Crashrecoverystillmakesthreepassesoverthelog.However,supportingown- 18changesofpages(withouttheneedto\rushrespectivepages rstormergelogs)requiresallnodestobeinvolvedincrashrecovery.Thus,itbecomesevenmoreimportantthatthetimethedatabaseisquiescedisreduced.D-ARIESachievesthissincenormalprocessingcanresumeassoonastheAnalysisphaseofrecoveryiscomplete.Furthermore,sincetheRedoandUndophasesareper-formedonapage-by-pagebasis,pagesrequiredduringtheRedoandUndophasesaremadeavailabletonormaltransactionprocessingassoonaspossible,reducingtheunavailabilityofthedatabasetoaminimum.Section4discussescrashrecoveryprocessinginmoredetail.Intheeventofoneormoreofthesystemsfailing,thecrashrecoveryalgorithmmustbeabletoreturnthesystem(s)tothemostrecentconsistentstate.Besidescrashrecovery,itisalsonecessarytodiscusstransactionrollbackduringnormalprocessing.Section5considerstwomainclassesofschedulesthatmustbeaddressedwhende ningarollbackalgorithm;thesearescheduleswithcascadingabortsandscheduleswithout.D-ARIEStakesadvantageofmulti-threadingandrollsbacktransactionsonapage-by-pagebasis.3.2LoggingPreviousadaptationsoftheARIESalgorithmrequirethattheLSNsbegloballymonotonicallyincreasingacrossallnodesinthesystem.Thisrequirementintro-ducestheproblemofhowtoensurethattheLSNsaremonotonicallyincreasingonaglobalbasisandmeansthatthephysicaladdressofalogrecordcannolongerbeinferredfromitsLSN.TheadaptationoftheARIESalgorithmpresentedherenolongerrequiresLSNstobegloballymonotonicallyincreasing,howeveritisrequiredthatLSNsbemonotonicallyincreasingwithineachnode.ThisrequirementformonotonicallyincreasingLSNswithinanodeisnotaburden,butratherabene t,sinceitallowsadirectcorrespondencebetweenalogrecord'sphysicaladdressandlogicaladdresstobemaintained.Inordertoremovetheneedforgloballymonotonicallyincreasinglognum-bers,anumberofmodi cationsmustbemadetothewaytheARIESalgorithmperformslogging,theseare:1.De nitionofa`DistributedLSN'.2.Modi cationoftheCLRrecord.3.De nitionofa`SpecialCompensationLogRecord(SCR)'.4.AdditionofPageLastLSNpointers(tospeci ctypesoflogrecords).DistributedLSN.Inordertoidentifywhichnodealogrecordbelongsto,anodeidenti ermustbeincludedineachLSN.Althoughtheexactmethodfordoingthisisleftopen,forthepurposesofthispaper,theformatofdistributedLSNsisasfollows:RecNum.NodeId 19WhereNodeIdisagloballyuniqueidenti erforthenodethatwrotethelogrecordandtheRecNumismonotonicallyincreasinganduniquewithinthatnode.Forthepurposesofthispaper,thenodeidenti erandrecordnumbersforeachLSNcanbereferredtoindividuallyasLSN[NodeId]andLSN[RecNum].Modi cationoftheCLRRecord.InthisdistributedincarnationoftheARIESalgorithm,extensivechangesaremadetotheCLRrecord,bothintermsoftheinformationitcontainsandthewayinwhichitisused.Changesare:1.TheUndoneLSN eldreplacestheUndoNextLSN eld.WhereastheUn-doNextLSNrecordstheLSNofthenextoperationtobeundone,theUn-doneLSNrecordstheLSNoftheoperationthatwasundone.2.ThePrevLSN eldisnolongerrequiredfortheCLRrecord.3.CLRrecordsarenowusedtorecordundooperationsduringnormalprocess-ingonly,thenewlyde nedSCRrecords(referbelow)isusedtorecordundooperationsduringcrashrecovery.Therationalebehindthesemodi cationscanbeunderstoodbyobservingthedi erencesinhowD-ARIESandtheoriginalARIESalgorithmperformundooperationsandhowthisa ectstheinformationrequiredbyD-ARIES.IntheoriginalARIESalgorithm,operationsareundoneoneatatimeinthereverseordertowhichtheywereperformedbytransactions.However,inordertoincreaseconcurrency,thisalgorithmcanperformmultipleundooperationsconcurrently,whereupdatestoindividualpagesareundoneindependentlyofeachother.Theresultofthisisthatalthoughoperationsareundoneinreverseorderonaperpagebasis,itisnowpossibleforoperationsbelongingtoasingletransactiontobeundoneinanorderthatdoescorrespondtotheorderinwhichtheyweredone.De nitionoftheSCRRecord.TheSpecialCompensationLogRecord(SCR)isidenticalinalmosteveryrespecttothemodi edversionoftheCLRrecord,theonlydi erencesbeing:{Therecordtype eld(SCRratherthanCLR),{SCRsarewrittenduringcrashrecoveryratherthannormalprocessing.Duringnormalrollbackprocessing,operationsareundoneinthereverseor-dertowhichtheywereperformedbyindividualtransactions.However,duringcrashrecoveryrollback,operationsareundoneinreverseorderthattheywereperformedonindividualpages.Havingseparatelogrecordsforcompensationduringrecoveryandnormalrollbackallowsustoexploitthisfact.PageLastLSNPointers.ThePageLastLSNpointer1isaddedtoallULR,SCRandCLRrecords.ItrecordstheLSNoftherecordthatlastmodi edanobject1ThePageLastLSNpointerdoesnotonlysupportpage-by-pageprocessing.ItisalsothecrucialdatastructurethathelpstoavoidgloballymonotonicallyincreasingLSNs. 20thispage.RecordingthesePageLastLSNpointersprovidesaneasymethodoftracingallmodi cationsmadetoaparticularsetofobjects(storedonasinglepage).ThePageLastLSNforeachrecordmustbeadistributedLSNandincludeboththeLSNandthenodeonwhichtherecordresides.Thisisduetothefactthatthelastmodi cationtothepagemayhaveoccurredonadi erentnode.3.3FuzzyCheckpointsAswiththeoriginalARIESalgorithm,afuzzycheckpointisperformedinordertoavoidquiescingthedatabasewhilecheckpointdataiswrittentodisk.Thefollowinginformationisstoredduringthecheckpoint:Activetransactiontable;andDirtyLSNvalue.Foreachactivetransaction,thetransactiontablestoresthefollowingdata:TransIdIdenti eroftheactivetransaction.FirstLSNLSNofthe rstlogrecordwrittenonthisnodeforthetransaction.StatusEitherActiveorCommit.Thus,eachnodemustmaintaintheFirstLSNofeachactivetransaction.ThevalueofFirstLSNcanbeeasilysetwhenthecorrespondingtransactiontableentryiscreated.Giventhesetofpagesthatweredirtyatthetimeofthecheckpoint,theDirtyLSNvaluepointstotherecordthatrepresentstheoldestupdatetoanysuchpagethathasnotyetbeenforcedtodisk.4D-ARIES{CrashRecoveryAswiththeoriginalARIESalgorithm,recoveryremainssplitintothreephases,whichareAnalysis,RedoandUndo.However,unliketheoriginalARIESalgo-rithm,recoverytakesplaceonapage-by-pagebasis,whereupdatesarereapplied(Redophase)andremovedfrom(Undophase)pagesindependentlyfromonean-other.TheRedophasereapplieschangestoeachpageintheexactorderthattheywereloggedandtheUndophaseundoeschangestoeachpageintheexactreverseorderthattheywereperformed.Sincethestateofeachpageisaccu-ratelyrecorded(byuseofthePageLSN),theconsistencyofthedatabasewillbemaintainedduringsuchaprocess.4.1DataStructures.ThedatacollectedbythealgorithmduringtheAnalysisphaseisstoredinthefollowingdatastructures:TransactionStatusTable.Thepurposeofthetransactionstatus(TransSta-tus)tableistodeterminethe nalstatusofalltransactionsthatwereac-tiveatsometimeafterthelastcheckpoint.Thisinformationisusedtode-terminewhetherchangesmadetothedatabaseshouldbekeptordiscarded. 21TheTransStatustablehasthefollowing elds:TransIdIdenti erofthetransaction.StatusStatusofthetransaction,whichdetermineswhetherornotitmustberolledback.PossiblestatesareActive,EndandCommit.OncetheAnalysisalgorithmhascompletedscanningallrequiredrecords,itwillsendtheTransStatustabletoall`crashnodes'inthesystem.UponreceivingaTransStatustablefromanothernode(Ni),thecrashnodewillupdateitsTransStatustableusingthefollowingrule:ThecrashnodewilldeleteanentryfortransactionTjfromtheTransStatustableif:1)TheTransStatustablereceivedfromnodeNicontainsanentryforTj;and2)ThestatusofthatentryiseitherCommitorEnd.AfterhavingreceivedtheTransStatustablefromallothernodesandupdat-ingthelocaltable,anytransactionwiththestatus`Active'isdeclareda`losertransaction',whilstallothertransactionsaredeclared`winnertransactions'.LocalPageLinkList.ThepurposeoftheLocalPageLink(LLink)lististoprovidealinkedlistofrecordsforeachpagemodi edbyanode.ThislinkedlistisusedintheRedophasetonavigateforwardsthroughthelogreapplyingupdatesmadetodataobjectsonthatpage.ForeachpagethathasCLR,SCRorULRrecords,eachnodewillcreateanLLinklist,whichisanorderedlistofallLSNsthatrecordchangestothatpage.ThedatacontainedwithintheLLinklistissucienttoallowforwardnav-igationthroughasetofrecordsprovidedallrecordsforthepagearestoredonthesamenode.However,inordertoallowforwardnavigationwhenchangestoapageareloggedonmultiplenodes,furtherdatamustbecaptured.ThisdataisstoredintheDistributedPageLinkList(seebelow).DistributedPageLinkList.ThepurposeoftheDistributedPageLink(DLink)lististoaugmenttheLLinklistinsuchawaythatalinkedlistofrecordsforeachpageispossibleevenwhenapagehaslogrecordsstoredonmultiplenodes.ForeachCLR,SCRorULRrecordencounteredthathasaPageLastLSNpointerthatpointstoanothernode,aDLinklistentrywillbecreated.EachDLinklistentryhasthefollowing elds:PageIdIdenti erofthepage.LSNLSNoftherecord.PageLastLSNPageLastLSNpointeroftherecord.NodeIdIdenti erofthenodethatthePageLastLSNpointerpointsto.AttheendoftheAnalysisphase,eachnodewillsendtherelevantportion2oftheDLinklisttotheothernodes.ThedatacontainedintheDLinklistwill2Eachnodewillbesentallentrieswheretheirnodeidenti erisequaltoNodeId. 22eusedtoaugmenttheLLinklistforallpages.TherulesforinsertingdatafromaDLinklistintoanLLinklistare:1.LocatethepointintheLLinklistwhereLSN=DLink.PageLastLSN.2.InsertthecorrespondingLSNfromtheDLinklistdirectlyafterthispoint.Example1.DLinkList(FromNode2)LSNPageLastLSNNodeId10.212.11LLinkList(Node1)571216First,locatetheentrywhereLSN=12.1(PageLastLSNfromDLinklist),theninsert10.2(LSNfromDLinklist)directlyafterthispoint.AugmentedLLinkList(Node1):571210.216PageStartList.ThepurposeofthePageStartLististodetermine,foreachpage,fromwhichnodetoinitiatetheRedophaseoftherecoveryprocess.ThePageStartListhasthefollowing elds:PageIdIdenti erofthepage.Duringtheforwardscanofthelog,the rsttimethealgorithmencountersalogrecordforapagePj,ittakesthefollowingactions:1.CreateaPageStartListentryforPj.AttheendoftheAnalysisphase,afterhavingaugmentedtheLLinklist,revisiteachentryofthelocalPageStartListandapplythefollowingrule:2.Removetheentryifthe rstentryforthecorrespondingpageintheaug-mentedLLinklistdoesnotrefertothelocalnode.Afterstep1.hasbeencompletedonallnodes,weareguaranteedtohavecapturedallpagesthathavetobevisitedduringtheRedoandUndophases.Thus,weareabletolockthosepagesexclusively.Allotherpagescanthenbemadeavailablefornormalprocessing.3Multiplelockrequests(allonbehalfoftherecoveryalgorithm)forthesamepagearepossible,butdonotcauseanyproblems.Second,third,:::requestscansimplybeignored.Step2.thenremovesallentriesexceptthosethatrefertopagesthathavebeenupdated rstbythelocalnode.ThisensuresthatthereisexactlyonePageStartListentryperpage(thatisofinteresttothecrashrecoveryalgorithm)acrossallnodes.ThisentryreferstothenodewhichwillbeinchargeofinitiatingtheRedopassforthatparticularpage.3Note:bythistime,nocommunicationbetweennodesnoraccesstopersistentdataotherthenthelocallogwererequired. 23PageEndList.ThepurposeofthePageEndLististodetermine,foreachpage,wheretheUndophaseoftherecoveryalgorithmshouldstopprocessing.However,thisisonlyanoptionalfeature,whichwillbeomittedduetospaceconstraints.IntheabsenceofthePageEndList,theUndophaseterminatesassoonastheScanLSNrecord(seebelow)hasbeenreachedonanynode.UndoneList.ThepurposeoftheUndoneLististostorealistofallopera-tionsthathavebeenpreviouslyundone.TheUndoneListhasthefollowing elds:PageIdIdenti erofthepage.UndoneLSNLSNoftherecordthathasbeenundone.Duringthescanofthelog,wheneverthealgorithmencountersaCLRrecord,itaddsanentrytotheUndoneList.4.2AnalysisPhase.DuringtheAnalysisphase,thealgorithmcollectsalldatathatisrequiredtorestorethedatabasetothemostrecentconsistentstate.Thisinvolvesperformingaforwardscanthroughthelog,collectingthedatarequiredfortheRedoandUndophasesofrecovery.TheAnalysisphaseoftherecoveryprocessiscomprisedofthreesteps,being:Initialisation,DataCollectionandCompletion.Step1:Initialisation.TheinitialisationoftheAnalysisphaseinvolvesreadingthemostrecentcheckpointinordertoconstructaninitialTransStatustableanddeterminethestartpoint(ScanLSN)fortheforwardscanofthelog.TransStatusTable.ForeachtransactionstoredintheActiveTransactiontable,acorrespondingentryiscreatedintheTransStatustable.StartPoint.Thestartpointofthescan(ScanLSN)isthepointinthelogwherethenodewillstartscanningandiscomputedasfollows:ThelowestLSNofeither:1)DirtyLSN;or2)ThelowestFirstLSNofanytransactionintheTransStatustablewhosestatusis`Active'.Step2:DataCollection.Duringtheforwardscanofitslog,eachnodewillcollectdatatobestoredinthedatastructuresdiscussedinSection4.1.Thetypeofrecordencounteredduringthelogscandeterminesthedatathatiscollectedandintowhichdatastructureitisstored.TherecordsfromwhichtheAnalysisphasecollectsdataareCommitLogRecord,EndLogRecord,ULR,CLR,andSCR.CommitLogRecord.EachtimeaCommitLogRecordisencountered,therecov-erymanagerinsertsanentryintotheTransStatustableforthetransactionwithstatussettoCommit.Anyexistingentriesforthistransactionarereplaced. 24LogRecord.EachtimeanEndLogRecordisencountered,therecoverymanagerinsertsanentryintotheTransStatustableforthetransactionwithstatussettoEnd.Anyexistingentriesforthistransactionarereplaced.UpdateLogRecord(ULR).EachtimeanULRrecordisencountered,thefol-lowingdataiscaptured:{IfnoentryexistsintheTransStatustableforthistransaction,thenanentryiscreatedforthistransactionwithstatussettoActive.{AddanentrytotheLLinklist.{IfthePageLastLSNpointerpointstoadi erentnode,thenaddanentrytotheDLinklist.{CreateaPageStartListentryasrequired.CompensationLogRecord(CLR).EachtimeaCLRrecordisencountered,inadditiontocapturingalldatadescribedforanULRrecord,anentrywillbeaddedtotheUndoneList.SpecialCompensationLogRecord(SCR).ThesamedataiscapturedforanSCRrecordasthatcapturedforanULRrecord.Step3:Completion.Oncethenodehascompletedtheforwardscanofthelog,itperformsthefollowingactions:1.Acquireanexclusivelock,onbehalfoftherecoveryalgorithm,onallpagesidenti edinthePageStartList.2.Sendthefollowingdata:TransStatustable(toallcrashnodes)andDLinklist(toallothernodes).3.Uponreceivingthisdatafromothernodes,thealgorithmperformsthefol-lowingtasks:1)UpdatetheTransStatustable;2)AugmenttheLLinklists.4.OnceallLLinklistshavethenaugmented,updatethePageStartList.5.OnceacrashnodehasreceivedtheTransStatustablefromallothernodes,itcansendalistof`losertransactions'toallothernodes.Notes:{Onceallnodesinthesystemhaveacquiredtherequiredlocks(referStep1),thedatabasecancommencenormalprocessing.Onlythosepagesthatarelockedforrecoverywillremainunavailable.{OnceanodehasreceivedtheDLinklistsfromallothernodes(andaug-menteditsLLinklists),itcanentertheRedophase.{Onceanodehasreceivedalistoflosertransactionsfromallcrashnodes,itcanpotentiallyentertheUndophase. 254.3RedoPhase.TheRedophaseisresponsibleforreturningeachpageinthedatabasetothestateitwasinimmediatelybeforethecrash.InorderforanodetoentertheRedophase,itmusthaveaugmentitsLLinklistsandthePageStartList.ForeachpagethatthenodehasinitsPageStartList,theRedoalgorithmwillspawnathreadthat`repeatshistory'forthatpage.GivenapagePj,historyisrepeatedbyperformingthefollowingtasks:1.StartbyconsideringtheoldestlogrecordforpagePjthatwaswrittenafterPageLSN.Thisrequiresreadingthepageintomainmemory.2.UsingtheaugmentedLLinklistsforpagePj,moveforwardthroughtheloguntilnomorerecordsforthispageexist.3.Eachtimeare-doablerecordisencountered,reapplythedescribedchanges.Re-doablerecordsare:SCR,CLRandULRrecords.Oncethethreadhasprocessedthelastrecordforthispage,therecoveryalgorithmmayentertheUndophaseforthispage.TherecoveryalgorithmmayentertheUndophasefordi erentpagesatdi erenttimes,forexamplepageP1mightentertheUndophasewhilepageP2isstillintheRedophase.OncetherecoveryalgorithmhascompletedtheRedophaseforallpages,anEndLogRecordcanbewrittenforalltransactionswhosestatusisCommitintheTransStatustable.SinceatransactioncanhaveaCommitentryononlyasinglenode,itisguaranteedthatexactlyoneEndLogRecordwillbewrittenforsuchtransactions.Forexpediency,thiscanbedeferreduntilaftertheUndophaseiscompleteifsodesired.4.4UndoPhase.TheUndophaseisresponsibleforundoingthee ectsofallupdatesthatwereperformedbyso-called`losertransactions'.InorderforanodetoentertheUndophase,itmusthavereceivedalistoflosertransactionsfromallcrashnodes.ThethreadthatwasspawnedfortheRedophaseforpagePjwillnowbeginworkingbackwardsthroughthelogundoingallupdatestothepagethatweremadebylosertransactionsbyperformingthefollowingtasks:1.WorkbackwardsthroughthelogusingthePageLastLSNpointersprocessingeachlogrecorduntilallupdatesbylosertransactionshavebeenundone.2.EachtimeanSCRorULRrecordisencountered,takethefollowingactions:{SpecialCompensationLogRecord(SCR).(a)JumptotherecordimmediatelyprecedingtherecordpointedtobytheUndoneLSN eld.TheUndoneLSN eldindicatesthatduringapreviousinvocationoftherecoveryalgorithm,theupdatesrecordedbytherecordatUndoneLSNhavealreadybeenundone. 26{UpdateLogRecord(ULR).Iftheupdatewasnotwrittenbyalosertransactionorhaspreviouslybeenundone(theUndoneListisusedtodeterminethis),thennoactionistaken.Otherwise,thefollowingactionsaretakentoundotheupdate:(a)WriteanSCRrecordthatdescribestheundoactiontobeperformedwiththeUndoneLSN eldsetequaltotheLSNoftheULRrecordwhoseupdateshavebeenundone.(b)ExecutetheundoactiondescribedintheSCRrecordwritten.OncethethreadhascompletedprocessingallrecordsbacktoScanLSN,thepagecanbeunlocked,onbehalfoftherecoveryalgorithm,andmadeavailablefornormalprocessingagain.Theadvantageofallowingeachpagetobeunlockedindividuallyisthatthedatabasecanreturntonormalprocessingasquicklyaspossible.OncetherecoveryalgorithmhascompletedtheUndophaseforallpages,anEndLogRecordcanbewrittenforalltransactionswhosestatusisActiveintheTransStatustable(thesearethelosertransactions).Thenodethatisresponsibleforthetransactionwillwritethisrecord.4.5CrashesDuringCrashRecovery.BypreservingARIES'paradigmofrepeatinghistory,itcanbeguaranteedthatmultiplecrashesduringcrashrecoverywillnota ecttheoutcomeoftherecoveryprocess.TheRedophaseensuresthateachupdatelostduringthecrashisappliedexactlyoncebyusingthePageLSNvaluetodeterminewhichloggedupdateshavealreadybeenappliedtothepage.SinceallcompensationoperationsareloggedduringtheUndophase,theRedophaseandthenatureoftheUndophaseensurethatcompensationoperationsarealsoperformedexactlyonce.TheUndoneLSNplaysasimilarroleinD-ARIESastheUndoNextLSNdoesinARIES.4.6EnhancementsItispossibletofurtherenhancethefaulttoleranceandspeedofrecoveryofthisalgorithminthefaceofanodethathascrashedandisslowtobecomeavailable.Thisisachievedbymakingboththedatabasepartitionandrecoverylogofeachnodeinthesystemaccessibletoeachothernodeinthesystem.Duringrecovery,itisthenpossibleforanon-crashnodetoactasa`proxy'forthecrashnode.Thatis,thenon-crashnodecanperformrecoveryonbehalfofthecrashednode.Whilstactingasaproxy,anon-crashnodewouldreadthelogrecordofthecrashnodeandperformupdatesonthecrashnode'sdatabasepartition,justasthecrashnodewoulddoifitwereavailable.Clearlysuchanenhancemento ersrealbene tsintermsofrecoveryspeed,particularlywhenacrashnodebecomesunavailableforanextendedperiodoftime.Bythetimethecrashnodebecomesavailable,itsdatabasepartitionandlogwillbeinaconsistentstateandthenodewillbeabletocommenceprocessingimmediately. 275D-ARIES{RollbackDuringNormalProcessingHavingde nedthealgorithmforrollbackoftransactionsduringcrashrecovery,itisnownecessarytodothesamefornormalprocessing.Therearetwomainclassesofschedulesthatmustbeconsideredwhende ningarollbackalgorithm;thesearescheduleswithcascadingabortsandscheduleswithout.Thecasewherecascadingabortsdonotisexististrivial,whererollingbackatransactionsimplyinvolvesfollowingthePrevLSNpointersforthetransactionbackwardsundoingeachoperationasitisencountered.Sincecascadingabortsdonotexistintheseschedules,noconsiderationneedbegiventocon\rictsbetweentheabortingtransactionandanyothertransactions.Thecasewherecascadingabortsdoexistisagreatdealmorecomplex,sincerollingbackatransactionmaynecessitatetherollbackofoneormoreothertransactions.Eachtimeanoperationisundone,itisnecessarytocon-siderwhichtransactions,ifany,mustberolledbackinordertoavoiddatabaseinconsistencies.IntheoriginalARIESalgorithm,rollbackoftransactionTiinvolvesundoingeachoperationinreverseorderbyfollowingthePrevLSNpointersfromoneULRrecordtothenext.WheneveranundooperationfortransactionTicon\rictswithanoperationinsomeothertransactionTj,acascadingabortoftransactionTjmustbeinitiated.TransactionTimustthensuspendrollbackandwaitfortrans-actionTjtorollbackbeyondthecon\rictingoperationbeforeitcanrecommencerollback.Clearlythisisnotthemostecientmethod,sincetherollbackoftheentiretransactionissuspendedduetoasingleoperationbeingincon\rict.Amoredesirablemethodistosuspendrollbackonlyofthoseoperationsthatareincon\rictandtocontinuerollbackofallotheroperations.Itisalsodesirabletotriggerthecascadingabortofalltransactionsincon\rictasearlyaspossible.Bytakingadvantageofmulti-threading,itispossibletorollbackatransactiononapage-by-pagebasis.Thisallowsatransactioninrollbacktosimultaneously:{Triggermultiplecascadingaborts,{Suspendrollbackofupdatestopageswhilstwaitingforothertransactionstorollback,and{Continuerollingbackupdatesthatdonothaveanycon\ricts.Partialrollbackoftransactionsisachievedbyestablishingsavepoints[10]duringprocessing,thenatsomelaterpointrequestingtherollbackofthetrans-actiontothemostrecentsavepoint.Thiscanbecontrastedwithtotalrollbackthatremovesallupdatesperformedbythetransaction.5.1Algorithms.RollbackofatransactionTiisachievedbytheuseofasingle`MasterThread'thatisresponsibleforcoordinatingtherollbackprocessandmultiple`SlaveThreads'thatareresponsiblefortherollbackofupdatesmadetoindividualpages. 28Thread.Themasterthreadisresponsibleforcoordinatingtherollbackofatransactionbyperformingthefollowingactions:{Triggeringthecascadingabortoftransactionsasrequired.{Undoingallupdateoperationsthatarenotincon\rictwithupdateoperationsfromothertransactions.{Spawninganewslavethreadwheneveracon\rictdetectedrequirestheundoofupdatestoapagebedelayedwhileothertransaction(s)rollback.Algorithm1:Parameters.TiIdentifieroftransactiontorollback.SaveLSNThesavepointforpartialrollbackofthetransaction.IfSaveLSNisNull,thentotalrollbackwilloccur.Variables.CurrLSNTheLSNoftherecordcurrentlybeingprocessed.PSSetofpageidentifiersforwhichslavethreadsarespawned.Procedure.1.CurrLSN=LSNoflastupdateperformedbyTi.2.WHILE(Record.LSN6=SaveLSN)DO3.MovetoRecordatCurrLSN4.IF(Currentoperationisnotinconflictwithanyother)THEN5.IF(Record.PageId=2PS)THEN6.Undocurrentoperation7.ENDIF8.ELSE9.IF(Record.PageId=2PS)THEN10.AddRecord.PageId11.SpawnanewslavethreadforpageRecord.PageIdParameters:(Record.PageId,SaveLSN,NodeId4)12.ENDIF13.Triggerthecascadingabortofanytransaction(s)thathaveconflictingoperations(andarenotalreadyaborting).14.ENDIF15.CurrLSN=Record.PrevLSN16.ENDDOThealgorithmterminatesoncethemasterthreadhasreachedthesavepointandhasreceivedaDone.messagefromallslavethreadsspawned.SlaveThread.Theslavethreadisresponsibleforundoingallupdatesmadebythetransactiontoasinglepage.Theslavethreadmustnotundoanyoperationuntilanycon\rictingoperation(s)havebeenrolledback.Oncetheslavethreadhascompletedrollingbackallchangestothepage,itsendsaDone.messagebacktothemasterthread.4IftheSaveLSN=Null,thenNodeIdisthenodeonwhichthemasterthreadwasspawned,otherwiseNodeId=SaveLSN[NodeId].ThisistoensurethatallslavethreadssendtheirDone.messagestothesamenode. 29Algorithm2:Parameters.PiIdentifierofpagetorollback.SaveLSNThesavepointforpartialrollbackofthetransaction.IfSaveLSNisNull,thentotalrollbackwilloccur.NodeIdIdentifierofthenodethatshouldreceivetheDone.message.Variables.CurrLSNTheLSNoftherecordcurrentlybeingprocessed.Procedure.1.WHILE(Record.LSN6=SaveLSN)DO2.MovetorecordatCurrLSN3.IF(Currentoperationisinconflictwithanyother)THEN4.Waituntilconflictingoperations(s)havebeenrolledback.5.ENDIF6.Undocurrentoperation7.CurrLSN=Record.PrevLSN8.ENDDO9.SendDone.Messagetothemasterthread.Optimisation.Inrollback,itispossibleforboththemasterthreadandtheslavethreadstoreducethefrequencywithwhichtheycheckforcon\rictsbetweenthecurrentoperationandoperationsbelongingtoothertransactions.GivenaULRrecordwrittenforpagePjbytransactionTi,itisonlynecessarytocheckforcon\rictsifthelastULRrecordwrittenforpagePjwasnotwrittenbytransactionTi.ThiscanbeachievedbycheckingifthePageLastLSNofthepreviouslogrecordwrittenbytransactionTipointstothecurrentlogrecordwrittenbytransactionTi.Ifthisisthecase,nocon\rictcheckingneedbeperformed.6ConclusionsThealgorithmpresentedinthispapero ersanadaptationofARIESthatsolvestheproblemofrecoveryinSDdatabasesystems.ThedesirablepropertiesofARIES,suchas negranularitylocking5,operationlockingandpartialrollbacks,havebeenpreservedwithoutimposingsigni cantoverheadonthesystem.TheoriginalARIESalgorithmreliesontheLSNsbeinggloballymonotoni-callyincreasinginordertoensurerecoverabilityofthedatabaseafteracrash.Assuch,previousadaptationshavesoughttoensurethatLSNsaremonotonicallyincreasingthroughouttheSDsystemratherthanresolvethisreliance.Thishasresultedinundesirablecompromisesbeingmadeincludingtherequirementforaperfectlysynchronisedglobalclockandtheglobalmergeofthelogs.Thisalgo-rithmremovesARIES'dependencyongloballymonotonicallyincreasingLSNs,whichinturnrendersagloballogmergeorclockofanykindredundant.5SameaswithARIES,D-ARIESsupports negranularitylockingassumingthatrecoverabilityisensuredbythetransactionmanager. 30Furtherenhancementsweremadetodecreasethetimetakenforthedatabasetorecoverfromacrashandreducethetimethatthedatabaseremainsun-availablefornormalprocessing.SuchenhancementsincludeallowingnormalprocessingtocommenceaftercompletionoftheAnalysisphaseandtheuseofmulti-threadingtoincreasetheconcurrencyofrecoveryoperations.Thecon-ceptofmulti-threadedrecoverywasfurtheradaptedtoprovideamechanismofincreasedconcurrencyoftransactionrollbackduringnormalprocessing.Also,itshouldnotbeforgottentomentionthattheproposedalgorithmo ersbene tseventocentraliseddatabasesystems.Multi-threadedrecoveryallowsforahighdegreeofparallelism.Also,supportforrecoveryfromisolatedhardwarefailures(e.g.asingle-pagerestoreafteratornwrite)isprovided.Moreover,theproposedalgorithmreadilypermitstoexploitacommonandverye ectiveoptimisation,namelyloggingofdiskwrites.Futureworkidenti edinthisareaincludestheadaptationofthisalgorithmtothemulti-leveltransactionmodelandtheuseofmulti-threadingtofurtherimproveconcurrencyofbothrecoveryandnormalprocessing.References1.Mohan,C.,Haderle,D.J.,Lindsay,B.G.,Pirahesh,H.,Schwarz,P.M.:ARIES:atransactionrecoverymethodsupporting ne-granularitylockingandpartialroll-backsusingwrite-aheadlogging.ACMTransactionsonDatabaseSystems(TODS)17(1992)94{1622.Mohan,C.:ARIESfamilyoflockingandrecoveryalgorithms(2004)http://www.almaden.ibm.com/u/mohan/ARIESImpact.html3.Harder,T.,Reuter,A.:Principlesoftransaction-orienteddatabaserecovery.ACMComputingSurveys(CSUR)15(1983)287{3174.Mohan,C.,Narang,I.:Databaserecoveryinshareddisksandclient-serverar-chitectures.In:Proceedingsofthe12thInternationalConferenceonDistributedComputingSystems,IEEEComputerSocietyPress(1992)310{3175.Mohan,C.,Narang,I.:Recoveryandcoherency-controlprotocolsforfastinter-systempagetransferand ne-granularitylockinginashareddiskstransactionenvironment.In:Proceedingsofthe17thInternationalConferenceonVeryLargeDataBases,MorganKaufmannPublishersInc.(1991)193{2076.Rahm,E.:Recoveryconceptsfordatasharingsystems.In:Proceedingsof21stInternationalSymposiumonFault-TolerantComputing.(1991)368{3777.Lomet,D.B.:Recoveryforshareddisksystemsusingmultipleredologs.TechnicalReportCLR90/4,DigitalEquipmentCorp.,CambridgeResearchLab,Cambridge,MA(1990)8.Bozas,G.,Kober,S.:Loggingandcrashrecoveryinshared-diskparalleldatabasesystems.TechnicalReportTUM-I9812,SFB-BerichtNr.342/06/98A,DeptofComputerScience,MunichUniversityofTechnology,Germany(1998)9.Mohan,C.:RepeatinghistorybeyondARIES.InAtkinson,M.P.,Orlowska,M.E.,Valduriez,P.,Zdonik,S.B.,Brodie,M.L.,eds.:VLDB'99,Proceedingsof25thInternationalConferenceonVeryLargeDataBases,September7-10,1999,Edin-burgh,Scotland,UK,MorganKaufmann(1999)1{1710.Gray,J.,McJones,P.,Blasgen,M.,Lindsay,B.,Lorie,R.,Price,T.,Putzolu,F.,Traiger,I.:TherecoverymanageroftheSystemRdatabasemanager.ACMComputingSurveys(CSUR)13(1981)223{242