498 504 499 500 503 502 508 507 497 501 505 506 REWINDRe coveryW n ID: 417613
Download Pdf The PPT/PDF document "rite" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
498 504 499 500 503 502 508 507 497 501 505 506 REWIND:Re coveryW riteAheadSystemforI nMemoryN onVolatileD ataStructuresAndreasChatzistergiouUniversityofEdinburgh,UKa.chatzistergiou@sms.ed.ac.ukMarceloCintraIntel,Germanymarcelo.cintra@intel.comStratisD.ViglasUniversityofEdinburgh,UKsviglas@inf.ed.ac.ukABSTRACTRecentnon-volatilememory(NVM)technologies,suchasPCM,STT-MRAMandReRAM,canactasbothmainmemoryandstorage.ThishasledtoresearchintoNVMpro-grammingmodels,wherepersistentdatastructuresremaininmemoryandareaccesseddirectlythroughCPUloadsandstores.Existingmechanismsfortransactionalupdatesarenotappropriateinsuchasettingastheyareoptimizedforblock-basedstorage.WepresentREWIND,auser-modelibraryapproachtomanagingtransactionalupdatesdirectlyfromusercodewritteninanimperativegeneral-purposelanguage.REWINDreliesonacustompersistentin-memorydatastructureforthelogthatsupportsrecover-ableoperationsonitself.Theschemealsoemploysacombi-nationofnon-temporalupdates,persistentmemoryfences,andlightweightlogging.ExperimentalresultsonsynthetictransactionalworkloadsandTPC-CshowtheoverheadofREWINDcomparedtoitsnon-recoverableequivalenttobewithinafactorofonly1:5and1:39respectively.More-over,REWINDoutperformsstate-of-the-artapproachesfordatastructurerecoverabilityaswellasgeneralpurposeandNVM-awareDBMS-basedrecoveryschemesbyuptotwoordersofmagnitude.1.INTRODUCTIONNon-volatilememory(NVM)technologies,suchasPCM,STT-MRAMandReRAM,raisetheprospectofpersistentbyte-addressablerandomaccessmemorywithlargeenoughcapacitytodoubleasstorage.Byitselfthiswouldallowap-plicationstostoretheirpersistentdatainmainmemorybymountingaportionofthelesystemtoit.ThisintroducesNVMintothedatamanagementprogrammingstack,butinafarfromidealmanner.Consideratypicalmulti-tierap-plication:theprogrammerdecidesontheapplication-levelcontrolanddatastructures,andthendecidesonthestorage-levelpersistentrepresentationofthedatastructures.Spe-cializedAPIs(e.g.,embeddedSQL,JDBC,etc.)translateThisworkislicensedundertheCreativeCommonsAttributionNonCommercialNoDerivs3.0UnportedLicense.Toviewacopyofthislicense,visithttp://creativecommons.org/licenses/byncnd/3.0/.Obtainpermissionpriortoanyusebeyondthosecoveredbythelicense.Contactcopyrightholderbyemailinginfo@vldb.org.Articlesfromthisvolumewereinvitedtopresenttheirresultsatthe41stInternationalConferenceonVeryLargeDataBases,August31stSeptember4th2015,KohalaCoast,Hawaii.ProceedingsoftheVLDBEndowment,Vol.8,No.5Copyright2015VLDBEndowment21508097/15/01.databetweenthetworuntimesusingSQLastheintermedi-atelanguage,inacumbersomeandsometimeserror-proneprocess.Moreover,datamaybereplicatedinbothDRAMandNVM,whilethebyte-addressabilityofNVMisnotlever-aged.Clearly,thisissuboptimal.Analternativeistoportanin-memorydatabasesystemtopersistentmemory.Thatwouldmakeuseofbyteaddressability,butitwouldstillre-quiredatabereplicatedandrepresentedintwodatamodels.Wearguethatweneedasolutionthatisnotintrusivetotheprogrammerandseamlesslyintegratestheapplication'sdatastructureswiththeirpersistentrepresentation.Wetargetuse-caseswherethedataownerhasfullcontrolofthedata,foreseeslittlechangetotheschema,andwouldliketotightlyco-designtheschemawiththeoperationsforperformance.Theseuse-casescapturealargegroupofcon-temporaryapplications.Indeed,persistenceAPIsarebeingusedacrossavarietyofoperatingsystems[1,18].Theseplatformssupporteitherapersistentstoragemanager[22]oranembeddeddatabase[16].Ourstanceisthatinsuchsce-narioswearebetterointegratingthestoragemanagerandtheapplicationmemoryspaces.Bydoingsoweenabletheuseofarbitrarypersistentdatastructuresinalightweightsoftwarestackthatsignicantlyreducesthecostofman-agingdata[15,27].ToaddresstheseissuesweintroduceREWIND:auser-modelibrarythatenablestransactionalrecoverabilityofanarbitrarysetofpersistentupdatestomainmemorydata.TheruntimesystemofREWINDtrans-parentlymanagesalogofprogramupdatestothecriticaldataandhandlesbothcommitandrecoveryoftransactions.Bytrackingtheoperationsofthetransactionsthatcommittheruntimecanidentifythepointoffailureandcanresumeoperationrelyingontheconsistencyofcriticaldata.Ourworkstemsfromanalternativepersistentmemoryaccessmodelthathasgainedinterestrecently:directlyprogrammingwithpersistentmemorythroughmechanismssuchaspersistentregions[14]orpersistentheaps[5,31].Persistentdataisaccesseddirectlybytheprocessorwithaload-storeinterfaceandwith(mostly)automaticpersistencewithoutinteractingwiththeI/Osoftwarestack.Thismodelishighlydisruptiveasitenablesanewclassofdatamanage-mentsystemsinwhichbothuserdataanddatabasemeta-dataaremanagedentirelyinmemoryasonewouldmanagevolatiledata[29].Thus,somefundamentalassumptionsonprogramminginterfacesandsoftwarearchitectureneedtoberevisedaspersistentdataneedstobedirectlymanagedbytheprogrammer'scodeusingimperativelanguages.Theprogrammer-visibleAPIofREWINDoerstwomainfunctionalities:onetodemarcatethebeginningandendof transactionsandonetologupdatestocriticaldata.Ourintentionistocompletelydoawaywiththesecondfunction-alitybyrelyingoncompilersupportsothattheprogrammeronlyneedstoidentifythecriticaldata.Thekeychallengeofusingpersistentin-memorydatastructuresisguaranteeingconsistentdataupdatesinthepresenceofsystemfailures(e.g.,powerfailuresandsystemcrashes).Thisdiersfromtransactionalmemorywherethemainintentionistodealop-timisticallyonlywithatomicityandisolation.Italsodiersfromnon-recoverabledevicefailureasthisisanorthogonalissueandishardware-related;weonlytargetfailuresduetosystemandsoftwaremalfunctions.REWINDprovidesfullatomicityanddurabilityforpersistentmemorythroughwrite-aheadlogging.Whiletheprinciplesofthemechanismarestillapplicabletopersistentmemory,itsimplementationandtradeosmustberevisitedgiventhesignicantdier-encesinaccesslatenciesandsynchronizationcontrol(see,e.g.,[10,12]).Thus,REWINDovercomesanumberofchal-lengesthatariseinthisnewcontext.First,theprocessingmodelisthatpersistentdataisonbyte-addressableNVM,accessibledirectlyfromusercodethroughCPUloadsandstores.1Traditionally,dataupdatesarerstperformedinvolatilememory.Itisthuspossibletodelaymakinglogentriespersistentuntilthetransactioncommitsorthedataupdatesarepurgedfrommainmemory.InREWIND,updatesaredonedirectlyonNVMdata:thelogentriesmustbemadepersistentimmediately,andaheadofthedataupdates.Weachievethisthroughenhancedver-sionsofmemoryfences(i.e.,barriersthatenforceorderingandpersistencetoprecedinginstructions),cacheline ushesandnon-temporalstores(i.e.,directtoNVMstoresthatbypassthecache)withpersistenceguarantees.REWINDusesphysicalloggingasittsbetterwithim-perativelanguagesandallowseasiercompilersupport.How-ever,itmightresultinmorelogrecordsthanlogical/phys-iologicalloggingwhenmemoryblocksareshiftedinmem-ory.Then,thelogitselfmustbemanipulatedatomicallyinarecoverableway.Traditionally,thelogismaintainedinvolatilememoryandpushedtopersistentstoragethroughsystemcalls.InREWIND,thelogitselfresidesinpersis-tentmainmemoryandupdatesaremadein-place.Trans-actionalhandlingoffailureoflogupdatesisattainedwithcarefullycrafteddatastructuresandcodesequences.Fur-thermore,performanceisrelativetoabaselinewiththelowcostofindividualmemoryoperations.Thus,loggingmustbeoptimizedtoincuronlyasmallincreaseinthecostofamemoryoperation.InREWIND,weguaranteethiswithminimalistdatastructuresandcodesequences.Moreover,whilecontemporarysystemsoerrecordlevellockingfromadata-centricperspective,theyusecoarse-grainedpage-levellatchinginternally.REWINDemploysne-grainedlatch-ingatalogrecordgranularity:thisenablesmoreecientand exiblelockingmechanisms.Finally,themajorityofrecoverymanagersbasedonARIES[20]areimplementedwithinDBMSs.Thus,theyhidedatamanagementbehindsomedatamodel(e.g.,relational)andallowdatamanip-ulationthroughawell-denedquerylanguage(e.g.,SQL).REWINDisimplementedasauser-modelibrarythatcanbelinkedtoanynativeapplication,givingtheprogrammerfull 1NAND-basedbattery-backedNV-DIMMsalreadysup-portthis(seealsohttp://www.smartstoragesys.com/pdfs/ULLtraDIMM_overview.pdf).Newertechnologieswillbringmorepracticalimplementations. Figure1:Transactionalaccesstoapplicationdata.accesstothedatausinganarbitrarysequenceofimperativecommands.Moreover,thedesignofREWINDitselfissuchthatitcanbestraightforwardlyembeddedintothecompilersothatthedisruptiontousercodeisfurtherminimized.ContributionsandorganizationOurcontributionsandthestructureoftherestofthispaperareasfollows:WeintroduceREWIND,auser-modelibraryforlog-gingandtransactionmanagementinNVM.WeexplorethedesignspaceofREWIND(Section2)viafourcongurationsthatresultfromchoosingbe-tween:(a)twodierentlogimplementationsopti-mizedforminimizingeitherloggingoverheadorsearchspeed;and(b)forcingornotuserdatatonon-temporalstores.Wedescribetwowaystoimplementthelogcompletelyinpersistentmemory,soitisrecoverableandatomicaswell.Wethenpresenttwooptimizedlogversionsthatfurtherreducethewriteoverhead(Section3).Byleveragingtherecoverablelogweshowhowtoenablein-memorypersistentdatastructures.Wepresenthowthesenewmechanismscanbeincor-poratedintoimperativegeneral-purposelanguagesthroughtheREWINDlibraryandruntime(Section4).WeanalyzethesensitivityofREWINDtoitsparam-etersandshowhowitcanbeconguredtodeliverlow-overheadtransactionalprocessingandrecoverabil-ityofdatastructuresinNVM.WecompareREWINDtostate-of-the-artrecoverymanagersaswellastotraditionalandNVM-awareDBMS-basedtechniques.REWIND'soverheadiswithinafactorof1:5fromitsnon-recoverablecounterpart,whileitoutperformsthecompetitionbyuptotwoordersofmagnitude(Sec-tions5.1and5.2).WeuseamodiedversionofTPC-CtoshowhowREWINDenablestheco-designofalgorithmsanddatastructures.Workload-andprogram-specicoptimiza-tionsresultinaREWINDperformancewithinafactorof1:3fromitsnon-recoverableversion(Section5.3).Finally,wepresentrelatedworkinSection6andconcludeandidentifyfutureworkdirectionsinSection7.2.SYSTEMOVERVIEWREWINDisauser-moderecoveryruntimesystemthatcanbeusedbyprogrammersandcompilerstoprovideatomicrecoverabilitytoarbitrarycodethatoperatesonper-sistentdatastructuresinNVM.WeenvisiontheREWINDlibrarybeingstaticallylinkedwithexecutables,butothervariationssuchasdynamiclinkingorasharedlibrarycouldalsobedeveloped.Thus,REWINDcanbeusedthenasastandalonerecoverymanagerforindividualapplications,or Scheme Pros Cons TransactionalFS portability programmability scalability exibility DBMS robustness initialperf.cost scalability exibility REWIND programmability disruptivemodel initialperf.cost Table1:ProsandconsoftheoptionsofFigure1.couldbeusedasthebuildingblockofalarger,multi-userdatamanagementsystem.WeviewREWINDasafunda-mentalbuildingblocktowardsintroducingpersistenceatthesystemlevel,especiallywhendataoutlivesapplications.InFigure1weshowtheoptionsavailablefortransac-tionaldatamanagement.TheapplicationcaninterfacewithaDBMS(ApplicationC),orwithalesystemthatoerstransactionalaccesstouserdata(ApplicationB)[7,28].ThelibraryapproachofREWIND(ApplicationA)operatesdirectlyondatastoredinNVM.ThisrequiresanNVM-awarememorymanagerintheOS[5,31].Table1summa-rizesthehigh-levelprosandconsofeachoption.Theuseofalesystemleadstoportabledataformats,butsuersintermsofprogrammabilityand exibilitybyexpectingtheprogrammertomanagebothanin-memoryandaserializedon-diskversionofthedata.TheDBMSisatriedapproach,butlimits exibilitybyimposingadatamodelandqueryAPI,andsuersfromtheoverheadsofaclient-serverar-chitectureandahighlycomplexserver.REWINDoersincreasedprogrammabilitybyenablingin-memorypersis-tentdatastructuresandAPIsaswellasthelowoverheadofauser-modelibrary.REWINDcurrentlytargetsindivid-ualapplicationsandmayrequireadditionalfunctionalityforoperationinmulti-userenvironments.Itprovides,however,acriticalmechanismtoenablethisnewclassofdataman-agementinNVM.Othershavealsoarguedforsupportingavarietyofstoragemodelsandinfrastructurestomeetthedemandsofdierentworkloads(e.g.,[27]).ThecoreofREWINDisatransactionalrecoveryprotocolbasedonWAL(Write-AheadLogging).UnliketheARIESimplementationsofDBMSs,REWINDprovidesprogram-merswithdirectcontrolofwhatupdatesshouldbetrans-actionalthroughasimpleAPIwithaconstructtomarktransactionsandafunctioncalltothelogoperation.Log-gingcallsarecurrentlyinsertedmanuallybytheprogram-mer,butweexpectthemtobeinsertedtransparentlybythecompiler,similarlytohowSoftwareTransactionalMemory(STM)compilerswork[11,32].Listing1isafunctiontore-moveanelementfromadoubly-linkedlistwiththestateofthelistupdatedthroughCPUwrites(lines3to6).Tomaketheoperationrecoverable,weenclosethecriticalupdatesinapersistentatomicblock.1voidremove(node*n){2persistentatomic{3if(n==tail)tail=n-prv;4if(n==head)head=n-nxt;5if(n-prv)n-prv-nxt=n-nxt;6if(n-nxt)n-nxt-prv=n-prv;7delete(n);}}//endofatomicblockListing1:Removalfromadoubly-linkedlist.Toprovideprogrammerswiththefamiliaraccessinterfaceofcurrentmainmemorydata,wearerestrictedtophysi-calloggingandin-placeupdates.Thisdiersfromtradi-tionaldisk-basedsystemswherewearefreetouselogicallogsordelayforcinguserupdatestoimproveperformance.AsREWINDisbasedonwrite-aheadlogging,allupdatestopersistentdatamustbeprecededbyacalltothelogfunction.ThisseparatesdatafromthelogasthedataarehandledbytheuserprogramwhilethelogishandledbyREWIND.Theresultingcode,asexpandedbytheprogram-merorcompiler,isshowninListing2.Theruntime'strans-actionmanageriscalledatthestartoftheblock(line2)tocreateanewtransactionidentier:transactionmanagementistransparenttotheprogrammerandcompiler.Loggingcalls(e.g.,line4)precedeeverycriticalupdate(e.g.,line5).Loggingcallparametersincludethetransactionidentier,theaddressofthememorylocationbeingupdated,andthepreviousandnewvalues2.Attheendoftheexpandedcodethecommitcallmarkstheendofthepersistentatomicblock.Thede-allocationofthememoryoccupiedbytheremovednodemustbeplacedaftertransactioncommit(line16):withoutadditionalOSsupport,de-allocatingmemoryisanoperationthatcannotbeundonebyREWIND.1voidremove(node*n){2inttID=tm-getNextID();3if(n==tail){4tm-log(tID,&tail,tail,n-prv);5tail=n-prv;}6if(n==head){7tm-log(tID,&head,head,n-nxt);8head=n-nxt;}9if(n-prv){10tm-log(tID,&n-prv-nxt,n-prv-nxt,n-nxt);11n-prv-nxt=n-nxt;}12if(n-nxt){13tm-log(tID,&n-nxt-prv,n-nxt-prv,n-prv);14n-nxt-prv=n-prv;}15tm-commit(tID);16delete(n);}Listing2:ExpandedcodeforListing1.WeproposeandevaluatefourcongurationsofloggingandtransactionmanagementinREWINDthroughdeciding:(a)whetherornottoforceuserupdatestoNVMastheyhappen;and(b)thenumberoflogginglayerstoemploy(oneortwo).Eachcongurationcomeswithitsowntradeos.Forcing/notforcinguserupdatesAforcepolicyslowsdownloggingduetotheextratimeneededtoguaranteethepersistenceoftheupdate.However,itonlyrequiresatwo-phaserecovery(analysisandundo)insteadofthethree-phaserecovery(analysis,redo,andundo)oftheno-forcepolicy.Thus,thetradeoisfasterrecoveryoveraslightslowdownduringlogging.Theforcepolicyalsoal-lows(withoutdictating,however)analternativelogclear-ingmethodinsteadofcheckpoints.Eachtransactioncanclearitsownrecordsimmediatelyaftercommit,resultinginslowercommitsbuteliminatingcheckpoints.Logclearingbecomesmoreexpensiveasthenumberofconcurrenttrans-actionsattemptingtocommitgrowsthroughincreasedlock-ingcongestion:clearingrequirescoarser-grainedlocksthanaddingrecordsasitinvalidatestheiteratorsofconcurrentthreads.However,itutilizesmemorybetter:memoryisde-allocatedrightaftercommitandnotafterthecheckpoint.Italsominimizesthesizeofthelog,whichimprovesthetimetondrecordsofatransaction.Inourimplementationwecombinetheforcepolicywithlogclearingatcommit-time. 2Byaddressofalocationwemeanapersistentvirtualad-dress,e.g.,thatoeredby[31],arelativeaddress,orsomeotherformofpersistentreferencetothememorylocation. NumberoflogginglayersThelogginginfrastructureandtherecoverymanageroertwocongurationsofthelogdatastructure.Initssimplestformthelogisaspeciallycraftedrecoverablepersistentdoubly-linkedlist.Alterna-tively,thelogdatastructureisorganizedintwolayers:anauxiliarydatastructure(anAVLtree)atthetoplayer,overtherecoverablepersistentdoubly-linkedlist.Thus,there-coverymanagercatersfortherecoveryofuptothreeel-ements:programmertransaction,anoptionalcomplexlogstructure,andafundamentalandsimpledatastructure.Re-coverystartsbyrecoveringthesimpledatastructuretoaconsistentstate,whosecontentsarethenusedtorecovertheauxiliarylogstructure,ifthereisone.Thecontentsofbothprimaryandauxiliarylogstructuresareusedtore-covertheupdatesoftheprogrammertransaction.Themaintradeobetweenthesetwovariantsisthatloggingisfasterintheone-layercasebutthetwo-layerlogstructureoersfastersearchwhichbenetsrollingtransactionsback.3.THERECOVERABLELOG3.1DesignoverviewInNVM,thelogisitselfanin-memorynon-volatiledatastructureandlogupdatesrequireaseriesofCPUwrites.Thus,updatesforloggingandrecoveringuserupdatesmustthemselvesbeloggedandrecoverable.Thechallengeistoensureatomicityanddurabilityforlogupdatesusingarecoverymechanism.Notethatuserupdatesarenowmuchcheaper,whichmakestraditionallogginginfrastruc-tureheavyweightandcallsforalowoverheaddesign.DBMSsoftenuseauxiliarydatastructuresforindexingthelog.Updatingsuchstructuresmayrequireavariablenumberofupdatesduetotheneedtore-organizethedatastructure.ThismakesmaintainingtransactionalatomicitydicultinNVM.Oursolutionistocreateaspecializeddatastructurethatembedstransactionallogicandisabletore-coveritselfinthecaseofasystemfailure.Thedatastructurerequiresaconstantnumberofoperationstoinsertorremoveanentry,sothatitsstatecanbetrackedwithonlyafewvari-ablesthatcanbeupdatedandmadepersistentinasingle,atomic,CPUwrite.Thisreliesononlythelastoperationinthebasicstructurebeingpending,soweonlyneedtologoneoperation.Wealsorequiretheappend/removeopera-tionstobethread-safe.WeforceallupdatesonthebasicdatastructuretobeperformeddirectlyonNVMthrough:memoryfencestoforcependingwrites;non-temporal,syn-chronous,writesthatbypassthecacheanddonotcompletebeforereachingNVM;andcacheline ushes.Allprimitivesarepresentinmostinstructionsetstoday,buttheyguar-anteeonlywritevisibilitywithinthememoryconsistencymodelofthemachine;theydonotguaranteepersistence.WeassumethatwhenNVMsystemsbecomewidelyavail-abletheywillbecapabletoalsoguaranteepersistencetoNVM.MostworkonpersistentdatastructuresforNVMmakessimilarassumptions(e.g.,[5,10,12,31]).Basedontheserequirementsweuseadoubly-linkedlistasthebasiclogdatastructure.Listnodescontainthelogrecords,whichcontaininformationalsofoundinARIES,e.g.,atransactionidentierandtheoldandnewvaluesoftheaectedmemorylocation,etc.Withcarefullycraftedcodeitispossibletomakeatomicnodeinsertionsanddele-tionsoverthelinkedlist.ThisAtomicDoubly-LinkedList(ADLL)isthekeydatastructureforlogginguserupdates.However,itrequireslinearsearchtolocateanentry.Anal-ternativeistoindexlogrecordsbytransactionidentierandusetheADLLtologthependingupdatestotheindex.Theresultisatwo-layercongurationwheretheindex(anAVLtreeinourcase)logspendinguserupdatesandthebasicdatastructure(i.e.,theADLL)logspendingREWINDup-datestothecomplexdatastructure.Theone-layercongu-rationoersfasterloggingbutmayleadtoslowerrollback;andvice-versaforthetwo-layerconguration.3.2Onelayerlogging:theAtomicDoublyLinkedListAssumingthatrecoveryandrollbackarerareevents,wecanachievefasterloggingatthecostofaslowerretrievaloflogentries.TheonlyloggingstructureistheADLL,soin-sertingarecordcostsasmallconstantnumberofwrites.Weoptimizeloggingattheexpenseofmoreworkduringrecov-ery.Wedosobynotkeepinganytransaction-specicstate.Atthepriceofahigherrollback/recoverycostweeliminatethetransactiontableduringloggingandreducethenumberofvariablesweupdate;weonlyconstructthetransactiontableduringrecovery.Thisdeparturefromback-chainingisacceptableasweexpectrollback/recoverytoberareevents.Insteadofrollingbackonetransactionatatime,weper-formasinglebackwardscanofthelogandrecoveralltrans-actions,attheexpenseofhighermemoryutilization(seeSection4.5).Rollingbackasingletransactionisnottypicalinsystemfailures,butselectiverollbackisnecessarytoallowuserstoabortspecictransactions.Toachievethisweneedtoscantheentirelogjustfortherolledbacktransaction.Long-runningtransactionsexacerbatethis,asdothenum-berofconcurrenttransactions:theyincreasethenumberofrecordsbetweenrecordsofthetransactionbeingrolledbackthatweneedtoskip.Torectify,weclearthelogatcheck-points(Section4.6):bytuningthecheckpointingfrequencywebalancetheinsertionoverheadagainsttherollbackspeed.TheADLLisakeystoneofREWINDasitenablestheatomicinsertionandremovaloflogrecordsinto/fromtheloginNVM.TheADLLisrecoverableitselfthrough:(a)useofsinglevariablestologtheinternalstate,whichcanbeupdatedatomicallyinhardware;(b)recoverybyredoingonlythelastoperation:repeatedredos,eitherpartialorinfull(duetofurthersystemfailuresinthemiddleofanADLLrecovery),aresafeandleavethelistinacorrectstate;(c)simpleoperationsthatmakeiteasiertoproducecodewiththeredorecoverabilityproperty;and(d)perform-ingallwritesvianon-temporalstores.TheADLLusesfourloggingvariables:lastTail,thetailofthelistbeforein-sertion;toAppend,thenodetobeappended;andtoRemove,thenodetoberemoved.Eachlistnodepointstothenextandpreviousnodes,andtotheactuallogrecord.Thelat-terissowecancreatenewrecords\o-line"andatomicallyinsert/appendthemtothelist.AppendThisoperationinvolvescreatingthenewnode,updatingthetail/headofthelist(ifneeded)andthenextpointerofthelasttail.TheoperationfortheADLLisshowninAlgorithm1.Lines5and10markthebeginningandend,respectively,ofthepersistentoperation.Line5correspondstothecriticalupdate:itsavesthenodetobeappendedsotheoperationcanberedoneduringrecovery.Ifthesystemfailsatanypointbeforeline5,thestateofthelistisnotaltered,andthusconsistent.Ifthesystemfailsafterline5,therecoveryoperation(describednext)willre-applythe Algorithm1:AppendoperationontheADLL,invokedaspartofthetransactionmanager'sloggingoperation. input:elementEtoinsert1//setupnewnode2n=newNode();n.element=E;n.prior=tail;3//undoinformation4lastTail=tail;//Keeptailbeforelogginglastinsertion5toAppend=n;6ifhead=NULLthenhead=n;//updatehead7iftail6=NULLthentail.next=n;//updatetail8tail=n;9//appendfinished,clearundo10toAppend=NULL; append.Line4isnotcriticalasitonlysetslastTailanddoesnotalterthestateofthelist.Ifthesystemcrashesbetweenlines4and5thisvaluewillbeoverwrittenbythenextappendattempt.Theorderoftheupdatesoflines4and5iscriticalforcorrectrecovery.Inline6theheadofthelistisupdated,ifnecessary.Thisisnotcriticalasitisdesignedsothatitcanberepeatedmultipletimesduringrecovery.Inlines7and8thenextpointerofthetailandthetailitselfareupdated.Ifthesystemfailsafterline10,thestateofthelistincludesthenewnodeandisconsistent.RecoveryduringappendWeusethetoAppendvariabletoidentifytheinterruptedaction:anon-NULLvalueimpliesanunnishedappendoperation.Thus,weneedtorepeatthecriticalsectionoftheappend.ToallowtherecoverycodetoberecoverableitselfweusethelastTailvariable,insteadoftailusedoriginally(line7ofAlgorithm1).Thisresolvestheproblemofacrashbetweenlines8and10ofAlgorithm1thatwouldcausethesecondrecoverytoreinsertthenode.RemovalToguaranteeatomicityandrecoverabilityofre-movalswefollowthesameprinciples.WestorethenodetoremoveinthetoRemovevariableatthebeginningofthecriticalsection,similarlytotoAppend.Torecover,werepeattheremovalcodewhichisdesignedtobesafelyre-executed.WerstcheckthetoRemovevariabletoidentifyifthesystemcrashedduringremovalandrepeattheprocess.ADLLrecoveryWerecoverbyrstidentifyingthein-terruptedoperation(appendorremoval)bycheckingthetoAppendandtoRemovevariables.Then,werepeattheap-propriateoperationasdiscussed.3.3OptimizingthelogstructureAppendingarecordtotheADLLthroughAlgorithm1re-quiresmultiplenon-temporalstoresandbearsoverheadduetothewritelatencyandtheuseoffences.Moreover,writesrefertonon-consecutivelocations(thelist'snodes),whichforbidspackingthemtofewercachelines.Wecansigni-cantlyreducethewriteoverheadbychangingthememorylayoutofthedatastructurebyblockingmultiplerecordsintoxed-sizebucketsrepresentedasarrays,asshowninFigure2.Aftercreatingalogrecordweplaceitintoabucketwithonewrite,renderinginsertionbothatomicandcheap.Thelogisresizedbyatomicallyappendingnewbuck-etstotheADLL.Thislayoutusescheaparrayappendsandamortizesthecostofatomicexpansion:insteadofsinglenodes,weinsertbuckets.Therecoveryalgorithmsareunaf-fectedbythenewstructure.Theonlyexceptionisthatweneedtokeepthenextpositioninthelastbuckettoinsertanewrecord.Doingsothroughanon-temporalstorewouldincreasetheinsertioncost.Instead,wereconstructthein-formationduringtheanalysisphaseintheeventofacrash.Weinitializethecellsofeachbuckettozero,and,during Figure2:Minimizingthewriteoverhead.analysis,weidentifythelastoccupiedcellafterskippingallemptycellsclearedbythelogclearingprocess.ClearingthelogRemovinglogrecordsfromthehybridstructureismoreinvolvedduetotheneedtoshiftrecordstollremovedrecordgaps.Doingthisatomicallyisex-pensiveandaddsunnecessarycomplexity.Weavoiditbyallowingmarkedgapsinabucket,keepingcountofoccupiedcells,andremovingabucketwhenitbecomesempty.Wedonotexplicitlystorebucketcounts,but,intheeventofacrash,wereconstructthemthroughthemarkedgaps.Wethussimplifyrecordremovalbutmaywastememoryinlong-runningtransactions.Underbothforcepolicies,therecordsoflong-runningtransactionscanspanmultiplebuckets,thuspreventingbucketremoval.Wecantunebucketsizetobal-ancetheimpactoflong-runningtransactions.Alternatively,wecancompactthelogifitsoccupancydropsbelowsomethresholdbycreatinganewlog,copyingrecordsover,andatomicallychangingthepointertotheheadbucket.MultiplelogrecordspercachelineAkeychallengeinkeepinguserdatainNVMisthelackofcontroloverwhentheirupdatesbecomepersistent,preventingDBMS-likeop-timizations[10,27]wherethelogtailis ushedfrommemorytopersistentstorageinbatches.Thisguaranteestheper-sistenceoflogrecordsandallowsthepackingofwritesincachelinesinNVM,butitisnotpossiblewhenuserdataisalsoinNVM:delayinglogwritesmaycausedatawritestoovertaketheirlogrecords,violatingtheWALprotocol.InREWIND,wecanperformsimilaroptimizationsoverourhybridlog.Multiplerecordsarepackedintoasinglecachelinesincetherecordpointersarestoredinconsecutivememorylocations.Thisdoesnotrequirethelogrecordsthemselvestobestoredtogetherinmemory.Thecompilerneedstoreorderthelogcallsandplacetheminbatchesabovethecorrespondinguserwrites.Thisguaranteesthelogwritesarenotovertakenbyuserwritesandrecordsareplacedinonecacheline.With64-bytecachelinesand8-bytepointersweneedjustasinglefenceandasinglenon-temporalstoreforeveryeightlogrecords.Thisalsomit-igatesthecostofthefenceandthegroupsizeservesasatuningknobforadjustingtodierentfencelatencies.Commitlogrecordscanbereorderedsafelybeforetheuserwritesasthelogrecordsthatprecedethemguaranteerecoverability.Evenifthecommitrecordscannotbemoved,wecanmoveallprecedingrecordsandproceedasbefore.Thisrequiresthecachelinebewrittenatomicallysinceweonlyassumethehardwarecanguaranteesingle-wordatomicwrites.Wedothisbykeepingthepositioninthebucketuptowhichlogrecordsareguaranteedtobepersistent.Thisissettozerowhenthebucketiscreated.Then,itisupdatedafterweissueamemoryfence(usinganon-temporalstore)withthepositionofthelastrecord.Thisguaranteesthatalllogrecordsuptothatpointarepersistent.Ifacacheline isnotintentionally ushed,thisindexisnotupdated.Thisisvitalforensuringcorrectness,asduringrecoveryweonlyconsiderlogrecordsuptothelastpersistentindex.Weissueamemoryfence/indexupdateforeverybjcachelinej=jpointerjcrecords;orwhenthebucketisfull;orwhenwendanENDrecord.ThelatterisimportantsinceENDrecordsmarkthecompletionofcommit/rollback.Delayingtheupdateofthelastpersistentpositionafteracommitmayleadtohavingtoabortacompletedtransactionafteracrash.3.4Twolayerlogging:theAtomicAVLTreeToimprovesearchintheADLLweuseanauxiliarystruc-ture:anAtomicAVLTree(AAVLT),whichindexeslogrecordsbytheiridentierandisrecoverablebymaintainingalogofitsinternaloperationsintheADLL.Themostinten-siveloggingactivityisduringrebalancingoninsertion/re-moval.WeusetheoptimizedversionoftheADLL.EveryupdatetotheAAVLTisonlyexecutedbyasinglethreadandforwardeddirectlytoNVM.DoingsoallowsustologonlythelastoperationontheAAVLTandclearthelogentriesaftercompletion,thusreducingthelengthoftheADLL.IntermsofAAVLTinsertionandremoval:(a)welogallthewritesthataectthestateofthestructure,and(b)wedelaythede-allocationoftheremovednodesuntilafterthesuccessfulcompletionoftheoperation.Theloggingandre-coveryimplementationisasimpliedversionoftherecoveryschemeofSection4.Wealsoskiptheanalysisphaseasthereisonlyonetransactiontoundo.Rollbackalsolargelyre-mainsthesamewiththeonlyissuebeinglocatingthenextlogrecordtoundo.Thisisstraightforwardforanormalrollback(thepreviousentry),butafteracrashduringtherollbackoperationitselfweneedtoskipallrecordsthatwerepreviouslyundoneandcontinuefromthatpoint.Wedothisinthesamewayasintherecoveryofone-layerlogging(seeSection4.5).Finally,weclearlogentriesaftereachAAVLToperationaswedescribeinSection4.6fortheforcepolicy.4.THERECOVERYRUNTIME4.1TransactionrecoverymanagementThetransactionrecoverymanagermaintainstwostruc-tures:thelogandthetransactiontable.Thelogtrackstheprogram'swrites.Theformatoftheserecordsisstan-dardandincludestherecordID,thetransactionID,therecordtype,theoldandnewvalues,theaddressofthemem-orylocationmodied,andpointerstootherrecords.Thetransactiontablestoresinformationabouttheactivetrans-actions.TransactiontableentriesincludethetransactionID,itsstatus,theIDofthelastrecordofthetransaction,andtheIDoftherecordtoundonext.Thetransactionta-bleisconstructedduringrecoveryinallcongurationsbutismaintainedduringlogginginthetwo-layerconguration.ThereisnoneedforadirtypagetableasNVMsarebyte-addressable.Thetransactionrecoverymanagerconstructsthetransactiontableatapplicationstartanddetermineswhetherasystemorapplicationcrashoccurred,inwhichcaserecoveryisperformed,orwhetherthisisacleanstart.4.2LoggingUndertheWALprotocolalogrecordmustbepersistedbeforethecorrespondingpersistentwrite.Weusethisap-proachfortheADLL,theAAVLTinthetwo-layercongura-tion,andtheprogrammerdata(Section2).Weusephysicallogginginsteadoflogicalloggingasittsbetterwithimpera-tivelanguages.DBMSsenforceWALwithsystemcallsandasynchronousI/Ointerface.InREWIND,wemustenforceWALforCPUwritesthatpassthroughacomplexmemoryhierarchyandmaybere-orderedbeforereachingNVM.WeuseasimpliedversionoftheoriginalARIESlogfunctionwiththekeydierencebeingthatthedirtypagetableisabsent,aswedonothavepages.Forone-layerloggingthetransactiontableisalsoabsentduringloggingandonlyre-constructedduringrecovery.Logrecordsarecreatedgivenappropriateparametersandthenamemoryfenceisissuedtoensuretherecordeldshavereachedthememory.Afterthat,therecordisinsertedatomicallytothelog.Iftwo-layerloggingisused,therecordisinsertedintotheAAVLTandtheAAVLTmaintenanceoperationsareloggedinstead.AswediscussedinSection3.3wecanreducethenumberoffencesrequiredbymovinggroupsoflogrecordsbeforethewritesandthenissuingasinglefence.4.3CommitThelogfunctionguaranteesthattherelevantlogrecordsareinNVMuponcommit.Underaforcepolicy,allupdatesofatransactionmustbeinNVMbythetimeatransactioncommits.WedothisbydirectlyupdatingNVMusingnon-temporalstoresandfollowit,atcommit-time,withamem-oryfenceandanENDlogrecord.Wemayalsothenremovethelogentriesofthistransaction.Inno-forcecongurations,allweneedistoinserttheENDlogrecordatcommit-time.Thelogentriesofcommittedtransactionsareclearedinthebackgroundbycheckpointing(aswewillseeinSection4.6).ARIESfollowsano-forcepolicytoimproveI/Owhenwrit-inglogpagesanddirtypagestodisk.InNVM,persist-ingthelogentriesisasexpensiveasmakingtheupdatesthemselvespersistent.ARIESusesastealpolicy,whichinourcaseisinapplicableasthereisnobuer-pool.Commit-tinginARIESexplicitlyforcesanyin-memorylogentriestopersistentstorage.ThisisnotrequiredinREWINDaslogentriesareimmediatelymadepersistent(throughnon-temporalstores).ThisisanovelrequirementinNVM-basedsystemstopreventreorderingofwritesinthememoryhier-archyfrombreakingtheWALprotocol.Memoryde-allocation(e.g.,line7ofListing1)requiresspecialhandlingforrecoverability.Inno-forcecongura-tions,wedelaymemoryde-allocationuntilthecorrespond-inglogentryisprocessedatthenextcheckpoint(seeSec-tion4.6).Thede-allocationdetailsarestoredinaspecialDELETErecord.Inforcecongurations,wepostponemem-oryde-allocationuntilaftercommitting(asinline16ofListing2).WealsorelyonaDELETErecordtohandleasystemfailurebetweencommitandtheactualde-allocation.4.4RollbackTransactionrollbackinREWIND(eitherexplicitlyorasaresultofasystemfailure)proceedsasfollows.Inone-layerlogging,rollbackisatrivialbackwardscan.Thesituationismorecomplicatedintwo-layerlogging,whereweselectivelyscanthelogforthetransactionbeingrolledbackthroughtheAAVLT.TherollbackcanberepeatedanunlimitednumberoftimesthroughtheuseofCLRsthatlogundooperations.Asweusephysicallogging,undosetsavariabletoitsoldvalue.Notethatundertheforcepolicytheundosshouldbemadepersistentaswell.Thisisrequiredtobeabletoclearthelogsaftertherollback.Onecomplicationisthatweneed toredothelastCLRwhenwerecoverfromacrashedroll-back.ThisprotectsfromthecornercaseofacrashafterthecreationofthelastCLRbutbeforethecorrespondingup-datewasmadepersistent.Finally,wemarkthesuccessfulrollbackcompletionbywritinganENDlogrecord.4.5RecoveryTorecover,wemustrstrecoverthelogitself.Thisisfollowedbyeitherthreephases(analysis,redo,andundo)ortwophases(analysisandundo)dependingontheforce/no-forceconguration.ARIESandDBMSsexploittheI/Osubstratetopresentaconsistentandpersistentlogstructureincaseofsystemfailuresduringlogwrites.InourcasethelogisinNVM.Thus,werequirecustommechanismstorecoverfrominterruptedlogupdates(seealsoSection3).Whenrecoverynishes,wealsoclearthetransactiontableasalltransactionsarehenceforthconsideredcompleted.Afterrecoveringthelog,theanalysisphasereconstructsthetransactiontablebyscanningthelogforwardtothepointoffailure.Then,intheno-force/three-phasecong-urationonly,wescanthelogforwardagainandredoallwrites.Theredophasehandlesacrashduringapreviousrollback,asitensuresthatallundosareredoneandconse-quentlynotlostduringthesecondcrash.Inthethirdphaseweconsultthetransactiontabletoundoalluncommittedtransactions.Theundoimplementationdependsonwhetherweuseone-ortwo-layerloggingaswewilldiscussshortly.Aftercompletingrecovery,andunderaforcepolicy,weknowthatalltransactionsarecompleted|eithercommittedoraborted.Thus,wecleartheloginthreesteps:(a)keepthepointertotheloginatemporaryvariable;(b)createanewlog;and(c)de-allocatetheoldlog.De-allocatingtheentirelogisfastercomparedtoindividuallyremovingitsrecords.Two-layerloggingForeachunnishedtransaction,weupdateitsstatusasbeingabortedandscanitslogrecordsbackwardsbyfollowingtheundoNextLogIDpointers:theIDofthenextrecordtoundo;weretrieveeachrecordthroughtheAAVLTandcalltherollbackfunction.Then,wewriteENDrecordsforallabortedtransactions.Intheforcepolicy,toaddressthecornercaseofacrashbetweenthelastCLRandthecorrespondinguserwrite,weredothelastCLR.One-layerloggingThisissimilartoundointwo-layerloggingwithtwomaindierences:First,selectivelyscanningthelogistooinecientsoweimplementacustomundoprocess(showninAlgorithm2)byundoingalluncommittedtransactionsinasinglebackwardscan.Second,duringthescanwetrackthelastCLR(undo)recordofalltransactionsatanunnishedrollbackstatewiththeaidofanauxiliarydatastructure.WeusethistoskiptheUPDATErecordsthathavealreadybeenabortedsowecanndthenextrecordtoundowithoutusingtheundoNextLogIDpointer.4.6LogcheckpointingReducingthesizeofthelogisanimportantrequirementofREWINDas:(a)despitetheirgoodscalability,NVMcapacitieswilllikelylagbehindthoseofdisk,and(b)thene-grainedloggingofREWINDleadstolargermetadatasizes.Keepingthelogsmalliscriticalinone-layerloggingtoreducethecostofscanning.Theremovaloflogrecordsdependsontheconguration.Whenforcing,weclearthelogrecordsrightafteratransactioncommits/rollbacks.Inano-forcepolicytherecordsareremovedatcheckpoints.Atacheckpoint,thecacheis ushedtomakeallpending Algorithm2:Undooperation(one-layerlogging)in-vokedduringrecovery. 1whileADLLLog.hasPrior()do 2rec=ADLLLog.prior();3xact=transactionTable[rec.xactID];4ifxact.status=RUNNINGorxact.status=ABORTEDthen 5ifxact.status=RUNNINGthen 6ADLLLog.insert(xact.xactID,ROLLBACK); 7ifrec.type=CLRthen 8ifundoMap.[rec.xactID]6=NULLthen 9undoMap[rec.xactID]=rec.undoLogID;10ifforcepolicythenrec.redo(); 11elseifrec.type=UPDATEandrec.isUndoablethen 12ifundoMap[rec.xactID]6=NULL13andundoMap[rec.xactID]rec.logIDthen 14//extraargumentsforCLRrecordomitted15ADLLLog.insert(xact.ID,CLR,...);16rec.undo(); 17//AddENDrecords18whiletransactionTable.hasNext()do 19xact=transactionTable.next();20ifxact.type6=FINISHEDthenADLLLog.insert(xact.ID,END); writespersistent.Regardlessofthemethodused,wehavetoupdatetheloginarecoverableway.Wethusatomicallyremoveeachtransaction'sENDlogrecordasthelastoper-ationtoguaranteethat,afteracrashduringclearing,thenextattemptwillbeperformedinexactlythesameway.Toclearthelogwhenforcing,wescanthelogbackwardsandremovetherecordsofcommittedtransactions.Acheck-pointunderano-forcepolicyismorecomplex.Itisdesignedasa\cache-consistent"checkpointtoallowne-grainedlock-ing.Thisforcesascanofthelog,butallowsconcurrenttransactionstokeepusingthelog,whichispossibleastrans-actionsonlyappendtothelogwhilecheckpointingremovesrecordsfromthemiddle.WeinsertaCHECKPOINTrecordbeforethecache ushtomarkthepointinthelogthatisper-sistent;allrecordsbeforethatpointcanbesafelyremoved.WedothisbyremovingENDrecordslast.Issuingrstthecache ushandthentheCHECKPOINTrecordcouldleadtonewlyinsertedrecordsappearingtobepersistent.4.7ConcurrencyREWINDallowslow-overhead,ne-grainedconcurrency.Weusesimplelockstoserializelogaccessandensuretraversals(duringacheckpoint)arethread-safe.Theone-layer/no-forcecongurationoersthenest-grainedconcur-rencyduetothesimplelogstructure,whichallowsustolockthelogonlybrie yduringinsertionorremoval.REWINDcouldfurtherbenetfromalock-freeADLLbutthisisleftforfuturework.Thread-safeaccesstouserdatabymultipletransactionsinREWINDisuptotheprogrammer.ThisisduetoREWIND'simperativelanguagenature,whichallowstheprogrammertoarbitrarilyupdatedata.5.PERFORMANCEEVALUATIONWeimplementedREWINDinC++(usingg++4:7:3)toevaluateitsperformance.Weusedaquad-coreIntel®XeonE5420clockedat2:5GHzpercorewith12GBoffullybueredDDR2memoryrunningtheGNU/Linux3.9kernel.WeemulatedNVMbyaddinglatencythroughabusyloop(seealso[31])precededbyacacheline ushandfollowedbyamemoryfence.Thelatencyemulationisinlinedbe-foreaccessingNVM.Weconsidereverynon-temporalstoreasanindividualNVMwrite,butgroupconsecutivewritestothesamecachelineintoasingleNVMwrite.WesettheNVMwritelatencyto510processorcycles(150ns).Wedo notmodelahigherNVMreadlatencythanDRAMbecause:(a)thetwoarealreadycomparableincurrentNVMtech-nology[25];and(b)transactionprocessingisupdate-heavysowritesaectperformancethemost.WecompareREWINDtoStasis[27],astate-of-the-artstoragemanagerforpersistentdatastructures.Stasisem-ploysdata-structure-specicpersistenceandrecoveryop-timizationsasopposedtogeneralDBMS-basedrecoverymechanisms.WealsocomparetothepopularBerkeleyDB.Finally,weincludetheversionofShore-MTfrom[33],whichisheavilymodiedforpersistentmemory.Allapproachesworkoverblockdevices,soaneasywaytoportthemtoNVMwouldhavebeentorunthemonamemory-mountedlesystem(e.g.,RAMFS).Wefollowedadierentap-proachandusedPMFS[9]:akernel-levellesystemthatismemory-mountedandbyte-addressable.PMFSguaranteespersistencethroughstandardlesystemcalls,butitsim-plementationisoptimizedforbyte-addressability,thusmin-imizingtheoverheadoverNVM.Thus,itdoesnotadverselyimpacttheperformanceofStasis,BerkeleyDBorShore-MT.Wefurtherfavorthetwoformersystemsbyonlycharginglatenciesforuser-datawritestoPMFSandnotforPMFS'sinternalbookkeepingwrites.WealsofavorShore-MTbydisablinganylatenciesusedin[33].WedonotcomparetoapproacheslikeMnemosyne[31]orNV-Heaps[5]sincetheydonotprovidefullloggingandrecoveryfunctionalityandarethuscomplementary(seealsoSection6).WeusedBerkeleyDBversion6:0:20deployedasin[27].ThecacheandlogbuersizesmatchedthoseofStasis.Thelockmanagerwasdisabledtofurtherimproveperformance.ForShore-MTweusedthetransaction-levelpartitioningvariantwithdurable-cacheenabledandsimilarcongurationwiththeothertwosystems.WerefertothethreeversionsofREWINDasSimple,OptimizedandBatchandthesecorre-spondtothedoubly-linked-list(Section3.2),hybriddoubly-linked-list(Section3.3)andhybriddoubly-linked-listwithbatchedlogrecords(Section3.3)implementations.Wecon-guredtheOptimizedversionwithabucketsizeof1;000recordsandtheBatchversionwitha64-bytecachelinesizeand8-bytepointersizetomatchourhardware.InSec-tion5.1weuseOptimizedREWINDforallone-layercon-gurationsasthisisthecongurationusedasthebottomlayerofthetwo-layerapproach.Allresultsaretheaverageofthreerunswithstandarddeviationaverageof1:4%.5.1SensitivityanalysisLoggingoverheadWemeasuretheoverheadofloggingasafunctionofthenumberofmemorystores.Weimplementedamicrobenchmarkwithasingletransactionthatalternatesbetweenupdatinganin-memorytableandperformingsomecomputationbetweenupdates.Thetransactionsuccessfullycommitsattheend.Wecalibratedthecomputationcosttobeamultipleofthecostofanon-loggedstoretoNVM.IntheleftplotofFigure3weshowtheloggingoverheadasafunctionofthefractionoftimespentonupdates;theoverheadisreportedastheratiobetweentheperformanceofREWINDandthenon-recoverableimplementationoverNVMe.g.,aratioof5meansREWINDis5xslower.Wetestedallfourcongurations:two-layerorone-layerlogging(2Lor1L);andforceorno-forcepolicies(FPorNFP).Therightmostpointoftheplotsrepresentstheworstcase:theuserprogramonlyupdatescriticaldata.Then,theover-headsofthetwo-layercongurationsarehighercompared Figure3:Loggingoverheadasafunctionofupdateintensity(left)andnumberofskiprecords(right).totheone-layercongurations.Thelowoverheadsoftheone-layercongurationsshowtheeectivenessoftheOpti-mizedimplementationofREWIND.Thedierencebetweentheoverheadsofthetwo-layerandone-layercongurationsstemfromthecostofusingtheAVLtreeandmaintainingthetransactiontable.Thetotaloverheaddecreasessteeplyastheintensityofupdatesdecreases.Fora10%updateintensity,theoveralloverheaddropstoonly1:5fortheone-layerno-forcecongurationand8:5forthetwo-layerno-forceconguration.Thedierenceinloggingoverheadbetweentheforceandno-forcepoliciesisnotasdramatic,especiallyforone-layerlogging,butitisstillsignicant.Tobetterconveytheinformationwehavemagniedtheplotfortheone-layerrunsatthebottomofthegraph.Forone-layerloggingtheoverheadvariesbetween2%to35%andfortwo-layerloggingbetween24%to74%.Theincreasedloggingoverheadoftheforcepolicyisduetothemoreexpensivenon-temporalwritestoNVMfortheuserupdatesandfromtheextraworktoclearthelogatcommit(Section4.3).Wenextfocusonthecomparisonofone-andtwo-layerloggingunderaforcepolicy.Recallthatcommitsinone-layerloggingrequirelinearscansofthelog,whichbecomemoreexpensivewithmoreinterleavinglogentries(amea-sureofthenumberofconcurrentlyrunningtransactions).Wetermsuchentriesskiprecords,astheywillneedtobeskippedifthistransactionistobeselectivelyprocessed.Two-layerloggingrectiesthisthroughtheAVLindex.Wechangedourmicrobenchmarktogenerateavariablenum-berofrecordsfromothertransactionsbetweenrecordsofaspecictransaction.Alltransactionsupdatethesamein-memorytable,sotheycorrespondtotheworst-case100%update-intensiveworkloadofthepreviousexperiment.Thenumberofskiprecordsvariedfrom100to1;000.Thismightseemlikeasmallnumber,butrecallthatREWINDrunsinuser-modeandinasingleapplicationcontext.Skiprecordscorrespondtothenumberofinterveningconcurrentupdatesofasharedresourceinasinglecontext(performed,perhaps,bymultiplethreads),soasmallernumberofsuchrecordsisenoughtomeasuretheoverheadandsucienttoindicatetheperformancetrendsofeachREWINDconguration.IntherightplotofFigure3wereporttheoverheadoftheone-andtwo-layercongurationsasafunctionofthenumberofskiprecords.Theoverheadisagainexpressedastheratioovertheperformanceofthenon-recoverablever-sionofthesamemicrobenchmark.Inone-layerloggingtheoverheadgrowssharplywiththenumberofskiprecords.Intwo-layerlogging,ontheotherhand,theoverheadisrela-tivelyxed.Inreality,italsogrowswiththenumberofskiprecords,butatsuchaslowratethatitisuntraceableintheplot.Eventhoughone-layerloggingstartsoperformingbetterthantwo-layerlogging,itsdegradationasthenum-berofskiprecordsgrowsissoseverethatthetwo-layer Figure4:Single-transactionrollback(left)andre-covery(right)foravaryingnumberofskiprecords.congurationoutperformsitafteraround600skiprecords.ThissuggeststhatinauserapplicationthedecisionofwhichREWINDcongurationtoemployisnotaclearoneastherewillbeacrossoverpointbeyondwhichthetwo-layercong-urationstartsshowingitsmerits.Itisuptotheusertodecideiftheconcurrencyneedsoftheapplicationarehighenoughfortwo-layerloggingtobethebestchoice.RollbackandrecoverycostsOurpurposeistoassesstheimpactofthenumberoflogginglayersontheperfor-manceofsingletransactionrollback.Weusethesamemi-crobenchmarkasbefore,butinsteadofcommittingthetar-getedtransactionwerollitback.IntheleftplotofFigure4weshowtherollbackduration(inmilliseconds)asafunc-tionofthenumberofskiprecordsfortheone-andtwo-layercongurationsandforaforcepolicy.Therollbacktimeoftheone-layercongurationgrowsfasterthanthatofthetwo-layercongurationasweincreasethenumberofskiprecords.Thetwo-layercongurationcatchesupwiththeone-layeroneataround400skiprecords.Aswasthecaseforcommit,thissuggeststhatthetwo-layercongurationexhibitsitsmeritsafterasucientnumberofskiprecords.Again,theprogrammershouldcustomizetheREWINDcon-gurationfortheexpectedapplicationworkload.REWINDitselfcanbetunedtoadapttovariousworkloads.Nextwereportthecostofabortingasingleuncommittedtransactionduringrecovery,insteadofrollingitbackdur-ingnormaloperation,againasafunctionofthenumberofskiprecords.Thiscaseappearswhenatransactionstartsitscommitprotocol,butdoesnotnishcommitting(i.e.,itdoesnotloganENDrecord);itmustthenbeabortedduringrecovery.Thiscontinuestheanalysisofthechoicebetweenone-ortwo-layercongurations,butinamorecontrivedsce-nario.Weextendedthemicrobenchmarktocommitallothertransactionsbutthetargetone,butwithoutclearingthemfromthelogsotheirentrieshavetobeskippedwhenrecov-eringthetarget.ThatcouldhappenifthesystemcrashedafterthesetransactionsloggedtheirENDlogged(sothesys-temwillnottrytoabortthem)butbeforeclearingthelog.IntherightplotofFigure4wereporttherecoverytimeasafunctionofthenumberofskiprecordsfortheone-andtwo-layercongurationsandwithaforcepolicy.One-layerloggingnowsignicantlyoutperformsthetwo-layercong-uration.Althoughtwo-layerloggingperformsbetterduringtheundophase,andforselectivetransactionrollback,itisswampedbythesloweriterationoverthelogcontentsduringtheanalysis/redophasesthusgreatlyexacerbatingtherecoverytime.Thiscontraststheearlierresultswhereone-layerloggingwasoutperformedbytwo-layerloggingandreinforcestheintricaciesofchoosingaconguration.Wenowreportthetotalprocessingcost(loggingpluscom-mitorrecovery)asafunctionofthelikelihoodthattrans-actionsarerecovered.Weextendedthemicrobenchmarkto Figure5:Loggingandrecoverycostasafunctionofthefractionofrecoveredtransactions. Figure6:Impactofcheckpointingfrequency.selectavaryingnumberoftransactionstoberecoveredandtimedboththeloggingandthecommitorrecoveryprocessofalltransactions.InFigure5weshowthetotaltimeasafunctionofthefractionoftransactionsthatneedtoberecovered,fortheone-layercongurationwithbothforceandno-forcepoliciesandwiththreevaluesofskiprecords:10,150and300.Forbothpolicieswefactoroutthedura-tionoflogclearingtocomparethemethodsirrespectivelyofwhetherclearingisimmediateorthroughcheckpoints.Wedonotconsiderthetwo-layercongurationaswehaveal-readyseenitperformworsethantheone-layeroneintermsofbothrecoveryandlogging.Theexecutiontimeissen-sitivetothenumberofskiprecordsgiventhedependencybetweenrollback/recoverycostandthevalueofthisparam-eter,aswasshownearlier.Recallthattheno-forcepolicyrequirestwophasesduringrecovery,whereastheforcepolicyrequiresthree.Observethenthattheno-forcepolicyhasaslightadvantageforthesamenumberofskiprecordsandaverylowcrashprobability.Itiseventuallyoutperformedbytheforcepolicybecauseoftheextrarecoveryphase.Thisismoreevidentasthenumberofskiprecordsincreasesbecausethedurationoftheextraphaseincreasesaswell.CheckpointoverheadTomeasurethecheckpointover-headweinsertedtenmillionlogrecordsinthethreeREWINDversions,conguredwithone-layerloggingandano-forcepolicy.Werantheinsertionsforeachcong-urationwithandwithoutcheckpointsandwereporttheoverheadofthecheckpointedrunasthepercentageofnon-checkpointedexecutionforavaryingcheckpointfrequency.Overall,theoverheaddeclineswithdecreasingcheckpointfrequency.However,theoverheadintheSimpleversionismoreseverecomparedtotheothertwoversions.Thisisduetothecoarserdegreeofconcurrency:theSimpleapproachneedstolockandserializetheinsertionofanewrecordtotheADLLwhiletheothermethodsonlyapplyasingleup-datetoabucket.AsshowninFigure6,theoverheadsoftheSimple,Optimized,andBatchREWINDversionsvaryfrom79%to60%,32%to9%,and20%to3%respectively.5.2ComplextransactionalworkloadsLoggingWeevaluatetheoverheadofREWINDwhenre-coveringdatastoredinaB+-tree.Wetestedeightcon- Figure7:B+-treeloggingperformanceforREWINDvs.norecoverability(left);REWINDvs.Stasis,BerkeleyDBandShore-MT(right).gurations:DRAM,withoutpersistenceorrecoverability;NVMwithpersistencebutwithoutrecoverability;thethreeREWINDversionsrunningonNVM;andStasis,Berke-leyDBandShore-MTrunningonNVM.Thelastsixcon-gurationsguaranteepersistenceandarerecoverable.AllREWINDversionswereconguredwithano-forcepol-icyandwithoutcheckpoints.Weimplementedonein-memoryB+-treeversionforeachdierentpersistencelayer:REWIND,Stasis,BerkeleyDB,Shore-MT.WeloadedtheB+-treewith100k32-byte-longrecordsandperformedamixof200klookupsandupdatesaswevariedtheread/updateratio.Theupdateswereequallydividedbetweeninsertionsanddeletionsforaconstanttreesizeperread/updateratio.IntheleftplotofFigure7weshowthetotalexecutiontimefortheworkloadasafunctionofthefractionofupdatequeries.TheoverheadoftheDRAMandNVMimplemen-tationsgrowswiththefractionofupdates,albeitgently,asupdatesaremoreexpensivethanlookups.Thisisexagger-atedintheNVMimplementationbecauseoftheoverheadofNVMwrites.AllREWINDcongurationsperformwellandclosetotheDRAMandNVMimplementations.TheOptimizedversionimprovestheSimpleversionby27%andtheBatchversionimprovesitby37%.WethereforefocusontheREWINDBatchvariantfromnowon.IntherightplotofFigure7wecomparetheoverheadofREWINDtoStasis,BerkeleyDB,andShore-MT.REWINDoutperformsStasisby85,BerkeleyDBby105andShore-MTby205at100%updatequeries.ThisisduetoREWIND'smin-imalisticdesign,leanersoftwarestack,andNVM-specicoptimizations.Shore-MTisoutperformedasitisoptimizedformulti-threadedperformancewhiletheworkloadissingle-threaded.InSection5.2weshowhowShore-MTscalesbet-terthanBerkeleyDBandStasisinmulti-threadedmode.RollbackandrecoveryWereportthecostoftransac-tionrollbackasafunctionofthenumberofoperations.Westartedwitha100k-recordB+-treeandtheninvokedamixedworkloadofanequalnumberofrandomlydistributedinsertionsanddeletions.ThiskeepsthesizeoftheB+-treesmall,butgeneratesalargenumberoflogrecords.Were-porttheresultsintheleftplotofFigure8.REWINDBatchoutperformsStasisby30,BerkeleyDBby12andShore-MTby4.ThisisduetotheREWINDalgorithmsanditsminimalphysicalne-grainedlogging,asopposedtothelog-icalloggingofStasis,orthecoarse-grained,page-levellog-gingofBerkeleyDBandShore-MT.Shore-MT'sexcellentperformanceisduetoundobuerskeepingtheundologrecordsinmemory.IntherightplotofFigure8wereportthecostoffullrecoveryformultipletransactions.Weusedthesamesetupbutnowwecreatedanewtransactionev-ery200operations.Thus,thenumberoftransactionsvaries Figure8:B+-treerecoveryforsingle(left);andmul-tipletransactions(right).from400to4;000.REWINDoutperformsStasisby20,BerkeleyDBby14andShore-MTby8.Thisisduetothelowerper-transactionoverheadofREWINDandone-layerloggingdoingawaywiththetransactiontable.CoupledwiththeecientNVM-specicimplementation,theresultisalargeperformancemarginoverthecompetition. Figure9:MultithreadedB+-treeloggingperformance.ConcurrencyTotestREWIND'sne-grainedconcurrency,westartedmultiplethreadswitheachthreadperforming100koperationsonaB+-tree.Eachoperationiseitheraninsert/deletepairoralookup.Thelookup-to-insert/deleteratiorangesfrom20%to80%(e.g.,30%lookups,70%insert/delete).Eachthreadisassignedaratioatthebeginningandpicksupoperationsfromapoolofavailabletasks.Wemeasuredthetotaldurationoftherun,i.e.,untilallthreadsnished,asafunctionofthenumberofthreads.REWINDusesitsownlibrary-levelconcurrencymechanisms.ForStasisandBerkeleyDBweletreadersprogresswithoutlocksbutuselocksforinsert/deletepairs.ThisimprovesperformanceforBerkeleyDBasiteliminatesdeadlocks.ForShore-MTweuseitsownconcurrencymechanismsforuptofourthreadsasitcreatesonelogpartitionforeachcore.BeyondthatwefounditbettertousethesamelockingasStasisandBerkeleyDB.AsshowninFigure9,theprocessingtimesofStasisandBerkeleyDBgrowlinearlywiththenumberofthreads.Shore-MTasexpectedscalesbetterthanStasisandBerkeleyDBuntiltherstfourthreadsandthenyieldssimilarperformancewithBerkeleyDB.REWINDscalessignicantlybetterafterthreethreads.TheprocessingtimeforREWINDdoesnotincreasemonotonically.ThisisduetotheOSschedulingthreadstothesamecore.Althoughwesettheanityofeachtasktoadierentcore,thelightweightlockingofREWINDresultsinthreadsnishingsofastthattheOSseemstoignorethathintandschedulesthreadswithdierentanitiestothesamecore.MemoryFencesensitivityMemoryfencelatencyvariesdependingonthestoragearchitecture.Weshowhowwecanmitigateitsimpactbygroupinglogrecords.WerepeatthebenchmarkofFigure7withthefractionofupdatequeriessetto1(theworstcasescenario).WecompareREWINDOptimized,whichsupportsin-placeupdatessolutionbutnogrouping,withREWINDBatchforvaryinggroupsizes,e.g.,REWINDBatch8uses8recordspermemoryfence;wealsoincludevariationsof16and32recordspergroup. Figure10:MemoryFencesen-sitivity.OurresultsareshowninFigure10.REWINDOptimizedisaectedandissloweddownby5whileREWINDBatchhasaslow-downof1:63,1:32,1:18forgroupsizesof8,16,and32respectively.Wecanthereforemitigatethefencecostofdierentstoragearchitecturesbytuningthegroupsize.WealsotestedStasis,adisk-replacementsolution,whichremainedunaectedasexpected.Theseresultsareinlinewithpreviouswork[24].InREWINDtheoptimizationsofSection3.3aretwofoldastheymitigatethecostofthefenceandalsoreducethewriteoverhead.DuetothelackofpagesREWINDdoesnotneedtorestrictthetransactionsofthegroupi.e.,forcealltransactionsinthegrouptocommitorabort.5.3TPCCWeuseavariantoftheTPC-Cbenchmarkto:(a)stress-testREWIND;and(b)showthatbycollapsingthebound-ariesbetweenthein-memoryandthepersistentrepresenta-tionswecanimproveperformancebyco-designingtheal-gorithmsandthephysicaldatalayout.WeimplementtheTPC-CschemawithB+-treesfortablestorageandfocusonthenewordertransaction.Weuseascalingfactorofoneandusetenthreadsonourtestmachinetosimulatethetenterminalsissuingnewordertransactions,whichisaslightdeviationfromTPC-Cwhereaterminalcanchooseamongvetypesoftransactions.However,ourgoalistomeasuretheoverheadinwrite-intensiveoperationsandnottestthefeaturesofafull-blownDBMS.Thus,thenewordertransactionisthebestchoiceasitisthemostwrite-intensiveTPC-Ctransactionandthebackboneoftheentireworkload.Weusefourdatalayouts:standardpersistentbutnotre-coverableB+-treesinNVM;naiveB+-treesoverREWIND;anoptimizedlayoutofB+-treesoverREWINDtorepresentcompoundkeys;andthelatterwithadistributedlog[24].ForREWINDwithoptimizedB+-trees,weuseanarrayofB+-treestorepresentatablewithacompoundkey.Fortheordertables(orders,order line,andnew order),insteadofhavingaB+-treewithacompoundkeyon(warehouse id,district id,order id)pertable,wenotedthatthedo-mainsofwarehouseanddistrictconsistedofoneandtenvaluesrespectivelyastherearetendistrictsinasingleware-house.Thus,webuildanarrayoftenB+-trees,eachonorder id.InREWIND,theuseofdistributedloggingisuptotheuser.Usingasingletransactionmanagerforalltrans-actionsdictatesasharedlog;whileaper-transactionman-agerimpliesadistributedlog.This exibilityfurtherenablesco-design:throughthepersistenceandrecoveryguaranteesofREWIND,programmerscanoptimizethedatastructuresandtheimplementationoftransactions.AspertheTPC-Cspecications,weabort1%oftrans-actions.InREWINDthesetransactionsarerolledbackwhileinthestandardNVMversiontheyareconsiderednon-recoverableandignored:thisaddsasignicantoverheadtotheREWINDB+-tree.WedonotcomparetoothersystemsasREWINDsignicantlyoutperformedthemearlier. Figure11:TPC-Cthroughput.InFigure11,thenon-recoverableim-plementationwithnaivedatastructureshasathroughputof273ktransactionsperminute(tpm).Theoptimizedim-plementationoverREWINDyieldsathroughputof197ktpmfora1:39overhead.Thishighlightsthepotentialofco-design:REWINDenablesprogram-levelworkload-specicoptimizationsaspersis-tenceandrecoverabilityneednotbeooadedtoadierentruntime.Distributedloggingimprovesthethroughoutevenmoreto262ktpmanda1:05overhead.REWINDwithnaiveimplementationofthedatastructuresgivesathroughputof37ktpmforaslowdownof7:37overthenon-recoverableNVMversion.ThisperformanceisinlinewiththemicrobenchmarkresultsofSection5.1andtheresultsof[24]fordistributedlogging.6.RELATEDWORKPersistentvirtualmemory[26,34]hasreceivedrenewedinterestthroughpersistentregions[14].Suchattemptsem-ployblock-levelI/Odevicesandleabstractions.Recov-erabilityreliesonstagingpersistenceandloggingthroughcombiningvolatilemainmemoryandpersistentdiskstor-age.Closertoourwork,[19]usesbattery-backedDRAMforpersistingthelecache[3],butultimatelyreliesonI/Oandusesacoarse-grainedregionapproachtoundologging.Tworecentproposals[5,31]provideNVMheapstoap-plications.Weleveragethistosupportin-memorypersis-tentdatastructures.Both[5,31]onlyprovideprimitivesforprogrammerstocreateandmanagetheirownrecoveryprotocols.Fangetal.[10]proposeanNVM-basedlogman-agerforDBMSs,which,unlikeourapproach,reliesonaclient-serverdesignandusesepochbarrierstoguaranteepersistence.Gilesetal.[12]addressembeddedtransactionmanagementinusercode,butunlikeourworktheyrequirecustomhardwaretoforcetheredologtoNVMbeforecom-mitting,whilekeepinguserupdatesinadedicatedbuerbeforepersistingthelog;theydonotelaborateonrecov-erymechanisms.Morerecentwork[13]hassimilargoalstoREWIND,butdoesnotgoasfarinaddressingrecoveryandconcurrency:itonlyperformsredologgingwithoutin-placeupdates.Similarly,[35]embedstransactionmanagementinusercodebutitassumestheexistenceofanon-volatilecachethatitusesinsteadoflogging.Finally,[2]studiestheup-datesemanticsofNVMdatainlock-basedcode(asopposedtotransactionalcode)andtouchesonlysuperciallyonthemechanismsusedforloggingandrecoveryinNVM.PriortoNVM,researchersproposedbattery-backedDRAMandbuermanagerextensionstosupportrecover-ability[21].Forinstance,[6]usesbattery-backedDRAMwithanARIES-likeprotocol,but,unlikeus,itstillassumespage-levelI/Ofordataandlogupdates.DBMSsoptimizedforvolatilememory[8,17]arealsorelevant.Thesesignif-icantlyimprovedisk-basedalternativesbutarestillsubop-timalforNVMastheyaresubjecttotheinecienciesofablock-baseddesigntowardsdurability.Pelleyetal.[24],proposedistributedloggingandgroupcommitsformitigatingthememoryfencelatencyinNVM. TheseareorthogonaltoREWINDandweexaminetheiref-fectsinSections5.2and5.3,respectively.Similarly,[33]examineshowNVMallowspracticaldistributedlogging.Unlikeourapproach,[33]targetspage-leveldataandlogupdates.WecomparedthistoREWINDinSection5.2.Recentworkhasconsideredmorelightweightdataman-agementthanfull-blownDBMSs.Forinstance,[27]com-paresbothDBMSsandlesystemstocustomalternatives;while[15]quantiestheoverheadofseveralDBMSfunc-tionalities.Therehasalsobeenrecentinterestinsimilarlyextendinglesystems.Forexample,[23]presentsanex-tendedtransactionalandrecoverableI/Ointerfaceformul-tiple,non-consecutiveblocks.Thesestillpresentablockinterfacetoprogrammersunlikeourbyte-addressableap-proach.Finally,therehasalsobeenworkonqueryprocess-ingalgorithmse.g.,[4,30]forNVM.ThesedierfromourapproachastheyassumeacompleteDBMSinsteadofourprogrammer-manageddatastructures.7.CONCLUSIONSANDFUTUREWORKNewNVMtechnologiesallowprogrammerstomaintainasinglecopyoftheirpersistentdatastructuresinmainmem-oryandaccessthemdirectlywithCPUloadsandstores.ThisrenderstransactionalrecoverymechanismsbasedonblockI/Oandtheseparationofvolatileandnon-volatiledatainappropriate.WepresentedREWIND,auser-modelibrarythatdirectlymanagespersistentdatastructuresinNVMinarecoverableway.ThelibraryprovidesasimpleAPIandtransparentlyhandlesrecoveryofcriticaldata.OurresultsshowthatREWINDoutperformsI/O-basedsolu-tionsataminimaloverhead,therebyprovidingapromisingpathtowardenablingpersistentin-memorydatastructures.Asthisisafreshresearcharea,thereismoreworktobedone.OuroverarchinggoalistoembedREWINDintoacompilerframeworkalasoftwaretransactionalmemory.Furtherperformancebenetswilllikelycomeifweimple-mentthebasiclogstructureusinglock-freetechniques.An-othergoalistointroduceautotuningsothatthesystemadaptstotheworkloadthroughmonitoring.AcknowledgmentsTheauthorswouldliketothanktheanonymousreviewersfortheircommentsandtheauthorsof[33]fortheirimplementationofShore-MT.ThisworkwassupportedbytheIntelUniversityResearchOce.8.REFERENCES[1]AppleDeveloperLibrary.CoreDataProgrammingGuide,2014.[2]D.ChakrabartiandH.-J.Boehm.Durabilitysemanticsforlock-basedmultithreadedprograms.InHOTPAR,2013.[3]P.M.Chenetal.TheRiolecache:Survivingoperatingsystemcrashes.InASPLOS,1996.[4]S.Chenetal.Rethinkingdatabasealgorithmsforphasechangememory.InCIDR,2011.[5]J.Coburnetal.NV-heaps:Makingpersistentobjectsfastandsafewithnext-generation,non-volatilememories.InASPLOS,2011.[6]G.Copelandetal.ThecaseforsafeRAM.InVLDB,1989.[7]B.Cornelletal.Wayback:Auser-levelversioninglesystemforLinux.InATC,2004.[8]C.Diaconuetal.Hekaton:SQLserver'smemory-optimizedOLTPengine.InSIGMOD,2013.[9]S.R.Dullooretal.Systemsoftwareforpersistentmemory.InEuroSys,2014.[10]R.Fangetal.Highperformancedatabaseloggingusingstorageclassmemory.InICDE,2011.[11]P.Felberetal.Transactifyingapplicationsusinganopencompilerframework.InTRANSACT,2007.[12]E.Gilesetal.BridgingtheprogramminggapbetweenpersistentandvolatilememoryusingWrAP.InCF,2013.[13]E.Gilesetal.Softwaresupportforatomicityandpersistenceinnon-volatilememory.InMEAOW,2013.[14]J.Guerraetal.Softwarepersistentmemory.InATC,2012.[15]S.Harizopoulosetal.OLTPthroughthelookingglass,andwhatwefoundthere.InSIGMOD,2008.[16]D.R.Hippetal.SQLiteDatabase,2014.[17]R.Kallmanetal.H-Store:ahigh-performance,distributedmainmemorytransactionprocessingsystem.PVLDB,1(2),2008.[18]LinuxKernel.LinuxProgrammer'sManual,2014.[19]D.E.LowellandP.M.Chen.FreetransactionswithRiovista.InSOSP,1997.[20]C.Mohanetal.ARIES:Atransactionrecoverymethodsupportingne-granularitylockingandpartialrollbacksusingwrite-aheadlogging.ACMTODS,17(1),1992.[21]W.T.NgandP.M.Chen.Integratingreliablememoryindatabases.InVLDB,1997.[22]OracleCorporation.OracleBerkeleyDB11g,2014.[23]X.Ouyangetal.BeyondblockI/O:Rethinkingtraditionalstorageprimitives.InHPCA,2011.[24]S.Pelleyetal.StoragemanagementintheNVRAMera.PVLDB,7(2),2014.[25]M.K.Qureshietal.PhaseChangeMemory:fromdevicestosystems.Morgan&Claypool,2012.[26]M.Satyanarayananetal.Lightweightrecoverablevirtualmemory.InSOSP,1993.[27]R.SearsandE.Brewer.Stasis:Flexibletransactionalstorage.InOSDI,2006.[28]R.P.Spillaneetal.Enablingtransactionalleaccessvialightweightkernelextensions.InFAST,2009.[29]S.Venkataramanetal.Consistentanddurabledatastructuresfornon-volatilebyte-addressablememory.InFAST,2011.[30]S.D.Viglas.Write-limitedsortsandjoinsforpersistentmemory.PVLDB,7(5),2014.[31]H.Volosetal.Mnemosyne:Lightweightpersistentmemory.InASPLOS,2011.[32]C.Wangetal.Codegenerationandoptimizationfortransactionalmemoryconstructsinanunmanagedlanguage.InCGO,2007.[33]T.WangandR.Johnson.Scalableloggingthroughemergingnon-volatilememory.PVLDB,7(10),2014.[34]M.WuandW.Zwaenepoel.eNVy:Anon-volatile,mainmemorystoragesystem.InASPLOS,1994.[35]J.Zhaoetal.Kiln:Closingtheperformancegapbetweensystemswithandwithoutpersistencesupport.MICRO,2013.