forPersistentMemorySystems MatheusAlmeidaOgleari 2 EthanLMiller 2 ID: 822814
Download Pdf The PPT/PDF document "StealbutNoForce:Ef" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
StealbutNoForce:EfcientHardwareUndo+Red
StealbutNoForce:EfcientHardwareUndo+RedoLoggingforPersistentMemorySystemsMatheusAlmeidaOgleari,EthanL.Miller,,JishenZhao,UniversityofCalifornia,SantaCruzPureStorageUniversityofCalifornia,SanDiego{mogleari,elm,jishen.zhao}@ucsc.eduAbstractPersistentmemoryisanewtierofmemorythatfunctionsasahybridoftraditionalstoragesystemsandmainmemory.Itcombinesthebenetsofboth:thedatapersistenceofstoragewiththefastload/storeinterfaceofmemory.Mostpreviouspersistentmemorydesignsplacecarefulcontrolovertheorderofwritesarrivingatpersistentmemory.Thiscanpreventcachesandmemorycontrollersfromoptimizingsystemperformancethroughwritecoalescingandreordering.Weidentifythatsuchwrite-ordercontrolcanberelaxedbyemployingundo+redologgingfordatainpersistentmemorysystems.However,traditionalsoftwareloggingmechanismsareandenergyoverheads.Previouslyproposedhardwareloggingschemesareinefcientanddonotfullyaddresstheissuesinsoftware.Toaddressthesechallenges,weproposeahardwareundo+redologgingschemewhichmaintainsdatapersistencebyleveragingthewrite-back,write-allocatepoliciesusedincommoditycaches.Furthermore,wedevelopacacheforce-write-backmechanisminhardwaretosignicantlyreducetheperformanceandenergyoverheadsfromforcingdataintopersistentmemory.Ourevaluationacrosspersistentmemorymicrobenchmarksandrealworkloadsdemonstratesreducesbothdynamicenergyandmemorytrafc.Italsoprovidesstrongconsistencyguaranteescomparedtosoftwareapproaches.I.INTRODUCTIONPersistentmemorypresentsanewtierofdatastoragecomponentsforfuturecomputersystems.ByattachingNon-VolatileRandom-AccessMemories(NVRAMs)[1],[2],[3],[4]tothememorybus,persistentmemoryuniesmemoryandstoragesystems.NVRAMoffersthefastload/storeaccessofmemorywiththedatarecoverabilityofstorageinasingledevice.Consequently,hardwareandsoftwarevendorsrecentlybeganadoptingpersistentmemorytechniquesintheirnext-generationdesigns.ExamplesincludeIntelsISAARMsnewcachewrite-backinstruction[6],MicrosoftsstorageclassmemorysupportinWindowsOSandin-memorydatabases[7],[8],RedHatspersistentmemorysupportintheLinuxkernel[9],andMellanoxspersistentmemorysupportoverfabric[10].Thoughpromising,persistentmemoryfundamentallychangescurrentmemoryandstoragesystemdesignassump-tions.Reapingitsfullpotentialischallenging.Previousper-sistentmemorydesignsintroducelargeperformanceanden-ergyoverheadscomparedtonativememorysystems,withoutenforcingconsistency[11],[12],[13].Akeyreasonisthewrite-ordercontrolusedtoenforcedatapersistence.Typicalmemorycontrollerstooptimizesystemperformance[14],[15],[16],[13].However,mostpreviouspersistentmemorydesignsemploymemorybarriersandforcedcachewrite-backs(orcacheushes)toenforcetheorderofpersistentdataarrivingatNVRAM.Thiswrite-ordercontrolissub-optimalforperformanceanddonotconsidernaturalcachingandmemoryschedulingmechanisms.Severalrecentstudiesstrivetorelaxwrite-ordercontrolinpersistentmemorysystems[15],[16],[13].However,thesestudieseitherimposesubstantialhardwareoverheadbyaddingNVRAMcachesintheprocessor[13]orfallbacktolow-performancemodesoncecertainbookkeepingresourcesOurgoalinthispaperistodesignahigh-performancepersistentmemorysystemwithout(i)anNVRAMcacheorbufferintheprocessor,(ii)fallingbacktoalow-performancemode,or(iii)interferingwiththewritereorderingbycachesandmemorycontrollers.Ourkeyideaistomaintaindatapersistencewithacombinedundo+redologgingschemeinhardware.Undo+redologgingstoresbothold(undo)andnew(redo)valuesinthelogduringapersistentdataupdate.Itoffersakeybenet:relaxingthewrite-orderconstraintsoncachingpersistentdataintheprocessor.Inourpaper,weshowneedingstrictwrite-ordercontrol.Asaresult,thecachesandmemorycontrollerscanreorderthewriteslikeintraditionalnon-persistentmemorysystems(discussedinSectionII-B).Previouspersistentmemorysystemstypicallyimplementeitherundoorredologginginsoftware.However,high-performancesoftwareundo+redologginginpersistentmem-oryisunfeasibleduetoinefciencies.First,softwarelogginggeneratesextrainstructionsinsoftware,competingforlim-itedhardwareresourcesinthepipelinewithothercriticalworkloadoperations.Undo+redologgingcandoublethenumberofextrainstructionsoverundoorredologging3362018 IEEE International Symposium on High Performance Computer Architecture2378-203X/18/$31.00 ©2018 IEEEDOI 10.1109/HPCA.2018.00037alone.Second,loggingintroducesextramemorytrafcinadditiontoworkingdataaccess[13].Undo+redologgingwouldimposemorethandoubleextramemorytrafcinsoftware.Third,thehardwarestatesofcachesareinvisibletosoftware.Asaresult,softwareundo+redologging,anideaborrowedfromdatabasemechanismsdesignedtocoord
inatewithsoftware-managedcaches,canonly
inatewithsoftware-managedcaches,canonlyconservativelyco-ordinatewithhardwarecaches.Finally,withmultithreadedworkloads,contextswitchesbytheoperatingsystem(OS)caninterrupttheloggingandpersistentdataupdates.Thiscanriskthedataconsistencyguaranteeinmultithreadedenvironment(SectionII-Cdiscussesthisfurther).Severalpriorworksinvestigatedhardwareundoorredologgingseparately[17],[15](SectionVII).Thesedesignshavesimilarchallengessuchashardwareandenergyover-heads[17],andslowdownduetosaturatedhardwarebook-keepingresourcesintheprocessor[15].Supportingbothundoandredologgingcanfurtherexacerbatetheissues.Additionally,hardwareloggingmechanismscaneliminatethelogginginstructionsinthepipeline,buttheextramemorytrafcgeneratedfromthelogstillexists.Toaddressthesechallenges,weproposeacombinedundo+redologgingschemeinhardwarethatallowspersis-tentmemorysystemstorelaxthewrite-ordercontrolbyleveragingexistingcachingpolicies.Ourdesignconsistsoftwomechanisms.First,aHardwareLogging(HWL)mech-anismperformsundo+redologgingbyleveragingwrite-backwrite-allocatecachingpolicies[14]commonlyusedinprocessors.OurHWLdesigncausesapersistentdataupdatetoautomaticallytriggerloggingforthatdata.WhetherastoregeneratesanL1cachehitormiss,itsaddress,oldvalue,andnewvalueareallavailableinthecachehierarchy.Assuch,ourdesignutilizesthecacheblockwritestoupdatethelogwithword-sizevalues.Second,weproposeacacheForceWrite-Back(FWB)mechanismtoforcewrite-backsofcachedpersistentworkingdatainamuchlower,yetmoreefcientfrequencythaninsoftwaremodels.ThisfrequencydependsonlyontheallocatedlogsizeandNVRAMwritebandwidth,thusdecouplingcacheforcewrite-backsfromtransactionexecution.Wesummarizethecontributionsofthispaperasfollowing:Thisistherstpapertoexploitthecombinationofundo+redologgingtorelaxorderingconstraintsoncachesandmemorycontrollersinpersistentmemorysystems.Ourdesignrelaxestheorderingconstraintsinawaythatundologging,redologging,orcopy-on-writealonecannot.Weenableefcientundo+redologgingforpersistentmemorysystemsinhardware,whichimposessubstan-tiallymorechallengesthanimplementingeitherundo-orredo-loggingalone.Wedevelopahardware-controlledcacheforcewrite-backmechanism,whichsignicantlyreducestheperformanceoverheadofforcewrite-backsbyefcientlytuningthewrite-backfrequency.Weimplementourdesignthroughlightweightsoftwaresupportandprocessormodications.II.BACKGROUNDANDMOTIVATIONPersistentmemoryisfundamentallydifferentfromtradi-tionalDRAMmainmemoryortheirNVRAMreplacement,duetoitspersistence(i.e.,crashconsistency)propertyinheritedfromstoragesystems.Persistentmemoryneedstoensuretheintegrityofin-memorydatadespitesystemcrashesandpowerloss[18],[19],[20],[21],[16],[22],[23],[24],[25],[26],[27].Thepersistencepropertyisnotguaranteedbymemoryconsistencyintraditionalmemorysystems.Memoryconsistencyensuresaconsistentglobalviewofprocessorcachesandmainmemory,whilepersistentmemoryneedstoensurethatthedataintheNVRAMmainmemoryisstandaloneconsistent[16],[19],[22].A.PersistentMemoryWrite-orderControlTomaintaindatapersistence,mostpersistentmemoryde-signsemploytransactionstoupdatepersistentdataandcare-fullycontroltheorderofwritesarrivinginNVRAM[16],[19],[28].Atransaction(e.g.,thecodeexampleinFig-ure1)consistsofagroupofpersistentmemoryupdatesperformedinthemannerofallornothinginthefaceofsystemfailures.Persistentmemorysystemsalsoforcecachewrite-backs(e.g.,clflush,clwb,anddccvap)andusememorybarrierinstructions(e.g.,mfenceandsfence)throughouttransactionstoenforcewrite-ordercontrol[28],[15],[13],[19],[29].Recentworksstrivedtoimprovepersistentmemoryper-formancetowardsanativenon-persistentsystem[15],[16],[13].Ingeneral,whetheremployinglogginginpersistentmemoryornot,mostfacesimilarproblems.(i)Theyin-troducenontrivialhardwareoverhead(e.g.,byintegratingNVRAMcache/buffersorsubstantialextrabookkeepingcomponentsintheprocessor)[13],[30].(ii)Theyfallbacktolow-performancemodesoncethebookkeepingcomponentsortheNVRAMcache/bufferaresaturated[13],[15].(iii)Theyinhibitcachesfromcoalescingandreorderingpersis-tentdatawrites[13](detailsdiscussedinSectionVII).Forcedcachewrite-backsensurethatcacheddataup-datesmadebycompleted(i.e.,committed)transactionsarewrittentoNVRAM.ThisensuresNVRAMisinapersistentstatewiththelatestdataupdates.Memorybarriersstallsubsequentdataupdatesuntilthepreviousupdatesbythetransactioncomplete.However,thiswrite-ordercontrolpreventscachesfromoptimizingsystemperformanceviacoalescingandreorderingwrites.Theforcedcachewrite-backsandmemorybarrierscanalsoblockorinterferewithsubsequentreadandwriterequeststhatsharethememorybus.Thishappensre
gardlessofwhethertheserequestsareindepe
gardlessofwhethertheserequestsareindependentfromthepersistentdataaccessornot[26],[31].337Tx_begin do some reads do some computation Uncacheable_log( addr(A), new_val(A), old_val(A) ) write new_val(A) //new_val(A) = A clwb // can be delayed Tx_commit Tx_begin do some reads do some computation Uncacheable_Ulog( addr(A), old_val(A) ) write new_val(A) //new_val(A) = A clwb //force writeback Tx_commit Tx_begin do some reads do some computation Uncacheable_Rlog( addr(A), new_val(A) ) memory_barrier write new_val(A) //new_val(A) = A Tx_commit Redo logging of the transaction Undo logging of store A1 Time Time Write A consists of N store instructions Tx commit Ulog_A1 Ulog_A2 Ulog_AN store A1 Logging store A1 store AN Tx begin Write A Rlog_A1 Rlog_A2 Rlog_AN store A1 store A1 store AN clwb A1..AN Tx commit Time Tx commit Rlog_A1 Rlog_A2 Rlog_AN store A1 store A1 store AN Ulog_A1 Ulog_A2 Ulog_AN (a) (b) (c) Undo logging only Redo logging only Undo+redo logging Logging Write A Logging Write A Uncacheable Cacheable Figure1.Comparisonofexecutingatransactioninpersistentmemorywith(a)undologging,(b)redologging,and(c)bothundoandredologging.B.WhyUndo+RedoLoggingWhilepriorpersistentmemorydesignsonlyemployeitherundoorredologgingtomaintaindatapersistence,weobservethatusingbothcansubstantiallyrelaxtheafore-mentionedwrite-ordercontrolplacedoncaches.Logginginpersistentmemory.Loggingiswidelyusedinpersistentmemorydesigns[19],[29],[22],[15].Inadditiontoworkingdataupdates,persistentmemorysystemscanmaintaincopiesofthechangesinthelog.Previousdesignstypicallyemployeitherundoorredologging.Figure1(a)showsthatanundologrecordsoldversionsofdatabeforethetransactionchangesthevalue.Ifthesystemfailsduringanactivetransaction,thesystemcanrollbacktothestatebeforethetransactionbyreplayingtheundolog.Figure1(b)illustratesanexampleofapersistenttransactionthatusesredologging.Theredologrecordsnewversionsofdata.Aftersystemfailures,replayingtheredologrecoversthepersistentdatawiththelatestchangestrackedbytheredolog.Inpersistentmemorysystems,logsaretypicallyun-cacheablebecausetheyaremeanttobeaccessedonlyduringtherecovery.Thus,theyarenotreusedduringapplicationexecution.TheymustalsoarriveinNVRAMinorder,whichisguaranteedthroughbypassingthecaches.Benetsofundo+redologging.Combiningundoandredologging(undo+redo)iswidelyusedindisk-baseddatabasemanagementsystems(DBMSs)[32].Yet,wendthatwecanleveragethisconceptinpersistentmemorydesigntorelaxthewrite-orderconstraintsonthecaches.Figure1(a)showsthatuncacheable,store-granularundologgingcaneliminatethememorybarrierbetweenthelogandworkingdatawrites.Aslongasthelogentry(UlogA1)iswrittenintoNVRAMbeforeitscorrespondingstoretotheworkingdata(storeA1),wecanundothepartiallycompletedstoreafterasystemfailure.Furthermore,storeA1musttraversethecachehierarchy.TheuncacheableUlogA1maybebuffered(e.g.,inafourtosixcache-linesizedentrywrite-combiningbufferinx86processors).However,itstillrequiresmuchlesstimetogetoutoftheprocessorthancachedstores.Thisnaturallymaintainsthewriteorderingwithoutexplicitmemorybarrierinstructionsbetweenthelogandthepersistentdatawrites.Thatis,loggingandworkingdatawritesareperformedinapipeline-likemanner(likeinthetimelineinFigure1(a)).issimilartothestealattributeinDBMS[32],i.e,cachedworkingdataupdatescanstealthewayintopersistentstoragebeforetrans-actioncommits.However,adownsideisthatundologgingrequiresaforcedcachewrite-backbeforethetransactioncommits.Thisisnecessaryifwewanttorecoverthelatesttransactionstateaftersystemfailures.Otherwise,thedatachangesmadebythetransactionwillnotbecommittedtomemory.Instead,redologgingallowstransactionstocommitwith-outexplicitcachewrite-backsbecausetheredolog,onceupdatescomplete,alreadyhasthelatestversionofthetransactions(Figure1(b)).Thisissimilartotheno-forceattributeinDBMS[32],i.e.,noneedtoforcetheworkingdataupdatesoutofthecachesattheendoftransactions.However,wemustusememorybarrierstocompletetheredologofAbeforeanystoresofAreachNVRAM.Weillustratethisorderingconstraintbythedashedbluelineinthetimeline.Otherwise,asystemcrashwhentheredologgingisincomplete,whileworkingdataAispartiallyoverwritteninNVRAM(bystoreAk),causesdatacorruption.Figure1(c)showsthatundo+redologgingcombinesthebenetsofbothstealandno-force.Asaresult,wecaneliminatethememorybarrierbetweenthe
logandpersistentwrites.Aforcedcachewrit
logandpersistentwrites.Aforcedcachewrite-back(e.g.,clwb)isunnecessaryforanunlimitedsizedlog.However,itcanbepostponeduntilafterthetransactioncommitsforalimitedsizedlog(SectionII-C).338Tx_begin(TxID) do some reads do some computation Uncacheable_log(addr(A), new_val(A), old_val(A)) write new_val(A) // A clwb //conservatively used Tx_commit Micro-ops: load A1 load A2 store log_A1 store log_A2 ... (a)(b)Micro-ops: store log_A1 store log_A2 ... Shared Cache Core cache Memory Controller Core cache Processor A NVRAM Rlog_A undo redo Ulog_A Rlog_B Ulog_B clwb Ak A ! Rlog_C Ulog_C A Ak still in caches Nonvolatile Volatile Figure2.Inefciencyoflogginginsoftware.C.WhyUndo+RedoLogginginHardwareThoughpromising,undo+redologgingisnotusedinpersistentmemorysystemdesignsbecauseprevioussoftwareloggingschemesareinefcient(Figure2).ExtrainstructionsintheCPUpipeline.Logginginsoftwareusesloggingfunctionsintransactions.Figure2(a)showsthatbothundoandredologgingcanintroducealargenumberofinstructionsintotheCPUpipeline.Aswedemonstrateinourexperimentalresults(SectionVI),usingonlyundologgingcanleadtomorethandoubledinstructionscomparedtomemorysystemswithoutpersistentmemory.Undo+redologgingcanintroduceaprohibitivelylargenumberofinstructionstotheCPUpipeline,occupyingcomputeresourcesneededfordatamovement.IncreasedNVRAMtrafc.Mostinstructionsforloggingareloadsandstores.Asaresult,loggingsubstantiallyincreasesmemorytrafc.Inparticular,undologgingmustnotonlystoretothelog,butitmustalsorstreadtheoldvaluesoftheworkingdatafromthecacheandmemoryhierarchy.Thisfurtherincreasesmemorytrafc.Conservativecacheforcedwrite-back.Logscanhavealimitedsize1.Supposethat,withoutlosinggenerality,alogcanholdundo+redorecordsoftwotransactions(Fig-ure2(b)).Tologathirdtransaction(UlogCandRlogC),wemustoverwriteanexistinglogrecord,sayUlogAandRlogA(transactionA).IfanyupdatesoftransactionA(e.g.,Ak)arestillincaches,wemustforcetheseupdatesintotheNVRAMbeforeweoverwritetheirlogentry.Theproblemisthatcachesareinvisibletosoftware.Therefore,softwaredoesnotknowwhetherorwhichparticularupdatestoAarestillinthecaches.Thus,oncealogbecomesfull(aftergarbagecollection),softwaremayconservativelyforcecachewrite-backsbeforecommittingthetransaction.Thisunfortunatelynegatesthebenetofredologging.Risksofdatapersistenceinmultithreading.Inadditiontotheabovechallenges,multithreadingfurthercomplicatessoftwarelogginginpersistentmemory,whenalogissharedbymultiplethreads.Evenifapersistentmemorysystem1Althoughwecangrowthelogsizeondemand,thisintroducesextrasystemoverheadonmanagingvariablesizelogs[19].Therefore,westudyxedsizelogsinthispaper.issuesclwbinstructionsineachtransaction,acontextswitchbytheOScanoccurbeforetheclwbinstructionexecutes.Thiscontextswitchinterruptsthecontrolowoftransactionsanddivertstheprogramtootherthreads.Thisreintroducestheaforementionedissueofprematurelyover-writingtherecordsinalledlog.Implementingper-threadlogscanmitigatethisrisk.However,doingsocanintroducenewpersistentmemoryAPIandcomplicatesrecovery.Theseinefcienciesexposethedrawbacksofundo+redologginginsoftwareandwarrantsahardwaresolution.III.OURDESIGNToaddressthechallenges,weproposeahardwareundo+redologgingdesign,consistingofHardwareLogging(HWL)andcacheForceWrite-Back(FWB)mechanisms.Thissectiondescribesourdesignprinciples.WedescribedetailedimplementationmethodsandtherequiredsoftwaresupportinSectionIV.A.AssumptionsandArchitectureOverviewFigure3(a)depictsanoverviewofourprocessorandmemoryarchitecture.ThegurealsoshowsthecircularlogstructureinNVRAM.Allprocessorcomponentsarecompletelyvolatile.Weusewrite-back,write-allocatecachescommontoprocessors.WesupporthybridDRAM+NVRAMformainmemory,deployedontheprocessor-memorybuswithseparatememorycontrollers[19],[13].However,thispaperfocusesonpersistentdataupdatestoNVRAM.FailureModel.DatainDRAMandcaches,butnotinNVRAM,arelostacrosssystemreboots.Ourdesignfo-cusesonmaintainingpersistenceofuser-denedcriticaldatastoredinNVRAM.Afterfailures,thesystemcanrecoverthisdatabyreplayingtheloginNVRAM.DRAMisusedtostoredatawithoutpersistence[19],[13].PersistentMemoryTransactions.Likepriorworkinpersistentmemory[19],[22],weusepersistentmemorytransactionsasasoftwareabstractiontoindicateregionsofmemorythatarepersistent.Persistentmemorywritesrequireapersistenceguarantee.Figure2illustratesasimplecodeexampleofapersistentmemorytransactionimple-mentedwithlogging(Figure2(a)),andourwithdesign(Fi
gure2(b)).ThetransactiondenesobjectAa
gure2(b)).ThetransactiondenesobjectAascriticaldatathatneedspersistenceguarantee.Unlikemostlogging-basedpersistentmemorytransactions,ourtransactionselim-inateexplicitloggingfunctions,cacheforcedwrite-backinstructions,andmemorybarrierinstructions.WediscussoursoftwareinterfacedesigninSectionIV.UncacheableLogsintheNVRAM.Weusesingle-consumer,single-producerLamportcircularstructure[33]forthelog.Oursystemsoftwarecanallocateandtruncatethelog(SectionIV).Ourhardwaremechanismsappendthelog.Wechoseacircularlogstructurebecauseitallowssimultaneousappendsandtruncateswithoutlocking[33],339hit Core A1 L1$ Write-allocate Lower-level$ L1$ Core Processor Core Last-level Cache L1$ Memory Controllers Cache Controllers DRAM NVRAM Log (Uncacheable) Log Buffer Log Buffer NVRAM Log Processor Tx_commit Core A1 miss L1$ A1 hits in a lower-level cache NVRAM Log Tx_commit (a) Architecture overview. (b) In case of a store hit in L1 cache. (c) In case of a store miss in L1 cache. Log Entry: Tail Pointer Head Pointer 1-bit 16-bit 8-bit 48-bit 1-word 1-word Nonvolatile Volatile Processor Tx_begin(TxID)do some readsdo some computationWrite ATx_commit(A1, A2, are new values to be written) Log Buffer Figure3.Overviewoftheproposedhardwarelogginginpersistentmemory.[19].Figure3(a)showsthatlogrecordsmaintainundoandredoinformationofasingleupdate(e.g.,storeA1).Inadditiontotheundo(A1)andredo(A1)values,logrecordsalsocontainthefollowingelds:a16-bittransactionID,an8-bitthreadID,a48-bitphysicaladdressofthedata,andatornbit.Weuseatornbitperlogentrytoindicatetheupdateiscomplete[19].Tornbitshavethesamevalueforallentriesinonepassoverthelog,butreverseswhenalogentryisoverwritten.Thus,completely-writtenlogrecordsallhavethesametornbitvalue,whileincompleteentrieshavemixedvalues[19].Thelogmustaccommodateallwriterequestsofundo+redo.Thelogistypicallyusedduringsystemrecovery,andrarelyreusedduringapplicationexecution.Additionally,logupdatesmustarriveinNVRAMinstore-order.Therefore,wemaketheloguncacheable.Thisisinlinewithmostpriorworks,inwhichlogupdatesarewrittendirectlyintoawrite-combinebuffer(WCB)[19],[31]thatcoalescesmultiplestorestothesamecacheline.B.HardwareLogging(HWL)ThegoalofourHardwareLogging(HWL)mechanismistoenablefeasibleundo+redologgingofpersistentdatainourmicroarchitecture.HWLalsorelaxesorderingcon-straintsoncachinginamannerthatneitherundonorredologgingcan.Furthermore,ourHWLdesignleveragesinformationnaturallyavailableinthecachehierarchybutnottotheprogrammerorsoftware.Itdoessowithouttheperformanceoverheadofunnecessarydatamovementorexecutinglogging,cacheforce-write-back,ormemorybarrierinstructionsinpipeline.LeveragingExistingUndo+RedoInformationinCaches.Mostprocessorscachesusewrite-back,write-allocatecachingpolicies[34].Onawritehit,acacheonlyupdatesthecachelineinthehittinglevelwiththenewvalues.Adirtybitinthecachetagindicatescachevaluesaremodiedbutnotcommittedtomemory.Onawritemiss,thewrite-allocate(alsocalledfetch-on-write)policyrequiresthecachetorstload(i.e.,allocate)theentiremissingcachelinebe-forewritingnewvaluestoit.HWLleveragesthewrite-back,write-allocatecachingpoliciestofeasiblyenableundo+redologginginpersistentmemory.HWLautomaticallytriggersalogupdateonapersistentwriteinhardware.HWLrecordsbothredoandundoinformationinthelogentryinNVRAM(showninFigure2(b)).Wegettheredodatafromthecurrentlyin-ightwriteoperationitself.Wegettheundodatafromthewriterequestscorrespondingwrite-allocatedcacheline.IfthewriterequesthitsintheL1cache,wereadtheoldvaluebeforeoverwritingthecachelineandusethatfortheundolog.IfthewriterequestmissesinL1cache,thatcachelinemustrstbeallocatedanyway,atwhichpointwegettheundodatainasimilarmanner.Thelogentry,consistingofatransactionID,thread,theaddresso
fthewrite,andundoandredovalues,iswritten
fthewrite,andundoandredovalues,iswrittenouttothecircularloginNVRAMusingtheheadandtailpointers.ThesepointersaremaintainedinspecialregistersdescribedinSectionIV.InherentOrderingGuaranteeBetweentheLogandData.OurdesigndoesnotrequireexplicitmemorybarrierstoenforcethatundologupdatesarriveatNVRAMbeforeitscorrespondingworkingdata.Theorderingisnaturallyen-suredbyhowHWLperformstheundologgingandworkingdataupdates.Thisincludesi)theuncachedlogupdatesandcachedworkingdataupdates,andii)store-granularundologging.Theworkingdatawritesmusttraversethecachehierarchy,buttheuncacheableundologupdatesdonot.Furthermore,ourHWLalsoprovidesanoptionalvolatilelogbufferintheprocessor,similartothewrite-combiningbuffersincommodityprocessordesign,thatcoalescesthelogupdates.Wecongurethenumberoflogbufferentriesbasedoncacheaccesslatency.Specically,weensurethatthelogupdateswriteoutofthelogbufferbeforeacached340storewritesoutofthecachehierarchy.SectionIV-CandSectionVIfurtherdiscussandevaluatethislogbuffer.C.DecouplingCacheFWBsandTransactionExecutionWritesareseeminglypersistentoncetheirlogsarewrittentoNVRAM.Infact,wecancommitatransactiononcelog-gingofthattransactioniscompleted.However,thisdoesnotguaranteedatapersistencebecauseofthecircularstructureoftheloginNVRAM(SectionII-A).However,insertingcachewrite-backinstructions(suchasclflushandclwb)insoftwarecanimposesubstantialperformanceoverhead(SectionII-A).Thisfurthercomplicatesdatapersistencesupportinmultithreading(SectionII-C).Weeliminatetheneedforforcedwrite-backinstructionsandguaranteepersistenceinmultithreadedapplicationsbydesigningacacheForce-Write-Back(FWB)mechanisminhardware.FWBisdecoupledfromtheexecutionofeachtransaction.HardwareusesFWBtoforcecertaincacheblockstowrite-backwhennecessary.FWBintroducesaforcewrite-backbit(fwb)alongsidethetaganddirtybitofeachcacheline.Wemaintainanitestatemachineineachcacheblock(SectionIV-D)usingthefwbanddirtybits.Cachesalreadymaintainthedirtybit:acachelineupdatesetsthebitandacacheeviction(write-back)resetsit.Acachecontrollermaintainsourfwbbitbyscanningcachelinesperiodically.Ontherstscan,itsetsthefwbbitindirtycacheblocksifunset.Onthesecondscan,itforceswrite-backsinallcachelineswith{fwb,dirty}={1,1}.Ifthedirtybitevergetsresetforanyreason,thefwbbitalsoresetsandnoforcedwrite-backoccurs.OurFWBdesignisalsodecoupledfromsoftwaremulti-threadingmechanisms.Assuch,ourmechanismisimpervi-oustosoftwarecontextswitchinterruptions.Thatis,whentheOSrequirestheCPUtocontextswitch,hardwarewaitsuntilongoingcachewrite-backscomplete.Thefrequencyoftheforcedwrite-backscanvary.However,forcedwrite-backsmustbefasterthantherateatwhichlogentrieswithuncommittedpersistentupdatesareoverwritteninthecircularlog.Infact,wecandetermineforcewrite-backfrequency(associatedwiththescanningfrequency)basedonthelogsizeandtheNVRAMwritebandwidth(discussedinSectionIV-D).Ourevaluationshowsthefrequencydetermi-nation(SectionVI).D.InstantTransactionCommitsPreviousdesignsrequiresoftwareorhardwarememorybarriers(and/orcacheforce-write-backs)attransactioncom-mitstoenforcewriteorderingoflogupdates(orpersistentdata)intoNVRAMacrossconsecutivetransactions[13],[26].Instead,ourdesigngivestransactioncommitsafreeride.Thatis,noexplicitinstructionsareneeded.Ourmechanismsalsonaturallyenforcetheorderofintra-andinter-transactionlogupdates:weissuelogupdatesintheorderofwritestocorrespondingworkingdata.WealsowritethelogupdatesintoNVRAMintheordertheyareissued(thelogbufferisaFIFO).Therefore,logupdatesofsubsequenttransactionscanonlybewrittenintoNVRAMaftercurrentlogupdatesarewrittenandcommitted.E.PuttingItAllTogetherFigure3(b)and(c)illustratehowourhardwareloggingworks.Hardwaretreatsallwritesencompassedinpersistenttransactions(e.g.,writeAinthetransactiondelimitedbytx_beginandtx_commitinFigure2(b))aspersistentwrites.ThosewritesinvokeourHWLandFWBmecha-nisms.Theyworktogetherasfollows.NotethatlogupdatesgodirectlytotheWCBorNVRAMifthesystemdoesnotadoptthelogbuffer.TheprocessorsendswritesofdataobjectA(avariableorotherdatastructure),consistingofnewvaluesofoneormorecachelines{A1,A2,...},totheL1cache.UponupdatinganL1cacheline(e.g.,fromoldvalueA1toanewvalueA1):1)Writethenewvalue(redo)intothecacheline().a)IftheupdateistherstcachelineupdateofdataobjectA,theHWLmechanism(whichhasthetransactionIDandtheaddressofAfromtheCPU)writesalogrecordheaderintothelogbuffer.b)Otherwise,theHWLmechanismwritesthenewvalue(e.g.,A1)intothelogbuffer.2)Obtaintheundodatafromtheoldvalueinthecacheline().ThissteprunsparalleltoStep-1.a)
IfthecachelinewriterequesthitsinL1(Fig-
IfthecachelinewriterequesthitsinL1(Fig-ure3(b)),theL1cachecontrollerimmediatelyextractstheoldvalue(e.g.,A1)fromthecachelinebeforewritingthenewvalue.ThecachecontrollerreadstheoldvaluefromthehittinglineoutofthecachereadportandwritesitintothelogbufferintheStep-3.Noadditionalreadinstructionisnecessary.b)IfthewriterequestmissesintheL1cache(Figure3(c)),thecachehierarchymustwrite-allocatethatcacheblockasisstandard.Thecachecontrolleratalower-levelcachethatownsthatcachelineextractstheoldvalue(e.g.,A1).ThecachecontrollersendstheextractedoldvaluetothelogbufferinStep-3.3)Updatetheundoinformationofthecacheline:thecachecontrollerwritestheoldvalueofthecacheline(e.g.,A1)tothelogbuffer().4)TheL1cachecontrollerupdatesthecachelineintheL1cache().Thecachelinecanbeevictedviastan-dardcacheevictionpolicieswithoutbeingsubjectedtodatapersistenceconstraints.Additionally,ourlogbufferissmallenoughtoguaranteethatlogupdatestraversethroughthelogbufferfasterthanthecachelinetraversesthecachehierarchy(SectionIV-D).341Therefore,thisstepoccurswithoutwaitingforthecorrespondinglogentriestoarriveinNVRAM.5)ThememorycontrollerevictsthelogbufferentriestoNVRAMinaFIFOmanner().Thisstepisindependentfromothersteps.6)RepeatStep-1-(b)through5ifthedataobjectAconsistsofmultiplecachelinewrites.Thelogbuffercoalescesthelogupdatesofanywritestothesamecacheline.7)Afterlogentriesofallthewritesinthetransactionareissued,thetransactioncancommit().8)PersistentworkingdataupdatesremaincacheduntiltheyarewrittenbacktoNVRAMbyeithernormalevictionorourcacheFWB.F.DiscussionTypesofLogging.Systemswithnon-volatilememorycanadoptcentralized[35]ordistributed(e.g.,per-thread)logs[36],[37].Distributedlogscanbemorescalablethancentralizedlogsinlargesystemsfromsoftwaresperspec-tive.Ourdesignworkswitheithertypeoflogs.Withcentralizedlogging,eachlogrecordneedstomaintainathreadID,whiledistributedlogsdonotneedtomaintainthisinformationinlogrecords.Withcentralizedlog,ourhardwaredesigneffectivelyreducesthesoftwareoverheadandcansubstantiallyimprovesystemperformancewithrealpersistentmemoryworkloadsasweshowinourexperi-ments.Inaddition,ourdesignalsoallowssystemstoadoptalternativeformatsofdistributedlogs.Forexample,wecanpartitionthephysicaladdressspaceintomultipleregionsandmaintainalogpermemoryregion.Weleavetheevaluationofsuchlogimplementationstoourfuturework.NVRAMCapacityUtilization.Storingundo+redologcanconsumemoreNVRAMspacethaneitherundoorredoalone.Ourlogusesaxed-sizecircularbufferratherthandoublinganypreviousundoorredologimplementation.ThelogsizecantradeoffwiththefrequencyofourcacheFWB(SectionIV).ThesoftwaresupportdiscussedinSectionIV-Aallowuserstodeterminethesizeofthelog.OurFWBmechanismwilladjustthefrequencyaccordinglytoensuredatapersistence.LifetimeofNVRAMMainMemory.Thelifetimeofthelogregionisnotanissue.Supposealoghas64Kentries(4MB)andNVRAM(assumingphase-changememory)hasa200nswritelatency.Eachentrywillbeoverwrittenonceevery64K×200ns.IfNVRAMenduranceis108writes,acell,evenstaticallyallocatedtothelog,willtake15daystowearout,whichisplentyoftimeforconventionalNVRAMwear-levelingschemestotrigger[38],[39],[40].Inaddition,ourschemehastwoimpactsonoverallNVRAMlifetime:loggingnormallyleadstowriteamplication,butweimproveNVRAMlifetimebecauseourcachescoalescewrites.Theoverallimpactislikelyslightlynegative.How-ever,wear-levelingwilltriggerbeforeanydamageoccurs.voidpersistent_update(intthreadid){tx_begin(threadid);//PersistentdataupdateswriteA[threadid];tx_commit();}//...intmain(){//Executesonepersistent//transactionperthreadfor(inti=0;inthreads;i++)threadt(persistent_update,i);}Figure4.Pseudocodeexamplefortx_beginandtx_commit,wherethreadIDistransactionIDtoperformonepersistenttransactionperthread.IV.IMPLEMENTATIONInthissection,wedescribetheimplementationdetailsofourdesignandhardwareoverhead.WecoveredtheimpactofNVRAMspaceconsumption,lifetime,andenduranceinSectionIII-F.A.SoftwareSupportOurdesignhassoftwaresupportfordeningpersistentmemorytransactions,allocatingandtruncatingthecircularloginNVRAM,andreservingaspecialcharacterasthelogheaderindicator.TransactionInterface.Weuseapairoftransactionfunc-tions,tx_begin(txid)andtx_commit(),thatde-netransactionswhichdopersistentwritesintheprogram.WeusetxidtoprovidethetransactionIDinformationusedbyourHWLmechanism.ThisIDisgroupswritesfromthesametransaction.Thistransactioninterfacehasbeenusedbynumerouspreviouspersistentmemorydesigns[13],[29].Figure4showsanexampleofmultithreadedpseudocodewithourtransactionfunctions.SystemLibraryFunctionsMaintainthe
Log.OurHWLmechanismperformslogupdates,
Log.OurHWLmechanismperformslogupdates,whilethesystemsoftwaremaintainsthelogstructure.Inparticular,weusesystemli-braryfunctions,log_create()andlog_truncate()(similartofunctionsusedinpriorwork[19]),toallocateandtruncatethelog,respectively.Thesystemsoftwaresetsthelogsize.Thememorycontrollerobtainslogmaintenanceinformationbyreadingspecialregisters(SectionIV-B),indicatingtheheadandtailpointersofthelog.Further-more,asingletransactionthatexceedstheoriginallyal-locatedlogsizecancorruptpersistentdata.Weprovidetwooptionstopreventoverows:1)Thelog_create()functionallocatesalarge-enoughlogbyreadingthemax-imumtransactionsizefromtheprograminterface(e.g.,#defineMAX_TX_SIZEN);2)Anadditionallibraryfunctionlog_grow()allocatesadditionallogregionswhenthelogislledbyanuncommittedtransaction.342B.SpecialRegistersThetxidargumentfromtx_begin()translatesintoan8-bitunsignedinteger(aphysicaltransactionID)storedinaspecialregisterintheprocessor.BecausethetransactionIDsgroupwritesofthesametransactions,wecansimplypickanot-in-usephysicaltransactionIDtorepresentanewlyreceivedtxid.An8-bitlengthcanaccommodate256uniqueactivepersistentmemorytransactionsatatime.AphysicaltransactionIDcanbereusedafterthetransactioncommits.Wealsousetwo64-bitspecialregisterstostoretheheadandtailpointersofthelog.Thesystemlibraryini-tializesthepointervalueswhenallocatingthelogusinglog_create().Duringlogupdates,thememorycon-trollerandlog_truncate()functionupdatethepointers.Iflog_grow()isused,weemployadditionalregisterstostoretheheadandtailpointersofnewlyallocatedlogregionsandanindicatoroftheactivelogregion.C.AnOptionalVolatileLogBufferToimproveperformanceoflogupdatestoNVRAM,weprovideanoptionallogbuffer(avolatileFIFO,similartoWCB)inthememorycontrollertobufferandcoalescelogupdates.Thislogbufferisnotrequiredforensuringdatapersistence,butonlyforperformanceoptimization.DatapersistencerequiresthatlogrecordsarriveatNVRAMbeforethecorrespondingcachelinewiththeworkingdata.Withoutthelogbuffer,logupdatesaredi-rectlyforcedtotheNVRAMbuswithoutbufferingintheprocessor.IfwechoosetoadoptalogbufferwithNentries,alogentrywilltakeNcyclestoreachtheNVRAMbus.AdatastoresenttotheL1cachetakesatleastthelatency(cycles)ofalllevelsofcacheaccessandmemorycontrollerqueuesbeforereachingtheNVRAMbus.Thetheminimumvalueofthislatencyisknownatdesigntime.Therefore,wecanensurethatlogupdatesarriveattheNVRAMbusbeforethecorrespondingdatastoresbydesigningNtobesmallerthantheminimumnumberofcyclesforadatastoretotraversethroughthecachehierarchy.SectionVIevaluatestheboundofNandsystemperformanceacrossvariouslogbuffersizesbasedonoursystemcongurations.D.CacheModicationsToimplementourcacheforcewrite-backscheme,weaddonefwbbittothetagofeachcacheline,alongsidethedirtybitasinconventionalcacheimplementations.FWBmaintainthreestates(IDLE,FLAG,andFWB)foreachcacheblockusingthesestatebits.CacheBlockStateTransition.Figure5showsthenite-statemachineforFWB,implementedinthecachecontrollerofeachlevel.Whenanapplicationbeginsexecuting,cachecontrollersinitialize(reset)eachcachelinetotheIDLEstatebysettingfwbbitto0.Standardcacheimplementationalsoinitializesdirtyandvalidbitsto0.Duringapplicationforce-write-back cache line write set fwb=1 write-back not dirty fwb,dirty ={0,0} fwb,dirty = {0,1} fwb,dirty = {1,1} reset Figure5.StatemachineincachecontrollerforFWB.execution,baselinecachecontrollersnaturallysetthedirtyandvalidbitsto1wheneveracachelineiswrittenandresetthedirtybitbackto0afterthecachelineiswrittenbacktoalowerlevel(typicallyoneviction).Toimplementourstatemachine,thecachecontrollersperiodicallyscanthevalid,dirty,andfwbbitsofeachcachelineandperformsthefollowing.Acachelinewith{fwb,dirty}={0,0}isinIDLEstate;thecachecontrollerdoesnothingtothosecachelines;Acachelinewith{fwb,dirty}={0,1}isintheFLAGstate;thecachecontrollersetsthefwbbitto1.Thisindicatesthatthecachelineneedsawrite-backduringthenextscanningiterationifitisstillinthecache.Acachelinewith{fwb,dirty}={1,1}isinFWBstate;thecachecontrollerforcewrites-backthisline.Aftertheforcedwrite-back,thecachecontrollerchangesthelinebacktoIDLEstatebyresetting{fwb,dirty}={0,0}.Ifacachelineisevictedfromthecacheatanypoint,thecachecontrollerresetsitsstatetoIDLE.DeterminingtheCacheFWBFrequency.Thetagscanningfrequencydeterminesthefrequencyofourcacheforcewrite-backoperations.TheFWBmustoccurasfrequentlyastoensurethattheworkingdataiswrittenbacktoNVRAMbeforeitslogrecord
sareoverwrittenbynewerupdates.Asaresult
sareoverwrittenbynewerupdates.Asaresult,themorefrequentthewriterequests,themorefrequentthelogwillbeoverwritten.Thelargerthelog,thelessfrequentthelogwillbeoverwritten.Therefore,thescanningfrequencyisdeterminedbythemaximumlogupdatefrequency(boundedbyNVRAMwritebandwidthsinceapplicationscannotwritetotheNVRAMfasterthanitsbandwidth)andlogsize(seethesensitivitystudyinSectionVI).Toaccommodatelargecachesizeswithlowscanningperformanceoverhead,wealsogrowthesizeofthelogtoreducethescanningfrequencyaccordingly.E.SummaryofHardwareOverheadTableIpresentsthehardwareoverheadofourimple-menteddesignintheprocessor.NotethatthesevaluesmayvarydependingonthenativeprocessorandISA.Ourimplementationassumesa64-bitmachine,hencewhythecircularlogheadandtailpointersare8bytes.Onlyhalfofthesebytesarerequiredina32-bitmachine.Thesizeofthelogbuffervariesbasedonthesizeofthecacheline.Thesizeoftheoverheadneededforthefwbstatevariesonthetotalnumberofcachelinesatalllevelsofcache.Thisismuchlowerthanpreviousstudiesthattracktransaction343MechanismLogicTypeSizeTransactionIDregisterip-ops1ByteLogheadpointerregisterip-ops8BytesLogtailpointerregisterip-ops8BytesLogbuffer(optional)SRAM964BytesFwbtagbitSRAM768BytesTableISUMMARYOFMAJORHARDWAREOVERHEAD.informationincachetags[13].ThenumbersinthetablewerecomputedbasedonthespecicationsofalloursystemcachesdescribedinSectionV.Notethatthesearemajorstatelogiccomponentson-chip.Ourdesignalsoalsorequiresadditionalgatesforlogicoperations.However,thesegatesareprimarilysmallandmedium-sizedgates,onthesamecomplexitylevelasamultiplexerordecoder.F.RecoveryWeoutlinethestepsofrecoveringthepersistentdatainsystemsthatadoptourdesign.Step1:Followingapowerfailure,therststepistoobtaintheheadandtailpointersoftheloginNVRAM.Thesepointersarepartofthelogstructure.Theyallowsystemstocorrectlyorderthelogentries.Weuseonlyonecentralizedcircularlogforalltransactionsforallthreads.Step2:ThesystemrecoveryhandlerfetcheslogentriesfromNVRAMandusetheaddress,oldvalue,andnewvalueeldstogeneratewritestoNVRAMtotheaddressesspecied.TheaddressesaremaintainedviapagetableinNVRAM.Weidentifywhichwritesdidnotcommitbytracingbackfromthetailpointer.Logentrieswithmis-matchedvaluesinNVRAMareconsiderednon-committed.Theaddressstoredwitheachentrycorrespondstotheaddressofthepersistentdatamember.Asidefromtheheadandtailpointers,wealsousethetornbittocorrectlyorderthesewrites[19].Logentrieswiththesametxidandtornbitarecomplete.Step3:ThegeneratedwritesbypassthecachesandgodirectlytoNVRAM.Weusevolatilecaches,sotheirstatesareresetandallgeneratedwritesonrecoveryarepersistent.Therefore,theycanbypassthecacheswithoutissue.Step4:Weupdatetheheadandtailpointersofthecircularlogforeachgeneratedpersistentwrite.Afterallupdatesfromthelogareredone(orundone),theheadandtailpointersofthelogpointtoentriestobeinvalidated.V.EXPERIMENTALSETUPWeevaluateourdesignbyimplementingitinMc-SimA+[41],aPin-based[42]cycle-levelmulti-coresimula-tor.Wecongurethesimulatortomodelamulti-coreout-of-orderprocessorwithNVRAMDIMMdescribedinTableII.OursimulatoralsomodelsadditionalmemorytrafcforProcessorSimilartoIntelCorei7/22nmCores4cores,2.5GHz,2threads/coreIL1Cache32KB,8-wayset-associative,64Bcachelines,1.6nslatency,DL1Cache32KB,8-wayset-associative,64Bcachelines,1.6nslatency,L2Cache8MB,16-wayset-associative,64Bcachelines,4.4nslatencyMemoryController64-/64-entryread/writequeues8GB,8banks,2KBrowNVRAMDIMM36nsrow-bufferhit,100/300nsread/writerow-bufferconict[44].PowerandEnergyProcessor:149W(peak)NVRAM:rowbufferread(write):0.93(1.02)pJ/bit,arrayread(write):2.47(16.82)pJ/bit[44]TableIIPROCESSORANDMEMORYCONFIGURATIONS.MemoryNameFootprintDescriptionHash256MBSearchesforavalueinan[29]open-chainhashtable.Insertifabsent,removeiffound.RBTree256MBSearchesforavalueinared-black[13]tree.Insertifabsent,removeiffoundSPS1GBRandomswapsbetweenentries[13]ina1GBvectorofvalues.BTree256MBSearchesforavalueinaB+tree.[45]Insertifabsent,removeiffoundSSCA216MBAtransactionalimplementation[46]ofSSCA2.2,performingseveralanalysesoflarge,scale-freegraph.TableIIIALISTOFEVALUATEDMICROBENCHMARKS.loggingandclwbinstructions.Wefeedtheperformancesim-ulationresultsintoMcPAT[43],awidelyusedarchitecture-levelpowerandareamodelingtool,toestimateprocessordynamicenergyconsumption.WemodifytheMcPATpro-cessorcongurationtomodelourhardwaremodications,includingthecomponentsaddedtosupportHWLandFWB.Weadoptphase-changememoryparametersintheNVRAMDIMM[44].BecauseallofourperformancenumbersshowninSectionVIarerelative,thesameobservationsarevalid
fordifferentNVRAMlatencyandaccessenergy.
fordifferentNVRAMlatencyandaccessenergy.OurworkfocusesonimprovingpersistentmemoryaccesssowedonotevaluateDRAMaccessinourexperiments.Weevaluatebothmicrobenchmarksandrealworkloadsinourexperiments.Themicrobenchmarksrepeatedlyup-datepersistentmemorystoringtodifferentdatastructuresincludinghashtable,red-blacktree,array,B+tree,andgraph.Thesearedatastructureswidelyusedinstoragesystems[29].TableIIIdescribesthesebenchmarks.Ourexperimentsusemultipleversionsofeachbenchmarkandvarythedatatypebetweenintegersandstringswithinthem.Datastructureswithintegerelementspacklessdata(smallerthanacacheline)perelement,whereasthosewithstringsrequiremultiplecachelinesperelement.Thisallowsusto344explorecomplexstructuresusedinreal-worldapplications.Inourmicrobenchmarks,eachtransactionperformsanin-sert,delete,orswapoperation.Thenumberoftransactionsisproportionaltothedatastructuresize,listedasmemoryfootprintinTableIII.Wecompilethesebenchmarksinnativex86andrunthemontheMcSimA+simulator.Weevaluatebothsinglethreadedandmultithreadedversionsofeachbenchmark.Inaddition,weevaluatethesetofrealworkloadbenchmarksfromtheWHISPERpersistentmem-orybenchmarksuite[11].Thebenchmarksuiteincorporatesvariousworkloads,suchaskey-valuestores,in-memorydatabases,andpersistentdatacaching,whicharelikelytobenetfromfuturepersistentmemorytechniques.VI.RESULTSWeevaluateourdesignintermsoftransactionthroughput,instructionpercycle(IPC),instructioncount,NVRAMtrafc,anddynamicenergyconsumption.Ourexperimentscompareamongthefollowingcases.non-pers ThisusesNVRAMasaworkingmemorywithoutanydatapersistenceorlogging.Thiscongu-rationyieldsanidealyetunachievableperformanceforpersistentmemorysystems[13].unsafe-base Thisusessoftwareloggingwithoutforcedcachewrite-backs.Assuch,itdoesnotguaranteedatapersistence(henceunsafe).Notethatthedashedlinesinourguresshowthebestcaseachievedbetweeneitherredoorundologgingforthatbenchmark.redo-clwbandundo-clwb Softwareredoandundologging,respectively.Theseinvoketheclwbinstructiontoforcecachewrite-backsafterpersistenttransactions.hw-rlogandhw-ulog Hardwareredoorundologgingwithnopersistenceguarantee(likeinunsafe-base).Theseshowanextremelyoptimizedperformanceofhardwareundoorredologging[13].hwl Thisdesignincludesundo+redologgingfromourhardwarelogging(HWL)mechanism,butusestheclwbinstructiontoforcecachewrite-backs.fwb Thisisthefullimplementationofourhardwareundo+redologgingdesignwithbothHWLandFWB.A.MicrobenchmarkResultsWemakethefollowingmajorobservationsofourmi-crobenchmarkexperimentsandanalyzetheresults.Weeval-uatebenchmarkcongurationsfromsingletoeightthreads.Theprexesoftheseresultscorrespondtoone(-1t),two(-2t),four(-4t),andeight(-8t)threads.SystemPerformanceandEnergyConsumption.Fig-ure6andFigure8comparethetransactionthroughputandmemorydynamicenergyofeachdesign.Weobservethatprocessordynamicenergyisnotsignicantlyalteredbydifferentcongurations.Therefore,weonlyshowmemorydynamicenergyinthegure.Theguresillustratethathwlaloneimprovessystemthroughputanddynamicenergyconsumption,comparedwithsoftwarelogging.Notethatourdesignsupportsundo+redologging,whiletheevaluatedsoftwareloggingmechanismsonlysupporteitherundoorredologging,notboth.Fwbyieldshigherthroughputandlowerenergyconsumption:overall,itimprovesthroughputby1.86×withonethreadand1.75×witheightthreads,comparedwiththebetterofredo-clwbandundo-clwb.SSCA2andBTreebenchmarksgeneratelessthroughputandenergyimprovementoversoftwarelogging.ThisisbecauseSSCA2andBTreeusemorecomplexdatastructures,wheretheoverheadofmanipulatingthedatastructuresoutweighthatofthelogstructures.Figure9showsthatourdesignsubstantiallyreducesNVRAMwrites.Theguresalsoshowthatunsafe-base,redo-clwb,andundo-clwbsignicantlydegradethroughputbyupto59%andimposeupto62%memoryenergyoverheadcomparedwiththeidealcasenon-pers.Ourdesignbringssystemthroughputbackup.Fwbachieves1.86×throughput,withonly6%processor-memoryand20%dynamicmemoryenergyoverhead,respectively.Furthermore,ourdesignsperformanceandenergybenetsoversoftwareloggingremainasweincreasethenumberofthreads.IPCandInstructionCount.WealsostudyIPCnumberofexecutedinstructions,showninFigure7.Overall,hwlandfwbsignicantlyimproveIPCoversoftwarelogging.Thisappearspromisingbecausethegureshowsourhardwareloggingdesignexecutesmuchfewerinstructions.Comparedwithnon-pers,softwareloggingimposesupto2.5×thenumberofinstructionsexecuted.Ourdesignfwbonlyim-posesa30%instructionoverhead.PerformanceSensitivitytoLogBufferSize.SectionIV-Cdiscusseshowthelogbuffersizeisboundedbythedatapersistencerequirem
ent.ThelogupdatesmustarriveatNVRAMbefor
ent.ThelogupdatesmustarriveatNVRAMbeforeitscorrespondingworkingdataupdates.Thisboundis15entriesbasedonourprocessorcongu-ration.Indeed,largerlogbuffersbetterimprovethroughputaswestudiedusingthehashbenchmark(Figure11(a)).An8-entrylogbufferimprovessystemthroughputby10%;ourimplementationwitha15-entrylogbufferimprovesthroughputby18%.Furtherincreasingthelogbuffersize,whichmaynolongerguaranteedatapersistence,additionallyimprovessystemthroughputuntilreachingtheNVRAMwritebandwidthlimitation(64entriesbasedonourNVRAMconguration).Notethatthesystemthroughputresultswith128and256entriesaregeneratedassuminginniteNVRAMwritebandwidth.Wealsoimprovethroughputoverbaselinehardwarelogginghw-rlogandhw-ulog.RelationBetweenFWBFrequencyandLogSize.Sec-tionIV-Ddiscussesthattheforcewrite-backfrequencyisdeterminedbytheNVRAMwritebandwidthandlogsize.WithagivenNVRAMwritebandwidth,westudytherelationbetweentherequiredFWBfrequencyandlogsize.Figure11(b)showsthatweonlyneedtoperformforced345\b\t\n\f\r\f\f\f\f\f\f\f\f\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\t\r ! \b\t\n\b\f\f\r\bFigure6.Transactionthroughputspeedup(higherisbetter),normalizedto\b\b\b\b\b\b\t\n\t\f\r\t\n\t\f\r\t\n\t\f\r\t\n\t\f\r\t\n\t\f\t\n\t\f\t\n\t\f\t\n\t\f\f\r\f\r\f\r\f\r\f\f\f\f\f\f \t" #\b\t\n\b\f\r\n\b\t\r\r\t$$%$!\n\f\nFigure7.IPCspeedup(higherisbetter)andinstructioncount(lowerisbetter),normalizedto\b\t\n\t\f\r\f\t\n\t\f\r\f\t\n\t\f\r\f\t\n\t\f\r\f\
b\t\n\t\f\f
b\t\n\t\f\f\t\n\t\f\f\t\n\t\f\f\t\n\t\f\f\b\f\r\f\f\r\f\f\r\f\f\r\f\b\f\f\f\f\f\f\f\f\b\f\r\f\f\r\f\f\r\f\f\r\f\b\f\f\f\f\f\f\f\f\b\f\f\f\f\b\f\f\f\f\b\n\f\n\f\n\f\n\f\b\f\f\f\t\b\t\n\b\f\r\t\b\t\r\n\f\nFigure8.Dynamicenergyreduction(higherisbetter),normalizedto(dashedline).write-backseverythreemillioncyclesifwehavea4MBlog.Asaresult,thefwbtagscanningonlyintroduces3.6%performanceoverheadwithour8MBcache.B.WHISPERResultsComparedwithmicrobenchmarks,weobserveevenmorepromisingperformanceandenergyimprovementsinrealpersistentmemoryworkloadsintheWHISPERbenchmarksuitewithlargedatasets(Figure10).AmongtheWHIS-PERbenchmarks,ctreeandhashmapbenchmarksaccu-ratelycorrespondtoandreecttheresultsachievedinourmicrobenchmarksduetotheirsimilarities.Althoughthemagnitudeofimprovementvary,ourdesignleadstomuchhigherperformance,lowerenergy,andlowerNVRAMtrafcthanourbaselines.Comparedwithredo-clwb,ourdesignsignicantlyreducesthedynamicmemoryenergyconsumptionoftpccandycsbduetothehighwriteintensityintheseworkloads.Overall,ourdesign)achievesupto2.7thethroughputofthebestcaseredo-clwb.Thisisalsowithin73%ofnon-persthroughputofthesamebenchmarks.Inaddition,ourdesignachievesuptoa2.43reductionindynamicmemoryoverthebaselines.346\b\t\b\t\b\t\n\t\b\t\t\t\t\t\t\b\t\n\t\b\t\t\n\t\t\b\t\b\t\b\t\n\t\b\t\t\t\t\t\n\t\t\t\n\t\t\n\t\t\b\b\b\t\n\f\r\t\n\f\n\b"""#""""""""""""%"""""""""""""""""""""&"""""""""""""""""""""""""""""#"""""""""""""""""""#"""&""""""""""&""""""""""""""""""""""""""""""""""""""""""Figure9.Memorywritetrafcreduction(higherisbetter),normalizedto(dashedline).\b\t\n\b\f\b\b\t\n\f\b\b\t\b\n\f\rFigure10.WHISPERbenchmarkresults,includingIPC,dynamicmemoryenergyconsumption,transactionthroughput,andNVRAMwritetrafc,normalizedto(thedashedline).
! "
! " ! !"!#"# $"!"$ ! " "%\t\nFigure11.Sensitivitystudiesof(a)systemthroughputwithvaryinglogbuffersizesand(b)cachefwbfrequencywithvariousNVRAMlogsizes.VII.RELATEDComparedtopreviousarchitecturesupportforpersistentmemorysystems,ourdesignfurtherrelaxesorderingcon-straintsoncacheswithlesshardwarecost.Hardwaresupportforlogging.Severalrecentstudiesproposedhardwaresupportforlog-basedpersistentmem-orydesign.Luetal.proposescustomhardwareloggingmechanismsandmulti-versioningcachestoreduceintra-andinter-transactiondependencies[24].However,theyrequirebothlarge-scalechangestothecachehierarchyandcachemulti-versioningsupport.Kollietal.proposesadelegatedpersistordering[15]thatsubstantiallyrelaxespersistenceorderingconstraintsbyleveraginghardwaresupportandcachecoherence.However,thedesignreliesonsnoop-basedcoherenceandadedicatedpersistentmemorycontroller.Instead,ourdesignisexiblebecauseitdirectlyleveragestheinformationalreadyinthebaselinecachehierarchy.ATOM[35]andDudeTM[47]onlyimplementeitherundoVolatileTMsupportsconcurrencybutdoesnotguaranteepersistenceinmemory.orredologging.Asaresult,thestudiesdonotprovidethelevelofrelaxedorderingofferedbyourdesign.Inaddition,DudeTM[47]alsoreliesonawhichcanincursubstantialmemoryaccesscost.Doshietal.usesredologgingfordatarecoverabilitywithabackendcontroller[48].Thebackendcontrollerreadslogentriesfromtheloginmemoryandupdatesdatain-place.However,thisdesigncanunnecessarilysaturatethememoryreadbandwidthneededforcriticalreadoperations.Also,itrequiresaseparatevictimcachetoprotectfromdirtycacheblocks.Instead,ourdesigndirectlyusesdirtycachebitstoenforcepersistence.Hardwaresupportforpersistentmemory.Recentstudiesalsoproposegeneralhardwaremechanismsforpersistentmemorywithorwithoutlogging.Recentworksproposethatcachesmaybeimplementedinsoftware[49],oranaddi-tionalnon-volatilecacheintegratedintheprocessor[50],[13]tomaintainpersistence.However,doingsocandoublethememoryfootprintforpersistentmemoryoperations.Otherworks[31],[26],[51]optimizethememorycontrollertoimproveperformancebydistinguishinglogginganddataupdates.Epochbarrier[29],[16],[52]isproposedtorelaxtheorderingconstraintsofpersistentmemorybyallowingcoarse-grainedtransactionordering.However,epochbarriersincurnon-trivialoverheadtothecachehierarchy.Further-more,systemperformancecanbesub-optimalwithsmallepochsizes,whichisobservedinmanypersistentmem-oryworkloads[11].Ourdesignuseslightweighthardwarechangesonexistingprocessordesignswithoutexpensivenon-volatileon-chiptransactionbufferingcomponents.347Persistentmemorydesigninsoftware.Previousworks,suchasMnemosyne[19]andREWIND[53],utilizewrite-aheadloggingimplementedinsoftware.Theserelyoninstructions,suchasclflush,clwb,andpcommit,toachievepersistencyandenforcelog-dataorderingintheircriticalpath.Ourdesigndoesnotrequirethesepersistentinstructionsthatweveshowncanclogthepipelineandareofteninefcientorunnecessary.JUSTDO[54]loggingalsoreliesontheseinstructions(formanualushingbytheprogrammer).Additionally,theseworksprogrammingmodelsallfalterbecauseIntelspcommitinstructionisnowdeprecated.Ourdesignworksonlegacycodeanddoesnothavestrictfunctionalityrequirementsontheprogrammingmodel.Thismakesitimmunetorelyingoninstructionsorsoftwarefunctionsthatbecomedeprecated.VIII.CONCLUSIONSWeproposedahardwareloggingschemethatallowsthecachestoperformcachelineupdatesandwrite-backswithoutnon-volatilebuffersorcachesintheprocessor.Thesemechanismsmakeupacomplexity-effectivedesignthatexcelsovertraditionalsoftwareloggingforpersistentmemory.Ourevaluationshowsthatourdesignsignicantlyincreasesperformancewhilereducingmemorytrafcandenergy.ACKNOWLEDGMENTSWethanktheanonymousreviewersfortheirvaluablefeedback.ThispaperissupportedinpartbyNSFgrants1652328and1718158,andNSFI/UCRCCenterforRe-searchonStorageSystems.REFERENCES[1]V.Sousa,PhasechangematerialsengineeringforRESETcurrentreduction,inProceedingsoftheMemoryWorkshop,2012.[2]C.Cagli,Characterizationandmodellingofelectrodeim-pactinHfO2-basedRRAM,inProceedingsoftheMemoryWorkshop,2012.[3]W.Zhao,E.Belhaire,Q.Mistral,C.Chappert,V.Javerliac,B.Dieny,andE.Nicolle,Macro-modelofspin-transfertorquebasedmagnetictunneljunctiondeviceforhybridmagnetic-CMOSdesign,inBehavioralModelingandSimu-lationWorkshop,Proceedingsofthe2006IEEEInternational,Sept2006,pp.40 43.[4]IntelandMicron,IntelandMicronpro-ducebreakthroughmemorytechnology,2015,http://newsroom.intel.c
om/community/intelnewsroom/.[5]Intel,
om/community/intelnewsroom/.[5]Intel,Intelarchitectureinstructionsetex-tensionsprogrammingreference,2016,https://software.intel.com/sites/default/les/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf.[6]ARM,ARMv8-aarchitectureevolution,2016.[7]N.Christiansen,StorageclassmemorysupportintheWin-dowsOS,inSNIANVMSummit,2016.[8]C.Diaconu,MicrosoftSQLHekatonctowardslargescaleuseofPMforin-memorydatabases,inSNIANVMSummit,2016.[9]J.Moyer,PersistentmemoryinLinux,inSNIANVMSummit,2016.[10]K.Deierling,Persistentmemoryoverfabric,inSNIANVMSummit,2016.[11]S.Nalli,S.Haria,M.D.Hill,M.M.Swift,H.Volos,andK.Keeton,AnanalysisofpersistentmemoryusewithWHISPER,inProceedingsofthInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,2017,pp.1 14.[12]S.Haria,S.Nalli,M.M.Swift,M.D.Hill,H.Volos,andK.Keeton,Hands-offpersistencesystem(HOPS),inNon-volatileMemoriesWorkshop,2017.[13]J.Zhao,S.Li,D.H.Yoon,Y.Xie,andN.P.Jouppi,Kiln:Closingtheperformancegapbetweensystemswithandwithoutpersistencesupport,inProceedingsofthe46thIn-ternationalSymposiumonMicroarchitecture(MICRO),2013,pp.421 432.[14]J.L.HennessyandD.A.Patterson,ComputerArchitecture,FifthEdition:AQuantitativeApproach,5thed.SanFran-cisco,CA,USA:MorganKaufmannPublishersInc.,2011.[15]A.Kolli,J.Rosen,S.Diestelhorst,A.Saidi,S.Pelley,S.Liu,P.M.Chen,andT.F.Wenisch,Delegatedpersistordering,inProceedingsofthe49thInternationalSymposiumonMi-croarchitecture,2016,pp.1 13.[16]S.Pelley,P.M.Chen,andT.F.Wenisch,Memoryper-sistency,inProceedingsoftheInternationalSymposiumonComputerArchitecture,2014,pp.1 12.[17]K.E.Moore,J.Bobba,M.J.Moravan,M.D.Hill,andD.A.Wood,LogTM:log-basedtransactionalmemory,inProceedingsofthe12thInternationalSymposiumonHighPerformanceComputerArchitecture,2006,pp.1 12.[18]Intel,acollectionoflinuxpersistentmemoryprogrammingexamples,https://github.com/pmem/linux-examples.[19]H.Volos,A.J.Tack,andM.M.Swift,Mnemosyne:Lightweightpersistentmemory,inProceedingsoftheSix-teenthInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ser.ASPLOSXVI.NewYork,NY,USA:ACM,2011,pp.91 104.[20]S.Pelley,T.F.Wenisch,B.T.Gold,andB.Bridge,StoragemanagementintheNVRAMera,ProceedingsoftheVLDBEndowment,vol.7,no.2,2013.[21]I.Moraru,D.G.Andersen,M.Kaminsky,N.Tolia,N.Binkert,andP.Ranganathan,Consistent,durable,andsafememorymanagementforbyte-addressablenonvolatilemainmemory,inProceedingsoftheACMConferenceonTimelyResultsinOperatingSystems,2013,pp.1 17.[22]A.Kolli,S.Pelley,A.Saidi,P.M.Chen,andT.F.Wenisch,High-performancetransactionsforpersistentmemories,inProceedingsofthe21thACMInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOp-eratingSystems,2016,pp.1 12.[23]Y.Lu,J.Shu,andL.Sun,Blurredpersistenceintrans-actionalpersistentmemory,inMassStorageSystemsandTechnologies(MSST),201531stSymposiumon,2015,pp.1 13.[24]Y.Lu,J.Shu,L.Sun,andO.Mutlu,Loose-orderingcon-sistencyforpersistentmemory,inICCD,2014.[25]S.Kannan,A.Gavrilovska,andK.Schwan,Reducingthecostofpersistencefornonvolatileheapsinenduserdevices,inProceedingsoftheInternationalSymposiumonHighPerformanceComputerArchitecture,2014,pp.1 12.[26]R.-S.Liu,D.-Y.Shen,C.-L.Yang,S.-C.Yu,andC.-Y.M.Wang,NVMDuet:Uniedworkingmemoryandpersistent348storearchitecture,inProceedingsoftheInternationalCon-ferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,2014,pp.455 470.[27]A.M.Cauleld,T.I.Mollov,L.A.Eisner,A.De,J.Coburn,andS.Swanson,Providingsafe,userspaceaccesstofast,solidstatedisks,inProceedingsoftheInternationalConfer-enceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,2012,pp.387 400.[28]J.Arulraj,M.Perron,andA.Pavlo,Write-behindlogging,Proc.VLDBEndow.,vol.10,no.4,pp.337 348,Nov.2016.[29]J.Coburn,A.M.Cauleld,A.Akel,L.M.Grupp,R.K.Gupta,R.Jhala,andS.Swanson,NV-heaps:makingpersis-tentobjectsfastandsafewithnext-generation,non-volatilememories,inInternationalConferenceonArchitecturalSup-portforProgrammingLanguagesandOperatingSystems,2011,pp.105 118.[30]J.Ren,J.Zhao,S.Khan,J.Choi,Y.Wu,andO.Mutlu,ThyNVM:Enablingsoftware-transparentcrashconsistencyinpersistentmemorysystems,inProceedingsofthe48thInternationalSymposiumonMicroarchitecture(MICRO-48),2015,pp.1 13.[31]J.Zhao,O.Mutlu,andY.Xie,FIRM:Fairandhigh-performancememorycontrolforpersistentmemorysystems,inProceedingsofthe47thInternationalSymposiumonMi-croarchitecture(MICRO-47),2014.[32]C.Mohan,D.Haderle,B.Lindsay,H.Pirahesh,andP.Schwar
z,ARIES:Atransactionrecoverymethodsuppo
z,ARIES:Atransactionrecoverymethodsupport-ingne-granularitylockingandpartialrollbacksusingwrite-aheadlogging,ACMTrans.DatabaseSyst.,vol.17,no.1,pp.94 162,Mar.1992.[33]L.Lamport,Provingthecorrectnessofmultiprocesspro-grams,IEEETrans.Softw.Eng.,vol.3,no.2,pp.125 143,Mar.1977.[34]D.A.PattersonandJ.L.Hennessy,ComputerOrganiza-tionandDesign,FourthEdition,FourthEdition:TheHard-ware/SoftwareInterface(TheMorganKaufmannSeriesinComputerArchitectureandDesign),4thed.SanFrancisco,CA,USA:MorganKaufmannPublishersInc.,2008.[35]A.Joshi,V.Nagarajan,S.Viglas,andM.Cintra,Atom:Atomicdurabilityinnon-volatilememorythroughhardwarelogging,in2017IEEEInternationalSymposiumonHighPerformanceComputerArchitecture(HPCA),Feb2017,pp.361 372.[36]T.WangandR.Johnson,Scalableloggingthroughemergingnon-volatilememory,Proc.VLDBEndow.,vol.7,no.10,pp.865 876,Jun.2014.[37]J.Huang,K.Schwan,andM.K.Qureshi,Nvram-awareloggingintransactionsystems,Proc.VLDBEndow.,vol.8,no.4,pp.389 400,Dec.2014.[38]P.Zhou,B.Zhao,J.Yang,andY.Zhang,Adurableandenergyefcientmainmemoryusingphasechangememorytechnology,inProceedingsofthe36thAnnualInternationalSymposiumonComputerArchitecture,ser.ISCA09.NewYork,NY,USA:ACM,2009,pp.14 23.[39]M.K.Qureshi,J.Karidis,M.Franceschini,V.Srinivasan,L.Lastras,andB.Abali,Enhancinglifetimeandsecurityofpcm-basedmainmemorywithstart-gapwearleveling,inProceedingsofthe42NdAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture,2009,pp.14 23.[40]M.K.Qureshi,V.Srinivasan,andJ.A.Rivers,Scalablehighperformancemainmemorysystemusingphase-changememorytechnology,inProceedingsofthe36thAnnualIn-ternationalSymposiumonComputerArchitecture,ser.ISCA09.NewYork,NY,USA:ACM,2009,pp.24 33.[41]J.H.Ahn,S.Li,O.Seongil,andN.Jouppi,Mcsima+:Amanycoresimulatorwithapplication-level+simulationanddetailedmicroarchitecturemodeling,inPerformanceAnal-ysisofSystemsandSoftware(ISPASS),2013IEEEInterna-tionalSymposiumon,2013,pp.74 85.[42]C.-K.Luk,R.Cohn,R.Muth,H.Patil,A.Klauser,G.Lowney,S.Wallace,V.J.Reddi,andK.Hazelwood,Pin:Buildingcustomizedprogramanalysistoolswithdynamicinstrumentation,inProceedingsofthe2005ACMSIGPLANConferenceonProgrammingLanguageDesignandImple-mentation,NewYork,NY,USA,2005,pp.190 200.[43]S.Li,J.H.Ahn,R.D.Strong,J.B.Brockman,D.M.Tullsen,andN.P.Jouppi,McPAT:Anintegratedpower,area,andtimingmodelingframeworkformulticoreandmanycorearchitectures,inProceedingsofthe42NdAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture,2009,pp.469 480.[44]B.C.Lee,E.Ipek,O.Mutlu,andD.Burger,ArchitectingphasechangememoryasascalableDRAMalternative,inInternationalSymposiumonComputerArchitecture,2009,pp.2 13.[45]T.Bingmann,STXB+Tree,Sept.2008,http://panthema.net/2007/stx-btree.[46]D.A.BaderandK.Madduri,Designandimplementationofthehpcsgraphanalysisbenchmarkonsymmetricmultipro-cessors,inProceedingsofthe12thInternationalConferenceonHighPerformanceComputing,2005,pp.465 476.[47]M.Liu,M.Zhang,K.Chen,X.Qian,Y.Wu,W.Zheng,andJ.Ren,Dudetm:Buildingdurabletransactionswithdecou-plingforpersistentmemory,inProceedingsoftheTwenty-SecondInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ser.ASPLOS17.NewYork,NY,USA:ACM,2017,pp.329 343.[48]K.Doshi,E.Giles,andP.Varman,AtomicpersistenceforSCMwithanon-intrusivebackendcontroller,in2016IEEEInternationalSymposiumonHighPerformanceComputerArchitecture(HPCA),2016,pp.77 89.[49]P.Li,D.R.Chakrabarti,C.Ding,andL.Yuan,Adaptivesoftwarecachingforefcientnvramdatapersistence,in2017IEEEInternationalParallelandDistributedProcessingSymposium(IPDPS),May2017,pp.112 122.[50]C.-H.Lai,J.Zhao,andC.-L.Yang,Leavethecachehierar-chyoperationasitis:Anewpersistentmemoryacceleratingapproach,inProceedingsofthe54thAnnualDesignAutoma-tionConference2017,ser.DAC17.NewYork,NY,USA:ACM,2017,pp.5:1 5:6.[51]L.Sun,Y.Lu,andJ.Shu,DP2:Reducingtransactionoverheadwithdifferentialanddualpersistencyinpersistentmemory,inProceedingsofthe12thACMInternationalConferenceonComputingFrontiers,2015,pp.24:1 24:8.[52]J.Condit,E.B.Nightingale,C.Frost,E.Ipek,B.Lee,D.Burger,andD.Coetzee,BetterI/Othroughbyte-addressable,persistentmemory,inProceedingsoftheACMSIGOPS22ndSymposiumonOperatingSystemsPrinciples,ser.SOSP09.NewYork,NY,USA:ACM,2009,pp.133 146.[53]A.Chatzistergiou,M.Cintra,andS.D.Viglas,REWIND:Recoverywrite-aheadsystemforin-memorynon-volatiledata-structures,Proc.VLDBEndow.,vol.8,no.5,pp.497 508,Jan.2015.[54]J.Izraelevitz,T.Kelly,andA.Kolli,Failure-atomicper-sistentmemoryupdatesviajustdologging,SIGPLANNot.,vol.51,no.4,pp.427 442