/
StealbutNoForce:Ef StealbutNoForce:Ef

StealbutNoForce:Ef - PDF document

osullivan
osullivan . @osullivan
Follow
342 views
Uploaded On 2020-11-23

StealbutNoForce:Ef - PPT Presentation

forPersistentMemorySystems MatheusAlmeidaOgleari 2 EthanLMiller 2 ID: 822814

fwb write store log write fwb log store clwb undo dirty ulog redo commit rlog logging val memory 2016

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "StealbutNoForce:Ef" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

StealbutNoForce:Ef“cientHardwareUndo+Red
StealbutNoForce:Ef“cientHardwareUndo+RedoLoggingforPersistentMemorySystemsMatheusAlmeidaOgleari,EthanL.Miller,,JishenZhao,‚UniversityofCalifornia,SantaCruzPureStorage‚UniversityofCalifornia,SanDiego{mogleari,elm,jishen.zhao}@ucsc.edu‚Abstract„Persistentmemoryisanewtierofmemorythatfunctionsasahybridoftraditionalstoragesystemsandmainmemory.Itcombinesthebene“tsofboth:thedatapersistenceofstoragewiththefastload/storeinterfaceofmemory.Mostpreviouspersistentmemorydesignsplacecarefulcontrolovertheorderofwritesarrivingatpersistentmemory.Thiscanpreventcachesandmemorycontrollersfromoptimizingsystemperformancethroughwritecoalescingandreordering.Weidentifythatsuchwrite-ordercontrolcanberelaxedbyemployingundo+redologgingfordatainpersistentmemorysystems.However,traditionalsoftwareloggingmechanismsareandenergyoverheads.Previouslyproposedhardwareloggingschemesareinef“cientanddonotfullyaddresstheissuesinsoftware.Toaddressthesechallenges,weproposeahardwareundo+redologgingschemewhichmaintainsdatapersistencebyleveragingthewrite-back,write-allocatepoliciesusedincommoditycaches.Furthermore,wedevelopacacheforce-write-backmechanisminhardwaretosigni“cantlyreducetheperformanceandenergyoverheadsfromforcingdataintopersistentmemory.Ourevaluationacrosspersistentmemorymicrobenchmarksandrealworkloadsdemonstratesreducesbothdynamicenergyandmemorytraf“c.Italsoprovidesstrongconsistencyguaranteescomparedtosoftwareapproaches.I.INTRODUCTIONPersistentmemorypresentsanewtierofdatastoragecomponentsforfuturecomputersystems.ByattachingNon-VolatileRandom-AccessMemories(NVRAMs)[1],[2],[3],[4]tothememorybus,persistentmemoryuni“esmemoryandstoragesystems.NVRAMoffersthefastload/storeaccessofmemorywiththedatarecoverabilityofstorageinasingledevice.Consequently,hardwareandsoftwarevendorsrecentlybeganadoptingpersistentmemorytechniquesintheirnext-generationdesigns.ExamplesincludeIntelsISAARMsnewcachewrite-backinstruction[6],MicrosoftsstorageclassmemorysupportinWindowsOSandin-memorydatabases[7],[8],RedHatspersistentmemorysupportintheLinuxkernel[9],andMellanoxspersistentmemorysupportoverfabric[10].Thoughpromising,persistentmemoryfundamentallychangescurrentmemoryandstoragesystemdesignassump-tions.Reapingitsfullpotentialischallenging.Previousper-sistentmemorydesignsintroducelargeperformanceanden-ergyoverheadscomparedtonativememorysystems,withoutenforcingconsistency[11],[12],[13].Akeyreasonisthewrite-ordercontrolusedtoenforcedatapersistence.Typicalmemorycontrollerstooptimizesystemperformance[14],[15],[16],[13].However,mostpreviouspersistentmemorydesignsemploymemorybarriersandforcedcachewrite-backs(orcache”ushes)toenforcetheorderofpersistentdataarrivingatNVRAM.Thiswrite-ordercontrolissub-optimalforperformanceanddonotconsidernaturalcachingandmemoryschedulingmechanisms.Severalrecentstudiesstrivetorelaxwrite-ordercontrolinpersistentmemorysystems[15],[16],[13].However,thesestudieseitherimposesubstantialhardwareoverheadbyaddingNVRAMcachesintheprocessor[13]orfallbacktolow-performancemodesoncecertainbookkeepingresourcesOurgoalinthispaperistodesignahigh-performancepersistentmemorysystemwithout(i)anNVRAMcacheorbufferintheprocessor,(ii)fallingbacktoalow-performancemode,or(iii)interferingwiththewritereorderingbycachesandmemorycontrollers.Ourkeyideaistomaintaindatapersistencewithacombinedundo+redologgingschemeinhardware.Undo+redologgingstoresbothold(undo)andnew(redo)valuesinthelogduringapersistentdataupdate.Itoffersakeybene“t:relaxingthewrite-orderconstraintsoncachingpersistentdataintheprocessor.Inourpaper,weshowneedingstrictwrite-ordercontrol.Asaresult,thecachesandmemorycontrollerscanreorderthewriteslikeintraditionalnon-persistentmemorysystems(discussedinSectionII-B).Previouspersistentmemorysystemstypicallyimplementeitherundoorredologginginsoftware.However,high-performancesoftwareundo+redologginginpersistentmem-oryisunfeasibleduetoinef“ciencies.First,softwarelogginggeneratesextrainstructionsinsoftware,competingforlim-itedhardwareresourcesinthepipelinewithothercriticalworkloadoperations.Undo+redologgingcandoublethenumberofextrainstructionsoverundoorredologging3362018 IEEE International Symposium on High Performance Computer Architecture2378-203X/18/$31.00 ©2018 IEEEDOI 10.1109/HPCA.2018.00037alone.Second,loggingintroducesextramemorytraf“cinadditiontoworkingdataaccess[13].Undo+redologgingwouldimposemorethandoubleextramemorytraf“cinsoftware.Third,thehardwarestatesofcachesareinvisibletosoftware.Asaresult,softwareundo+redologging,anideaborrowedfromdatabasemechanismsdesignedtocoord

inatewithsoftware-managedcaches,canonly
inatewithsoftware-managedcaches,canonlyconservativelyco-ordinatewithhardwarecaches.Finally,withmultithreadedworkloads,contextswitchesbytheoperatingsystem(OS)caninterrupttheloggingandpersistentdataupdates.Thiscanriskthedataconsistencyguaranteeinmultithreadedenvironment(SectionII-Cdiscussesthisfurther).Severalpriorworksinvestigatedhardwareundoorredologgingseparately[17],[15](SectionVII).Thesedesignshavesimilarchallengessuchashardwareandenergyover-heads[17],andslowdownduetosaturatedhardwarebook-keepingresourcesintheprocessor[15].Supportingbothundoandredologgingcanfurtherexacerbatetheissues.Additionally,hardwareloggingmechanismscaneliminatethelogginginstructionsinthepipeline,buttheextramemorytraf“cgeneratedfromthelogstillexists.Toaddressthesechallenges,weproposeacombinedundo+redologgingschemeinhardwarethatallowspersis-tentmemorysystemstorelaxthewrite-ordercontrolbyleveragingexistingcachingpolicies.Ourdesignconsistsoftwomechanisms.First,aHardwareLogging(HWL)mech-anismperformsundo+redologgingbyleveragingwrite-backwrite-allocatecachingpolicies[14]commonlyusedinprocessors.OurHWLdesigncausesapersistentdataupdatetoautomaticallytriggerloggingforthatdata.WhetherastoregeneratesanL1cachehitormiss,itsaddress,oldvalue,andnewvalueareallavailableinthecachehierarchy.Assuch,ourdesignutilizesthecacheblockwritestoupdatethelogwithword-sizevalues.Second,weproposeacacheForceWrite-Back(FWB)mechanismtoforcewrite-backsofcachedpersistentworkingdatainamuchlower,yetmoreef“cientfrequencythaninsoftwaremodels.ThisfrequencydependsonlyontheallocatedlogsizeandNVRAMwritebandwidth,thusdecouplingcacheforcewrite-backsfromtransactionexecution.Wesummarizethecontributionsofthispaperasfollowing:€Thisisthe“rstpapertoexploitthecombinationofundo+redologgingtorelaxorderingconstraintsoncachesandmemorycontrollersinpersistentmemorysystems.Ourdesignrelaxestheorderingconstraintsinawaythatundologging,redologging,orcopy-on-writealonecannot.€Weenableef“cientundo+redologgingforpersistentmemorysystemsinhardware,whichimposessubstan-tiallymorechallengesthanimplementingeitherundo-orredo-loggingalone.€Wedevelopahardware-controlledcacheforcewrite-backmechanism,whichsigni“cantlyreducestheperformanceoverheadofforcewrite-backsbyef“cientlytuningthewrite-backfrequency.€Weimplementourdesignthroughlightweightsoftwaresupportandprocessormodi“cations.II.BACKGROUNDANDMOTIVATIONPersistentmemoryisfundamentallydifferentfromtradi-tionalDRAMmainmemoryortheirNVRAMreplacement,duetoitspersistence(i.e.,crashconsistency)propertyinheritedfromstoragesystems.Persistentmemoryneedstoensuretheintegrityofin-memorydatadespitesystemcrashesandpowerloss[18],[19],[20],[21],[16],[22],[23],[24],[25],[26],[27].Thepersistencepropertyisnotguaranteedbymemoryconsistencyintraditionalmemorysystems.Memoryconsistencyensuresaconsistentglobalviewofprocessorcachesandmainmemory,whilepersistentmemoryneedstoensurethatthedataintheNVRAMmainmemoryisstandaloneconsistent[16],[19],[22].A.PersistentMemoryWrite-orderControlTomaintaindatapersistence,mostpersistentmemoryde-signsemploytransactionstoupdatepersistentdataandcare-fullycontroltheorderofwritesarrivinginNVRAM[16],[19],[28].Atransaction(e.g.,thecodeexampleinFig-ure1)consistsofagroupofpersistentmemoryupdatesperformedinthemannerofallornothingŽinthefaceofsystemfailures.Persistentmemorysystemsalsoforcecachewrite-backs(e.g.,clflush,clwb,anddccvap)andusememorybarrierinstructions(e.g.,mfenceandsfence)throughouttransactionstoenforcewrite-ordercontrol[28],[15],[13],[19],[29].Recentworksstrivedtoimprovepersistentmemoryper-formancetowardsanativenon-persistentsystem[15],[16],[13].Ingeneral,whetheremployinglogginginpersistentmemoryornot,mostfacesimilarproblems.(i)Theyin-troducenontrivialhardwareoverhead(e.g.,byintegratingNVRAMcache/buffersorsubstantialextrabookkeepingcomponentsintheprocessor)[13],[30].(ii)Theyfallbacktolow-performancemodesoncethebookkeepingcomponentsortheNVRAMcache/bufferaresaturated[13],[15].(iii)Theyinhibitcachesfromcoalescingandreorderingpersis-tentdatawrites[13](detailsdiscussedinSectionVII).Forcedcachewrite-backsensurethatcacheddataup-datesmadebycompleted(i.e.,committed)transactionsarewrittentoNVRAM.ThisensuresNVRAMisinapersistentstatewiththelatestdataupdates.Memorybarriersstallsubsequentdataupdatesuntilthepreviousupdatesbythetransactioncomplete.However,thiswrite-ordercontrolpreventscachesfromoptimizingsystemperformanceviacoalescingandreorderingwrites.Theforcedcachewrite-backsandmemorybarrierscanalsoblockorinterferewithsubsequentreadandwriterequeststhatsharethememorybus.Thishappensre

gardlessofwhethertheserequestsareindepe
gardlessofwhethertheserequestsareindependentfromthepersistentdataaccessornot[26],[31].337Tx_begin do some reads do some computation Uncacheable_log( addr(A), new_val(A), old_val(A) ) write new_val(A) //new_val(A) = A clwb // can be delayed Tx_commit Tx_begin do some reads do some computation Uncacheable_Ulog( addr(A), old_val(A) ) write new_val(A) //new_val(A) = A clwb //force writeback Tx_commit Tx_begin do some reads do some computation Uncacheable_Rlog( addr(A), new_val(A) ) memory_barrier write new_val(A) //new_val(A) = A Tx_commit ƒ ƒ Redo logging of the transaction Undo logging of store A1 Time Time Write AŽ consists of N store instructions Tx commit Ulog_A1 Ulog_A2 Ulog_AN store A1 Logging store A1 store AN Tx begin Write A ƒ Rlog_A1 Rlog_A2 Rlog_AN store A1 store A1 store AN ƒ clwb A1..AN Tx commit ƒ ƒ Time Tx commit Rlog_A1 Rlog_A2 Rlog_AN store A1 store A1 store AN ƒ Ulog_A1 Ulog_A2 Ulog_AN (a) (b) (c) Undo logging only Redo logging only Undo+redo logging Logging Write A Logging Write A Uncacheable Cacheable    Figure1.Comparisonofexecutingatransactioninpersistentmemorywith(a)undologging,(b)redologging,and(c)bothundoandredologging.B.WhyUndo+RedoLoggingWhilepriorpersistentmemorydesignsonlyemployeitherundoorredologgingtomaintaindatapersistence,weobservethatusingbothcansubstantiallyrelaxtheafore-mentionedwrite-ordercontrolplacedoncaches.Logginginpersistentmemory.Loggingiswidelyusedinpersistentmemorydesigns[19],[29],[22],[15].Inadditiontoworkingdataupdates,persistentmemorysystemscanmaintaincopiesofthechangesinthelog.Previousdesignstypicallyemployeitherundoorredologging.Figure1(a)showsthatanundologrecordsoldversionsofdatabeforethetransactionchangesthevalue.Ifthesystemfailsduringanactivetransaction,thesystemcanrollbacktothestatebeforethetransactionbyreplayingtheundolog.Figure1(b)illustratesanexampleofapersistenttransactionthatusesredologging.Theredologrecordsnewversionsofdata.Aftersystemfailures,replayingtheredologrecoversthepersistentdatawiththelatestchangestrackedbytheredolog.Inpersistentmemorysystems,logsaretypicallyun-cacheablebecausetheyaremeanttobeaccessedonlyduringtherecovery.Thus,theyarenotreusedduringapplicationexecution.TheymustalsoarriveinNVRAMinorder,whichisguaranteedthroughbypassingthecaches.Bene“tsofundo+redologging.Combiningundoandredologging(undo+redo)iswidelyusedindisk-baseddatabasemanagementsystems(DBMSs)[32].Yet,we“ndthatwecanleveragethisconceptinpersistentmemorydesigntorelaxthewrite-orderconstraintsonthecaches.Figure1(a)showsthatuncacheable,store-granularundologgingcaneliminatethememorybarrierbetweenthelogandworkingdatawrites.Aslongasthelogentry(UlogA1)iswrittenintoNVRAMbeforeitscorrespondingstoretotheworkingdata(storeA1),wecanundothepartiallycompletedstoreafterasystemfailure.Furthermore,storeA1musttraversethecachehierarchy.TheuncacheableUlogA1maybebuffered(e.g.,inafourtosixcache-linesizedentrywrite-combiningbufferinx86processors).However,itstillrequiresmuchlesstimetogetoutoftheprocessorthancachedstores.Thisnaturallymaintainsthewriteorderingwithoutexplicitmemorybarrierinstructionsbetweenthelogandthepersistentdatawrites.Thatis,loggingandworkingdatawritesareperformedinapipeline-likemanner(likeinthetimelineinFigure1(a)).issimilartothestealŽattributeinDBMS[32],i.e,cachedworkingdataupdatescanstealthewayintopersistentstoragebeforetrans-actioncommits.However,adownsideisthatundologgingrequiresaforcedcachewrite-backbeforethetransactioncommits.Thisisnecessaryifwewanttorecoverthelatesttransactionstateaftersystemfailures.Otherwise,thedatachangesmadebythetransactionwillnotbecommittedtomemory.Instead,redologgingallowstransactionstocommitwith-outexplicitcachewrite-backsbecausetheredolog,onceupdatescomplete,alreadyhasthelatestversionofthetransactions(Figure1(b)).Thisissimilartotheno-forceŽattributeinDBMS[32],i.e.,noneedtoforcetheworkingdataupdatesoutofthecachesattheendoftransactions.However,wemustusememorybarrierstocompletetheredologofAbeforeanystoresofAreachNVRAM.Weillustratethisorderingconstraintbythedashedbluelineinthetimeline.Otherwise,asystemcrashwhentheredologgingisincomplete,whileworkingdataAispartiallyoverwritteninNVRAM(bystoreAk),causesdatacorruption.Figure1(c)showsthatundo+redologgingcombinesthebene“tsofbothstealŽandno-forceŽ.Asaresult,wecaneliminatethememorybarrierbetweenthe

logandpersistentwrites.Aforcedcachewrit
logandpersistentwrites.Aforcedcachewrite-back(e.g.,clwb)isunnecessaryforanunlimitedsizedlog.However,itcanbepostponeduntilafterthetransactioncommitsforalimitedsizedlog(SectionII-C).338Tx_begin(TxID) do some reads do some computation Uncacheable_log(addr(A), new_val(A), old_val(A)) write new_val(A) // A clwb //conservatively used Tx_commit Micro-ops: load A1 load A2 ƒ store log_A1 store log_A2 ... (a)(b)Micro-ops: store log_A1 store log_A2 ... Shared Cache Core cache Memory Controller ƒ Core cache Processor A NVRAM Rlog_A undo redo Ulog_A Rlog_B Ulog_B clwb Ak A ! Rlog_C Ulog_C A Ak still in caches Nonvolatile Volatile Figure2.Inef“ciencyoflogginginsoftware.C.WhyUndo+RedoLogginginHardwareThoughpromising,undo+redologgingisnotusedinpersistentmemorysystemdesignsbecauseprevioussoftwareloggingschemesareinef“cient(Figure2).ExtrainstructionsintheCPUpipeline.Logginginsoftwareusesloggingfunctionsintransactions.Figure2(a)showsthatbothundoandredologgingcanintroducealargenumberofinstructionsintotheCPUpipeline.Aswedemonstrateinourexperimentalresults(SectionVI),usingonlyundologgingcanleadtomorethandoubledinstructionscomparedtomemorysystemswithoutpersistentmemory.Undo+redologgingcanintroduceaprohibitivelylargenumberofinstructionstotheCPUpipeline,occupyingcomputeresourcesneededfordatamovement.IncreasedNVRAMtraf“c.Mostinstructionsforloggingareloadsandstores.Asaresult,loggingsubstantiallyincreasesmemorytraf“c.Inparticular,undologgingmustnotonlystoretothelog,butitmustalso“rstreadtheoldvaluesoftheworkingdatafromthecacheandmemoryhierarchy.Thisfurtherincreasesmemorytraf“c.Conservativecacheforcedwrite-back.Logscanhavealimitedsize1.Supposethat,withoutlosinggenerality,alogcanholdundo+redorecordsoftwotransactions(Fig-ure2(b)).Tologathirdtransaction(UlogCandRlogC),wemustoverwriteanexistinglogrecord,sayUlogAandRlogA(transactionA).IfanyupdatesoftransactionA(e.g.,Ak)arestillincaches,wemustforcetheseupdatesintotheNVRAMbeforeweoverwritetheirlogentry.Theproblemisthatcachesareinvisibletosoftware.Therefore,softwaredoesnotknowwhetherorwhichparticularupdatestoAarestillinthecaches.Thus,oncealogbecomesfull(aftergarbagecollection),softwaremayconservativelyforcecachewrite-backsbeforecommittingthetransaction.Thisunfortunatelynegatesthebene“tofredologging.Risksofdatapersistenceinmultithreading.Inadditiontotheabovechallenges,multithreadingfurthercomplicatessoftwarelogginginpersistentmemory,whenalogissharedbymultiplethreads.Evenifapersistentmemorysystem1Althoughwecangrowthelogsizeondemand,thisintroducesextrasystemoverheadonmanagingvariablesizelogs[19].Therefore,westudy“xedsizelogsinthispaper.issuesclwbinstructionsineachtransaction,acontextswitchbytheOScanoccurbeforetheclwbinstructionexecutes.Thiscontextswitchinterruptsthecontrol”owoftransactionsanddivertstheprogramtootherthreads.Thisreintroducestheaforementionedissueofprematurelyover-writingtherecordsina“lledlog.Implementingper-threadlogscanmitigatethisrisk.However,doingsocanintroducenewpersistentmemoryAPIandcomplicatesrecovery.Theseinef“cienciesexposethedrawbacksofundo+redologginginsoftwareandwarrantsahardwaresolution.III.OURDESIGNToaddressthechallenges,weproposeahardwareundo+redologgingdesign,consistingofHardwareLogging(HWL)andcacheForceWrite-Back(FWB)mechanisms.Thissectiondescribesourdesignprinciples.WedescribedetailedimplementationmethodsandtherequiredsoftwaresupportinSectionIV.A.AssumptionsandArchitectureOverviewFigure3(a)depictsanoverviewofourprocessorandmemoryarchitecture.The“gurealsoshowsthecircularlogstructureinNVRAM.Allprocessorcomponentsarecompletelyvolatile.Weusewrite-back,write-allocatecachescommontoprocessors.WesupporthybridDRAM+NVRAMformainmemory,deployedontheprocessor-memorybuswithseparatememorycontrollers[19],[13].However,thispaperfocusesonpersistentdataupdatestoNVRAM.FailureModel.DatainDRAMandcaches,butnotinNVRAM,arelostacrosssystemreboots.Ourdesignfo-cusesonmaintainingpersistenceofuser-de“nedcriticaldatastoredinNVRAM.Afterfailures,thesystemcanrecoverthisdatabyreplayingtheloginNVRAM.DRAMisusedtostoredatawithoutpersistence[19],[13].PersistentMemoryTransactions.Likepriorworkinpersistentmemory[19],[22],weusepersistentmemorytransactionsŽasasoftwareabstractiontoindicateregionsofmemorythatarepersistent.Persistentmemorywritesrequireapersistenceguarantee.Figure2illustratesasimplecodeexampleofapersistentmemorytransactionimple-mentedwithlogging(Figure2(a)),andourwithdesign(Fi

gure2(b)).Thetransactionde“nesobjectAa
gure2(b)).Thetransactionde“nesobjectAascriticaldatathatneedspersistenceguarantee.Unlikemostlogging-basedpersistentmemorytransactions,ourtransactionselim-inateexplicitloggingfunctions,cacheforcedwrite-backinstructions,andmemorybarrierinstructions.WediscussoursoftwareinterfacedesigninSectionIV.UncacheableLogsintheNVRAM.Weusesingle-consumer,single-producerLamportcircularstructure[33]forthelog.Oursystemsoftwarecanallocateandtruncatethelog(SectionIV).Ourhardwaremechanismsappendthelog.Wechoseacircularlogstructurebecauseitallowssimultaneousappendsandtruncateswithoutlocking[33],339hit Core A1 L1$ Write-allocate Lower-level$ L1$ Core Processor Core  Last-level Cache L1$   Memory Controllers Cache Controllers DRAM NVRAM Log (Uncacheable) Log Buffer Log Buffer NVRAM Log Processor Tx_commit Core A1 miss L1$ A1 hits in a lower-level cache NVRAM Log Tx_commit (a) Architecture overview. (b) In case of a store hit in L1 cache. (c) In case of a store miss in L1 cache. Log Entry:  Tail Pointer Head Pointer   1-bit 16-bit 8-bit 48-bit 1-word 1-word Nonvolatile Volatile     Processor Tx_begin(TxID)do some readsdo some computationWrite ATx_commit(A1, A2,  are new values to be written)   ƒ ƒ ƒ ƒ Log Buffer   ƒ ƒ ƒ ƒ Figure3.Overviewoftheproposedhardwarelogginginpersistentmemory.[19].Figure3(a)showsthatlogrecordsmaintainundoandredoinformationofasingleupdate(e.g.,storeA1).Inadditiontotheundo(A1)andredo(A1)values,logrecordsalsocontainthefollowing“elds:a16-bittransactionID,an8-bitthreadID,a48-bitphysicaladdressofthedata,andatornbit.Weuseatornbitperlogentrytoindicatetheupdateiscomplete[19].Tornbitshavethesamevalueforallentriesinonepassoverthelog,butreverseswhenalogentryisoverwritten.Thus,completely-writtenlogrecordsallhavethesametornbitvalue,whileincompleteentrieshavemixedvalues[19].Thelogmustaccommodateallwriterequestsofundo+redo.Thelogistypicallyusedduringsystemrecovery,andrarelyreusedduringapplicationexecution.Additionally,logupdatesmustarriveinNVRAMinstore-order.Therefore,wemaketheloguncacheable.Thisisinlinewithmostpriorworks,inwhichlogupdatesarewrittendirectlyintoawrite-combinebuffer(WCB)[19],[31]thatcoalescesmultiplestorestothesamecacheline.B.HardwareLogging(HWL)ThegoalofourHardwareLogging(HWL)mechanismistoenablefeasibleundo+redologgingofpersistentdatainourmicroarchitecture.HWLalsorelaxesorderingcon-straintsoncachinginamannerthatneitherundonorredologgingcan.Furthermore,ourHWLdesignleveragesinformationnaturallyavailableinthecachehierarchybutnottotheprogrammerorsoftware.Itdoessowithouttheperformanceoverheadofunnecessarydatamovementorexecutinglogging,cacheforce-write-back,ormemorybarrierinstructionsinpipeline.LeveragingExistingUndo+RedoInformationinCaches.Mostprocessorscachesusewrite-back,write-allocatecachingpolicies[34].Onawritehit,acacheonlyupdatesthecachelineinthehittinglevelwiththenewvalues.Adirtybitinthecachetagindicatescachevaluesaremodi“edbutnotcommittedtomemory.Onawritemiss,thewrite-allocate(alsocalledfetch-on-write)policyrequiresthecacheto“rstload(i.e.,allocate)theentiremissingcachelinebe-forewritingnewvaluestoit.HWLleveragesthewrite-back,write-allocatecachingpoliciestofeasiblyenableundo+redologginginpersistentmemory.HWLautomaticallytriggersalogupdateonapersistentwriteinhardware.HWLrecordsbothredoandundoinformationinthelogentryinNVRAM(showninFigure2(b)).Wegettheredodatafromthecurrentlyin-”ightwriteoperationitself.Wegettheundodatafromthewriterequestscorrespondingwrite-allocatedcacheline.IfthewriterequesthitsintheL1cache,wereadtheoldvaluebeforeoverwritingthecachelineandusethatfortheundolog.IfthewriterequestmissesinL1cache,thatcachelinemust“rstbeallocatedanyway,atwhichpointwegettheundodatainasimilarmanner.Thelogentry,consistingofatransactionID,thread,theaddresso

fthewrite,andundoandredovalues,iswritten
fthewrite,andundoandredovalues,iswrittenouttothecircularloginNVRAMusingtheheadandtailpointers.ThesepointersaremaintainedinspecialregistersdescribedinSectionIV.InherentOrderingGuaranteeBetweentheLogandData.OurdesigndoesnotrequireexplicitmemorybarrierstoenforcethatundologupdatesarriveatNVRAMbeforeitscorrespondingworkingdata.Theorderingisnaturallyen-suredbyhowHWLperformstheundologgingandworkingdataupdates.Thisincludesi)theuncachedlogupdatesandcachedworkingdataupdates,andii)store-granularundologging.Theworkingdatawritesmusttraversethecachehierarchy,buttheuncacheableundologupdatesdonot.Furthermore,ourHWLalsoprovidesanoptionalvolatilelogbufferintheprocessor,similartothewrite-combiningbuffersincommodityprocessordesign,thatcoalescesthelogupdates.Wecon“gurethenumberoflogbufferentriesbasedoncacheaccesslatency.Speci“cally,weensurethatthelogupdateswriteoutofthelogbufferbeforeacached340storewritesoutofthecachehierarchy.SectionIV-CandSectionVIfurtherdiscussandevaluatethislogbuffer.C.DecouplingCacheFWBsandTransactionExecutionWritesareseeminglypersistentoncetheirlogsarewrittentoNVRAM.Infact,wecancommitatransactiononcelog-gingofthattransactioniscompleted.However,thisdoesnotguaranteedatapersistencebecauseofthecircularstructureoftheloginNVRAM(SectionII-A).However,insertingcachewrite-backinstructions(suchasclflushandclwb)insoftwarecanimposesubstantialperformanceoverhead(SectionII-A).Thisfurthercomplicatesdatapersistencesupportinmultithreading(SectionII-C).Weeliminatetheneedforforcedwrite-backinstructionsandguaranteepersistenceinmultithreadedapplicationsbydesigningacacheForce-Write-Back(FWB)mechanisminhardware.FWBisdecoupledfromtheexecutionofeachtransaction.HardwareusesFWBtoforcecertaincacheblockstowrite-backwhennecessary.FWBintroducesaforcewrite-backbit(fwb)alongsidethetaganddirtybitofeachcacheline.Wemaintaina“nitestatemachineineachcacheblock(SectionIV-D)usingthefwbanddirtybits.Cachesalreadymaintainthedirtybit:acachelineupdatesetsthebitandacacheeviction(write-back)resetsit.Acachecontrollermaintainsourfwbbitbyscanningcachelinesperiodically.Onthe“rstscan,itsetsthefwbbitindirtycacheblocksifunset.Onthesecondscan,itforceswrite-backsinallcachelineswith{fwb,dirty}={1,1}.Ifthedirtybitevergetsresetforanyreason,thefwbbitalsoresetsandnoforcedwrite-backoccurs.OurFWBdesignisalsodecoupledfromsoftwaremulti-threadingmechanisms.Assuch,ourmechanismisimpervi-oustosoftwarecontextswitchinterruptions.Thatis,whentheOSrequirestheCPUtocontextswitch,hardwarewaitsuntilongoingcachewrite-backscomplete.Thefrequencyoftheforcedwrite-backscanvary.However,forcedwrite-backsmustbefasterthantherateatwhichlogentrieswithuncommittedpersistentupdatesareoverwritteninthecircularlog.Infact,wecandetermineforcewrite-backfrequency(associatedwiththescanningfrequency)basedonthelogsizeandtheNVRAMwritebandwidth(discussedinSectionIV-D).Ourevaluationshowsthefrequencydetermi-nation(SectionVI).D.InstantTransactionCommitsPreviousdesignsrequiresoftwareorhardwarememorybarriers(and/orcacheforce-write-backs)attransactioncom-mitstoenforcewriteorderingoflogupdates(orpersistentdata)intoNVRAMacrossconsecutivetransactions[13],[26].Instead,ourdesigngivestransactioncommitsafreerideŽ.Thatis,noexplicitinstructionsareneeded.Ourmechanismsalsonaturallyenforcetheorderofintra-andinter-transactionlogupdates:weissuelogupdatesintheorderofwritestocorrespondingworkingdata.WealsowritethelogupdatesintoNVRAMintheordertheyareissued(thelogbufferisaFIFO).Therefore,logupdatesofsubsequenttransactionscanonlybewrittenintoNVRAMaftercurrentlogupdatesarewrittenandcommitted.E.PuttingItAllTogetherFigure3(b)and(c)illustratehowourhardwareloggingworks.Hardwaretreatsallwritesencompassedinpersistenttransactions(e.g.,writeAinthetransactiondelimitedbytx_beginandtx_commitinFigure2(b))aspersistentwrites.ThosewritesinvokeourHWLandFWBmecha-nisms.Theyworktogetherasfollows.NotethatlogupdatesgodirectlytotheWCBorNVRAMifthesystemdoesnotadoptthelogbuffer.TheprocessorsendswritesofdataobjectA(avariableorotherdatastructure),consistingofnewvaluesofoneormorecachelines{A1,A2,...},totheL1cache.UponupdatinganL1cacheline(e.g.,fromoldvalueA1toanewvalueA1):1)Writethenewvalue(redo)intothecacheline().a)Iftheupdateisthe“rstcachelineupdateofdataobjectA,theHWLmechanism(whichhasthetransactionIDandtheaddressofAfromtheCPU)writesalogrecordheaderintothelogbuffer.b)Otherwise,theHWLmechanismwritesthenewvalue(e.g.,A1)intothelogbuffer.2)Obtaintheundodatafromtheoldvalueinthecacheline().ThissteprunsparalleltoStep-1.a)

IfthecachelinewriterequesthitsinL1(Fig-
IfthecachelinewriterequesthitsinL1(Fig-ure3(b)),theL1cachecontrollerimmediatelyextractstheoldvalue(e.g.,A1)fromthecachelinebeforewritingthenewvalue.ThecachecontrollerreadstheoldvaluefromthehittinglineoutofthecachereadportandwritesitintothelogbufferintheStep-3.Noadditionalreadinstructionisnecessary.b)IfthewriterequestmissesintheL1cache(Figure3(c)),thecachehierarchymustwrite-allocatethatcacheblockasisstandard.Thecachecontrolleratalower-levelcachethatownsthatcachelineextractstheoldvalue(e.g.,A1).ThecachecontrollersendstheextractedoldvaluetothelogbufferinStep-3.3)Updatetheundoinformationofthecacheline:thecachecontrollerwritestheoldvalueofthecacheline(e.g.,A1)tothelogbuffer().4)TheL1cachecontrollerupdatesthecachelineintheL1cache().Thecachelinecanbeevictedviastan-dardcacheevictionpolicieswithoutbeingsubjectedtodatapersistenceconstraints.Additionally,ourlogbufferissmallenoughtoguaranteethatlogupdatestraversethroughthelogbufferfasterthanthecachelinetraversesthecachehierarchy(SectionIV-D).341Therefore,thisstepoccurswithoutwaitingforthecorrespondinglogentriestoarriveinNVRAM.5)ThememorycontrollerevictsthelogbufferentriestoNVRAMinaFIFOmanner().Thisstepisindependentfromothersteps.6)RepeatStep-1-(b)through5ifthedataobjectAconsistsofmultiplecachelinewrites.Thelogbuffercoalescesthelogupdatesofanywritestothesamecacheline.7)Afterlogentriesofallthewritesinthetransactionareissued,thetransactioncancommit().8)PersistentworkingdataupdatesremaincacheduntiltheyarewrittenbacktoNVRAMbyeithernormalevictionorourcacheFWB.F.DiscussionTypesofLogging.Systemswithnon-volatilememorycanadoptcentralized[35]ordistributed(e.g.,per-thread)logs[36],[37].Distributedlogscanbemorescalablethancentralizedlogsinlargesystemsfromsoftwaresperspec-tive.Ourdesignworkswitheithertypeoflogs.Withcentralizedlogging,eachlogrecordneedstomaintainathreadID,whiledistributedlogsdonotneedtomaintainthisinformationinlogrecords.Withcentralizedlog,ourhardwaredesigneffectivelyreducesthesoftwareoverheadandcansubstantiallyimprovesystemperformancewithrealpersistentmemoryworkloadsasweshowinourexperi-ments.Inaddition,ourdesignalsoallowssystemstoadoptalternativeformatsofdistributedlogs.Forexample,wecanpartitionthephysicaladdressspaceintomultipleregionsandmaintainalogpermemoryregion.Weleavetheevaluationofsuchlogimplementationstoourfuturework.NVRAMCapacityUtilization.Storingundo+redologcanconsumemoreNVRAMspacethaneitherundoorredoalone.Ourlogusesa“xed-sizecircularbufferratherthandoublinganypreviousundoorredologimplementation.ThelogsizecantradeoffwiththefrequencyofourcacheFWB(SectionIV).ThesoftwaresupportdiscussedinSectionIV-Aallowuserstodeterminethesizeofthelog.OurFWBmechanismwilladjustthefrequencyaccordinglytoensuredatapersistence.LifetimeofNVRAMMainMemory.Thelifetimeofthelogregionisnotanissue.Supposealoghas64Kentries(4MB)andNVRAM(assumingphase-changememory)hasa200nswritelatency.Eachentrywillbeoverwrittenonceevery64K×200ns.IfNVRAMenduranceis108writes,acell,evenstaticallyallocatedtothelog,willtake15daystowearout,whichisplentyoftimeforconventionalNVRAMwear-levelingschemestotrigger[38],[39],[40].Inaddition,ourschemehastwoimpactsonoverallNVRAMlifetime:loggingnormallyleadstowriteampli“cation,butweimproveNVRAMlifetimebecauseourcachescoalescewrites.Theoverallimpactislikelyslightlynegative.How-ever,wear-levelingwilltriggerbeforeanydamageoccurs.voidpersistent_update(intthreadid){tx_begin(threadid);//PersistentdataupdateswriteA[threadid];tx_commit();}//...intmain(){//Executesonepersistent//transactionperthreadfor(inti=0;inthreads;i++)threadt(persistent_update,i);}Figure4.Pseudocodeexamplefortx_beginandtx_commit,wherethreadIDistransactionIDtoperformonepersistenttransactionperthread.IV.IMPLEMENTATIONInthissection,wedescribetheimplementationdetailsofourdesignandhardwareoverhead.WecoveredtheimpactofNVRAMspaceconsumption,lifetime,andenduranceinSectionIII-F.A.SoftwareSupportOurdesignhassoftwaresupportforde“ningpersistentmemorytransactions,allocatingandtruncatingthecircularloginNVRAM,andreservingaspecialcharacterasthelogheaderindicator.TransactionInterface.Weuseapairoftransactionfunc-tions,tx_begin(txid)andtx_commit(),thatde-“netransactionswhichdopersistentwritesintheprogram.WeusetxidtoprovidethetransactionIDinformationusedbyourHWLmechanism.ThisIDisgroupswritesfromthesametransaction.Thistransactioninterfacehasbeenusedbynumerouspreviouspersistentmemorydesigns[13],[29].Figure4showsanexampleofmultithreadedpseudocodewithourtransactionfunctions.SystemLibraryFunctionsMaintainthe

Log.OurHWLmechanismperformslogupdates,
Log.OurHWLmechanismperformslogupdates,whilethesystemsoftwaremaintainsthelogstructure.Inparticular,weusesystemli-braryfunctions,log_create()andlog_truncate()(similartofunctionsusedinpriorwork[19]),toallocateandtruncatethelog,respectively.Thesystemsoftwaresetsthelogsize.Thememorycontrollerobtainslogmaintenanceinformationbyreadingspecialregisters(SectionIV-B),indicatingtheheadandtailpointersofthelog.Further-more,asingletransactionthatexceedstheoriginallyal-locatedlogsizecancorruptpersistentdata.Weprovidetwooptionstopreventover”ows:1)Thelog_create()functionallocatesalarge-enoughlogbyreadingthemax-imumtransactionsizefromtheprograminterface(e.g.,#defineMAX_TX_SIZEN);2)Anadditionallibraryfunctionlog_grow()allocatesadditionallogregionswhenthelogis“lledbyanuncommittedtransaction.342B.SpecialRegistersThetxidargumentfromtx_begin()translatesintoan8-bitunsignedinteger(aphysicaltransactionID)storedinaspecialregisterintheprocessor.BecausethetransactionIDsgroupwritesofthesametransactions,wecansimplypickanot-in-usephysicaltransactionIDtorepresentanewlyreceivedtxid.An8-bitlengthcanaccommodate256uniqueactivepersistentmemorytransactionsatatime.AphysicaltransactionIDcanbereusedafterthetransactioncommits.Wealsousetwo64-bitspecialregisterstostoretheheadandtailpointersofthelog.Thesystemlibraryini-tializesthepointervalueswhenallocatingthelogusinglog_create().Duringlogupdates,thememorycon-trollerandlog_truncate()functionupdatethepointers.Iflog_grow()isused,weemployadditionalregisterstostoretheheadandtailpointersofnewlyallocatedlogregionsandanindicatoroftheactivelogregion.C.AnOptionalVolatileLogBufferToimproveperformanceoflogupdatestoNVRAM,weprovideanoptionallogbuffer(avolatileFIFO,similartoWCB)inthememorycontrollertobufferandcoalescelogupdates.Thislogbufferisnotrequiredforensuringdatapersistence,butonlyforperformanceoptimization.DatapersistencerequiresthatlogrecordsarriveatNVRAMbeforethecorrespondingcachelinewiththeworkingdata.Withoutthelogbuffer,logupdatesaredi-rectlyforcedtotheNVRAMbuswithoutbufferingintheprocessor.IfwechoosetoadoptalogbufferwithNentries,alogentrywilltakeNcyclestoreachtheNVRAMbus.AdatastoresenttotheL1cachetakesatleastthelatency(cycles)ofalllevelsofcacheaccessandmemorycontrollerqueuesbeforereachingtheNVRAMbus.Thetheminimumvalueofthislatencyisknownatdesigntime.Therefore,wecanensurethatlogupdatesarriveattheNVRAMbusbeforethecorrespondingdatastoresbydesigningNtobesmallerthantheminimumnumberofcyclesforadatastoretotraversethroughthecachehierarchy.SectionVIevaluatestheboundofNandsystemperformanceacrossvariouslogbuffersizesbasedonoursystemcon“gurations.D.CacheModi“cationsToimplementourcacheforcewrite-backscheme,weaddonefwbbittothetagofeachcacheline,alongsidethedirtybitasinconventionalcacheimplementations.FWBmaintainthreestates(IDLE,FLAG,andFWB)foreachcacheblockusingthesestatebits.CacheBlockStateTransition.Figure5showsthe“nite-statemachineforFWB,implementedinthecachecontrollerofeachlevel.Whenanapplicationbeginsexecuting,cachecontrollersinitialize(reset)eachcachelinetotheIDLEstatebysettingfwbbitto0.Standardcacheimplementationalsoinitializesdirtyandvalidbitsto0.Duringapplicationforce-write-back cache line write set fwb=1 write-back not dirty fwb,dirty ={0,0} fwb,dirty = {0,1} fwb,dirty = {1,1} reset  Figure5.StatemachineincachecontrollerforFWB.execution,baselinecachecontrollersnaturallysetthedirtyandvalidbitsto1wheneveracachelineiswrittenandresetthedirtybitbackto0afterthecachelineiswrittenbacktoalowerlevel(typicallyoneviction).Toimplementourstatemachine,thecachecontrollersperiodicallyscanthevalid,dirty,andfwbbitsofeachcachelineandperformsthefollowing.€Acachelinewith{fwb,dirty}={0,0}isinIDLEstate;thecachecontrollerdoesnothingtothosecachelines;€Acachelinewith{fwb,dirty}={0,1}isintheFLAGstate;thecachecontrollersetsthefwbbitto1.Thisindicatesthatthecachelineneedsawrite-backduringthenextscanningiterationifitisstillinthecache.€Acachelinewith{fwb,dirty}={1,1}isinFWBstate;thecachecontrollerforcewrites-backthisline.Aftertheforcedwrite-back,thecachecontrollerchangesthelinebacktoIDLEstatebyresetting{fwb,dirty}={0,0}.€Ifacachelineisevictedfromthecacheatanypoint,thecachecontrollerresetsitsstatetoIDLE.DeterminingtheCacheFWBFrequency.Thetagscanningfrequencydeterminesthefrequencyofourcacheforcewrite-backoperations.TheFWBmustoccurasfrequentlyastoensurethattheworkingdataiswrittenbacktoNVRAMbeforeitslogrecord

sareoverwrittenbynewerupdates.Asaresult
sareoverwrittenbynewerupdates.Asaresult,themorefrequentthewriterequests,themorefrequentthelogwillbeoverwritten.Thelargerthelog,thelessfrequentthelogwillbeoverwritten.Therefore,thescanningfrequencyisdeterminedbythemaximumlogupdatefrequency(boundedbyNVRAMwritebandwidthsinceapplicationscannotwritetotheNVRAMfasterthanitsbandwidth)andlogsize(seethesensitivitystudyinSectionVI).Toaccommodatelargecachesizeswithlowscanningperformanceoverhead,wealsogrowthesizeofthelogtoreducethescanningfrequencyaccordingly.E.SummaryofHardwareOverheadTableIpresentsthehardwareoverheadofourimple-menteddesignintheprocessor.NotethatthesevaluesmayvarydependingonthenativeprocessorandISA.Ourimplementationassumesa64-bitmachine,hencewhythecircularlogheadandtailpointersare8bytes.Onlyhalfofthesebytesarerequiredina32-bitmachine.Thesizeofthelogbuffervariesbasedonthesizeofthecacheline.Thesizeoftheoverheadneededforthefwbstatevariesonthetotalnumberofcachelinesatalllevelsofcache.Thisismuchlowerthanpreviousstudiesthattracktransaction343MechanismLogicTypeSizeTransactionIDregister”ip-”ops1ByteLogheadpointerregister”ip-”ops8BytesLogtailpointerregister”ip-”ops8BytesLogbuffer(optional)SRAM964BytesFwbtagbitSRAM768BytesTableISUMMARYOFMAJORHARDWAREOVERHEAD.informationincachetags[13].Thenumbersinthetablewerecomputedbasedonthespeci“cationsofalloursystemcachesdescribedinSectionV.Notethatthesearemajorstatelogiccomponentson-chip.Ourdesignalsoalsorequiresadditionalgatesforlogicoperations.However,thesegatesareprimarilysmallandmedium-sizedgates,onthesamecomplexitylevelasamultiplexerordecoder.F.RecoveryWeoutlinethestepsofrecoveringthepersistentdatainsystemsthatadoptourdesign.Step1:Followingapowerfailure,the“rststepistoobtaintheheadandtailpointersoftheloginNVRAM.Thesepointersarepartofthelogstructure.Theyallowsystemstocorrectlyorderthelogentries.Weuseonlyonecentralizedcircularlogforalltransactionsforallthreads.Step2:ThesystemrecoveryhandlerfetcheslogentriesfromNVRAMandusetheaddress,oldvalue,andnewvalue“eldstogeneratewritestoNVRAMtotheaddressesspeci“ed.TheaddressesaremaintainedviapagetableinNVRAM.Weidentifywhichwritesdidnotcommitbytracingbackfromthetailpointer.Logentrieswithmis-matchedvaluesinNVRAMareconsiderednon-committed.Theaddressstoredwitheachentrycorrespondstotheaddressofthepersistentdatamember.Asidefromtheheadandtailpointers,wealsousethetornbittocorrectlyorderthesewrites[19].Logentrieswiththesametxidandtornbitarecomplete.Step3:ThegeneratedwritesbypassthecachesandgodirectlytoNVRAM.Weusevolatilecaches,sotheirstatesareresetandallgeneratedwritesonrecoveryarepersistent.Therefore,theycanbypassthecacheswithoutissue.Step4:Weupdatetheheadandtailpointersofthecircularlogforeachgeneratedpersistentwrite.Afterallupdatesfromthelogareredone(orundone),theheadandtailpointersofthelogpointtoentriestobeinvalidated.V.EXPERIMENTALSETUPWeevaluateourdesignbyimplementingitinMc-SimA+[41],aPin-based[42]cycle-levelmulti-coresimula-tor.Wecon“gurethesimulatortomodelamulti-coreout-of-orderprocessorwithNVRAMDIMMdescribedinTableII.Oursimulatoralsomodelsadditionalmemorytraf“cforProcessorSimilartoIntelCorei7/22nmCores4cores,2.5GHz,2threads/coreIL1Cache32KB,8-wayset-associative,64Bcachelines,1.6nslatency,DL1Cache32KB,8-wayset-associative,64Bcachelines,1.6nslatency,L2Cache8MB,16-wayset-associative,64Bcachelines,4.4nslatencyMemoryController64-/64-entryread/writequeues8GB,8banks,2KBrowNVRAMDIMM36nsrow-bufferhit,100/300nsread/writerow-buffercon”ict[44].PowerandEnergyProcessor:149W(peak)NVRAM:rowbufferread(write):0.93(1.02)pJ/bit,arrayread(write):2.47(16.82)pJ/bit[44]TableIIPROCESSORANDMEMORYCONFIGURATIONS.MemoryNameFootprintDescriptionHash256MBSearchesforavalueinan[29]open-chainhashtable.Insertifabsent,removeiffound.RBTree256MBSearchesforavalueinared-black[13]tree.Insertifabsent,removeiffoundSPS1GBRandomswapsbetweenentries[13]ina1GBvectorofvalues.BTree256MBSearchesforavalueinaB+tree.[45]Insertifabsent,removeiffoundSSCA216MBAtransactionalimplementation[46]ofSSCA2.2,performingseveralanalysesoflarge,scale-freegraph.TableIIIALISTOFEVALUATEDMICROBENCHMARKS.loggingandclwbinstructions.Wefeedtheperformancesim-ulationresultsintoMcPAT[43],awidelyusedarchitecture-levelpowerandareamodelingtool,toestimateprocessordynamicenergyconsumption.WemodifytheMcPATpro-cessorcon“gurationtomodelourhardwaremodi“cations,includingthecomponentsaddedtosupportHWLandFWB.Weadoptphase-changememoryparametersintheNVRAMDIMM[44].BecauseallofourperformancenumbersshowninSectionVIarerelative,thesameobservationsarevalid

fordifferentNVRAMlatencyandaccessenergy.
fordifferentNVRAMlatencyandaccessenergy.OurworkfocusesonimprovingpersistentmemoryaccesssowedonotevaluateDRAMaccessinourexperiments.Weevaluatebothmicrobenchmarksandrealworkloadsinourexperiments.Themicrobenchmarksrepeatedlyup-datepersistentmemorystoringtodifferentdatastructuresincludinghashtable,red-blacktree,array,B+tree,andgraph.Thesearedatastructureswidelyusedinstoragesystems[29].TableIIIdescribesthesebenchmarks.Ourexperimentsusemultipleversionsofeachbenchmarkandvarythedatatypebetweenintegersandstringswithinthem.Datastructureswithintegerelementspacklessdata(smallerthanacacheline)perelement,whereasthosewithstringsrequiremultiplecachelinesperelement.Thisallowsusto344explorecomplexstructuresusedinreal-worldapplications.Inourmicrobenchmarks,eachtransactionperformsanin-sert,delete,orswapoperation.Thenumberoftransactionsisproportionaltothedatastructuresize,listedasmemoryfootprintŽinTableIII.Wecompilethesebenchmarksinnativex86andrunthemontheMcSimA+simulator.Weevaluatebothsinglethreadedandmultithreadedversionsofeachbenchmark.Inaddition,weevaluatethesetofrealworkloadbenchmarksfromtheWHISPERpersistentmem-orybenchmarksuite[11].Thebenchmarksuiteincorporatesvariousworkloads,suchaskey-valuestores,in-memorydatabases,andpersistentdatacaching,whicharelikelytobene“tfromfuturepersistentmemorytechniques.VI.RESULTSWeevaluateourdesignintermsoftransactionthroughput,instructionpercycle(IPC),instructioncount,NVRAMtraf“c,anddynamicenergyconsumption.Ourexperimentscompareamongthefollowingcases.€non-pers…ThisusesNVRAMasaworkingmemorywithoutanydatapersistenceorlogging.Thiscon“gu-rationyieldsanidealyetunachievableperformanceforpersistentmemorysystems[13].€unsafe-base…Thisusessoftwareloggingwithoutforcedcachewrite-backs.Assuch,itdoesnotguaranteedatapersistence(henceunsafeŽ).Notethatthedashedlinesinour“guresshowthebestcaseachievedbetweeneitherredoorundologgingforthatbenchmark.€redo-clwbandundo-clwb…Softwareredoandundologging,respectively.Theseinvoketheclwbinstructiontoforcecachewrite-backsafterpersistenttransactions.€hw-rlogandhw-ulog…Hardwareredoorundologgingwithnopersistenceguarantee(likeinunsafe-base).Theseshowanextremelyoptimizedperformanceofhardwareundoorredologging[13].€hwl…Thisdesignincludesundo+redologgingfromourhardwarelogging(HWL)mechanism,butusestheclwbinstructiontoforcecachewrite-backs.€fwb…Thisisthefullimplementationofourhardwareundo+redologgingdesignwithbothHWLandFWB.A.MicrobenchmarkResultsWemakethefollowingmajorobservationsofourmi-crobenchmarkexperimentsandanalyzetheresults.Weeval-uatebenchmarkcon“gurationsfromsingletoeightthreads.Thepre“xesoftheseresultscorrespondtoone(-1t),two(-2t),four(-4t),andeight(-8t)threads.SystemPerformanceandEnergyConsumption.Fig-ure6andFigure8comparethetransactionthroughputandmemorydynamicenergyofeachdesign.Weobservethatprocessordynamicenergyisnotsigni“cantlyalteredbydifferentcon“gurations.Therefore,weonlyshowmemorydynamicenergyinthe“gure.The“guresillustratethathwlaloneimprovessystemthroughputanddynamicenergyconsumption,comparedwithsoftwarelogging.Notethatourdesignsupportsundo+redologging,whiletheevaluatedsoftwareloggingmechanismsonlysupporteitherundoorredologging,notboth.Fwbyieldshigherthroughputandlowerenergyconsumption:overall,itimprovesthroughputby1.86×withonethreadand1.75×witheightthreads,comparedwiththebetterofredo-clwbandundo-clwb.SSCA2andBTreebenchmarksgeneratelessthroughputandenergyimprovementoversoftwarelogging.ThisisbecauseSSCA2andBTreeusemorecomplexdatastructures,wheretheoverheadofmanipulatingthedatastructuresoutweighthatofthelogstructures.Figure9showsthatourdesignsubstantiallyreducesNVRAMwrites.The“guresalsoshowthatunsafe-base,redo-clwb,andundo-clwbsigni“cantlydegradethroughputbyupto59%andimposeupto62%memoryenergyoverheadcomparedwiththeidealcasenon-pers.Ourdesignbringssystemthroughputbackup.Fwbachieves1.86×throughput,withonly6%processor-memoryand20%dynamicmemoryenergyoverhead,respectively.Furthermore,ourdesignsperformanceandenergybene“tsoversoftwareloggingremainasweincreasethenumberofthreads.IPCandInstructionCount.WealsostudyIPCnumberofexecutedinstructions,showninFigure7.Overall,hwlandfwbsigni“cantlyimproveIPCoversoftwarelogging.Thisappearspromisingbecausethe“gureshowsourhardwareloggingdesignexecutesmuchfewerinstructions.Comparedwithnon-pers,softwareloggingimposesupto2.5×thenumberofinstructionsexecuted.Ourdesignfwbonlyim-posesa30%instructionoverhead.PerformanceSensitivitytoLogBufferSize.SectionIV-Cdiscusseshowthelogbuffersizeisboundedbythedatapersistencerequirem

ent.ThelogupdatesmustarriveatNVRAMbefor
ent.ThelogupdatesmustarriveatNVRAMbeforeitscorrespondingworkingdataupdates.Thisboundis15entriesbasedonourprocessorcon“gu-ration.Indeed,largerlogbuffersbetterimprovethroughputaswestudiedusingthehashbenchmark(Figure11(a)).An8-entrylogbufferimprovessystemthroughputby10%;ourimplementationwitha15-entrylogbufferimprovesthroughputby18%.Furtherincreasingthelogbuffersize,whichmaynolongerguaranteedatapersistence,additionallyimprovessystemthroughputuntilreachingtheNVRAMwritebandwidthlimitation(64entriesbasedonourNVRAMcon“guration).Notethatthesystemthroughputresultswith128and256entriesaregeneratedassumingin“niteNVRAMwritebandwidth.Wealsoimprovethroughputoverbaselinehardwarelogginghw-rlogandhw-ulog.RelationBetweenFWBFrequencyandLogSize.Sec-tionIV-Ddiscussesthattheforcewrite-backfrequencyisdeterminedbytheNVRAMwritebandwidthandlogsize.WithagivenNVRAMwritebandwidth,westudytherelationbetweentherequiredFWBfrequencyandlogsize.Figure11(b)showsthatweonlyneedtoperformforced345\b\t\n \f\r \f \f\f \f \f \f \f \f\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\b\b\t\b\t\b\b\t\r   ! \b\t\n \b\f\f\r\bFigure6.Transactionthroughputspeedup(higherisbetter),normalizedto\b\b\b\b\b\b\t\n \t\f\r\t\n \t\f\r\t\n \t\f\r\t\n \t\f\r\t\n \t\f \t\n \t\f \t\n \t\f \t\n \t\f  \f\r \f\r \f\r \f\r \f \f \f \f \f \f  \t" #\b\t\n \b\f\r\n\b \t\r \r\t $$%$! \n\f\n Figure7.IPCspeedup(higherisbetter)andinstructioncount(lowerisbetter),normalizedto\b\t\n \t\f\r\f\t\n \t\f\r\f\t\n \t\f\r\f\t\n \t\f\r\f\

b\t\n \t\f \f
b\t\n \t\f \f\t\n \t\f \f\t\n \t\f \f\t\n \t\f \f\b\f\r\f\f\r\f\f\r\f\f\r\f\b\f \f\f \f\f \f\f \f\b  \f\r\f  \f\r\f  \f\r\f  \f\r\f\b  \f \f  \f \f  \f \f  \f \f\b\f\f\f\f\b\f\f\f\f\b\n\f\n\f\n\f\n\f\b\f \f\f\t\b\t\n\b \f\r\t \b\t\r  \n\f\n Figure8.Dynamicenergyreduction(higherisbetter),normalizedto(dashedline).write-backseverythreemillioncyclesifwehavea4MBlog.Asaresult,thefwbtagscanningonlyintroduces3.6%performanceoverheadwithour8MBcache.B.WHISPERResultsComparedwithmicrobenchmarks,weobserveevenmorepromisingperformanceandenergyimprovementsinrealpersistentmemoryworkloadsintheWHISPERbenchmarksuitewithlargedatasets(Figure10).AmongtheWHIS-PERbenchmarks,ctreeandhashmapbenchmarksaccu-ratelycorrespondtoandre”ecttheresultsachievedinourmicrobenchmarksduetotheirsimilarities.Althoughthemagnitudeofimprovementvary,ourdesignleadstomuchhigherperformance,lowerenergy,andlowerNVRAMtraf“cthanourbaselines.Comparedwithredo-clwb,ourdesignsigni“cantlyreducesthedynamicmemoryenergyconsumptionoftpccandycsbduetothehighwriteintensityintheseworkloads.Overall,ourdesign)achievesupto2.7thethroughputofthebestcaseredo-clwb.Thisisalsowithin73%ofnon-persthroughputofthesamebenchmarks.Inaddition,ourdesignachievesuptoa2.43reductionindynamicmemoryoverthebaselines.346\b\t\b\t\b\t\n\t\b\t \t\t\t\t\t\b\t\n\t\b\t \t\n\t \t\b\t\b\t\b\t\n\t\b\t \t\t\t\t\n\t\t \t\n\t \t\n\t \t\b\b\b\t\n \f\r\t\n\f \n\b"""#""""""""""""%"""""""""""""""""""""&"""""""""""""""""""""""""""""#"""""""""""""""""""#"""&""""""""""&""""""""""""""""""""""""""""""""""""""""""Figure9.Memorywritetraf“creduction(higherisbetter),normalizedto(dashedline).\b\t\n\b \f\b\b\t\n \f\b\b\t\b\n \f\r Figure10.WHISPERbenchmarkresults,includingIPC,dynamicmemoryenergyconsumption,transactionthroughput,andNVRAMwritetraf“c,normalizedto(thedashedline). 

! " 
! "   ! !"!#"# $"!"$ ! "  "%\t\nFigure11.Sensitivitystudiesof(a)systemthroughputwithvaryinglogbuffersizesand(b)cachefwbfrequencywithvariousNVRAMlogsizes.VII.RELATEDComparedtopreviousarchitecturesupportforpersistentmemorysystems,ourdesignfurtherrelaxesorderingcon-straintsoncacheswithlesshardwarecost.Hardwaresupportforlogging.Severalrecentstudiesproposedhardwaresupportforlog-basedpersistentmem-orydesign.Luetal.proposescustomhardwareloggingmechanismsandmulti-versioningcachestoreduceintra-andinter-transactiondependencies[24].However,theyrequirebothlarge-scalechangestothecachehierarchyandcachemulti-versioningsupport.Kollietal.proposesadelegatedpersistordering[15]thatsubstantiallyrelaxespersistenceorderingconstraintsbyleveraginghardwaresupportandcachecoherence.However,thedesignreliesonsnoop-basedcoherenceandadedicatedpersistentmemorycontroller.Instead,ourdesignis”exiblebecauseitdirectlyleveragestheinformationalreadyinthebaselinecachehierarchy.ATOM[35]andDudeTM[47]onlyimplementeitherundoVolatileTMsupportsconcurrencybutdoesnotguaranteepersistenceinmemory.orredologging.Asaresult,thestudiesdonotprovidethelevelofrelaxedorderingofferedbyourdesign.Inaddition,DudeTM[47]alsoreliesonawhichcanincursubstantialmemoryaccesscost.Doshietal.usesredologgingfordatarecoverabilitywithabackendcontroller[48].Thebackendcontrollerreadslogentriesfromtheloginmemoryandupdatesdatain-place.However,thisdesigncanunnecessarilysaturatethememoryreadbandwidthneededforcriticalreadoperations.Also,itrequiresaseparatevictimcachetoprotectfromdirtycacheblocks.Instead,ourdesigndirectlyusesdirtycachebitstoenforcepersistence.Hardwaresupportforpersistentmemory.Recentstudiesalsoproposegeneralhardwaremechanismsforpersistentmemorywithorwithoutlogging.Recentworksproposethatcachesmaybeimplementedinsoftware[49],oranaddi-tionalnon-volatilecacheintegratedintheprocessor[50],[13]tomaintainpersistence.However,doingsocandoublethememoryfootprintforpersistentmemoryoperations.Otherworks[31],[26],[51]optimizethememorycontrollertoimproveperformancebydistinguishinglogginganddataupdates.Epochbarrier[29],[16],[52]isproposedtorelaxtheorderingconstraintsofpersistentmemorybyallowingcoarse-grainedtransactionordering.However,epochbarriersincurnon-trivialoverheadtothecachehierarchy.Further-more,systemperformancecanbesub-optimalwithsmallepochsizes,whichisobservedinmanypersistentmem-oryworkloads[11].Ourdesignuseslightweighthardwarechangesonexistingprocessordesignswithoutexpensivenon-volatileon-chiptransactionbufferingcomponents.347Persistentmemorydesigninsoftware.Previousworks,suchasMnemosyne[19]andREWIND[53],utilizewrite-aheadloggingimplementedinsoftware.Theserelyoninstructions,suchasclflush,clwb,andpcommit,toachievepersistencyandenforcelog-dataorderingintheircriticalpath.Ourdesigndoesnotrequirethesepersistentinstructionsthatweveshowncanclogthepipelineandareofteninef“cientorunnecessary.JUSTDO[54]loggingalsoreliesontheseinstructions(formanual”ushingbytheprogrammer).Additionally,theseworksprogrammingmodelsallfalterbecauseIntelspcommitinstructionisnowdeprecated.Ourdesignworksonlegacycodeanddoesnothavestrictfunctionalityrequirementsontheprogrammingmodel.Thismakesitimmunetorelyingoninstructionsorsoftwarefunctionsthatbecomedeprecated.VIII.CONCLUSIONSWeproposedahardwareloggingschemethatallowsthecachestoperformcachelineupdatesandwrite-backswithoutnon-volatilebuffersorcachesintheprocessor.Thesemechanismsmakeupacomplexity-effectivedesignthatexcelsovertraditionalsoftwareloggingforpersistentmemory.Ourevaluationshowsthatourdesignsigni“cantlyincreasesperformancewhilereducingmemorytraf“candenergy.ACKNOWLEDGMENTSWethanktheanonymousreviewersfortheirvaluablefeedback.ThispaperissupportedinpartbyNSFgrants1652328and1718158,andNSFI/UCRCCenterforRe-searchonStorageSystems.REFERENCES[1]V.Sousa,PhasechangematerialsengineeringforRESETcurrentreduction,ŽinProceedingsoftheMemoryWorkshop,2012.[2]C.Cagli,Characterizationandmodellingofelectrodeim-pactinHfO2-basedRRAM,ŽinProceedingsoftheMemoryWorkshop,2012.[3]W.Zhao,E.Belhaire,Q.Mistral,C.Chappert,V.Javerliac,B.Dieny,andE.Nicolle,Macro-modelofspin-transfertorquebasedmagnetictunneljunctiondeviceforhybridmagnetic-CMOSdesign,ŽinBehavioralModelingandSimu-lationWorkshop,Proceedingsofthe2006IEEEInternational,Sept2006,pp.40…43.[4]IntelandMicron,IntelandMicronpro-ducebreakthroughmemorytechnology,Ž2015,http://newsroom.intel.c

om/community/intelnewsroom/.[5]Intel,
om/community/intelnewsroom/.[5]Intel,Intelarchitectureinstructionsetex-tensionsprogrammingreference,Ž2016,https://software.intel.com/sites/default/“les/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf.[6]ARM,ARMv8-aarchitectureevolution,Ž2016.[7]N.Christiansen,StorageclassmemorysupportintheWin-dowsOS,ŽinSNIANVMSummit,2016.[8]C.Diaconu,MicrosoftSQLHekatonctowardslargescaleuseofPMforin-memorydatabases,ŽinSNIANVMSummit,2016.[9]J.Moyer,PersistentmemoryinLinux,ŽinSNIANVMSummit,2016.[10]K.Deierling,Persistentmemoryoverfabric,ŽinSNIANVMSummit,2016.[11]S.Nalli,S.Haria,M.D.Hill,M.M.Swift,H.Volos,andK.Keeton,AnanalysisofpersistentmemoryusewithWHISPER,ŽinProceedingsofthInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,2017,pp.1…14.[12]S.Haria,S.Nalli,M.M.Swift,M.D.Hill,H.Volos,andK.Keeton,Hands-offpersistencesystem(HOPS),ŽinNon-volatileMemoriesWorkshop,2017.[13]J.Zhao,S.Li,D.H.Yoon,Y.Xie,andN.P.Jouppi,Kiln:Closingtheperformancegapbetweensystemswithandwithoutpersistencesupport,ŽinProceedingsofthe46thIn-ternationalSymposiumonMicroarchitecture(MICRO),2013,pp.421…432.[14]J.L.HennessyandD.A.Patterson,ComputerArchitecture,FifthEdition:AQuantitativeApproach,5thed.SanFran-cisco,CA,USA:MorganKaufmannPublishersInc.,2011.[15]A.Kolli,J.Rosen,S.Diestelhorst,A.Saidi,S.Pelley,S.Liu,P.M.Chen,andT.F.Wenisch,Delegatedpersistordering,ŽinProceedingsofthe49thInternationalSymposiumonMi-croarchitecture,2016,pp.1…13.[16]S.Pelley,P.M.Chen,andT.F.Wenisch,Memoryper-sistency,ŽinProceedingsoftheInternationalSymposiumonComputerArchitecture,2014,pp.1…12.[17]K.E.Moore,J.Bobba,M.J.Moravan,M.D.Hill,andD.A.Wood,LogTM:log-basedtransactionalmemory,ŽinProceedingsofthe12thInternationalSymposiumonHighPerformanceComputerArchitecture,2006,pp.1…12.[18]Intel,acollectionoflinuxpersistentmemoryprogrammingexamples,Žhttps://github.com/pmem/linux-examples.[19]H.Volos,A.J.Tack,andM.M.Swift,Mnemosyne:Lightweightpersistentmemory,ŽinProceedingsoftheSix-teenthInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ser.ASPLOSXVI.NewYork,NY,USA:ACM,2011,pp.91…104.[20]S.Pelley,T.F.Wenisch,B.T.Gold,andB.Bridge,StoragemanagementintheNVRAMera,ŽProceedingsoftheVLDBEndowment,vol.7,no.2,2013.[21]I.Moraru,D.G.Andersen,M.Kaminsky,N.Tolia,N.Binkert,andP.Ranganathan,Consistent,durable,andsafememorymanagementforbyte-addressablenonvolatilemainmemory,ŽinProceedingsoftheACMConferenceonTimelyResultsinOperatingSystems,2013,pp.1…17.[22]A.Kolli,S.Pelley,A.Saidi,P.M.Chen,andT.F.Wenisch,High-performancetransactionsforpersistentmemories,ŽinProceedingsofthe21thACMInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOp-eratingSystems,2016,pp.1…12.[23]Y.Lu,J.Shu,andL.Sun,Blurredpersistenceintrans-actionalpersistentmemory,ŽinMassStorageSystemsandTechnologies(MSST),201531stSymposiumon,2015,pp.1…13.[24]Y.Lu,J.Shu,L.Sun,andO.Mutlu,Loose-orderingcon-sistencyforpersistentmemory,ŽinICCD,2014.[25]S.Kannan,A.Gavrilovska,andK.Schwan,Reducingthecostofpersistencefornonvolatileheapsinenduserdevices,ŽinProceedingsoftheInternationalSymposiumonHighPerformanceComputerArchitecture,2014,pp.1…12.[26]R.-S.Liu,D.-Y.Shen,C.-L.Yang,S.-C.Yu,andC.-Y.M.Wang,NVMDuet:Uni“edworkingmemoryandpersistent348storearchitecture,ŽinProceedingsoftheInternationalCon-ferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,2014,pp.455…470.[27]A.M.Caul“eld,T.I.Mollov,L.A.Eisner,A.De,J.Coburn,andS.Swanson,Providingsafe,userspaceaccesstofast,solidstatedisks,ŽinProceedingsoftheInternationalConfer-enceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,2012,pp.387…400.[28]J.Arulraj,M.Perron,andA.Pavlo,Write-behindlogging,ŽProc.VLDBEndow.,vol.10,no.4,pp.337…348,Nov.2016.[29]J.Coburn,A.M.Caul“eld,A.Akel,L.M.Grupp,R.K.Gupta,R.Jhala,andS.Swanson,NV-heaps:makingpersis-tentobjectsfastandsafewithnext-generation,non-volatilememories,ŽinInternationalConferenceonArchitecturalSup-portforProgrammingLanguagesandOperatingSystems,2011,pp.105…118.[30]J.Ren,J.Zhao,S.Khan,J.Choi,Y.Wu,andO.Mutlu,ThyNVM:Enablingsoftware-transparentcrashconsistencyinpersistentmemorysystems,ŽinProceedingsofthe48thInternationalSymposiumonMicroarchitecture(MICRO-48),2015,pp.1…13.[31]J.Zhao,O.Mutlu,andY.Xie,FIRM:Fairandhigh-performancememorycontrolforpersistentmemorysystems,ŽinProceedingsofthe47thInternationalSymposiumonMi-croarchitecture(MICRO-47),2014.[32]C.Mohan,D.Haderle,B.Lindsay,H.Pirahesh,andP.Schwar

z,ARIES:Atransactionrecoverymethodsuppo
z,ARIES:Atransactionrecoverymethodsupport-ing“ne-granularitylockingandpartialrollbacksusingwrite-aheadlogging,ŽACMTrans.DatabaseSyst.,vol.17,no.1,pp.94…162,Mar.1992.[33]L.Lamport,Provingthecorrectnessofmultiprocesspro-grams,ŽIEEETrans.Softw.Eng.,vol.3,no.2,pp.125…143,Mar.1977.[34]D.A.PattersonandJ.L.Hennessy,ComputerOrganiza-tionandDesign,FourthEdition,FourthEdition:TheHard-ware/SoftwareInterface(TheMorganKaufmannSeriesinComputerArchitectureandDesign),4thed.SanFrancisco,CA,USA:MorganKaufmannPublishersInc.,2008.[35]A.Joshi,V.Nagarajan,S.Viglas,andM.Cintra,Atom:Atomicdurabilityinnon-volatilememorythroughhardwarelogging,Žin2017IEEEInternationalSymposiumonHighPerformanceComputerArchitecture(HPCA),Feb2017,pp.361…372.[36]T.WangandR.Johnson,Scalableloggingthroughemergingnon-volatilememory,ŽProc.VLDBEndow.,vol.7,no.10,pp.865…876,Jun.2014.[37]J.Huang,K.Schwan,andM.K.Qureshi,Nvram-awareloggingintransactionsystems,ŽProc.VLDBEndow.,vol.8,no.4,pp.389…400,Dec.2014.[38]P.Zhou,B.Zhao,J.Yang,andY.Zhang,Adurableandenergyef“cientmainmemoryusingphasechangememorytechnology,ŽinProceedingsofthe36thAnnualInternationalSymposiumonComputerArchitecture,ser.ISCA09.NewYork,NY,USA:ACM,2009,pp.14…23.[39]M.K.Qureshi,J.Karidis,M.Franceschini,V.Srinivasan,L.Lastras,andB.Abali,Enhancinglifetimeandsecurityofpcm-basedmainmemorywithstart-gapwearleveling,ŽinProceedingsofthe42NdAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture,2009,pp.14…23.[40]M.K.Qureshi,V.Srinivasan,andJ.A.Rivers,Scalablehighperformancemainmemorysystemusingphase-changememorytechnology,ŽinProceedingsofthe36thAnnualIn-ternationalSymposiumonComputerArchitecture,ser.ISCA09.NewYork,NY,USA:ACM,2009,pp.24…33.[41]J.H.Ahn,S.Li,O.Seongil,andN.Jouppi,Mcsima+:Amanycoresimulatorwithapplication-level+simulationanddetailedmicroarchitecturemodeling,ŽinPerformanceAnal-ysisofSystemsandSoftware(ISPASS),2013IEEEInterna-tionalSymposiumon,2013,pp.74…85.[42]C.-K.Luk,R.Cohn,R.Muth,H.Patil,A.Klauser,G.Lowney,S.Wallace,V.J.Reddi,andK.Hazelwood,Pin:Buildingcustomizedprogramanalysistoolswithdynamicinstrumentation,ŽinProceedingsofthe2005ACMSIGPLANConferenceonProgrammingLanguageDesignandImple-mentation,NewYork,NY,USA,2005,pp.190…200.[43]S.Li,J.H.Ahn,R.D.Strong,J.B.Brockman,D.M.Tullsen,andN.P.Jouppi,McPAT:Anintegratedpower,area,andtimingmodelingframeworkformulticoreandmanycorearchitectures,ŽinProceedingsofthe42NdAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture,2009,pp.469…480.[44]B.C.Lee,E.Ipek,O.Mutlu,andD.Burger,ArchitectingphasechangememoryasascalableDRAMalternative,ŽinInternationalSymposiumonComputerArchitecture,2009,pp.2…13.[45]T.Bingmann,STXB+Tree,Sept.2008,Žhttp://panthema.net/2007/stx-btree.[46]D.A.BaderandK.Madduri,Designandimplementationofthehpcsgraphanalysisbenchmarkonsymmetricmultipro-cessors,ŽinProceedingsofthe12thInternationalConferenceonHighPerformanceComputing,2005,pp.465…476.[47]M.Liu,M.Zhang,K.Chen,X.Qian,Y.Wu,W.Zheng,andJ.Ren,Dudetm:Buildingdurabletransactionswithdecou-plingforpersistentmemory,ŽinProceedingsoftheTwenty-SecondInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ser.ASPLOS17.NewYork,NY,USA:ACM,2017,pp.329…343.[48]K.Doshi,E.Giles,andP.Varman,AtomicpersistenceforSCMwithanon-intrusivebackendcontroller,Žin2016IEEEInternationalSymposiumonHighPerformanceComputerArchitecture(HPCA),2016,pp.77…89.[49]P.Li,D.R.Chakrabarti,C.Ding,andL.Yuan,Adaptivesoftwarecachingforef“cientnvramdatapersistence,Žin2017IEEEInternationalParallelandDistributedProcessingSymposium(IPDPS),May2017,pp.112…122.[50]C.-H.Lai,J.Zhao,andC.-L.Yang,Leavethecachehierar-chyoperationasitis:Anewpersistentmemoryacceleratingapproach,ŽinProceedingsofthe54thAnnualDesignAutoma-tionConference2017,ser.DAC17.NewYork,NY,USA:ACM,2017,pp.5:1…5:6.[51]L.Sun,Y.Lu,andJ.Shu,DP2:Reducingtransactionoverheadwithdifferentialanddualpersistencyinpersistentmemory,ŽinProceedingsofthe12thACMInternationalConferenceonComputingFrontiers,2015,pp.24:1…24:8.[52]J.Condit,E.B.Nightingale,C.Frost,E.Ipek,B.Lee,D.Burger,andD.Coetzee,BetterI/Othroughbyte-addressable,persistentmemory,ŽinProceedingsoftheACMSIGOPS22ndSymposiumonOperatingSystemsPrinciples,ser.SOSP09.NewYork,NY,USA:ACM,2009,pp.133…146.[53]A.Chatzistergiou,M.Cintra,andS.D.Viglas,REWIND:Recoverywrite-aheadsystemforin-memorynon-volatiledata-structures,ŽProc.VLDBEndow.,vol.8,no.5,pp.497…508,Jan.2015.[54]J.Izraelevitz,T.Kelly,andA.Kolli,Failure-atomicper-sistentmemoryupdatesviajustdologging,ŽSIGPLANNot.,vol.51,no.4,pp.427…442

Related Contents


Next Show more