/
Mojim:AReliableandHighly-AvailableNon-VolatileMemorySystemYiyingZhangJ Mojim:AReliableandHighly-AvailableNon-VolatileMemorySystemYiyingZhangJ

Mojim:AReliableandHighly-AvailableNon-VolatileMemorySystemYiyingZhangJ - PDF document

tatyana-admore
tatyana-admore . @tatyana-admore
Follow
383 views
Uploaded On 2016-08-22

Mojim:AReliableandHighly-AvailableNon-VolatileMemorySystemYiyingZhangJ - PPT Presentation

WeproposeMojim1asystemthatprovidesreplicatedreliableandhighlyavailableNVMMasanoperatingsystemserviceApplicationscanaccessdatainMojimusingnormalloadandstoreinstructionswhilecontrollingwhenandhow ID: 454136

WeproposeMojim1 asystemthatprovidesreplicated re-liable andhighly-availableNVMMasanoperatingsystemservice.ApplicationscanaccessdatainMojimusingnor-malloadandstoreinstructionswhilecontrollingwhenandhow

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Mojim:AReliableandHighly-AvailableNon-Vo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Mojim:AReliableandHighly-AvailableNon-VolatileMemorySystemYiyingZhangJianYangAmirsamanMemaripourStevenSwansonDepartmentofComputerScienceandEngineering,UniversityofCalifornia,SanDiegofyiyingzhang,jianyang,amemarip,swansong@cs.ucsd.eduAbstractNext-generationnon-volatilememories(NVMs)promiseDRAM-likeperformance,persistence,andhighdensity.Theycanattachdirectlytoprocessorstoformnon-volatilemainmemory(NVMM)andoffertheopportunitytobuildverylow-latencystoragesystems.Thesehigh-performancestoragesystemswouldbeespeciallyusefulinlarge-scaledatacenterenvironmentswherereliabilityandavailabilityarecritical.However,providingreliabilityandavailabilitytoNVMMischallenging,sincethelatencyofdatarepli-cationcanoverwhelmthelowlatencythatNVMMshouldprovide.WeproposeMojim,asystemthatprovidestherelia-bilityandavailabilitythatlarge-scalestoragesystemsre-quire,whilepreservingtheperformanceofNVMM.Mojimachievesthesegoalsbyusingatwo-tierarchitectureinwhichtheprimarytiercontainsamirroredpairofnodesandthesecondarytiercontainsoneormoresecondarybackupnodeswithweaklyconsistentcopiesofdata.Mojimuseshighly-optimizedreplicationprotocols,software,andnetworkingstackstominimizereplicationcostsandexposeasmuchofNVMM'sperformanceaspossible.WeevaluateMojimusingrawDRAMasaproxyforNVMMandusinganin-dustrialNVMMemulationsystem.WendthatMojimpro-videsreplicatedNVMMwithsimilarorevenbetterperfor-mancethanun-replicatedNVMM(reducinglatencyby27%to63%anddeliveringbetween0.4to2.7thethroughput).WedemonstratethatreplacingMongoDB'sbuilt-inreplica-tionsystemwithMojimimprovesMongoDB'sperformanceby3.4to4.Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Copyrightsforcomponentsofthisworkownedbyothersthantheauthor(s)mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.Requestpermissionsfrompermissions@acm.org.ASPLOS'15,March14–18,2015,Istanbul,Turkey..Copyrightisheldbytheowner/author(s).PublicationrightslicensedtoACM.ACM978-1-4503-2835-7/15/03...$15.00.http://dx.doi.org/10.1145/2694344.2694370CategoriesandSubjectDescriptorsD.4.2[StorageMan-agement]:MainmemoryKeywordsnon-volatilememory;distributedstoragesys-tems;reliability;availability;datacenter;storage-classmem-ory1.IntroductionFast,non-volatilememorytechnologiessuchasphasechangememory(PCM),spin-transfertorquemagneticmemories(STTMs),andthememristorarepoisedtoradicallyaltertheperformancelandscapeforstoragesystems.Theywillblurthelinebetweenstorageandmemory,forcingdesignerstorethinkhowvolatileandnon-volatiledatainteractandhowtomanagenon-volatilememoriesasreliablestorage.AttachingNVMsdirectlytoprocessorswillproducenon-volatilemainmemories(NVMMs),exposingtheper-formance,exibility,andpersistenceofthesememoriestoapplications.However,takingfulladvantageofNVMMs'potentialwillrequirechangesinsystemsoftware[3].Theneedforsuchchangesisespeciallyacuteinlarge-scaledatacenterenvironmentswherestoragesystemsre-quiremorethansimplenon-volatility.Theseenvironmentsdemandreliabilityandavailabilityinthefaceofhardware,software,andnetworkfailures.Withoutthisreliabilityandavailability,NVMMwillonlybesuitableasatransientdatastoreorasacachinglayer—itwillnotbeabletoserveasareliableprimarystoragemedium.Datacenterstraditionallyprovidebothreliabilityandavailabilitybyaddingredundancyusingreplication[5,11,15,55,56]orerasurecodingschemes[18,21].Theseap-proachesrestontheassumptionthatstorageisslow,sothecostofthenetworkandsoftwareprotocolsrequiredtoim-plementreplicationsisacceptable.NVMMwillchangethissituationcompletely,sincethenetworkingandsoftwareoverheadofexistingreplicationmechanismswillsquanderthelowlatencythatNVMMcanprovide.TheinterfacewithNVMMisalsodifferentfromtraditionalstorage:applicationsaccessNVMMdirectlywithne-grainedmemoryoperations. WeproposeMojim1,asystemthatprovidesreplicated,re-liable,andhighly-availableNVMMasanoperatingsystemservice.ApplicationscanaccessdatainMojimusingnor-malloadandstoreinstructionswhilecontrollingwhenandhowupdatespropagatetoreplicasusingsystemcalls.Mo-jimallowsapplicationstobuilddatapersistenceabstractionsrangingfromsimplelog-basedsystemstocomplextransac-tions.Mojimusesatwo-tierarchitecturethatallowsexibilityinchoosingdifferentlevelsofreliability,availability,con-sistency,andmonetarycost,whileminimizingperformanceoverhead.Theprimarytierincludesoneprimarynodeandonemirrornode.Mojimcan,dependingonthecongura-tion,keepthesenodesstronglyorweaklyconsistent.Anop-tionalsecondarytierprovidesanadditionallevelofredun-dancywithoneormorebackupnodesthatareweaklycon-sistentwiththeprimarytier.Mojimefcientlyreplicatesne-graineddatafromtheprimarynodetothemirrornodeusinganoptimizedRDMA-basedprotocolthatissimplerthanexistingreplicationpro-tocols.Themirrornodereplicatesdatatothebackupnodesinthebackground,thuskeepingthesecondarytierofftheperformance-criticalpath.Thisdesignoffersgoodperfor-manceandtwostronglyconsistentcopiesofdataplusmorecopiesofweaklyconsistentdata.Tofurtherimproveavail-abilityandreliability,Mojimalsoprovidesafastrecoveryprocessandatomicsemanticsthatguaranteedataintegrity.InbuildingMojim,weexploretheperformanceandmon-etarycostimpactsofprovidingavailability,reliability,andconsistencywithNVMM,andweexploretrade-offsamongreplicationprotocolsforNVMM.Interestingly,wendthataddingavailability,reliability,andconsistencydoesnotnec-essarilyimpairNVMMperformance,aslongasthereplica-tionprotocolsandsoftwarelayersareoptimizedforNVMM.WeevaluateMojimusingrawDRAMasaproxyforfu-tureNVMMsandwithanindustrialNVMMemulationsys-tem.Ourevaluationshowsthat,surprisingly,Mojimreducestheaveragelatencyoftheun-replicatedsystemby27%to63%,evenwhenitprovidesstronglyconsistentcopiesofdata.Mojim'sperformancegainismainlyduetoinefcien-ciesinthecurrentinstructionsetstheun-replicatedsystemusestoenforcedatapersistence.Mojimprovides0.4to2.7thethroughputoftheun-replicatedsystem.Wealsorunsev-eralpopularapplicationsincludingalesystem[12],theGoogleHashTable[16],andMongoDB[40]onMojim.TheMongoDBresultsarethemoststriking:Mojimis3.4to4fasterthantheMongoDBreplicationmechanismand35to741fasterthanun-replicatedMongoDB.Therestofthepaperisorganizedasfollows.Section2providessomebackgroundonpersistentmemoryandrepli-catedstoragesystems.WepresentMojimanditsimplemen-tationinSections3and4.Section5describesourexperi-enceadaptingapplicationstouseMojim.Wethenpresent 1Mojim( !" )istheChinesewordfor“magicmirror.”theevaluationresultsofMojiminSection6.Finally,wedis-cussrelatedworkinSection7andconcludeinSection8.2.NVMMintheDataCenterNVMMblursthelinebetweenmemoryandstorage,anditposesnewchallengesforsystemdesignersandarchitects.PreviousresearchonNVMMhasfocusedonhowtousethesememoriesinasinglemachine[3,8,9,12,39,58,59],whilemostmission-criticaldataresidesindistributed,replicatedstoragesystems(e.g.,indatacenters).ForNVMMtosucceedasarst-classstoragetechnology,itmustprovidethereliabilityandavailabilitythatthesestoragesystemsrequire[26].Mojim'sgoalistomakeNVMMareliableandhighly-availablestoragelayersuitedtothesedatacenterenviron-ments.Achievingthisgoalwillrequiredesignerstoaddresstrade-offsbetweenperformance,reliability,monetarycost,andconsistency.Below,weintroduceNVMManddiscusswhyitdemandsnewapproachestoprovidingavailability,reliability,andconsistencyinstoragesystems.2.1Next-GenerationNon-VolatileMemoryNVMtechnologiesareclosingtheperformance,cost-per-bit,andcapacitygapbetweenlow-latency,volatilememorytechnologiesandhigh-capacity,persistentstoragetechnolo-gies[25,32,49,62].Next-generationNVMtechnologies(likePCM,thememristor,andSTTM)arebyte-addressableandprojectionsshowthattheirperformancemayapproachthatofDRAM[20,27,35,60].Forexample,PCM,themostmaturenext-generationNVMtechnology,hasaccesslaten-cieswithinasmallfactorofDRAM[33,34,38,48].Attachingnext-generationNVMstothemainmemorybusprovidesarawstoragemediumthatisordersofmag-nitudefasterthanmoderndisksandSSDs.NVMMpresentsmanynewtechnicalchallengesandhasinspiredahostofresearchprojectsontopicsincludingOSmanagementofNVMM[3],user-spacelibrariesandprogrammingmod-els[8,58],specializedNVMMlesystems[9,12,59],andhybridmainmemoryandheterogeneousmemoryallo-cation[39,41,50].ThisworkfocusesontheproblemsofreplicatingNVMMcontentsothatNVMMcanbeappliedinadistributeddatacentercontext.2.2NVMMAvailability,Reliability,andConsistencyAlthoughNVMMprotectsagainstpowerfailurebymak-ingthecontentsofmemorypersistent,itdoesnotaddresstheotherwaysthatsystemsfail,includingsoftware,hard-ware,andnetworkingerrorsthatarecommonindatacen-ters[14,42].ProvidingavailabilityandreliabilityinsuchenvironmentsisimportanttomeetclientSLAs[55]andap-plicationrequirements.Strongconsistencyisalsodesirableinstoragesystems,sinceitmakesiteasiertoreasonaboutsystemcorrectness.Addingredundancyorreplicationisacommontechniqueforprovidingreliabilityandavailability[1,5,11,15,17, 18,29,46,47,51,52,56,61].Variousprotocolsexisttoprovidedifferentconsistencylevelsamongredundantcopiesofdata[2,4,11,22,31,47].Fortraditionalstoragesys-temswithslowharddisksandSSDs,theperformanceover-headofreplicationissmallrelativetothecostofaccess-ingaharddriveorSSD,evenwithcomplexprotocolsforstrongconsistency.WithNVMMs,however,thenetworkingroundtripsandsoftwareoverheadinvolvedinthesetech-niques[4,22,31,56]threatentooutstripthelow-latencybenetofusingNVMMsintherstplace.Evenforsystemswithweakconsistency[11,47],increasingtherateofrec-onciliationbetweeninconsistentcopiesofdatacanthreatenperformance[17,28].SinceNVMMisvastlyfasterthanexistingstoragetech-nologies,itpresentsnewchallengestodatareplication.First,NVMM-basedsystemsmustdeliverhighperformancetojustifytheirincreasedcostrelativetodisksorSSDs.Ex-istingreplicationmechanismsbuiltfortheseslowerstoragemediahavesoftwareandnetworkingperformanceoverheadthatwouldobscuretheperformancebenetsthatNVMMcouldprovide.Second,NVMMismemory,andapplicationsshouldbeabletouseitlikememory(i.e.,vialoadandstoreinstruc-tionswithoutoperatingsystemoverheadsformostaccesses)ratherthanasastoragedevice(i.e.,viaI/Osystemcalls).3.MojimDesignMojimprovidesaneasy-to-use,genericlayerofreplicatedNVMMthatensuresreliability,availability,andconsistency,whilesacricingaslittleofNVMM'sperformanceaspossi-ble.Mojimusesatwo-tierarchitectureandsupportsseveraloperatingmodestoletapplicationstuneMojim'sreliability,availability,andconsistencytomatchtheirparticularneeds.ThissectiondiscussesMojim'sinterfacesandarchitec-tureandthedifferentmodesMojimprovides.3.1Mojim'sInterfacesMojimisanoperatingsystemservicethatprovidesreliableandhighly-availableNVMM.ThissectiondescribesMo-jim'stypicalusagescenarioandtheinterfaceitprovides.TouseMojim,asystemcongurationlespeciesasetofMojimregionsontheprimarynodetobereplicated,alongwithamirrornodeandalistofbackupnodeswherethereplicasshouldreside.Theprimarynodesupportsreadsandwritestothereplicateddata.Themirrornodeandbackupnodessupportreadsonly.Kernelmodulescanaccesstheseregionsandusethemtobuildcomplex,replicated,memory-basedservicessuchasakernel-levelpersistentkeyvaluestore,apersistentdiskcache,oralesystem.Thekernelcouldalsomaketheseservicesavailabletoapplicationsviaamalloc()-likeinterface.WhileMojimcanserveasthebasisformanymemory-basedservices,deployinganNVMM-awarelesystemtomanagethereplicatedNVMMregionwouldprovidethemostexibilityinapplicationusagemodels.Thelesystemwouldprovidefamiliarle-system-basedmechanismsofal-locationandnamingaswellasconventionalle-basedac-cessfornon-performance-criticalapplications.Thekeyre-quirementofthelesystemisthat,foranmmap'dle,itmapsthetheNVMMpagescorrespondingtotheledirectlyintotheapplications'addressspacesratherthanpagingtheminandoutofthekernel'sbuffercache.Inourexperiments,weusePMFS[12]forthispurpose.Withalesysteminplace,applicationscancreatelesintheMojim-backedlesystemandmapthemintotheiraddressspaceusingmmap.WecalltheNVMMareamappedbyapplicationsthedataarea.Afteranmmap,applicationscanperformdirectmemoryaccessestothedataareausingloadandstoreinstructionsontheprimarynodeandloadinstructionsonthemirrornode.Mojimprovidesamechanismcalledasyncpointthatallowsapplicationstocontrolwhenandwhatupdatesinthedataareapropagatetothereplicas.Ateachsyncpoint,Mojimatomicallyreplicatesallmemoryregionsspeciedbyanapplication.TwoAPIsallowapplicationstocreatesyncpoints:msyncandgmsync.Mojimleveragestheexistingmsyncsystemcalltospec-ifyasyncpointthatappliestoasingle,contiguousaddressrange.ThesemanticsofMojim'smsynccorrespondtocon-ventionalmsync,andapplicationsthatusemsyncwillworkcorrectlywithoutmodicationunderMojim.Mojimallowsanapplicationtospecifyane-grainedmemoryregioninthemsyncAPIandreplicatesitatomically,whiletraditionalmsyncushespage-alignedmemoryregionstopersistentstorageanddoesnotprovideatomicityguarantees[45].Mojim'sgmsyncaddstheabilitytospecifymultiplememoryregionsforthesyncpointtoreplicate,allowingformoreexibilitythanmsync.Mojimprovidesamechanismtoallowapplicationstomaketheirdatapersistentatomically,butitdoesnotprovideprimitivesforsynchronization.ItwouldbepossibletoaddsynchronizationprimitivestoMojim,butthiswouldincreasethecomplexityofthesystemandrequireselectingasetofsynchronizationmechanismstosupport.AbetterapproachwouldbetobuildsynchronizationmechanismsthatleverageMojim'smechanisms.Figure1showsasimpleexampleinCofhowtouseMo-jimtomanageanappend-onlylogonMojim.TheprogramrstopensandmmapsaleinaMojimregion.Itthenup-datestheaccesscountofthelogandmakesthisvalueper-sistentwiththeconventionalmsyncAPI.Next,itappendsalogentryandupdatesthesizeofthelog.Itmakesboththesedatapersistentwithangmsynccall.Theatomicitythatgm-syncprovidesguaranteesthatthelogsizeisconsistentwiththelogcontentonthereplicanodes. intfd=open("/mnt/mmapfile",O_CREAT|O_RDWR);//openafileinmountedMojimregionvoid*base=mmap(NULL,40960,PROT_WRITE,MAP_SHARED,fd,0);//mmapa40KBareainthefileunsignedlong*access_count_p=base;//accesscountofthelogunsignedlong*log_size_p=base+sizeof(unsignedlong);//sizeofthelogint*log=base+2*sizeof(unsignedlong);//thelog*access_count_p=*access_count_p+1;//memoryloadandstoremsync(access_count_p,sizeof(unsignedlong),MS_SYNC);//callconventionalmsyncintbeautiful_num=24;unsignedlongcurr_log_pos=*log_size_p;//memoryloadandstorelog[curr_log_pos]=beautiful_num;*log_size_p=*log_size_p+1;structmsync_input{void*address;intlength;};structmsync_inputinput[2];input[0].address=&(log[curr_log_pos]);input[0].length=sizeof(int);input[1].address=log_size_p;input[1].length=sizeof(unsignedlong);gmsync(input,2,MS_MOJIM);//callgmsynctocommitthelogappend Figure1.SamplecodetouseMojim.CodesnippetthatimplementsasimplelogappendoperationwithMojim. Figure2.MojimArchitecture.Thenumberedcirclesrepre-sentdifferentstepsintheMojimreplicationprocess.MLogstandsforthemetadatalog.3.2ArchitectureMojimusesatwo-tierarchitecture.Theprimarytiercontainsaprimarynodeanditsread-onlymirrornode;thesecondarytierincludesoneormorebackupnodeswithweaklyconsis-tent,read-onlycopiesofdata.Figure2depictsthearchitec-tureofMojim.Mojim'sprimarytiercontainsapairofmirroringnodes:aprimarynodereplicatesdatatoitsmirrornodeateachsyncpoint(i.e.,calltomsyncorgmsync).Theapplicationcanreadandwritedataontheprimarynode,butMojimonlyallowsreadsfromthemirrornode.Theprimarytieroffersgoodperformanceevenwhenguaranteeingstrongconsistency,sinceitrequiresonlyonenetworkingroundtripforeachsyncpoint.Existingarchi-tecturesthatallowwritestoallreplicas(E-writeall)[31],orthatuseoneprimaryandmultiplesecondarynodes(E-chainandE-broadcast)[2,56],requiremultiplenetworkingroundtripsorotherperformanceoverheadtoguaranteestrongcon-sistency.WewilldiscusshowMojimdiffersfromtheseex-istingschemesinmoredetailinSection7.Tofurtherimproveperformance,weconnecttheprimarynodeandthemirrornodewithahigh-speedInnibandlinkanduseanefcientsoftwareandnetworkinglayertorepli-catedatabetweenthem.Toimprovereliability,weplacetheprimarynodeandthemirrornodeondifferentracks,sincefailureburstsoftenhappenwithinthesamerack[14,42].Theoptionalsecondarytierincludesoneormorebackupnodestomaintainadditionalcopiesofdata.Itprovidesad-ditionalreliabilityandavailability,sothatfailureburstswillnotbecatastrophic.Themirrornodereplicatesdatatothebackupnodesinthebackground.Thus,datainthebackupnodesisnotstronglyconsistentwithdataintheprimarytier.Bykeepingthereplicationtothesecondarytierintheback-groundandofftheperformance-criticalpath,Mojimensuresgoodapplicationperformance.WithbothtiersinoperationandatotalofNnodes,Mo-jimcantolerateN1nodefailures.Inmostenvironments,oneorafewbackupnodesareenoughtopreventdataloss,sincefailureburstsaremorelikelytoinvolveonlyasmallnumberofnodes[14,42].Also,inmostfailurebursts,thenodesdonotallfailatthesametime;failuresareusuallyseparatedbyafewseconds.Afastrecoverycanthuspreventdatalossevenwithfewcopiesofreplicateddata.WediscussrecoveryoptimizationsinSection4.3. Scheme R A C $ S-unreplicated 0 Worst N/A LowM-async 1 Good Weak FairM-sync 1 Good Strong FairM-syncdisk 1 OK Strong LowM-syncsec N1 Best Strong+Weak HighM-syncseceth N1 Good Strong+Weak FairE-writeall N1 Best Strong HighE-chain N1 Best Strong HighE-broadcast N1 Best Strong High Table1.ReplicationSchemes.Mojimsupportsawiderangeofreliability,availability,consistency,andmonetarycostlevels(columns2-5).ThereliabilitycolumnrepresentsthenumberofnodefailuresthatcanbetoleratedinasystemofNnodes.ThelastthreerowscompareMojimtootherexistingreplicationschemes.3.3MojimModesandReplicationProtocolsMojimsupportsseveralreplicationmodesandprotocolsthatallowuserstochoosedifferentlevelsofperformance,relia-bility,consistency,availability,andmonetarycostdependingonapplicationrequirements.Table1summarizesthesedifferentmodesandtheirprop-erties,andwediscussthembelowusingthenumberedcir-clesinFigure2toillustratethereplicationprocessineachmode.AcrossallthemodesMojimprovides,Mojimachievesmostofitsperformancebyadoptingadifferentarchitecturethanmostreplicatedstoragesystems.Insteadofsupportingmultipleconsistentreplicas,Mojimonlysupportsstrongconsistencyatasinglemirrornode.Thisdecisionmakesourreplicationprotocolsmuchsimpler(e.g.,there'snoneedformulti-phasecommitoracomplexconsensusprotocol)and,therefore,allowsformuchhigherperformance.Mojimachievesthegoalofprovidingitsatomicdatapersistenceinterfacebyensuringthatatomicoperationsarereplicatedatomicallytothemirrornodeandthebackupnodes,byappendingreplicateddatatologsonthemirrornodeandthebackupnodes.Un-replicatedwithoutMojim:AsinglemachinewithoutMojim(S-unreplicated)mustushanmsync'dmemoryre-gionfromtheprocessor'scachestoensuredatapersistence.S-unreplicatedhaspooravailabilityandisonlyasreliableastheNVMdevices.Moreover,eveniftheNVMMisrecov-eredafteracrash,datacanstillbecorrupted.Forexample,ifacrashoccursafterapointerismadepersistentbutbe-forethedataitpointstobecomespersistent,thesystemwillcontaincorrupteddata.Sync:Mojim'sM-syncmodeguaranteesstrongconsistencybetweentheprimaryandthemirrornode.Itprovidesim-provedreliabilityandavailabilityoverS-unreplicated,sinceinthecaseofafailurethemirrornodecantaketheplaceoftheprimarynodewithoutlosingdata.InM-sync,whenanapplicationcallsmsyncorgmsync(1\rinFigure2),MojimpushesdatafromtheprimarynodetothemirrornodeviaRDMA(3\r)andwritesthedatainthemirrornodelog(4\r).Theprimarynodewaitsfortheacknowledgmentfromthemirrornode(5\r),andthenreturnsthemsyncorgmsynccall(6\r).Themirrornodelatertakesacheckpointtoapplythelogcontentstothedataarea(7\r).MojimstoresboththemirrornodelogsanditsdataareainNVMMforhighperformanceandfastrecovery.InM-sync,Mojimdoesnotushdatafromthepri-marynode'scaches(2\r).ModernRDMAdevicesarecache-coherent,sotheywillsendthemostup-to-datedata[24,30].Thus,themirrornodealwaysgetsthelatestdataandpushingdatatothemirrornodeissufcienttoensurepersistence.Iftheprimarynodecrashes,themirrornodehasthemostup-to-datedata.Ifthemirrornodecrashes,theprimarynodehasallthedata,butitmaynotbepersistent,sotheprimarynodeimmediatelyushesitscachestopreventdataloss.Thismeansthereisasmall“windowofvulnerability”afteramirrornodefailureduringwhichaprimarynodefailurecouldresultindataloss.Onoursystem,thiswindowlastsfor450s,thetimerequiredtoushtheprocessorcaches.Surprisingly,ourevaluationresultsshowthatM-syncof-fersperformancecomparabletoorbetterthanS-unreplicatedbecauseushingCPUcachesisoftenmoreexpensivethanpushingthedataoverRDMA.Thecurrentclushinstructionisstronglyorderedandcannotutilizetheparallelismofferedinmodernprocessorarchitecture.Intelrecentlyannouncedtwoinstructionsthataremoreefcientthanclushandthatwillbeavailableonfuturesystems[23],whichshouldhelpresolvethisproblem.Syncwithcacheush:Toclosethewindowofvulnerabilitymentionedabove,Mojimcanushdatafromtheprimarynode'scaches(2\r)beforereturningtoapplications'msyncorgmsynccalls(6\r).ThismodeiscalledM-syncush,andwithM-syncush,alldatacansurvivesimultaneousfailuresoftheprimarynodeandthemirrornode.Async:M-asyncprovidesweakerconsistencybetweentheprimarynodeandthemirrornode.M-asyncensuresthatdataispersistentontheprimarynodeforeachsyncpoint(2\r)andpushesthedatatothemirrornode(3\r),butitdoesnotwaitforthemirrornode'sacknowledgment(5\r)tocompletetheapplication'smsyncorgmsynccall(6\r).Thus,dataonthemirrornodecanbeoutofdaterelativetotheprimarynode.M-asyncmustushtheprimarynodeCPUcachesateachsyncpointtoensurethatthelatestdataispersistent.Syncwithslowstorage:ToreducethemonetarycostofM-sync,Mojimsupportsamodethatstoresthelogonthemir-rornodeinNVMM,butstoresthemirrornode'sdataareaonaharddiskorSSD(M-syncdisk).M-syncdiskhasaslowerrecoveryprocessthanM-sync,sinceMojimneedstoreaddatafromharddiskorSSDtoNVMMbeforeapplicationscanaccessthem.Syncwiththesecondarytier:M-syncsecaddsthesec-ondarytiertoM-syncandincreasesreliabilityandavail-abilitybyaddingmorecopiesofdata.Mojimreplicatesdatafromthemirrornodetothebackupnodeinthebackground (8\r-11\r).M-syncsecprovidestwostrongly-consistentcopiesofthedataattheprimaryandmirrornodesandoneweakly-consistentdatacopyateachbackupnode.Theamountofinconsistencybetweenthemirrornodeandbackupnodesistunableandaffectstherecoverytime.Eventhoughthedataateachbackupnodemaybeout-of-date,itstillrepre-sentsaconsistentsnapshotofapplicationdatabecauseoftheatomicsemanticsMojimprovides.Ourevaluationre-sultsshowthatM-syncsecdeliversperformancesimilartoM-syncbecausereplicationtothebackupnodestakesplaceinthebackground.Syncwithlow-costsecondarytier:M-syncsecrequiresfastnetworksbetweenthemirrornodeandbackupnodes,whichincreasesthemonetarycostandnetworkingband-widthconsumptionofthesystem.Alowercostoption,M-syncseceth,usesEthernetbetweenthemirrornodeandbackupnodes.M-syncsecethhastheworstperformanceofalltheMojimmodes,butitstillprovidesthesamereliability,availability,andconsistencyguaranteesasM-syncsec.4.ImplementationThissectiondescribesourimplementationofMojimintheLinuxkernel.ThecoreofMojimcomprisesanoptimizednetworkstackandthereplicationandrecoverycode.4.1NetworkingThenetworkingdelayofdatareplicationisthemostimpor-tantfactorindeterminingMojim'soverallperformance.Mo-jimusesInniband(IB),ahigh-performanceswitchednet-workthatsupportsRDMA.RDMAiscrucialbecauseital-lowstheprimarynodetotransferdatadirectlyintothemirrornode'sNVMMwithoutrequiringadditionalbuffering,copy-ing,orcacheushes.MojimusesIB-Verbs,asetofnativeIBAPIsbasedonsend,receive,andcompletionqueues[37].IB-Verbsrequirestheapplicationtopostsend(receive)requeststosend(re-ceive)queues.Itusescompletionmessagesinthecomple-tionqueuetoindicatethecompletionofrequestsandsup-portsbothpollingandinterruptstodetectcompletions.IB-VerbsoffersnativeIBperformanceandoutperformsalterna-tiveIBprotocolssuchasIPoIBandRDS(seeSection6.2).ExistingIB-Verbsimplementationsareuserspacelibrariesthatbypassthekernel.WecreatedakernelversionofIB-VerbsforMojim.Mojimusesathinprotocolbasedonthereliabletrans-portationmodeofIB-Verbs.TheMojimprotocoldirectlyfetchesdatafromtheprimarynodeandwritesittoNVMMonthemirrornode.Foreachsyncpoint,theprimarynodepostsasendrequestontheIBsendqueueandpollsforitscompletion.Themirrornodepostsasetofreceiverequestsinadvanceandpollsforthearrivalofincomingmessages.Ourmeasurementsshowthatpollingismoreefcientthaninterrupts. Figure3.ExampleofMojimReplication.AnexampleofMojim'sreplicationprocess.Eachcellrepresentsarequest.TheletterinthecellstandsforthememoryaddressandthenumberinthecellrepresentsitsuniqueID.Theupper-leftpartshowsthreethreadsplacingthreegmsynccalls.Theupper-rightpartshowsthedataareaonthemirrornode.Therepresentstheendmarkofagmsyncoperation.Thegraycellinthemirrornodedataarearepresentsthedatathatisrecoveredafteracrash.Theprotocoldoesnotrequireexplicitacknowledgmentmessagesfromthemirrornodetotheprimarynode,sinceweconguretheIBlinktoprovideasuccessfulcompletionnoticationfortheprimarynode'ssendrequestonlyoncethedatatransfersucceeds.Intheeventofanerrororatimeout,theprimarynoderesendsthemessagetothebackupnode.Afterasetnumberofunsuccessfulre-sendattempts,Mojiminvokesitsrecoveryprocess.Tosustainhighbandwidth,MojimcreatesmultipleIBconnectionstohandleclientrequests.Foreachconnection,weassignonethreadonthemirrornodetopollforincom-ingmessages.Ontheprimarynode,welettheapplicationthreadperformIBsendoperationsforM-syncanduseabackgroundthreadtoposttheseoperationsforM-async.4.2ReplicationWenowdescribetheMojimreplicationprocessandthetech-niquesthatweusetoenablereliable,atomic,andconsistentdatareplication.Primarytierreplication:Ateachsyncpoint,theprimarynodepostsIBsendrequestscontainingthetargetmemoryregions.MojimensuresthatallrequestsbelongingtothesameatomicoperationareconsecutiveandonthesameIBconnectionandmarksthelastrequesttoletthemirrornodeknowtheendofanatomicoperation.SinceMojim'sreliableIBprotocolensuresorderingineachIBconnection,theserequestswillappearinthesameorderonthemirrornode.AuniqueIDoneachsendrequestallowsthemirrornodetokeepupdatesorderedacrossIBconnections.Forrecoverypurposes,theprimarynodestoresthememoryaddressesofthemostrecentrequestsinametadatalog.ForeachIBconnection,themirrornodemaintainsacir-cularlogandathreadthatpollsincomingrequests.MojimplacesthelogsinNVMMforgoodperformanceandper-sistenceandpre-allocatesxed-sizebuffersonthelogsforRDMAaccesses.Withpre-allocatedmemoryslots,Mojim onlyneedsoneIBroundtriptoreplicatedatafromthepri-marynodetothemirrornode.Becausethereceivebuffersizeisxed,welimitthesizeofeachsendrequestontheprimarynodeandbreakoriginalmemoryregionsintomul-tiplesendrequestsifneeded.SinceRDMAwritesdirectlytoNVMM,thereisnoneedtoushthecacheonthemirrornode.Afterallthedataforasyncpointhasarrivedonthelog,themirrornodecanwritethemtotheirpermanentlo-cationsinthedataarea.Thischeckpointinghappensperi-odicallyafteracongurablenumberofrequests(CHECK-POINT THRESH)havebeenreceived,aswellaswhenthesystemisidleandwhenalogisfull.Mojimmaintainsglobalpointerstothebeginningandendofeachlogtoindicatedataavailableforcheckpointing.Toensurethatread-onlyapplicationsonthemirrornodeseeaconsistentviewoftheirdata,Mojimremovesthepagetableentriesoftheaffectedmemorylocationsbeforeacheckpointingoperation.Duringthecheckpointing,anapplicationreadingfromthosepageswillgenerateapagefault.WechangedthepagefaulthandlertowaituntilMojimcompletesthecheckpointingandthenrestorethepagetableentriesandreturntheapplicationread.Secondarytierreplication:Replicationtothesecondarybackupsoccursinthebackgroundwhenthereisdataonthemirrornode'slogs.Theprotocolforreplicationtothebackupnodemimicsthereplicationtothemirrornode.Themirrornodemaintainsapointerforeachlogtoin-dicatetheamountofdatathathasnotyetbeenrepli-catedtothebackupnode.Mojimusesathreshold(SEC-ONDARY TIER THRESH)tolimittheamountofsuchun-replicateddataonthemirrornodeandstallsfurtherreplica-tiontothemirrornodeuntilun-replicateddatadropsbelowSECONDARY TIER THRESH.Example:Figure3illustratesanexampleofMojim'sdatastructuresanditsreplicationprocess.Inthisexample,Mo-jimusestwoIBconnectionsandtwomirrornodelogs.ThreeapplicationthreadspostthreegmsynccallstothetwoIBsendqueues.Toguaranteeatomicity,Mojimserializesthread2'srequestsafterthread1'srequestsonthesecondsendqueue.Mojimthensendstheserequeststothemirrornode'slogs.Themirrornodethreadspollforthecompletionofthesewritesandupdatethelog-endpointerswhentheyhavereceivedallrequestsbelongingtoonegmsynccall.Thecheckpointingserviceprocessesthelogsfromthelog-beginpointertothelog-endpointer.Themirrornodereplicatesthelogcontentbetweenthelog-bak-beginpointerandthelog-endpointertothebackupnode.4.3RecoveryFastrecoveryiscrucialtoprovidinghighavailabilityandpreventingdatalossintheeventofafailure.Therearethreetypesoffailurescenarios:primary,mirror,andbackupnodefailures.Mojimusesheartbeatstodetectfailures,butothertechniques[7,36]arepossible.Whentheprimarynodefails,themirrornodebecomesthenewprimarynodeandabackupnodebecomesthenewmirrornode.Thenewprimarynoderstsendstheun-replicateddatainitslogstothenewmirrornodeandcheckpointsitslogcontenttoitsdataareaafterthefailure.ForM-syncdisk,thenewprimarynodeneedstoloaddatafromthedataleondisktotheNVMM.Aftertheseop-erations,applicationscanrestartonthenewprimarynode.Untiltheseoperationscomplete,theMojimcontentswillbeunavailable.Oneoptionforactivatinganewbackupnodeistowaitforthefailednodetocomebackonline.Rebootingthemachineisoftensufcientandmoreefcientthanconstructinganewnode[14].Whenthecrashedprimarynoderestarts,itreceivesthenewdataaccumulatedduringitsdowntimefromthenewprimarynode.Whenthefailednodecannotrebootfastenough,ahumanoperatororasystemmonitoringserviceselectsanewbackupnodebasedonitsavailableNVMMsize,itsnetworkingtopology,andothercriteria[6,53].Thenewnodereceivesacompletecopyofthememoryregionandbeginsprocessingupdatesfromthenewmirrornode.Whenthemirrornodeorthebackupnodefails,therecov-eryprocessissimilar.Ifthemirrornodefails,theprimarynoderstushesitsCPUcaches.Italsousesitsmetadatalogtolocateun-replicateddataandsendsthemtothebackupnode.Torestartthemirrornodeorthebackupnode,Mojimreplaysthelogsandwritesonlythecompletedatomicop-erationcontenttothedataarea,withthehelpoftheatomicoperationendmarkanduniqueIDs.IntheexampleinFig-ure3,themirrornodecrashesafterMojimcheckpointsG.TherecoveryprocesswillcheckpointCanddiscardH.Ifthefailednodecannotrestart,anewlychosennodereceivesreplicateddatafromtheprimarynodeasdescribedabove.Whenboththeprimarynodeandthemirrornodefailinquicksuccession,Mojimfallsbacktothebackupnode.NowMojimneedstoreconstructtwonewnodesthattheadministratingnodeselects.Thisrecoveryprocessismorecostlythantherecoveryofasinglenodefailure.WereducetheriskofthissituationbyplacingnodesondifferentracksandbysettingasmallSECONDARY TIER THRESH,thusspeedinguptherecoveryprocessofasinglenodefailure.5.MojimApplicationsWehaveportedseveralexistingsystemstoMojimtoillus-tratehowapplicationscanuseMojim'sinterface.Theappli-cationsincludethePMFSlesystem[12],theGooglehashtable[16],andMongoDB[40].5.1PMFSThePersistentMemoryFileSystem[12](PMFS)providesaconventionalle-system-likeinterfacetoNVMM,allow-ingapplicationstoallocatespacewithlecreation,limitac-cesstodatavialepermissions,andnameportionsofthe NVMMusinglenames.ThekeydifferencebetweenPMFSandaconventionallesystemisthatitsimplementationofmmapmapsthephysicalpagesofNVMMintotheappli-cations'addressspacesratherthanmovingthembackandforthbetweenthelestoreandthebuffercache.PMFSensurespersistenceusingsfenceandclushin-structions.MojiminvokesitsreplicationwhenPMFSper-formsitspersistenceprocedure.Mojim'sM-syncalsore-movesclushandonlyperformssfenceontheprimarynode.Mojim'schangerequiredmodicationstojust20linesofPMFSsourcecode.Applicationscanusemmaptogainload-/storeaccesstoale'scontentsandthenusefsync,msync,orgmsynctomanagereplicationanddataconsistency.5.2GooglehashtableGooglehashtable[16]isanopensourceimplementationofsparseanddensehashtables.OurMojim-enabledversionofthehashtablestoresitsdatainmmap'dPMFSlesandperformsmsyncateachinsertanddeleteoperationtoletMojimreplicatethedata.PortingtheGooglehashtabletoMojimrequireschangestojust18linesofcode.5.3MongoDBMongoDB[40]isapopularNoSqldatabase.SeveralaspectsofMongoDBmakeitagoodcomparisonpointforMojim.First,MongoDBstoresitsdatainmemory-mappedlesandperformsmemoryloadsandstoresfordataaccess—aperfectmatchforMojim'sNVMMinterface.Second,MongoDBsupportsbothsinglenodeandreplicationinasetofnodesinseveraldifferentmodes(called“writeconcerns”)thattradeoffamongperformance,reliability,andavailability.Mojimprovidessimilarfunctionalitywithamoregeneralmechanism.Bydefault,MongoDBlogsdatainajournalleandcheckpointsthedatatothememory-mappeddataleinalazyfashion.WiththeJOURNALEDwriteconcern,Mon-goDBblocksaclientcalluntiltheupdateddataiswrittentothejournalle.WiththeFSYNC SAFEwriteconcern,Mon-goDBushesallthedirtypagestothedataleaftereachwriteoperationandblockstheclientcalluntilthisoperationcompletes.MongoDBsupportsdatareplicationacrossasetofma-chines.AprimarynodeinaMongoDBreplicasetservesallwriterequestsandpushesoperationlogstothesec-ondarynodes.Secondarynodescanservereadrequestsbutmayreturnstaledata.TheMongoDBwriteconcernREPLI-CAS SAFEreturnstheclientrequestafteratleasttwosec-ondarynodeshavereceivedthecorrespondingoperationlog.TheREPLICAS SAFEwriteconcerndoesnotwaitforjour-nalwritesorcheckpointingontheprimarynode.Mojimoffersanotherwaytoprovidereliabilityandavail-abilitytoMongoDB.WiththehelpofMojim'sgmsyncAPIanditsreliabilityguarantees,wecanremovejournalingfromMongoDBandstillachievethesameconsistencylevel.ToguaranteethesameatomicityofclientrequestsasavailablethroughMongoDB,wemodifythestorageengineofMon-goDBtokeeptrackofallwritestothedataleandgroupthewrittenmemoryregionsbelongingtothesameclientrequestintoagmsynccall.Intotal,thischangerequiresmodifying117linesofMongoDB.AnalternativewayofusingMojimistorununmodi-edMongoDBonMojimbyconguringMongoDBtoplacebothitsdataleandjournalleinMojim'smmap'ddataarea.WhenMongoDBcommitsdatatothejournalorcheck-pointsthedatatothedatale,itperformsanmsyncopera-tion,whichwilltriggerMojim'sdatareplicationtranspar-ently.6.EvaluationwithDRAMInthissection,westudytheperformanceofMojimundereachofthecongurationsandapplicationswedescribedinSections3and5.Specically,werstevaluatetheperfor-manceofdifferentMojimmodesandcomparethemtoex-istingreplicationmethods.WethenevaluatetheeffectsofdifferentapplicationparametersandMojimcongurations,theperformanceofapplicationsportedtoMojim,andMo-jim'srecoverycosts.6.1TestBedSystemsWeusetwodifferentsystemstoevaluateMojim.TherstisanindustrialNVMMemulationsystemfromIntelcalledPMEP[12].PMEPaugmentsanoff-the-shelf,dual-socketserverplatformwithspecialCPUmicrocodeandcustomrmware.Itpartitionsthesystem'sDRAMintoemulatedNVMMandregularDRAM.PMEPemulatesNVMMreadlatency,readandwritebandwidth,anddatapersistencecosts.Forreadlatencyandread/writebandwidth,PMEPmodiestheCPUandthememorycontroller.ThePMEPplatformuseswrite-backCPUcachesanddoesnotemulateNVMMwritelatency.Itusessoftwaretoemulatethecostofdatapersistence:thekernelrunningonPMEPissuesclushinstructionsfollowedbyansfence,andaddsawritebar-rierdelaytomodelthecostofensuringdatapersistenceinNVMM.Inourexperiments,weemulateNVMMbysettingthereadlatencyto300ns,readandwritebandwidthto5GB/sand1.6GB/s(1/8ofDRAMbandwidth),andthewritebarrierdelayto1s,thecongurationusedinIntel'sPMFSproject[12].EachPMEPnodehastwo2.6GHz8-coreIntelXeonprocessors,40MBofaggregateCPUcache,8GBofDDR3DRAMusedasnormalDRAM,128GBofDRAMusedasemulatedNVMM,anda7200RPM4TBharddisk.Theyalsohave40GbpsMellanoxInnibandNICsandaredirectlyconnectedtoeachotherviaInnibandwithoutaswitch.TheplatformsrunUbuntu13.10andthe3.11.0Linuxkernel.WehaveaccesstoonlytwoPMEPmachines(locatedatanIntelfacility),sotoevaluateMojimmodesthatrequiremorethantwomachines,weusesimilarmachinesinour S-unrep-DRAMM-async-DRAMM-sync-DRAMM-syncflush-DRAMM-syncdisk-DRAMS-unrep-NVMMM-async-NVMMM-sync-NVMMM-syncflush-NVMMM-syncdisk-NVMM Avg Latency (usec)0 4 8 12 16 Figure4.msyncLatencywithDRAMandNVMM.Theaverage4KBmsynclatencywithPMEP'sDRAMandNVMMmodes. S-unrep-DRAMM-async-DRAMM-sync-DRAMM-syncflush-DRAMM-syncdisk-DRAMS-unrep-NVMMM-async-NVMMM-sync-NVMMM-syncflush-NVMMM-syncdisk-NVMM Throughput (GB/s)0 0.2 0.4 0.6 0.8 Figure5.msyncThroughputwithDRAMandNVMM.The4KBmsyncbandwidthwithPMEP'sDRAMandNVMMmodes. 298S-unreplicatedM-syncsecM-syncsecethE-chainE-broadcast Avg Latency (usec)0 5 10 15 20 Figure6.msyncLatencywithDRAM-basedmachines.Theaverage4KBmsynclatencywithS-unreplicatedandMojimtwo-tierarchitecture. S-unreplicatedM-syncsecM-syncsecethE-chainE-broadcast Throughput (GB/s)0 0.2 0.4 0.6 0.8 Figure7.msyncThroughputwithDRAM-basedma-chines.The4KBmsyncthroughputwithS-unreplicatedandMo-jimtwo-tierarchitecture.labthatdonotincludePMEPfunctionalityanduseordi-naryDRAMasaproxyforNVMM.EachofthesemachineshastwoIntelXeonX5647processors,48GBDRAM,one40GbpsMellanoxInnibandNIC,anda1000MbpsEther-net.AQLogicInnibandSwitchconnectsthesemachines'IBlinks.AllmachinesruntheCentOS6.4distributionandthe3.11.0Linuxkernel.Inallexperiments,unlessotherwisespecied,wesetCHECKPOINT THRESH(thefrequencyofcheckpointingthemirrornodelogs)to1(aftereachlogwrite)andSEC-ONDARY TIER THRESH(thethresholdforsendingun-replicateddatatothebackupnodes)to40MB.6.2OverallReplicationPerformanceWerstcomparethemicrobenchmarkperformanceofMo-jimmodesthatonlyinvolvetwonodesusingthePMEPplat-forms.ToevaluatetheimpactofNVMMvs.DRAM,werunthesameexperimentswithbothPMEP'sDRAMmodeanditsemulatedNVMM.Figures4and5presenttheaveragelatencyandthrough-putofmsynccallswithS-unreplicated,M-async,M-sync,M-syncush,andM-syncdisk.Foreachexperiment,weper-form10000random4KBmsynccallsina4GBmmap'dle.Surprisingly,M-syncoutperformsS-unreplicatedsignif-icantlyforbothDRAMandemulatedNVMM(reducinglatencyby45%and40%respectively).EventhoughM-syncwaitsforanetworkingroundtripbetweentheprimarynodeandthemirrornode,itstilloutperformsS-unreplicatedbecauseitdoesnotneedtoushdatafromprocessors'caches,whileS-unreplicatedmustushdataoneachmsync.M-async'sperformanceissimilartoS-unreplicated,asitalsoneedstoushprimarynode'scaches.M-syncushhashigherlatencythanS-unreplicated,sinceitperformsbothcacheushesandnetworkingroundtrips.Placingthemirrornodedataondiskaddsonly1%to10%overhead.However,M-syncdiskdoesnotsupportreadapplicationsonthemirrornodeandaddsanoverheadinrecoverytime(seeSection6.5).ComparingDRAMandemulatedNVMM,theperfor-mancewithemulatedNVMMforallschemesisclosetothatwithDRAM,indicatingthattheperformancedegrada-tionofNVMMoverDRAMonlyhasaverysmalleffectoverapplication-levelperformance. Request Size8B 64B 256B 1K 4K 8K 12K Avg Latency (usec)0 5 10 15 20 25 30 S-unreplicated M-async M-sync Figure8.AveragemsyncLatencywithDifferentmsyncSizesonEmulatedNVMM.TheaveragelatencyofmsyncoperationonNVMMwithrequestsizesfrom8bytesto12KB. Num of Threads1 2 4 8 12 Throughput (GB/s)0 1 2 3 4 5 6 S-unreplicated M-async M-sync_4threads M-sync_2threads M-sync_1thread Figure9.ThroughputwithDifferentApplicationThreadsonEmulatedNVMM.Themsyncthroughputwith1to12threadsperformingmsync.Next,toaugmentthePMEPresultswithmoremachinesandtotestMojim'stwo-tierarchitecture,weusethreeDRAM-basedmachinesinourlabtoevaluatetheperfor-manceofMojim'stwo-tiermodesandtwoexistingschemesthatuseaone-primary,multiple-secondaryarchitecture(Ta-ble1).Oneoftheseexistingschemes,E-chain,allowswritesonlyattheprimarynodeandpropagatesdatareplicationfromtheprimarynodetothesecondarynodesinaserializedorder[2,56].Theotherexistingscheme,E-broadcast[15],issimilartoE-chainbutbroadcastsupdatestothesecondarynodes.E-chainandE-broadcastuseoneprimarynodeandtwosecondarynodesinterconnectedbyIB.TheyusethesameIBprotocolthatweimplementedforMojim.Figures6and7plottheaveragelatencyandthroughputofusingourlabmachinestoruntheexperimentsshowninFigures4and5.ComparedtoS-unreplicated,Mojimwiththesecondarytierdoesnotdegradeperformanceifafastnetworkconnectsthebackupnode.However,thelower-costEthernetcongurationdegradesperformanceby37,becausethemirrornodecannotdrainitscircularlogfastenoughandhastostalltheprimarytierreplication.BothE-chainandE-broadcastareslowerthanMojim,increasinglatencyby1.8and2.8respectively,comparedtoM-syncsec.Finally,wecompareMojimwithtwoexistingIBkernelprotocols,RDSandIPoIB.WendthattheybothhaveworseperformancethanMojim'snetworkingprotocolonIB-Verbs,with4.9and31slowdown.Overall,Mojimdeliversperformancesimilartoorbetterthannoreplicationwhileaddingreliabilityandavailability.Mojim'sgoodperformanceisduetoitsefcientreplicationprotocol,itsabilitytoavoidexpensivecacheushopera-tions,anditsoptimizedsoftwareandnetworkingstacks.6.3SensitivityAnalysisBothMojim'scongurationparametersandapplication-levelbehaviorcanaffectperformance.Inthissection,wemeasuretheirimpactonMojim'sperformance.6.3.1msyncSizeTheamountofdatapermsynchasastrongimpactonper-formance.Figure8plotstheaveragelatencyofperformingmsynccallstorandommemoryregionsof8bytesto12KBwithS-unreplicated,M-async,andM-syncusingPMEP'semulatedNVMM.Forsmallerrequestsizes,M-asyncperformsmuchbetterthanS-unreplicated.S-unreplicatedunderperformsbecauseofthecurrentwaymsynccallareimplementedinLinux,withthemsynccallhandlercheckingfortherangeofthemsyncmemoryandroundingittomemorypages(4KBpageforthedefaultLinuxkernel).WithMojim,wemodifythemsynccallhandlertoallowanymemoryaddressrangeandonlyushandreplicatetheapplication-speciedmemoryregions.M-syncdoesnotperformclush(sincetransferringthedatatothemirrornodeguaranteespersistence).Asaresult,itsperformanceisalwaysbetterthanS-unreplicatedandisbetterthanM-asyncwhenmsyncsizeisbiggerthan1KB.6.3.2ApplicationThreadsandNetworkingConnectionsApplicationthreadcountandthenumberofnetworkcon-nectionsMojimusesalsoimpactperformance.Figure9presentsthe4KBmsyncthroughputwithoneto12applica-tionthreadsforS-unreplicated,M-async,andM-syncusingPMEP'semulatedNVMM.BothM-asyncandS-unreplicatedscalewellwiththenumberofapplicationthreads,whileM-syncwithoneIBconnection(andthusonelog)scalespoorly.Withmorecon- VarmailFileServerWebServer Throughput (MB/s)0 200 400 600 800 S-unreplicated M-async M-sync Figure10.FilebenchThroughputwithEmulatedNVMM.ThethroughputofthreeFilebenchworkloadswithsinglemachineandnoreplication,theM-asyncmode,andtheM-syncmode. SequentialRandom Avg Latency (usec)0 0.05 0.1 0.15 0.2 0.25 S-unreplicated M-async M-sync Figure11.GoogleHashTableAverageLatencywithEmulatedNVMM.Theaveragelatencyofsequentiallyandrandomlyinsertingkey-valuepairstotheGoogledensehashta-ble. 426JOURNALEDFSYNC_SAFEREPLICAS_SAFEM-asyncM-sync Latency (ms)0 10 20 30 40 Figure12.InsertAvgLa-tency.Averagelatencyofin-sertingkey-valuepairsonemu-latedNVMM. JOURNALEDFSYNC_SAFEREPLICAS_SAFEM-asyncM-sync Throughput (IOPS)0 200 400 600 800 1000 1200 Figure13.InsertThroughput.Through-putofinsertingkey-valuepairsonemulatedNVMM. 1721319262718173ABCDEF Average Latency (ms)0 0.5 1 1.5 2 2.5 3 3.5 JOURNALED FSYNC_SAFE REPLICAS_SAFE M-async M-sync Figure14.YCSBAverageLatency.AveragelatencyofYCSBworkloadsonemulatedNVMM.nections,M-sync'sscalingimproves.Atradeoffwithin-creasingnetworkingconnectionsisthatMojimusesmorethreadstopollforreceivingmessages,consumingmoreCPUcycles.6.3.3CheckpointandSecondaryTierReplicationThresholdsWechangethetwothresholdsMojimusesinitscongura-tions:wechangeCHECKPOINT THRESH,thefrequencyofcheckpointingthemirrornodelogs,from1to10000,andwechangeSECONDARY TIER THRESH,theamountofun-replicateddatatothebackupnode,from40KBto400MB.WendthatneitherCHECKPOINT THRESHnorSECONDARY TIER THRESHaffectstheapplicationper-formance,becauseboththecheckpointingprocessandthesecondarytierreplicationviaIBarefastenoughnottoblocktheprimarytierreplication.6.4ApplicationPerformanceInthissection,wepresenttheevaluationresultsforthreeap-plications:alesystem,ahashtable,andaNoSqldatabase.6.4.1PMFSWeusetheFileServer,WebServer,andVarMailworkloadsintheFilebenchsuite[54]toevaluatedifferentMojimmodesunderPMFSusingemulatedNVMM.Figure10presentsthethroughputofthethreeworkloadsofFilebench.ForWebServerandVarmail,bothM-asyncandM-syncyieldperformancesimilartoS-unreplicated.ForFileServer,M-asyncandM-synchaveslightlyworseperformancethanS-unreplicated.6.4.2GooglehashtableWeperformsequentialandrandomkey-valueinsertiontotheGoogleDenseHashTable[16].Eachkey-valuepaircon-tainsanintegerkeyandarandomintegervalue.Figure11plotstheaveragelatencyofS-unreplicated,M-async,andM-syncwithemulatedNVMM.Forbothworkloads,allthreeschemeshavesimilarperformance,showingthatMojimhassmallperformanceoverheadwhenitcomestohashtableop-erations. Workload Read Update Scan Insert Read&Update A 50 50 - - -B 95 5 - - -C 100 - - - -D 95 - - 5 -E - - 95 5 -F 50 - - - 50 Table2.YCSBWorkloadProperties.ThepercentageofdifferentoperationsineachYCSBworkload.6.4.3MongoDBMongoDBisanaturaltforMojim.WeevaluatehowMongoDBandMojimcompareusingmicro-andmacro-benchmarks.Microbenchmark:Ourmicrobenchmarkinsertskey-valuepairstoMongoDB.Eachinsertoperationcontains10key-valuepairs,witheachpaircontaining100bytesofran-domlygenerateddata.Figures12and13presenttheaver-agelatencyandthroughputofkey-valuepairinsertionswithPMEP'semulatedNVMM.WesettheMongoDBreplica-tionmethodtousetworeplicas(theprimarynodeandthesecondarynode)andconnectthesenodeswithIB.MongoDBwithMojimoutperformstheMongoDBrepli-cationmethodREPLICAS SAFEby3.7to3.9.Thisper-formancegainisduetoMojim'sefcientreplicationproto-colandnetworkingstack.Mojimalsooutperformstheun-replicatedJOURNALEDMongoDBby56to59andtheun-replicatedFSYNC SAFEby701to741.JOURNALEDushesjournalcontentforeachclientwriterequest.FSYNC SAFEperformsfsyncofthedataleaftereachwriteoperationtoguaranteedatareliabilitywithoutjournaling.Boththeseoperationsareex-pensive.ToevaluateMojim'stwo-tierarchitecturewithMon-goDB,weperformthesamesetofexperimentsusingthreeDRAM-basedmachinesinourlab.SimilartothePMEPre-sults,Mojim'sM-syncsecoutperformsMongoDB'srepli-cationmethodby3.4to4,theun-replicatedJOUR-NALEDMongoDBby35to43,andtheun-replicatedFSYNC SAFEby238to311,suggestingthatMojim'sreplicationisbetterthanMongoDBreplication.Finally,MongoDBcanrununmodiedonMojimbycon-guringbothitsjournalanddataletobeinammap'dNVMMregion.Inthiscase,itsperformanceissimilartoJOURNALED,withaperformanceoverheadof0.2%to6%.However,Mojimprovidesbetterreliabilityandavailabilitythantheun-replicatedMongoDB.Macrobenchmark:YCSB[10]isabenchmarkdesignedtoevaluatekey-valuestoresystems.YCSBincludessixwork-loadsthatimitatewebapplications'dataaccessmodels.Theworkloadscontainacombinationofread,update,scan,andinsertoperations.Table2summarizesthenumberoftheseoperationsintheYCSBworkloads.Eachworkloadperforms1000operationsonadatabasewith10001KBrecords.Figure14presentsthelatencyofMongoDBandMojimusingthesixYCSBworkloadsonemulatedNVMM.Formostworkloads,bothM-asyncandM-syncoutperformtheun-replicatedandreplicatedMongoDBschemes.Theper-formanceimprovementisespeciallyhighforwrite-heavyworkloads.WealsondsimilarresultswiththreeDRAM-basedmachines.6.5RecoveryRecoveryperformanceisimportantbecauseitdirectlyaf-fectsavailabilityandmayimpactreliability,sinceMojimisvulnerabletoadditionalnodefailuresduringsomerecoveryscenarios.Totesttherobustnessofthesystem,westopaMo-jimnodeatrandomandndthattherestofthesystemcancontinueservingclientrequestscorrectly.Wefurthermea-suretherecoverytimeintheeventofanodefailure.WeuseatypicalrecoveryscenariotoillustrateMojim'srecoveryperformance.WhenamirrornodefailswithM-syncsec,therecoveryprocessrequiressendingtheremain-ing,un-replicateddatatothebackupnode,ushingtheCPUcachesontheprimarynode,andcopyingallthedataareastothenewmirrornode.Mojimperformstheseoperationsinparallel.WesetSECONDARY TIER THRESHto40MBandusethreemachinesinourlabtoperformtherecoveryperformanceevaluation.Mojimtakes450stoushthe26MBCPUcachesontheprimarynode.Beforetheprimarynodeushesallitscaches,ifitalsofails,therewillbedataloss.Thewindowofvulnerabilityalsodependsonhowsoonthefailurecanbedetected,thusinpracticeitwillbelongerthan450s[7].Ittakes14mstosend40MBofdatatothebackupnodeand1.9secondstosenda5GBdataareatothenewmirrornode.Thewholerecoveryprocesscompletesin1.9secondsfora5GBNVMM.Evenfora1TBNVMM,therecoverypro-cesswillonlytake6.5minutes.Noticethatthevulnerabilitywindowdependsonhowfastprimarynodedetectsafailureandushesitscaches,notonNVMMsize.ForM-syncdisk,MojimalsoneedstoreadthedatalefromthedisktotheNVMMbeforeapplicationscanaccessthedata.Inthiscase,recoverytakes17secondsfora5GBdatale,amuchhighercostinavailabilitythanwhenweuseNVMMforthedataarea.7.RelatedWorkThissectionplacesMojimincontextwithotherrelatedresearchprojectsandsystems.Non-VolatileMainMemory:Recentyearshaveseenin-creasedinterestinNVMM.ResearchershavefocusedonNVMM-relatedproblems,suchasbuildingNVMMlesystems[9,12,59],hybridDRAM/NVMMmemorysys-tems[39],memoryallocators[41],memorymanagementandpagingmechanisms[3],andprogrammingmodels[8,58].Whilepreviousresearchhasfocusedonun-replicatedNVMMinasinglemachine,Mojimfocusesonproviding reliability,highavailability,andredundancytoNVMMindatacenterenvironments.RedundancyandReplication:Toprovidedatareliabilityandavailability,manysystemsusedataredundancyorrepli-cation[1,5,11,15,17,18,29,46,47,51,52,56,61].Sev-eralprevioussystemsadoptthearchitectureofoneprimaryandmultiplebackups[2,15,56].Thesesystemseitheruseatotalorderingofnodetoserializedatareplication[2,56]orbroadcastthereplicateddataandhavetheprimarywaitforallthebackups[15]toguaranteestrongconsistency.Mo-jimusesatwo-tierarchitecturecontainingoneprimarynode,onemirrornode,andmultiplebackupnodes.Mojim'srepli-cationbetweentheprimarynodeandmirrornodeandthebackgroundreplicationtothebackupnodesismoreefcientthanreplicationamongoneprimaryandmultiplebackups.Otherarchitecturesallowwritestoallreplicas(E-writeall)[11,29,47,55].SystemsthatrequirestrongconsistencyamongthereplicasusePaxos-likeprotocols[4,22,31].Withsucharchitectures,atleasttwonetworkingroundtripsareneededtodeliverstrongconsistency.Moreover,theroundtripsarenecessaryateachwrite(memorystore)ratherthanforthelessfrequentsyncpoints.Incontrast,Mojimonlyreplicatesdataatsyncpoints.EnsuringstrongconsistencyfortheMojimprimarytierisalsomuchsimplerandhasmuchsmallerperformancecost.TherearealsosystemsthatimplementweakconsistencyprotocolsfortheE-writeallarchitecture[11,29,47].Thesesystemsrequireareconcili-ationprocessforconicts,whichcanincreaseperformanceoverhead[17].Mojimdoesnotinvolveanyreconciliationandsupportsstrongconsistencyinitsprimarytier.Mojim'sarchitectureissimilartothetwo-tierarchitec-tureproposed[17],whichreducedthelockingandreconcil-iationoverheadinmobile,disconnectedenvironments.How-ever,wefocusondatacenterenvironmentswherenodesaremostlyconnected.Moreover,Mojim'sprimarytieronlyusesoneprimaryandonemirrornodeforgoodperformance.Toreducesystemdowntime,commercialstoragesystemsoftenmaintainapairofinterconnectednodes(calledhigh-availabilitypairs)[13,19,43,57].Whenonenodefails,theothermemberofthepairtakesoveritsduties.Mostofthesehigh-availabilitypairschemesrelyonsharedstorageanddonotreplicatestoragedata,whereasMojimreplicatesdatainNVMM.Moreover,Mojimcanalsoprovideadditionalredundancywithitssecondarytier.Finally,RAMCloudisalow-latencykey-valuestorethatkeepsalldatainDRAM[44].WhileMojimandRAMCloudbothprovidereliablememory-basedstoragesystems,RAM-Cloudprovidesakey-valueinterfaceratherthanamemory-likeinterfacetoapplications.Thekey-valuesoftwarelayeraddssignicantlatencytoaccessesandobscuresmuchoftheperformanceoftheunderlyingmemory.Mojimoffersamemory-basedinterface(i.e.,applicationsaccessMojimus-ingmemoryloadandstoreinstructions).Basedontheinter-faceitsupportsanditsperformancetargets,MojimmakesdesigndecisionsdifferentlyfromRAMCloud.Mojimrepli-catesNVMMtoNVMMatapplicationsyncpoints,whileRAMCloudreplicateskey-valuecontentsoneachputopera-tiontoslowstoragedevices.Mojimdoesnotshardmemory,becauseshardingwouldresultinportionsofapplications'NVMMbeingonaremotenode,vastlyincreasinglatency.Asaresult,Mojimusesfail-overtorecoverfromfailuresin-steadofRAMCloud'sapproachthatreliesonshardeddatastoragetoachievefastrecovery.8.ConclusionsWehavedescribedMojim,asystemforprovidingreliableandhighly-availableNVMM.Mojimusesatwo-tierarchi-tectureandefcientlyreplicatesdatainNVMM.OurresultsdemonstratethatMojimcanprovidereplicationwithsmallcost,inmanycasesevenoutperformingtheun-replicatedsystem.Indoingso,Mojimpavesthewayfordeploy-ingNVMMindatacentersthatwishtotakeadvantageofNVMM'senhancedperformancebutrequirestrongguaran-teesaboutdatasafety.AcknowledgmentsWethanktheanonymousreviewersfortheirenormouslyvaluablefeedbackandcomments,whichhavesubstantiallyimprovedthecontentandpresentationofthispaper.WealsothankDulloorSubramanya,JeffJackson,andthevLabteamfromIntelCorp.fortheirhelpwiththePMEPplatforms.Finally,wethankthemembersoftheNVSLresearchgroupfortheirinsightfulcomments.ThisworkwassupportedinpartbytheCenterforFu-tureArchitecturesResearch(C-FAR),oneofsixcentersofSTARnet,aSemiconductorResearchCorporationprogramsponsoredbyMARCOandDARPA.Anyopinions,ndings,andconclusionsorrecommendationsexpressedinthisma-terialarethoseoftheauthorsanddonotnecessarilyreecttheviewsofC-FARorotherinstitutions. References[1]AtulAdya,WilliamJ.Bolosky,MiguelCastro,GeraldCer-mak,RonnieChaiken,JohnR.Douceur,JonHowell,JacobR.Lorch,MarvinTheimer,andRogerP.Wattenhofer.FAR-SITE:Federated,Available,andReliableStorageforanIn-completelyTrustedEnvironment.InProceedingsofthe5thSymposiumonOperatingSystemsDesignandImplementation(OSDI'02),Boston,Massachusetts,December2002.[2]PeterA.AlsbergandJohnD.Day.Aprincipleforresilientsharingofdistributedresources.InProceedingsofthe2ndIn-ternationalConferenceonSoftwareEngineering(ICSE'76),SanFrancisco,California,October1976.[3]KatelinBailey,LuisCeze,StevenD.Gribble,andHenryM.Levy.Operatingsystemimplicationsoffast,cheap,non-volatilememory.InProceedingsofthe13thUSENIXConfer-enceonHotTopicsinOperatingSystemsi(HotOS'13),Napa,California,May2011.[4]MikeBurrows.Thechubbylockserviceforloosely-coupleddistributedsystems.InProceedingsofthe7thSymposiumonOperatingSystemsDesignandImplementation(OSDI'06),Seattle,Washington,November2006.[5]BradCalder,JuWang,AaronOgus,NiranjanNilakan-tan,ArildSkjolsvold,SamMcKelvie,YikangXu,Shash-watSrivastav,JieshengWu,HuseyinSimitci,JaidevHari-das,ChakravarthyUddaraju,HemalKhatri,AndrewEdwards,VamanBedekar,ShaneMainali,RafayAbbasi,ArpitAgar-wal,MianFahimulHaq,MuhammadIkramulHaq,DeepaliBhardwaj,SowmyaDayanand,AnithaAdusumilli,MarvinMcNett,SriramSankaran,KavithaManivannan,andLeonidasRigas.Windowsazurestorage:Ahighlyavailablecloudstorageservicewithstrongconsistency.InProceedingsofthe23rdACMSymposiumonOperatingSystemsPrinciples(SOSP'11),Cascais,Portugal,October2011.[6]MosharafChowdhury,SrikanthKandula,andIonStoica.Leveragingendpointexibilityindata-intensiveclusters.InProceedingsoftheACMSIGCOMM2013ConferenceonSIGCOMM(SIGCOMM'13),HongKong,China,August2013.[7]Byung-GonChun,FrankDabek,AndreasHaeberlen,EmilSit,HakimWeatherspoon,M.FransKaashoek,JohnKubia-towicz,andRobertMorris.Efcientreplicamaintenancefordistributedstoragesystems.InProceedingsofthe3rdSym-posiumonNetworkedSystemsDesignandImplementation(NSDI'06),SanJose,California,May2006.[8]JoelCoburn,AdrianM.Cauleld,AmeenAkel,LauraM.Grupp,RajeshK.Gupta,RanjitJhala,andStevenSwanson.Nv-heaps:Makingpersistentobjectsfastandsafewithnext-generation,non-volatilememories.InProceedingsoftheSix-teenthInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS'11),NewYork,NewYork,March2011.[9]JeremyCondit,EdmundB.Nightingale,ChristopherFrost,EnginIpek,DougBurger,BenjaminC.Lee,andDerrickCoet-zee.Betteri/othroughbyte-addressable,persistentmemory.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples(SOSP'09),BigSky,Montana,October2009.[10]BrianF.Cooper,AdamSilberstein,ErwinTam,RaghuRa-makrishnan,andRussellSears.Benchmarkingcloudservingsystemswithycsb.InProceedingsofthe1stACMSymposiumonCloudComputing(SoCC'10),NewYork,NewYork,June2010.[11]GuiseppeDeCandia,DenizHastorun,MadanJampani,Gu-navardhanKakulapati,AvinashLakshman,AlexPilchin,SwamiSivasubramanian,PeterVosshall,andWernerVogels.Dynamo:Amazon'sHighlyAvailableKey-ValueStore.InProceedingsofthe21stACMSymposiumonOperatingSys-temsPrinciples(SOSP'07),Stevenson,Washington,October2007.[12]SubramanyaR.Dulloor,SanjayKumar,AnilKeshavamurthy,PhilipLantz,DheerajReddy,RajeshSankaran,andJeffJack-son.Systemsoftwareforpersistentmemory.InProceedingsoftheEuroSysConference(EuroSys'14),Amsterdam,TheNetherlands,April2014.[13]EMCCorporation.EMCVNXeHighAvailability.https://www.emc.com/collateral/hardware/white-papers/h8276-emc-vnxe-high-availability-wp.pdf.[14]DanielFord,Franc¸oisLabelle,FlorentinaI.Popovici,MurrayStokely,Van-AnhTruong,LuizBarroso,CarrieGrimes,andSeanQuinlan.AvailabilityinGloballyDistributedStorageSystems.InProceedingsofthe9thSymposiumonOperatingSystemsDesignandImplementation(OSDI'10),Vancouver,Canada,December2010.[15]SanjayGhemawat,HowardGobioff,andShun-TakLeung.TheGoogleFileSystem.InProceedingsofthe19thACMSymposiumonOperatingSystemsPrinciples(SOSP'03),BoltonLanding,NewYork,October2003.[16]GoogleInc.GoogleSparseHash.http://goog-sparsehash.sourceforge.net.[17]JimGray,PatHelland,PatrickO'Neil,andDennisShasha.Thedangersofreplicationandasolution.InProceedingsofthe1996ACMSIGMODInternationalConferenceonMan-agementofData(SIGMOD'96),NewYork,NewYork,June1996.[18]LisaHellerstein,GarthA.Gibson,RichardM.Karp,RandyH.Katz,andDavidA.Patterson.CodingTechniquesforHan-dlingFailuresinLargeDiskArrays.Algorithmica,12(2):182–208,August1994.[19]HewlettPackard.HPNonStopoperatingsystem.http://h17007.www1.hp.com/us/en/enterprise/servers/integrity/nonstop/nonstop-os.aspx.[20]MHosomi,HYamagishi,TYamamoto,KBessho,YHigo,KYamane,HYamada,MShoji,HHachino,CFukumoto,etal.Anovelnonvolatilememorywithspintorquetrans-fermagnetizationswitching:Spin-ram.InElectronDevicesMeeting,2005.IEDMTechnicalDigest.IEEEInternational,pages459–462,2005.[21]ChengHuang,HuseyinSimitci,YikangXu,AaronOgus,BradCalder,ParikshitGopalan,JinLi,andSergeyYekhanin.Erasurecodinginwindowsazurestorage.InProceedings oftheUSENIXAnnualTechnicalConference(USENIX'12),Boston,Massachusetts,June2012.[22]PatrickHunt,MahadevKonar,FlavioP.Junqueira,andBen-jaminReed.Zookeeper:Wait-freecoordinationforinternet-scalesystems.InProceedingsoftheUSENIXAnnualTech-nicalConference(USENIX'10),Boston,Massachusetts,June2010.[23]Intel.AddSupportforNewPersistentMemoryInstructions.http://www.lwn.net/Articles/619851.[24]Intel.Intel64andIA-32ArchitecturesSoftwareDeveloper'sManual.http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.[25]EnginIpek,JeremyCondit,EdmundB.Nightingale,DougBurger,andThomasMoscibroda.Dynamicallyreplicatedmemory:Buildingreliablesystemsfromnanoscaleresistivememories.InProceedingsofthe14thInternationalConfer-enceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOSXIV),Pittsburgh,Pennsyl-vania,March2010.[26]JamesPinkerton.TheFutureofComputing:TheConver-genceofMemoryandStoragethroughNon-VolatileMemory(NVM).StorageIndustrySummit,SanJose,California,Jan2014.[27]BrianGJohnsonandCharlesHDennison.Phasechangememory,September2004.USPatent6,791,102.[28]BrentByungHoonKang,RobertWilensky,andJohnKubia-towicz.Thehashhistoryapproachforreconcilingmutualin-consistency.InProceedingsofthe23rdInternationalConfer-enceonDistributedComputingSystems(ICDCS'03),Provi-dence,RhodeIsland,May2003.[29]JohnKubiatowicz,DavidBindel,PatrickEaton,YanChen,DennisGeels,RamakrishnaGummadi,SeanRhea,WestleyWeimer,ChrisWells,HakimWeatherspoon,andBenZhao.OceanStore:AnArchitectureforGlobal-ScalePersistentStor-age.InProceedingsofthe9thInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOSIX),Cambridge,Massachusetts,November2000.[30]AmitKumarandRamHuggahalli.Impactofcachecoherenceprotocolsontheprocessingofnetworktrafc.InProceedingsofthe40thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture(MICRO'07),Chicago,Illinois,Dec2007.[31]LeslieLamport.PaxosMadeSimple.ACMSIGACTNews,32(4):18–25,November2001.[32]BenjaminC.Lee,EnginIpek,OnurMutlu,andDougBurger.Architectingphasechangememoryasascalabledramalter-native.InProceedingsofthe36thAnnualInternationalSym-posiumonComputerArchitecture(ISCA'09),Austin,Texas,June2009.[33]BenjaminC.Lee,EnginIpek,OnurMutlu,andDougBurger.Phasechangememoryarchitectureandthequestforscalabil-ity.Commun.ACM,53(7):99–106,2010.[34]BenjaminCLee,PingZhou,JunYang,YoutaoZhang,BoZhao,EnginIpek,OnurMutlu,andDougBurger.Phase-changetechnologyandthefutureofmainmemory.IEEEmi-cro,30(1):143,2010.[35]Myoung-JaeLee,ChangBumLee,DongsooLee,SeungRyulLee,ManChang,JiHyunHur,Young-BaeKim,Chang-JungKim,DavidHSeo,SunaeSeo,etal.Afast,high-enduranceandscalablenon-volatilememorydevicemadefromasym-metricta2o5-x/tao2-xbilayerstructures.Naturematerials,10(8):625–630,2011.[36]JoshuaB.Leners,HaoWu,Wei-LunHung,MarcosK.Aguil-era,andMichaelWalsh.DetectingFailuresinDistributedSystemswiththeFalconSpyNetwork.InProceedingsofthe23rdACMSymposiumonOperatingSystemsPrinciples(SOSP'11),Cascais,Portugal,October2011.[37]MellanoxTechnologies.Rdmaawarenetworksprogram-mingusermanual.http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf.[38]MicronTechnologyInc.P8pparallelphasechangemem-ory(pcm).http://www.micron.com/˜/media/Documents/Products/Data%20Sheet/PCM/p8p_parallel_pcm_ds.pdf.[39]JeffreyC.Mogul,EduardoArgollo,MehulShah,andPaoloFaraboschi.Operatingsystemsupportfornvm+dramhybridmainmemory.InTheTwelfthWorkshoponHotTopicsinOperatingSystems(HotOSXII),MonteVerita,Switzerland,May2009.[40]MongoDBInc.MongoDB.http://www.mongodb.org/.[41]IulianMoraru,DavidGAndersen,MichaelKaminsky,Ni-rajTolia,ParthasarathyRanganathan,andNathanBinkert.Consistent,durable,andsafememorymanagementforbyte-addressablenonvolatilemainmemory.InConferenceonTimelyResultsinOperatingSystems(TRIOS'13),Farming-ton,Pennsylvania,November2013.[42]SumanNath,HaifengYu,PhilipB.Gibbons,andSrinivasanSeshan.Subtletiesintoleratingcorrelatedfailuresinwide-areastoragesystems.InProceedingsofthe3rdSymposiumonNetworkedSystemsDesignandImplementation(NSDI'06),SanJose,California,May2006.[43]NetAppInc.NetAppSnapMirrorDataReplica-tion.http://www.netapp.com/us/products/protection-software/snapmirror.aspx.[44]DiegoOngaro,StephenM.Rumble,RyanStutsman,JohnOusterhout,andMendelRosenblum.FastCrashRecoveryinRAMCloud.InProceedingsofthe23rdACMSymposiumonOperatingSystemsPrinciples(SOSP'11),Cascais,Portugal,October2011.[45]StanPark,TerenceKelly,andKaiShen.Failure-atomicmsync():asimpleandefcientmechanismforpreservingthein-tegrityofdurabledata.InProceedingsoftheEuroSysConfer-ence(EuroSys'13),Prague,CzechRepublic,April2013.[46]DavidPatterson,GarthGibson,andRandyKatz.ACaseforRedundantArraysofInexpensiveDisks(RAID).InProceed- ingsofthe1988ACMSIGMODConferenceontheManage-mentofData(SIGMOD'88),Chicago,Illinois,June1988.[47]KarinPetersen,MikeJ.Spreitzer,DouglasB.Terry,Mar-vinM.Theimer,andAlanJ.Demers.FlexibleUpdateProp-agationforWeaklyConsistentReplication.InProceedingsofthe16thACMSymposiumonOperatingSystemsPrinciples(SOSP'97),Saint-Malo,France,October1997.[48]MoinuddinKQureshi,MicheleMFranceschini,LuisALastras-Monta˜no,andJohnPKaridis.Morphablememorysystem:arobustarchitectureforexploitingmulti-levelphasechangememories.InProceedingsofthe37thAnnualInterna-tionalSymposiumonComputerArchitecture(ISCA'07),June2010.[49]MoinuddinK.Qureshi,VijayalakshmiSrinivasan,andJudeA.Rivers.Scalablehighperformancemainmemorysystemus-ingphase-changememorytechnology.InProceedingsofthe36thAnnualInternationalSymposiumonComputerArchitec-ture(ISCA'09),Austin,Texas,June2009.[50]LuizE.Ramos,EugeneGorbatov,andRicardoBianchini.Pageplacementinhybridmemorysystems.InProceedingsoftheInternationalConferenceonSupercomputing(ICS'11),Tucson,Arizona,2011.[51]SeanRhea,PatrickEaton,DennisGeels,HakimWeath-erspoon,BenZhao,andJohnKubiatowicz.Pond:Theoceanstoreprototype.InProceedingsofthe2ndUSENIXSymposiumonFileandStorageTechnologies(FAST'03),SanFrancisco,California,April2003.[52]AntonyRowstronandPeterDruschel.StorageManagementandCachinginPAST,ALarge-scale,PersistentPeer-to-peerStorageUtility.InProceedingsofthe18thACMSymposiumonOperatingSystemsPrinciples(SOSP'01),Banff,Canada,October2001.[53]DavidSpence,JonCrowcroft,StevenHand,andTimHarris.Locationbasedplacementofwholedistributedsystems.InProceedingsofthe2005ACMConferenceonEmergingNet-workExperimentandTechnology(CoNEXT'05),Toulouse,France,October2005.[54]SunMicrosystems.SolarisInternals:FileBench.http://filebench.sourceforge.net/.[55]DouglasB.Terry,VijayanPrabhakaran,RamakrishnaKotla,MaheshBalakrishnan,MarcosK.Aguilera,andHussamAbu-Libdeh.Consistency-BasedServiceLevelAgreementsforCloudStorage.InProceedingsofthe24thACMSymposiumonOperatingSystemsPrinciples(SOSP'13),Farmington,Pennsylvania,November2013.[56]RobbertvanRenesseandFredB.Schneider.Chainreplicationforsupportinghighthroughputandavailability.InProceed-ingsofthe6thSymposiumonOperatingSystemsDesignandImplementation(OSDI'04),SanFrancisco,California,De-cember2004.[57]VMWareInc.VMwareHighAvailability.http://www.vmware.com/files/pdf/VMware-High-Availability-DS-EN.pdf.[58]HarisVolos,AndresJaanTack,andMichaelM.Swift.Mnemosyne:Lightweightpersistentmemory.InProceedingsoftheSixteenthInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS'11),NewYork,NewYork,March2011.[59]XiaojianWuandA.L.N.Reddy.Scmfs:Alesystemforstorageclassmemory.InInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis(SC'11),Nov2011.[60]JJoshuaYang,DmitriBStrukov,andDuncanRStewart.Memristivedevicesforcomputing.Naturenanotechnology,8(1):13–24,2013.[61]MingZhong,KaiShen,andJoelSeiferas.Replicationdegreecustomizationforhighavailability.InProceedingsoftheEuroSysConference(EuroSys'08),Glasgow,ScotlandUK,March2008.[62]PingZhou,BoZhao,JunYang,andYoutaoZhang.Adurableandenergyefcientmainmemoryusingphasechangemem-orytechnology.InProceedingsofthe36thAnnualInter-nationalSymposiumonComputerArchitecture(ISCA'09),Austin,Texas,June2009.

Related Contents


Next Show more