/
ScalablePersistentMemoryFileSystemwithKernelUserspaceCollaborationYou ScalablePersistentMemoryFileSystemwithKernelUserspaceCollaborationYou

ScalablePersistentMemoryFileSystemwithKernelUserspaceCollaborationYou - PDF document

lucy
lucy . @lucy
Follow
345 views
Uploaded On 2021-08-18

ScalablePersistentMemoryFileSystemwithKernelUserspaceCollaborationYou - PPT Presentation

TsinghuaUniversityUniversityofWisconsinMadisonWeintroduceKanoveldirectaccesslesystemarchitecturewhosemaingoalisscalabilityKutilizesthreekeytechniquescollaborativeindexingtwolevellockingandversioned ID: 866070

usa usenix storage association usenix usa association storage newyork conference file 19th technologies acm lesystems nova dax lesystem medium

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "ScalablePersistentMemoryFileSystemwithKe..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 ScalablePersistentMemoryFileSystemwithKe
ScalablePersistentMemoryFileSystemwithKernel-UserspaceCollaborationYouminChen,YouyouLu,BohongZhu,AndreaC.Arpaci-Dusseau,RemziH.Arpaci-Dusseau,JiwuShu  TsinghuaUniversityUniversityofWisconsin–MadisonWeintroduceK,anoveldirect-accesslesystemarchitec-turewhosemaingoalisscalability.Kutilizesthreekeytechniques–collaborativeindexing,two-levellocking,andversionedreads–toooadtime-consumingtasks,suchaspathnameresolutionandconcurrencycontrol,fromthekerneltouserspace,thusavoidingkernelprocessingbottlenecks.UponK,wepresentthedesignandimplementationofFS,andthenexperimentallyshowthatKFShasexcellentperformanceinawiderangeofexperiments;impor-tantly,KFSscalesbetterthanexistinglesystemsbyuptoanorderofmagnitudeformetadataoperations,andfullyexploitsdevicebandwidthfordataoperations.1IntroductionEmergingbyte-addressablepersistentmemories(PMs),suchasPCM[ 22 , 34 , 51 ],ReRAM[ 3 ],andtherecentlyreleasedIntelOptaneDCPMM[ 27 ],provideperformanceclosetoDRAManddatapersistencesimilartodisks.Suchhigh-performancehardwareincreasestheimportanceofredesign-lesystems.Inthepastdecade,thesystemscommunityhasproposedanumberoflesystems,suchasBPFS[ 11 ],PMFS[ 14 ],andNOVA[ 43 ],tominimizethesoftwareoverheadcausedbyatraditionallesystemarchitecture.However,thesePM-awarelesystemsarepartoftheoperatingsystemandapplicationsneedtotrapintothekerneltoaccessthem,wheresystemcalls(syscalls)andthevirtuallesystem(VFS)stillincurnon-negligibleoverhead.Inthisregard,recentwork[ 13 , 21 , 28 , 39 ]proposestodeploylesystemsinuserspacetoaccessledatadirectly(i.e.,directaccess),thusexploitingthehighperformanceofPM.Despitetheseeorts,wendthatanotherimportantper-formancemetric––stillhasnotbeenwellad-dressed,especiallywhenmulticoreprocessorsmeetfastPMs.NOVA[ 43 ]improvesmulticorescalabilitybypartitioninginternaldatastructuresandavoidingusinggloballocks.However,ourevaluationshowsthatitstillfailstoscalewellduetotheexistenceoftheVFSlayer.Evenworse,someuserspacelesystemdesignsfurtherexasperatethescalabilityproblembyintroducingacentralizedcomponent.Forexample,Aerie[ 39 ]ensurestheintegrityoflesystemmetadatabysendingexpensiveinter-processcommunications(IPCs)toatrustedprocess(TFS)thathastheauthoritytoupdatemetadata.Strata[ 21 ],asanotherexample,avoidsthe JiwuShuisthecorrespondingauthor(shujw@tsinghua.edu.cn).involvementofacentralizedprocessinnormaloperationsbydirectlyrecordingupdatesinPMlogs,butrequiresaKernFStoapplythem(includingbothdataandmetadata)tothelesystem,whichcausesonemoretimeofdatacopying.Thetrustedprocess(e.g.,TFSorKernFS)inbothlesystemsisalsoresponsibleforconcurrencycontrol,whichinevitablybecomesthebottleneckunderhighconcurrency.Inthispaper,werevisitthelesystemdesignbyin-troducinga ernel-u serspace llaborationarchitecture,or,toachievebothdirectaccessperformanceandhighscalability.Kfollowsaclassicservermodeltwocomponents,includingauserspacelibrary(namedtoprovidebasiclesysteminterfaces,andatrustedthread)placedinthekerneltoprocessrequestssentbyandperformcriticalupdates(e.g.,metadata).Inspiredbydistributedlesystemdesigns,e.g.,AFS[ 17 thatimprovescalabilitybyminimizingserverloadsandreduc-ingclientserverinteractions,Kpresentsanoveltaskdivi-sionandcollaborationbetween,whichomosttaskstotoavoidapossiblebottleneck.Formetadatascalability,weintroduceacollaborativeindexingtechniquetoallowtoperformpathnameresolutionbeforesendingrequeststo.Inthisway,canupdatemetadataitemsdirectlywiththepre-locatedaddressesprovidedby.Fordatascalability,werstproposeatwo-levellockingmechanismtocoordinateconcurrentwritestosharedles.Specically,managesawriteleaseforeachleandassignsittotheprocessthatintendstoopenthele.Instead,threadswithinthisprocesslockthelewitharange-lockcompletelyinuserspace.Second,weintroduceaversionedreadprotocoltoachievedirectreadsevenwithoutinteracting,despitethepresenceofconcurrentwriters.alsoincludestechniquestoenforcedataprotectionandimprovebaselineperformance.KmapsthePMspaceintouserspaceinreadonlymodetopreventbuggyprogramsfromcorruptingledata.Userspacedirectwritesareachievedwithathree-phasewriteprotocol.Beforewritesale,switchestherelatedPMpagesfromreadonlytowriteablebytogglingthepermissionbitsinthepagetable.Apre-techniqueisalsousedtoreducethenumberofinterac

2 tionsbetweenwhenwritingale.WiththeKarch
tionsbetweenwhenwritingale.WiththeKarchitecture,webuildaPMlesystemnamedKFS,whichgainsuserspacedirect-accessper-formanceanddelivershighscalabilitysimultaneously.WeevaluateKFSwithlesystembenchmarksandreal-world applications.TheevaluationresultsshowthatKFSscalesbetterthanexistinglesystemsbyanorderofmagnitudeunderhighcontentionworkloads(e.g.,creatinglesinthesamedirectoryorwritingdatainasharedle),anddeliversslightlyhigherthroughputunderlowcontention.ItalsohitsthebandwidthceilingofPMdevicesfornormaldataoperations.Insummary,wemakethefollowingcontributions:Weconductanin-depthanalysisofstate-of-the-artPM-awarelesystemsandsummarizetheirlimitationsonsolvingthesoftwareoverheadandscalabilityproblems.WeintroduceK,auserspace-kernelcollaborationtecturewiththreekeytechniques,includingcollaborativeindexingtwo-levellocking,andversionedreadtoachievehighscalability.WeimplementaPMlesystemnamedKFSbasedontheKarchitecture,andexperimentallyshowthatFSachievesuptooneorderofmagnitudehigherscalabilityformetadataoperations,andfullyexploitsthePMbandwidthfordataoperations.2MotivationInthepastdecade,researchershavedevelopedanumberofPMlesystems,suchasBPFS[ 11 ],SCMFS[ 41 ],PMFS[ 14 HiNFS[ 29 ],NOVA[ 43 ],Aerie[ 39 ],Strata[ 21 ],SplitFS[ 28 andZoFS[ 13 ].Theyarebroadlycategorizedintothreetypes.kernel-levellesystems.Applicationsaccessthembytrappingintothekernelforbothdataandmetadataoperations.userspacelesystems(e.g.,Aerie[ 39 ],Strata[ 21 andZoFS[ 13 ]).Amongthem,Aerie[ 39 ]reliesonatrustedprocess(TFS)tomanagemetadataandensuretheintegrityofit.TheTFSalsocoordinatesconcurrentreadsandwritestosharedleswithadistributedlockservice.Strata[ 21 incontrast,enablesapplicationstoappendtheirupdatesdirectlytoaper-processlog,butrequiresbackgroundthreads(KernFS)toasynchronouslydigestloggeddatatostoragedevices.ZoFSavoidsusingacentralizedcomponentandallowsuserspaceapplicationstoupdatemetadatadirectlywiththehelpofanewhardwarefeaturenamedIntelMemoryProtectionKey(MPK).NotethatAerie,Strata,andZoFSstillrelyonthekerneltoenforcecoarse-grainedallocationandprotection.Third,hybridlesystems(e.g.,SplitFS[ 28 ]andourproposedK).SplitFS[ 28 ]presentsacoarse-grainedsplitbetweenauser-spacelibraryandanexistingkernellesystem.Ithandlesdataoperationsentirelyinuserspace,andprocessesmetadataoperationsthroughtheExt4lesystem.Table 1 providesasummaryofexistingPM-awarelesystemsandhowwelltheybehaveinvariousaspects.Multicorescalability.NOVA[ 43 ],astate-of-the-artkernellesystemforPMs,iscarefullydesignedtoimprovescal-abilitybyintroducingtheper-coreallocatorandper-inodelog.Nevertheless,VFSstilllimitsitsscalabilityforcertainoperations.WeexperimentallyshowthisbydeployingNOVAonIntelOptaneDCPMMs(detailedexperimentalsetupisdescribedin§ 5.1 ),andusemultiplethreadstocreate,delete, NOVA Aerie/Strata ZoFS SplitFS KucoFS Category Kernel Userspace Hybrid ¶Scalability Metadata Medium(§ 5.2.1 ) Low 5.2.1 ) (Fig.7gin[ 13 ]) Low 5.2.1 ) High(§ 5.2.1 ) Read Medium(§ 5.2.2 ) Low 5.2.2 ) High LowinExt4) High(§ 5.2.2 ) Write Medium(§ 5.2.3 ) Low 5.2.3 ) (Fig.7fin[ 13 ]) High(§ 5.2.3 ) Softewareoverhead High Low Medium(sigsetjump) Medium(metadata) Low Otherissues Avoidstray 3 7 3 7 3 Readprotection POSIX Partition Co er POSIX Partition Visibilityofupdates Immed-iately AfterbatchAfterdigest Immed-iately Aftersync Immed-iately Hardware None None MPK None None Table1:ComparisonofdierentNVM-awarelesystems.orrenamelesinthesamedirectory.AsshowninFigure 1 theirthroughputisalmostunchangedasweincreasethenumberofthreads,sinceVFSneedstoacquirethelockoftheparentdirectory.Aerie[ 14 ]reliesonacentralizedTFStohandlemetadataoperationsandenforceconcurrencycontrol.AlthoughAeriebatchesmetadatachangestoreducecommunicationwiththeTFS,ourevaluationin§ 5 showsthattheTFSstillinevitablybecomesthebottleneckunderhighconcurrency.InStrata[ 21 ],theKernFSneedstodigestloggeddataandmetadatainthebackground.Ifanapplicationcompletelyusesupitslog,ithastowaitforanin-progressdigesttocompletebeforeitcanreclaimlogspace.Asaresult,thenumberofdigestionthreadslimitsStrata'soverallscalability.BothAerieandStratainteractwiththetrustedprocess(TFSKernFS)viaexpensiveIPCs,whichintroducesextrasyscalloverhead.ZoFSdoesnotrequireacentralizedcompo

3 nent,soitachievesmuchhigherscalability.H
nent,soitachievesmuchhigherscalability.However,ZoFSstillfailstoscalewellwhenprocessingoperationsthatrequireallocatingnewspacesfromthekernel(e.g.,,seeFigures7d,7f,and7gintheirpaper).OurevaluationshowsthatSplitFSscalespoorlyforbothdataandmetadataoperationsbecauseit1)doesnotsupportsharingbetweendierentprocesses,and2)reliesonExt4toupdatemetadata(seeFigures 7 and 9 Softwareoverhead.Placingalesysteminthekernelfacestwotypesofsoftwareoverhead,i.e.,thesyscallandVFSoverhead.WeinvestigatesuchoverheadbystillanalyzingNOVA,wherewecollectthelatencybreakdownofcommonlesystemoperations.Eachoperationisperformedon1millionlesordirectorieswithasinglethread.WemaketwoobservationsfromFigure 1 b.First,syscallstakeupto21%ofthetotalexecutiontime(e.g.,Also,afteraprocesstrapsintothekernel,theOSmayscheduleothertasksbeforereturningcontroltotheoriginalone.Hence,syscallsbringextrauncertaintyforlatency-sensitiveapplications[ 12 , 33 ].Second,LinuxkernellesystemsareimplementedbyoverridingVFSfunctions,andVFScausesnon-negligibleoverhead.AlthoughrecentPMle Latency Breakdown (%) 050100150 Throughput (Mops/s) 00.10.20.3 FS VFS Syscall Create Rename (a)(b)Number of Threads 0.8(!s) 102030 statmkdirmknodrenameunlinkread-1Kwrite-1K Figure1:SoftwareoverheadandscalabilityofNOVA.systems[ 9 , 11 , 14 , 29 , 40 , 43 , 50 ]usedirectaccess(DAX)tobypassthepagecacheinVFS,wendthatanaverageof34%ofthetimeisstillspentintheVFSlayerforNOVA.ZoFS[ 13 deploysalesysteminuserspacetoavoidtrappingintothekernel;however,itstillincursextrasoftwareoverhead.ZoFSallowsuserspaceapplicationstoupdatemetadatadirectly,whichmaycauseanormalprogramtobeterminatedwhenaccessingmetadatathatiscorruptedbymaliciousattackers.Toachievegracefulerrorreturn,ZoFSinvokesainstructionatthebeginningofeachsyscall,whichcausesextradelays(200ns).SplitFSrequiresakernellesystemtohandlemetadataoperations,soitstillintroduceskerneloverhead.Otherissues.First,misusedpointerscanleadtowritestoincorrectlocationsandcorruptthedata,whichisknownasstraywrites 14 ].Strata[ 21 ]exposestheper-processopera-tionlogandtheDRAMcache(includingbothmetadataanddata)touserspaceapplications.Aerie[ 39 ]andSplitFS[ 28 mapasubsetofthelesystemimagetouserspace.Hence,straywritescaneasilycorruptthedataintheseareas,andsuchcorruptionsarepermanentinNVMevenafterreboots.Second,Aerie,Strata,andSplitFSimproveperformancebydelayingthevisibilityofthenewlywrittendatatootherprocessesuntilissuinga,forcingapplicationstomakecorrespondingadjustments.Third,ZoFSheavilyreliesontheMPKmechanism,ifanapplicationalsoneedstouseMPK,theymaycompeteforthelimitedMPKresources.Tosummarize,itishardtoachievehighscalabilityandlowsoftwareoverheadwithexistinglesystemdesigns,andthismotivatesustointroducetheK3TheKArchitectureInthispaper,weintroducetheKarchitecturetoshowthataservermodelcanbeadoptedtorealizethetwogoalssimultaneously.ThecentralideaunderlyingKane-grainedtaskdivisionandcollaborationbetweenthe,wheremostloadsareooadedtotheclientparttoavoidtheserverfrombecomingthebottleneck.3.1Overview 2 showstheKarchitecture.Itfollowsaservermodelwithtwoparts,includingauserspacelibraryandaglobalkernelthread,whicharecalledrespectively.AnapplicationaccessesKbylinkingwith ! !"#$%& '#()%& *'+,#%& -$.+$/%& !"#$ %#$&#' (")*+,-* "##$%&'()*!"#0!.. 1#,(2'#3-#4, 5("2 +,-.,#'&-/$%0.'.1/26")(,#0('77!$2,'## 8#()!$.9 :'+,# 3-#'9 ;$)# =#(4#4 8($�#2.!?/4 0(�#2,(@.# '#()2 A#'4+!$#) 1 2 3 4 Figure2:TheKmetadataupdates(interactswithcollaborativeindexing;read:directaccessviaversionedread;write:directaccessbasedonathree-phasewriteprotocoltwo-levellockingforconcurrencycontrol.rst,anddiinstances(i.e.,applications)interactwithviaseparatememorymessagebuers.Likeexistinguserspacelesystems[ 21 , 39 ],KmapsthePMspacetouserspacetosupportdirectreadandwriteaccesses.Toprotectlesystemmetadatafrombeingcorrupted,Kdoesnotallowapplicationstoupdatemetadatadirectly;instead,suchrequestsarepostedto,andthenupdatesmetadataonbehalfofthem.delivershighscalabilitywithane-grainedtaskdivisionandcollaborationbetween.Formetadatascalability,Kincorporatesthecollaborativeindexingmechanismtoooadthepathnametraversaljobfromuserspace(§ 3.2 ).Insteadofsendingmetadataoperations(e.g.,)todirectly,rstndsalltherelatedmeta

4 dataitemsinuserspace,andthenencapsulates
dataitemsinuserspace,andthenencapsulatessuchinformationintherequestbeforesendingitout.Therefore,canperformmetadatamodicationsdirectlywiththegivenaddresses.Fordatascalability,atwo-levellockingmechanismisusedtohandleconcurrentwritestosharedles(§ 3.3 ).Specically,usesalease-baseddistributedlocktoresolvewriteconictsbetweendierentapplications(orprocesses).Concurrentwritesfromthesameprocessareserializedusingapureuserspacerangelock,whichcanbeacquiredwithouttheinvolvementof.Kintroducestheversionedreadtechniquetoperformlereadinginuserspace(§ 3.5 ).Byaddingextraversionbitsindatablockmappings(whichmaplogicalledatatophysicalPMaddresses),Kcanreadaconsistentversionofdatablockswithoutinteractingwithtoacquirethelock,despitethatthereareotherconcurrentwriters.Tofurtherpreventbuggyprogramsfromcorruptingledata,PMspaceismappedtouserspaceinreadonlymode.enablesuserspacedirectwritesonreadonlyaddressesbyplacinginthekernelwithathree-phasewriteprotocol 3.4 ).Beforewritesale,modiesthepermissionbitsinthepagetablersttoswitchtheinvolveddatapagesfromreadonlytowritable.Tofurtherreducethenumberofinteractionsbetweenwhenwritingale,pre-allocation,wherecanallocatemorefreepagesfromthandesired.Exceptforthewriteprotection mechanismthatpreventsstraywrites,thePMspaceinKisthendividedintodierentpartitiontrees,whichactastheminimumunitforreadprotection.ByapplyingKinalesystemnamedKFSandputtingalltechniquestogether,KFSgainsdirect-accessperformance,delivershighscalability,andensuresthekernel-leveldataprotection.3.2CollaborativeIndexingInatypicalclient-servermodel,wheneverreceivesametadatarequest,itneedstondtherelatedmetadata(e.g.,thatdescribeleattributes,orthatmaplenamestoinodenumbers)byperformingiterativepathnameresolutionfromtherootinodetothedirectorycontainingthisle.Suchpathnametraversaloverheadisaheavyburdenfor,especiallywhenadirectorycontainsalargenumberofsub-lesorwithdeepdirectoryhierarchies.Toaddressthisissue,weproposetoooadthepathnameresolutiontaskfrom.Bymappingpartitiontreestocanndtherelatedmetadataitemsdirectlyinuserspace,andthensendsametadataupdaterequesttoencapsulatingthemetadataaddressesintherequestaswell.Inthisway,canupdatemetadatadirectlywiththegivenaddresses,andthepathnameresolutionoverheadisotouserspace. 3 showshowKcreatesalewithapathnameofrstndsthepredecessordentryofle“”inthedentrylistof“”().Itthensendsarequestto,andtheaddressofthepredecessorisputinthemessagetoo(thencreatestheleafterreceivingtherequest),whichincludescreatinganinodeofthisle,andtheninsertinganewdentryintheparentdirectory'sdentrylistwiththegivenpredecessor.Todeleteale,boththeinodeofthisleanddentryintheparentdirectoryshouldbedeleted,sobothoftheiraddressesarekeptintherequestbeforesendsit.Notethatisdisabledbydefault,enablingreadonlyoperations(e.g.,)tobeperformedinuserspacewithoutpostingextrarequeststoInKsproducepointersandconsumesthem.This“one-way”pointersharingparadigmsimpliesensuringthecorrectnessandsafetyofK.Ontheonehand,metadataitemsareplacedinametadataareawithseparateaddressspaceandcanonlypasstheaddressesoftwotypesofmetadataitems(i.e.,dentryandinode).Hence,weaddanidentiereldatthebeginningofeachmetadataitem,whichtocheckthemetadatatype–anyaddressesnotinthemetadataareaornotpointingtoadentryinodeisconsideredinvalid.Ontheotherhand,alsoperformsconsistencycheckingbasedonthelesysteminternallogic:mightreadaninconsistentdirectorytree.Forexample,wheniscreatingnewlesinadirectory,smayreadaninconsistentdentrylistofthisdirectory.Toaddressthisissue,weorganizethedentrylistofeachdirectorywithaskiplist[ 32 ]andeachdentryisindexedbythehashvalueofthelename.Skiplisthasmultiple !"#$%&'$ !"#!"##$%&'()*'+,-.%&/,-01-!"#$"%&'#()*+$&$,(- 2'#3/'(&-4-5%&'(& /01 /23+1 !'(67'8&-%&9"#:;"7 1 2 3 4 %$#4#'#""3$ ?"8 Figure3:Creatingale()withcollaborativeindexinglayersoflinkedlist-likedatastructure.Eachhigherlayeractsasan“expresslane”forthelowerlistlayer.Thelist-basedstructureenableslock-freeatomicupdatesbyperformingpointermanipulations.Besides,thereareonlyinsertanddeleteoperationstothedentrylistperformedbyasingleincludingrenameoperationswhichareperformedbyrstinsertinganewnodeandthendeletingtheoldone.Therefore,areadtoadentryisalwaysperformedtoaconsistentoneevenwithoutacquiring

5 thelock.Second,withsuchalock-freedesign,
thelock.Second,withsuchalock-freedesign,userspaceapplica-tionsmayreadmetadataitemsthatarebeingdeletedbycausingthe“read-after-delete”anomaly.Tosafelyreclaimthedeleteditems,weneedtoensurethatnothreadsaccessitanymore.Weaddressthisissuebyusinganepoch-basedreclamationmechanism(EBR)[ 15 ].EBRmaintainsaglobalepochandthreereclaimqueues,wheretheexecutionisdividedintoepochsandreclaimqueuesaremaintainedforthelastthreeepochs.Eachthreadalsoownsaprivateepoch.Itemsdeletedinepochareplacedintothequeueforepoch.Eachtimestartsanoperation,itreadstheglobalepochandupdatesitsownepochtobeequaltotheglobalone.Itthencheckstheprivateepochsofothers.Ifallsareactiveinthecurrentepoch,thenanewepochbegins.Atthistime,allthreadsareactiveeitherinorin,anditemsinthequeuerelatedtocanbereclaimedsafely.Wealsoaddaagineachinodedentry.deletesametadataitembysettingitsagtoaninvalidstate,preventingapplicationsfromreadingthealreadydeleteditems.needstohandleconictingmetadataoperationsproperly.Forexample,whenmultiplesareperformingmetadataoperationsconcurrently,thepre-locatedmetadataitemofonemightbedeletedorrenamedbyanotheraccessesit.Hence,thisitemisnolongervalidanditsaddresscannotbeusedbyanymore.Itisalsopossiblethatamaliciousprocessattacksbyprovidingarbitraryaddresses.Luckily,onlytheupdatemetadata,anditcanvalidatethepre-locatedmetadatabeforeprocessingtheoperation.Specically,checksifthepre-locateditemstillexistsorisstillthepredecessor,andavoidscreatingleswiththesamename.Whenthevalidationfails,thenresolvesthepathnameitselfandreturnsanerrorcodetotheiftheoperationfailsanyway..First,Kensuresthatallmetadataoperationsareprocessedatomically.Foratomicallyinsertsanewdentryintheskiplistonlyafteraninodehasbeencreated,tomakethecreatedlevisible;For,it !"#"$ %&!$" !'($ )*+$ ,$-!'%. )/$)0!1+ 2 2.'"3#34%)03'"$+ 3 2.!$-" 1 5"%+')3#66!"#$%&'()* 4 7%.8')"3)/$)0'.93 :;3'.,#4'6'6=;3;&#x,#4-;.00;⚀%)03)-$#"$3*+$?#.9$3%@3A-'"$ Figure4:Directaccessrange-lock.Eachopenedleownsashowthestepstoacquirealock.atomicallydeletesthedentrybeforedeletingotherelds.involvesupdatingtwodentries(createanewentryinthedestinationpath,andthendeletetheoldone),soaprogramcanseetwosamelesonbothplacesatsomepointintime.Weleveragetheagineachdentrytopreventsuchaninconsistentstate.Specically,theoldentryonthesourcepathissettobeforecreatingthenewentry,andisthensetinvalidafterthenewentryiscreated.Asawhole,wecanobservethatmetadataoperationsalwayschangethedirectorytreeatomically,andisguaranteedtohaveaconsistentviewofthedirectorytreeevenwithoutacquiringthelock.Second,K'sscalabilityisfurtherimprovedbyavoidingusinglocks—concurrentmetadataupdatesarealldelegatedtotheglobal,sotheycanbeprocessedwithoutanylockingoverhead(onlycanupdatemetadata)[ 16 , 35 ].Kensuresthecrashconsistencyofmetadataviaanoperationlog,whichwillbediscussedin§ 4.2 3.3Two-LevelLockingintroducesatwo-levellockingservicetocoordinateconcurrentwritestosharedles,whichpreventsbeingfrequentlyinvolvedinconcurrencycontrol.First,writeleases(inthekernel,seeFigure 2 )onlestoenforcecoarse-grainedcoordinationbetweendiprocesses,asinAerieandStrata[ 21 , 39 ].Onlytheprocessthatholdsavalidwritelease(notyetexpired)canwritethele.Weassumethatappliesforleasesinfrequently,andthisisbasedonthefactthatitisnotthecommoncaseformultipleprocessestofrequentlyandconcurrentlywritethesamele.Morene-grainedsharingbetweenprocessescanbeachievedviasharedmemoryorpipes[ 21 ReadleasesnotneededinK(seeSection 3.5 Second,weintroduceadirectaccessrange-locktoserializeconcurrentwritesbetweenthreadswithinthesameprocess.Onceaacquiresthewriteleaseofale,itcreatesarangelockforthisleinuserspace,whichisactuallyaDRAMringbuer(asshowninFigures 4 ).Athreadwritesalebyacquiringtherange-lockrst,anditisblockedifalockconictoccurs.Eachslotintheringbuerhasveelds,whicharestate,oset,size,ctime,andachecksum.Thechecksumisthehashvalueoftherstfourelds.Wealsoplaceaversionattheheadofeachringbuertodescribetheorderofeachwriteoperation.Toacquirethelockofale,rstlyincrementstheversionwithanatomic).Ittheninsertsalockitemintoaspecicslotintheringbuer(,thelocationisdeterminedbythefetchedversionmodulotheringbusize).Theinsertionisblockedwhenthisslotoverlapswiththeheadoftheringbuer.Afterthis,traversesther

6 ingbuerbackwardtondtherstconictingloc
ingbuerbackwardtondtherstconictinglockitem(i.e.,theirwrittendataoverlaps).Ifsuchaconictexists,veriesitschecksum,andthenpollsonitsstateuntilitisalsochecksitsctimeeldrepeatedlytoavoidthedeadlockifathreadabortsbeforeitreleasesthelock(Withthisdesign,multiplethreadscanwritedierentdatapagesinthesameleconcurrently.3.4Three-PhaseWriteOncethelockhasbeenrequired,canactuallywriteledata.SincePMspacesaremappedtouserspaceinreadonlymode,cannotwriteledatadirectly.Instead,weproposeathree-phasewriteprotocoltoperformdirectwrites.Toensurethecrashconsistency,Kfollowsacopy-on-write(CoW)approachtowriteledata,wherethenewlywrittendataisalwaysredirectedtonewPMpages.SimilartoNOVA[ 43 ]andPMFS[ 14 ],weuse4KBasthedefaultdatapagesize.ThewriteprotocolinKconsistsofthreesteps.First,lockstheleviatwo-levellockingandsendsarequesttotoallocatenewPMpages.Notethat,byusingaCoWway,spaceallocationisnecessaryforbothalsoneedstomodifytherelatedpagetableentriestomaketheseallocatedPMpageswritablebeforesendingtheresponsemessageback.copiesboththeunmodieddatafromtheoldplaceandnewdatafromtheuserbuertotheallocatedPMpages,andpersiststhemviaushinstructions.Third,sendsanotherrequesttotoupdatethemetadataofthisle(i.e.,inode,blockmapping),switchthenewlywrittenpagestoreadonly,andnallyreleasesthelock.Furthermore,weintroducethepre-allocationtoavoidallocatingnewPMpagesfromforeverywriteoperations.Specically,weallowtoallocatemorefreepagesfromthandesired(4MBatatimeinourimplementation).Inthisway,canuselocalfreePMpageswithoutinteractingwithformostwriteoperations.Whenanapplicationexits,theunusedpagesaregivenbackto.Foranabnormalexit,thesefreepagesaretemporarilynon-reusablebyotherapplications,butstillcanbereclaimedduringtherecoveryphase(see§ 4.2 Pre-allocationhelpswithreducingtheoverheadofupdatingpagetableentries.Whentheupdatespagetableentriesaftereachallocation,itneedstoushtherelatedTLBentriesexplicitlytomakethemodicationsvisible.Pre-allocationallowsallocatingmultipledatapagesatatime,sotheTLBentriescanbeushedinbatch.3.5VersionedReadInthewriteprotocol,botholdandnewversionsofdatapagesaretemporarilykeptduetotheCoWway,providingusthe !" !" !" #$%&'()* +)(*,%& ',-&, %*. !"# !/ !/ !" !" !0 !0 !0 !1 !1 !1 !0 !0 $%& ',-&,#)*! %*.#)*!23)45#6-++(*7#(,%6 ++ +# , -8& (*).% , ! - . / 0 9(&%4,.-,- , ! / 0 :*.(&%4,;3)456-++(*723)456-++(*7 Figure5:BlockmappingformatandtheversionedreadMappingitemswiththesameversioncorrespondtothesamewriteoperation.Theabovethreeconsistentcasesdescribehowbitscanbeformattedwhentheversionchanges.opportunitytoreadledataevenwithoutblockingwrites.However,blockmappingsthatmapalogicalletophysicalpagesarestillupdatedinplaceby.Thisdirvesustodesigntheversionedreadmechanismtoachieveuser-leveldirectreadswithoutanyinvolvementofthe,regardlessofconcurrentwriters.VersionedReadisdesignedtoallowuserspacereadswithoutlockingthele,whileensuringthatreadersneverreaddatafromincompletewrites.Toachievethis,KusesanExt2-like[ 6 ]blockmappingtoindexdatapagesandembedsaversioneldineachpointeroftheblockmapping.AsshowninFigure 5 ,each96-bitblockmappingitemcontainsfourelds,whicharestart,version,endandpointer.Foraoperation,say,writingthreedatapages,therelatedblockmappingitemswiththefollowingformat: 1j V1j 0j P1 0j V1j 0j P2 0j V1j 1j P3 .Inparticular,allthreeitemssharethesameversion(i.e.,V),whichisprovidedbywhenitacquirestherangelock(inSection 3.3 ).Thestartbitoftherstitemandtheendbitofthelastitemaresetto1.Weonlyreserve40-bitforthepointereldsinceitpointstoa4KB-alignedpageandthelower12bitscanbediscarded.Withthisformat,readerscanreadaconsistentsnapshotofdatapageswhenoneofthethreecasesismetinFigure 5 Nooverlapping.Whentwoupdatestoaleareperformedonnon-overlappingpages,itemswiththesameversionshouldbeenclosedwithbothastartbitandanendbit(incasea).Overlapstheendpart.Whenathreadoverwritestheendpartofaformerwrite,areadershouldalwaysseeastartbitwhentheversionincreases(incaseb).Overlapsthefrontpart.Whenathreadoverwritesthersthalfofaformerwrite,areadershouldalwaysseeanendbitbeforetheversiondecreases(incasec).meetsanycaseotherthantheabovethreecases,itindicatesthatisupdatingtheblockmappingforsomeotherincompletewrites.Inthiscase,needstovalidateagainbyre-scanningthesequenceoftherelated

7 versions.succeedsintheversionchecking,it
versions.succeedsintheversionchecking,itthenreadstheassociateddatapages.Asawhole,Kutilizestheembeddedversionstodetectincompletewritesandretriesuntilreadingaconsistentsnapshotofdata.ReadSemantics.Inamulti-threadprocessexecution,ver- &'()% !"#$%&$ (!"#$% *'++!",-)#./ -#( 0%"&1234!5& 0'&'3+',%5 6+%1'7#"3)#, 8++%"$ ' 2 1 3 ! '()*+,-./0*)1234-./415 9'!)3+&1 :%&'$'&'3+',%5 .;%./+#!"& 33 !" =1%'&% 33#$%&' � 33"(#$%&' ? 33$)*' ' 33+,,- @ .%/0$,-12 Figure6:DatalayoutofapartitiontreeinKoperationwiththreestepsisalsoshown.sionedreadisslightlydierentfromlegacylockedreadinthatitallowsconcurrentwrites.Forexample,awritestartsandhasnotyetbeencompleted,butin-between,thereisaread,whichreadsanoldsnapshotofdata.Inthiscase,theexecutionstillequalstoaserializableorder(e.g.,“readwrite”,“happens-before).Versionedreadhasthesamesemanticaslockedreadwithineachthread,becauseareadorwritehastocompletebeforeissuingthenextone.4KFSImplementationInthissection,wedescribehowtheKarchitectureisappliedinapersistentmemorylesystemnamedK4.1DataLayoutFSorganizespartitiontreesofKinahybridwayusingbothDRAMandPM(Figure 6 ).InDRAM,anarrayofpointers(inodetable)isplacedatapredenedlocationtopointtotheactualinodes.Therstelementintheinodetablepointstotherootinodeofthecurrentpartitiontree.Withthis,canndanylesfromtherootinodeinuserspace.Asdiscussedbefore,thedentrylistofadirectoryisorganizedintoaskiplist,whichisalsoplacedinDRAM.Foreciency,KFSonlyoperatesontheDRAMmeta-datafornormalrequests.Toensurethedurabilityandcrashconsistencyofmetadata,KFSplacesanappend-onlypersistentoperationloginPMforeachpartitiontree.Whenupdatesthemetadata,itrstatomicallyappendsalogentry,andthenactuallyupdatestheDRAMmetadata(see§ 4.2 ).Whensystemfailuresoccur,theDRAMmetadatacanalwaysberecoveredbyreplayingthelogentriesintheoperationlog.Inadditiontotheoperationlog,theextraPMspaceiscutinto4KBdatapagesandmetadatapages.FreePMpagesaremanagedwithbothabitmapinPMandafreelistinDRAM(forfastallocation),andthebitmapislazilypersistedbytheduringthecheckpointphase.4.2CrashConsistencyandRecoveryMetadataconsistency.FSensuresthemetadatacon-sistencybyorderingupdatestoDRAMandPM.Figure 6 showsthestepsofhowcreatesalewhenitreceivesarequestfrom.Inreservesanunusedinodenumberfromtheinodetableandappendsalogentrytothe operationlog.Thislogentryrecordstheinodenumber,lename,parentdirectoryinodenumber,andotherattributes.In,itallocatesaninodewitheacheldlled,andupdatestheinodetabletopointtothisinode.In,ittheninsertsadentryintothedentrylistwiththegivenaddressofthepredecessor,tomakethecreatedlevisible.Acreationfailsifthesamedentryalreadyexists(avoidcreatingthesameles).Todeleteale,appendsalogentryrst,deletesthedentryintheparentdirectorywiththegivenaddresses,andnallyfreestherelatedspaces(e.g.,inode,datapagesandblockmapping).Ifacrashhappensbeforetheoperationisnished,theDRAMmetadataupdateswillbelost,butreconstructthemtotheneweststatebyreplayingthelogafterrecovery.Foroprations,exceptforsystemfailures,thekernelthreadmaycrashandcausethedirtyagtobeinaninconsistentstate.However,weconsiderthewholelesystemcrashesifthekernelthreadcrashes,whichrequiresthelesystemtoberebooted,andtheaboveloggingtechniqueensuresthatoperationisalsocrash-consistent.Dataconsistency.FShandleslewriteoperationsbyrstupdatingdatapagesinaCoWway,andthenappendingalogentryintheoperationlogtorecordthemetadatamodications.Atthispoint,thewriteisconsidereddurableThen,KFScansafelyupdateDRAMmetadatatomakethisoperationvisible.whenasystemfailureoccursbeforethelogentryispersisted,KFScanrollbacktoitslastconsistentstatesinceolddataandmetadataareuntouched.Otherwise,thiswriteoperationismadevisiblebyreplayingtheoperationlogafterrecovery.Logcleaningandrecovery.Weintroduceacheckpointmech-anismtoavoidtheoperationlogfromgrowingarbitrarily.Whentheisnotbusy,orthesizeofthelogexceedsathreshold(1MBonourimplementation),weuseabackgroundkernelthreadtotriggeracheckpoint,whichappliesmetadatamodicationsintheoperationlogtoPMmetadatapages.ThebitmapthatisusedtomanagethePMfreepagesisupdatedandpersistedaswell.Afterthat,theoperationlogistruncated.Backgrounddigestionneverblocksfront-endoperations,andtheonlyimpactisthatlogcleaningconsumesextraPMbandwidth.However,metadataaretypica

8 llysmall-sizedandbandwidthconsumptionisn
llysmall-sizedandbandwidthconsumptionisnothigh.EachtimeKFSisrebootedfromacrash,replaystheun-checkpointedlogentriesintheoperationlog,soastomakePMmetadatapagesup-to-date.ItthencopiesPMmetadatapagestoDRAM.ThefreelistofPMdatapagesisalsoreconstructedaccordingtothebitmapstoredinPM.Crashingagainduringtherecoveryisnotaconcernsincetheloghasnotyetbeentruncatedandcanbereplayedagain.KeepingredundantcopiesofmetadatabetweenDRAMandPMintroduceshigherconsumptionofPMDRAMspace,butwebelieveitisworththeeorts.WithstructuredmetadatainDRAM,wecanperformfastindexingdirectlyinDRAM;appendinglogentriesinthelogsavesthenumberofupdatestoPMs,whichreducesthepersistenceoverhead.Inthefuture,weplantoreducetheDRAMfootprintbyonlykeepingactivemetadatainDRAM.4.3WriteProtectionFSstrictlycontrolsupdatestothelesystemimage.Bothin-memorymetadataandthepersistentoperationlogarecritical,sotheinthekernelistheonlyonethatisallowedtoupdatethem.Filepagesaremappedtouserspaceinreadonlymode.ApplicationscanonlywritedatatonewlyallocatedPMpagesandexistingdatapagescannotbemodied.KFSalsoprovidesprocess-levelisolationforuserspacedatastructures.Themessagebuerandrangelocksareprivatelyownedbyeachprocess,soanattackercannotaccesstheminotherprocesses,exceptthatitperformsaprivilegeescalationattack.Suchsecurityissuesareoutofthescopeofthiswork.Assuch,weconcludethatKachievesthesamewriteprotectionaskernellesystems.Preventingstraywrites.Unlikemanyexistinguserspacelesystemsthatarevulnerabletostraywrites[ 21 , 28 , 39 FSpreventsthisissuebymappingthePMspaceinreadonlymode.Notethatthereisstillatemporary,writablewindow(lessthan1s)forthenewly-writtenpagesafterawriteoperationisnishedbutbeforethepermissionbitsarechanged.Thisisunavoidable,assameasinexistingkernellesystemslikePMFS.Fortunately,thisrarelyhappens.Besides,rangelocksandmessagebuersinuserspacemightalsobecorruptedbystraywrites.Forthisthreat,weaddchecksumandleaseeldsateachslot,whichcanbeusedtocheckwhethertheinsertedelementhasbeencorruptedornot.4.4ReadProtectionFSorganizesitsdirectorytreewithpartitiontrees,whichactastheminimalunitforaccesscontrol.Eachpartitiontreeisself-contained,consistingofmetadataanddatainPM,andtherelatedmetadatacopyinDRAM.KFSdoesnotallowdirectorystructurestospanacrossdierentpartitions.WhenaprogramaccessesKFS,onlythepartitiontreesithasaccesstoaremappedtoitsaddressspace,butotherpartitiontreesareinvisibletoit.InKFS,readaccesscontrolisstrengthenedwiththefollowingcompromises.First,similartoexistinguserspacelesystem[ 13 , 39 ],KFScannotsupport“write-only”orcomplexpermissionsemanticssuchasPOSIXaccesscontrollists(ACLs),sinceexistingpagetableonlyhasasinglebittoindicateapageisreadonlyorread-write.Second,Kdoesnotsupportexibledatasharingbetweenusers,becauseitishardtochangethepermissionofaspecicle(e.g.,via)withthepartitiontreedesign[ 13 , 21 , 31 ].Yetthereareseveralpracticalapproaches:creatingastandalonepartitionthatapplicationswithdierentpermissionshaveaccesstoit;postinguser-levelRPCsbetweendierentapplicationstoacquirethedata.Webelievesuchatradeoisnotlikelytobeanobstacle,sinceKFSstillsupportsecientdatasharingbetweenapplicationswithinthesameuser,whichisthemore commoncaseinreal-worldscenarios[ 13 4.5Memory-MappedISupportingDAXfeatureinacopy-on-writelesystemneedsextraeorts,sincelesareout-of-placeupdatedinnormaloperations[ 43 ].Besides,DAXleavesgreatchallengesforprogrammerstocorrectlyusePMspacewithatomicityandcrashconsistency.Takingthesefactorsintoconsideration,weborrowtheideafromNOVAtoprovidewhichhashigherconsistencyguarantees.Whenanapplicationmapsaleintouserspace,copiesledatatoitsprivatelymanageddatapages,andthensendsarequesttomapthesepagesintocontiguousaddressspace.Whentheapplicationissuesasystemcall,thenhandlesitasawriteoperation,soastoatomicallymaketheupdatesinthesedatapagesvisibletootherapplications.4.6KFS'sAPIsFSprovidesaPOSIX-likeinterface,soexistingappli-cationsareabletoaccessitwithoutanymodicationstothesourcecode.Weachievesthisbysettingtheenvironmentvariable.interceptsallAPIsinstandardClibrarythatarerelatedtolesystemoperations.syscallsdirectlyiftheprexofanaccessedlematcheswithapredenedstring(e.g.,).Otherwise,thesyscallsisprocessedinlegacymode.Notethatoperationsonlypassledescriptorsintheparameterli

9 st.distinguishesthemfromlegacysyscallsvi
st.distinguishesthemfromlegacysyscallsviaamappingtable[ 23 ],whichtrackslesofK4.7Examples:PuttingItAllTogetherFinally,wesummarizethedesignoftheKandKFSbywalkingthroughanexampleofwriting4KBofdatatoanewleandthenreadingitout.Beforesendinganpre-locatestherelatedmetadatarst.Sincethisisanewle,nditdirectly.Instead,itndsthepredecessorinitsparentdirectory'sdentrylistforlattercreation.Theaddress,aswellasotherinformation(e.g.,lename,ags,etc.),areencapsulatedintherequest.Whenthereceivestherequest,itcreatesthislebasedonthegivenaddress.Italsoneedstoassignawriteleasetothisprocess.Then,thesendsaresponsemessage.Afterthis,createsaledescriptorandarangelockforthisopenedle,andreturnstotheapplication.Theapplicationthenusesacallviawrite4KBofdatatothisnewlycreatedle.First,tolocktheleviathetwo-stagelockingservice.Sincethewriteleaseisstillvalid,itacquiresthelockdirectlythroughtherange-lock.blockstheprogramwhentherearewriteconictsandwaituntilotherconcurrentthreadshavereleasedthelock.Afterthat,canacquirethelocksuccessfully.Itthenallocatesa4KB-pagefromthepre-allocatedpages,copiesthedataintoit,andushesthemoutoftheCPUalsoneedstopostanextrarequesttothetoallocatemorefreedatapagesoncethepre-allocatedspaceisusedup.Finally,sendstherequesttothenishthereststeps,includingchangingthepermissionbitsofthewrittendatapagestoreadonly,appendingalogentrytodescribethiswriteoperation,andupdatingtheDRAMnallyunlockstheleintherangelock.FSenablesreadingledatawithoutinteractingwiththe.Toreadtherst4KBfromthisle,theinodeinuserspaceandreadstherstblockmappingitem.TheversioncheckingisperformedtoensureitsstatesatisesoneofthethreeconditionsdescribedinSection 3.5 .Aftercansafelyreadthedatapagepointedbythepointerinthemappingitem.alsoneedstosendarequesttotheuponclosingthisle.thenreclaimsthewriteleasesinceitwillnotaccessthisleanymore.5EvaluationInourevaluation,wetrytoanswerthefollowingquestions:DoesKFSachievethegoalofdeliveringdirectaccessperformanceandhighscalability?HowdoeseachindividualtechniqueinKFShelpwithachievingtheabovegoals?HowdoesKFSperformundermacro-benchmarkandreal-worldapplications?5.1ExperimentalSetupTestbed.Ourexperimentaltestbedisequippedwith2XeonGold6240MCPUs(36physicalcores),384GBDDR4DRAM,and12OptaneDCPMMs(256GBpermodule,3TBintotal).WeperformallexperimentsontheOptaneDCPMMsresidingonNUMA0(1.5TB),whosereadbandwidthpeaksat37.6GBsandthewritebandwidthis13.2GBs.TheserverisinstalledwithUbuntu19.04andLinux5.1,thenewestkernelversionsupportedbyNOVA.Comparedsystems.WeevaluateKFSagainstNVM-awarelesystemsincludingPMFS[ 14 ],NOVA[ 43 SplitFS[ 28 ],Aerie[ 39 ],andStrata[ 21 ],aswellastraditionallesystemswithDAXsupportincludingExt4-DAX[ 2 ]andXFS-DAX[ 38 ].Strataonlysupportsafewapplicationsandhastroublerunningmulti-threadedworkloads.Similartopreviouspapers[ 13 , 49 ],weonlyshowpartofitsperformanceresultsin§ 5.3 and§ 5.4 .WeonlyevaluateSplitFSin§ 5.2 and§ 5.4 sinceitonlysupportsasubsetofAPIs.Forafaircomparison,wedeploySplitFSwithmode,whichensuresbothdurabilityandatomicity.ZoFSisnotopen-sourcedsowedidnotevaluateit.AerieisbasedonLinux3.2.2,whichcannotsupportOptaneDCPMMs.Hence,wecomparewithAerie[ 39 ]byemulatingpersistentmemorywithDRAM,whichinjectsextradelays.Duetolimitedspace,weonlydescribeAerie'sexperimentaldataverballywithoutaddingextragures. XFS-DAX EXT4-DAX PMFS NOVA SplitFS KucoFSThroughput #. of Threads(a) Creat,(b) Creat,Medium (c) Creat,Medium(more Þles)ThroughputThroughputThroughput 00.511.5 00.511.5 102030 Figure7:performancewithFxMark.Low:indifolders;medium:inthesamefolder;moreles:eachthreadcreatesonemillionles.5.2EectsofIndividualTechniquesWeuseFxMark[ 25 ]toanalyzetheeectsofindividualtechniques,whichexploresthescalabilityofbasiclesystemoperations.FxMarkprovides19micro-benchmarks,whichiscategorizedbasedonfourcriteria:datatypes(i.e.,dataormetadata),modes(i.e.,readorwrite),operations,andsharinglevels.Weonlyevaluatethecommonlyusedoperations(e.g.,read,write,mknod,etc.)duetothelimitedspace.5.2.1EectsofCollaborativeIndexingBasicperformance.InKoperationrequirespostingrequeststo,sowechoosethisoperationtoshowtheeectsofcollaborativeindexing.FxMarkevaluatesoperationsbylettingeachclientthreadcreate10Klesinprivatedirectories(i.e.,lowsharinglev

10 el)orashareddirectory(i.e.,medium).Assho
el)orashareddirectory(i.e.,medium).AsshowninFigures 7 aand 7 b,Kexhibitsthehighestperformanceamongthecomparedlesystemsanditsthroughputnevercollapses,regardlessofthesharinglevel.XFS-DAX,Ext4-DAXandPMFSuseagloballocktoperformmetadatajournalinginasharedlog,whichleadstotheirpoorscalability.NOVAshowsexcellentscalabilityunderlowsharinglevelbyavoidinggloballocks(e.g.,itusesper-inodelogandpartitionsitsfreespaces).However,allkernellesystemsfailtoscaleunderthemediumsharinglevelsinceVFSneedstolocktheparentdirectorybeforecreatingles.SplitFSreliesonExt4tocreateles,whichaccountsforitslowscalability.FromtheZoFSpaperwealsondthatZoFSevenshowslowerthroughputthanNOVAunderlowsharinglevel,sinceitneedstotrapintothekernelfrequentlytoallocatenew ThroughputExecution time#. of ThreadsCreate, Medium 0204060 w/o CI KucoFS NOVA rw Lock w/o Lock KucoFS ThroughputCollaborativeindexingConßictshandlingVersioned NOVA KucoFS 00.511.5 0481216 Figure8:BenetsofcollaborativeindexingversionedreadoCI:KFSwithoutcollaborativeindexing.spaces.AeriesupportssynchronizingmetadataupdatesofthecreatedlestoTFSwithbatching(bycompromisingthevisibility),soitachievescomparableperformancetothatofFS.Aeriefailstoworkproperlywhenthenumberofthreadsincreases.ThethroughputofKFS,however,isonlydecreasedslightlywiththemediumsharinglevel,whichisoneorderofmagnitudehigherthanotherlesystems,andhigherthanthatofZoFS.WeexplainthehighscalabilityofKFSfromthefollowingaspects.First,inKFS,allmetadataupdatesaredelegatedto,soitcanupdatethemwithoutanylockingoverhead.Second,byooadingindexingtaskstouserspace,onlyneedstodolightweightwork.Largerworkload.Furthermore,wemeasurethescalabilityofFSintermsofdatacapacitybyextendingtheworkloadsize.Specically,weleteachthreadcreate1millionles,largerthanthedefaultsizeinFxMark,andtheresultsareshowninFigure 7 c.Comparedtotheresultswithasmallerworkloadsize,thethroughputofKFSdropsby28.5%.Thisismainlybecausealesystemneedsmoretimetondaproperslotforinsertionintheparentdirectorywhenthenumberoflesincreases.Evenso,KFSstilloutperformsotherlesystemsbyanorderofmagnitude.Conicthandling.FSrequirestofallbackandretrywhenaconictoccurs,whichmayimpactoverallperformance.Inthisregard,wealsotesthowKFSbe-haveswhenhandlingoperationsthatconictwitheachother.Specically,weusemultiplethreadstocreatethesameleconcurrentlyifitdoesnotexist,ordeleteitinsteadwhenithasalreadybeencreated.WecollectthethroughputofthesesuccessfulcreationsanddeletionsandtheresultsareshowninFigure 8 b.Asacomparison,theresultsofNOVA #. of Threads(a) Read, Low(b) Read, Medium(c) Overwrite, Low 0123 (e) Overwrite, Medium(d) Append, LowThroughput (Mops/s)#. of Threads#. of Threads#. of Threads#. of Threads Throughput (Mops/s) XFS-DAX EXT4-DAX PMFS NOVA SplitFS KucoFS 0123 Throughput (Mops/s) 102030 102030 102030 102030 102030 Figure9:throughputwithFxMark.Low:threadsread(write)datafrom(to)separateles;medium:inthesamelebuterentdatablocks;defaultIOsize:4KB; gray area: Op tane DCP MMs do not scale on NUMA plat isalsoshowninthegure.WecanobservethatKachieves2.4higherthroughputthanNOVA.InNOVA,athreadneedstoacquirethelockbeforecreatingordeletingles.Worse,ifthiscreationordeletionfails,otherconcurrentthreadswillbeblockedunnecessarilysincethelockdoesnotprotectavalidoperation.Instead,inKFS,threadscansendcreationordeletionrequeststowithoutbeenblockedisresponsiblefordeterminingwhetherthisoperationcanbeprocessedsuccessfully.Furthermore,sincealreadyprovidedrelatedaddressesintherequest,canusetheseaddressestovalidatemetadataitemsdirectly,whichintroducesinsignicantoverhead.Breakdown.WealsomeasurethebenetofcollaborativeindexingbycomparingwithavariantofKFSthatdisablesthisoptimization(i.e.,movethemetadataindexingtasksbackto,denotedas“oCI”).Figure 8 ashowstheresultsbymeasuringthethroughputofwithavaryingnumberofclients.Wemakethefollowingobservations.First,inthesinglethreadevaluation,collaborativeindexingnotcontributetoimprovingperformance,sincemovingthemetadataindexingtaskfrombacktothedoesnotreducetheoveralllatencyofeachoperation.Second,whenthenumberofclientthreadsincreases,wendthatcollaborativeindexingimprovesthroughputbyupto55%.SinceKonlyallowsthetoupdatemetadataonbehalfofmultipleinstances,thetheoreticalthr

11 oughputlimitismaxs,whereisthelatencyfort
oughputlimitismaxs,whereisthelatencyfortoprocessonerequest).Therefore,theooadingmechanismimprovesperformancebyshorteningtheexecutiontimeofeachrequest(i.e.,5.2.2EectsofVersionedRead 9 aand 9 bshowthelereadperformanceofeachlesystemwithavaryingnumberofthreadsunderdisharinglevels(i.e.,lowmedium).Wemakethefollowingobservations.First,KFSexhibitsthehighestthroughputamongthecomparedlesystems,whichpeaksat9.4Mops(hardwarebandwidthhasbeenfullyutilized).Theperfor-manceimprovementstemsprimarilyfromthedesignofver-sionedread,whichempowersuserspacedirectaccesswithouttheinvolvementof.Thesekernellesystems(e.g.,XFS,Ext4,NOVAandPMFS)havetoperformcontextswitchesandwalkthroughtheVFSlayer,whichimpactthereadperformance.SplitFSonlyachievescomparableperformancetothatofNOVAdespiteitsdirect-accessfeature.WendthatSplitFSneedstomapmorePMspacetouserspacewheneveritreadsapagethathasnotbeenmappedyet,whichcausesextraoverhead.TheperformanceimprovementofKismoreobviousformediumsharinglevelbecauseallthecomparedsystemsneedtolockthelebeforeactuallyreadingledata.Thelockingoverheadimpactstheirperformancesignicantly,despitetheyusesharedlocks[ 23 ].Second,thereadperformanceofallevaluatedlesystemsdropsdramaticallywhenthenumberofthreadskeepsincreasing(grayarea).Togetstableresults,werstbindthreadstoNUMA0(localaccess),andthecoresatNUMA1areusedonlyifthetotalnumberofthreadsisgreaterthan18.Bothweandpastwork[ 47 ]observethatcross-socketaccessingtoOptaneimpactsperformancegreatly.Toconrmthatoursoftwaredesignisscalable,wedeployNOVAandKinDRAM,andbothofthemshowscalablereadthroughputagain.Therefore,manyrecentpapers[ 13 , 19 ]onlyusethecoresfromthelocalNUMAnodeintheirevaluation.Withouremulatedpersistentmemory,AerieshowsalmostthesameperformanceasthatofKFSwiththelowsharinglevel,butitsthroughputfallsfarbehindothersatamediumsharinglevelbecauseAerieneedstointeractwiththeTFSfrequently.Wefurtherdemonstratetheecacyofversionedreadconcurrentlyreadingwritingdatafromtothesamele.Inourevaluation,onereadthreadisselectedtosequentiallyreadalewithanIOsizeof16KB,andanincreasingnumberofthreadsarelaunchedtooverwritethesameleconcurrently(4KBwritestoarandomoset).Weletthereadthreadissuereadoperationsfor1milliontimesandmeasureitsexecutiontimewithavaryingnumberofwriters.Forcomparison,wealsoimplementKwlockreadsledatabyacquiringread-writelocksintherange-lockringbuer,andKolockthatreadsledatadirectlywithoutacorrectnessguarantee.WemakethefollowingobservationsfromFigure 8 b.First,theproposedversionedreadachievesalmostthesameperformanceasthatofKolock.Thisprovesthattheoverheadofversioncheckingisextremelylow.WealsoobservethatKwlockmuchmoretimetonishreading(7%to3.2moretimethanFSfordierentIOsizes).Thisisbecauseitneedstouseatomicoperationstoacquiretherangelock,whichseverelyimpactreadperformancewhenconictsbecomefrequent. Second,theexecutiontimeofNOVAisordersofmagnitudeshigherthanthatofKFS.NOVAdirectlyusestosynchronizethereaderandconcurrentwriters.Asaresult,thereaderisalwaysblockedbywriters.5.2.3EectsofThree-PhaseWritesWeevaluatebothoperationstoanalyzethewriteprotocol(seeFigures 9 c-d).Foroverwriteoperationswithlowsharinglevel,someofthemexhibitaperformancecurvethatincreasesrstandthendecreases.Intherisingpart,KFSshowsthehighestthroughputamongthecomparedsystemsbecauseitisenabledtowritedatainuserspacedirectly.XFSandNOVAalsoshowgoodscalability.Amongthem,NOVApartitionsfreespacestoavoidthelockingoverheadwhenallocatingnewdatapages,whileXFSdirectlywritesdatain-placewithoutallocatingnewpages.BothPMFSandExt4failtoscalesincetheyrelyonacentralizedtransactionmanagertowritedata,introducingex-tralockingoverhead.Inthedecreasingpart,theirthroughputismainlyaectedbytwofactors:thecross-NUMAoverhead,whichhasbeenexplainedbefore,andthepoorscalabilityofOptane'swriteperformance[ 19 ].SplitFSfailstorunproperlyunderthissetting.Forappendoperations,XFS-DAX,Ext4-DAXandPMFSexhibitbadscalabilityasthenumberofthreadsincreases.Thisisbecausetheyuseagloballocktomanagethefreedatapagesandmetadatajournal,sothelockcontentioncontributestothemajoroverhead.BothNOVAandKFSshowbetterscalability,andKFSoutperformsNOVAbyfrom10%to2withanincreasingnumberofthreads.ThethroughputofSplitFSliesbetweenNOVAandExt4-DAX.Thisisbecause,SplitFSrsta

12 ppendsdatainastagingle,andthenre-linksi
ppendsdatainastagingle,andthenre-linksittotheoriginallebytrappingintothekernel.Onouremulatedpersistentmemory,Aerieshowstheworstperformancebecausethetrustedserviceisthebottleneck,whereclientsneedtofrequentlyinteractwiththeTFStoacquirethelockandallocatespaces.Two-levellocking.Toanalyzetheeectsofthelockdesign,wealsoevaluateoverwriteoperationswiththemediumsharinglevel,wherethreadswritedatatothesameleaterentosets.AsshowninFigure 9 e,thethroughputofFSisoneorderofmagnitudehigherthantheotherfourlesystemswhenthenumberofthreadsissmall(SplitFSfailstorunproperlyinthissetting).Therange-lockdesigninFSenablesparallelupdatestodierentdatablocksinthesamele.TheperformanceofKFSdropsagainwhenthenumberofthreadsgrowstomorethan8,whichismainlyrestrictedbytheringbuersizeintherange-lock(wereserve8lockslotsintheringbuer).WealsondthatZoFSshows-3higherthroughputthanthatofNOVA(Fig.7fintheirpaper),butitstillunderperformsKMemory-mappedIMemory-mappedIOisthemostecientwaytoaccessthelesystem.inKFSconstructsallpagetablesinadvancewhenprocessingForafaircomparison,weaddtheMAP_POPULATEag Workload Fileserver Webserver Webproxy Varmail WSize 16KB16KB 1MB8KB 1MB16KB 1MB16KB WRatio 1:2 10:1 5:1 1:1 Totalnumberoflesineachworkloadis100K. Threads 1 16 1 16 1 16 1 16 XFS-DAX 39K 127K 121K 1.35M 192K 863K 99K 319K Ext4-DAX 52K 362K 123K 1.33M 316K 2.50M 57K 135K PMFS 72K 317K 110K 1.25M 218K 1.54M 169K 1.06M NOVA 71K 537K 133K 1.43M 337K 3.02M 220K 2.04M Strata 75K - 105K - 420K - 283K - KucoFS 99K 683K 141K 1.48M 463K 3.22M 320K 2.55M ö 32% 27% 6% 3% 10% 7% 13% 24% ”indicatestheperformanceimprovementoverthe2nd-bestsystem.Table2:Filebenchthroughputwith1and16threads(Opswhenusingtoaccesskernellesystems,whichbuildsthepagetableduringthesyscall.Theexperimentalresultsareasexpected(notshowninthegure):whenweconcurrentlyissue4KBreadwriterequests,alltheevaluatedlesystemssaturatethehardwarebandwidth.5.3Filebench:Macro-BenchmarksWethenuseFilebench[ 1 ]asamacro-benchmarktoevaluatetheperformanceofKFS.Table 2 showsbothworkloadsettings(similartothatintheNOVApaper)andexperimentalresultswith1and16threads(addingmorethreadsdoesnotcontributetohigherthroughputwithFilebench[ 13 ]).Wecanobservethat,rst,KFSshowsthehighestperformanceamongalltheevaluatedworkloads.Insingle-threadedevalu-ationwithFileserverworkload,itsthroughputis2.5,1.9,1.39and1.32asmuchasthatofXFS,Ext4,PMFS,NOVA,andStratarespectively,andis3.2,5.6,1.9,1.45and1.13higherwithVarmailworkload.Forread-dominatedworkloads(e.g.,webserverwebproxy),KFSalsoshowsslightlyhigherthroughput.TheperformanceimprovementmainlycomesfromthedirectaccessfeatureofKFS.Strataalsobenetsfromdirectaccessandperformsthesecond-bestinmostworkloads.WealsoobservethatthedesignofKFSisagoodtfortheVarmailworkload.Thisisexpected:Varmailfrequentlycreatesanddeletesles,soitgeneratesmoremetadataoperationsandissuessystemcallsmorefrequently.Asdescribedbefore,KFSeliminatestheOS-partoverheadandisbetterathandlingmetadataoperations.Besides,StratashowsmuchhigherthroughputthanNOVAsincetheleIOsinVarmailissmall-sized.Strataonlyneedstoappendthesesmall-sizedupdatestotheoperationlog,reducingthewriteamplicationdramatically.Second,KFSisbetterathandlingconcurrentworkloads.With16clientthreadsundertheFileserverworkload,KoutperformsXFS-DAXby4.4,PMFSby1.2,andNOVAby27%.TheperformanceimprovementismoreobviousforVarmailworkload:itachieves10higherperformancethanXFS-DAXandExt4-DAXonaverage.Tworeasonscontributetoitsgoodperformance:rst,KFSincorporatestechniqueslikecollaborativeindexingtoenableprovidescalablemetadataaccessingperformance;second, XFS-DAX EXT4-DAX PMFS NOVA Strata SplitFS KucoFSThroughput (Kops/s)Object Size (SET) 0100200300 1281KB4KB8KB Figure10:Redisperformancewithdierentlesystems.FSavoidsusingagloballockbylettingeachclientmanageprivatefreedatapages.NOVAalsoexhibitsgoodscalabilitysinceitusesper-inodelog-structureandpartitionsthefreespacestoavoidgloballocks.5.4Redis:Real-WorldApplicationRedisexportsasetofAPIsallowingapplicationstoprocessandquerystructureddata,andusesthelesystemforpersis-tentdatastorage.Redishastwoapproachestopersistentlyrecorditsdata:oneistologoperationstoanappend-only-le(AOF),andtheotheristouseanasynchronoussnapshotmechanism.Weonlyevaluat

13 eRediswithAOFmodeinthispaper.Figure 10 s
eRediswithAOFmodeinthispaper.Figure 10 showsthethroughputofSEToperationsusing12-bytekeyswithvariousvaluesizes.Forsmallvalues,thethroughputofRedisis53%higheronaverageonKFS,comparedtoPMFS,NOVA,andStrata,and76%highercomparedtoXFS-DAXandExt4-DAX.ThisisconsistentwiththeresultsofinSection 5.2 .Withlargerobjectsizes,KFSachievesslightlyhigherthroughputthanotherlesystemssincemostofthetimeisspentonwritingdata.NotethatRedisisasingle-threadedapplication,soitisreasonableforKFStoachieveathroughputof100Kopswith8KBobjects(around800MBs).SplitFSisgoodatoperationssinceitprocessesdata-planeoperationsinuserspace.However,itstillunderperformsFS,becauseRedispoststoushtheAOFleeachtimeitappendsnewdata.Hence,SplitFSneedstotrapintothekerneltoupdatemetadata,whichagaincausesVFSandsyscalloverhead.6RelatedWorkKernel-userspacecollaboration.TheideaofmovingIoperationsfromthekerneltouserspacehasbeenwellstudied.Belayetal.[ 4 ]abstracttheDuneprocessleveragingthevirtualizationhardwareinmodernprocessors.ItenablesdirectaccesstotheprivilegedCPUinstructionsinuserspaceandexecutessyscallswithreducedoverhead.BasedonDune,IX[ 5 ]stepsfurthertoimprovetheperformanceofdata-centerapplicationsbyseparatingmanagementandschedulingfunc-tionsofthekernel(control-plane)fromnetworkprocessing(dataplane).Arrakis[ 31 ]isanewnetworkserveroperatingsystem,whereapplicationshavedirectaccesstoIOde-vicesandthekernelonlyenforcescoarse-grainedprotection.FLEX[ 42 ]avoidskerneloverheadbyreplacingconventionalleoperationswithsimilarDAX-basedoperations,whichsharessomesimilaritiestoSplitFS.Whilethesesystemssharethesameideaofsplittingtasksbetweenthekernelanduserspace,KFSisdierentinthatitexhibitsane-grainedsplitofresponsibilitieswhileenforcingclosecollaboration.Persistentmemorystoragesystems.Exceptforpersistentmemorylesystemsmentionedbefore,wesummarizemorePMsystemshere.First,generalPMoptimizations.Yangetal.[ 46 ]exploretheperformancepropertiesandcharacteristicsofOptaneDCPMMatthemicroandmacrolevels,andprovideanumberofguidelinestomaximizetheperformance.Libnvmmio[ 10 ]extendsuserspacememory-mappedIOwithfailureatomicity.Manyrecentpapersalsodesignedvariousdatastructuresthatworkcorrectlyandecientlyonpersistentmemory[ 7 , 18 , 26 , 30 , 48 , 52 ].Second,PM-awarelesystems.BPFS[ 11 ]adoptsshort-circuitshadowpagingtoguaranteethemetadataanddataconsistency.SCMFS[ 41 ]simpliesthelemanagementbymappinglestocontiguousvirtualaddressregionswiththevirtualmemorymanagement(VMM)inexistingOS.NOVA-Fortis[ 44 ]stepsfurthertobefault-tolerantbyprovidingasnapshotmechanism.Ziggurat[ 49 isatieredlesystemwhichestimatesthetemperatureofledataandmigratescolddatafromPMtodisks.DevFS[ 20 pushesthelesystemimplementationintothestoragedevicethathascomputecapabilityanddevice-levelRAM.Third,distributedPMsystems.Hotpot[ 36 ]managesPMdevicesofdierentnodesintheclusterwithadistributedsharedpersistentmemoryarchitecture.Octopus[ 24 , 37 ]leveragesPMandRDMAtobuildanecientdistributedlesystembyreducingthesoftwareoverhead.Similarly,Orion[ 45 ]isalsodistributedpersistentmemorylesystembutisbuiltinthekernel.FlatStore[ 8 ]isalog-structuredkey-valuestorageenginebasedonRDMAnetwork;itminimizestheushoverheadbybatchingsmall-sizedrequests.7ConclusionInthispaper,weintroduceakernelanduser-levelcollabora-tivearchitecturenamedK,whichexhibitsane-grainedtaskdivisionbetweenuserspaceandthekernel.Basedon,wefurtherdesignandimplementaPMlesystemnamedKFSandexperimentsshowthatKucoFSprovidesbothecientandhighlyscalableperformance.AcknowledgementsWesincerelythankourshepherdDonaldE.Porterandtheanonymousreviewersfortheirinsightfulfeedback.WealsothankQingWangandRamnatthanAlagappanfortheirexcel-lentsuggestions.ThismaterialissupportedbytheNationalKeyResearch&DevelopmentProgramofChina(GrantNo.2018YFB1003301),theNationalNaturalScienceFounda-tionofChina(GrantNo.62022051,61832011,61772300,61877035),andHuawei(GrantNo.YBN2019125112). ReferencesencesFilebenchlesystembenchmark. "http://www.nfsv4bat.org/Documents/nasconf/2004/filebench.pdf" ,2004.2004.Supportext4onNV-DIMMs." https://lwn.net/Articles/588218 ",2014.2014.IGBaek,MSLee,SSeo,MJLee,DHSeo,D-SSuh,JCPark,SOPark,HSKim,IKYoo,etal.Highlyscalablenonvolatileresistivememoryusingsimplebinaryoxidedrivenbyasymm

14 etricunipolarvoltagepulses.InElectronDev
etricunipolarvoltagepulses.InElectronDevicesMeeting,2004.IEDMTechnicalDigest.IEEEInternational,pages587–590.IEEE,2004.2004.AdamBelay,AndreaBittau,AliMashtizadeh,DavidTerei,DavidMazières,andChristosKozyrakis.Dune:Safeuser-levelaccesstoprivilegedcpufeatures.InProceedingsofthe10thUSENIXConferenceonOper-atingSystemsDesignandImplementation,OSDI'12,pages335–348,Berkeley,CA,USA,2012.USENIXUSENIXAdamBelay,GeorgePrekas,AnaKlimovic,SamuelGrossman,ChristosKozyrakis,andEdouardBugnion.Ix:Aprotecteddataplaneoperatingsystemforhighthroughputandlowlatency.InProceedingsofthe11thUSENIXConferenceonOperatingSystemsDesignand,OSDI'14,pages49–65,Berkeley,CA,USA,2014.USENIXAssociation.Association.RemyCard,TheodoreTs'o,andStephenTweedie.Designandimplementationofthesecondextendedlesystem.InProceedingsofthe1stDutchInternationalSymposiumonLinux,pages1–6,1994.1994.ShiminChenandQinJin.Persistentb-treesinnon-volatilemainmemory.Proc.VLDBEndow.,8(7):786–797,February2015.2015.YouminChen,YouyouLu,FanYang,QingWang,YangWang,andJiwuShu.Flatstore:Anecientlog-structuredkey-valuestorageengineforpersistentmem-ory.InProceedingsoftheTwenty-FifthInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOS'20,page1077–1091,NewYork,NY,USA,2020.AssociationforComputingMachinery..YouminChen,JiwuShu,JiaxinOu,andYouyouLu.Hinfs:Apersistentmemorylesystemwithbothbuer-inganddirect-access.ACMTrans.Storage,14(1):4:1–4:30,April2018.2018.JungsikChoi,JaewanHong,YoungjinKwon,andHwan-sooHan.Libnvmmio:ReconstructingsoftwareIOpathwithfailure-atomicmemory-mappedinterface.InUSENIXAnnualTechnicalConference(USENIXATC,pages1–16.USENIXAssociation,July2020.2020.JeremyCondit,EdmundB.Nightingale,ChristopherFrost,EnginIpek,BenjaminLee,DougBurger,andDerrickCoetzee.Betteriothroughbyte-addressable,persistentmemory.InProceedingsoftheACMSIGOPS22NdSymposiumonOperatingSystemsPrinciplesSOSP'09,pages133–146,NewYork,NY,USA,2009.ACM.CM.Je reyDeanandLuizAndréBarroso.Thetailatscale.Commun.ACM,56(2):74–80,February2013.2013.MingkaiDong,HengBu,JiefeiYi,BenchaoDong,andHaiboChen.Performanceandprotectioninthezofsuser-spacenvmlesystem.InThe27thACMSymposiumonOperatingSystemsPrinciples,SOSP'19,'19,SubramanyaR.Dulloor,SanjayKumar,AnilKeshava-murthy,PhilipLantz,DheerajReddy,RajeshSankaran,andJeJackson.Systemsoftwareforpersistentmemory.ProceedingsoftheNinthEuropeanConferenceonComputerSystems,EuroSys'14,pages15:1–15:15,NewYork,NY,USA,2014.ACM.CM.KeirFraser.Practicallock-freedom.Technicalreport,UniversityofCambridge,ComputerLaboratory,2004.2004.DannyHendler,ItaiIncze,NirShavit,andMoranTzafrir.Flatcombiningandthesynchronization-parallelism.InProceedingsoftheTwenty-SecondAnnualACMSymposiumonParallelisminAlgorithmsandArchitectures,SPAA'10,page355–364,NewYork,NY,USA,2010.AssociationforComputingMachinery..J.Howard,M.Kazar,S.Menees,D.Nichols,M.Satya-narayanan,RobertN.Sidebotham,andM.West.Scaleandperformanceinadistributedlesystem.InPro-ceedingsoftheEleventhACMSymposiumonOperatingSystemsPrinciples,SOSP'87,page1–2,NewYork,NY,USA,1987.AssociationforComputingMachinery..DeukyeonHwang,Wook-HeeKim,YoujipWon,andBeomseokNam.Endurabletransientinconsistencyinbyte-addressablepersistentb-tree.InProceedingsofthe16thUSENIXConferenceonFileandStorageTechnologies,FAST'18,page187,2018.2018.JosephIzraelevitz,JianYang,LuZhang,JunoKim,XiaoLiu,AmirsamanMemaripour,YunJoonSoh,ZixuanWang,YiXu,SubramanyaRDulloor,etal.Basicper-formancemeasurementsoftheinteloptanedcpersistent memorymodule.arXivpreprintarXiv:1903.05714arXiv:1903.05714SudarsunKannan,AndreaCArpaci-Dusseau,RemziHArpaci-Dusseau,YuangangWang,JunXu,andGopinathPalani.Designingatruedirect-accesslesystemwithdevfs.In16thUSENIXConferenceonFileandStorageTechnologies,page241,2018.2018.YoungjinKwon,HenriqueFingler,TylerHunt,SimonPeter,EmmettWitchel,andThomasAnderson.Strata:Acrossmedialesystem.InProceedingsofthe26thSymposiumonOperatingSystemsPrinciples,SOSP'17,pages460–477,NewYork,NY,USA,2017.ACM.CM.BenjaminC.Lee,EnginIpek,OnurMutlu,andDougBurger.Architectingphasechangememoryasascalabledramalternative.InProceedingsofthe36thannualInternationalSymposiumonComputerArchitecture,pages2–13,NewYork,NY,USA,2009.ACM.CM.BojieLi,TianyiCui,ZiboWang,WeiBai,andLintaoZ

15 hang.Socksdirect:Datacentersocketscanbef
hang.Socksdirect:Datacentersocketscanbefastandcompatible.InProceedingsoftheACMSpecialInterestGrouponDataCommunication,SIGCOMM'19,pages90–103,NewYork,NY,USA,2019.ACM.CM.YouyouLu,JiwuShu,YouminChen,andTaoLi.Octo-pus:Anrdma-enableddistributedpersistentmemorylesystem.InProceedingsofthe2017USENIXConferenceonUsenixAnnualTechnicalConference,USENIXATC'17,page773–785,USA,2017.USENIXAssociation.Association.ChangwooMin,SanidhyaKashyap,SteenMaass,WoonhakKang,andTaesooKim.Understandingmanycorescalabilityoflesystems.InProceedingsofthe2016USENIXConferenceonUsenixAnnualTechnicalConference,USENIXATC'16,pages71–85,Berkeley,CA,USA,2016.USENIXAssociation.Association.MoohyeonNam,HokeunCha,YoungriChoi,SamH.Noh,andBeomseokNam.Write-optimizeddynamichashingforpersistentmemory.In17thUSENIXCon-ferenceonFileandStorageTechnologies(FAST19)pages31–44,Boston,MA,February2019.USENIXUSENIXIntelNewsroom.IntelTMdcpersistentmemory. www/us/en/products/memory-storage/optane-dc-persistent-memory.html ,AprilAprilKadekodiohan,KwonLeeSe,KashyapSanidhya,KimTaesoo,KolliAasheesh,andChidambaramVijay.Splitfs:Alesystemthatminimizessoftwareoverheadinlesystemsforpersistentmemory.InThe27thACMSymposiumonOperatingSystemsPrinciples,SOSP'19,'19,JiaxinOu,JiwuShu,andYouyouLu.Ahighperfor-mancelesystemfornon-volatilemainmemory.InProceedingsoftheEleventhEuropeanConferenceonComputerSystems,EuroSys'16,pages12:1–12:16,NewYork,NY,USA,2016.ACM.CM.IsmailOukid,JohanLasperas,AnisoaraNica,ThomasWillhalm,andWolfgangLehner.Fptree:Ahybridscm-drampersistentandconcurrentb-treeforstorageclassmemory.InProceedingsofthe2016InternationalConferenceonManagementofData,SIGMOD'16,pages371–386,NewYork,NY,USA,2016.ACM.CM.SimonPeter,JialinLi,IreneZhang,DanR.K.Ports,DougWoos,ArvindKrishnamurthy,ThomasAnderson,andTimothyRoscoe.Arrakis:Theoperatingsystemisthecontrolplane.InProceedingsofthe11thUSENIXConferenceonOperatingSystemsDesignand,OSDI'14,pages1–16,Berkeley,CA,USA,2014.USENIXAssociation.Association.WilliamPugh.Skiplists:Aprobabilisticalternativetobalancedtrees.Commun.ACM,33(6):668–676,JuneJuneHenryQin,QianLi,JacquelineSpeiser,PeterKraft,andJohnOusterhout.Arachne:Core-awarethreadmanagement.InProceedingsofthe12thUSENIXCon-ferenceonOperatingSystemsDesignandImplemen-,OSDI'18,page145–160,USA,2018.USENIXUSENIXMoinuddinK.Qureshi,VijayalakshmiSrinivasan,andJudeA.Rivers.Scalablehighperformancemainmemorysystemusingphase-changememorytechnol-ogy.InProceedingsofthe36thannualInternationalSymposiumonComputerArchitecture(ISCA),pages24–33,NewYork,NY,USA,2009.ACM.CM.SepidehRoghanchi,JakobEriksson,andNilanjanaBasu.Ffwd:Delegationis(much)fasterthanyouthink.Proceedingsofthe26thSymposiumonOperatingSystemsPrinciples,SOSP'17,pages342–358,NewYork,NY,USA,2017.ACM.CM.YizhouShan,Shin-YehTsai,andYiyingZhang.Dis-tributedsharedpersistentmemory.InProceedingsofthe2017SymposiumonCloudComputing,SoCC'17,page323–337,NewYork,NY,USA,2017.AssociationforComputingMachinery..JiwuShu,YouminChen,QingWang,BohongZhu,JunruLi,andYouyouLu.Th-dpms:Designandim-plementationofanrdma-enableddistributedpersistent memorystoragesystem.ACMTrans.Storage,16(4),October2020.2020.AdamSweeney,DougDoucette,WeiHu,CurtisAn-derson,MikeNishimoto,andGeoPeck.Scalabilityinthexfslesystem.InUSENIXAnnualTechnicalConference,volume15,1996.1996.HarisVolos,SankethNalli,SankarlingamPanneersel-vam,VenkatanathanVaradarajan,PrashantSaxena,andMichaelM.Swift.Aerie:Flexiblele-systeminterfacestostorage-classmemory.InProceedingsoftheNinthEu-ropeanConferenceonComputerSystems,EuroSys'14,pages14:1–14:14,NewYork,NY,USA,2014.ACM.CM.YingWang,DejunJiang,andJinXiong.Cachingornot:Rethinkingvirtuallesystemfornon-volatilemainmemory.In10thUSENIXWorkshoponHotTopicsinStorageandFileSystems(HotStorage18).USENIXAssociation,2018.2018.XiaojianWuandA.L.NarasimhaReddy.Scmfs:Alesystemforstorageclassmemory.InProceedingsof2011InternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis,SC'11,pages39:1–39:11,NewYork,NY,USA,2011.ACM.CM.JianXu,JunoKim,AmirsamanMemaripour,andStevenSwanson.Findingandxingperformancepathologiesinpersistentmemorysoftwarestacks.InProceed-ingsoftheTwenty-FourthInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASP

16 LOS'19,page427–439,NewYork,NY,USA,2019.A
LOS'19,page427–439,NewYork,NY,USA,2019.AssociationforComputingMachinery..JianXuandStevenSwanson.Nova:Alog-structuredlesystemforhybridvolatilenon-volatilemainmem-ories.InProceedingsofthe14thUsenixConferenceonFileandStorageTechnologies,FAST'16,pages323–338,Berkeley,CA,USA,2016.USENIXAssociation.Association.JianXu,LuZhang,AmirsamanMemaripour,AkshathaGangadharaiah,AmitBorase,TamiresBritoDaSilva,StevenSwanson,andAndyRudo.Nova-fortis:Afault-tolerantnon-volatilemainmemorylesystem.Proceedingsofthe26thSymposiumonOperatingSystemsPrinciples,SOSP'17,pages478–496,NewYork,NY,USA,2017.ACM.CM.JianYang,JosephIzraelevitz,andStevenSwanson.Orion:Adistributedlesystemfornon-volatilemainmemoriesandrdma-capablenetworks.InProceedingsofthe17thUSENIXConferenceonFileandStor-ageTechnologies,FAST'19,page221–234,USA,2019.USENIXAssociation.Association.JianYang,JunoKim,MortezaHoseinzadeh,JosephIzraelevitz,andSteveSwanson.Anempiricalguidetothebehavioranduseofscalablepersistentmemory.18thUSENIXConferenceonFileandStorageTech-nologies(FAST20),pages169–182,SantaClara,CA,February2020.USENIXAssociation.Association.JianYang,JunoKim,MortezaHoseinzadeh,JosephIzraelevitz,andStevenSwanson.Anempiricalguidetothebehavioranduseofscalablepersistentmemory.arXivpreprintarXiv:1908.03583,2019.2019.JunYang,QingsongWei,ChengChen,ChundongWang,KhaiLeongYong,andBingshengHe.Nv-tree:Reduc-ingconsistencycostfornvm-basedsinglelevelsystems.Proceedingsofthe13thUSENIXConferenceonFileandStorageTechnologies,FAST'15,pages167–181,Berkeley,CA,USA,2015.USENIXAssociation.Association.ShenganZheng,MortezaHoseinzadeh,andStevenSwanson.Ziggurat:Atieredlesystemfornon-volatilemainmemoriesanddisks.In17thUSENIXConferenceonFileandStorageTechnologies(FAST19),pages207–219,2019.2019.DengZhou,WenPan,TaoXie,andWeiWang.Alesystembypassingvolatilemainmemory:Towardsasingle-levelpersistentstore.InProceedingsofthe15thACMInternationalConferenceonComputingFrontiersCF'18,pages97–104,NewYork,NY,USA,2018.ACM.CM.PingZhou,BoZhao,JunYang,andYoutaoZhang.Adurableandenergyecientmainmemoryusingphasechangememorytechnology.InProceedingsofthe36thannualInternationalSymposiumonComputerArchitecture(ISCA),pages14–23,NewYork,NY,USA,2009.ACM.CM.PengfeiZuo,YuHua,andJieWu.Write-optimizedandhigh-performancehashingindexschemeforpersistentmemory.InProceedingsofthe13thUSENIXConfer-enceonOperatingSystemsDesignandImplementationOSDI'18,page461–476,USA,2018.USENIXAssoci- USENIX Association 19th USENIX Conference on File and Storage Technologies 81 82 19th USENIX Conference on File and Storage Technologies USENIX Association USENIX Association 19th USENIX Conference on File and Storage Technologies 83 84 19th USENIX Conference on File and Storage Technologies USENIX Association USENIX Association 19th USENIX Conference on File and Storage Technologies 85 86 19th USENIX Conference on File and Storage Technologies USENIX Association USENIX Association 19th USENIX Conference on File and Storage Technologies 87 88 19th USENIX Conference on File and Storage Technologies USENIX Association USENIX Association 19th USENIX Conference on File and Storage Technologies 89 90 19th USENIX Conference on File and Storage Technologies USENIX Association USENIX Association 19th USENIX Conference on File and Storage Technologies 91 92 19th USENIX Conference on File and Storage Technologies USENIX Association USENIX Association 19th USENIX Conference on File and Storage Technologies 93 94 19th USENIX Conference on File and Storage Technologies USENIX Association USENIX Association 19th USENIX Conference on File and Storage Technologies 95 Scalable Persistent Memory File Systemwith Kernel-Userspace CollaborationYoumin Chen, Youyou Lu, and Bohong Zhu, Tsinghua University;Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau,University of Wisconsin–Madison; Jiwu Shu, Tsinghua Universityhttps://www.usenix.org/conference/fast21/presentation/chen-youmin This paper is included in the Proceedings of the 19th USENIX Conference on File and Storage Technologies.February 23–25, 2021978-1-939133-20-5Open access to the Proceedings of the 19th USENIX Conference on File and Storage Technologiesis sponsored b

Related Contents


Next Show more