/
The Scalable Commutativity Rule Designing Scalable Sof The Scalable Commutativity Rule Designing Scalable Sof

The Scalable Commutativity Rule Designing Scalable Sof - PDF document

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
510 views
Uploaded On 2015-05-18

The Scalable Commutativity Rule Designing Scalable Sof - PPT Presentation

Clements M Frans Kaashoek Nickolai Zeldovich Robert T Morris and Eddie Kohler MIT CSAIL and Harvard University Abstract What fundamental opportunities for scalability are latent in interfaces such as system call APIs Can scalability opportunities b ID: 69282

Clements Frans Kaashoek

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "The Scalable Commutativity Rule Designin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Thecommutativityrulemakesintuitivesense:whenoperationscommute,theirresults(returnvalueandeffectonsystemstate)areindependentoforder.Hence,commu-nicationbetweencommutativeoperationsisunnecessary,andeliminatingityieldsconict-freeimplementations.Thisintuitiveversionoftheruleisusefulinpractice,butnotpreciseenoughtoreasonaboutformally.§3formallydenesthecommutativityruleandprovesthecorrectnessoftheformalizedrule.AnimportantconsequenceofthispresentationisanovelformofcommutativitywecallSIMcommutativity.Theusualdenitionofcommutativity(e.g.,foralgebraicoperations)issostringentthatitrarelyappliestothecomplex,statefulinterfacescommoninsystemssoftware.SIMcommutativity,incontrast,isstate-dependentandinterface-based,aswellasmonotonic.Whenoperationscommuteinthecontextofaspecicsystemstate,specicoperationarguments,andspecicconcurrentoperations,weshowthatanimplementationexiststhatisconict-freeforthatstateandthoseargumentsandconcurrentoperations.Thisexposesmanymoreopportunitiestoapplytheruletorealinterfaces—andthusdiscoverscal-ableimplementations—thanamoreconventionalnotionofcommutativitywould.Despiteitslogicalstatedepen-dence,SIMcommutativityisinterface-based:ratherthanrequiringalloperationorderstoproduceidenticalinter-nalstates,itrequirestheresultingstatestobeindistin-guishableviatheinterface.SIMcommutativityisthusindependentofanyspecicimplementation,enablingdeveloperstoapplytheruledirectlytointerfacedesign.Thecommutativityruleleadstoanewwaytodesignscalablesoftware:rst,analyzetheinterface'scommuta-tivity,andthendesignanimplementationthatscalesincommutativesituations.Forexample,considerlecre-ationinaPOSIX-likelesystem.Imaginethatmultipleprocessescreatelesinthesamedirectoryatthesametime.Canthecreationsystemcallsbemadetoscale?Ourrstanswerwas“obviouslynot”:thesystemcallsmodifythesamedirectory,sosurelytheimplementationmustserializeaccesstothedirectory.Butitturnsouttheseoperationscommuteifthetwoleshavedifferentnames(andnohardorsymboliclinksareinvolved)and,there-fore,haveanimplementationthatscalesforsuchnames.Onesuchimplementationrepresentseachdirectoryasahashtableindexedbylename,withanindependentlockperbucket,sothatcreationofdifferentlynamedlesisconict-free,barringhashcollisions.Beforetherule,wetriedtodetermineiftheseoperationscouldscalebyanalyzingalloftheimplementationswecouldthinkof.Thisprocesswasdifcult,unguided,anditselfdidnotscaletocomplexinterfaces,whichmotivatedourgoalofreasoningaboutscalabilityintermsofinterfaces.Complexinterfacescanmakeitdifculttospotandreasonaboutallcommutativecasesevengiventherule.Toaddressthischallenge,§5introducesatoolnamedCOMMUTERthatautomatesthisreasoning.COMMUTERtakesaninterfacemodelintheformofasimplied,sym-bolicimplementation,computespreciseconditionsunderwhichsetsofoperationscommute,andtestsanimplemen-tationforconict-freedomundertheseconditions.Thistoolcanbeintegratedintothedevelopmentprocesstodriveinitialdesignandimplementation,toincrementallyimproveexistingimplementations,ortohelpdevelopersunderstandthecommutativityofaninterface.Thispaperdemonstratesthevalueofthecommutativ-ityruleandCOMMUTERintwoways.In§4,weexplorethecommutativityofPOSIXandusethisunderstandingbothtosuggestguidelinesfordesigninginterfaceswhoseoperationscommuteandtoproposespecicmodica-tionstoPOSIXthatwouldallowforgreaterscalability.In§6,weapplyCOMMUTERtoasimpliedmodelof18POSIXlesystemandvirtualmemorysystemcalls.Fromthismodel,COMMUTERgenerates13,664testsofcommutativesystemcallpairs,allofwhichcanbemadeconict-freeaccordingtotherule.Weusetheseteststoguidetheimplementationofanewresearchoperatingsystemkernelnamedsv6.sv6hasanovelvirtualmem-orysystem(RadixVM[15])andin-memorylesystem(namedScaleFS).COMMUTERdeterminesthatsv6isconict-freefor13,528ofthe13,664tests,whileLinuxisconict-freefor9,389tests.SomeofthecommutativecaseswhereLinuxdoesn'tscaleareimportanttoapplica-tions,suchascommutativeMMAPsandcreatingdifferentlesinashareddirectory.§7conrmsthatcommutativeconict-freesystemcallstranslatetobetterapplicationscalabilityonan80-coremachine.2RelatedworkThescalablecommutativityruleistothebestofourknowledgetherstobservationtodirectlyconnectscala-bilitytointerfacecommutativity.Thissectionrelatestheruleanditsuseinsv6andCOMMUTERtopriorwork.2.1ThinkingaboutscalabilityIsraeliandRappoportintroducethenotionofdisjoint-access-parallelmemorysystems[26].Roughly,ifasharedmemorysystemisdisjoint-access-parallelandasetofprocessesaccessdisjointmemorylocations,thenthoseprocessesscalelinearly.Likethecommutativityrule,thisisaconditionalscalabilityguarantee:iftheap-plicationusessharedmemoryinaparticularway,thenthesharedmemoryimplementationwillscale.However,wheredisjoint-accessparallelismisspecializedtothememorysysteminterface,ourworkencompassesanysoftwareinterface.Attiyaetal.extendIsraeliandRap-poport'sdenitiontoadditionallyrequireconict-freeoperationstoscale[1].Ourworkbuildsontheassump-tionthatmemorysystemsbehavethisway,andweindi- expandthesituationsthatcommute,andthatthereforecanscale.Forexample,fewOSsystemcallsuncon-ditionallycommuteineverystateandhistory.(OnethatdoesisGETPID ,sinceitsresultisconstantoveraprocess'slifetime.)Butmanysystemcallscondition-allycommute.ConsiderUnix'sOPENsystemcall.TwocallstoOPENA /?#2%!4\/?%8#, oftendon'tcom-mute:onecallwillcreatetheleandtheotherwillfailbecausethelealreadyexists.However,twocallstoOPENA /?#2%!4\/?%8#, docommuteifcalledfromprocesseswithdifferentworkingdirectories.Andeveniftheprocesseshavethesameworkingdirectory,twocallstoOPENA /?#2%!4\/?%8#, willcommuteifthelealreadyexists(bothcallswillreturnthesameerror).SIMcommutativityallowsustodistinguishthesecases,eventhoughtheoperationsarethesameineach.This,inturn,meansthecommutativityrulecantellusthatscalableimplementationsexistinthecommutativecases.SIMcommutativityisalsointerface-based.Itevaluatestheconsequencesofexecutionorderusingonlythespeci-cation.Furthermore,itdoesn'tsaythateveryreorderinghasindistinguishableresultsonagivenimplementation;itrequiresinsteadthateveryreorderingisallowedbythespecicationtohaveindistinguishableresults.Thisisimportantbecauseanygivenimplementationmighthaveunnecessaryscalabilitybottlenecksthatshowthroughtheinterface.TheSIMcommutativityofaninterfacecanbeconsideredevenwhennoimplementationexists.Thisinturnmakesitpossibletousethecommutativityruleearlyinsoftwaredevelopment,duringinterfacedesignandinitialimplementation.3.3ImplementationsToreasonaboutimplementationscalability,weneedtomodelimplementationsinenoughdetailtotellwhetherdifferentthreads'“memoryaccesses”areconict-free.(Asdiscussedin§1,conictfreedomisourproxyforscalability.)Wedeneanimplementationasastepfunc-tion:givenastateandaninvocation,itproducesanewstateandaresponse.SpecialCONTINUEactionsenableconcurrentoverlappingoperationsandblocking.Webeginbydeningthreesets:•Sisthesetofimplementationstates.•Iisthesetofvalidinvocations,includingCONTINUE.•Risthesetofvalidresponses,includingCONTINUE.AnimplementationmisafunctioninSI7!SR.Givenanoldstateandaninvocation,theimplementationproducesanewstateandaresponse(wheretheresponsemusthavethesamethreadastheinvocation).ACON-TINUEresponseindicatesthatarealresponseforthatthreadisnotyetready,andallowstheimplementationtoeffectivelyswitchtoanotherthread.CONTINUEinvoca-tionsgivetheimplementationanopportunitytocompleteanoutstandingrequest(orfurtherdelayitsresponse);however,theresponsemustbeforthethreadmatchingtheCONTINUEinvocation.1Animplementationgeneratesahistorywhencallstotheimplementation(perhapsincludingCONTINUEin-vocations)couldpotentiallyproducethecorrespondinghistory.Forexample,thissequenceshowsanimplemen-tationmgeneratingahistory ! " " ! :•m(s0; ! )=hs1;CONTINUEi•m(s1; " )=hs2;CONTINUEi•m(s2;CONTINUE)=hs3;CONTINUEi•m(s3;CONTINUE)=hs4; " i•m(s4;CONTINUE)=hs5; ! iThestateisthreadedfromsteptostep;invocationsappearasargumentsandresponsesasreturnvalues.Thegener-atedhistoryconsistsoftheinvocationsandresponses,inorder,withCONTINUEsremoved.AnimplementationmiscorrectforsomespecicationSwhentheresponsesitgeneratesarealwaysallowedbythespecication.Specically,assumeH2SisavalidhistoryandrisaresponsewheremcangenerateHjjr.WesaythatmiscorrectwhenforanysuchHandr,Hjjr2S.Notethatacorrectimplementationneednotbecapableofgeneratingeverypossiblevalidresponse;it'sjustthateveryresponseitdoesgenerateisvalid.Toreasonaboutconictfreedom,wemustpeekintoimplementationstates,identifyreadsandwrites,andcheckforaccessconicts.Leteachstates2Sbeatuplehs:0;:::;s:mi,andletsi xindicatecomponentreplace-ment:si x=hs:0;:::;s:(i�1);x;s:(i+1);:::;s:mi.Nowconsideranimplementationstepm(s;a)=hs0;ri.Thisstepwritesstatecomponentiwhens:i6=s0:i.Itreadsstatecomponentiwhens:imayaffectthestep'sbehavior;thatis,whenforsomey,m(si y;a)6= s0i y;r :Twoimplementationstepshaveanaccessconictwhentheyareondifferentthreadsandonewritesastatecom-ponentthattheothereitherwritesorreads.Asetofim-plementationstepsisconict-freewhennopairofstepsinthesethasanaccessconict.Thisnotionofaccessconictsmapsdirectlyontoreadandwriteaccesscon-ictsonrealshared-memorymachines.SincemodernMESI-basedcache-coherentmachinesusuallyprovidegoodscalabilityonconict-freeaccesspatterns,wecanlooselysaythataconict-freesetofimplementationsteps“scales.” 1Therearerestrictionsonhowimplementationargumentsarechosen—weassume,forexample,thatCONTINUEinvocationsarepassedonlywhenathreadhasanoutstandingrequest.Sinceimple-mentationsarefunctions,theymustbedeterministic.Wecouldmodelimplementationsinsteadasrelations,allowingnon-determinism,thoughthiswouldcomplicatelaterargumentssomewhat. •s:commute[t]—aper-threadagindicatingwhetherthecommutativeregionhasbeenreached.InitializedtoFALSE.•s:refstate—thereferenceimplementation'sstate.Eachstepofminthecommutativeregionaccessesonlystatecomponentsspecictotheinvokingthread.Thismeansthatanytwostepsinthecommutativeregionareconict-free,andthecommutativityruleisproved.TheconstructionusesSIMcommutativitywheninitializingthereferenceimplementation'sstateviaH0.Iftheob-servedinvocationsdivergebeforethecommutativere-gion,thenjustasinmns,H0willexactlyequaltheob-servedinvocations.Iftheobservedinvocationsdivergeinorafterthecommutativeregion,however,there'snotenoughinformationtorecovertheorderofinvocations.(Thes:h[t]componentstrackwhichinvocationshavehap-penedperthread,butnottheorderofthoseinvocationsbetweenthreads.)Therefore,H0mightreordertheinvoca-tionsinY.SIMcommutativityguaranteesthatreplayingH0willneverthelessproduceresultsindistinguishablefromthoseoftheactualinvocationorder,eveniftheexecutiondivergeswithinthecommutativeregion.23.6DiscussionThecommutativityruleandproofconstructionpushstateandhistorydependencetoanextreme:theproofcon-structionisspecializedforasinglecommutativeregion.Repeatedapplicationoftheconstructioncanbuildanimplementationthatscalesovermultiplecommutativeregionsinahistory,orfortheunionofmanyhistories.(Thisisbecause,oncetheconstructedmachineleavesthespecializedregion,itpassesinvocationsdirectlytothereferenceandhasthesameconict-freedompropertiesasthereference.)Nevertheless,theproofconstructionisimpractical,andrealimplementationsusuallyachievescalabilityusingdifferenttechniques.Webelieveitiseasiertocreatepracticalscalableim-plementationsforoperationsthatcommuteinmoresitu-ations.Theargumentsandsystemstatesforwhichasetofoperationscommutesoftencollapseintofairlywell-denedclasses(e.g.,lecreationmightcommutewhen-everthecontainingdirectoriesaredifferent).Inpractice,implementationsscaleforwholeclassesofstatesandarguments,notjustforspecichistories. 2WeeffectivelyhaveassumedthatM,thereferenceimplementation,producesthesameresultsforanyreorderingofthecommutativeregion.ThisisstricterthanSIMcommutativity,whichplacesrequirementsonthespecication,nottheimplementation.WealsoassumedthatMisindifferenttotheplacementofCONTINUEinvocationsintheinputhistory.Neitheroftheserestrictionsisfundamental,however.IfduringreplayMproducesresponsesthatareinconsistentwiththedesiredresults,mcouldthrowawayM'sstate,produceanewH0withdifferentCONTINUEinvocationsand/orcommutativeregionordering,andtryagain.Thisproceduremusteventuallysucceed.Itisalsooftenthecasethatasetofoperationscom-mutesinmorethanoneclassofsituation,butnosingleimplementationscalesforallclasses.Consider,forexam-ple,aninterfacewithtwocalls:PUTx recordsasamplewithvaluex,andMAX returnsthemaximumsamplerecordedsofar(or).SupposeH=[ ! =PUT ; ! ; " =PUT ; " ; # =MAX ; # =1]:Animplementationcouldstoreper-threadmaximarec-onciledbyMAXandbeconict-freefor ! ! " " inH.Alternatively,itcoulduseaglobalmaximumthatPUTcheckedbeforewriting.Thisisconict-freefor " " # # inH.Butnocorrectimplementationcanbeconict-freeacrossallofH.Intheend,asystemde-signermustdecidewhichsituationsinvolvingcommu-tativeoperationsaremostimportant,andndpracticalimplementationstrategiesthatscaleinthosesituations.In§6weshowthatmanyoperationsinPOSIXhaveim-plementationsthatscalequitebroadly,withfewcasesofincompatiblescalabilityclasses.ThecommutativityruleshowsthatSIM-commutativeregionshaveconict-freeimplementations.Itdoesnotshowtheconverse,however:commutativitysufcesforconict-freeaccesses,butitmaynotbenecessary.Somenon-commutativeinterfacesmayhavescalableimplementations—forinstance,onmachinesthatofferscalableaccesstostrictlyincreasingsourcesoftime,orwhenthecoreinterconnectallowscertaincommunicationpatternstoscale.Furthermore,someconict-freeaccesspatternsdon'tscaleonrealmachines;ifanapplicationoverwhelmsthememorybuswithmemoryaccesses,scal-abilitywillsufferregardlessofwhetherthoseaccesseshaveconicts.Wehopetoinvestigatetheseproblemsinfuture,butasweshowbelow,theruleisalreadyagoodguidelineforachievingpracticalscalability.4DesigningcommutativeinterfacesTherulefacilitatesscalabilityreasoningattheinterfaceandspecicationlevel,andSIMcommutativityletsusapplytheruletocomplexinterfaces.Thissectiondemon-stratestheinterface-levelreasoningenabledbytherule.UsingPOSIXasacasestudy,weexplorechangesthatmakeoperationscommuteinmoresituations,enablingmorescalableimplementations.Already,manyPOSIXoperationscommutewithmanyotheroperations,afactwewillquantifyinthenextsection;thissectionfocusesonproblematiccasestogiveasenseofthesubtlerissuesofcommutativeinterfacedesign.Decomposecompoundoperations.ManyPOSIXAPIscombineseveraloperationsintoone,limitingthecombinedoperation'scommutativity.Forexample,FORKbothcreatesanewprocessandsnapshotsthecurrentpro-cess'sentirememorystate,ledescriptorstate,signal Pythonmodel ANALYZER(§5.1) TESTGEN(§5.2) MTRACE(§5.3) Implementation Sharedcachelines Commutativityconditions Testcases Figure3:ThecomponentsofCOMMUTER.Adevelopercanusethesetestcasestounderstandthecommutativecasestheyshouldconsider,toiterativelyndandxscalabilityissuesintheircode,orasaregres-siontestsuitetoensurescalabilitybugsdonotcreepintotheimplementationovertime.5.1ANALYZERANALYZERautomatestheprocessofanalyzingthecom-mutativityofaninterface,savingdevelopersfromthetediousanderror-proneprocessofconsideringlargenum-bersofinteractionsbetweencomplexoperations.AN-ALYZERtakesasinputamodelofthebehaviorofaninterface,writteninasymbolicvariantofPython,andoutputscommutativityconditions:expressionsintermsofargumentsandstateforexactlywhensetsofoperationscommute.Adevelopercaninspecttheseexpressionstounderstandaninterface'scommutativityorpassthemtoTESTGEN(§5.2)togenerateconcreteexamplesofwheninterfacescommute.GiventhePythoncodeforamodel,ANALYZERusessymbolicexecutiontoconsiderallpossiblebehaviorsoftheinterfacemodelandconstructcompletecommutativ-ityconditions.SymbolicexecutionalsoenablesANA-LYZERtoreasonabouttheexternalbehaviorofanin-terface,ratherthanspecicsofthemodel'simplemen-tation,andenablesmodelstocapturespecicationnon-determinism(likeCREAT'sabilitytochooseanyfreeinode)asunder-constrainedsymbolicvalues.ANALYZERconsiderseverysetofoperationsofacer-tainsize(typicallyweusepairs).Foreachsetofopera-tionso,itconstructsanunconstrainedsymbolicsystemstatesandunconstrainedsymbolicargumentsforeachoperationino,andexecutesallpermutationsofo,eachstartingfromacopyofs.Thisexecutionforksatanybranchthatcangobothways,buildinguppathcondi-tionsthatconstrainthestateandargumentsthatcanleadtoeachcodepath.Attheendofeachcodepath,ANA-LYZERchecksifitspathconditionyieldsaninitialstateandargumentsthatmakeocommutebytestingifeachoperation'sreturnvalueisequivalentinallpermutationsandifthesystemstatesreachedbyallpermutationsareequivalent(orcanbeequivalentforsomechoiceofnon-deterministicvalueslikenewlyallocatedinodenumbers).Forsetslargerthanpairs,ANALYZERmustalsocheckthattheintermediatestatesareequivalentforeveryper-mutationofeachsubsetofo.ThistestcodiesthedenitionofSIMcommutativitySymInode=tstruct(data=tlist(SymByte),nlink=SymInt)SymIMap=tdict(SymInt,SymInode)SymFilename=tuninterpreted('Filename')SymDir=tdict(SymFilename,SymInt)def__init__(self,...):self.fname_to_inum=SymDir.any()self.inodes=SymIMap.any()@symargs(src=SymFilename,dst=SymFilename)defrename(self,src,dst):ifnotself.fname_to_inum.contains(src):return(-1,errno.ENOENT)ifsrc==dst:return0ifself.fname_to_inum.contains(dst):self.inodes[self.fname_to_inum[dst]].nlink-=1self.fname_to_inum[dst]=self.fname_to_inum[src]delself.fname_to_inum[src]return0Figure4:AsimpliedversionofourRENAMEmodel.from§3.2,exceptthat(1)itassumesthespecicationissequentiallyconsistent,and(2)insteadofconsideringallpossiblefutureoperations(whichwouldbedifcultinsymbolicexecution),itsubstitutesstateequivalence.It'suptothemodel'sauthortodenestateequivalenceaswhethertwostatesareexternallyindistinguishable.Thisisstandardpracticeforhigh-leveldatatypes(e.g.,twosetsrepresentedastreescouldbeequaleveniftheyarebalanceddifferently).ForthePOSIXmodelwepresentin§6,onlyafewtypesneedspecialhandlingbeyondwhatthestandarddatatypesprovideautomatically.Figure4givesanexampleofhowadevelopercouldmodelRENAME.Therstvelinesdeclaresymbolictypes(TUNINTERPRETEDdeclaresatypewhosevaluessupportonlyequality),and??INIT??instantiatesthelesystemstate.TheimplementationofRENAMEitselfisstraight-forward.Indeed,thefamiliarityofPythonandeaseofmanipulatingstatewerepartofwhywechoseitoverabstractspecicationlanguages.GiventwoRENAMEoperations,RENAMEA B andRENAMEC D ,ANALYZERoutputsthattheycommuteifanyofthefollowinghold:•Bothsourcelesexist,andthelenamesarealldif-ferent(AandCexist,andA,B,C,Dalldiffer).•OneRENAME'ssourcedoesnotexist,anditisnottheotherRENAME'sdestination(eitherAexists,Cdoesnot,andB6=C,orCexists,Adoesnot,andD6=A). plementationisconict-freeforeverytest.Ifitndsaviolationofthecommutativityrule—atestwhosecommu-tativeoperationsarenotconict-free—itreportswhichvariablesweresharedandwhatcodeaccessedthem.Forexample,whenrunningthetestcaseshowninFigure5onaLinuxRAMFSlesystem,MTRACEreportsthatthetwofunctionsmakeconictingaccessestotheDCACHEreferencecountandlock,whichlimitsthescalabilityofthoseoperations.MTRACErunstheentireoperatingsysteminamodi-edversionofqemu[4].Atthebeginningofeachtestcase,itissuesahypercalltoqemutostartrecordingmem-oryaccesses,andthenexecutesthetestoperationsondifferentvirtualcores.Duringtestexecution,MTRACElogsallreadsandwritesbyeachcore,alongwithinfor-mationaboutthecurrentlyexecutingkernelthread,tolteroutirrelevantconictsbybackgroundthreadsorin-terrupts.Afterexecution,MTRACEanalyzesthelogandreportsallconictingmemoryaccesses,alongwiththeCdatatypeoftheaccessedmemorylocation(resolvedfromDWARF[20]informationandlogsofeverydynamicallo-cation'stype)andstacktracesforeachconictingaccess.5.4ImplementationWebuiltaprototypeimplementationofCOMMUTER'sthreecomponents.ANALYZERandTESTGENconsistof3,050linesofPythoncode,includingthesymbolicexecutionengine,whichusestheZ3SMTsolver[19]viaZ3'sPythonbindings.MTRACEconsistsof1,594linesofcodechangedinqemu,alongwith612linesofcodechangedintheguestLinuxkernel(toreportmemorytypeinformation,contextswitches,etc.).Anotherprogram,consistingof2,865linesofC++code,processesthelogletondandreportmemorylocationsthataresharedbetweendifferentcoresforeachtestcase.6FindingscalabilityopportunitiesTounderstandwhetherCOMMUTERisusefultokerneldevelopers,wemodeledseveralPOSIXlesystemandvirtualmemorycallsinCOMMUTER,thenusedthisbothtoevaluateLinux'sscalabilityandtodevelopascalableleandvirtualmemorysystemforoursv6researchker-nel.Therestofthissectionusesthiscasestudytoanswerthefollowingquestions:•HowmanytestcasesdoesCOMMUTERgenerate,andwhatdotheytest?•HowgoodarecurrentimplementationsofthePOSIXinterface?DothetestcasesgeneratedbyCOMMUTERndcaseswherecurrentimplementationsdon'tscale?•Whattechniquesarenecessarytoachievescalabil-ityforcaseswherecurrentleandvirtualmemorysystemsdonotscale?•Whatsituationsmightbetoodifcultorimpracticaltomakescale,despitebeingcommutative?6.1POSIXtestcasesToanswertherstquestion,wedevelopedasimpliedmodelofthePOSIXlesystemandvirtualmemoryAPIsinCOMMUTER.Themodelcovers18systemcalls,andincludesinodes,lenames,ledescriptorsandtheiroff-sets,hardlinks,linkcounts,lelengths,lecontents,letimes,pipes,memory-mappedles,anonymousmemory,processes,andthreads.Ourmodelalsosupportsnesteddirectories,butwedisablethembecauseZ3doesnotcurrentlyhandletheresultingconstraints.Werestrictlesizesandoffsetstopagegranularity;somesv6datastruc-turesareconict-freeforoffsetsondifferentpages,butoffsetswithinapageconict.COMMUTERgeneratesatotalof13,664testcasesfromourmodel.GeneratingthetestcasesandrunningthemonbothLinuxandsv6takesatotalof8minutesonthemachinedescribedin§7.1.Themodelimplementationanditsmodel-specictestcodegeneratorare596and675linesofPythoncode,respectively.Figure4showedapartofourmodel,andFigure5gaveanexampletestcasegeneratedbyCOM-MUTER.WeveriedthatalltestcasesreturntheexpectedresultsonbothLinuxandsv6.6.2CurrentimplementationscalabilityToevaluatethescalabilityofexistingleandvirtualmemorysystems,weusedMTRACEtochecktheabovetestcasesagainstLinuxkernelversion3.8.Linuxdevel-opershaveinvestedsignicanteffortinmakingthelesystemscale[9],anditalreadyscalesinmanyinterestingcases,suchasconcurrentoperationsindifferentdirecto-riesorconcurrentoperationsondifferentlesinthesamedirectorythatalreadyexist[17].WeevaluatedtheRAMFSlesystembecauseRAMFSiseffectivelyauser-spacein-terfacetotheLinuxbuffercache.SinceexercisingRAMFSisequivalenttoexercisingthebuffercacheandthebuffercacheunderliesallLinuxlesystems,thisrepresentsthebest-casescalabilityforaLinuxlesystem.Linux'svir-tualmemorysystem,incontrast,involvesprocess-widelocksthatareknowntolimititsscalabilityandimpactrealapplications[9,14,41].ThelefthalfofFigure6showstheresults.Outof13,664testcases,4,275cases,widelydistributedacrossthesystemcallpairs,werenotconict-free.Thisindi-catesthatevenamatureandreasonablyscalableoperat-ingsystemimplementationmissesmanycasesthatcanbemadetoscaleaccordingtothecommutativityrule.Acommonsourceofaccessconictsissharedrefer-encecounts.Forexample,mostlenamelookupopera-tionsupdatethereferencecountonaSTRUCTDENTRY;theresultingwriteconictscausethemtonotscale.Simi-larly,mostoperationsthattakealedescriptorupdatethe open link unlink rename stat fstat lseek close pipe read write pread pwrite mmap munmap mprotect memread memwrite memwrite memread mprotect munmap mmap pwrite pread write read pipe close lseek fstat stat rename unlink link open Linux(9,389of13,664casesscale) 42 8 149 28 22 21 21 2 32 32 4 14 40 15 83 110 1 20 4 42 20 10 14 9 7 25 9 28 10 2 22 12 6 7 5 3 9 5 20 4 42 24 12 14 10 2 8 27 10 2 20 10 8 8 6 3 21 10 16 12 17 2 11 5 33 63 28 30 21 2 81 8 2 2 4 5 4 19 1 13 2 76 39 29 27 23 4 3 19 3 70 48 30 26 12 2 70 46 28 27 4 92 67 238 74 1518 21 1 82 19 20 22 memwrite memread mprotect munmap mmap pwrite pread write read pipe close lseek fstat stat rename unlink link open sv6(13,528of13,664casesscale) 9 1 2 2 5 1 1 2 3 4 5 8 8 16 58 1 1 9 100% 0% Figure6:Scalabilityforsystemcallpairs,showingthefractionandnumberoftestcasesgeneratedbyCOM-MUTERthatarenotconict-freeforeachsystemcallpair.OneexampletestcasewasshowninFigure5.referencecountonaSTRUCTLE,makingcommutativeop-erationssuchastwoFSTATcallsonthesameledescriptornotscale.Coarse-grainedlocksareanothersourceofac-cessconicts.Forinstance,Linuxlockstheparentdirec-toryforanyoperationthatcreateslenames,eventhoughoperationsthatcreatedistinctnamesgenerallycommute.Similarly,weseethatcoarse-grainedlockinginthevir-tualmemorysystemseverelylimitstheconict-freedomofaddressspacemanipulationoperations.Thisagreeswithpreviousndings[9,14,15],whichdemonstratedtheseproblemsinthecontextofseveralapplications.6.3MakingtestcasesscaleGiventhatLinuxdoesnotscaleinmanycases,howhardisittoimplementscalablelesystemsandvirtualmem-orysystems?Toanswerthisquestion,wedesignedandimplementedaRAMFS-likein-memorylesystemcalledScaleFSandavirtualmemorysystemcalledRadixVMforsv6,ourresearchkernelbasedonxv6[18].RadixVMappearedinpreviouswork[15],sowefocusonScaleFShere.AlthoughitisinprinciplepossibletomakethesamechangesinLinux,wechosenottoimplementScaleFSinLinuxbecauseScaleFS'sdesignwouldhaverequiredmodifyingcodethroughouttheLinuxkernel.Thede-signsofbothRadixVMandScaleFSwereguidedbythecommutativityrule.ForScaleFS,wereliedheavilyonCOMMUTERthroughoutdevelopmenttoguideitsde-signandidentifysharingproblemsinitsimplementation(RadixVMwasbuiltpriortoCOMMUTER).TherighthalfofFigure6showstheresultofapplyingCOMMUTERtosv6.ScaleFSmakesextensiveuseofexistingtechniquesforscalableimplementations,suchasper-coreresourceallo-cation,double-checkedlocking,lock-freereadersusingRCU[31],scalablereferencecountsusingRefcache[15],andseqlocks[28:§6].Thesetechniquesleadtoseveralcommonpatterns,asfollows;weillustratethepatternswithexampletestcasesfromCOMMUTERthatledustodiscoverthesesituations:Layerscalability.ScaleFSusesdatastructuresthatthemselvesnaturallysatisfythecommutativityrule,suchaslineararrays,radixarrays[15],andhashtables.Incontrastwithstructureslikebalancedtrees,thesedatastructurestypicallysharenocachelineswhendifferentelementsareaccessedormodied.Forexample,ScaleFSstoresthecacheddatapagesforagiveninodeusingaradixarray,sothatconcurrentreadsorwritestodifferentlepagesscale,eveninthepresenceofoperationsex-tendingortruncatingthele.Manyoperationsalsousethisradixarraytodetermineifsomeoffsetiswithinthele'sboundswithoutriskingconictswithoperationsthatchangethele'ssize.Deferwork.Manykernelresourcesareshared,suchaslesandpages,andmustbefreedwhennolongerreferenced.Typically,kernelsreleaseresourcesimme-diately,butthisrequireseagerlytrackingreferencestoresources,causingcommutativeoperationsthataccess alsoreportsingle-corenumbersforcomparison,thoughtheseareexpectedtobehigherbecauseonecorecanusetheentire30MBcache.Werunallbenchmarkswiththehardwareprefetcherdisabledbecausewefoundthatitoftenprefetchedcon-tendedcachelinestocoresthatdidnotultimatelyac-cessthosecachelines,causingsignicantvariabilityinourbenchmarkresultsandhamperingoureffortstopre-ciselycontrolsharing.Webelievethat,aslargemulticoresandhighlyparallelapplicationsbecomemoreprevalent,prefetcherheuristicswilllikewiseevolvetoavoidinduc-ingthisfalsesharing.Asasinglecoreperformancebaseline,wecompareagainstthesamebenchmarksrunningonLinux3.5.7fromUbuntuQuantal.Directcomparisonisdifcultbe-causeLinuximplementsmanyfeaturessv6doesnot,butthiscomparisonindicatessv6'sperformanceissensible.7.2MicrobenchmarksWeevaluatescalabilityandperformanceonrealhardwareusingtwomicrobenchmarksandanapplication-levelbenchmark.Eachbenchmarkhastwovariants,onethatusesstandard,non-commutativePOSIXAPIsandanotherthataccomplishesthesametaskusingthemodied,morebroadlycommutativeAPIsfrom§4.Bybenchmarkingthestandardinterfacesagainsttheircommutativecounter-parts,wecanisolatethecostofnon-commutativityandalsoexaminethescalabilityofconict-freeimplementa-tionsofcommutativeoperations.Weruneachbenchmarkthreetimesandreportthemean.Variancefromthemeanisalwaysunder4%andtypicallyunder1%.statbench.Ingeneral,it'sdifculttoarguethatanim-plementationofanon-commutativeinterfaceachievesthebestpossiblescalabilityforthatinterfaceandthatnoimplementationcouldscalebetter.However,inlimitedcases,wecandoexactlythis.Westartwithstatbench,whichmeasuresthescalabilityofFSTATwithrespecttoLINK.Thisbenchmarkcreatesasinglelethatn=2coresrepeatedlyFSTAT.Theothern=2coresrepeatedlyLINKthisletoanew,uniquelename,andthenUNLINKthenewlename.Asdiscussedin§4,FSTATdoesnotcommutewithLINKorUNLINKonthesamelebecauseFSTATreturnsthelinkcount.Inpractice,applicationsrarelyinvokeFSTATtogetthelinkcount,sosv6introducesFSTATX,whichallowsapplicationstorequestspecicelds(asimilarsystemcallhasbeenproposedforLinux[25]).Werunstatbenchintwomodes:onemodeusesFSTAT,whichdoesnotcommutewiththeLINKandUNLINKopera-tionsperformedbytheotherthreads,andtheothermodeusesFSTATXtorequestalleldsexceptthelinkcount,anoperationthatdoescommutewithLINKandUNLINK.WeuseaRefcachescalablecounter[15]forthelinkcountsothattheLINKsandUNLINKsdonotconict,andplaceitonitsowncachelinetoavoidfalsesharing.Figure7(a)showstheresults.WiththecommutativeFSTATX,statbenchscalesperfectlyandexperienceszeroL2cachemissesinFSTATX,whileFSTATseverelylimitsthescalabilityofstatbench.TobetterisolatethedifferencebetweenFSTATandFSTATX,werunstatbenchinathirdmodethatusesFSTAT,butrep-resentsthelinkcountusingasimplesharedcounterin-steadofRefcache.Inthismode,FSTATperformsbetter(attheexpenseofLINKandUNLINK),butstilldoesnotscale.Withasharedlinkcount,eachFSTATcallexperiencesex-actlyoneL2cachemiss(forthecachelinecontainingthelinkcount),whichmeansthisisthemostscalablethatFSTATcanpossiblybeinthepresenceofconcurrentLINKsandUNLINKs.Yet,despitesharingonlyasinglecacheline,thisseeminglyinnocuousnon-commutativitylim-itstheimplementation'sscalability.OnesmalltweaktomaketheoperationcommutebyomittingST?NLINKelim-inatesthebarriertoscaling,demonstratingthecostofnon-commutativity.InthecaseofFSTAT,optimizingforscalabilitysacricessomesequentialperformance.TrackingthelinkcountwithRefcache(orsomescalablecounter)isnecessarytomakeLINKandUNLINKscalelinearly,butrequiresFSTATtoreconcilethedistributedlinkcounttoreturnST?NLINK.Theexactoverheaddependsonthecorecount(whichdeterminesthenumberofRefcachecaches),butwith80cores,FSTATis3:9moreexpensivethanonLinux.Incontrast,FSTATXcanavoidthisoverheadunlesslinkcountsarerequested;likeFSTATwithasharedcount,itperformssimilarlytoLinux'sFSTATonasinglecore.openbench.Figure7(b)showstheresultsofopen-bench,whichstressestheledescriptorallocationper-formedbyOPEN.Inopenbench,nthreadsconcurrentlyOPENandCLOSEper-threadles.Thesecallsdonotcom-mutebecauseeachOPENmustallocatethelowestunusedledescriptorintheprocess.Formanyapplications,itsufcestoreturnanyunusedledescriptor(inwhichcasetheOPENcallscommute),sosv6addsan/?!.9&$agtoOPEN,whichitimplementsusingper-corepar-titionsoftheFDspace.Muchlikestatbench,thestan-dard,non-commutativeOPENinterfacelimitsopenbench'sscalability,whileopenbenchwith/?!.9&$scaleslin-early.Furthermore,thereappearstobenoperformancepenaltytoScaleFS'sOPEN,withorwithout/?!.9&$:atonecore,bothcasesperformidenticallyandoutper-formLinux'sOPENby27%.Someoftheperformancedifferenceisbecausesv6doesn'timplementthingslikepermissionschecking,butmuchofLinux'soverheadcomesfromlockingthatScaleFSavoids.7.3ApplicationperformanceFinally,weperformasimilarexperimentusingasimplemailservertoproduceasystemcallworkloadmorerep-resentativeofarealapplication.Ourmailserverusesase- Expensivesynchronizationinconcurrentalgorithmscannotbeeliminated.InProceedingsofthe38thACMSymposiumonPrinciplesofProgrammingLanguages,Austin,TX,January2011.[3]A.Baumann,P.Barham,P.-E.Dagand,T.Harris,R.Isaacs,S.Peter,T.Roscoe,A.Schüpbach,andA.Singhania.TheMultikernel:AnewOSarchitec-tureforscalablemulticoresystems.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples(SOSP),BigSky,MT,October2009.[4]F.Bellardetal.QEMU.http://www.qemu.org/.[5]D.J.Bernstein.Somethoughtsonsecurityaftertenyearsofqmail1.0.InProceedingsoftheACMWorkshoponComputerSecurityArchitecture,Fair-fax,VA,November2007.[6]P.A.BernsteinandN.Goodman.Concurrencycontrolindistributeddatabasesystems.ACMCom-putingSurveys,13(2):185–221,June1981.[7]S.Boyd-Wickizer.OptimizingCommunicationBot-tlenecksinMultiprocessorOperatingSystemKer-nels.PhDthesis,MassachusettsInstituteofTech-nology,February2014.[8]S.Boyd-Wickizer,H.Chen,R.Chen,Y.Mao,M.F.Kaashoek,R.Morris,A.Pesterev,L.Stein,M.Wu,Y.Dai,Y.Zhang,andZ.Zhang.Corey:Anoperat-ingsystemformanycores.InProceedingsofthe8thSymposiumonOperatingSystemsDesignandImplementation(OSDI),SanDiego,CA,December2008.[9]S.Boyd-Wickizer,A.Clements,Y.Mao,A.Pesterev,M.F.Kaashoek,R.Morris,andN.Zeldovich.AnanalysisofLinuxscalabilitytomanycores.InProceedingsofthe9thSymposiumonOperatingSystemsDesignandImplementation(OSDI),Vancouver,Canada,October2010.[10]C.Cadar,V.Ganesh,P.M.Pawlowski,D.L.Dill,andD.R.Engler.EXE:Automaticallygeneratinginputsofdeath.InProceedingsofthe13thACMConferenceonComputerandCommunicationsSe-curity,2006.[11]C.Cadar,D.Dunbar,andD.Engler.KLEE:Unas-sistedandautomaticgenerationofhigh-coveragetestsforcomplexsystemsprograms.InProceed-ingsofthe8thSymposiumonOperatingSystemsDesignandImplementation(OSDI),SanDiego,CA,December2008.[12]B.CantrillandJ.Bonwick.Real-worldconcurrency.CommunicationsoftheACM,51(11):34–39,2008.[13]K.ClaessenandJ.Hughes.QuickCheck:AlightweighttoolforrandomtestingofHaskellpro-grams.InProceedingsofthe5thACMSIGPLANInternationalConferenceonFunctionalProgram-ming,Montreal,Canada,September2000.[14]A.T.Clements,M.F.Kaashoek,andN.Zeldovich.ConcurrentaddressspacesusingRCUbalancedtrees.InProceedingsofthe17thInternationalCon-ferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),Lon-don,UK,March2012.[15]A.T.Clements,M.F.Kaashoek,andN.Zel-dovich.RadixVM:Scalableaddressspacesformul-tithreadedapplications.InProceedingsoftheACMEuroSysConference,Prague,CzechRepublic,April2013.[16]J.Corbet.Thesearchforfast,scalablecoun-ters,May2010.http://lwn.net/Articles/170003/.[17]J.Corbet.DcachescalabilityandRCU-walk,April23,2012.http://lwn.net/Articles/419811/.[18]R.Cox,M.F.Kaashoek,andR.T.Morris.Xv6,asimpleUnix-liketeachingoperatingsys-tem.http://pdos.csail.mit.edu/6.828/2012/xv6.html.[19]L.deMouraandN.Bjørner.Z3:AnefcientSMTsolver.InProceedingsofthe14thInternationalConferenceonToolsandAlgorithmsfortheCon-structionandAnalysisofSystems,Budapest,Hun-gary,March–April2008.[20]DWARFDebuggingInformationFormatCommit-tee.DWARFdebugginginformationformat,ver-sion4,June2010.[21]F.Ellen,Y.Lev,V.Luchango,andM.Moir.SNZI:Scalablenonzeroindicators.InProceedingsofthe26thACMSIGACT-SIGOPSSymposiumonPrinci-plesofDistributedComputing,Portland,OR,Au-gust2007.[22]P.Godefroid,N.Klarlund,andK.Sen.DART:Di-rectedautomatedrandomtesting.InProceedingsofthe2005ACMSIGPLANConferenceonPro-grammingLanguageDesignandImplementation,Chicago,IL,June2005.[23]M.HerlihyandE.Koskinen.Transactionalboost-ing:Amethodologyforhighly-concurrenttransac-tionalobjects.InProceedingsofthe13thACMSymposiumonPrinciplesandPracticeofParallelProgramming,SaltLakeCity,UT,February2008.