Miller Harshad Kasture George Kurian Charles Gruenwald III Nathan Beckmann Christopher Celio Jonathan Eastep and Anant Agarwal Massachusetts Institute of Technology Cambridge MA Abstract This paper introduces the Graphite opensource distributed para ID: 33144
Download Pdf The PPT/PDF document "Graphite A Distributed Parallel Simulato..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Thesimulatormaintainstheillusionthatallofthethreadsarerunninginasingleprocesswithasinglesharedaddressspace.Thisallowsthesimulatortorunoff-the-shelfparallelapplicationsonanynumberofmachineswithouthavingtorecompiletheappsfordifferentcongurations.Graphiteisnotintendedtobecompletelycycle-accuratebutinsteadusesacollectionofmodelsandtechniquestopro-videaccurateestimatesofperformanceandvariousmachinestatistics.Instructionsandeventsfromthecore,network,andmemorysubsystemfunctionalmodelsarepassedtoanalyticaltimingmodelsthatupdateindividuallocalclocksineachcore.Thelocalclocksaresynchronizedusingmessagetimestampswhencoresinteract(e.g.,throughsynchronizationormes-sages)[11].However,toreducethetimewastedonsynchro-nization,Graphitedoesnotstrictlyenforcetheorderingofalleventsinthesystem.Incertaincases,timestampsareignoredandoperationlatenciesarebasedontheorderingofeventsduringnativeexecutionratherthanthepreciseorderingtheywouldhaveinthesimulatedsystem(seeSectionIII-F).ThisissimilartotheunboundedslackmodeinSlackSim[12];however,GraphitealsosupportsanewscalablemechanismcalledLaxP2Pformanagingslackandimprovingaccuracy.Graphitehasbeenevaluatedbothintermsofthevalidityofthesimulationresultsaswellasthescalabilityofsimulatorperformanceacrossmultiplecoresandmachines.TheresultsfromtheseevaluationsshowthatGraphitescaleswell,hasreasonableperformanceandprovidesresultsconsistentwithexpectations.Forthescalingstudy,weperformafullycache-coherentsimulationof1024coresacrossupto10targetmachinesandrunapplicationsfromtheSPLASH-2benchmarksuite.Theslowdownversusnativeexecutionisaslowas41whenusingeight8-corehostmachines,indicatingthatGraphitecanbeusedforrealisticapplicationstudies.Theremainderofthispaperisstructuredasfollows.Sec-tionIIdescribesthearchitectureofGraphite.SectionIIIdiscussestheimplementationofGraphiteinmoredetail.Sec-tionIVevaluatestheaccuracy,performance,andscalingofthesimulator.SectionVdiscussesrelatedworkand,SectionVIsummarizesourndings.II.SYSTEMARCHITECTUREGraphiteisanapplication-levelsimulatorfortiledmulticorearchitectures.Asimulationconsistsofexecutingamulti-threadedapplicationonatargetmulticorearchitecturedenedbythesimulator'smodelsandruntimecongurationparam-eters.Thesimulationrunsononeormorehostmachines,eachofwhichmaybeamulticoremachineitself.Figure1illustrateshowamulti-threadedapplicationrunningonatargetarchitecturewithmultipletilesissimulatedonaclusterofhostmachines.Graphitemapseachthreadintheapplicationtoatileofthetargetarchitectureanddistributesthesethreadsamongmultiplehostprocesseswhicharerunningonmultiplehostmachines.Thehostoperatingsystemisthenresponsiblefortheschedulingandexecutionofthesethreads.Figure2aillustratesthetypesoftargetarchitecturesGraphiteisdesignedtosimulate.Thetargetarchitecturecon- Fig.1:High-levelarchitecture.Graphiteconsistsofoneormorehostprocessesdistributedacrossmachinesandworkingtogetheroversockets.Eachprocessrunsasubsetofthesimulatedtiles,onehostthreadpersimulatedtile.tainsasetoftilesinterconnectedbyanon-chipnetwork.Eachtileiscomposedofacomputecore,anetworkswitchandapartofthememorysubsystem(cachehierarchyandDRAMcontroller)[13].Tilesmaybehomogeneousorheterogeneous;however,weonlyexaminehomogeneousarchitecturesinthispaper.Anynetworktopologycanbemodeledaslongaseachtilecontainsanendpoint.Graphitehasamodulardesignbasedonswappablemodules.Eachofthecomponentsofatileismodeledbyaseparatemodulewithwell-denedinterfaces.Modulecanbecong-uredthroughrun-timeparametersorcompletelyreplacedtostudyalternatearchitectures.Modulesmayalsobereplacedtoalterthelevelofdetailinthemodelsandtradeoffbetweenperformanceandaccuracy.Figure2billustratesthekeycomponentsofaGraphitesimulation.Applicationthreadsareexecutedunderadynamicbinarytranslator(currentlyPin[14])whichrewritesinstruc-tionstogenerateeventsatkeypoints.TheseeventscausetrapsintoGraphite'sbackendwhichcontainsthecomputecore,memory,andnetworkmodelingmodules.Pointsofinterestinterceptedbythedynamicbinarytranslator(DBT)include:memoryreferences,systemcalls,synchronizationroutinesanduser-levelmessages.TheDBTisalsousedtogenerateastreamofexecutedinstructionsusedinthecomputecoremodels.Graphite'ssimulationbackendcanbebroadlydividedintotwosetsoffeatures:functionalandmodeling.Modelingfea-turesmodelvariousaspectsofthetargetarchitecturewhilefunctionalfeaturesensurecorrectprogrambehavior. Target Architecture Target Tile Target Tile Target Tile Target Tile Target Tile Target Tile Target Tile Target Tile Target Tile TCP/IP Sockets Host Threads Application ThreadsApplication (a)TargetArchitecture (b)ModularDesignofGraphiteFig.2:Systemarchitecture.a)Overviewofthetargetarchitecture.Tilescontainacomputecore,anetworkswitch,andanodeofthememorysystem.b)TheanatomyofaGraphitesimulation.Tilesaredistributedamongmultipleprocesses.Theappisinstrumentedtotrapintooneofthreemodelsatkeypoints:acoremodel,networkmodel,ormemorysystemmodel.Thesemodelsinteracttomodelthetargetsystem.Thephysicaltransportlayerabstractsawaythehost-specicdetailsofinter-tilecommunication.A.ModelingFeaturesAsshowninFigure2b,theGraphitebackendiscomprisedofmanymodulesthatmodelvariouscomponentsofthetargetarchitecture.Inparticular,thecoremodelisresponsibleformodelingthecomputationalpipeline;thememorymodelisresponsibleforthememorysubsystem,whichiscomposedofdifferentlevelsofcachesandDRAM;andthenetworkmodelhandlestheroutingofnetworkpacketsovertheon-chipnetworkandaccountsforvariousdelaysencounteredduetocontentionandroutingoverheads.Graphite'smodelsinteractwitheachothertodeterminethecostofeacheventintheapplication.Forinstance,thememorymodelusestheroundtripdelaytimesfromthenetworkmodeltocomputethelatencyofmemoryoperations,whilethecoremodelreliesonlatenciesfromthememorymodeltodeterminethetimetakentoexecuteloadandstoreoperations.OneofthekeytechniquesGraphiteusestoachievegoodsimulatorperformanceislaxsynchronization.Withlaxsyn-chronization,eachtargettilemaintainsitsownlocalclockwhichrunsindependentlyoftheclocksofothertiles.Syn-chronizationbetweenthelocalclocksofdifferenttileshap-pensonlyonapplicationsynchronizationevents,user-levelmessages,andthreadcreationandterminationevents.Duetothis,modelingofcertainaspectsofsystembehavior,suchasnetworkcontentionandDRAMqueueingdelays,becomecom-plicated.SectionIII-FwilltalkabouthowGraphiteaddressesthischallenge.B.FunctionalFeaturesGraphite'sabilitytoexecuteanunmodiedpthreadedappli-cationacrossmultiplehostmachinesiscentraltoitsscalabilityandeaseofuse.Inordertoachievethis,Graphitehastoaddressanumberoffunctionalchallengestoensurethattheapplicationrunscorrectly:1)SingleAddressSpace:Sincethreadsfromtheapplicationexecuteindifferentprocessesandhenceindifferentaddressspaces,allowingapplicationmemoryreferencestoaccessthehostaddressspacewon'tbefunctionallycorrect.Graphiteprovidestheinfrastructuretomodifythesememoryreferencesandpresentauniformviewoftheapplicationaddressspacetoallthreadsandmaintaindatacoherencebetweenthem.2)ConsistentOSInterface:Sinceapplicationthreadsex-ecuteondifferenthostprocessesonmultiplehosts,Graphiteimplementsasysteminterfacelayerthatinter-ceptsandhandlesallapplicationsystemcallsinordertomaintaintheillusionofasingleprocess.3)ThreadingInterface:Graphiteimplementsathreadinginterfacethatinterceptsthreadcreationrequestsfromtheapplicationandseamlesslydistributesthesethreadsacrossmultiplehosts.Thethreadinginterfacealsoim-plementscertainthreadmanagementandsynchroniza-tionfunctions,whileothersarehandledautomaticallybyvirtueofthesingle,coherentaddressspace.Tohelpaddressthesechallenges,Graphitespawnsaddi-tionalthreadscalledtheMasterControlProgram(MCP)andtheLocalControlProgram(LCP).ThereisoneLCPperprocessbutonlyoneMCPfortheentiresimulation.TheMCPandLCPensurethefunctionalcorrectnessofthesimulationbyprovidingservicesforsynchronization,systemcallexecutionandthreadmanagement.Alloftheactualcommunicationbetweentilesishandledbythephysicaltransport(PT)layer.Forexample,thenetworkmodelcallsintothislayertoperformthefunctionaltaskofmovingdatafromonetiletoanother.ThePTlayerabstractsawaythehost-architecturedependentdetailsofintra-andinter-processcommunication,makingiteasiertoportGraphitetonewhosts.III.IMPLEMENTATIONThissectiondescribesthedesignandinteractionofGraphite'svariousmodelsandsimulationlayers.Itdiscusses Interconnection Network DRAM Target TileTarget TileTarget Tile Network Switch Physical Transport Host Process MCPLCP Target Tile Target Tile Target Tile Target Tile Host Process MCP Target Tile Target Tile Target Tile Target Tile Host Process MCP Target Tile Target Tile Target Tile Target Tile I$, D$Network ModelPT APITCP/IP Sockets thechallengesofhighperformanceparalleldistributedsimu-lationandhowGraphite'sdesignaddressesthem.A.CorePerformanceModelThecoreperformancemodelisapurelymodeledcompo-nentofthesystemthatmanagesthesimulatedclocklocaltoeachtile.Itfollowsaproducer-consumerdesign:itconsumesinstructionsandotherdynamicinformationproducedbytherestofthesystem.Themajorityofinstructionsareproducedbythedynamicbinarytranslatorastheapplicationthreadexecutesthem.Otherpartsofthesystemalsoproducepseudo-instructionstoupdatethelocalclockonunusualevents.Forexample,thenetworkproducesamessagereceivepseudo-instructionwhentheapplicationusesthenetworkmessagingAPI(SectionIII-C),andaspawnpseudo-instructionisproducedwhenathreadisspawnedonthecore.Otherinformationbeyondinstructionsisrequiredtoper-formmodeling.Latenciesofmemoryoperations,pathsofbranches,etc.arealldynamicpropertiesofthesystemnotin-cludedintheinstructiontrace.Thisinformationisproducedbythesimulatorback-end(e.g.,memoryoperations)ordynamicbinarytranslator(e.g.,branchpaths)andconsumedbythecoreperformancemodelviaaseparateinterface.Thisallowsthefunctionalandmodelingportionsofthesimulatortoexecuteasynchronouslywithoutintroducinganyerrors.Becausethecoreperformancemodelisisolatedfromthefunctionalportionofthesimulator,thereisgreatexibilityinimplementingittomatchthetargetarchitecture.Currently,Graphitesupportsanin-ordercoremodelwithanout-of-ordermemorysystem.Storebuffers,loadunits,branchprediction,andinstructioncostsareallmodeledandcongurable.ThismodelisoneexampleofmanydifferentarchitecturalmodelsthancanbeimplementedinGraphite.Itisalsopossibletoimplementcoremodelsthatdifferdrasticallyfromtheoperationofthefunctionalmodelsi.e.,althoughthesimulatorisfunctionallyin-orderwithse-quentiallyconsistentmemory,thecoreperformancemodelcanbeout-of-ordercorewitharelaxedmemorymodel.Modelsthroughouttheremainderofthesystemwillreectthenewcoretype,astheyareultimatelybasedonclocksupdatedbythecoremodel.Forexample,memoryandnetworkutilizationwillreectanout-of-orderarchitecturebecausemessagetimestampsaregeneratedfromcoreclocks.B.MemorySystemThememorysystemofGraphitehasbothafunctionalandamodelingrole.Thefunctionalroleistoprovideanabstractionofasharedaddressspacetotheapplicationthreadswhichexecuteindifferentaddressspaces.Themodelingroleistosimulatethecacheshierarchiesandmemorycontrollersofthetargetarchitecture.Thefunctionalandmodelingpartsofthememorysystemaretightlycoupled,i.e,themessagessentoverthenetworktoload/storedataandensurefunctionalcorrectness,arealsousedforperformancemodeling.ThememorysystemofGraphiteisbuiltusinggenericmodulessuchascaches,directories,andsimplecacheco-herenceprotocols.Currently,Graphitesupportsasequentiallyconsistentmemorysystemwithfull-mapandlimiteddirectory-basedcachecoherenceprotocols,privateL1andL2caches,andmemorycontrollersoneverytileofthetargetarchitecture.However,duetoitsmodulardesign,adifferentimplementationofthememorysystemcouldeasilybedevelopedandswappedininstead.Theapplication'saddressspaceisdividedupamongthetargettileswhichpossessmemorycontrollers.Performancemodelingisdonebyappendingsimulatedtimestampstothemessagessentbetweendifferentmemorysystemmodules(seeSectionIII-F).Theaveragememoryaccesslatencyofanyrequestiscomputedusingthesetimestamps.Thefunctionalroleofthememorysystemistoserviceallmemoryoperationsmadebyapplicationthreads.ThedynamicbinarytranslatorinGraphiterewritesallmemoryreferencesintheapplicationsotheygetredirectedtothememorysystem.Analternatedesignoptionforthememorysystemistocompletelydecoupleitsfunctionalandmodelingparts.Thiswasnotdoneforperformancereasons.SinceGraphiteisadistributedsystem,boththefunctionalandmodelingpartsofthememorysystemhavetobedistributed.Hence,decouplingthemwouldleadtodoublingthenumberofmessagesinthesystem(onesetforensuringfunctionalcorrectnessandanotherformodeling).Anadditionaladvantageoftightlycouplingthefunctionalandthemodelingpartsisthatitautomaticallyhelpsverifythecorrectnessofcomplexcachehierarchiesandcoherenceprotocolsofthetargetarchitecture,astheircorrectoperationisessentialforthecompletionofsimulation.C.NetworkThenetworkcomponentprovideshigh-levelmessagingser-vicesbetweentilesbuiltontopofthelower-leveltransportlayer,whichusessharedmemoryandTCP/IPtocommuni-catebetweentargettiles.Itprovidesamessage-passingAPIdirectlytotheapplication,aswellasservingothercomponentsofthesimulatorbackend,suchasthememorysystemandsystemcallhandler.Thenetworkcomponentmaintainsseveraldistinctnetworkmodels.Thenetworkmodelusedbyaparticularmessageisdeterminedbythemessagetype.Forinstance,systemmes-sagesunrelatedtoapplicationbehavioruseaseparatenetworkmodelthanapplicationmessages,andthereforehavenoimpactonsimulationresults.Thedefaultsimulatorcongurationalsousesseparatemodelsforapplicationandmemorytrafc,asiscommonlydoneinmulticorechips[13],[15].Eachnetworkmodelisconguredindependently,allowingforexplorationofnewnetworktopologiesfocusedonparticularsubcomponentsofthesystem.Thenetworkmodelsareresponsibleforroutingpacketsandupdatingtimestampstoaccountfornetworkdelay.Eachnetworkmodelsharesacommoninterface.Therefore,networkmodelimplementationsareswappable,anditissimpletodevelopnewnetworkmodels.Currently,Graphitesupportsabasicmodelthatforwardspacketswithnodelay(usedforsystemmessages),andseveralmeshmodelswithdifferenttradeoffsinperformanceandaccuracy. D.ConsistentOSInterfaceGraphiteimplementsasysteminterfacelayerthatinterceptsandhandlessystemcallsinthetargetapplication.Systemcallsrequirespecialhandlingfortworeasons:theneedtoaccessdatainthetargetaddressspaceratherthanthehostaddressspace,andtheneedtomaintaintheillusionofasingleprocessacrossmultipleprocessesexecutingthetargetapplication.Manysystemcalls,suchascloneandrt_sigaction,passpointerstochunksofmemoryasinputoroutputargu-mentstothekernel.Graphiteinterceptssuchsystemcallsandmodiestheirargumentstopointtothecorrectdatabeforeexecutingthemonthehostmachine.Anyoutputdataiscopiedtothesimulatedaddressspaceafterthesystemcallreturns.Somesystemcalls,suchastheonesthatdealwithleI/O,needtobehandledspeciallytomaintainaconsistentprocessstateforthetargetapplication.Forexample,inamulti-threadedapplication,threadsmightcommunicateviales,withonethreadwritingtoaleusingawritesystemcallandpassingtheledescriptortoanotherthreadwhichthenreadsthedatausingthereadsystemcall.InaGraphitesimulation,thesethreadsmightbeindifferenthostprocesses(eachwithitsownledescriptortable),andmightberunningondifferenthostmachineseachwiththeirownlesystem.Instead,GraphitehandlesthesesystemcallsbyinterceptingandforwardingthemalongwiththeirargumentstotheMCP,wheretheyareexecuted.Theresultsaresentbacktothethreadthatmadetheoriginalsystemcall,achievingthedesiredresult.Othersystemcalls,e.g.open,fstatetc.,arehandledinasimilarmanner.Similarly,systemcallsthatareusedtoimplementsynchronizationbetweenthreads,suchasfutex,areinterceptedandforwardedtotheMCP,wheretheyareemulated.Systemcallsthatdonotrequirespecialhandlingareallowedtoexecutedirectlyonthehostmachine.1)ProcessInitializationandAddressSpaceManagement:Atthestartofthesimulation,Graphite'ssysteminterfacelayerneedstomakesurethateachhostprocessiscorrectlyinitialized.Inparticular,itmustensurethatallprocessseg-mentsareproperlysetup,thecommandlineargumentsandenvironmentvariablesareupdatedinthetargetaddressspaceandthethreadlocalstorage(TLS)iscorrectlyinitializedineachprocess.Eventually,onlyasingleprocessinthesimulationexecutesmain(),whilealltheotherprocessesexecutethreadssubsequentlyspawnedbyGraphite'sthreadingmechanism.Graphitealsoexplicitlymanagesthetargetaddressspace,settingasideportionsforthreadstacks,code,andstaticanddynamicdata.Inparticular,Graphite'sdynamicmemorymanagerservicesrequestsfordynamicmemoryfromtheapplicationbyinterceptingthebrk,mmapandmunmapsystemcallsandallocating(ordeallocating)memoryfromthetargetaddressspaceasrequired.E.ThreadingInfrastructureOnechallengingaspectoftheGraphitedesignwasseam-lesslydealingwiththreadspawncallsacrossadistributedsimulation.Otherprogrammingmodels,suchasMPI,forcetheapplicationprogrammertobeawareofdistributionbyallocatingworkamongprocessesatstart-up.Thisdesignislimitingandoftenrequiresthesourcecodeoftheapplicationtobechangedtoaccountforthenewprogrammingmodel.In-stead,Graphitepresentsasingle-processprogrammingmodeltotheuserwhiledistributingthethreadsacrossdifferentmachines.Thisallowstheusertocustomizethedistributionofthesimulationasnecessaryforthedesiredscalability,performance,andavailableresources.Theaboveparameterscanbechangedbetweensimula-tionrunsthroughrun-timecongurationoptionswithoutanychangestotheapplicationcode.Theactualapplicationin-terfaceissimplythepthreadspawn/joininterface.Theonlylimitationtotheprogramminginterfaceisthatthemaximumnumberofthreadsatanytimemaynotexceedthetotalnumberoftilesinthetargetarchitecture.Currentlythethreadsarelongliving,thatis,theyruntocompletionwithoutbeingswappedout.Toaccomplishthis,thespawncallsarerstinterceptedatthecallee.Next,theyareforwardedtotheMCPtoensureaconsistentviewofthethread-to-tilemapping.TheMCPchoosesanavailablecoreandforwardsthespawnrequesttotheLCPonthemachinethatholdsthechosentile.Themappingbetweentilesandprocessesiscurrentlyimplementedbysimplystripingthetilesacrosstheprocesses.ThreadjoiningisimplementedinasimilarmannerbysynchronizingthroughtheMCP.F.SynchronizationModelsForhighperformanceandscalabilityacrossmultiplema-chines,Graphitedecouplestilesimulationsbyrelaxingthetimingsynchronizationbetweenthem.Bydesign,Graphiteisnotcycle-accurate.Itsupportsseveralsynchronizationstrate-giesthatrepresentdifferenttimingaccuracyandsimulatorperformancetradeoffs:laxsynchronization,laxwithbarriersynchronization,andlaxwithpoint-to-pointsynchronization.LaxsynchronizationisGraphite'sbaselinemodel.Laxwithbarriersynchronizationandlaxwithpoint-to-pointsynchro-nizationlayermechanismsontopoflaxsynchronizationtoimproveitsaccuracy.1)LaxSynchronization:Laxsynchronizationisthemostpermissiveinlettingclocksdifferandoffersthebestper-formanceandscalability.Tokeepthesimulatedclocksinreasonableagreement,Graphiteusesapplicationeventstosynchronizethem,butotherwiseletsthreadsrunfreely.Laxsynchronizationisbestviewedfromtheperspectiveofasingletile.Allinteractionwiththerestofthesimulationtakesplacevianetworkmessages,eachofwhichcarriesatimestampthatisinitiallysettotheclockofthesender.Thesetimestampsareusedtoupdateclocksduringsynchronizationevents.Atile'sclockisupdatedprimarilywheninstructionsexecutedonthattile'scoreareretired.Withtheexceptionofmemoryoperations,theseeventsareindependentoftherestofthesimulation.However,memoryoperationsusemessageround-triptimetodeterminelatency,sotheydonotforcesynchronizationwithothertiles.Truesynchronizationonly occursinthefollowingevents:applicationsynchronizationsuchaslocks,barriers,etc.,receivingamessageviathemessage-passingAPI,andspawningorjoiningathread.Inallcases,theclockofthetileisforwardedtothetimethattheeventoccurred.Iftheeventoccurredearlierinsimulatedtime,thennoupdatestakeplace.Thegeneralstrategytohandleout-of-ordereventsistoignoresimulatedtimeandprocesseventsintheordertheyarereceived[12].Analternativeistore-ordereventssotheyarehandledinsimulated-timeorder,butthishassomefun-damentalproblems.Bufferingandre-orderingeventsleadstodeadlockinthememorysystem,andisdifculttoimplementanywaybecausethereisnoglobalcyclecount.Alternatively,onecouldoptimisticallyprocesseventsintheordertheyarereceivedandrollthembackwhenanearliereventarrives,asdoneinBigSim[11].However,thisrequiresstatetobemain-tainedthroughoutthesimulationandhurtsperformance.OurresultsinSectionIV-Cshowthatlaxsynchronization,despiteout-of-orderprocessing,stillpredictsperformancetrendswell.Thiscomplicatesmodels,however,aseventsareprocessedout-of-order.Queuemodeling,e.g.atmemorycontrollersandnetworkswitches,illustratesmanyofthedifculties.Inacycle-accuratesimulation,apacketarrivingataqueueisbuffered.Ateachcycle,thebufferheadisdequeuedandprocessed.Thismatchestheactualoperationofthequeueandisthenaturalwaytoimplementsuchamodel.InGraphite,however,thepacketisprocessedimmediatelyandpotentiallycarriesatimestampinthepastorfarfuture,sothisstrategydoesnotwork.Instead,queueinglatencyismodeledbykeepinganinde-pendentclockforthequeue.Thisclockrepresentsthetimeinthefuturewhentheprocessingofallmessagesinthequeuewillbecomplete.Whenapacketarrives,itsdelayisthedifferencebetweenthequeueclockandtheglobalclock.Additionally,thequeueclockisincrementedbytheprocessingtimeofthepackettomodelbuffering.However,becausecoresinthesystemarelooselysynchro-nized,thereisnoeasywaytomeasureprogressoraglobalclock.Thisproblemisaddressedbyusingpackettimestampstobuildanapproximationofglobalprogress.Awindowofthemostrecently-seentimestampsiskept,ontheorderofthenumberoftargettiles.Theaverageofthesetimestampsgivesanapproximationofglobalprogress.Becausemessagesaregeneratedfrequently(e.g.,oneverycachemiss),thiswindowgivesanup-to-daterepresentationofglobalprogressevenwithalargewindowsizewhilemitigatingtheeffectofoutliers.Combiningthesetechniquesyieldsaqueueingmodelthatworkswithintheframeworkoflaxsynchronization.Errorisintroducedbecausepacketsaremodeledout-of-orderinsimu-latedtime,buttheaggregatequeueingdelayiscorrect.Othermodelsinthesystemfacesimilarchallengesandsolutions.2)LaxwithBarrierSynchronization:Graphitealsosup-portsquanta-basedbarriersynchronization(LaxBarrier),whereallactivethreadswaitonabarrierafteracong-urablenumberofcycles.Thisisusedforvalidationoflaxsynchronization,asveryfrequentbarrierscloselyapproximatecycle-accuratesimulation.Asexpected,LaxBarrieralsohurtsperformanceandscalability(seeSectionIV-C).3)LaxwithPoint-to-pointSynchronization:Graphitesup-portsanovelsynchronizationschemecalledpoint-to-pointsynchronization(LaxP2P).LaxP2Paimstoachievethequanta-basedaccuracyofLaxBarrierwithoutsacricingthescalabilityandperformanceoflaxsynchronization.Inthisscheme,eachtileperiodicallychoosesanothertileatrandomandsynchro-nizeswithit.Iftheclocksofthetwotilesdifferbymorethanacongurablenumberofcycles(calledtheslackofsimulation),thenthetilethatisaheadgoestosleepforashortperiodoftime.LaxP2Pisinspiredbytheobservationthatinlaxsynchro-nization,thereareusuallyafewoutlierthreadsthatarefaraheadorbehindandresponsibleforsimulationerror.LaxP2Ppreventsoutliers,asanythreadthatrunsaheadwillputitselftosleepandstaytightlysynchronized.Similarly,anythreadthatfallsbehindwillputotherthreadstosleep,whichquicklypropagatesthroughthesimulation.Theamountoftimethatathreadmustsleepiscalculatedbasedonthereal-timerateofsimulationprogress.Essentially,thethreadsleepsforenoughreal-timesuchthatitssynchro-nizingpartnerwillhavecaughtupwhenitwakes.Specically,letcbethedifferenceinclocksbetweenthetiles,andsupposethatthethreadinfrontisprogressingatarateofrsimulatedcyclespersecond.Weapproximatethethread'sprogresswithalinearcurveandputthethreadtosleepforsseconds,wheres=c r.riscurrentlyapproximatedbytotalprogress,meaningthetotalnumberofsimulatedcyclesoverthetotalwall-clocksimulationtime.Finally,notethatLaxP2Piscompletelydistributedandusesnoglobalstructures.Becauseofthis,itintroduceslessoverheadthanLaxBarrierandhassuperiorscalability(seeSectionIV-C).IV.RESULTSThissectionpresentsexperimentalresultsusingGraphite.WedemonstrateGraphite'sabilitytoscaletolargetargetarchitecturesanddistributeacrossahostcluster.Weshowthatlaxsynchronizationprovidesgoodperformanceandaccuracy,andvalidateresultswithtwoarchitecturalstudies.A.ExperimentalSetupTheexperimentalresultsprovidedinthissectionwereallobtainedonahomogeneousclusterofmachines.Eachmachinewithintheclusterhasdualquad-coreIntel(r)X5460CPUsrunningat3.16GHzand8GBofDRAM.TheyarerunningDebianLinuxwithkernelversion2.6.26.Applicationswerecompiledwithgccversion4.3.2.ThemachineswithintheclusterareconnectedtoaGigabitEthernetswitchwithtwotrunkedGigabitportspermachine.Thishardwareistypicalofcurrentcommodityservers.EachoftheexperimentsinthissectionusesthetargetarchitectureparameterssummarizedinTableIunlessother-wisenoted.Theseparameterswerechosentomatchthehostarchitectureascloselyaspossible. Fig.3:SimulatorperformancescalingforSPLASH-2benchmarksacrossdifferentnumbersofhostcores.Thetargetarchitecturehas32tilesinallcases.Speed-upisnormalizedtosimulatorruntimeon1hostcore.Hostmachineseachcontain8cores.Resultsfrom1to8coresuseasinglemachine.Above8cores,simulationisdistributedacrossmultiplemachines. Feature Value Clockfrequency 3.16GHz L1caches Private,32KB(pertile),64bytelinesize,8-wayassociativity,LRUreplacement L2cache Private,3MB(pertile),64byteslinesize,24-wayassociativity,LRUreplacement Cachecoherence Full-mapdirectorybased DRAMbandwidth 5.3GB/s Interconnect Meshnetwork TABLEI:SelectedTargetArchitectureParameters.Allex-perimentsusethesetargetparameters(varyingthenumberoftargettiles)unlessotherwisenoted.B.SimulatorPerformance1)Single-andMulti-MachineScaling:Graphiteisde-signedtoscalewelltobothlargenumbersoftargettilesandlargenumbersofhostcores.Byleveragingmultiplemachines,simulationoflargetargetarchitecturescanbeacceleratedtoprovidefastturn-aroundtimes.Figure3demonstratesthespeedupachievedbyGraphiteasadditionalhostcoresaredevotedtothesimulationofa32-tiletargetarchitecture.ResultsarepresentedforseveralSPLASH-2[16]benchmarksandarenormalizedtotheruntimeonasinglehostcore.Theresultsfromonetoeighthostcoresarecollectedbyallowingthesimulationtouseadditionalcoreswithinasinglehostmachine.Theresultsfor16,32,and64hostcorescorrespondtousingallthecoreswithin2,4,and8machines,respectively.AsshowninFigure3,allapplicationsexceptfftexhibitsignicantsimulationspeedupsasmorehostcoresareadded.Thebestspeedupsareachievedwith64hostcores(across8machines)andrangefromabout2(fft)to20(radix).Scalingisgenerallybetterwithinasinglehostmachinethanacrossmachinesduetotheloweroverheadofcommunication.Severalapps(fmm,ocean,andradix)shownearlyidealspeedupcurvesfrom1to8hostcores(withinasinglemachine).Someappsshowadropinperformancewhengoingfrom8to16hostcores(from1to2machines)becausethead-ditionaloverheadofinter-machinecommunicationoutweighsthebenetsoftheadditionalcomputeresources.Thiseffectclearlydependsonspecicapplicationcharacteristicssuchasalgorithm,computation/communicationratio,anddegreeofmemorysharing.Iftheapplicationitselfdoesnotscalewelltolargenumbersofcores,thenthereisnothingGraphitecandotoimproveit,andperformancewillsuffer.TheseresultsdemonstratethatGraphiteisabletotakeadvantageoflargequantitiesofparallelisminthehostplatformtoacceleratesimulations.Forrapiddesigniterationandsoft-waredevelopment,thetimetocompleteasinglesimulationismoreimportantthanefcientutilizationofhostresources.Forthesetasks,anarchitectorprogrammermuststopandwaitfortheresultsoftheirsimulationbeforetheycancontinuetheirwork.Thereforeitmakessensetoapplyadditionalmachinestoasimulationevenwhenthespeedupachievedislessthanideal.Forbulkprocessingofalargenumberofsimulations,totalsimulationtimecanbereducedbyusingthemostefcientcongurationforeachapplication.2)SimulatorOverhead:TableIIshowssimulatorperfor-manceforseveralbenchmarksfromtheSPLASH-2suite.Thetableliststhenativeexecutiontimeforeachapplicationonasingle8-coremachine,aswellasoverallsimulationruntimesononeandeighthostmachines.Theslowdownsexperiencedovernativeexecutionforeachofthesecasesarealsopresented.ThedatainTableIIdemonstratesthatGraphiteachievesverygoodperformanceforallthebenchmarksstudied.The cholesky fft fmm lu_cont lu_non_cont ocean_cont ocean_non_cont radix water_nsquared water_spatial 0 5 10 15 20 SimulatorSpeed 64 32 16 8 4 2 1 HostCores Application Simulation Native 1machine 8machines Time TimeSlowdown TimeSlowdown cholesky 1.99 689346 508255 fft 0.02 803978 783930 fmm 7.11 67094 29841 lu_cont 0.072 2884007 2122952 lu_non_cont 0.08 2443061 1632038 ocean_cont 0.33 168515 66202 ocean_non_cont 0.41 177433 78190 radix 0.11 1781648 63584 water_nsquared 0.30 7422465 3961317 water_spatial 0.13 129966 82616 Mean - -1751 -1213 Median - -1307 -616 TABLEII:Multi-MachineScalingResults.Wall-clockexe-cutiontimeofSPLASH-2simulationsrunningon1and8hostmachines(8and64hostcores).Timesareinseconds.Slowdownsarecalculatedrelativetonativeexecution. Fig.4:Simulationspeed-upasthenumberofhostmachinesisincreasedfrom1to10foramatrix-multiplykernelwith1024applicationthreadsrunningona1024-tiletargetarchitecture.totalruntimeforallthebenchmarksisontheorderofafewminutes,withamedianslowdownof616overnativeexecution.ThishighperformancemakesGraphiteaveryusefultoolforrapidarchitectureexplorationandsoftwaredevelopmentforfuturearchitectures.Ascanbeseenfromthetable,thespeedofthesimula-tionrelativetonativeexecutiontimeishighlyapplicationdependent,withthesimulationslowdownbeingaslowas41forfmmandashighas3930forfft.Thisdepends,amongotherthings,onthecomputation-to-communicationratiofortheapplication:applicationswithahighcomputation-to-communicationratioareabletomoreeffectivelyparallelizeandhenceshowhighersimulationspeeds.3)ScalingwithLargeTargetArchitectures:Thissectionpresentsperformanceresultsforalargetargetarchitecturecontaining1024tilesandexploresthescalingofsuchsim-ulations.Figure4showsthenormalizedspeed-upofa1024-threadmatrix-multiplykernelrunningacrossdifferentnumbersofhostmachines.matrix-multiplywaschosenbecauseitscaleswelltolargenumbersofthreads,whilestillhavingfrequentsynchronizationviamessageswithneighbors. Lax LaxP2P LaxBarrier 1mc 4mc 1mc 4mc 1mc 4mc Run-time 1.0 0.55 1.10 0.59 1.82 1.09 Scaling 1.80 1.84 1.69 Error(%) 7.56 1.28 1.31 CoV(%) 0.58 0.31 0.09 TABLEIII:MeanperformanceandaccuracystatisticsfordatapresentedinFigure5.Scalingistheperformanceimprovementgoingfrom1to4hostmachines.Thisgraphshowssteadyperformanceimprovementforuptotenmachines.Performanceimprovesbyafactorof3:85withtenmachinescomparedtoasinglemachine.Speed-upisconsistentasmachinesareadded,closelymatchingalinearcurve.Weexpectscalingtocontinueasmoremachinesareadded,asthenumberofhostcoresisnotclosetosaturatingtheparallelismavailableintheapplication.C.LaxsynchronizationGraphitesupportsseveralsynchronizationmodels,namelylaxsynchronizationanditsbarrierandpoint-to-pointvariants,tomitigatetheclockskewbetweendifferenttargetcoresandincreasetheaccuracyoftheobservedresults.Thissectionprovidessimulatorperformanceandaccuracyresultsforthethreemodels,andshowsthetrade-offsofferedbyeach.1)Simulatorperformance:Figure5aandTableIIIillus-tratethesimulatorperformance(wall-clocksimulationtime)ofthethreesynchronizationmodelsusingthreeSPLASH-2benchmarks.Eachsimulationisrunononeandfourhostmachines.Thebarrierintervalwaschosenas1,000cyclestogiveveryaccurateresults.TheslackvalueforLaxP2Pwaschosentogiveagoodtrade-offbetweenperformanceandaccuracy,whichwasdeterminedtobe100,000cycles.ResultsarenormalizedtotheperformanceofLaxononehostmachine.WeobservethatLaxoutperformsbothLaxP2PandLaxBar-rierduetoitslowersynchronizationoverhead.PerformanceofLaxalsoincreasesconsiderablywhengoingfromonemachinetofourmachines(1.8).LaxP2PperformsonlyslightlyworsethanLax.Itshowsanaverageslowdownof1.10and1.07whencomparedtoLaxononeandfourhostmachinesrespectively.LaxP2Pshowsgoodscalabilitywithaperformanceimprovementof1.84goingfromonetofourhostmachines.ThisismainlyduetothedistributednatureofsynchronizationinLaxP2P,allowingittoscaletoalargernumberofhostcores.LaxBarrierperformspoorlyasexpected.Itencountersanaverageslowdownof1.82and1.94whencomparedtoLaxononeandfourhostmachinesrespectively.AlthoughtheperformanceimprovementofLaxBarrierwhengoingfromonetofourhostmachinesiscomparabletotheotherschemes,weexpecttherateofperformanceimprovementtodecreaserapidlyasthenumberoftargettilesisincreasedduetotheinherentnon-scalablenatureofbarriersynchronization.2)Simulationerror:Thisstudyexaminessimulationerrorandvariabilityforvarioussynchronizationmodels.Resultsaregeneratedfromtenrunsofeachbenchmarkusingthe 1 2 4 6 8 10 0 1 2 3 4 No.HostMachines SimulatorSpeed (a)NormalizedRun-time (b)Error(%) (c)CoefcientofVariation(%)Fig.5:Performanceandaccuracydatacomparisonfordifferentsynchronizationschemes.DataiscollectedfromSPLASH-2benchmarksononeandfourhostmachines,usingtenrunsofeachsimulation.(a)Simulationrun-timeinseconds,normalizedtoLaxononehostmachine..(b)Simulationerror,givenaspercentagedeviationfromLaxBarrierononehostmachine.(c)Simulationvariability,givenasthecoefcientofvariationforeachtypeofsimulation. (a)Lax (b)LaxP2P (c)LaxBarrierFig.6:Clockskewinsimulatedcyclesduringthecourseofsimulationforvarioussynchronizationmodels.DatacollectedrunningthefmmSPLASH-2benchmark.sameparametersasthepreviousstudy.Wecompareresultsforsingle-andmulti-machinesimulations,asdistributionacrossmachinesinvolveshigh-latencynetworkcommunicationthatpotentiallyintroducesnewsourcesoferrorandvariability.Figure5b,Figure5candTableIIIshowtheerrorandcoefcientofvariationofthesynchronizationmodels.Theerrordataispresentedasthepercentagedeviationofthemeansimulatedapplicationrun-time(incycles)fromsomebaseline.ThebaselinewechooseisLaxBarrier,asitgiveshighlyaccurateresults.Thecoefcientofvariation(CoV)isameasureofhowconsistentresultsarefromruntorun.Itisdenedastheratioofstandarddeviationtomean,asapercentage.ErrorandCoVvaluescloseto0:0%arebest.Asseeninthetable,LaxBarriershowsthebestCoV(0.09%).Thisisexpected,asthebarrierforcestargetcorestoruninlock-step,sothereislittleopportunityfordeviation.WealsoobservethatLaxBarriershowsveryaccurateresultsacrossfourhostmachines.Thisisalsoexpected,asthebarriereliminatesclockskewthatoccursduetovariablecommunicationlatencies.LaxP2Pshowsbothgooderror(1.28%)andCoV(0.32%).Despitethelargeslacksize,bypreventingtheoccurrenceofoutliersLaxP2PmaintainslowCoVanderror.Infact,LaxP2PshowserrornearlyidenticaltoLaxBarrier.ThemaindifferencebetweentheschemesisthatLaxP2PhasmodestlyhigherCoV.Laxshowstheworsterror(7.56%).Thisisexpectedbecauseonlyapplicationeventssynchronizetargettiles.Asshownbelow,Laxallowsthreadclockstovarysignicantly,givingmoreopportunityforthenalsimulatedruntimetovary.Forthesamereason,LaxhastheworstCoV(0.58%).3)Clockskew:Figure6showstheapproximateclockskewofeachsynchronizationmodelduringonerunoftheSPLASH-2benchmarkfmm.Thegraphshowsthedifferencebetweenthemaximumandminimumclocksinthesystematagiventime.Theseresultsmatchwhatoneexpectsfromthevarioussynchronizationmodels.Laxshowsbyfarthegreatestskew,andapplicationsynchronizationeventsareclearlyvisible.TheskewofLaxP2PisseveralordersofmagnitudelessthanLax,butapplicationsynchronizationeventsarestillvisibleandskewisontheorderof10;000cycles.LaxBarrierhastheleastskew,asonewouldexpect.Applicationsynchronizationeventsarelargelyundetectableskewappearsconstantthroughoutexecution.14)Summary:Graphiteoffersthreesynchronizationmodelsthatgiveatradeoffbetweensimulationspeedandaccuracy.Laxgivesoptimalperformancewhileachievingreasonableaccuracy,butitalsoletsthreadsdeviateconsiderablyduringsimulation.Thismeansthatne-grainedinteractionscanbemissedormisrepresented.Ontheotherextreme,LaxBarrierforcestightsynchronizationandaccurateresults,atthecostofperformanceandscaling.LaxP2Pliessomewhereinbe-tween,keepingthreadsfromdeviatingtoofarandgivingveryaccurateresults,whileonlyreducingperformanceby10%.1Spikesinthegraphs,asseeninFigure6c,areduetoapproximationsinthecalculationofclockskew.See[17]. 1mc4mc 1mc4mc 1mc4mc 0.0 0.5 1.0 1.5 Simulationrun 10.0 26.6 1mc4mc 1mc4mc 1mc4mc 0 2 4 6 8 10 Error 1mc4mc 1mc4mc 1mc4mc 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 CoV Lax LaxP2P LaxBarrier Fig.7:Differentcachecoherencyschemesarecomparedusingspeeduprelativetosimulatedsingle-tileexecutioninblackscholesbyscalingtargettilecount.Finally,weobservedwhilerunningtheseresultsthattheparameterstosynchronizationmodelscanbetunedtomatchapplicationbehavior.Forexample,someapplicationscantoleratelargebarrierintervalswithnomeasurabledegradationinaccuracy.ThisallowsLaxBarriertoachieveperformancenearthatofLaxP2Pforsomeapplications.D.ApplicationStudies1)CacheMiss-rateCharacterization:WereplicatethestudyperformedbyWooet.al[16]characterizingcachemissratesasafunctionofcachelinesize.Targetarchitecturalparametersarechosentomatchthosein[16]ascloselyaspossible.Inparticular,Graphite'sL1cachemodelsaredisabledtosimulateasingle-levelcache.Thearchitecturesstilldiffer,however,asGraphitesimulatesx86instructionswhereas[16]usesSGImachines.Ourresultsmatchtheexpectedtrendsforeachbenchmarkstudied,althoughthemissratesdonotmatchexactlyduetoarchitecturaldifferences.Furtherdetailscanbefoundin[17].2)ScalingofCacheCoherence:Asprocessorsscaletoever-increasingcorecounts,theviabilityofcachecoherenceinfuturemanycoresremainsunsettled.ThisstudyexploresthreecachecoherenceschemesasademonstrationofGraphite'sabilitytoexplorethisrelevantarchitecturalquestion,aswellasitsabilitytorunlargesimulations.Graphitesupportsafewcachecoherenceprotocols.AlimiteddirectoryMSIprotocolwithisharers,denotedDiriNB[18],isthedefaultcachecoherenceprotocol.Graphitealsosupportsfull-mapdirectoriesandtheLimitLESSprotocol2.Figure7showsthecomparisonofthedifferentcacheco-herencyschemesintheapplicationblackscholes,amem-berofthePARSECbenchmarksuite[8].blackscholesisnearlyperfectlyparallelaslittleinformationissharedbetweencores.However,bytrackingallrequeststhroughthememorysystem,weobservedsomeglobaladdressesinthesystemlibrariesareheavilysharedasread-onlydata.Alltestswere2IntheLimitLESSprotocol,alimitednumberofhardwarepointersexistfortherstisharersandadditionalrequeststoshareddataarehandledbyasoftwaretrap,preventingtheneedtoevictexistingsharers.[19]runusingthesimsmallinput.Theblackscholessourcecodewasunmodied.AsseeninFigure7,blackscholesachievesnear-perfectscalingwiththefull-mapdirectoryandLimitLESSdirectoryprotocolsupto32targettiles.Beyond32targettiles,paral-lelizationoverheadbeginstooutstripperformancegains.Fromsimulatorresults,weobservethatlargertargettilecountsgiveincreasedaveragememoryaccesslatency.Thisoccursinatleasttwoways:(i)increasednetworkdistancetomemorycontrollers,and(ii)additionallatencyatmemorycontrollers.Latencyatthememorycontrollerincreasesbecausethedefaulttargetarchitectureplacesamemorycontrollerateverytile,evenlysplittingtotaloff-chipbandwidth.Thismeansthatasthenumberoftargettilesincreases,thebandwidthateachcontrollerdecreasesproportionally,andtheservicetimeforamemoryrequestincreases.Queueingdelayalsoincreasesbystaticallypartitioningthebandwidthintoseparatequeues,butresultsshowthatthiseffectislesssignicant.TheLimitLESSandfull-mapprotocolsexhibitlittlediffer-entiationfromoneanother.Thisisexpected,astheheavilyshareddataisread-only.Therefore,oncethedatahasbeencached,theLimitLESSprotocolwillexhibitthesamecharac-teristicsasthefull-mapprotocol.Thelimitedmapdirectoryprotocolsdonotscale.Dir4NBdoesnotexhibitscalingbeyondfourtargettiles.Becauseonlyfoursharerscancacheanygivenmemorylineatatime,heavilysharedreaddataisbeingconstantlyevictedathighertargettilecounts.Thisserializesmemoryreferencesanddamagesperformance.Likewise,theDir16NBprotocoldoesnotexhibitscalingbeyondsixteentargetcores.V.RELATEDWORKBecausesimulationissuchanimportanttoolforcomputerarchitects,awidevarietyofdifferentsimulatorsandemulatorsexists.Conventionalsequentialsimulators/emulatorsincludeSimpleScalar[1],RSIM[2],SimOS[3],Simics[4],andQEMU[5].Someofthesearecapableofsimulatingparalleltargetarchitecturesbutallofthemexecutesequentiallyonthehostmachine.LikeGraphite,Proteus[20]isdesignedtosimulatehighlyparallelarchitecturesandusesdirectexecutionandcongurable,swappablemodels.However,ittoorunsonlyonsequentialhosts.TheprojectsmostcloselyrelatedtoGraphitearepar-allelsimulatorsofparalleltargetarchitecturesincluding:SimFlex[21],GEMS[22],COTSon[23],BigSim[11],FastMP[24],SlackSim[12],WisconsinWindTunnel(WWT)[25],WisconsinWindTunnelII(WWTII)[10],andthosedescribedbyChidesterandGeorge[26],andPenryetal.[27].SimFlexandGEMSbothuseanoff-the-shelfsequentialem-ulator(Simics)forfunctionalmodelingplustheirownmodelsformemorysystemsandcoreinteractions.BecauseSimicsisaclosed-sourcecommercialproductitisdifculttoexperimentwithdifferentcorearchitectures.GEMSusestheirtimingmodeltodriveSimicsoneinstructionatatimewhichresultsinmuchlowerperformancethanGraphite.SimFlexavoids 1 2 4 8 16 32 64 128 256 0.01 0.1 1 10 100 No.TargetTiles ApplicationSpeed LimitLESS4 mapDirectory Dir16NB Dir4NB thisproblembyusingstatisticalsamplingoftheapplicationbutthereforedoesnotobserveitsentirebehavior.ChidesterandGeorgetakeasimilarapproachbyjoiningtogetherseveralcopiesofSimpleScalarusingMPI.TheydonotreportabsoluteperformancenumbersbutSimpleScalaristypicallyslowerthanthedirectexecutionusedbyGraphite.COTSonusesAMD'sSimNow!forfunctionalmodelingandthereforesuffersfromsomeofthesameproblemsasSimFlexandGEMS.ThesequentialinstructionstreamcomingoutofSimNow!isdemultiplexedintoseparatethreadsbeforetimingsimulation.ThislimitsparallelismandrestrictsCOTSontoasinglehostmachineforshared-memorysimulations.COTSoncanperformmulti-machinesimulationsbutonlyiftheapplica-tionsarewrittenfordistributedmemoryanduseamessaginglibrarylikeMPI.BigSimandFastMPassumedistributedmemoryintheirtargetarchitecturesanddonotprovidecoherentsharedmem-orybetweentheparallelportionsoftheirsimulators.Graphitepermitsstudyofthemuchbroaderandmoreinterestingclassofarchitecturesthatusesharedmemory.WWTisoneoftheearliestparallelsimulatorsbutrequiresapplicationstouseanexplicitinterfaceforsharedmemoryandonlyrunsonCM-5machines,makingitimpracticalformodernusage.GraphitehasseveralsimilaritieswithWWTII.Bothusedirectexecutionandprovidesharedmemoryacrossaclusterofmachines.However,WWTIIdoesnotmodelanythingotherthanthetargetmemorysystemandrequiresapplicationstobemodiedtoexplicitlyallocatesharedmemoryblocks.Graphitealsomodelscomputecoresandcommunicationnetworksandimplementsatransparentsharedmemorysystem.Inaddition,WWTIIusesaverydifferentquantum-basedsynchronizationschemeratherthanlaxsynchronization.Penryetal.provideamuchmoredetailed,low-levelsim-ulationandaretargetinghardwaredesigners.Theirsimulator,whilefastforacycle-accuratehardwaremodel,doesnotprovidetheperformancenecessaryforrapidexplorationofdifferentideasorsoftwaredevelopment.Theproblemofacceleratingslowsimulationshasbeenaddressedinanumberofdifferentwaysotherthanlarge-scaleparallelization.ProtoFlex[9],FAST[6],andHASim[28]alluseFPGAstoimplementtimingmodelsforcycle-accuratesimulations.ProtoFlexandFASTimplementtheirfunctionalmodelsinsoftwarewhileHASimimplementsfunctionalmod-elsintheFPGAaswell.Theseapproachesrequiretheusertobuyexpensivespecial-purposehardwarewhileGraphiterunsoncommodityLinuxmachines.Inaddition,implementinganewmodelinanFPGAismoredifcultthansoftware,makingithardertoquicklyexperimentwithdifferentdesigns.Othersimulatorsimproveperformancebymodelingonlyaportionofthetotalexecution.FastMP[24]estimatesper-formanceforparallelworkloadswithnomemorysharing(suchasSPECrate)bycarefullysimulatingonlysomeoftheindependentprocessesandusingthoseresultstomodeltheothers.Finally,simulatorssuchasSimFlex[21]usestatisticalsamplingbycarefullymodelingshortsegmentsoftheoverallprogramrunandassumingthattherestoftherunissimilar.AlthoughGraphitedoesmakesomeapproximations,itdiffersfromtheseprojectsinthatitobservesandmodelsthebehavioroftheentireapplicationexecution.Theideaofmaintainingindependentlocalclocksandusingtimestampsonmessagestosynchronizethemduringinterac-tionswaspioneeredbytheTimeWarpsystem[29]andusedintheGeorgiaTechTimeWarp[30],BigSim[11],andSlack-Sim[12].Therstthreesystemsassumethatperfectorderingmustbemaintainedandrollbackwhenthetimestampsindicateout-of-orderevents.SlackSim(developedconcurrentlywithGraphite)istheonlyothersystemthatallowseventstooccuroutoforder.Itallowsallthreadstorunfreelyaslongastheirlocalclocksremainwithinaspeciedwindow.SlackSim'sunboundedslackmodeisessentiallythesameasplainlaxsynchronization.However,itsapproachtolimitingslackreliesonacentralmanagerwhichmonitorsallthreadsusingsharedmemory.This(alongwithotherfactors)restrictsittorunningonasinglehostmachineandultimatelylimitsitsscalability.Graphite'sLaxP2Piscompletelydistributedandenablesscalingtolargernumbersoftargettilesandhostmachines.BecauseGraphitehasmoreaggressivegoalsthanSlackSim,itrequiresmoresophisticatedtechniquestomitigateandcompensateforex-cessiveslack.TreadMarks[31]implementsagenericdistributedsharedmemorysystemacrossaclusterofmachines.However,itre-quirestheprogrammertoexplicitlyallocateblocksofmemorythatwillbekeptconsistentacrossthemachines.Thisrequiresapplicationsthatassumeasinglesharedaddressspace(e.g.,pthreadapplications)toberewrittentousetheTreadMarksinterface.Graphiteoperatestransparently,providingasinglesharedaddressspacetooff-the-shelfapplications.VI.CONCLUSIONSGraphiteisadistributed,parallelsimulatorfordesign-spaceexplorationoflarge-scalemulticoresandapplicationsresearch.Itusesavarietyoftechniquestodeliverthehighperformanceandscalabilityneededforusefulevaluationsincluding:directexecution,multi-machinedistribution,analyticalmodeling,andlaxsynchronization.SomeofGraphite'sotherkeyfeaturesareitsexibleandextensiblearchitecturemodeling,itscom-patibilitywithcommoditymulticoresandclusters,itsabilitytorunoff-the-shelfpthreadsapplicationbinaries,anditssupportforasinglesharedsimulatedaddressspacedespiterunningacrossmultiplehostmachines.OurresultsdemonstratethatGraphiteishighperformanceandachievesslowdownsaslittleas41overnativeexecutionforsimulationsofSPLASH-2benchmarksona32-tiletarget.WealsodemonstratethatGraphiteisscalable,obtainingnearlinearspeeduponasimulationofa1000-tiletargetusingfrom1to10hostmachines.Lastly,thisworkevaluatesseverallaxsynchronizationsimulationstrategiesandcharacterizestheirperformanceversusaccuracy.Wedevelopanovelsynchro-nizationstrategycalledLaxP2Pforbothhighperformanceand accuracybasedonperiodic,random,point-to-pointsynchro-nizationsbetweentargettiles.OurresultsshowthatLaxP2Pperformsonaveragewithin8%ofthehighestperformancestrategywhilekeepingaverageerrorto1.28%ofthemostaccuratestrategyforthestudiedbenchmarks.Graphitewillbereleasedtothecommunityasopen-sourcesoftwaretofosterresearchonlarge-scalemulticorearchitec-turesandapplications.ACKNOWLEDGEMENTTheauthorswouldliketothankJamesPsotaforhisearlyhelpinresearchingpotentialimplementationstrategiesandpointingustowardsPin.HisroleasliaisonwiththePinteamatIntelwasalsogreatlyappreciated.ThisworkwaspartiallyfundedbytheNationalScienceFoundationunderGrantNo.0811724.REFERENCES[1]T.Austin,E.Larson,andD.Ernst,SimpleScalar:Aninfrastructureforcomputersystemmodeling,IEEEComputer,vol.35,no.2,pp.5967,2002.[2]C.J.Hughes,V.S.Pai,P.Ranganathan,andS.V.Adve,Rsim:Sim-ulatingshared-memorymultiprocessorswithilpprocessors,Computer,vol.35,no.2,pp.4049,2002.[3]M.Rosenblum,S.Herrod,E.Witchel,andA.Gupta,Completecomputersystemsimulation:TheSimOSapproach,IEEEParallel&DistributedTechnology:Systems&Applications,vol.3,no.4,pp.3443,Winter1995.[4]P.Magnusson,M.Christensson,J.Eskilson,D.Forsgren,G.Hallberg,J.Hogberg,F.Larsson,A.Moestedt,andB.Werner,Simics:Afullsystemsimulationplatform,IEEEComputer,vol.35,no.2,pp.5058,Feb2002.[5]F.Bellard,QEMU,afastandportabledynamictranslator,inATEC'05:Proc.oftheUSENIXAnnualTechnicalConference2005onUSENIXAnnualTechnicalConference,Berkeley,CA,USA,2005.[6]D.Chiou,D.Sunwoo,J.Kim,N.A.Patil,W.Reinhart,D.E.Johnson,J.Keefe,andH.Angepat,FPGA-AcceleratedSimulationTechnologies(FAST):Fast,Full-System,Cycle-AccurateSimulators,inMICRO'07:Proceedingsofthe40thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture,2007,pp.249261.[7]A.KleinOsowskiandD.J.Lilja,MinneSPEC:AnewSPECbench-markworkloadforsimulation-basedcomputerarchitectureresearch,ComputerArchitectureLetters,vol.1,Jun.2002.[8]C.Bienia,S.Kumar,J.P.Singh,andK.Li,ThePARSECbenchmarksuite:Characterizationandarchitecturalimplications,inProc.ofthe17thInternationalConferenceonParallelArchitecturesandCompila-tionTechniques(PACT),October2008.[9]E.S.Chung,M.K.Papamichael,E.Nurvitadhi,J.C.Hoe,K.Mai,andB.Falsa,ProtoFlex:TowardsScalable,Full-SystemMultiprocessorSimulationsUsingFPGAs,ACMTrans.RecongurableTechnol.Syst.,vol.2,no.2,pp.132,2009.[10]S.S.Mukherjee,S.K.Reinhardt,B.Falsa,M.Litzkow,M.D.Hill,D.A.Wood,S.Huss-Lederman,andJ.R.Larus,WisconsinWindTunnelII:Afast,portableparallelarchitecturesimulator,IEEEConcurrency,vol.8,no.4,pp.1220,OctDec2000.[11]G.Zheng,G.Kakulapati,andL.V.Kal´e,BigSim:Aparallelsimulatorforperformancepredictionofextremelylargeparallelmachines,in18thInternationalParallelandDistributedProcessingSymposium(IPDPS),Apr2004,p.78.[12]J.Chen,M.Annavaram,andM.Dubois,SlackSim:APlatformforParallelSimulationsofCMPsonCMPs,SIGARCHComput.Archit.News,vol.37,no.2,pp.2029,2009.[13]M.B.Taylor,W.Lee,J.Miller,D.Wentzlaff,I.Bratt,B.Greenwald,H.Hoffman,P.Johnson,J.Kim,J.Psota,A.Saraf,N.Shnidman,V.Strumpen,M.Frank,S.Amarasinghe,andA.Agarwal,EvaluationoftheRawmicroprocessor:Anexposed-wire-delayarchitectureforILPandstreams,inProc.oftheInternationalSymposiumonComputerArchitecture,Jun.2004,pp.213.[14]C.-K.Luk,R.Cohn,R.Muth,H.Patil,A.Klauser,G.Lowney,S.Wallace,V.J.Reddi,andK.Hazelwood,Pin:Buildingcustomizedprogramanalysistoolswithdynamicinstrumentation,inPLDI'05:Proc.ofthe2005ACMSIGPLANconferenceonProgramminglanguagedesignandimplementation,June2005,pp.190200.[15]D.Wentzlaff,P.Grifn,H.Hoffmann,L.Bao,B.Edwards,C.Ramey,M.Mattina,C.-C.Miao,J.F.Brown,andA.Agarwal,On-chipinterconnectionarchitectureoftheTileprocessor,IEEEMicro,vol.27,no.5,pp.1531,Sept-Oct2007.[16]S.C.Woo,M.Ohara,E.Torrie,J.P.Singh,andA.Gupta,TheSPLASH-2programs:characterizationandmethodologicalconsidera-tions,inISCA'95:Proc.ofthe22ndannualinternationalsymposiumonComputerarchitecture,June1995,pp.2436.[17]J.Miller,H.Kasture,G.Kurian,N.Beckmann,C.GruenwaldIII,C.Celio,J.Eastep,andA.Agarwal,Graphite:Adistributedsimulatorformulticores,Cambridge,MA,USA,Tech.Rep.MIT-CSAIL-TR-2009-056,November2009.[18]A.Agarwal,R.Simoni,J.Hennessy,andM.Horowitz,Anevaluationofdirectoryschemesforcachecoherence,inISCA'88:Proc.ofthe15thAnnualInternationalSymposiumonComputerarchitecture,LosAlamitos,CA,USA,1988,pp.280298.[19]D.Chaiken,J.Kubiatowicz,andA.Agarwal,Limitlessdirectories:Ascalablecachecoherencescheme,inProc.oftheFourthInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOSIV,1991,pp.224234.[20]E.A.Brewer,C.N.Dellarocas,A.Colbrook,andW.E.Weihl,Proteus:ahigh-performanceparallel-architecturesimulator,inSIGMETRICS'92/PERFORMANCE'92:Proc.ofthe1992ACMSIGMETRICSjointinternationalconferenceonMeasurementandmodelingofcomputersystems,NewYork,NY,USA,1992,pp.247248.[21]T.F.Wenisch,R.E.Wunderlich,M.Ferdman,A.Ailamaki,B.Falsa,andJ.C.Hoe,SimFlex:Statisticalsamplingofcomputersystemsimulation,IEEEMicro,vol.26,no.4,pp.1831,July-Aug2006.[22]M.M.K.Martin,D.J.Sorin,B.M.Beckmann,M.R.Marty,M.Xu,A.R.Alameldeen,K.E.Moore,M.D.Hill,andD.A.Wood,Multifacet'sgeneralexecution-drivenmultiprocessorsimulator(GEMS)toolset,SIGARCHComput.Archit.News,vol.33,no.4,pp.9299,November2005.[23]M.Monchiero,J.H.Ahn,A.Falc´on,D.Ortega,andP.Faraboschi,Howtosimulate1000cores,SIGARCHComput.Archit.News,vol.37,no.2,pp.1019,2009.[24]S.Kanaujia,I.E.Papazian,J.Chamberlain,andJ.Baxter,FastMP:Amulti-coresimulationmethodology,inMOBS2006:WorkshoponModeling,BenchmarkingandSimulation,June2006.[25]S.K.Reinhardt,M.D.Hill,J.R.Larus,A.R.Lebeck,J.C.Lewis,andD.A.Wood,Thewisconsinwindtunnel:virtualprototypingofparallelcomputers,inSIGMETRICS'93:Proc.ofthe1993ACMSIGMETRICSconferenceonMeasurementandmodelingofcomputersystems,1993,pp.4860.[26]M.ChidesterandA.George,Parallelsimulationofchip-multiprocessorarchitectures,ACMTrans.Model.Comput.Simul.,vol.12,no.3,pp.176200,2002.[27]D.A.Penry,D.Fay,D.Hodgdon,R.Wells,G.Schelle,D.I.August,andD.Connors,Exploitingparallelismandstructuretoacceleratethesimulationofchipmulti-processors,inHPCA'06:TheTwelfthInternationalSymposiumonHigh-PerformanceComputerArchitecture,Feb2006,pp.2940.[28]N.Dave,M.Pellauer,andJ.Emer,Implementingafunctional/timingpartitionedmicroprocessorsimulatorwithanFPGA,in2ndWorkshoponArchitectureResearchusingFPGAPlatforms(WARFP2006),Feb2006.[29]D.R.Jefferson,Virtualtime,ACMTransactionsonProgrammingLanguagesandSystems,vol.7,no.3,pp.404425,July1985.[30]S.Das,R.Fujimoto,K.Panesar,D.Allison,andM.Hybinette,GTW:ATimeWarpSystemforSharedMemoryMultiprocessors,inWSC'94:Proceedingsofthe26thconferenceonWintersimulation,1994,pp.13321339.[31]C.Amza,A.Cox,S.Dwarkadas,P.Keleher,H.Lu,R.Rajamony,W.Yu,andW.Zwaenepoel,TreadMarks:Sharedmemorycomputingonnetworksofworkstations,IEEEComputer,vol.29,no.2,pp.1828,Feb1996.