/
Graphite A Distributed Parallel Simulator for Multicores Jason E Graphite A Distributed Parallel Simulator for Multicores Jason E

Graphite A Distributed Parallel Simulator for Multicores Jason E - PDF document

tatyana-admore
tatyana-admore . @tatyana-admore
Follow
430 views
Uploaded On 2015-01-19

Graphite A Distributed Parallel Simulator for Multicores Jason E - PPT Presentation

Miller Harshad Kasture George Kurian Charles Gruenwald III Nathan Beckmann Christopher Celio Jonathan Eastep and Anant Agarwal Massachusetts Institute of Technology Cambridge MA Abstract This paper introduces the Graphite opensource distributed para ID: 33144

Miller Harshad Kasture George

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Graphite A Distributed Parallel Simulato..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Thesimulatormaintainstheillusionthatallofthethreadsarerunninginasingleprocesswithasinglesharedaddressspace.Thisallowsthesimulatortorunoff-the-shelfparallelapplicationsonanynumberofmachineswithouthavingtorecompiletheappsfordifferentcongurations.Graphiteisnotintendedtobecompletelycycle-accuratebutinsteadusesacollectionofmodelsandtechniquestopro-videaccurateestimatesofperformanceandvariousmachinestatistics.Instructionsandeventsfromthecore,network,andmemorysubsystemfunctionalmodelsarepassedtoanalyticaltimingmodelsthatupdateindividuallocalclocksineachcore.Thelocalclocksaresynchronizedusingmessagetimestampswhencoresinteract(e.g.,throughsynchronizationormes-sages)[11].However,toreducethetimewastedonsynchro-nization,Graphitedoesnotstrictlyenforcetheorderingofalleventsinthesystem.Incertaincases,timestampsareignoredandoperationlatenciesarebasedontheorderingofeventsduringnativeexecutionratherthanthepreciseorderingtheywouldhaveinthesimulatedsystem(seeSectionIII-F).Thisissimilartothe“unboundedslack”modeinSlackSim[12];however,GraphitealsosupportsanewscalablemechanismcalledLaxP2Pformanagingslackandimprovingaccuracy.Graphitehasbeenevaluatedbothintermsofthevalidityofthesimulationresultsaswellasthescalabilityofsimulatorperformanceacrossmultiplecoresandmachines.TheresultsfromtheseevaluationsshowthatGraphitescaleswell,hasreasonableperformanceandprovidesresultsconsistentwithexpectations.Forthescalingstudy,weperformafullycache-coherentsimulationof1024coresacrossupto10targetmachinesandrunapplicationsfromtheSPLASH-2benchmarksuite.Theslowdownversusnativeexecutionisaslowas41whenusingeight8-corehostmachines,indicatingthatGraphitecanbeusedforrealisticapplicationstudies.Theremainderofthispaperisstructuredasfollows.Sec-tionIIdescribesthearchitectureofGraphite.SectionIIIdiscussestheimplementationofGraphiteinmoredetail.Sec-tionIVevaluatestheaccuracy,performance,andscalingofthesimulator.SectionVdiscussesrelatedworkand,SectionVIsummarizesourndings.II.SYSTEMARCHITECTUREGraphiteisanapplication-levelsimulatorfortiledmulticorearchitectures.Asimulationconsistsofexecutingamulti-threadedapplicationonatargetmulticorearchitecturedenedbythesimulator'smodelsandruntimecongurationparam-eters.Thesimulationrunsononeormorehostmachines,eachofwhichmaybeamulticoremachineitself.Figure1illustrateshowamulti-threadedapplicationrunningonatargetarchitecturewithmultipletilesissimulatedonaclusterofhostmachines.Graphitemapseachthreadintheapplicationtoatileofthetargetarchitectureanddistributesthesethreadsamongmultiplehostprocesseswhicharerunningonmultiplehostmachines.Thehostoperatingsystemisthenresponsiblefortheschedulingandexecutionofthesethreads.Figure2aillustratesthetypesoftargetarchitecturesGraphiteisdesignedtosimulate.Thetargetarchitecturecon- Fig.1:High-levelarchitecture.Graphiteconsistsofoneormorehostprocessesdistributedacrossmachinesandworkingtogetheroversockets.Eachprocessrunsasubsetofthesimulatedtiles,onehostthreadpersimulatedtile.tainsasetoftilesinterconnectedbyanon-chipnetwork.Eachtileiscomposedofacomputecore,anetworkswitchandapartofthememorysubsystem(cachehierarchyandDRAMcontroller)[13].Tilesmaybehomogeneousorheterogeneous;however,weonlyexaminehomogeneousarchitecturesinthispaper.Anynetworktopologycanbemodeledaslongaseachtilecontainsanendpoint.Graphitehasamodulardesignbasedonswappablemodules.Eachofthecomponentsofatileismodeledbyaseparatemodulewithwell-denedinterfaces.Modulecanbecong-uredthroughrun-timeparametersorcompletelyreplacedtostudyalternatearchitectures.Modulesmayalsobereplacedtoalterthelevelofdetailinthemodelsandtradeoffbetweenperformanceandaccuracy.Figure2billustratesthekeycomponentsofaGraphitesimulation.Applicationthreadsareexecutedunderadynamicbinarytranslator(currentlyPin[14])whichrewritesinstruc-tionstogenerateeventsatkeypoints.TheseeventscausetrapsintoGraphite'sbackendwhichcontainsthecomputecore,memory,andnetworkmodelingmodules.Pointsofinterestinterceptedbythedynamicbinarytranslator(DBT)include:memoryreferences,systemcalls,synchronizationroutinesanduser-levelmessages.TheDBTisalsousedtogenerateastreamofexecutedinstructionsusedinthecomputecoremodels.Graphite'ssimulationbackendcanbebroadlydividedintotwosetsoffeatures:functionalandmodeling.Modelingfea-turesmodelvariousaspectsofthetargetarchitecturewhilefunctionalfeaturesensurecorrectprogrambehavior. Target Architecture Target Tile Target Tile Target Tile Target Tile Target Tile Target Tile Target Tile Target Tile Target Tile TCP/IP Sockets Host Threads Application ThreadsApplication (a)TargetArchitecture (b)ModularDesignofGraphiteFig.2:Systemarchitecture.a)Overviewofthetargetarchitecture.Tilescontainacomputecore,anetworkswitch,andanodeofthememorysystem.b)TheanatomyofaGraphitesimulation.Tilesaredistributedamongmultipleprocesses.Theappisinstrumentedtotrapintooneofthreemodelsatkeypoints:acoremodel,networkmodel,ormemorysystemmodel.Thesemodelsinteracttomodelthetargetsystem.Thephysicaltransportlayerabstractsawaythehost-specicdetailsofinter-tilecommunication.A.ModelingFeaturesAsshowninFigure2b,theGraphitebackendiscomprisedofmanymodulesthatmodelvariouscomponentsofthetargetarchitecture.Inparticular,thecoremodelisresponsibleformodelingthecomputationalpipeline;thememorymodelisresponsibleforthememorysubsystem,whichiscomposedofdifferentlevelsofcachesandDRAM;andthenetworkmodelhandlestheroutingofnetworkpacketsovertheon-chipnetworkandaccountsforvariousdelaysencounteredduetocontentionandroutingoverheads.Graphite'smodelsinteractwitheachothertodeterminethecostofeacheventintheapplication.Forinstance,thememorymodelusestheroundtripdelaytimesfromthenetworkmodeltocomputethelatencyofmemoryoperations,whilethecoremodelreliesonlatenciesfromthememorymodeltodeterminethetimetakentoexecuteloadandstoreoperations.OneofthekeytechniquesGraphiteusestoachievegoodsimulatorperformanceislaxsynchronization.Withlaxsyn-chronization,eachtargettilemaintainsitsownlocalclockwhichrunsindependentlyoftheclocksofothertiles.Syn-chronizationbetweenthelocalclocksofdifferenttileshap-pensonlyonapplicationsynchronizationevents,user-levelmessages,andthreadcreationandterminationevents.Duetothis,modelingofcertainaspectsofsystembehavior,suchasnetworkcontentionandDRAMqueueingdelays,becomecom-plicated.SectionIII-FwilltalkabouthowGraphiteaddressesthischallenge.B.FunctionalFeaturesGraphite'sabilitytoexecuteanunmodiedpthreadedappli-cationacrossmultiplehostmachinesiscentraltoitsscalabilityandeaseofuse.Inordertoachievethis,Graphitehastoaddressanumberoffunctionalchallengestoensurethattheapplicationrunscorrectly:1)SingleAddressSpace:Sincethreadsfromtheapplicationexecuteindifferentprocessesandhenceindifferentaddressspaces,allowingapplicationmemoryreferencestoaccessthehostaddressspacewon'tbefunctionallycorrect.Graphiteprovidestheinfrastructuretomodifythesememoryreferencesandpresentauniformviewoftheapplicationaddressspacetoallthreadsandmaintaindatacoherencebetweenthem.2)ConsistentOSInterface:Sinceapplicationthreadsex-ecuteondifferenthostprocessesonmultiplehosts,Graphiteimplementsasysteminterfacelayerthatinter-ceptsandhandlesallapplicationsystemcallsinordertomaintaintheillusionofasingleprocess.3)ThreadingInterface:Graphiteimplementsathreadinginterfacethatinterceptsthreadcreationrequestsfromtheapplicationandseamlesslydistributesthesethreadsacrossmultiplehosts.Thethreadinginterfacealsoim-plementscertainthreadmanagementandsynchroniza-tionfunctions,whileothersarehandledautomaticallybyvirtueofthesingle,coherentaddressspace.Tohelpaddressthesechallenges,Graphitespawnsaddi-tionalthreadscalledtheMasterControlProgram(MCP)andtheLocalControlProgram(LCP).ThereisoneLCPperprocessbutonlyoneMCPfortheentiresimulation.TheMCPandLCPensurethefunctionalcorrectnessofthesimulationbyprovidingservicesforsynchronization,systemcallexecutionandthreadmanagement.Alloftheactualcommunicationbetweentilesishandledbythephysicaltransport(PT)layer.Forexample,thenetworkmodelcallsintothislayertoperformthefunctionaltaskofmovingdatafromonetiletoanother.ThePTlayerabstractsawaythehost-architecturedependentdetailsofintra-andinter-processcommunication,makingiteasiertoportGraphitetonewhosts.III.IMPLEMENTATIONThissectiondescribesthedesignandinteractionofGraphite'svariousmodelsandsimulationlayers.Itdiscusses Interconnection Network DRAM Target TileTarget TileTarget Tile Network Switch Physical Transport Host Process MCPLCP Target Tile Target Tile Target Tile Target Tile Host Process MCP Target Tile Target Tile Target Tile Target Tile Host Process MCP Target Tile Target Tile Target Tile Target Tile I$, D$Network ModelPT APITCP/IP Sockets thechallengesofhighperformanceparalleldistributedsimu-lationandhowGraphite'sdesignaddressesthem.A.CorePerformanceModelThecoreperformancemodelisapurelymodeledcompo-nentofthesystemthatmanagesthesimulatedclocklocaltoeachtile.Itfollowsaproducer-consumerdesign:itconsumesinstructionsandotherdynamicinformationproducedbytherestofthesystem.Themajorityofinstructionsareproducedbythedynamicbinarytranslatorastheapplicationthreadexecutesthem.Otherpartsofthesystemalsoproducepseudo-instructionstoupdatethelocalclockonunusualevents.Forexample,thenetworkproducesa“messagereceivepseudo-instruction”whentheapplicationusesthenetworkmessagingAPI(SectionIII-C),anda“spawnpseudo-instruction”isproducedwhenathreadisspawnedonthecore.Otherinformationbeyondinstructionsisrequiredtoper-formmodeling.Latenciesofmemoryoperations,pathsofbranches,etc.arealldynamicpropertiesofthesystemnotin-cludedintheinstructiontrace.Thisinformationisproducedbythesimulatorback-end(e.g.,memoryoperations)ordynamicbinarytranslator(e.g.,branchpaths)andconsumedbythecoreperformancemodelviaaseparateinterface.Thisallowsthefunctionalandmodelingportionsofthesimulatortoexecuteasynchronouslywithoutintroducinganyerrors.Becausethecoreperformancemodelisisolatedfromthefunctionalportionofthesimulator,thereisgreatexibilityinimplementingittomatchthetargetarchitecture.Currently,Graphitesupportsanin-ordercoremodelwithanout-of-ordermemorysystem.Storebuffers,loadunits,branchprediction,andinstructioncostsareallmodeledandcongurable.ThismodelisoneexampleofmanydifferentarchitecturalmodelsthancanbeimplementedinGraphite.Itisalsopossibletoimplementcoremodelsthatdifferdrasticallyfromtheoperationofthefunctionalmodels—i.e.,althoughthesimulatorisfunctionallyin-orderwithse-quentiallyconsistentmemory,thecoreperformancemodelcanbeout-of-ordercorewitharelaxedmemorymodel.Modelsthroughouttheremainderofthesystemwillreectthenewcoretype,astheyareultimatelybasedonclocksupdatedbythecoremodel.Forexample,memoryandnetworkutilizationwillreectanout-of-orderarchitecturebecausemessagetimestampsaregeneratedfromcoreclocks.B.MemorySystemThememorysystemofGraphitehasbothafunctionalandamodelingrole.Thefunctionalroleistoprovideanabstractionofasharedaddressspacetotheapplicationthreadswhichexecuteindifferentaddressspaces.Themodelingroleistosimulatethecacheshierarchiesandmemorycontrollersofthetargetarchitecture.Thefunctionalandmodelingpartsofthememorysystemaretightlycoupled,i.e,themessagessentoverthenetworktoload/storedataandensurefunctionalcorrectness,arealsousedforperformancemodeling.ThememorysystemofGraphiteisbuiltusinggenericmodulessuchascaches,directories,andsimplecacheco-herenceprotocols.Currently,Graphitesupportsasequentiallyconsistentmemorysystemwithfull-mapandlimiteddirectory-basedcachecoherenceprotocols,privateL1andL2caches,andmemorycontrollersoneverytileofthetargetarchitecture.However,duetoitsmodulardesign,adifferentimplementationofthememorysystemcouldeasilybedevelopedandswappedininstead.Theapplication'saddressspaceisdividedupamongthetargettileswhichpossessmemorycontrollers.Performancemodelingisdonebyappendingsimulatedtimestampstothemessagessentbetweendifferentmemorysystemmodules(seeSectionIII-F).Theaveragememoryaccesslatencyofanyrequestiscomputedusingthesetimestamps.Thefunctionalroleofthememorysystemistoserviceallmemoryoperationsmadebyapplicationthreads.ThedynamicbinarytranslatorinGraphiterewritesallmemoryreferencesintheapplicationsotheygetredirectedtothememorysystem.Analternatedesignoptionforthememorysystemistocompletelydecoupleitsfunctionalandmodelingparts.Thiswasnotdoneforperformancereasons.SinceGraphiteisadistributedsystem,boththefunctionalandmodelingpartsofthememorysystemhavetobedistributed.Hence,decouplingthemwouldleadtodoublingthenumberofmessagesinthesystem(onesetforensuringfunctionalcorrectnessandanotherformodeling).Anadditionaladvantageoftightlycouplingthefunctionalandthemodelingpartsisthatitautomaticallyhelpsverifythecorrectnessofcomplexcachehierarchiesandcoherenceprotocolsofthetargetarchitecture,astheircorrectoperationisessentialforthecompletionofsimulation.C.NetworkThenetworkcomponentprovideshigh-levelmessagingser-vicesbetweentilesbuiltontopofthelower-leveltransportlayer,whichusessharedmemoryandTCP/IPtocommuni-catebetweentargettiles.Itprovidesamessage-passingAPIdirectlytotheapplication,aswellasservingothercomponentsofthesimulatorbackend,suchasthememorysystemandsystemcallhandler.Thenetworkcomponentmaintainsseveraldistinctnetworkmodels.Thenetworkmodelusedbyaparticularmessageisdeterminedbythemessagetype.Forinstance,systemmes-sagesunrelatedtoapplicationbehavioruseaseparatenetworkmodelthanapplicationmessages,andthereforehavenoimpactonsimulationresults.Thedefaultsimulatorcongurationalsousesseparatemodelsforapplicationandmemorytrafc,asiscommonlydoneinmulticorechips[13],[15].Eachnetworkmodelisconguredindependently,allowingforexplorationofnewnetworktopologiesfocusedonparticularsubcomponentsofthesystem.Thenetworkmodelsareresponsibleforroutingpacketsandupdatingtimestampstoaccountfornetworkdelay.Eachnetworkmodelsharesacommoninterface.Therefore,networkmodelimplementationsareswappable,anditissimpletodevelopnewnetworkmodels.Currently,Graphitesupportsabasicmodelthatforwardspacketswithnodelay(usedforsystemmessages),andseveralmeshmodelswithdifferenttradeoffsinperformanceandaccuracy. D.ConsistentOSInterfaceGraphiteimplementsasysteminterfacelayerthatinterceptsandhandlessystemcallsinthetargetapplication.Systemcallsrequirespecialhandlingfortworeasons:theneedtoaccessdatainthetargetaddressspaceratherthanthehostaddressspace,andtheneedtomaintaintheillusionofasingleprocessacrossmultipleprocessesexecutingthetargetapplication.Manysystemcalls,suchascloneandrt_sigaction,passpointerstochunksofmemoryasinputoroutputargu-mentstothekernel.Graphiteinterceptssuchsystemcallsandmodiestheirargumentstopointtothecorrectdatabeforeexecutingthemonthehostmachine.Anyoutputdataiscopiedtothesimulatedaddressspaceafterthesystemcallreturns.Somesystemcalls,suchastheonesthatdealwithleI/O,needtobehandledspeciallytomaintainaconsistentprocessstateforthetargetapplication.Forexample,inamulti-threadedapplication,threadsmightcommunicateviales,withonethreadwritingtoaleusingawritesystemcallandpassingtheledescriptortoanotherthreadwhichthenreadsthedatausingthereadsystemcall.InaGraphitesimulation,thesethreadsmightbeindifferenthostprocesses(eachwithitsownledescriptortable),andmightberunningondifferenthostmachineseachwiththeirownlesystem.Instead,GraphitehandlesthesesystemcallsbyinterceptingandforwardingthemalongwiththeirargumentstotheMCP,wheretheyareexecuted.Theresultsaresentbacktothethreadthatmadetheoriginalsystemcall,achievingthedesiredresult.Othersystemcalls,e.g.open,fstatetc.,arehandledinasimilarmanner.Similarly,systemcallsthatareusedtoimplementsynchronizationbetweenthreads,suchasfutex,areinterceptedandforwardedtotheMCP,wheretheyareemulated.Systemcallsthatdonotrequirespecialhandlingareallowedtoexecutedirectlyonthehostmachine.1)ProcessInitializationandAddressSpaceManagement:Atthestartofthesimulation,Graphite'ssysteminterfacelayerneedstomakesurethateachhostprocessiscorrectlyinitialized.Inparticular,itmustensurethatallprocessseg-mentsareproperlysetup,thecommandlineargumentsandenvironmentvariablesareupdatedinthetargetaddressspaceandthethreadlocalstorage(TLS)iscorrectlyinitializedineachprocess.Eventually,onlyasingleprocessinthesimulationexecutesmain(),whilealltheotherprocessesexecutethreadssubsequentlyspawnedbyGraphite'sthreadingmechanism.Graphitealsoexplicitlymanagesthetargetaddressspace,settingasideportionsforthreadstacks,code,andstaticanddynamicdata.Inparticular,Graphite'sdynamicmemorymanagerservicesrequestsfordynamicmemoryfromtheapplicationbyinterceptingthebrk,mmapandmunmapsystemcallsandallocating(ordeallocating)memoryfromthetargetaddressspaceasrequired.E.ThreadingInfrastructureOnechallengingaspectoftheGraphitedesignwasseam-lesslydealingwiththreadspawncallsacrossadistributedsimulation.Otherprogrammingmodels,suchasMPI,forcetheapplicationprogrammertobeawareofdistributionbyallocatingworkamongprocessesatstart-up.Thisdesignislimitingandoftenrequiresthesourcecodeoftheapplicationtobechangedtoaccountforthenewprogrammingmodel.In-stead,Graphitepresentsasingle-processprogrammingmodeltotheuserwhiledistributingthethreadsacrossdifferentmachines.Thisallowstheusertocustomizethedistributionofthesimulationasnecessaryforthedesiredscalability,performance,andavailableresources.Theaboveparameterscanbechangedbetweensimula-tionrunsthroughrun-timecongurationoptionswithoutanychangestotheapplicationcode.Theactualapplicationin-terfaceissimplythepthreadspawn/joininterface.Theonlylimitationtotheprogramminginterfaceisthatthemaximumnumberofthreadsatanytimemaynotexceedthetotalnumberoftilesinthetargetarchitecture.Currentlythethreadsarelongliving,thatis,theyruntocompletionwithoutbeingswappedout.Toaccomplishthis,thespawncallsarerstinterceptedatthecallee.Next,theyareforwardedtotheMCPtoensureaconsistentviewofthethread-to-tilemapping.TheMCPchoosesanavailablecoreandforwardsthespawnrequesttotheLCPonthemachinethatholdsthechosentile.Themappingbetweentilesandprocessesiscurrentlyimplementedbysimplystripingthetilesacrosstheprocesses.ThreadjoiningisimplementedinasimilarmannerbysynchronizingthroughtheMCP.F.SynchronizationModelsForhighperformanceandscalabilityacrossmultiplema-chines,Graphitedecouplestilesimulationsbyrelaxingthetimingsynchronizationbetweenthem.Bydesign,Graphiteisnotcycle-accurate.Itsupportsseveralsynchronizationstrate-giesthatrepresentdifferenttimingaccuracyandsimulatorperformancetradeoffs:laxsynchronization,laxwithbarriersynchronization,andlaxwithpoint-to-pointsynchronization.LaxsynchronizationisGraphite'sbaselinemodel.Laxwithbarriersynchronizationandlaxwithpoint-to-pointsynchro-nizationlayermechanismsontopoflaxsynchronizationtoimproveitsaccuracy.1)LaxSynchronization:Laxsynchronizationisthemostpermissiveinlettingclocksdifferandoffersthebestper-formanceandscalability.Tokeepthesimulatedclocksinreasonableagreement,Graphiteusesapplicationeventstosynchronizethem,butotherwiseletsthreadsrunfreely.Laxsynchronizationisbestviewedfromtheperspectiveofasingletile.Allinteractionwiththerestofthesimulationtakesplacevianetworkmessages,eachofwhichcarriesatimestampthatisinitiallysettotheclockofthesender.Thesetimestampsareusedtoupdateclocksduringsynchronizationevents.Atile'sclockisupdatedprimarilywheninstructionsexecutedonthattile'scoreareretired.Withtheexceptionofmemoryoperations,theseeventsareindependentoftherestofthesimulation.However,memoryoperationsusemessageround-triptimetodeterminelatency,sotheydonotforcesynchronizationwithothertiles.Truesynchronizationonly occursinthefollowingevents:applicationsynchronizationsuchaslocks,barriers,etc.,receivingamessageviathemessage-passingAPI,andspawningorjoiningathread.Inallcases,theclockofthetileisforwardedtothetimethattheeventoccurred.Iftheeventoccurredearlierinsimulatedtime,thennoupdatestakeplace.Thegeneralstrategytohandleout-of-ordereventsistoignoresimulatedtimeandprocesseventsintheordertheyarereceived[12].Analternativeistore-ordereventssotheyarehandledinsimulated-timeorder,butthishassomefun-damentalproblems.Bufferingandre-orderingeventsleadstodeadlockinthememorysystem,andisdifculttoimplementanywaybecausethereisnoglobalcyclecount.Alternatively,onecouldoptimisticallyprocesseventsintheordertheyarereceivedandrollthembackwhenan“earlier”eventarrives,asdoneinBigSim[11].However,thisrequiresstatetobemain-tainedthroughoutthesimulationandhurtsperformance.OurresultsinSectionIV-Cshowthatlaxsynchronization,despiteout-of-orderprocessing,stillpredictsperformancetrendswell.Thiscomplicatesmodels,however,aseventsareprocessedout-of-order.Queuemodeling,e.g.atmemorycontrollersandnetworkswitches,illustratesmanyofthedifculties.Inacycle-accuratesimulation,apacketarrivingataqueueisbuffered.Ateachcycle,thebufferheadisdequeuedandprocessed.Thismatchestheactualoperationofthequeueandisthenaturalwaytoimplementsuchamodel.InGraphite,however,thepacketisprocessedimmediatelyandpotentiallycarriesatimestampinthepastorfarfuture,sothisstrategydoesnotwork.Instead,queueinglatencyismodeledbykeepinganinde-pendentclockforthequeue.Thisclockrepresentsthetimeinthefuturewhentheprocessingofallmessagesinthequeuewillbecomplete.Whenapacketarrives,itsdelayisthedifferencebetweenthequeueclockandthe“globalclock”.Additionally,thequeueclockisincrementedbytheprocessingtimeofthepackettomodelbuffering.However,becausecoresinthesystemarelooselysynchro-nized,thereisnoeasywaytomeasureprogressora“globalclock”.Thisproblemisaddressedbyusingpackettimestampstobuildanapproximationofglobalprogress.Awindowofthemostrecently-seentimestampsiskept,ontheorderofthenumberoftargettiles.Theaverageofthesetimestampsgivesanapproximationofglobalprogress.Becausemessagesaregeneratedfrequently(e.g.,oneverycachemiss),thiswindowgivesanup-to-daterepresentationofglobalprogressevenwithalargewindowsizewhilemitigatingtheeffectofoutliers.Combiningthesetechniquesyieldsaqueueingmodelthatworkswithintheframeworkoflaxsynchronization.Errorisintroducedbecausepacketsaremodeledout-of-orderinsimu-latedtime,buttheaggregatequeueingdelayiscorrect.Othermodelsinthesystemfacesimilarchallengesandsolutions.2)LaxwithBarrierSynchronization:Graphitealsosup-portsquanta-basedbarriersynchronization(LaxBarrier),whereallactivethreadswaitonabarrierafteracong-urablenumberofcycles.Thisisusedforvalidationoflaxsynchronization,asveryfrequentbarrierscloselyapproximatecycle-accuratesimulation.Asexpected,LaxBarrieralsohurtsperformanceandscalability(seeSectionIV-C).3)LaxwithPoint-to-pointSynchronization:Graphitesup-portsanovelsynchronizationschemecalledpoint-to-pointsynchronization(LaxP2P).LaxP2Paimstoachievethequanta-basedaccuracyofLaxBarrierwithoutsacricingthescalabilityandperformanceoflaxsynchronization.Inthisscheme,eachtileperiodicallychoosesanothertileatrandomandsynchro-nizeswithit.Iftheclocksofthetwotilesdifferbymorethanacongurablenumberofcycles(calledtheslackofsimulation),thenthetilethatisaheadgoestosleepforashortperiodoftime.LaxP2Pisinspiredbytheobservationthatinlaxsynchro-nization,thereareusuallyafewoutlierthreadsthatarefaraheadorbehindandresponsibleforsimulationerror.LaxP2Ppreventsoutliers,asanythreadthatrunsaheadwillputitselftosleepandstaytightlysynchronized.Similarly,anythreadthatfallsbehindwillputotherthreadstosleep,whichquicklypropagatesthroughthesimulation.Theamountoftimethatathreadmustsleepiscalculatedbasedonthereal-timerateofsimulationprogress.Essentially,thethreadsleepsforenoughreal-timesuchthatitssynchro-nizingpartnerwillhavecaughtupwhenitwakes.Specically,letcbethedifferenceinclocksbetweenthetiles,andsupposethatthethread“infront”isprogressingatarateofrsimulatedcyclespersecond.Weapproximatethethread'sprogresswithalinearcurveandputthethreadtosleepforsseconds,wheres=c r.riscurrentlyapproximatedbytotalprogress,meaningthetotalnumberofsimulatedcyclesoverthetotalwall-clocksimulationtime.Finally,notethatLaxP2Piscompletelydistributedandusesnoglobalstructures.Becauseofthis,itintroduceslessoverheadthanLaxBarrierandhassuperiorscalability(seeSectionIV-C).IV.RESULTSThissectionpresentsexperimentalresultsusingGraphite.WedemonstrateGraphite'sabilitytoscaletolargetargetarchitecturesanddistributeacrossahostcluster.Weshowthatlaxsynchronizationprovidesgoodperformanceandaccuracy,andvalidateresultswithtwoarchitecturalstudies.A.ExperimentalSetupTheexperimentalresultsprovidedinthissectionwereallobtainedonahomogeneousclusterofmachines.Eachmachinewithintheclusterhasdualquad-coreIntel(r)X5460CPUsrunningat3.16GHzand8GBofDRAM.TheyarerunningDebianLinuxwithkernelversion2.6.26.Applicationswerecompiledwithgccversion4.3.2.ThemachineswithintheclusterareconnectedtoaGigabitEthernetswitchwithtwotrunkedGigabitportspermachine.Thishardwareistypicalofcurrentcommodityservers.EachoftheexperimentsinthissectionusesthetargetarchitectureparameterssummarizedinTableIunlessother-wisenoted.Theseparameterswerechosentomatchthehostarchitectureascloselyaspossible. Fig.3:SimulatorperformancescalingforSPLASH-2benchmarksacrossdifferentnumbersofhostcores.Thetargetarchitecturehas32tilesinallcases.Speed-upisnormalizedtosimulatorruntimeon1hostcore.Hostmachineseachcontain8cores.Resultsfrom1to8coresuseasinglemachine.Above8cores,simulationisdistributedacrossmultiplemachines. Feature Value Clockfrequency 3.16GHz L1caches Private,32KB(pertile),64bytelinesize,8-wayassociativity,LRUreplacement L2cache Private,3MB(pertile),64byteslinesize,24-wayassociativity,LRUreplacement Cachecoherence Full-mapdirectorybased DRAMbandwidth 5.3GB/s Interconnect Meshnetwork TABLEI:SelectedTargetArchitectureParameters.Allex-perimentsusethesetargetparameters(varyingthenumberoftargettiles)unlessotherwisenoted.B.SimulatorPerformance1)Single-andMulti-MachineScaling:Graphiteisde-signedtoscalewelltobothlargenumbersoftargettilesandlargenumbersofhostcores.Byleveragingmultiplemachines,simulationoflargetargetarchitecturescanbeacceleratedtoprovidefastturn-aroundtimes.Figure3demonstratesthespeedupachievedbyGraphiteasadditionalhostcoresaredevotedtothesimulationofa32-tiletargetarchitecture.ResultsarepresentedforseveralSPLASH-2[16]benchmarksandarenormalizedtotheruntimeonasinglehostcore.Theresultsfromonetoeighthostcoresarecollectedbyallowingthesimulationtouseadditionalcoreswithinasinglehostmachine.Theresultsfor16,32,and64hostcorescorrespondtousingallthecoreswithin2,4,and8machines,respectively.AsshowninFigure3,allapplicationsexceptfftexhibitsignicantsimulationspeedupsasmorehostcoresareadded.Thebestspeedupsareachievedwith64hostcores(across8machines)andrangefromabout2(fft)to20(radix).Scalingisgenerallybetterwithinasinglehostmachinethanacrossmachinesduetotheloweroverheadofcommunication.Severalapps(fmm,ocean,andradix)shownearlyidealspeedupcurvesfrom1to8hostcores(withinasinglemachine).Someappsshowadropinperformancewhengoingfrom8to16hostcores(from1to2machines)becausethead-ditionaloverheadofinter-machinecommunicationoutweighsthebenetsoftheadditionalcomputeresources.Thiseffectclearlydependsonspecicapplicationcharacteristicssuchasalgorithm,computation/communicationratio,anddegreeofmemorysharing.Iftheapplicationitselfdoesnotscalewelltolargenumbersofcores,thenthereisnothingGraphitecandotoimproveit,andperformancewillsuffer.TheseresultsdemonstratethatGraphiteisabletotakeadvantageoflargequantitiesofparallelisminthehostplatformtoacceleratesimulations.Forrapiddesigniterationandsoft-waredevelopment,thetimetocompleteasinglesimulationismoreimportantthanefcientutilizationofhostresources.Forthesetasks,anarchitectorprogrammermuststopandwaitfortheresultsoftheirsimulationbeforetheycancontinuetheirwork.Thereforeitmakessensetoapplyadditionalmachinestoasimulationevenwhenthespeedupachievedislessthanideal.Forbulkprocessingofalargenumberofsimulations,totalsimulationtimecanbereducedbyusingthemostefcientcongurationforeachapplication.2)SimulatorOverhead:TableIIshowssimulatorperfor-manceforseveralbenchmarksfromtheSPLASH-2suite.Thetableliststhenativeexecutiontimeforeachapplicationonasingle8-coremachine,aswellasoverallsimulationruntimesononeandeighthostmachines.Theslowdownsexperiencedovernativeexecutionforeachofthesecasesarealsopresented.ThedatainTableIIdemonstratesthatGraphiteachievesverygoodperformanceforallthebenchmarksstudied.The cholesky fft fmm lu_cont lu_non_cont ocean_cont ocean_non_cont radix water_nsquared water_spatial 0 5 10 15 20 SimulatorSpeed 64 32 16 8 4 2 1 HostCores Application Simulation Native 1machine 8machines Time TimeSlowdown TimeSlowdown cholesky 1.99 689346 508255 fft 0.02 803978 783930 fmm 7.11 67094 29841 lu_cont 0.072 2884007 2122952 lu_non_cont 0.08 2443061 1632038 ocean_cont 0.33 168515 66202 ocean_non_cont 0.41 177433 78190 radix 0.11 1781648 63584 water_nsquared 0.30 7422465 3961317 water_spatial 0.13 129966 82616 Mean - -1751 -1213 Median - -1307 -616 TABLEII:Multi-MachineScalingResults.Wall-clockexe-cutiontimeofSPLASH-2simulationsrunningon1and8hostmachines(8and64hostcores).Timesareinseconds.Slowdownsarecalculatedrelativetonativeexecution. Fig.4:Simulationspeed-upasthenumberofhostmachinesisincreasedfrom1to10foramatrix-multiplykernelwith1024applicationthreadsrunningona1024-tiletargetarchitecture.totalruntimeforallthebenchmarksisontheorderofafewminutes,withamedianslowdownof616overnativeexecution.ThishighperformancemakesGraphiteaveryusefultoolforrapidarchitectureexplorationandsoftwaredevelopmentforfuturearchitectures.Ascanbeseenfromthetable,thespeedofthesimula-tionrelativetonativeexecutiontimeishighlyapplicationdependent,withthesimulationslowdownbeingaslowas41forfmmandashighas3930forfft.Thisdepends,amongotherthings,onthecomputation-to-communicationratiofortheapplication:applicationswithahighcomputation-to-communicationratioareabletomoreeffectivelyparallelizeandhenceshowhighersimulationspeeds.3)ScalingwithLargeTargetArchitectures:Thissectionpresentsperformanceresultsforalargetargetarchitecturecontaining1024tilesandexploresthescalingofsuchsim-ulations.Figure4showsthenormalizedspeed-upofa1024-threadmatrix-multiplykernelrunningacrossdifferentnumbersofhostmachines.matrix-multiplywaschosenbecauseitscaleswelltolargenumbersofthreads,whilestillhavingfrequentsynchronizationviamessageswithneighbors. Lax LaxP2P LaxBarrier 1mc 4mc 1mc 4mc 1mc 4mc Run-time 1.0 0.55 1.10 0.59 1.82 1.09 Scaling 1.80 1.84 1.69 Error(%) 7.56 1.28 1.31 CoV(%) 0.58 0.31 0.09 TABLEIII:MeanperformanceandaccuracystatisticsfordatapresentedinFigure5.Scalingistheperformanceimprovementgoingfrom1to4hostmachines.Thisgraphshowssteadyperformanceimprovementforuptotenmachines.Performanceimprovesbyafactorof3:85withtenmachinescomparedtoasinglemachine.Speed-upisconsistentasmachinesareadded,closelymatchingalinearcurve.Weexpectscalingtocontinueasmoremachinesareadded,asthenumberofhostcoresisnotclosetosaturatingtheparallelismavailableintheapplication.C.LaxsynchronizationGraphitesupportsseveralsynchronizationmodels,namelylaxsynchronizationanditsbarrierandpoint-to-pointvariants,tomitigatetheclockskewbetweendifferenttargetcoresandincreasetheaccuracyoftheobservedresults.Thissectionprovidessimulatorperformanceandaccuracyresultsforthethreemodels,andshowsthetrade-offsofferedbyeach.1)Simulatorperformance:Figure5aandTableIIIillus-tratethesimulatorperformance(wall-clocksimulationtime)ofthethreesynchronizationmodelsusingthreeSPLASH-2benchmarks.Eachsimulationisrunononeandfourhostmachines.Thebarrierintervalwaschosenas1,000cyclestogiveveryaccurateresults.TheslackvalueforLaxP2Pwaschosentogiveagoodtrade-offbetweenperformanceandaccuracy,whichwasdeterminedtobe100,000cycles.ResultsarenormalizedtotheperformanceofLaxononehostmachine.WeobservethatLaxoutperformsbothLaxP2PandLaxBar-rierduetoitslowersynchronizationoverhead.PerformanceofLaxalsoincreasesconsiderablywhengoingfromonemachinetofourmachines(1.8).LaxP2PperformsonlyslightlyworsethanLax.Itshowsanaverageslowdownof1.10and1.07whencomparedtoLaxononeandfourhostmachinesrespectively.LaxP2Pshowsgoodscalabilitywithaperformanceimprovementof1.84goingfromonetofourhostmachines.ThisismainlyduetothedistributednatureofsynchronizationinLaxP2P,allowingittoscaletoalargernumberofhostcores.LaxBarrierperformspoorlyasexpected.Itencountersanaverageslowdownof1.82and1.94whencomparedtoLaxononeandfourhostmachinesrespectively.AlthoughtheperformanceimprovementofLaxBarrierwhengoingfromonetofourhostmachinesiscomparabletotheotherschemes,weexpecttherateofperformanceimprovementtodecreaserapidlyasthenumberoftargettilesisincreasedduetotheinherentnon-scalablenatureofbarriersynchronization.2)Simulationerror:Thisstudyexaminessimulationerrorandvariabilityforvarioussynchronizationmodels.Resultsaregeneratedfromtenrunsofeachbenchmarkusingthe       1 2 4 6 8 10 0 1 2 3 4 No.HostMachines SimulatorSpeed (a)NormalizedRun-time (b)Error(%) (c)CoefcientofVariation(%)Fig.5:Performanceandaccuracydatacomparisonfordifferentsynchronizationschemes.DataiscollectedfromSPLASH-2benchmarksononeandfourhostmachines,usingtenrunsofeachsimulation.(a)Simulationrun-timeinseconds,normalizedtoLaxononehostmachine..(b)Simulationerror,givenaspercentagedeviationfromLaxBarrierononehostmachine.(c)Simulationvariability,givenasthecoefcientofvariationforeachtypeofsimulation. (a)Lax (b)LaxP2P (c)LaxBarrierFig.6:Clockskewinsimulatedcyclesduringthecourseofsimulationforvarioussynchronizationmodels.DatacollectedrunningthefmmSPLASH-2benchmark.sameparametersasthepreviousstudy.Wecompareresultsforsingle-andmulti-machinesimulations,asdistributionacrossmachinesinvolveshigh-latencynetworkcommunicationthatpotentiallyintroducesnewsourcesoferrorandvariability.Figure5b,Figure5candTableIIIshowtheerrorandcoefcientofvariationofthesynchronizationmodels.Theerrordataispresentedasthepercentagedeviationofthemeansimulatedapplicationrun-time(incycles)fromsomebaseline.ThebaselinewechooseisLaxBarrier,asitgiveshighlyaccurateresults.Thecoefcientofvariation(CoV)isameasureofhowconsistentresultsarefromruntorun.Itisdenedastheratioofstandarddeviationtomean,asapercentage.ErrorandCoVvaluescloseto0:0%arebest.Asseeninthetable,LaxBarriershowsthebestCoV(0.09%).Thisisexpected,asthebarrierforcestargetcorestoruninlock-step,sothereislittleopportunityfordeviation.WealsoobservethatLaxBarriershowsveryaccurateresultsacrossfourhostmachines.Thisisalsoexpected,asthebarriereliminatesclockskewthatoccursduetovariablecommunicationlatencies.LaxP2Pshowsbothgooderror(1.28%)andCoV(0.32%).Despitethelargeslacksize,bypreventingtheoccurrenceofoutliersLaxP2PmaintainslowCoVanderror.Infact,LaxP2PshowserrornearlyidenticaltoLaxBarrier.ThemaindifferencebetweentheschemesisthatLaxP2PhasmodestlyhigherCoV.Laxshowstheworsterror(7.56%).Thisisexpectedbecauseonlyapplicationeventssynchronizetargettiles.Asshownbelow,Laxallowsthreadclockstovarysignicantly,givingmoreopportunityforthenalsimulatedruntimetovary.Forthesamereason,LaxhastheworstCoV(0.58%).3)Clockskew:Figure6showstheapproximateclockskewofeachsynchronizationmodelduringonerunoftheSPLASH-2benchmarkfmm.Thegraphshowsthedifferencebetweenthemaximumandminimumclocksinthesystematagiventime.Theseresultsmatchwhatoneexpectsfromthevarioussynchronizationmodels.Laxshowsbyfarthegreatestskew,andapplicationsynchronizationeventsareclearlyvisible.TheskewofLaxP2PisseveralordersofmagnitudelessthanLax,butapplicationsynchronizationeventsarestillvisibleandskewisontheorderof10;000cycles.LaxBarrierhastheleastskew,asonewouldexpect.Applicationsynchronizationeventsarelargelyundetectable—skewappearsconstantthroughoutexecution.14)Summary:Graphiteoffersthreesynchronizationmodelsthatgiveatradeoffbetweensimulationspeedandaccuracy.Laxgivesoptimalperformancewhileachievingreasonableaccuracy,butitalsoletsthreadsdeviateconsiderablyduringsimulation.Thismeansthatne-grainedinteractionscanbemissedormisrepresented.Ontheotherextreme,LaxBarrierforcestightsynchronizationandaccurateresults,atthecostofperformanceandscaling.LaxP2Pliessomewhereinbe-tween,keepingthreadsfromdeviatingtoofarandgivingveryaccurateresults,whileonlyreducingperformanceby10%.1Spikesinthegraphs,asseeninFigure6c,areduetoapproximationsinthecalculationofclockskew.See[17]. 1mc4mc 1mc4mc 1mc4mc 0.0 0.5 1.0 1.5 Simulationrun 10.0 26.6 1mc4mc 1mc4mc 1mc4mc 0 2 4 6 8 10 Error 1mc4mc 1mc4mc 1mc4mc 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 CoV Lax LaxP2P LaxBarrier Fig.7:Differentcachecoherencyschemesarecomparedusingspeeduprelativetosimulatedsingle-tileexecutioninblackscholesbyscalingtargettilecount.Finally,weobservedwhilerunningtheseresultsthattheparameterstosynchronizationmodelscanbetunedtomatchapplicationbehavior.Forexample,someapplicationscantoleratelargebarrierintervalswithnomeasurabledegradationinaccuracy.ThisallowsLaxBarriertoachieveperformancenearthatofLaxP2Pforsomeapplications.D.ApplicationStudies1)CacheMiss-rateCharacterization:WereplicatethestudyperformedbyWooet.al[16]characterizingcachemissratesasafunctionofcachelinesize.Targetarchitecturalparametersarechosentomatchthosein[16]ascloselyaspossible.Inparticular,Graphite'sL1cachemodelsaredisabledtosimulateasingle-levelcache.Thearchitecturesstilldiffer,however,asGraphitesimulatesx86instructionswhereas[16]usesSGImachines.Ourresultsmatchtheexpectedtrendsforeachbenchmarkstudied,althoughthemissratesdonotmatchexactlyduetoarchitecturaldifferences.Furtherdetailscanbefoundin[17].2)ScalingofCacheCoherence:Asprocessorsscaletoever-increasingcorecounts,theviabilityofcachecoherenceinfuturemanycoresremainsunsettled.ThisstudyexploresthreecachecoherenceschemesasademonstrationofGraphite'sabilitytoexplorethisrelevantarchitecturalquestion,aswellasitsabilitytorunlargesimulations.Graphitesupportsafewcachecoherenceprotocols.AlimiteddirectoryMSIprotocolwithisharers,denotedDiriNB[18],isthedefaultcachecoherenceprotocol.Graphitealsosupportsfull-mapdirectoriesandtheLimitLESSprotocol2.Figure7showsthecomparisonofthedifferentcacheco-herencyschemesintheapplicationblackscholes,amem-berofthePARSECbenchmarksuite[8].blackscholesisnearlyperfectlyparallelaslittleinformationissharedbetweencores.However,bytrackingallrequeststhroughthememorysystem,weobservedsomeglobaladdressesinthesystemlibrariesareheavilysharedasread-onlydata.Alltestswere2IntheLimitLESSprotocol,alimitednumberofhardwarepointersexistfortherstisharersandadditionalrequeststoshareddataarehandledbyasoftwaretrap,preventingtheneedtoevictexistingsharers.[19]runusingthesimsmallinput.Theblackscholessourcecodewasunmodied.AsseeninFigure7,blackscholesachievesnear-perfectscalingwiththefull-mapdirectoryandLimitLESSdirectoryprotocolsupto32targettiles.Beyond32targettiles,paral-lelizationoverheadbeginstooutstripperformancegains.Fromsimulatorresults,weobservethatlargertargettilecountsgiveincreasedaveragememoryaccesslatency.Thisoccursinatleasttwoways:(i)increasednetworkdistancetomemorycontrollers,and(ii)additionallatencyatmemorycontrollers.Latencyatthememorycontrollerincreasesbecausethedefaulttargetarchitectureplacesamemorycontrollerateverytile,evenlysplittingtotaloff-chipbandwidth.Thismeansthatasthenumberoftargettilesincreases,thebandwidthateachcontrollerdecreasesproportionally,andtheservicetimeforamemoryrequestincreases.Queueingdelayalsoincreasesbystaticallypartitioningthebandwidthintoseparatequeues,butresultsshowthatthiseffectislesssignicant.TheLimitLESSandfull-mapprotocolsexhibitlittlediffer-entiationfromoneanother.Thisisexpected,astheheavilyshareddataisread-only.Therefore,oncethedatahasbeencached,theLimitLESSprotocolwillexhibitthesamecharac-teristicsasthefull-mapprotocol.Thelimitedmapdirectoryprotocolsdonotscale.Dir4NBdoesnotexhibitscalingbeyondfourtargettiles.Becauseonlyfoursharerscancacheanygivenmemorylineatatime,heavilysharedreaddataisbeingconstantlyevictedathighertargettilecounts.Thisserializesmemoryreferencesanddamagesperformance.Likewise,theDir16NBprotocoldoesnotexhibitscalingbeyondsixteentargetcores.V.RELATEDWORKBecausesimulationissuchanimportanttoolforcomputerarchitects,awidevarietyofdifferentsimulatorsandemulatorsexists.Conventionalsequentialsimulators/emulatorsincludeSimpleScalar[1],RSIM[2],SimOS[3],Simics[4],andQEMU[5].Someofthesearecapableofsimulatingparalleltargetarchitecturesbutallofthemexecutesequentiallyonthehostmachine.LikeGraphite,Proteus[20]isdesignedtosimulatehighlyparallelarchitecturesandusesdirectexecutionandcongurable,swappablemodels.However,ittoorunsonlyonsequentialhosts.TheprojectsmostcloselyrelatedtoGraphitearepar-allelsimulatorsofparalleltargetarchitecturesincluding:SimFlex[21],GEMS[22],COTSon[23],BigSim[11],FastMP[24],SlackSim[12],WisconsinWindTunnel(WWT)[25],WisconsinWindTunnelII(WWTII)[10],andthosedescribedbyChidesterandGeorge[26],andPenryetal.[27].SimFlexandGEMSbothuseanoff-the-shelfsequentialem-ulator(Simics)forfunctionalmodelingplustheirownmodelsformemorysystemsandcoreinteractions.BecauseSimicsisaclosed-sourcecommercialproductitisdifculttoexperimentwithdifferentcorearchitectures.GEMSusestheirtimingmodeltodriveSimicsoneinstructionatatimewhichresultsinmuchlowerperformancethanGraphite.SimFlexavoids                                     1 2 4 8 16 32 64 128 256 0.01 0.1 1 10 100 No.TargetTiles ApplicationSpeed  LimitLESS4  mapDirectory  Dir16NB  Dir4NB thisproblembyusingstatisticalsamplingoftheapplicationbutthereforedoesnotobserveitsentirebehavior.ChidesterandGeorgetakeasimilarapproachbyjoiningtogetherseveralcopiesofSimpleScalarusingMPI.TheydonotreportabsoluteperformancenumbersbutSimpleScalaristypicallyslowerthanthedirectexecutionusedbyGraphite.COTSonusesAMD'sSimNow!forfunctionalmodelingandthereforesuffersfromsomeofthesameproblemsasSimFlexandGEMS.ThesequentialinstructionstreamcomingoutofSimNow!isdemultiplexedintoseparatethreadsbeforetimingsimulation.ThislimitsparallelismandrestrictsCOTSontoasinglehostmachineforshared-memorysimulations.COTSoncanperformmulti-machinesimulationsbutonlyiftheapplica-tionsarewrittenfordistributedmemoryanduseamessaginglibrarylikeMPI.BigSimandFastMPassumedistributedmemoryintheirtargetarchitecturesanddonotprovidecoherentsharedmem-orybetweentheparallelportionsoftheirsimulators.Graphitepermitsstudyofthemuchbroaderandmoreinterestingclassofarchitecturesthatusesharedmemory.WWTisoneoftheearliestparallelsimulatorsbutrequiresapplicationstouseanexplicitinterfaceforsharedmemoryandonlyrunsonCM-5machines,makingitimpracticalformodernusage.GraphitehasseveralsimilaritieswithWWTII.Bothusedirectexecutionandprovidesharedmemoryacrossaclusterofmachines.However,WWTIIdoesnotmodelanythingotherthanthetargetmemorysystemandrequiresapplicationstobemodiedtoexplicitlyallocatesharedmemoryblocks.Graphitealsomodelscomputecoresandcommunicationnetworksandimplementsatransparentsharedmemorysystem.Inaddition,WWTIIusesaverydifferentquantum-basedsynchronizationschemeratherthanlaxsynchronization.Penryetal.provideamuchmoredetailed,low-levelsim-ulationandaretargetinghardwaredesigners.Theirsimulator,whilefastforacycle-accuratehardwaremodel,doesnotprovidetheperformancenecessaryforrapidexplorationofdifferentideasorsoftwaredevelopment.Theproblemofacceleratingslowsimulationshasbeenaddressedinanumberofdifferentwaysotherthanlarge-scaleparallelization.ProtoFlex[9],FAST[6],andHASim[28]alluseFPGAstoimplementtimingmodelsforcycle-accuratesimulations.ProtoFlexandFASTimplementtheirfunctionalmodelsinsoftwarewhileHASimimplementsfunctionalmod-elsintheFPGAaswell.Theseapproachesrequiretheusertobuyexpensivespecial-purposehardwarewhileGraphiterunsoncommodityLinuxmachines.Inaddition,implementinganewmodelinanFPGAismoredifcultthansoftware,makingithardertoquicklyexperimentwithdifferentdesigns.Othersimulatorsimproveperformancebymodelingonlyaportionofthetotalexecution.FastMP[24]estimatesper-formanceforparallelworkloadswithnomemorysharing(suchasSPECrate)bycarefullysimulatingonlysomeoftheindependentprocessesandusingthoseresultstomodeltheothers.Finally,simulatorssuchasSimFlex[21]usestatisticalsamplingbycarefullymodelingshortsegmentsoftheoverallprogramrunandassumingthattherestoftherunissimilar.AlthoughGraphitedoesmakesomeapproximations,itdiffersfromtheseprojectsinthatitobservesandmodelsthebehavioroftheentireapplicationexecution.Theideaofmaintainingindependentlocalclocksandusingtimestampsonmessagestosynchronizethemduringinterac-tionswaspioneeredbytheTimeWarpsystem[29]andusedintheGeorgiaTechTimeWarp[30],BigSim[11],andSlack-Sim[12].Therstthreesystemsassumethatperfectorderingmustbemaintainedandrollbackwhenthetimestampsindicateout-of-orderevents.SlackSim(developedconcurrentlywithGraphite)istheonlyothersystemthatallowseventstooccuroutoforder.Itallowsallthreadstorunfreelyaslongastheirlocalclocksremainwithinaspeciedwindow.SlackSim's“unboundedslack”modeisessentiallythesameasplainlaxsynchronization.However,itsapproachtolimitingslackreliesonacentralmanagerwhichmonitorsallthreadsusingsharedmemory.This(alongwithotherfactors)restrictsittorunningonasinglehostmachineandultimatelylimitsitsscalability.Graphite'sLaxP2Piscompletelydistributedandenablesscalingtolargernumbersoftargettilesandhostmachines.BecauseGraphitehasmoreaggressivegoalsthanSlackSim,itrequiresmoresophisticatedtechniquestomitigateandcompensateforex-cessiveslack.TreadMarks[31]implementsagenericdistributedsharedmemorysystemacrossaclusterofmachines.However,itre-quirestheprogrammertoexplicitlyallocateblocksofmemorythatwillbekeptconsistentacrossthemachines.Thisrequiresapplicationsthatassumeasinglesharedaddressspace(e.g.,pthreadapplications)toberewrittentousetheTreadMarksinterface.Graphiteoperatestransparently,providingasinglesharedaddressspacetooff-the-shelfapplications.VI.CONCLUSIONSGraphiteisadistributed,parallelsimulatorfordesign-spaceexplorationoflarge-scalemulticoresandapplicationsresearch.Itusesavarietyoftechniquestodeliverthehighperformanceandscalabilityneededforusefulevaluationsincluding:directexecution,multi-machinedistribution,analyticalmodeling,andlaxsynchronization.SomeofGraphite'sotherkeyfeaturesareitsexibleandextensiblearchitecturemodeling,itscom-patibilitywithcommoditymulticoresandclusters,itsabilitytorunoff-the-shelfpthreadsapplicationbinaries,anditssupportforasinglesharedsimulatedaddressspacedespiterunningacrossmultiplehostmachines.OurresultsdemonstratethatGraphiteishighperformanceandachievesslowdownsaslittleas41overnativeexecutionforsimulationsofSPLASH-2benchmarksona32-tiletarget.WealsodemonstratethatGraphiteisscalable,obtainingnearlinearspeeduponasimulationofa1000-tiletargetusingfrom1to10hostmachines.Lastly,thisworkevaluatesseverallaxsynchronizationsimulationstrategiesandcharacterizestheirperformanceversusaccuracy.Wedevelopanovelsynchro-nizationstrategycalledLaxP2Pforbothhighperformanceand accuracybasedonperiodic,random,point-to-pointsynchro-nizationsbetweentargettiles.OurresultsshowthatLaxP2Pperformsonaveragewithin8%ofthehighestperformancestrategywhilekeepingaverageerrorto1.28%ofthemostaccuratestrategyforthestudiedbenchmarks.Graphitewillbereleasedtothecommunityasopen-sourcesoftwaretofosterresearchonlarge-scalemulticorearchitec-turesandapplications.ACKNOWLEDGEMENTTheauthorswouldliketothankJamesPsotaforhisearlyhelpinresearchingpotentialimplementationstrategiesandpointingustowardsPin.HisroleasliaisonwiththePinteamatIntelwasalsogreatlyappreciated.ThisworkwaspartiallyfundedbytheNationalScienceFoundationunderGrantNo.0811724.REFERENCES[1]T.Austin,E.Larson,andD.Ernst,“SimpleScalar:Aninfrastructureforcomputersystemmodeling,”IEEEComputer,vol.35,no.2,pp.59–67,2002.[2]C.J.Hughes,V.S.Pai,P.Ranganathan,andS.V.Adve,“Rsim:Sim-ulatingshared-memorymultiprocessorswithilpprocessors,”Computer,vol.35,no.2,pp.40–49,2002.[3]M.Rosenblum,S.Herrod,E.Witchel,andA.Gupta,“Completecomputersystemsimulation:TheSimOSapproach,”IEEEParallel&DistributedTechnology:Systems&Applications,vol.3,no.4,pp.34–43,Winter1995.[4]P.Magnusson,M.Christensson,J.Eskilson,D.Forsgren,G.Hallberg,J.Hogberg,F.Larsson,A.Moestedt,andB.Werner,“Simics:Afullsystemsimulationplatform,”IEEEComputer,vol.35,no.2,pp.50–58,Feb2002.[5]F.Bellard,“QEMU,afastandportabledynamictranslator,”inATEC'05:Proc.oftheUSENIXAnnualTechnicalConference2005onUSENIXAnnualTechnicalConference,Berkeley,CA,USA,2005.[6]D.Chiou,D.Sunwoo,J.Kim,N.A.Patil,W.Reinhart,D.E.Johnson,J.Keefe,andH.Angepat,“FPGA-AcceleratedSimulationTechnologies(FAST):Fast,Full-System,Cycle-AccurateSimulators,”inMICRO'07:Proceedingsofthe40thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture,2007,pp.249–261.[7]A.KleinOsowskiandD.J.Lilja,“MinneSPEC:AnewSPECbench-markworkloadforsimulation-basedcomputerarchitectureresearch,”ComputerArchitectureLetters,vol.1,Jun.2002.[8]C.Bienia,S.Kumar,J.P.Singh,andK.Li,“ThePARSECbenchmarksuite:Characterizationandarchitecturalimplications,”inProc.ofthe17thInternationalConferenceonParallelArchitecturesandCompila-tionTechniques(PACT),October2008.[9]E.S.Chung,M.K.Papamichael,E.Nurvitadhi,J.C.Hoe,K.Mai,andB.Falsa,“ProtoFlex:TowardsScalable,Full-SystemMultiprocessorSimulationsUsingFPGAs,”ACMTrans.RecongurableTechnol.Syst.,vol.2,no.2,pp.1–32,2009.[10]S.S.Mukherjee,S.K.Reinhardt,B.Falsa,M.Litzkow,M.D.Hill,D.A.Wood,S.Huss-Lederman,andJ.R.Larus,“WisconsinWindTunnelII:Afast,portableparallelarchitecturesimulator,”IEEEConcurrency,vol.8,no.4,pp.12–20,Oct–Dec2000.[11]G.Zheng,G.Kakulapati,andL.V.Kal´e,“BigSim:Aparallelsimulatorforperformancepredictionofextremelylargeparallelmachines,”in18thInternationalParallelandDistributedProcessingSymposium(IPDPS),Apr2004,p.78.[12]J.Chen,M.Annavaram,andM.Dubois,“SlackSim:APlatformforParallelSimulationsofCMPsonCMPs,”SIGARCHComput.Archit.News,vol.37,no.2,pp.20–29,2009.[13]M.B.Taylor,W.Lee,J.Miller,D.Wentzlaff,I.Bratt,B.Greenwald,H.Hoffman,P.Johnson,J.Kim,J.Psota,A.Saraf,N.Shnidman,V.Strumpen,M.Frank,S.Amarasinghe,andA.Agarwal,“EvaluationoftheRawmicroprocessor:Anexposed-wire-delayarchitectureforILPandstreams,”inProc.oftheInternationalSymposiumonComputerArchitecture,Jun.2004,pp.2–13.[14]C.-K.Luk,R.Cohn,R.Muth,H.Patil,A.Klauser,G.Lowney,S.Wallace,V.J.Reddi,andK.Hazelwood,“Pin:Buildingcustomizedprogramanalysistoolswithdynamicinstrumentation,”inPLDI'05:Proc.ofthe2005ACMSIGPLANconferenceonProgramminglanguagedesignandimplementation,June2005,pp.190–200.[15]D.Wentzlaff,P.Grifn,H.Hoffmann,L.Bao,B.Edwards,C.Ramey,M.Mattina,C.-C.Miao,J.F.Brown,andA.Agarwal,“On-chipinterconnectionarchitectureoftheTileprocessor,”IEEEMicro,vol.27,no.5,pp.15–31,Sept-Oct2007.[16]S.C.Woo,M.Ohara,E.Torrie,J.P.Singh,andA.Gupta,“TheSPLASH-2programs:characterizationandmethodologicalconsidera-tions,”inISCA'95:Proc.ofthe22ndannualinternationalsymposiumonComputerarchitecture,June1995,pp.24–36.[17]J.Miller,H.Kasture,G.Kurian,N.Beckmann,C.GruenwaldIII,C.Celio,J.Eastep,andA.Agarwal,“Graphite:Adistributedsimulatorformulticores,”Cambridge,MA,USA,Tech.Rep.MIT-CSAIL-TR-2009-056,November2009.[18]A.Agarwal,R.Simoni,J.Hennessy,andM.Horowitz,“Anevaluationofdirectoryschemesforcachecoherence,”inISCA'88:Proc.ofthe15thAnnualInternationalSymposiumonComputerarchitecture,LosAlamitos,CA,USA,1988,pp.280–298.[19]D.Chaiken,J.Kubiatowicz,andA.Agarwal,“Limitlessdirectories:Ascalablecachecoherencescheme,”inProc.oftheFourthInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOSIV,1991,pp.224–234.[20]E.A.Brewer,C.N.Dellarocas,A.Colbrook,andW.E.Weihl,“Proteus:ahigh-performanceparallel-architecturesimulator,”inSIGMETRICS'92/PERFORMANCE'92:Proc.ofthe1992ACMSIGMETRICSjointinternationalconferenceonMeasurementandmodelingofcomputersystems,NewYork,NY,USA,1992,pp.247–248.[21]T.F.Wenisch,R.E.Wunderlich,M.Ferdman,A.Ailamaki,B.Falsa,andJ.C.Hoe,“SimFlex:Statisticalsamplingofcomputersystemsimulation,”IEEEMicro,vol.26,no.4,pp.18–31,July-Aug2006.[22]M.M.K.Martin,D.J.Sorin,B.M.Beckmann,M.R.Marty,M.Xu,A.R.Alameldeen,K.E.Moore,M.D.Hill,andD.A.Wood,“Multifacet'sgeneralexecution-drivenmultiprocessorsimulator(GEMS)toolset,”SIGARCHComput.Archit.News,vol.33,no.4,pp.92–99,November2005.[23]M.Monchiero,J.H.Ahn,A.Falc´on,D.Ortega,andP.Faraboschi,“Howtosimulate1000cores,”SIGARCHComput.Archit.News,vol.37,no.2,pp.10–19,2009.[24]S.Kanaujia,I.E.Papazian,J.Chamberlain,andJ.Baxter,“FastMP:Amulti-coresimulationmethodology,”inMOBS2006:WorkshoponModeling,BenchmarkingandSimulation,June2006.[25]S.K.Reinhardt,M.D.Hill,J.R.Larus,A.R.Lebeck,J.C.Lewis,andD.A.Wood,“Thewisconsinwindtunnel:virtualprototypingofparallelcomputers,”inSIGMETRICS'93:Proc.ofthe1993ACMSIGMETRICSconferenceonMeasurementandmodelingofcomputersystems,1993,pp.48–60.[26]M.ChidesterandA.George,“Parallelsimulationofchip-multiprocessorarchitectures,”ACMTrans.Model.Comput.Simul.,vol.12,no.3,pp.176–200,2002.[27]D.A.Penry,D.Fay,D.Hodgdon,R.Wells,G.Schelle,D.I.August,andD.Connors,“Exploitingparallelismandstructuretoacceleratethesimulationofchipmulti-processors,”inHPCA'06:TheTwelfthInternationalSymposiumonHigh-PerformanceComputerArchitecture,Feb2006,pp.29–40.[28]N.Dave,M.Pellauer,andJ.Emer,“Implementingafunctional/timingpartitionedmicroprocessorsimulatorwithanFPGA,”in2ndWorkshoponArchitectureResearchusingFPGAPlatforms(WARFP2006),Feb2006.[29]D.R.Jefferson,“Virtualtime,”ACMTransactionsonProgrammingLanguagesandSystems,vol.7,no.3,pp.404–425,July1985.[30]S.Das,R.Fujimoto,K.Panesar,D.Allison,andM.Hybinette,“GTW:ATimeWarpSystemforSharedMemoryMultiprocessors,”inWSC'94:Proceedingsofthe26thconferenceonWintersimulation,1994,pp.1332–1339.[31]C.Amza,A.Cox,S.Dwarkadas,P.Keleher,H.Lu,R.Rajamony,W.Yu,andW.Zwaenepoel,“TreadMarks:Sharedmemorycomputingonnetworksofworkstations,”IEEEComputer,vol.29,no.2,pp.18–28,Feb1996.