ImplicitRepresentationSystems MPI36GASNet41CoArrayFortran33 UPC12Titanium40Chapel15 X1017HJ14 Cilk9Charm29 ExplicitRepresentationSystems Static Dynamic TAM19Id4 Sequoia222 ID: 410688
Download Pdf The PPT/PDF document "explicitrepresentationevenforprogramstha..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
explicitrepresentationevenforprogramsthataredistributedacrossmanynodes.WebegininSection2withmorediscussionofrelatedworkinparallelruntimesystems.Section3covershowlight-weightRealmeventsenableadeferredexecutionmodel,howdeferredexecutiondiersfromstandardasynchronousmod-els,andmotivatestheneedfortheothernovelfeaturesofRealmincludingreservationsandphysicalregions.Section4givesanoverviewoftheRealminterface.Eachsubsequentsectionhighlightsoneofourprimarycontributions:Section5describestheuseofRealmeventsandhoweventsareimplementedwithgenerationalevents.ThelowcostofmanagingeventsinRealmiscentraltotheoveralldesign,asRealmclientsallocateeventsatratesinthetensofthousandspersecondpernode.Section6introducesreservationsforperformingsyn-chronizationinadeferredexecutionmodel,therebyenablingrelaxedexecutionorderingsnotexpressibleinmanyexplicitsystems.WeshowhowreservationsaretypicallyusedbyRealmclientsandpresentanecientimplementation.Section7coversRealm'sphysicalregionsystemandhowitsupportsdatamovementinadeferredexecutionmodel.WealsocoverRealm'snovelsupportforbulkreductions.Section8evaluatesRealmonacollectionofmicrobench-marksthatstress-testourimplementationsofevents,reservations,anddatamovementoperations.WeshowthattheperformanceofRealm'sprimitivesapproachthelimitsoftheunderlyinghardware.Section9detailstheperformanceofthreereal-worldapplicationswritteninbothanimplicitrepresenta-tionmodelandinRealm'sexplicitdeferredexecutionmodel.Inonecasewealsocompareagainstanin-dependentlywrittenandoptimizedMPIcode.WendthatapplicationsusingRealmrangefrom22-135%fasterthanequivalentimplicitversions.2.BACKGROUNDANDRELATEDWORKInthissectionwediscussbothrelatedworkandthedis-tinctionsbetweenimplicit/explicitandstatic/dynamicrun-timesystemsinmoredetail.ThepresentationofRealmbeginsinSection3.Acategorizationofanumberof(butbynomeansall)classicandrecentparallelruntimesisgiveninTable1.Themostwidelyusedhigh-performanceparallelruntimeisMPI[36],whichisanimplicitrepresentationsystem.MPIimplementsabulk-synchronousmodelinwhichprogramsaredividedintophasesofcomputation,communication,andsynchronization[24].Acommonbulk-synchronousidiomisaloopthatalternatesphases:while(...){compute(...);//localcomputationbarrier;communicate(...);barrier;}Bydesign,computationandcommunicationcannothappenatthesametimeinabulksynchronousexecution,idling ImplicitRepresentationSystems MPI[36]GASNet[41]Co-ArrayFortran[33] UPC[12]Titanium[40]Chapel[15] X10[17]HJ[14] Cilk[9]Charm++[29] ExplicitRepresentationSystems Static Dynamic TAM[19]Id[4] Sequoia[22,26]DPJ[10] StarPU[5]TAM[19] Tarragon[18]CnC[31,11] Realm Lucid[27]CGD[37] Table1:Categorizationofparallelruntimes.signicantmachineresources.Thus,MPIhasevolvedtoincludeasynchronousoperationsforoverlappingcommuni-cationandcomputation.ConsiderthefollowingMPI-likecode:receive(x,...);Y;//seediscussionsync;f(x);Herexisalocalbuerthatisthetargetofareceiveop-erationcopyingdatafromremotememory.Thereceiveexecutesasynchronously|themainthreadcontinuesexecu-tionwithYwhilethereceiveisalsoexecuting|buttheonlywaytosafelyusethecontentsofxistoperformasyncoperationthatblocksuntilthereceivecompletes.ItistheresponsibilityoftheprogrammertondusefulcomputationYtooverlapwiththereceive.Therearesev-eralconstraintsonthechoiceofY.TheexecutiontimeofYmustnotbetooshort(orthelatencyofthereceivewillnotbehidden)anditmustnotbetoolong(orthecontinu-ationf(x)willbeunnecessarilydelayed).SinceYcan'tusex,Ymustbeanunrelatedcomputation,whichcanresultinnon-modularcodethatisdiculttomaintain.Thus,theprogrammerisresponsiblenotonlyforaddingsucientsynchronizationbutalsoforstaticscheduling(e.g.,overlappingYwiththereceive).Otherimplicitruntimeshavethesameissue,asonlytheprogrammerhastheknowl-edgeofwhatcanbeparallelized.ThePGASlanguages(UPC[12],Titanium[40]andCo-ArrayFortran[33])are,forthepurposesofthisdiscussion,verysimilartoMPI.Otherprogrammingmodelsthatdiersignicantlyfromthebulksynchronousmodelstillrequireuser-placedandpotentiallyblockingsynchronizationtojoinasynchronouscomputations(e.g.,Cilk'sspawn/synch[9],X10's[17]andHabaneroJava's[14]asynch/nishandChapel'srichcollec-tionofsynchronizationconstructs[15]).TheactormodelprovidedbyCharm++[29]isaformofimplicitrepresen-tation,astheruntimehasnoknowledgeofwhatmessageswillbesentbyamessagehandleruntilitisactuallyex-ecuted.Charm++alsoprovidesfuturesthatallowasyn-chronouscomputationstobecalled;athreadusingafutureexecutesanimplicitsynchronizationoperationifthevalueisnotyetavailable.Staticexplicitsystemsvarywidelyinhowtheyrepresentthegraphofdependentoperations.Some,includingclassicdata owsystemssuchasId[4],providelanguagesthatareveryclosetotheunderlyingstaticgraphs.Therearealsore-centexamples,particularlyforcoarse-grainstaticdata ow asynchronously.Next,theasynchronouscopyfromatobwouldblockpendingtheavailabilityofa.Uptothispoint,thetwomodelsarethesame:AisrunningandthecopyiswaitingonthecompletionofA.However,instandardasynchronousexecutiontheclientmakesnofurtherprogressbecauseithastheresponsibilityofwaitinguntilitissafetoissuethecopy.Indeferredexecution,thisresponsibilityisdelegatedtotheruntime,andtheclientcanimmediatelycontinuewiththespawnofB,evenifthedatainaisnotyetready.Deferredexecutionallowstheclienttocontinuetobuildtheexplicitgraph,enablingtheruntimetodiscoverandconstructthegraphfortheindependentchainofoper-ationsCandDwhileAisexecuting.Furthermore,onceprocessorp1isavailableafterexecutingA,thetwochainsofoperationscanexecuteinparallel.AsimilareectcanbeachievedinimplicitsystemsbyhoistingCandDaboveB,butthissolutionplacestheburdenforschedulingontheprogrammer(recallSection2).AkeytoenablingdeferredexecutioninRealmismakingeventsinexpensive.Withtheclientabletoissueoperationsfaraheadoftheactualexecution,alargenumberofeventsareneededtotrackthedependencesbetweenoperations.AsweshowinSection9,Realmclientscangeneratetensofthousandsofuniqueeventspersecondpernodeduringexecution.InSection5,wedescribetheimplementationofgenerationaleventsthatarecrucialtomakingeventscheapanddeferredexecutionpractical.Realm'snoveloperationsareintegratedintothedeferredexecutionmodelaswell.Forexample,areservationrequestimmediatelyreturnsaneventthattriggerswhenthereser-vationisgranted.Inthisway,reservationsareadeferredex-ecutionversionoflocksthatdonotblockandallowanotheroperation'sexecutiontobedependentonthereservation'sacquisition.Similarly,Realm'sdatamovementoperationsaredeferred.Realm'ssupportfornovelbulkreductionsal-lowsclientstoconstructsophisticatedasynchronousreduc-tiontreesthatmatchtheshapeofthemachine'smemoryhierarchy.Weillustrateareal-worldapplicationthatlever-agesthisfeatureinSection9.4.REALMINTERFACERealmisalow-levelruntime,providingasmallsetofprimitivesforperformingcomputationsonheterogeneous,distributedmemorymachines.Thefocusisonprovidingthenecessarymechanisms,whileleavingthepolicydecisions(e.g.whichprocessortorunataskon,whatcopiestoper-form)underthecompletecontroloftheclient.WhilethatclientmaybeaprogrammercodingdirectlytotheRealminterface,theexpectedusagemodelforRealmisasatargetforhigher-levellanguagesandruntimes.TheRealminterfaceisshowninFigure1.Exceptforthestaticsingletonmachineobject,objectinstancesarelight-weighthandlesthatuniquelynametheunderlyingobject.Everyhandleisvalideverywhereinthesystem,allowinghandlestobefreelycopied,passedasarguments,andstoredintheheap.Forperformance,Realmdoesnottrackwherehandlespropagate.Inthecaseofevents,thiscreatesaninterestingproblemofknowingwhenitissafetoreclaimresourcesassociatedwithevents(seeSection5).IntherestofthissectionweexplainRealmprocessorandmachineobjects.InsubsequentsectionswepresentRealm'sevents,reservations,andphysicalregions.1classEventf2constunsignedid,gen;3staticconstEventNO EVENT;45boolhas triggered()const;6voidwait()const;7staticEventmerge events(constsethEventi&to merge);8g;910classUserEvent:publicEventf11staticUserEventcreate user event();12voidtrigger(Eventwait on=NO EVENT)const;13g;14classProcessorf15constunsignedid;16typedefunsignedTaskFuncID;17typedefvoid(TaskFuncPtr)(voidargs,size targlen,Processorp);18typedefmaphTaskFuncID,TaskFuncPtriTaskIDTable;1920enumKindfCPU PROC,GPU PROC/.../g;21Kindkind()const;2223Eventspawn(TaskFuncIDfunc id,constvoidargs,size targlen,24Eventwait on)const;25g;26classReservationf27constunsignedid;28Eventacquire(Eventwait on=NO EVENT)const;29voidrelease(Eventwait on=NO EVENT)const;3031staticReservationcreate reservation(size tpayload size=0);32voidpayload ptr();33voiddestroy lock();34g;35classMemoryf36constunsignedid;37size tsize()const;38g;3940classPhysicalRegionf41constunsignedid;42staticconstPhysicalRegionNO REGION;4344staticPhysicalRegioncreate region(size tnum elmts,size telmt size);45voiddestroy region()const;4647ptr talloc();48voidfree(ptr tp);4950RegionInstancecreate instance(Memorymemory)const;51RegionInstancecreate instance(Memorymemory,52ReductionOpIDredopid)const;53voiddestroy instance(RegionInstanceinstance,54Eventwait on=NO EVENT)const;55g;5657classRegionInstancef58constunsignedid;5960voidelement data ptr(ptr tp);61Eventcopy to(RegionInstancetarget,Eventwait on=NO EVENT);62Eventreduce to(RegionInstancetarget,ReductionOpIDredopid,63Eventwait on=NO EVENT);64g;65classMachinef66Machine(intargc,charargv,67constProcessor::TaskIDTable&task table);6869voidrun(Processor::TaskFuncIDtask id,70constvoidargs,size targlen);7172staticMachineget machine(void);73constsethMemoryi&get all memories(void)const;74constsethProcessori&get all processors(void)const;7576intget proc mem anity(vectorhProcMemAnityi&result,...);77intget mem mem anity(vectorhMemMemAnityi&result,...);78g;Figure1:RuntimeInterface. Eventx C T Q Q Q Eventy C T Eventz C T Q Q Gen.Eventw 0 1 2 3 C1 T1 Q1 Q1 Q1 C2 T2 C3 T3 Q3 Q3 Figure3:GenerationalEventTimelinesnodeoindicatingnodenshouldbeinformedwhenetriggers.Anyadditionaldependentoperationsonnodenareaddedton'slocallistofe'swaiterswithoutfurthercommunication.Whenetriggers,theownernodeonotiesalllocalwaitersandsendsaneventtriggermessagetoeachsubscribednode.Iftheownernodeoreceivesasubscriptionmessageafteretriggers,oimmediatelyrespondswithatriggermessage.Thetriggeringofaneventmayoccuronanynode.Whenitoccursonanodetotherthantheownero,atriggermes-sageissentfromttoo,whichforwardsthatmessagetoallothersubscribednodes.Thetriggeringnodetnotiesitslocalwaitersimmediately;nomessageissentfromobacktot.Whilearemoteeventtriggerresultsinthelatencyofatriggeringoperationbeingatleasttwoactivemessage ighttimes,itboundsthenumberofactivemessagesrequiredpereventtriggerto2N2whereNisthenumberofnodesmon-itoringtheevent(whichisgenerallyasmallfractionofthetotalnumberofmachinenodes).Analternativeistosharethesubscriberlistsothatthetriggeringnodecannotifyallinterestednodesdirectly.However,suchanalgorithmisbothmorecomplicated(duetoraceconditions)andrequiresO(N2)activemessages.Anyalgorithmsuper-linearinthenumberofnodesinthesystemwillnotscalewell,andasweshowinSection8.1,thelatencyofasingleeventtriggeractivemessageisverysmall.5.2GenerationalEventsThereareseveralconstraintsonthelifetimeofthedatastructureusedtorepresentanevente.Creationandtrig-geringofecaneachhappenonlyonce,butanynumberofoperationsmaydependone.Furthermore,someoperationsdependingonemaynotevenberequesteduntillongaf-terehastriggered.Therefore,thedatastructureusedtorepresentecannotbefreeduntilalltheseoperationsde-pendingonehavebeenregistered.Somesystemsasktheprogrammertoexplicitlycreateanddestroyevents[5],butthisisproblematicwhenmosteventsarecreatedbyRealmratherthantheprogrammer.Othersystemsaddressthisissuebyreferencecountingeventevents[30],butreferencecountingaddsclientandruntimeoverheadevenonasinglenode,andincursevengreatercostinadistributedmemoryimplementation.Insteadoffreeingeventdatastructures,ourimplemen-tationaggressivelyrecyclesthem.Comparedtoreferencecounting,ourimplementationrequiresfewertotaleventdatastructuresandhasnoclientorruntimeoverhead.Thekeyobservationisthatonegenerationaleventdatastructurecanrepresentoneuntriggeredeventandalargenumber(e.g.,2321)ofalready-triggeredevents.Weextendeacheventhandletoincludeagenerationnumberandtheidentierforitsgenerationalevent.Eachgenerationaleventrecordshowmanygenerationshavealreadytriggered.Agenerationaleventcanbereusedforanewgenerationassoonasthecurrentgenerationtriggers.(Anynewoperationdependentonapreviousgenerationcanimmediatelybeexecuted.)Tocreateanewevent,anodendsagenerationaleventinthetriggeredstate(orcreatesoneifallexistinggenerationaleventsownedbythenodeareintheuntriggeredstate),increasesthegenerationbyone,andsetsthegenerationalevent'sstatetountriggered.Asbefore,thiscanbedonewithnointer-nodecommunication.AnexampleofhowmultipleeventscanberepresentedbyasinglegenerationaleventisshowninFigure3.Timelinesforeventsx,y,andzindicatewherecreation(C),triggering(T)andqueries(Q)occur.Queriesthatsucceed(i.e.theeventhastriggered)areshownwithsolidarrows,whilethosethatfailaredotted.Thelifetimeofaneventextendsfromitscreationuntilthelastoperation(triggerorquery)performedonit.Althoughthelifetimeofeventxoverlapswiththoseofyandz,theuntriggeredintervalsarenon-overlapping,andallthreecanbemappedontogenerationaleventw,witheventxbeingassignedgeneration1,ybeingassigned2,andzbeingassigned3.Aqueryonthegenerationaleventsuc-ceedsifthegenerationaleventiseitherinthetriggeredstateorhasacurrentgenerationlargerthantheoneassociatedwiththequery.Nodesmaintaingenerationaleventdatastructuresforbotheventstheyownaswellasremoteeventsthattheyhaveob-served.Remotegenerationaleventdatastructuresrecordthemostrecentgenerationknowntohavetriggeredaswellasthegenerationofthemostrecentsubscriptionmessagesent(ifany).Remotegenerationaleventsenableaninter-estingoptimization.Ifaremotegenerationaleventreceivesaqueryonalatergenerationthanitscurrentgeneration,itcaninferthatallgenerationsuptotherequestedgener-ationhavetriggered,becausethenewgeneration(s)oftheeventwereabletobecreatedbytheevent'sowner.Alllocalwaitersforearliergenerationscanbenotiedevenbeforere-ceivingtheeventtriggermessageforthecurrentgeneration.InSection8weshowthatthelatencyofeventtriggeringisverylow,eveninthecaseofdense,distributedgraphsofdependentoperations.InSection9weshowthatourgener-ationaleventimplementationresultsinalargereductioninspacerequirementstorecordeventsforrealapplications.6.ReservationsRecallthatthetasksontherightofFigure2aretheactualapplication-leveltasksandthatthemappingtasksontheleftdynamicallycomputewheretheapplication-leveltasksshouldrun.Thehigher-levelruntime'smappingtasksmayberuninanyorder,buteachrequiresexclusiveaccesstoashareddatastructure.Inmostsystems,locksareusedforatomicdataaccess.However,standardlockingprimi- physicalregionstomanagethelayoutandmovementofdatainadeferredexecutionmodel.Aphysicalregiondenesanaddressingschemeforacol-lectionofelementsofacommontype.Toarstapprox-imation,physicalregionsofelementsoftypeTarearraysofTwithadditionalmetadatatosupportecientcopies,reductions,andallocation/deallocationofelementswithintheregion.Realmsupportscreatingmultipleinstancesofaphysicalregionindierentmemoriesforreplicationordatamigration.Becauseallinstancesofaphysicalregionrusethesameaddressingscheme,Realmhassucientinforma-tiontoperformdeferredcopiesbetweeninstancesofr.Lines35-64ofFigure1showasubsetoftheinterfaceforphysicalregions(weomitportionsoftheinterfaceduetospaceconstraints).EachphysicalmemoryisnamedbyaMemoryobject(line35).Physicalregionobjectsarecon-structedbydeningthemaximumnumberandsizeofele-ments(line44).PhysicalregioninstancesarecreatedinaspecicMemory(line50).Tomaintainperformancetrans-parency,thereisnovirtualizationofmemory|eachMemoryissizedbasedonaphysicalcapacityandanewinstancecanbeallocatedonlyifsucientspaceremains.Instancesmustbeexplicitlydestroyed,whichcanbecontingentonanevent(lines53-54).Elementscanbedynamicallyallocatedorfreedwithinaphysicalregion(lines47-48).ElementsareaccessedbyRealmpointersoftypeptr_t.Bydenition,aRealmpointerintophysicalregionrisvalidforeveryinstanceofrregard-lessofitsmemorylocation(line60).ThisallowsRealmpointerstobestoredindatastructuresandreusedlater,evenifinstanceshavebeenmovedaround.Incommoncases,pointerindexingreducestoinexpensivearrayaddresscalcu-lationswhicharecompiledtoindividualloadsandstores.Realmsupportscopyoperationsbetweeninstancesofthesamephysicalregion(line61).CopyoperationsinFigure2arerhomboidsmarkedcopy.LikeallotherRealmopera-tions,copyrequestsacceptaneventpreconditionandreturnaneventthattriggersuponcompletionofthecopy.Realmdoesnotguaranteethecoherenceofdatabetweendier-entinstances;coherencemustbeexplicitlymanagedbytheclientviacopyoperations.7.1ReductionInstancesIfataskonlyperformsreductionsonaninstance,aspe-cialreduction-onlyinstancemaybecreated(lines51-52).InFigure2,eachtaskTiismappedtousereduction-onlyinstancetitoaccumulatereductionsintheGPUzero-copymemory.Usingthereduce_tomethod(lines62-63),thesereductionbuersareeventuallyappliedtoanormalinstanceresidinginGASNetmemory.Wedetailthispatterninareal-worldapplicationinSection9.Bulkreductionopera-tionsarerhomboidsmarkedreduceinFigure2.Reduction-onlyinstancesdierintwoimportantwaysfromnormalinstances.First,theper-elementstorageinreduction-onlyinstancesissizedtoholdthe\right-handside"ofthereductionoperation(e.g.,thevinstruct:field+=v).Sec-ond,individualreductionsareaccumulated(atomically)intothelocalreductioninstance,whichcanthenbesentasabatchedreductiontoaremotetargetinstance.Whenmulti-plereductionsaremadetothesameelement,theyarefoldedlocally,furtherreducingcommunication.Thefoldoperationisnotalwaysidenticaltothereductionoperation.Forex-ample,ifthereductionisexponentiation,thecorrespondingNodes 1 2 4 8 16 MeanTriggerTime(s) 0.329 3.259 3.799 3.862 4.013Figure4:EventLatencyResults.foldismultiplication:(r[i]**=a)**=b,r[i]**=(ab)Theclientregistersreductionandfoldoperationsatsystemstartup(omittedfromFigure1duetospaceconstraints).Reductioninstancescanalsobefoldedintootherreductioninstancestobuildhierarchicalbulkreductionsmatchingthememoryhierarchy.Realmsupportstwoclassesofreduction-onlyinstances.Areductionfoldinstanceissimilartoanormalphysicalre-gioninthatitisimplementedasanarrayindexedbythesameelementindices.Thedierenceisthateachinstanceelementisavalueofthereduction'sright-hand-sidetype,whichisoftensmallerthantheregion'selementtype(theleft-hand-sidetype).Areductionoperationsimplyfoldsthesuppliedright-hand-sidevalueintothecorrespondingarraylocation.Whenthereductionfoldinstancepisreducedtoanormalinstancer,rstpiscopiedtor'slocation,whereRealmautomaticallyappliesptorbyinvokingthereduc-tionfunctiononceperlocationviaacache-friendlylinearsweepoverthememory.Thesecondkindofreductioninstanceisareductionlistin-stance,wheretheinstanceisimplementedasalistofreduc-tions.Areductionlistinstancelogseveryreductionopera-tion(thepointerlocationandright-hand-sidevalue).Whenthereductionlistinstancepisreducedtoanormalphysi-calregionr,pistransferredandreplayedatr'slocation.Incaseswherethelistofreductionsissmallerthanthenumberofelementsinrthereductionindatatransferredcanyieldbetterperformance(seeSection8.3).8.MICROBENCHMARKSWeevaluateourRealmimplementationusingmicrobench-marksthattestwhetherperformanceapproachestheca-pacityoftheunderlyinghardware.AllexperimentswererunontheKeenelandsupercomputer[38].EachKeenelandKIDSnodeiscomposedoftwoXeon5660CPUs,threeTeslaM2090GPUs,and24GBofDRAM.NodesareconnectedbyanInnibandQDRinterconnect.8.1EventLatencyandTriggerRatesWeusetwomicrobenchmarkstoevaluateeventperfor-mance.Thersttestseventtriggeringlatency,bothwithinandbetweennodes.Processorsareorganizedinaringandeachprocessorcreatesausereventdependentontheprevi-ousprocessor'sevent.Thersteventinthechainofdepen-denteventsistriggeredandthetimeuntilthetriggeringofthechain'slasteventismeasured;dividingthetotaltimebythenumberofeventsinthechainyieldsthemeantrig-gertime.Inthesingle-nodecase,alleventsarelocaltothatnode,sonoactivemessagesarerequired.Forallothercases,theringusesasingleprocessorpernodesothateverytriggerrequiresthetransmission(andreception)ofaneventtriggeractivemessage.Table4showsthemeantriggertimes.Thecostofmanipu-latingthedatastructuresandrunningdependentoperations Figure7:DataPlacementandMovementforCircuit.AMRisanadaptivemeshrenementbenchmarkbasedonthethirdheatequationexamplefromtheBerkeleyLabsBoxLibproject[32].AMRsimulatesthetwodimensionalheatdiusionequationusingthreedierentlevelsofre-nement.Eachlevelispartitionedanddistributedacrossthemachine.Timestepsrequirebothintra-andinter-levelcommunicationandsynchronization.Dependencesbetweentasksfromthesameanddierentlevelsareagainexpressedthroughevents.9.1EventLifetimesToillustratetheneedforgenerationaleventsdatastruc-tures,weinstrumentedRealmtocaptureinformationaboutthelifetimeofevents.Theusageofeventsbyallthreeap-plicationswassimilar,sowepresentrepresentativeresultsfromjustone.Figure6cshowsatimelineoftheexecutionoftheFluidapplicationon16nodesusing128cores.Thedynamiceventslinemeasuresthetotalnumberofeventcre-ations.Alargenumberofeventsarecreated|over260,000inlessthan20secondsofexecution|andallocatingsep-aratestorageforeveryeventwouldclearlybedicultforlong-runningapplications.Aneventisliveuntilitslastoperation(e.g.,query,trig-ger)isperformed.Afteranevent'slastoperationareferencecountingimplementationwouldrecovertheevent'sassoci-atedstorage.TheliveeventslineinFigure6cisthereforethenumberofneededeventsinareferencecountingscheme.Inthisexample,referencecountingreducesthestorageneededfordynamiceventsbyover10X,butwiththeadditionaloverheadassociatedwithreferencecounting.Thislinealsogivesalowerboundforthenumberofeventswhentheappli-cationperformsexplicitcreationanddestructionofevents.AsdiscussedinSection5.2,ourimplementationrequiresstoragethatgrowswiththemaximumnumberofuntriggeredevents,anumberthatis10Xsmallerthaneventhemaxi-malliveeventcount.TheactualstoragerequirementsofourRealmimplementationareshownbythegenerationaleventsline,whichshowsthetotalnumberofgenerationaleventsinthesystem.Themaximumnumberofgenerationaleventsneededisslightlylargerthanthepeaknumberofuntrig-geredeventsbecausenodesmustcreateaneweventlocallyiftheyhavenoavailable(i.e.triggered)generationalevents,evenifthereareavailablegenerationaleventsonremotenodes.Overall,ourimplementationuses5Xlessstoragethanareferencecountingimplementationandavoidsanyrelatedoverhead.Thesesavingswouldlikelybeevenmoredramaticforlongerrunsoftheapplication,asthenumberofliveeventsissteadilygrowingastheapplicationruns,whilethepeaknumberofgenerationaleventsneededappearstooccurduringthestart-upoftheapplication.Overallthisdemonstratestheabilityofgenerationaleventstorepresentlargenumbersofliveeventswithminimalstorageoverhead.9.2ReservationPerformanceTheCircuitandAMRapplicationsbothmadeuseofreser-vations,creating3336and1393reservationsrespectively.Ofallcreatedreservationsinbothapplications,14%weremi-gratedatleastoncebetweennodes.Thegrantratesforbothapplicationsareordersofmagnitudesmallerthanthemaxi-mumreservationgrantratesachievedbyourreservationmi-crobenchmarksinSection8.2.Thus,forthesebenchmarksreservationswereneededtoexpressnon-blockingsynchro-nizationandwerefarfrombeingaperformancelimiter.9.3ComparisonwithImplicitRepresentationsWenowattempttoestimatetheperformancegainsat-tributabletothelatencyhidingprovidedbydeferredexecu-tion.Tocomparewithastandardimplicitimplementation,wemodiedeachRealmapplicationtowaitforeventsintheapplicationcodeimmediatelybeforethedependentop-erationratherthansupplyingthemaspreconditions.Whilethismethodologyhasthedisadvantagethatourapproxima-tionofanimplicitruntimemaynotbeasfastasapurpose-builtone,ithasthegreatadvantageofcontrollingforthemyriadpossibleperformanceeectsincomparingtwocom-pletelydierentimplementations:anyperformancedier-enceswillbedueexactlytothemorerelaxedexecutionor-deringenabledbydeferredexecution.Figure8ashowsthreecurvesfortheAMRapplication:theoriginalRealmversion,theimplicitversion,andaninde-pendentlywrittenandoptimizedMPIimplementation[32].Observethattheimplicitversioniscompetitivewiththein-dependentMPIcode,whichissomeevidencethatusingtheimplicitversionasareferencepointisreasonable.BoththeRealmandimplicitversionsstartoutaheadofMPIduetobettermappingdecisionsprovidedbythehigher-levelrun-time[6].TheRealmimplementationofAMRusesasimpleall-to-allpatternforitscommunication.Theadditionalla-tencyinherentinthispatternishiddenwellbythedeferredexecutionmodel,butisabottleneckfortheimplicitver-sion,resultinginupto102%slowdownat16nodes.TheMPIversioncontinuestoscalebyusingmuchmorecompli-catedasynchronouscommunicationpatterns,buttheneedforblockingsynchronizationprimitivesstillresultsinex-posedcommunicationlatency,causinga66%slowdownrel-ativetoRealmon16nodes.ItisworthemphasizingthatinprincipletheMPIcodecanbejustasfastas(orfasterthan)Realm|thereisnothinginRealmthataprogrammercannotemulatewithsucienteortusingtheprimitivesavailableinanyimplicitruntimesystem.However,asdiscussedinSection2,thisprogram-mingworkissubstantial,diculttomaintain,andoftenmachinespecic;itisourexperiencethatfewprogrammersundertakeit.Figures8band8cshowperformanceresultsfortheCircuitandFluidapplicationsrespectively;fortheseapplicationswedonothaveindependentlyoptimizeddistributedmem-oryimplementationsandsowecompareonlytheRealmandimplicitimplementations.Eachplotcontainsperformancecurvesforbothimplementationsontwodierentproblem