/
explicitrepresentationevenforprogramsthataredistributedacrossmanynodes explicitrepresentationevenforprogramsthataredistributedacrossmanynodes

explicitrepresentationevenforprogramsthataredistributedacrossmanynodes - PDF document

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
376 views
Uploaded On 2016-07-19

explicitrepresentationevenforprogramsthataredistributedacrossmanynodes - PPT Presentation

ImplicitRepresentationSystems MPI36GASNet41CoArrayFortran33 UPC12Titanium40Chapel15 X1017HJ14 Cilk9Charm29 ExplicitRepresentationSystems Static Dynamic TAM19Id4 Sequoia222 ID: 410688

ImplicitRepresentationSystems MPI[36]GASNet[41]Co-ArrayFortran[33] UPC[12]Titanium[40]Chapel[15] X10[17]HJ[14] Cilk[9]Charm++[29] ExplicitRepresentationSystems Static Dynamic TAM[19]Id[4] Sequoia[22

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "explicitrepresentationevenforprogramstha..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

explicitrepresentationevenforprogramsthataredistributedacrossmanynodes.WebegininSection2withmorediscussionofrelatedworkinparallelruntimesystems.Section3covershowlight-weightRealmeventsenableadeferredexecutionmodel,howdeferredexecutiondi ersfromstandardasynchronousmod-els,andmotivatestheneedfortheothernovelfeaturesofRealmincludingreservationsandphysicalregions.Section4givesanoverviewoftheRealminterface.Eachsubsequentsectionhighlightsoneofourprimarycontributions:Section5describestheuseofRealmeventsandhoweventsareimplementedwithgenerationalevents.ThelowcostofmanagingeventsinRealmiscentraltotheoveralldesign,asRealmclientsallocateeventsatratesinthetensofthousandspersecondpernode.Section6introducesreservationsforperformingsyn-chronizationinadeferredexecutionmodel,therebyenablingrelaxedexecutionorderingsnotexpressibleinmanyexplicitsystems.WeshowhowreservationsaretypicallyusedbyRealmclientsandpresentanecientimplementation.Section7coversRealm'sphysicalregionsystemandhowitsupportsdatamovementinadeferredexecutionmodel.WealsocoverRealm'snovelsupportforbulkreductions.Section8evaluatesRealmonacollectionofmicrobench-marksthatstress-testourimplementationsofevents,reservations,anddatamovementoperations.WeshowthattheperformanceofRealm'sprimitivesapproachthelimitsoftheunderlyinghardware.Section9detailstheperformanceofthreereal-worldapplicationswritteninbothanimplicitrepresenta-tionmodelandinRealm'sexplicitdeferredexecutionmodel.Inonecasewealsocompareagainstanin-dependentlywrittenandoptimizedMPIcode.We ndthatapplicationsusingRealmrangefrom22-135%fasterthanequivalentimplicitversions.2.BACKGROUNDANDRELATEDWORKInthissectionwediscussbothrelatedworkandthedis-tinctionsbetweenimplicit/explicitandstatic/dynamicrun-timesystemsinmoredetail.ThepresentationofRealmbeginsinSection3.Acategorizationofanumberof(butbynomeansall)classicandrecentparallelruntimesisgiveninTable1.Themostwidelyusedhigh-performanceparallelruntimeisMPI[36],whichisanimplicitrepresentationsystem.MPIimplementsabulk-synchronousmodelinwhichprogramsaredividedintophasesofcomputation,communication,andsynchronization[24].Acommonbulk-synchronousidiomisaloopthatalternatesphases:while(...){compute(...);//localcomputationbarrier;communicate(...);barrier;}Bydesign,computationandcommunicationcannothappenatthesametimeinabulksynchronousexecution,idling ImplicitRepresentationSystems MPI[36]GASNet[41]Co-ArrayFortran[33] UPC[12]Titanium[40]Chapel[15] X10[17]HJ[14] Cilk[9]Charm++[29] ExplicitRepresentationSystems Static Dynamic TAM[19]Id[4] Sequoia[22,26]DPJ[10] StarPU[5]TAM[19] Tarragon[18]CnC[31,11] Realm Lucid[27]CGD[37] Table1:Categorizationofparallelruntimes.signi cantmachineresources.Thus,MPIhasevolvedtoincludeasynchronousoperationsforoverlappingcommuni-cationandcomputation.ConsiderthefollowingMPI-likecode:receive(x,...);Y;//seediscussionsync;f(x);Herexisalocalbu erthatisthetargetofareceiveop-erationcopyingdatafromremotememory.Thereceiveexecutesasynchronously|themainthreadcontinuesexecu-tionwithYwhilethereceiveisalsoexecuting|buttheonlywaytosafelyusethecontentsofxistoperformasyncoperationthatblocksuntilthereceivecompletes.Itistheresponsibilityoftheprogrammerto ndusefulcomputationYtooverlapwiththereceive.Therearesev-eralconstraintsonthechoiceofY.TheexecutiontimeofYmustnotbetooshort(orthelatencyofthereceivewillnotbehidden)anditmustnotbetoolong(orthecontinu-ationf(x)willbeunnecessarilydelayed).SinceYcan'tusex,Ymustbeanunrelatedcomputation,whichcanresultinnon-modularcodethatisdiculttomaintain.Thus,theprogrammerisresponsiblenotonlyforaddingsucientsynchronizationbutalsoforstaticscheduling(e.g.,overlappingYwiththereceive).Otherimplicitruntimeshavethesameissue,asonlytheprogrammerhastheknowl-edgeofwhatcanbeparallelized.ThePGASlanguages(UPC[12],Titanium[40]andCo-ArrayFortran[33])are,forthepurposesofthisdiscussion,verysimilartoMPI.Otherprogrammingmodelsthatdi ersigni cantlyfromthebulksynchronousmodelstillrequireuser-placedandpotentiallyblockingsynchronizationtojoinasynchronouscomputations(e.g.,Cilk'sspawn/synch[9],X10's[17]andHabaneroJava's[14]asynch/ nishandChapel'srichcollec-tionofsynchronizationconstructs[15]).TheactormodelprovidedbyCharm++[29]isaformofimplicitrepresen-tation,astheruntimehasnoknowledgeofwhatmessageswillbesentbyamessagehandleruntilitisactuallyex-ecuted.Charm++alsoprovidesfuturesthatallowasyn-chronouscomputationstobecalled;athreadusingafutureexecutesanimplicitsynchronizationoperationifthevalueisnotyetavailable.Staticexplicitsystemsvarywidelyinhowtheyrepresentthegraphofdependentoperations.Some,includingclassicdata owsystemssuchasId[4],providelanguagesthatareveryclosetotheunderlyingstaticgraphs.Therearealsore-centexamples,particularlyforcoarse-grainstaticdata ow asynchronously.Next,theasynchronouscopyfromatobwouldblockpendingtheavailabilityofa.Uptothispoint,thetwomodelsarethesame:AisrunningandthecopyiswaitingonthecompletionofA.However,instandardasynchronousexecutiontheclientmakesnofurtherprogressbecauseithastheresponsibilityofwaitinguntilitissafetoissuethecopy.Indeferredexecution,thisresponsibilityisdelegatedtotheruntime,andtheclientcanimmediatelycontinuewiththespawnofB,evenifthedatainaisnotyetready.Deferredexecutionallowstheclienttocontinuetobuildtheexplicitgraph,enablingtheruntimetodiscoverandconstructthegraphfortheindependentchainofoper-ationsCandDwhileAisexecuting.Furthermore,onceprocessorp1isavailableafterexecutingA,thetwochainsofoperationscanexecuteinparallel.Asimilare ectcanbeachievedinimplicitsystemsbyhoistingCandDaboveB,butthissolutionplacestheburdenforschedulingontheprogrammer(recallSection2).AkeytoenablingdeferredexecutioninRealmismakingeventsinexpensive.Withtheclientabletoissueoperationsfaraheadoftheactualexecution,alargenumberofeventsareneededtotrackthedependencesbetweenoperations.AsweshowinSection9,Realmclientscangeneratetensofthousandsofuniqueeventspersecondpernodeduringexecution.InSection5,wedescribetheimplementationofgenerationaleventsthatarecrucialtomakingeventscheapanddeferredexecutionpractical.Realm'snoveloperationsareintegratedintothedeferredexecutionmodelaswell.Forexample,areservationrequestimmediatelyreturnsaneventthattriggerswhenthereser-vationisgranted.Inthisway,reservationsareadeferredex-ecutionversionoflocksthatdonotblockandallowanotheroperation'sexecutiontobedependentonthereservation'sacquisition.Similarly,Realm'sdatamovementoperationsaredeferred.Realm'ssupportfornovelbulkreductionsal-lowsclientstoconstructsophisticatedasynchronousreduc-tiontreesthatmatchtheshapeofthemachine'smemoryhierarchy.Weillustrateareal-worldapplicationthatlever-agesthisfeatureinSection9.4.REALMINTERFACERealmisalow-levelruntime,providingasmallsetofprimitivesforperformingcomputationsonheterogeneous,distributedmemorymachines.Thefocusisonprovidingthenecessarymechanisms,whileleavingthepolicydecisions(e.g.whichprocessortorunataskon,whatcopiestoper-form)underthecompletecontroloftheclient.WhilethatclientmaybeaprogrammercodingdirectlytotheRealminterface,theexpectedusagemodelforRealmisasatargetforhigher-levellanguagesandruntimes.TheRealminterfaceisshowninFigure1.Exceptforthestaticsingletonmachineobject,objectinstancesarelight-weighthandlesthatuniquelynametheunderlyingobject.Everyhandleisvalideverywhereinthesystem,allowinghandlestobefreelycopied,passedasarguments,andstoredintheheap.Forperformance,Realmdoesnottrackwherehandlespropagate.Inthecaseofevents,thiscreatesaninterestingproblemofknowingwhenitissafetoreclaimresourcesassociatedwithevents(seeSection5).IntherestofthissectionweexplainRealmprocessorandmachineobjects.InsubsequentsectionswepresentRealm'sevents,reservations,andphysicalregions.1classEventf2constunsignedid,gen;3staticconstEventNO EVENT;45boolhas triggered()const;6voidwait()const;7staticEventmerge events(constsethEventi&to merge);8g;910classUserEvent:publicEventf11staticUserEventcreate user event();12voidtrigger(Eventwait on=NO EVENT)const;13g;14classProcessorf15constunsignedid;16typedefunsignedTaskFuncID;17typedefvoid(TaskFuncPtr)(voidargs,size targlen,Processorp);18typedefmaphTaskFuncID,TaskFuncPtriTaskIDTable;1920enumKindfCPU PROC,GPU PROC/.../g;21Kindkind()const;2223Eventspawn(TaskFuncIDfunc id,constvoidargs,size targlen,24Eventwait on)const;25g;26classReservationf27constunsignedid;28Eventacquire(Eventwait on=NO EVENT)const;29voidrelease(Eventwait on=NO EVENT)const;3031staticReservationcreate reservation(size tpayload size=0);32voidpayload ptr();33voiddestroy lock();34g;35classMemoryf36constunsignedid;37size tsize()const;38g;3940classPhysicalRegionf41constunsignedid;42staticconstPhysicalRegionNO REGION;4344staticPhysicalRegioncreate region(size tnum elmts,size telmt size);45voiddestroy region()const;4647ptr talloc();48voidfree(ptr tp);4950RegionInstancecreate instance(Memorymemory)const;51RegionInstancecreate instance(Memorymemory,52ReductionOpIDredopid)const;53voiddestroy instance(RegionInstanceinstance,54Eventwait on=NO EVENT)const;55g;5657classRegionInstancef58constunsignedid;5960voidelement data ptr(ptr tp);61Eventcopy to(RegionInstancetarget,Eventwait on=NO EVENT);62Eventreduce to(RegionInstancetarget,ReductionOpIDredopid,63Eventwait on=NO EVENT);64g;65classMachinef66Machine(intargc,charargv,67constProcessor::TaskIDTable&task table);6869voidrun(Processor::TaskFuncIDtask id,70constvoidargs,size targlen);7172staticMachineget machine(void);73constsethMemoryi&get all memories(void)const;74constsethProcessori&get all processors(void)const;7576intget proc mem anity(vectorhProcMemAnityi&result,...);77intget mem mem anity(vectorhMemMemAnityi&result,...);78g;Figure1:RuntimeInterface. Eventx C T Q Q Q Eventy C T Eventz C T Q Q Gen.Eventw 0 1 2 3 C1 T1 Q1 Q1 Q1 C2 T2 C3 T3 Q3 Q3 Figure3:GenerationalEventTimelinesnodeoindicatingnodenshouldbeinformedwhenetriggers.Anyadditionaldependentoperationsonnodenareaddedton'slocallistofe'swaiterswithoutfurthercommunication.Whenetriggers,theownernodeonoti esalllocalwaitersandsendsaneventtriggermessagetoeachsubscribednode.Iftheownernodeoreceivesasubscriptionmessageafteretriggers,oimmediatelyrespondswithatriggermessage.Thetriggeringofaneventmayoccuronanynode.Whenitoccursonanodetotherthantheownero,atriggermes-sageissentfromttoo,whichforwardsthatmessagetoallothersubscribednodes.Thetriggeringnodetnoti esitslocalwaitersimmediately;nomessageissentfromobacktot.Whilearemoteeventtriggerresultsinthelatencyofatriggeringoperationbeingatleasttwoactivemessage ighttimes,itboundsthenumberofactivemessagesrequiredpereventtriggerto2N�2whereNisthenumberofnodesmon-itoringtheevent(whichisgenerallyasmallfractionofthetotalnumberofmachinenodes).Analternativeistosharethesubscriberlistsothatthetriggeringnodecannotifyallinterestednodesdirectly.However,suchanalgorithmisbothmorecomplicated(duetoraceconditions)andrequiresO(N2)activemessages.Anyalgorithmsuper-linearinthenumberofnodesinthesystemwillnotscalewell,andasweshowinSection8.1,thelatencyofasingleeventtriggeractivemessageisverysmall.5.2GenerationalEventsThereareseveralconstraintsonthelifetimeofthedatastructureusedtorepresentanevente.Creationandtrig-geringofecaneachhappenonlyonce,butanynumberofoperationsmaydependone.Furthermore,someoperationsdependingonemaynotevenberequesteduntillongaf-terehastriggered.Therefore,thedatastructureusedtorepresentecannotbefreeduntilalltheseoperationsde-pendingonehavebeenregistered.Somesystemsasktheprogrammertoexplicitlycreateanddestroyevents[5],butthisisproblematicwhenmosteventsarecreatedbyRealmratherthantheprogrammer.Othersystemsaddressthisissuebyreferencecountingeventevents[30],butreferencecountingaddsclientandruntimeoverheadevenonasinglenode,andincursevengreatercostinadistributedmemoryimplementation.Insteadoffreeingeventdatastructures,ourimplemen-tationaggressivelyrecyclesthem.Comparedtoreferencecounting,ourimplementationrequiresfewertotaleventdatastructuresandhasnoclientorruntimeoverhead.Thekeyobservationisthatonegenerationaleventdatastructurecanrepresentoneuntriggeredeventandalargenumber(e.g.,232�1)ofalready-triggeredevents.Weextendeacheventhandletoincludeagenerationnumberandtheidenti erforitsgenerationalevent.Eachgenerationaleventrecordshowmanygenerationshavealreadytriggered.Agenerationaleventcanbereusedforanewgenerationassoonasthecurrentgenerationtriggers.(Anynewoperationdependentonapreviousgenerationcanimmediatelybeexecuted.)Tocreateanewevent,anode ndsagenerationaleventinthetriggeredstate(orcreatesoneifallexistinggenerationaleventsownedbythenodeareintheuntriggeredstate),increasesthegenerationbyone,andsetsthegenerationalevent'sstatetountriggered.Asbefore,thiscanbedonewithnointer-nodecommunication.AnexampleofhowmultipleeventscanberepresentedbyasinglegenerationaleventisshowninFigure3.Timelinesforeventsx,y,andzindicatewherecreation(C),triggering(T)andqueries(Q)occur.Queriesthatsucceed(i.e.theeventhastriggered)areshownwithsolidarrows,whilethosethatfailaredotted.Thelifetimeofaneventextendsfromitscreationuntilthelastoperation(triggerorquery)performedonit.Althoughthelifetimeofeventxoverlapswiththoseofyandz,theuntriggeredintervalsarenon-overlapping,andallthreecanbemappedontogenerationaleventw,witheventxbeingassignedgeneration1,ybeingassigned2,andzbeingassigned3.Aqueryonthegenerationaleventsuc-ceedsifthegenerationaleventiseitherinthetriggeredstateorhasacurrentgenerationlargerthantheoneassociatedwiththequery.Nodesmaintaingenerationaleventdatastructuresforbotheventstheyownaswellasremoteeventsthattheyhaveob-served.Remotegenerationaleventdatastructuresrecordthemostrecentgenerationknowntohavetriggeredaswellasthegenerationofthemostrecentsubscriptionmessagesent(ifany).Remotegenerationaleventsenableaninter-estingoptimization.Ifaremotegenerationaleventreceivesaqueryonalatergenerationthanitscurrentgeneration,itcaninferthatallgenerationsuptotherequestedgener-ationhavetriggered,becausethenewgeneration(s)oftheeventwereabletobecreatedbytheevent'sowner.Alllocalwaitersforearliergenerationscanbenoti edevenbeforere-ceivingtheeventtriggermessageforthecurrentgeneration.InSection8weshowthatthelatencyofeventtriggeringisverylow,eveninthecaseofdense,distributedgraphsofdependentoperations.InSection9weshowthatourgener-ationaleventimplementationresultsinalargereductioninspacerequirementstorecordeventsforrealapplications.6.ReservationsRecallthatthetasksontherightofFigure2aretheactualapplication-leveltasksandthatthemappingtasksontheleftdynamicallycomputewheretheapplication-leveltasksshouldrun.Thehigher-levelruntime'smappingtasksmayberuninanyorder,buteachrequiresexclusiveaccesstoashareddatastructure.Inmostsystems,locksareusedforatomicdataaccess.However,standardlockingprimi- physicalregionstomanagethelayoutandmovementofdatainadeferredexecutionmodel.Aphysicalregionde nesanaddressingschemeforacol-lectionofelementsofacommontype.Toa rstapprox-imation,physicalregionsofelementsoftypeTarearraysofTwithadditionalmetadatatosupportecientcopies,reductions,andallocation/deallocationofelementswithintheregion.Realmsupportscreatingmultipleinstancesofaphysicalregionindi erentmemoriesforreplicationordatamigration.Becauseallinstancesofaphysicalregionrusethesameaddressingscheme,Realmhassucientinforma-tiontoperformdeferredcopiesbetweeninstancesofr.Lines35-64ofFigure1showasubsetoftheinterfaceforphysicalregions(weomitportionsoftheinterfaceduetospaceconstraints).EachphysicalmemoryisnamedbyaMemoryobject(line35).Physicalregionobjectsarecon-structedbyde ningthemaximumnumberandsizeofele-ments(line44).Physicalregioninstancesarecreatedinaspeci cMemory(line50).Tomaintainperformancetrans-parency,thereisnovirtualizationofmemory|eachMemoryissizedbasedonaphysicalcapacityandanewinstancecanbeallocatedonlyifsucientspaceremains.Instancesmustbeexplicitlydestroyed,whichcanbecontingentonanevent(lines53-54).Elementscanbedynamicallyallocatedorfreedwithinaphysicalregion(lines47-48).ElementsareaccessedbyRealmpointersoftypeptr_t.Byde nition,aRealmpointerintophysicalregionrisvalidforeveryinstanceofrregard-lessofitsmemorylocation(line60).ThisallowsRealmpointerstobestoredindatastructuresandreusedlater,evenifinstanceshavebeenmovedaround.Incommoncases,pointerindexingreducestoinexpensivearrayaddresscalcu-lationswhicharecompiledtoindividualloadsandstores.Realmsupportscopyoperationsbetweeninstancesofthesamephysicalregion(line61).CopyoperationsinFigure2arerhomboidsmarkedcopy.LikeallotherRealmopera-tions,copyrequestsacceptaneventpreconditionandreturnaneventthattriggersuponcompletionofthecopy.Realmdoesnotguaranteethecoherenceofdatabetweendi er-entinstances;coherencemustbeexplicitlymanagedbytheclientviacopyoperations.7.1ReductionInstancesIfataskonlyperformsreductionsonaninstance,aspe-cialreduction-onlyinstancemaybecreated(lines51-52).InFigure2,eachtaskTiismappedtousereduction-onlyinstancetitoaccumulatereductionsintheGPUzero-copymemory.Usingthereduce_tomethod(lines62-63),thesereductionbu ersareeventuallyappliedtoanormalinstanceresidinginGASNetmemory.Wedetailthispatterninareal-worldapplicationinSection9.Bulkreductionopera-tionsarerhomboidsmarkedreduceinFigure2.Reduction-onlyinstancesdi erintwoimportantwaysfromnormalinstances.First,theper-elementstorageinreduction-onlyinstancesissizedtoholdthe\right-handside"ofthereductionoperation(e.g.,thevinstruct:field+=v).Sec-ond,individualreductionsareaccumulated(atomically)intothelocalreductioninstance,whichcanthenbesentasabatchedreductiontoaremotetargetinstance.Whenmulti-plereductionsaremadetothesameelement,theyarefoldedlocally,furtherreducingcommunication.Thefoldoperationisnotalwaysidenticaltothereductionoperation.Forex-ample,ifthereductionisexponentiation,thecorrespondingNodes 1 2 4 8 16 MeanTriggerTime(s) 0.329 3.259 3.799 3.862 4.013Figure4:EventLatencyResults.foldismultiplication:(r[i]**=a)**=b,r[i]**=(ab)Theclientregistersreductionandfoldoperationsatsystemstartup(omittedfromFigure1duetospaceconstraints).Reductioninstancescanalsobefoldedintootherreductioninstancestobuildhierarchicalbulkreductionsmatchingthememoryhierarchy.Realmsupportstwoclassesofreduction-onlyinstances.Areductionfoldinstanceissimilartoanormalphysicalre-gioninthatitisimplementedasanarrayindexedbythesameelementindices.Thedi erenceisthateachinstanceelementisavalueofthereduction'sright-hand-sidetype,whichisoftensmallerthantheregion'selementtype(theleft-hand-sidetype).Areductionoperationsimplyfoldsthesuppliedright-hand-sidevalueintothecorrespondingarraylocation.Whenthereductionfoldinstancepisreducedtoanormalinstancer, rstpiscopiedtor'slocation,whereRealmautomaticallyappliesptorbyinvokingthereduc-tionfunctiononceperlocationviaacache-friendlylinearsweepoverthememory.Thesecondkindofreductioninstanceisareductionlistin-stance,wheretheinstanceisimplementedasalistofreduc-tions.Areductionlistinstancelogseveryreductionopera-tion(thepointerlocationandright-hand-sidevalue).Whenthereductionlistinstancepisreducedtoanormalphysi-calregionr,pistransferredandreplayedatr'slocation.Incaseswherethelistofreductionsissmallerthanthenumberofelementsinrthereductionindatatransferredcanyieldbetterperformance(seeSection8.3).8.MICROBENCHMARKSWeevaluateourRealmimplementationusingmicrobench-marksthattestwhetherperformanceapproachestheca-pacityoftheunderlyinghardware.AllexperimentswererunontheKeenelandsupercomputer[38].EachKeenelandKIDSnodeiscomposedoftwoXeon5660CPUs,threeTeslaM2090GPUs,and24GBofDRAM.NodesareconnectedbyanIn nibandQDRinterconnect.8.1EventLatencyandTriggerRatesWeusetwomicrobenchmarkstoevaluateeventperfor-mance.The rsttestseventtriggeringlatency,bothwithinandbetweennodes.Processorsareorganizedinaringandeachprocessorcreatesausereventdependentontheprevi-ousprocessor'sevent.The rsteventinthechainofdepen-denteventsistriggeredandthetimeuntilthetriggeringofthechain'slasteventismeasured;dividingthetotaltimebythenumberofeventsinthechainyieldsthemeantrig-gertime.Inthesingle-nodecase,alleventsarelocaltothatnode,sonoactivemessagesarerequired.Forallothercases,theringusesasingleprocessorpernodesothateverytriggerrequiresthetransmission(andreception)ofaneventtriggeractivemessage.Table4showsthemeantriggertimes.Thecostofmanipu-latingthedatastructuresandrunningdependentoperations Figure7:DataPlacementandMovementforCircuit.AMRisanadaptivemeshre nementbenchmarkbasedonthethirdheatequationexamplefromtheBerkeleyLabsBoxLibproject[32].AMRsimulatesthetwodimensionalheatdi usionequationusingthreedi erentlevelsofre- nement.Eachlevelispartitionedanddistributedacrossthemachine.Timestepsrequirebothintra-andinter-levelcommunicationandsynchronization.Dependencesbetweentasksfromthesameanddi erentlevelsareagainexpressedthroughevents.9.1EventLifetimesToillustratetheneedforgenerationaleventsdatastruc-tures,weinstrumentedRealmtocaptureinformationaboutthelifetimeofevents.Theusageofeventsbyallthreeap-plicationswassimilar,sowepresentrepresentativeresultsfromjustone.Figure6cshowsatimelineoftheexecutionoftheFluidapplicationon16nodesusing128cores.Thedynamiceventslinemeasuresthetotalnumberofeventcre-ations.Alargenumberofeventsarecreated|over260,000inlessthan20secondsofexecution|andallocatingsep-aratestorageforeveryeventwouldclearlybedicultforlong-runningapplications.Aneventisliveuntilitslastoperation(e.g.,query,trig-ger)isperformed.Afteranevent'slastoperationareferencecountingimplementationwouldrecovertheevent'sassoci-atedstorage.TheliveeventslineinFigure6cisthereforethenumberofneededeventsinareferencecountingscheme.Inthisexample,referencecountingreducesthestorageneededfordynamiceventsbyover10X,butwiththeadditionaloverheadassociatedwithreferencecounting.Thislinealsogivesalowerboundforthenumberofeventswhentheappli-cationperformsexplicitcreationanddestructionofevents.AsdiscussedinSection5.2,ourimplementationrequiresstoragethatgrowswiththemaximumnumberofuntriggeredevents,anumberthatis10Xsmallerthaneventhemaxi-malliveeventcount.TheactualstoragerequirementsofourRealmimplementationareshownbythegenerationaleventsline,whichshowsthetotalnumberofgenerationaleventsinthesystem.Themaximumnumberofgenerationaleventsneededisslightlylargerthanthepeaknumberofuntrig-geredeventsbecausenodesmustcreateaneweventlocallyiftheyhavenoavailable(i.e.triggered)generationalevents,evenifthereareavailablegenerationaleventsonremotenodes.Overall,ourimplementationuses5Xlessstoragethanareferencecountingimplementationandavoidsanyrelatedoverhead.Thesesavingswouldlikelybeevenmoredramaticforlongerrunsoftheapplication,asthenumberofliveeventsissteadilygrowingastheapplicationruns,whilethepeaknumberofgenerationaleventsneededappearstooccurduringthestart-upoftheapplication.Overallthisdemonstratestheabilityofgenerationaleventstorepresentlargenumbersofliveeventswithminimalstorageoverhead.9.2ReservationPerformanceTheCircuitandAMRapplicationsbothmadeuseofreser-vations,creating3336and1393reservationsrespectively.Ofallcreatedreservationsinbothapplications,14%weremi-gratedatleastoncebetweennodes.Thegrantratesforbothapplicationsareordersofmagnitudesmallerthanthemaxi-mumreservationgrantratesachievedbyourreservationmi-crobenchmarksinSection8.2.Thus,forthesebenchmarksreservationswereneededtoexpressnon-blockingsynchro-nizationandwerefarfrombeingaperformancelimiter.9.3ComparisonwithImplicitRepresentationsWenowattempttoestimatetheperformancegainsat-tributabletothelatencyhidingprovidedbydeferredexecu-tion.Tocomparewithastandardimplicitimplementation,wemodi edeachRealmapplicationtowaitforeventsintheapplicationcodeimmediatelybeforethedependentop-erationratherthansupplyingthemaspreconditions.Whilethismethodologyhasthedisadvantagethatourapproxima-tionofanimplicitruntimemaynotbeasfastasapurpose-builtone,ithasthegreatadvantageofcontrollingforthemyriadpossibleperformancee ectsincomparingtwocom-pletelydi erentimplementations:anyperformancedi er-enceswillbedueexactlytothemorerelaxedexecutionor-deringenabledbydeferredexecution.Figure8ashowsthreecurvesfortheAMRapplication:theoriginalRealmversion,theimplicitversion,andaninde-pendentlywrittenandoptimizedMPIimplementation[32].Observethattheimplicitversioniscompetitivewiththein-dependentMPIcode,whichissomeevidencethatusingtheimplicitversionasareferencepointisreasonable.BoththeRealmandimplicitversionsstartoutaheadofMPIduetobettermappingdecisionsprovidedbythehigher-levelrun-time[6].TheRealmimplementationofAMRusesasimpleall-to-allpatternforitscommunication.Theadditionalla-tencyinherentinthispatternishiddenwellbythedeferredexecutionmodel,butisabottleneckfortheimplicitver-sion,resultinginupto102%slowdownat16nodes.TheMPIversioncontinuestoscalebyusingmuchmorecompli-catedasynchronouscommunicationpatterns,buttheneedforblockingsynchronizationprimitivesstillresultsinex-posedcommunicationlatency,causinga66%slowdownrel-ativetoRealmon16nodes.ItisworthemphasizingthatinprincipletheMPIcodecanbejustasfastas(orfasterthan)Realm|thereisnothinginRealmthataprogrammercannotemulatewithsuciente ortusingtheprimitivesavailableinanyimplicitruntimesystem.However,asdiscussedinSection2,thisprogram-mingworkissubstantial,diculttomaintain,andoftenmachinespeci c;itisourexperiencethatfewprogrammersundertakeit.Figures8band8cshowperformanceresultsfortheCircuitandFluidapplicationsrespectively;fortheseapplicationswedonothaveindependentlyoptimizeddistributedmem-oryimplementationsandsowecompareonlytheRealmandimplicitimplementations.Eachplotcontainsperformancecurvesforbothimplementationsontwodi erentproblem

Related Contents


Next Show more