/
1structNodefoatvoltage,new 1structNodefoatvoltage,new

1structNodefoatvoltage,new - PDF document

stefany-barnette
stefany-barnette . @stefany-barnette
Follow
362 views
Uploaded On 2015-08-24

1structNodefoatvoltage,new - PPT Presentation

chargecapacitanceg2structWirehrnifNodernin nodeout nodeoatcurrentg3structCircuitfregionr all nodescontainsallnodesforthecircuit4regionr all wirescontainsallcircuitwiresg5struct ID: 114610

charge capacitance;g;2structWirehrnifNode@rnin node out node;oatcurrent ...;g;3structCircuitfregionr all nodes;/containsallnodesforthecircuit/4regionr all wires;/containsallcircuitwires/g;5struct

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "1structNodefoatvoltage,new" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1structNodefoatvoltage,new charge,capacitance;g;2structWirehrnifNode@rnin node,out node;oatcurrent,...;g;3structCircuitfregionr all nodes;/containsallnodesforthecircuit/4regionr all wires;/containsallcircuitwires/g;5structCircuitPiecef6regionrn pvt,rn shr,rn ghost;/private,shared,ghostnoderegions/7regionrw pvt;/privatewiresregion/g;89voidsimulate circuit(Circuitc,oatdt):RWE(c.r all nodes,c.r all wires)10f11//Theconstructionofthecoloringsisnotshown.Thecoloringswire owner map,12//node owner map,andnode neighbor maphaveMAX PIECEScolors13//0..MAX PIECES�1.Thecoloringnode sharingmaphastwocolors0and1.14//15//PartitionofwiresintoMAX PIECESpieces16partitionhdisjointip wires=c.r all wires.partition(wire owner map);17//Partitionnodesintotwopartsforall�privatevs.all�shared18partitionhdisjointip nodes pvs=c.r all nodes.partition(node sharingmap);1920//Partitionall�privateintoMAX PIECESdisjointcircuitpieces21partitionhdisjointip pvt nodes=p nodes pvs[0].partition(node owner map);22//Partitionall�sharedintoMAX PIECESdisjointcircuitpieces23partitionhdisjointip shr nodes=p nodes pvs[1].partition(node owner map);24//Partitionall�sharedintoMAX PIECESghostregions,whichmaybealiased25partitionhaliasedip ghost nodes=p nodes pvs[1].partition(node neighbor map);2627CircuitPiecepieces[MAX PIECES];28for(i=0;iMAX PIECES;i++)29pieces[i]=frn pvt:p pvt nodes[i],rn shr:p shr nodes[i],30rn ghost:p ghost nodes[i],rw pvt:p wires[i]g;31for(t=0;tTIME STEPS;t++)f32spawn(i=0;iMAX PIECES;i++)calc new currents(pieces[i]);33spawn(i=0;iMAX PIECES;i++)distribute charge(pieces[i],dt);34spawn(i=0;iMAX PIECES;i++)update voltages(pieces[i]);35g36g37//ROE=Read�Only�Exclusive38voidcalc new currents(CircuitPiecepiece):39RWE(piece.rw pvt),ROE(piece.rn pvt,piece.rn shr,piece.rn ghost)f40foreach(w:piece.rw pvt)41w!current=(w!in node!voltage�w!out node!voltage)/w!resistance;42g43//RdA=Reduce�Atomic44voiddistribute charge(CircuitPiecepiece,oatdt):45ROE(piece.rw pvt),RdA(piece.rn pvt,piece.rn shr,piece.rn ghost)f46foreach(w:piece.rw pvt)f47w!in node!new charge+=�dtw!current;48w!out node!new charge+=dtw!current;49g50g5152voidupdate voltages(CircuitPiecepiece):RWE(piece.rn pvt,piece.rn shr)f53foreach(n:piece.rn pvt,piece.rn shr)f54n!voltage+=n!new charge/n!capacitance;55n!new charge=0;56g57gListing1.Circuitsimulation.(SectionIV).Oftenusingapplication-specicinforma-tionresultsinbettermappingsthanagenericmappingstrategy.WedescribeamappinginterfacethatallowsprogrammerstogivetheSOOPaspecicationofhowtomaptasksandregionsforaspecicapplication,orevenpartofanapplication.ThismappingAPIisdesignedsothatanyuser-suppliedmappingstrategycanonlyaffecttheperformanceofapplications,nottheircorrectness.Wepresentresultsofexperimentsonthreeapplications:uid-owonaregulargrid,athree-levelAMRcodesolv-ingaheatdiffusionequation,andacircuitsimulation.WecompareeachapplicationwiththebestreferenceversionsonthreedifferentclustersofmulticoreprocessorswithGPUs,includingtheKeenelandsupercomputer[4].II.EXAMPLE:CIRCUITSIMULATORWebeginbydescribinganexampleprogramwrittenintheLegionprogrammingmodel.Listing1showscodeforanelectricalcircuitsimulation,whichtakesacollectionofwiresandnodeswherewiresmeet.Ateachtimestepthesimulationcalculatescurrents,distributescharges,andupdatesvoltages.ThekeydecisionsinaLegionprogramarehowdataisgroupedintoregionsandhowregionsarepartitionedintosub-regions.Thegoalistopickanorganizationthatmakesexplicitwhichcomputationsareindependent.ACircuithastworegions:acollectionofnodesandacollectionofwires(line3ofListing1).1Anefcientparallelimplementationbreaksthisunstructuredgraphintopiecesthatcanbeprocessed(mostly)independently.Anappropriateregionorganizationmakesexplicitwhichnodesandwiresareinvolvedinintra-piececomputationand,wherewiresconnectdifferentpieces,whichareinvolvedininter-piececomputation.Figure1(b)showshowthenodesinasmallgraphmightbesplitintothreepieces.Blue(lighter)nodes,attachedbywiresonlytonodesinthesamepiece,areprivatetothepiece.Red(darker)nodes,ontheboundaryofapiece,aresharedwith(connectedto)otherpieces.Inthesimulation,computationsontheprivatenodesofdifferentpiecesareindependent,whilecomputationsonthesharednodesrequirecommunication.Tomakethisexplicitintheprogram,wepartitionthenodesregionintoprivateandsharedsubregions(line18).Topartitionaregion,weprovideacoloring,whichisarelationbetweentheelementsofaregionandasetofcolors.Foreachcolorcinthecoloring,thepartitioncontainsasubregionroftheregionbeingpartitioned,withrconsistingoftheelementscoloredc.Notethatthepartitionintosharedandprivatenodesisdisjointbecauseeachnodehasonecolor.Theprivateandsharednodesarepartitionedagainintoprivateandsharednodesforeachcircuitpiece(lines21and23);bothpartitionsaredisjoint.Thereisanotherusefulpartitionofthesharednodes:forapiecei,wewillneedthesharednodesthatborderiinotherpiecesofthecircuit.Thisghostnodepartition(line25)hastwointerestingproperties.First,itisasecondpartitionofthesharednodes:wehavetwoviewsontothesamecollectionofdata.Second,theghostnodepartitionisaliased,meaningthesubregionsarenotdisjoint:anodemayborderseveraldifferentcircuitpiecesandbelongtomorethanoneghostnodesubregion(thus,node_neighbor_maponline25assignsmorethanonecolortosomenodes).Theprivate,shared,andghostnodesubregionsfortheupper-leftpieceoftheexamplegraphareshowninFigures1(c),1(d),and1(e)respectively.Figure1(a)showsthenalhierarchyofnodepartitionsandsubregions.Thesymbolindicatesapartitionisdisjoint.Thisregiontreedatastructureplaysanimportantroleinschedulingtasksforout-of-orderexecution(seeSectionIII).Theorganizationofthewiresismuchsimpler:asingledisjoint1Notethatallpointersdeclaretheregiontowhichtheypoint.Forexample,thedenitionofWire(line2)isparametrizedontheregionrntowhichtheNodepointersineldsin_nodesandout_nodespoint. ExclusiveAtomicSimultaneousRelaxed Exclusive DepDepDepDepAtomic DepSameContContSimultaneous DepContSameNoneRelaxed DepContNoneNoneFig.2.Dependencetable.informationtomapt,butdoesidentifyalltasksthatmustmapbeforetcanmap.ThethirdSOOPstagemapstbycarryingoutamorerenedanalysisofthe(alreadymapped)tasksonwhichtdepends(SectionIII-C).Betweenthesestages,thesecondstagedistributestaskstootherprocessors(SectionIII-B).Onceataskhasbeenassignedaprocessorandphysicalinstancesitisissuedfordeferredexecution(SectionIII-D).ThenalSOOPstagereclaimsresourcesfromcompletedtasks(SectionIII-E).A.Stage1:MappingDependencesEachprocessorinthesystemrunsaninstanceoftheSOOP.Whenaparenttaskspawnsachildtask,thechildisregisteredwiththeSOOPontheparent'sprocessor;registrationrecordsthesubtask'slogicalregions,privilegesandcoherenceproper-ties.Childrenareregisteredinthesequentialordertheparentspawnsthemandenqueuedformappingdependenceanalysis.InthecircuitsimulationinListing1,thespawnstatementsonlines32-34registerallthreekindsofsubtasks(inprogramor-der)ontheprocessorwheresimulate_circuitexecutes.Detectingmappingdependencesbetweenanewlyregisteredtasktandapreviouslyregisteredtaskt0requirescomparingthetwosetsoflogicalregionsaccessed.Foreachlogicalregionusedbyt0thatmayalias(maysharedatawith)alogicalregionusedbyt,theprivilegesandcoherencemodesarecomparedtodeterminewhetheradependenceexists.Ifbothregionsneedonlyreadprivilegesthereisneveradependence,butifeithertaskneedswriteorreductionprivileges,thecoherencemodesarecomparedusingthetableinFigure2.DepindicatesdependencewhileNoneindicatesindepen-dence.Sameisadependenceunlessthetwotasksusethesamephysicalinstanceofthelogicalregion.Sincetaskshavenotmappedphysicalinstancesatthispointinthepipeline,Sameisalwaysamappingdependence.Contindicatesadependencecontingentontheprivilegesofthetwotasks(e.g.anAtomicRead-WritetaskandSimultaneousRead-Onlytaskwillnothaveadependence).Thetablelistssimultaneousandrelaxedcoherencemodesthatwehavenotyetdiscussed.Bothmodesallowothertasksusingtheregiontoexecuteatthesametimeanddifferonlyinwhatupdatesmustbeobserved.Withsimultaneouscoherence,ataskmustseeallupdatestothelogicalregionmadebyothertasksoperatingonthesameregionsimultaneously(i.e.,sharedmemorysemantics).Withrelaxedcoherence,ataskmayormaynotobserveconcurrentupdates.Akeypropertyisthatdependenceanalysisisnotneededbetweenarbitrarypairsoftasks.Infact,itsufcestocheckonlysiblings(childrenwiththesameparent)fordependences.Observation1.Lett1andt2betwosiblingtaskswithnodependence.Thennosubtaskoft1hasadependencewithanysubtaskoft2.RecallfromSectionIIthatsubtasksonlyaccessregions(orsubregions)thattheirparentaccesses.Thusiftheregionsusedbyt1andt2donotalias,theregionsusedbyanysubtasksoft1cannotaliastheregionsusedbyanysubtaskoft2.Ifregionsoft1andt2aliasbutthereisnodependencebecausetheregionshavesimultaneousorrelaxedcoherence,thenbydenitionthereisnodependencebetweenthesubtasks.Considerthersttwosubtasksspawnedonline32inListing1,calc_new_currents(pieces[0])andcalc_new_currents(pieces[1]).Therstofthesetasksreadsandwritestheprivatewiressubregionpieces[0].rw_pvtinexclusivemode,andreadsinex-clusivemodethenodesubregionspieces[0].rn_pvt,.rn_shr,and.rn_ghost(seeline39).Thesecondsubtaskmustbecheckedagainsttherstforanydependences;thesecondsubtaskusespieces[1].rw_pvt(read/writeexclusive)andpieces[1].rn_pvt,.rn_shr,and.rn_ghost(read-onlyexclusive).ItmaybehelpfultorefertotheregiontreefornodesinFigure1(a)(recallthatthewiresregiontreeconsistsofasingledisjointpartition).Wegiveafewrepresentativeexamplesofreasoningaboutpairwisealiasing(notallpairsarecovered):1)pieces[0].rw_pvtandpieces[1].rw_pvtdonotaliasastheyaredifferentsubregionsofadisjointpartition.2)pieces[0].rn_ghostandpieces[1].rn_shraliasastheyareindifferentpartitionsofthesameregion.3)Subregionsofanaliasedpartitionalwaysalias(e.g.,pieces[0].rn_ghostandpieces[1].rn_ghost).4)pieces[0].rw_pvtandpieces[1].rn_pvtdonotaliasbecausetheyareindifferentregiontrees.Cases2and3indicateapossibledependence,howevertheregionsareaccessedwithonlyreadprivileges.Allothercasesforthesetwosubtasksaresimilartooneoftheexamplesabove;thus,thetwosubtasksareindependent.Acloserlookatthisexampleshowsthatwhetherr1aliasesr2canbedeterminedbyexaminingtheirleastcommonancestorr1tr2intheregiontree.Ifr1tr2eitherdoesnotexist(theregionsareindifferentregiontrees)orisadisjointpartition,thenr1andr2aredisjoint.Ifr1tr2isanaliasedpartitionoraregion,thenr1andr2alias.Considertasksdistribute_charge(pieces[0],dt)andupdate_voltages[1]inListing1.Theformertaskusesregionpieces[0].rn_ghostandthelatterusesregionpieces[1].rn_shr.Theleastcommonancestoristheregionofallsharednodesp_nodes_pvs[1],sothesetworegionsalias.Sincebothtasksmodifytheirrespectivesubregions,thereisadependenceandtheupdate_voltagestaskcanonlymapafterthedistribute_chargestaskhasmapped.Wecannowdescribethealgorithmformappingdependenceanalysis.TheSOOPmaintainsalocalregionforestforeachtaskt,therootsofwhicharet'sregionargumentsandincludingallpartitionsandsubregionsusedbyt.Dependence (a)Twocalc_new_currentstasks. (b)distribute_chargedependsoncalc_new_currents. (c)update_voltagesdependsondistribute_charge. (d)Anupdate_voltagesmap-ping.Fig.3.Dependenceanalysisexamplesfromcircuit_simulation.awriterdoesnotrequireavalidinstance—becausereductionscanbereordered,combiningtheinstancesofrandr0(iftheyalias)canbedeferredtoalaterconsumerofthedata(seecase4).4)Ift0hasreduceprivilegeforr0andthasread,write,orreduceprivilegeswithadifferentoperatorthant0forr,andraliasesr0,thens0mustbereduced(usingt0'sreductionoperator)toaninstances00ofrtr0andthenafreshinstanceofrcreatedfroms00.Instances0isremovedfromtheregiontree.Tomapr,wewalkfromr'srootancestorintheregionforesttor,alongthewayexploringoff-pathsubtreestondregioninstancessatisfyingcases2and4.ThedetailsandanalysisaresimilartotheimplementationofdependenceanalysisinSectionIII-A;theamortizedworkperregionmappedisproportionaltothedepthoftheregionforestandindependentofthenumberoftasksorphysicalinstances.Aspartofthewalkanyrequiredcopyandreductionoperationsareissuedtoconstructavalidinstanceofr.Theseoperationsaredeferred,waitingtoexecuteonthetasksthatproducethedatatheycopyorreduce.Similarly,thestartoft'sexecutionismadedependentoncompletionofthecopies/reductionsthatconstructtheinstanceofritwilluse.Figure3(d)showspartofthemappingoftaskupdate_voltages[0];wefocusonlyonthesharednodes.Becausetheimmediatelyprecedingdistribute_chargestasksallperformedthesamereductiononthesharedandghostnodestheywereallowedtoruninparallel.Butupdate_voltages[0]needsread/writeprivilegeforpieces[0].rn_shr,whichforcesallofthereductionstobemergedbacktoaninstanceoftheall-sharednodesregion,fromwhichanewinstanceofpieces[0].rn_shriscopiedandusedbyupdate_voltages[0].D.Stage4:ExecutionAfterataskthasbeenmappeditenterstheexecutionstageoftheSOOP.Whenalloftheoperations(othertasksandcopies)onwhichtdependshavecompleted,tislaunchedontheprocessoritwasmappedto.Whentspawnssubtasks,theyareregisteredbytheSOOPonwhichtisexecuting,usingtastheparenttaskandt'sregionargumentsastherootsoftheregionforest.EachchildofttraversestheSOOPpipelineonthesameprocessorast,possiblybeingmappedtoaSOOPinstanceonadifferentprocessortoexecute.E.Stage5:Clean-UpOncetasktisdoneexecutingitsstateisreclaimed.Dependenceandmappinginformationisremovedfromtheregiontree.Themostinvolvedaspectofclean-upiscollectingphysicalinstancesthatarenolongerinuse,forwhichweuseadistributedreferencecountingscheme.IV.MAPPINGINTERFACEAsmentionedpreviously,anovelaspectofLegionisthemappinginterface,whichgivesprogrammerscontroloverwheretasksrunandwhereregioninstancesareplaced,makingpossibleapplication-ormachine-specicmappingdecisionsthatwouldbedifcultforageneral-purposeprogrammingsystemtoinfer.Furthermore,thisinterfaceisinvokedatruntimewhichallowsfordynamicmappingdecisionsbasedonprograminputdata.Wedescribetheinterface(SectionIV-A),ourbaseimplementation(SectionIV-B),andthebenetsofcreatingcustommappers(SectionIV-C).A.TheInterfaceThemappinginterfaceconsistsoftenmethodsthatSOOPscallformappingdecisions.Amapperimplementingthesemethodshasaccesstoasimpleinterfaceforinspectingproper-tiesofthemachine,includingalistofprocessorsandtheirtype(e.g.CPU,GPU),alistofmemoriesvisibletoeachprocessor,andtheirlatenciesandbandwidths.Forbrevityweonlydiscussthethreemostimportantinterfacecalls:select_initial_processor-ForeachtasktinitsmappingqueueaSOOPwillaskforaprocessorfort.Themappercankeepthetaskonthelocalprocessororsendittoanyotherprocessorinthesystem.permit_task_steal-WhenhandlingastealrequestaSOOPaskswhichtasksmaybestolen.Stealingcanbedisabledbyalwaysreturningtheemptyset.map_task_region-Foreachlogicalregionrusedbyatask,aSOOPasksforaprioritizedlistofmem-orieswhereaphysicalinstanceofrshouldbeplaced.TheSOOPprovidesalistofr'scurrentvalidphysicalinstances;themapperreturnsaprioritylistofmemoriesinwhichtheSOOPshouldattempttoeitherreuseorcreateaphysicalinstanceofr.Beginningwiththerstmemory,theSOOPusesacurrentvalidinstanceifoneispresent.Otherwise,theSOOPattemptstoallocateaphysicalinstanceandissuecopiestoretrievethevaliddata.Ifthatalsofails,theSOOPmovesontothenextmemoryinthelist. Themappinginterfacehastwodesirableproperties.First,programcorrectnessisunaffectedbymapperdecisions,whichcanonlyimpactperformance.Regardlessofwhereamapperplacesataskorregion,theSOOPsscheduletasksandcopiesinaccordancewiththeprivilegesandcoherencepropertiesspeciedintheprogram.Therefore,whenwritingaLegionapplication,aprogrammercanbeginbyusingthedefaultmapperandlaterimproveperformancebycreatingandreningacustommapper.Second,themappinginterfaceisolatesmachine-specicdecisionstothemapper.Asaresult,Legionprogramsarehighlyportable.ToportaLegionprogramtoanewarchitecture,aprogrammerneedonlyimplementanewmapperwithdecisionsspecictothenewarchitecture.B.DefaultMapperTomakewritingLegionapplicationseasier,wepro-videadefaultmapperthatcanquicklygetanapplica-tionworkingwithmoderateperformance.Thedefaultmap-peremploysasimpleschemeformappingtasks.Whenselect_initial_processorisinvoked,themapperchecksthetypeofprocessorsforwhichtaskthasimplemen-tations(e.g.,GPU).Ifthefastestimplementationisforthelocalprocessorthemapperkeepstlocal,otherwiseitsendsttotheclosestprocessorofthefastestkindthatcanrunt.ThedefaultmapperemploysaCilk-likealgorithmfortaskstealing[5].Tasksarekeptlocalwheneverpossibleandonlymovedwhenstolen.UnlikeCilk,thedefaultmapperhastheinformationnecessaryforlocality-awarestealing.Whenpermit_task_stealiscalledforatask,thedefaultmap-perinspectsthelogicalregionsforthetaskbeingstolenandmarksthatothertasksusingthesamelogicalregionsshouldbestolenaswell.Forcallstomap_task_region,thedefaultmappercon-structsastackofmemoriesorderedfrombest-to-worstbybandwidthfromthelocalprocessor.Thisstackisthenreturnedasthelocationofmemoriestobeusedformappingeachregion.Thisgreedyalgorithmworkswellincommoncases,butcancausesomeregionstobepulledunnecessarilyclosetotheprocessor,consumingpreciousfastmemory.C.CustomMappersTooptimizeaLegionprogramorlibrary,programmerscancreateoneormorecustommappers.Eachcustommapperextendsthedefaultmapper.Aprogrammerneedonlyoverridethemapperfunctionshewishestocustomize.Mappersareregisteredwiththeruntimeandgivenuniquehandles.Whenataskislaunched,theprogrammerspeciesthehandleforthemapperthatshouldbeinvokedbytheruntimeformappingthatparticulartask.Supportingcustommappershastwobenets.First,itallowsforthecompositionofLegionapplicationsandLegionlibrarieseachwiththeirowncustommappers.Second,custommapperscanbeusedtocreatetotallystaticmappings,mappingsthatmemoizetheirresults,oreventotallydynamicmappingsfordifferentsubsetsoftasksinLegionapplications.WedescribeexamplesofcustommappersinSectionV.Cluster SaplingVizKeeneland Nodes 41032(120)CPUs/Node 2xXeon56802xXeon56802xXeon5660HyperThreading onoffoffGPUs/Node 2xTeslaC20705xQuadroQ50003xTeslaM2090DRAM/Node 48GB24GB24GBInniband 2xQDRQDR2xQDRFig.4.Systemcongurationsusedfortheexperiments.V.EXPERIMENTSWeevaluatetheefciencyandscalabilityofLegionusingthreeapplicationsonthreeclusters(seeFigure4).AllthreeclusterswereLinux-based,andtheLegionruntimewasbuiltusingpthreadsformanagingCPUthreads,CUDA[6]forGPUs,andGASNet[7]forinter-nodecommunication.TheRDMAfeaturesofGASNetwereusedtocreateagloballyaddressable,butrelativelyslow,GASNetmemorythatisac-cessiblebyallnodes.Foreachapplication,multipleproblemsizeswereused,andeachsizeproblemwasrunonsubsetsofeachmachinerangingfromthesmallest(asingleCPUcoreorGPU)tothelargestornear-largest(exceptKeeneland,wherewelimitedrunsto32nodestogetsufcientclustertime).Byexaminingperformanceofthesamesizeproblemoverprogressivelylargermachines,wemeasureLegion'sstrongscaling.Byincreasingtheproblemsizeaswell,wealsomeasureweakscaling.A.CircuitSimulationTherstexperimentweinvestigateisthedistributedcircuitsimulationdescribedinSectionII.TheLegionSOOPruntimehandlesalloftheresourceallocation,scheduling,anddatamovementacrosstheclusterofGPUs.Inparticular,Legion'sabilitytoefcientlymovetheirregularlypartitionedshareddataaroundthesystemwhilekeepingtheprivatenodesandwiresresidentineachGPU'sframebuffermemoryiscriticaltoachievinggoodscalability.Circuitsoftwodifferentsizesweresimulated.Thersthad480Kwires,connecting120Knodes.Thesecondistwiceaslarge,withnearly1Mwiresconnecting250Knodes.Inadditiontorunningthesetestsonvaryingnumbersofnodes,thenumberofGPUsusedbytheruntimewasalsovaried.InnocasedidthechangestonodesornumberofGPUspernoderequirechangestotheapplicationcode.Thecircuitsimulationhasasimpleapplication-specicmapper.Atinitializationtime,themapperqueriesthelistofGPUsinthemachineandidentieseachGPU'sframebuffermemoryandzero-copymemory(pinnedmemorythatboththeGPUsandCPUsonanodecanaccessdirectly).Oncethecircuitispartitioned,thepartitionsareassignedahomeGPUinround-robinfashion.EverytaskrelatedtothatpartitionisthensenttothehomeGPU,withnotaskstealingallowed.(Inawell-partitionedcircuit,loadimbalanceislowenoughthatthecostofmovingtheprivatedataforapiecefromoneGPUtoanotheroutweighsanybenets.)TheregionsforthetasksaremappedasshowninFig-ure5.WiresandprivatenodedataarekeptineachGPU'sframebufferatalltimes.Aninstanceoftheall-shared-nodes (a)Single-nodeparticlesimulationspeed. (b)Multi-nodescaling.Fig.7.Fluidsimulationresults.ghostcellsasinteriorgridcells.Thelargerproblemsizes(2.4Mand19Mparticles)performmuchbetter,withscalingofupto5.4xwhengoingfrom1nodeto16becauseofalowercommunication-to-computationratio.C.AdaptiveMeshRenementOurnalapplicationisbasedonthethirdheatequationexamplefromtheBerkeleyLabsBoxLibproject[9].Thisapplicationisathree-leveladaptive-mesh-renement(AMR)codethatcomputesarstorderstencilona2Dmeshofcells.Updatingthesimulationforonetimestepconsistsofthreephases.Intherstphase,theboundarycellsaroundaboxatarenedlevellinearlyinterpolatetheirvaluesfromthenearbycellsatthenextcoarserlevel.Thesecondphaseperformsthestencilcomputationoneachcellineverylevel.Inthethirdphase,cellsatacoarserlevelthathavebeenrenedarerestrictedtotheaverageofthecellsthattheyphysicallycontainatthenextnestlevelofrenement.Achievinghigh-performanceonthisapplicationispartic-ularlychallengingforseveralreasons.First,theapplicationhasaveryhighcommunication-to-computationratiowhich,foraxedproblemsize,beginsasbeingmemoryboundand,withincreasingnodecount,becomesnetworkboundastheperimeter-to-arearatioofcellgridsincreases.Second,whenchoosinghowtopartitioncellsintogrids,theprogrammermustconsiderthelocalitybetweencellswithinalevelaswellasacrosslevels.Forcross-levelcelldependences,op-timalmappingdecisionscanonlybemadeatruntimeasthelocationofrenementsaredynamicallydetermined.Finally,thisapplicationhasparallelismbothbetweentasksrunningatthesamelevelandtasksrunningacrosslevels,leadingtocomplicatedinput-dependentdatadependences.BoxLib'simplementationpartitionscellswithinalevelintoanumberofgridsbasedonthenumberofnodesinthemachineanddistributesonegridfromeachleveltoeachnode.Thisoptimizesformemorybandwidthandloadbalance,butdoesnotexploitcross-levellocalitybetweengridsfromdifferentlevelsofrenement.Furthermore,BoxLibdoesnotblockgridsintosub-gridstotakeadvantageofintra-gridlocality.OurLegionimplementationperformstwooptimizationsthatallowustooutperformBoxLib.First,foreachlevelofrene-mentwerecursivelypartitionthelogicalregionofcellsbasedonthenumberofnodesinthemachineandthesizesoftheL2andL3caches.Oursecondoptimizationtakesadvantageofthecross-levellocality.Wewroteanapplication-specicmapperthatdynamicallydiscoversrelationshipsbetweengridsatdifferentlevelsofrenement.Themapperdynamicallyperformsintersectiontestsbetweenlogicalregionscontaininggridsofdifferentrenementlevels.Ifthemapperdiscoversoverlapsbetweengridsfromdifferentlevels,themapperplacesthemonthesamenodeinthemachine.Themappermemoizestheintersectionteststoamortizetheircost.Themapperalsodynamicallyloadbalancesbydistributingunconstrainedgridsfromthecoarsestlevelontounder-loadednodes.WecomparedourLegionimplementationagainstBoxLibonthreedifferentproblemsizeswithaxednumberofcellsperlevelofrenement,butwithrandomlychosenrenementlocations.BoxLibalsosupportsOpenMPandwetooktheirbestperformancefromusing1,2,4,or8threadspernode.OurLegionimplementationalwaysusesonethreadpernodetoillustratethatinthisapplicationlocalityissignicantlymoreimportantthanne-graineddata-parallelism.Figure8givestheresults.Onjustonenode,blockingforcachesusingLegionachievesupto2.6XspeedupoverBoxLib.Asnodecountincreases,themapper'sabilitytoexploitcross-levellocalityfurtherincreasestheperformanceadvantageto5.4Xbyreducingthetotalcommunicationcosts.AsthenodecountincreasestheAMRcodebecomeshighlydependentoninterconnectperformance.BoxLibperformsmuchbetteronKeenelandthanonVizduetothebetterinterconnect.AthighernodecountsBoxLibbeginstocatchup(seeFigure8(c))becauseourapplication'sintra-levelghost-cellexchangealgorithmusesGASNetmemorytocommuni-categhostcells,requiringalinearincreaseinnetworktrafcwiththenumberofnodes.BoxLibusesdirectnode-to-nodeexchangesofghostcells,similartoouruidapplication.AfutureimplementationofourAMRcodewillemployasimilarghostcellexchangealgorithmtoimprovescalability. (a)Saplingresults. (b)Vizresults. (c)Keenelandresults.Fig.8.Throughputofadaptivemeshrenementcode.VI.RELATEDWORKLegionbeganasanoutgrowthofSequoia,alocality-awareprogramminglanguagecapableofexpressingcomputationsindeepmemoryhierarchies[1].SequoiaisaspecialcaseoftheLegionprogrammingmodelinwhichonlyarrayscanberecursivelypartitioned,allaccessisexclusive,thereisastaticmappingoftasksanddata(thoughextensionstoSequoiamakethismappingmoredynamic[10])and,mostfundamentally,thedecompositionoftasksandthedecompositionofdataisone-to-one.LegiongeneralizestheSequoiamodelbyallowingfordynamicpartitioningofpointerdatastructuresthroughregions,enablingdynamicmappingsthroughthemapperin-terface,andallowingdifferentcoherenceproperties.Legion'sdecouplingofthetasktreefromtheregiontreeleadsdirectlytotheschedulingproblemsolvedbyoursoftwareout-of-orderprocessorfortaskswithregionarguments.TheSSMPprogrammingmodelisthemostsimilarworktoLegion[11].LikeLegion,SSMPsupportsdynamicdetectionofdependencesbetweentasksbasedondatarequirements.However,SSMPonlysupportsasingledisjointrectilinearpartitionofanarray,unlikeLegionwhichsupportsmultiplearbitrarypartitionsofregions.Furthermore,theSSMPruntimemustperformdependencechecksbetweeneverypairoftaskscreatedinthesystem.Legion'sprogrammingmodelonlyre-quiresdependencechecksbetweentaskswiththesameparenttaskwhichenablesscalablenestedparallelismindistributedmachines.SSMPonlyoperatesonsharedmemorymachines.Chapelhasseveralconceptstosupporttheexpressionoflocality[12].Domainsaresimilartologicalregionsinthattheydescribemapsfromindexestoobjects.Domainscancreatesub-domainsbyslicingtheindexsetsfromaparentdomain.Domainsareahigherlevelconceptthanregions;thedomainindexsetssupportdimensionalityanditerators,whereaslogicalregionscanonlybeaccessedbypointers.Also,theactofcreatingsubdomainsinChapeldoesnottrackdisjointnessinformation,makingitmorechallengingfortheChapelcompilerorruntimetoinfertaskindependence.Inadditiontodomains,Chapelalsosupportsthenotionofdomainmapsandlocalestoenabletheprogrammertoefcientlymapdomainsontohardware[13].Localesareaatarrayofabstractlocations.Programmerscanuselocalesbywritingdomainmapsthatspecifyhowdomainsaresubdividedandassignedtolocales.DomainmapsprovidethesamefunctionalityaspartitionsandmappersinLegion,butrequiretheusertocorrectlyimplementdomainmapsfortheprogramtobecorrect.LegionexplicitlyisolatescorrectnessfromperformancebydeningtheMapperinterface.Inaddition,Chapel'satarrayoflocalesmakesitchallengingtofullyutilizedeepmemoryhierarchies.ChapelcurrentlysupportsclustersandGPUsinisolation[14],butwearenotawareofanyresultsthatmakeuseofboth.X10isanotherparallelprogramminglanguagedesignedtooperateondistributedmemorymachines[15].X10'splacesenableprogrammerstotalkaboutwheretoplacebothdataandtasks.However,oncedataandtaskshavebeenplacedtheyarexed,whichmandatesthatdatamovementbeexplicitlymanagedbyuserlevelcodeorimplicitlybythecompiler[16].RecentlyX10hasintroducedregionsintothecompiler'sintermediaterepresentation[17].UnlikeLegion,regionsinX10arenotvisibletotheprogrammerbutareinferredfromhighlevelarraysthroughstaticanalysis.X10providessupportforclustersofGPUs[18],butrequirestheprogrammertowriteallcodemanagingdatamovementthroughboththeclusterandGPUmemoryhierarchies.DeterministicParallelJava(DPJ)isaparallelextensionofJavathat,likeLegion,usesregionstoexpresslocality,butdoesstaticdependenceanalysisonregionargumentstofunctionstonddependences[19].TheprimarygoalofDPJistoprovide

Related Contents


Next Show more