/
II.BACKGROUNDWepresentourapproachinthecontextofCilk,anex-emplarfork/jo II.BACKGROUNDWepresentourapproachinthecontextofCilk,anex-emplarfork/jo

II.BACKGROUNDWepresentourapproachinthecontextofCilk,anex-emplarfork/jo - PDF document

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
372 views
Uploaded On 2016-08-07

II.BACKGROUNDWepresentourapproachinthecontextofCilk,anex-emplarfork/jo - PPT Presentation

voidCilk Scheduler foreachwinworkers whilenotterminated Closurecltry steal from dequew ifclclClosure stealwrandom victim ifclexecute closurecl voidexecute closureClosu ID: 436542

voidCilk Scheduler(): foreach(winworkers): while(/notterminated/): Closurecl=try steal from deque(w); if(!cl)cl=Closure steal(w random victim()); if(cl)execute closure(cl); voidexecute closure(Closu

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "II.BACKGROUNDWepresentourapproachintheco..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

II.BACKGROUNDWepresentourapproachinthecontextofCilk,anex-emplarfork/joinmodel.Inthissection,webrieydescribetheCilkscheduleranditsdataaccessbehavior.Moredetaileddescriptionscanbefoundelsewhere[7],[8].CilkisaparallelprogrammingextensiontotheClanguagethatintroducesthreeadditionalkeywords:cilk,spawn,andsync.Thecilkkeywordspeciesthatafunctioniscapableofbeingexecutedinparallel.Thespawnkeywordspeciesthattheinvokedfunction,referredtoasthespawnedtask,isconcurrentwiththestatementsthatfollowinthespawningtask.Nostatementfollowingthesyncstatementinataskcanbeexecuteduntilallprecedingtaskshavecompleted.Everyfunctioninvocationistreatedasatask.Atanygivenpointintime,thesequenceofinstructionsremainingtobeexecutedinataskisreferredtoasthetask'scontinuation.Acontinuationoftenismarkedbyagotolabelandreferredtobyaninteger.Aclosurereferstothepartiallyexecutedstateofatask,orthestateofthelocalvariablesinthefunctioninvocation.Executionbeginswithoneofthethreadsexecutingtheclosurecorrespondingtothemain()function.Theactionsoftheschedulercanbedescribedbythefollowingloop: voidCilk Scheduler(): foreach(winworkers): while(/notterminated/): Closurecl=try steal from deque(w); if(!cl)cl=Closure steal(w,random victim()); if(cl)execute closure(cl); voidexecute closure(Closurecl): //executethecorrespondingclosurestartingatcontinuation Aworkerwithavalidclosureexecutestheclosureinawork-rst(i.e.depth-rst)fashion.Aspawnedtaskisimme-diatelyexecuted,allowingthecontinuationofthecurrentlyexecutingtasktobestolen.Uponfullyexecutingatask,aworkerreturnstoexecutetheinvokingtask.Whennolocalworkisavailable,athreadbecomesathiefandrandomlyattemptstostealtheoldestcontinuationfromanotherthread.Thestateofthiscontinuationisencapsulatedinaclosure,whichisthenexecuted.Thisproceedsuntiltheroottaskcompletesexecution.Ascheduleisaspecicationofanorderedlistoftasksexecutedbyeachthread.Whileshowntobeprovablyspace-andtime-efcient,Cilkdoesnottakedatalocalityintoaccount.Inparticular,weconsidertwochallengesassociatedwithexecutingfork/joinprogramsthatincurnon-trivialdataaccesscosts.First,thelackoflocality-awarenessacrosscomputationphasessignicantlyimpactsperformance.Forexample,weevaluatedtheexecutiontimeforperformingaparallelmemorycopybetweentwo8GBarraysonan80-coresystemafterbotharrayshavebeeninitialized(systemcongurationisdetailedinSectionIV-C).Thememorycopyisorganizedasconcurrenttasksoperatingoncontiguousblocksofthearray.AstaticallyscheduledOpenMPloopalignedtheinitializationandcopyoperationsandtook169mstoperformthememorycopy.Conversely,implementationsusingCilkandOpenMPtaskstook436ms.Second,theperformanceofsuchprogramsisverysensitivetotaskgranularity.Theprecedingresultswereobtainedusinga512KBblocksize,thebestperformingCilkandOpenMPtasksversion.Otherblocksizes,rangingfrom4KB(pagesize)toafewMBs,performedworsethanthis.Whilecompute-boundprogramscanbeoptimizedintermsofthesmallesttaskgranularitythatmaximizessequentialperformance,optimizingdataaccesscostsimposesadditionalchallenges.III.OVERVIEWInthissection,wepresentanoverviewofourapproachtodatalocalityoptimizationforfork/joinprograms.Ourobjectiveistomatchtheactionsofthework-stealingscheduleracrossdifferentphasesofafork/joinprogram.Forexample,makingthesameworkerthreadexecutetheinitializationandcopytasksonagivendatablockwouldimprovedatalocalityandthusperformance,resultinginthesameperformanceastheOpenMPstaticallyscheduledloops.Here,wefocusonaruntimeapproachtodatalocalityoptimizationbymatchingthetaskexecutionschedulesacrossdifferentphaseswithuserguidance.Thisinvolvesefcientconstructionofatemplatescheduleandalgorithmstocon-straintheactionsofthework-stealingschedulertoexecutesubsequentphasestofollowapreviouslyconstructedtemplateschedule.Weexploitthefactthatthescheduleforafork/joinprogramscheduledusingworkstealingcanbedescribedintermsofstealoperationsinvolved.Thesestealoperationscanbecom-binedtoconstructastealtreethatrepresentaphase'sexecutionschedule.Wepresenttwoapproachestoconstructthetemplateschedules.Therstapproachinvolvesefcientlytracingtheexecutionofaphasetoextractthecorrespondingstealtree.Thesecondapproachinvolvesuser-speciedpartitioningofthework,wheretheuserconstructsasyntheticstealtreeastheprogramisexecuted.Automatedextractionofthestealtree(inSectionIV-A)minimizesusereffortbutcannotimmediatelyoptimizefordatalocalityandloadbalance.User-speciedstealtrees(discussedinSectionIV-B)providedirectcontroltotheuserwhilerequiringadditionalusereffort.Wedescribethreeconstrainedschedulers(discussedinSec-tionIV-C)thatpresentdifferenttrade-offsbetweenfaithfullypreservingthetemplatescheduleandcontinuingtoimproveloadbalance.Thetwostrictschedulerspreservethestealtreeprovidedandcanavoidthecostsassociatedwithworkstealing.Therelaxedreplayschedulerincrementallyloadbalancesthecomputation,startingfromthetemplateschedule.Thefollowingcodesnippetillustratestheoptimizationapproach: spawninitialization();sync; //s�extractschedulefrominitialization for(i=0;inumIters;i++) if(!converged): //s�userelaxedschedulerwithsonkernel spawnkernel();sync; //usestrictschedulerwithsondatarelocationcode else: //usestrictschedulerwithstomaintainlocality spawnkernel();sync; Inthissnippet,theuserisinterestedinmatchingtheschedulesfortheinitialization(theinitialization()spawn)anditerativeker-nel(thekernel()spawn)phases.Theuserextractstheschedulefortheinitializationphaseintos.Thisschedulemaynotbeoptimalduetodatalocalityinefcienciesintheinitializationphaseitself.Inaddition,theextractedschedulemightnotresult spawnedthisone)alsoshouldhaveastolencontinuation[9].Thisistruewhenthenumberofnestedstealsatthispointequalsthenumberofnestedspawnsminusone.Ifthisisnottrue,thentheparenttaskdidnothaveastolencontinuation. voidcheck validity(Closurecl,intcurLevel): assert(curLevel==cl.level+1); ScheduleExtractionAPI.Atthispoint,theschedulecon-structedcanbeextractedusingtheroutineextractSchedulePre-vious().ThisroutinereturnsaStealTreestructurerepresentingthestealtreerootedattheimmediatelyprecedingspawn.Notethatthestatementsfollowingthespawncanbeexecutedwhilethespawnedtask,andthosespawnedtransitively,continuetobeexecuted.Therefore,acalltoextractthestealtreemustbeprecededbyasynctoensurethatthestealtreehasbeencompletelyconstructedbeforeitisextracted.Giventhatthestealtreeisrequestedaposteriori,weconstructthestealtreeforeveryspawnthroughoutthecomputation.Anoptimizedimplementationcouldincludeasplit-phasedesignifaspecicationoftheintenttoextractastealtreetriggeredthestealtreeconstructionforaspawn.However,weobservethattheoverheadsofstealtreeconstructionaremarginalinpracticeduetothefactthatthestealtreeconstructionoperationsareproportionaltothenumberofsteals,whichareasmallfractionofthetotalnumberoftasksinafork/joinprogramwithsufcientconcurrency.C.ConstrainedWorkStealingWehaveimplementedthreedifferentschedulingalgorithmsthatconstrainawork-stealingschedulertoatemplateschedulewithvaryinglevelsofdelity.Dependingonhowrenedandeffectiveascheduleisforagivencomputation,itmayneedtobeexactlyfollowedorrevisedtoadapttochangesintheenvironment(e.g.,changesinthelocalityofdataaccessed).InFigure1,wedepicthowthethreetypesofconstrainedschedulerscanimproveadefaultschedule:STOWS(strictorderedworkstealing):thework-stealingschedulerexactlyfollowsatemplatescheduletbyguid-ingeachworkertoexecutethetasksintheorderpre-scribedbyt.STUWS(strictunorderedworkstealing):thework-stealingschedulerapproximatesatemplatescheduletbyguidingeachworkertoexecutethesametasksitexecutedint,butitallowsthemtogreedilydeviateinorder(ensuringthatalldependenciesarefollowed).RELWS(relaxedworkstealing):thework-stealingsched-ulerapproximatesatemplatescheduletbyguidingeachworkertoexecutethetasksitexecutedintinanyorder,whileallowingfurtherstealswhenaworkerisidle.Constrainingtheexecutiontoatemplateschedulerequirescoordinatingtheworkerssothateachworkerstealsthesameclosuresasdictatedbytheschedule.Dependingonthelevelofdelity,theordermaymatterorfurtherstealsmaybeallowed.Apossiblemethodtoimplementthisinvolvescoordinatingthethievessotheystealfromthesamevictimsasspeciedduringthedesignatedworkingphase.However,thismayslowdownthevictimifithastowaitforthethieftostealduetoperturbationsintheexecutionorschedulevariationsfromlowerlevelsofdelity(unorderedorrelaxedworkstealing).Hence,wehaveimplementedalloftheschedulingalgorithms Fig.1:Examplecomputationscheduledwiththedefaultsched-ulerandthenmodiedwiththreetypesofconstrainedworkstealing.Defaultscheduler:allworkersbeginbusywithworkthenattempttostealwhentheynish.STOWSreproducesthisschedulewithouthavingtosearchforwork.STUWSisabletorevisetheorderonthread0,reducingthetime.RELWSperformsanadditionalsteal,furtherbalancingtheworkload.usingadonationprotocol:whenacontinuationmarkedasstoleninthetemplatescheduleisencountered,thevictimstealsthenextcontinuationfromitselfanddonatestheclosuretotheworkerdesignatedinthestealtree.Thecommonoperationinconstrainedworkstealingexe-cutesaclosureinthecurrentconstrainmode.Onceaspawnisencounteredandpushedonthestack,ifthereexistsastealpointinthestealtreeforthecontinuationafterthespawn,theworkerstealsfromitselfandpassesthestolencontinuationtotheworkerdesignatedasthief.Henceforth,weshallrefertothethiefthatstealsacontinuationinatemplatescheduleasthedesignedworkerforthatcontinuation. enumconstrain modefRWS,StOWS,StUWS,RelWSg; constrain modecurrent mode=RWS; voidexecute closure(Closurecl,constrain modem): //setglobalworkstealingconstrainmode current mode=m; //executecontinuation //executedafteraspawnisencounteredandpushedonthestack, //incrementingthecurrentlevel voidpushed spawn(intworker,intspawn,intcurLevel,Closure cl): if(current mode!=RWS): intcontThd=cl.tree.br[spawn+1].thd; if(contThd!=worker): //stealcontinuationfromself Closurecont=Closure steal(worker,worker); //transfercurrentstealtreetostolencontinuation cont.tree=cl.tree; //discardtransferredstealtreefromcurrentcontinuation cl.tree=cl.tree.br[spawn]; if(current mode==RelRjjcurrent mode==StUR): donate continuation(contThd,cont); elseif(current mode==StOR): intseq=cont.tree.br[spawn+1].seq; seqs[contThd][seq].cur=cont; seqs[contThd][seq].ready=true; aworkerisidle,indicatingthattheschedulehasadeciencyatthispoint.SimilartoSTUWS,RELWSdoesnotfollowtheorderandusestheboundedbuffertotransfercontinuationsbetweenworkers.Whentheboundedbufferisempty,theworkerbecomesathief.Iftheworkerperformsasuccessfulsteal,itcontinuestofollowthetemplatescheduleforthestolencontinuation.Anydescendentcontinuationsfromthisstolencontinuationthataremarkedasstoleninthetemplateschedulecontinuetobetreatedasstealsandgetdonatedtothedesig-natedworker.Therefore,eachoverridingstealonlymodies,atmost,onebranchofthetree.TheprimaryadvantageofRELWSisthatitcanadapttochangesthatmayarise.However,itdoesincurthemostoverheadofthethreepolicesduetoitsuseoftheboundedbufferandthestealingoverheadwhenthebufferisempty.Thefollowingalgorithmshowshowtherelaxedschedulerfunctions: voidRelWS Scheduler(Closurestarting): intstartThd=starting.tree.thd; donate continuation(startThd,starting); foreach(winworkers): while(/notterminated/): Closurecl; if/readycontinuationsofwnotempty/: cl=try extract continuation(w); if(cl): execute closure(cl,RelR); else: cl=Closure steal(w,random victim()); if(cl): execute closure(cl,RelR); WenowevaluatetheoverheadofbuildingthestealtreeandadaptabilityofusingtheconstrainedschedulerswiththerecursiveFibonaccibenchmark(b)implementedinCilk.Foralloftheexperimentsconductedwithb,wecalculatethe48thFibonaccinumber,unlessspecieddifferently.Whenwereachthedepthofb(30),weinvokeasequentialkernel.ExperimentalSetup.AlloftheexperimentsinthispaperwereperformedonanIntel80-coremachine,composedofeight2.27GHzE7-8860processors,eachwith10cores.TheyareconnectedviaIntelQPI6.4GT/s,andthemachinehas2TBofDRAM.AllourcodeswerecompiledwithGNUGCCversion4.3.4,usingtheMITCilk5.4.6translator[7]orjustwithGCCandOpenMP3.0(version200805).FortheOpenMPresults,wetriedusingICCwiththeIntelOpenMPimplementation,butfoundnosignicantscalingdifference.ThemachinerunsRedHatLinuxversion4.4.7-3andhasbeenconguredtouseapagesizeof4096bytes.Allofourcodessettheafnityofcreatedthreadsthatpinseachthread(inCilkorOpenMP)toaspeciccoreduringtheexecution.Therst10threadscreatedarepinnedtoasinglesocket.OverheadEvaluation.InFigure2a,therstsetofbarsplotsthenormalizedexecutiontimecomparedtoexecutingbwithouttracingforthefourdifferentcongurations.Buildingthestealtree,shownas“Trace”ontheplot,incursverylittleoverheadandiswithinthestandarddeviation.Weobservethatthestrictorderedschedulerspeedsupexecutionby1.4%,butunorderedandrelaxedworkstealingincuranexecutiontimepenaltyofabout6.8%and7.8%withstandarddeviationsof0.8%and2.2%,respectively.Thismatchesourexpectationthatthestrictorderedschedulerslightlyimprovesperformanceifthecomputationissufcientlyloadbalancedandthescheduleisappropriate,butunorderedorrelaxedschedulersmayimposeoverheadsiftheyarenotneededtorenetheschedule.AdaptabilityEvaluation.Inthesecondsetofbars,wetesttheefcacyofRELWSbyusingaschedulefromasmallerproblemsizeforalargerproblemtoobservehowitadapts.Werstexecuteb(48),extracttheschedule,andusethatscheduleasatemplateforb(48+6).Wecomparethistorunningb(48+6)withthedefaultCilkscheduler.Wendthatusingthestrictorderedschedulerincursaperformancepenaltyofaround8%duetoschedulingdeciencies.Thisisfromthemismatchbetweenthetemplatescheduleandtheworkbeingperformed.Thestrictunorderedschedulercauseshighaverageover-headbutwithalargestandarddeviation,indicatingthattheex-ecutiontimeisunpredictable,becausetheunorderedschedulerhasvaryingperformancedependingontheorderofexecution.However,RELWSrecoversthelostperformanceentirelybyadaptingtothenewproblemsize,achievingperformanceclosetothenativeschedule.InFigure2b,wevarythenumberofworkingthreadstofurthershowtheexibilityoftheRELWSscheduler.Therstbarineachsetplotstheexecutiontimefromb(48)withp�10threads.Thesecondbarshowstheexecutiontimewithpthreads.Inthethirdbar,weshowtheexecutiontimeusingthescheduleproducedwithp�10threadsasatemplateforpthreads.WeobservetheperformancealmostmatchesaschedulenativelygeneratedbyCilkforthescenariowithRELWS.Inthecaseofp=20,theperformanceiswithin0.35%ofthenative.Forp=40,itis2.6%,andforp=60,itis3.97%.InthenalbexperimentshowninFigure2c,wepresentthebaselineexecutiontimeforb(48).Then,wearbitrarilyslowdownasingleworkerbyinatingthesizeofeverytaskatthebottomofthetreeforonlythosetaskstheslowworkerexecutes.Thisisperformedbyenlargingeverytasktheslowworkerexecutesfromb(n)tob(n+3).Usingthestrictorderedschedulerwithaslowworkerincreasestheexecutiontimebyafactorof4.UsingRELWSonthissameschedulerestorestheperformancealmostentirelybystealingworkawayfromtheslowworker.V.WHOLEPROGRAMDATALOCALITYOPTIMIZATIONWedemonstratetheusefulnessofthealgorithmspresentedforoptimizingdatalocalityforsixbenchmarks.Toiterativelyoptimizedatalocality,westartwithatemplateschedulethatmaybeextractedfromthedatainitializationcode,dependingonwhetherofnottheinitializationcodehassimilarstructuretothekernel.Ifnot,theinitialtemplatescheduleisderivedfromapplyingrandomworkstealingonthekernelcode.AfterweapplyRELWStothetemplatescheduleforthekernel,weredistributethedatabyinvokingafork/joininitializerthatcopiesandreinitializesthedataconstrainedbythestrictor-deredscheduler.Weiterativelyapplythismethodtograduallylocalizedataandcorrespondinglyloadbalancetheschedule.Forthebenchmarkstested,wefoundthattheschedulesanddatadistributionsconvergequickly,withinaboutthreetoveiterations. beenusedtoincreasegranularity[36],whileotherworkhasfocusedonincreasingthestealgranularity[37],[38],whichisdifferentthanourapproach.OurapproachisappliedtotheCilkwork-rstschedulingruntimeandcanbeadaptedtootherfork/joinmodels.Itcanbedirectlyappliedtootherwork-stealingmodels,suchasahelp-rstscheduler.However,formoredivergentmodels,efcienttracingandconstrainedexecutionalgorithmswillberequiredtoeffectivelyimplementourmethodology.VIII.CONCLUSIONSWepresentanapproachtooptimizefork/joinprogramsfordatalocalityandgrainsizeselection.Wedescribetwodifferentmethodologies:(1)user-speciedstealtreeconstructionthatrequiresadditionalprogrammereffortandisnotadaptiveand(2)anautomaticiterativeoptimizationschemethatnearlycon-vergestothesameperformance.Theevaluationdemonstratesthatwecanobtainupto2.5xperformanceimprovementusingouriterativescheme.Weshowthathighperformancestillcanbeobtainedwithouttheapplication-specicknowledgeuserspecicationrequireswhilemaintainingtheefciencyandautomaticloadbalancingthatworkstealingprovides.Wealsoshowthatdynamiccoarseningcaneffectivelymatchtheperformanceofamanuallyoptimizedgrainsizewhileretainingtheschedulingexibilityofnertasks.ACKNOWLEDGMENTSThismaterialisbaseduponworksupportedbytheU.S.DepartmentofEnergy,OfceofScience,OfceofAdvancedScienticComputingResearchundercontractnumber63823.TheresearchwasperformedusingPNNLInstitutionalCom-putingatPacicNorthwestNationalLaboratory.REFERENCES[1]D.Lea,“Javaspecicationrequest166:Concurrencyutilities,”2004.[2]R.D.Blumofe,C.F.Joerg,B.C.Kuszmaul,C.E.Leiserson,K.H.Randall,andY.Zhou,“Cilk:Anefcientmultithreadedruntimesys-tem,”JPDC,vol.37,no.1,pp.55–69,1996.[3]E.Ayguad´e,N.Copty,A.Duran,J.Hoeinger,Y.Lin,F.Massaioli,X.Teruel,P.Unnikrishnan,andG.Zhang,“ThedesignofOpenMPtasks,”TPDS,vol.20,no.3,pp.404–418,2009.[4]V.A.Saraswat,V.Sarkar,andC.vonPraun,“X10:concurrentpro-grammingformodernarchitectures,”inPPOPP,2007,p.271.[5]“Cilkplus,”http://www.cilkplus.org.[6]P.Haplern,2012.[Online].Available:http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3409.pdf[7]M.Frigo,C.E.Leiserson,andK.H.Randall,“TheimplementationoftheCilk-5multithreadedlanguage,”ACMSigplanNotices,vol.33,no.5,pp.212–223,1998.[8]M.Frigo,P.Halpern,C.E.Leiserson,andS.Lewin-Berlin,“ReducersandotherCilk++hyperobjects,”inSPAA'09.ACM,2009,pp.79–90.[9]J.Lifander,S.Krishnamoorthy,andL.V.Kale,“Stealtree:low-overheadtracingofworkstealingschedulers.”inPLDI'13,2013,pp.507–518.[10]L.-N.Pouchet,“Polybench:Thepolyhedralbenchmarksuite,”2012.[11]J.-S.Park,M.Penner,andV.K.Prasanna,“Optimizinggraphalgorithmsforimprovedcacheperformance,”TPDS,vol.15,no.9,pp.769–782,2004.[12]D.H.Bailey,E.Barszcz,J.T.Barton,D.S.Browning,R.L.Carter,L.Dagum,R.A.Fatoohi,P.O.Frederickson,T.A.Lasinski,R.S.Schreiberetal.,“TheNASparallelbenchmarks,”Int.J.HighPerform.Comput.Appl.,vol.5,no.3,pp.63–73,1991.[13]S.Seo,G.Jo,andJ.Lee,“PerformancecharacterizationoftheNASparallelbenchmarksinOpenCL,”inIISWC'11,Nov2011.[14]R.E.LadnerandM.J.Fischer,“Parallelprexcomputation,”JournaloftheACM(JACM),vol.27,no.4,pp.831–838,1980.[15]Y.Guo,R.Barik,R.Raman,andV.Sarkar,“Work-rstandhelp-rstschedulingpoliciesforasync-nishtaskparallelism,”inIPDPS'09,2009,pp.1–12.[16]Y.Yan,J.Zhao,Y.Guo,andV.Sarkar,“Hierarchicalplacetrees:Aportableabstractionfortaskparallelismanddatamovement,”inLCPC'10.Springer,2010,pp.172–187.[17]S.-J.Min,C.Iancu,andK.Yelick,“Hierarchicalworkstealingonmanycoreclusters,”inPGAS'11,2011.[18]U.A.Acar,G.E.Blelloch,andR.D.Blumofe,“Thedatalocalityofworkstealing,”TOCS,vol.35,no.3,pp.321–347,2002.[19]G.E.BlellochandP.B.Gibbons,“Effectivelysharingacacheamongthreads,”inSPAA'04,2004,pp.235–244.[20]P.Charles,C.Grothoff,V.Saraswat,C.Donawa,A.Kielstra,K.Ebcioglu,C.VonPraun,andV.Sarkar,“X10:anobject-orientedapproachtonon-uniformclustercomputing,”AcmSigplanNotices,vol.40,no.10,pp.519–538,2005.[21]L.V.KaleandS.Krishnan,CHARM++:aportableconcurrentobjectorientedsystembasedonC++,1993.[22]G.Zheng,“Achievinghighperformanceonextremelylargeparallelma-chines:performancepredictionandloadbalancing,”Ph.D.dissertation,UIUC,2005.[23]J.Lifander,S.Krishnamoorthy,andL.V.Kale,“Workstealingandpersistence-basedloadbalancersforiterativeoverdecomposedapplica-tions,”ser.HPDC'12,2012,pp.137–148.[24]D.S.Nikolopoulos,E.Artiaga,E.Ayguad´e,andJ.Labarta,“ExploitingmemoryafnityinOpenMPthroughschedulereuse,”ACMSIGARCHComputerArchitectureNews,vol.29,no.5,pp.49–55,2001.[25]S.L.Olivier,B.R.deSupinski,M.Schulz,andJ.F.Prins,“Charac-terizingandmitigatingworktimeinationintaskparallelprograms,”inSC'12,2012,pp.1–12.[26]L.Huang,H.Jin,L.Yi,andB.Chapman,“Enablinglocality-awarecomputationsinOpenMP,”Sci.Program.,vol.18,no.3-4,pp.169–181,Aug.2010.[27]J.Bircsak,P.Craig,R.Crowell,Z.Cvetanovic,J.Harris,C.A.Nelson,andC.D.Offner,“ExtendingOpenMPforNUMAmachines,”inSC'00,2000.[28]F.Broquedis,N.Furmento,B.Goglin,R.Namyst,andP.-A.Wacrenier,“DynamictaskanddataplacementoverNUMAarchitectures:AnOpenMPruntimeperspective,”inIWOMP'09,2009,pp.79–92.[29]S.L.Olivier,A.K.Portereld,K.B.Wheeler,M.Spiegel,andJ.F.Prins,“OpenMPtaskschedulingstrategiesformulticoreNUMAsystems,”Int.J.HighPerform.Comput.Appl.,vol.26,no.2,May2012.[30]Y.Tang,R.A.Chowdhury,B.C.Kuszmaul,C.-K.Luk,andC.E.Leiserson,“Thepochoirstencilcompiler,”inSPAA'11.ACM,2011,pp.117–128.[31]A.Chien,W.Feng,V.Karamcheti,andJ.Plevyak,“Techniquesforefcientexecutionofne-grainedconcurrentprograms,”inLCPC'93.Springer,1993,pp.160–174.[32]Y.Sun,G.Zheng,P.Jetley,andL.V.Kale,“AnAdaptiveFrameworkforLarge-scaleStateSpaceSearch,”inLSPP'11,May2011.[33]V.Kumar,D.Frampton,S.M.Blackburn,D.Grove,andO.Tardieu,“Work-stealingwithoutthebaggage,”OOPSLA'12,vol.47,no.10,pp.297–314,2012.[34]A.Tzannes,G.C.Caragea,R.Barua,andU.Vishkin,“Lazybinary-splitting:Arun-timeadaptivework-stealingscheduler,”inPPoPP'10,2010.[35]M.A.Rainey,“Effectiveschedulingtechniquesforhigh-levelparallelprogramminglanguages,”Ph.D.dissertation,Chicago,IL,USA,2010.[36]E.Mohr,D.A.Kranz,andR.H.HalsteadJr,“Lazytaskcreation:Atechniqueforincreasingthegranularityofparallelprograms,”TPDS,vol.2,no.3,pp.264–280,1991.[37]S.L.OlivierandJ.F.Prins,“EvaluatingOpenMP3.0runtimesystemsonunbalancedtaskgraphs,”inIWOMP'09.Springer,2009,pp.63–78.[38]D.HendlerandN.Shavit,“Non-blockingsteal-halfworkqueues,”inPODC'02.ACM,2002,pp.280–289.

Related Contents


Next Show more