voidCilk Scheduler foreachwinworkers whilenotterminated Closurecltry steal from dequew ifclclClosure stealwrandom victim ifclexecute closurecl voidexecute closureClosu ID: 436542
Download Pdf The PPT/PDF document "II.BACKGROUNDWepresentourapproachintheco..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
II.BACKGROUNDWepresentourapproachinthecontextofCilk,anex-emplarfork/joinmodel.Inthissection,webrieydescribetheCilkscheduleranditsdataaccessbehavior.Moredetaileddescriptionscanbefoundelsewhere[7],[8].CilkisaparallelprogrammingextensiontotheClanguagethatintroducesthreeadditionalkeywords:cilk,spawn,andsync.Thecilkkeywordspeciesthatafunctioniscapableofbeingexecutedinparallel.Thespawnkeywordspeciesthattheinvokedfunction,referredtoasthespawnedtask,isconcurrentwiththestatementsthatfollowinthespawningtask.Nostatementfollowingthesyncstatementinataskcanbeexecuteduntilallprecedingtaskshavecompleted.Everyfunctioninvocationistreatedasatask.Atanygivenpointintime,thesequenceofinstructionsremainingtobeexecutedinataskisreferredtoasthetask'scontinuation.Acontinuationoftenismarkedbyagotolabelandreferredtobyaninteger.Aclosurereferstothepartiallyexecutedstateofatask,orthestateofthelocalvariablesinthefunctioninvocation.Executionbeginswithoneofthethreadsexecutingtheclosurecorrespondingtothemain()function.Theactionsoftheschedulercanbedescribedbythefollowingloop: voidCilk Scheduler(): foreach(winworkers): while(/notterminated/): Closurecl=try steal from deque(w); if(!cl)cl=Closure steal(w,random victim()); if(cl)execute closure(cl); voidexecute closure(Closurecl): //executethecorrespondingclosurestartingatcontinuation Aworkerwithavalidclosureexecutestheclosureinawork-rst(i.e.depth-rst)fashion.Aspawnedtaskisimme-diatelyexecuted,allowingthecontinuationofthecurrentlyexecutingtasktobestolen.Uponfullyexecutingatask,aworkerreturnstoexecutetheinvokingtask.Whennolocalworkisavailable,athreadbecomesathiefandrandomlyattemptstostealtheoldestcontinuationfromanotherthread.Thestateofthiscontinuationisencapsulatedinaclosure,whichisthenexecuted.Thisproceedsuntiltheroottaskcompletesexecution.Ascheduleisaspecicationofanorderedlistoftasksexecutedbyeachthread.Whileshowntobeprovablyspace-andtime-efcient,Cilkdoesnottakedatalocalityintoaccount.Inparticular,weconsidertwochallengesassociatedwithexecutingfork/joinprogramsthatincurnon-trivialdataaccesscosts.First,thelackoflocality-awarenessacrosscomputationphasessignicantlyimpactsperformance.Forexample,weevaluatedtheexecutiontimeforperformingaparallelmemorycopybetweentwo8GBarraysonan80-coresystemafterbotharrayshavebeeninitialized(systemcongurationisdetailedinSectionIV-C).Thememorycopyisorganizedasconcurrenttasksoperatingoncontiguousblocksofthearray.AstaticallyscheduledOpenMPloopalignedtheinitializationandcopyoperationsandtook169mstoperformthememorycopy.Conversely,implementationsusingCilkandOpenMPtaskstook436ms.Second,theperformanceofsuchprogramsisverysensitivetotaskgranularity.Theprecedingresultswereobtainedusinga512KBblocksize,thebestperformingCilkandOpenMPtasksversion.Otherblocksizes,rangingfrom4KB(pagesize)toafewMBs,performedworsethanthis.Whilecompute-boundprogramscanbeoptimizedintermsofthesmallesttaskgranularitythatmaximizessequentialperformance,optimizingdataaccesscostsimposesadditionalchallenges.III.OVERVIEWInthissection,wepresentanoverviewofourapproachtodatalocalityoptimizationforfork/joinprograms.Ourobjectiveistomatchtheactionsofthework-stealingscheduleracrossdifferentphasesofafork/joinprogram.Forexample,makingthesameworkerthreadexecutetheinitializationandcopytasksonagivendatablockwouldimprovedatalocalityandthusperformance,resultinginthesameperformanceastheOpenMPstaticallyscheduledloops.Here,wefocusonaruntimeapproachtodatalocalityoptimizationbymatchingthetaskexecutionschedulesacrossdifferentphaseswithuserguidance.Thisinvolvesefcientconstructionofatemplatescheduleandalgorithmstocon-straintheactionsofthework-stealingschedulertoexecutesubsequentphasestofollowapreviouslyconstructedtemplateschedule.Weexploitthefactthatthescheduleforafork/joinprogramscheduledusingworkstealingcanbedescribedintermsofstealoperationsinvolved.Thesestealoperationscanbecom-binedtoconstructastealtreethatrepresentaphase'sexecutionschedule.Wepresenttwoapproachestoconstructthetemplateschedules.Therstapproachinvolvesefcientlytracingtheexecutionofaphasetoextractthecorrespondingstealtree.Thesecondapproachinvolvesuser-speciedpartitioningofthework,wheretheuserconstructsasyntheticstealtreeastheprogramisexecuted.Automatedextractionofthestealtree(inSectionIV-A)minimizesusereffortbutcannotimmediatelyoptimizefordatalocalityandloadbalance.User-speciedstealtrees(discussedinSectionIV-B)providedirectcontroltotheuserwhilerequiringadditionalusereffort.Wedescribethreeconstrainedschedulers(discussedinSec-tionIV-C)thatpresentdifferenttrade-offsbetweenfaithfullypreservingthetemplatescheduleandcontinuingtoimproveloadbalance.Thetwostrictschedulerspreservethestealtreeprovidedandcanavoidthecostsassociatedwithworkstealing.Therelaxedreplayschedulerincrementallyloadbalancesthecomputation,startingfromthetemplateschedule.Thefollowingcodesnippetillustratestheoptimizationapproach: spawninitialization();sync; //sextractschedulefrominitialization for(i=0;inumIters;i++) if(!converged): //suserelaxedschedulerwithsonkernel spawnkernel();sync; //usestrictschedulerwithsondatarelocationcode else: //usestrictschedulerwithstomaintainlocality spawnkernel();sync; Inthissnippet,theuserisinterestedinmatchingtheschedulesfortheinitialization(theinitialization()spawn)anditerativeker-nel(thekernel()spawn)phases.Theuserextractstheschedulefortheinitializationphaseintos.Thisschedulemaynotbeoptimalduetodatalocalityinefcienciesintheinitializationphaseitself.Inaddition,theextractedschedulemightnotresult spawnedthisone)alsoshouldhaveastolencontinuation[9].Thisistruewhenthenumberofnestedstealsatthispointequalsthenumberofnestedspawnsminusone.Ifthisisnottrue,thentheparenttaskdidnothaveastolencontinuation. voidcheck validity(Closurecl,intcurLevel): assert(curLevel==cl.level+1); ScheduleExtractionAPI.Atthispoint,theschedulecon-structedcanbeextractedusingtheroutineextractSchedulePre-vious().ThisroutinereturnsaStealTreestructurerepresentingthestealtreerootedattheimmediatelyprecedingspawn.Notethatthestatementsfollowingthespawncanbeexecutedwhilethespawnedtask,andthosespawnedtransitively,continuetobeexecuted.Therefore,acalltoextractthestealtreemustbeprecededbyasynctoensurethatthestealtreehasbeencompletelyconstructedbeforeitisextracted.Giventhatthestealtreeisrequestedaposteriori,weconstructthestealtreeforeveryspawnthroughoutthecomputation.Anoptimizedimplementationcouldincludeasplit-phasedesignifaspecicationoftheintenttoextractastealtreetriggeredthestealtreeconstructionforaspawn.However,weobservethattheoverheadsofstealtreeconstructionaremarginalinpracticeduetothefactthatthestealtreeconstructionoperationsareproportionaltothenumberofsteals,whichareasmallfractionofthetotalnumberoftasksinafork/joinprogramwithsufcientconcurrency.C.ConstrainedWorkStealingWehaveimplementedthreedifferentschedulingalgorithmsthatconstrainawork-stealingschedulertoatemplateschedulewithvaryinglevelsofdelity.Dependingonhowrenedandeffectiveascheduleisforagivencomputation,itmayneedtobeexactlyfollowedorrevisedtoadapttochangesintheenvironment(e.g.,changesinthelocalityofdataaccessed).InFigure1,wedepicthowthethreetypesofconstrainedschedulerscanimproveadefaultschedule:STOWS(strictorderedworkstealing):thework-stealingschedulerexactlyfollowsatemplatescheduletbyguid-ingeachworkertoexecutethetasksintheorderpre-scribedbyt.STUWS(strictunorderedworkstealing):thework-stealingschedulerapproximatesatemplatescheduletbyguidingeachworkertoexecutethesametasksitexecutedint,butitallowsthemtogreedilydeviateinorder(ensuringthatalldependenciesarefollowed).RELWS(relaxedworkstealing):thework-stealingsched-ulerapproximatesatemplatescheduletbyguidingeachworkertoexecutethetasksitexecutedintinanyorder,whileallowingfurtherstealswhenaworkerisidle.Constrainingtheexecutiontoatemplateschedulerequirescoordinatingtheworkerssothateachworkerstealsthesameclosuresasdictatedbytheschedule.Dependingonthelevelofdelity,theordermaymatterorfurtherstealsmaybeallowed.Apossiblemethodtoimplementthisinvolvescoordinatingthethievessotheystealfromthesamevictimsasspeciedduringthedesignatedworkingphase.However,thismayslowdownthevictimifithastowaitforthethieftostealduetoperturbationsintheexecutionorschedulevariationsfromlowerlevelsofdelity(unorderedorrelaxedworkstealing).Hence,wehaveimplementedalloftheschedulingalgorithms Fig.1:Examplecomputationscheduledwiththedefaultsched-ulerandthenmodiedwiththreetypesofconstrainedworkstealing.Defaultscheduler:allworkersbeginbusywithworkthenattempttostealwhentheynish.STOWSreproducesthisschedulewithouthavingtosearchforwork.STUWSisabletorevisetheorderonthread0,reducingthetime.RELWSperformsanadditionalsteal,furtherbalancingtheworkload.usingadonationprotocol:whenacontinuationmarkedasstoleninthetemplatescheduleisencountered,thevictimstealsthenextcontinuationfromitselfanddonatestheclosuretotheworkerdesignatedinthestealtree.Thecommonoperationinconstrainedworkstealingexe-cutesaclosureinthecurrentconstrainmode.Onceaspawnisencounteredandpushedonthestack,ifthereexistsastealpointinthestealtreeforthecontinuationafterthespawn,theworkerstealsfromitselfandpassesthestolencontinuationtotheworkerdesignatedasthief.Henceforth,weshallrefertothethiefthatstealsacontinuationinatemplatescheduleasthedesignedworkerforthatcontinuation. enumconstrain modefRWS,StOWS,StUWS,RelWSg; constrain modecurrent mode=RWS; voidexecute closure(Closurecl,constrain modem): //setglobalworkstealingconstrainmode current mode=m; //executecontinuation //executedafteraspawnisencounteredandpushedonthestack, //incrementingthecurrentlevel voidpushed spawn(intworker,intspawn,intcurLevel,Closure cl): if(current mode!=RWS): intcontThd=cl.tree.br[spawn+1].thd; if(contThd!=worker): //stealcontinuationfromself Closurecont=Closure steal(worker,worker); //transfercurrentstealtreetostolencontinuation cont.tree=cl.tree; //discardtransferredstealtreefromcurrentcontinuation cl.tree=cl.tree.br[spawn]; if(current mode==RelRjjcurrent mode==StUR): donate continuation(contThd,cont); elseif(current mode==StOR): intseq=cont.tree.br[spawn+1].seq; seqs[contThd][seq].cur=cont; seqs[contThd][seq].ready=true; aworkerisidle,indicatingthattheschedulehasadeciencyatthispoint.SimilartoSTUWS,RELWSdoesnotfollowtheorderandusestheboundedbuffertotransfercontinuationsbetweenworkers.Whentheboundedbufferisempty,theworkerbecomesathief.Iftheworkerperformsasuccessfulsteal,itcontinuestofollowthetemplatescheduleforthestolencontinuation.Anydescendentcontinuationsfromthisstolencontinuationthataremarkedasstoleninthetemplateschedulecontinuetobetreatedasstealsandgetdonatedtothedesig-natedworker.Therefore,eachoverridingstealonlymodies,atmost,onebranchofthetree.TheprimaryadvantageofRELWSisthatitcanadapttochangesthatmayarise.However,itdoesincurthemostoverheadofthethreepolicesduetoitsuseoftheboundedbufferandthestealingoverheadwhenthebufferisempty.Thefollowingalgorithmshowshowtherelaxedschedulerfunctions: voidRelWS Scheduler(Closurestarting): intstartThd=starting.tree.thd; donate continuation(startThd,starting); foreach(winworkers): while(/notterminated/): Closurecl; if/readycontinuationsofwnotempty/: cl=try extract continuation(w); if(cl): execute closure(cl,RelR); else: cl=Closure steal(w,random victim()); if(cl): execute closure(cl,RelR); WenowevaluatetheoverheadofbuildingthestealtreeandadaptabilityofusingtheconstrainedschedulerswiththerecursiveFibonaccibenchmark(b)implementedinCilk.Foralloftheexperimentsconductedwithb,wecalculatethe48thFibonaccinumber,unlessspecieddifferently.Whenwereachthedepthofb(30),weinvokeasequentialkernel.ExperimentalSetup.AlloftheexperimentsinthispaperwereperformedonanIntel80-coremachine,composedofeight2.27GHzE7-8860processors,eachwith10cores.TheyareconnectedviaIntelQPI6.4GT/s,andthemachinehas2TBofDRAM.AllourcodeswerecompiledwithGNUGCCversion4.3.4,usingtheMITCilk5.4.6translator[7]orjustwithGCCandOpenMP3.0(version200805).FortheOpenMPresults,wetriedusingICCwiththeIntelOpenMPimplementation,butfoundnosignicantscalingdifference.ThemachinerunsRedHatLinuxversion4.4.7-3andhasbeenconguredtouseapagesizeof4096bytes.Allofourcodessettheafnityofcreatedthreadsthatpinseachthread(inCilkorOpenMP)toaspeciccoreduringtheexecution.Therst10threadscreatedarepinnedtoasinglesocket.OverheadEvaluation.InFigure2a,therstsetofbarsplotsthenormalizedexecutiontimecomparedtoexecutingbwithouttracingforthefourdifferentcongurations.Buildingthestealtree,shownasTraceontheplot,incursverylittleoverheadandiswithinthestandarddeviation.Weobservethatthestrictorderedschedulerspeedsupexecutionby1.4%,butunorderedandrelaxedworkstealingincuranexecutiontimepenaltyofabout6.8%and7.8%withstandarddeviationsof0.8%and2.2%,respectively.Thismatchesourexpectationthatthestrictorderedschedulerslightlyimprovesperformanceifthecomputationissufcientlyloadbalancedandthescheduleisappropriate,butunorderedorrelaxedschedulersmayimposeoverheadsiftheyarenotneededtorenetheschedule.AdaptabilityEvaluation.Inthesecondsetofbars,wetesttheefcacyofRELWSbyusingaschedulefromasmallerproblemsizeforalargerproblemtoobservehowitadapts.Werstexecuteb(48),extracttheschedule,andusethatscheduleasatemplateforb(48+6).Wecomparethistorunningb(48+6)withthedefaultCilkscheduler.Wendthatusingthestrictorderedschedulerincursaperformancepenaltyofaround8%duetoschedulingdeciencies.Thisisfromthemismatchbetweenthetemplatescheduleandtheworkbeingperformed.Thestrictunorderedschedulercauseshighaverageover-headbutwithalargestandarddeviation,indicatingthattheex-ecutiontimeisunpredictable,becausetheunorderedschedulerhasvaryingperformancedependingontheorderofexecution.However,RELWSrecoversthelostperformanceentirelybyadaptingtothenewproblemsize,achievingperformanceclosetothenativeschedule.InFigure2b,wevarythenumberofworkingthreadstofurthershowtheexibilityoftheRELWSscheduler.Therstbarineachsetplotstheexecutiontimefromb(48)withp10threads.Thesecondbarshowstheexecutiontimewithpthreads.Inthethirdbar,weshowtheexecutiontimeusingthescheduleproducedwithp10threadsasatemplateforpthreads.WeobservetheperformancealmostmatchesaschedulenativelygeneratedbyCilkforthescenariowithRELWS.Inthecaseofp=20,theperformanceiswithin0.35%ofthenative.Forp=40,itis2.6%,andforp=60,itis3.97%.InthenalbexperimentshowninFigure2c,wepresentthebaselineexecutiontimeforb(48).Then,wearbitrarilyslowdownasingleworkerbyinatingthesizeofeverytaskatthebottomofthetreeforonlythosetaskstheslowworkerexecutes.Thisisperformedbyenlargingeverytasktheslowworkerexecutesfromb(n)tob(n+3).Usingthestrictorderedschedulerwithaslowworkerincreasestheexecutiontimebyafactorof4.UsingRELWSonthissameschedulerestorestheperformancealmostentirelybystealingworkawayfromtheslowworker.V.WHOLEPROGRAMDATALOCALITYOPTIMIZATIONWedemonstratetheusefulnessofthealgorithmspresentedforoptimizingdatalocalityforsixbenchmarks.Toiterativelyoptimizedatalocality,westartwithatemplateschedulethatmaybeextractedfromthedatainitializationcode,dependingonwhetherofnottheinitializationcodehassimilarstructuretothekernel.Ifnot,theinitialtemplatescheduleisderivedfromapplyingrandomworkstealingonthekernelcode.AfterweapplyRELWStothetemplatescheduleforthekernel,weredistributethedatabyinvokingafork/joininitializerthatcopiesandreinitializesthedataconstrainedbythestrictor-deredscheduler.Weiterativelyapplythismethodtograduallylocalizedataandcorrespondinglyloadbalancetheschedule.Forthebenchmarkstested,wefoundthattheschedulesanddatadistributionsconvergequickly,withinaboutthreetoveiterations. beenusedtoincreasegranularity[36],whileotherworkhasfocusedonincreasingthestealgranularity[37],[38],whichisdifferentthanourapproach.OurapproachisappliedtotheCilkwork-rstschedulingruntimeandcanbeadaptedtootherfork/joinmodels.Itcanbedirectlyappliedtootherwork-stealingmodels,suchasahelp-rstscheduler.However,formoredivergentmodels,efcienttracingandconstrainedexecutionalgorithmswillberequiredtoeffectivelyimplementourmethodology.VIII.CONCLUSIONSWepresentanapproachtooptimizefork/joinprogramsfordatalocalityandgrainsizeselection.Wedescribetwodifferentmethodologies:(1)user-speciedstealtreeconstructionthatrequiresadditionalprogrammereffortandisnotadaptiveand(2)anautomaticiterativeoptimizationschemethatnearlycon-vergestothesameperformance.Theevaluationdemonstratesthatwecanobtainupto2.5xperformanceimprovementusingouriterativescheme.Weshowthathighperformancestillcanbeobtainedwithouttheapplication-specicknowledgeuserspecicationrequireswhilemaintainingtheefciencyandautomaticloadbalancingthatworkstealingprovides.Wealsoshowthatdynamiccoarseningcaneffectivelymatchtheperformanceofamanuallyoptimizedgrainsizewhileretainingtheschedulingexibilityofnertasks.ACKNOWLEDGMENTSThismaterialisbaseduponworksupportedbytheU.S.DepartmentofEnergy,OfceofScience,OfceofAdvancedScienticComputingResearchundercontractnumber63823.TheresearchwasperformedusingPNNLInstitutionalCom-putingatPacicNorthwestNationalLaboratory.REFERENCES[1]D.Lea,Javaspecicationrequest166:Concurrencyutilities,2004.[2]R.D.Blumofe,C.F.Joerg,B.C.Kuszmaul,C.E.Leiserson,K.H.Randall,andY.Zhou,Cilk:Anefcientmultithreadedruntimesys-tem,JPDC,vol.37,no.1,pp.5569,1996.[3]E.Ayguad´e,N.Copty,A.Duran,J.Hoeinger,Y.Lin,F.Massaioli,X.Teruel,P.Unnikrishnan,andG.Zhang,ThedesignofOpenMPtasks,TPDS,vol.20,no.3,pp.404418,2009.[4]V.A.Saraswat,V.Sarkar,andC.vonPraun,X10:concurrentpro-grammingformodernarchitectures,inPPOPP,2007,p.271.[5]Cilkplus,http://www.cilkplus.org.[6]P.Haplern,2012.[Online].Available:http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3409.pdf[7]M.Frigo,C.E.Leiserson,andK.H.Randall,TheimplementationoftheCilk-5multithreadedlanguage,ACMSigplanNotices,vol.33,no.5,pp.212223,1998.[8]M.Frigo,P.Halpern,C.E.Leiserson,andS.Lewin-Berlin,ReducersandotherCilk++hyperobjects,inSPAA'09.ACM,2009,pp.7990.[9]J.Lifander,S.Krishnamoorthy,andL.V.Kale,Stealtree:low-overheadtracingofworkstealingschedulers.inPLDI'13,2013,pp.507518.[10]L.-N.Pouchet,Polybench:Thepolyhedralbenchmarksuite,2012.[11]J.-S.Park,M.Penner,andV.K.Prasanna,Optimizinggraphalgorithmsforimprovedcacheperformance,TPDS,vol.15,no.9,pp.769782,2004.[12]D.H.Bailey,E.Barszcz,J.T.Barton,D.S.Browning,R.L.Carter,L.Dagum,R.A.Fatoohi,P.O.Frederickson,T.A.Lasinski,R.S.Schreiberetal.,TheNASparallelbenchmarks,Int.J.HighPerform.Comput.Appl.,vol.5,no.3,pp.6373,1991.[13]S.Seo,G.Jo,andJ.Lee,PerformancecharacterizationoftheNASparallelbenchmarksinOpenCL,inIISWC'11,Nov2011.[14]R.E.LadnerandM.J.Fischer,Parallelprexcomputation,JournaloftheACM(JACM),vol.27,no.4,pp.831838,1980.[15]Y.Guo,R.Barik,R.Raman,andV.Sarkar,Work-rstandhelp-rstschedulingpoliciesforasync-nishtaskparallelism,inIPDPS'09,2009,pp.112.[16]Y.Yan,J.Zhao,Y.Guo,andV.Sarkar,Hierarchicalplacetrees:Aportableabstractionfortaskparallelismanddatamovement,inLCPC'10.Springer,2010,pp.172187.[17]S.-J.Min,C.Iancu,andK.Yelick,Hierarchicalworkstealingonmanycoreclusters,inPGAS'11,2011.[18]U.A.Acar,G.E.Blelloch,andR.D.Blumofe,Thedatalocalityofworkstealing,TOCS,vol.35,no.3,pp.321347,2002.[19]G.E.BlellochandP.B.Gibbons,Effectivelysharingacacheamongthreads,inSPAA'04,2004,pp.235244.[20]P.Charles,C.Grothoff,V.Saraswat,C.Donawa,A.Kielstra,K.Ebcioglu,C.VonPraun,andV.Sarkar,X10:anobject-orientedapproachtonon-uniformclustercomputing,AcmSigplanNotices,vol.40,no.10,pp.519538,2005.[21]L.V.KaleandS.Krishnan,CHARM++:aportableconcurrentobjectorientedsystembasedonC++,1993.[22]G.Zheng,Achievinghighperformanceonextremelylargeparallelma-chines:performancepredictionandloadbalancing,Ph.D.dissertation,UIUC,2005.[23]J.Lifander,S.Krishnamoorthy,andL.V.Kale,Workstealingandpersistence-basedloadbalancersforiterativeoverdecomposedapplica-tions,ser.HPDC'12,2012,pp.137148.[24]D.S.Nikolopoulos,E.Artiaga,E.Ayguad´e,andJ.Labarta,ExploitingmemoryafnityinOpenMPthroughschedulereuse,ACMSIGARCHComputerArchitectureNews,vol.29,no.5,pp.4955,2001.[25]S.L.Olivier,B.R.deSupinski,M.Schulz,andJ.F.Prins,Charac-terizingandmitigatingworktimeinationintaskparallelprograms,inSC'12,2012,pp.112.[26]L.Huang,H.Jin,L.Yi,andB.Chapman,Enablinglocality-awarecomputationsinOpenMP,Sci.Program.,vol.18,no.3-4,pp.169181,Aug.2010.[27]J.Bircsak,P.Craig,R.Crowell,Z.Cvetanovic,J.Harris,C.A.Nelson,andC.D.Offner,ExtendingOpenMPforNUMAmachines,inSC'00,2000.[28]F.Broquedis,N.Furmento,B.Goglin,R.Namyst,andP.-A.Wacrenier,DynamictaskanddataplacementoverNUMAarchitectures:AnOpenMPruntimeperspective,inIWOMP'09,2009,pp.7992.[29]S.L.Olivier,A.K.Portereld,K.B.Wheeler,M.Spiegel,andJ.F.Prins,OpenMPtaskschedulingstrategiesformulticoreNUMAsystems,Int.J.HighPerform.Comput.Appl.,vol.26,no.2,May2012.[30]Y.Tang,R.A.Chowdhury,B.C.Kuszmaul,C.-K.Luk,andC.E.Leiserson,Thepochoirstencilcompiler,inSPAA'11.ACM,2011,pp.117128.[31]A.Chien,W.Feng,V.Karamcheti,andJ.Plevyak,Techniquesforefcientexecutionofne-grainedconcurrentprograms,inLCPC'93.Springer,1993,pp.160174.[32]Y.Sun,G.Zheng,P.Jetley,andL.V.Kale,AnAdaptiveFrameworkforLarge-scaleStateSpaceSearch,inLSPP'11,May2011.[33]V.Kumar,D.Frampton,S.M.Blackburn,D.Grove,andO.Tardieu,Work-stealingwithoutthebaggage,OOPSLA'12,vol.47,no.10,pp.297314,2012.[34]A.Tzannes,G.C.Caragea,R.Barua,andU.Vishkin,Lazybinary-splitting:Arun-timeadaptivework-stealingscheduler,inPPoPP'10,2010.[35]M.A.Rainey,Effectiveschedulingtechniquesforhigh-levelparallelprogramminglanguages,Ph.D.dissertation,Chicago,IL,USA,2010.[36]E.Mohr,D.A.Kranz,andR.H.HalsteadJr,Lazytaskcreation:Atechniqueforincreasingthegranularityofparallelprograms,TPDS,vol.2,no.3,pp.264280,1991.[37]S.L.OlivierandJ.F.Prins,EvaluatingOpenMP3.0runtimesystemsonunbalancedtaskgraphs,inIWOMP'09.Springer,2009,pp.6378.[38]D.HendlerandN.Shavit,Non-blockingsteal-halfworkqueues,inPODC'02.ACM,2002,pp.280289.