/
computationdoesnotexceedthestackthreshold.Ifitexceedsthethreshold,thea computationdoesnotexceedthestackthreshold.Ifitexceedsthethreshold,thea

computationdoesnotexceedthestackthreshold.Ifitexceedsthethreshold,thea - PDF document

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
388 views
Uploaded On 2015-11-01

computationdoesnotexceedthestackthreshold.Ifitexceedsthethreshold,thea - PPT Presentation

launchedForthepurposeofthispaperaplaceisimplementedasasetofworkerthreadsinaworkstealingschedulersuchthatnostealingispermittedacrossplacesNotethatplacesaremultithreadedingeneralandstealingcanoccur ID: 179594

launched.Forthepurposeofthispaper aplaceisimplementedasasetofworkerthreadsinawork-stealingschedulersuchthatnostealingispermittedacrossplaces.Notethatplacesaremulti-threadedingeneralandstealingcanoccur

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "computationdoesnotexceedthestackthreshol..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

computationdoesnotexceedthestackthreshold.Ifitexceedsthethreshold,theadaptiveschedulerwillmovethestackpressuretotheheap,sincetheheapisusuallymuchlargerthanthestack.Toobtaingoodscalabilityinbothscenarioswhenstealsarerarevs.frequent,SLAWhasaheuristictoadjustschedulingpoliciesaccordingtothestealingrate.TheadaptivenessinschedulingpolicyandboundedstackusagemakeSLAWcapableofhandlinglargeirregularcomputations.Theexperimentalresultsfromavarietyofbenchmarksshowthat:SLAW'sadaptiveschedulerachieves0.98-9.2speedupoverthehelp-rstschedulerand0.97-4.5speedupoverthework-rstschedulerfor64-threadexecutions.Incontrast,thehelp-rstpolicyis9.2slowerthanwork-rstintheworstcaseforaxedhelp-rstpolicy,andthework-rstpolicyis3.7slowerthanhelp-rstintheworstcaseforaxedwork-rstpolicy.Further,forlargeirregularrecursiveparallelcomputations,theadaptiveschedulerrunswithboundedstackusageandachievesperformance(andsupportsdatasizes)thatcannotbedeliveredbytheuseofanysinglexedpolicy.Locality-awarescheduling:Work-stealinghasalsobeenknowntobecache-unfriendlyforsomeapplicationsduetorandomizedstealing[11].Fortasksthatsharethesamemem-oryfootprints,randomizedlocality-obliviouswork-stealingschedulersdonothingtoensureschedulingofthesetasksonworkersthatshareacache.Thissignicantlylimitsthescalabilityforsomememory-bandwidthboundedapplicationsonmachinesthathaveseparatecaches.SLAWisdesignedforprogrammingmodelsinwhichlocalityhintsareprovidedbytheprogrammerorthecompiler,suchasX10'splaces[3]andChapel'slocales[1].InSLAW,workersaregroupedbyplacesandourcurrentimplementationrestrictsstealingtoonlyoccurwithinplaces.Ourexperimentalresultsshowthatlocality-awareschedulingcanachieveupto2.6speedupoverlocality-obliviousschedulingbyincreasingtemporaldatareuse.Webelievethattheperformancebenetsoflocality-awareschedulingwillcontinuetoincreaseinfuturesystems.Organizationofthepaper:SectionIIdescribesthecontextfortheSLAWscheduler,includingwork-rstandhelp-rstschedulingpolicesandtheHabanero-Java(HJ)language[12].SectionIIIdescribestheSLAWadaptiveschedulingalgorithm,andSectionIVdescribesextensionstoSLAWforlocality-awareness.SectionsVtoVIIcontainourexperimentalresults,relatedworkdiscussionandconclusionsrespectively.II.BACKGROUND:SLAWCONTEXTA.Work-stealingSchedulingPoliciesWork-rstandhelp-rstaretwocommonlyusedtaskschedulingpoliciesusedwhenspawningatask.Underthework-rstpolicy,theworkerwillexecutethespawnedtaskeagerlyandleavethecontinuationtobestolen.Underthehelp-rstpolicy,theworkerwillmakethespawnedtaskavailableforstealinganditselfwillcontinueexecutionontheparenttask.Bothwork-rstandhelp-rstpolicieswillyieldcoarse-graintaskstothethiefbecausethestealingisfromtherootofthespawntreetowardstheleaves.Thework-rstpolicyandhelp-rstpolicyhavedifferentstackandmemoryrequirementsandhaveperformanceissuesincomplementaryscenarios.1)PerformanceIssues:Contextswitchesrepresentamajorsourceofoverheadinawork-stealingscheduler[13].Acontextswitchisperformedwhentheworkercannotcontinuenormalsequentialexecution.Thishappensintwosituations:a)Whenthecurrenttaskterminates,andtheworkercannotresumeexecutionofitsparenttask.Thiscanhappenwhenthecontinuationoftheparenttaskisstolenorthecurrenttaskisnotspawnedunderthework-rstpolicy.b)Whenthecurrenttaskreachesasynchronizationpoint,andtherearestillunnisheddescendanttasks.Inbothsituations,theworkerwillreturntotheschedulertorequestanewtask.Inthesecondsituation,thetaskissuspendedandthecontextswitchwillalsosavetheactivationframeofthecurrenttasktotheheap,sothatitcanberesumedlater.Comparedtonormalsequentialexecution,theoverheadofcontextswitchesprimarilyarisesfrombookkeepingoverheadintheschedulerandfromcachemisses.Underthework-rstpolicy,ifthereisonly1workerthread,itwillexecutealltasksinthesameorderastheequivalentsequentialprogram,therebyrequiringnocontextswitchattasksynchronizationpoints.Ingeneral,thework-rstpolicyworkswellforsituationswherestealsareinfrequent[10].However,forapplicationswhosespawntreesareshallowandwide,thestealscanbecomefrequentbecausethereeachstealyieldsalimitedamountofwork.ConsiderascenarioinwhichonebusyworkercreatesNtasksconsecutivelyandtheotherN�1workersareidleandlookingfortasks.InordertodistributeNtaskstotheN�1idleworkersunderthework-rstpolicy,continuationsmustbepassedfromthevictimworkertothethiefthroughstealingandtherewillbeN�1suchstealsofcontinuations.Moreimportantly,thesestealsmustbeserializedtherebylimitingscalabilityandcontributingtoalargestealoverhead.Ontheotherhand,underthehelp-rstpolicy,thestealsareperformedmoreefciently.Thestealcontainstwoparts:taskretrievalfromthevictimandthefollowingcontextswitchtoexecutethetask.Underthehelp-rstpolicy,althoughthetaskshavetobepoppedfromthevictim'squeueinorder,thecontextswitchescanbedoneinparallel.However,ifstealsarerare,usingthehelp-rstpolicymayincreasethe1-threadexecutiontimerelativetosequentialexecution,becauseofthecontextswitchesthatcanoccurateverytasksynchronizationpoint.Ingeneral,thework-rstpolicyworksbetterforrecursiveparallelisminwhichthetaskspawntreeisdeepandthehelp-rstpolicyworksbetterforatparallelisminwhichthetaskspawntreeisshallowandwide.Intheextremecase,ourexperimentalresultsshowthat,for64threads,thehelp-rstpolicycanbe9.2slowerthanthework-rstpolicyonFibmicro-benchmark,andthework-rstpolicycanbe launched.Forthepurposeofthispaper,aplaceisimplementedasasetofworkerthreadsinawork-stealingschedulersuchthatnostealingispermittedacrossplaces.Notethatplacesaremulti-threadedingeneralandstealingcanoccurwithinaplace.Themappingofworkerthreadstoplacesisspeciedwhentheprogramislaunched,andisalsoreferredtoasthedeployment.Thismappingincludesabindingofworkerthreadstoprocessorcores.UnlikeX10,theHJsubsetusedinthispaperonlyassociatesplaceswithtasks(async's)andnotwithdata.Datalocalityisinsteadachievedindirectlybyassigningasync'swithdataafnitytoexecuteinthesameplace.TheisolatedconstructisHJ'srenamingofX10'satomicconstruct.Asstatedin[3],anatomicblockinX10isintendedtobe“executedbyataskasifinasinglestepduringwhichallotherconcurrenttasksinthesameplacearesuspended”.Thisdenitionimpliesstrongatomicitysemanticsfortheatomicconstruct.However,allX10implementationsthatweareawareofarelock-basedanddonotenforceanymutualexclusionguaranteesbetweencomputationswithinandoutsideanatomicblock.Asadvocatedin[15],weusetheisolatedkeywordinsteadofatomictomakeexplicitthefactthattheconstructsupportsweakisolationratherthanstrongatomicity.III.SCALABLEADAPTIVESCHEDULINGThissectionpresentsSLAW'sadaptiveschedulingalgo-rithmforasingleplace.TheextensiontomultipleplacesissummarizedinSectionIV.Werstinformallygiveanoverviewoftheadaptivescheduler,followedbytheschedulingmodelandtaxonomyassumedinthispaper.Wethenpresentouradaptiveschedulingalgorithm,alongwiththeoreticalworst-casespacebounds.Thissectionconcludeswithasum-maryofthecompilersupportandtheruntimeimplementationfortheadaptiveschedulingalgorithm.A.OverviewoftheAdaptiveSchedulerTherearetwomajorscalabilityconcernswhendesigningtheadaptivescheduler:(1)establishingspaceboundswhichincludestackspaceforworkerthreadsaswellasthetotalmemoryspace;and(2)selectionofhelp-rstandwork-rstpolicyindifferentscenariosforbetterscalability.AsdescribedinSectionII-A2,astack-basedimplementationofthework-rstpolicyincreasesstackpressure,whereasthehelp-rstpolicycanbeusedtoreducestackpressure(butattheexpenseofadditionalcontextswitches).LetusassumethatSisthespacelimit(orthreshold)foraworker'sstack.IftheinputprogramhasaspawntreedepthgreaterthanS,thenitisnecessaryatsomepointtousethehelp-rstpolicytoensurethataworker'sstackspacedoesnotexceedthresholdS.ThisdecisionispresentedasstackconditioninthespawnruleforAlgorithm1discussedlater.Besidesthestackbound,wealsoconsiderthetotalmemorybound.Thetotalmemoryboundisdeterminedbythememoryusageofbothstartedandfreshtasks.Startedtasksarethosethathavebeenexecutedbysomeprocessor;freshtaskshavebeenspawnedbutneverexecuted.Whenonlyspawningunderwork-rstpolicy,therewillbenofreshtasksandthetotalmemoryboundofstartedtaskshasbeenestablishedbypastresearchonwork-stealingschedulers[14],[16].However,underthehelp-rstpolicy,allchildtaskswillbecreatedasfreshtasksandsavedontheheap.Inordertoprovideatotalmemoryguaranteefortheadaptivework-stealingscheduler:theschedulermustswitchtothework-rstpolicywhenthenumberoffreshtasksexceedsathreshold;thisensuresthatthetotalmemoryusedbyfreshtasksarebounded.ThethresholdiscalledthefreshtaskthresholddenotedasF.ThisdecisionispresentedasfreshtaskconditioninthespawnruleforAlgorithm1.Thesetwoconditionsareenoughtoestablishthestackandtotalmemoryboundsfortheadaptiveschedulingalgorithm.Onethingthatisimportanttonoticeisthattheadaptiveschedulertreatsstackboundasahardboundandgivesthestackconditionhigherprioritythanthefreshtaskcondition.Whenthestackthresholdisreached,help-rstpolicywillalwaysbeusedtoavoidstackoverowregardlessofthenumberoffreshtaskscreated.SLAWemploysaruntimeheuristictoselectthepolicyifneitherofthesetwoconditionsismet.Thisheuristicisnotrequiredtoestablishtheworst-casestackandmemoryspacebound,butisdesignedtoachievebetterscalabilityandperformanceinpractice.Forthisreason,theheuristicisdescribedbelowbutisnotpresentedinthealgorithm.Beforedescribingtheheuristic,werstdiscusstwotech-niquesusedtoreducetheoverheadofadaptationandevalua-tionofthetaskspawningpolicy.First,eachworkermaintainsitsownspawningpolicyandtheheuristicusedtoevaluatethespawningpolicyconsistsonlyofthread-localoperations.WeshowintheSectionV-B1thattheoverheadislowerthan5%.Second,SLAWdoesnotre-evaluatethespawningpolicyateveryspawningpoint.Instead,itstartswiththehelp-rstpolicyatthebeginningandre-evaluatethespawningpolicyperiodicallyatanintervalforeveryINTspawnedtasks.Thereasonthatitstartswiththehelp-rstpolicyisbecausestealsareusuallyfrequentatthebeginningoftheapplication,andhelp-rstpolicycanraiseparallelismwhenstealingisfrequent.Evaluatingthepolicyperiodicallyfurtheramortizedtheoverhead.TheheuristicusedbySLAWisbasedonasimpleestimationonthelikelihoodofthenewspawnedtaskbeingstolen.Itcomputesthenumberoftasksthatwerestolenfromtheworkerduringthelastinterval.IfthenumberofstealsisgreaterthanINT,thisimpliesthestealrateishigherthanthetaskcreationrate.Theschedulerwillusethehelp-rstpolicyforthenewtaskinthenextintervaltoincreasetherateofdistributingtaskstootherworkers.Otherwise,theschedulerassumesthenewtaskwillnotbestolenandthususeswork-rstpolicyforthenextintervaltoreducetheoverheadofcontextswitches.Insummary,thethresholdsSandFforthestackconditionandthefreshtaskconditionareusedtoboundthealgorithm'sstackandtotalspacerequirementsrespectively.ThethirdparameterINTisusedtocontrolthepolicyre-evaluationintervalinSLAWtoreduceoverhead.LaterinSectionV-B,we Subroutine1IdleSubroutineforProcessorpBeforeStept 1)Ifpownsanyfreshtaskorpreemptedtask,removeone.Ifmultiplesuchtasksexist,thefollowingtiebreakerisused:a)Returnonethatisthedeepestinsynctree.b)Returnpreemptedtasksbeforefreshtaskc)ReturnonethatisthedeepestinspawntreeIfsuccess,goto4.2)Inthiscase,processorpdoesnotownanytask.Itwillgostealing.Itwillattempttoremovetask inthepoolthatmeetsoneofthefollowingstealingrestrictions.(a)if isfreshandcreatedbysomeprocessorq, istheonethatwascreatedearliestamongallfreshtaskscreatedbyq.(b)if ispreemptedandownedbysomeprocessorq, istheonethatwaspreemptedearliestamongalltaskspreemptedandownedbyq.Ifsuccess,goto4.3)Processorpremainsidle.Goto1.4)Processorpreturnsthetaskforexecutionatstept. D.SchedulingAlgorithmWeuseP-ADP(S,F)todenoteaP-processoradaptiveschedulethatcanbegeneratedbytheadaptivework-stealingalgorithmshowninAlgorithm1.Asmentionedearlier,SandFdenotethestackthresholdandfresh-taskthresholdrespectively.Toabstracttheruntimecallstack,sometasksareaggedon-stack-pwherepisaprocessorid.Theactivationframeofthosetasksaggedon-stack-pisconsideredtobeonprocessorp'sruntimecallstack.Thealgorithmalwaysagsatask ason-stack-pifprocessorpstartsexecuting .Thisagwillnotbeclearedwhen spawnsanewtaskunderthework-rstpolicy,sincework-rsttaskspawnisimplementedasasequentialcallinSLAW.However,whenaprocessorpdoesacontextswitchtostartexecutingafresh-task,orresumeasuspendedtask,allon-stack-pagsforthatparticularprocessorparecleared.Actions1-5inthealgorithmandtheidlesubroutine,guaranteethatalladaptiveschedulesexecutetasksinadepth-rstorderwhenataskissuspendedorterminates.However,whenmanytaskshavethesamedepth,thetiebreakersintheidlesubroutineareimportanttoensureprogress-nessoftheschedule,whichleadsthespaceboundofthealgorithm.Theorem3.2:Alladaptiveschedulesareprogressive.E.TheoreticalSpaceBoundGivenadagG(V;E;S1),thefollowingtheorempresentsthespaceboundforanyP-processoradaptiveschedule.Theorem3.3:IfS1=S,thememoryspaceofanyP�ADP(S;F)scheduleisboundedbyS1P+O(FP).IfS1&#x]TJ/;ø 9;&#x.962; Tf;&#x 7.7;I 0;&#x Td ;&#x[000;S,thememoryspaceofanyP�ADP(S;F)scheduleisboundedbyS1P+O(V).Theorem3.3establishesthememoryspaceboundfortheadaptiveschedule.IfS1=S,thisisthecasethatwork-rstcanrunsuccessfullywithoutexceedingthestackbound.Thework-rstwork-stealingscheduler'smemoryboundinthiscaseisS1P.ThememoryspaceoftheadaptiveschedulerisboundedbyS1P+O(FP).IfS1&#x]TJ/;ø 9;&#x.962; Tf;&#x 7.7;I 0;&#x Td ;&#x[000;S,thisisthecasethatwork-rstwilloverowthestack.TheadaptiveschedulewillneveroverowthestackandthememoryspaceisboundedbyS1P+O(V). Algorithm1AdaptiveWork-StealingAlgorithm-ADP(S,F) 1)Environment:TherearePprocessorsandasharedtaskpoolwhereeveryprocessorcanremoveandputtasks.Alloperationsareassumedtobeatomic.Thealgorithmproceedsstepbystep.Notethatboththespawntreeandthesynctreeareunfoldedonlineasthealgorithmprogresses.Whenaprocessorpisidlebeforestept,itwillcalltheidlesubroutineinSubroutine1toattempttoremoveandexecutetaskforthestept.2)Step0:Atstep0,oneprocessorwillstartexecutingtheroottask.Allotherprocessorsareidle.3)Stept+1:Forstept,thetaskwilldecidethetasktoexecuteforstept+1.Ifprocessorpexecutestask aatstept,itwillexecutethenextinstructionintask aunless aspawns,suspendedorterminates.Inthesecases,thefollowingrulesa)-c)arefollowedrespectively:a)Spawn:Let aspawns b.Processorpwillusethefollowingruletodecidethespawnpolicy:i)Ifthespaceoftheactivationframesofalltasksmarkedon-stack-pS,usehelp-rst(stackcondition);ii)OtherwiseifthenumberoffreshtaskscurrentlyownedbypbeforetisF,usework-rst(fresh-taskcondition);iii)Otherwisefreetouseanyheuristic.SeeSectionIII-AforSLAW'sheuristic.Action1:Ifthespawnisunderwork-rstpolicy,return atothepoolandexecute bforstept+1.Action2:Ifthespawnisunderhelp-rstpolicy,put btothepoolandcontinuetoexecutenextinstructionof aforstept+1.b)Suspended:Ifthetask aissuspended,processorpwillreturn atothepool,doacontextswitchandclearsallon-stack-pagsontasks.ThenAction3:processorpwillremoveanyfreshtaskitcreatedinSTsync( a).Ifnotsuccess,pbecomesidle.c)Terminates:Ifthetask aterminatesandif aistheroottask,thenthescheduleends.Otherwise,letT1bePRspawn( a)andTnbePRsync( a).Processorpwill:Action4:IfT1ispreemptedandownedbyp,removeT1andexecuteitforstept.IfAction4isnottaken,theprocessorpwilldoacontextswitchandclearsallon-stack-pagsontasks.ThenitwillAction5:CheckifTnbecomesreadybeforestept+1Ifyes,processorpwillattempttoremoveTnfromthepoolandexecuteTnforstept+1.IftheAction5isnottaken,processorpwillbecomeidle. ItisimportanttonoticethattheS1isrelatedtotheinputdatasize[17]becauseitisthespacerequirementofserialdepth-rstexecution.However,Fisapresetparameterandisnotrelatedtotheinputdatasize.TheconstantoftheO(FP)isthesizeoftheholderofthefreshtaskontheheap,whichisusuallysmall.InSLAW,thetaskholdercontainsonlythevalueofinputparameterstothetaskfunctionandafewbookkeepingelds.F.RuntimeImplementationandCompilerSupportManywork-stealingschedulersusethedequedatastructuretostoretasks.Dequeisadouble-endedqueue.Thestealoperationsareperformedatthetop-endofthedeque;onlytheownerofthedequewillpushandpoptasksatthebottom-endofthedeque.SLAW'sdequeimplementationisbasedonthedynamiccirculardequeimplementationdescribedin[18].SLAWimplementstheadaptiveschedulingalgorithmusingtwodequesperworker:oneforpreemptedtasksownedbytheworker;theotherforfreshtaskscreatedbytheworker.Whenataskispreemptedatawork-rstspawn,itispushed Benchmark Type Description Source Fib(35) MicroRecursive Two-wayrecursiveFibonacciforn=35,withnosequentialthreshold/cutoff Cilk++ FJ(1024) MicroFlat Create1024dummytasksandjointhem JGF SOR Loop 2DSuccessiveOver-Relaxationalgorithmona20002000oatarray JGF CG.A Loop ConjugateGradient,sizeA NPB3.0 MG.A Loop Multi-Grid,sizeA NPB3.0 Sort Recursive ParallelMergeSorton50331648randomintegers BOTS[19] Matmul Recursive RecursiveMatrixMultiplication.(two1500*1500doublematrix,Threshold=64) Cilk++ LU Recursive RecursiveLUDecomposition(2048*2048doublematrix,BlockSize=64) JCilk GC Recursive GraphColoringusingParallelConstraintSatisfactionSearch(CLIQUE 10,10colors) [20] PDFS IrregularRecursive Parallel-DFS(Figure1)onaTorusgraphwith4Mnodes XWS[21] TABLEILISTOFBENCHMARKSIMPLEMENTEDINHJANDTHEIRSOURCESlarge,thensomenoticeablecontextswitchoverheadwillbeobservedbeforethepolicyswitchoccurs.ThesamesituationoccursforthePDFSbenchmarkinFigure6(d).WhenweincreasetheINTvaluepast64,thethroughputoftheparalleldepthrstsearchbenchmarkdeclines.Figure6(c)showstheimpactofINTontheexecutiontimeofSORon64workers.Inthisfork-joinversionofSOR,64tasksaredistributedamong64workersineachouter(time-step)iteration,andthesetasksarejoinedwithanishconstructattheendofeachiteration.Notethatstealingisveryfrequentinthisexample,since63outof64taskswillbestolenineachiteration,therebyimplyingthatthestealingrateishighandhelp-rstpolicyisthebestchoice..Becausetheadaptiveschedulestartswithhelp-rstpolicyandre-evaluatesthepolicyaftereveryINTspawns,thisexperimentsuggeststhattheINTshouldbesettobegreaterorequalthenumberofworkers.Doingsowillensurethatatleastonetaskisspawnedforeachworkerusingthehelp-rstpolicybeforetheworkerswitchestowork-rstpolicy.Ifaworkerswitchestothework-rstpolicytooearly,itwilldelaytaskcreationandnegativelyaffecttheperformanceoftheentireapplication.Forthesamereason,thefreshdequethresholdshouldalsobetobeequalorgreaterthanthenumberofworkerstoholdatleastonetaskforeachworker.Forthereasonsdescribedabove,sincethemaximumnum-berofworkersis64(onNiagara2),thedefaultvalueofINTissetto64fortheexperimentspresentedinthispaper.2)ImpactofParameterF:Figure6(e)showshowthethroughputofthePDFSbenchmarkvariesfordifferentvaluesofthefreshtaskthreshold,F.TheothertwoparametersINTandSarexedat50and256respectively.INTisxedat50forthisexamplebecause(INT=50,S=256)wasthebestcombinationwefoundforPDFSafterenumeratingtheparameterspace.Allotherexperimentalresultsreportedinthispaperusethedefaultparametervalueunlessotherwisespecied.WedonotndanycorrelationbetweenthethroughputandF.ThisisinpartbecauseFisasoftboundthathaslowerpriorityintheSLAWalgorithmthanthestackbound.Theworkerwillalwaysusethehelp-rstpolicytocreatefreshtasksregardlessofthenumberofexistingfreshtasks,ifthestackboundpreventstheuseofthework-rstpolicy.InthepreviousdiscussionontheparameterINT,wealsomentionedthereasonwhyFshouldbeequalorgreaterthanthenumberofworkers.ThedefaultvalueofFinSLAWissetto128.3)ImpactofParameterS:Figure6(f)showsthethroughputofthePDFSbenchmarkasafunctionofthestackthresholdS.WhenSissetto1,theadaptiveschedulebecomesequivalenttothework-rstschedule.Interestingly,ifthestackthresholdissettoolarge,theperformancealsodegrades.Thisisbecauseifthestackbecomestoodeep,thenumberofmemorypagesspannedbytheruntimecallstackincreases,whichinturnleadstoanincreaseinTLBmisses.ThedefaultstackthresholdinSLAWissetto256activationframes.Amongthebenchmarksusedinthispaper,onlythePDFSbenchmarkrequiresastackthatgrowsproportionallywiththeinputproblemsize.Thestackrequirementforotherbenchmarksisboundedbyasmallnumber;consequentlytheydonothitthestackbound. (a)Fib(35),S=256,F=128,W=1 (b)Fib(35),S=256,F=128,W=32 (c)SOR,S=256,F=128,W=64 (d)PDFS,S=256,F=128,W=64 (e)PDFS,S=256,INT=50,W=64 (f)PDFS,F=128,INT=50,W=64Fig.6.AnalysisofAdaptiveScheduleParameterSensitivityontheNiagara2system.Thebenchmarkname,SLAWparametervalues(S,F,INT),andthenumberofworkers(W)arespeciedinthesub-gurecaptions.Betterperformanceisindicatedbysmallervaluesin(a),(b),(c)andlargervaluesin(d),(e),(f). Parameter INT F S DefaultValue 64 128 256activationframes TABLEIIDEFAULTADAPTIVESCHEDULEPARAMETERSVALUEC.BenchmarkResultsRecursiveparallelismandatparallelismaretwocommonpatternsfortaskparallelism.Inrecursiveparallelism,theparallelismisexpressedrecursively.Thusthetaskspawntreeisusuallydeep.Ontheotherhand,atparallelismcorrespondstothosedo-crossloopsandpointer-chasingcases,whereasynchronoustasksareiterativelyspawned.Inatparallelism,thedepthofthetaskspawntreeisusuallywideandshallow.Inwork-stealing,eachstealwillyieldthewholesub-treeforthethieftoworkon.Intuitively,foraspawntreewithlargedepth,itimpliesthateachworkerwillgetmoreworkcomparedtoashallowspawntree.Asaresult,stealingisgenerallyconsideredrareinrecursivetaskparallelismbutfrequentinatparallelism.Oneoptimizationthathasbeenstudiedinpreviousresearchistotransformsomeatloopstorecursivestyle[22],[5].Thisoptimizationrequirescompilersupport,andonlyappliestodo-allloopsonadivisibleregionanddoesnotapplytodo-crossloopsorpointer-chasingprograms.Asthemaincontributionofthispaperistoshowtherobustnessoftheruntime,wedonotapplysuchoptimizations.However,fortheFJmicrobenchmark,wedoshowtheperformanceresultsofwithandwithouttherecursiveloopoptimization.Weusetwomicrobenchmarks,FibandFJ,toshowtheextremecaseswherework-rstpolicyisbetterthanhelp-rstpolicyandthehelp-rstpolicyisbetterthanthework-rstpolicyrespectively.FibisusedastheextremecaseforrecursiveparallelismandFJwithouttherecursiveloopoptimizationisusedastheextremecaseforatparallelism.TableIIIshowstheexecutiontimeofFibonNiagara2.Figure7showsthenumberoffork-joinsperformedpersecond.ForFib,thework-rstpolicyis10.2fasterthanthehelp-rstpolicyduetofewercontextswitches.InFJ,stealsarefrequentandthehelp-rstpolicyis4.6fasterthanthework-rstpolicy.Inbothmicro-benchmarks,theperformanceoftheadaptiveschedulerisclosetothatofthebetterpolicy.Figure8showsthenumberoffork-joinsperformedpersecondinarecursive-stylefork-join(fj-rec),andcomparesthenumbertotheiterativefork-join(fj).Infj-rec,thetasksarerecursivelyspawned.Inordertospawn1024paralleltasks,thedepthofthetaskspawntreeis11.Whenthenumberofthreadsissmall(=8),thework-rstpolicyperformsthanbetterthanthehelp-rstpolicy.Thisisbecausethenumberofstealsisinfrequentcomparedtothetaskspawned.Asthenumberofthreadsincrease,thestealbecomesmorefrequent(consideringthedepthofthetaskspawnis10for1024tasks),andthehelp-rstpolicyperformsbetterthanthework-rstpolicy.Thisexampleisinterestingbecauseitshowsthatthebestchoiceofschedulingpolicyismoreadynamicchoicethanastaticchoice,althoughtheshapeofaprogramcanprobablygivesomeclue.Theexperimentalsoconrmsthatfj-recis Wrks 1 2 4 8 16 32 64 hf 334.14 173.64 79.43 39.71 21.43 11.04 8.04 wf 34.45 17.13 8.65 4.31 2.24 1.23 0.87 adp 34.25 16.99 8.54 4.36 2.25 1.25 0.90 TABLEIIIPERFORMANCERESULTSFORFIB(35)MICROBENCHMARKONNIAGARA2USING1TO64WORKERS.EXECUTIONTIME(INSECONDS)ISREPORTED.(SMALLERISBETTER.) Fig.7.PerformanceresultsforFJ(1024)microbenchmark(tasksarespawnediteratively)onNiagara2using1to64workers.Numberoffork-joinsperformedpersecondisreported.(Biggerisbetter.)morescalablethanfj,asthetaskspawnsarenowperformedinparallelaswell.However,thesequentialoverheadofthefj-recishigherthantheiterativefj,whichexplainsthelowerperformancewhenthenumberofthreadsissmall.Figure9showsthespeedupoftheSLAWscheduleronNiagara2overtheJava-serialversionwithoneexceptionforPDFS,whosespeedupisbasedon1-threadhelp-rstexecution.Boththeserialversionandthework-rstscheduleofPDFSwilloverowthestackasdescribedinSectionII-A2.Thisiswhythereisnobarforthewfinthegure.ThisexceptionalsoappliestoFigure10.Threeschedulingpoliciesarecompared:help-rstonly,work-rstonlyandtheadaptiveschedulingalgorithmde-scribedinSectionIII.AsallcoresonNiagara2sharethesameL2cache,thisexperimentusesthelocality-obliviousdeployment,whichspeciesonly1placewithall64workers.Noprocessorbindingisused.CG.A,MG.AandSORareat,loop-basedparallelbench-marks.Inthesebenchmarks,thehelp-rstpolicyperformsbetterthanthework-rstpolicy.TheresultsinFigure9showthattheadaptiveschedulingalgorithmmatchesorexceedstheperformanceofthehelp-rstpolicyforthesebenchmarks.Sort,Matmul,LUandGCaretaskrecursiveparallelbench-marks.Inthesebenchmarks,thework-rstpolicyisbetterthanoralmostthesameasthehelp-rstpolicy.Theresultshowstheadaptiveschedulingalgorithmmatchesorexceedstheperformanceofwork-rstpolicyinthosebenchmarks.PDFSisanirregulargraphcomputation.Irregulargraphcomputationsareinterestingbecausethestructureofthespawntreedependsontheorderinwhichnodesarevisited(labeled)inparallel.WeusedtheParallelDepthFirstSearchbenchmark(PDFS)studiedin[21](kernelcodeshowninFigure1),andappliedittoatwo-dimensional20002000torusgraphconsistingof4millionnodes.Ourresultsshowthatthe Fig.8.PerformanceresultsforFJ(1024)microbenchmark(inFJ-rec,tasksarespawnedrecursively.)onNiagara2using1to64workers.Numberoffork-joinsperformedpersecondisreported.(Biggerisbetter.)adaptiveapproachoutperformedthehelp-rstpolicyforthisbenchmarkbecauseofitsabilitytocombinehelp-rstandwork-rstpolicies.Atthebeginning,allworkersareidleandstealingisfrequenttherebymakinghelp-rstthemoredesirablepolicy.Aftereachworkergetssomework,theybegintraversingthegraphandstealingbecomeslessfrequent,thuscausingtheadaptiveruntimetoswitchtothework-rstpolicy.Thework-rstpolicyincursnosynchronizationoverheadandexecutesthetasksasiftheyaresequentialcalls.Finally,accordingtothestackconditionintheadaptiveschedulingalgorithm,theruntimewillswitchbacktothehelp-rstpolicywhenitbecomesnecessarytoavoidoverowingthestacksizelimit.Figure10showsthespeedupoftheSLAWscheduleronXeonSMPusingthethreeschedulingpolicies.Forreference,wereportalsoCilk++'sspeedupforthosebenchmarksforwhichCilkversionisavailable(SORT,MATMUL,LU).WealsotranslatetheJGFSORtoCilk++andusethecilk fortoparallelizetheloop.TofactoroutuniprocessorperformancedifferencesbetweenJavaandC,thespeedupforSLAWinthisgureisbasedontheJava-serialversionandCilk++'sspeedupisbasedontheC-serialversion.Thisexperimentusesthelocality-obliviousdeployment.Theexperimentalresultsofthelocality-awareschedulingarepresentedlaterintheSectionV-D.ForSort,MatmulandLU,SLAWachievesover10speeduponXeonSMP.SLAWscalesalmostlinearlyonMatmul.Cilk++alsoscalesalmostlinearlyfrom1workerto16workers,butitsspeeduplookssmallerbecauseCilk++'s1-workercaseis2.4slowerthantheC-serialversionduetosomeoptimizationsthataredisabledbytheCilk++compiler.CG,MGandSORdonotscalebeyond4hitamemory-bandwidthwall.ThescalabilityofCGandSORcanbesignicantlyimprovedbylocality-awareschedulingasshownnextinSectionV-D.WealsodiscussthescalabilityissueofGCandPDFSinSectionV-E,D.Locality-awareSchedulingResultsWeshowtheperformanceimprovementonmemory-bandwidthboundapplicationsbylocality-awarework-stealingschedulingwiththelocalityhintsprovidedbytheprogrammeronXeonSMP.With4quad-coreprocessorsinXeonSMP,thereareintotalofeight(8)L2sharedcachesonthemachinewithatotalsizeof24M.Thus,thelocality-awaredeploymentprovidedfortheSLAWschedulerhas8-places,with1or2workersperplace.TheSLAWschedulerwillbindworkerstovirtualprocessors.Figure11showstheperformanceresultsoftheSLAWlocality-awareschedulerusingtwolocality-awaredeploy-ments:onewith8-placesand1workerperplacewithatotalof8workers;theotherhas8-placesand2workersperplacewithatotalof16workers.Theschedulingpolicyusedfortaskschedulingwitheachplaceistheadaptiveschedule.ThespeedupreportedforCilk++isalsobasedontheJava-serialversioninordertocomparetheexecutiontime.ForSOR,thetotaldatasizeis20002000oatmatrix.Dividedby8places,thedataforeachplacetsintothe3ML2cache.The8-place-8-workerlocality-awareschedulingis2.1fasterthanthelocality-obliviousschedulingusingtheadaptiveschedulingandthespeedupfor8-place-16-workerlocality-awareschedulingis2.6.ForCG.A,thetotaldatasizeis2198000doublesparsematrixelements.Dividedby8places,thedatasetforeachplacealsotstheL2cache,thusenabletemporaldata-reuse.Experimentalresultsofthe8-place-8-workerlocality-awareschedulingis1.7fasterthanthelocality-obliviousscheduling.Wedonotgetimprovementfrom1workerperplaceto2workersperplace.Thisisduetothecachecontentionbetweentwoworkersunderthesameplace.MG.AcannotbenetfromtemporalreuseonXeonSMPbecausethememoryfootprintofeachplaceexceedsthecapacityoftheL2cache. (a)SOR (b)CGFig.11.ComparingLocality-awareschedulerwithlocality-oblivioussched-uleronSORandCGonIntelXeonSMP.Thelocality-awaredeploymentforadp+localityhas8placeswith1or2workersperplace.Theworkersarebindedtovirtualprocessors.E.OtherScalabilityIssuesBenchmarkslikePDFSandGCscalewellonNiagara2withonesharedL2cache,butnotonXeonSMPwithseparateL2caches.Thistrendhasalsobeenobservedbyotherresearchersandsomeexplanationsareproposed,suchastheallocationwalltheory[23].Besidesallocation-wall,wealsoobservethefalse-sharingbetweenapplicationobjectsaccessedfromdifferentworkerthreads.Forparallelgraphalgorithms,therearelotsofsmallobjects(e.g.graphnodes)accessedbymultipleworkerthreads.Withoutpaddingtheseobjects,thismaycausetwoobjectsaccessedbytwodifferentthreadslayinthesamecachelineandthuswillcausefalsesharing.ButautomaticpaddingforapplicationobjectsinJavaisnot Fig.9.PerformanceresultsonNiagara2.Deploymentislocality-oblivious(1-place,64workers)withnoprocessorbinding. Fig.10.PerformanceresultsonXeonSMP.Deploymentislocality-oblivious.(1-place,16workers)withnoprocessorbinding.straightforward.ThefullinvestigationofthescalabilityissueforthesebenchmarksinvolvestheJVMmemorymanagementsystemandisbeyondthescopeofthispaper.VI.RELATEDWORKFollowingthepopularCilk-5work-stealingruntime[4],manytaskparallelruntimesystemshavebeendevelopedinthelast10yearsincludingCilk++[5],IntelThreadingBuildingBlocks[22],Java'sForkJoinFramework[7],Microsoft'sTaskParallelLibrary[8],StackThreads/MP[9],andXWS[21].Theseruntimesareeitherexposedaslibraryroutines,orserveasimplementationsofnewlanguageconstructs,orsomecom-binationthereof.SLAWisawork-stealingruntimedesignedforataskparallellanguagecalledHJ,whichisderivedfromX10v1.5.Inourimplementation,theHJcompilergeneratescallstotheSLAWruntime.However,itispossibletousethetechniquesintroducedinthispapertosupportalibraryapproachaswell.Somework-stealingruntimeschedulersusethework-rstpolicyinspiredbyCilk[4],[5],[9]whileothersusethehelp-rstpolicy[22],[8],[7].Especially,IntelTBBusesthe“depth-rstexecutionandbreadth-rsttheft”principletoraiseparallelismaswellastomaintainlocality[6],whichissimilartothehelp-rstpolicy.XWSexposeslow-leveldequeoperations,butdoesnothavecompilersupportforeitherpolicy.SLAWincludescompilerandruntimesupportforbothwork-rstandhelp-rstpolicies.TheprogrammerhastheoptionofselectingaxedschedulingpolicyinHJorusingSLAW'sdefaultadaptiveschedulerthatselectsthepolicyonaper-taskbasisatruntime.Twoapproachesinpastworkhavebeenshowntobeprov-ablyspace-efcient.Onecategoryconsistsofwork-stealingschedulerswiththework-rstpolicy,whichwererstproventobespace-efcientforfully-strictcomputations[14].Thesameresultwaslaterextendedforterminally-strictcomputa-tions[16]inlanguageslikeX10andHJ.Anothercategoryoftechniquesisbasedondepth-rstschedulers[17]suchasDFDeques[24],whichcanuselessmemorythanwork-stealingschedulers.Alltheseschedulingtechniquesfocusonthememoryspaceusagewithoutboundingthestackpressureofindividualprocessors,becauseallmodelsassumethatase-rialdepth-rstschedulecanrunsuccessfully.SLAW'sadaptiveschedulingalgorithmaddressesthisproblembytrackingthestackpressureandgeneratingscheduleswithboundedstackusage,evenincaseswhenasequentialexecutioncannotrunsuccessfully.Previousresearchhascomparedtheprosandconsofdifferentschedulingpolicies[10].Work-rstworksbetterforsituationswhenstealsarerare,whilehelp-rstworksbetterwhenstealsarefrequent.[25]compareddepth-rstandbreadth-rsttaskspawningpolicyinOpenMPtaskschedulingandfoundthatdepth-rstperformedslightlybetterthanthebreadth-rstpolicy.Theirbreadth-rstpolicyisdifferentfromthehelp-rstpolicyinwork-stealingbecauseitusesaglobaltaskpooltostorealluntiedtasks,whilework-stealinguseslocalpoolperworker.Second,thebenchmarkstheyusedtoevaluatetheperformancearemostlytaskrecursiveparallelprogramswherestealsarerare.Anotherareaofresearchaimstoreduceschedulingover-headbyincreasingtaskgranularity.Thiscanbedonebychunkingaparallellooporbyusingacut-offtechniquetorunthetasksequentiallyifthetaskistoone-grained[26],[22].ThesetechniquesareapplicabletoSLAWaswell.[27]proposedaback-trackedschedulingapproachinwhichtheprogramrunsnormallyinsequentialmodeandback-tracksuponastealrequest.Thisapproachrequirestheprogrammertoexplicitlywriteroll-backcodeatthelanguagelevel.Thelocalityissueinwork-stealinghasalsoreceivedalotofattentioninpastwork.Acaretal.[11]provideatheoreticalboundonthenumberofcachemissesforrandomizedwork-stealingandgivesalocality-guidedwork-stealingimplementa-tiononasingle-coreSMP.SLAW'slocality-awareschedulingisdesignedformulticoreSMPsandstealingoccurswithinasingleplacebutnotacrossplaces.Chenetal.[28]studiedandcomparedthecachebehaviorbetweenwork-stealingandparalleldepth-rstscheduleronsimulatorsforcoresthatsharetheL2cache.Theyproposedapproachestocontrolthetaskgranularityandpromoteconstructivecachesharing.InamulticoreSMPenvironment,theconceptofplacesinHJcanbeusedtoenableconstructivecachesharingonasinglemulticoreprocessor.IntelTBBhastheaffinity_partitionerstructuretoutilizetemporalcache-reusebybindingthesameiterationtothesameworkerthreadthatpreviouslyexecutedtheiteration.TBBallowsstealingregardlessoftheafnityandhasamechanismtoreducecounter-productivestealing.SLAWcur- rentlydisablescross-placestealinginordertoavoidcounter-productivestealing.VII.CONCLUSIONSANDFUTUREWORKInthispaper,weintroducedSLAW,ascalablelocality-awareadaptivework-stealingscheduler.SLAWusesanoveladaptiveschedulingalgorithmwithguaranteedspacebounds,andperformslocality-awareschedulinginaccordancewithlocalityhints(placeannotations)providedbytheprogrammer.Experimentalresultsshowthattheuseoftheadaptivesched-ulerdeliveredupto4.5speedupoverthework-rstpolicyoniterativeloopparallelism(FJ)and9.2speedupoverthehelp-rstpolicyonrecursivetaskparallelism(Fib).Inaddition,theadaptivealgorithmcanachievescalableperformancewithsupportforlargedatasizes(asinthecaseofPDFS)thatcannotbeachievedbytheuseofaxedpolicy.Finally,ourresultsshowthatlocality-awareschedulingcanachieveupto2.6speedupoverlocality-obliviousscheduling,forthebenchmarksandplatformsstudiedinthispaper.Forfuturework,wewillextendtheplace-localitymodelandimplementationtoallowloadbalancingacrossplaceswhileavoidingcounter-productivestealing.Anotherdirectionforfutureresearchistoextendthework-stealingruntimesystempresentedinthispapertosupportadditionallanguagefeaturessuchasfuturesandphasers.WealsoplantoinvestigateotherscalabilitybottleneckssuchastheimpactofJVMmemorymanagementsystemonthework-stealingruntime.ACKNOWLEDGMENTSThisworkwassupportedinpartbytheNationalScienceFoundationundertheHECURAprogram,awardnumberCCF-0833166.Anyopinions,ndingsandconclusionsorrecom-mendationsexpressedinthismaterialarethoseoftheauthorsanddonotnecessarilyreectthoseoftheNationalScienceFoundation.WealsogratefullyacknowledgesupportfromanIBMOpenCollaborativeFacultyAward.WewouldliketothankallHabaneroteammembersfortheircontributionstotheHJsoftwarethatservedasthefoundationalinfrastructureforthisresearch.Finally,wewouldliketothankDougLeaforaccesstotheUltraSPARCT2SMPsystemusedtoobtaintheexperimentalresultsreportedinthispaper.REFERENCES[1]B.Chamberlain,D.Callahan,andH.Zima,“Parallelprogrammabilityandthechapellanguage,”Int.J.HighPerform.Comput.Appl.,vol.21,no.3,pp.291–312,2007.[2]Http://projectfortress.sun.com/Projects/Community.[3]P.Charles,C.Grothoff,V.Saraswat,C.Donawa,A.Kielstra,K.Ebcioglu,C.vonPraun,andV.Sarkar,“X10:anobject-orientedapproachtonon-uniformclustercomputing,”inOOPSLA'05.NewYork,NY,USA:ACM,2005,pp.519–538.[4]M.Frigo,C.E.Leiserson,andK.H.Randall,“Theimplementationofthecilk-5multithreadedlanguage,”SIGPLANNot.,vol.33,no.5,pp.212–223,1998.[5]CilkArts:http://www.cilk.com.[6]Intel(R)ThreadingBuildingBlocks,IntelCorporation.[7]D.Lea,“Ajavafork/joinframework,”inJAVA'00:ProceedingsoftheACM2000conferenceonJavaGrande.NewYork,NY,USA:ACM,2000,pp.36–43.[8]D.Leijen,W.Schulte,andS.Burckhardt,“Thedesignofataskparallellibrary,”vol.44,no.10.NewYork,NY,USA:ACM,2009,pp.227–242.[9]K.Taura,K.Tabata,andA.Yonezawa,“Stackthreads/mp:integratingfuturesintocallingstandards,”SIGPLANNot.,vol.34,no.8,pp.60–71,1999.[10]Y.Guo,R.Barik,R.Raman,andV.Sarkar,“Work-rstandhelp-rstschedulingpoliciesforasync-nishtaskparallelism,”inIPDPS'09:Proceedingsofthe2009IEEEInternationalSymposiumonParal-lel&DistributedProcessing.Washington,DC,USA:IEEEComputerSociety,2009,pp.1–12.[11]U.A.Acar,G.E.Blelloch,andR.D.Blumofe,“Thedatalocalityofworkstealing,”inSPAA'00:ProceedingsofthetwelfthannualACMsymposiumonParallelalgorithmsandarchitectures.NewYork,NY,USA:ACM,2000,pp.1–12.[12]“TheHabaneroJava(HJ)ProgrammingLanguage.”[Online].Available:http://habanero.rice.edu/hj[13]D.Spoonhower,G.E.Blelloch,P.B.Gibbons,andR.Harper,“Beyondnestedparallelism:tightboundsonwork-stealingoverheadsforparallelfutures,”inSPAA'09:Proceedingsofthetwenty-rstannualsymposiumonParallelisminalgorithmsandarchitectures.NewYork,NY,USA:ACM,2009,pp.91–100.[14]R.D.BlumofeandC.E.Leiserson,“Schedulingmultithreadedcompu-tationsbyworkstealing,”J.ACM,vol.46,no.5,pp.720–748,1999.[15]J.R.LarusandR.Rajwar,TransactionalMemory.MorganandClaypool,2006.[16]S.Agarwal,R.Barik,D.Bonachea,V.Sarkar,R.K.Shyamasundar,andK.Yelick,“Deadlock-freeschedulingofx10computationswithboundedresources,”inSPAA'07:ProceedingsofthenineteenthannualACMsymposiumonParallelalgorithmsandarchitectures.NewYork,NY,USA:ACM,2007,pp.229–240.[17]G.E.Blelloch,P.B.Gibbons,andY.Matias,“Provablyefcientschedulingforlanguageswithne-grainedparallelism,”J.ACM,vol.46,no.2,pp.281–321,1999.[18]D.ChaseandY.Lev,“Dynamiccircularwork-stealingdeque,”inSPAA'05:ProceedingsoftheseventeenthannualACMsymposiumonParallelisminalgorithmsandarchitectures,2005.[19]A.Duran,X.Teruel,R.Ferrer,X.Martorell,andE.Ayguade,“Barcelonaopenmptaskssuite:Asetofbenchmarkstargetingtheexploitationoftaskparallelisminopenmp,”inICPP'09:Proceedingsofthe2009InternationalConferenceonParallelProcessing.Washington,DC,USA:IEEEComputerSociety,2009,pp.124–131.[20]R.HaralickandG.Elliott,“Improvingtreesearchefciencyforconstraint-satisfactionproblems,”ArticialIntelligence,1980.[21]G.Cong,S.Kodali,S.Krishnamoorthy,D.Lea,V.Saraswat,andT.Wen,“Solvinglarge,irregulargraphproblemsusingadaptivework-stealing,”inICPP'08:Proceedingsofthe200837thInternationalConferenceonParallelProcessing.Washington,DC,USA:IEEEComputerSociety,2008,pp.536–545.[22]A.Robison,M.Voss,andA.Kukanov,“Optimizationviareectiononworkstealingintbb,”inParallelandDistributedProcessing,2008.IPDPS2008.IEEEInternationalSymposiumon,April2008,pp.1–8.[23]Y.Zhao,J.Shi,K.Zheng,H.Wang,H.Lin,andL.Shao,“Allocationwall:alimitingfactorofjavaapplicationsonemergingmulti-coreplatforms,”inOOPSLA'09.NewYork,NY,USA:ACM,2009,pp.361–376.[24]G.J.Narlikar,“Schedulingthreadsforlowspacerequirementandgoodlocality,”inSPAA'99:ProceedingsoftheeleventhannualACMsymposiumonParallelalgorithmsandarchitectures.NewYork,NY,USA:ACM,1999,pp.83–95.[25]A.Duran,J.Corbaln,andE.Ayguad,“Evaluationofopenmptaskschedulingstrategies,”inProceedingofIWOMP2008.[26]A.Duran,J.Corbal´an,andE.Ayguad´e,“Anadaptivecut-offfortaskparallelism,”inSC'08:Proceedingsofthe2008ACM/IEEEconferenceonSupercomputing.Piscataway,NJ,USA:IEEEPress,2008,pp.1–11.[27]T.Hiraishi,M.Yasugi,S.Umatani,andT.Yuasa,“Backtracking-basedloadbalancing,”inPPoPP'09:Proceedingsofthe14thACMSIGPLANsymposiumonPrinciplesandpracticeofparallelprogramming.NewYork,NY,USA:ACM,2009,pp.55–64.[28]S.Chen,P.B.Gibbons,M.Kozuch,V.Liaskovitis,A.Ailamaki,G.E.Blelloch,B.Falsa,L.Fix,N.Hardavellas,T.C.Mowry,andC.Wilkerson,“Schedulingthreadsforconstructivecachesharingoncmps,”inSPAA'07.NewYork,NY,USA:ACM,2007,pp.105–115.