/
Delay Scheduling A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling Delay Scheduling A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

Delay Scheduling A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling - PDF document

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
588 views
Uploaded On 2014-12-14

Delay Scheduling A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling - PPT Presentation

edu Dhruba Borthakur Facebook Inc dhrubafacebookcom Joydeep Sen Sarma Facebook Inc jssarmafacebookcom Khaled Elmeleegy Yahoo Research khaledyahooinccom Scott Shenker University of California Berkeley shenkercsberkeleyedu Ion Stoica University of Cali ID: 23631

edu Dhruba Borthakur Facebook Inc

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Delay Scheduling A Simple Technique for ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

twojobsarerunning,eachshouldgethalftheresources;ifathirdjobislaunched,eachjob'sshareshouldbe33%.  Datalocality:placecomputationsneartheirinputdata,tomaximizesystemthroughput.Toachievetherstgoal(fairsharing),aschedulermustreallocateresourcesbetweenjobswhenthenumberofjobschanges.Akeydesignquestioniswhattodowithtasks(unitsofworkthatmakeupajob)fromrunningjobswhenanewjobissubmitted,inordertogiveresourcestothenewjob.Atahighlevel,twoapproachescanbetaken: 1. Killrunningtaskstomakeroomforthenewjob. 2. Waitforrunningtaskstonish.Killingreallocatesresourcesinstantlyandgivescontroloverlocalityforthenewjobs,butithastheseriousdisad-vantageofwastingtheworkofkilledtasks.Waiting,ontheotherhand,doesnothavethisproblem,butcannegativelyimpactfairness,asanewjobneedstowaitfortaskston-ishtoachieveitsshare,andlocality,asthenewjobmaynothaveanyinputdataonthenodesthatfreeup.Ourprincipalresultinthispaperisthat,counterintu-itively,analgorithmbasedonwaitingcanachievebothhighfairnessandhighdatalocality.Weshowrstthatinlargeclusters,tasksnishatsuchahighratethatresourcescanbereassignedtonewjobsonatimescalemuchsmallerthanjobdurations.However,astrictimplementationoffairsharingcompromiseslocality,becausethejobtobeschedulednextaccordingtofairnessmightnothavedataonthenodesthatarecurrentlyfree.Toresolvethisproblem,werelaxfairnessslightlythroughasimplealgorithmcalleddelayschedul-ing,inwhichajobwaitsforalimitedamountoftimeforaschedulingopportunityonanodethathasdataforit.Weshowthataverysmallamountofwaitingisenoughtobringlocalitycloseto100%.DelayschedulingperformswellintypicalHadoopworkloadsbecauseHadooptasksareshortrelativetojobs,andbecausetherearemultiplelocationswhereataskcanruntoaccesseachdatablock.Delayschedulingisapplicablebeyondfairsharing.Ingeneral,anyschedulingpolicydenesanorderinwhichjobsshouldbegivenresources.Delayschedulingonlyasksthatwesometimesgiveresourcestojobsoutofordertoim-provedatalocality.Wehavetakenadvantageofthegener-alityofdelayschedulinginHFStoimplementahierarchi-calschedulingpolicymotivatedbytheneedsofFacebook'susers:atop-levelschedulerdividesslotsbetweenusersac-cordingtoweightedfairsharing,butuserscanscheduletheirownjobsusingeitherFIFOorfairsharing.Althoughwemotivateourworkwiththedatawarehous-ingworkloadatFacebook,itisapplicableinothersettings.OurYahoo!contactsalsoreportjobqueueingdelaystobeabigfrustration.Ourworkisalsorelevanttosharedaca-demicHadoopclusters[8,10,14],andtosystemsotherthanHadoop.Finally,oneconsequenceofthesimplicityofde-layschedulingisthatitcanbeimplementedinadistributedfashion;wediscusstheimplicationsofthisinSection6. Thispaperisorganizedasfollows.Section2providesbackgroundonHadoop.Section3analyzesasimplemodeloffairsharingtoidentifywhenfairnessconictswithlocal-ity,andexplainswhydelayschedulingcanbeexpectedtoperformwell.Section4describesthedesignofHFSandourimplementationofdelayscheduling.WeevaluateHFSanddelayschedulinginSection5.Section6discusseslimita-tionsandextensionsofdelayscheduling.Section7surveysrelatedwork.WeconcludeinSection8. 2. BackgroundHadoop'simplementationofMapReduceresemblesthatofGoogle[18].HadooprunsoveradistributedlesystemcalledHDFS,whichstoresthreereplicasofeachblocklikeGFS[21].Userssubmitjobsconsistingofamapfunctionandareducefunction.Hadoopbreakseachjobintotasks.First,maptasksprocesseachinputblock(typically64MB)andproduceintermediateresults,whicharekey-valuepairs.Thereisonemaptaskperinputblock.Next,reducetaskspassthelistofintermediatevaluesforeachkeyandthroughtheuser'sreducefunction,producingthejob'snaloutput.JobschedulinginHadoopisperformedbyamaster,whichmanagesanumberofslaves.Eachslavehasaxednumberofmapslotsandreduceslotsinwhichitcanruntasks.Typically,administratorssetthenumberofslotstooneortwopercore.Themasterassignstasksinresponsetoheartbeatssentbyslaveseveryfewseconds,whichreportthenumberoffreemapandreduceslotsontheslave.Hadoop'sdefaultschedulerrunsjobsinFIFOorder,withveprioritylevels.Whentheschedulerreceivesaheartbeatindicatingthatamaporreduceslotisfree,itscansthroughjobsinorderofpriorityandsubmittimetondonewithataskoftherequiredtype.Formaps,HadoopusesalocalityoptimizationasinGoogle'sMapReduce[18]:afterselectingajob,theschedulergreedilypicksthemaptaskinthejobwithdataclosesttotheslave(onthesamenodeifpossible,otherwiseonthesamerack,ornallyonaremoterack). 3. DelaySchedulingRecallthatourgoalistostatisticallymultiplexclusterswhilehavingaminimalimpactonfairness(i.e.givingnewjobstheirfairshareofresourcesquickly)andachievinghighdatalocality.Inthissection,weanalyzeasimplefairsharingalgorithmtoanswertwoquestions: 1. Howshouldresourcesbereassignedtonewjobs? 2. Howshoulddatalocalitybeachieved?Toanswertherstquestion,weconsidertwoapproachestoreassigningresources:killingtasksfromexistingjobstomakeroomfornewjobs,andwaitingfortaskstonishtoassignslotstonewjobs.Killinghastheadvantagethatitisinstantaneous,butthedisadvantagethatworkperformedbythekilledtasksiswasted.Weshowthatwaitingimposeslittleimpactonjobresponsetimeswhenjobsarelongerthantheaveragetasklengthandwhenaclusterissharedbetween Figure2:Datalocalityvs.jobsizeinproductionatFacebook. 3.3 LocalityProblemswithNa¨veFairSharingThemainaspectofMapReducethatcomplicatesschedul-ingistheneedtoplacetasksneartheirinputdata.Localityincreasesthroughputbecausenetworkbandwidthinalargeclusterismuchlowerthanthetotalbandwidthoftheclus-ter'sdisks[18].Runningonanodethatcontainsthedata(nodelocality)ismostefcient,butwhenthisisnotpossi-ble,runningonthesamerack(racklocality)isfasterthanrunningoff-rack.Fornow,weonlyconsidernodelocality.Wedescribetwolocalityproblemsthatarisewithna¨vefairsharing:head-of-lineschedulingandstickyslots. 3.3.1 Head-of-lineSchedulingTherstlocalityproblemoccursinsmalljobs(jobsthathavesmallinputlesandhencehaveasmallnumberofdatablockstoread).TheproblemisthatwheneverajobreachestheheadofthesortedlistinAlgorithm1(i.e.hasthefewestrunningtasks),oneofitstasksislaunchedonthenextslotthatbecomesfree,nomatterwhichnodethisslotison.Ifthehead-of-linejobissmall,itisunlikelytohavedataonthenodethatisgiventoit.Forexample,ajobwithdataon10%ofnodeswillonlyachieve10%locality.Weobservedthishead-of-lineschedulingproblematFacebookinaversionofHFSwithoutdelayscheduling.Figure2showslocalityforjobsofdifferentsizes(numberofmaps)runningatFacebookinMarch2009.(Recallthatthereisonemaptaskperinputblock.)Eachpointrepresentsabinofjobsizes.Therstpointisforjobswith1to25maps,whichonlyachieve5%nodelocalityand59%racklocality.Unfortunately,thisbehaviorwasproblematicbecausemostjobsatFacebookaresmall.Infact,58%ofFacebook'sjobsfallintothisrstbin(1-25maps).Smalljobsaresocom-monbecausebothad-hocqueriesandperiodicreportingjobsworkonsmalldatasets. 3.3.2 StickySlotsAsecondlocalityproblem,stickyslots,happensevenwithlargejobsiffairsharingisused.Theproblemisthatthereisatendencyforajobtobeassignedthesameslotrepeatedly.Forexample,supposethatthereare10jobsina100-nodeclusterwithoneslotpernode,andthateachjobhas10runningtasks.Supposejobjnishesataskonnoden.Node Figure3:Expectedeffectofstickyslotsonnodelocalityundervariousvaluesoflereplicationlevel(R)andslotspernode(L). nnowrequestsanewtask.Atthispoint,jhas9runningtaskswhilealltheotherjobshave10.Therefore,Algorithm1assignstheslotonnodentojobjagain.Consequently,insteadystate,jobsneverleavetheiroriginalslots.Thisleadstopoordatalocalitybecauseinputlesarestripedacrossthecluster,soeachjobneedstorunsometasksoneachmachine.Theimpactofstickyslotsdependsonthenumberofjobs,thenumberofslotsperslave(whichweshalldenoteL),andthenumberofreplicasperblockinthelesystem(whichwedenoteR).Supposethatjobj'sfractionalshareoftheclusterisf.Thenforanygivenblockb,theprobabilitythatnoneofj'sslotsareonanodewithacopyofbis(1�f)RL:thereareRreplicasofb,eachreplicaisonanodewithLslots,andtheprobabilitythataslotdoesnotbelongtojis1�f.Therefore,jisexpectedtoachieveatmost1�(1�f)RLlocality.WeplotthisboundonlocalityfordifferentRandLanddifferentnumbersofconcurrentjobs(withequalsharesofthecluster)inFigure3.EvenwithlargeRandL,localityfallsbelow80%for15jobsandbelow50%for30jobs.Interestingly,stickyslotsdonotoccurinHadoopduetoabuginhowHadoopcountsrunningtasks.Hadooptasksentera“commitpending”stateafternishingtheirwork,wheretheyrequestpermissiontorenametheiroutputtoitsnallename.Thejobobjectinthemastercountsataskinthisstateasrunning,whereastheslaveobjectdoesn't.Thereforeanotherjobcanbegiventhetask'sslot.Whilethisisbug(breakingfairness),ithaslimitedimpactonthroughputandresponsetime.Nonetheless,weexplainstickyslotstowarnothersystemdesignersoftheproblem.Forexample,stickyslotshavebeenreportedinDryad[24].InSection5,weshowthatstickyslotslowerthroughputby2xinaversionofHadoopwithoutthisbug. 3.4 DelaySchedulingTheproblemswepresentedhappenbecausefollowingastrictqueuingorderforcesajobwithnolocaldatatobescheduled.Weaddressthemthroughasimpletechniquecalleddelayscheduling.Whenanoderequestsatask,ifthehead-of-linejobcannotlaunchalocaltask,weskipitandlookatsubsequentjobs.However,ifajobhasbeenskippedlongenough,westartallowingittolaunchnon-localtasks, toavoidstarvation.Thekeyinsightbehinddelayschedulingisthatalthoughtherstslotweconsidergivingtoajobisunlikelytohavedataforit,tasksnishsoquicklythatsomeslotwithdataforitwillfreeupinthenextfewseconds.Inthissection,weconsiderasimpleversionofdelayschedulingwhereweallowajobtobeskippeduptoDtimes.Pseudocodeforthisalgorithmisshownbelow: Algorithm2FairSharingwithSimpleDelayScheduling Initializej:skipcountto0foralljobsj. whenaheartbeatisreceivedfromnoden: ifnhasafreeslotthen sortjobsinincreasingorderofnumberofrunningtasks forjinjobsdo ifjhasunlaunchedtasktwithdataonnthen launchtonn setj:skipcount=0 elseifjhasunlaunchedtasktthen ifj:skipcountDthen launchtonn else setj:skipcount=j:skipcount+1 endif endif endfor endif NotethatonceajobhasbeenskippedDtimes,weletitlauncharbitrarilymanynon-localtaskswithoutresettingitsskipcount.However,ifitevermanagestolaunchalocaltaskagain,wesetitsskipcountbackto0.Weexplaintherationaleforthisdesigninouranalysisofdelayscheduling. 3.5 AnalysisofDelaySchedulingInthissection,weexplorehowthemaximumskipcountDinAlgorithm2affectslocalityandresponsetimes,andhowtosetDtoachieveatargetleveloflocality.Wendthat: 1. Non-localitydecreasesexponentiallywithD. 2. TheamountofwaitingrequiredtoachieveagivenleveloflocalityisafractionoftheaveragetasklengthanddecreaseslinearlywiththenumberofslotspernodeL.WeassumethatwehaveanM-nodeclusterwithLslotspernode,foratotalofS=MLslots.Also,ateachtime,letPjdenotethesetofnodesonwhichjobjhasdatalefttoprocess,whichwecall“preferred”nodesforjobj,andletpj=jPjj Mbethefractionofnodesthatjprefers.Tosimplifytheanalysis,weassumethatalltasksareofthesamelengthTandthatthesetsPjareuncorrelated(forexample,eithereveryjobhasalargeinputleandthereforehasdataoneverynode,oreveryjobhasadifferentinputle).WerstconsiderhowmuchlocalityimprovesdependingonD.Supposethatjobjisfarthestbelowitsfairshare.Thenjhasprobabilitypjofhavingdataoneachslotthatbecomesfree.IfjwaitsforuptoDslotsbeforebeingallowedtolaunchnon-localtasks,thentheprobabilitythatitdoesnotndalocaltaskis(1�pj)D.Thisprobabilitydecreases exponentiallywithD.Forexample,ajobwithdataon10%ofnodes(pj=0:1)hasa65%chanceoflaunchingalocaltaskwithD=10,anda99%chancewithD=40.Asecondquestionishowlongajobwaitsbelowitsfairsharetolaunchalocaltask.BecausethereareSslotsinthecluster,aslotbecomesfreeeveryT Ssecondsonaverage.Therefore,onceajobjreachestheheadofthequeue,itwillwaitatmostD STsecondsbeforebeingallowedtolaunchnon-localtasks,providedthatitstaysattheheadofthequeue.3ThiswaitwillbemuchlessthantheaveragetasklengthifSislarge.Inparticular,waitingforalocaltaskmaycostlesstimethanrunninganon-localtask:inourexperiments,localtasksranupto2xfasterthannon-localtasks.Notealsothatforaxednumberofnodes,thewaittimedecreaseslinearlywiththenumberofslotspernode.WeconcludewithanapproximateanalysisofhowtosetDtoachieveadesiredleveloflocality.4SupposethatwewishtoachievelocalitygreaterthanlforjobswithNtasksonaclusterwithMnodes,LslotspernodeandreplicationfactorR.WewillcomputetheexpectedlocalityforanN-taskjobjoveritslifetimebyaveraginguptheprobabilitiesthatitlaunchesalocaltaskwhenithasN;N�1;:::;1taskslefttolaunch.WhenjhasKtaskslefttolaunch,pj=1�(1�K M)R,becausetheprobabilitythatagivennodedoesnothaveareplicaofoneofj'sinputblocksis(1�K M)R.Therefore,theprobabilitythatjlaunchesalocaltaskatthispointis1�(1�pj)D=1�(1�K M)RD1�e�RDK=M.AveragingthisquantityoverK=1toN,theexpectedlocalityforjobj,givenaskipcountD,isatleast:l(D)=1 NNåK=11�e�RDK=M=1�1 NNåK=1e�RDK=M1�1 N¥åK=1e�RDK=M1�e�RD=M N(1�e�RD=M)Solvingforl(D)l,wendthatweneedtoset:D�M Rln(1�l)N 1+(1�l)N (2) Forexample,forl=0:95,N=20,andR=3,weneedD0:23M.Also,themaximumtimeajobwaitsforalocaltaskisD ST=D LMT=0:23 LT.Forexample,ifwehaveL=8slotspernode,thiswaitis2.8%oftheaveragetasklength. 3Onceajobreachestheheadofthequeue,itislikelytostaythere,becausethehead-of-queuejobistheonethathasthesmallestnumberofrunningtasks.Theslotsthatthejoblosttofallbelowitssharemusthavebeengiventootherjobs,sotheotherjobsarelikelyabovetheirfairshare.4Thisanalysisdoesnotconsiderthatajobcanlaunchnon-localtaskswithoutwaitingafteritlaunchesitsrstone.However,thisonlyhappenstowardstheendofajob,soitdoesnotmattermuchinlargejobs.Ontheipside,theinequalitiesweuseunderestimatethelocalityforagivenD. Figure4:ExampleofallocationsinHFS.Pools1and3haveminimumsharesof60and10slots.BecausePool3isnotusingitsshare,itsslotsaregiventoPool2.Eachpool'sinternalschedulingpolicy(FIFOorfairsharing)splitsupitsslotsamongitsjobs. onthenumberofslotsajobneedstomeetacertainSLO(e.g.importlogsevery15minutes,ordeletespammessageseveryhour).Ifthesumofallpools'minimumsharesexceedsthenumberofslotsinthecluster,HFSlogsawarningandscalesdowntheminimumsharesequallyuntiltheirsumislessthanthetotalnumberofslots.Finally,althoughHFSuseswaitingtoreassignresourcesmostofthetime,italsosupportstaskkilling.Weaddedthissupporttopreventabuggyjobwithlongtasks,oragreedyuser,fromholdingontoalargeshareofthecluster.HFSusestwotaskkillingtimeouts.First,eachpoolhasaminimumsharetimeout,Tmin.IfthepooldoesnotreceiveitsminimumsharewithinTminsecondsofajobbeingsubmittedtoit,wekilltaskstomeetthepool'sshare.Second,thereisaglobalfairsharetimeout,Tfair,usedtokilltasksifapoolisbeingstarvedofitsfairshare.WeexpectadministratorstosetTminforeachproductionpoolbasedonitsSLO,andtosetalargervalueforTfairbasedonthelevelofdelayuserscantolerate.Whenselectingtaskstokill,wepickthemostrecentlylaunchedtasksinpoolsthatareabovetheirfairsharetominimizewastedwork. 4.1 TaskAssignmentinHFSWheneveraslotisfree,HFSassignsatasktoitthroughatwo-stepprocess:First,wecreateasortedlistofjobsaccordingtoourhierarchicalschedulingpolicy.Second,wescandownthislisttondajobtolaunchataskfrom,applyingdelayschedulingtoskipjobsthatdonothavedataonthenodebeingassignedforalimitedtime.Thesamealgorithmisappliedindependentlyformapslotsandreduceslots,althoughwedonotusedelayschedulingforreducesbecausetheyusuallyneedtoreaddatafromallnodes.Tocreateasortedlistofjobs,weusearecursivealgo-rithm.First,wesortthepools,placingpoolsthatarebelowtheirminimumshareattheheadofthelist(breakingtiesbasedonhowfareachpoolisbelowitsminimumshare),andsortingtheotherpoolsbycurrentshare weighttoachieveweightedfairsharing.Then,withineachpool,wesortjobsbasedonthepool'sinternalpolicy(FIFOorfairsharing).OurimplementationofdelayschedulingdiffersslightlyfromthesimpliedalgorithminSection3.4totakeintoac- countsomepracticalconsiderations.First,ratherthanusingamaximumskipcountDtodeterminehowlongajobwaitsforalocaltask,wesetamaximumwaittimeinseconds.Thisallowsjobstolaunchwithinapredictabletimewhenalargenumberofslotsintheclusterarelledwithlongtasksandslotsfreeupataslowrate.Second,toachieveracklocal-itywhenajobisunabletolaunchnode-localtasks,weusetwolevelsofdelayscheduling–jobswaitW1secondstondanode-localtask,andthenW2secondstondarack-localtask.Thisalgorithmisshownbelow: Algorithm3DelaySchedulinginHFS Maintainthreevariablesforeachjobj,initializedas j:level=0,j:wait=0,andj:skipped=false. whenaheartbeatisreceivedfromnoden: foreachjobjwithj:skipped=true,increasej:waitbythetimesincethelastheartbeatandsetj:skipped=false ifnhasafreeslotthen sortjobsusinghierarchicalschedulingpolicy forjinjobsdo ifjhasanode-localtasktonnthen setj:wait=0andj:level=0 returntton elseifjhasarack-localtasktonnand(j:level1orj:waitW1)then setj:wait=0andj:level=1 returntton elseifj:level=2or(j:level=1andj:waitW2)or(j:level=0andj:waitW1+W2)then setj:wait=0andj:level=2 returnanyunlaunchedtasktinjton else setj:skipped=true endif endfor endif Eachjobbeginsata“localitylevel”of0,whereitcanonlylaunchnode-localtasks.IfitwaitsatleastW1seconds,itgoestolocalitylevel1andmaylaunchrack-localtasks.IfitwaitsafurtherW2seconds,itgoestolevel2andmaylaunchoff-racktasks.Finally,ifajobeverlaunchesa“morelocal”taskthanthelevelitison,itgoesbackdowntoapreviouslevel,asmotivatedinSection3.5.1.Thealgorithmisstraightforwardtogeneralizetomorelocalitylevelsforclusterswithmorethanatwo-levelnetworkhierarchy.WeexpectadministratorstosetthewaittimesW1andW2basedontherateatwhichslotsfreeupintheirclusterandthedesiredleveloflocality,usingtheanalysisinSection3.5.Forexample,atFacebook,wesee27mapslotsfreeingpersecondwhentheclusterisunderload,lesarereplicatedR=3ways,andthereareM=620machines.Therefore,settingW1=10swouldgiveeachjobroughlyD=270schedulingopportunitiesbeforeitisallowedtolaunchnon-localtasks.ThisisenoughtoletjobswithK=1taskachieveatleast1�e�RDK=M=1�e�32701=620=73%locality,andtoletjobswith10tasksachieve90%locality. Environment Nodes HardwareandConguration AmazonEC2 100 42GHzcores,4disksand15GBRAMpernode.Appearstohave1Gbpslinks.4mapand2reduceslotspernode. PrivateCluster 100 8coresand4diskspernode.1GbpsEthernet.4racks.6mapand4reduceslotspernode. Table1:Experimentalenvironmentsusedinevaluation. 5. EvaluationWehaveevaluateddelayschedulingandHFSthroughasetofmacrobenchmarksbasedontheFacebookworkload,microbenchmarksdesignedtotesthierarchicalschedulingandstressdelayscheduling,asensitivityanalysis,andanexperimentmeasuringscheduleroverhead.Weranexperimentsintwoenvironments:Amazon'sElasticComputeCloud(EC2)[1]anda100-nodeprivatecluster.OnEC2,weused“extra-large”VMs,whichappeartooccupyawholephysicalnodes.BothenvironmentsareatypicaloflargeMapReduceinstallationsbecausetheyhavefairlyhighbisectionbandwidth;theprivateclusterspannedonly4racks,andwhiletopologyinformationisnotprovidedbyEC2,testsrevealedthatnodeswereabletosend1Gbpstoeachother.Therefore,ourexperimentsunderstatepotentialperformancegainsfromlocality.WeranamodiedversionofHadoop0.20,conguredwithablocksizeof128MBbecausethisimprovedperformance(Facebookusesthisset-tinginproduction).Wesetthenumberoftaskslotspernodeineachclusterbasedonhardwarecapabilities.Table1liststhehardwareandslotcountsineachenvironment. 5.1 MacrobenchmarksToevaluatedelayschedulingandHFSonamulti-userwork-load,weranasetmacrobenchmarksbasedontheworkloadatFacebookonEC2.Wegeneratedasubmissionsched-ulefor100jobsbysamplingjobinter-arrivaltimesandinputsizesfromthedistributionseenatFacebookoveraweekinOctober2009.Weranthisjobsubmissionsched-ulewiththreeworkloadsbasedontheHivebenchmark[5](whichisitselfbasedonPavloetal'sbenchmarkcomparingMapReducetoparalleldatabases[26]):anIO-heavywork-load,inwhichalljobswereIO-bound;aCPU-heavywork-load,inwhichalljobswereCPU-bound;andamixedwork-load,whichincludedallthejobsinthebenchmark.Foreachworkload,wecomparedresponsetimesanddatalocalityun-derFIFOscheduling,na¨vefairsharing,andfairsharingwithdelayscheduling.Wenowdescribeourexperimentalmethodologyindetail,beforepresentingourresults.Webeganbygeneratingacommonjobsubmissionsched-ulethatwassharedbyalltheexperiments.Wechosetousethesamescheduleacrossexperimentssothatelementsof“luck,”suchasasmalljobsubmittedafteralargeone,happenedthesamenumberoftimesinalltheexperiments.However,theschedulewaslongenough(100jobs)tocon-tainavarietyofbehaviors.Togeneratetheschedule,werst Bin #Maps %JobsatFacebook #MapsinBenchmark #JobsinBenchmark 1 1 39% 1 38 2 2 16% 2 16 3 3–20 14% 10 14 4 21–60 9% 50 8 5 61–150 6% 100 6 6 151–300 6% 200 6 7 301–500 4% 400 4 8 501–1500 4% 800 4 9 �1501 3% 4800 4 Table2:Distributionofjobsizes(intermsofnumberofmaptasks)atFacebookandinourmacrobenchmarks. sampledjobinter-arrivaltimesatrandomfromtheFacebooktrace.Thisdistributionofinter-arrivaltimeswasroughlyex-ponentialwithameanof14seconds,makingthetotalsub-missionschedule24minuteslong.WealsogeneratedjobinputsizesbasedontheFacebookworkload,bylookingatthedistributionofnumberofmaptasksperjobatFacebookandcreatingdatasetswiththecorrectsizes(becausethereisonemaptaskper128MBinputblock).Wequantizedthejobsizesintoninebins,listedinTable2,tomakeitpossibletocomparejobsinthesamebinwithinandacrossexperiments.WenotethatmostjobsatFacebookaresmall,butthelastbin,forjobswithmorethan1501maps,containssomeverylargejobs:the98thpercentilejobsizeis3065maptasks,the99thpercentileis3846maps,the99:5thpercentileis6232maps,andthelargestjobintheweekwelookedathadover25,000maps.6Wechose4800mapsasourrepresentativeforthisbintoposeareasonableloadwhileremainingmanageableforourEC2cluster.Weusedoursubmissionscheduleforthreeworkloads(IO-heavy,CPU-heavy,andmixed),toevaluatetheimpactofouralgorithmsfororganizationswithvaryingjobcharac-teristics.WechosetheactualjobstorunineachcasefromtheHivebenchmark[5],whichcontainsHiveversionsoffourqueriesfromPavloetal'sMapReducebenchmark[26]:textsearch,asimplelteringselection,anaggregation,andajointhatgetstranslatedintomultipleMapReducesteps.Finally,weraneachworkloadunderthreeschedulers:FIFO(Hadoop'sdefaultscheduler),na¨vefairsharing(i.e.Algorithm1),andfairsharingwith5-seconddelayschedul-ing.Forsimplicity,wesubmittedeachjobasaseparateuser,sothatjobswereentitledtoequalsharesofthecluster. 5.1.1 ResultsforIO-HeavyWorkloadToevaluateouralgorithmsonanIO-heavyworkload,wepickedthetextsearchjoboutoftheHivebenchmark,whichscansthroughadatasetandprintsoutonlytherecordsthatcontainacertainpattern.Only0.01%oftherecordscontainthepattern,sothejobisalmostentirelyboundbydiskIO.OurresultsareshowninFigures5,6,and7.First,Figure5showsaCDFofjobrunningtimesforvariousrangesof 6Manyofthesmallestjobsareactuallyperiodicjobsthatrunseveraltimesperhourtoimportexternaldataintotheclusterandgeneratereports. Bin JobType MapTasks ReduceTasks #JobsRun 1 select 1 NA 38 2 textsearch 2 NA 16 3 aggregation 10 3 14 4 select 50 NA 8 5 textsearch 100 NA 6 6 aggregation 200 50 6 7 select 400 NA 4 8 aggregation 800 180 4 9 join 2400 360 2 10 textsearch 4800 NA 2 Table3:Jobtypesandsizesforeachbininourmixedworkload. (a)Bins1-3 (b)Bins4-8 (c)Bins9-10Figure9:CDFsofrunningtimesofjobsinvariousbinrangesinthemixedworkload. Forthisexperiment,wesplitbin9intotwosmallerbins,oneofwhichcontained2joinjobs(whichtranslateintoalarge2400-mapjobfollowedbythreesmallerjobseach)andanotherofwhichcontainedtwo4800-mapjobsasbefore.WelistthejobweusedasarepresentativeineachbininTable3.Unlikeourrsttwoworkloads,whichhadmap-onlyjobs,thisworkloadalsocontainedjobswithreducetasks,sowealsolistthenumberofreducetasksperjob.WeplotCDFsofjobresponsetimesineachbininFigure9.Asinthepreviousexperiments,fairsharingsignicantlyimprovestheresponsetimeforsmallerjobs,whileslightlyslowinglargerjobs.Becausetheaggregationjobstakelongerthanmap-onlyjobs(duetohavingacommunication-heavyreducephase),wehavealsoplottedthespeedupsachievedbyeachbinseparatelyinFigure10.Thedarkbarsshowspeedupsforna¨vefairsharingoverFIFO,whilethelightbarsshowspeedupsforfairsharingwithdelayschedul-ingoverFIFO.Thesmallermap-onlyjobs(bins1and2)achievesignicantspeedupsfromfairsharing.Bin3doesnotachieveaspeedupashighasinotherexperimentsbe-causethejobsarelonger(themedianoneisabout100sec-onds,whilethemedianinbins1and2is32secondswithdelayscheduling).However,inallbutthelargestbins,jobsbenetfrombothfairsharinganddelayscheduling.WealsoseethatthebenetsfromdelayschedulingarelargerforthebinswithIO-intensivejobs(1,2,4,5,7and10)thanforbinswheretherearealsoreducetasks(andhenceasmallerfractionofthejobrunningtimeisspentreadinginput). 5.2 MicrobenchmarksWeranseveralmicrobenchmarkstotestHFSinamorecon-trolledmanner,andtostress-testdelayschedulinginsitu- Figure10:Averagespeedupofna¨vefairsharingandfairsharingwithdelayschedulingoverFIFOforjobsineachbininthemixedworkload.Theblacklinesshowstandarddeviations. ationswherelocalityisdifculttoachieve.Fortheseex-periments,weuseda“scan”includedinHadoop'sGridMixbenchmark.Thisisasyntheticjobinwhicheachmapout-puts0.5%ofitsinputrecords,similartothetextsearchjobinthemacrobenchmarks.Assuch,itisasuitableworkloadforstress-testingdatalocalitybecauseitisIO-bound.Thescanjobnormallyhasonereducetaskthatcountstheresults,butwealsoransomeexperimentswithnoreduces(savingmapoutputsasthejob'soutput)toemulatepurelteringjobs. 5.2.1 HierarchicalSchedulingToevaluatethehierarchicalschedulingpolicyinHFSandmeasurehowquicklyresourcesaregiventonewjobs,wesetupthreepoolsontheEC2cluster.Pools1and2usedfairsharingastheirinternalpolicy,whilepool3usedFIFO.Wethensubmittedasequenceofjobstotestbothsharingbe-tweenpoolsandschedulingwithinapool.Figure11showsatimelineoftheexperiment.Delayscheduling(withW1=5s)wasalsoenabled,andalljobsachieved99-100%locality.Weusedtwotypesoflterjobs:twolongjobswithlongtasks(12000maptasksthateachtook25sonaverage)andfourjobswithshorttasks(800maptasksthateachtook12sonaverage).Tomakethersttypeofjobshavelongertasks,wesettheirlteringrateto50%insteadof0.5%.Webeganbysubmittingalong-taskjobtopool1attime0.Thisjobwasgiventasksonallthenodesinthecluster.Then,attime57s,wesubmittedasecondlong-taskjobinpool2.Thisjobreacheditsfairshare(halfthecluster)in17seconds.Then,attime118s,wesubmittedthreeshort-taskjobstopool3.Thepoolranacquired33%oftheslotsintheclusterin12secondsandscheduleditsjobsinFIFOorder,sothatassoonastherstjobnishedtasks,slotsweregiventothesecondjob.Oncepool3'sjobsnished,theclusterreturnedtobeingsharedequallybetweenpools1and2.Finally,attime494s,wesubmittedasecondjobinpool1.Becausepool1wasconguredtoperformfairsharing,itsplitupitsslotsbetweenitsjobs,givingthem25%oftheslotseach,whilepool2'sshareremained50%.NotethatthegraphinFigure11showsa“bump”intheshareofpool2twentysecondsafteritstartsrunningjobs,andasmallerbumpwhenpool3starts.Thesebumpsoccur Figure11:Stackedchartshowingthepercentofmapslotsintheclustergiventoeachjobasafunctionoftimeinourhierarchicalschedulingexperiment.Pools1and2usefairsharinginternally,whilepool3usesFIFO.Thejobsubmissionscheduledisexplainedinthetext. JobSize Node/RackLocalityWithoutDelaySched. Node/RackLocalityWithDelaySched. 3maps 2%/50% 75%/96% 10maps 37%/98% 99%/100% 100maps 84%/99% 94%/99% Table4:Nodeandracklocalityinsmall-jobsstresstestworkload.ResultsweresimilarforFIFOandfairsharing. becauseofthe“commitpending”buginHadoopdiscussedinSection3.3.2.Hadooptasksentera“commitpending”phaseaftertheynishrunningtheuser'smapfunctionwhentheyarestillreportedasrunningbutasecondtaskcanbelaunchedintheirslot.However,duringthistime,thejobobjectintheHadoopmastercountsthetaskasrunning,whiletheslaveobjectdoesn't.Normally,asmallpercentoftasksfromeachjobsareinthe“commitpending”state,sothebugdoesn'taffectfairness.However,whenpool2'srstjobissubmitted,noneofitstasksnishuntilabout20secondspass,soitholdsontoagreatershareoftheclusterthan50%.(Wecalculatedeachjob'sshareasthepercentofrunningtasksthatbelongtoitwhenweplottedFigure11.) 5.2.2 DelaySchedulingwithSmallJobsTotesttheeffectofdelayschedulingonlocalityandthrough-putinasmalljobworkloadwherehead-of-lineschedulingposesaproblem,weranworkloadsconsistingoflterjobswith3,10or100maptasksontheprivatecluster.Foreachworkload,wepickedthenumberofjobsbasedonthejobsizesoastohavetheexperimenttake10-20minutes.WecomparedfairsharingandFIFOwithandwithoutdelayscheduling(W1=W2=15s).FIFOperformedthesameasfairsharing,soweonlyshowonesetofnumbersforboth.Figure12showsnormalizedrunningtimesofthework-load,whileTable4showslocalityachievedbyeachsched-uler.Delayschedulingincreasedthroughputby1.2xfor3-mapjobs,1.7xfor10-mapjobs,and1.3xfor100-mapjobs,andraiseddatalocalitytoatleast75%andracklocalitytoatleast94%.Thethroughputgainishigherfor10-mapjobsthanfor100-mapjobsbecauselocalitywith100-mapjobsisfairlygoodevenwithoutdelayscheduling.Thegainforthe Figure12:Performanceofsmall-jobsstresstestwithandwithoutdelayscheduling.ResultsweresimilarforFIFOandfairsharing. 3-mapjobswaslowbecause,atsmalljobsizes,jobinitial-izationbecomesabottleneckinHadoop.Interestingly,thegainswith10and100mapswereduetomovingfromrack-localtonode-localtasks;racklocalitywasgoodevenwith-outdelayschedulingbecauseourclusterhadonly4racks. 5.2.3 DelaySchedulingwithStickySlotsAsexplainedinSection3.3,stickyslotsdonotnormallyoccurinHadoopduetoanaccountingbug.WetestedaversionofHadoopwiththisbugxedtoquantifytheeffectofstickyslots.WeranthistestonEC2.Wegeneratedalarge180-GBdataset(2GBpernode),submittedbetween5and50concurrentscanjobsonit,andmeasuredthetimetonishalljobsandthelocalityachieved.Figures14and13showtheresultswithandwithoutdelayscheduling(withW1=10s).Withoutdelayscheduling,localitywaslowerthemoreconcurrentjobstherewere–from92%with5jobsdownto27%for50jobs.Delayschedulingraisedlocalityto99-100%inallcases.Thisledtoanincreaseinthroughputof1.1xfor10jobs,1.6xfor20jobs,and2xfor50jobs. 5.3 SensitivityAnalysisWemeasuredtheeffectofthewaittimeindelayschedulingondatalocalitythroughaseriesofexperimentsintheEC2environment.Weranexperimentswithtwosmalljobsizes:4mapsand12maps,tomeasurehowwelldelayschedul-ingmitigateshead-of-linescheduling.Weran200jobsineachexperiment,with50jobsactiveatanytime.Wevar-iedthenodelocalitywaittime,W1,from0secondsto10 Figure13:Nodelocalityinstickyslotsstresstest.Asthenumberofconcurrentjobsgrows,localityfallsbecauseofstickyslots. Figure14:Finishtimesinstickyslotsstresstest.Whendelayschedulingisnotused,performancedecreasesasthenumberofjobsincreasesbecausedatalocalitydecreases.Incontrast,nishtimeswithdelayschedulinggrowlinearlywiththenumberofjobs. seconds.TherewasnoracklocalitybecausewedonothaveinformationaboutracksonEC2;however,racklocalitywillgenerallybemuchhigherthannodelocalitybecausetherearemoreslotsperrack.Figure15showstheresults.Weseethatwithoutdelayscheduling,both4-mapjobsand12-mapjobshavepoorlocality(5%and11%).SettingW1aslowas1secondimproveslocalityto68%and80%respectively.In-creasingthedelayto5sachievesnearlyperfectlocality.Fi-nally,witha10sdelay,wegot100%localityforthe4-mapjobsand99.8%localityforthe12-mapjobs. 5.4 SchedulerOverheadInour100-nodeexperiments,HFSdidnotaddanynotice-ableschedulingoverhead.TomeasuretheperformanceofHFSunderamuchheavierload,weusedmockobjectstosimulateaclusterwith2500nodesand4slotspernode(2mapand2reduce),runninga100jobswith1000mapand1000reducetaskseachthatwereplacedinto20pools.Un-derthisworkload,HFSwasabletoschedule3200taskspersecondona2.66GHzIntelCore2Duo.Thisisseveraltimesmorethanisneededtomanageclusterofthissizerunningreasonably-sizedtasks(e.g.,iftheaveragetasklengthis10s,therewillonlybe1000tasksnishingpersecond). 6. DiscussionUnderlyingourworkisaclassictradeoffbetweenutilizationandfairness.Inprovisioningaclustercomputinginfrastruc-ture,thereisaspectrumbetweenhavingaseparateclusterperuser,whichprovidesgreatfairnessbutpoorutilization,andhavingasingleFIFOcluster,whichprovidesgreatuti- Figure15:Effectofdelayscheduling'swaittimeW1onnodelocalityforsmalljobswith4and12maptasks.Evendelaysaslowas1secondraiselocalityfrom5%to68%for4-mapjobs. lizationbutnofairness.Ourworkenablesasweetspotonthisspectrum–multiplexingaclusterefcientlywhilegiv-ingeachuserresponsetimescomparabletoaprivateclusterthroughfairsharing.Toimplementfairsharing,wehadtoconsidertwoothertradeoffsbetweenutilizationandfairness:rst,whethertokilltasksorwaitforthemtonishwhennewjobsaresubmitted,andsecond,howtoachievedatalocality.Wehaveproposedasimplestrategycalleddelayschedulingthatachievesbothfairnessandlocalitybywaitingfortaskstonish.Twokeyaspectsoftheclusterenvironmentenabledelayschedulingtoperformwell:rst,mosttasksareshortcomparedtojobs,andsecond,therearemultiplelocationsinwhichataskcanruntoreadagivendatablock,becausesystemslikeHadoopsupportmultipletaskslotspernode.Delayschedulingperformswellinenvironmentswherethesetwoconditionshold,whichincludetheHadoopenvi-ronmentsatYahoo!andFacebook.Delayschedulingwillnotbeeffectiveifalargefractionoftasksismuchlongerthantheaveragejob,oriftherearefewslotspernode.However,asclustertechnologyevolves,webelievethatbothofthesefactorswillimprove.First,makingtasksshortimprovesfaulttolerance[18],soasclustersgrow,weexpectmoredevelop-erstosplittheirworkintoshorttasks.Second,duetomulti-core,clusternodesarebecoming“bigger”andcanthussup-portmoretasksatonce.Inthesamespirit,organizationsareputtingmorediskspernode–forexample,Googleused12diskspernodeinitspetabytesortbenchmark[9].Lastly,10GbpsEthernetwillgreatlyincreasenetworkbandwidthwithinarack,andmayallowrack-localtaskstoperformaswellasnode-localtasks.Thiswouldincreasethenumberoflocationsfromwhichataskcanefcientlyaccessitsinputblockbyanorderofmagnitude.Becausedelayschedulingonlyinvolvesbeingabletoskipjobsinasortedorderthatcaptures“whoshouldbeschedulednext,”webelievethatitcanbeusedinavarietyofenvironmentsbeyondHadoopandHFS.Wenowdiscussseveralwaysinwhichdelayschedulingcanbegeneralized. SchedulingPoliciesotherthanFairSharing: Delayschedul-ingcanbeappliedtoanyqueuingpolicythatproducesasortedlistofjobs.Forexample,inSection5.2.2,weshowedthatitcanalsodoublethroughputunderFIFO. SchedulingPreferencesotherthanDataLocality: Somejobsmayprefertorunmultipletasksinthesamelocationratherthanrunningeachtasknearitsinputblock.Forexam-ple,someHadoopjobshavealargedatalethatissharedbyalltasksandisdynamicallycopiedontonodesthatrunthejob'stasksusingafeaturecalledthedistributedcache[4,12].Inthissituation,thelocalitytestinouralgorithmcanbechangedtopreferrunningtasksonnodesthathavethecachedle.Toallowaclustertobesharedbetweenjobsthatwanttoreusetheirslotsandjobsthatwanttoreaddataspreadthroughoutthecluster,wecandistributetasksfromtheformerthroughouttheclusterusingtheloadbalancingmechanismproposedforlongtasksinSection3.5.1. LoadManagementMechanismsotherthanSlots: Tasksinaclustermayhaveheterogeneousresourcerequirements.Toimproveutilization,thenumberoftaskssupportedonanodecouldbevarieddynamicallybasedonitsloadratherthanbeingxedasinHadoop.Aslongaseachjobhasaroughlyequalchanceofbeingscheduledoneachnode,delayschedulingwillbeabletoachievedatalocality. DistributedSchedulingDecisions: Wehavealsoimple-menteddelayschedulinginNexus[22],atwo-levelclusterschedulerthatallowsmultipleinstancesofHadoop,orofotherclustercomputingframeworks,tocoexistonasharedcluster.InNexus,amasterprocessfromeachframeworkregisterswiththeNexusmastertoreceiveslotsontheclus-ter.TheNexusmasterschedulesslotsbymaking“slotof-fers”totheappropriateframework(usingfairsharing),butframeworksareallowedtorejectanoffertowaitforaslotwithbetterdatalocality.Wehaveseenlocalityimprove-mentssimilartothoseinSection5whenrunningmultipleinstancesofHadooponNexuswithdelayscheduling.Thefactthathighdatalocalitycanbeachievedinadistributedfashionprovidessignicantpracticalbenets:rst,multipleisolatedinstancesofHadoopcanberuntoensurethatex-perimentaljobsdonotcrashtheinstancethatrunsproduc-tionjobs;second,multipleversionsofHadoopcancoexist;andlastly,organizationscanusemultipleclustercomputingframeworksandpickthebestoneforeachapplication. 7. RelatedWork SchedulingforData-IntensiveClusterApplications: TheclosestworkweknowoftoourownisQuincy[24],afairschedulerforDryad.Quincyalsotacklestheconictbe-tweenlocalityandfairnessinscheduling,butusesaverydifferentmechanismfromHFS.Eachtimeaschedulingde-cisionneedstobemade,Quincyrepresentstheschedulingproblemasanoptimizationproblem,inwhichtasksmustbematchedtonodesanddifferentassignmentshavedifferentcostsbasedonlocalityandfairness.Min-costowisusedtosolvethisproblem.Quincythenkillssomeoftherunningtasksandlaunchesnewtaskstoplacetheclusterinthecon-gurationreturnedbytheowsolver. Whilekillingtasksmaybethemosteffectivewaytoreassignresourcesinsomesituations,itwastescomputation.Ourworkshowsthatwaitingforsuitableslotstofreeupcanalsobeeffectiveinadiversereal-worldworkload.OneofthemaindifferencesbetweenourenvironmentandQuincy'sisthatHadoophasmultipletaskslotspernode,whilethesystemin[24]onlyranonetaskpernode.Theprobabilitythatallslotswithlocalcopiesofadatablockarelledbylongtasks(necessitatingkilling)decreasesexponentiallywiththenumberofslotspernode,asshowninSection3.5.1.AnotherimportantdifferenceisthattasklengthsaremuchshorterthanjoblengthsintypicalHadoopworkloads.Atrstsight,itmayappearthatQuincyusesmoreinfor-mationabouttheclusterthanHFS,andhenceshouldmakebetterschedulingdecisions.However,HFSalsousesinfor-mationthatisnotusedbyQuincy:delayschedulingisbasedonknowledgeabouttherateatwhichslotsfreeup.Insteadofmakingschedulingdecisionsbasedonpointsnapshotsofthestateofthecluster,wetakeintoaccountthefactthatmanytaskswillnishinthenearfuture.Finally,delayschedulingissimplerthantheoptimiza-tionapproachinQuincy,whichmakesiteasytousewithschedulingpoliciesotherthanfairsharing,aswedoinHFS. HighPerformanceComputing(HPC): BatchschedulersforHPCclusters,likeTorque[13],supportjobpriorityandresource-consumption-awarescheduling.However,HPCjobsrunonaxednumberofmachines,soitisnotpossibletochangejobs'allocationsovertimeaswedoinHadooptoachievefairsharing.HPCjobsarealsousuallyCPUorcommunicationbound,sothereislessneedfordatalocality. Grids: GridschedulerslikeCondor[28]supportlocalityconstraints,butusuallyatthelevelofgeographicsites,be-causethejobsaremorecompute-intensivethanMapReduce.Recentworkalsoproposesreplicatingdataacrosssitesondemand[17].Similarly,inBAD-FS[16],aworkloadsched-ulermanagesdistributionofdataacrossawide-areanetworktodedicatedstorageserversineachcluster.Ourworkinsteadfocusesontaskplacementinalocal-areaclusterwheredataisstoredonthesamenodesthatrunjobs. ParallelDatabases: LikeMapReduce,paralleldatabasesrundata-intensiveworkloadsonacluster.However,databasequeriesareusuallyexecutedaslong-runningprocessesratherthanshorttaskslikeHadoop's,reducingtheopportu-nityforne-grainedsharing.MuchlikeinHPCschedulers,queriesmustwaitinaqueuetorun[6],andasingle“mon-sterquery”cantakeuptheentiresystem[11].Reservationscanbeusedusedtoavoidstarvinginteractivequerieswhenabatchqueryisrunning[6],butthisleadstounderutiliza-tionwhentherearenointeractivequeries.Incontrast,ourHadoopschedulercanassignallresourcestoabatchjobandreassignslotsrapidlywheninteractivejobsarelaunched. FairSharing: AplethoraoffairsharingalgorithmshavebeendevelopedinthenetworkingandOSdomains[7,19,