/
MapReduce Simplied Data Pr ocessing on Lar ge Clusters MapReduce Simplied Data Pr ocessing on Lar ge Clusters

MapReduce Simplied Data Pr ocessing on Lar ge Clusters - PDF document

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
451 views
Uploaded On 2014-11-24

MapReduce Simplied Data Pr ocessing on Lar ge Clusters - PPT Presentation

com sanjaygooglecom Goo gle Inc Abstract MapReduce is programming model and an associ ated implementation for processing and generating lar ge data sets Users specify map function that processes yv alue pair to generate set of intermediate yv alue pa ID: 16294

com sanjaygooglecom Goo gle Inc

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "MapReduce Simplied Data Pr ocessing on L..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

MapReduce:SimpliedDataProcessingonLargeClustersJeffreyDeanandSanjayGhemawatjeff@google.com,sanjay@google.comGoogle,Inc.AbstractMapReduceisaprogrammingmodelandanassoci-atedimplementationforprocessingandgeneratinglargedatasets.Usersspecifyamapfunctionthatprocessesakey/valuepairtogenerateasetofintermediatekey/valuepairs,andareducefunctionthatmergesallintermediatevaluesassociatedwiththesameintermediatekey.Manyrealworldtasksareexpressibleinthismodel,asshowninthepaper.Programswritteninthisfunctionalstyleareautomati-callyparallelizedandexecutedonalargeclusterofcom-moditymachines.Therun-timesystemtakescareofthedetailsofpartitioningtheinputdata,schedulingthepro-gram'sexecutionacrossasetofmachines,handlingma-chinefailures,andmanagingtherequiredinter-machinecommunication.Thisallowsprogrammerswithoutanyexperiencewithparallelanddistributedsystemstoeas-ilyutilizetheresourcesofalargedistributedsystem.OurimplementationofMapReducerunsonalargeclusterofcommoditymachinesandishighlyscalable:atypicalMapReducecomputationprocessesmanyter-abytesofdataonthousandsofmachines.Programmersndthesystemeasytouse:hundredsofMapReducepro-gramshavebeenimplementedandupwardsofonethou-sandMapReducejobsareexecutedonGoogle'sclusterseveryday.1IntroductionOverthepastveyears,theauthorsandmanyothersatGooglehaveimplementedhundredsofspecial-purposecomputationsthatprocesslargeamountsofrawdata,suchascrawleddocuments,webrequestlogs,etc.,tocomputevariouskindsofderiveddata,suchasinvertedindices,variousrepresentationsofthegraphstructureofwebdocuments,summariesofthenumberofpagescrawledperhost,thesetofmostfrequentqueriesinagivenday,etc.Mostsuchcomputationsareconceptu-allystraightforward.However,theinputdataisusuallylargeandthecomputationshavetobedistributedacrosshundredsorthousandsofmachinesinordertonishinareasonableamountoftime.Theissuesofhowtopar-allelizethecomputation,distributethedata,andhandlefailuresconspiretoobscuretheoriginalsimplecompu-tationwithlargeamountsofcomplexcodetodealwiththeseissues.Asareactiontothiscomplexity,wedesignedanewabstractionthatallowsustoexpressthesimplecomputa-tionsweweretryingtoperformbuthidesthemessyde-tailsofparallelization,fault-tolerance,datadistributionandloadbalancinginalibrary.Ourabstractionisin-spiredbythemapandreduceprimitivespresentinLispandmanyotherfunctionallanguages.Werealizedthatmostofourcomputationsinvolvedapplyingamapop-erationtoeachlogical“record”inourinputinordertocomputeasetofintermediatekey/valuepairs,andthenapplyingareduceoperationtoallthevaluesthatsharedthesamekey,inordertocombinethederiveddataap-propriately.Ouruseofafunctionalmodelwithuser-speciedmapandreduceoperationsallowsustoparal-lelizelargecomputationseasilyandtousere-executionastheprimarymechanismforfaulttolerance.Themajorcontributionsofthisworkareasimpleandpowerfulinterfacethatenablesautomaticparallelizationanddistributionoflarge-scalecomputations,combinedwithanimplementationofthisinterfacethatachieveshighperformanceonlargeclustersofcommodityPCs.Section2describesthebasicprogrammingmodelandgivesseveralexamples.Section3describesanimple-mentationoftheMapReduceinterfacetailoredtowardsourcluster-basedcomputingenvironment.Section4de-scribesseveralrenementsoftheprogrammingmodelthatwehavefounduseful.Section5hasperformancemeasurementsofourimplementationforavarietyoftasks.Section6explorestheuseofMapReducewithinGoogleincludingourexperiencesinusingitasthebasisToappearinOSDI20041 forarewriteofourproductionindexingsystem.Sec-tion7discussesrelatedandfuturework.2ProgrammingModelThecomputationtakesasetofinputkey/valuepairs,andproducesasetofoutputkey/valuepairs.TheuseroftheMapReducelibraryexpressesthecomputationastwofunctions:MapandReduce.Map,writtenbytheuser,takesaninputpairandpro-ducesasetofintermediatekey/valuepairs.TheMapRe-ducelibrarygroupstogetherallintermediatevaluesasso-ciatedwiththesameintermediatekeyIandpassesthemtotheReducefunction.TheReducefunction,alsowrittenbytheuser,acceptsanintermediatekeyIandasetofvaluesforthatkey.Itmergestogetherthesevaluestoformapossiblysmallersetofvalues.TypicallyjustzerooroneoutputvalueisproducedperReduceinvocation.Theintermediateval-uesaresuppliedtotheuser'sreducefunctionviaaniter-ator.Thisallowsustohandlelistsofvaluesthataretoolargetotinmemory.2.1ExampleConsidertheproblemofcountingthenumberofoc-currencesofeachwordinalargecollectionofdocu-ments.Theuserwouldwritecodesimilartothefollow-ingpseudo-code:map(Stringkey,Stringvalue)://key:documentname//value:documentcontentsforeachwordwinvalue:EmitIntermediate(w,"1");reduce(Stringkey,Iteratorvalues)://key:aword//values:alistofcountsintresult=0;foreachvinvalues:result+=ParseInt(v);Emit(AsString(result));Themapfunctionemitseachwordplusanassociatedcountofoccurrences(just`1'inthissimpleexample).Thereducefunctionsumstogetherallcountsemittedforaparticularword.Inaddition,theuserwritescodetollinamapreducespecicationobjectwiththenamesoftheinputandout-putles,andoptionaltuningparameters.TheusertheninvokestheMapReducefunction,passingitthespeci-cationobject.Theuser'scodeislinkedtogetherwiththeMapReducelibrary(implementedinC++).AppendixAcontainsthefullprogramtextforthisexample.2.2TypesEventhoughthepreviouspseudo-codeiswrittenintermsofstringinputsandoutputs,conceptuallythemapandreducefunctionssuppliedbytheuserhaveassociatedtypes:map(k1,v1)!list(k2,v2)reduce(k2,list(v2))!list(v2)I.e.,theinputkeysandvaluesaredrawnfromadifferentdomainthantheoutputkeysandvalues.Furthermore,theintermediatekeysandvaluesarefromthesamedo-mainastheoutputkeysandvalues.OurC++implementationpassesstringstoandfromtheuser-denedfunctionsandleavesittotheusercodetoconvertbetweenstringsandappropriatetypes.2.3MoreExamplesHereareafewsimpleexamplesofinterestingprogramsthatcanbeeasilyexpressedasMapReducecomputa-tions.utedGrep:Themapfunctionemitsalineifitmatchesasuppliedpattern.Thereducefunctionisanidentityfunctionthatjustcopiesthesuppliedintermedi-atedatatotheoutput.CountofURLAccessFrequency:Themapfunc-tionprocesseslogsofwebpagerequestsandoutputshURL;1i.ThereducefunctionaddstogetherallvaluesforthesameURLandemitsahURL;totalcountipair.ReverseWeb-LinkGraph:Themapfunctionoutputshtarget;sourceipairsforeachlinktoatargetURLfoundinapagenamedsource.ThereducefunctionconcatenatesthelistofallsourceURLsas-sociatedwithagiventargetURLandemitsthepair:htarget;list(source)iTerm-VectorperHost:Atermvectorsummarizesthemostimportantwordsthatoccurinadocumentorasetofdocumentsasalistofhword;frequencyipairs.Themapfunctionemitsahhostname;termvectoripairforeachinputdocument(wherethehostnameisextractedfromtheURLofthedocument).There-ducefunctionispassedallper-documenttermvectorsforagivenhost.Itaddsthesetermvectorstogether,throwingawayinfrequentterms,andthenemitsanalhhostname;termvectoripair.ToappearinOSDI20042 UserProgramMaster(1) forkworker(1) forkworker(1) fork(2)assignmap(2)assignreducesplit 0split 1split 2split 3split 4 outputfile 0 (6) writeworker(3) readworker (4) local write MapphaseIntermediate files(on local disks)workeroutputfile 1Inputfiles(5) remote readReducephaseOutputfilesFigure1:ExecutionoverviewInvertedIndex:Themapfunctionparseseachdocu-ment,andemitsasequenceofhword;documentIDipairs.Thereducefunctionacceptsallpairsforagivenword,sortsthecorrespondingdocumentIDsandemitsahword;list(documentID)ipair.Thesetofalloutputpairsformsasimpleinvertedindex.Itiseasytoaugmentthiscomputationtokeeptrackofwordpositions.DistributedSort:Themapfunctionextractsthekeyfromeachrecord,andemitsahkey;recordipair.Thereducefunctionemitsallpairsunchanged.Thiscompu-tationdependsonthepartitioningfacilitiesdescribedinSection4.1andtheorderingpropertiesdescribedinSec-tion4.2.3ImplementationManydifferentimplementationsoftheMapReducein-terfacearepossible.Therightchoicedependsontheenvironment.Forexample,oneimplementationmaybesuitableforasmallshared-memorymachine,anotherforalargeNUMAmulti-processor,andyetanotherforanevenlargercollectionofnetworkedmachines.ThissectiondescribesanimplementationtargetedtothecomputingenvironmentinwideuseatGoogle:largeclustersofcommodityPCsconnectedtogetherwithswitchedEthernet[4].Inourenvironment:(1)Machinesaretypicallydual-processorx86processorsrunningLinux,with2-4GBofmemorypermachine.(2)Commoditynetworkinghardwareisused–typicallyeither100megabits/secondor1gigabit/secondatthemachinelevel,butaveragingconsiderablylessinover-allbisectionbandwidth.(3)Aclusterconsistsofhundredsorthousandsofma-chines,andthereforemachinefailuresarecommon.(4)StorageisprovidedbyinexpensiveIDEdisksat-tacheddirectlytoindividualmachines.Adistributedlesystem[8]developedin-houseisusedtomanagethedatastoredonthesedisks.Thelesystemusesreplicationtoprovideavailabilityandreliabilityontopofunreliablehardware.(5)Userssubmitjobstoaschedulingsystem.Eachjobconsistsofasetoftasks,andismappedbytheschedulertoasetofavailablemachineswithinacluster.3.1ExecutionOverviewTheMapinvocationsaredistributedacrossmultiplemachinesbyautomaticallypartitioningtheinputdataToappearinOSDI20043 intoasetofMsplits.Theinputsplitscanbepro-cessedinparallelbydifferentmachines.Reduceinvoca-tionsaredistributedbypartitioningtheintermediatekeyspaceintoRpiecesusingapartitioningfunction(e.g.,hash(key)modR).Thenumberofpartitions(R)andthepartitioningfunctionarespeciedbytheuser.Figure1showstheoverallowofaMapReduceop-erationinourimplementation.WhentheuserprogramcallstheMapReducefunction,thefollowingsequenceofactionsoccurs(thenumberedlabelsinFigure1corre-spondtothenumbersinthelistbelow):1.TheMapReducelibraryintheuserprogramrstsplitstheinputlesintoMpiecesoftypically16megabytesto64megabytes(MB)perpiece(con-trollablebytheuserviaanoptionalparameter).Itthenstartsupmanycopiesoftheprogramonaclus-terofmachines.2.Oneofthecopiesoftheprogramisspecial–themaster.Therestareworkersthatareassignedworkbythemaster.ThereareMmaptasksandRreducetaskstoassign.Themasterpicksidleworkersandassignseachoneamaptaskorareducetask.3.Aworkerwhoisassignedamaptaskreadsthecontentsofthecorrespondinginputsplit.Itparseskey/valuepairsoutoftheinputdataandpasseseachpairtotheuser-denedMapfunction.Theinterme-diatekey/valuepairsproducedbytheMapfunctionarebufferedinmemory.4.Periodically,thebufferedpairsarewrittentolocaldisk,partitionedintoRregionsbythepartitioningfunction.Thelocationsofthesebufferedpairsonthelocaldiskarepassedbacktothemaster,whoisresponsibleforforwardingtheselocationstothereduceworkers.5.Whenareduceworkerisnotiedbythemasterabouttheselocations,itusesremoteprocedurecallstoreadthebuffereddatafromthelocaldisksofthemapworkers.Whenareduceworkerhasreadallin-termediatedata,itsortsitbytheintermediatekeyssothatalloccurrencesofthesamekeyaregroupedtogether.Thesortingisneededbecausetypicallymanydifferentkeysmaptothesamereducetask.Iftheamountofintermediatedataistoolargetotinmemory,anexternalsortisused.6.Thereduceworkeriteratesoverthesortedinterme-diatedataandforeachuniqueintermediatekeyen-countered,itpassesthekeyandthecorrespondingsetofintermediatevaluestotheuser'sReducefunc-tion.TheoutputoftheReducefunctionisappendedtoanaloutputleforthisreducepartition.7.Whenallmaptasksandreducetaskshavebeencompleted,themasterwakesuptheuserprogram.Atthispoint,theMapReducecallintheuserpro-gramreturnsbacktotheusercode.Aftersuccessfulcompletion,theoutputofthemapre-duceexecutionisavailableintheRoutputles(oneperreducetask,withlenamesasspeciedbytheuser).Typically,usersdonotneedtocombinetheseRoutputlesintoonele–theyoftenpasstheselesasinputtoanotherMapReducecall,orusethemfromanotherdis-tributedapplicationthatisabletodealwithinputthatispartitionedintomultipleles.3.2MasterDataStructuresThemasterkeepsseveraldatastructures.Foreachmaptaskandreducetask,itstoresthestate(idle,in-progress,orcompleted),andtheidentityoftheworkermachine(fornon-idletasks).Themasteristheconduitthroughwhichthelocationofintermediateleregionsispropagatedfrommaptaskstoreducetasks.Therefore,foreachcompletedmaptask,themasterstoresthelocationsandsizesoftheRinter-mediateleregionsproducedbythemaptask.Updatestothislocationandsizeinformationarereceivedasmaptasksarecompleted.Theinformationispushedincre-mentallytoworkersthathavein-progressreducetasks.3.3FaultToleranceSincetheMapReducelibraryisdesignedtohelpprocessverylargeamountsofdatausinghundredsorthousandsofmachines,thelibrarymusttoleratemachinefailuresgracefully.WorkerFailureThemasterpingseveryworkerperiodically.Ifnore-sponseisreceivedfromaworkerinacertainamountoftime,themastermarkstheworkerasfailed.Anymaptaskscompletedbytheworkerareresetbacktotheirini-tialidlestate,andthereforebecomeeligibleforschedul-ingonotherworkers.Similarly,anymaptaskorreducetaskinprogressonafailedworkerisalsoresettoidleandbecomeseligibleforrescheduling.Completedmaptasksarere-executedonafailurebe-causetheiroutputisstoredonthelocaldisk(s)ofthefailedmachineandisthereforeinaccessible.Completedreducetasksdonotneedtobere-executedsincetheiroutputisstoredinagloballesystem.WhenamaptaskisexecutedrstbyworkerAandthenlaterexecutedbyworkerB(becauseAfailed),allToappearinOSDI20044 workersexecutingreducetasksarenotiedofthere-execution.AnyreducetaskthathasnotalreadyreadthedatafromworkerAwillreadthedatafromworkerB.MapReduceisresilienttolarge-scaleworkerfailures.Forexample,duringoneMapReduceoperation,networkmaintenanceonarunningclusterwascausinggroupsof80machinesatatimetobecomeunreachableforsev-eralminutes.TheMapReducemastersimplyre-executedtheworkdonebytheunreachableworkermachines,andcontinuedtomakeforwardprogress,eventuallycomplet-ingtheMapReduceoperation.MasterFailureItiseasytomakethemasterwriteperiodiccheckpointsofthemasterdatastructuresdescribedabove.Ifthemas-tertaskdies,anewcopycanbestartedfromthelastcheckpointedstate.However,giventhatthereisonlyasinglemaster,itsfailureisunlikely;thereforeourcur-rentimplementationabortstheMapReducecomputationifthemasterfails.ClientscancheckforthisconditionandretrytheMapReduceoperationiftheydesire.SemanticsinthePresenceofFailuresWhentheuser-suppliedmapandreduceoperatorsarede-terministicfunctionsoftheirinputvalues,ourdistributedimplementationproducesthesameoutputaswouldhavebeenproducedbyanon-faultingsequentialexecutionoftheentireprogram.Werelyonatomiccommitsofmapandreducetaskoutputstoachievethisproperty.Eachin-progresstaskwritesitsoutputtoprivatetemporaryles.Areducetaskproducesonesuchle,andamaptaskproducesRsuchles(oneperreducetask).Whenamaptaskcompletes,theworkersendsamessagetothemasterandincludesthenamesoftheRtemporarylesinthemessage.Ifthemasterreceivesacompletionmessageforanalreadycompletedmaptask,itignoresthemessage.Otherwise,itrecordsthenamesofRlesinamasterdatastructure.Whenareducetaskcompletes,thereduceworkeratomicallyrenamesitstemporaryoutputletothenaloutputle.Ifthesamereducetaskisexecutedonmulti-plemachines,multiplerenamecallswillbeexecutedforthesamenaloutputle.Werelyontheatomicrenameoperationprovidedbytheunderlyinglesystemtoguar-anteethatthenallesystemstatecontainsjustthedataproducedbyoneexecutionofthereducetask.Thevastmajorityofourmapandreduceoperatorsaredeterministic,andthefactthatoursemanticsareequiv-alenttoasequentialexecutioninthiscasemakesitveryeasyforprogrammerstoreasonabouttheirprogram'sbe-havior.Whenthemapand/orreduceoperatorsarenon-deterministic,weprovideweakerbutstillreasonablese-mantics.Inthepresenceofnon-deterministicoperators,theoutputofaparticularreducetaskR1isequivalenttotheoutputforR1producedbyasequentialexecutionofthenon-deterministicprogram.However,theoutputforadifferentreducetaskR2maycorrespondtotheoutputforR2producedbyadifferentsequentialexecutionofthenon-deterministicprogram.ConsidermaptaskMandreducetasksR1andR2.Lete(Ri)betheexecutionofRithatcommitted(thereisexactlyonesuchexecution).Theweakersemanticsarisebecausee(R1)mayhavereadtheoutputproducedbyoneexecutionofMande(R2)mayhavereadtheoutputproducedbyadifferentexecutionofM.3.4LocalityNetworkbandwidthisarelativelyscarceresourceinourcomputingenvironment.Weconservenetworkband-widthbytakingadvantageofthefactthattheinputdata(managedbyGFS[8])isstoredonthelocaldisksofthemachinesthatmakeupourcluster.GFSdivideseachleinto64MBblocks,andstoresseveralcopiesofeachblock(typically3copies)ondifferentmachines.TheMapReducemastertakesthelocationinformationoftheinputlesintoaccountandattemptstoscheduleamaptaskonamachinethatcontainsareplicaofthecorre-spondinginputdata.Failingthat,itattemptstoscheduleamaptasknearareplicaofthattask'sinputdata(e.g.,onaworkermachinethatisonthesamenetworkswitchasthemachinecontainingthedata).WhenrunninglargeMapReduceoperationsonasignicantfractionoftheworkersinacluster,mostinputdataisreadlocallyandconsumesnonetworkbandwidth.3.5TaskGranularityWesubdividethemapphaseintoMpiecesandthere-ducephaseintoRpieces,asdescribedabove.Ideally,MandRshouldbemuchlargerthanthenumberofworkermachines.Havingeachworkerperformmanydifferenttasksimprovesdynamicloadbalancing,andalsospeedsuprecoverywhenaworkerfails:themanymaptasksithascompletedcanbespreadoutacrossalltheotherworkermachines.TherearepracticalboundsonhowlargeMandRcanbeinourimplementation,sincethemastermustmakeO(M+R)schedulingdecisionsandkeepsO(MR)stateinmemoryasdescribedabove.(Theconstantfac-torsformemoryusagearesmallhowever:theO(MR)pieceofthestateconsistsofapproximatelyonebyteofdatapermaptask/reducetaskpair.)ToappearinOSDI20045 Furthermore,Risoftenconstrainedbyusersbecausetheoutputofeachreducetaskendsupinaseparateout-putle.Inpractice,wetendtochooseMsothateachindividualtaskisroughly16MBto64MBofinputdata(sothatthelocalityoptimizationdescribedaboveismosteffective),andwemakeRasmallmultipleofthenum-berofworkermachinesweexpecttouse.Weoftenper-formMapReducecomputationswithM=200;000andR=5;000,using2,000workermachines.3.6BackupTasksOneofthecommoncausesthatlengthensthetotaltimetakenforaMapReduceoperationisa“straggler”:ama-chinethattakesanunusuallylongtimetocompleteoneofthelastfewmaporreducetasksinthecomputation.Stragglerscanariseforawholehostofreasons.Forex-ample,amachinewithabaddiskmayexperiencefre-quentcorrectableerrorsthatslowitsreadperformancefrom30MB/sto1MB/s.Theclusterschedulingsys-temmayhavescheduledothertasksonthemachine,causingittoexecutetheMapReducecodemoreslowlyduetocompetitionforCPU,memory,localdisk,ornet-workbandwidth.Arecentproblemweexperiencedwasabuginmachineinitializationcodethatcausedproces-sorcachestobedisabled:computationsonaffectedma-chinessloweddownbyoverafactorofonehundred.Wehaveageneralmechanismtoalleviatetheprob-lemofstragglers.WhenaMapReduceoperationisclosetocompletion,themasterschedulesbackupexecutionsoftheremainingin-progresstasks.Thetaskismarkedascompletedwhenevereithertheprimaryorthebackupexecutioncompletes.Wehavetunedthismechanismsothatittypicallyincreasesthecomputationalresourcesusedbytheoperationbynomorethanafewpercent.WehavefoundthatthissignicantlyreducesthetimetocompletelargeMapReduceoperations.Asanexam-ple,thesortprogramdescribedinSection5.3takes44%longertocompletewhenthebackuptaskmechanismisdisabled.RenementsAlthoughthebasicfunctionalityprovidedbysimplywritingMapandReducefunctionsissufcientformostneeds,wehavefoundafewextensionsuseful.Thesearedescribedinthissection.4.1PartitioningFunctionTheusersofMapReducespecifythenumberofreducetasks/outputlesthattheydesire(R).Datagetsparti-tionedacrossthesetasksusingapartitioningfunctionontheintermediatekey.Adefaultpartitioningfunctionisprovidedthatuseshashing(e.g.“hash(key)modR”).Thistendstoresultinfairlywell-balancedpartitions.Insomecases,however,itisusefultopartitiondatabysomeotherfunctionofthekey.Forexample,sometimestheoutputkeysareURLs,andwewantallentriesforasinglehosttoendupinthesameoutputle.Tosupportsituationslikethis,theuseroftheMapReducelibrarycanprovideaspecialpartitioningfunction.Forexample,using“hash(Hostname(urlkey))modR”asthepar-titioningfunctioncausesallURLsfromthesamehosttoendupinthesameoutputle.4.2OrderingGuaranteesWeguaranteethatwithinagivenpartition,theinterme-diatekey/valuepairsareprocessedinincreasingkeyor-der.Thisorderingguaranteemakesiteasytogenerateasortedoutputleperpartition,whichisusefulwhentheoutputleformatneedstosupportefcientrandomaccesslookupsbykey,orusersoftheoutputnditcon-venienttohavethedatasorted.4.3CombinerFunctionInsomecases,thereissignicantrepetitionintheinter-mediatekeysproducedbyeachmaptask,andtheuser-speciedReducefunctioniscommutativeandassocia-tive.Agoodexampleofthisisthewordcountingexam-pleinSection2.1.SincewordfrequenciestendtofollowaZipfdistribution,eachmaptaskwillproducehundredsorthousandsofrecordsoftheform&#xthe,;1.Allofthesecountswillbesentoverthenetworktoasinglere-ducetaskandthenaddedtogetherbytheReducefunctiontoproduceonenumber.WeallowtheusertospecifyanoptionalCombinerfunctionthatdoespartialmergingofthisdatabeforeitissentoverthenetwork.TheCombinerfunctionisexecutedoneachmachinethatperformsamaptask.Typicallythesamecodeisusedtoimplementboththecombinerandthereducefunc-tions.TheonlydifferencebetweenareducefunctionandacombinerfunctionishowtheMapReducelibraryhan-dlestheoutputofthefunction.Theoutputofareducefunctioniswrittentothenaloutputle.Theoutputofacombinerfunctioniswrittentoanintermediatelethatwillbesenttoareducetask.PartialcombiningsignicantlyspeedsupcertainclassesofMapReduceoperations.AppendixAcontainsanexamplethatusesacombiner.4.4InputandOutputTypesTheMapReducelibraryprovidessupportforreadingin-putdatainseveraldifferentformats.Forexample,“text”ToappearinOSDI20046 modeinputtreatseachlineasakey/valuepair:thekeyistheoffsetintheleandthevalueisthecontentsoftheline.Anothercommonsupportedformatstoresasequenceofkey/valuepairssortedbykey.Eachinputtypeimplementationknowshowtosplititselfintomean-ingfulrangesforprocessingasseparatemaptasks(e.g.textmode'srangesplittingensuresthatrangesplitsoc-curonlyatlineboundaries).Userscanaddsupportforanewinputtypebyprovidinganimplementationofasim-plereaderinterface,thoughmostusersjustuseoneofasmallnumberofpredenedinputtypes.Areaderdoesnotnecessarilyneedtoprovidedatareadfromale.Forexample,itiseasytodeneareaderthatreadsrecordsfromadatabase,orfromdatastruc-turesmappedinmemory.Inasimilarfashion,wesupportasetofoutputtypesforproducingdataindifferentformatsanditiseasyforusercodetoaddsupportfornewoutputtypes.4.5Side-effectsInsomecases,usersofMapReducehavefounditcon-venienttoproduceauxiliarylesasadditionaloutputsfromtheirmapand/orreduceoperators.Werelyontheapplicationwritertomakesuchside-effectsatomicandidempotent.Typicallytheapplicationwritestoatempo-raryleandatomicallyrenamesthisleonceithasbeenfullygenerated.Wedonotprovidesupportforatomictwo-phasecom-mitsofmultipleoutputlesproducedbyasingletask.Therefore,tasksthatproducemultipleoutputleswithcross-leconsistencyrequirementsshouldbedetermin-istic.Thisrestrictionhasneverbeenanissueinpractice.4.6SkippingBadRecordsSometimestherearebugsinusercodethatcausetheMaporReducefunctionstocrashdeterministicallyoncertainrecords.SuchbugspreventaMapReduceoperationfromcompleting.Theusualcourseofactionistoxthebug,butsometimesthisisnotfeasible;perhapsthebugisinathird-partylibraryforwhichsourcecodeisunavail-able.Also,sometimesitisacceptabletoignoreafewrecords,forexamplewhendoingstatisticalanalysisonalargedataset.Weprovideanoptionalmodeofexecu-tionwheretheMapReducelibrarydetectswhichrecordscausedeterministiccrashesandskipstheserecordsinor-dertomakeforwardprogress.Eachworkerprocessinstallsasignalhandlerthatcatchessegmentationviolationsandbuserrors.BeforeinvokingauserMaporReduceoperation,theMapRe-ducelibrarystoresthesequencenumberoftheargumentinaglobalvariable.Iftheusercodegeneratesasignal,thesignalhandlersendsa“lastgasp”UDPpacketthatcontainsthesequencenumbertotheMapReducemas-ter.Whenthemasterhasseenmorethanonefailureonaparticularrecord,itindicatesthattherecordshouldbeskippedwhenitissuesthenextre-executionofthecorre-spondingMaporReducetask.4.7LocalExecutionDebuggingproblemsinMaporReducefunctionscanbetricky,sincetheactualcomputationhappensinadis-tributedsystem,oftenonseveralthousandmachines,withworkassignmentdecisionsmadedynamicallybythemaster.Tohelpfacilitatedebugging,proling,andsmall-scaletesting,wehavedevelopedanalternativeim-plementationoftheMapReducelibrarythatsequentiallyexecutesalloftheworkforaMapReduceoperationonthelocalmachine.Controlsareprovidedtotheusersothatthecomputationcanbelimitedtoparticularmaptasks.Usersinvoketheirprogramwithaspecialagandcantheneasilyuseanydebuggingortestingtoolstheynduseful(e.g.gdb).4.8StatusInformationThemasterrunsaninternalHTTPserverandexportsasetofstatuspagesforhumanconsumption.Thesta-tuspagesshowtheprogressofthecomputation,suchashowmanytaskshavebeencompleted,howmanyareinprogress,bytesofinput,bytesofintermediatedata,bytesofoutput,processingrates,etc.Thepagesalsocontainlinkstothestandarderrorandstandardoutputlesgen-eratedbyeachtask.Theusercanusethisdatatopre-dicthowlongthecomputationwilltake,andwhetherornotmoreresourcesshouldbeaddedtothecomputation.Thesepagescanalsobeusedtogureoutwhenthecom-putationismuchslowerthanexpected.Inaddition,thetop-levelstatuspageshowswhichworkershavefailed,andwhichmapandreducetaskstheywereprocessingwhentheyfailed.Thisinforma-tionisusefulwhenattemptingtodiagnosebugsintheusercode.4.9CountersTheMapReducelibraryprovidesacounterfacilitytocountoccurrencesofvariousevents.Forexample,usercodemaywanttocounttotalnumberofwordsprocessedorthenumberofGermandocumentsindexed,etc.Tousethisfacility,usercodecreatesanamedcounterobjectandthenincrementsthecounterappropriatelyintheMapand/orReducefunction.Forexample:ToappearinOSDI20047 Counter*uppercase;uppercase=GetCounter("uppercase");map(Stringname,Stringcontents):foreachwordwincontents:if(IsCapitalized(w)):�uppercase-Increment();EmitIntermediate(w,"1");Thecountervaluesfromindividualworkermachinesareperiodicallypropagatedtothemaster(piggybackedonthepingresponse).ThemasteraggregatesthecountervaluesfromsuccessfulmapandreducetasksandreturnsthemtotheusercodewhentheMapReduceoperationiscompleted.Thecurrentcountervaluesarealsodis-playedonthemasterstatuspagesothatahumancanwatchtheprogressofthelivecomputation.Whenaggre-gatingcountervalues,themastereliminatestheeffectsofduplicateexecutionsofthesamemaporreducetasktoavoiddoublecounting.(Duplicateexecutionscanarisefromouruseofbackuptasksandfromre-executionoftasksduetofailures.)SomecountervaluesareautomaticallymaintainedbytheMapReducelibrary,suchasthenumberofin-putkey/valuepairsprocessedandthenumberofoutputkey/valuepairsproduced.Usershavefoundthecounterfacilityusefulforsan-itycheckingthebehaviorofMapReduceoperations.Forexample,insomeMapReduceoperations,theusercodemaywanttoensurethatthenumberofoutputpairsproducedexactlyequalsthenumberofinputpairspro-cessed,orthatthefractionofGermandocumentspro-cessediswithinsometolerablefractionofthetotalnum-berofdocumentsprocessed.5PerformanceInthissectionwemeasuretheperformanceofMapRe-duceontwocomputationsrunningonalargeclusterofmachines.Onecomputationsearchesthroughapproxi-matelyoneterabyteofdatalookingforaparticularpat-tern.Theothercomputationsortsapproximatelyoneter-abyteofdata.Thesetwoprogramsarerepresentativeofalargesub-setoftherealprogramswrittenbyusersofMapReduce–oneclassofprogramsshufesdatafromonerepresenta-tiontoanother,andanotherclassextractsasmallamountofinterestingdatafromalargedataset.5.1ClusterCongurationAlloftheprogramswereexecutedonaclusterthatconsistedofapproximately1800machines.Eachma-chinehadtwo2GHzIntelXeonprocessorswithHyper-Threadingenabled,4GBofmemory,two160GBIDE20406080100Seconds0100002000030000Input (MB/s)Figure2:Datatransferrateovertimedisks,andagigabitEthernetlink.Themachineswerearrangedinatwo-leveltree-shapedswitchednetworkwithapproximately100-200Gbpsofaggregateband-widthavailableattheroot.Allofthemachineswereinthesamehostingfacilityandthereforetheround-triptimebetweenanypairofmachineswaslessthanamil-lisecond.Outofthe4GBofmemory,approximately1-1.5GBwasreservedbyothertasksrunningonthecluster.Theprogramswereexecutedonaweekendafternoon,whentheCPUs,disks,andnetworkweremostlyidle.5.2GrepThegrepprogramscansthrough1010100-byterecords,searchingforarelativelyrarethree-characterpattern(thepatternoccursin92,337records).Theinputissplitintoapproximately64MBpieces(M=15000),andtheen-tireoutputisplacedinonele(R=1).Figure2showstheprogressofthecomputationovertime.TheY-axisshowstherateatwhichtheinputdataisscanned.TherategraduallypicksupasmoremachinesareassignedtothisMapReducecomputation,andpeaksatover30GB/swhen1764workershavebeenassigned.Asthemaptasksnish,theratestartsdroppingandhitszeroabout80secondsintothecomputation.Theentirecomputationtakesapproximately150secondsfromstarttonish.Thisincludesaboutaminuteofstartupover-head.Theoverheadisduetothepropagationofthepro-gramtoallworkermachines,anddelaysinteractingwithGFStoopenthesetof1000inputlesandtogettheinformationneededforthelocalityoptimization.5.3SortThesortprogramsorts1010100-byterecords(approxi-mately1terabyteofdata).ThisprogramismodeledaftertheTeraSortbenchmark[10].Thesortingprogramconsistsoflessthan50linesofusercode.Athree-lineMapfunctionextractsa10-bytesortingkeyfromatextlineandemitsthekeyandtheToappearinOSDI20048 500100005000100001500020000Input (MB/s)500100005000100001500020000Shuffle (MB/s)5001000Seconds05000100001500020000Output (MB/s)Done(a)Normalexecution500100005000100001500020000Input (MB/s)500100005000100001500020000Shuffle (MB/s)5001000Seconds05000100001500020000Output (MB/s)Done(b)Nobackuptasks500100005000100001500020000Input (MB/s)500100005000100001500020000Shuffle (MB/s)5001000Seconds05000100001500020000Output (MB/s)Done(c)200taskskilledFigure3:Datatransferratesovertimefordifferentexecutionsofthesortprogramoriginaltextlineastheintermediatekey/valuepair.Weusedabuilt-inIdentityfunctionastheReduceoperator.Thisfunctionspassestheintermediatekey/valuepairun-changedastheoutputkey/valuepair.Thenalsortedoutputiswrittentoasetof2-wayreplicatedGFSles(i.e.,2terabytesarewrittenastheoutputoftheprogram).Asbefore,theinputdataissplitinto64MBpieces(M=15000).Wepartitionthesortedoutputinto4000les(R=4000).Thepartitioningfunctionusestheini-tialbytesofthekeytosegregateitintooneofRpieces.Ourpartitioningfunctionforthisbenchmarkhasbuilt-inknowledgeofthedistributionofkeys.Inageneralsortingprogram,wewouldaddapre-passMapReduceoperationthatwouldcollectasampleofthekeysandusethedistributionofthesampledkeystocomputesplit-pointsforthenalsortingpass.Figure3(a)showstheprogressofanormalexecutionofthesortprogram.Thetop-leftgraphshowstherateatwhichinputisread.Theratepeaksatabout13GB/sanddiesofffairlyquicklysinceallmaptasksnishbe-fore200secondshaveelapsed.Notethattheinputrateislessthanforgrep.ThisisbecausethesortmaptasksspendabouthalftheirtimeandI/Obandwidthwritingin-termediateoutputtotheirlocaldisks.Thecorrespondingintermediateoutputforgrephadnegligiblesize.Themiddle-leftgraphshowstherateatwhichdataissentoverthenetworkfromthemaptaskstothere-ducetasks.Thisshufingstartsassoonastherstmaptaskcompletes.Thersthumpinthegraphisfortherstbatchofapproximately1700reducetasks(theentireMapReducewasassignedabout1700machines,andeachmachineexecutesatmostonereducetaskatatime).Roughly300secondsintothecomputation,someoftheserstbatchofreducetasksnishandwestartshufingdatafortheremainingreducetasks.Alloftheshufingisdoneabout600secondsintothecomputation.Thebottom-leftgraphshowstherateatwhichsorteddataiswrittentothenaloutputlesbythereducetasks.Thereisadelaybetweentheendoftherstshufingpe-riodandthestartofthewritingperiodbecausethema-chinesarebusysortingtheintermediatedata.Thewritescontinueatarateofabout2-4GB/sforawhile.Allofthewritesnishabout850secondsintothecomputation.Includingstartupoverhead,theentirecomputationtakes891seconds.Thisissimilartothecurrentbestreportedresultof1057secondsfortheTeraSortbenchmark[18].Afewthingstonote:theinputrateishigherthantheshuferateandtheoutputratebecauseofourlocalityoptimization–mostdataisreadfromalocaldiskandbypassesourrelativelybandwidthconstrainednetwork.Theshuferateishigherthantheoutputratebecausetheoutputphasewritestwocopiesofthesorteddata(wemaketworeplicasoftheoutputforreliabilityandavail-abilityreasons).Wewritetworeplicasbecausethatisthemechanismforreliabilityandavailabilityprovidedbyourunderlyinglesystem.Networkbandwidthre-quirementsforwritingdatawouldbereducediftheun-derlyinglesystemusederasurecoding[14]ratherthanreplication.ToappearinOSDI20049 5.4EffectofBackupTasksInFigure3(b),weshowanexecutionofthesortpro-gramwithbackuptasksdisabled.TheexecutionowissimilartothatshowninFigure3(a),exceptthatthereisaverylongtailwherehardlyanywriteactivityoccurs.After960seconds,allexcept5ofthereducetasksarecompleted.Howevertheselastfewstragglersdon'tn-ishuntil300secondslater.Theentirecomputationtakes1283seconds,anincreaseof44%inelapsedtime.5.5MachineFailuresInFigure3(c),weshowanexecutionofthesortprogramwhereweintentionallykilled200outof1746workerprocessesseveralminutesintothecomputation.Theunderlyingclusterschedulerimmediatelyrestartednewworkerprocessesonthesemachines(sinceonlythepro-cesseswerekilled,themachineswerestillfunctioningproperly).Theworkerdeathsshowupasanegativeinputratesincesomepreviouslycompletedmapworkdisappears(sincethecorrespondingmapworkerswerekilled)andneedstoberedone.There-executionofthismapworkhappensrelativelyquickly.Theentirecomputationn-ishesin933secondsincludingstartupoverhead(justanincreaseof5%overthenormalexecutiontime).6ExperienceWewrotetherstversionoftheMapReducelibraryinFebruaryof2003,andmadesignicantenhancementstoitinAugustof2003,includingthelocalityoptimization,dynamicloadbalancingoftaskexecutionacrossworkermachines,etc.Sincethattime,wehavebeenpleasantlysurprisedathowbroadlyapplicabletheMapReduceli-braryhasbeenforthekindsofproblemsweworkon.IthasbeenusedacrossawiderangeofdomainswithinGoogle,including:large-scalemachinelearningproblems,clusteringproblemsfortheGoogleNewsandFroogleproducts,extractionofdatausedtoproducereportsofpopularqueries(e.g.GoogleZeitgeist),extractionofpropertiesofwebpagesfornewexper-imentsandproducts(e.g.extractionofgeographi-callocationsfromalargecorpusofwebpagesforlocalizedsearch),andlarge-scalegraphcomputations.2003/032003/062003/092003/122004/032004/062004/0902004006008001000Number of instances in source treeFigure4:MapReduceinstancesovertimeNumberofjobs29,423Averagejobcompletiontime634secsMachinedaysused79,186daysInputdataread3,288TBIntermediatedataproduced758TBOutputdatawritten193TBAverageworkermachinesperjob157Averageworkerdeathsperjob1.2Averagemaptasksperjob3,351Averagereducetasksperjob55Uniquemapimplementations395Uniquereduceimplementations269Uniquemap/reducecombinations426Table1:MapReducejobsruninAugust2004Figure4showsthesignicantgrowthinthenumberofseparateMapReduceprogramscheckedintoourprimarysourcecodemanagementsystemovertime,from0inearly2003toalmost900separateinstancesasoflateSeptember2004.MapReducehasbeensosuccessfulbe-causeitmakesitpossibletowriteasimpleprogramandrunitefcientlyonathousandmachinesinthecourseofhalfanhour,greatlyspeedingupthedevelopmentandprototypingcycle.Furthermore,itallowsprogrammerswhohavenoexperiencewithdistributedand/orparallelsystemstoexploitlargeamountsofresourceseasily.Attheendofeachjob,theMapReducelibrarylogsstatisticsaboutthecomputationalresourcesusedbythejob.InTable1,weshowsomestatisticsforasubsetofMapReducejobsrunatGoogleinAugust2004.6.1Large-ScaleIndexingOneofourmostsignicantusesofMapReducetodatehasbeenacompleterewriteoftheproductionindex-ToappearinOSDI200410 ingsystemthatproducesthedatastructuresusedfortheGooglewebsearchservice.Theindexingsystemtakesasinputalargesetofdocumentsthathavebeenretrievedbyourcrawlingsystem,storedasasetofGFSles.Therawcontentsforthesedocumentsaremorethan20ter-abytesofdata.TheindexingprocessrunsasasequenceofvetotenMapReduceoperations.UsingMapReduce(insteadofthead-hocdistributedpassesinthepriorver-sionoftheindexingsystem)hasprovidedseveralbene-ts:Theindexingcodeissimpler,smaller,andeasiertounderstand,becausethecodethatdealswithfaulttolerance,distributionandparallelizationishiddenwithintheMapReducelibrary.Forexample,thesizeofonephaseofthecomputationdroppedfromapproximately3800linesofC++codetoapprox-imately700lineswhenexpressedusingMapRe-duce.TheperformanceoftheMapReducelibraryisgoodenoughthatwecankeepconceptuallyunrelatedcomputationsseparate,insteadofmixingthemto-gethertoavoidextrapassesoverthedata.Thismakesiteasytochangetheindexingprocess.Forexample,onechangethattookafewmonthstomakeinouroldindexingsystemtookonlyafewdaystoimplementinthenewsystem.Theindexingprocesshasbecomemucheasiertooperate,becausemostoftheproblemscausedbymachinefailures,slowmachines,andnetworkinghiccupsaredealtwithautomaticallybytheMapRe-ducelibrarywithoutoperatorintervention.Further-more,itiseasytoimprovetheperformanceoftheindexingprocessbyaddingnewmachinestothein-dexingcluster.7RelatedWorkManysystemshaveprovidedrestrictedprogrammingmodelsandusedtherestrictionstoparallelizethecom-putationautomatically.Forexample,anassociativefunc-tioncanbecomputedoverallprexesofanNelementarrayinlogNtimeonNprocessorsusingparallelprexcomputations[6,9,13].MapReducecanbeconsideredasimplicationanddistillationofsomeofthesemodelsbasedonourexperiencewithlargereal-worldcompu-tations.Moresignicantly,weprovideafault-tolerantimplementationthatscalestothousandsofprocessors.Incontrast,mostoftheparallelprocessingsystemshaveonlybeenimplementedonsmallerscalesandleavethedetailsofhandlingmachinefailurestotheprogrammer.BulkSynchronousProgramming[17]andsomeMPIprimitives[11]providehigher-levelabstractionsthatmakeiteasierforprogrammerstowriteparallelpro-grams.AkeydifferencebetweenthesesystemsandMapReduceisthatMapReduceexploitsarestrictedpro-grammingmodeltoparallelizetheuserprogramauto-maticallyandtoprovidetransparentfault-tolerance.Ourlocalityoptimizationdrawsitsinspirationfromtechniquessuchasactivedisks[12,15],wherecompu-tationispushedintoprocessingelementsthatareclosetolocaldisks,toreducetheamountofdatasentacrossI/Osubsystemsorthenetwork.Werunoncommodityprocessorstowhichasmallnumberofdisksaredirectlyconnectedinsteadofrunningdirectlyondiskcontrollerprocessors,butthegeneralapproachissimilar.OurbackuptaskmechanismissimilartotheeagerschedulingmechanismemployedintheCharlotteSys-tem[3].Oneoftheshortcomingsofsimpleeagerschedulingisthatifagiventaskcausesrepeatedfailures,theentirecomputationfailstocomplete.Wexsomein-stancesofthisproblemwithourmechanismforskippingbadrecords.TheMapReduceimplementationreliesonanin-houseclustermanagementsystemthatisresponsiblefordis-tributingandrunningusertasksonalargecollectionofsharedmachines.Thoughnotthefocusofthispaper,theclustermanagementsystemissimilarinspirittoothersystemssuchasCondor[16].ThesortingfacilitythatisapartoftheMapReducelibraryissimilarinoperationtoNOW-Sort[1].Sourcemachines(mapworkers)partitionthedatatobesortedandsendittooneofRreduceworkers.Eachreduceworkersortsitsdatalocally(inmemoryifpossible).OfcourseNOW-Sortdoesnothavetheuser-denableMapandReducefunctionsthatmakeourlibrarywidelyappli-cable.River[2]providesaprogrammingmodelwherepro-cessescommunicatewitheachotherbysendingdataoverdistributedqueues.LikeMapReduce,theRiversystemtriestoprovidegoodaveragecaseperformanceeveninthepresenceofnon-uniformitiesintroducedbyheterogeneoushardwareorsystemperturbations.Riverachievesthisbycarefulschedulingofdiskandnetworktransferstoachievebalancedcompletiontimes.MapRe-ducehasadifferentapproach.Byrestrictingthepro-grammingmodel,theMapReduceframeworkisabletopartitiontheproblemintoalargenumberofne-grainedtasks.Thesetasksaredynamicallyscheduledonavailableworkerssothatfasterworkersprocessmoretasks.Therestrictedprogrammingmodelalsoallowsustoscheduleredundantexecutionsoftasksneartheendofthejobwhichgreatlyreducescompletiontimeinthepresenceofnon-uniformities(suchassloworstuckworkers).BAD-FS[5]hasaverydifferentprogrammingmodelfromMapReduce,andunlikeMapReduce,istargetedtoToappearinOSDI200411 theexecutionofjobsacrossawide-areanetwork.How-ever,therearetwofundamentalsimilarities.(1)Bothsystemsuseredundantexecutiontorecoverfromdatalosscausedbyfailures.(2)Bothuselocality-awareschedulingtoreducetheamountofdatasentacrosscon-gestednetworklinks.TACC[7]isasystemdesignedtosimplifycon-structionofhighly-availablenetworkedservices.LikeMapReduce,itreliesonre-executionasamechanismforimplementingfault-tolerance.8ConclusionsTheMapReduceprogrammingmodelhasbeensuccess-fullyusedatGoogleformanydifferentpurposes.Weattributethissuccesstoseveralreasons.First,themodeliseasytouse,evenforprogrammerswithoutexperiencewithparallelanddistributedsystems,sinceithidesthedetailsofparallelization,fault-tolerance,localityopti-mization,andloadbalancing.Second,alargevarietyofproblemsareeasilyexpressibleasMapReducecom-putations.Forexample,MapReduceisusedforthegen-erationofdataforGoogle'sproductionwebsearchser-vice,forsorting,fordatamining,formachinelearning,andmanyothersystems.Third,wehavedevelopedanimplementationofMapReducethatscalestolargeclus-tersofmachinescomprisingthousandsofmachines.Theimplementationmakesefcientuseofthesemachinere-sourcesandthereforeissuitableforuseonmanyofthelargecomputationalproblemsencounteredatGoogle.Wehavelearnedseveralthingsfromthiswork.First,restrictingtheprogrammingmodelmakesiteasytopar-allelizeanddistributecomputationsandtomakesuchcomputationsfault-tolerant.Second,networkbandwidthisascarceresource.Anumberofoptimizationsinoursystemarethereforetargetedatreducingtheamountofdatasentacrossthenetwork:thelocalityoptimizational-lowsustoreaddatafromlocaldisks,andwritingasinglecopyoftheintermediatedatatolocaldisksavesnetworkbandwidth.Third,redundantexecutioncanbeusedtoreducetheimpactofslowmachines,andtohandlema-chinefailuresanddataloss.AcknowledgementsJoshLevenberghasbeeninstrumentalinrevisingandextendingtheuser-levelMapReduceAPIwithanum-berofnewfeaturesbasedonhisexperiencewithusingMapReduceandotherpeople'ssuggestionsforenhance-ments.MapReducereadsitsinputfromandwritesitsoutputtotheGoogleFileSystem[8].WewouldliketothankMohitAron,HowardGobioff,MarkusGutschke,DavidKramer,Shun-TakLeung,andJoshRedstonefortheirworkindevelopingGFS.WewouldalsoliketothankPercyLiangandOlcanSercinoglufortheirworkindevelopingtheclustermanagementsystemusedbyMapReduce.MikeBurrows,WilsonHsieh,JoshLeven-berg,SharonPerl,RobPike,andDebbyWallachpro-videdhelpfulcommentsonearlierdraftsofthispa-per.TheanonymousOSDIreviewers,andourshepherd,EricBrewer,providedmanyusefulsuggestionsofareaswherethepapercouldbeimproved.Finally,wethankalltheusersofMapReducewithinGoogle'sengineeringor-ganizationforprovidinghelpfulfeedback,suggestions,andbugreports.References[1]AndreaC.Arpaci-Dusseau,RemziH.Arpaci-Dusseau,DavidE.Culler,JosephM.Hellerstein,andDavidA.Pat-terson.High-performancesortingonnetworksofwork-stations.InProceedingsofthe1997ACMSIGMODIn-ternationalConferenceonManagementofData,Tucson,Arizona,May1997.[2]RemziH.Arpaci-Dusseau,EricAnderson,NoahTreuhaft,DavidE.Culler,JosephM.Hellerstein,DavidPatterson,andKathyYelick.ClusterI/OwithRiver:Makingthefastcasecommon.InProceedingsoftheSixthWorkshoponInput/OutputinParallelandDistributedSystems(IOPADS'99),pages10–22,Atlanta,Georgia,May1999.[3]ArashBaratloo,MehmetKaraul,ZviKedem,andPeterWyckoff.Charlotte:Metacomputingontheweb.InPro-ceedingsofthe9thInternationalConferenceonParallelandDistributedComputingSystems,1996.[4]LuizA.Barroso,JeffreyDean,andUrsH¨olzle.Websearchforaplanet:TheGoogleclusterarchitecture.IEEEMicro,23(2):22–28,April2003.[5]JohnBent,DouglasThain,AndreaC.Arpaci-Dusseau,RemziH.Arpaci-Dusseau,andMironLivny.Explicitcontrolinabatch-awaredistributedlesystem.InPro-ceedingsofthe1stUSENIXSymposiumonNetworkedSystemsDesignandImplementationNSDI,March2004.[6]GuyE.Blelloch.Scansasprimitiveparalleloperations.IEEETransactionsonComputers,C-38(11),November1989.[7]ArmandoFox,StevenD.Gribble,YatinChawathe,EricA.Brewer,andPaulGauthier.Cluster-basedscal-ablenetworkservices.InProceedingsofthe16thACMSymposiumonOperatingSystemPrinciples,pages78–91,Saint-Malo,France,1997.[8]SanjayGhemawat,HowardGobioff,andShun-TakLe-ung.TheGooglelesystem.In19thSymposiumonOp-eratingSystemsPrinciples,pages29–43,LakeGeorge,NewYork,2003.ToappearinOSDI200412 [9]S.Gorlatch.Systematicefcientparallelizationofscanandotherlisthomomorphisms.InL.Bouge,P.Fraigni-aud,A.Mignotte,andY.Robert,editors,Euro-Par'96.ParallelProcessing,LectureNotesinComputerScience1124,pages401–408.Springer-Verlag,1996.[10]JimGray.Sortbenchmarkhomepage.http://research.microsoft.com/barc/SortBenchmark/.[11]WilliamGropp,EwingLusk,andAnthonySkjellum.UsingMPI:PortableParallelProgrammingwiththeMessage-PassingInterface.MITPress,Cambridge,MA,1999.[12]L.Huston,R.Sukthankar,R.Wickremesinghe,M.Satya-narayanan,G.R.Ganger,E.Riedel,andA.Ailamaki.Di-amond:Astoragearchitectureforearlydiscardininter-activesearch.InProceedingsofthe2004USENIXFileandStorageTechnologiesFASTConference,April2004.[13]RichardE.LadnerandMichaelJ.Fischer.Parallelprexcomputation.JournaloftheACM,27(4):831–838,1980.[14]MichaelO.Rabin.Efcientdispersalofinformationforsecurity,loadbalancingandfaulttolerance.JournaloftheACM,36(2):335–348,1989.[15]ErikRiedel,ChristosFaloutsos,GarthA.Gibson,andDavidNagle.Activedisksforlarge-scaledataprocess-ing.IEEEComputer,pages68–74,June2001.[16]DouglasThain,ToddTannenbaum,andMironLivny.Distributedcomputinginpractice:TheCondorexperi-ence.ConcurrencyandComputation:PracticeandEx-perience,2004.[17]L.G.Valiant.Abridgingmodelforparallelcomputation.CommunicationsoftheACM,33(8):103–111,1997.[18]JimWyllie.Spsort:Howtosortaterabytequickly.http://alme1.almaden.ibm.com/cs/spsort.pdf.AWordFrequencyThissectioncontainsaprogramthatcountsthenumberofoccurrencesofeachuniquewordinasetofinputlesspeciedonthecommandline.#include"mapreduce/mapreduce.h"//User'smapfunctionclassWordCounter:publicMapper{public:virtualvoidMap(constMapInput&input){conststring&text=input.value();constintn=text.size();for(inti=0;in;){//Skippastleadingwhitespacewhile((in)&&isspace(text[i]))i++;//Findwordendintstart=i;while((in)&&!isspace(text[i]))i++;if(starti)Emit(text.substr(start,i-start),"1");}}};User'sreducefunctionclassAdder:publicReducer{virtualvoidReduce(ReduceInput*input){//Iterateoverallentrieswiththe//samekeyandaddthevaluesint64value=0;while�(!input-done()){value+=�StringToInt(input-value());�input-NextValue();}Emitsumfor�input-key()Emit(IntToString(value));}};main(intargc,char**argv){ParseCommandLineFlags(argc,argv);MapReduceSpecificationspec;//Storelistofinputfilesinto"spec"for(inti=1;iargc;i++){MapReduceInput*input=spec.add_input();�input-set_format("text");}Specifytheoutputfiles:///gfs/test/freq-00000-of-00100///gfs/test/freq-00001-of-00100//...MapReduceOutput*out=spec.output();�out-set_filebase("/gfs/test/freq");Optional:dopartialsumswithinmap//taskstosavenetworkbandwidth�out-set_combiner_class("Adder");Tuningparameters:useatmost2000//machinesand100MBofmemorypertaskspec.set_machines(2000);NowrunitMapReduceResultresult;if(!MapReduce(spec,&result))abort();//Done:'result'structurecontainsinfo//aboutcounters,timetaken,numberof//machinesused,etc.return0;}ToappearinOSDI200413