berkeleyedu Tyson Condie UC Berkeley tcondieeecsberkeleyedu Neil Conway UC Berkeley nrceecsberkeleyedu Khaled Elmeleegy Yahoo Research khaledyahooinccom Joseph M Hellerstein UC Berkeley hellersteineecsberkeleyedu Russell Sears UC Berkeley searseecsbe ID: 39420
Download Pdf The PPT/PDF document "BOOM Analytics Exploring DataCentric Dec..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1.Distributedsystemsbenetsubstantiallyfromadata-centricdesignstylethatfocusestheprogrammer'satten-tiononcarefullycapturingalltheimportantstateofthesystemasafamilyofcollections(sets,relations,streams,etc.)Givensuchamodel,thestateofthesystemcanbedistributednaturallyandexiblyacrossnodesviafamiliarmechanismslikepartitioningandreplication.2.Thekeybehaviorsofsuchsystemscanbenaturallyim-plementedusingdeclarativeprogramminglanguagesthatmanipulatethesecollections,abstractingtheprogrammerfromboththephysicallayoutofthedataandthene-grainedorchestrationofdatamanipulation.Takentogether,thesehypothesessuggestthattraditionallydif-cultdistributedprogrammingtaskscanberecastasdatapro-cessingproblemsthatareeasytoreasonaboutinadistributedsettingandexpressibleinahigh-levellanguage.Inturn,thisshouldprovidesignicantreductionsincodecomplexityanddevelopmentoverhead,andimprovesystemevolutionandprogramcorrectness.Wealsoconjecturethatthesehypothe-ses,takenseparately,canofferdesignguidelinesusefulinawidevarietyofprogrammingmodels.1.1BOOMAnalyticsWedecidedtobegintheBOOMprojectwithanexperi-mentinconstruction,byimplementingasubstantialpieceofdistributedsoftwareinadata-centric,declarativestyle.Uponreviewofrecentliteratureondatacenterinfrastructure(e.g.,[7,11,12,14]),weobservedthatmostofthecomplex-ityinthesesystemsrelatestothemanagementofvariousformsofasynchronously-updatedstate,includingsessions,protocols,andstorage.Althoughquitecomplex,fewofthesesystemsinvolveintricate,uninterruptedsequencesofcom-putationalsteps.Hence,wesuspectedthatdatacenterinfras-tructuremightbeagoodinitiallitmustestforourhypothesesaboutbuildingdistributedsoftware.Inthispaper,wereportonourexperiencesbuildingBOOMAnalytics,anAPI-compliantreimplementationoftheHDFSdistributedlesystemandtheHadoopMapReduceengine.WenamedthesetwocomponentsBOOM-FSandBOOM-MR,respectively.1InwritingBOOMAnalytics,wepreservedtheJavaAPIskinofHDFSandHadoop,butreplacedcomplexinternalstatewithasetofrelations,andreplacedkeysystemlogicwithcodewritteninadeclarativelanguage.TheHadoopstackappealedtousasachallengefortworeasons.First,itexercisesthedistributedpowerofaclus-ter.Unlikeafarmofindependentwebserviceinstances,theHDFSandHadoopcodeentailscoordinationoflargenum-bersofnodestowardcommontasks.Second,Hadoopismiss-ingsignicantdistributedsystemsfeatureslikeavailabilityandscalabilityofmasternodes.Thisallowedustoevaluate 1TheBOOMAnalyticssoftwaredescribedinthispapercanbefoundathttp://db.cs.berkeley.edu/eurosys-2010.thedifcultyofextendingBOOMAnalyticswithcomplexfeaturesnotfoundintheoriginalcodebase.WeimplementedBOOMAnalyticsusingtheOverloglogiclanguage,originallydevelopedforDeclarativeNet-working[24].Overloghasbeenusedwithsomesuccesstoprototypedistributedsystemprotocols,notablyinasimpleprototypeofPaxos[34],asetofByzantineFaultTolerancevariants[32],asuiteofdistributedlesystemconsistencyprotocols[6],andadistributedhashtableroutingprotocolimplementedbyourowngroup[24].Ontheotherhand,Over-loghadnotpreviouslybeenusedtoimplementafull-featureddistributedsystemonthescaleofHadoopandHDFS.OnegoalofourworkonBOOMAnalyticswastoevaluatethestrengthsandweaknessesofOverlogforsystemprogram-minginthelarge,toinformthedesignofanewdeclarativeframeworkfordistributedprogramming.1.2ContributionsThispaperdescribesourexperienceimplementingandevolv-ingBOOMAnalytics,andrunningitonAmazonEC2.WedocumenttheeffortrequiredtodevelopBOOMAnalyticsinOverlog,andthewaywewereabletointroducesigni-cantextensions,includingPaxos-supportedreplicated-masteravailability,andmulti-masterstate-partitionedscalability.Wedescribethedebuggingtasksthatarosewhenprogrammingatthislevelofabstraction,andourtacticsformetaprogrammingOverlogtoinstrumentourdistributedsystematruntime.Whiletheoutcomeofanysoftwareexperienceisboundinparttothespecicprogrammingtoolsused,therearehopefullymoregenerallessonsthatcanbeextracted.Tothatend,wetrytoseparateout(andinsomecasescritique)thespecicsofOverlogasadeclarativelanguage,andthemoregenerallessonsofhigh-leveldata-centricprogramming.Themoregeneraldata-centricaspectoftheworkisbothpositiveandlanguage-independent:manyofthebenetswedescribearisefromexposingasmuchsystemstateaspossibleviacollectiondatatypes,andproceedingfromthatbasistowritesimplecodetomanagethosecollections.AswedescribeeachmoduleofBOOMAnalytics,wereporttheperson-hourswespentimplementingitandthesizeofourimplementationinlinesofcode(comparingagainsttherelevantfeatureofHadoop,ifappropriate).Thesearenoisymetrics,sowearemostinterestedinnumbersthattranscendthenoiseterms:forexample,order-of-magnitudereductionsincodesize.WealsovalidatethattheperformanceofBOOMAnalyticsiscompetitivewiththeoriginalHadoopcodebase.WepresenttheevolutionofBOOMAnalyticsfromastraightforwardreimplementationofHDFSandHadooptoasignicantlyenhancedsystem.Wedescribehowourini-tialBOOM-FSprototypewentthroughaseriesofmajorrevisions(revs)focusedonavailability(Section4),scala-bility(Section5),anddebuggingandmonitoring(Section6).WethendetailhowwedesignedBOOM-MRbyreplacingHadoop'staskschedulinglogicwithadeclarativeschedulingframework(Section7).Ineachcase,wediscusshowthe Figure2.AnOverlogtimestepataparticipatingnode:in-comingeventsareappliedtolocalstate,thelocalDatalogprogramisruntoxpoint,andoutgoingeventsareemitted.WhenOverlogtuplesarriveatanodeeitherthroughruleevaluationorexternalevents,theyarehandledinanatomiclocalDatalogtimestep.Withinatimestep,eachnodeseesonlylocally-storedtuples.CommunicationbetweenDatalogandtherestofthesystem(Javacode,networks,andclocks)ismodeledusingeventscorrespondingtoinsertionsordeletionsoftuplesinDatalogtables.Eachtimestepconsistsofthreephases,asshowninFig-ure2.Intherstphase,inboundeventsareconvertedintotupleinsertionsanddeletionsonthelocaltablepartitions.Thesecondphaseinterpretsthelocalrulesandtuplesaccord-ingtotraditionalDatalogsemantics,executingtherulestoaxpointinatraditionalbottom-upfashion[36],recursivelyevaluatingtherulesuntilnonewresultsaregenerated.Inthethirdphase,updatestolocalstateareatomicallymadedurable,andoutboundevents(networkmessages,Javacall-backinvocations)areemitted.NotethatwhileDatalogisdenedoverastaticdatabase,therstandthirdphasesallowOverlogprogramstomutatestateovertime.2.1JOLTheoriginalOverlogimplementation(P2)isagingandtargetedatnetworkprotocols,sowedevelopedanewJava-basedOverlogruntimewecallJOL.LikeP2,JOLcompilesOverlogprogramsintopipelineddataowgraphsofoperators(similartoelementsintheClickmodularrouter[19]).JOLprovidesmetaprogrammingsupportakintoP2'sEvitaRacedextension[10]:eachOverlogprogramiscompiledintoarepresentationthatiscapturedinrowsoftables.Programtesting,optimizationandrewritingcanbewrittenconciselyasmetaprogramsinOverlogthatmanipulatethosetables.BecausetheHadoopstackisimplementedinJava,weanticipatedtheneedfortightintegrationbetweenOverlogandJavacode.Hence,JOLsupportsJava-basedextensibilityinthemodelofPostgres[33].ItsupportsJavaclassesasabstractdatatypes,allowingJavaobjectstobestoredineldsoftuples,andJavamethodstobeinvokedonthoseeldsfromOverlog.JOLalsoallowsJava-basedaggregationfunctionstorunonsetsofcolumnvalues,andsupportsJavatablefunctions:Javaiteratorsproducingtuples,whichcanbereferencedinOverlogrulesasordinaryrelations.WemadesignicantuseofeachofthesefeaturesinBOOMAnalytics.3.HDFSRewriteOurrsteffortindevelopingBOOMAnalyticswasBOOM-FS,aclean-slaterewriteofHDFSinOverlog.HDFSislooselybasedonGFS[14],andistargetedatstoringlargelesforfull-scanworkloads.InHDFS,lesystemmetadataisstoredatacentralizedNameNode,butledataispartitionedintochunksanddistributedacrossasetofDataNodes.Bydefault,eachchunkis64MBandisreplicatedatthreeDataNodestoprovidefaulttolerance.DataNodesperiodicallysendheartbeatmessagestotheNameNodecontainingthesetofchunksstoredattheDataNode.TheNameNodecachesthisinformation.IftheNameNodehasnotseenaheartbeatfromaDataNodeforacertainperiodoftime,itassumesthattheDataNodehascrashedanddeletesitfromthecache;itwillalsocreateadditionalcopiesofthechunksstoredatthecrashedDataNodetoensurefaulttolerance.ClientsonlycontacttheNameNodetoperformmetadataoperations,suchasobtainingthelistofchunksinale;alldataoperationsinvolveonlyclientsandDataNodes.HDFSonlysupportslereadandappendoperations;chunkscannotbemodiedoncetheyhavebeenwritten.LikeGFS,HDFSmaintainsacleanseparationofcontrolanddataprotocols:metadataoperations,chunkplacementandDataNodelivenessaredecoupledfromthecodethatperformsbulkdatatransfers.Followingthislead,weimple-mentedthesimplehigh-bandwidthdatapathbyhandinJava,concentratingourOverlogcodeonthetrickiercontrol-pathlogic.ThisallowedustouseaprototypeversionofJOLthatfocusedonfunctionalitymorethanperformance.AswedocumentinSection8,thiswassufcienttoallowBOOM-FStokeeppacewithHDFSintypicalMapReduceworkloads.3.1FileSystemStateTherststepofourrewritewastorepresentlesystemmetadataasacollectionofrelations(Table1).Wethenimplementedlesystemoperationsbywritingqueriesoverthisschema.ThelerelationcontainsarowforeachleordirectorystoredinBOOM-FS.Thesetofchunksinaleisidentiedbythecorrespondingrowsinthefchunkrelation.2Thedatanodeandhb chunkrelationscontainthesetofliveDataNodesandthechunksstoredbyeachDataNode,respectively.TheNameNodeupdatestheserelationsasnewheartbeatsarrive;iftheNameNodedoesnotreceiveaheartbeatfromaDataNodewithinacongurableamountoftime,itassumesthattheDataNodehasfailedandremovesthecorrespondingrowsfromthesetables. 2Theorderofale'schunksmustalsobespecied,becauserelationsareunordered.Currently,weassignchunkIDsinamonotonicallyincreasingfashionandonlysupportappendoperations,soclientscandetermineale'schunkorderbysortingchunkIDs. Name Description Relevantattributes le Files leid ,parentleid,name,isDir fqpath Fully-qualiedpathnames path,leid fchunk Chunksperle chunkid ,leid datanode DataNodeheartbeats nodeAddr ,lastHeartbeatTime hb chunk Chunkheartbeats nodeAddr ,chunkid ,length Table1.BOOM-FSrelationsdeninglesystemmetadata.Theunderlinedattributestogethermakeuptheprimarykeyofeachrelation.TheNameNodemustensurethatlesystemmetadataisdurableandrestoredtoaconsistentstateafterafailure.ThiswaseasytoimplementusingOverlog;eachOverlogxpointbringsthesystemfromoneconsistentstatetoanother.WeusedtheStasisstoragelibrary[30]towritedurablestatechangestodiskasanatomictransactionattheendofeachxpoint.LikeP2,JOLallowsdurabilitytobespeciedonaper-tablebasis.SotherelationsinTable1weremarkeddurable,whereasscratchtablesthatareusedtocomputeresponsestolesystemrequestsweretransientemptiedattheendofeachxpoint.Sincealesystemisnaturallyhierarchical,thequeriesneededtotraverseitarerecursive.WhilerecursioninSQLisconsideredsomewhatesoteric,itisacommonpatterninDatalogandhenceOverlog.Forexample,anattributeoftheletabledescribestheparent-childrelationshipofles;bycomputingthetransitiveclosureofthisrelation,wecaninferthefully-qualiedpathnameofeachle(fqpath).ThetwoOverlogrulesthatderivefqpathfromlearelistedinFigure3.Notethatwhenalerepresentingadirectoryisremoved,allfqpathtuplesthatdescribechildpathsofthatdirectoryareautomaticallyremoved(becausetheycannolongerbederivedfromtheupdatedcontentsofle).Becausepathinformationisaccessedfrequently,weconguredthefqpathrelationtobecachedafteritiscom-puted.Overlogwillautomaticallyupdatefqpathwhenleischanged,usingstandardrelationalviewmaintenancelogic[36].BOOM-FSdenesseveralotherviewstocomputederivedlesystemmetadata,suchasthetotalsizeofeachleandthecontentsofeachdirectory.ThematerializationofeachviewcanbechangedviasimpleOverlogtabledenitionstatementswithoutalteringthesemanticsoftheprogram.Duringthedevelopmentprocess,weregularlyadjustedviewmaterializationtotradeoffreadperformanceagainstwriteperformanceandstoragerequirements.AteachDataNode,chunksarestoredasregularlesonthelesystem.Inaddition,eachDataNodemaintainsarelationdescribingthechunksstoredatthatnode.ThisrelationispopulatedbyperiodicallyinvokingatablefunctiondenedinJavathatwalkstheappropriatedirectoryoftheDataNode'slocallesystem.3.2CommunicationProtocolsBothHDFSandBOOM-FSusethreedifferentprotocols:themetadataprotocolthatclientsandNameNodesusetoexchangelemetadata,theheartbeatprotocolthatDataN-//fqpath:Fully-qualifiedpaths.//Basecase:rootdirectoryhasnullparentfqpath(Path,FileId):-file(FileId,FParentId,_,true),FParentId=null,Path="/";fqpath(Path,FileId):-file(FileId,FParentId,FName,_),fqpath(ParentPath,FParentId),//DonotaddextraslashifparentisrootdirPathSep=(ParentPath="/"?"":"/"),Path=ParentPath+PathSep+FName; Figure3.ExampleOverlogforderivingfully-qualiedpath-namesfromthebaselesystemmetadatainBOOM-FS.odesusetonotifytheNameNodeaboutchunklocationsandDataNodeliveness,andthedataprotocolthatclientsandDataNodesusetoexchangechunks.WeimplementedthemetadataandheartbeatprotocolswithasetofdistributedOverlogrules.ThedataprotocolwasimplementedinJavabecauseitissimpleandperformancecritical.Weproceedtodescribethethreeprotocolsinorder.Foreachcommandinthemetadataprotocol,thereisasingleruleattheclient(statingthatanewrequesttupleshouldbestoredattheNameNode).TherearetypicallytwocorrespondingrulesattheNameNode:onetospecifytheresulttuplethatshouldbestoredattheclient,andanothertohandleerrorsbyreturningafailuremessage.Requeststhatmodifymetadatafollowthesamebasicstructure,exceptthatinadditiontodeducinganewresulttupleattheclient,theNameNoderulesalsodeducechangestothelesystemmetadatarelations.ConcurrentrequeststotheNameNodearehandledinaserialfashionbyJOL.Whilethissimpleapproachhasbeensufcientforourexperiments,weplantoexploremoresophisticatedconcurrencycontroltechniquesinthefuture.Theheartbeatprotocolfollowsasimilarrequest/responsepattern,butitisnotdrivenbythearrivalofnetworkevents.Inordertotriggersucheventsinadata-centriclanguage,Over-logoffersaperiodicrelation[24]thatcanbeconguredtoproducenewtuplesateverytickofawall-clocktimer.DataN-odesusetheperiodicrelationtosendheartbeatmessagestoNameNodes.TheNameNodecanalsosendcontrolmessagestoDataN-odes.ThisoccurswhenalesysteminvariantisunmetandtheNameNoderequiresthecooperationoftheDataNodetorestoretheinvariant.Forexample,theNameNoderecordsthenumberofreplicasofeachchunk(asreportedbyheartbeatmessages).Ifthenumberofreplicasofachunkdropsbelowtheconguredreplicationfactor(e.g.,duetoaDataNodefailure),theNameNodesendsamessagetoaDataNodethatstoresthechunk,askingittosendacopyofthechunktoanotherDataNode.Finally,thedataprotocolisastraightforwardmechanismfortransferringthecontentsofachunkbetweenclientsandDataNodes.ThisprotocolisorchestratedbyOverlogrulesbutimplementedinJava.WhenanOverlogrulededuces Next,weneededtoconvertbasicPaxosintoaworkingprimitiveforadistributedlog.Thisrequiredaddingtheabilitytoefcientlypassaseriesoflogentries(Multi-Paxos),alivenessmodule,andacatchupalgorithm.Whiletherstwasforthemostpartasimpleschemachange,thelattertwocausedourimplementationtoswellto50rulesinroughly400linesofcode.EchoingtheexperienceofChandraetal.[9],theseenhancementsmadeourcodeconsiderablymoredifculttocheckforcorrectness.Thecodealsolostsomeofitspristinedeclarativecharacter;wereturntothispointinSection9.4.2BOOM-FSIntegrationOncewehadPaxosinplace,itwasstraightforwardtosupportthereplicationoflesystemmetadata.Allstate-alteringactionsarerepresentedintherevisedBOOM-FSasPaxosdecrees,whicharepassedintothePaxoslogicviaasingleOverlogrulethatinterceptstentativeactionsandplacesthemintoatablethatisjoinedwithPaxosrules.EachactionisconsideredcompleteatagivensitewhenitisreadbackfromthePaxoslog(i.e.,whenitbecomesvisibleinajoinwithatablerepresentingthelocalcopyofthatlog).AsequencenumbereldinthePaxoslogtablecapturestheglobally-acceptedorderofactionsonallreplicas.Wevalidatedtheperformanceofourimplementationex-perimentally.Intheabsenceoffailure,replicationhasnegli-gibleperformanceimpact,butwhentheprimaryNameNodefails,abackupNameNodetakesoverreasonablyquickly.Wepresentperformanceresultsinthetechnicalreport[2].4.3DiscussionOurPaxosimplementationconstitutedroughly400linesofcodeandrequiredsixperson-weeksofdevelopmenttime.AddingPaxossupporttoBOOM-FStooktwoperson-daysandrequiredmakingmechanicalchangestotenBOOM-FSrules(asdescribedinSection4.2).WesuspectthattherulemodicationsrequiredtoaddPaxossupportcouldbeperformedasanautomaticrewrite.Lamport'soriginalpaperdescribesPaxosasasetoflogicalinvariants.Thisspecicationnaturallylentitselftoadata-centricdesigninwhichballots,ledgers,internalcountersandvote-countinglogicarerepresenteduniformlyastables.However,aswenoteinaworkshoppaper[4],theprincipalbenetofourapproachcamedirectlyfromouruseofarule-baseddeclarativelanguagetoencodeLamport'sinvariants.Wefoundthatwewereabletocapturethedesignpatternsfrequentlyencounteredinconsensusprotocols(e.g.,multicast,voting)viathecompositionoflanguageconstructslikeaggregation,selectionandjoin.InourinitialimplementationofbasicPaxos,wefoundthateachrulecoveredalargeportionofthestatespace,avoidingthecase-by-casetransitionsthatwouldneedtobespeciedinastatemachine-basedimplementation.However,choosinganinvariant-basedapproachmadeithardertoadoptoptimizationsfromtheliteratureasthecodeevolved,inpartbecausetheseoptimizationswereoftendescribedusingstatemachines.Wehadtochoosebetweentranslatingtheoptimizationsuptoahigher-levelwhilepreservingtheirintent,ordirectlyencodingthestatemachineintologic,resultinginalower-levelimplementation.Intheend,weadoptedbothapproaches,givingsectionsofthecodeahybridfeel.5.TheScalabilityRevHDFSNameNodesmanagelargeamountsoflesystemmetadata,whicharekeptinmemorytoensuregoodper-formance.TheoriginalGFSpaperacknowledgedthatthiscouldcausesignicantmemorypressure[14],andNameN-odescalingisoftenanissueinpracticeatYahoo!.Giventhedata-centricnatureofBOOM-FS,wehopedtosimplyscaleouttheNameNodeacrossmultipleNameNode-partitions.Havingexposedthesystemstateintables,thiswasstraight-forward:itinvolvedaddingapartitioncolumntovarioustablestosplitthemacrossnodesinasimpleway.Oncethiswasdone,thecodetoquerythosepartitionsregardlessoflanguageinwhichitiswrittencomposescleanlywithouravailabilityimplementation:eachNameNode-partitioncanbedeployedeitherasasinglenodeoraPaxosgroup.Therearemanyoptionsforpartitioningthelesinadirectorytree.Weoptedforasimplestrategybasedonthehashofthefully-qualiedpathnameofeachle.WealsomodiedtheclientlibrarytobroadcastrequestsfordirectorylistingsanddirectorycreationtoeveryNameNode-partition.Althoughtheresultingdirectorycreationimplementationisnotatomic,itisidempotent;recreatingapartially-createddirectorywillrestorethesystemtoaconsistentstate,andwillpreserveanylesinthepartially-createddirectory.ForallotherBOOM-FSoperations,clientshaveenoughlocalinformationtodeterminethecorrectNameNode-partition.Wedidnotattempttosupportatomicrenameacrosspartitions.ThiswouldinvolvetheatomictransferofstatebetweenindependentPaxosgroups.Webelievethiswouldberelativelystraightforwardtoimplementwehavepreviouslybuiltatwo-phasecommitprotocolinOverlog[4]butwedecidednottopursuethisfeatureatpresent.5.1DiscussionByisolatingthelesystemstateintorelations,itbecameatextbookexercisetopartitionthatstateacrossnodes.IttookeighthoursofdevelopertimetoimplementNameNodepar-titioning;twoofthesehourswerespentaddingpartitioningandbroadcastsupporttotheBOOM-FSclientlibrary.Thiswasaclearwinforthedata-centricapproach,independentofanydeclarativefeaturesofOverlog.Beforeattemptingthiswork,wewereunsurewhetherpartitioningforscale-outwouldcomposenaturallywithstatereplicationforfaulttolerance.Becausescale-outinBOOM-FSamountedtolittlemorethanpartitioningdatacollections,wefounditquiteeasytoconvinceourselvesthat tracesummarytotheenduser,whichconstituted280linesofJava.BecauseJOLalreadyprovidedthemetaprogrammingfeaturesweneeded,ittooklessthanonedeveloperdaytoimplementtheserewrites.Capturingparserstateintableshadseveralbenets.Be-causetheprogramcodeitselfisrepresentedasdata,introspec-tionisaqueryoverthemetadatacatalog,whileautomaticprogramrewritesareupdatestothecatalogtables.Settinguptracestoreportupondistributedexecutionswasasimplematterofwritingrulesthatqueryexistingrulesandinsertnewones.Usingadeclarative,rule-basedlanguageallowedustoexpressassertionsinacross-cuttingfashion.Awatchdogruledescribesaqueryoversystemstatethatmustneverhold:sucharuleisbothaspecicationofaninvariantandacheckthatenforcesit.Theassertionneednotbecloselycoupledwiththerulesthatmodifytherelevantstate;instead,assertionrulesmaybewrittenasaindependentcollectionofconcerns.7.MapReducePortIncontrasttoourclean-slatestrategyfordevelopingBOOM-FS,webuiltBOOM-MR,ourMapReduceimplementation,byreplacingHadoop'scoreschedulinglogicwithOverlog.OurgoalinbuildingBOOM-MRwastoexploreembed-dingadata-centricrewriteofanon-trivialcomponentintoanexistingproceduralsystem.MapReduceschedulingpoli-ciesareoneissuethathasbeentreatedinrecentliterature(e.g.,[40,41]).ToenablecredibleworkonMapReducescheduling,wewantedtoremaintruetothebasicstructureoftheHadoopMapReducecodebase,soweproceededbyun-derstandingthatcode,mappingitscorestateintoarelationalrepresentation,andthenwritingOverlogrulestomanagethatstateinthefaceofnewmessagesdeliveredbytheexistingJavaAPIs.Wefollowthatstructureinourdiscussion.7.1Background:HadoopMapReduceInHadoopMapReduce,thereisasinglemasternodecalledtheJobTrackerwhichmanagesanumberofworkernodescalledTaskTrackers.Ajobisdividedintoasetofmapandreducetasks.TheJobTrackerassignstaskstoworkernodes.Eachmaptaskreadsaninputchunkfromthedistributedlesystem,runsauser-denedmapfunction,andpartitionsoutputkey/valuepairsintohashbucketsonthelocaldisk.Reducetasksarecreatedforeachhashbucket.Eachreducetaskfetchesthecorrespondinghashbucketsfromallmappers,sortslocallybykey,runsauser-denedreducefunctionandwritestheresultstothedistributedlesystem.EachTaskTrackerhasaxednumberofslotsforexecutingtasks(twomapsandtworeducesbydefault).AheartbeatprotocolbetweeneachTaskTrackerandtheJobTrackerisusedtoupdatetheJobTracker'sbookkeepingofthestateofrunningtasks,anddrivetheschedulingofnewtasks:iftheJobTrackeridentiesfreeTaskTrackerslots,itwillschedulefurthertasksontheTaskTracker.Also,Hadoopwillattempt Name Description Relevantattributes job Jobdenitions jobid ,priority,submit time, status,jobConf task Taskdenitions jobid ,taskid ,type,partition,status taskAttempt Taskattempts jobid ,taskid ,attemptid ,progress, state,phase,tracker,input loc, start,nish taskTracker TaskTracker name ,hostname,state, denitions map count,reduce count, max map,max reduce Table3.BOOM-MRrelationsdeningJobTrackerstate.toschedulespeculativetaskstoreduceajob'sresponsetimeifitdetectsstragglernodes[11].7.2MapReduceSchedulinginOverlogOurinitialgoalwastoporttheJobTrackercodetoOverlog.WebeganbyidentifyingthekeystatemaintainedbytheJobTracker.ThisstateincludesbothdatastructurestotracktheongoingstatusofthesystemandtransientstateintheformofmessagessentandreceivedbytheJobTracker.WecapturedthisinformationinfourOverlogtables,showninTable3.Thejobrelationcontainsasinglerowforeachjobsub-mittedtotheJobTracker.Inadditiontosomebasicmetadata,eachjobtuplecontainsanattributecalledjobConfthatholdsaJavaobjectconstructedbylegacyHadoopcode,whichcap-turesthecongurationofthejob.Thetaskrelationidentieseachtaskwithinajob.Theattributesofthisrelationidentifythetasktype(maporreduce),theinputpartition(achunkformaptasks,abucketforreducetasks),andthecurrentrunningstatus.Ataskmaybeattemptedmorethanonce,duetospecula-tionoriftheinitialexecutionattemptfailed.ThetaskAttemptrelationmaintainsthestateofeachsuchattempt.Inadditiontoaprogresspercentageandastate(running/completed),reducetaskscanbeinanyofthreephases:copy,sort,orreduce.ThetrackerattributeidentiestheTaskTrackerthatisassignedtoexecutethetaskattempt.Maptasksalsoneedtorecordthelocationoftheirinputdata,whichisgivenbyin-put loc.ThetaskTrackerrelationidentieseachTaskTrackerintheclusterwithauniquename.OverlogrulesareusedtoupdatetheJobTracker'stablesbyconvertinginboundmessagesintojob,taskAttemptandtaskTrackertuples.Theserulesaremostlystraightforward.SchedulingdecisionsareencodedinthetaskAttempttable,whichassignstaskstoTaskTrackers.AschedulingpolicyissimplyasetofrulesthatjoinagainstthetaskTrackerrelationtondTaskTrackerswithunassignedslots,andschedulestasksbyinsertingtuplesintotaskAttempt.Thisarchitecturemakesiteasyfornewschedulingpoliciestobedened.7.3EvaluationTovalidatetheextensibleschedulingarchitecturedescribedinSection7.2,weimplementedbothHadoop'sdefaultFirst-Come-First-Serve(FCFS)policyandtheLATEpolicypro-posedbyZahariaetal.[41].Ourgoalswerebothtoevaluate Figure5.CDFsrepresentingtheelapsedtimebetweenjobstartupandtaskcompletionforbothmapandreducetasks,forallcombinationsofHadoopandBOOM-MRoverHDFSandBOOM-FS.Ineachgraph,thehorizontalaxisiselapsedtimeinseconds,andtheverticalrepresentsthepercentageoftaskscompleted.andthecompactnessofBOOM-MRreectsthatsimplicityappropriately.8.PerformanceValidationWhileimprovedperformancewasnotagoalofourwork,wewantedtoensurethattheperformanceofBOOMAna-lyticswascompetitivewithHadoop.WecomparedBOOMAnalyticswithHadoop18.1,usingthe101-nodeEC2clusterdescribedinSection7.3.Theworkloadwasawordcountjobona30GBle,using481maptasksand100reducetasks.Figure5containsfourgraphscomparingtheperformanceofdifferentcombinationsofHadoopMapReduce,HDFS,BOOM-MR,andBOOM-FS.Eachgraphreportsacumu-lativedistributionoftheelapsedtimeinsecondsfromjobstartuptomaporreducetaskcompletion.Themaptaskscompleteinthreedistinctwaves.Thisisbecauseonly2100maptaskscanbescheduledatonce.Althoughall100reducetaskscanbescheduledimmediately,noreducetaskcannishuntilallmapshavebeencompletedbecauseeachreducetaskrequirestheoutputofallmaptasks.Thelower-leftgraphdescribestheperformanceofHadooprunningontopofHDFS,andhenceservesasabaselineforthesubsequentgraphs.Theupper-leftgraphdetailsBOOM-MRrunningoverHDFS.ThisgraphshowsthatmapandreducetaskdurationsunderBOOM-MRarenearlyidenticaltoHadoop18.1.Thelower-rightandupper-rightgraphsdetailtheperformanceofHadoopMapReduceandBOOM-MRrunningontopofBOOM-FS,respectively.BOOM-FSperformanceisslightlyslowerthanHDFS,butremainscompetitive.9.ExperienceandLessonsOuroverallexperiencewithBOOMAnalyticshasbeenquitepositive.Buildingthesystemrequiredonlyninemonthsofpart-timeworkbyfourdevelopers,includingthetimerequiredtogowellbeyondthefeaturesetofHDFS.Clearly,ourexperienceisnotuniversal:systeminfrastructureisonlyoneclassofdistributedsoftware,andanalyticsforBigDataisevenmorespecialized.However,wefeelthatourexperienceshedslightoncommonpatternsthatoccurinmanydistributedsystems,includingthecoordinationofmultiplenodestowardcommongoals,replicatedstateforhigh-availability,statepartitioningforscalability,andmonitoringandinvariantcheckingtoimprovemanageability.Anecdotally,wefeelthatmuchofourproductivitycamefromusingadata-centricdesignphilosophy,whichexposedthesimplicityoftasksweundertook.WhenyouexaminethetableswedenedtoimplementHDFSandMapReduce,it seemsnaturalthatthesystemimplementationontopshouldbefairlysimple,regardlessofthelanguageused.Overlogimposedthisdata-centricdisciplinethroughoutthedevelop-mentprocess:noprivatestatewasregisteredonthesidetoachievespecictasks.Beyondthisdiscipline,wealsobenetedfromafewkeyfeaturesofadeclarativelanguage,includingbuilt-inquerieswithsupportforrecursion,ex-ibleviewmaterializationandmaintenance,andhigh-levelmetaprogrammingopportunitiesaffordedbyOverlogandimplementedinJOL.9.1EverythingIsDataInBOOMAnalytics,everythingisdata,representedasob-jectsincollections.Thisincludestraditionalpersistentinfor-mationlikelesystemmetadata,runtimestatelikeTask-Trackerstatus,summarystatisticslikethoseusedbytheJobTracker'sschedulingpolicy,in-ightmessages,systemevents,executionstateofthesystem,andevenparsedcode.ThebenetsofthisapproachareperhapsbestillustratedbythesimplicitywithwhichwescaledouttheNameNodeviapartitioning(Section5):byhavingtherelevantstatestoredasdata,wewereabletousestandarddatapartitioningtoachievewhatwouldordinarilybeasignicantrearchitectingofthesystem.Similarly,theeasewithwhichweimplementedsystemmonitoringviabothsystemintrospectiontablesandrulerewritingarosebecausewecouldeasilywriterulesthatmanipulatedconceptsasdiverseastransientsystemstateandprogramsemantics,allstoredinaunieddatabaserepresentation(Section6).Theuniformityofdata-centricinterfacesalsoallowsinter-position[18]ofcomponentsinanaturalmanner:thedataowpipebetweentwosystemmodulescanbeeasilyreroutedtogothroughathirdmodule.ThisenabledthesimplicityofincorporatingourOverlogLATEschedulerintoBOOM-MR(Section7.2).Becausedataowscanberoutedacrossthenetwork(viathelocationspecierinarule'shead),interpo-sitioncanalsoinvolvedistributedlogicthisishowweeasilyaddedPaxossupporttotheBOOM-FSNameNode(Section4).Ourexperiencesuggeststhataformofencap-sulationcouldbeachievedbyconstrainingthepointsinthedataowatwhichinterpositionisallowedtooccur.Inall,relativelylittleofthisdiscussionseemsspecictologicprogrammingperse.Manymodernlanguagesandframeworksnowsupportconvenienthigh-levelprogrammingovercollectiontypes(e.g.,listcomprehensionsinPython,queryexpressionsintheLINQframework[27],andalgebraicdataowslikeMapReduce).Programmersusingthoseframe-workscanenjoymanyofthebenetsweidentiedinourworkbyadheringcarefullytodata-centricdesignpatternswhenbuildingdistributedsystems.However,thisrequiresdisciplineinaprogramminglanguagelikePython,wheredata-centricdesignispossiblebutnotnecessarilyidiomatic.9.2DevelopinginOverlogSomeofthebenetswedescribeabovecanbeattributedtodata-centricdesign,whileothersrelatetothehigh-leveldeclarativenatureofOverlog.However,Overlogisbynomeansaperfectlanguagefordistributedprogramming,anditcausedusvariousfrustrations.Manyofthebugsween-counteredwereduetoambiguitiesinthelanguagesemantics,particularlywithregardtostateupdateandaggregatefunc-tions.ThisispartlyduetoOverlog'sheritage:traditionalDatalogdoesnotsupportupdates,andOverlog'ssupportforupdateshasneverhadaformally-speciedsemantics.Wehaverecentlybeenworkingtoaddressthiswithnewtheoreti-calfoundations[3].Inretrospect,wemadeveryconservativeuseofOverlog'ssupportfordistributedqueries:weusedthe@syntaxasaconvenientshorthandforunidirectionalmessaging,butwedidnotutilizearbitrarydistributedqueries(i.e.,ruleswithtwoormoredistinctlocationspeciersintheirbodyterms).Inparticular,wewereunsureofthesemanticsofsuchqueriesintheeventofnodefailuresandnetworkpartitions.Instead,weimplementedprotocolsfordistributedmessag-ingexplicitly,dependingontherequirementsathand(e.g.,Paxosconsensus,communicationbetweenNameNodeandDataNodes).AsweobservedinSection3.3,thismadetheenforcementofdistributedinvariantssomewhatlower-levelthanourspecicationsoflocal-nodeinvariants.Byexamin-ingthecodingpatternsfoundinourhand-writtenmessagingprotocols,wehopetodevelopnewhigher-levelabstractionsforspecifyingandenforcingdistributedinvariants.DuringourPaxosimplementation,weneededtotranslatestatemachinedescriptionsintologic(Section4).Infairness,theportingtaskwasnotactuallyveryhard:inmostcasesitamountedtowritingmessage-handlingrulesinOverlogthathadafamiliarstructure.Butupondeeperreection,ourportwasshallowandsyntactic;theresultingOverlogdoesnotfeellikelogic,intheinvariantstyleofLamport'soriginalPaxosspecication.NowthatwehaveachievedafunctionalPaxosimplementation,wehopetorevisitthistopicwithaneyetowardrethinkingtheintentofthestate-machineoptimizations.ThiswouldnotonlytthespiritofOverlogbetter,butperhapscontributetoadeeperunderstandingoftheideasinvolved.Finally,Overlog'ssyntaxallowsprogramstobewrittenconcisely,butitcanbedifcultandtime-consumingtoread.SomeoftheblamecanbeattributedtoDatalog'sspecicationofjoinsviarepetitionofvariablenames.WeareexperimentingwithanalternativesyntaxbasedonSQL'snamed-eldapproach.9.3PerformanceJOLperformancewasgoodenoughforBOOMAnalyticstomatchHadoopperformance,butweareconsciousthatithasroomtoimprove.WeobservedthatsystemloadaveragesweremuchlowerwithHadoopthanwithBOOMAnalytics. [15]H.S.Gunawietal.SQCK:ADeclarativeFileSystemChecker.InOSDI,2008.[16]A.Guptaetal.Constraintcheckingwithpartialinformation.InPODS,1994.[17]M.Isardetal.Dryad:distributeddata-parallelprogramsfromsequentialbuildingblocks.InEuroSys,2007.[18]M.B.Jones.Interpositionagents:transparentlyinterposingusercodeatthesysteminterface.InSOSP,1993.[19]E.Kohleretal.TheClickmodularrouter.ACMTransactionsonComputerSystems,18(3):263297,August2000.[20]M.S.Lametal.Context-sensitiveprogramanalysisasdatabasequeries.InPODS,2005.[21]L.Lamport.Thepart-timeparliament.ACMTransactionsonComputerSystems,16(2):133169,1998.[22]LATEHadoopJira.Hadoopjiraissuetracker,July2009.http://issues.apache.org/jira/browse/HADOOP.[23]B.T.Looetal.Declarativenetworking:language,executionandoptimization.InSIGMOD,2006.[24]B.T.Looetal.Implementingdeclarativeoverlays.InSOSP,2005.[25]N.A.Lynch.DistributedAlgorithms.MorganKaufmann,1997.[26]W.R.Marczaketal.Declarativerecongurabletrustmanage-ment.InCIDR,2009.[27]F.Marguerieetal.LINQInAction.ManningPublicationsCo.,2008.[28]NokiaCorporation.disco:massivedataminimalcode,2009.http://discoproject.org/.[29]T.Schuttetal.Scalaris:ReliabletransactionalP2Pkey/valuestore.InACMSIGPLANWorkshoponErlang,2008.[30]R.SearsandE.Brewer.Stasis:exibletransactionalstorage.InOSDI,2006.[31]A.Singhetal.Usingqueriesfordistributedmonitoringandforensics.InEuroSys,2006.[32]A.Singhetal.BFTprotocolsunderre.InNSDI,2008.[33]M.Stonebraker.Inclusionofnewtypesinrelationaldatabasesystems.InICDE,1986.[34]B.SzekelyandE.Torres,Dec.2005.http://www.klinewoods.com/papers/p2paxos.pdf.[35]A.Thusooetal.Hive-awarehousingsolutionoveraMap-Reduceframework.InVLDB,2009.[36]J.D.Ullman.PrinciplesofDatabaseandKnowledge-BaseSystems:VolumeII:TheNewTechnologies.W.H.Freeman&Company,1990.[37]W.Whiteetal.Scalinggamestoepicproportions.InSIGMOD,2007.[38]F.Yangetal.Hilda:Ahigh-levellanguagefordata-drivenwebapplications.InICDE,2006.[39]Y.Yuetal.DryadLINQ:Asystemforgeneral-purposedistributeddata-parallelcomputingusingahigh-levellanguage.InOSDI,2008.[40]M.Zahariaetal.Delayscheduling:Asimpletechniqueforachievinglocalityandfairnessinclusterscheduling.InEuroSys,2010.[41]M.Zahariaetal.ImprovingMapReduceperformanceinheterogeneousenvironments.InOSDI,2008.