/
BOOM Analytics Exploring DataCentric Declarative Programming for the Cloud Peter Alvaro BOOM Analytics Exploring DataCentric Declarative Programming for the Cloud Peter Alvaro

BOOM Analytics Exploring DataCentric Declarative Programming for the Cloud Peter Alvaro - PDF document

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
446 views
Uploaded On 2015-02-25

BOOM Analytics Exploring DataCentric Declarative Programming for the Cloud Peter Alvaro - PPT Presentation

berkeleyedu Tyson Condie UC Berkeley tcondieeecsberkeleyedu Neil Conway UC Berkeley nrceecsberkeleyedu Khaled Elmeleegy Yahoo Research khaledyahooinccom Joseph M Hellerstein UC Berkeley hellersteineecsberkeleyedu Russell Sears UC Berkeley searseecsbe ID: 39420

berkeleyedu Tyson Condie Berkeley

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "BOOM Analytics Exploring DataCentric Dec..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1.Distributedsystemsbenetsubstantiallyfromadata-centricdesignstylethatfocusestheprogrammer'satten-tiononcarefullycapturingalltheimportantstateofthesystemasafamilyofcollections(sets,relations,streams,etc.)Givensuchamodel,thestateofthesystemcanbedistributednaturallyandexiblyacrossnodesviafamiliarmechanismslikepartitioningandreplication.2.Thekeybehaviorsofsuchsystemscanbenaturallyim-plementedusingdeclarativeprogramminglanguagesthatmanipulatethesecollections,abstractingtheprogrammerfromboththephysicallayoutofthedataandthene-grainedorchestrationofdatamanipulation.Takentogether,thesehypothesessuggestthattraditionallydif-cultdistributedprogrammingtaskscanberecastasdatapro-cessingproblemsthatareeasytoreasonaboutinadistributedsettingandexpressibleinahigh-levellanguage.Inturn,thisshouldprovidesignicantreductionsincodecomplexityanddevelopmentoverhead,andimprovesystemevolutionandprogramcorrectness.Wealsoconjecturethatthesehypothe-ses,takenseparately,canofferdesignguidelinesusefulinawidevarietyofprogrammingmodels.1.1BOOMAnalyticsWedecidedtobegintheBOOMprojectwithanexperi-mentinconstruction,byimplementingasubstantialpieceofdistributedsoftwareinadata-centric,declarativestyle.Uponreviewofrecentliteratureondatacenterinfrastructure(e.g.,[7,11,12,14]),weobservedthatmostofthecomplex-ityinthesesystemsrelatestothemanagementofvariousformsofasynchronously-updatedstate,includingsessions,protocols,andstorage.Althoughquitecomplex,fewofthesesystemsinvolveintricate,uninterruptedsequencesofcom-putationalsteps.Hence,wesuspectedthatdatacenterinfras-tructuremightbeagoodinitiallitmustestforourhypothesesaboutbuildingdistributedsoftware.Inthispaper,wereportonourexperiencesbuildingBOOMAnalytics,anAPI-compliantreimplementationoftheHDFSdistributedlesystemandtheHadoopMapReduceengine.WenamedthesetwocomponentsBOOM-FSandBOOM-MR,respectively.1InwritingBOOMAnalytics,wepreservedtheJavaAPI“skin”ofHDFSandHadoop,butreplacedcomplexinternalstatewithasetofrelations,andreplacedkeysystemlogicwithcodewritteninadeclarativelanguage.TheHadoopstackappealedtousasachallengefortworeasons.First,itexercisesthedistributedpowerofaclus-ter.Unlikeafarmofindependentwebserviceinstances,theHDFSandHadoopcodeentailscoordinationoflargenum-bersofnodestowardcommontasks.Second,Hadoopismiss-ingsignicantdistributedsystemsfeatureslikeavailabilityandscalabilityofmasternodes.Thisallowedustoevaluate 1TheBOOMAnalyticssoftwaredescribedinthispapercanbefoundathttp://db.cs.berkeley.edu/eurosys-2010.thedifcultyofextendingBOOMAnalyticswithcomplexfeaturesnotfoundintheoriginalcodebase.WeimplementedBOOMAnalyticsusingtheOverloglogiclanguage,originallydevelopedforDeclarativeNet-working[24].Overloghasbeenusedwithsomesuccesstoprototypedistributedsystemprotocols,notablyinasimpleprototypeofPaxos[34],asetofByzantineFaultTolerancevariants[32],asuiteofdistributedlesystemconsistencyprotocols[6],andadistributedhashtableroutingprotocolimplementedbyourowngroup[24].Ontheotherhand,Over-loghadnotpreviouslybeenusedtoimplementafull-featureddistributedsystemonthescaleofHadoopandHDFS.OnegoalofourworkonBOOMAnalyticswastoevaluatethestrengthsandweaknessesofOverlogforsystemprogram-minginthelarge,toinformthedesignofanewdeclarativeframeworkfordistributedprogramming.1.2ContributionsThispaperdescribesourexperienceimplementingandevolv-ingBOOMAnalytics,andrunningitonAmazonEC2.WedocumenttheeffortrequiredtodevelopBOOMAnalyticsinOverlog,andthewaywewereabletointroducesigni-cantextensions,includingPaxos-supportedreplicated-masteravailability,andmulti-masterstate-partitionedscalability.Wedescribethedebuggingtasksthatarosewhenprogrammingatthislevelofabstraction,andourtacticsformetaprogrammingOverlogtoinstrumentourdistributedsystematruntime.Whiletheoutcomeofanysoftwareexperienceisboundinparttothespecicprogrammingtoolsused,therearehopefullymoregenerallessonsthatcanbeextracted.Tothatend,wetrytoseparateout(andinsomecasescritique)thespecicsofOverlogasadeclarativelanguage,andthemoregenerallessonsofhigh-leveldata-centricprogramming.Themoregeneraldata-centricaspectoftheworkisbothpositiveandlanguage-independent:manyofthebenetswedescribearisefromexposingasmuchsystemstateaspossibleviacollectiondatatypes,andproceedingfromthatbasistowritesimplecodetomanagethosecollections.AswedescribeeachmoduleofBOOMAnalytics,wereporttheperson-hourswespentimplementingitandthesizeofourimplementationinlinesofcode(comparingagainsttherelevantfeatureofHadoop,ifappropriate).Thesearenoisymetrics,sowearemostinterestedinnumbersthattranscendthenoiseterms:forexample,order-of-magnitudereductionsincodesize.WealsovalidatethattheperformanceofBOOMAnalyticsiscompetitivewiththeoriginalHadoopcodebase.WepresenttheevolutionofBOOMAnalyticsfromastraightforwardreimplementationofHDFSandHadooptoasignicantlyenhancedsystem.Wedescribehowourini-tialBOOM-FSprototypewentthroughaseriesofmajorrevisions(“revs”)focusedonavailability(Section4),scala-bility(Section5),anddebuggingandmonitoring(Section6).WethendetailhowwedesignedBOOM-MRbyreplacingHadoop'staskschedulinglogicwithadeclarativeschedulingframework(Section7).Ineachcase,wediscusshowthe Figure2.AnOverlogtimestepataparticipatingnode:in-comingeventsareappliedtolocalstate,thelocalDatalogprogramisruntoxpoint,andoutgoingeventsareemitted.WhenOverlogtuplesarriveatanodeeitherthroughruleevaluationorexternalevents,theyarehandledinanatomiclocalDatalog“timestep.”Withinatimestep,eachnodeseesonlylocally-storedtuples.CommunicationbetweenDatalogandtherestofthesystem(Javacode,networks,andclocks)ismodeledusingeventscorrespondingtoinsertionsordeletionsoftuplesinDatalogtables.Eachtimestepconsistsofthreephases,asshowninFig-ure2.Intherstphase,inboundeventsareconvertedintotupleinsertionsanddeletionsonthelocaltablepartitions.Thesecondphaseinterpretsthelocalrulesandtuplesaccord-ingtotraditionalDatalogsemantics,executingtherulestoa“xpoint”inatraditionalbottom-upfashion[36],recursivelyevaluatingtherulesuntilnonewresultsaregenerated.Inthethirdphase,updatestolocalstateareatomicallymadedurable,andoutboundevents(networkmessages,Javacall-backinvocations)areemitted.NotethatwhileDatalogisdenedoverastaticdatabase,therstandthirdphasesallowOverlogprogramstomutatestateovertime.2.1JOLTheoriginalOverlogimplementation(P2)isagingandtargetedatnetworkprotocols,sowedevelopedanewJava-basedOverlogruntimewecallJOL.LikeP2,JOLcompilesOverlogprogramsintopipelineddataowgraphsofoperators(similarto“elements”intheClickmodularrouter[19]).JOLprovidesmetaprogrammingsupportakintoP2'sEvitaRacedextension[10]:eachOverlogprogramiscompiledintoarepresentationthatiscapturedinrowsoftables.Programtesting,optimizationandrewritingcanbewrittenconciselyasmetaprogramsinOverlogthatmanipulatethosetables.BecausetheHadoopstackisimplementedinJava,weanticipatedtheneedfortightintegrationbetweenOverlogandJavacode.Hence,JOLsupportsJava-basedextensibilityinthemodelofPostgres[33].ItsupportsJavaclassesasabstractdatatypes,allowingJavaobjectstobestoredineldsoftuples,andJavamethodstobeinvokedonthoseeldsfromOverlog.JOLalsoallowsJava-basedaggregationfunctionstorunonsetsofcolumnvalues,andsupportsJavatablefunctions:Javaiteratorsproducingtuples,whichcanbereferencedinOverlogrulesasordinaryrelations.WemadesignicantuseofeachofthesefeaturesinBOOMAnalytics.3.HDFSRewriteOurrsteffortindevelopingBOOMAnalyticswasBOOM-FS,aclean-slaterewriteofHDFSinOverlog.HDFSislooselybasedonGFS[14],andistargetedatstoringlargelesforfull-scanworkloads.InHDFS,lesystemmetadataisstoredatacentralizedNameNode,butledataispartitionedintochunksanddistributedacrossasetofDataNodes.Bydefault,eachchunkis64MBandisreplicatedatthreeDataNodestoprovidefaulttolerance.DataNodesperiodicallysendheartbeatmessagestotheNameNodecontainingthesetofchunksstoredattheDataNode.TheNameNodecachesthisinformation.IftheNameNodehasnotseenaheartbeatfromaDataNodeforacertainperiodoftime,itassumesthattheDataNodehascrashedanddeletesitfromthecache;itwillalsocreateadditionalcopiesofthechunksstoredatthecrashedDataNodetoensurefaulttolerance.ClientsonlycontacttheNameNodetoperformmetadataoperations,suchasobtainingthelistofchunksinale;alldataoperationsinvolveonlyclientsandDataNodes.HDFSonlysupportslereadandappendoperations;chunkscannotbemodiedoncetheyhavebeenwritten.LikeGFS,HDFSmaintainsacleanseparationofcontrolanddataprotocols:metadataoperations,chunkplacementandDataNodelivenessaredecoupledfromthecodethatperformsbulkdatatransfers.Followingthislead,weimple-mentedthesimplehigh-bandwidthdatapath“byhand”inJava,concentratingourOverlogcodeonthetrickiercontrol-pathlogic.ThisallowedustouseaprototypeversionofJOLthatfocusedonfunctionalitymorethanperformance.AswedocumentinSection8,thiswassufcienttoallowBOOM-FStokeeppacewithHDFSintypicalMapReduceworkloads.3.1FileSystemStateTherststepofourrewritewastorepresentlesystemmetadataasacollectionofrelations(Table1).Wethenimplementedlesystemoperationsbywritingqueriesoverthisschema.ThelerelationcontainsarowforeachleordirectorystoredinBOOM-FS.Thesetofchunksinaleisidentiedbythecorrespondingrowsinthefchunkrelation.2Thedatanodeandhb chunkrelationscontainthesetofliveDataNodesandthechunksstoredbyeachDataNode,respectively.TheNameNodeupdatestheserelationsasnewheartbeatsarrive;iftheNameNodedoesnotreceiveaheartbeatfromaDataNodewithinacongurableamountoftime,itassumesthattheDataNodehasfailedandremovesthecorrespondingrowsfromthesetables. 2Theorderofale'schunksmustalsobespecied,becauserelationsareunordered.Currently,weassignchunkIDsinamonotonicallyincreasingfashionandonlysupportappendoperations,soclientscandetermineale'schunkorderbysortingchunkIDs. Name Description Relevantattributes le Files leid ,parentleid,name,isDir fqpath Fully-qualiedpathnames path,leid fchunk Chunksperle chunkid ,leid datanode DataNodeheartbeats nodeAddr ,lastHeartbeatTime hb chunk Chunkheartbeats nodeAddr ,chunkid ,length Table1.BOOM-FSrelationsdeninglesystemmetadata.Theunderlinedattributestogethermakeuptheprimarykeyofeachrelation.TheNameNodemustensurethatlesystemmetadataisdurableandrestoredtoaconsistentstateafterafailure.ThiswaseasytoimplementusingOverlog;eachOverlogxpointbringsthesystemfromoneconsistentstatetoanother.WeusedtheStasisstoragelibrary[30]towritedurablestatechangestodiskasanatomictransactionattheendofeachxpoint.LikeP2,JOLallowsdurabilitytobespeciedonaper-tablebasis.SotherelationsinTable1weremarkeddurable,whereas“scratchtables”thatareusedtocomputeresponsestolesystemrequestsweretransient—emptiedattheendofeachxpoint.Sincealesystemisnaturallyhierarchical,the“queries”neededtotraverseitarerecursive.WhilerecursioninSQLisconsideredsomewhatesoteric,itisacommonpatterninDatalogandhenceOverlog.Forexample,anattributeoftheletabledescribestheparent-childrelationshipofles;bycomputingthetransitiveclosureofthisrelation,wecaninferthefully-qualiedpathnameofeachle(fqpath).ThetwoOverlogrulesthatderivefqpathfromlearelistedinFigure3.Notethatwhenalerepresentingadirectoryisremoved,allfqpathtuplesthatdescribechildpathsofthatdirectoryareautomaticallyremoved(becausetheycannolongerbederivedfromtheupdatedcontentsofle).Becausepathinformationisaccessedfrequently,weconguredthefqpathrelationtobecachedafteritiscom-puted.Overlogwillautomaticallyupdatefqpathwhenleischanged,usingstandardrelationalviewmaintenancelogic[36].BOOM-FSdenesseveralotherviewstocomputederivedlesystemmetadata,suchasthetotalsizeofeachleandthecontentsofeachdirectory.ThematerializationofeachviewcanbechangedviasimpleOverlogtabledenitionstatementswithoutalteringthesemanticsoftheprogram.Duringthedevelopmentprocess,weregularlyadjustedviewmaterializationtotradeoffreadperformanceagainstwriteperformanceandstoragerequirements.AteachDataNode,chunksarestoredasregularlesonthelesystem.Inaddition,eachDataNodemaintainsarelationdescribingthechunksstoredatthatnode.ThisrelationispopulatedbyperiodicallyinvokingatablefunctiondenedinJavathatwalkstheappropriatedirectoryoftheDataNode'slocallesystem.3.2CommunicationProtocolsBothHDFSandBOOM-FSusethreedifferentprotocols:themetadataprotocolthatclientsandNameNodesusetoexchangelemetadata,theheartbeatprotocolthatDataN-//fqpath:Fully-qualifiedpaths.//Basecase:rootdirectoryhasnullparentfqpath(Path,FileId):-file(FileId,FParentId,_,true),FParentId=null,Path="/";fqpath(Path,FileId):-file(FileId,FParentId,FName,_),fqpath(ParentPath,FParentId),//DonotaddextraslashifparentisrootdirPathSep=(ParentPath="/"?"":"/"),Path=ParentPath+PathSep+FName; Figure3.ExampleOverlogforderivingfully-qualiedpath-namesfromthebaselesystemmetadatainBOOM-FS.odesusetonotifytheNameNodeaboutchunklocationsandDataNodeliveness,andthedataprotocolthatclientsandDataNodesusetoexchangechunks.WeimplementedthemetadataandheartbeatprotocolswithasetofdistributedOverlogrules.ThedataprotocolwasimplementedinJavabecauseitissimpleandperformancecritical.Weproceedtodescribethethreeprotocolsinorder.Foreachcommandinthemetadataprotocol,thereisasingleruleattheclient(statingthatanewrequesttupleshouldbe“stored”attheNameNode).TherearetypicallytwocorrespondingrulesattheNameNode:onetospecifytheresulttuplethatshouldbestoredattheclient,andanothertohandleerrorsbyreturningafailuremessage.Requeststhatmodifymetadatafollowthesamebasicstructure,exceptthatinadditiontodeducinganewresulttupleattheclient,theNameNoderulesalsodeducechangestothelesystemmetadatarelations.ConcurrentrequeststotheNameNodearehandledinaserialfashionbyJOL.Whilethissimpleapproachhasbeensufcientforourexperiments,weplantoexploremoresophisticatedconcurrencycontroltechniquesinthefuture.Theheartbeatprotocolfollowsasimilarrequest/responsepattern,butitisnotdrivenbythearrivalofnetworkevents.Inordertotriggersucheventsinadata-centriclanguage,Over-logoffersaperiodicrelation[24]thatcanbeconguredtoproducenewtuplesateverytickofawall-clocktimer.DataN-odesusetheperiodicrelationtosendheartbeatmessagestoNameNodes.TheNameNodecanalsosendcontrolmessagestoDataN-odes.ThisoccurswhenalesysteminvariantisunmetandtheNameNoderequiresthecooperationoftheDataNodetorestoretheinvariant.Forexample,theNameNoderecordsthenumberofreplicasofeachchunk(asreportedbyheartbeatmessages).Ifthenumberofreplicasofachunkdropsbelowtheconguredreplicationfactor(e.g.,duetoaDataNodefailure),theNameNodesendsamessagetoaDataNodethatstoresthechunk,askingittosendacopyofthechunktoanotherDataNode.Finally,thedataprotocolisastraightforwardmechanismfortransferringthecontentsofachunkbetweenclientsandDataNodes.ThisprotocolisorchestratedbyOverlogrulesbutimplementedinJava.WhenanOverlogrulededuces Next,weneededtoconvertbasicPaxosintoaworkingprimitiveforadistributedlog.Thisrequiredaddingtheabilitytoefcientlypassaseriesoflogentries(“Multi-Paxos”),alivenessmodule,andacatchupalgorithm.Whiletherstwasforthemostpartasimpleschemachange,thelattertwocausedourimplementationtoswellto50rulesinroughly400linesofcode.EchoingtheexperienceofChandraetal.[9],theseenhancementsmadeourcodeconsiderablymoredifculttocheckforcorrectness.Thecodealsolostsomeofitspristinedeclarativecharacter;wereturntothispointinSection9.4.2BOOM-FSIntegrationOncewehadPaxosinplace,itwasstraightforwardtosupportthereplicationoflesystemmetadata.Allstate-alteringactionsarerepresentedintherevisedBOOM-FSasPaxosdecrees,whicharepassedintothePaxoslogicviaasingleOverlogrulethatinterceptstentativeactionsandplacesthemintoatablethatisjoinedwithPaxosrules.Eachactionisconsideredcompleteatagivensitewhenitis“readback”fromthePaxoslog(i.e.,whenitbecomesvisibleinajoinwithatablerepresentingthelocalcopyofthatlog).AsequencenumbereldinthePaxoslogtablecapturestheglobally-acceptedorderofactionsonallreplicas.Wevalidatedtheperformanceofourimplementationex-perimentally.Intheabsenceoffailure,replicationhasnegli-gibleperformanceimpact,butwhentheprimaryNameNodefails,abackupNameNodetakesoverreasonablyquickly.Wepresentperformanceresultsinthetechnicalreport[2].4.3DiscussionOurPaxosimplementationconstitutedroughly400linesofcodeandrequiredsixperson-weeksofdevelopmenttime.AddingPaxossupporttoBOOM-FStooktwoperson-daysandrequiredmakingmechanicalchangestotenBOOM-FSrules(asdescribedinSection4.2).WesuspectthattherulemodicationsrequiredtoaddPaxossupportcouldbeperformedasanautomaticrewrite.Lamport'soriginalpaperdescribesPaxosasasetoflogicalinvariants.Thisspecicationnaturallylentitselftoadata-centricdesigninwhich“ballots,”“ledgers,”internalcountersandvote-countinglogicarerepresenteduniformlyastables.However,aswenoteinaworkshoppaper[4],theprincipalbenetofourapproachcamedirectlyfromouruseofarule-baseddeclarativelanguagetoencodeLamport'sinvariants.Wefoundthatwewereabletocapturethedesignpatternsfrequentlyencounteredinconsensusprotocols(e.g.,multicast,voting)viathecompositionoflanguageconstructslikeaggregation,selectionandjoin.InourinitialimplementationofbasicPaxos,wefoundthateachrulecoveredalargeportionofthestatespace,avoidingthecase-by-casetransitionsthatwouldneedtobespeciedinastatemachine-basedimplementation.However,choosinganinvariant-basedapproachmadeithardertoadoptoptimizationsfromtheliteratureasthecodeevolved,inpartbecausetheseoptimizationswereoftendescribedusingstatemachines.Wehadtochoosebetweentranslatingtheoptimizations“up”toahigher-levelwhilepreservingtheirintent,ordirectly“encoding”thestatemachineintologic,resultinginalower-levelimplementation.Intheend,weadoptedbothapproaches,givingsectionsofthecodeahybridfeel.5.TheScalabilityRevHDFSNameNodesmanagelargeamountsoflesystemmetadata,whicharekeptinmemorytoensuregoodper-formance.TheoriginalGFSpaperacknowledgedthatthiscouldcausesignicantmemorypressure[14],andNameN-odescalingisoftenanissueinpracticeatYahoo!.Giventhedata-centricnatureofBOOM-FS,wehopedtosimplyscaleouttheNameNodeacrossmultipleNameNode-partitions.Havingexposedthesystemstateintables,thiswasstraight-forward:itinvolvedaddinga“partition”columntovarioustablestosplitthemacrossnodesinasimpleway.Oncethiswasdone,thecodetoquerythosepartitions—regardlessoflanguageinwhichitiswritten—composescleanlywithouravailabilityimplementation:eachNameNode-partitioncanbedeployedeitherasasinglenodeoraPaxosgroup.Therearemanyoptionsforpartitioningthelesinadirectorytree.Weoptedforasimplestrategybasedonthehashofthefully-qualiedpathnameofeachle.WealsomodiedtheclientlibrarytobroadcastrequestsfordirectorylistingsanddirectorycreationtoeveryNameNode-partition.Althoughtheresultingdirectorycreationimplementationisnotatomic,itisidempotent;recreatingapartially-createddirectorywillrestorethesystemtoaconsistentstate,andwillpreserveanylesinthepartially-createddirectory.ForallotherBOOM-FSoperations,clientshaveenoughlocalinformationtodeterminethecorrectNameNode-partition.Wedidnotattempttosupportatomic“rename”acrosspartitions.ThiswouldinvolvetheatomictransferofstatebetweenindependentPaxosgroups.Webelievethiswouldberelativelystraightforwardtoimplement—wehavepreviouslybuiltatwo-phasecommitprotocolinOverlog[4]—butwedecidednottopursuethisfeatureatpresent.5.1DiscussionByisolatingthelesystemstateintorelations,itbecameatextbookexercisetopartitionthatstateacrossnodes.IttookeighthoursofdevelopertimetoimplementNameNodepar-titioning;twoofthesehourswerespentaddingpartitioningandbroadcastsupporttotheBOOM-FSclientlibrary.Thiswasaclearwinforthedata-centricapproach,independentofanydeclarativefeaturesofOverlog.Beforeattemptingthiswork,wewereunsurewhetherpartitioningforscale-outwouldcomposenaturallywithstatereplicationforfaulttolerance.Becausescale-outinBOOM-FSamountedtolittlemorethanpartitioningdatacollections,wefounditquiteeasytoconvinceourselvesthat tracesummarytotheenduser,whichconstituted280linesofJava.BecauseJOLalreadyprovidedthemetaprogrammingfeaturesweneeded,ittooklessthanonedeveloperdaytoimplementtheserewrites.Capturingparserstateintableshadseveralbenets.Be-causetheprogramcodeitselfisrepresentedasdata,introspec-tionisaqueryoverthemetadatacatalog,whileautomaticprogramrewritesareupdatestothecatalogtables.Settinguptracestoreportupondistributedexecutionswasasimplematterofwritingrulesthatqueryexistingrulesandinsertnewones.Usingadeclarative,rule-basedlanguageallowedustoexpressassertionsina“cross-cutting”fashion.Awatchdogruledescribesaqueryoversystemstatethatmustneverhold:sucharuleisbothaspecicationofaninvariantandacheckthatenforcesit.Theassertionneednotbecloselycoupledwiththerulesthatmodifytherelevantstate;instead,assertionrulesmaybewrittenasaindependentcollectionofconcerns.7.MapReducePortIncontrasttoourclean-slatestrategyfordevelopingBOOM-FS,webuiltBOOM-MR,ourMapReduceimplementation,byreplacingHadoop'scoreschedulinglogicwithOverlog.OurgoalinbuildingBOOM-MRwastoexploreembed-dingadata-centricrewriteofanon-trivialcomponentintoanexistingproceduralsystem.MapReduceschedulingpoli-ciesareoneissuethathasbeentreatedinrecentliterature(e.g.,[40,41]).ToenablecredibleworkonMapReducescheduling,wewantedtoremaintruetothebasicstructureoftheHadoopMapReducecodebase,soweproceededbyun-derstandingthatcode,mappingitscorestateintoarelationalrepresentation,andthenwritingOverlogrulestomanagethatstateinthefaceofnewmessagesdeliveredbytheexistingJavaAPIs.Wefollowthatstructureinourdiscussion.7.1Background:HadoopMapReduceInHadoopMapReduce,thereisasinglemasternodecalledtheJobTrackerwhichmanagesanumberofworkernodescalledTaskTrackers.Ajobisdividedintoasetofmapandreducetasks.TheJobTrackerassignstaskstoworkernodes.Eachmaptaskreadsaninputchunkfromthedistributedlesystem,runsauser-denedmapfunction,andpartitionsoutputkey/valuepairsintohashbucketsonthelocaldisk.Reducetasksarecreatedforeachhashbucket.Eachreducetaskfetchesthecorrespondinghashbucketsfromallmappers,sortslocallybykey,runsauser-denedreducefunctionandwritestheresultstothedistributedlesystem.EachTaskTrackerhasaxednumberofslotsforexecutingtasks(twomapsandtworeducesbydefault).AheartbeatprotocolbetweeneachTaskTrackerandtheJobTrackerisusedtoupdatetheJobTracker'sbookkeepingofthestateofrunningtasks,anddrivetheschedulingofnewtasks:iftheJobTrackeridentiesfreeTaskTrackerslots,itwillschedulefurthertasksontheTaskTracker.Also,Hadoopwillattempt Name Description Relevantattributes job Jobdenitions jobid ,priority,submit time, status,jobConf task Taskdenitions jobid ,taskid ,type,partition,status taskAttempt Taskattempts jobid ,taskid ,attemptid ,progress, state,phase,tracker,input loc, start,nish taskTracker TaskTracker name ,hostname,state, denitions map count,reduce count, max map,max reduce Table3.BOOM-MRrelationsdeningJobTrackerstate.toschedulespeculativetaskstoreduceajob'sresponsetimeifitdetects“straggler”nodes[11].7.2MapReduceSchedulinginOverlogOurinitialgoalwastoporttheJobTrackercodetoOverlog.WebeganbyidentifyingthekeystatemaintainedbytheJobTracker.ThisstateincludesbothdatastructurestotracktheongoingstatusofthesystemandtransientstateintheformofmessagessentandreceivedbytheJobTracker.WecapturedthisinformationinfourOverlogtables,showninTable3.Thejobrelationcontainsasinglerowforeachjobsub-mittedtotheJobTracker.Inadditiontosomebasicmetadata,eachjobtuplecontainsanattributecalledjobConfthatholdsaJavaobjectconstructedbylegacyHadoopcode,whichcap-turesthecongurationofthejob.Thetaskrelationidentieseachtaskwithinajob.Theattributesofthisrelationidentifythetasktype(maporreduce),theinput“partition”(achunkformaptasks,abucketforreducetasks),andthecurrentrunningstatus.Ataskmaybeattemptedmorethanonce,duetospecula-tionoriftheinitialexecutionattemptfailed.ThetaskAttemptrelationmaintainsthestateofeachsuchattempt.Inadditiontoaprogresspercentageandastate(running/completed),reducetaskscanbeinanyofthreephases:copy,sort,orreduce.ThetrackerattributeidentiestheTaskTrackerthatisassignedtoexecutethetaskattempt.Maptasksalsoneedtorecordthelocationoftheirinputdata,whichisgivenbyin-put loc.ThetaskTrackerrelationidentieseachTaskTrackerintheclusterwithauniquename.OverlogrulesareusedtoupdatetheJobTracker'stablesbyconvertinginboundmessagesintojob,taskAttemptandtaskTrackertuples.Theserulesaremostlystraightforward.SchedulingdecisionsareencodedinthetaskAttempttable,whichassignstaskstoTaskTrackers.AschedulingpolicyissimplyasetofrulesthatjoinagainstthetaskTrackerrelationtondTaskTrackerswithunassignedslots,andschedulestasksbyinsertingtuplesintotaskAttempt.Thisarchitecturemakesiteasyfornewschedulingpoliciestobedened.7.3EvaluationTovalidatetheextensibleschedulingarchitecturedescribedinSection7.2,weimplementedbothHadoop'sdefaultFirst-Come-First-Serve(FCFS)policyandtheLATEpolicypro-posedbyZahariaetal.[41].Ourgoalswerebothtoevaluate Figure5.CDFsrepresentingtheelapsedtimebetweenjobstartupandtaskcompletionforbothmapandreducetasks,forallcombinationsofHadoopandBOOM-MRoverHDFSandBOOM-FS.Ineachgraph,thehorizontalaxisiselapsedtimeinseconds,andtheverticalrepresentsthepercentageoftaskscompleted.andthecompactnessofBOOM-MRreectsthatsimplicityappropriately.8.PerformanceValidationWhileimprovedperformancewasnotagoalofourwork,wewantedtoensurethattheperformanceofBOOMAna-lyticswascompetitivewithHadoop.WecomparedBOOMAnalyticswithHadoop18.1,usingthe101-nodeEC2clusterdescribedinSection7.3.Theworkloadwasawordcountjobona30GBle,using481maptasksand100reducetasks.Figure5containsfourgraphscomparingtheperformanceofdifferentcombinationsofHadoopMapReduce,HDFS,BOOM-MR,andBOOM-FS.Eachgraphreportsacumu-lativedistributionoftheelapsedtimeinsecondsfromjobstartuptomaporreducetaskcompletion.Themaptaskscompleteinthreedistinct“waves.”Thisisbecauseonly2100maptaskscanbescheduledatonce.Althoughall100reducetaskscanbescheduledimmediately,noreducetaskcannishuntilallmapshavebeencompletedbecauseeachreducetaskrequirestheoutputofallmaptasks.Thelower-leftgraphdescribestheperformanceofHadooprunningontopofHDFS,andhenceservesasabaselineforthesubsequentgraphs.Theupper-leftgraphdetailsBOOM-MRrunningoverHDFS.ThisgraphshowsthatmapandreducetaskdurationsunderBOOM-MRarenearlyidenticaltoHadoop18.1.Thelower-rightandupper-rightgraphsdetailtheperformanceofHadoopMapReduceandBOOM-MRrunningontopofBOOM-FS,respectively.BOOM-FSperformanceisslightlyslowerthanHDFS,butremainscompetitive.9.ExperienceandLessonsOuroverallexperiencewithBOOMAnalyticshasbeenquitepositive.Buildingthesystemrequiredonlyninemonthsofpart-timeworkbyfourdevelopers,includingthetimerequiredtogowellbeyondthefeaturesetofHDFS.Clearly,ourexperienceisnotuniversal:systeminfrastructureisonlyoneclassofdistributedsoftware,andanalyticsfor“BigData”isevenmorespecialized.However,wefeelthatourexperienceshedslightoncommonpatternsthatoccurinmanydistributedsystems,includingthecoordinationofmultiplenodestowardcommongoals,replicatedstateforhigh-availability,statepartitioningforscalability,andmonitoringandinvariantcheckingtoimprovemanageability.Anecdotally,wefeelthatmuchofourproductivitycamefromusingadata-centricdesignphilosophy,whichexposedthesimplicityoftasksweundertook.WhenyouexaminethetableswedenedtoimplementHDFSandMapReduce,it seemsnaturalthatthesystemimplementationontopshouldbefairlysimple,regardlessofthelanguageused.Overlogimposedthisdata-centricdisciplinethroughoutthedevelop-mentprocess:noprivatestatewasregistered“ontheside”toachievespecictasks.Beyondthisdiscipline,wealsobenetedfromafewkeyfeaturesofadeclarativelanguage,includingbuilt-inquerieswithsupportforrecursion,ex-ibleviewmaterializationandmaintenance,andhigh-levelmetaprogrammingopportunitiesaffordedbyOverlogandimplementedinJOL.9.1EverythingIsDataInBOOMAnalytics,everythingisdata,representedasob-jectsincollections.Thisincludestraditionalpersistentinfor-mationlikelesystemmetadata,runtimestatelikeTask-Trackerstatus,summarystatisticslikethoseusedbytheJobTracker'sschedulingpolicy,in-ightmessages,systemevents,executionstateofthesystem,andevenparsedcode.ThebenetsofthisapproachareperhapsbestillustratedbythesimplicitywithwhichwescaledouttheNameNodeviapartitioning(Section5):byhavingtherelevantstatestoredasdata,wewereabletousestandarddatapartitioningtoachievewhatwouldordinarilybeasignicantrearchitectingofthesystem.Similarly,theeasewithwhichweimplementedsystemmonitoring—viabothsystemintrospectiontablesandrulerewriting—arosebecausewecouldeasilywriterulesthatmanipulatedconceptsasdiverseastransientsystemstateandprogramsemantics,allstoredinaunieddatabaserepresentation(Section6).Theuniformityofdata-centricinterfacesalsoallowsinter-position[18]ofcomponentsinanaturalmanner:thedataow“pipe”betweentwosystemmodulescanbeeasilyreroutedtogothroughathirdmodule.ThisenabledthesimplicityofincorporatingourOverlogLATEschedulerintoBOOM-MR(Section7.2).Becausedataowscanberoutedacrossthenetwork(viathelocationspecierinarule'shead),interpo-sitioncanalsoinvolvedistributedlogic—thisishowweeasilyaddedPaxossupporttotheBOOM-FSNameNode(Section4).Ourexperiencesuggeststhataformofencap-sulationcouldbeachievedbyconstrainingthepointsinthedataowatwhichinterpositionisallowedtooccur.Inall,relativelylittleofthisdiscussionseemsspecictologicprogrammingperse.Manymodernlanguagesandframeworksnowsupportconvenienthigh-levelprogrammingovercollectiontypes(e.g.,listcomprehensionsinPython,queryexpressionsintheLINQframework[27],andalgebraicdataowslikeMapReduce).Programmersusingthoseframe-workscanenjoymanyofthebenetsweidentiedinourworkbyadheringcarefullytodata-centric“designpatterns”whenbuildingdistributedsystems.However,thisrequiresdisciplineinaprogramminglanguagelikePython,wheredata-centricdesignispossiblebutnotnecessarilyidiomatic.9.2DevelopinginOverlogSomeofthebenetswedescribeabovecanbeattributedtodata-centricdesign,whileothersrelatetothehigh-leveldeclarativenatureofOverlog.However,Overlogisbynomeansaperfectlanguagefordistributedprogramming,anditcausedusvariousfrustrations.Manyofthebugsween-counteredwereduetoambiguitiesinthelanguagesemantics,particularlywithregardtostateupdateandaggregatefunc-tions.ThisispartlyduetoOverlog'sheritage:traditionalDatalogdoesnotsupportupdates,andOverlog'ssupportforupdateshasneverhadaformally-speciedsemantics.Wehaverecentlybeenworkingtoaddressthiswithnewtheoreti-calfoundations[3].Inretrospect,wemadeveryconservativeuseofOverlog'ssupportfordistributedqueries:weusedthe@syntaxasaconvenientshorthandforunidirectionalmessaging,butwedidnotutilizearbitrarydistributedqueries(i.e.,ruleswithtwoormoredistinctlocationspeciersintheirbodyterms).Inparticular,wewereunsureofthesemanticsofsuchqueriesintheeventofnodefailuresandnetworkpartitions.Instead,weimplementedprotocolsfordistributedmessag-ingexplicitly,dependingontherequirementsathand(e.g.,Paxosconsensus,communicationbetweenNameNodeandDataNodes).AsweobservedinSection3.3,thismadetheenforcementofdistributedinvariantssomewhat“lower-level”thanourspecicationsoflocal-nodeinvariants.Byexamin-ingthecodingpatternsfoundinourhand-writtenmessagingprotocols,wehopetodevelopnewhigher-levelabstractionsforspecifyingandenforcingdistributedinvariants.DuringourPaxosimplementation,weneededtotranslatestatemachinedescriptionsintologic(Section4).Infairness,theportingtaskwasnotactuallyveryhard:inmostcasesitamountedtowritingmessage-handlingrulesinOverlogthathadafamiliarstructure.Butupondeeperreection,ourportwasshallowandsyntactic;theresultingOverlogdoesnot“feel”likelogic,intheinvariantstyleofLamport'soriginalPaxosspecication.NowthatwehaveachievedafunctionalPaxosimplementation,wehopetorevisitthistopicwithaneyetowardrethinkingtheintentofthestate-machineoptimizations.ThiswouldnotonlytthespiritofOverlogbetter,butperhapscontributetoadeeperunderstandingoftheideasinvolved.Finally,Overlog'ssyntaxallowsprogramstobewrittenconcisely,butitcanbedifcultandtime-consumingtoread.SomeoftheblamecanbeattributedtoDatalog'sspecicationofjoinsviarepetitionofvariablenames.WeareexperimentingwithanalternativesyntaxbasedonSQL'snamed-eldapproach.9.3PerformanceJOLperformancewasgoodenoughforBOOMAnalyticstomatchHadoopperformance,butweareconsciousthatithasroomtoimprove.WeobservedthatsystemloadaveragesweremuchlowerwithHadoopthanwithBOOMAnalytics. [15]H.S.Gunawietal.SQCK:ADeclarativeFileSystemChecker.InOSDI,2008.[16]A.Guptaetal.Constraintcheckingwithpartialinformation.InPODS,1994.[17]M.Isardetal.Dryad:distributeddata-parallelprogramsfromsequentialbuildingblocks.InEuroSys,2007.[18]M.B.Jones.Interpositionagents:transparentlyinterposingusercodeatthesysteminterface.InSOSP,1993.[19]E.Kohleretal.TheClickmodularrouter.ACMTransactionsonComputerSystems,18(3):263–297,August2000.[20]M.S.Lametal.Context-sensitiveprogramanalysisasdatabasequeries.InPODS,2005.[21]L.Lamport.Thepart-timeparliament.ACMTransactionsonComputerSystems,16(2):133–169,1998.[22]LATEHadoopJira.Hadoopjiraissuetracker,July2009.http://issues.apache.org/jira/browse/HADOOP.[23]B.T.Looetal.Declarativenetworking:language,executionandoptimization.InSIGMOD,2006.[24]B.T.Looetal.Implementingdeclarativeoverlays.InSOSP,2005.[25]N.A.Lynch.DistributedAlgorithms.MorganKaufmann,1997.[26]W.R.Marczaketal.Declarativerecongurabletrustmanage-ment.InCIDR,2009.[27]F.Marguerieetal.LINQInAction.ManningPublicationsCo.,2008.[28]NokiaCorporation.disco:massivedata–minimalcode,2009.http://discoproject.org/.[29]T.Schuttetal.Scalaris:ReliabletransactionalP2Pkey/valuestore.InACMSIGPLANWorkshoponErlang,2008.[30]R.SearsandE.Brewer.Stasis:exibletransactionalstorage.InOSDI,2006.[31]A.Singhetal.Usingqueriesfordistributedmonitoringandforensics.InEuroSys,2006.[32]A.Singhetal.BFTprotocolsunderre.InNSDI,2008.[33]M.Stonebraker.Inclusionofnewtypesinrelationaldatabasesystems.InICDE,1986.[34]B.SzekelyandE.Torres,Dec.2005.http://www.klinewoods.com/papers/p2paxos.pdf.[35]A.Thusooetal.Hive-awarehousingsolutionoveraMap-Reduceframework.InVLDB,2009.[36]J.D.Ullman.PrinciplesofDatabaseandKnowledge-BaseSystems:VolumeII:TheNewTechnologies.W.H.Freeman&Company,1990.[37]W.Whiteetal.Scalinggamestoepicproportions.InSIGMOD,2007.[38]F.Yangetal.Hilda:Ahigh-levellanguagefordata-drivenwebapplications.InICDE,2006.[39]Y.Yuetal.DryadLINQ:Asystemforgeneral-purposedistributeddata-parallelcomputingusingahigh-levellanguage.InOSDI,2008.[40]M.Zahariaetal.Delayscheduling:Asimpletechniqueforachievinglocalityandfairnessinclusterscheduling.InEuroSys,2010.[41]M.Zahariaetal.ImprovingMapReduceperformanceinheterogeneousenvironments.InOSDI,2008.