Hellerstein 3 David Maier UC Berkeley palvarocsberkeleyedu nrccsberkeleyedu hellersteincsberkeleyedu Portland State University maiercspdxedu Abstract Distributed consistency is perhaps the most dis cussed topic in distributed systems today Coordina ID: 61988
Download Pdf The PPT/PDF document "Blazes Coordination Analysis for Distrib..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
efcientandmanageableprotocolofasynchronouspoint-to-pointcommunicationbetweenproducersandconsumerscalledsealingthatindicateswhenpartitionsofastreamhavestoppedchanging.Thesepartitionsareidentiedandchasedthroughadataowviatechniquesfromfunctionaldependencyanalysis,anothersurprisingapplicationofdatabasetheorytodistributedconsistency.TheBLAZESarchitectureisdepictedinFigure1.BLAZEScanbedirectlyappliedtoexistingprogrammingplatformsbasedondistributedstreamordataowprocessing,includingTwitterStorm[21],ApacheS4[23],andSparkStreaming[24].ProgrammersofstreamprocessingenginesinteractwithBLAZESinagreyboxmanner:theyprovidesimplesemanticannotationstotheblack-boxcomponentsintheirdataows,andBLAZESperformstheanalysisofalldataowpathsthroughtheprogram.BLAZEScanalsotakeadvantageofthericheranalyzabilityofdeclarativelanguageslikeBloom.Bloomprogrammersarefreedfromtheneedtosupplyannotations,sinceBloom'slanguagesemanticsallowBLAZEStoinfercomponentpropertiesautomatically.Wemakethefollowingcontributionsinthispaper:ConsistencyAnomaliesandProperties.Wepresentaspectrumofconsistencyanomaliesthatariseindistributeddataows.Weidentifykeypropertiesofbothstreamsandcomponentsthataffectconsistency.CompositionofProperties.Weshowhowtoanalyzethecompositionofconsistencypropertiesincomplexprogramsviaaterm-rewritingtechniqueoverdataowpaths,whichtranslateslocalcomponentpropertiesintoend-to-endstreamproperties.CustomCoordinationCode.Wedistinguishtwoalterna-tivecoordinationstrategies,orderingandsealing,andshowhowwecanautomaticallygenerateapplication-awarecoordinationcodethatusesthecheapersealingtechniqueinmanycases.WeconcludebyevaluatingtheperformancebenetsofferedbyusingBLAZESasanalternativetogeneric,order-basedcoordinationmechanismsavailableinbothStormandBloom.B.RunningExamplesWeconsidertworunningexamples:astreaminganalyticqueryimplementedusingtheStormstreamprocessingsystemandadistributedad-trackingnetworkimplementedusingtheBloomdistributedprogramminglanguage.StreaminganalyticswithStorm:Figure2showsthearchi-tectureofaStormtopologythatcomputesacontinuouswordcountovertheTwitterstream.Eachtweetisassociatedwithanumberedbatch(theunitofreplay)andissenttoexactlyoneSplittercomponentwhichdividestweetsintotheirconstituentwordsviarandompartitioning.ThewordsarehashpartitionedtotheCountcomponent,whichtalliesthenumberofoccurrencesofeachwordinthecurrentbatch.Whenabatchends,theCommitcomponentrecordsthebatchnumberandfrequencyforeachwordinabackingstore.Stormensuresfault-toleranceviareplay:ifcomponentinstancesfailortimeout,streamsourcesredelivertheirinputs. Fig.1:TheBLAZESframework.Inthegreyboxsystem,program-merssupplyacongurationlerepresentinganannotateddataow.Inthewhiteboxsystem,thisleisautomaticallygeneratedviastaticanalysis. Fig.2:PhysicalarchitectureofaStormwordcounttopology.Itisuptotheprogrammertoensurethataccuratecountsarecommittedtothestoredespitetheseat-least-oncedeliverysemantics.OneapproachistomaketheStormtopologytransactionali.e.,onethatprocessestuplesinatomicbatches,ensuringthatcertaincomponents(calledcommitters)emitthebatchesinatotalorder.Byrecordingthelastsuccessfullyprocessedbatchidentier,aprogrammermayensureat-most-onceprocessinginthefaceofpossiblereplaybyincurringtheextraoverheadofsynchronizingtheprocessingofbatches.Notethatbatchesareindependentinthewordcountingapplication;becausethestreamingquerygroupsoutputsbybatchid,thereisnoneedtoorderbatcheswithrespecttoeachother.BLAZEScanaidatopologydesignerinavoidingunnecessarilyconservativeorderingconstraints,which(aswewillseeinSectionVIII)resultsinuptoa3improvementinthroughputinourexperiments.Ad-trackingwithBloom:Figure3depictsanad-trackingnet-work,inwhichacollectionofadserversdeliveradvertisementstousers(notshown)andsendclicklogs(edgeslabeledc)toasetofreportingserverreplicas.Reportingserverscomputeacontinuousquery;analystsmakerequests(q)forsubsetsofthequeryanswer(e.g.,byvisitingadashboard)andreceiveresultsviathestreamlabeledr.Toimproveresponsetimesforcommonqueries,acachingtierisinterposedbetweenanalystsandreportingservers.Ananalystposesarequestaboutaparticularadtoacacheserver.Ifthecachecontainsan Severity Label Conuent Stateless 1 CR X X 2 CW X 3 ORgate X 4 OWgate Fig.7:TheC.O.W.R.componentannotations.AcomponentpathiseitherConuentorOrder-sensitive,andeitherchangescomponentstate(aWritepath)ordoesnot(aRead-onlypath).Componentpathswithhigherseverityannotationscanproducemorestreamanomalies.indicatingtheterminationofacampaign,thenonmonotonicqueryCAMPAIGNcanproducedeterministicoutputs.IV.ANNOTATEDDATAFLOWGRAPHSSofar,wehavefocusedontheconsistencyanomaliesthatcanaffectindividualblackboxcomponents.Inthissection,weextendourdiscussionbypresentingagreyboxmodelinwhichprogrammersprovidesimpleannotationsaboutthesemanticpropertiesofcomponents.InSectionV,weshowhowBLAZEScanusetheseannotationstoautomaticallyderivetheconsistencypropertiesofentiredataowgraphs.A.AnnotationsandLabelsInthissection,wedescribealanguageofannotationsandlabelsthatenrichestheblackboxmodel(SectionII)withadditionalsemanticinformation.Programmerssupplyannotationsaboutpathsthroughcomponentsandaboutinputstreams;usingthisinformation,BLAZESderiveslabelsforeachcomponent'soutputstreams.1)ComponentAnnotations:BLAZESprovidesasmall,intuitivesetofannotationsthatcapturecomponentpropertiesrelevanttostreamconsistency.Areviewoftheimplementationoranalysisofacomponent'sinput/outputbehaviorshouldbesufcienttochooseanappropriateannotation.Figure7liststhecomponentannotationssupportedbyBLAZES.Eachannotationappliestoapathfromaninputinterfacetoanoutputinterface;ifacomponenthasmultipleinputoroutputinterfaces,eachpathcanhaveadifferentannotation.TheCRannotationindicatesthatapaththroughacompo-nentisconuentandstateless;thatis,itproducesdeterministicoutputregardlessofitsinputorder,anditsinputsdonotmodifythecomponent'sstate.CWdenotesapaththatisconuentandstateful.TheannotationsORgateandOWgatedenotenon-conuentpathsthatarestatelessorstateful,respectively.Thegatesubscriptisasetofattributenamesthatindicatesthepartitionsoftheinputstreamsoverwhichthenon-conuentcomponentoperates.ThisannotationallowsBLAZEStodeterminewhetheraninputstreamcontainingend-of-partitionpunctuationscanproducedeterministicexecutionswithoutusingglobalcoordi-nation.Supplyinggateisoptional;iftheprogrammerdoesnotknowthepartitionsoverwhichthecomponentpathoperates,theannotationsORandOWindicatethateachrecordbelongstoadifferentpartition.ConsiderareportingservercomponentimplementingthequeryWINDOW.Whenitreceivesarequestreferencingaparticularadvertisementandwindow,itreturnsaresponseiftheadvertisementhasfewerthan1000clickswithinthatwindow.AnappropriatelabelforthepathfromrequestinputstooutputsasORid;windowastatelessorder-sensitivepathoperatingoverpartitionswithcompositekeyid,window.Requestsdonotaffecttheinternalstateofthecomponent,buttheydoreturnpotentiallynondeterministicresultsthatdependontheoutcomesofracesbetweenqueriesandclickrecords.Notehoweverthatifweweretodelaytheresultsofqueriesuntilwewerecertainthattherewouldbenonewrecordsforaparticularadvertisementoraparticularwindow,2theoutputwouldbedeterministic.HenceWINDOWiscompatiblewithclickstreamspartitioned(andemittingappropriatepunctuations)onidorwindow.2)StreamAnnotations:Programmerscanalsosupplyop-tionalannotationstodescribethesemanticsofstreams.TheSealkeyannotationmeansthatthestreamispunctuatedonthesubsetkeyofthestream'sattributesthatis,thestreamcontainspunctuationsonkey,andthereisatleastonepunctuationcorrespondingtoeverystreamrecord.Forexample,astreamrepresentingmessagesbetweenaclientandservermighthavethelabelSealsession,toindicatethatclientswillsendmessagesindicatingthatsessionsarecomplete.Toensureprogress,theremustbeapunctuationforeverysessionidentierthatappearsinanymessage.ProgrammerscanusetheRepannotationtoindicatethatastreamisreplicated.Areplicatedstreamconnectsaproducercomponentinstance(orinstances)tomorethanoneconsumercomponentinstance,andproducesthesamecontentsforallstreaminstances(unlike,forexample,apartitionedstream).TheRepannotationcarriessemanticinformationbothaboutexpectedexecutiontopologyandprogrammerintent,whichBLAZESusestodeterminewhennondeterministicstreamcontentscanleadtoreplicadivergence.RepisanoptionalBooleanagthatmaybecombinedwithotherannotationsandlabels.V.COORDINATIONANALYSISANDSYNTHESISBLAZESusescomponentandstreamannotationstodeter-mineifagivendataowisguaranteedtoproducedeterministicoutcomes;ifitcannotmakethisguarantee,itaugmentstheprogramwithcoordinationcode.Inthissection,wedescribetheprogramanalysisandsynthesisprocess.A.AnalysisToderivelabelsfortheoutputstreamsinadataowgraph,BLAZESstartsbyenumeratingallpathsbetweenpairsofsourcesandsinks.Toruleoutinnitepaths,itreduceseachcycleinthegraphtoasinglenodewithacollapsedlabelbyselectingthelabelofhighestseverityamongthecycle2Thisrulesoutracesbyensuring(withoutenforcinganorderingonmessagedelivery)thatthequerycomesafterallrelevantclickrecords. unnecessarytoensuredeterministicreplay,andhenceconsistentoutcomes.1)Componentannotations:Toannotatethethreecompo-nentsoftheStormwordcountquery,weprovidethefollowingletoBLAZES:Splitter:annotation:-ffrom:tweets,to:words,label:CRgCount:annotation:-ffrom:words,to:counts,label:OW,subscript:[word,batch]gCommit:annotation:ffrom:counts,to:db,label:CWgSplitterisastateless,conuentcomponent:wegiveittheannotationCR.WeannotateCountasOWword;batchitisstateful(accumulatingcountsovertime)andorder-sensitive,butpotentiallysealableonwordorbatch(orboth).Lastly,Commitisalsostateful(thebackingstoretowhichitstoresthenalcountsispersistent),butsinceitisappend-onlyanddoesnotrecordtheorderofappends,weannotateitCW.2)Analysis:Intheabsenceofanysealannotations,BLAZESderivesanoutputlabelofRunforthewordcountdataow.Withoutcoordination,nondeterministicinputordersmaypro-ducenondeterministicoutputcontentsduetotheorder-sensitivenatureoftheCountcomponent.Toensurethatreplay(Storm'sinternalfault-tolerancestrategy)isdeterministic,BLAZESwillrecommendthatthetopologybecoordinatedtheprogrammercanachievethisbymakingthetopologytransactional(inStormterminology),totallyorderingthebatchcommits.If,ontheotherhand,theinputstreamissealedonbatch,BLAZESrecognizesthecompatibilitybetweenthestreampunctuationsandtheCountcomponent,whichoperatesovergroupingsetsofword,batch.Becauseabatchisatomic(itscontentsmaybecompletelydeterminedonceasealrecordarrives)andindependent(emittingaprocessedbatchneveraffectsanyotherbatches),thetopologywillproducedeterministicoutputsunderallinterleavings.B.Ad-reportingsystemNextwedescribehowwemightannotatethevariouscompo-nentsofthead-reportingsystem.AswediscussinSectionVII,theseannotationscanbeautomaticallyextractedfromtheBloomsyntax;forexposition,inthissectionwediscusshowaprogrammermightmanuallyannotateananalogousdataowwritteninalanguagewithoutBloom'sstatic-analysiscapabilities.Aswewillsee,ensuringdeterministicoutputswillrequiredifferentmechanismsforthedifferentquerieslistedinFigure6.1)Componentannotations:BelowistheBLAZESannotationlefortheadservingnetwork:Cache:annotation:-ffrom:request,to:response,label:CRg-ffrom:response,to:response,label:CWg-ffrom:request,to:request,label:CRgReport:Rep:trueannotation:-ffrom:click,to:response,label:CWgPOOR:ffrom:request,to:response,label:OR,subscript:[id]gTHRESH:ffrom:request,to:response,label:CRgWINDOW:ffrom:request,to:response,label:OR,subscript:[id,window]gCAMPAIGN:ffrom:request,to:response,label:OR,subscript:[id,campaign]gThecacheisclearlyastatefulcomponent,butsinceitsstateisappend-onlyandorder-independentwemayannotateitCW.Becausethedata-collectionpaththroughthereportingserversimplyappendsclicksandimpressionstoalog,weannotatethispathCWalso.Allthatremainsistoannotatetheread-onlypaththroughthereportingcomponentcorrespondingtothevariouscontinuousquerieslistedinFigure6.Reportisareplicatedcomponent,sowesupplytheRepannotationforallinstances.WeannotatethequerypathcorrespondingtoTHRESHwhichisconuentbecauseitneveremitsarecorduntiltheadimpressionsreachthegiventhresholdCR.WeannotatequeriesPOORandCAMPAIGNORidandORid;campaign,respectively.Thesequeriescanreturndifferentcontentsindifferentexecutions,recordingtheeffectofmessageracesbetweenclickandrequestmessages.WegivequeryWINDOWtheannotationORid;window.UnlikePOORandCAMPAIGN,WINDOWincludestheinputstreamattributewindowinitsgroupingclause.Itsoutputsarethereforepartitionedbyvaluesofwindow,makingitcompatiblewithaninputstreamsealedonwindow.2)Analysis:Havingannotatedalloftheinstancesofthereportingservercomponentfordifferentqueries,wemaynowconsiderhowBLAZESautomaticallyderivesoutputstreamlabelsfortheglobaldataow.IfwesupplyTHRESH,BLAZESderivesanallabelofAsyncfortheoutputpathfromcachetosink.Allcomponentsareconuent,sothecompletedataowproducesdeterministicoutputswithoutcoordination.Ifwechose,wecouldencapsulatetheserviceasasinglecomponentwithannotationCW.GivenqueryPOORwithnoinputstreamannotations,BLAZESderivesalabelofDiverge.Thepoorperformersqueryisnotconuent:itproducesnondeterministicoutputcontents.Becausetheseoutputsmutateastateful,replicatedcomponent(i.e.,thecache)thataffectssystemoutputs,theoutputstreamistaintedbydivergentreplicastate.Preventingreplicadivergencewillrequireacoordinationstrategythatcontrolsmessagedeliveryordertothereportingserver.If,however,theinputstreamissealedoncampaign,BLAZESrecognizesthecompatibilitybetweenthestreampartitioningandthecomponentpathannotationORid;campaign,synthesizesaprotocolthatallowsthepartitiontobeprocessedwhenithasstoppedchanging,andgivesthedataowthelabelAsync.Implementingthissealingstrategydoesnotrequireglobalcoordination,butmerelysomesynchronizationbetweenstreamproducersandconsumers.Similarly,WINDOW(givenaninputstreamsealedon Fig.8:TheeffectofcoordinationonthroughputforaStormtopologycomputingastreamingwordcount.Weusedasinglededicatednode(asthedocumentationrecommends)fortheStormmasterandthreeZookeeperservers.Ineachexperiment,weallowedthetopologytowarmupandreachsteadystatebyrunningitfor10minutes.Figure8plotsthethroughputofthecoordinatedanduncoordinatedimplementationsofthewordcountdataowasafunctionoftheclustersize.Theoverheadofconservativelydeployingatransactionaltopologyisconsiderable.Theuncoor-dinateddataowhasapeakthroughputroughly1.8timesthatofitscoordinatedcounterpartina5-nodedeployment.Aswescaleuptheclusterto20nodes,thedifferenceinthroughputgrowsto3.B.AdreportingTocomparetheperformanceofthesealingandorderingcoordinationstrategies,weconductedaseriesofexperimentsusingaBloomimplementationoftheadtrackingnetworkintroducedinSectionI-B.Foradservers,whichsimplygenerateclicklogsandforwardthemtoreportingservers,weused10microinstances.Wecreated3reportingserversusingmediuminstances.OurZookeeperclusterconsistedof3smallinstances.Adserversgenerateaworkloadof1000logentriesperserver,dispatching50clicklogmessagesinbatchandsleepingperiodically.Duringtheworkload,weposeanumberofrequeststothereportingservers,allofwhichimplementthecontinuousqueryCAMPAIGN.AlthoughthissystemimplementedintheBloomlanguageprototypedoesnotillustratethevolumewewouldexpectinahigh-performanceimplementation,wewillseethatithighlightssomeimportantrelativepatternsacrossdifferentcoordinationstrategies.1)Baseline:NoCoordination:Fortherstrun,wedonotenabletheBLAZESpreprocessor.Thusclicklogsandrequestsowinanuncoordinatedfashiontothereportingservers.Theuncoordinatedrunprovidesalowerboundforperformanceofappropriatelycoordinatedimplementations.However,itdoesnothavethesamesemantics.Weconrmedbyobservationthatcertainqueriesposedtomultiplereportingserverreplicasreturnedinconsistentresults.ThelinelabeledUncoordinatedinFigures9and10showsthelogrecordsprocessedovertimefortheuncoordinatedrun,forsystemswith5and10adservers,respectively.2)OrderingStrategy:InthenextrunweenabledtheBLAZESpreprocessorbutdidnotsupplyanyinputstreamannotations.BLAZESrecognizedthepotentialforinconsistentanswersacrossreplicasandsynthesizedacoordinationstrategybasedonordering.ByinsertingcallstoZookeeper,allclicklogentriesandrequestsweredeliveredinthesameordertoallreplicas.ThelinelabeledOrderedinFigures9and10plotstherecordsprocessedovertimeforthisstrategy.Theorderingstrategyruledoutinconsistentanswersfromreplicasbutincurredasignicantperformancepenalty.Scalingupthenumberofadserversbyafactoroftwohadlittleeffectontheperformanceoftheuncoordinatedimplementation,butincreasedtheprocessingtimeinthecoordinatedrunbyafactorofthree.3)SealingStrategies:ForthelastexperimentsweprovidedtheinputannotationSealcampaignandembeddedpunctuationsintheadclickstreamindicatingwhentherewouldbenofurtherlogrecordsforaparticularcampaign.RecognizingthecompatibilitybetweenthesealedstreamandtheaggregatequeryinCAMPAIGN(agroup-byonid,campaign),BLAZESsynthesizedaseal-basedcoordinationstrategy.Usingtheseal-basedstrategy,reportingserversdonotneedtowaituntileventsaregloballyordered;instead,theyareprocessedassoonasareportingservercandeterminethattheybelongtoasealedpartition.ThereportingserversuseZookeeperonlytodeterminethesetofadserversresponsibleforeachcampaignthatis,onecalltoZookeeperpercampaign.Whenareportingserverhasreceivedsealmessagesfromallproducersforagivencampaign,itemitsthepartitionforprocessing.InFigures9and10weevaluatethesealingstrategyfortwoalternativepartitioningsofclickrecords:inIndependentsealeachcampaignismasteredatexactlyoneadserver,whileinSeal,alladserversproduceclickrecordsforallcampaigns.Notethatbothseal-basedrunscloselytracktheperformanceoftheuncoordinatedrun;doublingthenumberofadserverseffectivelydoublessystemthroughput.Tohighlightthedifferencesbetweenthetwoseal-basedruns,Figure11plotsthe10-serverrunbutomitstheorderingstrategy.Aswewouldexpect,independentsealsresultinlowerlatenciesbecausereportingserversmayprocesspartitionsassoonasasinglesealmessageappears(sinceeachpartitionhasasingleproducer).Bycontrast,thestep-likeshapeofthenon-independentsealstrategyreectsthefactthatreportingserversdelayprocessinginputpartitionsuntiltheyhavereceivedasealrecordfromeveryproducer.Partitioningthedataacrossadserverssoastoplaceadvertisementcontentclosetoconsumers(i.e.,partitioningbyadid)causedcampaignstobespreadacrossadservers,conictingwiththecoordinationstrategy.WerevisitthenotionofcoordinationlocalityinSectionX.IX.RELATEDWORKOurapproachtoautomaticallycoordinatingdistributedservicesdrawsinspirationfromtheliteratureonbothdistributed