/
Highl ailab le aultT olerant arallel Datao ws Mehul A Highl ailab le aultT olerant arallel Datao ws Mehul A

Highl ailab le aultT olerant arallel Datao ws Mehul A - PDF document

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
477 views
Uploaded On 2014-12-25

Highl ailab le aultT olerant arallel Datao ws Mehul A - PPT Presentation

Shah C Ber ele mashahcs ber ele edu Joseph M Hellerstein C Ber ele Intel Research Ber ele jmhcs ber ele edu Er ic Bre er C Ber ele bre ercs ber ele edu ABSTRA CT present technique that masks ailures in cluster to pro vide high ailability and aulttol ID: 29339

Shah Ber ele

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Highl ailab le aultT olerant arallel Dat..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

HighlyAvailable,Fault­Tolerant,ParallelDataowsMehulA.ShahU.C.Berkeleymashah@cs.berkeley.eduJosephM.HellersteinU.C.BerkeleyIntelResearch,Berkeleyjmh@cs.berkeley.eduEricBrewerU.C.Berkeleybrewer@cs.berkeley.eduABSTRACTWepresentatechniquethatmasksfailuresinaclustertoprovidehighavailabilityandfault-toleranceforlong-running,parallelizeddataows.Wecanusethesedataowstoimplementavarietyofcontinuousquery(CQ)applicationsthatrequirehigh-throughput,24x7operation.Examplesincludenetworkmonitoring,phonecallprocessing,click-streamprocessing,andonlinenancialanalysis.Ourmaincontributionisaschemethatcarefullyintegratestradi-tionalqueryprocessingtechniquesforpartitionedparallelismwiththeprocess-pairsapproachforhighavailability.Thisdelicateinte-grationallowsustotoleratefailuresofportionsofaparalleldataowwithoutsacricingresultquality.Uponfailure,ourtechniquepro-videsquickfail-over,andautomaticallyrecoversthelostpiecesonthey.Thispiecemealrecoveryprovidesminimaldisruptiontotheongoingdataowcomputationandimprovedreliabilityascomparedtothestraight-forwardapplicationoftheprocess-pairstechniqueonaperdataowbasis.Thus,ourtechniqueprovidesthehighavailabilitynecessaryforcriticalCQapplications.OurtechniquesareencapsulatedinareusabledataowoperatorcalledFlux,anextensionoftheExchangethatisusedtocomposepar-alleldataows.Encapsulatingthefault-tolerancelogicintoFluxminimizesmodicationstoexistingoperatorcodeandrelievestheburdenontheoperatorwriterofrepeatedlyimplementingandver-ifyingthiscriticallogic.WepresentexperimentsillustratingthesefeatureswithanimplementationofFluxintheTelegraphCQcodebase[8].1.INTRODUCTIONThereareanumberofcontinuousquery(CQ)orstreampro-cessingapplicationsthatrequirehigh-throughput,24x7operation.Oneimportantclassoftheseapplicationsincludescritical,onlinemonitoringtasks.Forexample,todetectattacksonhostsorweb-sites,intrusiondetectionsystemsreconstructowsfromnetworkpacketsandinspecttheows'contents[20,27].Anotherexampleisprocessingastreamofcalldetailrecordsfortelecommunica-tionnetworkmanagement[3,17].PhonebillingsystemsperformvariousdatamanagementoperationsusingcallrecordstochargeThisworkwassupportedbyNSFgrantnos.0205647,0208588,aUCMICROgrant,andgiftsfromMicrosft,IBM,andIntel.Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforpro®torcommercialadvantage,andthatcopiesbearthisnoticeandthefullcitationonthe®rstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspeci®cpermissionand/orafee.SIGMOD2004June13­18,2004,Paris,France.Copyright2004ACM1­58113­859­8/04/06...$5.00.orevenroutephonecalls.Websitesmayanalyzeclickstreamsinreal-timeforusertargetedmarketingorsite-useviolations[32].Otherapplicationsincludenancialquoteanalysisforreal-timear-bitrageopportunities,monitoringmanufacturingprocesses,andin-stantmessaginginfrastructure.RecentresearchsuggeststhatwecanimplementtheseapplicationsusingamoregeneralCQinfras-tructurethatproducescontinuouslyprocessingdataows[7,8,9,26].ACQdataowisaDAGinwhichverticesrepresentoperatorsandedgesdenotethedirectioninwhichdataispassed.Forsuchcritical,long-runningdataowapplications,scalability,highavailability,andfault-tolerancearetheprimaryconcerns.Aviableapproachforscalingthesedataowsistoparallelizethemacrossashared-nothingclusterofworkstations.Clustersareacost-effective,highly-scalableplatform[2,10]withthepotentialtoyieldfastresponsetimesandpermitprocessingofhigh-throughputinputstreams.However,sincethesedataowsrunforanindef-initeperiod,theyareboundtoexperiencemachinefaultsonthisplatform,eitherduetounexpectedfailuresorplannedreboots.Asinglefailurecandefeatthepurposeofthedataowapplicationaltogether.Theseapplicationscannottoleratelosingaccumulatedoperatorstate,droppingincomingorin-ightdata,andlonginter-ruptionsinresultdelivery.Inanintrusiondetectionscenario,forexample,anyoneoftheseproblemscouldleadtohostsbeingcom-promised.Fornancialapplications,evenshortinterruptionscanleadtomissedopportunitiesandquantiablerevenueloss.Hence,inadditiontoscalability,suchapplicationsneedafault-toleranceandrecoverymechanismthatisfast,automatic,andnon-disruptive.Toaddressthisproblem,wepresentamechanismcalledFluxthatmasksmachinefailurestoprovidefault-toleranceandhighavail-abilityfordataowsparallelizedacrossacluster.Fluxhasthefol-lowingsalientfeatures.Fluxinterleavestraditionalqueryprocessingtechniquesforpar-titionedparallelismwiththeprocess-pairs[15]approachofrepli-catedcomputation.Typically,adatabasequeryplanisparallelizedbypartitioningitsconstituentdataowoperatorsandtheirstateacrossthemachinesinacluster,atechniqueknownaspartitionedparallelism[25].ThemaincontributionofFluxisatechniqueforcorrectlycoordinatingreplicasofindividualoperatorpartitionswithinalargerparalleldataow.Weaddressthechallengesofavoidinglongstallsandmaintainingexactly-once,in-orderdeliveryoftheinputtothesereplicasduringfailureandrecovery.Thisdeli-cateintegrationallowsthedataowtohandlefailuresofindividualpartitionswithoutsacricingresultqualityorreducingavailability.Inadditiontoquickfail-over(adirectbenetofreplicatedcom-putation)Fluxalsoprovidesautomaticon-the-yrecoverythatlim-itsdisruptionstoongoingdataowprocessing.Fluxrestoresaccu-mulatedoperatorstateandlostin-ightdataforthefailedparti-tionsofthedataowwhileallowingtheprocessingtocontinueforunaffectedpartitions.Thesefeaturesprovidethehighavailability necessaryforcriticalapplicationsthatnotonlycannottolerateinac-curateresultsanddatalossbutalsohavelowlatencyrequirements.Moreover,thispiecemealrecoveryyieldsalowermeantimetore-covery(MTTR)thanthestraight-forwardapproachofcoordinatingreplicasofanentireparalleldataow,thusimprovingthereliabil-ity,ormeantimetofailure(MTTF),ofthesystem.Forrestoringloststate,FluxleveragesanAPI,suppliedbytheoperatorwriter,forextractingandinstallingthestateofoperatorpartitions.InspiredbyExchange[13],Fluxencapsulatesthecoordinationlogicforfail-overandrecoveryintoanopaqueoperatorusedtocomposeparallelizeddataows.FluxisageneralizationoftheExchange[13],thecommunicationabstractionusedtocomposepartitioned-parallelqueryplans.Thistechniqueofencapsulatingthefault-tolerancelogicallowsFluxtobereusedforawideva-rietyofdatabaseoperatorsincludingstatefulones.Thisdesignre-lievestheburdenontheoperatorwriterofrepeatedlyimplementingandverifyingcriticalfault-tolerancelogic.ByinsertingFluxatthecommunicationpointswithinadataow,anapplicationdevelopercanmaketheentiredataowrobust.Inthispaper,wedescribetheFluxdesignandrecoveryproto-colandillustrateitsfeatures.InSection2,westartbystatingourassumptionsandoutlineourbasicapproachforfaulttoleranceandhighavailability.Inspiredbyprocess-pairs,inSection3,wede-scribehowtocoordinatereplicatedsingle-sitedataowstoprovidefaulttoleranceandhighavailability.Section4describestheneces-sarymodicationstoExchangeforCQdataowsandtheproblemswithanaiveapplicationofoursingle-sitetechniquetopartitionedparalleldataows.InSection5,webuildonthetechniquesintheprevioussectionstodevelopFlux'sdesignanditsrecoveryproto-col.WealsodemonstratethebenetsofFluxwithexperimentsusinganimplementationofFluxintheTelegraphCQcodebase.Section6summarizestherelatedwork.Section7concludes.2.ENVIRONMENTANDSETUPInthissection,weoutlinethemodelfordataowprocessingandourbasicapproachforachievinghighavailabilityandfault-tolerance.Wealsodescribetheunderlyingplatform,thefaultsweguardagainst,andservicesuponwhichwerely.Finally,wepresentamotivatingexampleusedthroughoutthepaper.2.1ProcessingModelACQdataowisageneralizationofapipelinedqueryplan.ItisaDAGinwhichthenodesrepresent“non-blocking”operators,andtheedgesrepresentthedirectioninwhichdataisprocessed.ACQdataowisusuallyaportionofalargerend-to-enddataow.Asanexample,considerapacketsprayerthatfeedsdatatoaCQdataow.Imaginethedataowmonitorsforintrusionsanditsout-putisspooledtoanadministrator'sconsole.Inthisexample,thelargerdataowincludestheapplicationsgeneratingpacketsandtheadministratorsthatreceivenotications.Inthispaper,weonlycon-centrateontheavailabilityofaCQdataowandnottheend-to-enddataownortheentryandexitpoints.WefocusonCQdataowsthatreceiveinputdatafromasingleentrypointandreturndatatoclientsthroughasingleexit,therebyallowingustoimposeato-talorderontheinputandoutputdata.Also,wewillonlydiscusslineardataowsforpurposesofexposition,butourtechniquesex-tendtoarbitraryDAGs.Forthecluster-basedsetting,weassumeapartitioneddataowmodelwhereCQoperatorsaredeclusteredacrossmultiplenodes,andmultipleoperatorsareconnectedinapipeline[25].CQoperatorsprocessoverinnitestreamsofdatathatarriveattheirinputs.OperatorsexporttheFjordinterface[24],init()andprocessNext()whicharesimilartothetraditionaliteratorinter-faces[14].WhentheoperatorisinvokedviatheprocessNext()call,itperformssomeprocessingandreturnsatupleifavailableforoutput.However,unlikegetNext()oftheiteratorinterface,processNext()isnon-blocking.TheprocessNext()methodperformsasmallamountofworkandquicklyreturnscontroltothecaller,butitneednotreturndata.Operatorscommunicateviaqueuesthatbufferintermediateresults.Theoperatorscanbein-vokedinapush-basedfashionfromthesourceorinapull-basedfashionfromtheoutput,orsomecombination[7,24].2.2PlatformAssumptionsInthissection,wedescribeourplatform,thetypesoffaultsthatwehandle,andcluster-basedserviceswerelyuponexternaltoourmechanism.Forthiswork,weassumeashared-nothingparallelcomputingarchitectureinwhicheachprocessingnode(orsite)hasaprivateCPU,memory,anddisk,andisconnectedtoallothernodesviaahigh-bandwidth,low-latencynetwork.Weassumethenetworklayerprovidesareliable,in-orderpoint-to-pointmessagedeliveryprotocol,e.g.TCP.Thus,aconnectionbetweentwoopera-torsismodeledastwoseparateuni-directionalFIFOqueues,whosecontentsarelostonlywheneitherendpointfails.Weignorerecurrentdeterministicbugsandonlyconsiderhard-warefailuresorfaultsdueto“heisenbugs”intheunderlyingplat-form.Thesefaultsintheunderlyingruntimesystemandoper-atingsystemarecausedbyunusualtimingsanddataracesthatariserarelyandareoftenmissedduringthequalitytestingprocess.Whenthesefaultsoccur,weassumethefaultymachineorprocessisfail-stop:theerrorisimmediatelydetectedandtheprocessstopsfunctioningwithoutsendingspuriousoutput.Schneider[28]andGrayandReuter[15]showhowtobuildfail-stopprocesses.More-over,sinceweaimtoprovidebothconsistencyandavailability,wecannotguardagainstarbitrarynetworkpartitions[12].Werelyonaclusterservicethatallowsustomaintainacon-sistentglobalviewofthedataowlayoutandactivenodesinthecluster[33].Wecallitthecontroller.Werelyonittoupdategroupmembershiptoreectactive,dead,andstandbynodes,setupandteardownthedataow,andperformconsistentupdatestothedataowstructureuponfailures.Standardgroupcommunicationsoftware[6,33]providesfunctionalityformanagingmembershipandcanbeusedtomaintainthedataowlayout.Note,thecon-trollerisnotasinglepointoffailure,butauniformviewofahighlyavailableservice.Suchaserviceusuallyemploysahighlyavailableconsensusprotocol[22],whichensurescontrollermessagesarriveatallclusternodesinthesameorderdespitefailures.However,wedonotrelyonthecontrollertomaintainorrecoveranyinformationaboutthein-ightdataorinternalstateofopera-tors.Theinitialstateofoperatorsmustbemadepersistentbyexter-nalmeans,andweassumeitisalwaysavailableforrecovery.Ourtechniquesalsodonotrelyonanystablestorageformakingthetransientdataowstatefault-tolerant.Whennodesfailandre-enterthesystem,theyarestateless.Also,linkfailuresaretransformedintoafailureofoneoftheendpoints.2.3BasicApproachOurgoalistomakethein-ightdataandtransientoperatorstatefault-tolerantandhighlyavailable.In-ightdataconsistsofalltu-plesinthesystemfromacknowledgedinputfromthesourcetounacknowledgedoutputtotheclient.Thisin-ightdataincludesintermediateoutputgeneratedfromoperatorswithinthedataowthatmaybeinlocalbuffersorwithinthenetworkitself.Inspiredbytheprocess-pairstechnique[15],ourapproachpro-videsfault-toleranceandhighavailabilitybyproperlycoordinatingredundantcopiesofthedataowcomputation.Redundantcompu-tationallowsquickfail-overandthusgiveshighavailability.Werestrictourdiscussioninthispapertotechniquesforcoordinatingtworeplicas;thus,wecantolerateasinglefailurebetweenrecovery Click\rStream\rOutput\rGroup-By\r{src,dst}\rGroup-By\r{app,src}\r{src, dst, data, ts}\r{app, src, dur}\r{user-id,max(dur),avg(dur)}\rFigure1:ExampleDataowpoints.Whilefurtherdegreesofreplicationarepossiblewithourtechnique,formostpracticalpurposes,thereliabilityachievedwithpairsissufcient[15].WemodelCQoperatorsasdeterministicstate-machines;thus,giventhesamesequenceofinputtuples,anoperatorwillproducethesamesequenceofoutputtuples.Wecallthispropertysequence-preserving.MostCQoperatorsfallwithinthiscategory,forex-ample,windowedsymmetrichashjoinorwindowedgroup-byag-gregates.Section3.4describeshowtorelaxthisassumptionforsingle-sitedataows.Ourtechniquesensurethatreplicasarekeptconsistentbyprop-erlyreplicatingtheirinputstreamduringnormalprocessing,uponfailure,andafterrecovery.Withthedataowmodel,maintaininginputorderingisstraightforward,givenanin-ordercommunicationprotocolandsequence-preservingoperators.However,consistencyisdifculttomaintainduringfailureandafterrecoverybecauseconnectionscanlosein-ightdataandoperatorsmaynotbeper-fectlysynchronized.Thus,ourtechniquesmaintainthefollowingtwoinvariantstoachievethisgoal.1.Loss-free:notuplesintheinputstreamsequencearelost.2.Dup-free:notuplesintheinputstreamsequenceareduplicated.Tomaintaintheseinvariants,weintroduceintermediateopera-torsthatconnectexistingoperatorsinareplicateddataow.Be-tweeneveryproducer-consumeroperatorpairthatcommunicateviaanetworkconnection,weinterposetheseintermediateoperatorstocoordinatecopiesoftheproducerandconsumer.Abstractly,theprotocolisasfollows.Tokeeptrackofin-ighttuples,weassignandmaintainasequencenumberwitheachtuple.Theintermediateoperatorontheconsumersidereceivesinputfromitsproducer,andacknowledgesthereceiptofinput(withthesequencenumber)totheproducer'scopy.Theintermediateoperatorattheproducer'scopystoresin-ighttuplesinaninternalbufferandensuresourinvariantsincasetheoriginalproducerfails.Theacknowledge-mentstracktheconsumer'sprogressandareusedtodrainthebufferandlterduplicates.Wewillshowthatbycomposinganexistingdataowusingintermediariesthatembodythisunusual,seeminglyasymmetricprotocolinvariousforms,thedataowcantoleratelossofitspiecesandberecoveredinpiecemeal.2.4MotivatingExampleAsamotivatingexample,weconsideradataowthatmaybeusedforanalyzingnetworkpacketsowingthrougharewall(orDMZ)foralargeorganization.Supposeanetworkadministratorwantstotrackthemaximumandaveragedurationofnetworkses-sionsacrossthisboundaryinrealtime.Thesessionsmaybeforvariousapplications,e.g.HTTP,FTP,Gnutella,etc.Further,sup-posetheadministratorwantsthesestatisticsonapersourceandapplicationbasis.Suchinformationmaybeused,forexample,todetectanomalousbehaviororidentifymisbehavingparties.ThelineardataowinFigure1performsthiscomputation;wewillusethisasourexamplethroughoutthepaper.Egress\rIngress\rS-Prod\rP\rS-Cons\rP\rS-Prod\rS\rS-Cons\rS\rSource\rDestination\rdata\rack\rP\rr\ri\rm\ra\rr\ry\rS\re\rc\ro\rn\rd\ra\rr\ry\rFigure2:DataowPairsNormalProcessingThedatasource,therstoperator,providesarawpacketstream.Todeterminesessionduration,thenetworksessionmustberecon-structedfromapacketstream.Eachpacketstreamtuplehastheschema,(src,dst,data,ts),wherethesourceanddesti-nationeldsdetermineauniquesession.Eachtuplealsohasatimestamp,ts,andadataeldthatcontainsthepayloadorsig-nalsastartorendofsession.Thesecondoperatorisastreaminggroup-byaggregationthatreconstructseachsession,determinestheapplicationandduration,andoutputsa(app,src,dur)tu-pleuponsessioncompletion.Thethirdoperatorisalsoastream-inggroup-byaggregationthatmaintainsforeachapplicationandsource,i.e.(app,src),themaximumandaveragedurationoverauser-speciedwindowandemitsanewoutputatauser-speciedfrequency.Thenaloperatorisanoutputoperator.3.SINGLE­SITEDATAFLOWInspiredbyprocess-pairs,inthissection,wedescribehowtoco-ordinatereplicated,single-siteCQdataowstoprovidequickfail-overandthushighavailability.Weextendourtechniquetoparalleldataowsinthenextsection.ForagivenCQdataow,weintroduceadditionaloperatorsthatcoordinatereplicasofthedataowduringnormalprocessingandautomaticallyperformrecoveryuponmachinefailure.Werefertotheseoperatorsasboundaryoperatorsbecausetheyareinterposedattheinputandoutputofthedataow.Theseoperatorsencapsu-latethecoordinationandrecoverylogicsomodicationstoexist-ingoperatorsareunnecessary.Recoveryhastwophases:take-overandcatch-up.Immediatelyafterafailureisdetected,take-overen-sues.Duringtake-over,weadjusttheroutingofdatawithinthedataowtoallowtheremainingreplicatocontinueprocessingin-comingdataanddeliveringresults.Aftertake-over,thedataowisvulnerable:oneadditionalfailurewouldcausethedataowtostop.Onceastandbymachineisavailable,thecatch-upphasecre-atesanewreplicaofthedataow,makingthedataowonceagainfault-tolerant.3.1DataowPairsInthissection,wedescribethenormal-casecoordinationneces-sarybetweenreplicateddataowstoguardagainstfailures.3.1.1ComponentsofaSingle­SiteDataowAsingle-siteCQdataowisaDAGofnon-blockingoperatorswhicharetypicallyinasinglethreadofcontrol.Inourexample,thedataowisapipeline.Wecallanysuchlocallyconnectedsequenceofoperatorsadataowsegment,asshownintheleftportionofFigure2.Communicationtonon-localmachinesoccursonlyatthetopandthebottomofthedataowsegment.Whenwerefertothe top,wemeantheendthatdeliversoutputandthebottomistheendthatreceivesinput.Likewise,weuse“above”toindicateclosertotheoutput,andvice-versafor“below”.Intop-downexecution,thetopmostoperatorinvokestheopera-torsbelowitrecursivelythroughtheprocessNext()method,andreturnstheresultsthroughanegressoperatorthatforwardsresultstothedestination.Atthebottomareoperatorsthatreceivedatafromaningressoperator.Ingresshandlestheinterfacetothenet-worksource.Theingressandegressareboundaryoperators.Thecirclesareoperatorsinthedataow;inourrunningexampletheyarethestreaminggroup-byoperators.Inthispaper,weplacetheingressandegressoperatorsonsep-aratemachinesandassumetheyarealwaysavailable.Theseoper-atorsareproxiesfortheinputandoutputofthedataowandmayhavecustomizedinterfacestotheexternalworld.Existingtech-niques,e.g.processpairs,maybeusedtomaketheseoperatorshighlyavailable,butwedonotdiscussthesefurther.Ourfocusisontheavailabilityofthedataowthatprocessestheincomingdataanditsinteractionwiththeseproxyoperators.Thus,intheremainingdiscussion,wedetailingressandegressalongsidetheotherboundaryoperatorsweintroduce.3.1.2NormalCaseProtocolTomakeasingle-sitedataowfault-tolerantandhighlyavail-able,wehavetwocopiesoftheentiredataowrunningonseparatemachinesandcoordinatethecopiesthroughtheingressandegress.Henceforth,weusethesuperscriptPtodenotetheprimarycopyofanobject,andthesuperscriptStodenotethesecondary(seeFigure2).Ingressincorporatestheinputandforwardsittobothdataows,looselysynchronizingtheirprocessing,andegressfor-wardstheoutput.Theingressoperatorincorporatesthenetworkinputintodatastructuresaccessibletothedataowexecutionengine,i.e.tuples.Oncetheingressoperatorreceivesaninputandhasincorporateditintothedataow,itsendsanacknowledgement(abbreviatedasack)tothesourceindicatingthattheinputisstableandthesourcecandiscardit.Forpush-basedsourcesthatdonotprocessacks,theoperatordoesitsbesttoincorporatetheincomingdata.Theegressoperatordoesthereverse.Itsendstheoutputtothedestinationandwhenanackisreceivedfromthedestination,theegressoperatorcandiscardtheoutput.Ifthedestinationdoesnotsendacks,resultdeliveryisalsojustbest-effort.WeintroduceadditionalboundaryoperatorscalledS-ProdandS-Consatthetop(producing)andatthebottom(consuming)endofthedataowsegmentrespectively.Weassumeeachinputtu-pleisassignedamonotonicallyincreasinginputsequencenum-ber(ISN)fromthesourceoringressoperator.Theingressoper-atorbuffersandforwardseachinputtupletobothS-ConsPandS-ConsSwhichuponreceiptsendacks.TheS-Consoperatorsdonotforwardanincomingtupletooperatorsaboveunlessanackforthattuplehasbeensenttotheingress.Acknowledgments,unlessotherwisenoted,arejustthesequencenumbersassignedtotuples.WheningressreceivesanackfrombothreplicasforaspecicISN,itdropsthecorrespondinginputfromitsinternalbuffer.Likewise,bothS-Prodoperatorsassignanoutputsequencenumber(OSN)toeachoutputandstoretheoutputinaninternalbuffer.OnlyS-ProdPforwardstheoutputtotheegressoperatorafterwhichitimmediatelydropsthetuple.EgresssendsackstoS-ProdSforev-eryinputreceived.Onceanackissent,egresscanforwardtheinputtothedestination.OnceS-ProdShasreceivedtheackfromegress,itdiscardstheoutputtuplewiththatOSNfromitsbuffer.Note,acksmayarrivebeforeoutputisproducedbythedataow.S-ProdSmaintainstheseacksinthebufferuntilthecorrespondingoutputisproduced.WewillseeinSection5,thatthisasymmetric, not B.full()\r{t := processNext();\r status[dest]=ACTIVE\r{t := B.peek(dest);\r status[dest]=ACTIVE\r{sn :=recv(dest);\rGuard\rState Change\rState\r Buffer B : { {sn, tuple, mark}, … }\r1\r2\ra\r3\rFigure3:AbstractProducerSpecication-NormalCaseegressprotocolisexactlyhalfofthesymmetric,Fluxprotocol.Thedataowpairsschemeensuresbothdataowsreceivethesameinputsequenceandmaintainstheloss-freeanddup-freein-variantsinthefaceoffailures.Theformeristruebecauseofourin-orderdeliveryassumptionaboutconnections.Next,weshowhowtomaintainthelatterbyusingtheinternalbuffers.Thebufferservesasaredundantstoreforin-ighttuplesaswellasaduplicatelterduringandafterrecovery.Wedescribetheinterfacethesebufferssupportandhowtheboundaryoperatorsusethemduringnormalprocessingandrecovery.3.1.3UsingBuffersThebufferintheingressoperatorstoressequencenumbersandassociatedwiththosesequencenumbersstorestuplesandmark-ings.Ifasequencenumberdoesnothaveatupleassociatedwithit,wecallitadanglingsequencenumber.Themarkingsindicatetheplacesfromwhichthesequencenumberwasreceived.Amark-ingcanbeanycombinationofPROD,PRIM,andSEC.ThePRODmarkindicatesitisfromatupleproducedfrombelow.ThePRIMandSECmarkingsindicateitisfromanackreceivedfromtheprimaryandsecondarydestinations,respectively.Thebufferalsomaintainstwocursors:oneeachfortheprimaryandsecondarydestinations.Thecursorspointtotherstundeliveredtupletoeachdestination.ThebufferinS-Prodisthesame,exceptthereisonlyonedestination,soonlyonecursorexistsandonlytwomark-ingsareallowed,PROD,PRIM.Thebuffersupportsthefollowingmethods:peek(dest),advance(dest),put(tuple,SN,del),ack(SN,dest,del),ackall(dest,del),reset(dest).Thepeek()methodreturnstherstundeliveredtupleforthedestinationandadvance()movesthecursorforthatdestinationtothenextundeliveredtuple.Theput()methodinsertsasequencenumber,SN,intothebufferiftheSNdoesnotexist.Then,itas-sociatesatuplewiththatsequencenumber,andmarksthattupleasproduced.Theack()methodmarksthesequencenumberasac-knowledgedbythegivendestination.Thedelparameterforbothput()andack()indicateswhichmarkingsmustexistforase-quencenumberinordertoremoveitanditsassociatedtuplefromthebuffer.Thereset()andackall()methodsareusedduringtake-overandcatch-up,sowedefertheirdescriptiontoSection3.2.Ouroperatorsthatproduceorforwarddatausethisbufferandimplementtheabstractstate-machinespecicationinFigure3.Thespecicationcontainsstatevariablesandactionsthatcanmodifythestateorproduceoutput.Statevariablescanbescalarorset-valued.Thevaluestheycanholdaredeclaredwithinbraces,andforset-valuedtypes,thesevaluesareseparatedbybars.Eachrowinthespecicationisanaction.Eachactionhasaguard,apredicatethatmustbetrue,toenabletheaction.Eachenabledactioncauses astatechange.Statechangecommandssurroundedbybracesim-plythesequenceofcommandsisatomic.Thatis,eitherallornocommandsareperformed,andtheactionmakesnostatechangesifanycommandinthesequencefails.Ifmultipleactionsareen-abled,anyoneoftheirstatechangesmayoccur.Indentedpredi-catesareshorthandfornestedconditions;thehigherlevelpredicatemustalsobetrue.Ifthenestedconditionisalsotrue,theircor-respondingcommandsareappendedtothehigherlevelcommandsequence.Actionsmayalsohaveexternalactionsaspre-conditions(seeFigure4).Whiletheabstractproducerdescriptiononlyspeci-esactionsrelatedtotheoutputsideoftheoperators,itsufcestoillustratehowthebufferisused.ThespecicationinFigure3indicatesthatanyoneofthreeac-tionsmayoccurinadataforwardingoperatorlikeingressorS-Prod.Ifthereisroominthebuffer,theoperatorgetsthenextin-comingtuple,determinesthesequencenumber,andinsertsitusingput()(action(1)).Ifadestinationisaliveanditsconnectionindi-catesthatitshouldsend,thenitremovestherstundeliveredtuple,sendsit,andadvancesthecursor(action(2)).Iftheconnectionalsoisnotreceivingacks,thetuple'ssequencenumberisimmediatelyackedinthebufferaftersending(action(2a)).Iftheconnectionindicatesthatacksshouldbeprocessed,thenextackedsequencenumberisretrievedandisstoredinthebuffer(action(3)).Note,ifeithertorsnisnull,theactionsmakenostatechanges.Bysettingthevariablesappropriately,weobtainthebehaviorofourboundaryoperatorsthatproducedata.Fortheingressoper-ator,thedestvariablerangesoverbothprimaryandsecondaryconnections.Forbothconnections,theconnvariableissettofSEND,ACKgtoindicatethattheconnectionsareusedforbothsend-ingdataandreceivingacks.ThedelvariableissettofPROD,PRIM,SECgtoindicatethatallthreemarkingsarenecessaryforevictionofanentry.ForS-Prod,Pconn[PRIM]=fSENDg;forS-Prod,Sconn[PRIM]=fACKg,i.e.dataisnotforwarded.FortheS-Prodoperators,thereisonlyoneconnectiontotheegress,sodestalwaysissettoPRIM,anddelissettofPROD,PRIMg.Thus,forS-Prod,Pentriesinthebufferaredeletedonlyafterhav-ingbeenproducedandsent.ForS-Prod,Sentriesaredeletedafterhavingbeenproducedandacked.Note,acksfromegressmaybeinsertedintheS-Prod,Sbufferbeforetheirassociatedtuplesareproduced,resultingindanglingsequencenumbersinthebuffer.Thesizeoftheingressbufferlimitsthedriftofthetworepli-cas;theshorterthebufferthelessslackispossiblebetweenthetwo.Sinceweassumeunderlyingin-ordermessagedelivery,thebufferscanbeimplementedassimplequeues,andtheackscanbesentperiodicallybyS-Consoregresstoindicatethelatesttuplereceived.Withthisoptimization,ack()acknowledgeseverytu-plewithsequencenumberSN.Thisschemeallowsustoamor-tizetheoverheadofround-triplatencies,buttake-overandcatch-upprotocolsmustchangeslightly.Weomitthesedetailsduetospaceconstraints.Themean-time-to-failure(MTTF)analysisforthedataowpairduringnormalprocessingisthesameasthatofprocesspairs,as-sumingindependentfailures.TheoverallMTTFofthesystemis(MTTFs)2=MTTRswhereMTTRsisthetimetorecoverasinglemachine,MTTFsistheMTTFforasinglemachine,andMTTRsMTTFs[15].Inourcase,MTTRsconsistsofboththetake-overandcatch-upphase.Thelatteronlyoccursifastandbyisavailable.3.2Take­OverInthissection,wedescribetheactionsinvolvedinthetake-overprotocolfortheboundaryoperators.Webeginbydiscussinghowcontrollermessagesarehandledandwhatadditionalstatetheboundaryoperatorsmustmaintain.Wethendescribetheprotocolandshowhowitmaintainsourtwoinvariantsafterafailure. not(status[dest]=DEAD)\rconn[p(dest)]\r{status[dest]:=DEAD;\rGuard\rState Change\rExt. Action\r fail(dest)\r{p_fail := true;}\r fail(pair)\r reverse()\r{conn[PRIM]:={SEND};\r not(status[dest]=DEAD)\r{status[dest]:=DEAD;\r fail(dest)\rE\rg\rr\re\rs\rs\rS\r-\rP\rr\ro\rd\rI\rn\rg\rr\re\rs\rs\rS\r-\rC\ro\rn\rs\r{p_fail := true;}\r fail(pair)\r1\r2\r3\ra\rFigure4:Take-OverSpecication3.2.1ControllerMessagesandOperatorModesAsmentionedbefore,werelyonthecontrollertothemaintaingroupmembershipanddataowlayout,anditcanbeimplementedusingstandardsoftware[6,33].Typically,suchaservicewillde-tectfailuresviatimeoutofperiodicheartbeats,andinformallac-tivenodesofthefailednodethroughadistributedconsensusproto-col[22](i.e.doaviewupdate).Note,nodefailuresarepermanent;aresetnodeenterstheclusteratinitialstate.Whenafailuremes-sagearrivesfromthecontroller(viaaviewupdate),wemodelitasanexternalactionfail()invokedonourboundaryoperators.Thecontrolleralsosendsavailabilitymessages(avail())usedduringcatch-up(seeSection3.3).Sincethecontrollercansendmultiplemessages,weenforcethatallenabledactionsforacontrollermes-sagecompletebeforeactionsforthenextmessagecanbegin.Whilecontrollermessagesarriveatallmachinesinsameorderwithinallcontrollercommands,thereisnoorderingguaranteewithrespecttodataroutedwithinadataow.Forexample,afail()messagemayarriveattheingressbeforesometuplethasbeenforwardedandarriveataS-Consmuchafterthetuplethasbeenreceived.Thus,ourboundaryoperatorsmustcoordinateusingmes-sageswithinthedataowtoperformtake-over.Moreover,ourboundaryoperatorsmustmaintainthestatusoftheoperatorontheotherendofalloutgoingandincomingconnec-tions.Eachoperatorcanbeinfourdistinctmodes:ACTIVE,DEAD,STDBY,PAUSE.IntheACTIVEmode,theoperatorisaliveandpro-cessing.Whenitisdead,itisnolongerpartofthedataow.WediscusstheSTDBYandPAUSEmodesinSection3.3ofcatch-up.3.2.2Re­routingUponFailureUponfailure,weneedtomaketwoadjustmentstotheroutingdonebyourboundaryoperatorstoensuretheremainingdataowsegmentcontinuescomputinganddeliveringresults.First,intheingressoperator,wemustadjustthebuffertonolongeraccountforthefailedreplica.Second,iftheprimarydataowsegmentfails,thesecondaryS-Prodmustforwarddatatoegress.Theactionsenableduponafailureandduringtake-overforourboundaryoperatorsareshowninFigure4.Wedescribethesetop-downinourdataow.Whenafailuremessagearrivesattheegressoperator,rstitsim-plyadjuststheconnectionstatustoindicatethatthedestinationisdead(action(1)).Iftheconnectiontotheremainingreplica,in-dexedbyp(dest),isreceivingresults,nothingelsehappens.Oth-erwise,theegressoperatormarksthatconnectionforreceivingtu-ples.Egressalsosendsareversemessagetothereplica(action(1a)).Hence,iftheprimarydataowsegmentfails,egressbeginsprocessingresultsfromthesecondary.ThereversemessageissentalongtheconnectiontoS-Prodthatwasusedforacks.Thismes-sageisanindicationtoS-Prodtobeginforwardingtuplesinsteadofprocessingacks. WhenareversemessagearrivesattheS-Prod,itsimplyadjuststheconnectionstatetobeginsendingtoitsonlyconnection(action(2)).Usingthebufferensuresthatnoresultsarelostorduplicated.Oncethereversemessagearrives,therearenooutstandingacksfromegress,givenourassumptionaboutin-orderdeliveryalongconnections.Atthispoint,therecanbetwotypesofentriesinthebuffer:unacknowledgedtuples,anddanglingsequencenumbers.Thedanglingsequencenumbersarefromreceivedacksforwhichtupleshaveyettobeproduced.Notuplesarelostbecausealllostin-ighttuplesfromtheprimaryeitherremainunacknowledgedinthebufferorwilleventuallybeproduced,assumingtheingressisfault-tolerant.Moreover,newlyproducedtupleswithalreadyackedsequencenumberswillnotberesentbecausethebufferactsasaduplicatelter.Onceproduced,atupleisinsertedusingtheput()methodandisimmediatelyremovedifadanglingsequencenumberforitexists.WhenS-Prodstartssending,thebufferensurestheprimary'scur-sorpointstotherstundeliveredtupleinthebuffer,i.e.therstunacknowledgedtuple.Asdescribedinthenextsection,theS-ProdandS-Consoperatorsmaintainadditionalstate,pfail,usedtoindicatethecompletionofthetake-overphase.Uponreceivingthefailuremessage,theingressoperatorrstsetstheconnectiontodead(action(3)).Then,usingtheackall()method,itmarksalltuplesdestinedforthedeadconnectionasac-knowledged.Likeack(),ackall()willremovetuplescontain-ingallthemarksindel.Finally,ingressadjuststhedelvariabletoonlyconsidertheproducedmarker,PROD,andoneofPRIMorSECfortheremainingconnection,whenevictingtuples.Notuplesarelostorduplicatedbecausefortheremainingconnection,theingressoperatorisperformingexactlythesameactions.Moreover,thebufferdoesnotllindenitelyaftertake-overbecausedelismodiedtoignorethedeadconnection.Aftertake-overiscomplete,thedataowcontinuestoprocessanddeliverresults.But,itisstillvulnerabletoanadditionalfailure.3.3Catch­UpTomakethedataowfault-tolerantagain,weinitiatearecov-eryprotocolcalledcatch-up.Figure5showsthespecicationforthecatch-upphase.Duringcatch-up,apassivestandbydataowisbroughtup-to-datewiththeremainingcopy.Todoso,weneedanothernon-faultymachinethathasthedataowoperatorsinitial-izedinSTDBYmode.Weassumethatthecontrollerarrangessuchamachineafterfailureandissuesanavail()messagetoindi-cateitsavailability.Wealsoassumethatthecontrollerinitializesthestandbymachinewithoperatorsintheirinitialstateandhastheconnectionsproperlysetupwiththeboundaryoperators.Onceavailable,thestate-movementphasebeginsinwhichwetransferstateoftheactivedataowontothestandbymachine.Theninthefold-inphaseweincorporatethenewcopyintotheoveralldataow.Below,wedetailthesetwophasesofcatch-up.3.3.1State­MovementWerstprovideanoverviewofthestate-movementphase.Onceastandbydataowsegmentisavailable,state-movementbegins.Initially,theS-Consoperatorquiescesthedataowsegmentandbeginsstatemovement.S-ConsleveragesaspecicAPI(whichwedescribeshortly)forextractingandinstallingthestateofthedataowoperators.ThisstateistransferredthroughaStateMoverthatislocaltothemachine,butinaseparatethreadoutsidethedataow.TheStateMoverhasaconnectiontotheStateMoversonallmachinesinthecluster.Oncestate-movementiscomplete,wehavetwoconsistentcopiesofthedataowsegment.Next,wedetailthespecicactionstakeninthisphase.TheS-Consoperatorinitiatesthestate-movementphasewhenitrecognizesthatastandbydataowsegmentisreadyviaanavail()messagefromthecoordinator.TheS-Consoperatorsonboththeactiveandthestandbymachinesrelyonalocal,butexternalState-Moverprocessfortransferringstate.TheS-Consontheactivema-chineisinACTIVEmode,andtheS-ConsonthestandbymachineisinSTDBYmode.TheactiveS-ConsoperatorcheckswithitsS-Prodabovetodetermineiftake-overhascompleted(action(1)).Ifso,itsignalstheStateMoverprocesstobeginstatemovement.TheStateMoversonthestandbyandactivemachinescommuni-cate,and,whenreadyfortransferringstate,theybothsignaltheirrespectiveS-Consoperatorswithsmready()(action(2)).Next,theactiveS-Consincrementsitsversionnumberandpausesthein-comingconnection.ItalsocallsinitTrans()whichisinvokedoneveryoperatorinthedataowsegmenttoquiescetheoperatorsformovement.EventuallythecallreachestheS-Prodaboveandpausestheoperator.Thenstatetransferbegins.Atthispoint,wenotetwoimportantitems.First,S-Prodpauses move()\r sm_ready()\r my_status=ACTIVE\r{send(sm,send-req};}\r avail(pair)\r{my_ver:=my_ver+1;\r my_status=ACTIVE\r{st:=getState();\r sm_done()\r{endTrans(),p_fail:=false;\r{status[dest]:=STDBY;\rGuard\rState Change\rExt. Action\r avail(dest)\r psync(dest,v)\r status[dest]=ACTIVE\r{ver[dest] := v;\r p_fail ^ r_done\r{return true;}\r tk_done()\r sync(ver)\r my_status=STDBY\r{p_fail:=r_done:=false;\r initTrans()\r{p_status:=my_status;\r endTrans()\r{my_status:=p_status;}\rE\rg\rr\re\rs\rs\rS\r-\rP\rr\ro\rd\rS\r-\rC\ro\rn\rs\r{status[dest]:=STDBY;\r avail(dest)\r csync(dest,\rstatus[dest]=ACTIVE\r{ver[dest]:=v;\rconn[\rp(\rdest\r)\r]:=\r{\rSEND,ACK\r}\r;}\r{ver[dest]:=v;\r8\r7\r1\r5\r6\r2\r4\r3\rI\rn\rg\rr\re\rs\rs\rFigure5:Catch-UpSpecication topreventadditionalstatechangesfromoccurringduringmove-ment(action(3)).Weenforcenot(mystatus=PAUSE)asaguardforeveryactiontostallthedataowsegmentcompletely.(Section5showshowtoalleviatethisstallforparalleldataows.)Second,wehaveintroducedversionnumbersforeachdataowsegment.Theversionnumbertrackswhichcheckpointofthedataowstatehasbeentransferred.Ingressandegressalsomaintainthecheckpointversionsforthesegmentsattheotherendoftheirconnections.Dur-ingstatemovement,theversionnumberiscopiedsotheactive'sandstandby'sversionvaluesmatchaftermovement.Thesecheck-pointversionnumbersarethenusedtomatchthesynchronizationmessagessentduringthefold-inphase.Inordertotransferthestateofdataowoperators,operatorde-velopersmustimplementaspecicAPI.First,werequiretwometh-ods:getState()andinstallState()(action(4))thatextractandmarshall,andunmarshallandinstallthestateoftheoperator,respectively.S-Conscallsthesemethodsoneverydataowopera-torduringstatemovement.Forexample,forthegroup-byfromourexample,getState()extractsthehashtableentriesthatcontainintermediateper-groupstateandmarshallsitintomachineindepen-dentform;installState()doesthereverse.EvenS-Prodim-plementsthesemethodstotransferitsinternalbuffer.Similarly,weuseinitTrans()toquiescetheoperatorbeforetransferandendTrans()torestarttheoperatorafterwards.Oncestatetransferisnished,thesmdone()actionisenabled(action(5)).Next,bothS-Consoperatorsrestarttheirancestors,restarttheirinputconnectionforreceivingandackingdata,anden-ablesync()ontheirdownstreamS-Prodoperator.Atthispoint,thetwodataowcopiesareconsistentwithrespecttotheirstate,theamountofinputprocessed,andoutputsent.Thefold-inphaserestartstheinputandoutputstreamsofthenewreplicawhilemain-tainingthedup-freeandloss-freeinvariants.3.3.2Fold­InOncestatemovementiscompletebothactiveandstandbyS-ProdandS-Consoperatorsbeginfold-inbysendingsynchroniza-tionmessagestotheegressandingressoperators,respectively.Themessagesmarkthepoint,withintheoutputorinputstreamsofthedataowsegment,atwhichthetwonewreplicasareconsistent.Oncetheingressandegressoperatorsreceivethesemessagesfromtheactiveandstandby,theincomingandoutgoingconnectionsarerestartedandfold-iniscomplete.WerstdescribetheinteractionbetweentheingressandS-Consoperators,andthendescribetheinteractionbetweentheegressandS-Prodoperators.S-Conssendsacsyncmessagewithitslatestcheckpointversiontotheingressoperator(action(5)).Note,thecsyncmessagesaresentalongtheconnectiontotheingressforsendingacks,ushingallackstotheingress.Oncesent,bothS-Consareactive.Ontheotherside,theingressmustproperlyincorporatebothmessagesandrestarttheconnectionstotheS-Consoperators.Theingressoperatorcompletesfold-inafterbothcsyncmes-sagesarrive(action(6)).Sincethecsyncmessagescanarriveattheingressinanyorder,wemustkeeptrackoftheirarrivalsandtakeactionstoavoidskippingorrepeatinginputtothenewstandby.Thevervariablemaintainsthelatestcheckpointversionforthedataowattheotherendofeachconnection.Sinceversionsaresentonlywithcsyncmessages,weusevertoindicatewhetheracsyncwasreceivedfromtheconnection.Whencsyncarrivesfromtheactivedataowsegment,wearecertainthatallackssentpriortostatetransferhavebeenreceived.So,inordertoavoiddrop-pinginputforthenewreplica,ingressupdatesdeltoindicatethatacksfrombothconnectionsarenecessaryfordeletion.Ingressalsoresetsthecursorforthestandbythroughreset()whichpointsthecursortotherstoftheunacknowledgedtuplesinthebuffer.Resetalsoensuresthemarkingsforbothconnectionsmatchoneachen-tryandremovesentrieswithallthreeornomarkings.Note,wearecarefulnottorestarttheconnectiontothestandbysegmentunlessthecsyncfromtheactivedataowsegmenthasarrived.Other-wise,ingressmightduplicatepreviouslyconsumedtuplestothenewreplica.Fold-iniscompletewhenitactivatestheconnectiontothenewreplica.Returningtothetopofthedataowsegment,onceS-Prodopera-torsreceivethesync()fromtheS-Consbelow,theyemitapsyncwiththelatestcheckpointversionalongtheforwardconnectiontoegress(action(7)).Afterthismessageissent,bothS-Prodareac-tive,andonlyfold-infortheegressremainstobediscussed.Attheegressoperator,wemustensurethatitdoesnotforwardacksforanyin-ightdatasentbeforemovementbeganandthatitdoesnotmisssendingacksforanytuplessentaftermovementcompleted(action(8)).Ifthepsyncarrivesfromtheactiverst,wearecarefultopausetheconnectionuntilthesecondpsyncar-rives.Otherwise,egresswouldmisssendingacksfortuplesre-ceivedafterstate-movementbutbeforethestandbypsyncarrives.Ifpsyncarrivesfromthestandbyrst,wearecarefulnottoacti-vatetheconnectionforacks;otherwise,egresswouldforwardacksfortuplessentbeforestate-movement.Thesecondpsynccausesbothconnectionstobeactivated.Onceegressstartssendingackstothenewreplica,fold-iniscomplete.Oncefold-iniscompletedatbothends,catch-upisnishedandthedataowisfault-tolerant.3.4Conclusion­DataowPairsTherearetwopropertiesofthisprotocolthatwewanttohigh-light.First,withsomemoremodications,wecanmakethecatch-upprotocolidempotent.Thatis,evenifthestandbyfailsduringmovement,thedataowwillcontinuetoprocesscorrectlyandat-temptcatch-upagain.Idempotencycanbeextremelyuseful.Forexample,imagineanadministratorwantstomigratethedataowtoamachinewithanuntestedupgradedOS.Withsuchaproperty,shecansimplyterminateareplica,bringupastandbywiththenewOS,andifthestandbyfails,revertbacktotheoldOS.Wesketchtheneededchangesforachievingidempotentcatch-up.Onafailureduringmovement,theS-Consoperatorcompletestheprotocolasifmovementdidnish.But,multiplefailurescancausethecheck-pointversionnumbersontheoutgoing(incoming)connectionsofingress(egress)todriftapart.Thus,wemustmodifytheingressandegresspsync(),csync(),andfail()actionstotrackthisgapandproperlyrestartconnectionswhentheversionsmatch.Second,notetheentireprotocolonlymakesuseofsequencenumbersasuniqueidentiers;wenevertakeadvantageoftheiror-der,exceptintheoptimizationsforamortizingthecostofround-triplatencies.Someoperatorsexhibitexternalbehaviorthatdependsoninputsoutsideofthescopeofthedataow.Forexample,theoutputorderofXJoin[31]dependsupontheprevailingmemorypressures.Ifweignoretheseoptimizations,wecanaccommodatedataowsegmentsthatincludenon-deterministicbutset-preservingopera-torssuchasXJoin.Thatis,giventhesamesetofinputtuples,theoperatorwillproducethesamesetofoutputtuples.Inthiscase,insteadofgeneratingnewOSNs,werequirethatoutputshaveauniquekeywhichareusedinplaceofsequencenumbers.Weout-linemethodsformaintainingsuchkeysinSection4.1.4.PARALLELDATAFLOWParallelizingaCQdataowacrossaclusterofworkstationsisacost-effectivemethodforscalinghigh-throughputapplicationsim-plementedwithCQdataows.Forexample,inourmonitoringsce-nario,onecanimaginehavingthousandsofsimultaneoussessionsandthousandsofsources.Moreover,thestatistics(e.g.max)col-lectedmayrangeoversomesizeablewindowofhistory.Tokeep Consumer\rOp\r2\rC\r2\rThread Boundary\rExchange\rOperator \r Partition\rEx-Prod \r2\rP\r2\rMachine 2\rProducer\rOp\r2\rMachine 1\rP\r1\rProducer\rOp\r1\r Ex-Cons \r1\r Ex-Cons \r2\rConsumer\rOp\r1\rC\r1\rFigure6:ExchangeArchitectureupwithhigh-throughputinputratesandmaintainlow-latencies,thedataowcanbescaledbypartitioningitacrossacluster.Onacluster,aCQdataowisacollectionofdataowsegments(oneormorepermachine).Individualoperatorsareparallelizedbypartitioningtheirinputandprocessingacrossthecluster,atech-niquecalledpartitionedparallelism.Whenthepartitionsofanop-eratorneedtocommunicatetonon-localpartitionsofthenextoper-atorinthechain,thecommunicationoccursviatheExchange[13].InSection4.1,wedescribeExchangeandthenecessaryextensionsforuseinapartitionedparallelCQdataowconsistingofsequence-preservingoperatorsthatrequiretheirinputtobeinarrivalorder.Inthisconguration,aschemethatnaivelyappliesthedataowpairstechniquewithoutaccountingforthecross-machinecommu-nicationwithinaparalleldataowquicklybecomesunreliable.InSection4.2,weshowthatthisnaiveapproach,calledclusterpairs,leadstoamean-time-to-failure(MTTF)thatfallsoffquadraticallywiththenumberofmachines.Moreover,theparalleldataowmuststallduringrecovery,therebyreducingtheavailabilityofthesys-tem.Instead,embeddingcoordinationandrecoverylogicwithintheExchangespeedsuptheMTTRtherebyimprovingbothavail-abilityandMTTF.TheimprovedMTTFfallsofflinearlywiththenumberofmachines.InSection5,wedescribetheFluxdesignwhichachievesthisMTTF.4.1ExchangeInaparalleldatabase,anExchange[13]isusedtoconnectaproducer-consumeroperatorpairinwhichtheproducer'soutputmustberepartitionedfortheconsumer.Inourrunningexample,aviablewaytoparallelizethedataowistopartitionthegroup-bys'stateandinputbasedon(src,dst)attherstlevel,and(app,src)atthesecond.TheExchangeensuresproperroutingofdatabetweenthepartitionedinstancesoftheseoperators.Ex-changeiscomposedoftwointermediateoperators,Ex-ConsandEx-Prod(seeFigure6).Ex-Prodencapsulatestheroutinglogic;itforwardsoutputfromaproducerpartitiontotheappropriatecon-sumerpartitionbasedontheoutput'scontent.Inourexample,ifthepartitioningwashash-based,Ex-Prodwouldcomputeahashon(app,src)todeterminethedestination.Since,anyproducerpartitioncangenerateoutputdestinedforanyconsumerpartition,eachEx-ProdinstanceisconnectedtoeachEx-Consinstance.Typ-ically,Ex-ConsandEx-Prodarescheduledindependentlyinsep-aratethreadsandsupporttheiteratorinterface.Ex-Consmergesthestreamsfromtheincomingconnectionstoaconsumerinstance.SinceExchangeencapsulatesthelogicneededforparallelism,theoperatorwritercanwriterelationaloperatorswhilebeingagnostictotheiruseinaparallelorsingle-sitesetting.WemaketwomodicationstoExchangeforuseinCQdataows.First,toallowacombinationofpushandpullprocessing,Ex-ProdandEx-Consmustsupportthenon-blockingFjordinterfaces[24].Second,Ex-Consmustbeorder-preserving.SomeCQoperators,likeaslidingwindowgroup-bywhosewindowslideswitheverynewinput,requirethatitsinputdataarriveinsequentialorder.Intheparallelsetting,sincetheinputstreamispartitioned,theinputwillarriveinsomeinterleavedorderatanEx-Cons.ThestreamsoutputbytheindividualEx-Prodinstancesare,however,insequen-tialorder.AnEx-Consinstancecanrecoverthisorderbymergingitsinputdatausingtheirsequencenumbers.WecanalsomodifyEx-Constosupportoperatorsthatrelaxthisorderingconstraint.Animportantrelatedissueishowtomaintainsequencenumbersastuplesareprocessedthroughthedataow.Wecannotsimplygeneratenewoutputsequencenumbers(OSNs)atEx-Prodother-wisetheorderingacrossEx-Prodinstanceswouldbelost.Instead,wemustkeeptheoriginalISNintacttoreconstructtheorderattheconsumerside.Foroperatorsthatperformone-to-onetransfor-mations,theoperatorsjustneedtokeeptheinputtuple'ssequencenumber(SN)intact.Forone-to-manytransformations,theoutputSNcanbeacompoundkeyconsistingoftheinputSN,andanothervaluethatuniquelyordersalltuplesgeneratedfromthatinput.Forexample,forasymmetricjoin,aconcatenationoftheSNsfromitstwoinputstreamswillsufce.Formany-to-onetransformationslikewindowedaggregates,thelargestSNoftheinputthatproducedtheoutputwillsufce.Itisthetaskofoperatordeveloperstogen-eratecorrectSNs.4.2NaiveSolution:ClusterPairsInthissectionwedescribehowtomakeaparalleldataowhighlyavailableandfault-tolerantinastraightforwardmannerusingthetechniqueinSection3.Assumewestillhaveasingleingressandegressoperator.Alsoassumeweusepartitionedparallelismfortheentiredataow.Eachmachineintheclusterisexecutingasingle-sitedataowexcepttheoperatorsonlyprocessapartitionoftheinputandrepartitionmidwayifneeded.Anaiveschemeforparallelfault-tolerancewouldbetoapplytheideasoftheprevioussectiononlytotheoperatorpartitionsoneachmachinethatcommunicatewithingressandegress.Usingthistechnique,wecandevotehalfofthemachinesinaclusterfortheprimarypartitionsofthedataowandtheotherhalfforthesec-ondary,witheachmachinehavinganassociatedpair(copy).Wecallthisschemeclusterpairs.Werefertothesetofmachinesrun-ningeithertheprimarypartitionsorthesecondarypartitionsasareplicaset.Withclusterpairs,anoperatorpartitionanditscopyarenotprocessinginputinlock-step;thus,whenamachinefails,thecopiesmaybeinconsistent.Sincepartitionedoperatorscommuni-cateviaExchange,wemaylosein-ightdataonfailuremakingitimpossibletoreconcilethisinconsistency.Thus,asinglemachinefailureineithertheprimaryorsecondaryreplicasetrenderstheentiredataowreplicasetuseless.Toseewhy,fromourexample,imagineamachineintheprimaryreplicasetfailedwithapartitionofthetwogroup-byoperatorsonit.Letscallthelowerstreaminggroup-byoperatorG1andtheup-perG2.Atthetimeoffailure,asecondarypartitionofG1mayhavealreadyproduced,sayahundredtuples,thatitsprimarycopywasjustabouttosend.Now,ifwerecoveredthestatefromtheremainingsecondarypartition,alltheprimaryoperatorpartitionsofG2thatreliedonthefailedprimarypartitionwillneverreceivethosehundredtuples.Fromthatpointforward,theprimaryparti-tionswillbeinconsistentwiththeirreplicas.WithoutaccountingforthecommunicationattheExchangepoints,wemustrecoverthestateoftheentireparalleldataowacrossallthemachinesintheprimaryreplicasettomaintainconsistencyandcorrectness.TheMTTFforthisnaivetechniqueisthesameastheprocess-pairapproach(MTTFc)2=MTTRcwhereMTTFcandMTTRcareforareplicaset.SincethereplicasetispartitionedoverN=2 F-Cons\rP\rF-Cons\rS\rF-Prod\rP\rF-Prod\rS\rConsumer\rConsumer\rProducer\rProducer\rack\rdata\rack\rdata\rBuf\rBuf\rFigure7:Fluxdesignandnormalcaseprotocol.machines,MTTFc=2MTTFs=N.Thus,theoverallMTTFis4(MTTFs)2=(N2MTTRc),aquadraticdropoffwithN.Thecatch-upphaseisthebulkofourrecoverytime,itisthede-terminingfactorinthisequation.Dependingontheavailableband-widthduringcatch-up,MTTRsMTTRcNMTTRs.Moreimportantly,weneedallN=2machinesavailablebeforecatch-upcompletesand,duringcatch-up,theentireparalleldataowisstalled.Thisstallmayleadtounacceptablylongresponsetimesandperhapsevendroppedinput.Clearly,thereliabilityoftheclusterpairstechniquedoesnotscalewellwiththenumberofmachines,nordoesitprovidethehigh-availabilitywewantduringrecoveryforourcritical,high-throughputdataowapplications.Ifwehadatechniquethatprop-erlycoordinatedinputtooperatorpartitionsattheExchangepoints,thenwecouldbuildparalleldataowsthattoleratethelossofindi-vidualpartitions.Also,wewouldonlyneedtorecoverthestateofthefailedpartition,resultinginimprovedreliabilityandavailabil-ity.Inourcongurationabove,theMTTFforsuchaschemewouldbethetimeforanypairednodestofail.SincethereareN=2pairednodes,andtheMTTFforanypairis(MTTFs)2=MTTRs,overalltheMTTFwouldbe2(MTTFs)2=(NMTTRs).ToachievethisimprovedMTTF,wemustmodifytheExchangetocoordinateoperatorpartitionreplicasandproperlyadjusttheroutingafterfailuresandrecovery.5.PARTITIONPAIRSOuranalysisintheprevioussectionshowsthatwithoutpropercoordinationofoperatorpartitionsattheircommunicationpoints,recoveryisinefcient,leadingtoreducedreliabilityandavailabil-ity.Inthissection,webuildontheprotocolsforthesingle-sitecaseandshowhowtocoordinateoperatorpartitionsbymodifyingtheExchange.Weshowhowtomaintaintheloss-freeanddup-freeinvariantsforoperatorpartitionsratherthanentiredataows.Thisdesignwillpermitustorecoverthedataowpiecemealandallowtheprocessingfortheunperturbedpartsofthedataowtocontinue,therebyimprovingbothreliabilityandavailability.Ournewoperator,Flux,hasthesamearchitectureasExchange;itsconstituentoperatorsarecalledF-ConsandF-Prod.ForeachF-Consinstanceintheprimarydataow,thereisacorrespondingF-Consinstanceinthesecondarydataow,andlikewisefortheF-Prodinstances.Wecallanysuchpairofinstancespartitionpairs.TheF-Consprotocolissimilartoegress'duringnormalprocess-ingandtake-over,andsimilartoS-Consduringcatch-up.F-Prod'sprotocolissimilartoS-Prodduringnormalcaseandrecovery.Dur-ingnormalprocessing,eachF-ConsinstanceacksreceivedinputfromoneofitsF-ProdinstancestoitsdualF-Prodintherepli-cateddataow(seeFigure7).WeassumeFluxusestheorder-preservingvariantofExchangetoensurethattheinputisconsumedinthesameorderforbothpartitionreplicas.F-Prodisresponsibleforroutingoutputtoitsprimaryconsumersandincorporatingacksfromthesecondaryconsumersintoitsbuffer.TherearefewsalientdifferencesbetweentheFluxprotocolandtheclusterpairsanddataowpairsschemes.First,theFluxnormalcaseprotocolissymmetricbetweenthetwodataows.Thisprop-ertymakesFluxeasiertoimplementandtestbecauseitreducesthestatespaceofpossiblefailuremodesandthereforethenumberofcasestoverify.Intherestofthispaper,wearticiallydistinguishbetweentheprimaryandsecondaryversionsoftheF-ConsandF-Prod.Fromthepointofviewofanoperator,weusetheadjectiveprimarytomeanwithinthesamedataowandsecondarytomeanwithinthedualdataow.Second,Fluxhandlesbothmultiplepro-ducersandmultipleconsumers.Inapartitionedparalleldataowtheseproducersandconsumersarepartitionsofdataowoperators.Finally,theinstancesofF-ProdorF-Cons(andoperatorsintheircorrespondingdataowsegments)arefreetobeplacedonanyma-chineinaclusteraslongastheyfailindependently.Thisexibilityisusefulforadministrativeandload-balancingpurposes.Thein-dependencerequirementleadstoatleastoneconstraint:notworeplicasofapartitionareonthesamemachine.Inthissection,wedescribethemodicationsnecessarytothepreviousnormal-caseandrecoveryprotocolstoaccommodateall-to-allcommunicationbetweenpartitionedoperators.Sincetheremaybemanysuchcommunicationpointsinadataow,recoveryproceedsbottom-up,recoveringonelevelatatime.Wehaveal-readyshownthebasecasesfortheentryandexitpoints,andnowwewillshowtheinductivestepattheExchangepoints.5.1FluxNormalCaseWespecifythenormalcaseforwarding,buffering,andackingprotocolFluxusestoguardagainstunexpectedfailures.F-Consbe-havesthesameasegressinFigure2exceptthatitmanagesmul-tipleconnectionsformultipleproducers.ForeachtuplereceivedfromaconnectiontoaprimaryF-Prod,itsendsanackofthetu-ple'ssequencenumbertothecorrespondingsecondaryF-Cons,be-foreconsideringthetupleforanyfurtherprocessing.Meanwhile,foreachdestination,anF-Prodinstanceobeysthesameabstract,normal-casespecicationforproducersshowninFigure3.Theac-tionsremainthesame,butthestatesizeincreases.Itmaintainsonesetofstatevariablesforeachconsumerpartitionpair.Weuseasubscript,i,todenotethevariableassociatedwithpartitioni.UnliketheingressoperatorofFigure2,however,F-Prodonlyforwardsdatatotheprimarypartition,conni[PRIM]=fSENDg,andonlyprocessesacksfromthesecondary,conni[SEC]=fACKg.Finally,onceatuplehasbeenproduced,senttotheprimary,andackedbythesecondary,itisevictedfromthebuffer,i.e.deli=fPROD,PRIM,SECg.SinceF-Proddoesnotremoveanytuplesuntilanackhasbeenreceived,allin-ightandundeliveredtuplesfromF-Prod'sreplicaareinitsbuffersorwilleventuallybeproduced.Thus,thisschemeensurestheloss-freeinvariantwithuptoasinglefailureperpar-titionpair.Inthenextsection,wedescribethetake-overprotocolwhichallowsthedataowtocontinueprocessingafterfailuresandensuresthatnotupleswillbeduplicatedtoconsumerinstances.5.2FluxTake­OverTake-overensuresthatregardlessofthenumberofmachinefail-ures,aslongasonlyonereplicaofeachpartitionpairfails,thedataowwillcontinuetoprocessincomingdataanddeliverre-sults.Sincethethenormalcaseprotocolissymmetric,thereareonlyfourdistinctcongurationsinwhichaparticularF-Prod,PF-Prod,SF-Cons,PF-ConsSquartetcansurviveafterfailures.With-outlossofgenerality,thesecasesare:F-ConsSfails,F-ProdSfails, F-ConsSandF-ProdSfail,orF-ConsPandF-ProdSandfail.Wedescribethetake-overactionstohandlethesecases.ForF-Cons,thetake-overspecicationisexactlythesameastheoneforegress,exceptthefailmessagenowspeciesexactlywhichpartition,i,andcopyofF-Prodfailed.Iftheprimaryfails,F-Conssendsareversemessagetothesecondaryandbeginsre-ceivingdatafromthesecondary.LikeS-Cons,italsonotesusingpfailifitsreplicafailed,becauseitisresponsibleforcatch-up,asdescribedinthenextsection.ForF-Prod,take-overisacombinationoftheingressactionsandS-Prodactions.Likeingress,whenitdetectsafailureforacon-sumerinstance,itmarksallunackedsequencenumbersinthecor-respondingbufferforthefailedcopyusingackall(dest,del).Inthiscase,inadditiontoremovingentrieswithallthreemarks,thismethodalsoremovesentrieswithonlythedestmark,leav-ingonlyentriesrelevanttotheremainingconsumerpartition.So,ifthesecondaryconsumerfails,themethodwillremovealldan-glingsequencenumbersfromthesecondary,andthebufferonlywillcontainundeliveredtuplesfortheprimary.Or,iftheprimaryfails,thebuffermaycontaintuplesnotyetackedbythesecondaryordanglingsequencenumbersfromthesecondary.Moreover,F-Prodalsoadjustsdeltoignorethefailedpartitionduringthepro-cessingbetweentake-overandcatch-up.LikeS-Prod,F-Prodnotesifitsownreplicafailedandhandlesreversemessagesfromsecondaryconsumerpartitions.Theac-tionenabledinthiscaseisdifferent(replacesFigure4-(2)):reverse(i,SEC)fconni[SEC]:=fSENDg;rdonei:=true;deli:=deli+fSECg;gThisstatechangeallowsF-Prodtoproperlyforwardtobothcon-sumerspartitionsifitsreplicafailsandtojustthesecondarycon-sumeriftheprimaryconsumerpartitionialsofails.Aftertakeover,weobservethattheFluxprotocolmaintainstheloss-freeanddup-freeinvariantsbynotingitssimilaritytothepro-tocolontheegressside.Intheegresscase,weassumedtheegresswasfault-tolerant,andinthiscase,F-Consisfault-tolerantbecauseitisreplicated.IfF-ConsSfailsorF-ConsSandF-ProdSfail,thefailedpartitionsareignoredandbufferentriesmodiedaccord-ingly.IfF-ProdSfailsorF-ProdSandF-ConsPfail,areversemessageissentandoncereceivedF-ProdPfeedstheremainingconsumer(s).Inthelattercases,thebufferltersduplicates.Inallcases,theremainingpartitionsinbothdataowscontinuepro-cessing.Fortheinterestedreader,astraight-forwardcasebycaseanaylsissimilartoSection3.2showsourinvariantshold.5.3FluxCatch­UpThecatch-upphaseforapartitionpairissimilartoonedescribedforasingle-sitedataow.Inthissection,wedetailthedifferences.Forthepurposesofexposition,weassumenofailuresoccurduringcatch-upofasinglepartition.AchievingtheidempotencypropertyinthiscaseisakintothatinSection3.4.Oncetheanewlyresetnodeisavailableorastandbymachineisavailable,thecatch-upphaseensues.Eachfaileddataowseg-ment(apartitionwithitsFluxoperators)isrecoveredindividu-ally,bottom-up.F-Consinitiatescatch-upwhenitrecognizesthatcatch-upforthepreviousleveliscompleteandthattake-overiscompleteforitsdownstreamF-Prod(orS-Prod).LikeS-Cons,itstallstheoperatorswithinitsdataowsegment,andtransfersstatethroughtheStateMover.Oncenished,synchronizationmessagesarebroadcasttoallprimaryandsecondaryconsumerpartitionsatthetopofthesegmentandprimaryandsecondaryproducerparti-tionsatthebottomtofoldinthenewpartitionreplica.Thereareafewdifferencesbetweencatch-upinthiscaseandthesingle-sitecasewhichweoutlinerst.First,beforestate-movementbegins,F-Conscannotjuststalltheincomingconnectionsbyset-tingthemtoPAUSE.Thus,F-ConspausestheprimaryincomingconnectionsthroughadistributedprotocolthatstallstheoutgoingconnectionsfromallitsprimaryF-Prods.Second,whenthecon-tentsofthebufferinF-Prodareinstalledatthestandby,allmark-ingsinentriesmustbereversedandcursorpositionsswapped.Weneedtodothisreversalbecausetheprimaryandsecondarydesti-nationsareswappedforthenewreplica.Finally,thefold-insyn-chronizationmessagesreceivedatF-ProdandF-Consarehandledslightlydifferentlyfromtheingressandegressoperators.Below,wedetailthedistributedpausingprotocolandfold-in.5.3.1State­MovementWhenF-Consbeginsstate-movement,allofitsprimaryandsec-ondaryproducerinstancesarenishedwithcatch-up.Thus,itisintherstsurvivalscenario;F-ConsShasfailed.IftheremainingF-ConsPjustpausestheconnectionlocallyandcopiesoveritsstate(seeFigure5-(2)),F-ProdPwillnotproperlyaccountforthedatain-ighttoF-ConsPimmediatelyafterthepause.ThosetuplesnolongerexistinF-ProdP'sbufferbecauseitisnotreceivingorac-countingforacks.However,acksforthosein-ighttupleswillar-riveaftercatch-upatF-ProdPviathenewF-ConsSfromF-ProdS.Andtheseackswouldremaininthebufferindenitely.Hence,wemustensuretherearenoin-ightdatabeforestate-movementbegins.Todoso,F-ConsPsendsapausemessagetoallofitsproducers,whichuponreceiptpausetheoutgoingconnec-tionandenqueueanackforthepauseontheoutgoingconnection.OnceF-ConsPreceivesalltheincomingpause-acks,itcanbeginstatetransfer.AslightcomplicationarisesifF-Consisorderpre-serving.Sinceapause-ackcanarriveatanytime,itmaypreventF-Consfrommergingandconsumingincomingtuplesfromotherun-stalledincomingconnections.Thisdeadlockoccursbecausewhenmerginginorder,inputsfromallpartitionsarenecessarytoselectinthenexttupleinline.Thus,F-Consmustbufferthein-ighttuplesinternallyinordertodrainthenetworkandreceiveallpauseacks.Ofcourse,allbufferedtuplesstillneedtobeackedtoF-ProdS.StatetransferisaccomplishedthroughtheStateMoverandisthesameasthesingle-sitecase.5.3.2Fold­inAfterstate-movement,thetwopartitioncopiesareconsistent,andwemustrestartallconnectionstothenewreplicawithoutlos-ingorduplicatinginput.WerstdiscusshowF-ProdfoldsinanewF-Consreplica,thendescribehowF-ConsfoldsinanewF-Prod.BothF-ConsPandthenewF-ConsSbroadcasttoallproducerpartitionsacsyncmessagewiththecorrespondingcheckpointver-sionnumber.TheactionstakenbyF-Prodarenowslightlydifferentthaningress(replacesFigure5-(6)):csyncfver[dest]:=v;(dest,v)status[dest]=ACTIVEdel+=p(dest);B.reset(p(dest));^dest=PRIMconn[PRIM]:=fSENDg;^ver[SEC]=vconn[SEC]:=fACKg;g^dest=SEC^ver[PRIM]=vconn[PRIM]:=fSENDg;gstatus[dest]=STDBYstatus[dest]:=ACTIVE;^dest=PRIM^ver[SEC]=vconn[PRIM]:=fSENDg;g^dest=SEC^ver[PRIM]=vconn[SEC]:=fACKg;gLiketheingress,ifacsyncarrivesfromtheactiveconnec-tion,F-Prodresetsthebufferandmodiesdeltoaccountforthenewreplica.Otherwise,furtherprocessingofdata(oracks)to(orfrom)theactivemightevicttuplesinthebufferintendedforthe 1920212223Time (sec)051015Throughput (K-tuples/sec)(a)Throughput1920212223Time (sec)0100200300400500Avg. Latency / Tuple (msec)(b)Avg.LatencyFigure8:PerformanceDuringRecoverynewreplica.Note,duringstatemovement,theprimaryforward-ingconnectionispausedforbothF-Prodreplicas.Iftheprimaryconnectionisconnectedtotheactivereplicaandacsyncarrives,theconnectionimmediatelyisunpaused.Iftheprimaryconnec-tiongoestoastandbyreplica,theconnectionisrestartedonlyafterthecsyncfromthesecondaryhasarrived.ThisallowsF-ProdStoconsumeallin-ightackssentbeforethemovement.Otherwise,F-ProdSmightduplicatetuplesconsumedbyF-ConsPtothenewF-ConsS.Inanycase,westartthestandbyconnectiononlyafterbothsyncsarrive.Atthetopofthedataowsegment,likeS-Prod,bothF-Prodop-eratorsbroadcastapsynctoallremainingF-Cons.UnlikeS-Prod,however,F-ProdcanhaveeitheroneorbothF-Consremaining.F-Prodstillbroadcaststhepsyncbutifitisforwardingdatatoitssecondary,itstopsimmediatelyandbeginsprocessingacksinstead,i.e.conn[SEC]:=fACKg.TheF-Consinthenextlevelalsomusthandlethepsyncmessagesdifferentlyaccordingthefollowingac-tion(replacingFigure5-(8)):psyncfver[dest]:=v;(dest,v)status[dest]:=ACTIVE;dest=SEC^ver[PRIM]=vconn[SEC]:=fACKg;conn[PRIM]:=fRECVg;gdest=PRIMconn[PRIM]:=fPAUSEg;^ver[SEC]=vconn[SEC]:=fACKg;conn[PRIM]:=fRECVg;gToavoidmissingacksorsendingredundantacks,F-Conscannotactivateaconnectionuponreceivingapsyncunlessthecorrespond-ingpsyncfromthereplicahasarrived.Thus,ifapsyncarrivesfromtheprimaryconnectionrst,thenwemustpausethatconnec-tionuntilthesecondpsynchasarrived.Otherwise,F-Consmightmissackingtuplesreceivedafterstate-movement.Ifapsyncar-rivesfromthesecondaryrst,F-Conscannotstartsendingacksun-lessthepsyncarrivesfromtheprimary.Otherwise,F-Consmightresendacksfortuplessentbeforestate-movement.Unlikeegress,afterbothpsyncsarrive,F-ConsPacksF-ProdSandreceivesdatafromF-ProdPregardlessofwhichwasactivebeforecatch-up.5.4ExperimentInthissection,weillustratethebenetsofourdesignbyexam-iningtheperformanceofaparallelimplementationofourexampledataowduringfailureandrecovery.Weimplementedastreaminghash-basedgroup-byaggregationoperator(2KlinesofCcode),ourboundryoperators,andFlux(11KlinesofCcode)withintheTelegraphCQopen-sourcecodebase.Inthisexperiment,wepartitionthisdataowacrossfourma-chinesinaclusterandplacetheingressandegressoperatorsonaseparatefthmachine.Theingresshasa400Kduplicateelimi-nationbuffer.WeinsertaFluxaftertherstgroup-byoperatortorepartitionitsoutputon(app,src).Atstartup,weplaceapar-titionofeachoperatoroneachofthefourmachines,numbered0to3,andreplicatethemusingachaineddeclusteringstrategy[16].Thatis,eachprimarypartitionhasitsreplicaonthenextmachine,andthelastpartitionhasitsreplicaontherst.Forexample,theprimarycopyofpartition3oftherstgroup-byisonmachine3anditsreplicaisonmachine0.Inthisconguration,whenasinglema-chinefailsallfoursurvivalscenariosoccurindifferentpartitions.Atstartuptime,weintroduceastandbymachinewithoperatorsintheirinitialstate.Wehavenotimplementedthecontroller,becauseitinvolveswellknowntechniquesimplementedbystandardclus-termanagementsoftware[33].WesimulatefailurebykillingtheTelegraphCQprocessononeofthosemachines,whichcausescon-nectionstothatmachinetocloseandraiseanerror.EachmachinehasaPentiumIII1.4GHzCPU,512MBofRAMandisconnectedtoa100mpbsswitch.Forthepurposesoftheexperiment,toapproximatetheworkloadofahigh-throughputnetworkmonitoringworkload,ouringressop-eratorgeneratessequentiallynumberedsessionstartandendeventsasfastaspossible.Thereare10Kunique(app,src)values(uni-formlydistributed),100Kunique(src,dst)pairs.Thesecondgroup-byoutputsrunningstatisticseveryotherupdate.Withthissetup,Figure8showsthetotaloutputrateandaveragelatencypertupleattheegress.Inthisexperiment,thenetworkisthebottleneck.Att=20sec,whentheexperimentreachessteadystate,wekilloneofthefourmachines.Thethroughputremainssteadyforaboutquarterofasecond,andthensuddenlydrops.Thedropoccursbecauseduringstatemovement,thepartitionbeingre-coveredisstalledandeventuallycausesalldownstreampartitionstoalsostall.Inthisexperiment,about8.5MBofstatewastrans-ferredin941msec.Oncecatch-upisnished,ataboutt=21sec-onds,weobserveasuddenspikeinthrougput.Thisspikeoccursbecauseduringmovement,allthequeuestotheunaffectedparti-tionsarelled,andreadytobeprocessedoncecatch-upcompletes.Aroundthesametime,gure8(b)showsanincreaseinlatencybe-causetheinputandin-ightdataarebufferedduringmovement.Thentheoutputrateandaveragelatencysettledowntonormal.Duringthisentireexperiment,theinputrateattheingressstayedataconstant42Ktuples/secwithnodatadropped.Thisexperi-mentillustratesthatwithpiecemealrecoveryandsufcientbuffer-ing(400K),wecaneffectivelymasktheeffectsofmachinefailures.TounderstandtheoverheadsofFlux,weaddedjustenoughCPUprocessingtothelower-levelgroup-bytomakeitthebottleneck.Inthisconguration,forasingleparalleldataow,theinputratewas82Ktuples/sec,forclusterpairs,40Ktuples/sec,andforFlux,36Ktuples/sec(10%slowerthanclusterpairs).AdditionalprocessingwouldonlyreducetheFluxoverheadrelativetotheothers.6.RELATEDWORKThereisaplethoraofworkonfault-tolerance,availability,andrecovery,butthemostcloselyrelatedworkencompassesmecha-nismsformakinggeneralizedcomputationsfault-tolerant.Incon-trast,ourworkfocusesonanarrowerbutstillgenerallyusefulstyleofcomputation:CQdataows.Thereplicatedstate-machineapproach[21]coordinatesredun-dantcomputationtoprovideprotectionagainstfaultsinadistributedenvironment.Schneider[29]providesasurveyofstate-machinebasedapproaches.Thecriticalstepinthisapproachistoreachconsensusamongthereplicasforaconsistentviewoftheinputse-quence.ThePaxos[22]algorithmisthemostfault-tolerantmethodforreachingdistributedconsensus.Process-pairs[15]issimilarbe-causeitcoordinatestwoprocesses,butitdiffersbecauseasingleleaderdeterminesinputorder.Persistentqueues[4]areanabstrac-tionthatmakemessagespersistentacrossfailures.Theyaretoo heavyweightforourscenariobecausetheyprovidetransactionalsemanticsanddependonstablestorage.Theseschemesareblack-boxtechniquesthataddressaclient-severmodelratherthanachainofcomputations,onefeedingthenext.TheydonotexploitthestructureofparallelCQdataowstoprovideimprovedreliabilityandavailability.Thereareanumberofdisk-based,checkpointandreplayschemesformessagepassingsystems.Theauthorsin[11]surveythesemethods.ThePhoenixproject[23]leveragesthesetechniquestoprovidepersistentCOMcomponents.Theseroll-backrecoverytechniquesprovidereliability,butnotthehighavailabilityneededforcriticaldata-streamingapplications.TheIsis[5],Horus,Ensemble[6],andSpread[1]projectsaregenericgroupmembershipandcommunicationtoolkitsforbuild-inghighlyavailableapplications.Isismaintainsgroupmembershipandprovidestheapplicationprogrammerreliablecommunicationlikegroupbroadcastoratomicbroadcast.HorusandEnsembleareextensiblesystemsinwhichprogrammercanlayerthesereli-ablecommunicationprimitivesasnecessaryforhisorherapplica-tion.Spreadprovidessimilarabstractionsforwide-areanetworks.Suchtoolkitscanbeusedtoimplementthecontrollerinoursys-tem.Theyhavealsobeenusedtoperformefcient,eagerdatabasereplication[19].But,theirprimitivesoffersemanticsdifferentthanthatnecessaryforpartitionedparalleldataowcomputation.Finally,welistrecentworkonavailabilityforCQdataows.Ourworkin[30]describesmechanismsforextendingExchangetopro-videload-balancingdataows.WorkintheAurorasystem[18]describestechniquesforbuildinghighlyavailable,wide-areaCQdataowswithstatelessoperators.Incontrast,ourtechniqueshan-dlemoregeneral,statefuloperatorsandaddressparalleldataows.7.CONCLUSIONInthispaper,weshowhowtoachievehighavailabilityandfault-toleranceforcritical,long-running,paralleldataows.Ourmaincontributionisatechniqueforcoordinatingreplicasofoperatorpar-titionswithinalargerparalleldataow.Itisadelicatecombinationofpartitionedparallelismandprocesspairs.Ourtechniqueismorereliableandmoreavailablethanthestraightforwardclusterpairsapproach.Ourschemeprovidesonlinerecoverywithoutstallingtheongoingdataowcomputationbecauseitallowsforrecoveringthedataowinpiecemeal.Theprotocolswedescribeareinen-capsulatedinanopaquedataowoperatorcalledFlux.Thus,anapplicationdevelopercanreuseFluxwithavarietyofoperatorstomakeexisting,brittledataowsmorerobust.Webelieveintegrat-ingload-balancingmechanismsintoFluxisanecessarynextstepforprovidingperformanceavailability[30].8.REFERENCES[1]Y.AmirandJ.Stanton.TheSpreadWideAreaGroupCommunicationSystem.TechnicalReportCNDS-98-4,JohnsHopkins,1998.[2]T.Anderson,D.Culler,andD.Patterson.ACaseforNetworksofWorkstations:NOW.IEEEMicro,Feb.1995.[3]J.Baulier,S.Blott,H.Korth,andA.Silberschatz.ADatabaseSystemforReal-TimeEventAggregationinTelecommunication.1998.[4]P.A.Bernstein,M.Hsu,andB.Mann.ImplementingRecoverableRequestsUsingQueues.InSIGMOD,1990.[5]K.Birmanetal.ISIS:ASystemForFault-TolerantDistributedComputing.TechnicalReportTR86-744,Cornell,1986.[6]K.Birmanetal.TheHorusandEnsembleProjects:AccomplishmentsandLimitations.TechnicalReportTR99-1774,Cornell,1999.[7]D.Carnyetal.MonitoringStreams-ANewClassofDataManagementApplications.InVLDB,2002.[8]S.Chandrasekaranetal.TelegraphCQ:ContinuousDataowProcessingforanUncertainWorld.InCIDR,2003.[9]J.Chen,D.DeWitt,F.Tian,andY.Wang.NiagaraCQ:AScalableContinuousQuerySystemforInternetDatabases.InSIGMOD,2000.[10]D.DeWittandJ.Gray.ParallelDatabaseSystems:TheFutureofHighPerformanceDatabaseSystems.CACM,June1992.[11]E.Elnozahy,D.Johnson,andY.Wang.ASurveyofRollback-RecoveryProtocolsinMessage-PassingSystems.TechnicalReportCMU-CS-96-181,CMU,1996.[12]S.GilbertandN.Lynch.Brewer'sConjectureandtheFeasibilityofConsistent,Available,Partition-tolerantWebServices.SIGACTNews,2002.[13]G.Graefe.EncapsulationofParallelismintheVolcanoQueryProcessingSystem.InSIGMOD,1990.[14]G.Graefe.QueryEvaluationTechniquesforLargeDatabases.InACMComputingSurveys,2002.[15]J.GrayandA.Reuter.TransactionProcessing–ConceptsandTechniques.Kaufmann,1993.[16]H.HsiaoandD.DeWitt.ChainedDeclustering:ANewAvailabilityStrategyforMultiprocessorDatabaseMachines.InICDE,1990.[17]S.Hvasshovdetal.TheClustRaTelecomDatabase.InVLDB,1995.[18]J.Hwangetal.AComparisonofStream-OrientedHigh-AvailabilityAlgorithms.TechnicalReportCS-03-17,Brown,2003.[19]B.KemmeandG.Alonso.Don'tBeLazy,BeConsistent.InVLDB,2000.[20]C.Kruegel,F.Valeur,G.Vigna,andR.A.Kemmerer.StatefulIntrusionDetectionforHigh-SpeedNetworks.IEEESymposiumonSecurityandPrivacy,May2002.[21]L.Lamport.TheImplementationofReliableDistributedMultiprocessSystems.ComputerNetworks,1978.[22]B.Lampson.TheABCD'sofPaxos.InPODC,Aug.2001.[23]D.LometandR.Barga.PhoenixProject:FaultTolerantApplications.SIGMODRecord,June2002.[24]S.MaddenandM.Franklin.FjordingtheStream:AnArchitectureforQueriesoverStreamingSensorData.InICDE,2002.[25]M.MehtaandD.DeWitt.ManagingIntra-operatorParallelisminParallelDatabaseSystems.InVLDB,1995.[26]R.Motwanietal.QueryProcessing,Approximation,andResourceManagementinaDataStreamManagementSystem.InCIDR,2003.[27]V.Paxon.Bro:ASystemforDetectingNetworkIntrudersinReal-Time.ComputerNetworks,1999.[28]F.Schneider.ByzantineGeneralsinAction:ImplementingFail-StopProcessors.TransactionsonComputerSystems,May1984.[29]F.Schneider.ImplementingFault-TolerantServicesUsingtheState-MachineApproach:ATutorial.ComputingSurveys,Dec.1990.[30]M.Shah,J.Hellerstein,S.Chandrasekaran,andM.Franklin.Flux:AnAdaptivePartitioningOperatorforContinuousQuerySystems.InICDE,2003.[31]T.UrhanandM.Franklin.XJoin:AReactively-ScheduledPipelinedJoinOperator.IEEEDataEngineeringBulletin,June2000.[32]G.Vigna,W.Robertson,V.Kher,andR.A.Kemmerer.AStatefulIntrusionDetectionSystemforWorld-WideWebServers.InACSAC,2003.[33]W.Vogelsetal.TheDesignandArchitectureoftheMicrosoftClusterService.InFTCS,1998.