/
Slingshot:Time-CriticalMulticastforClusteredApplicationsMaheshBalakris Slingshot:Time-CriticalMulticastforClusteredApplicationsMaheshBalakris

Slingshot:Time-CriticalMulticastforClusteredApplicationsMaheshBalakris - PDF document

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
425 views
Uploaded On 2015-10-24

Slingshot:Time-CriticalMulticastforClusteredApplicationsMaheshBalakris - PPT Presentation

OureffortissupportedbyDARPA ID: 171372

OureffortissupportedbyDARPA

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Slingshot:Time-CriticalMulticastforClust..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slingshot:Time-CriticalMulticastforClusteredApplicationsMaheshBalakrishnanStefanPleischKenBirmanDepartmentofComputerScienceCornellUniversity,Ithaca,NY-14853mahesh,pleisch,ken@cs.cornell.eduAbstractDatacentersarecomplexenvironmentsconsistingofthousandsoffailure-pronecommoditycomponentscon-nectedbyfast,high-capacityinterconnects.Thesoftwarerunningonsuchdatacenterstypicallyusesmulticastcom-municationpatternsinvolvingmultiplesenders.Weexam- OureffortissupportedbyDARPAÕsSRSprogram,theAFRL/CornellInformationAssuranceInstitute,theSwissNationalScienceFoundation,andaMURIgrant.usefulbuildingblockforconstructingdatacenterapplica- However,theuniquecombinationofaservice-orientedar-chitectureandadatacenternetworkingtopologyisonethatexistingprotocolsarenotdesignedtodealwithefÞciently.Communicationbetweenandwithinservicesresultsinmul-ticastgroupswithmanysenderstransmittingsmallmes-sagesatunevenrates,acrossnodesandovertime.State-of-the-artreliablemulticastprotocols-especiallythoseopti-mizedfordeliverytime-arenotdesignedtodealwithsuchcommunicationpatterns.Further,datacentertopologiesoc-cupyauniquepointinthenetworkingproblemspace,scal-ingbeyondthereachofconventionalLANprotocolswhileescapingmuchofthecomplexityofWANdeployments.Exploitingthepositiveaspectsofsuchtopologies-suchashighbandwidth,proximity,andßatroutingstructure-whileretainingscalabilityisanopenproblem.OurkeyideaistoemployForwardErrorCorrection(FEC)atthereceiversofamulticast.FECisawell-knowntechniqueforintroducingreliabilityintomulticast,involv-ingtheinjectionofredundantpacketsintoastreamtoen-ablereceiverstorecoverfrompacketloss.TraditionalFECisstrictlyasender-basedmechanism,andhasbeenusedtogoodeffectinsingle-sendersettingsinvolvingsteadyconstant-ratestreamsofdata.WeexploretheideaofhavingreceiversinamulticastgroupencodeincomingdataintoFECpackets,whicharethenexchangedproactivelytore-pairlosses.WepresentaprotocolcalledSlingshot,whichlayersunreliableIPMulticastwithagossip-styleschemefordisseminatingreceiver-generatedFECpackets.Sling-shotoffersapplicationstunable,probabilisticguaranteesontimeliness,andallowsuserstochooseeitherprobabilisticorcompletereliability.WeevaluateSlingshotonarack-styleclusterandcom-pareitstime-criticalpropertiestoScalableReliableMul-ticast[8],awell-knownreliablemulticastprotocol.WebelieveSRMtobeaclosermatchfortime-critical,multi-sendersettingsthanotherexistingprotocols,andhenceavalidbaselinetomeasureSlingshotagainst.SlingshotachievesrecoveryoflostpacketstwoordersofmagnitudefasterthanSRMinourevaluation,aÞndingthathighlightsthevalueofreceiver-basedFECinthedatacenter.Wesup-plementthisevaluationonarealclusterwithsimulationre-sultsthathighlightthescalabilityofSlingshottolargersys-temsizes.Therestofthepaperisorganizedasfollows:InSection2wedescribetheoperatingconditionsforadatacentermul-ticastprotocol,intermsofworkloadandnetworkbehav-ior,andhowexistingreliablemulticastprotocolsperformwithintheseconstraints.Section3describestheoperationofSlingshotindetail,alongwithitsoverheadsandassump-tions,andprovidesananalysisoftheprotocol.Section4presentsevaluationresultsforSlingshot.Section5placesSlingshotincontextwithrelatedwork,andSection6con-cludesthepaper.2DesignSpaceAreliablemulticastprotocolaimedattime-criticaldata-centersettingshastotakeintoaccounttheexpectednatureofcommunicationwithingroups,aswellasthecharacter-isticsoftheunderlyingnetwork.Asmentionedinthein-troduction,theserviceorientednatureofdatacentercom-putationresultsinmulticastgroupswithlargenumbersofsenderstransmittingatvaryingrates.Weobtainsomein-sightintothecharacteristicsofdatacenternetworksfromarecentdescriptionoftheGoogleFileSystem(GFS)[10],whichissetagainstanetworkingbackdropoflargeclus-tersofmachinesdistributedacrossmanymachineracks,withswitchesfacilitatinginter-rackcommunication.ThepaperÕsevaluationsetupinvolves1Gbpslinksbetweenswitchesand100Mbpslinkstomachines,indicatingthathigh-bandwidthlinksarethenorm.Thecombinationofsuchßat,high-capacityroutingstructuresandinexpensivecommodityhostsresultsincommunicationcharacteristicsinthedatacenterbeingdeterminedbytheend-points,ratherthanthenetwork.Forinstance,inter-nodelatencyisdomi-natedbythetimespentbypacketsineitherprotocolstack,withanegligiblequantumoftimespentontheactualwire.Morepertinentlyforreliablemulticastprotocols,packetlossoccursattheendnodesduetobufferoverßows,andnotatintermediaryroutersandswitches.Thisallowsadat-acentermulticastprotocoltoassumethatpacketlossoc-cursindependentofnetworkstructure(thoughotherkindsoflosscorrelationwillexist,suchasnodesrunningthesameservicesgettingoverloadedatthesametime).Further,thefactthattime-criticaldataenteringanodecomprisesonlyafractionofthetotalincomingtrafÞcatthatnodeallowsustoassumethatthereisnosigniÞcanttemporalcorrela-tionoflosses;abufferoverßowwouldaffectapplicationsacrosstheboard,andnotjustthetime-criticalsegment.Hence,wecanassumethatthepatternoflossesvisibletothetime-criticalprotocolisnotburstyandoptimizeforthecasewherepacketsarelostsingly.Withthissetofassumptionsinmind,weexaminethecurrentsolutionspaceavailabletoreliablemulticastpro-tocoldesigners.Mostreliablemulticastprotocolslayerapacketlossdiscovery/recoverymechanismoveranunreli-abledeliveryprimitive,suchasIPMulticast[6]orsomeformofoverlaymulticast[4].WhileIPMulticastisnotwidelydeployedontheInternet,itisaviablealternativeinadatacentercontext,giventheattendantadministrativehomogeneityandrelativelylowernumberofdimensionsinwhichscalabilityisrequired.Introducingreliabilityoverthisunreliabledeliveryprimitivedecomposesintotwoin-tertwinedquestions:detectingthatanodehasnotreceivedapacket,andrecoveringthatlostpacket.Existingreliabilityschemescanbedividedbasedondelegationofresponsibil-ityintotwoclasses:sender-basedandreceiver-based. 2.1Sender-BasedReliabilityInsender-basedschemes[14,12],thesenderofamulti-castisresponsibleforensuringthatthedataisdeliveredtoallreceivers.Thetrivialextensionofunicastreliabilitytoamulticastscenarioinvolvespositiveacknowledgements:eachreceiversendsanACKforeverypacketbacktothesender.However,ifmulticastisusedheavily,thiscausesACKimplosion,wherethesenderisoverwhelmedbyac-knowledgementsfromitsmanyreceivers.AstandardwaytoavoidACKimplosionistouseACKtrees,whichareusedtoaggregateacknowledgementsbeforepassingthemontothesender.Forexample,RMTPandRMTP-II[14,12]usesuchahierarchicalstructuretocollectACKsandrespondtoretransmissionrequests(botharedesignedforsinglesendersettings).Inatime-criticalsetting,though,anykindofhier-archicalstructureimposesunacceptablelatencyonthedis-coveryprocess,sincethesenderhastowaitforareasonableamountoftimetoallowtheacknowledgementaggregatetopercolateupthetreebeforeitdeclarespacketslost.Ingen-eral,sender-basedreliabilitymechanismshaveseveraldis-advantages.Inmanycases,thesenderislikelytobebusy,andgoingbacktoitforretransmissionsmightoverloadit.Also,theroundtriptimetothesendermightaddunneces-sarylatencytothepacketrecoveryprocess,particularlyiftherearelessloadedreceiversneartheaffectednodefromwhichlostpacketscouldberecovered.2.2Receiver-BasedReliabilityReceiverbasedschemes[1,8]forreliablemulticastplacetheburdenofdiscoveringandrecoveringfrompacketlossonthereceivers.ManysolutionsusesenderspeciÞcsequencingfordiscoveringpacketloss:thesendernumbersthemessagesitmulticasts,andareceiverknowsitmissedamessageifitreceivesthenextmessageinthesequence.Ifmessagesaredeliveredoutoforderbythetransportsub-system,timeoutthresholdsareusedtodetermineifapacketistrulylost.Onceapacketisdeclaredlost,onerecoverymechanismistosendaNAK,ornegativeacknowledgment,requestingretransmissionofthepacket.TheNAKcanbesentbacktothesender,toanearbyreceiver,ormulticasttotheentiregroup,asinSRM[8].SRMdiscoverslossonlywhenitreceivesthenextpacketfromthesamesender;analternative,suchasinBimodalMulticast[1],istohavereceiversgossipingmessagehistorieswitheachothertoex-peditediscoveryofloss.Here,onceapacketisdiscoveredtobemissing,arequestforitcanbesentbacktothenodewhoinitiatedthegossipexchangeandthepacketcanbere-covered.2.3ForwardErrorCorrectionForwardErrorCorrection(FEC)[11,13,16]isusedinscenarioswherethelatencyofatwo-phasediscov-ery/recoverymechanismisunacceptable,andwherecon-tactingthesenderformissingpacketsisundesirableorim-possible.Initssimplestform,FECinvolvescreatingrepairpacketsfromdatapacketssuchthatanyoutofthere-sultingpacketsisenoughtorecovertheoriginaldatapackets[11].TraditionalapplicationsofFECtoreli-ablemulticasthavethesendergeneraterepairpacketsforeverydatapacketsandinjectthemintothedatastream.Hence,foreveryblockofpackets,areceiverisinsu-latedfrompacketlosses.FECisparticularlyattractiveasareliabilitymechanismasitimposesaconstantoverheadonthesystemandhaseasilyunderstandablebehaviorunderarbitrarynetworkconditions.However,itisdesignedpri-marilyforsituationswhereasinglesenderistransmittingdataatahigh,steadyrate:forexample,bulkÞletransfers[2],orvideoandaudiofeeds[3].3SlingshotSlingshotusesanunreliablemulticastmechanismsuchasIPmulticastforaninitialbest-effortdelivery,andtheninjectspoint-to-pointerrorcorrectiontrafÞcbetweenre-ceiverstoproactivelyrecoverlostpackets.Likeconven-tionalFEC,itgeneratesaconstantpercentageoferrorcor-rectionpackets,withthedifferencethateachreceiveren-codesoverallincomingpacketswithinagroup,asopposedtoasenderencodingonthepacketsitsends.Insomesense,SlingshotleveragesthemultiplicativeincreaseinmessagedensityatmulticastreceiversÑthefundamentalcauseofACKimplosionÑtouseerrorcorrectiontechniquesthatrequireacertaincriticalrateofinputpackets,decreasingthelatencyofpacketlossrecovery/discoveryasaresult.Be-low,wedescribethebasicoperationofSlingshot,describethenatureoftheoverheadsitimposesandofferanalyticalboundsonitsperformance.3.1ProtocolDetailsMessagelossdiscoveryandrecoveryinSlingshotoccursintwophases.Phase1involvestheproactivedissemina-tionofrepairpacketsbetweenreceivers,andguaranteesfastdiscoveryandrecoveryofahighpercentageoflostpack-ets;thisisthemaincontributionofthispaper.Phase2isoptionalandconÞgurable,andallowsforcompleterecov-eryofallpackets.Slingshotdealswithtwotypesofpack-ets:datapackets,whichcontaintheoriginaldatamulticastwithinthegroupandareuniquelyidentiÞedbymessageIDsintheformof(sender,sequencenumber)tuples,andrepair packets,whichcontainrecoveryinformationfordatapack-ets.InPhase1,Slingshotintroducesaconstantpercentageoverheadoncommunicationbyhavingreceiverssendeachotherrepairpackets.Foreverydatapacketsthatanodereceivesviatheunderlyingunreliablemulticast,itgener-atesarepairpacketusingFECandsendsittootherran-domlyselectedreceivers.SlingshotusesXOR,thesimplestandfastestformofFEC,whichallowstherecepientofarepairpackettorecoverfromonemissingdatapacket.Toensurethatbothdataandrepairpacketsaresentinthenet-workwithoutfragmentation,datapacketsarelimitedtoasizeslightlysmallerthantheMaximumTransmissionUnit(MTU);thisensuresthatarepairpackethasspacefortheXOR,whichisequaltothesizeofasingledatapacket,andalistofcontents,comprisedofmessageIDsdescribingdatapacketswhichcanberecoveredusingit.Wesaythatadatapacketiscontainedwithinarepairpacketifitcanberecoveredfromthelatter.Thefractionofoverheadtraf-ÞcandtheresultingpercentageoflostpacketsrecoveredbyPhase1aredirectlydeterminedbythe2-tupleparameterr,c,whichwecalltherate-of-Þre.WegiveananalysisfortheexpectedpercentageoflostpacketsrecoveredinSection3.4.Tofacilitaterecoveryoflostdatapacketsfromrepairs,nodemaintainsadatabuffer,whereitstoresincom-ingdatapackets.Whenanodereceivesarepairpacket,itÞrstchecksthelistofcontentstodetermineiftherearedatapacketsincludedthatithasnotreceived.Ifonlyonesuchpacketisincludedintherepair,itretrievestheotherdatapacketsfromitsdatabufferandcombinesthemwiththeXORcontainedintherepairpackettorecoverthelostpacket.Further,eachnodemaintainsarepairbinwhereitstorespointersforuptorecentlyreceiveddatapackets,tobeincludedinthenextrepairpacketitsends.TheprobabilisticnatureofPhase1resultsinasmallper-centageoflostpacketsgoingunrecovered.Thisusuallyhappenswhenallincomingrepairsatanodecontainingaparticularlostdatapacketincludeotherlosses,makingthemuseless,andalsointheunlikelyeventthatanodedoesnotreceivearepaircontainingthelostpacket,duetotherandommannerinwhichnodespickdestinationsforrepairpackets.Theprobabilitiesattachedtothesetwocasesde-pendontherate-of-Þreparameter,allowingustotunethereliabilityprovidedbyPhase1todesiredlevels.EveniftheFECtrafÞcinPhase1doesnotenablepacketrecovery,itensuresthatlossisdiscoveredveryquickly.Inthecasethatthepacketisnotrecoverableduetoallrelevantincomingre-pairscontainingmultiplelosses,discoverytakesplacewhentheÞrstsuchrepairisreceived.Atthispoint,thenodecaneitherwaitformorerepairpacketstoarrive,orrunPhase2aftersometimeoutperiodhaselapsed.InPhase2,thenodeinitiatesrecoveryofthelostpacketbysendinganexplicitrequesttosomeothernode.IfweuseaschemesimilartoBimodalMulticast,thisexplicitre-coveryrequestissenttosomearbitrarilychosenreceiverofthemulticast-forinstance,thesenderoftherepairpacket-whichservicestherequestfromitsdatabuffer.Figure1illustratesthedifferencebetweenSlingshotandBimodalMulticast,ifPhase2isstructuredinthismanner.Anotheroptionistosendtheretransmitrequestbacktothesender,whichrequiressenderstoeitherbuffersentpacketsordel-egateresponsibilityforreconstructingthepackettotheap-plication,asinSRM.Algorithm1givesthecompleteSling-shotalgorithm,substitutingtheÞrstoptionforPhase2. InitialmulticastGossiping InitialmulticastExplicitRequests FEC-basedrepairsPhase 1Phase 2Bimodal MulticastSlin g shotFECFECFEC sg Figure1.ComparisonwithBimodalMulticast.3.2OverheadsInPhase1,Slingshotimposesaconstantpercentagecommunicationoverheadonthedatasentthrough,whichdependsontherate-of-Þrer,c;foreverypacketsanodereceives,itsendsoutadditionalrepairpackets.Theratioofincomingrepairpacketstodatapacketsoverallnodesinthesystemisgivenbythesimpleformula: ,whereistheprobability-independent,consistentwithouras-sumptionsinSection2-ofapacketbeingdroppedatanend-node.Otherthantherate-of-Þrer,c,theprincipalparametertoSlingshotisthesizeofitsdatabuffer.Ourex-perimentsshowthatinan8Mbpsgroup,lostpacketsarere-trievedwithinafewmillisecondsofthetimetheywereorig-inallysentat,whichreducesbufferingrequirementstotensofmillisecondsworthofdata,orafewhundredKbytes.Anoptionaldatastructureisthependingrepairsbuffer,whererepairpacketscontainingmultiplemissingpacketsaretem-porarilystored,incaseallbutoneoftheirincludedlostpacketsisrecoveredthroughothermeans;inpractice,weachievegoodresultswiththesizeofthisbuffersettotwentypackets.Finally,thecomputationaloverheadimposedbySlingshotisminimalandeasilyquantiÞable:everyincom-ingdatapacketisXORedincrementallytoabuffertopre- Algorithm1Slingshotalgorithm. CodeofreceiverInitialisation:DataBufferContainsreceivedDataPacketsRepairBinDatapacketstocreatenextRepairPacketLastSeqNoo{LastknowndatapacketfromeachsenderKnownLostListofMessageIDsthoughttobelostr-multicast(send()toallprocessesinreceptionofdatapacketfromsenderdelivertotheapplicationDataBufferDataBufferRepairBinRepairBinmarkLost(dp.idKnownLostKnownLostdp.idcomposeRepairPac()Phase1Recoveryreceptionofrepairpacketfromsenderforallmessageidrp.MsgIdListmarkLost(KnownLostrp.MsgIdListthenRecoverdatapacketcorrespondingtothisdelivertotheapplicationDataBufferDataBufferKnownLostKnownLostdp.idMarksunreceivedmessagesprecedingpassed-inidaslostproceduremarkLost(forlastseqnooid.sender]+1id.seqnoKnownLostKnownLostid.sender,ilastseqnooid.senderid.seqnoConstructsFECrepairpacketprocedurecomposeRepairPacRepairBinthenr,cdenotestherate-of-ÞreDestinationSetselectprocessess.t.generaterepairpackettoallDestinationSetRepairBinPhase2receivingrequestmessageMsgIdListfromforalldatapackets.t.dp.idMsgIdListDataBufferPhase2:runsperiodicallytaskforallMsgIDKnownLostforlongerthanMsgIdListMsgIdListsendrequestmessageMsgIdList)toanarbitrarypro- paretheoutgoingrepairpacket,andduringrecovery,XORsareperformedtoextractthemissingdatapacketfromtherepairpacket.Phase2recoveryimposestheextraover-headofanexplicitrequestandresponse;wedonotconsiderthissigniÞcant,sinceweexpectSlingshotÕsrate-of-ÞretobetunedsuchthatthepercentageofpacketsleftunrecoveredinPhase1isverysmall.3.3MembershipSlingshotneedsaweaklyconsistentmembershipservicetoprovideeachnodewiththelistofothernodesinthegroup,knownasaview,fromwhichitcanpicktargetsforrepairpacketsrandomly.Themoreaccuratelythisviewre-ßectsactualgroupmembership,thebetterSlingshotÕsdis-coveryandrecoverymechanismperforms;therepairpack-etssenttonodeswhoareinaviewbutnotinthegrouparewasted,andmembersofthegroupwhoarenotaccu-ratelyrepresentedinviewsarelesslikelytoreceivesuf-Þcientrepairpackets.However,SlingshotÕsprobabilisticnatureallowsittoperformwelldespiteweaklyconsistentmembership,allowingittobelayeredovergossip-basedmembershipprotocols[15].Slingshotalsoworkswellwithpartialviews,whereeachnodeknowsonlyasubsetofthegroup;hence,ascalable,externalmembershipservicepro-vidinguniformlyselectedpartialviews,suchasScamp[9],canbeusedunderneathit.Inourimplementationwein-cludeasimplestate-machinereplicatedmembershipserverthatprovidesnodeswithviewupdates,andhavenodesper-formfailuredetectioninaring.Webelievesuchasolu-tiontobeappropriatefordatacentersettings,wheregroupsizesarelimitedtothousandsofnodesandchurnisnotaconcern.Becausewedonotrequirestrongconsistency,theoverheadimposedbytheunderlyingmembershipser-viceisnegligible:inourimplementation,eachnodepingsoneothernodeonceasecondandmembershipupdatesarepropagatedlazilytonodesbythecentralservice.3.4AnalysisWepresentasimplisticanalysistopredicthowtheprob-abilityofalostpacketbeingrecovereddependsontherate-of-Þreparameter,thesizeofthegroup,thepartialviewsize,andtheprobabilityofpacketloss.WeassumethatpacketsaredroppedatendnodebufferswithsomeÞxedindepen-dentprobability,andthatroutersdonotdroppackets;thisisconsistentwithourassumptionsinSection2.Givenagroupsize,apartialviewsizeandarate-of-Þrepa-rameterr,c,wewanttopredicttheprobabilityofrecov-eringagivendatapacketlostatnode.Wemaketheassumptionthatthepartialviewatanodeisauniformlychosensubsetofthewholegroupmembership,andalwaysincludesthenodeitself. Now,letbearandomvariablesignifyingthenumberofnodesinthesystemwhichhaveintheirpartialviews.Sinceeachpartialviewisauniformlypickedsubsetofthegroupmembership,theprobabilityofbeingincludedinaviewotherthanitsownis .Thus,hasabinomialdistributionwithparametersandGiventhatthenumberofnodesincludingintheirviewsisaparticularvalue,letbearandomvariabledenotingthenumberofrepairpacketsoriginatingatthesenodesthatincludethedatapacketandaretargetedatTheupperboundonthetotalnumberofsuchpacketsisforthecasethateachofthenodesreceiveswithoutlossandsendsoutarepairpacketto.Atanyofthenodes,theprobabilityofbeingselectedasoneofthedestina-tionsis .Ifwealsoconsidertheprobabilityofthesenderoftherepairpacketdropping,thenYhasabinomialdistributionwithparametersandIfwesetthenumberofrepairpacketswhichincludeandaretargetedattoavalue,thenthenumberofsuchpacketsreceivedwithoutlossbyisrepresentedbyaran-domvariablethathasabinomialdistributionwithparam-etersand.Letussettothevalue.Now,weneedtocomputetheprobabilitybeingrecoveredifreceivesrepairpacketscontainingit.Wecanrecoverif,foratleastoneoftheincomingrepairpacketscontaininghasalltheotherdatapacketsincludedinthatrepairpacket.Wederiveinclusiveupperandlowerboundsand,where;itisequaltozerowhen.Thelowerboundcorrespondstothecasewhereallrepairpacketshavethesamecontents,andtheupperboundisgivenbythecasewhereallrepairpacketsarepair-wisedisjoint;i.etheyincludecompletelydifferentdatapackets.Hence,isboundedby:=(1Hence,theÞnalprobabilityn,dpofadatapacketlostatbeingrecoveredsuccessfullyisboundedby:n,dp4EvaluationWeevaluatedanimplementationofSlingshotonarack-styleclusterof64nodes,consistingoffourracksofblade-serversconnectedviatwoswitches.Webelievethisclus-tertobefairlytypicalofadatacentersetup,withaverage Tuning Slingshot's Rate-Of-Fire: Changing c with fixed r123456789c component of rate-of-fire (r,c)Fraction500100015002000250030003500400045005000Microseconds Fraction of Overhead in Traffic Fraction of Lost Packets Recovered Avg Packet Recovery Time (a) Tuning Slingshot's Rate-Of-Fire: Changing r with c = r/20.20.40.60.81.212345678r component of rate-of-fire (r,c)Fraction50010001500200025003000350040004500Microseconds Fraction of Overhead in Traffic Fraction of Lost Packets Recovered Average Packet Recovery Time Figure2.TuningaSlingshotimplementationona64-nodecluster:tradeoffpointsavailablebetweenreliability,timelinessandoverheadbychangingtherate-of-Þrer,cReliabilityandOverheadareplottedagainstthelefty-axis,andAverageRecoveryTimeonontherighty-axis.pingroundtriptimeataround100microseconds.Unlessindicatedotherwise,ourlossmodelinvolvespacketsbeingdroppedatendnodeswith.01independentprobability.4.1ReliabilityandTimelinessvs.OverheadFigure2showshowchangingtherate-of-Þreparameteraffectstheamountofoverheadinducedandtheresultingre-liabilityandtimelinesscharacteristics.Figure2ashowstheeffectofvaryingforconstant,andFigure2bshowsresultsfordifferentvaluesof,keepingtomaintainthesamelevelofoverhead.Intheseexperiments,eachmulticastsapacketonceevery64milliseconds,resultingin1000packets/secondinthegroup.Notethattheaveragere-coverytimeincludesdiscoveryofpacketloss.Mostpacketrecoverymechanisms,includingSRM[8],presentonlythetimetakentorecoverpacketsintheirresults,ignoringdis-coverytime.Inamulti-sendersetting,discoverytimeusingsender-basedsequencingwouldheavilydominaterecoverytime,averagingat64millisecondsinthisexperiment.Incontrast,Slingshotperformsbothdiscoveryandrecoverywithinafewmilliseconds.WerunonlyPhase1ofSlingshot here,andthegraphincludesthefractionofpacketsrecov-eredinthisphase;runningasecondphasewouldprovidecompletereliability.Afteracertainpoint,increasingtheamountofoverheadbyraisinghasdiminishingreturns;fortherestofoursimulationswesettherate-of-Þretobe Comparison of Slingshot and SRM: c.d.f on Logarithmic Time axis0.10.20.30.40.50.60.70.80.90.00010.0010.010.1110Seconds Elapsed after Initial SendFraction of Lost Packets Recovered Slingshot SRM Discovery SRM Recovery Figure3.ComparisonofSlingshotimplementationagainstSRMona64noderack-stylecluster:c.d.f.ofpack-etsrecoveredfora1000packets/secworkload.4.2ComparisonwithSRMFigure3showsacomparisonofthetime-criticalprop-ertiesofSlingshotagainstSRM.Theexperimentinvolves64nodesmulticastingevery64milliseconds,toachieveadatarateof1000packetspersecondwithinthegroup.Thegraphisacdfofpacketsrecoveredagainsttimetaken(sincetheoriginalunreliablesend),onalogarithmicx-axis.SRMdiscoverspacketlossthroughsender-basedsequenc-ing,whichexplainsthesteepriseoftheÓSRMDiscoveryÓcurveataround64milliseconds;alostpacketisnotdiscov-ereduntilitssendermulticastsagain,64millisecondslater.InthisparticularrunofSRM,roughly53%ofalltrafÞcisrepairoverhead,whiletheSlingshotconÞgurationused(rate-of-Þre=)resultsin38%ofalltrafÞcbeingover-head.UnlessSRMisallowedtofacilitatefasterdiscoverybysendingblankmessagesbetweenactualmulticasts,usingupevenmoreoverhead,itsrecoverytimeisboundedbelowbytheinter-sendtime.Hence,evenifrecoveryweremadefasterbyoptimizingSRMÕstimingparametersfordatacen-tersettings,itcannottakeplacefasterthantheinter-sendtime.RunningonlyPhase1oftheSlingshotprotocol,weachieverecoveryofalmostalllostpackets(97.5%inthisrun)twoordersofmagnitudefasterthanSRM.4.3ScalabilityToassessthescalabilityofSlingshotbeyondthelimitsofour64-nodecluster,weranasimulationoftheprotocolona400nodenetwork.Ourtopologyconsistedof20switchesformingastarnetworkaroundagatewayrouter,withswitchhaving20end-hostsunderit.Asintheimplementa-tionsetup,end-hostsdroppacketswith1%probability,androutersandswitchesdonotdroppackets.InFigure4,werandomlyselectnodesfromthe400nodetopologytoÞllagroupofagivensize,andevaluatethepercentageofpack-etsrecoveredbySlingshotÕsPhase1intheresultinggroup.WealsousedthesimulatortoassesstheimpactofpartialviewsizeonSlingshotÕsperformance.Figure5showsthatSlingshotworksaswellwithsmall,partialviewsasitdoeswithglobalviews. Scalability in Group Size0.840.860.880.90.920.940.960.981060110160210260310360Group SizeFraction of Lost Packets Recovered Figure4.EffectofsroupsizeonSlingshotÕsPhase1re-coverypercentageina400Nodesimulateddatacenter. Effect of View Size0.10.20.30.40.50.60.70.80.91060110160210260310360View Size Fraction50010001500200025003000Microseconds Fraction of Lost Packets Recovered Average Recovery Time Figure5.Effectofchangingthepartialviewsizeina400nodesimulateddatacenter.5RelatedWorkSlingshotliesintheintersectionofvariousdistributedsystemstechnologies,suchasreliablemulticast,FECschemes,andreal-timeprotocols.AmongstthereliablemulticastprotocolsdiscussedinSection2,theoneclosesttoSlingshotisBimodalMulticast(seeFigure1);theyarebothlayeredoverIPMulticast,andinvolvereceiversexchangingrecoveryinformationwitheachother.Slingshothastheob-viousadvantageoverBimodalMulticastofrequiringonlyasinglemessagesendforrecovery,asopposedtothree. Also,runningBimodalMulticastintime-criticalsettingswouldinvolveexchangingmessagedigestsataveryhighrate.Whendigestsaresentthisfrequently,thecasewheresomereceiversgetapacketbeforeotherscouldtriggeroffmanyunnecessarytwo-steprecoveries.Slingshotdoesnotsufferfromthisproblem,asrecoveringapacketfromare-pairisaveryfastoperation.ThepassingresemblanceofSlingshottoBimodalMulticastispartiallybyconstruction;webelievethatthisisonewayofenhancingreliablemul-ticastwithreceiver-basedFEC,butonecanimagineotheralternatives,suchasdeterministicallyßoodingrepairsonanoverlay,thatofferguaranteesofadifferentßavor.Slingshotisclosertothereliablemulticastspacethantoreal-timeprotocols,duetothenatureofitsguaranteesandoperatingassumptions.Real-timeprotocolseitherassumetimingguaranteesatalowerlevelorcharacterizethevio-lationoftimingassumptionsasfailures.Forinstance,thedelta-tprotocol[5]offersdeterministicboundsondelivery,givenacertainnumberofsuchfailures.Ourcharacteriza-tionoftime-criticalityasaneedforverysmall,probabilis-tictimeboundsis,hence,quitedifferentfromtraditionalnotionsofreal-timerequirements.Lastly,FEChaslongbeenatopicofextensiveresearch;nonetheless,wearenotawareofanyexistingworkthatpro-posesencodingrepairpacketsatthereceiverend.MostcurrentresearchinFECisfocusedonprovidingefÞcientwaystoperformcomplexencodingsthatallowrecoveryfrommultiplelossesinastream.WhilewehaverestrictedourselvestousingXORforencoding,usingmorepowerfulformsofFECisanavenueopentofurtherexploration.6ConclusionSlingshotoffersuniqueprobabilisticguaranteesontime-linessindatacentersettingsbyplacingFECatthereceiversofamulticast.Itexploitstheaccumulatedmulticastingratesofmultiplesenderstoachievefasterpacketlossdetectionandrecoveryatthereceivers.Inourevaluationsetupofa64-noderack-stylecluster,SlingshotrecoverspacketstwoordersofmagnitudefasterthanScalableReliableMulticast,underscoringitsutilityasabuildingblockfortime-criticaldatacenterapplications.AcknowledgmentsWethankVidhyashankarVenkataramanforhishelpwiththeanalysisoftheprotocol,RobbertvanRenesseforhiscommentsonSlingshot,andLuisRodriguesforhispointerstoreal-timemulticasts.References[1]K.P.Birman,M.Hayden,O.Ozkasap,Z.Xiao,M.Budiu,andY.Minsky.Bimodalmulticast.ACMTrans.Comput.Syst.,17(2):41Ð88,1999.[2]J.Byers,M.Luby,andM.Mitzenmacher.Adigitalfountainapproachtoasynchronousreliablemulticast.IEEEJournalonSelectedAreasinCommunications,20(8),October2002.[3]G.CarleandE.Biersack.Surveyoferrorrecoverytechniquesforip-basedaudio-visualmulticastapplications.IEEENetwork,December1997.[4]Y.-H.Chu,S.G.Rao,andH.Zhang.Acaseforendsys-temmulticast.InMeasurementandModelingofComputerSystems,pages1Ð12,2000.[5]F.Cristian,H.Aghali,R.Strong,andD.Dolev.Atomicbroadcast:Fromsimplemessagediffusiontobyzantineagreement.InProc.15thInt.Symp.onFault-TolerantCom-puting(FTCS-15),pages200Ð206,AnnArbor,MI,USA,1985.IEEEComputerSocietyPress.[6]S.E.DeeringandD.R.Cheriton.Multicastroutingindata-graminternetworksandextendedlans.ACMTrans.Comput.Syst.,8(2):85Ð110,1990.[7]P.T.Eugster,R.Guerraoui,S.B.Handurukande,P.Kouznetsov,andA.-M.Kermarrec.Lightweightproba-bilisticbroadcast.InDSNÕ01:Proceedingsofthe2001InternationalConferenceonDependableSystemsandNet-works,pages443Ð452.IEEEComputerSociety,2001.[8]S.Floyd,V.Jacobson,C.-G.Liu,S.McCanne,andL.Zhang.Areliablemulticastframeworkforlight-weightsessionsandapplicationlevelframing.IEEE/ACMTransac-tionsonNetworking,5(6):784Ð803,1997.[9]A.J.Ganesh,A.-M.Kermarrec,andL.Massouli«e.Peer-to-peermembershipmanagementforgossip-basedprotocols.IEEETrans.Computers,52(2):139Ð149,2003.[10]S.Ghemawat,H.Gobioff,andS.-T.Leung.ThegoogleÞlesystem.InProceedingsofthenineteenthACMsymposiumonOperatingsystemsprinciples,pages29Ð43.ACMPress,[11]C.Huitema.ThecaseforpacketlevelFEC.InProtocolsforHigh-SpeedNetworks,pages109Ð120,1996.[12]T.Montgomery,B.Whetten,M.Basavaiah,S.Paul,N.Ras-togi,J.Conlan,andT.Yeh.TheRMTP-IIprotocol,Apr.1998.IETFInternetDraft.[13]J.Nonnenmacher,E.W.Biersack,andD.Towsley.Parity-basedlossrecoveryforreliablemulticasttransmission.IEEE/ACMTransactionsonNetworking,6(4):349Ð361,[14]S.Paul,K.K.Sabnani,J.C.-H.Lin,andS.Bhattacharyya.Reliablemulticasttransportprotocol(RMTP).IEEEJour-nalofSelectedAreasinCommunications,15(3):407Ð421,[15]R.V.Renesse,Y.Minsky,andM.Hayden.Agossip-stylefailuredetectionservice.TechnicalReportTR98-1687,28,[16]L.RizzoandL.Vicisano.Areliablemulticastdatadis-tributionprotocolbasedonsoftwareFECtechniques.InTheFourthIEEEWorkshopontheArchitectureandIm-plementationofHighPerformanceCommunicationSystems(HPCSÕ97),SaniBeach,Chalkidiki,Greece,June1997.