The Nature of Datacenter Trafc Measurements  Analysis Srikanth Kandula Sudipta Sengupta Albert Greenberg Par veen Patel Ronnie Chaiken Microsoft Research ABSTRACT We explore the nature of trac in dat
225K - views

The Nature of Datacenter Trafc Measurements Analysis Srikanth Kandula Sudipta Sengupta Albert Greenberg Par veen Patel Ronnie Chaiken Microsoft Research ABSTRACT We explore the nature of trac in dat

Weinstrumenttheserve rsto collect socketlevellogs with negligible performance im pact In a 1500 server operational cluster we thus amass roughly a pet abyte of measurementsover two months from which we obtain and re portdetailedviewsoftra57358candco

Download Pdf

The Nature of Datacenter Trafc Measurements Analysis Srikanth Kandula Sudipta Sengupta Albert Greenberg Par veen Patel Ronnie Chaiken Microsoft Research ABSTRACT We explore the nature of trac in dat




Download Pdf - The PPT/PDF document "The Nature of Datacenter Trafc Measureme..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "The Nature of Datacenter Trafc Measurements Analysis Srikanth Kandula Sudipta Sengupta Albert Greenberg Par veen Patel Ronnie Chaiken Microsoft Research ABSTRACT We explore the nature of trac in dat"— Presentation transcript:


Page 1
The Nature of Datacenter Traffic: Measurements & Analysis Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Par veen Patel, Ronnie Chaiken Microsoft Research ABSTRACT We explore the nature of trac in data centers, designed to su p- portthemining of massive datasets. Weinstrumenttheserve rsto collect socket-levellogs, with negligible performance im pact. In a 1500 server operational cluster, we thus amass roughly a pet abyte of measurementsover two months, from which we obtain and re- portdetailedviewsoftracandcongestionconditionsandp atterns.

Wefurtherconsiderwhethertracmatricesintheclustermi ghtbe obtained instead via tomographic inference from coarser-g rained counterdata. CategoriesandSubjectDescriptors C.2.4[ DistributedSystems ]Distributedapplications C.4[ Performanceofsystems ]PerformanceAttributes GeneralTerms Design,experimentation,measurement,performance Keywords Datacentertrac,characterization,models,tomography 1. INTRODUCTION Analysis of massive data sets is a major driver for todays da ta centers [ ]. For example, web search relies on continuously col- lecting and analyzing billions of web pages to

build fresh in dexes and mining of click-stream data to improve search quality. A s a result,distributedinfrastructuresthatsupportquerypr ocessingon peta-bytesofdatausingcommodityserversareincreasingl ypreva- lent (e.g., GFS, BigTable [ 17 ], Yahoos Hadoop, PIG [ 27 ] and Microsos Cosmos, Scope [ 23 ]). Besides search providers, the economics and performance of these clusters appeals to comm er- cial cloud computing providers who oer fee based access to s uch infrastructures[ ]. To the best of our knowledge, this paper provides the rst de- scription of the

characteristics of trac arising in an oper ational distributedqueryprocessingclusterthatsupportsdivers eworkloads createdinthecourseofsolvingbusinessandengineeringpr oblems. Our measurements collected network related events from eac h of the 1500 servers, which represent a logical cluster in an ope ra- tionaldatacenterhousingtensofthousandsofservers,for overtwo months. Ourcontributionsareasfollows: MeasurementInstrumentation. Wedescribealightweight,exten- sibleinstrumentationandanalysismethodologythatmeasu restraf- c on data center servers, rather than switches, providing s

ocket levellogs. isserver-centricapproach,webelieve,provi desanad- vantageous tradeo for monitoring trac in data centers. Se rver overhead (CPU, memory, storage) is relatively small, thoug h the tracvolumesgeneratedin total arelarge over10 GB perser ver per day. Further, such server instrumentation enables link ing up Permission to make digital or hard copies of all or part of thi s work for personal or classroom use is granted without fee provided th at copies are not made or distributed for profit or commercial advantage an d that copies bear this

notice and the full citation on the first page. To cop y otherwise, to republish, to post on servers or to redistribute to lists, re quires prior specific permission and/or a fee. IMC09, November 46, 2009, Chicago, Illinois, USA. Copyright 2009 ACM 978-1-60558-770-7/09/11 ...$10.00. Figure 1: Sketchofatypicalcluster. Tensof serversperrac kare connected viainexpensivetop ofrackswitchesthatin turn c on- nectto high degreeaggregationswitches. VLANsareset-upb e- tweensmallnumbersofrackstokeepbroadcastdomainssmall Wecollecttracesfromall(1500)nodesinaproductionclust er.

networktractotheapplicationsthatgenerateordependon it,let- tingusunderstandthecauses(andimpact)ofnetworkincide nts. Trac Characteristics. Much of the trac volume could be ex- plained by two clearly visible patterns which we call Work-Seeks- Bandwidth and Scatter-Gather . Using socket level logs, we investi- gatethenatureofthetracwithinthesepatterns: owchara cteris- tics,congestion,andrateofchangeofthetracmix. Tomography Inference Accuracy. Will the familiar infer- ence methods to obtain trac matrices in the Internet Servic Provider (ISP)

networks extendto data centers[ 20 32 34 35 ]? If they do, the barrier to understandthe trac characteristic sof dat- acenterswill beloweredfromthedetailed instrumentation thatwe have done here to analyzing the more easily available SNMP li nk counters. Our evaluation shows that tomography performs po orly fordatacentertracandwepostulatesomereasonsforthis. Aconsistentthemethatrunsthroughourinvestigationisth atthe methodology that works in the data center and the results see n in the data centerare dierent than their counterpartsin ISP o r even

enterprisenetworks.eopportunitiesandsweetspotsfo rinstru- mentationare dierent. echaracteristicsof the trac are dier- ent, as are the challenges of associatedinference problems . Simple intuitiveexplanationsarisefromengineeringconsiderat ions,where thereistightercouplinginapplicationsuseofnetwork,c omputing, andstorageresources,thanthatisseeninothersettings. 2. DATA & METHODOLOGY We briey present our instrumentation methodology. Measur e- mentsinISPsandenterprisesconcentrateoninstrumenting thenet- workdeviceswiththefollowing choices:

SNMPcounters ,whichsupportpacketandbytecountsacrossindi- vidual switchinterfacesandrelatedmetrics, areubiquito uslyavail- able on network devices. However, logistic concerns on how o en routers can be polled limit availability to coarse time-sca les, typi- callyonceeveryveminutes,andbyitselfSNMPprovideslit tlein- sightintoow-levelorevenhost-levelbehavior. Sampledow or sampled packet header level data [ 16 29 22 31 canprovideowlevelinsightatthecostofkeepingahigherv olume of data for analysis and for assurance that samples are repre senta- tive[ 15 ].

Whilenotyetubiquitous,thesecapabilitiesarebecomin moreavailable,especiallyonnewerplatforms[ 12 ]. Deep packet inspection: Much research mitigates the costs of packetinspectionathighspeed[ 11 13 ]butfewcommercialdevices supporttheseacrossproductionswitchandrouterinterfac es.
Page 2
In this context, how do we design data-center measurements that achieveaccurateand useful data while keeping costs ma nage- able? Whatdrivescostisdetailed measurementatveryhighs peed. To achieve speed, the computations have to be implemented in rmware and more importantly the high speed memory or

stor- age required to keep trackof details is expensive causing li ttle of it to be available on-board the switch or router. Datacentersp rovide a unique choice rather than collecting data on network devi ces with limited capabilities for measurement, we could obtain mea- surements at the servers, even commodity versions of which h ave multiple cores, GBs of memory, and 100s of GBs or more of local storage. When divided across servers, the per-server monit oring taskisasurprisinglysmallfractionofwhatanetworkdevic emight incur. Further, modern data centershave a common managemen

frameworkspanningtheirentireenvironmentservers,sto rage,and network,simplifying thetaskofmanagingmeasurementsand stor- ingtheproduceddata.Finally,instrumentationattheserv ersallows ustolink thenetworktracwithapplicationlevellogs(e.g .,atthe level of individual processes) which is otherwise impossib le to do with reasonableaccuracy. isletsusunderstandnotonly th eori- ginsofnetworktracbutalsotheimpactofnetworkincident s(such ascongestion,incast)onapplications. eideaofusingserverstoeaseoperationsisnotnovel,netw ork

exceptionhandlersleverageendhoststoenforceaccesspol icies[ 24 ], andsomepriorworkadaptsPCA-basedanomalydetectorstowo rk well even when data is distributed on many servers[ 21 ]. Yet, per- forming cluster-wideinstrumentation of serversto obtain detailed measurementsisanovelaspectofthiswork. WeusetheETW(EventTracingforWindows[ ])frameworkto collectsocketleveleventsateachserverandparsethe info rmation locally. Periodically, the measured data is stowed away usi ng the APIsoftheunderlyingdistributedlesystemwhichwealsou sefor analyzing thedata.

Inourcluster,thecostofturningonETWwasamedianincreas of 1.6% in CPU utilization, an increase of 1.2% in disk utiliz ation, 10% more cpu cycles per byte of network trac and fewer than a 2Mbps drop in network throughput even when the server was us- ing the NIC at capacity (i.e., at 1Gbps). is overhead is low p ri- marily due to the ecienttracing framework [ ] underlying ETW but also becauseunlike packet capturewhichinvolves an int errupt fromthekernelsnetworkstackforeachpacket,weuseETWto ob- tain socket level events, one per application read or write, which

aggregatesoverseveral packetsandskips network chatter. To keep the cumulativedata upload ratemanageable, we compressthe logs prior to uploading. Compression reduces the network bandwi dth usedbythemeasurementinfrastructurebyatleast10x. In addition to network level events, we collect and use appli ca- tionlogs(jobqueues,processerrorcodes,completiontime setc.)to seewhichapplicationsgeneratewhatnetworktracaswella show networkartifacts(congestionetc.)impactapplications. Over a month, our instrumentation collected nearly a petaby te of uncompressed data. We believe that deep packet

inspectio n is infeasible in production clusters of this scaleit would be hard to justify the associated cost and the spikes in CPU usage assoc iated with packetcaptureandparsing on the serverinterfacesare a con- cernproductionclustermanagers.esocketleveldetailwe collect isbothdoableanduseful,sinceaswewillshownextthislets usan- swerquestionsthatSNMPfeedscannot. 3. APPLICATION WORKLOAD Before we delve into measurement results, we briey sketch t he natureoftheapplicationthatisdriving tracontheinstru mented cluster.Atahighlevel,theclusterisasetofcommodityser versthat

Server From Server To 10 15 20 25 UnMonitored Figure 2: e Work-Seeks-Bandwidth and Scatter-Gather patterns indatacentertracasseeninamatrixof log Bytes exchanged betweenserverpairsinarepresentative10speriod.(See 4.1 ). supports map reduce style jobs as well as a distributed repli cated block store layer for persistentstorage. Programmers writ e jobs in ahigh-levelSQLlikelanguagecalledScope[ ]. escopecompiler transforms the job into a workow (similar to that of Dryad [ 23 ]) consisting of phasesofdierenttypes. Someof thecommonph ase typesare Extract

whichlooksattherawdataandgeneratesastream ofrelevantrecords, Partition whichdividesastreamintoasetnum- ber of buckets, Aggregate which is the Dryad equivalent of reduce and Combine which implements joins. Each phase consists of one ormoreverticesthatruninparallelandperform thesamecom pu- tation on dierent parts of the input stream. Input data may n eed tobereadothenetworkifitisnotavailableonthesamemach ine butoutputsarealwayswrittentothelocaldiskforsimplici ty. Some phasescanfunctionasapipeline,forexample Partition maystartdi- vidingthedatageneratedby Extract

intoseparatehashbinsassoon asanextractvertexnishes,whileotherphasesmaynotbepi peline- able,forexample,an Aggregate phasethatcomputesthemediansale priceofdierenttextbookswouldneedtolookateverysales record for a textbook before it can compute the median price. Hence, in thiscasetheaggregatecanrunonlyaereverypartitionver texthat mayoutputsalesrecordsforthisbooknamecompletes. Allth ein- puts and the eventual outputs of jobs are stored in a reliable repli- catedblockstoragemechanismcalledCosmosthatisimpleme nted on the same commodity serversthatdo computation. Finally, jobs

range over a broad spectrum from short interactiveprograms that maybewrittentoquicklyevaluateanewalgorithmtolongrun ning, highlyoptimized, productionjobsthatbuildindexes. 4. TRAFFIC CHARACTERISTICS Context: edatacenterwecollecttracfromhasthetypicalstruc- ture sketchedin Figure . Virtualization is not used in this cluster, henceeachIPcorrespondstoadistinctmachinewhichwewill refer toasa server . Amatrixrepresentinghowmuchtracisexchanged from the server denoted by the row to the server denoted by the column will be referred to as a trac matrix (TM). We compute TMs

at multiple time-scales, s, 10 and 100 and between both serversandtop-of-rack(ToR)switches. elatterToR-to-T oRTM haszeroentriesonthediagonal,i.e.,unliketheserver-to -serverTM onlytracthatowsacrossracksisincludedhere.By ow ,wemean thecanonicalve-tuple(sourceIP,port,destinationIP,p ortandpro- tocol). When explicit begins and ends of a ow are not availab le, similar to muchprior work [ 26 30 ], we usea long inactivitytime- out (default 60s) to determinewhen a ow ends(or a new one be- gins). Finally,clocksacrossthevariousserversarenotsy

nchronized butalsonottoofarskewedtoaectthesubsequentanalysis. 4.1 Patterns Twopronouncedpatternstogethercomprisealargechunkoft raf- cinthedatacenter. Wecallthesethework-seeks-bandwidt h pat-
Page 3
log Bytes within rack Density 0 5 10 15 20 25 0.00 0.10 0.20 log Bytes across racks Density 0 5 10 15 20 25 0.00 0.10 0.20 Figure 3: How much trac is exchanged between server pairs(non-zeroentries)? Fraction of Correspondents within rack Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 Fraction of Correspondents across racks Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 400

800 1200 Figure 4: How many other servers does a server correspond with?( Rack = 20 servers, Cluster 1500 servers) tern and the scatter-gather pattern due to their respective causes. Figure plots the log Bytes exchanged between serverpairs in a 10s period. We order the servers such that those within a rac are adjacent to each other on the axes. e small squares aroun the diagonal represent a large chunk of the trac and corresp ond toexchangesamongserverswithinarack.Atrstblush,this gure resembles CPU and memory layouts on ASIC chips that are com- mon in the

architecture community. Indeed the resemblance e x- tends to the underlying reasons. While chip designers prefe r plac- ingcomponentsthatinteractoen(e.g.,cpu-L1cache,mult iplecpu cores)closebytogethighbandwidthinterconnectionsonth echeap, writers of data center applications prefer placing jobs tha t rely on heavytracexchangeswitheachotherinareaswherehighnet work bandwidth is available. In topologies such as the one in Figu re thistranslatestotheengineeringdecisionofplacingjobs withinthe same server, within servers on the same rack or within server s in

thesameVLANandsoonwithdecreasingorderofpreferencean hence the work-seeks-bandwidth pattern. Further, the hori zontal andverticallinesrepresentinstanceswhereinoneserverp ushes(or pulls)datatomanyserversacrossthecluster.isisindica tiveofthe mapandreduceprimitivesunderlyingdistributedquerypro cessing infrastructureswhereindata ispartitioned intosmall chu nks,each ofwhichisworkedonbydierentservers,andtheresultinga nswers are later aggregated. Hence, we call this the scatter-gathe rpattern. Finally,wenotethatthedensediagonaldoesnotextendallt heway tothetoprightcorner.

isisbecausetheareaonthefarrigh t(and fartop)correspondstoserversthatareexternaltotheclus terwhich uploadnewdataintotheclusterorpulloutresultsfromit. We attempt to characterize these patterns with a bit more pre cision. Figure plots the log-distribution of the non-zero entries of the TM. At rst both distributions appear similar, non-ze ro en- triesaresomewhatheavy-tailed,rangingfrom 20 withserver pairs that are within the same rack more likely to exchange mo re bytes. Yet,thetruedistributionsarequitedierentdueto thenum-

bersofzeroentriestheprobabilityofexchangingnotrac is89% forserverpairsthatbelongtothesamerackand99.5%forpai rsthat are in dierent racks. Finally, Figure shows the distributions of howmanycorrespondentsaservertalkswith. Aservereither talks to almost all the other serverswithin the rack(the bump near 1 in Fig. le) or it talks to fewer than 25% of serverswithin the rack. Further, a server either doesnt talk to servers outside its rack (the spike at zero in Fig. right) or it talks to about 1-10% of outside servers. e median numbers of correspondents for a server

ar two(other)serverswithinitsrackandfourserversoutside therack. 0 20 40 60 80 100 120 140 0 2 4 6 8 10 12 14 16 18 20 22 24 Links from ToR to Core Time (hours) 100s Cong 10s Cong Figure 5: When and where does congestion happen in the data- center? 1 10 100 1000 10000 1 10 100 1000 0 0.2 0.4 0.6 0.8 1 Frequency Cumulative Congestion Duration (s) 665 Unique Episodes Probability Cumulative Figure6: LengthofCongestionEvents Webelievethatgs. to togetherformtherstcharacterizationof datacentertracatamacroscopiclevelandcompriseamodel that canbeusedinsimulatingsuchtrac.

4.2 Congestion Within the Datacenter Next, we shi focus to hot-spots in the network, i.e., links t hat haveaverageutilizationabovesomeconstant . Resultsinthissec- tion use a value of = 70% but choosing a threshold of 90% or 95% yields qualitatively similar results. Ideally, one would l ike to drive the network at as high an utilization as possible witho ut ad- versely aecting throughput. Pronounced periods of low net work utilizationlikelyindicate(a)thattheapplicationbynat uredemands more of other resourcessuchascpu anddisk than the network, or

(b)thattheapplicationscanbere-writtentomakebetterus eofavail- ablenetworkbandwidth. Figure illustrates when and where links within the monitored network are highly utilized. Highly utilized links happen o en! Among the 150 inter-switchlinks that carry the trac of the 1 500 monitoredmachines,86%ofthelinksobservecongestionlas tingat least10secondsand15%observecongestionlastingatleast 100sec- onds. Shortcongestionperiods(bluecircles,10sofhighut ilization) are highly correlatedacrossmany tensof linksandaredue to brief spurts of high demand from the application. Long lasting con

ges- tion periods tend to be more localized to a small set of links. Fig- ure showsthatmostperiodsofcongestiontendtobeshort-lived O all congestion eventsthat are more than one second long, o ver 90% are no longer than 2 seconds, but long epochs of congestio exist in one days worth of data, there were 665 unique episo des of congestion that each lasted more than 10s, a few epochs las ted severalhundredsofsecondsandthelongestlastedfor382se conds. Whencongestionhappens,istherecollateraldamagetovict imows that happen to be using the congested links? Figure compares the ratesof

owsthatoverlaphighutilization periodswith the ratesof allows. Fromaninitialinspection,itappearsasiftherat esdonot changeappreciably(seecdfbelow).Errorssuchasowtimeo utsor failure to start may not be visible in ow rates, hence we corr elate highutilizationepochsdirectlywithapplicationlevello gs. Figure showsthatjobsexperienceamedianincreaseof1.1xintheir prob- ability of failing to read input(s) if they haveows travers inghigh utilization links. Note that while outputs are always writt en to the localdisk, thenextphaseof thejob thatusesthisdatamay ha veto


Page 4
0 0.01 0.02 0.03 0.04 0.05 0.06 0.0001 0.001 0.01 0.1 1 10 100 1000 Fraction Flow Rate (Mbps) Flows that Overlap Congestion All Flows 0 0.5 1 0.0001 0.001 0.01 0.1 1 10 100 1000 CDF Flow Rate (Mbps) Figure 7: Comparingratesofowsthatoverlapcongestionwi th ratesofallows. Figure 8: Impact of high utilization e likelihood that a jo failsbecauseitis unable to readrequisite dataoverthe net work increasesby1.1x(median)duringhighutilizationepochs. read it over the network if necessary. When a job is unable to nd its input data, or is unable to connect to the

machine that has the input data, or is stuck, i.e., does not make steady progress i n read- ing more of its input, the job is killed and logged as a read failure We note upfront that not all read failures are due to the netwo rk; besides congestion they could be caused by an unresponsive m a- chine,badsowareorbaddisksectors.However,weobservea high correlation between network congestion and read failures l eading us to believe that a sizable chunk of the observed read failur es are due to congestion. Over a one week period, we see thatthe inab il- ity to read input (s) increaseswhen the

network is highly uti lized. Further,themoreprevalentthecongestion(on5th,8thJa nforex- ample), thelargertheincreaseandinparticularthedayswi th little increase(10th,11thJan)correspondtoalightlyloadedw eekend. When high utilization epochs happen, we would like to know th causesbehindhighvolumesoftrac. Operatorswouldliketoknowif thesehighvolumesarenormal. Developerscanbetterengine erjob placement if they know which applications send how much tra and network designers can evaluate architecture choices be tter by knowing what drives the trac. To attributenetwork trac to

the applicationsthatgenerateit, wemergethenetworkeventlo gswith logsattheapplication-levelthatdescribewhichjobandph ase(e.g., map,reduce)wereactiveatthattime. Ourresultsshowthat, asex- pected,jobsinthe reduce phaseareresponsibleforafairamountof the network trac. Notethat in the reduce phase of a map-redu ce job, data in each partition that is present at multiple serve rsin the cluster(e.g.,allpersonnelrecordsthatstartwithA)ha stobepulled totheserverthathandlesthereduceforthepartition(e.g. ,countthe numberofrecordsthatbeginwithA)[ 14 23 ]. However, unexpectedly, the extract phase

also contributeda fair amountoftheowsonhighutilizationlinks. InDryad[ 23 ],extract isanearlyphaseintheworkowthatparsesthedatablocks. H ence, itlooksatbyfarthelargestamountofdataandthejobmanage rat- temptstokeepthecomputationasclosetodataaspossible. I tturns outthatasmallfractionofallextractinstancesreaddatao thenet- work if all of the cores on the machine that has the data are bus at the time. Yet another unexpectedcause for highly utilize d links were evacuation events.Whenaserverrepeatedlyexperiencesprob- lems,theautomatedmanagementsysteminourclusterevacua tesall 0

0.005 0.01 0.015 0.02 0.025 0.03 0.01 0.1 1 10 100 1000 10000 100000 Probability Flow Duration~(seconds) Flows Bytes 0 0.2 0.4 0.6 0.8 1 0.01 0.1 1 10 100 1000 10000 100000 Cumulative Flow Duration~(seconds) Flows Bytes Figure 9: More than 80% of the ows lastless than ten seconds, fewer than .1% last longer than 200s and more than 50% of the bytesareinowslastinglessthan25s. the usable blocks on that server prior to alerting a human tha t the server is ready to be re-imaged (or reclaimed). e latter two un- expected sources of congestion helped developers re-engi neer

the applicationsbasedonthesemeasurements. To sum up, high utilization epochs are common, appear to be causedbyapplicationdemandandhaveamoderatenegativeim pact onjobperformance. 4.3 Flow Characteristics Figure showsthatthetracmixchangesfrequently. egure plotsthedurationsof165millionows(adaysworthofows )inthe cluster. Most ows come and go (80% last less than 10s) and the re are few long running ows (less than .1% last longer than 200s ). is has interesting implications for trac engineering. Ce ntral-

izeddecisionmaking,intermsofdecidingwhichpathacerta inow shouldtake,isquitechallengingnotonlywouldthecentra lsched- ulerhavetodealwitharatherhighvolumeofschedulingdeci sions but it would also have to make the decisions very quickly to av oid visible lag in ows. One might wonder whether most of the byte arecontainedinthelong runningows. If thisweretrue,sch edul- ingjustthefewlongrunningowswouldbeenough. Unfortuna tely, thisdoesnotturnouttobethecaseinDCtrac;morethanhalf the bytesareinowsthatlastnolongerthan25s. Figure 10

showshowthetracchangesovertimewithinthedata center. e gure on the top shows the aggregate trac rate ove all serverpairsfor atenhourperiod. Tracchangesquitequ ickly, somespikesaretransientbutotherslastforawhile.Intere stinglythe top of the spikes is more than half the full-duplex bisection band- widthofthenetwork. Communicationpatternsthatarefulld uplex arerare,becausetypicallyatanytime,theproducersandco nsumers of data are xed. Hence, this means that at several times duri ng a typicaldayalltheusednetworklinksrunclosetocapacity. Another dimension

in trac change is the ux in participants even when the net trac rate remains the same, the servers tha exchange those bytes may change. Fig 10 (bottom) quanties the absolute change in trac matrix from one instant to another n or- malized by the total trac. More precisely if and arethetracmatricesattime and ,weplot Normalized Change where the numerator is the absolute sum of the entry wide die r- encesofthetwomatricesandthedenominatoristheabsolute sumof entriesin . Weplotchangesforboth = 100 and = 10 At both these time-scales, the

median change in trac is roug hly 82% and the 90th and 10th percentiles are 149% and 37% respec- tively. ismeans thatevenwhenthe total trac in thematrix re- mainsthesame(atregionsonthetopgraph),theserverpair sthat are involved in these trac exchanges change appreciably. ere
Page 5
0 2 4 6 8 10 50000 55000 60000 65000 70000 75000 TM Magnitude (GB/s) Traffic Matrix, Sum of Entries 0.1 1 10 100 50000 55000 60000 65000 70000 75000 Change in TM (times) Time (seconds) Norm1 Change Over 100s Norm1 Change Over 10s Figure 10: Tracin the

data-centerchangesin both the magni tude(top)andtheparticipants(bottom). areinstancesofbothleadingandlaggingchange;shortburs tscause spikes atthe shortertime-scale(in dashed line)thatsmoot h out at thelongertimescale(insolidline)whereasgradualchange sappear conversely,smoothedoutatshortertime-scalesyetpronou ncedon thelongertime-scale. Signicantvariability appearstob eakeyas- pectofdatacentertrac. Figure 11 portraysthedistributionofinter-arrivaltimesbetween ows as seen at hosts in the datacenter. How long aer a ow ar- rives wouldone

expectanotherowtoarrive? Ifowarrivalswerea Poissonprocess,networkdesignerscouldsafelydesignfor theaver- agecase.Yet,weseeevidenceofperiodicshort-termbursts andlong tails. einter-arrivalsatbothserversandtop-of-racksw itcheshave pronouncedperiodicmodesspacedapartbyroughly15ms. Web e- lievethatthisislikelyduetothestop-and-gobehavioroft heappli- cation that rate-limitsthe creation of new ows. e tail for these twodistributionsisquitelongaswell,serversmayseeows spaced apartbyupto10s. Finally,themedianarrivalrateofallow sinthe cluster is 10

ows per second, or 100 ows in every millisecond. Centralizedschedulersthatdecidewhichpathtopinaowon may behardpressedtokeepup. Schedulingapplicationunits(jo bsetc.) ratherthantheowscausedbytheseunitsislikelytobemore feasi- ble, aswould distributedschedulersthatengineerowsby m aking simplerandomchoices[ 25 36 ]. 4.4 On Incast Wedonotseedirectevidenceoftheincastproblem[ 10 33 ],per- haps because we dont have detailed TCP level statistics for ows in the datacenter. However, we comment on how oen the sev- eral assumptions that need to happen

in the examined datacen ter for incast to occur. First, due to the low round-trip times in data- centers,the bandwidth delay productis small whichwhendiv ided overthemanycontendingowsonalinkresultsinasmallcong es- tion window for each ow. Second, when the interfaces queue is full, multiple ows should see their packets dropped. Due to their small congestionwindows, theseowscannotrecoverviaTCP fast retransmit,arestuckuntilaTCPtimeoutandhavepoorthrou ghput. ird,forthethroughputofthenetworktoalsogodown,synch ro- nization should happen such that no other

ow is able to pick u theslackwhensomeowsareinTCP timeout. Finally,anappli ca- tion is impacted more if it cannot make forward progress unti l all itsnetworkowsnish. MapReduce,oratleasttheimplementationinourdatacenter ,ex- hibits very few scenarios wherein a job phase cannot make inc re- mentalprogresswiththedataitreceivesfromthenetwork. F urther, two engineering decisions explicitly limit the number of mu tually contendingowsrst,applicationslimittheirsimultane ouslyopen connectionstoasmallnumber(defaultto4)andsecond,comp uta- tion is placed

suchthat with a high probability network exch anges arelocal(i.e.,within arack,within aVLANetc.Figure ). islo- calnatureof ows,most areeitherwithin thesame rackorVLA N, 0 0.02 0.04 0.06 0.08 0.1 0.01 0.1 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 Probability All Flows in Cluster Flows traversing a ToR switch Flows from/to a Server 0 0.2 0.4 0.6 0.8 1 0.01 0.1 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 Cumulative Fraction Inter-arrival Time~(ms) Figure 11: Distribution of inter-arrival times of the ows s een in the entire cluster, at Top-of-Rack switches (averaged) a

nd at servers(averaged). implicitly isolatesows from other ows elsewherein the ne twork andreducesthelikelihoodthatabottleneck-edswitchwill carrythe largenumberofowsneededtotrigger incast. Finally,seve raljobs runontheclusteratanytime.oughoneorafewowsmaysuer timeouts, this multiplexing allows other ows to use up the b and- width that becomesfree therebyreducing the likelihood of w hole- salethroughputcollapse. We do believe that TCPs inability to recover from even a few packetdropswithoutresortingtotimeoutsinlowbandwidth delay

productsettingsisa fundamental problem thatneedstobeso lved. Howeverontheobservedpracticalworkloads,whichisperha pstyp- ical of a wide set of datacenterworkloads, we see little evid enceof throughputcollapseduetothisweaknessinTCP. 5. TOMOGRAPHY IN THE DATA CENTER Socket levelinstrumentation,whichwe used to drive the res ults presented so far in the paper, is unavailable in most datacen ters but link counters at routers (e.g., SNMP byte counts) are wid ely available. It is natural to ask in the absence of more detail ed in- strumentation,towhatapproximationcanweachievesimila rvalue fromlink

counters? Inthissection, weprimarily focusonne twork tomography methodsthatinfertracmatrices(origin-dest ination owvolumes)fromlinklevelSNMPmeasurements[ 20 35 ].Ifthese techniques are as applicable in datacenters as they are in IS P net- works,theywouldhelpusunravelthenatureoftracinmanym ore datacenterswithouttheoverheadofdetailedmeasurement. ereareseveralchallengesfortomographymethodstoexten dto data centers. Tomography inherently isan under-constrained prob- lem ;whilethenumberoforigin-destinationowvolumestobees timatedisquadratic 1))

,thenumberoflinkmeasurements available (i.e., constraints) is much fewer, oen a small co nstant times the number of nodes. Further, the typical datacenter t opol- ogy (Fig. ) represents a worst-case scenario for tomography. As manyToRswitchesconnecttooneorafewhigh-degreeaggrega tion switches,thenumberlinkmeasurementsavailableissmall( typically ). To combat this under-constrainednature,tomography met h- odsmodelthetracseeninpracticeandusethesemodelsas apriori estimatesofthetracmatrix, therebynarrowingthespaceo fTMs thatarepossiblegiventhelinkdata. A second

diculty stems from the fact that many of the priors that are known to be eective make simplifying assumptions. For example, the gravity model assumes that the amount of trac a node (origin) would sendto another node (destination)is pr opor- tionaltothetracvolumereceivedbythedestination. oug hthis prior has been shown to be a good predictor in ISP networks [ 20 35 ],thepronouncedpatternsintracthatweobservearequite far fromthesimplespreadthatthegravitypriorwouldgenerate .Anal dicultyisduetoscale. Whilemostexistingmethodscancom pute

tracmatricesbetweenafew100participants(e.g.,POPsin anISP), evenareasonableclusterhasseveralthousandservers.
Page 6
Figure 12: CDF of estimation error for TMs estimated by (i) tomogravity, (ii) tomogravity augmented with job informat ion, and(iii)sparsitymaximization. Figure 13: efractionofentriesthatcomprise 75% ofthetraf- c in the ground truth TM correlates well (negatively) with t he estimationerroroftomogravity. Methodology: We compute link counts from the ground truth TM andmeasurehow well the TM estimatedby tomography from these link counts approximates

the true TM. Our error functi on avoids penalizing mis-estimates of matrix entries that hav e small values [ 35 ]. Specically, we choose a threshold such that en- tries larger than make up about 75% of trac volume and then obtain the Root Mean Square Relative Error (RMSRE) as true ij est ij true ij true ij , where true ij , x est ij are the true and estimatedentriesrespectively.isevaluationsidesteps theissueof scale by attempting to obtain trac matrices at the ToR level . We report aggregate results over 288 ToR-level TMs, which is about a

daysworthof5minaverageTMs. 5.1 Tomogravity Tomogravity based tomography methods [ 35 ] use the gravity tracmodeltoestimateapriorithetracbetweenapairofno des. In Figure 12 , we plot the CDF of tomogravity estimation errors of 5 min TMs taken over an entire day. Tomogravity results in fai rly inaccurateinferences, with estimationerrorsranging fro 35% to 184% and a median of 60% . We observed that the gravity prior usedinestimationtendstospreadtracaroundwhereastheg round truth TMs are sparse. An explanation for this is that communi ca- tion ismorelikely

betweennodesthatareassignedtothesam e job rather than all nodes, whereas gravity model, not being awar e of these job-clusters, introduces trac across clusters, thu s resulting in many non-zero TM entries. To verify this conjecture, we sh ow, in Figure 13 , that the estimation error of tomogravity is correlated with the sparsity of the ground truth TM the fewer the number of entries in ground truth TM the larger the estimation error . (A logarithmicbest-tcurveisshowninblack.) 5.2 Sparsity Maximization Given the sparse natureof datacenterTMs, we consider an est i- mation

methodthatfavorssparserTMs among the many possibl e. Specically, weformulatedamixed integerlinearprogram ( MILP) that generates the sparsest TM subject to link trac constra ints. Sparsity maximization has been used earlier to isolate anom alous trac[ 34 ].However,wendthatthesparsestTMsaremuchsparser thangroundtruthTMs(seeFigure 14 )andhenceyieldaworseesti- matethantomogravity(seeFigure 12 ). eTMsestimatedviaspar- Figure14:ComparingtheTMsestimatedbyvarioustomograph methodswiththegroundtruthintermsofthenumberofTMen- triesthataccountfor 75%

ofthetotaltrac.Ground truthTMs are sparser than tomogravity estimated TMs, and denser than sparsitymaximizedestimatedTMs. sity maximization contain typically 150 non-zero entries, whichis about 3% of the total TM entries. Further, these non-zero entries donotcorrespondtoheavyhittersinthegroundtruthTMson lya handful( 20 )oftheseentriescorrespondtoentriesingroundtruth TMwithvaluegreaterthanthe 97 -thpercentile.Sparsitymaximiza- tion appears overly aggressive and datacenter trac appear s to be somewhere in between the dense nature of tomogravity estima ted

TMsandthesparsenatureofsparsitymaximizedTMs. 5.3 Prior based on application metadata Can we leverage application logs to supplement the shortcom ings of tomogravity? Specically, we use metadata on which j obs ranwhenandwhichmachineswererunninginstancesof thesam job. Weextendthegravitymodeltoincludeanadditionalmul tiplier fortracbetweentwogivennodes(ToRs) and thatislargerifthe nodessharemorejobsandfewerotherwise,i.e.,the product of the numberofinstancesofajobrunningonserversunderToRs and summed over all jobs . In practicehowever, thisextension seems to not improvevanilla

tomogravityby much, theestimation e rrors areonlymarginallybetter(Figure 12 )thoughtheTMsestimatedby thismethodare closertoground truthin terms of sparsity (F igure 14 ). Webelievethatthisisduetonodesinajobassumingdiere nt rolesovertimeandtracpatternsvaryingwithrespectiver oles. As futurework,weplantoincorporatefurtherinformationonr olesof nodesassignedtoajob. 6. RELATED WORK Datacenternetworking hasrecentlyemergedasa topicof int er- est. ereisnotmuchworkonmeasurement,analysis,andchar ac- terizationofdatacentertrac.Greenberg etal. 18 ]reportdatacen-

tertraccharacteristicsvariabilityatsmalltimescale sandstatistics onowsizesandconcurrentows,andusethesetoguidenetwo rk design. Benson et al. ] perform a complementary study of trac attheedgesofadatacenterbyexaminingSNMPtracesfromrou ters and identify ON-OFF characteristicswhereas this paper exa mines novelaspectsoftracwithinadatacenterindetail. Trac measurementin enterprisesis better studied with pap ers that compare enterprise trac to wide-area trac [ 28 ], study the health of an enterprisenetwork based on thefractionof succ

essful ows generated by end-hosts [ 19 ] and use trac measurement on end-hostsforne-grainedaccesscontrol[ 24 ]. 7. DISCUSSION We believe that our results here would extend to other mining datacentersthatemploysomeavorofmap-reducestylework ow computation on top of a distributedblock store. Forexample , sev- eral companies including Yahoo! and Facebookhave clusters run- ning Hadoop, an open source implementation of map-reduce an Googlehasclustersthatrunmap reduce. In contrast, web or cloud
Page 7
data centersthat primarily deal with generating

responses for web requests (e.g., mail, messenger), are likely tohave diere ntcharac- teristics. Ourresultsareprimarily dictatedbyhowtheapp lications havebeenengineeredandarelikelytoholdevenasthespeci cssuch asnetworktopologyandover-subscriptionratioschange.H owever, we note thatpending future data centermeasurements, based per- hapsoninstrumentationsimilartothatdescribedhere,the sebeliefs remainconjecturesatthispoint. Animplicationofourmeasurementsisworthcallingout.Byp ar- titioning the measurement problem which in the past was done at switchesorroutersacrossmany commodity

serverswerelaxm any ofthetypicalconstraints(memory,cycles)formeasuremen t.Clever counters or data structures to perform measurement at line s peed under constrained memory are no longer as crucial but contin ue tobeusefulinkeeping overheadssmall. Conversely,howeve r,han- dling scenarios where multiple independent parties are eac h mea- suringasmallpieceofthepuzzlegainsnewweight. 8. CONCLUSIONS In spite of widespread interest in datacenter networks, lit tle has been published that reveals the nature of their trac, or the prob- lems that arise in practice. is paper is a

rst attempt to cap ture both the macroscopic patterns which servers talk to which o th- ers, when and for what reasons as well as the microscopic cha r- acteristics ow durations, inter-arrivaltimes, andlike statistics thatshouldprovideausefulguidefordatacenternetworkde signers. esestatisticsappearmoreregularandbetterbehavedthan coun- terpartsfromISPnetworks(e.g.,elephantowsonlylast aboutone second). is,webelieve,isthenaturaloutcomeofthetight ercou- pling between network, computing, and storage in datacente r ap- plications. Wedid not see

evidenceof superlarge ows (ow s izes being determined largely by chunking considerations, opti mizing for storage latencies), TCP incast problems (the precondit ions ap- parently notarising consistently),orsustainedoverload s(owingto near ubiquitoususeof TCP). However, episodesof congestio n and negative application impact do occur, highlighting the sig nicant promise for improvement through better understanding of tr ac andmechanismsthatsteerdemand. Acknowledgments: We are grateful to Igor Belianski and Aparna Rajaramanforinvaluablehelpindeployingthemeasurement

infras- tructureandalsotoBikasSaha,JayFinger,MoshaPasumansk yand JingrenZhouforhelpingusinterpretresults. 9. REFERENCES [1] AmazonWebServices. http://aws.amazon.com [2] EventTracingforWindows. http://msdn.microso.com/en-us/library/ms751538.asp [3] Googleappengine. http://code.google.com/appengine/ [4] Hadoopdistributedlesystem. http://hadoop.apache.org [5] WindowsAzure. http://www.microso.com/azure/ [6] L.A.BarrosoandU.Hlzle.eDatacenterasaComputer: AnIntroductiontotheDesignofWarehouse-ScaleMachines. Synthesis LecturesonComputerArchitecture ,2009. [7]

T.Benson,A.Anand,A.Akella,andM.Zhang. UnderstandingDatacenterTracCharacteristics.In SIGCOMMWRENworkshop , 2009. [8] R.Chaiken,B.Jenkins,P.keLarson,B.Ramsey,D.Shaki b, S.Weaver,andJ.Zhou.SCOPE:EasyandEcientParallel ProcessingofMassiveDataSets.In VLDB ,2008. [9] F.Chang,J.Dean,S.Ghemawat,W.Hsieh,D.A.Wallach, M.Burrows,T.Chandra,A.Fikes,andR.E.Gruber.Bigtable: adistributedstoragesystemforstructureddata.In OSDI 2006. [10] Y.Chen,R.Grith,J.Liu,R.H.Katz,andA.D.Joseph. UnderstandingTCPIncastroughputCollapsein DatacenterNetworks.In SIGCOMMWRENWorkshop , 2009.

[11] CiscoGuardDDoSMitigationAppliance. http://www.cisco.com/en/US/products/ps5888/ [12] CiscoNexus7000 SeriesSwitches. http://www.cisco.com/en/US/products/ps9402/ [13] C.Cranor,T.Johnson,O.Spataschek,andV.Shkapenyuk Gigascope: Astreamdatabasefornetworkapplications.In SIGMOD ,2003. [14] J.DeanandS.Ghemawat.Mapreduce: Simplieddata processingonlargeclusters.In OSDI ,2004. [15] N.Dueld,C.Lund,andM.orup.EstimatingFlow DistributionsfromSampledFlowStatistics.In SIGCOMM 2003. [16] C.Estan,K.Keys,D.Moore,andG.Varghese.Buildinga BetterNetFlow.In SIGCOMM ,2004. [17]

S.Ghemawat,H.Gobio,andS.-T.Leung.eGoogleFile System.In SOSP ,2003. [18] A.Greenberg,N.Jain,S.Kandula,C.Kim,P.Lahiri,D.M altz, P.Patel,andS.Sengupta.VL2: AScalableandFlexibleData CenterNetwork.In ACMSIGCOMM ,2009. [19] S.Guha,J.Chandrashekar,N.Ta,andK.Papagiannaki. HowHealthyareTodaysEnterpriseNetworks? In IMC ,2008. [20] A.Gunnar,M.Johansson,andT.Telkampi.TracMatrix EstimationonaLargeIPBackbone-AComparisononReal Data.In IMC ,2004. [21] L.Huang,X.Nguyen,M.Garofalakis, J.Hellerstein, M.Jordan,M.Joseph,andN.Ta.Communication-Ecient

OnlineDetectionofNetwork-WideAnomalies.In INFOCOM ,2007. [22] IETFWorkingGroupIPFlowInformationExport(ipx). http://www.ietf.org/html.charters/ipx-charter.html [23] M.Isard, M.Budiu,Y.Yu,A.Birrell,,andD.Fetterly.D ryad: DistributedData-ParallelProgramsfromSequentialBuild ing Blocks.In EUROSYS ,2007. [24] T.Karagiannis,R.Mortier,andA.Rowstron.Network exceptionhandlers: Host-NetworkControlinEnterprise Networks.In SIGCOMM ,2008. [25] M.Kodialam, T.V.Lakshman,andS.Sengupta.Ecientan RobustRoutingofHighlyVariableTrac.In HotNets ,2004. [26]

R.KompellaandC.Estan.PowerofSlicinginInternetFl ow Measurement.In IMC ,2005. [27] C.Olston,B.Reed,U.Srivastava,R.Kumar,andA.Tomki ns. PigLatin: ANot-So-ForeignLanguageforDataProcessing. In SIGMOD ,2008. [28] R.Pang,M.Allman,M.Bennett,J.Lee,V.Paxson,and B.Tierney.AFirstLookatModernEnterpriseTrac.In IMC ,2005. [29] IETFPacketSampling(ActiveWG). http://tools.ietf.org/wg/psamp/ [30] S.KandulaandD.KatabiandS.SinhaandA.Berger. DynamicLoadBalancingWithoutPacketReordering. In CCR ,2006. [31] sFlow.org.Makingthenetworkvisible. http://www.sow.org [32]

A.Soule,A.Lakhina,N.Ta,K.Papagiannaki,K.Salama tian, A.Nucci,M.Crovella,andC.Diot.TracMatrices: BalancingMeasurements,InferenceandModeling.In ACM SIGMETRICS ,2005. [33] V.Vasudevan,A.Phanishayee,H.Shah,E.Krevat, D.Andersen,G.Ganger,G.Gibson,andB.Mueller.Safeand EectiveFine-grainedTCPRetransmissionsforDatacenter Communication.In SIGCOMM ,2009. [34] Y.Zhang,Z.Ge,A.Greenberg,andM.Roughan.Network Anomography.In IMC ,2005. [35] Y.Zhang,M.Roughan,N.C.Dueld,andA.Greenberg. Fastaccuratecomputationoflarge-scaleIPtracmatrices fromlinkloads.In ACMSIGMETRICS ,2003.

[36] R.Zhang-ShenandN.McKeown.DesigningaPredictable InternetBackboneNetwork.In HotNets ,2004.