104K - views

Improving Network Connection Locality on Multicore Systems Aleksey Pesterev Jaco

Morris MIT CSAIL Quanta Research Cambridge alekseyp nickolai rtm csailmitedu jacobstraussqrclabcom Abstract Incoming and outgoing processing for a given TCP connec tion often execute on different cores an incoming packet is typically processed on th

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "Improving Network Connection Locality on..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Improving Network Connection Locality on Multicore Systems Aleksey Pesterev Jaco






Presentation on theme: "Improving Network Connection Locality on Multicore Systems Aleksey Pesterev Jaco"— Presentation transcript:

Listensocket Requesthashtable Acceptqueue 1.SYN 2.ACK 3.accept() requestsocket TCPsocket lock Figure1:ATCPlistensocketinLinuxiscomposedoftwodatastructures:therequesthashtable,whichholdsrequestsockets,andtheacceptqueue,whichholdsestablishedTCPsockets.Thelistensocketperformsthreeduties:(1)trackingconnectioninitiationrequestsonSYNpacketreception,(2)storingTCPconnectionsthatnishedthethree-wayhandshake,and(3)supplyingTCPconnectionstoapplicationsoncallstoaccept().spaceapplication.Unfortunately,thisdoesnotworkduetolimitedsteeringtablesizesandthecostofmaintainingthistable;x7givesmoredetails.Wepresentanevaluationona48-coreLinuxserverrunningApache.First,weshowthat,independentofAfnity-Accept,splittingupthenewconnectionqueueintomultiplequeueswithner-grainedlocksimprovesthroughputby2.8.Next,weevaluateAfnity-AcceptandshowthatprocessingeachconnectiononasinglecorefurtherreducestimespentintheTCPstackby30%andimprovesoverallthroughputby24%.Themainreasonforthisadditionalimprovementisareductionincachemissesonconnectionstatewrittenbyothercores.Therestofthepaperisorganizedasfollows.x2describestheproblemswithconnectionmanagementinmoredetail,us-ingLinuxasanexample.x3presentsthekernelcomponentsofAfnity-Accept'sdesign,withapplication-levelconsider-ationsinx4,andimplementationdetailsinx5.x6evaluatesAfnity-Accept.x7presentsrelatedworkandx8concludes.2.ConnectionProcessinginLinuxThissectionprovidesabriefoverviewofLinuxconnectioninitiationandprocessing,anddescribestwoproblemswithparallelconnectionprocessing:asinglelockpersocketandcostlyaccessestosharedcachelines.2.1TCPListenSocketLockTheLinuxkernelrepresentsaTCPportthatiswaitingforincomingconnectionswithaTCPlistensocket,showninFigure1.ToestablishaconnectiononthisporttheclientstartsaTCPthree-wayconnectionsetuphandshakebysendingaSYNpackettotheserver.TheserverrespondswithaSYN-ACKpacket.TheclientnishesthehandshakebyreturninganACKpacket.Aconnectionthathasnishedthehandshakeiscalledanestablishedconnection.Alistensockethastwopartstotrackthishandshakeprotocol:ahashtablerecordingarrivedSYNsforwhichnoACKhasarrived,andan“acceptqueue”ofconnectionsforwhichtheSYN/SYN-ACK/ACKhandshakehascompleted.AnarrivingSYNcreatesanentryinthehashtableandtriggersaSYN-ACKresponse.AnarrivingACKmovestheconnectionfromthehashtabletotheacceptqueue.Anapplicationpullsconnectionsfromtheacceptqueueusingtheaccept()systemcall.Incomingconnectionprocessingscalespoorlyonmulti-coremachinesbecauseeachTCPport'slistensockethasasinglehashtableandasingleacceptqueue,protectedbyasinglelock.OnlyonecoreatatimecanprocessincomingSYNpackets,ACKpackets(inresponsetoSYN-ACKs),oraccept()systemcalls.OnceLinuxhassetupaTCPconnectionandanapplica-tionprocesshasacceptedit,furtherprocessingscalesmuchbetter.Per-connectionlocksallowonlyonecoreatatimetomanipulateaconnection,butinanapplicationwithmanyactiveconnections,thisallowsforsufcientparallelism.Therelativelycoarse-grainedper-connectionlockshavelowover-headandareeasytoreasonabout[9,18].OurapproachtoincreasingparallelismduringTCPcon-nectionsetupistopartitionthestateofthelistensocketinordertoallownergrainedlocks.x3describesthisdesign.Locks,however,arenotthewholestory:thereisstilltheproblemofmultiplecoressharinganestablishedconnection'sstate,whichwediscussnext.2.2SharedCacheLinesinPacketProcessingTheLinuxkernelassociatesalargeamountofstatewitheachestablishedTCPconnection,andmanipulatesthatstatewheneverapacketarrivesordepartsonthatconnection.Thecorewhichmanipulatesaconnection'sstateinresponsetoincomingpacketsisdeterminedbywheretheEthernetinterface(NIC)deliverspackets.Adifferentcoremayexecutetheapplicationprocessthatreadsandwritesdataontheconnection'ssocket.Thus,forasingleconnection,differentcoresoftenhandleincomingpackets,outgoingpackets,andcopyconnectiondatatoandfromtheapplication.Theresultisthatcachelinesholdingaconnection'sstatemayfrequentlymovebetweencores.Eachtimeonecoremodiesacacheline,thecachecoherencyhardwaremustinvalidateanyothercopiesofthisdata.Subsequentreadsonothercorespulldataoutofaremotecore'scache.Thissharingofcachelinesisexpensiveonamulticoremachinebecauseremoteaccessesaremuchslowerthanlocalaccesses.Processingthesameconnectiononmultiplecoresalsocreatesmemoryallocationperformanceproblems.Anexam-pleismanagingbuffersforincomingpackets.Thekernelallocatesbufferstoholdpacketsoutofaper-corepool.ThekernelallocatesabufferonthecorethatinitiallyreceivesthepacketfromtheRXDMAring,anddeallocatesabufferonthecorethatcallsrecvmsg().Withasinglecoreprocess-ingaconnection,bothallocationanddeallocationarefastbecausetheyaccessthesamelocalpool.Withmultiplecoresperformancesuffersbecauseremotedeallocationisslower. bypartitioningtheacceptqueueintoper-coreacceptqueues,eachprotectedbyitsownlock,andbyusingaseparatelocktoprotecteachbucketintherequesthashtable.Thesechangesavoidlockcontentiononthelistensocket.Wedescribethesedesigndecisionsinmoredetailinx5.Toachieveconnectionafnity,Afnity-Acceptmodiesthebehavioroftheaccept()systemcall.Whenanappli-cationcallsaccept(),Afnity-Acceptreturnsaconnectionfromthelocalcore'sacceptqueueforthecorrespondinglis-tensocket,ifavailable.Ifnolocalconnectionsareavailable,accept()goestosleep(aswillbedescribedinx3.3,thecorerstchecksothercore'squeuesbeforegoingtosleep).Whennewconnectionsarrive,thenetworkstackwakesupanythreadswaitingonthelocalcore'sacceptqueue.Thisallowsallconnectionprocessingtooccurlocally.3.3ConnectionLoadBalancerAlwaysacceptingconnectionsfromthelocalacceptqueue,asdescribedinx3.2,addressestheconnectionafnityprob-lem,butintroducespotentialloadimbalanceproblems.Ifonecorecannotkeepupwithincomingconnectionsinitslocalacceptqueue,theacceptqueuewilloverow,andthekernelwilldropconnectionrequests,adverselyaffectingtheclient.However,evenwhenonecoreistoobusytoacceptconnections,othercoresmaybeidle.Anidealsystemwouldofoadconnectionsfromthelocalacceptqueuetootheridlecores.Therearetwocasesforwhysomecoresmaybeabletoprocessmoreconnectionsthanothers.Therstisashort-termloadspikeononecore,perhapsbecausethatcoreishandlingaCPU-intensiverequest,oranunrelatedCPU-intensiveprocessrunsonthatcore.Todealwithshort-termimbalance,Afnity-Acceptperformsconnectionstealing,wherebyanapplicationthreadrunningononecoreacceptsincomingconnectionsfromanothercore.Sinceconnectionstealingtransfersoneconnectionatatimebetweencores,updatingtheNIC'sowroutingtableforeachstolenconnectionwouldnotbeworthwhile.Thesecondcaseisalonger-termloadimbalance,perhapsduetoanunevendistributionofowgroupsintheNIC,duetounrelatedlong-runningCPU-intensiveprocesses,orduetodifferencesinCPUperformance(e.g.,someCPUsmaybefurtherawayfromDRAM).Inthiscase,Afnity-Accept'sgoalistopreserveefcientlocalprocessingofconnections.Thus,Afnity-Acceptmustmatchtheloadofferedtoeachcore(bypacketsfromtheNIC'sRXDMArings)totheapplication'sthroughputonthatcore.Todothis,Afnity-Acceptimplementsowgroupmigration,inwhichitchangestheassignmentofowgroupsintheNIC'sFDirtable(x3.1).Intherestofthissubsection,werstdescribeconnectionstealinginx3.3.1,followedbyowgroupmigrationinx3.3.2.3.3.1ConnectionStealingAfnity-Accept'sconnectionstealingmechanismconsistsoftwoparts:therstisthemechanismforstealingaconnectionfromanothercore,andthesecondisthelogicfordeterminingwhenstealingshouldbedone,anddeterminingthecorefromwhichtheconnectionshouldbestolen.Stealingmechanism.Whenastealercoredecidestostealaconnectionfromavictimcore,itacquiresthelockonthevictim'slocalacceptqueue,anddequeuesaconnectionfromit.Onceaconnectionhasbeenstolen,thestealercoreexecutesapplicationcodetoprocesstheconnection,butthevictimcorestillperformsprocessingofincomingpacketsfromtheNIC'sRXDMAring.ThisisbecausetheFDirtableintheNICcannotbeupdatedonaper-owbasis.Asaresult,thevictimcoreisstillresponsibleforperformingsomeamountofprocessingonbehalfofthestolenconnection.Thus,short-termconnectionstealingtemporarilyviolatesAfnity-Accept'sgoalofconnectionafnity,inhopeofresolvingaloadimbalance.Stealingpolicy.Todeterminewhenonecoreshouldstealincomingconnectionsfromanothercore,Afnity-Acceptdesignateseachcoretobeeitherbusyornon-busy.Eachcoredeterminesitsownbusystatusdependingonthelengthofitslocalacceptqueueovertime;wewilldescribethisalgorithmshortly.Non-busycorestrytostealconnectionsfrombusycores,inordertoevenouttheloadintheshortterm.Busycoresneverstealconnectionsfromothercores.Whenanapplicationcallsaccept()onanon-busycore,Afnity-Acceptcaneitherchoosetodequeueaconnectionfromitslocalacceptqueue,orfromtheacceptqueueofabusyremotecore.Whenbothtypesofincomingconnectionsareavailable,Afnity-Acceptmustmaintainefcientlocalprocessingofincomingconnections,whilealsohandlingsomeconnectionsfromremotecores.Todothis,Afnity-Acceptimplementsproportional-sharescheduling.Wendthataratioof5:1betweenlocalandremoteconnectionsacceptedappearstoworkwellforarangeofworkloads.Theoverallperformanceisnotsignicantlyaffectedbythechoiceofthisratio.Ratiosthataretoolowstarttopreferremoteconnectionsinfavoroflocalones,andratiosthataretoohighdonotstealenoughconnectionstoresolvealoadimbalance.Eachnon-busycoreusesasimpleheuristictochoosefromwhichremotecoretosteal.Coresaredeterministicallyordered.Eachcorekeepsacountofthelastremotecoreitstolefrom,andstartssearchingforthenextbusycoreonepastthelastcore.Thus,non-busycoresstealinaround-robinfashionfromallbusyremotecores,achievingfairnessandavoidingcontention.Unfortunately,roundrobindoesnotgivepreferencetoanyparticularremotequeue,evenifsomecoresaremorebusythanothers.Investigatingthistrade-offislefttofuturework.Trackingbusycores.Anapplicationspeciesthemax-imumacceptqueuelengthinthelisten()systemcall.Afnity-Acceptsplitsthemaximumlengthevenlyacrossallcores;thislengthiscalledthemaximumlocalacceptqueue 4.2ApplicationStructureTogetoptimalperformancewithAfnity-Accept,callstoaccept(),recvmsg(),andsendmsg()onthesamecon-nectionmusttakeplaceonthesamecore.Thearchitectureofthewebserverdetermineswhetherthishappensornot.Anevent-drivenwebserverlikelighttpd[4]adherestothisguideline.Event-drivenserverstypicallyrunmultipleprocesses,eachrunninganeventloopinasinglethread.Oncallstoaccept()theprocessgetsbackaconnectionwithanafnityforthelocalcore.Subsequentcallstorecvmsg()andsendmsg()thereforealsodealwithconnectionsthathaveanafnityforthelocalcore.Thedesignersofsuchwebserversrecommendspawningatleasttwoprocessesperavailablecore[6]todealwithlesystemI/Oblockingaprocessesandallofitspendingrequests.Ifoneprocessblocks,anothernon-blockedprocessmaybeavailabletorun.TheLinuxprocessloadbalancerdistributesthemultipleprocessesamongtheavailablecores.Onepotentialconcernwiththeprocessloadbalanceristhatitmigratesprocessesbetweencoreswhenitdetectsaloadimbalance.Allconnectionstheprocessacceptsafterthemigrationwouldhaveanafnitytothenewcore,butexistingconnectionswouldhaveafnityfortheoriginalcore.x6showsthatthisisnotaproblembecausetheLinuxloadbalancerrarelymigratesprocesses,aslongastheloadisclosetoevenacrossallcores.TheApache[2]webserverhasmoremodesofopera-tionthanlighttpd,butnoneofApache'smodesareidealforAfnity-Acceptwithoutadditionalchanges.In“worker”mode,Apacheforksmultipleprocesses;eachacceptscon-nectionsinonemainthreadandspawnsmultiplethreadstoprocessthoseconnections.Theproblemwithusingthisde-signwithAfnity-Acceptisthattheschedulerdispersesthethreadsacrosscores,causingtheacceptandworkerthreadstorunondifferentcores.Asaresult,oncetheacceptthreadac-ceptsanewconnection,ithandsitofftoaworkerthreadthatisexecutingonanothercore,violatingconnectionafnity.Apache's“prefork”modeissimilartolighttpdinthatitforksmultipleprocesses,eachofwhichacceptsandprocessesasingleconnectiontocompletion.PreforkdoesnotperformwellwithAfnity-Acceptfortworeasons.First,preforkusesmanymoreprocessesthanworkermode,andthusspendsmoretimecontext-switchingbetweenprocesses.Second,eachprocessallocatesmemoryfromtheDRAMcontrollerclosesttothecoreonwhichitwasforked,andinpreforkmode,Apacheinitiallyforksallprocessesonasinglecore.OncetheprocessesaremovedtoanothercorebytheLinuxprocessloadbalancer,memoryoperationsbecomemoreexpensivesincetheyrequireremoteDRAMaccesses.TheapproachwetaketoevaluateApache'sperformanceistousethe“worker”mode,buttopinApache'sacceptandworkerthreadstospeciccores,whichavoidstheseproblemsentirely.However,thisdoesrequireadditionalsetupcongurationatstartuptimetoidentifythecorrectnumberofcorestouse,andreducesthenumberofthreadswhichtheLinuxprocessloadbalancercanmovebetweencorestoaddressaloadimbalance.Abettersolutionwouldbetoaddanewkernelinterfaceforspecifyinggroupsofthreadswhichthekernelshouldscheduletogetheronthesamecore.Designingsuchaninterfaceislefttofuturework.5.ImplementationAfnity-AcceptbuildsupontheLinux2.6.35kernel,patchedwithchangesdescribedbyBoyd-Wickizeretal[9],andincludestheTCPlistensocketchangesdescribedinx3.Afnity-Acceptdoesnotcreateanynewsystemcalls.Weadded1,200linesofcodetothebasekernel,alongwithanewkernelmoduletoimplementtheconnectionloadbalancerinabout800linesofcode.Weusedthe2.0.84.9IXGBEdriver.Wedidnotuseanewerdriver(version3.3.9)becauseweencountereda20%performanceregression.WemodiedthedrivertoaddamodethatcongurestheFDirhardwaresothattheconnectionloadbalancercouldmigrateconnectionsbetweencores.Wealsoaddedaninterfacetomigrateowgroupsfromonecoretoanother.Thechangesrequiredabout700linesofcode.WemodiedtheApacheHTTPServerversion2.2.14todisableamutex(describedinx4)usedtoserializemultipleprocessesoncallstoaccept()andpoll().5.1Per-coreAcceptQueuesOneofthechallengingaspectsofimplementingAfnity-Acceptwastobreakupthesinglelockandacceptqueueineachlistensocketintoper-corelocksandacceptqueues.ThisturnedouttobechallengingbecauseofthelargeamountofLinuxcodethatdealswiththedatastructuresinquestion.Inparticular,Linuxusesasingledatastructure,calledasock,torepresentsocketsforanyprotocol(TCP,UDP,etc).Eachprotocolspecializesthesockstructuretoholditsownadditionalinformationforthatsocket(suchastherequesthashtableandacceptqueue,foraTCPlistensocket).Somefunctions,especiallythoseinearlypacketprocessingstages,manipulatethesockpartofthedatastructure.Otheroperationsareprotocol-specic,anduseatableoffunctionpointerstoinvokeprotocol-specichandlers.Importantly,socketlockinghappensonthesockdatastructuredirectly,anddoesnotinvokeaprotocol-specicfunctionpointer;forexample,thenetworkingcodeoftengrabsthesocklock,callsaprotocol-specicfunctionpointer,andthenreleasesthelock.Thus,changingthelockingpolicyonsockobjectswouldrequirechangingthelockingpolicythroughouttheentireLinuxnetworkstack.Tochangethelistensocketlockingpolicywithoutchang-ingthesharednetworkingcodethatdealswithprocessingsockobjects,weclonethelistensocket.Thisway,thereisaper-corecopyoftheoriginallistensocket,eachprotectedbyitsownsocketlock.Thisensurescorescanmanipulatetheirper-coreclonesinparallel.Mostoftheexistingcodedoesnotneedtochangebecauseitdealswithexactlythesame directlybecauseitdoesnotsufcientlystressthenetworkstack:somerequestsinvolveperformingSQLqueriesorrunningPHPcode,whichstressesthediskandCPUmorethanthenetworkstack.ApplicationsthatputlessstressonthenetworkstackwillseelesspronouncedimprovementswithAfnity-Accept.Thelesservedrangefrom30bytesto5,670bytes.Thewebserverserves30,000distinctles,andaclientchoosesaletorequestuniformlyoverallles.Unlessotherwisestated,inallexperimentsaclientre-questsatotalof6lesperconnectionwithrequestsspacedoutbythinktime.First,aclientrequestsoneleandwaitsfor100ms.Theclientthenrequeststwomoreles,waits100ms,requeststhreemoreles,andnallyclosestheconnection.x6.6showsthattheresultsareindependentofthethinktime.Wecongurelighttpdwith10processespercoreforatotalof480processesontheAMDmachine.Eachprocessislimitedtoamaximumof200connections.Havingsev-eralprocesseshandlingconnectionsoneachcorelimitsthenumberofbrokenconnectionafnitiesiftheLinuxschedulermigratesoneoftheprocessestoanothercore,andreducesthenumberofledescriptorseachprocessmustpasstothekernelviapoll().WerunApacheinworkermodeandspawnoneprocesspercore.Eachprocessconsistsofonethreadthatonlyacceptsconnectionsandmultipleworkerthreadsthatprocessacceptedconnections.Wemodifytheworkermodeltopineachprocesstoaseparatecore.Allthreadsinaprocessinheritthecoreafnityoftheprocess,andthustheacceptthreadandworkerthreadsalwaysrunonthesamecore.Asinglethreadprocessesoneconnectionatatimefromstarttonish.WecongureApachewith1,024workerthreadsperprocess,whichisenoughtokeepupwiththeloadandthinktime.Weuseafewdifferentimplementationsofthelistensockettoevaluateourdesign.WerstcompareAfnity-AccepttoastockLinuxlistensocketthatwecall“Stock-Accept”andthenasecondintermediatelistensocketimplementationthatwerefertoas“Fine-Accept”.Fine-AcceptissimilartoAfnity-Accept,butdoesnotmaintainconnectionafnitytocores.Oncallstoaccept(),Fine-Acceptdequeuesconnec-tionsoutofclonedacceptqueuesinaround-robinfashion.ThisschemeperformsbetterthanStock-Accept'ssingleac-ceptqueue,becausewithmultipleacceptqueues,eachqueueisprotectedbyadistinctlock,andmultipleconnectionscanbeacceptedinparallel.TheFine-Acceptlistensocketdoesnotneedaloadbalancerbecauseacceptingconnectionround-robinisintrinsicallyloadbalanced:allqueuesareservicedequally.InallcongurationsweusetheNIC'sFDirhardwaretodistributeincomingpacketsamongallhardwareDMArings(asdescribedinx3.1)andwecongureinterruptssothateachcoreprocessesitsownDMAring.6.3SocketLockFirst,wemeasurethethroughputachievedwiththestockLinuxlistensocket,whichusesasinglesocketlock.TheStock-AcceptlineinFigure2showsthescalabilityofApache Stock-AcceptFine-AcceptAfnity-Accept020004000600080001000012000140001600014812162024283236404448 Throughput(requests/sec/core) CoresFigure2:Apacheperformancewithdifferentlistensocketimple-mentationsontheAMDmachine. Stock-AcceptFine-AcceptAfnity-Accept050001000015000200002500014812162024283236404448 Throughput(requests/sec/core) CoresFigure3:Lighttpdperformancewithdifferentlistensocketimple-mentationsontheAMDmachine.ontheAMDmachine.Thenumberofrequestseachcorecanprocessdecreasesdrasticallyasthenumberofcoresincreases(infact,thetotalnumberofrequestshandledpersecondstaysaboutthesame,despitetheaddedcores).Thereisanincreasingamountofidletimepast12coresbecausethesocketlockworksintwomodes:spinlockmodewherethekernelbusyloopsandmutexmodewherethekernelputsthethreadtosleep.Tounderstandthesourceofthisbottleneck,theStock-AcceptrowinTable2showsthecostofacquiringthesocketlockwhenrunningApacheonastockLinuxkernelontheAMDmachine.Thenumbersarecollectedusinglock stat,aLinuxkernellockprolerthatreports,forallkernellocks,howlongeachlockisheldandthewaittimetoacquirethelock.Usinglock statincurssubstantialoverheadduetoaccountingoneachlockoperation,andlock statdoesnottrackthewaittimetoacquirethesocketlockinmutexmode;however,theresultsdogiveapictureofwhichlocksarecontended.UsingStock-Accept,themachinecanprocessarequestin590ms,82msofwhichitwaitstoacquirethelistensocketlockinspinmodeandatmost320msinmutexmode.Closeto70%ofthetimeisspentwaitingforanothercore.Thus,thedeclineobservedforStock-AcceptinFigure2isduetocontentiononthelistensocketlock. Non-IdleTime ListenSocketThroughput(requests/sec/core)TotalTimeIdleTimeSocketLockWaitTimeSocketLockHoldTimeOtherTime Stock-Accept1,700590ms320ms82ms25ms163msFine-Accept5,700178ms8ms0ms30ms140msAfnity-Accept7,000144ms4ms0ms17ms123ms Table2:ThecompositionoftimetoprocessasinglerequestwithApacherunningontheAMDmachinewithall48coresenabled.Thesenumbersareforalock statenabledkernel;asaconsequencethethroughputnumbers,shownintherstcolumn,arelowerthaninotherexperiments.Thetotaltimetoprocessarequest,showninthesecondcolumn,iscomposedofbothidleandnon-idletime.Theidletimeisshowninthethirdcolumn;includedintheidletimeisthewaittimetoacquirethesocketlockinmutexmode.Thelastthreecolumnsshowthecompositionofactiverequestprocessingtime.Thefourthcolumnshowsthetimethekernelwaitstoacquirethesocketlockinspinlockmodeandthefthcolumnshowsthetimethesocketlockisheldonceitisacquired.Thelastcolumnshowsthetimespentoutsidethesocketlock.ToverifythatApache'sthreadpinningisnotresponsi-bleforAfnity-Accept'sperformanceadvantage,Figure3presentsresultsfromthesameexperimentwithlighttpd,whichdoesnotpinthreads.Afnity-Acceptagainconsis-tentlyachieveshigherthroughput.ThedownwardslopeofAfnity-Acceptisduetolighttpd'shigherperformancethatsaturatestheNIC:theNIChardwareisunabletoprocessanyadditionalpackets.Additionallythehigherrequestprocessingratetriggersascalabilitylimitationinhowthekerneltracksreferencecountstoleobjects;wehavenotyetexploredworkaroundsorsolutionsforthisproblem.TheAfnity-AcceptlineinFigure2showsthatthescala-bilityofApacheimproveswhenweusetheAfnity-Acceptlistensocket.Partoftheperformanceimprovementcomesfromthereducedsocketlockwaittime,asshowninAfnity-AcceptrowofTable2.Partoftheimprovementalsocomesfromimprovedlocality,asweevaluatenext.6.4CacheLineSharingToisolatetheperformancegainofusingnegrainedlockingfromgainsduetolocalconnectionprocessing,weanalyzetheperformanceofFine-Accept.TheFine-AcceptrowinTable2conrmsthatFine-Acceptalsoavoidsbottlenecksonthelistensocketlock.However,Figures2and3showthatAfnity-AcceptconsistentlyoutperformsFine-Accept.Thismeansthatlocalconnectionprocessingisimportanttoachievinghighthroughput,evenwithne-grainedlocking.IncaseofApache,Afnity-AcceptoutperformsFine-Acceptby24%at48coresandinthecaseoflighttpdby17%.Trackingmisses.InordertondoutwhyAfnity-AcceptoutperformsFine-Accept,weinstrumentedthekerneltorecordanumberofperformancecountereventsduringeachtypeofsystemcallandinterrupt.Table3showsresultsofthreeperformancecounters(clockcycles,instructioncount,andL2misses)trackingonlykernelexecution.ThetablealsoshowsthedifferencebetweenFine-AcceptandAfnity-Accept.Thesoftirq net rxkernelentryprocessesin-comingpackets.TheseresultsshowthatFine-Acceptuses40%moreclockcyclesthanAfnity-Accepttodothesameamountofworkinsoftirq net rx.Summingthecycles CyclesInstructionsL2Misses KernelEntryTotalDTotalDTotalD softirq net rx97k/69k28k33k/34k-788352/178174sys read17k/10k7k4k/4k26060/3129schedule23k/17k6k9k/8k45079/3841sys accept412k/7k5k3k/2k66638/1919sys writev15k/12k3k5k/4k12053/3320sys poll12k/9k3k4k/4k9439/1722sys shutdown8k/6k3k3k/3k5528/721sys futex18k/16k3k8k/8k35756/4511sys close5k/4k7072k/2k2912/102softirq rcu714/603111212/20484/31sys fcntl375/385-10275/276-10/00sys getsockname706/719-13277/27521/10sys epoll wait2k/2k-29568/601-333/21 Table3:Performancecounterresultscategorizedbykernelentrypoint.Systemcallkernelentrypointsbeginwith“sys”,andtimerandinterruptkernelentrypointsbeginwith“softirq”.NumbersbeforeandaftertheslashcorrespondtoFine-AcceptandAfnity-Accept,respectively.DreportsthedifferencebetweenFine-AcceptandAfnity-Accept.Thekernelprocessesincomingconnectioninsoftirq net rx.columnovernetworkstackrelatedsystemcallsandinter-rupts,theimprovementfromFine-AccepttoAfnity-Acceptis30%.TheapplicationlevelimprovementduetoAfnity-Accept,however,isnotashighat24%.TheisbecausethemachineisdoingmuchmorethanjustprocessingpacketswhenitrunsApacheandlighttpd.Bothimplementationsexe-cuteapproximatelythesamenumberofinstructions;thus,theincreaseisnotduetoexecutingmorecode.ThenumberofL2misses,however,doubleswhenusingFine-Accept.TheseL2missesindicatethatcoresneedtoloadmorecachelinesfromeitherthesharedL3,remotecaches,orDRAM.TounderstandtheincreaseinL2misses,andinparticular,whatdatastructuresarecontributingtotheL2misses,weuseDProf[19].DProfisakernelprolingtoolthat,foraparticularworkload,prolesthemostcommonlyuseddatastructuresandtheaccesspatternstothesedatastructures.Table4showsthatthemost-sharedobjectsarethoseasso-ciatedwithconnectionandpacketprocessing.Forexamplethetcp sockdatatyperepresentsaTCPestablishedsocket.Coresshare30%ofthebytesthatmakeupthisstructure. Fine-AcceptAfnity-AcceptStock-Accept020004000600080001000012000140001600011020304050607080 Throughput(requests/sec/core) CoresFigure6:Lighttpdperformancewithdifferentlistensocketimple-mentationsontheIntelmachine.showstheperformanceoflighttpdonourIntelsystem.Afnity-AcceptoutperformsFine-Acceptbyasmallermar-ginonthissystemthanontheAMDmachine.Wesuspectthatthisdifferenceisduetofastermemoryaccessesandafasterinterconnect.6.5LoadBalancerToevaluatetheloadbalancerwewanttoshowtwothings.First,thatwithoutconnectionstealing,acceptqueuesover-owandaffecttheperformanceperceivedbyclients.Second,thatowgroupmigrationreducestheincomingpacketloadoncoresnotprocessingnetworkpackets,andspeedsupotherapplicationsrunningonthesecores.Therstexperimentillustratesthattheloadbalancercandealwithcoresthatsuddenlycannotkeepupwiththeincom-ingload.Wetestthisbyrunningthewebserverbenchmarkonallcoresbutadjustingtheloadsotheserverusesonly50%oftheCPUtime.Foreachconnection,theclientterminatestheconnectionafter10secondsifitgetsnoresponsefromtheserver.Toreducetheprocessingcapacityofthetestmachine,westartabuildoftheLinuxkernelusingparallelmakeonhalfofthecores(usingsched setaffinity()tolimitthecoresonwhichmakecanrun).Eachclientrecordsthetimetoserviceeachconnection,andwecomputethemedianlatency.Runningjustthewebserverbenchmarkyieldsamedianand90thpercentilelatencyof200mswithandwithouttheloadbalancer.Thisincludesboththetimetoprocessesthe6requests,aswellasthetwo100msclientthinktimes,indicatingthatrequestprocessingtakesunder1ms.Whenmakeisrunningonhalfofthecores,themedianand90thpercentilelatenciesjumpto10secondsintheabsenceofaloadbalancer,becausethemajorityofconnectionstimeoutattheclientbeforetheyareservicedbythewebserver.Thetimeoutsareduetoacceptqueueoverowsoncoresrunningmake.Enablingloadbalancingreducesthemedianlatencyto230ms,andthe90thpercentilelatencyto480ms.Theextra30msinmedianlatencyisduetothe100%utilizationofcoresstillexclusivelyrunninglighttpd,andisaconsequenceoftakingaworkloadthatuses50%ofCPUtimeon48coresand Fine-AcceptAfnity-AcceptStock-Accept02000400060008000100001200014000160001101001000 Throughput(requests/sec/core) RequestsperConnectionFigure7:TheeffectofTCPconnectionreuseonApache'sthrough-putrunningontheAMDmachine.TheacceptratedecreasesasclientssendmoreHTTPrequestsperTCPconnection. Fine-AcceptAfnity-AcceptStock-Accept0200040006000800010000120000.11101001000 Throughput(requests/sec/core) ThinkTime(ms)Figure8:TheeffectofincreasingclientthinktimeonApache'sthroughputrunningontheAMDmachine.squeezingitonto24cores.Iftheinitialwebserverutilizationislessthan50%,themedianlatencyfallsbackto200ms.Thesecondexperimentshowsthatowgroupmigrationimprovesnon-webserverapplicationperformance.Werunthesameexperimentasabove,withconnectionstealingenabled,butthistimemeasuretheruntimeofthekernelcompile.Asabaseline,thecompilationtakes125son24coreswithoutthewebserverrunning.Addingthewebserverworkloadwithowgroupmigrationdisabledincreasesthetimeto168s.Enablingowgroupmigrationreducesthetimeto130s.Theextra5sisduetothetimeittakesowgroupmigrationtomoveallowgroupsawayfromthecoresrunningmake.Thismigrationactuallyhappenstwice,becausethekernelmakeprocesshastwoparallelphasesseparatedbyamulti-secondserialprocess.Duringthebreakbetweenthetwophases,Afnity-Acceptmigratesowgroupsbacktothecoresthatwererunningmake.6.6VariationstotheWorkloadTheevaluationthusfarconcentratedonshort-livedconnec-tions.ThissectionexaminestheeffectofthreeparametersoftheworkloadonAfnity-Acceptperformance:acceptrate, Fine-AcceptAfnity-AcceptStock-Accept0200040006000800010000120001400010100100010000 Throughput(requests/sec/core) AverageFileSize(bytes)Figure9:TheeffectofdifferentaveragelesizesonApache'sthroughputrunningontheAMDmachine.Theaveragelesizeisofallservicedles.clientthinktime,andaverageservedlesize.AlloftheexperimentsinthissectionwererunwithallCPUsenabled.AcceptRate.TherstworkloadvariationweconsiderisHTTPconnectionreuse.AclientcansendmultiplerequeststhroughasingleHTTPconnection,whichreducesthefrac-tionofacceptsintheserver'stotalnetworktrafc.Inpreviousexperimentstherequestperconnectionratioisxedto6;Fig-ure7showstheperformanceofApacheasthenumberofrequestsperconnectionvaries.Inthisexample,Apacheisconguredtopermitanunboundednumberofrequestsperconnectioninsteadofthedefaultcongurationwhichlimitsconnectionreuse.Whenthenumberofrequestsperconnec-tionissmall,Afnity-AcceptandFine-AcceptoutperformStock-Acceptasdescribedintheearlierexperiments.Ascon-nectionreuseincreases,totalthroughputincreases,asthereislessoverheadtoinitiateandteardownconnections.Afnity-AcceptoutperformsFine-Acceptatallmeasuredpoints.Atveryhighratesofconnectionreuse(above5,000requestsperconnection),lockcontentionforthelistensocketisnolongerabottleneckforStock-Accept,anditsperformancematchesthatofFine-Accept.Figure7alsoshowsthatAfnity-AcceptprovidesabenetoverFine-Acceptevenwhenacceptsarelessfrequent,sinceAfnity-Acceptreducesdatasharingcostsafterthekernelacceptsconnections.ThinkTime.Figure8showstheeffectofincreasingthelifetimeofaconnectionbyaddingthinktimeattheclientbetweenrequestssentoverthesameTCPconnection.Thisexperimentholdsconnectionreuseconstantat6requestsperconnection,soitdoesnotvarythefractionofnetworktrafcdevotedtoconnectioninitiation.Itdoesaddavariableamountofclient-sidethinktimebetweensubsequentrequestsonthesameconnectiontoincreasethetotalnumberofactiveconnectionsthattheservermusttrack.Therangeofthinktimesintheplotcoverstherangeofdelaysaservermightexperienceindatacenterorwideareaenvironments,althoughthepatternofpacketarrivalwouldbesomewhatdifferentifthedelayswereduetopropagationdelayinsteadofthinktime.Beyondtherightmostedgeoftheplot(1s),theserverwouldneedmorethanhalfamillionthreads,whichourkernelcannotsupport.Stock-Acceptdoesnotperformwellinanyofthesecasesduetosocketlockcontention.Afnity-AcceptoutperformsFine-Acceptandthetwosustainaconstantrequestthroughputacrossawiderangeofthinktimes.ThisgraphalsopointsouttheproblemwithNICassistedowredirection.Inthisexperimentat100msofthinktimetherearemorethan50,000concurrentlyactiveconnectionsandat1softhinktimemorethan300,000connections.SuchalargenumberofactiveconnectionswouldlikelynottintoacurrentNIC'sowsteeringtable.AverageFileSize.Figure9showshowlesizeaffectsAfnity-Accept.Theaveragelesizeforpreviousexperi-mentsisaround700bytesandtranslatesto4.5Gbpsoftrafcat12,000requests/second/core.Herewechangealllespro-portionallytoincreaseordecreasetheaveragelesize.TheperformanceofStock-Acceptisonceagainlowduetolockcontention.Atanaveragelesizelargerthan1Kbyte,theNIC'sbandwidthsaturatesforbothFine-AcceptandAfnity-Accept;asaconsequence,therequestratedecreasesandservercoresexperienceidletime.TheStock-Acceptcong-urationdoesnotserveenoughrequeststosaturatetheNIC,untiltheaveragelesizereachesabout10Kbytes.7.RelatedWorkTherehasbeenpreviousresearchthatshowsprocessingpacketsthatarepartofthesameconnectiononasinglecoreiscriticaltogoodnetworkingperformance.Nahumetal.[18],Yatesetal.[24],Boyd-Wickizeretal.[9],andWillmannetal.[23]alldemonstratethatanetworkstackwillscalewiththenumberofcoresaslongastherearemanyconnectionsthatdifferentcorescanprocessesinparallel.Theyalsoshowthatitisbesttoprocesspacketsofthesameconnectiononthesamecoretoavoidperformanceissuesduetolocksandoutoforderpacketprocessing.Wepresentindetailamethodforprocessingasingleconnectiononthesamecore.Boyd-Wickizeretal.[9]usedanearlierversionofthisworktogetgoodscalabilityresultsfromtheLinuxnetworkstack.RouteBricks[10]evaluatespacketprocessingandroutingschemesonamulticoremachineequippedwithmultipleIXGBEcards.Theyshowthatprocessingapacketexclusivelyononecoresubstantiallyimprovesperformancebecauseitreducesinter-corecachemissesandDMAringlockingcosts.Serverswitch[17]appliesrecentimprovementsinthepro-grammabilityofnetworkcomponentstodatacenternetworks.UsingsimilarfeatureswithinfutureNICdesignscouldenableabettermatchbetweenhardwareandtheneedsofsystemssuchasAfnity-Accept.Inadditiontonetworkstackorganization,therehavebeenattemptstoaddresstheconnectionafnityproblem.Wedescribetheminthenexttwosections. Fine-AcceptAfnity-AcceptStock-AcceptTwenty-Policy02000400060008000100001200014000160001101001000 Throughput(requests/sec/core) RequestsperConnectionFigure10:TheeffectofTCPconnectionlengthonApache'sthroughputrunningontheAMDmachine.ThisisaduplicateofFigure7butincludes“Twenty-Policy”:stockLinuxwithowsteeringinhardware.7.1DealingwithConnectionAfnityinHardwareTheIXGBEdriverauthorshavetriedtouseFDirtorouteincomingpacketstothecoreprocessingoutgoingpackets.TheydosobyupdatingtheFDirhashtableonevery20thtransmittedpackettorouteincomingpacketstothecorecallingsendmsg().Wecallthisscheme“Twenty-Policy”andFigure10showsitsperformance.At1,000requestsperconnectiontheNICdoesagoodjobofroutingpacketstothecorrectcoreandtheperformancematchesAfnity-Accept.At500requestsperconnection,however,maintainingthehardwaretablelimitsperformance.Socketlockcontentionlimitsperformancebelow100requestsperconnection;thetablemaintenanceproblemswouldstilllimitperformanceevenifthesocketlockwerenotabottleneck.ThereareafewproblemswithTwenty-Policy.First,itisexpensivetotalktothenetworkcard.Ittakes10,000cyclestoaddanentryintotheFDirhashtable.Thebulkofthiscostcomesfromcalculatingthehashvalue,andthetableinserttakes600cycles.Second,managingthehashtableisdifcult.Thedrivercannotremoveindividualentriesfromthehashtable,becauseitdoesnotknowwhenconnectionsarenolongeractive.Thedriver,instead,ushesthetablewhenitoverows.Ittakesupto80,000cycles(40ms)toschedulethekerneltoruntheushoperation,and70,000cycles(35ms)toushthetable.Thedriverhaltspackettransmissionsforthedurationoftheush.Arateof50,000connections/secondandahashtablewith32Kentriesrequiresaushevery0.6seconds.WehavealsoconrmedthattheNICmissesmanyincomingpacketswhenrunninginthismode.Althoughwedonothaveaconcretereason,wesuspectitisbecausetheNICcannotkeepupwiththeincomingratewhiletheushisinprogress.ThestoppedtransmissionandmissedpacketscauseTCPtimeoutsanddelays,andtheendresultispoorperformance.Tighterintegrationwiththenetworkstackcanreducemanyofthesecosts.ThisisexactlytheapproachAcceleratedReceiveFlowSteering(aRFS)[13]takes.Insteadofonevery NICHWDMARingsRSSDMARingsFlowSteeringTable(#connections) Intel[14]641632KChelsio[1]32or6432or64“tensofthousands”Solarare[16]32328KMyricom[15]3232- Table5:ComparisonoffeaturesavailableonmodernNICs.Entrieswith“-”indicatethatwecouldnotndinformation.20thtransmittedpacket,anaRFSenabledkerneladdsaroutingentrytotheNIConcallstosendmsg().ToavoidthekernelcalculatingtheconnectionhashvalueonahashtableupdatetheNICreports,intheRXpacketdescriptor,thehashvalueoftheowandthenetworkstackstoresthisvalue.Unfortunately,thenetworkstackdoesnotnotifythedriverwhenaconnectionisnolongerinusesothedrivercanselectivelyshootdownconnections.Instead,thedriverneedstoperiodicallywalkthehardwaretableandquerythenetworkstackaskingifaconnectionisstillinuse.JustasinTwenty-Policy,weseetheneedforthedrivertosearchfordeadconnectionsasapointofinefciency.EvenwithaRFS,owsteeringinhardwareisstillimprac-ticalbecausethethirdproblemisthehardlimitonthesizeoftheNIC'stable.Table5liststhetablesizesfordifferentmodern10GbitNICs.FreeBSDdevelopers,whoarealsoimplementingaRFS-likefunctionality,haveraisedsimilarconcernsoverhardwaretablesizes[20].Additionally,currentlyavailable10GbitNICsprovidelimitedhardwarefunctionalityinonewayoranother.Table5summarizeskeyNICfeatures.EachcardofferseitherasmallnumberofDMArings,RSSsupportedDMArings,orowsteeringentries.Forexample,usingtheIXGBENICthereisnowaytospreadincomingloadamongallcoresifFDirisalsousedtorouteindividualconnectionstoparticularcores.Inthiscase,wewouldhavetouseRSSforloadbalancingnewconnections,whichonlysupports16DMArings.ItisimperativeforNICmanufacturerstogrowthenumberofDMAringswithincreasingcorecountsandprovidefunctionalitytoallDMArings.7.2DealingwithConnectionAfnityinSoftwareRoutinginsoftwareismoreexiblethanroutinginhardware.Google'sReceiveFlowSteeringpatch[11,12]forLinuximplementsowroutinginsoftware.InsteadofhavingthehashtableresideintheNIC'smemory,thetableisinmainmemory.Oneachcalltosendmsg()thekernelupdatesthehashtableentrywiththecorenumberonwhichsendmsg()executed.TheNICisconguredtodistributeloadequallyamongasetofcores(theroutingcores).Eachroutingcoredoestheminimumworktoextracttheinformationneededtodoalookupinthehashtabletondthedestinationcore.Theroutingcorethenappendsthepackettoadestinationcore'squeue(thisqueueactslikeavirtualDMAring).The