/
Distrib uted Computing with the CLAN Netw ork Da vid Riddoch Kieran Mansle Ste Pope Department Distrib uted Computing with the CLAN Netw ork Da vid Riddoch Kieran Mansle Ste Pope Department

Distrib uted Computing with the CLAN Netw ork Da vid Riddoch Kieran Mansle Ste Pope Department - PDF document

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
521 views
Uploaded On 2015-03-05

Distrib uted Computing with the CLAN Netw ork Da vid Riddoch Kieran Mansle Ste Pope Department - PPT Presentation

acuk kjm25camacuk ste ve popeclcamacuk Abstract CLAN Collapsed LAN is high performance user le vel network tar eted at the server oom It pr esents simple lowle vel interface to applications connection oriented noncoher ent shar ed memory for data tr ID: 41809

acuk kjm25camacuk ste popeclcamacuk

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Distrib uted Computing with the CLAN Net..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DistributedComputingwiththeCLANNetworkDavidRiddochKieranMansleyStevePopeDepartmentofEngineeringDepartmentofEngineeringAT&TLaboratories-CambridgeUniversityofCambridgeUniversityofCambridge24aTrumpingtonStreetCambridge,EnglandCambridge,EnglandCambridge,Englanddjr23@cam.ac.ukkjm25@cam.ac.uksteve.pope@cl.cam.ac.ukAbstractCLAN(CollapsedLAN)isahighperformanceuser-levelnetworktargetedattheserverroom.Itpresentsasimplelow-levelinterfacetoapplications:connection-orientednon-coherentsharedmemoryfordatatransfer,andTripwire,auser-levelprogrammableCAMforsyn-chronisation.ThissimpleinterfaceisimplementedusingonlyhardwarestatemachinesontheNIC,yetisexibleenoughtosupportmanydifferentapplicationsandcom-municationsparadigms.WeshowhowCLANisusedtosupportanumberofstandardtransportsandmiddleware:MPI,VIA,TCP/IPandCORBA.Ineachcasewedemonstrateperformancethatapproachestheunderlyingnetwork.ForTCP/IPwepresentourinitialresultsusinganin-kernelstack,anddescribethearchitectureofourprototypeGigabitEther-net/CLANbridge,whichdemultiplexesEthernetframesdirectlytouser-levelTCP/IPstacksviatheCLANnet-work.ForVIAwepresentasoftwareimplementationwithbetterlatencythanacommercialVIANICimplementedonASICtechnology.Keywords:CLAN,highperformance,user-levelnet-works,networkinterface.1.IntroductionAsthelinespeedoflocalareanetworksreachesagi-gabitpersecondandbeyond,theoverheadofsoftwareonthehostsystemisincreasinglybecomingthelimitingfactorforperformance.Athighmessageratesthepro-cessingtimeisdominatedbynetworkoverheads,attheexpenseoftheapplication,andcanleadtoperformancecollapse.Theoverheadisduetoanumberoffactors[11]includ-ingcopyingdatabetweenbuffers,protocolprocessing,demultiplexing,interruptsandsystemcalls.Inadditiontotheprocessortimetaken,theseactivitieshaveadetri-mentaleffectonthecacheperformanceoftheapplica-tion.Onesolutionthataddressestheseproblemsisuser-levelnetworking,whereinapplicationscommunicatedi-rectlywiththenetworkinterfacecontroller(NIC),by-passingtheoperatingsystemaltogetherinthecommoncase.TheNICtypicallyhasdirectaccesstoapplicationbuffers,eliminatingunnecessarycopies.Insomecasesthenetworkprovidesareliabletransport,whichsimpli-esprotocolprocessing.Avarietyofuser-levelnetworkinterfaceshavebeendeveloped[28,8,9],eachsupportingaparticularcom-municationsparadigm.Forexample,SCIhaslargelybeenusedtosupportshared-memoryscienticclusters,andArsenic[24]supportsprocessingofTCPandUDPstreams.Othercommunicationinterfacescanbebuiltaslayersofsoftwareabovetherawnetwork,butthistypi-callyincurssignicantadditionaloverheadwhenthetwointerfacesaredissimilar.Oneapproachtosupportingmultiplenetworkinter-facesistouseaprogrammableNIC.Myrinet[9]isagi-gabitclassuser-levelaccessibleNICwhichincorporatesaprocessor.AnumberofcommunicationsinterfaceshavebeenbuiltusingMyrinet,includingMPI[25],theVirtualInterfaceArchitecture(VIA)[7,10],VMMC-2[13]andTCP/IP[15].However,atanyonetimeallcommunicatingnodesmustbeprogrammedtosupportthesamemodel.TheCLANnetworkpresentsasingle,low-levelnet-workinterfacethatsupportscommunicationwithlowoverheadandlatency,highbandwidth,andefcientandexiblesynchronisation.Inthispaperweshowhowthisinterfacesupportsarangeofdisparatestylesofcommu-nication,withoutsacricingtheperformanceoftherawnetwork.MPI,VIAandCORBAareimplementedasuser-levellibraries,requiringnoprivilegedcodeormodicationstothenetwork.Wepresentanin-kernelIPimplementation,andalsodescribethearchitectureofourGigabitEther-net/CLANbridge,whichdemultiplexesEthernetframesdirectlyontouser-levelTCP/IPstacksviatheCLANnet-work.2.TheCLANNetworkCLANisahighperformanceuser-levelnetworkde-signedfortheserverroom.Keyaimsoftheprojectin-cludesupportforgeneralpurposemultiprogrammeddis-tributedsystems,andscalabilitytolargenumbersofap-plicationsandendpoints.Anoverviewofthekeyfeaturesofthenetworkfollows:Atthelowestlevelthecommunicationsmodelisnon-coherentdistributedsharedmemory(DSM).Aportion ofthevirtualaddressspaceofanapplicationislogi-callymappedoverthenetworkontophysicalmemoryinanothernode.Dataistransferredbetweenapplica-tionsbywritingtothesharedmemoryregionusingstan-dardprocessorwriteinstructions.AbufferinaremotenodeisrepresentedbyanRemoteDirectMemoryAc-cess(RDMA)cookie,thepossessionofwhichimpliespermissiontoaccessthatbuffer.However,theCLANnetworkisnotintendedtosup-portthetraditionalDSMcommunicationsmodel.In-stead,thesharedmemoryinterfaceisusedasthelow-leveldatatransferlayeronwhichhigher-levelcommu-nicationsabstractionsarebuilt.Thenetworksupportssmalldatagrammessages,whicharecurrentlyusedforconnectionmanagement.TheNICalsoprovidesapro-grammableDMAenginetooff-loaddatatransferfromtheCPU.2.1.SimpledatatransferBywayofexample,wepresenttheimplementationofasimplemessagepassingprotocol.TheDistributedMessageQueueisbasedonacircularbufferinmem-orylocaltothereceiver,asillustratedinFigure1.Thesenderwritesamessagethroughitsmappingontothere-ceivebuffer,atthepositionindicatedbythewritepointer(write_i).Thewritepointeristhenincrementedmod-ulothesizeofthebuffer,andthenewvaluecopiedtotheremoteaddress-space(lazy_write_i).read_iwrite_ilazy_write_iQueue entrieslazy_read_iRemoteapertureTripwireHostmemoryReceiveSendFigure1.ADistributedMessageQueueThereceivercomparesitslazycopyofthewritepointerwiththereadpointertodeterminewhetherornotthequeueisempty.Messagesaredequeuedbyreadingthemfromthebuffer,thenincrementingthereadpointer,andcopyingitsnewvaluetothesender.Transferringsmallmessagesinthiswayconsistsofjustafewprocessorwriteinstructions,andhencehasverylowoverhead.2.2.RDMAcookie­basedcommunicationInsomecases,itispossibletoarrangefortheappli-cationtoreadreceiveddatadirectlyfrominthecircularbuffer(in-place).Otherprogramminginterfacesrequiredatatobedeliveredtoapplication-levelreceivebuffers,whichrequiresanadditionalcopy.QueueCookieReceiveSendRDMAcookiesDataConsumecookieFigure2.RDMAcookie­baseddatatransfer.Thiscopycanbeavoidedifthesenderisinformedofthelocationofthereceivebuffersinadvance.Toachievethis,thereceiversendsRDMAcookiesforitsbufferstothesenderusingadistributedmessagequeue,knownasacookiequeue.ThesenderretrievesanRDMAcookiefromthecookiequeue,andusesitasthetargetforaDMAtransfer.ThisisillustratedinFigure2.2.3.SynchronisationOnthereceivepath,dataisplacedinanapplication'sbuffersasynchronously,withouttheinterventionoftheCPU.Thisminimisesoverhead,butmeanstheapplica-tionhasnowaytodeterminethatamessagehasarrivedotherthanbypollingmemorylocationsexplicitly.Othernetworkshavesolvedthisprobleminoneoftwoways:bybeingabletorequestthataninterruptbegen-eratedwhenaparticularregionofsharedmemoryisac-cessed,orbyusingsomeformofout-of-bandsynchroni-sationmessages.TheCLANNICprovidesanovelsolution:thetrip-wire[27].Atripwireisanentryinacontentaddressablememory(CAM)whichmatchesaparticularaddressinanapplication'saddressspace.Theaddressofeachmemorylocationthatisaccessedviathenetworkislooked-upintheCAM,andwhenthereisamatchtheapplicationre-ceivesanotication.Iftheapplicationisblockedwaitingforsuchanotication,aninterruptisgeneratedandtheapplicationisrescheduled.Tripwiresareprogrammeddirectlybyuser-levelap-plications,andaresetonlocationsthatcorrespondtopro-tocolspecicevents.Forexample,whenusingthedis-tributedmessagequeueabove,thereceiversetsatripwireonlazy_write_i,andreceivesanoticationwhen-everanewmessageisplacedinthequeue.Ifthereceiverisblockedwaitingforanewmessage,itwillberesched-uled.Similarlythesendercanblockwaitingforspaceinthequeuebysettingatripwireonlazy_read_i.Thisisaexibleandne-grainedsolutiontothesyn-chronisationproblem.Withtripwires,synchronisationisorthogonaltodatatransfer,anddecoupledfromthetrans-mitter.Thisgreatlysimpliesthehardwareimplementa-tion.Atripwirecanbeassociatedwithcontrolsignalsasabove,oralternativelywithin-banddata. 2.4.EventhandlingTheNICgeneratesavarietyofevents,includingDMAcompletion,out-of-bandmessagearrivalandtripwireevents.Anyofthesecanbedirectedtoanasynchronouseventqueue[26].Thisisasharedmemorydatastructure,whichallowseventstobedequeuedatuser-levelwithverylowoverhead.Onceaneventhasbeenenqueueditisblocked,sothequeueisnotsusceptibletooverow.TheCPUoverheadofeventdeliveryisO(1)withrespecttothenumberofeventsregisteredwiththequeue.2.5.PrototypeimplementationTheprototypeCLANNICsarebasedonoff-the-shelfparts,includinganAltera10k50eFPGAclockedat60MHz,aV3PCIbridge(32bit,33MHz)andHP'sG-Linkopticaltransceiverswith1.5Gbit/slinkspeed.Wehavealsobuiltaveportworm-hole-routedswitch,againusingFPGAs,andanon-blockingcrossbarswitchfabric.AbridgetoGigabitEthernetisatthedebugstage.Thetripwiresynchronisationprimitiveisimplementedbyacontentaddressablememory,supporting4096trip-wiresinthecurrentversion.Tripwiresaremanagedbythedevicedriver,andwhenantripwireresaninterruptisgenerated.Theinterruptserviceroutinedeliversaneventtotheapplication,andwakesanyprocesseswait-ingfortheevent.TheV3PCIbridgechipincludesanintegratedDMAengine,whichcanonlybeprogrammedwithasinglerequestatatime,andgeneratesaninterruptaftereachtransfer.TheinterruptserviceroutinethenstartsthenextDMArequest.ThiscausesalargegapbetweeneachDMArequest,whichseverelylimitsDMAperformanceforsmallandmediumsizedmessages.Theformatofdatapacketsonthewireresemblesthatofwriteburstsonamemorybus.Theheaderidentiesthetargetnodeandaddressoftherstwordofdata.Theamountofdatainthepacketisnotencodedintheheader,butisimplicitinthedatawhichfollows.Thepacketcanthusbesplitatanypoint,andanewheadergeneratedforthetrailingportion.Conversely,consecutivepacketsthatrepresentacontiguoustransfercanbemergedintoasinglepacketbyaswitchorreceivingNIC.Becausethepacketlengthisnotencodedintheheader,theNICsandswitchescanbegintoemitpacketsassoonasdataisavailable,ratherthanwaitingforanentirepacket.ThiscontributestothelowlatencyoftheCLANnetwork.Theswitchexploitstheabilitytosplitpacketstopreventlargepacketsfromhogginganoutputportun-fairly.Nomaximumpacketsizeisenforced,sothenet-workoperatesasefcientlyasthetrafcpatternsallow.Ifcongestionisencountered,smallpacketsarelikelytobemerged,leadingtolargerpacketsandhigherefciency.Flowcontrolisrate-basedonaper-hopbasis,withowcontrolinformationpassedin-bandwiththedata.Thisensuresthatthesourceratecanbeadjustedinatimelyfashiontopreventbufferoverrunsinthereceiver.TheNICsandeachswitchporthavejust512bytesofbufferspace.02004006008001000416642561024409616384Bandwidth (Mbit/s)Message size (bytes)Alpha 21264 using PIOPIII 650MHz using PIOPIII 650MHz using DMAFigure3.Rawbandwidthvs.messagesize.2.6.ScalabilityThedatapathintheCLANNICsandswitchesissim-plecomparedwithothertechnologieswhichrunatthesamelinespeed.Itisimplementedentirelyashard-warecombinatorialsandstatemachines,andrunsatfullspeedonthreeyearoldFPGAtechnology.Wehavere-centlycompletedadesigntorunat3Gbit/s,alsousinganFPGA.Thesefactorsindicatethatthenetworkmodelislikelytoscaletosignicantlyhigherlinespeeds,ifnec-essarywithintegration.Themaximumnumberofendpointsthatcanbesup-portedinauser-levelnetworkisusuallylimitedbyper-endpointresourceintheNIC.InCLANNICs,per-endpointresourcejustconsistsofincomingandoutgo-ingaperturemappingsandtripwires,soalargenumberofendpointscanbesupportedrelativelycheaply.3.BaselinePerformance3.1.TestcongurationThetestsystemconsistedofapairofoff-the-shelfPCsystemsconnectedthroughaCLANswitch.Eachnodewasa650MHzIntelPentiumIIIsystemwith256MBSDRAMand256KBcache,runninganunmod-iedLinux2.4.6kernel.Exceptwhereotherwisestated,allperformanceresultsgiveninthispaperweremeasuredusingthisconguration.Theerrorinthegraphsistoosmalltorepresentwitherrorbars.3.2.LatencyandbandwidthApplicationtoapplicationlatencyforsinglewordprogrammedI/O(PIO)writeswasmeasuredbytimingalargenumberofping-pongs(100000).Themedianround-triptimewas5.6µs,with98.6%below5.7µs.Measurementwithalogicanalysershowedthattheswitchcontributed.8µsineachdirection.Thebandwidthwasmeasuredbystreamingalargeamountofdatathroughthedistributedmessagequeuede-scribedinSection2.1,usinga50KBbuffer.Theresults areshowninFigure3.Alldataistouchedonboththetransmitandreceiveside.ForsmallmessagesDMAperformanceislimitedbytheV3bridge'sDMAengine–whichhashighoverheadandahighturn-aroundtimebetweenrequests.Thekinkbetween64and128bytesisduetoanoptimisationintheDMAdriver,wherePIOisusedforsmallmessages.PIOgivesexcellentperformancewithlowoverheadforsmallmessages,butislimitedbythePCI/Osys-temtolessthan400Mbit/s.UsinganAlpha21264sys-temareweabletosaturatethenetwork,achievingupto960Mbit/swithPIO,andhalfbandwidthisavailablewithmessagesofjust100bytes.AnimprovedDMAen-ginewithauser-levelinterfaceandpre-fetchingoughttoachieveperformancethatismuchclosertothiscurve.4.MPIMPIisthedefacto-standardcommunicationsinterfaceforparallelscienticcomputing,andiswidelyimple-mentedandused.Ithasbeendesignedtobeefcientonavarietyofarchitectures,fromshared-memorymul-tiprocessorstonetworksofworkstations.Theinterfaceisbasedonmessagepassing,andincludesprimitivesforbothpoint-to-pointcommunicationsandavarietyofcol-lectiveoperations,includingmulticast.4.1.ImplementationOurportofMPIisbasedontheLAM[3]implemen-tation,whichrunsoverthestandardBSDsocketinter-face.Allcollectiveoperationsareimplementedintermsofpoint-to-pointconnections.Wehavereplacedthestan-dardsocketcallswithauser-levelsocketlibrarythatpro-videsthesamesemantics.OursocketlibraryisbasedonthedistributedmessagequeuedescribedinSection2.1.CLAN U/LsocketsMPI/LAMOperating systemApplicationD/drivermappingUser-levelCLAN NICFigure4.ThearchitectureofCLANMPI.Theround-triptimeforsmallmessagesusingMPI_Send()andMPI_Recv()is15µs.Thiscom-pareswith19µsforMPI-BIP[25]usingMyrinethard-ware,and33µsforMPIoverFMovertheEmulexcLAN1000[21].Todemonstratearealapplication,wechoseastandardn-bodyproblem.Itisrepresentativeoftheapplicationsthatcanbesolvedwithlooselycouplednetworksofpro-00.511.52416642561024Speed-upNumber of particlesCLAN U/L SocketsCLAN TCP/IPGigabit EthernetFast EthernetFigure5.MPIn­bodycalculationspeed­upusingtwonodes.cessors,yetisnotperfectlyparallelisable(soisagoodindicatorofnetworkperformance).Figure5showsthespeed-upachievedusingtwonodes.WecompareMPIoverFastEthernet,GigabitEth-ernet(3c985),CLANkernel-levelIP(seeSection6.1)andCLANMPI.Thisproblemislatencyconstrainedforsmallnumbersofparticles,soMPIoverCLANatuser-leveldoessubstantiallybetterthanMPIoverTCP/IP.CLANMPIhasverylowoverhead,sowillalsogiveim-provedperformancetoapplicationsthatarenotsensitivetolatency.5.VirtualInterfaceArchitectureTheVirtualInterfaceArchitectureisanindustrystandard[6]foruser-levelnetworking.Itsscopeistode-scribeaninterfacebetweentheNICandsoftwareonthehost,andanapplicationprogramminginterface[2].Theintentionisthatvendorsdevelopandmarketdevicesthatimplementthisspecication,suchasEmulex'scLAN1000[1].Alternatively,VIAcanbeprovidedonexistingnet-worksbyemulatingtheAPIinsoftware.M-VIA[5]con-sistsofauser-levellibraryandloadablekernelmoduleforLinux,andsupportsVIAoverEthernet.AthirdapproachistouseanintelligentNIC.Intel'sproof-of-concept[7]implementationandBerkeleyVIA[10]bothuseMyrinet.5.1.VIAdatatransferInthestandardsend/receivedatatransfermodel,asendingprocessenqueuesdescriptorsforsourcebuffersbycallingVipPostSend(),whichreturnsimmedi-ately.Thesendoperationcompletesasynchronously,andtheapplicationcanpollforcompletionbycallingVipSendDone(),orblockwaitingforcompletionwithVipSendWait().Similarlythereceivingprocesspostsdescrip-torsdescribingbufferstothereceivequeueusingVipPostRecv().Thesedescriptorsarecompletedwhendataisdeliveredintothebuffers,andtheap- plicationsynchronisesusingVipRecvDone()andVipRecvWait().Tosupportapplicationsthatmanagemultipleconnec-tions,noticationsofcompletedrequestsfromanum-berofVIAendpointscanbedirectedtoacomple-tionqueue.Theapplicationcanpollthecompletionqueue(VipCQDone()),orblockwaitingforevents(VipCQWait()).Thereturnedvalueindicateswhichendpointthedescriptorcompletedon,andwhetheritwasasendorreceiveevent.5.2.ImplementationWehaveimplementedtheVIAAPIasauser-spacesoftwarelibraryoverCLAN.ThearchitectureissimilartothatofourMPIimplementation,showninFigure4.Thefunctionalitywehavesofarincludesthesend/receivedatatransfermodel,pollingandblockingmodesofsyn-chronisation,completionqueuesandallthreereliabilitylevels.Thisissufcienttoprovidesource-levelcompata-bilityformanyVIAapplications.5.2.1Datatransfer.BasicdatatransferisillustratedinFigure6.Thereceivingapplicationpostsareceivedescriptortoanendpoint(1)usingVipPostRecv().ThesegmentswithinthedescriptoraremappedtoCLANRDMAcookies,andpassedtotheremoteendpointviaacookiequeue,asdescribedinSection2.2.Controlisreturnedtotheapplicationimmediately.TransferQueueQueueCookie12VIVISendReceiveControlReceivedescriptorsData34Figure6.VIAdatatransfer.Sometimelater(2),thesendingapplicationpostsasenddescriptor.ThecookiequeueisinterrogatedtondtheRDMAcookiesforthereceivebuffers,andoneormoreDMArequestsaremadetotransfertheapplicationdatadirectlyfromthesendbufferstothereceivebuffers(3).Asecondmessagequeue,thetransferqueue,isusedtopassmeta-data(includingthesizeofthemessage)andcontrolfromthesendertothereceiver(4).DatatransferitselfhappensasynchronouslywhentheDMArequestsreachthefrontoftheDMAqueue.Alter-natively,thedatacanbetransferredbyPIO,whichhasloweroverheadandlatencyforsmallmessages.Thisre-quiresamemorymappingontotheremotereceivebuffer,whichisrelativelyexpensivetosetup,andsoacacheofsuchmappingsismaintained.AfurtherbenetofPIOforsmallmessagesisthatdatatransferhappensduringtheapplication'sschedulingtimeslice,ratherthanwhentheNICchoosestosched-ulethetransfer,asforDMA.Applicationsthathavede-laysensitivetrafccanusePIOtoensuretimelydeliv-eryofmessages,evenwhencompetingwithlargetrans-fers.Thustheoperatingsystem'sprocessschedulingpol-icyalsomanagesnetworkaccess.Jitterintroducedbythenetworkisverysmall(atmost20.5µsperhop),sogoodqualityofservicecanbeachievedwithareal-timesched-uler.5.2.2Synchronisation.Sendsynchronisationistriv-ial,andmerelyinvolvesdeterminingwhethertheDMArequestsassociatedwithadescriptorhavecompleted.ThisinformationisprovidedbytheCLANDMAinter-face.Thecompletionofanincomingmessageisindicatedbythearrivalofamessageinthetransferqueue.Forthenon-blockingVipRecvDone()methodthiscanbede-tectedbyinspectingthetransferqueue.Tosupportblock-ingreceives(VipRecvWait())atripwireonthetrans-ferqueueisusedasdescribedinSection2.3.TheVIAcompletionqueueisimplementedusingtheCLANasynchronouseventqueue.Whenanendpoint'sreceivequeueisassociatedwithacompletionqueue,atripwireisattachedtothetransferqueue,andconguredtodelivereventstotheeventqueue.Toassociateanend-point'ssendqueuewithacompletionqueue,DMAcom-pletioneventsaredirectedtotheeventqueue.5.2.3Flowcontrol.Ifthecookiequeueisfoundtobeemptywhenasenddescriptorisposted,thenthereceivebuffershavebeenoverrun,andVIAspeciesthatthedatashouldbedropped.Thisconditionisdetectedwithoutanydatabeingtransmittedacrossthenetwork,sothenet-workisnotloadedwithdatathatcannotbedelivered.Toavoidpacketloss,applicationshavetobuildowcontrolontopofVIA.Togetgoodperformance,owcontrolinformationmustbetimely,andthiscannotbeachievedifitisbeingmultiplexedoverthesamechan-nelasbulkdata.Toaddressthis,EmulexVIAprovidesnon-standardinterfacesforcommunicatingout-of-bandinformationwithlowlatency.Inourimplementation,anapplicationmaycongureanendpointtoqueue-upsenddescriptorsuntilcorre-spondingreceivebuffersareposted.Thisispossiblebe-causethesendingapplicationreceivesanoticationwhenamessageisplacedinthecookiequeue.Thisextensiontothestandardimprovesperformance,simpliesappli-cationcodeconsiderably,andrequiresnoadditionalnon-standardprimitives.5.2.4Protection.DuetolackofspaceontheFPGA,ourprototypeNIChardwarecurrentlylacksfullprotec-tiononthereceivepath.Havinggivenaremoteprocessaccesstoabufferitisnotpossibletorevokeaccess.ThismeansthatafaultyormaliciousnodethatgoesinbelowthelevelofVIAcanoverwritedatainanapplication'sre-ceivebuffersafterthereceivedescriptorhascompleted,whichcouldcausetheapplicationtomisbehave.Proper protectionwillbeavailableinafuturerevisionoftheNIC.5.3.PerformanceInthissectionwecomparetheperformanceofourim-plementationofVIAwiththatofanexistingcommercialimplementation:theEmulexcLAN1000.TheEmulexNICisa64bit,33MHzPCIcard,withasinglechipim-plementationand1.25Gbit/slinkspeed.WedidnothaveaccesstoanEmulexswitchforthesetests,sotheEmulexNICswereconnectedback-to-back.Thesystemsetupandbenchmarkprogramsforthetwosystemswereiden-tical.Sincemanydistributedapplicationsrequirereliablecommunications,thereliabilitylevelusedinthesetestswasreliabledelivery.Topreventreceivebufferoverrun,thetestapplicationsusedcredit-basedowcontrol.ForCLANVIAwepresentseparateresultsforPIOandDMAdatatransferforclarity(althoughbydefaultweswitchbetweenthetwodynamically).Thelatencyforsmallmessageswasmeasuredbytim-ingalargenumberofround-trips.Thisvalueincludesthetimetakentopostasenddescriptor,processthatde-scriptor,transferthedata,synchronisewithcompletiononthereceivesideandmakethereturntrip.TheresultsaregiveninTable1.Table1.Round­triptimeforVIA(µs).BytesCLANCLANEmulextransferred(DMA)(PIO)cLAN1000010.78.512.6414.511.614.54015.512.118.3ThesmallmessagelatencyforCLANVIA(PIO)isthelowestbysomemargin,despitethefactthattheCLANNICsareconnectedbyaswitch,whereastheEmulexNICsareconnectedback-to-back.Withoutaswitch,theCLANVIA(PIO)round-triptimeisjust6.7µs.Forcomparison,M-VIAreportlatencyoverGigabitEther-netof38µs[4],andBerkeleyVIAoverMyrinetreport46µs[10].ThebandwidthachievedforvariousmessagesizesisgiveninFigure7.Datais`touched'onboththesendandreceiveside.Formessagesupto128bytes,CLANVIA(PIO)hasthehighestthroughput.CLANVIA(DMA)performspoorlyforsmallmessagesduetothehighover-headoftheV3'sDMAengine.Thekinkbetween64and128bytesisduetheDMAoptimisationdescribedinSec-tion3.2.Wehavealsomeasuredmaximumtransactionrateswithaserverapplicationthatsimplyacknowledgeseachmessageitreceives.Withabout15clientsEmulexVIAsaturatesat150,000requestspersecond.Thesameappli-cationimplementedovertherawCLANnetworkisabletoprocess1,030,000requestspersecond–aseven-foldimprovement.0200400600800416642561024409616384Bandwidth (Mbit/s)Message size (bytes)CLAN VIA (PIO)CLAN VIA (DMA)Emulex VIAFigure7.VIAbandwidthvs.messagesize.5.4.AnalysisThatCLANVIAhaslowerlatency(andcompara-blebandwidth)thananASICimplementationdesignedspecicallyforVIAissuprising.ProlingshowsthatpostingsendandreceivebuffershasverylowoverheadfortheEmulexcLAN1000:about.6µs.ThissuggeststhatperformanceislimitedbyhighoverheadintheNIC.WesuspectthattheNIChasanembeddedprocessor,whichmaybelimitingperformanceforsmallmessages.6.TCP/IPAlthoughanincreasingnumberofapplicationsaremakinguseofhighperformanceinterfacessuchasMPIandVIA,thevastmajorityofdistributedapplicationscontinuetouseTCPsockets.6.1.KernellevelIPThesimplestwaytosupportIPnetworkingistouseexistingsupportintheoperatingsystem.Wehavewrit-tenalow-levelnetworkdevicedriverfortheLinuxker-nelthatworksinasimilarmannertoclassicalIPoverATM[17].Inourcase,IPpacketsaretunneledoveraCLANconnection.WhenanIPpacketisrstsenttoaparticularhost,aCLANconnectionisestablishedandusedtotransmitsubsequentpackets.Ourinitialimplementationusedadistributedmessagequeue(Section2.1)totransferthedata.Whendataarrivesinthereceivinghost,atripwiregeneratesaninterrupt,andtheinterruptserviceroutineschedulesa`bottomhalf'whichpassesthedatadownintothestandardnetworkingsubsystem.Theuseofthedistributedmessagequeuehastwodis-advantages:(1)thereceivebufferhasaxedsizeand(2)datahastobecopiedfromthereceivebufferintotheker-nel'ssocketbuffers.ThiswasimproveduponbyusingRDMAcookie-baseddatatransfer,asdescribedinSec-tion2.2.Eachhostallocatesapoolofsocketbuffers,andsendsRDMAcookiesforthesebuffersthroughacookiequeuetotheotherhost.DataistransferredbyDMAdi-rectlyfromsocketbuffersinthesendertosocketbuffers 0200400600800166425610244096Bandwidth (Mbit/s)Message size (bytes)CLAN IPGigabit Ethernet (3c985)Figure8.TCPbandwidthforCLANIPandGigabitEthernet.inthereceiverinasimilarmannertoourVIAimplemen-tation.InFigure6thesendandreceivebuffersarenowkernelsocketbuffers.6.1.1Performance.Wehavemeasuredtheperfor-manceofthisimplementationusingthestandardTTCPbenchmark.Figure8showstheresultsforCLANIPandaGigabitEthernetnetworkusingthe3Com3c985adapter.The3c985isaprogrammableNICwithtwoon-boardprocessors.Interruptcoalescingandcheck-sumofoadareusedtoreduceoverheadonthehostproces-sor.Forbothnetworks,theLinuxkernelwasconguredtoallowlargereceivewindows,and256KBofsocketbufferswereused.The3c985resultswereobtainedwith9KB(jumbo)frames,andtheCLANresultswithan8KBMTU.Thiscongurationexposestheweaknessesofourpro-totypeNIC.PerformanceislimitedbythehighoverheadofDMAtransfers,andwetakemanymoreinterruptsonthereceivesidethanthe3c985.The3c985alsobene-tsfromcheck-sumofoad.Despitethis,performanceforthetwonetworksisverysimilaruptoabout512bytemessages,abovewhichbothsaturatewithCLANIPslightlyfaster.Table2.Round­triptimeforpingoverCLANIPandGigabitEthernet.NetworkTestPingRTT(µs)Error(µs)CLANIPnormal7810ood54103c985normal19637ood21619Wemeasuredtheround-triptimeusingthestandard`ping'command.TheresultsgiveninTable2areaver-agedover100pingsfor`normal'pings(withonesecondgaps),andovermanythousandsofpingsfortheoodping.6.2.AcceleratingTCP/IPTheperformanceofthein-kernelTCP/IPsupportde-scribedaboveislimitedbythehighoverheadoftheTCPstack.OnesolutionistoofoadsomeoftheprotocolontotheNIC,asisdonebythe3c985.IntheArsenic[24]project,theNICdemultiplexesincomingdatadirectlyintoapplication-levelbuffers,andtheTCPstackisex-ecutedatuser-level.Overheadissubstantiallyreduced,andafurtherimprovementisgainedbyusingazero-copyinterface.Trapeze/IP[15]alsoofoadsthecheck-sumcalculationandprovidesazero-copysocketinter-face.Withinalocalareanetwork,analternativeistopro-videafastpathforTCP/IPtrafcwithasimpliedstackthatdoesnotduplicatefunctionalityinthenetwork.Forexample,theCLANnetworkisreliableandguaran-teesin-orderdelivery,socheck-sums,sequencenumbers,timersandre-transmissionarenotneeded.TheuseofRDMAcookiesfordatatransferprovidesimplicitowcontrol,somanagementoftheTCPreceivewindowcouldalsoberemoved.However,thisapproachisnotanoptionwhereappli-cationsrequireTCPortheotherendoftheconnectionisnotinthelocalnetwork.6.3.GigabitEthernet/CLANbridgeSwitchBridgeCLANGigabitethernetFigure9.CLANserverroomarchitecture.AlthoughwecanalreadybridgeIPtrafcbetweenEthernetandtheCLANnetworkbyconguringaPCap-propriately,thissolutiondoesnotscalewelltothehighlineratesexperiencedbylargeserverclusters.TheGi-gabitEthernet/CLANbridgeconnectsaCLANnetworktotheoutsideworld,asshowninFigure9.Prototypehardwareforthebridgehasrecentlybeenassembled,andiscurrentlyinthedebugstage.Webrieydescribethearchitecturehere.Themainfunctionofthebridgeistodemultiplexin-comingTCPstreamsontoCLANstreamswhichtermi-nateinuser-levelapplications.TheIPandTCP/UDPheadersofincomingEthernetframeswillbelookedupinaCAMtoidentifytheassociatedCLANstream.ForTCPstreams,thesequencenumberisinspectedtodeterminewherethepacketdatashouldbedeliveredinthereceivebuffer.IPpacketsthatarenotassociatedwithapartic-ularCLANstreamwillbedeliveredviaadistinguished Figure10.TheprototypeGigabitEthernetbridge.streamtotheoperatingsystem.TheTCPprotocolstackisexecutedatuser-level.WehaveselectedthelwIP[14]stackasthestartingpointforourimplementation,whichiscurrentlyabletoexchangepacketsbetweenCLANhosts.Onthetransmitside,com-pleteIPpacketsareassembledintheapplicationandde-liveredviaaCLANstreamtoastagingbufferinthebridge.ThebridgewillverifykeyeldsintheIPandTCPheader,andthenemittheEthernetframes.7.CORBATheCommonObjectRequestBrokerArchitecture[20]isastandardforobject-orientedremoteprocedurecall.Itsimpliesdistributedcomputingbypresentingapplica-tionswithaveryhighlevelabstractionofthenetwork.CORBAspeciesalanguageindependentobjectmodel,anetworkprotocolforinvokingrequestsonobjects,andbindingstoavarietyofprogramminglanguages.TheORBisresponsibleforprovidingreliablecom-munication,managingresourcessuchasconnectionsandthreads,andprovidinganumberofservices.Thesein-cludemanagementofobjects'lifecycle,naming,loca-tion(includingtransparentforwardingofrequests)andowcontrol.7.1.omniORBomniORBisaCORBAimplementationwithbindingsfortheC++andPythonprogramminglanguages,devel-opedatAT&TLaboratories-Cambridge.Ithasbeencer-tiedcompliantwithversion2.1ofthespecication.TheORBusesathread-per-connectionmodelontheserverside,whichavoidscontextswitchesonthecallpath.Thetransportinterface[18]isexibleandef-cient,andtransportsoverTCP/IP,ATM[23],SCI[22]andHTTP(fortunnelingthroughrewalls)havebeenimple-mented.ThearchitectureofanomniORBserverwiththeCLANtransportisshowninFigure11.GIOP (un)marshallingObject implInterface skelD/driverD/driverOperating systemmappingUser-leveltransportTCP/IPtransportCLANCLAN NICEthernetruntimeORBFigure11.ThearchitectureofomniORB.7.2.CLANtransportDatatransferisbasedonthedistributedmessagequeuedescribedinSection2.1,withtripwiresforsyn-chronisation.Onthetransmitside,theCLANtransportprovidesabufferwhichthemarshallinglayermarshalsamessageinto.Whenthatbufferllsortherequestiscom-pleted,itispassedbacktothetransportlayer,whereitistransferredintotheremotecircularbuffereitherbyPIOorDMA.LargechunksofdataarepasseddirectlytotheCLANtransport,andcanbetransferreddirectlytothereceiverwithoutrstbeingcopiedintothemarshallingbuffer.Onthereceivesidetheunmarshallinglayermakesre-queststothetransportlayerforbufferscontainingre-ceiveddata,specifyingaminimumsize.TheCLANtransportprovidesdirectaccesstothereceivebuffer,henceeliminatingunnecessarycopies.Itispossiblethatthedatarequestedisnon-contiguousinthecircularre-ceivebuffer,soasmallamountofspaceisreservedim-mediatelybeforeandafterthebuffer,anddatacopiedthereasnecessarytoprovideacontiguouschunktotheunmarshallinglayer.7.3.ThreadsanddemultiplexingomniORB'sthread-per-connectionmodelhasanum-berofdrawbacks.Theprincipleproblemisthatasingleconnectionmaybeservicedrepeatedlyattheexpenseofothers,untilitsthread'stimesliceisexhausted.Inad-dition,alargenumberofthreadsareneedediftherearemanyconnections,andathreadswitchisalwaysneededbetweenrequestsondifferentconnections.Asthecostofthenetworktransportdecreases,theimpactofthesedefectsbecomesmoreapparent.Oursolutionistoadoptahybridthread-poolmodel.Asingleasynchronouseventqueuegatherstripwireeventsfrommultipleconnections,anddemultiplexesactivecon-nectionsontoavailablethreads.Whenmorethanoneconnectionisactive,itisnecessarytoensurethatsuf-cientthreadsarerunnabletoensureconcurrencyisnotlimitedifathreadblocksintheup-calltotheobjectim-plementation.However,ifathreaddoesnotblock,itisabletoservemanyrequestsbeforeathreadswitchoccurs. 7.4.PerformanceInthissectionwepresentsomeearlyperformancere-sults.Theround-triptimeforsmallrequestsisgiveninTable3.ThelowestlatencyreportedtodateisforPadico[12]onMyrinet-2000(a2Gbit/stechnology),whichalsousesomniORB.WealsogiveresultsforDCOM(object-orientedRPCforMicrosoftplatforms)overVIA,takenfrom[19].Table3.Round­triptimeforsmallmessagesusingCORBAandDCOM.HardwareandinterfaceRTT(µs)CORBAFastEthernet128GigabitEthernet(3c985)180CLAN20Padico(Myrinet-2000)20DCOMEmulexcLAN1000VIA70EmulexVIA(polling)40ForomniORBwehavealsomeasuredthemaximumrequestratewhenservingmultipleclients.TheresultsaregiveninTable4.ForFastandGigabitEthernet,theORBwassaturatedwithsixclients.TherequestrateforomniORBoverCLANwasgreaterthan105,000requestspersecondwithjustthreeclients.Sinceeachrequestdoesnousefulwork,thisprovidesameasureofthetotalover-headoftheORBandnetworktransportontheserverside,whichisjust9.5µsperrequest.Table4.Maximumrequestrateforom­niORB.TransportRequestspersecondFastEthernet19100GigabitEthernet(3c985)21600CLAN105900PreviousstudieshavefoundthatORBoverheadisveryhigh[16].However,theresultspresentedherein-dicatethatORBoverheadcanbeverylow,andconsider-ableimprovementsareachievedwithhighperformancetransports.8.FutureworkDuetotherecentclosureofAT&TLaboratories-Cambridge,theNICandGigabitEthernet/CLANbridgearenotbeingdevelopedfurther.Inlightofthis,thefunc-tionofthebridgewillbeemulatedonaPC,sothatdevel-opmentofCLANuser-levelTCPcancontinue.MPIperformancecouldbesubstantiallyimprovedbyimplementingitdirectlyovertherawnetwork,ratherthantheuser-levelsocketinterface.Thiswouldallowzero-copyoptimisationsandreduceoverhead.AnumberofimprovementsarebeingconsideredforourCORBAimplementation,includingmarshallingmes-sagesdirectlyintothereceivebufferintheremoteappli-cation,andusingDMAforlargemessages.Theuseofantailoredmarshallingprotocolmightalsoprovideanim-provement.9.ConclusionsInthispaperwehavedescribedtheCLANnetwork,andshownthatit'ssimple,low-levelinterfacesupportsawiderangeofcommunicationsparadigms.Eachhigher-levelabstractionisbuiltasalayerofsoftware,withoutadditionalsupportinthenetwork,andwithoutsacricingperformance.WehavefoundthattechnologieswithmorecomplexnetworkinterfacesandprotocolsarelimitedtorelativelylowmessageratesbytheprocessingrequirementsontheNIC.WeexpectthesimplehardwaremodelofCLANtoscalemoreeasilytohighlinerates.OurMPIimplementationhaslowerlatencythancom-parableinterconnects,despiteitssimplicity.Furtherim-provementscanbeexpectedifMPIisimplementeddi-rectlyovertherawnetworkinterfaceratherthanuser-levelsockets.ForbothCLANVIAandin-kernelTCPourimple-mentationsgivecomparableperformancetoASICsolu-tionsthathavebeendesignedspecicallyforthesepro-tocols.IneachcasethelatencyoftheCLANimplemen-tationissignicantlylower.Further,theCLANperfor-mancewouldbeexpectedtoimprovesignicantlywithaproperimplementationoftheDMAengine.WehavealsoshownthatafullyfeaturedCORBAORBneednotincurthehighoverheadthathasoftenbeenassociatedwithit.omniORBoverCLANachievestrans-actionratesof105,000requestspersecondonourtestsystem,withoutanymodicationstoapplications.TherehasbeenatrendawayfromPIOinuser-levelnetworks(forexampleSHRIMPmovedtoaDMAonlymodelonMyrinet).Thisislikelytobebecauseitisdif-culttomanageintheNIC,duetodatabeingpushedratherthanpulled.However,wehavefoundthatthelowlatencyandoverheadofPIOforsmallmessageshasbeeninvalu-ableintheimplementationofapplicationlevelprotocols.ThecombinationofPIOforsmallmessages,DMAtoof-oadbulkdatatransferandtripwiresforsynchronisationisaveryexibleandefcientmodel,withasimplescal-ablehardwareimplementation.AcknowledgmentsTheauthorswouldliketothankmembersoftheLab-oratoryforCommunicationsEngineering,formermem-bersofAT&TLaboratories-CambridgeandparticularlyJonCrowcroftforhisinvaluableadvice.DavidRiddochandKieranMansleyarebothjointlyfundedbyaRoyalCommissionfortheExhibitionof1851IndustrialFellowship,andAT&TLabs-Research. References[1]EmulexcLAN1000.http://wwwip.emulex.com/ip/products/clan1000.html.[2]IntelVirtualInterface(VI)ArchitectureDeveloper'sGuide.http://developer.intel.com/design/servers/vi/.[3]LAM/MPIParallelComputing.http://www.mpi.nd.edu/lam/.[4]M-VIAPerformance.http://www.nersc.gov/research/FTG/via/faq.html#q14.[5]M-VIAProject.http://www.nersc.gov/research/FTG/via/.[6]TheVirtualInterfaceArchitecture.http://www.viarch.org/.[7]F.Berry,E.Deleganes,andA.M.Merritt.TheVirtualIn-terfaceArchitectureProof-of-ConceptPerformanceRe-sults.Technicalreport,IntelCorporation.[8]M.Blumrich,K.Li,R.Alpert,C.Dubnicki,E.Felten,andJ.Sandberg.VirtualMemoryMappedNetworkIn-terfacefortheSHRIMPMulticomputer.In21stAnnualSymposiumonComputerArchitecture,pages142–153,April1994.[9]N.Boden,D.Cohen,R.Felderman,A.Kulawik,C.Seitz,J.Seizovic,andW.-K.Su.Myrinet—AGigabit-per-SecondLocal-AreaNetwork.IEEEMicro,15(1),1995.[10]P.Buonadonna.AnImplementationandAnalysisoftheVirtualInterfaceArchitecture.Master'sthesis,UniversityofCalifornia,Berkeley,May1999.[11]D.Clarke,V.Jacobson,J.Romkey,andH.Salwen.AnAnalysisofTCPProcessingOverhead.IEEECommuni-cationsMagazine,27(6):23–29,June1989.[12]A.Denis,C.Perez,andT.Priol.TowardsHighPerfor-manceCORBAandMPIMiddlewareforGridComput-ing.InGridComputing,2001.[13]C.Dubnicki,A.Bilas,Y.Chen,S.Damianakis,andK.Li.VMMC-2:EfcientSupportforReliable,Connection-OrientedCommunication.InHotInterconnectsV,August1997.[14]A.Dunkels.MinimalTCP/IPimplementationwithproxysupport.Master'sthesis,SwedishInstituteofComputerScience,February2001.[15]A.Gallatin,J.Chase,andK.Yocum.Trapeze/IP:TCP/IPatNear-GigabitSpeeds.InUSENIXTechnicalConfer-ence,1999.[16]A.GokhaleandD.Schmidt.MeasuringthePerfor-manceofCommunicationsMiddlewareonHigh-SpeedNetworks.InACMSIGCOMM,1996.[17]M.Laubach.ClassicalIPandARPoverATM.RFC1577,January1994.[18]S.-L.LoandS.Pope.TheImplementationofaHighPer-formanceORBoverMultipleNetworkTransports.Tech-nicalReport98.4,AT&TLaboratories-Cambridge,1998.[19]R.MadukkarumukumanaandH.Shah.HarnessingUser-LevelNetworkingArchitecturesforDistributedObjectComputingoverHigh-SpeedNetworks.In2ndUSENIXWindowsNTSymposium,1998.[20]ObjectManagementGroup,http://www.omg.org/.TheCommonObjectRequestBroker:Architec-tureandSpecication.[21]A.Pant.AHighPerformanceMPIImplementationontheNTSCVIACluster.Technicalreport,NCSA,UniversityofIllinoisatUrbana-Champaign,1999.[22]S.Pope,S.Hodges,G.Mapp,D.Roberts,andA.Hop-per.EnhancingDistributedSystemswithLow-LatencyNetworking.InParallelandDistributedComputingandNetworks,December1998.[23]S.PopeandS.-L.Lo.TheImplementationofaNativeATMTransportforaHighPerformanceORB.TechnicalReport98.5,AT&TLaboratoriesCambridge,1998.[24]I.PrattandK.Fraser.Arsenic:AUser-AccessibleGigabitEthernetInterface.InIEEEINFOCOM,April2001.[25]L.Prylli,B.Tourancheau,andR.Westrelin.Thede-signforahighperformanceMPIimplementationontheMyrinetnetwork.InEuroPVM/MPI,pages223–230,1999.[26]D.RiddochandS.Pope.ALowOverheadApplication/Device-driverInterfaceforUser-levelNet-working.InInternationalConferenceonParallelandDistributedProcessingTechniquesandApplications,June2001.[27]D.Riddoch,S.Pope,D.Roberts,G.Mapp,D.Clarke,D.Ingram,K.Mansley,andA.Hopper.Tripwire:ASyn-chronisationPrimitiveforVirtualMemoryMappedCom-munication.JournalofInterconnectionNetworks,JOIN,2(3):345–364,September2001.[28]T.vonEicken,A.Basu,V.Buch,andW.Vogels.U-Net:AUser-LevelNetworkInterfaceforParallelandDistributedComputing.In15thACMSymposiumonOp-eratingSystemsPrinciples,December1995.