/
Flattened Butterfly topology for on chip networks Flattened Butterfly topology for on chip networks

Flattened Butterfly topology for on chip networks - PDF document

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
408 views
Uploaded On 2017-08-25

Flattened Butterfly topology for on chip networks - PPT Presentation

R1 ab ConcConcConcConc c Figure1Theuseofconcentrationininterconnectionnetworksa8nodeN0N7ringwith8routersR0R7withoutconcentrationb4noderingwith2wayconcentratorandcthesametopology ID: 89432

R1 (a)(b) ConcConcConcConc (c) Figure1.Theuseofconcentrationininterconnectionnetworks--(a)8node(N0-N7)ringwith8routers(R0-R7)withoutconcentration (b)4noderingwith2-wayconcentratorand(c)thesametopology

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Flattened Butterfly topology for on chip..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

FlattenedButterßyTopologyforOn-ChipNetworksJohnKim,JamesBalfour,andWilliamJ.DallyComputerSystemsLaboratoryStanfordUniversity,Stanford,CA94305jjk12,jbalfour,dally@cva.stanford.eduWiththetrendtowardsincreasingnumberofcoresinchipmultiprocessors,theon-chipinterconnectthatconnectsthecoresneedstoscaleefÞciently.Inthiswork,weproposetheuseofhigh-radixnetworksinon-chipinterconnectionnet- R1 (a)(b) ConcConcConcConc (c) Figure1.Theuseofconcentrationininterconnectionnetworks--(a)8node(N0-N7)ringwith8routers(R0-R7)withoutconcentration,(b)4noderingwith2-wayconcentratorand(c)thesametopologyas(b)withthe2-wayconcentratorintegratedintotherouter.2BackgroundThissectiondescribeshowthecoststructuresofon-chipnetworksdifferfromsysteminterconnectionnetworksandpresentsanargumentforhigh-radixnetworksonchip.2.1CostofOn-ChipNetworksThecostofanoff-chipinterconnectionnetworktypicallyincreaseswiththechannelcount.Asthechannelcountin-creaseswiththehopcount,reducingthediameterofthenet-workreducesthecostofthenetwork[16].On-chipnetworksdifferbecausebandwidthisplentifulbecauseofinexpensivewires,whilebuffersarecomparativelyexpensive.However,reducingthediameterofon-chipnetworksremainsbeneÞ-cialforseveralreasons.Thepowerconsumedbythechan-nelsisasigniÞcantpartofaggregateon-chipnetworkpowerconsumption[3];thus,reducingthenumberofchannelscanreducetheoverallpowerconsumption.Furthermore,sincebufferedßowcontrolisoftenusedinon-chipnetworks,theaggregateareaallocatedtotheinputbuffersintheroutersde-creasesasthenumberofchannelsisreduced.Inthiswork,weshowhowtheuseofhigh-radixroutersandtheßattenedbutterßytopologyleadstolowerdiameterandasaresult,leadstolowercostbyreducingthenumberofchannelsandtheamountofbuffersrequiredinthenetwork.Inadditiontoreducingthediameter,theuseoftration,wherenetworkresourcesaresharedamongdifferentprocessornodes,canimprovetheefÞciencyofthenetwork.AnexampleofconcentrationisshowninFigure1.Usingaringtopology,8nodescanbeconnectedwith8routersasshowninFigure1(a).Byusingaconcentrationfactoroftwo,the8nodescanbeconnectedina4ringwhereeachconsistsoftwoterminalnodesandarouter,asshowninFigure1(b).TheuseofconcentrationaggregatestrafÞcfromdifferentnodesintoasinglenetworkinterface.Thisre- meshflattenedbutterflymeshflattened averageworst-case Latency (nsec) Ts Th Figure2.Latencyofapacketinon-chipnetworks.ducesboththenumberofresourcesallocatedtothenetworkroutersandtheaveragehopcount,whichcanimprovela-tency.Thus,whileprovidingthesamebisectionbandwidth,concentrationcanreducethecostofthenetworkbyreduc-ingthenetworksize.Theconcentratorcanbeintegratedintotherouterbyincreasingtheradixoftherouter,asshowninFigure1(c),toallowalloftheterminalnodestoaccessthenetworkconcurrently,insteadofallowingonlyoneterminalnodeassociatedwitharoutertoaccessthenetworkduringanyonecycle.Theuseofconcentrationinon-chipnetworksalsoreducesthewiringcomplexityasshowninSection6.1.Concentrationispracticalforon-chipnetworksbecausetheprobabilitythatmorethanoneoftheprocessorsattachedtoasinglerouterwillattempttoaccessthenetworkonagivencycleisrelativelylow.Forexample,inachipmulti-processor(CMP)architecturewithasharedL2cachedistrib-utedacrossthechip,theL1missrateisoftenunderunderwhichresultsinrelativelylowtrafÞcinjectionratesattheprocessors.Consequently,usingconcentrationtosharethenetworkresourcesisaneffectivetechniqueforCMPtraf-Þc.Theadvantagesofusingconcentrationinon-chipnet-workwerepreviouslydescribedfor2-Dmeshnetworks[3].Inthispaper,wedescribehowtheßattenedbutterßytopol-ogy[15]foron-chipnetworksuseconcentrationaswellashigh-radixrouterstoreducethediameterofthenetworktoimprovecost-efÞciency.2.2LatencyinOn-ChipNetworksLatencyisacriticalperformancemetricforon-chipnet-works.Thelatencyofapacketthroughaninterconnectionnetworkcanbeexpressedasthesumoftheheaderlatency),theserializationlatency(),andthetimeofßightonthewires(whereistherouterdelay,isthehopcount,isthepacketsize,andisthechannelbandwidth.Minimizinglatencyrequiresestablishingacarefulbal-ancebetween.Foron-chipnetworks,wiresare abundant,on-chipbandwidthplentiful,andconsequentlycanbereducedsigniÞcantlybyprovidingverywidechan-nels.However,traditional2-Dmeshnetworkstendtoestab-valuesthatareunbalanced,withwidechan-nelsprovidingalowwhileremainshighduetothehighhop-count.Consequently,thesenetworksfailtomini-mizelatency.Forexample,inthe2-DmeshnetworkusedintheIntelTeraFlop[25],withuniformrandomtrafÞc,approximately3timesandforworstcasetrafÞc,thereisapproximatelydifference.However,byusinghigh-radixroutersandtheßattenedbutterßytopology,thehopcountcanbereducedattheexpenseofincreasingtheseri-alizationlatency,assumingthebisectionbandwidthisheldconstant,asisshowninFigure2.Asaresult,theßattenedbutterßyachievesaloweroveralllatency.Notethatthewiredelay()associatedwiththeManhattandistancebetweenthesourceandthedestinationnodesgenerallycorrespondstotheminimumpacketlatencyinanon-chipnetwork[18].Thispaperdescribeshowhigh-radixroutersandtheßattenedbutterßytopologycanbeusedthebuildon-chipnetworksthattrytoapproachthisideallatencybysigniÞcantlyreduc-ingthenumberofintermediaterouters.3On-ChipFlattenedButterßy3.1TopologyDescriptionTheßattenedbutterßytopology[15]isacost-efÞcienttopologyforusewithhigh-radixrouters.Theßattenedbut-terßyisderivedbycombining(orßattening)theroutersineachrowofaconventionalbutterßytopologywhilepreserv-ingtheinter-routerconnections.Theßattenedbutterßyissimilartoageneralizedhypercube[4];however,byprovid-ingconcentrationintherouters,theßattenedbutterßysig-niÞcantlyreducesthewiringcomplexityofthetopology,al-lowingittoscalemoreefÞciently.Tomapa64-nodeon-chipnetworkontotheßattenedbut-terßytopology,wecollapsea3-stageradix-4butterßynet-work(4-ary3-ßy)toproducetheßattenedbutterßyshowninFigure3(a).Theresultingßattenedbutterßyhas2di-mensionsandusesradix-10routers.Withfourprocessornodesattachedtoeachrouter,theroutershaveaconcen-trationfactorof4.Theremaining6routerportsareusedforinter-routerconnections:3portsareusedforthedimen-sion1connections,and3portsareusedforthedimension2connections.RoutersareplacedasshowninFigure3(b)toembedthetopologyinaplanarVLSIlayoutwitheachrouterplacedinthemiddleofthe4processingnodes.Routerscon-nectedindimension1arealignedhorizontally,whileroutersconnectedindimension2arealignedvertically;thus,the ThelatencycalculationswerebasedonIntelTeraFlop[25]parameters=1.25ns,=16GB/s,=320bits)andestimatedvalueofwiredelayfor65nm(=250pspermm). R10 R11 R12 R14 R15 R13 dimension 1 dimension 2 R7 R6 R5 R8 R11 R10 R9 R12 R15 R14 R13 R0 R3 R2 R1 (b)Figure3.(a)Blockdiagramofa2-dimensionflattenedbutterflyconsistingof64nodesand(b)thecorrespondinglayoutoftheflattenedbutterflywheredimension1routersarehorizontallyplacedanddimension2routersareverticallyplaced.routerswithinarowarefullyconnected,asaretherouterswithinacolumn.ThewiredelayassociatedwiththeManhattandistancebetweenapacketÕssourceanditsdestinationprovidesalowerboundonlatencyrequiredtotraverseanon-chipnet-work.Whenminimalroutingisused,processorsinthisßat-tenedbutterßynetworkareseparatedbyonly2hops,whichisasigniÞcantimprovementoverthehopcountofa2-Dmesh.TheßattenedbutterßyattemptstoapproachthewiredelayboundbyreducingthenumberofintermediateroutersÐresultinginnotonlylowerlatencybutalsolowerenergyconsumption.However,thewiresconnectingdistantroutersintheßattenedbutterßynetworkarenecessarilylongerthanthosefoundinthemesh.Theadverseimpactoflongwiresonperformanceisreadilyreducedbyoptimallyinsertingre-peatersandpipelineregistertopreservethechannelband-widthwhiletoleratingchanneltraversaltimesthatmaybeseveralcycles.Thelongerchannelsalsorequiredeeperbuffersizestocoverthecreditroundtriplatencyinordertomaintainfullthroughput.3.2RoutingandDeadlockBothminimalandnon-minimalroutingalgorithmscanbeimplementedontheßattenedbutterßytopology.Alim-itednumberofvirtualchannels(VCs)[8]maybeneededtopreventdeadlockwithinthenetworkwhencertainrout-ingalgorithmsareused.AdditionalVCsmayberequiredforpurposessuchasseparatingtrafÞcintodifferentclassesoravoidingdeadlockintheclientprotocols.Dimension- R1 (a)(b)(d) Figure4.Routingpathsina2-Don-chipflattenedbutterfly.(a)AllofthetrafficfromnodesattachedtoR1issenttonodesattachedtoR2.Theminimalpathrouteisshownin(b)andthetwonon-minimalpathsareshownin(c)and(d).Forsimplicity,theprocessingnodesattachedtotheseroutersarenotshownandonlythefirstrowofroutersisshown.orderedrouting(DOR)canbeusedasaminimalroutingal-gorithmfortheßattenedbutterßy(e.g.routeindimension1,thenrouteindimension2);inthiscase,theroutingalgorithmitselfisrestrictiveenoughtopreventdeadlock.Non-minimalroutingallowsthepathdiversityavailableintheßattenedbutterßynetworktobeusedtoimproveloadbalanceandperformance.Forthesereasons,weusethenon-minimalglobaladaptiverouting(UGAL)[23]algorithmwheneval-uatingtheßattenedbutterßytopologyinthiswork.UGALloadbalancesbydeterminingwhetheritisbeneÞcialtorouteminimallyornonminimally.Ifitselectsnonminimalrouting,UGALroutesminimallytoanintermediatenodeintheÞrstphase,andthenroutingminimallytothedestinationinthesecondphase.ToreducethenumberofVCsthatareneeded,weuseDORwithineachphaseofUGALroutingÐthus,only2VCsareneeded.4BypassChannelsandMicroarchitectureAsshowninFigure3,theroutersarefullyconnectedineachdimension.Asaresult,thechannelsthatpassoverotherroutersinthesameroworcolumncanbereferredtoaspasschannelsÐi.e.channelsthatbypasslocalrouterstoconnectdirectlytoitsdestinationrouter.Theuseofthesebypasschannelswithnon-minimalroutinginon-chipßat-tenedbutterßytopologycanresultinnon-minimalphysicaldistancebeingtraversed.AnexampleisshowninFigure4withatrafÞcpatternshowninFigure4(a).TheminimalpathforthistrafÞcpatternisshowninFigure4(b)andthenon-minimalpathsareshowninFigure4(c,d).Thelayoutofanon-chipßattenedbutterßycanresultinthenon-minimalroutesovershootingthedestinationonthewaytotheinter-mediatenodeselectedforload-balancing(Figure4(c)).Anon-minimalroutemayalsorouteapacketawayfromitsdestinationbeforeitisroutedbacktoitsdestinationonaby- to/from R2to/from R3to/from R0to/from R0to/from R2to/from R3 Router R1 BypassChannels Router R1 output muxesinput muxes to/from R2to/from R3to/from R2to/from R3to/from R0to/from R0(b)Figure5.Flattenedbutterflyrouterdiagramwithbypasschan-nelsin(a)aconventionalflattenedbutterflyand(b)aflattenedbutterflywithmuxestoefficientlyutilizethebypasschannels.TherouterdiagramisillustratedforrouterR1inFigure4withtheconnectionsshownonlyforasingledimensionoftheflattenedbutterfly.passchannelthatpassesoverthesource(Figure4(d)).ToavoidtheinefÞcienciesofroutingpacketsonpathsofnon-minimalphysicallengths,thebypasschannelscanbecon-nectedtothoserouterstheypassover.Theseadditionalcon-nectionsallowpacketstoenterorleavethebypasschannelsearlywhendoingsoisbeneÞcial.Inthissection,weex-plainhowtheroutermicroarchitectureandtheßowcontrolmechanismscanbemodiÞedtoconnectthebypasschannelsdirectlytotherouterswitchinordertoreducelatencyandimproveenergyefÞciency.4.1RouterBypassArchitectureAhigh-leveldiagramofarouterinanon-chipßattenedbutterßyisshowninFigure5(a).Itconsistsoftheswitchandbypasschannelsthatconnecttheneighbouringrouters.Onemethodtoconnectthebypasschannelstothelocalrouteristoaddadditionalinputstotheswitch.However,doingsowouldsigniÞcantlyincreasethecomplexityoftheswitch.Forexample,intheßattenedbutterßyshowninFigure3,theswitchwouldincreasefrom1010to1818intheworstcase,nearlyquadruplingtheareaconsumedbytheswitch.Inaddition,theuseofbypasschannelsisnotintendedtoincreasethebandwidthofthetopology,butrathertoreducelatencyandenergyÐthus,thelargerswitch,whichwouldprovideadditionalbandwidth,isnotneeded.Instead,we breakthebypasschannelsastheypassovertherouterandinsertmuxesasshowninFigure5(b).AsillustratedinFigure5(b),twotypesofmuxesareaddedtoconnectthebypasschannelstothelocalrouter:in-putmuxes,andoutputmuxes.TheinputstothemuxescanbeclassiÞedaseitherbypassinputs(e.g.inputsfromtheby-passchannels)ordirectinputs(e.g.inputs/outputsto/fromthelocalrouter).Theinputmuxesreceivepacketsdestinedforthelocalrouterthatwouldotherwisebypassthelocalrouterenroutetotheintermediatenodeselectedbytherout-ingalgorithm,asillustratedinFigure4(c).Thus,eachinputmuxreceivesboththedirectinputsthatthepacketwouldhaveusedifthenon-minimalpathwastakenandtheinputsfromthebypasschannels.Theoutputmuxesareusedbypacketsthatwouldhaveinitiallybeenroutedawayfromtheirdestinationsbeforebeingroutedbackoverthelocalrouter,asillustratedinFigure4(d).TheinputstotheoutputmuxesarethedirectoutputsfromthelocalrouterandthebypasschannelinputsÐthepaththepacketwouldhavetakenifnon-minimalroutingpathwastaken.Theadditionofthesemuxesdoesnoteliminatetheneedfornon-minimalroutingforload-balancingpurpose.Instead,themuxesreducethedistancetraveledbypacketstoimproveenergyefÞciencyandreducelatency.4.2MuxArbiterThearbitersthatcontrolthebypassmuxesarecriticaltotheproperutilizationofthebypasschannels.Asim-pleround-robinarbitercouldbeimplementedatthemuxes.Whilethistypeofarbiterleadstoalocallyfairarbitrationatthemux,thearbiterdoesnotguaranteeglobalfairness[11].Toprovideglobalfairness,weimplementanarbiterthatyieldstotheprimaryinputÐi.e.theinputthatwouldhaveusedthechannelbandwidthattheoutputofthemuxifthebypasschannelswerenotconnectedtothelocalrouter.Accordingly,thedirectinputisgivenpriorityataninputmux,whilethebypasschannelisgivenpriorityatanout-putmux.Thus,iftheprimaryinputisidle,thearbitergrantsaccesstothenon-primaryinput.Topreventstarvationofthenon-primaryinputs,acontrolpacketissentalongthenon-minimalpathoriginallyselectedbytheroutingalgorithm.Thiscontrolpacketcontainsonlyroutinginformationandamarkerbittoidentifythepacketasacontrolpacket.Thecontrolpacketisroutedatthein-termediatenodeasthoughitwerearegularpacket,whichresultsiniteventuallyarrivingattheprimaryinputofthemuxesatthedestinationrouter.Whenthecontrolpacketarrives,thenon-primaryinputisgrantedaccesstothemux Usingtheanalogyofcarsandhighways,thelongwiresintroducedcor-respondtoaddinghighways.Theinputmuxescorrespondtoaddingaddi-tionalexitrampstothehighwayswhiletheoutputsmuxescorrespondtoaddingentrancerampstogetonthehighway. control packet routedata packet route primary buffer bypass buffer Data DEST Figure6.Modificationtothebuffersintroducedintotheflowcontrolwiththeutilizationofbypasschannels.TheadditionalbitsofthebufferscorrespondtoV:validbit,CNT:countofcontrolpacket,andDESTcorrespondstocontrolpacketcontentwhichcontainsadestination.Onceaflitintheinputbufferisprocessed,iftheVbitisset,CNTnumberofcontrolpacketsareprocessedintherouterbeforethenextflitintheinputbufferisprocessed.outputbandwidth.Thispolicyguaranteesthatapacketwait-ingatthenon-primaryinputofabypassmuxwilleventu-allybegrantedaccesstothebypassmux.Intheworst-case(i.e.highlycongested)environment,thelatencyofthenon-minimalroutedpacketswillbeidenticaltotheßattenedbut-terßythatdoesnotdirectlyutilizethebypasschannels.How-ever,therewillstillbeenergysavingsbecausetheßitdoesnottraversethenon-minimalphysicaldistance;instead,onlythecontrolßit,whichismuchsmallerthanadatapacket,travelsthefullphysicaldistanceofthenon-minimalroute.4.3SwitchArchitectureWithminimalrouting,thecrossbarswitchcanbesimpli-Þedbecauseitneednotbefullyconnected.Non-minimalroutingincreasesthecomplexityoftheswitchbecausesomepacketsmightneedtoberoutedtwicewithinadimension,whichrequiresmoreconnectionswithinthecrossbar.How-ever,byusingthebypasschannelsefÞciently,non-minimalroutingcanbeimplementingusingaswitchoflessercom-plexityÐonethatapproachesthecomplexityofaßattenedbutterßythatonlysupportsminimalrouting.Ifthebypasschannelsareutilized,non-minimalroutingdoesnotrequiresendingfullpacketsthroughintermediaterouters,andasaresult,theconnectionswithintheswitchthemselvesap-proachesthatoftheswitchthatonlysupportsminimalrout-ing.4.4FlowControlandRoutingBuffersarerequiredatthenon-primaryinputsoftheby-passmuxesforßowcontrolasillustratedinFigure6.Thus withnon-minimalrouting,creditsforthebypassbuffersareneededbeforepacketcandepartarouter.Thecontrolpack-etsthataregeneratedcanbebufferedintheinputbuffers.However,sincebuffersareanexpensiveresourceinon-chipnetworks,insteadofhavingtheshortcontrolpacketsoccupy-ingtheinputbuffers,minormodiÞcationscanbemadetotheinputbufferstohandlethecontrolpackets.Onceacontrolpacketarrives,thedestinationofthecontrolpacketsisstoredinaseparatebuffer(shownwithDESTÞeldinFigure6)andthecontrolbitsoftheinputbufferneedtobeupdatedprop-erlysuchthatthecontrolpacketcanbeproperlyprocessedintherouter.ThesemodiÞcationsshouldintroducelittleover-headbecausethedatapathsofon-chipnetworksaretypicallymuchwiderthanrequiredforthecontrolpacket.Thebypasschannelsareexploitedwhennon-minimalroutingisutilizedandtoproperlyloadbalancethechannels,wemodifytheUGAL[23]routingalgorithmtosupportby-passchannels.Whennon-minimalroutingisselected,in-steadofroutingtoanintermediatenodeandthenroutingtothedestination,UGALisbrokenintomultiplephasessuchthatUGALisappliedindimension1followedbyUGALindimension2.Theseensurethatloadbalancingisachievedwhileusingthebypasschanneltoprovidetheminimumphysicalpathfromsourcetodestination.5EvaluationWecomparetheperformanceofthefollowingtopologiesinthissection:1.conventional2-Dmesh(MESH)2.concentratedmeshwithexpresschannels[3](CMESH)3.ßattenedbutterßy(FBFLY)(a)ßattenedbutterßywithminimalroutingonly(FBFLY-MIN)(b)ßattenedbutterßywithnon-minimalrouting(FBFLY-NONMIN)(c)ßattenedbutterßywithnon-minimalroutinganduseofbypasschannels(FBFLY-BYP)Thetopologieswereevaluatedusingacycleaccuratenet-worksimulator.WecomparethenetworksÕperformanceandpowerconsumption.Thepowerconsumptionisbasedonthemodeldescribedin[3]fora65nmtechnology.Weaccuratelymodeltheadditionalpipelinedelayrequiredforthehigh-radixroutersaswellastheadditionalserializationlatencythroughthenarrowerchannels.Thebisectionband-widthisheldconstantacrossalltopologiesforthepurposeofcomparingtheperformanceofthedifferentnetworks.5.1PerformanceWecomparetheperformanceofthedifferenttopologiesfora64-nodeon-chipnetworkbyevaluatingtheirthrough-putusingopen-loopsimulationandalsocomparethemusing Topology Routing FBLY-MIN randomized-dimension FBLY-NONMIN UGAL[23] FBLY-BYPASS UGAL[23] CMESH O1Turnwithexpresschannels[3] O1Turn[22] Table1.Routingalgorithmsusedinsimulationcomparison. 00.10.20.30.4Offered load (fraction of capacity)Latency (cycles) CMESH FBFLY-MIN FBFLY-NONMIN FBFLY-BYP 00.10.20.30.40.5 CMESH FBFLY-NONMIN (b)Figure7.ThroughputcomparisonofCMESHandFBFLYfor(a)tornadoand(b)bitcomplementtrafficpattern.closed-loopsimulation,usingbothsynthetictrafÞcpatternandtracesfromsimulations.TheroutingalgorithmsusedforthedifferenttopologiesaredescribedinTable1.Forßowcontrol,virtualchannelßowcontrolisusedwith2VCstobreakroutingdeadlockandanother2VCsneededtobreakprotocoldeadlockfortheclosed-loopsimulations.5.1.1ThroughputComparisonToevaluatethethroughput,thesimulatoriswarmedupun-derloadwithouttakingmeasurementsuntilsteady-stateisreached.Thenasampleofinjectedpacketsislabeleddur-ingameasurementinterval.Thesimulationisrununtilalllabeledpacketsexitthesystem.Forthethroughputanalysis,packetsareassumetobesingle-ßitpackets.InFigure7,wecomparethelatencyvs.offeredloadontwoadversarialtrafÞcpatternsforCMESHandFBFLYÐ 10000110001200015000Completion Time (cycles)Number of Nodes 10000110001200015000Completion Time (cycles)Number of Nodes 10000110001200015000Completion Time (cycles)Number of Nodes (a)(b)(c)Figure9.Nodecompletiontimevarianceforthedifferenttopologies(a)mesh(b)CMESHand(c)flattenedbutterfly. 0.20.40.60.8MESHCMESHFBFLY-FBFLY-NONMINFBFLY-BYPLatency (normalized to mesh network) bitrev transpose tornado randperm bitcompFigure8.Latencycomparisonofalternativetopologiesacrossdifferentsynthetictrafficpattern.twotopologiesthatutilizeconcentrationtoreducethecostofthenetwork.Byeffectivelyutilizingnon-minimalrout-ingandsmallerdiameterofthetopology,FBFLYcanpro-videupto50%increaseinthroughputcomparedtoCMESHwhileprovidelowerzero-loadlatency.AlthoughtheMESHcanprovidehigherthroughputforsometrafÞcpattern,ithasbeenpreviouslyshownthatCMESHresultsinamorecost-andenergy-efÞcienttopologycomparedtotheMESH[3].5.1.2SyntheticBatchTrafÞcInadditiontothethroughputmeasurement,weuseabatchexperimenttomodelthememorycoherencetrafÞcofasharedmemorymultiprocessor.EachprocessorexecutesaÞxednumberofremotememoryoperations(e.g.remotecachelineread/writerequests)duringthesimulationandwerecordthetimerequiredforalloperationstocomplete.Readrequestsandwriteacknowledgementsaremappedinto64-bitmessages,whilereadrepliesandwriterequestsaremappedinto576-bitmessages.Eachprocessormayhaveuptofouroutstandingmemoryoperations.ThesynthetictrafÞcpatternusedareuniformrandom(UR),bitcomple-ment,transpose,tornado,arandompermutation,andbitre-verse[11].Figure8showstheperformancecomparisonforthebatchexperimentandwenormalizethelatencytothemeshnet-work.CMESHreduceslatency,comparedtotheMESH,by10%buttheßattenedbutterßyreducesthelatencyfur-ther.ByusingFBFLY-NONMIN,thelatencycanactuallyincreasebecauseoftheextralatencyincurredwiththenon-minimalrouting.However,theFBFLY-BYPprovidesthebeneÞtofnon-minimalroutingbutreducingthelatencyasallpacketstakeminimalphysicalpathandresultsinapprox-imately28%latencyreduction,comparedtotheMESHnet-work.Inadditiontothelatencyrequiredtocompletethebatchjob,wealsoplotthevarianceofthecompletiontimeofthe64nodes.AhistogramisshowninFigure9thatcollectsthecompletiontimeforeachprocessingnode.Withtheßattenedbutterßy,becauseofthelowerdiameter,thecompletiontimehasmuchmoretighterdistributionandsmallervarianceinthecompletiontimeacrossallofthenodes.Lessvariancecanreducetheglobalsynchronizationtimeinchipmulti-processorsystems.TheCMESH,becauseitisnotasym-metrictopology,leadstoanunbalanceddistributionofcom-pletiontime.5.1.3MultiprocessorTracesNetworktraceswerecollectedfromaninstrumentedversionofa64-processordirectorybasedTransactionalCoherenceandConsistency(TCC)multiprocessorsimulator[6].Inad-ditiontocapturingthetrafÞcpatternsandmessagedistrib-utions,werecorddetailedprotocolinformationsothatwecaninferdependenciesbetweenmessagesandidentifyse-quencesofinterdependentcommunicationandcomputationphasesateachprocessornode.Thisimprovestheaccu-racyoftheperformancemeasuredwhenthetracesarere-playedthroughacycle-accuratenetworksimulator.Whencapturingthetraces,weuseanidealizedinterconnectionnet-workmodelwhichprovidesinstantaneousmessagedeliverytoavoidadverselybiasingthetraces.ThetrafÞcinjectionprocessusestherecordedprotocolinformationtoreconstructthestateofeachprocessornode.Essentially,eachprocessorismodelledasbeinginoneoftwostates:anactivecomput-ingstate,duringwhichitprogressestowardsitsnextcom-municationevent;or,anidlecommunicatingstate,inwhich meshFBFLY-MINFBFLY-NONMINFBFLY-BYPmeshFBFLY-MINFBFLY-NONMINFBFLY-BYPmeshFBFLY-MINFBFLY-NONMINFBFLY-BYPmeshFBFLY-MINFBFLY-NONMIN barnesoceanequaketomcatv Latency (Normalized to Mesh Network)Figure10.PerformancecomparisonfromSPLASHbench-marktracesgeneratedfromadistributedTCCsimulator.itisstalledwaitingforanoutstandingcommunicationre-questtocomplete.Incomingcommunicationeventswhichdonotrequireprocessorintervention,suchasaremotereadrequest,areassumedtobehandledbythememorycontrollerandthereforedonotinterruptthereceivingprocessor.Fourbenchmarks(barnes,ocean,equake,andtomcatv)fromtheSPLASHbenchmarks[26]wereusedtoevaluatethealternativetopologiesandtheresultsareshowninFig-ure10.Fortwoofthebenchmarks(equake,tomcatv),theßattenedbutterßyon-chipnetworkprovideslessthan5%re-ductioninlatency.However,fortheothertwobenchmarks(barnes,ocean),theßattenedbutterßycanprovideupto20%reductioninlatency.5.2PowerComparisonThepowerconsumptioncomparisonisshowninFig-ure11.Theßattenedbutterßyprovidesadditionalpowersaving,comparedtotheCMESH.Withthereductioninthewidthofthedatapath,thepowerconsumptionofthecrossbarisalsoreducedÐthus,achievingapproximately38%powerreductioncomparedtothemeshnetwork.Theareaofaroutertendstoincreasewithitsradix.Con-trollogic,suchastheallocators,consumesareaproportionaltotheradixoftherouter;however,itrepresentsasmallfrac-tionoftheaggregatearea.Consequently,thebuffersandswitchdominatetherouterarea.Thebufferareacanbekeptconstantastheradixincreasesbyreducingthebufferspaceallocatedperinputport.Thetotalswitchareacanbeapproximatedaswhereisthenumberofrouters,isthebandwidthperport,andistherouterradix.Asincreases,decreasesbecausethebisectionbandwidthisheldconstant,andalsodecreases,becauseeachrouterser-vicesmoreprocessors.Consequently,weexpecthigh-radixon-chipnetworkwillconsumelessarea.Weestimatethattheßattenedbutterßyprovidesanareareductionofapproxi-mately4xcomparedtotheconventionalmeshnetworkanda MESHCMESHFBFLY-FBFLY-NONMINFBFLY-BYP Memory Crossbar ChannelFigure11.PowerconsumptioncomparisonofalternativetopologiesonURtraffic.reductionof2.5xcomparedtotheconcentratedmesh.thoughtheintroductionofthebypassmuxescanincreasetheareaaswellasthepower,theimpactisnegligiblecomparetotheareaandpowerconsumptionofthebuffersandthechannels.6DiscussionInthissectionwedescribehowtheßattenedbutterßytopologycanbescaledasmoreprocessorsareintegratedonchip.WealsodescribehowthelongdirectlinksusedintheßattenedbutterßytopologyarelikelytobeneÞtfromad-vancesinon-chipsignallingtechniques,andhowthesedirectlinksprovidesomeoftheperformancebeneÞtstraditionallyprovidedbyvirtualchannels.6.1ComparisontoGeneralizedHypercubeTheßattenedbutterßytopologyissimilartothegeneral-izedhypercube[4]butthemaindifferenceistheuseofcon-centrationofintheßattenedbutterßy.Theuseofconcen-trationsigniÞcantlyreducesthewiringcomplexitybecausetheresultingnetworkrequiresfewerchannelstoconnecttherouters.Furthermore,itisoftenpossibletoprovidewiderchannelswhenconcentrationisused,becausetherearefewerchannelscompetingforlimitedwireresources,whichim-provesthetypicalserializationlatency.WiththeembeddingofthetopologyintoaplanarVLSIlayoutconstraintforon-chipnetworks,asthenumberofchannelscrossingaparticu-larbisectionincreases,thetotalamountofbandwidthneedstobedividedamongalargernumberofchannelsÐthus,de-creasingtheamountofbandwidthperchannel.Alayoutofaconventionalmeshnetworkandthe2-DßattenedbutterßyisshowninFigure12(a,b).Althoughtheßattenedbutterßyincreasesthenumberofchannelscrossing Althoughthereisaradixincreasefromradix-8toradix-10comparingCMESHtotheßattenedbutterßy,sinceisreducedinhalf,thereisanoveralldecreaseintotalarea. P2 P10 R10 P11 R11 P12 R12 P13 R13 P14 R14 P15 R15 P10 P11 P12 P13 P14 P15 (b) P10 R10 P11 R11 P12 R12 P13 R13 P14 R14 P15 R15 Figure12.Layoutof64-nodeon-chipnetworks,illustratingtheconnectionsforthetoptworowsofnodesandroutersfor(a)aconventional2-Dmeshnetwork,(b)2-Dflattenedbutterfly,and(c)ageneralizedhypercube.Becauseofthecomplexity,thechannelsconnectedtoonlyR0areshownforthegeneralizedhypercube.neighboringroutersinthemiddlebyafactorof4,theuseofconcentrationallowsthetworowsofwirebandwidthtobecombinedÐthus,resultinginareductionofbandwidthperchannelbyonlyafactorof2.However,thegeneralizedhy-percube(Figure12(c))topologywouldincreasethenumberofchannelsinthebisectionofthenetworkbyafactorof16,whichwouldadverselyimpacttheserializationlatencyandtheoveralllatencyofthenetwork.6.2ScalingOn-ChipFlattenedButterflyTheßattenedbutterßyinon-chipnetworkscanbescaledtoaccommodatemorenodesindifferentways.Onemethodofscalingthetopologyistoincreasetheconcentrationfactor.Forexample,theconcentrationfactorcanbeincreasedfrom4to8toincreasethenumberofnodesinthenetworkfrom64to128asshowninFigure13(a).Thisfurtherincreasestheradixoftheroutertoradix-14.Withthisapproach,thebandwidthoftheinter-routerchannelsneedstobeproperlyadjustedsuchthatthereissufÞcientbandwidthtosupporttheterminalbandwidth.Anotherscalingmethodologyistoincreasethedimen-sionoftheßattenedbutterßy.Forexample,thedimensioncanbeincreasedfroma2-Dßattenedbutterßytoa3-Dßat-tenedbutterßyandprovideanon-chipnetworkwithupto256nodesasshowninFigure13(b).Toscalealargernum-berofnodes,boththeconcentrationfactoraswellasthenumberofdimensionscanbeincreasedaswell.However,asmentionedinSection6.1,asthenumberof 2-D FlattenedButterfly 2-D FlattenedButterfly 2-D FlattenedButterfly 2-D FlattenedButterfly 2-D FlattenedButterfly 2-D FlattenedButterfly (b)(c)Figure13.Differentmethodstoscaletheon-chipflattenedbutterflyby(a)increasingtheconcentrationfactor,(b)increas-ingthedimensionoftheflattenedbutterfly,and(c)usingahybridapproachtoscaling.channelscrossingthebisectionincrease,reductioninband-widthperchannelandincreasedserializationlatencybe-comeproblematic.Toovercomethis,anhybridapproachcanbeusedtoscaletheon-chipßattenedbutterßy.OnesuchpossibilityisshowninFigure13(c)wherethe2-Dßattenedbutterßyisusedlocallyandtheclusterof2-Dßattenedbut-terßyisconnectedwithameshnetwork.Thisreducesthenumberchannelscrossingthebisectionandminimizestheimpactofnarrowerchannelsattheexpenseofslightlyin-creaseintheaveragehopcount(comparedtoapureßattenedbutterßy).6.3FutureTechnologiesTheuseofhigh-radixroutersinon-chipnetworksintro-duceslongwires.Theanalysisinthisworkassumedopti-mallyrepeatedwirestomitigatetheimpactoflongwiresandutilizedpipelinedbuffersformulticyclewiredelays.How-ever,manyevolvingtechnologywillimpactcommunicationinfutureon-chipnetworksandthelongerwiresintheon-chipßattenedbutterßytopologyaresuitabletoexploitthesetechnologies.Forexample,on-chipopticalsignalling[17]andon-chiphigh-speedsignalling[13]attempttoprovidesignalpropagationvelocitiesthatareclosetothespeedoflightwhileprovidinghigherbandwidthforon-chipcommu-nications.Withaconventional2Dmeshtopology,allofthe wiresarerelativelyshortandbecauseoftheoverheadin-volvedinthesetechnologies,low-radixtopologiescannotexploittheirbeneÞts.However,fortheon-chipßattenedbutterßytopologywhichcontainsbothshortwiresaswellaslongwires,thetopologycantakeadvantageofthecheapelectricalwiresfortheshortchannelswhileexploitingthesenewtechnologiesforthelongchannels.6.4UseofVirtualChannelsVirtualchannels(VCs)wereoriginallyusedtobreakdeadlocks[9]andwerealsoproposedtoincreasenetworkperformancebyprovidemultiplevirtuallanesforasinglephysicalchannel[8].WhenmultipleoutputVCscanbeas-signedtoaninputVC,aVCallocationisrequiredwhichcansigniÞcantlyimpactthelatencyandtheareaofarouter.Forexample,intheTRIPSon-chipnetwork,theVCallocationconsumes27%ofthecycletime[12]andanareaanalysisshowthatVCallocationcanoccupyupto35%ofthetotalareaforanon-chipnetworkrouter[20].InFigure14(a),anexampleofhowblockingcanoc-curinconventionalnetworkwithwormholeßowcontrolisshown.Byutilizingvirtualchannelßowcontrol(Fig-ure14(b)),buffersarepartitionedintomultipleVCs,whichallowpacketstopassblockedpackets.However,withtheuseofhigh-radixroutersandtheßattenedbutterßytopology,theadditionalwireresourcesavailablecanbeusedtoover-cometheblockingthatcanoccurasshowninFigure14(c).Asaresult,blockingisreducedtoonlypacketsoriginatingfromthesamesourcerouteranddestinedtotheroutersinthesamecolumnofthenetwork.Thus,thetopologyreducestheneedforthebufferstobepartitionedintomultipleVCsandtakesadvantageoftheabundantwiringavailableinanon-chipnetwork.VCsforotherusagesuchasseparatingtrafÞcintodifferentclassesmightstillbeneededbutsuchusagedoesnotrequireVCallocation.Figure15showshowtheperformanceoftheßattenedbut-terßyischangedbyincreasingthenumberofVCs.Inthesimulationcomparison,thetotalamountofstorageperphys-icalchannelisheldconstantÐthus,asthenumberofVCsisincreased,theamountofbufferingperVCisdecreased.Asaresult,increasingthenumberofVCsforanon-chipßat-tenedbutterßycanslightlydegradetheperformanceofthenetworksincetheamountofbufferingperVCisreduced.7RelatedWorkMoston-chipnetworksthathavebeenproposedarelow-radix,mostlyutilizinga2Dmeshoratorusnetwork[10].BalfourandDallyproposedusingconcentratedmeshandexpresschannelsinon-chipnetworkstoreducethediame-terandenergyofthenetwork[3].Thisworkexpandsontheirideaandprovidesasymmetrictopologythatfurtherre-duceslatencyandenergy.Inaddition,thebeneÞtsofcreating Packet A Packet B BlockedP0P1P2P3Destinationof Packet B BlockedP0P1P2P3 (a)(b) BlockedP0P1P2P3(c) Figure14.Blockdiagramofpacketblockingin(a)wormholeflowcontrol(b)virtualchannelflowcontroland(c)flattenedbutterfly. 00.10.20.30.4Offered Load (fraction of capacity)Latency (cycles) 1VC 2 VC Figure15.Performancecomparisonasthenumberofvirtualchannelsisincreasedintheflattenedbutterfly.parallelsubnetworks[3]canalsobeappliedtotheßattenedbutterßytopologyinon-chipnetworks.Kimetal.[16]showedthattheincreasingpinbandwidthcanbeexploitedwithhigh-radixrouterstoachievelowercostandlowerlatencyinoff-chipinterconnectionnetworks.Althoughon-chipnetworkshaveverydifferentconstraintscomparedtooff-chipnetworks,weshowedhowtheßattenedbutterßytopology[15]proposedforhigh-radixoff-chipin-terconnectionnetworkscanalsobeappliedtoon-chipnet-works.Kumaretal.[18]proposedtheuseofexpressvirtualchannels(EVC)toreducethelatencyof2-Dmeshon-chipnetworkbybypassingintermediaterouters.However,EVCrequiressharingthebandwidthbetweenEVCandnon-EVCpacketsina2-Dmeshnetwork.BoththeEVCandtheßat-tenedbutterßytopologysharesimilarobjectiveoftryingtoachieveideallatencyinon-chipnetworksbuttheydifferintheapproachÐEVCisaßowcontrolwhiletheßattenedbut- terßyisatopology.TheyieldarbiterdescribedinSection4.2issimilarinconcepttotheßit-reservationßowcontrol(FRFC)[21].InFRFC,acontrolßitissentaheadofthedataßitsandreservesthebuffersandchannelsfortheensuingdataßits.However,theschemedescribedinthispaperusesthecontrolßittoen-surebandwidthforthedataßitsintheworst-casescenario(i.e.verycongestednetwork)Ðotherwise,thebypasschan-nelisusedregardlessofthecontrolßit.Thescalingofthetopologywithanhybrid(mesh)ap-proachatthetoplevelhasbeenproposedforoff-chipinter-connectionnetworkstoreducethelengthoftheglobalca-bles[27].Similarscalingcanalsobeappliedforon-chipnetworksasshowninSection6.2withthebeneÞtsbeingnotjustshorterwiresbutalsoreducedon-chipwiringcomplex-ity.8ConclusionInthispaper,wedescribedhowhigh-radixroutersandtheßattenedbutterßytopologycanbeutilizedinon-chipnet-workstorealizereductioninlatencyandpower.Byreducingthenumberofroutersandchannelsinthenetwork,itresultsinamoreefÞcientnetworkwithlowerlatencyandloweren-ergyconsumption.Inaddition,wedescribetheutilizationofbypasschannelstoutilizenon-minimalroutingwithmin-imalincreaseinpowerwhilefurtherreducinglatencyintheon-chipnetwork.Weshowthattheßattenedbutterßycanincreasethroughputbyupto50%comparedtotheconcen-tratedmeshandreducelatencyby28%whilereducingthepowerconsumptionby38%comparedtoameshnetwork.References[1]P.Abad,V.Puente,J.A.Gregorio,andP.Prieto.RotaryRouter:AnEfÞcientArchitectureforCMPInterconnectionNetworks.InProc.oftheInternationalSymposiumonComputerArchitecture(ISCA)pages116Ð125,SanDiego,CA,June2007.[2]A.Agarwal,L.Bao,J.Brown,B.Edwards,M.Mattina,C.-C.Miao,C.Ramey,andD.Wentzlaff.TileProcessor:EmbeddedMulticoreforNetworkingandMultimedia.InHotChips19,Stanford,CA,Aug.2007.[3]J.BalfourandW.J.Dally.Designtradeoffsfortiledcmpon-chipnet-works.InProc.oftheInternationalConferenceonSupercomputing,pages187Ð198,2006.[4]L.N.BhuyanandD.P.Agrawal.Generalizedhypercubeandhy-perbusstructuresforacomputernetwork.IEEETrans.Computers33(4):323Ð333,1984.[5]T.BjerregaardandS.Mahadevan.Asurveyofresearchandpracticesofnetwork-on-chip.ACMComput.Surv.,38(1):1,2006.[6]H.ChaÞ,J.Casper,B.D.Carlstrom,A.McDonald,C.CaoMinh,W.Baek,C.Kozyrakis,andK.Olukotun.Ascalable,non-blockingapproachtotransactionalmemory.In13thInternationalSymposiumonHighPerformanceComputerArchitecture(HPCA).Feb2007.[7]S.ChoandL.Jin.ManagingDistributed,SharedL2CachesthroughOS-LevelPageAllocation.InIEEE/ACMInternationalSymposiumonMicroarchitecture(MICRO),pages455Ð468,Orlando,FL,2006.[8]W.J.Dally.Virtual-channelFlowControl.IEEETransactionsonParallelandDistributedSystems,3(2):194Ð205,1992.[9]W.J.DallyandC.L.Seitz.Deadlock-freemessageroutinginmulti-processorinterconnectionnetworks.IEEETransactionsonComput-ers,36(5):547Ð553,1987.[10]W.J.DallyandB.Towles.RoutePackets,NotWires:On-ChipInteconnectionNetworks.InProc.ofthe38thconferenceonDesignAutomation(DAC),pages684Ð689,2001.[11]W.J.DallyandB.Towles.PrinciplesandPracticesofInterconnec-tionNetworks.MorganKaufmann,SanFrancisco,CA,2004.[12]P.Gratz,C.Kim,R.McDonald,S.Keckler,andD.Burger.Im-plementationandEvaluationofOn-ChipNetworkArchitectures.InInternationalConferenceonComputerDesign(ICCD),2006.[13]A.Jose,G.Patounakis,andK.Shepard.Nearspeed-of-lighton-chipinterconnectsusingpulsedcurrent-modesignalling.InDigestofTechnicalPapers.2005SymposiumonVLSICircuits,pages108Ð111,2005.[14]J.Kim,J.Balfour,andW.J.Dally.Flattenedbutterßyforon-chipnetworks.IEEEComputerArchitectureLetters,July2007.[15]J.Kim,W.J.Dally,andD.Abts.FlattenedButterßy:ACost-EfÞcientTopologyforHigh-RadixNetworks.InProc.oftheInterna-tionalSymposiumonComputerArchitecture(ISCA),pages126Ð137,SanDiego,CA,June2007.[16]J.Kim,W.J.Dally,B.Towles,andA.K.Gupta.MicroarchitectureofaHigh-RadixRouter.InProc.oftheInternationalSymposiumonComputerArchitecture(ISCA),pages420Ð431,Madison,WI,2005.[17]N.Kirman,M.Kirman,R.K.Dokania,J.F.Martinez,A.B.Apsel,M.A.Watkins,andD.H.Albonesi.Leveragingopticaltechnol-ogyinfuturebus-basedchipmultiprocessors.InIEEE/ACMInter-nationalSymposiumonMicroarchitecture(MICRO),pages492Ð503,Orlando,FL,2006.[18]A.Kumar,L.-S.Peh,P.Kundu,andN.K.Jhay.ExpressVirtualChannels:TowardstheIdealInterconnectionFabric.InProc.oftheInternationalSymposiumonComputerArchitecture(ISCA),pages150Ð161,SanDiego,CA,June2007.[19]R.D.Mullins,A.West,andS.W.Moore.Low-LatencyVirtual-ChannelRoutersforOn-ChipNetworks.InProc.oftheInternationalSymposiumonComputerArchitecture(ISCA),pages188Ð197,Mu-nich,Germany,2004.[20]C.A.Nicopoulos,D.Park,J.Kim,N.Vijaykrishnan,M.S.Yousif,andC.R.Das.ViChaR:ADynamicVirtualChannelRegulatorforNetwork-on-ChipRouters.InIEEE/ACMInternationalSymposiumonMicroarchitecture(MICRO),Orlando,FL,2006.[21]L.-S.PehandW.J.Dally.Flit-reservationßowcontrol.InInter-nationalSymposiumonHigh-PerformanceComputerArchitecture(HPCA),pages73Ð84,2000.[22]D.Seo,A.Ali,W.-T.Lim,N.RaÞque,andM.Thottethodi.Near-OptimalWorst-CaseThroughputRoutingforTwo-DimensionalMeshNetworks.InProc.oftheInternationalSymposiumonCom-puterArchitecture(ISCA),pages432Ð443,Madison,WI,2005.[23]A.Singh.Load-BalancedRoutinginInterconnectionNetworks.PhDthesis,StanfordUniversity,2005.[24]M.B.Taylor,W.Lee,S.Amarasinghe,andA.Agarwal.ScalarOperandNetworks:On-ChipInterconnectforILPinPartitionedAr-chitectures.InInternationalSymposiumonHigh-PerformanceCom-puterArchitecture(HPCA),pages341Ð353,Anaheim,California,[25]S.Vangaletal.An80-Tile1.28TFLOPSNetwork-on-Chipin65nmCMOS.InIEEEIntÕlSolid-StateCircuitsConf.,DigestofTech.Pa-pers(ISSCC),2007.[26]S.C.Woo,M.Ohara,E.Torrie,J.P.Singh,andA.Gupta.TheSPLASH-2programs:Characterizationandmethodologicalconsid-erations.InProc.oftheInternationalSymposiumonComputerArchitecture(ISCA),pages24Ð36,SantaMargheritaLigure,Italy,[27]M.Woodacre.TowardsMulti-ParadigmComputingatSGI.TalkatStanfordUniversityEE380seminar,Jan.2006.