/
Catnap Energy Proportional Multiple NetworkonChip Reetuparna Das University of Michigan Catnap Energy Proportional Multiple NetworkonChip Reetuparna Das University of Michigan

Catnap Energy Proportional Multiple NetworkonChip Reetuparna Das University of Michigan - PDF document

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
465 views
Uploaded On 2015-03-03

Catnap Energy Proportional Multiple NetworkonChip Reetuparna Das University of Michigan - PPT Presentation

edu Satish Narayanasamy University of Michigan nsatishumichedu Sudhir K Satpathy Intel Labs sudhirksatpathyintelcom Ronald G Dreslinski University of Michigan rdreslinumichedu Abstract Multiple networks have been used in several processor implemen ta ID: 40687

edu Satish Narayanasamy University

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Catnap Energy Proportional Multiple Netw..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

processorswithmanycores.Thepowergatingpolicydetermineswhentowakeuparouter,andhenceitmustreactquicklytocongestiontoavoidaccruingwake-updelaysfrompowered-offroutersateachhop.Toaddresstheseproblems,weproposetheCatnapNoCar-chitecturethatemploysanewsubnet-selectionpolicyandanewpower-gatingpolicy.Catnap'ssubnetselectionpolicyenforcesastrictprioritybetweensubnetsduringinjection.Apacketisinjectedintoahigher-ordersubnetonlyifthecurrentsetofactivesubnetsaregettingclosetocongestion.Thispriorityorderingensuresthatatruntimeonlytherequirednumberoflower-ordersubnetsareac-tive,whilehigher-ordersubnetscanbepowered-off.Italsoensuresefcientoperationunderburstytrafc.Afteraburst,asthenet-workloadreduces,congestioninallthesubnetsreduces.Oncethelower-ordersubnet'scongestiongoesbelowthecongestiondetec-tionthreshold,theyareprioritizedsothatnewpacketsarenotin-jectedintothehigher-ordernetworks.Todeterminewhenarouteranditsassociatedlinksshouldbeturnedon/off,wediscussapower-gatingpolicythatworkssyner-gisticallywiththeCatnap'ssubnet-selectionpolicy.Itsobjectiveistomaximizethesleepcycleswhilereducingfrequentswitchesbe-tweenpowerstatesandperformanceloss.Inourdesign,arouterinasubnetisturnedoffwhenitsinputbuffersareemptyforapre-denednumberofconsecutivecyclesandthecongestionstatusofcurrentlyactivesubnetsisallsettofalse.Apower-gatedrouteriswokenupwheneitherofthesetwoconditionschange.Ourproposedmechanismsrelyoncongestiondetection.Toachievefastcongestiondetection,wediscussaregionalconges-tiondetectionmechanism,whichisthecriticalenablerofCatnap'ssubnet-selectionpolicyandpower-gatingpolicy.Anodedetectscongestioninasubnetifanynodeinitsregiondetectslocalcon-gestioninthatsubnet.Anodedeterminesitslocalcongestionstatusofasubnetbyexaminingthatsubnet'slocalroutereverycycle.Weinvestigatedseveralsolutionsforlocallydetectingconges-tioninasubnet.Wefoundthatsomeoftheseeminglypromisinglocalcongestionmetrics,suchasthelocalpacketinjectionrate,didnotperformwell.Thereasonisthattheinjectionratethresh-oldfordeterminingcongestionvariessignicantlybythetypeofnetworktrafc,whichcanchangeatruntimebasedonapplicationcharacteristics.Congestionmetrics,suchasoccupancyofinjectionqueue(networkinterfacequeue)oraveragebufferoccupancyofallportsofarouter,didnotperformwelleither.Wefoundthemax-imumbufferoccupancyofalocalroutertobethemosteffectivelocalcongestionmetric,asithasthekeyadvantagethatitsconges-tionthresholdisindependentofthenetworktrafcpattern.Also,itincurslowerdesigncomplexitythantheotheralternativeswecon-sidered.Multi-NoCisattractiveevenfromadynamicpowerperspec-tive.Forlowbandwidthnetworkswithfewernodes,theoverheadofduplicatingcontrollogicacrossmultipleroutersindifferentsub-netscouldbeexpensiveintermsofareaandpower.However,forhigh-bandwidthnetworks,aMulti-NoCdesignwithmultiplenar-rowernetworksismoreefcientintermsofdynamicpowerthanabandwidth-equivalentSingle-NoCdesignwithasinglewidernet-work.Thereasonisthat,beyondadatapathwidth,increasingthewidthofarouterincursahigherpowercostthanincreasingthenumberofrouters.Wendthat,atahigherdatapathwidth,thela-tencyofacrossbarbecomesthebottleneckintherouterpipeline.Therefore,awiderrouterneedstobeoperatedatahighervoltagethananarrowerroutertoachievethesamefrequency.Aspowerincreasesquadraticallywithrespecttovoltage,forhigh-bandwidthnetworks,thepowerofasinglewiderrouterishigherthantheag-gregatepowerofmultiplenarrowerrouters.Weevaluate8x8concentratedmeshfora256-coresystem.OurevaluationsshowthatCatnap'spower-gatingmechanismiseffec-tiveinprotablypowergatingMulti-NoC'snetworkcomponentsforasmuchas70%ofexecutioncycles,whilelosinglessthan2%performanceforworkloadswithlownetworkdemand.However,forSingle-NoC,wenotonlyobservethatthereisonlyanegli-giblereductioninstaticpower,butalsothatthereisabouta10%performancelossforworkloadswithlownetworkdemand.Whenaveragedoverdifferentmultiprogrammedworkloads,wendthattheaveragenetworkpowerofaCatnapMulti-NoCwithfoursub-nets(20W)is44%lowerthanabandwidthequivalentSingle-NoCdesign(36W),whiletheaverageperformanceoverheadisabout5%. Figure1.Aprocessorwith256coresconnectedbyasingle8x8concentratedmesh.Eachrouterisconnectedtofourtiles.2.BackgroundandMotivationThissectionprovidesabriefbackgroundonpacket-basednetwork-on-chip(NoC).Wediscussnetworkpower-scalingissuesofasin-glephysicalnetworkdesign,whicharelikelytomotivateprocessormanufacturerstoadoptamultiple-networkdesign.2.1Packet-BasedNetwork-on-ChipArchitectureOurbaselinemany-coreprocessorwith256coresand8memorycontrollersisillustratedinFigure1.Theon-chipnetworkisorga-nizedasagridofroutersinan8x8concentratedmeshtopology.Atileconsistsofaprocessorcore,itsprivatecache,andasliceofthesharedlast-levelcache.FourtilesareconnectedtoasinglerouterthroughasharedNetworkInterface(NI)buffer.Arouterhasveinputports:fourportstoconnectwiththeneighboringfourrouters,andoneporttoconnectwiththenode'sNI.Adataoracontrolmessageiscommunicatedasonepacket.Eachpacketconsistsofseveralsmallerow-controlunitscalledits.Eachittravelshopbyhopfromoneroutertoanotherun-tilitreachesthedestinationrouter.Therouterwemodelisin-putbufferedwithvirtualchannels,useswormholeswitching,hasaspeculativetwo-stagepipeline[24],andperformslook-aheadrout-ing[12].Foranin-depthintroductiontoNoCs,wereferthereaderto[9,17].2.2ScalingOn-ChipNetworkBandwidthOurtargetprocessorcorecountforthisstudyis256cores.Forsus-tainingperformancetheimprovementofmany-coreprocessors,itisessentialtomaintainthecurrentper-corebandwidth,althoughun-dertightpowerconstraints[6].Tosustaintoday's8GB/sper-corebandwidth[15]ina256-coreprocessor,thebisectionbandwidthof Figure5.Router'sstatesforPowerGatingarouter,andalsobeabletopredictwhenasleepingroutershouldbeactivatedtoavoidperformanceloss.Figure5showsthepowerstatetransitionsforarouter.Weemploythefollowingmechanismtodeterminewhentodeactivatearouter.ArouterinasubnetShisunlikelytoreceiveaitiftheregionalcongestionstatusoftheimmediatelower-ordersubnetsSl(wherel=h�1)isoff.Underthiscondition,Catnap'ssubnet-selectionpolicywouldchoosethelower-ordernetworkSltoinjectapacket.Therefore,ourpower-gatingpolicydecidestoswitchoffarouterinasubnetShiftheRCSofitsimmediatelylower-ordersubnetSl(l=h�1)isoff,andbufferemptyconditionistrue.Bufferemptyconditionistrueonlyifalloftherouter'sbuffershavebeenemptyforapre-denednumberofconsecutivecycles(fourinourexperiments).Notethatourpower-gatingpolicywouldkeepthe0thsubnetalwaysactive.Towakeuparouterofasubnet,Catnap'spolicyagainreliesontheRCSofitsimmediatelylower-ordersubnet.AsleepingrouterinasubnetShiswokenupandthepowerstateoftheroutertransitionstothewake-upstateimmediatelywhentheRCSofitsimmediatelylower-ordersubnetSl(l=h�1)isturnedon.Also,whenarouterreceivestheheaditofapacketanddeterminesthenexthopduringroutingcomputation,itsendsaWakeupsignaltothedownstreamroutertoactiveit.Thisallowsustohidepartofthewake-uppenalty(T-wakeup)whentheRCS-basedpolicyfailstoactivatetherouterintimebeforetheitarrives.Thisback-uppolicytoactivatearouteressentiallyleverageslook-aheadrouting[12]tohidethewakeupdelay[21].Foratwo-stagerouterawake-updelayofupto3cyclescanbehiddenbysendingawakeupsignalfromanupstreamroutertoadownstreamsleepingrouter.3.4AlternativePoliciesforLocalCongestionStatusWeuseBFMtodeterminethelocalcongestionstatusofarouter(Section3.2.1).Wearrivedatthispolicyafterevaluatingseveralotherpoliciesthatlookedpromisingwhenwestartedourinvestiga-tion.Nowwedescribethosepoliciesandexplainwhytheyfailed(Section6.4presentsaquantitativeevaluation).3.4.1InjectionRate(IR)TheNIofanodecanmeasureitsinjectionrate,whichisthenumberofitsinjectingintothenetworkoveraperiodoftime.Injectionratecanbeusefulinselectingsubnets,becauseitisadirectmeasureoftheloadthatisbeinginjectedintothenetwork.However,wefoundthatthethresholdfortheinjectionrateatwhichcongestionmanifestsvariesfordifferenttrafcpatterns.Wefoundthattheinjectionrateatwhichcongestionmanifestsforuniformrandomtrafcpatternsismuchhigherthanforanadversarialtrafcpatternliketranspose.Conservativelychoosingasmallthresholdthatprovidesacceptableperformanceforalltrafcpatternsendsupprovidingfewopportunitiesforsavingleakagepower.3.4.2AverageBufferOccupancy(BFA)Insteadofcomputingthemaximumbufferoccupancyofarouter,wetriedaveragebufferoccupancy.Thispolicyalsodidnotperformwell,becausewhencongestionmanifestsalongonlyafewpaths,thenonlyasubsetoftheinputports'bufferswouldgetlledup.Therestmaybeempty,whichunnecessarilylowerstheaveragebufferoccupancy,andasaresultwemaynotbeabletodetectthecongestion.3.4.3InjectionQueueOccupancy(IQOcc)Insteadofcomputingtheinjectionrateatanode,wealsotriedametricbasedonoccupancyofinjectionqueue[7].Thismetricdidnotperformwelleither,becauseitwastooslowtoreacttocongestion.Theinjectionqueuesgetlleduponlyafterarouter'sbuffersarelledup.Also,bufferoccupancyofarouter'sportsindicatethecongestionofitsneighborsandhenceismoreoptimalthanoccupancyofinjectionqueues.3.4.4BlockingDelayBased(Delay)Wealsoevaluatedapolicyinwhichanodedetectscongestioninasubnetwhentheaverageblockingdelayperitofthatsubnet'srouterexceedsacertainthreshold.Averageblockingdelay-basedcongestiondetectiondoesnotsufferfromtheproblemsthattheotherthreemetricswediscussedface.Wecanidentifyathresholdthatworkswellforalltrafcpatternsandburstytrafc.However,thispolicyisexpensivetoimplementinrouters.Anaccurateimple-mentationwouldrequireacounterperittomeasuretheblockingdelay,anaddertosumtheblockingdelaysofallitsintherouters,andashiftregistertodivideitbysumofits.Toreducethismea-surementoverhead,thismetriccanbeapproximatedbysamplingonlyafewitsandcomputingamovingaverageoftheblockingdelaysforonlythesampledits.Weevaluatedthissampling-basedapproachandfoundthatthemaximumbufferoccupancy(BFM)thatweuseinournaldesignoutperformsit. Cores 256cores@2GHz,64-entryinstructionwindow, 2-widefetch/issue/commit L1Caches 32KB/core,private,4-waysetassociative, 64-byteblock,2-cyclelatency,write-back, splitI/Dcaches,32-entryMSHR L2Caches 256KB/core,shared,16-waysetassociative, 64-byteblocksize,6-cyclebanklatency,32MSHRs Coherence 4-hopMESIdirectoryprotocol Network 2GHz2-stagerouter,4VCs/port,4its/VC 8x8meshtopologywith4tiles/node,wormholeswitched VCowcontrol,X-Ydeterministicrouting MainMemory Eight4GBDRAMs,80cycleaccesslatency 8on-chipmemorycontrollers(MCs), 4DDRchannelsproviding16GB/sperMC, upto16outstandingrequestsforeachcoreperMC, Table1.Processorconguration4.MethodologyWedescribethreemainaspectsofoursimulationinfrastructure.First,wedescribeourcycle-levelperformancesimulatorwhichmodelsa256-coreprocessorwithsingle-networkandmultiple-networkcongurations.Second,wedescribeourrouterpowermodel.Finally,wedescribeourSPICEsimulationsfordeterminingthepowerandlatencyofcomponentsusedinpowergating.4.1Multi-CoreSimulatorWeuseacycle-leveltrace-drivenmulti-coresimulator[10]toeval-uatetheperformanceandpowerofthe256-coreprocessorde-scribedinSection2.Weuseafront-endfunctionalsimulatorbasedonPin[23]tocollectinstructiontracesfromapplications,whicharethenfedintoaback-endcycle-levelsimulator.Congurationsforprocessorcoremodel,two-levelcaches,DRAM,andmemorycontrollersarelistedinTable1.Ourcycle-levelsimulatorisintegratedwithadetailedpacket-switchedon-chipnetworkmodel.Ournetworkmodelconsistsofastate-of-the-arttwo-stage,wormholeswitched,5-portroutermi-croarchitecture[24].Theroutersoperateat2GHz,andweas-sumelinkwidthsof512bitsforSingle-NoCand128bitsfor References[1]D.Abts,M.R.Marty,P.M.Wells,P.Klausler,andH.Liu,“Energyproportionaldatacenternetworks,”inISCA,2010.[2]J.D.BalfourandW.J.Dally,“Designtradeoffsfortiledcmpon-chipnetworks,”inICS,2006.[3]L.A.BarrosoandU.H¨olzle,“Thecaseforenergy-proportionalcom-puting,”IEEEComputer,2007.[4]E.Baydal,P.Lopez,andJ.Duato,“Afamilyofmechanismsforcon-gestioncontrolinwormholenetworks,”IEEETrans.ParallelDistrib.Syst.,2005.[5]S.Borkar,“Designchallengesoftechnologyscaling,”Micro,IEEE,1999.[6]——,“Thousandcorechips:atechnologyperspective,”inDAC-44,2007.[7]J.CamachoandJ.Flich,“Hpc-mesh:Ahomogeneousparallelconcen-tratedmeshforfault-toleranceandenergysavings,”inANCS-7,2011.[8]L.ChenandT.M.Pinkston,“Nord:Node-routerdecouplingforeffec-tivepower-gatingofon-chiprouters,”inMICRO-45,2012.[9]W.J.DallyandB.Towles,PrinciplesandPracticesofInterconnectionNetworks.MorganKaufmann,2003.[10]R.Das,O.Mutlu,T.Moscibroda,andC.Das,“Application-AwarePrioritizationMechanismsforOn-ChipNetworks,”inMICRO-42,2009.[11]X.Fan,W.-D.Weber,andL.A.Barroso,“Powerprovisioningforawarehouse-sizedcomputer,”inISCA,2007.[12]M.Galles,“Scalablepipelinedinterconnectfordistributedendpointrouting:thesgispiderchip,”inSymposiumonHighPerformanceInterconnects(HotInterconnects),1996,pp.141–146.[13]P.Gratz,B.Grot,andS.W.Keckler,“Regionalcongestionawarenessforloadbalanceinnetworks-on-chip,”inHPCA-16,2008.[14]M.Hayenga,D.Johnson,andM.H.Lipasti,“Pitfallsoforion-basedsimulation,”inWDDD-10,2010.[15]J.Howardandetal.,“A48-coreia-32message-passingprocessorwithdvfsin45nmcmos,”inISSCC,2010.[16]Z.Hu,A.Buyuktosunoglu,V.Srinivasan,V.V.Zyuban,H.M.Jacob-son,andP.Bose,“Microarchitecturaltechniquesforpowergatingofexecutionunits,”inISLPED,2004.[17]N.E.JergerandL.S.Peh,On-ChipNetworks,SynthesisLectureinComputerArchitecture.MorganandClaypoolPublishers,2003.[18]A.B.Kahng,B.Li,L.-S.Peh,andK.Samadi,“Orion2.0:Afastandaccuratenocpowerandareamodelforearly-stagedesignspaceexploration,”inDATE,2009.[19]J.Kim,J.Balfour,andW.Dally,“Flattenedbutterytopologyforon-chipnetworks,”MICRO-40,2007.[20]H.Matsutani,M.Koibuchi,D.Ikebuchi,K.Usami,H.Nakamura,andH.Amano,“Performance,area,andpowerevaluationsofultrane-grainedrun-timepower-gatingroutersforcmps,”inComputer-AidedDesignofIntegratedCircuitsandSystems,IEEETransactionson,2011.[21]H.Matsutani,M.Koibuchi,H.Amano,andD.Wang,“Run-timepowergatingofon-chiproutersusinglook-aheadrouting,”inASP-DAC,2008.[22]D.Meisner,B.T.Gold,andT.F.Wenisch,“Powernap:eliminatingserveridlepower,”inASPLOS,2009.[23]H.Patil,R.Cohn,M.Charney,R.Kapoor,A.Sun,andA.Karunanidhi,“PinpointingRepresentativePortionsofLargeIn-telItaniumProgramswithDynamicInstrumentation,”inMICRO-37,2004.[24]L.-S.PehandW.J.Dally,“ADelayModelandSpeculativeArchi-tectureforPipelinedRouters,”inProceedingsofthe7thInternationalSymposiumonHigh-PerformanceComputerArchitecture,2001.[25]A.Samih,R.Wang,A.Krishna,C.Maciocco,C.Tai,andY.Solihin,“Energy-efcientinterconnectviarouterparking,”inHPCA-19,2013.[26]K.Sankaralingam,R.Nagarajan,H.Liu,C.Kim,J.Huh,D.Burger,S.W.Keckler,andC.R.Moore,“ExploitingILP,TLP,andDLPwithThePolymorphousTRIPSArchitecture,”inISCA-30,2003.[27]M.B.Taylor,J.S.Kim,J.E.Miller,D.Wentzlaff,F.Ghodrat,B.Greenwald,H.Hoffmann,P.Johnson,J.-W.Lee,W.Lee,A.Ma,A.Saraf,M.Seneski,N.Shnidman,V.Strumpen,M.Frank,S.P.Ama-rasinghe,andA.Agarwal,“Therawmicroprocessor:Acomputationalfabricforsoftwarecircuitsandgeneral-purposeprograms,”IEEEMi-cro,2002.[28]M.Thottethodi,A.R.Lebeck,andS.S.Mukherjee,“Self-tunedcon-gestioncontrolformultiprocessornetworks,”inHPCA-7,2001.[29]S.Volos,C.Seiculescu,B.Grot,N.Pour,B.Falsa,andG.DeMicheli,“Ccnoc:Specializingon-chipinterconnectsforenergyefciencyincache-coherentservers,”inNOCS-6,2012.[30]H.Wang,L.-S.Peh,andS.Malik,“Power-drivendesignofroutermicroarchitecturesinon-chipnetworks,”inMICRO,2003.[31]——,“Apowermodelforrouters:Modelingalpha21364andinni-bandrouters,”IEEEMicro,2003.[32]D.Wentzlaff,P.Grifn,H.Hoffmann,L.Bao,B.Edwards,C.Ramey,M.Mattina,C.-C.Miao,J.F.B.III,andA.Agarwal,“On-chipinter-connectionarchitectureofthetileprocessor,”IEEEMicro,2007.