Today stateoftheart bus based systems such as the AMBA AXI 2 or the STBUS platform 3 supports the in stantiation of crossbar matrices where multiple buses operate in parallel providing a high bandwidth communication infrastructure While methodologie ID: 72477
Download Pdf The PPT/PDF document "Chapter Designing Crossbar Based System..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Chapter2DesigningCrossbarBasedSystemsOverthelastdecade,thecommunicationarchitectureofSoCshasevolvedfromsinglesharedbussystemstomulti-bussystems.Today,state-of-the-artbusbasedsystems,suchastheAMBAAXI[2]ortheSTBUSplatform[3]supportsthein- 2DesigningCrossbarBasedSystems21 Fig.2.5Thecrossbarsynthesisphaseßoorplanofthesynthesizeddesignisperformed.Floorplanningistheprocessofdeterminingtheexact2Dpositionsofthedifferentcoresandtheswitchmatrixinthedesign.Forobtainingtheßoorplan,weuseParquet[92],afastandaccurateßoorplannerthatminimizesthedesignareaaswellastheaveragewire-length.AsthecoresintheMPSoCareusuallypredesignedhardwareblocks,werealisticallyassumethatthesizeofthecores(eitherthewidthandheightortheaspectratioandarea)areprovidedasaninputtothesynthesisprocess.Fromtheßoorplanofthedesign,thelengthofthewires(basedontheManhattandistance),andhencethepowerconsumptiononthewiresareobtained.Inthenextstep,forthechosenfrequencypoint,thewire-lengthsarecheckedtoseewhetherthemaximumwire-lengthexceedsthelengththatthedatacantraverseinasingleclockcycle.Inthenextstep,fromtheswitchmatrixpowerconsumptionandthewirepowerconsumption,thepowerconsumptionofthesynthesizedcommunicationarchitectureisobtained.Fromthesetofgeneratedcrossbararchitecturesforeacharchitecturaldesignpoint,themostpowerefÞcientarchitecturethatsatisÞestheperformanceandtimingconstraintsischosen. 242DesigningCrossbarBasedSystems2.3.2ExactCrossbarSynthesisAlgorithmTheexactalgorithmforthecrossbardesignhastwomajorsteps:theÞrstistoÞndthebestcrossbarconÞgurationthatsatisÞestheperformanceconstraints(thatwerepresentedintheabovesubsection)andthesecondstepistoÞndtheoptimalbindingofthecorestothechosencrossbarconÞguration.InordertoÞndthebestcrossbarconÞguration,wevarythenumberofbusesinthedesign,fromthemaximumnumber(equaltothenumberofcoresinthedesign,modelingafullcrossbar)toone(modelingasinglesharedbus),inabinarysearchmanner.ForeachconÞgurationofbuscount,wecheckwhetherafeasiblesolutionthatsatisÞestheconstraintsoftheILP(formedbythesetofinequalitiesfromequa-tions()to())exists.OncetheminimumnumberofbuseshavebeenidentiÞedfromapplyingtheILP,possiblymultipletimes,thebusesusedbythemastersandslavesofthedesignareseparated,therebygeneratingtheoptimalcrossbarconÞgu-OncethebestcrossbarconÞgurationisobtainedinthenextstep,theoptimalbindingofthecoresontobusesofthecrossbarisobtained.AbindingofcorestothebusesthatminimizestheamountofoverlapoftrafÞconeachbuswillresultinloweraverageandpeaklatencyfordatatransfer.Forthis,theaboveILPissolvedwiththeobjectiveofreducingthemaximumoverlaponeachofthebus(themaximumoverlapoverallthebusesisrepresentedbythevariablemaxov),andsatisfyingtheperformanceconstraintsasfollows:maxovi,ji,j,kmaxovandsubjecttoequations()to()(2.9)BysplittingtheproblemintotwoILPs,theexecutiontimeofthealgorithmisreduced,assolvingILP1forfeasibilitycheckisusuallyfasterthansolvingtheILP2withobjectivefunctionandadditionalconstraints.TheILPsaresolvedusingtheCPLEXpackage[59].2.4HeuristicApproachtoCrossbarSynthesisAstheexactILPapproachisnotscalabletolargeprobleminstances,eitherwhenthenumberofcoresinthedesignislargeorwhenthenumberofsimulationwin-dowsusedforanalysisislarge,inthissectionwepresentfastandefÞcientheuristicapproachforcrossbarsynthesis.Theproblemofassigningcorestotheminimumnumberofbuses,subjecttotheperformanceconstraintsisaspecialinstanceofthegeneralproblemofconstrainedbin-packing[60].ThereareseveralefÞcientheuristicsthathavebeendevelopedfor 2DesigningCrossbarBasedSystems25thebin-packingproblem[60].Inthiswork,weuseanapproachthatisbasedontheÞrst-Þtheuristictobin-packing.Wechosethisheuristicforseveralreasons.Whentheperformanceconstraintsareremoved,theheuristicprocedureistheoreticallyguaranteedtoprovidesolutionsthatarewithintwotimestheoptimumsolutionthatwouldbeobtainedbyanexactalgorithm[60].Practically,wefoundthatthesolu-tionsobtainedbytheheuristicareclosetotheoptimumsolutionpossibleforexper-imentsonseveralSoCbenchmarks.Moreover,theheuristicsarerelativelysimpletoimplementandhaveaverylowrun-timecomplexity,makingtheapproachscalabletolargedesignsandallowingtheuseoflargenumberofsimulationwindowsforTheheuristicalgorithmforcrossbarsynthesisispresentedinAlgorithm.IntheÞrststepofthealgorithm,thebandwidthavailableineachsimulationwindowiscalculated.Inthenextstep,allthecoresareinitializedasunmapped,astheyareyettobemappedontobuses.Thenthenumberofbusesinthecrossbarisinitializedtozero(step5).Insteps6to25,theassignmentofthecoresontothebusesofthecrossbarisperformed.Thebasicapproachusedisthefollowing:Wetrytomapasmanycoresaspossibleontoasinglebus.Whilemappingthecores,fromthesetofallcoresthatsatisfythebandwidthandconßictconstraints,wechoosetheonethatminimizesthepair-wisetrafÞcoverlapwiththecoresthathavebeenalreadymappedontothecurrentbus.Whennomorecorescanbeassignedtothecurrentbus,eitherbecausethebandwidthofthebusinanyofthesimulationwindowhasbeensaturated,orbecauseofconßictswiththecoresalreadymappedontothebus,anewbusisinstantiated.Theprocessisrepeateduntilallthecoresinthedesignhavebeenmappedontoabus.Fromtheresultingnumberofbuses,thebusesontowhichmastersareattachedandthoseontowhichtheslavesareattachedareseparated.Fromthis,theefÞcientcrossbarconÞgurationforthedesignisobtained.Example1Letusconsiderasmallexamplewith5cores,with3ofthembeingmastersandtherestbeingslaves.Forillustrativepurposes,letusassumethattwosimulationwindowsareusedforanalysis(althoughinrealsystemsusuallyseveralthousandwindowsareused).ThecommunicationtrafÞcratesforeachofthecores(inMB/s)forthetwosimulationwindowsarepresentedinTableandtheamountoftrafÞcoverlapbetweenthedifferentcoresoverallthewindowsispresentedinTable.Letusassumethatthecurrentfrequencydesignpointis100MHzandthebuswidthis32bits,whichareautomaticallytunedbythecrossbarsynthesisprocedure(aspresentedinFigure).IntheÞrststepoftheheuristicalgorithm,thebandwidthofthebusineachsimulationwindowiscalculatedtobe400MB/s(frequencydata-width).Initially,asinglebusisinstantiatedandcore_0ischosentobemappedontothebus,asithasthemaximumbandwidthrequirementsofthedifferentcores,acrossallthesimulationwindows(seeFigureThenfromthesetofallcores,thosecoresthatsatisfythebandwidthandconßictconstraintsarechosen.Ascoresthataremastersandslavesarenotallowedtobemappedontothesamebus(speciÞedaspartoftheconßictconstraints),thesetofassignablecorestothebusarecore_1andcore_2.Fromthesetwo,core_2is 2DesigningCrossbarBasedSystems27 Fig.2.6ExampleapplicationoftheheuristicalgorithmTable2.3AmountoftrafÞcoverlapbetweencores(inMB/s)ofexamplesystem core_0core_1core_2core_3core_4 3010core_130core_21027 302DesigningCrossbarBasedSystemsFig.2.7Powerconsumptionfordifferentcrossbar conÞgurationisrequiredtosatisfythebandwidthconstraints.Alargercrossbarcon-Þgurationusuallyalsoleadstoanincreasedwiringcomplexity.Thesetwofactorscoupledtogetherresultsinlargerpowerconsumptionforthecommunicationarchi-tecture.Atveryhighoperatingfrequencies,thepowerconsumptionofthecommu-nicationarchitectureishigher,asthepowerconsumptionincreaseslinearlywiththeoperatingfrequencyofthesystem.FortheIMP2design,thecrossbararchitecturewithlowestpowerconsumptionisobtainedat400MHz.Thesynthesizedcrossbararchitecture(a56crossbar)fortheIMP2designispresentedinFigure.Inordertosatisfythewindowbandwidthconstraints,onlyfewofthecorescanshareasinglebus,andthuseachofthebusesusedinthecross-barhaveatmost2coresattachedtothem.Thebindingsaresuchthatthecoreswithhighlyoverlappingstreamsareplacedondifferentbuses.Asaresult,thedesignedcrossbarhasacceptableperformance(intermsofaverageandmaximumlatencyconstraints)with1reductioninthenumberofbusesused,whencomparedtoafullcrossbar.TheßoorplanoftheIMP2SoCwiththedesignedcrossbar,asobtainedfromtheParquetßoorplannerispresentedinFigureThesizeandpowerconsumptionofthesynthesizedcrossbararchitecturesforthedifferentSoCdesignsandforfullcrossbarconÞgurationsarereportedinTa-.Thepowerconsumptionofboththeswitchmatrixandthecrossbarbuswiresarereportedinthetable.Themethodologyresultsinalargereductioninthecrossbararchitecturepowerconsumption(453%onaverage)whencomparedtothetraditionalfullcrossbarbasedsystems.ThesynthesizedcrossbarconÞgurationsalsoleadtolargereductioninthetotallengthofthebusesusedinthedesign(38.0%onaverage,refertoFigure),astherearefewerbusesinthedesign.Reducingwiringcongestionisessentialtohaveafasterphysicaldesignprocessandtoachievefasterdesignclosure.Thenormalizedaverageandmaximumread/writetransactionlatencies(toreadorwriteonedataword)forthedesignsobtainedusingthemethodologybasedonaveragetrafÞcßowsandusingtheproposedmethodology(referredtoasÒslotÓinthe 2DesigningCrossbarBasedSystems35samecore)intheapplicationtostudyitsimpactonthecrossbarsynthesisprocess.Thetypicalburstsizesforthebenchmarkisinitiallysetto100cycles.Whenthewindowsizeismuchsmallerthantheburstsize,thesizeofthecrossbargeneratedisveryclosetothatofafullcrossbar(referFigure).Whenthewindowsizeisaroundfewtimesthatoftheburstsize(from1Ð4times),thesynthesizedcross-barhasmuchsmallersize(typicallyaround25%)andacceptablelatencies(around)ofthatofafullcrossbar.Foraggressivedesigns,thewindowsizecanbesetclosertotheburstsizeandforconservativedesigns(wherelargertransactionlaten-ciescanbetolerated),thewindowsizecanbesettofewtimesthetypicalburstsize.TheacceptablewindowsizesforvariousburstsizesispresentedinFigure.Itcanbeseenfromtheplotthatthewindowsizevariesalmostlinearlywiththeburstsize,consolidatingtheabovearguments. Fig.2.13EffectofwindowsizeoncrossbarsizeandpowerconsumptionFig.2.14Burstvs.window 362DesigningCrossbarBasedSystems2.5.5Real-TimeStreams&EffectofBindingIneachsimulationwindow,thecriticaltrafÞcstreamsthatrequirereal-timeguar-anteesarerecorded.Duringthepreprocessingstepofthedesignßow(refertoFig-),thereal-timetrafÞcstreamsthatoverlapwitheachotherinanywindowareidentiÞed.Inordertoprovidereal-timeguaranteestosuchstreams,thecoreswithcriticalstreamsthathavetemporaloverlapareplacedontoseparatebusesofthecrossbar.Experimentalresultsonthebenchmarkapplicationsshowaverylowtransactionlatency(almostequaltothelatencyofperfectcommunicationusingafullcrossbar)forsuchstreams.Pleasenotethatinordertoprovidehardreal-timeguarantees,theunderlyingcrossbararchitectureshouldalsoprovidesupportforhav-ingprioritiesforthedifferenttrafÞcstreams,sothatthereal-timestreamsaregivenhigherprioritiesoverotherstreams.Inmanycrossbararchitectures,suchastheST-bus,suchsupportisprovidedinthecrossbararchitecturebyutilizingprioritybasedarbitrationmechanisms.AfterÞndingthebestcrossbarconÞguration,wedoanoptimalbindingofthecoresontothebusesofthecrossbar,minimizingthetotaloverlaponeachbus.Byminimizingtheoverlaponeachbus,thetransactionlatenciesreducesigniÞcantly.Toillustratethiseffect,wecomparethecrossbarsdesignedusingtheproposedap-proachwithtwobindingschemes:randombindingofcoresontothebuses,satis-fyingthedesignconstraints(equations())andoptimalbindingthatmin-imizesoverlaponeachbus,satisfyingthedesignconstraints.Theaveragelatencyincurredbytherandombindingschemeforthebenchmarkapplicationswasonav-erage2higherthanthatincurredbytheoptimalbindingscheme.2.5.6OverlapThresholdSettingByvaryingthetwoparameters:windowsizeandoverlapthreshold,thecrossbarcanbedesignedsuchthattheaverageandthemaximumtransactionlatenciesincurredinthedesignareacceptable.TheeffectoftheoverlapthresholdparameteronthesizeandpowerconsumptionofthecrossbargeneratedforthesyntheticbenchmarkarepresentedinFigures(a)and(b).Thecrossbarsizeandpowernumbersarenormalizedwithrespecttothecasewhentheoverlapthresholdissetto0%,whichleadstoafullcrossbarconÞguration(asnotwocorescanshareabusinthiscase).Theplotsendat50%overlapbetweencoresbecause,ifthepair-wiseoverlapbe-tweentwocoresexceeds50%ofthewindowsize(inanyofthewindows),thenthewindowbandwidthconstraintscannotbesatisÞed.So,themaximumvalueoftheoverlapparametercanbesetat50%ofthewindowsize.Thiswillalsospeed-uptheprocessofÞndingthebestcrossbarconÞguration,assuchoverlappingcoreswillbeidentiÞedinthepreprocessingphase(refertoFigure)andwillbeforbiddentobeonthesamebusofthecrossbar.Fromexperiments,wefoundthatforaggressivedesigns(wheretherearetightrequirementsonthemaximumlatencies)thethresh-oldcanbesettoaround10%andforconservativedesigns,thethresholdcanbesetto30%Ð40%ofthewindowsize.