/
Chapter  Designing Crossbar Based Systems Over the las Chapter  Designing Crossbar Based Systems Over the las

Chapter Designing Crossbar Based Systems Over the las - PDF document

pasty-toler
pasty-toler . @pasty-toler
Follow
401 views
Uploaded On 2015-05-23

Chapter Designing Crossbar Based Systems Over the las - PPT Presentation

Today stateoftheart bus based systems such as the AMBA AXI 2 or the STBUS platform 3 supports the in stantiation of crossbar matrices where multiple buses operate in parallel providing a high bandwidth communication infrastructure While methodologie ID: 72477

Today stateoftheart bus based

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Chapter Designing Crossbar Based System..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Chapter2DesigningCrossbarBasedSystemsOverthelastdecade,thecommunicationarchitectureofSoCshasevolvedfromsinglesharedbussystemstomulti-bussystems.Today,state-of-the-artbusbasedsystems,suchastheAMBAAXI[2]ortheSTBUSplatform[3]supportsthein- 2DesigningCrossbarBasedSystems21 Fig.2.5Thecrossbarsynthesisphaseßoorplanofthesynthesizeddesignisperformed.Floorplanningistheprocessofdeterminingtheexact2Dpositionsofthedifferentcoresandtheswitchmatrixinthedesign.Forobtainingtheßoorplan,weuseParquet[92],afastandaccurateßoorplannerthatminimizesthedesignareaaswellastheaveragewire-length.AsthecoresintheMPSoCareusuallypredesignedhardwareblocks,werealisticallyassumethatthesizeofthecores(eitherthewidthandheightortheaspectratioandarea)areprovidedasaninputtothesynthesisprocess.Fromtheßoorplanofthedesign,thelengthofthewires(basedontheManhattandistance),andhencethepowerconsumptiononthewiresareobtained.Inthenextstep,forthechosenfrequencypoint,thewire-lengthsarecheckedtoseewhetherthemaximumwire-lengthexceedsthelengththatthedatacantraverseinasingleclockcycle.Inthenextstep,fromtheswitchmatrixpowerconsumptionandthewirepowerconsumption,thepowerconsumptionofthesynthesizedcommunicationarchitectureisobtained.Fromthesetofgeneratedcrossbararchitecturesforeacharchitecturaldesignpoint,themostpowerefÞcientarchitecturethatsatisÞestheperformanceandtimingconstraintsischosen. 242DesigningCrossbarBasedSystems2.3.2ExactCrossbarSynthesisAlgorithmTheexactalgorithmforthecrossbardesignhastwomajorsteps:theÞrstistoÞndthebestcrossbarconÞgurationthatsatisÞestheperformanceconstraints(thatwerepresentedintheabovesubsection)andthesecondstepistoÞndtheoptimalbindingofthecorestothechosencrossbarconÞguration.InordertoÞndthebestcrossbarconÞguration,wevarythenumberofbusesinthedesign,fromthemaximumnumber(equaltothenumberofcoresinthedesign,modelingafullcrossbar)toone(modelingasinglesharedbus),inabinarysearchmanner.ForeachconÞgurationofbuscount,wecheckwhetherafeasiblesolutionthatsatisÞestheconstraintsoftheILP(formedbythesetofinequalitiesfromequa-tions()to())exists.OncetheminimumnumberofbuseshavebeenidentiÞedfromapplyingtheILP,possiblymultipletimes,thebusesusedbythemastersandslavesofthedesignareseparated,therebygeneratingtheoptimalcrossbarconÞgu-OncethebestcrossbarconÞgurationisobtainedinthenextstep,theoptimalbindingofthecoresontobusesofthecrossbarisobtained.AbindingofcorestothebusesthatminimizestheamountofoverlapoftrafÞconeachbuswillresultinloweraverageandpeaklatencyfordatatransfer.Forthis,theaboveILPissolvedwiththeobjectiveofreducingthemaximumoverlaponeachofthebus(themaximumoverlapoverallthebusesisrepresentedbythevariablemaxov),andsatisfyingtheperformanceconstraintsasfollows:maxovi,ji,j,kmaxovandsubjecttoequations()to()(2.9)BysplittingtheproblemintotwoILPs,theexecutiontimeofthealgorithmisreduced,assolvingILP1forfeasibilitycheckisusuallyfasterthansolvingtheILP2withobjectivefunctionandadditionalconstraints.TheILPsaresolvedusingtheCPLEXpackage[59].2.4HeuristicApproachtoCrossbarSynthesisAstheexactILPapproachisnotscalabletolargeprobleminstances,eitherwhenthenumberofcoresinthedesignislargeorwhenthenumberofsimulationwin-dowsusedforanalysisislarge,inthissectionwepresentfastandefÞcientheuristicapproachforcrossbarsynthesis.Theproblemofassigningcorestotheminimumnumberofbuses,subjecttotheperformanceconstraintsisaspecialinstanceofthegeneralproblemofconstrainedbin-packing[60].ThereareseveralefÞcientheuristicsthathavebeendevelopedfor 2DesigningCrossbarBasedSystems25thebin-packingproblem[60].Inthiswork,weuseanapproachthatisbasedontheÞrst-Þtheuristictobin-packing.Wechosethisheuristicforseveralreasons.Whentheperformanceconstraintsareremoved,theheuristicprocedureistheoreticallyguaranteedtoprovidesolutionsthatarewithintwotimestheoptimumsolutionthatwouldbeobtainedbyanexactalgorithm[60].Practically,wefoundthatthesolu-tionsobtainedbytheheuristicareclosetotheoptimumsolutionpossibleforexper-imentsonseveralSoCbenchmarks.Moreover,theheuristicsarerelativelysimpletoimplementandhaveaverylowrun-timecomplexity,makingtheapproachscalabletolargedesignsandallowingtheuseoflargenumberofsimulationwindowsforTheheuristicalgorithmforcrossbarsynthesisispresentedinAlgorithm.IntheÞrststepofthealgorithm,thebandwidthavailableineachsimulationwindowiscalculated.Inthenextstep,allthecoresareinitializedasunmapped,astheyareyettobemappedontobuses.Thenthenumberofbusesinthecrossbarisinitializedtozero(step5).Insteps6to25,theassignmentofthecoresontothebusesofthecrossbarisperformed.Thebasicapproachusedisthefollowing:Wetrytomapasmanycoresaspossibleontoasinglebus.Whilemappingthecores,fromthesetofallcoresthatsatisfythebandwidthandconßictconstraints,wechoosetheonethatminimizesthepair-wisetrafÞcoverlapwiththecoresthathavebeenalreadymappedontothecurrentbus.Whennomorecorescanbeassignedtothecurrentbus,eitherbecausethebandwidthofthebusinanyofthesimulationwindowhasbeensaturated,orbecauseofconßictswiththecoresalreadymappedontothebus,anewbusisinstantiated.Theprocessisrepeateduntilallthecoresinthedesignhavebeenmappedontoabus.Fromtheresultingnumberofbuses,thebusesontowhichmastersareattachedandthoseontowhichtheslavesareattachedareseparated.Fromthis,theefÞcientcrossbarconÞgurationforthedesignisobtained.Example1Letusconsiderasmallexamplewith5cores,with3ofthembeingmastersandtherestbeingslaves.Forillustrativepurposes,letusassumethattwosimulationwindowsareusedforanalysis(althoughinrealsystemsusuallyseveralthousandwindowsareused).ThecommunicationtrafÞcratesforeachofthecores(inMB/s)forthetwosimulationwindowsarepresentedinTableandtheamountoftrafÞcoverlapbetweenthedifferentcoresoverallthewindowsispresentedinTable.Letusassumethatthecurrentfrequencydesignpointis100MHzandthebuswidthis32bits,whichareautomaticallytunedbythecrossbarsynthesisprocedure(aspresentedinFigure).IntheÞrststepoftheheuristicalgorithm,thebandwidthofthebusineachsimulationwindowiscalculatedtobe400MB/s(frequencydata-width).Initially,asinglebusisinstantiatedandcore_0ischosentobemappedontothebus,asithasthemaximumbandwidthrequirementsofthedifferentcores,acrossallthesimulationwindows(seeFigureThenfromthesetofallcores,thosecoresthatsatisfythebandwidthandconßictconstraintsarechosen.Ascoresthataremastersandslavesarenotallowedtobemappedontothesamebus(speciÞedaspartoftheconßictconstraints),thesetofassignablecorestothebusarecore_1andcore_2.Fromthesetwo,core_2is 2DesigningCrossbarBasedSystems27 Fig.2.6ExampleapplicationoftheheuristicalgorithmTable2.3AmountoftrafÞcoverlapbetweencores(inMB/s)ofexamplesystem core_0core_1core_2core_3core_4 3010core_130core_21027 302DesigningCrossbarBasedSystemsFig.2.7Powerconsumptionfordifferentcrossbar conÞgurationisrequiredtosatisfythebandwidthconstraints.Alargercrossbarcon-Þgurationusuallyalsoleadstoanincreasedwiringcomplexity.Thesetwofactorscoupledtogetherresultsinlargerpowerconsumptionforthecommunicationarchi-tecture.Atveryhighoperatingfrequencies,thepowerconsumptionofthecommu-nicationarchitectureishigher,asthepowerconsumptionincreaseslinearlywiththeoperatingfrequencyofthesystem.FortheIMP2design,thecrossbararchitecturewithlowestpowerconsumptionisobtainedat400MHz.Thesynthesizedcrossbararchitecture(a56crossbar)fortheIMP2designispresentedinFigure.Inordertosatisfythewindowbandwidthconstraints,onlyfewofthecorescanshareasinglebus,andthuseachofthebusesusedinthecross-barhaveatmost2coresattachedtothem.Thebindingsaresuchthatthecoreswithhighlyoverlappingstreamsareplacedondifferentbuses.Asaresult,thedesignedcrossbarhasacceptableperformance(intermsofaverageandmaximumlatencyconstraints)with1reductioninthenumberofbusesused,whencomparedtoafullcrossbar.TheßoorplanoftheIMP2SoCwiththedesignedcrossbar,asobtainedfromtheParquetßoorplannerispresentedinFigureThesizeandpowerconsumptionofthesynthesizedcrossbararchitecturesforthedifferentSoCdesignsandforfullcrossbarconÞgurationsarereportedinTa-.Thepowerconsumptionofboththeswitchmatrixandthecrossbarbuswiresarereportedinthetable.Themethodologyresultsinalargereductioninthecrossbararchitecturepowerconsumption(453%onaverage)whencomparedtothetraditionalfullcrossbarbasedsystems.ThesynthesizedcrossbarconÞgurationsalsoleadtolargereductioninthetotallengthofthebusesusedinthedesign(38.0%onaverage,refertoFigure),astherearefewerbusesinthedesign.Reducingwiringcongestionisessentialtohaveafasterphysicaldesignprocessandtoachievefasterdesignclosure.Thenormalizedaverageandmaximumread/writetransactionlatencies(toreadorwriteonedataword)forthedesignsobtainedusingthemethodologybasedonaveragetrafÞcßowsandusingtheproposedmethodology(referredtoasÒslotÓinthe 2DesigningCrossbarBasedSystems35samecore)intheapplicationtostudyitsimpactonthecrossbarsynthesisprocess.Thetypicalburstsizesforthebenchmarkisinitiallysetto100cycles.Whenthewindowsizeismuchsmallerthantheburstsize,thesizeofthecrossbargeneratedisveryclosetothatofafullcrossbar(referFigure).Whenthewindowsizeisaroundfewtimesthatoftheburstsize(from1Ð4times),thesynthesizedcross-barhasmuchsmallersize(typicallyaround25%)andacceptablelatencies(around)ofthatofafullcrossbar.Foraggressivedesigns,thewindowsizecanbesetclosertotheburstsizeandforconservativedesigns(wherelargertransactionlaten-ciescanbetolerated),thewindowsizecanbesettofewtimesthetypicalburstsize.TheacceptablewindowsizesforvariousburstsizesispresentedinFigure.Itcanbeseenfromtheplotthatthewindowsizevariesalmostlinearlywiththeburstsize,consolidatingtheabovearguments. Fig.2.13EffectofwindowsizeoncrossbarsizeandpowerconsumptionFig.2.14Burstvs.window 362DesigningCrossbarBasedSystems2.5.5Real-TimeStreams&EffectofBindingIneachsimulationwindow,thecriticaltrafÞcstreamsthatrequirereal-timeguar-anteesarerecorded.Duringthepreprocessingstepofthedesignßow(refertoFig-),thereal-timetrafÞcstreamsthatoverlapwitheachotherinanywindowareidentiÞed.Inordertoprovidereal-timeguaranteestosuchstreams,thecoreswithcriticalstreamsthathavetemporaloverlapareplacedontoseparatebusesofthecrossbar.Experimentalresultsonthebenchmarkapplicationsshowaverylowtransactionlatency(almostequaltothelatencyofperfectcommunicationusingafullcrossbar)forsuchstreams.Pleasenotethatinordertoprovidehardreal-timeguarantees,theunderlyingcrossbararchitectureshouldalsoprovidesupportforhav-ingprioritiesforthedifferenttrafÞcstreams,sothatthereal-timestreamsaregivenhigherprioritiesoverotherstreams.Inmanycrossbararchitectures,suchastheST-bus,suchsupportisprovidedinthecrossbararchitecturebyutilizingprioritybasedarbitrationmechanisms.AfterÞndingthebestcrossbarconÞguration,wedoanoptimalbindingofthecoresontothebusesofthecrossbar,minimizingthetotaloverlaponeachbus.Byminimizingtheoverlaponeachbus,thetransactionlatenciesreducesigniÞcantly.Toillustratethiseffect,wecomparethecrossbarsdesignedusingtheproposedap-proachwithtwobindingschemes:randombindingofcoresontothebuses,satis-fyingthedesignconstraints(equations())andoptimalbindingthatmin-imizesoverlaponeachbus,satisfyingthedesignconstraints.Theaveragelatencyincurredbytherandombindingschemeforthebenchmarkapplicationswasonav-erage2higherthanthatincurredbytheoptimalbindingscheme.2.5.6OverlapThresholdSettingByvaryingthetwoparameters:windowsizeandoverlapthreshold,thecrossbarcanbedesignedsuchthattheaverageandthemaximumtransactionlatenciesincurredinthedesignareacceptable.TheeffectoftheoverlapthresholdparameteronthesizeandpowerconsumptionofthecrossbargeneratedforthesyntheticbenchmarkarepresentedinFigures(a)and(b).Thecrossbarsizeandpowernumbersarenormalizedwithrespecttothecasewhentheoverlapthresholdissetto0%,whichleadstoafullcrossbarconÞguration(asnotwocorescanshareabusinthiscase).Theplotsendat50%overlapbetweencoresbecause,ifthepair-wiseoverlapbe-tweentwocoresexceeds50%ofthewindowsize(inanyofthewindows),thenthewindowbandwidthconstraintscannotbesatisÞed.So,themaximumvalueoftheoverlapparametercanbesetat50%ofthewindowsize.Thiswillalsospeed-uptheprocessofÞndingthebestcrossbarconÞguration,assuchoverlappingcoreswillbeidentiÞedinthepreprocessingphase(refertoFigure)andwillbeforbiddentobeonthesamebusofthecrossbar.Fromexperiments,wefoundthatforaggressivedesigns(wheretherearetightrequirementsonthemaximumlatencies)thethresh-oldcanbesettoaround10%andforconservativedesigns,thethresholdcanbesetto30%Ð40%ofthewindowsize.