Download presentation
1 -

MaPUANovelMathematicalComputingArchitectureDonglinWangCASIABeijingChin


Correspondingauthoremailconsumingonly28WinpowerwhichincreasestheactualpowerefciencybyanorderofmagnitudecomparablewiththetraditionalCPUandGPGPU1INTRODUCTIONTodaypowerefciencyisakeyfactorformobilecom-pu

abigail's Recent Documents

Oral MaxillofacialImaging andReporting Service
Oral MaxillofacialImaging andReporting Service

UF College of DentistryDivision ofOral MaxillofacialRadiologyUF College of DentistryOral Maxillofacial Radiology ServiceDepartment of Oral Diagnostic SciencesPO Box 100414Gainesville FL 32610-0414Ra

published 0K
Casirivimab and imdevimab for injectionPage of
Casirivimab and imdevimab for injectionPage of

PRODUCT MONOGRAPHINCLUDING PATIENT MEDICATION INFORMATIONCasirivimab and imdevimab for injectionSolutions for infusion each available as 1332 mg/111 mL and 300 mg/25 mL 120mg/mL Casirivimab and imdevi

published 1K
National Society of Tax Professionals
National Society of Tax Professionals

Sole Proprietorship OperationNational Society of Tax Professionals Tax Return Filing RequirementsSeparate businesses National Society of Tax Professionals A Separate Schedule C for Each Separa

published 0K
U S Small Business Administration Table of Small Business Size Standar
U S Small Business Administration Table of Small Business Size Standar

Ifouave anytheruestionsoncerningize standardsontactize Specialistyourearest SBAovernment Contractingreafficelist at thex0000x00002 Sector 11 150 Agriculture Forestry Fishing and HuntingNAICS codesNAIC

published 0K
TURNING POINT
TURNING POINT

Ben NovakBen Novak has spent his young career endeavouring to resurrect extinct species Although he has no graduate degree he has amassed the skills and funding to start a project to bring back the pa

published 0K
Occupational Safety and Health Act of 1970
Occupational Safety and Health Act of 1970

147To assure safe and healthful working conditions for working men and women by authorizing enforcement of the standards developed under the Act by assisting and encouraging the States in their effort

published 0K
New Procedure   Israel bars Palestinians in Gaza from moving to West B
New Procedure Israel bars Palestinians in Gaza from moving to West B

By almost completely blocking the possibility of moving from Gaza to the West Bank Israel is separating parents from their children and husbands from wives and denying Palestinians the basic right to

published 0K
x0000x0000Date Activity Court date  GV Permit
x0000x0000Date Activity Court date GV Permit

1241 W Bayaud Ave Denver CO 80223 DenverAnimalShelterorg General Inquiries 720-913-1311 Officer Dispatch 720-913-2080Requirements for a Potentially PermitIf you have been informed that your animal

published 0K
Download Section

Download - The PPT/PDF document "" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.






Document on Subject : "MaPUANovelMathematicalComputingArchitectureDonglinWangCASIABeijingChin"— Transcript:

1 MaPU:ANovelMathematicalComputingArchitec
MaPU:ANovelMathematicalComputingArchitectureDonglinWangCASIA,Beijing,ChinaShaolinXieCASIA,Beijing,ChinaZhiweiZhangCASIA,Beijing,ChinaXueliangDuCASIA,Beijing,ChinaLeiWang,ZijunLiuCASIA,Beijing,ChinaXiaoLin,JieHaoCASIA,Beijing,ChinaLeizuYinSpreadtrumComm,Inc.TaoWangHuaweiTechCo,Ltd.YongyongYangHuaweiTechCo,Ltd.ChenLin,HongMaCASIA,Beijing,ChinaZhonghuaPu,GuangxinCASIA,Beijing,ChinaWenqinSun,FabiaoCASIA,Beijing,ChinaWeiliRen,HuijuanWangCASIA,Beijing,ChinaMengchenZhu,LipengYangCASIA,Beijing,ChinaNuoZhouXiao,QianCuiCASIA,Beijing,ChinaXingangWangCASIA,Beijing,ChinaRuoshanGuoCASIABeijingChinaXiaoqinWangCASIABeijingChinaABSTRACTAsthefeaturesizeofthesemiconductorprocessisscalingdownto10nmandbelow,itispossibletoassemblesystemswithhighperformanceprocessorsthatcantheoreticallypro-videcomputationalpowerofuptotensofPLOPS.However,thepowerconsumptionofthesesystemsisalsorocketinguptotensofmillionswatts,andtheactualperformanceisonlyaround60%ofthetheoreticalperformance.Today,poweref“ciencyandsustainedperformancehavebecomethemainfociofprocessordesigners.Traditionalcomputingarchitec-turesuchassuperscalarandGPGPUareproventobepowerinef“cient,andthereisabiggapbetweentheactualandpeakperformance.Inthispaper,wepresenttheMaPUarchitec-ture,anovelarchitecturewhichissuitablefordata-intensivecomputingwithgreatpoweref“ciencyandsustainedcom-putationthroughput.Toachievethisgoal,MaPUattemptstooptimizetheapplicationfromasystemperspective,in-cludingthehardware,algorithmandcorrespondingprogrammodel.Itusesaninnovativemulti-granularityparallelmem-orysystemwithintrinsicshuf”eability,cascadingpipelineswithwideSIMDdatapathsandastate-machine-basedpro-grammodel.Whenexecutingtypicalsignalprocessingalgo-rithms,asingleMaPUcoreimplementedwitha40nmpro-cessexhibitsasustainedperformanceof134GLOPSwhile Correspondingauthor;email:consumingonly2.8Winpower,whichincreasestheactualpoweref“ciencybyanorderofmagnitudecomparablewiththetraditionalCPUandGPGPU.1.INTRODUCTIONToday,poweref“ciencyisakeyfactorformobilecom-puting.Greatadvanceshavebeenmadetoprolongbatterylifeandprovidegreatercomputingabilities[1][2].However,poweref“ciencyisnotonlyanimportantfactorinmobilecomputing,butalsoakeymetricinsupercomputing.AsMooresLawisstilleffective,tremendoustransistorscanbeusedforbuildingsuperprocessorslikeIntelsXeonPhico-processorandtheNvidiaGPU.TheIntelXeonPhi7120pco-processorwasbuiltwith61cores,providingathe-oreticalpeakperformanceof2.4TFLOPS.ThelatestNvidiaKeplerGK210providesatheoreticalpeakperformanceof4.37TFLOPS.However,thepeakpowerlevelsofthesetwochipsare300Wand150W,respectively,whichmeanstheirpoweref“ciencylevelsareonly8GFLOPS/Wand29GFLOPS/W,respectively.Furthermore,these“guresrepresenttheirtheo-reticalpoweref“ciency;theiractualpoweref“ciencyisevenlower.AsreportedbyGreen500,whichaimstoprovidearankingofthemostenergy-ef“cientsupercomputersintheworld,themostpower-ef“cientsupercomputerisonly0.7GLOPS/W,Althoughthepowerconsumptionofasingleprocessor2016IEEE 457 isnotnecessarilyadisadvantageforindoorsystemswithasustainedpowersupply,theaggregatepowerofthosesuper-computersbuiltwiththousandsofsuperprocessorswouldeventuallylimitthescaleofthesystemgiventheincreasingcostofdeploymentandmaintenanceinvolved.Forexam-ple,themostpowerfulsupercomputerTianhe2consumes24MW(withacoolingsystem)andoccupies720squaremetersofspace.Tobuildsuchasystem,dedicatedcomputercom-plexandpowerstationareneeded.Anotherproblemwithtodayscomputingisthegapbe-tweenpeakperformanceandsustainedperformance[3].Onlyafewimprovementshavebeenmadetoday[4].Theactualperformanceofthemosttwopowerfulsupercomputers(rankedbyTop500inJune2015)isonly62%and65%ofthecorre-spondingpeakperformance,respectively,evenwithaverystructuredalgorithmsuchasLINPACKapplied.Astheirsustainedperformanceismuchlowerthantheirtheoreticalperformance,theactualpoweref“ciencyoftheirprocessorsislessthanthetheoreticalpoweref“ciency.Manyfactorscontributetotheperformancegap[3],al-thoughthemostimportantoneisthatthecompilersusuallyimplicitlyusesimpli“edmodelsofprocessorarchitectureanddonottakethedetailedworkingsofeachprocessorcom-ponentintoaccount,suchasthememoryhierarchy,SIMDinstructionextensions,andzerooverheadcirculations.IthasbeenreportedthatthemeanusageforvariousGPUbench-marksisonly45%,andthatthemainsourcesofunderusearememorystalls,whicharecausedbymemoryaccessla-tencyandinef“cientaccesspatterns[4].Asaresult,mostprocessorsrelyheavilyonhand-optimizedlibrarie

2 stoboostactualperformanceinmanyapplicati
stoboostactualperformanceinmanyapplications,whichmakesthecompilersubsidiaryinperformancecriticalprogram.Giventhecurrentabundanceofchiptransistors,manynewarchitecturesleveragevariousASIC-basedacceleratorstoincreasethepoweref“ciencyandnarrowtheperformancegap.However,thesearchitecturesarein”exibleandrequiregreatefforttodesignachipforspeci“capplications.Weaimedtoconstructaprogrammableacceleratorarchitecturefromasystemperspectivethatcanprovideperformanceandpoweref“ciencycomparablewithASICimplementationandcanbetailoredbytheprogrammertospeci“cworkloads.Theintuitivestrategyweadoptistomapthemathemati-calrepresentationofthecomputationkernelintomassivere-con“gurablecomputinglogics,andmapthedataintohighlyrecon“gurablememorysystems.WecallthisarchitectureMaPU,whichstandsforMathematicalProcessingUnit.Inthispaper,we“rstdiscusstheconsiderationsinvolvedindesigningMaPU.WethenintroducetheinstructionsetarchitectureofMaPUinSection2.ThehighlightsoftheMaPUarchitecturearepresentedinSection3.ToprovetheadvantagesoftheMaPUarchitecture,achipwithfourMaPUcoresisdesigned,implementedandtapedoutwitha40-nmprocess.Thestructure,performanceandpowerofthischiparefullyanalyzedinSection4.2.RETROSPECTANDOVERVIEWOFMAPUIttookusalongtimetodesignwithafeasiblemicro-architectureforMaPU.Traditionalsuperscalarhasprovedtobeinherentlypowerinef“cient[5],andGPGPUisalsopowerhungry.Assuch,ourworkexcludedbothofthosearchitec-tures.StrategiessuchasVLIWandSIMDweretakenintoconsideration.The“rstproposedMaPUmicro-architecturefeaturedcustomizedwidevectorinstructionsinaRISCstyleandaninnovativemulti-granularityparallel(MGP)memory.Thismethodwasdiscardedforitslowef“ciency.Wethenproposedamicro-architecturethatincludedmassivecom-putingunitswithhardwiredstatemachines,inwhicheachcomputationkernelwasrepresentedbyastatemachine.Thismicro-architecturemanifestedhighperformanceandpoweref“ciency.However,aswetriedtosupportmorekernels,thestatemachinebecamesocomplexthatthecircuitcouldonlyrunatamuchlowerfrequency.Themicro-architecturethenevolvedintothecurrentoneinwhichthestatema-chineswerebrokendownintomicrocodesandbecamepro-grammable.Thisprovidedthepossibilityofsupportingvar-iouskernelswithcustomizedstatemachines.Asanacceleratorframework,MaPUismadeupofthreemaincomponents:themicrocodepipeline,MGPmemoryandscalarpipeline,asshowninFigure1. Register File Scalar FU Microcode Fetch Microcode Memory FU2 Multi-Granularity (MGP) InterfaceInterfaceInterface Figure1:MaPUArchitectureFrameworkThescalarpipelineissubsidiaryandcanbeasimpleRISCcoreoraVLIWDSPcore.Itisusedtocommunicatewiththesystemonchip(SoC)andcontrollingthemicrocodepipeline.TheCommunicationandSynchronizationUnit(CSU)inthescalarpipelineincludesaDMAcontrollerusedtotransporthighdimensionaldatatoandfromtheSoCandsomecon-trolregistersthatcanberead/writtenbyotherSoCmasterstocontrolorcheckthestatusoftheMaPUcore.AlthoughMaPUisanacceleratorarchitecture,weincludethisscalarpipelinetofacilitatetheinteractionbetweentheMaPUcoresandSoC.Forexample,thescalarpipelinein-cludesexclusiveload/storeoperationpairstosupportmulti-threadprimitivessuchasatomicaddition,spinlockandfork/join.Therefore,itwouldbeeasiertodevelopaMaPUruntimeli-brarytosupportamulti-coreprogramframeworksuchasOpenMPI,OpenMPandOpenCL.2.1MicrocodepipelineThemicrocodepipelineisthekeycomponentoftheMaPUarchitecture.Thefunctionalunit(FU)canbeanALU,MACoranyothermodulewithspecialfunctions.Superscalarcomponentssuchasregisterrenamelogicsandinstructionissuewindowsarepowerinef“cient[5].Toeliminatethepowerconsumedbycontrollogics,theFUsinMaPUarecontrolledbymicrocodesinVLIWstyle.Thereisahighly 458 structuredforwardingmatrixbetweenFUs,anditsroutingiscontrolleddynamicallybymicrocodes.Inthisway,theFUscancascadeintoacompletedatapathwhichresemblesthedata”owofthealgorithm.Datadependenceandroutingarehandledbytheprogramtofurthersimplifythecontrollogic.Thismicro-architecture,whichconsistsofmassiveFUswithastructuredforwardingmatrix,manifestshighperfor-manceandpoweref“ciencybutleavesallofthecomplexitytotheprogrammers.Thisisplausiblebecausecomputationkernelsareusuallysimpleandstructured,suchastheFFTandmatrixmultiplyalgorithms.Alibrarythatincludescom-monroutineswoulddecreasethecomplexityforprogram-Themicrocodepipelinehasmanyfeaturesincommonwithcoarsegrainrecon“gurablearchitecture(CGRA),butwithtwoenhancements.First,alloftheFUsandforward-ingmatrixinMaPUoperateintheSIMDmannerandhavethesamebitwidth,whichissuppose

3 dtobewide.WiderSIMDcanamortizethepowerov
dtobewide.WiderSIMDcanamortizethepoweroverheadofinstructionfetchanddispatch,bringingmorebene“tsintermsofenergyef“-ciency.Second,themicrocodepipelinehasahighlycoupledforwardingmatrixinsteadofdedicatedroutingunits.Withthisforwardingmatrix,FUscancascadeintoacompactdatapaththatresemblesthedata”owofthealgorithm.There-fore,itcanprovideperformanceandpoweref“ciencycom-parablewiththatofASIC.Furthermore,aseachFUhasadedicatedpathtoandfromotherspeci“cFUs,programmersdonothavetoconsiderthedataroutingcongestionproblem,makinginstructionschedulingmuchsimplerthanCGRA.Thesebene“tscomesatthecostofextrawires,whichoc-cupyaconsiderablechiparea.Infact,someimplementationhastodividetheforwardingmatrixintotwostagestode-creasetheconnectingwiresbetweenFUs,andonlypartoftheFUsmaybeconnecteddependingonthecharacteristicsoftheapplications. FU0FU1FU2FU3FU4… …Control ,$/8DGGVYDOXHVLQUHJLVWHU7DQGDQGWKHQVHQGVWKHUHVXOWWR6+8¶VUHJLVWHU7 77 Repeat: 5HSHDWFXUUHQWPLFURFRGHOLQHZLWKVSHFLILFWLPHV/RRSWRVSHFLILFPLFURFRGHOLQHZLWKVSHFLILFWLPHV Microcode for each functional Unit Microcode controlling next PC address Figure2:MicrocodelineFormatThereareNmicrocodesissuedinaclockperiodforanimplementationwithNFUs.Thesemicrocodesissuedatthesametimearecalledmicrocodelines,whicharestoredinthemicrocodememoryandcanbeupdatedatruntime.Thiscodingschemaissimpleandeffective.However,formostkernels,onlyafewFUsareworkingsimultaneouslymostofthetime;thus,thereareNOPsinmostmicrocodelines.Acompressionstrategycanbeusedtoincreasethecodedensityinthefuture.ThemicrocodelineformatisillustratedinFigure2.EachmicrocodeforFUhastwoparts:"OperationŽand"ResultDestination.Ž"OperationŽtellstheFUwhatshouldbedone,and"ResultDestinationŽtellstheforwardinglogichowtoroutetheresult.Attheendofthelineisthemicrocodeforthecontroller,whichtakeschargeofthemicrocodefetchanddispatch.Therearetwotypesofmicrocodesforthecon-troller:repeatandloop.Thesetwotypesofmicrocodesin-structthecontrollertorepeatorlooptoaspeci“cmicrocodelineatcertaintimes.TheMaPUarchitectureincorporatesthesetwospecialmicrocodestosupportthewidespreadnestedloopstructureinkernelapplications.2.2Multi-granularityparallel(MGP)mem-Asmentionedinpreviouswork[4],memoryaccesspat-ternsareanimportantsourceforprocessorsunderuse.How-ever,computationkernelsarealwaysstructured,andtheirmemoryaccesspatternscanbeclassi“edintoafewcate-gories.Therefore,itishighlypossibletonormalizethesepatternswithastrictdatamodelandspecialdesignedmem-orysystem.TheMGPmemorysystemservesasasoftmanagedlocalmemorysystemandisdesignedwithanintrinsicdatashuf-”eability,supportingvariousaccesspatterns.Providingarow-andcolumn-majorlayoutsimultaneouslyformatriceswithcommondatatypes,thismemoryarchitecturemakesthetime-consumingmatrixtranspositionunnecessary.WiththeMaPUdatamodel,matricesintheMGPmemorysystemcanbetreatedasnormalandtransposedformsatthesametime.WeexplorethisfeatureinmoredetailsinSection3.1.3.ARCHITECTUREHIGHLIGHTS3.1MGPmemorysystemTheMGPmemorysystemsupportsvariousaccesspat-terns,especiallythesimultaneousrow-andcolumn-majorlayoutformatriceswithcommondatatypes.Beforedescrib-ingthestructureoftheMGPmemorysystem,somebasicconceptsshouldbeexplained.Physicalbank:On-chipSRAMthatcanbeaccessedwithmultiplebytesinparallel,whichcanbegeneratedfromamemorycompilerorcustomized.Logicbank:agroupofphysicalbanks,onwhichtheaddressresolutionisbased.Granularityparameter:theparametercontrolswhichphysicalbanksshouldbegroupedintoalogicbankandhowtheinputtedaddressisresolved.3.1.1BasicstructureoftheMGPmemorysystemTheMGPmemorysystemprovidesbytesofparallelaccessandNbytesofcapacityandhasthreeinterfaces:oneread/writeaddress,agranularityparameter(G)anddataforreadingorwriting.TheMGPmemorysystemhasthefol-lowingconstraints.Wshouldbeanintegertothepowerof2,andN=2wherekmustbeanaturalnumber.Gshouldbeanintegerstothepowerof2,rangingfrom1toW.Thatis,G=2,where0log 459 physicalbanks,labeledfrom0toW-1,arerequired.Eachbankcanread/writeWbytesinparallelandhasN/W-bytecapacity.Whenaccessed,thismemorysystemoperatesaccordingtothefollowingprocedurestodecodetheaddressLogicbankformation:Physicalbankscascadeandgrou

4 pintologicbanksaccordingtoparameterG.Gco
pintologicbanksaccordingtoparameterG.Gconsecutivephysicalbankslabeledwithi*Gto(i+1)*G-1cascadeintoalogicbanki,where0 Addressmapping:Physicalbanksinalogicbankareaddressedinsequence,startingfromzero.Allofthelogicbankshavethesameaddressspace.AsthesizeofeachphysicalmemorybankisN/WandthereareGphysicalbanksinalogicbank,theaddressofthelogicbankrangesfrom0toG*N/W-1.Dataaccess:Whenreading/writing,eachlogicbankaccessesonlyGbytesofthewholeWbytes.Theac-cessaddressistheaddressinputtedintotheMGPmem-orysystem.Eachphysicalmemorybankusesamasktocontrolthispartialaccess.                                                                                    Figure3:MGPmemoryexamplethatprovidesW=4bytesofparallelaccessandwithatotalcapacityofN=64bytes.(a)G=1,address=0,onephysicalmemorybankisgroupedintoalogicbankandeachlogicbankaccesses1byte.(b)G=2,address=0,twophysicalbanksaregroupedintoalogicbankandeachlogicbankaccesses2bytes.(c)G=4,address=0,fourphysicalbanksaregroupedintoalogicbankandeachlogicbankaccesses4bytesFigure3showsanMGPmemorysysteminwhichW=4,N=64.WithagranularityparameterG,thememorysystemcansupportlog+1typesofaccesspatterns.Thelayoutofphysicalmemoriesiscontrolleddynamicallybygranu-larityparameterG.WhenG=W,theMGPmemorysystemhasonlyonelogicbankandallofthephysicalbanksaread-dressedinsequence.Inthiscase,thememorysystemfallsintoanordinarymemorysystemwithWbytesaccessingtheinterface.Whencombinedwithcarefullydesigneddatalay-outs,theMGPmemorysystemcanprovideinterestingfea-turessuchassimultaneousparallelaccessformatrixrowsandcolumns.Infact,whenwritingdataintomemorywithoneGvalueandthenreadingthemwithadifferentGvalue,weshuf”ethedataimplicitly.DifferentpairsofGrepresentdifferentshuf”epatterns,whichmakesthisMGPmemorysystemversatileinhandlinghighdimensionaldata.Infact,theMGPmemorysystemisthemostdistinguish-ingfeatureofMaPUandotherdesignsmaycertainlybene“tfromit.Otherrecon“gurablearchitecturesfocusmainlyoncomputingfabric,ignoringorcompensatingfortheperfor-manceandpowerpenaltiescausedbyaninef“cientmemorysystem.MGPmemorytriestoaddresstherootcauseoftheproblem.Itisnotlikethescatter/gather-enabledmemoryinconventionalvectorprocessors.Insuchascatter/gatherop-eration,multipleaddressesaresenttomemorybanks,andaccesscon”ictsareunavoidableandcanleadtopipelinestall,affectingtheperformanceandincreasingthecomplex-ityofpipelinecontrolinturn.MGPmemoryinMaPUisamathematicallystructured,con”ict-freeandvectorizedsys-tem.MostvectoraccessingpatternscanbemappedintotheMGPmemorysystem,suchasaccessingamatrixroworcolumnwiththesamedat

5 alayout,andreadingandwritingFFTdataforan
alayout,andreadingandwritingFFTdataforanystageofbutter”ydiagram.Moreoftheap-plicationsthatmaybene“tfromMPGmemoryareexploredlater.     5 15 6 7 8 9 19 22 23 1 2 3 4 5 6 7 8 9 13 14 15 16 17 10 19 20 21 22 23                           3                 4 7 2 5 8 0 1 2 3 4 5 6 7 8                   Figure4:Matrixinitiallayoutwhen(a):theelementsbitwidthisthesameasthatoftheaddressableunitsand(b)theelementsbitwidthistwiceasthatoftheaddressableunits.ThebitwidthWinthisexampleis4bytes.Both(a)and(b)showonlytheinitiallayoutofthematrix.TheactuallayoutiscontrolledbygranularityparameterGwhenthismemoryisread/written.3.1.2MatrixlayoutintheMGPmemorysystemAnmatrixcanbeaccessedinparallelinroworcolumnor-dersimultaneouslyonlyifitisinitializedinaspeci“clayoutintheMGPmemorysystem.Figure4showstwomatricesinwiththeinitiallayoutcanbeaccessedineitherroworcolumnorder.Keepinmindthat“gure4showsonlythelayoutinwhichthematricesshouldbeinitialized.ThedatalayoutwillchangeaccordingtotheprovidedGparameterwhenaccessed.Formatriceswithelementswhosebitwidthsarethesame 460 14microcodes,eachofwhichisassignedtoanFUexceptforMReg,whichrequires4microcodes.Themicrocodelineis328bitswide,andthemicrocodememory(MIM)canhold2,000microcodelines.Toacceleratetheturbodecodingpro-cess,weaddadedicatedturboco-processorinAPE.Thescalarpipelineisa32-bitVLIWDSPcorethatcon-tainsfourFUs:for32-bitintegersandIEEE754singleprecision”oatingpointcomputation.AGU:forload/storeandregister“letransfer.formicrocodepipelinecon“gurationandcon-forthejump,loopandfunctioncall.ThesefourFUscanruninparallelandthescalarpipelinecanissuefourinstructionsatonecycle.Figure12showsthe“nallayoutofAPE.Thetotalareais36mm parallel Memory system Load/Store &Forwarding parallel Memory system Microcode Scalar Pipeline SHU0/SHU1 Microcode Controller Turbo Decoder Turbo Decoder Microcode Figure12:FinallayoutofAPE.FMAC,IMAC,FALU,IALUandMRegaredistributedineightblocks,eachofwhichcanhandle64bitsofdata.Themicrocodepipelinerunsat1GHzantheothercomponentsrunat500MHzintypicalcases.ThetoolchainofAPEisbasedmainlyonanopensourceframework,asshowninTable3.4.2PerformanceBeforetapingoutthechip,wesimulatemanytypicalsig-nalprocessingalgorithmsinAPEwiththe“nalRTLandcomparetheperformancewiththatoftheTIC66xcore,acommercialDSPwithasimilarprocessnodeandcomputa-tionresource.WesupposethatAPErunsat1GHzandthattheC66xcorerunsat1.25Hz.WeobtaintheperformanceofTable3:MaPUtoolchains ToolName OpenSourceFramework Compilerformicrocodepipeline Ragel&Bison&LLVMCompilerforscalarpipeline Clang&LLVM Ragel&Bison&LLVMLinker BinutilsGoldDebuggerforscalarpipeline GDBSimulator Gem5Emulator OpenOCD theC66xcorebyrunninganalgorithminCodeComposerStudio(CCSv5)withtheof“cialoptimizedDS

6 Plibraryandimageprocessinglibrary(DSPLIB
Plibraryandimageprocessinglibrary(DSPLIBandIMGLIB).Table4:ComplexSPFFTperformance(TimeUnit:us) length 128 256 512 1024 2048 4096 C66x 0.65 1.18 2.82 5.49 13.09 26.00APE 0.56 0.88 1.41 2.63 4.75 9.79 Table4showstheexecutiontimeofacomplexsinglepre-cision”oatingpointFFTalgorithmofvaryinglengths.Ta-ble5showstheexecutiontimeofacomplex16-bit“xedpointFFTalgorithm.Thespeci“cFFTalgorithmusedhereiscached-fft[6].Inthisalgorithm,butter”iesaredividedintogroups.Eachgroupcontainsmultiplebutter”iesandstagesthatcanbecomputedindependentlywithoutinteract-ingwiththedatainothergroups.Thus,agroupofbutter-”iescanbeloadedandcomputedthoroughlywithoutwritingbacktomemory.Figure13(a)showsadata”owdiagramofabutter”ygroupwithinacomplexsingle”oatingpointFFT.Figure13(b)showsthecorrespondingdatapathoftheFUs.Althoughthebutter”yinagroupcanbecomputedin-dependently,thedatummustbeshuf”edbetweengroupsindifferentepochs.ThisisdonenaturallyusingMGPmem-ory.TheresultofagroupisstoredbacktomemorywithoneGvalueaftercomputationandthenloadedbacktothemi-crocodepipelinewithadifferentGvalue.Theloadeddatacanbecomputeddirectlywithoutanyshuf”eoperation.AsdifferentpairsofGvaluesareusedfortheFFTwithdifferentdatatypes,thememoryaccesspatternmatchesthedatapathperfectly.Asresult,theoverallperformanceisboostedandthepowerisreduced.Theoriginalworkwasimplementedwithadedicatedco-processor.Inthispaper,thealgorithmisre-implementedwithonlymicrocodes.TheaveragespeedupsofAPEvs.theC66xcorefortheSPFFTalgorithmare,respectively,fora“xedpointFFT.Infact,wecanfurtherimprovetheperformanceofAPEthroughmicrocodeoptimization.Forexample,theexecutiontimefora4,096-point16-bitFFTcanbereducedfrom4.80usto4.10usafterfurthermicrocodeoptimization,animprovementofalmost15%.Fromthisexample,wecanseethatMaPUhasahugeperformancepotentialwithcus-tomizedstatemachines.WeimplementothertypicalalgorithmswiththesameFFTstrategies,inwhichthedatapathresemblesthedata”owandMGPmemoryprovidesthematchedaccesspatterns.Table6summarizestheaveragespeedupsofAPEvs.C66xforthesealgorithms.Thesealgorithmshavedifferentdatatypesand 464 asthoseoftheaddressableunits,thefollowingprocedureswouldproducetheinitiallayout.SetgranularityparameterG=1.Putthethrowinlogicbanki%W.Rowsinthesamelogicbankshouldbeconsecutive.Theseproceduresgeneratetheinitiallayoutfora5x5ma-trixinanMPGmemorysystemwithW=4,asshowninFig-ure4(a).Providingthisdatalayout,thematrixcanbeac-cessedinroworderwithG=WandaccessedincolumnorderwithG=1.WhenG=1andaddress=0,columnelements(0,5,10,15)areaccessedinparallel.WhenG=Wandaddress=0,rowelements(0,1,2,3)areaccessedinparallel.Infact,aformalexpressionfortheaddressoffsetofeachrowduringtheinitializationprocedureandtheaddressoffsetofeachelementfortheread/writeprocessafterinitializationcanbederivedforgeneralcases.MatriceswithelementswhosebitwidthisMtimesthoseoftheaddressableunits,whereMisanintegertothepowerof2,shouldbeinitializedasfollows(providingthecapacityofthememorysystemisN,andthedimensionofthematrixisPxQ).SetthegranularityparameterG=W.(Inthiscase,thereisonlyonelogicbank.)TheaddressoffsetoftheithrowisA(i)= M iM Table1:Addressoffsetformatrixelements(i,j),whichoccupyMmemoryunits.ThematrixsizeisPxQ.ThememorywidthisWandthetotalcapacityisN. AccessMode G AddressOffset Rowmajor W (i%W M) W)QM+W( Columnmajor M W)QM WhenmastersaccessthematrixintheMGPmemorysys-tem,theyaccessconsecutiveWbytesasawholebyroworbycolumn.Theaddressoffsetforaccessingelements(i,j)canbecalculatedasshowninTable1.Here,"%"representsthemodulooperation,andalltheofdivisionsoperatewithnoremainder.AsmasterscanonlyaccessWbytesaswhole,Table1showsonlytheaccessingaddressfortheseWbytes.Theaddresscomputationiscomplicated,butthestrideoftheaddressisregular.Thus,itisplausibletotransversethewholematrixwithsimpleaddressgenerationhardware.3.2HighdimensiondatamodelAsmentionedpreviously,theMGPmemorysystemisver-satileinhandlinghighdimensionaldata,butitsaddresscom-putationiscomplicatedandaddressesarealwaysnotconsec-utive.TodescribeandaccesshighdimensionaldataintheMGPmemorysystem,amuchmoreexpressivedatamodelotherthanvectorsisrequired.Figure5showsthebasicparametersneededforeachdi-mension:thebaseaddress(KB);theaddressstride(KS),whichistheaddressdifferencebetweentwoconsecutiveelements;andthetotalnumberofelements(KI).Figure6showsthetwo-dimensionaldatadescribedbytheseparameters.  &

7 #x0011;
#x0011;             Figure5:Parameterstodescribethedimensionsofdata Dimension 1KB0=1 KS1=8KI0=3 KI1=4 Dimension 2 Figure6:Exampleofhighdimensionaldatadescription:A4x3matrixdistributedina4x8space.Withthisdatamodel,thematrixdescribedinTable1canbecon“guredaccordingtotheaccessmodes.Allofthebaseaddressesofthedimensionsarethesameasthestartaddressofthematrix.Whenaccessedbyrow,thematrixintheMGPmemorysystemcanbeseenasthree-dimensionaldata.The“rstdi-mensionistheelementsinarow.AsthememorysystemaccessesWbytesforatime,theaddressstrideofthedimen-sionisWandthenumberofelementsisQM/W.Theseconddimensionistherowsbetweenthelogicbanks,whosead-dressstrideisthesizeofthelogicbanks,i.e.,NM/W.Thenumberofelementsisthenumberoflogicbanks,i.e.,W/M.Thethirddimensionistherowinthesamelogicbank,whoseaddressstrideisthememoryunitsoccupiedbyarow,i.e.,QM,andthenumberofelementsisPM/W.Whenaccessedbycolumn,the“rstdimensionistheel-ementsinacolumn,whoseaddressstrideisthememoryunitsoccupiedbyarow,i.e.,QM.ThenumberofelementsisPM/W.Theseconddimensionisthecolumnofthematrix,whoseaddressstrideisM,andthetotalnumberofelementsisQ.Table2showstheKSandKIcon“gurationsofthesetwoaccessmodes.Table2:ParametersforamatrixwhosesizeisPxQ.EachelementoccupiesMmemoryunits.ThememorywidthisWandthetotalcapacityisN. AccessMode G KS0 KI0 KS1 KI1 KS2 KI2 Rowmajor W W QM W W M QM PM Columnmajor M QM PM W M Q IntheMaPUarchitecture,thenumberofdimensionsthatthechipcansupportdependsonimplementations.These 461 parametersaresetbythescalarpipeline.TheirvaluesarestoredinregistersofFUsthattakechargeoftheload/storeoperation.Duringoperation,dataareaccessedinsequentialgroupswherelowdimensionsareaccessed“rst,followedbythehighdimensions.Theaddressesofthesegroupsarecalculatedautomatically,likethoseinDMAoperations,ex-ceptthatthetimerequiredtoaccessthedataiscontrolledby3.3Cascadingpipelinewithstate-machine-basedprogrammodelFUsinthemicrocodepipelinecancascadeintothedatapaththatsuitesthealgorithms,likethedatapathintheASICcontrolledbythestatemachine.Differentstorageelementssuchaspipelineregistersandcacheshavedifferentaccesstimeandenergyconsumptionsforeachoperation.Thiscas-cadingstructurecandramaticallydecreasedatamovementbetweenregister“lesandmemory,thusdecreasingtheover-allconsumedenergy.Moreover,keepingthedata”owcen-tralizedinFUsandforwardingpathsincreasestheoverallcomputationef“ciency,astheFUsandforwardinglogicsarealwaysrunningatahigherfrequencythencachesandmem-Figure7showsthisconceptfortheFFTandmatrixmul-tiplyalgorithms. Figure7:Data”owmappinginMaPUarchitecture,whichcanchangedynamicallythroughmicrocodese-Themappingbetweenthedata”owanddatapathisrepre-sentedbyamicrocodesequencethatde“nestheoperationsoftheFUsateveryclockcycle.Asthesemicrocodesarestoredinmemoryandcanbeupdatedatruntime,thismap-pingcanbealsoupdateddynamically.Ashardwaredoesnothandledependencybetweenmi-crocodes,programmershouldwritethemicrocodescarefullytomakesurethattheresultwillnotcometooearlyortoolate.Thisinvolvesagreatnumberofeffort.MaPUimple-mentsa-state-machinebasedprogrammodeltosimplifythistask.ProgrammersonlyneedtodescribethestatemachineofeachFUandthetimewhenthesestatemachinesshouldstart.Thecompilercanthentransformthesestatemachinesintomicrocodelinesthatcanbeemittedsimultaneously.First,theprogrammerestablishesastate-machine-basedprogramdescription.Sub-statemachinesforeachnodeareconstructedusingtheMaPUinstructionset,andthetopstatemachineisthenconstructedaccordingtothetimedelay,i.e.,datadependence.Next,accordingtothemicro-architecturefeatureofMaPU,thecompilertransformseachstatema-chineintoanintermediateexpressionthatconveysthese-quential,cyclicorrepeatedstructureofthebasicblock.Thecompilerthenmergesthedifferentstatemachinesbyab-stractingtheirsameattributestogeneratecombinationalstatemachinesandthemicrocodelines.Finally,thecompilerim-plementsgrammaticalstructure,resourcecon”ictanddatadependencedetectiontoensurethatthemicrocodelinessat-isfytheMaPUinstructionsetconstraints.Figure8showstheconceptofthisprogrammodel.      &#

8 x0005; &
x0005;            State Machine of Each Functional Unit Figure8:Illustrationofstate-machine-basedprogramThestatemachineofeachFUisdescribedbymicrocodesthattheFUsupportsandmicrocodesforloopcontrol.Thefollowingcodesnippetshowsanexamplestatemachineofaload/storeunit. .hmacroFU1SM//looptolabel1,loopcountisstoredinKI12register//loaddatatoRegister“leandcalculatenextloadaddressBIU0.DM(A++,�K++)-M[0];//idleforonecycleycleNOP;//idleforonecycle Thefollowingcodesnippetshowsthetopstatemachineofthealgorithm. .hmacroMainSM//startFU1statemachineatcycle0//waitsixcyclesFU2SM||FU3SM;//startFU2andFU3statemachineatcycle7//waitsixcycles//startFU4statemachineatcycle13 6 462 4.THEFIRSTMAPUCHIPToprovetheadvantagesoftheMaPUarchitecture,wede-sign,implementandtapeoutachipwithfourcoresthatim-plementMaPUtheinstructionsetarchitecture.4.1SoCarchitectureFigure9showsasimpli“eddiagramofthischip.The Local Memory Local Memory High Speed Network DDR 3 Controller 0 PCIe 0 RapidIO 0 Controller 1 PCIe 1 RapidIO 1 Cortex-A8 High Speed NetworkHigh Speed Network Ethernet External Bus Interface Watch Dog Interrupt 2s L1 Bus L2 Bus 4x Figure9:Simpli“edSoCstructureMaPUcoresinthischiparecalled,whichstandsforngine.InadditiontotheforMaPUcores,thereareothersubsidiarycomponentsliketheCortex-A8coreandIPs.Thischipalsoincludessomehigh-speedIOslikeDDR3Controller,PCIeandRapidIO,andsomeotherlow-speedinterfaces.Allofthesecomponentsarecon-nectedbyathree-levelbusmatrix.ThischipisimplementedwithaTSMC40-nmlow-powerprocess.Figure10showsthe“nallayout.Thetotalareais363.468mmFigure11showstheAPEstructure.ThemicrocodepipelineinAPErunsat1GHzantheothercomponentsrunat500MHz.Thereare10FUsineachAPE,aslistedbelow.EachFUcanhandle512bitsofdataintheSIMDmanner.forintegercomputation,withanSIMDabilityfor8-,16-and32-bitdatatypes.FALU:forIEEE754singleanddoubleprecision”oat-ingpointcomputation.IMAC:forintegermultiplyaccumulation,withanSIMDabilityfor8-,16-and32-bitdatatypes.FMAC:forIEEE754singleanddoubleprecision”oat-ingpointmultiplyaccumulation.forload/storeoperation.Calcu-latesnextdataaddressautomaticallyandsupportsdataaccesswithfourdimensions.128x512bitmatrixregister“lewithslidewin-dowandautoindexfeatures. PCIe0 PCIe1 Phy0 Phy1 DDR3 Controller 0 DDR3 Controller 1 Bus 1 tll1 APE1 APE3 Figure10:Finallayoutofthechip Microcode Fetch Microcode Memory SHU0 FALU Mreg AXI AXI AXI DM0 AGU Figure11:APEstructureSHU0,SHU1:shuf”eunitthatcanextractspeci“cbytesinthesourceregisterandwritethemintothedes-tinationregisterinanyorder.Somespecialfunctionunitsaredesignedtoexploredatalocality.Therefore,data”owcanbeconcentratedinsidethemicrocodepipelineanddecreasetheload/storeoperations.Forexample,SHUcanperformcascadingshiftoperations,inwhichtwo512-bitregistersareconnectedandshiftedfor1,2or4bytescircularly,justliketheslidingwindowinthe“niteimpulseresponse(FIR)algorithm.Combinedwithlargematrixregister“les,thecoef“cientsandinputdatumneedtoloadonlyonceinthewholeFIRprocess.Thebus,forwardingmatrixandmemorysystemarealso512bitswide.Todecreasethescaleoftheforwardingma-trix,notalloftheFUsareconnected.Forexample,theFMACresultcannotbeforwardedtotheIALUandIMAC.Themicrocodesyntaxembodiestheseconstraints.Therearesixdatamemoriesintotal,eachofwhichisanMGPmemorysystem,witha2Mbitcapacity.Themicrocodelineincludes 463 MR2 MR1 Data write to memory Figure13:FFTdata”owdiagramanddatapathmap-pinginAPE.Table5:16-bitcomplex“xedpointFFTperformance(TimeUnit:us) length 256 512 1,024 2,048 4,096 C66x 0.60 1.33 2.59 5.91 12.03APE 0.56 0.79 1.50 2.41 4.80 data”ow.However,giventhecascadingpipelineandintrin-sicshuf”eabilitiesofMGPmemory,allofthemaremappedsuccessfullyinAPE,andtheirperformanceisquiteimpres-sive.APEhashundredsofspeedupsfortablelookupsduetoitsSHUunits,whichcanhandle64parallelqueriesforatablewith256recordswithin5cycles.Takingthefrequencyofbothprocessorsintoconsideration,APEhasmoreadvan-tagesintermsofarchitectureasitrunsatalowerfrequencybutperformsbetter.4.3Poweref“ciencyBeforetapingoutthechip,weestimatethepowerofAPEusingPrimeTimewithvariousalgorithms.

9 Theswitchingactivityisgeneratedthroughth
Theswitchingactivityisgeneratedthroughthe“nalpost-simulationwiththe“nalnetlistandSDFannotation.WealsotestthepowerofAPEwhenthechipreturnsfromfab.AsintheSoC,therearededicatedpowerdomainsandclockgatesforAP.AswecanturnonthepowersupplyandclockforeachAPEsep-arately,thepowerofeachAPEcanbemeasuredpreciselybythepowerincreasewhenitisinvoked.Table7showsthepowerdataofthetypicalalgorithm.ThedatatypesofeachalgorithmarethesameinTable6.Alloftheusedmicro-benchmarksareheldinon-chipmem-ory,buttheoverallamountofpowerconsumedbymemoryissmall.ThenumberofDMsinAPEisdesignedforscal-ability.ForbenchmarksthatexceedthesizeofDM,threeTable6:APEvs.C66xcore:Actualperformancecom- Algorithm Speedup DataTypes CplxSPFFT 2.00 Complexsingle”oatingpointCplxFPFFT 1.89 Complex16-bit“xedpointMatrixmul 4.77 Realsingle”oatingpoint2D“lter 6.94 Real8-bit“xedpointSPFIR 6.55 Realsingle”oatingpointTablelookup 161.00 tablewith8bitaddress,256recordsMatrixTrans 6.29 16bitmatrix Table7:EstimatedandTestedPowerofAPEat1GHz(PowerUnit:Watt) Algorithm Est Tested Diff Size CplxSPFFT 2.81 2.95 -5% CplxFPFFT 2.63 2.85 -8% MatrixMul 3.05 3.10 -2% 65*66,66*67matrix2DFilter 4.13 4.15 -1% 508*508,5*5template 2.19 2.20 -1% 4,096,128coef“cientsTablelookup 2.75 2.95 -7% 4,096queriesMatrixTrans 2.28 2.45 -7% 512*256matrix 1.51 1.55 -2% WhileAPEstandby ofthesixDMscanbeusedforcomputationbuffers,andtheotherthreeDMscanbeusedforDMAtransfers.Theaggre-gatedpowerofalloftheMGPmemories(sixDMsinFigure13)forbenchmarksinTable7(inorder,exceptforIdle)are8%,6%,3%,3%,2%,3%and15%,respectively.The15%isformatrixtranspose,whichconsistsentirelyofmemoryread/writeoperations.TakingDMAtransfersintoconsider-ation,theenergyef“ciencywilldegradeslightlybutremainalmostthesameasthepresentedresult.Wecanseeclearlyfromthetablethatthepowerconsump-tionofmostofthealgorithmsisbelow3WandthatthestandbypowerofAPEisashighas1.55W.Preliminaryanalysisindicatedthattheclocknetworkcontributedmostoftheidlepower,whichwillrequireimprovementinfuture.ItisalsocanbeseenfromTable7thattheestimatedandmeasuredpowerarealmostthesame.Thisindicatesthatourpowerevaluatingmethodiseffectiveandourmodulebasedpoweranalysisdiscussedlaterishighlyreasonable.Tocomputetheactualdynamicpoweref“ciencyofAPE,wecollectdetailedinstructionsstatistics.Table8showsthenumberofmicrocodesissuedwhenAPErunsdifferental-gorithms.ThedatatypeandsizearethesameasinTableBasedonthemicrocodestatisticsinFigure8,weknowhowmanydataoperationsareneededtocompleteanalgo-rithm.WeobtaintheGFLOPSorGOPSofAPEfordifferentalgorithmsbydividingthenumberofoperationswithcorrespondingexecutiontimes.ThecorrespondingGFLOPS/WandGOPS/WarecomputedbydividingGFLOPSorGOPSwiththepower,asshowinFigures14and15.ThecomputationoperationsincludeIALU,IMAC,FALU,FMACandSHU0,SHU1(eachMACinstructionisconsid-eredastwooperations)andthetotaloperationsincludescomputation,register“leread/writeandload/storeopera-Figure14showsthatthemaximumperformanceofAPEfor”oatingpointapplicationis64.33GFLOPSwiththeSPFIRalgorithm.Themaximumcomputationperformanceforthe“xedpointis255.21GOPSwithan8-bit2D“lterapplication.Whentakingpowerintoaccount,themaximumtalpoweref“ciencyofAPEfor”oatingpointapplicationis45.69GFLOPS/WwiththeSPFFTalgorithm.Themax-totalpoweref“ciencyacrossallofthealgo-rithmsis103.49GOPS/Wwitha2D“lterapplication,asshowninFigure15.Table9summarizestheenergyef“ciencyofseveralpro- 465 Table8:Microcodestatisticsfordifferentalgorithms Algorithm MR0 MR1 MR2 MR3 SHU0 SHU1 IALU IMAC FALU FMAC BIU0 BIU1 BIU2 CplxSPFFT 1,908 1,841 1,888 12 1,878 1,836 0 0 1,771 1,847 807 746 CplxFPFFT 942 930 897 10 894 894 0 885 0 0 255 193 Matrixmul 29,478 0 0 1,650 29,478 0 0 0 12,854 29,476 1,650 488 2D“lter 20,336 20,336 0 0 105,740 105,740 0 105,737 0 0 8,703 20,344 7,599FIR 2,048 2,049 0 0 34,817 34,817 0 0 1,792 34,817 265 2,048 Tablelookup 258 192 128 0 257 0 320 0 0 0 4 64 Matrixtran 0 0 0 4,098 0 0 0 0 0 0 4,099 0 4,096        Figure14:ActualPerformanceofAPEfordifferental-gorithms.TheunitsareG

10 FLOPSfor”oatingpointap-plicationandGOPSf
FLOPSfor”oatingpointap-plicationandGOPSfor“xedpointapplication.Table9:Estimatedperformanceofcurrentproces- Processor Process GFLOPS GLOPS/W Corei7960 45nm 96 NvidiaGTX280 65nm 410 2.6Cell 65nmSOI 200 NvidiaGTX480 40nm 940 StratixIVFPGA 40nm 200 TIC66xDSP 40nm 74 XeonPhi7210D 22nm 2,225 TeslaK40(+CPU) 28nm 3,800 TegraK1 28nm 290 MaPUCore 40nm 134 45.7 cessors,aspresentedin[7].Fromthistableandthedatapresentedin[4],wecanseethatthepeakenergyef“ciencyofGPGPUandprocessorsarebelow10GFLOPS/W.Theactualmaximumenergyef“ciencyofAPEisaround40-50GLFOPS/W,aimprovementforthe”oatingpointapplicationsandaimprovementforthe“xedpointap-4.4DiscussionInadditiontotheinstructionstatisticsinTable8,wegatherthepowerdataforeachindividualmodulethroughdetailedsimulation.Withthemicrocodecountandpower,wecanes-timatetheaveragedynamicenergyconsumedpermicrocode,aspresentedinTable10.Table7showstheestimatedandtestedpowerfromrealchipareveryclose,thusthepowerstatisticsherethatbasedonsimulationarehighlyreliable.        Figure15:Actualpoweref“ciencytestedfromrealchip.TheunitsareGFLOPS/Wfor”oatingpointapplicationandGOPS/Wfor“xedpointapplication.Theresultiscalculatedasfollows:RunningPowerIdlePowerTime InstructionCountTheaverageload/storeenergyincludestheload/storeunit,databusandmemory.Table10clearlyshowsthattheregis-ter“leaccessismostlyenergyef“ciency.Theenergycon-sumedbymostofthecomputationFUisalmosthalfthatoftheload/storeunit,exceptforIMAC,whichrequiresfurtherimprovement.             Figure16:Microcodecompositionfordifferentalgo-rithms.MRegaccessincludesmicrocodesforMR0-MR3;computationincludesmicrocodesforSHU,IALU,IMAC,FMACandFALU;load/storeincludesmi-crocodesforBIU0-BIU2. 466 0 Table10:Dynamicenergyconsumedpermicrocode Algorithm EnergyPerMicrocode(Unit:pJ,512bit) RegisterR/W 133.25Load/Store FALU 345.65IALU FMAC IMAC 788.77SHU 213.04 AstheenergyconsumedbyFUsismuchlessthanthatconsumedbyload/storeoperations,itisreasonabletokeepdatamovingbetweenFUsthroughcascadingpipelinesasmuchaspossible.Figure16showsmicrocodecompositionsfordifferentalgorithms.UsingtheMGPmemorysystemandcascadingpipelines,typicalalgorithmscanbemappedintostructureddatapathsinwhichtheoperationsmainlycompriseregister“leaccessandcomputationoperations,whichconsumemuchlessenergythantheload/storeoperations.Figure17furthershowstheaverageusageandaverageen-ergyconsumptionfordifferentcomponentswhenAPErunsalgorithmsinTable7,exceptformatrixtranspose.AtrendcanbeseeninFigure17:theregister“leaccessopera-tionconsumesmuchlessenergythantheload/storeopera-tionsbutisusedmuchmore,andthecomputationunitsareusedmuchmorethanthe“leregistrationandmemoryunits.Assuch,theenergy-ef“cientFUsareusedmuchmorefre-quentlythantheenergy-inef“cientFUs.Thisisanimportantfactorthatcontributestotheremark-ableenergyef“ciencyofMaPU.Poweref“ciencybene“tsfromtwoaspects.The“rstaspectisthenovelmicroarchi-tecture.MaPUconsistsofmassiveFUsbutsimplecontrollogic.AsinASIC,mostoftheenergyisconsumedbyFUsthatdotherealcomputations,andenergy-hungryoperationssuchasmemoryaccessareminimizedthroughMPGmem-orysystems.Therefore,itispossibletoachievehighpoweref“ciencyinMaPU.Inparticular,instructionfetchinganddispatchinglogicconsumeonly0.18%power,andFUscon-sumeupto53%powerintheFFTbenchmark.ThesecondaspectistheelaboratelyoptimizedalgorithmofMaPU.Toachieveoutstandingpoweref“ciency,datalo-calityshouldbeexploredatthealgorithmlevelandstatemachinesshouldbeconstructedinamannerthatcentral-i

11 zesdatamovementsintheFUsandforwardingmat
zesdatamovementsintheFUsandforwardingmatrix.Fig-ure17indicatesthatbenchmarkshavebeenmappedmostlyintothecomputationandregister“leaccessoperations.Aspower-consumingload/storeoperationsareonlyasmallpartoftheoveralloperations,powerconsumptionisreducedover-Greateffortshavebeenmadetooptimizethemicrobench-markspresentedinthispaper.Althoughthestate-machine-basedprogrammodelhassimpli“edtheoptimizationpro-cess,itwouldtakeaboutonemonthtoimplementakernelonMaPUforthosefamiliarwiththemicro-architecture.Wedevelopaninformal”owtofacilitatetheoptimizationpro-cess.Wewouldliketoexplorethis”owmorethoroughlyinthenearfutureandhopetodevelopamoreconvenientandhigh-levelprogrammodebasedontheoneadoptedinthispaper.     !          " # Figure17:Averageusageandenergyconsumptionofdif-ferentcomponents5.RELATEDWORKIngeneral,MaPUcanbeclassi“edintoCGRAandvec-torprocessors.OneprincipleunderMaPUinvolvesmappingthecomputationgraphintoarecon“gurablefabricwhilemap-pingthedataaccessingpatternintoanMGPmemorysys-tem.ThecomputationmappingofMaPUissimilartoDySER[8].However,MaPUusesacrossbar-basedforwardinglogicratherthanaswitchingnetwork.Furthermore,therecon“gurationfabricandcomputationsub-regioninDySERismorelikeaninstructionextensiontothescalarpipeline.Therecon“gura-tionfabricandmicrocodepipelineinMaPUisastandaloneprocessorcorethatcanexecutekernelalgorithmssuchasmatrixmultiplyand2D“lteralgorithms.MaPUisnotlikeGARP[9]andSGMF[10].GARPuses“ne-grainedrecon“g-urablearraystoconstructFUssuchasaddersandshiftersinawaythatresemblesFPGA.MaPUusesFUstomaphigh-levelalgorithms.Atthesametime,MaPUhasnothreadconceptandthusnodatadependencehandlinglogic.ThisisdifferentfromSGMF[10].Thepoweref“ciencyofcomputerarchitecturehasbecomemoreandmoreimportantinrecentyears.Voltageandfre-quencyadjustmentsandclockgatingaretwomaintechniquesusedtodecreasethepowerofpreviousprocessors[11].How-ever,althoughchipsareintegratingfarmoretransistorsthanbefore,theirtotalpowerisstillstrictlyconstrained.Onlyafewpartsofthechipcanbelighted,whichleadstotheideaofusingso-calleddarksilicon[12].Inthisnewdesignregime,heterogeneousarchitectureshavebeenproposedinwhichsomegeneral-purposecoresareaugmentedbymanyothercoresandacceleratorsofdifferentmicro-architectures[13].GreenDroid[14]issuchanaggressivedarksiliconproces-sorwithgreatpoweref“ciency.Althoughthededicatedco-processorc-coreinthischipcanberecon“guredafterman-ufacture,itscompatibilitywithalgorithmupdatespresentsaconcernforprogrammers.Someothercustomizedprocessorsthataimatpoweref-“ciencyarelessaggressive.Mostofthemfocusmainlyonimprovingdatapathcomputation,suchasbyaddingavec-torprocessingunit[7][1],addingchainedFUs[7]andaddingspecializedinstructions[2][15].Althoughimprovementshavebeenmadethroughthesetechniques,theyarelimitedbytheirmemoryaccessef“ciency. 467 Transporttriggeredarchitecturepossessesmanyadvan-tagesincludingmodularity,”exibilityandscalability.Itslow-powerpotentialisexploitedin[16]butmainlyfocusesoncompileroptimizationandthereductionofregister“leAsclockdistributionnetworksconsumearound20-50%ofthetotalpowerinasynchronouscircuit[17],anasyn-chronouscircuitisconsideredanalternativetobuildlow-powerprocessors[18].However,asynchronouscircuitsaredif“culttodesign,andrelatedEDAtoolsarefarfrommatureandmayonlybesuccessfulinspecialchipssuchasneuro-morphicprocessors[19].6.CONCLUSIONThenovelMaPUarchitectureispresentedinthispaper.WithanMGPmemorysystem,cascadingpipelineandstate-machine-basedprogrammodel,thisarchitecturepossessesgreatperformanceandenergypotentialforcomputationin-tensiveapplications.AchipwithfourMaPUcoresisde-signed,implementedandtapedo

12 utfollowingaTSMC40-nmlow-powerprocess.Th
utfollowingaTSMC40-nmlow-powerprocess.Theperformanceandpowerofthechiparefullyanalyzedandcomparedwithotherprocessors.Althoughtheimplementationofthe“rstMaPUchipcanbefurtherimproved,theresultsindicatethatMaPUprocessorscanprovideaperformance/poweref“ciencyimprovement10timeshigherthanthetraditionalCPUandGPGPU.The“rstMaPUchippresentedhereisonlyanexampleoftheMaPUarchitecture.Designerscaneasilyimplementthisarchitectureinspeci“cdomainsthroughcustomizedFUs.GreateffortsareneededinprogrammingMaPU.Althoughalow-levelstate-machine-basedprogrammingmodelhasbeenproposed,wehopetodevelopamoreef“cientmodelatahighlevel.Furthermore,wearealsotryingtoimplementaheterogeneouscomputingframeworksuchasOpenCLonMaPU,whichwouldhideallofthehardwarecomplexityandprovideef“cientruntimesandlibrariesdesignedforspeci“cdomains.Inaddition,wearenowworkinghardtoconstructrelevantwikipagesandmakedetaileddocumentationandtoolsavailabletotheopensourcecommunity.7.ACKNOWLEDGEMENTThisworkissupportedbytheStrategicPriorityResearchProgramofChineseAcademyofSciences(underGrantXDA-8.REFERENCES[1]B.Barry,C.Brick,F.Connor,D.Donohoe,D.Moloney,R.Richmond,M.ORiordan,andV.Toma,Always-onvisionprocessingunitformobileapplications,ŽIEEEMicro,no.2,pp.56…66,2015.[2]L.Codrescu,W.Anderson,S.Venkumanhanti,M.Zeng,E.Plondke,C.Koob,A.Ingle,C.Tabony,andR.Maule,Hexagondsp:Anarchitectureoptimizedformobilemultimediaandcommunications,ŽMicro,IEEE,vol.34,no.2,pp.34…43,2014.[3]D.Parello,O.Temam,andJ.M.Verdun,Onincreasingarchitectureawarenessinprogramoptimizationstobridgethegapbetweenpeakandsustainedprocessorperformance-matrix-multiplyrevisited,ŽinSCConference,pp.31…31,2002.[4]A.Sethia,G.Dasika,T.Mudge,andS.Mahlke,Acustomizedprocessorforenergyef“cientscienti“ccomputing,ŽComputers,IEEETransactionson,vol.61,no.12,pp.1711…1723,2012.[5]V.V.ZyubanandP.M.Kogge,Inherentlylower-powerhigh-performancesuperscalararchitectures,ŽComputers,IEEETransactionson,vol.50,no.3,pp.268…285,2001.[6]B.M.Baas,Alow-power,high-performance,1024-pointfftprocessor,ŽSolid-StateCircuits,IEEEJournalof,vol.34,no.3,pp.380…387,1999.[7]M.H.IonicaandD.Gregg,Themovidiusmyriadarchitecturespotentialforscienti“ccomputing,ŽMicro,IEEE,vol.35,no.1,pp.6…14,2015.[8]V.Govindaraju,C.-H.Ho,andK.Sankaralingam,Dynamicallyspecializeddatapathsforenergyef“cientcomputing,ŽinPerformanceComputerArchitecture(HPCA),2011IEEE17thInternationalSymposiumon,pp.503…514,Feb2011.[9]J.HauserandJ.Wawrzynek,Garp:amipsprocessorwitharecon“gurablecoprocessor,ŽinField-ProgrammableCustomComputingMachines,1997.Proceedings.,The5thAnnualIEEESymposiumon,pp.12…21,Apr1997.[10]D.VoitsechovandY.Etsion,Single-graphmultiple”ows:Energyef“cientdesignalternativeforgpgpus,ŽinComputerArchitecture(ISCA),2014ACM/IEEE41stInternationalSymposiumonpp.205…216,June2014.[11]S.KaxirasandM.Martonosi,Computerarchitecturetechniquesforpower-ef“ciency,ŽSynthesisLecturesonComputerArchitecturevol.3,no.1,pp.1…207,2008.[12]M.B.Taylor,Alandscapeofthenewdarksilicondesignregime,ŽMicro,IEEE,vol.33,no.5,pp.8…19,2013.[13]M.Själander,M.Martonosi,andS.Kaxiras,Power-ef“cientcomputerarchitectures:Recentadvances,ŽSynthesisLecturesonComputerArchitecture,vol.9,no.3,pp.1…96,2014.[14]N.Goulding-Hotta,J.Sampson,S.Swanson,M.B.Taylor,G.Venkatesh,S.Garcia,J.Auricchio,P.-C.Huang,M.Arora,S.Nath,V.Bhatt,J.Babb,S.Swanson,andM.B.Taylor,Thegreendroidmobileapplicationprocessor:Anarchitectureforsiliconsdarkfuture,ŽIEEEMicro,no.2,pp.86…95,2011.[15]S.Z.Gilani,N.S.Kim,andM.J.Schulte,Power-ef“cientcomputingforcompute-intensivegpgpuapplications,ŽinPerformanceComputerArchitecture(HPCA2013),2013IEEE19thInternationalSymposiumon,pp.330…341,IEEE,2013.[16]Y.He,D.She,B.Mesman,andH.Corporaal,Move-pro:alowpowerandhighcodedensityttaarchitecture,ŽinComputerSystems(SAMOS),2011InternationalConferenceonpp.294…301,IEEE,2011.[17]F.H.AsgariandM.Sachdev,Alow-powerreducedswingglobalclockingmethodology,ŽVeryLargeScaleIntegration(VLSI)Systems,IEEETransactionson,vol.12,no.5,pp.538…545,2004.[18]M.Laurence,Introductiontooctasicasynchronousprocessortechnology,ŽinAsynchronousCircuitsandSystems(ASYNC),201218thIEEEInternationalSymposiumon,pp.113…117,IEEE,2012.[19]J.-s.Seo,B.Brezzo,Y.Liu,B.D.Parker,S.K.Esser,R.K.Montoye,B.Rajendran,J.Tierno,L.Chang,D.S.Modha,andD.JFriedman,A45nmcmosneuromorphicchipwithascalablearchitectureforlearninginnetworksofspikingneurons,ŽinCustomIntegratedCircuitsConference(CICC),2011IEEE,pp.1…4,IEEE,20