/
Energy and Performance Exploration of Accelerator Cohe Energy and Performance Exploration of Accelerator Cohe

Energy and Performance Exploration of Accelerator Cohe - PDF document

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
408 views
Uploaded On 2015-05-17

Energy and Performance Exploration of Accelerator Cohe - PPT Presentation

sadr2lucabeniniuniboit weiswehneituniklde ABSTRACT Cooperation of CPU and hardware accelerator to accomplish computational intensive tasks provides signi64257cant advan tages in runtime speed and energy E64259cient management of data sharing among mu ID: 68538

sadr2lucabeniniuniboit weiswehneituniklde ABSTRACT Cooperation

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Energy and Performance Exploration of Ac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Evaluationofspeedandenergyeciencyofeachmethodbyrunningpracticalexperimentsonthehardware(sec-tion4).Descriptionofthelessonslearntfromtheresults.Theauthorsprovidetheinfrastructure,containinghard-wareprojects, rmwaresanddevelopedsoftwareasopensourcecodeandfreeofchargetotheresearchcommunity.1.1RelatedWorkHeterogeneousSystemArchitecture(HSA)foundationpro-videsarchitecturalandapplicationlevelsolutionstohelpsystemdesignersintegratedi erentkindsofheterogeneouscomputingunitsinawaythateliminatestheinecienciesofsharingdataandsendingworkitemsbetweenthem[20].ThedevelopedaccelerationhardwareblocksbecomeHSAcompliantbydeclarationofthenecessarylow-levelinterfacelayers.Thisfreestheprogrammersfromtheburdenoftai-loringaprogramtoaspeci chardwareplatform.AsapartofHSA,[25]describesAMDFusionSystemArchitecture;targetedtounifyCPUsandGPUsina exiblecomput-ingfabric.Theproposedideahowever,ismainlydevelopedforsharingmemorybetweenCPUandfullyprogrammableGPUcoresandisnottargetingrecon gurableheterogeneousarchitectureslikeZYNQ.Amethodologyforanalyzingtheimpactofhardwareac-celeratordatatransfergranularityontheperformanceofatypicalembeddedsystemispresentedin[21].Thisispartic-ularlyimportantbecause,aswewillshow,thegranularityofdatatransferbetweenmemoryandaccelerator,andthus,theinterruptratetotheCPUhasdirectimpactontheper-formance.TheideaofusingaportionofCPUsub-systemcachesasbu ersfortheacceleratorsisstudiedin[6].Thisresultsinsmallersiliconareasinceeachacceleratordoesn'tinstantiateitsownbu er.ThebasicideaofdedicatingasharedmemoryspacetoacceleratorsisinterestingbecausetheZYNQdeviceprovidesadedicatedOn-ChipMemory(OCM)whichcanbeusedforthesamepurpose.Wealsoconsiderprocessor-acceleratormemorysharingusingOCMinourtests.TheproblemofmaintainingcoherencybetweenCPUcac-hesandacceleratordatainamulti-coreembeddedsystemisaddressedin[2].Thepaperdiscussespossiblehardwarear-chitecturesandrelatedsoftwaresolutionstotackletheprob-lem.Itconcludesthattheoptimalsolutionheavilydependsonthecharacteristicsoftheapplication.Thepaperdiscussesthesolutionsatarchitecturelevelanddoesnotprovidede-tailedpracticalcomparisonsontheperformanceandenergyeciencyofeachsolution.Anarea-andpower-ecientmany-corecomputingfabricwhichfeaturesclustersofupto16processorcoresispro-posedin[1].Thedevelopedplatformdeliversanextraor-dinarylevelofcomputationalspeed(�80GOPS)whileconsumingrelativelysmallamountofpower(2W).Thepaperisofparticularimportancesinceitprovidesideasonthedevelopmentofenergyecientacceleratorlogic.Indeed,thedevelopedarchitectureinourpaperispartiallyinspiredfromthemethodusedby[1]toconnecttoitshostCPU.Ahigh-performance,energyimprovedmobileprocessingplatformnamedbig.LITTLEisintroducedby[7].Theplat-formconsistsofhighperformanceCortex-A15processorandenergyecientCortex-A7.TheconnectionbetweentheCPUsub-systemsisprovidedthroughtheCCI-400inter-connectwhichfacilitatesfullcoherencybetweenCortex-A15andCortex-A7aswellasGPUs,acceleratorsandI/O.Thisplatform,ifconnectedtoaprogrammablegatearray,canprovideasuitabletestbedforevaluationofvariousprocessor-acceleratormemorysharingschemes.TheimpactofcachearchitectureontheperformanceandareaofFPGAbasedprocessorandparallelacceleratorsys-temsisdiscussedin[4].Thepaperproposesasimplehard-warecontainingoneMIPScore,multipleacceleratorunits,amulti-portsharedL1cacheandaDRAMcontroller.Itcon-sidersdi erentstructuralparametersfortheL1cache(suchasnumberofports,associativity,etc.)andde nesasetofcomputationaltaskstobedoneonlybyaccelerators.Itthenquanti estheimpactofcachestructureontheoverallspeedofacceleratorsconnectedtotheL1cache.ThepaperdoesnotdiscussthecooperativeoperationofCPUandaccelera-tors.Moreover,thedevelopedhardwareinthepaperisverysimple.Itisnotcapableofbootinganoperatingsystemandcommunicatingwiththeoutsideworld.TheideaofaddinghardwareacceleratorstoreducepowerinFPGAsisinvestigatedin[8].Thepapershowspracticalcomparisonsforthepowerconsumptionofsamplecompu-tationaltasks,whentheyareexecutedbytheCPUortheacceleratorlogic.Thepaperdoesnotaddressissuesrelatedtocoherencyandprocessor-acceleratormemorysharing.Totheauthors'knowledge,thisisthe rstworkwhichpracticallyquanti esthepotentialprocessingbandwidthandenergyeciencyofdi erentprocessor-acceleratormemorysharingmethodsusingtheZYNQdevice.Moreover,itisthe rstworkwhichprovidesanexplicitpracticalcomparisonintermsofenergyandspeed,onproce-ssor-acceleratormemorysharingusingACPandothertra-ditionalmethods.Inaddition,wealsoprovidea exiblere-searchvehiclewhichfacilitatesevaluationofinnovativeideasregardingthedesignofhardwareacceleratorsinheteroge-neousarchitecturesonprogrammablegatearrays.1.2KeyZYNQArchitectureDescriptionXilinxZYNQdevice[16]containstwoparts:1-Program-mableLogic(PL)whichisroughlyafullFPGA.2-Program-mableSystem(PS)whichisacompletesub-systemwithARMCPUcoresanddi erentperipherals[26].AsetofAXIinterfaces(asshowninFigure1)areimplementedtomakethecommunicationbetweenPSandthePLlogicpossible.BasicallytheseAXIinterfacesdivideintotwogroups:AXIMasterinterfaces(GP),connecttoAXIslavesre-sidingonthePL.TheCPUisabletoinitiateread/writetransactionsovertheseAXImasterstotransferdatatoPLmodules.Therearetwo32BitsAXImasterportsavailableintheZYNQdevice:GP0andGP1.AXISlaveinterfaces(HP,ACPandSGP),connecttotheimplementedAXImastersonthePL.ThereexistfourHighPerformance(HP)portsandoneAcceler-atorCoherencyPort(ACP).Eachoftheseinterfacesimplementsafull-duplex64Bitsconnection,meaningthatateveryclockcycle,total16BytesofdatacanbetransferredonAXIreadandAXIwritechannelsconcurrently.ThetwoSGP0andSGP1interfacesim-plement32Bitsconnections.Thereexistsade nedmemorymapfortheZYNQdevice[16]whichindicatestheaddressrangeofeachlogicblock.Every Figure2:Blockdiagramofthedevelopedhardware.handlingotherincomingrequests.WehaveimplementedthishardwareontheZYNQXC7Z020-1CdeviceavailableonXilinxZC-702board.Thetotalnum-berofusedlogicslicesisequalto7324whichcorrespondsto55%ofthetotalavailableslices.Thedesignconsumes92blockmemoriesof36Kbitssize(65%)and9blockmemoriesof18KBitssize(3%).Themaximumclockfrequencyforthisdesignis128:2MHz.2.2SoftwareEnvironmentOurdevelopedsoftwareisdividedintotwomajorparts:Linuxkernelleveldriversanduserlevelapplications.AtLinuxkernellevel,ourinfrastructureconsistsoftwodriverswhicharecalledaxiDandaxiDdummygenerator.axiDmanagestheAXImasterslocatedonHP0andACPports.Thedriverisresponsiblefor:Memoryallocationandobtainingthephysicaladdressofallocatedmemorybu erswhichwillbeusedbyAXImasters.Memorycanbeallocatedineithercachableornon-cachableregionsdependingonthememoryshar-ingmethod.(Furtherdescriptionsinsection3.)Initializing,programmingandtriggeringtheAXImas-tersandhandlingtheinterruptsgeneratedbythem.Calculatingthesourceanddestinationaddressesforthesetofacceleratorsbasedonthestatusoftheon-goingprocessingtasks,numberofpassedloopsandnumberofprocesseddatachunks.Interactionwithuser-levelapplications:receivingrawinputdatafromuserside,copyingtosourcememorybu ersandthenwritingbacktheprocessedresultstouser-level.Providinganaccuratetooltomeasuretimeintervals.Thedriverenablesaccesstothefreerunning64BitscountersoftheZYNQdevice[16]whichareclockedat333MHz(halfCPUclockfrequency).Con guringthePL310[10]cachecontrollerstatisticsunitsothatitre ectstotalnumberofreadrequestsreceivedatthecacheandthetotalnumberofreadhits.Thedriverre ectsthesevaluesatthebeginningandendingtimeofeachtask.ThedevelopedaxiDdummygenerator,doesnotperformanyaccelerationrelatedtask.Whenneeded,itenablesustoactivatethedummytracgeneratorAXImastersresidingonHP1port.Atuserlevel,wehavepreparedthefollowingitems:AsimpleapplicationwhichcommunicateswiththeaxiDdriver.Asimplememoryintensiveapplication(calledback-groundapplication),whichallocatesamemorybu erandperformsarbitraryreadandwritestothisbu erinanendlessloop.Thisapplicationwillbeusedtodemonstratethee ectofcachepollutionontheper-formanceofACPaccelerator.TheOpro lestatisticalperformancemonitoringtool[22]whichwehaveportedtotheZYNQenvironment.ThisenablesustomeasureimportantperformancemetricsoftheCPUsub-system.2.3PowerMeasurementTheZC-702boardisutilizingasetofpowersupplyunitswhichprovideonlinesensorsforvoltage,currentandtem-peraturemeasurement[15].Basicallywesamplethefollow-ingconsumedpowervalues:1-CorelogicforthePLpartofZYNQ.2-InternallogicofthePS.3-InterfacesandI/Obu ersofthePSand4-DRAMchips.Basedonourpracti-calobservations,thesefouritemsarethemostpowerhungry partsofthesystem.Thesamplingfrequencyformeasure-mentofvoltageandcurrentvaluesisequalto2Hz.3.MEMORYSHARINGMETHODSInordertoevaluatedi erentprocessor-acceleratormem-orysharingmethodsintermsofspeedandenergyeciency,we rstde neaprocessingtask.Thenwede neasetofprocessingmethodstoaccomplishthistask.Eachprocess-ingmethodutilizesadi erentmemorysharingscheme.Wethenexecuteeachprocessingmethodontherealhardwareandmeasureperformanceandpower.ProcessingTask:Forasampleimageofibytes,performthefollowing:readtheimagefromthesourcebu er,passtheimagethroughtheFIR lter,and nallywritetheout-putbacktothedestinationbu er.Sourceanddestinationbu ersaredi erent.Inpractice,wecontinuouslyperformthisoperationalargenumberoftimes.Thisenablesustohaveanaccuratespeedandpowermeasurement.ProcessingMethods:Here,wedescribethemethodsthatweusetoperformtheprocessingtask.Weassignanametoeachmethod,whichwillbeusedduringtherestofthispaper.HP0Only:TheacceleratorlocatedonHP0isre-sponsibletoperformtheprocessingtaskalone.Im-agesourceanddestinationbu ersareallocatedontheDRAMmemoryandinthenon-cachablearea.(Linuxkernelcalldma alloc coherentisusedforthispurpose.)ACPOnly:TheacceleratorlocatedonACPisresposi-bletoaccomplishtheprocessingtaskalone.Imagesourceanddestinationbu ersareallocatedusingnor-malkmallocLinuxkernelcall,thus,theyareallowedtoalsobecachedbyCPUsub-system.OCMOnly:TheacceleratorlocatedonACPisrepon-sibletoaccomplishtheprocessingtaskalone.How-ever,imagesourceanddestinationbu ersarelocatedintheOn-ChipMemory(OCM)blockoftheZYNQdevice.Heretheallocationwillbedonelikeotherhardwareperipheralsusingrequest mem regionandthenioremapLinuxkernelcalls.CPUCache:TheCPUcoreisreponsiblefordoingtheprocessingtaskalone.Noacceleratorisactive.Thesourceanddestinationimagebu ersareallocatedusingkmallocthus,theyareallowedtogetcached.CPUnoCache:IssimilartoCPUCachehowever,memoryallocationforthesourceanddestinationbu ersisdoneusingdma alloc coherentthustheyarelocatedonnon-cachableregionofmemory.CPUHP0:TheCPUandtheacceleratoronHP0portcooperatetoperformtheprocessingtask.Ateachit-eration rsttheCPUreadsthesourceimage,performstheprocessingandwritestheresultbacktothemem-ory.ThenitistheturnoftheacceleratoronHPporttoperformtheprocessingtask.Imagesourceanddes-tinationbu ersareallocatedonnon-cachableregionofmemory.CPUACP:IssimilartoCPUHP0however,theac-celeratoronACPcooperateswithCPUtoaccomplishthetaskandimagebu ersareallowedtobecached. Figure3:Processingbandwidthcomparisonofac-celerationmethods.Imagesizesweepsfrom4KBto2048KB.CPUOCM:IssimilartoCPUACPhowever,sourceanddestinationimagebu ersarelocatedontheOCM.4.EXPERIMENTALRESULTSWeconsidertheprocessingtaskdescribedinsection3andweuseeachofthedescribedmethodstoaccomplishthistaskandtomeasureprocessingspeedandenergy.Wesweepoverdi erentimagesizeivaluestoevaluatethee ectofusedmemorysizeonthespeedofoperation.(i=f4;16;64;128;256;1024;2048gKBytes).Byincreasingi,wealsoincreasethesizeofpackets(p)transferredbytheAXImasters(p=f4;16;64;128;128;128;128gKBytes).AlthoughourAXImastersarecapableofhandlingpacketsofupto1MBytes,welimitthepacketsizeto128KBwhichisthesizeofFIFOsinsideaccelerationlogic.Duringthesetests,weusea xedrunningfrequencyof125MHzfortheentirelogicresidingonthePL.Wemeasuretotalexecutiontimeandthustotalprocessingbandwidth(read+write)foreachcase.DuringeachtestwealsomeasuretotalnumberofL2cacherequestsandhitstohaveabetterinsightonL2cacheutilization.Figure3showsthetotalprocessingbandwidthforeachmethod.TheYaxisrepresentstotaltransferreddata(read+write)inMB/s.Xaxisrepresentsthesizeofimage(i)beingprocessedinalogarithmicscale.Thefollowingprocessingmethodsshowhighestperfor-mance:HP0Only,ACPOnlyandOCMOnly.ForOCMOnlywecanperformtheprocessingtaskonlyforlimitedvaluesofisince,thetotalOn-chipMemoryavailableontheZYNQdeviceislimitedto256KB.Fori=f4;16;64gKBweseealmostequalperformanceforeachofthesethreemethods.Ati=128KBandi=256KB,wenoticeaslightdecreaseintheperformanceofACPOnlycomparedtoHP0Only(1708:5MB/sforHP0Onlyvs.1665:9MB/sand1640:8MB/sforACPOnly).Whenimagesizegrowsover256KB,asigni cantdropappearsintheperformanceofACPOnly(653:3MB/sforACPOnlyvs.1708:5MB/sforHP0Only).Thisphenomenacanbedescribedasfollows;Foreachprocessingtask,thetotalutilizedmemory(bysourceanddestinationimagearrays)is2i.IfthetotalavailablecachesizeisL,thenwhile2iLthesystemisabletoecientlystorelocalcopiesofrecentlyusedACPacceleratordataonitscaches,thusprovidingfastaccesstodatawhen Figure4:Averagedpowerconsumptionofmajorsystemblocksforeachprocessingmethod.needed.Howeverwhen2i�L,itisnomorepossibletocachetheentiredataobjectsusedbytheaccelerator.Asaresult,someacceleratorrequeststothecachewillfailandwilleventuallyend-uptheDRAMmemory.TheextradelayintroducedbypassingthroughthecachestoDRAM,causesaseriousdecreaseinperformance.Eitheranincreaseini(e.g.increasingthesizeofprocessedimage)oradecreaseinL(e.g.abackgroundapplicationisalsoconsumingavailablecaches)cancausetheabovephenomena.InFigure3,nobackgroundapplicationisrunningonthesystem.ThesizeofavailablesharedL2cacheis512KBintheZYNQdevice.Aswesee,performancedrophappenswhen2i�512KB.Wenowconsiderprocessingmethodswhichfullyorpar-tiallyusetheCPUcorestoperformtheprocessingtask.CPUnocachemethodisshowingthelowestperformance(average140MB/s)andCPUcacheisslightlyhigher(aver-age170MB/s).Here,theentireprocessingisdoneonlywiththeALUoftheCPU.EnhancementsinspeedispossibleifweusetheNEONSIMDengineofARMCPUcores.Buteveninthatcasethepossiblespeed-upisaround8X[9].Finally,wehaveCPUACP,CPUOCMandCPUHP0withspeedsbetweenCPUonlyandhardwareonlymethods.Here,cooperationofacceleratorwiththeCPUcausesanspeed-upindataprocessing.CPUACPisalwaysfasterthanCPUHP0.ThisisbecauseofthepossibilityofsharingthedatabetweenCPUandacceleratoronthecache(ThustheCPUcanaccessdatafaster).ThespeedofCPUOCMisalwaysbetweentheothertwomethods.Fori256KBCPUACPisapproximately1:22XfasterthanCPUHP0.Bygrowingtheimagesizehowever,thespeedofCPUACPbeginsconvergingtoCPUHP.LookingatthenumberofL2cachehits,wenoticeasig-ni cantdi erencebetweenthemethodswhichuseACPandmethodswhichuseHP0.Forexample,inACPOnlyweseemorethan2hitspereach32bytes(onecache-line)ofpro-cesseddatawhileforHP0Onlythisvalueispracticallyzero(intheorderof10�5).ForeachtestpointinFigure3,wealsomeasurepower.Figure4showstheresults.Consideringthefactthat,powervaluesdonotchangesigni cantlybychangingtheimagesizeforeachprocessingmethod,weonlyshowtheaveragedpowervalueforalloftheimagesizes.InFigure4weshowthefourmajorpowersinks(asdescribedinsection2.3)oftheZC-702boardatthetimeofthetests.Asshown,HP0OnlymethodhasthehighestDRAMpower.CPUcachecauseshighestpowerconsumptionbyPSinternallogic,and Figure5:Energyconsumedforprocessingofonebyteofdataforeachprocessingmethod.HP0Only,ACPOnlyandOCMOnlyshowhighestvaluesofpowerconsumptionbythePL.PowerconsumptionofPSI/Obu ersareatthesamelevelforallmethods.Havingtheprocessingbandwidthandpower,wecalculatetheenergyconsumedforprocessingonebyteofdata.Figure5showsenergyvaluesforeachofthetestpoints.LookingatFigure5wesee,fori256KB,ACPOnlyandOCMOnlyconsumetheleastenergy.Fori�256KBhowever,HP0Onlyshowsbetterresults.Atthenextlevel,amongmethodswhichutilizetheCPU,cooperationofCPUandacceleratoroverACP(CPUACP)showsthelowestenergy.Afterthat,wehaveCPUOCMandthenCPUHP0showingmoreenergyconsumption.4.1EffectofBackgroundWorkloadsNow,westudythee ectofbackgroundworkloadsontheperformanceofACPOnlyandHP0Onlymethods.First,weturnonthedummytracgeneratoronHP1AXIinterface.Thisblockcontinuouslyperformsarbitraryreadandwriteoperationstoanallocated2MBDRAMarea.AtthesametimewemeasuretheperformanceofACPOnlyandHP0Onlyfordi erentvaluesofi.Inanothertest,weexecuteamemoryintensivebackgroundapplicationonARMCPUcores.ThedummyAXItracgeneratoriso .Thebackgroundapplicationperformsarbi-traryreadandwriteoperationstoanallocated2MBarray.Thearrayisallowedtobecached.Thus,itoccupiesCPUcachesduringexecution.Infact,thistestshowsthee ectofdecreasingLontheperformanceoftheaccelerationmethod.Figure6showstheresultsofbothtests.Inthis gure,XaxisisiandYaxisistotaltransferreddatainMBytes/s.We rstlookatthee ectofAXIdummyactivityonper-formance.Aswesee,performancedropof(ACPOnly)isnegligible.However,amajordropcanbeseenintheperfor-manceofHP0Only.Forexample,ati=128KB,HP0Onlyspeedis1708:5MB/swhenthereisnootheractivitygoingonDRAM.However,itsspeeddropsto1382:2MB/swhentheAXIdummyinterfaceisactiveandoccupyingDRAMbandwidth.ForACPOnlythecorrespondingnumbersare1665:9MB/sand1664:3MB/srespectively.Lookingattheresultsofthesecondtestwherethecacheisheavilyoccupiedbythebackgroundapplication,weseeacleardropintheperformanceofACPOnlywhileHP0Onlyshowsaslightperformanceshift.InACP+background Figure6:ProcessingbandwidthcomparisonofACPandHP0acceleratorsatthepresenceofbackgroundtrac.application,thespeedoftheACPacceleratordoesnotgrowfori�16KB.Forexampleati=128KB,thespeedofACPOnlyis1665:9MB/sand531:6MB/swithandwithoutthebackgroundapplicationrunning,respectively.Anothernoticeablepointduringthesecondtestisthat:theoperationspeedofbothACPandHP0accelerators,ati=4KB,ishigherwhenthebackgroundapplicationisrun-ningontheCPUscomparedtowhentheCPUsarenotdoinganyspeci ctask1.Wedescribethisphenomenaasfollows:whenthebackgroundapplicationisrunningonCPUcores,itkeepstheCPUsub-systemactivepreventingittogotoidlestate.ThuswhenanAXImaster nishesitscurrenttaskandissuesaninterruptrequesttotheCPU,theserviceroutinewillgetexecutedinashortertime,andthemasterbeginsthenexttaskfaster.Thisdescriptioncanbefurthercon rmedbynotingthefactthatwhenthepacketsizein-creases(andthustherateofAXImasterinterruptstotheCPUdecreases)thisspeed-updisappears.Forfurthereval-uationresultspleasereferto[26].5.LESSONSLEARNEDBasedontheobtainedresultswederivethefollowingde-signrules:Ifaspeci ctaskshouldbedonebycooperationofCPUandaccelerator:ThespeedandenergyconsumptionofCPUACPandCPUOCMmethodsarealwaysbetterthan(orintheworstcaseequalto)CPUHP.ThemaindrawbackofusingCPUACPisthatpartsoftheavailablecachespacewillbeoccupiedbytheaccelerationtask.Thusifthereexistsanyothercrit-icalapplicationwhoseperformanceisheavilydepen-dentonCPUcaches,itmayfaceproblemshandlingitsdutyon-time.Inthiscase,usingCPUOCM(forsmallarraysizes)andCPUHP(forbigarrays)isrec-ommended.Ifthetaskshouldbedonebythehardwareacceleratoronly(andthentheCPUwilljustusethe nalresult),thenACPOnlyorACPOCMmightbeusedonly 1700:0MB/sforHP0and707:3MB/sforACPwhenback-groundapplicationisrunningcomparedto608:5MB/sforHP0and631:8MB/sforACPwhiletheCPUsareidle.whentheprocessedarrayblocksaresmall(smallerthanthesizeofavailablecache,oron-chipmemory)andthereisnootherbackgroundapplicationconsum-ingtheseresources.Otherwise,HPOnlyalwayspro-videsbetterresults.TheaboverulescanalsobeexpandedtootherplatformswithsimilararchitecturesastheZYNQ.ThesizeofpacketstransferredbyeachAXIinterface,andmaximumburstlength,heavilya ecttheoveralldatatrans-ferbandwidth.Asaresult,basedonthetracpatternoftheprocessingtask(e.g.thesizeofdatachunksprocessedineachiteration),andprovidedimplementationforAXImas-ters,interconnectsandinterfaces,theseparametersshouldbesettothelargestpossiblevalues.6.CONCLUSIONInthispaper,wedemonstratedanassessmentontheper-formanceofaccelerationbyhardwareblockswhichcommu-nicatetoCPUsub-systemandDRAMoverAXIinterfaces.WeselectedXilinxZYNQAPSoCasthetargetanddevel-opedaninfrastructuretomeasuretheprocessingspeedandenergy.WecomparedtheresultswheneachoftheCPUandacceleratorperformthetaskaloneandwhentheCPUandacceleratorcooperatetoaccomplishthetask.Basedontheresults,wederivedasetofruleswhichcanbeusedforecientprocessor-accelerationmemorysharing.7.ACKNOWLEDGMENTSThisworkwassupported,inparts,byEUFP7ProjectVirtical(GAn.288574).8.REFERENCES[1]L.Benini,E.Flamand,D.Fuin,andD.Melpignano.P2012:Buildinganecosystemforascalable,modularandhigh-eciencyembeddedcomputingaccelerator.InDesign,AutomationTestinEuropeConferenceExhibition(DATE),2012,pages983{987,2012.[2]T.Berg.Maintainingi/odatacoherenceinembeddedmulticoresystems.Micro,IEEE,29(3):10{19,2009.[3]C.Cascaval,S.Chatterjee,H.Franke,K.Gildea,andP.Pattnaik.Ataxonomyofacceleratorarchitecturesandtheirprogrammingmodels.IBMJournalofResearchandDevelopment,54(5):5:1{5:10,2010.[4]J.Choi,K.Nam,A.Canis,J.Anderson,S.Brown,andT.Czajkowski.Impactofcachearchitectureandinterfaceonperformanceandareaoffpga-basedprocessor/parallel-acceleratorsystems.InField-ProgrammableCustomComputingMachines(FCCM),2012IEEE20thAnnualInternationalSymposiumon,pages17{24,2012.[5]F.Clermidy,C.Bernard,R.Lemaire,J.Martin,I.Miro-Panades,Y.Thonnart,P.Vivet,andN.Wehn.Magali:Anetwork-on-chipbasedmulti-coresystem-on-chipformimo4gsdr.InICDesignandTechnology(ICICDT),2010IEEEInternationalConferenceon,pages74{77,2010.[6]C.Fajardo,Z.Fang,R.Iyer,G.Garcia,S.E.Lee,andL.Zhao.Bu er-integrated-cache:Acost-e ectivesramarchitectureforhandheldandembeddedplatforms.InDesignAutomationConference(DAC),201148thACM/EDAC/IEEE,pages966{971,2011. [7]P.Greenhalgh.big.littleprocessingwitharmcortex-a15&cortex-a7.september2011.[8]Altera.Inc.Addinghardwareacceleratorstoreducepowerinembeddedsystems.september2009.[9]ARM.Inc.Introducingneondevelopment,2009.http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/BABCJFDG.html.[10]ARM.Inc.Cortex-A9MPCoreTechnicalReferenceManual,2012.http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0460c/CIAIIJCE.html.[11]ARM.Inc.AMBAAXIandACEProtocolSpeci cation,February2013.http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0022e/index.html.[12]Synopsys.Inc.DesignWareDDR3/2SDRAMMemoryController,2013.http://www.synopsys.com/dw/ipdir.php?ds=dwc_ddr3_mem.[13]Xilinx.Inc.LogiCOREIPAXIMasterBurst(DS844),June2011.http://www.xilinx.com/support/documentation/ip_documentation/axi_master_burst/v1_00_a/ds844_axi_master_burst.pdf.[14]Xilinx.Inc.LogiCOREIPChipScopeAXIMonitor(DS810),March2011.http://www.xilinx.com/support/documentation/ip_documentation/chipscope_axi_monitor/v2_00_a/ds810_chipscope_axi_monitor.pdf.[15]Xilinx.Inc.ZC-702EvaluationBoardfortheZynq-7000XC7Z020AllProgrammableSoC,April2013.http://www.xilinx.com/support/documentation/boards_and_kits/zc702_zvik/ug850-zc702-eval-bd.pdf.[16]Xilinx.Inc.Zynq-7000AllProgrammableSoCTechnicalReferenceManual(UG585),March2013.http://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf.[17]S.Ishikawa,A.Tanaka,andT.Miyazaki.Hardwareacceleratorforblast.InEmbeddedMulticoreSocs(MCSoC),2012IEEE6thInternationalSymposiumon,pages16{22,2012.[18]S.KaxirasandA.Ros.Ecient,snoopless,system-on-chipcoherence.InSOCConference(SOCC),2012IEEEInternational,pages230{235,2012.[19]A.Kennedy,X.Wang,andB.Liu.Energyecientpacketclassi cationhardwareaccelerator.InParallelandDistributedProcessing,2008.IPDPS2008.IEEEInternationalSymposiumon,pages1{8,2008.[20]G.Kyriazis.Heterogeneoussystemarchitecture:Atechnicalreview.Technicalreport,AdvancedMicroDevices,August2012.[21]S.LafondandJ.Lilius.Interruptcostsinembeddedsystemwithshortlatencyhardwareaccelerators.InEngineeringofComputerBasedSystems,2008.ECBS2008.15thAnnualIEEEInternationalConferenceandWorkshoponthe,pages317{325,2008.[22]J.Levon,M.Johnson,etal.Opro le:Asystempro lerforlinux."http://oprofile.sourceforge.net/.[23]O.Mencer.Maximumperformancecomputingforexascaleapplications.InEmbeddedComputerSystems(SAMOS),2012InternationalConferenceon,pagesiii{iii,2012.[24]M.Nadeem,S.Wong,G.Kuzmanov,andA.Shabbir.Ahigh-throughput,area-ecienthardwareacceleratorforadaptivedeblocking lterinh.264/avc.InEmbeddedSystemsforReal-TimeMultimedia,2009.ESTIMedia2009.IEEE/ACM/IFIP7thWorkshopon,pages18{27,2009.[25]M.O'Connor.Acceleratedprocessingandthefusionsystemarchitecture.InDesignAutomationConference(ASP-DAC),201217thAsiaandSouthPaci c,pages93{93,2012.[26]M.Sadri.Technicalreport:Energyandperformanceexplorationofacceleratorcoherencyportusingxilinxzynq.Technicalreport,DepartmentofElectrical,ElectronicandInformationEngineering,UniversityofBologna,May2013.[27]N.C.StephaneEricSebastienBrochier.Managingthestorageofdataincoherentdatastores,092009.[28]T.Suh,D.Blough,andH.-H.Lee.Supportingcachecoherenceinheterogeneousmultiprocessorsystems.InDesign,AutomationandTestinEuropeConferenceandExhibition,2004.Proceedings,volume2,pages1150{1155Vol.2,2004.