sadr2lucabeniniuniboit weiswehneituniklde ABSTRACT Cooperation of CPU and hardware accelerator to accomplish computational intensive tasks provides signi64257cant advan tages in runtime speed and energy E64259cient management of data sharing among mu ID: 68538
Download Pdf The PPT/PDF document "Energy and Performance Exploration of Ac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Evaluationofspeedandenergyeciencyofeachmethodbyrunningpracticalexperimentsonthehardware(sec-tion4).Descriptionofthelessonslearntfromtheresults.Theauthorsprovidetheinfrastructure,containinghard-wareprojects,rmwaresanddevelopedsoftwareasopensourcecodeandfreeofchargetotheresearchcommunity.1.1RelatedWorkHeterogeneousSystemArchitecture(HSA)foundationpro-videsarchitecturalandapplicationlevelsolutionstohelpsystemdesignersintegratedierentkindsofheterogeneouscomputingunitsinawaythateliminatestheinecienciesofsharingdataandsendingworkitemsbetweenthem[20].ThedevelopedaccelerationhardwareblocksbecomeHSAcompliantbydeclarationofthenecessarylow-levelinterfacelayers.Thisfreestheprogrammersfromtheburdenoftai-loringaprogramtoaspecichardwareplatform.AsapartofHSA,[25]describesAMDFusionSystemArchitecture;targetedtounifyCPUsandGPUsina exiblecomput-ingfabric.Theproposedideahowever,ismainlydevelopedforsharingmemorybetweenCPUandfullyprogrammableGPUcoresandisnottargetingrecongurableheterogeneousarchitectureslikeZYNQ.Amethodologyforanalyzingtheimpactofhardwareac-celeratordatatransfergranularityontheperformanceofatypicalembeddedsystemispresentedin[21].Thisispartic-ularlyimportantbecause,aswewillshow,thegranularityofdatatransferbetweenmemoryandaccelerator,andthus,theinterruptratetotheCPUhasdirectimpactontheper-formance.TheideaofusingaportionofCPUsub-systemcachesasbuersfortheacceleratorsisstudiedin[6].Thisresultsinsmallersiliconareasinceeachacceleratordoesn'tinstantiateitsownbuer.ThebasicideaofdedicatingasharedmemoryspacetoacceleratorsisinterestingbecausetheZYNQdeviceprovidesadedicatedOn-ChipMemory(OCM)whichcanbeusedforthesamepurpose.Wealsoconsiderprocessor-acceleratormemorysharingusingOCMinourtests.TheproblemofmaintainingcoherencybetweenCPUcac-hesandacceleratordatainamulti-coreembeddedsystemisaddressedin[2].Thepaperdiscussespossiblehardwarear-chitecturesandrelatedsoftwaresolutionstotackletheprob-lem.Itconcludesthattheoptimalsolutionheavilydependsonthecharacteristicsoftheapplication.Thepaperdiscussesthesolutionsatarchitecturelevelanddoesnotprovidede-tailedpracticalcomparisonsontheperformanceandenergyeciencyofeachsolution.Anarea-andpower-ecientmany-corecomputingfabricwhichfeaturesclustersofupto16processorcoresispro-posedin[1].Thedevelopedplatformdeliversanextraor-dinarylevelofcomputationalspeed(80GOPS)whileconsumingrelativelysmallamountofpower(2W).Thepaperisofparticularimportancesinceitprovidesideasonthedevelopmentofenergyecientacceleratorlogic.Indeed,thedevelopedarchitectureinourpaperispartiallyinspiredfromthemethodusedby[1]toconnecttoitshostCPU.Ahigh-performance,energyimprovedmobileprocessingplatformnamedbig.LITTLEisintroducedby[7].Theplat-formconsistsofhighperformanceCortex-A15processorandenergyecientCortex-A7.TheconnectionbetweentheCPUsub-systemsisprovidedthroughtheCCI-400inter-connectwhichfacilitatesfullcoherencybetweenCortex-A15andCortex-A7aswellasGPUs,acceleratorsandI/O.Thisplatform,ifconnectedtoaprogrammablegatearray,canprovideasuitabletestbedforevaluationofvariousprocessor-acceleratormemorysharingschemes.TheimpactofcachearchitectureontheperformanceandareaofFPGAbasedprocessorandparallelacceleratorsys-temsisdiscussedin[4].Thepaperproposesasimplehard-warecontainingoneMIPScore,multipleacceleratorunits,amulti-portsharedL1cacheandaDRAMcontroller.Itcon-sidersdierentstructuralparametersfortheL1cache(suchasnumberofports,associativity,etc.)anddenesasetofcomputationaltaskstobedoneonlybyaccelerators.ItthenquantiestheimpactofcachestructureontheoverallspeedofacceleratorsconnectedtotheL1cache.ThepaperdoesnotdiscussthecooperativeoperationofCPUandaccelera-tors.Moreover,thedevelopedhardwareinthepaperisverysimple.Itisnotcapableofbootinganoperatingsystemandcommunicatingwiththeoutsideworld.TheideaofaddinghardwareacceleratorstoreducepowerinFPGAsisinvestigatedin[8].Thepapershowspracticalcomparisonsforthepowerconsumptionofsamplecompu-tationaltasks,whentheyareexecutedbytheCPUortheacceleratorlogic.Thepaperdoesnotaddressissuesrelatedtocoherencyandprocessor-acceleratormemorysharing.Totheauthors'knowledge,thisistherstworkwhichpracticallyquantiesthepotentialprocessingbandwidthandenergyeciencyofdierentprocessor-acceleratormemorysharingmethodsusingtheZYNQdevice.Moreover,itistherstworkwhichprovidesanexplicitpracticalcomparisonintermsofenergyandspeed,onproce-ssor-acceleratormemorysharingusingACPandothertra-ditionalmethods.Inaddition,wealsoprovidea exiblere-searchvehiclewhichfacilitatesevaluationofinnovativeideasregardingthedesignofhardwareacceleratorsinheteroge-neousarchitecturesonprogrammablegatearrays.1.2KeyZYNQArchitectureDescriptionXilinxZYNQdevice[16]containstwoparts:1-Program-mableLogic(PL)whichisroughlyafullFPGA.2-Program-mableSystem(PS)whichisacompletesub-systemwithARMCPUcoresanddierentperipherals[26].AsetofAXIinterfaces(asshowninFigure1)areimplementedtomakethecommunicationbetweenPSandthePLlogicpossible.BasicallytheseAXIinterfacesdivideintotwogroups:AXIMasterinterfaces(GP),connecttoAXIslavesre-sidingonthePL.TheCPUisabletoinitiateread/writetransactionsovertheseAXImasterstotransferdatatoPLmodules.Therearetwo32BitsAXImasterportsavailableintheZYNQdevice:GP0andGP1.AXISlaveinterfaces(HP,ACPandSGP),connecttotheimplementedAXImastersonthePL.ThereexistfourHighPerformance(HP)portsandoneAcceler-atorCoherencyPort(ACP).Eachoftheseinterfacesimplementsafull-duplex64Bitsconnection,meaningthatateveryclockcycle,total16BytesofdatacanbetransferredonAXIreadandAXIwritechannelsconcurrently.ThetwoSGP0andSGP1interfacesim-plement32Bitsconnections.ThereexistsadenedmemorymapfortheZYNQdevice[16]whichindicatestheaddressrangeofeachlogicblock.Every Figure2:Blockdiagramofthedevelopedhardware.handlingotherincomingrequests.WehaveimplementedthishardwareontheZYNQXC7Z020-1CdeviceavailableonXilinxZC-702board.Thetotalnum-berofusedlogicslicesisequalto7324whichcorrespondsto55%ofthetotalavailableslices.Thedesignconsumes92blockmemoriesof36Kbitssize(65%)and9blockmemoriesof18KBitssize(3%).Themaximumclockfrequencyforthisdesignis128:2MHz.2.2SoftwareEnvironmentOurdevelopedsoftwareisdividedintotwomajorparts:Linuxkernelleveldriversanduserlevelapplications.AtLinuxkernellevel,ourinfrastructureconsistsoftwodriverswhicharecalledaxiDandaxiDdummygenerator.axiDmanagestheAXImasterslocatedonHP0andACPports.Thedriverisresponsiblefor:MemoryallocationandobtainingthephysicaladdressofallocatedmemorybuerswhichwillbeusedbyAXImasters.Memorycanbeallocatedineithercachableornon-cachableregionsdependingonthememoryshar-ingmethod.(Furtherdescriptionsinsection3.)Initializing,programmingandtriggeringtheAXImas-tersandhandlingtheinterruptsgeneratedbythem.Calculatingthesourceanddestinationaddressesforthesetofacceleratorsbasedonthestatusoftheon-goingprocessingtasks,numberofpassedloopsandnumberofprocesseddatachunks.Interactionwithuser-levelapplications:receivingrawinputdatafromuserside,copyingtosourcememorybuersandthenwritingbacktheprocessedresultstouser-level.Providinganaccuratetooltomeasuretimeintervals.Thedriverenablesaccesstothefreerunning64BitscountersoftheZYNQdevice[16]whichareclockedat333MHz(halfCPUclockfrequency).ConguringthePL310[10]cachecontrollerstatisticsunitsothatitre ectstotalnumberofreadrequestsreceivedatthecacheandthetotalnumberofreadhits.Thedriverre ectsthesevaluesatthebeginningandendingtimeofeachtask.ThedevelopedaxiDdummygenerator,doesnotperformanyaccelerationrelatedtask.Whenneeded,itenablesustoactivatethedummytracgeneratorAXImastersresidingonHP1port.Atuserlevel,wehavepreparedthefollowingitems:AsimpleapplicationwhichcommunicateswiththeaxiDdriver.Asimplememoryintensiveapplication(calledback-groundapplication),whichallocatesamemorybuerandperformsarbitraryreadandwritestothisbuerinanendlessloop.Thisapplicationwillbeusedtodemonstratetheeectofcachepollutionontheper-formanceofACPaccelerator.TheOprolestatisticalperformancemonitoringtool[22]whichwehaveportedtotheZYNQenvironment.ThisenablesustomeasureimportantperformancemetricsoftheCPUsub-system.2.3PowerMeasurementTheZC-702boardisutilizingasetofpowersupplyunitswhichprovideonlinesensorsforvoltage,currentandtem-peraturemeasurement[15].Basicallywesamplethefollow-ingconsumedpowervalues:1-CorelogicforthePLpartofZYNQ.2-InternallogicofthePS.3-InterfacesandI/ObuersofthePSand4-DRAMchips.Basedonourpracti-calobservations,thesefouritemsarethemostpowerhungry partsofthesystem.Thesamplingfrequencyformeasure-mentofvoltageandcurrentvaluesisequalto2Hz.3.MEMORYSHARINGMETHODSInordertoevaluatedierentprocessor-acceleratormem-orysharingmethodsintermsofspeedandenergyeciency,werstdeneaprocessingtask.Thenwedeneasetofprocessingmethodstoaccomplishthistask.Eachprocess-ingmethodutilizesadierentmemorysharingscheme.Wethenexecuteeachprocessingmethodontherealhardwareandmeasureperformanceandpower.ProcessingTask:Forasampleimageofibytes,performthefollowing:readtheimagefromthesourcebuer,passtheimagethroughtheFIRlter,andnallywritetheout-putbacktothedestinationbuer.Sourceanddestinationbuersaredierent.Inpractice,wecontinuouslyperformthisoperationalargenumberoftimes.Thisenablesustohaveanaccuratespeedandpowermeasurement.ProcessingMethods:Here,wedescribethemethodsthatweusetoperformtheprocessingtask.Weassignanametoeachmethod,whichwillbeusedduringtherestofthispaper.HP0Only:TheacceleratorlocatedonHP0isre-sponsibletoperformtheprocessingtaskalone.Im-agesourceanddestinationbuersareallocatedontheDRAMmemoryandinthenon-cachablearea.(Linuxkernelcalldma alloc coherentisusedforthispurpose.)ACPOnly:TheacceleratorlocatedonACPisresposi-bletoaccomplishtheprocessingtaskalone.Imagesourceanddestinationbuersareallocatedusingnor-malkmallocLinuxkernelcall,thus,theyareallowedtoalsobecachedbyCPUsub-system.OCMOnly:TheacceleratorlocatedonACPisrepon-sibletoaccomplishtheprocessingtaskalone.How-ever,imagesourceanddestinationbuersarelocatedintheOn-ChipMemory(OCM)blockoftheZYNQdevice.Heretheallocationwillbedonelikeotherhardwareperipheralsusingrequest mem regionandthenioremapLinuxkernelcalls.CPUCache:TheCPUcoreisreponsiblefordoingtheprocessingtaskalone.Noacceleratorisactive.Thesourceanddestinationimagebuersareallocatedusingkmallocthus,theyareallowedtogetcached.CPUnoCache:IssimilartoCPUCachehowever,memoryallocationforthesourceanddestinationbuersisdoneusingdma alloc coherentthustheyarelocatedonnon-cachableregionofmemory.CPUHP0:TheCPUandtheacceleratoronHP0portcooperatetoperformtheprocessingtask.Ateachit-erationrsttheCPUreadsthesourceimage,performstheprocessingandwritestheresultbacktothemem-ory.ThenitistheturnoftheacceleratoronHPporttoperformtheprocessingtask.Imagesourceanddes-tinationbuersareallocatedonnon-cachableregionofmemory.CPUACP:IssimilartoCPUHP0however,theac-celeratoronACPcooperateswithCPUtoaccomplishthetaskandimagebuersareallowedtobecached. Figure3:Processingbandwidthcomparisonofac-celerationmethods.Imagesizesweepsfrom4KBto2048KB.CPUOCM:IssimilartoCPUACPhowever,sourceanddestinationimagebuersarelocatedontheOCM.4.EXPERIMENTALRESULTSWeconsidertheprocessingtaskdescribedinsection3andweuseeachofthedescribedmethodstoaccomplishthistaskandtomeasureprocessingspeedandenergy.Wesweepoverdierentimagesizeivaluestoevaluatetheeectofusedmemorysizeonthespeedofoperation.(i=f4;16;64;128;256;1024;2048gKBytes).Byincreasingi,wealsoincreasethesizeofpackets(p)transferredbytheAXImasters(p=f4;16;64;128;128;128;128gKBytes).AlthoughourAXImastersarecapableofhandlingpacketsofupto1MBytes,welimitthepacketsizeto128KBwhichisthesizeofFIFOsinsideaccelerationlogic.Duringthesetests,weuseaxedrunningfrequencyof125MHzfortheentirelogicresidingonthePL.Wemeasuretotalexecutiontimeandthustotalprocessingbandwidth(read+write)foreachcase.DuringeachtestwealsomeasuretotalnumberofL2cacherequestsandhitstohaveabetterinsightonL2cacheutilization.Figure3showsthetotalprocessingbandwidthforeachmethod.TheYaxisrepresentstotaltransferreddata(read+write)inMB/s.Xaxisrepresentsthesizeofimage(i)beingprocessedinalogarithmicscale.Thefollowingprocessingmethodsshowhighestperfor-mance:HP0Only,ACPOnlyandOCMOnly.ForOCMOnlywecanperformtheprocessingtaskonlyforlimitedvaluesofisince,thetotalOn-chipMemoryavailableontheZYNQdeviceislimitedto256KB.Fori=f4;16;64gKBweseealmostequalperformanceforeachofthesethreemethods.Ati=128KBandi=256KB,wenoticeaslightdecreaseintheperformanceofACPOnlycomparedtoHP0Only(1708:5MB/sforHP0Onlyvs.1665:9MB/sand1640:8MB/sforACPOnly).Whenimagesizegrowsover256KB,asignicantdropappearsintheperformanceofACPOnly(653:3MB/sforACPOnlyvs.1708:5MB/sforHP0Only).Thisphenomenacanbedescribedasfollows;Foreachprocessingtask,thetotalutilizedmemory(bysourceanddestinationimagearrays)is2i.IfthetotalavailablecachesizeisL,thenwhile2iLthesystemisabletoecientlystorelocalcopiesofrecentlyusedACPacceleratordataonitscaches,thusprovidingfastaccesstodatawhen Figure4:Averagedpowerconsumptionofmajorsystemblocksforeachprocessingmethod.needed.Howeverwhen2iL,itisnomorepossibletocachetheentiredataobjectsusedbytheaccelerator.Asaresult,someacceleratorrequeststothecachewillfailandwilleventuallyend-uptheDRAMmemory.TheextradelayintroducedbypassingthroughthecachestoDRAM,causesaseriousdecreaseinperformance.Eitheranincreaseini(e.g.increasingthesizeofprocessedimage)oradecreaseinL(e.g.abackgroundapplicationisalsoconsumingavailablecaches)cancausetheabovephenomena.InFigure3,nobackgroundapplicationisrunningonthesystem.ThesizeofavailablesharedL2cacheis512KBintheZYNQdevice.Aswesee,performancedrophappenswhen2i512KB.Wenowconsiderprocessingmethodswhichfullyorpar-tiallyusetheCPUcorestoperformtheprocessingtask.CPUnocachemethodisshowingthelowestperformance(average140MB/s)andCPUcacheisslightlyhigher(aver-age170MB/s).Here,theentireprocessingisdoneonlywiththeALUoftheCPU.EnhancementsinspeedispossibleifweusetheNEONSIMDengineofARMCPUcores.Buteveninthatcasethepossiblespeed-upisaround8X[9].Finally,wehaveCPUACP,CPUOCMandCPUHP0withspeedsbetweenCPUonlyandhardwareonlymethods.Here,cooperationofacceleratorwiththeCPUcausesanspeed-upindataprocessing.CPUACPisalwaysfasterthanCPUHP0.ThisisbecauseofthepossibilityofsharingthedatabetweenCPUandacceleratoronthecache(ThustheCPUcanaccessdatafaster).ThespeedofCPUOCMisalwaysbetweentheothertwomethods.Fori256KBCPUACPisapproximately1:22XfasterthanCPUHP0.Bygrowingtheimagesizehowever,thespeedofCPUACPbeginsconvergingtoCPUHP.LookingatthenumberofL2cachehits,wenoticeasig-nicantdierencebetweenthemethodswhichuseACPandmethodswhichuseHP0.Forexample,inACPOnlyweseemorethan2hitspereach32bytes(onecache-line)ofpro-cesseddatawhileforHP0Onlythisvalueispracticallyzero(intheorderof105).ForeachtestpointinFigure3,wealsomeasurepower.Figure4showstheresults.Consideringthefactthat,powervaluesdonotchangesignicantlybychangingtheimagesizeforeachprocessingmethod,weonlyshowtheaveragedpowervalueforalloftheimagesizes.InFigure4weshowthefourmajorpowersinks(asdescribedinsection2.3)oftheZC-702boardatthetimeofthetests.Asshown,HP0OnlymethodhasthehighestDRAMpower.CPUcachecauseshighestpowerconsumptionbyPSinternallogic,and Figure5:Energyconsumedforprocessingofonebyteofdataforeachprocessingmethod.HP0Only,ACPOnlyandOCMOnlyshowhighestvaluesofpowerconsumptionbythePL.PowerconsumptionofPSI/Obuersareatthesamelevelforallmethods.Havingtheprocessingbandwidthandpower,wecalculatetheenergyconsumedforprocessingonebyteofdata.Figure5showsenergyvaluesforeachofthetestpoints.LookingatFigure5wesee,fori256KB,ACPOnlyandOCMOnlyconsumetheleastenergy.Fori256KBhowever,HP0Onlyshowsbetterresults.Atthenextlevel,amongmethodswhichutilizetheCPU,cooperationofCPUandacceleratoroverACP(CPUACP)showsthelowestenergy.Afterthat,wehaveCPUOCMandthenCPUHP0showingmoreenergyconsumption.4.1EffectofBackgroundWorkloadsNow,westudytheeectofbackgroundworkloadsontheperformanceofACPOnlyandHP0Onlymethods.First,weturnonthedummytracgeneratoronHP1AXIinterface.Thisblockcontinuouslyperformsarbitraryreadandwriteoperationstoanallocated2MBDRAMarea.AtthesametimewemeasuretheperformanceofACPOnlyandHP0Onlyfordierentvaluesofi.Inanothertest,weexecuteamemoryintensivebackgroundapplicationonARMCPUcores.ThedummyAXItracgeneratoriso.Thebackgroundapplicationperformsarbi-traryreadandwriteoperationstoanallocated2MBarray.Thearrayisallowedtobecached.Thus,itoccupiesCPUcachesduringexecution.Infact,thistestshowstheeectofdecreasingLontheperformanceoftheaccelerationmethod.Figure6showstheresultsofbothtests.Inthisgure,XaxisisiandYaxisistotaltransferreddatainMBytes/s.WerstlookattheeectofAXIdummyactivityonper-formance.Aswesee,performancedropof(ACPOnly)isnegligible.However,amajordropcanbeseenintheperfor-manceofHP0Only.Forexample,ati=128KB,HP0Onlyspeedis1708:5MB/swhenthereisnootheractivitygoingonDRAM.However,itsspeeddropsto1382:2MB/swhentheAXIdummyinterfaceisactiveandoccupyingDRAMbandwidth.ForACPOnlythecorrespondingnumbersare1665:9MB/sand1664:3MB/srespectively.Lookingattheresultsofthesecondtestwherethecacheisheavilyoccupiedbythebackgroundapplication,weseeacleardropintheperformanceofACPOnlywhileHP0Onlyshowsaslightperformanceshift.InACP+background Figure6:ProcessingbandwidthcomparisonofACPandHP0acceleratorsatthepresenceofbackgroundtrac.application,thespeedoftheACPacceleratordoesnotgrowfori16KB.Forexampleati=128KB,thespeedofACPOnlyis1665:9MB/sand531:6MB/swithandwithoutthebackgroundapplicationrunning,respectively.Anothernoticeablepointduringthesecondtestisthat:theoperationspeedofbothACPandHP0accelerators,ati=4KB,ishigherwhenthebackgroundapplicationisrun-ningontheCPUscomparedtowhentheCPUsarenotdoinganyspecictask1.Wedescribethisphenomenaasfollows:whenthebackgroundapplicationisrunningonCPUcores,itkeepstheCPUsub-systemactivepreventingittogotoidlestate.ThuswhenanAXImasternishesitscurrenttaskandissuesaninterruptrequesttotheCPU,theserviceroutinewillgetexecutedinashortertime,andthemasterbeginsthenexttaskfaster.Thisdescriptioncanbefurtherconrmedbynotingthefactthatwhenthepacketsizein-creases(andthustherateofAXImasterinterruptstotheCPUdecreases)thisspeed-updisappears.Forfurthereval-uationresultspleasereferto[26].5.LESSONSLEARNEDBasedontheobtainedresultswederivethefollowingde-signrules:IfaspecictaskshouldbedonebycooperationofCPUandaccelerator:ThespeedandenergyconsumptionofCPUACPandCPUOCMmethodsarealwaysbetterthan(orintheworstcaseequalto)CPUHP.ThemaindrawbackofusingCPUACPisthatpartsoftheavailablecachespacewillbeoccupiedbytheaccelerationtask.Thusifthereexistsanyothercrit-icalapplicationwhoseperformanceisheavilydepen-dentonCPUcaches,itmayfaceproblemshandlingitsdutyon-time.Inthiscase,usingCPUOCM(forsmallarraysizes)andCPUHP(forbigarrays)isrec-ommended.Ifthetaskshouldbedonebythehardwareacceleratoronly(andthentheCPUwilljustusethenalresult),thenACPOnlyorACPOCMmightbeusedonly 1700:0MB/sforHP0and707:3MB/sforACPwhenback-groundapplicationisrunningcomparedto608:5MB/sforHP0and631:8MB/sforACPwhiletheCPUsareidle.whentheprocessedarrayblocksaresmall(smallerthanthesizeofavailablecache,oron-chipmemory)andthereisnootherbackgroundapplicationconsum-ingtheseresources.Otherwise,HPOnlyalwayspro-videsbetterresults.TheaboverulescanalsobeexpandedtootherplatformswithsimilararchitecturesastheZYNQ.ThesizeofpacketstransferredbyeachAXIinterface,andmaximumburstlength,heavilyaecttheoveralldatatrans-ferbandwidth.Asaresult,basedonthetracpatternoftheprocessingtask(e.g.thesizeofdatachunksprocessedineachiteration),andprovidedimplementationforAXImas-ters,interconnectsandinterfaces,theseparametersshouldbesettothelargestpossiblevalues.6.CONCLUSIONInthispaper,wedemonstratedanassessmentontheper-formanceofaccelerationbyhardwareblockswhichcommu-nicatetoCPUsub-systemandDRAMoverAXIinterfaces.WeselectedXilinxZYNQAPSoCasthetargetanddevel-opedaninfrastructuretomeasuretheprocessingspeedandenergy.WecomparedtheresultswheneachoftheCPUandacceleratorperformthetaskaloneandwhentheCPUandacceleratorcooperatetoaccomplishthetask.Basedontheresults,wederivedasetofruleswhichcanbeusedforecientprocessor-accelerationmemorysharing.7.ACKNOWLEDGMENTSThisworkwassupported,inparts,byEUFP7ProjectVirtical(GAn.288574).8.REFERENCES[1]L.Benini,E.Flamand,D.Fuin,andD.Melpignano.P2012:Buildinganecosystemforascalable,modularandhigh-eciencyembeddedcomputingaccelerator.InDesign,AutomationTestinEuropeConferenceExhibition(DATE),2012,pages983{987,2012.[2]T.Berg.Maintainingi/odatacoherenceinembeddedmulticoresystems.Micro,IEEE,29(3):10{19,2009.[3]C.Cascaval,S.Chatterjee,H.Franke,K.Gildea,andP.Pattnaik.Ataxonomyofacceleratorarchitecturesandtheirprogrammingmodels.IBMJournalofResearchandDevelopment,54(5):5:1{5:10,2010.[4]J.Choi,K.Nam,A.Canis,J.Anderson,S.Brown,andT.Czajkowski.Impactofcachearchitectureandinterfaceonperformanceandareaoffpga-basedprocessor/parallel-acceleratorsystems.InField-ProgrammableCustomComputingMachines(FCCM),2012IEEE20thAnnualInternationalSymposiumon,pages17{24,2012.[5]F.Clermidy,C.Bernard,R.Lemaire,J.Martin,I.Miro-Panades,Y.Thonnart,P.Vivet,andN.Wehn.Magali:Anetwork-on-chipbasedmulti-coresystem-on-chipformimo4gsdr.InICDesignandTechnology(ICICDT),2010IEEEInternationalConferenceon,pages74{77,2010.[6]C.Fajardo,Z.Fang,R.Iyer,G.Garcia,S.E.Lee,andL.Zhao.Buer-integrated-cache:Acost-eectivesramarchitectureforhandheldandembeddedplatforms.InDesignAutomationConference(DAC),201148thACM/EDAC/IEEE,pages966{971,2011. [7]P.Greenhalgh.big.littleprocessingwitharmcortex-a15&cortex-a7.september2011.[8]Altera.Inc.Addinghardwareacceleratorstoreducepowerinembeddedsystems.september2009.[9]ARM.Inc.Introducingneondevelopment,2009.http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/BABCJFDG.html.[10]ARM.Inc.Cortex-A9MPCoreTechnicalReferenceManual,2012.http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0460c/CIAIIJCE.html.[11]ARM.Inc.AMBAAXIandACEProtocolSpecication,February2013.http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0022e/index.html.[12]Synopsys.Inc.DesignWareDDR3/2SDRAMMemoryController,2013.http://www.synopsys.com/dw/ipdir.php?ds=dwc_ddr3_mem.[13]Xilinx.Inc.LogiCOREIPAXIMasterBurst(DS844),June2011.http://www.xilinx.com/support/documentation/ip_documentation/axi_master_burst/v1_00_a/ds844_axi_master_burst.pdf.[14]Xilinx.Inc.LogiCOREIPChipScopeAXIMonitor(DS810),March2011.http://www.xilinx.com/support/documentation/ip_documentation/chipscope_axi_monitor/v2_00_a/ds810_chipscope_axi_monitor.pdf.[15]Xilinx.Inc.ZC-702EvaluationBoardfortheZynq-7000XC7Z020AllProgrammableSoC,April2013.http://www.xilinx.com/support/documentation/boards_and_kits/zc702_zvik/ug850-zc702-eval-bd.pdf.[16]Xilinx.Inc.Zynq-7000AllProgrammableSoCTechnicalReferenceManual(UG585),March2013.http://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf.[17]S.Ishikawa,A.Tanaka,andT.Miyazaki.Hardwareacceleratorforblast.InEmbeddedMulticoreSocs(MCSoC),2012IEEE6thInternationalSymposiumon,pages16{22,2012.[18]S.KaxirasandA.Ros.Ecient,snoopless,system-on-chipcoherence.InSOCConference(SOCC),2012IEEEInternational,pages230{235,2012.[19]A.Kennedy,X.Wang,andB.Liu.Energyecientpacketclassicationhardwareaccelerator.InParallelandDistributedProcessing,2008.IPDPS2008.IEEEInternationalSymposiumon,pages1{8,2008.[20]G.Kyriazis.Heterogeneoussystemarchitecture:Atechnicalreview.Technicalreport,AdvancedMicroDevices,August2012.[21]S.LafondandJ.Lilius.Interruptcostsinembeddedsystemwithshortlatencyhardwareaccelerators.InEngineeringofComputerBasedSystems,2008.ECBS2008.15thAnnualIEEEInternationalConferenceandWorkshoponthe,pages317{325,2008.[22]J.Levon,M.Johnson,etal.Oprole:Asystemprolerforlinux."http://oprofile.sourceforge.net/.[23]O.Mencer.Maximumperformancecomputingforexascaleapplications.InEmbeddedComputerSystems(SAMOS),2012InternationalConferenceon,pagesiii{iii,2012.[24]M.Nadeem,S.Wong,G.Kuzmanov,andA.Shabbir.Ahigh-throughput,area-ecienthardwareacceleratorforadaptivedeblockinglterinh.264/avc.InEmbeddedSystemsforReal-TimeMultimedia,2009.ESTIMedia2009.IEEE/ACM/IFIP7thWorkshopon,pages18{27,2009.[25]M.O'Connor.Acceleratedprocessingandthefusionsystemarchitecture.InDesignAutomationConference(ASP-DAC),201217thAsiaandSouthPacic,pages93{93,2012.[26]M.Sadri.Technicalreport:Energyandperformanceexplorationofacceleratorcoherencyportusingxilinxzynq.Technicalreport,DepartmentofElectrical,ElectronicandInformationEngineering,UniversityofBologna,May2013.[27]N.C.StephaneEricSebastienBrochier.Managingthestorageofdataincoherentdatastores,092009.[28]T.Suh,D.Blough,andH.-H.Lee.Supportingcachecoherenceinheterogeneousmultiprocessorsystems.InDesign,AutomationandTestinEuropeConferenceandExhibition,2004.Proceedings,volume2,pages1150{1155Vol.2,2004.