0 20 40 60 80 100 GPU DC AD SND MIC AE AD SND IMG DC CAM IMG DC CAM VE VD DC A D SND MIC AE CAM IMG DC VE MIC AE VD DC AD SND Angry BirdsAudioMP3PhotosCamPic S kype Video RecordYoutube Processin ID: 427965
Download Pdf The PPT/PDF document "% of time spent" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
% of time spent 0% 20% 40% 60% 80% 100% GPU DC AD SND MIC AE AD SND IMG DC CAM IMG DC CAM VE VD DC A D SND MIC AE CAM IMG DC VE MIC AE VD DC AD SND Angry BirdsAudioMP3PhotosCamPic S kype Video RecordYoutube Processing Cycles Data Stall Cycles Snd. Rec Figure4: TotaldatastallsandprocessingtimeinIPsduringexecution. Fraction of time an IP stalls in memory 0 20 40 60 80 100 GPU AD AE AD IMG IMG VE VD AD AE IMG VE AE VD AD Angry BirdsSnd Rec MP3PhotosCamPic Skype Video RecordYoutube 0.5x-LPDDR2-800 Base-LPDDR3-1333 2x-LPDDR3-1600 4x-LPDDR4-3200 Figure5: TrendsshowingincreaseofpercentageofdatastallswitheachnewergenerationofIPsandDRAMs. tomeanthenumberofcyclesanIPstallsfordatawithout doinganyusefulcomputation,afterissuingarequesttothe memory.WeobservefromFigure4thatthevideodecoder andvideoencoderIPsspendmostoftheirtimeprocessing thedata,anddonotstressthememorysubsystem.IPs thathaveverysmallcomputetime,liketheaudiodecoder andsoundengine,demandveryhighbandwidththanwhat memorycanprovide,andthustendtostallmorethan compute.CameraIPandgraphicsIP,ontheotherhand,send burstsofrequestsforlargeframesofdataatregularintervals. Hereaswell,ifmemoryisnotabletomeetthehigh bandwidthorhashighlatency,theIPremainsinthehigh- powermodestallingfortherequests.Thehighdatastalls seeninFigure4translatetoframedropswhichisshownin Figure6(for5.3GBPSmemorybandwidth).Weseethat onaverage24%oftheframesaredroppedwiththedefault baselinesystem,whichcanhurtuserexperiencewiththe device.Withhighermemorybandwidths(2xand4xofthe baselinebandwidth),thoughtheframedropsdecrease,they stilldonotimproveasmuchastheincreaseinbandwidth. Evenwith4xbaselinebandwidth,weobservemorethan 10%framedrops(becauseofhighermemorylatencies). Asuserdemandsincreaseandmoreuse-casesneedto besupported,thenumberofIPsintheSoCislikelyto 0 25 50 75 100 A6A7A8 AVG(all) % of Frames Shown 5.3 GBPS 10.6 GBPS 21.2 GBPS Figure6: Percentageofframescompletedinasubsetofapplica- tionswithvaryingmemorybandwidths. increase[44]alongwithdatasizes[12].EvenastheDRAM speedsincrease,theneedtogooff-chipfordataaccesses placesasignicantbottleneck.Thisaffectsperformance, powerandeventuallytheoveralluserexperience.Toun- derstandtheseverityofthisproblem,weconductasimple experimentshowninFigure5,demonstratinghowthecycles perframevaryacrossthebasesystem(giveninTableIII) andwhentheIPscomputeathalftheirbasespeed(last generationIPs),andtwicetheirspeed(nextgeneration)etc. ForDRAM,wevariedthememorythroughputbyvarying theLPDDRcongurations.Weobservefromtheresultsin Figure5thatthepercentageofdatastallsincreasesaswego fromonegenerationtothenext.IncreasingtheDRAMpeak bandwidthaloneis not sufcienttomatchtheIPscaling.We requiresolutionsthatcantacklethisproblemwithintheSoC. Further,toestablishthemaximumgainsthatcanbe obtainedifwehadanidealandperfectmemory,wedid ahypotheticalstudyof perfectmemorywith1cyclela- tency .Thecycles-per-frameresultswiththisperfectmemory systemareshowninFigure7.Asexpected,weobserved 0 20 40 60 80 100 GPU-DC AD-SND MIC-AE-MMC MMC-AD-SND MMC-IMG-DC CAM-IMG-DC CAM-IMG-MMC CAM-VE VD-DC AD-SND MIC-AE CAM-IMG-DC CAM-VE-MMC MIC-AE-MMC VD-DC AD-SND A1A2A3A4A5A6 A7A8 % Cycles Per Frame Figure7: PercentagereductioninCycles-Per-Frameindifferent owswithaperfectmemoryconguration. 170 176 renderingwhenoperatingonframesandforthedisplayren- dering.Thesetilesaresimilartothesub-framesproposedin thiswork.Atypicaltiledrenderingalgorithmrsttransforms thegeometriestobedrawn(multipleobjectsonthescreen) intoscreenspaceandassignsthemintotiles.Eachscreen- spacetileholdsalistofgeometriesthatneedstobedrawn inthattile.Thislistissortedfronttoback,andthegeometry behindanotherisclippedawayandonlytheonesinthefront aredrawntothescreen.GPUsrenderseachtileseparately toalocalon-chipbuffer,whichisnallywrittenbackto main-memoryinsideaframebufferregion.Fromthisregion, thedisplaycontrollerreadstheframetobedisplayedtobe shownonscreen.Alltilesareindependentofeachother, andthusformasub-frameinoursystem. B.OSandHardwareSupport Incurrentsystems,IPsareinvokedsequentiallyone-after- anotherperframe.Letusrevisittheexampleconsidered previously aowwith3IPs.TheOS,throughdevice drivers,callstherstIPintheow.Itwaitsfortheprocess- ingtocompleteandthedatatobewrittenbacktomemory andthencallsthesecondIP.AfterthesecondIPnishesits processing,thethirdIPiscalled.Withsub-frames,whenthe dataisreadyintherstIP,thesecondIPisnotiedofthe incomingrequestsothatitcanbeready(byenteringtothe activestatefromalowpowerstate)whenthedataarrives. WeenvisionthattheOScancapturethisinformationthrough alibrary(ormultiplelibraries)sincetheIP-owsforeach applicationareprettystandard.InAndroid[15]forinstance, thereisalayeroflibraries(HardwareAbstractionLayer HAL)thatinterfacewiththeunderlyingdevicedriversand theseHALsarespecictoIPs.Asdevicesevolve,HAL andthecorrespondingdriversareexpectedtoenableaccess todevicestorundifferentapplications.ByaddinganSA HALanditsdrivercounterparttocommunicatetheow information,wecanaccomplishourrequirements.Fromthe applicationsperspective,thisistransparentsincetheaccess totheSAHALhappensfromwithinotherHALsastheyare requestedbytheapplications.Figure16showsahighlevel viewofthesub-frameimplementationinSAalongwithour short-circuitingtechniques.Fromahardwareperspective,to enablesub-framingofdata,theSAneedstohaveasmall matrixofallIPs rowscorrespondingtoproducersand columnstoconsumers.Eachentryintherowis1bitperIP. Currently,wearelookingatabout8IPs,andthisisabout 8bytesintotal.Infuture,evenaswegrowto100IPs,the sizeofthematrixissmall.AseachIPcompletesitssub- frame,theSAlooksatitsmatrixandinformstheconsumer IP.Insituationswherewehavemultipleows(currently Androidallowstwoapplicationstorunsimultaneously[32]) withanIPincommon,theentriesintheSAforthecommon IPcanbeswappedinoroutalongwiththecontextofthe applicationrunning.Thiswillmaintainthecorrectconsumer IPstatusinthematrixforanyIP. VII.E VALUATION Inthissection,wepresenttheperformanceandpower benetsobtainedbyusingsub-framescomparedtothe conventionalbaselinesystemwhichusesfullframesin IPows.WeusedamodiedversionoftheGemDroid infrastructure[7]fortheevaluation.Foreachapplication evaluated,wecapturedandranthetraceseithertocomple- tionorforaxedtime.Thetracelengthsvariedfromabout 2secsto30secs.Usuallythislengthislimitedbyframe sizesweneededtocapture.Workloadsevaluatedarelistedin TableIIandthecongurationofthesystemusedisdescribed inSectionII-DandTableIII.Forthisevaluationweused asub-framesizeof32cachelines(2KB).Thetransaction queueandbankqueuesinSAcanhold64entries(totaling 8KB).Fortheowbuffersolution,weuseda32KBbuffer (basedonthehitratesobservedinFigure14). UserExperience: Asameasureofuserexperience,we trackthenumberofframesthatcouldcompleteineach ow.Themoreframesthatgetcompleted,lessertheframe dropsandbetteristheuserexperience.Figure17showsthe numberofframescompletedindifferentschemes.They- axisshowsthepercentageofframescompletedoutofthe totalframesinanapplicationtrace.Therstbarshowsthe framescompletedinthebaselinesystemwithfullframe ows.Thesecondandthirdbarsshowthepercentageof framescompletedwithourtwotechniques.Inbaselinesys- tem,onlyaround76%offramesweredisplayed.Byusing ourtwotechniques,thepercentageofframescompleted improvedto92%and88%,respectively.Improvementsin ourschemesaremainlyattributedtothereducedmemory bandwidthdemandandimprovedmemorylatencyasthe consumerIPsrequestsareservedthroughtheow-buffers orbyshort-circuitingthememoryrequests.Thehit-ratesof consumersrequestswerepreviouslyshowninFigure14.In somecases,ow-buffersperformbetterthanshort-circuiting duetothespaceadvantageintheowbufferingtechnique. PerformanceGains: Tounderstandtheeffectivenessof ourschemes,weplottheaveragenumberofcyclestakento processaframeineachowinFigure18.Thisisthetime betweentheinvocationofrstIPandcompletionoflastIP ineachow.Notethat,reducingthecyclesperframecan leadtofewerframedrops.Whenweuseourtechniques withsub-framing,duetopipeliningofintra-framedata acrossmultipleIPsinsteadofsequentiallyprocessingone frameafteranother,weareabletosubstantiallyreducethe cyclesperframeby45%onaverage.Wealsoobserved thatinA6-Skypeapplication(whichhasmultipleows), throughtheuseofsub-framing,thememorysubsystemgets overwhelmedbecause,weallowmoreIPstoworkatthe sametime.Thisisnotthecaseinthebasesystem.IfIPsdo notbenetfromow-buffersorIPrequestshort-circuiting, thememorypressureismorethanthebaselineleadingto someperformanceloss(17%). EnergyGains: Energyefciencyisaveryimportant Short-CircuitingMemoryTrafcinHandheldPlatforms PraveenYedlapalli NachiappanChidambaramNachiappan NiranjanSoundararajan AnandSivasubramaniam MahmutT.Kandemir ChitaR.Das ThePennsylvaniaStateUniversity IntelCorp. Email: {praveen,nachi,anand,kandemir,das}@cse.psu.edu, {niranjan.k.soundararajan}@intel.com, Abstract 172 throughthesystemandshowthat,bycommunicatingatframe granularity,wemisssignicantperformanceoptimizationop- portunities,causedbylargeIP-to-IPdatareusedistances. Bycarefullybreakingtheseframesintosub-frames,while maintainingcorrectness,wedemonstratesubstantialgainswith limitedhardwarerequirements.Specically,weevaluatetwo techniques,ow-bufferingandIP-IPshort-circuiting,andshow thatthesetechniquesbringbothpower-performancebenets andenhanceduserexperience. I.I NTRODUCTION Thepropensityoftabletsandmobilephonesinthis handhelderaraisesseveralinterestingchallenges.Atthe applicationend,thesedevicesarebeingusedforanumber 174 whichmaynotbepossiblejustbyoptimizingthesystem inparts.Withthisphilosophy,thispaperfocussesonan importantclassofapplicationsrunonthesehandhelddevices Theauthorswouldliketoconrmthatthisworkisanacademic explorationanddoesnotreectanyeffortwithinIntel. (real-timeframe-orientedvideo/graphics/audio),examines thedataowintheseapplicationsthroughthedifferent computationalkernels,identiestheinefciencieswhen sustainingtheseowsintodayshardwaresolutions,which simplyrelyonmainmemorytoexchangesuchdata,and proposesalternatehardwareenhancementstooptimizesuch ows. Real-timeinteractiveapplications,includinginteractive 175 0% 20% 40% 60% 80% 100% Audio Record AR GameVideo Record Video Play Average (10 apps) Delay Breakdown MC Trans Q MC Bank Q DRAM 150124137238166 Snd. Rec Figure13: DelaybreakdownofamemoryrequestissuedbyIPs orcores.Thenumbersabovethebargivetheabsolutecycles. smalllow-latencycostandareaefcientbuffers. Notethat,intheseuse-cases,corestypicallyrundevice drivercodeandhandleinterrupts.Theyhaveminimaldata framesprocessing.Consequently,wedonotincorporate ow-buffersbetweencoreandanyotheraccelerator.Also, whenause-caseisinitssteady-state(forexample,aminute intorunningavideo),theIPsareintheactivestateand quicklyconsumedata.However,ifanIPisnishingup onanactivityorbusywithanotheractivityorwakingup fromsleepstate,thesub-framescanbeoverwritteninthe ow-buffer.Inthatcase,basedonsub-frameaddresses,the consumerIPcannditsdatainthemainmemorysincethe ow-bufferisawrite-throughbuffer.Inourexperiments, discussedlaterinSectionVII,wefoundthataow-buffer sizeof32KBprovidesagoodtrade-offbetweenavoiding alargeow-bufferandsub-framesgettingoverwritten. B.IP-IPShort-circuiting Theow-buffersolutionrequiresanextrapieceof hardwaretowork.Toavoidthecostofaddingthe ow-buffers,analternatetechniquewouldbetoenable consumersdirectlyusethedatathattheirproducersprovide. Towardsthat,weanalyzedtheaverageround-tripdelaysof allaccessesissuedbythecoresorIPs(showninFigure13) andfoundrequestsspendmaximumtimequeuinginthe memorysubsystem. MCTransQueue showsthetime takenfromtherequestleavingtheIPtillitgetstothe headofthetransactionqueue.Thenextpart MCBank Queue ,isthetimespentinbankqueues.Thisisprimarily determinedbywhetherthedataaccesswasarowbuffer hit,ormiss.And,nally DRAM showsthetimeforDRAM accessingalongwiththeresponsebacktotheIPs.As canbeseen,mostofthetimeisspentinthememory transactionqueues(~100cycles).Thismeansthatdata thatcouldotherwisebereusedliesidleinthememory queuesandweusethisobservationtowardsbuildingan opportunisticIP-to-IPshort-circuitingtechnique,similarin concepttostore-loadforwardinginCPUcores[29],[38] 2 thoughourtechniqueisinbetweendifferentIPs.There arecorrectnessandimplementationdifferences,whichwe highlightedinthefollowingparagraphs/sections. 2 Corerequestsspendrelativelyinsignicantamountoftimeintransactionqueues astheyarenotburstyinnature.DuetotheirstrictQoSdeadlines,theyareprioritized overotherIPrequests.TheyspendmoretimeinbankqueuesandinDRAM. 0 20 40 60 80 100 Angry Birds Snd.RecMP3PhotoCamPicSkypeVideo Record Youtube Hit Rate (%) 32 KB Flow-Buffer 16 KB Flow-Buffer 8 KB Flow-Buffer 4 KB Flow-Buffer IP-IP Short-circuiting Figure14: Hitrateswithow-bufferingandIP-IPshort-circuiting. IPsusuallyloadthedataframesproducedbyotherIPs. Similartostore-loadforwarding,iftheconsumerIPsload requestscanbesatisedfromthememorytransactionqueue orbankqueues,thememorystalltimecanbeconsiderably reduced.Asthesub-framesizegetssmaller,theprobability ofaloadhittingastoregetshigher.Unliketheow-buffers discussedinSectionV-A,storedatadoesnotremainin thequeuestilltheyareoverwritten.Thistechniqueis opportunisticandasthememorybankclearsupitsentries, therequestmovesfromthetransactionqueueintothe bankqueuesandeventuallyintomainmemory.Thus,the loadsneedtofollowthestoresquickly,elseithastogo tomemory.ThisdistancebetweentheconsumerIPload requestandproducerIPstorerequestdependsonhowfull thetransactionandbankqueuesare.Intheextremecase, ifboththequeues(transaction-queueandbank-queue)are full,thenumberofrequeststhataloadcancomeaftera storewillbethesumofthenumberofentriesinthequeues. TheoverheadofimplementingtheIP-IPshort-circuitingis notsignicantsinceweareusingpre-existingqueuespresent inthesystemagent.Thetransactionandbankqueuesalready implementanassociativesearchtore-orderrequestsbased ontheirQoSrequirementsandrow-bufferhits,respectively [33].Address-searchesforsatisfyingcoreloadsalreadyexist andthesecanbereusedforotherIPs.Aswewillshowlater, thistechniqueworksonlywhenthesub-framereusedistance issmall. C.EffectsofSub-framingDatawithFlow-BufferingandIP- IPShort-circuiting Thebenetsofsub-framingarequantiedinFigure14 intermsofhitrateswhenusingow-bufferingandIP-IP short-circuiting.Wecanseethatthebufferhitratesincrease asweincreasethesizeofow-buffers,andsaturatewhen thesizeofbuffersareintherangesof16KBto32KB. Theotheradvantageofhavingsub-framesisthereduced bandwidthconsumptionduetothereducednumberofmem- oryaccesses.Asdiscussedbefore,acceleratorsprimarily facebandwidthissueswiththecurrentmemorysubsystem. Sub-framingalleviatessuchbottleneckbyavoidingfetching everypieceofdatafrommemory.Redundantwritesand readstosameaddressesareavoided.Latencybenetsof ourtechniques,aswellastheirimpactonuserexperience willbegivenlaterinSectionVII. 1 oftendeployedtoleveragethespecicityincomputation foreachframeanddeliveringhighenergyefciencyforthe requiredcomputation.Theframesarethenpipelinedthrough theseacceleratorsoneafteranothersequentially.Fourth,in manyoftheseapplications,theframeshaveto ow not justthroughonesuchcomputationalstage(accelerator)but possiblythoughseveralsuchstages.Forinstance,considera videocaptureapplication,wherethecameraIPmaycapture rawdata,whichisthenencodedintoanappropriateform byanotherIP,beforebeingsenteithertoaashstorageor adisplay.Consequently,theframeshavetoowthrough 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture1072-4451/14 $31.00 © 2014 IEEEDOI 10.1109/MICRO.2014.60166 173 availablelocalityacrossIPsgoesunexploited.Bybreaking thedataframesintosub-frames,thereusedistancedecreases substantially,andwecanuseowbuffersortheexisting memorycontrollerqueuestoforwarddatafromtheproducer toconsumer.Ourresultsshowthatsuchtechniqueshelpto reduceframelatencies,whichinturnenhancetheend-user experiencewhileminimizingenergyconsumption. A CKNOWLEDGMENTS Thisresearchissupportedinpartbythefollowing NSFgrants #1205618,#1213052,#1302225,#1302557, #1317560,#1320478,#1409095,#1439021,#1439057,and grantsfromIntel. R EFERENCES [1]VisionStatement:HowPeopleReallyUseMobile, January-February2013.[Online].Available:http://hbr.org/ 2013/01/how-people-really-use-mobile/ar/1 [2]B.Akesson etal. ,Predator:Apredictablesdrammemory controller,in CODES+ISSS ,2007. [3]Apple,Appleiphone5s.Available:https://www.apple.com/ iphone/ [4]N.Balasubramanian etal. ,Energyconsumptioninmobile phones:Ameasurementstudyandimplicationsfornetwork applications,in IMC ,2009. [5]M.N.BojnordiandE.Ipek,Pardis:Aprogrammablemem- orycontrollerfortheddrxinterfacingstandards,in ISCA , 2012. [6]G.Chen etal. ,Exploitinginter-processordatasharingfor improvingbehaviorofmulti-processorsocs,in ISVLSI ,2005. [7]N.ChidambaramNachiappan etal. ,GemDroid:AFrame- worktoEvaluateMobilePlatforms,in SIGMETRICS ,2014. [8]P.Conway etal. ,Cachehierarchyandmemorysubsystem oftheamdopteronprocessor, IEEEmicro ,2010. [9]A.Coskun etal. ,Temperatureawaretaskschedulingin mpsocs,in DATE ,2007. [10]R.Das etal. ,Aergia:Exploitingpacketlatencyslackinon- chipnetworks,in ISCA ,2010. [11]B.Diniz etal. ,Limitingthepowerconsumptionofmain memory.in ISCA ,2007. [12]J.Engwell,Thehighresolutionfutureâ A¸ Sretinadisplays anddesign,Blurgroup,Tech.Rep.,2013. [13]J.Y.C.Engwell,Gputechnologytrendsandfuturerequire- ments,NvidiaCorp.,Tech.Rep. [14]S.Fenney,Texturecompressionusinglow-frequencysignal modulation,in HWWS ,2003. [15]Google,AndroidHAL.Available:https://source.android. com/devices/index.html [16]Google.(2013)Androidsdk-emulator.Available:http: //developer.android.com/ [17]M.I.Gordon etal. ,Astreamcompilerforcommunication- exposedarchitectures,in ASPLOS ,2002. [18]A.Gutierrez etal. ,Full-systemanalysisandcharacterization ofinteractivesmartphoneapplications,in IISWC ,2011. [19]K.Han etal. ,Ahybriddisplayframebufferarchitecturefor energyefcientdisplaysubsystems,in ISLPED ,2013. [20]M.K.Jeong etal. ,Aqos-awarememorycontrollerfor dynamicallybalancinggpuandcpubandwidthuseinan mpsoc,in DAC ,2012. [21]A.Jog etal. ,OWL:CooperativeThreadArrayAware SchedulingTechniquesforImprovingGPGPUPerformance, in ASPLOS , 2013. [22]M.Kandemir etal. ,Exploitingsharedscratchpadmemory spaceinembeddedmultiprocessorsystems,in DAC ,2002. [23]C.N.Keltcher etal. ,Theamdopteronprocessorformulti- processorservers, IEEEMicro ,2003. [24]H.b.T.KhanandM.K.Anwar,Quality-awareFrameSkip- pingforMPEG-2VideoBasedonInter-frameSimilarity, MalardalenUniversity,Tech.Rep. [25]Y.Kim etal. ,ATLAS:AScalableandHigh-performance SchedulingAlgorithmforMultipleMemoryControllers,in HPCA ,2010. [26]Y.Kim etal. ,ThreadClusterMemoryScheduling:Exploit- ingDifferencesinMemoryAccessBehavior,in MICRO , 2010. [27]K.-B.LeeandT.-S.Chang, EssentialIssuesinSoCDesign Designing-ComplexSystems-on-Chip .Springer,2006,ch. SoCMemorySystemDesign. [28]A.W.LimandM.S.Lam,Maximizingparallelismand minimizingsynchronizationwithafnetransforms,in POPL , 1997. [29]G.H.Loh etal. ,Memorybypassing:Notworththeeffort, in WDDD ,2002. [30]P.Marwedel etal. ,Mappingofapplicationstompsocs,in CODES+ISSS ,2011. [31]A.Naveh etal. ,Powermanagementarchitectureofthe 2ndgenerationintel R coreâ D ´ cmicroarchitecture,formerly codenamedsandybridge,2011. [32]S.T.Report,Wqxgasolutionwithexynosdual,2012. [33]S.Rixner etal. ,Memoryaccessscheduling,in ISCA ,2000. [34]P.Rosenfeld etal. ,Dramsim2:Acycleaccuratememory systemsimulator, CAL ,2011. [35]R.Saleh etal. ,System-on-chip:Reuseandintegration, ProceedingsoftheIEEE ,2006. [36]Samsung,Samsunggalaxys5,2014.Available:http: //www.samsung.com/global/microsite/galaxys5/ [37]H.Schwarz etal. ,Overviewofthescalablevideocoding extensionoftheh.264/avcstandard, IEEETransactionson CircuitsandSystemsforVideoTechnology ,2007. [38]T.Sha etal. ,Scalablestore-loadforwardingviastorequeue indexprediction,in MICRO ,2005. [39]H.Shim etal. ,Acompressedframebuffertoreducedisplay powerconsumptioninmobilesystems,in ASP-DAC ,2004. [40]P.ShivakumarandN.P.Jouppi,Cacti3.0:Anintegrated cachetiming,power,andareamodel,TechnicalReport 2001/2,CompaqComputerCorporation,Tech.Rep.,2001. [41]A.K.Singh etal. ,Communication-awareheuristicsforrun- timetaskmappingonnoc-basedmpsocplatforms, J.Syst. Archit. ,2010. [42]R.SoC,Rk3188multimediacodecbenchmark,2011. [43]Y.SongandZ.Li,Newtilingtechniquestoimprovecache temporallocality,in ACMSIGPLANNotices ,1999. [44]D.S.SteveScheirey,Sensorfusion,sensorhubsandthe futureofsmartphoneintelligence,ARM,Tech.Rep.,2013. [45]V.Suhendra etal. ,Integratedscratchpadmemoryoptimiza- tionandtaskschedulingformpsocarchitectures,in CASES , 2006. [46]Y.Wang etal. ,Markov-optimalsensingpolicyforuserstate estimationinmobiledevices,in IPSN ,2010. [47]L.Xue etal. ,Spmconsciousloopschedulingforembedded chipmultiprocessors,in ICPADS ,2006. [48]M.Yuffe etal. ,Afullyintegratedmulti-cpu,gpuand memorycontroller32nmprocessor,in ISSCC ,2011. [49]Y.Zhang etal. ,Adatalayoutoptimizationframeworkfor nuca-basedmulticores,in MICRO ,2011. [50]Y.ZhuandV.J.Reddi,High-performanceandenergy- efcientmobilewebbrowsingonbig/littlesystems,in HPCA ,2013. 0 2 4 6 8 4 MB8 MB16 MB32 MB Power (W) Dynamic Leakage 0 50 100 150 200 250 300 4 MB8 MB16 MB32 MB Area (mm 2 ) Figure11: Areaandpower-overheadwithlargesharedcaches. fromFigure10,increasingthecachesizesdoesnotalways helpandthereisnooptimalsize.ForIPslikeSNDand AD,theframesizesaresmallandhenceasmallercache sufces.Fromthereon,increasingcachesizeincreases lookuplatencies,andaffectstheaccesstimes.Inothercases, likeDC,astheframesizesarelarge,weobservefewer cyclesperframeasweincreasethecachesize.Forother acceleratorswithlatencytolerance,oncetheirdatatsin thecache,theyencounternoperformanceimpact. Further,scalingcachesizesabove4MBisnotreasonable duetotheirareaandpoweroverheads.Figure11plots theoverheadsfordifferentcachesizes.Typically,handhelds operateintherangeof2W 10W,whichincludeseverything onthedevice(SoC+display+network).Eventhe2Wcon- sumedbythe4MBcachewillimpactbatterylifeseverely. Summary: Tosummarize,thehighnumberofmemorystalls isthemainreasonforframedrops,andlargeIP-to-IPreuse distancesisthemaincauseforlargememorystalls.Even largecachesarenotsufcienttocapturethedatareuse andhence,acceleratorsanddevicesstillhaveconsiderable memorystalls.Alloftheseobservationsledustore-architect howdatagetsexchangedbetweendifferentIPs,pavingway forbetterperformance. V.S UB -F RAMING OurprimarygoalistoreducetheIP-to-IPdatareuse distances,andtherebyreducedatastalls,whichwesawwere majorimpedimenttoperformanceinSectionIII. Toachievethis,weproposeanovelapproachof sub- framing thedata.Oneofthecommonlyusedcompiler techniquestoreducethedatareusedistanceinloopneststhat manipulatearraydataistoemploy looptiling [28],[43]. Itistheprocessofpartitioningaloopsiterationspaceinto smallerblocks(tiles)inamannerthatthedatausedbythe loopremainsinthecacheenablingquickerreuse.Inspired bytiling,weproposetobreakthedataframesintosmaller sub-frames ,thatreducesIP-to-IPdatareusedistances. Incurrentsystems,IPsreceivearequesttoprocessadata frame(itcouldbeavideoframe,audioframe,displayframe orimageframe).Onceitcompletesitsprocessing,thenext IPinthepipelineistriggered,whichin-turntriggersthe followingIPonceitcompletesitsprocessingandsoon.In oursolution,weproposetosub-dividethesedataframesinto smallersub-frames,sothatonceIP1nishesitsrstsub- frame,IP2isinvokedtoprocessit.Inthefollowingsections, weshowthatthisdesignreducesthehardwarerequirements tostoreandmovethedataconsiderablytherebybringing 1 10 100 1000 10000 100000 1000000 10000000 100000000 Frame 1024 512 128 64 32 8 1 Access Interval Sub-Frame Size CAM-IMG IMG-DC 1 10 100 1000 10000 100000 1000000 10000000 100000000 Frame 1024 512 128 64 32 8 1 Access Interval Sub-Frame Size MMC-VD VD-DC 6 Figure12: IP-to-IPreusedistancevariationwithdifferentsub- framesizes.Notethatthey-axisisinthelogscale. bothperformanceandpowergains.Thegranularityofthe subframecanhaveaprofoundimpactonvariousmetrics.To quantifytheeffectsofsubdividingaframe,wevariedthe sub-framesizesfrom1cachelinetothecurrentdataframe size,andanalyzedthereusedistances.Figure12plotsthe reductionintheIP-to-IPreusedistances(ony-axis,plotted on log-scale ),aswereducedthesizeofasub-frame.We canseefromthisplotaninverseexponentialdecreasein reusedistances.Infact,forverysmallsub-framesizes,we seereusedistancesinlessthan100cycles.Tocapitalize onsuchsmallreusedistances,weexploretwotechniques ow-buffering andopportunisticIP-to-IP requestshort- circuiting . A.Flow-Buffering InSectionIV-B,weshowedthatevenlargecacheswere notveryeffectiveinavoidingmisses.Thisisprimarily duetoverylargereusedistancesthatarepresentbetween thedata-framewritebyaproducerandthedata-frame readbyaconsumer.Withsub-frames,thereusedistances reducedramatically.Motivatedbythis,wenowre-explore theoptionofcachingdata.Interestingly,inthisscenario, cachesofmuchsmallersizecanbefarmoreeffective(low misses).Thereusedistancesresultingfromsub-framingare sosmallthatevenhavingastructurewithfewcache-lines issufcienttocapturethetemporallocalityofferedbyIP pipelininginSoCs.Wecallthesestructuresas ow-buffers . Unlikeasharedcache,theow-buffersareprivatebetween anytwoIPs.Thisdesignavoidstheconictmissesseenina sharedcache(fullyassociativehashighpowerimplications). Theseow-buffersarewrite-through.Asthesub-framegets written,thesub-frameiswrittentomemory.Thereasonfor thisdesignchoiceisdiscussednext. Inatypicaluse-caseinvolvingdataowfromIP-A IP- B IP-C,IP-Agetsitsdatafromthemain-memoryand startscomputingit.Duringthisprocess,asitcompletes asub-frame,itwritesbackthischunkofdataintothe ow-bufferbetweenIP-AandIP-B.IP-Bstartsprocessing thissub-framefromtheow-buffer(inparallelwithIP-A workingonanothersub-frame)andwritesitbacktothe ow-bufferbetweenitselfandIP-C.OnceIP-Cisdone,the dataiswrittenintothememoryorthedisplay.Originally, everyreadandwriteintheabovescenariowouldhavebeen scheduledtoreachthemainmemory.Now,withtheow- buffersinplace,alltherequestscanbeservicedfromthese 167 177 Withtheiradvent,wealsoseeatremendousincreaseindevice- userinteractivityandreal-timedataprocessingneeds.Media (audio/video/camera)andgaminguse-casesaregainingsub- stantialuserattentionandaredeningproductsuccesses.The combinationofincreasingdemandfromtheseuse-casesand havingtorunthematlowpower(fromabattery)meansthat architectshavetocarefullystudytheapplicationsandoptimize thehardwareandsoftwarestacktogethertogainsignicant optimizations.Inthiswork,westudyworkloadsfromthese domainsandidentifythememorysubsystem(systemagent)to beacriticalbottlenecktoperformancescaling.Wecharacterize !"# $ $ % &' % % ()*+,#"&,-,--&& System Agent + MC . #) Figure1: TargetSoCplatformwithahigh-levelviewofdifferent functionalblocksinthesystem. theworkhorsesoftheSoCastheyprovidemaximumperfor- manceandpowerefciency,e.g.videoencoders/decoders, graphics,imagingandaudioengines. InteractionsbetweenCore,IPsandOperatingSystem: SoCapplicationsarehighlyinteractiveandinvolvemultiple acceleratorsanddevicestoenhanceuserexperience.Us- ingAPIcalls,applicationrequirementsgettransformedto acceleratorrequirementsthroughdifferentlayersoftheOS. Typically,thecallshappenthroughsoftwaredevicedriversin thekernelportionoftheOS.Thesecallsdecideif,whenand forhowlongthedifferentacceleratorsgetused.Thedevice drivers,whichareoptimizedbytheIPvendors,control thefunctionalityandthepowerstatesoftheaccelerators. Onceanacceleratorneedstobeinvoked,itsdevicedriveris notiedwithrequestandassociatedphysicaladdressofinput data.Thedevicedriversetsupthedifferentactivitiesthat theacceleratorneedstodo,includingwritingappropriate registerswithpointerstothememoryregionwherethe datashouldbefetchedandwrittenback.Theaccelerator readsthedatafrommainmemorythroughDMA.Input datafetchingandprocessingarepipelinedandthefetching granularitydependsonhowthelocalbufferisdesigned. Oncedataisprocessed,itiswrittenbacktothelocalbuffers andeventuallytothemainmemoryattheaddressregion speciedbythedriver.Asmostacceleratorsworkfasterthan mainmemory,thereisaneedforinputandoutputbuffers. TheSystemAgent(SA): AlsoknownastheNorth- bridge,isacontrollerthatreceivescommandsfromthecore andpassesthemontotheIPs.Somedesignsaddmore intelligencetotheSAtoprioritizeandreorderrequests tomeetQoSdeadlinesandtoimproveDRAMhits.SA usuallyincorporatesthememorycontroller(MC)aswell. Apartfromre-orderingrequestsacrosscomponentstomeet QoSguarantees,evenne-grainedre-orderingamongIPs requestscanbedonetomaximizeDRAMbandwidthand bus-utilization.Withincreasinguserdemandsfromhand- heldsthenumberofacceleratorsandtheirspeedskeepin- creasing[12],[13],[44].Thesetrendswillplaceaveryhigh demandonDRAMtrafc.Consequently,unlesswedesign asophisticatedSAthatcanhandletheincreasedamountof trafc,theimprovementinacceleratorsperformancewill notendinimproveduserexperience. / ) 0 1 &1 &2 2 3 )') 1 2 3 4 5 + 0 Figure2: OverviewofdataowinSoCarchitectures. B.DatamovementinSoCs Figure2depictsthehigh-levelviewofthedataow inSoCarchitectures.Onceacoreissuesarequesttoan IPthroughheSA(shownas(1),theIPstartsitswork byinjectingamemoryrequestintoSA.First,therequest traversesthroughaninterconnectwhichistypicallyabusor cross-bar,andisenqueuedinamemorytransactionqueue. Here,requestscanbereorderedbytheSAaccordingto individualIPprioritiestohelprequestsmeettheirdeadlines. Subsequently,requestsareplacedinthebank-queuesof thememorycontroller,whererequestsfromIPsarere- arrangedtomaximizethebusutilization(andinturn,the DRAMbandwidth).Followingthat,anoff-chipDRAM accessismade.TheresponseisreturnedtotheIPthrough theresponsenetworkintheSA(shownas(2)).IP-1writes itsoutputdatatomemory(shownin(3))tillitcompletes processingthewholeframe.AfterIP-1completesprocess- ing,IP-2isinvokedbythecore(shownas4),anddataow similartowhatIP-1hadisfollowed,ascapturedby(5) and(6)inFigure2.Theunitofdataprocessinginmedia andgamingIPs(includingaudio,videoandgraphics)isa frame ,whichcarriesinformationabouttheimageorpixels oraudiodeliveredtotheuser. Typicallyahighframedrop ratecorrespondstoadeteriorationinuser-experience. Thetotalrequestaccesslatencyincludesthenetwork latency,thequeuinglatenciesatthetransactionqueueand bankqueue,DRAMservicetime,andtheresponsenetwork latency.Thislatency,asexpected,isnotconstantandvaries basedonthesystemdynamics(includingDRAMrow-buffer hits/misses).Whenrunningaparticularapplication,theOS mapsthedataframesofdifferentIPstodifferentphysical memoryregions.Notethattheseregionsgetreusedduring theapplicationrunforwritingdifferentframes(overtime). InadataowinvolvingmultipleIPsthatprocesstheframes, oneafteranother,theOS(throughdevicedrivers)synchro- nizestheIPssuchthattheproducerIPwritesanotherframe ofdataontothesamememoryregionaftertheconsumerIP hadconsumedtheearlierframe. C.DecomposinganApplicationExecutionintoFlows ApplicationscannotdirectlyaccessthehardwareforI/O oraccelerationpurposes.InAndroid,forexample,appli- cationrequestsgettranslatedbyintermediatelibrariesand 171 0 20 40 60 80 100 120 GPU-DC AD_SND MIC-AE-MMC MMC-AD-SND MMC-IMG-DC CAM-IMG-DC CAM-IMG-MMC CAM-VE VD-DC AD-SND MIC-AE CAM-VE-MMC MIC-AE-MMC VD-DC AD-SND Angry BirdsSnd.RecMP3PhotoCamPic Skype Video RecordYoutubeAVG Fraction of frames completed Base SubFrame-Cache SubFrame-Forward Figure17: PercentageofFramesCompleted(Higherthebetter). 0 25 50 75 100 125 GPU-DC AD-SND MIC-AE-MMC MMC-AD-SND MMC-IMG-DC CAM-IMG-DC CAM-IMG-MMC CAM-VE VD-DC A D-SND MIC-AE CAM-VE-MMC MIC-AE-MMC VD-DC AD-SND A1A2A3A4A5 A6 A7A8AVG % of Cycles Per Frame SubFrame-cache SubFrame-forward Figure18: ReductioninCyclesPerFrameinaownormalizedto Baseline(Lowerthebetter). metricinhandheldssincetheyoperateoutofabattery(most ofthetime).ExactIPdesignandpowerstatesincorporated arenotavailablepublicly.Asaproxy,weusethenumber ofcyclesanIPwasactivetocorrespondtotheenergy consumedwhenrunningthespecicapplications.InFig- ure19,weplotthetotalnumberofactivecyclesconsumed byanacceleratorcomparedtothebasecase.Weplotthis graphonlyforacceleratorsastheyarecompute-intensiveand hence,consumemostofthepowerinaow.Onaverage, weobserve46%and35%reductioninactivecycles(going upto80%inGPU)withourtechniques,whichtranslatesto substantialsystem-levelenergygains.Withsub-framing,we alsoreducethememoryenergyconsumptionby33%due to(1)reducedDRAMaccesses,and(2)memoryspending moretimeinlow-powermode.Fromtheaboveresultsit canbeconcludedthatsub-framingyieldssignicantperfor- manceandpowergains. VIII.R ELATED W ORK DataReuse: Datareusewithinandacrosscoreshasbeen studiedbymanyworks.Chenetal.[6],[47],Gordonet al.[17]andKandemiretal.[22]proposecompileroptimiza- tionsthatperformcoderestructuringandenabledatasharing 0 20 40 60 80 100 120 GPU AD AE AD IMG IMG VE VD AD AE VE AE VD AD A1A2A3A4A5A6 A7A8 % of Active Cycles SubFrame-Cache SubFrame-Forward Figure19: ReductioninNumberofActiveCyclesofAccelerators (Lowerthebetter). acrossprocessors.Suhendraetal.,[45]proposedwaysto optimallyusescratchpadmemoryinMPSoCsalongwith methodstoscheduleprocessestocores.Therehavebeen multipleworksthatdiscussapplicationandtaskmapping toMPSoCs[9],[30],[41]withthegoalofminimizing datamovementacrosscores.Ourworklooksataccelerator trafc,whichisdominantinSoCs,andidentiesthatframe dataisreusedacrossIPs.Unlikecoretrafc,thereusecan beexploitedonlyifthedataframesarebrokeninsub- frames.Wecapturethisfordataframesofdifferentclassesof applications(audio/video/graphics)andproposetechniques toreducethedatamovementbyshortcircuitingtheproducer writestotheconsumerreads. MemoryControllerDesign: Alargebodyofworksexist intheareaofmemoryschedulingtechniquesandmemory controllerdesignsinthecontextofMPSoCs.Leeand Chang[27]describetheessentialissuesinmemorysystem designforSoCs.Akessonetal.[2]proposeamemory schedulingtechniquethatprovidesaguaranteedminimum bandwidthandmaximumlatencyboundtoIPs.Jeonget al.[20]provideQoSguaranteestoframesbybalancing memoryrequestsatthememorycontroller.Ourworkiden- tiesaspeciccharacteristic(reuseatsub-framelevel)that existswhendataowsthroughacceleratorsandoptimizes systemagentdesign.Oursolutioniscomplimentarytoprior techniquesandcanworkintandemwiththem. AlongwithIPdesignandanalysis,severalworkshave proposedIP-specicoptimizations[14],[19],[24],[39]and lowpoweraspectsofsystem-on-chiparchitectures[11],[18], [46].OursolutionisnotspecictoanyIPrather,itisatthe system-level.ByreducingtheIPstalltimesandmemory trafc,wemaketheSoCperformanceandpower-efcient. IX.C ONCLUSIONS Memorytrafcisaperformancebottleneckinmobile systems,anditiscriticalthatweoptimizethesystemas awholetoavoidsuchbottlenecks.Mediaandgraphics applicationsareverypopularonmobiledevices,andoperate ondataframes.Theseapplicationshavealotoftemporal localityofdatabetweenproducerandconsumerIPs.We showthatbyoperatingaframeasanatomicblockbetween differentIPs,thereusedistancesareveryhigh,andthus,the ago),requiringsubstantialcomputationalresourcesforreal- timeinteractivity,onbothinputandoutputsides,withthe externalworld.Onthehardwareend,powerandlimited batterycapacitiesmandatehighdegreesofenergyefcien- ciestoperformthesecomputationaltasks.Meetingthese computationalneedswiththecontinuingimprovementsin hardwarecapabilitiesisnolongerjustamatterofthrowing highperformanceandplentifulcoresorevenacceleratorsat theproblem.Instead,acarefulexaminationandmarriageof thehardwarewiththeapplicationandexecutioncharacteris- ticsiswarrantedforextractingthemaximumefciencies.In otherwords,aco-designofsoftwareandhardwareisnec- VI.I MPLEMENTATION D ETAILS Inimplementingoursub-frameidea,weaccountforthe probablepresenceofdependenciesandcorrectnessissues resultingfromsplittingframes.Below,wediscussthecor- rectnessissueandtheassociatedintricaciesthatneedtobe addressedtoimplementsub-frames.Wethendiscussthe software,hardwareandsystem-levelsupportneededforsuch implementations. A.Correctness Webroadlycategorizedataframesintothefollowing types (i)video,(ii)audio,(iii)graphicsdisplay,and(iv)the networkpackets.Ofthese,therstthreetypesofframesare theonesthatusuallydemandsustainedhighbandwidthwith theframesizesvaryingfromamegabytetotensofMBs.In thiswork,weaddressonlytherstthreetypesofframes, andleaveoutnetworkpacketsasthelatencyofnetwork packettransmissionisconsiderablyhighercomparedtothe timespentintheSoC. VideoandAudioFrames: Encodinganddecoding, abbreviatedas codec iscompressionanddecompressionof datathatcanbeperformedateitherhardwareorsoftware layer.CurrentgenerationofsmartphonessuchasSamsung S5[36]andAppleiPhone[3]havemultipletypesofcodes embeddedintheirphone. VideoCodecs: First,letusconsidertheowscontaining videoframes,andanalyzethecorrectnessofsub-dividing suchlargeframesintosmallerones.Amongthevideo codecs,themostcommonlyusedareH.264(MPEG-4)or H.265(MPEG-H,HEVC)codecs.Letustakeasmallset ofvideoframesandanalyzethedecodingprocess.The encodingprocessisalmostequivalenttotheinversionof eachstageofdecoding.Asaresult,similarprinciplesapply thereaswell.Figure15showsavideoclipinitsentirety, witheachframecomponentnamed.Ahigh-qualityHD videoismadeupofmultipleframesofdata.Assuminga defaultof60FPS,theamountofdataneededtoshowthe clipforaminutewouldbe1920x1080(screenresolution)x 3(bytes/pixel)x60(framerate)x60(seconds)=21.35GB. Evenifeachframeiscompressedindividuallyandstoredon todayshand-helddevices,theamountofstorageavailable wouldnotpermitit.Toovercomethislimitation,codecs takeadvantageofthetemporalredundancypresentinvideo frames,asthenextframeisusuallynotverydifferentfrom thepreviousframe. 373 + 818288384 Figure15: Pictorialrepresentationshowingthestructureofve consecutivevideoframes. Flow Id 1 2 IP-1IP-2 ,9%/&1:&2:;;:8 %9& &8)8& *) /6 IP-1 IP-2 * *)8)#'# ;1 1 " ' ).)' IP-1 IP-2 0 Flow Id 1 2 ;1 1 " ' Figure16: HighlevelviewoftheSAthathandlessub-frames. Eachframecanbedissectedintomultipleslices.Slices arefurthersplitintomacroblocks,whichareusuallyablock of16x16pixels.Thesemacroblockscanbefurtherdivided intonergranularitiessuchassub-macroblocksorpixel blocks.But,wedonotneedsuchnegranularitiesforour purpose.Slicescanbeclassiedinto3majortypes:I-Slice (independentorintraslices),P-Slice(predictive),andB- Slice(bi-directional)[37]asdepictedinFigure15. 3 I-slices havealldatacontainedinthem,anddonotneedmotion prediction.P-slicesusemotionpredictionfromoneslice whichbelongstothepastorfuture.B-slicesusetwoslices frompastorthefuture.Eachsliceisanindependententity andcanbedecodedwithouttheneedforanyotherslice inthe currentframe .P-andB-slicesneedslicesfroma previousornextframeonly. Inoursub-frameimplementation,wechoose slice-level granularity asthenestlevelofsub-divisiontoensure correctness withouthavinganyextraoverheadofsynchro- nization .Asslicesareindependentlydecodedinaframe, theneedforanothersliceintheframedoesnotarise,and wecanbesurethatcorrectnessismaintained.Sub-dividing anyfurtherwouldbringindependencies,staledataand overwrites. AudioCodecs: Audiodataiscodedinamuchsimpler fashionthanvideodata.Anaudiolehasalargenumberof frames,witheachaudioframehavingthesamenumberof bits.Eachframeisindependentofanotheranditconsistofa headerblockanddatablock.Headerblock(inMP3format) stores32-bitsofmetadataaboutthecomingdatablock frame.Thus,eachaudioframecanbedecodedindependently ofanotherasallrequireddatafordecodingispresentinthe currentframesheader.Therefore,usingafullaudioframe asasub-framewouldnotcauseanycorrectnessissue. GraphicsRendering: GraphicsIPsalreadyemploytiled 3 Earliercodecshadframelevelclassicationinsteadofslicelevel.In suchsituations,I-frameisconstructedasaframewithonlyI-slices. 168 back,areamongstthemostpopularontodaystabletsand mobilesapartfromemailandsocialnetworking.Suchappli- cationsaccountfornearly65%oftheusageontodayshand- helds[1],stressingtheimportanceofmeetingthechallenges imposedbysuchapplicationsefciently.Thisimportant classofapplicationshasseveralkeycharacteristicsthatare relevanttothisstudy.First,theseapplicationsworkwith input(sensors,network,camera,etc.)and/oroutput(display, speaker,etc.)devices,mandatingreal-timeresponsiveness. Second,theseapplicationsdealwithframesofdata,with therequirementtoprocessaframewithinastipulatedtime constraint.Third,thecomputationrequiredforprocessinga Apartfromthecomputationalneedsforreal-timeexecu- tion,alltheaboveobservationsstressthememoryintensity oftheseapplications.Framesofdatacomingfromany externalsensor/deviceisstreamedintomemory,fromwhich itisstreamedoutbyadifferentIP,processedandputbackin memory.Suchanundertakingplacesheavydemandsonthe memorysubsystem.Whenwehaveseveralconcurrentows, eitherwithinthesameapplicationoracrossapplicationsin amultiprogrammedenvironment,alloftheseowscontend forthememoryandstressesitevenfurther.Thiscontention canhaveseveralconsequences:(i)withoutasteadystream ofdatato/frommemory,theefcienciesfromhavingspe- cializedIPswithcontinuousdataowcangetlostwiththe IPsstallingformemory;(ii)suchstallswithidleIPscan leadtoenergywastageintheIPsthemselves;and(iii)the highmemorytrafccanalsocontendwith,andslowdown, thememoryaccessesofthemaincoresinthesystem.While therehasbeenalotofworkcoveringprocessing whether itbeCPUcoresorspecializedIPsandaccelerators(e.g. [35][50][27]) forthesehandheldenvironments,thetopic ofoptimizingthedataows,whilekeepingthememory systeminmind,hasdrawnlittleattention.Optimizingfor memorysystemperformance,andminimizingconsequent queueingdelayshasitselfreceivedsubstantialinterestin thepastdecade,butonlyintheareaofhigh-endsystems (e.g.,[4][26][25][10]).Thispaperaddressesthiscritical issueinthedesignofhandhelds,wherememorywillplay anincreasinglyimportantroleinsustainingthedataow notjustacrossCPUcores,butalsobetweenIPs,andwith theperipheralinput-output(display,sound,networkand sensors)devices. Intodayshandheldarchitectures,a SystemAgent (SA) [8],[23],[31],[48]servesastheglueintegratingallthe compute(whetheritbeIPsorCPUcores)andstorage components.Italsoservesastheconduittothememory system.However,itdoesnotclearlyunderstanddataows, andsimplyactsasaslaveinitiatingandservingmemory requestsregardlessofwhichcomponentrequestsit.Asa result,thehighframeraterequirementstranslatetoseveral transactionsinthememoryqueues,andtheowofthese framesfromoneIPtoanotherexplicitlygoesthroughthese queues,i.e.,thepotentialfordataow(ordata reuse )across IPsisnotreallybeingexploited.Instead,inthispaperwe exploretheideaof virtuallyintegratingacceleratorpipelines byshort-circuitingmanyoftheread/writerequests,so thatthetrafcinthememoryqueuescanbesubstantially reduced.Specically,weexplorethepossibilityof shared buffers/caches and short-circuitingcommunication between theIPcoresbasedonrequestsalreadypendinginthe memorytransactionqueues.Inthiscontext,thispapermakes thefollowing contributions : Weshowthatthememoryishighlyutilizedinthese systems,withIPsfacingaround47%oftheirtotal executiontimestallingfordata,inturn,causing24% oftheframestobedroppedintheseapplications.We cannotaffordtolettechnologytakecareofthisproblem sincewitheachDRAMtechnologyadvancement,the demandsfromthememorysystemalsobecomemore stringent. Blindlyprovisioningasharedcachetoleveragedata ow/reuseacrosstheIPcoresisalsolikelytobeless benecialfromapracticalstandpoint.Ananalysisof theIP-to-IPreusedistancessuggeststhatsuchcaches havetorunintoseveralmegabytesforreasonablehit rates(whichwouldalsobeundesirableforpower). Weshowthatthisproblemismainlyduetothecurrent framesizesbeingrelativelylarge.Akintotilingforlo- calityenhancementinnested-loopsoflargearrays[28], [43],[49],weintroducethenotionof sub-frame for restructuringthedataow,whichcansubstantiallyre- ducereusedistances.Wediscusshowthissub-framing canbedonefortheconsideredapplications. Withthissub-framinginplace,weshowthatreasonably sizedsharedcaches referredtoas owbuffers in thispaper betweentheproducerandconsumerIPs ofaframecancircumventtheneedtogotomain memoryformanyoftheloadsfromtheconsumerIP. Suchreductioninmemorytrafcresultsinaround20% performanceimprovementintheseapplications. Whiletheseowbufferscanbenettheseplatforms substantially,wealsoexploreanalternateideaofnot requiringanyseparatehardwarestructures leveraging existingmemoryqueuesfordataforwardingfromthe producertotheconsumer.Sincememorytrafcis usuallyhigh,recentlyproduceditemsaremorelikely tobewaitinginthesequeues(servingasasmall cache),whichcouldbeforwardedtotherequesting consumerIP.Weshowthatthiscanbeaccommodated inrecently-proposedmemoryqueuestructures[5],and demonstrateperformanceandpowerbenetsthatare nearlyasgoodasthatoftheowbuffersolution. II.B ACKGROUNDAND E XPERIMENTAL S ETUP Inthissection,werstprovideabriefoverviewof currentSoC(system-on-chip)platformsshowinghowthe OS,coreandtheIPsinteractfromasoftwareandhardware perspective.Next,wedescribeourevaluationplatform,and propertiesoftheapplicationsthatareusedinthiswork. A.OverviewofSoCPlatforms AsshowninFigure1,handheldsavailableinthemarket havemultiplecoresandotherspecializedIPs.TheIPsin theseplatformscanbebroadlyclassiedintotwocategories accelerators and devices .Devicesinteractdirectlywiththe userorexternalworldandincludecameras,touchscreen, speakerandwireless.Acceleratorsaretheon-chiphardware componentswhichspecializeincertainactivities.Theyare IPAbbr. Expansion IPAbbr. Expansion VD VideoDecoder AD AudioDecoder DC DisplayController VE VideoEncoder MMC FlashController MIC Microphone AE AudioEncoder CAM Camera IMG Imaging SND Sound TableI: ExpansionsforIPabbreviationsusedinthispaper. Id Application IPFlows A1 AngryBirds AD-SND;GPU-DC A2 SoundRecord MIC-AE-MMC A3 AudioPlayback(MP3) MMC-AD-SND A4 PhotosGallery MMC-IMG-DC A5 PhotoCapture(CamPic) CAM-IMG-DC;CAM-IMG-MMC A6 Skype CAM-VE;VD-DC;AD-SND;MIC-AE A7 VideoRecord CAM-VE-MMC;MIC-AE-MMC A8 Youtube VD-DC;AD-SND TableII: IPowsinourapplications. devicedriversintocommandsthatdrivetheacceleratorsand devices.Thistranslationresultsinhardwareprocessingthe data,movingitbetweenmultipleIPsandnallywritingto storageordisplayingorsendingitoverthenetwork.Letus considerforexampleavideoplayerapplication.Theash controllerreadsachunkofvideolefrommemory,gets processedbythecore,andtwoseparaterequestsaresentto video-decoderandaudio-decoder.Theyreadtheirdatafrom thememoryand,onceanaudio/videoframeisdecoded,it issenttothedisplaythroughmemory.Inthispaper,we termsucharegularstreamofdatamovementfromoneIP toanotherasa ow .Allourtargetapplicationshavesuch ows,asshowninTableII.TableIgivestheexpansions forIPabbreviations.Itistobenotedthatanapplication canhaveoneormoreows.Inmultipleows,eachow couldbeinvokedatadifferentpointintime,ormultiple independentowscanbeactiveconcurrently without sharing anycommonIPordataacrossthem. D.EvaluationPlatform Handheld/mobileplatformscommonlyrunapplications thatrelyonuserinputsandareinteractiveinnature.Study- ingsuchasystemistrickyduetothenon-determinism associatedwithit.Toenablethat,weuseGemDroid[7], whichutilizesGoogleAndroidsopen-sourceemulator[16] tocapturethecompletesystem-levelactivity.Thisprovides acompletememorytrace(withcyclesbetweenmemory accesses)alongwithallIPcallswhentheywereinvoked bytheapplication.Weextendedtheplatformbyincluding DRAMSim2[34]foraccuratememoryperformanceevalua- tion.Further,weenhancedthetooltoextensivelymodelthe systemagent,acceleratorsanddevicesindetail. SpecicationsofselectIPsfrequency,framesizesand processinglatencyareavailablefrom[42].Forcomplete- ness,wegiveallcoreparameters,DRAMparameters,and IPdetailsinTableIII.Thespecicationsusedarederived fromcurrentsystemsinthemarket[3],[36].Notethatthe techniquesdiscussedinthisworkaregenericandnottied tospecicmicro-architecturalparameters. Processor ARMISA;4-coreprocessor;Clockedat2GHz; OoOw/issuewidth:4 Caches 32KBL1-I;32KBL1-D;512KBL2 Memory Till2GBreservedforcores.2GBto3GBreservedforIPs. LPDDR3-1333;1channel;1rank;8Banks 5.3GBPSpeakbandwidth;t CL ,t RP ,t RCD =12,12,12ns SystemAgent Frequency:500MHz;Interconnectlatency:1cycleper16Bytes MemoryTransaction-Q.:64entries;Bank-Q.:8entries AllIPsrunat500Mhzfrequency IPsand Aud.Frame:16KBframe;Vid.Frame:4K(3840x2160) SystemParameters CameraFrame:1080p(1920x1080) InputBufferSizes:16-32KB;OutputBufferSizes:32-64KB Enc/DecodingRatio:VD 1:32;AD 1:8; TableIII: Platformconguration. III.M OTIVATION :M EMORY S TALLS Inthissection,weshowthatDRAMstallsarehighin currentSoCsandthiswillonlyworsenasIPsperformance scale.Typically,DRAMissharedbetweenthecoresand IPsandisusedtotransferdatabetweenthem.Thereisa highdegreeofdatamovementandthisoftenresultsina highcontentionformemorycontrollerbandwidthbetween thedifferentIPs[21].Figure3showsthe memorybandwidth obtained bytwoofourapplications: YouTube and Skype witha6.4GBPSmemory.Onecannoticethe burstiness oftheaccessesintheseplots.Dependingonthetypeof IPsinvolved,framesgetwrittentomemoryorreadfrom memoryatacertainrate.Forexample,camerastodaycan capturevideoframesofresolution1920x1080at60FPS andthedisplayrefreshesthescreenwiththeseframesat thesamerate(60FPS).Therefore,60burstsofmemory requestsfrombothIPshappeninasecond,witheach burstrequestingonecompleteframe.Whiletherequest rateissmall,thedatasizeperrequestishigh 6MBfor a1920x1080resolutionframe(thiswillincreasewith4K resolutions[12]).Ifthisamountofbandwidthcannotbe cateredtobytheDRAM,thememorycontrollerandDRAM queueslluprapidlyandinturnthedevicesandaccelerators startexperiencingperformancedrops.Theperformancedrop alsoaffectsbatterylifeasexecutiontimeincreases.Inthe rightsidegraphinFigure3,wheneverthecamera(CAM) initiatesitsburstofrequests,thepeakmemorybandwidth consumptioncanbeseen(about6GBPS).Wealsonoticed thattheaveragememorylatencymorethandoublesinthose periods,andmemoryqueuessustainover95%utilization. Toexplainhowmuchimpactthememorysubsystemand thesystem-agentcanhaveonIPsexecutiontime( active cycles duringwhichtheIPsremainsinactivestate),in Figure4,weplotthetotalnumberofcyclesspentbyanIPin processingdataandindatastalls.Here,weusedatastall 0 2 4 6 8 (a) Youtube 0 2 4 6 8 (b ) Skype Time Mem. Bandwidth (in GBPS) Mem. Bandwidth (in GBPS) Time Figure3: Bandwidthusageof Youtube and Skype overtime. 169 system(theDRAMmainmemory)isemployedtofacilitate thisow.Finally,wemayneedtosupportseveralsuchows atthesametime.Evenasingleapplicationmayhaveseveral concurrentows(thevideopartandaudiopartofthevideo captureapplicationwhichhavetheirownpipelines).Even otherwise,withmultiprogrammingincreasinglyprevalentin handhelds,thereisaneedtoconcurrentlysupportindividual applicationowsinsuchenvironments. 1 WeusethetermacceleratorsandIPsinterchangeablyinthiswork. drasticreductionincyclesperframesacrossapplications andIPs(ashighas75%).InsomeIPs,memoryisnota bottleneckandthosedidnotshowimprovedbenets.From thisdata,weconcludethatreducingthememoryaccess timesdoesbringthecyclesperframedown,whichinturn booststheoverallapplicationperformance.Notethat,this perfectmemorydoesnotallowanyframestobedropped. IV.IP- TO -IPD ATA R EUSE Thissectionexploresperformanceoptimizationoppor- tunitiesthatexistincurrentdesignsandwhetherexisting solutionscanexploitthat. A.DataReuseandReuseDistance Inaow,datagetread,processed(byIPs)andwritten back.Theproducerandconsumerofthedatacouldbetwo differentIPsorsometimeseventhesameIP.Wecapturethis IP-to-IPreuseinFigure8,whereweplottedthephysical addressesaccessedbythecoreandotherIPsfor YouTube application.Notethatthisgureonlycapturesaverysmall sliceoftheentireapplicationrun.Here,wecanseethat thedisplay-controller(DC)(redpoints)readsacaptured framefromamemoryregionthatwaspreviouslywritten tobyvideodecoder(blackpoints).Similarly,wecanalso seethatthesound-enginereadsfromanaddressregion whereaudio-decoderwrites.Thisclearlyshowsthatthedata getsreusedrepeatedlyacrossIPs,butthereusedistances canbeveryhigh.AsmentionedinSectionII-B,when aparticularapplicationisrun,thesamephysicalmemory regionsgetused(overtime)byanIPforwritingdifferent frames.Inourcurrentcontext,thereusewementionisonly betweentheproducerandconsumerIPsforaparticular frameandnothingtotodowithframesbeingrewrittento thesameaddresses.Duetoframeraterequirements,reuse distancesbetweendisplayframebasedIPsweremorethan tensofmilli-seconds,whileaudioframebasedIPswereless thanamilli-second.Thus,thereisalargevariationacross producer-consumerreusedistancesacrossIPsthatprocess large(display)frames(e.g.,VD,CAM)andIPsthatprocess smaller(audio)frames(e.g.,AD,AE). B.ConvertingDataReuseintoLocality Giventhedatareuse,thesimplestsolutionistoplace aon-chipcacheandallowthemultipleIPstoshareit.The Time 2.2e+09 2.22e+09 2.21e+09 2.205e+09 2.215e+09 $ ( Address Range Figure8: DataaccesspatternofIPsinYouTubeapplication. 0 20 40 60 80 100 A1A2A3A4A5A6A7A8 4 MB 8 MB 16 MB 32 MB Hit Rate (%) Figure9: Hitratesundervariouscachecapacities. expectancyisthatcachesarebestforlocalityandhencethey shouldwork.Inthissubsection,weevaluatetheimpactof addingsuchasharedcachetoholdthedataframes.Typical toconventionalcaches,onacache-miss,therequestissent tothetransactionqueue.Thesharedcacheisimplemented asadirect-mappedstructure,withmultiplereadandwrite ports,andmultiplebanks(withabanksizeof4MB),andthe read/write/lookuplatenciesaremodeledusingCACTI[40]. Weevaluatedmultiplecachesizes,rangingfrom4MBto 32MB,andanalyzedtheirhitratesandthereductionin cyclestakenperframetobedisplayed.Wepresenttheresults for4MB,8MB,16MBand32MBsharedcachesinFigure9 andFigure10forclarity.Theycapturetheoveralltrend observedinourexperiments.Inourrstexperiment,we noticethatasthecachesizesincrease,thecachehitrates eitherincreaseorremainthesame.Forapplicationslike AudioRecord and AudioPlay (withsmallframes), wenotice100%cachehitratesfrom4MBcache.For otherapplicationslike AngryBirds or Video-play (withlargerframes),asmallercachedoesnotsufce.Thus, asweincreasethecachecapacity,weachievehigherhit rates.Interestingly,someapplicationshaveverylowcache hitratesevenwithlargecaches.Thiscanbeattributedto twomainreasons.First,framesizesareverylargetot eventwoframesinalarge32MBcache(asinthecaseof YouTube and Gallery ).Second,andmostimportantly,if thereusedistancesarelarge,datagetskickedoutofcaches bytheotherowsinthesystemorbyotherframesin thesameow.Applicationswithlargereusedistanceslike Video-record exhibitsuchbehavior. Inoursecondexperiment,wequantifytheperformance benetsofhavingsuchlargesharedcachesbetweenIPs, andgivetheaveragecyclesconsumedbyanIPtoprocess afull-frame(audio/video/cameraframe).Ascanbeseen 0 20 40 60 80 100 AEADVEVDIMGDCSNDMICCAM 4 MB 8 MB 16 MB 32 MB % Cycles Per Frame Figure10: CyclesPerFrameundervariouscachecapacities.