/
% of time spent % of time spent

% of time spent - PDF document

phoebe-click
phoebe-click . @phoebe-click
Follow
456 views
Uploaded On 2016-08-01

% of time spent - PPT Presentation

0 20 40 60 80 100 GPU DC AD SND MIC AE AD SND IMG DC CAM IMG DC CAM VE VD DC A D SND MIC AE CAM IMG DC VE MIC AE VD DC AD SND Angry BirdsAudioMP3PhotosCamPic S kype Video RecordYoutube Processin ID: 427965

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "% of time spent" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

% of time spent 0% 20% 40% 60% 80% 100% GPU DC AD SND MIC AE AD SND IMG DC CAM IMG DC CAM VE VD DC A D SND MIC AE CAM IMG DC VE MIC AE VD DC AD SND Angry BirdsAudioMP3PhotosCamPic S kype Video RecordYoutube Processing Cycles Data Stall Cycles Snd. Rec Figure4: TotaldatastallsandprocessingtimeinIPsduringexecution. Fraction of time an IP stalls in memory 0 20 40 60 80 100 GPU AD AE AD IMG IMG VE VD AD AE IMG VE AE VD AD Angry BirdsSnd Rec MP3PhotosCamPic Skype Video RecordYoutube 0.5x-LPDDR2-800 Base-LPDDR3-1333 2x-LPDDR3-1600 4x-LPDDR4-3200 Figure5: TrendsshowingincreaseofpercentageofdatastallswitheachnewergenerationofIPsandDRAMs. tomeanthenumberofcyclesanIPstallsfordatawithout doinganyusefulcomputation,afterissuingarequesttothe memory.WeobservefromFigure4thatthevideodecoder andvideoencoderIPsspendmostoftheirtimeprocessing thedata,anddonotstressthememorysubsystem.IPs thathaveverysmallcomputetime,liketheaudiodecoder andsoundengine,demandveryhighbandwidththanwhat memorycanprovide,andthustendtostallmorethan compute.CameraIPandgraphicsIP,ontheotherhand,send burstsofrequestsforlargeframesofdataatregularintervals. Hereaswell,ifmemoryisnotabletomeetthehigh bandwidthorhashighlatency,theIPremainsinthehigh- powermodestallingfortherequests.Thehighdatastalls seeninFigure4translatetoframedropswhichisshownin Figure6(for5.3GBPSmemorybandwidth).Weseethat onaverage24%oftheframesaredroppedwiththedefault baselinesystem,whichcanhurtuserexperiencewiththe device.Withhighermemorybandwidths(2xand4xofthe baselinebandwidth),thoughtheframedropsdecrease,they stilldonotimproveasmuchastheincreaseinbandwidth. Evenwith4xbaselinebandwidth,weobservemorethan 10%framedrops(becauseofhighermemorylatencies). Asuserdemandsincreaseandmoreuse-casesneedto besupported,thenumberofIPsintheSoCislikelyto 0 25 50 75 100 A6A7A8 AVG(all) % of Frames Shown 5.3 GBPS 10.6 GBPS 21.2 GBPS Figure6: Percentageofframescompletedinasubsetofapplica- tionswithvaryingmemorybandwidths. increase[44]alongwithdatasizes[12].EvenastheDRAM speedsincrease,theneedtogooff-chipfordataaccesses placesasigni“cantbottleneck.Thisaffectsperformance, powerandeventuallytheoveralluserexperience.Toun- derstandtheseverityofthisproblem,weconductasimple experimentshowninFigure5,demonstratinghowthecycles perframevaryacrossthebasesystem(giveninTableIII) andwhentheIPscomputeathalftheirbasespeed(last generationIPs),andtwicetheirspeed(nextgeneration)etc. ForDRAM,wevariedthememorythroughputbyvarying theLPDDRcon“gurations.Weobservefromtheresultsin Figure5thatthepercentageofdatastallsincreasesaswego fromonegenerationtothenext.IncreasingtheDRAMpeak bandwidthaloneis not suf“cienttomatchtheIPscaling.We requiresolutionsthatcantacklethisproblemwithintheSoC. Further,toestablishthemaximumgainsthatcanbe obtainedifwehadanidealandperfectmemoryŽ,wedid ahypotheticalstudyof perfectmemorywith1cyclela- tency .Thecycles-per-frameresultswiththisperfectmemory systemareshowninFigure7.Asexpected,weobserved 0 20 40 60 80 100 GPU-DC AD-SND MIC-AE-MMC MMC-AD-SND MMC-IMG-DC CAM-IMG-DC CAM-IMG-MMC CAM-VE VD-DC AD-SND MIC-AE CAM-IMG-DC CAM-VE-MMC MIC-AE-MMC VD-DC AD-SND A1A2A3A4A5A6 A7A8 % Cycles Per Frame Figure7: PercentagereductioninCycles-Per-Frameindifferent ”owswithaperfectmemorycon“guration. 170 176 renderingwhenoperatingonframesandforthedisplayren- dering.Thesetilesaresimilartothesub-framesproposedin thiswork.Atypicaltiledrenderingalgorithm“rsttransforms thegeometriestobedrawn(multipleobjectsonthescreen) intoscreenspaceandassignsthemintotiles.Eachscreen- spacetileholdsalistofgeometriesthatneedstobedrawn inthattile.Thislistissortedfronttoback,andthegeometry behindanotherisclippedawayandonlytheonesinthefront aredrawntothescreen.GPUsrenderseachtileseparately toalocalon-chipbuffer,whichis“nallywrittenbackto main-memoryinsideaframebufferregion.Fromthisregion, thedisplaycontrollerreadstheframetobedisplayedtobe shownonscreen.Alltilesareindependentofeachother, andthusformasub-frameinoursystem. B.OSandHardwareSupport Incurrentsystems,IPsareinvokedsequentiallyone-after- anotherperframe.Letusrevisittheexampleconsidered previously…a”owwith3IPs.TheOS,throughdevice drivers,callsthe“rstIPinthe”ow.Itwaitsfortheprocess- ingtocompleteandthedatatobewrittenbacktomemory andthencallsthesecondIP.AfterthesecondIP“nishesits processing,thethirdIPiscalled.Withsub-frames,whenthe dataisreadyinthe“rstIP,thesecondIPisnoti“edofthe incomingrequestsothatitcanbeready(byenteringtothe activestatefromalowpowerstate)whenthedataarrives. WeenvisionthattheOScancapturethisinformationthrough alibrary(ormultiplelibraries)sincetheIP-”owsforeach applicationareprettystandard.InAndroid[15]forinstance, thereisalayeroflibraries(HardwareAbstractionLayer… HAL)thatinterfacewiththeunderlyingdevicedriversand theseHALsarespeci“ctoIPs.Asdevicesevolve,HAL andthecorrespondingdriversareexpectedtoenableaccess todevicestorundifferentapplications.ByaddinganSA HALanditsdrivercounterparttocommunicatethe”ow information,wecanaccomplishourrequirements.Fromthe applicationsperspective,thisistransparentsincetheaccess totheSAHALhappensfromwithinotherHALsastheyare requestedbytheapplications.Figure16showsahighlevel viewofthesub-frameimplementationinSAalongwithour short-circuitingtechniques.Fromahardwareperspective,to enablesub-framingofdata,theSAneedstohaveasmall matrixofallIPs…rowscorrespondingtoproducersand columnstoconsumers.Eachentryintherowis1bitperIP. Currently,wearelookingatabout8IPs,andthisisabout 8bytesintotal.Infuture,evenaswegrowto100IPs,the sizeofthematrixissmall.AseachIPcompletesitssub- frame,theSAlooksatitsmatrixandinformstheconsumer IP.Insituationswherewehavemultiple”ows(currently Androidallowstwoapplicationstorunsimultaneously[32]) withanIPincommon,theentriesintheSAforthecommon IPcanbeswappedinoroutalongwiththecontextofthe applicationrunning.Thiswillmaintainthecorrectconsumer IPstatusinthematrixforanyIP. VII.E VALUATION Inthissection,wepresenttheperformanceandpower bene“tsobtainedbyusingsub-framescomparedtothe conventionalbaselinesystemwhichusesfullframesin IP”ows.Weusedamodi“edversionoftheGemDroid infrastructure[7]fortheevaluation.Foreachapplication evaluated,wecapturedandranthetraceseithertocomple- tionorfora“xedtime.Thetracelengthsvariedfromabout 2secsto30secs.Usuallythislengthislimitedbyframe sizesweneededtocapture.Workloadsevaluatedarelistedin TableIIandthecon“gurationofthesystemusedisdescribed inSectionII-DandTableIII.Forthisevaluationweused asub-framesizeof32cachelines(2KB).Thetransaction queueandbankqueuesinSAcanhold64entries(totaling 8KB).Forthe”owbuffersolution,weuseda32KBbuffer (basedonthehitratesobservedinFigure14). UserExperience: Asameasureofuserexperience,we trackthenumberofframesthatcouldcompleteineach ”ow.Themoreframesthatgetcompleted,lessertheframe dropsandbetteristheuserexperience.Figure17showsthe numberofframescompletedindifferentschemes.They- axisshowsthepercentageofframescompletedoutofthe totalframesinanapplicationtrace.The“rstbarshowsthe framescompletedinthebaselinesystemwithfullframe ”ows.Thesecondandthirdbarsshowthepercentageof framescompletedwithourtwotechniques.Inbaselinesys- tem,onlyaround76%offramesweredisplayed.Byusing ourtwotechniques,thepercentageofframescompleted improvedto92%and88%,respectively.Improvementsin ourschemesaremainlyattributedtothereducedmemory bandwidthdemandandimprovedmemorylatencyasthe consumerIPsrequestsareservedthroughthe”ow-buffers orbyshort-circuitingthememoryrequests.Thehit-ratesof consumersrequestswerepreviouslyshowninFigure14.In somecases,”ow-buffersperformbetterthanshort-circuiting duetothespaceadvantageinthe”owbufferingtechnique. PerformanceGains: Tounderstandtheeffectivenessof ourschemes,weplottheaveragenumberofcyclestakento processaframeineach”owinFigure18.Thisisthetime betweentheinvocationof“rstIPandcompletionoflastIP ineach”ow.Notethat,reducingthecyclesperframecan leadtofewerframedrops.Whenweuseourtechniques withsub-framing,duetopipeliningofintra-framedata acrossmultipleIPsinsteadofsequentiallyprocessingone frameafteranother,weareabletosubstantiallyreducethe cyclesperframeby45%onaverage.Wealsoobserved thatinA6-Skypeapplication(whichhasmultiple”ows), throughtheuseofsub-framing,thememorysubsystemgets overwhelmedbecause,weallowmoreIPstoworkatthe sametime.Thisisnotthecaseinthebasesystem.IfIPsdo notbene“tfrom”ow-buffersorIPrequestshort-circuiting, thememorypressureismorethanthebaselineleadingto someperformanceloss(17%). EnergyGains: Energyef“ciencyisaveryimportant Short-CircuitingMemoryTraf“cinHandheldPlatforms PraveenYedlapalli  NachiappanChidambaramNachiappan  NiranjanSoundararajan  AnandSivasubramaniam  MahmutT.Kandemir  ChitaR.Das   ThePennsylvaniaStateUniversity  IntelCorp. Email:  {praveen,nachi,anand,kandemir,das}@cse.psu.edu,  {niranjan.k.soundararajan}@intel.com, Abstract 172 throughthesystemandshowthat,bycommunicatingatframe granularity,wemisssigni“cantperformanceoptimizationop- portunities,causedbylargeIP-to-IPdatareusedistances. Bycarefullybreakingtheseframesintosub-frames,while maintainingcorrectness,wedemonstratesubstantialgainswith limitedhardwarerequirements.Speci“cally,weevaluatetwo techniques,”ow-bufferingandIP-IPshort-circuiting,andshow thatthesetechniquesbringbothpower-performancebene“ts andenhanceduserexperience. I.I NTRODUCTION Thepropensityoftabletsandmobilephonesinthis handhelderaraisesseveralinterestingchallenges.Atthe applicationend,thesedevicesarebeingusedforanumber 174 whichmaynotbepossiblejustbyoptimizingthesystem inparts.Withthisphilosophy,thispaperfocussesonan importantclassofapplicationsrunonthesehandhelddevices Theauthorswouldliketocon“rmthatthisworkisanacademic explorationanddoesnotre”ectanyeffortwithinIntel. (real-timeframe-orientedvideo/graphics/audio),examines thedata”owintheseapplicationsthroughthedifferent computationalkernels,identi“estheinef“ciencieswhen sustainingthese”owsintodayshardwaresolutions,which simplyrelyonmainmemorytoexchangesuchdata,and proposesalternatehardwareenhancementstooptimizesuch ”ows. Real-timeinteractiveapplications,includinginteractive 175 0% 20% 40% 60% 80% 100% Audio Record AR GameVideo Record Video Play Average (10 apps) Delay Breakdown MC Trans Q MC Bank Q DRAM 150124137238166 Snd. Rec Figure13: DelaybreakdownofamemoryrequestissuedbyIPs orcores.Thenumbersabovethebargivetheabsolutecycles. smalllow-latencycostandareaef“cientbuffers. Notethat,intheseuse-cases,corestypicallyrundevice drivercodeandhandleinterrupts.Theyhaveminimaldata framesprocessing.Consequently,wedonotincorporate ”ow-buffersbetweencoreandanyotheraccelerator.Also, whenause-caseisinitssteady-state(forexample,aminute intorunningavideo),theIPsareintheactivestateand quicklyconsumedata.However,ifanIPis“nishingup onanactivityorbusywithanotheractivityorwakingup fromsleepstate,thesub-framescanbeoverwritteninthe ”ow-buffer.Inthatcase,basedonsub-frameaddresses,the consumerIPcan“nditsdatainthemainmemorysincethe ”ow-bufferisawrite-throughbuffer.Inourexperiments, discussedlaterinSectionVII,wefoundthata”ow-buffer sizeof32KBprovidesagoodtrade-offbetweenavoiding alarge”ow-bufferandsub-framesgettingoverwritten. B.IP-IPShort-circuiting The”ow-buffersolutionrequiresanextrapieceof hardwaretowork.Toavoidthecostofaddingthe ”ow-buffers,analternatetechniquewouldbetoenable consumersdirectlyusethedatathattheirproducersprovide. Towardsthat,weanalyzedtheaverageround-tripdelaysof allaccessesissuedbythecoresorIPs(showninFigure13) andfoundrequestsspendmaximumtimequeuinginthe memorysubsystem. MCTransQueue showsthetime takenfromtherequestleavingtheIPtillitgetstothe headofthetransactionqueue.Thenextpart MCBank Queue ,isthetimespentinbankqueues.Thisisprimarily determinedbywhetherthedataaccesswasarowbuffer hit,ormiss.And,“nally DRAM showsthetimeforDRAM accessingalongwiththeresponsebacktotheIPs.As canbeseen,mostofthetimeisspentinthememory transactionqueues(~100cycles).Thismeansthatdata thatcouldotherwisebereusedliesidleinthememory queuesandweusethisobservationtowardsbuildingan opportunisticIP-to-IPshort-circuitingtechnique,similarin concepttostore-loadforwardingŽinCPUcores[29],[38] 2 thoughourtechniqueisinbetweendifferentIPs.There arecorrectnessandimplementationdifferences,whichwe highlightedinthefollowingparagraphs/sections. 2 Corerequestsspendrelativelyinsigni“cantamountoftimeintransactionqueues astheyarenotburstyinnature.DuetotheirstrictQoSdeadlines,theyareprioritized overotherIPrequests.TheyspendmoretimeinbankqueuesandinDRAM. 0 20 40 60 80 100 Angry Birds Snd.RecMP3PhotoCamPicSkypeVideo Record Youtube Hit Rate (%) 32 KB Flow-Buffer 16 KB Flow-Buffer 8 KB Flow-Buffer 4 KB Flow-Buffer IP-IP Short-circuiting Figure14: Hitrateswith”ow-bufferingandIP-IPshort-circuiting. IPsusuallyloadthedataframesproducedbyotherIPs. Similartostore-loadforwarding,iftheconsumerIPsload requestscanbesatis“edfromthememorytransactionqueue orbankqueues,thememorystalltimecanbeconsiderably reduced.Asthesub-framesizegetssmaller,theprobability ofaloadhittingastoregetshigher.Unlikethe”ow-buffers discussedinSectionV-A,storedatadoesnotremainin thequeuestilltheyareoverwritten.Thistechniqueis opportunisticandasthememorybankclearsupitsentries, therequestmovesfromthetransactionqueueintothe bankqueuesandeventuallyintomainmemory.Thus,the loadsneedtofollowthestoresquickly,elseithastogo tomemory.ThisdistancebetweentheconsumerIPload requestandproducerIPstorerequestdependsonhowfull thetransactionandbankqueuesare.Intheextremecase, ifboththequeues(transaction-queueandbank-queue)are full,thenumberofrequeststhataloadcancomeaftera storewillbethesumofthenumberofentriesinthequeues. TheoverheadofimplementingtheIP-IPshort-circuitingis notsigni“cantsinceweareusingpre-existingqueuespresent inthesystemagent.Thetransactionandbankqueuesalready implementanassociativesearchtore-orderrequestsbased ontheirQoSrequirementsandrow-bufferhits,respectively [33].Address-searchesforsatisfyingcoreloadsalreadyexist andthesecanbereusedforotherIPs.Aswewillshowlater, thistechniqueworksonlywhenthesub-framereusedistance issmall. C.EffectsofSub-framingDatawithFlow-BufferingandIP- IPShort-circuiting Thebene“tsofsub-framingarequanti“edinFigure14 intermsofhitrateswhenusing”ow-bufferingandIP-IP short-circuiting.Wecanseethatthebufferhitratesincrease asweincreasethesizeof”ow-buffers,andsaturatewhen thesizeofbuffersareintherangesof16KBto32KB. Theotheradvantageofhavingsub-framesisthereduced bandwidthconsumptionduetothereducednumberofmem- oryaccesses.Asdiscussedbefore,acceleratorsprimarily facebandwidthissueswiththecurrentmemorysubsystem. Sub-framingalleviatessuchbottleneckbyavoidingfetching everypieceofdatafrommemory.Redundantwritesand readstosameaddressesareavoided.Latencybene“tsof ourtechniques,aswellastheirimpactonuserexperience willbegivenlaterinSectionVII. 1 oftendeployedtoleveragethespeci“cityincomputation foreachframeanddeliveringhighenergyef“ciencyforthe requiredcomputation.Theframesarethenpipelinedthrough theseacceleratorsoneafteranothersequentially.Fourth,in manyoftheseapplications,theframeshaveto ”ow not justthroughonesuchcomputationalstage(accelerator)but possiblythoughseveralsuchstages.Forinstance,considera videocaptureapplication,wherethecameraIPmaycapture rawdata,whichisthenencodedintoanappropriateform byanotherIP,beforebeingsenteithertoa”ashstorageor adisplay.Consequently,theframeshaveto”owthrough 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture1072-4451/14 $31.00 © 2014 IEEEDOI 10.1109/MICRO.2014.60166 173 availablelocalityacrossIPsgoesunexploited.Bybreaking thedataframesintosub-frames,thereusedistancedecreases substantially,andwecanuse”owbuffersortheexisting memorycontrollerqueuestoforwarddatafromtheproducer toconsumer.Ourresultsshowthatsuchtechniqueshelpto reduceframelatencies,whichinturnenhancetheend-user experiencewhileminimizingenergyconsumption. A CKNOWLEDGMENTS Thisresearchissupportedinpartbythefollowing NSFgrants…#1205618,#1213052,#1302225,#1302557, #1317560,#1320478,#1409095,#1439021,#1439057,and grantsfromIntel. R EFERENCES [1]VisionStatement:HowPeopleReallyUseMobile,Ž January-February2013.[Online].Available:http://hbr.org/ 2013/01/how-people-really-use-mobile/ar/1 [2]B.Akesson etal. ,Predator:Apredictablesdrammemory controller,Žin CODES+ISSS ,2007. [3]Apple,Appleiphone5s.ŽAvailable:https://www.apple.com/ iphone/ [4]N.Balasubramanian etal. ,Energyconsumptioninmobile phones:Ameasurementstudyandimplicationsfornetwork applications,Žin IMC ,2009. [5]M.N.BojnordiandE.Ipek,Pardis:Aprogrammablemem- orycontrollerfortheddrxinterfacingstandards,Žin ISCA , 2012. [6]G.Chen etal. ,Exploitinginter-processordatasharingfor improvingbehaviorofmulti-processorsocs,Žin ISVLSI ,2005. [7]N.ChidambaramNachiappan etal. ,GemDroid:AFrame- worktoEvaluateMobilePlatforms,Žin SIGMETRICS ,2014. [8]P.Conway etal. ,Cachehierarchyandmemorysubsystem oftheamdopteronprocessor,Ž IEEEmicro ,2010. [9]A.Coskun etal. ,Temperatureawaretaskschedulingin mpsocs,Žin DATE ,2007. [10]R.Das etal. ,Aergia:Exploitingpacketlatencyslackinon- chipnetworks,Žin ISCA ,2010. [11]B.Diniz etal. ,Limitingthepowerconsumptionofmain memory.Žin ISCA ,2007. [12]J.Engwell,Thehighresolutionfutureâ  A¸ Sretinadisplays anddesign,ŽBlurgroup,Tech.Rep.,2013. [13]J.Y.C.Engwell,Gputechnologytrendsandfuturerequire- ments,ŽNvidiaCorp.,Tech.Rep. [14]S.Fenney,Texturecompressionusinglow-frequencysignal modulation,Žin HWWS ,2003. [15]Google,AndroidHAL.ŽAvailable:https://source.android. com/devices/index.html [16]Google.(2013)Androidsdk-emulator.Available:http: //developer.android.com/ [17]M.I.Gordon etal. ,Astreamcompilerforcommunication- exposedarchitectures,Žin ASPLOS ,2002. [18]A.Gutierrez etal. ,Full-systemanalysisandcharacterization ofinteractivesmartphoneapplications,Žin IISWC ,2011. [19]K.Han etal. ,Ahybriddisplayframebufferarchitecturefor energyef“cientdisplaysubsystems,Žin ISLPED ,2013. [20]M.K.Jeong etal. ,Aqos-awarememorycontrollerfor dynamicallybalancinggpuandcpubandwidthuseinan mpsoc,Žin DAC ,2012. [21]A.Jog etal. ,OWL:CooperativeThreadArrayAware SchedulingTechniquesforImprovingGPGPUPerformance,Ž in ASPLOS , 2013. [22]M.Kandemir etal. ,Exploitingsharedscratchpadmemory spaceinembeddedmultiprocessorsystems,Žin DAC ,2002. [23]C.N.Keltcher etal. ,Theamdopteronprocessorformulti- processorservers,Ž IEEEMicro ,2003. [24]H.b.T.KhanandM.K.Anwar,Quality-awareFrameSkip- pingforMPEG-2VideoBasedonInter-frameSimilarity,Ž MalardalenUniversity,Tech.Rep. [25]Y.Kim etal. ,ATLAS:AScalableandHigh-performance SchedulingAlgorithmforMultipleMemoryControllers,Žin HPCA ,2010. [26]Y.Kim etal. ,ThreadClusterMemoryScheduling:Exploit- ingDifferencesinMemoryAccessBehavior,Žin MICRO , 2010. [27]K.-B.LeeandT.-S.Chang, EssentialIssuesinSoCDesign Designing-ComplexSystems-on-Chip .Springer,2006,ch. SoCMemorySystemDesign. [28]A.W.LimandM.S.Lam,Maximizingparallelismand minimizingsynchronizationwithaf“netransforms,Žin POPL , 1997. [29]G.H.Loh etal. ,Memorybypassing:Notworththeeffort,Ž in WDDD ,2002. [30]P.Marwedel etal. ,Mappingofapplicationstompsocs,Žin CODES+ISSS ,2011. [31]A.Naveh etal. ,Powermanagementarchitectureofthe 2ndgenerationintel R  coreâ  D ´ cmicroarchitecture,formerly codenamedsandybridge,Ž2011. [32]S.T.Report,Wqxgasolutionwithexynosdual,Ž2012. [33]S.Rixner etal. ,Memoryaccessscheduling,Žin ISCA ,2000. [34]P.Rosenfeld etal. ,Dramsim2:Acycleaccuratememory systemsimulator,Ž CAL ,2011. [35]R.Saleh etal. ,System-on-chip:Reuseandintegration,Ž ProceedingsoftheIEEE ,2006. [36]Samsung,Samsunggalaxys5,Ž2014.Available:http: //www.samsung.com/global/microsite/galaxys5/ [37]H.Schwarz etal. ,Overviewofthescalablevideocoding extensionoftheh.264/avcstandard,Ž IEEETransactionson CircuitsandSystemsforVideoTechnology ,2007. [38]T.Sha etal. ,Scalablestore-loadforwardingviastorequeue indexprediction,Žin MICRO ,2005. [39]H.Shim etal. ,Acompressedframebuffertoreducedisplay powerconsumptioninmobilesystems,Žin ASP-DAC ,2004. [40]P.ShivakumarandN.P.Jouppi,Cacti3.0:Anintegrated cachetiming,power,andareamodel,ŽTechnicalReport 2001/2,CompaqComputerCorporation,Tech.Rep.,2001. [41]A.K.Singh etal. ,Communication-awareheuristicsforrun- timetaskmappingonnoc-basedmpsocplatforms,Ž J.Syst. Archit. ,2010. [42]R.SoC,Rk3188multimediacodecbenchmark,Ž2011. [43]Y.SongandZ.Li,Newtilingtechniquestoimprovecache temporallocality,Žin ACMSIGPLANNotices ,1999. [44]D.S.SteveScheirey,Sensorfusion,sensorhubsandthe futureofsmartphoneintelligence,ŽARM,Tech.Rep.,2013. [45]V.Suhendra etal. ,Integratedscratchpadmemoryoptimiza- tionandtaskschedulingformpsocarchitectures,Žin CASES , 2006. [46]Y.Wang etal. ,Markov-optimalsensingpolicyforuserstate estimationinmobiledevices,Žin IPSN ,2010. [47]L.Xue etal. ,Spmconsciousloopschedulingforembedded chipmultiprocessors,Žin ICPADS ,2006. [48]M.Yuffe etal. ,Afullyintegratedmulti-cpu,gpuand memorycontroller32nmprocessor,Žin ISSCC ,2011. [49]Y.Zhang etal. ,Adatalayoutoptimizationframeworkfor nuca-basedmulticores,Žin MICRO ,2011. [50]Y.ZhuandV.J.Reddi,High-performanceandenergy- ef“cientmobilewebbrowsingonbig/littlesystems,Žin HPCA ,2013. 0 2 4 6 8 4 MB8 MB16 MB32 MB Power (W) Dynamic Leakage 0 50 100 150 200 250 300 4 MB8 MB16 MB32 MB Area (mm 2 ) Figure11: Areaandpower-overheadwithlargesharedcaches. fromFigure10,increasingthecachesizesdoesnotalways helpandthereisnooptimalsize.ForIPslikeSNDand AD,theframesizesaresmallandhenceasmallercache suf“ces.Fromthereon,increasingcachesizeincreases lookuplatencies,andaffectstheaccesstimes.Inothercases, likeDC,astheframesizesarelarge,weobservefewer cyclesperframeasweincreasethecachesize.Forother acceleratorswithlatencytolerance,oncetheirdata“tsin thecache,theyencounternoperformanceimpact. Further,scalingcachesizesabove4MBisnotreasonable duetotheirareaandpoweroverheads.Figure11plots theoverheadsfordifferentcachesizes.Typically,handhelds operateintherangeof2W…10W,whichincludeseverything onthedevice(SoC+display+network).Eventhe2Wcon- sumedbythe4MBcachewillimpactbatterylifeseverely. Summary: Tosummarize,thehighnumberofmemorystalls isthemainreasonforframedrops,andlargeIP-to-IPreuse distancesisthemaincauseforlargememorystalls.Even largecachesarenotsuf“cienttocapturethedatareuse andhence,acceleratorsanddevicesstillhaveconsiderable memorystalls.Alloftheseobservationsledustore-architect howdatagetsexchangedbetweendifferentIPs,pavingway forbetterperformance. V.S UB -F RAMING OurprimarygoalistoreducetheIP-to-IPdatareuse distances,andtherebyreducedatastalls,whichwesawwere majorimpedimenttoperformanceinSectionIII. Toachievethis,weproposeanovelapproachof sub- framing thedata.Oneofthecommonlyusedcompiler techniquestoreducethedatareusedistanceinloopneststhat manipulatearraydataistoemploy looptiling [28],[43]. Itistheprocessofpartitioningaloopsiterationspaceinto smallerblocks(tiles)inamannerthatthedatausedbythe loopremainsinthecacheenablingquickerreuse.Inspired bytiling,weproposetobreakthedataframesintosmaller sub-frames ,thatreducesIP-to-IPdatareusedistances. Incurrentsystems,IPsreceivearequesttoprocessadata frame(itcouldbeavideoframe,audioframe,displayframe orimageframe).Onceitcompletesitsprocessing,thenext IPinthepipelineistriggered,whichin-turntriggersthe followingIPonceitcompletesitsprocessingandsoon.In oursolution,weproposetosub-dividethesedataframesinto smallersub-frames,sothatonceIP1“nishesits“rstsub- frame,IP2isinvokedtoprocessit.Inthefollowingsections, weshowthatthisdesignreducesthehardwarerequirements tostoreandmovethedataconsiderablytherebybringing 1 10 100 1000 10000 100000 1000000 10000000 100000000 Frame 1024 512 128 64 32 8 1 Access Interval Sub-Frame Size CAM-IMG IMG-DC 1 10 100 1000 10000 100000 1000000 10000000 100000000 Frame 1024 512 128 64 32 8 1 Access Interval Sub-Frame Size MMC-VD VD-DC  6 Figure12: IP-to-IPreusedistancevariationwithdifferentsub- framesizes.Notethatthey-axisisinthelogscale. bothperformanceandpowergains.Thegranularityofthe subframecanhaveaprofoundimpactonvariousmetrics.To quantifytheeffectsofsubdividingaframe,wevariedthe sub-framesizesfrom1cachelinetothecurrentdataframe size,andanalyzedthereusedistances.Figure12plotsthe reductionintheIP-to-IPreusedistances(ony-axis,plotted on log-scale ),aswereducedthesizeofasub-frame.We canseefromthisplotaninverseexponentialdecreasein reusedistances.Infact,forverysmallsub-framesizes,we seereusedistancesinlessthan100cycles.Tocapitalize onsuchsmallreusedistances,weexploretwotechniques … ”ow-buffering andopportunisticIP-to-IP requestshort- circuiting . A.Flow-Buffering InSectionIV-B,weshowedthatevenlargecacheswere notveryeffectiveinavoidingmisses.Thisisprimarily duetoverylargereusedistancesthatarepresentbetween thedata-framewritebyaproducerandthedata-frame readbyaconsumer.Withsub-frames,thereusedistances reducedramatically.Motivatedbythis,wenowre-explore theoptionofcachingdata.Interestingly,inthisscenario, cachesofmuchsmallersizecanbefarmoreeffective(low misses).Thereusedistancesresultingfromsub-framingare sosmallthatevenhavingastructurewithfewcache-lines issuf“cienttocapturethetemporallocalityofferedbyIP pipelininginSoCs.Wecallthesestructuresas ”ow-buffers . Unlikeasharedcache,the”ow-buffersareprivatebetween anytwoIPs.Thisdesignavoidsthecon”ictmissesseenina sharedcache(fullyassociativehashighpowerimplications). These”ow-buffersarewrite-through.Asthesub-framegets written,thesub-frameiswrittentomemory.Thereasonfor thisdesignchoiceisdiscussednext. Inatypicaluse-caseinvolvingdata”owfromIP-A  IP- B  IP-C,IP-Agetsitsdatafromthemain-memoryand startscomputingit.Duringthisprocess,asitcompletes asub-frame,itwritesbackthischunkofdataintothe ”ow-bufferbetweenIP-AandIP-B.IP-Bstartsprocessing thissub-framefromthe”ow-buffer(inparallelwithIP-A workingonanothersub-frame)andwritesitbacktothe ”ow-bufferbetweenitselfandIP-C.OnceIP-Cisdone,the dataiswrittenintothememoryorthedisplay.Originally, everyreadandwriteintheabovescenariowouldhavebeen scheduledtoreachthemainmemory.Now,withthe”ow- buffersinplace,alltherequestscanbeservicedfromthese 167 177 Withtheiradvent,wealsoseeatremendousincreaseindevice- userinteractivityandreal-timedataprocessingneeds.Media (audio/video/camera)andgaminguse-casesaregainingsub- stantialuserattentionandarede“ningproductsuccesses.The combinationofincreasingdemandfromtheseuse-casesand havingtorunthematlowpower(fromabattery)meansthat architectshavetocarefullystudytheapplicationsandoptimize thehardwareandsoftwarestacktogethertogainsigni“cant optimizations.Inthiswork,westudyworkloadsfromthese domainsandidentifythememorysubsystem(systemagent)to beacriticalbottlenecktoperformancescaling.Wecharacterize                               !"# $   $ % &'       %   %       ()*+,#"&,-,--& & System Agent + MC   .     #)  Figure1: TargetSoCplatformwithahigh-levelviewofdifferent functionalblocksinthesystem. theworkhorsesoftheSoCastheyprovidemaximumperfor- manceandpoweref“ciency,e.g.videoencoders/decoders, graphics,imagingandaudioengines. InteractionsbetweenCore,IPsandOperatingSystem: SoCapplicationsarehighlyinteractiveandinvolvemultiple acceleratorsanddevicestoenhanceuserexperience.Us- ingAPIcalls,applicationrequirementsgettransformedto acceleratorrequirementsthroughdifferentlayersoftheOS. Typically,thecallshappenthroughsoftwaredevicedriversin thekernelportionoftheOS.Thesecallsdecideif,whenand forhowlongthedifferentacceleratorsgetused.Thedevice drivers,whichareoptimizedbytheIPvendors,control thefunctionalityandthepowerstatesoftheaccelerators. Onceanacceleratorneedstobeinvoked,itsdevicedriveris noti“edwithrequestandassociatedphysicaladdressofinput data.Thedevicedriversetsupthedifferentactivitiesthat theacceleratorneedstodo,includingwritingappropriate registerswithpointerstothememoryregionwherethe datashouldbefetchedandwrittenback.Theaccelerator readsthedatafrommainmemorythroughDMA.Input datafetchingandprocessingarepipelinedandthefetching granularitydependsonhowthelocalbufferisdesigned. Oncedataisprocessed,itiswrittenbacktothelocalbuffers andeventuallytothemainmemoryattheaddressregion speci“edbythedriver.Asmostacceleratorsworkfasterthan mainmemory,thereisaneedforinputandoutputbuffers. TheSystemAgent(SA): AlsoknownastheNorth- bridge,isacontrollerthatreceivescommandsfromthecore andpassesthemontotheIPs.Somedesignsaddmore intelligencetotheSAtoprioritizeandreorderrequests tomeetQoSdeadlinesandtoimproveDRAMhits.SA usuallyincorporatesthememorycontroller(MC)aswell. Apartfromre-orderingrequestsacrosscomponentstomeet QoSguarantees,even“ne-grainedre-orderingamongIPs requestscanbedonetomaximizeDRAMbandwidthand bus-utilization.Withincreasinguserdemandsfromhand- heldsthenumberofacceleratorsandtheirspeedskeepin- creasing[12],[13],[44].Thesetrendswillplaceaveryhigh demandonDRAMtraf“c.Consequently,unlesswedesign asophisticatedSAthatcanhandletheincreasedamountof traf“c,theimprovementinacceleratorsperformancewill notendinimproveduserexperience. / ) 0 1 &1 &2  2  3  )') 1 2  3 4 5   + 0 Figure2: Overviewofdata”owinSoCarchitectures. B.DatamovementinSoCs Figure2depictsthehigh-levelviewofthedata”ow inSoCarchitectures.Onceacoreissuesarequesttoan IPthroughheSA(shownas(1),theIPstartsitswork byinjectingamemoryrequestintoSA.First,therequest traversesthroughaninterconnectwhichistypicallyabusor cross-bar,andisenqueuedinamemorytransactionqueue. Here,requestscanbereorderedbytheSAaccordingto individualIPprioritiestohelprequestsmeettheirdeadlines. Subsequently,requestsareplacedinthebank-queuesof thememorycontroller,whererequestsfromIPsarere- arrangedtomaximizethebusutilization(andinturn,the DRAMbandwidth).Followingthat,anoff-chipDRAM accessismade.TheresponseisreturnedtotheIPthrough theresponsenetworkintheSA(shownas(2)).IP-1writes itsoutputdatatomemory(shownin(3))tillitcompletes processingthewholeframe.AfterIP-1completesprocess- ing,IP-2isinvokedbythecore(shownas4),anddata”ow similartowhatIP-1hadisfollowed,ascapturedby(5) and(6)inFigure2.Theunitofdataprocessinginmedia andgamingIPs(includingaudio,videoandgraphics)isa frame ,whichcarriesinformationabouttheimageorpixels oraudiodeliveredtotheuser. Typicallyahighframedrop ratecorrespondstoadeteriorationinuser-experience. Thetotalrequestaccesslatencyincludesthenetwork latency,thequeuinglatenciesatthetransactionqueueand bankqueue,DRAMservicetime,andtheresponsenetwork latency.Thislatency,asexpected,isnotconstantandvaries basedonthesystemdynamics(includingDRAMrow-buffer hits/misses).Whenrunningaparticularapplication,theOS mapsthedataframesofdifferentIPstodifferentphysical memoryregions.Notethattheseregionsgetreusedduring theapplicationrunforwritingdifferentframes(overtime). Inadata”owinvolvingmultipleIPsthatprocesstheframes, oneafteranother,theOS(throughdevicedrivers)synchro- nizestheIPssuchthattheproducerIPwritesanotherframe ofdataontothesamememoryregionaftertheconsumerIP hadconsumedtheearlierframe. C.DecomposinganApplicationExecutionintoFlows ApplicationscannotdirectlyaccessthehardwareforI/O oraccelerationpurposes.InAndroid,forexample,appli- cationrequestsgettranslatedbyintermediatelibrariesand 171 0 20 40 60 80 100 120 GPU-DC AD_SND MIC-AE-MMC MMC-AD-SND MMC-IMG-DC CAM-IMG-DC CAM-IMG-MMC CAM-VE VD-DC AD-SND MIC-AE CAM-VE-MMC MIC-AE-MMC VD-DC AD-SND Angry BirdsSnd.RecMP3PhotoCamPic Skype Video RecordYoutubeAVG Fraction of frames completed Base SubFrame-Cache SubFrame-Forward Figure17: PercentageofFramesCompleted(Higherthebetter). 0 25 50 75 100 125 GPU-DC AD-SND MIC-AE-MMC MMC-AD-SND MMC-IMG-DC CAM-IMG-DC CAM-IMG-MMC CAM-VE VD-DC A D-SND MIC-AE CAM-VE-MMC MIC-AE-MMC VD-DC AD-SND A1A2A3A4A5 A6 A7A8AVG % of Cycles Per Frame SubFrame-cache SubFrame-forward Figure18: ReductioninCyclesPerFrameina”ownormalizedto Baseline(Lowerthebetter). metricinhandheldssincetheyoperateoutofabattery(most ofthetime).ExactIPdesignandpowerstatesincorporated arenotavailablepublicly.Asaproxy,weusethenumber ofcyclesanIPwasactivetocorrespondtotheenergy consumedwhenrunningthespeci“capplications.InFig- ure19,weplotthetotalnumberofactivecyclesconsumed byanacceleratorcomparedtothebasecase.Weplotthis graphonlyforacceleratorsastheyarecompute-intensiveand hence,consumemostofthepowerina”ow.Onaverage, weobserve46%and35%reductioninactivecycles(going upto80%inGPU)withourtechniques,whichtranslatesto substantialsystem-levelenergygains.Withsub-framing,we alsoreducethememoryenergyconsumptionby33%due to(1)reducedDRAMaccesses,and(2)memoryspending moretimeinlow-powermode.Fromtheaboveresultsit canbeconcludedthatsub-framingyieldssigni“cantperfor- manceandpowergains. VIII.R ELATED W ORK DataReuse: Datareusewithinandacrosscoreshasbeen studiedbymanyworks.Chenetal.[6],[47],Gordonet al.[17]andKandemiretal.[22]proposecompileroptimiza- tionsthatperformcoderestructuringandenabledatasharing 0 20 40 60 80 100 120 GPU AD AE AD IMG IMG VE VD AD AE VE AE VD AD A1A2A3A4A5A6 A7A8 % of Active Cycles SubFrame-Cache SubFrame-Forward Figure19: ReductioninNumberofActiveCyclesofAccelerators (Lowerthebetter). acrossprocessors.Suhendraetal.,[45]proposedwaysto optimallyusescratchpadmemoryinMPSoCsalongwith methodstoscheduleprocessestocores.Therehavebeen multipleworksthatdiscussapplicationandtaskmapping toMPSoCs[9],[30],[41]withthegoalofminimizing datamovementacrosscores.Ourworklooksataccelerator traf“c,whichisdominantinSoCs,andidenti“esthatframe dataisreusedacrossIPs.Unlikecoretraf“c,thereusecan beexploitedonlyifthedataframesarebrokeninsub- frames.Wecapturethisfordataframesofdifferentclassesof applications(audio/video/graphics)andproposetechniques toreducethedatamovementbyshortcircuitingtheproducer writestotheconsumerreads. MemoryControllerDesign: Alargebodyofworksexist intheareaofmemoryschedulingtechniquesandmemory controllerdesignsinthecontextofMPSoCs.Leeand Chang[27]describetheessentialissuesinmemorysystem designforSoCs.Akessonetal.[2]proposeamemory schedulingtechniquethatprovidesaguaranteedminimum bandwidthandmaximumlatencyboundtoIPs.Jeonget al.[20]provideQoSguaranteestoframesbybalancing memoryrequestsatthememorycontroller.Ourworkiden- ti“esaspeci“ccharacteristic(reuseatsub-framelevel)that existswhendata”owsthroughacceleratorsandoptimizes systemagentdesign.Oursolutioniscomplimentarytoprior techniquesandcanworkintandemwiththem. AlongwithIPdesignandanalysis,severalworkshave proposedIP-speci“coptimizations[14],[19],[24],[39]and lowpoweraspectsofsystem-on-chiparchitectures[11],[18], [46].Oursolutionisnotspeci“ctoanyIPrather,itisatthe system-level.ByreducingtheIPstalltimesandmemory traf“c,wemaketheSoCperformanceandpower-ef“cient. IX.C ONCLUSIONS Memorytraf“cisaperformancebottleneckinmobile systems,anditiscriticalthatweoptimizethesystemas awholetoavoidsuchbottlenecks.Mediaandgraphics applicationsareverypopularonmobiledevices,andoperate ondataframes.Theseapplicationshavealotoftemporal localityofdatabetweenproducerandconsumerIPs.We showthatbyoperatingaframeasanatomicblockbetween differentIPs,thereusedistancesareveryhigh,andthus,the ago),requiringsubstantialcomputationalresourcesforreal- timeinteractivity,onbothinputandoutputsides,withthe externalworld.Onthehardwareend,powerandlimited batterycapacitiesmandatehighdegreesofenergyef“cien- ciestoperformthesecomputationaltasks.Meetingthese computationalneedswiththecontinuingimprovementsin hardwarecapabilitiesisnolongerjustamatterofthrowing highperformanceandplentifulcoresorevenacceleratorsat theproblem.Instead,acarefulexaminationandmarriageof thehardwarewiththeapplicationandexecutioncharacteris- ticsiswarrantedforextractingthemaximumef“ciencies.In otherwords,aco-designofsoftwareandhardwareisnec- VI.I MPLEMENTATION D ETAILS Inimplementingoursub-frameidea,weaccountforthe probablepresenceofdependenciesandcorrectnessissues resultingfromsplittingframes.Below,wediscussthecor- rectnessissueandtheassociatedintricaciesthatneedtobe addressedtoimplementsub-frames.Wethendiscussthe software,hardwareandsystem-levelsupportneededforsuch implementations. A.Correctness Webroadlycategorizedataframesintothefollowing types…(i)video,(ii)audio,(iii)graphicsdisplay,and(iv)the networkpackets.Ofthese,the“rstthreetypesofframesare theonesthatusuallydemandsustainedhighbandwidthwith theframesizesvaryingfromamegabytetotensofMBs.In thiswork,weaddressonlythe“rstthreetypesofframes, andleaveoutnetworkpacketsasthelatencyofnetwork packettransmissionisconsiderablyhighercomparedtothe timespentintheSoC. VideoandAudioFrames: Encodinganddecoding, abbreviatedas codec iscompressionanddecompressionof datathatcanbeperformedateitherhardwareorsoftware layer.CurrentgenerationofsmartphonessuchasSamsung S5[36]andAppleiPhone[3]havemultipletypesofcodes embeddedintheirphone. VideoCodecs: First,letusconsiderthe”owscontaining videoframes,andanalyzethecorrectnessofsub-dividing suchlargeframesintosmallerones.Amongthevideo codecs,themostcommonlyusedareH.264(MPEG-4)or H.265(MPEG-H,HEVC)codecs.Letustakeasmallset ofvideoframesandanalyzethedecodingprocess.The encodingprocessisalmostequivalenttotheinversionof eachstageofdecoding.Asaresult,similarprinciplesapply thereaswell.Figure15showsavideoclipinitsentirety, witheachframecomponentnamed.Ahigh-qualityHD videoismadeupofmultipleframesofdata.Assuminga defaultof60FPS,theamountofdataneededtoshowthe clipforaminutewouldbe1920x1080(screenresolution)x 3(bytes/pixel)x60(framerate)x60(seconds)=21.35GB. Evenifeachframeiscompressedindividuallyandstoredon todayshand-helddevices,theamountofstorageavailable wouldnotpermitit.Toovercomethislimitation,codecs takeadvantageofthetemporalredundancypresentinvideo frames,asthenextframeisusuallynotverydifferentfrom thepreviousframe.   373   + 818288384 Figure15: Pictorialrepresentationshowingthestructureof“ve consecutivevideoframes.     Flow Id 1 2 IP-1IP-2       ,9%/&1:&2:;;:8 %9& &8)8& *)  /6   IP-1 IP-2  * *)8)#'# ;1 1       " ' ).)'     IP-1 IP-2     0         Flow Id 1 2 ;1 1  " '    Figure16: HighlevelviewoftheSAthathandlessub-frames. Eachframecanbedissectedintomultipleslices.Slices arefurthersplitintomacroblocks,whichareusuallyablock of16x16pixels.Thesemacroblockscanbefurtherdivided into“nergranularitiessuchassub-macroblocksorpixel blocks.But,wedonotneedsuch“negranularitiesforour purpose.Slicescanbeclassi“edinto3majortypes:I-Slice (independentorintraslices),P-Slice(predictive),andB- Slice(bi-directional)[37]asdepictedinFigure15. 3 I-slices havealldatacontainedinthem,anddonotneedmotion prediction.P-slicesusemotionpredictionfromoneslice whichbelongstothepastorfuture.B-slicesusetwoslices frompastorthefuture.Eachsliceisanindependententity andcanbedecodedwithouttheneedforanyotherslice inthe currentframe .P-andB-slicesneedslicesfroma previousornextframeonly. Inoursub-frameimplementation,wechoose slice-level granularity asthe“nestlevelofsub-divisiontoensure correctness withouthavinganyextraoverheadofsynchro- nization .Asslicesareindependentlydecodedinaframe, theneedforanothersliceintheframedoesnotarise,and wecanbesurethatcorrectnessismaintained.Sub-dividing anyfurtherwouldbringindependencies,staledataand overwrites. AudioCodecs: Audiodataiscodedinamuchsimpler fashionthanvideodata.Anaudio“lehasalargenumberof frames,witheachaudioframehavingthesamenumberof bits.Eachframeisindependentofanotheranditconsistofa headerblockanddatablock.Headerblock(inMP3format) stores32-bitsofmetadataaboutthecomingdatablock frame.Thus,eachaudioframecanbedecodedindependently ofanotherasallrequireddatafordecodingispresentinthe currentframesheader.Therefore,usingafullaudioframe asasub-framewouldnotcauseanycorrectnessissue. GraphicsRendering: GraphicsIPsalreadyemploytiled 3 Earliercodecshadframelevelclassi“cationinsteadofslicelevel.In suchsituations,I-frameisconstructedasaframewithonlyI-slices. 168 back,areamongstthemostpopularontodaystabletsand mobilesapartfromemailandsocialnetworking.Suchappli- cationsaccountfornearly65%oftheusageontodayshand- helds[1],stressingtheimportanceofmeetingthechallenges imposedbysuchapplicationsef“ciently.Thisimportant classofapplicationshasseveralkeycharacteristicsthatare relevanttothisstudy.First,theseapplicationsworkwith input(sensors,network,camera,etc.)and/oroutput(display, speaker,etc.)devices,mandatingreal-timeresponsiveness. Second,theseapplicationsdealwithframesŽofdata,with therequirementtoprocessaframewithinastipulatedtime constraint.Third,thecomputationrequiredforprocessinga Apartfromthecomputationalneedsforreal-timeexecu- tion,alltheaboveobservationsstressthememoryintensity oftheseapplications.Framesofdatacomingfromany externalsensor/deviceisstreamedintomemory,fromwhich itisstreamedoutbyadifferentIP,processedandputbackin memory.Suchanundertakingplacesheavydemandsonthe memorysubsystem.Whenwehaveseveralconcurrent”ows, eitherwithinthesameapplicationoracrossapplicationsin amultiprogrammedenvironment,allofthese”owscontend forthememoryandstressesitevenfurther.Thiscontention canhaveseveralconsequences:(i)withoutasteadystream ofdatato/frommemory,theef“cienciesfromhavingspe- cializedIPswithcontinuousdata”owcangetlostwiththe IPsstallingformemory;(ii)suchstallswithidleIPscan leadtoenergywastageintheIPsthemselves;and(iii)the highmemorytraf“ccanalsocontendwith,andslowdown, thememoryaccessesofthemaincoresinthesystem.While therehasbeenalotofworkcoveringprocessing…whether itbeCPUcoresorspecializedIPsandaccelerators(e.g. [35][50][27])…forthesehandheldenvironments,thetopic ofoptimizingthedata”ows,whilekeepingthememory systeminmind,hasdrawnlittleattention.Optimizingfor memorysystemperformance,andminimizingconsequent queueingdelayshasitselfreceivedsubstantialinterestin thepastdecade,butonlyintheareaofhigh-endsystems (e.g.,[4][26][25][10]).Thispaperaddressesthiscritical issueinthedesignofhandhelds,wherememorywillplay anincreasinglyimportantroleinsustainingthedata”ow notjustacrossCPUcores,butalsobetweenIPs,andwith theperipheralinput-output(display,sound,networkand sensors)devices. Intodayshandheldarchitectures,a SystemAgent (SA) [8],[23],[31],[48]servesastheglueintegratingallthe compute(whetheritbeIPsorCPUcores)andstorage components.Italsoservesastheconduittothememory system.However,itdoesnotclearlyunderstanddata”ows, andsimplyactsasaslaveinitiatingandservingmemory requestsregardlessofwhichcomponentrequestsit.Asa result,thehighframeraterequirementstranslatetoseveral transactionsinthememoryqueues,andthe”owofthese framesfromoneIPtoanotherexplicitlygoesthroughthese queues,i.e.,thepotentialfordata”ow(ordata reuse )across IPsisnotreallybeingexploited.Instead,inthispaperwe exploretheideaof virtuallyintegratingacceleratorpipelines byshort-circuitingŽmanyoftheread/writerequests,so thatthetraf“cinthememoryqueuescanbesubstantially reduced.Speci“cally,weexplorethepossibilityof shared buffers/caches and short-circuitingcommunication between theIPcoresbasedonrequestsalreadypendinginthe memorytransactionqueues.Inthiscontext,thispapermakes thefollowing contributions : € Weshowthatthememoryishighlyutilizedinthese systems,withIPsfacingaround47%oftheirtotal executiontimestallingfordata,inturn,causing24% oftheframestobedroppedintheseapplications.We cannotaffordtolettechnologytakecareofthisproblem sincewitheachDRAMtechnologyadvancement,the demandsfromthememorysystemalsobecomemore stringent. € Blindlyprovisioningasharedcachetoleveragedata ”ow/reuseacrosstheIPcoresisalsolikelytobeless bene“cialfromapracticalstandpoint.Ananalysisof theIP-to-IPreusedistancessuggeststhatsuchcaches havetorunintoseveralmegabytesforreasonablehit rates(whichwouldalsobeundesirableforpower). € Weshowthatthisproblemismainlyduetothecurrent framesizesbeingrelativelylarge.Akintotilingforlo- calityenhancementinnested-loopsoflargearrays[28], [43],[49],weintroducethenotionof sub-frame Žfor restructuringthedata”ow,whichcansubstantiallyre- ducereusedistances.Wediscusshowthissub-framing canbedonefortheconsideredapplications. € Withthissub-framinginplace,weshowthatreasonably sizedsharedcaches…referredtoas ”owbuffers in thispaper…betweentheproducerandconsumerIPs ofaframecancircumventtheneedtogotomain memoryformanyoftheloadsfromtheconsumerIP. Suchreductioninmemorytraf“cresultsinaround20% performanceimprovementintheseapplications. € Whilethese”owbufferscanbene“ttheseplatforms substantially,wealsoexploreanalternateideaofnot requiringanyseparatehardwarestructures…leveraging existingmemoryqueuesfordataforwardingfromthe producertotheconsumer.Sincememorytraf“cis usuallyhigh,recentlyproduceditemsaremorelikely tobewaitinginthesequeues(servingasasmall cache),whichcouldbeforwardedtotherequesting consumerIP.Weshowthatthiscanbeaccommodated inrecently-proposedmemoryqueuestructures[5],and demonstrateperformanceandpowerbene“tsthatare nearlyasgoodasthatofthe”owbuffersolution. II.B ACKGROUNDAND E XPERIMENTAL S ETUP Inthissection,we“rstprovideabriefoverviewof currentSoC(system-on-chip)platformsshowinghowthe OS,coreandtheIPsinteractfromasoftwareandhardware perspective.Next,wedescribeourevaluationplatform,and propertiesoftheapplicationsthatareusedinthiswork. A.OverviewofSoCPlatforms AsshowninFigure1,handheldsavailableinthemarket havemultiplecoresandotherspecializedIPs.TheIPsin theseplatformscanbebroadlyclassi“edintotwocategories … accelerators and devices .Devicesinteractdirectlywiththe userorexternalworldandincludecameras,touchscreen, speakerandwireless.Acceleratorsaretheon-chiphardware componentswhichspecializeincertainactivities.Theyare IPAbbr. Expansion IPAbbr. Expansion VD VideoDecoder AD AudioDecoder DC DisplayController VE VideoEncoder MMC FlashController MIC Microphone AE AudioEncoder CAM Camera IMG Imaging SND Sound TableI: ExpansionsforIPabbreviationsusedinthispaper. Id Application IPFlows A1 AngryBirds AD-SND;GPU-DC A2 SoundRecord MIC-AE-MMC A3 AudioPlayback(MP3) MMC-AD-SND A4 PhotosGallery MMC-IMG-DC A5 PhotoCapture(CamPic) CAM-IMG-DC;CAM-IMG-MMC A6 Skype CAM-VE;VD-DC;AD-SND;MIC-AE A7 VideoRecord CAM-VE-MMC;MIC-AE-MMC A8 Youtube VD-DC;AD-SND TableII: IP”owsinourapplications. devicedriversintocommandsthatdrivetheacceleratorsand devices.Thistranslationresultsinhardwareprocessingthe data,movingitbetweenmultipleIPsand“nallywritingto storageordisplayingorsendingitoverthenetwork.Letus considerforexampleavideoplayerapplication.The”ash controllerreadsachunkofvideo“lefrommemory,gets processedbythecore,andtwoseparaterequestsaresentto video-decoderandaudio-decoder.Theyreadtheirdatafrom thememoryand,onceanaudio/videoframeisdecoded,it issenttothedisplaythroughmemory.Inthispaper,we termsucharegularstreamofdatamovementfromoneIP toanotherasa ”ow .Allourtargetapplicationshavesuch ”ows,asshowninTableII.TableIgivestheexpansions forIPabbreviations.Itistobenotedthatanapplication canhaveoneormore”ows.Inmultiple”ows,each”ow couldbeinvokedatadifferentpointintime,ormultiple independent”owscanbeactiveconcurrently without sharing anycommonIPordataacrossthem. D.EvaluationPlatform Handheld/mobileplatformscommonlyrunapplications thatrelyonuserinputsandareinteractiveinnature.Study- ingsuchasystemistrickyduetothenon-determinism associatedwithit.Toenablethat,weuseGemDroid[7], whichutilizesGoogleAndroidsopen-sourceemulator[16] tocapturethecompletesystem-levelactivity.Thisprovides acompletememorytrace(withcyclesbetweenmemory accesses)alongwithallIPcallswhentheywereinvoked bytheapplication.Weextendedtheplatformbyincluding DRAMSim2[34]foraccuratememoryperformanceevalua- tion.Further,weenhancedthetooltoextensivelymodelthe systemagent,acceleratorsanddevicesindetail. Speci“cationsofselectIPsfrequency,framesizesand processinglatencyareavailablefrom[42].Forcomplete- ness,wegiveallcoreparameters,DRAMparameters,and IPdetailsinTableIII.Thespeci“cationsusedarederived fromcurrentsystemsinthemarket[3],[36].Notethatthe techniquesdiscussedinthisworkaregenericandnottied tospeci“cmicro-architecturalparameters. Processor ARMISA;4-coreprocessor;Clockedat2GHz; OoOw/issuewidth:4 Caches 32KBL1-I;32KBL1-D;512KBL2 Memory Till2GBreservedforcores.2GBto3GBreservedforIPs. LPDDR3-1333;1channel;1rank;8Banks 5.3GBPSpeakbandwidth;t CL ,t RP ,t RCD =12,12,12ns SystemAgent Frequency:500MHz;Interconnectlatency:1cycleper16Bytes MemoryTransaction-Q.:64entries;Bank-Q.:8entries AllIPsrunat500Mhzfrequency IPsand Aud.Frame:16KBframe;Vid.Frame:4K(3840x2160) SystemParameters CameraFrame:1080p(1920x1080) InputBufferSizes:16-32KB;OutputBufferSizes:32-64KB Enc/DecodingRatio:VD  1:32;AD  1:8; TableIII: Platformcon“guration. III.M OTIVATION :M EMORY S TALLS Inthissection,weshowthatDRAMstallsarehighin currentSoCsandthiswillonlyworsenasIPsperformance scale.Typically,DRAMissharedbetweenthecoresand IPsandisusedtotransferdatabetweenthem.Thereisa highdegreeofdatamovementandthisoftenresultsina highcontentionformemorycontrollerbandwidthbetween thedifferentIPs[21].Figure3showsthe memorybandwidth obtained bytwoofourapplications: YouTube and Skype witha6.4GBPSmemory.Onecannoticethe burstiness oftheaccessesintheseplots.Dependingonthetypeof IPsinvolved,framesgetwrittentomemoryorreadfrom memoryatacertainrate.Forexample,camerastodaycan capturevideoframesofresolution1920x1080at60FPS andthedisplayrefreshesthescreenwiththeseframesat thesamerate(60FPS).Therefore,60burstsofmemory requestsfrombothIPshappeninasecond,witheach burstrequestingonecompleteframe.Whiletherequest rateissmall,thedatasizeperrequestishigh…6MBfor a1920x1080resolutionframe(thiswillincreasewith4K resolutions[12]).Ifthisamountofbandwidthcannotbe cateredtobytheDRAM,thememorycontrollerandDRAM queues“lluprapidlyandinturnthedevicesandaccelerators startexperiencingperformancedrops.Theperformancedrop alsoaffectsbatterylifeasexecutiontimeincreases.Inthe rightsidegraphinFigure3,wheneverthecamera(CAM) initiatesitsburstofrequests,thepeakmemorybandwidth consumptioncanbeseen(about6GBPS).Wealsonoticed thattheaveragememorylatencymorethandoublesinthose periods,andmemoryqueuessustainover95%utilization. Toexplainhowmuchimpactthememorysubsystemand thesystem-agentcanhaveonIPsexecutiontime( active cycles duringwhichtheIPsremainsinactivestate),in Figure4,weplotthetotalnumberofcyclesspentbyanIPin processingdataandindatastalls.Here,weusedatastallŽ 0 2 4 6 8 (a) Youtube 0 2 4 6 8 (b ) Skype Time Mem. Bandwidth (in GBPS) Mem. Bandwidth (in GBPS) Time Figure3: Bandwidthusageof Youtube and Skype overtime. 169 system(theDRAMmainmemory)isemployedtofacilitate this”ow.Finally,wemayneedtosupportseveralsuch”ows atthesametime.Evenasingleapplicationmayhaveseveral concurrent”ows(thevideopartandaudiopartofthevideo captureapplicationwhichhavetheirownpipelines).Even otherwise,withmultiprogrammingincreasinglyprevalentin handhelds,thereisaneedtoconcurrentlysupportindividual application”owsinsuchenvironments. 1 WeusethetermacceleratorsandIPsinterchangeablyinthiswork. drasticreductionincyclesperframesacrossapplications andIPs(ashighas75%).InsomeIPs,memoryisnota bottleneckandthosedidnotshowimprovedbene“ts.From thisdata,weconcludethatreducingthememoryaccess timesdoesbringthecyclesperframedown,whichinturn booststheoverallapplicationperformance.Notethat,this perfectmemorydoesnotallowanyframestobedropped. IV.IP- TO -IPD ATA R EUSE Thissectionexploresperformanceoptimizationoppor- tunitiesthatexistincurrentdesignsandwhetherexisting solutionscanexploitthat. A.DataReuseandReuseDistance Ina”ow,datagetread,processed(byIPs)andwritten back.Theproducerandconsumerofthedatacouldbetwo differentIPsorsometimeseventhesameIP.Wecapturethis IP-to-IPreuseinFigure8,whereweplottedthephysical addressesaccessedbythecoreandotherIPsfor YouTube application.Notethatthis“gureonlycapturesaverysmall sliceoftheentireapplicationrun.Here,wecanseethat thedisplay-controller(DC)(redpoints)readsacaptured framefromamemoryregionthatwaspreviouslywritten tobyvideodecoder(blackpoints).Similarly,wecanalso seethatthesound-enginereadsfromanaddressregion whereaudio-decoderwrites.Thisclearlyshowsthatthedata getsreusedrepeatedlyacrossIPs,butthereusedistances canbeveryhigh.AsmentionedinSectionII-B,when aparticularapplicationisrun,thesamephysicalmemory regionsgetused(overtime)byanIPforwritingdifferent frames.Inourcurrentcontext,thereusewementionisonly betweentheproducerandconsumerIPsforaparticular frameandnothingtotodowithframesbeingrewrittento thesameaddresses.Duetoframeraterequirements,reuse distancesbetweendisplayframebasedIPsweremorethan tensofmilli-seconds,whileaudioframebasedIPswereless thanamilli-second.Thus,thereisalargevariationacross producer-consumerreusedistancesacrossIPsthatprocess large(display)frames(e.g.,VD,CAM)andIPsthatprocess smaller(audio)frames(e.g.,AD,AE). B.ConvertingDataReuseintoLocality Giventhedatareuse,thesimplestsolutionistoplace aon-chipcacheandallowthemultipleIPstoshareit.The Time 2.2e+09 2.22e+09 2.21e+09 2.205e+09 2.215e+09  $    ( Address Range Figure8: DataaccesspatternofIPsinYouTubeapplication. 0 20 40 60 80 100 A1A2A3A4A5A6A7A8 4 MB 8 MB 16 MB 32 MB Hit Rate (%) Figure9: Hitratesundervariouscachecapacities. expectancyisthatcachesarebestforlocalityandhencethey shouldwork.Inthissubsection,weevaluatetheimpactof addingsuchasharedcachetoholdthedataframes.Typical toconventionalcaches,onacache-miss,therequestissent tothetransactionqueue.Thesharedcacheisimplemented asadirect-mappedstructure,withmultiplereadandwrite ports,andmultiplebanks(withabanksizeof4MB),andthe read/write/lookuplatenciesaremodeledusingCACTI[40]. Weevaluatedmultiplecachesizes,rangingfrom4MBto 32MB,andanalyzedtheirhitratesandthereductionin cyclestakenperframetobedisplayed.Wepresenttheresults for4MB,8MB,16MBand32MBsharedcachesinFigure9 andFigure10forclarity.Theycapturetheoveralltrend observedinourexperiments.Inour“rstexperiment,we noticethatasthecachesizesincrease,thecachehitrates eitherincreaseorremainthesame.Forapplicationslike AudioRecord and AudioPlay (withsmallframes), wenotice100%cachehitratesfrom4MBcache.For otherapplicationslike AngryBirds or Video-play (withlargerframes),asmallercachedoesnotsuf“ce.Thus, asweincreasethecachecapacity,weachievehigherhit rates.Interestingly,someapplicationshaveverylowcache hitratesevenwithlargecaches.Thiscanbeattributedto twomainreasons.First,framesizesareverylargeto“t eventwoframesinalarge32MBcache(asinthecaseof YouTube and Gallery ).Second,andmostimportantly,if thereusedistancesarelarge,datagetskickedoutofcaches bytheother”owsinthesystemorbyotherframesin thesame”ow.Applicationswithlargereusedistanceslike Video-record exhibitsuchbehavior. Inoursecondexperiment,wequantifytheperformance bene“tsofhavingsuchlargesharedcachesbetweenIPs, andgivetheaveragecyclesconsumedbyanIPtoprocess afull-frame(audio/video/cameraframe).Ascanbeseen 0 20 40 60 80 100 AEADVEVDIMGDCSNDMICCAM 4 MB 8 MB 16 MB 32 MB % Cycles Per Frame Figure10: CyclesPerFrameundervariouscachecapacities.