/
Fundamental Latency Tradeoffs in Architecting DRAM Cache Outperforming Impractical SRAMTags Fundamental Latency Tradeoffs in Architecting DRAM Cache Outperforming Impractical SRAMTags

Fundamental Latency Tradeoffs in Architecting DRAM Cache Outperforming Impractical SRAMTags - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
483 views
Uploaded On 2014-12-12

Fundamental Latency Tradeoffs in Architecting DRAM Cache Outperforming Impractical SRAMTags - PPT Presentation

Qureshi Gabriel H Loh Dept of Electrical and Computer Engineering AMD Research Georgia Institute of Technology Advanced Micro Devices In c moingatechedu gabelohamdcom Abstract This paper analyzes the design tradeoffs in architecting largescale DRAM ID: 22707

Qureshi Gabriel Loh

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Fundamental Latency Tradeoffs in Archite..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

FundamentalLatencyTrade-offsinArchitectingDRAMCachesOutperformingImpracticalSRAM-TagswithaSimpleandPracticalDesignMoinuddinK.QureshiGabrielH.LohDept.ofElectricalandComputerEngineeringAMDResearchGeorgiaInstituteofTechnologyAdvancedMicroDevices,Inc.moin@gatech.edugabe.loh@amd.comAbstractThispaperanalyzesthedesigntrade-offsinarchitectinglarge-scaleDRAMcaches.Priorresearch,includingthere-centworkfromLohandHill,haveorganizedDRAMcachessimilartoconventionalcaches.Inthispaper,wecontendthatsomeofthebasicdesigndecisionstypicallymadeforcon-ventionalcaches(suchasserializationoftaganddataac-cess,largeassociativity,andupdateofreplacementstate)aredetrimentaltotheperformanceofDRAMcaches,astheyex-acerbatethealreadyhighhitlatency.WeshowthathigherperformancecanbeobtainedbyoptimizingtheDRAMcachearchitecturerstforlatency,andthenforhitrate.Weproposealatency-optimizedcachearchitecture,calledAlloyCache,thateliminatesthedelayduetotagserializa-tionbystreamingtaganddatatogetherinasingleburst.WealsoproposeasimpleandhighlyeffectiveMemoryAccessPredictorthatincursastorageoverheadof96bytespercoreandalatencyof1cycle.Ithelpsservicecachemissesfasterwithouttheneedtowaitforacachemissdetectioninthecom-moncase.Ourevaluationsshowthatourlatency-optimizedcachedesignsignicantlyoutperformsboththerecentpro-posalfromLohandHill,aswellasanimpracticalSRAMTag-Storedesignthatincursanunacceptableoverheadofseveraltensofmegabytes.Onaverage,theproposalfromLohandHillprovides8.7%performanceimprovement,the“idealized”SRAMTagdesignprovides24%,andoursimplelatency-optimizeddesignprovides35%.1.IntroductionEmerging3D-stackedmemorytechnologyhasthepotentialtoprovideastepfunctioninmemoryperformance.Itcanprovidecachesofhundredsofmegabytes(orafewgigabytes)atalmostanorderofmagnitudehigherbandwidthcomparedtotraditionalDRAM;assuch,ithasbeenaveryactivere-searcharea[ 2 , 4 , 7 , 12 , 13 , 19 ].However,togetperformancebenetfromsuchlargecaches,onemustrsthandleseveralkeychallenges,suchasarchitectingthetagstore,optimizinghitlatency,andhandlingmissesefciently.TheprohibitiveoverheadofstoringtagsinSRAMcanbeavoidedbyplacingthetagsinDRAM,butnaivelydoingsodoublesthelatency TheworkonMemoryAccessPrediction(Section5)wasdonein2009whiletherstauthorwasaresearchscientistatIBMResearch[ 15 ].ofDRAMcache(oneaccesseachfortaganddata).ArecentworkfromLohandHill[ 10 , 11 ]makesthetags-in-DRAMap-proachefcientbyco-locatingthetagsanddatainthesamerow.However,similartopriorworkonDRAMcaches,therecentworkalsoarchitectsDRAMcachesinlargelythesamewayastraditionalSRAMcaches.Forexamplebyhavingase-rializedtag-and-dataaccessandemployingtypicaloptimiza-tionssuchashighassociativityandintelligentreplacement.Weobservethattheeffectivenessofcacheoptimizationsdependsontechnologyconstraintsandparameters.Whatmayberegardedasindispensableinonesetofconstraints,mayberenderedineffectivewhentheparametersandcon-straintschange.GiventhatthelatencyandsizeparametersofaDRAMcachearesowidelydifferentfromtraditionalcaches,andthetechnologyconstraintsaredisparate,wemustbecarefulabouttheimplicitoptimizationsthatgetincorpo-ratedinthearchitectureoftheDRAMcache.Inparticular,wepointoutthatDRAMcachesaremuchslowerthantra-ditionalcaches,sooptimizationsthatexacerbatethealreadyhighhitlatencymaydegradeoverallperformanceeveniftheyprovideamarginalimprovementinhitrate.Whilethismayseemtobeafairlysimpleandstraight-forwardconcept,ithasadeepimpact(andoftencounter-intuitiveimplications)onthedesignofDRAMcachearchitectures.WeexplaintheneedforreexaminingconventionalcacheoptimizationsforDRAMcacheswithasimpleexample.Considerasystemwithacacheandamemory.Memoryaccessesincuralatencyof1unit,andcacheaccessesincur0.1unit.Increasingthecachehitratefrom0%to100%reducestheaveragelatencylinearlyfrom1to0.1,shownas“BaseCache”inFigure 1 (a).Assumingthebasecachehasahitrateof50%,thentheaveragememoryaccesstimeforthebasecacheis0.55.NowconsideranoptimizationAthateliminates40%ofthemisses(hitratewithA:70%)butincreaseshitlatencyto1.4x(hitlatencywithA:0.14unit).WewanttoimplementAonlyifitreducesaveragelatency.Wemaybeginbyexaminingthetargethit-rateforAgiventhehigherhit-latency,suchthattheaveragelatencyisequaltothebasecase,whichwecalltheBreak-EvenHitRate(BEHR).Ifthehit-ratewithAishigherthantheBEHR,thenAwillreduceaveragelatency.Forourexample,theBEHRforAis52%.So,wedeemAtobeahighlyeffectiveoptimization,andindeeditreducesaveragelatencyfrom0.55to0.40. Opt-AAverage Latency(a) Fast Cache [Hit Latency 0.1](b) Slow Cache [Hit Latency 0.5]Break-Even HitRate=52%HitRate with A=70%Break-Even HitRate=83%HitRate with A=70%Base Cache 102030405060708090100102030405060708090100 Figure1:Effectivenessofcacheoptimizationsdependoncachehitlatency.OptionAincreaseshitlatencyby1.4xandhit-ratefrom50%to70%.(a)Forafastcache,Aishighlyeffectiveatreducingaveragelatencyfrom0.55to0.4(b)Foraslowcache,Aincreasesaveragelatencyfrom0.75to0.79.Now,considerthesame“highlyeffective”optimizationA,butnowthecachehasalatencyof0.5units,muchliketherelativelatencyofaDRAMcache.TherevisedhitlatencywithAwillnowbe1.4x0.5=0.7units.Consideragainthatourbasecachehasahit-rateof50%.Thentheaveragela-tencyforthebasecachewouldbe0.75units,asshowninFigure 1 (b).Toachievethisaveragelatency,Amusthaveahitrateof83%.ThusoptimizationA,whichwasregardedashighlyeffectiveinthepriorcase,endsupincreasingaveragelatency(from0.75to0.79).TheBreakEvenHitRatedependsalsoonthehitrateofthebasecache.Ifthebasecachehadahitrateof60%,thenAwouldneeda100%hit-ratesimplytobreakeven!Thus,seeminglyindispensableandtraditionallyeffectivecacheoptimizationsmayberenderedineffectiveiftheyhaveasignicantimpactoncachehitlatencyforDRAMcaches.Notethattypicalcacheoptimizations,suchashigherassociativityandbetterreplacement,donotusuallyprovideamissreductionashighas40%,whichwehaveconsideredforA.However,ourdetailedanalysis(Section2)showsthattosupporttheseoptimizations,previouslyanalyzedDRAMcachearchitecturesdoincurahitlatencyoverheadofmorethan1.4xasconsideredforA.ItisourcontentionthatDRAMcachesshouldbedesignedfromtheground-upkeepinghitlatencyasarstpriorityforoptimization.Designchoicesthatincreasehitlatencybymorethananegligibleamountmustbecarefullyanalyzedtoseeifitindeedprovidesanoverallimprovement.WendthatpreviouslyproposeddesignsforDRAMcachesthattrytomaximizehit-ratearenotwellsuitedforoptimizingover-allperformance.Forexample,theycontinuetoserializethetaganddataaccess(similartotraditionalcaches),whichin-creaseshitlatencysignicantly.Theyprovidehighassociativ-ity(severaltensofways)attheexpenseofhitlatency.WecansignicantlyimprovetheperformanceofDRAMcachesbyoptimizingthemforlatencyrst,andthenforhitrate.Withthisinsight,thispapermakesfollowingcontributions:1.Weanalyzethelatencyofthreedesigns:SRAM-Tags,theproposalfromLohandHill,andanideallatency-optimizedDRAMcache.WendthattheLoh-Hillpro-posalsuffersfromsignicantlatencyoverheadsduetotagserializationandduetotheMissMappredictor.ForSRAM-Tags,tagserializationlatencylimitsperformance.Bothdesignsleavesignicantroomforperformanceim-provementcomparedtothelatency-optimizeddesign.2.Weshowthatde-optimizingtheDRAMcachefromahighly-associativestructuretodirect-mappedimprovesperformancebyreducingthehitlatency,evenifitde-gradescachehitrate.Forexample,simplyconguringthedesignofLohandHillfrom29-waytodirect-mappedenhancesperformanceimprovementfrom8.7%to15%.However,thisdesignstillsuffersfromtagserializationduetoseparateaccessestothe“tag-store”and“data-store.”3.WeproposetheAlloyCache,ahighly-effectivelatency-optimizedcachearchitecture.Ratherthansplittingcachespaceinto“tagstore”and“datastore,”ittightlyintegratesoralloysthetaganddataintooneunit(TagandData,TAD).AlloyCachestreamsoutaTADunitoneachcacheaccess,thusavoidingthetagserializationpenalty.4.WepresentasimpleandeffectiveMemoryAccessPre-dictor[ 15 ]toavoidthecacheaccesspenaltyinthepathofservicingcachemiss.UnlikeMissMap,whichincursmulti-megabytestorageandL3accessdelay,ourproposalrequiresastorageoverheadof96bytespercoreandincursalatencyof1cycle.Ourpredictorprovidesaperformanceimprovementwithin2%ofaperfectpredictor.Ourevaluationswitha256MBDRAMcacheshowthat,onaverage,ourlatency-optimizeddesign(35%)signicantlyoutperformsboththeproposalfromLohandHill(8.7%)aswellastheimpracticalSRAM-Tagdesign(24%).Thus,oursimpledesignwithlessthan1KBoverhead(duetopredictor)provides1.5xtheperformancebenetsoftheSRAMdesignthatrequiresseveraltensofmegabytesofoverhead. ???????????????????????????????? MISSMAP TTT Figure2:DRAMCacheorganizationandowforatypicalaccessfor(a)SRAMTag-store,(b)theorganizationproposedbyLohandHill,and(c)anIDEALlatency-optimizedcache.2.BackgroundandMotivationWhilestackedmemorycanenablegiga-scaleDRAMcaches,severalchallengesmustbeovercomebeforesuchcachescanbedeployed.AneffectivedesignofDRAMcachemustbalance(at-least)fourgoals.First,itshouldminimizethenon-DRAMstoragerequiredforcachemanagement(usingasmallfractionofDRAMspaceisacceptable).Second,itshouldminimizehitlatency.Third,itshouldminimizemisslatency,sothatmissescanbesenttomemoryquickly.Fourth,itshouldprovideagoodhit-rate.Theserequirementsareof-tenconictingwitheachother,andagooddesignmustbal-ancetheseappropriatelytomaximizeperformance.ItisdesirabletoorganizeDRAMcachesatthegranularityofacachelineinordertoefcientlyusecachecapacity,andtominimizetheconsumptionofmainmemorybandwidth[ 10 ].OneofthemainchallengesinarchitectingaDRAMcacheatalinegranularityisthedesignofthetagstore.Aper-linetagoverheadof5-6bytesquicklytranslatesintoatotaltag-storeoverheadofafewtensofmegabytesforacachesizeintheregimeofafewhundredmegabytes.Wediscusstheoptionstoarchitectthetagstore,andhowitimpactscachelatency.2.1.SRAM-TagDesignThisapproachstorestagsinaseparateSRAMstructure,asshowninFigure 2 (a).Forthecachesizesweconsider,thisde-signincursanunacceptablyhighoverhead(24MBfor256MBDRAMcache).WecanconguretheDRAMcacheasa32-waycacheandstoretheentiresetinonerowofthecache[ 2 , 10 ].Toobtaindata,theaccessmustrstgothroughthetag-store.Wecallthelatencyduetoserializationoftagaccessas“TagSerializationLatency”(TSL).TSLdirectlyim-pactsthecachehitlatency,andhencemustbeminimized.2.2.Tags-in-DRAM:TheLH-CacheWecanplacethetagsinDRAMtoavoidtheSRAMoverhead.However,naivelydoingsowouldrequirethateachDRAMcacheaccessincursalatencyoftwoaccesses,onefortagandtheotherfordata,furtherexacerbatingthealreadyhighhitlatency.ArecentworkfromLohandHill[ 10 , 11 ]reducestheaccesspenaltyofDRAMtagsbyco-locatingthetagsanddatafortheentiresetinthesamerow,asshowninFigure 2 (b).Itreservesthreelinesinarowfortagstore,andmakestheother29linesavailableasdatalines,thusprovidinga29-waycache.Acacheaccessmustrstobtainthetags,andthenthedataline.TheauthorsproposeCompoundAccessSchedulingsothatthesecondaccess(fordata)isguaranteedtogetarowbufferhit.However,thesecondaccessstillincursapproxi-matelyhalfthelatencyoftherst,sothisdesignstillincurssignicantTSLoverhead.GiventhatthetagcheckincursafullDRAMaccess,thela-tencyforservicingacachemissisincreasedsignicantly.Toservicecachemissesquickly,theauthorsproposeaMissMapstructurethatkeepstrackofthelinesintheDRAMcache.IfamissisdetectedintheMissMap,thentheaccesscangodirectlytomemorywithouttheneedtowaitforatagcheck.Unfortunately,theMissMapstructurerequiresmulti-megabytestorageoverhead.Toimplementthisefciently,theauthorsproposetoembedtheMissMapintheL3cache.TheMissMapisqueriedoneachL3miss,whichmeansthattheextralatencyoftheMissMap,whichwecallPredictorSeri-alizationLatency(PSL),isaddedtothelatencyofbothcachehitandcachemiss.Thus,thehitlatencysuffersfrombothTSLandPSL.Throughoutthispaper,wewillassumethatthedesignfromLohandHill[ 10 ]isalwaysimplementedwiththeMissMap,andwewillrefertoitsimplyastheLH-Cache. TAG-STORE [24] 24324048566472808896104112112[64][76][76][52][88][112][112][96]81616[40][52]Hit X/YCAS (DATA)CAS (TAGS)MEMORYCACHEACTCASBUS[36][36][16][18][18][4] Figure3:LatencybreakdownfortwoclassesofisolatedaccessesXandY.XhasgoodrowbufferlocalityandYneedstoactivatethememoryrowtogetserviced.Thelatencyincurredinanactivityismarkedas[N]processorcycles.2.3.IDEALLatency-OptimizedDesignBothSRAM-TagsandLH-CachehavehitlatencyduetoTSL.Toreduceconictmisses,bothdesignsareconguredsim-ilartoconventionalset-associativecaches.Theyplacetheentiresetinarowforconictmissreduction,sacricingtherow-bufferhitsforcacheaccesses(sequentially-addressedlinesmaptodifferentsets,andtheprobabilityoftemporally-closeaccessesgoingtosamesetis1%).Furthermore,forLH-Cache,supportinghighassociativityincurshigherla-tencyduetostreamingalargenumberoftaglines,andthebandwidthconsumedduetoreplacementupdateandvictimselectionfurtherworsensthealreadyhighhitlatency.WearguethatDRAMcachesmustbearchitectedtomini-mizehitlatency.Thiscanbedonebyasuitablecachestruc-turethatavoidsextraneouslatencyoverheadsandsupportsrowbufferlocality.Ideally,suchastructurewouldhavezeroTSLandPSL,andwouldstreamoutexactlyonecachelineaf-teralatencyequaltotherawlatencyoftheDRAMstructure(ACT+CASforaccessesthatopentherow,andonlyCASforrow-bufferhits).Also,itwouldknowaprioriiftheaccesswouldhitincacheorgotomemory.WecallsuchadesignasIDEAL-LO(LatencyOptimized).AsshowninFigure 2 (c),itdoesnotincuranylatencyoverheads.2.4.RawLatencyBreakdownInthissection,wequantitativelyanalyzethelatencyeffective-nessofdifferentdesigns.WhilethereareseveralalternativeimplementationsofbothSRAM-TagsandLH-Cache,wewillrestricttheanalysisinthissectiontotheexactimplementa-tionofSRAM-TagsandLH-Cacheaspreviouslydescribed,includingidenticallatencynumbersforallparameters[ 10 ],whicharesummarizedinTable 2 .Wereportlatencyintermsofprocessorcycles.Off-chipmemorymemoryhastACTandtCASof36cycleseach,andneeds16cyclestotransferonelineonthebus.StackedDRAMhastACTandtCASof18cycleseach,andneeds4cyclestotransferonelineonthebus.ThelatencyforaccessingtheL3cacheaswellastheSRAM-Tagstoreisassumedtobe24cycles.Tokeeptheanalysistractable,wewillinitiallyconsideronlyisolatedaccessesoftwotypes,XandY.TypeXhasahighrowbufferhit-rateforoff-chipmemoryandisservicedbymemorywithalatencyequaltoarowbufferhit.TypeYneedstoopentherowinordertogetserviced.ThebaselinememorysystemwouldserviceXin52cycles(36forCAS,and16forBus),andYin88cycles(36forACT,36forCAS,and16forBus).Figure 3 showsthelatencyincurredbydif-ferentdesignstoserviceXandY. AsbothSRAM-TagsandLH-CachemaptheentiresettoasingleDRAMrow,theygetpoorrowbufferhit-ratesintheDRAMcache.ThereforeforbothXandY,neithercachedesignwillgivearowbufferhit.Therefore,ahitforbothXandYwillincuralatencyofACT.However,withIDEAL-LO,XgetsarowbufferhitandYwillneedalatencyofACT.TheSRAM-TagsuffersaTagSerializationLatencyof24cyclesforbothcachehitsandmisses.Acachehitneedsan-other40cycles(18ACT+18CAS+4burst),foratotalof64cycles.ThusSRAM-TagincreaseslatencyforhitsonX,de-creaseslatencyforhitsonY,andincreaseslatencyformissesonbothXandYduetotheinherentlatencyoftag-lookup.LH-CacherstprobestheMissMap,whichincursalatencyof24cycles. 1 Forahit,LH-Cachethenissuesareadfortaginformation(ACT+CAS,36cycles),thenitstreamsoutthethreetaglines(12cycles),followedbyoneDRAMcyclefortagcheck.Thisisfollowedbyaccesstothedataline(CAS+burst).ThusahitinLH-Cacheincursalatencyof96cycles,almostdoublingthelatencyforXonhit,degradingthelatencyforYonhit,andaddingMissMaplatencytomiss.AnIDEAL-LOorganizationwouldserviceXwitharowbufferhit,reducingthelatencyto22cycles.AhitforYwouldincur40cycles.IDEAL-LOdoesnotincreasemisslatency.Tosummarize,weassumedthattherawlatencyofthestackedDRAMcacheishalfthatoftheoff-chipmemory.However,duetotheinherentserializationlatencies,LH-Cache(andinmostcasesSRAM-Tag)hasahigherrawla-tencythanoff-chipmemory.Whereas,IDEAL-LOcontinuestoprovideareductioninhitlatencyoncachehits.2.5.BandwidthBenetsofDRAMCacheEvenwithahigherrawhitlatencythanmainmemory,bothLH-CacheandSRAM-Tagcanstillimproveperformancebyprovidingtwoindirectbenets.First,stackedDRAMhas8xmorebandwidththanoff-chipDRAM,whichmeanscacherequestswaitless.Second,contentionforoff-chipmemoryisreducedasDRAMcachehitsareltered.TheperformancebenetofLH-CacheandSRAM-Tagscomeslargelyfromthesetwoindirectbenetsandnotduetorawlatency.Therstbenetreliesonhavingacachethathashigh-bandwidth.AlthoughstackedDRAMhas8xrawband-widthcomparedtooff-chip,LH-Cacheusesmorethan4xlinetransfersoneachcacheaccess(3fortag,1fordata,andsomeforupdate),sotheeffectivebandwidthbecomes2x.BothSRAM-TagandIDEAL-LOmaintains8xbandwidthbyef-cientlyusingthebandwidth.Therefore,theyaremoreef-fectivethanLH-Cacheatreducingwaitingtimeforcachere-quests.Wefoundthatthelatencyforservicingrequestsfromoff-chipmemoryissimilarforallthreedesigns. 1TheMissMapserializationlatencycanbeavoidedbyprobingtheMissMapinparallelwithL3access.However,thiswoulddoubletheL3accesses,asMissMapwouldbeprobedonL3hitsaswell,causingbank/portcontentionandincreasingL3latencyandpowerconsumption.Hence,priorwork[ 10 ]usedserialaccessforMissMap,andsodidwe.2.6.PerformancePotentialFigure 4 comparestheperformanceofthreedesigns:SRAM-Tag,LH-Cache,andIDEAL-LO.ThenumbersarespeedupswithrespecttoabaselinethatdoesnothaveaDRAMcache,andarereportedforaDRAMcacheofsize256MB(method-ologyinSection 3 ). 0.60.81.01.21.41.61.8Speedup LH-Cache SRAM-Tag Figure4:PerformancepotentialofIDEAL-LOdesign.ValidationwithPriorWork:Onaverage,SRAM-Tagspro-videaperformanceimprovementof24%andLH-Cache8.7%.Thus,LH-Cacheobtainsonlyone-thirdoftheperfor-mancebenetofSRAM-TagswhichisinconsistentwiththeoriginalLH-Cachestudy[ 10 ],whichreportedthattheLH-CacheobtainsperformanceveryclosetoSRAM-Tag.Giventhedifferenceinrawhitlatenciesbetweenthetwodesigns(seeFigure 3 )and4xbandwidthconsumptionofLH-CachecomparedtoSRAM-Tags,itishighlyunlikelythatLH-CachewouldperformclosetoSRAM-Tags.Asignicantpartofthisresearchstudywastoresolvethisinconsistencywithpreviouslyreportedresults.TheauthorsoftheLH-Cachestudy[ 10 ]havesubsequentlypublishedanerrata[ 9 ]thatshowsrevisedevaluationsaftercorrectingdecienciesintheirevaluationinfrastructure.Therevisedevaluationsfor256MBshowonaverage10%improvementforLH-Cacheand25%forSRAM-Tag,consistentwithourevaluations.NotethatIDEAL-LOoutperformsbothSRAM-TagsandLH-Cache,andprovidesanaverageof38%.Forlibquan-tum,thememoryaccesspatternshasveryhighrow-bufferhitratesintheoff-chipDRAMresultinginmostlytypeXre-quests.Therefore,bothSRAM-TagandLH-Cacheshowper-formancedegradationsduetotheirinabilitiestoexploitthespatiallocalityofsequentialaccessstreams.2.7.De-OptimizingforPerformanceWenowpresentsimplede-optimizationsthatimprovetheoverallperformanceofLH-Cache,attheexpenseofhit-rate.Therstisusingareplacementschemethatdoesnotre-quireupdate(randomreplacement,insteadofLRU-basedDIP):ThisavoidsLRU-updateandvictimselectionover-heads,whichimproveshitlatencyduetothereducedbankcontention.ThesecondconvertsLH-Cachefrom29-waytodirectmapped.Thishastwoadvantages:direct,inthatwedonotneedtostreamoutthreetaglinesoneachaccess,andindirect,employingopenpagemodeforlowerlatency.For SRAM-TagandLH-cache,sequentiallyaddressedcachelinesaremappedtodifferentsets,andbecauseeachsetismappedtoauniquerow,theprobabilityofarow-bufferhitisverylow.Withadirect-mappedorganization,severalconsecutivesetsmaptothesamephysicalDRAMrow,andsoaccesseswithspatiallocalityresultinrowbufferhits.Therow-bufferhitrateforthedirect-mappedcongurationwasmeasuredtobe56%onaverage,comparedtolessthan0.1%whentheentireset(29-wayor32-way)ismappedtothesamerow.Table1:ImpactofDe-OptimizingLH-Cache. Conguration Speedup Hit-Rate HitLatency (cycles) LH-Cache 8.7% 55.2% 107 LH-Cache+RandRepl 10.2% 51.5% 98 LH-Cache(1-way) 15.2% 49.0% 82 SRAM-Tag(32-way) 23.8% 56.8% 67 SRAM-Tag(1-way) 24.3% 51.5% 59 IDEAL-LO(1-way) 38.4% 48.2% 35 Table 1 showsthespeedup,hit-rate,andaveragehitlatencyforvariousavorsofLH-Cache.WealsocomparethemwithSRAM-TagandIDEAL-LO.LH-Cachehashitlatencyof107cycles,almost3xcomparedtoIDEAL-LO.De-optimizingLH-Cachereducesthelatencyto98cycles(randomreplace-ment)and82cycles(directmapped).Theseoptimizationsreducehit-rateandincreasemissessignicantly(areductioninhit-ratefrom55%to49%representsalmost15%moremisses).However,thisstillimprovesperformancesigni-cantly.ForSRAM-Tag,convertingfrom32-wayto1-wayhadlittlebenet(05%),asthereductioninhitlatencyisoffsetbyreductioninhit-rate.Whileadirect-mappedimplementationofLH-cacheismoreeffectivethantheset-associativeimplementation,itstillsuffersfromTagSerializationLatency,aswellasthePre-dictorSerializationLatency,resultinginasignicantper-formancegapbetweenLH-CacheandIDEAL-LO(15%vs.38%).OurproposalremovestheseserializationlatenciesandobtainsperformanceclosetoIDEAL-LO.Wedescribeourexperimentalmethodologybeforedescribingoursolution.3.ExperimentalMethodology3.1.CongurationWeuseaPin-basedx86simulatorwithadetailedmemorymodel.Table 2 showsthecongurationusedinourstudy.TheparametersfortheL3cacheandDRAM(off-chipandstacked)areidenticaltotheoriginalLH-Cachestudy[ 10 ],in-cludinga24-cyclelatencyfortheSRAM-Tag.ForLH-Cache,wemodelanidealizedunlimited-sizeMissMapthatresidesintheL3cachebutdoesnotconsumeanyL3cachecapac-ity.ForbothLH-CacheandSRAM-TagweuseLRU-basedDIP[ 16 ]replacement.Wewillperformdetailedstudiesfora256MBDRAMcache.InSection 6.1 ,wewillanalyzecachesizesrangingfrom64MBto1GB.Table2:BaselineConguration Processors Numberofcores 8 Frequency 3.2GHz Width 1IPC LastLevelCache L3(shared) 8MB,16-way24cycles Off-ChipDRAM Busfrequency 800MHz(DDR1.6GHz) Channels 2 Ranks 1Rankperchannel Banks 8Banksperrank Rowbuffersize 2048bytes Buswidth 64bitsperchannel tCAS-tRCD-tRP-tRAS 9-9-9-36 StackedDRAM Busfrequency 1.6GHz(DDR3.2GHz) Channels 4 Banks 16Banksperrank Buswidth 128bitsperchannel 3.2.WorkloadsWeuseasingleSimPoint[ 14 ]sliceof1billioninstructionsforeachbenchmarkfromtheSPEC2006suite.Weperformevaluationsbyexecuting8copiesofeachbenchmarkinratemode.Giventhatourstudyisaboutlargecaches,weperformdetailedstudiesonlyforthe10workloadsthathaveaspeedupofmorethan2withaperfectL3cache(100%hit-rate).OtherworkloadsareanalyzedinSection 6.4 .Table 3 showstheworkloadssortedbasedonperfectL3speedup,theMissesPer1000Instructions(MPKI),andfoot-print(thenumberofuniquelinesmultipliedbylinesize).Wemodelavirtual-to-physicalmappingtoensuretwobench-marksdonotmaptothesamephysicaladdress.Weuseasuf-x_rwiththenameofthebenchmarktoindicateratemode.Weperformtimingsimulationuntilallbenchmarksintheworkloadnishexecutionandmeasuretheexecutiontimeoftheworkloadastheaverageexecutiontimeacrossall8cores.Table3:BenchmarkCharacteristics. Workload Perfect-L3 MPKI Footprint Name Speedup mcf_r 4.9x 74.0 10.4GB lbm_r 3.8x 31.8 3.3GB soplex_r 3.5x 27.0 1.9GB milc_r 3.5x 25.7 4.1GB omnetpp_r 3.1x 20.9 259MB gcc_r 2.8x 16.5 458MB bwaves_r 2.8x 18.7 1.5GB sphinx_r 2.4x 12.3 80MB gems_r 2.2x 9.7 3.6GB libquantum_r 2.1x 25.4 262MB ROW BUFFERDRAM ARRAY 80B = IGNORE [8B] + TAG [8B] + DATA [64B] (8B) 2KB Row Buffer = 28 x 72 byte TAD = 28 data lines (32 bytes unused)80B = TAG [8B] + DATA [64B] + IGNORE [8B]ORAlloy CacheTAG-AND-DATA (TAD)ADDRDATA(64B)TAG Figure5:ArchitectureandOperationofAlloyCachethatintegratesTagandData(TAD)intoasingleentitycalledTAD.Thesizeofdatatransfersisdeterminedbya16-bytewidedata-bus,henceminimumtransferof80bytesforobtainingoneTAD.4.Latency-OptimizedCacheArchitectureWhileconguringtheLH-Cachefroma29-waystructuretoadirect-mappedstructureimprovedperformance(from8%to15%),itstillleftsignicantroomforimprovementcom-paredtoalatency-optimizedsolution(38%).Oneofthemainsourcesofthisgapistheserializationlatencyduetotaglookup.WenotethatLH-Cachecreatedaseparate“tag-store”and“data-store”intheDRAMcache,similartocon-ventionalcaches.Aseparatetag-storeanddata-storemakessenseforaconventionalcache,becausetheyareindeedphys-icallyseparatestructures.Thetag-storeisoptimizedforla-tencytosupportquick-lookupsandcanhavemultipleports,whereasthedata-storeisoptimizedfordensity.Wemakeanimportantobservationthatcreatingaseparatecontiguoustag-store(similartoconventionalcaches)isnotnecessarywhentagsanddataco-existinthesameDRAMarray.4.1.AlloyCacheObviatingtheseparationoftag-storeanddata-storecanhelpusavoidtheTSLoverhead.Thisisthekeyinsightinourpro-posedcachestructure,whichwecalltheAlloyCache.TheAlloyCachetightlyintegratesoralloystaganddataintoasingleentitycalledTAD(TagandData).OnanaccesstotheAlloyCache,itprovidesoneTAD.IfthetagobtainedfromtheTADmatcheswiththegivenlineaddress,itindi-catesacachehitandthedatalineintheTADissupplied.Atagmismatchindicatescachemiss.Thus,insteadofhavingtwoseparateaccesses(onetothe“tag-store”andtheothertothe“data-store”),AlloyCachetightlyintegratesthosetwoac-cessesintoasingleuniedaccess,asshowninFigure 5 .Onacachemiss,thereisaminorcostinthatbandwidthiscon-sumedtransferringadatalinethatisnotused.NotethatthisoverheadisstillsubstantiallylessthanthethreetaglinesthatmustbetransferredforbothhitsandmissesintheLH-Cache.EachTADrepresentsonesetofthedirect-mappedAlloyCache.GiventhattheAlloyCachehasanon-power-of-twonumberofsets,wecannotsimplyusetheaddressbitstoiden-tifytheset.Weassumethatamodulooperationonthelinead-dressisusedtodeterminethesetindexoftheAlloyCache. 2 Anon-power-of-twonumberofsetsalsomeansthatthetagentryneedstostorefulltags,whichincreasesthesizeofthetagentry.Weestimatethatatagentryof8bytesismorethansufcientfortheAlloyCache(foraphysicaladdressspaceof48-bits,weneed42tagbits,1validbit,1dirtybit,andtheremaining20bitsforcoherencesupportandotheropti-mizations).TheminimumsizeofaTADisthus72bytes(64bytesfordatalineand8bytesfortag).TheAlloyCachecanstore28linesinarow,reachingclosetothe29-linesperrowstorageefciencyoftheLH-Cache.ThesizeofdatatransferfromtheAlloyCacheisalsoaf-fectedbythephysicalconstraintsoftheDRAMcache.Forexample,thesizeofthedatabusassumedforourstackedDRAMcongurationis16bytes,whichmeanstransfersto-and-fromthecacheoccuratthegranularityof16bytes.Thus,itwilltakeaburstofvetransferstoobtainoneTADof72bytes.Tokeepourdesignsimple,werestrictthetransferstobealignedatthegranularityofthedata-bussize.Thisrequire-mentmeansthatforoddsetsoftheAlloyCache,therst8bytesareignoredandforevensetsthelast8bytesareignored.Thetag-checklogiccheckseitherthersteightbytesorthenexteightbytesdependingonthelowbitofthesetindex.4.2.ImpactonEffectiveBandwidthTable 4 comparestheeffectivebandwidthofservicingonecachelinefromvariousstructures.Therawbandwidthsandeffectivebandwidthsarenormalizedtooff-chipmemory.Onacachehit,LH-Cachetransfers(3linesoftag+1data+replacementupdate)reducingrawbandwidthof8xintoaneffectivebandwidthoflessthan2x.Whereas,AlloyCachecanprovideaneffectivebandwidthofup-to6.4x. 2Designingageneralpurposemodulo-computingunitincurshighareaandlatencyoverheads.However,herewecomputemodulowithrespecttoaconstant,soitismuchsimplerandfastercomparedtoageneral-purposesolution.Infact,modulowithrespectto28(numberofsetsinonerowofAlloyCache)canbecomputedeasilywitheight5-bitaddersusingresiduearithmetic(28=32-4).Thisvaluecanthenberemovedfromthelineaddresstogetrow-idofDRAMcache.Weestimatethecalculationtotaketwocyclesandonlyafewhundredlogicgates.WeassumethattheindexcalculationoftheAlloyCachehappensinparallelwiththeL3cacheaccess(thus,wehaveupto24cyclestocalculatethesetindexoftheAlloyCache). Table4:Bandwidthcomparison(relativetooff-chipmemory). Structure Raw Transferper Effective Bandwidth access(hit) Bandwidth Off-chipMemory 1x 64byte 1x SRAM-Tag 8x 64byte 8x LH-Cache 8x (256+16)byte 1.8x IDEAL-LO 8x 64byte 8x AlloyCache 8x 80byte 6.4x 4.3.LatencyandPerformanceImpactTheAlloyCacheavoidstagserialization.Insteadoftwoseri-alizedaccesses,oneeachfortaganddata,itprovidestaganddatainasingleburstofvetransfersonthedata-bus.Com-paratively,atransferofonlythedatalinewouldtakefourtransfers,sothelatencyoverheadoftransferringTADinsteadofonlythedatalineis1buscycle.However,thisoverheadisnegligiblecomparedtotheTSLoverheadincurredbySRAM-Tag(24cycles)andLH-Cache(32-50cycles).BecauseoftheavoidanceofTSL,theaveragehitlatencyforAlloyCacheissignicantlybetter(42cycles),comparedtobothSRAM-Tag(69cycles)andLH-Cache(107cycles).TheAlloyCachereducestheTSLbutnotthePSL,sotheoverallperformancedependsonhowmissesarehandled.Weconsiderthreescenarios:First,noprediction(waitfortagac-cessuntilcachemissisdetected).Second,usetheMissMap(PSLof24cycles).Third,perfectpredictor(100%accuracy,0latency).Figure 6 comparesthespeedupofthesetotheimpracticalSRAM-Tagdesignconguredas32-way. 0.81.01.21.41.61.8Speedup Alloy+NoPred Alloy+MissMap Alloy+Perfect Figure6:SpeedupwithAlloyCache.Evenwithoutanypredictor,theAlloyCacheprovidesa21%performanceimprovement,muchclosertotheimprac-ticalSRAM-Tag.Thisisprimarilyduetothelowerhitla-tency.AMissMapprovidesbettermisshandling,butthe24-cyclePSLisincurredonbothhitsandmisses,sotheper-formanceisactuallyworsethannotusingapredictor.Withaperfectpredictor(100%accuracyandzero-cyclelatency),theAlloyCache'sperformanceincreasesto37%.Thenextsectiondescribeseffectivesingle-cyclepredictorsthatobtainperformanceclosetothatwithaperfectpredictor.5.Low-LatencyMemoryAccessPredictionTheMissMapapproachfocusesongettingperfectinforma-tionaboutthepresenceofthelineintheDRAMcache.There-fore,itneedstokeeptrackofinformationonaper-linebasis.Evenifthisincurredastorageofone-bitperline,giventhatalargecachecanhavemanymillionsoflines,thesizeoftheMissMapquicklygetsintothemegabyteregime.GiventhelargesizeoftheMissMap,itisbettertoavoiddedicatedstor-ageandstoreitinanalreadyexistingon-chipstructuresuchastheL3cache.Hence,itincursasignicantlatencyofL3cacheaccess(24cycles).Inthissection,wewilldescribeac-curatepredictorsthatincurnegligiblestorageanddelay.Welaythebackgroundforoperatingsuchapredictorbeforede-scribingthepredictor.TheideasdescribedinthisSectionarederivedfromthepriorworkfromQureshi[ 15 ].5.1.SerialAccessvs.ParallelAccessTheimplicitassumptionmadeintheLH-CachestudywasthatthesystemneedstoensurethatthereisaDRAMcachemissbeforeaccessingmemory.Thisassumptionissimilartohowconventionalcachesoperate.WecallthistheSerialAccessModel(SAM),asthecacheaccessandmemoryaccessgetserialized.TheSAMmodelisbandwidth-efcientasitsendsonlythecachemissestomainmemory,asshowninFigure 7 . CHIPCACHE MEMORY Figure7:CacheAccessModels:SerialvsParallelAlternatively,wemaychoosetousealessbandwidthef-cientmodel,whichprobesboththecacheandmemoryinparallel.WecallthistheParallelAccessModel(PAM),asshowninFigure 7 .TheadvantageofPAMisthatitremovestheserializationofthecache-missdetectionlatencyfromthememoryaccesspath.ToimplementPAMcorrectlythough,weshouldgiveprioritytocachecontentratherthanthemem-orycontent,ascachecontentcanbedirty.Also,ifthemem-orysystemreturnsdatabeforethecachereturnstheoutcomeofthetagcheck,thenwemustwaitbeforeusingthedataasthelinecouldstillbepresentinadirtystateinthecache.Atrstblush,itmayseemwastefultoaccesstheDRAMcacheincaseofaDRAMcachemiss.However,forbothLH-CacheandAlloyCache,thetagsarelocatedinDRAM.So,evenonaDRAMcachemiss,westillneedtoreadthetagsanywaytoselectavictimlineandcheckifthevictimisdirty(toschedulewriteback).So,PAMdoesnothaveasignicantimpactoncacheutilizationcomparedtoaperfectpredictor. 5.2.ToWaitorNottoWaitWecangetthebestofbothSAMandPAMbydynamicallychoosingbetweenthetwo,basedonanestimateofwhetherthelineislikelytobepresentinthecacheornot.WecallthisDynamicAccessModel(DAM).Ifthelineislikelytobepresentinthecache,DAMusesSAMtosaveonmemorybandwidth.Andifthelineisunlikelytobepresent,DAMusesPAMtoreducelatency.NotethatDAMdoesnotrequireperfectinformationfordecidingbetweenSAMandPAM,butsimplyagoodestimate.Tohelpwiththisestimate,wepro-poseahardware-basedMemoryAccessPredictor(MAP).Tokeepthelatencyofourpredictortoabareminimum,wecon-sideronlysimplepredictors.5.3.MemoryAccessPredictorThelatencysavingsofPAMandthebandwidthsavingsofSAMdependonthecachehitrate.Ifthecachehitrateisveryhigh,thenSAMcanreducebandwidth.Ifthecachehit-rateisverylow,thenPAMcanreducelatency.So,wecansimplyusecachehitrateformemory-accessprediction.However,itiswellknownthatbothcachemissesandhitsshowgoodcorrelationwithpreviousoutcomes[ 5 ]andexploitingsuchcorrelationresultsinmoreeffectivepredictionthansimplyusingthehit-rate.Forexample,ifHishitandMismiss,andthelasteightoutcomesareMMMMHHHH,thenusingthehit-ratewouldgiveanaccuracyof50%,butasimplelast-timepredictorwouldgiveanaccuracyof87.5%(assumingtherstMwaspredictedcorrectly).Basedonthisinsight,weproposetouseHistory-BasedMemory-AccessPredictors.5.3.1.Global-HistoryBasedMAP(MAP-G)Ourbasicimplementation,calledMAPGlobalorMAP-G,usesasinglesaturatingcountercalledtheMemoryAccessCounter(MAC)thatkeepstrackiftherecentL3missesre-sultedinamemoryaccessorahitintheDRAMcache.IftheL3missresultsinamemoryaccess,thentheMACisin-cremented,otherwiseMACisdecremented(bothoperationsaredoneusingsaturatingarithmetic).Forprediction,MAP-GsimplyusestheMSBoftheMACtodecideiftheL3missshouldemploySAM(MSB=0)orPAM(MSB=1).Weem-ployMAP-Gonaper-corebasisandusea3-bitcounterfortheMAC.OurresultsshowthatMAP-GbridgesmorethanhalftheperformancegapbetweenSAMandperfectpredic-tion.Notethatbecausewritesarenotonthecriticalpath(atthislevel,writesaremainlyduetodirtyevictionsfromon-chipcaches),wedonotmakepredictionsforwritesandsimplyemploySAM.5.3.2.Instruction-BasedMAP(MAP-I)WecanimprovetheeffectivenessofMAP-Gbyexploitingthewell-knownobservationthatthecachehit/missinforma-tionisheavilycorrelatedwiththeinstructionaddressthatcausedthecacheaccess[ 3 , 8 , 18 ].Wecallthisimplemen-tationInstruction-BasedMAPorsimplyMAP-I.InsteadofusingasingleMAC,MAP-IusesatableofMACs,calledtheMemoryAccessCounterTable(MACT).TheaddressoftheL3misscausinginstructionishashed(usingfolded-xor[ 17 ])intotheMACTtoobtainthedesiredMAC.AllpredictionsandupdateshappenbasedonthisMAC.Wefoundthatsim-plyusing256entries(8-bitindex)intheMACTissufcient.ThestorageoverheadforthisimplementationofMAP-Iis256*3-bit=96bytes.WekeeptheMACTonaper-corebasistoavoidinterferencebetweenthecores(foreightcores,to-taloverheadisonly96*8=768bytes).LikeMAP-G,MAP-Idoesnotmakepredictionsforwriterequests.Notethatourpredictorsdonotrequirethattheinstructionaddressbestoredinthecache.Forreadmisses,theinstruc-tionaddressofmisscausingloadisforwardedwiththemissrequest.AswritebackmissesareservicedwithSAM,wedonotneedinstructionaddressesforwritebacks.5.4.PerformanceResultsFigure 8 showsthespeedupfromtheAlloyCachewithdif-ferentmemoryaccesspredictors.Ifweuseapredictionofalways-cache-hitthesystembehaveslikeSAM,andifweuseapredictionofnever-cache-hitthesystembehaveslikePAM.Theperfectpredictorassumes100%accuracyatzerolatency. 0.91.01.11.21.31.41.51.61.71.8Speedup SAM PAM MAP-G MAP-I Figure8:PerformanceimprovementofAlloyCachefordiffer-entMemoryAccessPredictorsOnaverage,thereisa14%gapbetweenSAM(22.6%)andperfectprediction(36.6%).PAMprovides29.6%per-formanceimprovementbutresultsinalmosttwiceasmanymemoryaccessesasperfectprediction.MAP-Gprovides30.9%performance,bridginghalftheperformancedifferencebetweenSAMandtheperfectpredictor.Itthusperformssim-ilartoPAMbutwithoutdoublingthememorytrafc.MAP-Iprovidesanaverageof35%,comingwithin1.6%oftheper-formanceofaperfectpredictor.Thus,eventhoughourpre-dictorsaresimple(100bytespercore)andlowlatency(1cycle),theygetalmostallofthepotentialperformance.Forlibquantum,MAP-Gperforms3%betterthantheper-fectpredictor.Thishappensbecausesomeofthemispredic-tionsavoidtherowbufferpenaltyforlaterdemandmisses.Forexample,considerfourlinesA,B,C,DthatmaptothesameDRAMrow.OnlyAandBarepresentintheDRAMcache.A,B,C,Dareaccessedinasequence.IfAandBarepredictedcorrectly,Cwouldincurarowopeningpenalty whenitgoestomemory.If,ontheotherhand,Aismispre-dicteditwouldavoidtherowopeningpenaltyforC.5.5.PredictionAccuracyAnalysisToprovideinsightsintotheeffectivenessofthepredictors,weanalyzeddifferentoutcome-predictionscenarios.Therearefourcases:1)L3mississervicedbymemoryandourpredictorpredictsitassuch,2)L3mississervicedbymem-oryandourpredictorpredictsthatitwillbeservicedbytheDRAMcache,3)L3mississervicedbytheDRAMcacheandourpredictorpredictsmemoryaccess,and4)L3mississervicedbytheDRAM-cacheandourpredictorpredictsittobeso.Scenarios2and3denotemispredictions.However,notethatthecostofmispredictionsarequitedifferentinthetwoscenarios(scenario2incurshigherlatencyandscenario3extrabandwidth).Table 5 showsthescenariodistributionfordifferentpredictorsaveragedacrossallworkloads.Table5:AccuracyforDifferentPredictors ServicedbyMemory ServicedbyCache Overall Prediction Memory Cache Memory Cache Accuracy SAM 0 51.8% 0 48.1% 48.1% PAM 51.8% 0 48.2% 0 51.8% MAP-G 45.1% 6.7% 10.8% 37.4% 82.5% MAP-I 48.3% 3.5% 1.9% 46.2% 94.5% Perfect 51.8% 0% 0% 48.2% 100% PAMalmostdoublesthememorytrafccomparedtootherapproaches(48%ofL3missesarewastefullydeemedtoac-cessmemorywhentheyarein-factservicedbytheDRAM-cache).Comparedtoaperfectpredictor,MAP-Ihashigherlatencyfor3.5%oftheL3misses,andextraneousband-widthconsumptionfor1.9%oftheL3misses.Forthere-maining94.5%oftheL3misses,MAP-Ipredictioniscor-rect.Thus,eventhoughourpredictorsarequitesimple,low-cost,andlow-latency,theyarestillhighly-effective,providehighaccuracy,andobtainalmostallofthepotentialforper-formanceimprovementfrommemoryaccessprediction.Un-lessstatedotherwise,theAlloyCacheisalwaysimplementedwithMAP-Iintheremainderofthispaper.5.6.ImplicationsonMemoryPowerandEnergyAccessingmemoryinparallelwiththecache,asdoneinPAMandconditionallyinDAM,increasespowerinmemorysys-temduetowastefulmemoryaccesses.ForPAM,alloftheL3misseswouldbesenttooff-chipmemory.WhereaswithSAM,onlythemissesintheDRAMcachewouldgetsenttomemory.FromTable 5 ,itcanbeconcludedthatPAMwouldalmostdoublethememoryactivitycomparedtoSAM.Hence,wedonotrecommendunregulateduseofPAM(exceptasareferencepoint).ForDAM,ourMAP-Ipredictorisquiteaccuratewhichmeanswastefulparallelaccessesaccountforonly1.9%ofL3misses,comparedto48%withPAM.6.AnalysisandDiscussions6.1.SensitivitytoCacheSizeThedefaultDRAMcachesizeforallofourstudiesis256MB.Inthissection,westudytheimpactofdifferentschemesasthecachesizeisvariedfrom64MBto1GB.Figure 9 showstheaveragespeedupwithLH-Cache(29-way),SRAM-Tag(32-way),AlloyCache,andIDEAL-LO.IDEAL-LOisthela-tencyoptimizedtheoreticaldesignthattransfersonly64byteonacachehitandhasperfectzero-latencypredictor. 1.001.051.101.151.201.251.301.351.401.451.50Speedup LH-Cache SRAM-Tag Alloy-Cache 64MB 128MB 256MB 512MB 1GB Figure9:Performanceimpactacrossvariouscachesize.Alloy-CachecontinuestosignicantlyoutperformimpracticalSRAM-TagandreachesclosetotheupperboundofIDEAL-LO.TheSRAM-TagdesignsuffersfromTagSerializationLa-tency(TSL).LH-CachesuffersfrombothTSLandPSLduetotheMissMap.AlloyCacheavoidsbothTSLandPSL,henceitoutperformsboththeLH-CacheandSRAM-Tagacrossallstudiedcachesizes.Forthe1GBcachesize,LH-Cacheprovidesanaverageimprovementof11.1%,SRAM-Tagprovides29.3%,andAlloyCacheprovides46.1%.Thus,AlloyCacheprovidesapproximately1.5timestheimprove-mentoftheSRAM-Tagdesign.NotethattheSRAM-Tagim-plementationincursanimpracticalstorageoverheadof6MB,12MB,24MB,48MB,and96MBforDRAMcachesizesof64MB,128MB,256MB,512MBand1GB,respectively.Ourproposal,ontheotherhand,requireslessthanonekilobyteofstorage,andstilloutperformsSRAM-Tagsignicantly,con-sistentlyreachingclosetotheperformanceofIDEAL-LO.6.2.ImpactonHitLatencyTheprimaryreasonwhytheAlloyCacheperformssowellisbecauseitisdesignedfromtheground-uptohavelowerlatency.Figure 10 comparestheaveragereadlatencyofLH-Cache,SRAM-Tags,andAlloyCache.NotethatSRAM-Tagsincuratagserializationlatencyof24cycles,andLH-CacheincursMissMapdelayof24cyclesinadditiontothetagserializationlatency(32-50cycles).FortheAlloyCache,thereisnotagserialization,exceptfortheoneadditionalbuscycleforobtainingthetagwithdataline.Theaveragehitla-tencyforLH-Cacheis107cycles.TheAlloyCachecutsthislatencyby60%,bringingitto43cycles.Thissignicantre-ductioncausesAlloyCachetooutperformLH-Cachedespitethelowerhitrate.TheSRAM-Tagincursanaveragelatencyof67cycles,hencelowerperformancethantheAlloyCache. 020406080100120Avg Hit Latency (Cycles) LH-Cache SRAM-Tag Figure10:AverageHit-Latency:LH-Cache107cycles,SRAM-Tag67cycles,andAlloyCache43cycles.6.3.ImpactonHit-RateOurdesignde-optimizesthecachearchitecturefromahighly-associativestructuretoadirect-mappedstructureinordertoreducehitlatency.Wecomparethehitrateofahighly-associative29-wayLH-cachewiththedirect-mappedAlloyCache.Table 6 showstheaveragehitratefordifferentcachesizes.Fora256MBcache,theabsolutedifferenceinhitratesbetweenthe29-wayLH-Cacheanddirect-mappedAl-loyCachesis7%.Thus,theAlloyCacheincreasesmissesby15%comparedtoLH-Cache.However,weshowthatthe60%reductioninhitlatencycomparedtoLH-Cacheprovidesmuchmoreperformancebenetthanaslightperformancedegradationfromthereducedhitrate.Table 6 alsoshowsthatthehit-ratedifferencebetweenahighly-associativecacheandadirect-mappedcachereducesasthecachesizeisincreased(at1GBitis2.5%,i.e.,5%moremisses).Thereducinggapbetweenthehitrateofahighly-associativecacheanddirect-mappedcacheasthecachesizeisincreasediswellknown[ 6 ].Table6:HitRate:Highlyassociativevs.directmapped Cache LH-Cache Alloy-Cache Delta Size (29-way) (1-way) HitRate 256MB 55.2% 48.2% 7.0% 512MB 59.6% 55.2% 4.4% 1GB 62.6% 59.1% 2.5% 6.4.OtherWorkloadsInourdetailedstudies,weonlyconsideredmemory-intensiveworkloadsthathaveaspeedupofatleast2xifL3cacheismadeperfect(100%hitrate).Figure 11 showstheper-formanceimprovementfromLH-Cache,SRAM-Tags,andAlloy-Cachefortheremainingworkloadsthatspendatleast1%oftimeinmemory.Thesebenchmarkswereexecutedinratemodeaswell.ThebarlabeledGmeanrepresentsthege-ometricmeanimprovementoverthesefourteenworkloads.Asthepotentialislow,theimprovementsfromalldesignsareloweredcomparedtothedetailedstudy.However,thebroadtrendremainsthesame.Onaverage,LH-Cacheim-provesperformanceby3%,SRAM-Tagby7.3%,andAlloyCacheby11%.Thus,theAlloyCachecontinuestooutper-formbothLH-CacheandSRAM-Tag. 1.001.051.101.151.201.251.301.35Speedup LH-Cache SRAM-Tag Figure11:PerformanceimpactforotherSPECworkloads6.5.ImpactofOddSizeBurstLengthOurproposalassumesaburstlengthofvefortheAlloyCache,transferring80bytesoneachDRAMcacheaccess.However,conventionalDDRspecicationsmayrestricttheburstlengthtoapower-of-twoevenforstackedDRAM.Ifsucharestrictionexists,thentheAlloyCachecanstreamoutburstofeighttransfers(total128bytesperaccess).Oureval-uationshowsthatadesignwithaburstof8provides33%performanceimprovementonaverage,comparedto35%iftheburstlengthcanbesettove.Thus,ourassumptionofodd-sizeburstlengthhasminimalimpactontheperformancebenetofAlloyCache.Notethatdie-stackedDRAMswilllikelyusedifferentinterfacesthanconventionalDDR.Thelargernumberofthrough-siliconviascouldmakeiteasiertoprovideadditionalcontrolsignalsto,forexample,dynami-callyspecifytheamountofdatatobetransferred.6.6.PotentialRoomforImprovementOurproposalisasimpleandpracticaldesignthatsigni-cantlyoutperformstheimpracticalSRAM-Tagdesign,butthereisstillroomforimprovement.Table 7 comparestheaverageperformanceofAlloyCache+MAP-I,with(a)per-fectMemoryAddressPrediction(Perf-Pred)(b)IDEAL-LO,acongurationthatincursminimumlatencyandbandwidthandhasPerf-Predand(c)IDEAL-LOwithnotagoverhead,soallofthe256MBspaceisavailabletostoredata.Table7:Roomforimprovement Design Performance Improvement AlloyCache+MAP-I 35.0% AlloyCache+PerfPred 36.6% IDEAL-LO 38.4% IDEAL-LO+NoTagOverhead 41.0% Weobservethatforourdesignwewouldget1.6%addi-tionalperformanceimprovementfromaperfectpredictorandanother1.8%fromanIDEAL-LOcache.Thus,ourpracti-calsolutioniswithin3%oftheperformanceofanidealizeddesignthatplacestagsinDRAM.IfwecancomeupwithawaytoavoidthestorageoverheadoftagsinDRAM,thenthereisanother2.6%improvementpossible.Whileallofthethreeoptimizationsshowsmallopportunityforimprovement, wemustbemindfulthatsolutionstoobtaintheseimprove-mentsmustincurminimallatencyoverheads,otherwisethemarginalimprovementsmaybequicklynegated.6.7.HowAboutTwo-WayAlloyCaches?Wealsoevaluatedtwo-wayAlloyCachesthatstreamouttwoTADentriesoneachaccess.Whilethisimprovedthehit-ratefrom48.2%to49.7%,wefoundthatthehitlatencyincreasedfrom43cyclesto48cycles.Thiswasduetoincreasedburstlength(2x),associatedbandwidthconsumption(2x),andthereductioninrowbufferhitrate.Overall,theperformanceimpactofdegradedhitlatencyoutweighsthemarginalim-provementfromhit-rate.WeenvisionthatfutureresearcherswilllookatreducingconictmissesinDRAMcaches(andweencouragethemtodoso);however,weadvisethemtopaycloseattentiontotheimpactonhitlatency.7.ConclusionThispaperanalyzedthetrade-offsinarchitectingDRAMcaches.Wecomparedtheperformanceofarecently-proposeddesign(LH-Cache)andanimpracticalSRAM-basedTag-Store(SRAM-Tags)withalatency-optimizeddesign,andshowthatoptimizingforlatencyprovidesamuchmoreef-fectiveDRAMcachethanoptimizingsimplyforhit-rate.Toobtainapracticalandeffectivelatency-optimizeddesign,thispaperwentthroughathree-stepprocess:1.WeshowedthatsimplyconvertingtheDRAMcachefromhighassociativitytodirectmappedcanitselfprovidegoodperformanceimprovement.Forexample,conguringLH-Cachefrom29-wayto1-wayenhancestheperformanceimprovementfrom8.7%to15%.Thishappensbecauseofthelowerlatencyofadirect-mappedcacheaswellastheabilitytoexploitrowbufferhits.2.Simplyhavingadirect-mappedstructureisnotenough.Acachedesignthatcreatesaseparate“tag-store”and“data-store”stillincursthetag-serializationlatencyevenfordirect-mappedcaches.Toavoidthistagserializationlatency,weproposeacachearchitecturecalledtheAlloyCachethatfusesthedataandtagtogetherintoonestorageentity,thusconvertingtwoserializedaccessesfortaganddataintoasingleuniedaccess.Weshowthatadirect-mappedAlloyCacheimprovesperformanceby21%.3.TheperformanceoftheAlloyCachecanbeimprovedbyhandlingmissesfaster,i.e.,sendingthemtomemorybe-forecompletingthetagcheckintheDRAMcache.How-ever,doingsowithaMissMapincursmegabytesofstor-ageoverheadandtensofcyclesoflatency,whichnegatedmuchoftheperformancebenetofhandlingmissesearly.Instead,wepresentalow-latency(singlecycle),lowstor-ageoverhead(96bytespercore),highlyaccurate(95%accuracy)hardware-basedMemoryAccessPredictorthatenhancestheperformancebenetofAlloyCacheto35%.Optimizingforlatencyenabledourproposeddesigntopro-videbetterperformancethanevenanimpracticaloptionofhavingthetagstoreinanSRAMarray(24%improvement),whichwouldrequiretensofmegabytesofstorage.Thus,weshowedthatsimpledesignscanbehighlyeffectiveiftheycanexploittheconstraintsofthegiventechnology.Whilethetechnologyandconstraintsoftodayarequitedif-ferentfromthe1980's,inspirit,theinitialpartofourworkissimilartothatofMarkHill[ 6 ]fromtwenty-veyearsago,makingacasefordirect-mappedcachesandshowingthattheycanoutperformset-associativecaches.Indeed,some-times“BigandDumbisBetter”[ 1 ].AcknowledgmentsThankstoAndréSeznecandMarkHillforcommentsonear-lierversionsofthispaper.MoinuddinQureshiissupportedbyNetAppFacultyFellowshipandIntelEarlyCAREERaward.References[1]QuotefromMarkHill'sBio(shortlinkhttp://tinyurl.com/hillbio):.https://www.cs.wisc.edu/event/mark-hill-efciently-enabling-conventional-block-sizes-very-large-die-stacked-dram-caches.[2]X.Dong,Y.Xie,N.Muralimanohar,andN.P.Jouppi.SimplebutEffectiveHeterogeneousMainMemorywithOn-ChipMemoryCon-trollerSupport.InSupercomputing,2010.[3]M.Farrens,G.Tyson,J.Matthews,andA.R.Pleszkun.Amodiedapproachtodatacachemanagement.InMICRO-28,1995.[4]M.GhoshandH.-H.S.Lee.SmartRefresh:AnEnhancedMemoryControllerDesignforReducingEnergyinConventionaland3DDie-StackedDRAMs.InMICRO-40,2007.[5]A.Hartstein,V.Srinivasan,T.R.Puzak,andP.G.Emma.Cachemissbehavior:isitsqrt(2)?InComputingFrontiers,2006.[6]M.D.Hill.Acasefordirect-mappedcaches.IEEEComputer,Dec1988.[7]X.Jiang,N.Madan,L.Zhao,M.Upton,R.Iyer,S.Makineni,D.Newell,Y.Solihin,andR.Balasubramonian.CHOP:Adaptivelter-baseddramcachingforCMPserverplatforms.InHPCA-16,2010.[8]S.M.Khan,D.A.Jiménez,D.Burger,andB.Falsa.Usingdeadblocksasavirtualvictimcache.InPACT-19,2010.[9]G.H.LohandM.D.Hill.Addendumfor“Efcientlyenablingconventionalblocksizesforverylargedie-stackedDRAMcaches”.http://www.cs.wisc.edu/multifacet/papers/micro11_missmap_addendum.pdf.[10]G.H.LohandM.D.Hill.Efcientlyenablingconventionalblocksizesforverylargedie-stackedDRAMcaches.InMICRO-44,2011.[11]G.H.LohandM.D.Hill.SupportingverylargeDRAMcacheswithcompoundaccessschedulingandmissmaps.InIEEEMicroTopPicks,2012.[12]N.Madan,L.Zhao,N.Muralimanohar,A.Udipi,R.Balasubramonian,R.Iyer,S.Makineni,andD.Newell.OptimizingCommunicationandCapacityina3DStackedRecongurableCacheHierarchy.InHPCA-15,2009.[13]J.Meza,J.Chang,H.Yoon,O.Mutlu,andP.Ranganathan.En-ablingefcientandscalablehybridmemoriesusingne-granularitydramcachemanagement.ComputerArchitectureLetters,Feb2012.[14]E.Perelmanetal.UsingSimPointforaccurateandefcientsimulation.ACMSIGMETRICSPerformanceEvaluationReview,2003.[15]M.K.Qureshi.Memoryaccessprediction.U.S.PatentApplicationNumber12700043,FiledFeb2010,PublicationAug2011.[16]M.K.Qureshi,A.Jaleel,Y.N.Patt,S.C.SteelyJr.,andJ.Emer.Adaptiveinsertionpoliciesforhigh-performancecaching.InISCA-34,pages167–178,2007.[17]A.SeznecandP.Michaud.Acasefor(partially)taggedgeometrichistorylengthbranchprediction.InJournalofInstructionLevelPar-allelism,2006.[18]C.-J.Wu,A.Jaleel,W.Hasenplaugh,M.Martonosi,S.C.Steely,Jr.,andJ.Emer.Ship:signature-basedhitpredictorforhighperformancecaching.InMICRO-44,2011.[19]L.Zhao,R.Iyer,R.Illikkal,andD.Newell.ExploringDRAMcachearchitecturesforCMPserverplatforms.InICCD,2007.