/
CHOP I NTEGRATING DRAM C ACHES FOR CMP S ERVER LATFORMS CHOP I NTEGRATING DRAM C ACHES FOR CMP S ERVER LATFORMS

CHOP I NTEGRATING DRAM C ACHES FOR CMP S ERVER LATFORMS - PDF document

liane-varnes
liane-varnes . @liane-varnes
Follow
472 views
Uploaded On 2015-01-14

CHOP I NTEGRATING DRAM C ACHES FOR CMP S ERVER LATFORMS - PPT Presentation

ID: 31249

dram cache chop memory cache dram memory chop filter bandwidth page line utilization counter kbyte pages hot percent performance figure mfc caching

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "CHOP I NTEGRATING DRAM C ACHES FOR CMP S..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

bandwidthutilizationtendedtoincreaseasthethresholdincreased.Tounderstandwhythisisthecase,recallthattheAFCschemeadjustsitsDRAMcacheallocationpolicywithrespecttotheavailablememorybandwidth.Moreover,theaveragememorybandwidthutilizationachievedwas93percentfortheDRAMand32percentforCHOP-FC(Figure3b),whichessentiallydenotestheupperandlowerboundsofthememorybandwidthuti-lizationthattheCHOP-AFCschemecanreach.Whenthethresholdwaslessthan0.3,themeasuredmemoryutilizationwiththefiltercacheturnedonwasalwayslargerthanthethreshold,sothefiltercachewasal-waysturnedon,resultinginanaverageof32percentmemorybandwidthutilization.Therefore,withthethresholdbetween0and0.3,CHOP-AFCbehavedthesameasCHOP-FC.ThisexplainswhyCHOP-AFCshowsaconstantspeedupandmemorybandwidthutilizationunderthosethresholds.Whenthethresholdwasgreaterthan0.3,owingtotheabundanceofavailablememorybandwidth,thefiltercachecouldbeturnedoffforsometime,andthereforemorepagescouldgointotheDRAMcache.Asaresult,theusefulhitsprovidedbytheDRAMcachetendedtoincreaseaswell.Thisexplainswhymemorybandwidthutili-zationkeptincreasingandperformancekeptincreasinginthebeginningforthresholdval-uesbetween0.3and1.However,aftersomepoint,becauseofthehighmemoryband-widthutilization,queueingdelaybegantodominateandconsequentlyperformancedecreased.ThereasontheTPC-Cworkloadshowsadifferenttrendforthresholdsbe-tween0and0.3isitssmallerlower-boundmemorybandwidthutilizationof15percent(theDRAMbarinFigure5b).ComparisonofthethreeschemesComparedtothetraditionalDRAM-cachingapproach,allourfilter-based[3B2-14]mmi2011010099.3d20/1/01114:55Page106 (b) 0.00.10.20.30.40.50.60.00.10.20.30.40.50.6 SAP Sjbb TPC-C Mix 01020304050608090100Memory bandwidth utilization thresholdMemory bandwidth utilization (%) SAP Sjbb TPC-C Figure5.Speedupratios(a)andmemorybandwidthutilization(b)oftheCHOPadaptivefiltercache(CHOP-AFC)withvariousmemorybandwidthutilizationthresholds.....................................................................IEEEMICRO...............................................................................................................................................................................................ICKS techniquesimproveperformancebycachingonlythehotsubsetofpages.Fora4-Kbyteblocksize,CHOP-MFCachievedonaveragespeedupof18.6percentwithonly1Kbyteofextraon-diestorageoverhead,whereasCHOP-FCachieveda17.2percentspeedupwith132Kbytesofextraon-diestorageoverhead.Comparingthesetwoschemes,CHOP-MFCuniquelyofferslargeraddressspacecoveragewithverysmallstorageover-head,whereasCHOP-FCnaturallyoffersbothhotnessandtimelinessinformationatamoderatestorageoverheadandminimalhardwaremodifications.Theadaptivefiltercacheschemes(CHOP-AFCandCHOP-AMFC)typicallyoutperformedCHOP-FCandCHOP-MFC.TheseschemesintelligentlydetectavailablememorybandwidthtodynamicallyadjustDRAM-cachingcoveragepolicyandtherebyimproveperformance.Dynamicallyturningthefiltercacheonandoffcanadapttothememorybandwidthutilizationonthefly.Whenmemorybandwidthisabundant,theadaptivefiltercacheenlargescoverage,andmoreblocksarecachedintotheDRAMcachetoproducemoreusefulcachehits.Whenmemorybandwidthisscarce,theadaptivefiltercachereducescov-erage,andonlyhotpagesarecachedtopro-duceareasonableamountofusefulcachehitsasoftenaspossible.Bysettingapropermemoryutilizationthreshold,CHOP-AFCandCHOP-AMFCcanusememoryband-widthwisely.headaptivefiltercacheschemesshowtheirperformancerobustnessandguaranteeperformanceimprovementbyquicklyadaptingtothebandwidthutiliza-tionsituation.Possibleextensionstotheseschemesincludeapplicationtootherformsoftwo-levelmemoryhierarchies,includingphase-changememory(PCM)andflash-basedmemory,andarchitecturalsupportforexposinghot-pageinformationbacktotheoperatingsystemtoscheduleMICROAcknowledgmentsWethankZhenFangforprovidingvaluablecommentsandfeedbackonthis....................................................................References1.D.Burger,J.R.Goodman,andA.Kagi,‘‘MemoryBandwidthLimitationsofFutureMicroprocessors,’’Proc.23rdAnn.IntlSymp.ComputerArchitecture(ISCA96),ACMPress,1996,pp.78-79.2.F.Liuetal.,‘‘UnderstandingHowOff-ChipMemoryBandwidthPartitioninginChip-MultiprocessorsAffectsSystemPerfor-Proc.IEEE16thIntlSymp.HighPerformanceComputerArchitecture(HPCA10),IEEEPress,2010,doi:10.1109/HPCA.2010.5416655.3.B.M.Rogersetal.,‘‘ScalingtheBandwidthWall:ChallengesinandAvenuesforCMPScaling,’’Proc.36thAnn.IntlSymp.Com-puterArchitecture(ISCA09),ACMPress,2009,pp.371-382.4.B.Blacketal.,‘‘DieStacking(3D)Microarch-itecture,’’Proc.39thIntlSymp.Microarchi-tecture,IEEECSPress,2006,pp.469-479.5.R.Iyer,‘‘PerformanceImplicationsofChip-setCachesinWebServers,’’Proc.IEEEIntlSymp.PerformanceAnalysisofSys-temsandSoftware(ISPASS03),IEEECSPress,2003,pp.176-185.6.N.Madanetal.,‘‘OptimizingCommunica-tionandCapacityina3DStackedReconfig-urableCacheHierarchy,’’Proc.IEEE15thIntlSymp.HighPerformanceComputerArchitecture(HPCA09),IEEEPress,2009,pp.262-274.7.Z.Zhang,Z.Zhu,andX.Zhang,‘‘DesignandOptimizationofLargeSizeandLowOver-headOff-ChipCaches,’’IEEETrans.Com-vol.53,no.7,2004,pp.843-855.8.L.Zhaoetal.,‘‘ExploringDRAMCacheArchitecturesforCMPServerPlatforms,’’Proc.25thIntlConf.ComputerDesign(ICCD07),IEEEPress,2007,pp.55-62.9.X.Jiangetal.,‘‘CHOP:AdaptiveFilter-BasedDRAMCachingforCMPServerPlatforms,’’Proc.IEEE16thIntlSymp.HighPerformanceComputerArchitecture(HPCA10),IEEEPress,2010,doi10.1109/HPCA.2010.5416642.IntelArchitectureSoftwareDevelopersMan-vol.1,Intel,2008;http://www.intel.com/design/pentiumii/manuals/243190.htm.11.L.Zhaoetal.,‘‘ExploringLarge-ScaleCMPArchitecturesUsingManySim,’’vol.27,no.4,2007,pp.21-33.[3B2-14]mmi2011010099.3d20/1/01114:55Page107.................................................................... memorybandwidthrequirementincreasesaccordinglyandbecomestheprimarybottle-neck.Therefore,atlargecachelinesizes,theDRAM-cachingchallengechangesintoamemorybandwidthefficiencyissue.Savingmemorybandwidthrequiresstrictlycontrol-lingtheamountofcachelinesthatentertheDRAMcache.However,evenwhencachingonlyasubsetoftheworkingset,wemustretaintheamountofusefulDRAMcachetoprovideperformancebene-fits.So,wemustcarefullychoosethissubsetoftheworkingset.Tounderstandhowweshouldchoosethissubset,weperformedadetailedofflineprofilingofworkloadstoobtaintheirDRAMcacheaccessinformation(assuming4-Kbytepage-sizecachelines).Table1showstheresultsforfourserverworkloads.Forthisprofiling,wedefinehotpagesthetopmostaccessedpagesthatcontributeto80percentofthetotalaccessnumber.Wecalculatethehot-pagepercentageasthenumberofhotpagesdividedbythetotalnumberofpagesforeachworkload.Weconsiderabout25percentofthepagestobehotpages.ThelastcolumninTable1alsoshowstheminimumnumberofaccessestothesehotpages.Onaverage,ahotpageisaccessedatleast79times.IfwecancapturesuchhotpagesontheflyandallocatethemintotheDRAMcache,wecanreducemem-orytrafficbyabout75percentandretainabout80percentoftheDRAMcachehits.Filter-basedDRAMcachingWeintroduceasmallfiltercachethatprofilesthememoryaccesspatternandiden-tifieshotpages:pagesthatareheavilyaccessedbecauseoftemporalorspatiallocal-ity.Byenablingafiltercachethatidentifieshotpages,wecanintroduceafilter-basedDRAMcachethatonlyallocatescachelinesforthehotpages.ByeliminatingallocationofcoldpagesintheDRAMcache,wecanre-ducethewastedbandwidthduetoallocatingcachelinesthatnevergettouchedlater.Filtercache(CHOP-FC)Figure2ashowsourbasicfilter-basedDRAM-cachingarchitecture(CHOP-FC),whichincorporatesafiltercacheondiewiththeDRAMcachetagarray(thedataarrayisoffdievia3DstackingorMCP).BoththefiltercacheandDRAMcacheusepage-sizelines.Eachfiltercacheentry(Figure2b)includesatag,LRU(leastrecentlyused)bits,andacounterindicatinghowoftenthelineistouched.ThefiltercacheallocatesalineintoitstagarraythefirsttimeanLLCmissoccursonthisline.Thecounterreturnstozeroforanewlyallo-catedlineandincreasesincrementallyforsubsequentLLCmissesonthatline.Oncethecounterreachesathreshold,thefiltercacheconsidersthelinetobeahotpageandputsitintotheDRAMcache.ToallocateahotpageintotheDRAMcache,thefiltercachesendsapagerequesttothememorytofetchthepage,andthe[3B2-14]mmi2011010099.3d20/1/01114:55Page101 Memory Tag LRUcounter Tag LRUcounterThresholdWay 0Way Interconnect . . .. . .. . . Figure2.ThefiltercacheforDRAMcaching:architectureoverview(a)andfiltercachestructure(b).(DT:DRAMcachetagarray;FC:filtercache;L2C:level-twocache;LLC:last-levelcache;LRU:leastrecentlyused;Way:cachewayofthefiltercache.) Table1.Offlinehot-pageprofilingstatisticsforserverworkloads.Hot-pagepercentageHot-pageminimumno.ofaccessesSAP24.895Sjas38.465Sjbb30.693TPC-C7.264Average25.279.......................................................................................................*SAP:SAPStandardApplication;Sjas(SPECjApps2004):SPECJavaApplicationServer2004;Sjbb(SPECjbb2005):SPECJavaServer2005;TPC-C:TransactionProcessingPerformanceCouncilC..................................................................... NTEGRATINGLARGECACHESISAPROMISINGWAYTOADDRESSTHEMEMORYBANDWIDTHWALLISSUEINTHEMANYCOREERAOWEVERORGANIZINGAND 58.2percentforSjbb.Withthe8-Kbytecachelinesize,itresultedinanaverageslow-downof50.1percent,withaworstcaseof77.1percentforSjbb.Tounderstandwhythisoccurred,Figure3bpresentseachscheme’smemorybandwidthutilization.Usinglargecachelinesizeseasilysaturatestheavailablememorybandwidth,withanav-erageof92.8percentand98.9percentfor4-Kbyteand8-Kbytelinesizes,respectively.BecauseallocatingcachelinesforeachLLCmissintheDRAMcachesaturatesthememorybandwidth,onepossiblesolu-tionistocacheonlyasubsetofit.However,theRANDbarprovesthatthissubsetmustbecarefullychosen.Onaverage,RANDshows20.2percentand48.1percentslow-downsforthe4-Kbyteand8-Kbytelinesizes,respectively.Figure3aalsodemonstratesthatourCHOP-FCschemegenerallyoutperformsDRAMandRAND.Withthe4-Kbyteand8-Kbytelinesizes,CHOP-FCachievedanaveragespeedupof17.7percentand12.8percentusingcounterthreshold32.Thereasonisthatalthoughhotpageswereonly25.24percentoftheLLC-missedpor-tionoftheworkingset,theycontributedto80percentoftheentireLLCmisses.Cachingthosehotpagessignificantlyreducesmemorybandwidthutilizationandthusreducestheassociatedqueueingdelays.AscomparedtoRAND,cachinghotpagesalsoprovidesfarsmallerMPI.CHOP-MFCevaluationWecomparedCHOP-MFC’sperfor-manceandmemorybandwidthutilizationtothatofaregularDRAMcache.Figure4comparesthespeedupratios(Figure4a)andmemorybandwidthutilization(Figure4b)ofabasicDRAMcache[3B2-14]mmi2011010099.3d20/1/01114:55Page104 SAPSjasSjbbMixTPC-CAvgSAPSjasSjbbMixTPC-CAvg SAPSjasTPC-CMixAvg Speedup ratio0.00.20.40.60.81.01.21.4Speedup ratio DRAM (a)(b) 10203050608090100 SAPSjasTPC-CMixAvg 010203050608090 Figure3.Speedupratiosfora4-Kbytecacheline(a)andan8-Kbytecacheline(b),andmemorybandwidthutilizationfora4-Kbytecacheline(c)andan8-Kbytecacheline(d),ofourCHOP(CachingHotPages)filter-cache-basedschemesandtwootherschemes.(CHOP-FC:CHOPfilter-cache-basedDRAM-cachingarchitecture;DRAM:regularDRAMcaching;RAND:cachingarandomsubsetofLLCmisses;Mix:amixtureofallthebenchmarksshown(SAP,Sjas,Sjbb,andTPC-C.)....................................................................IEEEMICRO...............................................................................................................................................................................................ICKS theTLBhasalreadyobtainedthedesiredpage’scounterandcouldsupplyittothefiltercache.So,wecansignificantlyreducethecounterfetchoperation’soverheadbecausethisoperationdoesn’tneedtogooffchip.Uponafiltercacheeviction,thefiltercachemustwritethereplacedcounterbacktothePTEs.ToassistinPTElookup,thefil-tercachemuststorebothapage’svirtualandphysicaltagbits.Thisapproachistranspar-enttotheoperatingsysteminthatthelatterdoesn’tallocateorupdatethecountersinPTEs.Rather,theprocessorTLBandthefiltercachealoneoperatethesecounters.Adaptivefiltercacheschemes(CHOP-AFCandCHOP-AMFC)Becausedifferentworkloadscanhavedis-tinctbehaviors,andeventhesameworkloadcanhavedifferentphases,memorysubsys-temscouldbeaffecteddifferently.Toaddresssucheffects,weproposetwoadaptivefiltercacheschemes,CHOP-AFCandCHOP-AMFC,inwhichtheprocessorturnsthefil-tercacheonandoffdynamicallyonthebasisofthememoryutilizationstatus.CHOP-AMFCdiffersfromCHOP-AFCinthattheformerisbasedonCHOP-MFC,where-asthelatterisbasedonCHOP-FC.Weaddmonitorstotrackmemorytraffic.Whenmemoryutilizationexceedsacertainthresh-old,theprocessorturnsonthefiltercachesothatonlyhotpagesarefetchedintotheDRAMcache;otherwise,theprocessorturnsoffthefiltercacheandbringsallpagesintotheDRAMcacheondemand.WiththecapabilityofadjustingtheDRAMcacheallocationpolicyonthefly,weexpectmoreperformancebenefitsbyusingtheentireavailablememorybandwidth.EvaluationmethodologyForthesimulationenvironment,weusedtheManySimsimulatortoevaluateourfiltercacheschemes.Thesimulatedchipmulti-processor(CMP)architectureconsistsof8cores(eachoperatingat4GHz).Eachcorehasaneight-way,512-Kbyteprivatelevel-two(L2)cache,andallcoressharea16-way,8-MbyteL3cache.Anon-dieinter-connectwithbidirectionalringtopologyconnectstheL2andL3caches.TheDRAMcacheis128Mbyteswithanon-dietagarray.Thefiltercachehasacover-ageof128Mbytes.TheMFCcontains64entrieswithfour-wayassociativity.Thememoryaccesslatencyis400cycleswithabandwidthof12.8Gbytespersecond.ForCHOP-MFC,weevaluatedthesecondoption,whichpinsdowncountermemoryspaceintheDRAMcache,becausethisoptioninvolvesfewerhardwarechanges.Wechosefourkeycommercialserverworkloads:TPC-C,SAPStandardApplica-tion(SAP),SPECJavaServer2005(Sjbb),andSPECJavaApplicationServer2004(Sjas).Wecollectedlongbustraces(dataandinstruction)forworkloadsonanIntelXeonMultiprocessorplatformwiththeLLCdisabled.EvaluationresultsWeevaluatedtheperformanceefficacyofthethreefilter-cache-basedDRAMcachingtechniquesindividually,andthenwecom-paredtheirresultscollectively.CHOP-FCevaluationTodemonstrateourbasicfiltercache’sef-fectiveness,wecomparedCHOP-FC’sper-formanceresultstoaregular128-MbyteDRAMcacheandanaiveschemethatcachesarandomportionoftheworkingset.Figure3ashowsthespeedupratiosofthebasic128-MbyteDRAMcache(DRAM),aschemethatcachesarandomsubsetoftheLLCmisses(RAND)intotheDRAMcache,andourCHOP-FCscheme,whichcapturesthehotsubsetofpages.Wenormal-izedallresultstothebasecaseinwhichnoDRAMcachewasapplied.FortheRANDscheme,wegeneratedarandomnumberforeachnewLLCmissandcomparedittoaprobabilitythresholdtodeterminewhethertheblockshouldbecachedintotheDRAMcache.Weadjustedtheprobabilitythreshold,andFigure3showstheonethatleadstothebestperformance.TheDRAMbarshowsthatdirectlyadd-inga128-MbyteDRAMcachedoesn’timproveperformanceasmuchasexpected.Instead,thisadditionincursaslowdowninmanycases.Witha4-Kbytecachelinesize,theDRAMresultedinanaverageslowdownof21.9percent,withaworstcaseof[3B2-14]mmi2011010099.3d20/1/01114:55Page103.................................................................... DRAMcachechoosesavictim.Ourmeas-urementsindicatedthattheLFU(leastfre-quentlyused)replacementpolicyperformsbetterthantheLRUpolicyfortheDRAMcache.ToemploytheLFUpolicy,theDRAMcachealsomaintainscounterssimi-larlytothewaythefiltercachedoes.Thecounterisincrementedwhenahittothelineoccurs.WhentheDRAMcachereplacesaline,itchoosesthelinewiththesmallestcountervalueasthevictim.TheDRAMcachethenreturnsthevictimtothefiltercacheasacoldpage,whilewritingitsdatabacktomemory.Thevictim’scounteristhensettohalfitscurrentvaluesothatthevictimhasabetterchancetobecomehotterthanothernewlinesinthefiltercache.Thepro-cessorconsidersascoldpagesanyLLCmissesthatalsomissinthefiltercacheandDRAMcache(orhitinthefiltercachewiththecounterlessthanthethreshold).Forsuchpages,theLLCsendsaregularre-quest(64bytes)tothememory.So,thepro-cessordoesn’twastememorybandwidthonfetchingcoldpagesintotheDRAMcache.Memory-basedfiltercache(CHOP-MFC)Inthebaselinefiltercachescheme,thefil-tercachesizedirectlyimpactshowlongalinecanstayinit.Thecounterhistoryiscom-pletelylostwhenthelineisevicted.Thiscouldreducethelikelihoodofahotpagebeingidentified.Todealwiththisproblem,weproposeamemory-basedfiltercache(CHOP-MFC),whichspillsacounterintomemorywhenthelineisreplacedandrestoresitwhenthelineisfetchedbackagain.Withamemorybackupforcounters,wecanexpectthefiltercachetoprovidemoreaccuratehot-pageidentification.Inad-dition,thisschemeletsussafelyreducethefiltercachesizesignificantly.Storingcountersinmemoryrequiresallocatingextraspaceinthememory.A16-Gbytememorywitha4-Kbytepagesizeandacounterthresholdof256(8-bitcoun-ters)requires4Mbytesofmemoryspace.Weproposethreeoptionsforthismemorybackup.Inthefirstoption,eithertheoperatingsystemallocatesthiscountermemoryspaceinthemainmemoryorthefirmwareorBIOSreservesitwithoutexposingittotheoperatingsystem.However,thisoptionrequiresoff-chipmemoryaccessestolookupthecountersintheallocatedmemoryre-gion.Forsmallerfilterswithpotentiallyhighermissrates,theperformancebenefitofusingafiltercachemightnotbeabletoamortizethecostofoff-chipcounterlookups.Todealwiththeseproblems,thesecondoptionformemorybackupistopinthiscountermemoryspaceintheDRAMcache.IftheDRAMcachesizeis128Mbytes,andonly4Mbytesofmemoryspacearerequiredforcounterbackup,thenthelossinDRAMcachespaceisminimal(about3percent).TogeneratetheDRAMcacheaddressforretrievingacounter,wesimplyusethepageframenumberofapagetoreferencethecountermemoryspaceintheDRAMcache.SimilartoCHOP-FC,wheneveranLLCmissoccurs,theprocessorchecksthefiltercacheandincrementsthecorrespondingcounter.Tofurtherreducethecounterlook-ups’latency,theprocessorcansendprefetch-ingrequestsofcountersuponalower-levelcachemiss.Ifareplacementinthefiltercacheoccurs,theprocessorupdatesthecorre-spondingcounterinthereservedmemoryspace.Thiscounterwritebackcanoccurinthebackground.Oncethecounterinthefil-tercachereachesthethreshold,thefiltercacheidentifiesahotpageandinstallsitintotheDRAMcache.(Moredetailsonthisoptionareavailableelsewhere.WefurtherproposeathirdoptionfortheMFC,inwhichwedistributethecountermemoryspaceandaugmentitintothepagetableentry(PTE)ofeachpage.TheprocessorallocatestheunusedbitsinthePTEs(18bitsina64-bitmachine)forthedenotedpage’scounter.Uponatransla-tionlook-asidebuffer(TLB)miss,thepro-cessor’sTLBmisshandlerorthepagetablewalkroutinebringsthecounteralongwithotherPTEattributesintotheTLB.Theprocessorperformsvirtualtophysicalpagetranslationatthebeginningofthein-structionfetchstage(forinstructionpages)andthememoryaccessstage(fordatapages).Therefore,bythetimeafiltercacheaccessfiresforaloadorstoreinstruction,[3B2-14]mmi2011010099.3d20/1/01114:55Page102....................................................................IEEEMICRO...............................................................................................................................................................................................ICKS XiaoweiJiangisaresearchscientistatIntel.Hisresearchinterestsincludesystem-on-chip(SoC)andheterogeneous-corearchitectures.HehasaPhDincomputerengineeringfromNorthCarolinaStateUniversity.NitiMadanisaNationalScienceFounda-tionandComputingResearchAssociationComputingInnovationFellowattheIBMThomasJ.WatsonResearchCenter.Herresearchinterestsincludepowerandreliability-awarearchitecturesandtheuseofemergingtechnologiessuchas3Ddiestacking.ShehasaPhDincomputersciencefromtheUniversityofUtah.LiZhaoisaresearchscientistatIntel.HerresearchinterestsincludeSoCarchitecturesandcachememorysystems.ShehasaPhDincomputersciencefromtheUniversityofCalifornia,Riverside.MikeUptonisaprincipalengineeratIntel.Hisresearchinterestsincludegraphicsprocessorarchitecturesandmany-corearchitectures.HehasaPhDincomputerarchitecturefromtheUniversityofMichi-gan,AnnArbor.RaviIyerisaprincipalengineeratIntel.HisresearchinterestsincludeSoCandhetero-geneous-corearchitectures,cachememorysystems,andacceleratorarchitectures.HehasaPhDincomputersciencefromTexasA&MUniversity.SrihariMakineniisaseniorresearcheratIntel.Hisresearchinterestsincludecacheandmemorysubsystems,interconnects,networking,andlarge-scalechipmultipro-cessor(CMP)architectures.HehasanMSinelectricalandcomputerengineeringfromLamarUniversity,Texas.DonaldNewellistheCTOoftheServerProductGroupatAMD.Hisresearchinterestsincludeserverarchitectures,cachememorysystems,andnetworking.HehasaBSincomputersciencefromtheUniversityofOregon.YanSolihinisanassociateprofessorintheDepartmentofElectricalandComputerEngineeringatNorthCarolinaStateUni-versity.Hisresearchinterestsincludecachememorysystems,performancemodeling,security,andreliability.HehasaPhDincomputersciencefromtheUniversityofIllinoisatUrbana-Champaign.RajeevBalasubramonianisanassociateprofessorintheSchoolofComputingattheUniversityofUtah.Hisresearchinterestsincludememorysystems,interconnectdesign,large-cachedesign,transactionalmemory,andreliability.HehasaPhDincomputersciencefromtheUniversityofRochester.DirectquestionsandcommentsaboutthisarticletoXiaoweiJiang,JF2,Intel,2111NE25thAve.,Hillsboro,OR97124;xiaowei.jiang@intel.com. [3B2-14]mmi2011010099.3d20/1/01114:55Page108....................................................................IEEEMICRO...............................................................................................................................................................................................ICKS (DRAM)withCHOP-MFCundervariouscounterthresholds(MFC-32forthreshold32,andsoon).Figure4ashowsthatCHOP-MFCout-performsDRAMinmostcases.Forthe4-Kbytelinesize,MFC-64achievedthehighestspeedup(25.4percent),followedbyMFC-32(22.5percent),MFC-128(18.6percent),andMFC-256(12.0percent).Forthe8-Kbytelinesize,MFC-128per-formedbest,withanaveragespeedupof18.9percent.Theperformanceimprove-mentsareduetothesignificantreductioninmemorybandwidthutilization,asFigure4bshows.TheseresultsdemonstratethathavingasmallMFCissufficientforkeepingthefreshhot-pagecandidates.UnlikeCHOP-FC,replacementsintheMFCincurcounterwritebackstomemoryratherthanatotallossofcounters.Therefore,CHOP-MFCachieveshigherhot-pageidentificationaccu-racy.CHOP-MFCwith64entrieshadanaveragemissrateof4.29percentcomparedto10.34percentinCHOP-FC,evenwith32,000entriesforthe4-Kbytelinesize.Thankstotheperformancerobustness(speedupsinallcases)thatcounterthreshold128provided,wechosetouseitasthedefaultCHOP-AFCevaluationFigure5exhibitstheresultsforCHOP-AFCwithmemoryutilizationthresholdsvaryingfrom0to1.(Forthisevaluation,weshowCHOP-AFC’sresultsonlybecauseCHOP-AMFCexhibitedsimilarperfor-mance.)Formemorybandwidthutilizationthresholdslessthan0.3,thespeedupsachievedremainedthesame(exceptforTPC-C).Forthresholdvaluesbetween0.3and1,thespeedupsshowanincreasingtrendfollowedbyanimmediatedecrease(seeFigure5a).Ontheotherhand,Figure5bshowsthatforthresholdvalueslessthan0.3,thememorybandwidthutiliza-tionremainedapproximatelythesame(exceptforanincreaseinTPC-C).Forthresholdvaluesgreaterthan0.3,memory[3B2-14]mmi2011010099.3d20/1/01114:55Page105 SAPSjasSjbbSjbbTPC-CAvgSAPSjasTPC-CMixAvgSAPSjasSjbbSjbbTPC-CAvgSAPSjasTPC-CMixAvg 0.20.4 Speedup ratio(b)0102030506080901000.20.4Speedup ratio DRAM MFC-256 Figure4.Speedupratiosfora4-Kbytecacheline(a)andan8-Kbytecacheline(b),andmemorybandwidthutilizationfora4-Kbytecacheline(c)andan8-Kbytecacheline(d),oftheCHOPmemory-basedfiltercache(CHOP-MFC)withvariousthresholds..................................................................... bandwidthwallproblem.Suchasolutionenables8to16timeshigherdensitythantra-ditionalstaticRAM(SRAM)caches,andconsequentlyprovidesfarhighercacheca-pacity.ThebasicDRAM-cachingapproachistointegratethetagarraysondieforfastlookup,whereastheDRAMcachedataarrayscanbeeitheronchip(embeddedDRAM,oreDRAM)oroffchip(3Dstackedormissesperinstruction[MPI]).TounderstandthebenefitsandchallengesofDRAMcaching,weexperimentedwiththeTransactionProcessingPerformanceCouncilCbenchmark(TPC-C).Figure1showstheDRAMcachebenefits(intermsofMPIandperformance)alongwiththetrade-offs(intermsoftagspaceoverheadandmemorybandwidthutilization).ThebaselineconfigurationhasnoDRAMcache.Atsmallercachelines,tagspaceover-headbecomessignificant.Figure1ashowsthata128-MbyteDRAMcachewith64-bytecachelineshasatagspaceoverheadof6Mbytes.Ifweim-plementthiscache,thenwemusteitherincreasethediesizeby6MbytesorreducetheLLCspacetoaccommodatethetagspace.Eitheralternativesignificantlyoffsetstheperformancebenefits,sowedon’tconsiderthisaviableapproach.Thesignifi-canttagspaceoverheadatsmallcachelinesinherentlypromotestheuseoflargecachelines,whichalleviatesthetagspaceproblembymorethananorderofmagnitude(asFigure1ashows,the4-Kbytelinesizeena-blesatagspaceofonlyafewkilobytes).However,Figure1bshowsthatusinglargercachelinesdoesn’tsignificantlyreduceMPI.Thisisbecauselargercachelinesizesalonearen’tsufficientforprovidingspatiallocality.Moreimportantly,thememorybandwidthutilizationincreasessignificantly(Figure1c)duetothedecreaseinDRAMcachemissrate(orMPI);itisn’tdirectlypro-portionaltotheincreaseinlinesize.Suchanincreaseinmainmemorybandwidthrequire-mentssaturatesthemainmemorysubsystemandthereforeimmediatelyaffectsthisDRAMcacheconfiguration’sperformancesignificantly(asshownbythe2-Kbyteand4-KbytelinesizeresultsinthelasttwobarsofFigure1d).ServerworkloadDRAM-cachingUsinglargecachelinesizes(suchas4-Kbytepagesizes)considerablyalleviatesthetagspaceoverhead.However,themain[3B2-14]mmi2011010099.3d20/1/01114:55Page100 641282565121,0002,0004,000 No DRAM641282565121,0002,0004,000 641282565121,0002,0004,000 No DRAM641282565121,0002,0004,000 102030405060708090100Memory bandwidth utilization (%) Figure1.OverviewofdynamicRAM(DRAM)cachebenefits,challenges,andtrade-offs:tagarraystorageoverhead(a),missesperinstruction(b),memorybandwidthutilization(c),andperformanceimprovement(d).....................................................................IEEEMICRO...............................................................................................................................................................................................ICKS ......................................................................................................................................................................................................................... CHOP:IDRAMCCMPS