/
Appears in the Proceedings of the rd Annual IEEEACM Sy Appears in the Proceedings of the rd Annual IEEEACM Sy

Appears in the Proceedings of the rd Annual IEEEACM Sy - PDF document

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
447 views
Uploaded On 2015-05-01

Appears in the Proceedings of the rd Annual IEEEACM Sy - PPT Presentation

edu Abstract The everincreasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity Associativity is typically im proved by increasing the number of ways This reduces con64258ict misse ID: 58454

edu Abstract The everincreasing importance

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Appears in the Proceedings of the rd Ann..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AppearsintheProceedingsofthe43rdAnnualIEEE/ACMSymposiumonMicroarchitecture(MICRO-43),2010TheZCache:DecouplingWaysandAssociativityDanielSanchezandChristosKozyrakisElectricalEngineeringDepartmentStanfordUniversitysanchezd,kozyraki—Theever-increasingimportanceofmainmemorylatencyandbandwidthispushingCMPstowardscacheswithhighercapacityandassociativity.Associativityistypicallyim-provedbyincreasingthenumberofways.Thisreducesconictmisses,butincreaseshitlatencyandenergy,placingastringenttrade-offoncachedesign.Wepresentthezcache,acachedesignthatallowsmuchhigherassociativitythanthenumberofphysicalways(e.g.a64-associativecachewith4ways).Thezcachedrawsonpreviousresearchonskew-associativecachesandcuckoohashing.Hits,thecommoncase,requireasinglelookup,incurringthelatencyandenergycostsofacachewithaverylownumberofways.Onamiss,additionaltaglookupshappenoffthecriticalpath,yieldinganarbitrarilylargenumberofreplacementcandidatesfortheincomingblock.Unlikeconventionaldesigns,thezcacheprovidesassociativitybyincreasingthenumberofreplacementcandidates,butnotthenumberofcacheways.Tounderstandtheimplicationsofthisapproach,wedevelopageneralanalysisframeworkthatallowstocompareassociativityacrossdifferentcachedesigns(e.g.aset-associativecacheandazcache)byrepresentingassociativityasaprobabilitydistribution.Weusethisframeworktoshowthatforzcaches,associativitydependsonlyonthenumberofreplacementcandidates,andisindependentofotherfactors(suchasthenumberofcachewaysortheworkload).Wealsoshowthat,forthesamenumberofreplacementcandidates,theassociativityofazcacheissuperiorthanthatofaset-associativecacheformostworkloads.Finally,weperformdetailedsimulationsofmultithreadedandmultiprogrammedworkloadsonalarge-scaleCMPwithzcacheasthelast-levelcache.Weshowthatzcachesprovidehigherperformanceandbetterenergyefciencythanconventionalcacheswithoutincurringtheoverheadsofdesignswithalargenumberofways.I.INTRODUCTIONAsMoore'slawenableschip-multiprocessors(CMPs)withtensandhundredsofcores[22,40],thelimitedbandwidth,highlatency,andhighenergyofmainmemoryaccessesbecomeanimportantlimitationtoscalability.Tomitigatethisbottleneck,CMPsrelyoncomplexmemoryhierarchieswithlargeandhighlyassociativecaches,whichcommonlytakemorethan50%ofchipareaandcontributesignicantlytostaticanddynamicpowerconsumption[29,43].Thegoalofthisworkistoimprovetheefciencyofcache.Higherassociativityprovidesmoreexibilityinblock(re)placementandallowsustoutilizethelimitedcachecapacityinthebestpossiblemanner.Last-levelcachesinexistingCMPsarealreadyhighlyassociativeandthetrendistoincreasethenumberofwayswithcorecount.More-over,severalarchitecturalproposalsrelyonhighlyassociativecaches.Forexample,manydesignsfortransactionalmemoryandthread-levelspeculation[13,19],deterministicreplay[42],eventmonitoringanduser-levelinterrupts[8,34],andevenmemoryconsistencyimplementations[12]usecachestobufferorpinspecicblocks.Lowassociativitymakesitdifculttobufferlargesetsofblocks,limitingtheapplicabilityoftheseschemesorrequiringexpensivefall-backmechanisms.Conventionalcachesimproveassociativitybyincreasingthenumberofphysicalways.Unfortunately,thisalsoincreasesthelatencyandenergycostofcachehits,placingastringenttrade-offoncachedesign.Forexample,thistrade-offlimitstheassociativityofrst-levelcachesinmostchipstotwoorfourways.Forlast-levelcaches,a32-wayset-associativecachehasupto3.3theenergyperhitandis32%slowerthana4-waydesign.Mostalternativeapproachestoimproveassociativityrelyonincreasingthenumberoflocationsablockcanbeplaced(withe.g.multiplelocationsperway[1,10,37],victimcaches[3,25]orextralevelsofindirec-tion[18,36]).Increasingthenumberofpossiblelocationsofablockultimatelyincreasestheenergyandlatencyofcachehits,andmanyoftheseschemesaremorecomplexthanconventionalcachearrays(requiringe.g.heaps[3],hash-table-likearrays[18]orpredictors[10]).Alternatively,hashingcanbeusedtoindexthecache,spreadingoutaccessesandavoidingworst-caseaccesspatterns[26,39].Whilehashing-basedschemesimproveperformance,theyarestilllimitedbythenumberoflocationsthatablockcanbein.Inthispaper,weproposeanovelcachedesignthatachievesarbitrarilyhighassociativitywithasmallnumberofphysical,breakingthetrade-offbetweenassociativityandaccesslatencyorenergy.Thedesignismotivatedbytheobservationassociativityistheabilityofacachetoselectagoodblocktoevictonareplacement.Forinstance,assuminganaccesspatternwithhightemporallocality,thebestblocktoevictistheleastrecentlyusedoneintheentirecache.Foratransactionalmemorysystem,thebestblocktoevictisonethatdoesnotstoretransactionalmetadata.Acachethatprovidesahigherqualitystreamofevictedblocksessentiallyhashigherassociativity,regardlessofthenumberofwaysitusesandthenumberoflocationseachblockcanbeplacedin.Ourthreemaincontributionsare:1)Weproposezcache,acachedesignthatimprovesassocia-tivitywhilekeepingthenumberofpossiblelocations(i.e.ways)ofeachblocksmall.Thezcache'sdesignisbasedontheinsightthatassociativityisnotdeterminedbythenumberoflocationsthatablockcanresidein,butbythenumberofreplacementcandidatesonaneviction.Likeaskew-associativecache[39],azcacheaccesseseachway usingadifferenthashfunction.Ablockcanbeinonlyonelocationperway,sohits,thecommoncase,requireonlyasinglelookup.Onareplacement,thezcacheexploitsthatwithdifferenthashfunctions,ablockthatconictswiththeincomingblockcanbemovedtoanon-conictinglocationinanotherwayinsteadofbeingevictedtoaccommodatethenewblock.Thisissimilartocuckoohashing[35],atechniquetobuildspace-efcienthashtables.Onamiss,thezcachewalksthetagarraytoobtainadditionalreplacementcandidates,evictsthebestone,andperformsaseriesofrelocationstoaccommodatetheincomingblock.Thishappensoffthecriticalpath,concurrentlywiththemissandotherlookups,soithasnoeffectonaccesslatency.2)Wedevelopanovelanalysisframeworktounderstandassociativityandcomparetheassociativitiesofdifferentcachedesignsindependentlyofthereplacementpolicy.Wedeneassociativityasaprobabilitydistributionandshowthat,underasetofconditions,whicharemetbyzcaches,associativitydependsonlyonthenumberofreplacementcandidates.Therefore,weprovethatthezcacheassociativityfromthenumberofways(orlocationsthatablockcanbein).3)Weevaluatearstuseofzcachesatthelast-levelcacheoftheCMP'smemoryhierarchy.Usingtheanalyticalframe-workweshowthat,forthesamenumberofways,zcachesprovidehigherassociativitythanset-associativecachesformostworkloads.Wealsosimulateavarietyofmultithreadedandmultiprogrammedworkloadsonalarge-scaleCMP,andshowthatzcachesachievethebenetsofhighly-associativecacheswithoutincreasingaccesslatencyorenergy.Forexample,overasetof10miss-intensiveworkloads,a4-wayzcacheprovides7%higherIPCand10%betterenergyefciencythana32-wayset-associativecache.Therestofthepaperisorganizedasfollows.SectionIIgivesthenecessarybackgroundonapproachestoincreasecacheassociativity.SectionIIIpresentsthezcachedesign.SectionIVdevelopsthetheoreticalframeworktounderstandandanalyzeassociativity.SectionVdiscussesourevaluationmethodology,andSectionVIpresentstheevaluationofthezcacheasalast-levelcache.SectionVIIdiscussesadditionalrelatedwork,andSectionVIIIconcludesthepaper.II.BACKGROUNDONACHESSOCIATIVITYApartfromsimplyincreasingthenumberofwaysinacacheandcheckingtheminparallel,thereisabundantpriorworkonalternativeschemestoimproveassociativity.Theymainlyrelyoneitherusinghashfunctionstospreadoutcacheaccesses,orincreasingthenumberoflocationsthatablockcanbein.A.Hashing-basedApproachesHashblockaddress:Insteadofusingasubsetoftheblockaddressbitsasthecacheindex,wecanuseabetterhashfunctionontheaddresstocomputetheindex.Hashingspreadsoutaccesspatternsthatareotherwisepathological,suchasstridedaccessesthatalwaysmaptothesameset.Hashingslightlyincreasesaccesslatencyaswellasareaandpoweroverheadsduetothisadditionalcircuitry.Italsoaddstagstoreoverheads,sincethefullblockaddressneedstobestoredinthetag.Simplehashfunctionshavebeenshowntoperformwell[26],andsomecommercialprocessorsimplementthistechniqueintheirlast-levelcache[41].Skew-associativecaches:Skew-associativecaches[39]indexeachwaywithadifferenthashfunction.Aspecicblockaddressconictswithaxedsetofblocks,butthoseblocksconictwithotheraddressesonotherways,furtherspreadingoutconicts.Skew-associativecachestypicallyexhibitlowerconictmissesandhigherutilizationthanaset-associativecachewiththesamenumberofways[7].However,theybreaktheconceptofaset,sotheycannotusereplacementpolicyimplementationsthatrelyonsetordering(e.g.usingpseudo-LRUtoapproximateLRU).B.ApproachesthatIncreasetheNumberofLocationsAllowmultiplelocationsperway:Column-associativecaches[1]extenddirect-mappedcachestoallowablocktoresideintwolocationsbasedontwo(primaryandsecondary)hashfunctions.Lookupscheckthesecondlocationiftherstisamissandarehashbitindicatesthatablockinthesetisinitssecondarylocation.Toimproveaccesslatency,ahitinasecondarylocationcausestheprimaryandsecondarylocationstobeswapped.Thisschemehasbeenextendedwithbetterwaystopredictwhichlocationtoproberst[10],higherassociativities[45],andschemesthatexplicitlyidentifythelessusedsetsandusethemtostorethemoreusedones[37].Thedrawbacksofallowingmultiplelocationsperwayarethevariablehitlatencyandreducedcachebandwidthduetomultiplelookups,andtheadditionalenergyrequiredtodoswapsonhits.Useavictimcache:Avictimcacheisahighlyorfully-associativesmallcachethatstoresblocksevictedfromthemaincacheuntiltheyareeitherevictedorre-referenced[25].Itavoidsconictmissesthatarere-referencedafterashortpe-riod,butworkspoorlywithasizableamountofconictmissesinseveralhotways[9].Scavenger[3]dividescachespaceintotwoequallylargeparts,aconventionalset-associativecacheandafully-associativevictimcacheorganizedasaheap.Victimcachedesignsworkwellaslongasmissesinthemaincachearerare.Onamissinthemaincache,theyintroduceadditionallatencyandenergyconsumptiontocheckthevictimcache,regardlessofwhetherthevictimcacheholdstheblock.Useindirectioninthetagarray:Analternativestrategyistoimplementtaganddataarraysseparately,makingthetagarrayhighlyassociative,andhavingitcontainpointerstoanon-associativedataarray.TheIndirectIndexCache(IIC)[18]implementsthetagarrayasahashtableusingopen-chainedhashingforhighassociativity.TheV-Waycache[36]implementsaconventionalset-associativetagarray,butmakesitlargerthanthedataarraytomakeconictmissesrare.Tagindirectionschemessufferfromextrahitlatency,astheyareforcedtoserializeaccessestothetaganddataarrays.BoththeIICandtheV-Waycachehavetagarrayoverheadsofaround,andtheIIChasavariablehitlatency. Severalofthesedesignsbothincreasecacheassociativityandproposeanewreplacementpolicy,sometimestailoredtotheproposeddesign[3,18,36,39].Thismakesitdifculttoelucidatehowmuchimprovementisduetothehigherassocia-tivityandhowmuchdependsonthebetterreplacementpolicy.Inthisworkweconsiderthatassociativityandreplacementpolicyareseparateissues,andfocusonassociativity.III.THEZCACHEESIGNStructurally,thezcachesharesmanycommonelementswiththeskew-associativecache.Eachwayisindexedbyadifferenthashfunction,andacacheblockcanonlyresideinasinglepositiononeachway.Thatpositionisgivenbythehashvalueoftheblock'saddress.Hitshappenexactlyasintheskew-associativecache,requiringasinglelookuptoasmallnumberofways.Onamiss,however,thezcacheexploitsthefactthattwoblocksthatconictonawayoftendonotconictontheotherwaystoincreasethenumberofreplacementcandidates.Thezcacheperformsareplacementovermultiplesteps.First,itthetagarraytoidentifythesetofreplacementcandidates.Itthenpicksthecandidatepreferredbythereplacementpolicy(e.g.leastrecentlyusedblockforLRU),andevictsit.Finally,itperformsaseriesofrelocationstobeabletoaccommodatetheincomingblockattherightlocation.Themulti-stepreplacementprocesshappenswhilefetchingtheincomingblockfromthememoryhierarchy,anddoesnotaffectthetimerequiredtoservethemiss.Innon-blockingcaches,simultaneouslookupshappenconcurrentlywiththisprocess.Thedownsideisthatthereplacementprocessrequiresextrabandwidth,especiallyonthetagarray,andneedsextraenergy.However,shouldbandwidthorenergybecomeanissue,thereplacementprocesscanbestoppedearly,simplyresultinginaworsereplacementcandidate.A.OperationWeexplaintheoperationofthereplacementprocessindetailusingtheexampleinFig.1.Theexampleusesasmall3-waycachewith8linesperway.LettersA-Zdenotecacheblocks,andnumbersdenotehashvalues.Fig.1gshowsthetimelineofreadsandwritestothetaganddataarrays,andthememorybus.ThroughoutFig.1,addressesandhashvaluesobtainedinthesameaccesssharethesamecolor.Walk:Fig.1ashowstheinitialcontentsofthecacheandthemissforaddressYthattriggerstheprocess.Initially,theaddressesreturnedbythetaglookupforYareouronlyreplacementcandidatesfortheincomingblock(addressesA,DandM).Thesearetherst-levelcandidates.Askew-associativecachewouldonlyconsiderthesecandidates.Inazcache,thecontrollerstartsthewalktoexpandthenumberofcandidatesbycomputingthehashvaluesoftheseaddresses,showninFig.1b.Oneofthehashvaluesalwaysmatchesthehashvalueoftheincomingblock.Theothersdenotethepositionsinthearraywherewecouldmoveeachofourcurrentreplacementcandidatestoaccommodatetheincomingblock.Forexample,ascolumnAinFig.1bshows,wecouldmoveblockAtoline2inway1(evictingK)orline1inway2(evictingX)andwriteincomingblockYinline5ofway0.Wetakethesixnon-matchinghashvaluesinFig.1bandperformtwoaccesses,givingusanadditionalsetofsixsecond-levelreplacementcandidates,asshowninFig.1c(addressesB,K,X,P,Z,andS).Wecanrepeatthisprocess(which,atitscore,isabreadth-rstgraphwalk)indenitely,gettingmoreandmorereplacementcandidates.Inpractice,weeventuallyneedtostopthewalkandselectthebestcandidatefoundsofar.Inthisexample,weexpanduptoathirdlevel,reaching21(3+6+12)replacementcandidates.Ingeneral,itisnotnecessarytoobtainfulllevels.Fig.1dshowsatreewiththethreelevelsofcandidates.Notehow,inexpandingthesecondlevel,somehashvaluesarerepeatedandleadtothesameaddress.Theserepeatsareboundtohappeninthissmallexample,butareveryrareinlargercacheswithhundredsofblocksperway.Oncethewalknishes,thereplacementpolicychoosesthebestreplacementcandidate.Wediscusstheim-plementationofreplacementpoliciesinSectionIII-E.Inourexample,blockNisthebestcandidate,asshowninFig.1d.ToaccommodatetheincomingblockY,thezcacheevictsNandrelocatesitsancestorsinthetree(bothdataandtags),asshowninFig.1e.Thisinvolvesreadingandwritingthetagsanddatatotheirnewlocations,asthetimelineinFig.1gindicates.Fig.1fshowsthecontentsofthecacheafterthereplacementprocessisnished,withNevictedandYinthecache.NotehowNandYbothusedway0,butcompletelydifferentlocations.B.GeneralguresofmeritAzcachewithWwayswherethewalkislimitedtoLlevelshasthefollowingguresofmerit:Replacementcandidates():Assumingnorepeatswhenexpandingthetree,Replacementprocessenergy():Iftheenergiestoread/writetagordatainasinglewayaredenotedrtwtrdwd,thenwalkrelocsrtrtrdwtwd,where2f;::;Lthenumberofrelocations.Notethatreadsandwritestothedataarray,whichconsumemostoftheenergy,growwith,i.e.logarithmicallyReplacementprocesslatency:Becauseaccessesinawalkcanbepipelined,thelatencyofawalkgrowswiththenumberoflevels,unlesstherearesomanyaccessesoneachlevelthattheyfullycoverthelatencyofatagarrayread:walk.Thismeansthat,forW�,wecangettensofcandidatesinasmallamountofdelay.Forexample,Fig.1gassumesatagreaddelayof4cycles,andshowshowthewalkprocessfor21candidates(3levels)completesin3=12cycles.Thewholeprocessnishesin20cycles,muchearlierthanthe100cyclesusedtoretrievetheincomingblockfrommainmemory. U F N B P A G V C D E K Z T M X J R H Q I L O S 01234567540ADM H0 H1 H2 Way 0Way 1Way 2 (a)Initialstateofthecacheandinitialmiss Addr Y A D M H0 5 5 3 2 H1 4 2 4 5 H2 0 1 7 0 (b)Hashvaluesofrst-levelcandidates B K X P Z S Addr 3 7 4 2 6 1 H0 6 2 3 3 5 2 H1 1 0 1 5 3 7 H2 (c)Hashvaluesofsecond-levelcandidates Y A K X L M N E D B Z T X G R M P S E Q F K (d)Thethreelevelsofreplacementcandidates.Nisselectedbythereplacementpolicy U F N B P A G V C D E K Z T M X J R H Q I L O S Y 2 1 3 4 (e)Relocationsdonetoaccommodatetheincomingblock U F B P A G V C D E K Z T M X J R H Q I L O S Y (f)Finalcachestateafterreplacement 5 3 2 7 4 6 1 4 5 4 5 4 2 5 6 3 3 2 0 1 7 1 0 5 3 1 A B P L N G F N A X Y D K Z T E E K M X S X M Q R X A A N A X Y D M X A Y N Y 05101520105 «« TimeWay0Way1Way2Way0Way1Way2Way0Way1Way2 Address for read/ write Tag portout/ in Data portout/ in Memory bus Fetch on missWriteback(if needed)Miss response Miss WalkRelocations (g)TimelineofrequestsandresponsesFig.1.Replacementprocessinazcache C.ImplementationToimplementthereplacementprocess,thecachecontrollerneedssomemodicationsinvolvinghashfunctions,someadditionalstateand,fornon-blockingcaches,schedulingofconcurrentoperations.Hashfunctions:Weneedonehashfunctionperway.Hashfunctionsrangefromextremelysimple(e.g.bitselection)toexceedinglycomplex(e.g.cryptographichashfunctionslikeSHA-1).Inthisstudy,weusehashfunctions[11],afamilyoflow-cost,universal,pairwise-independenthashfunctionsthatrequireafewXORgatesperhashbit[38].Thecontrollerneedstorememberthepositionsofthereplacementcandidatesvisitedduringthewalkandthepositionofthebestevictioncandidate.Trackingonlythemostdesirablereplacementcandidateisnotsufcient,becauserelocationsneedtoknowaboutallblocksinthepath.However,asingle-portedSRAMorsmallregisterlesufces.Notethatwedonothavetorememberfulltags,justhashvalues.Also,noback-pointersneedtobestored,becauseforacertainpositionintheSRAM,theparent'spositionisalwaysthesame.IntheexampleshowninFig.1,thecontrollerneeds63bitsofstatetotrackcandidates(21hashvalues3bits/value).Ifthecachewaslarger,e.g.3MB,with1MBperwayand64-bytelines(requiring14bits/hashvalue),itwouldneed294bits.Additionally,thecontrollermustbufferthetagsanddataoftheLlinesitreadsandwritesonarelocation.Sincethenumberoflevelsistypicallysmall(2or3inourexperiments),thisalsoentailsasmalloverhead.Concurrentoperationsfornon-blockingcaches:Toavoidincreasingcachelatency,thereplacementprocessshouldbeabletorunconcurrentlywithallotheroperations(tag/datareadsandwritesduetohits,write-backs,invalidations,etc.).Thewalkprocesscanrunconcurrentlywithoutinterference.Thismayleadtobenignraceswhere,forexample,thewalkidentiesthebestevictioncandidatetobeablockthatwasaccessed(e.g.withahit)intheinterim.Thisisexceedinglyrareinlargecaches,sowesimplyevicttheblockanyway.Insmallercaches(e.g.highly-associativebutsmallTLBsorrst-levelcaches),wecouldkeeptrackofthebesttwoorthreeevictioncandidatesanddiscardthemiftheyareaccessedwhilethewalkprocessisrunning.Inthesecondpartofthereplacement,therelocations,thecontrollermustblockinterveningoperationstoatmostLpositionswhileblocksinthesepositionsarebeingrelocated.Wenotethatthecontrolleralreadyhaslogictodealwiththesecases(e.g.withMSHRs[28]).Whileitisfeasibletorunmultiplereplacementprocessesconcurrently,itwouldcomplicatethecachecontroller,andsincereplacementsarenotinthecriticalpath,theycansimplyqueueup.Concurrentreplacementswouldonlymakesensetoincreasebandwidthutilizationwhenthecacheisclosetobandwidthsaturation.AsshowninSectionVI,wedonotseetheneedforsuchmechanisminourexperiments.Inconclusion,thezcacheimposesminorstateandlogicoverheadstotraditionalcachecontrollers.D.ExtensionsWenowdiscussadditionalimplementationoptionstoen-hancezcaches.Avoidingrepeats:Insmallrst-levelcachesorTLBs,repeatscanbecommonduetowalkingasignicantportionofthecache.Moreover,arepeatatalowlevelcantriggertheexpansionofmanyrepeatedcandidates.RepeatscanbeavoidedbyinsertingtheaddressesvisitedduringthewalkinaBloomlter[6],andnotcontinuingthewalkthroughaddressesthatarealreadyrepresentedinthelter.Repeatsarerareinourexperiments,sowedonotseeanyperformancebenetfromthis.Alternativewalkstrategies:Thecurrentwalkperformsabreadth-rstsearchforcandidates,fullyexpandingalllevels.Alternatively,wecouldperformadepth-rstsearch(DFS),alwaysmovingtowardshigherlevelsofreplacementcandi-dates.Cuckoohashing[35]followsthisstrategy.DFSallowsustoremovethewalktableandinterleavewalkwithre-locations,reducingstate.However,itincreasesthenumberofrelocationsforagivennumberofreplacementcandidatesR=W),whichinturnincreasesboththeenergyrequiredperreplacement(asrelocationsreadandwritetothemuchwiderdataarray)andreplacementlatency(asaccessesinthewalkcannotbepipelined).BFSisamuchbettermatchtoahardwareimplementationastheextrarequiredstateforBFSisafewhundredbitsatmost.Nevertheless,acontrollercanimplementahybridBFS+DFSstrategytoincreaseassociativitycheaply.Forinstance,inourexampleinFig.1,thecontrollercouldperformasecondphaseofBFS,tryingtore-insertNratherthanevictingit,todoublethenumberofcandidateswithoutincreasingthestateneeded.E.ReplacementPolicySofar,wehavepurposelyignoredhowthereplacementpolicyisimplemented.Inthissection,wecoverhowtoimplementorapproximateLRU.Whileset-associativecachescancheaplymaintainanorderoftheblocksineachset(e.g.usingLRUorpseudo-LRU),sincetheconceptofasetdoesnotexistinazcache,policiesthatrelyonthisorderingneedtobeimplementeddifferently.However,severalprocessordesignsalreadyndittooexpensivetoimplementsetorderingandresorttopoliciesthatdonotrequireit[20,41].Additionally,someofthelatest,highest-performingpoliciesdonotrelyonsetordering[24].Whiledesigningareplacementpolicyspecicallytailoredtozcachesisaninterestingendeavor,wedeferittofuturework.FullLRU:Weuseaglobaltimestampcounter,andaddatimestampeldtoeachblockinthecache.Oneachaccess,thetimestampcounterisincremented,andthetimestampeldisupdatedtothecurrentcountervalue.Onareplacement,thecontrollerselectsthereplacementcandidatewiththelowesttimestamp(inmod2arithmetic).Thisdesignrequiresverysimplelogic,buttimestampshavetobelarge(e.g.32bits)tomakewrap-aroundsrare,thushavinghighareaoverhead. BucketedLRU:Todecreasespaceoverheads,timestampsaremadesmaller,andthecontrollerincreasesthetimestampcounteronceeveryaccesses.Forexample,with=5%thecachesizeand=8bitspertimestamp,itisrareforablocktosurviveawrap-aroundwithoutbeingeitheraccessedorevicted.WeusethisLRUpolicyinourevaluation.IV.ANALYTICALRAMEWORKFORSSOCIATIVITYQuantifyingandcomparingassociativityacrossdifferentcachedesignsishard.Inset-associativecaches,morewaysim-plicitlymeanhigherassociativity.However,whencomparingdifferentdesigns(e.g.aset-associativecacheandazcache),thenumberofwaysbecomesauselessproxyforassociativity.Themostcommonlyusedapproachtoquantifyassociativityisbythenumberofconictmisses[21].Conictmissesforacachearecalculatedbysubtractingthenumberofmissesincurredbyafully-associativecacheofthesamesizefromthetotalnumberofmisses.Usingconictmissesasaproxyforassociativityhastheadvantageofbeinganend-to-endmetric,directlylinkingassociativitytoperformance.However,itissubjecttothreeproblems.First,itishighlydependentonthereplacementpolicy;forexample,byusinganLRUreplacementpolicyinaworkloadwithananti-LRUaccesspattern,wecangethigherconictmisseswhenincreasingthenumberofways.Second,inCMPswithmultilevelmemoryhierarchies,changingtheassociativitycanalterthereferencestreamathighercachelevels,andcomparingthenumberofconictmisseswhenthetotalnumberofaccessesdiffersismeaningless.Finally,conictmissesareworkload-dependent,sotheycannotbeusedasageneralproxyforassociativity.Inthissection,wedevelopaframeworktoaddresstheseissues,withtheobjectivesof1)beingabletocompareassociativitybetweendifferentcacheorganizations,and2)determininghowvariousdesignaspects(e.g.ways,numberofreplacementcandidates,etc)inuencecacheassociativity.A.AssociativityDistributionWedivideacacheintothefollowingcomponents:Cachearray:Holdstagsanddata,implementsassociativelookupsbyblockaddress,and,onareplacement,givesalistofreplacementcandidatesthatcanbeevicted.Replacementpolicy:Maintainsarankofwhichcacheblockstoreplace.Thismodelassumesverylittleabouttheunderlyingcacheimplementation.Thearraycouldbeset-associative,azcache,oranyoftheschemesmentionedinSectionII.Theonlyrequirementthatweimposeonthereplacementpolicyistodeneaglobalorderingofblocks,whichmostpoliciesinherentlydo.Forexample,inLRUblocksarerankedbythetimeoftheirlastreference,inLFUtheyareorderedbyaccessfrequency,andinOPT[4]theyarerankedbythetimetotheirnextreference.Thisdoesnotmeanthattheimplementationactuallymaintainsthisglobalrank.Inaset-associativecache,LRUonlyneedstoremembertheorderofelementsineachset,andinazcachethiscanbeachievedwithtimestamps,asexplainedinSectionIII-E.Byconvention,blockswithahigherpreferencetobeevictedaregivenahigherrank.Inacachewithwith;:::;B.Tomaketherestoftheanalysisindependentofcachesize,wedeneablock'sevictionprioritytobeitsranknormalizedtoto;1],i.e.r=Associativitydistribution:Wedenetheassociativitydistribu-tionastheprobabilitydistributionoftheevictionprioritiesofevictedblocks.Inafully-associativecache,wewouldalwaysevicttheblockwith=1.However,mostcachedesignsexamineonlyasmallsubsetoftheblocksinaneviction,sotheyselectblockswithlowerevictionpriorities.Ingeneral,themoreskewedthedistributionistowards=1,thehighertheassociativityis.Theassociativitydistributioncharacterizesofthereplacementdecisionsmadebythecacheinawaythatisindependentofthereplacementpolicy.Notethatthisdecoupleshowthearrayperformsfromill-effectsfromthereplacementpolicy.Forexample,ahighlyassociativecachemayalwaysndreplacementcandidateswithhighevictionpriorities,butifthereplacementpolicydoesapoorjobinrankingtheblocks,thismayactuallyhurtperformance.B.LinkingAssociativityandReplacementCandidatesDeningassociativityasaprobabilitydistributionletsusevaluatethequalityofthereplacementcandidates,butisstilldependentonworkloadandreplacementpolicy.How-ever,undercertaingeneralconditionsthisdistributioncanbecharacterizedbyasinglenumber,thenumberofreplacementcandidates.Thisisthegureofmeritoptimizedbyzcaches.Uniformityassumption:Ifthecachearrayalwaysreturnsreplacementcandidates,andwetreattheevictionprioritiesoftheseblocksasrandomvariables,assumingthattheyare1)uniformlydistributedin[0,1]and2)statisticallyfromeachother,wecanderivetheassociativitydistribution.Since;:::;EE;1];i:i:d,thecumula-tivedistributionfunction(CDF)ofeachevictionpriorityis)=Prob)=x;xx;1]1.Theassociativityistherandomvariable=max;:::;E,anditsCDFis:)=Prob)=ProbProb;xx;1]Therefore,underthisuniformityassumption,theassociativitydistributiononlydependson,thenumberofreplacementcandidates.Fig.2showsexampleCDFsoftheassociativitydistribution,inlinearandsemi-logscales,witheachlinerepresentingadifferentnumberofreplacementcandidates.Thehigherthenumberofreplacementcandidates,themoreskewedtowards1.0theassociativitydistributionbecomes.Also,evictionsofblockswithalowevictionpriorityquicklybecomeveryrare.Forexample,for16replacementcandidates,theprobabilityofevictingablockwitheNotethatwearetreatingrandomvariables,eventhoughtheyarediscrete(normalizedrankswithoneofequallyprobablevaluesinin;1]).WedothistoachieveresultsthatareindependentofcachesizeResultsarethesameforthediscretizedversionoftheseequations. 0:0 0:2 0:4 0:6 0:8 1:0Evictionpriority 0:2 0:4 0:6 0:8 1:0AssociativityCDF 0:0 0:2 0:4 0:6 0:8 1:0Evictionpriority 1010 109 108 107 106 105 104 103 102 101 100 AssociativityCDF n=4 n=8 n=16 n=64 Fig.2.AssociativityCDFsundertheuniformityassumption()=;xx;1])for=4candidates,inlinearandlogarithmicscales. Unif.Assumption wupwisem apsim mgridm canneal uidanimate blackscholes 0:0 0:2 0:4 0:6 0:8 1:0Evictionpriority 0:2 0:4 0:6 0:8 1:0AssociativityCDF SetAssoc4-way 0:0 0:2 0:4 0:6 0:8 1:0Evictionpriority 0:2 0:4 0:6 0:8 1:0AssociativityCDF SetAssoc16-way (a)Set-associativecacheswithouthashing 0:0 0:2 0:4 0:6 0:8 1:0Evictionpriority 0:2 0:4 0:6 0:8 1:0AssociativityCDF SetAssoc4-wayw/hash 0:0 0:2 0:4 0:6 0:8 1:0Evictionpriority 0:2 0:4 0:6 0:8 1:0AssociativityCDF SetAssoc16-wayw/hash (b)Set-associativecacheswith 0:0 0:2 0:4 0:6 0:8 1:0Evictionpriority 0:2 0:4 0:6 0:8 1:0AssociativityCDF SkewAssoc4-way(Z4/4) 0:0 0:2 0:4 0:6 0:8 1:0Evictionpriority 0:2 0:4 0:6 0:8 1:0AssociativityCDF SkewAssoc16-way(Z16/16) (c)Skew-associativecaches 0:0 0:2 0:4 0:6 0:8 1:0Evictionpriority 0:2 0:4 0:6 0:8 1:0AssociativityCDF ZCache4way/16rc 0:0 0:2 0:4 0:6 0:8 1:0Evictionpriority 0:2 0:4 0:6 0:8 1:0AssociativityCDF ZCache4way/52rc (d)4-wayzcachesFig.3.AssociativitydistributionsforselectedPARSECandSPECOMPworkloadsusingdifferenttypesofcaches.Randomcandidatescache:Theuniformityassumptionmakesitsimpletocharacterizeassociativity,butitisnotmetingeneralbyrealcachedesigns.However,acachearraythatrandomlyselectedreplacementcandidates(withrep-etition)fromalltheblocksinthecachealwaysachievestheseassociativitycurves.Eachisuniformlydistributedbecauseitisanunbiasedrandomsamplingofoneofthepossiblevaluesofarank,andsincedifferentselectionsaredoneindependently,theareindependentaswell.Wesimulatedthiscachedesignwithtensofrealworkloads,underseveralcongurationsandreplacementpolicies,andobtainedassociativitydistributionsasshowninFig.2,experimentallyvalidatingthepreviousderivation.Althoughthisrandomcandidatescachedesignisunrealis-tic,itrevealsasufcientconditiontoachievetheuniformityassumption:themorerandomizedthereplacementcandidates,thebetteracachewillmatchtheuniformityassumption.C.AssociativityMeasurementsofRealCachesOuranalyticalframeworkimpliesthatthenumberofre-placementcandidatesisthekeymetricindeterminingasso-ciativity.Wenowevaluatewhetherthisisthecaseusingrealcachedesigns.Set-associativecaches:Fig.3ashowstheassociativitydis-tributionsfor8MBL2set-associativecachesof4and16ways,usinganLRUreplacementpolicy.ThedetailsonsystemcongurationandmethodologycanbefoundinSectionV.Eachofthe6solidlinesrepresentsadifferentbenchmark, Cores 32cores,x86-64ISA,in-order,IPC=1exceptonmemoryaccesses,2GHz L1caches 32KB,4-waysetassociative,splitD/I,1-cyclelatency L2cache 8MBNUCA,8banks,1MBbank,shared,inclusive,MESIdirectorycoherence, 4-cycleaverageL1-to-L2-banklatency,6–11-cycleL2banklatency MCU 4memorycontrollers,200cycleszero-loadlatency,64GB/speakmemoryBW TABLEIAINCHARACTERISTICSOFTHESIMULATEDCMP.THELATENCIESASSUMEA32NMPROCESSAT2GHfromarepresentativeselectionofPARSECandSPECOMPapplications.Thesingledottedlinepergraphplotstheasso-ciativitydistributionundertheuniformityassumption,whichisindependentoftheworkload.Weseethatthedistributionsdiffersignicantlyfromtheuniformityassumption.Twowork-loads(wupwiseandapsi)dosignicantlyworse,withtheCDFrapidlyclimbingtowards1.0.Forexample,inwupwise,60%oftheevictionshappentoblockswith20%evictionpri-ority.Others(mgrid,cannealanduidanimate)havesensiblyworseassociativity,andonlyonebenchmark(blackscholes)outperformstheuniformityassumption.Thesedifferencesarenotsurprising:replacementcandidatesallcomefromthesamesmallset,thwartingindependence,andlocalityofreferencewillskewevictionprioritiestowardslowervalues,breakingtheassumptionofanuniformdistribution.Wecanimproveassociativitywithhashing.Fig.3bshowstheassociativitydistributionsofset-associativecachesindexedbyanhashoftheblockaddress.Associativitydistribu-tionsgenerallyimprove,butsomehot-spotsremain,andallworkloadsnowperformsensiblyworsethantheuniformityassumptioncase.Skew-associativecachesandzcaches:Fig.3cshowstheassociativitydistributionsof4and16-wayskew-associativecaches.Aswecansee,skew-associativecachescloselymatchtheuniformityassumptiononallworkloads.Theseresultsprovideananalyticalfoundationtothepreviousempiricalob-servationsthatskew-associativecaches“improveperformancepredictability”[7].Fig.3dshowstheassociativityof4-wayzcacheswith2and3levelsofreplacementcandidates.Wealsoobserveaclosematchtotheuniformityassumption.Thisisexpected,sincereplacementcandidatesareevenmorerandomized:levelcandidatesdependontheaddressesofthe-levelcandidates,makingthesetofpositionscheckedvaryingwithcachecontents.Inconclusion,bothskew-associativecachesandzcachesmatchtheuniformityassumptioninpractice.Hence,theirassociativityisdirectlylinkedtothenumberofcandidatesexaminedonreplacement.Althoughthegraphsonlyshowasmallsetofapplicationsforclarity,resultswithotherwork-loadsandreplacementpoliciesareessentiallyidentical.Thesmalldifferencesobservedbetweenapplicationsdecreasebyeitherincreasingthenumberofways(andhashfunctions)orimprovingthequalityofhashfunctions(thesameexperimentsusingmorecomplexSHA-1hashfunctionsinsteadofdistributionsidenticaltotheuniformityassumption).Overall,ouranalysisframeworkrevealstwomainresults:1)Inazcache,associativityisdeterminedbytheofreplacementcandidates,andnotthenumberofways,decouplingwaysandassociativity2)Whenusinganequalnumberofreplacementcandidates,zcachesempiricallyshowbetterassociativitythanset-associativecachesformostapplications.V.EXPERIMENTALETHODOLOGYInfrastructure:Weperformmicroarchitectural,execution-drivensimulationusinganx86-64simulatorbasedonPin[31].WeuseMcPAT[30]toobtaincomprehensivetiming,areaandenergyestimationsfortheCMPswemodel,anduseCACTI6.5[33]formoredetailedcachearea,powerandtimingmodels.Weuse32nmITRSmodels,withahigh-performanceprocessforallthecomponentsofthechipexcepttheL2cache,whichusesalow-leakageprocess.Wemodela32-coreCMP,within-orderx86coresmodeledaftertheAtomprocessor[17].Thesystemhasa2-levelcachehierarchy,withafullysharedL2cache.TableIshowsthedetailsofthesystem.On32nm,thisCMPrequiresabout220andhasaTDPofaround90Wat2GHz,bothreasonablebudgets.Workloads:Weuseavarietyofmultithreadedandmultipro-grammedbenchmarks:6PARSEC[5]applications(blacksc-holes,canneal,uidanimate,freqmine,streamclusterandswaptions),10SPECOMPbenchmarks(allexceptgalgel,whichgcccannotcompile)and26SPECCPU2006programs(allexceptdealII,tontoandwrf,whichwecouldnotcompile).Formultiprogrammedruns,werundifferentinstancesofthesamesingle-threadedCPU2006applicationoneachcore,plus30randomCPU2006workloadcombinations(choosing32workloadseachtime,withrepetitionsallowed).Thesemakeatotalof72workloads.Allapplicationsarerunwiththeirreference(maximumsize)inputsets.Formultithreadedworkloads,wefast-forwardintotheparallelregionandruntherst10billioninstructions.SincesynchronizationcanskewIPCresultsformultithreadedworkloads[2],wedonotcountinstructionsinsynchronizationroutines(locks,barriers,etc.)todeterminewhentostopexecution,butwedoincludetheminenergycalculations.Formultiprogrammedworkloads,wefollowstandardmethodologyfrompriorwork[24]:wefast-forward20billioninstructionsforeachprocess,simulateuntilallthethreadshaveexecutedatleast256millioninstructions,andonlytaketherst256millioninstructionsofeachthreadintoaccountforIPCcomputations. CacheType Seriallookups Parallellookups Bank Bank Bank Bank Bank Bank L2 L2 latency E/hit E/miss latency E/hit E/miss area leakage SetAssoc4-way 4.14ns 0.61nJ 1.26nJ 2.91ns 0.71nJ 1.42nJ 42.3mm2 535mW SetAssoc8-way 4.41ns 0.75nJ 1.57nJ 3.18ns 0.99nJ 1.88nJ 45.1mm2 536mW SetAssoc16-way 4.74ns 0.88nJ 1.87nJ 3.51ns 1.42nJ 2.46nJ 46.4mm2 561mW SetAssoc32-way 5.05ns 1.23nJ 2.66nJ 3.82ns 2.34nJ 3.82nJ 51.9mm2 588mW ZCache4/16 4.14ns 0.62nJ 2.28nJ 2.91ns 0.72nJ 2.44nJ 42.3mm2 535mW ZCache4/52 4.14ns 0.62nJ 3.47nJ 2.91ns 0.72nJ 3.63nJ 42.3mm2 535mW TABLEIIREAPOWERANDLATENCYOF8MB,8-BANKEDL2CACHESWITHDIFFERENTORGANIZATIONS SetAssoc4-way SetAssoc16-way SetAssoc32-way Z4way/4rc Z4way/16rc Z4way/52rc 0 10 20 30 40 50 60 70Workload 1:0 1:1 1:2 1:3 1:4L2MPKIreduction 0 10 20 30 40 50 60 70Workload 1:00 1:05 1:10 1:15IPCimprovement (a)OPTreplacement 0 10 20 30 40 50 60 70Workload 1:0 1:1 1:2 1:3 1:4L2MPKIreduction 0 10 20 30 40 50 60 70Workload 1:00 1:05 1:10 1:15IPCimprovement (b)LRUreplacementFig.4.L2MPKIandIPCimprovementsforallworkloads,overa4-wayset-associativewithhashingbaseline.VI.EVALUATIONOFZCACHEASAAST-LEVELACHEThezcachecanbeusedwithanydesignthatrequireshighassociativityatlowoverheadsintermsofarea,hittime,andhitenergy.Inthispaper,weevaluatezcacheasalast-levelcacheina32-nodeCMP.Wedeferotherusecases,suchasrst-levelcachesorTLBs,tofuturework.Werstquantifythearea,energyandlatencyadvantagesofzcachesversusset-associativecacheswithsimilarassociativity,thencomparetheperformanceandsystem-wideenergyoveroursetofworkloads.A.CacheCostsTableIIshowsthetiming,areaandpowerrequirementsofbothset-associativecachesandzcacheswithvaryingasso-ciativities.WeuseCACTI'smodelstoobtainthesenumbers.Taganddataarraysaredesignedseparatelybydoingafulldesignspaceexplorationandchoosingthedesignthatminimizesareapower.Arraysaresub-banked,andboththeaddressanddataportsareimplementedusingH-trees.Weshowresultsforbothserialandparallel-lookupcaches.Inserialcaches,taganddataarraysareaccessedsequentially,savingenergyattheexpenseofdelay.Inparallelcaches,bothtaganddataaccessesareinitiatedinparallel.Whenthetagreadresolvestheappropriateway,itpropagatesaway-selectsignaltothedataarray,whichselectsandpropagatesthecorrectoutput.Thisparallelizesmostofthetaganddataaccesseswhileavoidinganexceedinglywidedataarrayport.Forzcaches,weexploredesignswithtwoandthree-levelwalks.Wedenotezcacheswith“”,indicatingthenumberofwaysandreplacementcandidates,respectively.Forexample,a4/16zcachehas4waysand16replacementcandidatespereviction(obtainedfromatwo-levelwalk).TableIIshowsthatincreasingthenumberofwaysbeyond8startsimposingsignicantarea,latencyandenergyoverheads.Forexample,a32-waycachewithseriallookupshas1.22thearea,1.23thehitlatencyand2thehitenergyofa4-waycache(forparallellookups,hitlatencyis1.32hitenergyis3.3).Thisislogical,sincea32-waycachereads4moretagbitsthandatabitsperlookup,thetagarrayhasamuchwiderport,andthecriticalpathislonger(slowertagarray,morecomparators).Forzcaches,however,area,hitlatencyandhitenergygrowwiththenumberofways,butnotwiththenumberofreplacementcandidates.Thiscomesattheexpenseofincreasingenergypermiss,which,however,isstillsimilartoset-associativecacheswiththesameassociativity.Forexample,aserial-lookupzcache4/52hasalmosttwicetheassociativityofa32-wayset-associativecacheat1.3energypermiss,butretainsthe2lowerhitenergyand1.23loweraccesslatencyofa4-waycache.B.PerformanceFig.4showstheimprovementsinbothL2missesperthousandinstructions(MPKI)andIPCforallworkloads,usingbothOPTandLRUreplacementpolicies.Eachlinerepresentstheimprovementofadifferentcachedesignoverabaseline SetAssoc4-wayS SetAssoc16-wayS SetAssoc32-wayS Z4way/4rcS Z4way/16rcS Z4way/52rcS SetAssoc4-wayS SetAssoc16-wayS SetAssoc32-wayS Z4way/4rcS Z4way/16rcS Z4way/52rcS SetAssoc4-wayP SetAssoc16-wayP SetAssoc32-wayP Z4way/4rcP Z4way/16rcP Z4way/52rcP ammpm 416.gamess cpu2K6rand0 canneal 436.cactusADM gmean(72) gmean(10) 5 0 5 10 15 20IPCimprovement(%) ammpm 416.gamess cpu2K6rand0 canneal 436.cactusADM gmean(72) gmean(10) 10 5 0 5 10 15 20BIPS/Wimprovement(%) (a)OPTreplacement ammpm 416.gamess cpu2K6rand0 canneal 436.cactusADM gmean(72) gmean(10) 5 0 5 10 15 20IPCimprovement(%) ammpm 416.gamess cpu2K6rand0 canneal 436.cactusADM gmean(72) gmean(10) 10 5 0 5 10 15 20BIPS/Wimprovement(%) (b)LRUreplacementFig.5.IPCandenergyefciency(BIPS/W)improvementsforserialandparallel-lookupcaches,overaserial-lookup4-wayset-associativewithhashingbaseline.Eachgraphshowsimprovementsfor5representativeworkloads,plusthegeometricmeanoverbothall72workloadsandthe10workloadswiththehighestL2MPKI.4-wayset-associativecachewithhashing.Cacheswithouthashingperformsignicantlyworse(evenathighassociativ-ities),sowedonotconsiderthemhere.Serial-lookupcachesareusedinallcases.Foreachline,workloads(inthex-axis)aresortedaccordingtotheimprovementachieved,soeachlineismonotonicallyincreasing.Fractionalimprovementsaregiven(e.g.aL2MPKIreductionof1.2means1.2lowerMPKIthanthebaseline).OPT:Fig.4ashowstheeffectsofusingOPTreplacement(i.e.evictingthecandidatereusedfurthest).OPTsimulationsarerunintrace-drivenmode.AlthoughOPTisunrealistic,itremovesill-effectsfromthereplacementpolicy(wheree.g.increasingassociativitydegradesperformance),allowingustodecouplereplacementpolicyissuesfromassociativityeffectsNotethatthesenumbersdonotnecessarilyshowmaximumimprovementsfromincreasingassociativity,asotherreplace-mentpoliciesmaybemoresensitivetoassociativitychanges.Intermsofmisses,higherassociativitiesalwaysimproveMPKI,anddesignswiththesameassociativityhavepracticallythesameimprovements(e.g.16-wayset-associativevsZ4/16).However,forset-associativecaches,theseimprovementsinMPKIdonotalwaystranslatetoIPC,duetotheadditionalaccesslatency(1extracyclefor16-way,2cyclesfor32-way).Forexample,the32-wayset-associativedesignperformsworsethanthe4-waydesignon15workloads(whichhavealargenumberofL1misses,butfewL2misses),andperformsworsethanthe16-waydesignonhalfoftheworkloads(36).Incontrast,zcachesdonotsufferfromincreasedaccesslatency,sensiblyimprovingIPCwithassociativityforallworkloads(e.g.aZ4/52improvesIPCbyupto16%overthebaseline).Incacheswithinterferenceacrosssets,likeskew-associativeandzcaches,OPTisnotactuallyoptimal,butitisagoodheuristic.LRU:Fig.4bcomparescachedesignswhenusingLRU.AssociativityimprovesMPKIforallbut3workloads,andbothMPKIandIPCimprovementsaresignicant(e.g.aZ4/52reducesL2missesbyupto2.1andimprovesperformancebyupto25%overa4-wayset-associativecache).WithLRU,thedifferencebetweenZ4/16andZ4/52designsislowerthanwithOPT,howevertheysignicantlyoutperformboththebaselineandtheZ4/4(skew-associative)design.C.SerialvsParallel-LookupCachesFig.5showstheperformanceandenergyefciencywhenusingserialandparallel-lookupcaches,un-derbothOPTandLRUreplacementpolicies.Resultsarenormalizedtoaserial-lookup,4-wayset-associativecachehashing.Eachgraphshowsimprovementsonverepresentativeapplications,aswellasthegeometricmeansofbothall72workloadsandthe10workloadswiththehighestL2MPKI.Wecandistinguishthreetypesofapplications:afewbench-marks,likeblackscholesorfreqmine,havelowL1missrates,andareinsensitivetotheL2'sorganization.Otherapplications,likeammpandgamess,havefrequentL2hitsbutinfrequentL2misses.Theseworkloadsaresensitivetohitlatency,soparallel-lookupcachesprovidehigherperformancegainsthanincreasingassociativity(e.g.a3%IPCimprovementongamessvsserial-lookupcaches).Infact,increasingassociativ-ityinset-associativecachesreducesperformanceduetohigherhitlatencies,whilehighly-associativezcachesdonotdegradeperformance.Finally,workloadslikecpu2K6rand0,canneal,andcactusADMhavefrequentL2misses.Theseapplicationsareoftensensitivetoassociativity,andahighly-associativecacheimprovesperformance(byreducingL2MPKI)morethanreducingaccesstime(e.g.incactusADMwithLRU, goingfromZ4/4toZ4/52improvesIPCby9%,whilegoingfromserialtoparallel-lookupimprovesIPCby3%).Intermsofenergyefciency,set-associativecachesandzcachesshowdifferentbehaviorswhenincreasingassociativ-ity.Becausehitenergyincreasessteeplywiththenumberofwaysinparallel-lookupcaches,16and32-wayset-associativecachesoftenachievelowerenergyefciencythanserial-lookupcaches(e.g.upto8%lowerBIPS/WincactusADM).Incontrast,serialandparallel-lookupzcachesachievepracti-callythesameenergyefciencyonmostworkloads,duetotheirsimilarlylowaccessandmissenergies.Inconclusion,zcachesenablehighly-associative,energy-efcientparallel-lookupcaches.Overall,zcachesofferboththebestperformanceandenergyefciency.Forexample,underLRU,whenconsideringall72workloads,aparallel-lookupzcache4/52improvesIPCby7%andBIPS/Wby3%overthe4-waybaseline.Overthesubsetofthe10mostL2miss-intensiveworkloads,azcache4/52improvesIPCby18%andenergyefciencyby13%overthe4-waybaseline,andobtains7%higherperformanceand10%betterenergyefciencythana32-wayset-associativecache.D.ArrayBandwidthSincezcachesperformmultipletaglookupsonamiss,itisworthexaminingwhethertheseadditionallookupscansatu-ratebandwidth.Ofthe72workloads,themaximumaverageloadperbankis15.2%(i.e.0.152coreaccesses/cycle/L2bank).However,asL2missesincrease,averageloadde-creases:at0.005misses/cycle/bank,averageloadis0.035accesses/cycle/bank,andtotalloadonthetagarrayforaZ4/52cacheis0.092tagaccesses/cycle/bank.Inotherwords,asL2missesincrease,bandwidthpressureontheL2decreases;thesystemisself-throttling.ZCachesusethissparetagbandwidthtoimproveassociativity.Ultimately,evenforhigh-MLParchi-tectures,theloadonthetagarraysislimitedbymainmemorybandwidth,whichismorethananorderofmagnitudesmallerthanthemaximumL2tagbandwidthandmuchhardertoscale.VII.RELATEDORKThezcacheisinspiredbycuckoohashing,atechniquetobuildspace-efcienthashtablesproposedbyPaghandRodler[35].Theoriginaldesignusestwohashfunctionstoindexthehashtable,soeachlookupneedstochecktwolocations.Onaninsertion,ifbothpossiblelocationsareoccupied,theincomingitemreplacesoneofthematrandom,andthereplacedblockisreinserted.Thisisrepeateduntileitheranemptylocationisfoundor,ifalimitnumberofretriesisreached,elementsarerehashedintoalargerarray.Thoughcuckoohashinghasbeenmostlystudiedasatechniqueforsoftwarehashtables,hardwarevariantshavebeenproposedtoimplementlookuptablesinIProuters[16].Foradditionalreferences,Mitzenmacherhasasurveyonrecentresearchincuckoohashing[32].Bothhighassociativityandagoodreplacementpolicyarenecessarytoimprovecacheperformance.ThegrowingimportanceofcacheperformancehassparkedresearchintoalternativepoliciesthatoutperformLRU[14,23,24,44].Theincreasingimportanceofon-chipwiredelayhasalsomotivatedresearchinnon-uniformcachearchitectures(NUCA)[27].SomeNUCAdesignssuchasNuRAPID[15]useindirectiontoenhancetheexibilityofNUCAplacementandreduceaccesslatencyinsteadofincreasingassociativity.VIII.CONCLUSIONSANDUTUREORKWehavepresentedthezcache,acachedesignthatenableshighassociativitywithasmallnumberofways.Thezcacheusesadifferenthashfunctionperwaytoenableanarbitrarilylargenumberofreplacementcandidatesonamiss.Toevaluatethezcache'sassociativity,wehavedevelopedanovelanalyt-icalframeworktocharacterizeandcompareassociativity.Weusethisframeworktoshowthat,forzcaches,associativityisdeterminedbythenumberofreplacementcandidates,notthenumberofways,hencedecouplingwaysandassociativity.Anevaluationusingzcachesasthelast-levelcacheinaCMPshowsthattheyprovidehighassociativitywithlowoverheadsintermsofarea,hittime,andhitenergy.ZCachesoutperformtraditionalset-associativecachesinbothperformanceandenergyefciency,witha4-wayzcacheachievingboth18%higherperformanceand13%higherperformance/wattthan4-wayset-associativecounterpartoverasetof10L2miss-intensiveworkloads,and7%higherperformanceand10%betterenergyefciencythana32-wayset-associativecache.Thereareseveralopportunitiesforfurtherresearch,suchasusingzcachestobuildhighlyassociativerst-levelcachesandTLBsformultithreadedcores.Additionally,replacementpoliciesthatarespecicallysuitedtothezcachecouldbede-signed.Finally,sincethezcachemakesittrivialtoincreaseorreduceassociativitywiththesamehardwaredesign,itwouldbeinterestingtoexploreadaptivereplacementschemesthatusethehighassociativityonlywhenitimprovesperformance,savingcachebandwidthandenergywhenhighassociativityisnotneeded,orevenmakingassociativityasoftware-controlledproperty.CKNOWLEDGEMENTSWesincerelythankJohnBrunhaver,ChristinaDelimitrou,DavidLo,GeorgeMichelogiannakisandtheanonymousre-viewersfortheirusefulfeedbackonearlierversionsofthismanuscript.DanielSanchezwassupportedbyaHewlett-PackardStanfordSchoolofEngineeringFellowship.EFERENCES[1]A.AgarwalandS.D.Pudar,“Column-associativecaches:atechniqueforreducingthemissrateofdirect-mappedcaches,”Proc.ofthe20thannualIntl.Symp.onComputerarchitec-ture,1993.[2]A.AlameldeenandD.Wood,“IPCconsideredharmfulformultiprocessorworkloads,”IEEEMicro,vol.26,no.4,2006.[3]A.Basu,N.Kirman,M.Kirman,M.Chaudhuri,andJ.Mar-tinez,“Scavenger:Anewlastlevelcachearchitecturewithglobalblockpriority,”inProc.ofthe40thannualIEEE/ACMIntlSymp.onMicroarchitecture,2007.[4]L.A.Belady,“Astudyofreplacementalgorithmsforavirtual-storagecomputer,”IBMSyst.J.,vol.5,no.2,1966. [5]C.Bienia,S.Kumar,J.P.Singh,andK.Li,“ThePARSECbenchmarksuite:Characterizationandarchitecturalimplica-tions,”inProc.ofthe17thIntl.Conf.onParallelArchitecturesandCompilationTechniques,2008.[6]B.H.Bloom,“Space/timetrade-offsinhashcodingwithallowableerrors,”Commun.ACM,vol.13,no.7,1970.[7]F.BodinandA.Seznec,“Skewedassociativityenhancesperfor-mancepredictability,”inProc.ofthe22ndannualIntl.Symp.onComputerArchitecture,1995.[8]A.Bracy,K.Doshi,andQ.Jacobson,“Disintermediatedactivecommunication,”Comput.Archit.Lett.,vol.5,no.2,2006.[9]M.W.Brehob,“Onthemathematicsofcaching,”Ph.D.disser-tation,MichiganStateUniversity,2003.[10]B.Calder,D.Grunwald,andJ.Emer,“Predictivesequentialassociativecache,”inProc.ofthe2ndIEEESymp.onHigh-PerformanceComputerArchitecture,1996.[11]J.L.CarterandM.N.Wegman,“Universalclassesofhashfunctions(extendedabstract),”inProc.ofthe9thannualACMSymposiumonTheoryofComputing,1977.[12]L.Ceze,J.Tuck,P.Montesinos,andJ.Torrellas,“BulkSC:bulkenforcementofsequentialconsistency,”inProc.ofthe34thannualIntl.Symp.onComputerarchitecture,2007.[13]L.Ceze,J.Tuck,J.Torrellas,andC.Cascaval,“Bulkdisam-biguationofspeculativethreadsinmultiprocessors,”inProc.ofthe33rdannualIntl.Symp.onComputerArchitecture,2006.[14]M.Chaudhuri,“Pseudo-LIFO:thefoundationofanewfamilyofreplacementpoliciesforlast-levelcaches,”inProc.ofthe42ndannualIEEE/ACMIntl.Symp.onMicroarchitecture,2009.[15]Z.Chishti,M.D.Powell,andT.N.Vijaykumar,“Distanceassociativityforhigh-performanceenergy-efcientnon-uniformcachearchitectures,”inProc.ofthe36thAnnualIEEE/ACMIntl.Symp.onMicroarchitecture,2003.[16]S.Demetriades,M.Hanna,S.Cho,andR.Melhem,“Anefcienthardware-basedmulti-hashschemeforhighspeedIPlookup,”inProc.ofthe16thIEEESymp.onHighPerformanceInterconnects,2008.[17]G.Gerosaetal.,“Asub-1Wto2Wlow-powerIAprocessorformobileinternetdevicesandultra-mobilePCsin45nmhi-KmetalgateCMOS,”inIEEEIntl.Solid-StateCircuitsConf.[18]E.G.HallnorandS.K.Reinhardt,“Afullyassociativesoftware-managedcachedesign,”inProc.ofthe27thannualIntl.Symp.onComputerArchitecture,2000.[19]L.Hammond,V.Wong,M.Chen,B.D.Carlstrom,J.D.Davis,B.Hertzberg,M.K.Prabhu,H.Wijaya,C.Kozyrakis,andK.Olukotun,“Transactionalmemorycoherenceandconsis-tency,”inProc.ofthe31stannualIntl.Symp.onComputerArchitecture,2004.[20]Hewlett-Packard,“InsidetheIntelItanium2processor,”Tech.Rep.,2002.[21]M.D.HillandA.J.Smith,“Evaluatingassociativityincpucaches,”IEEETrans.Comput.,vol.38,no.12,1989.[22]J.Howardetal.,“A48-coreIA-32message-passingprocessorwithDVFSin45nmCMOS,”inIEEEIntl.Solid-StateCircuitsConf.,2010.[23]A.Jaleel,W.Hasenplaugh,M.Qureshi,J.Sebot,S.Steely,Jr.,andJ.Emer,“Adaptiveinsertionpoliciesformanagingsharedcaches,”inProc.ofthe17thintl.conf.onParallelArchitecturesandCompilationTechniques,2008.[24]A.Jaleel,K.Theobald,S.C.S.Jr,andJ.Emer,“Highperfor-mancecachereplacementusingre-referenceintervalprediction(RRIP),”inProc.ofthe37thannualIntl.Symp.onComputerArchitecture,2010.[25]N.P.Jouppi,“Improvingdirect-mappedcacheperformancebytheadditionofasmallfully-associativecacheandprefetchbuffers,”inProc.ofthe17thannualIntl.Symp.onComputerArchitecture,1990.[26]M.Kharbutli,K.Irwin,Y.Solihin,andJ.Lee,“Usingprimenumbersforcacheindexingtoeliminateconictmisses,”inProc.ofthe10thIntl.Symp.onHighPerformanceComputerArchitecture,2004.[27]C.Kim,D.Burger,andS.W.Keckler,“Anadaptive,non-uniformcachestructureforwire-delaydominatedon-chipcaches,”inProc.ofthe10thintl.conf.onArchitecturalSupportforProgrammingLanguagesandOperatingSystems,2002.[28]D.Kroft,“Lockup-freeinstructionfetch/prefetchcacheorgani-zation,”inProc.ofthe8thannualIntl.Symp.onComputerArchitecture,1981.[29]N.Kurdetal.,“Westmere:Afamilyof32nmIAprocessors,”IEEEIntl.Solid-StateCircuitsConf.,2010.[30]S.Li,J.H.Ahn,R.D.Strong,J.B.Brockman,D.M.Tullsen,andN.P.Jouppi,“McPAT:anintegratedpower,area,andtimingmodelingframeworkformulticoreandmanycorearchitectures,”inProc.ofthe42ndannualIEEE/ACMIntl.Symp.onMicroarchitecture,2009.[31]C.-K.Luk,R.Cohn,R.Muth,H.Patil,A.Klauser,G.Lowney,S.Wallace,V.J.Reddi,andK.Hazelwood,“Pin:buildingcus-tomizedprogramanalysistoolswithdynamicinstrumentation,”Proc.oftheACMSIGPLANconf.onProgrammingLanguageDesignandImplementation,2005.[32]M.Mitzenmacher,“Someopenquestionsrelatedtocuckoohashing,”inProc.ofthe17thannualEuropeanSymp.on,2009.[33]N.Muralimanohar,R.Balasubramonian,andN.Jouppi,“Op-timizingNUCAorganizationsandwiringalternativesforlargecacheswithCACTI6.0,”inProc.ofthe40thannualIEEE/ACMIntl.Symp.onMicroarchitecture,2007.[34]V.NagarajanandR.Gupta,“ECMon:exposingcacheeventsformonitoring,”inProc.ofthe36thannualIntl.Symp.onComputerArchitecture,2009.[35]R.PaghandF.F.Rodler,“Cuckoohashing,”inProc.ofthe9thannualEuropeanSymp.onAlgorithms,2001.[36]M.K.Qureshi,D.Thompson,andY.N.Patt,“Thev-waycache:Demandbasedassociativityviaglobalreplacement,”inProc.ofthe32ndannualIntl.Symp.onComputerArchitecture,2005.[37]D.Rolan,B.B.Fraguela,andR.Doallo,“Adaptivelineplacementwiththesetbalancingcache,”inProc.ofthe42ndannualIEEE/ACMIntl.Symp.onMicroarchitecture,2009.[38]D.Sanchez,L.Yen,M.D.Hill,andK.Sankaralingam,“Imple-mentingsignaturesfortransactionalmemory,”inProc.ofthe40thannualIEEE/ACMIntl.Symp.onMicroarchitecture,2007.[39]A.Seznec,“Acasefortwo-wayskewed-associativecaches,”inProc.ofthe20thannualIntl.Symp.onComputerArchitecture[40]J.Shinetal.,“A40nm16-core128-threadCMTSPARCSoCprocessor,”inIEEEIntl.Solid-StateCircuitsConf.,2010.[41]SunMicrosystems,“UltraSPARCT2supplementtotheUltra-SPARCarchitecture2007,”Tech.Rep.,2007.[42]J.Torrellas,L.Ceze,J.Tuck,C.Cascaval,P.Montesinos,W.Ahn,andM.Prvulovic,“Thebulkmulticorearchitectureforimprovedprogrammability,”Commun.ACM,vol.52,no.12,[43]D.Wendeletal.,“TheimplementationofPOWER7:Ahighlyparallelandscalablemulti-corehigh-endserverprocessor,”inIEEEIntl.Solid-StateCircuitsConf.,2010.[44]Y.XieandG.H.Loh,“PIPP:promotion/insertionpseudo-partitioningofmulti-coresharedcaches,”inProc.ofthe36thannualIntl.Symp.onComputerArchitecture,2009.[45]C.Zhang,X.Zhang,andY.Yan,“Twofastandhigh-associativitycacheschemes,”IEEEMicro,vol.17,no.5,1997.