/
DesignTradeoffsfortheAlphaEV8ConditionalBranchPredictor�Andr DesignTradeoffsfortheAlphaEV8ConditionalBranchPredictor�Andr

DesignTradeoffsfortheAlphaEV8ConditionalBranchPredictorAndr - PDF document

tawny-fly
tawny-fly . @tawny-fly
Follow
436 views
Uploaded On 2015-09-29

DesignTradeoffsfortheAlphaEV8ConditionalBranchPredictorAndr - PPT Presentation

ThisworkwasdonewhiletheauthorswerewithCompaqduring1999facedduringthedenitionofthepredictorandonvarioustradeoffsperformedthatleadtothenaldesignInparticularweelucidateonthefollowingauseofaglo ID: 144853

ThisworkwasdonewhiletheauthorswerewithCompaqduring1999facedduringthedenitionofthepredictor andonvarioustrade-offsperformedthatleadtothenaldesign.Inpar-ticular weelucidateonthefollowing:(a)useofaglo

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "DesignTradeoffsfortheAlphaEV8Conditional..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DesignTradeoffsfortheAlphaEV8ConditionalBranchPredictorAndr´eSeznecStephenFelixVenkataKrishnanYiannakisSazeidesIRISA/INRIAIntelStarGen,Inc.DeptofComputerScienceCampusdeBeaulieu334SouthStreet225CedarHillStreetUniversityofCyprus35042RennesShrewsbury,MA01545Marlborough,MA01752CY-1678NicosiaFranceUSAUSACyprusseznec@irisa.frStephen.Felix@intel.comkrishnan@stargen.comyanos@cs.ucy.ac.cyAbstractThispaperpresentstheAlphaEV8conditionalbranchpre-dictor.TheAlphaEV8microprocessorproject,canceledinJune2001inalate ThisworkwasdonewhiletheauthorswerewithCompaqduring1999facedduringthedenitionofthepredictor,andonvarioustrade-offsperformedthatleadtothenaldesign.Inpar-ticular,weelucidateonthefollowing:(a)useofaglobalhistorybranchpredictionscheme,(b)choiceofthepredic-tionschemederivedfromthehybridskewedbranchpredic-tor harnessedtottheEV8predictorin352Kbitsmemorybud-get.Section5presentsandjustiesthehistoryandpathin-formationusedtoindexthebranchpredictor.OntheAlphaEV8,thebranchpredictortablesmustsupporttwoindepen-dentreadsof8predictionspercycle.Section6presentstheschemeusedtoguaranteetwoconict-freeaccessespercy-cleonabank-interleavedpredictor.Section7presentsthehardwareconstraintsforcomposingindexfunctionsforthepredictiontablesanddescribesthefunctionsthatwereeven-tuallyused.Section8presentsastepbystepperformanceevaluationoftheEV8branchpredictorasconstraintsareaddedandturn-aroundsolutionsareadopted.Finally,weprovideconcludingremarksinSection9.2AlphaEV8front-endpipelineTosustainhighperformance,theAlphaEV8fetchesuptotwo,8-instructionblockspercyclefromtheinstructioncache.AninstructionfetchblockconsistsofallconsecutivevalidinstructionsfetchedfromtheI-cache:aninstructionfetchblockendseitherattheendofanaligned8-instructionblockoronatakencontrolowinstruction.Nottakenconditionalbranchesdonotendafetchblock,thusupto16conditionalbranchesmaybefetchedandpredictedineverycycle.Oneverycycle,theaddressesofthenexttwofetchblocksmustbegenerated.Sincethismustbeachievedinasinglecycle,itcanonlyinvolveveryfasthardware.OntheAlphaEV8,alinepredictor[1]isusedforthispurpose.Thelinepredictorconsistsofthreetablesindexedwiththeaddressofthemostrecentfetchblockandaverylimitedhashinglogic.Aconsequenceofsimpleindexinglogicisrelativelylowlinepredictionaccuracy.Toavoidhugeperformanceloss,duetofairlypoorlinepredictoraccuracyandlongbranchresolutionlatency(ontheEV8pipeline,theoutcomeofabranchisknowntheear-liestincycle14andmoreoftenaroundcycle20or25),thelinepredictorisbackedupwithapowerfulprogramcounter(PC)addressgenerator.Thisincludesaconditionalbranchpredictor,ajumppredictor,areturnaddressstackpredic-tor,conditionalbranchtargetaddresscomputation(frominstructionsowingoutoftheinstructioncache)andnal-addressselection.PC-address-generationispipelinedint-wocyclesasillustratedinFig.1:uptofourdynamicallysuccesivefetchblocksA,B,CandDaresimultaneouslyinightinthePC-address-generator.IncaseofamismatchbetweenlinepredictionandPC-address-generation,thein-structionfetchisresumedwiththePC-address-generationresult.3GlobalvsLocalhistoryThepreviousgenerationAlphamicroprocessor[7]in-corporatedahybridpredictorusingbothglobalandlocalbranchhistoryinformation.OnAlphaEV8,upto16branchoutcomes(8foreachfetchblock)havetobepredictedperBlocks A and BPhase 1Phase 1Phase 0Blocks C and Dis completedPC address generationBlocks Y and ZPhase 0is completedPhase 1Phase 0Line predictionis completedprediction tables read Cycle 1Cycle 2Cycle 3Figure1.PCaddressgenerationpipelinecycle.ImplementingahybridbranchpredictorforEV8basedonlocalhistoryorincludingacomponentusinglo-calhistorywouldhavebeenachallenge.Localbranchpredictionrequiresforeachpredictionareadofthelocalhistorytableandthenareadofthepredic-tiontable.Performingthe16localhistoryreadsinparallelrequiresadual-portedhistorytable.Oneportforeachfetchblockissufcientsinceonecanreadinparallelthehisto-riesforsequentialinstructionsonsequentialtableentries.Butperformingthe16predictiontablereadswouldrequirea16-portedpredictiontable.Wheneveranoccurrenceofabranchisinight,thespec-ulativehistoryassociatedwiththeyoungerinightoccur-renceofthebranchshouldbeused[8].Maintainingandusingspeculativelocalhistoryisalreadyquitecomplexonaprocessorfetchingandpredictingasinglebranchpercycle[20].OnAlphaEV8,thenumberofinightbranch-esispossiblyequaltothemaximumnumberofinightin-structions(thatismorethan256).Moreover,inEV8whenindexingthebranchpredictorthereareuptothreefetchblocksforwhichthe(speculative)branchoutcomeshavenotbeendetermined(seeFig.1).Thesethreeblocksmaycontainuptothreepreviousoccurrencesofeverybranchinthefetchblock.Incontrast,singlespeculativeglobalhisto-ry(perthread)issimplertobuildandasshowninSection8theaccuracyoftheEV8globalhistorypredictionschemeisvirtuallyinsensitivetotheeffectsofthreefetchblocksoldglobalhistory.Finally,theAlphaEV8isasimultaneousmultithreadedprocessor[25,26].Whenindependentthreadsarerunning,theycompeteforpredictortableentries.Suchinterferenceonalocalhistorybasedschemecanbedisastrous,becauseitpollutesboththelocalhistoryandpredictiontables.Whatismore,whenseveralparallelthreadsarespawnedbyas-ingleapplication,thepollutionisexacerbatedunlessthelo-calhistorytableisindexedusingPCandthreadnumber.Incomparison,forglobalhistoryschemesaglobalhistoryregistermustbemaintainedperthread,andparallelthread-s-fromthesameapplication-benetfromconstructivealiasing[10].2 4ThebranchpredictionschemeGlobalbranchhistorybranchpredictortablesleadtoaphenomenonknownasaliasingorinterference[28,24],inwhichmultiplebranchinformationvectorssharethesameentryinthepredictortable,causingthepredictionsfort-woormorebranchsubstreamstointermingle.“De-aliased”globalhistorybranchpredictorshavebeenrecentlyintro-duced:theenhancedskewedbranchpredictore-gskew[15],theagreepredictor[22],thebimodepredictor[13]andtheYAGSpredictor[4].Thesepredictorshavebeenshowntoachievehigherpredictionaccuracyatequivalenthardwarecomplexitythanlarger“aliased”globalhistorybranchpre-dictorssuchasgshare[14]orGAs[27].However,hybridpredictorscombiningaglobalhistorypredictorandatyp-icalbimodalpredictoronlyindexedwiththePC[21]maydeliverhigherpredictionaccuracythanaconventionalsin-glebranchpredictor[14].Therefore,“de-aliased”branchpredictorsshouldbeincludedinhybridpredictorstobuildefcientbranchpredictors.TheEV8branchpredictorisderivedfromthehybridskewedbranchpredictor2Bc-gskewpresentedin[19].Inthissection,thestructureofthehybridskewedbranchpre-dictorisrstrecalled.ThenweoutlinetheupdatepolicyusedontheEV8branchpredictor.Thethreedegreesoffreedomavailableinthedesignspaceofthe2Bc-gskewpre-dictoraredescribed:differenthistorylengthsforthepre-dictorcomponents,sizeofthedifferentpredictorcompo-nentsandusingsmallerhysteresistablesthanpredictionta-bles.Thesedegreesoffreedomwereleveragedtodesignthe“best”possiblebranchpredictorttingintheEV8hardwarebudgetconstraints.4.1Generalstructureofthehybridskewedpre­dictor2Bc­gskewTheenhancedskewedbranchpredictore-gskewisaveryefcientsinglecomponentbranchpredictor[15,13]andthereforeanaturalcandidateasacomponentforahybridpredictor.Thehybridpredictor2Bc-gskewillustratedinFig.2combinese-gskewandabimodalpredictor.2Bc-gskewconsistsoffour2-bitcountersbanks.BankBIMisthebi-modalpredictor,butisalsopartofthee-gskewpredictor.BanksG0andG1arethetwootherbanksofthee-gskewpredictor.BankMetaisthemeta-predictor.DependingonMeta,thepredictioniseitherthepredictioncomingoutfromBIMorthemajorityvoteonthepredictionscomingoutfromG0,G1andBIM4.2PartialupdatepolicyInamultipletablebranchpredictor,theupdatepolicycanhaveabearingonthepredictionaccuracy[15].Par-tialupdatepolicywasshowntoresultinhigherpredictionaccuracythantotalupdatepolicyfore-gskew.Applyingpartialupdatepolicyon2Bc-gskewalsoresultsinbetterpredictionaccuracy.Thebimodalcomponentac-nnhistorye-gskew predictionbimodal predictionmetapredictionMetaG1G0nnaddressmajorityvoteBIMaddressPREDICTIONFigure2.The2Bc-gskewpredictorcuratelypredictsstronglybiasedstaticbranches.Therefore,oncethemetapredictorhasrecognizedthissituation,theothertablesarenotupdatedanddonotsufferfromalias-ingassociatedwitheasy-to-predictbranches.ThepartialupdatepolicyimplementedontheAlphaEV8consistsofthefollowing:onacorrectprediction:whenallpredictorswereagreeingdonotupdate(seeRationale1)otherwise:strengthenMetaifthetwopredictionsweredifferent,andstrengthenthecorrectpredictiononallparticipatingtablesG0,G1andBIMasfollows:-strengthenBIMifthebimodalpredictionwasused-strengthenallthebanksthatgavethecorrectpredic-tionifthemajorityvotewasusedonamisprediction:whenthetwopredictionsweredifferent,rstupdatethechooser(seeRationale2),thenrecomputetheover-allpredictionaccordingtothenewvalueofthechoos-er:-correctprediction:strengthensallparticipatingtables-misprediction:updateallbanksRationale1Thegoalistolimitthenumberofstrength-enedcountersonacorrectprediction.Whenacounterisstrengthened,itisharderforanother(address,history)pairto“steal”it.But,whenthethreepredictorsBIM,G0andG1areagreeing,onecounterentrycanbestolenbyanother(address,history)pairwithoutdestroyingthemajoritypre-diction.Bynotstrengtheningthecounterswhenthethreepredictorsagree,suchastealingismadeeasier.Rationale2Thegoalistolimitthenumberofcounterswrittenonawrongprediction:thereisnoneedtostealatableentryfromanother(address,history)pairwhenitcanbeavoided.4.3UsingdistinctpredictionandhysteresisarraysPartialupdateleadstobetterpredictionaccuracythantotalupdatepolicyduetobetterspaceutilization.Italso3 allowsasimplerhardwareimplementationofahybridpre-dictorwith2-bitcounters.Whenusingthepartialupdatedescribedearlier,onacor-rectprediction,thepredictionbitisleftunchanged(andnotwritten),whilethehysteresisbitisstrengthenedonpartic-ipatingcomponents(andneednotberead).Therefore,acorrectpredictionrequiresonlyonereadofthepredictionarray(atfetchtime)and(atmost)onewriteofthehystere-sisarray(atcommittime).Amispredictionleadstoareadofthehysteresisarrayfollowedbypossibleupdatesofthepredictionandhysteresisarrays.4.4Sharingahysteresisbitbetweenseveralcoun­tersUsingpartialupdatenaturallyleadstoaphysicalimple-mentationofthebranchpredictorastwodifferentmemoryarrays,apredictionarrayandahysteresisarray.FortheAlphaEV8,siliconareaandchiplayoutcon-straintsallowedlessspaceforthehysteresismemoryarraythanthepredictionmemoryarray.Insteadofreducingthesizeofthepredictionarray,itwasdecidedtousehalfsizehysteresistablesforcomponentsG1andMeta.Asaresult,twopredictionentriesshareasinglehysteresisentry:thepredictiontableandthehysteresistableareindexedusingthesameindexfunction,exceptthemostsignicantbit.Consequently,thehysteresistablesuffersfrommorealiasingthanthepredictiontable.Forinstance,thefollow-ingscenariomayoccur.PredictionentriesAandBsharethesamehysteresisentry.Both(address,history)pairsas-sociatedwiththeentriesarestronglybiased,butBremainsalwayswrongduetocontinuousresettingofthehysteresisbitby(address,history)pairassociatedwithA.Whilesuchascenariocertainlyoccurs,itisveryrare:anytwoconsec-utiveaccessestoBwithoutintermediateaccesstoAwillallowBtoreachthecorrectstate.Moreover,thepartialup-datepolicyimplementedontheEV8branchpredictorlimitsthenumberofwritesonthehysteresistablesandthereforedecreasestheimpactofaliasingonthehysteresistables.4.5HistorylengthsPreviousstudiesoftheskewedbranchpredictor[15]andthehybridskewedbranchpredictor[19]assumedthatta-blesG0andG1wereindexedusingdifferenthashingfunc-tiononthe(address,history)pairbutwiththesamehistorylengthusedforallthetables.Usingdifferenthistorylengthsforthetwotablesallowsslightlybetterbehavior.MoreoveraspointedoutbyJuanetal.[12],theoptimalhistorylengthforapredictorvariesdependingontheapplication.Thisphenomenonislessimportantonahybridpredictorfeatur-ingabimodaltableasacomponent.Itssignicanceisfur-therreducedon2Bc-gskewiftwodifferenthistorylengthsareusedfortablesG0andG1.AmediumhistorylengthcanbeusedforG0whilealongerhistorylengthisusedforG1.BIMG0G1Metapredictiontable16K64K64K64Khysteresistable16K32K64K32Khistorylength4132115Table1.CharacteristicsofAlphaEV8branchpredictor4.6DifferentpredictiontablesizesInmostacademicstudiesofmultipletablepredictors[15,13,14,19],thesizesofthepredictortablesarecon-sideredequal.Thisisconvenientforcomparingdifferentpredictionschemes.However,forthedesignofarealpre-dictorinhardware,theoveralldesignspacehastobeex-plored.Equaltablesizesinthe2Bc-gkewbranchpredictorisagoodtrade-offforsmallsizepredictors(forinstance4*4Kentries).However,forverylargebranchpredictors(i.e4*64Kentries),thebimodaltableBIMisusedverysparselysinceeachbranchinstructionmapsontoasingleentry.Consequently,thelargebranchpredictorusedinEV8implementsaBIMtablesmallerthantheotherthreecom-ponents.4.7TheEV8branchpredictorcongurationTheAlphaEV8implementsaverylarge2Bc-gskewpre-dictor.Itfeaturesatotalof352Kbitsofmemory,consistingof208Kbitsforpredictionand144Kbitsforhysteresis.De-signspaceexplorationleadtothetablesizesindexedwithdifferenthistorylengthsaslistedinTable1.Itmaybere-markedthatthetableBIM(originallythebimodaltable)isindexedusinga4-bithistorylength.ThiswillbejustiedwhenimplementationconstraintsarediscussedinSection7.5PathandbranchoutcomeinformationTheaccuracyofabranchpredictordependsbothonthepredictionschemeandpredictortablesizesaswellasontheinformationvectorusedtoindexit.ThissectiondescribeshowpipelineconstraintsleadtotheeffectiveinformationvectorusedforindexingtheEV8Alphabranchpredictor.ThisinformationvectorcombinesthePCaddress,acom-pressedformofthethreefetchblocksoldbranchandpathhistoryandpathinformationformthethreelastblocks.5.1Threefetchblocksoldblockcompressedhis­toryThreefetchblocksoldhistoryInformationusedtoreadthepredictortablesmustbeavailableatindexingtime.OntheAlphaEV8,thebranchpredictorhasalatencyoft-wocyclesandtwoblocksarefetchedeverycycle.Fig.1showsthatthebranchhistoryinformationusedtopredict4 abranchoutcomeinblockDcannotincludeany(specu-lative)branchoutcomefromconditionalbranchesinblockDitself,andalsofromblocksC,BandA.ThustheEV8branchpredictorcanonlybeindexedusingathreefetchblocksoldbranchhistory(i.eupdatedwithhistoryinforma-tionfromZ)forpredictingbranchesinblockD.BlockcompressedhistorylghistWhenasinglebranchispredictedpercycle,atmostonehistorybithastobeshiftedintheglobalhistoryregisteroneverycycle.Whenupto16branchesarepredictedpercycle,upto16historybitshavetobeshiftedinthehistoryoneverycycle.Suchanupdaterequirescomplexcircuitry.OntheAlphaEV8,thiscomplexhistory-registerupdatewouldhavestressedcriticalpathstotheextentthatevenolderhistorywouldhavehadtobeused(veorevenseven-blocksold).Instead,justasinglehistorybitisinsertedperfetchblock[5].Theinsertedbitcombinesthelastbranchoutcomewithpathinformation.Itiscomputedasfollows:wheneveratleastoneconditionalbranchispresentinthefetchblock,theoutcomeofthelastconditionalbranchinthefetchblock(1fortaken,0fornot-taken)isexclusive-ORedwithbit4inthePCaddressofthislastbranch.Therationaleforexclusive-ORbyaPCbitthebranchoutcomeistogetamoreuniformdistributionofhistorypatternsforanappli-cation.Highlyoptimizedcodestendtoexhibitlesstakenbranchesthannot-takenbranches.Therefore,thedistribu-tionof“pure”branchhistoryoutcomesinthoseapplicationsisnon-uniform.Whileusingasinglehistorybitwasoriginallythoughtofasacompromisingdesigntrade-off-sinceitispossibletocompressupto8historybitsinto1-Section8showsthatitdoesnothavesignicanteffectontheaccuracyofthebranchpredictor.NotationTheblockcompressedhistorydenedabovewillbereferredtoaslghist.5.2PathinformationfromthethreelastfetchblocksDuetoEV8pipelineconstraints(Section2),threefetch-blocksoldlghistisusedforthepredictor.Although,nobranchhistoryinformationfromthesethreeblockscanbeused,theiraddressesareavailableforindexingthebranchpredictor.Theaddressesofthethreepreviousfetchblocksareusedintheindexfunctionsofthepredictortables.5.3UsingverylonghistoryTheAlphaEV8featuresaverylargebranchpredictorcomparedtothoseimplementedinpreviousgenerationmi-croprocessors.Mostacademicstudiesonglobalhistorybranchpredictorshaveassumedthatthelengthoftheglobalhistoryissmallerorequaltoofthenumberofentriesofthebranchpredictortable.ForthesizeofpredictorusedinAlphaEV8,thisisfarfromoptimalevenwhenusinglghist.Forexample,whenconsidering“notcompressed”branchhistoryfora4*64K2-bitentries2Bc-gskewpredic-tor,usingequalhistorylengthforG0,G1andMeta,historylength24wasfoundtobeagooddesignpoint.Whencon-sideringdifferenthistorylengths,using17forG0,20forMetaand27forG1wasfoundtobeagoodtrade-off.Forthesamepredictorcongurationwiththreefetchblocksoldlghist,slightlyshorterlengthwasfoundtobethebestperforming.However,theoptimalhistorylengthisstilllongerthanofthesizeofthebranchpredictortable:fortheEV8branchpredictor21bitsoflghisthistoryareusedtoindextableG1with64Kentries.InSection8,weshowempiricallythatforlargepredic-tors,branchhistorylongerthanofthepredictortablesizeisalmostalwaysbenecial.6Conictfreebankinterleavedbranchpre-dictorUpto16branchpredictionsfromtwofetchblocksmustbecomputedinparallelontheAlphaEV8.Normally,sincetheaddressesofthetwofetchblocksareindependent,eachofthebranchpredictortableswouldhavehadtosupporttwoindependentreadspercycle.Thereforethepredictortableswouldhavehadtobemulti-ported,dual-pumpedorbank-interleaved.ThissectionpresentsaschemethatallowedtheimplementationoftheEV8branchpredictoras4-waybankinterleavedusingonlysingle-portedmemorycells.Bankconictsareavoidedbyconstruction:thepredictionsas-sociatedwithtwodynamicallysuccessivefetchblocksareassuredtolieintwodistinctbanksinthepredictors.6.1ParallelaccesstopredictionsassociatedwithasingleblockParallelaccesstoallthepredictionsassociatedwithas-inglefetchblockisstraightforward.ThepredictiontablesintheAlphaEV8branchpredictorareindexedbasedonahashingfunctionofaddress,threefetchblocksoldlghistbranchandpathhistory,andthethreelastfetchblockad-dresses.Foralltheelementsofasinglefetchblock,thesamevectorofinformation(exceptbits2,3and4ofthePCaddress)isused.Therefore,theindexingfunctionsusedguaranteethateightpredictionslieinasingle8-bitwordinthetables.6.2Guaranteeingtwosuccessivenon­conictingaccessesTheAlphaEV8branchpredictormustbecapableofde-liveringpredictionsassociatedwithtwofetchblocksperclockcycle.Thistypicallymeansthebranchpredictormustbemulti-ported,dual-pumpedorbankinterleaved.OntheAlphaEV8branchpredictor,thisdifcultyiscircumventedthroughabanknumbercomputation.Thebanknumbercomputationdescribedbelowguaranteesbyconstructionthatanytwodynamicallysuccessivefetch5 unshuffleCycle 0Phase 1Y and Z flows outA and B flows outfrom the line predictor completedPhase 0Phase 0Phase 0Phase 1Phase 1Cycle 1Cycle 2from the line predictorbank number computation for A and Bbank selectionwordline selectioncolumn selectionfinal PC selectionPC address generation completedprediction tables readsFigure3.Flowofthebranchpredictortablesreadaccessblockswillgenerateaccessestotwodistinctpredictorbanks.Therefore,bankconictsneveroccur.Moreover,thebanknumberiscomputedonthesamecycleastheaddressofthefetchblockisgeneratedbythelinepredictor,thusnoextradelayisaddedtoaccessthebranchpredictortables(Fig.3).Theimplementationofthebanknumbercomputationisdenedbelow:letbethebanknumberforinstructionfetchblockA,letY,Zbetheaddressesofthetwopreviousaccessslots,letbethenumberofthebankaccessedbyinstructionfetchblockZ,let(y52,y51,..,y6,y5,y4,y3,y2,0,0)bethebi-naryrepresentationofaddressY,theniscomputedasfollows:if((y6,y5)==)then=(y6,y51)else=(y6,y5)Thiscomputationguaranteesthepredictionforafetchblockwillbereadfromadifferentbankthanthatofthepreviousfetchblock.TheonlyinformationbitsneededtocomputethebanknumbersforthetwonextfetchblocksAandBarebits(y6,y5),(z6,z5)and:thatistwo-blockahead[18]banknumbercomputation.Theseinformationbitsareavailableonecyclebeforetheeffectiveaccessonthebranchpredictorisperformedandtherequiredcompu-tationsareverysimple.Therefore,nodelayisintroducedonthebranchpredictorbythebanknumbercomputation.Infact,bankselectioncanbeperformedattheendofPhase1ofthecycleprecedingthereadofthebranchpredictortables.7IndexingthebranchpredictorAspreviouslymentioned,theAlphaEV8branchpre-dictoris4-wayinterleavedandthepredictionandhystere-sistablesareseparate.Sincethelogicalorganizationofthepredictorcontainsthefour2Bc-gskewcomponents,thisshouldtranslatetoanimplementationwith32memoryta-bles.However,theAlphaEV8branchpredictoronlyim-plementseightmemoryarrays:foreachofthefourbanksthereisanarrayforpredictionandanarrayforhysteresis.Eachwordlineinthearraysismadeupofthefourlogicalpredictorcomponents.Thissection,presentsthephysicalimplementationofthebranchpredictorarraysandtheconstraintstheyimposeonthecompositionoftheindexingfunctions.Thesectional-soincludesdetaileddenitionofthehashingfunctionsthatwereselectedforindexingthedifferentlogicalcomponentsintheAlphaEV8branchpredictor.7.1PhysicalimplementationandconstraintsEachofthefourbanksintheAlphaEV8branchpre-dictorisimplementedastwophysicalmemoryarrays:thepredictionmemoryarrayandthehysteresismemoryarray.Eachwordlineinthearraysismadeupofthefourlogicalpredictorcomponents.Eachbankfeatures64wordlines.Eachwordlinecon-tains328-bitpredictionwordsfromG0,G1andMeta,and88-bitpredictionwordsfromBIM.Asingle8-bitpredic-tionwordisselectedfromthewordlinefromeachpredictortableG0,G1,MetaandBIM.Apredictionreadspansover3halfcyclephases(5phasesincludingbanknumbercom-putationandbankselection).ThisisillustratedinFig.3and4.Adetaileddescriptionisgivenbelow.1.Wordlineselection:oneofthe64wordlinesoftheac-cessedbankisselected.Thefourpredictorcomponentssharethe6addressbitsneededforwordlineselection.Fur-thermore,these6addressbitscannotbehashedsincethewordlinedecodeandarrayaccessconstituteacriticalpathforreadingthebranchpredictionarrayandconsequently-inputstodecodermustbeavailableattheverybeginningofthecycle.6 Wordline selection(1 out 64)Unshuffle : permutation 8 to 8Column selection(1 out of 8 for BIM)(1 out of 32 for G0, G1, Meta)8 predictions pertableG0G1BIMMetaTime(1 out of 4)Bank selection Figure4.Readingthebranchpredictionta­bles2.Columnselection:eachwordlineconsistsofmultiple8-bitpredictionentriesofthefourlogicalpredictortables.One8-bitpredictionwordisselectedforeachofthelogicalpredictortables.Asonlyonecyclephaseisavailabletocomputetheindexinthecolumn,onlyasingle2-entryXORgateisallowedtocomputeeachofthesecolumnbits.3.Unshufe:8-bitpredictionwordsaresystematicallyread.ThiswordisfurtherrearrangedthroughaXORper-mutation(thatisbitatpositioniismovedatposition).Thisnalpermutationensuresalargerdispersionofthepredictionsoverthearray(onlyentriescorrespondingtoabranchinstructionarenallyuseful).Itallowsalsotodis-criminatebetweenlongerhistoryforthesamebranchPC,sincethecomputationoftheparameterfortheXORper-mutationcanspanoveracompletecycle:eachbitofcanbecomputedbyalargetreeofXORgates.NotationsThethreefetch-blocksoldlghisthistorywillbenotedH=(h20,..,h0).A=(a52,..,a2,0,0)istheaddressofthefetchblock.ZandYarethetwopreviousfetchblock-s.I=(i15,..,i0)istheindexfunctionofatable,(i1,i0)be-ingthebanknumber,(i4,i3,i2)beingtheoffsetintheword,(i10,i9,i8,i7,i6,i5)beingthelinenumber,andthehighestor-derbitsbeingthecolumnnumber.7.2GeneralphilosophyforthedesignofindexingfunctionsWhendeningtheindexingfunctions,wetriedtoapplytwogeneralprincipleswhilerespectingthehardwareim-plementationconstraints.First,wetriedtolimitaliasingasmuchaspossibleoneachindividualtablebypickingin-dividualindexingfunctionthatwouldspreadtheaccessesoverthepredictortableasuniformlyaspossible.Foreachindividualfunction,thisnormallycanbeobtainedbymix-ingalargenumberofbitsfromthehistoryandfromthead-dresstocomputeeachindividualbitinthefunction.How-ever,generalconstraintsforcomputingtheindexingfunc-tionsonlyallowedsuchcomplexcomputationsfortheun-shufebits.Fortheotherindexingbits,wefavoredtheuseoflghistbitsinsteadoftheaddressbits.Duetotheinclu-sionofpathinformationinlghist,lghistvectorsweremoreuniformlydistributedthanPCaddresses.In[17],itwaspointedoutthattheindexingfunctionsinaskewedcacheshouldbechosentominimizethenumberofblockpairsthatwillconictontwoormoreways.Thesameappliesforthe2Bc-gskewbranchpredictor.7.3SharedbitsTheindexingfunctionsforthefourpredictiontablesshareatotalof8bits,thebanknumber(2bits)andthewordlinenumber(i10,..,i5).Thebanknumbercompu-tationwasdescribedinSection6.Thewordlinenumbermustbeimmediatelyavailableattheverybeginningofthebranchpredictoraccess.There-fore,itcaneitherbederivedfrominformationalreadyavail-ableearlier,suchasthebanknumber,ordirectlyextractedfrominformationavailableattheendofthepreviouscyclesuchasthethreefetchblocksoldlghistandthefetchblockaddress.Thefetchblockaddressisthemostnaturalchoice,sinceitallowstheuseofaneffectivebimodaltableforcomponentBIMinthepredictor.However,simulationsshowedthatthedistributionoftheaccessesovertheBIMtableentrieswereunbalanced.Someregionsinthepredictortableswereusedinfrequentlyandotherswerecongested.Usingamixoflghisthistorybitsandfetchblockad-dressbitsleadstoamoreuniformuseofthedifferentwordlinesinthepredictor,thusallowingoverallbetterpredic-torperformance.Asaconsequence,componentBIMinthebranchpredictoruses4bitsofhistoryinitsindexingfunc-tion.Thewordlinenumberusedisgivenby(i10,i9,i8,i7,i6,i5)=(h3,h2,h1,h0,a8,a7).7.4IndexingBIMTheindexingfunctionforBIMisalreadyusing4his-torybits,thatarethreefetchedblocksold,andsomepathinformationfromtwofetchedblockahead(forbanknum-bercomputation).Thereforepathinformationfromthelastinstructionfetchblock(thatisZ)isused.TheextrabitsforindexingBIMare(i13,i12,i11,i4,i3,i2)=(a11,,\n \r,a4,\r,).7.5EngineeringtheindexingfunctionsforG0,G1andMetaThefollowingmethodologywasusedtodenethein-dexingfunctionsforG0,G1andMeta.First,thebesthisto-rylengthcombinationwasdeterminedusingstandardskew-ingfunctionsfrom[17].Then,thecolumnindicesandtheXORfunctionsforthethreepredictorsweremanuallyde-nedapplyingthefollowingprinciplesasbestaswecould:1.favorauniformdistributionofcolumnnumbersforthechoiceofwordlineindex.7 Columnindexbitsmustbecomputedusingonlyonetwo-entryXORgate.Sincehistoryvectorsaremoreuniformlydistributedthanaddressnumbers,tofavoranoverallgooddistributionofcolumnnumbers,historybitsweregenerallypreferredtoaddressbits.2.if,forthesameinstructionfetchblockaddressA,twohistoriesdifferbyonlyoneortwobitsthenthetwooccur-rencesshouldnotmapontothesamepredictorentryinanytable:toguaranteethis,wheneveraninformationbitisX-ORedwithanotherinformationbitforcomputingacolumnbit,atleastoneofthemwillappearaloneforthecomputa-tionofonebitoftheunshufeparameter.3.ifaconictoccursinatable,thentrytoavoiditonthetwoothertables:toapproximatethis,differentpairsofhis-torybitsareXORedforcomputingthecolumnbitsforthethreetables.Thismethodologyleadtothedesignoftheindexingfunctionsdenedbelow:IndexingG0Tosimplifytheimplementationofcolumns-electors,G0andMetasharei15andi14.Columnselectionisgivenby(i15,i14,i13,i12,i11)=( , ,,, \r).Unshuingisdenedby(i4,i3,i2)=(\n    ,a11h9h10h12z6a5,a2a14a10h6h4h7a6).IndexingG1Columnselectionisgivenby(i15,i14,i13,i12,i11)=(h19h12,h18h11,h17h10,h16h4,h15h20).Unshufingisdenedby(i4,i3,i2)=(a4a11a14a6h4h6h9h14h15h16z6,a3a10a13h5h11h13h18h19h20z5,a2a5a9h4h8h7h10h12h13h14h17)IndexingMetaColumnselectionisgivenby(i15,i14,i13,i12,i11)=(h7h11,h8h12,h5h13,h4h9,a9h6).Unshufingisdenedby(i4,i3,i2)=(a4a10a5h7h10h14h13z5,a3a12a14a6h4h6h8h14,a2a9a11a13h5h9h11h12z6)8EvaluationInthissection,weevaluatethedifferentdesigndecisionsthatweremadeintheAlphaEV8predictordesign.Werstjustifythechoiceofthehybridskewedpredictor2Bc-gskewagainstotherschemesrelyingonglobalhistory.Thenstepbystep,weanalyzebenetsordetrimentsbroughtbydesigndecisionsandimplementationconstraints.8.1Methodology8.1.1SimulationTracedrivenbranchsimulationswithimmediateupdatewereusedtoexplorethedesignspacefortheAlphaEV8branchpredictor,sincethismethodologyisaboutthreeor-dersofmagnitudefasterthanthecompleteAlphaEV8pro-cessorsimulation.Wecheckedthatforbranchpredictorsusing(very)longglobalhistoryasthoseconsideredinthisstudy,therelativeerrorinnumberofbranchmisprediction-sbetweenatracedrivensimulation,assumingimmediateupdate,andthecompletesimulationoftheAlphaEV8,as-sumingpredictorupdateatcommittime,isinsignicant.Themetricusedtoreporttheresultsismispredictionsper1000instructions(misp/KI).Toexperimentwithhisto-rylengthwiderthanoftablesizes,indexingfunctionsfromthefamilypresentedin[17,15]wereusedforallpre-dictors,exceptinSection8.5.Theinitialstateofallentriesinthepredictiontableswassettoweaklynottaken.8.1.2BenchmarksetDisplayedsimulationresultswereobtainedusingtracescol-lectedwithAtom[23].ThebenchmarksuitewasSPECIN-T95.BinarieswerehighlyoptimizedfortheAlpha21264usingproleinformationfromthetraininput.Thetraceswererecordedusingtherefinputs.Onehundredmil-lioninstructionsweretracedafterskipping400millionin-structionsexceptforcompress(2billioninstructionswereskipped).Table2detailsthecharacteristicsofthebench-marktraces.8.22Bc­gskewvsotherglobalhistorybasedpre­dictorsWerstvalidatedthechoiceofthe2BC-gskewpredic-tionschemeagainstotherglobalpredictionschemes.Fig.5showssimulationresultsforpredictorswithmemorizationsizeinthesamerangeastheAlphaEV8predictor.Dis-playedresultsassumeconventionalbranchhistory.Forallthepredictors,thebesthistorylengthresultsarepresented.Fig.6showsthenumberofadditionalmispredictionsforthesamecongurationsasinFig.5butusingofthetablesize,insteadofthebesthistorylength.Theillustratedcongurationsare:a4*32Kentries(i.e.256Kbits)2Bc-gskewusinghis-torylengths0,13,16and23respectivelyforBIM,G0,MetaandG1,anda4*64Kentries(i.e.512Kbits)2Bc-gskewusinghistorylengths0,17,20and27.Forlimited()historylength,thelengthsareequalforalltablesandare15forthe256Kbitcongurationand16forthe512Kbit.abimodepredictor[13]consistingoftwo128Ken-triestablesforrespectivelybiasedtakenandnottakenbranchesanda16Kentriesbimodaltable,foratotalof544Kbitsofmemorization1.Theoptimumhistory1Theoriginalpropositionforthebimodepredictorassumesequalsizesforthethreetables.Forlargesizepredictors,usingasmallerbimodaltableismorecost-effective.Onourbenchmarkset,usingmorethan16Kentriesinthebimodaltabledidnotaddanybenet.8 Benchmarkcompressgccgoijpeglim88ksimperlvortexdyn.cond.branches(x1000)12044160351128588941625497061326312757staticcond.branches461208637109042514092732239Table2.BenchmarkcharacteristicsFigure5.Branchpredictionaccuracyforvar­iousglobalhistoryschemesFigure6.AdditionalMispredictionswhenus­ingtablesizehistorylength(forourbenchmarkset)was20.Forhisto-rylength17bitswereused.a1Mentries(2Mbits)gshare.Theoptimumhistorylength(onourbenchmarkset)was20(i.eofthepredictortablesize).a288Kbitsand576KbitsYAGSpredictor[4](respec-tivebesthistorylength23and25)thesmallcongu-rationconsistsofa16Kentrybimodalandtwo16Kpartiallytaggedtablescalleddirectioncaches,tagsare6bitswide.Whenthebimodaltablepredictstaken(resp.not-taken),thenot-taken(resp.taken)directioncacheissearched.Onamissinthesearcheddirectioncache,thebimodaltableprovidestheprediction.Onahit,thedirectioncacheprovidestheprediction.Forhistorylength14bits(resp15bits)wereused.1Figure7.ImpactoftheinformationvectoronbranchpredictionaccuracyFirst,oursimulationresultsconrmthat,atequivalen-tmemorizationbudget2Bc-gskewoutperformstheotherglobalhistorybranchpredictorsexceptYAGS.ThereisnoclearwinnerbetweentheYAGSpredictorand2Bc-gskew.However,theYAGSpredictoruses(partially)taggedarrays.Readingandchecking16ofthesetagsinonlyoneandhalfcyclewouldhavebeendifculttoimplement.Second,thedatasupportthat,predictorsfeaturingalargenumberofen-triesneedverylonghistorylengthandtablesizehisto-ryissuboptimal.8.3QualityoftheinformationvectorThediscussionbelowexaminestheimpactofsuccessivemodicationsoftheinformationvectoronbranchpredic-tionaccuracyassuminga4*64Kentries2Bc-gskewpredic-tor.Foreachcongurationtheaccuraciesforthebesthis-torylengthsarereportedinFig.7.ghistrepresentstheconventionalbranchhistory.lghist,nopathassumesthatlghistdoesnotincludepathinformation.lghist+pathin-cludespathinformation.3-oldlghististhesameasbefore,butconsideringthreefetchblocksoldhistory.EV8infovec-torrepresentstheinformationvectorusedonAlphaEV8,thatisthreefetchblocksoldlghisthistoryincludingpathinformationpluspathinformationonthethreelastblocks.lghistAsexpectedtheoptimallghisthistorylengthisshorterthantheoptimalrealbranchhistory:(15,17,23)insteadof(17,20,27)respectivelyfortablesG0,MetaandG1.Quitesurprisingly(seeFig.7),lghisthassameper-formanceasconventionalbranchhistory.Dependingontheapplication,thereiseitherasmalllossorasmallbenetinaccuracy.Embeddingpathinformationinlghistisgener-9 Figure8.Adjustingtablesizesinthepredictorallybenecial:wedeterminedthatismoreoftenusefultode-aliasotherwisealiasedhistorypaths.Thelossofinformationfrombranchesinthesamefetchblockinlghistisbalancedbytheuseofhistoryfrommorebranches(eventhoughrepresentedbyashorterinformationvector):forinstance,forvortexthe23lghistbitsrepresen-tonaverage36branches.Table3representstheaveragenumberofconditionalbranchesrepresentedbyonebitinlghistforthedifferentbenchmarks.ThreefetchblocksoldhistoryUsingthreefetchblocksoldhistoryslightlydegradestheaccuracyofthepredictor,buttheimpactislimited.Moreover,usingpathinformationfromthethreefetchblocksmissinginthehistoryconsis-tentlyrecoversmostofthisloss.EV8informationvectorInsummary,despitethefac-tthatthevectorofinformationusedforindexingtheAlphaEV8branchpredictorwaslargelydictatedbyimplementa-tionconstraints,onourbenchmarksetthisvectorofinfor-mationachievesapproximatelythesamelevelsofaccuracyaswithoutanyconstraints.8.4ReducingsometablesizesFig.8showstheeffectofreducingtablesizes.Thebasecongurationisa4*64Kentries2Bc-gskewpredictor(512Kbits).ThedatadenotedbysmallBIMshowstheper-formancewhentheBIMsizeisreducedfrom64Kto16K2-bitcounters.TheperformancewithasmallBIMandhalfthesizeforGOandMetahysteresistablesisdenotedbyEV8Size.Thelattertsthe352KbitsbudgetoftheAlphaEV8predictor.TheinformationvectorusedforindexingthepredictoristheinformationvectorusedonAlphaEV8.ReducingthesizeoftheBIMtablehasnoimpactatallonourbenchmarkset.Exceptforgo,theeffectofusinghalfsizehysteresistablesforG0andMetaisbarelynoticeable.gopresentsaverylargefootprintandconsequentlyisthemostsensitivetosizereduction.Figure9.Effectofwordlineindices8.5IndexingfunctionconstraintsSimulationsresultspresentedsofardidnottakeintoac-counthardwareconstraintsontheindexingfunctions.8bitsofindexmustbesharedandcannotbehashed,andcompu-tationofthecolumnbitscanonlyuseone2-entryXORgateIntuitively,theseconstraintsshouldleadtosomelossofef-ciency,sinceitrestrictsthepossiblechoicesforindexingfunctions.However,itwasremarkedin[16]that(forcaches)partialskewingisalmostasefcientascompleteskewing.Thesameappliesforbranchpredictors:sharing8bitsintheindicesdoesnothurtthepredictionaccuracyaslongasthesharedindexisuniformlydistributed.Theconstraintofusingunhashedbitsforthewordlinenumberturnedouttobemorecritical,sinceitrestrictedthedistributionofthesharedindex.IdeallyfortheEV8branchpredictor,onewoulddesiretogetthedistributionofthisshared8bitindexasuniformaspossibletospreadaccessesonG0,G1andMetaovertheentiretable.Fig.9illustratestheeffectsofthevariouschoicesmadeforselectingthewordlinenumber.addressonly,nopathassumesthatonlyPCaddressbitsareusedinthesharedindexandthatnopathinformationisusedinlghist.addressonly,pathassumesthatonlyPCaddressbitsareusedinthesharedindex,butpathinformationisembeddedinlghist.nopathassumes4historybitsand2PCbitsaswordlinenumber,butthatnopathinformationisusedinlghist.EV8illustratestheaccuracyoftheAlphaEV8branchpredictorwhere4historybitsareusedinthewordlinenumberindexandpathinformationisembeddedinthehistory.Finallycompletehashrecallstheresultsassuminghashingonalltheinformationbitsand4*64K2Bc-gskewghistrepresentsthesimulationresultsassuminga512Kbitspredictorwithnoconstraintonindexfunctionsandconventionalbranchhistory.Previouslywasnotedthatincorporatingpathinforma-tioninlghisthasonlyasmallimpactona2Bc-gskewpre-dictorindexedusinghashingfunctionswithnohardwareconstraints.However,addingthepathinformationinthehistoryfortheAlphaEV8predictormakesthedistributionoflghistmoreuniform,allowsitsuseinthesharedindex10 compressgccgoijpeglim88ksimperlvortexlghist/ghist1.241.571.121.201.551.531.321.59Table3.Ratiolghist/ghistFigure10.Limitsofusingglobalhistoryandthereforecanincreasepredictionaccuracy.Theconstraintonthecolumnbitscomputationindirect-lyachievedapositiveimpactbyforcingustoverycareful-lydesignthecolumnindexingandtheunshufefunctions.The(nearly)totalfreedomforcomputingtheunshufewasfullyexploited:11bitsareXORedintheunshufingfunc-tionontableG1.Theindexingfunctionsusedinthenaldesignoutperformthestandardhashingfunctionsconsid-eredintherestofthepaper:thesefunctions(originallyde-nedforskewedassociativecaches[17])exhibitgoodinter-bankdispersion,butwerenotmanuallytunedtoenforcethethreecriteriadescribedinSection7.5.Tosummarize,the352KbitsAlphaEV8branchpredic-torstandsthecomparisonagainsta512Kbits2Bc-gskewpredictorusingconventionalbranchhistory.9ConclusionThebranchpredictoronAlphaEV8wasdenedatthebeginningof1999.Itfeatures352Kbitsofmemoryandde-liversupto16branchpredictionspercyclefortwodynami-callysuccesiveinstructionfetchblocks.Therefore,aglobalhistorypredictionschemehadtobeused.In1999,thehy-bridskewedbranchpredictor2Bc-gskewpredictionscheme[19]representedstate-of-the-artforglobalhistorypredic-tionschemes.TheAlphaEV8branchpredictorimplementsa2Bc-gskewpredictorschemeenhancedwithanoptimizedupdatepolicyandtheuseofdifferenthistorylengthsonthedifferenttables.Somedegreesoffreedominthedenitionof2Bc-gskewweretunedtoadaptthepredictorparameterstosiliconareaandchiplayoutconstraints:thebimodalcomponentiss-mallerthantheothercomponentsandthehysteresistablesoftwooftheothercomponentsareonlyhalf-sizeofthepre-dictortables.Implementationconstraintsimposedathreefetchblocksoldcompressedformofbranchhistory,lghist,insteadoftheeffectivebranchhistory.However,theinformationvectorusedtoindextheAlphaEV8branchpredictorstandsthecomparisonwithcompletebranchhistory.Itachievesthatbycombiningpathinformationwiththebranchoutcometobuildlghistandusingpathinformationfromthethreefetchblocksthathavetobeignoredinlghist.TheAlphaEV8isfour-wayinterleavedandeachbankissingleported.Oneachcycle,thebranchpredictorsup-portsrequestsfromtwodynamicallysuccesiveinstructionfetchblocksbutdoesnotrequireanyhardwareconictres-olution,sincebanknumbercomputationguaranteesbycon-structionthatanytwodynamicallysuccesivefetchblockswillaccesstwodistinctpredictorsbanks.TheAlphaEV8branchpredictorfeaturesfourlogicalcomponents,butisimplementedasonlytwomemoryar-rays,thepredictionarrayandthehysteresisarray.There-fore,thedenitionofindexfunctionsforthefour(logi-cal)predictortablesisstronglyconstrained:8bitsmustbesharedamongthefourindices.Furthermore,timingconstraintsrestrictthecomplexityofhashingthatcanbeappliedforindicescomputation.However,efcientindexfunctionsturningaroundtheseconstraintsweredesigned.Despiteimplementationandsizeconstraints,theAl-phaEV8branchpredictordeliversaccuracyequivalenttoa4*64Kentries2Bc-gskewpredictorusingconventionalbranchhistoryforwhichnoconstraintontheindexingfunc-tionswasimposed.Infuturegenerationmicroprocessors,branchpredictionaccuracywillremainamajorissue.Evenlargerpredic-torsthanthepredictorimplementedintheAlphaEV8maybeconsidered.However,thisbruteforceapproachwouldhavelimitedreturnexceptforapplicationswithaverylargenumberofbranches.Thisisexempliedonourbench-marksetinFig.10thatshowssimulationresultsfora4*1M2-bitentries2Bc-gskewpredictor.Addingback-uppredictorcomponents[3]relyingondifferentinformationvectortypes(localhistory,valueprediction[9,6],ornewpredictionconcepts(e.g.,perceptron[11])totacklehard-to-predictbranchesseemsmorepromising.Sincesuchapredictorwillfacetimingconstraintsissues,onemaycon-siderfurtherextendingthehierarchyofpredictorswithin-creasedaccuraciesanddelays:linepredictor,globalhisto-rybranchprediction,backupbranchpredictor.Thebackupbranchpredictorwoulddeliveritspredictionlaterthantheglobalhistorybranchpredictor.11 References[1]B.CalderandD.Grunwald.Nextcachelineandsetpre-diction.InProceedingsofthe22ndAnnualInternationalSymposiumonComputerArchitecture,1995.[2]K.Diefendorff.CompaqChoosesSMTforAlpha.Micro-processorReport,December1999.[3]K.DriesenandU.Holzle.Thecascadedpredictor:Econom-icalandadaptivebranchtargetprediction.InProceedingofthe30thSymposiumonMicroarchitecture,Dec.1998.[4]A.N.EdenandT.Mudge.TheYAGSbranchpredictor.InProceedingsofthe31stAnnualInternationalSymposiumonMicroarchitecture,Dec1998.[5]G.GiacaloneandJ.Edmonson.Methodandapparatusforpredictingmultipleconditionalbranches.InUSPatentNo6,272,624,August2001.[6]J.GonzalezandA.Gonzalez.Control-owspeculationthroughvaluepredictionforsuperscalarprocessors.InPro-ceedingsofInternationalConferenceonParallelArchitec-turesandCompilationTechniques(PACT),1999.[7]L.Gwennap.Digital21264setsnewstandard.Micropro-cessorReport,October1996.[8]E.Hao,P.-Y.Chang,andY.N.Patt.Theeffectofspecula-tivelyupdatingbranchhistoryonbranchpredictionaccura-cy,revisited.InProceedingsofthe27thAnnualInternation-alSymposiumonMicroarchitecture,SanJose,California,1994.[9]T.Heil,Z.Smith,andJ.E.Smith.Improvingbranchpre-dictorsbycorrelatingondatavalues.In32ndInt.Symp.onMicroarchitecture,Nov.1999.[10]S.HilyandA.Seznec.Branchpredictionandsimultaneousmultithreading.InProceedingsofthe1996ConferenceonParallelArchitecturesandCompilationTechniques(PACT'96),Oct.1996.[11]D.JimenezandC.Lin.Dynamicbranchpredictionwithper-ceptrons.InProceedingsoftheSeventhInternationalSym-posiumonHighPerformanceComputerArchitecture,Jan-uary2001.[12]T.Juan,S.Sanjeevan,andJ.Navarro.Dynamichistory-lengthtting:Athirdlevelofadaptivityforbranchpredic-tion.InProceedingsofthe25thAnnualInternationalSym-posiumonComputerArchitecture(ISCA-98),volume26,3ofACMComputerArchitectureNews,pages155–166,NewYork,June27–July11998.ACMPress.[13]C.-C.Lee,I.-C.Chen,andT.Mudge.Thebi-modebranchpredictor.InProceedingsofthe30thAnnualInternationalSymposiumonMicroarchitecture,Dec1997.[14]S.McFarling.Combiningbranchpredictors.Technicalre-port,DEC,1993.[15]P.Michaud,A.Seznec,andR.Uhlig.Tradingconictandcapacityaliasinginconditionalbranchpredictors.InPro-ceedingsofthe24thAnnualInternationalSymposiumonComputerArchitecture(ISCA-97),June1997.[16]A.Seznec.Acasefortwo-wayskewed-associativecaches.InProceedingsofthe20thAnnualInternationalSymposiumonComputerArchitecture,May1993.[17]A.SeznecandF.Bodin.Skewedassociativecaches.InProceedingsofPARLE'93,May1993.[18]A.Seznec,S.Jourdan,P.Sainrat,andP.Michaud.Multiple-blockaheadbranchpredictors.InArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS-VII),pages116–127,1996.[19]A.SeznecandP.Michaud.De-aliasedhybridbranchpre-dictors.TechnicalReportRR-3618,Inria,Feb1999.[20]K.Skadron,M.Martonosi,andD.Clark.Speculativeup-datesoflocalandglobalbranchhistory:Aquantitativeanal-ysis.JournalofInstruction-LevelParallelism,vol.2,Jan.2000,January2000.[21]J.Smith.Astudyofbranchpredictionstrategies.InPro-ceedingsofthe8thAnnualInternationalSymposiumonComputerArchitecture,May1981.[22]E.Sprangle,R.S.Chappell,M.Alsup,andY.Patt.Thea-greepredictor:Amechanismforreducingnegativebranchhistoryinterference.InProceedingsofthe24thAnnualInternationalSymposiumonComputerArchitecture(ISCA-97),pages284–291,June1997.[23]A.SrivastavaandA.Eustace.ATOM:Asystemforbuildingcustomizedprogramanalysistools.ACMSIGPLANNotices,29(6):196–205,1994.[24]A.Talcott,M.Nemirovsky,andR.Wood.Theinuenceofbranchpredictiontableinterferenceonbranchpredictionschemeperformance.InProceedingsofthe3rdAnnualIn-ternationalConferenceonParallelArchitecturesandCom-pilationTechniques,1995.[25]D.M.Tullsen,S.Eggers,andH.M.Levy.Simultaneousmultithreading:Maximizingon-chipparallelism.InPro-ceedingsofthe22thAnnualInternationalSymposiumonComputerArchitecture,June1995.[26]D.M.Tullsen,S.J.Eggers,J.S.Emer,H.M.Levy,J.L.Lo,andR.L.Stamm.Exploitingchoice:InstructionfetchandissueonanimplementablesimultaneousMultiThread-ingprocessor.InProceedingsofthe23rdAnnualInterna-tionalSymposiumonComputerArchitecure,May1996.[27]T.-Y.YehandY.Patt.Alternativeimplementationsoftwo-leveladaptivebranchprediction.InProceedingsofthe19thAnnualInternationalSymposiumonComputerArchitecture,May1992.[28]C.Young,N.Gloy,andM.Smith.Acomparativeanalysisofschemesforcorrelatedbranchprediction.InProceedingsofthe22ndAnnualInternationalSymposiumonComputerArchitecture,June1995.AcknowledgementTheauthorswouldliketorecognisetheworkofallthosewhocontributedtothearchitecture,circuitimplementationandveri-cationoftheEV8BranchPredictor.TheyincludeTa-chungChang,GeorgeChrysos,JohnEdmondson,JoelEmer,TryggveFossum,GlennGiacalone,BalakrishnanIyer,ManickaveluBala-subramanian,HarishPatil,GeorgeTienandJamesVash.12