/
http://www.eecs.berkeley.edu/~kubitron/cs252 http://www.eecs.berkeley.edu/~kubitron/cs252

http://www.eecs.berkeley.edu/~kubitron/cs252 - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
413 views
Uploaded On 2016-03-09

http://www.eecs.berkeley.edu/~kubitron/cs252 - PPT Presentation

cs252S09 Lecture 92 Keep both the branch PC and target PC in the BTB Entry PC target PC 22309 Two possibilities Current branch depends onProduces a ID: 248533

cs252-S09 Lecture 92 Keep both the

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "http://www.eecs.berkeley.edu/~kubitron/c..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

http://www.eecs.berkeley.edu/~kubitron/cs252 cs252-S09, Lecture 92 Keep both the branch PC and target PC in the BTB Entry PC = target PC 2/23/09 Two possibilities; Current branch depends on:Produces a “GA”(for “global adaptive”) in the Yehand PattProduces a “PA”(for “per-address adaptive”) in same classification 2/23/09cs252-S09, Lecture 94 Yehand Patt, 1992 if (x[i] )y += 1;if (x[i] )c -= 4;If first condition false, second condition also false 2/23/09cs252-S09, Lecture 95 (2,2) GAspredictor(0,2) GAspredictor Branch address Prediction 2-bit global branch history register BHT. That gives us a GAshistory table. Each slot is2-bit counter 2/23/09cs252-S09, Lecture 96 Two-Level Branch Predictor (e.g. GAs)Pentium Pro uses the result from the last two branchesto select one of the four sets of BHT bits (~95% correct) 00 Fetch PC Shift in Taken/¬Taken Taken/¬Taken? cs252-S09, Lecture 97 What are Important Metrics?•Clearly, Hit Rate matters–Even 1% can be important when above 90% hit rate•Speed: Does this affect cycle time?•Space: Clearly Total Space matters!–Papers which do not try to normalize across different options are playing fast and lose with data–Try to get best performance for the cost 2/23/09cs252-S09, Lecture 98 matrix300espresso6%6% Accuracy of Different Schemes4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT 0% Frequency of Mispredictions 2/23/09cs252-S09, Lecture 99 Mispredictbecause either:gccat 12% 2/23/09cs252-S09, Lecture 910 Yehand Pattclassification GBHR GAg: Global History Register, Global History Table•PAg: Per-Address History Register, Global History Table•PAp: Per-Address History Register, Per-Address History Table 2/23/09cs252-S09, Lecture 911 PApbest: But uses a lot more state!GAgnot effective with 6-bit history registersPAgperforms better because it 2/23/09cs252-S09, Lecture 912 GAgrequires 18-bit history registerPAgrequires 12-bit history registerPAprequires 6-bit history registerPAgis the cheapest among these 2/23/09cs252-S09, Lecture 913 Why doesn’t GAgdo better?Difference between GAgand both PA variants:GAgtracks correllationsbetween different branchesPAg/PAptrack corellationsbetween different instances of the Among other things, GAggood for branches in straight-line code, GAgdoesn’t leave flexibility to do this 2/23/09cs252-S09, Lecture 914 Gshare: Global History Regist Address cs252-S09, Lecture 915 Prediction,”by Cliff Young, Nicolas Gloy, and Michael D. Smith to further bias branch behaviorYes: filter out biased branches to save prediction resources forthe unbiased ones 2/23/09cs252-S09, Lecture 916 Bimodeand YAGS TAGPred 2/23/09cs252-S09, Lecture 917 From: “An Analysis of Correlation and Predictability: What Makes rk,”Evers, Patel, Chappell, PattDifference in predictability quite significant for some branches! 2/23/09cs252-S09, Lecture 918 Dynamically finding structure in Spaghetti “spaghetti code”•Are all branches likely to need the same type of What to do about it?–How about predicting which predictor will be best?–Called a “Tournament predictor” 2/23/09cs252-S09, Lecture 919 Motivation for correlating branch predictors is 2-adding global information, performance Tournament predictors: use 2 predictors, 1 based on global information and 1 based on addr Predictor A 2/23/09cs252-S09, Lecture 920 1.4K 2-bit counters to choose from among a global 2.Global predictor (GAg):4K entries, indexed by the history of the last 12 branches; eachentry in the global predictor is a standard 2-bit predictor12-bit pattern:ithbit 0 =� ithprior branch not taken; ithbit 1 =� ithprior branch taken; 3.Local predictora local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows Next levelSelected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 2/23/09cs252-S09, Lecture 921 % of predictions from local predictor in Tournament Scheme 0%20%40%60%80%100% cs252-S09, Lecture 922 0%20%40%60%80%100%gccespressofppppdoductomcatvBranch prediction accuracy Profile-based 2-bit counter Tournament 2/23/09cs252-S09, Lecture 923 Accuracy v. Size (SPEC89) 081624324048566472808896104112120128 CorrelatingTournament cs252-S09, Lecture 924 64 avg. 11.5 mispredictionsper 1000 instructions64 avg. 16.5 mispredictionsper 1000 instructions64 avg. 17 mispredictionsper 1000 instructions64 avg. 15 mispredictionsper 1000 instructions 2/23/09cs252-S09, Lecture 925 Special Case Return Addresses•Register Indirect branch hard to predict address–SPEC89 85% such branches for procedure return–Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate BTB Next PC Fetch UnitDestination FromCall Instruction[ On Fetch?]Select forIndirect Jumps[ On Fetch ]Return Address StackMux cs252-S09, Lecture 926 Performance: Return Address Predictor•Push a return address on stack–Pop an address off stack & predict as new PC 0124816Return address buffer entriesMisprediction frequency go m88ksim compress xlisp ijpeg perl vortex cs252-S09, Lecture 927 Stores commit in order (ROB), so no WAR/WAW memory hazards.store prior to load is waiting for its address 2/23/09cs252-S09, Lecture 928 In-Order Memory Queue•Execute all loads and stores in program order�= Load and store cannot leave ROB for execution until all previous loads and stores have •Can still execute loads and stores speculatively, and out-of-order with respect to other instructions 2/23/09cs252-S09, Lecture 929 str1, (r2)uncommitted stores 2/23/09cs252-S09, Lecture 930 Guess that r4 != r2Need to hold all completed but uncommitted load/store If subsequently find r4==r2, squash load and urate address speculationstr1, (r2) 2/23/09cs252-S09, Lecture 931 str1, (r2) cs252-S09, Lecture 932 -A speculative store buffer is a structure introduced to hold 2/23/09cs252-S09, Lecture 933 Speculative Store Buffer•On store execute:–mark entry valid and speculative, and save data and tag of instruction.On store commit: –clear speculative bit and eventually move data to cache•On store abort:–clear valid bit Data Load Address Tags Store Commit Path Load Data Tag cs252-S09, Lecture 934 Speculative Store Buffer•If data in both store buffer and cache, which should we use:Speculative store buffer•If same address in store buffer twice, which should we use:Youngest store older than load Data Load Address Tags Store Commit Path Load Data Tag cs252-S09, Lecture 935 Naïve Speculation: always let using Store Sets”, Chrysosand Emer. 2/23/09cs252-S09, Lecture 936 Said another way: Could we do better?•oracle predictor–We can get significantly better performance if we find a good predictor–Question: How to build a good predictor? 2/23/09cs252-S09, Lecture 937 Not always true! Hopefully true most of timeStore Set: Set of store inststhat affect given loadExample: AddrInst0Store C4Store A8Store B12Store C28Load B Store set { PC 8 }32Load D Store set { (null) }36Load C Store set { PC 0, PC 12 }40Load B Store set { PC 8 }Idea: Store set for load starts empty. If ever load go forward and this causes a violation, add offending store to load’s store setElse let go forward 2/23/09cs252-S09, Lecture 938 “Infinite”here means to place no limits on:Note: “Not Predicted”means load had empty store setOnly Appluand Xlispseems to have false dependencies 2/23/09cs252-S09, Lecture 939 Notice that this requires each store to be in only one store set!LFST: Maps SSIDsto mostWhen Load is fetched, allows it to find most recent store in itsstore set that is executing (if any) allows stalling until store finishedPretty much same type of ordering as enforced by ROB anywayTransitivityloads end up waiting for all active stores in store setAllow store sets to be merged together deterministically Want periodic clearing of SSIT to avoid:problems with aliasing across program 2/23/09cs252-S09, Lecture 940 How well does this do?•Comparison against Store Barrier Cache–Marks individual Stores as “tending to cause memory violations”–Not specific to particular loads….•Problem with APPLU?–Analyzed in paper: has complex 3-level inner loop in which loads occasionally depend on stores–Forces overly conservative stalls (i.e. false dependencies) 2/23/09cs252-S09, Lecture 941 MikkoH. Lipasti, Christopher B. Wilkerson and John Paul Shen 2/23/09cs252-S09, Lecture 942 Load Value Prediction Table•Load Value Prediction Table (LVPT)–Untagged, Direct Mapped–Takes Instructions Predicted Data•of last n unique values from given instructionCan contain aliases, since untagged•How to predict?–When n=1, easy–When n=16? Use Oracle•No! Why not?–Must identify predictable loads somehow LVPT cs252-S09, Lecture 943 Load Classification Table (LCT)•Load Classification Table (LCT)–Untagged, Direct Mapped–Takes Instructions Single bit of whether or not to predict•How to implement?–Uses saturating counters (2 or 1 bit)–When prediction correct, increment–When prediction incorrect, decrement•With 2 bit counter –not predictable–constant (very predictable)•With 1 bit counter–not predictable–constant (very predictable)Instruction Addr cs252-S09, Lecture 944 Difference between “Simple”and “predictability”of structure 2/23/09cs252-S09, Lecture 945 Constant Value Unit•Idea: Identify a load instruction as “constant”–Can ignore cache lookup (no verification)Must enforce by monitoring result status How well does this work?–Seems to identify 6-18% of loads as constant–Must be unchanging enough to cause LCT to classify as constant 2/23/09cs252-S09, Lecture 946 Load Value Architecture•LCT/LVPT in fetch stage•CVU in execute stage–Used to bypass cache entirely–(Know that result is good)•Results: Some speedups –21264 seems to do better than Power PC–Authors think this is because of small first-level cache and in-order execution makes CVU more useful 2/23/09cs252-S09, Lecture 947 Global Predictors: GAg, GAs, GShare, Bimode, YAGSStore set: Set of stores that have had dependencies with load inpast