/
EE194/Comp140 Mark Hempstead EE194/Comp140 Mark Hempstead

EE194/Comp140 Mark Hempstead - PowerPoint Presentation

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
345 views
Uploaded On 2019-11-08

EE194/Comp140 Mark Hempstead - PPT Presentation

EE194Comp140 Mark Hempstead ECEC 194 High Performance Computer Architecture Spring 2016 Tufts University Instructor Prof Mark Hempstead markecetuftsedu Lecture 3 Review of Branch Prediction Memory Hierarchy and Caches ID: 764473

wrong branch bit prediction branch wrong prediction bit predictor mark hempstead bhr correct comp140 history ee194 cycle time outcome

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "EE194/Comp140 Mark Hempstead" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

EE194/Comp140 Mark Hempstead ECEC 194: High Performance Computer Architecture Spring 2016 Tufts University Instructor: Prof. Mark Hempstead mark@ece.tufts.edu Lecture 3: Review of Branch Prediction, Memory Hierarchy and Caches

More pipelining and stallsConsider a random assembly programThe architecture says that instructions are executed in order.The 2nd instruction uses the r2 written by the first instructionEE194/Comp140 Mark Hempsteadload r2=mem[r1]add r5=r3+r2add r8=r6+r7

Pipelined instructionsPipelining cheats! It launches instructions before the previous ones finish.Hazards occur when one computation uses the results of another – it exposes our sleight of handEE 193 Joel GrodsteinInstructionCycle 1Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 load r2=mem[r1] fetchread R1execute nothingaccess cachewrite r2add r5=r3+r2fetchread r3,r2add r3,r2load/st nothingwrite r5add r8=r6+r7fetchread r6,r7add r6,r7load/st nothingwrite r8store mem[r9]=r10fetchread r9,r10execute nothingstorewrite nothing

Same problem with branchesSame problem as beforewe don't know whether to fetch the “sub" or the "add” until you know r2.The simplest solution is to stall, just like beforeEE 193 Joel Grodstein InstructionCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1]fetchread R1execute nothingaccess cachewrite r2branch if r2=0fetchread r2read r2read r2sub r8=r6-r7fetchread r6,r7add r6,r7…fetchread load r2 =mem[r1] if (r2==0) goto L1 sub r8=r6-r7 … L1: add r8=r8+r1

What prediction looks likeLet's predict that r20, so we don't take the branch, and we will execute r8=r6-r7We start executing it right away, before we know if we are really supposed to.What could go wrong with this?EE 193 Joel Grodstein InstructionCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1]fetchread R1execute nothingaccess cachewrite r2branch if r2==0fetchread r2read r2read r2sub r8=r6-r7fetchread r6,r7add r6,r7access nothingwrite r8… fetch read execute access load r2 =mem[r1] if (r2==0) goto L1 sub r8=r6-r7 … L1: add r8=r8+r1

What prediction looks likeWhat could go wrong with this?What if we hit cycle 5 and find that r2=0?We must undo the sub, and do the add insteadEE 193 Joel GrodsteinInstruction Cycle 1Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetchread R1execute nothingaccess cachewrite r2branch if r2==0fetchread r2read r2read r2sub r8=r6-r7fetchread r6,r7add r6,r7access nothingwrite r8…fetch read execute access load r2 =mem[r1] if (r2==0) goto L1 sub r8=r6-r7 … L1: add r8=r8+r1

What prediction looks likeEasy, right?What if the load took longer, so we already wrote r8?EE 193 Joel GrodsteinInstructionCycle 1Cycle 2Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 load r2=mem[r1] fetch read R1execute nothingaccess cachewrite r2branch if r2==0fetchread r2read r2read r2add r8=r8+1fetchread r8load r2=mem[r1]if (r2==0) goto L1sub r8=r6-r7… L1: add r8=r8+r1

What prediction looks likeNow, we have a cache miss and it takes longer for the load to finish.Problems with this scenario?We hit cycle 5 and find that r2=0We want to cancel the "sub" – but it already wrote r8!We'll talk about how to fix this laterEE 193 Joel Grodstein Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7load r2=mem[r1]fetchread R1execute nothingaccess cacheaccess cacheaccess cacheaccess cachewrite r2branch if r2==0fetchread r2read r2read r2read r2read r2read r2 sub r8=r6-r7 fetch read r6,r7 add r6,r7 access nothing write r8 … fetch read execute access load r2 =mem[r1] if (r2==0) goto L1 sub r8=r6-r7 … L1: add r8=r8+r1

Branch predictionGuessing if a branch will be taken is called branch predictionBottom line: it's hard to fix a wrong predictionSo you better try to get it right most of the timeNow we’ll look at branch prediction.Hardware guesses branch outcomeStart fetching from guessed addressSomehow fix things up on a mis-predict.EE194/Comp140 Mark Hempstead

People branch-predict tooBuy an engagement ring, predicting a “yes” answerIf you guessed wrong, then throw away the ring?Pick which way to drive to work based on what the traffic was like yesterdayIf today’s backups are different, then make excuses when you get to workEE194/Comp140 Mark Hempstead

Branch prediction is hardMaking predictions is hard, especially when they’re about things that haven’t happened yet.Yogi BerraCorollary: branch prediction is hardEE194/Comp140 Mark Hempstead

EE194/Comp140 Mark Hempstead Branch Characteristics Integer Benchmarks: 14-16% instructions are conditional branches FP: 3-12% On Average: 67% of conditional branches are “taken” 60% of forward branches are taken 85% of backward branches are taken. Why? They tend to be the termination condition of a loop

Simplest of predictorsReally simple predictor:67% of conditional branches are “taken”So assume branches are always takenSounds nice, but being right 67% of the time isn't nearly good enoughThe penalty for being wrong is usually far too highDitto for "assume not taken."EE194/Comp140 Mark Hempstead

Learn from past, predict the futureRecord the past in a hardware structureDirection predictor (DIRP)Map conditional-branch PC to taken/not-taken (T/N) decisionIndividual conditional branches often biased or weakly biased90%+ one way or the other considered “biased”Why? Loop back edges, checking for uncommon conditionsBranch history table (BHT): simplest predictorPC indexes table of bits (0 = Not taken, 1 = Taken), no tagsEssentially: branch will go same way it went last timeWhat about aliasing?Two PC with the same lower bits? No problem, just a prediction! Branch Prediction T or NT [9:2] 1:0 [31:10] T or NT PC BHT Prediction (taken or not taken) EE194/Comp140 Mark Hempstead BHT idx Not shown: some way to update the BHT after a mis -predict

Branch History Table (BHT)Branch history table (BHT): simplest direction predictorPC indexes table of bits (0 = N, 1 = T), no tagsPredicts the branch to go same way it went last timeProblem: inner loop branch belowfor (i=0;i<100;i++) for (j=0;j<3;j++) // whateverIt will be wrong twice per inner loop; once each at the beginning and the end.Branch predictor “changes its mind too quickly” Time State Prediction Outcome Result? 1 T 2 T 3 T 4 N 5 T 6 T 7 T 8 N 9 T 10 T 11 T 12 N EE194/Comp140 Mark Hempstead

Branch History Table (BHT)Branch history table (BHT): simplest direction predictorPC indexes table of bits (0 = N, 1 = T), no tagsPredicts the branch to go same way it went last timeProblem: inner loop branch belowfor (i=0;i<100;i++) for (j=0;j<3;j++) // whateverIt will be wrong twice per inner loop; once each at the beginning and the end.Branch predictor “changes its mind too quickly” Time State Prediction Outcome Result? 1 N N T Wrong 2 T 3 4 5 6 7 8 9 10 11 12 EE194/Comp140 Mark Hempstead

Branch History Table (BHT)Branch history table (BHT): simplest direction predictorPC indexes table of bits (0 = N, 1 = T), no tagsPredicts the branch to go same way it went last timeProblem: inner loop branch belowfor (i=0;i<100;i++) for (j=0;j<3;j++) // whateverIt will be wrong twice per inner loop; once each at the beginning and the end.Branch predictor “changes its mind too quickly” Time State Prediction Outcome Result? 1 N N T Wrong 2 T T T Correct 3 T 4 5 6 7 8 9 10 11 12 EE194/Comp140 Mark Hempstead

Branch History Table (BHT)Branch history table (BHT): simplest direction predictorPC indexes table of bits (0 = N, 1 = T), no tagsPredicts the branch to go same way it went last timeProblem: inner loop branch belowfor (i=0;i<100;i++) for (j=0;j<3;j++) // whateverIt will be wrong twice per inner loop; once each at the beginning and the end.Branch predictor “changes its mind too quickly” Time State Prediction Outcome Result? 1 N N T Wrong 2 T T T Correct 3 T T T Correct 4 T 5 6 7 8 9 10 11 12 EE194/Comp140 Mark Hempstead

Branch History Table (BHT)Branch history table (BHT): simplest direction predictorPC indexes table of bits (0 = N, 1 = T), no tagsPredicts the branch to go same way it went last timeProblem: inner loop branch belowfor (i=0;i<100;i++) for (j=0;j<3;j++) // whateverIt will be wrong twice per inner loop; once each at the beginning and the end.Branch predictor “changes its mind too quickly” Time State Prediction Outcome Result? 1 N N T Wrong 2 T T T Correct 3 T T T Correct 4 T T N Wrong 5 N 6 7 8 9 10 11 12 EE194/Comp140 Mark Hempstead

Branch History Table (BHT)Branch history table (BHT): simplest direction predictorPC indexes table of bits (0 = N, 1 = T), no tagsPredicts the branch to go same way it went last timeProblem: inner loop branch belowfor (i=0;i<100;i++) for (j=0;j<3;j++) // whateverIt will be wrong twice per inner loop; once each at the beginning and the end.Branch predictor “changes its mind too quickly” Time State Prediction Outcome Result? 1 N N T Wrong 2 T T T Correct 3 T T T Correct 4 T T N Wrong 5 N N T Wrong 6 T 7 8 9 10 11 12 EE194/Comp140 Mark Hempstead

Branch History Table (BHT)Branch history table (BHT): simplest direction predictorPC indexes table of bits (0 = N, 1 = T), no tagsPredicts the branch to go same way it went last timeProblem: inner loop branch belowfor (i=0;i<100;i++) for (j=0;j<4;j++) // whateverIt will be wrong twice per inner loop; once each at the beginning and the end.Branch predictor “changes its mind too quickly” Time State Prediction Outcome Result? 1 N N T Wrong 2 T T T Correct 3 T T T Correct 4 T T N Wrong 5 N N T Wrong 6 T T T Correct 7 T T T Correct 8 T T N Wrong 9 N N T Wrong 10 T T T Correct 11 T T T Correct 12 T T N Wrong EE194/Comp140 Mark Hempstead

Two-Bit Saturating Counters (2bc)Two-bit saturating counters (2bc) [Smith 1981]Replace each single-bit predictor with(0,1,2,3) = (N,n,t,T)Adds “hysteresis”Force predictor to mis-predict twice before “changing its mind”One mispredict each loop execution (rather than two)Improves our inner-loop problem. Time State Prediction Outcome Result? 1 N N T Wrong 2 n 3 4 5 6 7 8 9 10 11 12 EE194/Comp140 Mark Hempstead

ECEC 621 Mark HempsteadTwo-bit Saturating Counters 2-bit FSMs mean prediction must miss twice before change N-bit predictors are possible, but after 2-bits not much benefit Predict Not Taken Predict Taken Predict Not Taken Predict Taken 11 10 00 01 Taken Taken Taken Taken Not Taken Not Taken Not Taken strongly taken weakly not taken weakly taken strongly not taken Not Taken

Two-Bit Saturating Counters (2bc)Two-bit saturating counters (2bc) [Smith 1981]Replace each single-bit predictor with(0,1,2,3) = (N,n,t,T)Adds “hysteresis”Force predictor to mis-predict twice before “changing its mind”One mispredict each loop execution (rather than two)Improves our inner-loop problem. Time State Prediction Outcome Result? 1 N N T Wrong 2 n N T Wrong 3 t 4 5 6 7 8 9 10 11 12 EE194/Comp140 Mark Hempstead

Two-Bit Saturating Counters (2bc)Two-bit saturating counters (2bc) [Smith 1981]Replace each single-bit predictor with(0,1,2,3) = (N,n,t,T)Adds “hysteresis”Force predictor to mis-predict twice before “changing its mind”One mispredict each loop execution (rather than two)Improves our inner-loop problem. Time State Prediction Outcome Result? 1 N N T Wrong 2 n N T Wrong 3 t T T Correct 4 T 5 6 7 8 9 10 11 12 EE194/Comp140 Mark Hempstead

Two-Bit Saturating Counters (2bc)Two-bit saturating counters (2bc) [Smith 1981]Replace each single-bit predictor with(0,1,2,3) = (N,n,t,T)Adds “hysteresis”Force predictor to mis-predict twice before “changing its mind”One mispredict each loop execution (rather than two)Improves our inner-loop problem. Time State Prediction Outcome Result? 1 N N T Wrong 2 n N T Wrong 3 t T T Correct 4 T T N Wrong 5 t 6 7 8 9 10 11 12 EE194/Comp140 Mark Hempstead

Two-Bit Saturating Counters (2bc)Two-bit saturating counters (2bc) [Smith 1981]Replace each single-bit predictor with(0,1,2,3) = (N,n,t,T)Adds “hysteresis”Force predictor to mis-predict twice before “changing its mind”One mispredict each loop execution (rather than two)Improves our inner-loop problem. Time State Prediction Outcome Result? 1 N N T Wrong 2 n N T Wrong 3 t T T Correct 4 T T N Wrong 5 t T T Correct 6 T 7 8 9 10 11 12 EE194/Comp140 Mark Hempstead

Two-Bit Saturating Counters (2bc)Two-bit saturating counters (2bc) [Smith 1981]Replace each single-bit predictor with(0,1,2,3) = (N,n,t,T)Adds “hysteresis”Force predictor to mis-predict twice before “changing its mind”One mispredict each loop execution (rather than two)Improves our inner-loop problem. Time State Prediction Outcome Result? 1 N N T Wrong 2 n N T Wrong 3 t T T Correct 4 T T N Wrong 5 t T T Correct 6 T T T Correct 7 T T T Correct 8 T T N Wrong 9 t T T Correct 10 T T T Correct 11 T T T Correct 12 T T N Wrong EE194/Comp140 Mark Hempstead

Intuition on how well this worksSimple predictors work well:on branches that repeat the same behavior in streaks.They work better when the streaks are longerThey don’t work well:for random or data-dependent branches.Data dependence can be common in some applications, but it’s evil for branch prediction.EE194/Comp140 Mark Hempstead

ECEC 621 Mark Hempstead 0% Frequency of Mispredictions 0% 1% 5% 6% 6% 11% 4% 6% 5% 1% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT nasa7 matrix300 doducd spice fpppp gcc expresso eqntott li tomcatv

Branch prediction is hardMaking predictions is hard, especially when they’re about things that haven’t happened yet.Yogi BerraCorollary: we’re not going to find a perfect branch predictor.EE194/Comp140 Mark Hempstead

ECEC 621 Mark HempsteadHybrid Branch Predictors Tournament predictors: Adaptively combine local and global predictors Different schemes work better for different branches Local Predictor Global Predictor Chooser Predictor Taken or Not-taken? Could be 2-bit BHT

Branch prediction costsThe BTB and BHT are big arrays that run every single cycle.They are on critical paths, and thus must run fast.They can be big power hogs.They will never work well if there are data-dependent branches.For good single-stream performance, a good branch prediction is very necessary.EE194/Comp140 Mark Hempstead

More costs“When you come to a fork in the road, take it”Yogi BerraThe central idea of branch prediction is that doing nothing is bad.Don’t stall. Instead, speculatively execute a (hopefully pretty good) predictionBut executing instructions takes energy, and if you have to kill them then the energy was wasted.Unless you’re predictors are always right, then branch prediction + speculative execution always wastes energy in flushed instructions.Nonetheless, the tradeoff is usually a good one.EE194/Comp140 Mark Hempstead

How far should we go?How many transistors should we spend on branch prediction?The BP itself costs power and area, but does no computation.A BP is necessary for good single-stream performanceWe already saw diminishing returns on bigger BHTsThe difficult-to-predict branches are often data dependent; a better/bigger algorithm won't help muchIt would be really nice if we just didn't care about single-stream performanceBut we do – usually.EE194/Comp140 Mark Hempstead

TakeawaysSpeculation is necessary to get reasonable single-stream performanceWaiting for all branches to be resolved is too slowBranch prediction is needed for speculation to work wellGood branch prediction is expensive and power hungryIt's not nice to fool your branch predictorAnd it makes your code run really slowIrregular (e.g., data-dependent) branchesEE194/Comp140 Mark Hempstead

BACKUPEE194/Comp140 Mark Hempstead

InterruptsInterrupts are hard!EE194/Comp140 Mark Hempstead

Types of interruptsWhat kinds of things can go wrong in a program?Floating-point errors:Integer errorsMemory errorsInstruction errorsOthers:EE194/Comp140 Mark Hempsteaddivide by 0, underflow, overflowdivide by 0access unaligned data, illegal address, virtual memory faultillegal instruction, reserved instructionuser-requested breakpoint, I/O error, …

What do we do about it?Terminate the program. Easy for the architect, not so nice for the userIt means your program crashes, with no ability to recover & keep going.Resume after fixing the problem. Harder for the architect, but it’s usually what the user wants.Absolutely needed for, e.g., page faults & for user traps for single stepping.EE194/Comp140 Mark Hempstead

EE194/Comp140 Mark Hempstead Precise interrupts An interrupt is precise if, after an instruction takes an interrupt, the machine state is such that All earlier instructions are completed The interrupting instruction seemingly never happened All later instructions seeming never happened. All interrupts are taken in program order Seems easy, right? It’s not!

EE194/Comp140 Mark Hempstead Interrupts can happen backwards Cycle #3: the “ Axx ” takes an illegal-instruction interrupt, since there is no “ Axx ” instruction Cycle 4: the LDW takes an unaligned-memory interrupt, since R2=0x1000 and you cannot read a word from address 0x1003. But the later instruction raised its interrupt first! Cycle123 4 5 6 LDW R3, R2, #3 IF ID EX MEM WB Axx R4, R3, R5 IF ID EX MEM WB How do you solve this? Exceptions are only handled in WB

EE194/Comp140 Mark Hempstead Interrupts can still happen backwards! The ADD.D and ADD are in separate pipelines; the floating-point pipe has more EX stages than does the integer pipe Cycle #6: the ADD finishes and writes the register file. Cycle #7: the ADD.F takes an overflow interrupt. But the earlier instruction already changed the RF! Cycle 1 2 3 4 5 6 7 8 ADD.D F3,F2,F1 IF ID EX1 EX2 EX3 EX4 EX5 WB ADD R4, R3, R5 IF ID EX MEM WB

Branch-delay slotsConsider the codeThe LD is a delay-slot instruction. MIPS always executes it after the branch, even if the branch is taken!What if LD faults after BEQ is taken?Must rerun it and then jump to “loop”But you’re not re-executing the BEQ, so there’s no reason to jump to loop.Suddenly restarting after interrupt is harder.EE194/Comp140 Mark Hempsteadloop: stuff…… BEQ R2, R4, #loopLD R9, 30(R8)

EE194/Comp140 Mark Hempstead Summary of Exceptions Precise interrupts are a headache! Delayed branches make them a bigger headache. Preview: OOO makes them a still-bigger headache More ways to have something write back earlier than the exception Some machines punt on the problem Precise exceptions only for integer pipe Special “precise mode” used for debugging (10x slower)

Parting quoteDemocracy is the worst form of government there is… except for all of the others.Winston ChurchillCorollary: the same is true for imprecise interrupts.EE194/Comp140 Mark Hempstead

BackupEE194/Comp140 Mark Hempstead

ECEC 621 Mark HempsteadBranch prediction methods When is information about branches gathered/applied? When the machine is designed When the program is compiled (“compile-time”) When a “training run” of the program is executed (“profile-based”) As the program is executing (“dynamic”) Section 3.3 in textbook

EE194/Comp140 Mark Hempstead Interrupt Taxonomy (C.4) Synchronous vs. Asynchronous (HW error, I/O) User Request (exception?) vs. Coerced User maskable vs. Nonmaskable (Ignorable) Within vs. Between Instructions Resume vs. Terminate The difficult exceptions are resumable interrupts within instructionsSave the state, correct the cause, restore the state, continue execution

EE194/Comp140 Mark Hempstead Interrupt Taxonomy Exception Type Sync vs. Async User Request Vs. Coerced User mask vs. nommask Within vs. Between Insn Resume vs. terminate I/O Device Req. Async Coerced Nonmask Between Resume Invoke O/S Sync User Nonmask Between Resume Tracing Instructions Sync User Maskable Between Resume Breakpoint Sync User Maskable Between Resume Arithmetic Overflow Sync Coerced Maskable Within Resume Page Fault (not in main m) Sync Coerced Nonmask Within Resume Misaligned Memory Sync Coerced Maskable Within Resume Mem. Protection Violation Sync Coerced Nonmask Within Resume Using Undefined Insns Sync Coerced Nonmask Within Terminate Hardware/Power Failure Async Coerced Nonmask Within Terminate

EE194/Comp140 Mark Hempstead Interrupts on Instruction Phases Exceptions can occur on many different phases However, exceptions are only handled in WB Why? load IF ID EX MEM WB add IF ID EX MEM WB Exception Type IF ID EXEMEM WB Arithmetic Overflow X Page Fault (not in main memory) X X Misaligned Memory X X Mem. Protection Violation X X

ECEC 621 Mark Hempstead Tournament Predictors Tournament predictor using, say, 4K 2-bit counters indexed by local branch address. Chooses between: Global predictor 4K entries index by history of last 12 branches (2 12 = 4K) Each entry is a standard 2-bit predictor Local predictor Local history table: 1024 10-bit entries recording last 10 branches, index by branch address The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters

ECEC 621 Mark Hempstead Comparing Predictors (Fig. 2.8) Advantage of tournament predictor is ability to select the right predictor for a particular branch Particularly crucial for integer benchmarks. A typical tournament predictor will select the global predictor almost 40% of the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarks

Core i7 Branch Predictor2-levels of predictorsSimple 1st level predictorSlower more accurate 2nd level predictorEach predictor has three componentsSimple 2-bit predictor (Appendix C)Global history predictorLoop exit predictor ECEC 621 Mark Hempstead

ECEC 621 Mark Hempstead Why predict? Speculative Execution Execute beyond branch boundaries before the branch is resolved Correct Speculation Avoid stall, result is computed early, performance++ Incorrect Speculation Abort/squash incorrect instructions, complexity+ Undo any incorrect state changes, complexity++ Performance gain is weighed vs. penaltySpeculation accuracy = branch prediction accuracy

ECEC 621 Mark HempsteadBranch Direction Prediction Basic idea: Hope that future behavior of the branch is correlated to past behavior Loops Error-checking conditionals For a single branch PC Simplest possible idea: Keep 1 bit around to indicate taken or not-taken 2 nd simplest idea: Keep 2 bits around, saturating counter Store a “cache” of counters for recent PC addresses. Use LSB of PC as index into cache

ECEC 621 Mark HempsteadDynamic Branch Prediction Why does prediction work? Underlying algorithm has regularities Data that is being operated on has regularities Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems Is dynamic branch prediction better than static branch prediction? Seems to be There are a small number of important branches in programs which have dynamic behavior

ECEC 621 Mark HempsteadBranch Prediction Buffer (branch history table, BHT) Small memory indexed with low bits of the branch instruction’s address Why the low bits? Implementation Separate memory accessed during IF phase 2-bits attached to each block in the Instruction Cache Caveats: Cannot separately size I-Cache and BHT What about multiple branches in a cache line? Does this help our simple 5-stage pipeline? PC 12-bits 2 12 = 4K Entries Taken or Not-taken?

ECEC 621 Mark Hempstead 2-level Predictor with Global History Global history -- 1 Branch History Register (BHR) Per-address/set history Per-Address/set Branch History Table holds many BHRs Concatenate global k-bit (shift register) with PC bits to index PHT k-bits k-bits k-bits k-bits k-bits k-bits PC Taken or Not-taken? K-bits

EE194/Comp140 Mark Hempstead How to take an exception Force a trap instruction on the next IF Squash younger instructions (Turn off all writes (register/memory) for faulting instruction and all instructions that follow it) Save all processor state after trap begins PC-chain, PSW, Condition Codes, trap condition PC-chain is length of the branch delay plus 1 Perform the trap/exception code then restart where we left off

Extra-credit problemPronounce your own name.Pronounce “Donald Knuth.”Which of the following would make the best president?Donald TrumpDonald KnuthDonald DuckWorth 1 point on the midterm.EE194/Comp140 Mark Hempstead

Branch Prediction PerformanceParametersBranch: 20%, load: 20%, store: 10%, other: 50%75% of branches are takenDynamic branch predictionBranches predicted with 95% accuracyCPI = 1 + 20% * 5% * 2 = 1.02EE194/Comp140 Mark Hempstead

Dynamic Branch Prediction ComponentsStep #1: is it a branch?Easy after decode...Step #2: is the branch taken or not taken?Direction predictor (applies to conditional branches only)Predicts taken/not-takenStep #3: if the branch is taken, where does it go?Branch target predictorEasy after decode… regfile D$ I$ B P EE194/Comp140 Mark Hempstead

ECEC 621 Mark HempsteadTwo-bit Saturating Counters, take 2 Slightly different version works better for some benchmarks. Predict Not Taken Predict Taken Predict Not Taken Predict Taken 11 10 01 00 Taken Taken Taken Taken Not Taken Not Taken Not Taken Not Taken strongly taken strongly not taken weakly not taken weakly taken

Branch Target Buffer (BTB)It’s little use predicting taken/not early in ID if you don’t know the target address until late in EX.So we record the past branch targets in a hardware structureBranch target buffer (BTB):“Last time the branch X was taken, it went to address Y”“So, in the future, if address X is fetched, fetch address Y next” Works well because most control insns use direct targetsTarget encoded in insn itself  same “taken” target every timeWhat about indirect targets? Target held in a register  can be different each time Two indirect call idioms Dynamically linked functions (DLLs): target always the same Dynamically dispatched (virtual) functions: hard but uncommon Also two indirect unconditional jump idioms Switches: hard but uncommon Function returns: hard and common but… EE194/Comp140 Mark Hempstead

ImplementationA small RAM: address = PC, data = target-PCAccess at Fetch in parallel with instruction memorypredicted-target = BTB[hash(PC)]Updated whenever target != predicted-targetBTB[hash(PC)] = correct-targetVery similar to BHT, but it’s a full address width and not just one taken/not bit.Aliasing? EE194/Comp140 Mark Hempstead Target address [9:2] 1:0 [31:10] PC BTB BTB idx No problem, this is only a prediction Target address

Return Address Stack (RAS)Return address stack (RAS)Call instruction? RAS[TopOfStack++] = PC+4Return instruction? Predicted-target = RAS[--TopOfStack]Q: how can you tell if an insn is a call/return before decoding it?Accessing RAS on every instruction would waste powerAnswer:another predictor (or put them in BTB marked as “return”)Or, pre-decode bits in insn mem, written when first executed PC + 4 BTB tag == target predicted target RAS EE194/Comp140 Mark Hempstead

Putting It All TogetherBTB & branch direction predictor during fetchIf branch prediction correct, then no penalty for branches! PC + 4 BTB tag == target predicted target RAS BHT taken/not-taken EE194/Comp140 Mark Hempstead

Branch Prediction PerformanceDynamic branch prediction20% of instruction branchesSimple predictor: branches predicted with 75% accuracyCPI = 1 + (20% * 25% * 2) = 1.1More advanced predictor: 95% accuracyCPI = 1 + (20% * 5% * 2) = 1.02Branch mis-predictions still a big problem thoughPipelines are long: typical mis-prediction penalty is 10+ cyclesFor cores that do more per cycle, predictions most costly (later) Many other methods for building a better branch predictor including correlating predictors, global history tables(Chapter 3) EE194/Comp140 Mark Hempstead

Integer ECEC 621 Mark Hempstead BHT Accuracy Mispredict because either: Wrong guess for that branch Got branch history of wrong branch due to aliasing (4K-entry table shown here). Floating Point Another way: Prediction accuracy between 99% and 82%

ADVANCED BRANCH PREDICTION (3.3) ECEC 621Mark Hempstead

ECEC 621 Mark HempsteadCorrelating Predictors 2-bit scheme only looks at branch’s own history to predict its behavior What if we use other branches to predict it as well? if (aa==2)aa=0; if (bb==2)bb=0; if (aa!=bb){..} Does branch #3 depends on outcome of #1 and #2? Yes. If #1 & #2 are both taken, then #3 will not be taken. Sometimes the past can predict the future with certainty!This is called “global history.”// Branch #1// Branch #2 // Branch #3

Another example of global historyConsider our inner loop againThe branch pattern is (T,T,T,N) repeatingConsider the following global rule:If the last branches were “T,T,T” then predict not takenElse predict taken.Works perfectly – the trick is having the predictor figure out that rule.Let’s make a predictor that looks at both the history at this branch, as well as recent outcomes of all branches (including this one).EE194/Comp140 Mark Hempstead

ECEC 621 Mark HempsteadBranch History Register Simple shift register Shift in branch outcomes as they occur 1 => branch was taken 0 => branch was not-taken k-bit BHR => 2 k patterns Now we know the outcome of the last k branches.Use these patterns to address into the Pattern History Table. Effectively the PHT is a separate BHT for each global pattern.

ECEC 621 Mark Hempstead Generic Adaptive Branch Prediction (Correlating Predictor) Two-level BP requires two main components Branch history register (BHR): recent outcomes of branches (last k branches encountered) Pattern History Table (PHT): branch behavior for recent occurrences of the specific pattern of these k branches In effect, we concatenate BHR with Branch PC bits We’ve drawn a (2,2) predictorPC 12-bits 2 12 = 4K Entries each (PHTs) Taken or Not-taken? 2-bit BHR 0 0

ECEC 621 Mark HempsteadPattern History Table Has 2k entries Usually uses a 2-bit counter for the prediction BHR is used to address the PHT. I.e., a (k,2) PHT is an array of 2 k BHTs, each of which has a 2-bit counter.

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PBHR, PC) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N T 2 T 3 T 4 N 5 T 6 T 7 T 8 N 9 T 10 T 11 T 12 N

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N 3 4 5 6 7 8 9 10 11 12

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N 4 5 6 7 8 9 10 11 12

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N N T Wrong 4 TT T T N T 5 6 7 8 9 10 11 12

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N N T Wrong 4 TT T T N T T N Wrong 5 TN T T N N 6 7 8 9 10 11 12

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N N T Wrong 4 TT T T N T T N Wrong 5 TN T T N N N T Wrong 6 NT T T T N 7 8 9 10 11 12

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N N T Wrong 4 TT T T N T T N Wrong 5 TN T T N N N T Wrong 6 NT T T T N T T Correct 7 TT T T T N 8 9 10 11 12

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N N T Wrong 4 TT T T N T T N Wrong 5 TN T T N N N T Wrong 6 NT T T T N T T Correct 7 TT T T T N N T Wrong 8 TT T T T T 9 10 11 12

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N N T Wrong 4 TT T T N T T N Wrong 5 TN T T N N N T Wrong 6 NT T T T N T T Correct 7 TT T T T N N T Wrong 8 TT T T T T T N Wrong 9 TN T T T N 10 11 12

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N N T Wrong 4 TT T T N T T N Wrong 5 TN T T N N N T Wrong 6 NT T T T N T T Correct 7 TT T T T N N T Wrong 8 TT T T T T T N Wrong 9 TN T T T N T T Correct 10 NT T T T N 11 12

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N N T Wrong 4 TT T T N T T N Wrong 5 TN T T N N N T Wrong 6 NT T T T N T T Correct 7 TT T T T N N T Wrong 8 TT T T T T T N Wrong 9 TN T T T N T T Correct 10 NT T T T N T T Correct 11 TT T T T N 12

(2,1) correlated PredictorCorrelated (two-level) predictor [Patt 1991]Exploits observation that branch outcomes are correlatedMaintains separate prediction per (PC, BHR) pairsBranch history register (BHR): recent branch outcomesSimple working example: assume program has one branchBHT: one 1-bit DIRP entryBHT+2BHR: 22 = 4 1-bit DIRP entries Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N N T Wrong 4 TT T T N T T N Wrong 5 TN T T N N N T Wrong 6 NT T T T N T T Correct 7 TT T T T N N T Wrong 8 TT T T T T T N Wrong 9 TN T T T N T T Correct 10 NT T T T N T T Correct 11 TT T T T N N T Wrong 12 TT T T T T

(2,1) correlated PredictorWhat went wrong?Look at the grey squares; they log our mistakes.For NN, NT, TN we only made one mistake. But TT kept making mistakes. Why?We have both T,T,T and also T,T,N.Every time we went from one to the other we got the wrong answer and had to re-train our predictor.How do we fix this simply?Go to a (3,1) predictor. Time BHR PHT Prediction Outcome Result? NN NT TN TT 1 NN N N N N N T Wrong 2 NT T N N N N T Wrong 3 TT T T N N N T Wrong 4 TT T T N T T N Wrong 5 TN T T N N N T Wrong 6 NT T T T N T T Correct 7 TT T T T N N T Wrong 8 TT T T T T T N Wrong 9 TN T T T N T T Correct 10 NT T T T N T T Correct 11 TT T T T N N T Wrong 12 TT T T T T T N Wrong

(3,1) correlated predictor Time BHR PHT Prediction Outcome Result? NNN NNT NTN NTT TNN TNT TTN TTT 1 T 2 T 3 T 4 N 5 T 6 T 7 T 8 N 9 T 10 T 11 T 12 N 3-bit BHR 2 3 DIRP entries per pattern One-bit BHT (not a 2-bit saturating counter)

(3,1) correlated predictor Time BHR PHT Prediction Outcome Result? NNN NNT NTN NTT TNN TNT TTN TTT 1 NNN N N N N N N N N N T Wrong 2 NNT T N N N N N N N 3 4 5 6 7 8 9 10 11 12 3-bit BHR 2 3 DIRP entries per pattern One-bit BHT (not a 2-bit saturating counter)

(3,1) correlated predictor Time BHR PHT Prediction Outcome Result? NNN NNT NTN NTT TNN TNT TTN TTT 1 NNN N N N N N N N N N T Wrong 2 NNT T N N N N N N N N T Wrong 3 NTT T T N N N N N N 4 5 6 7 8 9 10 11 12 3-bit BHR 2 3 DIRP entries per pattern One-bit BHT (not a 2-bit saturating counter)

(3,1) correlated predictor Time BHR PHT Prediction Outcome Result? NNN NNT NTN NTT TNN TNT TTN TTT 1 NNN N N N N N N N N N T Wrong 2 NNT T N N N N N N N N T Wrong 3 NTT T T N N N N N N N T Wrong 4 TTT T T N T N N N N 5 6 7 8 9 10 11 12 3-bit BHR 2 3 DIRP entries per pattern One-bit BHT (not a 2-bit saturating counter)

(3,1) correlated predictor Time BHR PHT Prediction Outcome Result? NNN NNT NTN NTT TNN TNT TTN TTT 1 NNN N N N N N N N N N T Wrong 2 NNT T N N N N N N N N T Wrong 3 NTT T T N N N N N N N T Wrong 4 TTT T T N T N N N N N N Correct 5 TTN T T N T N N N N 6 7 8 9 10 11 12 3-bit BHR 2 3 DIRP entries per pattern One-bit BHT (not a 2-bit saturating counter)

(3,1) correlated predictor Time BHR PHT Prediction Outcome Result? NNN NNT NTN NTT TNN TNT TTN TTT 1 NNN N N N N N N N N N T Wrong 2 NNT T N N N N N N N N T Wrong 3 NTT T T N N N N N N N T Wrong 4 TTT T T N T N N N N N N Correct 5 TTN T T N T N N N N N T Wrong 6 TNT T T N T N N T N 7 8 9 10 11 12 3-bit BHR 2 3 DIRP entries per pattern One-bit BHT (not a 2-bit saturating counter)

(3,1) correlated predictor Time BHR PHT Prediction Outcome Result? NNN NNT NTN NTT TNN TNT TTN TTT 1 NNN N N N N N N N N N T Wrong 2 NNT T N N N N N N N N T Wrong 3 NTT T T N N N N N N N T Wrong 4 TTT T T N T N N N N N N Correct 5 TTN T T N T N N N N N T Wrong 6 TNT T T N T N N T N N T Wrong 7 NTT T T N T N T T N 8 9 10 11 12 3-bit BHR 2 3 DIRP entries per pattern One-bit BHT (not a 2-bit saturating counter)

(3,1) correlated predictor Time BHR PHT Prediction Outcome Result? NNN NNT NTN NTT TNN TNT TTN TTT 1 NNN N N N N N N N N N T Wrong 2 NNT T N N N N N N N N T Wrong 3 NTT T T N N N N N N N T Wrong 4 TTT T T N T N N N N N N Correct 5 TTN T T N T N N N N N T Wrong 6 TNT T T N T N N T N N T Wrong 7 NTT T T N T N T T N T T Correct 8 TTT T T N T N T T N 9 10 11 12 3-bit BHR 2 3 DIRP entries per pattern One-bit BHT (not a 2-bit saturating counter)

(3,1) correlated predictor Time BHR PHT Prediction Outcome Result? NNN NNT NTN NTT TNN TNT TTN TTT 1 NNN N N N N N N N N N T Wrong 2 NNT T N N N N N N N N T Wrong 3 NTT T T N N N N N N N T Wrong 4 TTT T T N T N N N N N N Correct 5 TTN T T N T N N N N N T Wrong 6 TNT T T N T N N T N N T Wrong 7 NTT T T N T N T T N T T Correct 8 TTT T T N T N T T N N N Correct 9 TTN T T N T N T T N T T Correct 10 TNT T T N T N T T N T T Correct 11 NTT T T N T N T T N T T Correct 12 TTT T T N T N T T N N N Correct With 3 bits of history, the history is now enough to fully predict the branch result. Any column has only one grey square; once we train the predictor for a given history it never makes another mistake.

ECEC 621 Mark HempsteadHardware Costs of 2-level predictions (m,n) predictor  m-bits of global history, n-bit predictor 2 m *n*Number of prediction entries Say you have m-bits of history (m=2) n-bits of predictor per entries (n=2) (2,2) predictor with 1K prediction entries 2 2*2*1024 = 8K-bits

EE194/Comp140 Mark Hempstead Outline Control Hazards and Branch Prediction Interrupts Memory Hierarchy Cache review