Branch Prediction 1 Arvind Computer Science amp Artificial Intelligence Lab Massachusetts Institute of Technology April 9 2012 L16 1 httpcsgcsailmitedu6S078 Icache Fetch Buffer ID: 478189
Download Presentation The PPT/PDF document "Computer Architecture: A Constructive Ap..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Computer Architecture: A Constructive ApproachBranch Prediction - 1ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology
April 9, 2012
L16-1
http://csg.csail.mit.edu/6.S078Slide2
I-cache
Fetch Buffer
Issue
Buffer
Func.
Units
Arch.
State
Execute
Decode
Result
Buffer
Commit
PC
Fetch
Branch
executed
Next fetch started
Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution !
Control Flow Penalty
How much work is lost if pipeline doesn’t follow correct instruction flow
?
~ Loop length x pipeline width
April 9, 2012
L12-
2
http://csg.csail.mit.edu/6.S078Slide3
Average Run-Length between BranchesAverage dynamic instruction mix from SPEC92: SPECint92 SPECfp92 ALU 39 % 13 % FPU Add 20 % FPU Mult
13
% load 26 % 23 %
store 9 % 9 %
branch
16
% 8 %
other 10 % 12 %
SPECint92:
compress,
eqntott
, espresso,
gcc , liSPECfp92:
doduc, ear, hydro2d, mdijdp2, su2corWhat is the average
run-length
between branches?
April 9, 2012L16-
3http://csg.csail.mit.edu/6.S078Slide4
Instruction Taken known? Target known?JJRBEQZ/BNEZMIPS Branches and Jumps
Each instruction fetch depends on one or two pieces of information from the preceding instruction:
1. Is the preceding instruction a taken branch?
2
.
If so, what is the target address?
After Inst. Decode
After Inst. Decode
After Inst. Decode
After Inst. Decode
After Reg. Fetch
After
E
xec
April 9, 2012
L16-
4
http://csg.csail.mit.edu/6.S078Slide5
Currently our simple pipelined architecture does very simple branch predictionWhat is it?Branch is predicted not taken: pc, pc+4, pc+8, …Can we do better?April 9, 2012L16-
5
http://csg.csail.mit.edu/6.S078Slide6
Branch Prediction Bits Assume 2 BP bits per instruction Use saturating counterOn ¬taken
On taken
1
1
Strongly taken
1
0
Weakly taken
0
1
Weakly ¬taken
0
0
Strongly ¬taken
April 9, 2012
L16-
6
http://csg.csail.mit.edu/6.S078Slide7
Branch History Table (BHT)4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
0
0
Fetch PC
Branch?
Target PC
+
I-Cache
Opcode
offset
Instruction
k
BHT Index
2
k
-entry
BHT,
2 bits/entry
Taken/¬Taken?
April 9, 2012
L16-
7
http://csg.csail.mit.edu/6.S078Slide8
Where does BHT fit in the processor pipeline?BHT can only be used after instruction decodeWhat should we do at the fetch stage?Need a mechanism to update the BHTwhere does the update information come fromApril 9, 2012L16-8http://csg.csail.mit.edu/6.S078Slide9
Overview of branch predictionPCNeed next PC immediately
Decode
Reg
Read
Execute
Instr type,
PC relative targets available
Simple conditions, register targets available
Complex conditions available
Next
Addr
Pred
BP,
JMP,
Ret
Loose loop
Loose loop
Loose loop
Tight loop
Best predictors reflect program behavior
April 9, 2012
L16-
9
http://csg.csail.mit.edu/6.S078Slide10
Next Address Predictor (NAP)first attemptBP bits are stored with the predicted target address. IF stage: nPC = If
(BP=taken) then target
else pc+4
later:
check
prediction, if wrong then kill the instruction
and
update BTB &
BPb
else update
BPb
iMem
PC
Branch
Target
Buffer
(2
k
entries)
k
BPb
predicted
target
BP
target
April 9, 2012
L16-
10
http://csg.csail.mit.edu/6.S078Slide11
Address CollisionsWhat will be fetched after the instruction at 1028? NAP prediction = Correct target =
Assume a
128-entry
NAP
BPb
target
take
236
1028 Add .....
132 Jump 100
Instruction
Memory
236
1032
kill
PC=236 and
fetch
PC=1032
Is this a common occurrence?
Can we avoid these bubbles?
April 9, 2012
L16-
11
http://csg.csail.mit.edu/6.S078Slide12
Use NAP for Control Instructions onlyNAP contains useful information for branch and jump instructions only Do not update it for other instructions
For all other instructions the next PC is (PC)+4 !
How to achieve this effect without decoding the instruction?
April 9, 2012
L16-
12
http://csg.csail.mit.edu/6.S078Slide13
Branch Target Buffer (BTB)a special form of NAP Keep the (pc, predicted pc) in the BTB pc+4 is
predicted if
no pc match is found
BTB is updated only for branches
and
jumps
2
k
-entry direct-mapped
BTB
I-Cache
PC
k
Valid
valid
Entry PC
=
match
predicted
target
target PC
Permits
nextPC
to
be determined
before
instruction
is
decoded
April 9, 2012
L16-
13
http://csg.csail.mit.edu/6.S078Slide14
Consulting BTB Before Decoding1028 Add .....132 Jump 100
BPb
target
take
236
entry PC
132
The match for
pc =1028
fails and 1028+4 is fetched
eliminates false predictions after ALU instructions
BTB contains entries only for control transfer instructions
more room to store branch targets
Even very small BTBs are very effective
April 9, 2012
L16-
14
http://csg.csail.mit.edu/6.S078Slide15
ObservationsThere is a plethora of branch prediction schemes – their importance grows with the depth of processor pipelineProcessors often use more than one prediction schemeIt is usually easy to understand the data structures required to implement a particular schemeIt takes considerably more effort to understand how a particular scheme with its lookup and updates is integrated in the pipeline and how various schemes interact with each otherApril 9, 2012L16-15
http://csg.csail.mit.edu/6.S078Slide16
PlanWe will begin with a very simple 2-stage pipeline and integrate a simple BTB scheme in itWe will extend the design to a multistage pipeline and integrate at least one more predictor, say BHT, in the pipeline (next lecture)revisit the simple two-stage pipeline without branch predictionApril 9, 2012L16-16
http://csg.csail.mit.edu/6.S078Slide17
Decoupled Fetch and ExecuteFetch
Execute
<instructions, pc, epoch>
<updated pc>
ir
nextPC
Fetch sends instructions to Execute along with pc and other control information
Execute sends information about the target pc to Fetch, which updates pc and other control registers whenever it looks at the
nextPC
fifo
April 9, 2012
L16-
17
http://csg.csail.mit.edu/6.S078Slide18
A solution using epochAdd fEpoch and eEpoch registers to the processor state; initialize them to the same value The epoch changes whenever Execute determines that the pc prediction is wrong. This change is reflected immediately in eEpoch and eventually in fEpoch via nextPC FIFOAssociate the fEpoch with every instruction when it is fetched In the execute stage, reject, i.e., kill, the instruction if its epoch does not match eEpoch
April 9, 2012
L16-
18
http://csg.csail.mit.edu/6.S078Slide19
Two-Stage pipelineA robust two-rule solutionPCInstMemory
Decode
Register File
Execute
Data
Memory
+4
i
r
Bypass
FIFO
Pipeline
FIFO
nextPC
fEpoch
eEpoch
Either
fifo
can be a normal (>1 element)
fifo
April 9, 2012
L16-
19
http://csg.csail.mit.edu/6.S078Slide20
Two-stage pipeline Decoupledmodule mkProc(Proc); Reg
#(Addr
) pc <- mkRegU;
RFile
rf
<-
mkRFile
;
IMemory
iMem
<- mkIMemory;
DMemory
dMem <- mkDMemory
;
PipeReg#(TypeFetch2Decode)
ir <-
mkPipeReg;
Reg#(Bool)
fEpoch <-
mkReg(False);
Reg#(
Bool) eEpoch <-
mkReg(False);
FIFOF#(Addr
) nextPC <- mkBypassFIFOF; rule doFetch (ir.notFull); let inst = iMem
(pc); ir.enq(TypeFetch2Decode
{pc:pc
, epoch:fEpoch, inst:inst
});
if(
nextPC.notEmpty) begin
pc<=
nextPC.first
;
fEpoch
<=!
fEpoch
;
nextPC.deq;
end
else
pc <= pc + 4;
endrule
explicit guard
simple branch prediction
April 9, 2012
L16-
20
http://csg.csail.mit.edu/6.S078Slide21
Two-stage pipeline Decoupled contrule doExecute (ir.notEmpty);
let
i
rpc
=
ir.first.pc
;
let
inst =
ir.first.inst;
if(ir.first.epoch
==eEpoch) begin
let
eInst = decodeExecute
(irpc, inst,
rf);
let
memData <- dMemAction(
eInst, dMem
);
regUpdate(eInst
, memData, rf
); if
(eInst.brTaken
) begin nextPC.enq(eInst.addr); eEpoch <= !eEpoch; end end
ir.deq;
endrule
endmodule
April 9, 2012
L16-
21http://csg.csail.mit.edu/6.S078Slide22
Two-Stage pipeline with a Branch PredictorPCInstMemory
Decode
Register File
Execute
Data
Memory
ir
+
ppc
nextPC
fEpoch
eEpoch
Branch
Predictor
April 9, 2012
L16-
22
http://csg.csail.mit.edu/6.S078Slide23
Branch Predictor Interfaceinterface NextAddressPredictor; method Addr prediction(Addr pc);
method Action update(
Addr
pc,
Addr
target);
endinterface
April 9, 2012
L16-
23
http://csg.csail.mit.edu/6.S078Slide24
Null Branch Predictionmodule mkNeverTaken(NextAddressPredictor); method Addr prediction(
Addr pc);
return pc+4;
endmethod
method
Action update(
Addr
pc,
Addr
target);
noAction;
endmethodendmodule
Replaces PC+4 with …
Already implemented in the pipeline
Right most of the time
Why?
April 9, 2012
L16-
24
http://csg.csail.mit.edu/6.S078Slide25
Branch Target Prediction (BTB)module mkBTB(NextAddressPredictor); RegFile#(LineIdx
, Addr
) tagArr
<-
mkRegFileFull
;
RegFile
#(
LineIdx
,
Addr
) targetArr <- mkRegFileFull;
method Addr prediction(
Addr pc);
LineIdx
index = truncate(pc >> 2);
let tag = tagArr.sub(index);
let target = targetArr.sub(index);
if (tag==pc)
return target; else return (pc+4);
endmethod
method Action update(Addr
pc, Addr target);
LineIdx index = truncate(pc >> 2);
tagArr.upd(index, pc);
targetArr.upd(index, target); endmethodendmoduleApril 9, 2012L16-25http://csg.csail.mit.edu/6.S078Slide26
Two-stage pipeline + BPmodule mkProc(Proc); Reg#(
Addr) pc <-
mkRegU;
RFile
rf
<-
mkRFile
;
IMemory
iMem <-
mkIMemory;
DMemory dMem
<- mkDMemory;
PipeReg#(TypeFetch2Decode)
ir <- mkPipeReg
;
Reg#(Bool)
fEpoch <- mkReg
(False);
Reg#(Bool
) eEpoch <- mkReg
(False); FIFOF#(
Tuple2#(
Addr,Addr)) nextPC <- mkBypassFIFOF; NextAddressPredictor bpred <- mkNeverTaken;
The definition of TypeFetch2Decode is changed to include predicted pc
typedef
struct
{ Addr pc;
Addr ppc
; Bool
epoch; Data inst; } TypeFetch2Decode deriving (Bits,
Eq
);
April 9, 2012
L16-
26
http://csg.csail.mit.edu/6.S078
Some target predictorSlide27
Two-stage pipeline + BP Fetch rule rule doFetch (ir.notFull);
let ppc
=
bpred.prediction
(pc);
let
inst
=
iMem
(pc);
ir.enq(TypeFetch2Decode
{pc:pc,
ppc:ppc, epoch:fEpoch
, inst:inst
});
if(nextPC.notEmpty
) begin
match{.ipc
, .ippc} = nextPC.first;
pc <= ippc
; fEpoch <= !fEpoch
; nextPC.deq;
bpred.update(ipc
,
ippc); end else pc <= ppc; endrule
April 9, 2012
L16-27
http://csg.csail.mit.edu/6.S078Slide28
Two-stage pipeline + BP Execute rulerule doExecute (ir.notEmpty);
let
irpc
=
ir.first.pc
;
let
inst =
ir.first.inst
;
let irppc
= ir.first.ppc;
if(
ir.first.epoch==eEpoch
) begin
let
eInst =
decodeExecute(irpc
, irppc, inst,
rf);
let
memData <- dMemAction
(eInst,
dMem);
regUpdate
(eInst, memData, rf); if (eInst.missPrediction) begin nextPC.enq(tuple2(irpc, eInst.brTaken
? eInst.addr : irpc+4));
eEpoch <= !
eEpoch; end
end
ir.deq;endrule
endmodule
April 9, 2012
L16-
28
http://csg.csail.mit.edu/6.S078Slide29
Execute Functionfunction ExecInst exec(DecodedInst dInst
, Data rVal1,
Data rVal2, Addr
pc
,
Addr
ppc
);
ExecInst
einst = ?;
let
aluVal2 = (dInst.immValid
)? dInst.imm : rVal2
let aluRes
= alu(rVal1, aluVal2,
dInst.aluFunc);
let brAddr =
brAddrCal(pc, rVal1, dInst.iType,
dInst.imm);
einst.itype =
dInst.iType;
einst.addr = (memType(
dInst.iType)? aluRes :
brAddr
; einst.data = dInst.iType==St ? rVal2 : aluRes; einst.brTaken = aluBr(rVal1, aluVal2, dInst.brComp); einst.missPrediction =
brTaken ? brAddr!=
ppc :
(pc+4)!=ppc;
einst.rDst = dInst.rDst
; return
einst;
endfunction
April 7, 2012
L7-
29
http://csg.csail.mit.edu/6.s078Rev