/
Computer Architecture: A Constructive Approach Computer Architecture: A Constructive Approach

Computer Architecture: A Constructive Approach - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
429 views
Uploaded On 2016-10-19

Computer Architecture: A Constructive Approach - PPT Presentation

Branch Prediction 1 Arvind Computer Science amp Artificial Intelligence Lab Massachusetts Institute of Technology April 9 2012 L16 1 httpcsgcsailmitedu6S078 Icache Fetch Buffer ID: 478189

csg http april 2012 http csg 2012 april csail mit s078 l16 addr target branch einst inst prediction nextpc pipeline instruction fetch

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Computer Architecture: A Constructive Ap..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Computer Architecture: A Constructive ApproachBranch Prediction - 1ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology

April 9, 2012

L16-1

http://csg.csail.mit.edu/6.S078Slide2

I-cache

Fetch Buffer

Issue

Buffer

Func.

Units

Arch.

State

Execute

Decode

Result

Buffer

Commit

PC

Fetch

Branch

executed

Next fetch started

Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution !

Control Flow Penalty

How much work is lost if pipeline doesn’t follow correct instruction flow

?

~ Loop length x pipeline width

April 9, 2012

L12-

2

http://csg.csail.mit.edu/6.S078Slide3

Average Run-Length between BranchesAverage dynamic instruction mix from SPEC92: SPECint92 SPECfp92 ALU 39 % 13 % FPU Add 20 % FPU Mult

13

% load 26 % 23 %

store 9 % 9 %

branch

16

% 8 %

other 10 % 12 %

SPECint92:

compress,

eqntott

, espresso,

gcc , liSPECfp92:

doduc, ear, hydro2d, mdijdp2, su2corWhat is the average

run-length

between branches?

April 9, 2012L16-

3http://csg.csail.mit.edu/6.S078Slide4

Instruction Taken known? Target known?JJRBEQZ/BNEZMIPS Branches and Jumps

Each instruction fetch depends on one or two pieces of information from the preceding instruction:

1. Is the preceding instruction a taken branch?

2

.

If so, what is the target address?

After Inst. Decode

After Inst. Decode

After Inst. Decode

After Inst. Decode

After Reg. Fetch

After

E

xec

April 9, 2012

L16-

4

http://csg.csail.mit.edu/6.S078Slide5

Currently our simple pipelined architecture does very simple branch predictionWhat is it?Branch is predicted not taken: pc, pc+4, pc+8, …Can we do better?April 9, 2012L16-

5

http://csg.csail.mit.edu/6.S078Slide6

Branch Prediction Bits Assume 2 BP bits per instruction Use saturating counterOn ¬taken

On taken

1

1

Strongly taken

1

0

Weakly taken

0

1

Weakly ¬taken

0

0

Strongly ¬taken

April 9, 2012

L16-

6

http://csg.csail.mit.edu/6.S078Slide7

Branch History Table (BHT)4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

0

0

Fetch PC

Branch?

Target PC

+

I-Cache

Opcode

offset

Instruction

k

BHT Index

2

k

-entry

BHT,

2 bits/entry

Taken/¬Taken?

April 9, 2012

L16-

7

http://csg.csail.mit.edu/6.S078Slide8

Where does BHT fit in the processor pipeline?BHT can only be used after instruction decodeWhat should we do at the fetch stage?Need a mechanism to update the BHTwhere does the update information come fromApril 9, 2012L16-8http://csg.csail.mit.edu/6.S078Slide9

Overview of branch predictionPCNeed next PC immediately

Decode

Reg

Read

Execute

Instr type,

PC relative targets available

Simple conditions, register targets available

Complex conditions available

Next

Addr

Pred

BP,

JMP,

Ret

Loose loop

Loose loop

Loose loop

Tight loop

Best predictors reflect program behavior

April 9, 2012

L16-

9

http://csg.csail.mit.edu/6.S078Slide10

Next Address Predictor (NAP)first attemptBP bits are stored with the predicted target address. IF stage: nPC = If

(BP=taken) then target

else pc+4

later:

check

prediction, if wrong then kill the instruction

and

update BTB &

BPb

else update

BPb

iMem

PC

Branch

Target

Buffer

(2

k

entries)

k

BPb

predicted

target

BP

target

April 9, 2012

L16-

10

http://csg.csail.mit.edu/6.S078Slide11

Address CollisionsWhat will be fetched after the instruction at 1028? NAP prediction = Correct target = 

Assume a

128-entry

NAP

BPb

target

take

236

1028 Add .....

132 Jump 100

Instruction

Memory

236

1032

kill

PC=236 and

fetch

PC=1032

Is this a common occurrence?

Can we avoid these bubbles?

April 9, 2012

L16-

11

http://csg.csail.mit.edu/6.S078Slide12

Use NAP for Control Instructions onlyNAP contains useful information for branch and jump instructions only  Do not update it for other instructions

For all other instructions the next PC is (PC)+4 !

How to achieve this effect without decoding the instruction?

April 9, 2012

L16-

12

http://csg.csail.mit.edu/6.S078Slide13

Branch Target Buffer (BTB)a special form of NAP Keep the (pc, predicted pc) in the BTB pc+4 is

predicted if

no pc match is found

BTB is updated only for branches

and

jumps

2

k

-entry direct-mapped

BTB

I-Cache

PC

k

Valid

valid

Entry PC

=

match

predicted

target

target PC

Permits

nextPC

to

be determined

before

instruction

is

decoded

April 9, 2012

L16-

13

http://csg.csail.mit.edu/6.S078Slide14

Consulting BTB Before Decoding1028 Add .....132 Jump 100

BPb

target

take

236

entry PC

132

The match for

pc =1028

fails and 1028+4 is fetched

eliminates false predictions after ALU instructions

BTB contains entries only for control transfer instructions

more room to store branch targets

Even very small BTBs are very effective

April 9, 2012

L16-

14

http://csg.csail.mit.edu/6.S078Slide15

ObservationsThere is a plethora of branch prediction schemes – their importance grows with the depth of processor pipelineProcessors often use more than one prediction schemeIt is usually easy to understand the data structures required to implement a particular schemeIt takes considerably more effort to understand how a particular scheme with its lookup and updates is integrated in the pipeline and how various schemes interact with each otherApril 9, 2012L16-15

http://csg.csail.mit.edu/6.S078Slide16

PlanWe will begin with a very simple 2-stage pipeline and integrate a simple BTB scheme in itWe will extend the design to a multistage pipeline and integrate at least one more predictor, say BHT, in the pipeline (next lecture)revisit the simple two-stage pipeline without branch predictionApril 9, 2012L16-16

http://csg.csail.mit.edu/6.S078Slide17

Decoupled Fetch and ExecuteFetch

Execute

<instructions, pc, epoch>

<updated pc>

ir

nextPC

Fetch sends instructions to Execute along with pc and other control information

Execute sends information about the target pc to Fetch, which updates pc and other control registers whenever it looks at the

nextPC

fifo

April 9, 2012

L16-

17

http://csg.csail.mit.edu/6.S078Slide18

A solution using epochAdd fEpoch and eEpoch registers to the processor state; initialize them to the same value The epoch changes whenever Execute determines that the pc prediction is wrong. This change is reflected immediately in eEpoch and eventually in fEpoch via nextPC FIFOAssociate the fEpoch with every instruction when it is fetched In the execute stage, reject, i.e., kill, the instruction if its epoch does not match eEpoch

April 9, 2012

L16-

18

http://csg.csail.mit.edu/6.S078Slide19

Two-Stage pipelineA robust two-rule solutionPCInstMemory

Decode

Register File

Execute

Data

Memory

+4

i

r

Bypass

FIFO

Pipeline

FIFO

nextPC

fEpoch

eEpoch

Either

fifo

can be a normal (>1 element)

fifo

April 9, 2012

L16-

19

http://csg.csail.mit.edu/6.S078Slide20

Two-stage pipeline Decoupledmodule mkProc(Proc); Reg

#(Addr

) pc <- mkRegU;

RFile

rf

<-

mkRFile

;

IMemory

iMem

<- mkIMemory;

DMemory

dMem <- mkDMemory

;

PipeReg#(TypeFetch2Decode)

ir <-

mkPipeReg;

Reg#(Bool)

fEpoch <-

mkReg(False);

Reg#(

Bool) eEpoch <-

mkReg(False);

FIFOF#(Addr

) nextPC <- mkBypassFIFOF; rule doFetch (ir.notFull); let inst = iMem

(pc); ir.enq(TypeFetch2Decode

{pc:pc

, epoch:fEpoch, inst:inst

});

if(

nextPC.notEmpty) begin

pc<=

nextPC.first

;

fEpoch

<=!

fEpoch

;

nextPC.deq;

end

else

pc <= pc + 4;

endrule

explicit guard

simple branch prediction

April 9, 2012

L16-

20

http://csg.csail.mit.edu/6.S078Slide21

Two-stage pipeline Decoupled contrule doExecute (ir.notEmpty);

let

i

rpc

=

ir.first.pc

;

let

inst =

ir.first.inst;

if(ir.first.epoch

==eEpoch) begin

let

eInst = decodeExecute

(irpc, inst,

rf);

let

memData <- dMemAction(

eInst, dMem

);

regUpdate(eInst

, memData, rf

); if

(eInst.brTaken

) begin nextPC.enq(eInst.addr); eEpoch <= !eEpoch; end end

ir.deq;

endrule

endmodule

April 9, 2012

L16-

21http://csg.csail.mit.edu/6.S078Slide22

Two-Stage pipeline with a Branch PredictorPCInstMemory

Decode

Register File

Execute

Data

Memory

ir

+

ppc

nextPC

fEpoch

eEpoch

Branch

Predictor

April 9, 2012

L16-

22

http://csg.csail.mit.edu/6.S078Slide23

Branch Predictor Interfaceinterface NextAddressPredictor; method Addr prediction(Addr pc);

method Action update(

Addr

pc,

Addr

target);

endinterface

April 9, 2012

L16-

23

http://csg.csail.mit.edu/6.S078Slide24

Null Branch Predictionmodule mkNeverTaken(NextAddressPredictor); method Addr prediction(

Addr pc);

return pc+4;

endmethod

method

Action update(

Addr

pc,

Addr

target);

noAction;

endmethodendmodule

Replaces PC+4 with …

Already implemented in the pipeline

Right most of the time

Why?

April 9, 2012

L16-

24

http://csg.csail.mit.edu/6.S078Slide25

Branch Target Prediction (BTB)module mkBTB(NextAddressPredictor); RegFile#(LineIdx

, Addr

) tagArr

<-

mkRegFileFull

;

RegFile

#(

LineIdx

,

Addr

) targetArr <- mkRegFileFull;

method Addr prediction(

Addr pc);

LineIdx

index = truncate(pc >> 2);

let tag = tagArr.sub(index);

let target = targetArr.sub(index);

if (tag==pc)

return target; else return (pc+4);

endmethod

method Action update(Addr

pc, Addr target);

LineIdx index = truncate(pc >> 2);

tagArr.upd(index, pc);

targetArr.upd(index, target); endmethodendmoduleApril 9, 2012L16-25http://csg.csail.mit.edu/6.S078Slide26

Two-stage pipeline + BPmodule mkProc(Proc); Reg#(

Addr) pc <-

mkRegU;

RFile

rf

<-

mkRFile

;

IMemory

iMem <-

mkIMemory;

DMemory dMem

<- mkDMemory;

PipeReg#(TypeFetch2Decode)

ir <- mkPipeReg

;

Reg#(Bool)

fEpoch <- mkReg

(False);

Reg#(Bool

) eEpoch <- mkReg

(False); FIFOF#(

Tuple2#(

Addr,Addr)) nextPC <- mkBypassFIFOF; NextAddressPredictor bpred <- mkNeverTaken;

The definition of TypeFetch2Decode is changed to include predicted pc

typedef

struct

{ Addr pc;

Addr ppc

; Bool

epoch; Data inst; } TypeFetch2Decode deriving (Bits,

Eq

);

April 9, 2012

L16-

26

http://csg.csail.mit.edu/6.S078

Some target predictorSlide27

Two-stage pipeline + BP Fetch rule rule doFetch (ir.notFull);

let ppc

=

bpred.prediction

(pc);

let

inst

=

iMem

(pc);

ir.enq(TypeFetch2Decode

{pc:pc,

ppc:ppc, epoch:fEpoch

, inst:inst

});

if(nextPC.notEmpty

) begin

match{.ipc

, .ippc} = nextPC.first;

pc <= ippc

; fEpoch <= !fEpoch

; nextPC.deq;

bpred.update(ipc

,

ippc); end else pc <= ppc; endrule

April 9, 2012

L16-27

http://csg.csail.mit.edu/6.S078Slide28

Two-stage pipeline + BP Execute rulerule doExecute (ir.notEmpty);

let

irpc

=

ir.first.pc

;

let

inst =

ir.first.inst

;

let irppc

= ir.first.ppc;

if(

ir.first.epoch==eEpoch

) begin

let

eInst =

decodeExecute(irpc

, irppc, inst,

rf);

let

memData <- dMemAction

(eInst,

dMem);

regUpdate

(eInst, memData, rf); if (eInst.missPrediction) begin nextPC.enq(tuple2(irpc, eInst.brTaken

? eInst.addr : irpc+4));

eEpoch <= !

eEpoch; end

end

ir.deq;endrule

endmodule

April 9, 2012

L16-

28

http://csg.csail.mit.edu/6.S078Slide29

Execute Functionfunction ExecInst exec(DecodedInst dInst

, Data rVal1,

Data rVal2, Addr

pc

,

Addr

ppc

);

ExecInst

einst = ?;

let

aluVal2 = (dInst.immValid

)? dInst.imm : rVal2

let aluRes

= alu(rVal1, aluVal2,

dInst.aluFunc);

let brAddr =

brAddrCal(pc, rVal1, dInst.iType,

dInst.imm);

einst.itype =

dInst.iType;

einst.addr = (memType(

dInst.iType)? aluRes :

brAddr

; einst.data = dInst.iType==St ? rVal2 : aluRes; einst.brTaken = aluBr(rVal1, aluVal2, dInst.brComp); einst.missPrediction =

brTaken ? brAddr!=

ppc :

(pc+4)!=ppc;

einst.rDst = dInst.rDst

; return

einst;

endfunction

April 7, 2012

L7-

29

http://csg.csail.mit.edu/6.s078Rev