/
Lecture 5: Interrupts, Superscalar Lecture 5: Interrupts, Superscalar

Lecture 5: Interrupts, Superscalar - PowerPoint Presentation

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
343 views
Uploaded On 2019-03-20

Lecture 5: Interrupts, Superscalar - PPT Presentation

Professor Alvin R Lebeck Computer Science 220 ECE 252 Fall 2008 Admin Homework 1 Due Today Homework 2 Assigned Reading HampP Chapter 2 amp 3 suggested Research papers not yet ready to read but will be soon ID: 758350

instruction instructions branch interrupt instructions instruction interrupt branch cycle pipeline interrupts stage amp load bypass prediction superscalar multiple issue

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Lecture 5: Interrupts, Superscalar" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Lecture 5: Interrupts, Superscalar

Professor Alvin R. Lebeck

Computer Science 220 / ECE 252

Fall 2008Slide2

Admin

Homework #1 Due Today

Homework #2 Assigned

Reading

H&P Chapter 2 & 3 (suggested)

Research papers (not yet ready to read, but will be soon!):

Hinton et al: “The Microarchitecture of the Pentium 4 Processor”

Palacharla, Jouppi, and Smith: “Complexity-Effective Superscalar Processors”

Akkary, Rajwar, and Srinivasan: “Checkpoint Processing and Recovery”Slide3

Review: Hazards

Data Hazards

RAW

only one that can occur in simple 5-stage pipeline

WAR, WAW

Data Forwarding (Register Bypassing)

send data from one stage to another bypassing the register file

Still have load use delay

Structural Hazards

Replicate Hardware, scheduling

Control Hazards

Compute condition and target early (delayed branch)Slide4

Review: Dynamic Branch Prediction

Solution: 2-bit counter where prediction changes only if mispredict

twice:

Increment for taken, decrement for not-taken

00,01,10,11

Helps when target is known before condition

T

T

T

T

NT

NT

NT

NT

Predict Taken

Predict Not

Taken

Predict Taken

Predict Not

TakenSlide5

Review: Correlating Branches

Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)

Tournament

Choose between alternative predictors

How do you choose?

Branch address

2-bits per branch predictor

Prediction

2-bit global branch historySlide6

Review: Need Address @ Same Time as Prediction

Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)

Note: must check for branch match now, since can’t use wrong branch

address

Predicted PC

Branch Prediction:

Taken or not Taken

Procedure Return Addresses Predicted with a Stack

PC of Inst to fetch

=

0

n-1

Yes, use predicted PC

No, not branchSlide7

Review: Multicycle Ops in PipelineSlide8

Interrupts and Exceptions

Unnatural change in control flow

warning: varying terminology

“exception” sometimes refers to all cases

“Trap” software trap, hardware trap

Exception

is potential problem with program

condition occurs within the processorsegmentation faultbus errordivide by 0Don’t want my bug to crash the entire machinepage fault (virtual memory…)Slide9

Interrupts and Exceptions

Interrupt

is external event

devices: disk, network, keyboard, etc.

clock for timeslicing

These are useful events, must do something when they occur.

Trap is user-requested exceptionoperating system call (syscall)Slide10

ld

add

st

div

beq

ld

sub

bneRETT

User Program

Interrupt Handler

Handling an Exception/Interrupt

Invoke specific kernel routine based on type of interrupt

interrupt/exception handler

Must determine what caused interruptcould use software to examine each devicePC = interrupt_handlerVectored InterruptsPC = interrupt_table[i]

Kernel initializes table at boot timeClear the interruptMay return from interrupt (RETT) to different process (e.g, context switch)

Similar mechanism is used to handle interrupts, exceptions, trapsSlide11

Execution Mode

What if interrupt occurs while in interrupt handler?

Problem

: Could lose information for one interrupt

clear of interrupt #1, clears both #1 and #2

Solution

:

disable interruptsDisabling interrupts is a protected operationOnly the kernel can execute ituser v.s. kernel modemode bit in CPU status registerOther protected operations

installing interrupt handlersmanipulating CPU state (saving/restoring status registers)Changing modes

interruptssystem calls (syscall instruction)Slide12

A System Call (syscall)

Trap

Handler

RETT

User Program

Special Instruction to change modes and invoke service

read/write I/O device

create new process

Invokes specific kernel routine based on argument

kernel defined interface

May return from trap to different process (e.g, context switch)RETT, instruction to return to user process

ServiceRoutines

Kernel

ld

add

st

TA 6

beqldsub

bneSlide13

Interrupts/exceptions

classifying interrupts

terminal (fatal) vs.

restartable (control returned to program)

synchronous (internal) vs. asynchronous (external)

user vs. coerced

maskable (ignorable) vs. non-maskable

between instructions vs. within instructionSlide14

Precise Exceptions

“unobserved system can exist in any intermediate state, upon observation system collapses to well-defined state”

2nd postulate of quantum mechanics

system

 processor, observation  interrupt

what is the “well-defined” state?

von Neumann: “sequential, instruction atomic execution”

precise state at interruptall instructions older than interrupt are completeall instructions younger than interrupt haven’t startedimplies interrupts are taken in program ordernecessary for VM (why?), “highly recommended” by IEEESlide15

Pipelining Complications

Interrupts (Exceptions)

5 instructions executing in 5 stage pipeline

How to stop the pipeline?

How to restart the pipeline?

Who caused the interrupt?

Stage

Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation

ID Undefined or illegal opcode EX Arithmetic interrupt

MEM Page fault on data fetch; misaligned memory access; memory-protection violationSlide16

Pipelining Complications

Simultaneous exceptions in > 1 pipeline stage

Load with data page fault in MEM stage

Add with instruction page fault in IF stage

Solution #1

Interrupt status vector per instruction

Defer check til last stage, kill state update if exception

Solution #2Interrupt ASAP

Restart everything that is incompleteAnother advantage for state update late in pipeline!

Slide17

Interrupts/Exceptions are Nasty

odd bits of state must be precise (e.g., condition codes)

delayed branches

what if instruction in delay slot takes an interrupt?

Out of order Writes (e.g., autoinc, multicycle ops)

must undo write (e.g., future-file, history-file)

some machines had precise interrupts only in integer pipe

sufficient for implementing VM (e.g., VAX/Alpha)Lucky for us, there’s a nice, clean way to handle precise stateWe’ll see how this is done in a couple of lectures ...Slide18

Pipelining x86

The x86 ISA has some really nasty instructions - how did Intel ever figure out how to build a pipelined x86 microprocessor?

Solution: at runtime, “crack” x86 instructions (macro-ops) into RISC-like micro-ops

First used in P6 (Pentium Pro)

Used in all subsequent x86 processors, including those from AMD

What are the potential challenges for implementing this solution?Slide19

Where are We

principles of pipelining

pipeline depth: clock rate vs. number of stalls (CPI)

hazards

structural

data (RAW, WAR, WAW)

control

Branch predictionmulti-cycle operationsstructural hazards, WAW hazardsinterruptsprecise stateNext up: CPI < 1Slide20

Getting CPI < 1: Issuing Multiple Instructions/Cycle

“Flynn bottleneck”

single issue performance limit is CPI = IPC = 1

hazards + overhead

 CPI >= 1 (IPC <= 1)

diminishing returns from deep pipelines

solution: issue multiple instructions per cycle

Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler (statically scheduled) or by HW (Tomasulo; dynamically scheduled)First superscalar IBM America → RS6000 → Power1

Pentium4, IBM PowerPC, Sun SuperSparc, DEC Alpha, HP PA-8000Slide21

Base Implementation

statically scheduled (in-order) superscalar

executes unmodified sequential programs

Figures out on its own what can be done in parallel

e.g., Sun UltraSPARC, Alpha 21164

we’ll start with this one

What has to change from single issue to multiple issue?Slide22

CPI < 1: Issuing Multiple Instructions/Cycle

Ex 2-way superscalar: 1 FP & 1 anything else

– Fetch 64-bits/clock cycle; Int on left, FP on right

Can only issue 2nd instruction if 1st instruction issues

– More ports for FP registers to do FP load & FP op in a pair

Type Pipe Stages

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

1 cycle load delay expands to 3 instructions in SS

instruction in right half can’t use it, nor instructions in next slotSlide23

Implications of Superscalar

what is involved in

fetching two instructions per cycle?

decoding two instructions per cycle?

executing two ALU operations per cycle?

accessing the data cache twice per cycle?

writing back two results per cycle?

what about 4 or 8 instructions per cycle?Slide24

Wide Fetch

Fetch N instructions per cycle

if instructions are sequential...

and on same cache line

 nothing really

and on different cache lines

 banked I$ + combining network

if instructions are not sequential...more difficulttwo serial I$ accesses (access1predict targetaccess2)? nonote: embedded branches OK as long as predicted NTserial access + prediction in parallelif prediction is T, discard serial part after branchTrace Cache…Slide25

Wide Decode

Decode N instructions per cycle

actually decoding instructions?

easy if fixed length instructions (multiple decoders)

harder (but possible) if variable length

reading input register values?

2N register read ports (register file read latency ~2N)

actually less than 2N, since most values come from bypasseswhat about the stall logic to enforce RAW dependences?Slide26

N2

Dependence Check Logic

remember stall logic for single issue pipeline

rs1(D) == rd(D/X) || rs1(D) == rd(X/M) || rs1(D) == rd(M/W)

same for rs2(D)

full-bypassing reduces to rs1(D) == rd(D/X) && op(D/X) == LOAD

doubling issue width (N) quadruples stall logic!

not only 2 instructions in D, but two instructions in every stage(rs1(D1) == rd(D/X1) && op(D/X1) == LOAD)(rs1(D1) == rd(D/X2) && op(D/X2

) == LOAD)repeat for rs1(D2), rs2(D1), rs2(D2

)also check dependence of 2nd instruction on 1st: rs1(D2) == rd(D1)

“N2 dependence cross-check”for N-wide pipeline, stall (and bypass) circuits grow as N

2Slide27

Superscalar Stalls

invariant: stalls propagate upstream to younger instructions

what if older instruction in issue “pair” (inst0) stalls?

younger instruction (inst1) stalls too, cannot pass it

what if younger instruction (inst1) stalls?

can older instruction from next group (inst2) move up?

Rigid pipeline: No

Fluid pipeline: YesSlide28

Wide Execute

What does it take to execute N instructions per cycle?

multiple execution units...N of every kind?

N ALUs? OK, ALUs are small

N FP dividers? no, FP dividers are huge (and fdiv is uncommon)

typically have some mix (proportional to instruction mix)

RS/6000: 1 ALU/memory/branch + 1 FP

Pentium: 1 any + 1 ALU (Pentium)Pentium II: 1 ALU/FP + 1 ALU + 1 load + 1 store + 1 branchAlpha 21164: 1 ALU/FP/branch + 2 ALU + 1 load/storeSlide29

N2

Bypass

N

2

bypass logic... OK

only 5-bit quantities

compare to generate 1-bit outcomes

similar to stall logicN2 bypass buses... not even close to OK32-bit or 64-bit quantitiesbroadcast, route, and multiplex (mux)difficult to lay out and route all the wireswide (SLOW) muxesbig design problem todaySlide30

One Solution to N2

Bypass: Clustering

group functional units into

clusters

full bypass within cluster

no bypass between clusters

~(N/k) inputs at each mux

~(N/k)2 routed buses in each clustersteer instructions to different clustersdependent instructions to same clusterexploit intra-cluster bypassstatic or dynamic steering is possiblee.g., Alpha 212644-wide, 300MHzfull bypass didn’t fit into 1 clock cycle2 clusters with full intra-cluster bypass Slide31

Wide Memory Access

what is involved in accessing memory for multiple instructions per cycle?

multi-banked D$

requires bank assignment and conflict-detection logic

(rough) instruction mix: 20% loads, 15% stores

for width N, we need about 0.2*N load ports, 0.15*N store portsSlide32

Wide Writeback

what is involved in writing back multiple instructions per cycle?

nothing too special, just another port on the register file

everything else is taken care of earlier in pipeline

adding ports isn’t free, though

increases area

increases access latencySlide33

Multiple Issue Summary

superscalar problem spots

fetch, branch prediction

 trace cache?

decode (N

2

dependence cross-check)

execute (N2 bypass)  clustering?Slide34

Can we do better?

Problem: Stall in ID stage if any data hazard.

Your task: Teams of two, propose a design to eliminate these stalls.

MULD F2, F3, F4 Long latency…

ADDD F1, F2, F3

ADDD F3, F4, F5

ADDD F1, F4, F5Slide35

Next Time

Dynamic Scheduling

Read papers

HW #2 Assigned