Professor Alvin R Lebeck Computer Science 220 ECE 252 Fall 2008 Admin Homework 1 Due Today Homework 2 Assigned Reading HampP Chapter 2 amp 3 suggested Research papers not yet ready to read but will be soon ID: 758350
Download Presentation The PPT/PDF document "Lecture 5: Interrupts, Superscalar" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Lecture 5: Interrupts, Superscalar
Professor Alvin R. Lebeck
Computer Science 220 / ECE 252
Fall 2008Slide2
Admin
Homework #1 Due Today
Homework #2 Assigned
Reading
H&P Chapter 2 & 3 (suggested)
Research papers (not yet ready to read, but will be soon!):
Hinton et al: “The Microarchitecture of the Pentium 4 Processor”
Palacharla, Jouppi, and Smith: “Complexity-Effective Superscalar Processors”
Akkary, Rajwar, and Srinivasan: “Checkpoint Processing and Recovery”Slide3
Review: Hazards
Data Hazards
RAW
only one that can occur in simple 5-stage pipeline
WAR, WAW
Data Forwarding (Register Bypassing)
send data from one stage to another bypassing the register file
Still have load use delay
Structural Hazards
Replicate Hardware, scheduling
Control Hazards
Compute condition and target early (delayed branch)Slide4
Review: Dynamic Branch Prediction
Solution: 2-bit counter where prediction changes only if mispredict
twice:
Increment for taken, decrement for not-taken
00,01,10,11
Helps when target is known before condition
T
T
T
T
NT
NT
NT
NT
Predict Taken
Predict Not
Taken
Predict Taken
Predict Not
TakenSlide5
Review: Correlating Branches
Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)
Tournament
Choose between alternative predictors
How do you choose?
Branch address
2-bits per branch predictor
Prediction
2-bit global branch historySlide6
Review: Need Address @ Same Time as Prediction
Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)
Note: must check for branch match now, since can’t use wrong branch
address
Predicted PC
Branch Prediction:
Taken or not Taken
Procedure Return Addresses Predicted with a Stack
PC of Inst to fetch
=
…
0
n-1
Yes, use predicted PC
No, not branchSlide7
Review: Multicycle Ops in PipelineSlide8
Interrupts and Exceptions
Unnatural change in control flow
warning: varying terminology
“exception” sometimes refers to all cases
“Trap” software trap, hardware trap
Exception
is potential problem with program
condition occurs within the processorsegmentation faultbus errordivide by 0Don’t want my bug to crash the entire machinepage fault (virtual memory…)Slide9
Interrupts and Exceptions
Interrupt
is external event
devices: disk, network, keyboard, etc.
clock for timeslicing
These are useful events, must do something when they occur.
Trap is user-requested exceptionoperating system call (syscall)Slide10
ld
add
st
div
beq
ld
sub
bneRETT
User Program
Interrupt Handler
Handling an Exception/Interrupt
Invoke specific kernel routine based on type of interrupt
interrupt/exception handler
Must determine what caused interruptcould use software to examine each devicePC = interrupt_handlerVectored InterruptsPC = interrupt_table[i]
Kernel initializes table at boot timeClear the interruptMay return from interrupt (RETT) to different process (e.g, context switch)
Similar mechanism is used to handle interrupts, exceptions, trapsSlide11
Execution Mode
What if interrupt occurs while in interrupt handler?
Problem
: Could lose information for one interrupt
clear of interrupt #1, clears both #1 and #2
Solution
:
disable interruptsDisabling interrupts is a protected operationOnly the kernel can execute ituser v.s. kernel modemode bit in CPU status registerOther protected operations
installing interrupt handlersmanipulating CPU state (saving/restoring status registers)Changing modes
interruptssystem calls (syscall instruction)Slide12
A System Call (syscall)
Trap
Handler
RETT
User Program
Special Instruction to change modes and invoke service
read/write I/O device
create new process
Invokes specific kernel routine based on argument
kernel defined interface
May return from trap to different process (e.g, context switch)RETT, instruction to return to user process
ServiceRoutines
Kernel
ld
add
st
TA 6
beqldsub
bneSlide13
Interrupts/exceptions
classifying interrupts
terminal (fatal) vs.
restartable (control returned to program)
synchronous (internal) vs. asynchronous (external)
user vs. coerced
maskable (ignorable) vs. non-maskable
between instructions vs. within instructionSlide14
Precise Exceptions
“unobserved system can exist in any intermediate state, upon observation system collapses to well-defined state”
2nd postulate of quantum mechanics
system
processor, observation interrupt
what is the “well-defined” state?
von Neumann: “sequential, instruction atomic execution”
precise state at interruptall instructions older than interrupt are completeall instructions younger than interrupt haven’t startedimplies interrupts are taken in program ordernecessary for VM (why?), “highly recommended” by IEEESlide15
Pipelining Complications
Interrupts (Exceptions)
5 instructions executing in 5 stage pipeline
How to stop the pipeline?
How to restart the pipeline?
Who caused the interrupt?
Stage
Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation
ID Undefined or illegal opcode EX Arithmetic interrupt
MEM Page fault on data fetch; misaligned memory access; memory-protection violationSlide16
Pipelining Complications
Simultaneous exceptions in > 1 pipeline stage
Load with data page fault in MEM stage
Add with instruction page fault in IF stage
Solution #1
Interrupt status vector per instruction
Defer check til last stage, kill state update if exception
Solution #2Interrupt ASAP
Restart everything that is incompleteAnother advantage for state update late in pipeline!
Slide17
Interrupts/Exceptions are Nasty
odd bits of state must be precise (e.g., condition codes)
delayed branches
what if instruction in delay slot takes an interrupt?
Out of order Writes (e.g., autoinc, multicycle ops)
must undo write (e.g., future-file, history-file)
some machines had precise interrupts only in integer pipe
sufficient for implementing VM (e.g., VAX/Alpha)Lucky for us, there’s a nice, clean way to handle precise stateWe’ll see how this is done in a couple of lectures ...Slide18
Pipelining x86
The x86 ISA has some really nasty instructions - how did Intel ever figure out how to build a pipelined x86 microprocessor?
Solution: at runtime, “crack” x86 instructions (macro-ops) into RISC-like micro-ops
First used in P6 (Pentium Pro)
Used in all subsequent x86 processors, including those from AMD
What are the potential challenges for implementing this solution?Slide19
Where are We
principles of pipelining
pipeline depth: clock rate vs. number of stalls (CPI)
hazards
structural
data (RAW, WAR, WAW)
control
Branch predictionmulti-cycle operationsstructural hazards, WAW hazardsinterruptsprecise stateNext up: CPI < 1Slide20
Getting CPI < 1: Issuing Multiple Instructions/Cycle
“Flynn bottleneck”
single issue performance limit is CPI = IPC = 1
hazards + overhead
CPI >= 1 (IPC <= 1)
diminishing returns from deep pipelines
solution: issue multiple instructions per cycle
Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler (statically scheduled) or by HW (Tomasulo; dynamically scheduled)First superscalar IBM America → RS6000 → Power1
Pentium4, IBM PowerPC, Sun SuperSparc, DEC Alpha, HP PA-8000Slide21
Base Implementation
statically scheduled (in-order) superscalar
executes unmodified sequential programs
Figures out on its own what can be done in parallel
e.g., Sun UltraSPARC, Alpha 21164
we’ll start with this one
What has to change from single issue to multiple issue?Slide22
CPI < 1: Issuing Multiple Instructions/Cycle
Ex 2-way superscalar: 1 FP & 1 anything else
– Fetch 64-bits/clock cycle; Int on left, FP on right
–
Can only issue 2nd instruction if 1st instruction issues
– More ports for FP registers to do FP load & FP op in a pair
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
1 cycle load delay expands to 3 instructions in SS
instruction in right half can’t use it, nor instructions in next slotSlide23
Implications of Superscalar
what is involved in
fetching two instructions per cycle?
decoding two instructions per cycle?
executing two ALU operations per cycle?
accessing the data cache twice per cycle?
writing back two results per cycle?
what about 4 or 8 instructions per cycle?Slide24
Wide Fetch
Fetch N instructions per cycle
if instructions are sequential...
and on same cache line
nothing really
and on different cache lines
banked I$ + combining network
if instructions are not sequential...more difficulttwo serial I$ accesses (access1predict targetaccess2)? nonote: embedded branches OK as long as predicted NTserial access + prediction in parallelif prediction is T, discard serial part after branchTrace Cache…Slide25
Wide Decode
Decode N instructions per cycle
actually decoding instructions?
easy if fixed length instructions (multiple decoders)
harder (but possible) if variable length
reading input register values?
2N register read ports (register file read latency ~2N)
actually less than 2N, since most values come from bypasseswhat about the stall logic to enforce RAW dependences?Slide26
N2
Dependence Check Logic
remember stall logic for single issue pipeline
rs1(D) == rd(D/X) || rs1(D) == rd(X/M) || rs1(D) == rd(M/W)
same for rs2(D)
full-bypassing reduces to rs1(D) == rd(D/X) && op(D/X) == LOAD
doubling issue width (N) quadruples stall logic!
not only 2 instructions in D, but two instructions in every stage(rs1(D1) == rd(D/X1) && op(D/X1) == LOAD)(rs1(D1) == rd(D/X2) && op(D/X2
) == LOAD)repeat for rs1(D2), rs2(D1), rs2(D2
)also check dependence of 2nd instruction on 1st: rs1(D2) == rd(D1)
“N2 dependence cross-check”for N-wide pipeline, stall (and bypass) circuits grow as N
2Slide27
Superscalar Stalls
invariant: stalls propagate upstream to younger instructions
what if older instruction in issue “pair” (inst0) stalls?
younger instruction (inst1) stalls too, cannot pass it
what if younger instruction (inst1) stalls?
can older instruction from next group (inst2) move up?
Rigid pipeline: No
Fluid pipeline: YesSlide28
Wide Execute
What does it take to execute N instructions per cycle?
multiple execution units...N of every kind?
N ALUs? OK, ALUs are small
N FP dividers? no, FP dividers are huge (and fdiv is uncommon)
typically have some mix (proportional to instruction mix)
RS/6000: 1 ALU/memory/branch + 1 FP
Pentium: 1 any + 1 ALU (Pentium)Pentium II: 1 ALU/FP + 1 ALU + 1 load + 1 store + 1 branchAlpha 21164: 1 ALU/FP/branch + 2 ALU + 1 load/storeSlide29
N2
Bypass
N
2
bypass logic... OK
only 5-bit quantities
compare to generate 1-bit outcomes
similar to stall logicN2 bypass buses... not even close to OK32-bit or 64-bit quantitiesbroadcast, route, and multiplex (mux)difficult to lay out and route all the wireswide (SLOW) muxesbig design problem todaySlide30
One Solution to N2
Bypass: Clustering
group functional units into
clusters
full bypass within cluster
no bypass between clusters
~(N/k) inputs at each mux
~(N/k)2 routed buses in each clustersteer instructions to different clustersdependent instructions to same clusterexploit intra-cluster bypassstatic or dynamic steering is possiblee.g., Alpha 212644-wide, 300MHzfull bypass didn’t fit into 1 clock cycle2 clusters with full intra-cluster bypass Slide31
Wide Memory Access
what is involved in accessing memory for multiple instructions per cycle?
multi-banked D$
requires bank assignment and conflict-detection logic
(rough) instruction mix: 20% loads, 15% stores
for width N, we need about 0.2*N load ports, 0.15*N store portsSlide32
Wide Writeback
what is involved in writing back multiple instructions per cycle?
nothing too special, just another port on the register file
everything else is taken care of earlier in pipeline
adding ports isn’t free, though
increases area
increases access latencySlide33
Multiple Issue Summary
superscalar problem spots
fetch, branch prediction
trace cache?
decode (N
2
dependence cross-check)
execute (N2 bypass) clustering?Slide34
Can we do better?
Problem: Stall in ID stage if any data hazard.
Your task: Teams of two, propose a design to eliminate these stalls.
MULD F2, F3, F4 Long latency…
ADDD F1, F2, F3
ADDD F3, F4, F5
ADDD F1, F4, F5Slide35
Next Time
Dynamic Scheduling
Read papers
HW #2 Assigned