Review MIPS RISC Design Principles Simplicity favors regularity fixed size instructions small number of instruction formats opcode always the first 6 bits Smaller is faster limited instruction set ID: 692337
Download Presentation The PPT/PDF document "Processor Design & Implementation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Processor
Design & ImplementationSlide2
Review: MIPS (RISC) Design Principles
Simplicity favors regularity
fixed size instructions
small number of instruction formats
opcode
always the first 6 bits
Smaller is faster
limited instruction set
limited number of registers in register file
limited number of addressing modes
Make the common case fast
arithmetic operands from the register file (load-store machine)
allow instructions to contain immediate operands
Good design demands good compromises
three instruction formats
Slide3
Sequential vs Combinational Circuits
Combinational logic circuits
output is a function of the
present value of the inputs
only.
When inputs are changed, the information about the previous inputs is lost
memoryless
E.g.,
Sequential logic circuits
outputs are also dependent upon past inputs
has memory
flip flops/latchesSlide4
Sequential vs Combinational Circuits
Combinational logic circuits
output is a function of the
present value of the inputs
only.
When inputs are changed, the information about the previous inputs is lost
memoryless
e.g., multiplexors.
Sequential logic circuits
outputs are also dependent upon past inputs
has memory
basically combinational circuits with the additional properties of storage (to remember past inputs) and feedbackSlide5
RS Latches
An RS latch is a memory element with 2 inputs:
- Reset (R)
- Set (S)
- 2 outputs: Q and Q
Note: if inputs don’t change, outputs are held indefinitely.Slide6
RS Latches - Hold
1
0
0
1
0
0
0
0Slide7
RS Latches - Set
0
1
0
1
0
0
1
1
0
1
0
1
ResetSlide8
Clocks and Synchronous Circuits
•
Asynchronous operation
:
- the output state of RS latches changes occur directly in
response to changes in the inputs.
• Virtually all sequential circuits currently employ the
notion of synchronous operation
the output of a sequential circuit is constrained to change
only at a time specified by a global enabling signal.
This signal is generally known as the
system clockSlide9
Transparent D Latches
•
modify the RS Latch such that its output state is only permitted to change when a valid enable signal (system clock) is present
• Add a couple of AND gates in cascade with the R and S inputs that are controlled by an additional input known as the enable (EN) inputSlide10Slide11Slide12Slide13Slide14
J-K Flip FlopsSlide15
Race Around Condition only when J = K=
clk
= 1
=
1
,
0
,1
=
0
,
1
,0
1,
0
,
1
1
1
1
1
,
0
1
1
,
0
0
,
1
Clk
J K Q
n+1
(Q
n+1
)’
0 x
x
Q
n
(
Q
n
)’
1 0 0
Q
n
(
Q
n
)’
1 0 1 0 1 (reset)
1 1 0 1 0 (set)
1 1 1 ? ?
Reset to 0 when input = 1Slide16
Master-Slave Flip Flops
•
Easy to design sequential circuits if outputs change on:
- rising (positive trending)
- falling (negative trending)
edges of a clock (i.e., enable) signal
Can be done by combining two transparent D latches in a Master-Slave configuration. Slide17Slide18Slide19
The Processor:
Datapath
& Control
Our implementation of the MIPS is simplified
memory-reference instructions:
lw
,
sw
arithmetic-logical instructions:
add, sub, and, or,
slt
control flow instructions:
beq
, j
Generic implementation
use the program counter (PC) to supply the instruction address and
fetch
the instruction from memory (and update the PC)
decode
the instruction (and read registers)
execute
the instruction
All instructions (except
j
) use the ALU after reading the registers
How? memory-reference? arithmetic? control flow?
Fetch
PC = PC+4
Decode
ExecSlide20
Aside: Clocking Methodologies
The
clocking methodology
defines when data in a state element is valid and stable relative to the clock
State elements - a memory element such as a register
Edge-triggered – all state changes occur on a clock edge
Typical execution
read contents of state elements -> send values through combinational logic -> write results to one or more state elements
State
element
1
State
element
2
Combinational
logic
clock
one clock cycle
Assumes state elements are written on every clock cycle; if not, need explicit write control signal
write occurs only when
both
the write control is asserted and the clock edge occursSlide21
Fetch Phase
Fetching instructions involves
reading the instruction from the Instruction Memory
updating the PC value to be the address of the next (sequential) instruction
Read
Address
Instruction
Instruction
Memory
Add
PC
4
PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal
Reading from the Instruction Memory is a combinational activity, so it doesn’t need an explicit read control signal
Fetch
PC = PC+4
Decode
Exec
clockSlide22
Decoding Instructions
Decoding instructions involves
sending the fetched instruction’s
opcode
and function field bits to the
control unit
Instruction
Write Data
Read
Addr
1
Read
Addr
2
Write Addr
Register
File
Read
Data 1
Read
Data 2
Control
Unit
reading two values from the Register File
Register addresses (
Read
Addr
1
&
Read
Addr
2
)
are contained in the instruction
Fetch
PC = PC+4
Decode
ExecSlide23Slide24
Executing R Format Operations
R format operations
(
add, sub,
slt
, and, or
)
perform operation (
op
and
funct
) on values in
rs
and
rt
store the result back into the
Register File
(into location
rd)
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
overflow
zero
ALU control (3 bit code)
RegWrite
R-type:
31
25
20
15
5
0
op
rs
rt
rd
funct
shamt
10
Register File
is not written every cycle (e.g.,
sw
)
need an explicit write control signal (
RegWrite
) for the it.
Fetch
PC = PC+4
Decode
ExecSlide25Slide26
Executing Load and Store Operations
Load and store operations involves
compute memory address by adding the base register (read from the
Register File
during decode) to the
16-bit signed-extended offset field
in the instruction, e.g.,
sw
$s3
4
(
$t5
)
store
value (read from the Register File during decode) written to the Data Memory,
load
value, read from the Data Memory, written to the Register File
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
overflow
zero
ALU control
RegWrite
Data
Memory
Address
Write Data
Read Data
Sign
Extend
MemWrite
MemRead
16
32
$t5
4Slide27Slide28
Branch instructions specify
opcode, two registers, target address
Most branch targets are near branch
- Forward or backward
op
rs
rt
constant or address
6 bits
5 bits
5 bits
16 bits
PC-relative addressing
Target address = PC + offset × 4
PC already incremented by 4 by this time
Branch AddressingSlide29
Executing Branch Operations
Branch operations involves
compare the operands read from the Register File during decode for equality (
zero
ALU output)
compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the
instr
Why << 2?
Instruction
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
zero
ALU control (3 bit code)
Sign
Extend
16
32
Shift
left 2
4
Add
PC
Branch
target
address
(to branch control logic
)
AddSlide30Slide31
Chapter 2 — Instructions: Language of the Computer —
31
Jump Addressing
Jump (
j
and
jal
) targets could be anywhere in text segment
Encode full address in instruction
op
address
6 bits
26 bits
(Pseudo)Direct jump addressing
Target address = PC
31…28
: (address × 4)Slide32
Executing Jump Operations
Jump operation involves
replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits
Target address = PC
31…28
: (address × 4)
Read
Address
Instruction
Instruction
Memory
Add
PC
4
Shift
left 2
Jump
address
26
4
28Slide33
Creating a Single
Datapath
from the Parts
Assemble the
datapath
segments and add control lines and multiplexors as needed
Single cycle
design – fetch, decode and execute each instructions in
one
clock cycleno
datapath
resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders)
multiplexors
needed at the input of shared elements with control lines to do the selection
write signals to control writing to the Register File and Data Memory
Cycle time is determined by length of the longest pathSlide34
Multiplexors
2 Input 1 Bit Selector Device (2x1 MUX)
Here is a truth table definition of a “function” we wish to implement:
When S = 0, A is “selected” for output
When S = 1, B is “selected” for output
S
A
B
output
0
0
0
0
0
0
1
0
0
1
0
1
0
1
1
1
1
0
0
0
1
0
1
1
1
1
0
0
1
1
1
1Slide35
2x1 MUX (Multiplexor)
What is the Boolean expression for
a 2x1 MUX?
Output = S
•
B + S
•
A
How do you implement this using gates?
S
A
B
output
0
0
0
0
0
0
1
0
0
1
0
1
0
1
1
1
1
0
0
0
1
0
1
1
1
1
0
0
1
1
1
1
A
B
S (control signal)
outputSlide36
Multiplexors (MUX) and ALUs
- To select a source input for ALU
From Register
From instruction
field
Control signal
M
U
X
A
L
USlide37
Fetch, R, and Memory Access Portions
MemtoReg
Read
Address
Instruction
Instruction
Memory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
ovf
zero
ALU control
RegWrite
Data
Memory
Address
Write Data
Read Data
MemWrite
MemRead
Sign
Extend
16
32
ALUSrc
-
multiplexor (mux)
- ALU (adder)Slide38
Adding the Control
Selecting the operations to perform (ALU, Register File and Memory read/write)
Controlling the flow of data (multiplexor inputs)
I-Type:
op
rs
rt
address offset
31
25
20
15
0
R-type:
31
25
20
15
5
0
op
rs
rt
rd
funct
shamt
10
Observations
op field
always
in bits 31-26
addr
of registers to be read are
always
specified by the
rs
field (bits 25-21) and
rt
field (bits 20-16); for
lw
and
sw
rs
is the base register
addr
. of register to be written is in one of
two
places – in
rt
(bits 20-16) for
lw
; in rd (bits 15-11) for R-type instructions
offset for
beq
,
lw
, and
sw
always
in bits 15-0
J-type:
31
25
0
op
target addressSlide39
The control unit is responsible for setting all the control signals so that each
instruction is executed properly.
— The control unit’s input is the 32-bit instruction word.
— The outputs are values for the control signals in the
datapath
.
Most of the signals can be generated from the instruction opcode alone, and
not the entire 32-bit word.
To illustrate the relevant control signals, we will show the route that is taken
through the
datapath
by R-type,
lw
,
sw
and
beq
instructions.
ControlSlide40Slide41Slide42
ALU Control Unit
Read
Address
Instr[31-0]
Instruction
Memory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
ovf
zero
RegWrite
Data
Memory
Address
Write Data
Read Data
MemWrite
MemRead
Sign
Extend
16
32
MemtoReg
ALUSrc
Shift
left 2
Add
PCSrc
RegDst
ALU
control
1
1
1
0
0
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr
[25-21]
Instr[20-16]
Instr[15 -11]
Control
Unit
Instr[31-26]
Branch
2
4Slide43
Can ignore
use XX don’t caresSlide44
4Slide45Slide46
Bit I/O for ALU Control Unit
Read
Address
Instr[31-0]
Instruction
Memory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
ovf
zero
RegWrite
Data
Memory
Address
Write Data
Read Data
MemWrite
MemRead
Sign
Extend
16
32
MemtoReg
ALUSrc
Shift
left 2
Add
PCSrc
RegDst
ALU
control
1
1
1
0
0
0
0
1
ALUOp
Instr
[5-0]
Instr[15-0]
Instr
[25-21]
Instr[20-16]
Instr[15 -11]
Control
Unit
Instr
[31-26
]
Branch
2
4Slide47
31
25
20
15
0
R-type:
31
25
20
15
5
0
op
rs
rt
rd
funct
shamt
10
R-type InstructionSlide48Slide49
R-type Dataflow
Read
Address
Instr[31-0]
Instruction
Memory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
ovf
zero
RegWrite
Data
Memory
Address
Write Data
Read Data
MemWrite
MemRead
Sign
Extend
16
32
MemtoReg
ALUSrc
Shift
left 2
Add
PCSrc
RegDst
ALU
control
1
1
1
0
0
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr
[25-21]
[
rs
]
Instr
[20-16]
[
rt
]
Instr
[15 -11]
[
rd
]
Control
Unit
Instr[31-26]
Branch
ALUOpSlide50
R type - Control Lines
0
0
0
0
1
Slide51
0010Slide52
Load Word Instruction Data/Control Flow
Read
Address
Instr[31-0]
Instruction
Memory
Add
PC
4
Write
Data
Read
Addr
1
Read
Addr
2
Write
Addr
Register
File
Read
Data
1
Read
Data 2
ALU
ovf
zero
RegWrite
Data
Memory
Address
Write Data
Read Data
MemWrite
MemRead
Sign
Extend
16
32
MemtoReg
ALUSrc
Shift
left 2
Add
PCSrc
RegDst
ALU
control
1
1
1
0
0
0
0
1
ALUOp
Instr[5-0]
Instr
[15-0]
Instr
[15 -11]
Control
Unit
Instr[31-26]
Branch
Instr
[25-21]
Instr
[25-21]Slide53
Load Word Instruction Data/Control Flow
Read
Address
Instr[31-0]
Instruction
Memory
Add
PC
4
Write
Data
Read
Addr
1
Read
Addr
2
Write
Addr
Register
File
Read
Data
1
Read
Data 2
ALU
ovf
zero
RegWrite
Data
Memory
Address
Write Data
Read Data
MemWrite
MemRead
Sign
Extend
16
32
MemtoReg
ALUSrc
Shift
left 2
Add
PCSrc
RegDst
ALU
control
1
1
1
0
0
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr
[25-21]
Instr
[20-16]
Instr
[15 -11]
Control
Unit
Instr[31-26]
Branch
32
$t0
lw
$s1, 32($t0)
$s1Slide54
lw
- Control Lines
1
1
0
0
1
1
Slide55
$a0
/
16
$
sp
16
/
32Slide56
I-Type:
op
rs
rt
address offset
31
25
20
15
0
BranchingSlide57Slide58
Branch Instruction Data/Control Flow
Read
Address
Instr[31-0]
Instruction
Memory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
ovf
zero
RegWrite
Data
Memory
Address
Write Data
Read Data
MemWrite
MemRead
Sign
Extend
16
32
MemtoReg
ALUSrc
Shift
left 2
Add
PCSrc
RegDst
ALU
control
1
1
1
0
0
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
Control
Unit
Instr[31-26]
Branch
beq
$t0,$t1,
addrSlide59
Branch Instruction Data/Control Flow
Read
Address
Instr[31-0]
Instruction
Memory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
ovf
zero
RegWrite
Data
Memory
Address
Write Data
Read Data
MemWrite
MemRead
Sign
Extend
16
32
MemtoReg
ALUSrc
Shift
left 2
Add
PCSrc
RegDst
ALU
control
1
1
1
0
0
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
Control
Unit
Instr[31-26]
Branch
beq
$t0,$t1,
addr
$t0
$t1Slide60
Main Control Lines
1
0
Slide61
J-type
:
31
25
0
op
target addressSlide62
Jump
Read
Address
Instr
[31-0]
Instruction
Memory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
ovf
zero
RegWrite
Data
Memory
Address
Write Data
Read Data
MemWrite
MemRead
Sign
Extend
16
32
MemtoReg
ALUSrc
Shift
left 2
Add
PCSrc
RegDst
ALU
control
1
1
1
0
0
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
Control
Unit
Instr[31-26]
Branch
Shift
left 2
0
1
Jump
32
Instr[25-0]
26
PC:4[31-28]
28
25-0]
j
addr
if
addr
= A (26 bits) A<<2 to make it 28 bits
Target
addr
=
PC[31-28]:A00Slide63
Single Cycle Disadvantages & Advantages
Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the
slowest
instruction
especially problematic for more complex instructions like floating point multiply
800
ps
700
ps
May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle
but is simple and easy to understand
Clk
lw
sw
Waste
Cycle 1
Cycle 2Slide64
Instruction Critical Paths
What is the clock cycle time assuming negligible delays for
muxes
, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except:
Instruction and Data Memory access (200
ps
)
ALU and adders (200
ps
)
Register File access (reads or writes) (100
ps
)Slide65
How Can We Make It Faster?
Start fetching and executing the next instruction before the current one has completed
Pipelining
– (all?) modern processors are pipelined for performance
The
performance metric:
CPU time = Cycles/
Instr
* Clk
Cycle * Total #
Instr
Under
ideal
conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages
A five stage pipeline is nearly five times faster because the
clk
cycle is nearly five times faster
Fetch (and execute) more than one instruction at a timeSlide66
Pipelining Analogy
Pipelined laundry: overlapping execution
Parallelism improves performance
Four loads:
Sequential = 8
hrs
Pipelined = 3.5
hrs
Speedup
= 8/3.5 = 2.3Slide67
five-stage pipeline should offer nearly a fivefold improvement over the
nonpipelined time, or a 160
ps
clock cycle.
Pipelining has some issues
actual speedup < # stages.
Pipelining SpeedupSlide68Slide69
f = Fetch
r = Register read
a = ALU op
d = Data access
w = WritebackSlide70
Pipeline Control
IF Stage: read
Instr
Memory (always asserted) and write PC (on System Clock)
ID Stage: no optional control signals to set
EX Stage
MEM Stage
WB Stage
RegDst
ALUOp1
ALUOp0
ALUSrc
Brch
MemRead
MemWrite
RegWrite
Mem
toReg
R
1
1
0
0
0
0
0
1
0
lw
0
0
0
1
0
1
0
1
1
sw
X
0
0
1
0
0
1
0
X
beq
X
0
1
0
1
0
0
0
XSlide71Slide72Slide73Slide74Slide75Slide76Slide77Slide78Slide79Slide80Slide81Slide82Slide83Slide84Slide85Slide86Slide87Slide88Slide89
The desired data is only available after the 4
th
stage for
lw
instead of the 3
rd
stage for
add.
Need to
stall
one cycle –
bubble
.
lw
$s0
, 20(St1)
Sub $t2,
$s0
, $t3Slide90Slide91
W RSlide92Slide93Slide94Slide95Slide96
Output from ALU of 1
st
instruction has to be forwarded to Instr#2 and #3
Since only one output (from ALU) can be forwarded, it is sent to Instr#2
Instr#3’s ALU must wait till the next stage to get output of 1
st
Instr’s
ALUSlide97
?
?
IF ID EX MEM WBSlide98Slide99
Why?Slide100
Why 13 cycles?
Cycles
1 2 3 4 5 6 7 8 9 10 11 12 13
Instr
1
2
(stall)
3
4
5
(stall)
6
7Slide101Slide102Slide103
Forwarding UnitSlide104Slide105
Forwarding Unit Inputs
For ALU result
For WB (from Memory)Slide106
Forwarding Example
$t0
$t1
$t0
$t1
$t1
$t0
$t0
$t1Slide107
$t0
$t1
$t0
$t1Slide108Slide109
Add $t0,
add $t0,
or $t1,
Х Х
sub _ , _, _
$t0,$t0
(if one
instr
away, can forward at MEM stage)Slide110Slide111Slide112
Hazard
UnitSlide113Slide114Slide115Slide116Slide117Slide118Slide119Slide120Slide121Slide122Slide123Slide124Slide125Slide126Slide127Slide128Slide129Slide130