/
Processor Design & Implementation Processor Design & Implementation

Processor Design & Implementation - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
378 views
Uploaded On 2018-10-21

Processor Design & Implementation - PPT Presentation

Review MIPS RISC Design Principles Simplicity favors regularity fixed size instructions small number of instruction formats opcode always the first 6 bits Smaller is faster limited instruction set ID: 692337

data read control instr read data instr control instruction write addr memory address alu register add file branch clock

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Processor Design & Implementation" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Processor

Design & ImplementationSlide2

Review: MIPS (RISC) Design Principles

Simplicity favors regularity

fixed size instructions

small number of instruction formats

opcode

always the first 6 bits

Smaller is faster

limited instruction set

limited number of registers in register file

limited number of addressing modes

Make the common case fast

arithmetic operands from the register file (load-store machine)

allow instructions to contain immediate operands

Good design demands good compromises

three instruction formats

Slide3

Sequential vs Combinational Circuits

Combinational logic circuits

output is a function of the 

present value of the inputs

 only.

When inputs are changed, the information about the previous inputs is lost

 memoryless

E.g.,

Sequential logic  circuits

outputs are also dependent upon past inputs

 has memory

 flip flops/latchesSlide4

Sequential vs Combinational Circuits

Combinational logic circuits

output is a function of the 

present value of the inputs

 only.

When inputs are changed, the information about the previous inputs is lost

 memoryless

e.g., multiplexors.

Sequential logic  circuits

outputs are also dependent upon past inputs

 has memory

 basically combinational circuits with the additional properties of storage (to remember past inputs) and feedbackSlide5

RS Latches

An RS latch is a memory element with 2 inputs:

- Reset (R)

- Set (S)

- 2 outputs: Q and Q

Note: if inputs don’t change, outputs are held indefinitely.Slide6

RS Latches - Hold

1

0

0

1

0

0

0

0Slide7

RS Latches - Set

0

1

0

1

0

0

1

1

0

1

0

1

ResetSlide8

Clocks and Synchronous Circuits

Asynchronous operation

:

- the output state of RS latches changes occur directly in

response to changes in the inputs.

• Virtually all sequential circuits currently employ the

notion of synchronous operation

the output of a sequential circuit is constrained to change

only at a time specified by a global enabling signal.

This signal is generally known as the

system clockSlide9

Transparent D Latches

modify the RS Latch such that its output state is only permitted to change when a valid enable signal (system clock) is present

• Add a couple of AND gates in cascade with the R and S inputs that are controlled by an additional input known as the enable (EN) inputSlide10
Slide11
Slide12
Slide13
Slide14

J-K Flip FlopsSlide15

Race Around Condition only when J = K=

clk

= 1

=

1

,

0

,1

=

0

,

1

,0

1,

0

,

1

1

1

1

1

,

0

1

1

,

0

0

,

1

Clk

J K Q

n+1

(Q

n+1

)’

0 x

x

Q

n

(

Q

n

)’

1 0 0

Q

n

(

Q

n

)’

1 0 1 0 1 (reset)

1 1 0 1 0 (set)

1 1 1 ? ?

Reset to 0 when input = 1Slide16

Master-Slave Flip Flops

Easy to design sequential circuits if outputs change on:

- rising (positive trending)

- falling (negative trending)

edges of a clock (i.e., enable) signal

Can be done by combining two transparent D latches in a Master-Slave configuration. Slide17
Slide18
Slide19

The Processor:

Datapath

& Control

Our implementation of the MIPS is simplified

memory-reference instructions:

lw

,

sw

arithmetic-logical instructions:

add, sub, and, or,

slt

control flow instructions:

beq

, j

Generic implementation

use the program counter (PC) to supply the instruction address and

fetch

the instruction from memory (and update the PC)

decode

the instruction (and read registers)

execute

the instruction

All instructions (except

j

) use the ALU after reading the registers

How? memory-reference? arithmetic? control flow?

Fetch

PC = PC+4

Decode

ExecSlide20

Aside: Clocking Methodologies

The

clocking methodology

defines when data in a state element is valid and stable relative to the clock

State elements - a memory element such as a register

Edge-triggered – all state changes occur on a clock edge

Typical execution

read contents of state elements -> send values through combinational logic -> write results to one or more state elements

State

element

1

State

element

2

Combinational

logic

clock

one clock cycle

Assumes state elements are written on every clock cycle; if not, need explicit write control signal

write occurs only when

both

the write control is asserted and the clock edge occursSlide21

Fetch Phase

Fetching instructions involves

reading the instruction from the Instruction Memory

updating the PC value to be the address of the next (sequential) instruction

Read

Address

Instruction

Instruction

Memory

Add

PC

4

PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal

Reading from the Instruction Memory is a combinational activity, so it doesn’t need an explicit read control signal

Fetch

PC = PC+4

Decode

Exec

clockSlide22

Decoding Instructions

Decoding instructions involves

sending the fetched instruction’s

opcode

and function field bits to the

control unit

Instruction

Write Data

Read

Addr

1

Read

Addr

2

Write Addr

Register

File

Read

Data 1

Read

Data 2

Control

Unit

reading two values from the Register File

Register addresses (

Read

Addr

1

&

Read

Addr

2

)

are contained in the instruction

Fetch

PC = PC+4

Decode

ExecSlide23
Slide24

Executing R Format Operations

R format operations

(

add, sub,

slt

, and, or

)

perform operation (

op

and

funct

) on values in

rs

and

rt

store the result back into the

Register File

(into location

rd)

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

overflow

zero

ALU control (3 bit code)

RegWrite

R-type:

31

25

20

15

5

0

op

rs

rt

rd

funct

shamt

10

Register File

is not written every cycle (e.g.,

sw

)

need an explicit write control signal (

RegWrite

) for the it.

Fetch

PC = PC+4

Decode

ExecSlide25
Slide26

Executing Load and Store Operations

Load and store operations involves

compute memory address by adding the base register (read from the

Register File

during decode) to the

16-bit signed-extended offset field

in the instruction, e.g.,

sw

$s3

4

(

$t5

)

store

value (read from the Register File during decode) written to the Data Memory,

load

value, read from the Data Memory, written to the Register File

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

overflow

zero

ALU control

RegWrite

Data

Memory

Address

Write Data

Read Data

Sign

Extend

MemWrite

MemRead

16

32

$t5

4Slide27
Slide28

Branch instructions specify

opcode, two registers, target address

Most branch targets are near branch

- Forward or backward

op

rs

rt

constant or address

6 bits

5 bits

5 bits

16 bits

PC-relative addressing

Target address = PC + offset × 4

PC already incremented by 4 by this time

Branch AddressingSlide29

Executing Branch Operations

Branch operations involves

compare the operands read from the Register File during decode for equality (

zero

ALU output)

compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the

instr

Why << 2?

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

zero

ALU control (3 bit code)

Sign

Extend

16

32

Shift

left 2

4

Add

PC

Branch

target

address

(to branch control logic

)

AddSlide30
Slide31

Chapter 2 — Instructions: Language of the Computer —

31

Jump Addressing

Jump (

j

and

jal

) targets could be anywhere in text segment

Encode full address in instruction

op

address

6 bits

26 bits

(Pseudo)Direct jump addressing

Target address = PC

31…28

: (address × 4)Slide32

Executing Jump Operations

Jump operation involves

replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits

Target address = PC

31…28

: (address × 4)

Read

Address

Instruction

Instruction

Memory

Add

PC

4

Shift

left 2

Jump

address

26

4

28Slide33

Creating a Single

Datapath

from the Parts

Assemble the

datapath

segments and add control lines and multiplexors as needed

Single cycle

design – fetch, decode and execute each instructions in

one

clock cycleno

datapath

resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders)

multiplexors

needed at the input of shared elements with control lines to do the selection

write signals to control writing to the Register File and Data Memory

Cycle time is determined by length of the longest pathSlide34

Multiplexors

2 Input 1 Bit Selector Device (2x1 MUX)

Here is a truth table definition of a “function” we wish to implement:

When S = 0, A is “selected” for output

When S = 1, B is “selected” for output

S

A

B

output

0

0

0

0

0

0

1

0

0

1

0

1

0

1

1

1

1

0

0

0

1

0

1

1

1

1

0

0

1

1

1

1Slide35

2x1 MUX (Multiplexor)

What is the Boolean expression for

a 2x1 MUX?

Output = S

B + S

A

How do you implement this using gates?

S

A

B

output

0

0

0

0

0

0

1

0

0

1

0

1

0

1

1

1

1

0

0

0

1

0

1

1

1

1

0

0

1

1

1

1

A

B

S (control signal)

outputSlide36

Multiplexors (MUX) and ALUs

- To select a source input for ALU

From Register

From instruction

field

Control signal

M

U

X

A

L

USlide37

Fetch, R, and Memory Access Portions

MemtoReg

Read

Address

Instruction

Instruction

Memory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

ovf

zero

ALU control

RegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Sign

Extend

16

32

ALUSrc

-

multiplexor (mux)

- ALU (adder)Slide38

Adding the Control

Selecting the operations to perform (ALU, Register File and Memory read/write)

Controlling the flow of data (multiplexor inputs)

I-Type:

op

rs

rt

address offset

31

25

20

15

0

R-type:

31

25

20

15

5

0

op

rs

rt

rd

funct

shamt

10

Observations

op field

always

in bits 31-26

addr

of registers to be read are

always

specified by the

rs

field (bits 25-21) and

rt

field (bits 20-16); for

lw

and

sw

rs

is the base register

addr

. of register to be written is in one of

two

places – in

rt

(bits 20-16) for

lw

; in rd (bits 15-11) for R-type instructions

offset for

beq

,

lw

, and

sw

always

in bits 15-0

J-type:

31

25

0

op

target addressSlide39

 The control unit is responsible for setting all the control signals so that each

instruction is executed properly.

— The control unit’s input is the 32-bit instruction word.

— The outputs are values for the control signals in the

datapath

.

 Most of the signals can be generated from the instruction opcode alone, and

not the entire 32-bit word.

 To illustrate the relevant control signals, we will show the route that is taken

through the

datapath

by R-type,

lw

,

sw

and

beq

instructions.

ControlSlide40
Slide41
Slide42

ALU Control Unit

Read

Address

Instr[31-0]

Instruction

Memory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

ovf

zero

RegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Sign

Extend

16

32

MemtoReg

ALUSrc

Shift

left 2

Add

PCSrc

RegDst

ALU

control

1

1

1

0

0

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr

[25-21]

Instr[20-16]

Instr[15 -11]

Control

Unit

Instr[31-26]

Branch

2

4Slide43

Can ignore

 use XX don’t caresSlide44

4Slide45
Slide46

Bit I/O for ALU Control Unit

Read

Address

Instr[31-0]

Instruction

Memory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

ovf

zero

RegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Sign

Extend

16

32

MemtoReg

ALUSrc

Shift

left 2

Add

PCSrc

RegDst

ALU

control

1

1

1

0

0

0

0

1

ALUOp

Instr

[5-0]

Instr[15-0]

Instr

[25-21]

Instr[20-16]

Instr[15 -11]

Control

Unit

Instr

[31-26

]

Branch

2

4Slide47

31

25

20

15

0

R-type:

31

25

20

15

5

0

op

rs

rt

rd

funct

shamt

10

R-type InstructionSlide48
Slide49

R-type Dataflow

Read

Address

Instr[31-0]

Instruction

Memory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

ovf

zero

RegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Sign

Extend

16

32

MemtoReg

ALUSrc

Shift

left 2

Add

PCSrc

RegDst

ALU

control

1

1

1

0

0

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr

[25-21]

[

rs

]

Instr

[20-16]

[

rt

]

Instr

[15 -11]

[

rd

]

Control

Unit

Instr[31-26]

Branch

ALUOpSlide50

R type - Control Lines

0

0

0

0

1

Slide51

0010Slide52

Load Word Instruction Data/Control Flow

Read

Address

Instr[31-0]

Instruction

Memory

Add

PC

4

Write

Data

Read

Addr

1

Read

Addr

2

Write

Addr

Register

File

Read

Data

1

Read

Data 2

ALU

ovf

zero

RegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Sign

Extend

16

32

MemtoReg

ALUSrc

Shift

left 2

Add

PCSrc

RegDst

ALU

control

1

1

1

0

0

0

0

1

ALUOp

Instr[5-0]

Instr

[15-0]

Instr

[15 -11]

Control

Unit

Instr[31-26]

Branch

Instr

[25-21]

Instr

[25-21]Slide53

Load Word Instruction Data/Control Flow

Read

Address

Instr[31-0]

Instruction

Memory

Add

PC

4

Write

Data

Read

Addr

1

Read

Addr

2

Write

Addr

Register

File

Read

Data

1

Read

Data 2

ALU

ovf

zero

RegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Sign

Extend

16

32

MemtoReg

ALUSrc

Shift

left 2

Add

PCSrc

RegDst

ALU

control

1

1

1

0

0

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr

[25-21]

Instr

[20-16]

Instr

[15 -11]

Control

Unit

Instr[31-26]

Branch

32

$t0

lw

$s1, 32($t0)

$s1Slide54

lw

- Control Lines

1

1

0

0

1

1

Slide55

$a0

/

16

$

sp

16

/

32Slide56

I-Type:

op

rs

rt

address offset

31

25

20

15

0

BranchingSlide57
Slide58

Branch Instruction Data/Control Flow

Read

Address

Instr[31-0]

Instruction

Memory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

ovf

zero

RegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Sign

Extend

16

32

MemtoReg

ALUSrc

Shift

left 2

Add

PCSrc

RegDst

ALU

control

1

1

1

0

0

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

Control

Unit

Instr[31-26]

Branch

beq

$t0,$t1,

addrSlide59

Branch Instruction Data/Control Flow

Read

Address

Instr[31-0]

Instruction

Memory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

ovf

zero

RegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Sign

Extend

16

32

MemtoReg

ALUSrc

Shift

left 2

Add

PCSrc

RegDst

ALU

control

1

1

1

0

0

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

Control

Unit

Instr[31-26]

Branch

beq

$t0,$t1,

addr

$t0

$t1Slide60

Main Control Lines

1

0

Slide61

J-type

:

31

25

0

op

target addressSlide62

Jump

Read

Address

Instr

[31-0]

Instruction

Memory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

ovf

zero

RegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Sign

Extend

16

32

MemtoReg

ALUSrc

Shift

left 2

Add

PCSrc

RegDst

ALU

control

1

1

1

0

0

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

Control

Unit

Instr[31-26]

Branch

Shift

left 2

0

1

Jump

32

Instr[25-0]

26

PC:4[31-28]

28

25-0]

j

addr

if

addr

= A (26 bits) A<<2 to make it 28 bits

Target

addr

=

PC[31-28]:A00Slide63

Single Cycle Disadvantages & Advantages

Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the

slowest

instruction

especially problematic for more complex instructions like floating point multiply

800

ps

700

ps

May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle

but is simple and easy to understand

Clk

lw

sw

Waste

Cycle 1

Cycle 2Slide64

Instruction Critical Paths

What is the clock cycle time assuming negligible delays for

muxes

, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except:

Instruction and Data Memory access (200

ps

)

ALU and adders (200

ps

)

Register File access (reads or writes) (100

ps

)Slide65

How Can We Make It Faster?

Start fetching and executing the next instruction before the current one has completed

Pipelining

– (all?) modern processors are pipelined for performance

The

performance metric:

CPU time = Cycles/

Instr

* Clk

Cycle * Total #

Instr

Under

ideal

conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages

A five stage pipeline is nearly five times faster because the

clk

cycle is nearly five times faster

Fetch (and execute) more than one instruction at a timeSlide66

Pipelining Analogy

Pipelined laundry: overlapping execution

Parallelism improves performance

Four loads:

Sequential = 8

hrs

Pipelined = 3.5

hrs

Speedup

= 8/3.5 = 2.3Slide67

five-stage pipeline should offer nearly a fivefold improvement over the

nonpipelined time, or a 160

ps

clock cycle.

Pipelining has some issues

 actual speedup < # stages.

Pipelining SpeedupSlide68
Slide69

f = Fetch

r = Register read

a = ALU op

d = Data access

w = WritebackSlide70

Pipeline Control

IF Stage: read

Instr

Memory (always asserted) and write PC (on System Clock)

ID Stage: no optional control signals to set

EX Stage

MEM Stage

WB Stage

RegDst

ALUOp1

ALUOp0

ALUSrc

Brch

MemRead

MemWrite

RegWrite

Mem

toReg

R

1

1

0

0

0

0

0

1

0

lw

0

0

0

1

0

1

0

1

1

sw

X

0

0

1

0

0

1

0

X

beq

X

0

1

0

1

0

0

0

XSlide71
Slide72
Slide73
Slide74
Slide75
Slide76
Slide77
Slide78
Slide79
Slide80
Slide81
Slide82
Slide83
Slide84
Slide85
Slide86
Slide87
Slide88
Slide89

The desired data is only available after the 4

th

stage for

lw

instead of the 3

rd

stage for

add.

Need to

stall

one cycle –

bubble

.

lw

$s0

, 20(St1)

Sub $t2,

$s0

, $t3Slide90
Slide91

W RSlide92
Slide93
Slide94
Slide95
Slide96

Output from ALU of 1

st

instruction has to be forwarded to Instr#2 and #3

Since only one output (from ALU) can be forwarded, it is sent to Instr#2

Instr#3’s ALU must wait till the next stage to get output of 1

st

Instr’s

ALUSlide97

?

?

IF ID EX MEM WBSlide98
Slide99

Why?Slide100

Why 13 cycles?

Cycles

1 2 3 4 5 6 7 8 9 10 11 12 13

Instr

1

2

(stall)

3

4

5

(stall)

6

7Slide101
Slide102
Slide103

Forwarding UnitSlide104
Slide105

Forwarding Unit Inputs

For ALU result

For WB (from Memory)Slide106

Forwarding Example

$t0

$t1

$t0

$t1

$t1

$t0

$t0

$t1Slide107

$t0

$t1

$t0

$t1Slide108
Slide109

Add $t0,

add $t0,

or $t1,

Х Х

sub _ , _, _

$t0,$t0

(if one

instr

away, can forward at MEM stage)Slide110
Slide111
Slide112

Hazard

UnitSlide113
Slide114
Slide115
Slide116
Slide117
Slide118
Slide119
Slide120
Slide121
Slide122
Slide123
Slide124
Slide125
Slide126
Slide127
Slide128
Slide129
Slide130