/
CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
350 views
Uploaded On 2018-11-07

CS 152 Computer Architecture and Engineering - PPT Presentation

Lecture 3 From CISC to RISC Dr George Michelogiannakis EECS University of California at Berkeley CRD Lawrence Berkeley National Laboratory httpinsteecsberkeleyeducs152 Last Time in Lecture 2 ID: 720187

rs1 inst instructions alu inst rs1 alu instructions risc opcode bit clk register rs2 memory cycle control instruction imm

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS 152 Computer Architecture and Enginee..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CS 152 Computer Architecture and Engineering Lecture 3 - From CISC to RISC

Dr. George

Michelogiannakis

EECS

, University of California at Berkeley

CRD, Lawrence Berkeley National Laboratory

http://inst.eecs.berkeley.edu/~cs152Slide2

Last Time in Lecture 2 ISA is the hardware/software interface

Defines set of programmer visible state

Defines instruction format (bit encoding) and instruction semantics

Examples: IBM 360, MIPS, RISC-V, x86, JVM Many possible implementations of one ISA360 implementations: model 30 (c. 1964), z12 (c. 2012)x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486, Pentium, Pentium Pro, Pentium-4 (c. 2000), Core 2 Duo, Nehalem, Sandy Bridge, Ivy Bridge, Atom, AMD Athlon, Transmeta Crusoe, SoftPCMIPS implementations: R2000, R4000, R10000, R18K, …JVM: HotSpot, PicoJava, ARM Jazelle, …Microcoding: straightforward methodical way to implement machines with low logic gate count and complex instructions

2Slide3

Question of the DayDo you think a CISC or RISC single-cycle processor would be faster?

3Slide4

Instructions per program depends on source code, compiler technology, and ISACycles per instructions (CPI) depends on ISA and µarchitectureTime per cycle depends upon the µ

architecture and base technology

4

Time

=

Instructions

Cycles

Time Program Program * Instruction * Cycle

“Iron Law” of Processor PerformanceSlide5

Inst 3

CPI for Microcoded Machine

5

7 cycles

Inst 1

Inst 2

5 cycles

10 cycles

Total clock cycles = 7+5+10 = 22

Total instructions = 3

CPI = 22/3 = 7.33

CPI is always an average over a large number of instructions

TimeSlide6

Technology InfluenceWhen microcode appeared in 50s, different technologies for:

Logic: Vacuum Tubes

Main Memory: Magnetic cores

Read-Only Memory: Diode matrix, punched metal cards,…Logic very expensive compared to ROM or RAMROM cheaper than RAMROM much faster than RAM6But seventies brought advances in integrated circuit technology and semiconductor memory…Slide7

First MicroprocessorIntel 4004, 1971

4-bit accumulator architecture

8µm

pMOS2,300 transistors3 x 4 mm2750kHz clock8-16 cycles/inst.7Made possible by new integrated circuit technologySlide8

Microprocessors in the SeventiesInitial target was embedded control

First micro, 4-bit 4004 from Intel, designed for a desktop printing calculator

Constrained by what could fit on single chip

Accumulator architectures, similar to earliest computersHardwired state machine control8-bit micros (8085, 6800, 6502) used in hobbyist personal computersMicral, Altair, TRS-80, Apple-IIUsually had 16-bit address space (up to 64KB directly addressable)Often came with simple BASIC language interpreter built into ROM or loaded from cassette tape.

8Slide9

VisiCalc – the first “killer” app for micros

Microprocessors had little impact on conventional computer market until VisiCalc spreadsheet for Apple-II

Apple-II used

Mostek 6502 microprocessor running at 1MHz9

[ Personal Computing Ad, 1979 ]

Floppy disk

s were originally invented by IBM as a

way of shipping IBM 360 microcode patches to customers!Slide10

DRAM in the SeventiesDramatic progress in semiconductor memory technology

1970, Intel introduces first DRAM, 1Kbit 1103

1979, Fujitsu introduces 64Kbit DRAM

=> By mid-Seventies, obvious that PCs would soon have >64KBytes physical memory

10Slide11

Microprocessor EvolutionRapid progress in 70s, fueled by advances in MOSFET technology and expanding markets

Intel i432

Most ambitious seventies’ micro; started in 1975 - released 1981

32-bit capability-based object-oriented architectureInstructions variable number of bits longSevere performance, complexity, and usability problemsMotorola 68000 (1979, 8MHz, 68,000 transistors)Heavily microcoded (and nanocoded)32-bit general-purpose register architecture (24 address pins)8 address registers, 8 data registersIntel 8086 (1978, 8MHz, 29,000 transistors)“Stopgap” 16-bit processor, architected in 10 weeksExtended accumulator architecture, assembly-compatible with 808020-bit addressing through segmented addressing scheme

11Slide12

IBM PC, 1981HardwareTeam from IBM building PC prototypes in 1979

Motorola 68000 chosen initially, but 68000 was late

IBM builds “stopgap” prototypes using 8088 boards from Display Writer word processor

8088 is 8-bit bus version of 8086 => allows cheaper systemEstimated sales of 250,000100,000,000s soldSoftwareMicrosoft negotiates to provide OS for IBM. Later buys and modifies QDOS from Seattle Computer Products.Open SystemStandard processor, Intel 8088Standard interfacesStandard OS, MS-DOSIBM permits cloning and third-party software12Slide13

13

[ Personal Computing Ad, 11/81]Slide14

Microprogramming: early EightiesEvolution bred more complex micro-machinesComplex instruction sets led to need for subroutine and call stacks in µcode

Need for fixing bugs in control programs was in conflict with read-only nature of µROM

Writable Control Store (WCS) (B1700, QMachine, Intel i432, …)With the advent of VLSI technology assumptions about ROM & RAM speed became invalid more complexityBetter compilers made complex instructions less important.Use of numerous micro-architectural innovations, e.g., pipelining, caches and buffers, made multiple-cycle execution of reg-reg instructions unattractive14Slide15

Analyzing Microcoded MachinesJohn Cocke and group at IBM

Working on a simple pipelined processor, 801, and advanced compilers inside IBM

Ported experimental PL.8 compiler to IBM 370, and only used simple register-register and load/store instructions similar to 801

Code ran faster than other existing compilers that used all 370 instructions! (up to 6MIPS whereas 2MIPS considered good before)Emer, Clark, at DECMeasured VAX-11/780 using external hardwareFound it was actually a 0.5MIPS machine, although usually assumed to be a 1MIPS machineFound 20% of VAX instructions responsible for 60% of microcode, but only account for 0.2% of execution time!VAX8800Control Store: 16K*147b RAM, Unified Cache: 64K*8b RAM 4.5x more microstore RAM than cache RAM!15Slide16

IC Technology Changes TradeoffsLogic, RAM, ROM all implemented using MOS transistorsSemiconductor RAM ~ same speed as ROM

16Slide17

Nanocoding17

MC68000 had 17-bit µ

code

containing either 10-bit µjump or 9-bit nanoinstruction pointerNanoinstructions were 68 bits wide, decoded to give 196 control signals

µcode

ROM

nanoaddress

µcode

next-state

µaddress

μ

PC

(state)

nanoinstruction ROM

data

Exploits recurring control signal patterns in

µcode

, e.g.,

ALU

0

A

<=

Reg[

rs1]

...

ALUi

0

A

Reg[

rs1]

...User PC

Inst. CacheHardwired Decode

RISCSlide18

From CISC to RISCUse fast RAM to build fast instruction cache

of user-visible instructions, not fixed hardware

microroutines

Contents of fast instruction memory change to fit what application needs right nowUse simple ISA to enable hardwired pipelined implementationMost compiled code only used a few of the available CISC instructionsSimpler encoding allowed pipelined implementationsFurther benefit with integrationIn early ‘80s, could finally fit 32-bit datapath + small caches on a single chipNo chip crossings in common case allows faster operation18Slide19

Berkeley RISC Chips

19

RISC-I (1982) Contains 44,420 transistors,

fabbed in 5 µm

NMOS, with a die area of 77 mm

2

, ran at 1 MHz. This chip is probably the first VLSI RISC.

RISC-II (1983) contains 40,760 transistors, was

fabbed

in 3 µm NMOS, ran at 3 MHz, and the size is 60 mm

2. Stanford built some too…Slide20

Instructions per program depends on source code, compiler technology, and ISACycles per instructions (CPI) depends on ISA and µarchitectureTime per cycle depends upon the µ

architecture and base technology

20

Time

=

Instructions

Cycles

Time Program Program * Instruction * Cycle

“Iron Law” of Processor Performance

Microarchitecture

CPI

cycle time

Microcoded

Single-cycle unpipelined

Pipelined

T

his

lectureSlide21

Instructions per program depends on source code, compiler technology, and ISACycles per instructions (CPI) depends on ISA and µarchitectureTime per cycle depends upon the µ

architecture and base technology

21

Time

=

Instructions

Cycles

Time Program Program * Instruction * Cycle

“Iron Law” of Processor Performance

Microarchitecture

CPI

cycle time

Microcoded

>1

short

Single-cycle unpipelined

1

long

Pipelined

1

short

T

his

lectureSlide22

Hardware Elements

Combinational circuits

Mux, Decoder, ALU, ...

Synchronous state elementsFlipflop, Register, Register file, SRAM, DRAMEdge-triggered: Data is sampled at the rising edge

Clk

D

Q

En

ff

Q

D

Clk

En

OpSelect

- Add, Sub, ...

- And, Or, Xor, Not, ...

- GT, LT, EQ, Zero, ...

Result

Comp?

A

B

ALU

Sel

O

A

0

A

1

A

n-1

Mux

.

.

.

lg(n)

A

Decoder

.

.

.

O

0

O

1

O

n-1

lg(n)Slide23

Register FilesReads are combinational

23

ReadData1

ReadSel1

ReadSel2

WriteSel

Register

file

2R+1W

ReadData2

WriteData

WE

Clock

rd1

rs1

rs2

ws

wd

rd2

we

ff

Q

0

D

0

Clk

En

ff

Q

1

D

1

ff

Q

2

D

2

ff

Q

n-1

D

n-1

...

...

...

registerSlide24

Register File ImplementationRISC-V integer instructions have at most 2 register source operands

24

reg 31

rd

clk

reg 1

wdata

we

rs1

rdata1

rdata2

reg 0

32

5

32

32

rs2

5

5Slide25

A Simple Memory Model25

MAGIC

RAM

ReadData

WriteData

Address

WriteEnable

Clock

Reads and writes are always completed in one cycle

a Read can be done any time (i.e. combinational)

a Write is performed at the rising clock edge

if it is enabled

=>

the write address and data

must be stable at the clock edge

Later in the course we will present a more realistic model of memorySlide26

Implementing RISC-VWithout a bus

Single-cycle per instruction

datapath

& control logic(Should be review of CS61C)26Slide27

Instruction ExecutionExecution of an instruction involves

Instruction fetch

Decode and register fetch

ALU operationMemory operation (optional)Write back (optional)and compute address of next instruction27Slide28

Datapath: Reg-Reg ALU Instructions28

RegWrite

Timing?

0x4

Add

clk

addr

inst

Inst.

Memory

PC

I

nst

<19:15>

I

nst<24:20>

Inst

<

1

1:

7

>

Inst

<14:12>

OpCode

ALU

ALU

Control

RegWriteEn

clk

rd1

GPRs

rs1

rs2

wa

wd

rd2

we

7 5 5 3

5

7

func7

rs2

rs1 func3

rd

opcode

rd

(

rs1)

func

(

rs2)

31

25 24 20 19 15 14 12 11 7 6

0Slide29

Datapath: Reg-Imm ALU Instructions

29

Imm

Select

ImmSel

Inst

<31:20>

OpCode

0x4

Add

clk

addr

inst

Inst.

Memory

PC

ALU

RegWriteEn

clk

rd1

GPRs

rs1

rs2

wa

wd

rd2

we

Inst

<19:15>

Inst

<11:

7

>

Inst

<14:12>

ALU

Control

12

5 3

5

7

immediate12

rs1 func3

rd

opcode

rd

(

rs1) op immediate

31

20 19 15 14 12 11 7 6

0Slide30

Conflicts in Merging Datapath

30

Imm

Select

ImmSel

OpCode

0x4

Add

clk

addr

inst

Inst.

Memory

PC

ALU

RegWrite

clk

rd1

GPRs

rs1

rs2

wa

wd

rd2

we

Inst

<19:15>

Inst

<

1

1:

7

>

Inst

<31:20

>

Inst

<14:12>

ALU

Control

Introduce

muxes

Inst<24:20>

7 5 5 3

5

7

func7

rs2

rs1 func3

rd

opcode

rd

(

rs1)

func

(

rs2)

immediate12

rs1 func3

rd

opcode

rd

(

rs1) op immediate

31

20 19 15 14 12 11 7 6

0Slide31

Datapath for ALU Instructions

31

<14:12>

Op2Sel

Reg

/

Imm

Imm

Select

ImmSel

OpCode

0x4

Add

clk

addr

inst

Inst.

Memory

PC

ALU

RegWriteEn

clk

rd1

GPRs

rs1

rs2

wa

wd

rd2

we

<19:15>

<24:20>

ALU

Control

<

1

1:

7

>

<6:

0>

7 5 5 3

5

7

func7

rs2

rs1 func3

rd

opcode

rd

(

rs1)

func

(

rs2)

immediate12

rs1 func3

rd

opcode

rd

(

rs1) op immediate

31

20 19 15 14 12 11 7 6

0

Inst

<31:20>Slide32

Load/Store Instructions

32

WBSel

ALU /

Mem

rs1

is the base register

rd

is the destination of a

Load, rs2 is the data source for a Store

Op2Sel

“base”

disp

ImmSel

OpCode

ALU

Control

ALU

0x4

Add

clk

addr

inst

Inst.

Memory

PC

RegWriteEn

clk

rd1

GPRs

rs1

rs2

wa

wd

rd2

we

Imm

Select

clk

MemWrite

addr

wdata

rdata

Data

Memory

we

7 5 5 3

5

7

imm

rs2

rs1 func3

imm

opcode

Store

(

rs1)

+ displacement

immediate12

rs1

func3

rd

opcode Load

31

20 19 15 14 12 11 7 6

0Slide33

RISC-V Conditional BranchesCompare two integer registers for equality (BEQ/BNE) or signed magnitude (BLT/BGE) or unsigned magnitude (BLTU/BGEU)

12-bit immediate encodes branch target address as a signed offset from PC, in units of 16-bits (i.e., shift left by 1 then add to PC).

33

7

6

0

opcode

5

11

7

imm

3

14

12

func3

5

1

9

15

rs1

5

24

20

rs2

7

31

25

imm

BEQ/BNE

BLT/BGE

BLTU/BGEUSlide34

Conditional

Branches

(BEQ/BNE/BLT/BGE/BLTU/BGEU)

34

0x4

Add

PCSel

clk

WBSel

MemWrite

addr

wdata

rdata

Data

Memory

we

Op2Sel

ImmSel

OpCode

Bcomp

?

clk

clk

addr

inst

Inst.

Memory

PC

rd1

GPRs

rs1

rs2

wa

wd

rd2

we

Imm

Select

ALU

ALU

Control

Add

br

pc+4

RegWrEn

Br LogicSlide35

RISC-V Unconditional Jumps20-bit immediate encodes jump target address as a signed offset from PC, in units of 16-bits (i.e., shift left by 1 then add to PC). (+/- 1MiB)

JAL is a subroutine call that also saves return address (PC+4) in register

rd

35

JAL

7

12 11 7

6

0

opcode

25

31

Jump Offset[19:0]

rdSlide36

RISC-V Register Indirect JumpsJumps to target address given by adding 12-bit offset (

not

shifted by 1 bit) to register rs1

The return address (PC+4) is written to rd (can be x0 if value not needed)36

7

6

0

opcode

5

11

7

rd

3

12

func3

5

19

15 14

rs1

JALR

12

31

20

rdSlide37

Full RISCV1Stage Datapath

37Slide38

Hardwired Control is pure Combinational Logic

38

combinational

logic

op code

Equal?

ImmSel

Op2Sel

FuncSel

MemWrite

WBSel

WASel

RegWriteEn

PCSelSlide39

ALU Control & Immediate Extension39

Inst

<6:

0

>

(

Opcode

)

Decode Map

Inst

<14:12>

(

Func3)

ALUop

0?

+

FuncSel

(

Func

, Op, +, 0? )

ImmSel

(

IType

12

,

S

Type12, UType

20)Slide40

Hardwired Control Table

40

Opcode

ImmSel

Op2Sel

FuncSel

MemWr

RFWen

WBSel

WASel

PCSel

ALU

ALUi

LW

SW

BEQ

true

BEQ

false

J

JAL

JALR

Op2Sel=

Reg

/

Imm

WBSel

= ALU /

Mem

/ PC

WASel

=

rd

/

X1

PCSel

= pc+4 /

br

/ rind / jabs

*

*

*

no

yes

rind

PC

rd

jabs

*

*

*

no

yes

PC

X1

jabs

*

*

*

no

no

*

*

pc+4

SBType

12

*

*

no

no

*

*

br

SBType

12

*

*

no

no

*

*

pc+4

SType

12

Imm

+

yes

no

*

*

pc+4

*

Reg

Func

no

yes

ALU

rd

IType

12

Imm

Op

pc+4

no

yes

ALU

rd

pc+4

IType

12

Imm

+

no

yes

Mem

rdSlide41

Single-Cycle Hardwired Control We will assume clock period is sufficiently long for all of the following steps to be “completed”:

Instruction fetch

Decode and register fetch

ALU operationData fetch if requiredRegister write-back setup time=> tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB

At the rising edge of the following clock, the PC, register file and memory are updated

41Slide42

Question of the DayDo you think a CISC or RISC single-cycle processor would be faster?

42Slide43

SummaryMicrocoding became less attractive as gap between RAM and ROM speeds reduced, and logic implemented in same technology as memory

Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew

Iron Law explains architecture design space

Trade instruction/program, cycles/instruction, and time/cycleLoad-Store RISC ISAs designed for efficient pipelined implementationsVery similar to vertical microcodeInspired by earlier Cray machines (CDC 6600/7600)RISC-V ISA will be used in lectures, problems, and labsBerkeley RISC chips: RISC-I, RISC-II, SOAR (RISC-III), SPUR (RISC-IV)43Slide44

AcknowledgementsThese slides contain material developed and copyright by:Arvind

(MIT)

Krste Asanovic (MIT/UCB)

Joel Emer (Intel/MIT)James Hoe (CMU)John Kubiatowicz (UCB)David Patterson (UCB)MIT material derived from course 6.823UCB material derived from course CS25244