Lecture 3 From CISC to RISC Dr George Michelogiannakis EECS University of California at Berkeley CRD Lawrence Berkeley National Laboratory httpinsteecsberkeleyeducs152 Last Time in Lecture 2 ID: 720187
Download Presentation The PPT/PDF document "CS 152 Computer Architecture and Enginee..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CS 152 Computer Architecture and Engineering Lecture 3 - From CISC to RISC
Dr. George
Michelogiannakis
EECS
, University of California at Berkeley
CRD, Lawrence Berkeley National Laboratory
http://inst.eecs.berkeley.edu/~cs152Slide2
Last Time in Lecture 2 ISA is the hardware/software interface
Defines set of programmer visible state
Defines instruction format (bit encoding) and instruction semantics
Examples: IBM 360, MIPS, RISC-V, x86, JVM Many possible implementations of one ISA360 implementations: model 30 (c. 1964), z12 (c. 2012)x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486, Pentium, Pentium Pro, Pentium-4 (c. 2000), Core 2 Duo, Nehalem, Sandy Bridge, Ivy Bridge, Atom, AMD Athlon, Transmeta Crusoe, SoftPCMIPS implementations: R2000, R4000, R10000, R18K, …JVM: HotSpot, PicoJava, ARM Jazelle, …Microcoding: straightforward methodical way to implement machines with low logic gate count and complex instructions
2Slide3
Question of the DayDo you think a CISC or RISC single-cycle processor would be faster?
3Slide4
Instructions per program depends on source code, compiler technology, and ISACycles per instructions (CPI) depends on ISA and µarchitectureTime per cycle depends upon the µ
architecture and base technology
4
Time
=
Instructions
Cycles
Time Program Program * Instruction * Cycle
“Iron Law” of Processor PerformanceSlide5
Inst 3
CPI for Microcoded Machine
5
7 cycles
Inst 1
Inst 2
5 cycles
10 cycles
Total clock cycles = 7+5+10 = 22
Total instructions = 3
CPI = 22/3 = 7.33
CPI is always an average over a large number of instructions
TimeSlide6
Technology InfluenceWhen microcode appeared in 50s, different technologies for:
Logic: Vacuum Tubes
Main Memory: Magnetic cores
Read-Only Memory: Diode matrix, punched metal cards,…Logic very expensive compared to ROM or RAMROM cheaper than RAMROM much faster than RAM6But seventies brought advances in integrated circuit technology and semiconductor memory…Slide7
First MicroprocessorIntel 4004, 1971
4-bit accumulator architecture
8µm
pMOS2,300 transistors3 x 4 mm2750kHz clock8-16 cycles/inst.7Made possible by new integrated circuit technologySlide8
Microprocessors in the SeventiesInitial target was embedded control
First micro, 4-bit 4004 from Intel, designed for a desktop printing calculator
Constrained by what could fit on single chip
Accumulator architectures, similar to earliest computersHardwired state machine control8-bit micros (8085, 6800, 6502) used in hobbyist personal computersMicral, Altair, TRS-80, Apple-IIUsually had 16-bit address space (up to 64KB directly addressable)Often came with simple BASIC language interpreter built into ROM or loaded from cassette tape.
8Slide9
VisiCalc – the first “killer” app for micros
Microprocessors had little impact on conventional computer market until VisiCalc spreadsheet for Apple-II
Apple-II used
Mostek 6502 microprocessor running at 1MHz9
[ Personal Computing Ad, 1979 ]
Floppy disk
s were originally invented by IBM as a
way of shipping IBM 360 microcode patches to customers!Slide10
DRAM in the SeventiesDramatic progress in semiconductor memory technology
1970, Intel introduces first DRAM, 1Kbit 1103
1979, Fujitsu introduces 64Kbit DRAM
=> By mid-Seventies, obvious that PCs would soon have >64KBytes physical memory
10Slide11
Microprocessor EvolutionRapid progress in 70s, fueled by advances in MOSFET technology and expanding markets
Intel i432
Most ambitious seventies’ micro; started in 1975 - released 1981
32-bit capability-based object-oriented architectureInstructions variable number of bits longSevere performance, complexity, and usability problemsMotorola 68000 (1979, 8MHz, 68,000 transistors)Heavily microcoded (and nanocoded)32-bit general-purpose register architecture (24 address pins)8 address registers, 8 data registersIntel 8086 (1978, 8MHz, 29,000 transistors)“Stopgap” 16-bit processor, architected in 10 weeksExtended accumulator architecture, assembly-compatible with 808020-bit addressing through segmented addressing scheme
11Slide12
IBM PC, 1981HardwareTeam from IBM building PC prototypes in 1979
Motorola 68000 chosen initially, but 68000 was late
IBM builds “stopgap” prototypes using 8088 boards from Display Writer word processor
8088 is 8-bit bus version of 8086 => allows cheaper systemEstimated sales of 250,000100,000,000s soldSoftwareMicrosoft negotiates to provide OS for IBM. Later buys and modifies QDOS from Seattle Computer Products.Open SystemStandard processor, Intel 8088Standard interfacesStandard OS, MS-DOSIBM permits cloning and third-party software12Slide13
13
[ Personal Computing Ad, 11/81]Slide14
Microprogramming: early EightiesEvolution bred more complex micro-machinesComplex instruction sets led to need for subroutine and call stacks in µcode
Need for fixing bugs in control programs was in conflict with read-only nature of µROM
Writable Control Store (WCS) (B1700, QMachine, Intel i432, …)With the advent of VLSI technology assumptions about ROM & RAM speed became invalid more complexityBetter compilers made complex instructions less important.Use of numerous micro-architectural innovations, e.g., pipelining, caches and buffers, made multiple-cycle execution of reg-reg instructions unattractive14Slide15
Analyzing Microcoded MachinesJohn Cocke and group at IBM
Working on a simple pipelined processor, 801, and advanced compilers inside IBM
Ported experimental PL.8 compiler to IBM 370, and only used simple register-register and load/store instructions similar to 801
Code ran faster than other existing compilers that used all 370 instructions! (up to 6MIPS whereas 2MIPS considered good before)Emer, Clark, at DECMeasured VAX-11/780 using external hardwareFound it was actually a 0.5MIPS machine, although usually assumed to be a 1MIPS machineFound 20% of VAX instructions responsible for 60% of microcode, but only account for 0.2% of execution time!VAX8800Control Store: 16K*147b RAM, Unified Cache: 64K*8b RAM 4.5x more microstore RAM than cache RAM!15Slide16
IC Technology Changes TradeoffsLogic, RAM, ROM all implemented using MOS transistorsSemiconductor RAM ~ same speed as ROM
16Slide17
Nanocoding17
MC68000 had 17-bit µ
code
containing either 10-bit µjump or 9-bit nanoinstruction pointerNanoinstructions were 68 bits wide, decoded to give 196 control signals
µcode
ROM
nanoaddress
µcode
next-state
µaddress
μ
PC
(state)
nanoinstruction ROM
data
Exploits recurring control signal patterns in
µcode
, e.g.,
ALU
0
A
<=
Reg[
rs1]
...
ALUi
0
A
Reg[
rs1]
...User PC
Inst. CacheHardwired Decode
RISCSlide18
From CISC to RISCUse fast RAM to build fast instruction cache
of user-visible instructions, not fixed hardware
microroutines
Contents of fast instruction memory change to fit what application needs right nowUse simple ISA to enable hardwired pipelined implementationMost compiled code only used a few of the available CISC instructionsSimpler encoding allowed pipelined implementationsFurther benefit with integrationIn early ‘80s, could finally fit 32-bit datapath + small caches on a single chipNo chip crossings in common case allows faster operation18Slide19
Berkeley RISC Chips
19
RISC-I (1982) Contains 44,420 transistors,
fabbed in 5 µm
NMOS, with a die area of 77 mm
2
, ran at 1 MHz. This chip is probably the first VLSI RISC.
RISC-II (1983) contains 40,760 transistors, was
fabbed
in 3 µm NMOS, ran at 3 MHz, and the size is 60 mm
2. Stanford built some too…Slide20
Instructions per program depends on source code, compiler technology, and ISACycles per instructions (CPI) depends on ISA and µarchitectureTime per cycle depends upon the µ
architecture and base technology
20
Time
=
Instructions
Cycles
Time Program Program * Instruction * Cycle
“Iron Law” of Processor Performance
Microarchitecture
CPI
cycle time
Microcoded
Single-cycle unpipelined
Pipelined
T
his
lectureSlide21
Instructions per program depends on source code, compiler technology, and ISACycles per instructions (CPI) depends on ISA and µarchitectureTime per cycle depends upon the µ
architecture and base technology
21
Time
=
Instructions
Cycles
Time Program Program * Instruction * Cycle
“Iron Law” of Processor Performance
Microarchitecture
CPI
cycle time
Microcoded
>1
short
Single-cycle unpipelined
1
long
Pipelined
1
short
T
his
lectureSlide22
Hardware Elements
Combinational circuits
Mux, Decoder, ALU, ...
Synchronous state elementsFlipflop, Register, Register file, SRAM, DRAMEdge-triggered: Data is sampled at the rising edge
Clk
D
Q
En
ff
Q
D
Clk
En
OpSelect
- Add, Sub, ...
- And, Or, Xor, Not, ...
- GT, LT, EQ, Zero, ...
Result
Comp?
A
B
ALU
Sel
O
A
0
A
1
A
n-1
Mux
.
.
.
lg(n)
A
Decoder
.
.
.
O
0
O
1
O
n-1
lg(n)Slide23
Register FilesReads are combinational
23
ReadData1
ReadSel1
ReadSel2
WriteSel
Register
file
2R+1W
ReadData2
WriteData
WE
Clock
rd1
rs1
rs2
ws
wd
rd2
we
ff
Q
0
D
0
Clk
En
ff
Q
1
D
1
ff
Q
2
D
2
ff
Q
n-1
D
n-1
...
...
...
registerSlide24
Register File ImplementationRISC-V integer instructions have at most 2 register source operands
24
reg 31
rd
clk
reg 1
wdata
we
rs1
rdata1
rdata2
reg 0
…
32
…
5
32
32
…
rs2
5
5Slide25
A Simple Memory Model25
MAGIC
RAM
ReadData
WriteData
Address
WriteEnable
Clock
Reads and writes are always completed in one cycle
a Read can be done any time (i.e. combinational)
a Write is performed at the rising clock edge
if it is enabled
=>
the write address and data
must be stable at the clock edge
Later in the course we will present a more realistic model of memorySlide26
Implementing RISC-VWithout a bus
Single-cycle per instruction
datapath
& control logic(Should be review of CS61C)26Slide27
Instruction ExecutionExecution of an instruction involves
Instruction fetch
Decode and register fetch
ALU operationMemory operation (optional)Write back (optional)and compute address of next instruction27Slide28
Datapath: Reg-Reg ALU Instructions28
RegWrite
Timing?
0x4
Add
clk
addr
inst
Inst.
Memory
PC
I
nst
<19:15>
I
nst<24:20>
Inst
<
1
1:
7
>
Inst
<14:12>
OpCode
ALU
ALU
Control
RegWriteEn
clk
rd1
GPRs
rs1
rs2
wa
wd
rd2
we
7 5 5 3
5
7
func7
rs2
rs1 func3
rd
opcode
rd
(
rs1)
func
(
rs2)
31
25 24 20 19 15 14 12 11 7 6
0Slide29
Datapath: Reg-Imm ALU Instructions
29
Imm
Select
ImmSel
Inst
<31:20>
OpCode
0x4
Add
clk
addr
inst
Inst.
Memory
PC
ALU
RegWriteEn
clk
rd1
GPRs
rs1
rs2
wa
wd
rd2
we
Inst
<19:15>
Inst
<11:
7
>
Inst
<14:12>
ALU
Control
12
5 3
5
7
immediate12
rs1 func3
rd
opcode
rd
(
rs1) op immediate
31
20 19 15 14 12 11 7 6
0Slide30
Conflicts in Merging Datapath
30
Imm
Select
ImmSel
OpCode
0x4
Add
clk
addr
inst
Inst.
Memory
PC
ALU
RegWrite
clk
rd1
GPRs
rs1
rs2
wa
wd
rd2
we
Inst
<19:15>
Inst
<
1
1:
7
>
Inst
<31:20
>
Inst
<14:12>
ALU
Control
Introduce
muxes
Inst<24:20>
7 5 5 3
5
7
func7
rs2
rs1 func3
rd
opcode
rd
(
rs1)
func
(
rs2)
immediate12
rs1 func3
rd
opcode
rd
(
rs1) op immediate
31
20 19 15 14 12 11 7 6
0Slide31
Datapath for ALU Instructions
31
<14:12>
Op2Sel
Reg
/
Imm
Imm
Select
ImmSel
OpCode
0x4
Add
clk
addr
inst
Inst.
Memory
PC
ALU
RegWriteEn
clk
rd1
GPRs
rs1
rs2
wa
wd
rd2
we
<19:15>
<24:20>
ALU
Control
<
1
1:
7
>
<6:
0>
7 5 5 3
5
7
func7
rs2
rs1 func3
rd
opcode
rd
(
rs1)
func
(
rs2)
immediate12
rs1 func3
rd
opcode
rd
(
rs1) op immediate
31
20 19 15 14 12 11 7 6
0
Inst
<31:20>Slide32
Load/Store Instructions
32
WBSel
ALU /
Mem
rs1
is the base register
rd
is the destination of a
Load, rs2 is the data source for a Store
Op2Sel
“base”
disp
ImmSel
OpCode
ALU
Control
ALU
0x4
Add
clk
addr
inst
Inst.
Memory
PC
RegWriteEn
clk
rd1
GPRs
rs1
rs2
wa
wd
rd2
we
Imm
Select
clk
MemWrite
addr
wdata
rdata
Data
Memory
we
7 5 5 3
5
7
imm
rs2
rs1 func3
imm
opcode
Store
(
rs1)
+ displacement
immediate12
rs1
func3
rd
opcode Load
31
20 19 15 14 12 11 7 6
0Slide33
RISC-V Conditional BranchesCompare two integer registers for equality (BEQ/BNE) or signed magnitude (BLT/BGE) or unsigned magnitude (BLTU/BGEU)
12-bit immediate encodes branch target address as a signed offset from PC, in units of 16-bits (i.e., shift left by 1 then add to PC).
33
7
6
0
opcode
5
11
7
imm
3
14
12
func3
5
1
9
15
rs1
5
24
20
rs2
7
31
25
imm
BEQ/BNE
BLT/BGE
BLTU/BGEUSlide34
Conditional
Branches
(BEQ/BNE/BLT/BGE/BLTU/BGEU)
34
0x4
Add
PCSel
clk
WBSel
MemWrite
addr
wdata
rdata
Data
Memory
we
Op2Sel
ImmSel
OpCode
Bcomp
?
clk
clk
addr
inst
Inst.
Memory
PC
rd1
GPRs
rs1
rs2
wa
wd
rd2
we
Imm
Select
ALU
ALU
Control
Add
br
pc+4
RegWrEn
Br LogicSlide35
RISC-V Unconditional Jumps20-bit immediate encodes jump target address as a signed offset from PC, in units of 16-bits (i.e., shift left by 1 then add to PC). (+/- 1MiB)
JAL is a subroutine call that also saves return address (PC+4) in register
rd
35
JAL
7
12 11 7
6
0
opcode
25
31
Jump Offset[19:0]
rdSlide36
RISC-V Register Indirect JumpsJumps to target address given by adding 12-bit offset (
not
shifted by 1 bit) to register rs1
The return address (PC+4) is written to rd (can be x0 if value not needed)36
7
6
0
opcode
5
11
7
rd
3
12
func3
5
19
15 14
rs1
JALR
12
31
20
rdSlide37
Full RISCV1Stage Datapath
37Slide38
Hardwired Control is pure Combinational Logic
38
combinational
logic
op code
Equal?
ImmSel
Op2Sel
FuncSel
MemWrite
WBSel
WASel
RegWriteEn
PCSelSlide39
ALU Control & Immediate Extension39
Inst
<6:
0
>
(
Opcode
)
Decode Map
Inst
<14:12>
(
Func3)
ALUop
0?
+
FuncSel
(
Func
, Op, +, 0? )
ImmSel
(
IType
12
,
S
Type12, UType
20)Slide40
Hardwired Control Table
40
Opcode
ImmSel
Op2Sel
FuncSel
MemWr
RFWen
WBSel
WASel
PCSel
ALU
ALUi
LW
SW
BEQ
true
BEQ
false
J
JAL
JALR
Op2Sel=
Reg
/
Imm
WBSel
= ALU /
Mem
/ PC
WASel
=
rd
/
X1
PCSel
= pc+4 /
br
/ rind / jabs
*
*
*
no
yes
rind
PC
rd
jabs
*
*
*
no
yes
PC
X1
jabs
*
*
*
no
no
*
*
pc+4
SBType
12
*
*
no
no
*
*
br
SBType
12
*
*
no
no
*
*
pc+4
SType
12
Imm
+
yes
no
*
*
pc+4
*
Reg
Func
no
yes
ALU
rd
IType
12
Imm
Op
pc+4
no
yes
ALU
rd
pc+4
IType
12
Imm
+
no
yes
Mem
rdSlide41
Single-Cycle Hardwired Control We will assume clock period is sufficiently long for all of the following steps to be “completed”:
Instruction fetch
Decode and register fetch
ALU operationData fetch if requiredRegister write-back setup time=> tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB
At the rising edge of the following clock, the PC, register file and memory are updated
41Slide42
Question of the DayDo you think a CISC or RISC single-cycle processor would be faster?
42Slide43
SummaryMicrocoding became less attractive as gap between RAM and ROM speeds reduced, and logic implemented in same technology as memory
Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew
Iron Law explains architecture design space
Trade instruction/program, cycles/instruction, and time/cycleLoad-Store RISC ISAs designed for efficient pipelined implementationsVery similar to vertical microcodeInspired by earlier Cray machines (CDC 6600/7600)RISC-V ISA will be used in lectures, problems, and labsBerkeley RISC chips: RISC-I, RISC-II, SOAR (RISC-III), SPUR (RISC-IV)43Slide44
AcknowledgementsThese slides contain material developed and copyright by:Arvind
(MIT)
Krste Asanovic (MIT/UCB)
Joel Emer (Intel/MIT)James Hoe (CMU)John Kubiatowicz (UCB)David Patterson (UCB)MIT material derived from course 6.823UCB material derived from course CS25244