Sep 4 2017 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts Computing Models and ISA Tradeoffs The content and concept of this course are adapted from CMU ECE 740 AGENDA Logistics Review ID: 806746
Download The PPT/PDF document "Samira Khan University of Virginia" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Samira KhanUniversity of VirginiaSep 4, 2017
COMPUTER ARCHITECTURE
CS 6354Fundamental Concepts:Computing Models and ISA Tradeoffs
The content and concept of this course are adapted from CMU ECE 740
Slide2AGENDALogisticsReview from last lectureFundamental concepts
Computing modelsISA Tradeoffs2
Slide3LOGISTICSReview 2 Due WednesdayDennis and Misunas, “A preliminary architecture for a basic data flow processor,” ISCA
1974.Kung, H. T., "Why Systolic Architectures?", IEEE Computer 1982.
ParticipationDiscuss in Piazza and class (5% grade)Project listWill be open Wednesday, start earlyBe prepared to spend time on the project3
Slide4FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966SISD: Single instruction operates on single data element
SIMD: Single instruction operates on multiple data elementsArray processorVector processorMISD: Multiple instructions operate on single data elementClosest form: systolic array processor, streaming processorMIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)MultiprocessorMultithreaded processor4
Slide5SIMD PROCESSINGSingle instruction operates on multiple data elements
In time or in spaceMultiple processing elements Time-space dualityArray processor: Instruction operates on multiple data elements at the same time
Vector processor: Instruction operates on multiple data elements in consecutive time steps5
Slide6SCALAR CODE EXAMPLE
For I = 1 to 50C[i] = (A[i] + B[i]) / 2Scalar code MOVI R0 = 50 1
MOVA R1 = A 1 MOVA R2 = B 1 MOVA R3 = C 1X: LD R4 = MEM[R1++] 11 ;autoincrement addressing LD R5 = MEM[R2++] 11 ADD R6 = R4 + R5 4 SHFR R7 = R6 >> 1 1 ST MEM[R3++] = R7 11
DECBNZ --R0, X 2 ;decrement and branch if NZ
304 dynamic instructions
6
Slide7VECTOR CODE EXAMPLEA loop is
vectorizable if each iteration is independent of any otherFor I = 0 to 49C[i] = (A[
i] + B[i]) / 2Vectorized loop: MOVI VLEN = 50 1 MOVI VSTR = 1 1 VLD V0 = A 11 + VLN - 1 VLD V1 = B 11 + VLN – 1 VADD V2 = V0 + V1 4 + VLN - 1 VSHFR V3 = V2 >> 1 1 + VLN - 1 VST C = V3 11 + VLN – 1
7 dynamic instructions
7
Slide8SCALAR CODE EXECUTION TIMEScalar execution time on an in-order processor with 1 bank
First two loads in the loop cannot be pipelined 2*11 cycles4 + 50*40 = 2004 cyclesScalar execution time on an in-order processor with 16 banks (word-interleaved)First two loads in the loop can be pipelined
4 + 50*30 = 1504 cyclesWhy 16 banks?11 cycle memory access latencyHaving 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency8
Slide9VECTOR CODE EXECUTION TIMENo chaining
i.e., output of a vector functional unit cannot be used as the input of another (i.e., no vector data forwarding)16 memory banks (word-interleaved)
285 cycles
9
Slide10VECTOR PROCESSOR DISADVANTAGES
-- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list?
Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA 1983.
10
Slide11VECTOR PROCESSOR LIMITATIONS--
Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks
11
Slide12FURTHER READING: SIMDRecommended
H&P, Appendix on Vector ProcessorsRussell, “
The CRAY-1 computer system,” CACM 1978.12
Slide13VECTOR MACHINE EXAMPLE: CRAY-1CRAY-1
Russell, “The CRAY-1 computer system,” CACM 1978.Scalar and vector modes8 64-element vector registers
64 bits per element16 memory banks8 64-bit scalar registers8 24-bit address registers13
Slide14AMDAHL’S LAW:
BOTTLENECK ANALYSIS
Speedup= timewithout enhancement / timewith enhancementSuppose an enhancement speeds up a fraction
f
of a task by a factor of
S
time
enhanced
=
time
original
·
(1-
f
) +
timeoriginal·(f/
S
)
Speedup
overall
= 1 / (
(1-
f
)
+
f
/
S
)
f
(1 -
f
)
time
original
(1 -
f
)
time
enhanced
f/S
Focus on bottlenecks with large f (and large S)
14
Slide15FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966SISD: Single instruction operates on single data element
SIMD: Single instruction operates on multiple data elementsArray processorVector processorMISD: Multiple instructions operate on single data elementClosest form: systolic array processor, streaming processorMIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)MultiprocessorMultithreaded processor
15
Slide16SYSTOLIC ARRAYS
Slide17WHY SYSTOLIC ARCHITECTURES?
Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memorySimilar to an assembly line of processing elements
Different people work on the same carMany cars are assembled simultaneouslyWhy? Special purpose accelerators/architectures needSimple, regular design (keep # unique parts small and regular)High concurrency high performanceBalanced computation and I/O (memory) bandwidth17
Slide18H. T. Kung, “Why Systolic Architectures?
,” IEEE Computer 1982.
Memory: heart
PEs:
cells
Memory pulses
data through
cells
SYSTOLIC ARRAYS
18
Slide19SYSTOLIC ARCHITECTURESBasic principle: Replace
one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs Balance computation and memory bandwidth
Differences from pipelining:These are individual PEsArray structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed)PEs can have local memory and execute kernels (rather than a piece of the instruction)
19
Slide20SYSTOLIC COMPUTATION EXAMPLE
ConvolutionUsed in filtering, pattern matching, correlation, polynomial evaluation, etc …
Many image processing tasks
20
Slide21CONVOLUTION
y1 = w1x1 + w2x2 + w3x3y2 = w1x2 + w2x3 + w3x4y3 = w1x3 + w2x4 + w3x5
21
Slide22SYSTOLIC ARRAYS: PROS AND CONS
Advantage: Specialized (computation needs to fit PE organization/functions) improved efficiency, simple design, high concurrency/ performance good to do more with less memory bandwidth requirement
Downside: Specialized not generally applicable because computation needs to fit the PE functions/organization22
Slide23AGENDALogisticsReview from last lectureFundamental concepts
Computing modelsISA Tradeoffs
23
Slide24LEVELS OF TRANSFORMATIONISAAgreed upon interface between software and hardware
SW/compiler assumes, HW promisesWhat the software writer needs to know to write system/user programs MicroarchitectureSpecific implementation of an ISANot visible to the software
MicroprocessorISA, uarch, circuits“Architecture” = ISA + microarchitecture
Microarchitecture
ISA
Program/Language
Algorithm
Problem
Logic
Circuits
24
Slide25ISA VS. MICROARCHITECTUREWhat is part of ISA vs.
Uarch?Gas pedal: interface for “acceleration”
Internals of the engine: implements “acceleration”Add instruction vs. Adder implementationImplementation (uarch) can be various as long as it satisfies the specification (ISA)Bit serial, ripple carry, carry lookahead addersx86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, …
Uarch
usually changes faster than ISA
Few ISAs (x86, SPARC, MIPS, Alpha) but many
uarchs
Why?
25
Slide26ISAInstructionsOpcodes
, Addressing Modes, Data TypesInstruction Types and FormatsRegisters, Condition CodesMemoryAddress space, Addressability, Alignment
Virtual memory managementCall, Interrupt/Exception HandlingAccess Control, Priority/Privilege I/OTask ManagementPower and Thermal ManagementMulti-threading support, Multiprocessor support26
Slide27MicroarchitectureImplementation of the ISA under specific
design constraints and goalsAnything done in hardware without exposure to softwarePipeliningIn-order versus out-of-order instruction executionMemory access scheduling policy
Speculative executionSuperscalar processing (multiple instruction issue?)Clock gatingCaching? Levels, size, associativity, replacement policyPrefetching?Voltage/frequency scaling?Error correction?
27
Slide28Samira KhanUniversity of VirginiaSep 4, 2017
COMPUTER ARCHITECTURE
CS 6354Fundamental Concepts:Computing Models and ISA Tradeoffs
The content and concept of this course are adapted from CMU ECE 740