/
Samira Khan University of Virginia Samira Khan University of Virginia

Samira Khan University of Virginia - PowerPoint Presentation

atomexxon
atomexxon . @atomexxon
Follow
343 views
Uploaded On 2020-08-28

Samira Khan University of Virginia - PPT Presentation

Sep 4 2017 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts Computing Models and ISA Tradeoffs The content and concept of this course are adapted from CMU ECE 740 AGENDA Logistics Review ID: 806746

memory data instruction multiple data memory multiple instruction vector time systolic isa processor elements single operates computer banks computing

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Samira Khan University of Virginia" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Samira KhanUniversity of VirginiaSep 4, 2017

COMPUTER ARCHITECTURE

CS 6354Fundamental Concepts:Computing Models and ISA Tradeoffs

The content and concept of this course are adapted from CMU ECE 740

Slide2

AGENDALogisticsReview from last lectureFundamental concepts

Computing modelsISA Tradeoffs2

Slide3

LOGISTICSReview 2 Due WednesdayDennis and Misunas, “A preliminary architecture for a basic data flow processor,” ISCA

1974.Kung, H. T., "Why Systolic Architectures?", IEEE Computer 1982.

ParticipationDiscuss in Piazza and class (5% grade)Project listWill be open Wednesday, start earlyBe prepared to spend time on the project3

Slide4

FLYNN’S TAXONOMY OF COMPUTERS

Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966SISD: Single instruction operates on single data element

SIMD: Single instruction operates on multiple data elementsArray processorVector processorMISD: Multiple instructions operate on single data elementClosest form: systolic array processor, streaming processorMIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)MultiprocessorMultithreaded processor4

Slide5

SIMD PROCESSINGSingle instruction operates on multiple data elements

In time or in spaceMultiple processing elements Time-space dualityArray processor: Instruction operates on multiple data elements at the same time

Vector processor: Instruction operates on multiple data elements in consecutive time steps5

Slide6

SCALAR CODE EXAMPLE

For I = 1 to 50C[i] = (A[i] + B[i]) / 2Scalar code MOVI R0 = 50 1

MOVA R1 = A 1 MOVA R2 = B 1 MOVA R3 = C 1X: LD R4 = MEM[R1++] 11 ;autoincrement addressing LD R5 = MEM[R2++] 11 ADD R6 = R4 + R5 4 SHFR R7 = R6 >> 1 1 ST MEM[R3++] = R7 11

DECBNZ --R0, X 2 ;decrement and branch if NZ

304 dynamic instructions

6

Slide7

VECTOR CODE EXAMPLEA loop is

vectorizable if each iteration is independent of any otherFor I = 0 to 49C[i] = (A[

i] + B[i]) / 2Vectorized loop: MOVI VLEN = 50 1 MOVI VSTR = 1 1 VLD V0 = A 11 + VLN - 1 VLD V1 = B 11 + VLN – 1 VADD V2 = V0 + V1 4 + VLN - 1 VSHFR V3 = V2 >> 1 1 + VLN - 1 VST C = V3 11 + VLN – 1

7 dynamic instructions

7

Slide8

SCALAR CODE EXECUTION TIMEScalar execution time on an in-order processor with 1 bank

First two loads in the loop cannot be pipelined 2*11 cycles4 + 50*40 = 2004 cyclesScalar execution time on an in-order processor with 16 banks (word-interleaved)First two loads in the loop can be pipelined

4 + 50*30 = 1504 cyclesWhy 16 banks?11 cycle memory access latencyHaving 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency8

Slide9

VECTOR CODE EXECUTION TIMENo chaining

i.e., output of a vector functional unit cannot be used as the input of another (i.e., no vector data forwarding)16 memory banks (word-interleaved)

285 cycles

9

Slide10

VECTOR PROCESSOR DISADVANTAGES

-- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list?

Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA 1983.

10

Slide11

VECTOR PROCESSOR LIMITATIONS--

Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks

11

Slide12

FURTHER READING: SIMDRecommended

H&P, Appendix on Vector ProcessorsRussell, “

The CRAY-1 computer system,” CACM 1978.12

Slide13

VECTOR MACHINE EXAMPLE: CRAY-1CRAY-1

Russell, “The CRAY-1 computer system,” CACM 1978.Scalar and vector modes8 64-element vector registers

64 bits per element16 memory banks8 64-bit scalar registers8 24-bit address registers13

Slide14

AMDAHL’S LAW:

BOTTLENECK ANALYSIS

Speedup= timewithout enhancement / timewith enhancementSuppose an enhancement speeds up a fraction

f

of a task by a factor of

S

time

enhanced

=

time

original

·

(1-

f

) +

timeoriginal·(f/

S

)

Speedup

overall

= 1 / (

(1-

f

)

+

f

/

S

)

f

(1 -

f

)

time

original

(1 -

f

)

time

enhanced

f/S

Focus on bottlenecks with large f (and large S)

14

Slide15

FLYNN’S TAXONOMY OF COMPUTERS

Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966SISD: Single instruction operates on single data element

SIMD: Single instruction operates on multiple data elementsArray processorVector processorMISD: Multiple instructions operate on single data elementClosest form: systolic array processor, streaming processorMIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)MultiprocessorMultithreaded processor

15

Slide16

SYSTOLIC ARRAYS

Slide17

WHY SYSTOLIC ARCHITECTURES?

Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memorySimilar to an assembly line of processing elements

Different people work on the same carMany cars are assembled simultaneouslyWhy? Special purpose accelerators/architectures needSimple, regular design (keep # unique parts small and regular)High concurrency  high performanceBalanced computation and I/O (memory) bandwidth17

Slide18

H. T. Kung, “Why Systolic Architectures?

,” IEEE Computer 1982.

Memory: heart

PEs:

cells

Memory pulses

data through

cells

SYSTOLIC ARRAYS

18

Slide19

SYSTOLIC ARCHITECTURESBasic principle: Replace

one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs Balance computation and memory bandwidth

Differences from pipelining:These are individual PEsArray structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed)PEs can have local memory and execute kernels (rather than a piece of the instruction)

19

Slide20

SYSTOLIC COMPUTATION EXAMPLE

ConvolutionUsed in filtering, pattern matching, correlation, polynomial evaluation, etc …

Many image processing tasks

20

Slide21

CONVOLUTION

y1 = w1x1 + w2x2 + w3x3y2 = w1x2 + w2x3 + w3x4y3 = w1x3 + w2x4 + w3x5

21

Slide22

SYSTOLIC ARRAYS: PROS AND CONS

Advantage: Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement

Downside: Specialized  not generally applicable because computation needs to fit the PE functions/organization22

Slide23

AGENDALogisticsReview from last lectureFundamental concepts

Computing modelsISA Tradeoffs

23

Slide24

LEVELS OF TRANSFORMATIONISAAgreed upon interface between software and hardware

SW/compiler assumes, HW promisesWhat the software writer needs to know to write system/user programs MicroarchitectureSpecific implementation of an ISANot visible to the software

MicroprocessorISA, uarch, circuits“Architecture” = ISA + microarchitecture

Microarchitecture

ISA

Program/Language

Algorithm

Problem

Logic

Circuits

24

Slide25

ISA VS. MICROARCHITECTUREWhat is part of ISA vs.

Uarch?Gas pedal: interface for “acceleration”

Internals of the engine: implements “acceleration”Add instruction vs. Adder implementationImplementation (uarch) can be various as long as it satisfies the specification (ISA)Bit serial, ripple carry, carry lookahead addersx86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, …

Uarch

usually changes faster than ISA

Few ISAs (x86, SPARC, MIPS, Alpha) but many

uarchs

Why?

25

Slide26

ISAInstructionsOpcodes

, Addressing Modes, Data TypesInstruction Types and FormatsRegisters, Condition CodesMemoryAddress space, Addressability, Alignment

Virtual memory managementCall, Interrupt/Exception HandlingAccess Control, Priority/Privilege I/OTask ManagementPower and Thermal ManagementMulti-threading support, Multiprocessor support26

Slide27

MicroarchitectureImplementation of the ISA under specific

design constraints and goalsAnything done in hardware without exposure to softwarePipeliningIn-order versus out-of-order instruction executionMemory access scheduling policy

Speculative executionSuperscalar processing (multiple instruction issue?)Clock gatingCaching? Levels, size, associativity, replacement policyPrefetching?Voltage/frequency scaling?Error correction?

27

Slide28

Samira KhanUniversity of VirginiaSep 4, 2017

COMPUTER ARCHITECTURE

CS 6354Fundamental Concepts:Computing Models and ISA Tradeoffs

The content and concept of this course are adapted from CMU ECE 740