/
Samira Khan University of Virginia Samira Khan University of Virginia

Samira Khan University of Virginia - PowerPoint Presentation

bigboybikers
bigboybikers . @bigboybikers
Follow
344 views
Uploaded On 2020-08-28

Samira Khan University of Virginia - PPT Presentation

Jan 28 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts Computing Models and ISA Tradeoffs The content and concept of this course are adapted from CMU ECE 740 AGENDA Project Proposal and Ideas ID: 806747

data instruction memory vector instruction data vector memory isa multiple time computer problem research processor instructions elements single systolic

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Samira Khan University of Virginia" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Samira KhanUniversity of VirginiaJan 28, 2016

COMPUTER ARCHITECTURE

CS 6354Fundamental Concepts:Computing Models and ISA Tradeoffs

The content and concept of this course are adapted from CMU ECE 740

Slide2

AGENDAProject Proposal and IdeasReview from last lectureFundamental concepts

Computing modelsISA Tradeoffs2

Slide3

RESEARCH PROJECTYour chance to explore in depth a computer architecture topic that interests you

Perhaps even publish your innovation in a top computer architecture conference. Start thinking about your project topic from now!Interact with me and the TARead the project topics handout wellGroups of 2-3 students (will finalize this later)

Proposal due: Feb 183

Slide4

RESEARCH PROJECT

Goal: Develop (new) insightSolve a problem in a new way or evaluate/analyze systems/ideas

Type 1: Develop new ideas to solve an important problemRigorously evaluate the benefits and limitations of the ideasType 2:Derive insight from rigorous analysis and understanding of existing systems or previously proposed ideasPropose potential new solutions based on the new insightThe problem and ideas need to be concreteProblem and goals need to be very clear

4

Slide5

RESEARCH PROPOSAL OUTLINE

The Problem: What is the problem you are trying to solveDefine very clearly. Explain why it is important.Novelty:

Why has previous research not solved this problem? What are its shortcomings? Describe/cite all relevant works you know of and describe why these works are inadequate to solve the problem. This will be your literature survey.Idea: What is your initial idea/insight? What new solution are you proposing to the problem? Why does it make sense? How does/could it solve the problem better?Hypothesis: What is the main hypothesis you will test?Methodology: How will you test the hypothesis/ideas? Describe what simulator or model you will use and what initial experiments you will do.Plan: Describe the steps you will take. What will you accomplish by Milestone 1, 2, and Final Report? Give 75%, 100%, 125% and moonshot goals.

5

All research projects can be (and should be)

described in

this

fashion

Slide6

HEILMEIER’S CATECHISM (VERSION 1)

What are you trying to do? Articulate your objectives using absolutely no jargon.How is it done today, and what are the limits of current practice?

What's new in your approach and why do you think it will be successful?Who cares?If you're successful, what difference will it make?What are the risks and the payoffs?How much will it cost?How long will it take?What are the midterm and final "exams" to check for success?6

Slide7

HEILMEIER’S CATECHISM (VERSION 2)

What is the problem?Why is it hard?How is it solved today?

What is the new technical idea?Why can we succeed now?What is the impact if successful?http://en.wikipedia.org/wiki/George_H._Heilmeier7

Slide8

SUPPLEMENTARY READINGS ON RESEARCH,

WRITING, REVIEWS

Hamming, “You and Your Research,” Bell Communications Research Colloquium Seminar, 7 March 1986.http://www.cs.virginia.edu/~robins/YouAndYourResearch.htmlLevin and

Redell

, “

How (and how not) to write a good systems paper

,

OSR 1983.

Smith, “

The Task of the Referee

,

IEEE Computer 1990.

Read this to get an idea of the publication process

SP Jones, “

How to Write a Great Research Paper

Fong, “

How to Write a CS Research Paper: A Bibliography

8

Slide9

WHERE TO GET PROJECT TOPICS/IDEAS FROM

Project topics handoutAssigned readings

Mutlu and Subramanium, “Research Problems and Opportunities in Memory Systems”Recent conference proceedings ISCA: http://www.informatik.uni-trier.de/~ley/db/conf/isca/ MICRO: http://www.informatik.uni-trier.de/~ley/db/conf/micro/

HPCA:

http://www.informatik.uni-trier.de/~ley/db/conf/hpca/

ASPLOS:

http://www.informatik.uni-trier.de/~ley/db/conf/asplos/

9

Slide10

LAST LECTURE RECAPWhy Study Computer Architecture?Von Neumann Model

Data Flow ArchitectureSIMDArray Vector10

Slide11

Von Neumann model: An instruction is fetched and executed in control flow order As specified by the instruction pointer

Sequential unless explicit control flow instructionDataflow model: An instruction is fetched and executed in data flow orderi.e., when its operands are ready

i.e., there is no instruction pointerInstruction ordering specified by data flow dependenceEach instruction specifies “who” should receive the resultAn instruction can “fire” whenever all operands are receivedPotentially many instructions can execute at the same timeInherently more parallelREVIEW: THE DATA FLOW MODEL11

Slide12

REVIEW: FLYNN’S TAXONOMY OF COMPUTERS

Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966SISD: Single instruction operates on single data element

SIMD: Single instruction operates on multiple data elementsArray processorVector processorMISD: Multiple instructions operate on single data elementClosest form: systolic array processor, streaming processorMIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)MultiprocessorMultithreaded processor12

Slide13

REVIEW: SIMD PROCESSINGSingle instruction operates on multiple data elements

In time or in spaceMultiple processing elements Time-space dualityArray processor: Instruction operates on multiple data elements at the same time

Vector processor: Instruction operates on multiple data elements in consecutive time steps13

Slide14

REVIEW: VECTOR PROCESSOR ADVANTAGES

+ No dependencies within a vector Pipelining, parallelization work wellCan have very deep pipelines, no dependencies! + Each instruction generates a lot of work

Reduces instruction fetch bandwidth+ Highly regular memory access pattern Interleaving multiple banks for higher memory bandwidthPrefetching+ No need to explicitly code loops Fewer branches in the instruction sequence14

Slide15

SCALAR CODE EXAMPLE

For I = 1 to 50C[i] = (A[i] + B[i]) / 2Scalar code MOVI R0 = 50 1

MOVA R1 = A 1 MOVA R2 = B 1 MOVA R3 = C 1X: LD R4 = MEM[R1++] 11 ;autoincrement addressing LD R5 = MEM[R2++] 11 ADD R6 = R4 + R5 4 SHFR R7 = R6 >> 1 1 ST MEM[R3++] = R7 11

DECBNZ --R0, X 2 ;decrement and branch if NZ

304 dynamic instructions

15

Slide16

VECTOR PROCESSORSA vector is a one-dimensional array of numbers

Many scientific/commercial programs use vectorsfor (i = 0; i

<=49; i++) C[i] = (A[i] + B[i]) / 2A vector processor is one whose instructions operate on vectors rather than scalar (single data) valuesBasic requirementsNeed to load/store vectors 

vector registers (contain vectors)

Need to operate on vectors of different lengths 

vector length register (VLEN)

Elements of a vector might be stored apart from each other in memory 

vector stride register (VSTR)

Stride: distance between two elements of a vector

16

Slide17

VECTOR CODE EXAMPLEA loop is

vectorizable if each iteration is independent of any otherFor I = 0 to 49C[i] = (A[

i] + B[i]) / 2Vectorized loop: MOVI VLEN = 50 1 MOVI VSTR = 1 1 VLD V0 = A 11 + VLN - 1 VLD V1 = B 11 + VLN – 1 VADD V2 = V0 + V1 4 + VLN - 1 VSHFR V3 = V2 >> 1 1 + VLN - 1 VST C = V3 11 + VLN – 1

7 dynamic instructions

17

Slide18

SCALAR CODE EXECUTION TIMEScalar execution time on an in-order processor with 1 bank

First two loads in the loop cannot be pipelined 2*11 cycles4 + 50*40 = 2004 cyclesScalar execution time on an in-order processor with 16 banks (word-interleaved)First two loads in the loop can be pipelined

4 + 50*30 = 1504 cyclesWhy 16 banks?11 cycle memory access latencyHaving 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency18

Slide19

VECTOR CODE EXECUTION TIMENo chaining

i.e., output of a vector functional unit cannot be used as the input of another (i.e., no vector data forwarding)16 memory banks (word-interleaved)

285 cycles

19

Slide20

VECTOR PROCESSOR DISADVANTAGES

-- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list?

Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA 1983.

20

Slide21

VECTOR PROCESSOR LIMITATIONS--

Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks

21

Slide22

FURTHER READING: SIMDRecommended

H&P, Appendix on Vector ProcessorsRussell, “

The CRAY-1 computer system,” CACM 1978.22

Slide23

VECTOR MACHINE EXAMPLE: CRAY-1CRAY-1

Russell, “The CRAY-1 computer system,” CACM 1978.Scalar and vector modes8 64-element vector registers

64 bits per element16 memory banks8 64-bit scalar registers8 24-bit address registers23

Slide24

AMDAHL’S LAW:

BOTTLENECK ANALYSIS

Speedup= timewithout enhancement / timewith enhancementSuppose an enhancement speeds up a fraction

f

of a task by a factor of

S

time

enhanced

=

time

original

·

(1-

f

) +

time

original

·

(

f

/

S) Speedup

overall

= 1 / (

(1-

f

)

+

f

/

S

)

f

(1 -

f

)

time

original

(1 -

f

)

time

enhanced

f/S

Focus on bottlenecks with large f (and large S)

24

Slide25

FLYNN’S TAXONOMY OF COMPUTERS

Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966SISD: Single instruction operates on single data element

SIMD: Single instruction operates on multiple data elementsArray processorVector processorMISD: Multiple instructions operate on single data elementClosest form: systolic array processor, streaming processorMIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)MultiprocessorMultithreaded processor

25

Slide26

SYSTOLIC ARRAYS

Slide27

WHY SYSTOLIC ARCHITECTURES?Idea:

Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memorySimilar to an assembly line of processing elements

Different people work on the same carMany cars are assembled simultaneouslyCan be two-dimensionalWhy? Special purpose accelerators/architectures needSimple, regular design (keep # unique parts small and regular)High concurrency  high performanceBalanced computation and I/O (memory) bandwidth

27

Slide28

H. T. Kung, “Why Systolic Architectures?

,” IEEE Computer 1982.

Memory: heart

PEs:

cells

Memory pulses

data through

cells

SYSTOLIC ARRAYS

28

Slide29

SYSTOLIC ARCHITECTURESBasic principle: Replace

one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs Balance computation and memory bandwidth

Differences from pipelining:These are individual PEsArray structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed)PEs can have local memory and execute kernels (rather than a piece of the instruction)

29

Slide30

SYSTOLIC ARRAYS: PROS AND CONS

Advantage: Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement

Downside: Specialized  not generally applicable because computation needs to fit the PE functions/organization30

Slide31

AGENDAProject Proposal and IdeasReview from last lectureFundamental concepts

Computing modelsISA Tradeoffs31

Slide32

LEVELS OF TRANSFORMATIONISAAgreed upon interface between software and hardware

SW/compiler assumes, HW promisesWhat the software writer needs to know to write system/user programs MicroarchitectureSpecific implementation of an ISANot visible to the software

MicroprocessorISA, uarch, circuits“Architecture” = ISA + microarchitecture

Microarchitecture

ISA

Program/Language

Algorithm

Problem

Logic

Circuits

32

Slide33

ISA VS. MICROARCHITECTUREWhat is part of ISA vs.

Uarch?Gas pedal: interface for “acceleration”

Internals of the engine: implements “acceleration”Add instruction vs. Adder implementationImplementation (uarch) can be various as long as it satisfies the specification (ISA)Bit serial, ripple carry, carry lookahead addersx86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, …

Uarch

usually changes faster than ISA

Few ISAs (x86, SPARC, MIPS, Alpha) but many

uarchs

Why?

33

Slide34

ISAInstructionsOpcodes

, Addressing Modes, Data TypesInstruction Types and FormatsRegisters, Condition CodesMemoryAddress space, Addressability, Alignment

Virtual memory managementCall, Interrupt/Exception HandlingAccess Control, Priority/Privilege I/OTask ManagementPower and Thermal ManagementMulti-threading support, Multiprocessor support34

Slide35

MicroarchitectureImplementation of the ISA under specific

design constraints and goalsAnything done in hardware without exposure to softwarePipeliningIn-order versus out-of-order instruction executionMemory access scheduling policy

Speculative executionSuperscalar processing (multiple instruction issue?)Clock gatingCaching? Levels, size, associativity, replacement policyPrefetching?Voltage/frequency scaling?Error correction?

35

Slide36

DESIGN POINTA set of design considerations and their importance

leads to tradeoffs in both ISA and uarch

ConsiderationsCostPerformanceMaximum power consumptionEnergy consumption (battery life)AvailabilityReliability and Correctness (or is it?)Time to MarketDesign point determined by the “Problem”

space (application space)

Microarchitecture

ISA

Program/Language

Algorithm

Problem

Logic

Circuits

36

Slide37

TRADEOFFS: SOUL OF COMPUTER ARCHITECTURE

ISA-level tradeoffsUarch-level tradeoffsSystem and Task-level tradeoffsHow to divide the labor between hardware and software

37

Slide38

ISA-LEVEL TRADEOFFS: SEMANTIC GAP

Where to place the ISA? Semantic gapCloser to high-level language (HLL) or closer to hardware control signals?  Complex vs. simple instructions

RISC vs. CISC vs. HLL machinesFFT, QUICKSORT, POLY, FP instructions?VAX INDEX instruction (array access with bounds checking)Tradeoffs:Simple compiler, complex hardware vs. complex compiler, simple hardwareCaveat: Translation (indirection) can change the tradeoff!Burden of backward compatibilityPerformance?Optimization opportunity: Example of VAX INDEX instruction: who (compiler vs. hardware) puts more effort into optimization?Instruction size, code size

38

Slide39

Samira KhanUniversity of VirginiaJan 28, 2016

COMPUTER ARCHITECTURE

CS 6354Fundamental Concepts:Computing Models and ISA Tradeoffs

The content and concept of this course are adapted from CMU ECE 740