Jan 28 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts Computing Models and ISA Tradeoffs The content and concept of this course are adapted from CMU ECE 740 AGENDA Project Proposal and Ideas ID: 806747
Download The PPT/PDF document "Samira Khan University of Virginia" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Samira KhanUniversity of VirginiaJan 28, 2016
COMPUTER ARCHITECTURE
CS 6354Fundamental Concepts:Computing Models and ISA Tradeoffs
The content and concept of this course are adapted from CMU ECE 740
Slide2AGENDAProject Proposal and IdeasReview from last lectureFundamental concepts
Computing modelsISA Tradeoffs2
Slide3RESEARCH PROJECTYour chance to explore in depth a computer architecture topic that interests you
Perhaps even publish your innovation in a top computer architecture conference. Start thinking about your project topic from now!Interact with me and the TARead the project topics handout wellGroups of 2-3 students (will finalize this later)
Proposal due: Feb 183
Slide4RESEARCH PROJECT
Goal: Develop (new) insightSolve a problem in a new way or evaluate/analyze systems/ideas
Type 1: Develop new ideas to solve an important problemRigorously evaluate the benefits and limitations of the ideasType 2:Derive insight from rigorous analysis and understanding of existing systems or previously proposed ideasPropose potential new solutions based on the new insightThe problem and ideas need to be concreteProblem and goals need to be very clear
4
Slide5RESEARCH PROPOSAL OUTLINE
The Problem: What is the problem you are trying to solveDefine very clearly. Explain why it is important.Novelty:
Why has previous research not solved this problem? What are its shortcomings? Describe/cite all relevant works you know of and describe why these works are inadequate to solve the problem. This will be your literature survey.Idea: What is your initial idea/insight? What new solution are you proposing to the problem? Why does it make sense? How does/could it solve the problem better?Hypothesis: What is the main hypothesis you will test?Methodology: How will you test the hypothesis/ideas? Describe what simulator or model you will use and what initial experiments you will do.Plan: Describe the steps you will take. What will you accomplish by Milestone 1, 2, and Final Report? Give 75%, 100%, 125% and moonshot goals.
5
All research projects can be (and should be)
described in
this
fashion
Slide6HEILMEIER’S CATECHISM (VERSION 1)
What are you trying to do? Articulate your objectives using absolutely no jargon.How is it done today, and what are the limits of current practice?
What's new in your approach and why do you think it will be successful?Who cares?If you're successful, what difference will it make?What are the risks and the payoffs?How much will it cost?How long will it take?What are the midterm and final "exams" to check for success?6
Slide7HEILMEIER’S CATECHISM (VERSION 2)
What is the problem?Why is it hard?How is it solved today?
What is the new technical idea?Why can we succeed now?What is the impact if successful?http://en.wikipedia.org/wiki/George_H._Heilmeier7
Slide8SUPPLEMENTARY READINGS ON RESEARCH,
WRITING, REVIEWS
Hamming, “You and Your Research,” Bell Communications Research Colloquium Seminar, 7 March 1986.http://www.cs.virginia.edu/~robins/YouAndYourResearch.htmlLevin and
Redell
, “
How (and how not) to write a good systems paper
,
”
OSR 1983.
Smith, “
The Task of the Referee
,
”
IEEE Computer 1990.
Read this to get an idea of the publication process
SP Jones, “
How to Write a Great Research Paper
”
Fong, “
How to Write a CS Research Paper: A Bibliography
”
8
Slide9WHERE TO GET PROJECT TOPICS/IDEAS FROM
Project topics handoutAssigned readings
Mutlu and Subramanium, “Research Problems and Opportunities in Memory Systems”Recent conference proceedings ISCA: http://www.informatik.uni-trier.de/~ley/db/conf/isca/ MICRO: http://www.informatik.uni-trier.de/~ley/db/conf/micro/
HPCA:
http://www.informatik.uni-trier.de/~ley/db/conf/hpca/
ASPLOS:
http://www.informatik.uni-trier.de/~ley/db/conf/asplos/
9
Slide10LAST LECTURE RECAPWhy Study Computer Architecture?Von Neumann Model
Data Flow ArchitectureSIMDArray Vector10
Slide11Von Neumann model: An instruction is fetched and executed in control flow order As specified by the instruction pointer
Sequential unless explicit control flow instructionDataflow model: An instruction is fetched and executed in data flow orderi.e., when its operands are ready
i.e., there is no instruction pointerInstruction ordering specified by data flow dependenceEach instruction specifies “who” should receive the resultAn instruction can “fire” whenever all operands are receivedPotentially many instructions can execute at the same timeInherently more parallelREVIEW: THE DATA FLOW MODEL11
Slide12REVIEW: FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966SISD: Single instruction operates on single data element
SIMD: Single instruction operates on multiple data elementsArray processorVector processorMISD: Multiple instructions operate on single data elementClosest form: systolic array processor, streaming processorMIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)MultiprocessorMultithreaded processor12
Slide13REVIEW: SIMD PROCESSINGSingle instruction operates on multiple data elements
In time or in spaceMultiple processing elements Time-space dualityArray processor: Instruction operates on multiple data elements at the same time
Vector processor: Instruction operates on multiple data elements in consecutive time steps13
Slide14REVIEW: VECTOR PROCESSOR ADVANTAGES
+ No dependencies within a vector Pipelining, parallelization work wellCan have very deep pipelines, no dependencies! + Each instruction generates a lot of work
Reduces instruction fetch bandwidth+ Highly regular memory access pattern Interleaving multiple banks for higher memory bandwidthPrefetching+ No need to explicitly code loops Fewer branches in the instruction sequence14
Slide15SCALAR CODE EXAMPLE
For I = 1 to 50C[i] = (A[i] + B[i]) / 2Scalar code MOVI R0 = 50 1
MOVA R1 = A 1 MOVA R2 = B 1 MOVA R3 = C 1X: LD R4 = MEM[R1++] 11 ;autoincrement addressing LD R5 = MEM[R2++] 11 ADD R6 = R4 + R5 4 SHFR R7 = R6 >> 1 1 ST MEM[R3++] = R7 11
DECBNZ --R0, X 2 ;decrement and branch if NZ
304 dynamic instructions
15
Slide16VECTOR PROCESSORSA vector is a one-dimensional array of numbers
Many scientific/commercial programs use vectorsfor (i = 0; i
<=49; i++) C[i] = (A[i] + B[i]) / 2A vector processor is one whose instructions operate on vectors rather than scalar (single data) valuesBasic requirementsNeed to load/store vectors
vector registers (contain vectors)
Need to operate on vectors of different lengths
vector length register (VLEN)
Elements of a vector might be stored apart from each other in memory
vector stride register (VSTR)
Stride: distance between two elements of a vector
16
Slide17VECTOR CODE EXAMPLEA loop is
vectorizable if each iteration is independent of any otherFor I = 0 to 49C[i] = (A[
i] + B[i]) / 2Vectorized loop: MOVI VLEN = 50 1 MOVI VSTR = 1 1 VLD V0 = A 11 + VLN - 1 VLD V1 = B 11 + VLN – 1 VADD V2 = V0 + V1 4 + VLN - 1 VSHFR V3 = V2 >> 1 1 + VLN - 1 VST C = V3 11 + VLN – 1
7 dynamic instructions
17
Slide18SCALAR CODE EXECUTION TIMEScalar execution time on an in-order processor with 1 bank
First two loads in the loop cannot be pipelined 2*11 cycles4 + 50*40 = 2004 cyclesScalar execution time on an in-order processor with 16 banks (word-interleaved)First two loads in the loop can be pipelined
4 + 50*30 = 1504 cyclesWhy 16 banks?11 cycle memory access latencyHaving 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency18
Slide19VECTOR CODE EXECUTION TIMENo chaining
i.e., output of a vector functional unit cannot be used as the input of another (i.e., no vector data forwarding)16 memory banks (word-interleaved)
285 cycles
19
Slide20VECTOR PROCESSOR DISADVANTAGES
-- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list?
Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA 1983.
20
Slide21VECTOR PROCESSOR LIMITATIONS--
Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks
21
Slide22FURTHER READING: SIMDRecommended
H&P, Appendix on Vector ProcessorsRussell, “
The CRAY-1 computer system,” CACM 1978.22
Slide23VECTOR MACHINE EXAMPLE: CRAY-1CRAY-1
Russell, “The CRAY-1 computer system,” CACM 1978.Scalar and vector modes8 64-element vector registers
64 bits per element16 memory banks8 64-bit scalar registers8 24-bit address registers23
Slide24AMDAHL’S LAW:
BOTTLENECK ANALYSIS
Speedup= timewithout enhancement / timewith enhancementSuppose an enhancement speeds up a fraction
f
of a task by a factor of
S
time
enhanced
=
time
original
·
(1-
f
) +
time
original
·
(
f
/
S) Speedup
overall
= 1 / (
(1-
f
)
+
f
/
S
)
f
(1 -
f
)
time
original
(1 -
f
)
time
enhanced
f/S
Focus on bottlenecks with large f (and large S)
24
Slide25FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966SISD: Single instruction operates on single data element
SIMD: Single instruction operates on multiple data elementsArray processorVector processorMISD: Multiple instructions operate on single data elementClosest form: systolic array processor, streaming processorMIMD: Multiple instructions operate on multiple data elements (multiple instruction streams)MultiprocessorMultithreaded processor
25
Slide26SYSTOLIC ARRAYS
Slide27WHY SYSTOLIC ARCHITECTURES?Idea:
Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memorySimilar to an assembly line of processing elements
Different people work on the same carMany cars are assembled simultaneouslyCan be two-dimensionalWhy? Special purpose accelerators/architectures needSimple, regular design (keep # unique parts small and regular)High concurrency high performanceBalanced computation and I/O (memory) bandwidth
27
Slide28H. T. Kung, “Why Systolic Architectures?
,” IEEE Computer 1982.
Memory: heart
PEs:
cells
Memory pulses
data through
cells
SYSTOLIC ARRAYS
28
Slide29SYSTOLIC ARCHITECTURESBasic principle: Replace
one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs Balance computation and memory bandwidth
Differences from pipelining:These are individual PEsArray structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed)PEs can have local memory and execute kernels (rather than a piece of the instruction)
29
Slide30SYSTOLIC ARRAYS: PROS AND CONS
Advantage: Specialized (computation needs to fit PE organization/functions) improved efficiency, simple design, high concurrency/ performance good to do more with less memory bandwidth requirement
Downside: Specialized not generally applicable because computation needs to fit the PE functions/organization30
Slide31AGENDAProject Proposal and IdeasReview from last lectureFundamental concepts
Computing modelsISA Tradeoffs31
Slide32LEVELS OF TRANSFORMATIONISAAgreed upon interface between software and hardware
SW/compiler assumes, HW promisesWhat the software writer needs to know to write system/user programs MicroarchitectureSpecific implementation of an ISANot visible to the software
MicroprocessorISA, uarch, circuits“Architecture” = ISA + microarchitecture
Microarchitecture
ISA
Program/Language
Algorithm
Problem
Logic
Circuits
32
Slide33ISA VS. MICROARCHITECTUREWhat is part of ISA vs.
Uarch?Gas pedal: interface for “acceleration”
Internals of the engine: implements “acceleration”Add instruction vs. Adder implementationImplementation (uarch) can be various as long as it satisfies the specification (ISA)Bit serial, ripple carry, carry lookahead addersx86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, …
Uarch
usually changes faster than ISA
Few ISAs (x86, SPARC, MIPS, Alpha) but many
uarchs
Why?
33
Slide34ISAInstructionsOpcodes
, Addressing Modes, Data TypesInstruction Types and FormatsRegisters, Condition CodesMemoryAddress space, Addressability, Alignment
Virtual memory managementCall, Interrupt/Exception HandlingAccess Control, Priority/Privilege I/OTask ManagementPower and Thermal ManagementMulti-threading support, Multiprocessor support34
Slide35MicroarchitectureImplementation of the ISA under specific
design constraints and goalsAnything done in hardware without exposure to softwarePipeliningIn-order versus out-of-order instruction executionMemory access scheduling policy
Speculative executionSuperscalar processing (multiple instruction issue?)Clock gatingCaching? Levels, size, associativity, replacement policyPrefetching?Voltage/frequency scaling?Error correction?
35
Slide36DESIGN POINTA set of design considerations and their importance
leads to tradeoffs in both ISA and uarch
ConsiderationsCostPerformanceMaximum power consumptionEnergy consumption (battery life)AvailabilityReliability and Correctness (or is it?)Time to MarketDesign point determined by the “Problem”
space (application space)
Microarchitecture
ISA
Program/Language
Algorithm
Problem
Logic
Circuits
36
Slide37TRADEOFFS: SOUL OF COMPUTER ARCHITECTURE
ISA-level tradeoffsUarch-level tradeoffsSystem and Task-level tradeoffsHow to divide the labor between hardware and software
37
Slide38ISA-LEVEL TRADEOFFS: SEMANTIC GAP
Where to place the ISA? Semantic gapCloser to high-level language (HLL) or closer to hardware control signals? Complex vs. simple instructions
RISC vs. CISC vs. HLL machinesFFT, QUICKSORT, POLY, FP instructions?VAX INDEX instruction (array access with bounds checking)Tradeoffs:Simple compiler, complex hardware vs. complex compiler, simple hardwareCaveat: Translation (indirection) can change the tradeoff!Burden of backward compatibilityPerformance?Optimization opportunity: Example of VAX INDEX instruction: who (compiler vs. hardware) puts more effort into optimization?Instruction size, code size
38
Slide39Samira KhanUniversity of VirginiaJan 28, 2016
COMPUTER ARCHITECTURE
CS 6354Fundamental Concepts:Computing Models and ISA Tradeoffs
The content and concept of this course are adapted from CMU ECE 740