/
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
394 views
Uploaded On 2018-11-07

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. - PPT Presentation

CUDA Lecture 3 Parallel Architectures and Performance Analysis Conventional Von Neumann architecture consists of a processor executing a program stored in a main memory Each main memory location located by its address Addresses start at zero and extend to 2 ID: 720177

performance parallel analysis architectures parallel performance architectures analysis slide memory data processors processor speedup shared cont mimd time stream

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Prepared 7/28/2011 by T. O’Neil for 34..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 3

Parallel Architectures and Performance AnalysisSlide2

Conventional Von Neumann architecture consists of a processor executing a program stored in a (main) memory:

Each main memory location located by its address. Addresses start at zero and extend to 2

n – 1 when there are n bits (binary digits) in the address.

Parallel Architectures and Performance Analysis – Slide

2

Topic 1: Parallel ArchitecturesSlide3

Parallel computer: multiple-processor system supporting parallel programming.Three principle types of architecture

Vector computers, in particular processor arraysShared memory multiprocessorsSpecially designed and manufactured systems

Distributed memory multicomputersMessage passing systems readily formed from a cluster of workstations

Parallel Architectures and Performance Analysis – Slide

3

Parallel ComputersSlide4

Vector computer: instruction set includes operations on vectors as well as scalarsTwo ways to implement vector computers

Pipelined vector processor (e.g. Cray): streams data through pipelined arithmetic unitsProcessor array: many identical, synchronized arithmetic processing elements

Parallel Architectures and Performance Analysis – Slide

4

Type 1: Vector ComputersSlide5

Historically, high cost of a control unitScientific applications have data parallelism

Parallel Architectures and Performance Analysis – Slide

5Why Processor Arrays?Slide6

Front end computer (standard uniprocessor)

ProgramData manipulated sequentiallyProcessor array (individual processor/memory pairs)

Data manipulated in parallelPerformanceSpeed of processing elementsUtilization of processing elements

Size of data structure

Parallel Architectures and Performance Analysis – Slide

6

Data/Instruction StorageSlide7

Each VLSI chip has 16 processing elements

Parallel Architectures and Performance Analysis – Slide

72-D Processor Interconnection NetworkSlide8

Not all problems are data parallelSpeed drops for conditionally executed code

Do not adapt to multiple users wellDo not scale down well to “starter” systemsRely on custom VLSI for processors

Expense of control units has dropped

Parallel Architectures and Performance Analysis – Slide

8

Processor Array ShortcomingsSlide9

Natural way to extend single processor modelHave multiple processors connected to multiple memory modules such that each processor can access any memory module

So-called shared memory configuration:

Parallel Architectures and Performance Analysis – Slide

9

Type 2: Shared Memory Multiprocessor SystemsSlide10

Parallel Architectures and Performance Analysis – Slide

10

Ex: Quad Pentium Shared Memory MultiprocessorSlide11

Any memory location can be accessible by any of the processors.A

single address space

exists, meaning that each memory location is given unique address within a single range of addresses.Generally, shared memory programming more convenient although it does require access to shared data to be controlled by the programmer (using critical sections, etc.).

Parallel Architectures and Performance Analysis – Slide

11

Shared Memory Multiprocessor SystemsSlide12

Alternately known as a tightly coupled architecture.

No local memory associated with processors.Avoid three problems of processor arraysCan be built from commodity CPUs

Naturally support multiple usersMaintain efficiency in conditional code

Parallel Architectures and Performance Analysis – Slide

12

Shared Memory Multiprocessor Systems (cont.)Slide13

Several alternatives for programming shared memory multiprocessorsUsing

threads (pthreads, Java, …) in which the programmer decomposes the program into individual parallel sequences, each being a thread, and each being able to access variables declared outside the threads.

Using a sequential programming language with user-level libraries to declare and access shared variables.

Parallel Architectures and Performance Analysis – Slide

13

Shared Memory Multiprocessor Systems (cont.)Slide14

Several alternatives for programming shared memory multiprocessorsUsing a sequential programming language with

preprocessor compiler directives to declare shared variables and specify parallelism.

Ex: OpenMP – the industry standardAn API for shared-memory systemsSupports higher performance parallel programming of symmetrical multiprocessors

Parallel Architectures and Performance Analysis – Slide

14

Shared Memory Multiprocessor Systems (cont.)Slide15

Several alternatives for programming shared memory multiprocessorsUsing a

parallel programming language with syntax for parallelism, in which the compiler creates the appropriate executable code for each processor.

Using a sequential programming language and ask a parallelizing compiler to convert it into parallel executable code.Neither of these not now common.

Parallel Architectures and Performance Analysis – Slide

15

Shared Memory Multiprocessor Systems (cont.)Slide16

Type 1: Centralized MultiprocessorStraightforward extension of

uniprocessorAdd CPUs to busAll processors share same primary memory

Memory access time same for all CPUsAn example of a uniform memory access (UMA) multiprocessorSymmetrical multiprocessor (SMP)

Parallel Architectures and Performance Analysis – Slide

16

Fundamental Types of Shared Memory MultiprocessorSlide17

Parallel Architectures and Performance Analysis – Slide

17

Centralized MultiprocessorSlide18

Private data: items used only by a single processor

Shared data: values used by multiple processorsIn a centralized multiprocessor, processors communicate via shared data valuesProblems associated with shared data

Cache coherenceReplicating data across multiple caches reduces contentionHow to ensure different processors have same value for same address?Synchronization

Mutual exclusion

Barriers

Parallel Architectures and Performance Analysis – Slide

18

Private and Shared DataSlide19

Making the main memory of a cluster of computers look as though it is a single memory with a single address space (via hidden message passing).

Then can use shared memory programming techniques.

Parallel Architectures and Performance Analysis – Slide 19

Distributed Shared MemorySlide20

Type 2: Distributed MultiprocessorDistribute primary memory among processors

Increase aggregate memory bandwidth and lower average memory access timeAllow greater number of processorsAlso called non-uniform memory access (NUMA) multiprocessor

Parallel Architectures and Performance Analysis – Slide

20

Fundamental Types of Shared Memory MultiprocessorSlide21

Parallel Architectures and Performance Analysis – Slide

21

Distributed MultiprocessorSlide22

Some NUMA multiprocessors do not support it in hardwareOnly instructions, private data in cache

Large memory access time varianceImplementations more difficultNo shared memory bus to “snoop”

Directory-based protocol needed

Parallel Architectures and Performance Analysis – Slide

22

Cache CoherenceSlide23

Distributed directory contains information about cacheable memory blocksOne directory entry for each cache block

Each entry hasSharing statusUncached: block not in any processor’s cache

Shared: cached by one or more processors; read onlyExclusive: cached by exactly one processor which has written block, so copy in memory obsoleteWhich processors have copies

Parallel Architectures and Performance Analysis – Slide

23

Directory-Based ProtocolSlide24

Complete computers connected through an interconnection network

Parallel Architectures and Performance Analysis – Slide

24Type 3: Message-Passing

MulticomputersSlide25

Distributed memory multiple-CPU computerSame address on different processors refers to different physical memory locations

Processors interact through message passingCommercial multicomputers

Commodity clusters

Parallel Architectures and Performance Analysis – Slide

25

MulticomputersSlide26

Alternate name for message-passing multicomputer systems.Each processor has its own memory accessible only to that processor.

A message passing interconnection network provides point-to-point connections among processors.Memory access varies between processors.

Parallel Architectures and Performance Analysis – Slide

26

Loosely Coupled ArchitecturesSlide27

Parallel Architectures and Performance Analysis – Slide

27

Asymmetrical MulticomputerSlide28

Advantages:Back-end processors dedicated to parallel computations

Easier to understand, model, tune performanceOnly a simple back-end operating system needed

Easy for a vendor to createDisadvantages:Front-end computer is a single point of failureSingle front-end computer limits scalability of system

Primitive operating system in back-end processors makes debugging difficult

Every application requires development of both front-end and back-end programs

Parallel Architectures and Performance Analysis – Slide

28

Asymmetrical MulticomputerSlide29

Parallel Architectures and Performance Analysis – Slide

29

Symmetrical MulticomputerSlide30

Advantages:Alleviate performance bottleneck caused by single front-end computer

Better support for debuggingEvery processor executes same program

Disadvantages:More difficult to maintain illusion of single “parallel computer”No simple way to balance program development workload among processorsMore difficult to achieve high performance when multiple processes on each processor

Parallel Architectures and Performance Analysis – Slide

30

Symmetrical MulticomputerSlide31

Parallel Architectures and Performance Analysis – Slide

31

ParPar Cluster: A Mixed ModelSlide32

Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams.

Also important are number of processors, number of programs which can be executed, and the memory structure.

Parallel Architectures and Performance Analysis – Slide 32

Alternate System: Flynn’s TaxonomySlide33

Single instruction stream, single data stream (SISD) computerIn a single processor computer, a single stream of instructions is generated from the program. The instructions operate upon a single stream of data items.

The single CPU executes one instruction at a time and fetches or stores one item of data at a time.

Parallel Architectures and Performance Analysis – Slide

33

Flynn’s Taxonomy: SISDSlide34

Parallel Architectures and Performance Analysis – Slide

34

Flynn’s Taxonomy: SISD (cont.)

Control unit

Arithmetic

Processor

Memory

Control Signals

Instruction

Data Stream

ResultsSlide35

Single instruction stream, multiple data stream (SIMD) computer

A specially designed computer in which a single instruction stream is from a single program, but multiple data streams exist.The instructions from the program are broadcast to more than one processor.

Each processor executes the same instruction in synchronism, but using different data.Developed because there are a number of important applications that mostly operate upon arrays of data.

Parallel Architectures and Performance Analysis – Slide

35

Flynn’s Taxonomy: SIMDSlide36

Parallel Architectures and Performance Analysis – Slide

36

Flynn’s Taxonomy: SIMD (cont.)

Control Unit

Control Signal

PE 1

PE 2

PE

n

Data Stream 1

Data Stream 2

Data Stream

nSlide37

Processing distributed over a large amount of hardware.

Operates concurrently on many different data elements.Performs the same computation on all data elements.Processors operate

synchronously.Examples: pipelined vector processors (e.g. Cray-1) and processor arrays (e.g. Connection Machine)

Parallel Architectures and Performance Analysis – Slide

37

SIMD ArchitecturesSlide38

Parallel Architectures and Performance Analysis – Slide

38

SISD vs. SIMD Execution

X 1

X 2

X 3

X 4

PEs satisfy

a = 0

,

others are idle

PEs satisfy

a ≠ 0

,

others are idle

All PEs

All PEs

SIMD machine

X 1

a=0 ?

X 3

X 2

X 4

Yes

No

SISD machineSlide39

Multiple instruction stream, single data stream (MISD) computer

MISD machines may execute several different programs on the same data item.There are two categoriesDistinct processing units perform distinct instructions on the same data. Currently there is no such machine.

Pipelined architectures, where data flows through a series of processing elements.

Parallel Architectures and Performance Analysis – Slide

39

Flynn’s Taxonomy: MISDSlide40

Parallel Architectures and Performance Analysis – Slide

40

Flynn’s Taxonomy: MISD (cont.)

Control

Unit 1

Control

Unit 2

Control

Unit

n

Processing

Element 1

Processing

Element 2

Processing

Element

n

Instruction Stream 1

Instruction Stream 2

Instruction Stream

n

Data

StreamSlide41

A pipeline processor

works according to the principle of pipelining.A process can be broken down into several stages (segments). While one stage is executing, another stage is being loaded and the input of one stage is the output of the previous stage.

The processor carries out many different computations concurrently.Example: systolic array

Parallel Architectures and Performance Analysis – Slide

41

MISD ArchitecturesSlide42

Parallel Architectures and Performance Analysis – Slide

42

MISD Architectures (cont.)

Serial execution of two processes with 4 stages each. Time to execute

T = 8

t ,

where t is the time to execute one stage.

Pipelined execution of the same two processes.

T = 5 t

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4Slide43

Multiple instruction stream, multiple data stream (MIMD) computer

General purpose multiprocessor system.Multiple processors, each with a separate (different) program operating on its own data.

One instruction stream is generated from each program for each processor.Each instruction operates upon different data.Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification.

Parallel Architectures and Performance Analysis – Slide

43

Flynn’s Taxonomy: MIMDSlide44

Parallel Architectures and Performance Analysis – Slide

44

Flynn’s Taxonomy: MIMD (cont.)

Control

Unit 1

Control

Unit 2

Control

Unit

n

Processing

Element 1

Processing

Element 2

Processing

Element

n

Instruction Stream 1

Instruction Stream 2

Instruction Stream

n

Data Stream 1

Data Stream 2

Data Stream

nSlide45

Processing distributed over a number of processors operating independently and concurrently.

Resources (memory) shared among processors.Each processor runs its own program.MIMD systems execute operations in a parallel

asynchronous fashion.

Parallel Architectures and Performance Analysis – Slide

45

MIMD ArchitecturesSlide46

Differ with regard to

Interconnection networksMemory addressing techniquesSynchronizationControl structures

A high throughput can be achieved if the processing can be broken into parallel streams keeping all the processors active concurrently.

Parallel Architectures and Performance Analysis – Slide

46

MIMD Architectures (cont.)Slide47

Multiple Program Multiple Data (MPMD) Structure

Within the MIMD classification, which we are concerned with, each processor will have its own program to execute.

Parallel Architectures and Performance Analysis – Slide

47

Two MIMD Structures: MPMDSlide48

Single Program Multiple Data (SPMD) Structure

Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism.The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.

Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware.

Parallel Architectures and Performance Analysis – Slide

48

Two MIMD Structures: SPMDSlide49

SIMD needs less hardware (only one control unit). In MIMD each processor has its own control unit.SIMD needs less memory than MIMD (SIMD need only one copy of instructions). In MIMD the program and operating system needs to be stored at each processor.

SIMD has implicit synchronization of PEs. In contrast, explicit synchronization may be required in MIMD.

Parallel Architectures and Performance Analysis – Slide

49

SIMD vs. MIMDSlide50

MIMD allows different operations to be performed on different processing elements simultaneously (functional parallelism). SIMD is limited to data parallelism.

For MIMD it is possible to use general-purpose microprocessor as a processing unit. Processor may be cheaper and more powerful.

Parallel Architectures and Performance Analysis – Slide 50

SIMD vs. MIMD (cont.)Slide51

Time to execute a sequence of instructions in which the execution time is data dependent is less for MIMD than for SIMD.

MIMD allows each instruction to execute independently. In SIMD each processing element must wait until all the others have finished the execution of one instruction.Thus

T(MIMD) = MAX {t1 + t2 + … + t

n

}

T(SIMD) = MAX {t1} + MAX {t2} + … + MAX {tn

} T(MIMD) ≤ T(SIMD)

Parallel Architectures and Performance Analysis – Slide

51

SIMD vs. MIMD (cont.)Slide52

In MIMD each processing element can independently follow either direction path in executing if-then-else

statement. This requires two phases on SIMD.MIMD can operate in SIMD mode.

Parallel Architectures and Performance Analysis – Slide

52

SIMD vs. MIMD (cont.)Slide53

ArchitecturesVector computers

Shared memory multiprocessors: tightly coupledCentralized/symmetrical multiprocessor (SMP): UMADistributed multiprocessor: NUMA

Distributed memory/message-passing multicomputers: loosely coupledAsymmetrical vs. symmetricalFlynn’s Taxonomy

SISD, SIMD, MISD, MIMD (MPMD, SPMD)

Parallel Architectures and Performance Analysis – Slide

53

Topic 1 SummarySlide54

A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input.

The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements.

Parallel Architectures and Performance Analysis – Slide 54

Topic 2: Performance Measures and AnalysisSlide55

The

degree of parallelism is a measure of the number of operations that an algorithm can perform in parallel for a problem of size W, and it is independent of the parallel architecture.

If P(W) is the degree of parallelism of a parallel algorithm, then for a problem of size W

no more than

P(W)

processors can be employed effectively.Want to be able to do two things: predict performance of parallel programs, and understand barriers to higher performance.

Parallel Architectures and Performance Analysis – Slide

55

Performance Measures and Analysis (cont.)Slide56

General speedup formulaAmdahl’s Law

Decide if program merits parallelizationGustafson-Barsis’ Law

Evaluate performance of a parallel programParallel Architectures and Performance Analysis – Slide

56

Topic 2 OutlineSlide57

The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel.

The speedup factor of a parallel computation utilizing p processors is defined as the following ratio:

In other words, S(p

)

is defined as the ratio of the sequential processing time to the parallel processing time.

Parallel Architectures and Performance Analysis – Slide 57

Speedup FactorSlide58

Speedup factor can also be cast in terms of computational steps:

Maximum speedup is (usually) p

with p processors (linear speedup).

Parallel Architectures and Performance Analysis – Slide

58

Speedup Factor (cont.)Slide59

It is assumed that the processor used in parallel computation is identical to the one used by sequential algorithm.

S(p) gives the increase in speed by using a multiprocessor.

Underlying algorithm for parallel implementation might be (and usually is) different.

Parallel Architectures and Performance Analysis – Slide

59

Speedup Factor (cont.)Slide60

The sequential algorithm has to be the best algorithm known for a particular computation problem.

This means that it is fair to judge the performance of parallel computation with respect to the fastest sequential algorithm for solving the same problem in a single processor architecture.

Several issues such as synchronization and communication are involved in the parallel computation.

Parallel Architectures and Performance Analysis – Slide

60

Speedup Factor (cont.)Slide61

Given a problem of size n on

p processors letInherently sequential computations

(n)Potentially parallel computations

(

n

)Communication operations (

n,

p

)

Then:

Parallel Architectures and Performance Analysis – Slide

61

Execution Time ComponentsSlide62

Parallel Architectures and Performance Analysis – Slide

62

Speedup Plot

“elbowing out”

Number of processors

Slide63

The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system:

Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation.

Parallel Architectures and Performance Analysis – Slide

63

EfficiencySlide64

In an ideal parallel system the speedup factor is equal to

p and the efficiency is equal to one.In practice ideal behavior is not achieved, since processors cannot devote 100 percent of their time to the computation.

Every parallel program has overhead factors such as creating processes, process synchronization and communication.In practice efficiency is between zero and one, depending on the degree of effectiveness with which processing elements are utilized.

Parallel Architectures and Performance Analysis – Slide

64

Efficiency (cont.)Slide65

Since E =

S(p)/p

, by what we did earlierSince all terms are positive, E > 0

Furthermore, since the denominator is larger than the numerator,

E

< 1

Parallel Architectures and Performance Analysis – Slide

65

Analysis of EfficiencySlide66

Consider the problem of adding n

numbers on a p processor system.Initial brute force approach: all tasks send values to one processor which adds them all up..

Parallel Architectures and Performance Analysis – Slide

66

Example: ReductionSlide67

Parallel algorithm: find the global sum by using a binomial tree.

Parallel Architectures and Performance Analysis – Slide

67Example: Reduction (cont.)

SSlide68

Assume it takes one unit of time for two directly connected processors to add two numbers and to communicate to each other.Adding

n/p numbers locally on each processor takes n/p –1 units of time.

The p partial sums may be added in log p steps, each consisting of one addition and one communication.

Parallel Architectures and Performance Analysis – Slide

68

Example: Reduction (cont.)Slide69

The total parallel computation time

Tp is

n/p – 1 + 2 log p.

For large values of

p

and n this can be approximated by

Tp = n / p +

2

log

p

.

The serial computation time can be approximated by

T

s

=

n

.

Parallel Architectures and Performance Analysis – Slide

69

Example: Reduction (cont.)Slide70

The expression for speedup is

The expression for efficiency is

Speedup and efficiency can be calculated for any p and n.

Parallel Architectures and Performance Analysis – Slide

70

Example: Reduction (cont.)Slide71

Computational efficiency as a function of

n and p.

Parallel Architectures and Performance Analysis – Slide

71

Example: Reduction (cont.)

processors

p

n

1

2

4

8

16

32

64

1

.980

.930

.815

.623

.399

192

1

.990

.975

.930

.832

.665

320

1

.995

.985

.956

.892

.768

512

1

.995

.990

.972

.930

.841Slide72

Parallel Architectures and Performance Analysis – Slide

72

Example: Reduction (cont.)

0

10

20

30

5

10

15

20

25

30

0

speedup

processors

n=

64

n=

192

n=

320

n=

512Slide73

Parallel Architectures and Performance Analysis – Slide

73

Maximum Speedup: Amdahl’s LawSlide74

As before

since the communication time must be non-trivial.

Let f represent the inherently sequential portion of the computation; then

Parallel Architectures and Performance Analysis – Slide

74

Amdahl’s Law (cont.)Slide75

Then

In short, the maximum speedup factor is given by

where

f

is the fraction of the computation that cannot be divided into concurrent tasks.

Parallel Architectures and Performance Analysis – Slide

75

Amdahl’s Law (cont.)Slide76

Limitations

Ignores communication timeOverestimates speedup achievable

Amdahl EffectTypically (n

,

p

) has lower complexity than (n

)/p

So as

p

increases, (

n

)/

p

dominates (

n

,

p

)

Thus as

p

increases, speedup increases

Parallel Architectures and Performance Analysis – Slide

76

Amdahl’s Law (cont.)Slide77

Even with an infinite number of processors, maximum speedup limited to

1/f .Ex: With only 5% of a computation being serial, maximum speedup is 20, irrespective of number of processors.

Parallel Architectures and Performance Analysis – Slide

77

Speedup against number of processorsSlide78

So Amdahl’s LawTreats problem size as a constant

Shows how execution time decreases as the number of processors increasesHowever, we often use faster computers to solve larger problem instances

Let’s treat time as a constant and allow the problem size to increase with the number of processors

Parallel Architectures and Performance Analysis – Slide

78

Amdahl’s Law: The Last WordSlide79

As before

Let

s represent the fraction of time spent in parallel computation performing inherently sequential operations;

then

Parallel Architectures and Performance Analysis – Slide

79

Gustafson-

Barsis

’ LawSlide80

Then

Parallel Architectures and Performance Analysis – Slide

80Gustafson-

Barsis

’ Law

(cont.)Slide81

Begin with parallel execution

time instead of sequential timeEstimate sequential execution time to solve same

problemProblem size is an increasing function of p

Predicts

scaled speedup

Parallel Architectures and Performance Analysis – Slide

81

Gustafson-

Barsis

Law (cont.)Slide82

An application running on 10 processors spends 3% of its time in serial

code.According to Amdahl’s Law the maximum speedup is

However the scaled speedup is

Parallel Architectures and Performance Analysis – Slide

82

ExampleSlide83

Both

Amdahl’s Law and Gustafson-Barsis

’ Law ignore communication timeBoth

overestimate speedup or scaled speedup

achievable

Gene Amdahl

John L. Gustafson

Parallel Architectures and Performance Analysis – Slide

83

LimitationsSlide84

Performance terms: speedup, efficiency

Model of speedup: serial, parallel and communication componentsWhat prevents linear speedup?Serial and communication operations

Process start-upImbalanced workloadsArchitectural limitationsAnalyzing parallel performanceAmdahl’s Law

Gustafson-

Barsis

’ Law

Parallel Architectures and Performance Analysis – Slide 84

Topic 2 SummarySlide85

Based on original material fromThe University of Akron: Tim O’Neil, Kathy

LiszkaHiram College: Irena Lomonosov

The University of North Carolina at CharlotteBarry Wilkinson, Michael AllenOregon State University: Michael QuinnRevision history: last updated 7/28/2011.

Parallel Architectures and Performance Analysis – Slide

85

End Credits