49K - views

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011,

Parallel . Architectures. . & Performance Analysis. Parallel computer: multiple-processor system supporting parallel programming.. Three principle types of architecture. Vector computers, in particular processor arrays.

Embed :
Presentation Download Link

Download Presentation - The PPT/PDF document "Prepared 7/28/2011 by T. O’Neil for 34..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011,






Presentation on theme: "Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011,"— Presentation transcript:

Slide1

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Parallel

Architectures & Performance AnalysisSlide2

Parallel computer: multiple-processor system supporting parallel programming.Three principle types of architecture

Vector computers, in particular processor arraysShared memory multiprocessorsSpecially designed and manufactured systems

Distributed memory multicomputersMessage passing systems readily formed from a cluster of workstations

Parallel Architectures and Performance Analysis – Slide

2

Parallel ComputersSlide3

Vector computer: instruction set includes operations on vectors as well as scalarsTwo ways to implement vector computers

Pipelined vector processor (e.g. Cray): streams data through pipelined arithmetic unitsProcessor array: many identical, synchronized arithmetic processing elements

Parallel Architectures and Performance Analysis – Slide 3

Type 1: Vector ComputersSlide4

Natural way to extend single processor modelHave multiple processors connected to multiple memory modules such that each processor can access any memory module

So-called shared memory configuration:

Parallel Architectures and Performance Analysis – Slide 4

Type 2: Shared Memory Multiprocessor SystemsSlide5

Parallel Architectures and Performance Analysis – Slide

5

Ex: Quad Pentium Shared Memory MultiprocessorSlide6

Type 2: Distributed MultiprocessorDistribute primary memory among processors

Increase aggregate memory bandwidth and lower average memory access timeAllow greater number of processorsAlso called non-uniform memory access (NUMA) multiprocessor

Parallel Architectures and Performance Analysis – Slide 6

Fundamental Types of Shared Memory MultiprocessorSlide7

Parallel Architectures and Performance Analysis – Slide

7

Distributed MultiprocessorSlide8

Complete computers connected through an interconnection network

Parallel Architectures and Performance Analysis – Slide

8Type 3: Message-Passing MulticomputersSlide9

Distributed memory multiple-CPU computerSame address on different processors refers to different physical memory locations

Processors interact through message passingCommercial multicomputers

Commodity clustersParallel Architectures and Performance Analysis – Slide 9

MulticomputersSlide10

Parallel Architectures and Performance Analysis – Slide

10

Asymmetrical MulticomputerSlide11

Parallel Architectures and Performance Analysis – Slide

11

Symmetrical MulticomputerSlide12

Parallel Architectures and Performance Analysis – Slide

12

ParPar Cluster: A Mixed ModelSlide13

Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams.

Also important are number of processors, number of programs which can be executed, and the memory structure.

Parallel Architectures and Performance Analysis – Slide 13

Alternate System: Flynn’s TaxonomySlide14

Parallel Architectures and Performance Analysis – Slide

14

Flynn’s Taxonomy: SISD (cont.)

Control unit

Arithmetic

Processor

Memory

Control Signals

Instruction

Data Stream

ResultsSlide15

Parallel Architectures and Performance Analysis – Slide

15

Flynn’s Taxonomy: SIMD (cont.)

Control Unit

Control Signal

PE 1

PE 2

PE

n

Data Stream 1

Data Stream 2

Data Stream

nSlide16

Parallel Architectures and Performance Analysis – Slide

16

Flynn’s Taxonomy: MISD (cont.)

Control

Unit 1

Control

Unit 2

Control

Unit

n

Processing

Element 1

Processing

Element 2

Processing

Element

n

Instruction Stream 1

Instruction Stream 2

Instruction Stream

n

Data

StreamSlide17

Parallel Architectures and Performance Analysis – Slide

17

MISD Architectures (cont.)Serial execution of two processes with 4 stages each. Time to execute

T = 8

t

,

where t is the time to execute one stage

.

Pipelined execution of the same two processes.

T = 5 t

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4Slide18

Parallel Architectures and Performance Analysis – Slide

18

Flynn’s Taxonomy: MIMD (cont.)

Control

Unit 1

Control

Unit 2

Control

Unit

n

Processing

Element 1

Processing

Element 2

Processing

Element

n

Instruction Stream 1

Instruction Stream 2

Instruction Stream

n

Data Stream 1

Data Stream 2

Data Stream

nSlide19

Multiple Program Multiple Data (MPMD) Structure

Within the MIMD classification, which we are concerned with, each processor will have its own program to execute.

Parallel Architectures and Performance Analysis – Slide 19

Two MIMD Structures: MPMDSlide20

Single Program Multiple Data (SPMD) Structure

Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism.The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.

Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware.Parallel Architectures and Performance Analysis – Slide

20

Two MIMD Structures: SPMDSlide21

ArchitecturesVector computers

Shared memory multiprocessors: tightly coupledCentralized/symmetrical multiprocessor (SMP): UMADistributed multiprocessor: NUMA

Distributed memory/message-passing multicomputers: loosely coupledAsymmetrical vs. symmetricalFlynn’s TaxonomySISD, SIMD, MISD, MIMD (MPMD, SPMD)

Parallel Architectures and Performance Analysis – Slide

21

Topic 1 SummarySlide22

A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input.

The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements.

Parallel Architectures and Performance Analysis – Slide 22Topic 2: Performance Measures and AnalysisSlide23

The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel.

The speedup factor of a parallel computation utilizing p processors is defined as the following ratio:

In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time.

Parallel Architectures and Performance Analysis – Slide

23

Speedup FactorSlide24

Speedup factor can also be cast in terms of computational steps:

Maximum speedup is (usually) p

with p processors (linear speedup).Parallel Architectures and Performance Analysis – Slide

24

Speedup Factor (cont.)Slide25

Given a problem of size n on

p processors letInherently sequential computations

(n)Potentially parallel computations (n)

Communication operations

(

n

,

p

)

Then:

Parallel Architectures and Performance Analysis – Slide

25

Execution Time ComponentsSlide26

Parallel Architectures and Performance Analysis – Slide

26

Speedup Plot

“elbowing out”

Number of processors

Slide27

The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system:

Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation.

Parallel Architectures and Performance Analysis – Slide 27

EfficiencySlide28

Since E =

S(p)/p

, by what we did earlierSince all terms are positive, E > 0Furthermore, since the denominator is larger than the numerator, E < 1

Parallel Architectures and Performance Analysis – Slide

28

Analysis of EfficiencySlide29

Parallel Architectures and Performance Analysis – Slide

29

Maximum Speedup: Amdahl’s LawSlide30

As before

since the communication time must be non-trivial.

Let f represent the inherently sequential portion of the computation; then

Parallel Architectures and Performance Analysis – Slide

30

Amdahl’s Law (cont.)Slide31

Limitations

Ignores communication timeOverestimates speedup achievable

Amdahl EffectTypically (n,p

) has lower complexity than (

n

)/

p

So as

p

increases, (

n

)/

p

dominates (

n

,

p

)

Thus as

p

increases, speedup increases

Parallel Architectures and Performance Analysis – Slide

31

Amdahl’s Law (cont.)Slide32

As before

Let

s represent the fraction of time spent in parallel computation performing inherently sequential operations; thenParallel Architectures and Performance Analysis – Slide

32

Gustafson-

Barsis

’ LawSlide33

Then

Parallel Architectures and Performance Analysis – Slide

33Gustafson-Barsis’ Law (cont.)Slide34

Begin with parallel execution time instead of sequential time

Estimate sequential execution time to solve same problemProblem size is an increasing function of

pPredicts scaled speedupParallel Architectures and Performance Analysis – Slide

34

Gustafson-

Barsis

’ Law (cont.)Slide35

Both Amdahl’s Law and Gustafson-

Barsis’ Law ignore communication time

Both overestimate speedup or scaled speedup achievable

Gene Amdahl John L. Gustafson

Parallel Architectures and Performance Analysis – Slide

35

LimitationsSlide36

Performance terms: speedup, efficiency

Model of speedup: serial, parallel and communication componentsWhat prevents linear speedup?Serial and communication operations

Process start-upImbalanced workloadsArchitectural limitationsAnalyzing parallel performanceAmdahl’s LawGustafson-Barsis’ Law

Parallel Architectures and Performance Analysis – Slide

36

Topic 2 SummarySlide37

Based on original material fromThe University of Akron: Tim O’Neil, Kathy

LiszkaHiram College: Irena Lomonosov

The University of North Carolina at CharlotteBarry Wilkinson, Michael AllenOregon State University: Michael QuinnRevision history: last updated 7/28/2011.Parallel Architectures and Performance Analysis – Slide

37

End Credits