Parallel Architectures amp Performance Analysis Parallel computer multipleprocessor system supporting parallel programming Three principle types of architecture Vector computers in particular processor arrays ID: 575049
Download Presentation The PPT/PDF document "Prepared 7/28/2011 by T. O’Neil for 34..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Parallel
Architectures & Performance AnalysisSlide2
Parallel computer: multiple-processor system supporting parallel programming.Three principle types of architecture
Vector computers, in particular processor arraysShared memory multiprocessorsSpecially designed and manufactured systems
Distributed memory multicomputersMessage passing systems readily formed from a cluster of workstations
Parallel Architectures and Performance Analysis – Slide
2
Parallel ComputersSlide3
Vector computer: instruction set includes operations on vectors as well as scalarsTwo ways to implement vector computers
Pipelined vector processor (e.g. Cray): streams data through pipelined arithmetic unitsProcessor array: many identical, synchronized arithmetic processing elements
Parallel Architectures and Performance Analysis – Slide 3
Type 1: Vector ComputersSlide4
Natural way to extend single processor modelHave multiple processors connected to multiple memory modules such that each processor can access any memory module
So-called shared memory configuration:
Parallel Architectures and Performance Analysis – Slide 4
Type 2: Shared Memory Multiprocessor SystemsSlide5
Parallel Architectures and Performance Analysis – Slide
5
Ex: Quad Pentium Shared Memory MultiprocessorSlide6
Type 2: Distributed MultiprocessorDistribute primary memory among processors
Increase aggregate memory bandwidth and lower average memory access timeAllow greater number of processorsAlso called non-uniform memory access (NUMA) multiprocessor
Parallel Architectures and Performance Analysis – Slide 6
Fundamental Types of Shared Memory MultiprocessorSlide7
Parallel Architectures and Performance Analysis – Slide
7
Distributed MultiprocessorSlide8
Complete computers connected through an interconnection network
Parallel Architectures and Performance Analysis – Slide
8Type 3: Message-Passing MulticomputersSlide9
Distributed memory multiple-CPU computerSame address on different processors refers to different physical memory locations
Processors interact through message passingCommercial multicomputers
Commodity clustersParallel Architectures and Performance Analysis – Slide 9
MulticomputersSlide10
Parallel Architectures and Performance Analysis – Slide
10
Asymmetrical MulticomputerSlide11
Parallel Architectures and Performance Analysis – Slide
11
Symmetrical MulticomputerSlide12
Parallel Architectures and Performance Analysis – Slide
12
ParPar Cluster: A Mixed ModelSlide13
Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams.
Also important are number of processors, number of programs which can be executed, and the memory structure.
Parallel Architectures and Performance Analysis – Slide 13
Alternate System: Flynn’s TaxonomySlide14
Parallel Architectures and Performance Analysis – Slide
14
Flynn’s Taxonomy: SISD (cont.)
Control unit
Arithmetic
Processor
Memory
Control Signals
Instruction
Data Stream
ResultsSlide15
Parallel Architectures and Performance Analysis – Slide
15
Flynn’s Taxonomy: SIMD (cont.)
Control Unit
Control Signal
PE 1
PE 2
PE
n
Data Stream 1
Data Stream 2
Data Stream
nSlide16
Parallel Architectures and Performance Analysis – Slide
16
Flynn’s Taxonomy: MISD (cont.)
Control
Unit 1
Control
Unit 2
Control
Unit
n
Processing
Element 1
Processing
Element 2
Processing
Element
n
Instruction Stream 1
Instruction Stream 2
Instruction Stream
n
Data
StreamSlide17
Parallel Architectures and Performance Analysis – Slide
17
MISD Architectures (cont.)Serial execution of two processes with 4 stages each. Time to execute
T = 8
t
,
where t is the time to execute one stage
.
Pipelined execution of the same two processes.
T = 5 t
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4Slide18
Parallel Architectures and Performance Analysis – Slide
18
Flynn’s Taxonomy: MIMD (cont.)
Control
Unit 1
Control
Unit 2
Control
Unit
n
Processing
Element 1
Processing
Element 2
Processing
Element
n
Instruction Stream 1
Instruction Stream 2
Instruction Stream
n
Data Stream 1
Data Stream 2
Data Stream
nSlide19
Multiple Program Multiple Data (MPMD) Structure
Within the MIMD classification, which we are concerned with, each processor will have its own program to execute.
Parallel Architectures and Performance Analysis – Slide 19
Two MIMD Structures: MPMDSlide20
Single Program Multiple Data (SPMD) Structure
Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism.The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.
Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware.Parallel Architectures and Performance Analysis – Slide
20
Two MIMD Structures: SPMDSlide21
ArchitecturesVector computers
Shared memory multiprocessors: tightly coupledCentralized/symmetrical multiprocessor (SMP): UMADistributed multiprocessor: NUMA
Distributed memory/message-passing multicomputers: loosely coupledAsymmetrical vs. symmetricalFlynn’s TaxonomySISD, SIMD, MISD, MIMD (MPMD, SPMD)
Parallel Architectures and Performance Analysis – Slide
21
Topic 1 SummarySlide22
A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input.
The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements.
Parallel Architectures and Performance Analysis – Slide 22Topic 2: Performance Measures and AnalysisSlide23
The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel.
The speedup factor of a parallel computation utilizing p processors is defined as the following ratio:
In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time.
Parallel Architectures and Performance Analysis – Slide
23
Speedup FactorSlide24
Speedup factor can also be cast in terms of computational steps:
Maximum speedup is (usually) p
with p processors (linear speedup).Parallel Architectures and Performance Analysis – Slide
24
Speedup Factor (cont.)Slide25
Given a problem of size n on
p processors letInherently sequential computations
(n)Potentially parallel computations (n)
Communication operations
(
n
,
p
)
Then:
Parallel Architectures and Performance Analysis – Slide
25
Execution Time ComponentsSlide26
Parallel Architectures and Performance Analysis – Slide
26
Speedup Plot
“elbowing out”
Number of processors
Slide27
The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system:
Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation.
Parallel Architectures and Performance Analysis – Slide 27
EfficiencySlide28
Since E =
S(p)/p
, by what we did earlierSince all terms are positive, E > 0Furthermore, since the denominator is larger than the numerator, E < 1
Parallel Architectures and Performance Analysis – Slide
28
Analysis of EfficiencySlide29
Parallel Architectures and Performance Analysis – Slide
29
Maximum Speedup: Amdahl’s LawSlide30
As before
since the communication time must be non-trivial.
Let f represent the inherently sequential portion of the computation; then
Parallel Architectures and Performance Analysis – Slide
30
Amdahl’s Law (cont.)Slide31
Limitations
Ignores communication timeOverestimates speedup achievable
Amdahl EffectTypically (n,p
) has lower complexity than (
n
)/
p
So as
p
increases, (
n
)/
p
dominates (
n
,
p
)
Thus as
p
increases, speedup increases
Parallel Architectures and Performance Analysis – Slide
31
Amdahl’s Law (cont.)Slide32
As before
Let
s represent the fraction of time spent in parallel computation performing inherently sequential operations; thenParallel Architectures and Performance Analysis – Slide
32
Gustafson-
Barsis
’ LawSlide33
Then
Parallel Architectures and Performance Analysis – Slide
33Gustafson-Barsis’ Law (cont.)Slide34
Begin with parallel execution time instead of sequential time
Estimate sequential execution time to solve same problemProblem size is an increasing function of
pPredicts scaled speedupParallel Architectures and Performance Analysis – Slide
34
Gustafson-
Barsis
’ Law (cont.)Slide35
Both Amdahl’s Law and Gustafson-
Barsis’ Law ignore communication time
Both overestimate speedup or scaled speedup achievable
Gene Amdahl John L. Gustafson
Parallel Architectures and Performance Analysis – Slide
35
LimitationsSlide36
Performance terms: speedup, efficiency
Model of speedup: serial, parallel and communication componentsWhat prevents linear speedup?Serial and communication operations
Process start-upImbalanced workloadsArchitectural limitationsAnalyzing parallel performanceAmdahl’s LawGustafson-Barsis’ Law
Parallel Architectures and Performance Analysis – Slide
36
Topic 2 SummarySlide37
Based on original material fromThe University of Akron: Tim O’Neil, Kathy
LiszkaHiram College: Irena Lomonosov
The University of North Carolina at CharlotteBarry Wilkinson, Michael AllenOregon State University: Michael QuinnRevision history: last updated 7/28/2011.Parallel Architectures and Performance Analysis – Slide
37
End Credits