David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama Gupta Karypis and Kumar Outline Decomposition Tasks and Interaction Load Balancing ID: 273745
Download Presentation The PPT/PDF document "Starting Parallel Algorithm Design" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Starting Parallel Algorithm Design
David
Monismith
Based on notes from
Introduction to Parallel Programming
2
nd
Edition by
Grama
, Gupta,
Karypis
, and KumarSlide2
Outline
Decomposition
Tasks and Interaction
Load Balancing
Managing Overhead
Parallel ModelsSlide3
Decomposition
Decomposition - dividing a computation into parts that may be executed in parallel
Tasks - programmer defined units of computation into which the main computation is subdivided
Task-dependency graphs - abstraction used to express dependencies between tasks and their relative order of executionSlide4
Granularity
Granularity - number/size of tasks that a computation can be divided into
Fine grained - task divided into many small tasks
Coarse grained - task divided into few large tasks
Degree of concurrency - maximum number of tasks that can be executed in parallel in a program at any time
Average degree of concurrency can be more useful as it provides a better indication of performanceSlide5
Example
Matrix-Vector Multiplication
Figure will be drawn upon the board
Generally considered fine grained if parallelizing based upon each dot product
Could be considered coarse-grained if using a dual core processor and each task computes half of the dot productsSlide6
Task Graphs
Critical path - longest directed path between a pair of start an finish nodes in the task graph
Critical path length – sum of the weights of the nodes along a critical path
Weight of a node is the size of the task or amount of work associated with the task
Aside from these factors, the interaction between tasks running on different processors may cost additional runtime
An example of a task dependency graph will be drawn in class to aid in the understanding of these conceptsSlide7
Processes and Threads vs. Processors
mapping - mechanism by which tasks are assigned to processes and/or threads for execution
Threads and processes are logical units that perform tasks
Processors physically perform the computations
Important to realize this because we may have multiple stages of computation
For example, internode communication vs. shared memory communication
Drawing a task dependency or task interaction graph may help us to understand how tasks interact with one another and will aid in development of a parallel algorithmSlide8
Decomposition Techniques
Embarrassingly Parallel
Recursive decomposition
Data Decomposition
Exploratory decomposition
Speculative decompositionSlide9
Embarrassingly Parallel Tasks
Some tasks lend themselves to direct parallelization
Such tasks are said to be embarrassingly parallel and can be directly mapped to processes or threads
A subset of these types of tasks represent the map pattern
Note that the map pattern represents a function that can be “replicated and applied to all elements in a collection” – source
https://software.intel.com/en-us/blogs/2009/06/10/parallel-patterns-3-map
Map operations occur in independent loop iterationsSlide10
Embarrassingly Parallel (Map)
Performing array (or matrix) addition is a straightforward example that is easily parallelized
The serial example of this follows:
for
(
i
= 0;
i
< N;
i
++)
C[
i
] = A[
i
] + B[
i
]
;
Three
OpenMP
parallel versions follow on the next slidesSlide11
OpenMP First Try
We could parallelize the loop on the last slide directly as follows:
#pragma
omp
parallel private(
i
) shared(A,B,C)
{
int
start =
omp_get_thread_num
()*(
N /
omp_get_num_threads
());
int
end = start + (N/
omp_get_num_threads
());
for(i = start; i < end; i++)
C[i] = A[i] + B[i];
}
Notice that i is declared private because it it is not shared between threads – each thread gets its own copy of i
Arrays A, B, and C are declared shared because they are shared between threadsSlide12
OpenMP
for
clause
It is preferred to allow
OpenMP
to directly parallelize loops using the for clause as follows
#
pragma
omp
parallel private(
i
)
shared
(A,B,C)
{
#pragma
omp
for
for(
i
= 0;
i
< N;
i
++)
C[
i
] = A[
i
] + B[
i
];
}
Notice that the loop can be written in a serial fashion and it will be automatically partitioned and tasked to a threadSlide13
Shortened OpenMP
for
When using a single for loop, the parallel and for clauses may be combined
#
pragma
omp
parallel for private(
i
) \
shared
(A,B,C)
for(
i
= 0;
i
< N;
i
++)
C[
i
] = A[
i
] + B[
i
];Slide14
Recursive Decomposition
Used to include concurrency in problems that can be solved with divide-and-conquer
Such a problem is solved by dividing it into independent sub-problems
A special type of this decomposition is the Reduction Pattern, wherein elements of a collection are combined with a binary associative operator (e.g. +, -, min, max, etc.), source -
https://software.intel.com/en-us/blogs/2009/07/23/parallel-pattern-7-reduce
Slide15
Example
To find a minimum serially given an array A of size N use the following algorithm
min
= A[0];
for(i
= 1; i < N; i++)
if(A[
i
] < min)
min =
A[i
];Slide16
Example
Decomposing this task for parallelism requires a recursive solution
int
findMinRec
(
int
A[],
int
i
,
int
n)
{
if(n == 1)
return A[i];
else
{
int
lmin
=
findMinRec
(A,
i
, n/2);
int
rmin
=
findMinRec
(A,
i+n
/2, n-n/2);
return
min(
lmin,rmin
);
}
}Slide17
OpenMP Implementation
for
(
i
= 0;
i
< N;
i
++)
A
[
i
] = rand() % 100;
small
= A[0];
#
pragma
omp
parallel for reduction(
min:small
)
for
(
i
= 0;
i
< N;
i
++
) {
if
(A[
i
] <
small)
small
= A[
i
];
}Slide18
OpenMP Sum Reduction
for(
i
= 0;
i
< N;
i
++)
A[
i
] = i+1;
sum
= 0;
#
pragma
omp
parallel
for reduction
(+:sum)
for
(
i
= 0;
i
< N;
i
++)
sum
+= A[i];
printf
("The sum is %d\n", sum);Slide19
Data Decomposition
Commonly used on algorithms that operate on large data structures
Involves two steps
Data is partitioned
Data partitioning is used to cause partitioning of computations into tasks
Operations on different data partitions are typically similar or are chosen from a small set of operationsSlide20
Partitioning
Partitioning output data – outputs computed independently of others as a function of input
Example – matrix multiplication
can be partitioned into
submatrices
Partitioning input data – task is created for each partition of the input data
Example – finding a minimum or maximum
Partitioning input and output – combination of the two cases above
Partitioning intermediate dataSlide21
Next Time
More decompositions
Exploratory Decomposition
Speculative Decomposition
Tasks and Interactions
Load balancing
Handling overhead
Parallel Algorithm Models