/
Starting Parallel Algorithm Design Starting Parallel Algorithm Design

Starting Parallel Algorithm Design - PowerPoint Presentation

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
435 views
Uploaded On 2016-04-04

Starting Parallel Algorithm Design - PPT Presentation

David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama Gupta Karypis and Kumar Outline Decomposition Tasks and Interaction Load Balancing ID: 273745

tasks parallel decomposition task parallel tasks task decomposition data omp threads int openmp partitioning sum shared min small pragma computation map pattern

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Starting Parallel Algorithm Design" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Starting Parallel Algorithm Design

David

Monismith

Based on notes from

Introduction to Parallel Programming

2

nd

Edition by

Grama

, Gupta,

Karypis

, and KumarSlide2

Outline

Decomposition

Tasks and Interaction

Load Balancing

Managing Overhead

Parallel ModelsSlide3

Decomposition

Decomposition - dividing a computation into parts that may be executed in parallel

Tasks - programmer defined units of computation into which the main computation is subdivided

Task-dependency graphs - abstraction used to express dependencies between tasks and their relative order of executionSlide4

Granularity

Granularity - number/size of tasks that a computation can be divided into

Fine grained - task divided into many small tasks

Coarse grained - task divided into few large tasks

Degree of concurrency - maximum number of tasks that can be executed in parallel in a program at any time

Average degree of concurrency can be more useful as it provides a better indication of performanceSlide5

Example

Matrix-Vector Multiplication

Figure will be drawn upon the board

Generally considered fine grained if parallelizing based upon each dot product

Could be considered coarse-grained if using a dual core processor and each task computes half of the dot productsSlide6

Task Graphs

Critical path - longest directed path between a pair of start an finish nodes in the task graph

Critical path length – sum of the weights of the nodes along a critical path

Weight of a node is the size of the task or amount of work associated with the task

Aside from these factors, the interaction between tasks running on different processors may cost additional runtime

An example of a task dependency graph will be drawn in class to aid in the understanding of these conceptsSlide7

Processes and Threads vs. Processors

mapping - mechanism by which tasks are assigned to processes and/or threads for execution

Threads and processes are logical units that perform tasks

Processors physically perform the computations

Important to realize this because we may have multiple stages of computation

For example, internode communication vs. shared memory communication

Drawing a task dependency or task interaction graph may help us to understand how tasks interact with one another and will aid in development of a parallel algorithmSlide8

Decomposition Techniques

Embarrassingly Parallel

Recursive decomposition

Data Decomposition

Exploratory decomposition

Speculative decompositionSlide9

Embarrassingly Parallel Tasks

Some tasks lend themselves to direct parallelization

Such tasks are said to be embarrassingly parallel and can be directly mapped to processes or threads

A subset of these types of tasks represent the map pattern

Note that the map pattern represents a function that can be “replicated and applied to all elements in a collection” – source

https://software.intel.com/en-us/blogs/2009/06/10/parallel-patterns-3-map

Map operations occur in independent loop iterationsSlide10

Embarrassingly Parallel (Map)

Performing array (or matrix) addition is a straightforward example that is easily parallelized

The serial example of this follows:

for

(

i

= 0;

i

< N;

i

++)

C[

i

] = A[

i

] + B[

i

]

;

Three

OpenMP

parallel versions follow on the next slidesSlide11

OpenMP First Try

We could parallelize the loop on the last slide directly as follows:

#pragma

omp

parallel private(

i

) shared(A,B,C)

{

int

start =

omp_get_thread_num

()*(

N /

omp_get_num_threads

());

int

end = start + (N/

omp_get_num_threads

());

for(i = start; i < end; i++)

C[i] = A[i] + B[i];

}

Notice that i is declared private because it it is not shared between threads – each thread gets its own copy of i

Arrays A, B, and C are declared shared because they are shared between threadsSlide12

OpenMP

for

clause

It is preferred to allow

OpenMP

to directly parallelize loops using the for clause as follows

#

pragma

omp

parallel private(

i

)

shared

(A,B,C)

{

#pragma

omp

for

for(

i

= 0;

i

< N;

i

++)

C[

i

] = A[

i

] + B[

i

];

}

Notice that the loop can be written in a serial fashion and it will be automatically partitioned and tasked to a threadSlide13

Shortened OpenMP

for

When using a single for loop, the parallel and for clauses may be combined

#

pragma

omp

parallel for private(

i

) \

shared

(A,B,C)

for(

i

= 0;

i

< N;

i

++)

C[

i

] = A[

i

] + B[

i

];Slide14

Recursive Decomposition

Used to include concurrency in problems that can be solved with divide-and-conquer

Such a problem is solved by dividing it into independent sub-problems

A special type of this decomposition is the Reduction Pattern, wherein elements of a collection are combined with a binary associative operator (e.g. +, -, min, max, etc.), source -

https://software.intel.com/en-us/blogs/2009/07/23/parallel-pattern-7-reduce

Slide15

Example

To find a minimum serially given an array A of size N use the following algorithm

min

= A[0];

for(i

= 1; i < N; i++)

if(A[

i

] < min)

min =

A[i

];Slide16

Example

Decomposing this task for parallelism requires a recursive solution

int

findMinRec

(

int

A[],

int

i

,

int

n)

{

if(n == 1)

return A[i];

else

{

int

lmin

=

findMinRec

(A,

i

, n/2);

int

rmin

=

findMinRec

(A,

i+n

/2, n-n/2);

return

min(

lmin,rmin

);

}

}Slide17

OpenMP Implementation

for

(

i

= 0;

i

< N;

i

++)

A

[

i

] = rand() % 100;

small

= A[0];

#

pragma

omp

parallel for reduction(

min:small

)

for

(

i

= 0;

i

< N;

i

++

) {

if

(A[

i

] <

small)

small

= A[

i

];

}Slide18

OpenMP Sum Reduction

for(

i

= 0;

i

< N;

i

++)

A[

i

] = i+1;

sum

= 0;

#

pragma

omp

parallel

for reduction

(+:sum)

for

(

i

= 0;

i

< N;

i

++)

sum

+= A[i];

printf

("The sum is %d\n", sum);Slide19

Data Decomposition

Commonly used on algorithms that operate on large data structures

Involves two steps

Data is partitioned

Data partitioning is used to cause partitioning of computations into tasks

Operations on different data partitions are typically similar or are chosen from a small set of operationsSlide20

Partitioning

Partitioning output data – outputs computed independently of others as a function of input

Example – matrix multiplication

can be partitioned into

submatrices

Partitioning input data – task is created for each partition of the input data

Example – finding a minimum or maximum

Partitioning input and output – combination of the two cases above

Partitioning intermediate dataSlide21

Next Time

More decompositions

Exploratory Decomposition

Speculative Decomposition

Tasks and Interactions

Load balancing

Handling overhead

Parallel Algorithm Models