Introduction to Parallel Processing
96K - views

Introduction to Parallel Processing

Similar presentations


Download Presentation

Introduction to Parallel Processing




Download Presentation - The PPT/PDF document "Introduction to Parallel Processing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Introduction to Parallel Processing"— Presentation transcript:

Slide1

Introduction to Parallel Processing

Dr. Guy Tel-

Zur

Lecture 10

Slide2

Agenda

Administration

Final presentations

Demos

Theory

Next week plan

Home assignment #4 (last)

Slide3

Final Projects

Next Sunday: Groups 1-16 will presentNext Monday: Groups 17+ will present10 minutes presentation per groupAll group members should presentSend to: gtelzur@gmail.com your presentation by midnight of the previous day

נוכחות חובה

Slide4

Final Presentations

החלוקה לקבוצות הינה קשיחה

קבוצה שלא תציג תאבד 5 נקודות בציון

יש לבצע חזרה ולוודא עמידה בזמנים

המצגת צריכה לכלול:

שם הפרויקט, מטרתו, האתגר בבעיה מבחינת החישוב המקבילי, דרכים לפתרון.

לא תתקבלנה מצגות בזמן השיעור! יש להקפיד לשלוח אותן אל המרצה מבעוד מועד

Slide5

The Course Roadmap

Introduction

Message Passing

HTC

HPC

Shared Memory

Condor

Grid Computing

Cloud Computing

MPI

OpenMP

Cilk

++

Today

Today

GPU Computing

New!

Today

Slide6

Advanced Parallel Computing and Distributed Computing course

A new course at the department:

Distributed Computing: Advanced Parallel Processing course + Grid Computing + Cloud Computing

Course

Number:  

361-1-4691

If you are interested in this course please send me an email

Slide7

Today

Algorithms

Numerical Algorithms

(“slides11.ppt”)

Introduction to Grid Computing

Some demos

Home assignment #4

Slide8

Futuristic A-Symmetric Multi-Core Chip

SACC Sequential Accelerator

Slide9

Theory

Numerical Algorithms

Slides from:

UNIVERSITY OF NORTH CAROLINA AT CHARLOTTE

 

Department of Computer Science

ITCS 4145/5145 Parallel Programming

 

Spring 2009

 

Dr. Barry Wilkinson

Matrix multiplication, solving a system of linear equations, iterative methods

URL is

Here

Slide10

Demos

Hybrid Parallel Programming – MPI + OpenMPCloud ComputingSetting a HPC clusterSetting a Condor machine(a separate presentation)

StarHPC

Cilk

++

GPU Computing (a separate presentation)

Eclipse PTP

Kepler

workflow

Slide11

Hybrid MPI + OpenMP Demo

Machine File:hobbit1hobbit2hobbit3hobbit4

Each hobbit has 8 cores

mpicc -o mpi_out mpi_test.c -fopenmp

MPI

OpenMP

An Idea for a final project!!!

cd

~/

mpi

program name:

hybridpi.c

Slide12

MPI is not installed yet on the hobbits, in the meanwhile:

vdwarf5

vdwarf6

vdwarf7

vdwarf8

Slide13

top -u tel-zur -H -d 0.05

H – show threads, d – delay for refresh, u - user

Slide14

Hybrid MPI+OpenMP continued

Slide15

Hybrid Pi (MPI+OpenMP

#include <

stdio.h

>

#include <

mpi.h

>

#include <

omp.h

>

#define NBIN 100000

#define MAX_THREADS 8

int

main(

int

argc,char

**

argv

) {

int

nbin,myid,nproc,nthreads,tid

;

double

step,sum

[MAX_THREADS]={0.0},pi=0.0,pig;

MPI_Init

(&

argc

,&

argv

);

MPI_Comm_rank

(MPI_COMM_WORLD,&

myid

);

MPI_Comm_size

(MPI_COMM_WORLD,&

nproc

);

nbin

= NBIN/

nproc

;

step = 1.0/(

nbin

*

nproc

);

Slide16

#pragma

omp

parallel private(

tid

)

{

int

i

;

double x;

nthreads

=

omp_get_num_threads

();

tid

=

omp_get_thread_num

();

for (

i

=

nbin

*

myid+tid

;

i

<

nbin

*(myid+1);

i

+=

nthreads

) {

x = (i+0.5)*step;

sum[

tid

] += 4.0/(1.0+x*x);

}

printf

("rank

tid

sum = %d %d %e\n",

myid,tid,sum

[

tid

]);

}

for(

tid

=0;

tid

<

nthreads

;

tid

++)

pi += sum[

tid

]*step;

MPI_Allreduce

(&pi,&pig,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);

if (

myid

==0)

printf

("PI = %f\

n",pig

);

MPI_Finalize

();

return 0;

}

Slide17

Cilk++

Simple, powerful expression of task parallelism: cilk_for – Parallelize for loops cilk_spawn – Specify the start of parallel execution cilk_sync – Specify the end of parallel execution

http://software.intel.com/en-us/articles/intel-cilk-plus/

Slide18

17/8/2011

Slide19

Slide20

Fibonachi (Fibonacci)

Try:

http://www.wolframalpha.com/input/?i=fibonacci+number

Slide21

Fibonachi Numbersserial version

//

1, 1, 2, 3, 5, 8, 13, 21, 34, ...

// Serial version

// Credit: http://myxman.org/dp/node/182

long

fib_serial

(long n) {

if (n < 2) return n;

return

fib_serial

(n-1) +

fib_serial

(n-2);

}

Slide22

Cilk++ Fibonachi (Fibonacci)

#include <cilk.h>#include <stdio.h>long fib_parallel(long n){ long x, y; if (n < 2) return n; x = cilk_spawn fib_parallel(n-1); y = fib_parallel(n-2); cilk_sync; return (x+y); }int cilk_main(){ int N=50; long result; result = fib_parallel(N); printf("fib of %d is %d\n",N,result); return 0;}

Slide23

Cilk_spawn

ADD PARALLELISM USING CILK_SPAWN

We are now ready to introduce parallelism into our

qsort

program.

The

cilk_spawn

keyword indicates that a function (the

child

) may be executed in parallel with the code that follows the

cilk_spawn

statement (the

parent

). Note that the keyword

allows

but does not

require

parallel operation. The

Cilk

++ scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The

cilk_sync

statement indicates that the function may not continue until all

cilk_spawn

requests in the same function have completed.

cilk_sync

does not affect parallel strands spawned in other functions.

Slide24

Cilkview Fn(30)

Slide25

Strands and Knots A Cilk++ program fragments

...

do_stuff_1(); // execute strand 1 cilk_spawn func_3(); // spawn strand 3 at knot A do_stuff_2(); // execute strand 2 cilk_sync; // sync at knot B do_stuff_4(); // execute strand 4 ...

DAG with two spawns (labeled A and B) and one sync (labeled C)

Slide26

Let's add labels to the strands to indicate the number of milliseconds it takes to execute each strand

a more complex Cilk++ program (DAG):

In ideal circumstances (e.g., if there is no scheduling overhead) then, if an unlimited number of processors are available, this program should run for 68 milliseconds.

Slide27

Work and Span

WorkThe total amount of processor time required to complete the program is the sum of all the numbers. We call this the work. In this DAG, the work is 181 milliseconds for the 25 strands shown, and if the program is run on a single processor, the program should run for 181 milliseconds. SpanAnother useful concept is the span, sometimes called the critical path length. The span is the most expensive path that goes from the beginning to the end of the program. In this DAG, the span is 68 milliseconds, as shown below:

Slide28

divide-and-conquer strategy

cilk_for

Shown here: 8 threads and 8 iterations

Here is the DAG for a serial loop that spawns each iteration. In this case, the work is

not well balanced

, because each child does the work of only one iteration before incurring the scheduling overhead inherent in entering a sync.

Slide29

Race conditions

Check the “qsort-race” program with cilkscreen:

Slide30

StarHPC on the Cloud

Will be ready for

PP201X?

Slide31

Eclipse PTPParallel Tools Platform

http://www.eclipse.org/ptp/

Will be ready for

PP201X?

Slide32

Recursion in OpenMP

long fib_parallel(long n) { long x, y; if (n < 2) return n; #pragma omp task default(none) shared(x,n) { x = fib_parallel(n-1); } y = fib_parallel(n-2); #pragma omp taskwait return (x+y); }

#pragma omp parallel #pragma omp single { r = fib_parallel(n); }

Reference: http://myxman.org/dp/node/182

Use the taskwait pragma to specify a wait for child tasks to be completed that are generated by the current task.

The task

pragma

can be useful for parallelizing irregular algorithms such as recursive algorithms for which other

OpenMP

workshare

constructs are inadequate.

Slide33

Intel® Parallel Studio

Use

Parallel Composer

to create and compile a parallel application

Use

Parallel Inspector

to improve reliability by finding memory and threading errors

Use

Parallel Amplifier

to improve parallel performance by tuning threaded code

Slide34

Intel® Parallel Studio

Slide35

Slide36

Slide37

Slide38

Slide39

Slide40

Slide41

Parallel Studio add new features to Visual Studio

Slide42

Intel’s Parallel Amplifier – Execution Bottlenecks

Slide43

Slide44

Intel’s Parallel Inspector – Threading Errors

Slide45

Intel’s Parallel Inspector – Threading Errors

Slide46

Slide47

Error – Data Race

Slide48

Slide49

Slide50

Intel Parallel Studio - Composer

The installation of this part failed for me.

Probably because I didn’t install before Intel’s C++ compiler.

Sorry I can’t make a demo here…

Slide51

Slide52

Slide53

Slide54

Slide55

Slide56