/
Parallel Performance Parallel Performance

Parallel Performance - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
383 views
Uploaded On 2017-05-26

Parallel Performance - PPT Presentation

6162010 Parallel Performance 1 Performance Considerations I parallelized my code and The code slower than the sequential version or I dont get enough speedup What should I do 6162010 ID: 552673

performance parallel 2010 int parallel performance int 2010 cache random results2 merge results1 rand2 rand1 false length 20000000 line

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Parallel Performance" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Parallel Performance

6/16/2010

Parallel Performance

1Slide2

Performance Considerations

I parallelized my code and …

The code slower than the sequential version, orI don’t get enough speedup

What should I do?

6/16/2010

Parallel Performance

2Slide3

Performance Considerations

Algorithmic

Fine-grained parallelism True contentionFalse contention

Other Bottlenecks

6/16/2010

Parallel Performance

3Slide4

Algorithmic Bottlenecks

Parallel algorithms might sometimes be completely different from their sequential counterparts

Might need different design and implementation

6/16/2010

Parallel Performance

4Slide5

Algorithmic Bottlenecks

Parallel algorithms might sometimes be completely different from their sequential counterparts

Might need different design and implementation

Example: MergeSort (once again

)

6/16/2010

Parallel Performance

5Slide6

Recall from Previous Lecture

Merge was the bottleneck

6/16/2010

Parallel Performance

6

1

16

1

8

1

1

1

8

1

1Slide7

Most Efficient Sequential Merge

is not Parallelizable

6/16/2010

Parallel Performance

7

Merge(

int

* a,

int

* b,

int

* result ){

while( <end condition> ) {

if(*a <= *b){

*result++ = *a++;

}

else {

*result++ = *b++;

}

}

}Slide8

Parallel Merge Algorithm

Merge two sorted arrays A and B using divide and conquer

6/16/2010

Parallel Performance

8Slide9

Parallel Merge Algorithm

Merge two sorted arrays A and B

Let A be the larger array (else swap A and B)

Let n be the size of ASplit A into

A[0…n/2-1]

, A[n/2],

A[n/2+1…n]

Do binary search to find smallest m such that

B[m] >= A[n/2]

Split B into

B[0...m-1],

B[m,..]

r

eturn Merge(

A[0…n/2-1], B[0…m-1]),

A[n/2],

Merge(

A[n/2+1…n], B[m…])

6/16/2010

Parallel Performance

9Slide10

Assignment 1 Extra Credit

Implement the Parallel Merge algorithm and measure performance improvement with your work stealing queue implementation

6/16/2010

Parallel Performance

10Slide11

Fine Grained Parallelism

Overheads of Tasks

Each Task uses some memory (not as much resources as threads, though)Work stealing queue operations

If the work done in each task is small, then the overheads might not justify the improvements in parallelism

6/16/2010

Parallel Performance

11Slide12

False Sharing

Data Locality & Cache Behavior

Performance of computation depends HUGELY on how well the cache is working

Too many cache misses, if processors are “fighting” for the same cache lines

Even if they don’t access the same data

6/16/2010

Parallel Performance

12Slide13

Cache Coherence

Each

cacheline, on each processor, has one of these states:

i - invalid : not cached heres - shared : cached, but immutable

x

- exclusive: cached, and can be read or written

State transitions require communication between caches (cache coherence protocol)

If a processor writes to a line, it removes it from all other caches

6/16/2010

Parallel Performance

13

P1

P2

P3

i

i

i

i

P3

P1

P2

P3

s

i

s

s

P3

P1

P2

P3

i

i

x

i

P3Slide14

Ping-Pong & False Sharing

Ping-Pong

If two processors both keep writing to the same location, cache line has to go back and forth

Very inefficient (lots of cache misses)False SharingTwo processors writing to two different variables may happen to write to the same

cacheline

If both variables are allocated on the same

cache line

Get

ping-pong

effect as above, and horrible performance

6/16/2010

Parallel Performance

14Slide15

False Sharing Example

void

WithFalseSharing

()

{

Random

rand1 =

new

Random

(), rand2 =

new

Random

();

int

[] results1 =

new

int

[

20000000

],

results2 =

new

int

[

20000000

];

Parallel

.Invoke

(

() => {

for

(

int

i

=

0

;

i

< results1.Length;

i

++)

results1[

i

] = rand1.Next();

},

() => {

for

(

int

i

=

0

;

i

< results2.Length;

i

++)

results2[

i

] = rand2.Next();

});

}

Parallel Performance

15

6/16/2010Slide16

False Sharing Example

void

WithFalseSharing

()

{

Random

rand1 =

new

Random

(), rand2 =

new

Random

();

int

[] results1 =

new

int

[

20000000

],

results2 =

new

int

[

20000000

];

Parallel

.Invoke

(

() => {

for

(

int

i

=

0

;

i

< results1.Length;

i

++)

results1[

i

] = rand1.Next();

},

() => {

for

(

int

i

=

0

;

i

< results2.Length;

i

++)

results2[

i

] = rand2.Next();

});

}

Parallel Performance

16

6/16/2010

Call to Next() writes to the random object

=>

Ping-Pong Effect

rand1, rand2 are allocated at same time =>

l

ikely on same cache line.Slide17

False Sharing, Eliminated?

void

WithoutFalseSharing

()

{

int

[] results1, results2;

Parallel

.Invoke

(

() => {

Random

rand1 =

new

Random

();

results1 =

new

int

[

20000000

];

for

(

int

i

=

0

;

i

< results1.Length;

i

++)

results1[

i

] = rand1.Next();

},

() => {

Random

rand2 =

new

Random

();

results2 =

new

int

[

20000000

];

for

(

int

i

=

0

;

i

< results2.Length;

i

++)

results2[

i

] = rand2.Next();

});

}

Parallel Performance

17

6/16/2010

rand1, rand2 are allocated by different tasks

=>

Not likely on same cache line.