6162010 Parallel Performance 1 Performance Considerations I parallelized my code and The code slower than the sequential version or I dont get enough speedup What should I do 6162010 ID: 552673
Download Presentation The PPT/PDF document "Parallel Performance" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Parallel Performance
6/16/2010
Parallel Performance
1Slide2
Performance Considerations
I parallelized my code and …
The code slower than the sequential version, orI don’t get enough speedup
What should I do?
6/16/2010
Parallel Performance
2Slide3
Performance Considerations
Algorithmic
Fine-grained parallelism True contentionFalse contention
Other Bottlenecks
6/16/2010
Parallel Performance
3Slide4
Algorithmic Bottlenecks
Parallel algorithms might sometimes be completely different from their sequential counterparts
Might need different design and implementation
6/16/2010
Parallel Performance
4Slide5
Algorithmic Bottlenecks
Parallel algorithms might sometimes be completely different from their sequential counterparts
Might need different design and implementation
Example: MergeSort (once again
)
6/16/2010
Parallel Performance
5Slide6
Recall from Previous Lecture
Merge was the bottleneck
6/16/2010
Parallel Performance
6
1
16
1
8
1
1
1
8
1
1Slide7
Most Efficient Sequential Merge
is not Parallelizable
6/16/2010
Parallel Performance
7
Merge(
int
* a,
int
* b,
int
* result ){
while( <end condition> ) {
if(*a <= *b){
*result++ = *a++;
}
else {
*result++ = *b++;
}
}
}Slide8
Parallel Merge Algorithm
Merge two sorted arrays A and B using divide and conquer
6/16/2010
Parallel Performance
8Slide9
Parallel Merge Algorithm
Merge two sorted arrays A and B
Let A be the larger array (else swap A and B)
Let n be the size of ASplit A into
A[0…n/2-1]
, A[n/2],
A[n/2+1…n]
Do binary search to find smallest m such that
B[m] >= A[n/2]
Split B into
B[0...m-1],
B[m,..]
r
eturn Merge(
A[0…n/2-1], B[0…m-1]),
A[n/2],
Merge(
A[n/2+1…n], B[m…])
6/16/2010
Parallel Performance
9Slide10
Assignment 1 Extra Credit
Implement the Parallel Merge algorithm and measure performance improvement with your work stealing queue implementation
6/16/2010
Parallel Performance
10Slide11
Fine Grained Parallelism
Overheads of Tasks
Each Task uses some memory (not as much resources as threads, though)Work stealing queue operations
If the work done in each task is small, then the overheads might not justify the improvements in parallelism
6/16/2010
Parallel Performance
11Slide12
False Sharing
Data Locality & Cache Behavior
Performance of computation depends HUGELY on how well the cache is working
Too many cache misses, if processors are “fighting” for the same cache lines
Even if they don’t access the same data
6/16/2010
Parallel Performance
12Slide13
Cache Coherence
Each
cacheline, on each processor, has one of these states:
i - invalid : not cached heres - shared : cached, but immutable
x
- exclusive: cached, and can be read or written
State transitions require communication between caches (cache coherence protocol)
If a processor writes to a line, it removes it from all other caches
6/16/2010
Parallel Performance
13
P1
P2
P3
i
i
i
i
P3
P1
P2
P3
s
i
s
s
P3
P1
P2
P3
i
i
x
i
P3Slide14
Ping-Pong & False Sharing
Ping-Pong
If two processors both keep writing to the same location, cache line has to go back and forth
Very inefficient (lots of cache misses)False SharingTwo processors writing to two different variables may happen to write to the same
cacheline
If both variables are allocated on the same
cache line
Get
ping-pong
effect as above, and horrible performance
6/16/2010
Parallel Performance
14Slide15
False Sharing Example
void
WithFalseSharing
()
{
Random
rand1 =
new
Random
(), rand2 =
new
Random
();
int
[] results1 =
new
int
[
20000000
],
results2 =
new
int
[
20000000
];
Parallel
.Invoke
(
() => {
for
(
int
i
=
0
;
i
< results1.Length;
i
++)
results1[
i
] = rand1.Next();
},
() => {
for
(
int
i
=
0
;
i
< results2.Length;
i
++)
results2[
i
] = rand2.Next();
});
}
Parallel Performance
15
6/16/2010Slide16
False Sharing Example
void
WithFalseSharing
()
{
Random
rand1 =
new
Random
(), rand2 =
new
Random
();
int
[] results1 =
new
int
[
20000000
],
results2 =
new
int
[
20000000
];
Parallel
.Invoke
(
() => {
for
(
int
i
=
0
;
i
< results1.Length;
i
++)
results1[
i
] = rand1.Next();
},
() => {
for
(
int
i
=
0
;
i
< results2.Length;
i
++)
results2[
i
] = rand2.Next();
});
}
Parallel Performance
16
6/16/2010
Call to Next() writes to the random object
=>
Ping-Pong Effect
rand1, rand2 are allocated at same time =>
l
ikely on same cache line.Slide17
False Sharing, Eliminated?
void
WithoutFalseSharing
()
{
int
[] results1, results2;
Parallel
.Invoke
(
() => {
Random
rand1 =
new
Random
();
results1 =
new
int
[
20000000
];
for
(
int
i
=
0
;
i
< results1.Length;
i
++)
results1[
i
] = rand1.Next();
},
() => {
Random
rand2 =
new
Random
();
results2 =
new
int
[
20000000
];
for
(
int
i
=
0
;
i
< results2.Length;
i
++)
results2[
i
] = rand2.Next();
});
}
Parallel Performance
17
6/16/2010
rand1, rand2 are allocated by different tasks
=>
Not likely on same cache line.