Dr Yingwu Zhu Chapter 27 Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer We will now extend our model to parallel algorithms that can run on a ID: 654633
Download Presentation The PPT/PDF document "Multithreaded Algorithms" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Multithreaded Algorithms
Dr. Yingwu Zhu
Chapter 27Slide2
Motivation
We have discussed
serial algorithms
that are suitable for running on a
uniprocessor
computer. We will now extend our model to
parallel algorithms
that can run on a
multiprocessor
computer
Divide and conquer algorithms are good candidates for parallelismSlide3
Computational Models
Parallel Machines are getting cheaper and in fact are now ubiquitous ...
supercomputers: custom architectures and networks
computer clusters with dedicated networks (distributed memory)
multi-core integrated circuit chips (shared memory)
GPUs (graphics processing units) Slide4
Dynamic Multithreading
Static threading:
abstraction of virtual processors
.
Our
model is
dynamic
multithreading
Programmers
specify opportunities for parallelism and a
concurrency platform
manages the decisions of mapping these to static threads (load balancing, communication, etc.).Slide5
Concurrency Constructs
parallel
: add to loop construct such as for to indicate each iteration can be executed in parallel.
spawn
: create a parallel
subprocess
, then keep executing the current process (parallel procedure call).
sync
: wait here until all active parallel threads created by this instance of the program finish.
These
keywords specify opportunities for parallelism without affecting whether the corresponding sequential program obtained by removing them is correctSlide6
Fibonacci Numbers
The Fibonacci numbers are defined by the recurrence:
F
0
= 0
F
1
= 1
F
i
= F
i-1
+ F
i-2
for
i
> 1. Slide7
Naïve Algorithm: Compute Fib NumbersSlide8
Caveat: Running Time
T(
n
) = T(
n
- 1) + T(
n
- 2) +
Θ(1
)
T(
n
) =
Θ(
F
n
)
Since this is an
exponential
growth in n
,
this is a particularly bad way to calculate Fibonacci numbers
. (see P776 for analysis)Slide9
Fibonacci Numbers
This
allows you to calculate
F
n
in O(log n) steps by repeated squaring of the matrix. This is how you can calculate the Fibonacci numbers with a serial algorithm.
To illustrate the principles of parallel programming, we will use the naive (bad) algorithm, though. Slide10
Parallel Algorithm for Fib Numbers
Logical Parallelism
: The
spawn
keyword does not force parallelism: it just says that it is permissible
.
A scheduler will make the decision concerning allocation to processors
.
If
parallelism is used,
sync
must be respected. Slide11
Computation DAG
Model a multithreaded computation as a
computation dag
(directed acyclic graph)
G
= (
V
,
E
)
Vertices in
V
are instructions, or
strands
= sequences of non-parallel instructions
.
Edges
in
E
represent dependencies between instructions or strands: (
u
,
v
) ∈
E
means
u
must execute before
v
.
Continuation Edges
(
u
,
v
) are drawn horizontally and indicate that
v
is the successor to
u
in the sequential procedure.
Call Edges
(
u
,
v
) point downwards, indicating that
u
called
v
as a normal
subprocedure
call.
Spawn Edges
(
u
,
v
) point downwards, indicating that
u
spawned
v
in parallel.
Return edges
point upwards to indicate the next strand executed after returning from a normal procedure call, or after parallel spawning at a sync point
.
If
G
has a directed path from
u
to
v
they are logically in
series
; otherwise they are logically
parallel
.Slide12
Computation DAG for Fib(4)Slide13
Performance Measures: Work
Work
T
1
= the total time to execute an algorithm on one processor.
The
total amount of computational work that gets done.
T
P
= the
running time of an algorithm on
P
processors.
Work law
P
⋅
T
P
≥
T
1
T
P
≥
T
1
/
P
The
speedup for
P
processors can be no better than the time with one processor divided by
P
.Slide14
Performance Measures: Span
Span
T
∞
= the total time to execute an algorithm on an infinite number of processors
as many processors as are needed to allow parallelism wherever it is
possible
Span corresponds
to the longest time to execute the strands along any path in the computation
dag
Span law
:
T
P
≥
T
∞
a
P
-processor ideal parallel computer cannot run faster than one with an infinite number of processorsSlide15
Performance Measures: Speedup
Speedup =
T
1
/
T
P
By work law,
T
1
/
T
P
≤
P
one
cannot have any more speedup than the number of processors
.
Notes:
P
arallelism
provides only constant time improvements
(the constant being
# processors
) to any algorithm!
Parallelism cannot move an algorithm from a higher to lower complexity class (e.g., exponential to polynomial, or quadratic to linear
)
Parallelism is not a silver bullet: good algorithm design and analysis is still needed.Slide16
Performance Measures: Parallelism
Parallelism =
T
1
/
T
∞
the potential parallelism of the computation
.
Upper Bound
: the maximum possible speedup that can be achieved on any number of processors
.
Limit
: The limit on the possibility of attaining perfect linear speedup.
Once
the number of processors exceeds the parallelism, the computation cannot possibly achieve perfect linear speedup. The more processors we use beyond parallelism, the less perfect the speedup
.Slide17
Analysis of Multithreaded Algorithms
Analyzing span
If in series, the span is the sum of the spans of the
subcomputations
.
If in parallel, the span is the
maximum
of the spans of the
subcomputationsSlide18
Analysis of Fib Number Computation
Work:
T
1
= θ( ((1+sqrt(5))/2)
n
)
Span:
T
∞
(
n
)
= max(T
∞
(
n
−1), T
∞
(
n
−2)) +
Θ(1
)
= T
∞
(
n
−1) +
Θ(1
)
=
Θ(
n
)
Parallelism:
T
1
(
n
) / T
∞
= Θ(
F
n
/
n
), which grows dramatically, as
F
n
grows much faster than
nSlide19
Parallel Loops
parallel
keyword, which is used with loop constructs such as
for
Problem:
multiply
an
n
x
n
matrix
A
= (
a
ij
) by an
n
-vector
x
= (
x
j
). This yields an
n
-vector
y
= (
y
i
) where
:Slide20
Parallel Loops
parallel for
keywords indicate that each iteration of the loop can be executed concurrently
.
Notice that the inner
for
loop is not parallel; a possible point of improvement to be discussed
.
But
,
i
t
is not realistic to think that all
n
subcomputations
in these loops can be spawned immediately with no extra workSlide21
Implementing Parallel Loops
This can be accomplished by a compiler with a divide and conquer approach, itself implemented with parallelism.Slide22
Analysis of Mat-Vec
Work:
T
1
(
n
) =
Θ(
n
2
)
Span:
T
∞
(
n
)
=
Θ(
log
n
) +
Θ(
n
) =
Θ(
n
)
Tree height for spawn & for loop on leaf
Parallelism:
Θ(
n
)
Even
with full utilization of parallelism the inner
for
loop still requires Θ(
n
)
Can we improve it?Slide23
Improvement: Nested Parallelism?
Make
the inner
for
loop parallel as well
? Does it work?Slide24
Race Condition
Deterministic
algorithms do the same thing on the same
input
Nondeterministic
algorithms may give different results on different
runs
Race condition: multiple threads access (at least one write) shared data structure/memory Slide25
Race ConditionSlide26
Race ConditionSlide27
Example: Matrix Multiplication
Work: T
1
(
n
) =
Θ(
n
3
)
Span: T
∞
(
n
) =
Θ(
n
)
due to the path for spawning the outer and inner parallel loop executions and then the
n
executions of the innermost
for
loop
Parallelism:
T
1
(
n
) / T
∞
(
n
) = Θ(
n
3
) / Θ(
n
) = Θ(
n
2
)
Questions:
Is
it
also subject to a race condition? Why or why not
?
Could we get the span down to Θ(1) if we parallelized the inner
for
with
parallel for
?Slide28
Multithreading the divide and conquer
algorithm
Refer to Chapter 4, p75 for
divide&conquer
Span:
T
∞
(
n
)
=
Θ(
lg
2
n
)
Parallelism:
Θ(
n
3
) /
Θ(
lg
2
n
)Slide29
Example: Merge Sort
MERGE remains a serial algorithm, so its work and span are Θ(
n
) as before
.
Work
MS'
1
(
n
) of
MERGE-SORT‘
span
MS'
∞
(
n
) of
MERGE-SORT‘
parallelism
is thus MS'
1
(
n
) / MS'
∞
(
n
)
=
Θ(
n
lg
n
/
n
) = Θ(lg
n
)
Low
parallelism
How
about speeding up the serial MERGE?Slide30
Parallelizing Merge
Divide-and-conquer on merge: break the
sorted
lists into four lists, two of which will be merged to form the head of the final list and the other two merged to form the tail
.
Choose the longer list to be the first list, T[
p
1
..
r
1
].
Find the middle element (median) of the first list (
x
at
q
1
).
Use binary search to find the position (
q
2
) of this element if it were to be inserted in the second list T[
p
2
..
r
2
].
Recursively merge
The first list up to just before the median T[
p
1
..
q
1
-1] and the second list up to the insertion point T[
p
2
..
q
2
-1].
The first list from just after the median T[
q
1
+1 ..
r
1
] and the second list after the insertion point T[
q
2
..
r
2
].
Assemble the results with the median element placed between them, as shown below.Slide31
Paralleling MergeSlide32
Paralleling Merge Sort
P-MERGE-SORT(A, p, r, B, s) {
n = r-p+1
if n == 1
B[s] = A[p]
else let T[1..n] be a new array
q = floor((
p+r
)/2)
q’= q-p+1
spawn P-MERGE-SORT(A, p, q, T, 1)
P-MERGE-SORT(A,
q+1, r,
T,
q’+1
)
sync
P-MERGE(T, 1, q’, q’+1, n, B, s)
}Slide33
Analysis of Paralleling Merge Sort
Refer to p801-804 for details
Parallelism achieved:
Θ(
n
/ lg
2
n
)