2018-03-17 18K 18 0 0

##### Description

Dr. Yingwu Zhu. Chapter 27. Motivation. We have discussed . serial algorithms. that are suitable for running on a . uniprocessor. computer. We will now extend our model to . parallel algorithms. that can run on a . ID: 654633

**Embed code:**

## Download this presentation

DownloadNote - The PPT/PDF document "Multithreaded Algorithms" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

## Presentations text content in Multithreaded Algorithms

Multithreaded Algorithms

Dr. Yingwu Zhu

Chapter 27

Slide2Motivation

We have discussed

serial algorithms

that are suitable for running on a

uniprocessor

computer. We will now extend our model to

parallel algorithms

that can run on a

multiprocessor

computer

Divide and conquer algorithms are good candidates for parallelism

Slide3Computational Models

Parallel Machines are getting cheaper and in fact are now ubiquitous ...

supercomputers: custom architectures and networks

computer clusters with dedicated networks (distributed memory)

multi-core integrated circuit chips (shared memory)

GPUs (graphics processing units)

Slide4Dynamic Multithreading

Static threading:

abstraction of virtual processors

.

Our

model is

dynamic

multithreading

Programmers

specify opportunities for parallelism and a

concurrency platform

manages the decisions of mapping these to static threads (load balancing, communication, etc.).

Slide5Concurrency Constructs

parallel

: add to loop construct such as for to indicate each iteration can be executed in parallel.

spawn

: create a parallel

subprocess

, then keep executing the current process (parallel procedure call).

sync

: wait here until all active parallel threads created by this instance of the program finish.

These

keywords specify opportunities for parallelism without affecting whether the corresponding sequential program obtained by removing them is correct

Slide6Fibonacci Numbers

The Fibonacci numbers are defined by the recurrence:

F

0

= 0

F

1

= 1

F

i

= F

i-1

+ F

i-2

for

i

> 1.

Slide7Naïve Algorithm: Compute Fib Numbers

Slide8Caveat: Running Time

T(

n

) = T(

n

- 1) + T(

n

- 2) +

Θ(1

)

T(

n

) =

Θ(

F

n

)

Since this is an

exponential

growth in n

,

this is a particularly bad way to calculate Fibonacci numbers

. (see P776 for analysis)

Slide9Fibonacci Numbers

This

allows you to calculate

F

n

in O(log n) steps by repeated squaring of the matrix. This is how you can calculate the Fibonacci numbers with a serial algorithm.

To illustrate the principles of parallel programming, we will use the naive (bad) algorithm, though.

Slide10Parallel Algorithm for Fib Numbers

Logical Parallelism

: The

spawn

keyword does not force parallelism: it just says that it is permissible

.

A scheduler will make the decision concerning allocation to processors

.

If

parallelism is used,

sync

must be respected.

Slide11Computation DAG

Model a multithreaded computation as a

computation dag

(directed acyclic graph)

G

= (

V

,

E

)

Vertices in

V

are instructions, or

strands

= sequences of non-parallel instructions

.

Edges

in

E

represent dependencies between instructions or strands: (

u

,

v

) ∈

E

means

u

must execute before

v

.

Continuation Edges

(

u

,

v

) are drawn horizontally and indicate that

v

is the successor to

u

in the sequential procedure.

Call Edges

(

u

,

v

) point downwards, indicating that

u

called

v

as a normal

subprocedure

call.

Spawn Edges

(

u

,

v

) point downwards, indicating that

u

spawned

v

in parallel.

Return edges

point upwards to indicate the next strand executed after returning from a normal procedure call, or after parallel spawning at a sync point

.

If

G

has a directed path from

u

to

v

they are logically in

series

; otherwise they are logically

parallel

.

Slide12Computation DAG for Fib(4)

Slide13Performance Measures: Work

Work

T

1

= the total time to execute an algorithm on one processor.

The

total amount of computational work that gets done.

T

P

= the

running time of an algorithm on

P

processors.

Work law

P

⋅

T

P

≥

T

1

T

P

≥

T

1

/

P

The

speedup for

P

processors can be no better than the time with one processor divided by

P

.

Slide14Performance Measures: Span

Span

T

∞

= the total time to execute an algorithm on an infinite number of processors

as many processors as are needed to allow parallelism wherever it is

possible

Span corresponds

to the longest time to execute the strands along any path in the computation

dag

Span law

:

T

P

≥

T

∞

a

P

-processor ideal parallel computer cannot run faster than one with an infinite number of processors

Slide15Performance Measures: Speedup

Speedup =

T

1

/

T

P

By work law,

T

1

/

T

P

≤

P

one

cannot have any more speedup than the number of processors

.

Notes:

P

arallelism

provides only constant time improvements

(the constant being

# processors

) to any algorithm!

Parallelism cannot move an algorithm from a higher to lower complexity class (e.g., exponential to polynomial, or quadratic to linear

)

Parallelism is not a silver bullet: good algorithm design and analysis is still needed.

Slide16Performance Measures: Parallelism

Parallelism =

T

1

/

T

∞

the potential parallelism of the computation

.

Upper Bound

: the maximum possible speedup that can be achieved on any number of processors

.

Limit

: The limit on the possibility of attaining perfect linear speedup.

Once

the number of processors exceeds the parallelism, the computation cannot possibly achieve perfect linear speedup. The more processors we use beyond parallelism, the less perfect the speedup

.

Slide17Analysis of Multithreaded Algorithms

Analyzing span

If in series, the span is the sum of the spans of the

subcomputations

.

If in parallel, the span is the

maximum

of the spans of the

subcomputations

Slide18Analysis of Fib Number Computation

Work:

T

1

= θ( ((1+sqrt(5))/2)

n

)

Span:

T

∞

(

n

)

= max(T

∞

(

n

−1), T

∞

(

n

−2)) +

Θ(1

)

= T

∞

(

n

−1) +

Θ(1

)

=

Θ(

n

)

Parallelism:

T

1

(

n

) / T

∞

= Θ(

F

n

/

n

), which grows dramatically, as

F

n

grows much faster than

n

Slide19Parallel Loops

parallel

keyword, which is used with loop constructs such as

for

Problem:

multiply

an

n

x

n

matrix

A

= (

a

ij

) by an

n

-vector

x

= (

x

j

). This yields an

n

-vector

y

= (

y

i

) where

:

Slide20Parallel Loops

parallel for

keywords indicate that each iteration of the loop can be executed concurrently

.

Notice that the inner

for

loop is not parallel; a possible point of improvement to be discussed

.

But

,

i

t

is not realistic to think that all

n

subcomputations

in these loops can be spawned immediately with no extra work

Slide21Implementing Parallel Loops

This can be accomplished by a compiler with a divide and conquer approach, itself implemented with parallelism.

Slide22Analysis of Mat-Vec

Work:

T

1

(

n

) =

Θ(

n

2

)

Span:

T

∞

(

n

)

=

Θ(

log

n

) +

Θ(

n

) =

Θ(

n

)

Tree height for spawn & for loop on leaf

Parallelism:

Θ(

n

)

Even

with full utilization of parallelism the inner

for

loop still requires Θ(

n

)

Can we improve it?

Slide23Improvement: Nested Parallelism?

Make

the inner

for

loop parallel as well

? Does it work?

Slide24Race Condition

Deterministic

algorithms do the same thing on the same

input

Nondeterministic

algorithms may give different results on different

runs

Race condition: multiple threads access (at least one write) shared data structure/memory

Slide25Race Condition

Slide26Race Condition

Slide27Example: Matrix Multiplication

Work: T

1

(

n

) =

Θ(

n

3

)

Span: T

∞

(

n

) =

Θ(

n

)

due to the path for spawning the outer and inner parallel loop executions and then the

n

executions of the innermost

for

loop

Parallelism:

T

1

(

n

) / T

∞

(

n

) = Θ(

n

3

) / Θ(

n

) = Θ(

n

2

)

Questions:

Is

it

also subject to a race condition? Why or why not

?

Could we get the span down to Θ(1) if we parallelized the inner

for

with

parallel for

?

Slide28Multithreading the divide and conquer

algorithm

Refer to Chapter 4, p75 for

divide&conquer

Span:

T

∞

(

n

)

=

Θ(

lg

2

n

)

Parallelism:

Θ(

n

3

) /

Θ(

lg

2

n

)

Slide29Example: Merge Sort

MERGE remains a serial algorithm, so its work and span are Θ(

n

) as before

.

Work

MS'

1

(

n

) of

MERGE-SORT‘

span

MS'

∞

(

n

) of

MERGE-SORT‘

parallelism

is thus MS'

1

(

n

) / MS'

∞

(

n

)

=

Θ(

n

lg

n

/

n

) = Θ(lg

n

)

Low

parallelism

How

about speeding up the serial MERGE?

Slide30Parallelizing Merge

Divide-and-conquer on merge: break the

sorted

lists into four lists, two of which will be merged to form the head of the final list and the other two merged to form the tail

.

Choose the longer list to be the first list, T[

p

1

..

r

1

].

Find the middle element (median) of the first list (

x

at

q

1

).

Use binary search to find the position (

q

2

) of this element if it were to be inserted in the second list T[

p

2

..

r

2

].

Recursively merge

The first list up to just before the median T[

p

1

..

q

1

-1] and the second list up to the insertion point T[

p

2

..

q

2

-1].

The first list from just after the median T[

q

1

+1 ..

r

1

] and the second list after the insertion point T[

q

2

..

r

2

].

Assemble the results with the median element placed between them, as shown below.

Slide31Paralleling Merge

Slide32Paralleling Merge Sort

P-MERGE-SORT(A, p, r, B, s) {

n = r-p+1

if n == 1

B[s] = A[p]

else let T[1..n] be a new array

q = floor((

p+r

)/2)

q’= q-p+1

spawn P-MERGE-SORT(A, p, q, T, 1)

P-MERGE-SORT(A,

q+1, r,

T,

q’+1

)

sync

P-MERGE(T, 1, q’, q’+1, n, B, s)

}

Slide33Analysis of Paralleling Merge Sort

Refer to p801-804 for details

Parallelism achieved:

Θ(

n

/ lg

2

n

)

Slide34