/
On-the-Fly On-the-Fly

On-the-Fly - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
352 views
Uploaded On 2015-09-24

On-the-Fly - PPT Presentation

Pipeline Parallelism ITing Angelina Lee Charles E Leiserson Tao B Schardl Jim Sukha and Zhunping Zhang SPAA 2013 MIT CSAIL Intel Corporation ID: 138723

pipeline chunk cilk stage chunk pipeline stage cilk parallelism work fly node serial execute pipe dup filter tbb execution

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "On-the-Fly" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

On-the-Fly Pipeline Parallelism

I-Ting Angelina Lee*, Charles E. Leiserson*, Tao B. Schardl*, Jim Sukha†, and Zhunping Zhang*

SPAA 2013

MIT CSAIL

*

Intel Corporation

†Slide2

Dedup PARSEC Benchmark [BKS08

]2int fd_out = open_output_file(); bool done = false; while(!done) {

chunk_t

*chunk =

get_next_chunk

();

if(chunk == NULL) { done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(fd_out, chunk); }}

Stage 0:

While there is more data, read the next chunk from the stream.

Stage 1: Check for duplicates.

Stage 2: Compress first-seen chunk.

Stage 3: Write tooutput file.

Dedup

compresses a stream of data by compressing unique elements and removing duplicates.Slide3

Parallelism in Dedup

3............

while(

!done)

{

chunk_t *chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(

fd_out, chunk); }

}

Stage

0

Stage

1

Stage 2

Stage 3Stage 0

Stage 1

Stage 2

Stage 3

Let’s model

Dedup’s

execution as a

pipeline dag

.

A

node

denotes the execution of a stage in an iteration.

Edges

denote dependencies between nodes.

Dedup

exhibits

pipeline parallelism

.

i

0

i

1

i

2

i

3

i

4

i

5

:

cross edgeSlide4

Pipeline Parallelism4

Work

T

1

: The sum of the weights of the nodes in the dag.

=

w

eight 1

=

w

eight 8

T

1

= 75

T

=

20

T

1

/

T

=

3.75

Example:

We can measure parallelism in terms of work and span

[CLRS09]

.

Span

T

: The length of a longest path in the dag

.

Parallelism

T

1

/

T

: The maximum

possible speedup.Slide5

Executing a Parallel Pipeline5

............

Stage 0

Stage 1

Stage 2

Stage 3

i

0

i

1

i

2

i

3

i

4

i

5

To execute

Dedup

in parallel, we must answer two questions.

How do we assign work to parallel processors to execute this computation efficiently?

How do we encode the parallelism in

Dedup

?

while(

!done)

{

chunk_t

*chunk =

get_next_chunk

();

if(

chunk == NULL)

{ done

= true

; }

else {

chunk->

is_dup

=

deduplicate

(chunk);

if(!chunk->

is_dup

) compress

(chunk);

write_to_file

(

fd_out

, chunk

);

}

}

Stage

0

Stage

1

Stage 2

Stage 3Slide6

On-the-Fly Pipeline Parallelism

I-Ting Angelina Lee*, Charles E. Leiserson*, Tao B. Schardl*, Jim Sukha†, and Zhunping Zhang*

SPAA

July 24, 2013

MIT CSAIL

*

Intel Corporation†Slide7

Construct-and-Run Pipelining7

Ex: TBB [MRR12], StreamIt [GTA06], GRAMPS [SLY+11]A construct-and-run pipeline specifies the stages and their dependencies a priori before execution.tbb

::pipeline

pipeline;

GetChunk_Filter

filter1(SERIAL, item);

Deduplicate_Filter filter2(SERIAL);Compress_Filter filter3(PARALLEL);WriteToFile_Filter filter4(SERIAL, out_item);pipeline.add_filter(filter1); pipeline.add_filter(filter2); pipeline.add_filter(filter3); pipeline.add_filter(filter4);pipeline.run(

pipeline_depth);Slide8

On-the-Fly Pipelining of X264

An on-the-fly pipeline is constructed dynamically as the program executes.

I

P

P

P

I

P

P

I

P

P

P

8

Not easily expressible using TBB's pipeline construct [RCJ11]. Slide9

On-the-Fly Pipeline Parallelism in Cilk

-Psimple linguistics for specifying on-the-fly pipeline parallelism that are composable with Cilk's existing fork-join primitives; andPIPER, a theoretically sound randomized work-stealing scheduler that handles both pipeline and fork-join parallelism.We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system, named

Cilk-P, which features:

We hand-compiled 3 applications with pipeline parallelism (ferret,

dedup

, and x264 from PARSEC

[BKS08]) to run on Cilk-P. 9Empirical results indicate that Cilk-P exhibits low serial overhead and good scalability.Slide10

Outline

On-the-Fly Pipeline Parallelism The Pipeline Linguistics in Cilk-PThe PIPER SchedulerEmpirical EvaluationConcluding Remarks10Slide11

The Pipeline Linguistics in

Cilk-Pint fd_out = open_output_file(); bool done = false; while(!done) { chunk_t

*chunk = get_next_chunk

();

if(

chunk == NULL)

{ done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(fd_out, chunk); }}

11Slide12

The Pipeline Linguistics in

Cilk-Pint fd_out = open_output_file(); bool done = false; pipe_while(!done)

{

chunk_t

*chunk =

get_next_chunk(); if(chunk == NULL) { done = true; } else { pipe_wait(1); chunk->is_dup = deduplicate(chunk); pipe_continue(2); if(!chunk->is_dup) compress(chunk);

pipe_wait

(3); write_to_file

(fd_out, chunk);

}}

Loop iterations may execute

in parallel

in a pipelined

fashion, where stage 0 executes serially.

End the current stage, advance to stage 1

,

and

wait for the previous iteration to finish stage 1.

12

End the current stage and advance to stage 2. Slide13

The

Pipeline

Linguistics in

Cilk

-P

pipe_while

(

!done

)

{

chunk_t

*chunk =

get_next_chunk

();

if(

chunk == NULL)

{ done

= true

; }

else

{

pipe_wait

(1);

chunk->

is_dup

=

deduplicate(chunk);

pipe_continue(2);

if(!chunk->is_dup

) compress(chunk);

pipe_wait(3);

write_to_file(fd_out, chunk);

}}

13...

...

...

...

Stage 0Stage 1Stage 2

Stage 3

:

cross edge

These keywords denote the

logical parallelism

of the computation.

The

pipe_while

enforces that stage 0 executes serially.

The

pipe_wait

(1)

enforces

c

ross edges across stage 1.

The

pipe_wait

(3)

enforces

c

ross edges across stage 3.Slide14

The Pipeline Linguistics in

Cilk-Pint fd_out = open_output_file(); bool done = false; pipe_while(!done)

{

chunk_t

*chunk =

get_next_chunk(); if(chunk == NULL) { done = true; } else { pipe_wait(1); chunk->is_dup = deduplicate(chunk); pipe_continue(2); if(!chunk->is_dup) compress(chunk);

pipe_wait

(3); write_to_file

(fd_out, chunk);

}} 14

These keywords have serial semantics — when elided or replaced with its serial counterpart, a legal serial code results, whose semantics is one of the legal interpretation of the parallel code [FLR98]. Slide15

On-the-Fly Pipelining of X264

The program controls the execution of pipe_wait and pipe_continue statements, thus supporting on-the-fly pipeline parallelism.

I

P

P

P

I

P

P

I

P

P

P

15

Program control can thus:

Skip stages;

Make cross edges data dependent; and

Vary the number of stages across

iterations

.

We can pipeline the x264 video encoder using

Cilk

-P.Slide16

Pipelining X264 with

Pthreads

The scheduling logics are embedded in the application code.

I

P

P

P

I

P

P

I

P

P

P

16

The main

control thread

pthread_create

pthread_joinSlide17

Pipelining X264 with

Pthreads

I

P

P

P

I

P

P

I

P

P

P

17

p

thread_mutex_lock

update

my_var

pthread_cond_broadcast

pthread_mutex_unlock

pthread_mutex_lock

while(

my_var

< value) {

pthread_cond_wait

}

pthread_mutex_unlock

The scheduling logics are embedded in the application code.Slide18

Pipelining X264 with

Pthreads

The scheduling logics are embedded in the application code.

I

P

P

P

I

P

P

I

P

P

P

18

The

cross-edge dependencies

are enforced via

data synchronization

with locks and conditional variables.Slide19

X264 Performance Comparison

19Number of processors (P)Speedup over serial executionCilk-P achieves comparable performance to

Pthreads on x264 without explicit data synchronization.Slide20

Outline

On-the-Fly Pipeline Overview The Pipeline Linguistics in Cilk-PThe PIPER SchedulerA Work-Stealing SchedulerHandling Runaway PipelineAvoiding Synchronization Overhead

Concluding Remarks

20Slide21

Guarantees of a Standard Work-

Stealing Scheduler [BL99,ABP01] 21Definition. TP — execution time on

P

processors

T

1 — work T∞ — span T1 / T∞ — parallelismSP — stack space on P processors

S1

— stack space of a serial execution

The Work-First Principle [FLR98].

Minimize the scheduling overhead borne by the work path (T1) and amortize it against the steal

path (T∞).

Given a computation dag with fork-join parallelism, it achieves:Time bound: T

P ≤ T1 /

P + O(T∞ +

lg

P

)

expected time

linear speedup

when

P

T

1

/

T

Space bound: S

P ≤ PS1Slide22

22

2

6

10

22

...

...

14

3

7

11

19

23

15

1

5

9

17

13

...

...

4

8

12

20

24

16

i

0

i

1

i

2

i

3

i

4

i

5

A Work-Stealing Scheduler

(Based on [BL99,ABP01])

P

P

P

Execute

17

18

21

: executing

: done

: not done

: ready

Each worker

maintains

its own set

of

ready

nodes

.

If executing a node enables:

two nodes:

mark one ready and execute

the other

one;

one

node:

execute the enabled node

;

zero nodes:

execute a node in its ready set.

If the ready set is empty,

steal

from a randomly chosen worker. Slide23

23

2

10

22

...

...

14

3

7

11

19

23

15

1

5

9

17

13

...

...

4

8

12

20

24

16

i

0

i

1

i

2

i

3

i

4

i

5

A Work-Stealing Scheduler

(Based on [BL99,ABP01])

P

Execute

18

P

P

6

18

21

: executing

: done

: not done

: ready

Each worker

maintains

its own set

of

ready

nodes

.

If executing a node enables:

two nodes:

mark one ready and execute

the other

one;

one

node:

execute the enabled node

;

zero nodes:

execute a node in its ready set.

If the ready set is empty,

steal

from a randomly chosen worker. Slide24

24

2

6

10

22

...

...

14

3

7

11

19

23

15

1

5

9

17

21

13

...

...

4

8

12

20

24

16

i

0

i

1

i

2

i

3

i

4

i

5

A Work-Stealing Scheduler

(Based on [BL99,ABP01])

P

P

P

P

Steal!

A node has at most two outgoing edges, so

a

standard work-stealing scheduler just works ...

w

ell, almost

.

18

Each worker

maintains

its own set

of

ready

nodes

.

If executing a node enables:

two nodes:

mark one ready and execute

the other

one;

one

node:

execute the enabled node

;

zero nodes:

execute a node in its ready set.

If the ready set is empty,

steal

from a randomly chosen worker. Slide25

Outline

On-the-Fly Pipeline Overview The Pipeline Linguistics in Cilk-PThe PIPER SchedulerA Work-Stealing SchedulerHandling Runaway PipelineAvoiding Synchronization Overhead

Concluding Remarks

25Slide26

26

2

10

18

22

...

...

14

3

7

11

19

23

15

1

5

9

13

...

...

4

8

12

20

24

16

i

0

i

1

i

2

i

3

i

4

i

5

Runaway Pipeline

P

P

A runaway pipeline:

where the scheduler allows many new iterations to be started before finishing old ones

.

Problem: Unbounded space usage!

17

21

6Slide27

27

2

10

18

22

...

...

14

3

7

11

19

23

15

1

5

9

13

...

...

i

0

i

1

i

2

i

3

i

4

i

5

Runaway Pipeline

A runaway pipeline:

where the scheduler allows many new iterations to be started before finishing old ones

.

Problem: Unbounded space usage!

17

21

Cilk

-P automatically

throttles

pipelines by inserting a

throttling edge

between iteration

i

and iteration

i+K

, where

K

is the

throttling limit

.

P

P

K

= 4

Steal!

4

8

12

20

24

16

6Slide28

Outline

On-the-Fly Pipeline Overview The Pipeline Linguistics in Cilk-PThe PIPER SchedulerA Work-Stealing SchedulerHandling Runaway Pipeline

Avoiding Synchronization OverheadConcluding

Remarks

28Slide29

29

2

6

10

18

22

...

...

14

3

7

11

19

23

15

1

5

9

13

...

...

4

8

12

20

24

16

i

0

i

1

i

2

i

3

i

4

i

5

Synchronization Overhead

17

21

P

P

If

two predecessors of a node are executed by different

workers,

synchronization

is necessary — whoever finishes last enables the node. Slide30

30

2

6

18

22

...

...

14

3

7

11

19

23

15

1

5

9

13

...

...

4

8

12

20

24

16

i

0

i

1

i

2

i

3

i

4

i

5

Synchronization Overhead

17

21

P

P

If

two predecessors of a node are executed by different

workers,

synchronization

is necessary — whoever finishes last enables the node.

At

pipe_wait

(

j

)

,

i

teration

i

must

check left

to see if stage

j

is

done in iteration

i

-1

.

At the end of a stage, iteration

i

must

check right

to see if it enabled a node in iteration

i

+1

.

Cilk

-P implements

“lazy enabling”

to mitigate the check-right overhead

and

“dependency folding”

to mitigate the check-left overhead.

c

heck left!

c

heck right!

10Slide31

31

2

6

18

22

...

...

14

3

7

11

19

23

15

1

5

9

13

...

...

4

8

12

20

24

16

i

0

i

1

i

2

i

3

i

4

i

5

Lazy Enabling

17

21

P

P

Idea:

Be

really really lazy

about the

check-right

operation.

Lazy enabling is in accordance with the

work-first

principle

[FLR98]

.

Punt the responsibility of checking right onto a thief stealing or until the

worker runs out of nodes to execute in its iteration.

P

Steal!

check i

2

?

10Slide32

PIPER's Guarantees

32Time bound: TP ≤ T1 / P + O(T∞ + lg

P)

expected time

 linear speedup when P ≪ T1 / T∞ Space bound: SP ≤ P(S1 +

fDK)

Definition.

TP — execution time on

P processors

T1 — work

T

∞ — span of the throttled dag

T1

/

T

parallelism

S

P

— stack space on

P

processors

S

1

— stack space of a serial execution

K

throttling limit

f — maximum frame size

D — depth of nested pipelinesSlide33

Outline

On-the-Fly Pipeline Parallelism The Pipeline Linguistics in Cilk-PThe PIPER SchedulerEmpirical EvaluationConcluding Remarks33Slide34

Experimental Setup

All experiments were ran on an AMD Opteron system with 4 quad-core 2GHz CPU’s having a total of 8 GBytes of memory. Code compiled using GCC (or G++ for TBB) 4.4.5 using –O3 optimization (except for x264 which uses –O4 by default). The Pthreaded implementation of ferret and dedup employ the oversubscription method that creates more than one thread per pipeline stage. We limit the number of cores used by the Pthreaded implementations using taskset but experimented to find the best configuration.

All benchmarks are throttled similarly.Each data point shown is the average of 10 runs, typically with standard deviation less than a few percent.

34Slide35

Ferret Performance Comparison

35Throttling limit = 10PNumber of processors (P)Speedup over serial executionNo performance penalty incurred for using the more general

on-the-fly pipeline instead of a construct-and-run pipeline. Slide36

Dedup Performance Comparison

36Throttling limit = 4PNumber of processors (P)Speedup over serial executionMeasured parallelism for Cilk

-P (and TBB)’s pipeline is merely 7.4.

The

Pthreaded

implementation has more parallelism due to unordered stages.Slide37

X264 Performance Comparison

37Number of processors (P)Speedup over serial executionCilk-P achieves comparable performance to

Pthreads on x264 without explicit data synchronization.Slide38

On-the-Fly Pipeline Parallelism in Cilk

-PWe have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system, named Cilk-P, which features:38simple linguistics that:composable with Cilk's fork-join primitives;

specifies on-the-fly pipelines;

has serial semantics; and

allow users to synchronize

via control

constructs. the PIPER scheduler that:supports both pipeline and fork-join parallelism;asymptotically efficient;uses bounded space; andempirically demonstrates low serial overhead and good scalability.ANDIntel has created an experimental branch of its Cilk Plus runtime with support for on-the-fly pipelines based on Cilk-P:https://intelcilkruntime@bitbucket.org/intelcilkruntime/intel-cilk-runtime.gitSlide39

39Slide40

Impact of Throttling

40

...

...

...

...

If the dag is

regular

, then the work and span of the throttled dag asymptotically match that of the

unthrottled

dag.

We automatically throttle to save space, but the user shouldn’t worry about throttling affecting performance.

How does throttling a pipeline computation affect its performance?Slide41

Impact of Throttling

41

...

...

...

We automatically throttle to save space, but the user shouldn’t worry about throttling affecting performance.

If the dag is

irregular

, then there are pipelines where

no

throttling scheduler can achieve speedup.

T

1

1/3

+ 1

T

1

1/3

+ 1

(T

1

2/3

+ T

1

1/3

)/2

How does throttling a pipeline computation affect its performance?Slide42

Dedup Performance Comparison

42Throttling limit = 4PNumber of processors (P)Speedup over serial executionModified Cilk-P uses a single worker thread for writing out output, like the

Pthreaded implementation, which helps performance as well. Slide43

The Cilk Programming Model

int fib(int n

) {

if(n

< 2) { return

n; } int x = cilk_spawn fib(n-1);

int

y = fib(n-2);

cilk_sync;

return (

x +

y);

}

Control cannot pass this point until all spawned children have returned.

Cilk

keywords

grant permission

for parallel execution.

They do not

command

parallel execution.

The named

child

function may execute

in parallel with the

parent

caller

.

43Slide44

Pipelining with TBB44

…tbb::parallel_pipeline( INNER_PIPELINE_NUM_TOKENS, tbb

::make_filter

<

void

,

one_chunk* > ( tbb::filter::serial, get_next_chunk) & & tbb::make_filter< one_chunk*, one_procd_chunk* > (

tbb

::filter::parallel,

deduplicate) &

tbb::make_filter<

one_procd_chunk*, one_procd_chunk

* > ( tbb

::filter::

parallel, compress) &

tbb

::

make_filter

<

one_procd_chunk

*,

void

> (

tbb

::

filter

::

serial,

write_to_file

)

);

Create a pipeline object.

Execute.Slide45

Pipelining with Pthreads

45void *Fragment(void *targs) {

… chunk =

get_next_chunk

();

buf_insert(&send_buf, chunk); …}void *Deduplicate(void *targs) { … chunk = buf_remove(&recv_buf);

is_dup = deduplicate(chunk);

if (!is_dup) buf_insert

( &send_buf_compress, chunk);

else buf_insert( &

send_buf_reorder, chunk); …}

Encode each stage in its own thread.

Assign threads to workers.

Execute.

void

*

Compress

(

void

*

targs

) {

chunk =

buf_remove

(&

recv_buf

);

compress(chunk);

buf_insert

(&

send_buf

, chunk);

}

void

*

Reorder(

void *targs

) { … chunk = buf_remove

(&recv_buf); write_or_enqueue

(chunk); …}Slide46

Pipelining X264 with Pthreads

The cross-edge dependencies are enforced via data synchronization with locks and conditional variables.I

P

P

P

I

P

P

I

P

P

P

46

Encoding a video with 512 frames on 16 processors:

Total # of invocations for:

pthread_mutex_lock

:

202776

pthread_cond_broadcast

:

34816

pthread_cond_wait

:

10068

i

n application code.

Related Contents


Next Show more