Pipeline Parallelism ITing Angelina Lee Charles E Leiserson Tao B Schardl Jim Sukha and Zhunping Zhang SPAA 2013 MIT CSAIL Intel Corporation ID: 138723
Download Presentation The PPT/PDF document "On-the-Fly" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
On-the-Fly Pipeline Parallelism
I-Ting Angelina Lee*, Charles E. Leiserson*, Tao B. Schardl*, Jim Sukha†, and Zhunping Zhang*
SPAA 2013
MIT CSAIL
*
Intel Corporation
†Slide2
Dedup PARSEC Benchmark [BKS08
]2int fd_out = open_output_file(); bool done = false; while(!done) {
chunk_t
*chunk =
get_next_chunk
();
if(chunk == NULL) { done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(fd_out, chunk); }}
Stage 0:
While there is more data, read the next chunk from the stream.
Stage 1: Check for duplicates.
Stage 2: Compress first-seen chunk.
Stage 3: Write tooutput file.
Dedup
compresses a stream of data by compressing unique elements and removing duplicates.Slide3
Parallelism in Dedup
3............
while(
!done)
{
chunk_t *chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(
fd_out, chunk); }
}
Stage
0
Stage
1
Stage 2
Stage 3Stage 0
Stage 1
Stage 2
Stage 3
Let’s model
Dedup’s
execution as a
pipeline dag
.
A
node
denotes the execution of a stage in an iteration.
Edges
denote dependencies between nodes.
Dedup
exhibits
pipeline parallelism
.
i
0
i
1
i
2
i
3
i
4
i
5
:
cross edgeSlide4
Pipeline Parallelism4
Work
T
1
: The sum of the weights of the nodes in the dag.
=
w
eight 1
=
w
eight 8
T
1
= 75
T
∞
=
20
T
1
/
T
∞
=
3.75
Example:
We can measure parallelism in terms of work and span
[CLRS09]
.
Span
T
∞
: The length of a longest path in the dag
.
Parallelism
T
1
/
T
∞
: The maximum
possible speedup.Slide5
Executing a Parallel Pipeline5
............
Stage 0
Stage 1
Stage 2
Stage 3
i
0
i
1
i
2
i
3
i
4
i
5
To execute
Dedup
in parallel, we must answer two questions.
How do we assign work to parallel processors to execute this computation efficiently?
How do we encode the parallelism in
Dedup
?
while(
!done)
{
chunk_t
*chunk =
get_next_chunk
();
if(
chunk == NULL)
{ done
= true
; }
else {
chunk->
is_dup
=
deduplicate
(chunk);
if(!chunk->
is_dup
) compress
(chunk);
write_to_file
(
fd_out
, chunk
);
}
}
Stage
0
Stage
1
Stage 2
Stage 3Slide6
On-the-Fly Pipeline Parallelism
I-Ting Angelina Lee*, Charles E. Leiserson*, Tao B. Schardl*, Jim Sukha†, and Zhunping Zhang*
SPAA
July 24, 2013
MIT CSAIL
*
Intel Corporation†Slide7
Construct-and-Run Pipelining7
Ex: TBB [MRR12], StreamIt [GTA06], GRAMPS [SLY+11]A construct-and-run pipeline specifies the stages and their dependencies a priori before execution.tbb
::pipeline
pipeline;
GetChunk_Filter
filter1(SERIAL, item);
Deduplicate_Filter filter2(SERIAL);Compress_Filter filter3(PARALLEL);WriteToFile_Filter filter4(SERIAL, out_item);pipeline.add_filter(filter1); pipeline.add_filter(filter2); pipeline.add_filter(filter3); pipeline.add_filter(filter4);pipeline.run(
pipeline_depth);Slide8
On-the-Fly Pipelining of X264
An on-the-fly pipeline is constructed dynamically as the program executes.
I
P
P
P
I
P
P
I
P
P
P
8
Not easily expressible using TBB's pipeline construct [RCJ11]. Slide9
On-the-Fly Pipeline Parallelism in Cilk
-Psimple linguistics for specifying on-the-fly pipeline parallelism that are composable with Cilk's existing fork-join primitives; andPIPER, a theoretically sound randomized work-stealing scheduler that handles both pipeline and fork-join parallelism.We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system, named
Cilk-P, which features:
We hand-compiled 3 applications with pipeline parallelism (ferret,
dedup
, and x264 from PARSEC
[BKS08]) to run on Cilk-P. 9Empirical results indicate that Cilk-P exhibits low serial overhead and good scalability.Slide10
Outline
On-the-Fly Pipeline Parallelism The Pipeline Linguistics in Cilk-PThe PIPER SchedulerEmpirical EvaluationConcluding Remarks10Slide11
The Pipeline Linguistics in
Cilk-Pint fd_out = open_output_file(); bool done = false; while(!done) { chunk_t
*chunk = get_next_chunk
();
if(
chunk == NULL)
{ done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(fd_out, chunk); }}
11Slide12
The Pipeline Linguistics in
Cilk-Pint fd_out = open_output_file(); bool done = false; pipe_while(!done)
{
chunk_t
*chunk =
get_next_chunk(); if(chunk == NULL) { done = true; } else { pipe_wait(1); chunk->is_dup = deduplicate(chunk); pipe_continue(2); if(!chunk->is_dup) compress(chunk);
pipe_wait
(3); write_to_file
(fd_out, chunk);
}}
Loop iterations may execute
in parallel
in a pipelined
fashion, where stage 0 executes serially.
End the current stage, advance to stage 1
,
and
wait for the previous iteration to finish stage 1.
12
End the current stage and advance to stage 2. Slide13
The
Pipeline
Linguistics in
Cilk
-P
pipe_while
(
!done
)
{
chunk_t
*chunk =
get_next_chunk
();
if(
chunk == NULL)
{ done
= true
; }
else
{
pipe_wait
(1);
chunk->
is_dup
=
deduplicate(chunk);
pipe_continue(2);
if(!chunk->is_dup
) compress(chunk);
pipe_wait(3);
write_to_file(fd_out, chunk);
}}
13...
...
...
...
Stage 0Stage 1Stage 2
Stage 3
:
cross edge
These keywords denote the
logical parallelism
of the computation.
The
pipe_while
enforces that stage 0 executes serially.
The
pipe_wait
(1)
enforces
c
ross edges across stage 1.
The
pipe_wait
(3)
enforces
c
ross edges across stage 3.Slide14
The Pipeline Linguistics in
Cilk-Pint fd_out = open_output_file(); bool done = false; pipe_while(!done)
{
chunk_t
*chunk =
get_next_chunk(); if(chunk == NULL) { done = true; } else { pipe_wait(1); chunk->is_dup = deduplicate(chunk); pipe_continue(2); if(!chunk->is_dup) compress(chunk);
pipe_wait
(3); write_to_file
(fd_out, chunk);
}} 14
These keywords have serial semantics — when elided or replaced with its serial counterpart, a legal serial code results, whose semantics is one of the legal interpretation of the parallel code [FLR98]. Slide15
On-the-Fly Pipelining of X264
The program controls the execution of pipe_wait and pipe_continue statements, thus supporting on-the-fly pipeline parallelism.
I
P
P
P
I
P
P
I
P
P
P
15
Program control can thus:
Skip stages;
Make cross edges data dependent; and
Vary the number of stages across
iterations
.
We can pipeline the x264 video encoder using
Cilk
-P.Slide16
Pipelining X264 with
Pthreads
The scheduling logics are embedded in the application code.
I
P
P
P
I
P
P
I
P
P
P
16
The main
control thread
pthread_create
pthread_joinSlide17
Pipelining X264 with
Pthreads
I
P
P
P
I
P
P
I
P
P
P
17
p
thread_mutex_lock
update
my_var
pthread_cond_broadcast
pthread_mutex_unlock
pthread_mutex_lock
while(
my_var
< value) {
pthread_cond_wait
}
pthread_mutex_unlock
The scheduling logics are embedded in the application code.Slide18
Pipelining X264 with
Pthreads
The scheduling logics are embedded in the application code.
I
P
P
P
I
P
P
I
P
P
P
18
The
cross-edge dependencies
are enforced via
data synchronization
with locks and conditional variables.Slide19
X264 Performance Comparison
19Number of processors (P)Speedup over serial executionCilk-P achieves comparable performance to
Pthreads on x264 without explicit data synchronization.Slide20
Outline
On-the-Fly Pipeline Overview The Pipeline Linguistics in Cilk-PThe PIPER SchedulerA Work-Stealing SchedulerHandling Runaway PipelineAvoiding Synchronization Overhead
Concluding Remarks
20Slide21
Guarantees of a Standard Work-
Stealing Scheduler [BL99,ABP01] 21Definition. TP — execution time on
P
processors
T
1 — work T∞ — span T1 / T∞ — parallelismSP — stack space on P processors
S1
— stack space of a serial execution
The Work-First Principle [FLR98].
Minimize the scheduling overhead borne by the work path (T1) and amortize it against the steal
path (T∞).
Given a computation dag with fork-join parallelism, it achieves:Time bound: T
P ≤ T1 /
P + O(T∞ +
lg
P
)
expected time
linear speedup
when
P
≪
T
1
/
T
∞
Space bound: S
P ≤ PS1Slide22
22
2
6
10
22
...
...
14
3
7
11
19
23
15
1
5
9
17
13
...
...
4
8
12
20
24
16
i
0
i
1
i
2
i
3
i
4
i
5
A Work-Stealing Scheduler
(Based on [BL99,ABP01])
P
P
P
Execute
17
18
21
: executing
: done
: not done
: ready
Each worker
maintains
its own set
of
ready
nodes
.
If executing a node enables:
two nodes:
mark one ready and execute
the other
one;
one
node:
execute the enabled node
;
zero nodes:
execute a node in its ready set.
If the ready set is empty,
steal
from a randomly chosen worker. Slide23
23
2
10
22
...
...
14
3
7
11
19
23
15
1
5
9
17
13
...
...
4
8
12
20
24
16
i
0
i
1
i
2
i
3
i
4
i
5
A Work-Stealing Scheduler
(Based on [BL99,ABP01])
P
Execute
18
P
P
6
18
21
: executing
: done
: not done
: ready
Each worker
maintains
its own set
of
ready
nodes
.
If executing a node enables:
two nodes:
mark one ready and execute
the other
one;
one
node:
execute the enabled node
;
zero nodes:
execute a node in its ready set.
If the ready set is empty,
steal
from a randomly chosen worker. Slide24
24
2
6
10
22
...
...
14
3
7
11
19
23
15
1
5
9
17
21
13
...
...
4
8
12
20
24
16
i
0
i
1
i
2
i
3
i
4
i
5
A Work-Stealing Scheduler
(Based on [BL99,ABP01])
P
P
P
P
Steal!
A node has at most two outgoing edges, so
a
standard work-stealing scheduler just works ...
w
ell, almost
.
18
Each worker
maintains
its own set
of
ready
nodes
.
If executing a node enables:
two nodes:
mark one ready and execute
the other
one;
one
node:
execute the enabled node
;
zero nodes:
execute a node in its ready set.
If the ready set is empty,
steal
from a randomly chosen worker. Slide25
Outline
On-the-Fly Pipeline Overview The Pipeline Linguistics in Cilk-PThe PIPER SchedulerA Work-Stealing SchedulerHandling Runaway PipelineAvoiding Synchronization Overhead
Concluding Remarks
25Slide26
26
2
10
18
22
...
...
14
3
7
11
19
23
15
1
5
9
13
...
...
4
8
12
20
24
16
i
0
i
1
i
2
i
3
i
4
i
5
Runaway Pipeline
P
P
A runaway pipeline:
where the scheduler allows many new iterations to be started before finishing old ones
.
Problem: Unbounded space usage!
17
21
6Slide27
27
2
10
18
22
...
...
14
3
7
11
19
23
15
1
5
9
13
...
...
i
0
i
1
i
2
i
3
i
4
i
5
Runaway Pipeline
A runaway pipeline:
where the scheduler allows many new iterations to be started before finishing old ones
.
Problem: Unbounded space usage!
17
21
Cilk
-P automatically
throttles
pipelines by inserting a
throttling edge
between iteration
i
and iteration
i+K
, where
K
is the
throttling limit
.
P
P
K
= 4
Steal!
4
8
12
20
24
16
6Slide28
Outline
On-the-Fly Pipeline Overview The Pipeline Linguistics in Cilk-PThe PIPER SchedulerA Work-Stealing SchedulerHandling Runaway Pipeline
Avoiding Synchronization OverheadConcluding
Remarks
28Slide29
29
2
6
10
18
22
...
...
14
3
7
11
19
23
15
1
5
9
13
...
...
4
8
12
20
24
16
i
0
i
1
i
2
i
3
i
4
i
5
Synchronization Overhead
17
21
P
P
If
two predecessors of a node are executed by different
workers,
synchronization
is necessary — whoever finishes last enables the node. Slide30
30
2
6
18
22
...
...
14
3
7
11
19
23
15
1
5
9
13
...
...
4
8
12
20
24
16
i
0
i
1
i
2
i
3
i
4
i
5
Synchronization Overhead
17
21
P
P
If
two predecessors of a node are executed by different
workers,
synchronization
is necessary — whoever finishes last enables the node.
At
pipe_wait
(
j
)
,
i
teration
i
must
check left
to see if stage
j
is
done in iteration
i
-1
.
At the end of a stage, iteration
i
must
check right
to see if it enabled a node in iteration
i
+1
.
Cilk
-P implements
“lazy enabling”
to mitigate the check-right overhead
and
“dependency folding”
to mitigate the check-left overhead.
c
heck left!
c
heck right!
10Slide31
31
2
6
18
22
...
...
14
3
7
11
19
23
15
1
5
9
13
...
...
4
8
12
20
24
16
i
0
i
1
i
2
i
3
i
4
i
5
Lazy Enabling
17
21
P
P
Idea:
Be
really really lazy
about the
check-right
operation.
Lazy enabling is in accordance with the
work-first
principle
[FLR98]
.
Punt the responsibility of checking right onto a thief stealing or until the
worker runs out of nodes to execute in its iteration.
P
Steal!
check i
2
?
10Slide32
PIPER's Guarantees
32Time bound: TP ≤ T1 / P + O(T∞ + lg
P)
expected time
linear speedup when P ≪ T1 / T∞ Space bound: SP ≤ P(S1 +
fDK)
Definition.
TP — execution time on
P processors
T1 — work
T
∞ — span of the throttled dag
T1
/
T
∞
—
parallelism
S
P
— stack space on
P
processors
S
1
— stack space of a serial execution
K
—
throttling limit
f — maximum frame size
D — depth of nested pipelinesSlide33
Outline
On-the-Fly Pipeline Parallelism The Pipeline Linguistics in Cilk-PThe PIPER SchedulerEmpirical EvaluationConcluding Remarks33Slide34
Experimental Setup
All experiments were ran on an AMD Opteron system with 4 quad-core 2GHz CPU’s having a total of 8 GBytes of memory. Code compiled using GCC (or G++ for TBB) 4.4.5 using –O3 optimization (except for x264 which uses –O4 by default). The Pthreaded implementation of ferret and dedup employ the oversubscription method that creates more than one thread per pipeline stage. We limit the number of cores used by the Pthreaded implementations using taskset but experimented to find the best configuration.
All benchmarks are throttled similarly.Each data point shown is the average of 10 runs, typically with standard deviation less than a few percent.
34Slide35
Ferret Performance Comparison
35Throttling limit = 10PNumber of processors (P)Speedup over serial executionNo performance penalty incurred for using the more general
on-the-fly pipeline instead of a construct-and-run pipeline. Slide36
Dedup Performance Comparison
36Throttling limit = 4PNumber of processors (P)Speedup over serial executionMeasured parallelism for Cilk
-P (and TBB)’s pipeline is merely 7.4.
The
Pthreaded
implementation has more parallelism due to unordered stages.Slide37
X264 Performance Comparison
37Number of processors (P)Speedup over serial executionCilk-P achieves comparable performance to
Pthreads on x264 without explicit data synchronization.Slide38
On-the-Fly Pipeline Parallelism in Cilk
-PWe have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system, named Cilk-P, which features:38simple linguistics that:composable with Cilk's fork-join primitives;
specifies on-the-fly pipelines;
has serial semantics; and
allow users to synchronize
via control
constructs. the PIPER scheduler that:supports both pipeline and fork-join parallelism;asymptotically efficient;uses bounded space; andempirically demonstrates low serial overhead and good scalability.ANDIntel has created an experimental branch of its Cilk Plus runtime with support for on-the-fly pipelines based on Cilk-P:https://intelcilkruntime@bitbucket.org/intelcilkruntime/intel-cilk-runtime.gitSlide39
39Slide40
Impact of Throttling
40
...
...
...
...
If the dag is
regular
, then the work and span of the throttled dag asymptotically match that of the
unthrottled
dag.
We automatically throttle to save space, but the user shouldn’t worry about throttling affecting performance.
How does throttling a pipeline computation affect its performance?Slide41
Impact of Throttling
41
...
...
...
We automatically throttle to save space, but the user shouldn’t worry about throttling affecting performance.
If the dag is
irregular
, then there are pipelines where
no
throttling scheduler can achieve speedup.
T
1
1/3
+ 1
T
1
1/3
+ 1
(T
1
2/3
+ T
1
1/3
)/2
How does throttling a pipeline computation affect its performance?Slide42
Dedup Performance Comparison
42Throttling limit = 4PNumber of processors (P)Speedup over serial executionModified Cilk-P uses a single worker thread for writing out output, like the
Pthreaded implementation, which helps performance as well. Slide43
The Cilk Programming Model
int fib(int n
) {
if(n
< 2) { return
n; } int x = cilk_spawn fib(n-1);
int
y = fib(n-2);
cilk_sync;
return (
x +
y);
}
Control cannot pass this point until all spawned children have returned.
Cilk
keywords
grant permission
for parallel execution.
They do not
command
parallel execution.
The named
child
function may execute
in parallel with the
parent
caller
.
43Slide44
Pipelining with TBB44
…tbb::parallel_pipeline( INNER_PIPELINE_NUM_TOKENS, tbb
::make_filter
<
void
,
one_chunk* > ( tbb::filter::serial, get_next_chunk) & & tbb::make_filter< one_chunk*, one_procd_chunk* > (
tbb
::filter::parallel,
deduplicate) &
tbb::make_filter<
one_procd_chunk*, one_procd_chunk
* > ( tbb
::filter::
parallel, compress) &
tbb
::
make_filter
<
one_procd_chunk
*,
void
> (
tbb
::
filter
::
serial,
write_to_file
)
);
…
Create a pipeline object.
Execute.Slide45
Pipelining with Pthreads
45void *Fragment(void *targs) {
… chunk =
get_next_chunk
();
buf_insert(&send_buf, chunk); …}void *Deduplicate(void *targs) { … chunk = buf_remove(&recv_buf);
is_dup = deduplicate(chunk);
if (!is_dup) buf_insert
( &send_buf_compress, chunk);
else buf_insert( &
send_buf_reorder, chunk); …}
Encode each stage in its own thread.
Assign threads to workers.
Execute.
void
*
Compress
(
void
*
targs
) {
…
chunk =
buf_remove
(&
recv_buf
);
compress(chunk);
buf_insert
(&
send_buf
, chunk);
…
}
void
*
Reorder(
void *targs
) { … chunk = buf_remove
(&recv_buf); write_or_enqueue
(chunk); …}Slide46
Pipelining X264 with Pthreads
The cross-edge dependencies are enforced via data synchronization with locks and conditional variables.I
P
P
P
I
P
P
I
P
P
P
46
Encoding a video with 512 frames on 16 processors:
Total # of invocations for:
pthread_mutex_lock
:
202776
pthread_cond_broadcast
:
34816
pthread_cond_wait
:
10068
i
n application code.