Exploiting CoarseGrained ask Data and Pipeline arallelism in Str eam Pr ograms Michael I
196K - views

Exploiting CoarseGrained ask Data and Pipeline arallelism in Str eam Pr ograms Michael I

Gordon illiam Thies and Saman Amarasinghe Massachusetts Institute of echnology Computer Science and Arti64257cial Intelligence Laboratory mgo rdon thies saman mitedu Abstract As multicore architectures enter the mainstream there is press ing demand

Download Pdf

Exploiting CoarseGrained ask Data and Pipeline arallelism in Str eam Pr ograms Michael I




Download Pdf - The PPT/PDF document "Exploiting CoarseGrained ask Data and Pi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Exploiting CoarseGrained ask Data and Pipeline arallelism in Str eam Pr ograms Michael I"— Presentation transcript:


Page 1
Exploiting Coarse-Grained ask, Data, and Pipeline arallelism in Str eam Pr ograms Michael I. Gordon, illiam Thies, and Saman Amarasinghe Massachusetts Institute of echnology Computer Science and Artificial Intelligence Laboratory mgo rdon, thies, saman @mit.edu Abstract As multicore architectures enter the mainstream, there is press- ing demand for high-le el programming models that can ef fecti ely map to them. Stream programming of fers an attracti ay to x- pose coarse-grained parallelism, as streaming applications (image, video, DSP etc.) are naturally represented by

independent filters that communicate er xplicit data channels. In this paper we demonstrate an end-to-end stream compiler that attains rob ust multicore performance in the ace of arying ap- plication characteristics. As benchmarks xhibit dif ferent amounts of task, data, and pipeline parallelism, we xploit all types of par allelism in unified manner in order to achie this generality Our compiler which maps from the StreamIt language to the 16-core Ra architecture, attains 11.2x mean speedup er single-core baseline, and 1.84x speedup er our pre vious ork. Categories and Subject

Descriptors D.3.2 Pr gr amming Lan- gua es ]: Language Classifications–Data-flo languages; D.3.3 Pr gr amming Langua es ]: Language Constructs and Features Concurrent programming structures; D.3.4 Pr gr amming Lan- gua es ]: Processors–Compilers, Optimization General erms Design, Languages, Performance eyw ords coarse-grained dataflo multicore, Ra softw are pipelining, StreamIt, streams 1. Intr oduction As centralized microprocessors are ceasing to scale ef fecti ely multicore architectures are becoming the industry standard. or x- ample, the IBM/T oshiba/Son Cell processor

has cores [17 ], the Sun Niagara has cores [21 ], the RMI XLR732 has cores [1], the IBM/Microsoft Xbox 360 CPU has cores [4], and most endors are shipping dual-core chips. Cisco has described ne xt-generation netw ork processor containing 192 ensilica Xtensa cores [14 ]. This trend has pushed the performance urden to the compiler as fu- ture application-le el performance gains depend on ef fecti paral- lelization across the cores. Unfortunately traditional programming models such as C, C++ and FOR TRAN are ill-suited to multicore architectures because the assume single instruction stream and

monolithic memory Extracting coarse-grained parallelism suitable for multicore ecution amounts to heroic compiler analysis that remains lar gely intractable. Permission to mak digital or hard copies of all or part of this ork for personal or classroom use is granted without fee pro vided that copies are not made or distrib uted for pro˛t or commercial adv antage and that copies bear this notice and the full citation on the ˛rst page. cop otherwise, to republish, to post on serv ers or to redistrib ute to lists, requires prior speci˛c permission and/or fee. ASPLOS’06 October

21±25, 2006, San Jose, California, USA. Cop yright 2006 CM 1-59593-451-0/06/0010. $5.00. The stream programming paradigm of fers promising ap- proach for xposing parallelism suitable for multicore archi- tectures. Stream languages such as StreamIt [39], Brook [6], SPUR [42], Cg [27], Bak er [9 ], and Spidle [10] are moti ated not only by trends in computer architecture, ut also by trends in the application space, as netw ork, image, oice, and multimedia programs are becoming only more pre alent. In the StreamIt lan- guage, program is represented as set of autonomous actors that communicate

through FIFO data channels (see Figure 1). During program ecution, actors fire repeatedly in periodic schedule. As each actor has separate program counter and an independent address space, all dependences between actors are made xplicit by the communication channels. Compilers can le erage this depen- dence information to orchestrate parallel ecution. Despite the ab undance of parallelism in stream programs, it is nonetheless challenging problem to obtain an ef ficient mapping to multicore architecture. Often the gains from parallel ecution can be ershado wed by the costs of

communication and synchro- nization. In addition, not all parallelism has equal benefits, as there is sometimes critical path that can only be reduced by running cer tain actors in parallel. Due to these concerns, it is critical to le erage the right combination of task, data, and pipeline parallelism while oiding the hazards associated with each. ask parallelism refers to pairs of actors that are on dif ferent parallel branches of the original stream graph, as written by the programmer That is, the output of each actor ne er reaches the input of the other In stream programs, task

parallelism reflects logical parallelism in the underlying algorithm. It is easy to xploit by mapping each task to an independent processor and splitting or joining the data stream at the endpoints (see Figure 2b). The hazards associated with task parallelism are the communication and synchronization associated with the splits and joins. Also, as the granularity of task parallelism depends on the application (and the programmer), it is not suf ficient as the only source of parallelism. Data parallelism refers to an actor that has no dependences be- tween one ecution and the ne xt.

Such “stateless actors of fer unlimited data parallelism, as dif ferent instances of the actor can be spread across an number of computation units (see Figure 2c). Ho we er while data parallelism is well-suited to ector machines, on coarse-grained multicore architectures it can introduce xces- si communication erhead. Pre vious data-parallel streaming ar chitectures ha focused on designing special memory hierarchy to support this communication [18]. Ho we er data parallelism has the hazard of increasing uf fering and latenc and the limitation of being unable to parallelize actors with state.

Pipeline parallelism applies to chains of producers and con- sumers that are directly connected in the stream graph. In our pre vi- stateless actor may still ha read-only state.
Page 2
ous ork [15 ], we xploited pipeline parallelism by mapping clus- ters of producers and consumers to dif ferent cores and using an on-chip netw ork for direct communication between actors (see Fig- ure 2d). Compared to data parallelism, this approach of fers reduced latenc reduced uf fering, and good locality It does not introduce an xtraneous communication, and it pro vides the ability to e- cute an

pair of stateful actors in parallel. Ho we er this form of pipelining introduces xtra synchronization, as producers and con- sumers must stay tightly coupled in their ecution. In addition, ef fecti load balancing is critical, as the throughput of the stream graph is equal to the minimum throughput across all of the proces- sors. In this paper we describe rob ust compiler system that le er ages the right combination of task, data, and pipeline parallelism to achie good multicore performance across wide range of in- put programs. Because no single type of parallelism is perfect fit for all

situations, unified approach is needed to obtain consis- tent results. Using the StreamIt language as our input and tar geting the 16-core Ra architecture, our compiler demonstrates mean speedup of 11.2x er single-core baseline; out of 12 bench- marks speedup by er 12x. This also represents 1.84x impro e- ment er our pre vious ork [15 ]. As part of this ef fort, we ha de eloped tw ne compiler tech- niques that are generally applicable to an coarse-grained multi- core architecture. The first technique le erages data parallelism, ut oids the communication erhead by first

increasing the granu- larity of the stream graph. Using program analysis, we fuse actors in the graph as much as possible so long as the result is stateless. Each fused actor has significantly higher computation to com- munication ratio, and thus incurs significantly reduced communi- cation erhead in being duplicated across cores. further reduce the communication costs, the technique also le erages task paral- lelism; for xample, tw balanced task-parallel actors need only be split across half of the cores in order to obtain high utilization. On Ra coarse-grained data parallelism

achie es mean speedup of 9.9x er single core and 4.4x er task-parallel baseline. The second technique le erages pipeline parallelism. Ho we er to oid the pitf all of synchronization, it emplo ys softw are pipelin- ing techniques to ecute actors from dif ferent iterations in parallel. While softw are pipelining is traditionally applied at the instruction le el, we le erage po werful properties of the stream programming model to apply the same technique at coarse le el of granularity This ef fecti ely remo es all dependences between actors scheduled in steady-state iteration of the stream graph,

greatly increasing the scheduling freedom. Lik hardw are-based pipelining, softw are pipelining allo ws stateful actors to ecute in parallel. Ho we er it oids the synchronization erhead because processors are read- ing and writing into uf fer rather than directly communicating with another processor On Ra coarse-grained softw are pipelin- ing achie es 7.7x speedup er single core and 3.4x speedup er task-parallel baseline. Combining the techniques yields the most general results, as data parallelism of fers good load balancing for stateless actors while softw are pipelining enables stateful

actors to ecute in par allel. An task parallelism in the application is also naturally uti- lized, or judiciously collapsed during granularity adjustment. This inte grated treatment of coarse-grained parallelism leads to an er all speedup of 11.2x er single core and 5.0x er task-parallel baseline. 2. The Str eamIt Language StreamIt is an architecture-independent programming language for high-performance streaming applications [39 2]. As described pre- viously it represents programs as set of independent actors (re- ferred to as filter in StreamIt) that use xplicit data channels for Figur

1. Stream graph for simplified subset of our ocoder benchmark. ollo wing set of sliding DFTs, the signal is con erted to polar coordinates. Node S2 sends the magnitude component to the left and the phase component to the right. In this simplified xample, no magnitude adjustment is needed. all communication. Each filter contains work function that e- cutes single step of the filter From within work filters can push items onto the output channel, pop items from the input channel, or peek at an input item without remo ving it from the channel. While peeking requires

special care in parts of our analysis, it is critical for xposing data parallelism in sliding-windo filters (e.g., FIR filters), as the ould otherwise need internal state. StreamIt pro vides three hierarchical primiti es for composing filters into lar ger stream graphs. pipeline connects streams se- quentially; for xample, there is four -element pipeline be ginning with UnwrapPhase in the ocoder xample (Figure 1). splitjoin specifies independent, task-parallel streams that di er ge from common splitter and mer ge into common joiner or xample, the AdapativeDFT

filters in Figure form tw o-element splitjoin (each filter is configured with dif ferent parameters). The final hier archical primiti in StreamIt is the feedbac kloop which pro vides ay to create ycles in the graph. In practice, feedbackloops are rare and we do not consider them in this paper In this paper we require that the push, pop, and peek rates of each filter are kno wn at compile time. This enables the compiler to calculate steady-state for the stream graph: repetition of each fil- ter that does not change the number of items uf fered on an data

channel [26, 19 ]. In combination with simple program analysis that estimates the number of operations performed on each in o- cation of gi en ork function, the steady-state repetitions of fer an estimate of the ork performed by gi en filter as fraction of the erall program ecution. This estimate is important for our softw are pipelining technique.
Page 3
F3 F2 F1 F5 F4 F6 F7 F8 F9 F10 F11 F12 S1 J1 S2 S3 F3 J2 J3 F3 Processors (b) T ask Parallel F5 F3 F4 F2 F1 F6 F7 F8 F9 F10 F11 F12 Processors (a) Sequential Processors (d) T ask, Data and Pipeline Parallel f3 f3 f3 f3 f3 f3

f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 F2 F1 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 F5 F4 F6 F7 F8 F9 F10 F11 J2 s1 j2 j1 S2 S3 J2 J3 s2 f3 f3 f12 f12 F2 F1 F5 F4 F6 F7 F8 F9 F10 F11 S1 J1 s1 j1 S2 S3 F3 J2 J3 s2 f3 f3 f12 f12 j2 Processors (c) T ask and Data Parallel f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 Figur 2. arallel ecution models for stream programs. Each block corresponds to filter in the ocoder xample (Figure 1). The height of the block reflects the amount of ork contained in the

filter 3. Coarse-Grained Data arallelism There is typically widespread data parallelism in stream graph, as stateless filters can be applied in parallel to dif ferent parts of the data stream. or xample, Figure 3a depicts our Filterbank bench- mark, in which all of the filters are stateless. detect stateless filters using simple program analysis that tests whether there are an filter fields (state ariables) that are written during one itera- tion of the ork function and read during another le erage the implicit data parallelism of stateless filters, we

con ert the filters into an xplicit splitjoin of man filters, such that each filter can be mapped to separate processor This process, which is called filter fission causes the steady-state ork of the original filter to be enly split across the components of the splitjoin (though the work function is lar gely unchanged, each fission product ecutes less frequently than the original filter). Fission is described in more detail else where [15 ]. On multicore architecture, it can be xpensi to distrib ute data to and from the parallel products of

filter fission. As start- ing point, our technique only fisses filters in which the estimated computation to communication ratio is abo gi en threshold. (In our xperiments, we use threshold of 10 compute instructions to item of communication.) further mitigate the communication cost, we introduce tw ne techniques: coarsening the granularity and complementing task parallelism. 3.1 Coarsening the Granularity Stateless filters are often connected together in pipeline (see Fig- ure 3a). Using nai filter fission, each filter is con erted to splitjoin

that scatters and gathers data to and from the data-parallel components. This corresponds to fine-grained data parallelism at the loop le el. Ho we er as this introduces xcessi communica- tion, we instead aim to fiss the entire pipeline into man parallel pipelines, such that each data-parallel unit maintains local com- munication. implement this functionality by first fusing the pipeline into single filter and then fissing the filter into data- parallel splitjoin. apply fusion first because we assume that each data-parallel product will ecute on single

core; fusion enables po werful inter -node optimizations such as scalar replacement [35] and algebraic simplification [3, 24 ]. Some pipelines in the application cannot be fully fused with- out introducing internal state, thereby eliminating the data par allelism. or xample, in Figure 3a, the LowPassFilter and HighPassFilter perform peeking (a sliding windo computa- tion) and al ays require number of data items to be present on the input channel. If either filter is fused with filters abo it, the data items will become state of the fused filter thereby prohibiting data

parallelism. In the general case, the number of persistent items uf fered between filters depends on the initialization schedule [20 ]. Thus, our algorithm for coarsening the granularity of data- parallel re gions operates by fusing pipeline se gments as much as possible so long as the result of each fusion is stateless. or ery pipeline in the application, the algorithm identifies the lar gest sub- se gments that contain neither stateful filters nor uf fered items and fuses the sub-se gments into single filters. It is important to note that pipelines may contain

splitjoins in addition to filters, and thus stateless splitjoins may be fused during this process. While such fusion temporarily remo es task parallelism, this parallelism will
Page 4
LowPas er pu h= 1, p p= , p ee 28 HighP assF er pu h= 1, p p= , p ee 28 Com so pu sh =1 , p 8, pe =8 Proces er pu sh =1 , p 1, pe =1 Exp an pu sh =8 , p 1, pe =1 LowPas Filter pu h= 1, po =1 , p eek 12 PassF pu h= 1, 1, pe =1 duplicate roundrobin(1..1) Adder push=1, pop=8, peek=8 Ad er pu sh =1 , p 2, pe =2 duplicate roundrobin(1..1) Adder push=1, pop=8, peek=8 duplicate LowPassFilter push=1,

pop=2, peek=128 LowPassFilter push=1, pop=2, peek=129 roundrobin(1,1) Fused push=8, pop=16, peek=135 Fused push=8, pop=16, peek=143 (HighPassFilter , Compr essor , ProcessFilter , Expander) (HighPassFilter , Compr essor , ProcessFilter , Expander) Fused push=1, pop=2, peek=128 Fused push=1, pop=2, peek=129 (LowPassFilter , HighPassFilter , Adder) (LowPassFilter HighPassFilter , Adder) duplicate roundrobin(1..1) Adder push=1, pop=8, peek=8 LowPassFilter push=1, pop=2, peek=128 push=8, pop=16, peek=135 (HighPassFilter , Compr essor , ProcessFilter , Expander) push=1, pop=2, peek=128

(LowPassFilter , HighPassFilter , Adder) Fused Fused (a) Original Stream Graph (b) After Granularity Coarsening (c) Followed by Judicious Filter Fission duplicate roundrobin(1,1) duplicate roundrobin(1,1) duplicate roundrobin(1,1) Figur 3. Exploiting coarse-grained data parallelism in the FilterBank benchmark. Only one pipeline of the tople el splitjoin is sho wn; the other parallel streams are identical and are transformed in the same ay be restored in the form of data parallelism once the resulting filter is fissed. The output of the algorithm on the FilterBank benchmark is

illustrated in Figure 3b 3.2 Complementing ask arallelism Ev en if ery filter in an application is data-parallel, it may not be desirable to fiss each filter across all of the cores. Doing so ould eliminate all task parallelism from the ecution schedule, as only one filter from the original application could ecute at gi en time. An alternate approach is to preserv the task parallelism in the original application, and only introduce enough data parallelism to fill an idle processors. This serv es to reduce the synchronization imposed by filter fission,

as filters are fissed to smaller xtent and will span more local area of the chip. Also, the task-parallel filters are natural part of the algorithm and oid an computational erhead imposed by filter fission (e.g., fission of peeking filters introduces decimation stage on each fission product). In order to balance task and data parallelism, we emplo “ju- dicious fission heuristic that estimates the amount of ork that is task-parallel to gi en filter and fisses the filter accordingly De- picted in Algorithm 1, this algorithm

orks by ascending through the hierarchy of the stream graph. Whene er it reaches splitjoin, it calculates the ratio of ork done by the stream containing the filter of interest to the ork done by the entire splitjoin (per steady state ecution). Rather than summing the ork within stream, it considers the erage ork per filter in each stream so as to miti- gate the ef fects of imbalanced pipelines. After estimating filter ork as fraction of those running in parallel, the algorithm at- tempts to fiss the filter the minimum number of times needed to ensure that none of

the fission products contains more than of Algorithm Heuristic algorithm for fissing filter as little as possible while filling all cores with task or data-parallel ork. is the filter to fiss; is the total number of cores (filter int Estimate ork done by as fraction of eryone running task-parallel to fr action Stream par ent Stream hild while par ent null do if par ent is splitjoin then total-work par ent my-work hild fr action fr action my-work total-work end if hild par ent par ent par ent end while Fiss according to its weight in task-parallel unit

Fiss into fr action filters the total task-parallel ork (where is the total number of cores). Note that if se eral filters are being fissed, the ork estimation is calculated ahead of time and is not updated during the course of fission. Figure 3c illustrates the outcome of performing judicious fission on the coarsened-granularity stream graph from Figure 3b Because
Page 5
n t he heav est st at ef ul r 10 12 14 16 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% cent age of rk t hat at ef ul f ers Sp ee du p of Da el ne P aral el over Data Par al el 16 32 64

Figur 4. Potential speedups of pure pipeline parallelism er pure data parallelism for arying amounts of stateful ork in the application. Each line represents dif ferent amount of ork in the hea viest stateful filter The graph assumes 16 cores and does not consider task parallelism or communication costs. there is 8-w ay task parallelism between all of the pipelines, the filters in each pipeline are fissed maximum of ays so as not to erwhelm the communication resources. As described in the Section 6, the combination of granularity coarsening and judicious fission of fers

6.8x mean speedup er nai fission polic 4. Coarse-Grained Softwar Pipelining 4.1 Benefits of Pipeline arallelism Pipeline parallelism is an important mechanism for parallelizing filters that ha dependences from one iteration to another Such “stateful filters are not data parallel and do not benefit from the techniques described in the pre vious section. While man stream- ing applications ha ab undant data parallelism, en small num- ber of stateful filters can greatly limit the performance of purely data-parallel approach on lar ge multicore architecture. The

potential benefits of pipeline parallelism are straightfor ard to quantify Consider that the sequential ecution of an ap- plication requires unit time, and let denote the fraction of ork (sequential ecution time) that is spent within stateful filters. Also let denote the maximum ork performed by an indi vidual state- ful filter Gi en processing cores, we model the ecution time achie ed by tw scheduling techniques: 1) data parallelism, and 2) data parallelism plus pipeline parallelism. In this ercise, we as- sume that ecution time is purely function of load balancing; we do

not model the costs of communication, synchronization, local- ity etc. also do not model the impact of task parallelism. 1. Using data parallelism, parts of the ork are data-parallel and can be spread across all cores, yielding parallel e- cution time of (1 The stateful ork must be run as separate stage on single core, adding units to the erall ecution. The total ecution time is (1 2. Using data and pipeline parallelism, an set of filters can e- cute in parallel during the steady state. (That is, each stateful filter can no ecute in parallel with others; the stateful filter

itself is not parallelized.) The stateful filters can be assigned to the processors, minimizing the maximum amount of ork allo- cated to an processor Ev en greedy assignment (filling up one processor at time) guarantees that no processor xceeds the lo wer -bound ork balance of by more than the hea vi- est stateful filter Thus, the stateful ork can al ays complete in time. Remaining data parallelism can be freely dis- trib uted across processors. If it fills each processor to or be yond, then there is perfect utilization and ecution com- pletes in time; otherwise, the

state is the bottleneck. Thus the general ecution time is max( ; Using these modeled runtimes, Figure illustrates the potential speedup of adding pipeline parallelism to data-parallel ecution model for arious alues of = on 16-core architecture. In the best case, = approaches and the speedup is (1 max( 1) or xample, if there are 16 cores and en as little as 15 th of the ork is stateful, then pipeline parallelism of fers potential gains of 2x. or these parameters, the orst-case gain is 1.8x. The best and orst cases di er ge further for lar ger alues of 4.2 Exploiting

Pipeline arallelism At an gi en time, pipeline-parallel actors are ecuting dif fer ent iterations from the original stream program. Ho we er the dis- tance between acti iterations must be bounded, as otherwise the amount of uf fering required ould gro to ard infinity le er age pipeline parallelism, one needs to pro vide mechanisms for both decoupling the schedule of each actor and for bounding the uf fer sizes. This can be done in either hardw are or softw are. In coarse-grained har dwar pipelining groups of filters are as- signed to independent processors that proceed at their wn

rate (see Figure 5a). As the processors ha decoupled program coun- ters, filters early in the pipeline can adv ance to later iteration of the program. Buf fer size is limited either by blocking FIFO com- munication, or by other synchronization primiti es (e.g., shared- memory data structure). Ho we er hardw are pipelining entails performance tradeof f: If each processor ecutes its filters in single repeating pat- tern, then it is only beneficial to map contiguous set of filters to gi en processor Since filters on the processor will al ays be at the same iteration

of the steady state, an filter missing from the contiguous group and ecuting at remote location ould only increase the latenc of the processor schedule. The requirement of contiguity can greatly constrain the partitioning options and thereby orsen the load balancing. oid the constraints of contiguous mapping, processors could ecute filters in dynamic, data-dri en manner Each processor monitors se eral filters and fires an who has data ailable. This allo ws filters to adv ance to dif ferent iterations of the original stream graph en if the are assigned to the same

processing node. Ho we er because filters are ecuting out-of-order the communication pattern is no longer static and more comple flo w-control mechanism (e.g., using credits) may be needed. There is also some erhead due to the dynamic dispatching step. Coarse-grained softwar pipelining of fers an alternati that does not ha the dra wbacks of either of the abo approaches (see Figure 5b). Softw are pipelining pro vides decoupling by ecuting tw distinct schedules: loop prologue and steady-state loop. The prologue serv es to adv ance each filter to dif ferent iteration of the

stream graph, en if those filters are mapped to the same core. Because there are no dependences between filters within an iteration of the steady-state loop, an set of filters (contiguous or non-contiguous) can be assigned to core. This of fers ne de gree of freedom to the partitioner thereby enhancing the load balancing. Also, softw are pipelining oids the erhead of the demand-dri en model by ecuting filters in fix ed and repeatable pattern on each In an ac yclic stream graph, set of ˛lters is contiguous if, in tra ersing directed path between an tw

˛lters in the set, the only ˛lters encountered are also within the set.
Page 6
Processors (a) Hardware Pipelining f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 F2 F1 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 F5 F4 F6 F7 F8 F9 F10 F11 J2 s1 j2 j1 S2 S3 J2 J3 s2 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f3 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 f12 F2 F1 F5 F4 F6 F7 F8 F9 F10 F11 Processors (b) Software Pipelining f12 f12 f12 f12 Figur 5. Comparison of hardw are pipelining and softw are pipelining for the ocoder xample (see Figure 1). or

clarity the same assignment of filters to processors is used in both cases, though softw are pipelining admits more fle xible set of assignments than hardw are pipelining. In softw are pipelining, filters read and write directly into uf fers and communication is done at steady-state boundaries. The prologue schedule for softw are pipelining is not sho wn. core. Buf fering can be bounded by the on-chip communication netw orks, without needing to resort to softw are-based flo control. 4.3 Softwar Pipelining Implementation Our softw are pipelining algorithm maps

filters to cores in simi- lar manner that traditional algorithms map instructions to ALUs. This transformation is enabled by an important property of the StreamIt programming model, namely that the entire stream graph is wrapped with an implicit outer loop. As the granularity of soft- are pipelining increases from instructions to filters, one needs to consider the implications for managing uf fers and scheduling communication. also describe our algorithm for mapping filters to cores, and compare the process to con entional softw are pipelin- ing. construct the loop prologue

so as to uf fer at least one steady state of data items between each pair of dependent filters. This allo ws each filter to ecute completely independently during each subsequent iteration of the stream graph, as the are reading and writing to uf fers rather than communicating directly The uf fers could be stored in ariety of places, such as the local memory of the core, hardw are FIFO, shared on-chip cache, or an of f-chip DRAM. On Ra of f-chip DRAM of fers higher throughput than core local memory so we decided to store the uf fers there. Ho we er we en vision that on-chip storage

ould be the better choice for most commodity multicores. As filters are reading and writing into uf fers in distrib uted memory banks, there needs to be separate communication stage to shuf fle data between uf fers. Some of this communication is direct transfer of data, while others performs scatter or gather operations corresponding to the splitjoins in StreamIt. The com- munication stage could be implemented by DMA engines, on-chip netw orks, ector permutations, or other mechanisms. On Ra we le erage the programmable static netw ork to perform all communi- cation in single stage,

which is situated between iterations of the steady state. On architectures with DMA engines, it ould be pos- sible to parallelize the communication stage with the computation stage by double-b uf fering the I/O of each filter In assigning filters to cores, the goal is to optimize the load bal- ancing across cores while minimizing the synchronization needed during the communication stage. address these criteria in tw passes, first optimizing load balancing and then optimizing the lay- out. As the load-balancing problem is NP-complete (by reduction from SUBSET -SUM [29 ]), we

use greedy partitioning heuristic that assigns each filter to one of processors. The algorithm con- siders filters in order of decreasing ork, assigning each one to the processor that has the least amount of ork so ar As described in Section 4.1, this heuristic ensures that the bottleneck processor does not xceed the optimum by more than the amount of ork in the hea viest filter minimize synchronization, we wrap the partitioning algo- rithm with selecti fusion pass. This pass repeatedly fuses the tw adjacent filters in the graph that ha the smallest combined ork. After

each fusion step, the partitioning algorithm is re- ecuted; if the bottleneck partition increases by more than gi en threshold (10%), then the fusion is re ersed and the process termi- nates. This process increases the computation to communication ratio of the stream graph, while also le eraging the inter -node fu- sion optimizations mentioned pre viously It impro es performance by up to 2x on the Radar benchmark, with geometric mean of 15% across all benchmarks. Ov erall, coarse-grained softw are pipelining on multicore ar chitecture oids man of the complications and xposes man ne

optimization opportunities ersus traditional softw are pipelining. In traditional softw are pipelining, the limited size of the re gister file is al ays an adv ersary (re gister pressure), ut there is ample mem- ory ailable for uf fering stream data. Another recurring issue tra-
Page 7
ditionally is the length of the prologue to the softw are pipelined loop, ut this is less problematic in the streaming domain because the steady-state ecutes longer Lastly and most importantly we can xploit these properties to fully pipeline the graph, remo ving all dependences and thus remo ving

all the constraints for our schedul- ing algorithm. 5. Implementation and Methodology 5.1 The Raw Ar chitectur tar get the Ra microprocessor [37 40], which addresses the wire delay problem by pro viding direct instruction set architecture (ISA) analogs to three underlying physical resources of the pro- cessor: gates, wires and pins. The architecture xposes the gate re- sources as scalable 2-D array of identical, programmable cores, that are connected to their immediate neighbors by four on-chip netw orks. alues routed through the netw orks of of the side of the array appear on the pins, and

alues placed on the pins by xternal de vices (wide-w ord A/Ds, DRAMS, etc.) appear on the netw orks. Each of the cores contains compute processor some memory and tw types of routers—one static, one dynamic—that control the flo of data er the netw orks as well as into the compute pro- cessor The compute processor interf aces to the netw ork through bypassed, re gister -mapped interf ace [37 that allo ws instructions to use the netw orks and the re gister files interchangeably Because we generate ulk DRAM transfers, we do not ant these optimizable accesses to become the bottleneck of

the hard- are configuration. So, we emplo simulation of CL2 PC 3500 DDR DRAM, which pro vides enough bandwidth to saturate both directions of Ra port [38 ]. Additionally each chipset contains streaming memory controller that supports number of simple streaming memory requests. In our configuration, 16 such DRAMs are attached to the 16 logical ports of the chip. The chipset recei es request messages er the dynamic netw ork for ulk transfers to and from the DRAMs. The transfers themselv es can use either the static netw ork or the general dynamic netw ork (the desired netw ork is

encoded in the request). The results in this paper were generated using btl, ycle- accurate simulator that models arrays of Ra cores identical to those in the .15 micron 16-core Ra prototype ASIC chip, with tar get clock rate of 450 MHz. The core emplo ys as compute processor an 8-stage, single issue, in-order MIPS-style pipeline that has 32 KB data cache, 32 KB of instruction memory and 64 KB of static router memory The simulator includes 2-w ay set associati hardw are instruction caching mechanism (not present in the hardw are) that is serviced er the dynamic netw ork, with resource

contention modeled accordingly 5.2 Str eamIt Compiler Infrastructur The techniques presented in this paper are aluated in the con- te xt of the StreamIt compiler infrastructure. The system includes high-le el stream IR with host of graph transformations including graph canonicalization, synchronization remo al, ref actoring, fu- sion, and fission [15 ]. Also included are domain specific optimiza- tions for linear filters (e.g., FIR, FFT and DCT) [24], state-space analysis [3], and cache optimizations [35 ]. le erage StreamIt spatially-a are Ra back end for this ork. Pre

viously we described hardw are and softw are pipelining as tw distinct techniques for xploiting pipeline parallelism. The StreamIt compiler has full support for each. It also implements hybrid approach where hardw are pipelined units are scheduled in softw are pipelined loop. While we were xcited by the possibilities of the hybrid approach, it does not pro vide benefit on Ra due the tight coupling between processors and the limited FIFO uf fering of the netw ork. Although these are not fundamental limits of the architecture, the enforce fine-grained orchestration of communi- cation

and computation that is mismatch for our coarse-grained ecution model. In the compiler we elected to uf fer all streaming data of f- chip. Gi en the Ra configuration we are simulating and for the re gular ulk memory traf fic we generate, it is more xpensi to stream data onto the netw ork from core local data cache than to stream the data from the streaming memory controllers. load hit in the data cache incurs 3-c ycle latenc So although the netw orks are re gister -mapped, tw instructions must be performed to hide the latenc of the load, implying maximum bandwidth of 1/2 ord per

ycle, while each streaming memory controller has load bandwidth of ord per ycle for unit-stride memory accesses. When tar geting an architecture with more modest of f-chip memory bandwidth, the stream uf fers could reside completely in on-chip memory or xample, the total uf fer allocation for each of our benchmarks in Section ould fit in Cell 512KB L2 cache. Streaming computation requires that the tar get architecture pro- vides an ef ficient mechanism for implementing split and join (scat- ter and gather) operations. The StreamIt compiler programs the switch processors to form

netw ork to perform the splitting (in- cluding duplication) and the joining of data streams. Since Ra w DRAM ports are bank ed, we must read all the data participating in the reor ganization from of f-chip memory and write it back to of f-chip memory The disadv antages to this scheme are that all the data required for joining must be ailable before the join can commence and the compute processors of the cores in olv ed are idle during the reor ganization. Architectures that include decoupled DMA engines (e.g., Cell) can erlap splitting/joining communica- tion with useful computation. 5.2.1

Baseline Scheduler acilitate aluation of coarse-grained softw are pipelining, we implemented separate scheduling path for non-softw are- pipelined schedule that ecutes the steady-state respecting the dataflo dependencies of the stream graph. This scheduling path ignores the presence of the encompassing outer loop in the stream graph and is emplo yed for the task and task data parallel config- urations of the aluation section. This scheduling problem is equi alent to static scheduling of coarse-grained dataflo to multiprocessor which has been well-studied problem er the last

40 years (see [23] for good re vie w). le erage simulated annealing as randomized solutions to this static scheduling problem are superior to other heuristics [22]. Briefly we generate an initial layout by assigning the filters of the stream graph to processors in dataflo order assigning ran- dom processing core to each filter The simulated annealing pertur bation function randomly selects ne core for filter inserting the filter at the correct slot in the core schedule of filters based on the dataflo dependencies of the graph. The cost function

of the annealer uses our static ork estimation to calculate the maximum critical path length (measured in ycles) from source to sink in the graph. After the annealer is finished, we use the configuration that achie ed the minimum critical path length er the course of the search. 5.2.2 Instruction-Le el Optimizations or the results detailed in the ne xt section we ran host of opti- mizations including function inlining, constant propagation, con- stant folding, array scalarization, and loop unrolling (with actor of 4). These optimizations are especially important for fused

fil- ter as we can possibly unroll enough to scalarize constituent split- ter and joiner uf fers, eliminating the shuf fling operations. do not enable StreamIt aggressi cache optimizations [35 or linear
Page 8
Peeking Comp / Tas arall en ch ar rs rs in Comm al BitonicSort Bitonic Sort 28 16 5% Cha nn elV oco de Cha nn el oic de 22 16 16 IEEE nc ncr yp tion 33 FF le nt FF 00 or Mu tir Si al roc ss in ad io ad io it al nt nt ncr yp tion 61 TD ela al tion or 28 33 00 MP co de MP oc otion Ve ctor co in oco de Bitr ed ction oco de 5% 16 ada ada rr ront 11 Figur 6. Benchmark

descriptions and characteristics. optimizations [24 ], as the conflate too man actors into the x- periments. Finally we produce mix of and assembly code that is compiled with GCC 3.4 at optimization le el 3. 6. Experimental Ev aluation In this section we present an aluation of our full compiler system and compare to pre vious techniques for compiling stream programs to multicore architectures. This section will include an elaboration of the follo wing contrib utions and conclusions: Our compiler achie es consistent and xcellent parallelization of our benchmark suite to 16-core

architecture, with mean speedup of 11.2x er sequential, single-core performance. Our technique for xploiting coarse-grained data parallelism achie es mean performance gain of 9.9x er sequential, single-core baseline. This also represents 6.8x speedup er fine-grained data parallel approach. Coarse-grained softw are pipelining is an ef fecti technique for xtracting parallelism be yond task and data parallelism, with an additional mean speedup of 1.45x for our benchmarks with stateful computation and 1.13x across all of our bench- marks. On its wn, our softw are pipelining technique af

fords 7.7x performance gain er sequential single-core baseline. Our compiler emplo ying the combination of the techniques presented in this paper impro es upon our pre vious ork that xploited combination of task and hardw are pipeline paral- lelism, with mean speedup of 1.84x. In the aluation, speedup of configuration er is calcu- lated as the throughput for an erage steady-state of di vided by B; the initialization and prologue schedules, if present, are not included. Figure presents table of throughput speedups nor malized to single-core for most configurations. Pre vious ork has

sho wn that sequential StreamIt ecuting on single Ra core out- performed hand-written implementations ecuting on single core er benchmark suite similar to ours [38 ]. Furthermore, we ha increased the performance of sequential compilation since this pre vious aluation. 6.1 Benchmark Suite aluate our techniques using the benchmark suite gi en in Fig- ure 6. The benchmark suite consists of 12 StreamIt applications. MPEG2Decoder implements the block decoding and the motion ector decoding of an MPEG-2 decoder containing approximately one-third of the computation of the entire MPEG-2 decoder The DCT

benchmark implements 16x16 IEEE reference DCT while the MPEG2Decoder benchmark includes ast 8x8 DCT as com- ponent. or additional information on MPEG2Decoder ocoder and Radar please refer to [11 ], [34 ], and [25], respecti ely In the table, the measurements gi en in each column are ob- tained from the stream graph as concei ed by the programmer before it is transformed by our techniques. The “Filters columns gi es the total number of filters in the stream (including file input filters and file output filters that are not mapped to cores). The num- ber of

filters that perform peeking is important because peeking fil- ters cannot be fused with upstream neighbors without introducing internal state. “Comp Comm gi es the static estimate of the com- putation to communication ratio of each benchmark for one steady- state ecution. This is calculated by totaling the computation esti- mates across all filters and di viding by the number of dynamic push or pop statements ecuted in the steady-state (all items pushed and popped are 32 bits). Notice that although the computation to com- munication ratio is much lar ger than one across our

benchmarks, we will demonstrate that inter -core synchronization is an important actor to consider “T ask arallel Critical ath calculates, using static ork estimates, the ork that is on the critical path for task parallel model, assuming infinite processors, as percentage of the total ork. Smaller percentages indicate the presence of more task parallelism. is defined in Section 4.1 as the fraction of total ork that is stateful. is the maximum fraction of total ork performed by an indi vidual stateful filter Referring to Figure 6, we see that three of our benchmarks include

stateful computation. Radar repeatedly operates on long columns of an array requiring special beha vior at the boundaries; thus, the state tracks the position in the column and does some internal uf fering. ocoder performs an adapti DFT that uses stateful decay to ensure stability; it also needs to retain the pre vious output across one iteration within phase transformation. MPEGDecoder has ne gligible state in retaining predicted motion ectors across one iteration of ork. 6.2 ask arallelism moti ate the necessity of our parallelism xtraction techniques let us first consider the task

parallel ecution model. This model closely approximates thread model of ecution where the only form of coarse-grained parallelism xploited is fork/join paral- lelism. In our implementation, the sole form of parallelism x- ploited in this model is the parallelism across the children of splitjoin. The first bar of Figure gi es the speedup for the each of our benchmarks running in the task parallel model ecuting on 16-core Ra normalized to sequential StreamIt ecuting on single core of Ra or the remainder of the paper unless other wise noted, we tar get all 16 cores of Ra The mean

performance speedup for task parallelism is 2.27x er sequential performance. can see that for most of our benchmarks, little parallelism is xploited; notable xceptions are Radar ChannelV ocoder and Fil- terBank. Each contains wide splitjoins of load-balanced children. In the case of BitonicSort, the task parallelism is xpressed at too fine granularity for the communication system. Gi en that we are
Page 9
Hardware sk + sk + + Compu MFLOPS nch ar sk li sk + ili 450 BitonicSort 0.3 3. 54 N/A Cha nn elV oco de .0 0. 3. 3. 3. N/A FF .0 45 11 .0 MR ad io 3. .0 .3 11 nt .0 N/A TD .0

.0 45 co de 3. 3. N/A oco de .0 3. 3.0 3.0 ada 0. 00 om ea 2.3 11 .2 ou pu li ed gl Co re rea sk + Figur 7. Throughput speedup comparison and ask Data Softw are Pipelining performance results. 11 BitonicSo rt Cha nn elV oco de FF MR ad io nt TD co de oco de ada tric ea Throughput Normalized to Single Core StreamIt So el inin So el inin Figur 8. ask, ask Data, ask Softw are Pipelining, and ask Data Softw are Pipelining normalized to single core. tar geting 16-core processor mean speedup of 2.27x is inade- quate. 6.3 Coarse-Grained Data arallelism The StreamIt programming model acilitates relati

ely simple anal- ysis to determine opportunities for data parallelism. But the gran- ularity of the transformations must account for the additional syn- chronization incurred by data-parallelizing filter If we attempt to xploit data parallelism at fine granularity by simply replicat- ing each stateless filter across the cores of the architecture we run the risk of erwhelming the communication substrate of the tar get architecture. study this, we implemented simple algorithm for xposing data parallelism: replicate each filter by the number of cores, mapping each

fission product to its wn core. call this fine-grained data parallelism. In Figure 9, we sho this tech- nique normalized to single-core performance. Fine-grained data parallelism achie es mean speedup of only 1.40x er sequential StreamIt. Note that FilterBank is not included in Figure because the size of the fine-grained data parallel stream graph stressed our infrastructure. or four of our benchmarks, fine-grained duplication on 16 cores has lo wer throughput than single core. This moti ates the need for more intelligent approach for xploiting data parallelism in

streaming applications when tar geting multicore architectures. The second bar of Figure gi es the speedup of coarse-grained data parallelism er single-core StreamIt. The mean speedup across our suite is 9.9x er single core and 4.36x er our task parallel baseline. BitonicSort, whose original granularity as too fine, no achie es 8.4x speedup er single core. of our 12 applications are stateless and non- peeking (BitonicSort, DCT DES, FFT Serpent, and TDE) and thus fuse to one filter that is fissed 16 ays. or these benchmarks the mean speedup is 11.1x er the single core. or DCT

the algo- rithm data-parallelizes the bottleneck of the application (a single filter that performs more than 6x the ork of each of the other filters). Coarse-grained data parallelism achie es 14.6x speedup er single-core, while fine-grained achie es only 4.0x because it fisses at too fine granularity ignoring synchronization. Coarsen- ing and then parallelizing reduces the synchronization costs of data parallelizing. or Radar and ocoder data parallelism is paralyzed by the preponderance of stateful computation. 6.4 Coarse-Grained Softwar Pipelining Our technique

for coarse-grained softw are pipelining is ef fec- ti for xploiting coarse-grained pipelined parallelism (though it under -performs when compared to coarse-grained data paral- lelism). More importantly combining softw are pipelining with our data parallelism techniques pro vides cumulati performance gain, especially for applications with stateful computation.
Page 10
it on cS Cha nn lV FF MR TD Throughput of Fine-Grained Data nor ed to ing e ore trea mI ea Figur 9. Fine-Grained Data arallelism normalized to single core. The third bar of Figure gi es the speedup for softw are pipelin-

ing er single core. On erage, softw are pipelining has speedup of 7.7x er single core (compare to 9.9x for data parallelism) and speedup of 3.4x er task parallelism. Softw are pipelining per forms well when it can ef fecti ely load-balance the packing of the dependence-free steady-state. In the case of Radar TDE, Filter Bank, and FFT softw are pipelining achie es comparable or bet- ter performance compared to data parallelism (see Figure 8). or these applications, the orkload is not dominated by single fil- ter and the resultant schedules are statically load-balanced across cores. or the

Radar application, softw are pipelining achie es 2.3x speedup er data parallelism and task parallelism because there is little coarse-grained data parallelism to xploit and it can more ef fecti ely schedule the dependence-free steady-state. Ho we er when compared to data parallelism, softw are pipelin- ing is hampered by its inability to reduce the bottleneck filter when the bottleneck filter contains stateless ork (e.g., DCT MPEGDe- coder). Also, our data parallelism techniques tend to coarsen the stream graph more than the selecti fusion stage of softw are pipelining, remo ving

more synchronization. or xample, in DES, selecti fusion mak es greedy decision that it cannot remo com- munication af fecting the critical path orkload. Softw are pipelin- ing performs poorly for this application when compared to data par allelism, 6.9x ersus 13.9x er single core, although it calculates load-balanced mapping. Another consideration when comparing softw are pipelining to data parallelism is that the softw are pipelin- ing techniques rely more hea vily on the accurac of the static ork estimation strate gy although it is dif ficult to quantify this ef fect. 6.5 Combining the

echniques When we softw are pipeline the data-parallelized stream graph, we achie 13% mean speedup er data parallelism alone. The cu- mulati ef fect is most prominent when the application in question contains stateful computation; for such benchmarks, there is 45% mean speedup er data parallelism. or xample, the combined technique achie es 69% speedup er each indi vidual technique for ocoder or ChannelV ocoder FilterBank, and FM, softw are pipelining further coarsens the stream graph without af fecting the critical path ork (as estimated statically) and performs splitting and joining in

parallel. Each reduces the synchronization encoun- tered on the critical path. The combined technique depresses the performance of MPEG by 6% because the selecti fusion component of the softw are pipeliner fuses one step too ar In most circumstances, fusion will help to reduce inter -core synchronization by using the local .5 .5 .5 .5 BitonicSo rt Cha nn elV oco de FF MR ad io nt TD co de oco de ada tric ea Throughput of Task + Data + Software Pipeline nor ali ar ware Pipelining Figur 10. ask Data Softw are Pipelining normalized to Hard- are Pipelining. memory of the core for uf fering.

Consequently the algorithm does not model the communication costs of each fusion step. In the case of MPEG, it fuses too ar and adds synchronization. The combined technique also hurts Radar as compared to only softw are pipelining because we fiss too aggressi ely and create synchronization across the critical path. In Figure 7, we report the compute utilization and the MFLOPS performance (N/A for inte ger benchmarks) for each benchmark emplo ying the combination of our techniques, task plus data plus softw are pipeline parallelism. Note that for our tar get architecture, the maximum

number of MFLOPS achie able is 7200. The com- pute utilization is calculated as the number of instructions issued on each computer processor di vided by the total number possible for steady-state. The utilization accurately models pipeline hazards and stalls of Ra w single issue, in-order processing cores. achie generally xcellent compute utilization; in cases the utilization is 60% or greater 6.6 Comparison to our Pr vious ork: Hard war Pipelining In Figure 10 we sho our combined technique normalized to our pre vious ork for compiling streaming applications to multicore architectures. This

baseline configuration is maturation of the ideas presented in [15 and implements task plus hardw are pipeline parallel ecution model relying solely on on-chip uf fer ing and the on-chip static netw ork for communication and syn- chronization. In this hardw are pipelining model, we require that the number of filters in the stream graph be less than or equal to the number of processing cores of the tar get architecture. achie this, we repeatedly apply fusion and fission transformations as di- rected by dynamic programming algorithm. Our ne techniques achie mean speedup of

1.84x er hard- are pipelining. or most of our benchmarks, the combined tech- niques presented in this paper of fer impro ed data parallelism, im- pro ed scheduling fle xibility and reduced synchronization com- pared to our pre vious ork. This comparison demonstrates that combining our techniques is important for generalization to stateful benchmarks. or Radar data parallelism loses to hardw are pipelin- ing by 19%, while the combined technique enjo ys 38% speedup. or ocoder data parallelism is 18% slo wer while the combined technique is 30% aster Hardw are pipelining performs well in out

of 12 benchmarks (FFT TDE, and Serpent). This is because these applications con- Please note that due to ug in our tools, the MFLOPS numbers reported in the proceedings ersion of [15 were inaccurate.
Page 11
tain long pipelines that can be load-balanced. or xample, the stream graph for Serpent is pipeline of identical splitjoins that is fused do wn to balanced pipeline. Hardw are pipelining incurs less synchronization than usual in this case because the I/O rates of the filters are matched; consequently its compute utilization is higher than our combined technique (64% ersus

57%). The com- bined approach fuses Serpent to single filter and then fisses it 16 ays, con erting the pipeline parallelism into data parallelism. In this case, data-parallel communication is more xpensi than the hardw are pipelined communication. 7. Related ork In addition to StreamIt, there are number of stream-oriented languages dra wing from domains such as functional, dataflo CSP and synchronous programming [36 ]. The Brook language is architecture-independent and focuses on data parallelism [6]. Stream ernels are required to be stateless, though there is special support

for reducing streams to single alue. StreamC/K ernelC is lo wer le el than Brook; ernels written in ernelC are stitched together in StreamC and mapped to the data-parallel Imagine pro- cessor [18 ]. SPUR adopts similar decomposition between “mi- crocode stream ernels and sk eleton programs to xpose data par allelism [42 ]. Cg xploits pipeline parallelism and data parallelism, though the programmer must write algorithms to xactly match the tw pipeline stages of graphics processor [27 ]. Compared to these languages, StreamIt places more emphasis on xposing task and pipeline parallelism (all the

languages xpose data parallelism). By adopting the synchronous dataflo model of ecution [26 ], StreamIt focuses on well-structured programs that can be aggres- si ely optimized. The implicit infinite loop around programs is also StreamIt characteristic that enables the transformations in this paper Spidle [10 is also recent stream language that as influenced by StreamIt. Liao et al. map Brook to multicore processors by le eraging the af fine partitioning model [41 ]. While af fine partitioning is po wer ful technique for parameterized loop-based programs, in

StreamIt we simplify the problem by fully resolving the program structure at compile time. This allo ws us to schedule single steady state using fle xible, non-af fine techniques (e.g., simulated annealing) and to repeat the found schedule for an indefinite period at runtime. Gum- maraju and Rosenblum map stream programs to general-purpose hyperthreaded processor [16 ]. Such techniques could be inte grated with our spatial partitioning to optimize per -core performance. Gu et al. xpose data and pipeline parallelism in Ja a-lik language and use compiler analysis to ef

ficiently xtract coarse-grained filter boundaries [12 ]. Ottoni et al. also xtract decoupled threads from sequential code, using hardw are-based softw are pipelining to distrib ute the resulting threads across cores [30 ]. By embedding pipeline-parallel filters in the programming model, we focus on the mapping step. Pre vious ork in scheduling computation graphs to parallel tar gets has focused on partitioning and scheduling techniques that x- ploit task and pipeline parallelism [33 32 28 23 13 ]. Applica- tion of loop-conscious transformations to coarse-grained dataflo

graphs has been in estigated. Unrolling (or “unfolding in this do- main) is emplo yed for synchronous dataflo (SDF) graphs to re- duce the initiation interv al ut the do not aluate mappings to actual architectures [7, 31 ]. Softw are pipelining techniques ha been applied to SDF graphs onto arious embedded and DSP tar gets [5, ], ut has required programmer kno wledge of both the application and the architecture. our kno wledge, none of these systems automatically xploit the combination of task, data, and pipeline parallelism. Furthermore, these systems do not pro vide rob ust end-to-end

path for application parallelization from high- le el, portable programming language. 8. Conclusions As multicore architectures become ubiquitous, it will be critical to de elop high-le el programming model that can automatically xploit the coarse-grained parallelism of the underlying machine without requiring heroic ef forts on the part of the programmer Stream programming represents promising approach to this prob- lem, as high-le el descriptions of streaming applications naturally xpose task, data, and pipeline parallelism. In this paper we de elop general techniques for automatically

bridging the gap between the original granularity of the program and the underlying granularity of the architecture. bolster the benefits of data parallelism on multicore architecture, we uild coarse-grained data-parallel units that are duplicated as fe times as needed. And to le erage the benefits of pipeline parallelism, we emplo softw are pipelining techniques—traditionally applied at the instruction le el—to coarse-grained filters in the program. detailed aluation in the conte xt of the StreamIt language and the 16-core Ra microprocessor of fers ery orable results that

are also quite consistent across di erse applications. Coarse- grained data parallelism of fers 4.4x speedup er task-parallel baseline and 9.9x speedup er sequential code. ithout our granularity coarsening pass, these reduce to 0.7x and 1.4x, respec- ti ely Coarse-grained softw are pipelining impro es the generality of the compiler as it is able to parallelize stateful filters with de- pendences from one iteration to the ne xt. Our tw techniques are complementary and of fer combined speedup of 11.2x er the baseline (and 1.84x er our pre vious ork). Though data parallelism is responsible

for greater speedups on 16-core chip, pipeline parallelism may become more important as multicore architectures scale. Data parallelism requires global communication, and eeps resources sitting idle when it encounters stateful filters (or feedback loops). According to our analysis in Section 4.1, le eraging pipeline parallelism on 64-core chip when only 10% of the filters ha state could of fer up to 6.4x speedup (impro ement in load balancing). Exposing pipeline parallelism in combination with data parallelism for the stateful benchmarks in our suite pro vided 1.45x speedup er data

parallelism alone. As our techniques rely on specific features of the StreamIt programming model, the results suggest that these features are good match for multicore architectures. Of particular importance are the follo wing tw language features: 1. Exposing producer -consumer relationships between filters. This enables us to coarsen the computation to communication ratio via filter fusion, and also enables pipeline parallelism. 2. Exposing the outer loop around the entire stream graph. This is central to the formulation of softw are pipelining; it also enables data

parallelism, as the products of filter fission may span multiple steady-state iterations. While our implementation tar gets Ra the techniques de el- oped should be applicable to other multicore architectures. As Ra has relati ely high communication bandwidth, coarsening the granularity of data parallelism may benefit commodity multicores en more. In porting this transformation to ne architecture, one may need to adjust the threshold computation-to-communication ratio that justifies filter fission. As for coarse-grained softw are pipelining, the scheduling

freedom af forded should benefit man multicore systems. One should consider the most ef ficient loca- tion for intermediate uf fers (local memory shared memory FI- FOs, etc.) as well as the best mechanism for shuf fling data (DMA, on-chip netw ork, etc.). The basic algorithms for coarsening granu- larity judicious fission, partitioning, and selecti fusion are lar gely architecture-independent.
Page 12
Ackno wledgments ould lik thank the members of the StreamIt team, both past and present, and especially Jasper Lin, Rodric Rabbah, and Allyn Dimock for their

contrib utions to this ork. are ery grateful to Michael aylor Jonathan Eastep, and Samuel Larsen for their help with the Ra infrastructure, and to Ronn Krashinsk for his comments on this paper This ork is supported in part by ARP grants PCA-F29601-03-2-0065 and HPCA/PERCS- W0133890, and NSF ards CNS-0305453 and EIA-0071841. Refer ences [1] Raza Microelectronics, Inc. http://www.razamicroelectronics.com/products/xlr. htm [2] StreamIt Language Speci˛cation. http://cag.lcs.mit.edu/streamit/papers/streamit-l ang-spe c.pdf [3] S. Agra al, Thies, and S. Amarasinghe. Optimizing Stream Programs

Using Linear State Space Analysis. In CASES San Francisco, CA, Sept. 2005. [4] J. Andre ws and N. Bak er Xbox 360 System Architecture. IEEE Micr 26(2), 2006. [5] S. Bakshi and D. D. Gajski. artitioning and pipelining for performance-constrained hardw are/softw are systems. IEEE ans. ery Lar Scale Inte gr Syst. 7(4):419±432, 1999. [6] I. Buck, ole D. Horn, J. Sugerman, K. atahalian, M. Houston, and Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardw are. In SIGGRAPH 2004. [7] L.-F Chao and E. H.-M. Sha. Scheduling Data-Flo Graphs via Retiming and Unfolding. IEEE ans. on ar allel and

Distrib uted Systems 08(12), 1997. [8] K. S. Chatha and R. emuri. Hardw are-Softw are partitioning and pipelined scheduling of transformati applications. IEEE ans. ery Lar Scale Inte gr Syst. 10(3), 2002. [9] M. K. Chen, X. Li, R. Lian, J. H. Lin, L. Liu, Liu, and R. Ju. Shangri-La: Achie ving High Performance from Compiled Netw ork Applications While Enabling Ease of Programming. In PLDI Ne ork, NY USA, 2005. [10] C. Consel, H. Hamdi, L. Rv eillre, L. Singara elu, H. u, and C. Pu. Spidle: DSL Approach to Specifying Streaming Applications. In 2nd Int. Conf on Gener ative Pr and Component

Engineering 2003. [11] M. Drak e, H. Hof fman, R. Rabbah, and S. Amarasinghe. MPEG-2 Decoding in Stream Programming Language. In IPDPS Rhodes Island, Greece, April 2006. [12] Du, R. Ferreira, and G. Agra al. Compiler Support for Exploiting Coarse-Grained Pipelined arallelism. In Super computing 2005. [13] E. and D. Messerschmitt. Pipeline interlea ed programmable DSP s: Synchronous data ˇo programming. IEEE ans. on Signal Pr ocessing 35(9), 1987. [14] Eatherton. The Push of Netw ork Processing to the op of the Pyramid. ynote Address, Symposium on Architectures for Netw orking and

Communications Systems, 2005. [15] M. Gordon, Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Le ger J. ong, H. Hof fmann, D. Maze, and S. Amarasinghe. Stream Compiler for Communication-Exposed Architectures. In ASPLOS 2002. [16] J. Gummaraju and M. Rosenblum. Stream Programming on General- Purpose Processors. In MICR 2005. [17] H. Hofstee. Po wer Ef ˛cient Processor Architecture and The Cell Processor. HPCA 00:258±262, 2005. [18] U. J. Kapasi, S. Rixner J. Dally B. Khailan J. H. Ahn, Mattson, and J. D. Owens. Programmable stream processors. IEEE Computer 2003. [19] M.

Karczmarek, Thies, and S. Amarasinghe. Phased scheduling of stream programs. In LCTES San Die go, CA, June 2003. [20] M. A. Karczmarek. Constrained and Phased Scheduling of Synchronous Data Flo Graphs for the StreamIt Language. Master thesis, MIT 2002. [21] ongetira, K. Aingaran, and K. Oluk otun. Niagara: 32-W ay Multithreaded Sparc Processor. IEEE Micr 25(2):21±29, 2005. [22] .-K. Kw ok and I. Ahmad. ASTEST: Practical Lo w-Comple xity Algorithm for Compile-T ime Assignment of arallel Programs to Multiprocessors. IEEE ans. on ar allel and Distrib uted Systems 10(2), 1999. [23] .-K. Kw ok and

I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. CM Comput. Surv 31(4):406±471, 1999. [24] A. A. Lamb, Thies, and S. Amarasinghe. Linear Analysis and Optimization of Stream Programs. In PLDI San Die go, CA, June 2003. [25] J. Lebak. Polymorphous Computing Architecture (PCA) Example Applications and Description. External Report, Lincoln Laboratory Mass. Inst. of echnology 2001. [26] E. A. Lee and D. G. Messerschmitt. Static Scheduling of Synchronous Data Flo Programs for Digital Signal Processing. IEEE ans. Comput. 36(1):24±35, 1987. [27] R. Mark,

R. S. Glan ville, K. Ak ele and M. J. Kilgard. Cg: System for Programming Graphics Hardw are in C-lik Language. In SIGGRAPH 2003. [28] D. May R. Shepherd, and C. eane. Communicating Process Architecture: ransputers and Occam. Futur ar allel Computer s: An Advanced Cour se Pisa, Lectur Notes in Computer Science 272, June 1987. [29] Michael Sipser. Intr oduction to the Theory of Computation PWS Publishing Compan 1997. [30] G. Ottoni, R. Rangan, A. Stoler and D. I. August. Automatic Thread Extraction with Decoupled Softw are Pipelining. In MICR 2005. [31] K. arhi and D. Messerschmitt. Static

Rate-Optimal Scheduling of Iterati Data-Flo Programs ia Optimum Unfolding. IEEE ansactions on Computer 40(2), 1991. [32] J. L. Pino, S. S. Bhattacharyya, and E. A. Lee. Hierarchical Multiprocessor Scheduling Frame ork for Synchronous Dataˇo Graphs. echnical Report UCB/ERL M95/36, May 1995. [33] J. L. Pino and E. A. Lee. Hierarchical Static Scheduling of Dataˇo Graphs onto Multiple Processors. Pr oc. of the IEEE Confer ence on Acoustics, Speec h, and Signal Pr ocessing 1995. [34] S. Senef f. Speech transformation system (spectrum and/or xcitation) without pitch xtraction. Master

thesis, MIT 1980. [35] J. Sermulins, Thies, R. Rabbah, and S. Amarasinghe. Cache are Optimization of Stream Programs. In LCTES Chicago, 2005. [36] R. Stephens. Surv of Stream Processing. Acta Informatica 34(7), 1997. [37] M. B. aylor et al. The Ra Microprocessor: Computational abric for Softw are Circuits and General Purpose Programs. IEEE Micro ol 22, Issue 2, 2002. [38] M. B. aylor Lee, J. Miller D. entzlaf f, et al. Ev aluation of the Ra Microprocessor: An Exposed-W ire-Delay Architecture for ILP and Streams. In ISCA Munich, German June 2004. [39] Thies, M. Karczmarek, and S. Amarasinghe.

StreamIt: Language for Streaming Applications. In CC France, 2002. [40] E. aingold, M. aylor D. Srikrishna, Sarkar Lee, et al. Baring It All to Softw are: Ra Machines. IEEE Computer 30(9), 1997. [41] S. wei Liao, Z. Du, G. u, and G.-Y Lueh. Data and Computation ransformations for Brook Streaming Applications on Multiproces- sors. In CGO 2006. [42] D. Zhang, Z.-Z. Li, H. Song, and L. Liu. Programming Model for an Embedded Media Processing Architecture. In SAMOS 2005.