/
Fine-grain Task Aggregation and Coordination on GPUs Fine-grain Task Aggregation and Coordination on GPUs

Fine-grain Task Aggregation and Coordination on GPUs - PowerPoint Presentation

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
400 views
Uploaded On 2016-07-17

Fine-grain Task Aggregation and Coordination on GPUs - PPT Presentation

Marc S Orr Bradford M Beckmann Steven K Reinhardt David A Wood ISCA June 16 2014 Executive Summary SIMT languages eg CUDA amp OpenCL restrict GPU programmers to regular parallelism ID: 408169

int cilk gpu channels cilk int channels gpu gpus fib simd task channel amd return work spawn information scheduling

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Fine-grain Task Aggregation and Coordina..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Fine-grain Task Aggregation and Coordination on GPUs

Marc S. Orr†§, Bradford M. Beckmann§, Steven K. Reinhardt§, David A. Wood†§

ISCA, June 16, 2014

§Slide2

Executive Summary

SIMT languages (e.g. CUDA & OpenCL) restrict GPU programmers to regular parallelismCompare to Pthreads, Cilk, MapReduce, TBB, etc.Goal: enable irregular parallelism on GPUsWhy?

More GPU applicationsHow? Fine-grain

task aggregationWhat? Cilk on GPUsSlide3

Outline

BackgroundGPUsCilkChannel AbstractionOur WorkCilk on ChannelsChannel DesignResults/ConclusionSlide4

CP

GPUs Today

GPU tasks

scheduled by control processor (CP)—small, in-order programmable coreToday’s GPU abstractions are coarse-grain

GPU

CP

SIMD

SIMD

SIMD

System Memory

SIMD

+ Maps well to SIMD hardware

- Limits fine-grain schedulingSlide5

Cilk Background

Cilk extends C for divide and conquer parallelismAdds keywordsspawn: schedule a thread to execute a functionsync: wait for prior spawns to complete

1:

int

fib(

int

n) {

2:

if

(n <= 2)

return

1;

3:

int x = spawn fib(n - 1);4: int y = spawn fib(n - 2);5: sync

;6: return (x + y);7: }Slide6

Agg

Agg

Prior Work on Channels

CP, or aggregator (

agg

), manages channels

Finite task

queues, except:

User-defined scheduling

Dynamic aggregation

One consumption

function

channels

GPU

SIMD

SIMD

SIMD

SIMD

System Memory

Dynamic aggregation enables “CPU-like” scheduling abstractions on

GPUsSlide7

Outline

BackgroundGPUsCilkChannel Abstraction

Our WorkCilk on ChannelsChannel Design

Results/ConclusionSlide8

Enable Cilk on GPUs via Channels

Cilk routines split by sync into sub-routines

Step 1

1:

int

fib (

int

n) {

2:

if

(n<=2)

return

1;

3:

int

x = spawn

fib (n-1);4: int y = spawn

fib (n-2);5: sync;

6: return (x+y);

7: }

1:

int

fib (

int

n) {2: if (n<=2) return

1;3: int x = spawn

fib (n-1);4: int

y = spawn fib (n-2);5: }

6: int

fib_cont(int x, int y) {

7: return (x+y

);8: }

“pre-sync”

“continuation”Slide9

3

4

3

5

5

3

4

3

Enable Cilk on GPUs via Channels

Channels instantiated for breadth-first traversal

Quickly populates

GPU’s

tens of thousands of

lanes

Facilitates coarse-grain

dependency managementStep 2

“pre-sync” task ready

“continuation”

task

task A spawned task BA

Btask B depends on task A

A

B

“pre-sync” task done

5

4

3

2

2

1

2

1

3

fib_cont

channel stack:

t

op of

stack

fib channelSlide10

Bound Cilk’s

Memory FootprintBound memory to the depth of the Cilk tree by draining channels closer to the base caseThe amount of work generated dynamically is not known a prioriWe propose that GPUs allow SIMT threads to yield

Facilitates resolving conflicts on shared resources like memory

5

4

3

2

2

1

2

1

3Slide11

Channel Implementation

Our design accommodates SIMT access patterns+ array-based+ lock-free+ non-blocking

See PaperSlide12

Outline

BackgroundGPUsCilkChannel Abstraction

Our Work

Cilk on ChannelsChannel DesignResults/ConclusionSlide13

Methodology

Implemented Cilk on channels on a simulated APUCaches are sequentially consistentAggregator schedules Cilk tasksSlide14

Cilk scales

with the GPU Architecture

More Compute Units

 Faster executionSlide15

Conclusion

We observed that dynamic aggregation enables new GPU programming languages and abstractionsWe enabled dynamic aggregation by extending the GPU’s control processor to manage channelsWe found that breadth first scheduling works well for Cilk on GPUs

We proposed that GPUs allow SIMT threads to yield for breadth first scheduling

Future

work should focus on

how the control processor can enable more GPU applicationsSlide16

BackupSlide17

Divergence and Channels

Branch divergenceMemory divergence+ Data in channels goodPointers to data in channels badSlide18

GPU NOT Blocked on

AggregatorSlide19

GPU Cilk vs. standard GPU workloads

Cilk is more succinct than SIMT languagesChannels trigger more GPU dispatches

 

LOC reduction

Dispatch rate

Speedup

Strassen

42%

13x

1.06

Queens

36%

12.5x

0.98

Same performance, easier to

programSlide20

Disclaimer & Attribution

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

©

2014

Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo

and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners

.