Marc S Orr Bradford M Beckmann Steven K Reinhardt David A Wood ISCA June 16 2014 Executive Summary SIMT languages eg CUDA amp OpenCL restrict GPU programmers to regular parallelism ID: 408169
Download Presentation The PPT/PDF document "Fine-grain Task Aggregation and Coordina..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Fine-grain Task Aggregation and Coordination on GPUs
Marc S. Orr†§, Bradford M. Beckmann§, Steven K. Reinhardt§, David A. Wood†§
ISCA, June 16, 2014
†
§Slide2
Executive Summary
SIMT languages (e.g. CUDA & OpenCL) restrict GPU programmers to regular parallelismCompare to Pthreads, Cilk, MapReduce, TBB, etc.Goal: enable irregular parallelism on GPUsWhy?
More GPU applicationsHow? Fine-grain
task aggregationWhat? Cilk on GPUsSlide3
Outline
BackgroundGPUsCilkChannel AbstractionOur WorkCilk on ChannelsChannel DesignResults/ConclusionSlide4
CP
GPUs Today
GPU tasks
scheduled by control processor (CP)—small, in-order programmable coreToday’s GPU abstractions are coarse-grain
GPU
CP
SIMD
SIMD
SIMD
System Memory
SIMD
+ Maps well to SIMD hardware
- Limits fine-grain schedulingSlide5
Cilk Background
Cilk extends C for divide and conquer parallelismAdds keywordsspawn: schedule a thread to execute a functionsync: wait for prior spawns to complete
1:
int
fib(
int
n) {
2:
if
(n <= 2)
return
1;
3:
int x = spawn fib(n - 1);4: int y = spawn fib(n - 2);5: sync
;6: return (x + y);7: }Slide6
Agg
Agg
Prior Work on Channels
CP, or aggregator (
agg
), manages channels
Finite task
queues, except:
User-defined scheduling
Dynamic aggregation
One consumption
function
channels
GPU
SIMD
SIMD
SIMD
SIMD
System Memory
Dynamic aggregation enables “CPU-like” scheduling abstractions on
GPUsSlide7
Outline
BackgroundGPUsCilkChannel Abstraction
Our WorkCilk on ChannelsChannel Design
Results/ConclusionSlide8
Enable Cilk on GPUs via Channels
Cilk routines split by sync into sub-routines
Step 1
1:
int
fib (
int
n) {
2:
if
(n<=2)
return
1;
3:
int
x = spawn
fib (n-1);4: int y = spawn
fib (n-2);5: sync;
6: return (x+y);
7: }
1:
int
fib (
int
n) {2: if (n<=2) return
1;3: int x = spawn
fib (n-1);4: int
y = spawn fib (n-2);5: }
6: int
fib_cont(int x, int y) {
7: return (x+y
);8: }
“pre-sync”
“continuation”Slide9
3
4
3
5
5
3
4
3
Enable Cilk on GPUs via Channels
Channels instantiated for breadth-first traversal
Quickly populates
GPU’s
tens of thousands of
lanes
Facilitates coarse-grain
dependency managementStep 2
“pre-sync” task ready
“continuation”
task
task A spawned task BA
Btask B depends on task A
A
B
“pre-sync” task done
5
4
3
2
2
1
2
1
3
fib_cont
channel stack:
t
op of
stack
fib channelSlide10
Bound Cilk’s
Memory FootprintBound memory to the depth of the Cilk tree by draining channels closer to the base caseThe amount of work generated dynamically is not known a prioriWe propose that GPUs allow SIMT threads to yield
Facilitates resolving conflicts on shared resources like memory
5
4
3
2
2
1
2
1
3Slide11
Channel Implementation
Our design accommodates SIMT access patterns+ array-based+ lock-free+ non-blocking
See PaperSlide12
Outline
BackgroundGPUsCilkChannel Abstraction
Our Work
Cilk on ChannelsChannel DesignResults/ConclusionSlide13
Methodology
Implemented Cilk on channels on a simulated APUCaches are sequentially consistentAggregator schedules Cilk tasksSlide14
Cilk scales
with the GPU Architecture
More Compute Units
Faster executionSlide15
Conclusion
We observed that dynamic aggregation enables new GPU programming languages and abstractionsWe enabled dynamic aggregation by extending the GPU’s control processor to manage channelsWe found that breadth first scheduling works well for Cilk on GPUs
We proposed that GPUs allow SIMT threads to yield for breadth first scheduling
Future
work should focus on
how the control processor can enable more GPU applicationsSlide16
BackupSlide17
Divergence and Channels
Branch divergenceMemory divergence+ Data in channels goodPointers to data in channels badSlide18
GPU NOT Blocked on
AggregatorSlide19
GPU Cilk vs. standard GPU workloads
Cilk is more succinct than SIMT languagesChannels trigger more GPU dispatches
LOC reduction
Dispatch rate
Speedup
Strassen
42%
13x
1.06
Queens
36%
12.5x
0.98
Same performance, easier to
programSlide20
Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
©
2014
Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo
and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners
.