Highlyparallel Accelerators Marc S Orr PhD Defense December 5 2016 Advisor David A Wood Committee Brad Beckmann Mark Hill Nam Sung Kim Karu Sankaralingam and Mike Swift ID: 674367
Download Presentation The PPT/PDF document "Communication and Coordination Paradigm..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Communication and Coordination Paradigms for Highly-parallel Accelerators
Marc S. OrrPhD Defense, December 5, 2016Advisor: David A. WoodCommittee: Brad Beckmann, Mark Hill, Nam Sung Kim,Karu Sankaralingam, and Mike Swift
Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide2
Motivation
Highly-parallel accelerators, like GPUs, are becoming ubiquitous.GPUs are performant and energy-efficient for data-parallel applications that accommodate the GPU’s peculiarities.However, many data-parallel applications are not able to accommodate the GPU’s peculiarities.12/5/20162Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide3
Research Statement
I focus on one of the GPU’s peculiarities: it is difficult synchronize on GPUsMy first goal is to enable GPU threads to coordinate efficiently.My second goal is to identify novel use cases for efficient GPU thread coordination.Task Aggregation for GPUs (In ISCA-2014)Message Aggregation for GPUs (Under review)12/5/20163
Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide4
12/5/2016
4Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain
System-level Operations
Task
Aggregation for GPUs
Message Aggregation for
GPUs
Conclusions
OutlineSlide5
Highly-parallel AcceleratorsLots of simple threads
Data-parallel hardwareExamples:Intel Xeon Phi, GPUsMy focus: GPUsAbundant commodity sold by many companiesOther accelerators are less common/more niche12/5/20165Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide6
GPUs are Everywhere
Bottom tier (cell phones, tablets, and netbooks)Mid tier (laptops and desktops)Top Tier (super computers and cloud computing)New programming abstractions are making them more accessibleCUDA, OpenCLC++ AMP, OpenACC, Java Aparapi, etc.Integration is reducing their performance “tariff”12/5/20166Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide7
GPU PrimerWork-item: GPU thread executed by SIMD lane
Wavefront: Work-items executed by same SIMDWork-group: Wavefronts executing on same CUSIMT group: a wavefront or a work-group12/5/20167Communication and Coordination Paradigms for Highly-parallel Accelerators
Compute Unit
(CU)
wavefront
wi
SIMD
SIMD
work-group
L1
L2/MemorySlide8
Result: many applications cannot easily benefit from the GPU’s
Higher compute capability
Higher memory bandwidth
Lower energy
Paradigms that Map Poorly to GPUs
System-level Operations
Task scheduling
I/O
System calls
Memory allocation
More
generally, communication & coordination
Synchronization
Resource
sharing
12/5/2016
8
Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide9
Communication & Coordination
What do I mean?Locks? RMWs? Transactions? etc.Focus: producer/consumer queuesProducers: GPU threadsConsumers: CPUs, GPUs, network interface, etc.Goal: arbitrate system-level operationsNarrow enough for a well-defined problemBroad enough for novel research12/5/20169Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide10
What Makes Coordination Hard?Many threads
contentionAMD Kaveri (iGPU): 20,480 threadsAMD FirePro S9170 (dGPU): 112,640 threadsIntra-wavefront dependencies can cause deadlockwhile (trylock
(&L)) ; // spin
u
nlock();
12/5/2016
10
Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide11
12/5/2016
11Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain
System-level Operations
Task
Aggregation for GPUs
Message Aggregation for
GPUs
Conclusions
OutlineSlide12
Prior WorkCoprocessor Model
GPU threads don’t execute system-level operationsInstead, the CPU executes themOperation-per-lane ModelGPU threads independently execute system-level operationCoalesced APIsGPU work-groups (not threads) execute system-level operations12/5/201612Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide13
Examples of Prior Work12/5/2016
13Communication and Coordination Paradigms for Highly-parallel AcceleratorsTaskingNetwork I/OFile I/O
Coprocessor
-
GPUDirect
RDMA
- MVAPICH2-GPU
Operation-per-lane
- CUDA & OpenCL 2.0 dynamic parallelism
- NVSHMEM
DCGN
GGAS
Coalesced APIs
-
GPUnet
GPUrdma
-
GPUfsSlide14
Coprocessor Modelsystem-level
operations executed by host CPU12/5/201614Communication and Coordination Paradigms for Highly-parallel AcceleratorsSoC
Producer/consumer
queue
GPU
SIMT
group 0
wi
0
Memory
wi
1
wi
2
wi
3
SIMT
group 1
wi
4
wi
5
wi
6
wi
7
CPUSlide15
Operation-per-lane Modelsystem-level
operations executed by GPU threads12/5/201615Communication and Coordination Paradigms for Highly-parallel AcceleratorsSoC
Producer/consumer
queues
GPU
SIMT
group 0
wi
0
Memory
wi
1
wi
2
wi
3
SIMT
group 1
wi
4
wi
5
wi
6
wi
7
wrPtr
wrPtr
wrPtrSlide16
Coalesced APIssystem-level
operations executed by GPU work-groups12/5/201616Communication and Coordination Paradigms for Highly-parallel Accelerators
SoC
Producer/consumer
queues
GPU
SIMT
group 0
wi
0
Memory
wi
1
wi
2
wi
3
SIMT
group 1
wi
4
wi
5
wi
6
wi
7
wrPtr
wrPtr
wrPtrSlide17
Takeaways12/5/2016
17Communication and Coordination Paradigms for Highly-parallel Accelerators
MY GOAL
Operation-per-lane model’s programmability
Coprocessor model’s performance
Model
Programmable?
Coprocessor
L
Operation
-
per
-
lane
J
Coalesced APIs
K
Prod/cons
Overhead
low
high
medium
Pipelined
execution?
requires
programmer
effort
yes
yesSlide18
12/5/2016
18Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain System-level Operations
Task
Aggregation for GPUs
Message Aggregation for
GPUs
Conclusions
OutlineSlide19
Implementation Overview
Introduce leader-level synchronizationAmortize communication cost across SIMT groupEliminate dependencies within a SIMT groupDescribe two communication schemes that build off of leader-level synchronizationSIMT-direct aggregationIndirect aggregation12/5/201619Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide20
Leader-level SynchronizationSIMT group elects one work-item to synchronize on its behalf
12/5/201620SoC
Producer/consumer
queues
GPU
SIMT
group 0
wi
0
Memory
wi
1
wi
2
wi
3
SIMT
group 1
wi
4
wi
5
wi
6
wi
7
wrPtr
wrPtr
µB:
~118x speedup over work-item-level syncSlide21
SIMT-direct Aggregation
SIMT group invokes leader-level sync once for each producer/consumer queue that it targets12/5/201621Communication and Coordination Paradigms for Highly-parallel Accelerators
SoC
Producer/consumer
queues
GPU
SIMT
group 0
wi
0
Memory
wi
1
wi
2
wi
3
SIMT
group 1
wi
4
wi
5
wi
6
wi
7
wrPtr
wrPtr
wrPtrSlide22
Indirect Aggregation
SIMT group uses leader-level sync to export its operations to a dedicated aggregator12/5/201622Communication and Coordination Paradigms for Highly-parallel Accelerators
SoC
Producer/consumer
queues
GPU
SIMT
group 0
wi
0
Memory
wi
1
wi
2
wi
3
SIMT
group 1
wi
4
wi
5
wi
6
wi
7
wrPtr
wrPtr
wrPtr
wrPtr
Aggregator
rdPtrSlide23
SIMT-direct vs. Indirect
12/5/201623Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide24
Wavefront vs. Work-group
12/5/201624Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide25
More on Work-groupsLeader is elected by communicating across the threads in a work-group
Problem: intra-work-group communication is not defined when a work-group’s wavefronts divergeTwo high-level solutionsTrack control flow at work-group granularityTrack which wavefronts are executing together12/5/201625Communication and Coordination Paradigms for Highly-parallel Accelerators
Details in dissertationSlide26
TakeawaysLeverage leader-level synchronization
efficient operation-per-lane modelSIMT-direct: scales to a small # of queuesIndirect: scales to a large # of queuesSIMT group can be defined as a wavefront or a work-groupWavefronts have well-defined behaviorWork-groups amortize communication across more work-items12/5/201626Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide27
12/5/2016
27Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain
System-level Operations
Task
Aggregation for GPUs
Message Aggregation for
GPUs
Conclusions
OutlineSlide28
Task Aggregation for GPUsExecutive Summary
SIMT languages (e.g. CUDA & OpenCL) restrict GPU programmers to regular parallelismGoal: enable irregular parallelism on GPUsWhy? More GPU applicationsHow? Fine-grain task aggregationWhat? Cilk on GPUs12/5/201628Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide29
GPUs Today
GPU tasks scheduled by control processor (CP)—small, in-order programmable coreToday’s GPU abstractions are coarse-grain12/5/201629Communication and Coordination Paradigms for Highly-parallel Accelerators
CP
GPU
CP
SIMD
SIMD
SIMD
System Memory
SIMD
+ Maps well to SIMD hardware
- Limits fine-grain schedulingSlide30
Fine-grain Scheduling via Channels
CP, or aggregator (agg), manages channelsFinite task queues, except:User-defined schedulingDynamic aggregationOne consumption function12/5/201630Communication and Coordination Paradigms for Highly-parallel Accelerators
Agg
Agg
channels
GPU
SIMD
SIMD
SIMD
SIMD
System Memory
Dynamic aggregation enables “CPU-like” scheduling abstractions on
GPUsSlide31
Contributions
First Channel Implementation (in software)Lock-free, non-blocking, array-basedSIMT-direct, wavefront granularityImplementation detailsEnable Cilk on GPUs by using channels as an intermediate representationBreadth-first task traversalCoarse-grain dependency resolutionMemory bounded with wavefront yieldCilk-to-channels formulation12/5/201631Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide32
MethodologyImplemented Cilk on channels on a simulated APU
Caches are coherentAggregator schedules Cilk tasks12/5/201632Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide33
Cilk scales with the GPU Architecture12/5/2016
33Communication and Coordination Paradigms for Highly-parallel AcceleratorsMore Compute Units Faster execution
More ResultsSlide34
TakeawaysDynamic
aggregation enables “CPU-like” scheduling abstractions for GPUsAchieved by extending the GPU’s control processor to manage channelsChannels can be used as an intermediate layer for higher-level abstractions like Cilk12/5/201634Communication and Coordination Paradigms for Highly-parallel Accelerators
Dynamic aggregation brings fine-grain task scheduling to GPUsSlide35
12/5/2016
35Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain
System-level Operations
Task
Aggregation for GPUs
Message Aggregation for
GPUs
Conclusions
OutlineSlide36
Message Aggregation for GPUsExecutive Summary
Irregular data-intensive: small and hard-to-predict data access patternsPrior frameworks for these workloads focused on:CPUs (GraphChi); CPU-based clusters (Grappa, GraphLab, etc.); GPUs (GasCL)My focus: irregular data-intensive workloads on a cluster of GPUsAggregate fine-grain GPU-initiated messages12/5/201636
Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide37
GasCL: Vertex-Centric Graph Model for GPUs
Gather: Vertex receives messages from neighborsApply: Compute and update vertex’s valueScatter: Send messages to vertex’s neighborsPageRank exampleLimited to one GPU12/5/201637Communication and Coordination Paradigms for Highly-parallel Accelerators
def
scatter:
foreach
n in neighbors:
sendMsg
(n,
my_value
/
my_edge_cnt
)
def
gather-and-apply:
my_value
= 0.85 *
combine_msg_sum
()
+ 0.15/
vertex_cnt
0.2
0.2
0.2
0.2
0.2
0.07
0.07
0.07
0.07
0.07
0.07
0.07
0.07
0.07
0.07
0.07
0.07
0.07
0.07
0.1
0.1
0.2
0.07
0.1
0.07
0.07
0.2
0.07
0.31
0.23
0.23
0.14
0.09
0.1Slide38
Distributing GasCL
Work-items initiate network messagesOne-sided communicationread data (GET), write data (PUT), higher-level operations (active message)12/5/201638
Communication and Coordination Paradigms for Highly-parallel Accelerators
0.2
0.2
0.2
0.2
0.2
Node 0
Node 1
Node 2
Network
*
dest
=
val
;
post_write
(
dest
,
&
val
,
sizeof
(float),
…);
0.07
post_write
(
dest
,
&
val
,
sizeof
(float),
…);
0.07
0.07
0.07
0.07
0.07Slide39
ApproachHSA unified virtual memory can help to offload network requests to
CPU12/5/201639Communication and Coordination Paradigms for Highly-parallel Accelerators
Request queue:
Network
CPU
Node 1
Memory
GPU
wi
0
1
2
3
Node 0
Memory
out
[2:3]:
8
out
[0:1]:
2
0
1
3
ST
ack
ST
ack
put
put
put
put
ST
ST
PUT
PUT
int
out[4];Slide40
Message Size vs. Sync Performancefine-grain producer/consumer synchronization is expensive!
12/5/201640Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide41
Message Size vs Network Performance
Small messages degrade network performanceBidirectional Bandwidth Test* run on a pair of AMD Kaveri’s connected over a 10 Gb/s Ethernet link.12/5/201641Communication and Coordination Paradigms for Highly-parallel Accelerators
*
http://mvapich.cse.ohio-state.edu/benchmarks/.Slide42
Aggregate Network MessagesIndirect aggregation @ work-group granularity
12/5/201642Communication and Coordination Paradigms for Highly-parallel Accelerators
Network
GPU
WG
Node 0
Memory
n1:
t
0
’s bin buffers
GPU-producer/multi-consumer queue
WG
Node 1
Node 2
C
PU
n2:
n1:
t
1
’s bin buffers
n2:
t
0
t
1Slide43
GPU-producer/multi-consumer queue organization
Array-based queueSynchronization occurs at work-group granularityPayload organized so adjacent work-items activate the memory coalescer12/5/201643Communication and Coordination Paradigms for Highly-parallel Accelerators
full
full bit:
payload:
Lane 0
Lane 1
Lane 2
Lane N-1
…
[0]
Lane 0
Lane 1
Lane 2
Lane N-1
…
[1]
…
Lane 0
Lane 1
Lane 2
Lane N-1
…
[N-1]
tickets:
numServed
writeTicket
writeIdx
readIdx
readTicketSlide44
GPU-producer/multi-consumer queue performance
12/5/201644Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide45
Compared to CPUExpensive synchronization (e.g., acquiring a lock): once per
work-groupAdjacent work-items write same cache lines.CPU threads write different cache lines less false sharing, lower cache utilization.12/5/201645Communication and Coordination Paradigms for Highly-parallel Accelerators
CPU
LOCK()
LOCK()
LOCK()
LOCK()
G
PU
work-group
LOCK()
CPU
core
cache
v
1
.val
pad
v
2
.val
pad
core
cache
v
3
.val
pad
v
4
.val
pad
G
PU
CU
cache
work-group
v
1
.val
v
2
.val
v
3
.val
v
4
.valSlide46
Prototype Called GravelHardware
8 AMD Kaveri processors (AMD A10-7850K)56 Gb Infiniband interconnectSoftware: mpic++ (Open MPI 1.6.5) & snack.shUseful HSA features: Unified virtual memory, C11 atomicsSupport subset of OpenSHMEM: work-items can initiate PUT and ATOMIC_INC12/5/2016
46
Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide47
Gravel vs. CPU-only 12/5/2016
47Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide48
Gravel vs. Prior Work12/5/2016
48Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide49
TakeawaysGravel’s focus: small and unpredictable
messagesDynamic aggregation amortizes network overheadDynamic aggregation overhead is amortized across work-groupsGravel enables work-items to initiate network transactionsNetwork transactions offloaded CPU, which aggregates messages12/5/201649Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide50
12/5/2016
50Communication and Coordination Paradigms for Highly-parallel AcceleratorsMotivation & BackgroundCoordinating System-level OperationsFine-grain
System-level Operations
Task
Aggregation for GPUs
Message Aggregation for
GPUs
Conclusions
OutlineSlide51
Publications12/5/2016
51Communication and Coordination Paradigms for Highly-parallel AcceleratorsThesis ChaptersMarc S. Orr, Bradford M. Beckmann, Steven K. Reinhardt, and David A. Wood, Fine-grain Task Aggregation and Coordination on GPUs, ISCA 2014.Marc S. Orr,
Shuai
Che
, Bradford M. Beckmann, Mark Oskin, Steven K. Reinhardt, and David A. Wood
,
Gravel: Fine-grain GPU-initiated Network Messages
, Under review.
Other PhD Research
Jason Power, Joel
Hestness
,
Marc S. Orr
,
Mark D. Hill, and David A. Wood,
gem5-gpu: A Heterogeneous CPU-GPU Simulator
, CAL 2014.
Marc S. Orr
,
Shuai
Che
,
Ayse
Yilmazer
, Bradford M. Beckmann, Mark D. Hill, and David A.
Wood
,
Synchronization Using Remote-Scope Promotion
, ASPLOS 2015.
Johnathan Alsop,
Marc S.
Orr
,
Bradford M. Beckmann, and David A.
Wood, Lazy Release Consistency for GPUs, Micro 2016.Slide52
Conclusion
Three main contributionsProposed efficient implementations for GPU-initiated system-level operations that follow the operation-per-lane modelExplored task aggregation for GPUsExplore message aggregation for GPUsQuestions?12/5/201652Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide53
Backup
12/5/201653Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide54
GPU Performance & Bandwidth
12/5/201654Communication and Coordination Paradigms for Highly-parallel Accelerators
NVIDIA,
CUDA
C Programming
Guide, 2014.Slide55
GPU Energy12/5/2016
55Communication and Coordination Paradigms for Highly-parallel Accelerators
Keckler, Keynote presentation at Micro, 2011.
The GPU is 7.6x more energy efficient!Slide56
First Channel Implementation
Our array-based, lock-free, non-blocking design accommodates SIMT access patterns
Data:
head, tail
Ctrl:
head, tail,
reserveTail
Done Count:
0
0
0
GPU
CP
SIMD
SIMD
SIMD
SIMD
System Memory
Channel A
Channel
B
head
tail
CP
head
tail
,
reserveTail
4
CP
Back
12/5/2016
Communication and Coordination Paradigms for Highly-parallel Accelerators
56Slide57
Cilk BackgroundCilk extends C for divide and conquer parallelism
Adds keywordsspawn: schedule a thread to execute a functionsync: wait for prior spawns to complete12/5/201657Communication and Coordination Paradigms for Highly-parallel Accelerators
1:
int
fib(
int
n) {
2:
if
(n <= 2)
return
1;
3:
int
x =
spawn
fib(n - 1);
4:
int
y =
spawn
fib(n - 2);
5:
sync
;
6:
return
(x + y);
7: }
BackSlide58
Enable Cilk on GPUs via Channels(Step 1)
Cilk routines split by sync into sub-routines
1:
int
fib (
int
n) {
2:
if
(n<=2)
return
1;
3:
int
x =
spawn
fib (n-1);
4:
int
y =
spawn
fib (n-2);
5:
sync
;
6:
return
(x+y);
7: }
1:
int
fib(
int
n){
2:
if
(n<=2)
return
1;
3:
int
x =
spawn
fib (n-1);
4:
int
y =
spawn
fib (n-2);
5: }
6: int fib_cont(int x, int y){
7:
return
(x+y);
8: }
“pre-sync”
“continuation”
Back
12/5/2016
Communication and Coordination Paradigms for Highly-parallel Accelerators
58Slide59
Enable Cilk on GPUs via Channels(Step 2)
Back34
3
5
5
3
4
3
Channels instantiated for breadth-first traversal
Quickly populates
GPU’s
tens of thousands of
lanes
Facilitates coarse-grain dependency management
“pre-sync” task ready
“continuation” task
task A spawned task B
A
B
t
ask B depends on task A
A
B
“pre-sync” task done
5
4
3
2
2
1
2
1
3
fib_cont channel stack:
t
op of
stack
fib channel
12/5/2016
Communication and Coordination Paradigms for Highly-parallel Accelerators
59Slide60
Bound Cilk’s Memory Footprint
Bound memory to the depth of the Cilk tree by draining channels closer to the base caseThe amount of work generated dynamically is not known a prioriProposal: GPUs allow SIMT threads to yieldFacilitates resolving conflicts on shared resources like memory5
4
3
2
2
1
2
1
3
Back
12/5/2016
Communication and Coordination Paradigms for Highly-parallel Accelerators
60Slide61
Divergence and Channels
Branch divergenceMemory divergence+ Data in channels good– Pointers to data in channels badBack
12/5/2016
Communication and Coordination Paradigms for Highly-parallel Accelerators
61Slide62
GPU NOT Blocked on Aggregator
Back12/5/2016Communication and Coordination Paradigms for Highly-parallel Accelerators62Slide63
GPU Cilk vs. standard GPU workloads
Cilk is more succinct than SIMT languagesChannels trigger more GPU dispatches
LOC reduction
Dispatch rate
Speedup
Strassen
42%
13x
1.06
Queens
36%
12.5x
0.98
Same performance, easier to
program
Back
12/5/2016
Communication and Coordination Paradigms for Highly-parallel Accelerators
63Slide64
GPMC Queue Synchronizationpart 1
Lots of synchronization, but once per work-groupProducers ordered through a per-slot write ticketConsumers ordered through a per-slot read ticketA producer and consumer, with valid tickets, are ordered through the full bitE.g., queue, with one slot and its per-slot state12/5/201664Communication and Coordination Paradigms for Highly-parallel AcceleratorsSlide65
GPMC Queue Synchronizationpart 2
Below, is a queue, with one slot and its per-slot state12/5/201665Communication and Coordination Paradigms for Highly-parallel Accelerators
GPU
readIdx
,
writeIdx
GPU
C
PU
full
numServed
writeTicket
readTicket
0
0
0
0
WG
2
ticket=0
WG
1
ticket=1
1
2
t
icket=0
t
0
1
t
1
t
0
WG
0
WG
1
WG
4
WG
2
ticket=1
t
1
t
icket=0
t
0
WG
1
ticket=1
1
2
1
0