/
Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford)

Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) - PowerPoint Presentation

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
342 views
Uploaded On 2019-11-06

Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) - PPT Presentation

Rahul Sharma Stanford Michael Bauer NVIDIA Research Alex Aiken Stanford Verification of ProducerConsumer Synchronization in GPU Programs June 15 2015 Rahul Sharma Michael Bauer Outline ID: 763796

warp named shared data named warp data shared ync sync barrier arrive memory gpu barriers alu warps consumer producer

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Rahul Sharma (Stanford) Michael Bauer (N..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Rahul Sharma (Stanford)Michael Bauer (NVIDIA Research)Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15, 2015 Rahul Sharma Michael Bauer

OutlineGPU background Motivating examplesVerification algorithm and implementationResults

GPU background GPU Off-Chip Global Memory SM SM SM SM Streaming Multiprocessors On-Chip ALU ALU ALU ALU ALU ALU Shared Memory (up to 48 KB) Threadblock (CTA) ~ 100s of threads Load Data from Global to Shared __ syncthreads () barrier Compute on data in Shared __ syncthreads () barrier Store Data from Shared to Global Warp = 32 threads SM SM SM SM

Named barriersSynchronization primitive Built into hardware16 named barriers per SMTwo instructionsSync: blockingArrive: non-blocking Specify participating count__synchthreads is a special case of named barriersSync 0, NEncode producer-consumer patterns Producer Warp Consumer Warp Named Barrier 0 Sync 0,64 Arrive 0,64

CudaDMA library (SC 2011) Simple library to abstract data movement between GPU memoriesGlobal to sharedShared to globalSpecialize warpsCompute warps: do math DMA warps: move dataUse named barriers to synchronize transfers Use more named barriers for double buffering Compute Warps DMA Warps start_xfer (arrive 0,N) wait_start (sync 0,N) finish_xfer (arrive 1,N) wait_start (sync 0,N) wait_finish (sync 1,N) start_xfer (arrive 0,N) Load Data into Shared Buffer Compute on Shared Buffer Load Data into Shared Buffer wait_finish (sync 1,N) finish_xfer (arrive 1,N)

Singe compiler ( PPoPP 2014)DSL compiler for combustion chemistry Up to 4X speedupKernels contain 10K linesMaps static dataflow graphs onto warps Use shared memory for communication Assign synchronization points to named barriersAnalogous to register allocationManage passing of data through shared memory Warp 0 Warp 1 Warp 2 Warp 3 A B C D E G F I H J 2 0 1 3 2

Named barrier challenges Three challenges:Named barrier reuseMust prove that it is safe to recycle named barriersNeed happens-before relationship Must be self-consistentDeadlock Shared memory racesTwo accesses to the same location with at least one being a write Warp 0 Warp 1 Warp 2 Warp 3 A B C D E G F I H J 2 0 1 3 2 Warp 0 Warp 1 sync 0 arrive 1 sync 1 arrive 0

WEFT architecture GPU kernel compile 0 Thread programs 1   n Happens Before Improper barrier recycling Shared memory data races WEFT Deadlocks Threadblock (n threads)

Thread programsOmit statements irrelevant to properties Straight line programs: sequences of commandsCommandssync b [m]arrive b [m]read awrite a Restrictive, but followed by the majority of GPU code

Well synchronization“Synchronization pattern is deterministic” Same commands synchronize, no double dutyObey generations Subsumes deadlock freedom and safe recycling Producer Consumer s ync 0 s ync 0 w rite a s ync 1 a rrive 1 read a s ync 0 sync 0 Generation 1 of barrier 0 Generation 1 of barrier 1 Generation 2 of barrier 0

Check well synchronizationNeed to know Which commands synchronize togetherWhat is the generation of the corresponding barrierFirst challenge: how to infer this information?Generations are invariant over all executions Statically e mulate one executionRecord synchronizationCheck that all executions respect the generations

Happens beforeHB relation: reachability A happens before B if path from A from B The path has at least one black edgeCheck successive generations have HB relationship Main result: HB relation is sound and precise Producer Consumer s ync 0 s ync 0 w rite a s ync 1 a rrive 1 read a sync 0 sync 0 s ync 0 w rite a a rrive 1 s ync 0 s ync 0 s ync 1 read async 0 g en 1gen 2

Data racesFor every two commands that can race check an HB relationshipSound and complete for race detection

Implementation Naïve implementation does not scaleExtensive optimizationsFour orders of magnitude improvement : total commands across all thread programs Memory: Time:  

Evaluation (Singe kernels)

Discovered bugsWrite-after-read Benign data racesAll kernels were well synchronized

ConclusionGPUs are much more flexible than people realize Can use GPUs in new ways with named barriersUse of named barriers can create many complicationsDeadlock, improper recycling, data races Providing good software verification is importantNecessary to make named barriers easy to useWEFT verifies code with named barriers Algorithm is both sound and completeHandles real production code efficiently https:// github.com / lightsighter /Weft