Native amp Convenient Heterogeneous Computing for D Outline Introduction Compiler Libraries Using DCompute present and future Future directions State of Hardware X86 all compilers ID: 618054
Download Presentation The PPT/PDF document "DCompute" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
DCompute:
Native & Convenient Heterogeneous
Computing for DSlide2
Outline
Introduction
Compiler
Libraries
Using
DCompute
(present and future)
Future directionsSlide3
State of Hardware
X86
–
all compilers
ARM
–
GDC, LDC
MIPS, PPC
–
LDC
DSPs, FPGAs GPUs - ? Slide4
State of Hardware
DSPs
–
C,
OpenCL
FPGAs
–
HDLs (Verilog, VHDL, DHDL?),
OpenCL
GPUs
Shaders
(GLSL, HLSL) for graphics
CUDA,
OpenCL
for computeSlide5
CUDA
NVidia only
CUDA C++/Fortran => PTX => SASS
Well integrated
Reasonably nice to use (from C/C++)Slide6
OpenCL
Many targets - GPUs, DSPs FPGAs
SPIR-V
OpenCL
C/C++ (kernel languages) are OK
API is horribleSlide7
So…
CUDA has vendor lock-in
OpenCL
isn’t very nice
LLVM targets SPIR-V and PTX
We have a D compiler that targets LLVM (LDC)
But...Slide8
Enter DCompute
Compiler
–
Enables writing kernel in D
Library
–
Automate using kernels in D
Kernels
–
Prewritten kernels for heterogeneous acceleration with less effort.Slide9
Compiler
Core functionality done
Work in Progress
Images / Pipes
Globals
Better errors / error reporting
SPIR-V
Optimisations
(not very critical)Slide10
Targeting CUDA & OpenCL
Tacking on Bits of metadata to the modules
Calling convention
Address spacing pointers
Images & other special types (WiP)
Indexing (
get_local_id
/
threadIdx
)
Other “
Instinsics
”Slide11
Compilation Process Models
OpenCL
like Separate compilation
–
no compile time info
SYCL like hybrid source code
–
kernel is one lambda long, highly nested, hides a lot of runtime magic
CUDA like hybrid source
–
runtime magic, semantic validation harder
Want something that fits well with modules & retains compile time informationSlide12
Compilation Process
ldc2
–
mdcompute
-targets=ocl-220,cuda-620
files.d
…
module normal;
@compute(
CompileFor.hostAndDevice
) module
shared_code
;
@compute(
CompileFor.deviceOnly
) module
my_kernels
;
@kernel void foo(
GlobalPointer!float
f) {
…
}
if (__
dcompute_reflect
(
target,version
)) {
…
}
GlobalPointer!T
-> { T
addrspace
(n)* }
ABI
MetadataSlide13
Codegen conditional compilation
if (
stmt
->condition->op ==
TOKcall
) {
auto
ce
= (
CallExp
*)
stmt
->condition;
if (ce->f && ce->f->ident &&
!strcmp(ce->f->ident->toChars(),
"__dcompute_reflect")) {
if (
match(
ce
->arguments)
) {
stmt->ifbody->accept(this);
else if (
stmt
->
elsebody
)
stmt
->
elsebody
->accept(this);
}
return;
}
}Slide14
Benefits
Done in one compilation (Host, CUDA &
OpenCL
)
No need to worry about templates
Get compile time info on kernelsSlide15
DCompute
Standard Library for compute operations
For use with kernels
Driver
Abstraction over
OpenCl
and CUDA runtimes
Handles Device and Host interactions
Launching kernels
Managing memory
Standard collection of KernelsSlide16
Standard Library
Indexation
Synchronisation
primitives
Vectors (SIMD & geometric)
Math (the usual)
Images (1d,2d,3d + arrays, cubes)
Packing (colour operations)
Atomics
Work Group operations (reduce
…
)
Backed by
CUDA:
libdevice
+ LLVM PTX intrinsics
OpenCL
: intrinsic operation (Magic)Slide17Slide18
Driver (WiP)
Allocate & Manage device memory
Data transfer
Kernels: Load, Launch
Device
Synchronisation
EventsSlide19
Driver API Automation
For launching kernels we want something like
@kernel void
my_kernel
(T)(
GlobalPointer!T
p,
int
args
) { ... }
void main(string[]
args
){
auto
dev
=
getDefaultDevice
(
getConfig
());
auto q =
dev.getDefaultQueue
();
float[]
arr
=
someData
();
Buffer!float
b =
dev.makeBuffer
(
arr
);
Event v =
q.enqueue
!(
my_kernel!float
)(
b.length
)(b,42);
v.wait
();
b.read
(
arr
).
writeln
;
}Slide20
struct
Queue {
Call
enqueue
(alias kernel)(
LaunchParams
lp
) {
return Call!(
typeof
(kernel),
kernel.mangleof
)
(
lp
, this);
}
}
struct
Call(
F,string
mangle) {
LaunchParams
lp
;
Queue q;
Event
opCall
(
KernelArgsOf!F
args
) {
//Get type correctness for free!
//use
Parameters!F
to call
clSetKernelArg
/
//
clEnqueueNDRangeKernel
or
cuLaunchKernel
}Slide21
Collection of Kernels
Showcase of how to do things (and how not to)
Functional examples
Covers common use cases
Convolutions (DFT/FFT)
Reductions
Filter, Sort
Thrust-like (except ranges)
Precompilation
(for use with DMD/GDC)Slide22
Future work
Make SPIR-V use intrinsics (LLVM, WiP)
Compiler
Tests (WiP)
Relax
DCompute
constraints
Images & Pipes (integrate with Phobos#2845)
Globals
Std
lib
–
add missing functions -> intrinsics
Driver
Finish integrating
clWrap
CUDA
High level API over
OpenCL
+ CUDA
Testing framework
Library integration: with e.g.
mir
(
ndslice
GLAS CV),
scid
Kernels: standard algorithms, NEW algorithmsSlide23
Conclusion
DCompute
is a compiler extension of LDC to target
OpenCL
and CUDA (and the host) all at the same time! (working but not feature complete)
Runtime libraries are a work in progress. Building kernels work, demo to follow.
This is D so we are able to make this nice to use (in spite of the horribleness of the underlying APIs) thanks to awesome templates and introspection!
World
domintation
of HPC will be within grasp!Slide24
Acknowledgments
John Colvin
David
Nadlinger
Kai
Nacke
Kinke
Johan
EngelenSlide25
Questions?
LLVM: https://
github.com
/
thewilsonator
/
llvm
/tree/compute
LDC: https://
github.com
/
ldc
-developers/
ldc
/tree/
dcompute
(master
soon
TM
)
DCompute
: https://
github.com
/
libmir
/
dcompute