Overview Ocelot PTX Emulator MulticoreBackend NVIDIA GPU Backend AMD GPU Backend 2 Multicore CPU Backend Introduction Target Efficient execution of PTX kernels on CPUs ISA Translation from PTX to LLVM ID: 580595
Download Presentation The PPT/PDF document "Ocelot: supported devices" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Ocelot: supported devicesSlide2
Overview
Ocelot PTX
Emulator
Multicore-BackendNVIDIA GPU BackendAMD GPU Backend
2Slide3
Multicore CPU Backend: Introduction
Target: Efficient execution of PTX kernels on CPUs
ISA Translation from PTX to LLVM
Execution-model translation from PTX thread hierarchy to serialized PTX threadsLight-weight thread schedulerLLVM Just-in-time compilation to x86LLVM transformations applied before code generationSlide4
Some Interesting Features
Serialization Transforms
JIT for Parallel Code
Utilize all resources
4Slide5
Translation to CPUs: Thread Fusion
Execution Manager
thread scheduling
context management
Thread Blocks
Multicore Host Threads
Thread serialization
Execution Model Translation
Thread scheduling
Dealing with specialized operations
e.g. custom hardware
Control flow restructuring
Resource management (multiple cores)
One worker
pthread
per CPU core
Execute a kernel
5
J. Stratton, S. Stone, and W.
mei
Hwu
,
Mcuda
: An efficient implementation of
cuda
kernels on multi-cores," University of Illinois at Urbana-Champaign, Tech. Rep. IMPACT-08-01,March 2008.
G. Diamos, A. Kerr, S. Yalamanchili and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk-Synchronous Applications in Heterogeneous,”
PACT
October 2010Slide6
Ocelot Source Code: Multicore CPU Backend
ocelot/
executive/
interface/
MulticoreCPUDevice.h
interface/LLVMContext.h
interface/
LLVMExecutableKernel.h
interface/
LLVMCooperativeThreadArray.h
interface/
LLVMModuleManager.h
interface/
TextureOperations.h
ir
/
interface/
LLVMInstruction.h
translator/ interface/
PTXToLLVMTranslator.htransforms/ interface/SubkernelFormationPass.hinterface/
RemoveBarrierPass.h
6Slide7
Multicore CPU: ISA Translation
Translate PTX IR to LLVM Internal Representation
Arithmetic instructions have one-to-few mapping
Special instructions and registers handled by LLVM intrinsics (e.g. cos
, clock64, bar.sync)Texture sampling calls Ocelot’s texture library
LLVMContext contains pointers to address spaces, next entry ID, thread ID
Custom LLVM IR implementation insulates Ocelot from LLVM changes
LLVM requires SSA form -> Ocelot converts PTX to SSA
Remove predicationSlide8
PTX to LLVM ISA Translation
//
// ocelot/translation/implementation/PTXToLLVMTranslator.cpp
//
void
PTXToLLVMTranslator::_
translateAdd
(
const
ir
::
PTXInstruction
&
i
)
{
if( ir::PTXOperand
::isFloat( i.type ) ) {
ir::
LLVMFadd add;
ir::LLVMInstruction
::Operand result = _destination( i );
add.a = _translate( i.a
);
add.b
= _translate(
i.b ); add.d
= result; _llvmKernel->_statements.push_back( ir::LLVMStatement( add ) ); } else { .. ..
.. };}Translate each PTX instruction to LLVM IR instruction sequenceSpecial
PTX registers and instructions mapped to LLVM intrinsics:llvm.readcyclecounter()llvm.sqrt.f32()Result is LLVM function implementing PTX kernelShould be invertible if coupled to LLVM->PTX code generator (not implemented)Slide9
Thread Serialization
Thread loops
Enter next executable region via scheduler block
Barriers:
store live values into thread-local memory, return to thread schedulerSlide10
Using the Multicore Backend
Edit configure.ocelot
Executive:devices:llvm – efficient execution of PTX on multicore CPU
optimizationLevel – basic, none, full, memory, debugworkerThreadLimit -- number of worker threads
Optimizations:subkernelSize -
size of
subkernels
in instructions
simplifyCFG –
whether to apply CFG simplification pass
hoistSpecialValues –
whether to load
LLVMContext
values at launch of kernel
executive: {
devices: [
llvm ],
asynchronousKernelLaunch: true,
optimizationLevel: none, workerThreadLimit: 1,
warpSize
: 1},
optimizations: {
subkernelSize: 1000, simplifyCFG: true,
hoistSpecialValues: true
},
10Slide11
Overview
Ocelot PTX Emulator
Multicore-Backend
NVIDIA GPU BackendAMD Backend
11Slide12
NVIDIA GPU: Introduction
Executes PTX kernels on GPUs via the CUDA Driver API
Thin layer on top of CUDA Driver API
Ocelot enables rewriting of PTX kernelsRegister reallocationRuntime optimizationsInstrumentationSlide13
Ocelot Source Code: NVIDIA GPU Device Backend
ocelot/
executive/
interface/
NVIDIAGPUDevice.h
interface/NVIDIAExecutableKernel.h
13Slide14
Using the NVIDIA GPU Backend
Edit configure.ocelot
executive:
devices:nvidia – invokes NVIDIA GPU backend
executive: {
devices: [
nvidia
],
},
14Slide15
Dynamic Instrumentation
Run-time generation of user-defined, custom instrumentation code for CUDA kernels
Harness chip-level instrumentation when possible
Instrumentation data to driveOff-line workload characterizationOn-line debugging & program optimization
On-line resource managementInspired in part by the PIN1 infrastructure
15
15
Naila
Farooqui
, Andrew Kerr, Greg
Eisenhauer
,
Karsten
Schwan,
Sudhakar
Yalamanchili. Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU
Architectures. ISPASS. April 2012.PhD Student: Naila Farooqui
, Joint with K. Schwan and A. Gavrilovska1 C.-K. Luk, R. Cohn, R. Muth, H. Patil,
A. Klauser, G. Lowney, S. Wallace, V. J. Reddi
, and K. Hazelwood. “Pin: building customized program analysis tools with dynamic instrumentation,” PLDI
'05Slide16
Instrumentation Support in Ocelot
High-level, C constructs to define instrumentation + (C-to-PTX) JIT
Integration with system management software and dynamic compiler
Online resource management based on profilingAdditional
Instrumentor APIs to provide criteria for instrumentationSelectively perform instrumentation on kernels
16
16Slide17
Custom Instrumentation
Transparent profiling and characterization of library implementations
17
nvcc
PTX
Ocelot Run Time
CUDA
Libraries
Instrumentation APIs
Instrumentor
C-on-Demand JIT
C-PTX Translator
PTX-PTX Transformer
Lynx
Example Instrumentation Code
17Slide18
Instrumentation: Instruction count
* Scan
(CUDA SDK)Slide19
Remote Device Layer
Remote procedure call layer for Ocelot device calls
Execute local applications that run kernels remotely
Multi-GPU applications can become multi-node
19
19Slide20
Switchable Compute
Switch devices at runtime
Load balancing
Remote executionSlide21
Overview
Ocelot PTX Emulator
Multicore-Backend
NVIDIA BackendAMD GPU Backend
21
Rodrigo Dominguez, Dana
Schaa
, and David
Kaeli
.
“Caracal
: Dynamic Translation of Runtime Environments for GPUs
.”
In
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
, GPGPU-4Slide22
AMD GPU Backend
Executes PTX kernels on GPUs via the
CAL
Driver APIRewriting of PTX kernels (for optimization, instrumentation, etc.) also gets translated to the AMD backendOcelot Device Interface:Module registration
Memory managementGlobal/Shared/Constant/Parameter memory allocationKernel launchesTranslation from PTX to IL
Texture managementOpenGL interoperability
Streams and Events
Rodrigo Dominguez, Dana
Schaa
, and David
Kaeli
.
“Caracal
: Dynamic Translation of Runtime Environments for GPUs
.”
In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
, GPGPU-4Slide23
AMD Evergreen Architecture
AMD Radeon HD 5870
20 SIMD cores
16 Stream Cores (SC) per SIMD core
Each SC is VLIW-5
A total of 1600 ALUs
Wavefronts
of 64 threads
Peak is 2.72 TFLOPS (SP) and
544 GFLOPS (DP)Slide24
AMD Evergreen Architecture
One SIMD Core
Source
: AMD
OpenCL
University Kit
General Purpose Registers
One Stream Core
T-Processing
Element
Branch
Execution Unit
Processing
Elements
Instruction and Control Flow
Each Stream Core includes:
4 Processing Elements
4 independent SP or integer operations
2 DP operation
1 DP
fma
or
mult
operation
1 Special Function Unit
1 SP or integer operation
SP or DP transcendental
Branch Execution Unit
GPR = 5.24 MBSlide25
AMD Evergreen Architecture
Local Data Share
2 TB/s
32 KB per SIMD
Global Data Share
Shared between all threads in a kernel
Low latency global reductions
L1 (8 KB)
L2
512 KB
450 GB/s
Global Memory
GDDR5 153 GB/sSlide26
Translation from PTX to IL
PTX
RISC style syntax
Load-Store instruction setRegisters are typed and scalarUnlimited virtual registersPredicate registersControl flow based on branches and labels
Designed for compute (GPGPU)
.entry vecAdd ( .param
.u64 A,
.
param
.u64 B,
.
param
.u64 C,
.param .s32 N){
mov.u16 rh1, ctaid.x;mov.u16 rh2, ntid.x
;mul.wide.u16 r1, rh1, rh2;cvt.u32.u16 r2, tid.x;add.u32 r3, r2, r1;ld.param.s32
r4, [N];setp.le.s32 p1, r4, r3;@p1 bra Label_1;...
}Slide27
Translation from PTX to IL
IL
Registers are 32-bit and vectors (4 components)
Registers have no typeSwizzlesResources are globally scopedStructured control flow
(if-end, while-end)Designed for graphics, not compute (see FSAIL)
il_cs_2_0 dcl_raw_uav_id
(0)
dcl_cb
cb0[2]
dcl_cb
cb1[4]
dcl_literal
l0, 4, 4, 4, 4 mov
r0.x, vThreadGrpId.x mov r1.x, cb0[0].
x imul r2.x, r0.x, r1.x mov r3.x, vTidInGrp.x
iadd r4.x, r3.x, r2.x mov r5.x, cb1[3].x
ige r6.x, r4.x, r5.x if_logicalz r6.x ... endifendSlide28
AMD GPU Backend
Validated over 30 applications from the CUDA SDK
Support for pre-compiled libraries
Device selection can be made at runtimeWhat is supported?Global memory (cudaMalloc, cudaMemcpy)Shared memory (including extern)
Constant memory Atomics (global and shared)Barriers and Fences
30+ PTX instructionsRodrigo Dominguez, Dana Schaa, and David
Kaeli
. Caracal: Dynamic Translation of Runtime Environments for GPUs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4Slide29
Ocelot Source Code: AMD GPU Device Backend
ocelot/
analysis/
interface/
StructuralAnalysis.h
executive/
interface/
ATIGPUDevice.h
interface/
ATIExecutableKernel.h
transforms/
interface/
StructuralTransform.h
29Slide30
Using the AMD GPU Backend
Edit
configure.ocelot
executive:devices:amd – invokes AMD GPU backend
executive: {
devices: [
amd
],
},
30Slide31
Unstructured to Structured Control Flow*
Branch Divergence
is key to high performance in GPU
Its impact is different depending upon whether the control flow is structured or unstructured
Not all GPUs support unstructured
CFG directlyUsing dynamic translation to support AMD GPUs**
31
*
Wu H,
Diamos
G, Li S,
Yalamanchili
S. Characterization and Transformation of Unstructured Control Flow in GPU Applications. CACHES. 2011.
** R. Dominguez, D.
Schaa
, and D.
Kaeli. Caracal: Dynamic translation of runtime environments for gpus. In Proceedingsof the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pages 5–11. ACM, 2011.