/
Ocelot: supported devices Ocelot: supported devices

Ocelot: supported devices - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
387 views
Uploaded On 2017-08-20

Ocelot: supported devices - PPT Presentation

Overview Ocelot PTX Emulator MulticoreBackend NVIDIA GPU Backend AMD GPU Backend 2 Multicore CPU Backend Introduction Target Efficient execution of PTX kernels on CPUs ISA Translation from PTX to LLVM ID: 580595

ocelot ptx gpu llvm ptx ocelot llvm gpu translation backend amd instrumentation thread processing dynamic interface execution gpus hinterface

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Ocelot: supported devices" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Ocelot: supported devicesSlide2

Overview

Ocelot PTX

Emulator

Multicore-BackendNVIDIA GPU BackendAMD GPU Backend

2Slide3

Multicore CPU Backend: Introduction

Target: Efficient execution of PTX kernels on CPUs

ISA Translation from PTX to LLVM

Execution-model translation from PTX thread hierarchy to serialized PTX threadsLight-weight thread schedulerLLVM Just-in-time compilation to x86LLVM transformations applied before code generationSlide4

Some Interesting Features

Serialization Transforms

JIT for Parallel Code

Utilize all resources

4Slide5

Translation to CPUs: Thread Fusion

Execution Manager

thread scheduling

context management

Thread Blocks

Multicore Host Threads

Thread serialization

Execution Model Translation

Thread scheduling

Dealing with specialized operations

e.g. custom hardware

Control flow restructuring

Resource management (multiple cores)

One worker

pthread

per CPU core

Execute a kernel

5

J. Stratton, S. Stone, and W.

mei

Hwu

,

Mcuda

: An efficient implementation of

cuda

kernels on multi-cores," University of Illinois at Urbana-Champaign, Tech. Rep. IMPACT-08-01,March 2008.

G. Diamos, A. Kerr, S. Yalamanchili and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk-Synchronous Applications in Heterogeneous,”

PACT

October 2010Slide6

Ocelot Source Code: Multicore CPU Backend

ocelot/

executive/

interface/

MulticoreCPUDevice.h

interface/LLVMContext.h

interface/

LLVMExecutableKernel.h

interface/

LLVMCooperativeThreadArray.h

interface/

LLVMModuleManager.h

interface/

TextureOperations.h

ir

/

interface/

LLVMInstruction.h

translator/ interface/

PTXToLLVMTranslator.htransforms/ interface/SubkernelFormationPass.hinterface/

RemoveBarrierPass.h

6Slide7

Multicore CPU: ISA Translation

Translate PTX IR to LLVM Internal Representation

Arithmetic instructions have one-to-few mapping

Special instructions and registers handled by LLVM intrinsics (e.g. cos

, clock64, bar.sync)Texture sampling calls Ocelot’s texture library

LLVMContext contains pointers to address spaces, next entry ID, thread ID

Custom LLVM IR implementation insulates Ocelot from LLVM changes

LLVM requires SSA form -> Ocelot converts PTX to SSA

Remove predicationSlide8

PTX to LLVM ISA Translation

//

// ocelot/translation/implementation/PTXToLLVMTranslator.cpp

//

void

PTXToLLVMTranslator::_

translateAdd

(

const

ir

::

PTXInstruction

&

i

)

{

if( ir::PTXOperand

::isFloat( i.type ) ) {

ir::

LLVMFadd add;

ir::LLVMInstruction

::Operand result = _destination( i );

add.a = _translate( i.a

);

add.b

= _translate(

i.b ); add.d

= result; _llvmKernel->_statements.push_back( ir::LLVMStatement( add ) ); } else { .. ..

.. };}Translate each PTX instruction to LLVM IR instruction sequenceSpecial

PTX registers and instructions mapped to LLVM intrinsics:llvm.readcyclecounter()llvm.sqrt.f32()Result is LLVM function implementing PTX kernelShould be invertible if coupled to LLVM->PTX code generator (not implemented)Slide9

Thread Serialization

Thread loops

Enter next executable region via scheduler block

Barriers:

store live values into thread-local memory, return to thread schedulerSlide10

Using the Multicore Backend

Edit configure.ocelot

Executive:devices:llvm – efficient execution of PTX on multicore CPU

optimizationLevel – basic, none, full, memory, debugworkerThreadLimit -- number of worker threads

Optimizations:subkernelSize -

size of

subkernels

in instructions

simplifyCFG –

whether to apply CFG simplification pass

hoistSpecialValues –

whether to load

LLVMContext

values at launch of kernel

executive: {

devices: [

llvm ],

asynchronousKernelLaunch: true,

optimizationLevel: none, workerThreadLimit: 1,

warpSize

: 1},

optimizations: {

subkernelSize: 1000, simplifyCFG: true,

hoistSpecialValues: true

},

10Slide11

Overview

Ocelot PTX Emulator

Multicore-Backend

NVIDIA GPU BackendAMD Backend

11Slide12

NVIDIA GPU: Introduction

Executes PTX kernels on GPUs via the CUDA Driver API

Thin layer on top of CUDA Driver API

Ocelot enables rewriting of PTX kernelsRegister reallocationRuntime optimizationsInstrumentationSlide13

Ocelot Source Code: NVIDIA GPU Device Backend

ocelot/

executive/

interface/

NVIDIAGPUDevice.h

interface/NVIDIAExecutableKernel.h

13Slide14

Using the NVIDIA GPU Backend

Edit configure.ocelot

executive:

devices:nvidia – invokes NVIDIA GPU backend

executive: {

devices: [

nvidia

],

},

14Slide15

Dynamic Instrumentation

Run-time generation of user-defined, custom instrumentation code for CUDA kernels

Harness chip-level instrumentation when possible

Instrumentation data to driveOff-line workload characterizationOn-line debugging & program optimization

On-line resource managementInspired in part by the PIN1 infrastructure

15

15

Naila

Farooqui

, Andrew Kerr, Greg

Eisenhauer

,

Karsten

Schwan,

Sudhakar

Yalamanchili. Lynx: A Dynamic Instrumentation System for Data-Parallel Applications on GPGPU

Architectures. ISPASS. April 2012.PhD Student: Naila Farooqui

, Joint with K. Schwan and A. Gavrilovska1 C.-K. Luk, R. Cohn, R. Muth, H. Patil,

A. Klauser, G. Lowney, S. Wallace, V. J. Reddi

, and K. Hazelwood. “Pin: building customized program analysis tools with dynamic instrumentation,” PLDI

'05Slide16

Instrumentation Support in Ocelot

High-level, C constructs to define instrumentation + (C-to-PTX) JIT

Integration with system management software and dynamic compiler

Online resource management based on profilingAdditional

Instrumentor APIs to provide criteria for instrumentationSelectively perform instrumentation on kernels

16

16Slide17

Custom Instrumentation

Transparent profiling and characterization of library implementations

17

nvcc

PTX

Ocelot Run Time

CUDA

Libraries

Instrumentation APIs

Instrumentor

C-on-Demand JIT

C-PTX Translator

PTX-PTX Transformer

Lynx

Example Instrumentation Code

17Slide18

Instrumentation: Instruction count

* Scan

(CUDA SDK)Slide19

Remote Device Layer

Remote procedure call layer for Ocelot device calls

Execute local applications that run kernels remotely

Multi-GPU applications can become multi-node

19

19Slide20

Switchable Compute

Switch devices at runtime

Load balancing

Remote executionSlide21

Overview

Ocelot PTX Emulator

Multicore-Backend

NVIDIA BackendAMD GPU Backend

21

Rodrigo Dominguez, Dana

Schaa

, and David

Kaeli

.

“Caracal

: Dynamic Translation of Runtime Environments for GPUs

.”

In

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

, GPGPU-4Slide22

AMD GPU Backend

Executes PTX kernels on GPUs via the

CAL

Driver APIRewriting of PTX kernels (for optimization, instrumentation, etc.) also gets translated to the AMD backendOcelot Device Interface:Module registration

Memory managementGlobal/Shared/Constant/Parameter memory allocationKernel launchesTranslation from PTX to IL

Texture managementOpenGL interoperability

Streams and Events

Rodrigo Dominguez, Dana

Schaa

, and David

Kaeli

.

“Caracal

: Dynamic Translation of Runtime Environments for GPUs

.”

In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

, GPGPU-4Slide23

AMD Evergreen Architecture

AMD Radeon HD 5870

20 SIMD cores

16 Stream Cores (SC) per SIMD core

Each SC is VLIW-5

A total of 1600 ALUs

Wavefronts

of 64 threads

Peak is 2.72 TFLOPS (SP) and

544 GFLOPS (DP)Slide24

AMD Evergreen Architecture

One SIMD Core

Source

: AMD

OpenCL

University Kit

General Purpose Registers

One Stream Core

T-Processing

Element

Branch

Execution Unit

Processing

Elements

Instruction and Control Flow

Each Stream Core includes:

4 Processing Elements

4 independent SP or integer operations

2 DP operation

1 DP

fma

or

mult

operation

1 Special Function Unit

1 SP or integer operation

SP or DP transcendental

Branch Execution Unit

GPR = 5.24 MBSlide25

AMD Evergreen Architecture

Local Data Share

2 TB/s

32 KB per SIMD

Global Data Share

Shared between all threads in a kernel

Low latency global reductions

L1 (8 KB)

L2

512 KB

450 GB/s

Global Memory

GDDR5 153 GB/sSlide26

Translation from PTX to IL

PTX

RISC style syntax

Load-Store instruction setRegisters are typed and scalarUnlimited virtual registersPredicate registersControl flow based on branches and labels

Designed for compute (GPGPU)

.entry vecAdd ( .param

.u64 A,

.

param

.u64 B,

.

param

.u64 C,

.param .s32 N){

mov.u16 rh1, ctaid.x;mov.u16 rh2, ntid.x

;mul.wide.u16 r1, rh1, rh2;cvt.u32.u16 r2, tid.x;add.u32 r3, r2, r1;ld.param.s32

r4, [N];setp.le.s32 p1, r4, r3;@p1 bra Label_1;...

}Slide27

Translation from PTX to IL

IL

Registers are 32-bit and vectors (4 components)

Registers have no typeSwizzlesResources are globally scopedStructured control flow

(if-end, while-end)Designed for graphics, not compute (see FSAIL)

il_cs_2_0 dcl_raw_uav_id

(0)

dcl_cb

cb0[2]

dcl_cb

cb1[4]

dcl_literal

l0, 4, 4, 4, 4 mov

r0.x, vThreadGrpId.x mov r1.x, cb0[0].

x imul r2.x, r0.x, r1.x mov r3.x, vTidInGrp.x

iadd r4.x, r3.x, r2.x mov r5.x, cb1[3].x

ige r6.x, r4.x, r5.x if_logicalz r6.x ... endifendSlide28

AMD GPU Backend

Validated over 30 applications from the CUDA SDK

Support for pre-compiled libraries

Device selection can be made at runtimeWhat is supported?Global memory (cudaMalloc, cudaMemcpy)Shared memory (including extern)

Constant memory Atomics (global and shared)Barriers and Fences

30+ PTX instructionsRodrigo Dominguez, Dana Schaa, and David

Kaeli

. Caracal: Dynamic Translation of Runtime Environments for GPUs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4Slide29

Ocelot Source Code: AMD GPU Device Backend

ocelot/

analysis/

interface/

StructuralAnalysis.h

executive/

interface/

ATIGPUDevice.h

interface/

ATIExecutableKernel.h

transforms/

interface/

StructuralTransform.h

29Slide30

Using the AMD GPU Backend

Edit

configure.ocelot

executive:devices:amd – invokes AMD GPU backend

executive: {

devices: [

amd

],

},

30Slide31

Unstructured to Structured Control Flow*

Branch Divergence

is key to high performance in GPU

Its impact is different depending upon whether the control flow is structured or unstructured

Not all GPUs support unstructured

CFG directlyUsing dynamic translation to support AMD GPUs**

31

*

Wu H,

Diamos

G, Li S,

Yalamanchili

S. Characterization and Transformation of Unstructured Control Flow in GPU Applications. CACHES. 2011.

** R. Dominguez, D.

Schaa

, and D.

Kaeli. Caracal: Dynamic translation of runtime environments for gpus. In Proceedingsof the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pages 5–11. ACM, 2011.