/
OPENCL OVERVIEW OPENCL OVERVIEW

OPENCL OVERVIEW - PowerPoint Presentation

briana-ranney
briana-ranney . @briana-ranney
Follow
379 views
Uploaded On 2017-10-12

OPENCL OVERVIEW - PPT Presentation

OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems silicon middleware tools and applications Open royaltyfree standard for heterogeneous parallel computing ID: 595249

memory work opencl kernel work memory kernel opencl objects program section compute execution platform host data command null model

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "OPENCL OVERVIEW" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

OPENCL OVERVIEWSlide2

OpenCL Commercial Objectives

Grow the market for parallel computing

For vendors of systems, silicon, middleware, tools and applications

Open, royalty-free standard for heterogeneous parallel computing

Unified programming model for CPUs, GPUs, Cell, DSP and other processors in a systemCross-vendor software portability to a wide range of silicon and systemsHPC servers, desktop systems and handheld devices covered in one specificationSupport for a wide diversity of applicationsFrom embedded and mobile software through consumer applications to HPC solutionsCreate a foundation layer for a parallel computing ecosystemClose-to-the-metal interface to support a rich diversity of middleware and applicationsRapid deployment in the marketDesigned to run on current latest generations of GPU hardwareSlide3

OpenCL Working Group

Diverse industry participation

Processor vendors, system OEMs, middleware vendors, application developers

Many industry-leading experts involved in

OpenCL’s designA healthy diversity of industry perspectivesApple initially proposed and is very active in the working groupServing as specification editorHere are some of the other companies in the OpenCL working group Slide4

OpenCL Timeline

Six months from proposal to released specification

Due to a strong initial proposal and a shared commercial incentive to work quickly

Apple’s Mac OS X Snow Leopard includes OpenCL

Multiple OpenCL implementations are now outApple and NVIDIA/AMD on Windows and Linux to name a fewApple works with AMD, Intel, NVIDIA and others on draft proposal

Apple proposes OpenCL working group and contributes draft specification to Khronos

OpenCL working group develops draft into cross-vendor specification

Working Group sends completed draft to Khronos Board for Ratification

Khronos publicly releases OpenCL as royalty-free specification

Khronos

to release conformance tests to ensure high-quality implementations

Jun08

Oct08

Dec08

Mar09Slide5

OpenCL 1.0

Technical OverviewSlide6

OpenCL Design Requirements

Use all computational resources in system

Program

GPUs

, CPUs, and other processors as peersSupport both data- and task- parallel compute modelsEfficient C-based parallel programming modelAbstract the specifics of underlying hardwareAbstraction is low-level, high-performance but device-portableApproachable – but primarily targeted at expert developersEcosystem foundation – no middleware or “convenience” functionsImplementable on a range of embedded, desktop, and server systemsHPC, desktop, and handheld profiles in one specificationDrive future hardware requirementsFloating point precision requirementsApplicable to both consumer and HPC applicationsSlide7

Anatomy of OpenCL

Language Specification

C-based cross-platform programming interface

Subset of ISO C99 with language extensions - familiar to developers

Well-defined numerical accuracy - IEEE 754 rounding behavior with specified maximum errorOnline or offline compilation and build of compute kernel executablesIncludes a rich set of built-in functionsPlatform Layer APIA hardware abstraction layer over diverse computational resourcesQuery, select and initialize compute devicesCreate compute contexts and work-queuesRuntime APIExecute compute kernelsManage scheduling, compute, and memory resourcesSlide8

Hierarchy of Models

Platform Model

Memory Model

Execution Model

Programming ModelSlide9

OpenCL Platform Model

(Section 3.1)

One

Host

+ one or more Compute DevicesEach Compute Device is composed of one or more Compute UnitsEach Compute Unit is further divided into one or more Processing ElementsSlide10

OpenCL Execution Model

(Section 3.2)

OpenCL Program:

KernelsBasic unit of executable code — similar to a C functionData-parallel or task-parallelHost ProgramCollection of compute kernels and internal functionsAnalogous to a dynamic libraryKernel ExecutionThe host program invokes a kernel over an index space called an NDRangeNDRange = “N-Dimensional Range”NDRange can be a 1, 2, or 3-dimensional spaceA single kernel instance at a point in the index space is called a

work-item

Work-items have unique global IDs from the index space

Work-items are further grouped into

work-groups

Work-groups have a unique work-group ID

Work-items have a unique local ID within a work-groupSlide11

Kernel Execution

Total number of work-items = G

x

x G

ySize of each work-group = Sx x SyGlobal ID can be computed from work-group ID and local IDSlide12

Contexts and Queues

(Section 3.2.1)

Contexts are used to contain and manage the state of the “world”

Kernels are executed in contexts defined and manipulated by the host

DevicesKernels - OpenCL functionsProgram objects - kernel source and executableMemory objectsCommand-queue - coordinates execution of kernelsKernel execution commandsMemory commands - transfer or mapping of memory object dataSynchronization commands - constrains the order of commandsApplications queue compute kernel execution instancesQueued in-order Executed in-order or out-of-order

Events are used to implement appropriate synchronization of execution instancesSlide13

Compute Unit 1

Private Memory

Private Memory

Work Item 1

Work Item M

Compute Unit N

Private Memory

Private Memory

Work Item 1

Work Item M

Local Memory

Local Memory

Global / Constant Memory Data Cache

Global Memory

OpenCL Memory Model

(Section 3.3)

Shared memory model

Relaxed consistency

Multiple distinct address spaces

Address spaces can be collapsed depending on the device’s memory subsystem

Address spaces

Private - private to a

work-item

Local - local to a

work-group

Global - accessible by all work-items in all work-groups

Constant - read only global space

Implementations map this hierarchy

To available physical memories

Compute Device Memory

Compute DeviceSlide14

Memory Consistency

(Section 3.3.1)

“OpenCL uses a relaxed consistency memory

model

i.e. the state of memory visible to a work-item is not guaranteed to be consistent across the collection of work-items at all times.”Within a work-item, memory has load/store consistencyWithin a work-group at a barrier, local memory has consistency across work-itemsGlobal memory is consistent within a work-group, at a barrier, but not guaranteed across different work-groupsConsistency of memory shared between commands are enforced through synchronizationSlide15

Data-Parallel Programming Model

(Section 3.4.1)

Define

N-Dimensional computation domainEach independent element of execution in an N-Dimensional domain is called a work-itemN-Dimensional domain defines the total number of work-items that execute in parallel = global work sizeWork-items can be grouped together — work-groupWork-items in group can communicate with each otherCan synchronize execution among work-items in group to coordinate memory accessExecute multiple work-groups in parallelMapping of global work size to work-group can be implicit or explicitSlide16

Task-Parallel Programming Model

(Section 3.4.2)

Data

-parallel execution model must be implemented by all OpenCL compute devicesSome compute devices such as CPUs can also execute task-parallel compute kernelsExecutes as a single work-itemA compute kernel written in OpenCL A “native” functionSlide17

Host

program

Query platform

Query

compute devicesCreate contextsCreate memory objects associated to contextsCompile and create kernel program objectsIssue commands to command-queueSynchronization of commandsClean up OpenCL resourcesKernelsOpenCL C codeBasic OpenCL Program Structure

Platform Layer

Runtime

LanguageSlide18

Platform Layer

(Chapter 4)

Platform layer allows applications to query for platform specific features

Querying platform info (i.e., OpenCL profile)

(Chapter 4.1)Querying devices (Chapter 4.2)clGetDeviceIDs()Find out what compute devices are on the systemClassified device types include CPUs, GPUs, or AcceleratorsclGetDeviceInfo()Queries the capabilities of the discovered compute devices

Creating

contexts

(Chapter 4.3)

Contexts are used by the OpenCL runtime to manage objects and execute kernels on one or more devices

Contexts are associated to one or more devices

Multiple contexts could be associated to the same device

clCreateContext

() and

clCreateContextFromType() returns a handle

to the created contextsSlide19

Command-Queues

(Section 5.1)

Command-queues store a set of operations to perform

Command-queues are associated to a contextMultiple command-queues can be created to handle independent commands that don’t require synchronizationExecution of the command-queue is guaranteed to be completed at sync pointsSlide20

Memory Objects

(Section 5.2)

Buffer objects

One-dimensional collection of objects (like C arrays)Valid elements include scalar and vector types as well as user defined structuresBuffer objects can be accessed via pointers in the kernelImage objectsTwo- or three-dimensional texture, frame-buffer, or imagesMust be addressed through built-in functionsSampler objectsDescribes how to sample an image in the kernelAddressing modesFiltering modesSlide21

Creating Memory Objects

clCreateBuffer

()

,

clCreateImage2D(), and clCreateImage3D()Memory objects are created with an associated contextMemory can be created as read only, write only, or read-writeWhere objects are created in the platform memory space can be controlledDevice memoryDevice memory with data copied from a host pointerHost memoryHost memory associated with a pointerMemory at that pointer is guaranteed to be valid at synchronization pointsImage objects are also created with a channel formatChannel order (e.g., RGB, RGBA ,etc.)Channel type (e.g., UNORM INT8, FLOAT, etc.)Slide22

Manipulating Object Data

Object data can be copied to host memory, from host memory, or to other objects

Memory commands are enqueued in the command buffer and processed when the command is executed

clEnqueueReadBuffer

(), clEnqueueReadImage()clEnqueueWriteBuffer(), clEnqueueWriteImage()clEnqueueCopyBuffer(), clEnqueueCopyImage()Data can be copied between Image and Buffer objectsclEnqueueCopyImageToBuffer()clEnqueueCopyBufferToImage()Regions of the object data can be accessed by mapping into the host address spaceclEnqueueMapBuffer

(),

clEnqueueMapImage

()

clEnqueueUnmapMemObject

()Slide23

Program Objects

(Section 5.4)

Program objects encapsulate:

An associated contextProgram source or binaryLatest successful program build, list of targeted devices, build optionsNumber of attached kernel objectsBuild processCreate program objectclCreateProgramWithSource()clCreateProgramWithBinary()

Build program executable

Compile and link from source or binary for all devices or specific devices in the associated context

clBuildProgram

()

Build options

Preprocessor

Math intrinsics (floating-point behavior)

OptimizationsSlide24

Kernel Objects

(Section 5.5)

Kernel objects encapsulate

Specific kernel functions declared in a programArgument values used for kernel executionCreating kernel objectsclCreateKernel() - creates a kernel object for a single function in a programclCreateKernelsInProgram() - creates an object for all kernels in a programSetting argumentsclSetKernelArg(<kernel>, <argument index>)Each argument data must be set for the kernel functionArgument values are copied and stored in the kernel objectKernel vs. program objectsKernels are related to program execution

Programs are related to program sourceSlide25

Kernel Execution

(Section 5.6)

A command to execute a kernel must be enqueued to the command-queue

clEnqueueNDRangeKernel

()Data-parallel execution modelDescribes the index space for kernel executionRequires information on NDRange dimensions and work-group sizeclEnqueueTask()Task-parallel execution model (multiple queued tasks)Kernel is executed on a single work-itemclEnqueueNativeKernel()Task-parallel execution modelExecutes a native C/C++ function not compiled using the OpenCL compiler

This mode does not use a kernel object so arguments must be passed inSlide26

Command-Queues and

Synchronization

Command-queue execution

Execution model signals when commands are complete or data is ready

Command-queue could be explicitly flushed to the deviceCommand-queues execute in-order or out-of-orderIn-order - commands complete in the order queued and correct memory is consistentOut-of-order - no guarantee when commands are executed or memory is consistent without synchronizationSynchronizationBlocking callsCommands that do not return until complete

clEnqueueReadBuffer

() can be called as blocking and will block until complete

Event objects

Tracks execution status of a command

Some commands can be blocked until event objects signal a completion of previous command

clEnqueueNDRangeKernel

() can take an event object as an argument and wait until a previous command (e.g.,

clEnqueueWriteBuffer

) is complete

Profiling

Queue barriers - queued commands that can block command executionSlide27

OpenCL C for Compute Kernels

(Chapter 6)

Derived from ISO C99

A few restrictions: recursion, function pointers, functions in C99 standard headers ...Preprocessing directives defined by C99 are supportedBuilt-in Data TypesScalar and vector data types, PointersData-type conversion functions: convert_type<_sat><_roundingmode> Image types: image2d_t, image3d_t and sampler_tBuilt-in Functions — Requiredwork-item functions,

math.h

, read and write image

Relational, geometric functions, synchronization functions

Built-in Functions — Optional

double precision, atomics to global and local memory

selection of rounding mode, writes to

image3d_t

surfaceSlide28

OpenCL C Language Highlights

Function qualifiers

“__kernel” qualifier declares a function as a kernel

Kernels can call other kernel functions

Address space qualifiers__global, __local, __constant, __privatePointer kernel arguments must be declared with an address space qualifierWork-item functionsQuery work-item identifiersget_work_dim()get_global_id(), get_local_id(), get_group_id()Image functionsImages must be accessed through built-in functionsReads/writes performed through sampler objects from host or defined in source

Synchronization functions

Barriers - all work-items within a work-group must execute the barrier function before any work-item can continue

Memory fences - provides ordering between memory operationsSlide29

OpenCL C Language Restrictions

Pointers to functions are not allowed

Pointers to pointers allowed within a kernel, but not as an argument

Bit-fields are not supportedVariable length arrays and structures are not supported

Recursion is not supportedWrites to a pointer of types less than 32-bit are not supported*Double types are not supported*3D Image writes are not supported**Supported by official extensionSlide30

OpenCL ExampleSlide31

Host

program

Query platform

Query

compute devicesCreate contextsCreate memory objects associated to contextsCompile and create kernel program objectsIssue commands to command-queueSynchronization of commandsClean up OpenCL resourcesKernelsOpenCL C codeBasic OpenCL Program Structure

Platform Layer

Runtime

LanguageSlide32

VecAdd:

Get the Platform (Win/Lin)

cl_uint

platform_ids[num_entries];cl_uint num_platforms

;

//

Query for supported platforms

cl_int

err =

clGetPlatformIDs

(num_entries

,

// max ids to return

platform_ids

,

// platforms supported

&

num_platforms

)

;

// number returned

// Choose a platform based on its properties (vendor, version, etc.)

char vendor_name[128];

size_t

vendor_name_length;

err = clGetPlatformInfo

(platform_ids

[i],

// for a platform

CL_PLATFORM_VENDOR

,

// get vendor name

sizeof

(char

)*128,

// value storage size

vendor_name

,

&

vendor_name_length

);

// returned data length

Querying Platform Info:

Section

4.1

SpecSlide33

VecAdd

:

Create a Context

Skipping device enumeration, which proper apps should do

cl_context_properties context_props[2];cl_context context

;

context_props[0] = CL_CONTEXT_PLATFORM;

// set the platform property

context_props[1] =

platform_ids

[i];

// to the chosen platform id

/

/

Create the

contextusing

the shortcut “from type” API

context =

clCreateContextFromType

(context_props

,

CL_DEVICE_TYPE_GPU

,

// the device type

NULL,

NULL,

&

err

);

// error return

Contexts:

Section

4.3

SpecSlide34

VecAdd: Create Memory Objects

cl_mem

memobjs[3];

// allocate input buffer memory objects

memobjs[0] = clCreateBuffer

(context

,

CL_MEM_READ_ONLY

|

// flags

CL_MEM_COPY_HOST_PTR

,

sizeof

(

cl_float

)*

n

,

// size

srcA,

// host pointer

NULL); // error code

memobjs[1] =

clCreateBuffer

(context,

CL_MEM_READ_ONLY

| CL_MEM_COPY_HOST_PTR

,

sizeof(

cl_float)*

n,

srcB, NULL);

// allocate input buffer memory object

memobjs[2] =

clCreateBuffer

(context

,

CL_MEM_WRITE_ONLY

,

sizeof

(

cl_float

)*

n

, NULL, NULL);

Creating buffer objects:

Section 5.2.1

SpecSlide35

VecAdd: Program and Kernel

// create the program

cl_program

program =

clCreateProgramWithSource( context, 1,

// string count

&

program_source

,

// program strings

NULL,

// string lengths

NULL);

// error code

// build the program

err

=

clBuildProgram

(program

,

0,

// num devices in device list

NULL, // device list

NULL,

// options

NULL,

//

notifier callback function

ptr

NULL);

// user data

// create the kernel

cl_kernel

kernel =

clCreateKernel

(program

,

vec_add

, NULL);

Creating program objects: Section 5.4.1

Building program executables: Section 5.4.2

Creating kernel objects: Section 5.5.1

SpecSlide36

// set “a” vector argument

err =

clSetKernelArg

(kernel,

0, // argument index (

void

*)&memobjs[0],

// argument data

sizeof

(cl_mem));

// argument data size

// set “b” vector argument

err |=

clSetKernelArg

(kernel, 1, (

void

*)&memobjs[1],

sizeof

(cl_mem));

// set “c” vector argument

err |=

clSetKernelArg

(kernel, 2, (

void *)&memobjs[2], sizeof

(cl_mem));

VecAdd: Set Kernel Arguments

Setting kernel arguments: Section 5.5.2

Executing Kernels: Section 6.1

Reading, writing, and

copying buffer objects: Section 5.2.2

SpecSlide37

size_t

global_work_size[1] = n;

// set work-item dimensions

// execute kernel

err =

clEnqueueNDRangeKernel

(cmd_queue, kernel,

1,

// Work dimensions

NULL,

// must be NULL (work offset)

global_work_size,

NULL,

// automatic local work size

0,

// no events to wait on

NULL,

// event list

NULL);

// event for this kernel

// read output array

err =

clEnqueueReadBuffer

( context, memobjs[2],

CL_TRUE,

// blocking

0,

// offset

n*

sizeof

(

cl_float),

// size

dst,

// pointer

0, NULL, NULL);

// events

VecAdd: Invoke Kernel, Read Output

Setting kernel arguments: Section 5.5.2

Executing Kernels: Section 6.1

Reading, writing, and

copying buffer objects: Section 5.2.2

SpecSlide38

OpenCL Vector Addition Kernel

__kernel

void

vec_add (__global const float

*a,

__global

const float

*

b

,

__global

float

*

c)

{

int

gid =

get_global_id

(0);

c[gid] =

a[gid] +

b[gid];

}

__kernel: Section 6.7.1

__global: Section 6.5.1

get_global_id():

Section 6.11.1Data types: Section 6.1

SpecSlide39

OPENCL’s EVOLUTION

An insider’s

view and some thoughts for discussionSlide40

The challenge of a multi-vendor standard

Different companies on different schedules

Usually this means hewing to the lowest denominator

Compromise is key to forward progress

Supporting many architectures means a big tentCPU, GPU, Cell, DSP all engagedPotentially differing priorities and visionSupporting them all is nice, but be wary of lack of focusPrecise language saves many headaches laterSpend a lot of time and effort getting the basics downCan’t depend on platform specific behavior to fill in the gapsSlide41

An expert API programmer’s compute standard

Targeted at experts from the start

No integrated runtime, all API based on the Host

Explicit control, low-level control in API and language*

Influenced by key workloads (e.g. image processing, games) Typically have their own domain-specific runtimes (e.g. CoreImage, UnrealEngine)Built by experts for use by developers wanting more abstractionA Khronos standardRich public API history to learn both bad and good fromEngages a developer community already familiar and comfortable programming complex APIs*Except for memory, it’s handle based, no affinity. Why?Slide42

The legacy of OS GPU driver design on memory models

GPU Accelerated

window managers (Quartz,

Aero) have new

QoS requirementsPriority to keep the display interactiveGraceful degradation when GPU resources overcommittedOSes require the capability to manage GPU memory as they see fit to meet QoSLow-level in other ways, OpenCL must leave memory management to other layers in the stackHence an opaque handle-based API, rather than pointersSlide43

The API-Language split

Standards must rely on supported interfaces

C APIs are well supported in the Host environment

OpenCL C specifies device environment

SoHost program uses an API to manipulate execution and memory modelDevice uses language to interact with execution and memory modelOpenCL Host and Device have very different viewsHost/Device may need to be different types (multi-vendor support), maybe even different endian-nessBut has costs: user types (structs) don’t magically work on both Host and Device