OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems silicon middleware tools and applications Open royaltyfree standard for heterogeneous parallel computing ID: 595249
Download Presentation The PPT/PDF document "OPENCL OVERVIEW" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
OPENCL OVERVIEWSlide2
OpenCL Commercial Objectives
Grow the market for parallel computing
For vendors of systems, silicon, middleware, tools and applications
Open, royalty-free standard for heterogeneous parallel computing
Unified programming model for CPUs, GPUs, Cell, DSP and other processors in a systemCross-vendor software portability to a wide range of silicon and systemsHPC servers, desktop systems and handheld devices covered in one specificationSupport for a wide diversity of applicationsFrom embedded and mobile software through consumer applications to HPC solutionsCreate a foundation layer for a parallel computing ecosystemClose-to-the-metal interface to support a rich diversity of middleware and applicationsRapid deployment in the marketDesigned to run on current latest generations of GPU hardwareSlide3
OpenCL Working Group
Diverse industry participation
Processor vendors, system OEMs, middleware vendors, application developers
Many industry-leading experts involved in
OpenCL’s designA healthy diversity of industry perspectivesApple initially proposed and is very active in the working groupServing as specification editorHere are some of the other companies in the OpenCL working group Slide4
OpenCL Timeline
Six months from proposal to released specification
Due to a strong initial proposal and a shared commercial incentive to work quickly
Apple’s Mac OS X Snow Leopard includes OpenCL
Multiple OpenCL implementations are now outApple and NVIDIA/AMD on Windows and Linux to name a fewApple works with AMD, Intel, NVIDIA and others on draft proposal
Apple proposes OpenCL working group and contributes draft specification to Khronos
OpenCL working group develops draft into cross-vendor specification
Working Group sends completed draft to Khronos Board for Ratification
Khronos publicly releases OpenCL as royalty-free specification
Khronos
to release conformance tests to ensure high-quality implementations
Jun08
Oct08
Dec08
Mar09Slide5
OpenCL 1.0
Technical OverviewSlide6
OpenCL Design Requirements
Use all computational resources in system
Program
GPUs
, CPUs, and other processors as peersSupport both data- and task- parallel compute modelsEfficient C-based parallel programming modelAbstract the specifics of underlying hardwareAbstraction is low-level, high-performance but device-portableApproachable – but primarily targeted at expert developersEcosystem foundation – no middleware or “convenience” functionsImplementable on a range of embedded, desktop, and server systemsHPC, desktop, and handheld profiles in one specificationDrive future hardware requirementsFloating point precision requirementsApplicable to both consumer and HPC applicationsSlide7
Anatomy of OpenCL
Language Specification
C-based cross-platform programming interface
Subset of ISO C99 with language extensions - familiar to developers
Well-defined numerical accuracy - IEEE 754 rounding behavior with specified maximum errorOnline or offline compilation and build of compute kernel executablesIncludes a rich set of built-in functionsPlatform Layer APIA hardware abstraction layer over diverse computational resourcesQuery, select and initialize compute devicesCreate compute contexts and work-queuesRuntime APIExecute compute kernelsManage scheduling, compute, and memory resourcesSlide8
Hierarchy of Models
Platform Model
Memory Model
Execution Model
Programming ModelSlide9
OpenCL Platform Model
(Section 3.1)
One
Host
+ one or more Compute DevicesEach Compute Device is composed of one or more Compute UnitsEach Compute Unit is further divided into one or more Processing ElementsSlide10
OpenCL Execution Model
(Section 3.2)
OpenCL Program:
KernelsBasic unit of executable code — similar to a C functionData-parallel or task-parallelHost ProgramCollection of compute kernels and internal functionsAnalogous to a dynamic libraryKernel ExecutionThe host program invokes a kernel over an index space called an NDRangeNDRange = “N-Dimensional Range”NDRange can be a 1, 2, or 3-dimensional spaceA single kernel instance at a point in the index space is called a
work-item
Work-items have unique global IDs from the index space
Work-items are further grouped into
work-groups
Work-groups have a unique work-group ID
Work-items have a unique local ID within a work-groupSlide11
Kernel Execution
Total number of work-items = G
x
x G
ySize of each work-group = Sx x SyGlobal ID can be computed from work-group ID and local IDSlide12
Contexts and Queues
(Section 3.2.1)
Contexts are used to contain and manage the state of the “world”
Kernels are executed in contexts defined and manipulated by the host
DevicesKernels - OpenCL functionsProgram objects - kernel source and executableMemory objectsCommand-queue - coordinates execution of kernelsKernel execution commandsMemory commands - transfer or mapping of memory object dataSynchronization commands - constrains the order of commandsApplications queue compute kernel execution instancesQueued in-order Executed in-order or out-of-order
Events are used to implement appropriate synchronization of execution instancesSlide13
Compute Unit 1
Private Memory
Private Memory
Work Item 1
Work Item M
Compute Unit N
Private Memory
Private Memory
Work Item 1
Work Item M
Local Memory
Local Memory
Global / Constant Memory Data Cache
Global Memory
OpenCL Memory Model
(Section 3.3)
Shared memory model
Relaxed consistency
Multiple distinct address spaces
Address spaces can be collapsed depending on the device’s memory subsystem
Address spaces
Private - private to a
work-item
Local - local to a
work-group
Global - accessible by all work-items in all work-groups
Constant - read only global space
Implementations map this hierarchy
To available physical memories
Compute Device Memory
Compute DeviceSlide14
Memory Consistency
(Section 3.3.1)
“OpenCL uses a relaxed consistency memory
model
i.e. the state of memory visible to a work-item is not guaranteed to be consistent across the collection of work-items at all times.”Within a work-item, memory has load/store consistencyWithin a work-group at a barrier, local memory has consistency across work-itemsGlobal memory is consistent within a work-group, at a barrier, but not guaranteed across different work-groupsConsistency of memory shared between commands are enforced through synchronizationSlide15
Data-Parallel Programming Model
(Section 3.4.1)
Define
N-Dimensional computation domainEach independent element of execution in an N-Dimensional domain is called a work-itemN-Dimensional domain defines the total number of work-items that execute in parallel = global work sizeWork-items can be grouped together — work-groupWork-items in group can communicate with each otherCan synchronize execution among work-items in group to coordinate memory accessExecute multiple work-groups in parallelMapping of global work size to work-group can be implicit or explicitSlide16
Task-Parallel Programming Model
(Section 3.4.2)
Data
-parallel execution model must be implemented by all OpenCL compute devicesSome compute devices such as CPUs can also execute task-parallel compute kernelsExecutes as a single work-itemA compute kernel written in OpenCL A “native” functionSlide17
Host
program
Query platform
Query
compute devicesCreate contextsCreate memory objects associated to contextsCompile and create kernel program objectsIssue commands to command-queueSynchronization of commandsClean up OpenCL resourcesKernelsOpenCL C codeBasic OpenCL Program Structure
Platform Layer
Runtime
LanguageSlide18
Platform Layer
(Chapter 4)
Platform layer allows applications to query for platform specific features
Querying platform info (i.e., OpenCL profile)
(Chapter 4.1)Querying devices (Chapter 4.2)clGetDeviceIDs()Find out what compute devices are on the systemClassified device types include CPUs, GPUs, or AcceleratorsclGetDeviceInfo()Queries the capabilities of the discovered compute devices
Creating
contexts
(Chapter 4.3)
Contexts are used by the OpenCL runtime to manage objects and execute kernels on one or more devices
Contexts are associated to one or more devices
Multiple contexts could be associated to the same device
clCreateContext
() and
clCreateContextFromType() returns a handle
to the created contextsSlide19
Command-Queues
(Section 5.1)
Command-queues store a set of operations to perform
Command-queues are associated to a contextMultiple command-queues can be created to handle independent commands that don’t require synchronizationExecution of the command-queue is guaranteed to be completed at sync pointsSlide20
Memory Objects
(Section 5.2)
Buffer objects
One-dimensional collection of objects (like C arrays)Valid elements include scalar and vector types as well as user defined structuresBuffer objects can be accessed via pointers in the kernelImage objectsTwo- or three-dimensional texture, frame-buffer, or imagesMust be addressed through built-in functionsSampler objectsDescribes how to sample an image in the kernelAddressing modesFiltering modesSlide21
Creating Memory Objects
clCreateBuffer
()
,
clCreateImage2D(), and clCreateImage3D()Memory objects are created with an associated contextMemory can be created as read only, write only, or read-writeWhere objects are created in the platform memory space can be controlledDevice memoryDevice memory with data copied from a host pointerHost memoryHost memory associated with a pointerMemory at that pointer is guaranteed to be valid at synchronization pointsImage objects are also created with a channel formatChannel order (e.g., RGB, RGBA ,etc.)Channel type (e.g., UNORM INT8, FLOAT, etc.)Slide22
Manipulating Object Data
Object data can be copied to host memory, from host memory, or to other objects
Memory commands are enqueued in the command buffer and processed when the command is executed
clEnqueueReadBuffer
(), clEnqueueReadImage()clEnqueueWriteBuffer(), clEnqueueWriteImage()clEnqueueCopyBuffer(), clEnqueueCopyImage()Data can be copied between Image and Buffer objectsclEnqueueCopyImageToBuffer()clEnqueueCopyBufferToImage()Regions of the object data can be accessed by mapping into the host address spaceclEnqueueMapBuffer
(),
clEnqueueMapImage
()
clEnqueueUnmapMemObject
()Slide23
Program Objects
(Section 5.4)
Program objects encapsulate:
An associated contextProgram source or binaryLatest successful program build, list of targeted devices, build optionsNumber of attached kernel objectsBuild processCreate program objectclCreateProgramWithSource()clCreateProgramWithBinary()
Build program executable
Compile and link from source or binary for all devices or specific devices in the associated context
clBuildProgram
()
Build options
Preprocessor
Math intrinsics (floating-point behavior)
OptimizationsSlide24
Kernel Objects
(Section 5.5)
Kernel objects encapsulate
Specific kernel functions declared in a programArgument values used for kernel executionCreating kernel objectsclCreateKernel() - creates a kernel object for a single function in a programclCreateKernelsInProgram() - creates an object for all kernels in a programSetting argumentsclSetKernelArg(<kernel>, <argument index>)Each argument data must be set for the kernel functionArgument values are copied and stored in the kernel objectKernel vs. program objectsKernels are related to program execution
Programs are related to program sourceSlide25
Kernel Execution
(Section 5.6)
A command to execute a kernel must be enqueued to the command-queue
clEnqueueNDRangeKernel
()Data-parallel execution modelDescribes the index space for kernel executionRequires information on NDRange dimensions and work-group sizeclEnqueueTask()Task-parallel execution model (multiple queued tasks)Kernel is executed on a single work-itemclEnqueueNativeKernel()Task-parallel execution modelExecutes a native C/C++ function not compiled using the OpenCL compiler
This mode does not use a kernel object so arguments must be passed inSlide26
Command-Queues and
Synchronization
Command-queue execution
Execution model signals when commands are complete or data is ready
Command-queue could be explicitly flushed to the deviceCommand-queues execute in-order or out-of-orderIn-order - commands complete in the order queued and correct memory is consistentOut-of-order - no guarantee when commands are executed or memory is consistent without synchronizationSynchronizationBlocking callsCommands that do not return until complete
clEnqueueReadBuffer
() can be called as blocking and will block until complete
Event objects
Tracks execution status of a command
Some commands can be blocked until event objects signal a completion of previous command
clEnqueueNDRangeKernel
() can take an event object as an argument and wait until a previous command (e.g.,
clEnqueueWriteBuffer
) is complete
Profiling
Queue barriers - queued commands that can block command executionSlide27
OpenCL C for Compute Kernels
(Chapter 6)
Derived from ISO C99
A few restrictions: recursion, function pointers, functions in C99 standard headers ...Preprocessing directives defined by C99 are supportedBuilt-in Data TypesScalar and vector data types, PointersData-type conversion functions: convert_type<_sat><_roundingmode> Image types: image2d_t, image3d_t and sampler_tBuilt-in Functions — Requiredwork-item functions,
math.h
, read and write image
Relational, geometric functions, synchronization functions
Built-in Functions — Optional
double precision, atomics to global and local memory
selection of rounding mode, writes to
image3d_t
surfaceSlide28
OpenCL C Language Highlights
Function qualifiers
“__kernel” qualifier declares a function as a kernel
Kernels can call other kernel functions
Address space qualifiers__global, __local, __constant, __privatePointer kernel arguments must be declared with an address space qualifierWork-item functionsQuery work-item identifiersget_work_dim()get_global_id(), get_local_id(), get_group_id()Image functionsImages must be accessed through built-in functionsReads/writes performed through sampler objects from host or defined in source
Synchronization functions
Barriers - all work-items within a work-group must execute the barrier function before any work-item can continue
Memory fences - provides ordering between memory operationsSlide29
OpenCL C Language Restrictions
Pointers to functions are not allowed
Pointers to pointers allowed within a kernel, but not as an argument
Bit-fields are not supportedVariable length arrays and structures are not supported
Recursion is not supportedWrites to a pointer of types less than 32-bit are not supported*Double types are not supported*3D Image writes are not supported**Supported by official extensionSlide30
OpenCL ExampleSlide31
Host
program
Query platform
Query
compute devicesCreate contextsCreate memory objects associated to contextsCompile and create kernel program objectsIssue commands to command-queueSynchronization of commandsClean up OpenCL resourcesKernelsOpenCL C codeBasic OpenCL Program Structure
Platform Layer
Runtime
LanguageSlide32
VecAdd:
Get the Platform (Win/Lin)
cl_uint
platform_ids[num_entries];cl_uint num_platforms
;
//
Query for supported platforms
cl_int
err =
clGetPlatformIDs
(num_entries
,
// max ids to return
platform_ids
,
// platforms supported
&
num_platforms
)
;
// number returned
// Choose a platform based on its properties (vendor, version, etc.)
char vendor_name[128];
size_t
vendor_name_length;
err = clGetPlatformInfo
(platform_ids
[i],
// for a platform
CL_PLATFORM_VENDOR
,
// get vendor name
sizeof
(char
)*128,
// value storage size
vendor_name
,
&
vendor_name_length
);
// returned data length
Querying Platform Info:
Section
4.1
SpecSlide33
VecAdd
:
Create a Context
Skipping device enumeration, which proper apps should do
cl_context_properties context_props[2];cl_context context
;
context_props[0] = CL_CONTEXT_PLATFORM;
// set the platform property
context_props[1] =
platform_ids
[i];
// to the chosen platform id
/
/
Create the
contextusing
the shortcut “from type” API
context =
clCreateContextFromType
(context_props
,
CL_DEVICE_TYPE_GPU
,
// the device type
NULL,
NULL,
&
err
);
// error return
Contexts:
Section
4.3
SpecSlide34
VecAdd: Create Memory Objects
cl_mem
memobjs[3];
// allocate input buffer memory objects
memobjs[0] = clCreateBuffer
(context
,
CL_MEM_READ_ONLY
|
// flags
CL_MEM_COPY_HOST_PTR
,
sizeof
(
cl_float
)*
n
,
// size
srcA,
// host pointer
NULL); // error code
memobjs[1] =
clCreateBuffer
(context,
CL_MEM_READ_ONLY
| CL_MEM_COPY_HOST_PTR
,
sizeof(
cl_float)*
n,
srcB, NULL);
// allocate input buffer memory object
memobjs[2] =
clCreateBuffer
(context
,
CL_MEM_WRITE_ONLY
,
sizeof
(
cl_float
)*
n
, NULL, NULL);
Creating buffer objects:
Section 5.2.1
SpecSlide35
VecAdd: Program and Kernel
// create the program
cl_program
program =
clCreateProgramWithSource( context, 1,
// string count
&
program_source
,
// program strings
NULL,
// string lengths
NULL);
// error code
// build the program
err
=
clBuildProgram
(program
,
0,
// num devices in device list
NULL, // device list
NULL,
// options
NULL,
//
notifier callback function
ptr
NULL);
// user data
// create the kernel
cl_kernel
kernel =
clCreateKernel
(program
,
“
vec_add
”
, NULL);
Creating program objects: Section 5.4.1
Building program executables: Section 5.4.2
Creating kernel objects: Section 5.5.1
SpecSlide36
// set “a” vector argument
err =
clSetKernelArg
(kernel,
0, // argument index (
void
*)&memobjs[0],
// argument data
sizeof
(cl_mem));
// argument data size
// set “b” vector argument
err |=
clSetKernelArg
(kernel, 1, (
void
*)&memobjs[1],
sizeof
(cl_mem));
// set “c” vector argument
err |=
clSetKernelArg
(kernel, 2, (
void *)&memobjs[2], sizeof
(cl_mem));
VecAdd: Set Kernel Arguments
Setting kernel arguments: Section 5.5.2
Executing Kernels: Section 6.1
Reading, writing, and
copying buffer objects: Section 5.2.2
SpecSlide37
size_t
global_work_size[1] = n;
// set work-item dimensions
// execute kernel
err =
clEnqueueNDRangeKernel
(cmd_queue, kernel,
1,
// Work dimensions
NULL,
// must be NULL (work offset)
global_work_size,
NULL,
// automatic local work size
0,
// no events to wait on
NULL,
// event list
NULL);
// event for this kernel
// read output array
err =
clEnqueueReadBuffer
( context, memobjs[2],
CL_TRUE,
// blocking
0,
// offset
n*
sizeof
(
cl_float),
// size
dst,
// pointer
0, NULL, NULL);
// events
VecAdd: Invoke Kernel, Read Output
Setting kernel arguments: Section 5.5.2
Executing Kernels: Section 6.1
Reading, writing, and
copying buffer objects: Section 5.2.2
SpecSlide38
OpenCL Vector Addition Kernel
__kernel
void
vec_add (__global const float
*a,
__global
const float
*
b
,
__global
float
*
c)
{
int
gid =
get_global_id
(0);
c[gid] =
a[gid] +
b[gid];
}
__kernel: Section 6.7.1
__global: Section 6.5.1
get_global_id():
Section 6.11.1Data types: Section 6.1
SpecSlide39
OPENCL’s EVOLUTION
An insider’s
view and some thoughts for discussionSlide40
The challenge of a multi-vendor standard
Different companies on different schedules
Usually this means hewing to the lowest denominator
Compromise is key to forward progress
Supporting many architectures means a big tentCPU, GPU, Cell, DSP all engagedPotentially differing priorities and visionSupporting them all is nice, but be wary of lack of focusPrecise language saves many headaches laterSpend a lot of time and effort getting the basics downCan’t depend on platform specific behavior to fill in the gapsSlide41
An expert API programmer’s compute standard
Targeted at experts from the start
No integrated runtime, all API based on the Host
Explicit control, low-level control in API and language*
Influenced by key workloads (e.g. image processing, games) Typically have their own domain-specific runtimes (e.g. CoreImage, UnrealEngine)Built by experts for use by developers wanting more abstractionA Khronos standardRich public API history to learn both bad and good fromEngages a developer community already familiar and comfortable programming complex APIs*Except for memory, it’s handle based, no affinity. Why?Slide42
The legacy of OS GPU driver design on memory models
GPU Accelerated
window managers (Quartz,
Aero) have new
QoS requirementsPriority to keep the display interactiveGraceful degradation when GPU resources overcommittedOSes require the capability to manage GPU memory as they see fit to meet QoSLow-level in other ways, OpenCL must leave memory management to other layers in the stackHence an opaque handle-based API, rather than pointersSlide43
The API-Language split
Standards must rely on supported interfaces
C APIs are well supported in the Host environment
OpenCL C specifies device environment
SoHost program uses an API to manipulate execution and memory modelDevice uses language to interact with execution and memory modelOpenCL Host and Device have very different viewsHost/Device may need to be different types (multi-vendor support), maybe even different endian-nessBut has costs: user types (structs) don’t magically work on both Host and Device