/
Introduction to GPU Programming Introduction to GPU Programming

Introduction to GPU Programming - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
343 views
Uploaded On 2019-06-19

Introduction to GPU Programming - PPT Presentation

Volodymyr Vlad Kindratenko Innovative Systems Laboratory NCSA Institute for Advanced Computing Applications and Technologies IACAT Tutorial Goals Become familiar with NVIDIA GPU architecture ID: 759151

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Introduction to GPU Programming" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Introduction to GPU Programming

Volodymyr (

Vlad

) Kindratenko

Innovative Systems Laboratory @ NCSA

Institute for Advanced Computing Applications and Technologies (IACAT)

Slide2

Tutorial Goals

Become familiar with NVIDIA GPU architectureBecome familiar with the NVIDIA GPU application development flowBe able to write and run a simple NVIDIA GPU application in CUDA

2

7/26/2009

Slide3

Tutorial Outline

Introduction (15 minutes)Why use Graphics Processing Units (GPUs) for general-purpose computingModern GPU architectureNVIDIAGPU programmingLibraries, CUDA C, OpenCL, PGI x64+GPUHands-on: getting started with NCSA GPU cluster (15 minutes)Cluster architecture overviewHow to login and check out a nodeHow to compile and run an existing applicationHands-on: Anatomy of a GPU application (25 minutes)Host sideDevice sideCUDA programming modelHands-on: Porting matrix multiplier to GPU (25 minutes)More on CUDA programming (40 minutes)

3

7/26/2009

Slide4

Introduction

Why use Graphics Processing Units (GPUs) for general-purpose computingModern GPU architectureNVIDIAGPU programmingLibrariesCUDA COpenCLPGI x64+GPU

4

7/26/2009

Slide5

Why GPUs?Raw Performance Trends

5

5800

5950 Ultra

6800 Ultra

7800 GTX

Graph is courtesy of NVIDIA

7/26/2009

Slide6

5800

5950 Ultra

6800 Ultra

7800 GTX

7900 GTX

8800 GTX

8800 Ultra

GTX 285

GTX 280

Why GPUs?

Memory Bandwidth Trends

6

Graph is courtesy of NVIDIA

7/26/2009

Slide7

GPU vs. CPU Silicon Use

7

7/26/2009

Graph is courtesy of NVIDIA

Slide8

NVIDIA GPU Architecture

A scalable array of multithreaded Streaming Multiprocessors (SMs), each SM consists of8 Scalar Processor (SP) cores2 special function units for transcendentalsA multithreaded instruction unitOn-chip shared memoryGDDR3 SDRAMPCIe interface

8

Figure

is courtesy of NVIDIA

7/26/2009

Slide9

NVIDIA GeForce9400M G GPU

16 streaming processors arranged as 2 streaming multiprocessors At 0.8 GHz this provides54 GFLOPS in single-precision (SP)128-bit interface to off-chip GDDR3 memory21 GB/s bandwidth

TPC

Geometry controller

SMC

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT

issue

I cache

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT

issue

I cache

Texture units

Texture L1

L2

ROP

L2

ROP

128-bit interconnect

DRAM

DRAM

9

7/26/2009

Slide10

NVIDIA Tesla C1060 GPU

240 streaming processors arranged as 30 streaming multiprocessors At 1.3 GHz this provides1 TFLOPS SP86.4 GFLOPS DP512-bit interface to off-chip GDDR3 memory102 GB/s bandwidth

TPC 1

Geometry controller

SMC

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT

issue

I cache

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT

issue

I cache

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT issue

I cache

Texture units

Texture L1

TPC 10

Geometry controller

SMC

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT issue

I cache

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT issue

I cache

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT issue

I cache

Texture units

Texture L1

L2

ROP

L2

ROP

512-bit memory interconnect

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

10

7/26/2009

Slide11

NVIDIA Tesla S1070 Computing Server

4 T10 GPUs

Tesla GPU

Tesla GPU

Tesla GPU

Tesla GPU

4 GB GDDR3 SDRAM

4 GB GDDR3 SDRAM

4 GB GDDR3 SDRAM

4 GB GDDR3 SDRAM

NVIDIA SWITCH

NVIDIA SWITCH

Power supply

Thermal management

System monitoring

PCI x16

PCI x16

11

7/26/2009

Graph is courtesy of NVIDIA

Slide12

GPU Use/Programming

GPU librariesNVIDIA’s CUDA BLAS and FFT librariesMany 3rd party librariesLow abstraction lightweight GPU programming toolkitsCUDA COpenCLHigh abstraction compiler-based toolsPGI x64+GPU

12

7/26/2009

Slide13

Getting Started with NCSA GPU Cluster

Cluster architecture overviewHow to login and check out a nodeHow to compile and run an existing application

13

7/26/2009

Slide14

NCSA AC GPU Cluster

14

7/26/2009

Slide15

GPU Cluster Architecture

Servers: 32CPU cores: 128

Accelerator Units: 32GPUs: 128

15

ac01

(compute node)

ac02

(compute node)

ac32

(compute node)

ac

(head node)

7/26/2009

Slide16

GPU Cluster Node Architecture

HP xw9400 workstation2216 AMD Opteron 2.4 GHz dual socket dual core8 GB DDR2InfiniBand QDR S1070 1U GPU Computing Server1.3 GHz Tesla T10 processors4x4 GB GDDR3 SDRAM

16

IB

Tesla S1070

T10

T10

PCIe

interface

DRAM

DRAM

T10

T10

PCIe

interface

DRAM

DRAM

HP xw9400 workstation

PCIe

x16

PCIe

x16

QDR IB

Compute node

7/26/2009

Slide17

Accessing the GPU Cluster

Use Secure Shell (SSH) client to access AC`ssh USER@ac.ncsa.uiuc.edu` (User: gpu001 - gpu200; Password: CHiPS-09)You will see something like this printed out:See machine details and a technical report at: http://www.ncsa.uiuc.edu/Projects/GPUcluster/Machine Description and HOW TO USE. See: /usr/local/share/ac.readmeCUDA wrapper readme: /usr/local/share/cuda_wrapper.readme*IMPORTANT* If you are using multiple GPU devices per host, be sure to understand how the cuda_wrapper changes this system!!…July 03, 2009Nvidia compute exclusive mode made default. If this breaks your application, "touch ~/FORCE_NORMAL" to create an override for all your jobs.Questions? Contact Jeremy Enos jenos@ncsa.uiuc.edu[gpuXYZ@ac ~]$ _

17

7/26/2009

Slide18

Installing Tutorial Examples

Run this sequence to retrieve and install tutorial examples:cdcp /tmp/chips_tutorial.tgz .tar -xvzf chips_tutorial.tgzcd chips_tutorialls src1 src2 src3

18

7/26/2009

Slide19

Accessing the GPU Cluster

19

ac01

(compute node)

ac02

(compute node)

ac32(compute node)

Laptop 1

Laptop 30

Laptop 2

ac

(head node)

You are here

7/26/2009

Slide20

Requesting a Cluster Node for Interactive Use

Run `qstat` to see what other users do, just for the fun of itRun `qsub -I -l walltime=02:00:00` to request a node with a single GPU for 2 hours of interactive useYou will see something like this printed out:qsub: waiting for job 64424.acm to startqsub: job 64424.acm ready[gpuXYZ@acAB ~]$ _

20

7/26/2009

Slide21

Requesting a Cluster Node

21

ac01

(compute node)

ac02

(compute node)

ac32(compute node)

Laptop 1

Laptop 30

Laptop 2

ac

(head node)

You are here

7/26/2009

Slide22

Checking GPU Characteristics

Run `deviceQuery`CUDA Device Query (Runtime API) version (CUDART static linking)There is 1 device supporting CUDADevice 0: "Tesla C1060" CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 4294705152 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes…Clock rate: 1.30 GHzCompute mode: Exclusive (only one host thread at a time can use this device)

22

7/26/2009

Slide23

Compiling and Running and Existing Application

cd chips_tutorial/src1vecadd.c - reference C implementationvecadd.cu – CUDA implementationCompile & run CPU versiongcc vecadd.c -o vecadd_cpu./vecadd_cpuRunning CPU vecAdd for 16384 elementsC[0]=2147483648.00 ...Compile & run GPU versionnvcc vecadd.cu -o vecadd_gpu./vecadd_gpuRunning GPU vecAdd for 16384 elementsC[0]=2147483648.00 ...

23

7/26/2009

Slide24

nvcc

Any source file containing CUDA C language extensions must be compiled with nvccnvcc is a compiler driver that invokes many other tools to accomplish the jobBasic nvcc usagenvcc <filename>.cu [-o <executable>]Builds release modenvcc -deviceemu <filename>.cuBuilds device emulation mode (all code runs on CPU)-g flag allows to build debug mode for gdb debugger

24

7/26/2009

Slide25

Anatomy of a GPU Application

Host sideDevice sideCUDA programming model

25

7/26/2009

Slide26

CPU-Only Version

void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i];}int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); vecAdd(N, A, B, C); // call compute kernel free(A); free(B); free(C); }

26

Computational kernel

Memory allocation

Kernel invocation

Memory de-allocation

7/26/2009

Slide27

Adding GPU support

int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N * sizeof(float)); cudaMalloc((void**)&devPtrB, N * sizeof(float)); cudaMalloc((void**)&devPtrC, N * sizeof(float)); cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice);

27

Memory allocation on the GPU card

Copy data from the CPU (host) memory to the GPU (device) memory

7/26/2009

Slide28

Adding GPU support

vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC); cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(A); free(B); free(C); }

28

Kernel invocation

Copy results from device memory to the host memory

Device memory de-allocation

7/26/2009

Slide29

GPU Kernel

CPU versionvoid vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i];}GPU version__global__ void vecAdd(float* A, float* B, float* C) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; }

29

7/26/2009

Slide30

CUDA Programming Model

A CUDA kernel is executed by an array of threadsAll threads run the same code (SPMD)‏Each thread has an ID that it uses to compute memory addresses and make control decisionsThreads are arranged as a grid of thread blocksThreads within ablock have accessto a segment ofshared memory

30

float x = input[

threadID

];

float y =

func

(x);

output[

threadID

] = y;

threadID

Grid

Thread Block 0

Shared memory

Thread Block 1

Shared memory

Thread Block

N-1

Shared memory

7/26/2009

Slide31

Kernel Invocation Syntax

31

Grid

Thread Block 0

Shared memory

Thread Block 1

Shared memory

Thread Block

N-1

Shared memory

grid

&

thread block

dimensionality

vecAdd

<<<

32

, 512>>>

(

devPtrA

,

devPtrB

,

devPtrC

);

int

i

=

blockIdx.x

*

blockDim.x

+

threadIdx.x

;

thread ID within a thread block

number of

theards

per block

block ID within a grid

7/26/2009

Slide32

Mapping Threads to the Hardware

Blocks of threads are transparently assigned to SMsA block of threads executes on one SM & does not migrateSeveral blocks can reside concurrently on one SM

32

Device

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Kernel grid

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Device

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Each block can execute in any order relative to other blocks.

time

Slide is courtesy of NVIDIA

7/26/2009

Slide33

CUDA Programming Model

A kernel is executed as a grid of thread blocksGrid of blocks can be 1 or 2-dimentionalThread blocks can be 1, 2, or 3-dimensionalDifferent kernels can have different grid/block configurationThreads from the same block have access to a shared memory and their execution can be synchronized

33

Slide is courtesy of NVIDIA

Device

Grid 2

Host

Kernel 1

Kernel 2

Block (1, 1)

Thread

(0,1,0)

Thread

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

Thread

(0,0,0)

Thread

(1,0,0)

Thread

(2,0,0)

Thread

(3,0,0)

(0,0,1)

(1,0,1)

(2,0,1)

(3,0,1)

Grid 1

Block

(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(0, 1)

Block

(1, 1)

Block

(2, 1)

7/26/2009

Slide34

GPU Memory Hierarchy

34

MemoryLocationCachedAccessScopeLifetimeRegisterOn-chipN/AR/WOne threadThreadLocalOff-chipNoR/WOne threadThreadSharedOn-chipN/AR/WAll threads in a blockBlockGlobalOff-chipNoR/WAll threads + hostApplicationConstantOff-chipYesRAll threads + hostApplicationTextureOff-chipYesRAll threads + hostApplication

Host

CPU

chipset

DRAM

Device

DRAM

local

global

constanttexture

GPU

Multiprocessor

Multiprocessor

Multiprocessor

registers

shared memory

constant and texture caches

7/26/2009

Slide35

Porting matrix multiplier to GPU

cd ../chips_tutorial/src2Compile & run CPU versionicc -O3 mmult.c -o mmult./mmult1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ......msec = 2215478 GFLOPS = 0.969

35

7/26/2009

Slide36

36

int main(int argc, char* argv[]){ int N = 1024; struct timeval t1, t2, ta, tb; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float)); minit(a, b, c, N); gettimeofday(&t1, NULL); mmult(a, b, c, N); // a = b * c gettimeofday(&t2, NULL); mprint(a, N, 5); free(a); free(b); free(c); msec1 = t1.tv_sec * 1000000 + t1.tv_usec; msec2 = t2.tv_sec * 1000000 + t2.tv_usec; msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);}

// a = b * cvoid mmult(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) for (int i = 0; i < N; i++) a[i+j*N] += b[i+k*N]*c[k+j*N];}void minit(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) { a[i+N*j] = 0.0f; b[i+N*j] = 1.0f; c[i+N*j] = 1.0f; }}void mprint(float *a, int N, int M){ int i, j; for (int j = 0; j < M; j++) { for (int i = 0; i < M; i++) printf("%.2f ", a[i+N*j]); printf("...\n"); } printf("...\n");}

7/26/2009

Slide37

for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j];

Matrix Representation in Memory

Matrices are stored in column-major orderFor reference, jki-ordered version runs at 1.7 GFLOPS on 3 GHz Intel Xeon (single core)

37

7/26/2009

Slide38

for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j];

Grid of thread blocks

0

0 1 2 3 4 5

1

0 1 2 3 4 5

2

0 1 2 3 4 5

blockIdx.x

blockDim.x

threadIdx.x

blockIdx.x

*

blockDim.x

+

threadIdx.x

Map this code:

into this (logical) architecture:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

b

1,1

b

1,2

b

2,1

b

2,2

b

3,1

b

3,2

c

1,1

c

1,2

c

1,3c2,1c2,2c2,3

a1,1a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3

B

C

A=B*C

a

1,2

=b1,1*c1,2+b1,2*c2,2

38

7/26/2009

Slide39

32x1024 grid of thread blocks

Block of 32x1x1 threads(blockIdx.x, blockIdx.y)

(threadIdx.x){ int i = blockIdx.x*32 + threadIdx.x; int j = blockIdx.y; float sum = 0.0f; for (int k = 0; k < n; k++) sum += b[i+n*k] * c[k+n*j]; a[i+n*j] = sum;}

32 threads per block (

i

)

32 thread blocks (

i

)

1024 thread blocks (

j

)

dim3 grid(1024/32, 1024);

dim3 threads (32);

39

7/26/2009

Slide40

Kernel

Original CPU kernel

GPU Kernel

40

__global__ void mmult(float *a, float *b, float *c, int N){ int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y; float sum = 0.0; for (int k = 0; k < N; k++) sum += b[i+N*k] * c[k+N*j]; a[i+N*j] = sum;}

void mmult(float *a, float *b, float *c, int N) { for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) a[i+j*N] += b[i+k*N]*c[k+j*N];}

dim3 dimGrid(32, 1024);dim3 dimBlock(32);mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N);

7/26/2009

Slide41

41

int main(int argc, char* argv[]){ int N = 1024; struct timeval t1, t2; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float)); minit(a, b, c, N); // allocate device memory float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N*N*sizeof(float)); cudaMalloc((void**)&devPtrB, N*N*sizeof(float)); cudaMalloc((void**)&devPtrC, N*N*sizeof(float)); // copu input arrays to the device meory cudaMemcpy(devPtrB, b, N*N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrC, c, N*N*sizeof(float), cudaMemcpyHostToDevice);

7/26/2009

Slide42

42

gettimeofday(&t1, NULL); msec1 = t1.tv_sec * 1000000 + t1.tv_usec; // define grid and thread block sizes dim3 dimGrid(32, 1024); dim3 dimBlock(32); // launch GPU kernel mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N); // check for errors cudaError_t err = cudaGetLastError(); if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE); } // wait until GPU kernel is done cudaThreadSynchronize(); gettimeofday(&t2, NULL); msec2 = t2.tv_sec * 1000000 + t2.tv_usec;

7/26/2009

Slide43

43

// copy results to host cudaMemcpy(a, devPtrA, N*N*sizeof(float), cudaMemcpyDeviceToHost); mprint(a, N, 5); // free device memory cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(a); free(b); free(c); msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);}

7/26/2009

Slide44

Porting matrix multiplier to GPU

Compile & run GPU versionnvcc mmult.cu -o mmult./mmult1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ......msec = 91363 GFLOPS = 23.505

44

7/26/2009

Slide45

More on CUDA Programming

Language extensionsFunction type qualifiersVariable type qualifiersExecution configurationBuilt-in variablesCommon runtime componentsBuilt-in vector typesDevice runtime componentsIntrinsic functionsSynchronization and memory fencing functionsAtomic functionsHost runtime components (runtime API only)Device managementMemory managementError handlingDebugging in the device emulation modeExercise

45

7/26/2009

Slide46

Function Type Qualifiers

46

host

host

__host__

float HostFunc()‏

host

device

__global__ void KernelFunc()‏

device

device

__device__ float DeviceFunc()‏

Only callable from the:

Executed on the:

__device__

and

__global__

functions do not support recursion, cannot declare static variables inside their body, cannot have a variable number of arguments

__device__

functions cannot have their address taken

__host__

and

__device__

qualifiers can be used together, in which case the function is compiled for both

__global__ and __host__ qualifiers cannot be used together__global__ function must have void return type, its execution configuration must be specified, and the call is asynchronous

7/26/2009

Slide47

Variable Type Qualifiers

47

MemoryScopeLifetime__device__ int GlobalVar;globalgridapplication__device__ __shared__ int SharedVar;sharedblockblock__device__ __constant__ int ConstantVar;constantgridApplicationvolatile int GlobarVar or SharedVar;

__shared__ and __constant__ variables have implied static storage__device__, __shared__ and __constant__ variables cannot be defined using external keyword__device__ and __constant__ variables are only allowed at file scope__constant__ variables cannot be assigned to from the devices, they are initialized from the host only__shared__ variables cannot have an initialization as part of their declaration

7/26/2009

Slide48

Execution Configuration

48

Function declared as __global__ void kernel(float* param);must be called like this:kernel<<<Dg, Db, Ns, S>>>(param);where Dg (type dim3) specifies the dimension and size of the grid, such that Dg.x*Dg.y equals the number of blocks being launched; Db (type dim3) spesifies the dimension abd size of each block of threads, such that Db.x*Db.y*Db.z equals the number of threads per block; optional Ns (type size_z) specifies the number of bytes of shared memory dynamically allocated per block for this call in addition to the statically allocated memory optional S (type cudaStream_t) specifies the stream associated with this kernel call

7/26/2009

Slide49

Built-in Variables

49

variabletypedescriptiongridDimdim3dimensions of the gridblockIDunit3block index within the gridblockDimdim3dimensions of the blockthreadIdxuint3thread index within the blockwarpSizeintwarp size in threads

It is not allowed to take addresses of any of the built-in variablesIt is not allowed to assign values to any of the built-in variables

7/26/2009

Slide50

Built-in Vector Types

50

Vector types derived from basic integer and float types char1, char2, char3, char4 uchar1, uchar2, uchar3, uchar4 short1, short2, short3, short4 ushort1, ushort2, ushort3, ushort4 int1, int2, int3, int4 uint1, uint2, uint3 (dim3), uint4 long1, long2, long3, long4 ulong1, ulong2, ulong3, ulong4 longlong1, longlong2 float1, float2, float3, float4 double1, double2

They are all structures, like this:typedef struct { float x,y,z,w;} float4;They all come with a constructor function in the form make_<type name>, e.g.,int2 make_int2(int x, int y);

7/26/2009

Slide51

Intrinsic Functions

51

Supported on the device onlyStart with __, as in __sinf(x)End with _rn (round-to-nearest-even rounding mode)_rz (round-towards-zero rounding mode)_ru (round-up rounding mode)_rd (round-down rounding mode)as in __fadd_rn(x,y);There are mathematical (__log10f(x)), type conversion (__int2float_rn(x)), type casting (__int_as_float(x)), and bit manipulation (__ffs(x)) functions

7/26/2009

Slide52

Synchronization and Memory Fencing Functions

52

functiondescriptionvoid __threadfence()wait until all global and shared memory accesses made by the calling thread become visible to all threads in the device for global memory accesses and all threads in the thread block for shared memory accessesvoid __threadfence_block()Waits until all global and shared memory accesses made by the calling thread become visible to all threads in the thread blockvoid __syncthreads()Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads become visible to all threads in the block

7/26/2009

Slide53

Atomic Functions

53

An atomic function performs read-modify-write atomic operation on one 32-bit or one 64-bit word residing in global or shared memory. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads.

functionDescriptionatomicAdd()new = old + valatomicSub()new = old – valatomicExch()new = valatomicMin()new = min(old, val)atomicMax()new = max(old, val)atomicInc()new = ((old >= val) ? 0 : (old+1))atomicDec()new = (((old==0) | (old > val)) ? val : (old-1))atomicCAS()new = (old == compare ? val : old)Atomic{And, Or, Xor}()new = {(old & val), (old | val), (old^val)}

7/26/2009

Slide54

CUDA APIs

higher-level API called the CUDA runtime APImyKernelunsigned char*)devPtr, width, <<<Dg, Db>>>(( height, pitch);

low-level API called the CUDA driver APIcuModuleLoad( &module, binfile );cuModuleGetFunction( &func, module, "mmkernel" );…cuParamSetv( func, 0, &args, 48 );cuParamSetSize( func, 48 );cuFuncSetBlockShape( func, ts[0], ts[1], 1 );cuLaunchGrid( func, gs[0], gs[1] );

54

7/26/2009

Slide55

Device Management

55

functiondescriptioncudaGetDeviceCount()Returns the number of compute-capable devicescudaGetDeviceProperties()Returns information on the compute devicecudaSetDevice()Sets device to be used for GPU executioncudaGetDevice()Returns the device currently being usedcudaChooseDevice()Selects device that best matches given criteria

7/26/2009

Slide56

Device Management Example

56

void cudaDeviceInit() { int devCount, device; cudaGetDeviceCount(&devCount); if (devCount == 0) { printf("No CUDA capable devices detected.\n"); exit(EXIT_FAILURE); } for (device=0; device < devCount; device++) { cudaDeviceProp props; cudaGetDeviceProperties(&props, device); // If a device of compute capability >= 1.3 is found, use it if (props.major > 1 || (props.major == 1 && props.minor >= 3)) break; } if (device == devCount) { printf("No device above 1.2 compute capability detected.\n"); exit(EXIT_FAILURE); } else cudaSetDevice(device);}

7/26/2009

Slide57

Memory Management

57

functiondescriptioncudaMalloc()Allocates memory on the GPUcudaMallocPitch()Allocates memory on the GPU device for 2D arrays, may pad the allocated memory to ensure alignment requirementscudaFree()Frees the memory allocated on the GPUcudaMallocArray()Allocates an array on the GPUcudaFreeArray()Frees an array allocated on the GPUcudaMallocHost()Allocates page-locked memory on the hostcudaFreeHost()Frees page-locked memory in the host

7/26/2009

Slide58

More on Memory Alignment

a

1,1

a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3

a1,1a2,1a3,1a1,2a2,2a3,2a1,3a2,3a3,3

cudaMalloc(&dev_a, m*n*sizeof(float));

Matrix columns are not aligned at 64-bit boundary

a1,1a2,1a3,1a1,2a2,2a3,2a1,3a2,3a3,3

cudaMallocPitch(&dev_a, &n, n*sizeof(float), m);

Matrix columns are aligned at 64-bit boundary

n is the allocated (aligned) size for the first dimension (the pitch), given the requested sizes of the two dimensions.

58

7/26/2009

Slide59

Memory Management Example

59

cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height); myKernel<<<100, 192>>>(devPtr, pitch); // device code __global__ void myKernel(float* devPtr, int pitch) { for (int r = 0; r < height; ++r) { float* row = (float*)((char*)devPtr + r * pitch); for (int c = 0; c < width; ++c) { float element = row[c]; } }}

7/26/2009

Slide60

Memory Management

60

functiondescriptioncudaMemset()Initializes or sets GPU memory to a valuecudaMemCpy()Copies data between host and the devicecudaMemcpyToArray()cudaMemcpyFromArray()cudaMemcpyArrayToArray()cudaMemcpyToSymbol()cudaMemcpyFromSymbol()cudaGetSymbolAddress()Finds the address associated with a CUDA symbolcudaGetSymbolSize()Finds the size of the object associated with a CUDA symbol

7/26/2009

Slide61

Error Handling

61

All CUDA runtime API functions return an error code. The runtime maintains an error variable for each host thread that is overwritten by the error code every time an error concurs.

functiondescriptioncudaGetLastError()Returns error variable and resets it to cudaSuccesscudaGetErrorString()Returns the message string from an error code

cudaError_t err = cudaGetLastError();if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE);}

7/26/2009

Slide62

Exercise

Port Mandelbrot set fractal renderer to CUDA

Source is in ~/chips_tutorial/src3fractal.c – reference C implementationMakefile – make filefractal.cu.reference – CUDA implementation for reference

62

7/26/2009

Slide63

Reference C Implementation

7/26/2009

63

void makefractal_cpu(unsigned char *image, int width, int height, double xupper, double xlower, double yupper, double ylower){ int x, y; double xinc = (xupper - xlower) / width; double yinc = (yupper - ylower) / height; for (y = 0; y < height; y++) { for (x = 0; x < width; x++) { image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc)); } }}

Slide64

CUDA Kernel Implementation

7/26/2009

64

__global

__

void

makefractal_cpu

(unsigned char *image,

int

width,

int

height, double

xupper

, double

xlower

, double

yupper

, double

ylower

)

{

int

x =

blockIdx.x

;

int

y =

blockIdx.y

;

int

width =

gridDim.x

;

int

height =

gridDim.y

;

double

xupper

=-0.74624,

xlower

=-0.74758,

yupper

=0.10779,

ylower

=0.10671

;

double

xinc

= (

xupper

-

xlower

) / width;

double

yinc

= (

yupper

-

ylower

) / height;

image[y*

width+x

] =

iter

((

xlower

+ x*

xinc

), (

ylower

+ y*

yinc

));

}

Slide65

Reference C Implementation

7/26/2009

65

inline unsigned char iter(double a, double b){ unsigned char i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0; while ((c_x*c_x+c_y*c_y < D) && (i++ < 255)) { c_x_tmp = c_x * c_x - c_y * c_y; c_y_tmp = 2* c_y * c_x; c_x = a + c_x_tmp; c_y = b + c_y_tmp; } return i;}

The Mandelbrot set is generated by iterating complex function

z

2

+ c

, where

c

is a constant:

z

1

= (z

0

)

2

+ c

z

2

= (z

1

)

2

+ c

z

3

= (z

2

)

2

+ c

and so forth. Sequence

z

0

, z

1

, z

2

,...

is called the

orbit

of

z

0

under iteration of

z

2

+ c

. We stop iteration when the orbit starts to diverge, or when a maximum number of iterations is done.

Slide66

CUDA Kernel Implementation

7/26/2009

66

inline

__device__

unsigned

char

iter

(double a, double b)

{

unsigned char

i

= 0;

double

c_x

= 0,

c_y

= 0;

double

c_x_tmp

,

c_y_tmp

;

double D = 4.0;

while ((

c_x

*

c_x+c_y

*

c_y

< D) && (

i

++ < 255))

{

c_x_tmp

=

c_x

*

c_x

-

c_y

*

c_y

;

c_y_tmp

= 2*

c_y

*

c_x

;

c_x

= a +

c_x_tmp

;

c_y

= b +

c_y_tmp

;

}

return

i

;

}

Slide67

Host Code

7/26/2009

67

int

width = 1024;

int

height = 768;

unsigned char *image = NULL;

unsigned char *

devImage

;

image

= (unsigned char*)

malloc

(width*height*

sizeof

(unsigned char));

cudaMalloc

((void**)&

devImage

, width*height*

sizeof

(unsigned char));

dim3

dimGrid

(width, height);

dim3

dimBlock

(1

);

makefractal_gpu

<<<

dimGrid

,

dimBlock

>>>(

devImage

);

cudaMemcpy

(image

,

devImage

, width*height*

sizeof

(unsigned char

),

cudaMemcpyDeviceToHost

);

free(image);

cudaFree

(

devImage

);

Slide68

Few Examples

xupper=-0.74624xlower=-0.74758yupper=0.10779ylower=0.10671CPU time: 2.27 secGPU time: 0.29 sec

xupper=-0.754534912109xlower=-.757077407837yupper=0.060144042969ylower=0.057710774740CPU time: 1.5 secGPU time: 0.25 sec

7/26/2009

68