Introduction to GPU Programming PowerPoint Presentation, PPT - DocSlides

Introduction to GPU Programming PowerPoint Presentation, PPT - DocSlides

2018-09-19 4K 4 0 0

Description

Volodymyr (. Vlad. ) Kindratenko. Innovative Systems Laboratory @ NCSA. Institute for Advanced Computing Applications and Technologies (IACAT). Tutorial Goals. Become familiar with NVIDIA GPU architecture. ID: 670754

Embed code:

Download this presentation



DownloadNote - The PPT/PDF document "Introduction to GPU Programming" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentations text content in Introduction to GPU Programming

Slide1

Introduction to GPU Programming

Volodymyr (

Vlad

) Kindratenko

Innovative Systems Laboratory @ NCSA

Institute for Advanced Computing Applications and Technologies (IACAT)

Slide2

Tutorial GoalsBecome familiar with NVIDIA GPU architectureBecome familiar with the NVIDIA GPU application development flowBe able to write and run a simple NVIDIA GPU application in CUDA

2

7/26/2009

Slide3

Tutorial OutlineIntroduction (15 minutes)

Why use Graphics Processing Units (GPUs) for general-purpose computing

Modern GPU architecture

NVIDIA

GPU programming

Libraries, CUDA C,

OpenCL

, PGI x64+GPUHands-on: getting started with NCSA GPU cluster (15 minutes)Cluster architecture overviewHow to login and check out a nodeHow to compile and run an existing applicationHands-on: Anatomy of a GPU application (25 minutes)Host sideDevice sideCUDA programming modelHands-on: Porting matrix multiplier to GPU (25 minutes)More on CUDA programming (40 minutes)

3

7/26/2009

Slide4

IntroductionWhy use Graphics Processing Units (GPUs) for general-purpose computingModern GPU architecture

NVIDIA

GPU programming

Libraries

CUDA COpenCL

PGI x64+GPU

4

7/26/2009

Slide5

Why GPUs?Raw Performance Trends5

5800

5950 Ultra

6800 Ultra

7800 GTX

Graph is courtesy of NVIDIA

7/26/2009

Slide6

5800

5950 Ultra

6800 Ultra

7800 GTX

7900 GTX

8800 GTX

8800 Ultra

GTX 285

GTX 280

Why GPUs?

Memory Bandwidth Trends

6

Graph is courtesy of NVIDIA

7/26/2009

Slide7

GPU vs. CPU Silicon Use7

7/26/2009

Graph is courtesy of NVIDIA

Slide8

NVIDIA GPU ArchitectureA scalable array of multithreaded Streaming Multiprocessors (SMs), each SM consists of8 Scalar Processor (SP) cores

2 special function units for

transcendentals

A multithreaded instruction unit

On-chip shared memory

GDDR3 SDRAM

PCIe

interface8

Figure

is courtesy of NVIDIA

7/26/2009

Slide9

NVIDIA GeForce9400M G GPU16 streaming processors arranged as 2 streaming multiprocessors

At 0.8 GHz this provides

54 GFLOPS in single-precision (SP)

128-bit interface to off-chip GDDR3 memory

21 GB/s bandwidth

TPC

Geometry controller

SMC

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT

issue

I cache

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT

issue

I cache

Texture units

Texture L1

L2

ROP

L2

ROP

128-bit interconnect

DRAM

DRAM

9

7/26/2009

Slide10

NVIDIA Tesla C1060 GPU240 streaming processors arranged as 30 streaming multiprocessors

At 1.3 GHz this provides

1 TFLOPS SP

86.4 GFLOPS DP

512-bit interface to off-chip GDDR3 memory102 GB/s bandwidth

TPC 1

Geometry controller

SMC

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT

issue

I cache

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT

issue

I cache

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT issue

I cache

Texture units

Texture L1

TPC 10

Geometry controller

SMC

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT issue

I cache

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT issue

I cache

SM

Shared memory

SFU

SFU

SP

SP

SP

SP

SP

SP

SP

SP

C cache

MT issue

I cache

Texture units

Texture L1

L2

ROP

L2

ROP

512-bit memory interconnect

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

10

7/26/2009

Slide11

NVIDIA Tesla S1070 Computing Server

4 T10 GPUs

Tesla GPU

Tesla GPU

Tesla GPU

Tesla GPU

4 GB GDDR3 SDRAM

4 GB GDDR3 SDRAM

4 GB GDDR3 SDRAM

4 GB GDDR3 SDRAM

NVIDIA SWITCH

NVIDIA SWITCH

Power supply

Thermal management

System monitoring

PCI x16

PCI x16

11

7/26/2009

Graph is courtesy of NVIDIA

Slide12

GPU Use/ProgrammingGPU librariesNVIDIA’s CUDA BLAS and FFT libraries

Many 3

rd

party libraries

Low abstraction lightweight GPU programming toolkitsCUDA C

OpenCL

High abstraction compiler-based tools

PGI x64+GPU127/26/2009

Slide13

Getting Started with NCSA GPU ClusterCluster architecture overviewHow to login and check out a node

How to compile and run an existing application

13

7/26/2009

Slide14

NCSA AC GPU Cluster

14

7/26/2009

Slide15

GPU Cluster ArchitectureServers: 32CPU cores

: 128

Accelerator

Units

: 32

GPUs

: 128

15

ac01(compute node)

ac02

(compute node)

ac32

(compute node)

ac

(head node)

7/26/2009

Slide16

GPU Cluster Node ArchitectureHP xw9400 workstation2216 AMD Opteron 2.4 GHz dual socket dual core

8 GB DDR2

InfiniBand QDR

S1070 1U GPU Computing Server

1.3 GHz Tesla T10 processors4x4 GB GDDR3 SDRAM

16

IB

Tesla S1070

T10

T10

PCIe

interface

DRAM

DRAM

T10

T10

PCIe

interface

DRAM

DRAM

HP xw9400 workstation

PCIe

x16

PCIe

x16

QDR IB

Compute node

7/26/2009

Slide17

Accessing the GPU ClusterUse Secure Shell (SSH) client to access AC`

ssh

USER@ac.ncsa.uiuc.edu

` (User: gpu001 - gpu200; Password: CHiPS-09)You will see something like this printed out:

See machine details and a technical report at: http://www.ncsa.uiuc.edu/Projects/GPUcluster/

Machine Description and HOW TO USE. See:

/usr/local/share/ac.readmeCUDA wrapper readme: /usr/local/share/cuda_wrapper.readme*IMPORTANT* If you are using multiple GPU devices per host, be sure to understand how the cuda_wrapper changes this system!!…July 03, 2009Nvidia compute exclusive mode made default. If this breaks your application, "touch ~/FORCE_NORMAL" to create an override for all your jobs.

Questions? Contact Jeremy Enos jenos@ncsa.uiuc.edu[

gpuXYZ@ac

~]$ _

17

7/26/2009

Slide18

Installing Tutorial ExamplesRun this sequence to retrieve and install tutorial examples:cd

cp /

tmp

/chips_tutorial.tgz .

tar -xvzf

chips_tutorial.tgz

cd

chips_tutorialls src1 src2 src318

7/26/2009

Slide19

Accessing the GPU Cluster19

ac01

(compute node)

ac02

(compute node)

ac32

(compute node)

Laptop 1

Laptop 30

Laptop 2

ac

(head node)

You are here

7/26/2009

Slide20

Requesting a Cluster Node for Interactive UseRun `qstat

` to see what other users do, just for the fun of it

Run `

qsub

-I -l

walltime

=02:00:00` to request a node with a single GPU for 2 hours of interactive useYou will see something like this printed out:qsub: waiting for job 64424.acm to startqsub: job 64424.acm ready[gpuXYZ@acAB ~]$ _20

7/26/2009

Slide21

Requesting a Cluster Node21

ac01

(compute node)

ac02

(compute node)

ac32

(compute node)

Laptop 1

Laptop 30

Laptop 2

ac

(head node)

You are here

7/26/2009

Slide22

Checking GPU CharacteristicsRun `deviceQuery`

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "Tesla C1060"

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes…Clock rate: 1.30 GHzCompute mode: Exclusive (only one host thread at a time can use this device)22

7/26/2009

Slide23

Compiling and Running and Existing Applicationcd

chips_tutorial

/src1

vecadd.c

- reference C implementationvecadd.cu – CUDA implementation

Compile & run CPU version

gcc

vecadd.c -o vecadd_cpu./vecadd_cpuRunning CPU vecAdd for 16384 elementsC[0]=2147483648.00 ...Compile & run GPU versionnvcc vecadd.cu -o vecadd_g

pu./vecadd_

g

pu

Running

G

PU vecAdd for 16384 elements

C[0]=2147483648.00 ...

23

7/26/2009

Slide24

nvccAny source file containing CUDA C language extensions must be compiled with nvcc

nvcc

is a compiler driver that invokes many other tools to accomplish the job

Basic

nvcc usage

nvcc

<filename>.cu [-o <executable>]

Builds release modenvcc -deviceemu <filename>.cuBuilds device emulation mode (all code runs on CPU)-g flag allows to build debug mode for gdb debugger247/26/2009

Slide25

Anatomy of a GPU ApplicationHost sideDevice sideCUDA programming model

25

7/26/2009

Slide26

CPU-Only Versionvoid vecAdd(

int

N, float* A, float* B, float* C) {

for (

int

i

= 0;

i < N; i++) C[i] = A[i] + B[i];}int main(int argc, char **argv) {

int N = 16384; // default vector size

float *A = (float*)

malloc

(N *

sizeof

(float));

float *B = (float*)

malloc

(N *

sizeof

(float));

float *C = (float*)

malloc

(N *

sizeof

(float));

vecAdd

(N, A, B, C); // call compute kernel

free(A); free(B); free(C);

}

26

Computational kernel

Memory allocation

Kernel invocation

Memory de-allocation

7/26/2009

Slide27

Adding GPU supportint main(int

argc

, char **

argv) {

int

N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); float *

devPtrA, *devPtrB

, *

devPtrC

;

cudaMalloc

((void**)&

devPtrA

, N *

sizeof

(float));

cudaMalloc

((void**)&

devPtrB

, N *

sizeof

(float));

cudaMalloc

((void**)&

devPtrC

, N * sizeof(float)); cudaMemcpy(

devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrB, B, N * sizeof(float),

cudaMemcpyHostToDevice);27

Memory allocation on the GPU card

Copy data from the CPU (host) memory to the GPU (device) memory

7/26/2009

Slide28

Adding GPU support vecAdd

<<<N/512, 512>>>

(

devPtrA

, devPtrB

,

devPtrC

); cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost);

cudaFree(devPtrA

);

cudaFree

(

devPtrB

);

cudaFree

(

devPtrC

);

free(A); free(B); free(C);

}

28

Kernel invocation

Copy results from device memory to the host memory

Device memory de-allocation

7/26/2009

Slide29

GPU KernelCPU versionvoid vecAdd

(

int

N, float* A, float* B, float* C)

{ for (

int

i = 0; i < N; i++) C[i] = A[i] + B[i];}GPU version__global__ void vecAdd(float* A, float* B, float* C) { int

i =

blockIdx.x

*

blockDim.x

+

threadIdx.x

;

C[

i

] = A[

i

] + B[

i

];

}

29

7/26/2009

Slide30

CUDA Programming ModelA CUDA kernel is executed by an array of threads

All threads run the same code (SPMD)

Each thread has an ID that it uses to compute memory addresses and make control decisions

Threads are arranged as a grid of thread blocks

Threads within a

block have access

to a segment ofshared memory30

float x = input[

threadID

];

float y =

func

(x);

output[

threadID

] = y;

threadID

Grid

Thread Block 0

Shared memory

Thread Block 1

Shared memory

Thread Block

N-1

Shared memory

7/26/2009

Slide31

Kernel Invocation Syntax

31

Grid

Thread Block 0

Shared memory

Thread Block 1

Shared memory

Thread Block

N-1

Shared memory

grid

&

thread block

dimensionality

vecAdd

<<<

32

, 512>>>

(

devPtrA

,

devPtrB

,

devPtrC

);

int

i

=

blockIdx.x

*

blockDim.x

+

threadIdx.x

;

thread ID within a thread block

number of

theards

per block

block ID within a grid

7/26/2009

Slide32

Mapping Threads to the HardwareBlocks of threads are transparently assigned to SMsA block of threads executes on one SM & does not migrate

Several blocks can reside concurrently on one SM

32

Device

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Kernel grid

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Device

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Each block can execute in any order relative to other blocks.

time

Slide is courtesy of NVIDIA

7/26/2009

Slide33

CUDA Programming ModelA kernel is executed as a grid of thread blocks

Grid of blocks can be 1 or 2-dimentional

Thread blocks can be 1, 2, or 3-dimensional

Different kernels can have different grid/block configuration

Threads from the same block have access to a shared memory and their execution can be synchronized

33

Slide is courtesy of NVIDIA

Device

Grid 2

Host

Kernel 1

Kernel 2

Block (1, 1)

Thread

(0,1,0)

Thread

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

Thread

(0,0,0)

Thread

(1,0,0)

Thread

(2,0,0)

Thread

(3,0,0)

(0,0,1)

(1,0,1)

(2,0,1)

(3,0,1)

Grid 1

Block

(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(0, 1)

Block

(1, 1)

Block

(2, 1)

7/26/2009

Slide34

GPU Memory Hierarchy 34

Memory

Location

Cached

Access

Scope

Lifetime

Register

On-chip

N/A

R/W

One thread

Thread

Local

Off-chip

No

R/W

One thread

Thread

Shared

On-chip

N/A

R/W

All threads in a block

Block

Global

Off-chip

No

R/W

All threads + host

Application

Constant

Off-chip

Yes

R

All threads + host

Application

Texture

Off-chip

Yes

R

All threads + host

Application

Host

CPU

chipset

DRAM

Device

DRAM

local

global

constant

texture

GPU

Multiprocessor

Multiprocessor

Multiprocessor

registers

shared memory

constant and texture caches

7/26/2009

Slide35

Porting matrix multiplier to GPUcd ../chips_tutorial

/src2

Compile & run CPU version

icc -O3 mmult.c -o mmult

./

mmult

1024.00 1024.00 1024.00 1024.00 1024.00 ...

1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ......msec = 2215478 GFLOPS = 0.96935

7/26/2009

Slide36

36

int

main(

int

argc

, char*

argv

[]){ int N = 1024; struct timeval t1, t2, ta, tb; long msec1, msec2; float flop, mflop, gflop;

float *a = (float *)malloc(N*N*

sizeof

(float));

float *b = (float *)

malloc

(N*N*

sizeof

(float));

float *c = (float *)

malloc

(N*N*

sizeof

(float));

minit

(a, b, c, N);

gettimeofday

(&t1, NULL);

mmult

(a, b, c, N); // a = b * c

gettimeofday(&t2, NULL); mprint(a, N, 5); free(a); free(b); free(c); msec1 = t1.tv_sec * 1000000 + t1.tv_usec; msec2 = t2.tv_sec * 1000000 + t2.tv_usec; msec2 -= msec1; flop = N*N*N*2.0f; mflop

= flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);}

// a = b * cvoid mmult(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) for (int i = 0;

i < N; i++) a[

i+j*N] += b[i+k*N]*c[k+j

*N];}void minit(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int i

= 0;

i

< N;

i

++) {

a[

i+N

*j] = 0.0f;

b[

i+N

*j] = 1.0f;

c[

i+N

*j] = 1.0f;

}

}

void

mprint

(float *a,

int

N,

int

M)

{

int

i

, j;

for (

int

j = 0; j < M; j++)

{

for (

int

i

= 0;

i

< M;

i

++)

printf("%.2f ", a[i+N*j]);

printf("...\n"); } printf("...\n");}

7/26/2009

Slide37

for (i = 0;

i

< n; ++

i

) for (j = 0; j < n; ++j)

for (k = 0; k < n; ++k)

a[

i+n*j] += b[i+n*k] * c[k+n*j];Matrix Representation in MemoryMatrices are stored in column-major orderFor reference,

jki-ordered version runs at 1.7 GFLOPS on 3 GHz Intel Xeon (single core)

37

7/26/2009

Slide38

for (i = 0;

i

< n; ++

i

) for (j = 0; j < n; ++j)

for (k = 0; k < n; ++k)

a[

i+n*j] += b[i+n*k] * c[k+n*j];Grid of thread blocks

00 1 2 3 4 5

1

0 1 2 3 4 5

2

0 1 2 3 4 5

blockIdx.x

blockDim.x

threadIdx.x

blockIdx.x

*

blockDim.x

+

threadIdx.x

Map this code:

into this (logical) architecture:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

b

1,1

b

1,2

b

2,1

b

2,2

b

3,1

b

3,2

c

1,1

c

1,2

c

1,3

c

2,1

c

2,2

c

2,3

a

1,1

a

1,2

a

1,3

a

2,1

a

2,2

a

2,3

a

3,1

a

3,2

a

3,3

B

C

A=B*C

a

1,2

=b

1,1

*c

1,2

+b

1,2

*c

2,2

38

7/26/2009

Slide39

32x1024 grid of thread blocks

Block of 32x1x1 threads

(

blockIdx.x

,

blockIdx.y

)

(threadIdx.x){ int i = blockIdx.x*32 +

threadIdx.x; int

j =

blockIdx.y

;

float sum = 0.0f;

for (

int

k = 0; k < n; k++)

sum += b[

i+n

*k] * c[

k+n

*j];

a[

i+n

*j] = sum;

}

32 threads per block (

i

)

32 thread blocks (

i

)

1024 thread blocks (

j

)

dim3 grid(1024/32, 1024);

dim3 threads (32);

39

7/26/2009

Slide40

KernelOriginal CPU kernel

GPU Kernel

40

__global__

void

mmult

(float *a, float *b, float *c,

int N){ int i = blockIdx.x

* blockDim.x

+

threadIdx.x

;

int

j =

blockIdx.y

;

float sum = 0.0;

for (

int

k = 0; k < N; k++)

sum

+= b[

i+N

*k] * c[

k+N

*j];

a[

i+N*j] = sum;}void mmult(float *a, float *b, float *c, int N)

{ for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++)

a[i+j*N] += b[i+k*N]*c[k+j*N];}dim3 dimGrid(32, 1024);dim3 dimBlock

(32);mmult<<<dimGrid,

dimBlock>>>(devPtrA, devPtrB

, devPtrC, N);7/26/2009

Slide41

41int

main(

int

argc, char* argv

[])

{

int N = 1024; struct timeval t1, t2; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)

malloc(N*N*sizeof(float));

float *c = (float *)

malloc

(N*N*

sizeof

(float));

minit

(a, b, c, N);

// allocate device memory

float *

devPtrA

, *

devPtrB

, *

devPtrC

;

cudaMalloc

((void**)&

devPtrA

, N*N*sizeof(float)); cudaMalloc((void**)&devPtrB, N*N*

sizeof(float)); cudaMalloc((void**)&devPtrC, N*N*sizeof(float)); // copu input arrays to the device

meory cudaMemcpy(devPtrB, b, N*N*sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(devPtrC

, c, N*N*sizeof(float),

cudaMemcpyHostToDevice); 7/26/2009

Slide42

42

gettimeofday

(&t1, NULL);

msec1 = t1.tv_sec * 1000000 + t1.tv_usec;

// define grid and thread block sizes

dim3

dimGrid

(32, 1024); dim3 dimBlock(32); // launch GPU kernel mmult<<<dimGrid, dimBlock

>>>(devPtrA,

devPtrB

,

devPtrC

, N);

// check for errors

cudaError_t

err =

cudaGetLastError

();

if (

cudaSuccess

!= err)

{

fprintf

(

stderr

, "CUDA error: %s.\n",

cudaGetErrorString

( err) ); exit(EXIT_FAILURE); } // wait until GPU kernel is done

cudaThreadSynchronize(); gettimeofday(&t2, NULL); msec2 = t2.tv_sec * 1000000 + t2.tv_usec;7/26/2009

Slide43

43 // copy results to host

cudaMemcpy

(a,

devPtrA

, N*N*

sizeof

(float), cudaMemcpyDeviceToHost); mprint(a, N, 5); // free device memory cudaFree(devPtrA);

cudaFree(devPtrB

);

cudaFree

(

devPtrC

);

free(a);

free(b);

free(c);

msec2 -= msec1;

flop = N*N*N*2.0f;

mflop

= flop / msec2;

gflop

=

mflop

/ 1000.0f;

printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);}7/26/2009

Slide44

Porting matrix multiplier to GPUCompile & run GPU versionnvcc mmult.cu -o mmult

./

mmult

1024.00 1024.00 1024.00 1024.00 1024.00 ...

1024.00 1024.00 1024.00 1024.00 1024.00 ...

1024.00 1024.00 1024.00 1024.00 1024.00 ...

1024.00 1024.00 1024.00 1024.00 1024.00 ...

1024.00 1024.00 1024.00 1024.00 1024.00 ......msec = 91363 GFLOPS = 23.50544

7/26/2009

Slide45

More on CUDA ProgrammingLanguage extensionsFunction type qualifiers

Variable type qualifiers

Execution configuration

Built-in variables

Common runtime componentsBuilt-in vector types

Device runtime components

Intrinsic functions

Synchronization and memory fencing functionsAtomic functionsHost runtime components (runtime API only)Device managementMemory managementError handlingDebugging in the device emulation modeExercise457/26/2009

Slide46

Function Type Qualifiers46

host

host

__host__

float

HostFunc

()

host

device

__global__

void

KernelFunc

()

device

device

__device__

float

DeviceFunc

()

Only callable from the:

Executed on the:

__device__

and

__global__

functions do not support recursion, cannot declare static variables inside their body, cannot have a variable number of arguments

__device__

functions cannot have their address taken

__host__

and

__device__

qualifiers can be used together, in which case the function is compiled for both

__global__

and

__host__

qualifiers cannot be used together

__global__

function must have void return type, its execution configuration must be specified, and the call is asynchronous

7/26/2009

Slide47

Variable Type Qualifiers47

Memory

Scope

Lifetime

__device__

int

GlobalVar

;

global

grid

application

__device__

__shared__

int

SharedVar

;

shared

block

block

__device__

__constant__

int

ConstantVar

;

constant

grid

Application

volatile

int

GlobarVar

or

SharedVar

;

__shared__

and

__constant__

variables have implied static storage

__device__

,

__shared__

and

__constant__

variables cannot be defined using

external

keyword

__device__

and

__constant__

variables are only allowed at file scope

__constant__

variables cannot be assigned to from the devices, they are initialized from the host only

__shared__

variables cannot have an initialization as part of their declaration

7/26/2009

Slide48

Execution Configuration48

Function declared as

__global__

void kernel(float*

param

);

must be called like this:

kernel<<<Dg, Db, Ns, S>>>(param);where Dg (type dim3) specifies the dimension and size of the grid, such that Dg.x*Dg.y equals the number of blocks being launched;

Db (type dim3) spesifies the dimension abd size of each block of threads, such that

Db.x

*

Db.y

*

Db.z

equals the number of threads per block;

optional

Ns

(type

size_z

) specifies the number of bytes of shared memory dynamically allocated per block for this call in addition to the statically allocated memory

optional

S

(type

cudaStream_t

) specifies the stream associated with this kernel

call

7/26/2009

Slide49

Built-in Variables49

variable

type

description

gridDim

dim3

dimensions of the grid

blockID

unit3

block index within the grid

blockDim

dim3

dimensions of the block

threadIdx

uint3

thread index within

the block

warpSize

int

warp size in threads

It is not allowed to take addresses of any of the built-in variables

It is not allowed to assign values to any of the built-in variables

7/26/2009

Slide50

Built-in Vector Types50

Vector types derived from basic integer and float types

char1, char2, char3, char4

uchar1, uchar2, uchar3, uchar4

short1, short2, short3, short4

ushort1, ushort2, ushort3, ushort4

int1, int2, int3, int4

uint1, uint2, uint3 (dim3), uint4 long1, long2, long3, long4 ulong1, ulong2, ulong3, ulong4 longlong1, longlong2 float1, float2, float3, float4

double1, double2

They are all structures, like this:

typedef

struct

{

float

x,y,z,w

;

} float4;

They all come with a constructor function in the form

make_<type name>

, e.g.,

int2 make_int2(

int

x,

int

y);

7/26/2009

Slide51

Intrinsic Functions51

Supported on the device only

Start with

__

, as in

__

sinf

(x)End with _rn (round-to-nearest-even rounding mode)_rz (round-towards-zero rounding mode)_ru

(round-up rounding mode)_rd

(round-down rounding mode)

as in

__

fadd_rn

(

x,y

);

There are mathematical

(__log10f(x)

), type conversion

(__int2float_rn(x)

), type casting

(__

int_as_float

(x)

), and bit manipulation

(__

ffs

(x)

) functions

7/26/2009

Slide52

Synchronization and Memory Fencing Functions52

function

description

void

__

threadfence

()

wait until all global and shared memory accesses made by the calling thread become visible to all threads in the device for global memory accesses and all threads in the thread

block for shared memory accesses

void

__

threadfence_block

()

Waits until all global and shared memory accesses made by the calling thread become visible to all threads in the thread block

void

__

syncthreads

()

Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads become visible to all threads in the block

7/26/2009

Slide53

Atomic Functions53

An atomic function performs read-modify-write atomic operation on one 32-bit or one 64-bit word residing in global or shared memory. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads.

function

Description

atomicAdd

()

new = old +

val

atomicSub

()

new = old –

val

atomicExch

()

new =

val

atomicMin

()

new = min(old,

val

)

atomicMax

()

new = max(old,

val

)

atomicInc

()

new = ((old >=

val

) ? 0 : (old+1))

atomicDec

()new = (((old==0) | (old > val)) ?

val : (old-1))

atomicCAS

()new = (old == compare ? val : old)Atomic{And, Or,

Xor

}()

new = {(old &

val

), (old |

val

), (

old^val

)}

7/26/2009

Slide54

CUDA APIshigher-level API called the CUDA runtime API

myKernelunsigned

char*)

devPtr

, width, <<<Dg, Db>>>(( height, pitch);

low-level API called the

CUDA driver API

cuModuleLoad( &module, binfile );cuModuleGetFunction( &func, module, "mmkernel" );…cuParamSetv( func, 0, &args, 48 );cuParamSetSize(

func, 48 );cuFuncSetBlockShape(

func

,

ts

[0],

ts

[1], 1 );

cuLaunchGrid

(

func

,

gs

[0],

gs

[1] );

54

7/26/2009

Slide55

Device Management55

function

description

cudaGetDeviceCount

()

Returns the number of compute-capable devices

cudaGetDeviceProperties

()

Returns information on the compute device

cudaSetDevice

()

Sets device to be used for GPU execution

cudaGetDevice

()

Returns the device currently being used

cudaChooseDevice

()

Selects device that best matches given criteria

7/26/2009

Slide56

Device Management Example56

void

cudaDeviceInit

() {

int

devCount, device; cudaGetDeviceCount(&devCount); if (devCount == 0) { printf("No CUDA capable devices detected.\n"); exit(EXIT_FAILURE); } for (device=0; device < devCount; device++) { cudaDeviceProp props;

cudaGetDeviceProperties(&props, device); //

If a device of compute capability >= 1.3 is found, use it

if (

props.major

> 1 || (

props.major

== 1 &&

props.minor

>=

3)) break;

}

if (device ==

devCount

) {

printf

("No device above 1.2 compute capability detected.\n");

exit(EXIT_FAILURE);

}

else

cudaSetDevice

(device);}

7/26/2009

Slide57

Memory Management57

function

description

cudaMalloc

()

Allocates memory on the GPU

cudaMallocPitch

()

Allocates memory on the GPU device for 2D arrays, may pad the allocated memory to ensure alignment requirements

cudaFree

()

Frees the memory allocated on the GPU

cudaMallocArray

()

Allocates an

array on the GPU

cudaFreeArray

()

Frees an array allocated on the GPU

cudaMallocHost

()

Allocates page-locked memory on the host

cudaFreeHost

()

Frees page-locked memory in the host

7/26/2009

Slide58

More on Memory Alignment

a

1,1

a

1,2

a

1,3

a

2,1

a

2,2

a

2,3

a

3,1

a

3,2

a

3,3

a

1,1

a

2,1

a

3,1

a

1,2

a

2,2

a

3,2

a

1,3

a

2,3

a3,3cudaMalloc(&dev_a, m*n*sizeof(float));

Matrix columns are not aligned at 64-bit boundary

a

1,1

a

2,1

a

3,1

a

1,2

a

2,2

a

3,2

a

1,3

a

2,3

a

3,3

cudaMallocPitch

(&

dev_a

, &n, n*

sizeof

(float), m);

Matrix columns are aligned at 64-bit boundary

n

is the allocated (aligned) size for the first dimension (the

pitch

), given the requested sizes of the two dimensions.

58

7/26/2009

Slide59

Memory Management Example59

cudaMallocPitch

((void**)&

devPtr

, &pitch, width *

sizeof

(float), height);

myKernel

<<<100, 192>>>(

devPtr

, pitch);

// device code

__global__ void

myKernel

(float*

devPtr

,

int

pitch)

{

for (

int

r = 0; r < height; ++r) {

float* row = (float*)((char*)

devPtr

+ r * pitch);

for (

int

c = 0; c < width; ++c) {

float element = row[c];

}

}

}

7/26/2009

Slide60

Memory Management60

function

description

cudaMemset

()

Initializes or sets GPU memory to a value

cudaMemCpy

()

Copies data between host and the device

cudaMemcpyToArray

()

cudaMemcpyFromArray

()

cudaMemcpyArrayToArray

()

cudaMemcpyToSymbol

()

cudaMemcpyFromSymbol

()

cudaGetSymbolAddress

()

Finds the address associated with a CUDA symbol

cudaGetSymbolSize

()

Finds the size of the object associated with a CUDA symbol

7/26/2009

Slide61

Error Handling61

All CUDA runtime API functions return an error code. The runtime maintains an error variable for each host thread that is overwritten by the error code every time an error concurs.

function

description

cudaGetLastError

()

Returns error variable and resets it to

cudaSuccess

cudaGetErrorString

()

Returns the message string from an error code

cudaError_t

err =

cudaGetLastError

();

if (

cudaSuccess

!= err) {

fprintf

(

stderr

, "CUDA error: %s.\n",

cudaGetErrorString

( err) );

exit(EXIT_FAILURE);

}

7/26/2009

Slide62

Exercise

Port Mandelbrot set fractal renderer to CUDA

Source is

in ~/

chips_tutorial

/src3

fractal.c

– reference C implementationMakefile – make filefractal.cu.reference – CUDA implementation for reference62

7/26/2009

Slide63

Reference C Implementation7/26/2009

63

void

makefractal_cpu

(unsigned char *image,

int

width,

int height, double xupper, double xlower, double yupper, double ylower){ int x, y; double xinc = (xupper -

xlower) / width; double yinc

= (

yupper

-

ylower

) / height;

for (y = 0; y < height; y

++)

{

for (x = 0; x < width; x++)

{

image[y*

width+x

] =

iter

((

xlower

+ x*

xinc

), (

ylower + y*yinc)); } }}

Slide64

CUDA Kernel Implementation7/26/2009

64

__global

__

void

makefractal_cpu

(unsigned char *image,

int width, int height, double xupper, double xlower, double yupper, double ylower){ int x =

blockIdx.x;

int

y =

blockIdx.y

;

int

width =

gridDim.x

;

int

height =

gridDim.y

;

double

xupper

=-0.74624,

xlower

=-0.74758,

yupper

=0.10779, ylower=0.10671; double xinc = (xupper - xlower

) / width; double yinc = (yupper - ylower) / height; image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc));}

Slide65

Reference C Implementation7/26/2009

65

inline unsigned char

iter

(double a, double b)

{

unsigned char

i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0; while ((c_x*c_x+c_y*c_y < D) && (i++ < 255))

{ c_x_tmp = c_x

*

c_x

-

c_y

*

c_y

;

c_y_tmp

= 2*

c_y

*

c_x

;

c_x

= a +

c_x_tmp

;

c_y = b + c_y_tmp; } return i;}

The Mandelbrot set is generated by iterating complex function z

2 + c, where c is a constant:

z1

= (z

0

)2 + c

z

2

= (z

1

)

2

+ c

z

3

= (z

2

)

2

+ c

and so forth. Sequence

z

0

, z

1

, z

2

,...

is called the

orbit

of

z

0

under iteration of

z

2

+ c

. We stop iteration when the orbit starts to diverge, or when a maximum number of iterations is done.

Slide66

CUDA Kernel Implementation7/26/2009

66

inline

__device__

unsigned

char

iter(double a, double b){ unsigned char i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0; while ((c_x*c_x+c_y*

c_y < D) && (i++ < 255)) {

c_x_tmp

=

c_x

*

c_x

-

c_y

*

c_y

;

c_y_tmp

= 2*

c_y

*

c_x

;

c_x

= a + c_x_tmp; c_y = b + c_y_tmp; }

return i;}

Slide67

Host Code7/26/2009

67

int

width = 1024;

int

height = 768; unsigned char *image = NULL; unsigned char *devImage; image = (unsigned char*)malloc(width*height*sizeof(unsigned char)); cudaMalloc((void**)&devImage

, width*height*sizeof(unsigned char));

dim3

dimGrid

(width, height);

dim3

dimBlock

(1

);

makefractal_gpu

<<<

dimGrid

,

dimBlock

>>>(

devImage

);

cudaMemcpy

(image

,

devImage, width*height*sizeof(unsigned char), cudaMemcpyDeviceToHost

); free(image); cudaFree(devImage);

Slide68

Few Examplesxupper=-0.74624

xlower

=-

0.74758

yupper=0.10779ylower

=0.10671

CPU time: 2.27 sec

GPU time: 0.29 secxupper=-0.754534912109xlower=-.757077407837yupper=0.060144042969ylower=0.057710774740CPU time: 1.5 secGPU time: 0.25 sec7/26/2009

68


About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.