Volodymyr Vlad Kindratenko Innovative Systems Laboratory NCSA Institute for Advanced Computing Applications and Technologies IACAT Tutorial Goals Become familiar with NVIDIA GPU architecture ID: 759151
Download Presentation The PPT/PDF document "Introduction to GPU Programming" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Introduction to GPU Programming
Volodymyr (
Vlad
) Kindratenko
Innovative Systems Laboratory @ NCSA
Institute for Advanced Computing Applications and Technologies (IACAT)
Slide2Tutorial Goals
Become familiar with NVIDIA GPU architectureBecome familiar with the NVIDIA GPU application development flowBe able to write and run a simple NVIDIA GPU application in CUDA
2
7/26/2009
Slide3Tutorial Outline
Introduction (15 minutes)Why use Graphics Processing Units (GPUs) for general-purpose computingModern GPU architectureNVIDIAGPU programmingLibraries, CUDA C, OpenCL, PGI x64+GPUHands-on: getting started with NCSA GPU cluster (15 minutes)Cluster architecture overviewHow to login and check out a nodeHow to compile and run an existing applicationHands-on: Anatomy of a GPU application (25 minutes)Host sideDevice sideCUDA programming modelHands-on: Porting matrix multiplier to GPU (25 minutes)More on CUDA programming (40 minutes)
3
7/26/2009
Slide4Introduction
Why use Graphics Processing Units (GPUs) for general-purpose computingModern GPU architectureNVIDIAGPU programmingLibrariesCUDA COpenCLPGI x64+GPU
4
7/26/2009
Slide5Why GPUs?Raw Performance Trends
5
5800
5950 Ultra
6800 Ultra
7800 GTX
Graph is courtesy of NVIDIA
7/26/2009
Slide65800
5950 Ultra
6800 Ultra
7800 GTX
7900 GTX
8800 GTX
8800 Ultra
GTX 285
GTX 280
Why GPUs?
Memory Bandwidth Trends
6
Graph is courtesy of NVIDIA
7/26/2009
Slide7GPU vs. CPU Silicon Use
7
7/26/2009
Graph is courtesy of NVIDIA
Slide8NVIDIA GPU Architecture
A scalable array of multithreaded Streaming Multiprocessors (SMs), each SM consists of8 Scalar Processor (SP) cores2 special function units for transcendentalsA multithreaded instruction unitOn-chip shared memoryGDDR3 SDRAMPCIe interface
8
Figure
is courtesy of NVIDIA
7/26/2009
Slide9NVIDIA GeForce9400M G GPU
16 streaming processors arranged as 2 streaming multiprocessors At 0.8 GHz this provides54 GFLOPS in single-precision (SP)128-bit interface to off-chip GDDR3 memory21 GB/s bandwidth
TPC
Geometry controller
SMC
SM
Shared memory
SFU
SFU
SP
SP
SP
SP
SP
SP
SP
SP
C cache
MT
issue
I cache
SM
Shared memory
SFU
SFU
SP
SP
SP
SP
SP
SP
SP
SP
C cache
MT
issue
I cache
Texture units
Texture L1
L2
ROP
L2
ROP
128-bit interconnect
DRAM
DRAM
9
7/26/2009
Slide10NVIDIA Tesla C1060 GPU
240 streaming processors arranged as 30 streaming multiprocessors At 1.3 GHz this provides1 TFLOPS SP86.4 GFLOPS DP512-bit interface to off-chip GDDR3 memory102 GB/s bandwidth
TPC 1
Geometry controller
SMC
SM
Shared memory
SFU
SFU
SP
SP
SP
SP
SP
SP
SP
SP
C cache
MT
issue
I cache
SM
Shared memory
SFU
SFU
SP
SP
SP
SP
SP
SP
SP
SP
C cache
MT
issue
I cache
SM
Shared memory
SFU
SFU
SP
SP
SP
SP
SP
SP
SP
SP
C cache
MT issue
I cache
Texture units
Texture L1
TPC 10
Geometry controller
SMC
SM
Shared memory
SFU
SFU
SP
SP
SP
SP
SP
SP
SP
SP
C cache
MT issue
I cache
SM
Shared memory
SFU
SFU
SP
SP
SP
SP
SP
SP
SP
SP
C cache
MT issue
I cache
SM
Shared memory
SFU
SFU
SP
SP
SP
SP
SP
SP
SP
SP
C cache
MT issue
I cache
Texture units
Texture L1
L2
ROP
L2
ROP
512-bit memory interconnect
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
10
7/26/2009
Slide11NVIDIA Tesla S1070 Computing Server
4 T10 GPUs
Tesla GPU
Tesla GPU
Tesla GPU
Tesla GPU
4 GB GDDR3 SDRAM
4 GB GDDR3 SDRAM
4 GB GDDR3 SDRAM
4 GB GDDR3 SDRAM
NVIDIA SWITCH
NVIDIA SWITCH
Power supply
Thermal management
System monitoring
PCI x16
PCI x16
11
7/26/2009
Graph is courtesy of NVIDIA
Slide12GPU Use/Programming
GPU librariesNVIDIA’s CUDA BLAS and FFT librariesMany 3rd party librariesLow abstraction lightweight GPU programming toolkitsCUDA COpenCLHigh abstraction compiler-based toolsPGI x64+GPU
12
7/26/2009
Slide13Getting Started with NCSA GPU Cluster
Cluster architecture overviewHow to login and check out a nodeHow to compile and run an existing application
13
7/26/2009
Slide14NCSA AC GPU Cluster
14
7/26/2009
Slide15GPU Cluster Architecture
Servers: 32CPU cores: 128
Accelerator Units: 32GPUs: 128
15
ac01
(compute node)
ac02
(compute node)
ac32
(compute node)
ac
(head node)
7/26/2009
Slide16GPU Cluster Node Architecture
HP xw9400 workstation2216 AMD Opteron 2.4 GHz dual socket dual core8 GB DDR2InfiniBand QDR S1070 1U GPU Computing Server1.3 GHz Tesla T10 processors4x4 GB GDDR3 SDRAM
16
IB
Tesla S1070
T10
T10
PCIe
interface
DRAM
DRAM
T10
T10
PCIe
interface
DRAM
DRAM
HP xw9400 workstation
PCIe
x16
PCIe
x16
QDR IB
Compute node
7/26/2009
Slide17Accessing the GPU Cluster
Use Secure Shell (SSH) client to access AC`ssh USER@ac.ncsa.uiuc.edu` (User: gpu001 - gpu200; Password: CHiPS-09)You will see something like this printed out:See machine details and a technical report at: http://www.ncsa.uiuc.edu/Projects/GPUcluster/Machine Description and HOW TO USE. See: /usr/local/share/ac.readmeCUDA wrapper readme: /usr/local/share/cuda_wrapper.readme*IMPORTANT* If you are using multiple GPU devices per host, be sure to understand how the cuda_wrapper changes this system!!…July 03, 2009Nvidia compute exclusive mode made default. If this breaks your application, "touch ~/FORCE_NORMAL" to create an override for all your jobs.Questions? Contact Jeremy Enos jenos@ncsa.uiuc.edu[gpuXYZ@ac ~]$ _
17
7/26/2009
Slide18Installing Tutorial Examples
Run this sequence to retrieve and install tutorial examples:cdcp /tmp/chips_tutorial.tgz .tar -xvzf chips_tutorial.tgzcd chips_tutorialls src1 src2 src3
18
7/26/2009
Slide19Accessing the GPU Cluster
19
ac01
(compute node)
ac02
(compute node)
ac32(compute node)
Laptop 1
Laptop 30
Laptop 2
ac
(head node)
You are here
7/26/2009
Slide20Requesting a Cluster Node for Interactive Use
Run `qstat` to see what other users do, just for the fun of itRun `qsub -I -l walltime=02:00:00` to request a node with a single GPU for 2 hours of interactive useYou will see something like this printed out:qsub: waiting for job 64424.acm to startqsub: job 64424.acm ready[gpuXYZ@acAB ~]$ _
20
7/26/2009
Slide21Requesting a Cluster Node
21
ac01
(compute node)
ac02
(compute node)
ac32(compute node)
Laptop 1
Laptop 30
Laptop 2
ac
(head node)
You are here
7/26/2009
Slide22Checking GPU Characteristics
Run `deviceQuery`CUDA Device Query (Runtime API) version (CUDART static linking)There is 1 device supporting CUDADevice 0: "Tesla C1060" CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 4294705152 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes…Clock rate: 1.30 GHzCompute mode: Exclusive (only one host thread at a time can use this device)
22
7/26/2009
Slide23Compiling and Running and Existing Application
cd chips_tutorial/src1vecadd.c - reference C implementationvecadd.cu – CUDA implementationCompile & run CPU versiongcc vecadd.c -o vecadd_cpu./vecadd_cpuRunning CPU vecAdd for 16384 elementsC[0]=2147483648.00 ...Compile & run GPU versionnvcc vecadd.cu -o vecadd_gpu./vecadd_gpuRunning GPU vecAdd for 16384 elementsC[0]=2147483648.00 ...
23
7/26/2009
Slide24nvcc
Any source file containing CUDA C language extensions must be compiled with nvccnvcc is a compiler driver that invokes many other tools to accomplish the jobBasic nvcc usagenvcc <filename>.cu [-o <executable>]Builds release modenvcc -deviceemu <filename>.cuBuilds device emulation mode (all code runs on CPU)-g flag allows to build debug mode for gdb debugger
24
7/26/2009
Slide25Anatomy of a GPU Application
Host sideDevice sideCUDA programming model
25
7/26/2009
Slide26CPU-Only Version
void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i];}int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); vecAdd(N, A, B, C); // call compute kernel free(A); free(B); free(C); }
26
Computational kernel
Memory allocation
Kernel invocation
Memory de-allocation
7/26/2009
Slide27Adding GPU support
int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N * sizeof(float)); cudaMalloc((void**)&devPtrB, N * sizeof(float)); cudaMalloc((void**)&devPtrC, N * sizeof(float)); cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice);
27
Memory allocation on the GPU card
Copy data from the CPU (host) memory to the GPU (device) memory
7/26/2009
Slide28Adding GPU support
vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC); cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(A); free(B); free(C); }
28
Kernel invocation
Copy results from device memory to the host memory
Device memory de-allocation
7/26/2009
Slide29GPU Kernel
CPU versionvoid vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i];}GPU version__global__ void vecAdd(float* A, float* B, float* C) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; }
29
7/26/2009
Slide30CUDA Programming Model
A CUDA kernel is executed by an array of threadsAll threads run the same code (SPMD)Each thread has an ID that it uses to compute memory addresses and make control decisionsThreads are arranged as a grid of thread blocksThreads within ablock have accessto a segment ofshared memory
30
…
float x = input[
threadID
];
float y =
func
(x);
output[
threadID
] = y;
…
threadID
Grid
Thread Block 0
Shared memory
Thread Block 1
Shared memory
Thread Block
N-1
Shared memory
…
7/26/2009
Slide31Kernel Invocation Syntax
31
Grid
Thread Block 0
Shared memory
Thread Block 1
Shared memory
Thread Block
N-1
Shared memory
…
grid
&
thread block
dimensionality
vecAdd
<<<
32
, 512>>>
(
devPtrA
,
devPtrB
,
devPtrC
);
int
i
=
blockIdx.x
*
blockDim.x
+
threadIdx.x
;
thread ID within a thread block
number of
theards
per block
block ID within a grid
7/26/2009
Slide32Mapping Threads to the Hardware
Blocks of threads are transparently assigned to SMsA block of threads executes on one SM & does not migrateSeveral blocks can reside concurrently on one SM
32
Device
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Kernel grid
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Device
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Each block can execute in any order relative to other blocks.
time
Slide is courtesy of NVIDIA
7/26/2009
Slide33CUDA Programming Model
A kernel is executed as a grid of thread blocksGrid of blocks can be 1 or 2-dimentionalThread blocks can be 1, 2, or 3-dimensionalDifferent kernels can have different grid/block configurationThreads from the same block have access to a shared memory and their execution can be synchronized
33
Slide is courtesy of NVIDIA
Device
Grid 2
Host
Kernel 1
Kernel 2
Block (1, 1)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
(0,0,1)
(1,0,1)
(2,0,1)
(3,0,1)
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
7/26/2009
Slide34GPU Memory Hierarchy
34
MemoryLocationCachedAccessScopeLifetimeRegisterOn-chipN/AR/WOne threadThreadLocalOff-chipNoR/WOne threadThreadSharedOn-chipN/AR/WAll threads in a blockBlockGlobalOff-chipNoR/WAll threads + hostApplicationConstantOff-chipYesRAll threads + hostApplicationTextureOff-chipYesRAll threads + hostApplication
Host
CPU
chipset
DRAM
Device
DRAM
local
global
constanttexture
GPU
Multiprocessor
Multiprocessor
Multiprocessor
registers
shared memory
constant and texture caches
7/26/2009
Slide35Porting matrix multiplier to GPU
cd ../chips_tutorial/src2Compile & run CPU versionicc -O3 mmult.c -o mmult./mmult1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ......msec = 2215478 GFLOPS = 0.969
35
7/26/2009
Slide3636
int main(int argc, char* argv[]){ int N = 1024; struct timeval t1, t2, ta, tb; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float)); minit(a, b, c, N); gettimeofday(&t1, NULL); mmult(a, b, c, N); // a = b * c gettimeofday(&t2, NULL); mprint(a, N, 5); free(a); free(b); free(c); msec1 = t1.tv_sec * 1000000 + t1.tv_usec; msec2 = t2.tv_sec * 1000000 + t2.tv_usec; msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);}
// a = b * cvoid mmult(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) for (int i = 0; i < N; i++) a[i+j*N] += b[i+k*N]*c[k+j*N];}void minit(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) { a[i+N*j] = 0.0f; b[i+N*j] = 1.0f; c[i+N*j] = 1.0f; }}void mprint(float *a, int N, int M){ int i, j; for (int j = 0; j < M; j++) { for (int i = 0; i < M; i++) printf("%.2f ", a[i+N*j]); printf("...\n"); } printf("...\n");}
7/26/2009
Slide37for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j];
Matrix Representation in Memory
Matrices are stored in column-major orderFor reference, jki-ordered version runs at 1.7 GFLOPS on 3 GHz Intel Xeon (single core)
37
7/26/2009
Slide38for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j];
Grid of thread blocks
0
0 1 2 3 4 5
1
0 1 2 3 4 5
2
0 1 2 3 4 5
blockIdx.x
blockDim.x
threadIdx.x
blockIdx.x
*
blockDim.x
+
threadIdx.x
Map this code:
into this (logical) architecture:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
b
1,1
b
1,2
b
2,1
b
2,2
b
3,1
b
3,2
c
1,1
c
1,2
c
1,3c2,1c2,2c2,3
a1,1a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3
B
C
A=B*C
a
1,2
=b1,1*c1,2+b1,2*c2,2
38
7/26/2009
Slide3932x1024 grid of thread blocks
Block of 32x1x1 threads(blockIdx.x, blockIdx.y)
(threadIdx.x){ int i = blockIdx.x*32 + threadIdx.x; int j = blockIdx.y; float sum = 0.0f; for (int k = 0; k < n; k++) sum += b[i+n*k] * c[k+n*j]; a[i+n*j] = sum;}
32 threads per block (
i
)
32 thread blocks (
i
)
1024 thread blocks (
j
)
dim3 grid(1024/32, 1024);
dim3 threads (32);
39
7/26/2009
Slide40Kernel
Original CPU kernel
GPU Kernel
40
__global__ void mmult(float *a, float *b, float *c, int N){ int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y; float sum = 0.0; for (int k = 0; k < N; k++) sum += b[i+N*k] * c[k+N*j]; a[i+N*j] = sum;}
void mmult(float *a, float *b, float *c, int N) { for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) a[i+j*N] += b[i+k*N]*c[k+j*N];}
dim3 dimGrid(32, 1024);dim3 dimBlock(32);mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N);
7/26/2009
Slide4141
int main(int argc, char* argv[]){ int N = 1024; struct timeval t1, t2; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float)); minit(a, b, c, N); // allocate device memory float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N*N*sizeof(float)); cudaMalloc((void**)&devPtrB, N*N*sizeof(float)); cudaMalloc((void**)&devPtrC, N*N*sizeof(float)); // copu input arrays to the device meory cudaMemcpy(devPtrB, b, N*N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrC, c, N*N*sizeof(float), cudaMemcpyHostToDevice);
7/26/2009
Slide4242
gettimeofday(&t1, NULL); msec1 = t1.tv_sec * 1000000 + t1.tv_usec; // define grid and thread block sizes dim3 dimGrid(32, 1024); dim3 dimBlock(32); // launch GPU kernel mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N); // check for errors cudaError_t err = cudaGetLastError(); if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE); } // wait until GPU kernel is done cudaThreadSynchronize(); gettimeofday(&t2, NULL); msec2 = t2.tv_sec * 1000000 + t2.tv_usec;
7/26/2009
Slide4343
// copy results to host cudaMemcpy(a, devPtrA, N*N*sizeof(float), cudaMemcpyDeviceToHost); mprint(a, N, 5); // free device memory cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(a); free(b); free(c); msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);}
7/26/2009
Slide44Porting matrix multiplier to GPU
Compile & run GPU versionnvcc mmult.cu -o mmult./mmult1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ......msec = 91363 GFLOPS = 23.505
44
7/26/2009
Slide45More on CUDA Programming
Language extensionsFunction type qualifiersVariable type qualifiersExecution configurationBuilt-in variablesCommon runtime componentsBuilt-in vector typesDevice runtime componentsIntrinsic functionsSynchronization and memory fencing functionsAtomic functionsHost runtime components (runtime API only)Device managementMemory managementError handlingDebugging in the device emulation modeExercise
45
7/26/2009
Slide46Function Type Qualifiers
46
host
host
__host__
float HostFunc()
host
device
__global__ void KernelFunc()
device
device
__device__ float DeviceFunc()
Only callable from the:
Executed on the:
__device__
and
__global__
functions do not support recursion, cannot declare static variables inside their body, cannot have a variable number of arguments
__device__
functions cannot have their address taken
__host__
and
__device__
qualifiers can be used together, in which case the function is compiled for both
__global__ and __host__ qualifiers cannot be used together__global__ function must have void return type, its execution configuration must be specified, and the call is asynchronous
7/26/2009
Slide47Variable Type Qualifiers
47
MemoryScopeLifetime__device__ int GlobalVar;globalgridapplication__device__ __shared__ int SharedVar;sharedblockblock__device__ __constant__ int ConstantVar;constantgridApplicationvolatile int GlobarVar or SharedVar;
__shared__ and __constant__ variables have implied static storage__device__, __shared__ and __constant__ variables cannot be defined using external keyword__device__ and __constant__ variables are only allowed at file scope__constant__ variables cannot be assigned to from the devices, they are initialized from the host only__shared__ variables cannot have an initialization as part of their declaration
7/26/2009
Slide48Execution Configuration
48
Function declared as __global__ void kernel(float* param);must be called like this:kernel<<<Dg, Db, Ns, S>>>(param);where Dg (type dim3) specifies the dimension and size of the grid, such that Dg.x*Dg.y equals the number of blocks being launched; Db (type dim3) spesifies the dimension abd size of each block of threads, such that Db.x*Db.y*Db.z equals the number of threads per block; optional Ns (type size_z) specifies the number of bytes of shared memory dynamically allocated per block for this call in addition to the statically allocated memory optional S (type cudaStream_t) specifies the stream associated with this kernel call
7/26/2009
Slide49Built-in Variables
49
variabletypedescriptiongridDimdim3dimensions of the gridblockIDunit3block index within the gridblockDimdim3dimensions of the blockthreadIdxuint3thread index within the blockwarpSizeintwarp size in threads
It is not allowed to take addresses of any of the built-in variablesIt is not allowed to assign values to any of the built-in variables
7/26/2009
Slide50Built-in Vector Types
50
Vector types derived from basic integer and float types char1, char2, char3, char4 uchar1, uchar2, uchar3, uchar4 short1, short2, short3, short4 ushort1, ushort2, ushort3, ushort4 int1, int2, int3, int4 uint1, uint2, uint3 (dim3), uint4 long1, long2, long3, long4 ulong1, ulong2, ulong3, ulong4 longlong1, longlong2 float1, float2, float3, float4 double1, double2
They are all structures, like this:typedef struct { float x,y,z,w;} float4;They all come with a constructor function in the form make_<type name>, e.g.,int2 make_int2(int x, int y);
7/26/2009
Slide51Intrinsic Functions
51
Supported on the device onlyStart with __, as in __sinf(x)End with _rn (round-to-nearest-even rounding mode)_rz (round-towards-zero rounding mode)_ru (round-up rounding mode)_rd (round-down rounding mode)as in __fadd_rn(x,y);There are mathematical (__log10f(x)), type conversion (__int2float_rn(x)), type casting (__int_as_float(x)), and bit manipulation (__ffs(x)) functions
7/26/2009
Slide52Synchronization and Memory Fencing Functions
52
functiondescriptionvoid __threadfence()wait until all global and shared memory accesses made by the calling thread become visible to all threads in the device for global memory accesses and all threads in the thread block for shared memory accessesvoid __threadfence_block()Waits until all global and shared memory accesses made by the calling thread become visible to all threads in the thread blockvoid __syncthreads()Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads become visible to all threads in the block
7/26/2009
Slide53Atomic Functions
53
An atomic function performs read-modify-write atomic operation on one 32-bit or one 64-bit word residing in global or shared memory. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads.
functionDescriptionatomicAdd()new = old + valatomicSub()new = old – valatomicExch()new = valatomicMin()new = min(old, val)atomicMax()new = max(old, val)atomicInc()new = ((old >= val) ? 0 : (old+1))atomicDec()new = (((old==0) | (old > val)) ? val : (old-1))atomicCAS()new = (old == compare ? val : old)Atomic{And, Or, Xor}()new = {(old & val), (old | val), (old^val)}
7/26/2009
Slide54CUDA APIs
higher-level API called the CUDA runtime APImyKernelunsigned char*)devPtr, width, <<<Dg, Db>>>(( height, pitch);
low-level API called the CUDA driver APIcuModuleLoad( &module, binfile );cuModuleGetFunction( &func, module, "mmkernel" );…cuParamSetv( func, 0, &args, 48 );cuParamSetSize( func, 48 );cuFuncSetBlockShape( func, ts[0], ts[1], 1 );cuLaunchGrid( func, gs[0], gs[1] );
54
7/26/2009
Slide55Device Management
55
functiondescriptioncudaGetDeviceCount()Returns the number of compute-capable devicescudaGetDeviceProperties()Returns information on the compute devicecudaSetDevice()Sets device to be used for GPU executioncudaGetDevice()Returns the device currently being usedcudaChooseDevice()Selects device that best matches given criteria
7/26/2009
Slide56Device Management Example
56
void cudaDeviceInit() { int devCount, device; cudaGetDeviceCount(&devCount); if (devCount == 0) { printf("No CUDA capable devices detected.\n"); exit(EXIT_FAILURE); } for (device=0; device < devCount; device++) { cudaDeviceProp props; cudaGetDeviceProperties(&props, device); // If a device of compute capability >= 1.3 is found, use it if (props.major > 1 || (props.major == 1 && props.minor >= 3)) break; } if (device == devCount) { printf("No device above 1.2 compute capability detected.\n"); exit(EXIT_FAILURE); } else cudaSetDevice(device);}
7/26/2009
Slide57Memory Management
57
functiondescriptioncudaMalloc()Allocates memory on the GPUcudaMallocPitch()Allocates memory on the GPU device for 2D arrays, may pad the allocated memory to ensure alignment requirementscudaFree()Frees the memory allocated on the GPUcudaMallocArray()Allocates an array on the GPUcudaFreeArray()Frees an array allocated on the GPUcudaMallocHost()Allocates page-locked memory on the hostcudaFreeHost()Frees page-locked memory in the host
7/26/2009
Slide58More on Memory Alignment
a
1,1
a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3
a1,1a2,1a3,1a1,2a2,2a3,2a1,3a2,3a3,3
cudaMalloc(&dev_a, m*n*sizeof(float));
Matrix columns are not aligned at 64-bit boundary
a1,1a2,1a3,1a1,2a2,2a3,2a1,3a2,3a3,3
cudaMallocPitch(&dev_a, &n, n*sizeof(float), m);
Matrix columns are aligned at 64-bit boundary
n is the allocated (aligned) size for the first dimension (the pitch), given the requested sizes of the two dimensions.
58
7/26/2009
Slide59Memory Management Example
59
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height); myKernel<<<100, 192>>>(devPtr, pitch); // device code __global__ void myKernel(float* devPtr, int pitch) { for (int r = 0; r < height; ++r) { float* row = (float*)((char*)devPtr + r * pitch); for (int c = 0; c < width; ++c) { float element = row[c]; } }}
7/26/2009
Slide60Memory Management
60
functiondescriptioncudaMemset()Initializes or sets GPU memory to a valuecudaMemCpy()Copies data between host and the devicecudaMemcpyToArray()cudaMemcpyFromArray()cudaMemcpyArrayToArray()cudaMemcpyToSymbol()cudaMemcpyFromSymbol()cudaGetSymbolAddress()Finds the address associated with a CUDA symbolcudaGetSymbolSize()Finds the size of the object associated with a CUDA symbol
7/26/2009
Slide61Error Handling
61
All CUDA runtime API functions return an error code. The runtime maintains an error variable for each host thread that is overwritten by the error code every time an error concurs.
functiondescriptioncudaGetLastError()Returns error variable and resets it to cudaSuccesscudaGetErrorString()Returns the message string from an error code
cudaError_t err = cudaGetLastError();if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE);}
7/26/2009
Slide62Exercise
Port Mandelbrot set fractal renderer to CUDA
Source is in ~/chips_tutorial/src3fractal.c – reference C implementationMakefile – make filefractal.cu.reference – CUDA implementation for reference
62
7/26/2009
Slide63Reference C Implementation
7/26/2009
63
void makefractal_cpu(unsigned char *image, int width, int height, double xupper, double xlower, double yupper, double ylower){ int x, y; double xinc = (xupper - xlower) / width; double yinc = (yupper - ylower) / height; for (y = 0; y < height; y++) { for (x = 0; x < width; x++) { image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc)); } }}
Slide64CUDA Kernel Implementation
7/26/2009
64
__global
__
void
makefractal_cpu
(unsigned char *image,
int
width,
int
height, double
xupper
, double
xlower
, double
yupper
, double
ylower
)
{
int
x =
blockIdx.x
;
int
y =
blockIdx.y
;
int
width =
gridDim.x
;
int
height =
gridDim.y
;
double
xupper
=-0.74624,
xlower
=-0.74758,
yupper
=0.10779,
ylower
=0.10671
;
double
xinc
= (
xupper
-
xlower
) / width;
double
yinc
= (
yupper
-
ylower
) / height;
image[y*
width+x
] =
iter
((
xlower
+ x*
xinc
), (
ylower
+ y*
yinc
));
}
Slide65Reference C Implementation
7/26/2009
65
inline unsigned char iter(double a, double b){ unsigned char i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0; while ((c_x*c_x+c_y*c_y < D) && (i++ < 255)) { c_x_tmp = c_x * c_x - c_y * c_y; c_y_tmp = 2* c_y * c_x; c_x = a + c_x_tmp; c_y = b + c_y_tmp; } return i;}
The Mandelbrot set is generated by iterating complex function
z
2
+ c
, where
c
is a constant:
z
1
= (z
0
)
2
+ c
z
2
= (z
1
)
2
+ c
z
3
= (z
2
)
2
+ c
and so forth. Sequence
z
0
, z
1
, z
2
,...
is called the
orbit
of
z
0
under iteration of
z
2
+ c
. We stop iteration when the orbit starts to diverge, or when a maximum number of iterations is done.
Slide66CUDA Kernel Implementation
7/26/2009
66
inline
__device__
unsigned
char
iter
(double a, double b)
{
unsigned char
i
= 0;
double
c_x
= 0,
c_y
= 0;
double
c_x_tmp
,
c_y_tmp
;
double D = 4.0;
while ((
c_x
*
c_x+c_y
*
c_y
< D) && (
i
++ < 255))
{
c_x_tmp
=
c_x
*
c_x
-
c_y
*
c_y
;
c_y_tmp
= 2*
c_y
*
c_x
;
c_x
= a +
c_x_tmp
;
c_y
= b +
c_y_tmp
;
}
return
i
;
}
Slide67Host Code
7/26/2009
67
int
width = 1024;
int
height = 768;
unsigned char *image = NULL;
unsigned char *
devImage
;
image
= (unsigned char*)
malloc
(width*height*
sizeof
(unsigned char));
cudaMalloc
((void**)&
devImage
, width*height*
sizeof
(unsigned char));
dim3
dimGrid
(width, height);
dim3
dimBlock
(1
);
makefractal_gpu
<<<
dimGrid
,
dimBlock
>>>(
devImage
);
cudaMemcpy
(image
,
devImage
, width*height*
sizeof
(unsigned char
),
cudaMemcpyDeviceToHost
);
free(image);
cudaFree
(
devImage
);
Slide68Few Examples
xupper=-0.74624xlower=-0.74758yupper=0.10779ylower=0.10671CPU time: 2.27 secGPU time: 0.29 sec
xupper=-0.754534912109xlower=-.757077407837yupper=0.060144042969ylower=0.057710774740CPU time: 1.5 secGPU time: 0.25 sec
7/26/2009
68