/
Introduction to CUDA Programming Introduction to CUDA Programming

Introduction to CUDA Programming - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
347 views
Uploaded On 2019-12-21

Introduction to CUDA Programming - PPT Presentation

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slidesmaterial from UIUC course by WenMei Hwu and David Kirk UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem ID: 771151

block thread threads gpu thread block gpu threads float memory cpu blocks blockdim int data device threadidx cuda grid

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Introduction to CUDA Programming" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David Kirk UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem NVIDIA by Simon Green

Understanding Semiconductor Technology Limitations Computation Calculations A + B, decide what to do nextData communication/Storage Tons of Compute Engines Tons of Storage Unlimited Bandwidth Zero/Low Latency This is what we would like to have

Let’s see what we can get: Calculation Capability How many calculation units can be built? Today’s silicon chips About 1B+ transistors30K transistors for a 52b multiplier~30K multipliers 260mm^2 area (mid-range)112microns^2 for FP unit (overestimated)~2K FP unitsFrequency ~ 3Ghz common todayTFLOPs possibleDisclaimer: back-on-the-envelop calculations – take with a grain of saltCan build lots of calculation units Tons of Compute Engines?

How about Communication/Storage Need data feed and storage The larger the slower Takes time to get there and backMultiple cycles even on the same die Tons of Compute Engines Tons of Slow Storage Unlimited Bandwidth Zero/Low Latency   

What if? Is there enough parallelism? Keep this busy? Needs lots of independent calculations Parallelism/ConcurrencyMuch of what we do is sequential First do 1, then do 2, then if X do 3 else do 4Tons of Compute Engines Tons of Storage Unlimited Bandwidth Zero/Low Latency

Today’s High-End General Purpose Processors Localize Communication and Computation Try to automatically extract some parallelism time Tons of Slow Storage Faster cache Slower Cache Large on-die caches to tolerate off-chip memory latency Application-driven design: Optimize common case Some reuse of data Actually a lot, in short term 90%+ hit rate on first level caches

Some things are naturally parallel

Sequential Execution Model int a[N]; // N is large for ( i =0; i < N; i++) a[i ] = a[i] * fade;time Flow of control / Thread One instruction at the time Optimizations possible at the machine level

Data Parallel Execution Model / SIMD int a[N]; // N is large for all elements do in parallel a[i] = a[i] * fade; time This has been tried before: ILLIAC III, UIUC, 1966 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4038028&tag=1 http://ed-thelen.org/comp-hist/vs-illiac-iv.html

Single Program Multiple Data / SPMD int a[N]; // N is large for all elements do in parallel if (a[i] > threshold) a[i]*= fade; time Code is statically identical across all threads Execution path may differ The model used in today’s Graphics Processors

CPU vs. GPU overview CPU: Handles sequential code well Latency optimized: do all very fastCan’t take advantage of massively parallel codeOff-chip bandwidth lower -- narrowLower Peak Computation capabilityGPU: Requires massively parallel computationBandwidth optimized: do lots concurrentlyHandles some control flowHigher off-chip bandwidth -- wideHigher peak computation capability

Why GPUs exist now? Why not before (1966)? 3D Graphics Applications Games Engineering/CADToo a much lesser extent3D Graphics – nature of computationStart with triangles (points in 3D space)Transform (move, rotate, scale) Paint / Texture mappingRasterize  convert into pixelsLightHidden “surface” eliminationBottom line:Tons of independent calculationsLots of identical calculations

Programmer’s view GPU as a co-processor (CPU data is from 2008 – matches our lab machines) CPUMemory GPUGPU Memory 1GB on our systems 3GB/s – 8GB.s 6.4GB/sec – 31.92GB/sec 8B per transfer 177.4GB/sec Key Suppliers: Nvidia and AMD GTX480 characteristics Top of the line in 2010

But what about performance? Focus on PEAK performance first: What the manufacturer guarantees you’ll never exceed Two Aspects:Data Access Rate Capability BandwidthData Processing Capability How many ops per sec

Data Processing Capability Focus on floating point dataGFLOPSBillion (giga) Floating-Point Operations per SecondCaveat: FOPs can be different But today things are not as bad as before

GFLOPS High-End CPU in (2008) 3.4Ghz x 8 FOPS/cycle = 27 GFLOPS Assumes SSEHigh-End GPU in(2008) / GTX280933.1 GFLOPS or 34x capabilityHigh-End CPU in 2011 3.4Ghz x 32 FOPS/cycle x core= 435.2 GFLOPSAssumes AVXHigh-End GPU 2011 / GTX5801581 GFLOPS or 3.6x capabilityOur GPUs1345 GFLOPS

Data Access Capability High-End CPU (2008) 31.92 GB/sec (nehalem) - 12.8 GB/sec (hapertown) Bus width 64-bitGPU / GTX280 (2008-2009)141.7 GB/secBus width 512-bit4.39x – 11xHigh-End CPU 2011 Four channels of DDR3 1600 -- each 12.8 GB/s = 51.2 GB/sBus width 64-bit GPU / GTX480 2010177.4 GB/sec (GTX580 in 2011 is 192.4GB/s)Bus width 384-bit3.75xOur machines (I think): 12.8GB CPU vs 177.4 GB GPU or 13.8x difference

GPU vs. CPU: GFLOPs

GPU vs. CPU: Memory Bandwidth GBytes /Sec

Target Applications int a[N]; // N is large for all elements of an array a[i] = a[i] * fadeLots of independent computationsCUDA threads need not be completely independent Kernel THREAD

Programmer’s View of the GPU GPU: a compute device that:Is a coprocessor to the CPU or hostHas its own DRAM (device memory)Runs many threads in parallelData-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

Why are threads useful ? Parallelism Concurrency: Do multiple things in parallel Uses more hardware  Gets higher performanceApplication must have parallelism Needs more functional units

Why are threads useful #2 – Tolerating stalls Often a thread stalls, e.g., memory access Multiplex the same functional unit Get more performance at a fraction of the cost

GPU: bandwidth optimized – latencies are long A GPU ADD takes 24 GPU cycles (true of GTX280)CPU ADD 1 cycleThe GPU cycle is roughly ¼ of a CPU cycleFor the systems in the lab ( GTX480)Need ~100 threads to break even1000s of threads for GPU to be better

GPU vs. CPU Threads GPU threads are extremely lightweight Very little creation overhead In the order of microsecondsAll done in hardwareGPU needs 1000s of threads for full efficiencyMulti-core CPU needs only a few

Execution Timeline time 1. Copy to GPU mem 2. Launch GPU Kernel GPU / Device 2’. Synchronize with GPU 3. Copy from GPU mem CPU / Host

Programmer’s view First create data on CPU memory CPU Memory GPU GPU Memory

Programmer’s view Then Copy to GPU CPU Memory GPU GPU Memory

Programmer’s view GPU starts computation  runs a kernelCPU can also continue CPUMemory GPU GPU Memory

Programmer’s view CPU and GPU Synchronize CPU Memory GPU GPU Memory

Programmer’s view Copy results back to CPU CPU Memory GPU GPU Memory

Programming Languages CUDA NVidia Has market leadOpenCLMany including NvidiaCUDA superset Somewhat different syntaxCan target many different devices, e.g., CPUs + programmable acceleratorsFairly newWe’ll focus on CUDA for nowBoth are evolving

Computation partitioning: At the highest level: Think of computation as a series of loops: for (i = 0; i < big_number ; i++)a[i] = some functionfor (i = 0; i < big_number; i++) a[i] = some other functionfor (i = 0; i < big_number; i++)a[i] = some other function Kernels

Computation Partitioning -- Kernel CUDA exposes the hardware to the programmer Programmer must manually partition work appropriately Programmers view is hierarchical:Think of data as an array

Per Kernel Computation Partitioning Computation Grid: 2D Case Threads within a block can communicate/synchronize Run on the same multiprocessor Threads across blocks can’t communicateShouldn’t touch each others dataBehavior undefined Block thread

Per Kernel Computation Partitioning Computation Grid: 2D Case One thread can process multiple data elements Other mappings are possible and often desirable More on this when we talk about how to optimize for performance Block thread

GBT: Grids of Blocks of Threads Why? Realities of integrated circuits : need to cluster computation and storage to achieve high speedsPhilosophy is: We’ll tell you about the hardware – you figure out how to make the best of itProgrammers view of data and computation partitioning Time

Programmer’s view: Memory Model

Device Grid 1 Block (0, 0) Block (1, 0) Block(2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Grids of Blocks of Threads: Dimension Limits Grid of Blocks 1D, 2D, or 3D Max x, y and z: 65535 Block of Threads: 1D, 2D, or 3D Max number of threads: 1024 Max x: 1024 Max y: 1024 Max z: 64 Limits apply to Compute Capability 2.0 GTX480 = 2.0

Block and Thread IDs Threads and blocks have IDs So each thread can decide what data to work on Block ID: 1D, 2D, or 3DThread ID: 1D, 2D, or 3D Combination is uniqueSimplifies memoryaddressing when processingmultidimensional dataConvenience not necessity Device Grid 1Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) IDs and dimensions are accessible through predefined “variables”, e.g., blockDim.x and threadIdx.x

Thread Batching A kernel is executed as a grid of thread blocks All threads share data memory space But cannot communicate through itA thread block: Threads that can cooperate with each other by:Synchronizing their execution For hazard-free shared memory accessesEfficiently sharing data through a low latency shared memoryTwo threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)

Thread Coordination Overview Race-free access to data Only across threads within the same block No communication across blocks

Programmer’s view: Memory Model: Thread vs. Host Arrows show whether read and/or write is possible

Programmer’s View: Memory Detail – Thread and Host Each thread can: R/W per-thread registersR/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRead only per-grid constant memoryRead only per-grid texture memory The host can R/W: global, constant, and texture memories

Memory Model: Global, Constant, and Texture Memories Global memory Main means of communicating R/W Data between host and deviceContents visible to all threadsOfficially not cached (GTX280)Little locality – 3D graphics origin GTX480 caches itTexture and Constant MemoriesConstants initialized by host Contents visible to all threadsCached (GTX280)

Memory Model Summary Memory Location Cached Access Scope Local off-chip No R/W thread Shared on-chip N/A R/W all threads in a block Global off-chip No R/W all threads + host Constant off-chip Yes RO all threads + host Texture off-chip Yes RO all threads + host

Execution Model: Ordering Execution order is undefined Do not assume and use: block 0 executes before block 1Thread 10 executes before thread 20And any other ordering even if you can observe itFuture implementations may break this orderingIt’s not part of the CUDA definitionWhy? More flexible hardware options

CUDA Software Architecture cuda…() cu…() e.g., fft()

Reasoning about CUDA call ordering GPU communication via cuda …() calls and kernel invocationscudaMalloc, cudaMemCpy Asynchronous from the CPU’s perspective CPU places a request in a “CUDA” queuerequests are handled in-orderStreams allow for multiple queuesOrder within each queue honoredNo order across queuesMore on this much later on

Execution Model Summary (for your reference) Grid of blocks of threads 1D/2D/3D grid of blocks1D/2D/3D blocks of threadsAll blocks are identical: same structure and # of threadsBlock execution order is undefined Same block threads: can synchronize and share data fast (shared memory)Threads from different blocks:Cannot cooperate Communication through global memoryThreads and Blocks have IDsSimplifies data indexingCan be 1D, 2D, or 3D (threads)Blocks do not migrate: execute on the same processorSeveral blocks may run over the same processor

CUDA API: Example int a[N]; for ( i =0; i < N; i++) a[ i] = a[i] + x;Allocate CPU Data StructureInitialize Data on CPUAllocate GPU Data StructureCopy Data from CPU to GPU Define Execution ConfigurationRun KernelCPU synchronizes with GPUCopy Data from GPU to CPUDe-allocate GPU and CPU memory

My first CUDA Program / Skeleton __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i ] + float;}int main(){ float h_a[N]; float * d_a ; cudaMalloc ((void **) & a_d , SIZE); cudaMemcpy ( d_a, h_a, SIZE, cudaMemcpyHostToDevice )); arradd <<< n_blocks , block_size >>> ( d_a , 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); }GPUCPU

1. Allocate CPU Data container float *ha; main ( int argc, char *argv[]){ int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N); ...}No memory allocated on the GPU side Pinned memory allocation results in faster CPU to/from GPU copies But pinned memory cannot be paged-out cudaMallocHost (…)

2. Initialize CPU Data (dummy) float *ha; int i; for (i = 0; i < N; i++) ha[i] = i;

3. Allocate GPU Data container float * da ;cudaMalloc ((void **) &da, sizeof (float) * N);Notice: no assignment sideNOT: da = cudaMalloc (…)Assignment is done internally:That’s why we pass &daSpace is allocated in Global Memory on the GPU

GPU Memory Allocation The host manages GPU memory allocation: cudaMalloc (void **ptr, size_t nbytes)Must explicitly cast to (void **)cudaMalloc ((void **) &da, sizeof (float) * N);cudaFree (void *ptr);cudaFree (da); cudaMemset (void *ptr, int value, size_t nbytes); cudaMemset ( da , 0, N * sizeof ( int )); Check the CUDA Reference Manual

4. Copy Initialized CPU data to GPU float *da; float *ha; cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyHostToDevice ); // DIRECTION

Host/Device Data Transfers The host initiates all transfers: cudaMemcpy ( void *dst, void * src, size_t nbytes, enum cudaMemcpyKind direction)Asynchronous from the CPU’s perspectiveCPU thread continuesIn-order processing with other CUDA requestsenum cudaMemcpyKindcudaMemcpy HostToDevicecudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

5. Define Execution Configuration How many blocks and threads/block int threads_block = 64; int blocks = N / threads_block;if (blocks % N != 0) blocks += 1; Alternatively:blocks = (N + threads_block – 1) / threads_block;

6. Launch Kernel & 7. CPU/GPU Synchronization Instructs the GPU to launch blocks x threads_block threads: darradd <<<blocks, threads_block>> (da , 10f, N); cudaThreadSynchronize (); // forces CPU to waitdarradd : kernel name<<<…>>> execution configuration(da, x, N): arguments256 byte limit / No variable arguments

CPU/GPU Synchronization CPU does not block on cuda …() callsKernel/requests are queued and processed in-orderControl returns to CPU immediatelyGood if there is other work to be donee.g., preparing for the next kernel invocation Eventually, CPU must know when GPU is doneThen it can safely copy the GPU resultscudaThreadSynchronize ()Block CPU until all preceding cuda…() and kernel requests have completed

8. Copy data from GPU to CPU & 9. DeAllocate Memory float *da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da);// display or process results herefree (ha);

The GPU Kernel __global__ darradd (float * da, float x, int N){ int i = blockIdx.x * blockDim.x + threadIdx.x ; if (i < N) da [i] = da [ i ] + x; } BlockIdx : Unique Block ID. Numerically asceding : 0, 1, … BlockDim : Dimensions of Block = how many threads it has BlockDim.x , BlockDim.y , BlockDim.z Unused dimensions default to 0 ThreadIdx : Unique per Block Index 0, 1, … Per Block

Array Index Calculation Example int i = blockIdx.x * blockDim.x + threadIdx.x; a[0]a[63] a[64]a[127] a[128] a[191] a[192] blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 i = 0 i = 63 i = 64 i = 127 i = 128 i = 191 i = 192 Assuming blockDim.x = 64

Generic Unique Thread and Block Index Calculations #1 1D Grid / 1D Blocks: UniqueBlockIndex = blockIdx.x; UniqueThreadIndex = blockIdx.x * blockDim.x + threadIdx.x ;1D Grid / 2D Blocks: UniqueBlockIndex = blockIdx.x; UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x ; 1D Grid / 3D Blocks: UniqueBockIndex = blockIdx.x ; UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y * blockDim.z + threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x ;Source: http://forums.nvidia.com/lofiversion/index.php?t82040.html

Generic Unique Thread and Block Index Calculations #2 2D Grid / 1D Blocks: UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x; UniqueThreadIndex = UniqueBlockIndex * blockDim.x + threadIdx.x;2D Grid / 2D Blocks: UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x; UniqueThreadIndex =UniqueBlockIndex * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x ; 2D Grid / 3D Blocks: UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x ; UniqueThreadIndex = UniqueBlockIndex * blockDim.z * blockDim.y * blockDim.x + threadIdx.z * blockDim.y * blockDim.z + threadIdx.y * blockDim.x + threadIdx.x;UniqueThreadIndex means unique per grid.

CUDA Function Declarations __global__ defines a kernel function Must return voidCan only call __device__ functions__device__ and __host__ can be used togetherTwo difference versions generated Executed on the: Only callable from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host __host__ float HostFunc() host host

__device__ Example Add x to a[i] multiple times __device__ float addmany (float a, float b, int count) { while (count--) a += b; return a;}__global__ darradd (float *da, float x, int N){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10);}

Kernel and Device Function Restrictions __device__ functions cannot have their address takene.g., f = &addmany ; *f(…);For functions executed on the device:No recursiondarradd (…){ darradd (…)}This may be changing on newer versionsNo static variable declarations inside the functiondarradd (…){ static int canthavethis;}No variable number of argumentse.g., something like printf (…)

My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i ] + float;}int main(){ float h_a[N]; float * d_a ; cudaMalloc ((void **) & a_d , SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice )); arradd <<< n_blocks , block_size >>> ( d_a , 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); }GPUCPU

How to get high-performance #1 Programmer managed Scratchpad memory Bring data in from global memory Reuse16KB or 36KB/bankedAccessed in parallel by 32 threads “shared memory”Programmer needs to:Decide what to bring and whenDecide which thread accesses what and whenCoordination paramount

How to get high-performance #2 Global memory accesses 32 threads access memory together Can coalesce into a single referenceE.g., a[threadID] works wellControl flow32 threads run together If they diverge there is a performance penaltyTexture cacheWhen you think there is locality

Numerical Accuracy Can do FP Mostly OK some minor discrepancies Can do DP 1/8 the bandwidthBetter on newer hardwareMixed methodsBreak numbers into two single-precision values Must carefully check for stability/correctnessWill get better w/ next generation hardware

Are GPUs really that much faster than CPUs 50x – 200x speedups typically reported Recent work found Not enough effort goes into optimizing code for CPUsIntel paper (ISCA 2010)http://portal.acm.org/ft_gateway.cfm?id=1816021&type=pdf But:The learning curve and expertise needed for CPUs is much largerThen, so is the potential and flexibility

Predefined Vector Datatypes Can be used both in host and in device code. [u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4] Structures accessed with .x, .y, .z, .w fieldsdefault constructors, “ make_TYPE (…)”: float4 f4 = make_float4 (1f, 10f, 1.2f, 0.5f);dim3type built on uint3Used to specify dimensionsDefault value is (1, 1, 1)

Execution Configuration Must specify when calling a __global__ function: <<< Dg, Db [, Ns [, S]] >>>where: dim3 Dg: grid dimensions in blocksdim3 Db: block dimensions in threadssize_t Ns: per block additional number of shared memory bytes to allocate optional, defaults to 0more on this much later on cudaStream_t S: request stream(queue)optional, default to 0. Compute capability >= 1.1

Built-in Variables dim3 gridDim Number of blocks per grid, in 2D (.z always 1) uint3 blockIdxBlock ID, in 2D (blockIdx.z = 1 always)dim3 blockDimNumber of threads per block, in 3D uint3 threadIdxThread ID in block, in 3D

Execution Configuration Examples 1D grid / 1D blocks dim3 gd(1024) dim3 bd(64) akernel<<<gd, bd>>>(...) gridDim.x = 1024, gridDim.y = 1, blockDim.x = 64, blockDim.y = 1, blockDim.z = 1 2D grid / 3D blocks dim3 gd(4, 128) dim3 bd(64, 16, 4) akernel<<<gd, bd>>>(...) gridDim.x = 4, gridDim.y = 128, blockDim.x = 64, blockDim.y = 16, blockDim.z = 4

Error Handling Most cuda…() functions return a cudaError_t If cudaSuccess: Request completed without a problemcudaGetLastError():returns the last error to the CPU Use with cudaThreadSynchronize():cudaError_t code;cudaThreadSynchronize ();code = cudaGetLastError ();char *cudaGetErrorString(cudaError_t code);returns a human-readable description of the error code

Error Handling Utility Function void cudaDie (const char *msg) { cudaError_t err; cudaThreadSynchronize (); err = cudaGetLastError (); if (err == cudaSuccess) return; fprintf (stderr, "CUDA error: %s: %s.\n", msg, cudaGetErrorString (err)); exit(EXIT_FAILURE); }adapted from: http://www.ddj.com/hpc-high-performance-computing/207603131

Error Handling Macros CUDA_SAFE_CALL ( some cuda call ) CUDA_SAFE_CALL (cudaMemcpy ( a_h, a_d, arr_size, cudaMemcpyDeviceToHost) ); Prints error and exits on error Must define #define _DEBUGNo checking code emitted when undefined: Performance Use make dbg=1 under NVIDIA_CUDA_SDK

Measuring Time -- gettimeofday Unix-based: #include <sys/ time.h> #include <time.h>struct timeval start, end; gettimeofday (&start, NULL);WHAT WE ARE INTERESTED INgettimeofday (&end, NULL); timeCpu = (float)(end.tv_sec - start.tv_sec);if (end.tv_usec < start.tv_usec ){ timeCpu -= 1.0; timeCpu += (double)(1000000.0 + end.tv_usec - start.tv_usec )/1000000.0; } else timeCpu += (double)( end.tv_usec - start.tv_usec )/1000000.0;

Using CUDA clock () clock_t clock (); Can be used in device codereturns a counter valueOne per multiprocessor / incremented every clock cycleSample at the beginning and end of the codeupper bound since threads are time-sliceduint start = clock();... compute (less than 3 sec) ....uint end = clock(); if (end > start) time = end - start;else time = end + (0xffffffff - start) Look at the clock example under projects in SDK Using takes some effortEvery thread measures start and endThen must find min start and max endCycle accurate

Using cutTimer…() library calls #include <cuda.h> #include <cutil.h> unsigned int htimer; cutCreateTimer (&htimer);CudaThreadSynchronize ();cutStartTimer(htimer);WHAT WE ARE INTERESTED INcudaThreadSynchronize ();cutStopTimer(htimer);printf (“time: %f\n", cutGetTimerValue(htimer) );

Code Overview: Host side #include < cuda.h >#include < cutil.h>unsigned int htimer;float *ha, *da; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N); for ( int i = 0; i < N; i ++) ha[ i ] = i ; cutCreateTimer (& htimer ); cudaMalloc ((void **) & da , sizeof (float) * N); cudaMemCpy ((void *) da , (void *) ha, sizeof (float) * N, cudaMemcpyHostToDevice ); blocks = (N + threads_block – 1) / threads_block; cudaThreadSynchronize (); cutStartTimer(htimer); darradd <<<blocks, threads_block>> (da, 10f, N) cudaThreadSynchronize (); cutStopTimer(htimer); cudaMemCpy ((void *) ha, (void *) da, sizeof (float) * N, cudaMemcpyDeviceToHost); cudaFree ( da ); free (ha); printf (“processing time: %f\n", cutGetTimerValue ( htimer ) ); }

Code Overview: Device Side __device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; }__global__ darradd (float *da, float x, int N){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10);}

Variable Declarations – Will revisit next time __device__ stored in device memory (large, high latency, no cache) Allocated with cudaMalloc (__device__qualifier implied)accessible by all threads lifetime: application__constant__same as __device__, but cached and read-only by GPUwritten by CPU via cudaMemcpyToSymbol(...) calllifetime: application__shared__stored in on-chip shared memory (very low latency)accessible by all threads in the same thread block lifetime: kernel launchUnqualified variables:scalars and built-in vector types are stored in registersarrays of more than 4 elements or run-time indices stored in device memory

Measurement Methodology You will not get exactly the same time measurements every time Other processes running / external events (e.g., network activity) Cannot control “Non-determinism”Must take sufficient samples say 10 or moreThere is theory on what the number of samples must beMeasure averageWill discuss this next time or will provide a handout online

Handling Large Input Data Sets – 1D Example Recall gridDim .[xy] <= 65535Host calls kernel multiple times: float * dac = da; // starting offset for current kernel while (n_blocks) { int bn = n_blocks; int elems; // array elements processed in this kernel if (bn > 65535) bn = 65535; elems = bn * block_size ; darradd <<< bn , block_size >>> ( dac , 10.0f, elems ); n_blocks -= bn ; dac += elems ; } Better alternative: Each thread processes multiple elements

Course Structure Lectures: Jan.– end of March.Assignments2-3 starting next weekProject:Propose by the end of first week of March.Finish by end of April.Give presentation:If not too many – in class – otherwise in my office Report: up to 10 pagesMust deliver: presentation, report, and code by the end of the course

Project Ideal scenario Team up: People with interesting compute problemsPeople with strong computer eng./sci. backgroundAlgorithm/App. that has not been converted alreadyOr, try existing solutions and re-create results ideally improveEmphasis is on learning and reporting the experience: What went wellWhat didn’t and why

Material Programming Massively Parallel Processors: A Hands-on Approach  D. Kirk and W.-M. Hwuhttp://www.elsevierdirect.com/morgan_kaufmann/kirk/The OpenCL Programming Book: Parallel Programming for MultiCore CPU and GPU, R. Tsuchiyama, T. Nakamura, and T. Lizuka, http://www.fixstars.com/en/company/books/opencl/We’ll cover CUDA for GTX480 At the end we’ll talk about the newest Fermi architecture and AMD’s offerings

TO DO today www.eecg.toronto.edu/~ moshovos/CUDA1 2not ready yetWill be posting lecture notesTry CUDA10 for recent set of slides Name, e-mailE-mail me at moshovos@eecg.toronto.eduSubject: CUDA11:Time? Is this slot OK for everyone?May be post a doodle to check what other times might work?