/
Introduction to CUDA Introduction to CUDA

Introduction to CUDA - PowerPoint Presentation

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
391 views
Uploaded On 2015-09-22

Introduction to CUDA - PPT Presentation

heterogeneous programming Katia Oleinik koleinikbuedu Scientific Computing and Visualization Boston University Architecture NVIDIA Tesla M2070 Core clock 115GHz Single instruction 448 CUDA cores ID: 136771

int cuda threadidx threads cuda int threads threadidx void memory block device kernel sum thread shared architecture blockidx blocks

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Introduction to CUDA" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Introduction to CUDAheterogeneous programming

Katia Oleinikkoleinik@bu.eduScientific Computing and VisualizationBoston UniversitySlide2
Slide3

Architecture

NVIDIA Tesla M2070:

Core clock: 1.15GHz

Single instruction

448 CUDA cores

1.15 x 1 x 448 =

515 Gigaflops double

precision (peak)

1.03

Tflops single precision (peak)3GB total dedicated memoryDelivers performance at about 10% of the cost and 5% the power of CPU Slide4

Architecture

CUDA:

Compute Unified Device Architecture

General Purpose Parallel Computing Architecture by NVIDIA

Supports traditional OpenGL graphics Slide5

Architecture

Memory Bandwidth:

the rate at which data can be read

from or stored into

memory, expressed in bytes per second

Intel Xeon X5650

: 32 GB/s

Tesla M2070

: 148 GB/sSlide6

Architecture

Tesla M2070 Processor:

Streaming Multiprocessors (SM):

14

Streaming Processors on each SM:

32

Total:

14 x 32 = 448 Cores

Each Streaming Multiprocessor supports 1024 threads.Slide7

Architecture

CUDA:

SIMT philosophy:

Single Instruction Multiple Thread

Computationally intensive

—The time spent on computation significantly

exceeds the time spent on transferring data to and from GPU memory.

Massively parallel

—The computations can be broken down into

hundreds or thousands of independent units of work. Slide8

Architecture

# Copy tutorial files

scc1 %

cp

–r /scratch/

katia

/

cuda

.

# Request interactive session on the node with GPU

scc1 %

qrsh

–l

gpus

=1

# Change directory

scc1-ha1 %

cd

deviceQuery

#

Set Environment variables to link to CUDA 5/0

scc1-ha1 %

module load

cuda

/5.0

#

Execute

deviceQuery

program

scc1-ha1 % ./

deviceQuerySlide9

Architecture

CUDA Driver Version / Runtime Version 5.0 / 5.0

CUDA Capability Major/Minor version number: 2.0

Total amount of global memory: 5375

MBytes

(14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA

Cores

Total amount of constant memory: 65536 bytes

Total

amount of shared memory per block: 49152 bytes

Total

number of registers available per block: 32768

Information that we will need later in this tutorial:Slide10

CUDA Architecture

Warp size: 32

Maximum number of threads per multiprocessor: 1536

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Information that we will need later in this tutorial:Slide11

CUDA

Architecture

# Change directory

scc1-ha1 %

cd

bandwidthTest

# Execute

bandwidthTest

program

scc1-ha1 %

./

bandwidthTest

Query

device capabilities and measure GPU/CPU

bandwidth.

This is a simple test program to measure the

memcopy

bandwidth of the GPU and

memcpy

bandwidth across PCI-eSlide12

CUDA Terminology

CUDA:

Device

The

GPU and its memory (device memory)

Host

The

CPU and its memory (host memory)Slide13

CUDA: C Language Extensions

CUDA:

Based on industry-standard C

Language extensions allow heterogeneous programming

APIs for memory and device managingSlide14

Hello,

Cuda

!

CUDA: Basic example

HelloCuda1.

cu

#include

<

stdio.h

>

int

main(

void

){

printf

("

Hello,

Cuda

! \n

");

return

(0

);

}

To build the program, use

nvcc

compiler:

scc-he1: %

nvcc

-o helloCuda1 helloCuda1.cuSlide15

Hello,

Cuda

!

Function to be executed

on the device

(GPU) and called

from host code

__device__

void foo(){ . . . }

CUDA Language closely follows C/C++ syntax with minimum set of extensions:

NVCC compiler will compile the function that run on the device and host compiler (

gcc

) will take care about all other functions that run on the host (e.g. main() )Slide16

Hello,

Cuda

!

CUDA: Basic example

HelloCuda2.

cu

#include

<

stdio.h

>

__

global__

void

cudakernel

(

void

){

printf

("

Hello, I am CUDA kernel ! Nice to meet you!\n

");

}Slide17

Hello,

Cuda

!

CUDA: Basic example

HelloCuda2.

cu

int

main(

void

){

printf

("

Hello,

Cuda

! \n

");

cudakernel

<<<

1,1

>>>

();

cudaDeviceSynchronize

();

printf

("

Nice to meet you too! Bye, CUDA\n

");

return

(0);

}Slide18

Hello,

Cuda

!

CUDA: Basic example

HelloCuda2.

cu

cudakernel

<<<

N,M

>>>

();

cudaDeviceSynchronize

();

Triple angle brackets

indicate that the function will be executed on the device (GPU).

This function is called

kernel

.

Kernel is always of type void.

Program returns immediately after launching the kernel. To prevent program to finish before kernel is completed, we have call

cudaDeviceSynchronize

().Slide19

CUDA: C Language Extensions

There is a number of

cuda

functions:

Device management:

cudaGetDeviceCount

(),

cudaGetDeviceProperties

()

Error

management

:

cudaGetLastError

(),

cudaSafeCall

(),

cudaCheckError

()

Device memory management:

cudaMalloc

(),

cudaFree

(),

cudaMemcpy

()Slide20

Hello,

Cuda

!

CUDA: Basic example

HelloCuda2.

cu

To build the program, use

nvcc

compiler:

scc-he1: %

nvcc

-o

helloCuda2 helloCuda2.cu

–arch sm_20

The ability to print from within the kernel was added in a later generation of architectural evolution. To request the support of Compute Capability 2.0, we need to add this option into compilation command line.Slide21

Hello,

Cuda

!

CUDA: Basic example

HelloCudaBlock.cu

#include

<

stdio.h

>

__

global__

void

cudakernel

(

void

){

printf

("

Hello, I am CUDA block %d !\n

",

blockIdx.x

);

}

int

main(

void

){

. . .

cudakernel

<<<

16

,1

>>>();

.

.

.

}

To simplify compilation process we will use

Makefile

:

% make

HelloCudaBlockSlide22

CUDA: C Language Extensions

CUDA provides special variable for thread identification in the

kernal

:

d

im3

threadIdx

;

// thread ID within the block

dim3

blockIdx

;

// block

ID within the

grid

dim3

blockDim

;

// number of threads per block

dim3

gridDim

;

//

number of

blocks in the grid

In the simple 1-dimentional case, we use only the first component of each variable, e.g.

threadIdx.x

Slide23

CUDA: Blocks and Threads

Serial Code

Serial Code

Kernel A

Kernel B

Host

Host

Device

DeviceSlide24

CUDA: C Language Extensions

CUDA: Basic example

HelloCudaThread.cu

#include

<

stdio.h

>

__

global__

void

cudakernel

(

void

){

printf

("

Hello, I am CUDA thread %d !\

n

",

threadIdx.x

);

}

int

main(

void

){

. . .

cudakernel

<<<1,

16

>>>();

.

.

.

}Slide25

CUDA: Blocks and Threads

One kernel is executed on the device at a time

Many threads execute each kernel

Each thread execute the same code (SPMD)

Threads are grouped into

thread blocks

Kernel is a

grid

of thread blocks

Threads are scheduled as sets of warps

Warp

is a group of 32 threads

SM executes same instruction on all threads in the warp

Blocks cannot synchronize and can run in any orderSlide26

Vector Addition Example

CUDA:

vectorAdd.cu

__

global__

void

vectorAdd

(

const

float

*A,

const

float

*B,

float

*C,

int

numElements

){

int

i

=

blockDim.x

*

blockIdx.x

+

threadIdx.x

;

if (

i

<

numElements

) {

C[

i

] = A[

i

] + B[

i

];

}

}Slide27

Vector Addition Example

CUDA:

vectorAdd.cu

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

threadIdx.x

threadIdx.x

threadIdx.x

threadIdx.x

blockIdx.x

= 0

blockIdx.x

= 1

blockIdx.x

= 2

blockIdx.x

= 3

int

i

=

blockDim.x

*

blockIdx.x

+

threadIdx.x

;

Unlike blocks, threads have mechanisms to communicate and synchronizeSlide28

Vector Addition Example

CUDA:

vectorAdd.cu

device memory allocation

int

main(

void

)

{

. . .

float

*

d_A

= NULL;

err

=

cudaMalloc

((

void

**)&

d_A

, size

);

float

*

d_B

= NULL;

err

=

cudaMalloc

((

void

**)&

d_B

,

size

);

float

*

d_C

= NULL;

err

=

cudaMalloc

((

void

**)&

d_C

,

size

);

.

.

.

}Slide29

Vector Addition Example

CUDA:

vectorAdd.cu

int

main(

void

)

{

. . .

// Copy input values to the device

cudaMemcpy

(

d_A

, &A, size,

cudaMemcpyHostToDevice

);

cudaMemcpy

(

d_A

, &A, size,

cudaMemcpyHostToDevice

);

.

.

.

}Slide30

Vector Addition Example

CUDA:

vectorAdd.cu

int

main(

void

)

{

. . .

//

Launch the Vector Add CUDA Kernel

int

threadsPerBlock

= 256;

int

blocksPerGrid

=(

numElements

+

threadsPerBlock

- 1)

/

threadsPerBlock

;

vectorAdd

<<<

blocksPerGrid

,

threadsPerBlock

>>>(

d_A

,

d_B

,

d_C

,

N);

err

=

cudaGetLastError

();

.

.

.

}Slide31

Vector Addition Example

CUDA:

vectorAdd.cu

int

main(

void

)

{

. . .

//

Copy result back to host

cudaMemcpy

(&C,

d_C

, size,

cudaMemcpyDeviceToHost

);

// Clean-up

cudaFree

(

d_A

);

cudaFree

(

d_B

);

cudaFree

(

d_C

);

.

.

.

}Slide32

Timing CUDA kernel

CUDA:

vectorAddTime.cu

float

memsettime

;

cudaEvent_t

start, stop

;

//

initialize CUDA timer

cudaEventCreate

(&start

);

cudaEventCreate

(&stop);

cudaEventRecord

(start,0

);

// CUDA Kernel

. .

.

// stop CUDA timer

cudaEventRecord

(stop,0

);

cudaEventSynchronize

(stop

);

cudaEventElapsedTime

(&

memsettime,start,stop

);

printf

(" *** CUDA execution time: %f *** \n",

memsettime

);

cudaEventDestroy

(start

);

cudaEventDestroy

(stop

); Slide33

Timing CUDA kernel

CUDA:

vectorAddTime.cu

scc-ha1 %

make

// specify the number of threads per block

scc-ha1 %

vectorAddTime

128

Explore the CUDA kernel execution time based on the block size:

Remember:

CUDA Streaming Multiprocessor executes threads in warps (32 threads)

There is a maximum of 1024 threads per block (for our GPU)

There is a maximum of 1536 threads per

multiprocessor (for our GPU

)Slide34

Dot Product

CUDA:

dotProd1.cu

a

0

a

1

a

2

a

3

b

0

b

1

b

2

b

3

*

*

*

*

+

C

C = A * B = ( a

0

, a

1

, a

2

,

a

3

) *

(

b

0

,

b

1

,

b

2

,

b

3

) = a

0

* b

0

+

a

1

*

b

1

+

a

2

*

b

2

+

a

3

*

b

3

Slide35

Dot Product

CUDA:

dotProd1.cu

A block of threads shares common memory, called

shared memory

Shared Memory is extremely fast on-chip memory

To declare shared memory use

__shared__

keyword

Shared Memory is not visible to the threads in other blocksSlide36

Dot Product

CUDA:

dotProd1.cu

#define

N 512

__global

__

voiddot

(

int

*a,

int

*b,

int

*c )

{

//

Shared memory for results of multiplication

__

shared__

inttemp

[N];

temp

[

threadIdx.x

] = a[

threadIdx.x

] * b[

threadIdx.x

];

//

Thread 0 sums the

pairwise products

if(

threadIdx.x

== 0

) {

int

sum

= 0;

for

(

int

i

= 0;

i

< N;

i

++

) sum

+=

temp

[

i

];

*

c = sum;

}

}

What if thread 0 starts to calculate sum before other threads completed their calculations? Slide37

Thread Synchronization

CUDA:

dotProd1.cu

#define

N 512

__global

__

voiddot

(

int

*a,

int

*b,

int

*c )

{

//

Shared memory for results of multiplication

__

shared__

inttemp

[N];

temp[

threadIdx.x

] = a[

threadIdx.x

] * b[

threadIdx.x

];

__

syncthreads

()

;

//

Thread 0 sums the

pairwise products

if(

threadIdx.x

== 0

) {

int

sum

= 0;

for

(

int

i

= 0;

i

< N;

i

++

) sum

+= temp[

i

];

*

c = sum;

}

}Slide38

Thread Synchronization

CUDA:

dotProd1.cu

int

main(

void

) {

. . .

// copy input vectors to the device

.

.

.

// Launch CUDA kernel

dotProductKernel

<<<1, N >>>

(

dev_A

,

dev_B

,

dev_C

);

.

.

.

//

copy input vectors

from

the

device

. . .

}

But our vector is limited to the maximum block size. Can we use blocks? Slide39

Race Condition

CUDA:

dotProd2.cu

a

0

a

1

a

2

a

3

b

0

b

1

b

2

b

3

*

*

*

*

+

sum

a

4

a

5

a

6

a

7

b

4

b

5

b

6

b

7

*

*

*

*

+

sum

Block 0

Block 1

CSlide40

Race Condition

CUDA:

dotProd2.cu

#define N (2048*2048)

#define THREADS_PER_BLOCK 512

__global__

void

dotProductKernel

(

int

*a,

int

*b,

int

*c ) {

__shared__

int

temp[THREADS_PER_BLOCK];

int

index

=

threadIdx.x

+

blockIdx.x

*

blockDim.x

;

temp[

threadIdx.x

] = a[index] * b[index];

__

syncthreads

();

if(

threadIdx.x

== 0)

{

intsum

= 0;

for(

int

i

= 0;

i

< THREADS_PER_BLOCK;

i

++ )sum += temp[

i

];

*c += sum;

}

}

Blocks interfere with each other – Race conditionSlide41

Race Condition

CUDA:

dotProd2.cu

#define N (2048*2048)

#define THREADS_PER_BLOCK 512

__global__

void

dotProductKernel

(

int

*a,

int

*b,

int

*c ) {

__shared__

int

temp[THREADS_PER_BLOCK];

int

index

=

threadIdx.x

+

blockIdx.x

*

blockDim.x

;

temp[

threadIdx.x

] = a[index] * b[index];

__

syncthreads

();

if(

threadIdx.x

== 0)

{

intsum

= 0;

for(

int

i

= 0;

i

< THREADS_PER_BLOCK;

i

++ )sum += temp[

i

];

atomicAdd

(

c,sum

);

}

}Slide42

Atomic Operations

Race

conditions

- behavior

depends upon relative timing of multiple event

sequences.

Can occur when an implied read-modify-write is

interruptible

Read-Modify-Write uninterruptible –

atomic

atomicAdd

()

atomicInc

()

atomicSub

()

atomicDec

()

atomicMin

()

atomicExch

()

atomicMax

()

atomicCAS

()Slide43

CUDA Best Practices

NVIDIA’s link:

http

://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

Locate part of the slowest part of the code

gcc

-O2 -g -

pg

myprog.c

gprof

./

a.out

> profile.txt

Use CUDA to parallelize code;

Use optimize cu* libraries if possible;

Overlapping data transfers, fine-tuning operation sequences

Compare the outcome with the original expectations.Slide44

CUDA Debugging

CUDA-GDB

- GNU

Debugger that runs on Linux and

Mac:

http://

developer.nvidia.com/cuda-gdb

The NVIDIA Parallel

Nsight

debugging and profiling tool for

Microsoft

Windows Vista and Windows 7 is available as a free plugin for Microsoft Visual

Studio:

http

://

developer.nvidia.com/nvidia-parallel-nsight

Slide45

This tutorial has been made possible by

Scientific Computing and Visualization group

at

Boston University

.

Katia Oleinik

koleinik@bu.edu

http://www.bu.edu/tech/research/training/tutorials/list/