heterogeneous programming Katia Oleinik koleinikbuedu Scientific Computing and Visualization Boston University Architecture NVIDIA Tesla M2070 Core clock 115GHz Single instruction 448 CUDA cores ID: 136771
Download Presentation The PPT/PDF document "Introduction to CUDA" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Introduction to CUDAheterogeneous programming
Katia Oleinikkoleinik@bu.eduScientific Computing and VisualizationBoston UniversitySlide2Slide3
Architecture
NVIDIA Tesla M2070:
Core clock: 1.15GHz
Single instruction
448 CUDA cores
1.15 x 1 x 448 =
515 Gigaflops double
precision (peak)
1.03
Tflops single precision (peak)3GB total dedicated memoryDelivers performance at about 10% of the cost and 5% the power of CPU Slide4
Architecture
CUDA:
Compute Unified Device Architecture
General Purpose Parallel Computing Architecture by NVIDIA
Supports traditional OpenGL graphics Slide5
Architecture
Memory Bandwidth:
the rate at which data can be read
from or stored into
memory, expressed in bytes per second
Intel Xeon X5650
: 32 GB/s
Tesla M2070
: 148 GB/sSlide6
Architecture
Tesla M2070 Processor:
Streaming Multiprocessors (SM):
14
Streaming Processors on each SM:
32
Total:
14 x 32 = 448 Cores
Each Streaming Multiprocessor supports 1024 threads.Slide7
Architecture
CUDA:
SIMT philosophy:
Single Instruction Multiple Thread
Computationally intensive
—The time spent on computation significantly
exceeds the time spent on transferring data to and from GPU memory.
Massively parallel
—The computations can be broken down into
hundreds or thousands of independent units of work. Slide8
Architecture
# Copy tutorial files
scc1 %
cp
–r /scratch/
katia
/
cuda
.
# Request interactive session on the node with GPU
scc1 %
qrsh
–l
gpus
=1
# Change directory
scc1-ha1 %
cd
deviceQuery
#
Set Environment variables to link to CUDA 5/0
scc1-ha1 %
module load
cuda
/5.0
#
Execute
deviceQuery
program
scc1-ha1 % ./
deviceQuerySlide9
Architecture
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375
MBytes
(14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA
Cores
Total amount of constant memory: 65536 bytes
Total
amount of shared memory per block: 49152 bytes
Total
number of registers available per block: 32768
Information that we will need later in this tutorial:Slide10
CUDA Architecture
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Information that we will need later in this tutorial:Slide11
CUDA
Architecture
# Change directory
scc1-ha1 %
cd
bandwidthTest
# Execute
bandwidthTest
program
scc1-ha1 %
./
bandwidthTest
Query
device capabilities and measure GPU/CPU
bandwidth.
This is a simple test program to measure the
memcopy
bandwidth of the GPU and
memcpy
bandwidth across PCI-eSlide12
CUDA Terminology
CUDA:
Device
The
GPU and its memory (device memory)
Host
The
CPU and its memory (host memory)Slide13
CUDA: C Language Extensions
CUDA:
Based on industry-standard C
Language extensions allow heterogeneous programming
APIs for memory and device managingSlide14
Hello,
Cuda
!
CUDA: Basic example
HelloCuda1.
cu
#include
<
stdio.h
>
int
main(
void
){
printf
("
Hello,
Cuda
! \n
");
return
(0
);
}
To build the program, use
nvcc
compiler:
scc-he1: %
nvcc
-o helloCuda1 helloCuda1.cuSlide15
Hello,
Cuda
!
Function to be executed
on the device
(GPU) and called
from host code
__device__
void foo(){ . . . }
CUDA Language closely follows C/C++ syntax with minimum set of extensions:
NVCC compiler will compile the function that run on the device and host compiler (
gcc
) will take care about all other functions that run on the host (e.g. main() )Slide16
Hello,
Cuda
!
CUDA: Basic example
HelloCuda2.
cu
#include
<
stdio.h
>
__
global__
void
cudakernel
(
void
){
printf
("
Hello, I am CUDA kernel ! Nice to meet you!\n
");
}Slide17
Hello,
Cuda
!
CUDA: Basic example
HelloCuda2.
cu
int
main(
void
){
printf
("
Hello,
Cuda
! \n
");
cudakernel
<<<
1,1
>>>
();
cudaDeviceSynchronize
();
printf
("
Nice to meet you too! Bye, CUDA\n
");
return
(0);
}Slide18
Hello,
Cuda
!
CUDA: Basic example
HelloCuda2.
cu
cudakernel
<<<
N,M
>>>
();
cudaDeviceSynchronize
();
Triple angle brackets
indicate that the function will be executed on the device (GPU).
This function is called
kernel
.
Kernel is always of type void.
Program returns immediately after launching the kernel. To prevent program to finish before kernel is completed, we have call
cudaDeviceSynchronize
().Slide19
CUDA: C Language Extensions
There is a number of
cuda
functions:
Device management:
cudaGetDeviceCount
(),
cudaGetDeviceProperties
()
Error
management
:
cudaGetLastError
(),
cudaSafeCall
(),
cudaCheckError
()
Device memory management:
cudaMalloc
(),
cudaFree
(),
cudaMemcpy
()Slide20
Hello,
Cuda
!
CUDA: Basic example
HelloCuda2.
cu
To build the program, use
nvcc
compiler:
scc-he1: %
nvcc
-o
helloCuda2 helloCuda2.cu
–arch sm_20
The ability to print from within the kernel was added in a later generation of architectural evolution. To request the support of Compute Capability 2.0, we need to add this option into compilation command line.Slide21
Hello,
Cuda
!
CUDA: Basic example
HelloCudaBlock.cu
#include
<
stdio.h
>
__
global__
void
cudakernel
(
void
){
printf
("
Hello, I am CUDA block %d !\n
",
blockIdx.x
);
}
int
main(
void
){
. . .
cudakernel
<<<
16
,1
>>>();
.
.
.
}
To simplify compilation process we will use
Makefile
:
% make
HelloCudaBlockSlide22
CUDA: C Language Extensions
CUDA provides special variable for thread identification in the
kernal
:
d
im3
threadIdx
;
// thread ID within the block
dim3
blockIdx
;
// block
ID within the
grid
dim3
blockDim
;
// number of threads per block
dim3
gridDim
;
//
number of
blocks in the grid
In the simple 1-dimentional case, we use only the first component of each variable, e.g.
threadIdx.x
Slide23
CUDA: Blocks and Threads
Serial Code
Serial Code
Kernel A
Kernel B
Host
Host
Device
DeviceSlide24
CUDA: C Language Extensions
CUDA: Basic example
HelloCudaThread.cu
#include
<
stdio.h
>
__
global__
void
cudakernel
(
void
){
printf
("
Hello, I am CUDA thread %d !\
n
",
threadIdx.x
);
}
int
main(
void
){
. . .
cudakernel
<<<1,
16
>>>();
.
.
.
}Slide25
CUDA: Blocks and Threads
One kernel is executed on the device at a time
Many threads execute each kernel
Each thread execute the same code (SPMD)
Threads are grouped into
thread blocks
Kernel is a
grid
of thread blocks
Threads are scheduled as sets of warps
Warp
is a group of 32 threads
SM executes same instruction on all threads in the warp
Blocks cannot synchronize and can run in any orderSlide26
Vector Addition Example
CUDA:
vectorAdd.cu
__
global__
void
vectorAdd
(
const
float
*A,
const
float
*B,
float
*C,
int
numElements
){
int
i
=
blockDim.x
*
blockIdx.x
+
threadIdx.x
;
if (
i
<
numElements
) {
C[
i
] = A[
i
] + B[
i
];
}
}Slide27
Vector Addition Example
CUDA:
vectorAdd.cu
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
threadIdx.x
threadIdx.x
threadIdx.x
threadIdx.x
blockIdx.x
= 0
blockIdx.x
= 1
blockIdx.x
= 2
blockIdx.x
= 3
int
i
=
blockDim.x
*
blockIdx.x
+
threadIdx.x
;
Unlike blocks, threads have mechanisms to communicate and synchronizeSlide28
Vector Addition Example
CUDA:
vectorAdd.cu
device memory allocation
int
main(
void
)
{
. . .
float
*
d_A
= NULL;
err
=
cudaMalloc
((
void
**)&
d_A
, size
);
float
*
d_B
= NULL;
err
=
cudaMalloc
((
void
**)&
d_B
,
size
);
float
*
d_C
= NULL;
err
=
cudaMalloc
((
void
**)&
d_C
,
size
);
.
.
.
}Slide29
Vector Addition Example
CUDA:
vectorAdd.cu
int
main(
void
)
{
. . .
// Copy input values to the device
cudaMemcpy
(
d_A
, &A, size,
cudaMemcpyHostToDevice
);
cudaMemcpy
(
d_A
, &A, size,
cudaMemcpyHostToDevice
);
.
.
.
}Slide30
Vector Addition Example
CUDA:
vectorAdd.cu
int
main(
void
)
{
. . .
//
Launch the Vector Add CUDA Kernel
int
threadsPerBlock
= 256;
int
blocksPerGrid
=(
numElements
+
threadsPerBlock
- 1)
/
threadsPerBlock
;
vectorAdd
<<<
blocksPerGrid
,
threadsPerBlock
>>>(
d_A
,
d_B
,
d_C
,
N);
err
=
cudaGetLastError
();
.
.
.
}Slide31
Vector Addition Example
CUDA:
vectorAdd.cu
int
main(
void
)
{
. . .
//
Copy result back to host
cudaMemcpy
(&C,
d_C
, size,
cudaMemcpyDeviceToHost
);
// Clean-up
cudaFree
(
d_A
);
cudaFree
(
d_B
);
cudaFree
(
d_C
);
.
.
.
}Slide32
Timing CUDA kernel
CUDA:
vectorAddTime.cu
float
memsettime
;
cudaEvent_t
start, stop
;
//
initialize CUDA timer
cudaEventCreate
(&start
);
cudaEventCreate
(&stop);
cudaEventRecord
(start,0
);
// CUDA Kernel
. .
.
// stop CUDA timer
cudaEventRecord
(stop,0
);
cudaEventSynchronize
(stop
);
cudaEventElapsedTime
(&
memsettime,start,stop
);
printf
(" *** CUDA execution time: %f *** \n",
memsettime
);
cudaEventDestroy
(start
);
cudaEventDestroy
(stop
); Slide33
Timing CUDA kernel
CUDA:
vectorAddTime.cu
scc-ha1 %
make
// specify the number of threads per block
scc-ha1 %
vectorAddTime
128
Explore the CUDA kernel execution time based on the block size:
Remember:
CUDA Streaming Multiprocessor executes threads in warps (32 threads)
There is a maximum of 1024 threads per block (for our GPU)
There is a maximum of 1536 threads per
multiprocessor (for our GPU
)Slide34
Dot Product
CUDA:
dotProd1.cu
a
0
a
1
a
2
a
3
b
0
b
1
b
2
b
3
*
*
*
*
+
C
C = A * B = ( a
0
, a
1
, a
2
,
a
3
) *
(
b
0
,
b
1
,
b
2
,
b
3
) = a
0
* b
0
+
a
1
*
b
1
+
a
2
*
b
2
+
a
3
*
b
3
Slide35
Dot Product
CUDA:
dotProd1.cu
A block of threads shares common memory, called
shared memory
Shared Memory is extremely fast on-chip memory
To declare shared memory use
__shared__
keyword
Shared Memory is not visible to the threads in other blocksSlide36
Dot Product
CUDA:
dotProd1.cu
#define
N 512
__global
__
voiddot
(
int
*a,
int
*b,
int
*c )
{
//
Shared memory for results of multiplication
__
shared__
inttemp
[N];
temp
[
threadIdx.x
] = a[
threadIdx.x
] * b[
threadIdx.x
];
//
Thread 0 sums the
pairwise products
if(
threadIdx.x
== 0
) {
int
sum
= 0;
for
(
int
i
= 0;
i
< N;
i
++
) sum
+=
temp
[
i
];
*
c = sum;
}
}
What if thread 0 starts to calculate sum before other threads completed their calculations? Slide37
Thread Synchronization
CUDA:
dotProd1.cu
#define
N 512
__global
__
voiddot
(
int
*a,
int
*b,
int
*c )
{
//
Shared memory for results of multiplication
__
shared__
inttemp
[N];
temp[
threadIdx.x
] = a[
threadIdx.x
] * b[
threadIdx.x
];
__
syncthreads
()
;
//
Thread 0 sums the
pairwise products
if(
threadIdx.x
== 0
) {
int
sum
= 0;
for
(
int
i
= 0;
i
< N;
i
++
) sum
+= temp[
i
];
*
c = sum;
}
}Slide38
Thread Synchronization
CUDA:
dotProd1.cu
int
main(
void
) {
. . .
// copy input vectors to the device
.
.
.
// Launch CUDA kernel
dotProductKernel
<<<1, N >>>
(
dev_A
,
dev_B
,
dev_C
);
.
.
.
//
copy input vectors
from
the
device
. . .
}
But our vector is limited to the maximum block size. Can we use blocks? Slide39
Race Condition
CUDA:
dotProd2.cu
a
0
a
1
a
2
a
3
b
0
b
1
b
2
b
3
*
*
*
*
+
sum
a
4
a
5
a
6
a
7
b
4
b
5
b
6
b
7
*
*
*
*
+
sum
Block 0
Block 1
CSlide40
Race Condition
CUDA:
dotProd2.cu
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
__global__
void
dotProductKernel
(
int
*a,
int
*b,
int
*c ) {
__shared__
int
temp[THREADS_PER_BLOCK];
int
index
=
threadIdx.x
+
blockIdx.x
*
blockDim.x
;
temp[
threadIdx.x
] = a[index] * b[index];
__
syncthreads
();
if(
threadIdx.x
== 0)
{
intsum
= 0;
for(
int
i
= 0;
i
< THREADS_PER_BLOCK;
i
++ )sum += temp[
i
];
*c += sum;
}
}
Blocks interfere with each other – Race conditionSlide41
Race Condition
CUDA:
dotProd2.cu
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
__global__
void
dotProductKernel
(
int
*a,
int
*b,
int
*c ) {
__shared__
int
temp[THREADS_PER_BLOCK];
int
index
=
threadIdx.x
+
blockIdx.x
*
blockDim.x
;
temp[
threadIdx.x
] = a[index] * b[index];
__
syncthreads
();
if(
threadIdx.x
== 0)
{
intsum
= 0;
for(
int
i
= 0;
i
< THREADS_PER_BLOCK;
i
++ )sum += temp[
i
];
atomicAdd
(
c,sum
);
}
}Slide42
Atomic Operations
Race
conditions
- behavior
depends upon relative timing of multiple event
sequences.
Can occur when an implied read-modify-write is
interruptible
Read-Modify-Write uninterruptible –
atomic
atomicAdd
()
atomicInc
()
atomicSub
()
atomicDec
()
atomicMin
()
atomicExch
()
atomicMax
()
atomicCAS
()Slide43
CUDA Best Practices
NVIDIA’s link:
http
://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
Locate part of the slowest part of the code
gcc
-O2 -g -
pg
myprog.c
gprof
./
a.out
> profile.txt
Use CUDA to parallelize code;
Use optimize cu* libraries if possible;
Overlapping data transfers, fine-tuning operation sequences
Compare the outcome with the original expectations.Slide44
CUDA Debugging
CUDA-GDB
- GNU
Debugger that runs on Linux and
Mac:
http://
developer.nvidia.com/cuda-gdb
The NVIDIA Parallel
Nsight
debugging and profiling tool for
Microsoft
Windows Vista and Windows 7 is available as a free plugin for Microsoft Visual
Studio:
http
://
developer.nvidia.com/nvidia-parallel-nsight
Slide45
This tutorial has been made possible by
Scientific Computing and Visualization group
at
Boston University
.
Katia Oleinik
koleinik@bu.edu
http://www.bu.edu/tech/research/training/tutorials/list/