Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos moshovoseecgtorontoedu ECE Univ of Toronto Summer 2010 Some slidesmaterial from UIUC course by Wen ID: 546297
Download Presentation The PPT/PDF document "Introduction to CUDA Programming" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Introduction to CUDA Programming
Introduction to Programming Massively Parallel Graphics processors
Andreas
Moshovos
moshovos@eecg.toronto.edu
ECE, Univ. of Toronto
Summer 2010
Some slides/material from:
UIUC course by
Wen
-Mei
Hwu
and David Kirk
UCSB course by Andrea Di Blas
Universitat
Jena by
Waqar
Saleem
NVIDIA by Simon
Green
and others as noted on slidesSlide2
How to Get High Performance
Computation
Calculations
Data communication/Storage
Tons of Compute Engines
Tons of Storage
Unlimited Bandwidth
Zero/Low LatencySlide3
Calculation capabilities
How many calculation units can be built?
Today’s silicon chips
About 1B transistors30K transistors for a 52b multiplier~30K multipliers
260mm^2 area (mid-range)112microns^2 for FP unit (overestimated)~2K FP unitsFrequency ~ 3Ghz common todayTFLOPs possibleDisclaimer: back-on-the-envelop calculations – take with a grain of saltCan build lots of calculation units (ALUs)
Tons of Compute Engines?Slide4
How about Communication/Storage
Need data feed and storage
The larger the slower
Takes time to get there and backMultiple cycles even on the same die
Tons of Compute Engines
Tons of Slow Storage
Unlimited Bandwidth
Zero/Low Latency
Slide5
Is there enough parallelism?
Keep this busy?
Needs lots of independent calculations
Parallelism/ConcurrencyMuch of what we do is sequential
First do 1, then do 2, then if X do 3 else do 4Tons of Compute Engines
Tons of Storage
Unlimited Bandwidth
Zero/Low LatencySlide6
Today’s High-End General Purpose Processors
Localize Communication and Computation
Try to automatically extract parallelism
time
Tons of
Slow
Storage
Faster cache
Slower Cache
Automatically extract instruction level parallelism
Large on-die caches to tolerate off-chip memory latencySlide7
Some things are naturally parallelSlide8
Sequential Execution Model
int
a[N]; // N is large for (
i =0; i < N; i++) a[i] = a[i
] * fade;time
Flow of control / Thread
One instruction at the time
Optimizations possible at the machine levelSlide9
Data Parallel Execution Model / SIMD
int
a[N]; // N is large for all elements do in parallel
a[index] = a[index] * fade;time
This has been tried before: ILLIAC III, UIUC, 1966Slide10
Single Program Multiple Data / SPMD
int
a[N]; // N is large for all elements do in parallel
if (a[i] > threshold) a[i]*= fade;
time
The model used in today’s Graphics ProcessorsSlide11
CPU vs. GPU overview
CPU:
Handles sequential code well
Can’t take advantage of massively parallel codeOff-chip bandwidth lowerPeak Computation capability lowerGPU:
Requires massively parallel computationHandles some control flowHigher off-chip bandwidthHigher peak computation capabilitySlide12
Programmer’s view
GPU as a
co-processor (2008)
CPU
Memory
GPUGPU Memory
1GB on our systems
3GB/s – 8GB.s
6.4GB/sec – 31.92GB/sec
8B per transfer
141GB/secSlide13
Target Applications
int
a[N]; // N is large for all elements of a compute
a[i] = a[i] * fadeLots of independent computationsCUDA threads need not be independentSlide14
Programmer’s View of the GPU
GPU: a compute
device
that:Is a coprocessor to the CPU or hostHas its own DRAM (device memory
)Runs many threads in parallelData-parallel portions of an application are executed on the device as kernels which run in parallel on many threadsSlide15
Why are threads useful
? Parallelism
Concurrency:
Do multiple things in parallel
Uses more hardware Gets higher performance
Needs more functional unitsSlide16
Why are threads useful #2 – Tolerating stalls
Often a thread stalls, e.g., memory access
Multiplex the same functional unit
Get more performance at a fraction of the costSlide17
GPU vs. CPU Threads
GPU threads are extremely lightweight
Very little creation overhead
In the order of microsecondsAll done in hardware
GPU needs 1000s of threads for full efficiencyMulti-core CPU needs only a fewSlide18
Execution Timeline
time
1. Copy to GPU mem
2. Launch GPU Kernel
GPU / Device
2’. Synchronize with GPU
3. Copy from GPU mem
CPU / HostSlide19
Programmer’s view
First create data on CPU memory
CPU
Memory
GPU
GPU MemorySlide20
Programmer’s view
Then Copy to GPU
CPU
Memory
GPU
GPU MemorySlide21
Programmer’s view
GPU starts computation
runs a
kernelCPU can also continue
CPUMemory
GPU
GPU MemorySlide22
Programmer’s view
CPU and GPU Synchronize
CPU
Memory
GPU
GPU MemorySlide23
Programmer’s view
Copy results back to CPU
CPU
Memory
GPU
GPU MemorySlide24
Computation partitioning:
At the highest level:
Think of computation as a series of loops:
for (i = 0; i
< big_number; i++)a[i] = some functionfor (i = 0; i < big_number;
i++)a[i] = some other functionfor (i = 0; i < big_number; i++)a[i
] = some other function
KernelsSlide25
Computation Partitioning -- Kernel
CUDA exposes the hardware to the programmer
Programmer must manually partition work appropriately
Programmers view is hierarchical:Think of data as an arraySlide26
Per Kernel Computation Partitioning
Computation Grid: 2D Case
Threads within a block can communicate/synchronize
Run on the same multiprocessor
Threads across blocks can’t communicateShouldn’t touch each others dataBehavior undefined
Block
threadSlide27
Thread Coordination Overview
Race-free access to dataSlide28
GBT: Grids of Blocks of Threads
Why? Realities of integrated circuits: need to cluster computation and storage
to achieve high speeds
Programmers view of data and computation partitioningSlide29
Block and Thread IDs
Threads and blocks have IDs
So each thread can decide what data to work on
Block ID: 1D or 2DThread ID: 1D, 2D, or 3D
Simplifies memoryaddressing when processingmultidimensional dataConvenience not necessity
DeviceGrid 1
Block(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
IDs and dimensions are
accessible
through
predefined “variables”, e.g.,
blockDim.x
and
threadIdx.xSlide30
Execution Model: Ordering
Execution order is
undefined
Do not assume and use: block 0 executes before block 1Thread 10 executes before thread 20
And any other ordering even if you can observe itFuture implementations may break this orderingIt’s not part of the CUDA definitionWhy? More flexible hardware optionsSlide31
Programmer’s view: Memory Model
Different memories with different uses and performance
Some managed by the compiler
Some must be managed by the programmer
Arrows show whether read and/or write is possibleSlide32
Execution Model
Summary (for your reference)
Grid of blocks of threads
1D/2D grid of blocks1D/2D/3D blocks of threads
All blocks are identical: same structure and # of threadsBlock execution order is undefined Same block threads: can synchronize and share data fast (shared memory)Threads from different blocks:Cannot cooperate
Communication through global memoryThreads and Blocks have IDsSimplifies data indexingCan be 1D, 2D, or 3D (threads)Blocks do not migrate: execute on the same processorSeveral blocks may run over the same processorSlide33
CUDA Software Architecture
cuda…()
cu…()
e.g., fft()Slide34
Reasoning about CUDA call ordering
GPU communication via
cuda
…() calls and kernel invocationscudaMalloc, cudaMemCpy
Asynchronous from the CPU’s perspectiveCPU places a request in a “CUDA” queuerequests are handled in-orderStreams allow for multiple queuesOrder within each queue honoredNo order across queuesMore on this much later onSlide35
My first CUDA Program
__global__ void
arradd
(float *a, float f,
int
N)
{ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[
i
] = a[
i
] + float;
}
int
main()
{
float
h_a
[N];
float *d_a; cudaMalloc ((void **) &a_d
, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a,
h_a
, SIZE,
cudaMemcpyHostToDevice
));
arradd
<<<
n_blocks
,
block_size
>>> (
d_a
, 10.0, N);
cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree
(a_d));}GPUCPUSlide36
CUDA API: Example
int
a[N];
for (
i =0; i < N; i++)
a[i] = a[i] + x;Allocate CPU Data StructureInitialize Data on CPUAllocate GPU Data Structure
Copy Data from CPU to GPUDefine Execution ConfigurationRun KernelCPU synchronizes with GPUCopy Data from GPU to CPUDe-allocate GPU and CPU memorySlide37
1. Allocate CPU Data
float *ha;
main (
int
argc, char *argv[]){
int N = atoi (argv[1]);
ha = (float *) malloc (sizeof (float) * N);
...
}
No memory allocated on the GPU side
Pinned memory allocation results in faster CPU to/from GPU
copies
But pinned memory cannot be paged-out
More on this later
cudaMallocHost
(…)Slide38
2. Initialize CPU
Data (dummy)
float *ha;
int i;
for (i = 0; i < N; i++) ha[i] = i;Slide39
3. Allocate GPU Data
float *
da
;
cudaMalloc ((void **) &da, sizeof
(float) * N);Notice: no assignment sideNOT: da = cudaMalloc (…)Assignment is done internally:That’s why we pass &da
Space is allocated in Global Memory on the GPUSlide40
GPU Memory Allocation
The host manages GPU memory allocation:
cudaMalloc
(void **ptr
, size_t nbytes)Must explicitly cast to (void **)cudaMalloc ((void **) &da,
sizeof (float) * N);cudaFree (void *ptr);cudaFree (
da);cudaMemset (void *ptr, int value,
size_t
nbytes
)
;
cudaMemset
(
da
, 0, N *
sizeof
(
int
));
Check the
CUDA Reference ManualSlide41
4. Copy Initialized CPU data to GPU
float *da;
float *ha;
cudaMemCpy
((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpy
HostToDevice); // DIRECTIONSlide42
Host/Device Data Transfers
The host initiates all transfers:
cudaMemcpy
( void *dst
, void *src, size_t nbytes,
enum cudaMemcpyKind direction)Asynchronous from the CPU’s perspectiveCPU thread continuesIn-order processing with other CUDA requestsenum cudaMemcpyKind
cudaMemcpyHostToDevicecudaMemcpyDeviceToHost
cudaMemcpy
Device
To
DeviceSlide43
5. Define Execution Configuration
How many blocks and threads/block
int threads_block = 64;
int blocks = N / threads_block;
if (blocks % N != 0) blocks += 1;Alternatively:blocks = (N + threads_block – 1) / threads_block;Slide44
6. Launch Kernel & 7. CPU/GPU Synchronization
Instructs the GPU to launch
blocks x
threads_block
threads: darradd <<<blocks, threads_block>> (
da, 10f, N); cudaThreadSynchronize (); // forces CPU to wait
darradd: kernel name<<<…>>> execution configurationMore on this soon(da, x, N): arguments256 – 8 byte limit / No variable argumentsSlide45
CPU/GPU Synchronization
CPU does not block on
cuda
…() callsKernel/requests are queued and processed in-orderControl returns to CPU immediatelyGood if there is other work to be done
e.g., preparing for the next kernel invocationEventually, CPU must know when GPU is doneThen it can safely copy the GPU resultscudaThreadSynchronize ()Block CPU until all preceding cuda…() and kernel requests have completedSlide46
8. Copy data from GPU to CPU & 9. DeAllocate Memory
float *da;
float *ha;
cudaMemCpy
((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost
); // DIRECTIONcudaFree (da);// display or process results herefree (ha);Slide47
The GPU Kernel
__global__
darradd
(float *da, float x, int N){
int i = blockIdx.x * blockDim.x +
threadIdx.x; if (i < N)
da
[
i
] =
da
[
i
] + x;
}
BlockIdx
:
Unique Block ID.
Numerically
asceding
: 0, 1, …
BlockDim
:
Dimensions of Block = how many threads it has
BlockDim.x
,
BlockDim.y
,
BlockDim.z
Unused dimensions default to 0
ThreadIdx
:
Unique per Block Index
0, 1, … Per BlockSlide48
Array Index Calculation Example
int i =
blockIdx.x
*
blockDim.x + threadIdx.x;
a[0]a[63]
a[64]a[127]
a[128]
a[191]
a[192]
blockIdx.x
= 0
blockIdx.x
= 1
blockIdx.x
= 2
threadIdx.x 0
threadIdx.x 63
threadIdx.x 0
threadIdx.x 63
threadIdx.x 0
threadIdx.x 63
threadIdx.x 0
i = 0
i = 63
i = 64
i = 127
i = 128
i
=
191
i
=
192
Assuming blockDim.x = 64Slide49
CUDA Function Declarations
__global__
defines a kernel function
Must return voidCan only call __device__ functions__device__
and __host__ can be used togetherTwo difference versions generated
Executed on the:
Only callable from the:
__device__
float DeviceFunc()
device
device
__global__
void KernelFunc()
device
host
__host__
float HostFunc()
host
hostSlide50
__device__ Example
Add x to a[i] multiple times
__device__ float addmany (float a, float b, int count)
{
while (count--) a += b; return a;}__global__ darradd (float *da, float x, int N){
int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10);}Slide51
Kernel and Device Function Restrictions
__device__
functions cannot have their address takene.g., f = &
addmany; *f(…);For functions executed on the device:No recursiondarradd (…){
darradd (…)}No static variable declarations inside the functiondarradd (…){
static int canthavethis;}No variable number of argumentse.g., something like printf
(…)Slide52
My first CUDA Program
__global__ void
arradd
(float *a, float f,
int
N)
{ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[
i
] = a[
i
] + float;
}
int
main()
{
float
h_a
[N];
float *d_a; cudaMalloc ((void **) &a_d
, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a,
h_a
, SIZE,
cudaMemcpyHostToDevice
));
arradd
<<<
n_blocks
,
block_size
>>> (
d_a
, 10.0, N);
cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree
(a_d));}GPUCPUSlide53
How to get high-performance #1
Programmer managed Scratchpad memory
Bring data in from global memory
Reuse16KB/bankedAccessed in parallel by 16 threadsProgrammer needs to:
Decide what to bring and whenDecide which thread accesses what and whenCoordination paramountSlide54
How to get high-performance #2
Global memory accesses
32 threads access memory together
Can coalesce into a single referenceE.g., a[threadID] works well
Control flow32 threads run togetherIf they diverge there is a performance penaltyTexture cacheWhen you think there is localitySlide55
Are GPUs really that much faster than CPUs
50x – 200x speedups typically reported
Recent work found
Not enough effort goes into optimizing code for CPUsBut:The learning curve and expertise needed for CPUs is much largerSlide56
ECE Overview
ECE research Profile
Personnel and budget
Partnerships with industryOur areas of expertise
Biomedical EngineeringCommunicationsComputer EngineeringElectromagneticsElectronicsEnergy SystemsPhotonics
Systems ControlSlides from F. Najm (Chair) and T. Sargent (Research Vice Chair)Slide57
About our group
Computer Architecture
How to build the best possible system
Best: performance, power, cost, etc.Expertise in high-end systemsMicro-architecture
Multi-processor and Multi-core systemsCurrent Research Support:AMD, IBM, NSERC, Qualcomm (planned)Claims to fameMemory Dependence PredictionCommercially implemented and licensedSnoop Filtering: IBM Blue GeneSlide58Slide59
UofT-DRDC PartnershipSlide60Slide61Slide62
Examples of industry research contracts with ECE in the past 8 years
AMD
Agile Systems Inc Altera
ARISE Technologies Asahi Kasei Microsystems Bell Canada Bell Mobility Cellular Bioscrypt Inc Broadcom Corporation
Ciclon Semiconductor Cybermation Inc Digital Predictive Systems Inc. DPL ScienceEastman Kodak Electro Scientific Industries
EMS Technologies Exar Corp FOX-TEKFiran Technology Group Fuji Electric
62
Fujitsu
Gennum
H2Green Energy Corporation
Honeywell
ASCa
, Inc.
Hydro One Networks Inc.
IBM Canada Ltd.
IBM
IMAX Corporation
Intel Corporation
Jazz Semiconductor
KT Micro
LG Electronics
Maxim
MPB Technologies
Microsoft
Motorola
Northrop Grumman
NXP Semiconductors
ON Semiconductor
Ontario Lottery and Gaming Corp
Ontario Power Generation Inc.
Panasonic Semiconductor Singapore
Peraso
Technologies Inc.
Philips Electronics North America
Redline Communications Inc.
Research in Motion Ltd.
Right Track CAD
Robert Bosch Corporation
Samsung Thales Co., Ltd Semiconductor Research CorporationSiemens
AktiengesellschaftSipex Corporation STMicroelectronics Inc. Sun Microsystems of Canada Inc.
Telus
Mobility
Texas Instruments
Toronto Hydro-Electric System
Toshiba Corporation
Xilinx Inc. Slide63
63
Eight Research Groups
Biomedical Engineering
Communications
Computer EngineeringElectromagneticsElectronicsEnergy Systems
PhotonicsSystems ControlECESlide64
Computer Engineering Group
Human-Computer Interaction
Willy Wong, Steve Mann
Multi-sensor information systemsParham AarabiComputer Hardware
Jonathan Rose, Steve Brown, Paul Chow, Jason AndersonComputer ArchitectureGreg Steffan, Andreas Moshovos, Tarek Abdelrahman, Natalie Enright JergerComputer SecurityDavie Lie, Ashvin
GoelSlide65
65
Biomedical Engineering
Neurosystems
Berj L. Bardakjian, Roman Genov.
Willy Wong, Hans KunovMoshe EizenmanRehabilitationMilos Popovic, Tom Chau.Medical ImagingMichael Joy, Adrian Nachman.
Richard CobboldOfer LeviProteomicsBrendan Frey.Kevin Truong.
Ca
2+
Ca
2+Slide66
Communications Group
Study of the principles, mathematics and algorithms that underpin how information is encoded, exchanged and processed
Three Sub-Groups:
Networks
Signal ProcessingInformation TheorySlide67
Sequence AnalysisSlide68
Image Analysis and Computer Vision
Computer vision and graphics
Embedded
computer vision
Pattern recognition and detectionSlide69
NetworksSlide70
Quantum Cryptography and ComputingSlide71
Computer Engineering
System Software
Michael Stumm, H-A. Jacobsen, Cristiana Amza, Baochun Li
Computer-Aided Design of CircuitsFarid Najm, Andreas Veneris, Jianwen Zhu, Jonathan RoseSlide72
Electronics Group
UofT-IBM Partnership
72
72
14 active professors; largest electronics group in Canada.
Breadth of research topics:
Electronic device modelling
Semiconductor technology
VLSI CAD and Systems
FPGAs
DSP and Mixed-mode ICs
Biomedical microsystems
High-speed and mm-wave ICs and SoCs
Lab for (on-wafer) SoC and IC testing through 220 GHzSlide73
73
Intelligent Sensory Microsystems
Mixed-signal VLSI circuits
Low-power, low-noise signal processing, computing and ADCs
On-chip micro-sensors
Electrical, chemical, optical
Project examples
Brain-chip interfaces
On-chip biochemical sensors
CMOS imagersSlide74
74
mm-Wave and 100+GHz systems on chip
Modelling mm-wave and noise performance of active and passive devices past 300 GHz.
60-120GHz multi-gigabit data rate phased-array radios
Single-chip 76-79 GHz automotive radar
170 GHz transceiver with on-die antennasSlide75
Electromagnetics Group
Metamaterials: From microwaves to optics
Super-resolving lenses for imaging and sensing
Small antennasMultiband RF components
CMOS phase shiftersElectromagnetics of High-Speed CircuitsSignal integrity in high-speed digital systemsMicrowave integrated circuit design, modeling and characterization Computational Electromagnetics
Interaction of Electromagnetic Fields with Living TissueAntennas Telecom and Wireless Systems Reflectarrays Wave electronicsIntegrated antennas Controlled-beam antennas
Adaptive and diversity antennas Slide76
Super-lens capable of resolving details down to
l/6
Small and broadband antennas
Scanning antennas with CMOS MTM chips
METAMATERIALS (MTMs)Slide77
Computational Electromagnetics
Fast CAD for RF/
optical structures
Modeling of Metamaterials
Plasmonic Left-Handed Media
Leaky-Wave Antennas
Microstrip spiral
inductor
Optical power splitterSlide78
78
Energy Systems Group
Power Electronics
High power (> 1.2 MW) converters
modeling, control, and digital control realizationMicro-Power Gridsconverters for distributed resources, dc distribution systems, and HVdc systemsLow-Power Electronics
Integrated power supplies and power managementsystems-on-chip for low-power electronics computers, cell phones, PDA-s, MP3 players, body implantsHarvesting Energy from humans Slide79
79
IC for cell phone
power supplies
U
of
T
Matrix Converter for Micro-Turbine Generator
Voltage Control System for
Wind Power Generators
Energy Systems ResearchSlide80
Photonics GroupSlide81
Photonics GroupSlide82
Photonics GroupSlide83
Photonics Group: Bio-PhotonicsSlide84
Basic & applied research in control engineering
World-leading group in Control
theory
_______________________________________________Optical Signal-to-Noise Ratio opt. with game theoryErbium-doped fibre amplifier design
Analysis and design of digital watermarks for authenticationNonlinear control theoryapplication to magnetic levitation, micro positioning system distributed control of mobile autonomous robotsFormations, collision avoidanceSystems Control Group