/
Introduction to CUDA Programming Introduction to CUDA Programming

Introduction to CUDA Programming - PowerPoint Presentation

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
389 views
Uploaded On 2017-05-09

Introduction to CUDA Programming - PPT Presentation

Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos moshovoseecgtorontoedu ECE Univ of Toronto Summer 2010 Some slidesmaterial from UIUC course by Wen ID: 546297

cpu gpu float memory gpu cpu memory float block threads int thread data blocks device cuda void systems control

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Introduction to CUDA Programming" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Introduction to CUDA Programming

Introduction to Programming Massively Parallel Graphics processors

Andreas

Moshovos

moshovos@eecg.toronto.edu

ECE, Univ. of Toronto

Summer 2010

Some slides/material from:

UIUC course by

Wen

-Mei

Hwu

and David Kirk

UCSB course by Andrea Di Blas

Universitat

Jena by

Waqar

Saleem

NVIDIA by Simon

Green

and others as noted on slidesSlide2

How to Get High Performance

Computation

Calculations

Data communication/Storage

Tons of Compute Engines

Tons of Storage

Unlimited Bandwidth

Zero/Low LatencySlide3

Calculation capabilities

How many calculation units can be built?

Today’s silicon chips

About 1B transistors30K transistors for a 52b multiplier~30K multipliers

260mm^2 area (mid-range)112microns^2 for FP unit (overestimated)~2K FP unitsFrequency ~ 3Ghz common todayTFLOPs possibleDisclaimer: back-on-the-envelop calculations – take with a grain of saltCan build lots of calculation units (ALUs)

Tons of Compute Engines?Slide4

How about Communication/Storage

Need data feed and storage

The larger the slower

Takes time to get there and backMultiple cycles even on the same die

Tons of Compute Engines

Tons of Slow Storage

Unlimited Bandwidth

Zero/Low Latency

Slide5

Is there enough parallelism?

Keep this busy?

Needs lots of independent calculations

Parallelism/ConcurrencyMuch of what we do is sequential

First do 1, then do 2, then if X do 3 else do 4Tons of Compute Engines

Tons of Storage

Unlimited Bandwidth

Zero/Low LatencySlide6

Today’s High-End General Purpose Processors

Localize Communication and Computation

Try to automatically extract parallelism

time

Tons of

Slow

Storage

Faster cache

Slower Cache

Automatically extract instruction level parallelism

Large on-die caches to tolerate off-chip memory latencySlide7

Some things are naturally parallelSlide8

Sequential Execution Model

int

a[N]; // N is large for (

i =0; i < N; i++) a[i] = a[i

] * fade;time

Flow of control / Thread

One instruction at the time

Optimizations possible at the machine levelSlide9

Data Parallel Execution Model / SIMD

int

a[N]; // N is large for all elements do in parallel

a[index] = a[index] * fade;time

This has been tried before: ILLIAC III, UIUC, 1966Slide10

Single Program Multiple Data / SPMD

int

a[N]; // N is large for all elements do in parallel

if (a[i] > threshold) a[i]*= fade;

time

The model used in today’s Graphics ProcessorsSlide11

CPU vs. GPU overview

CPU:

Handles sequential code well

Can’t take advantage of massively parallel codeOff-chip bandwidth lowerPeak Computation capability lowerGPU:

Requires massively parallel computationHandles some control flowHigher off-chip bandwidthHigher peak computation capabilitySlide12

Programmer’s view

GPU as a

co-processor (2008)

CPU

Memory

GPUGPU Memory

1GB on our systems

3GB/s – 8GB.s

6.4GB/sec – 31.92GB/sec

8B per transfer

141GB/secSlide13

Target Applications

int

a[N]; // N is large for all elements of a compute

a[i] = a[i] * fadeLots of independent computationsCUDA threads need not be independentSlide14

Programmer’s View of the GPU

GPU: a compute

device

that:Is a coprocessor to the CPU or hostHas its own DRAM (device memory

)Runs many threads in parallelData-parallel portions of an application are executed on the device as kernels which run in parallel on many threadsSlide15

Why are threads useful

? Parallelism

Concurrency:

Do multiple things in parallel

Uses more hardware  Gets higher performance

Needs more functional unitsSlide16

Why are threads useful #2 – Tolerating stalls

Often a thread stalls, e.g., memory access

Multiplex the same functional unit

Get more performance at a fraction of the costSlide17

GPU vs. CPU Threads

GPU threads are extremely lightweight

Very little creation overhead

In the order of microsecondsAll done in hardware

GPU needs 1000s of threads for full efficiencyMulti-core CPU needs only a fewSlide18

Execution Timeline

time

1. Copy to GPU mem

2. Launch GPU Kernel

GPU / Device

2’. Synchronize with GPU

3. Copy from GPU mem

CPU / HostSlide19

Programmer’s view

First create data on CPU memory

CPU

Memory

GPU

GPU MemorySlide20

Programmer’s view

Then Copy to GPU

CPU

Memory

GPU

GPU MemorySlide21

Programmer’s view

GPU starts computation

 runs a

kernelCPU can also continue

CPUMemory

GPU

GPU MemorySlide22

Programmer’s view

CPU and GPU Synchronize

CPU

Memory

GPU

GPU MemorySlide23

Programmer’s view

Copy results back to CPU

CPU

Memory

GPU

GPU MemorySlide24

Computation partitioning:

At the highest level:

Think of computation as a series of loops:

for (i = 0; i

< big_number; i++)a[i] = some functionfor (i = 0; i < big_number;

i++)a[i] = some other functionfor (i = 0; i < big_number; i++)a[i

] = some other function

KernelsSlide25

Computation Partitioning -- Kernel

CUDA exposes the hardware to the programmer

Programmer must manually partition work appropriately

Programmers view is hierarchical:Think of data as an arraySlide26

Per Kernel Computation Partitioning

Computation Grid: 2D Case

Threads within a block can communicate/synchronize

Run on the same multiprocessor

Threads across blocks can’t communicateShouldn’t touch each others dataBehavior undefined

Block

threadSlide27

Thread Coordination Overview

Race-free access to dataSlide28

GBT: Grids of Blocks of Threads

Why? Realities of integrated circuits: need to cluster computation and storage

to achieve high speeds

Programmers view of data and computation partitioningSlide29

Block and Thread IDs

Threads and blocks have IDs

So each thread can decide what data to work on

Block ID: 1D or 2DThread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional dataConvenience not necessity

DeviceGrid 1

Block(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(0, 1)

Block

(1, 1)

Block

(2, 1)

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

IDs and dimensions are

accessible

through

predefined “variables”, e.g.,

blockDim.x

and

threadIdx.xSlide30

Execution Model: Ordering

Execution order is

undefined

Do not assume and use: block 0 executes before block 1Thread 10 executes before thread 20

And any other ordering even if you can observe itFuture implementations may break this orderingIt’s not part of the CUDA definitionWhy? More flexible hardware optionsSlide31

Programmer’s view: Memory Model

Different memories with different uses and performance

Some managed by the compiler

Some must be managed by the programmer

Arrows show whether read and/or write is possibleSlide32

Execution Model

Summary (for your reference)

Grid of blocks of threads

1D/2D grid of blocks1D/2D/3D blocks of threads

All blocks are identical: same structure and # of threadsBlock execution order is undefined Same block threads: can synchronize and share data fast (shared memory)Threads from different blocks:Cannot cooperate

Communication through global memoryThreads and Blocks have IDsSimplifies data indexingCan be 1D, 2D, or 3D (threads)Blocks do not migrate: execute on the same processorSeveral blocks may run over the same processorSlide33

CUDA Software Architecture

cuda…()

cu…()

e.g., fft()Slide34

Reasoning about CUDA call ordering

GPU communication via

cuda

…() calls and kernel invocationscudaMalloc, cudaMemCpy

Asynchronous from the CPU’s perspectiveCPU places a request in a “CUDA” queuerequests are handled in-orderStreams allow for multiple queuesOrder within each queue honoredNo order across queuesMore on this much later onSlide35

My first CUDA Program

__global__ void

arradd

(float *a, float f,

int

N)

{ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[

i

] = a[

i

] + float;

}

int

main()

{

float

h_a

[N];

float *d_a; cudaMalloc ((void **) &a_d

, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a,

h_a

, SIZE,

cudaMemcpyHostToDevice

));

arradd

<<<

n_blocks

,

block_size

>>> (

d_a

, 10.0, N);

cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree

(a_d));}GPUCPUSlide36

CUDA API: Example

int

a[N];

for (

i =0; i < N; i++)

a[i] = a[i] + x;Allocate CPU Data StructureInitialize Data on CPUAllocate GPU Data Structure

Copy Data from CPU to GPUDefine Execution ConfigurationRun KernelCPU synchronizes with GPUCopy Data from GPU to CPUDe-allocate GPU and CPU memorySlide37

1. Allocate CPU Data

float *ha;

main (

int

argc, char *argv[]){

int N = atoi (argv[1]);

ha = (float *) malloc (sizeof (float) * N);

...

}

No memory allocated on the GPU side

Pinned memory allocation results in faster CPU to/from GPU

copies

But pinned memory cannot be paged-out

More on this later

cudaMallocHost

(…)Slide38

2. Initialize CPU

Data (dummy)

float *ha;

int i;

for (i = 0; i < N; i++) ha[i] = i;Slide39

3. Allocate GPU Data

float *

da

;

cudaMalloc ((void **) &da, sizeof

(float) * N);Notice: no assignment sideNOT: da = cudaMalloc (…)Assignment is done internally:That’s why we pass &da

Space is allocated in Global Memory on the GPUSlide40

GPU Memory Allocation

The host manages GPU memory allocation:

cudaMalloc

(void **ptr

, size_t nbytes)Must explicitly cast to (void **)cudaMalloc ((void **) &da,

sizeof (float) * N);cudaFree (void *ptr);cudaFree (

da);cudaMemset (void *ptr, int value,

size_t

nbytes

)

;

cudaMemset

(

da

, 0, N *

sizeof

(

int

));

Check the

CUDA Reference ManualSlide41

4. Copy Initialized CPU data to GPU

float *da;

float *ha;

cudaMemCpy

((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpy

HostToDevice); // DIRECTIONSlide42

Host/Device Data Transfers

The host initiates all transfers:

cudaMemcpy

( void *dst

, void *src, size_t nbytes,

enum cudaMemcpyKind direction)Asynchronous from the CPU’s perspectiveCPU thread continuesIn-order processing with other CUDA requestsenum cudaMemcpyKind

cudaMemcpyHostToDevicecudaMemcpyDeviceToHost

cudaMemcpy

Device

To

DeviceSlide43

5. Define Execution Configuration

How many blocks and threads/block

int threads_block = 64;

int blocks = N / threads_block;

if (blocks % N != 0) blocks += 1;Alternatively:blocks = (N + threads_block – 1) / threads_block;Slide44

6. Launch Kernel & 7. CPU/GPU Synchronization

Instructs the GPU to launch

blocks x

threads_block

threads: darradd <<<blocks, threads_block>> (

da, 10f, N); cudaThreadSynchronize (); // forces CPU to wait

darradd: kernel name<<<…>>> execution configurationMore on this soon(da, x, N): arguments256 – 8 byte limit / No variable argumentsSlide45

CPU/GPU Synchronization

CPU does not block on

cuda

…() callsKernel/requests are queued and processed in-orderControl returns to CPU immediatelyGood if there is other work to be done

e.g., preparing for the next kernel invocationEventually, CPU must know when GPU is doneThen it can safely copy the GPU resultscudaThreadSynchronize ()Block CPU until all preceding cuda…() and kernel requests have completedSlide46

8. Copy data from GPU to CPU & 9. DeAllocate Memory

float *da;

float *ha;

cudaMemCpy

((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost

); // DIRECTIONcudaFree (da);// display or process results herefree (ha);Slide47

The GPU Kernel

__global__

darradd

(float *da, float x, int N){

int i = blockIdx.x * blockDim.x +

threadIdx.x; if (i < N)

da

[

i

] =

da

[

i

] + x;

}

BlockIdx

:

Unique Block ID.

Numerically

asceding

: 0, 1, …

BlockDim

:

Dimensions of Block = how many threads it has

BlockDim.x

,

BlockDim.y

,

BlockDim.z

Unused dimensions default to 0

ThreadIdx

:

Unique per Block Index

0, 1, … Per BlockSlide48

Array Index Calculation Example

int i =

blockIdx.x

*

blockDim.x + threadIdx.x;

a[0]a[63]

a[64]a[127]

a[128]

a[191]

a[192]

blockIdx.x

= 0

blockIdx.x

= 1

blockIdx.x

= 2

threadIdx.x 0

threadIdx.x 63

threadIdx.x 0

threadIdx.x 63

threadIdx.x 0

threadIdx.x 63

threadIdx.x 0

i = 0

i = 63

i = 64

i = 127

i = 128

i

=

191

i

=

192

Assuming blockDim.x = 64Slide49

CUDA Function Declarations

__global__

defines a kernel function

Must return voidCan only call __device__ functions__device__

and __host__ can be used togetherTwo difference versions generated

Executed on the:

Only callable from the:

__device__

float DeviceFunc()

device

device

__global__

void KernelFunc()

device

host

__host__

float HostFunc()

host

hostSlide50

__device__ Example

Add x to a[i] multiple times

__device__ float addmany (float a, float b, int count)

{

while (count--) a += b; return a;}__global__ darradd (float *da, float x, int N){

int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10);}Slide51

Kernel and Device Function Restrictions

__device__

functions cannot have their address takene.g., f = &

addmany; *f(…);For functions executed on the device:No recursiondarradd (…){

darradd (…)}No static variable declarations inside the functiondarradd (…){

static int canthavethis;}No variable number of argumentse.g., something like printf

(…)Slide52

My first CUDA Program

__global__ void

arradd

(float *a, float f,

int

N)

{ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[

i

] = a[

i

] + float;

}

int

main()

{

float

h_a

[N];

float *d_a; cudaMalloc ((void **) &a_d

, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a,

h_a

, SIZE,

cudaMemcpyHostToDevice

));

arradd

<<<

n_blocks

,

block_size

>>> (

d_a

, 10.0, N);

cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree

(a_d));}GPUCPUSlide53

How to get high-performance #1

Programmer managed Scratchpad memory

Bring data in from global memory

Reuse16KB/bankedAccessed in parallel by 16 threadsProgrammer needs to:

Decide what to bring and whenDecide which thread accesses what and whenCoordination paramountSlide54

How to get high-performance #2

Global memory accesses

32 threads access memory together

Can coalesce into a single referenceE.g., a[threadID] works well

Control flow32 threads run togetherIf they diverge there is a performance penaltyTexture cacheWhen you think there is localitySlide55

Are GPUs really that much faster than CPUs

50x – 200x speedups typically reported

Recent work found

Not enough effort goes into optimizing code for CPUsBut:The learning curve and expertise needed for CPUs is much largerSlide56

ECE Overview

ECE research Profile

Personnel and budget

Partnerships with industryOur areas of expertise

Biomedical EngineeringCommunicationsComputer EngineeringElectromagneticsElectronicsEnergy SystemsPhotonics

Systems ControlSlides from F. Najm (Chair) and T. Sargent (Research Vice Chair)Slide57

About our group

Computer Architecture

How to build the best possible system

Best: performance, power, cost, etc.Expertise in high-end systemsMicro-architecture

Multi-processor and Multi-core systemsCurrent Research Support:AMD, IBM, NSERC, Qualcomm (planned)Claims to fameMemory Dependence PredictionCommercially implemented and licensedSnoop Filtering: IBM Blue GeneSlide58
Slide59

UofT-DRDC PartnershipSlide60
Slide61
Slide62

Examples of industry research contracts with ECE in the past 8 years

AMD

Agile Systems Inc Altera

ARISE Technologies Asahi Kasei Microsystems Bell Canada Bell Mobility Cellular Bioscrypt Inc Broadcom Corporation

Ciclon Semiconductor Cybermation Inc Digital Predictive Systems Inc. DPL ScienceEastman Kodak Electro Scientific Industries

EMS Technologies Exar Corp FOX-TEKFiran Technology Group Fuji Electric

62

Fujitsu

Gennum

H2Green Energy Corporation

Honeywell

ASCa

, Inc.

Hydro One Networks Inc.

IBM Canada Ltd.

IBM

IMAX Corporation

Intel Corporation

Jazz Semiconductor

KT Micro

LG Electronics

Maxim

MPB Technologies

Microsoft

Motorola

Northrop Grumman

NXP Semiconductors

ON Semiconductor

Ontario Lottery and Gaming Corp

Ontario Power Generation Inc.

Panasonic Semiconductor Singapore

Peraso

Technologies Inc.

Philips Electronics North America

Redline Communications Inc.

Research in Motion Ltd.

Right Track CAD

Robert Bosch Corporation

Samsung Thales Co., Ltd Semiconductor Research CorporationSiemens

AktiengesellschaftSipex Corporation STMicroelectronics Inc. Sun Microsystems of Canada Inc.

Telus

Mobility

Texas Instruments

Toronto Hydro-Electric System

Toshiba Corporation

Xilinx Inc. Slide63

63

Eight Research Groups

Biomedical Engineering

Communications

Computer EngineeringElectromagneticsElectronicsEnergy Systems

PhotonicsSystems ControlECESlide64

Computer Engineering Group

Human-Computer Interaction

Willy Wong, Steve Mann

Multi-sensor information systemsParham AarabiComputer Hardware

Jonathan Rose, Steve Brown, Paul Chow, Jason AndersonComputer ArchitectureGreg Steffan, Andreas Moshovos, Tarek Abdelrahman, Natalie Enright JergerComputer SecurityDavie Lie, Ashvin

GoelSlide65

65

Biomedical Engineering

Neurosystems

Berj L. Bardakjian, Roman Genov.

Willy Wong, Hans KunovMoshe EizenmanRehabilitationMilos Popovic, Tom Chau.Medical ImagingMichael Joy, Adrian Nachman.

Richard CobboldOfer LeviProteomicsBrendan Frey.Kevin Truong.

Ca

2+

Ca

2+Slide66

Communications Group

Study of the principles, mathematics and algorithms that underpin how information is encoded, exchanged and processed

Three Sub-Groups:

Networks

Signal ProcessingInformation TheorySlide67

Sequence AnalysisSlide68

Image Analysis and Computer Vision

Computer vision and graphics

Embedded

computer vision

Pattern recognition and detectionSlide69

NetworksSlide70

Quantum Cryptography and ComputingSlide71

Computer Engineering

System Software

Michael Stumm, H-A. Jacobsen, Cristiana Amza, Baochun Li

Computer-Aided Design of CircuitsFarid Najm, Andreas Veneris, Jianwen Zhu, Jonathan RoseSlide72

Electronics Group

UofT-IBM Partnership

72

72

14 active professors; largest electronics group in Canada.

Breadth of research topics:

Electronic device modelling

Semiconductor technology

VLSI CAD and Systems

FPGAs

DSP and Mixed-mode ICs

Biomedical microsystems

High-speed and mm-wave ICs and SoCs

Lab for (on-wafer) SoC and IC testing through 220 GHzSlide73

73

Intelligent Sensory Microsystems

Mixed-signal VLSI circuits

Low-power, low-noise signal processing, computing and ADCs

On-chip micro-sensors

Electrical, chemical, optical

Project examples

Brain-chip interfaces

On-chip biochemical sensors

CMOS imagersSlide74

74

mm-Wave and 100+GHz systems on chip

Modelling mm-wave and noise performance of active and passive devices past 300 GHz.

60-120GHz multi-gigabit data rate phased-array radios

Single-chip 76-79 GHz automotive radar

170 GHz transceiver with on-die antennasSlide75

Electromagnetics Group

Metamaterials: From microwaves to optics

Super-resolving lenses for imaging and sensing

Small antennasMultiband RF components

CMOS phase shiftersElectromagnetics of High-Speed CircuitsSignal integrity in high-speed digital systemsMicrowave integrated circuit design, modeling and characterization  Computational Electromagnetics    

Interaction of Electromagnetic Fields with Living TissueAntennas Telecom and Wireless Systems   Reflectarrays Wave electronicsIntegrated antennas  Controlled-beam antennas 

Adaptive and diversity antennas Slide76

Super-lens capable of resolving details down to

l/6

Small and broadband antennas

Scanning antennas with CMOS MTM chips

METAMATERIALS (MTMs)Slide77

Computational Electromagnetics

Fast CAD for RF/

optical structures

Modeling of Metamaterials

Plasmonic Left-Handed Media

Leaky-Wave Antennas

Microstrip spiral

inductor

Optical power splitterSlide78

78

Energy Systems Group

Power Electronics

High power (> 1.2 MW) converters

modeling, control, and digital control realizationMicro-Power Gridsconverters for distributed resources, dc distribution systems, and HVdc systemsLow-Power Electronics

Integrated power supplies and power managementsystems-on-chip for low-power electronics computers, cell phones, PDA-s, MP3 players, body implantsHarvesting Energy from humans Slide79

79

IC for cell phone

power supplies

U

of

T

Matrix Converter for Micro-Turbine Generator

Voltage Control System for

Wind Power Generators

Energy Systems ResearchSlide80

Photonics GroupSlide81

Photonics GroupSlide82

Photonics GroupSlide83

Photonics Group: Bio-PhotonicsSlide84

Basic & applied research in control engineering

World-leading group in Control

theory

_______________________________________________Optical Signal-to-Noise Ratio opt. with game theoryErbium-doped fibre amplifier design

Analysis and design of digital watermarks for authenticationNonlinear control theoryapplication to magnetic levitation, micro positioning system distributed control of mobile autonomous robotsFormations, collision avoidanceSystems Control Group