/
Intro to GPU’s for Parallel Computing Intro to GPU’s for Parallel Computing

Intro to GPU’s for Parallel Computing - PowerPoint Presentation

limelighthyundai
limelighthyundai . @limelighthyundai
Follow
344 views
Uploaded On 2020-09-28

Intro to GPU’s for Parallel Computing - PPT Presentation

Goals for Rest of Course Learn how to program massively parallel processors and achieve high performance functionality and maintainability scalability across future generations Acquire technical knowledge required to achieve the above goals ID: 812395

memory parallel device data parallel memory data device thread host gpu texture cuda threads cache float block registers global

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Intro to GPU’s for Parallel Computing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Intro to GPU’s for Parallel Computing

Slide2

Goals for Rest of Course

Learn how to program massively parallel processors and achieve

high performance

functionality and maintainability

scalability across future generations

Acquire technical knowledge required to achieve the above goals

principles and patterns of parallel programming

processor architecture features and constraints

programming API, tools and techniques

Overview of architecture first, then introduce architecture as we go

Slide3

Equipment

Your own, if CUDA-enabled; will use CUDA SDK in C

Compute Unified Device Architecture

NVIDIA G80 or newer

G80 emulator won’t quite work

Lab machine – uaa-csetesla.duckdns.org

Ubuntu

two Intel Xeon E5-2609 @2.4Ghz, each four cores

128 Gb memory

Two

nVidia

Quadro

4000’s

256 CUDA Cores

1

Ghz

Clock

2 Gb memory

Slide4

Why Massively Parallel Processors

A quiet revolution and potential build-up

2006 Calculation: 367 GFLOPS vs. 32 GFLOPS

G80 Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s

Until recently, programmed through graphics API

GPU in every PC and workstation – massive volume and potential impact

Slide5

DRAM

Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU

GPU

CPUs and GPUs have fundamentally different design philosophies

Slide6

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Load/store

Load/store

Load/store

Load/store

Load/store

Architecture of a CUDA-capable GPU

Streaming

Processor

(SP)

Streaming

Multiprocessor

(SM)

Building

Block

32 SM’s each with 8 SP’s on one

Quadro

4000

Slide7

GT200 Characteristics

1 TFLOPS peak performance (25-50 times of current high-end microprocessors)

265 GFLOPS sustained for apps such as Visual Molecular Dynamics (VMD)

Massively parallel, 128 cores, 90W

Massively threaded, sustains 1000s of threads per app

30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics

“I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI

Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.” -John Stone, VMD group, Physics UIUC

Slide8

8

Future

A

pps

Reflect a Concurrent World

Exciting applications in future mass computing market have been traditionally considered

supercomputing applications”Molecular dynamics simulation, Video and audio coding

and manipulation, 3D

imaging and visualization, Consumer game physics, and virtual

reality products

These “Super-apps” represent and model physical, concurrent world

Various granularities of

parallelism exist, but…

programming model must not hinder parallel implementation

data delivery needs careful management

Slide9

Sample of Previous GPU Projects

Application

Description

Source

Kernel

% time

H.264

SPEC ‘06 version, change in guess vector

34,811

194

35%

LBM

SPEC ‘06 version, change to single precision and print fewer reports

1,481

285

>99%

RC5-72

Distributed.net RC5-72 challenge client code

1,979

218

>99%

FEM

Finite element modeling, simulation of 3D graded materials

1,874

146

99%

RPES

Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion

1,104

281

99%

PNS

Petri Net simulation of a distributed system

322

160

>99%

SAXPY

Single-precision implementation of saxpy, used in Linpack’s Gaussian elim. routine

952

31

>99%

TRACF

Two Point Angular Correlation Function

536

98

96%

FDTD

Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation

1,365

93

16%

MRI-Q

Computing a matrix Q, a scanner’s configuration in MRI reconstruction

490

33

>99%

Slide10

Speedup of Applications

GeForce 8800 GTX vs. 2.2GHz Opteron 248

10

 speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads

25 to 400 speedup if the function’s data requirements and control flow suit the GPU and the application is optimized

Slide11

GPU HistoryCUDA

Slide12

Graphics Pipeline Elements

A scene description: vertices, triangles, colors, lighting

Transformations that map the scene to a camera viewpoint

“Effects”: texturing, shadow mapping, lighting calculations

Rasterizing: converting geometry into pixels

Pixel processing: depth tests, stencil tests, and other per-pixel operations.

Slide13

Host

Vertex Control

Vertex

Cache

VS/T&L

Triangle Setup

Raster

Shader

ROP

FBI

Texture

Cache

Frame

Buffer

Memory

CPU

GPU

Host Interface

A Fixed Function GPU Pipeline

Slide14

Texture mapping example: painting a world map texture image onto a globe object.

Texture Mapping Example

Slide15

Triangle Geometry

Aliased

Anti-Aliased

Anti-Aliasing Example

Slide16

3D Application

or Game

3D API:

OpenGL or Direct3D

Programmable

Vertex

Processor

Primitive

Assembly

Rasterization & Interpolation

3D API Commands

Transformed Vertices

Assembled Polygons, Lines, and Points

GPU Command & Data Stream

Programmable

Fragment

Processor

Rasterized

Pre-transformed

Fragments

Transformed

Fragments

Raster

Ops

Framebuffer

Pixel Updates

GPU

Front End

Pre-transformed Vertices

Vertex Index Stream

Pixel Location Stream

CPU – GPU Boundary

CPU

GPU

An example of separate vertex processor and fragment processor in a programmable graphics pipeline

Programmable Vertex and Pixel Processors

Slide17

GeForce 8800 GPU

2006 – Mapped the separate programmable graphics stages to an array of unified processors

Logical graphics pipeline visits processors three times with fixed-function graphics logic between visits

Load balancing possible; different rendering algorithms present different loads among the programmable stages

Dynamically allocated from unified processors

Functionality of vertex and pixel

shaders

identical to the programmergeometry shader to process all vertices of a primitive instead of vertices in isolation

Slide18

L2

FB

SP

SP

L1

TF

Thread Processor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue

Pixel Thread Issue

Data Assembler

Host

SP

SP

L1

TF

SP

SP

L1

TF

SP

SP

L1

TF

SP

SP

L1

TF

SP

SP

L1

TF

SP

SP

L1

TF

SP

SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Unified Graphics Pipeline GeForce 8800

Slide19

What is (Historical) GPGPU ?

General Purpose computation using GPU and graphics API in applications other than 3D graphics

GPU accelerates critical path of application

Data parallel algorithms leverage GPU attributes

Large data arrays, streaming throughput

Fine-grain SIMD parallelism

Low-latency floating point (FP) computation

Applications – see

http://gpgpu.org

Game effects (FX) physics, image processing

Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

Slide20

Previous GPGPU Constraints

Dealing with graphics API

Working with the corner cases of the graphics API

Addressing modes

Limited texture size/dimension

Shader capabilities

Limited outputs

Instruction setsLack of Integer & bit opsCommunication limitedBetween pixelsScatter a[i] = p

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per thread

per Shader

per Context

FB Memory

Slide21

Tesla GPU

NVIDIA developed a more general purpose GPU

Can programming it like a regular processor

Must

explicitly

declare the data parallel parts of the workload

Shader

processors  fully programming processors with instruction memory, cache, sequencing logicMemory load/store instructions with random byte addressing capabilityParallel programming model primitives; threads, barrier synchronization, atomic operations

Slide22

CUDA

C

ompute Unified

Device

A

rchitecture”General purpose programming modelUser kicks off batches of threads on the GPUGPU = dedicated super-threaded, massively data parallel co-processorTargeted software stackCompute oriented drivers, language, and toolsDriver for loading computation programs into GPUStandalone Driver - Optimized for computation Interface designed for compute – graphics-free APIData sharing with OpenGL buffer objects Guaranteed maximum download & readback speedsExplicit GPU memory management

Slide23

Parallel Computing on a GPU

8-series GPUs deliver 25 to 200+ GFLOPS

on compiled parallel C applications

Available in laptops, desktops, and clusters

GPU parallelism is doubling every year

Programming model scales transparently

Programmable in C with CUDA tools

Multithreaded SPMD model uses application

data parallelism and thread parallelism

GeForce 8800

Tesla S870

Tesla D870

Slide24

Overview

CUDA programming model – basic concepts and data types

CUDA application programming interface - basic

Simple examples to illustrate basic concepts and functionalities

Performance features will be covered later

Slide25

CUDA – C with no shader limitations!

Integrated

host+device

app C program

Serial or modestly parallel parts in

host

C code

Highly parallel parts in device SPMD/SIMT kernel C code

Serial Code (host)

. . .

. . .

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

Slide26

CUDA Devices and Threads

A compute

device

Is a coprocessor to the CPU or

host

Has its own DRAM (

device memory

)

Runs many

threads

in parallel

Is typically a

GPU

but can also be another type of parallel processing device

Data-parallel portions of an application are expressed as device

kernels which run on many threads

Differences between GPU and CPU threads GPU threads are extremely lightweight

Very little creation overheadGPU needs 1000s of threads for full efficiency

Multi-core CPU needs only a few

Slide27

27

G80 CUDA mode – A

Device

Example

Processors execute computing threads

New operating mode/HW interface for computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Load/store

Load/store

Load/store

Load/store

Load/store

Slide28

Extended C

Type Qualifiers

global, device, shared, local, host

Keywords

threadIdx

,

blockIdx

Intrinsics__syncthreadsRuntime APIMemory, symbol, execution managementFunction launch

__device__ float filter[N];

__global__ void convolve (float *image) {

__shared__ float region[M];

...

region[threadIdx] = image[i];

__syncthreads()

...

image[j] = result;

}

// Allocate GPU memory

void *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per block

convolve<<<100, 10>>> (myimage);

Slide29

CUDA Platform

Slide30

CUDA Platform

30

Slide31

Arrays of Parallel Threads

A CUDA kernel is executed by an array of

threads

All threads run the same code (SPMD)

Each thread has an ID that it uses to compute memory addresses and make control decisions

7

6

5

4

3

2

1

0

float x = input[threadID];

float y = func(x);

output[threadID] = y;

threadID

Slide32

float x = input[threadID];

float y = func(x);

output[threadID] = y;

threadID

Thread Block 0

float x = input[threadID];

float y = func(x);

output[threadID] = y;

Thread Block 1

float x = input[threadID];

float y = func(x);

output[threadID] = y;

Thread Block N - 1

Thread Blocks: Scalable Cooperation

Divide monolithic thread array into multiple blocks

Threads within a block cooperate via

shared memory, atomic operations

and

barrier synchronization

Threads in different blocks cannot cooperate

Up to 65535 blocks, 512 threads/block

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

Slide33

Block IDs and Thread IDs

We launch a “grid” of “blocks” of “threads”

Each thread uses IDs to decide what data to work on

Block ID: 1D, 2D, or 3D

Usually 1D or 2D

Thread ID: 1D, 2D, or 3D

Simplifies memory

addressing when processingmultidimensional dataImage processing

Solving PDEs on volumes

Slide34

CUDA Memory Model Overview

Global memory

Main means of communicating R/W Data between

host

and

device

Contents visible to all threads

Long latency access

We will focus on global memory for now

Constant and texture memory will come later

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Slide35

35

CUDA Device Memory Allocation

cudaMalloc

()

Allocates object in the device

Global Memory

Requires two parameters

Address of a pointe

r to the allocated object

Size of

allocated object

cudaFree

()

Frees object from device Global Memory

Pointer to freed object

Grid

Global

Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

DON’T use a CPU pointer in a GPU function !

Slide36

CUDA Device Memory Allocation (cont.)

Code example:

Allocate a 64 * 64 single precision float array

Attach the allocated storage to Md

“d” is often used to indicate a device data structure

TILE_WIDTH = 64;

float*

Md

;

int

size = TILE_WIDTH * TILE_WIDTH *

sizeof

(float);

cudaMalloc

((void**)&

Md

, size);

cudaFree

(

Md

);

Slide37

CUDA Host-Device Data Transfer

cudaMemcpy

()

memory data transfer

Requires four parameters

Pointer to destination

Pointer to source

Number of bytes copied

Type of transfer

Host to Host

Host to Device

Device to Host

Device to Device

Non-blocking/asynchronous transfer

Grid

Global

Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Slide38

CUDA Host-Device Data Transfer

(cont.)

Code example:

Transfer a 64 * 64 single precision float array

M is in host memory and Md is in device memory

cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

Slide39

CUDA Keywords

Slide40

CUDA Function Declarations

host

host

__host__

float HostFunc()

host

device

__global__

void KernelFunc()

device

device

__device__

float DeviceFunc()

Only callable from the:

Executed on the:

__global__

defines a kernel function

Must return

void

Slide41

CUDA Function Declarations (cont.)

__device__

functions cannot have their address taken

For functions executed on the device:

No recursion

No static variable declarations inside the function

No variable number of arguments

Slide42

Calling a Kernel Function – Thread Creation

A kernel function must be called with an

execution configuration

:

__global__

void

KernelFunc

(...);

dim3

DimGrid

(100, 50); // 5000 thread blocks

dim3

DimBlock

(4, 8, 8); // 256 threads per block

size_t

SharedMemBytes

= 64; // 64 bytes of shared memory

KernelFunc

<<< DimGrid, DimBlock, SharedMemBytes

>>>(...);Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking

Slide43

Next Time

Code example