/
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
390 views
Uploaded On 2017-12-19

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 - PPT Presentation

ECE 498AL University of Illinois UrbanaChampaign 1 GPU Programming with CUDA David KirkNVIDIA and Wenmei W Hwu 20072009 ECE 498AL University of Illinois UrbanaChampaign 2 A quiet revolution and potential buildup ID: 616598

bank thread memory block thread bank block memory nvidia wen mei hwu 2007 university illinois urbana champaign david kirk

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "© David Kirk/NVIDIA and Wen-mei W. Hwu,..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

1

GPU Programming with CUDA Slide2

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

2

A quiet revolution and potential build-up

Calculation: 367 GFLOPS vs. 32 GFLOPS

Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s

Until recently, programmed through graphics APIGPU in every PC and workstation – massive volume and potential impact

GFLOPS

G80 = GeForce 8800 GTX

G71 = GeForce 7900 GTX

G70 = GeForce 7800 GTXNV40 = GeForce 6800 UltraNV35 = GeForce FX 5950 UltraNV30 = GeForce FX 5800

GPU: A Massively

Parallel ProcessorSlide3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

3

What

is GPGPU?

General Purpose computation using GPU and graphics API in applications other than 3D graphics

GPU accelerates critical path of applicationData parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computationApplications – see //GPGPU.orgGame effects (FX) physics, image processingPhysical modeling, computational engineering, matrix algebra, convolution, correlation, sortingSlide4

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

4

Previous GPGPU Constraints

Dealing with graphics API

Working with the corner cases of the graphics API

Addressing modesLimited texture size/dimensionShader capabilitiesLimited outputsInstruction setsLack of Integer & bit opsCommunication limitedBetween pixelsScatter a[i] = p

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per thread

per Shader

per Context

FB MemorySlide5

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

5

CUDA

C

ompute Unified Device Architecture”General purpose programming modelUser kicks off batches of threads on the GPUGPU = dedicated super-threaded, massively data parallel co-processorTargeted software stackCompute oriented drivers, language, and toolsDriver for loading computation programs into GPUStandalone Driver - Optimized for computation Interface designed for compute – graphics-free APIData sharing with OpenGL buffer objects Guaranteed maximum download & readback speedsExplicit GPU memory managementSlide6

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

6

Parallel Computing on a GPU

8-series GPUs deliver 25 to 200+ GFLOPS

on compiled parallel C applications

Available in laptops, desktops, and clustersGPU parallelism is doubling every year

Programming model scales transparently

Programmable in C with CUDA toolsMultithreaded SPMD model uses application data parallelism and thread parallelism

GeForce 8800

Tesla S870Tesla D870Slide7

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

7

CUDA – C with no shader limitations!

Integrated host+device app C program

Serial or modestly parallel parts in

host C codeHighly parallel parts in device SPMD kernel C code

Serial Code (host)‏

. . .

. . .

Parallel Kernel (device)‏

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)‏

Parallel Kernel (device)‏

KernelB<<< nBlk, nTid >>>(args);Slide8

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

8

CUDA Devices and Threads

A compute

device

Is a coprocessor to the CPU or host

Has its own DRAM (device memory

)‏Runs many threads

in parallelIs typically a

GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels, which run on many threadsDifferences between GPU and CPU threads

GPU threads are extremely lightweight

Very little creation overhead

GPU needs 1000s of threads for full efficiency

Multi-core CPU needs only a fewSlide9

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

9

L2

FB

SP

SP

L1

TF

Thread Processor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue

Pixel Thread Issue

Input Assembler

Host

SP

SP

L1

TF

SP

SP

L1

TF

SP

SP

L1

TF

SP

SP

L1

TF

SP

SP

L1

TF

SP

SP

L1

TF

SP

SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

The future of GPUs is programmable processing

So – build the architecture around the processor

G80 – Graphics ModeSlide10

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

10

G80 CUDA mode – A

Device

Example

Processors execute computing threadsNew operating mode/HW interface for computingLoad/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Load/store

Load/store

Load/store

Load/store

Load/storeSlide11

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

11

Extended C

Declspecs

global, device, shared, local, constant

KeywordsthreadIdx, blockIdxIntrinsics__syncthreadsRuntime APIMemory, symbol, execution managementFunction launch__device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ...

region[threadIdx] = image[i];

__syncthreads() ... image[j] = result;

}// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);Slide12

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

12

gcc / cl

G80 SASS

foo.sass

OCG

Extended C

cudacc

EDG C/C++ frontend

Open64 Global OptimizerGPU Assemblyfoo.s

CPU Host Code

foo.cpp

Integrated source

(foo.cu)

Mark Murphy, “

NVIDIA’s Experience with Open64

,”

www.capsl.udel.edu/conferences/open64/2008/Papers/101.doc

Slide13

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

13

CUDA API Highlights:

Easy and Lightweight

The API is an

extension to the ANSI C programming language Low learning curve

The hardware is

designed to enable lightweight runtime and driver

High performanceSlide14

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

14

CUDA Thread Block

All threads in a block execute the same kernel program (SPMD)

Programmer declares block:

Block size 1 to 512 concurrent threadsBlock shape 1D, 2D, or 3DBlock dimensions in threadsThreads have thread id

numbers within block

Thread program uses thread id to select work and address shared data

Threads in the same block share data and synchronize while doing their share of the workThreads in different blocks cannot cooperate

Each block can execute in any order relative to other blocs!CUDA Thread BlockThread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIASlide15

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

15

float x = input[threadID];

float y = func(x);

output[threadID] = y;

threadID

Thread Block 0

float x = input[threadID];

float y = func(x);

output[threadID] = y;

Thread Block 0

float x = input[threadID];

float y = func(x);

output[threadID] = y;

Thread Block N - 1

Thread Blocks: Scalable Cooperation

Divide monolithic thread array into multiple blocks

Threads within a block cooperate via

shared memory, atomic operations

and

barrier synchronization

Threads in different blocks cannot cooperate

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0Slide16

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

16

Transparent Scalability

Hardware is free to

assign

blocks to any processor at any timeA kernel scales across any number of parallel processorsDevice

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Kernel grid

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Device

Block 0

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

Each block can execute in any order relative to other blocks.

timeSlide17

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

17

G80 Example: Executing Thread Blocks

Threads are assigned to

Streaming Multiprocessors

in block granularity

Up to

8

blocks to each SM as resource allowsSM in G80 can take up to 768 threads

Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc.Threads run concurrentlySM maintains thread/block id #sSM manages/schedules thread execution

t0 t1 t2 … tm

Blocks

SP

Shared

Memory

MT IU

SP

Shared

Memory

MT IU

t0 t1 t2 … tm

Blocks

SM 1

SM 0

Flexible resource allocationSlide18

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

18

G80 Example: Thread Scheduling

Each Block is executed as 32-thread Warps

An implementation decision, not part of the CUDA programming modelWarps are scheduling units in SMIf 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?

Each Block is divided into 256/32 = 8 Warps

There are 8 * 3 = 24 Warps

t0 t1 t2 … t31

t0 t1 t2 … t31

Block 1 Warps

Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1

Streaming Multiprocessor

Shared Memory

t0 t1 t2 … t31

Block 1 WarpsSlide19

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

19

G80 Example: Thread

Scheduling

SM implements zero-overhead warp scheduling

At any time, only one of the warps is executed by SMWarps whose next instruction has its operands ready for consumption are eligible for executionEligible Warps are selected for execution on a prioritized scheduling policyAll threads in a warp execute the same instruction when selectedSlide20

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

20

Block IDs and Thread IDs

Each thread uses IDs to decide what data to work on

Block ID: 1D or 2D

Thread ID: 1D, 2D, or 3D Simplifies memoryaddressing when processingmultidimensional dataImage processingSolving PDEs on volumes…Slide21

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

21

Terminology

Thread

: concurrent code and associated state executed on the CUDA device

(in parallel with other threads)The unit of parallelism in CUDAWarp: a group of threads executed physically in parallel in G80Block: a group of threads that are executed together and form the unit of resource assignmentGrid: a group of thread blocks that must all complete before the next kernel call of the program can take effectSlide22

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign

22

MemoriesSlide23

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

23

CUDA Memory Model Overview

Global memory

Main means of communicating R/W Data between

host and deviceContents visible to all threads

Long latency access

We will focus on global memory for now

Constant and texture memory will come later

Grid

Global Memory

Block (0, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Block (1, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

HostSlide24

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

24

CUDA Device Memory Allocation

cudaMalloc()

Allocates object in the device

Global Memory

Requires two parameters

Address of a pointer to the allocated object

Size of of allocated object

cudaFree()Frees object from device Global MemoryPointer to freed object

Grid

Global

Memory

Block (0, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Block (1, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

HostSlide25

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

25

CUDA Device Memory Allocation (cont.)‏

Code example:

Allocate a 64 * 64 single precision float array

Attach the allocated storage to Md

“d” is often used to indicate a device data structure

TILE_WIDTH = 64;

Float* Md

int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);

cudaFree(Md);Slide26

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

26

CUDA Host-Device Data Transfer

cudaMemcpy()‏

memory data transfer

Requires four parameters

Pointer to destination

Pointer to source

Number of bytes copied

Type of transfer Host to HostHost to DeviceDevice to Host

Device to Device

Asynchronous transfer

Grid

Global

Memory

Block (0, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Block (1, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

HostSlide27

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

27

CUDA Host-Device Data Transfer

(cont.)

Code example:

Transfer a 64 * 64 single precision float arrayM is in host memory and Md is in device memory

cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);Slide28

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

28

CUDA Function Declarations

host

host

__host__

float HostFunc()‏

host

device

__global__

void KernelFunc()‏

device

device

__device__

float DeviceFunc()‏

Only callable from the:

Executed on the:

__global__

defines a kernel function

Must return

void

__device__

and

__host__

can be used togetherSlide29

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

29

CUDA Function Declarations (cont.)‏

__device__

functions cannot have their address takenFor functions executed on the device:

No recursion

No static variable declarations inside the function

No variable number of argumentsSlide30

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

30

Calling a Kernel Function – Thread Creation

A kernel function must be called with an

execution configuration

:

__global__

void KernelFunc(...);

dim3

DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory

KernelFunc

<<<

DimGrid, DimBlock, SharedMemBytes

>>>

(...);

Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blockingSlide31

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign

31

G80 Implementation of CUDA Memories

Each thread can:

Read/write per-thread

registersRead/write per-thread local memoryRead/write per-block shared memoryRead/write per-grid global memoryRead/only per-grid constant memory

Grid

Global Memory

Block (0, 0)

Shared MemoryThread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant MemorySlide32

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign

32

CUDA Variable Type Qualifiers

__device__

is optional when used with __local__, __shared__, or __constant__Automatic variables without any qualifier reside in a registerExcept arrays that reside in local memory

Variable declaration

Memory

Scope

Lifetime

__device__

__local__

int LocalVar;

local

thread

thread

__device__

__shared__

int SharedVar;

shared

block

block

__device__

int GlobalVar;

global

grid

application

__device__

__constant__

int ConstantVar;

constant

grid

applicationSlide33

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign

33

Where to Declare Variables?

global

constant

register (automatic)sharedlocalSlide34

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign

34

Variable Type Restrictions

Pointers

can only point to memory allocated or declared in global memory:

Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr)Obtained as the address of a global variable: float* ptr = &GlobalVar;Slide35

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign

35

A Common Programming Strategy

Global memory resides in device memory (DRAM) - much slower access than shared memory

So, a profitable way of performing computation on the device is to

tile data to take advantage of fast shared memory:Partition data into subsets that fit into shared memoryHandle each data subset with one thread block by:Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelismPerforming the computation on the subset from shared memory; each thread can efficiently multi-pass over any data elementCopying results from shared memory to global memorySlide36

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign

36

A Common Programming Strategy (Cont.)

Constant memory also resides in device memory (DRAM) - much slower access than shared memory

But… cached!

Highly efficient access for read-only dataCarefully divide data according to access patternsR/Only  constant memory (very fast if in cache)R/W shared within Block  shared memory (very fast)R/W within each thread  registers (very fast)R/W inputs/results  global memory (very slow)For texture memory usage, see NVIDIA document.Slide37

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign

37

37

GPU Atomic Integer Operations

Atomic operations on integers in global memory:

Associative operations on signed/unsigned intsadd, sub, min, max, ...and, or, xorIncrement, decrementExchange, compare and swapRequires hardware with compute capability 1.1 and above.Slide38

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

38

SM Register File

Register File (RF)

32 KB (8K entries) for each SM in G80

TEX pipe can also read/write RF2 SMs share 1 TEXLoad/Store pipe can also read/write RF

I

$

L

1

Multithreaded

Instruction Buffer

R

F

C

$

L

1

Shared

Mem

Operand Select

MAD

SFUSlide39

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

39

Programmer View of Register File

There are 8192 registers in each SM in G80

This is an implementation decision, not part of CUDA

Registers are dynamically partitioned across all blocks assigned to the SMOnce assigned to a block, the register is NOT accessible by threads in other blocksEach thread in the same block only access registers assigned to itself

4 blocks

3 blocksSlide40

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

40

Example

If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM?

Each block requires 10*256 = 2560 registers

8192 = 3 * 2560 + changeSo, three blocks can run on an SM as far as registers are concernedHow about if each thread increases the use of registers by 1?Each Block now requires 11*256 = 2816 registers8192 < 2816 *3Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!Slide41

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

41

More on Dynamic Partitioning

Dynamic partitioning gives more flexibility to compilers/programmers

One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each

This allows for finer grain threading than traditional CPU threading modelsThe compiler can trade off between instruction-level parallelism and thread level parallelismSlide42

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

42

ILP vs. TLP Example

Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global

loads take

200 cycles3 Blocks can run on each SMIf a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory loadOnly two Blocks can run on each SMHowever, one only needs 200/(8*4) = 7 Warps to tolerate the memory latencyTwo blocks have 16 Warps. The performance can be actually higher!Slide43

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

43

Memory Coalescing

When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations.

Md

Nd

W

I

D

T

H

WIDTH

Thread 1

Thread 2

Not coalesced

coalescedSlide44

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

44

Parallel Memory Architecture

In a parallel machine, many threads access memory

Therefore, memory is divided into

banksEssential to achieve high bandwidthEach bank can service one address per cycleA memory can service as many simultaneous accesses as it has banksMultiple simultaneous accesses to a bankresult in a bank conflict Conflicting accesses are serialized

Bank 15

Bank 7

Bank 6

Bank 5Bank 4

Bank 3

Bank 2

Bank 1

Bank 0Slide45

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

45

Bank Addressing Examples

No Bank Conflicts

Linear addressing

stride == 1No Bank ConflictsRandom 1:1 Permutation

Bank 15

Bank 7

Bank 6

Bank 5Bank 4Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0Slide46

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

46

Bank Addressing Examples

2-way Bank Conflicts

Linear addressing

stride == 28-way Bank ConflictsLinear addressing stride == 8Thread 11

Thread 10

Thread 9

Thread 8

Thread 4Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 9

Bank 8

Bank 15

Bank 7

Bank 2

Bank 1

Bank 0

x8

x8Slide47

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

47

How addresses map to banks on G80

Each bank has a bandwidth of 32 bits per clock cycle

Successive 32-bit words are assigned to successive banks

G80 has 16 banksSo bank = address % 16Same as the size of a half-warpNo bank conflicts between different half-warps, only within a single half-warpSlide48

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

48

Shared memory bank conflicts

Shared memory is as fast as registers

if there are no bank

conflictsThe fast case:If all threads of a half-warp access different banks, there is no bank conflictIf all threads of a half-warp access the identical address, there is no bank conflict (broadcast)The slow case:Bank Conflict: multiple threads in the same half-warp access the same bankMust serialize the accessesCost = max # of simultaneous accesses to a single bankSlide49

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign

49

Linear Addressing

Given:

__shared__ float shared[256];

float foo = shared[baseIndex + s * threadIdx.x];This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

s=3

s=1