ECE 498AL University of Illinois UrbanaChampaign 1 GPU Programming with CUDA David KirkNVIDIA and Wenmei W Hwu 20072009 ECE 498AL University of Illinois UrbanaChampaign 2 A quiet revolution and potential buildup ID: 616598
Download Presentation The PPT/PDF document "© David Kirk/NVIDIA and Wen-mei W. Hwu,..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
1
GPU Programming with CUDA Slide2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
2
A quiet revolution and potential build-up
Calculation: 367 GFLOPS vs. 32 GFLOPS
Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
Until recently, programmed through graphics APIGPU in every PC and workstation – massive volume and potential impact
GFLOPS
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTXNV40 = GeForce 6800 UltraNV35 = GeForce FX 5950 UltraNV30 = GeForce FX 5800
GPU: A Massively
Parallel ProcessorSlide3
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
3
What
is GPGPU?
General Purpose computation using GPU and graphics API in applications other than 3D graphics
GPU accelerates critical path of applicationData parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computationApplications – see //GPGPU.orgGame effects (FX) physics, image processingPhysical modeling, computational engineering, matrix algebra, convolution, correlation, sortingSlide4
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
4
Previous GPGPU Constraints
Dealing with graphics API
Working with the corner cases of the graphics API
Addressing modesLimited texture size/dimensionShader capabilitiesLimited outputsInstruction setsLack of Integer & bit opsCommunication limitedBetween pixelsScatter a[i] = p
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
per thread
per Shader
per Context
FB MemorySlide5
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
5
CUDA
“
C
ompute Unified Device Architecture”General purpose programming modelUser kicks off batches of threads on the GPUGPU = dedicated super-threaded, massively data parallel co-processorTargeted software stackCompute oriented drivers, language, and toolsDriver for loading computation programs into GPUStandalone Driver - Optimized for computation Interface designed for compute – graphics-free APIData sharing with OpenGL buffer objects Guaranteed maximum download & readback speedsExplicit GPU memory managementSlide6
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
6
Parallel Computing on a GPU
8-series GPUs deliver 25 to 200+ GFLOPS
on compiled parallel C applications
Available in laptops, desktops, and clustersGPU parallelism is doubling every year
Programming model scales transparently
Programmable in C with CUDA toolsMultithreaded SPMD model uses application data parallelism and thread parallelism
GeForce 8800
Tesla S870Tesla D870Slide7
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
7
CUDA – C with no shader limitations!
Integrated host+device app C program
Serial or modestly parallel parts in
host C codeHighly parallel parts in device SPMD kernel C code
Serial Code (host)
. . .
. . .
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);Slide8
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
8
CUDA Devices and Threads
A compute
device
Is a coprocessor to the CPU or host
Has its own DRAM (device memory
)Runs many threads
in parallelIs typically a
GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels, which run on many threadsDifferences between GPU and CPU threads
GPU threads are extremely lightweight
Very little creation overhead
GPU needs 1000s of threads for full efficiency
Multi-core CPU needs only a fewSlide9
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
9
L2
FB
SP
SP
L1
TF
Thread Processor
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue
Pixel Thread Issue
Input Assembler
Host
SP
SP
L1
TF
SP
SP
L1
TF
SP
SP
L1
TF
SP
SP
L1
TF
SP
SP
L1
TF
SP
SP
L1
TF
SP
SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
The future of GPUs is programmable processing
So – build the architecture around the processor
G80 – Graphics ModeSlide10
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
10
G80 CUDA mode – A
Device
Example
Processors execute computing threadsNew operating mode/HW interface for computingLoad/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Load/store
Load/store
Load/store
Load/store
Load/storeSlide11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
11
Extended C
Declspecs
global, device, shared, local, constant
KeywordsthreadIdx, blockIdxIntrinsics__syncthreadsRuntime APIMemory, symbol, execution managementFunction launch__device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ...
region[threadIdx] = image[i];
__syncthreads() ... image[j] = result;
}// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);Slide12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
12
gcc / cl
G80 SASS
foo.sass
OCG
Extended C
cudacc
EDG C/C++ frontend
Open64 Global OptimizerGPU Assemblyfoo.s
CPU Host Code
foo.cpp
Integrated source
(foo.cu)
Mark Murphy, “
NVIDIA’s Experience with Open64
,”
www.capsl.udel.edu/conferences/open64/2008/Papers/101.doc
Slide13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
13
CUDA API Highlights:
Easy and Lightweight
The API is an
extension to the ANSI C programming language Low learning curve
The hardware is
designed to enable lightweight runtime and driver
High performanceSlide14
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
14
CUDA Thread Block
All threads in a block execute the same kernel program (SPMD)
Programmer declares block:
Block size 1 to 512 concurrent threadsBlock shape 1D, 2D, or 3DBlock dimensions in threadsThreads have thread id
numbers within block
Thread program uses thread id to select work and address shared data
Threads in the same block share data and synchronize while doing their share of the workThreads in different blocks cannot cooperate
Each block can execute in any order relative to other blocs!CUDA Thread BlockThread Id #:0 1 2 3 … m
Thread program
Courtesy: John Nickolls, NVIDIASlide15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
15
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
threadID
Thread Block 0
…
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Block 0
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Block N - 1
Thread Blocks: Scalable Cooperation
Divide monolithic thread array into multiple blocks
Threads within a block cooperate via
shared memory, atomic operations
and
barrier synchronization
Threads in different blocks cannot cooperate
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0Slide16
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
16
Transparent Scalability
Hardware is free to
assign
blocks to any processor at any timeA kernel scales across any number of parallel processorsDevice
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Kernel grid
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Device
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Each block can execute in any order relative to other blocks.
timeSlide17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
17
G80 Example: Executing Thread Blocks
Threads are assigned to
Streaming Multiprocessors
in block granularity
Up to
8
blocks to each SM as resource allowsSM in G80 can take up to 768 threads
Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc.Threads run concurrentlySM maintains thread/block id #sSM manages/schedules thread execution
t0 t1 t2 … tm
Blocks
SP
Shared
Memory
MT IU
SP
Shared
Memory
MT IU
t0 t1 t2 … tm
Blocks
SM 1
SM 0
Flexible resource allocationSlide18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
18
G80 Example: Thread Scheduling
Each Block is executed as 32-thread Warps
An implementation decision, not part of the CUDA programming modelWarps are scheduling units in SMIf 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?
Each Block is divided into 256/32 = 8 Warps
There are 8 * 3 = 24 Warps
…
t0 t1 t2 … t31
…
…
t0 t1 t2 … t31
…
Block 1 Warps
Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1
Streaming Multiprocessor
Shared Memory
…
t0 t1 t2 … t31
…
Block 1 WarpsSlide19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
19
G80 Example: Thread
Scheduling
SM implements zero-overhead warp scheduling
At any time, only one of the warps is executed by SMWarps whose next instruction has its operands ready for consumption are eligible for executionEligible Warps are selected for execution on a prioritized scheduling policyAll threads in a warp execute the same instruction when selectedSlide20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
20
Block IDs and Thread IDs
Each thread uses IDs to decide what data to work on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D Simplifies memoryaddressing when processingmultidimensional dataImage processingSolving PDEs on volumes…Slide21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
21
Terminology
Thread
: concurrent code and associated state executed on the CUDA device
(in parallel with other threads)The unit of parallelism in CUDAWarp: a group of threads executed physically in parallel in G80Block: a group of threads that are executed together and form the unit of resource assignmentGrid: a group of thread blocks that must all complete before the next kernel call of the program can take effectSlide22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign
22
MemoriesSlide23
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
23
CUDA Memory Model Overview
Global memory
Main means of communicating R/W Data between
host and deviceContents visible to all threads
Long latency access
We will focus on global memory for now
Constant and texture memory will come later
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
HostSlide24
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
24
CUDA Device Memory Allocation
cudaMalloc()
Allocates object in the device
Global Memory
Requires two parameters
Address of a pointer to the allocated object
Size of of allocated object
cudaFree()Frees object from device Global MemoryPointer to freed object
Grid
Global
Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
HostSlide25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
25
CUDA Device Memory Allocation (cont.)
Code example:
Allocate a 64 * 64 single precision float array
Attach the allocated storage to Md
“d” is often used to indicate a device data structure
TILE_WIDTH = 64;
Float* Md
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);
cudaMalloc((void**)&Md, size);
cudaFree(Md);Slide26
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
26
CUDA Host-Device Data Transfer
cudaMemcpy()
memory data transfer
Requires four parameters
Pointer to destination
Pointer to source
Number of bytes copied
Type of transfer Host to HostHost to DeviceDevice to Host
Device to Device
Asynchronous transfer
Grid
Global
Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
HostSlide27
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
27
CUDA Host-Device Data Transfer
(cont.)
Code example:
Transfer a 64 * 64 single precision float arrayM is in host memory and Md is in device memory
cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);Slide28
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
28
CUDA Function Declarations
host
host
__host__
float HostFunc()
host
device
__global__
void KernelFunc()
device
device
__device__
float DeviceFunc()
Only callable from the:
Executed on the:
__global__
defines a kernel function
Must return
void
__device__
and
__host__
can be used togetherSlide29
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
29
CUDA Function Declarations (cont.)
__device__
functions cannot have their address takenFor functions executed on the device:
No recursion
No static variable declarations inside the function
No variable number of argumentsSlide30
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
30
Calling a Kernel Function – Thread Creation
A kernel function must be called with an
execution configuration
:
__global__
void KernelFunc(...);
dim3
DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory
KernelFunc
<<<
DimGrid, DimBlock, SharedMemBytes
>>>
(...);
Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blockingSlide31
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign
31
G80 Implementation of CUDA Memories
Each thread can:
Read/write per-thread
registersRead/write per-thread local memoryRead/write per-block shared memoryRead/write per-grid global memoryRead/only per-grid constant memory
Grid
Global Memory
Block (0, 0)
Shared MemoryThread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant MemorySlide32
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign
32
CUDA Variable Type Qualifiers
__device__
is optional when used with __local__, __shared__, or __constant__Automatic variables without any qualifier reside in a registerExcept arrays that reside in local memory
Variable declaration
Memory
Scope
Lifetime
__device__
__local__
int LocalVar;
local
thread
thread
__device__
__shared__
int SharedVar;
shared
block
block
__device__
int GlobalVar;
global
grid
application
__device__
__constant__
int ConstantVar;
constant
grid
applicationSlide33
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign
33
Where to Declare Variables?
global
constant
register (automatic)sharedlocalSlide34
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign
34
Variable Type Restrictions
Pointers
can only point to memory allocated or declared in global memory:
Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr)Obtained as the address of a global variable: float* ptr = &GlobalVar;Slide35
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign
35
A Common Programming Strategy
Global memory resides in device memory (DRAM) - much slower access than shared memory
So, a profitable way of performing computation on the device is to
tile data to take advantage of fast shared memory:Partition data into subsets that fit into shared memoryHandle each data subset with one thread block by:Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelismPerforming the computation on the subset from shared memory; each thread can efficiently multi-pass over any data elementCopying results from shared memory to global memorySlide36
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign
36
A Common Programming Strategy (Cont.)
Constant memory also resides in device memory (DRAM) - much slower access than shared memory
But… cached!
Highly efficient access for read-only dataCarefully divide data according to access patternsR/Only constant memory (very fast if in cache)R/W shared within Block shared memory (very fast)R/W within each thread registers (very fast)R/W inputs/results global memory (very slow)For texture memory usage, see NVIDIA document.Slide37
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana Champaign
37
37
GPU Atomic Integer Operations
Atomic operations on integers in global memory:
Associative operations on signed/unsigned intsadd, sub, min, max, ...and, or, xorIncrement, decrementExchange, compare and swapRequires hardware with compute capability 1.1 and above.Slide38
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
38
SM Register File
Register File (RF)
32 KB (8K entries) for each SM in G80
TEX pipe can also read/write RF2 SMs share 1 TEXLoad/Store pipe can also read/write RF
I
$
L
1
Multithreaded
Instruction Buffer
R
F
C
$
L
1
Shared
Mem
Operand Select
MAD
SFUSlide39
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
39
Programmer View of Register File
There are 8192 registers in each SM in G80
This is an implementation decision, not part of CUDA
Registers are dynamically partitioned across all blocks assigned to the SMOnce assigned to a block, the register is NOT accessible by threads in other blocksEach thread in the same block only access registers assigned to itself
4 blocks
3 blocksSlide40
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
40
Example
If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM?
Each block requires 10*256 = 2560 registers
8192 = 3 * 2560 + changeSo, three blocks can run on an SM as far as registers are concernedHow about if each thread increases the use of registers by 1?Each Block now requires 11*256 = 2816 registers8192 < 2816 *3Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!Slide41
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
41
More on Dynamic Partitioning
Dynamic partitioning gives more flexibility to compilers/programmers
One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each
This allows for finer grain threading than traditional CPU threading modelsThe compiler can trade off between instruction-level parallelism and thread level parallelismSlide42
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
42
ILP vs. TLP Example
Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global
loads take
200 cycles3 Blocks can run on each SMIf a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory loadOnly two Blocks can run on each SMHowever, one only needs 200/(8*4) = 7 Warps to tolerate the memory latencyTwo blocks have 16 Warps. The performance can be actually higher!Slide43
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
43
Memory Coalescing
When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations.
Md
Nd
W
I
D
T
H
WIDTH
Thread 1
Thread 2
Not coalesced
coalescedSlide44
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
44
Parallel Memory Architecture
In a parallel machine, many threads access memory
Therefore, memory is divided into
banksEssential to achieve high bandwidthEach bank can service one address per cycleA memory can service as many simultaneous accesses as it has banksMultiple simultaneous accesses to a bankresult in a bank conflict Conflicting accesses are serialized
Bank 15
Bank 7
Bank 6
Bank 5Bank 4
Bank 3
Bank 2
Bank 1
Bank 0Slide45
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
45
Bank Addressing Examples
No Bank Conflicts
Linear addressing
stride == 1No Bank ConflictsRandom 1:1 Permutation
Bank 15
Bank 7
Bank 6
Bank 5Bank 4Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0Slide46
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
46
Bank Addressing Examples
2-way Bank Conflicts
Linear addressing
stride == 28-way Bank ConflictsLinear addressing stride == 8Thread 11
Thread 10
Thread 9
Thread 8
Thread 4Thread 3
Thread 2
Thread 1
Thread 0
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 9
Bank 8
Bank 15
Bank 7
Bank 2
Bank 1
Bank 0
x8
x8Slide47
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
47
How addresses map to banks on G80
Each bank has a bandwidth of 32 bits per clock cycle
Successive 32-bit words are assigned to successive banks
G80 has 16 banksSo bank = address % 16Same as the size of a half-warpNo bank conflicts between different half-warps, only within a single half-warpSlide48
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
48
Shared memory bank conflicts
Shared memory is as fast as registers
if there are no bank
conflictsThe fast case:If all threads of a half-warp access different banks, there is no bank conflictIf all threads of a half-warp access the identical address, there is no bank conflict (broadcast)The slow case:Bank Conflict: multiple threads in the same half-warp access the same bankMust serialize the accessesCost = max # of simultaneous accesses to a single bankSlide49
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign
49
Linear Addressing
Given:
__shared__ float shared[256];
float foo = shared[baseIndex + s * threadIdx.x];This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
s=3
s=1