/
CS/EE 217 – GPU Architecture and Parallel Programming CS/EE 217 – GPU Architecture and Parallel Programming

CS/EE 217 – GPU Architecture and Parallel Programming - PowerPoint Presentation

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
344 views
Uploaded On 2019-06-25

CS/EE 217 – GPU Architecture and Parallel Programming - PPT Presentation

Midterm Review Material on exam Module 110 Module 5 is mixed in there somewhere Chapters 36 8 and 9 Understand the CUDA C programming model Understand the architecture limitations and how to navigate them to improve the performance of your code ID: 760253

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS/EE 217 – GPU Architecture and Paral..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CS/EE 217 – GPU Architecture and Parallel Programming

Midterm Review

Slide2

Material on exam

Module 1-10Module 5 is mixed in there somewhereChapters 3-6, 8 and 9 Understand the CUDA C programming model Understand the architecture limitations and how to navigate them to improve the performance of your code Parallel programming patterns. Analyze for run-time, memory performance (global memory traffic; memory coalescing), work efficiency, resource efficiency, ...

2

Slide3

Review Problems

Problem 3.5 from the book. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index. i = blockIdx.x*blockDim.x + threadIdx.x

3

Slide4

Problem 3.6. We want to use each thread to calculate two adjacent elements of a vector addition. Assume that variable

i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index. i = (blockIdx.x*blockDim.x + threadIdx.x)*2

4

Slide5

Assume that the vector length is 2000, and each thread calculates one output element, with a block size of 512. How many threads will there be in the grid?

2000 How many warps will have divergence? 1, the last warp only have 16 elements

5

Slide6

4.4: You need to write a kernel that operates on an image of size 400x900. You would like to allocate one thread to each pixel. You would like the thread blocks to be square and to use the maximum number of threads per block possible on the device (assume max # of threads per block is 1024). How would you select the grid and block dimensions?

Block dim = 32x32 (=1024 threads)Grid dim = ceil(400/32) x ceil(900/32) = 13x29Assuming next that we use blocks of size 16x16, how many warps would experience thread divergence? 200 warps. A warp spans 2x16 of the block. The right most block only have 4 threads active (900%16).

6

Slide7

For the simple reduction kernel, if the block size is 1024, how many warps will have thread divergence during the fifth iteration?

All warps in the thread block will be divergentHow many for the improved kernel? 0

7

Slide8

Recall the more efficient reduction kernel

for (unsigned int stride = blockDim.x; stride > 0; stride /= 2) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } A bright engineer wanted to optimize this kernel by unrolling the last five steps as follows.

8

Slide9

for (unsigned

int stride = blockDim.x; stride >= 32; stride >>= 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride];} __syncthreads(); if(t < 32) { partialSum[t]+= partialSum[t+16]; partialSum[t]+= partialSum[t+8]; partialSum[t]+= partialSum[t+4]; partialSum[t]+= partialSum[t+2]; partialSum[t]+= partialSum[t+1]; } What are they thinking? Will this work? Will performance be better? Eliminated syncthreads for last 5 iter.. Rely on implicit warp sync.

9

Slide10

8.5: Consider performing a 2D convolution on a square matrix of size

nxn with a mask of size mxm. How many halo elements will there be? (n+m-1)(n+m-1)-n*nWhat percentage of the multiplications involves halo elements? There are a total of m*m*n*n mult. Each element does m*m mult. We have n*n elements in total. Use algebra to remove corners and edges… (too time consuming)What is the saving in memory accesses for an internal tile (no ghost elements) vs. an untiled implementation? O_TILE_WIDTH2 * MASK_WIDTH2 / (O_TILE_WIDTH+MASK_WIDTH-1)2 from slide 50 in Module8-Stencil.pdfAssuming the implementation where every element has a thread to load into shared memory, how many warps will there be per block? Every thread loads an element for the tile. A tile t * t requires (t+m-1)(t+m-1) elements. Therefore, each threadblock requires (t+m-1)(t+m-1)/32 warps.

10

Slide11

9.7: Consider the following array: [4 6 7 1 2 8 5 2]

Perform inclusive prefix sum on the array using the work inefficient algorithm. Report the intermediate results at every step. Initial array: 4 6 7 1 2 8 5 2 Work inefficient scan: 4 10 13 8 3 10 13 7 4 10 17 18 16 18 16 17 4 10 17 18 20 28 33 35 Repeat with the work efficient kernel Initial array: 4 6 7 1 2 8 5 2 Reduction steps: 4 10 7 8 2 10 5 7 4 10 7 18 2 10 5 17 4 10 7 18 2 10 5 35 Reverse phase: 4 10 7 18 2 28 5 35 4 10 17 18 20 28 33 35

11