/
CS 179: GPU Programming Lecture 9 / Homework 3 CS 179: GPU Programming Lecture 9 / Homework 3

CS 179: GPU Programming Lecture 9 / Homework 3 - PowerPoint Presentation

valerie
valerie . @valerie
Follow
67 views
Uploaded On 2023-06-22

CS 179: GPU Programming Lecture 9 / Homework 3 - PPT Presentation

Recap Some algorithms are less obviously parallelizable Reduction Sorts FFT and certain recursive algorithms Parallel FFT structure radix2 Bitreversed access httpstaffustceducncsligraduatealgorithmsbook6chap32htm ID: 1001758

fft convolution length circular convolution fft circular length log homework impulse large maximum part ifft dft lti linear gpu

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS 179: GPU Programming Lecture 9 / Home..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. CS 179: GPU ProgrammingLecture 9 / Homework 3

2. RecapSome algorithms are “less obviously parallelizable”:ReductionSortsFFT (and certain recursive algorithms)

3. Parallel FFT structure (radix-2)Bit-reversed accesshttp://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htmStage 1Stage 2Stage 3

4. cuFFT 1D exampleCorrection: Remember to use cufftDestroy(plan) when finished with transforms

5. TodayHomework 3Large-kernel convolutionProject Introductions

6. SystemsGiven input signal(s), produce output signal(s)

7. LTI system review (Week 1)“Linear time-invariant” (LTI) systemsLots of them!Can be characterized entirely by “impulse response” Output given from input by convolution: 

8. Parallelization Convolution is parallelizable!Sequential pseudocode (ignoring boundary conditions): (set all y[i] to 0) For (i from 0 through x.length - 1) for (j from 0 through h.length – 1) y[i] += (appropriate terms from x and h) 

9. A problem…This worked for small impulse responsesE.g. h[n], 0 ≤ n ≤ 20 in HW 1Homework 1 was “small-kernel convolution”:(Vocab alert: Impulse responses are often called “kernels”!)

10. A problem…Sequential runtime: O(n*m)(n: size of x)(m: size of h)Troublesome for large m! (i.e. large impulse responses) (set all y[i] to 0) For (i from 0 through x.length - 1) for (j from 0 through h.length – 1) y[i] += (appropriate terms from x and h) 

11. DFT/FFTSame problem with Discrete Fourier Transform!Successfully optimized and GPU-accelerated!O(n2) to O(n log n)Can we do the same here?

12. “Circular” convolution

13. “Circular” convolutionLinear convolution:Circular convolution:  

14. Example:x[0..3], h[0..1]Linear convolution: y[0] = x[0]h[0] y[1] = x[0]h[1] + x[1]h[0] y[2] = x[1]h[1] + x[2]h[0] y[3] = x[2]h[1] + x[3]h[0] y[4] = x[3]h[1] + x[4]h[0]Circular convolution: y[0] = x[0]h[0] + x[3]h[1] + x[2]h[2] + x[3]h[1] y[1] = x[0]h[1] + x[1]h[0] + x[2]h[3] + x[3]h[2] y[2] = x[1]h[1] + x[2]h[0] + x[3]h[3] + x[0]h[2] y[3] = x[2]h[1] + x[3]h[0] + x[0]h[3] + x[1]h[2]  = 0

15. Circular Convolution Theorem*Can be calculated by: IFFT( FFT(x) .* FFT(h) )i.e.For all i:Then: * DFT case 

16. Circular Convolution Theorem*Can be calculated by: IFFT( FFT(x) .* FFT(h) )i.e.For all i:Then: * DFT case O(n log n) Assume n > mO(m log m)O(n)O(n log n)Total: O(n log n)

17. x[n] and h[n] are different lengths?How to linearly convolve using circular convolution?

18. Paddingx[n] and h[n] – presumed zero where not definedComputationally: Store x[n] and h[n] as larger arraysPad both to at least x.length + h.length - 1

19. Example: (Padding)x[0..3], h[0..1]Linear convolution: y[0] = x[0]h[0] y[1] = x[0]h[1] + x[1]h[0] y[2] = x[1]h[1] + x[2]h[0] y[3] = x[2]h[1] + x[3]h[0] y[4] = x[3]h[1] + x[4]h[0]Circular convolution: y[0] = x[0]h[0] + x[1]h[4] + x[2]h[3] + x[3]h[2] + x[4]h[1] y[1] = x[0]h[1] + x[1]h[0] + x[2]h[4] + x[3]h[3] + x[4]h[2] y[2] = x[1]h[1] + x[2]h[0] + x[3]h[4] + x[4]h[3] + x[0]h[2] y[3] = x[2]h[1] + x[3]h[0] + x[4]h[4] + x[0]h[3] + x[1]h[2] y[4] = x[3]h[1] + x[4]h[0] + x[0]h[4] + x[1]h[3] + x[2]h[2]  N is now (4 + 2 – 1) = 5

20. SummaryAlternate algorithm for large impulse response convolution!Serial: O(n log n) vs. O(mn)Small vs. large m determines algorithm choiceRuntime does “carry over” to parallel situations (to some extent)

21. Homework 3, Part 1Implement FFT (“large-kernel”) convolutionUse cuFFT for FFT/IFFT (if brave, try your own)Use “batch” variable to save FFT calculations Correction: Good practice in general, but results in poor performance on Homework 3Complex multiplication kernel: Week 1-style(HW1 difference: Consider right-hand boundary region)

22. Complex numberscufftComplex: cuFFT complex number typeExample usage: cufftComplex a; a.x = 3; // Real part a.y = 4; // Imaginary partElement-wise multiplying:(a + bi)(c + di) = (ac - bd) + (ad + bc)i

23. Homework 3, Part 2

24. NormalizationAmplitudes must lie in range [-1, 1]Normalize s.t. maximum magnitude is 1 (or 1 - ε)How to find maximum amplitude?

25. ReductionThis time, maximum (instead of sum)Lecture 7 strategies“Optimizing Parallel Reduction in CUDA” (Harris)

26. Homework 3, Part 2Implement GPU-accelerated normalizationFind maximum (reduction)Divide by maximum to normalize

27. (Demonstration)Rooms can be modeled as LTI systems!

28. Other notesMachines:Normal mode: haru, mx, minutemanAudio mode: haruDue date: Friday (4/24), 3 PM Correction: 11:59 PMExtra office hours: Thursday (4/23), 8-10 PM

29. Projects