/
EECS 583 – Class 17 Automatic Parallelization Via EECS 583 – Class 17 Automatic Parallelization Via

EECS 583 – Class 17 Automatic Parallelization Via - PowerPoint Presentation

oryan
oryan . @oryan
Follow
27 views
Uploaded On 2024-02-02

EECS 583 – Class 17 Automatic Parallelization Via - PPT Presentation

Decoupled Software Pipelining University of Michigan November 10 2021 Announcements Reading Material Research paper presentations 3 more today my last presentation today Fill out quizzes feedback forms on canvas ID: 1044003

cost node amp ncost node cost ncost amp thread consume 1core ptr produce iterationloop 4ld 1ld iter doit 3ld

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "EECS 583 – Class 17 Automatic Parallel..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. EECS 583 – Class 17Automatic Parallelization ViaDecoupled Software PipeliningUniversity of MichiganNovember 10, 2021

2. Announcements + Reading MaterialResearch paper presentations3 more today + my last presentation todayFill out quizzes (feedback forms) on canvasReading material“Revisiting the Sequential Programming Model for Multi-Core,” M. J. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. I. August, Proc 40th IEEE/ACM International Symposium on Microarchitecture, December 2007. “Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005

3. Moore’s LawSource: Intel/Wikipedia

4. Compilers are the Answer? - Proebsting’s Law“Compiler Advances Double Computing Power Every 18 Years”Run your favorite set of benchmarks with your favorite state-of-the-art optimizing compiler. Run the benchmarks both with and without optimizations enabled. The ratio of of those numbers represents the entirety of the contribution of compiler optimizations to speeding up those benchmarks. Let's assume that this ratio is about 4X for typical real-world applications, and let's further assume that compiler optimization work has been going on for about 36 years. Therefore, compiler optimization advances double computing power every 18 years. QED.Conclusion – Compilers not about performance!

5. DOALL Coverage – Provable and ProfiledStill not good enough!

6. What About Non-Scientific Codes???for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // Xwhile(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // XScientific Codes (FORTRAN-like)General-purpose Codes (legacy C/C++)012345LD:1X:1LD:2X:2LD:4X:4LD:3X:3LD:5X:5LD:6Cyclic Multithreading (CMT) Example: DOACROSS[Cytron, ICPP 86]Independent Multithreading (IMT)Example: DOALL parallelization012345C:1X:1C:2X:2C:4X:4C:3X:3C:5X:5C:6X:6Core 1Core 2Core 1Core 2

7. Alternative Parallelization Approaches012345LD:1X:1LD:2X:2LD:4X:4LD:3X:3LD:5X:5LD:6Core 1Core 2while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X012345LD:1LD:2X:1X:2X:3X:4LD:3LD:4LD:5LD:6X:5Core 1Core 2Pipelined Multithreading (PMT)Example: DSWP[PACT 2004]Cyclic Multithreading(CMT)

8. Comparison: IMT, PMT, CMT012345C:1X:1C:2X:2C:4X:4C:3X:3C:5X:5C:6X:6Core 1Core 2IMT012345LD:1LD:2X:1X:2X:3X:4LD:3LD:4LD:5LD:6Core 1Core 2PMT1 iter/cyclelat(comm) = 1:1 iter/cycle1 iter/cycle1 iter/cyclelat(comm) = 2:0.5 iter/cycle1 iter/cycle012345LD:1X:1LD:2X:2LD:3X:3Core 1Core 2CMT

9. Comparison: IMT, PMT, CMT012345C:1X:1C:2X:2C:4X:4C:3X:3C:5X:5C:6X:6Core 1Core 2IMT012345LD:1LD:2X:1X:2X:3X:4LD:3LD:4LD:5LD:6Core 1Core 2PMT012345LD:1X:1LD:2X:2LD:3X:3Core 1Core 2CMTCross-thread Dependences  Wide ApplicabilityThread-local Recurrences  Fast Execution

10. Our Objective: Automatic Extraction of Pipeline Parallelism using DSWPFindEnglishSentencesParseSentences(95%)EmitResultsDecoupled Software PipeliningPS-DSWP (Spec DOALL Middle Stage)197.parser

11. Decoupled Software Pipelining

12. Decoupled Software Pipelining (DSWP)A: while(node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;Inter-thread communication latency is a one-time costintra-iterationloop-carriedregistercontrolcommunication queue[MICRO 2005]DependenceGraphDAGSCCThread 1Thread 2DBCAA DBCADBC

13. Implementing DSWPL1:Aux:DFGintra-iterationloop-carriedregistermemorycontrol

14. Optimization: Node SplittingTo Eliminate Cross Thread ControlL1L2intra-iterationloop-carriedregistermemorycontrol

15. Optimization: Node Splitting To Reduce CommunicationL1L2intra-iterationloop-carriedregistermemorycontrol

16. Constraint: Strongly Connected ComponentsSolution: DAGSCC Consider:intra-iterationloop-carriedregistermemorycontrolEliminates pipelined/decoupled property

17. 2 Extensions to the Basic TransformationSpeculationBreak statistically unlikely dependencesForm better-balanced pipelinesParallel StagesExecute multiple copies of certain “large” stagesStages that contain inner loops perfect candidates

18. Why Speculation?A: while(node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;DependenceGraphDBCADAGSCCA DBCintra-iterationloop-carriedregistercontrolcommunication queue

19. Why Speculation?A: while(cost < T && node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;DependenceGraphDBCADAGSCCA DBCA B C DPredictableDependencesintra-iterationloop-carriedregistercontrolcommunication queue

20. Why Speculation?A: while(cost < T && node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;DependenceGraphDBCADAGSCCDBCPredictableDependencesAintra-iterationloop-carriedregistercontrolcommunication queue

21. Execution ParadigmMisspeculationdetectedDAGSCCDBCAMisspeculation RecoveryRerun Iteration 4intra-iterationloop-carriedregistercontrolcommunication queue

22. Understanding PMT Performance012345A:1A:2B:1B:2B:3B:4A:3A:4A:5A:6B:5Core 1Core 2012345A:1B:1C:1C:3A:2B:2A:3B:3Core 1Core 2Idle Time1 cycle/iterSlowest thread:Iteration Rate:1 iter/cycle2 cycle/iter0.5 iter/cycleRate ti is at least as large as the longest dependence recurrence.NP-hard to find longest recurrence.Large loops make problem difficult in practice.

23. Selecting Dependences To SpeculateA: while(cost < T && node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;DependenceGraphDBCADAGSCCDBCAThread 1Thread 2Thread 3Thread 4intra-iterationloop-carriedregistercontrolcommunication queue

24. Detecting MisspeculationDAGSCCDBCAA1: while(consume(4)) D : node = node->next produce({0,1},node);Thread 1A3: while(consume(6)) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);Thread 3A2: while(consume(5)) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);Thread 2A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node);Thread 4

25. Detecting MisspeculationDAGSCCDBCAA1: while(TRUE) D : node = node->next produce({0,1},node);Thread 1A3: while(TRUE) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);Thread 3A2: while(TRUE) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);Thread 2A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node);Thread 4

26. Detecting MisspeculationDAGSCCDBCAA1: while(TRUE) D : node = node->next produce({0,1},node);Thread 1A3: while(TRUE) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);Thread 3A2: while(TRUE) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);Thread 2A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC();Thread 4

27. Adding Parallel Stages to DSWPLD = 1 cycleX = 2 cycleswhile(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // XThroughputDSWP: 1/2 iteration/cycleDOACROSS: 1/2 iteration/cyclePS-DSWP: 1 iteration/cycleComm. Latency = 2 cycles012345LD:1LD:2X:1X:3LD:3LD:4LD:5LD:6X:567LD:7LD:8Core 1Core 2Core 3X:2X:4

28. Things to Think AboutHow do you decide what dependences to speculate?Look solely at profile data?How do you ensure enough profile coverage?What about code structure?What if you are wrong? Undo speculation decisions at run-time?How do you manage speculation in a pipeline?Traditional definition of a transaction is brokenTransaction execution spread out across multiple coresHow many cores can DSWP realistically scale to?Can a pipeline be adjusted when the number of available cores increases/decreases, or based on what else is running on the processor?