Decoupled Software Pipelining University of Michigan November 10 2021 Announcements Reading Material Research paper presentations 3 more today my last presentation today Fill out quizzes feedback forms on canvas ID: 1044003
Download Presentation The PPT/PDF document "EECS 583 – Class 17 Automatic Parallel..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. EECS 583 – Class 17Automatic Parallelization ViaDecoupled Software PipeliningUniversity of MichiganNovember 10, 2021
2. Announcements + Reading MaterialResearch paper presentations3 more today + my last presentation todayFill out quizzes (feedback forms) on canvasReading material“Revisiting the Sequential Programming Model for Multi-Core,” M. J. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. I. August, Proc 40th IEEE/ACM International Symposium on Microarchitecture, December 2007. “Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005
3. Moore’s LawSource: Intel/Wikipedia
4. Compilers are the Answer? - Proebsting’s Law“Compiler Advances Double Computing Power Every 18 Years”Run your favorite set of benchmarks with your favorite state-of-the-art optimizing compiler. Run the benchmarks both with and without optimizations enabled. The ratio of of those numbers represents the entirety of the contribution of compiler optimizations to speeding up those benchmarks. Let's assume that this ratio is about 4X for typical real-world applications, and let's further assume that compiler optimization work has been going on for about 36 years. Therefore, compiler optimization advances double computing power every 18 years. QED.Conclusion – Compilers not about performance!
5. DOALL Coverage – Provable and ProfiledStill not good enough!
6. What About Non-Scientific Codes???for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // Xwhile(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // XScientific Codes (FORTRAN-like)General-purpose Codes (legacy C/C++)012345LD:1X:1LD:2X:2LD:4X:4LD:3X:3LD:5X:5LD:6Cyclic Multithreading (CMT) Example: DOACROSS[Cytron, ICPP 86]Independent Multithreading (IMT)Example: DOALL parallelization012345C:1X:1C:2X:2C:4X:4C:3X:3C:5X:5C:6X:6Core 1Core 2Core 1Core 2
7. Alternative Parallelization Approaches012345LD:1X:1LD:2X:2LD:4X:4LD:3X:3LD:5X:5LD:6Core 1Core 2while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X012345LD:1LD:2X:1X:2X:3X:4LD:3LD:4LD:5LD:6X:5Core 1Core 2Pipelined Multithreading (PMT)Example: DSWP[PACT 2004]Cyclic Multithreading(CMT)
8. Comparison: IMT, PMT, CMT012345C:1X:1C:2X:2C:4X:4C:3X:3C:5X:5C:6X:6Core 1Core 2IMT012345LD:1LD:2X:1X:2X:3X:4LD:3LD:4LD:5LD:6Core 1Core 2PMT1 iter/cyclelat(comm) = 1:1 iter/cycle1 iter/cycle1 iter/cyclelat(comm) = 2:0.5 iter/cycle1 iter/cycle012345LD:1X:1LD:2X:2LD:3X:3Core 1Core 2CMT
9. Comparison: IMT, PMT, CMT012345C:1X:1C:2X:2C:4X:4C:3X:3C:5X:5C:6X:6Core 1Core 2IMT012345LD:1LD:2X:1X:2X:3X:4LD:3LD:4LD:5LD:6Core 1Core 2PMT012345LD:1X:1LD:2X:2LD:3X:3Core 1Core 2CMTCross-thread Dependences Wide ApplicabilityThread-local Recurrences Fast Execution
10. Our Objective: Automatic Extraction of Pipeline Parallelism using DSWPFindEnglishSentencesParseSentences(95%)EmitResultsDecoupled Software PipeliningPS-DSWP (Spec DOALL Middle Stage)197.parser
11. Decoupled Software Pipelining
12. Decoupled Software Pipelining (DSWP)A: while(node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;Inter-thread communication latency is a one-time costintra-iterationloop-carriedregistercontrolcommunication queue[MICRO 2005]DependenceGraphDAGSCCThread 1Thread 2DBCAA DBCADBC
13. Implementing DSWPL1:Aux:DFGintra-iterationloop-carriedregistermemorycontrol
14. Optimization: Node SplittingTo Eliminate Cross Thread ControlL1L2intra-iterationloop-carriedregistermemorycontrol
15. Optimization: Node Splitting To Reduce CommunicationL1L2intra-iterationloop-carriedregistermemorycontrol
16. Constraint: Strongly Connected ComponentsSolution: DAGSCC Consider:intra-iterationloop-carriedregistermemorycontrolEliminates pipelined/decoupled property
17. 2 Extensions to the Basic TransformationSpeculationBreak statistically unlikely dependencesForm better-balanced pipelinesParallel StagesExecute multiple copies of certain “large” stagesStages that contain inner loops perfect candidates
18. Why Speculation?A: while(node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;DependenceGraphDBCADAGSCCA DBCintra-iterationloop-carriedregistercontrolcommunication queue
19. Why Speculation?A: while(cost < T && node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;DependenceGraphDBCADAGSCCA DBCA B C DPredictableDependencesintra-iterationloop-carriedregistercontrolcommunication queue
20. Why Speculation?A: while(cost < T && node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;DependenceGraphDBCADAGSCCDBCPredictableDependencesAintra-iterationloop-carriedregistercontrolcommunication queue
21. Execution ParadigmMisspeculationdetectedDAGSCCDBCAMisspeculation RecoveryRerun Iteration 4intra-iterationloop-carriedregistercontrolcommunication queue
22. Understanding PMT Performance012345A:1A:2B:1B:2B:3B:4A:3A:4A:5A:6B:5Core 1Core 2012345A:1B:1C:1C:3A:2B:2A:3B:3Core 1Core 2Idle Time1 cycle/iterSlowest thread:Iteration Rate:1 iter/cycle2 cycle/iter0.5 iter/cycleRate ti is at least as large as the longest dependence recurrence.NP-hard to find longest recurrence.Large loops make problem difficult in practice.
23. Selecting Dependences To SpeculateA: while(cost < T && node) B: ncost = doit(node);C: cost += ncost;D: node = node->next;DependenceGraphDBCADAGSCCDBCAThread 1Thread 2Thread 3Thread 4intra-iterationloop-carriedregistercontrolcommunication queue
24. Detecting MisspeculationDAGSCCDBCAA1: while(consume(4)) D : node = node->next produce({0,1},node);Thread 1A3: while(consume(6)) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);Thread 3A2: while(consume(5)) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);Thread 2A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node);Thread 4
25. Detecting MisspeculationDAGSCCDBCAA1: while(TRUE) D : node = node->next produce({0,1},node);Thread 1A3: while(TRUE) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);Thread 3A2: while(TRUE) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);Thread 2A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node);Thread 4
26. Detecting MisspeculationDAGSCCDBCAA1: while(TRUE) D : node = node->next produce({0,1},node);Thread 1A3: while(TRUE) B3: ncost = consume(2);C : cost += ncost; produce(3,cost);Thread 3A2: while(TRUE) B : ncost = doit(node); produce(2,ncost);D2: node = consume(0);Thread 2A : while(cost < T && node)B4: cost = consume(3); C4: node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC();Thread 4
27. Adding Parallel Stages to DSWPLD = 1 cycleX = 2 cycleswhile(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // XThroughputDSWP: 1/2 iteration/cycleDOACROSS: 1/2 iteration/cyclePS-DSWP: 1 iteration/cycleComm. Latency = 2 cycles012345LD:1LD:2X:1X:3LD:3LD:4LD:5LD:6X:567LD:7LD:8Core 1Core 2Core 3X:2X:4
28. Things to Think AboutHow do you decide what dependences to speculate?Look solely at profile data?How do you ensure enough profile coverage?What about code structure?What if you are wrong? Undo speculation decisions at run-time?How do you manage speculation in a pipeline?Traditional definition of a transaction is brokenTransaction execution spread out across multiple coresHow many cores can DSWP realistically scale to?Can a pipeline be adjusted when the number of available cores increases/decreases, or based on what else is running on the processor?