/
Performance Scalability on Embedded Many-Core Processors Performance Scalability on Embedded Many-Core Processors

Performance Scalability on Embedded Many-Core Processors - PowerPoint Presentation

jade
jade . @jade
Follow
64 views
Uploaded On 2024-01-29

Performance Scalability on Embedded Many-Core Processors - PPT Presentation

Michael Champigny Research Scientist Advanced Computing Solutions Mercury Computer Systems 2010 HPEC Workshop September 15 2010 Motivation Singlechip parallelism and convergence Variability challenges ID: 1043072

int work twiddle fft work int fft twiddle core tasks complex stealing gen victim steal parallel cores task idle

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Performance Scalability on Embedded Many..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Performance Scalability on Embedded Many-Core ProcessorsMichael ChampignyResearch ScientistAdvanced Computing SolutionsMercury Computer Systems2010 HPEC WorkshopSeptember 15, 2010

2. MotivationSingle-chip parallelism and convergenceVariability challengesDynamic schedulingTask ParallelismLoad balancingWork stealingRuntime designParallel runtimeScalability studyData structure considerationsOutline2

3. Large scale parallel chip multiprocessors are herePower efficientSmall form factorse.g., Tilera TILEPro64Convergence is inevitable for many workloadsMulti-board solutions became multi-socket solutions…and multi-socket solutions will become single-socket solutionse.g., ISR tasks will share a processorSoftware is a growing challengeHow do I scale my algorithms and applications?…without rewriting them?…and improve productivity?Many-Core in Embedded HPC3

4. Chip multiprocessors introduce variability to workloadscc-NUMAMemory hierarchies and block sizesAsymmetries in processing elements due toThermal conditionsProcess variationFaultsWorkloads themselves are increasingly data-drivenData dependencies lead to processor stallsComplex state machines, branching, pointer chasingConvergence compounds the problemAdversarial behavior of software components sharing resourcesSources of Variability4

5. Mapping algorithms to physical resources is painfulRequires significant analysis on a particular architectureDoesn’t translate well to different architecturesMapping must be revisited as processing elements increaseStatic partitioning is no longer effective for many problemsVariability due to convergence and data-driven applicationsProcessing resources are not optimally utilizede.g., Processor cores can become idle while work remainsLoad balancing must be performed dynamicallyLanguageCompilerRuntimeImportance of Load Balancing5

6. Load balancing requires small units of work to fill idle “gaps”Fine-grained task parallelismExposing all fine-grained parallelism at once is problematicExcessive memory pressureCache-oblivious algorithms have proven low cache complexityMinimize number of memory transactionsScale well unmodified on any cache-coherent parallel architectureBased on divide-and-conquer method of algorithm designTasks only subdivided on demand when a processor idlesTasks create subtasks recursively until a cutoffLeaf tasks fit in private caches of all processorsTask Parallelism & Cache-Oblivious Algorithms6

7. Runtime schedulers assign tasks to processing resourcesGreedy: make decisions only when required (i.e., idle processor)Ensure maximum utilization of available computesHave knowledge of instantaneous system stateScheduler must be highly optimized for use by many threadsLimit sharing of data structures to ensure scalabilityAny overhead in scheduler will impact algorithm performanceWork-stealing based schedulers are provably efficientProvide dynamic load balancing capabilityIdle cores look for work to “steal” from other coresEmploy heuristics to improve locality and cache reuseScheduling Tasks on Many-Cores7

8. Re-architected our dynamic scheduler for many-coreChimera Parallel Programming PlatformExpose parallelism in C/C++ code incrementally using C++ compilerPorted to several many-core architectures from different vendorsInsights gained improved general performance scalabilityAffinity-based work-stealing policy optimized for cc-NUMAVirtual NUMA topology used to improve data localityCore data structures adapt to current runtime conditionsTasks are grouped into NUMA-friendly clusters to amortize steal costDynamic load balancing across OpenCL and CUDA supported devicesNo performance penalty for low numbers of cores (i.e., multi-core)Designing a Parallel Runtime for Many-Core8

9. Cores operate on local tasks (i.e., work) until they run outA core operating on local work is in the work stateWhen a core becomes idle it looks for work at a victim coreThis operation is called stealing and the perpetrator is labeled a thiefThis cycle is repeated until work is found or no more work existsA thief looking for work is in the idle stateWhen all cores are idle the system reaches quiescent stateBasic principles of optimizing a work-stealing schedulerKeep cores in work state for as long as possibleThis is good for locality as local work stays in private cachesStealing is expensive so attempt to minimize it and to amortize costStealing larger-grained work is preferableChoose your victim wiselyStealing from NUMA neighbor is preferableWork-Stealing Scheduling Basics9

10. Work-stealing algorithm leads to many design decisionsWhat criteria to apply to choose a victim?How to store pending work (i.e., tasks)?What to do when system enters quiescent state?How much work to steal?Distribute work (i.e., load sharing)?Periodically rebalance work?Actively monitor/sample the runtime state?Work-Stealing Implications on Scheduler Design10

11. Victim selection policyWhen a core becomes idle which core do I try to steal from?Several choice are availableRandomized orderLinear orderNUMA orderWe found NUMA ordering provided better scalabilityBenefits became more pronounced with larger numbers of coresExample: Victim Selection Policy on Many-Core11

12. When work is stolen how much do we take from the victim?If we take too much…victim will begin looking for work too soonIf we don’t take enough…thief begins looking for work too soonWe conducted an empirical study to determine the best strategyIntuitively, stealing half the available work should be optimalOptimal Amount of Tasks to Steal12

13. Steal a single task at a timeImplemented with any linear structure (i.e., dynamic array)Allows for concurrent operation at both ends…without locks in some casesSteal a block of tasks at a timeImplemented with a linear structure of blocksEach block contains at most a fixed number of tasksCan lead to load imbalance in some situationsIf few tasks exist in system one core could own them allSteal a fraction of available tasks at a timeWe picked 0.5 as the fraction to stealData structure is a more complex list of treesImpact of Steal Amount Policy on Data Structures13

14. Determine steal amount policy impact on performance scalabilityScalability defined as ratio of single core to P core latencyRun experiment on existing many-core embedded processorTilera TILEPro64 using 56 coresGNU compiler 4.4.3SMP Linux 2.6.26Used Mercury Chimera as parallel runtime platformModify existing industry standard benchmarks for task parallelismBarcelona OpenMP Task Suite 1.1MIT Cilk 5.4.6Best-of-10 latency used for scalability calculationEmpirical Study of Steal Amount on Many-Core14

15. Tilera TILEPro64 Processor Architecture15

16. Processing64 tiles arranged in 8 × 8 grid @ 23W866 MHz clock32-bit VLIW ISA with 64-bit instruction bundles (3 ops/cycle)CommunicationiMesh 2D on-chip interconnect fabric1 cycle latency per tile-tile hop MemoryDynamic Distributed CacheAggregates L2 caches into coherent 4 Mbytes L3 cache5.6 Mbytes combined on-chip cacheTilera TILEPro64 Processor Features16

17. BenchmarkSourceDomainCutoffDescriptionFFTBOTSSpectral1281M point, FFTW generatedFibonacciBOTSMicro10Compute 45th numberHeatCilkSolver512Diffusion, 16M point meshMatrixMultCilkDense Linear16512×512 square matricesNQueensBOTSSearch313×13 chessboardPartialLUCilkDense Linear321M point matrixSparseLUBOTSSparse Linear202K×2K sparse matrixSortBOTSSort2048, 2020M 4-byte integersStrassenMultBOTSDense Linear64, 31M point matricesTask Parallel Benchmarks17

18. void fft_twiddle_gen (int i, int i1, COMPLEX* in, COMPLEX* out, COMPLEX* W, int n, int nW, int r, int m){ if (i == (i1 – 1)) fft_twiddle_gen1 (in+i, out+i, W, r, m, n, nW*i, nW*m); else { int i2 = (i + i1) / 2; fft_twiddle_gen (i, i2, in, out, W, n, nW, r, m); fft_twiddle_gen (i2, i1, in, out, W, n, nW, r, m); }}Example: FFT Twiddle Factor Generator (Serial)18

19. void fft_twiddle_gen (int i, int i1, COMPLEX* in, COMPLEX* out, COMPLEX* W, int n, int nW, int r, int m){ if (i == (i1 – 1)) fft_twiddle_gen1 (in+i, out+i, W, r, m, n, nW*i, nW*m); else { int i2 = (i + i1) / 2; #pragma omp task untied fft_twiddle_gen (i, i2, in, out, W, n, nW, r, m); #pragma omp task untied fft_twiddle_gen (i2, i1, in, out, W, n, nW, r, m); #pragma omp taskwait }}Example: FFT Twiddle Factor Generator (OpenMP)19

20. void fft_twiddle_gen parallel (int i, int i1, COMPLEX* in, COMPLEX* out, COMPLEX* W, int n, int nW, int r, int m){ if (i == (i1 – 1)) fft_twiddle_gen1 (in+i, out+i, W, r, m, n, nW*i,nW*m); else join { int i2 = (i + i1) / 2; fork (fft_twiddle_gen, i, i2, in, out, W, n, nW,r,m); fork (fft_twiddle_gen, i2, i1, in, out, W, n,nW,r,m); }}Example: FFT Twiddle Factor Generator (Chimera)20

21. BOTS: Fibonacci21

22. BOTS: Fast Fourier Transform22

23. Cilk: Matrix-Matrix Multiply23

24. BOTS: Strassen Matrix-Matrix Multiply24

25. BOTS: Sparse LU Factorization25

26. Cilk: Partial Pivoting LU Decomposition26

27. Cilk: Heat27

28. BOTS: N-Queens28

29. BOTS: Sort29

30. Popular choice of stealing a single task at a time is suboptimalChoosing a fraction of available tasks led to improved scalabilityPopular choice of randomized victim selection is suboptimalWe found NUMA ordering improved scalability slightlyCache-oblivious algorithms are a good fit for many-core platformsMany implementations available in literatureScale well across a wide range of processors…but research continues and questions remainWhat about 1000s of cores?How far can we scale algorithms on cc-NUMA architectures?Conclusions30

31. Questions?Michael Champignymchampig@mc.comThank you!31