and Thread Scalable Subdomain Solvers Siva Rajamanickam Joint Work Joshua Booth Mehmet Deveci Kyungjoo Kim Andrew Bradley Erik Boman Collaborators Clark Dohrmann Heidi ID: 527592
Download Presentation The PPT/PDF document "ShyLU" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
ShyLU and Thread Scalable Subdomain Solvers
Siva Rajamanickam Joint Work: Joshua Booth, Mehmet Deveci, Kyungjoo Kim,Andrew Bradley, Erik BomanCollaborators: Clark Dohrmann, Heidi ThornquistSlide2
MPI+X based subdomain solversDecouple the notion of one MPI rank as one subdomain: Subdomains can span multiple MPI ranks each with its own subdomain solver using X or MPI+X Subpackages of
ShyLU: Multiple Kokkos-based options for on-node parallelism Basker : LU or ILU (t) factorizationTacho: Incomplete Cholesky - IC (k) Fast-ILU: Fast-ILU factorization for GPUs KokkosKernels: Coloring based Gauss-Seidel (M. Deveci), Triangular Solves (A. Bradley)
Lots more work (w/ Christian
Trott)Under active development. Jointly funded by ASC, ATDM, FASTMath, LDRD.
ShyLU and Subdomain Solvers : Overview
Tacho
Basker
FAST-ILU
KLU2
Amesos2
Ifpack2
ShyLU
KokkosKernels
–
SGS, Tri-Solve (HTS)Slide3
Specialized memory layouts Architecture aware data layouts
Coalesced memory accessPaddingArray of Structures vs Structure of ArraysKokkos based abstractions (H. C. Edwards and C. Trott)Two dimensional layouts for matricesAllows using 2D algorithms for solvers and kernelsBonus: Fewer synchronizations with 2D algorithmsCons : Much more harder to design correctlyBetter utilization of hierarchical memory like High Bandwidth Memory (HBM) in Intel Xeon Phi or NVRAM
Hybrid layouts
Better for very heterogeneous problemsThemes for Architecture Aware Solvers and Kernels : Data layoutsSlide4
Synchronizations are expensive1D algorithms for factorizations and solvers, such as ND based solvers have a huge synchronization bottleneck for the final separatorImpossible to do efficiently in certain architectures designed for massive data parallelism (GPUs)
This is true only for global synchronizations, fork/join style model.Fine grained synchronizationsBetween handful of threads (teams of threads)Point to Point Synchronizations instead of global synchronizationsPark et al (ISC14) showed this for triangular solveThread parallel reductions wherever possible Atomics are cheapOnly when used judiciouslyThemes for Architecture Aware Solvers and Kernels : Fine-grained SynchronizationSlide5
Statically Scheduled TasksDetermine the static scheduling of tasks based on a task graphEliminate unnecessary synchronizationsTasks scheduled in the same thread do not need to synchronize
Find transitive relationships to reduce synchronization even furtherJongsoo Park et alDynamically scheduled tasksUse a tasking model that allows fine grained synchronizationsRequires support for futuresNot the fork-join model where the parent forks a set of tasks and blocks till they finishKokkos Tasking APIJoint work with Carter Edwards, Stephen Olivier, Kyungjoo Kim, Jon Berry, George Stelle
Themes for Architecture Aware Solvers and Kernels : Task ParallelismSlide6
System Level AlgorithmsCommunication Avoiding Methods (s-step methods)Not truly asynchronous but can be done asynchronously as well.Multiple authors from early 1980s
Pipelined Krylov Methods Recently Ghysels, W. Vanroose et al. and othersNode Level AlgorithmsFinegrained Asynchronous iterative ILU factorizationsAn iterative algorithm to compute ILU factorization (Chow et al)Asynchronous in the updatesFinegrained Asynchronous iterative Triangular solves
Jacobi iterations for the triangular solve.
Kacmarcz, Cimmino type methods and their block variants (Boman’s Talk, Monday)
Themes for Architecture Aware Solvers and Kernels : Asynchronous AlgorithmsSlide7
Basker: Sparse (I)LU factorization (w/ Joshua Booth)Block Triangular form (BTF) based LU factorization, Nested-Dissection on large BTF components
2D layout of coarse and fine grained blocksPrevious work by Sherry Li, Rothberg & GuptaData-Parallel, Kokkos based implementationFine-grained parallel algorithm with P2P synchronizationsParallel version of Gilbert-Peirels’ algorithm (or KLU)Left-looking 2D algorithm requires careful synchronization between the threadsAll reduce operations between threads to avoid atomic updatesSee “Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts” (J.
Booth,
S. Rajamanickam and H. Thornquist)ShyLU/Basker : (I)LU factorizationSlide8
ShyLU/Basker : Steps in a Left looking factoization
Different Colors show different threadsGrey means not active at any particular stepEvery left looking factorization for the final separator shown here involves four independent triangular solve, a mat-vec and updates (P2P communication), two independent triangular solves, a mat-vec and updates, and triangular solve. (Walking up the nested-dissection tree)Slide9
Speedup 5.9x (CPU) and 7.4x (Xeon Phi) over KLU (Geom-Mean) and up to 53x (CPU) and 13x (Xeon Phi) over MKL Performance Profile for a matrix set with a high-fill and low-fill matrices shown (16 threads on CPUs /32 threads on Xeon Phi)Low-fill matrices
Basker is consistently the better solver. High fill matrices MKL Pardiso is consistently the better solverShyLU/Basker : Performance ResultsSlide10
Kokkos Tasking APIKokkos Tasking APIH. C. Edwards, S. Olivier, J. Berry, S. Rajamanickam et al
Supports Pthreads, Qthreads, Cuda (really experimental) backends Kokkos Tasking APISlide11
ShyLU/Tacho : Task Based Cholesky factorization
Fine-grained Task-basked, Right Looking Cholesky Factorization (w/ K. Kim)2D layout of blocks based on nested dissection and fill patternTask-Parallel, Kokkos based implementationFine-grained parallel algorithm with synchronizations represented as a task DAGAlgorithm-by-blocks style algorithmOriginally used for parallel out-of-core factorizations (Quintana et al, Buttari et al)Block based algorithm rather than scalar based algorithmSee “Task Parallel Incomplete Cholesky
Factorization using 2D Partitioned-Block Layout
” (K. Kim, S. Rajamanickam, G. Stelle, H. C. Edwards, S. Olivier) arXiv.Slide12
ShyLU/Tacho : Steps in the factorization
Degree of Concurrency still depends on nested dissection orderingParallelism is not tied to nested dissection orderingSlide13
ShyLU/Tacho : More realistic task DAGComplete Task DAG never formed. Shown here for demonstration of the degree of concurrency.
The concurrency is from fine-grained tasking and a 2D right looking algorithmSlide14
ShyLU/Tacho : Experimental Results
Results shown for two matrices with different levels of fillEuclid results shown for referenceIt is an MPI code, using RCM ordering (best for Euclid)Not many parallel IC(k) codesSpeedup numbers are in comparison with single threaded Cholesky
Small overhead for single threaded
Cholesky over serial CholeskyResults are shown for both CPU and Xeon Phi architecturesThe two matrices are chosen for very different nnz/n.Slide15
Asynchronous ILU factorizationsThe factorization is exact on the sparsity pattern (S)ILU methods were interpreted this way before the current “traditional” ILU
(Varga 1959, Oliphant 1961, Buleev 1960)Chow & Patel (2015) uses this property to compute ILU factorizations
Solve
w
ith the constraints
The equations are non-linear with more equations than the original !
Equations can be solved in a fine-grained inexact manner with a good initial guessSlide16
Asynchronous ILU factorizationsWrite each unknown in terms of other unknowns
Depending on how the fixed-point iteration is done, we have nonlinear Jacobi or a nonlinear Gauss-Seidel method
Of the form x = G(x). Solve it using a fixed point iterationSlide17
Test Set and Exact (sequential) ILU factorization
n012345thermal21.2M19341225
856
637507440af_shell3504K1248
788583
462369309
ecology2999K
1625
988696576467414apache2
715K1294619394289
235188offshore259K485*
****G3_circuit
1.5M1414757546421
341303Parabolic_fem525K
313238164129106
91
Test set from University of Florida sparse matrix collection
Three Platforms : AMD
Magnycour
CPUs 8x8 cores, Intel Xeon Phi (KNC) with 228 threads, Tesla K20x with 5.6GB memory
Exact ILU factorization using
Trilinos
/
Ifpack
with RCM reordering of the matrix
and
Trilinos
/
Belos
for solvers Slide18
Asynchronous ILU factorization + Tri Solves vs Exact ILU factorization
012345thermal21312869692
635
630626af_shell3**
**
**ecology2
17081082
832
761738723apache2967541
326299293294
offshore334****
*G3_circuit860524
426360312254
Parabolic_fem334271255
249263292
0
1
2
3
4
5
thermal2
1934
1225
856
637
507
440
af_shell3
1248
788
583
462
369
309
ecology2
1625
988
696
576
467
414
apache2
1294
619
394
289
235
188
offshore
485
*
*
*
*
*
G3_circuit
1414
757
546
421
341
303
Parabolic_fem
313
238
164
129
106
91
FastILU
10 sweeps,
RCM ordering
GPUs
Exact ILUSlide19
Stabilization TechniquesFastILU FactorizationA lot of concurrency in the GPUs
Almost in the the nonlinear Jacobi regime than in the nonlinear Gauss Seidel regimeCommon technique to stabilize convergence : damping or underelaxation
Damping factor 0-1 allows controlling the effect of numerical overflow and enormous concurrency
Choosing the damping factor is much more trickierSlide20
Restart/Blocking TechniquesUse a warm restart for ILU(k) k >0Instead of using A as the initial guess use few sweeps of ILU(k-1) as the initial guessContinuation like method, recursively find initial guess with lower fill levels
Block Jacobi iteration instead of Jacobi iteration on the triangular solves with small block sizes for each threadNeed to be careful about performance vs convergenceAssigning blocks to a warp / thread block balances bothKokkos allows teams of threads to do this naturallyAnother stabilization approach: Manteuffel shifting didn’t help these methodsSlide21
Asynchronous ILU factorization + Tri Solves vs Exact ILU factorization
012345thermal21343904727
646
623608af_shell3902666
569
534547424
ecology21704
1114
879799759725apache21043
592370291267267
offshore350205178
178282274G3_circuit
903567498419
357291Parabolic_fem358
296253246251256
0
1
2
3
4
5
thermal2
1934
1225
856
637
507
440
af_shell3
1248
788
583
462
369
309
ecology2
1625
988
696
576
467
414
apache2
1294
619
394
289
235
188
offshore
485
*
*
*
*
*
G3_circuit
1414
757
546
421
341
303
Parabolic_fem
313
238
164
129
106
91
FastILU
10 sweeps,
RCM ordering
GPUs, damping
Factor = 0.5,
Continuation
guess
Exact ILU
This is still
a big
win as the per
i
teration cost for
t
he parallel
a
lgorithm is a
f
raction of the
s
erial algorithmSlide22
Kokkos-Kernels/SGS (w/ M. Deveci, Boman)
Symmetric Gauss-Seidel that uses coloring to expose parallelismParallel Coloring Algorithm that is highly parallel and returns 4x (geom. Mean) and up to 48x fewer colors.#colors == #synch calls.Thread-team based implementation of SGSReduces the overall CG solve time by 33% compared to cuSparse.“Parallel Graph Coloring for Manycore Architectures”, M.Deveci, E. Boman, K. Devine, S.RajamanickamKokkosKernels
/SGS: Coloring based Symmetric Gauss-SeidelSlide23
Hybrid Triangular Solves (A. Bradley)Algorithm is a combination of level-set triangular solve and recursive blocked triangular solveImplementation based on P2P synchronizations similar to Park et al.
Level-Set portion of the algorithm works on the low-fill portions of the problemRecursive blocked algorithm is used in the denser portions of the algorithm (separators, supernodes)Scales well on CPU and Xeon Phi architecturesHTS Triangular SolveSlide24
Themes around Thread Scalable Subdomain solversData LayoutsFine-grained SynchronizationsTask ParallelismAsynchronous Algorithms
Current WorkBasker : Thread parallel (I)LU factorizationTacho : Task Parallel Cholesky FactorizationFast-ILU : Asynchronous ILU of Chow et al. with damping and warm restartSGS : Multithreaded coloring based symmetric Gauss-SeidelHTS : Mutlithreaded triangular solver for CPUs and Xeon Phis
ConclusionsSlide25
Questions ?Slide26
Iterative Inexact Triangular SolvesJust Computing the LU factors is only part of the cost. We need to apply the factors in fine-grained fashion.Solve the triangular factors with fixed number of sweeps of Jacobi iteration
Use the matrix splitting
Given an initial guess use the update:
Will converge if :
This is always true for triangular matrices (spectral radius of zero)
Note:
This is asymptotic convergence Slide27
Asynchronous ILU factorization + Tri Solves vs Exact ILU factorization
012345thermal21343924840
815
819811af_shell3901653
565
589554599
ecology21704
1103
925910893922apache21043
629432484427497
offshore350211184
175172172G3_circuit
904607512471
431410Parabolic_fem356328
295288285286
0
1
2
3
4
5
thermal2
1934
1225
856
637
507
440
af_shell3
1248
788
583
462
369
309
ecology2
1625
988
696
576
467
414
apache2
1294
619
394
289
235
188
offshore
485
*
*
*
*
*
G3_circuit
1414
757
546
421
341
303
Parabolic_fem
313
238
164
129
106
91
FastILU
10 sweeps,
RCM ordering
GPUs, damping
Factor = 0.5
Exact ILUSlide28
Asynchronous ILU factorization vs Exact ILU factorization
012345thermal2142111101086
1145
11721178af_shell3***
**
*ecology21807
1311
12711300
13441308apache21001768815
818827847offshore
******
G3_circuit868612586
574568562Parabolic_fem425
467421474480527
0
1
2
3
4
5
thermal2
1934
1225
856
637
507
440
af_shell3
1248
788
583
462
369
309
ecology2
1625
988
696
576
467
414
apache2
1294
619
394
289
235
188
offshore
485
*
*
*
*
*
G3_circuit
1414
757
546
421
341
303
Parabolic_fem
313
238
164
129
106
91
FastILU
5 sweeps,
RCM ordering
GPUs
Exact ILU