/
ShyLU ShyLU

ShyLU - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
438 views
Uploaded On 2017-03-21

ShyLU - PPT Presentation

and Thread Scalable Subdomain Solvers Siva Rajamanickam Joint Work Joshua Booth Mehmet Deveci Kyungjoo Kim Andrew Bradley Erik Boman Collaborators Clark Dohrmann Heidi ID: 527592

factorization ilu based triangular ilu factorization triangular based parallel algorithm asynchronous exact grained shylu solvers solve task kokkos basker

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "ShyLU" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

ShyLU and Thread Scalable Subdomain Solvers

Siva Rajamanickam Joint Work: Joshua Booth, Mehmet Deveci, Kyungjoo Kim,Andrew Bradley, Erik BomanCollaborators: Clark Dohrmann, Heidi ThornquistSlide2

MPI+X based subdomain solversDecouple the notion of one MPI rank as one subdomain: Subdomains can span multiple MPI ranks each with its own subdomain solver using X or MPI+X Subpackages of

ShyLU: Multiple Kokkos-based options for on-node parallelism Basker : LU or ILU (t) factorizationTacho: Incomplete Cholesky - IC (k) Fast-ILU: Fast-ILU factorization for GPUs KokkosKernels: Coloring based Gauss-Seidel (M. Deveci), Triangular Solves (A. Bradley)

Lots more work (w/ Christian

Trott)Under active development. Jointly funded by ASC, ATDM, FASTMath, LDRD.

ShyLU and Subdomain Solvers : Overview

Tacho

Basker

FAST-ILU

KLU2

Amesos2

Ifpack2

ShyLU

KokkosKernels

SGS, Tri-Solve (HTS)Slide3

Specialized memory layouts Architecture aware data layouts

Coalesced memory accessPaddingArray of Structures vs Structure of ArraysKokkos based abstractions (H. C. Edwards and C. Trott)Two dimensional layouts for matricesAllows using 2D algorithms for solvers and kernelsBonus: Fewer synchronizations with 2D algorithmsCons : Much more harder to design correctlyBetter utilization of hierarchical memory like High Bandwidth Memory (HBM) in Intel Xeon Phi or NVRAM

Hybrid layouts

Better for very heterogeneous problemsThemes for Architecture Aware Solvers and Kernels : Data layoutsSlide4

Synchronizations are expensive1D algorithms for factorizations and solvers, such as ND based solvers have a huge synchronization bottleneck for the final separatorImpossible to do efficiently in certain architectures designed for massive data parallelism (GPUs)

This is true only for global synchronizations, fork/join style model.Fine grained synchronizationsBetween handful of threads (teams of threads)Point to Point Synchronizations instead of global synchronizationsPark et al (ISC14) showed this for triangular solveThread parallel reductions wherever possible Atomics are cheapOnly when used judiciouslyThemes for Architecture Aware Solvers and Kernels : Fine-grained SynchronizationSlide5

Statically Scheduled TasksDetermine the static scheduling of tasks based on a task graphEliminate unnecessary synchronizationsTasks scheduled in the same thread do not need to synchronize

Find transitive relationships to reduce synchronization even furtherJongsoo Park et alDynamically scheduled tasksUse a tasking model that allows fine grained synchronizationsRequires support for futuresNot the fork-join model where the parent forks a set of tasks and blocks till they finishKokkos Tasking APIJoint work with Carter Edwards, Stephen Olivier, Kyungjoo Kim, Jon Berry, George Stelle

Themes for Architecture Aware Solvers and Kernels : Task ParallelismSlide6

System Level AlgorithmsCommunication Avoiding Methods (s-step methods)Not truly asynchronous but can be done asynchronously as well.Multiple authors from early 1980s

Pipelined Krylov Methods Recently Ghysels, W. Vanroose et al. and othersNode Level AlgorithmsFinegrained Asynchronous iterative ILU factorizationsAn iterative algorithm to compute ILU factorization (Chow et al)Asynchronous in the updatesFinegrained Asynchronous iterative Triangular solves

Jacobi iterations for the triangular solve.

Kacmarcz, Cimmino type methods and their block variants (Boman’s Talk, Monday)

Themes for Architecture Aware Solvers and Kernels : Asynchronous AlgorithmsSlide7

Basker: Sparse (I)LU factorization (w/ Joshua Booth)Block Triangular form (BTF) based LU factorization, Nested-Dissection on large BTF components

2D layout of coarse and fine grained blocksPrevious work by Sherry Li, Rothberg & GuptaData-Parallel, Kokkos based implementationFine-grained parallel algorithm with P2P synchronizationsParallel version of Gilbert-Peirels’ algorithm (or KLU)Left-looking 2D algorithm requires careful synchronization between the threadsAll reduce operations between threads to avoid atomic updatesSee “Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts” (J.

Booth,

S. Rajamanickam and H. Thornquist)ShyLU/Basker : (I)LU factorizationSlide8

ShyLU/Basker : Steps in a Left looking factoization

Different Colors show different threadsGrey means not active at any particular stepEvery left looking factorization for the final separator shown here involves four independent triangular solve, a mat-vec and updates (P2P communication), two independent triangular solves, a mat-vec and updates, and triangular solve. (Walking up the nested-dissection tree)Slide9

Speedup 5.9x (CPU) and 7.4x (Xeon Phi) over KLU (Geom-Mean) and up to 53x (CPU) and 13x (Xeon Phi) over MKL Performance Profile for a matrix set with a high-fill and low-fill matrices shown (16 threads on CPUs /32 threads on Xeon Phi)Low-fill matrices

Basker is consistently the better solver. High fill matrices MKL Pardiso is consistently the better solverShyLU/Basker : Performance ResultsSlide10

Kokkos Tasking APIKokkos Tasking APIH. C. Edwards, S. Olivier, J. Berry, S. Rajamanickam et al

Supports Pthreads, Qthreads, Cuda (really experimental) backends Kokkos Tasking APISlide11

ShyLU/Tacho : Task Based Cholesky factorization

Fine-grained Task-basked, Right Looking Cholesky Factorization (w/ K. Kim)2D layout of blocks based on nested dissection and fill patternTask-Parallel, Kokkos based implementationFine-grained parallel algorithm with synchronizations represented as a task DAGAlgorithm-by-blocks style algorithmOriginally used for parallel out-of-core factorizations (Quintana et al, Buttari et al)Block based algorithm rather than scalar based algorithmSee “Task Parallel Incomplete Cholesky

Factorization using 2D Partitioned-Block Layout

” (K. Kim, S. Rajamanickam, G. Stelle, H. C. Edwards, S. Olivier) arXiv.Slide12

ShyLU/Tacho : Steps in the factorization

Degree of Concurrency still depends on nested dissection orderingParallelism is not tied to nested dissection orderingSlide13

ShyLU/Tacho : More realistic task DAGComplete Task DAG never formed. Shown here for demonstration of the degree of concurrency.

The concurrency is from fine-grained tasking and a 2D right looking algorithmSlide14

ShyLU/Tacho : Experimental Results

Results shown for two matrices with different levels of fillEuclid results shown for referenceIt is an MPI code, using RCM ordering (best for Euclid)Not many parallel IC(k) codesSpeedup numbers are in comparison with single threaded Cholesky

Small overhead for single threaded

Cholesky over serial CholeskyResults are shown for both CPU and Xeon Phi architecturesThe two matrices are chosen for very different nnz/n.Slide15

Asynchronous ILU factorizationsThe factorization is exact on the sparsity pattern (S)ILU methods were interpreted this way before the current “traditional” ILU

(Varga 1959, Oliphant 1961, Buleev 1960)Chow & Patel (2015) uses this property to compute ILU factorizations

Solve

w

ith the constraints

The equations are non-linear with more equations than the original !

Equations can be solved in a fine-grained inexact manner with a good initial guessSlide16

Asynchronous ILU factorizationsWrite each unknown in terms of other unknowns

Depending on how the fixed-point iteration is done, we have nonlinear Jacobi or a nonlinear Gauss-Seidel method

Of the form x = G(x). Solve it using a fixed point iterationSlide17

Test Set and Exact (sequential) ILU factorization

n012345thermal21.2M19341225

856

637507440af_shell3504K1248

788583

462369309

ecology2999K

1625

988696576467414apache2

715K1294619394289

235188offshore259K485*

****G3_circuit

1.5M1414757546421

341303Parabolic_fem525K

313238164129106

91

Test set from University of Florida sparse matrix collection

Three Platforms : AMD

Magnycour

CPUs 8x8 cores, Intel Xeon Phi (KNC) with 228 threads, Tesla K20x with 5.6GB memory

Exact ILU factorization using

Trilinos

/

Ifpack

with RCM reordering of the matrix

and

Trilinos

/

Belos

for solvers Slide18

Asynchronous ILU factorization + Tri Solves vs Exact ILU factorization

012345thermal21312869692

635

630626af_shell3**

**

**ecology2

17081082

832

761738723apache2967541

326299293294

offshore334****

*G3_circuit860524

426360312254

Parabolic_fem334271255

249263292

0

1

2

3

4

5

thermal2

1934

1225

856

637

507

440

af_shell3

1248

788

583

462

369

309

ecology2

1625

988

696

576

467

414

apache2

1294

619

394

289

235

188

offshore

485

*

*

*

*

*

G3_circuit

1414

757

546

421

341

303

Parabolic_fem

313

238

164

129

106

91

FastILU

10 sweeps,

RCM ordering

GPUs

Exact ILUSlide19

Stabilization TechniquesFastILU FactorizationA lot of concurrency in the GPUs

Almost in the the nonlinear Jacobi regime than in the nonlinear Gauss Seidel regimeCommon technique to stabilize convergence : damping or underelaxation

Damping factor 0-1 allows controlling the effect of numerical overflow and enormous concurrency

Choosing the damping factor is much more trickierSlide20

Restart/Blocking TechniquesUse a warm restart for ILU(k) k >0Instead of using A as the initial guess use few sweeps of ILU(k-1) as the initial guessContinuation like method, recursively find initial guess with lower fill levels

Block Jacobi iteration instead of Jacobi iteration on the triangular solves with small block sizes for each threadNeed to be careful about performance vs convergenceAssigning blocks to a warp / thread block balances bothKokkos allows teams of threads to do this naturallyAnother stabilization approach: Manteuffel shifting didn’t help these methodsSlide21

Asynchronous ILU factorization + Tri Solves vs Exact ILU factorization

012345thermal21343904727

646

623608af_shell3902666

569

534547424

ecology21704

1114

879799759725apache21043

592370291267267

offshore350205178

178282274G3_circuit

903567498419

357291Parabolic_fem358

296253246251256

0

1

2

3

4

5

thermal2

1934

1225

856

637

507

440

af_shell3

1248

788

583

462

369

309

ecology2

1625

988

696

576

467

414

apache2

1294

619

394

289

235

188

offshore

485

*

*

*

*

*

G3_circuit

1414

757

546

421

341

303

Parabolic_fem

313

238

164

129

106

91

FastILU

10 sweeps,

RCM ordering

GPUs, damping

Factor = 0.5,

Continuation

guess

Exact ILU

This is still

a big

win as the per

i

teration cost for

t

he parallel

a

lgorithm is a

f

raction of the

s

erial algorithmSlide22

Kokkos-Kernels/SGS (w/ M. Deveci, Boman)

Symmetric Gauss-Seidel that uses coloring to expose parallelismParallel Coloring Algorithm that is highly parallel and returns 4x (geom. Mean) and up to 48x fewer colors.#colors == #synch calls.Thread-team based implementation of SGSReduces the overall CG solve time by 33% compared to cuSparse.“Parallel Graph Coloring for Manycore Architectures”, M.Deveci, E. Boman, K. Devine, S.RajamanickamKokkosKernels

/SGS: Coloring based Symmetric Gauss-SeidelSlide23

Hybrid Triangular Solves (A. Bradley)Algorithm is a combination of level-set triangular solve and recursive blocked triangular solveImplementation based on P2P synchronizations similar to Park et al.

Level-Set portion of the algorithm works on the low-fill portions of the problemRecursive blocked algorithm is used in the denser portions of the algorithm (separators, supernodes)Scales well on CPU and Xeon Phi architecturesHTS Triangular SolveSlide24

Themes around Thread Scalable Subdomain solversData LayoutsFine-grained SynchronizationsTask ParallelismAsynchronous Algorithms

Current WorkBasker : Thread parallel (I)LU factorizationTacho : Task Parallel Cholesky FactorizationFast-ILU : Asynchronous ILU of Chow et al. with damping and warm restartSGS : Multithreaded coloring based symmetric Gauss-SeidelHTS : Mutlithreaded triangular solver for CPUs and Xeon Phis

ConclusionsSlide25

Questions ?Slide26

Iterative Inexact Triangular SolvesJust Computing the LU factors is only part of the cost. We need to apply the factors in fine-grained fashion.Solve the triangular factors with fixed number of sweeps of Jacobi iteration

Use the matrix splitting

Given an initial guess use the update:

Will converge if :

This is always true for triangular matrices (spectral radius of zero)

Note:

This is asymptotic convergence Slide27

Asynchronous ILU factorization + Tri Solves vs Exact ILU factorization

012345thermal21343924840

815

819811af_shell3901653

565

589554599

ecology21704

1103

925910893922apache21043

629432484427497

offshore350211184

175172172G3_circuit

904607512471

431410Parabolic_fem356328

295288285286

0

1

2

3

4

5

thermal2

1934

1225

856

637

507

440

af_shell3

1248

788

583

462

369

309

ecology2

1625

988

696

576

467

414

apache2

1294

619

394

289

235

188

offshore

485

*

*

*

*

*

G3_circuit

1414

757

546

421

341

303

Parabolic_fem

313

238

164

129

106

91

FastILU

10 sweeps,

RCM ordering

GPUs, damping

Factor = 0.5

Exact ILUSlide28

Asynchronous ILU factorization vs Exact ILU factorization

012345thermal2142111101086

1145

11721178af_shell3***

**

*ecology21807

1311

12711300

13441308apache21001768815

818827847offshore

******

G3_circuit868612586

574568562Parabolic_fem425

467421474480527

0

1

2

3

4

5

thermal2

1934

1225

856

637

507

440

af_shell3

1248

788

583

462

369

309

ecology2

1625

988

696

576

467

414

apache2

1294

619

394

289

235

188

offshore

485

*

*

*

*

*

G3_circuit

1414

757

546

421

341

303

Parabolic_fem

313

238

164

129

106

91

FastILU

5 sweeps,

RCM ordering

GPUs

Exact ILU

Related Contents


Next Show more