/
Massively parallel implementation of Total-FETI DDM with application to medical image Massively parallel implementation of Total-FETI DDM with application to medical image

Massively parallel implementation of Total-FETI DDM with application to medical image - PowerPoint Presentation

cheeserv
cheeserv . @cheeserv
Follow
348 views
Uploaded On 2020-06-24

Massively parallel implementation of Total-FETI DDM with application to medical image - PPT Presentation

Michal Merta Alena Vašatová Václav Hapla David Horák DD21 Rennes France solution of largescale scientific and engineering problems possibly hundreds of millions DOFs linear problems nonlinear problems ID: 786396

registration trilinos action petsc trilinos registration petsc action image coarse 105 data tier linear solution tfeti problem time implementation

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Massively parallel implementation of Tot..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Massively parallel implementation of Total-FETI DDM with application to medical image registration

Michal MertaAlena VašatováVáclav HaplaDavid Horák

DD21, Rennes, France

Slide2

solution of large-scale scientific and engineering problemspossibly

hundreds of millions DOFslinear problemsnon-linear problemsnon-overlapping, FETI methods with up to tens of thousands of subdomainsusage of PRACE Tier-1

and Tier-0 HPC systemsMotivation

Slide3

developed by Argonne National Laboratory

data structures and routines for the scalable parallel solution of scientific applications modeled by PDEcoded primarily in

C language, but good FORTRAN support, can also

be called from C++ and Python codescurrent version is 3.2 www.mcs.anl.gov/petscpetsc-dev (development branch) is intensively evolving

code and mailing lists open to anybodyPETSc(Portable, Extensible Toolkit for Scientific computation)

Slide4

PETSc components

seq

. / par.

Slide5

developed by Sandia National Laboratoriescollection of relatively independent

packages toolkit for basic linear algebra operations, direct and iterative solvers for linear systems, PDE discretization utilities, mesh

generation tools etc.object oriented design, high modularity,

use of modern C++ features (templating)mainly in C++ (Fortran and Python bindings)current version 10.10 trilinos.sandia.gov

Trilinos

Slide6

Trilinos components

Slide7

are parallelized on the data level

(vectors & matrices) using MPIuse BLAS and LAPACK – de facto standard for dense LAhave their own implementation of

sparse BLASinclude robust preconditioners, linear solvers

(direct and iterative) and nonlinear solverscan cooperate with many other external solvers and libraries (e.g. MATLAB, MUMPS, UMFPACK, …)support CUDA and hybrid parallelizationare licensed as open-source

Both PETSc and Trilinos…

Slide8

Problem of elastostatics

f

Slide9

TFETI decomposition

Slide10

The FEM discretization with a suitable numbering of nodes results in the QP problem:

Primal discretized formulation

Slide11

Dual discretized formulation(homogenized)

QP problem again, but with lower dimension and simpler constraints

Slide12

Primal data distribution,F action

… straightforward

matrix distribution,

given by a decomposition

*

very sparse

block diagonal  embarrassingly parallel

Slide13

Coarse projector action

*

… can

easily

take 85 % of computation time if not properly parallelized!

??

?

Slide14

G preprocessing and action

preprocessing

action

?

Slide15

Coarse problempreprocessing and action

preprocessing

action

?

3

1

2

Currently

used

variant:

B

2

(

PPAM 2011)

Slide16

Coarse problem

Slide17

the UK's largest, fastest and

most powerful supercomputer supplied by Cray Inc., operated by EPCCuses the latest AMD "Bulldozer" multicore

processor architecture704 compute bladeseach blade with

4 compute nodes giving a total of 2816 compute nodeseach node with two 16-core AMD Opteron 2.3GHz Interlagos processors → 32 cores per nodetotal of 90 112 cores

each 16-core processor shares 16Gb of memory, in total 60 Tbtheoretical peak performance over 800 TflopsHECToR phase 3 (XE6)

www.hector.ac.uk

Slide18

K

+

implemented as direct solve (LU) of regularized K

built-in CG routine used(PETSc.KSP, Trilinos.Belos)E = 1e6,  = 0.3, g = 9.81 ms-2 computed @ HECToR

Benchmark

Slide19

Results

#

subds = # cores

14

1664

256

1024

Prim. dim.

31752

127 008

508 032

2 032 128

8 128 512

32 514 048

Dual dim.

 

252

1512

7056

30240

124992

508032

Solution time

Trilinos

1.39

3.01

4.80

6.25

10.31

28.05

 

PETSc

1.14

2.66

4.16

4.74

4.92

5.84

# iterations

Trilinos

34

63

96

105

105

102

 

PETSc

33

68

94

105

105

102

1 iter.

time

Trilinos

4.48e-2

4.76e-2

5.00e-2

5.95e-2

9.81e-2

2.75e-1

PETSc

3.46e-2

3.92e-2

4.42e-2

4.52e-2

4.69e-2

5.73e-2

stopping criterion:

||

r

k

|| / || r

0

|| <

1e-

5

without preconditioning

Slide20

Process of integrating information from two (or more) different imagesImages from different sensors, different angles or/and times

Application to image registration

Slide21

Application to image registration

In medicine:Monitoring of growth of a tumourTherapy valuationComparison of patient data with anathomical atlasData from magnetic resonance (MR), computer tomography (CT), positron emission tomography (PET)

Slide22

The task is to minimize the distance between two images

Elastic registration

 

 

 

Slide23

Parallelization using TFETI methodElastic registration

Slide24

#

of subdomains14

16Primal variables2040281608

326432Dual variables90326418254Solution time [s]41

34.5457.44# of iterations2467990665Time/iteration [s]

0.010.030.08Resultsstopping criterion: ||rk|| / || r

0|| < 1e-5

Slide25

Solution

Slide26

To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages

To further optimize the codes using core-hours on Tier-1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2)To extend image registration to 3D dataConclusion and future work

Slide27

KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software

.HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, 2012.Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003

, pp. 977-100.References

Slide28

Thank you for your attention!