Michal Merta Alena Vašatová Václav Hapla David Horák DD21 Rennes France solution of largescale scientific and engineering problems possibly hundreds of millions DOFs linear problems nonlinear problems ID: 786396
Download The PPT/PDF document "Massively parallel implementation of Tot..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Massively parallel implementation of Total-FETI DDM with application to medical image registration
Michal MertaAlena VašatováVáclav HaplaDavid Horák
DD21, Rennes, France
Slide2solution of large-scale scientific and engineering problemspossibly
hundreds of millions DOFslinear problemsnon-linear problemsnon-overlapping, FETI methods with up to tens of thousands of subdomainsusage of PRACE Tier-1
and Tier-0 HPC systemsMotivation
Slide3developed by Argonne National Laboratory
data structures and routines for the scalable parallel solution of scientific applications modeled by PDEcoded primarily in
C language, but good FORTRAN support, can also
be called from C++ and Python codescurrent version is 3.2 www.mcs.anl.gov/petscpetsc-dev (development branch) is intensively evolving
code and mailing lists open to anybodyPETSc(Portable, Extensible Toolkit for Scientific computation)
Slide4PETSc components
seq
. / par.
Slide5developed by Sandia National Laboratoriescollection of relatively independent
packages toolkit for basic linear algebra operations, direct and iterative solvers for linear systems, PDE discretization utilities, mesh
generation tools etc.object oriented design, high modularity,
use of modern C++ features (templating)mainly in C++ (Fortran and Python bindings)current version 10.10 trilinos.sandia.gov
Trilinos
Slide6Trilinos components
Slide7are parallelized on the data level
(vectors & matrices) using MPIuse BLAS and LAPACK – de facto standard for dense LAhave their own implementation of
sparse BLASinclude robust preconditioners, linear solvers
(direct and iterative) and nonlinear solverscan cooperate with many other external solvers and libraries (e.g. MATLAB, MUMPS, UMFPACK, …)support CUDA and hybrid parallelizationare licensed as open-source
Both PETSc and Trilinos…
Slide8Problem of elastostatics
f
Slide9TFETI decomposition
Slide10The FEM discretization with a suitable numbering of nodes results in the QP problem:
Primal discretized formulation
Slide11Dual discretized formulation(homogenized)
QP problem again, but with lower dimension and simpler constraints
Slide12Primal data distribution,F action
… straightforward
matrix distribution,
given by a decomposition
*
very sparse
block diagonal embarrassingly parallel
Slide13Coarse projector action
*
… can
easily
take 85 % of computation time if not properly parallelized!
??
?
Slide14G preprocessing and action
preprocessing
action
?
Slide15Coarse problempreprocessing and action
preprocessing
action
?
3
1
2
Currently
used
variant:
B
2
(
PPAM 2011)
Slide16Coarse problem
Slide17the UK's largest, fastest and
most powerful supercomputer supplied by Cray Inc., operated by EPCCuses the latest AMD "Bulldozer" multicore
processor architecture704 compute bladeseach blade with
4 compute nodes giving a total of 2816 compute nodeseach node with two 16-core AMD Opteron 2.3GHz Interlagos processors → 32 cores per nodetotal of 90 112 cores
each 16-core processor shares 16Gb of memory, in total 60 Tbtheoretical peak performance over 800 TflopsHECToR phase 3 (XE6)
www.hector.ac.uk
Slide18K
+
implemented as direct solve (LU) of regularized K
built-in CG routine used(PETSc.KSP, Trilinos.Belos)E = 1e6, = 0.3, g = 9.81 ms-2 computed @ HECToR
Benchmark
Slide19Results
#
subds = # cores
14
1664
256
1024
Prim. dim.
31752
127 008
508 032
2 032 128
8 128 512
32 514 048
Dual dim.
252
1512
7056
30240
124992
508032
Solution time
Trilinos
1.39
3.01
4.80
6.25
10.31
28.05
PETSc
1.14
2.66
4.16
4.74
4.92
5.84
# iterations
Trilinos
34
63
96
105
105
102
PETSc
33
68
94
105
105
102
1 iter.
time
Trilinos
4.48e-2
4.76e-2
5.00e-2
5.95e-2
9.81e-2
2.75e-1
PETSc
3.46e-2
3.92e-2
4.42e-2
4.52e-2
4.69e-2
5.73e-2
stopping criterion:
||
r
k
|| / || r
0
|| <
1e-
5
without preconditioning
Slide20Process of integrating information from two (or more) different imagesImages from different sensors, different angles or/and times
Application to image registration
Slide21Application to image registration
In medicine:Monitoring of growth of a tumourTherapy valuationComparison of patient data with anathomical atlasData from magnetic resonance (MR), computer tomography (CT), positron emission tomography (PET)
Slide22The task is to minimize the distance between two images
Elastic registration
Parallelization using TFETI methodElastic registration
Slide24#
of subdomains14
16Primal variables2040281608
326432Dual variables90326418254Solution time [s]41
34.5457.44# of iterations2467990665Time/iteration [s]
0.010.030.08Resultsstopping criterion: ||rk|| / || r
0|| < 1e-5
Slide25Solution
Slide26To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages
To further optimize the codes using core-hours on Tier-1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2)To extend image registration to 3D dataConclusion and future work
Slide27KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software
.HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, 2012.Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003
, pp. 977-100.References
Slide28Thank you for your attention!