Petascale Dmitry Pekurovsky San Diego Supercomputer Center UC San Diego dmitrysdscedu Presented at XSEDE13 July 2225 San Diego Introduction Fast Fourier Transforms and related spectral transforms ID: 727161
Download Presentation The PPT/PDF document "Scalable Spectral Transforms at" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Scalable Spectral Transforms at Petascale
Dmitry PekurovskySan Diego Supercomputer CenterUC San Diegodmitry@sdsc.edu
Presented at
XSEDE’13
, July 22-25, San DiegoSlide2
Introduction: Fast Fourier Transforms and related spectral transforms
Project scope: algorithms operating on structured grids in three dimensions that are computationally demanding, and process data in each dimension independently of others. Examples: Fourier, Chebyshev, finite difference high-order compact schemesHeavily used in many areas of computational scienceComputationally demandingTypically not a cache-friendly algorithmMemory bandwidth is stressed
Communication intense
All-to-all exchange is an expensive operation, stressing bisection bandwidth of the host’s
networkSlide3
1D decomposition 2D decomposition
z
x
y
P
2
P
4
P
1
P
3
P
2
P
4
P
1
P
3Slide4
Algorithm scalability
1D decomposition: concurrency is limited to N (linear grid size). Not enough parallelism for O(104)-O(105) coresThis is the approach of most libraries to date (FFTW 3.3, PESSL)2D decomposition: concurrency is up to N2Scaling to ultra-large core counts is possible
The answer to the
petascale
challengeSlide5
Need for general-purpose scalable library for spectral transforms
Requirements for the library:Scalable to large core counts on significantly large problem sizes (implies 2D decomposition)Achieves performance and scalability reasonably close to upper hardware capabilityHas a simple user interfaceIs sufficiently versatile to be of use to many research groups Slide6
P3DFFT
Open source library for efficient, highly scalable spectral transforms on parallel platformsUses 2D decompositionIncludes 1D option.Available at http://code.google.com/p/p3dfftHistorically grew out of a Teragrid Advanced User Support Project (now called ECSS)Slide7
P3DFFT 2.6.1: features
Currently implements:Real-to-complex (R2C) and complex-to-real (C2R) 3D FFT transformsComplex-to-complex 3D FFT transformsCosine/sine/Chebyshev
transforms in third dimension (FFT in first two dimensions)
Empty transform in the third dimension
User can substitute their custom algorithm
Fortran and C interfaces
Single
or double precisionSlide8
P3DFFT 2.6.1 features (contd)
Arbitrary dimensionsHandles many uneven cases (Ni does not have to be a factor of Mj)Can do either in-place or out-of-place transform
Can do pruned input/output (when only a subset of output or input modes is needed). This can save substantial time, as shown later.
Includes
installation instructions, extensive documentation, example
programs in Fortran and CSlide9
3D FFT algorithm with 2D decomposition
Y-Z plane exchange in column subgroups
Z
Y
X
X-Y plane exchange in row subgroups
Perform 1D FFT in X
Perform 1D FFT in Y
Perform 1D FFT in ZSlide10
P3DFFT implementation
Baseline version implemented in Fortran90 with MPI1D FFT: call FFTW or ESSLTranspose implementation in 2D decomposition:Set up 2D cartesian subcommunicators, using MPI_COMM_SPLIT (rows and columns)
Two transposes are needed: 1. in rows 2. in columns
Baseline version: exchange data using
MPI_Alltoall
or
MPI_AlltoallvSlide11
Computation performance
1D FFT, three times: 1. Stride-1 2. Small stride 3. Large stride (out of cache) Strategy: Use an established library (ESSL, FFTW)An option to keep data in original layout, or transpose so that the stride is always 1The results are then laid out as (Z,Y,X) instead of (X,Y,Z)Use loop blocking to optimize cache useSlide12
Communication performance
A large portion of total time (up to 80%) is all-to-allHighly dependent on optimal implementation of MPI_Alltoall (varies with vendor)Buffers for exchange are close in sizeGood load balance, predictable patternPerformance can be sensitive to choice of
2D virtual processor grid (M
1
,M
2
)Slide13
Performance dependance on processor grid shape M1xM
2Slide14
Communication scaling and networks
All-to-all exchanges are directly affected by bisection bandwidth of the interconnectIncreasing P decreases buffer sizeExpect 1/P scaling on fat-trees and other networks with full bisection bandwidth (until buffer size gets below the latency threshold).On torus topology (Cray XT5, XE6) bisection bandwidth scales
as P
2/3
Expect P
-2/3
scalingSlide15
Strong scaling on Cray XT5 (Kraken) at NICS/ORNL
40963 grid, double precision, best M1/M2 combinationSlide16
Weak Scaling (Kraken)
N3 grid, double precisionSlide17
2D vs. 1D decompositionSlide18
Applications of P3DFFT
P3DFFT has already been applied in a number of codes, in science fields including the following:TurbulenceAstrophysicsOceanographyOther potential areas include Material ScienceChemistryAerospace engineeringX-ray crystallographyMedicineAtmospheric scienceSlide19
DNS turbulence
Direct Numerical Simulations (DNS) code from Georgia Tech (P.K.Yeung et al.) to simulate isotropic turbulence on a cubic periodic domainCharacterized by disorderly, nonlinear fluctuations in 3D space and time that span a wide range of interacting scalesDNS is an important tool for first-principles understanding of turbulence in great detailVital for new concepts and models as well as improved engineering devices
Areas of application include aeronautics, environment, combustion, meteorology, oceanography
One of three Model Problems for NSF’s Track 1 solicitationSlide20
DNS algorithmIt is crucial to simulate grids with high resolution to minimize discretization effects, and study a wide
range of length scales. Uses Runge-Kutta 2nd or 4th order for time-steppingUses pseudospectral method to solve Navier-Stokes eqs.3D FFT is the most time-consuming part2D decomposition based on P3DFFT framework has been implemented. Slide21
DNS performance (Cray XT5)
Ncores40963
8192
3Slide22
P3DFFT Development: Motivation
3D FFT is a very common algorithm. In pseudospectral algorithms simulating turbulence it is mostly used for cases with isotropic conditions and in homogeneous domains with periodic boundary conditions. In this case only real-to-complex and complex-to-real 3D FFT are needed. For many researchers it is more interesting to study inhomogeneous systems, for example wall-bounded flows (Dirichlet
or other non-periodic boundary conditions in one dimension).
Chebyshev
transforms
or finite difference higher-order
compact schemes
are more appropriate.
In simulating
compressible turbulence
, again higher-order
compact schemes are used. In other applications,
a complex-to-complex transform may be needed; or a custom user transform. Slide23
P3DFFT Development: Motivation (cont’d)
Many CFD/turbulence codes use 2/3 dealiasing technique, where only 2/3 of the modes in each dimension are kept after the forward Fourier Transform. Potential time- and memory-saving opportunity.Many codes have several independent arrays (variables) that need to be transformed. This can be implemented in a staggered fashion so as to overlap communication with computation. Some codes employ 3D rather than 2D domain decomposition. They need utilities to go between 2D and 3D.
In some cases, usage scenario does not fall into the common fold, and the user might need access to
isolated transposes
. Slide24
P3DFFT - Ongoing and planned work
Part 1: Interface and FlexibilityAdded other types of transform (e.g. complex-to-complex, Chebyshev, empty) – DONE in P3DFFT 2.5.1Added pruned input/output feature (allows to implement 2/3 dealiasing) – DONE in P3DFFT 2.6.1
Expanding the memory layout
options, including 3D decomposition utilities.
Adding ability to isolate transposes so the user can design their own transform
Slide25
P3DFFT 2.6.1 performance for a large problem (81923)Slide26
One-sided/nonblocking communication
MPI-2, MPI-3OpenSHMEMCo-Array FortranCommunication/computation overlap – requires RDMAHybrid MPI/OpenMP implementation
P3DFFT - Ongoing
and planned work
Part 2: Performance improvementsSlide27
Default
Comm. 1Comm. 4
Comm. 3
Comm. 2
Comp. 1
Comp. 2
Comp. 3
Comp. 4
Overlap
Comm. 1
Comm. 4
Comm. 3
Comm. 2
Comp. 1
Comp. 2
Comp. 3
Comp. 4
Idle
IdleSlide28
Coarse-grain overlapSuitable for computing several FFTs at once
Independent variables, e.g. velocity componentsOverlap communication stage of one variable with computation stage of another variableUses large send buffers due to message aggregationUses pairwise exchange algorithm, implemented through either MPI-2, SHMEM or Co-Array FortranAlternatively, as of recently, MPI-3 nonblocking collectives have become availableSlide29
Coarse-grain overlap, results on Mellanox ConnectX-2 cluster (64 and 128 cores)
K.Kandalla, H.Subramoni, K.Tomko, D. Pekurovsky, S.Sur, D.Panda “High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT”, ISC’11, Germany. Computer Science – Research and Development, v. 26, i.3, 237-246 (2011)Slide30
Hybrid MPI/OpenMP preliminary results (Kraken)Slide31
Conclusions
P3DFFT is an efficient, scalable and versatile library (available as open source at http://code.google.com/p/p3dfft)Performance consistent with hardware capability is achieved on leading platformsGreat potential for enabling petascale scienceAn
excellent testing tool for future platforms’ capabilities
Bisection bandwidth
MPI implementation
One-sided protocols implementation
MPI/
OpenMP
hybrid performanceSlide32
Conclusions (cont’d)An example of project that came out of an Advanced User Support Collaboration, now benefiting a wider community
Incorporated into a number of codes (~25 citations as of today, hundreds of downloads)A future XSEDE community code Work under way to expand capability and improve parallel performance even furtherWHAT ARE YOUR PETASCALE ALGORITHMIC NEEDS?
Send me an e-mail:
dmitry@sdsc.eduSlide33
Acknowledgements
P.K.YeungD.A.DonzisG. ChukkappalliJ. GoebbertG. BrethouserN. Prigozhina
K.
Tomko
K.
Kandalla
H.
Subramoni
S. Sur
D. Panda
Work supported by
XSEDE,
NSF grants OCI-0850684 and CCF-0833155Benchmarks run on Teragrid resources Ranger (TACC), Kraken (NICS), and DOE resources Jaguar (NCSS/ORNL), Hopper(NERSC)
, Blue Waters (NCSA)