/
Scalable Spectral Transforms at Scalable Spectral Transforms at

Scalable Spectral Transforms at - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
347 views
Uploaded On 2018-11-10

Scalable Spectral Transforms at - PPT Presentation

Petascale Dmitry Pekurovsky San Diego Supercomputer Center UC San Diego dmitrysdscedu Presented at XSEDE13 July 2225 San Diego Introduction Fast Fourier Transforms and related spectral transforms ID: 727161

mpi p3dfft decomposition fft p3dfft mpi fft decomposition performance complex comp comm transforms transform user large bandwidth communication turbulence exchange time implementation

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Scalable Spectral Transforms at" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Scalable Spectral Transforms at Petascale

Dmitry PekurovskySan Diego Supercomputer CenterUC San Diegodmitry@sdsc.edu

Presented at

XSEDE’13

, July 22-25, San DiegoSlide2

Introduction: Fast Fourier Transforms and related spectral transforms

Project scope: algorithms operating on structured grids in three dimensions that are computationally demanding, and process data in each dimension independently of others. Examples: Fourier, Chebyshev, finite difference high-order compact schemesHeavily used in many areas of computational scienceComputationally demandingTypically not a cache-friendly algorithmMemory bandwidth is stressed

Communication intense

All-to-all exchange is an expensive operation, stressing bisection bandwidth of the host’s

networkSlide3

1D decomposition 2D decomposition

z

x

y

P

2

P

4

P

1

P

3

P

2

P

4

P

1

P

3Slide4

Algorithm scalability

1D decomposition: concurrency is limited to N (linear grid size). Not enough parallelism for O(104)-O(105) coresThis is the approach of most libraries to date (FFTW 3.3, PESSL)2D decomposition: concurrency is up to N2Scaling to ultra-large core counts is possible

The answer to the

petascale

challengeSlide5

Need for general-purpose scalable library for spectral transforms

Requirements for the library:Scalable to large core counts on significantly large problem sizes (implies 2D decomposition)Achieves performance and scalability reasonably close to upper hardware capabilityHas a simple user interfaceIs sufficiently versatile to be of use to many research groups Slide6

P3DFFT

Open source library for efficient, highly scalable spectral transforms on parallel platformsUses 2D decompositionIncludes 1D option.Available at http://code.google.com/p/p3dfftHistorically grew out of a Teragrid Advanced User Support Project (now called ECSS)Slide7

P3DFFT 2.6.1: features

Currently implements:Real-to-complex (R2C) and complex-to-real (C2R) 3D FFT transformsComplex-to-complex 3D FFT transformsCosine/sine/Chebyshev

transforms in third dimension (FFT in first two dimensions)

Empty transform in the third dimension

User can substitute their custom algorithm

Fortran and C interfaces

Single

or double precisionSlide8

P3DFFT 2.6.1 features (contd)

Arbitrary dimensionsHandles many uneven cases (Ni does not have to be a factor of Mj)Can do either in-place or out-of-place transform

Can do pruned input/output (when only a subset of output or input modes is needed). This can save substantial time, as shown later.

Includes

installation instructions, extensive documentation, example

programs in Fortran and CSlide9

3D FFT algorithm with 2D decomposition

Y-Z plane exchange in column subgroups

Z

Y

X

X-Y plane exchange in row subgroups

Perform 1D FFT in X

Perform 1D FFT in Y

Perform 1D FFT in ZSlide10

P3DFFT implementation

Baseline version implemented in Fortran90 with MPI1D FFT: call FFTW or ESSLTranspose implementation in 2D decomposition:Set up 2D cartesian subcommunicators, using MPI_COMM_SPLIT (rows and columns)

Two transposes are needed: 1. in rows 2. in columns

Baseline version: exchange data using

MPI_Alltoall

or

MPI_AlltoallvSlide11

Computation performance

1D FFT, three times: 1. Stride-1 2. Small stride 3. Large stride (out of cache) Strategy: Use an established library (ESSL, FFTW)An option to keep data in original layout, or transpose so that the stride is always 1The results are then laid out as (Z,Y,X) instead of (X,Y,Z)Use loop blocking to optimize cache useSlide12

Communication performance

A large portion of total time (up to 80%) is all-to-allHighly dependent on optimal implementation of MPI_Alltoall (varies with vendor)Buffers for exchange are close in sizeGood load balance, predictable patternPerformance can be sensitive to choice of

2D virtual processor grid (M

1

,M

2

)Slide13

Performance dependance on processor grid shape M1xM

2Slide14

Communication scaling and networks

All-to-all exchanges are directly affected by bisection bandwidth of the interconnectIncreasing P decreases buffer sizeExpect 1/P scaling on fat-trees and other networks with full bisection bandwidth (until buffer size gets below the latency threshold).On torus topology (Cray XT5, XE6) bisection bandwidth scales

as P

2/3

Expect P

-2/3

scalingSlide15

Strong scaling on Cray XT5 (Kraken) at NICS/ORNL

40963 grid, double precision, best M1/M2 combinationSlide16

Weak Scaling (Kraken)

N3 grid, double precisionSlide17

2D vs. 1D decompositionSlide18

Applications of P3DFFT

P3DFFT has already been applied in a number of codes, in science fields including the following:TurbulenceAstrophysicsOceanographyOther potential areas include Material ScienceChemistryAerospace engineeringX-ray crystallographyMedicineAtmospheric scienceSlide19

DNS turbulence

Direct Numerical Simulations (DNS) code from Georgia Tech (P.K.Yeung et al.) to simulate isotropic turbulence on a cubic periodic domainCharacterized by disorderly, nonlinear fluctuations in 3D space and time that span a wide range of interacting scalesDNS is an important tool for first-principles understanding of turbulence in great detailVital for new concepts and models as well as improved engineering devices

Areas of application include aeronautics, environment, combustion, meteorology, oceanography

One of three Model Problems for NSF’s Track 1 solicitationSlide20

DNS algorithmIt is crucial to simulate grids with high resolution to minimize discretization effects, and study a wide

range of length scales. Uses Runge-Kutta 2nd or 4th order for time-steppingUses pseudospectral method to solve Navier-Stokes eqs.3D FFT is the most time-consuming part2D decomposition based on P3DFFT framework has been implemented. Slide21

DNS performance (Cray XT5)

Ncores40963

8192

3Slide22

P3DFFT Development: Motivation

3D FFT is a very common algorithm. In pseudospectral algorithms simulating turbulence it is mostly used for cases with isotropic conditions and in homogeneous domains with periodic boundary conditions. In this case only real-to-complex and complex-to-real 3D FFT are needed. For many researchers it is more interesting to study inhomogeneous systems, for example wall-bounded flows (Dirichlet

or other non-periodic boundary conditions in one dimension).

Chebyshev

transforms

or finite difference higher-order

compact schemes

are more appropriate.

In simulating

compressible turbulence

, again higher-order

compact schemes are used. In other applications,

a complex-to-complex transform may be needed; or a custom user transform. Slide23

P3DFFT Development: Motivation (cont’d)

Many CFD/turbulence codes use 2/3 dealiasing technique, where only 2/3 of the modes in each dimension are kept after the forward Fourier Transform. Potential time- and memory-saving opportunity.Many codes have several independent arrays (variables) that need to be transformed. This can be implemented in a staggered fashion so as to overlap communication with computation. Some codes employ 3D rather than 2D domain decomposition. They need utilities to go between 2D and 3D.

In some cases, usage scenario does not fall into the common fold, and the user might need access to

isolated transposes

. Slide24

P3DFFT - Ongoing and planned work

Part 1: Interface and FlexibilityAdded other types of transform (e.g. complex-to-complex, Chebyshev, empty) – DONE in P3DFFT 2.5.1Added pruned input/output feature (allows to implement 2/3 dealiasing) – DONE in P3DFFT 2.6.1

Expanding the memory layout

options, including 3D decomposition utilities.

Adding ability to isolate transposes so the user can design their own transform

Slide25

P3DFFT 2.6.1 performance for a large problem (81923)Slide26

One-sided/nonblocking communication

MPI-2, MPI-3OpenSHMEMCo-Array FortranCommunication/computation overlap – requires RDMAHybrid MPI/OpenMP implementation

P3DFFT - Ongoing

and planned work

Part 2: Performance improvementsSlide27

Default

Comm. 1Comm. 4

Comm. 3

Comm. 2

Comp. 1

Comp. 2

Comp. 3

Comp. 4

Overlap

Comm. 1

Comm. 4

Comm. 3

Comm. 2

Comp. 1

Comp. 2

Comp. 3

Comp. 4

Idle

IdleSlide28

Coarse-grain overlapSuitable for computing several FFTs at once

Independent variables, e.g. velocity componentsOverlap communication stage of one variable with computation stage of another variableUses large send buffers due to message aggregationUses pairwise exchange algorithm, implemented through either MPI-2, SHMEM or Co-Array FortranAlternatively, as of recently, MPI-3 nonblocking collectives have become availableSlide29

Coarse-grain overlap, results on Mellanox ConnectX-2 cluster (64 and 128 cores)

K.Kandalla, H.Subramoni, K.Tomko, D. Pekurovsky, S.Sur, D.Panda “High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT”, ISC’11, Germany. Computer Science – Research and Development, v. 26, i.3, 237-246 (2011)Slide30

Hybrid MPI/OpenMP preliminary results (Kraken)Slide31

Conclusions

P3DFFT is an efficient, scalable and versatile library (available as open source at http://code.google.com/p/p3dfft)Performance consistent with hardware capability is achieved on leading platformsGreat potential for enabling petascale scienceAn

excellent testing tool for future platforms’ capabilities

Bisection bandwidth

MPI implementation

One-sided protocols implementation

MPI/

OpenMP

hybrid performanceSlide32

Conclusions (cont’d)An example of project that came out of an Advanced User Support Collaboration, now benefiting a wider community

Incorporated into a number of codes (~25 citations as of today, hundreds of downloads)A future XSEDE community code Work under way to expand capability and improve parallel performance even furtherWHAT ARE YOUR PETASCALE ALGORITHMIC NEEDS?

Send me an e-mail:

dmitry@sdsc.eduSlide33

Acknowledgements

P.K.YeungD.A.DonzisG. ChukkappalliJ. GoebbertG. BrethouserN. Prigozhina

K.

Tomko

K.

Kandalla

H.

Subramoni

S. Sur

D. Panda

Work supported by

XSEDE,

NSF grants OCI-0850684 and CCF-0833155Benchmarks run on Teragrid resources Ranger (TACC), Kraken (NICS), and DOE resources Jaguar (NCSS/ORNL), Hopper(NERSC)

, Blue Waters (NCSA)