Noncontiguous Data A Case Study with PETSc and MPI P Balaji D Buntinas S Balay B Smith R Thakur and W Gropp Mathematics and Computer Science Argonne National Laboratory Numerical Libraries in HEC ID: 192770
Download Presentation The PPT/PDF document "Non-uniformly Communicating" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Non-uniformly CommunicatingNon-contiguous Data:A Case Study with PETSc and MPI
P. Balaji,
D. Buntinas, S. Balay, B. Smith,
R. Thakur and W. Gropp
Mathematics and Computer Science
Argonne National LaboratorySlide2
Numerical Libraries in HEC
Developing parallel applications is a complex task
Discretizing physical equations to numerical forms
Representing the domain of interest as data points
Libraries allow developers to abstract low-level details
E.g., Numerical Analysis, Communication, I/O
Numerical libraries (e.g., PETSc, ScaLAPACK, PESSL)
Parallel data layout and processing
Tools for distributed data layout (matrix, vector)
Tools for data processing (SLES, SNES)Slide3
Overview of PETScPortable, Extensible Toolkit for Scientific Computing
Software tools for solving PDEs
Suite of routines to create vectors, matrices and distributed arrays
Sequential/parallel data layout
Linear and nonlinear numerical solvers
Widely used in Nanosimulations, Molecular dynamics, etc.Uses MPI for communication
BLAS
LAPACK
MPI
Matrices
Vectors
Index Sets
KSP
(Krylov subspace Methods)
PC
(Preconditioners)
Draw
SNES(Nonlinear Equation Solvers)
SLES(Linear Equation Solvers)
TS(Time Stepping)
PDE Solvers
Application Codes
Level of AbstractionSlide4
Handling Parallel Data Layouts in PETSc
Grid layout exposed to the application
Structured or Unstructured (1D, 2D, 3D)
Internally managed as a single vector of data elements
Representation often suited to optimize its operations
Impact on communication:
Data representation and communication pattern might not be ideal for MPI communication operationsNon-uniformity and Non-contiguity in communication are the primary culpritsSlide5
Presentation Layout
Introduction
Impact of PETSc Data Layout and Processing on MPI
MPI Enhancements and Optimizations
Experimental Evaluation
Concluding Remarks and Future WorkSlide6
Local Data Point
Data Layout and Processing in PETSc
Grid layouts: data is divided among processes
Ghost data points shared
Non-contiguous Data Communication
2
nd dimension of the grid
Non-uniform communicationStructure of the gridStencil type usedSides larger than corners
Process Boundary
Ghost Data Point
Proc 1
Proc 0
Box-type stencil
Proc 1
Proc 0
Star-type stencilSlide7
MPI Derived DatatypesApplication describes noncontiguous data layout to MPIData is either packed to contiguous buffers and pipelined (sparse layouts) or sent individually (dense layouts)
Good for simple algorithms, but very restrictive
Lookup upcoming content to predecide algorithm to use
Multiple parses on the datatype loses context!
Non-contiguous Communication in MPI
Non-contiguous Data layout
Save Context
Send Data
Save Context
Packing BufferSlide8
Issues with Lost Datatype ContextRollback of context not possible
Datatypes could be recursive
Duplication of context not possible
Context information might be large
When datatype elements are small, context could be larger than the datatype itself
Search of context possible, but very expensiveQuadratically increasing search time with increasing datatype sizeCurrently used mechanism!Slide9
Non-uniform Collective Communication
Non-uniform communication algorithms are optimized for “uniform” communication
Case Studies
Allgatherv uses a ring algorithm
Causes idleness if data volumes are very different
Alltoallw sends data to nodes in round-robin manner
MPI processing is sequential
Large Message
Small Message
0
1
2
3
4
5
6Slide10
Presentation Layout
Introduction
Impact of PETSc Data Layout and Processing on MPI
MPI Enhancements and Optimizations
Experimental Evaluation
Concluding Remarks and Future WorkSlide11
Dual-context Approach forNon-contiguous Communication
Previous approaches are in-efficient in complex designs
E.g., if a look-ahead is performed to understand the structure of the upcoming data, the saved context is lost
Dual-context approach retains the data context
Look-aheads are performed using a separate context
Completely eliminates the search time
Non-contiguous Data layout
Save Context
Send Data
Save Context
Look-ahead
Packing BufferSlide12
Non-Uniform Communication: AllGathervSingle point of distribution is the primary bottleneck
Identify if a small fraction of messages are very large
Floyd and Rivest Algorithm
Linear time detection of outliers
Binomial Algorithms
Recursive doubling or DisseminationLogarithmic time
Large Message
Small MessageSlide13
Non-uniform Communication: AlltoallwDistributing messages to be sent out as bins (based on message size) allows differential treatment to nodes
Send out small messages first
Nodes waiting for small messages have to wait lesser
Ratio of increase in time for nodes waiting for larger messages is much smaller
No skew for zero-byte data with lesser synchronization
Most helpful for non-contiguous messagesMPI processing (e.g., packing) is sequential for non-contiguous messagesSlide14
Presentation Layout
Introduction
Impact of PETSc Data Layout and Processing on MPI
MPI Enhancements and Optimizations
Experimental Evaluation
Concluding Remarks and Future WorkSlide15
Experimental Testbed
64-node Cluster
32 nodes with dual Intel EM64T 3.6GHz processors
2MB L2 Cache, 2GB DDR2 400MHz SDRAM
Intel E7520 (Lindenhurst) Chipset
32 nodes with dual Opteron 2.8GHz processors1MB L2 Cache, 4GB DDR 400MHz SDRAMNVidia 2200/2050 ChipsetRedHat AS4 with kernel.org kernel 2.6.16
InfiniBand DDR (16Gbps) Network:MT25208 adapters connected through a 144-port switchMVAPICH2-0.9.6 MPI implementationSlide16
Non-uniform Communication Evaluation
Search time can dominate performance if the working context is lost!Slide17
AllGatherv EvaluationSlide18
Alltoallw Evaluation
Our algorithm reduces the skew introduced due to the Alltoallw operations by sending out smaller messages first and allowing the corresponding applications to progressSlide19
PETSc Vector ScatterSlide20
3-D Laplacian Multigrid SolverSlide21
Presentation Layout
Introduction
Impact of PETSc Data Layout and Processing on MPI
MPI Enhancements and Optimizations
Experimental Evaluation
Concluding Remarks and Future WorkSlide22
Concluding Remarks and Future Work
Non-uniform and Non-contiguous communication is inherent in several libraries and applications
Current algorithms deal with non-uniform communication in a same way as uniform communication
Demonstrated that more sophisticated algorithms can give close to 10x improvements in performance
Designs are a part of MPICH2-1.0.5 and 1.0.6
To be picked up by MPICH2 derivatives in later releases
Future Work:Skew tolerance in non-uniform communicationOther libraries and applicationsSlide23
Thank You
Group Web-page:
http://www.mcs.anl.gov/radix
Home-page:
http://www.mcs.anl.gov/~balaji
Email: balaji@mcs.anl.govSlide24
Backup SlidesSlide25
Noncontiguous Communication in PETSc
0 8 16 192 384
Copy Buffer
vector (count = 8, stride = 8)
contiguous (count = 3)
double | double | double
double | double | double
double | double | double
contiguous (count = 3)
contiguous (count = 3)
Data might not always be contiguously laid out in memory
E.g., Second dimension of a structured grid
Communication is performed by packing data
Pipelining copy and communication is important for performanceSlide26
Hand-tuning vs. Automated optimization
Nonuniformity and noncontiguity in data communication is inherent in several applications
Communicating unequal amounts of data to the different peer processes
Communication data from noncontiguous memory locations
Previous research has primarily focused on uniform and contiguous data communication
Accordingly applications and libraries tried hand-tuning attempts to convert communication formatsManually packing noncontiguous data
Re-implementing collective operations in the applicationSlide27
Non-contiguous Communication in MPIMPI Derived Datatypes
Common approach for non-contiguous communication
Application describes noncontiguous data layout to MPI
Data is either packed into contiguous memory (sparse layouts) or sent as independent segments (dense layouts)
Pipelining of packing and communication improves performance, but requires context information!
Non-contiguous Data layout
Save Context
Send Data
Save Context
Packing BufferSlide28
Issues with Non-contiguous Communication
Current approach is simple and works as long as there is a single parse on the noncontiguous data
More intelligent algorithms might suffer:
E.g., lookup upcoming datatype content to predecide algorithm to use
Multiple parses on the datatype lose the context !
Searching for the lost context every time requires quadratically increasing time with datatype sizePETSc non-contiguous communication suffers with such high search timesSlide29
MPI-level EvaluationSlide30
Experimental Results
MPI-level Micro-benchmarks
Non-contiguous data communication time
Non-uniform collective communication
Allgatherv Operation
Alltoallw Operation
PETSc Vector Scatter BenchmarkPerforms communication only3-D Laplacian Multigrid Solver ApplicationPartial differential equation solver
Utilizes PETSc numerical solver operations