/
Non-uniformly Communicating Non-uniformly Communicating

Non-uniformly Communicating - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
390 views
Uploaded On 2015-11-14

Non-uniformly Communicating - PPT Presentation

Noncontiguous Data A Case Study with PETSc and MPI P Balaji D Buntinas S Balay B Smith R Thakur and W Gropp Mathematics and Computer Science Argonne National Laboratory Numerical Libraries in HEC ID: 192770

communication data context contiguous data communication contiguous context mpi layout petsc uniform time processing double datatype messages evaluation packing

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Non-uniformly Communicating" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Non-uniformly CommunicatingNon-contiguous Data:A Case Study with PETSc and MPI

P. Balaji,

D. Buntinas, S. Balay, B. Smith,

R. Thakur and W. Gropp

Mathematics and Computer Science

Argonne National LaboratorySlide2

Numerical Libraries in HEC

Developing parallel applications is a complex task

Discretizing physical equations to numerical forms

Representing the domain of interest as data points

Libraries allow developers to abstract low-level details

E.g., Numerical Analysis, Communication, I/O

Numerical libraries (e.g., PETSc, ScaLAPACK, PESSL)

Parallel data layout and processing

Tools for distributed data layout (matrix, vector)

Tools for data processing (SLES, SNES)Slide3

Overview of PETScPortable, Extensible Toolkit for Scientific Computing

Software tools for solving PDEs

Suite of routines to create vectors, matrices and distributed arrays

Sequential/parallel data layout

Linear and nonlinear numerical solvers

Widely used in Nanosimulations, Molecular dynamics, etc.Uses MPI for communication

BLAS

LAPACK

MPI

Matrices

Vectors

Index Sets

KSP

(Krylov subspace Methods)

PC

(Preconditioners)

Draw

SNES(Nonlinear Equation Solvers)

SLES(Linear Equation Solvers)

TS(Time Stepping)

PDE Solvers

Application Codes

Level of AbstractionSlide4

Handling Parallel Data Layouts in PETSc

Grid layout exposed to the application

Structured or Unstructured (1D, 2D, 3D)

Internally managed as a single vector of data elements

Representation often suited to optimize its operations

Impact on communication:

Data representation and communication pattern might not be ideal for MPI communication operationsNon-uniformity and Non-contiguity in communication are the primary culpritsSlide5

Presentation Layout

Introduction

Impact of PETSc Data Layout and Processing on MPI

MPI Enhancements and Optimizations

Experimental Evaluation

Concluding Remarks and Future WorkSlide6

Local Data Point

Data Layout and Processing in PETSc

Grid layouts: data is divided among processes

Ghost data points shared

Non-contiguous Data Communication

2

nd dimension of the grid

Non-uniform communicationStructure of the gridStencil type usedSides larger than corners

Process Boundary

Ghost Data Point

Proc 1

Proc 0

Box-type stencil

Proc 1

Proc 0

Star-type stencilSlide7

MPI Derived DatatypesApplication describes noncontiguous data layout to MPIData is either packed to contiguous buffers and pipelined (sparse layouts) or sent individually (dense layouts)

Good for simple algorithms, but very restrictive

Lookup upcoming content to predecide algorithm to use

Multiple parses on the datatype loses context!

Non-contiguous Communication in MPI

Non-contiguous Data layout

Save Context

Send Data

Save Context

Packing BufferSlide8

Issues with Lost Datatype ContextRollback of context not possible

Datatypes could be recursive

Duplication of context not possible

Context information might be large

When datatype elements are small, context could be larger than the datatype itself

Search of context possible, but very expensiveQuadratically increasing search time with increasing datatype sizeCurrently used mechanism!Slide9

Non-uniform Collective Communication

Non-uniform communication algorithms are optimized for “uniform” communication

Case Studies

Allgatherv uses a ring algorithm

Causes idleness if data volumes are very different

Alltoallw sends data to nodes in round-robin manner

MPI processing is sequential

Large Message

Small Message

0

1

2

3

4

5

6Slide10

Presentation Layout

Introduction

Impact of PETSc Data Layout and Processing on MPI

MPI Enhancements and Optimizations

Experimental Evaluation

Concluding Remarks and Future WorkSlide11

Dual-context Approach forNon-contiguous Communication

Previous approaches are in-efficient in complex designs

E.g., if a look-ahead is performed to understand the structure of the upcoming data, the saved context is lost

Dual-context approach retains the data context

Look-aheads are performed using a separate context

Completely eliminates the search time

Non-contiguous Data layout

Save Context

Send Data

Save Context

Look-ahead

Packing BufferSlide12

Non-Uniform Communication: AllGathervSingle point of distribution is the primary bottleneck

Identify if a small fraction of messages are very large

Floyd and Rivest Algorithm

Linear time detection of outliers

Binomial Algorithms

Recursive doubling or DisseminationLogarithmic time

Large Message

Small MessageSlide13

Non-uniform Communication: AlltoallwDistributing messages to be sent out as bins (based on message size) allows differential treatment to nodes

Send out small messages first

Nodes waiting for small messages have to wait lesser

Ratio of increase in time for nodes waiting for larger messages is much smaller

No skew for zero-byte data with lesser synchronization

Most helpful for non-contiguous messagesMPI processing (e.g., packing) is sequential for non-contiguous messagesSlide14

Presentation Layout

Introduction

Impact of PETSc Data Layout and Processing on MPI

MPI Enhancements and Optimizations

Experimental Evaluation

Concluding Remarks and Future WorkSlide15

Experimental Testbed

64-node Cluster

32 nodes with dual Intel EM64T 3.6GHz processors

2MB L2 Cache, 2GB DDR2 400MHz SDRAM

Intel E7520 (Lindenhurst) Chipset

32 nodes with dual Opteron 2.8GHz processors1MB L2 Cache, 4GB DDR 400MHz SDRAMNVidia 2200/2050 ChipsetRedHat AS4 with kernel.org kernel 2.6.16

InfiniBand DDR (16Gbps) Network:MT25208 adapters connected through a 144-port switchMVAPICH2-0.9.6 MPI implementationSlide16

Non-uniform Communication Evaluation

Search time can dominate performance if the working context is lost!Slide17

AllGatherv EvaluationSlide18

Alltoallw Evaluation

Our algorithm reduces the skew introduced due to the Alltoallw operations by sending out smaller messages first and allowing the corresponding applications to progressSlide19

PETSc Vector ScatterSlide20

3-D Laplacian Multigrid SolverSlide21

Presentation Layout

Introduction

Impact of PETSc Data Layout and Processing on MPI

MPI Enhancements and Optimizations

Experimental Evaluation

Concluding Remarks and Future WorkSlide22

Concluding Remarks and Future Work

Non-uniform and Non-contiguous communication is inherent in several libraries and applications

Current algorithms deal with non-uniform communication in a same way as uniform communication

Demonstrated that more sophisticated algorithms can give close to 10x improvements in performance

Designs are a part of MPICH2-1.0.5 and 1.0.6

To be picked up by MPICH2 derivatives in later releases

Future Work:Skew tolerance in non-uniform communicationOther libraries and applicationsSlide23

Thank You

Group Web-page:

http://www.mcs.anl.gov/radix

Home-page:

http://www.mcs.anl.gov/~balaji

Email: balaji@mcs.anl.govSlide24

Backup SlidesSlide25

Noncontiguous Communication in PETSc

0 8 16 192 384

Copy Buffer

vector (count = 8, stride = 8)

contiguous (count = 3)

double | double | double

double | double | double

double | double | double

contiguous (count = 3)

contiguous (count = 3)

Data might not always be contiguously laid out in memory

E.g., Second dimension of a structured grid

Communication is performed by packing data

Pipelining copy and communication is important for performanceSlide26

Hand-tuning vs. Automated optimization

Nonuniformity and noncontiguity in data communication is inherent in several applications

Communicating unequal amounts of data to the different peer processes

Communication data from noncontiguous memory locations

Previous research has primarily focused on uniform and contiguous data communication

Accordingly applications and libraries tried hand-tuning attempts to convert communication formatsManually packing noncontiguous data

Re-implementing collective operations in the applicationSlide27

Non-contiguous Communication in MPIMPI Derived Datatypes

Common approach for non-contiguous communication

Application describes noncontiguous data layout to MPI

Data is either packed into contiguous memory (sparse layouts) or sent as independent segments (dense layouts)

Pipelining of packing and communication improves performance, but requires context information!

Non-contiguous Data layout

Save Context

Send Data

Save Context

Packing BufferSlide28

Issues with Non-contiguous Communication

Current approach is simple and works as long as there is a single parse on the noncontiguous data

More intelligent algorithms might suffer:

E.g., lookup upcoming datatype content to predecide algorithm to use

Multiple parses on the datatype lose the context !

Searching for the lost context every time requires quadratically increasing time with datatype sizePETSc non-contiguous communication suffers with such high search timesSlide29

MPI-level EvaluationSlide30

Experimental Results

MPI-level Micro-benchmarks

Non-contiguous data communication time

Non-uniform collective communication

Allgatherv Operation

Alltoallw Operation

PETSc Vector Scatter BenchmarkPerforms communication only3-D Laplacian Multigrid Solver ApplicationPartial differential equation solver

Utilizes PETSc numerical solver operations