/
Improving the maximum attainable accuracy of communication- Improving the maximum attainable accuracy of communication-

Improving the maximum attainable accuracy of communication- - PowerPoint Presentation

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
408 views
Uploaded On 2016-09-10

Improving the maximum attainable accuracy of communication- - PPT Presentation

Krylov s ubspace m ethods Erin Carson and James Demmel Householder Symposium XIX June 813 2014 Spa Belgium 2 Model Problem 2D Poisson on grid Equilibration diagonal scaling used ID: 463604

communication residual accuracy replacement residual communication replacement accuracy attainable true bound van ksms iterations compute krylov convergence der strategy

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Improving the maximum attainable accurac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Improving the maximum attainable accuracy of communication-avoiding Krylov subspace methods

Erin Carson and James DemmelHouseholder Symposium XIXJune 8-13, 2014, Spa, BelgiumSlide2

2

Model Problem: 2D

Poisson on

grid

,

.

Equilibration (diagonal scaling) used. RHS set s.t. elements of true solution .

 

Roundoff error can cause a decrease in attainable accuracy of Krylov subspace methods.

This affect can be worse for “communication-avoiding” (or -step) Krylov subspace methods – limits practice applicability!

 

Residual replacement strategy of van der Vorst and Ye (1999) improves attainable accuracy for classical Krylov methods

We extend this strategy to communication-avoiding variants. Accuracy can be improved for minimal performance cost! Slide3

Communication bottleneck in KSMs2.

Orthogonalize with respect to

Inner products

Parallel: global reduction

Sequential: multiple reads/writes to slow memory

 

1

Projection process in each iteration

:

Add a dimension to

Sparse Matrix-Vector Multiplication (

SpMV)Parallel: comm. vector entries w/ neighborsSequential: read /vectors from slow

memory

 

Dependencies between communication-bound kernels in each iteration limit performance!

SpMV

orthogonalize

For linear systems, approx. solution

to

by imposing

and

,

where

.

 

A

Krylov

Subspace Method is a projection process onto the Krylov subspace orthogonal to some .

 Slide4

2

Example: Classical conjugate gradient (CG)

SpMVs

and inner products require communication in each iteration!

for solving

Let

f

or

until convergence

do

e

nd for

 Slide5

Related references:(Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (

Chronopoulos and Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991), (Kim and

Chronopoulos

, 1991), (

Joubert

and Carey, 1992), (Bai, Hu,

Reichel

, 1991), (

Erhel, 1995), GMRES (De Sturler, 1991), (De Sturler and Van der Vorst, 1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen, 2010), (Philippe and Reichel, 2012), (C., Knight, Demmel 2013), (Feuerriegel and Bücker, 2013). Krylov methods can be reorganized to reduce communication cost by

Many CA-KSMs (or -step KSMs) derived in the literature: CG, GMRES, Orthomin, MINRES, Lanczos, Arnoldi, CGS,

Orthodir, BICG, CGS, BICGSTAB, QMR 

Communication-avoiding CA-KSMs

3

Reduction in communication can translate to speedups on practical problems

Recent results:

x speedup

with

for CA-BICGSTAB on GMG bottom-solve (

cores,

on Cray XE6) (Williams, et al., 2014).

 Slide6

CA-CG overview

4

Starting at iteration

for

, it can be shown that to compute the next

steps (iterations

where

),

.

1. Compute matrix

of dimension

-by-

, giving the recurrence relation

,

where

is

-by-

and is 2x2 block diagonal with upper

Hessenberg

blocks, and

is

with columns

and

set to 0.

Represent length-

iterates by length-

coordinates

in :

Communication cost:

Same latency cost as

one

SpMV

using

matrix powers kernel

(e.g.,

Hoemmen

et al., 2007), assuming sufficient

sparsity

structure (low diameter)

2. Compute Gram matrix

of dimension

-by-

.

Communication cost: One reduction

 Slide7

5

 

Compute

such that

 

Compute

 

 

for

do

 

for

do

 

end for

end for

CG

CA-CG

No communication in inner loop

!

 Slide8

CA-KSMs are mathematically equivalent to classical KSMsBut have different behavior in finite precision!Roundoff errors have two discernable effects:

Loss of attainable accuracy (due to deviation of true residual

and updated residual

)

Delay of

convergence

(due to perturbation of

Lanczos

recurrence)Generally worse with increasing Obstacle to solving practical problems:Decrease in attainable accuracy some problems that KSM can solve can’t be solved with CA variantDelay of convergence if # iterations increases more than time per iteration decreases due to CA techniques, no speedup expected! CA-KSMs in finite precision6Slide9

Maximum attainable accuracy of CG7

In classical CG, iterates are updated by

and

Formulas for

and

do not depend on each other - rounding errors cause the true residual,

, and the updated residual,

, to deviate

The size of

,

determines

the

maximum attainable accuracy

Write the true residual as

Then the size of the true residual is bounded by

When

,

and

have similar magnitude

W

hen

,

depends on

Many results on attainable accuracy, e.g.:

Greenbaum

(1989, 1994, 1997),

Sleijpen

, van der Vorst and

Fokkema

(1994),

Sleijpen

, van der Vorst and

Modersitzki

(2001),

Bj

ö

rck

,

Elfving

and

Strakoš (1998) and Gutknecht and Strakoš (2000). Slide10

8

Example: Comparison of convergence of true and updated residuals for CG vs. CA-CG using a

monomial basis

, for various

values

Model problem (2D Poisson on

grid

)

 Slide11

9

Better conditioned polynomial bases can be used instead of monomial.

Two common choices:

Newton

and

Chebyshev

- see, e.g., (Philippe and

Reichel

, 2012).Behavior closer to that of classical CGBut can still see some loss of attainable accuracy compared to CGClearer for higher values

 Slide12

Residual replacement strategy for CG10

v

an

der Vorst and Ye (

1999): Improve accuracy by replacing

updated residual

by the

true residual

in certain iterations, combined with group update.Choose when to replace with

to meet two constraints:

Replace often enough so that at termination,

is small relative to

Don’t replace so often

that original convergence mechanism of updated residuals is destroyed (avoid large perturbations to finite precision CG recurrence)

Use computable bound

for

to update error estimate

in

each

iteration:

,

If

,

(group update), set

, set

If updated residual converges to

,

true residual reduced

to  Slide13

Error in

basis change

11

Sources of

r

oundoff

error in CA-CG

Error in computing

-step basis

 

Error in updating coefficient vectors

Computing the

-step

Krylov

basis:

Updating coordinate vectors in the inner loop:

with

Recovering CG vectors for use in next outer loop:

 Slide14

We can write the deviation of the true and updated residuals in terms of these errors:12

Maximum attainable accuracy of CA-CG

Using standard rounding error results, this allows us to obtain an upper bound on

.

 

 Slide15

We extend van der Vorst and Ye’s residual replacement strategy to CA-CGMaking use of the bound for

in CA-CG, update error estimate

by:

 

A computable

b

ound

otherwise

13

 

where

 

Estimated only once

flops per

iterations; no communication

 

flops per

iterations;

1 reduction per

iterations

 

All lower order terms in

CA-CG

Residual replacement does not asymptotically increase communication or

computation!

 

 

 Slide16

Use the same replacement condition as van der Vorst and Ye (1999):

where

is

a tolerance

parameter, and we initially set

.

Pseudo-code for residual replacement with group update for CA-CG:

 

if

break from inner loop and begin new outer loop

end

 

Residual replacement for CA-CG

14Slide17

15Slide18

CACG Mono.

CACG Newt.

CACG

Cheb

.

CG

s=4

354

353365355s=8224, 334, 401, 517340

353s=12

135, 2119326346

Residual Replacement Indices18

# replacements small compared to total iterations  RR doesn’t significantly affect communication savings!

Can have both speed and accuracy!

Total Number of Reductions

In addition to attainable accuracy,

convergence rate

is incredibly important in practical implementations

Convergence rate depends on

basis

CACG Mono.

CACG Newt.

CACG

Cheb

.

CGs=4203

196197669s=815710299s=12557687115Slide19

,

,

 

Current work: CA-

Lanczos

convergence

a

nalysis

16

Classic

Lanczos

rounding error result of Paige (1976):

These results form the basis for Paige’s influential results in (Paige, 1980).

 

 

for

,

 

where

,

,

and

 Slide20

with

,

where

,

 

Current

work

: CA-

Lanczos

convergence

a

nalysis

17

If

is numerically full rank for

and if

, the

results of

(Paige, 1980) apply to CA-

Lanczos

(with these news values of

).

Confirms the empirically observed effect of basis conditioning on convergence

.

 

for

,

 

For CA-

Lanczos

, we

have:

(vs.

for

Lanczos

)

 

(vs.

for

Lanczos

)

 

Let

 Slide21

Thank you!contact: erin@cs.berkeley.eduhttp://www.cs.berkeley.edu/~ecc2z/21Slide22

Extra SlidesSlide23

23

Replacement Iterations

Newton: 240, 374

Chebyshev

: 271, 384Slide24

Upper bound on maximum attainable accuracy in communication-avoiding

Krylov subspace methods in finite precisionApplies to CA-CG and CA-BICGImplicit residual replacement strategy for maintaining agreement between residuals to within

Strategy can be implemented within method without

asymptotically increasing communication or

computation

 

24Slide25

25Slide26

26Slide27

CACG Mono.

CACGNewt

.

CACGCheb

.

CG

s=4

813

782785669s=81246817

785s=12

6675813850

Total Iterations27Slide28

Total Number of Communication Steps

CACG Mono.

CACGNewt

.

CACGCheb

.

CG

s=4

203

196197669

s=8157102

99s=12557

6871

28Slide29

Motivation: Communication-avoiding, KSM bottleneck, how CA-KSMs work (1)Speedups, but numerical properties bad (2) – attainable accuracy, convergenceToday focus on attainable accuracy but mention current work related to convergence

Include plotsIn order to be practical, numerical properties can’t negate CA benefitsRelated workCA-KSMs (1)Bounds on attainable accuracy, residual

replacement strategies,

VdV

and Ye (2)

CA-CG derivation (2)

Maximum attainable accuracy bound (1)

Residual replacement strategy of Vdv and Ye (1)

Residual replacement strategy for CA-CG (1)How can be computed cheaply (1)Plots (1)Current work (2): results of Paige and analogous results for CA-case29Slide30

Improving Maximum Attainable Accuracy in CA-KSMs

Van der Vorst and Ye (1999) : Residual replacement used in combination with group-update of solution vector to improve the maximum attainable accuracyGiven computable upper bound for deviation, replacement steps chosen to satisfy two constraints:

(1) Deviation must not grow so large that attainable accuracy is limited

(2) Replacement must not perturb

Lanczos

recurrence

relation for computed residuals such that convergence

deteriorates

When the computed residual converges to level but true residual deviates, strategy reduces true residual, to level We devise an analogous method for CA-KSMsRequires determining a computable upper bound for deviation of residualsIn CA-KSMs, we can write the deviation of the true and computed residual as:

 

30Slide31

We can bound the deviation of the true and computed residual in CA-CG with residual replacement in each iteration by updating the quantity

by

 

A Computable Bound

otherwise

31

 

 

last inner

itr

.Slide32

if

break from inner

loop

end

 

Residual Replacement Strategy

32

Based on the perturbed

Lanczos

recurrence (see, e.g. Van der Vorst and Ye, 1999), we should replace the residual when

where

is a tolerance parameter, chosen as

, and

w

e initially set

.

Pseudo-code for residual replacement with group update for CA-CG:

 Slide33

Communication-avoiding CA-KSMs3

Krylov methods can be “reorganized” to

reduce communication cost by

Main idea:

Outer Loop k: 1

communication

step

Expand Krylov basis by O(s) dimensions up front, stored in matrix Same cost as a single SpMVs (for well-partitioned ) using matrix powers kernel (see, e.g., Hoemmen et al., 2007)Compute Gram matrix

: one global reduction

Inner Loop j:

computation stepsUpdate

-vectors of coordinates for

,

in

No communication! Quantities

either

local

(parallel) or

fit

in fast memory (sequential)Many CA-KSMs (or

-step KSMs) derived in the literature: (Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (Chronopoulos and Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991), (Kim and Chronopoulos,

1991), (Joubert and Carey, 1992), (Bai, Hu, Reichel, 1991), (Erhel, 1995), (De Sturler, 1991), (De Sturler and Van der Vorst,

1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen, 2010).

 Slide34

CA-CG DerivationStarting at iteration

for

, it can be shown that for iterations

where

,

D

efine the

basis matrices

, where

,

, where

,

where

is a polynomial of degree

satisfying

,

This

allows us to write

,

 

4Slide35

CA-CG Derivation5

No communication for iterations!

All small

-dimensional

:

local/fit in cache.

 

Computing

: Matrix powers kernel, same communication cost as 1

SpMV

for well-partitioned

 

Computing

: one global reduction

 

Let

,

, and

.

Then we see that we can represent iterates by their coordinates in

:

,

,

a

nd multiplication by

can be written:

Then after computing

and

, for iterations

,

,

I

nner products can be written:

Vector updates done implicitly by updating coordinates:

 Slide36

v

ia CA Matrix Powers Kernel

Global reduction to compute

 

6

Example: CA-CG

Local computations within inner loop require no communication!

for solving

Let

for

until convergence

do

Compute

and

Let

for

do

end for

Recover

,

,

e

nd for

 Slide37

AuthorsKSM

BasisPrecond?Mtx

Pwrs

?

TSQR?

Van Rosendale, 1983

CG

Monomial

PolynomialNoNoLeland, 1989CGMonomialPolynomialNoNoWalker, 1988GMRESMonomialNoneNoNoChronopoulos and Gear, 1989CGMonomialNoneNo

NoChronopoulos and Kim, 1990Orthomin

, GMRESMonomial

NoneNoNoChronopoulos, 1991MINRES

Monomial

NoneNo

No

Kim and

Chronopoulos

, 1991

Symm

.

Lanczos

, ArnoldiMonomialNoneNoNode Sturler, 1991GMRESChebyshevNoneNoNo

Related Work: -step methods

 

37Slide38

AuthorsKSM

BasisPrecond?Mtx

Pwrs

?

TSQR?

Joubert

and Carey, 1992

GMRES

ChebyshevNoYes*NoChronopoulos and Kim, 1992Nonsymm. LanczosMonomialNoNoNoBai, Hu, and Reichel, 1991GMRESNewtonNoNoNo

Erhel, 1995GMRES

NewtonNoNo

Node Sturler and van der Vorst, 2005GMRES

ChebyshevGeneral

NoNo

Toledo, 1995

CG

Monomial

Polynomial

Yes*

No

Chronopoulos

and Swanson, 1990CGR, OrthominMonomialNoNoNoChronopoulos and Kinkaid, 2001

OrthodirMonomialNo

NoNo

Related Work: -step methods

 

38Slide39

Communication bottleneck in KSMs2.

Orthogonalize with respect to

Inner products

Parallel: global reduction

Sequential: multiple reads/writes to slow memory

 

1

Projection process in each iteration

:

Add a dimension to

Sparse Matrix-Vector Multiplication (

SpMV)Parallel: comm. vector entries w/ neighborsSequential: read /vectors from slow

memory

 

Dependencies between communication-bound kernels in each iteration limit performance!

SpMV

orthogonalize

For linear systems, approx. solution

to

by imposing

and

,

where

.

 

A

Krylov

Subspace Method is a projection process onto the Krylov subspace orthogonal to some .

 Slide40

Residual replacement strategy for CG

10

Van

der Vorst and Ye (

1999): Improve accuracy by replacing

updated residual

by the

true residual

in certain

iterations, combined with group update.

Bound error

in residual update with residual replacement scheme:

gives an upper bound for

Want to replace

with

whenever

has grown significantly since last replacement step, and

is not much larger than

(avoid large perturbations to finite precision CG recurrence)

Update

,

an estimate of

(and bound for

),

in each

iteration:

,

where

is the maximal # non-

zeros

per row of

If

,

(group update), set

, set

If updated residual converges to

,

true residual reduced

to

 Slide41

Communication and computation costs

We can compute the terms we need in calculation of

as follows:

 

 

 

 

Can be estimated using Gram matrix:

and

where

is available from a previous iteration. No additional communication is needed, and additional computation cost is

per outer loop.

 

If we compute

at the start of each outer loop, we can compute these quantities in each inner loop in the 2-norm for a factor of

additional communication per outer loop (a lower order term).

If we can compute

in the same reduction step to compute

, the latency cost is not affected (otherwise 1 extra message).

 

15

 

Estimated only once at the start of the algorithm.

 

Computed once per outer loop – no communication.

All

lower order terms in context of

CA-CG

algorithm

Residual Replacement Strategy does not

asymptotically

increase communication or computation costs!