Krylov s ubspace m ethods Erin Carson and James Demmel Householder Symposium XIX June 813 2014 Spa Belgium 2 Model Problem 2D Poisson on grid Equilibration diagonal scaling used ID: 463604
Download Presentation The PPT/PDF document "Improving the maximum attainable accurac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Improving the maximum attainable accuracy of communication-avoiding Krylov subspace methods
Erin Carson and James DemmelHouseholder Symposium XIXJune 8-13, 2014, Spa, BelgiumSlide2
2
Model Problem: 2D
Poisson on
grid
,
.
Equilibration (diagonal scaling) used. RHS set s.t. elements of true solution .
Roundoff error can cause a decrease in attainable accuracy of Krylov subspace methods.
This affect can be worse for “communication-avoiding” (or -step) Krylov subspace methods – limits practice applicability!
Residual replacement strategy of van der Vorst and Ye (1999) improves attainable accuracy for classical Krylov methods
We extend this strategy to communication-avoiding variants. Accuracy can be improved for minimal performance cost! Slide3
Communication bottleneck in KSMs2.
Orthogonalize with respect to
Inner products
Parallel: global reduction
Sequential: multiple reads/writes to slow memory
1
Projection process in each iteration
:
Add a dimension to
Sparse Matrix-Vector Multiplication (
SpMV)Parallel: comm. vector entries w/ neighborsSequential: read /vectors from slow
memory
Dependencies between communication-bound kernels in each iteration limit performance!
SpMV
orthogonalize
For linear systems, approx. solution
to
by imposing
and
,
where
.
A
Krylov
Subspace Method is a projection process onto the Krylov subspace orthogonal to some .
Slide4
2
Example: Classical conjugate gradient (CG)
SpMVs
and inner products require communication in each iteration!
for solving
Let
f
or
until convergence
do
e
nd for
Slide5
Related references:(Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (
Chronopoulos and Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991), (Kim and
Chronopoulos
, 1991), (
Joubert
and Carey, 1992), (Bai, Hu,
Reichel
, 1991), (
Erhel, 1995), GMRES (De Sturler, 1991), (De Sturler and Van der Vorst, 1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen, 2010), (Philippe and Reichel, 2012), (C., Knight, Demmel 2013), (Feuerriegel and Bücker, 2013). Krylov methods can be reorganized to reduce communication cost by
Many CA-KSMs (or -step KSMs) derived in the literature: CG, GMRES, Orthomin, MINRES, Lanczos, Arnoldi, CGS,
Orthodir, BICG, CGS, BICGSTAB, QMR
Communication-avoiding CA-KSMs
3
Reduction in communication can translate to speedups on practical problems
Recent results:
x speedup
with
for CA-BICGSTAB on GMG bottom-solve (
cores,
on Cray XE6) (Williams, et al., 2014).
Slide6
CA-CG overview
4
Starting at iteration
for
, it can be shown that to compute the next
steps (iterations
where
),
.
1. Compute matrix
of dimension
-by-
, giving the recurrence relation
,
where
is
-by-
and is 2x2 block diagonal with upper
Hessenberg
blocks, and
is
with columns
and
set to 0.
Represent length-
iterates by length-
coordinates
in :
Communication cost:
Same latency cost as
one
SpMV
using
matrix powers kernel
(e.g.,
Hoemmen
et al., 2007), assuming sufficient
sparsity
structure (low diameter)
2. Compute Gram matrix
of dimension
-by-
.
Communication cost: One reduction
Slide7
5
Compute
such that
Compute
for
do
for
do
end for
end for
CG
CA-CG
No communication in inner loop
!
Slide8
CA-KSMs are mathematically equivalent to classical KSMsBut have different behavior in finite precision!Roundoff errors have two discernable effects:
Loss of attainable accuracy (due to deviation of true residual
and updated residual
)
Delay of
convergence
(due to perturbation of
Lanczos
recurrence)Generally worse with increasing Obstacle to solving practical problems:Decrease in attainable accuracy some problems that KSM can solve can’t be solved with CA variantDelay of convergence if # iterations increases more than time per iteration decreases due to CA techniques, no speedup expected! CA-KSMs in finite precision6Slide9
Maximum attainable accuracy of CG7
In classical CG, iterates are updated by
and
Formulas for
and
do not depend on each other - rounding errors cause the true residual,
, and the updated residual,
, to deviate
The size of
,
determines
the
maximum attainable accuracy
Write the true residual as
Then the size of the true residual is bounded by
When
,
and
have similar magnitude
W
hen
,
depends on
Many results on attainable accuracy, e.g.:
Greenbaum
(1989, 1994, 1997),
Sleijpen
, van der Vorst and
Fokkema
(1994),
Sleijpen
, van der Vorst and
Modersitzki
(2001),
Bj
ö
rck
,
Elfving
and
Strakoš (1998) and Gutknecht and Strakoš (2000). Slide10
8
Example: Comparison of convergence of true and updated residuals for CG vs. CA-CG using a
monomial basis
, for various
values
Model problem (2D Poisson on
grid
)
Slide11
9
Better conditioned polynomial bases can be used instead of monomial.
Two common choices:
Newton
and
Chebyshev
- see, e.g., (Philippe and
Reichel
, 2012).Behavior closer to that of classical CGBut can still see some loss of attainable accuracy compared to CGClearer for higher values
Slide12
Residual replacement strategy for CG10
v
an
der Vorst and Ye (
1999): Improve accuracy by replacing
updated residual
by the
true residual
in certain iterations, combined with group update.Choose when to replace with
to meet two constraints:
Replace often enough so that at termination,
is small relative to
Don’t replace so often
that original convergence mechanism of updated residuals is destroyed (avoid large perturbations to finite precision CG recurrence)
Use computable bound
for
to update error estimate
in
each
iteration:
,
If
,
(group update), set
, set
If updated residual converges to
,
true residual reduced
to Slide13
Error in
basis change
11
Sources of
r
oundoff
error in CA-CG
Error in computing
-step basis
Error in updating coefficient vectors
Computing the
-step
Krylov
basis:
Updating coordinate vectors in the inner loop:
with
Recovering CG vectors for use in next outer loop:
Slide14
We can write the deviation of the true and updated residuals in terms of these errors:12
Maximum attainable accuracy of CA-CG
Using standard rounding error results, this allows us to obtain an upper bound on
.
Slide15
We extend van der Vorst and Ye’s residual replacement strategy to CA-CGMaking use of the bound for
in CA-CG, update error estimate
by:
A computable
b
ound
otherwise
13
where
Estimated only once
flops per
iterations; no communication
flops per
iterations;
1 reduction per
iterations
All lower order terms in
CA-CG
Residual replacement does not asymptotically increase communication or
computation!
Slide16
Use the same replacement condition as van der Vorst and Ye (1999):
where
is
a tolerance
parameter, and we initially set
.
Pseudo-code for residual replacement with group update for CA-CG:
if
break from inner loop and begin new outer loop
end
Residual replacement for CA-CG
14Slide17
15Slide18
CACG Mono.
CACG Newt.
CACG
Cheb
.
CG
s=4
354
353365355s=8224, 334, 401, 517340
353s=12
135, 2119326346
Residual Replacement Indices18
# replacements small compared to total iterations RR doesn’t significantly affect communication savings!
Can have both speed and accuracy!
Total Number of Reductions
In addition to attainable accuracy,
convergence rate
is incredibly important in practical implementations
Convergence rate depends on
basis
CACG Mono.
CACG Newt.
CACG
Cheb
.
CGs=4203
196197669s=815710299s=12557687115Slide19
,
,
Current work: CA-
Lanczos
convergence
a
nalysis
16
Classic
Lanczos
rounding error result of Paige (1976):
These results form the basis for Paige’s influential results in (Paige, 1980).
for
,
where
,
,
and
Slide20
with
,
where
,
Current
work
: CA-
Lanczos
convergence
a
nalysis
17
If
is numerically full rank for
and if
, the
results of
(Paige, 1980) apply to CA-
Lanczos
(with these news values of
).
Confirms the empirically observed effect of basis conditioning on convergence
.
for
,
For CA-
Lanczos
, we
have:
(vs.
for
Lanczos
)
(vs.
for
Lanczos
)
Let
Slide21
Thank you!contact: erin@cs.berkeley.eduhttp://www.cs.berkeley.edu/~ecc2z/21Slide22
Extra SlidesSlide23
23
Replacement Iterations
Newton: 240, 374
Chebyshev
: 271, 384Slide24
Upper bound on maximum attainable accuracy in communication-avoiding
Krylov subspace methods in finite precisionApplies to CA-CG and CA-BICGImplicit residual replacement strategy for maintaining agreement between residuals to within
Strategy can be implemented within method without
asymptotically increasing communication or
computation
24Slide25
25Slide26
26Slide27
CACG Mono.
CACGNewt
.
CACGCheb
.
CG
s=4
813
782785669s=81246817
785s=12
6675813850
Total Iterations27Slide28
Total Number of Communication Steps
CACG Mono.
CACGNewt
.
CACGCheb
.
CG
s=4
203
196197669
s=8157102
99s=12557
6871
28Slide29
Motivation: Communication-avoiding, KSM bottleneck, how CA-KSMs work (1)Speedups, but numerical properties bad (2) – attainable accuracy, convergenceToday focus on attainable accuracy but mention current work related to convergence
Include plotsIn order to be practical, numerical properties can’t negate CA benefitsRelated workCA-KSMs (1)Bounds on attainable accuracy, residual
replacement strategies,
VdV
and Ye (2)
CA-CG derivation (2)
Maximum attainable accuracy bound (1)
Residual replacement strategy of Vdv and Ye (1)
Residual replacement strategy for CA-CG (1)How can be computed cheaply (1)Plots (1)Current work (2): results of Paige and analogous results for CA-case29Slide30
Improving Maximum Attainable Accuracy in CA-KSMs
Van der Vorst and Ye (1999) : Residual replacement used in combination with group-update of solution vector to improve the maximum attainable accuracyGiven computable upper bound for deviation, replacement steps chosen to satisfy two constraints:
(1) Deviation must not grow so large that attainable accuracy is limited
(2) Replacement must not perturb
Lanczos
recurrence
relation for computed residuals such that convergence
deteriorates
When the computed residual converges to level but true residual deviates, strategy reduces true residual, to level We devise an analogous method for CA-KSMsRequires determining a computable upper bound for deviation of residualsIn CA-KSMs, we can write the deviation of the true and computed residual as:
30Slide31
We can bound the deviation of the true and computed residual in CA-CG with residual replacement in each iteration by updating the quantity
by
A Computable Bound
otherwise
31
last inner
itr
.Slide32
if
break from inner
loop
end
Residual Replacement Strategy
32
Based on the perturbed
Lanczos
recurrence (see, e.g. Van der Vorst and Ye, 1999), we should replace the residual when
where
is a tolerance parameter, chosen as
, and
w
e initially set
.
Pseudo-code for residual replacement with group update for CA-CG:
Slide33
Communication-avoiding CA-KSMs3
Krylov methods can be “reorganized” to
reduce communication cost by
Main idea:
Outer Loop k: 1
communication
step
Expand Krylov basis by O(s) dimensions up front, stored in matrix Same cost as a single SpMVs (for well-partitioned ) using matrix powers kernel (see, e.g., Hoemmen et al., 2007)Compute Gram matrix
: one global reduction
Inner Loop j:
computation stepsUpdate
-vectors of coordinates for
,
in
No communication! Quantities
either
local
(parallel) or
fit
in fast memory (sequential)Many CA-KSMs (or
-step KSMs) derived in the literature: (Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (Chronopoulos and Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991), (Kim and Chronopoulos,
1991), (Joubert and Carey, 1992), (Bai, Hu, Reichel, 1991), (Erhel, 1995), (De Sturler, 1991), (De Sturler and Van der Vorst,
1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen, 2010).
Slide34
CA-CG DerivationStarting at iteration
for
, it can be shown that for iterations
where
,
D
efine the
basis matrices
, where
,
, where
,
where
is a polynomial of degree
satisfying
,
This
allows us to write
,
4Slide35
CA-CG Derivation5
No communication for iterations!
All small
-dimensional
:
local/fit in cache.
Computing
: Matrix powers kernel, same communication cost as 1
SpMV
for well-partitioned
Computing
: one global reduction
Let
,
, and
.
Then we see that we can represent iterates by their coordinates in
:
,
,
a
nd multiplication by
can be written:
Then after computing
and
, for iterations
,
,
I
nner products can be written:
Vector updates done implicitly by updating coordinates:
Slide36
v
ia CA Matrix Powers Kernel
Global reduction to compute
6
Example: CA-CG
Local computations within inner loop require no communication!
for solving
Let
for
until convergence
do
Compute
and
Let
for
do
end for
Recover
,
,
e
nd for
Slide37
AuthorsKSM
BasisPrecond?Mtx
Pwrs
?
TSQR?
Van Rosendale, 1983
CG
Monomial
PolynomialNoNoLeland, 1989CGMonomialPolynomialNoNoWalker, 1988GMRESMonomialNoneNoNoChronopoulos and Gear, 1989CGMonomialNoneNo
NoChronopoulos and Kim, 1990Orthomin
, GMRESMonomial
NoneNoNoChronopoulos, 1991MINRES
Monomial
NoneNo
No
Kim and
Chronopoulos
, 1991
Symm
.
Lanczos
, ArnoldiMonomialNoneNoNode Sturler, 1991GMRESChebyshevNoneNoNo
Related Work: -step methods
37Slide38
AuthorsKSM
BasisPrecond?Mtx
Pwrs
?
TSQR?
Joubert
and Carey, 1992
GMRES
ChebyshevNoYes*NoChronopoulos and Kim, 1992Nonsymm. LanczosMonomialNoNoNoBai, Hu, and Reichel, 1991GMRESNewtonNoNoNo
Erhel, 1995GMRES
NewtonNoNo
Node Sturler and van der Vorst, 2005GMRES
ChebyshevGeneral
NoNo
Toledo, 1995
CG
Monomial
Polynomial
Yes*
No
Chronopoulos
and Swanson, 1990CGR, OrthominMonomialNoNoNoChronopoulos and Kinkaid, 2001
OrthodirMonomialNo
NoNo
Related Work: -step methods
38Slide39
Communication bottleneck in KSMs2.
Orthogonalize with respect to
Inner products
Parallel: global reduction
Sequential: multiple reads/writes to slow memory
1
Projection process in each iteration
:
Add a dimension to
Sparse Matrix-Vector Multiplication (
SpMV)Parallel: comm. vector entries w/ neighborsSequential: read /vectors from slow
memory
Dependencies between communication-bound kernels in each iteration limit performance!
SpMV
orthogonalize
For linear systems, approx. solution
to
by imposing
and
,
where
.
A
Krylov
Subspace Method is a projection process onto the Krylov subspace orthogonal to some .
Slide40
Residual replacement strategy for CG
10
Van
der Vorst and Ye (
1999): Improve accuracy by replacing
updated residual
by the
true residual
in certain
iterations, combined with group update.
Bound error
in residual update with residual replacement scheme:
gives an upper bound for
Want to replace
with
whenever
has grown significantly since last replacement step, and
is not much larger than
(avoid large perturbations to finite precision CG recurrence)
Update
,
an estimate of
(and bound for
),
in each
iteration:
,
where
is the maximal # non-
zeros
per row of
If
,
(group update), set
, set
If updated residual converges to
,
true residual reduced
to
Slide41
Communication and computation costs
We can compute the terms we need in calculation of
as follows:
Can be estimated using Gram matrix:
and
where
is available from a previous iteration. No additional communication is needed, and additional computation cost is
per outer loop.
If we compute
at the start of each outer loop, we can compute these quantities in each inner loop in the 2-norm for a factor of
additional communication per outer loop (a lower order term).
If we can compute
in the same reduction step to compute
, the latency cost is not affected (otherwise 1 extra message).
15
Estimated only once at the start of the algorithm.
Computed once per outer loop – no communication.
All
lower order terms in context of
CA-CG
algorithm
Residual Replacement Strategy does not
asymptotically
increase communication or computation costs!