ī dəm pō tənt adj 1 of relating to or being a mathematical quantity which when applied to itself equals itself 2 of relating to or being an operation under which a mathematical quantity is idempotent ID: 388896
Download Presentation The PPT/PDF document "idempotent" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
idempotent
(ī-
dəm
-
pō
-
tənt
)
adj.
1
of, relating to, or being a mathematical quantity which when applied to itself equals itself;
2
of, relating to, or being an operation under which a mathematical quantity is idempotent.
idempotent processing
(ī-
dəm
-
pō
-
tənt
prə-ses-iŋ
)
n.
the application of only idempotent operations in sequence; said of the execution of computer programs in units of only idempotent computations, typically, to achieve
restartable
behavior.Slide2
Static Analysis and Compiler Design for Idempotent Processing
Marc de
KruijfKarthikeyan SankaralingamSomesh Jha
PLDI 2012, BeijingSlide3
Example
2
int sum(int *array, int len) { int x = 0;
for
(
int
i = 0; i < len; ++i) x += array[i]; return x;}
source codeSlide4
Example
3
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP
EXIT:
return R3
assembly code
FFF
F
0
faults
exceptions
x
load ?
mis
-speculationsSlide5
Example
4
R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOP
EXIT:
return R3
assembly code
Bad Stuff Happens!Slide6
R0
and
R1 are unmodified R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez
R2,
LOOP
EXIT:
return R3Example5assembly code
just re-execute!
convention:
use checkpoints/buffersSlide7
It’s Idempotent!
6
idempoh… what…?int sum(int *data, int len) {
int
x = 0;
for (int i = 0; i < len; ++i) x += data[i];
return
x;
}
=Slide8
Idempotent Processing
7
idempotent regionsAll The TimeSlide9
Idempotent
Processing
8
normal compiler:
custom compiler:
executive summary
low runtime overhead (typically 2-12%)
cut
semantic
clobber
antidependences
how?
idempotence
inhibited by
clobber
antidependencesSlide10
9
Presentation Overview
❶
Idempotence
❷
Algorithm
❸ Results
=Slide11
What is
Idempotence?
10Yes
2
is this idempotent?Slide12
What is
Idempotence?
11No
2
how about this?Slide13
What is
Idempotence?
12Yes
2
maybe this?Slide14
What
is Idempotence?
13
o
peration sequence
dependence chain
idempotent?write
read, write
write, read, write
Yes
No
Yes
it’s all about the data dependencesSlide15
What
is Idempotence?
14
o
peration sequence
dependence chain
idempotent?write, read
read, write
write, read, write
Yes
No
Yes
it’s all about the data dependences
Clobber
Antidependence
antidependence
with an exposed readSlide16
Semantic
Idempotence
15(1) local (“pseudoregister”) state:can be renamed to remove clobber antidependences*does not semantically constrain
idempotence
two types of program state
(2) non-local (“memory”) state:
cannot “rename” to avoid clobber antidependencessemantically constrains idempotence
semantic
idempotence
= no
non-local
clobber
antidep
.
preserve
local
state
by
renaming and careful allocationSlide17
16
Presentation Overview
❶ Idempotence❷ Algorithm
❸ Results
=Slide18
Region Construction Algorithm
17
steps one, two, and threeStep 1: transform function remove artificial dependences, remove non-clobbers
Step
3
:
refine for correctness & performance account for loops, optimize for dynamic behaviorStep 2: construct regions around antidependences cut all non-local antidependences in the CFGSlide19
Step 1
: Transform
18Transformation 1: SSA for pseudoregister antidependencesnot one, but two transformations
clobber
antidependences
region
boundariesregion identification
But we still have a problem:
depends onSlide20
Step 1
: Transform
19Transformation 1: SSA for pseudoregister antidependencesnot one, but two transformations
Transformation
2
:
Scalar replacement of memory variables [x] = a;b = [x]
;
[
x]
= c;
[
x]
= a;
b =
a
;
[
x]
= c;
non-clobber
antidependences
… GONE!Slide21
Step 1
: Transform
20Transformation 1: SSA for pseudoregister
antidependences
not one, but two transformations
Transformation
2: Scalar replacement of memory variables
clobber
antidependences
region
boundaries
region identification
depends onSlide22
Region Construction Algorithm
21
steps one, two, and threeStep 1: transform function remove artificial dependences, remove non-clobbers
Step
3
:
refine for correctness & performance account for loops, optimize for dynamic behaviorStep 2: construct regions around antidependences cut all non-local antidependences in the CFGSlide23
Step
2: Cut the CFG
22cut, cut, cut…
construct regions by “cutting” non-local
antidependences
antidependenceSlide24
l
arger is (generally) better:
large regions amortize the cost of input preservation23region size
overhead
sources of overhead
optimal region size?
Step 2: Cut the CFG
rough sketch
but where to cut…?Slide25
Step
2: Cut the CFG
24but where to cut…? goal: the minimum set of cuts that cuts all antidependence paths
intuition
: minimum cuts fewest regions large regions
approach
: a series of reductions:
minimum vertex multi-cut
(NP-complete)
minimum hitting set
among paths
minimum
hitting set
among “dominating nodes”
details in paper…Slide26
Region Construction Algorithm
25
steps one, two, and threeStep 1: transform function remove artificial dependences, remove non-clobbers
Step
3
:
refine for correctness & performance account for loops, optimize for dynamic behaviorStep 2: construct regions around antidependences cut all non-local antidependences in the CFGSlide27
Step 3
: Loop-Related Refinements
26correctness: Not all local antidependences removed by SSA…
loops affect correctness and performance
loop-carried
antidependences
may clobberdepends on boundary placement; handled as a post-passperformance: Loops tend to execute multiple times…
to maximize region size, place cuts outside of loop
algorithm modified to prefer cuts outside of loops
details in paper…Slide28
27
Presentation Overview
❶
Idempotence
❷
Algorithm
❸ Results
=Slide29
Results
compiler implementation
– Paper compiler implementation in LLVM v2.9 – LLVM v3.1 source code release in July timeframeexperimental data (1) runtime overhead (2) region size (3) use case
28Slide30
Runtime Overhead
29
as a percentagepercent overhead7.6
7.7
benchmark suites (
gmean
)(gmean)Slide31
Region Size
30
average number of instructions28 dynamic region size
(
gmean
)
benchmark suites (gmean)Slide32
Use Case
31
hardware fault recovery(gmean)
8.2
24.0
30.5
percent overhead
benchmark suites (
gmean
)Slide33
32
Presentation Overview
❶ Idempotence❸ Results
=
❷
AlgorithmSlide34
Summary & Conclusions
33
idempotent processing – large (low-overhead) idempotent regions all the timestatic analysis, compiler algorithm – (a) remove artifacts (b) partition (c) compilesummary
low overhead
– 2-12% runtime overhead typicalSlide35
Summary & Conclusions
34
several applications already demonstrated – CPU hardware simplification (MICRO ’11) – GPU exceptions and speculation (ISCA ’12) – hardware fault recovery (this paper)conclusionsfuture work – more applications, hybrid techniques
– optimal region size?
– enabling even larger region sizesSlide36
Back-up Slides
35Slide37
Error recovery
36
mis-speculation (e.g. branch misprediction) – compiler handles for pseudoregister state – for non-local memory, store buffer assumedarbitrary failure (e.g. hardware fault) – ECC and other verification assumed – variety of existing techniques; details in paper
exceptions
– generally no side-effects beyond out-of-order-
ness
– fairly easy to handledealing with side-effectsSlide38
Optimal Region Size?
37
region sizeoverhead
detection
latency
register
pressure
re-execution
time
it depends… (rough sketch not to scale)Slide39
Prior Work
38
relating to idempotenceTechniqueYear
Domain
Sentinel Scheduling
1992
Speculative memory re-orderingFast Mutual Exclusion1992Uniprocessor mutual exclusionMulti-Instruction Retry
1995
Branch
and hardware fault recovery
Atomic Heap Transactions
1999
Atomic memory allocation
Reference
Idempotency
2006
Reducing speculative
storage
Restart Markers
2006
Virtual memory in vector machines
Data-Triggered
Threads
2011
Data-triggered multi-threading
Idempotent Processors2011
Hardware simplification for exceptions
Encore2011Hardware fault recovery
iGPU
2012
GPU exception/speculation supportSlide40
Detailed Runtime Overhead
39
as a percentagesuites (gmean)
outliers
(
gmean
)
percent overhead
7.6
7.7
non-idempotent inner loops + high register pressureSlide41
Detailed Region Size
40
average number of instructionssuites (gmean)
outliers
(
gmean
)
28
/
116
45
>1,000,000
limited aliasing information