Lee Peter Chen Jason Flinn Satish Narayanasamy University of Michigan Ann Arbor Chimera Hybrid Program Analysis for Determinism Chimera image from http superpunchblogspotcom200902chimerasketchhtml ID: 378756
Download Presentation The PPT/PDF document "Dongyoon" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Dongyoon Lee, Peter Chen, Jason Flinn, Satish NarayanasamyUniversity of Michigan, Ann Arbor
Chimera:
Hybrid Program Analysis for Determinism
* Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.htmlSlide2
Deterministic ReplayGoal: record and reproduce multithreaded executionDebugging concurrency bugsOffline heavyweight dynamic analysisForensics and intrusion detection… and many more usesProblemMultithreaded record-and-replay is too slow (>2x) or requires custom hardwareSlide3
Multithreaded Record-and-Replay is Slow
Write
WriteRead
Log
shared memory dependencies
Checkpoint Memory and Register State
Log non-deterministic
program input
- Interrupts
, I/O values, DMA, etc.
Thread 1
Thread 2
Thread 3Slide4
Replay for Data-Race-Free Programs is CheapData-race-free programsShared memory accesses are well ordered by synchronization ops.Recording happens-before order of sync. ops. is sufficientProblem: Programs with data racesT1T2X=0Y=0X=1
Y=1
Y=2Unlock(l)Lock(l)Unlock(l)
Signal(c)
Wait(c)
Z=1
X=2
Z=2
T3
order of
mem
. ops.
order of sync. ops.Slide5
Our Contribution: A Hybrid Analysis
Potentially racy
program P
Data-race-free
program P’
Sound static data race analysis
Add synchronizations for potential data races
Problem: Too many
false positives
Profiling
non-concurrent code regions
Symbolic bounds analysis
ChimeraSlide6
RoadmapMotivationChimera AnalysisStatic data race analysisProfiling non-concurrent code regionsSymbolic bounds analysisWeak-lock DesignEvaluationConclusionSlide7
Roadmap
Motivation
Chimera AnalysisStatic data race analysisProfiling non-concurrent code regionsSymbolic bounds analysisWeak-lock
DesignEvaluation
ConclusionSlide8
Static Data Race Analysis Find potential data-races using a sound static data race detector RELAY [Voung et al., FSE’07]Protect all potential data-races using weak-locks A new time-out lock which may be preempted (discussed later)Record and replay the happens-before order of weak-locksSlide9
Protect Potential Races using Weak-locksPotential racy-pairPotential racy-pair
Static analysis helps avoid instrumentation for access to
Z No race report
void foo() {
X = 0;
for(
i
= ... ){
Y[
tid
][
i
] = 0;
}
}
void bar() {
X = 1;
for(
i
= … ){
Y[
tid
][
i
] = 1;
Z
= 1;
}
}Slide10
Sources of False Positives in RELAYSound data-race detector reports too many false data-races53x overhead Source 1: Non-mutex synchronizations are ignoredLockset based analysis ignores fork-join, barrier, signal-wait, etc. May report a false data-race between memory instructions that can never execute concurrentlySource 2: Conservative pointer analysis Overestimate variables accessed by a memory instruction May report a false data-race between memory instructions that can never access the same location
Solution:
Profiling
non-concurrent code regions
Solution:
Symbolic bounds
AnalysisSlide11
Roadmap
Motivation
Chimera AnalysisStatic data race analysisProfiling non-concurrent code regionsSymbolic bounds analysis
Weak-lock Design
Evaluation
ConclusionSlide12
Profiling Non-concurrent Code RegionsProblemLockset based analysis ignores non-mutex synchronization ops. SolutionProfile non-concurrent code regions (e.g., functions)
Increase the granularity of weak-locks to
protect a larger code region instead of each potential racy instructionParallelism is preserved unless mis-profiledT1foo()BARRIER
T2BARRIER
bar()
False RaceSlide13
Function-Level Weak-Locksif profiler says foo() and bar() are not likely to run concurrently
foo()
BARRIERBARRIERbar()
False Race
void foo() {
X = 0;
for(
i
= … ){
Y[
tid
][
i
] = 0;
}
}
void bar() {
X = 1;
for(
i
= … ){
Y[
tid
][
i
] = 1;
Z =
1;
}
}Slide14
Roadmap
Motivation
Chimera AnalysisStatic data race analysisProfiling non-concurrent code regionsSymbolic bounds analysisDesign
EvaluationConclusionSlide15
Imprecision in Conservative Pointer AnalysisT1foo()BARRIERT2
BARRIER
May runConcurrentlybar()Slide16
Imprecision in Conservative Pointer AnalysisRELAY uses Steensgaard’s and Anderson’s pointer analysisFlow-Insensitive and Context-Insensitive (FICI) analysisNaming heap objects is conservativeOverestimate the variables accessed by a memory instruction
void foo() {
… for(i = 0 to N){ Y[ tid ][ i ] = 0; … }}
void bar() { …
for(
i
=
0
to N){
Y[
tid
][
i
] = 1;
…
}
}
False Race
Y[][]
Thread1
Thread 2
…
…
…
Potential
racy-pairSlide17
Symbolic Bounds AnalysisOur SolutionDerive the symbolic lower and upper bounds that a racy code region may access (e.g., loops) [Rugina and Rinard, PLDI’00]Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expressionParallelism is preserved if the bounds are precise enough
void foo() {
… for(i = 0 to N){ Y[ tid ][ i ] = 0; } …}
Bounds: &Y[tid][0] to &Y[tid][N]
Symbolic
Bounds
AnalysisSlide18
Loop-level Weak-locksSymbolic bounds: &Y[tid][0] ~ &Y[tid][N]
(&Y[
tid][0],&Y[tid][N])(&Y[tid
][0],&Y[tid][N])
(&Y[
tid
][0],&Y[
tid
][N])
(&Y[
tid
][0],&Y[
tid
][N])
void foo() {
X = 0;
for(
i
= 0 to N){
Y[
tid
][
i
] = 0;
}
}
void bar() {
X = 1;
for(
i
= 0 to N){
Y[
tid
][
i
] = 1;
Z = 1;
}
}Slide19
Imprecise Symbolic BoundsSourcesDepend on the value computed inside the code regionDepend on arithmetic operations not supported in the analysise.g., modulo operations, logical AND/OR, etc.Choosing the optimal granularityIf bounds are too imprecise and the loop body is long enough, resort to instruction (basic-block) level weak-locks for parallelism
void
qux() { … for(i = 0 to N){ prev = Z[ prev ]; } …}
Bounds: -INF to +INF
Symbolic
Bounds
AnalysisSlide20
RoadmapMotivationChimera AnalysisWeak-lock DesignEvaluationConclusionSlide21
Deadlock due to Weak-locksNo deadlocks between weak-locksfunction-level > loop-level > instruction-levelDeadlock between weak-locks and original sync. ops. is possible
T
1
…wait (cv)
…
T
2
…
signal(cv)
…
Time-out !!Slide22
Weak-lock Time-outA weak-lock might time-outInvoke a special system call to handle it
Weak-lock guaranteeOnly one thread holds a given weak-lock at any given timeMutual exclusion may be compromised; but sufficient for replay
T
2
…
signal(cv)
…
Time-out !!
T
1
…
wait (cv)
…
Current owner
Current owner
Logged order
of weak-locksSlide23
RoadmapMotivationChimera AnalysisWeak-lock DesignEvaluationConclusionSlide24
ImplementationSource-to-source InstrumentationImplemented in OCaml using CIL as a front endStatic analysisData race detection: RELAY [Voung et al., FSE’07]Include all library source codes for soundness (uClibc’s libc, libm, etc.)Symbolic bounds analysis: [Rugina and Rinard, PLDI’00]Intra-procedural analysis for racy loops onlyRuntime systemModified Linux kernel to record/replay program input Modified pthread library to record/replay happens-before order of original synchronization operations and weak-locksSlide25
Evaluation SetupTest Environment2.66 GHz 8-core Xeon processor with 4 GB of RAM Different set of inputs for profiling and performance evaluationAverage of five trials with 4 worker threads2, 4, 8 threads for scalability resultsBenchmarksDesktop applicationsaget, pfscan, and pbzip2Server programsknot and apacheSPLASH-2 suiteocean, water-nsq, fft, and radixSlide26
Record and Replay Performance
Recording : 39% on averageReplay : similar to recording; much lower for I/O intensive prgs
.
2.4% slowdown
86% slowdown
39%Slide27
Effectiveness of Coarse-grained Weak-locks53xSlide28
Effectiveness of Coarse-grained Weak-locks
Coarse-grained weak-locks reduce the cost of instrumentationSlide29
Effectiveness of Coarse-grained Weak-locks
Coarse-grained weak-locks reduce the cost of instrumentation
Exception: control-flow dependency (e.g., pfscan)Slide30
Effectiveness of Coarse-grained Weak-locks
Coarse-grained weak-locks reduce the cost of instrumentationException: control-flow dependency (e.g
., pfscan)Slide31
Effectiveness of Coarse-grained Weak-locks
Coarse-grained weak-locks reduce the cost of instrumentationException: control-flow dependency (e.g
., pfscan)1.39xSlide32
Breakdown of Recording Overhead
Weak-lock overhead = contention (waiting) cost + logging cost
func
locks
loop locks
instr
/bb locks
sync op & system logSlide33
Breakdown of Recording Overhead
func
wait
loop wait
instr
/bb wait
sync op & system log
func
log
loop log
instr
/bb log
Weak-lock overhead = contention (waiting) cost + logging cost
High loop-lock contention
High
instr
/bb-lock contentionSlide34
Scalability
Scientific applications scale worse due to imprecise symbolic bounds analysisSlide35
ConclusionGoal: Software-only deterministic multiprocessor replay systemsChimera AnalysisStatic data race analysisFind and protect potential data races with weak-locksInstruction/basic-block-level weak-locksProfiling non-concurrent code regionsAddress the inadequacy of lockset-based algorithmFunction-level weak-locksSymbolic bounds analysisAddress the imprecision of conservative pointer analysisLoop-level weak-locksLow Recording Overhead39% recording overhead for 4 worker threadsSlide36
Thank you