Baris Kasikci Weidong Cui Xinyang Ge Ben Niu Why Does InProduction Bug Diagnosis Matter Potential to fix bugs that impact users Short release cycles make inhouse testing challenging ID: 731245
Download Presentation The PPT/PDF document "Lazy Diagnosis of In-Production Concurr..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Lazy Diagnosis of In-Production Concurrency Bugs
Baris Kasikci, Weidong Cui, Xinyang Ge, Ben NiuSlide2
Why Does In-Production Bug Diagnosis Matter?
Potential to fix bugs that impact usersShort release cycles make in-house testing challengingRelease cycles can be as frequent as a few times a day
1
[1] https
://
code.facebook.com
/posts/270314900139291/rapid-release-at-massive-scaleSlide3
Concurrency Bug Diagnosis
Concurrency bug diagnosis requires knowing the order
of
key events
(e.g., memory accesses)
Time
Thread 1
Thread 2
i
f (*x) {
y
= *x;
}
f
ree(x);
x = NULL;
R
R
W
Atomicity Violation
Thread 1
Thread 2Slide4
Challenges of Concurrency Bug Diagnosis
Diagnosis requires reproducing bugs [PBI, ASPLOS’13] [Gist, SOSP’15]Practitioners report that they can fix reproducible bugs [PLATEAU’14]It may not be possible to reproduce in-production concurrency bugs
Inputs for reproducing bugs may not be availableExposing bugs in production may incur high overhead [
RaceMob
, SOSP’13]Slide5
Record/ReplayTracing fine-grained
interleavings incurs high overheadState-of-the-art record/replay has 28% overhead [DoublePlay, ASPLOS’11]
R
R
W
Atomicity Violation
Time
Thread 1
Thread 2
Δ
T
1
Δ
T
2
In theory,
ΔT
can be on the order of a nanosecondSlide6
Coarse Interleaving Hypothesis
A lightweight, coarse-grained time tracking mechanism can help infer ordering
Study with 54 bugs in 13 systems
Smallest
ΔT
is 91 microseconds
R
R
W
Atomicity Violation
Time
Thread 1
Thread 2
Δ
T
1
Δ
T
2
91 us
~1ns
~
10^5Slide7
Lazy Diagnosis
SnorlaxLazy Diagnosis PrototypeFully Accurate Concurrency Bug Diagnosis (11 bugs in 7 systems)Low overhead (always below < 2%)
Leverages the coarse interleaving hypothesis
Hybrid
dynamic/static root cause diagnosis
techniqueSlide8
OutlineUsage model
DesignEvaluationSlide9
Current Bug Diagnosis Model
Root cause diagnosisSlide10
Lazy Diagnosis Usage Model
Lazy Diagnosis
Root
cause
+
Control- flow
t
race & Timing Info
Control flow trace speeds up static analysis
Coarse-grained timing information helps determine ordering
Root cause diagnosisSlide11
OutlineUsage model
DesignEvaluationSlide12
Lazy Diagnosis
Hybrid
Points-to
Analysis
Type-based
Ranking
Bug Pattern
Computation
Statistical
DiagnosisSlide13
Lazy Diagnosis
Hybrid
Points-to
Analysis
Type-based
Ranking
Bug Pattern
Computation
Statistical
DiagnosisSlide14
Hybrid
Points-to
Analysis
I
F
I
1
I
2
load %Queue
*,
%
fifo
store i32* %21, %
bufSize
store %Queue
* %1
, %
q
FAILURE (CRASH)
Finds instructions with operands pointing to the same location as the failing instruction’s operand
Hybrid Points-to AnalysisSlide15
Hybrid Points-To AnalysisUses the control flow traces to limit the scope of static analysis
Runs fast, scales to large programs (e.g., httpd, MySQL)LazyControl flow traces trigger the analysis
InterproceduralBug patterns may span multiple functionsFlow-insensitive
Discards execution order of instructions for scalabilitySlide16
Lazy Diagnosis
Hybrid
Points-to
Analysis
Type-based
Ranking
Bug Pattern
Computation
Statistical
DiagnosisSlide17
Lazy Diagnosis
Hybrid
Points-to
Analysis
Type-based
Ranking
Bug Pattern
Computation
Statistical
DiagnosisSlide18
Type-Based Ranking
Type-based
Ranking
store %Queue
* %1
, %
q
store
i32
* %21, %
bufSize
store %Queue
*
%1, %
qstore i32* %21, %bufSizeload %Queue
*, %fifoFAILURE (CRASH)
Highly ranks instructions operating on types that match the failing instruction's operand type
1
2Slide19
Lazy Diagnosis
Hybrid
Points-to
Analysis
Type-based
Ranking
Bug Pattern
Computation
Statistical
DiagnosisSlide20
Lazy Diagnosis
Hybrid
Points-to
Analysis
Type-based
Ranking
Bug Pattern
Computation
Statistical
DiagnosisSlide21
Bug Pattern Computation
Bug Pattern
Computation
Bug Pattern
Computation
store
i32
* %21, %
bufSize
store %Queue
*
%1, %
q
FAILURE
load %Queue
*,
%
fifo
store
i32
* %21, %
bufSize
Thread 1
Thread 2
store %Queue
*
%1, %
q
load %Queue
*,
%
fifo
load %Queue
*,
%
fifo
Bug Pattern I
Bug Pattern II
Thread 1
Thread 2Slide22
Bug Pattern Computation
Our implementation uses timing packets in Intel Processor TraceGranularity of a few 10s of microsecondsWe measured the smallest
ΔT between key events as 91 microseconds
Leverages the coarse interleaving hypothesis to
establish instruction ordersSlide23
Lazy Diagnosis
Hybrid
Points-to
Analysis
Type-based
Ranking
Bug Pattern
Computation
Statistical
DiagnosisSlide24
Lazy Diagnosis
Hybrid
Points-to
Analysis
Type-based
Ranking
Bug Pattern
Computation
Statistical
DiagnosisSlide25
store %Queue* %1, %
q
load %Queue
*,
%
fifo
Thread 1
Thread 2
FAILURE (CRASH)
store %Queue* %1, %
q
load %Queue
*,
%
fifo
Thread 1
Thread 2
FAILURE (CRASH)
store %Queue* %1, %
q
load %Queue
*,
%
fifo
Thread 1
Thread 2
SUCCESS
store %Queue* %1, %
q
load %Queue
*,
%
fifo
Thread 1
Thread 2
SUCCESS
store %Queue* %1, %
q
load %Queue
*,
%
fifo
Thread 1
Thread 2
SUCCESS
store %Queue* %1, %
q
load %Queue
*,
%
fifo
Thread 1
Thread 2
SUCCESS
Statistical identification of failure predicting patternsSlide26
OutlineUsage model
DesignEvaluationSlide27
Evaluation of Snorlax
Is Snorlax effective?Is Snorlax accurate?Is Snorlax efficient?
How does Snorlax compare to its competition?Slide28
Experimental Setup
Real-world C/C++ programs11 concurrency bugsWorkloads from program’s test cases and test cases by other researchersSlide29
Snorlax’s Effectiveness
Snorlax correctly identified the root causes of 11 bugsDetermined after manual investigation of developer fixesA single failure recurrence is enough for root cause diagnosisIn practice, for concurrency bugs, “event orders” = “root cause”
Snorlax can effectively diagnose concurrency bugsSlide30
Snorlax’s Accuracy
Accuracy
Contribution
All stages of Lazy Diagnosis are necessary for full accuracySlide31
Snorlax’s Efficiency
Snorlax has low runtime performance overhead (always below 2%)
0.97%
Percentage
OverheadSlide32
Snorlax vs. Gist
Snorlax scales better than Gist with the increasing number of application threads
Percentage
Overhead
39%
3
%
1.9%
0.9%Slide33
Lazy Diagnosis
SnorlaxLazy Diagnosis PrototypeFully Accurate Concurrency Bug Diagnosis (11 bugs in 7 systems)
Low overhead (always below < 2%)Scales well with the increasing number of threads
Leverages the coarse interleaving hypothesis
Hybrid dynamic/static root cause diagnosis technique
Michigan
is hiring!