for Fast Discovery of Deep Bugs in Cloud Systems Tanakorn Leesatapornwongsa Mingzhe Hao Pallavi Joshi Jeffrey F Lukman and Haryadi S Gunawi 2 ID: 660259
Download Presentation The PPT/PDF document "1 SAMC: Sematic-Aware Model Checking" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1
SAMC: Sematic-Aware Model Checkingfor Fast Discovery ofDeep Bugs in Cloud Systems
Tanakorn Leesatapornwongsa, Mingzhe Hao,Pallavi Joshi*, Jeffrey F. Lukman†,and Haryadi S. Gunawi
†
*Slide2
2
Internet ServicesSlide3
3
Internet Services
Cloud SystemsSlide4
Reliability
4
Complex failures
“Deep bugs”Slide5
Deep Bug Example
ZooKeeper
(synchronization service)
Issue
#335
.
1
.
Nodes A, B, C start (w/ latex
txid
: 10)
2.
B becomes
leader
3.
B crashes
4
.
C becomes leader
5
.
C commits new
txid
-value pair (11, X)
6
.
A
crashes
, before
committing
the new
txid
11
7
.
C loses quorum and
C crashes
8
.
A and B
are back online
after
C crashes
9
.
A
becomes leader
10.
A's commits new
txid
-value pair (11, Y)
11.
C is back online
after A's new
tx commit
12. C announce to B (11, X)13. B replies diff starting with tx 1214.
Inconsistency: A has (11, Y), C has (11, X)
5
F
F
L
F
L
F
L
L
L
F
L
F
L
F
F
x
x
x
x
x
y
y
y
y
x
A
B
C
PERMANENT INCONSISTENT REPLICASlide6
Deep Bug Characteristics
ZooKeeper (synchronization service)
Issue #335.1. Nodes A, B, C start (w/ latex txid: 10)2. B becomes leader3. B crashes4. C becomes leader5.
C commits new txid-value pair (11, X)
6. A
crashes, before
committing the new
txid
11
7
.
C loses quorum and
C crashes
8
.
A and B
are back online
after
C crashes
9
.
A
becomes leader
10.
A's commits new
txid
-value pair (11, Y)
11.
C
is back online
after
A's new
tx
commit
12.
C announce to B (11, X)
13.
B replies diff starting with
tx
12
14.
Inconsistency: A has (11, Y), C has (11, X
)
6
Specific Order
1. Out-of-order messages
2
. Multiple crashes
3
. Multiple reboots
HAPPEN IN ANY ORDER
1
2
3Slide7
Study of Deep Bugs
x-axis is bug number
y-axis is number of crashes and reboots7
# crashes
/
# reboots
# crashes
/
# reboots
# crashes
/
# reboots
SEVERE
IMPLICATIONS
INCONSISTENT REPLICAS,
DATA LOSS, DOWNTIMES, ETC.Slide8
How do we catch deep bugs in
distributed systems?8Slide9
How to Catch Deep Bugs
Distributed system model checkerRe-ordering all non-deterministic eventsFind which specific orderings lead to bugs9
1234567891011121314
2
71
4563
811
10
9
12
14
13
6
9
3
4
5
1
7
8
2
10
11
13
12
14
3
4
1
5
2
6
7
8
9
12
11
10
14
13
2
1
3
4
5
7
6
8
9
10
11
14
12
13
. .
.Slide10
What’s Wrong with
Existing Model Checkers?Last 7 years MaceMC [NSDI’07], Modist [NSDI’09],
dBug [SSV’10], Demeter [SOSP’13], etc.BUTToo many eventsMultiple crashes and rebootsCreate more messagesNo model checker incorporate multiple crashes and rebootsCannot find deep bugs!10100 eventsSlide11
How do we catch deep bugs
REALLY FAST?11Slide12
Black-Box Approach
Existing model checkers are so slowThey treat target systems as black boxesA large number of event orderings are generated12
BlackBoxA
B
C
D
Black Box
Model Checker
ABCD
ABDC
ACBD
ACDB
ADBC
.
. .
(24 total)Slide13
Semantic Knowledge
How can we make model checker fast?Exploit semantic knowledgeSemantic-aware model checker (SAMC)13
AB
C
D
SAMC
Black
BoxSlide14
Black Box vs. SAMC
14BlackBox
Black BoxModel CheckerABCDABDCACBD
ACDBADBCADCB
BACDBADC
BCADBCDABDAC
. . .
A
SAMC with
message processing semantic
ABCD
ABDC
ACBD
ACDB
ADBC
ADCB
BACD
BADC
BCAD
BCDA
. . .
B
C
D
A
B
C
D
Unnecessary
Re-orderings
Lead to the same state
Message
Processing
SemanticSlide15
SAMC with
crash recovery semanticABCDX
ABCXDABXCDAXBCDXABCDABDCXABDXC. . .N3
SAMC with Crashes
15
Black Box
Model
checker
ABCD
X
ABC
X
D
AB
X
CD
A
X
BCD
X
ABCD
ABDC
X
ABD
X
C
.
. .
N1
N2
N4
A,B
C,D
Unnecessary
Re-orderings
Crash
Recovery
SemanticSlide16
Generic Reduction Algorithms
16
Principle ofSemanticAwarenessLocal-MessageIndependence (LMI)
Crash-Message
Independence (CMI)
Crash
Recovery
Symmetry (CRS)
Reboot Synchronization
Symmetry (RSS)Slide17
SAMC Implementation and Integration
Cloud systems
ProtocolCassandraGossiperHinted handoffRead/writeHadoop 2.0Cluster managementSpeculative executionZooKeeperAtomic broadcastLeader election17
SAMC implementation
10,000 LOC from scratch
Apply SAMC to 3 cloud
systems
7 protocols
10
versionsSlide18
Result
Reproduced 12 old bugsCompare to state-of-the-art techniquesDynamic Partial Order Reduction (DPOR)Random-DPORFind bugs 2x to 340x
faster49x on averageFound 2 new bugsSubmit them to developers18Slide19
Outline
IntuitionSAMCLocal-Message IndependenceCrash-Message IndependenceCrash Recovery SymmetryReboot Synchronization SymmetryEvaluation
19Slide20
Dependency vs. Independency
20
AB
NodeState = S
S
A, B
B, A
S’
S’’
B, A
A, B
S
S’
A, B = Dependent
A, B = Independent
INDEPENDENT = NO REORDERING
2X SPEED-UP
UnnecessarySlide21
Black Box vs. SAMC
21SAMC
ABCDABDCBACDBADCBlackBox
A
B
C
D
A
B
C
D
All dependent
D
ependent
D
ependent
Semantic
Info
Black Box
Model Checker
ABCD
ABDC
ACBD
ACDB
ADBC
ADCB
BACD
BADC
BCAD
BCDA
. . .
4!=24 orderings
6
X SPEED-UPSlide22
Reduction Speed-up
22S
ABCDABDCACBD. . .
.
.
.
.
.
.
.
.
.
S
ABCD
ABDC
ACBD
. . .
.
.
.
.
.
.
.
.
.Slide23
How to Declare Message Independency?
23Q: Which concurrent messages are independent ?
A: Use message processing semanticSlide24
Message Processing Semantic in
Simplified Leader Election
24Belief = n3
Vote=1
B = 3
V = 1
i
f
(
vote
<=
belief
)
// do nothing
e
lse
belief
=
vote
;
Vote=2
Vote=4
B = 3
B = 3
V = 2
B = 3
B = 3
V = 4
B = 4Slide25
Removing Re-ordering via
Message Processing Semantic25
B = 4
V = 1
B = 4
B = 4
V =
2
V =
3
B = 4
Belief=4
B = 4
V = 1
B = 4
B = 4
V =
3
V =
2
B = 4
. . .
B = 4
V =
2
B = 4
B = 4
V = 1
V =
3
B = 4
Vote=1
Vote=2
Vote=3
1
,
2
,
3
1
,
3
,
2
2
,
1
,
3
i
f
(
vote <= belief
)
// do nothing
else
belief = vote;
UnnecessarySlide26
Formalizing
the Intuition26
if (isDiscard(
msg, ls
)) {
// do nothing;}
DISCARD PATTERN
b
oolean
discardPredicate
(
msg
,
ls
) {
if (
msg.vote
<=
ls.belief
)
return
true
;
else
return
false
;
}
DISCARD PREDICATE
i
f
(
vote <= belief
)
// do nothing
else
belief = vote;
MESSAGE PROCESSING SEMANTICSlide27
Formalizing the Intuition
27
mxmydiscard(mx)discard(my)Independent12truetrue✔1
3truetrue
✔2
3true
true✔
vote
belief
discardPredicate
1
4
true
2
4
true
3
4
true
Belief=4
Vote=1
Vote=2
Vote=3Slide28
Outline
IntuitionSAMCLocal-Message IndependenceCrash-Message IndependenceCrash Recovery SymmetryReboot Synchronization SymmetryEvaluation
28Slide29
Local state: ls1
SAMC Architecture
29interceptora, bc, d
Local state: ls2
interceptor
Dynamic Partial
Order Reduction (DPOR)
Symmetry
Basic
Reduction
Techniques
Generic
Reduction
Policies
LMI
CMI
CRS
RSS
Protocol
Specific
Rules
Leader
Election
Atomic
Broadcast
…
r
elease(c)
SAMC
…Slide30
Local-Message Independence
30
Generic Reduction PoliciesLocal-MessageIndependence (LMI)Crash-MessageIndependence (CMI)
Crash Recovery
Symmetry (CRS)
Reboot Synchronization
Symmetry (RSS)
SAMCSlide31
Local-Message Independence
Discard patternIncrement patternConstant pattern
31if (msg.type == ack) { node.ackCount++;
}
C
=0
ack
C=1
C=2
ack
b
oolean
incrementPredicate
(
msg
,
ls
) {
if (
msg.type
=
=
ack
)
return
true
;
else
return
false
;
}
C=0
ack
C=1
C=2
ackSlide32
Crash-Message Independence
32
Generic Reduction PoliciesCrash-MessageIndependence (CMI)Local-MessageIndependence (LMI)
Crash Recovery
Symmetry (CRS)
Reboot Synchronization
Symmetry (RSS)
SAMCSlide33
Crash-Message Independence
33
Black BoxABCDXABCXDABXCDAXBCDXABCDABDCX…void
handleCrash() { if (
X == follower && isQuorum
())
followerCount--;
// No new messages
}
b
oolean
localImpact
(X,
ls
) {
if (
X == follower &&
isQuorum
()
)
return
true
;
else
return
false
;
}
L
F
F
A,B
C,D
F
L
F
F
A,B
C,D
F
Local ImpactSlide34
Crash-Message Independence
34
boolean globalImpact(X, ls) { if (X == leader || !isQuorum()) return true
; else
return false
;}
L
F
F
A,B
C,D
F
L
S
S
S
Global Impact
void
handleCrash
(
) {
if (
X == leader || !
isQuorum
()
)
electLeader
()
// New messages created
}Slide35
Crash Recovery Symmetry
Reboot Synchronization Symmetry35
Generic Reduction PoliciesCrash RecoverySymmetry (CRS)Local-MessageIndependence (LMI)
Crash-
MessageIndependence
(CMI)
Reboot Synchronization
Symmetry (RSS)
SAMCSlide36
Outline
IntuitionSAMCLocal-Message IndependenceCrash-Message IndependenceCrash Recovery SymmetryReboot Synchronization SymmetryEvaluation
36Slide37
Evaluation
Cloud systemsProtocol
CassandraGossiperHinted handoffRead/writeHadoop 2.0Cluster managementSpeculative executionZooKeeperAtomic broadcastLeader election37
SAMC implementation
10,000 LOC from scratch
Apply SAMC to 3 cloud
systems
7 protocols
10
versionsSlide38
Protocol-Specific Rules
(e.g. ZooKeeper Leader Election)Guide SAMC to remove re-orderings35 LOC on average per protocol
38Slide39
Catching Old Bugs
SAMC#exe
1177531610057611453401049639A table shows number of executions to reach the bugs and speedup
Issue#
ZooKeeper-335
ZooKeeper-790
ZooKeeper-975ZooKeeper-1075
ZooKeeper-1419
ZooKeeper-1492
ZooKeeper-1653
MapReduce-4748
MapReduce-5489
MapReduce-5505
Cassandra-3395
Cassandra-3626
Black-Box DPOR
#exe
speedup
5000+
14
967
1081
924
5000+
945
22
5000+
1212
2552
5000+
Random
#exe
speedup
1057
225
71
86
2514
5000+
3756
6
5000+
5000+
191
5000+
Random DPOR
#exe
speedup
5000+
82
163
250
987
5000+
3462
6
5000+
1210
550
5000+Slide40
Catching Old Bugs
SAMC#exe
1177531610057611453401049640A table shows number of executions to reach the bugs and speedup
Issue#
ZooKeeper-335
ZooKeeper-790
ZooKeeper-975ZooKeeper-1075
ZooKeeper-1419
ZooKeeper-1492
ZooKeeper-1653
MapReduce-4748
MapReduce-5489
MapReduce-5505
Cassandra-3395
Cassandra-3626
Black-Box DPOR
#exe
speedup
5000+
43+
14
2
967
18
1081
68
924
9
5000+
9+
945
86
22
6
5000+
94+
1212
30
2552
25
5000+
52+
Random
#exe
speedup
1057
9
225
32
71
1
86
5
2514
25
5000+
9+
3756
341
6
2
5000+
94+
5000+
125+
191
2
5000+
52+
Random DPOR
#exe
speedup
5000+
43+
82
12
163
3
250
16
987
10
5000+
9+
3462
315
6
2
5000+
94+
1210305505
5000+
52Slide41
Reduction Ratio
ZooKeeper leader election protocolRun black-box DPORAfter each execution, find that this execution is executed by SAMC or notCount DPOR’s executions that are executed by SAMC too
41Reduction RatioALLLMICMICRSRSS37X
18X5X4X
3X63X
35X6X5X
5X103X
37X
9X
9X
14X
#Crash
#Reboot
1
1
2
2
3
3Slide42
Conclusion
Deep bugs live in the cloudModel checker needs to incorporate complex failure to reach deep bugsState space explosionSemantic-aware model checkingLMI, CMI, CRS, RSSBring future research questionsWhat other semantic knowledge is useful?How to extract them from the code automatically?
42Slide43
Thank You
Questions?43
http://ucare.cs.uchicago.edu