/
Lazy Diagnosis of  In-Production Concurrency Bugs Lazy Diagnosis of  In-Production Concurrency Bugs

Lazy Diagnosis of In-Production Concurrency Bugs - PowerPoint Presentation

briana-ranney
briana-ranney . @briana-ranney
Follow
395 views
Uploaded On 2018-11-21

Lazy Diagnosis of In-Production Concurrency Bugs - PPT Presentation

Baris Kasikci Weidong Cui Xinyang Ge Ben Niu Why Does InProduction Bug Diagnosis Matter Potential to fix bugs that impact users Short release cycles make inhouse testing challenging ID: 731245

thread diagnosis queue bug diagnosis thread bug queue store pattern lazy hybrid analysis computation bugs type points load fifo

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Lazy Diagnosis of In-Production Concurr..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Lazy Diagnosis of In-Production Concurrency Bugs

Baris Kasikci, Weidong Cui, Xinyang Ge, Ben NiuSlide2

Why Does In-Production Bug Diagnosis Matter?

Potential to fix bugs that impact usersShort release cycles make in-house testing challengingRelease cycles can be as frequent as a few times a day

1

[1] https

://

code.facebook.com

/posts/270314900139291/rapid-release-at-massive-scaleSlide3

Concurrency Bug Diagnosis

Concurrency bug diagnosis requires knowing the order

of

key events

(e.g., memory accesses)

Time

Thread 1

Thread 2

i

f (*x) {

y

= *x;

}

f

ree(x);

x = NULL;

R

R

W

Atomicity Violation

Thread 1

Thread 2Slide4

Challenges of Concurrency Bug Diagnosis

Diagnosis requires reproducing bugs [PBI, ASPLOS’13] [Gist, SOSP’15]Practitioners report that they can fix reproducible bugs [PLATEAU’14]It may not be possible to reproduce in-production concurrency bugs

Inputs for reproducing bugs may not be availableExposing bugs in production may incur high overhead [

RaceMob

, SOSP’13]Slide5

Record/ReplayTracing fine-grained

interleavings incurs high overheadState-of-the-art record/replay has 28% overhead [DoublePlay, ASPLOS’11]

R

R

W

Atomicity Violation

Time

Thread 1

Thread 2

Δ

T

1

Δ

T

2

In theory,

ΔT

can be on the order of a nanosecondSlide6

Coarse Interleaving Hypothesis

A lightweight, coarse-grained time tracking mechanism can help infer ordering

Study with 54 bugs in 13 systems

Smallest

ΔT

is 91 microseconds

R

R

W

Atomicity Violation

Time

Thread 1

Thread 2

Δ

T

1

Δ

T

2

91 us

~1ns

~

10^5Slide7

Lazy Diagnosis

SnorlaxLazy Diagnosis PrototypeFully Accurate Concurrency Bug Diagnosis (11 bugs in 7 systems)Low overhead (always below < 2%)

Leverages the coarse interleaving hypothesis

Hybrid

dynamic/static root cause diagnosis

techniqueSlide8

OutlineUsage model

DesignEvaluationSlide9

Current Bug Diagnosis Model

Root cause diagnosisSlide10

Lazy Diagnosis Usage Model

Lazy Diagnosis

Root

cause

+

Control- flow

t

race & Timing Info

Control flow trace speeds up static analysis

Coarse-grained timing information helps determine ordering

Root cause diagnosisSlide11

OutlineUsage model

DesignEvaluationSlide12

Lazy Diagnosis

Hybrid

Points-to

Analysis

Type-based

Ranking

Bug Pattern

Computation

Statistical

DiagnosisSlide13

Lazy Diagnosis

Hybrid

Points-to

Analysis

Type-based

Ranking

Bug Pattern

Computation

Statistical

DiagnosisSlide14

Hybrid

Points-to

Analysis

I

F

I

1

I

2

load %Queue

*,

%

fifo

store i32* %21, %

bufSize

store %Queue

* %1

, %

q

FAILURE (CRASH)

Finds instructions with operands pointing to the same location as the failing instruction’s operand

Hybrid Points-to AnalysisSlide15

Hybrid Points-To AnalysisUses the control flow traces to limit the scope of static analysis

Runs fast, scales to large programs (e.g., httpd, MySQL)LazyControl flow traces trigger the analysis

InterproceduralBug patterns may span multiple functionsFlow-insensitive

Discards execution order of instructions for scalabilitySlide16

Lazy Diagnosis

Hybrid

Points-to

Analysis

Type-based

Ranking

Bug Pattern

Computation

Statistical

DiagnosisSlide17

Lazy Diagnosis

Hybrid

Points-to

Analysis

Type-based

Ranking

Bug Pattern

Computation

Statistical

DiagnosisSlide18

Type-Based Ranking

Type-based

Ranking

store %Queue

* %1

, %

q

store

i32

* %21, %

bufSize

store %Queue

*

%1, %

qstore i32* %21, %bufSizeload %Queue

*, %fifoFAILURE (CRASH)

Highly ranks instructions operating on types that match the failing instruction's operand type

1

2Slide19

Lazy Diagnosis

Hybrid

Points-to

Analysis

Type-based

Ranking

Bug Pattern

Computation

Statistical

DiagnosisSlide20

Lazy Diagnosis

Hybrid

Points-to

Analysis

Type-based

Ranking

Bug Pattern

Computation

Statistical

DiagnosisSlide21

Bug Pattern Computation

Bug Pattern

Computation

Bug Pattern

Computation

store

i32

* %21, %

bufSize

store %Queue

*

%1, %

q

FAILURE

load %Queue

*,

%

fifo

store

i32

* %21, %

bufSize

Thread 1

Thread 2

store %Queue

*

%1, %

q

load %Queue

*,

%

fifo

load %Queue

*,

%

fifo

Bug Pattern I

Bug Pattern II

Thread 1

Thread 2Slide22

Bug Pattern Computation

Our implementation uses timing packets in Intel Processor TraceGranularity of a few 10s of microsecondsWe measured the smallest

ΔT between key events as 91 microseconds

Leverages the coarse interleaving hypothesis to

establish instruction ordersSlide23

Lazy Diagnosis

Hybrid

Points-to

Analysis

Type-based

Ranking

Bug Pattern

Computation

Statistical

DiagnosisSlide24

Lazy Diagnosis

Hybrid

Points-to

Analysis

Type-based

Ranking

Bug Pattern

Computation

Statistical

DiagnosisSlide25

store %Queue* %1, %

q

load %Queue

*,

%

fifo

Thread 1

Thread 2

FAILURE (CRASH)

store %Queue* %1, %

q

load %Queue

*,

%

fifo

Thread 1

Thread 2

FAILURE (CRASH)

store %Queue* %1, %

q

load %Queue

*,

%

fifo

Thread 1

Thread 2

SUCCESS

store %Queue* %1, %

q

load %Queue

*,

%

fifo

Thread 1

Thread 2

SUCCESS

store %Queue* %1, %

q

load %Queue

*,

%

fifo

Thread 1

Thread 2

SUCCESS

store %Queue* %1, %

q

load %Queue

*,

%

fifo

Thread 1

Thread 2

SUCCESS

Statistical identification of failure predicting patternsSlide26

OutlineUsage model

DesignEvaluationSlide27

Evaluation of Snorlax

Is Snorlax effective?Is Snorlax accurate?Is Snorlax efficient?

How does Snorlax compare to its competition?Slide28

Experimental Setup

Real-world C/C++ programs11 concurrency bugsWorkloads from program’s test cases and test cases by other researchersSlide29

Snorlax’s Effectiveness

Snorlax correctly identified the root causes of 11 bugsDetermined after manual investigation of developer fixesA single failure recurrence is enough for root cause diagnosisIn practice, for concurrency bugs, “event orders” = “root cause”

Snorlax can effectively diagnose concurrency bugsSlide30

Snorlax’s Accuracy

Accuracy

Contribution

All stages of Lazy Diagnosis are necessary for full accuracySlide31

Snorlax’s Efficiency

Snorlax has low runtime performance overhead (always below 2%)

0.97%

Percentage

OverheadSlide32

Snorlax vs. Gist

Snorlax scales better than Gist with the increasing number of application threads

Percentage

Overhead

39%

3

%

1.9%

0.9%Slide33

Lazy Diagnosis

SnorlaxLazy Diagnosis PrototypeFully Accurate Concurrency Bug Diagnosis (11 bugs in 7 systems)

Low overhead (always below < 2%)Scales well with the increasing number of threads

Leverages the coarse interleaving hypothesis

Hybrid dynamic/static root cause diagnosis technique

Michigan

is hiring!