/
Dongyoon Dongyoon

Dongyoon - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
367 views
Uploaded On 2016-06-26

Dongyoon - PPT Presentation

Lee Peter Chen Jason Flinn Satish Narayanasamy University of Michigan Ann Arbor Chimera Hybrid Program Analysis for Determinism Chimera image from http superpunchblogspotcom200902chimerasketchhtml ID: 378756

analysis weak data locks weak analysis locks data tid bounds race lock symbolic amp code replay void chimera coarse

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Dongyoon" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Dongyoon Lee, Peter Chen, Jason Flinn, Satish NarayanasamyUniversity of Michigan, Ann Arbor

Chimera:

Hybrid Program Analysis for Determinism

* Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.htmlSlide2

Deterministic ReplayGoal: record and reproduce multithreaded executionDebugging concurrency bugsOffline heavyweight dynamic analysisForensics and intrusion detection… and many more usesProblemMultithreaded record-and-replay is too slow (>2x) or requires custom hardwareSlide3

Multithreaded Record-and-Replay is Slow

Write

WriteRead

Log

shared memory dependencies

Checkpoint Memory and Register State

Log non-deterministic

program input

- Interrupts

, I/O values, DMA, etc.

Thread 1

Thread 2

Thread 3Slide4

Replay for Data-Race-Free Programs is CheapData-race-free programsShared memory accesses are well ordered by synchronization ops.Recording happens-before order of sync. ops. is sufficientProblem: Programs with data racesT1T2X=0Y=0X=1

Y=1

Y=2Unlock(l)Lock(l)Unlock(l)

Signal(c)

Wait(c)

Z=1

X=2

Z=2

T3

order of

mem

. ops.

order of sync. ops.Slide5

Our Contribution: A Hybrid Analysis

Potentially racy

program P

Data-race-free

program P’

Sound static data race analysis

Add synchronizations for potential data races

Problem: Too many

false positives

Profiling

non-concurrent code regions

Symbolic bounds analysis

ChimeraSlide6

RoadmapMotivationChimera AnalysisStatic data race analysisProfiling non-concurrent code regionsSymbolic bounds analysisWeak-lock DesignEvaluationConclusionSlide7

Roadmap

Motivation

Chimera AnalysisStatic data race analysisProfiling non-concurrent code regionsSymbolic bounds analysisWeak-lock

DesignEvaluation

ConclusionSlide8

Static Data Race Analysis Find potential data-races using a sound static data race detector RELAY [Voung et al., FSE’07]Protect all potential data-races using weak-locks A new time-out lock which may be preempted (discussed later)Record and replay the happens-before order of weak-locksSlide9

Protect Potential Races using Weak-locksPotential racy-pairPotential racy-pair

Static analysis helps avoid instrumentation for access to

Z No race report

void foo() {

X = 0;

for(

i

= ... ){

Y[

tid

][

i

] = 0;

}

}

void bar() {

X = 1;

for(

i

= … ){

Y[

tid

][

i

] = 1;

Z

= 1;

}

}Slide10

Sources of False Positives in RELAYSound data-race detector reports too many false data-races53x overhead Source 1: Non-mutex synchronizations are ignoredLockset based analysis ignores fork-join, barrier, signal-wait, etc. May report a false data-race between memory instructions that can never execute concurrentlySource 2: Conservative pointer analysis Overestimate variables accessed by a memory instruction May report a false data-race between memory instructions that can never access the same location

Solution:

Profiling

non-concurrent code regions

Solution:

Symbolic bounds

AnalysisSlide11

Roadmap

Motivation

Chimera AnalysisStatic data race analysisProfiling non-concurrent code regionsSymbolic bounds analysis

Weak-lock Design

Evaluation

ConclusionSlide12

Profiling Non-concurrent Code RegionsProblemLockset based analysis ignores non-mutex synchronization ops. SolutionProfile non-concurrent code regions (e.g., functions)

Increase the granularity of weak-locks to

protect a larger code region instead of each potential racy instructionParallelism is preserved unless mis-profiledT1foo()BARRIER

T2BARRIER

bar()

False RaceSlide13

Function-Level Weak-Locksif profiler says foo() and bar() are not likely to run concurrently

foo()

BARRIERBARRIERbar()

False Race

void foo() {

X = 0;

for(

i

= … ){

Y[

tid

][

i

] = 0;

}

}

void bar() {

X = 1;

for(

i

= … ){

Y[

tid

][

i

] = 1;

Z =

1;

}

}Slide14

Roadmap

Motivation

Chimera AnalysisStatic data race analysisProfiling non-concurrent code regionsSymbolic bounds analysisDesign

EvaluationConclusionSlide15

Imprecision in Conservative Pointer AnalysisT1foo()BARRIERT2

BARRIER

May runConcurrentlybar()Slide16

Imprecision in Conservative Pointer AnalysisRELAY uses Steensgaard’s and Anderson’s pointer analysisFlow-Insensitive and Context-Insensitive (FICI) analysisNaming heap objects is conservativeOverestimate the variables accessed by a memory instruction

void foo() {

… for(i = 0 to N){ Y[ tid ][ i ] = 0; … }}

void bar() { …

for(

i

=

0

to N){

Y[

tid

][

i

] = 1;

}

}

False Race

Y[][]

Thread1

Thread 2

Potential

racy-pairSlide17

Symbolic Bounds AnalysisOur SolutionDerive the symbolic lower and upper bounds that a racy code region may access (e.g., loops) [Rugina and Rinard, PLDI’00]Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expressionParallelism is preserved if the bounds are precise enough

void foo() {

… for(i = 0 to N){ Y[ tid ][ i ] = 0; } …}

Bounds: &Y[tid][0] to &Y[tid][N]

Symbolic

Bounds

AnalysisSlide18

Loop-level Weak-locksSymbolic bounds: &Y[tid][0] ~ &Y[tid][N]

(&Y[

tid][0],&Y[tid][N])(&Y[tid

][0],&Y[tid][N])

(&Y[

tid

][0],&Y[

tid

][N])

(&Y[

tid

][0],&Y[

tid

][N])

void foo() {

X = 0;

for(

i

= 0 to N){

Y[

tid

][

i

] = 0;

}

}

void bar() {

X = 1;

for(

i

= 0 to N){

Y[

tid

][

i

] = 1;

Z = 1;

}

}Slide19

Imprecise Symbolic BoundsSourcesDepend on the value computed inside the code regionDepend on arithmetic operations not supported in the analysise.g., modulo operations, logical AND/OR, etc.Choosing the optimal granularityIf bounds are too imprecise and the loop body is long enough, resort to instruction (basic-block) level weak-locks for parallelism

void

qux() { … for(i = 0 to N){ prev = Z[ prev ]; } …}

Bounds: -INF to +INF

Symbolic

Bounds

AnalysisSlide20

RoadmapMotivationChimera AnalysisWeak-lock DesignEvaluationConclusionSlide21

Deadlock due to Weak-locksNo deadlocks between weak-locksfunction-level > loop-level > instruction-levelDeadlock between weak-locks and original sync. ops. is possible

T

1

…wait (cv)

T

2

signal(cv)

Time-out !!Slide22

Weak-lock Time-outA weak-lock might time-outInvoke a special system call to handle it

Weak-lock guaranteeOnly one thread holds a given weak-lock at any given timeMutual exclusion may be compromised; but sufficient for replay

T

2

signal(cv)

Time-out !!

T

1

wait (cv)

Current owner

Current owner

Logged order

of weak-locksSlide23

RoadmapMotivationChimera AnalysisWeak-lock DesignEvaluationConclusionSlide24

ImplementationSource-to-source InstrumentationImplemented in OCaml using CIL as a front endStatic analysisData race detection: RELAY [Voung et al., FSE’07]Include all library source codes for soundness (uClibc’s libc, libm, etc.)Symbolic bounds analysis: [Rugina and Rinard, PLDI’00]Intra-procedural analysis for racy loops onlyRuntime systemModified Linux kernel to record/replay program input Modified pthread library to record/replay happens-before order of original synchronization operations and weak-locksSlide25

Evaluation SetupTest Environment2.66 GHz 8-core Xeon processor with 4 GB of RAM Different set of inputs for profiling and performance evaluationAverage of five trials with 4 worker threads2, 4, 8 threads for scalability resultsBenchmarksDesktop applicationsaget, pfscan, and pbzip2Server programsknot and apacheSPLASH-2 suiteocean, water-nsq, fft, and radixSlide26

Record and Replay Performance

Recording : 39% on averageReplay : similar to recording; much lower for I/O intensive prgs

.

2.4% slowdown

86% slowdown

39%Slide27

Effectiveness of Coarse-grained Weak-locks53xSlide28

Effectiveness of Coarse-grained Weak-locks

Coarse-grained weak-locks reduce the cost of instrumentationSlide29

Effectiveness of Coarse-grained Weak-locks

Coarse-grained weak-locks reduce the cost of instrumentation

Exception: control-flow dependency (e.g., pfscan)Slide30

Effectiveness of Coarse-grained Weak-locks

Coarse-grained weak-locks reduce the cost of instrumentationException: control-flow dependency (e.g

., pfscan)Slide31

Effectiveness of Coarse-grained Weak-locks

Coarse-grained weak-locks reduce the cost of instrumentationException: control-flow dependency (e.g

., pfscan)1.39xSlide32

Breakdown of Recording Overhead

Weak-lock overhead = contention (waiting) cost + logging cost

func

locks

loop locks

instr

/bb locks

sync op & system logSlide33

Breakdown of Recording Overhead

func

wait

loop wait

instr

/bb wait

sync op & system log

func

log

loop log

instr

/bb log

Weak-lock overhead = contention (waiting) cost + logging cost

High loop-lock contention

High

instr

/bb-lock contentionSlide34

Scalability

Scientific applications scale worse due to imprecise symbolic bounds analysisSlide35

ConclusionGoal: Software-only deterministic multiprocessor replay systemsChimera AnalysisStatic data race analysisFind and protect potential data races with weak-locksInstruction/basic-block-level weak-locksProfiling non-concurrent code regionsAddress the inadequacy of lockset-based algorithmFunction-level weak-locksSymbolic bounds analysisAddress the imprecision of conservative pointer analysisLoop-level weak-locksLow Recording Overhead39% recording overhead for 4 worker threadsSlide36

Thank you