/
Dongyoon Dongyoon

Dongyoon - PowerPoint Presentation

tatyana-admore
tatyana-admore . @tatyana-admore
Follow
375 views
Uploaded On 2016-12-11

Dongyoon - PPT Presentation

Lee Benjamin Wester Kaushik Veeraraghavan Satish Narayanasamy Peter M Chen and Jason Flinn University of Michigan Ann Arbor Respec Efficient Online Multiprocessor Replay ID: 500176

system replay process overhead replay system overhead process threads checkpoint check output rollback2 execution program deterministic data replayed race

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Dongyoon" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason FlinnUniversity of Michigan, Ann Arbor

Respec

: Efficient Online Multiprocessor Replay

via Speculation and External DeterminismSlide2

Deterministic ReplayRecord and reproduce non-deterministic events1) Offline Uses: replay repeatedly after original runDebuggingForensics2) Online Uses: record and replay concurrentlyFault toleranceDecoupled runtime checksWe focus on online replay for multi-processorsDeterministic Replay2Slide3

Online Deterministic Replay UsesServerReplicaTakeoverFault ToleranceDecoupled Runtime Checks

App

Replay + Check

A

+ Check

B

+ Check

C

+ Check

Need to record and replay

concurrently

Both recording and replaying should be

efficient

Request

log

Response

Fault !!

replay

keep the

same state

3

P1

P2

P3

P4

A

B

CSlide4

Uniprocessor ReplayProgram Input (e.g. system calls, signals, etc)Thread schedulingMultiprocessor Replay: + Shared memory dependenciesInstrument every memory operation PinSEL [Pereira, IISWC’08] , iDNA [Bhansali, VEE’06]Page protection SMP-ReVirt [Dunlap, VEE’08]Offline search ODR [Altekar, SOSP’09] , PRES [Park, SOSP’09] Replay-SAT [Lee, MICRO’09] Hardware support FDR [Xu, ISCA’03], Strata [Narayanasamy, ASPLOS’06], ReRun [Hower, ISCA’08], DeLorean [Montesinos, ISCA’08]Past Solutions for Deterministic Replay→ 10-100x→ 2-9x

→ Slow replay

→ Custom HW

4Slide5

Goal: Efficient online software-only multiprocessor replayKey Idea: Speculation + Check1) Speculate data race free2) Detect mis-speculation using a cheap check3) Rollback and retry on mis-speculationOverview of Our Approachmulti-threadedforkLock(l)Unlock(l)Lock(l)

T1

T2

Checkpoint A

Recorded Process

T1’

T2’

A’

Replayed Process

Lock(l’)

Unlock(l’)

Lock(l’)

Speculate

Race free

Check

B’==B?

Checkpoint B

5Slide6

Motivation/OverviewRespec DesignSpeculate data race freeDetect mis-speculationRollback and Retry on mis-speculationEvaluationConclusionRoadmap6Slide7

ObservationReproducing program input and happens-before order of sync. operations guarantees deterministic replay of data-race-free programs [Ronsse and Bosschere ’99]1) Program input ( e.g. system calls, signals, etc. )Record: Log system call effectsReplay: Emulate system call2) Synchronization OperationsRecord and replay happens-before order Instrument common (not all) synchronization primitives in glibc Deterministic Replay of Data-race-free Programs7+ total order+ total orderSlide8

What if a program is NOT race free?ProblemNeed to detect mis-speculationData race detector is too heavy-weightInsight: External Determinism is sufficientNot necessary to replay data racesEnsure that the replayed process produces the same visible effects as the recorded process to an external observerVisible effects = System output + Final program stateSolution: Divergence checksDetect mis-speculation when the replay is not externally deterministic 8Slide9

1) System Output CheckFor every system call, compare system call argumentEnsure that the replay produces the same output as the recorded processDivergence Check #1 – System OutputLock(l)Unlock(l)Lock(l)Lock(l’)Unlock(l’)Lock(l’)

T1

T2

Start A

Recorded Process

T1’

T2’

Start A’

Replayed Process

Check O’==O?

SysRead

X

SysWrite

O

SysRead

X’

SysWrite

O’

9

multi-threaded

forkSlide10

Benign Data RacesNot all races cause divergence checks to failA data race is inconsequential if system output matchesx=1x!=0?x=1x!=0?x!=0?x!=0?T1

T2

Start A

Recorded Process

Replayed Process

Success

SysWrite

(x)

SysWrite

(x)

10

multi-threaded

fork

T1’

T2’

Start A’Slide11

1) Need to rollback to the beginning2) Need to buffer system output till the end11Divergence due to Data RacesStart ARecorded ProcessReplayed ProcessStart A’multi-threadedforkT1T2T1’T2’

Fail

SysWrite

(x)

x=1

x=2

x=1

x=2

SysWrite

(x)Slide12

2) Program state check Compare register and memory state at semi-regular intervals (epochs)Construct a safe intermediate pointTo release buffered outputTo rollback to in case of mis-speculationDivergence Check #2 – Program State12Replayed ProcessT1’T2’

Start A’

T1

T2

Recorded Process

Start A

epoch

Release

Output

Success

epoch

SysWrite

(x)

x=1

x=2

x=1

x=2

SysWrite

(x)

Fail

Checkpoint B

B’ == B ?Slide13

Recovery from Mis-speculationRollbackRollback both recorded and replayed processes to the previous checkpointRe-executeOptimistically re-run the failed epochOn repeated failure, switch to uniprocessor execution model Record and replay only one thread at a time Parallel execution resumes after the failed intervalT1T2T1’T2’Check B’==B?

x=1

x=2

Fail

A == A’

Checkpoint B

x=1

x=2

Checkpoint A

x=1

x=2

Checkpoint B

Checkpoint C

Checkpoint A

Check B’==B?

x=1

x=2

Check C’==

C

?

A == A’

13Slide14

Speculative ExecutionSpeculator [Nightingale et al. SOSP’05]Buffer output during speculationBlock execution if speculative execution is not feasibleRelease buffered output on commitUndo speculative changes and squash buffered output on mis-speculation14Slide15

Motivation/OverviewRespec DesignEvaluationPerformance resultsBreakdown of performance overheadRollback frequency and overheadConclusionRoadmap15Slide16

Evaluation SetupTest Environment2 GHz 8 core Xeon processor with 3 GB of RAM Run 1~4 worker threads (excluding control threads)Collect the average of 10 trials (except pbzip2 and aget)BenchmarksPARSEC suiteblackscholes, bodytrack, fluidanimate, swaptions, streamclusterSPLASH-2 suiteocean, raytrace, volrend, water-nsq, fft, and radixReal applicationspbzip2, pfscan, aget, and Apache16Slide17

Record and Replay Performance 18% for 2 threads, 55% for 4 threads Real applications (including Apache) showed <50% for 4 threads17Slide18

1) Redundant Execution Overhead (25%) Cost of running two executions (Lower bound of online replay) Mainly due to sharing limited resources: memory system Contribute 25% of total cost for 4 threadsRedundant execution overhead (25%)18Slide19

2) Epoch overhead (17%) Due to checkpoint cost Due to artificial epoch barrier cost Contribute 17% of total cost for 4 threads19Epoch overhead (17%)Redundant execution overhead (25%)Slide20

3) Memory Comparison Overhead (16%) Optimization 1. compare dirty pages only Optimization 2. parallelize comparison Contribute 16% of total cost for 4 threads20Memory comparison overhead (16%)Epoch overhead (17%)

Redundant execution

overhead (25

%)Slide21

4) Logging Overhead (42%) Logging synchronization operations and system calls overhead Main cost for applications with fine-grained synchronizations Contribute 42% of total cost for 4 threads21Logging and other overhead (42%)Memory comparison overhead (16%)

Epoch

overhead (17%)

Redundant execution

overhead (25%)Slide22

Rollback Frequency and OverheadApp.ThreadsRollback FrequencyOverheadAvg. OverheadPbzip2(100 runs)484% none41%45%15% once66%1% twice105%Aget

(50 runs)

4

80%

none

6%

6%

18% once

6%

2%

twice

6%

Pbzip2(

16%

) and

Aget

(

20%

)

invoke one or more rollbacks

Pbzip2: Rollbacks

contribute

<10%

of

total overhead

Aget: Rollback overhead is

negligible frequent checkpoints => short epochs => small amount of work to be re-done

22Slide23

ConclusionGoal: Deterministic replay for multithreaded programsSoftware-only: no custom hardwareOnline: record and replay concurrentlyContributions to replaySpeculation: speculate race-free, and rollback/retry if neededExternal Determinism: Match system output and program statesResultsPerformance overhead record and replay concurrently2 threads: 18% 4 threads: 55%Thank you…23Slide24

Thank you24Slide25

Benign Data RacesBenign data races could cause frequent rollbacksPerformance (NOT correctness) issue The latest Java and C++ memory model prohibits benign races => There are only harmful races [Manson et al. POPL’05],[Boehm et al. PLDI’08]Programmers should explicitly annotate intentionally racy variables (e.g. handcrafted synchronization) using volatile/atomic keywordsCould automatically detect and instrument25Slide26

ImplementationModify Linux 2.6.27 kernelDeterministic replay Multithreaded forkRecord/replay program input (e.g. system calls, signals, …)Compare program state (memory and register contents)Speculator [Nightingale et al. SOSP’05]Checkpoint and rollbackBuffer system output or propagate speculative statesModify glibc 2.5.1Support recording/replaying low-level synchronization operationse.g. locks, unlock, futex waits, futex wakes26Slide27

Replayed process1) Emulate most system callsFeed logged return value and data copied into the process2) Re-execute some system callsCreate or delete threads : clone, exit, …Modify address space: mmap2, mprotect, …ProblemDoes NOT recreate most kernel state associated with the replayed process (e.g. the file descriptor table)Process can NOT transition from replaying to live executionSolutionRecreate the OS state by re-executing native/virtualized system calls ReVirt [Dunlap et al. OSDI’02], Zap [Osman et al. OSDI’02]Handling System Calls27Slide28

Copy-on-write forkLinux’s fork supports fork of only single threadNeed new copy-on-write primitive for checkpointing multithreadsShould checkpoint a thread at safe point kernel entry/exit (system call)Multi-threaded fork1) The initiating thread that initiates a multithreaded fork creates a barrier on which it waits until all other threads reach a safe point2) Once all threads reach the barrier, the original thread creates the checkpoint, then let other threads continue execution.Semi-regular checkpointsAdaptive epoch lengthTo bound the amount of work that must be redone on rollbackOutput triggered commitTo provide acceptable latency for interactive tasksMulti-threaded Fork (Checkpoint)28Slide29

1) Allow Respec to commit epochs and release system outputBuffer output during speculationSafe to release output on commit after matching program state2) Reduce the amount of execution that must be re-done when a check fails3) Allow broader uses of replay systemTolerating non-fail-stop faults (e.g. transient hardware fault)Need to detect latent faultsParallelizing security and reliability checksBenefits of Program State Check29Slide30

Respec LogKernel’s system call + User-level synchronizationsMD5 checksum of address space and register stateProblem: Not all races are loggedOffline replay is NOT guaranteed to succeedSince the recorded process has been replayed successfully at least once, it is likely that offline replay will eventually succeedSolutionOffline replay search tools can be used e.g. ODR [Altekar et al. SOSP’09] , PRES [Park et al. SOSP’09] , Replay-SAT [Lee et al. MICRO’09] Offline Replay with Respec30Slide31

e.g. I/O, DMA, interrupts, signals, RDTSC, context-switch, page-faultAsynchronous interrupts (caused by external sources)eg. I/O, timer, disk read completionSynchronous interrupts (=traps)eg. arithmetic overflow exceptions, invoking system calls, page fault, TLB missx86 instructions (can return non-deterministic results, but do not normally trap when running in user mode)eg. rdtsc(read timestamp counter), rdpmc(read performance monitoring counter)Non-Deterministic Program Input31Slide32

Rollback Frequency and Overhead (Pbzip2)ThreadsRollback FrequencyOriginalTime (sec)TypeRespecTime (sec)Slowdown10%4.59Overall4.835%213% once2.35w/o rollback2.7015%w/ rollback2.9726%overall2.7316%

3

9% once

2% twice

1.64

w/o

rollback

2.00

22%

w/ rollback

2.29

40%

overall

1.03

24%

4

84%

no rollback

15% once

1%

twice

1.33

w/o

rollback

1.88

41%

w/ rollback

2.24

68%overall

1.9345%

Out of 100 runs, 13-16% of executions invoke more than one rollbacks

Rollbacks contribute 8% of Respec's total overhead

32Slide33

Rollback Frequency and Overhead (Aget)ThreadsRollback FrequencyOriginalTime (sec)TypeRespecTime (sec)Slowdown110% once2% twice2.05w/o rollback2.197%w/ rollback2.218%overall2.197%220% once2% twice1.93

w/o

rollback

2.17

13%

w/ rollback

2.17

13%

overall

2.17

13%

3

24% once

1.94

w/o

rollback

2.08

7%

w/ rollback

2.09

8%

overall

2.08

7%

4

18% once

2%

twice

1.96

w/o rollback2.07

6%

w/ rollback2.086%

overall

2.086%

Out of 50 runs, 14-24% of executions invoke more than one rollbacks

Peformance

impact is negligible (due to very frequent checkpoint)

33