/
mSWAT mSWAT

mSWAT - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
370 views
Uploaded On 2016-07-24

mSWAT - PPT Presentation

LowCost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari ManLap Alex Li Pradeep Ramachandran Byn Choi Sarita Adve Department of Computer Science ID: 417481

fault replay symptom core replay fault core symptom diagnosis deterministic detected trace faulty divergence hardware detection isolated multithreaded capture

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "mSWAT" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

mSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems

Siva Kumar Sastry Hari

, Man-Lap (Alex) Li,

Pradeep Ramachandran, Byn Choi, Sarita Adve

Department of Computer Science

University of Illinois at Urbana-Champaign

swat@cs.uiuc.eduSlide2

MotivationHardware will fail in-the-field due to several reasons

Need

in-field detection, diagnosis,

repair, and recoveryReliability problem pervasive across many marketsTraditional redundancy solutions (e.g., nMR) too expensive Need low-cost solutions for multiple failure sourcesMust incur low area, performance, power overhead

Transient

errors

(High-energy particles

)

Wear-out

(Devices are weaker)

Design Bugs

… and so onSlide3

SWAT: Low-Cost Hardware ReliabilitySWAT ObservationsNeed handle only hardware faults that propagate to software

Fault-free case remains common, must be optimized

SWAT Approach

 Watch for software anomalies (symptoms)Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked3Slide4

SWAT Framework Components

Fault

Error

Symptom

detected

Recovery

Diagnosis

Repair

Checkpoint

Checkpoint

Detectors with simple

hardware

[Li et al. ASPLOS’08]

µarch-level Fault Diagnosis (TBFD

)

[Li et al. DSN’09]

4Slide5

Challenge

Fault

Error

Symptom

detected

Recovery

Diagnosis

Repair

Checkpoint

Checkpoint

Detectors with simple

hardware

[Li et.al. ASPLOS’08]

µarch-level Fault Diagnosis (TBFD

)

[Li et.al. DSN’09]

5

Shown to work well for single-threaded apps

Does SWAT approach work on multithreaded apps?Slide6

Challenge: Data sharing in multithreaded appsMultithreaded apps share data among threads

Does symptom detection work?

Symptom causing core may not be faulty

How to diagnose faulty core?6Memory

Store

Symptom Detection

on a

fault-free

core

Load

Core 1

Core

2

Error

FaultSlide7

ContributionsEvaluate SWAT detectors on multithreaded appsLow Silent Data Corruption rate for multithreaded apps

Observed

symptom from fault-free cores

Novel fault diagnosis for multithreaded appsIdentifies the faulty core despite error propagationProvides high diagnosability7Slide8

OutlineMotivationmSWAT

Detection

mSWAT

DiagnosisResultsSummary and Future Work8Slide9

mSWAT Fault DetectionSWAT Detectors:

Low-cost monitors to detect anomalous

sw

behaviorIncur near-zero perf overhead in fault-free operationSymptom detectors provide low Silent Data Corruption rate

SWAT firmware

Fatal Traps

Division by zero,

RED state, etc.

Kernel

Panic

OS enters panic

State due to fault

High OS

High contiguous

OS activity

Hangs

Simple HW hang

detector

App Abort

Application abort due to faultSlide10

SWAT Fault Diagnosis

Rollback/replay on same/different core

Single-threaded application on multicore

No symptomSymptom

Deterministic s/w or

Permanent h/w bug

Symptom

detected

Faulty

Rollback on

faulty

core

Rollback/replay

on

good

core

Continue

Execution

Transient

or

Non-deterministic s/w bug

Symptom

Permanent

h/w fault,

needs repair!

No symptom

Deterministic s/w bug, send to s/w layer

10

GoodSlide11

ChallengesRollback/replay on same/different core

Single-threaded application on multicore

No symptom

Symptom

Deterministic s/w or

Permanent h/w bug

Symptom

detected

Rollback on

faulty

core

Rollback/replay

on

good

core

Continue

Execution

Transient or Non-deterministic s/w bug

Symptom

Permanent

h/w fault,

needs repair!

No symptom

Deterministic s/w bug, send to s/w layer

11

Faulty

Good

No known good

cores

available

Faulty core is unknown

How to replay multithreaded apps?Slide12

Extending SWAT Diagnosis to Multithreaded AppsAssumptions: In-core faults, single core fault modelNaïve extension –

N known good cores

to replay the trace

Too expensive – areaRequires full-system deterministic replaySimple optimization – One spare coreNot scalable, requires N full-system deterministic replaysHigh hardware overhead – requires a spare coreSingle point of failure – spare core12

Faulty core

is C2

C1

C2

C3

No Symptom Detected

Spare

C1

C2

C3

Symptom Detected

Spare

C1

C2

C3

Symptom Detected

SpareSlide13

mSWAT Diagnosis - Key Ideas

13

Challenges

Multithreaded applicationsFull-system deterministic replayNo known good core

Emulated TMR

T

A

T

B

T

C

T

D

T

A

T

A

T

B

T

C

T

D

T

A

A B C D

T

A

A B C D

Key Ideas

Isolated deterministic replaySlide14

mSWAT Diagnosis - Key Ideas

14

Challenges

Multithreaded applicationsFull-system deterministic replayNo known good core

Emulated TMR

T

A

T

B

T

C

T

D

T

A

T

A

T

B

T

C

T

D

A B C D

A B C D

Key Ideas

Isolated deterministic replay

T

A

T

B

T

A

T

BSlide15

mSWAT Diagnosis - Key Ideas

15

Challenges

Multithreaded applicationsFull-system deterministic replayNo known good core

Emulated TMR

T

A

T

B

T

C

T

D

T

A

T

A

T

B

T

C

T

D

A B C D

A B C D

Key Ideas

Isolated deterministic replay

T

A

T

B

T

C

T

C

T

A

T

BSlide16

mSWAT Diagnosis - Key Ideas

16

Challenges

Multithreaded applicationsFull-system deterministic replayNo known good core

Emulated TMR

T

A

T

B

T

C

T

D

T

A

T

A

T

B

T

C

T

D

A B C D

A B C D

Key Ideas

Isolated deterministic replay

T

D

T

A

T

B

T

C

T

C

T

D

T

A

T

B

Maximum 3 replaysSlide17

Multicore Fault Diagnosis Algorithm Overview17

Replay &

capture

fault activating traceTATB

T

C

T

D

A B C DExample

Symptom

detected

Diagnosis

Deterministically

replay

captured

traceSlide18

Multicore Fault Diagnosis Algorithm Overview18

Replay &

capture

fault activating traceSymptom detected

Diagnosis

Deterministically

replay

captured

trace

T

A

T

B

T

C

T

DA B C D

A B C D

ExampleSlide19

Multicore Fault Diagnosis Algorithm Overview19

Replay &

capture

fault activating traceSymptom detected

Diagnosis

Deterministically

replay

captured

trace

T

A

T

B

T

C

T

D

A B C D

T

D

T

A

T

B

T

C

A B C D

Example

Look for divergenceSlide20

Multicore Fault Diagnosis Algorithm Overview20

Replay &

capture

fault activating traceSymptom detected

Diagnosis

Deterministically

replay

captured

trace

Look for divergence

T

A

T

B

T

C

T

D

A B C D

T

D

T

A

T

B

T

C

A B C D

Divergence

Example

T

A

A B C D

Faulty core is

A

Fault-free Cores

Divergence

Faulty

coreSlide21

Digging Deeper

21

Symptom

detected

Replay &

capture

fault

activating trace

Isolated

deterministic

replay

Faulty

core

Look for divergence

What info to capture

to enable

isolated deterministic replay

?

How to identify divergence?

Hardware costs?Slide22

Enabling Isolated Deterministic Replay22

Thread

Input to thread

Ld

Ld

Ld

Ld

Recording thread inputs sufficient – similar to

BugNet

Record

all retiring

loads valuesSlide23

Digging Deeper (Contd.)

23

Symptom

detected

Replay &

capture fault

activating trace

Isolated

deterministic

replay

Faulty

core

Look for divergence

How to identify divergence?

Trace Buffer

What info to capture

to enable

isolated deterministic replay

?

Hardware costs?Slide24

Identifying Divergence24

Thread

Comparing all instructions

 Large buffer requirement

Faults corrupt software

through

memory

and control

instrns

Capture memory and control instructions

Store

Load

Branch

StoreSlide25

Digging Deeper (Contd.)

25

Symptom

detected

Replay &

capture fault

activating trace

Isolated

deterministic

replay

Faulty

core

Look for divergence

How to identify divergence?

Trace Buffer

What info to capture

to enable

isolated deterministic replay

?

Hardware costs?Slide26

How Big?

Hardware Costs

26

Native Execution

Memory Backed Log

Small hardware support

Firmware Emulation

Minor support for firmware reliability

Symptom

detected

Replay &

capture fault

activating trace

Isolated

deterministic

replay

Faulty

core

Look for divergence

Trace Buffer

What if the faulty core subverts the process?

Key Idea:

On a divergence two cores take overSlide27

Trace Buffer SizeLong detection latency  large trace buffers (8MB/core)

Need to reduce the size requirement

 Iterative Diagnosis Algorithm

27Repeatedly execute on short tracese.g. 100,000 instrns

Symptom

detected

Replay &

capture fault

activating trace

Isolated

deterministic

replay

Faulty

core

Look for divergence

Trace BufferSlide28

Experimental MethodologyMicroarchitecture

-level

fault injection

GEMS timing models + Simics full-system simulationSix multithreaded applications on OpenSolaris4 Multimedia apps and 1 each from SPLASH and PARSEC4 core system running 4-threades appsFaults in latches of 7 arch unitsPermanent (stuck-at) and transients faults28Slide29

Experimental Methodology

Detection:

Metrics:

SDC Rate, detection latencyDiagnosis:Iterative algorithm with 100,000 instrns

in each iteration

Until divergence or 20M instrns

Deterministic replay is native execution

Not firmware

emulatedMetrics: Diagnosability, overheads

29

1

0M

instr

Timing simulation

If

no symptom

in

10M instr, run to completion

Functional simulation

Fault

Masked or

Silent Data Corruption (SDC)Slide30

Results: mSWAT Detection Summary

SDC Rate: Only

0.2% for permanents & 0.55% for transients

Detection Latency: Over 99% detected within 10M instrns30DetectedSlide31

Results: mSWAT Detection Summary

SDC Rate: Only

0.2% for permanents & 0.55% for transients

Detection Latency: Over 99% detected within 10M instrns31

4.5% detected in a good coreSlide32

Results: mSWAT Diagnosability

Over 95%

of detected faults are successfully diagnosed

All faults detected in fault-free core are diagnosedUndiagnosed faults: 88% did not activate faults32 99 99 99 86 100 80 99 95.9Slide33

Results: mSWAT Diagnosis OverheadsDiagnosis Latency

98%

diagnosed

<10 million cycles (10ms in 1GHz system)93% were diagnosed in 1 iterationIterative approach is effectiveTrace Buffer size96% require <400KB/coreTrace buffer can easily fit in L2 or L3 cache33Slide34

mSWAT SummaryDetection: Low SDC rate, detection latency

Diagnosis – identifying the faulty core

Challenges: no known good core, deterministic replay

High diagnosability with low diagnosis latencyLow Hardware overhead - Firmware based implementationScalable – maximum 3 replays for any systemFuture Work: Reducing SDCs, detection latency, recovery overheadsExtending to server apps; off-core faultsValidation on FPGAs (w/ Michigan)34

Related Contents


Next Show more