LowCost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari ManLap Alex Li Pradeep Ramachandran Byn Choi Sarita Adve Department of Computer Science ID: 417481
Download Presentation The PPT/PDF document "mSWAT" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
mSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems
Siva Kumar Sastry Hari
, Man-Lap (Alex) Li,
Pradeep Ramachandran, Byn Choi, Sarita Adve
Department of Computer Science
University of Illinois at Urbana-Champaign
swat@cs.uiuc.eduSlide2
MotivationHardware will fail in-the-field due to several reasons
Need
in-field detection, diagnosis,
repair, and recoveryReliability problem pervasive across many marketsTraditional redundancy solutions (e.g., nMR) too expensive Need low-cost solutions for multiple failure sourcesMust incur low area, performance, power overhead
Transient
errors
(High-energy particles
)
Wear-out
(Devices are weaker)
Design Bugs
… and so onSlide3
SWAT: Low-Cost Hardware ReliabilitySWAT ObservationsNeed handle only hardware faults that propagate to software
Fault-free case remains common, must be optimized
SWAT Approach
Watch for software anomalies (symptoms)Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked3Slide4
SWAT Framework Components
Fault
Error
Symptom
detected
Recovery
Diagnosis
Repair
Checkpoint
Checkpoint
Detectors with simple
hardware
[Li et al. ASPLOS’08]
µarch-level Fault Diagnosis (TBFD
)
[Li et al. DSN’09]
4Slide5
Challenge
Fault
Error
Symptom
detected
Recovery
Diagnosis
Repair
Checkpoint
Checkpoint
Detectors with simple
hardware
[Li et.al. ASPLOS’08]
µarch-level Fault Diagnosis (TBFD
)
[Li et.al. DSN’09]
5
Shown to work well for single-threaded apps
Does SWAT approach work on multithreaded apps?Slide6
Challenge: Data sharing in multithreaded appsMultithreaded apps share data among threads
Does symptom detection work?
Symptom causing core may not be faulty
How to diagnose faulty core?6Memory
Store
Symptom Detection
on a
fault-free
core
Load
Core 1
Core
2
Error
FaultSlide7
ContributionsEvaluate SWAT detectors on multithreaded appsLow Silent Data Corruption rate for multithreaded apps
Observed
symptom from fault-free cores
Novel fault diagnosis for multithreaded appsIdentifies the faulty core despite error propagationProvides high diagnosability7Slide8
OutlineMotivationmSWAT
Detection
mSWAT
DiagnosisResultsSummary and Future Work8Slide9
mSWAT Fault DetectionSWAT Detectors:
Low-cost monitors to detect anomalous
sw
behaviorIncur near-zero perf overhead in fault-free operationSymptom detectors provide low Silent Data Corruption rate
SWAT firmware
Fatal Traps
Division by zero,
RED state, etc.
Kernel
Panic
OS enters panic
State due to fault
High OS
High contiguous
OS activity
Hangs
Simple HW hang
detector
App Abort
Application abort due to faultSlide10
SWAT Fault Diagnosis
Rollback/replay on same/different core
Single-threaded application on multicore
No symptomSymptom
Deterministic s/w or
Permanent h/w bug
Symptom
detected
Faulty
Rollback on
faulty
core
Rollback/replay
on
good
core
Continue
Execution
Transient
or
Non-deterministic s/w bug
Symptom
Permanent
h/w fault,
needs repair!
No symptom
Deterministic s/w bug, send to s/w layer
10
GoodSlide11
ChallengesRollback/replay on same/different core
Single-threaded application on multicore
No symptom
Symptom
Deterministic s/w or
Permanent h/w bug
Symptom
detected
Rollback on
faulty
core
Rollback/replay
on
good
core
Continue
Execution
Transient or Non-deterministic s/w bug
Symptom
Permanent
h/w fault,
needs repair!
No symptom
Deterministic s/w bug, send to s/w layer
11
Faulty
Good
No known good
cores
available
Faulty core is unknown
How to replay multithreaded apps?Slide12
Extending SWAT Diagnosis to Multithreaded AppsAssumptions: In-core faults, single core fault modelNaïve extension –
N known good cores
to replay the trace
Too expensive – areaRequires full-system deterministic replaySimple optimization – One spare coreNot scalable, requires N full-system deterministic replaysHigh hardware overhead – requires a spare coreSingle point of failure – spare core12
Faulty core
is C2
C1
C2
C3
No Symptom Detected
Spare
C1
C2
C3
Symptom Detected
Spare
C1
C2
C3
Symptom Detected
SpareSlide13
mSWAT Diagnosis - Key Ideas
13
Challenges
Multithreaded applicationsFull-system deterministic replayNo known good core
Emulated TMR
T
A
T
B
T
C
T
D
T
A
T
A
T
B
T
C
T
D
T
A
A B C D
T
A
A B C D
Key Ideas
Isolated deterministic replaySlide14
mSWAT Diagnosis - Key Ideas
14
Challenges
Multithreaded applicationsFull-system deterministic replayNo known good core
Emulated TMR
T
A
T
B
T
C
T
D
T
A
T
A
T
B
T
C
T
D
A B C D
A B C D
Key Ideas
Isolated deterministic replay
T
A
T
B
T
A
T
BSlide15
mSWAT Diagnosis - Key Ideas
15
Challenges
Multithreaded applicationsFull-system deterministic replayNo known good core
Emulated TMR
T
A
T
B
T
C
T
D
T
A
T
A
T
B
T
C
T
D
A B C D
A B C D
Key Ideas
Isolated deterministic replay
T
A
T
B
T
C
T
C
T
A
T
BSlide16
mSWAT Diagnosis - Key Ideas
16
Challenges
Multithreaded applicationsFull-system deterministic replayNo known good core
Emulated TMR
T
A
T
B
T
C
T
D
T
A
T
A
T
B
T
C
T
D
A B C D
A B C D
Key Ideas
Isolated deterministic replay
T
D
T
A
T
B
T
C
T
C
T
D
T
A
T
B
Maximum 3 replaysSlide17
Multicore Fault Diagnosis Algorithm Overview17
Replay &
capture
fault activating traceTATB
T
C
T
D
A B C DExample
Symptom
detected
Diagnosis
Deterministically
replay
captured
traceSlide18
Multicore Fault Diagnosis Algorithm Overview18
Replay &
capture
fault activating traceSymptom detected
Diagnosis
Deterministically
replay
captured
trace
T
A
T
B
T
C
T
DA B C D
A B C D
ExampleSlide19
Multicore Fault Diagnosis Algorithm Overview19
Replay &
capture
fault activating traceSymptom detected
Diagnosis
Deterministically
replay
captured
trace
T
A
T
B
T
C
T
D
A B C D
T
D
T
A
T
B
T
C
A B C D
Example
Look for divergenceSlide20
Multicore Fault Diagnosis Algorithm Overview20
Replay &
capture
fault activating traceSymptom detected
Diagnosis
Deterministically
replay
captured
trace
Look for divergence
T
A
T
B
T
C
T
D
A B C D
T
D
T
A
T
B
T
C
A B C D
Divergence
Example
T
A
A B C D
Faulty core is
A
Fault-free Cores
Divergence
Faulty
coreSlide21
Digging Deeper
21
Symptom
detected
Replay &
capture
fault
activating trace
Isolated
deterministic
replay
Faulty
core
Look for divergence
What info to capture
to enable
isolated deterministic replay
?
How to identify divergence?
Hardware costs?Slide22
Enabling Isolated Deterministic Replay22
Thread
Input to thread
Ld
Ld
Ld
Ld
Recording thread inputs sufficient – similar to
BugNet
Record
all retiring
loads valuesSlide23
Digging Deeper (Contd.)
23
Symptom
detected
Replay &
capture fault
activating trace
Isolated
deterministic
replay
Faulty
core
Look for divergence
How to identify divergence?
Trace Buffer
What info to capture
to enable
isolated deterministic replay
?
Hardware costs?Slide24
Identifying Divergence24
Thread
Comparing all instructions
Large buffer requirement
Faults corrupt software
through
memory
and control
instrns
Capture memory and control instructions
Store
Load
Branch
StoreSlide25
Digging Deeper (Contd.)
25
Symptom
detected
Replay &
capture fault
activating trace
Isolated
deterministic
replay
Faulty
core
Look for divergence
How to identify divergence?
Trace Buffer
What info to capture
to enable
isolated deterministic replay
?
Hardware costs?Slide26
How Big?
Hardware Costs
26
Native Execution
Memory Backed Log
Small hardware support
Firmware Emulation
Minor support for firmware reliability
Symptom
detected
Replay &
capture fault
activating trace
Isolated
deterministic
replay
Faulty
core
Look for divergence
Trace Buffer
What if the faulty core subverts the process?
Key Idea:
On a divergence two cores take overSlide27
Trace Buffer SizeLong detection latency large trace buffers (8MB/core)
Need to reduce the size requirement
Iterative Diagnosis Algorithm
27Repeatedly execute on short tracese.g. 100,000 instrns
Symptom
detected
Replay &
capture fault
activating trace
Isolated
deterministic
replay
Faulty
core
Look for divergence
Trace BufferSlide28
Experimental MethodologyMicroarchitecture
-level
fault injection
GEMS timing models + Simics full-system simulationSix multithreaded applications on OpenSolaris4 Multimedia apps and 1 each from SPLASH and PARSEC4 core system running 4-threades appsFaults in latches of 7 arch unitsPermanent (stuck-at) and transients faults28Slide29
Experimental Methodology
Detection:
Metrics:
SDC Rate, detection latencyDiagnosis:Iterative algorithm with 100,000 instrns
in each iteration
Until divergence or 20M instrns
Deterministic replay is native execution
Not firmware
emulatedMetrics: Diagnosability, overheads
29
1
0M
instr
Timing simulation
If
no symptom
in
10M instr, run to completion
Functional simulation
Fault
Masked or
Silent Data Corruption (SDC)Slide30
Results: mSWAT Detection Summary
SDC Rate: Only
0.2% for permanents & 0.55% for transients
Detection Latency: Over 99% detected within 10M instrns30DetectedSlide31
Results: mSWAT Detection Summary
SDC Rate: Only
0.2% for permanents & 0.55% for transients
Detection Latency: Over 99% detected within 10M instrns31
4.5% detected in a good coreSlide32
Results: mSWAT Diagnosability
Over 95%
of detected faults are successfully diagnosed
All faults detected in fault-free core are diagnosedUndiagnosed faults: 88% did not activate faults32 99 99 99 86 100 80 99 95.9Slide33
Results: mSWAT Diagnosis OverheadsDiagnosis Latency
98%
diagnosed
<10 million cycles (10ms in 1GHz system)93% were diagnosed in 1 iterationIterative approach is effectiveTrace Buffer size96% require <400KB/coreTrace buffer can easily fit in L2 or L3 cache33Slide34
mSWAT SummaryDetection: Low SDC rate, detection latency
Diagnosis – identifying the faulty core
Challenges: no known good core, deterministic replay
High diagnosability with low diagnosis latencyLow Hardware overhead - Firmware based implementationScalable – maximum 3 replays for any systemFuture Work: Reducing SDCs, detection latency, recovery overheadsExtending to server apps; off-core faultsValidation on FPGAs (w/ Michigan)34