Dan Fisher Addison Floyd Outline Introduction Fault Detection Motivation Methods etc Fault Diagnosis Motivation Methods etc Fault Tolerance Single FPGA Multiple FPGAs Single Faults ID: 655945
Download Presentation The PPT/PDF document "Survey of Detection, Diagnosis, and Faul..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
Dan Fisher, Addison FloydSlide2
Outline
Introduction
Fault Detection - Motivation, Methods, etc.
Fault Diagnosis - Motivation, Methods, etc.
Fault Tolerance
Single FPGA
Multiple FPGAs
Single Faults
Multiple Faults
ConclusionSlide3
Introduction
FPGA Background
Importance
Applications
Motivation for Fault Tolerance
http://en.wikipedia.org/wiki/Field-programmable_gate_arraySlide4
Fault Detection - Motivation
Main Causes of Faults
Degradation
Manufacturing Defects
Single Event Upsets(SEUs)Slide5
Fault Detection - Judgement Criteria
Detection Methods are judged on:
Speed of Detection
Coverage
Resource Overhead
Performance Overhead
Detection GranularitySlide6
Fault Detection - Criteria In-Depth
Detection Granularity - how specific one is when detecting an error.
FPGA made up of Tiles containing:
Logic Blocks
Connection Blocks - connect tiles
Switch Blocks - connect tiles, allow for direction changeSlide7
Fault Detection - Comparison
Slide8
Fault Detection - SEDC Method
The Method Explained
Partition data and Encode with SEDC codes
Calculate and Store check bits
Generate check bits as circuit operates
Compare calculated and generated values
Better than Berger and TMRSlide9
Fault Detection - Nazar Method
CED method providing single error detection
Takes advantage of properties of LUTs
Major Drawback - LUT insertion
Area Improvement over DWCSlide10
Nazar Method - LUT Properties Explained*
1st Advantage: A LUT can be viewed as combinational circuit independent from others. Area overhead is avoided since you don’t need to replicate sub-expressions that form circuit outputs
2nd Advantage: A K-input LUT can compute any function with up to K inputs. So as long as our selected group is no more than K different inputs than the parity can be calculated using just one LUT. If the selected group also has no more than K-1 different outputs, then the checker can be made of just one LUT(with the last input the parity bit).
This
picture shows upside-down triangles as LUTs, with a one parity LUT for each K-1 outputs. Also show is the checker which would be composed of just one LUT. Separate LUTs in the same checker group can’t overlap (otherwise they wouldn’t be independent) but in order to provide coverage different checker group LUTs can overlap.
*
Note:This slide wasn’t in the original presentation but was added to try to better explain the method since some mentioned wanting to know more Slide11
Fault Detection - Roving Stars
New method for online detection
Detected faults do not affect working logic
STARs and BISTERs
Better than other methods
*Picture added after presentation to attempt to help
clear up any confusion.Slide12
Fault Detection - Injection Topic 1
Which modules most sensitive to SEU
1.4% sensitive(83% routing/16% logic)
Density matrixSlide13
Fault Detection - Injection Topic 2
HW module to test efficiency of SEU mitigation schemes
How to emulate SEUs - 2 step process
Example Results
Scrubbing RateSlide14
Fault Diagnosis - Roving Stars
Diagnose both interconnect & plb faults
Partial Reuse
Future - Do we allow for retest of fault?Slide15
Fault Diagnosis - More Abramovici
BIST-based method in 2000
2004 paper further extending Roving Stars
Slide16
Fault Diagnosis - Niamat - MATS++
Diagnose multiple stuck at faults
Use of MATS++ algorithm
Goal of speeding up diagnosisSlide17
Fault Diagnosis - Tahoori’s Method
Diagnose a single fault in interconnect or logic
Application Dependent
Basic IdeaSlide18
Fault Tolerance
Single FPGA platform
Multi FPGA platform
Single Fault
Multiple FaultsSlide19
Fault Tolerance - Single FPGA
Dynamic Fault Tolerance via Partial Reconfiguration
online - handles faulty PLBs without system stopping
uses spare logic cells
Stroud et alSlide20
Fault Tolerance - Single FPGA
Online Fault Tolerance for FPGA Logic Blocks
reuse defective blocks to increase the number of spares and extend mission life
uses commercial CAD tools to implement
Stroud et alSlide21
Fault Tolerance - Single FPGA
Using Relocatable Bitstreams for Fault Tolerance
combines passive and active techniques
standardized relocatable modules, which are copied and stored
Montminy et alSlide22
Fault Tolerance - Multi FPGA
A Reliable Reconfiguration Controller for Fault-Tolerant Embedded Systems on Multi-FPGA platforms
multiple FPGAs in a mesh topology
hardening achieved by TMR
distributed solution
Bolchini et alSlide23
Fault Tolerance - Single Fault
Designing Fault Tolerant Systems into SRAM-based FPGAs
for use in space
Duplication with Comparison and Concurrent Error Detection
Lima et alSlide24
Fault Tolerance - Single Fault
TMR and Partial Dynamic Reconfiguration to Mitigate SEU Faults in FPGAs
passive Triple Modular Redundancy
Bolchini et alSlide25
Fault Tolerance - Single Fault
IPR: In-Place Reconfiguration for FPGA Fault Tolerance
preserves function and topology of LUT-based logic network
algorithm applied post-layout
Zhe et alSlide26
Fault Tolerance - Single Fault
A Novel SRAM-Based FPGA Architecture for Efficient TMR Fault Tolerance Support
Architectural level
augments LUTs with TMR
minimize number of reconfigurations
Kyriakoulakos et alSlide27
Fault Tolerance - Multiple Faults
Placement of Repair Circuits for In-Field FPGA Repair
utilize unused FPGA resources
repair circuits identified before faults occur
alternate repair circuits cached locally or remotely
Wirthlin et alSlide28
Fault Tolerance - Multiple Faults
Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing
dynamic self-adaptation
high reliability vs. high performance
Jacobs et alSlide29
Fault Tolerance - Multiple Faults
Exploiting Partially Defective LUTs: Why You Don’t Need Perfect Fabrication
because of shrinking feature size, transistor variability and failure rates are going up
identifies partially defective LUTs for reuse
DeHon et alSlide30
Conclusion
Importance of FPGAs
FPGA applications
Future of FPGA fault toleranceSlide31
Questions?