Alvaro Moreira amp Luigi Carro Instituto de Informática UFRGS Brasil 1 Outline Part III Work done at UFRGS on detectioncorrection of Control Flow Errors CFEs with LLVM Similarities and differences with Security ID: 816776
Download The PPT/PDF document "Part 3 Research directions" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Part 3Research directions
Alvaro Moreira & Luigi CarroInstituto de Informática – UFRGS Brasil
1
Slide2Outline – Part III
Work done at UFRGS on detection/correction of Control Flow Errors (CFEs) with LLVMSimilarities and differences with SecurityThe need of a new computational stack
Work done at UFRGS towards a new stackChallenges and research directions
2
Slide3Control Flow Error - CFE
1
3
4
2
5
6
CFE
–
creates
branch
not
present
in
the
CFG
An
ilegal
branch
CFE
3
Slide4Fault Model
We assume single bit flip in one of the words that constitute a program Bit flip can occur anywhere in the programA variety of consequences. Among them:Non jump instruction becomes a jump Jump instruction is not a jump anymore
Spurious op code is producedAddress part of a jump instruction is modified…..
4
Slide5Fault Model and CFE
Bit Flip can create a branch that does not exist the original CFG of the programTarget destination is the same BB (intra block)Target is another BB of the program (inter block)Target is none of the BBs of the program
Work done with LLVM detects (and “corrects”) only inter block CFEs
5
Slide6SW-Based Approach for CFEs
Several SW-based techniques have
been proposed to
deal with CFEs
Most
are
program
transformations
that
add extra code to program
basic blocks This extra
code is responsible for checking block signatures
6
Slide7Signature checking and updating
1
3
4
2
5
6
Extra
code
for
signature
checking
and
updatin
g
Extra
code
for
signature
checking
and
updating
7
Slide8ACCE and CEDA as starting points
The implementation done in LLVM was based on:VEMU, R.; GURUMURTHY, S.; ABRAHAM, J. A. ACCE: Automatic
Correction of Control-flow Errors. IEEE International Test Conference, ITC2007
, p.1–10, oct 2007.
VEMU, R.; ABRAHAM, J. A
. CEDA
:
Control-flow Error Detection using Assertions
.
IEEE
Transactions
on
Computers, v.60, p.1233–1245, 2011
8
Slide9The LLVM Framework
9
Slide10LLVM Framework
Several optimizations availableSeveral front ends and back ends
C, C++,
Haskell
, ...
LLVM IR
Optimizations
/
Transformations
Code
Generator
ARM, X86,
Mips
, ...
P
ro
gram
Analysis
JIT
(
Just in time)
10
Slide11Advantages of Using LLVM
Hardening mechanism is implemented once and it is available for a series of processorsLLVM has a library that facilitates the implementation of new LLVM
-IR program transformations Lots of program optimizations to experiment with
11
Slide12Signature Checking and Updating in LLVM
LLVM C compiler produces intermediate representation in LLVM-IRStatic analysis of input LLVM-IR program collects entry and exit signatures of every basic block
A new transformation step transforms LLVM-IR programs by adding signature checking and updating code at the beginning and end of each basic block
12
Slide13Detection of CFE
Compile to a target processor (x86 for instance)Detection of a CFE during program execution: code in the beginning of a block starts executing signature checking code
if signature of the source block obtained during execution is different from the expected signature collected statically them we have a CFE
13
Slide14Correction is not Satisfactory
ACCE conference paper of 2007 has both detection and correctionCEDA journal paper of 2011 has only detection
The correction mechanism executes the “faulty” block since the beginningBut no guarantees about data that has been modified
14
Slide15Experiments: Interaction with Optimizations
Baseline: the original program with only the protection transformationVarious experimental scenarios with LLVM optimizations in order to pinpoint
corner casesFault injection according to the fault model defined
15
Slide16Experiments - one optimization with different benchmarks
16
Slide17Experiments – same benchmark with several optimizations
17
Slide18Results /Recommendations
First, phi instructions have to be removed (LLVM reg2men transformation does that)
define i32 @
main
(){
entry
:
%1 =
load
i32* @i
%2 =
add
i32 %1, 1
store
i32 %2, i32* @i
br
label
%bb2
bb2: %3 =
phi i32 [ %2, %entry
], [ 0, %bb2 ] %4 =
icmp eq
i32 %3, 2 br i1 %4, label %bb2, label %bb3
bb3:
ret
i32 1
}
Protection
code
should
come
first
18
Slide19Results/Recomendations
Some optimizations might destroy what protecting transformation producedTo avoid this, transformation that adds protecting code should be the last one to be applied Some optimizations improve the protection of some programs and reduce protection for others
Effectiveness of protection depends on the program structure
19
Slide20Protection x Size and Number of BBs
Quote from original ACCE paper: “If more coverage needs to be obtained
, a node can be divided into subnodes and the proposed technique applied, thus detecting intra-node
CFEs which now become inter-node CFEs. This will lead to increased performance and
memory overhead along with
increased coverage
”
Our experiments showed that this claim has to be refined
Coverage can be
reduced
when there are too many very small blocks
It is even worst when they are inside inner loops
20
Slide21Problem with too many small BBs
Extra protection code added is not protected!Some transformations add lots of what we call label-br basic blocks
When too many very small protected blocks are inside nested loops the chances that (unprotected) protection code be affected by a single bit flip increase
21
Slide22Improving Signature Checking Mechanisms
Ongoing work by a Delft PhD student (with contributions of a UFRGS Master student)Current signature checking/updating approaches add extra code to all basic block of programEven to those not reachable by CFE caused by a single bit flip….
Static analysis informs exactly which BBs have to be protected
22
Slide23Security x Hardening against CFEs (I
)
In both cases there is the notion of an attack/invasion
and the need to detect it Detection
In the context of security: detection should be done before the invasion actually occurs
In the context of soft error: detection is only possible after the “invasion” has occurred
23
Slide24Security X Hardening to Soft Errors (I)
The attacker :In
soft errors caused by radiation, nature
is the attacker!Attacks by nature are inevitable but it won’t try to find news
way to break the defense
mechanism
In
security
: diversity of attackers always trying to find new ways to invade
24
Slide25Redesigning the Stack for Security
The effectiveness of defense mechanisms for Security tend to be short lived since attackers keep trying to find a way to break themThis difficulty motivates research on redesigning the Stack of Abstractions
to take security issues into consideration See for instance the SAFE project
25
Slide26Redesigning the Stack for Fault Tolerance to Soft Errors - I
In comparison, the redesign of stack for hardening against Soft Errors has not attracted so much attention. Why?Problem is still perceived as exclusive of certain very specialized domains (today everyone can relate to security issues)
The regularity of the way Nature attacks computer systems, in theory at least, should facilitate the design of effective
(and stable) detection and correction mechanisms
26
Slide27Redesigning the Stack for Fault Tolerance to Soft Errors - II
Soft Errors tend to become more pervasive in every day computer systemsDetection/correction mechanisms should be cost effective, fast, light, and not energy drainers Solution may reside in redesigning the Stack !!
27
Slide28System Level
Algorithm Level
Architecture Level
Circuit Level
Component Level
Technology Level
Recent work across the stack
Reliability and high performance
Strategy: better use of available code optimizations
Compiler
optimizations
28
Slide29System Level
Algorithm Level
Architecture Level
Circuit Level
Component Level
Technology Level
Recent work across the stack
Reliability and high performance
Strategy: better use of
available signature space
Compiler
optimizations
Signature control
29
Slide30System Level
Algorithm Level
Architecture Level
Circuit Level
Component Level
Technology Level
Recent work across the stack
Reliability and high
performance
Strategy:
how can massive parallelism be useful?
Coding for GPUs
30
Slide31Can we have it all?
31
Slide32Can we have it all?
A tribute to
Nigella
Lawson
32
Slide33The same aggressive scaling that allowsHigh performance
High energy efficiencyHigh integrationParallelism exploitation
Trends in transistor scaling
33
Slide34…also increases:Power densitySub-optimal
VDD scalingSusceptibility to transient faultsReduced critical charge
Trends in transistor scaling
34
Slide35Can we have it all
?High performance, energy efficiency, high integration, parallelism exploitationWhile still reducing power dissipation and providing high reliability?
Trends in transistor scaling
35
Slide36We can have it all!
The same structure that provides acceleration provides error correction!
36
Slide37System Level
Algorithm Level
Architecture Level
Circuit Level
Component Level
Technology Level
Recent work across the stack
Reliability, high performance, low energy:
Strategy
: use an unreliable but self correcting fabric
Reliable
Accelerator over
Unreliable fabric
37
Slide38System Level
Algorithm Level
Architecture Level
Circuit Level
Component Level
Technology Level
Recent work across the stack
Reliability, high performance, low energy:
Strategy
: use an unreliable but self correcting fabric
Reliable
Accelerator over
Unreliable fabric
38
Slide39In an
MPSoC withConcurrent tasksHeterogeneous components
Each component should aim at finishing its tasks on time with minimum power
A larger share of the budget remains for the rest of the systemMPSoC
39
New application
Many matrix operations!
What if we had a
generic
accelerator?
Adaptive
and
resilient
!
Slide40Error
Detection:
Error
Correction
:
-
Step
1 (Computes
the
residue
):
-Step
2 (
Recomputation
of corrupted element):
Algorithm-Based Fault Tolerance
The difference between the calculated vectors determines the wrong position (
C
1,1
)
40
Slide41Resilient Adaptive
Algebraic Architecture (RA3)
×
=
Example
:
-50
Propagation
of
corrupted
element
to
memory!
row
1
Column
0
The
corrupted element
is located in the intersection between the
row=1
and
column=0.
The address of corrupted element is calculated
The
residue
is
calculated
to
The correct value is written to the memory
S.E.T.
Address of C(1,0)
Address
Address
41
Slide42Written in VHDLMaximum size defined at synthesis time
Separated scratchpads to exploit regularity of accessesUnits can be powered on or off during run-timeProviding adaptive parallelism exploitation
Matrix-Matrix Multiplier
42
Slide43Matrix Multiplication Protected by ABFT: Software (GPU
Nvidia Tesla S1070) [*] X Hardware (RA3)
RA3 Results – Performance
C.
Ding
, C.
Karlsson
, H. Liu, T.
Davies
and
Z. Chen “
Matrix
Multiplication on GPUs
with On-line Fault Tolerance,” in ISPA ’11: Int.
Symp. on Parallel and
Distributed Processing App., IEEE, 2011, pp. 311–317.
43
Slide44RA3 Results –
Fault Coverage
50,000 faults injected in each RA3 configuration.
# of RA
3
multipliers
1
2
4
8
16
32
# of Errors
a
900/948
1563/1614
2727/2782
4345/4414
6693/6753
9236/9289
Coverage
94.94%
96.84%
98.02%
98.44%
99.11%99.43%aNumber of corrected/injected errors.
Control becomes less significant in overall area!
44
Slide45Memory access latency and execution timeLP-DDR2 pins
RA3 Results –
Memory Wall
45
Slide46MIMO SystemsBottleneck = QR decomposition of the H matrix
Many matrices to be decomposedVoice Over IP (Voip) SystemsBottleneck = Acoustic Echo Canceller (AEC)Use Adaptive Filters with large number of taps
Both operations can be realized with matrix multiplications
Our Case Studies
MIMO
Decoding
AEC for
VoIP
46
Slide47Execution time and performance constraint for QR decomposition
MIMO Results
Best choice to meet the deadline with minimal power consumption
Time (s)
47
Slide48Execution Time and performance constraintsAEC Results
Typical Time Frame of voice (codec G.711)
As AEC is not the only operation in a frame decoding…
Length of AEC Filter
- Matrix Dimensions -
Time (s)
~8 times power gap!
Variable resources for a workload that varies at runtime!
48
Slide49Architecture of RA3 able to adaptively
exploit parallelism in order to:meet real time constraintsrespect limits imposed by memory bandwidthWith minimum power consumption
The RA3 also provides: Low cost error correctionVery predictable execution times
PARADIGM: aggressive scaling together with reliable processor
Conclusions about RA
3
49
Slide50System Level
Algorithm Level
Architecture Level
Circuit Level
Component Level
Technology Level
Time to cross borders
Reliability, high performance, low energy:
Strategy:
use a fast unreliable but self correcting fabric &
o
nly write when you´re sure about it
Transactional
processor
50
Slide51Adaptive Low-Power Architecture for
High-Performance and Reliable
Embedded Computing
Ronaldo Ferreira
, Jean da Rolt, Gabriel Nazar,
Álvaro Moreira, Luigi Carro
Universidade Federal do Rio Grande do Sul (UFRGS)
Porto Alegre, Brazil
51
Slide52Architecture block diagram
52
Slide53Transactional Core
Getting rid of checkpointing skews
Data sharing between basic blocks using registers is prohibited
All communication occurs in main memory
Virtually eliminates register liveness between basic blocks
Loose lockstep execution
Two RISC cores execute the same instruction but we allow timing errors
53
Slide54Transactional Core – The SW Stack
Transactional Basic Block
Two scenarios in case of errors
Register definition segment
54
Slide55Transactional Core – The SW Stack
Transactional Basic Block
Two scenarios in case of errors
Register termination segment
55
Slide56Transactional Core – Error Latency
store r1 add r1 r2 r3 load r3 load r2
RISC Core 1
store r1 add r1 r2 r3 load r3 load r2
RISC Core 2
time
56
Slide57Transactional Core – Error Latency
store r1 add r1 r2 r3 load r3 load r2
RISC Core 1
store r1 add r1 r2 r3 load r3 load r2
RISC Core 2
time
error
57
Slide5858
Transactional Core – Error Latency
store r1 add r1 r2 r3 load r3 load r2
RISC Core 1
store r1 add r1 r2 r3 load r3 load r2
RISC Core 2
time
Error latency is TBB length
Register file is in read-only mode
Slide59Transactional Core – Error Latency
store r1 add r1 r2 r3 load r3 load r2
RISC Core 1
store r1 add r1 r2 r3 load r3 load r2
RISC Core 2
time
Error latency is 2 instructions
Register file has write permission
Average case is better than state of the practice
Instruction replay (e.g., Intel Itanium)
Slide6060
Transactional Core – The HW Stack
Slide61Error Coverage – Transactional Core
Six programs as benchmark
Bubble sort, least squares, CRC32, Kruskal, Floyd-Warshall, matrix multiply
672,348,891 injected SEUs in a Virtex-5 FPGA
61
Slide62Error Coverage – Transactional Core
62
Slide63Error Coverage – Transactional Core
Undetected errors are due to SEUs injected in the memory address that will be written right after the last comparator
63
Slide64Error Coverage – RA3 Core
64
Slide65Power - Transactional Core vs. Single RISC Core
65
Slide66Area – Transactional Core vs Single RISC Core
66
Slide67Area – Transactional vs RA3
67
Slide68Area – Transactional vs RA3
The Transactional Core can be used as a small fault tolerant unit that connects accelerators together
68
Slide69Transactional Core vs. MIPS
69
Slide70Error Recovery Latency
70
Slide71Conclusions for Transactional core
Natural heterogenity Model of agressive scaling + conservative design already in use for 20 yearsGuess where?Compiler optimizations are being developed to reduce the 20% overhead cause by extra LW/SW instructions
71
Slide72Hidden problems around the corner?
Performance, area, energy overhead... Are there some ...
72
Slide73Propagation delay
(*)
vs. Technologies
in
out
clk
clk
32 nm
90 nm
130 nm
180 nm
(*)
simulated using parameters from PTM web site and HSPICE tool
73
Slide74Transient width studies
DODD, 2004
FERLET-CAVROIS, 2006
74
Slide75Transient widths vs. Propagation delays
Cycle time and transient width scaling across technologies
0
100
200
300
400
500
600
180nm
130nm
100nm
90nm
70nm
32nm
Technology
Cycle time (
ps
)
Width 20MeV
Width 10MeV
Cycle 10 Inv
Cycle 8 Inv
Cycle 6 Inv
Cycle 4 Inv
(*)
(*) 180, 130, and 100nm from [DODD, 2004], 70 nm from [Ferlet-Cavrois 2006]
6.39
x
75
Slide76Conclusions for the Tutorial (I)Crossing borders might provide performance and low energy, together with reliability
Redesign the stack: ambitious by high reward taskPay only once: The same structure that provides acceleration must provide reliability
76
Slide77Conclusions for the Tutorial (I)Adaptivity is a major tool
Programming for GPUs:Amount of parallelism for maximum reliabilityMemory bandwidth Be as fast as memory allows
Algorithm X external conditionsRazor, RA3, adaptive filters as role models
77
Slide78Adaptive platform
Programming Language
Compiler
Runtime System
Hardware Platform with Adaptive Units
Adaptive Solver
Optimization Plans
Binary
Adapted Binary
Optimization Function
Source Code
Maintain SW abstraction
Adapt SW and HW
as a function of the QoS
as a function of required FT
78
Slide79Exciting times ahead
Thank you!
carro@inf.ufrgs.br
afmoreira@inf.ufrgs.br
www.inf.ufrgs.br
/~carro
www.inf.ufrgs.br/~lse
we will answer questions, but
NOT
about the world cup
79
Slide80Exciting times ahead
Thank you!
carro@inf.ufrgs.br
www.inf.ufrgs.br/~carro
www.inf.ufrgs.br/~lse
80
Slide81System Level
Algorithm Level
Architecture Level
Circuit Level
Component Level
Technology Level
Recent work across the stack
Reliability, high performance, low energy:
Strategy:
use regularity
Computing
with memories
81
Slide82Using Memory to Cope with Transient Faults
[Rhod et al, DATE, 2007]
Memory comes with intrinsic protection against manufacturing errors (spare columns and spare rows);
There are protection techniques with low area and latency overhead like Reed Solomon that can be applied;
Use Reed-Solomon protected memory to replace combinational circuit;
Reducing the area sensible to faults;
Reducing the SER (soft error rate) of the circuit;
82
Slide83Replacing Combinational Circuit by Memory (ROM / Magnetic Memory)
Example:4x4 bit multiplier
Fully combinational:
Total area = 304 transistors
Fully memory:
Memory
Input A
Input B
4
4
result
8
Total area = 2,048 transistors
considering 1 transistor per bit
8 inputs and 8 outputs
2
8
x 8 = 2,048 bits
EXPENSIVE
X
[Rhod et al, DATE, 2007]
Slide84Example:
4x4 bit multiplier Fully combinational:
Total area = 304 transistors
Let’s Replace just some part of the circuit !!!
1 column
Area cost = 512 transistors
Latency = 7 cycles
Memory
512 bits
4
2
7
x 4 = 512 bits
7 inputs and 4 outputs
Replacing Combinational Circuit by Memory (ROM / Magnetic Memory)
[Rhod et al, DATE, 2007]
Slide85Results
Circuit
Total Area
# of gates that fail
Latency
(ns)
Proportional fault rate (%)
Combinational
6524
1631
69
48.21
Memory
1832
50
56.8
2.58
FIR Filter Fault Rate Results for SINGLE Fault Injection
Circuit
Total Area
# of gates that fail
Latency
(ns)
Proportional fault rate (%)
Combinational
6524
1631
69
67.35
Memory
1832
50
56.8
2.96
FIR Filter Fault Rate Results for DOUBLE Fault Injection
[Rhod et al, DATE, 2007]