/
Part 3 Research directions Part 3 Research directions

Part 3 Research directions - PowerPoint Presentation

agentfor
agentfor . @agentfor
Follow
343 views
Uploaded On 2020-11-06

Part 3 Research directions - PPT Presentation

Alvaro Moreira amp Luigi Carro Instituto de Informática UFRGS Brasil 1 Outline Part III Work done at UFRGS on detectioncorrection of Control Flow Errors CFEs with LLVM Similarities and differences with Security ID: 816776

core level memory transactional level core transactional memory error load stack code time performance errors fault llvm circuit area

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Part 3 Research directions" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Part 3Research directions

Alvaro Moreira & Luigi CarroInstituto de Informática – UFRGS Brasil

1

Slide2

Outline – Part III

Work done at UFRGS on detection/correction of Control Flow Errors (CFEs) with LLVMSimilarities and differences with SecurityThe need of a new computational stack

Work done at UFRGS towards a new stackChallenges and research directions

2

Slide3

Control Flow Error - CFE

1

3

4

2

5

6

CFE

creates

branch

not

present

in

the

CFG

An

ilegal

branch

CFE

3

Slide4

Fault Model

We assume single bit flip in one of the words that constitute a program Bit flip can occur anywhere in the programA variety of consequences. Among them:Non jump instruction becomes a jump Jump instruction is not a jump anymore

Spurious op code is producedAddress part of a jump instruction is modified…..

4

Slide5

Fault Model and CFE

Bit Flip can create a branch that does not exist the original CFG of the programTarget destination is the same BB (intra block)Target is another BB of the program (inter block)Target is none of the BBs of the program

Work done with LLVM detects (and “corrects”) only inter block CFEs

5

Slide6

SW-Based Approach for CFEs

Several SW-based techniques have

been proposed to

deal with CFEs

Most

are

program

transformations

that

add extra code to program

basic blocks This extra

code is responsible for checking block signatures

6

Slide7

Signature checking and updating

1

3

4

2

5

6

Extra

code

for

signature

checking

and

updatin

g

Extra

code

for

signature

checking

and

updating

7

Slide8

ACCE and CEDA as starting points

The implementation done in LLVM was based on:VEMU, R.; GURUMURTHY, S.; ABRAHAM, J. A. ACCE: Automatic

Correction of Control-flow Errors. IEEE International Test Conference, ITC2007

, p.1–10, oct 2007.

VEMU, R.; ABRAHAM, J. A

. CEDA

:

Control-flow Error Detection using Assertions

.

IEEE

Transactions

on

Computers, v.60, p.1233–1245, 2011

8

Slide9

The LLVM Framework

9

Slide10

LLVM Framework

Several optimizations availableSeveral front ends and back ends

C, C++,

Haskell

, ...

LLVM IR

Optimizations

/

Transformations

Code

Generator

ARM, X86,

Mips

, ...

P

ro

gram

Analysis

JIT

(

Just in time)

10

Slide11

Advantages of Using LLVM

Hardening mechanism is implemented once and it is available for a series of processorsLLVM has a library that facilitates the implementation of new LLVM

-IR program transformations Lots of program optimizations to experiment with

11

Slide12

Signature Checking and Updating in LLVM

LLVM C compiler produces intermediate representation in LLVM-IRStatic analysis of input LLVM-IR program collects entry and exit signatures of every basic block

A new transformation step transforms LLVM-IR programs by adding signature checking and updating code at the beginning and end of each basic block

12

Slide13

Detection of CFE

Compile to a target processor (x86 for instance)Detection of a CFE during program execution: code in the beginning of a block starts executing signature checking code

if signature of the source block obtained during execution is different from the expected signature collected statically them we have a CFE

13

Slide14

Correction is not Satisfactory

ACCE conference paper of 2007 has both detection and correctionCEDA journal paper of 2011 has only detection

The correction mechanism executes the “faulty” block since the beginningBut no guarantees about data that has been modified

14

Slide15

Experiments: Interaction with Optimizations

Baseline: the original program with only the protection transformationVarious experimental scenarios with LLVM optimizations in order to pinpoint

corner casesFault injection according to the fault model defined

15

Slide16

Experiments - one optimization with different benchmarks

16

Slide17

Experiments – same benchmark with several optimizations

17

Slide18

Results /Recommendations

First, phi instructions have to be removed (LLVM reg2men transformation does that)

define i32 @

main

(){

entry

:

%1 =

load

i32* @i

%2 =

add

i32 %1, 1

store

i32 %2, i32* @i

br

label

%bb2

bb2: %3 =

phi i32 [ %2, %entry

], [ 0, %bb2 ] %4 =

icmp eq

i32 %3, 2 br i1 %4, label %bb2, label %bb3

bb3:

ret

i32 1

}

Protection

code

should

come

first

18

Slide19

Results/Recomendations

Some optimizations might destroy what protecting transformation producedTo avoid this, transformation that adds protecting code should be the last one to be applied Some optimizations improve the protection of some programs and reduce protection for others

Effectiveness of protection depends on the program structure

19

Slide20

Protection x Size and Number of BBs

Quote from original ACCE paper: “If more coverage needs to be obtained

, a node can be divided into subnodes and the proposed technique applied, thus detecting intra-node

CFEs which now become inter-node CFEs. This will lead to increased performance and

memory overhead along with

increased coverage

Our experiments showed that this claim has to be refined

Coverage can be

reduced

when there are too many very small blocks

It is even worst when they are inside inner loops

20

Slide21

Problem with too many small BBs

Extra protection code added is not protected!Some transformations add lots of what we call label-br basic blocks

When too many very small protected blocks are inside nested loops the chances that (unprotected) protection code be affected by a single bit flip increase

21

Slide22

Improving Signature Checking Mechanisms

Ongoing work by a Delft PhD student (with contributions of a UFRGS Master student)Current signature checking/updating approaches add extra code to all basic block of programEven to those not reachable by CFE caused by a single bit flip….

Static analysis informs exactly which BBs have to be protected

22

Slide23

Security x Hardening against CFEs (I

)

In both cases there is the notion of an attack/invasion

and the need to detect it Detection

In the context of security: detection should be done before the invasion actually occurs

In the context of soft error: detection is only possible after the “invasion” has occurred

23

Slide24

Security X Hardening to Soft Errors (I)

The attacker :In

soft errors caused by radiation, nature

is the attacker!Attacks by nature are inevitable but it won’t try to find news

way to break the defense

mechanism

In

security

: diversity of attackers always trying to find new ways to invade

24

Slide25

Redesigning the Stack for Security

The effectiveness of defense mechanisms for Security tend to be short lived since attackers keep trying to find a way to break themThis difficulty motivates research on redesigning the Stack of Abstractions

to take security issues into consideration See for instance the SAFE project

25

Slide26

Redesigning the Stack for Fault Tolerance to Soft Errors - I

In comparison, the redesign of stack for hardening against Soft Errors has not attracted so much attention. Why?Problem is still perceived as exclusive of certain very specialized domains (today everyone can relate to security issues)

The regularity of the way Nature attacks computer systems, in theory at least, should facilitate the design of effective

(and stable) detection and correction mechanisms

26

Slide27

Redesigning the Stack for Fault Tolerance to Soft Errors - II

Soft Errors tend to become more pervasive in every day computer systemsDetection/correction mechanisms should be cost effective, fast, light, and not energy drainers Solution may reside in redesigning the Stack !!

27

Slide28

System Level

Algorithm Level

Architecture Level

Circuit Level

Component Level

Technology Level

Recent work across the stack

Reliability and high performance

Strategy: better use of available code optimizations

Compiler

optimizations

28

Slide29

System Level

Algorithm Level

Architecture Level

Circuit Level

Component Level

Technology Level

Recent work across the stack

Reliability and high performance

Strategy: better use of

available signature space

Compiler

optimizations

Signature control

29

Slide30

System Level

Algorithm Level

Architecture Level

Circuit Level

Component Level

Technology Level

Recent work across the stack

Reliability and high

performance

Strategy:

how can massive parallelism be useful?

Coding for GPUs

30

Slide31

Can we have it all?

31

Slide32

Can we have it all?

A tribute to

Nigella

Lawson

32

Slide33

The same aggressive scaling that allowsHigh performance

High energy efficiencyHigh integrationParallelism exploitation

Trends in transistor scaling

33

Slide34

…also increases:Power densitySub-optimal

VDD scalingSusceptibility to transient faultsReduced critical charge

Trends in transistor scaling

34

Slide35

Can we have it all

?High performance, energy efficiency, high integration, parallelism exploitationWhile still reducing power dissipation and providing high reliability?

Trends in transistor scaling

35

Slide36

We can have it all!

The same structure that provides acceleration provides error correction!

36

Slide37

System Level

Algorithm Level

Architecture Level

Circuit Level

Component Level

Technology Level

Recent work across the stack

Reliability, high performance, low energy:

Strategy

: use an unreliable but self correcting fabric

Reliable

Accelerator over

Unreliable fabric

37

Slide38

System Level

Algorithm Level

Architecture Level

Circuit Level

Component Level

Technology Level

Recent work across the stack

Reliability, high performance, low energy:

Strategy

: use an unreliable but self correcting fabric

Reliable

Accelerator over

Unreliable fabric

38

Slide39

In an

MPSoC withConcurrent tasksHeterogeneous components

Each component should aim at finishing its tasks on time with minimum power

A larger share of the budget remains for the rest of the systemMPSoC

39

New application

Many matrix operations!

What if we had a

generic

accelerator?

Adaptive

and

resilient

!

Slide40

Error

Detection:

Error

Correction

:

-

Step

1 (Computes

the

residue

):

-Step

2 (

Recomputation

of corrupted element):

 

Algorithm-Based Fault Tolerance

The difference between the calculated vectors determines the wrong position (

C

1,1

)

40

Slide41

Resilient Adaptive

Algebraic Architecture (RA3)

×

=

 

 

 

 

Example

:

 

 

-50

Propagation

of

corrupted

element

to

memory!

 

row

1

Column

0

The

corrupted element

is located in the intersection between the

row=1

and

column=0.

The address of corrupted element is calculated

The

residue

is

calculated

to

The correct value is written to the memory

 

S.E.T.

Address of C(1,0)

Address

Address

41

Slide42

Written in VHDLMaximum size defined at synthesis time

Separated scratchpads to exploit regularity of accessesUnits can be powered on or off during run-timeProviding adaptive parallelism exploitation

Matrix-Matrix Multiplier

42

Slide43

Matrix Multiplication Protected by ABFT: Software (GPU

Nvidia Tesla S1070) [*] X Hardware (RA3)

RA3 Results – Performance

C.

Ding

, C.

Karlsson

, H. Liu, T.

Davies

and

Z. Chen “

Matrix

Multiplication on GPUs

with On-line Fault Tolerance,” in ISPA ’11: Int.

Symp. on Parallel and

Distributed Processing App., IEEE, 2011, pp. 311–317.

43

Slide44

RA3 Results –

Fault Coverage

50,000 faults injected in each RA3 configuration.

# of RA

3

multipliers

1

2

4

8

16

32

# of Errors

a

900/948

1563/1614

2727/2782

4345/4414

6693/6753

9236/9289

Coverage

94.94%

96.84%

98.02%

98.44%

99.11%99.43%aNumber of corrected/injected errors.

Control becomes less significant in overall area!

44

Slide45

Memory access latency and execution timeLP-DDR2 pins

RA3 Results –

Memory Wall

45

Slide46

MIMO SystemsBottleneck = QR decomposition of the H matrix

Many matrices to be decomposedVoice Over IP (Voip) SystemsBottleneck = Acoustic Echo Canceller (AEC)Use Adaptive Filters with large number of taps

Both operations can be realized with matrix multiplications

Our Case Studies

MIMO

Decoding

AEC for

VoIP

46

Slide47

Execution time and performance constraint for QR decomposition

MIMO Results

Best choice to meet the deadline with minimal power consumption

Time (s)

47

Slide48

Execution Time and performance constraintsAEC Results

Typical Time Frame of voice (codec G.711)

As AEC is not the only operation in a frame decoding…

Length of AEC Filter

- Matrix Dimensions -

Time (s)

~8 times power gap!

Variable resources for a workload that varies at runtime!

48

Slide49

Architecture of RA3 able to adaptively

exploit parallelism in order to:meet real time constraintsrespect limits imposed by memory bandwidthWith minimum power consumption

The RA3 also provides: Low cost error correctionVery predictable execution times

PARADIGM: aggressive scaling together with reliable processor

Conclusions about RA

3

49

Slide50

System Level

Algorithm Level

Architecture Level

Circuit Level

Component Level

Technology Level

Time to cross borders

Reliability, high performance, low energy:

Strategy:

use a fast unreliable but self correcting fabric &

o

nly write when you´re sure about it

Transactional

processor

50

Slide51

Adaptive Low-Power Architecture for

High-Performance and Reliable

Embedded Computing

Ronaldo Ferreira

, Jean da Rolt, Gabriel Nazar,

Álvaro Moreira, Luigi Carro

Universidade Federal do Rio Grande do Sul (UFRGS)

Porto Alegre, Brazil

51

Slide52

Architecture block diagram

52

Slide53

Transactional Core

Getting rid of checkpointing skews

Data sharing between basic blocks using registers is prohibited

All communication occurs in main memory

Virtually eliminates register liveness between basic blocks

Loose lockstep execution

Two RISC cores execute the same instruction but we allow timing errors

53

Slide54

Transactional Core – The SW Stack

Transactional Basic Block

Two scenarios in case of errors

Register definition segment

54

Slide55

Transactional Core – The SW Stack

Transactional Basic Block

Two scenarios in case of errors

Register termination segment

55

Slide56

Transactional Core – Error Latency

store r1 add r1 r2 r3 load r3 load r2

RISC Core 1

store r1 add r1 r2 r3 load r3 load r2

RISC Core 2

time

56

Slide57

Transactional Core – Error Latency

store r1 add r1 r2 r3 load r3 load r2

RISC Core 1

store r1 add r1 r2 r3 load r3 load r2

RISC Core 2

time

error

57

Slide58

58

Transactional Core – Error Latency

store r1 add r1 r2 r3 load r3 load r2

RISC Core 1

store r1 add r1 r2 r3 load r3 load r2

RISC Core 2

time

Error latency is TBB length

Register file is in read-only mode

Slide59

Transactional Core – Error Latency

store r1 add r1 r2 r3 load r3 load r2

RISC Core 1

store r1 add r1 r2 r3 load r3 load r2

RISC Core 2

time

Error latency is 2 instructions

Register file has write permission

Average case is better than state of the practice

Instruction replay (e.g., Intel Itanium)

Slide60

60

Transactional Core – The HW Stack

Slide61

Error Coverage – Transactional Core

Six programs as benchmark

Bubble sort, least squares, CRC32, Kruskal, Floyd-Warshall, matrix multiply

672,348,891 injected SEUs in a Virtex-5 FPGA

61

Slide62

Error Coverage – Transactional Core

62

Slide63

Error Coverage – Transactional Core

Undetected errors are due to SEUs injected in the memory address that will be written right after the last comparator

63

Slide64

Error Coverage – RA3 Core

64

Slide65

Power - Transactional Core vs. Single RISC Core

65

Slide66

Area – Transactional Core vs Single RISC Core

66

Slide67

Area – Transactional vs RA3

67

Slide68

Area – Transactional vs RA3

The Transactional Core can be used as a small fault tolerant unit that connects accelerators together

68

Slide69

Transactional Core vs. MIPS

69

Slide70

Error Recovery Latency

70

Slide71

Conclusions for Transactional core

Natural heterogenity Model of agressive scaling + conservative design already in use for 20 yearsGuess where?Compiler optimizations are being developed to reduce the 20% overhead cause by extra LW/SW instructions

71

Slide72

Hidden problems around the corner?

Performance, area, energy overhead... Are there some ...

72

Slide73

Propagation delay

(*)

vs. Technologies

in

out

clk

clk

32 nm

90 nm

130 nm

180 nm

(*)

simulated using parameters from PTM web site and HSPICE tool

73

Slide74

Transient width studies

DODD, 2004

FERLET-CAVROIS, 2006

74

Slide75

Transient widths vs. Propagation delays

Cycle time and transient width scaling across technologies

0

100

200

300

400

500

600

180nm

130nm

100nm

90nm

70nm

32nm

Technology

Cycle time (

ps

)

Width 20MeV

Width 10MeV

Cycle 10 Inv

Cycle 8 Inv

Cycle 6 Inv

Cycle 4 Inv

(*)

(*) 180, 130, and 100nm from [DODD, 2004], 70 nm from [Ferlet-Cavrois 2006]

6.39

x

75

Slide76

Conclusions for the Tutorial (I)Crossing borders might provide performance and low energy, together with reliability

Redesign the stack: ambitious by high reward taskPay only once: The same structure that provides acceleration must provide reliability

76

Slide77

Conclusions for the Tutorial (I)Adaptivity is a major tool

Programming for GPUs:Amount of parallelism for maximum reliabilityMemory bandwidth Be as fast as memory allows

Algorithm X external conditionsRazor, RA3, adaptive filters as role models

77

Slide78

Adaptive platform

Programming Language

Compiler

Runtime System

Hardware Platform with Adaptive Units

Adaptive Solver

Optimization Plans

Binary

Adapted Binary

Optimization Function

Source Code

Maintain SW abstraction

Adapt SW and HW

as a function of the QoS

as a function of required FT

78

Slide79

Exciting times ahead

Thank you!

carro@inf.ufrgs.br

afmoreira@inf.ufrgs.br

www.inf.ufrgs.br

/~carro

www.inf.ufrgs.br/~lse

we will answer questions, but

NOT

about the world cup

79

Slide80

Exciting times ahead

Thank you!

carro@inf.ufrgs.br

www.inf.ufrgs.br/~carro

www.inf.ufrgs.br/~lse

80

Slide81

System Level

Algorithm Level

Architecture Level

Circuit Level

Component Level

Technology Level

Recent work across the stack

Reliability, high performance, low energy:

Strategy:

use regularity

Computing

with memories

81

Slide82

Using Memory to Cope with Transient Faults

[Rhod et al, DATE, 2007]

Memory comes with intrinsic protection against manufacturing errors (spare columns and spare rows);

There are protection techniques with low area and latency overhead like Reed Solomon that can be applied;

Use Reed-Solomon protected memory to replace combinational circuit;

Reducing the area sensible to faults;

Reducing the SER (soft error rate) of the circuit;

82

Slide83

Replacing Combinational Circuit by Memory (ROM / Magnetic Memory)

Example:4x4 bit multiplier

Fully combinational:

Total area = 304 transistors

Fully memory:

Memory

Input A

Input B

4

4

result

8

Total area = 2,048 transistors

considering 1 transistor per bit

8 inputs and 8 outputs

2

8

x 8 = 2,048 bits

EXPENSIVE

X

[Rhod et al, DATE, 2007]

Slide84

Example:

4x4 bit multiplier Fully combinational:

Total area = 304 transistors

Let’s Replace just some part of the circuit !!!

1 column

Area cost = 512 transistors

Latency = 7 cycles

Memory

512 bits

4

2

7

x 4 = 512 bits

7 inputs and 4 outputs

Replacing Combinational Circuit by Memory (ROM / Magnetic Memory)

[Rhod et al, DATE, 2007]

Slide85

Results

Circuit

Total Area

# of gates that fail

Latency

(ns)

Proportional fault rate (%)

Combinational

6524

1631

69

48.21

Memory

1832

50

56.8

2.58

FIR Filter Fault Rate Results for SINGLE Fault Injection

Circuit

Total Area

# of gates that fail

Latency

(ns)

Proportional fault rate (%)

Combinational

6524

1631

69

67.35

Memory

1832

50

56.8

2.96

FIR Filter Fault Rate Results for DOUBLE Fault Injection

[Rhod et al, DATE, 2007]