/
Overcoming Hard-Faults in Overcoming Hard-Faults in

Overcoming Hard-Faults in - PowerPoint Presentation

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
376 views
Uploaded On 2016-05-10

Overcoming Hard-Faults in - PPT Presentation

HighPerformance Microprocessors I2PC Talk Sept 15 2011 Presented by Amin Ansari Significance of Reliability 2 Mission Critical Systems Commodity Systems Engine or Break CU Full Authority Digital Engine ID: 313890

cache core line hint core cache hint line hints bank sacrificial lines power execution hard necromancer performance high faults disabling data fault

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Overcoming Hard-Faults in" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Overcoming Hard-Faults in High-Performance Microprocessors

I2PC Talk

Sept 15, 2011

Presented by: Amin AnsariSlide2

Significance of Reliability

2

Mission Critical Systems

Commodity Systems

Engine or

Break CU

Full Authority Digital Engine

Financial Analysis or Transactions

HP Tandem

NonStop

IBM z

series

Equipped with:

Triple Modular

Redundancy

Watchdog Timer

Error Correction Code

Fault-Tolerant Scheduling

Desktop

or Server Processor

ECC

RAID

Hard-Faults

Core disabling Slide3

Main sources

Manufacturing defects

Process variation induced

In-field wearout

Ultra low-power operation

Have a direct impact on

Manufacturing yield

Performance

Lifetime throughput

Dependability of semiconductor parts

Hard-Faults

3Slide4

Happen due to Silicon crystal defects

Random particles on the waferFabrication impreciseness

ITRS: One defect per five 100mm2

dies expected

A real threat for yield

Manufacturing Defects

4Slide5

Protecting HPµPs against hard-faults is more challenging

Contain billions of transistors

↑ transistors per core (no core disabling)

Complex connectivity + many stages

Fine-grained redundancy is not cost-eff.

Higher clock frequency, voltage, temp.

↑ operational stress accelerates aging

Operating at most aggressive V/F curve

Usage of high V/F guard-bandsLarge on-chip cachesBit-cell with worst timing characteristics dictates the V&F of the SRAM arrayChallenges with High-Performance µPs

5

[AMD,

Phenom

]

[Intel, Nehalem]

[IBM, POWER7]Slide6

Archipelago

[HPCA’11]

Necromancer

[ISCA’10, IEEE Micro’10]

Archipelago

protect

on-chip caches against:

● Near-threshold failures●

Process variation●

Wearout and defects

Outline

6

Necromancer

protects

general core area (non-

cache parts) against:

Manufacturing defects

Wearout failures

Objective:

overcome hard-faults in high-performance µPs,

with comprehensive, low-cost solutions for protecting the

on-chip caches and also the non-cache parts of the core.

Archipelago

[HPCA’11]

Necromancer

[ISCA’10, IEEE Micro’10]Slide7

NT Operation: SRAM Bit-Error-Rate

Extremely fast growth in failure rate with decreasing V

dd

7Slide8

Our Goal

Enabling DVS to push core’s Vdd

down to Ultra low voltage region ( < 650mV )

While preserving correct functionality of on-chip caches

8

Proposing a highly flexible and FT cache architecture that can efficiently

tolerate

these

SRAM failures

Minimizing

our overheads in high-power

modeSlide9

Archipelago (AP)

1

2

3

4

5

6

7

8

data chunk

sacrificial line

sacrificial line

This particular cache has

only a single functional line.

By forming autonomous islands,

AP saves 6 out of 8 lines.

Island

1

Island

2

9Slide10

Baseline AP

Architecture10

MUXing layer

First Bank

Second Bank

Functional Line

G3

Fault

Map (10T)

-

-

G3

Input Address

Data line

Sacrificial line

Fault map address

Memory

Map (10T)

Added modules:

Memory map

Fault map

MUXing layer

Two type of lines:

data line

sacrificial line

Two lines have

collision

, if they have at least one faulty chunk in

the

same position

(blue

and

orange

are collision free)

There should be

no collision

between

lines

within a group

[

Group 3 (G3) contains

green, blue,

and

orange lines]

SSlide11

AP

with Relaxed Group Formation

11

Sacrificial lines do not contribute to the effective capacity

We want to minimize the total number of groups

First Bank

Second Bank

S

S

First Bank

Second Bank

SSlide12

Semi-Sacrificial Lines

12

First Bank

Second Bank

Sacrificial line

Semi-sacrificial line

MUXing Layer

Accessed Line

Lending

Reclaiming

Semi-sacrificial line guarantees the

parallel access

In contrast

to a sacrificial line,

i

t also

contributes to

the effective cache capacitySlide13

AP with Semi-Sacrificial Lines

13

semi-sacrificial

line

MUXing layer

First Bank

Second Bank

Functional Block

G3

Fault Map

G3

Input Address

Memory Map

-

way

0

way

1

way

0

way

1

SSlide14

AP Configuration

We model the problem as a

graph:

Each node is a

line of the cache.

Edge when there

is no collision between nodes

A collision free group forms a clique

Group formationFinding the cliques 14

To

maximize the number of functional lines, we need to minimize the number of groups.

minimum clique cover (MCC).Slide15

AP Configuration Example

15

1

3

7

2

4

6

8

G2(1)

G2(2)

G1(2)

G2(S)

5

9

10

1

2

3

4

5

8

7

6

9

10

G1(1)

G2(3)

G2(4)

G1(S)

G1(3)

D

First Bank

Second Bank

Island or Group

1

Island or Group

2

DisabledSlide16

Operation Modes

16

High power mode (

AP is

turned off

)

There is no non-functional lines in this case

Clock gating to reduce dynamic power of SRAM structures

Low power mode

During the boot time

in low-power mode

BIST scans cache for potential faulty cells

Processor switches back to high power mode

Forms groups and configure the HWSlide17

Minimum Achievable Vdd

17Slide18

Performance

Loss

One extra

cycle latency for L1

and

2 cycles for L2

18Slide19

19

Comparison with Alternative Methods

Conventional

Recently Proposed

10T : [

Verma

, ISSCC’08]

ZC : [Ansari, MICRO’09]

BF : [Wilkerson, ISCA’08]Slide20

Archipelago: Summary

DVS is widely used to deal with high power dissipation

Minimum achievable voltage is bounded by SRAM structures

We

proposed a

highly flexible cache architecture

To tolerate failures when operating in near-threshold region

Using our approach

Vdd of processor can be reduced to 375mV79% dynamic power saving and 51% leakage power saving< 10% area overhead and performance overheads20Slide21

Archipelago

protect

on-chip caches against:

Near-threshold failures

● Process variation

● Wearout and defects

Outline21

Necromancer

protects

general core area (non-

cache parts) against:

Manufacturing defects

Wearout failures

Archipelago

[HPCA’11]

Necromancer

[ISCA’10, IEEE Micro’10]Slide22

Necromancer (NM)

Given a CMP system, Necromancer

Utilizes a dead core (i.e., a core with a hard-fault) to do useful workEnhances system throughput

22

There are proper techniques to protect caches

To maintain an acceptable level of yield, the processing cores need to be protected

More challenging due to

inherent irregularitySlide23

Impact of Hard-Faults on Program Execution

23

Distribution of injected

hard-faults that manifest as

architectural

state

mismatches

across different latenciesBased on number of committed instructions before mismatch happening when starting from a valid architectural state

More than 40% of the injected faults cause an immediate (less than 10K) architectural state mismatch. Thus, a faulty core cannot be trusted to provide correct functionality even for short periods of program execution.Slide24

Relaxing Absolute Correctness Constraint

24

Distribution of injected

faults

resulting into

similarity index mismatch

across different latencies

Similarity Index:

% of PCs matching between the faulty and golden execution (sample @1K instruction intervals)For an SI threshold of 90%, in more than 85% ofcases, the dead core can successfully commit at least 100K

instructions before its execution differs by more than 10%Slide25

Using the Un

dead Core to Generate Hints

25

The execution behavior of a dead core

coarsely

matches the intact program execution for long time periods

How to exploit the program execution on the dead core?

Accelerating

the execution of another core!We extract useful information from the execution of the program on the dead core and sending this information (hints) to the other core (the animator core), running the same program.

Un

dead

Core

Animator

Core

Hints

Hard-fault

PerformanceSlide26

Opportunities for Acceleration

26

Perfect hints:

Perfect branch prediction and

No L1 cache miss

Increasing complexity/resources

IPC of several Alpha cores, normalized to EV4’s IPC.

In most cases, by

providing perfect hints

for the simpler

cores (EV4, EV5, and EV4 (OoO)), these cores can

achieve a performance comparable to that

achieved by a 6-issue OoO EV6.Slide27

Necromancer Architecture

27

A robust heterogeneous core coupling execution technique

L1-Data

Shared L2 cache

Read-Only

The Animator Core

L1-Data

Hint Gathering

FET

Memory Hierarchy

Queue

tail

head

DEC

REN

DIS

EXE

MEM

COM

FE

DE

RE

DI

EX

ME

CO

Hint Distribution

L1-Inst

L1-Inst

Cache Fingerprint

Hint Disabling

Resynchronization signal and hint disabling information

The Undead Core

No communication for L2 warm-up

Most communications are from

the undead core to the animator

core except resynchronization and

hint disabling signals.

A single queue for sending hints

and cache fingerprints

Animator core is an older generation

with the same ISA and less resources

A 2-issue OoO EV4 (evaluation)

Handles exceptions in NM coupled cores

Treats $ hints as prefetching info

Fuzzy hint disabling approach based on

cont. monitoring of hints effectiveness

PC & arch. registers for

resynch

Undead core executes the

same

program

to provide

hints

for the AC.

It works as “an external run-ahead

engine for the AC”.

A 6-issue OoO EV6 (evaluation)

I$ hints: PC of committed instructions

D$ hints: address of committed ld/

strs

Branch prediction hints: BP updates

D$ dirty lines are dropped when they

required to be replaced

It can proceed on data L2 misses Slide28

Example: Branch Prediction Hints

28

L1-Data

Shared L2 cache

Read-Only

The Animator Core

L1-Data

Hint Gathering

FET

Memory Hierarchy

Queue

tail

head

DEC

REN

DIS

EXE

MEM

COM

FE

DE

RE

DI

EX

ME

CO

Hint Distribution

L1-Inst

L1-Inst

Cache Fingerprint

Hint Disabling

Resynchronization signal and hint disabling information

The Undead Core

PC

NPC

PC

*

NPC

Type

Age

Hint

H

Hint Format

H

H

H

Buffer

H

PC

*

NPC

Type

Age

Age tag ≤ num committed instructions + BP release window size

Hint Disabling

PC

*

PC

*

NPC

Original BP of AC

PC

*

NPC

NM Predictor

PC

*

SC

1

SC

2

Tournament Predictor

NPC

Prediction Outcomes

Original

BP

NM BP

Action

r

r

--

a

a

--

a

r

r

a

Counter

> Threshold Disable HintSlide29

NM Design for CMP Systems

29Slide30

Impact of Hard-Fault Location

30

Program Counter

Instruction Fetch Queue

Integer ALUSlide31

Overheads

31Slide32

Performance Gain

32

88%

71%Slide33

Necromancer: Summary

Enhancing system throughput by exploiting dead cores

Necromancer leverages a set of microarchitectural techniques to provide Intrinsically robust hints

Fine and coarse-grained hint disabling

Online monitoring of hints effectiveness

Dynamic state resynchronization between cores

Applying Necromancer to a 4-core CMP

On average, 88% of the original performance of the undead core can be retrieved

Modest area and power overheads of 5.3% and 8.5%33Slide34

Takeaways

34

To achieve efficient, reliable solutions

Runtime

adaptability

High degree of

re-configurability

Fine-grained spare substitution

Mission-critical and conventional reliability solutions are too expensive for modern high-perf. processorsAP: low-cost cache protection against major reliability threats in nanometer technologiesFor processing core, redundancy

NM an alternative to utilize dead coresSlide35

Thank You35

?