HighPerformance Microprocessors I2PC Talk Sept 15 2011 Presented by Amin Ansari Significance of Reliability 2 Mission Critical Systems Commodity Systems Engine or Break CU Full Authority Digital Engine ID: 313890
Download Presentation The PPT/PDF document "Overcoming Hard-Faults in" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Overcoming Hard-Faults in High-Performance Microprocessors
I2PC Talk
Sept 15, 2011
Presented by: Amin AnsariSlide2
Significance of Reliability
2
Mission Critical Systems
Commodity Systems
Engine or
Break CU
Full Authority Digital Engine
Financial Analysis or Transactions
HP Tandem
NonStop
IBM z
series
Equipped with:
Triple Modular
Redundancy
Watchdog Timer
Error Correction Code
Fault-Tolerant Scheduling
Desktop
or Server Processor
ECC
RAID
Hard-Faults
Core disabling Slide3
Main sources
Manufacturing defects
Process variation induced
In-field wearout
Ultra low-power operation
Have a direct impact on
Manufacturing yield
Performance
Lifetime throughput
Dependability of semiconductor parts
Hard-Faults
3Slide4
Happen due to Silicon crystal defects
Random particles on the waferFabrication impreciseness
ITRS: One defect per five 100mm2
dies expected
A real threat for yield
Manufacturing Defects
4Slide5
Protecting HPµPs against hard-faults is more challenging
Contain billions of transistors
↑ transistors per core (no core disabling)
Complex connectivity + many stages
Fine-grained redundancy is not cost-eff.
Higher clock frequency, voltage, temp.
↑ operational stress accelerates aging
Operating at most aggressive V/F curve
Usage of high V/F guard-bandsLarge on-chip cachesBit-cell with worst timing characteristics dictates the V&F of the SRAM arrayChallenges with High-Performance µPs
5
[AMD,
Phenom
]
[Intel, Nehalem]
[IBM, POWER7]Slide6
Archipelago
[HPCA’11]
Necromancer
[ISCA’10, IEEE Micro’10]
Archipelago
protect
on-chip caches against:
● Near-threshold failures●
Process variation●
Wearout and defects
Outline
6
Necromancer
protects
general core area (non-
cache parts) against:
●
Manufacturing defects
●
Wearout failures
Objective:
overcome hard-faults in high-performance µPs,
with comprehensive, low-cost solutions for protecting the
on-chip caches and also the non-cache parts of the core.
Archipelago
[HPCA’11]
Necromancer
[ISCA’10, IEEE Micro’10]Slide7
NT Operation: SRAM Bit-Error-Rate
Extremely fast growth in failure rate with decreasing V
dd
7Slide8
Our Goal
Enabling DVS to push core’s Vdd
down to Ultra low voltage region ( < 650mV )
While preserving correct functionality of on-chip caches
8
Proposing a highly flexible and FT cache architecture that can efficiently
tolerate
these
SRAM failures
Minimizing
our overheads in high-power
modeSlide9
Archipelago (AP)
1
2
3
4
5
6
7
8
data chunk
sacrificial line
sacrificial line
This particular cache has
only a single functional line.
By forming autonomous islands,
AP saves 6 out of 8 lines.
Island
1
Island
2
9Slide10
Baseline AP
Architecture10
MUXing layer
First Bank
Second Bank
Functional Line
G3
Fault
Map (10T)
-
-
G3
Input Address
Data line
Sacrificial line
Fault map address
Memory
Map (10T)
Added modules:
●
Memory map
●
Fault map
●
MUXing layer
Two type of lines:
●
data line
●
sacrificial line
Two lines have
collision
, if they have at least one faulty chunk in
the
same position
(blue
and
orange
are collision free)
There should be
no collision
between
lines
within a group
[
Group 3 (G3) contains
green, blue,
and
orange lines]
SSlide11
AP
with Relaxed Group Formation
11
Sacrificial lines do not contribute to the effective capacity
We want to minimize the total number of groups
First Bank
Second Bank
S
S
First Bank
Second Bank
SSlide12
Semi-Sacrificial Lines
12
First Bank
Second Bank
Sacrificial line
Semi-sacrificial line
MUXing Layer
Accessed Line
Lending
Reclaiming
Semi-sacrificial line guarantees the
parallel access
In contrast
to a sacrificial line,
i
t also
contributes to
the effective cache capacitySlide13
AP with Semi-Sacrificial Lines
13
semi-sacrificial
line
MUXing layer
First Bank
Second Bank
Functional Block
G3
Fault Map
G3
Input Address
Memory Map
-
way
0
way
1
way
0
way
1
SSlide14
AP Configuration
We model the problem as a
graph:
Each node is a
line of the cache.
Edge when there
is no collision between nodes
A collision free group forms a clique
Group formationFinding the cliques 14
To
maximize the number of functional lines, we need to minimize the number of groups.
minimum clique cover (MCC).Slide15
AP Configuration Example
15
1
3
7
2
4
6
8
G2(1)
G2(2)
G1(2)
G2(S)
5
9
10
1
2
3
4
5
8
7
6
9
10
G1(1)
G2(3)
G2(4)
G1(S)
G1(3)
D
First Bank
Second Bank
Island or Group
1
Island or Group
2
DisabledSlide16
Operation Modes
16
High power mode (
AP is
turned off
)
There is no non-functional lines in this case
Clock gating to reduce dynamic power of SRAM structures
Low power mode
During the boot time
in low-power mode
BIST scans cache for potential faulty cells
Processor switches back to high power mode
Forms groups and configure the HWSlide17
Minimum Achievable Vdd
17Slide18
Performance
Loss
One extra
cycle latency for L1
and
2 cycles for L2
18Slide19
19
Comparison with Alternative Methods
Conventional
Recently Proposed
10T : [
Verma
, ISSCC’08]
ZC : [Ansari, MICRO’09]
BF : [Wilkerson, ISCA’08]Slide20
Archipelago: Summary
DVS is widely used to deal with high power dissipation
Minimum achievable voltage is bounded by SRAM structures
We
proposed a
highly flexible cache architecture
To tolerate failures when operating in near-threshold region
Using our approach
Vdd of processor can be reduced to 375mV79% dynamic power saving and 51% leakage power saving< 10% area overhead and performance overheads20Slide21
Archipelago
protect
on-chip caches against:
●
Near-threshold failures
● Process variation
● Wearout and defects
Outline21
Necromancer
protects
general core area (non-
cache parts) against:
●
Manufacturing defects
●
Wearout failures
Archipelago
[HPCA’11]
Necromancer
[ISCA’10, IEEE Micro’10]Slide22
Necromancer (NM)
Given a CMP system, Necromancer
Utilizes a dead core (i.e., a core with a hard-fault) to do useful workEnhances system throughput
22
There are proper techniques to protect caches
To maintain an acceptable level of yield, the processing cores need to be protected
More challenging due to
inherent irregularitySlide23
Impact of Hard-Faults on Program Execution
23
Distribution of injected
hard-faults that manifest as
architectural
state
mismatches
across different latenciesBased on number of committed instructions before mismatch happening when starting from a valid architectural state
More than 40% of the injected faults cause an immediate (less than 10K) architectural state mismatch. Thus, a faulty core cannot be trusted to provide correct functionality even for short periods of program execution.Slide24
Relaxing Absolute Correctness Constraint
24
Distribution of injected
faults
resulting into
similarity index mismatch
across different latencies
Similarity Index:
% of PCs matching between the faulty and golden execution (sample @1K instruction intervals)For an SI threshold of 90%, in more than 85% ofcases, the dead core can successfully commit at least 100K
instructions before its execution differs by more than 10%Slide25
Using the Un
dead Core to Generate Hints
25
The execution behavior of a dead core
coarsely
matches the intact program execution for long time periods
How to exploit the program execution on the dead core?
Accelerating
the execution of another core!We extract useful information from the execution of the program on the dead core and sending this information (hints) to the other core (the animator core), running the same program.
Un
dead
Core
Animator
Core
Hints
Hard-fault
PerformanceSlide26
Opportunities for Acceleration
26
Perfect hints:
Perfect branch prediction and
No L1 cache miss
Increasing complexity/resources
IPC of several Alpha cores, normalized to EV4’s IPC.
In most cases, by
providing perfect hints
for the simpler
cores (EV4, EV5, and EV4 (OoO)), these cores can
achieve a performance comparable to that
achieved by a 6-issue OoO EV6.Slide27
Necromancer Architecture
27
A robust heterogeneous core coupling execution technique
L1-Data
Shared L2 cache
Read-Only
The Animator Core
L1-Data
Hint Gathering
FET
Memory Hierarchy
Queue
tail
head
DEC
REN
DIS
EXE
MEM
COM
FE
DE
RE
DI
EX
ME
CO
Hint Distribution
L1-Inst
L1-Inst
Cache Fingerprint
Hint Disabling
Resynchronization signal and hint disabling information
The Undead Core
●
No communication for L2 warm-up
●
Most communications are from
the undead core to the animator
core except resynchronization and
hint disabling signals.
●
A single queue for sending hints
and cache fingerprints
●
Animator core is an older generation
with the same ISA and less resources
●
A 2-issue OoO EV4 (evaluation)
●
Handles exceptions in NM coupled cores
●
Treats $ hints as prefetching info
●
Fuzzy hint disabling approach based on
cont. monitoring of hints effectiveness
●
PC & arch. registers for
resynch
●
Undead core executes the
same
program
to provide
hints
for the AC.
●
It works as “an external run-ahead
engine for the AC”.
●
A 6-issue OoO EV6 (evaluation)
●
I$ hints: PC of committed instructions
●
D$ hints: address of committed ld/
strs
●
Branch prediction hints: BP updates
●
D$ dirty lines are dropped when they
required to be replaced
●
It can proceed on data L2 misses Slide28
Example: Branch Prediction Hints
28
L1-Data
Shared L2 cache
Read-Only
The Animator Core
L1-Data
Hint Gathering
FET
Memory Hierarchy
Queue
tail
head
DEC
REN
DIS
EXE
MEM
COM
FE
DE
RE
DI
EX
ME
CO
Hint Distribution
L1-Inst
L1-Inst
Cache Fingerprint
Hint Disabling
Resynchronization signal and hint disabling information
The Undead Core
PC
NPC
PC
*
NPC
Type
Age
Hint
H
Hint Format
H
H
H
Buffer
H
PC
*
NPC
Type
Age
Age tag ≤ num committed instructions + BP release window size
Hint Disabling
PC
*
PC
*
NPC
Original BP of AC
PC
*
NPC
NM Predictor
PC
*
SC
1
SC
2
Tournament Predictor
NPC
Prediction Outcomes
Original
BP
NM BP
Action
r
r
--
a
a
--
a
r
r
a
Counter
> Threshold Disable HintSlide29
NM Design for CMP Systems
29Slide30
Impact of Hard-Fault Location
30
Program Counter
Instruction Fetch Queue
Integer ALUSlide31
Overheads
31Slide32
Performance Gain
32
88%
71%Slide33
Necromancer: Summary
Enhancing system throughput by exploiting dead cores
Necromancer leverages a set of microarchitectural techniques to provide Intrinsically robust hints
Fine and coarse-grained hint disabling
Online monitoring of hints effectiveness
Dynamic state resynchronization between cores
Applying Necromancer to a 4-core CMP
On average, 88% of the original performance of the undead core can be retrieved
Modest area and power overheads of 5.3% and 8.5%33Slide34
Takeaways
34
To achieve efficient, reliable solutions
Runtime
adaptability
High degree of
re-configurability
Fine-grained spare substitution
Mission-critical and conventional reliability solutions are too expensive for modern high-perf. processorsAP: low-cost cache protection against major reliability threats in nanometer technologiesFor processing core, redundancy
NM an alternative to utilize dead coresSlide35
Thank You35
?