Memory Characterization Optimization and Recovery Yu Cai Yixin Luo Erich F Haratsch Ken Mai Onur Mutlu Carnegie Mellon University LSI Corporation 1 You Probably Know ID: 725750
Download Presentation The PPT/PDF document "Data Retention in MLC NAND Flash" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery
Yu Cai, Yixin Luo, Erich F. Haratsch*,Ken Mai, Onur MutluCarnegie Mellon University, *LSI Corporation
1Slide2
You Probably KnowMany use cases:
+ High performance, low energy consumption2Slide3
NAND Flash Memory Challenges– Requires erase before program (write)– High raw bit error rate
3CPU
Flash Controller
ECC Controller
Raw Flash Memory ChipsSlide4
Limited Flash Memory Lifetime
4Program/Erase (P/E) Cycles (or Writes Per
Cell)
Raw bit error rate (RBER)
ECC-correctable RBER
Newer generation
~3000
~2000
Goal: Extend flash memory
lifetime at low cost
P/E Cycle LifetimeSlide5
Retention Loss5
Charge leakage over timeOne dominant source of flash memory errors [DATE ‘12, ICCD ‘12]
1
0
0
Retention error
Flash cellSlide6
NAND Flash 1016
Before I show youhow we extend flash lifetime …Slide7
Threshold Voltage (Vth)7
Normalized Vth
0
1
Flash cell
Flash cellSlide8
Threshold Voltage (Vth) Distribution8
Normalized Vth
0
1
Probability Density Function (PDF)Slide9
Read Reference Voltage (Vref)
9Normalized Vth
PDF
0
1
V
refSlide10
Multi-Level Cell (MLC)10
Normalized Vth
Erased
(11)
P1
(10)
P2
(00)
P3
(01)
PDF
ER-P1
V
ref
P1-P2
V
ref
P2-P3
V
refSlide11
11
Normalized Vth
PDF
P1
(10)
P2
(00)
P3
(01)
Before retention loss:
After some retention loss:
Threshold Voltage Reduces Over TimeSlide12
Fixed Read Reference Voltage Becomes Suboptimal12
Normalized VthP1-P2 Vref
P2-P3
V
ref
Normalized V
th
PDF
P1
(10)
P2
(00)
P3
(01)
Raw bit errors
Before retention loss:
After some retention loss:Slide13
Optimal Read Reference Voltage (OPT)
13
Normalized V
th
PDF
P1
(10)
P2
(00)
P3
(01)
P1-P2
V
ref
P2-P3
V
ref
P1-P2 OPT
P2-P3 OPT
Minimal raw bit errors
After some retention
loss:Slide14
14
Goal 1: Design a low-cost mechanism that dynamically finds the optimal read reference voltageSlide15
Correctable errors
Retention Failure
15
Normalized V
th
PDF
P1
(10)
P2
(00)
P3
(01)
P1-P2
V
ref
P2-P3
V
ref
Uncorrectable errors
After some retention loss:
After
significant
retention loss:Slide16
16
Goal 1: Design a low-cost mechanism that dynamically finds the optimal read reference voltageGoal 2: Design an offline mechanism to recover data after detecting uncorrectable errorsSlide17
17
To understand the effects of retention loss: - Characterize retention loss using real chipsSlide18
18
Goal 2: Design an offline mechanism to recover data after detecting uncorrectable errorsTo understand the effects of retention loss: - Characterize retention loss using real chips
Goal 1: Design a low-cost mechanism that dynamically
finds the optimal read reference
voltageSlide19
Characterization Methodology19
FPGA-based flash memory testing platform [Cai+,FCCM ‘11]Slide20
Characterization MethodologyFPGA-based flash memory testing platformReal 20- to 24-nm MLC NAND flash chips
0- to 40-day worth of retention lossRoom temperature (20⁰C)0 to 50k P/E Cycles20Slide21
21
Characterize the effects of retention loss1. Threshold Voltage Distribution2. Optimal Read Reference Voltage
3. RBER and P/E Cycle LifetimeSlide22
1. Threshold Voltage (Vth) Distribution22
Normalized Vth
PDF
P1
P2
P3Slide23
1. Threshold Voltage (V
th) Distribution23
Finding: Cell’s threshold voltage decreases over time
P1
P2
P3
0-day
40-day
0-day
40-daySlide24
2. Optimal Read Reference Voltage (OPT)
24P1P2P3
Finding: OPT decreases over time
0-day OPT
40-day OPT
0-day OPT
40-day OPTSlide25
3. RBER and P/E Cycle Lifetime25
P/E CyclesRBERSlide26
Actual OPT
Reading data with 7-day worth of retention loss.
3. RBER and P/E Cycle Lifetime
26
ECC-correctable RBER
Finding: Using actual OPT achieves the longest lifetime
V
ref
closer to actual OPT
Nominal Lifetime
Extended LifetimeSlide27
Characterization SummaryDue to retention lossCell’s threshold voltage
(Vth) decreases over timeOptimal read reference voltage (OPT) decreases over timeUsing the actual OPT for readingAchieves the longest lifetime27Slide28
28
Goal 2: Design an offline mechanism to recover data after detecting uncorrectable errorsTo understand the effects of retention loss: - Characterize retention loss using real chips
Goal 1: Design a low-cost mechanism that dynamically finds the optimal read
reference
voltageSlide29
Naïve Solution: Sweeping Vref
Key idea: Read the data multiple times with different read reference voltages until the raw bit errors are correctable by ECCFinds the optimal read reference voltageRequires many read-retries higher read latency29Slide30
Comparison of Flash Read Techniques30
Flash Read TechniquesLifetime(P/E Cycle)Performance
(Read Latency)
Fixed
V
ref
Sweeping
V
ref
Our Goal
Slide31
1. The optimal read reference voltage gradually decreases over time
Key idea: Record the old OPT as a prediction (Vpred) of the actual OPTBenefit: Close to actual OPT Fewer read retries2. The amount of retention loss is similar across pages within a flash blockKey idea: Record only one
Vpred for each block
Benefit:
Small storage overhead (768KB out of 512GB)
Observations
31Slide32
Retention Optimized Reading (ROR)Components:1. Online
pre-optimization algorithmPeriodically records a Vpred for each block2. Improved read-retry techniqueUtilizes the recorded Vpred to minimize read-retry count32Slide33
1. Online Pre-Optimization AlgorithmTriggered periodically (e.g., per day)Find and record an OPT as per-block
VpredPerformed in backgroundSmall storage overhead33Normalized Vth
PDF
New
V
pred
Old
V
predSlide34
2. Improved Read-Retry TechniquePerformed as normal readVpred
already close to actual OPTDecrease Vref if Vpred fails, and retry34Normalized Vth
PDF
OPT
V
pred
Very closeSlide35
Retention Optimized Reading: Summary35
Flash Read TechniquesLifetime(P/E Cycle)Performance
(Read Latency)
Fixed
V
ref
Sweeping
V
ref
64% ↑
ROR
64% ↑
_____
Nom. Life: 2.4% ↓
Ext. Life: 70.4% ↓Slide36
36
Goal 2: Design an offline mechanism to recover data after detecting uncorrectable errorsTo understand the effects of retention loss: - Characterize retention loss using real chips
Goal 1:
Design a low-cost mechanism that dynamically finds the optimal read reference
voltageSlide37
Correctable errors
Retention Failure
37
Normalized V
th
PDF
P1
(10)
P2
(00)
P3
(01)
P1-P2
V
ref
P2-P3
V
ref
Uncorrectable errors
After some retention loss:
After
significant
retention loss:Slide38
Leakage Speed Variation38
Normalized VthPDF
S
F
low-leaking cell
ast
-leaking cell
S
FSlide39
Initially, Right After Programming39
Normalized VthPDF
S
F
S
F
S
F
S
F
P2
P3Slide40
P2
P3
F
F
F
F
After Some Retention Loss
40
Normalized V
th
PDF
S
F
S
F
S
F
S
F
Fast-leaking cells have lower V
th
Slow-leaking cells have higher V
thSlide41
Eventually: Retention Failure
41
Normalized V
th
PDF
S
F
S
F
S
F
S
F
OPT
P2
P3Slide42
Retention Failure Recovery (RFR)Key idea: Guess original state of the cell from its leakage speed property
Three stepsIdentify risky cellsIdentify fast-/slow-leaking cellsGuess original states42Slide43
1. Identify Risky Cells
43Normalized VthPDF
S
S
F
F
OPT+
σ
OPT
OPT–
σ
Risky cells
P2
P3
+ S =
+ F =
Key FormulaSlide44
2. Identifying Fast- vs. Slow-Leaking Cells
44Normalized VthPDF
OPT+
σ
OPT
OPT–
σ
Risky cells
P2
P3
+ S =
+ F =
Key Formula
?
?
?
?
?
?Slide45
2. Identifying Fast- vs. Slow-Leaking Cells
45Normalized VthPDF
OPT+
σ
OPT
OPT–
σ
Risky cells
P2
P3
+ S =
+ F =
Key Formula
?
?
?
?
S
F
F
S
?
?Slide46
3. Guess Original States
46
Normalized V
th
PDF
S
F
F
S
Risky cells
P2
P3
+ S =
+ F =
Key FormulaSlide47
RFR EvaluationExpect
to eliminate 50% of raw bit errorsECC can correct remaining errors47
Program with random data
Detect failure, backup data
Recover data
28 days
12
addt’l
.
daysSlide48
48
Goal 2: Design an offline mechanism to recover data after detecting uncorrectable errorsTo understand the effects of retention loss: - Characterize retention loss using real chips
Goal 1: Design a low-cost mechanism that dynamically finds the optimal read reference
voltageSlide49
ConclusionProblem:
Retention loss reduces flash lifetimeOverall Goal: Extend flash lifetime at low costFlash Characterization: Developed an understanding of the effects of retention loss in real chipsRetention Optimized Reading: A low-cost mechanism that dynamically finds the optimal read reference voltage64% lifetime
↑, 70.4% read latency ↓
Retention Failure Recovery:
An offline mechanism that recovers
data after detecting uncorrectable errorsRaw bit error rate 50%
↓
, reduces data loss
49Slide50
Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery
Yu Cai, Yixin Luo, Erich F. Haratsch*,Ken Mai, Onur MutluCarnegie Mellon University, *LSI Corporation
50Slide51
Backup Slides
51Slide52
RFR MotivationData loss can happen in many waysHigh P/E cycle
High temperature accelerates retention lossHigh retention age (lost power for a long time)52Slide53
What if there are other errors?Key: RFR does not have to correct all errors
Example:ECC can correct 40 errors in a pageCorrupted page has 20 retention errors, 25 other errors (45 total errors)After RFR: 10 retention errors, 30 other errors (40 total errors ECC correctable)53