Exploiting SelfRecovery and Temperature Awareness Yixin Luo Saugata Ghose Yu Cai Erich F Haratsch Onur Mutlu HeatWatch Storage Technology Drivers 2018 2 Store large amounts ID: 724849
Download Presentation The PPT/PDF document "Improving 3D NAND Flash Memory Device Re..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Improving 3D NAND Flash Memory Device Reliability by
Exploiting Self-Recovery and Temperature Awareness
Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu
HeatWatchSlide2
Storage Technology Drivers - 20182
Store
large amounts
of data
reliably
for months to years
3D NAND Flash Memory
Stacked layersSlide3
Executive Summary3D NAND flash memory susceptible to
retention errorsCharge leaks out of flash cellTwo unreported factors: self-recovery and
temperatureWe study self-recovery and temperature effectsWe develop a new technique to improve flash reliability
3
Experimental characterization
of
real 3D NAND chips
Unified Self-Recovery and Temperature (URT) Model
Predicts impact of retention loss, wearout, self-recovery, temperature on flash cell voltageLow prediction error rate: 4.9%
HeatWatch
Uses URT model to find optimal read voltages for 3D NAND flash
Improves flash lifetime by 3.85xSlide4
OutlineExecutive SummaryBackground on NAND Flash ReliabilityCharacterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips
URT: Unified Self-Recovery and Temperature ModelHeatWatch MechanismConclusion
4Slide5
3D NAND Flash Memory
Flash Cell
Higher Voltage State
Lower Voltage State
Data Value =
0
Data Value =
1
–
–
–
Read Reference Voltage
Charge
=
Threshold Voltage
3D NAND Flash Memory Background
5Slide6
Flash Wearout
6
2. Program Variation(init. voltage difference b/w states)
Program/Erase (P/E)
Wearout
–
–
–
Insulator
–
–
–
–
1. Retention Loss
(voltage shift over time)
Wearout Introduces Errors
Wearout Effects:
–
Voltage
–
–
–
–
–
–
–
–Slide7
Improving Flash Lifetime7
Errors introduced by wearoutlimit flash lifetime(measured in P/E cycles)
Exploiting theSelf-Recovery Effect
Exploiting theTemperature Effect
Two Ways to Improve Flash LifetimeSlide8
Exploiting the Self-Recovery Effect8
P/E
P/E
P/E
P/E
P/E
Dwell Time: Idle Time Between P/E Cycles
Longer Dwell Time: More Self-Recovery
P/E
P/E
P/E
P/E
P/E
Reduces Retention Loss
Partially repairs damage due to wearoutSlide9
–
–
–
Exploiting the Temperature Effect
9
Accelerates Retention Loss
–
–
–
High Program Temperature
High Storage Temperature
Voltage
–
Increases Program Variation
–
–
–
–Slide10
Prior Studies of Self-Recovery/Temperature10
Self-Recovery Effect
Temperature Effect
Planar (2D) NAND
3D NAND
Mielke 2006
JEDEC 2010
(no characterization)x
xSlide11
OutlineExecutive SummaryBackground on NAND Flash ReliabilityCharacterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips
URT: Unified Self-Recovery and Temperature ModelHeatWatch MechanismConclusion
11Slide12
Characterization MethodologyModified firmware version in the flash controllerControl the read reference voltage of the flash chip
Bypass ECC to get raw NAND data (with raw bit errors)Control temperature with a heat chamber12
Heat Chamber
SSD
ServerSlide13
01
01
01
Characterized Devices
13
Real 30-39 Layer 3D MLC NAND Flash Chips
2-bit MLC
30- to 39-layer
01Slide14
MLC Threshold Voltage Distribution Background14
11
10
00
01
Probability
Threshold Voltage
–
–
–
Highest Voltage State
Lowest Voltage State
–
–
–
Read Reference Voltage
Read Reference Voltage
Read Reference Voltage
Threshold Voltage DistributionSlide15
Characterized Metrics
–
–
–
Program Variation
(initial voltage difference
between states)
Characterized Phenomena
–
–
–
Retention Loss Speed
(how fast voltage shifts
over time)
Self-Recovery Effect
Temperature Effect
Characterization Goal
15Slide16
Self-Recovery Effect Characterization Results16
Increasing dwell time from 1 minute to 2.3 hours slows down retention loss speed by 40%
2.3 hour
1 minute
Dwell time: Idle time between P/E cyclesSlide17
Program Temperature EffectCharacterization Results17
Increasing program temperature from 0°C to 70°C
improves program variation by 21%
70°C
0°CSlide18
Storage Temperature EffectCharacterization Results18
Lowering storage temperature from 70°C to 0°C
slows down retention loss speed by 58%
70°C
0°CSlide19
Characterization SummaryMajor Results:
Self-recovery affects retention loss speedProgram temperature affects program variationStorage
temperature affects retention loss speedOther Characterizations Methods in the Paper:More detailed results on self-recovery and temperatureEffects on error rate
Effects on threshold voltage distributionEffects of recovery cycle (P/E cycles withlong dwell time) on retention loss speed
19
Unified ModelSlide20
OutlineExecutive SummaryBackground on NAND Flash ReliabilityCharacterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips
URT: Unified Self-Recovery and Temperature ModelHeatWatch MechanismConclusion
20Slide21
00
01
Minimizing 3D NAND Errors
21
Read Ref. Voltage
Retention Errors
…
Probability
Threshold Voltage
Optimal Read Ref. Voltage
Optimal read reference voltage
minimizes 3D NAND errorsSlide22
Predicting the Mean Threshold Voltage22
Our URT Model:V =
V0 + ΔV
Initial Voltage Before Retention
(Program Variation)
Voltage Shift
Due to
Retention Loss
Mean Threshold VoltageSlide23
URT Model Overview23
V = V
0 + ΔV
PEC
T
p
1. Program Variation Component
t
r
T
r
t
r,eff
t
d
T
d
t
d,eff
PEC
Initial Voltage Before Retention
Voltage Shift Due to Retention Loss
3. Temperature
Scaling Component
2. Self-Recovery and Retention ComponentSlide24
1. Program Variation Component24
PEC
T
p
P/E Cycle
Program Temperature
Validation: R
2
= 91.7%
V
0
Initial Voltage
V
0Slide25
2. Self-Recovery and Retention Component25
t
r
T
d
Dwell Time
Δ
V
Retention Shift
Retention Time
PEC
P/E
Cycle
Validation: 3x more accurate
than state-of-the-art model
Δ
VSlide26
3. Temperature Scaling Component26
t
d
T
d
Dwell Temp.
Actual Dwell Time
t
d,eff
Effective Dwell Time
t
r
T
r
Storage Temp.
t
r,eff
Effective Retention Time
Actual Retention Time
Validation: Adjust an important parameter, E
a
, from 1.1 eV to 1.04 eV
Arrhenius Equation:Slide27
Initial Voltage Before Retention
Voltage Shift Due to Retention Loss
URT Model Summary
27
V =
V
0
+ ΔV
PEC
T
p
1. Program Variation Component
t
r
T
r
t
r,eff
t
d
T
d
t
d,eff
PEC
Validation:
Prediction Error Rate = 4.9%
3. Temperature
Scaling Component
2. Self-Recovery and Retention ComponentSlide28
OutlineExecutive SummaryBackground on NAND Flash ReliabilityCharacterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips
URT: Unified Self-Recovery and Temperature ModelHeatWatch MechanismConclusion
28Slide29
HeatWatch MechanismKey Idea
Predict change in threshold voltage distributionby using the URT modelAdapt read reference voltage to near-optimal
(Vopt)based on predicted change in voltage distribution
29Slide30
HeatWatch Mechanism Overview30
Tracking Components
SSD Temperature
Dwell Time
P/E Cycles &
Retention Time
Prediction Components
V
opt Prediction
Fine-Tuning URT Parameters
URTSlide31
Tracking SSD Temperature31
Tracking Components
SSD Temperature
Dwell Time
P/E Cycles &
Retention Time
Prediction Components
V
opt Prediction
Fine-Tuning URT Parameters
URT
Use existing sensors in the SSD
Precompute
temperature scaling factor
at
logarithmic time intervalsSlide32
Tracking Dwell Time32
Tracking Components
SSD Temperature
Dwell Time
P/E Cycles &
Retention Time
Prediction Components
V
opt Prediction
Fine-Tuning URT Parameters
URT
Only need to log the timestamps of
last 20 full drive writes
Self-recovery effect diminishes after 20 P/E cyclesSlide33
Tracking P/E Cycles and Retention Time33
Tracking Components
SSD Temperature
Dwell Time
P/E Cycles &
Retention Time
Prediction Components
V
opt Prediction
Fine-Tuning URT Parameters
URT
P/E cycle count
already recorded
by SSD
Log write timestamp
for each block
Retention time = read timestamp – write timestampSlide34
Predicting Optimal Read Reference Voltage
34
Tracking Components
SSD Temperature
Dwell Time
P/E Cycles &
Retention Time
Prediction Components
V
opt
Prediction
Fine-Tuning URT Parameters
URT
Calculate URT
using tracked information
Modeling error: 4.9%Slide35
Fine-Tuning URT Parameters Online
35
Tracking Components
SSD Temperature
Dwell Time
P/E Cycles &
Retention Time
Prediction Components
V
opt
Prediction
Fine-Tuning URT Parameters
URT
Accommodates
chip-to-chip variation
Uses
periodic samplingSlide36
HeatWatch Mechanism Summary36
Tracking Components
SSD Temperature
Dwell Time
P/E Cycles &
Retention Time
Prediction Components
V
opt Prediction
Fine-Tuning URT Parameters
URT
Storage Overhead: 0.16% of DRAM in 1TB SSD
Latency Overhead: < 1% of flash read latencySlide37
HeatWatch Evaluation Methodology28 real workload storage tracesMSR-Cambridge
We use real dwell time, retention time valuesobtained from tracesTemperature Model:Trigonometric function
+ Gaussian noiseRepresents periodic temperature variation in each dayIncludes small transient temperature variation
37Slide38
HeatWatch Greatly Improves Flash Lifetime
38
ECC limit
Error Rate
HeatWatch improves lifetime by
capturing the effect of
retention, wearout, self-recovery, temperature
Lifetime
(P/E Cycles)
Fixed V
ref
State-of-the-art
HeatWatch
Oracle
3.85x over Fixed V
ref
24% over
state-of-the-artSlide39
OutlineExecutive SummaryBackground on NAND Flash ReliabilityCharacterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips
URT: Unified Self-Recovery and Temperature ModelHeatWatch MechanismConclusion
39Slide40
Conclusion3D NAND flash memory susceptible to
retention errorsCharge leaks out of flash cellTwo unreported factors: self-recovery and
temperatureWe study self-recovery and temperature effectsWe develop a new technique to improve flash reliability
40
Experimental characterization
of
real 3D NAND chips
Unified Self-Recovery and Temperature (URT) Model
Predicts impact of retention loss, wearout, self-recovery, temperature on flash cell voltageLow prediction error rate: 4.9%
HeatWatch
Uses URT model to find optimal read voltages for 3D NAND flash
Improves flash lifetime by 3.85xSlide41
Improving 3D NAND Flash Memory Device Reliability by
Exploiting Self-Recovery and Temperature Awareness
Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu
HeatWatch
41Slide42
Backup Slides
42Slide43
SSD Architecture43
SSD
HOST
SSD Controller
NAND
NAND
DRAMSlide44
3D vs. 2D Flash Cell Design44
Substrate
S
D
Charge Trap
(
Insulator
)
Control
Gate
e
e
e
e
e
e
Gate Oxide
Tunnel Oxide
Floating-Gate Cell
Substrate
D
S
Control Gate
Floating Gate
(
Conductor
)
Gate Oxide
Tunnel Oxide
e
e
e
e
e
e
e
e
3D Charge-Trap Cell
Charges stored in insulator, t
hinner tunnel oxide
Faster data retentionSlide45
3D vs. 2D Retention Characteristics
Source: K. Mizoguchi, et al., “Data-Retention Characteristics Comparison of 2D and 3D TLC NAND Flash Memories,” IMW, 2017.
2D NAND very sensitive to wearout
3
D NAND uniformly affected by wearoutSlide46
LimitationsVendor-to-vendor variationSelf-recovery and temperature effect should be similar for3D charge trap NAND (Samsung, Hynix, Toshiba, Sandisk)
Chip-to-chip variationEach of our experiments takes several monthsExpect future large-scale study on 3D NAND errorsNot our limitation:Any process variation within a chipOur results include tens of randomly selected flash blocks~1 million cells
46Slide47
Generalizability of ResultsShould apply to other 3D NAND flash memory that uses charge trap cells (Samsung, Hynix, Toshiba, Sandisk)
47Slide48
Self-Recovery and Temperature in Planar NANDUDM [Mielke 2006]Only models retention shift, no initial voltage
Exponential P/E cycle effectActivation energy for planar NAND3 other works propose mechanism and speculate different lifetime improvements211x [Mohan+ HotStorage10]5.8x [Wu+ HotStorage11]2.8x [Lee+ FAST12]
48Slide49
Novelty vs. UDM3D charge trap cells are more resilient to P/E cycling than floating-gate cells in planar NANDDifferent activation energyProgram temperature effect not discussed in planar NAND
49Slide50
Ideal SSD TemperatureIt depends!High program temperature increases program variation (good)
High dwell temperature accelerates self-recovery (good)High retention temperature accelerates retention loss (bad)50Slide51
URT Fine TuningRandomly sample 10 wordlines in each chipLearn Vopt by sweeping V
refFit URT model with newly learned Vopt51Slide52
HeatWatch OverheadStorage Overhead:Tracking SSD Temperature26 logarithmic intervals
208 BProgram temperature, dwell time, program time per block1.5 MBDwell timeTimestamp for last 20 full drive writes85 BLatency Overhead:<1% of flash read latency (25 us)
52Slide53
HeatWatch: Tracking Components1. Tracking SSD temperatureUse existing sensors in the SSDPrecompute temperature scaling factorat
logarithmic time intervals53
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2
3
4
Area = Effective Ret. Time
Actual Retention Time
Temperature Effect
4n
8n
2n
nSlide54
HeatWatch: Tracking Components2. Tracking dwell timeOnly need to track write frequency for last 20 P/E cycles
54
Faster Retention Loss
Self-recovery effect plateaus after 20 P/E cyclesSlide55
URT vs. Conventional Model55
PEC
t
r
PEC
V = V
0
+
Δ
V
T
p
T
r
T
r,eff
t
d
T
d
T
d,eff
Conventional
URT
PEC
t
r
PEC
PEC
t
r
PEC
URT adds
self-recovery, temperature
to conventional modelSlide56
Threshold Voltage Distribution ShiftsShifts occur over time due to multiple factors (e.g., retention)Can cause distribution of one state to cross over theread reference voltage boundarySome cells get misreadIntroduces raw bit errors
56
Probability
Density
V
a
V
b
V
c
Threshold Voltage
(V
th
)
P1
0
1
P2
0
0
P3
1
0
ER
1
1
Raw bit errors
Shifted
OriginalSlide57
Per-Workload Flash Lifetime Improvements57Slide58
Dwell Time Impact on Error Rate After Retention58Slide59
Dwell Time Impact on Threshold Voltage Distributions59Slide60
Mean Distribution Voltage vs. Retentionfor Different Dwell Times
60Slide61
Impact of Dwell Time on Error Rate and Threshold Voltage Distribution Means61Slide62
Temperature Impact on Error Rate After Retention 62Slide63
Impact of Programming Temperature on Threshold Voltage Distributions63Slide64
Impact of Programming Temperature on Error Rate and Threshold Voltage Distribution Means64Slide65
SRRM Prediction Accuracy65Slide66
Change in Flash Lifetime Due toProgramming Temperature and Write Intensity66Slide67
Optimal Read Reference Voltage:Measured vs. Predicted by URT67Slide68
Inaccurate Read Reference VoltagesIncrease Error Rate68