/
Improving 3D NAND Flash Memory Device Reliability by Improving 3D NAND Flash Memory Device Reliability by

Improving 3D NAND Flash Memory Device Reliability by - PowerPoint Presentation

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
349 views
Uploaded On 2018-11-09

Improving 3D NAND Flash Memory Device Reliability by - PPT Presentation

Exploiting SelfRecovery and Temperature Awareness Yixin Luo Saugata Ghose Yu Cai Erich F Haratsch Onur Mutlu HeatWatch Storage Technology Drivers 2018 2 Store large amounts ID: 724849

retention temperature flash voltage temperature retention voltage flash time recovery nand urt dwell effect prediction ssd read program loss

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Improving 3D NAND Flash Memory Device Re..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Improving 3D NAND Flash Memory Device Reliability by

Exploiting Self-Recovery and Temperature Awareness

Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu

HeatWatchSlide2

Storage Technology Drivers - 20182

Store

large amounts

of data

reliably

for months to years

3D NAND Flash Memory

Stacked layersSlide3

Executive Summary3D NAND flash memory susceptible to

retention errorsCharge leaks out of flash cellTwo unreported factors: self-recovery and

temperatureWe study self-recovery and temperature effectsWe develop a new technique to improve flash reliability

3

Experimental characterization

of

real 3D NAND chips

Unified Self-Recovery and Temperature (URT) Model

Predicts impact of retention loss, wearout, self-recovery, temperature on flash cell voltageLow prediction error rate: 4.9%

HeatWatch

Uses URT model to find optimal read voltages for 3D NAND flash

Improves flash lifetime by 3.85xSlide4

OutlineExecutive SummaryBackground on NAND Flash ReliabilityCharacterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips

URT: Unified Self-Recovery and Temperature ModelHeatWatch MechanismConclusion

4Slide5

3D NAND Flash Memory

Flash Cell

Higher Voltage State

Lower Voltage State

Data Value =

0

Data Value =

1

Read Reference Voltage

Charge

=

Threshold Voltage

3D NAND Flash Memory Background

5Slide6

Flash Wearout

6

2. Program Variation(init. voltage difference b/w states)

Program/Erase (P/E)

Wearout

Insulator

1. Retention Loss

(voltage shift over time)

Wearout Introduces Errors

Wearout Effects:

Voltage

–Slide7

Improving Flash Lifetime7

Errors introduced by wearoutlimit flash lifetime(measured in P/E cycles)

Exploiting theSelf-Recovery Effect

Exploiting theTemperature Effect

Two Ways to Improve Flash LifetimeSlide8

Exploiting the Self-Recovery Effect8

P/E

P/E

P/E

P/E

P/E

Dwell Time: Idle Time Between P/E Cycles

Longer Dwell Time: More Self-Recovery

P/E

P/E

P/E

P/E

P/E

Reduces Retention Loss

Partially repairs damage due to wearoutSlide9

Exploiting the Temperature Effect

9

Accelerates Retention Loss

High Program Temperature

High Storage Temperature

Voltage

Increases Program Variation

–Slide10

Prior Studies of Self-Recovery/Temperature10

Self-Recovery Effect

Temperature Effect

Planar (2D) NAND

3D NAND

Mielke 2006

JEDEC 2010

(no characterization)x

xSlide11

OutlineExecutive SummaryBackground on NAND Flash ReliabilityCharacterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips

URT: Unified Self-Recovery and Temperature ModelHeatWatch MechanismConclusion

11Slide12

Characterization MethodologyModified firmware version in the flash controllerControl the read reference voltage of the flash chip

Bypass ECC to get raw NAND data (with raw bit errors)Control temperature with a heat chamber12

Heat Chamber

SSD

ServerSlide13

01

01

01

Characterized Devices

13

Real 30-39 Layer 3D MLC NAND Flash Chips

2-bit MLC

30- to 39-layer

01Slide14

MLC Threshold Voltage Distribution Background14

11

10

00

01

Probability

Threshold Voltage

Highest Voltage State

Lowest Voltage State

Read Reference Voltage

Read Reference Voltage

Read Reference Voltage

Threshold Voltage DistributionSlide15

Characterized Metrics

Program Variation

(initial voltage difference

between states)

Characterized Phenomena

Retention Loss Speed

(how fast voltage shifts

over time)

Self-Recovery Effect

Temperature Effect

Characterization Goal

15Slide16

Self-Recovery Effect Characterization Results16

Increasing dwell time from 1 minute to 2.3 hours slows down retention loss speed by 40%

2.3 hour

1 minute

Dwell time: Idle time between P/E cyclesSlide17

Program Temperature EffectCharacterization Results17

Increasing program temperature from 0°C to 70°C

improves program variation by 21%

70°C

0°CSlide18

Storage Temperature EffectCharacterization Results18

Lowering storage temperature from 70°C to 0°C

slows down retention loss speed by 58%

70°C

0°CSlide19

Characterization SummaryMajor Results:

Self-recovery affects retention loss speedProgram temperature affects program variationStorage

temperature affects retention loss speedOther Characterizations Methods in the Paper:More detailed results on self-recovery and temperatureEffects on error rate

Effects on threshold voltage distributionEffects of recovery cycle (P/E cycles withlong dwell time) on retention loss speed

19

Unified ModelSlide20

OutlineExecutive SummaryBackground on NAND Flash ReliabilityCharacterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips

URT: Unified Self-Recovery and Temperature ModelHeatWatch MechanismConclusion

20Slide21

00

01

Minimizing 3D NAND Errors

21

Read Ref. Voltage

Retention Errors

Probability

Threshold Voltage

Optimal Read Ref. Voltage

Optimal read reference voltage

minimizes 3D NAND errorsSlide22

Predicting the Mean Threshold Voltage22

Our URT Model:V =

V0 + ΔV

Initial Voltage Before Retention

(Program Variation)

Voltage Shift

Due to

Retention Loss

Mean Threshold VoltageSlide23

URT Model Overview23

V = V

0 + ΔV

PEC

T

p

1. Program Variation Component

t

r

T

r

t

r,eff

t

d

T

d

t

d,eff

PEC

Initial Voltage Before Retention

Voltage Shift Due to Retention Loss

3. Temperature

Scaling Component

2. Self-Recovery and Retention ComponentSlide24

1. Program Variation Component24

PEC

T

p

P/E Cycle

Program Temperature

Validation: R

2

= 91.7%

V

0

Initial Voltage

V

0Slide25

2. Self-Recovery and Retention Component25

t

r

T

d

Dwell Time

Δ

V

Retention Shift

Retention Time

PEC

P/E

Cycle

Validation: 3x more accurate

than state-of-the-art model

Δ

VSlide26

3. Temperature Scaling Component26

t

d

T

d

Dwell Temp.

Actual Dwell Time

t

d,eff

Effective Dwell Time

t

r

T

r

Storage Temp.

t

r,eff

Effective Retention Time

Actual Retention Time

Validation: Adjust an important parameter, E

a

, from 1.1 eV to 1.04 eV

Arrhenius Equation:Slide27

Initial Voltage Before Retention

Voltage Shift Due to Retention Loss

URT Model Summary

27

V =

V

0

+ ΔV

PEC

T

p

1. Program Variation Component

t

r

T

r

t

r,eff

t

d

T

d

t

d,eff

PEC

Validation:

Prediction Error Rate = 4.9%

3. Temperature

Scaling Component

2. Self-Recovery and Retention ComponentSlide28

OutlineExecutive SummaryBackground on NAND Flash ReliabilityCharacterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips

URT: Unified Self-Recovery and Temperature ModelHeatWatch MechanismConclusion

28Slide29

HeatWatch MechanismKey Idea

Predict change in threshold voltage distributionby using the URT modelAdapt read reference voltage to near-optimal

(Vopt)based on predicted change in voltage distribution

29Slide30

HeatWatch Mechanism Overview30

Tracking Components

SSD Temperature

Dwell Time

P/E Cycles &

Retention Time

Prediction Components

V

opt Prediction

Fine-Tuning URT Parameters

URTSlide31

Tracking SSD Temperature31

Tracking Components

SSD Temperature

Dwell Time

P/E Cycles &

Retention Time

Prediction Components

V

opt Prediction

Fine-Tuning URT Parameters

URT

Use existing sensors in the SSD

Precompute

temperature scaling factor

at

logarithmic time intervalsSlide32

Tracking Dwell Time32

Tracking Components

SSD Temperature

Dwell Time

P/E Cycles &

Retention Time

Prediction Components

V

opt Prediction

Fine-Tuning URT Parameters

URT

Only need to log the timestamps of

last 20 full drive writes

Self-recovery effect diminishes after 20 P/E cyclesSlide33

Tracking P/E Cycles and Retention Time33

Tracking Components

SSD Temperature

Dwell Time

P/E Cycles &

Retention Time

Prediction Components

V

opt Prediction

Fine-Tuning URT Parameters

URT

P/E cycle count

already recorded

by SSD

Log write timestamp

for each block

Retention time = read timestamp – write timestampSlide34

Predicting Optimal Read Reference Voltage

34

Tracking Components

SSD Temperature

Dwell Time

P/E Cycles &

Retention Time

Prediction Components

V

opt

Prediction

Fine-Tuning URT Parameters

URT

Calculate URT

using tracked information

Modeling error: 4.9%Slide35

Fine-Tuning URT Parameters Online

35

Tracking Components

SSD Temperature

Dwell Time

P/E Cycles &

Retention Time

Prediction Components

V

opt

Prediction

Fine-Tuning URT Parameters

URT

Accommodates

chip-to-chip variation

Uses

periodic samplingSlide36

HeatWatch Mechanism Summary36

Tracking Components

SSD Temperature

Dwell Time

P/E Cycles &

Retention Time

Prediction Components

V

opt Prediction

Fine-Tuning URT Parameters

URT

Storage Overhead: 0.16% of DRAM in 1TB SSD

Latency Overhead: < 1% of flash read latencySlide37

HeatWatch Evaluation Methodology28 real workload storage tracesMSR-Cambridge

We use real dwell time, retention time valuesobtained from tracesTemperature Model:Trigonometric function

+ Gaussian noiseRepresents periodic temperature variation in each dayIncludes small transient temperature variation

37Slide38

HeatWatch Greatly Improves Flash Lifetime

38

ECC limit

Error Rate

HeatWatch improves lifetime by

capturing the effect of

retention, wearout, self-recovery, temperature

Lifetime

(P/E Cycles)

Fixed V

ref

State-of-the-art

HeatWatch

Oracle

3.85x over Fixed V

ref

24% over

state-of-the-artSlide39

OutlineExecutive SummaryBackground on NAND Flash ReliabilityCharacterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips

URT: Unified Self-Recovery and Temperature ModelHeatWatch MechanismConclusion

39Slide40

Conclusion3D NAND flash memory susceptible to

retention errorsCharge leaks out of flash cellTwo unreported factors: self-recovery and

temperatureWe study self-recovery and temperature effectsWe develop a new technique to improve flash reliability

40

Experimental characterization

of

real 3D NAND chips

Unified Self-Recovery and Temperature (URT) Model

Predicts impact of retention loss, wearout, self-recovery, temperature on flash cell voltageLow prediction error rate: 4.9%

HeatWatch

Uses URT model to find optimal read voltages for 3D NAND flash

Improves flash lifetime by 3.85xSlide41

Improving 3D NAND Flash Memory Device Reliability by

Exploiting Self-Recovery and Temperature Awareness

Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu

HeatWatch

41Slide42

Backup Slides

42Slide43

SSD Architecture43

SSD

HOST

SSD Controller

NAND

NAND

DRAMSlide44

3D vs. 2D Flash Cell Design44

Substrate

S

D

Charge Trap

(

Insulator

)

Control

Gate

e

e

e

e

e

e

Gate Oxide

Tunnel Oxide

Floating-Gate Cell

Substrate

D

S

Control Gate

Floating Gate

(

Conductor

)

Gate Oxide

Tunnel Oxide

e

e

e

e

e

e

e

e

3D Charge-Trap Cell

Charges stored in insulator, t

hinner tunnel oxide

 Faster data retentionSlide45

3D vs. 2D Retention Characteristics

Source: K. Mizoguchi, et al., “Data-Retention Characteristics Comparison of 2D and 3D TLC NAND Flash Memories,” IMW, 2017.

2D NAND very sensitive to wearout

3

D NAND uniformly affected by wearoutSlide46

LimitationsVendor-to-vendor variationSelf-recovery and temperature effect should be similar for3D charge trap NAND (Samsung, Hynix, Toshiba, Sandisk)

Chip-to-chip variationEach of our experiments takes several monthsExpect future large-scale study on 3D NAND errorsNot our limitation:Any process variation within a chipOur results include tens of randomly selected flash blocks~1 million cells

46Slide47

Generalizability of ResultsShould apply to other 3D NAND flash memory that uses charge trap cells (Samsung, Hynix, Toshiba, Sandisk)

47Slide48

Self-Recovery and Temperature in Planar NANDUDM [Mielke 2006]Only models retention shift, no initial voltage

Exponential P/E cycle effectActivation energy for planar NAND3 other works propose mechanism and speculate different lifetime improvements211x [Mohan+ HotStorage10]5.8x [Wu+ HotStorage11]2.8x [Lee+ FAST12]

48Slide49

Novelty vs. UDM3D charge trap cells are more resilient to P/E cycling than floating-gate cells in planar NANDDifferent activation energyProgram temperature effect not discussed in planar NAND

49Slide50

Ideal SSD TemperatureIt depends!High program temperature increases program variation (good)

High dwell temperature accelerates self-recovery (good)High retention temperature accelerates retention loss (bad)50Slide51

URT Fine TuningRandomly sample 10 wordlines in each chipLearn Vopt by sweeping V

refFit URT model with newly learned Vopt51Slide52

HeatWatch OverheadStorage Overhead:Tracking SSD Temperature26 logarithmic intervals

208 BProgram temperature, dwell time, program time per block1.5 MBDwell timeTimestamp for last 20 full drive writes85 BLatency Overhead:<1% of flash read latency (25 us)

52Slide53

HeatWatch: Tracking Components1. Tracking SSD temperatureUse existing sensors in the SSDPrecompute temperature scaling factorat

logarithmic time intervals53

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

2

3

4

Area = Effective Ret. Time

Actual Retention Time

Temperature Effect

4n

8n

2n

nSlide54

HeatWatch: Tracking Components2. Tracking dwell timeOnly need to track write frequency for last 20 P/E cycles

54

Faster Retention Loss

Self-recovery effect plateaus after 20 P/E cyclesSlide55

URT vs. Conventional Model55

PEC

t

r

PEC

V = V

0

+

Δ

V

T

p

T

r

T

r,eff

t

d

T

d

T

d,eff

Conventional

URT

PEC

t

r

PEC

PEC

t

r

PEC

URT adds

self-recovery, temperature

to conventional modelSlide56

Threshold Voltage Distribution ShiftsShifts occur over time due to multiple factors (e.g., retention)Can cause distribution of one state to cross over theread reference voltage boundarySome cells get misreadIntroduces raw bit errors

56

Probability

Density

V

a

V

b

V

c

Threshold Voltage

(V

th

)

P1

0

1

P2

0

0

P3

1

0

ER

1

1

Raw bit errors

Shifted

OriginalSlide57

Per-Workload Flash Lifetime Improvements57Slide58

Dwell Time Impact on Error Rate After Retention58Slide59

Dwell Time Impact on Threshold Voltage Distributions59Slide60

Mean Distribution Voltage vs. Retentionfor Different Dwell Times

60Slide61

Impact of Dwell Time on Error Rate and Threshold Voltage Distribution Means61Slide62

Temperature Impact on Error Rate After Retention 62Slide63

Impact of Programming Temperature on Threshold Voltage Distributions63Slide64

Impact of Programming Temperature on Error Rate and Threshold Voltage Distribution Means64Slide65

SRRM Prediction Accuracy65Slide66

Change in Flash Lifetime Due toProgramming Temperature and Write Intensity66Slide67

Optimal Read Reference Voltage:Measured vs. Predicted by URT67Slide68

Inaccurate Read Reference VoltagesIncrease Error Rate68