/
1 Memory Controller  Innovations 1 Memory Controller  Innovations

1 Memory Controller Innovations - PowerPoint Presentation

eatfuzzy
eatfuzzy . @eatfuzzy
Follow
342 views
Uploaded On 2020-11-06

1 Memory Controller Innovations - PPT Presentation

for HighPerformance Systems Rajeev Balasubramonian School of Computing University of Utah Sep 25 th 2013 2 Micron Road Trip MICRON BOISE SALT LAKE CITY 3 DRAM Chip Innovations 4 Feedback I ID: 816077

memory ecc data energy ecc memory energy data compression performance controller dbi chipkill power drop error checksums scheduling constraints

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "1 Memory Controller Innovations" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

1

Memory Controller Innovationsfor High-Performance Systems

Rajeev

BalasubramonianSchool of ComputingUniversity of UtahSep 25th 2013

Slide2

2

Micron Road Trip

MICRON

BOISE

SALT LAKE CITY

Slide3

3

DRAM Chip Innovations

Slide4

4

Feedback - I

Don’t bother modifying the DRAM chip.

Slide5

5

Feedback - II

We love what you’re doing with the

memory controller and OS.

Slide6

6

Academic Research Agendas

Not giving up on memory device innovations

Several examples of academic papers resonating commercial innovations Greater focus on memory controller improvements

Slide7

7

This Talk’s Focus: The Memory Controller

More relevant to Intel, Micron

Cores are being commoditized, but memory controller features are still evolving – new devices (buffer chips, HMC), chipkill

,

compression

Lots of room for improvement – MCs haven’t seen the same

innovation frenzy as the cores

Slide8

8

Example IBM Server

Source: P. Bose, WETI Workshop, 2012

Slide9

9

Power Contributions

PERCENTAGE

OF TOTAL

SERVER

POWER

PROCESSOR

MEMORY

Slide10

10

Power Contributions

PERCENTAGE

OF TOTAL

SERVER

POWER

PROCESSOR

MEMORY

Slide11

11

Memory Basics

HOST

MULTI-COREPROCESSOR

MC

MC

MC

MC

x8

Slide12

12

Outline

Background Focusing on the memory controller Memory basics Implementing memory

compression (

MemZip

)

Implementing

chipkill

(LESS-ECC)

Voltage and current aware

scheduling (MICRO 2013)

Slide13

13

Making a Case for Compression

Prior work:

IBM MXT, Ekman and Stenstrom, LCP, Alameldeen and Wood, etc.

Can

improve several

metrics:

primarily memory

capacity

secondary benefit in apps with locality: bandwidth, energy

Typically worsens access complexity and introduces

data copies

The

MemZip approach: focus on other metrics

no change in memory capacity

improvements in energy, bandwidth, reliability, complexity

Slide14

14

MemZip

HOST

MULTI-COREPROCESSOR

MC

MC

MC

MC

x8

Rank

subsetting

Data fetch in

8-byte increments

Need metadata

Modified data layout

MDC

Slide15

15

Cache Line Format

BASE-DELTA-IMMEDIATE

FREQUENT PATTERN COMPRESSION

Slide16

16

Using Spare Space for Energy and Reliability

COMPRESSED CACHE LINE

26 BYTES

8 B

16 B

24 B

32 B

ROOM FOR ECC AND

DBI CODES

Slide17

17

Making the ECC Access More Efficient

Baseline ECC: ECC code is fetched in parallel from 9th chip Subranking with embedded-ECC:

no extra chip; ECC is located

in the same row as data; need extra COL-RDs to

fetch ECC codes

MemZip

with embedded-ECC: in many cases, the ECC is fetched

with no additional COL-RD

Slide18

18

DBI for Energy Efficiency

Data Bus Inversion: to save energy, either send data or the inverse of data Break the cache line into small words; each word needs an inversion bit; the inversion bits make up the DBI code

We use either 0, 1, 2, or 3 bytes of DBI codes

ORIGINAL DATA

Transfer 1: 11111111

Transfer 2: 00000000

WITH DBI ENCODING

Transfer 1: 11111111 0

Transfer 2: 11111111 1

2 bits for DBI code size

DBI code

Slide19

19

Methodology

Simics (8 out-of-order cores) and USIMM memory system timing Micron power calculator for DRAM power estimates Collection of

workloads from SPEC2k6, NASPB, Parsec,

CloudSuite

;

multi-programmed and multi-threaded

Slide20

20

Effect of Sub-Ranking

2-way sub-ranking has best performance 8-way sub-ranking is worse than baseline

Slide21

21

Effect of Compression on Performance

20% performance improvement

With compression, 4-way is the best, but only slightly better than 8x2-way

Slide22

22

Effect on Memory Energy

8x2-way has lowest traffic and energy Additional 17% reduction in activity with DBI

Slide23

23

Outline

Background Focusing on the memory controller Memory basics

Implementing memory

compression (

MemZip

)

Implementing

chipkill

(LESS-ECC)

Voltage and current aware

scheduling (MICRO 2013)

Slide24

24

Chipkill Overview

Chipkill: the ability to recover from an entire memory chip failure Commercial symbol-based chipkill: 4 check symbols are required

to recover from 1 data symbol corruption; hence needs 32+4

x4 DRAM chips per access (two channels)

Slide25

25

LOT-ECC

1st level: checksums for error detection and location

2

nd

and 3

rd

levels: parity for error recovery

A0

P

A

A1

A7

A8

CA0

PP

A

CA1

CA7

CA8

P

A

PP

A

P

A

PP

A

P

A

PP

A

Slide26

26

LESS-ECC

A0

CA0

A1

CA1

A7

CA7

X

A

CXA

1

st

level: parity for error detection and recovery

2

nd

level

: checksums for error

location detection

Slide27

27

Reducing Storage

Checksums can be made large/small Small checksums can also be effectively cached on chip In LESS-ECC, the checksum can be designed so that basic: 8-bit checksum for 64 bits of data

ES1: 8-bit checksum for 512 bits of data

ES2: 64-bit checksum for 8Kb of data

ES3: 64-bit checksum for 4Gb of data

LOT-ECC

LESS-ECC

X8

26%

13%

X16

52%

26%

Storage Overhead

Slide28

28

Error Rates

Checksums have a small probability of failing to detect an error LOT-ECC uses checksums in the 1st level and hence causes SDC (silent data corruption)

LESS-ECC uses checksums in the 2

nd

level and hence causes

DUE (detected but unrecoverable error); DUEs are more

favorable

Slide29

29

LESS-ECC Summary

Benefits: energy, parallelism, storage, SDC

Disadvantage: checksum cache and more logic at the memory controller

Slide30

30

LESS-ECC Performance

Slide31

31

LESS-ECC Memory Energy

Slide32

32

LESS-ECC Energy Efficiency

LESS-ECC-x8 has 0.5% lower energy than LOT-ECC-x8 but 15% less energy per usable byte LESS-ECC-x16 has 26% lower energy than LOT-ECC-x8 (both have similar storage overhead of 26%)

Slide33

33

Outline

Background Focusing on the memory controller Memory basics

Implementing memory

compression (

MemZip

)

Implementing

chipkill

(LESS-ECC)

Voltage and current aware

scheduling (MICRO 2013)

Slide34

34

Current Constraints and IR-Drop

MC ensures that requests are scheduled appropriately; many timing constraints, such as tFAW A new constraint emerges in future 3D-stacked devices: IR-drop

Slide35

35

Many Possible Solutions

Note that charge pumps and

decaps scale with dies, but IR-drop gets worse Provide higher voltage

(

power increase!)

Provide more pins and TSVs

(

cost increase!)

Alternative:

U

se

an architectural solution; the MC schedules

requests

in a manner that does not violate IR-drop limits

Similar to the power tokens used in some PCM papers, but we

have to be aware of where activity is happening

Place data such that IR-drop-prone regions are avoided

Slide36

36

Example Voltage Map

Y Coordinate

Slide37

37

IR-Drop Regions

Slide38

38

Scheduling Constraints

For a given part of the device, identify the worst-case set of requests that will cause an IR-drop violation For example, for region A-Top, if you issue one COL-RD to the furthest bank, that region cannot handle any other request

Continue to widen the list of constraints by considering larger

regions on the device

Slide39

39

Scheduling Constraints

1 COL-RD = 1 COL-WR = 2 ACTs = 6 PREs

Slide40

40

Overcoming Scheduling Limitations

Starvation: If B-Top always has 2 accesses, A-Top requests will be starved; prioritize requests that have much longer than average wait times Page

placement: dynamically identify frequently accessed

pages and move them to favorable regions

Slide41

41

Performance Impact

With All Constraints, (Real PDN) performance falls by 4.6X

With Starvation management, gap is reduced to 1.47X

Profiled Page Placement with Starvation Control is within

15%

of unrealistic Ideal PDN

Slide42

42

Summary

Many features expected of future memory controllers: handling compression, errors, new devices Lots of low-hanging fruit Significant energy/performance benefits from compression

Energy-efficient and storage-efficient

chipkill

possible,

but requires some effort in the MC

More scheduling constraints being imposed as technology

evolves;

we show in an IR-drop case study

for 3D-stacked

devices

that the performance impacts can be large

Slide43

43

Acks

Students in the Utah Arch Lab (Amirali Boroumand, Nil

Chatterjee

,

Seth

Pugsley

, Ali

Shafiee

,

Manju

Shevgoor,

Meysam Taassori)

Other

collaborators from Samsung (Jung-Sik Kim), HP Labs

(Naveen Muralimanohar

)

, ARM (

Ani

Udipi

), U.

Nebrija

(Pedro

Reviriego

), Utah (Al Davis)

Funding sources: NSF, Samsung, HP, IBM