for HighPerformance Systems Rajeev Balasubramonian School of Computing University of Utah Sep 25 th 2013 2 Micron Road Trip MICRON BOISE SALT LAKE CITY 3 DRAM Chip Innovations 4 Feedback I ID: 816077
Download The PPT/PDF document "1 Memory Controller Innovations" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1
Memory Controller Innovationsfor High-Performance Systems
Rajeev
BalasubramonianSchool of ComputingUniversity of UtahSep 25th 2013
Slide22
Micron Road Trip
MICRON
BOISE
SALT LAKE CITY
Slide33
DRAM Chip Innovations
Slide44
Feedback - I
Don’t bother modifying the DRAM chip.
Slide55
Feedback - II
We love what you’re doing with the
memory controller and OS.
Slide66
Academic Research Agendas
Not giving up on memory device innovations
Several examples of academic papers resonating commercial innovations Greater focus on memory controller improvements
Slide77
This Talk’s Focus: The Memory Controller
More relevant to Intel, Micron
Cores are being commoditized, but memory controller features are still evolving – new devices (buffer chips, HMC), chipkill
,
compression
Lots of room for improvement – MCs haven’t seen the same
innovation frenzy as the cores
Slide88
Example IBM Server
Source: P. Bose, WETI Workshop, 2012
Slide99
Power Contributions
PERCENTAGE
OF TOTAL
SERVER
POWER
PROCESSOR
MEMORY
Slide1010
Power Contributions
PERCENTAGE
OF TOTAL
SERVER
POWER
PROCESSOR
MEMORY
Slide1111
Memory Basics
HOST
MULTI-COREPROCESSOR
MC
MC
MC
MC
…
x8
Slide1212
Outline
Background Focusing on the memory controller Memory basics Implementing memory
compression (
MemZip
)
Implementing
chipkill
(LESS-ECC)
Voltage and current aware
scheduling (MICRO 2013)
Slide1313
Making a Case for Compression
Prior work:
IBM MXT, Ekman and Stenstrom, LCP, Alameldeen and Wood, etc.
Can
improve several
metrics:
primarily memory
capacity
secondary benefit in apps with locality: bandwidth, energy
Typically worsens access complexity and introduces
data copies
The
MemZip approach: focus on other metrics
no change in memory capacity
improvements in energy, bandwidth, reliability, complexity
Slide1414
MemZip
HOST
MULTI-COREPROCESSOR
MC
MC
MC
MC
…
x8
Rank
subsetting
Data fetch in
8-byte increments
Need metadata
Modified data layout
MDC
Slide1515
Cache Line Format
BASE-DELTA-IMMEDIATE
FREQUENT PATTERN COMPRESSION
Slide1616
Using Spare Space for Energy and Reliability
COMPRESSED CACHE LINE
26 BYTES
8 B
16 B
24 B
32 B
ROOM FOR ECC AND
DBI CODES
Slide1717
Making the ECC Access More Efficient
Baseline ECC: ECC code is fetched in parallel from 9th chip Subranking with embedded-ECC:
no extra chip; ECC is located
in the same row as data; need extra COL-RDs to
fetch ECC codes
MemZip
with embedded-ECC: in many cases, the ECC is fetched
with no additional COL-RD
Slide1818
DBI for Energy Efficiency
Data Bus Inversion: to save energy, either send data or the inverse of data Break the cache line into small words; each word needs an inversion bit; the inversion bits make up the DBI code
We use either 0, 1, 2, or 3 bytes of DBI codes
ORIGINAL DATA
Transfer 1: 11111111
Transfer 2: 00000000
WITH DBI ENCODING
Transfer 1: 11111111 0
Transfer 2: 11111111 1
2 bits for DBI code size
DBI code
Slide1919
Methodology
Simics (8 out-of-order cores) and USIMM memory system timing Micron power calculator for DRAM power estimates Collection of
workloads from SPEC2k6, NASPB, Parsec,
CloudSuite
;
multi-programmed and multi-threaded
Slide2020
Effect of Sub-Ranking
2-way sub-ranking has best performance 8-way sub-ranking is worse than baseline
Slide2121
Effect of Compression on Performance
20% performance improvement
With compression, 4-way is the best, but only slightly better than 8x2-way
Slide2222
Effect on Memory Energy
8x2-way has lowest traffic and energy Additional 17% reduction in activity with DBI
Slide2323
Outline
Background Focusing on the memory controller Memory basics
Implementing memory
compression (
MemZip
)
Implementing
chipkill
(LESS-ECC)
Voltage and current aware
scheduling (MICRO 2013)
Slide2424
Chipkill Overview
Chipkill: the ability to recover from an entire memory chip failure Commercial symbol-based chipkill: 4 check symbols are required
to recover from 1 data symbol corruption; hence needs 32+4
x4 DRAM chips per access (two channels)
Slide2525
LOT-ECC
1st level: checksums for error detection and location
2
nd
and 3
rd
levels: parity for error recovery
A0
P
A
A1
A7
A8
…
CA0
PP
A
CA1
CA7
CA8
P
A
PP
A
P
A
PP
A
P
A
PP
A
Slide2626
LESS-ECC
A0
CA0
A1
CA1
A7
CA7
X
A
CXA
…
1
st
level: parity for error detection and recovery
2
nd
level
: checksums for error
location detection
Slide2727
Reducing Storage
Checksums can be made large/small Small checksums can also be effectively cached on chip In LESS-ECC, the checksum can be designed so that basic: 8-bit checksum for 64 bits of data
ES1: 8-bit checksum for 512 bits of data
ES2: 64-bit checksum for 8Kb of data
ES3: 64-bit checksum for 4Gb of data
LOT-ECC
LESS-ECC
X8
26%
13%
X16
52%
26%
Storage Overhead
Slide2828
Error Rates
Checksums have a small probability of failing to detect an error LOT-ECC uses checksums in the 1st level and hence causes SDC (silent data corruption)
LESS-ECC uses checksums in the 2
nd
level and hence causes
DUE (detected but unrecoverable error); DUEs are more
favorable
Slide2929
LESS-ECC Summary
Benefits: energy, parallelism, storage, SDC
Disadvantage: checksum cache and more logic at the memory controller
Slide3030
LESS-ECC Performance
Slide3131
LESS-ECC Memory Energy
Slide3232
LESS-ECC Energy Efficiency
LESS-ECC-x8 has 0.5% lower energy than LOT-ECC-x8 but 15% less energy per usable byte LESS-ECC-x16 has 26% lower energy than LOT-ECC-x8 (both have similar storage overhead of 26%)
Slide3333
Outline
Background Focusing on the memory controller Memory basics
Implementing memory
compression (
MemZip
)
Implementing
chipkill
(LESS-ECC)
Voltage and current aware
scheduling (MICRO 2013)
Slide3434
Current Constraints and IR-Drop
MC ensures that requests are scheduled appropriately; many timing constraints, such as tFAW A new constraint emerges in future 3D-stacked devices: IR-drop
Slide3535
Many Possible Solutions
Note that charge pumps and
decaps scale with dies, but IR-drop gets worse Provide higher voltage
(
power increase!)
Provide more pins and TSVs
(
cost increase!)
Alternative:
U
se
an architectural solution; the MC schedules
requests
in a manner that does not violate IR-drop limits
Similar to the power tokens used in some PCM papers, but we
have to be aware of where activity is happening
Place data such that IR-drop-prone regions are avoided
Slide3636
Example Voltage Map
Y Coordinate
Slide3737
IR-Drop Regions
Slide3838
Scheduling Constraints
For a given part of the device, identify the worst-case set of requests that will cause an IR-drop violation For example, for region A-Top, if you issue one COL-RD to the furthest bank, that region cannot handle any other request
Continue to widen the list of constraints by considering larger
regions on the device
Slide3939
Scheduling Constraints
1 COL-RD = 1 COL-WR = 2 ACTs = 6 PREs
Slide4040
Overcoming Scheduling Limitations
Starvation: If B-Top always has 2 accesses, A-Top requests will be starved; prioritize requests that have much longer than average wait times Page
placement: dynamically identify frequently accessed
pages and move them to favorable regions
Slide4141
Performance Impact
With All Constraints, (Real PDN) performance falls by 4.6X
With Starvation management, gap is reduced to 1.47X
Profiled Page Placement with Starvation Control is within
15%
of unrealistic Ideal PDN
Slide4242
Summary
Many features expected of future memory controllers: handling compression, errors, new devices Lots of low-hanging fruit Significant energy/performance benefits from compression
Energy-efficient and storage-efficient
chipkill
possible,
but requires some effort in the MC
More scheduling constraints being imposed as technology
evolves;
we show in an IR-drop case study
for 3D-stacked
devices
that the performance impacts can be large
Slide4343
Acks
Students in the Utah Arch Lab (Amirali Boroumand, Nil
Chatterjee
,
Seth
Pugsley
, Ali
Shafiee
,
Manju
Shevgoor,
Meysam Taassori)
Other
collaborators from Samsung (Jung-Sik Kim), HP Labs
(Naveen Muralimanohar
)
, ARM (
Ani
Udipi
), U.
Nebrija
(Pedro
Reviriego
), Utah (Al Davis)
Funding sources: NSF, Samsung, HP, IBM