Vinson Young Prashant Nair Moinuddin Qureshi 1 MOOREs LAW HITS BANDWIDTH WALL 2 Moores scaling encounters Bandwidth Wall 3DDRAM MITIGATES BANDWIDTH WALL 3 3DDRAM Hybrid Memory Cube HMC from Micron ID: 776207
Download Presentation The PPT/PDF document " DICE: Compressing DRAM Caches for Bandw..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
DICE: Compressing DRAM Caches for Bandwidth and Capacity
Vinson YoungPrashant NairMoinuddin Qureshi
1
Slide2MOORE’s LAW HITS BANDWIDTH WALL
2
Moore’s scaling encounters Bandwidth Wall
Slide33D-DRAM MITIGATES BANDWIDTH WALL
3
3D-DRAM
Hybrid Memory Cube (HMC) from Micron,
High Bandwidth Memory (HBM) from Samsung
3D-DRAM improves bandwidth, but does not have capacity to replace conventional DIMM memory
Slide4OS-visible Space
System Memory
3D-DRAM as a CACHE (3D-DRAM CACHE)
4
3D-DRAM Cache
Memory Hierarchy
fast
slow
CPU
L1$
L2$
L3$
CPU
L2$
L1$
Architecting 3D-DRAM as a cache can improve memory bandwidth (and avoid OS/software change)
MCDRAM from Intel,
HBC from AMD
Slide5Practical DRAM cache: low latency and bandwidth-efficient
Tags “part-of-line”
Alloy
Tag+Data Avoid Tag Serialization
One “Tag+Data”
Similar to DRAM Cache in KNL: Direct-mapped, Tags in ECC
PRACTICAL 3D-DRAM CACHE: ALLOY CACHE
Slide63D-DRAM Cache Bandwidth is important
6
2x-capacity cache improves performance by 10%.
And, additional 2x bandwidth increases speedup to 22%.Improving both bandwidth and capacity is valuable.
On 8-CPU, 1GB DRAM Cache configuration
10%
22%
Slide77
Baseline: Direct-Mapped, One Data Block in an access
A
D
Baseline
C
B
A
B
C
D
INTRODUCTION: DRAM CACHE
7
A
Spatial
Indexing
(
Incompressible
)
C
Y
W
A
D
Traditional
Compression
(
Incompressible
)
C
B
A
D
Spatial Indexing
(
Compressible
)
C
B
W
X
Y
Z
Slide88
Compression: Adds capacity, improve bandwidth?
A
B
C
D
A
D
Traditional
Compression
(
Compressible
)
C
B
W
X
Y
Z
INTRODUCTION: COMPRESSED DRAM CACHE
A
Spatial
Indexing
(
Incompressible
)
C
Y
W
A
D
Traditional
Compression
(
Incompressible
)
C
B
A
D
Spatial Indexing
(
Compressible
)
C
B
W
X
Y
Z
1x
Bandwidth
Slide99
Compression: Adds capacity, improve bandwidth?
A
B
C
D
A
D
Traditional
Compression
(
Incompressible
)
C
B
A
D
Traditional
Compression
(
Compressible
)
C
B
W
X
Y
Z
INTRODUCTION: COMPRESSED DRAM CACHE
A
Spatial
Indexing
(
Incompressible
)
C
Y
W
A
D
Spatial Indexing
(
Compressible
)
C
B
W
X
Y
Z
1x
Bandwidth
Slide10A
D
Traditional
Compression
(
Compressible
)
C
B
W
X
Y
Z
A
D
Traditional
Compression
(
Incompressible
)
C
B
10
Compression: Adds capacity, improve bandwidth?
A
B
C
D
4 accesses
@
1x-2x Capacity
INTRODUCTION: COMPRESSED DRAM CACHE
A
Spatial
Indexing
(
Incompressible
)
C
Y
W
A
D
Spatial Indexing
(
Compressible
)
C
B
W
X
Y
Z
Slide1111
Compression: Adds capacity, improve bandwidth?
A
B
C
D
A
D
Spatial Indexing
(
Compressible
)
C
B
W
X
Y
Z
2x
Bandwidth
INTRODUCTION: COMPRESSED DRAM CACHE
A
D
Traditional
Compression
(
Compressible
)
C
B
W
X
Y
Z
A
D
Traditional
Compression
(
Incompressible
)
C
B
A
Spatial
Indexing
(
Incompressible
)
C
Y
W
12
Compression: Adds capacity, improve bandwidth?
A
B
C
D
A
Spatial
Indexing
(
Incompressible
)
C
Y
W
B,D?
<
1x
Bandwidth
INTRODUCTION: COMPRESSED DRAM CACHE
A
D
Traditional
Compression
(
Compressible
)
C
B
W
X
Y
Z
A
D
Traditional
Compression
(
Incompressible
)
C
B
A
D
Spatial Indexing
(
Compressible
)
C
B
W
X
Y
Z
Slide1313
Compression: Adds capacity, improve bandwidth?
Compressible
Incompressible
2x
Bandwidth
<
1x
Bandwidth
Incompressible
Compressible
Traditional Compression
Spatial Indexing
INTRODUCTION: COMPRESSED DRAM CACHE
1x
Bandwidth
1x
Bandwidth
Slide14INTRODUCTION:
Traditional Compression
14
Compression for capacity (TSI) sees little speedup (7%) due to diminishing returns on giga-scale caches
Improves Capacity
No degradation
Little speedup (7%)
Slide15INTRODUCTION:
SPATIAL Indexing
15
Spatial Indexing compression gets both benefits of bandwidth and capacity when lines are compressible. But, it hurts performance when lines are incompressible
Improves Bandwidth
Can degrade
No speedup
Slide16Compressible
Incompressible
<
1x
Bandwidth
1x
Bandwidth
16
Goal: Compression for Capacity
AND
Bandwidth
Compressible
2x
Bandwidth
Incompressible
DICE (Dynamic Index)
19% Speedup + 36% EDP
Spatial Indexing
INTRODUCTION: COMPRESSED DRAM CACHE
Traditional Compression
1x
Bandwidth
Slide17DICE OVERVIEW
Compressed DRAM Cache OrganizationFlexible Mapping for Quick SwitchingDynamic Indexing ComprEssion (DICE)Insertion PolicyIndex Prediction
17
Slide1818
L3 Cache
L4 Cache Controller
Memory
DRAM Cache
Writeback
Install
Read
Write
Compression Logic
Decompression Logic
DRAM Cache (compressed)
On-chip
Off-chip
Compression: Simple changes within the controller
PRACTICAL DRAM CACHE COMPRESSION
Slide1919
Tag A
Data A
Tag Boundary
Data
Cache controller receives 72B of
tag+data. It can flexibly interpret bits as tag bits or data bits.
DRAM CACHE TAG format
8 Bytes
64 Bytes
Slide20PROPOSED FLEXIBLE TAG FORMAT
20
A
B
I
X
A
B
X
Is Tag?
Tag Boundary
Data
Not Tag
Is Tag?
Is Tag?
We create Tag space as needed, for up to 28 lines. Achieves 1.6x effective capacity.
Slide21DICE OVERVIEW
Compressed DRAM Cache OrganizationFlexible Mapping for Quick SwitchingDynamic Indexing ComprEssion (DICE)Insertion PolicyIndex Prediction
21
Slide225
7
Flexible Mapping (TSI or BAI)
22
0
1
2
3
4
5
6
7
0
4
2
6
1
3
Traditional Set Indexing (TSI)
Bandwidth-Aware Indexing (BAI)
4
6
5
7
0
2
1
3
Naïve Spatial Indexing
Bandwidth-Aware Indexing (BAI)
facilitates quick switching between two indices
TSI
and
BAI
.
Slide231
2
3
4
5
6
5
7
Flexible Mapping (TSI or BAI)
23
0
1
2
3
4
5
6
7
0
4
2
6
1
3
Traditional Set Indexing (TSI)
Bandwidth-Aware Indexing (BAI)
7
0
Naïve Spatial Indexing
Bandwidth-Aware Indexing (BAI)
facilitates quick switching between two indices
TSI
and
BAI
.
Slide245
7
Flexible Mapping (TSI or BAI)
24
0
1
2
3
4
5
6
7
0
4
2
6
1
3
Traditional Set Indexing (TSI)
Bandwidth-Aware Indexing (BAI)
4
6
5
7
0
2
1
3
Naïve Spatial Indexing
Bandwidth-Aware Indexing (BAI)
facilitates quick switching between two indices
TSI
and
BAI
.
4
6
1
3
Slide25DICE OVERVIEW
Compressed DRAM Cache OrganizationFlexible Mapping for Quick SwitchingDynamic Indexing ComprEssion (DICE)Insertion PolicyIndex Prediction
25
Slide26DICE: Dynamic-indexed Compressed Cache
26
DRAM Cache
Compressibility Based Insertion
Install
Read
DICE: Dynamic-Indexing Cache
comprEssion
,
decides index on install, and predicts index on read
Cache Index Prediction
Bandwidth-Aware Index
Traditional Set Index
TSI = BAI
?
?
Slide27Compressibility-based Insertion
27
DRAM Cache
Compressibility Based Insertion
Install
Compressibilty
-based insertion uses
Bandwidth-Aware Indexing
when lines are compressible, and TSI otherwise
Bandwidth-Aware Index
Traditional Set Index
TSI = BAI
> ½-size
<= ½-size
?
But checking both wastes bandwidth
No explicit swaps. Eviction and install decides policy
Read
Slide28SIMILAR INTRA-Page COMPRESSIBILITY
28
Indices seen in a Compressible Page
Install
DICE is likely to install lines of a page into similar index
Bandwidth-Aware Index
Bandwidth-Aware Index
Bandwidth-Aware Index
<= ½-size
Read BAI
Lines within a page have similar compressibility
Slide29SIMILAR INTRA-Page COMPRESSIBILITY
29
Indices seen in an Incompressible Page
Install
Thus, page-based last-time prediction of index
can be accurate (94%)
Traditional Set Index
Traditional Set Index
Traditional Set Index
> ½-size
Read TSI
Traditional Set Index
Bandwidth-Aware Index
Lines within a page have similar compressibility
2
nd
access only on
mispredict
Slide30Page-based last-time prediction exploits similar intra-page compressibility, to achieve high prediction accuracy (94%)
PAGE-based Cache INDEX PREDICTOR (CIP)
30
0
1
0
0
0
1
1
1
Page #
Hash
Demand Access
Predict
Traditional Set Index
0
=
Traditional Set Index
1 = Bandwidth-Aware Index
Last-Time Table (LTT)
Slide31DICE OVERVIEW
Compressed DRAM Cache OrganizationFlexible Mapping for Quick SwitchingDynamic Indexing (DICE)Insertion PolicyIndex PredictionResults
31
Slide32Core Chip
3.2GHz 4-wide out-of-order core
8 cores, 8MB shared last-level cacheCompressionFPC + BDI
Methodology (1/8th Knights Landing)
32
Stacked DRAM
Commodity DRAM
CPU
Slide33Methodology (1/8th Knights Landing)
33
Stacked DRAM
Commodity DRAM
Stacked DRAM
Commodity
DRAM
Capacity
1GB32GBBusDDR1.6GHz, 128-bitDDR1.6GHz, 64-bitChannels4 channels1 channelBandwidth100 GBps12.5 GBpsLatency35ns35ns
CPU
Other sensitivities in paper
Slide34DICE RESULTS
34
DICE
improves performance over both Spatial Indexing and Traditional Indexing with fine-grain decision (19%)
Performs as Spatial
Indexing
Performs as Traditional
Indexing
DICE outperforms both
Slide35Compressible
Incompressible
<
1x
Bandwidth
1x
Bandwidth
35
Goal: Compression for Capacity
AND
Bandwidth
Compressible
2x
Bandwidth
Incompressible
DICE (Dynamic Index)
19% Speedup + 36% EDP
Spatial Indexing
INTRODUCTION: COMPRESSED DRAM CACHE
Traditional Compression
1x
Bandwidth
Slide36Thank you
36
Slide37Extra SLIDES
Extra Slides
37
Slide38DIFFERENT CACHE SENSITIVITIES
38
Slide39Comparison to prefetch
39
Slide40Comparison to sram /memory compression
40
Slide41FULL RESULTS (mixed compressibility)
41
Slide42SRAM Cache compression on DRAM CACHE
42
Slide43Distribution for index decision
43
Slide44DICE INSERTION threshold
44
Slide45EFFECTIVE CAPACITY
45
Slide46L3 hit rate improvement
46
Slide47Larger TSI vs. BAI example
47
Slide4848