Practical Data Compression for OnChip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu Todd C Mowry Phillip B Gibbons Michael A Kozuch Executive Summary ID: 911394
Download Presentation The PPT/PDF document "Base-Delta-Immediate Compression:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches
Gennady Pekhimenko Vivek Seshadri Onur Mutlu , Todd C. Mowry
Phillip B. Gibbons*
Michael A.
Kozuch*
*
Slide2Executive SummaryOff-chip memory latency is highLarge caches can help, but
at significant cost Compressing data in cache enables larger cache at low costProblem: Decompression is on the execution critical path Goal: Design a new compression scheme that has 1. low decompression latency, 2. low cost, 3. high compression ratio Observation: Many cache lines have low dynamic range dataKey Idea
: Encode
cachelines as a base
+ multiple differencesSolution: Base-Delta-Immediate compression with low decompression latency and high compression ratio Outperforms three state-of-the-art compression mechanisms 2
Slide3Motivation for Cache CompressionSignificant redundancy in data:3
0x00000000How can we exploit this redundancy?
Cache compression
helps
Provides effect of a larger cache without making it physically larger0x0000000B0x000000030x00000004
…
Slide4Background on Cache CompressionKey requirements:Fast (low decompression latency)Simple (avoid complex hardware changes)Effective (good compression ratio)
4CPUL2 Cache
Uncompressed
Compressed
DecompressionUncompressedL1 CacheHit
Slide5Shortcomings of Prior Work5
CompressionMechanismsDecompressionLatencyComplexityCompressionRatioZero
Slide6Shortcomings of Prior Work6
CompressionMechanismsDecompressionLatencyComplexityCompressionRatioZeroFrequent Value
Slide7Shortcomings of Prior Work7
CompressionMechanismsDecompressionLatencyComplexityCompressionRatioZeroFrequent Value
Frequent Pattern/
Slide8Shortcomings of Prior Work8
CompressionMechanismsDecompressionLatencyComplexityCompressionRatioZeroFrequent Value
Frequent Pattern/Our proposal:BΔI
Slide9OutlineMotivation & BackgroundKey Idea & Our MechanismEvaluationConclusion
9
Slide10Key Data Patterns in Real Applications
10
0x00000000
0x00000000
0x00000000
0x00000000…
0x
000000
FF
0x
000000
FF
0x
000000
FF
0x
000000
FF
…
0x
000000
00
0x
000000
0B
0x
000000
03
0x
000000
04
…
0x
C04039
C0
0x
C04039
C8
0x
C04039
D0
0x
C04039
D8
…
Zero Values
: initialization, sparse matrices, NULL pointers
Repeated Values
: common initial values, adjacent pixels
Narrow Values
: small values stored in a big data type
Other Patterns
:
pointers to the same memory
region
Slide11How Common Are These Patterns?11
SPEC2006, databases, web workloads, 2MB L2 cache
“Other Patterns” include Narrow Values
43%
of the cache lines belong to key patterns
Slide12Key Data Patterns in Real Applications
12
0x00000000
0x00000000
0x00000000
0x00000000…
0x
000000
FF
0x
000000
FF
0x
000000
FF
0x
000000
FF
…
0x
000000
00
0x
000000
0B
0x
000000
03
0x
000000
04
…
0x
C04039
C0
0x
C04039
C8
0x
C04039
D0
0x
C04039
D8
…
Zero Values
: initialization, sparse matrices, NULL pointers
Repeated Values
: common initial values, adjacent pixels
Narrow Values
: small values stored in a big data type
Other Patterns
:
pointers to the same memory
region
Low Dynamic Range:
Differences between values are significantly smaller than the values themselves
Slide1332-byte Uncompressed Cache Line
Key Idea:
Base+Delta (B+
Δ) Encoding
130xC04039C00xC04039C80xC04039D0…0xC04039F8
4 bytes0xC04039C0Base
0x00
1 byte
0x08
1 byte
0x10
1 byte
…
0x38
12-byte
Compressed Cache Line
20 bytes saved
Fast Decompression:
vector addition
Simple Hardware:
arithmetic and comparison
Effective:
good compression ratio
Slide14Can We Do Better?Uncompressible cache line (with a single base): Key idea: Use more bases, e.g., two instead of one
Pro: More cache lines can be compressedCons:Unclear how to find these bases efficientlyHigher overhead (due to additional bases)140x00000000
0x09A4
0178
0x0000000B0x09A4A838…
Slide15B+Δ with Multiple Arbitrary Bases15
2 bases – the best option based on evaluations
Slide16How to Find Two Bases Efficiently?First base - first element in the cache line
Second base - implicit base of 0 Advantages over 2 arbitrary bases:Better compression ratioSimpler compression logic16
Base+Delta part
Immediate partBase-Delta-Immediate (BΔI) Compression
Slide17B+Δ (with two arbitrary bases) vs. BΔI17
Average compression ratio is close, but BΔI is simpler
Slide18BΔI ImplementationDecompressor DesignLow latencyCompressor DesignLow cost and complexity
BΔI Cache OrganizationModest complexity18
Slide19Δ0
B0BΔI Decompressor Design19Δ1
Δ
2
Δ3Compressed Cache LineV0V1V2
V3+
+
Uncompressed Cache Line
+
+
B
0
Δ
0
B
0
B
0
B
0
B
0
Δ
1
Δ
2
Δ
3
V
0
V
1
V
2
V
3
Vector addition
Slide20BΔI Compressor Design20
32-byte Uncompressed Cache Line8-byte B01-byte ΔCU8
-byte B02
-byte Δ
CU8-byte B04-byte ΔCU4-byte B01-byte ΔCU4-byte B02-byte Δ
CU2-byte B01-byte ΔCUZeroCURep.ValuesCU
Compression Selection Logic (based on
compr
. size)
CFlag
&
CCL
CFlag
&
CCL
CFlag
&
CCL
CFlag
&
CCL
CFlag
&
CCL
CFlag
&
CCL
CFlag
&
CCL
CFlag
&
CCL
Compression Flag & Compressed Cache Line
CFlag
&
CCL
Compressed Cache Line
Slide21BΔI Compression Unit: 8-byte B0 1-byte Δ 21
32-byte Uncompressed Cache LineV0V1
V
2
V38 bytes
--
-
-
B
0
=
V
0
V
0
B
0
B
0
B
0
B
0
V
0
V
1
V
2
V
3
Δ
0
Δ
1
Δ
2
Δ
3
Within 1-byte range?
Within 1-byte range?
Within 1-byte range?
Within 1-byte range?
Is every element within 1-byte range?
Δ
0
B
0
Δ
1
Δ
2
Δ
3
B
0
Δ
0
Δ
1
Δ
2
Δ
3
Yes
No
Slide22BΔI Cache Organization22
Tag0Tag1…
…
…
…Tag Storage:Set0Set1Way0Way1
Data0……Set0Set1Way0Way
1
…
Data
1
…
32 bytes
Data Storage:
Conventional
2-way cache with
32
-byte cache lines
B
Δ
I:
4
-way cache with
8
-byte segmented data
Tag
0
Tag
1
…
…
…
…
Tag Storage:
Way
0
Way
1
Way
2
Way
3
…
…
Tag
2
Tag
3
…
…
Set
0
Set
1
Twice as many tags
C
-
Compr
. encoding bits
C
Set
0
Set
1
…
…
…
…
…
…
…
…
S
0
S
0
S
1
S
2
S
3
S
4
S
5
S
6
S
7
…
…
…
…
…
…
…
…
8 bytes
Tags map to multiple adjacent segments
2.3% overhead for 2 MB cache
Slide23Qualitative Comparison with Prior WorkZero-based designsZCA [Dusser+, ICS’09]: zero-content augmented cacheZVC
[Islam+, PACT’09]: zero-value cancellingLimited applicability (only zero values)FVC [Yang+, MICRO’00]: frequent value compressionHigh decompression latency and complexityPattern-based compression designsFPC [Alameldeen+, ISCA’04]: frequent pattern compressionHigh decompression latency (5 cycles) and complexityC-pack [Chen+, T-VLSI Systems’10]: practical implementation of FPC-like algorithmHigh decompression latency (8 cycles)
23
Slide24OutlineMotivation & BackgroundKey Idea & Our MechanismEvaluationConclusion
24
Slide25MethodologySimulator x86
event-driven simulator based on Simics [Magnusson+, Computer’02]WorkloadsSPEC2006 benchmarks, TPC, Apache web server1 – 4 core simulations for 1 billion representative instructionsSystem ParametersL1/L2/L3 cache latencies from CACTI [
Thoziyoor+, ISCA’08]4GHz, x86 in-order core, 512kB - 16MB
L2, simple memory model (300-cycle latency for row-misses)
25
Slide26Compression Ratio: BΔI vs. Prior Work BΔI
achieves the highest compression ratio26
1.53
SPEC2006, databases, web workloads, 2MB L2
Slide27Single-Core: IPC and MPKI27
8.1%5.2%5.1%4.9%5.6%3.6%
16%
24%
21%13%19%14%BΔI achieves the performance of a 2X-size cache
Performance improves due to the decrease in MPKI
Slide28Multi-Core WorkloadsApplication classification based on Compressibility: effective cache size increase(Low Compr. (LC) < 1.40, High
Compr. (HC) >= 1.40)Sensitivity: performance gain with more cache (Low Sens. (LS) < 1.10, High Sens. (HS) >= 1.10; 512kB -> 2MB) Three classes of applications:LCLS, HCLS, HCHS, no LCHS applicationsFor 2-core - random mixes of each possible class pairs (20 each, 120 total workloads)28
Slide29Multi-Core: Weighted SpeedupBΔI performance improvement is the highest (9.5
%)
If at least one application is
sensitive
, then the performance improves29
Slide30Other Results in PaperIPC comparison against upper boundsBΔI almost achieves performance of the 2X-size cacheSensitivity study of having more than 2X tags
Up to 1.98 average compression ratioEffect on bandwidth consumption2.31X decrease on averageDetailed quantitative comparison with prior workCost analysis of the proposed changes2.3% L2 cache area increase30
Slide31ConclusionA new Base-Delta-Immediate compression mechanism Key insight: many cache lines can be efficiently
represented using base + delta encodingKey properties:Low latency decompression Simple hardware implementationHigh compression ratio with high coverage Improves cache hit ratio and performance of both single-core and multi-core workloadsOutperforms state-of-the-art cache compression techniques: FVC and FPC
31
Slide32Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches
Gennady Pekhimenko, Vivek Seshadri , Onur Mutlu , Todd C. Mowry
Phillip B. Gibbons*,
Michael A.
Kozuch*
*
Slide33Backup Slides33
Slide34B+Δ: Compression RatioGood average compression ratio (1.40)
34But some benchmarks have low compression ratio
SPEC2006, databases, web workloads, L2 2MB cache
Slide35Single-Core: Effect on Cache CapacityBΔI achieves performance close to the upper bound
351.3%1.7%2.3%
Fixed L2 cache latency
Slide36Multiprogrammed Workloads - I36
Slide37Cache Compression Flow37
CPUL1 Data CacheUncompressedL2 CacheCompressed
Memory
Uncompressed
Hit L1
MissMissHit L2DecompressCompress
Writeback
Decompress
Writeback
Compress
Slide38Example of Base+Delta CompressionNarrow values (taken from h264ref):38