/
Base-Delta-Immediate Compression: Base-Delta-Immediate Compression:

Base-Delta-Immediate Compression: - PowerPoint Presentation

dora
dora . @dora
Follow
342 views
Uploaded On 2022-05-17

Base-Delta-Immediate Compression: - PPT Presentation

Practical Data Compression for OnChip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu Todd C Mowry Phillip B Gibbons Michael A Kozuch Executive Summary ID: 911394

compression cache 000000 byte cache compression byte 000000 values amp base data latency ratio cflag ccl decompression core delta

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Base-Delta-Immediate Compression:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches

Gennady Pekhimenko Vivek Seshadri Onur Mutlu , Todd C. Mowry

Phillip B. Gibbons*

Michael A.

Kozuch*

*

Slide2

Executive SummaryOff-chip memory latency is highLarge caches can help, but

at significant cost Compressing data in cache enables larger cache at low costProblem: Decompression is on the execution critical path Goal: Design a new compression scheme that has 1. low decompression latency, 2. low cost, 3. high compression ratio Observation: Many cache lines have low dynamic range dataKey Idea

: Encode

cachelines as a base

+ multiple differencesSolution: Base-Delta-Immediate compression with low decompression latency and high compression ratio Outperforms three state-of-the-art compression mechanisms 2

Slide3

Motivation for Cache CompressionSignificant redundancy in data:3

0x00000000How can we exploit this redundancy?

Cache compression

helps

Provides effect of a larger cache without making it physically larger0x0000000B0x000000030x00000004

Slide4

Background on Cache CompressionKey requirements:Fast (low decompression latency)Simple (avoid complex hardware changes)Effective (good compression ratio)

4CPUL2 Cache

Uncompressed

Compressed

DecompressionUncompressedL1 CacheHit

Slide5

Shortcomings of Prior Work5

CompressionMechanismsDecompressionLatencyComplexityCompressionRatioZero

Slide6

Shortcomings of Prior Work6

CompressionMechanismsDecompressionLatencyComplexityCompressionRatioZeroFrequent Value

Slide7

Shortcomings of Prior Work7

CompressionMechanismsDecompressionLatencyComplexityCompressionRatioZeroFrequent Value

Frequent Pattern/

Slide8

Shortcomings of Prior Work8

CompressionMechanismsDecompressionLatencyComplexityCompressionRatioZeroFrequent Value

Frequent Pattern/Our proposal:BΔI

Slide9

OutlineMotivation & BackgroundKey Idea & Our MechanismEvaluationConclusion

9

Slide10

Key Data Patterns in Real Applications

10

0x00000000

0x00000000

0x00000000

0x00000000…

0x

000000

FF

0x

000000

FF

0x

000000

FF

0x

000000

FF

0x

000000

00

0x

000000

0B

0x

000000

03

0x

000000

04

0x

C04039

C0

0x

C04039

C8

0x

C04039

D0

0x

C04039

D8

Zero Values

: initialization, sparse matrices, NULL pointers

Repeated Values

: common initial values, adjacent pixels

Narrow Values

: small values stored in a big data type

Other Patterns

:

pointers to the same memory

region

Slide11

How Common Are These Patterns?11

SPEC2006, databases, web workloads, 2MB L2 cache

“Other Patterns” include Narrow Values

43%

of the cache lines belong to key patterns

Slide12

Key Data Patterns in Real Applications

12

0x00000000

0x00000000

0x00000000

0x00000000…

0x

000000

FF

0x

000000

FF

0x

000000

FF

0x

000000

FF

0x

000000

00

0x

000000

0B

0x

000000

03

0x

000000

04

0x

C04039

C0

0x

C04039

C8

0x

C04039

D0

0x

C04039

D8

Zero Values

: initialization, sparse matrices, NULL pointers

Repeated Values

: common initial values, adjacent pixels

Narrow Values

: small values stored in a big data type

Other Patterns

:

pointers to the same memory

region

Low Dynamic Range:

Differences between values are significantly smaller than the values themselves

Slide13

32-byte Uncompressed Cache Line

Key Idea:

Base+Delta (B+

Δ) Encoding

130xC04039C00xC04039C80xC04039D0…0xC04039F8

4 bytes0xC04039C0Base

0x00

1 byte

0x08

1 byte

0x10

1 byte

0x38

12-byte

Compressed Cache Line

20 bytes saved

Fast Decompression:

vector addition

Simple Hardware:

arithmetic and comparison

Effective:

good compression ratio

Slide14

Can We Do Better?Uncompressible cache line (with a single base): Key idea: Use more bases, e.g., two instead of one

Pro: More cache lines can be compressedCons:Unclear how to find these bases efficientlyHigher overhead (due to additional bases)140x00000000

0x09A4

0178

0x0000000B0x09A4A838…

Slide15

B+Δ with Multiple Arbitrary Bases15

 2 bases – the best option based on evaluations

Slide16

How to Find Two Bases Efficiently?First base - first element in the cache line

Second base - implicit base of 0 Advantages over 2 arbitrary bases:Better compression ratioSimpler compression logic16

Base+Delta part

 Immediate partBase-Delta-Immediate (BΔI) Compression

Slide17

B+Δ (with two arbitrary bases) vs. BΔI17

Average compression ratio is close, but BΔI is simpler

Slide18

BΔI ImplementationDecompressor DesignLow latencyCompressor DesignLow cost and complexity

BΔI Cache OrganizationModest complexity18

Slide19

Δ0

B0BΔI Decompressor Design19Δ1

Δ

2

Δ3Compressed Cache LineV0V1V2

V3+

+

Uncompressed Cache Line

+

+

B

0

Δ

0

B

0

B

0

B

0

B

0

Δ

1

Δ

2

Δ

3

V

0

V

1

V

2

V

3

Vector addition

Slide20

BΔI Compressor Design20

32-byte Uncompressed Cache Line8-byte B01-byte ΔCU8

-byte B02

-byte Δ

CU8-byte B04-byte ΔCU4-byte B01-byte ΔCU4-byte B02-byte Δ

CU2-byte B01-byte ΔCUZeroCURep.ValuesCU

Compression Selection Logic (based on

compr

. size)

CFlag

&

CCL

CFlag

&

CCL

CFlag

&

CCL

CFlag

&

CCL

CFlag

&

CCL

CFlag

&

CCL

CFlag

&

CCL

CFlag

&

CCL

Compression Flag & Compressed Cache Line

CFlag

&

CCL

Compressed Cache Line

Slide21

BΔI Compression Unit: 8-byte B0 1-byte Δ 21

32-byte Uncompressed Cache LineV0V1

V

2

V38 bytes

--

-

-

B

0

=

V

0

V

0

B

0

B

0

B

0

B

0

V

0

V

1

V

2

V

3

Δ

0

Δ

1

Δ

2

Δ

3

Within 1-byte range?

Within 1-byte range?

Within 1-byte range?

Within 1-byte range?

Is every element within 1-byte range?

Δ

0

B

0

Δ

1

Δ

2

Δ

3

B

0

Δ

0

Δ

1

Δ

2

Δ

3

Yes

No

Slide22

BΔI Cache Organization22

Tag0Tag1…

…Tag Storage:Set0Set1Way0Way1

Data0……Set0Set1Way0Way

1

Data

1

32 bytes

Data Storage:

Conventional

2-way cache with

32

-byte cache lines

B

Δ

I:

4

-way cache with

8

-byte segmented data

Tag

0

Tag

1

Tag Storage:

Way

0

Way

1

Way

2

Way

3

Tag

2

Tag

3

Set

0

Set

1

Twice as many tags

C

-

Compr

. encoding bits

C

Set

0

Set

1

S

0

S

0

S

1

S

2

S

3

S

4

S

5

S

6

S

7

8 bytes

Tags map to multiple adjacent segments

2.3% overhead for 2 MB cache

Slide23

Qualitative Comparison with Prior WorkZero-based designsZCA [Dusser+, ICS’09]: zero-content augmented cacheZVC

[Islam+, PACT’09]: zero-value cancellingLimited applicability (only zero values)FVC [Yang+, MICRO’00]: frequent value compressionHigh decompression latency and complexityPattern-based compression designsFPC [Alameldeen+, ISCA’04]: frequent pattern compressionHigh decompression latency (5 cycles) and complexityC-pack [Chen+, T-VLSI Systems’10]: practical implementation of FPC-like algorithmHigh decompression latency (8 cycles)

23

Slide24

OutlineMotivation & BackgroundKey Idea & Our MechanismEvaluationConclusion

24

Slide25

MethodologySimulator x86

event-driven simulator based on Simics [Magnusson+, Computer’02]WorkloadsSPEC2006 benchmarks, TPC, Apache web server1 – 4 core simulations for 1 billion representative instructionsSystem ParametersL1/L2/L3 cache latencies from CACTI [

Thoziyoor+, ISCA’08]4GHz, x86 in-order core, 512kB - 16MB

L2, simple memory model (300-cycle latency for row-misses)

25

Slide26

Compression Ratio: BΔI vs. Prior Work BΔI

achieves the highest compression ratio26

1.53

SPEC2006, databases, web workloads, 2MB L2

Slide27

Single-Core: IPC and MPKI27

8.1%5.2%5.1%4.9%5.6%3.6%

16%

24%

21%13%19%14%BΔI achieves the performance of a 2X-size cache

Performance improves due to the decrease in MPKI

Slide28

Multi-Core WorkloadsApplication classification based on Compressibility: effective cache size increase(Low Compr. (LC) < 1.40, High

Compr. (HC) >= 1.40)Sensitivity: performance gain with more cache (Low Sens. (LS) < 1.10, High Sens. (HS) >= 1.10; 512kB -> 2MB) Three classes of applications:LCLS, HCLS, HCHS, no LCHS applicationsFor 2-core - random mixes of each possible class pairs (20 each, 120 total workloads)28

Slide29

Multi-Core: Weighted SpeedupBΔI performance improvement is the highest (9.5

%)

If at least one application is

sensitive

, then the performance improves29

Slide30

Other Results in PaperIPC comparison against upper boundsBΔI almost achieves performance of the 2X-size cacheSensitivity study of having more than 2X tags

Up to 1.98 average compression ratioEffect on bandwidth consumption2.31X decrease on averageDetailed quantitative comparison with prior workCost analysis of the proposed changes2.3% L2 cache area increase30

Slide31

ConclusionA new Base-Delta-Immediate compression mechanism Key insight: many cache lines can be efficiently

represented using base + delta encodingKey properties:Low latency decompression Simple hardware implementationHigh compression ratio with high coverage Improves cache hit ratio and performance of both single-core and multi-core workloadsOutperforms state-of-the-art cache compression techniques: FVC and FPC

31

Slide32

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches

Gennady Pekhimenko, Vivek Seshadri , Onur Mutlu , Todd C. Mowry

Phillip B. Gibbons*,

Michael A.

Kozuch*

*

Slide33

Backup Slides33

Slide34

B+Δ: Compression RatioGood average compression ratio (1.40)

34But some benchmarks have low compression ratio

SPEC2006, databases, web workloads, L2 2MB cache

Slide35

Single-Core: Effect on Cache CapacityBΔI achieves performance close to the upper bound

351.3%1.7%2.3%

Fixed L2 cache latency

Slide36

Multiprogrammed Workloads - I36

Slide37

Cache Compression Flow37

CPUL1 Data CacheUncompressedL2 CacheCompressed

Memory

Uncompressed

Hit L1

MissMissHit L2DecompressCompress

Writeback

Decompress

Writeback

Compress

Slide38

Example of Base+Delta CompressionNarrow values (taken from h264ref):38