/
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches - PowerPoint Presentation

candy
candy . @candy
Follow
345 views
Uploaded On 2022-05-17

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches - PPT Presentation

Gennady Pekhimenko Vivek Seshadri Onur Mutlu Michael A Kozuch Phillip B Gibbons Todd C Mowry Carnegie Mellon University Intel Labs Pittsburgh ID: 911392

base cache delta compression cache base compression delta 0x00000000 data 0xcafe4a11 values patterns 0xa438 0x0 bases 0x0b 0x000b 0x00

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Base-Delta-Immediate Compression: Practi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches

Gennady Pekhimenko

§

Vivek Seshadri

§

Onur Mutlu

§

Michael A. Kozuch

Phillip B. Gibbons

Todd C. Mowry

§

§

Carnegie Mellon University

Intel Labs Pittsburgh

Published at PACT 2012

Presented by Marc-Philippe Bartholomä

Slide2

Problem & Goal

2

Slide3

Large Cache Improves Performance

Larger capacity

fewer misses

better performance

Larger capacity ⇒ fewer off-chip cache misses Avoids memory bandwidth bottleneckEspecially important for multi-core with shared memoryBut increasing capacity by scaling the conventional design:Slower cachesMore power consumptionMore area required

3

Idea

: Compress the data in caches to save on hardware costs

Slide4

Goals of Cache Compression

Compression/decompression need to be very

fast

Decompression is on the critical path

Simple

compression logic avoids large power and area costs

Must compress the data effectivelyOtherwise there isn’t much gain in capacity4

Slide5

Background

5

Slide6

Data Patterns in Applications: Zeroes

6

0x00000000

0x00000000

0x00000000

0x00000000

16-byte cache line

Slide7

Data Pattern: Repeated Values

7

0xCAFE4A11

0xCAFE4A11

0xCAFE4A11

0xCAFE4A11

Slide8

Data Pattern: Narrow Values

8

0x000000CA

0x000000FE

0x0000004A

0x00000011

Values have more storage allocated than necessary

Slide9

Data Patterns are Frequent

9

Narrow Values are included in Other Patterns

43% of application cache lines can be compressed on average

Slide10

Data Patterns: Low Dynamic Range

10

0x4100004

0x41000108

0x4100004C

0x41000130

0x000000CA

0x000000FE

0x0000004A

0x00000011

0xCAFE4A11

0xCAFE4A11

0xCAFE4A11

0xCAFE4A11

0x00000000

0x00000000

0x00000000

0x00000000

The values are larger than the difference between them

Slide11

Base+Delta Encoding

11

0x4100004

0x41000108

0x4100004C

0x41000130

0x000000CA

0x000000FE

0x000004A

0x00000011

0xCAFE4A11

0xCAFE4A11

0xCAFE4A11

0xCAFE4A11

0x00000000

0x00000000

0x00000000

0x00000000

0x4100004

+0x0

+0x104

+0x48

+0x12C

0x00000011

+0xB9

+0xED

+0x39

+0x0

0x00000000

+0x0

+0x0

+0x0

+0x0

0xCAFE4A11

+0x0

+0x0

+0x0

+0x0

Slide12

Novelty

12

Slide13

Novelty

Compress on cache line granularity

Previous approaches work on individual words

View data patterns as Low Dynamic Range

Apply Base+Delta compression to caches

Instead of general purpose compression

Instead of special case handling for some patterns13

Slide14

Key Approach and Ideas

14

Slide15

Base+Delta Compression

15

Fast decompression

(vector addition)

Simple hardware (addition/subtraction and comparison)

Effectively compresses observed patterns

0xC04039C0 + 0x38 = 0xC04039F8

Slide16

Room for Improvement

Multiple bases allow compression of more cache lines

Need to encode multiple bases in compressed line

16

Slide17

Finding Bases?

0x0000

0xA478

0x000B

0x0001

0xA438

0x000A

0x000B

0xA438

17

0x0000

0x000B

0xA478

0x0001

0xA438

0x000A

0x000B

0xA438

0x0000

0x000B

0x000B

0x0001

0x000A

0xA438

0xA478

0xA438

Gets more difficult with more bases

Slide18

Base Delta Immediate (BΔI)

2 bases

1 is always 0x00000000

⇒ no need to save

1 is arbitrary

Values with respect to the zero base are the “immediates”

Slightly better than Base+Delta with 2 arbitrary basesWhich in turn compresses better than Base+Delta with other number of bases18

Slide19

Mechanism

19

Slide20

Finding base for

BΔI

20

0xA438

0x0000

0x000B

0xA438

0x0001

0xA470

0x000A

0x000B

0xA478

+0x00

+0x0B

0xA438

+0x01

0xA470

+0x0A

+0x0B

0xA478

+0x00

+0x0B

+0x00

+0x01

+0x32

+0x0A

+0x0B

+0x40

+0x00

+0x0B

+0x00

+0x01

+0x32

+0x0A

+0x0B

+0x40

Try compression with base 0

Choose first non-compressible

Compress the rest

Add nontrivial base

Slide21

Attribute Deltas to Bases

21

0xA438

+0x00

+0x0B

+0x00

+0x01

+0x32

+0x0A

+0x0B

+0x40

0

0

1

0

1

0

0

1

Generate and save a bitmap

Note: Decompression becomes

masked

vector addition

Slide22

Determining Base and Delta Sizes

22

2 bytes

1 byte

09A40178

0000

0000

000B

0001

A6C0

000A

000B

C178

4 bytes

2 byte

A438

00

0B

00

01

32

0A

0B

40

FC

5A

03

7A

44

AB

0C

82

Slide23

Determining Base and Delta Sizes

23

0

R

E

P

E

A

T

E

D

U

N

C

O

M

P

R

E

S

S

E

D

Zero line and repeated values are special cases

Everything is attempted in parallel and shortest is chosen

Base

/

Delta

Free

Special

Cases

Byte Usage

8

16

24

32

Slide24

Changes in Cache Organization

Double the amount of cache tags

Add encoding bits for cases and bitmask for base determination

Segment the cache lines and add segment pointers to the tags

24

Slide25

Key Results:

Methodology and Evaluation

25

Slide26

Methodology

x86-based Simulation

1-4 cores

SPEC2006, TPC-H and Apache web server workloads

L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08]

26

Slide27

BΔI vs Baseline ⇒

Capacity Nearly Doubled

27

Instructions per Cycle

Misses per Kilo Instruction

Slide28

BΔI vs Other Approaches ⇒

Best Comp. Ratio

28

ZCA (Zero-Content Augmented cache): exploits only zeroes

FVC (Frequent Value Compression): zeroes and common words

FPC (Frequent Pattern Compression): patterns including repeated values and narrow values

Slide29

Multi-Core Profits Even More

29

LC/HC: low/high compressibility, LS/HS low/high cache size sensitivity

(uses 2 cores, 2MB L2 cache)

Missing: LCHS due to absence in sample workloads

Slide30

Summary

30

Slide31

SummaryGoal:

Increase cache capacity using data compression at lower cost

Key Insight:

A significant fraction (43%) of real-world cache lines can be compressedKey Mechanism: Base+Delta encoding fits well to exploit low dynamic range patternsKey Results:

BΔI yields nearly the performance gain of a cache with double capacity without the same costs in area and power

5.1% avg. performance increase on single-core over baseline

9.5% avg. performance increase on dual-core over baseline

31

Slide32

Strengths

32

Slide33

Strengths of the Paper

Novel approach leading to significant improvement

Thorough analysis and evaluation of patterns, previous approaches and variants

Elegant solution and principled design

Easy-to-understand and well-structured paper

Transparent to the OS and applications

Compression mechanism is predictable for the user33

Slide34

Weaknesses

34

Slide35

Weaknesses/Limitations of the Paper

Requires double amount of cache tags

Potential bottleneck

Adds the possibility of eviction when writing with cache-hit

Because real capacity is unknown, it is harder to optimize applications

Missing category for multi-core workload

Analysis of cache size only for Base+Delta (no bitmap)Compressed data patterns don’t capture floating point valuesToo much latency for L1 cache

35

Slide36

Thoughts and Ideas

36

Slide37

Extensions

Special case with

only the 0 base

Base Finding approach generalizes to 2 arbitrary basesAnalyze the benefit of switching between

BΔI and Base+Delta with 1 or 2 bases

To save on cache tags you could load 2 contiguous cache lines

Include base bitmap in the deltasFor repeated values of size up to 4 bytes, you could save them using the bitmask for base attribution

37

Slide38

Takeaways

38

Slide39

Key Takeaways

Paper is a prime example of principled design

Carefully examines the potential

Thoroughly analyzes the tradeoffs

Picks the best variant

Data compression is viable for on-chip caches

39

Slide40

Questions

40

Slide41

Discussion

Cache Replacement Policy

Paper uses slightly modified LRU and leaves detailed study for future work

For uncompressed caches: theoretical optimal cache replacement policy (adapted from Computer Systems 2018):

Is the shown CRP also optimal for caches with compression? Why or why not?

What aspects need to be considered to adapt it?

Ideas about what an actual cache replacement policy should do?

41

For eviction: Choose entry that will not be referenced again for the longest period of time.

Slide42

Discussion

Patterns in floating point values? Exploitable with BΔI?

42

Systems Programming and Computer Architecture 2017

1

8 bits

23 bits

Slide43

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches

Gennady Pekhimenko

§

Vivek Seshadri §

Onur Mutlu

§ Michael A. Kozuch

†Phillip B. Gibbons †

Todd C. Mowry

§

§

Carnegie Mellon University

Intel Labs Pittsburgh

Published at PACT 2012

Presented by Marc-Philippe Bartholomä

Slide44

Backup Slides

44

Slide45

Quantitative Comparison with Prior Work

45

Slide46

Different Number of Bases in Base+Delta

46

Slide47

Compression Mechanism: Cases

47

Slide48

Compression Mechanism: Single Case

48

Slide49

Decompression Mechanism

49

Slide50

Cache Size Analysis

50

Slide51

Multicore Workload Categories

51

Slide52

Quad-Core Results

52

Slide53

Performance “Basically” Doubles

53

Slide54

Number of Cache Tags

54

Slide55

Effect on Bandwidth

55

Slide56

Performance comparison with prior work

56