/
An  Optimized AMPM-based An  Optimized AMPM-based

An Optimized AMPM-based - PowerPoint Presentation

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
404 views
Uploaded On 2016-03-07

An Optimized AMPM-based - PPT Presentation

Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia Maulik Bakulbhai Padia Kashyap Amboju and Huiyang Zhou Department of Electrical and Computer Engineering ID: 245630

prefetch access table ampm access prefetch ampm table map block bits cache zone offset counter prefetcher size llc page

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "An Optimized AMPM-based" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

An

Optimized AMPM-based

Prefetcher

Coupled with Configurable Cache Line

Sizing

Qi

Jia

,

Maulik

Bakulbhai

Padia

,

Kashyap

Amboju

and

Huiyang

Zhou

Department of Electrical and Computer Engineering

North Carolina State UniversitySlide2

Presentation Outline

Access Map Pattern Matching (AMPM)

Prefetcher

Problems with AMPM

Cold zone

Inaccurate states within zones

Proposed Optimizations

Configurable Block Sizing (CBS)

Two-Level Prefetching

Hardware Overhead

Experimental Results

Conclusion Slide3

AMPM

・・・

0xAB04

0xAB03

0xAB05

0xAB06

0xABFF

Cache Line

・・・

0xAB02

Prefetch

Access 3

Access 1

0xAB01

0xAB00

0xAAFF

Access 2

Init

/0

Access/2

Access

Access

Pre-

Fetch/1

Prefetch

Current accessSlide4

Problems with AMPM

Cold Zone

No Pattern is detected before the zone bitmaps is evicted from the zone table

… …

0

2

0

2

0

… …

0

22

0x4800x4c00x5000x5800x5400x9c00xa40

0xa00Last Access before zone eviction… …0

0000

… …000

No pattern detectedSlide5

Problems with AMPM Cont.

Inaccurate States in Zone

The bits in zone bitmaps cannot reflect the actual states. (i.e. block evictions)

… …

2

2

2

2

0

… …

2

11

0x4800x4c00x5000x5800x5400x9c00xa40

0xa00AccessBitmap indicate “Access”, but is evicted previously

Cannot prefetch since AMPM treat it as accessed and assumes it remain in cache.Prefetch Chance Lost!!!… …2

22“2”

0

… …211Slide6

Proposed Optimizations

Common Offset Table (COT)

Record the most frequent accessed

offsets across different pages

Update on every demand access

Only

init prefetch from COT when COT gets high accuracy

… …122

1

0

… …

… …

0122

0… …… …

0121

0… …

Pref

CounterOffset

LRUAccess map page 1Access map page 2Common Offset TableSlide7

Proposed Optimizations Cont.

Conflict Table

Record how inaccurate the current information is

Each entry in the table is corresponding to one page

The entry counter will be increased when

inaccuracy is detected.

The entry counter will be reset when the page is evicted out

… …012

2

0

… …

3

17… …4

Cache missupdateAccess map pageConflict Table

318… …

4Slide8

Configurable Cache Line Sizing

A

block

size monitor

is used to select the best block size used for LLC.Block size selection algorithm (consider bandwidth and performance)

Score = hit – A * (access – hit) * block_size The selected blk

size will be used to guide the LLC prefetch.Slide9

Two-Level Prefetching

Specific for DPC2 framework.

Change the state “

Prefetch

” in access map to “L2

Prefetch” and “LLC Prefetch”.

Our main goal is to hide long main memory latency. And then try to hide the LLC latency.During prefetch candidate selection, we will first choose the blocks which are not prefetched. If the such candidates do not fill up the prefetch

degree we will choose the blocks which are in “LLC prefetch” to transfer them into L2 cache.Slide10

Hardware Overhead

Components

Storage

Memory Access

Map

Table

Address Tag (64 b)

LRU (6 b)

Access Map (3*64 b)

64

entries2.047KBCBS monitor

ATD4 ATD

2.872KBCommonOffset TableCounter (6 b)LRU status (6 bits)

Offset Map(64*6 bits +64*1bit)8entries0.45KB

Conflict TableCounter (6 bits)64 entries

0.046KB

Prefetch BitPrefetch (1 bit)

4096 blks0.5KB

Cold Zone

MSHRTags (64 bits)LRU status (5 bits)32 entries

0.27KBTotal 

 6.185KBSlide11

Experimental Results

The optimized

prefetcher

outperforms the baseline without prefetching by 10.8%. Compared with the original AMPM,

it

achieves a speedup of 0.76% on averageSlide12

Conclusions

We optimize the AMPM

prefetcher

by introducing two hardware components: common offset table and conflict table.

We combine the AMPM

prefetcher with configurable block sizing and two-level prefetching mechnisim.Slide13

Question