Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia Maulik Bakulbhai Padia Kashyap Amboju and Huiyang Zhou Department of Electrical and Computer Engineering ID: 245630
Download Presentation The PPT/PDF document "An Optimized AMPM-based" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
An
Optimized AMPM-based
Prefetcher
Coupled with Configurable Cache Line
Sizing
Qi
Jia
,
Maulik
Bakulbhai
Padia
,
Kashyap
Amboju
and
Huiyang
Zhou
Department of Electrical and Computer Engineering
North Carolina State UniversitySlide2
Presentation Outline
Access Map Pattern Matching (AMPM)
Prefetcher
Problems with AMPM
Cold zone
Inaccurate states within zones
Proposed Optimizations
Configurable Block Sizing (CBS)
Two-Level Prefetching
Hardware Overhead
Experimental Results
Conclusion Slide3
AMPM
・・・
0xAB04
0xAB03
0xAB05
0xAB06
0xABFF
Cache Line
・・・
0xAB02
Prefetch
Access 3
Access 1
0xAB01
0xAB00
0xAAFF
Access 2
Init
/0
Access/2
Access
Access
Pre-
Fetch/1
Prefetch
Current accessSlide4
Problems with AMPM
Cold Zone
No Pattern is detected before the zone bitmaps is evicted from the zone table
… …
0
2
0
2
0
… …
0
22
0x4800x4c00x5000x5800x5400x9c00xa40
0xa00Last Access before zone eviction… …0
0000
… …000
No pattern detectedSlide5
Problems with AMPM Cont.
Inaccurate States in Zone
The bits in zone bitmaps cannot reflect the actual states. (i.e. block evictions)
… …
2
2
2
2
0
… …
2
11
0x4800x4c00x5000x5800x5400x9c00xa40
0xa00AccessBitmap indicate “Access”, but is evicted previously
Cannot prefetch since AMPM treat it as accessed and assumes it remain in cache.Prefetch Chance Lost!!!… …2
22“2”
0
… …211Slide6
Proposed Optimizations
Common Offset Table (COT)
Record the most frequent accessed
offsets across different pages
Update on every demand access
Only
init prefetch from COT when COT gets high accuracy
… …122
1
0
… …
… …
0122
0… …… …
0121
0… …
Pref
CounterOffset
LRUAccess map page 1Access map page 2Common Offset TableSlide7
Proposed Optimizations Cont.
Conflict Table
Record how inaccurate the current information is
Each entry in the table is corresponding to one page
The entry counter will be increased when
inaccuracy is detected.
The entry counter will be reset when the page is evicted out
… …012
2
0
… …
3
17… …4
Cache missupdateAccess map pageConflict Table
318… …
4Slide8
Configurable Cache Line Sizing
A
block
size monitor
is used to select the best block size used for LLC.Block size selection algorithm (consider bandwidth and performance)
Score = hit – A * (access – hit) * block_size The selected blk
size will be used to guide the LLC prefetch.Slide9
Two-Level Prefetching
Specific for DPC2 framework.
Change the state “
Prefetch
” in access map to “L2
Prefetch” and “LLC Prefetch”.
Our main goal is to hide long main memory latency. And then try to hide the LLC latency.During prefetch candidate selection, we will first choose the blocks which are not prefetched. If the such candidates do not fill up the prefetch
degree we will choose the blocks which are in “LLC prefetch” to transfer them into L2 cache.Slide10
Hardware Overhead
Components
Storage
Memory Access
Map
Table
Address Tag (64 b)
LRU (6 b)
Access Map (3*64 b)
64
entries2.047KBCBS monitor
ATD4 ATD
2.872KBCommonOffset TableCounter (6 b)LRU status (6 bits)
Offset Map(64*6 bits +64*1bit)8entries0.45KB
Conflict TableCounter (6 bits)64 entries
0.046KB
Prefetch BitPrefetch (1 bit)
4096 blks0.5KB
Cold Zone
MSHRTags (64 bits)LRU status (5 bits)32 entries
0.27KBTotal
6.185KBSlide11
Experimental Results
The optimized
prefetcher
outperforms the baseline without prefetching by 10.8%. Compared with the original AMPM,
it
achieves a speedup of 0.76% on averageSlide12
Conclusions
We optimize the AMPM
prefetcher
by introducing two hardware components: common offset table and conflict table.
We combine the AMPM
prefetcher with configurable block sizing and two-level prefetching mechnisim.Slide13
Question