Snehasish Kumar Hongzhou Zhao Arrvindh Shriraman Eric Matthews Sandhya Dwarkadas Lesley Shannon School of Computing Sciences Simon Fraser University Department of Computer Science University of Rochester ID: 909735
Download Presentation The PPT/PDF document "Amoeba-Cache: Adaptive Blocks for Elimin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy
Snehasish
Kumar,
Hongzhou
Zhao†,
Arrvindh
Shriraman
Eric Matthews∗, Sandhya
Dwarkadas
†, Lesley Shannon∗
School of Computing Sciences, Simon Fraser University
†Department of Computer Science, University of Rochester
∗School of Engineering Science, Simon Fraser University
2012 IEEE/ACM 45th Annual International Symposium on Microarchitecture
Slide2Overview
Problem being addressed
Prior work
Challenges encountered
Proposed solution
Results
Review of the paper
Slide3Problem Statement
Current caches are designed for fixed block size
Block granularity is decided based on average spatial locality across general workload
Many common applications exhibit fairly lower spatial localities than the design point
Unused words occupy between 17—80% of a 64K L1 cache and between 1%—79% of a 1MB private LLC
Unused-word transfers comprise 11% of on-chip cache hierarchy energy consumption.
Slide4Prior Work
Sector Cache:
Aims at minimizing bandwidth
Fetches sub blocks
Works well for applications with low-moderate spatial locality
Reduces mispredicted spatial prefetches thus reducing bandwidth usage and energy consumption
Prior Work
Slide6Challenges to static design
Small cache lines tend to fetch fewer unused words
Impose significant performance penalties by missing opportunities for spatial prefetching in applications with high spatial locality
Larger cache line sizes effectively
prefetch
neighboring words
Increase number of unused words and network bandwidth.
Determining a fixed optimal point for the cache line granularity at
hardware design time is a challenge
Slide7Amoeba Cache
A novel cache architecture that supports fine grain (per-miss) dynamic adjustment of cache block size and the # of blocks per set.
Filters out unused words in a block and prevents them
from being inserted into the cache, allowing the resulting free space
to hold other useful blocks.
Can adapt to the available spatial locality
Slide8Amoeba Cache
Slide9Amoeba Cache
How to grow and shrink the # of tags as the # of blocks per set vary with block granularity?
Eliminates Tag array
Tags and Data are kept together in a single data array
Bitmaps indicate which words in the data array are tags
Valid bits are also stored as bitmap
Amoeba Cache
Slide11Amoeba Cache
Data lookup:
Tag bitmap activates words from the array containing tags for comparison
Minimum size of Amoeba block is 2 words(1 tag 1 data), adjacent words cannot be tags
Slide12Amoeba Cache
Block Insertion:
The Valid bitmap is used to determine empty slots within the set
1 means allocated, 0 means empty
For an incoming block with
m
words,
m
consecutive 0’s are searched
The replacement algorithm is triggered repeatedly until space is created
To reclaim space from Amoeba block, Tag bits and Valid bits from the
bitset
corresponding to the block are unset.
Uses LRU policy to choose a way within the cache and randomly picks a random candidate from within the set for block replacement
Slide13Partial Misses:
Low probability (5 in 1K accesses)
Identify overlapping blocks
Evict to MSHR
Allocate space for entire block
Miss request
Block copied
Slide14Results
Result 1:
Amoeba-Cache increases cache capacity by harvesting space from unused words and can achieve an 18% reduction in both L1 and L2 miss rate.
Result 2:
Amoeba-Cache adaptively sizes the cache block granularity and reduces L1
↔
L2 bandwidth by 46% and L2
↔
Memory bandwidth
by 38%.
Slide15Results
Slide16Results
Result 3: Boosts performance by 10% on commercial applications saving 11% energy of on-chip memory hierarchy. Off-chip L2 to MM sees a mean energy reduction of 41% across all workloads.
Slide17Review of the paper
Connects proposed work with prior work
Builds up on the proposed idea gradually with sufficient examples
Algorithms explained well with control flow diagrams
Lots of comparative graphs to support the results
The maximum region size(RMAX) stated differently in text(bytes) and in diagram(words).
The reason for using of the metric 1/(
MissRate
×
Bandwidth
) for determining block granularity could have been supported better.
Slide18Thank you!