and CoarseGrain Memory Tracking Andreas Moshovos Univ of TorontoECE Short Course at the University of Zaragoza July 2009 Some slides by J Zebchuk or the original paper authors JETTY ID: 324446
Download Presentation The PPT/PDF document "Snoop Filtering" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Snoop Filtering and Coarse-Grain Memory Tracking
Andreas
Moshovos
Univ. of Toronto/ECE
Short Course at the University of Zaragoza, July 2009
Some slides by J.
Zebchuk
or the original paper authorsSlide2
JETTY
Snoop-Filtering for Reduced Power in SMP Servers
Andreas
Moshovos
Babak
Falsafi
, ECE, Carnegie Mellon
Gokhan
Memik
,
ECE, Northwestern
Alok
Choudhary
, ECE,
Northwestern
Int’l Conference on High-Performance Architecture, 2001Slide3
Power is Becoming ImportantArchitecture is a science of tradeoffs
Thus far:
Performance vs. Cost vs. Complexity
Today:vs. PowerWhere?Mobile DevicesDesktops/Servers
Our FocusSlide4
Power-Aware ServersRevisit the design of SMP servers
2 or more CPUs per machine
Snoop coherence-based
Why?File, web, databases, your typical desktopCost effective tooThis work - a first step:Power-Aware
Snoopy-CoherenceSlide5
Power-Aware Snoop-CoherenceConventional
All
L2 caches snoop
all memory trafficPower expended by all on any memory accessJetty-EnhancedTiny structure on L2-backside
Filters most “would-be-misses
”
Less power expended on most snoop misses
No changes to protocol necessary
No performance lossSlide6
RoadmapWhy Power is a Concern for Servers?
Snoopy-Coherence Basics
An Opportunity for Reducing Power
JETTYResultsSummarySlide7
Why is Power Important?
Power Could Ultimately Limit Performance
Power Demands have been increasingDeliver Energy to and on chipDissipate HeatLimit: Amount of resources & frequencyFeasibilityCooling a solution: Cost & Integration?
Reducing Power Demands is much more convenientSlide8
What can be done?Redesign Circuits
Clock Gating and Frequency Scaling
A lot has been done thus far
Still activeRethink Architectural DecisionsOrthogonal to othersReduce Power Under Performance ConstraintsSlide9
The “Silver Bullet” SolutionGood if there was one
However, till one is found...
Look at all structures
Rethink DesignPropose Power-Optimized versionsThis is what we’re doing for performanceSlide10
Snoopy Cache
Coherence
All L2 tags see all bus accesses
Intervene when necessary
Main Memory
CPU Core
L1
L2
CPU Core
HitSlide11
How About Power?
All L2 tags see all bus accesses
Perf. & Complexity:
Have L2 tags why not use themPower: All L2 tags consume power on all accesses
Main Memory
L1
L2
CPU Core
CPU Core
CPU Core
miss
missSlide12
JETTY: A Would be Snoop-Miss Filter
Imprecise:
May
filter a would-be miss Never filters snoop-hits
JETTY
addr
Not here!
CPU n
Would be Snoop-Miss:
JETTY
addr
Don’t Know
CPU n
Would be Snoop-Hit:
Detect most misses using fewer resourcesSlide13
Potential for Savings Exist
Most Snoops miss
91% AVG
Many L2 accesses are due to Snoop Misses55% AVGSizeable Potential Power Savings:20% - 50% of total L2 powerSlide14
Exclude-Jetty
Subset of what is not cached
cached
not cached
How? Cache recent snoop-misses locally
Exclude
JETTYSlide15
Exclude-Jetty
Subset of what you don’t have
Works well for producer-consumerSlide16
Include-Jetty
Superset of what is cached
cached
not cached
How? Well...
include
JETTYSlide17
Include-Jetty
address
bit vector 0
bit vector 1
bit vector 2
f
( )
h
( )
g
( )
Not-Cached
Any
zero
bit
May be Cached
All
bits
set
Later I was told this is a Bloom filter…Slide18
Include-Jetty
Superset of what you have
This is a counting bloom filter:
L-CBF: A Low Power, Fast Counting Bloom Filter Implementation
Elham
Safi, Andreas
Moshovos
and Andreas
Veneris
,
In Proc. Annual International Symposium on Low Power Electronics and Design (ISLPED), Oct. 2006.
Partial overlapping indexes worked better Slide19
Hybrid-JettySome cases Exclude-J works well
Some other Include-J is better
Combine
Access in parallel on snoopAllocationIJ alwaysIf IJ fails to filter then to EJEJ coverage increasesSlide20
Latency?Jetty may increase snoop-response time
Can only be determined on a design by design basis
Largest Jetty:
Five 32x32 bit register filesSlide21
ResultsUsed SPLASH-II
Scientific applications
“Large” Datasets
e.g., 4-80Megs of main memory allocatedAccess Counts: 60M-1.7B4-way SMP, MOESI1M direct-mapped L2, 64b 32b subblocks32k direct-mapped L1, 32b blocksCoverage & Power (analytical model)Slide22
Coverage: Hybrid-Jetty
Can capture 74% of all snoop-misses
betterSlide23
Power-Savings
28% of overall L2 power
betterSlide24
SummaryPower is becoming important
Performance, Reliability and Feasibility
Unique Opportunities Exist for Servers
JETTY: Filter Snoops that would miss74% of all snoops28% of L2 power savedNo protocol changesNo performance lossSlide25
Power efficient cache coherence
C.
Saldanha
, M. LipastiWorkshop on Memory Performance Issues (in conjunction with ISCA), June 2001.Slide26
MEMORY
Serial Snooping
Avoids Speculative transmission of Snoop packets.
Check the nearest neighbor
Data supplied with minimum latency and powerSlide27
TLB and Snoop Energy-Reduction using Virtual Caches inLow-Power Chip-Multiprocessors
Magnus
Ekman
, *Fredrik Dahlgren, and Per StenströmChalmers University of Technology
Ericsson Mobile Platforms
Int’l Symposium on Low Power Electronic Design and Devices, Aug. 2002Slide28
Page Sharing Tables
On snoop requesting node gets a
page-level sharing vector
Paper by same authors demonstrates the Jetty is not beneficial for small-scale CMPs
If a PST entry is evicted the whole page must be evictedSlide29
29
RegionScout
:
Exploiting Coarse Grain Sharing in Snoop Coherence
Andreas
Moshovos
moshovos@eecg.toronto.edu
Int’l Conference on Computer Architecture 2005Slide30
30
CPU
I$
D$
CPU
I$
D$
CPU
I$
D$
interconnect
Main Memory
Improving Snoop Coherence
Conventional Considerations:
Complexity and Correctness NOT Power/Bandwidth
Can we:
(1) Reduce Power/bandwidth
(2) Leverage snoop coherence?
Remains Attractive:
Simple
/
Design Re-use
Yes: Exploit Program Behavior to
Dynamically Identify Requests that do not Need SnoopingSlide31
31
CPU
I$
D$
CPU
I$
D$
CPU
I$
D$
interconnect
Main Memory
RegionScout: Avoid Some Snoops
Frequent case:
non-sharing even at a coarse level/Region
RegionScout: Dynamically Identify Non-Shared Regions
First Request to a Region Identifies it as not Shared
Subsequent Requests do not need to be broadcast
Uses Imprecise Information
Small structures
Layer on top of conventional coherence
No additional constraintsSlide32
32
Roadmap
Conventional Coherence:
The need for power-aware designsPotential: Program BehaviorRegionScout: What and How
Implementation
Evaluation
SummarySlide33
33
Coherence Basics
Given request for memory block X (address)
Detect where its current value resides
Main Memory
snoop
snoop
X
hit
CPU
CPU
CPUSlide34
34
Conventional Coherence not
Power-Aware/Bandwidth-Effective
All L2 tags see all accesses
Perf. & Complexity:
Have L2 tags why not use them
Power:
All
L2 tags consume power on
all
accesses
Bandwidth:
broadcast all coherent requests
Main Memory
L2
CPU
miss
miss
CPU
CPUSlide35
35
RegionScout
Motivation: Sharing is CoarseRegion:
large continuous memory area, power of 2 size
CPU X asks for data block in region R
No one else has X
No one else has
any
block in R
RegionScout Exploits this Behavior
Layered Extension over Snoop Coherence
Typical Memory Space Snapshot:
colored by owner(s)
addressesSlide36
Optimization Opportunities
Power and Bandwidth
Originating node:
avoid asking othersRemote node: avoid tag lookup
CPU
I$
D$
CPU
I$
D$
Memory
SWITCH
CPU
I$
D$Slide37
Potential: Region Miss Frequency
% of all requests
Region Size
Even with a 16K Region
~45% of requests miss in all remote nodes
better
Global Region MissesSlide38
RegionScout at Work: Non-Shared Region Discovery
First request detects a non-shared region
Main Memory
CPU
CPU
CPU
Global Region Miss
Region Miss
Region Miss
1
2
2
3
Record: Non-Shared Regions
Record: Locally Cached RegionsSlide39
RegionScout at Work: Avoiding Snoops
Subsequent request avoids snoops
Main Memory
CPU
CPU
CPU
Global Region Miss
1
2
Record: Non-Shared Regions
Record: Locally Cached RegionsSlide40
RegionScout is Self-Correcting
Request from another node invalidates non-shared record
Main Memory
CPU
CPU
CPU
1
2
2
Record: Non-Shared Regions
Record: Locally Cached RegionsSlide41
Requesting Node provides address:
At Originating Node – from CPU:
Have I discovered that this region is not shared?
At Remote Nodes – from Interconnect:
Do I have a block in the region?
Implementation: Requirements
Region Tag
offset
lg(Region Size)
CPU
addressSlide42
Remembering Non-Shared Regions
Records non-shared regions
Lookup by Region portion prior to issuing a request
Snoop requests and invalidate
Region Tag
offset
address
valid
Non-Shared Region Table
Few entries
16x4 in most experimentsSlide43
What Regions are Locally Cached?
If we had as many counters as regions:
Block Allocation: counter[region]++
Block Eviction: counter[region]--Region cached only if counter[region] non-zeroNot Practical:E.g., 16K Regions and 4G Memory 256K counters
Region Tag
offset
counterSlide44
Moshovos ©What Regions are Locally Cached?
Use few Counters Imprecise:
Records a superset
of locally cached Regions
False positives: lost opportunity, correctness preserved
Region Tag
offset
counter
hash
Cached Region Hash
“Counter”:
+ on block allocation
- on block eviction
Few entries, e.g., 256
p bits
P-bit
1 if counter non-zero
used for lookupsSlide45
Moshovos ©Roadmap
Conventional Coherence
Program Behavior: Region Miss Frequency
RegionScout
Evaluation
SummarySlide46
Moshovos ©Evaluation Overview
Methodology
Filter rates
Practical Filters can capture many Region MissesInterconnect bandwidth reductionSlide47
Moshovos ©Methodology
In-House simulator based on
Simplescalar
Execution drivenAll instructions simulated – MIPS like ISASystem calls faked by passing them to host OS
Synchronization using load-linked/store-conditional
Simple in-order processors
Memory requests complete instantaneously
MESI snoop coherence
1 or 2 level memory hierarchy
WATTCH power models
SPLASH II benchmarks
Scientific workloads
Feasibility studySlide48
Moshovos ©Filter Rates
Identified
Global Region Misses
CRH Size
better
For small CRH better to use large regions
Practical RegionScout filters capture a lot of the potentialSlide49
Moshovos ©Bandwidth Reduction
Messages
Region Size
better
CMP
Moderate Bandwidth Savings for SMP (15%-22%)
More so for CMP (>25%)Slide50
Moshovos ©Related Work
RegionScout
Technical Report, Dec. 2003
JettyMoshovos, Memik, Falsafi,
Choudhary
, HPCA 2001
PST
Eckman
, Dahlgren, and
Stenström
, ISLPED 2002
Coarse-Grain Coherence
Cantin
,
Lipasti
and Smith, ISCA 2005Slide51
Moshovos ©
51
Summary
Exploit program behavior/optimize a frequent caseMany requests result in a global region miss
RegionScout
Practical filter mechanism
Dynamically detect would-be region misses
Avoid broadcasts
Save tag lookup power and interconnect bandwidth
Small structures
Layered extension over existing mechanisms
Invisible to programmer and the OSSlide52
Coarse-Grain Coherence
J.
Cantin
, M. Lipasti and
J. E. Smith
ISCA
2005Slide53
Coarse-Grain CoherenceExploits the same phenomenon as
RegionScout
Protocol extended to keep track of region state as well
Additional optimizationsUses an additional region tag array to do soRegion replacements Must scan and find the block and evict themSlide54
Flexible snooping: adaptive forwarding and filtering of snoops in embedded-ring multiprocessors
K. Strauss, X.
Shen
, J. TorrellasInternational Symposium on Computer Architecture, June 2006.Slide55
Karin Strauss Flexible Snooping 55
Predictors and algorithms
snoop
forward
Exact
forward
then snoop
Agg
forward
snoop
forward
then snoop
Subset
action on positive prediction
action on negative prediction
predictor / algorithm
Superset
Con
snoop then forward
node can supply
in predictor
set of addresses:
Ring-specificSlide56
Karin Strauss Flexible Snooping 56
Predictor implementation
Subset
associative table:
subset of addresses that can be supplied by node
Superset
bloom filter:
superset of addresses that can be supplied by node
associative table (exclude cache):
addresses that recently suffered false positives
Exact
associative table:
all addresses that can be supplied by node
downgrading:
if address has to be evicted from predictor table, corresponding line in node has to be downgraded
Slide57
Design and Implementation of the Blue Gene/P Snoop Filter
Valentina
Salapura, Matthias Blumrich
, Alan
Gara
Int’l Conf. on High-Performance Computer Architecture, 2008 Slide58Slide59
Three MechanismsStream registers
Contiguous data areas
Adaptive to cache arbitrarily sized contiguous regions with a single register
Stream registers track strided and sequential streamsSnoop cachesCache of recently executed snoop requestsMultiple requests to same line do not have to cause multiple snoop lookups
Snoop caches track locality
Range filter
Identify regions of known non-shared data
Configured by softwareSlide60
Stream RegistersBase = where the block starts
Mask = which bits are common
Example: base 0111 mask 1101
01X1 may be in the cacheOver time Mask becomes all zerosHow to reset?Cache Wrap
Each set uses Round-Robin replacement
Count replacements per set
Cache wrap when all counters > ways
Copy all streams to history and use combination
Next time throw out history
Slide61
Stream Registers: An Example
Direct mapped cache with two blocks
At this point the filter reports that the cache contains:
001 and 011101 and 111The first two are not there
Eventually the filter becomes saturated and can filter much
How can we get rid of the 011 / 1x1?
empty
empty
001
empty
empty
empty
001 / 111
empty
001
011
001 /
1X1
empty
101
011
001 / 111
101 / 111
101
111
001 / 1X1
101 /
1X1
Time
cache
Stream registersSlide62
Avoiding Saturation: Exploiting Cache Warping
empty
empty
001
empty
empty
empty
001 / 111
empty
001
011
001 /
1X1
empty
101
011
empty
101 / 111
101
111
empty
101 /
1X1
Time
cache
Stream registers
empty
empty
empty
empty
001 /
1X1
empty
001 / 1X1
empty
001 / 1X1
empty
Shadow
Cache Warp
Can discard ShadowSlide63
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors
Chinnakrishnan
S.
Ballapuram
Ahmad Sharif
Hsien-Hsin
S. Lee
ASPLOS 2008Slide64
Software-Hardware HybridSoftware Directs hardware what to do
Mechanisms very similar to Jetty and
RegionScout
Paper incorrectly states that: Jetty does not work for CMPsIt does not work well for small scale CMPsRegionScout
works only for
busses
Is
interconnect agnosticSlide65
RegionTracker: A Framework for Coarse-Grain Optimizations in the On-chip Memory Hierarchy
Jason
Zebchuk
, Elham Safi and Andreas Moshovos
Int’l Symposium on
Microarchitecture
, 2007Slide66
EPFL, Jan. 2008
66
Aenao Group/Toronto
Future Caches: Just Larger?
CPU
I$
D$
CPU
I$
D$
CPU
I$
D$
interconnect
Main Memory
“Big Picture” Management
Store Metadata
10s – 100s of MBSlide67
EPFL, Jan. 2008
67
Aenao Group/Toronto
Conventional Block Centric Cache
“Small” Blocks
Optimizes Bandwidth and Performance
Large L2/L3 caches especially
Fine-Grain View of Memory
L2 Cache
Big Picture LostSlide68
EPFL, Jan. 2008
68
Aenao Group/Toronto
“Big Picture” View
Region
: 2
n
sized, aligned area of memory
Patterns and behavior exposed
Spatial locality
Exploit for performance/area/power
Coarse-Grain View of Memory
L2 CacheSlide69
EPFL, Jan. 2008
69
Aenao Group/Toronto
Exploiting Coarse-Grain Patterns
Many existing coarse-grain optimizations
Add new structures to track coarse-grain information
CPU
L2 Cache
Stealth Prefetching
Run-time Adaptive Cache Hierarchy Management via Reference Analysis
Destination-Set Prediction
Spatial Memory Streaming
Coarse-Grain Coherence Tracking
RegionScout
Circuit-Switched Coherence
Hard to justify for a commercial design
Coarse-Grain Framework
Embed coarse-grain information in tag array
Support many different optimizations with less area overhead
Adaptable optimization FRAMEWORK
Virtual Tree Coherence
Power-Efficient DRAM
SpeculationSlide70
EPFL, Jan. 2008
70
Aenao Group/Toronto
L2 Cache
RegionTracker Solution
Manage
blocks
, but also track and manage
regions
Tag Array
L1
L1
L1
L1
Data Array
Data Blocks
Block
Requests
Block Requests
Region
Tracker
Region
Probes
Region
ResponsesSlide71
EPFL, Jan. 2008
71
Aenao Group/Toronto
RegionTracker Summary
Replace conventional tag array
:
4-core CMP with 8MB shared L2 cache
Within 1% of original performance
Up to 20% less tag area
Average 33% less energy consumption
Optimization Framework
:
Stealth Prefetching: same performance, 36% less area
RegionScout: 2x more snoops avoided, no area overheadSlide72
EPFL, Jan. 2008
72
Aenao Group/Toronto
Road Map
Introduction
Goals
Coarse-Grain Cache Designs
RegionTracker: A Tag Array Replacement
RegionTracker: An Optimization Framework
ConclusionSlide73
EPFL, Jan. 2008
73
Aenao Group/Toronto
Goals
Conventional Tag Array Functionality
Identify data block location and state
Leave data array un-changed
Optimization Framework Functionality
Is Region X cached?
Which blocks of Region X are cached? Where?
Evict or migrate Region X
Easy to assign properties to each RegionSlide74
EPFL, Jan. 2008
74
Aenao Group/Toronto
Coarse-Grain Cache Designs
Increased BW, Decreased hit-rates
Region X
Large Block Size
Tag Array
Data ArraySlide75
EPFL, Jan. 2008
75
Aenao Group/Toronto
Sector Cache
Decreased hit-rates
Region X
Tag Array
Data ArraySlide76
EPFL, Jan. 2008
76
Aenao Group/Toronto
Sector Pool Cache
High Associativity (2 - 4 times)
Region X
Tag Array
Data ArraySlide77
EPFL, Jan. 2008
77
Aenao Group/Toronto
Decoupled Sector Cache
Region information not exposed
Region replacement requires scanning multiple entries
Region X
Tag Array
Data Array
Status TableSlide78
EPFL, Jan. 2008
78
Aenao Group/Toronto
Design Requirements
Small block size (64B)
Miss-rate does not increase
Lookup associativity does not increase
No additional access latency
(i.e., No scanning, no multiple block evictions)
Does not increase latency, area, or energy
Allows banking and interleaving
Fit in conventional tag array “envelope”Slide79
EPFL, Jan. 2008
79
Aenao Group/Toronto
RegionTracker: A Tag Array Replacement
L1
L1
L1
L1
Data Array
3 SRAM arrays, combined smaller than tag array
R
egion
V
ector
A
rray
B
lock
S
tatus
T
able
E
victed
R
egion
B
ufferSlide80
EPFL, Jan. 2008
80
Aenao Group/Toronto
Common Case: Hit
Region Tag
RVA Index
Region Offset
Block Offset
49
0
6
10
21
Address:
Region Vector Array
(RVA)
Region Tag
……
block
0
block
15
way
V
Block Offset
19
6
0
Block Status Table
(BST)
1
4
status
3
2
Data Array + BST Index
To Data Array
Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB regionSlide81
EPFL, Jan. 2008
81
Aenao Group/Toronto
Worst Case (Rare): Region Miss
Region Tag
RVA Index
Region Offset
Block Offset
49
0
6
10
21
Address:
Region Vector Array
(RVA)
Region Tag
……
block
0
block
15
way
V
Block Offset
19
6
0
Block Status Table
(BST)
status
3
Ptr
2
Data Array + BST Index
Evicted
Region
Buffer
(ERB)
No
Match!
PtrSlide82
82
Aenao Group/Toronto
Methodology
Flexus simulator from CMU SimFlex group
Based on Simics full-system simulator
4-core CMP modeled after Piranha
Private 32KB, 4-way set-associative L1 caches
Shared 8MB, 16-way set-associative L2 cache
64-byte blocks
Miss-rates
: Functional simulation of 2 billion instructions per core
Performance and Energy
: Timing simulation using SMARTS sampling methodology
Area and Power
: Full custom implementation on 130nm commercial technology
9 commercial workloads:
WEB: SpecWEB on Apache and Zeus
OLTP: TPC-C on DB2 and Oracle
DSS: 5 TPC-H queries on DB2
Interconnect
L2
P
D$
I$
P
D$
I$
P
D$
I$
P
D$
I$Slide83
83
Aenao Group/Toronto
Miss-Rates vs. Area
Sector Cache: 512KB sectors, SPC and RT: 1KB regions
Trade-offs comparable to conventional cache
better
Relative Miss-Rate
Relative Tag Array Area
Sector Cache (0.25, 1.26)
14-way
15-way
52-way
48-waySlide84
EPFL, Jan. 2008
84
Aenao Group/Toronto
Performance & Energy
12-way set-associative RegionTracker: 20% less area
Error bars: 95% confidence interval
Performance within 1%, with 33% tag energy reduction
Normalized Execution Time
better
Reduction in Tag Energy
better
Performance
EnergySlide85
85
Aenao Group/Toronto
Road Map
Introduction
Goals
Coarse-Grain Cache Designs
RegionTracker: A Tag Array Replacement
RegionTracker: An Optimization Framework
ConclusionSlide86
86
Aenao Group/Toronto
RegionTracker
: An Optimization Framework
L1
L1
L1
L1
RVA
ERB
Data Array
BST
Stealth Prefetching
:
Average 20% performance improvement
Drop-in RegionTracker for 36% less area overhead
RegionScout:
In-depth analysisSlide87
87
Aenao Group/Toronto
Snoop Coherence: Common Case
Main Memory
CPU
CPU
CPU
Read x
miss
miss
Read x+1
Read x+2
Read x+n
Many snoops are to non-shared regionsSlide88
88
Aenao Group/Toronto
RegionScout
Eliminate broadcasts for non-shared regions
Main Memory
CPU
CPU
CPU
Global Region Miss
Region Miss
Non-Shared Regions
Locally Cached Regions
Read x
Region
Miss
Miss
MissSlide89
89
Aenao Group/Toronto
RegionTracker Implementation
Minimal overhead to support RegionScout optimization
Still uses less area than conventional tag array
Non-Shared Regions
Add 1 bit to each RVA entry
Locally Cached Regions
Already provided by RVASlide90
90
Aenao Group/Toronto
RegionTracker + RegionScout
Reduction in Snoop Broadcasts
better
4 processors, 512KB L2 Caches
1KB regions
Avoid 41% of Snoop Broadcasts,
no area overhead compared to conventional tag arraySlide91
EPFL, Jan. 2008
91
Aenao Group/Toronto
Result Summary
Replace Conventional Tag Array:
20% Less tag area
33% Less tag energy
Within 1% of original performance
Coarse-Grain Optimization Framework:
36% reduction in area overhead for Stealth Prefetching
Filter 41% of snoop broadcasts with no area overhead compared to conventional cache