/
Snoop Filtering Snoop Filtering

Snoop Filtering - PowerPoint Presentation

briana-ranney
briana-ranney . @briana-ranney
Follow
454 views
Uploaded On 2016-05-18

Snoop Filtering - PPT Presentation

and CoarseGrain Memory Tracking Andreas Moshovos Univ of TorontoECE Short Course at the University of Zaragoza July 2009 Some slides by J Zebchuk or the original paper authors JETTY ID: 324446

power region snoop cpu region power cpu snoop tag cache array memory toronto block group aenao regions empty coherence

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Snoop Filtering" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Snoop Filtering and Coarse-Grain Memory Tracking

Andreas

Moshovos

Univ. of Toronto/ECE

Short Course at the University of Zaragoza, July 2009

Some slides by J.

Zebchuk

or the original paper authorsSlide2

JETTY

Snoop-Filtering for Reduced Power in SMP Servers

Andreas

Moshovos

Babak

Falsafi

, ECE, Carnegie Mellon

Gokhan

Memik

,

ECE, Northwestern

Alok

Choudhary

, ECE,

Northwestern

Int’l Conference on High-Performance Architecture, 2001Slide3

Power is Becoming ImportantArchitecture is a science of tradeoffs

Thus far:

Performance vs. Cost vs. Complexity

Today:vs. PowerWhere?Mobile DevicesDesktops/Servers

Our FocusSlide4

Power-Aware ServersRevisit the design of SMP servers

2 or more CPUs per machine

Snoop coherence-based

Why?File, web, databases, your typical desktopCost effective tooThis work - a first step:Power-Aware

Snoopy-CoherenceSlide5

Power-Aware Snoop-CoherenceConventional

All

L2 caches snoop

all memory trafficPower expended by all on any memory accessJetty-EnhancedTiny structure on L2-backside

Filters most “would-be-misses

Less power expended on most snoop misses

No changes to protocol necessary

No performance lossSlide6

RoadmapWhy Power is a Concern for Servers?

Snoopy-Coherence Basics

An Opportunity for Reducing Power

JETTYResultsSummarySlide7

Why is Power Important?

Power Could Ultimately Limit Performance

Power Demands have been increasingDeliver Energy to and on chipDissipate HeatLimit: Amount of resources & frequencyFeasibilityCooling a solution: Cost & Integration?

Reducing Power Demands is much more convenientSlide8

What can be done?Redesign Circuits

Clock Gating and Frequency Scaling

A lot has been done thus far

Still activeRethink Architectural DecisionsOrthogonal to othersReduce Power Under Performance ConstraintsSlide9

The “Silver Bullet” SolutionGood if there was one

However, till one is found...

Look at all structures

Rethink DesignPropose Power-Optimized versionsThis is what we’re doing for performanceSlide10

Snoopy Cache

Coherence

All L2 tags see all bus accesses

Intervene when necessary

Main Memory

CPU Core

L1

L2

CPU Core

HitSlide11

How About Power?

All L2 tags see all bus accesses

Perf. & Complexity:

Have L2 tags why not use themPower: All L2 tags consume power on all accesses

Main Memory

L1

L2

CPU Core

CPU Core

CPU Core

miss

missSlide12

JETTY: A Would be Snoop-Miss Filter

Imprecise:

May

filter a would-be miss Never filters snoop-hits

JETTY

addr

Not here!

CPU n

Would be Snoop-Miss:

JETTY

addr

Don’t Know

CPU n

Would be Snoop-Hit:

Detect most misses using fewer resourcesSlide13

Potential for Savings Exist

Most Snoops miss

91% AVG

Many L2 accesses are due to Snoop Misses55% AVGSizeable Potential Power Savings:20% - 50% of total L2 powerSlide14

Exclude-Jetty

Subset of what is not cached

cached

not cached

How? Cache recent snoop-misses locally

Exclude

JETTYSlide15

Exclude-Jetty

Subset of what you don’t have

Works well for producer-consumerSlide16

Include-Jetty

Superset of what is cached

cached

not cached

How? Well...

include

JETTYSlide17

Include-Jetty

address

bit vector 0

bit vector 1

bit vector 2

f

( )

h

( )

g

( )

Not-Cached

Any

zero

bit

May be Cached

All

bits

set

Later I was told this is a Bloom filter…Slide18

Include-Jetty

Superset of what you have

This is a counting bloom filter:

L-CBF: A Low Power, Fast Counting Bloom Filter Implementation

Elham

Safi, Andreas

Moshovos

and Andreas

Veneris

,

In Proc. Annual International Symposium on Low Power Electronics and Design (ISLPED), Oct. 2006.

Partial overlapping indexes worked better Slide19

Hybrid-JettySome cases Exclude-J works well

Some other Include-J is better

Combine

Access in parallel on snoopAllocationIJ alwaysIf IJ fails to filter then to EJEJ coverage increasesSlide20

Latency?Jetty may increase snoop-response time

Can only be determined on a design by design basis

Largest Jetty:

Five 32x32 bit register filesSlide21

ResultsUsed SPLASH-II

Scientific applications

“Large” Datasets

e.g., 4-80Megs of main memory allocatedAccess Counts: 60M-1.7B4-way SMP, MOESI1M direct-mapped L2, 64b 32b subblocks32k direct-mapped L1, 32b blocksCoverage & Power (analytical model)Slide22

Coverage: Hybrid-Jetty

Can capture 74% of all snoop-misses

betterSlide23

Power-Savings

28% of overall L2 power

betterSlide24

SummaryPower is becoming important

Performance, Reliability and Feasibility

Unique Opportunities Exist for Servers

JETTY: Filter Snoops that would miss74% of all snoops28% of L2 power savedNo protocol changesNo performance lossSlide25

Power efficient cache coherence

C.

Saldanha

, M. LipastiWorkshop on Memory Performance Issues (in conjunction with ISCA), June 2001.Slide26

MEMORY

Serial Snooping

Avoids Speculative transmission of Snoop packets.

Check the nearest neighbor

Data supplied with minimum latency and powerSlide27

TLB and Snoop Energy-Reduction using Virtual Caches inLow-Power Chip-Multiprocessors

Magnus

Ekman

, *Fredrik Dahlgren, and Per StenströmChalmers University of Technology

Ericsson Mobile Platforms

Int’l Symposium on Low Power Electronic Design and Devices, Aug. 2002Slide28

Page Sharing Tables

On snoop requesting node gets a

page-level sharing vector

Paper by same authors demonstrates the Jetty is not beneficial for small-scale CMPs

If a PST entry is evicted the whole page must be evictedSlide29

29

RegionScout

:

Exploiting Coarse Grain Sharing in Snoop Coherence

Andreas

Moshovos

moshovos@eecg.toronto.edu

Int’l Conference on Computer Architecture 2005Slide30

30

CPU

I$

D$

CPU

I$

D$

CPU

I$

D$

interconnect

Main Memory

Improving Snoop Coherence

Conventional Considerations:

Complexity and Correctness NOT Power/Bandwidth

Can we:

(1) Reduce Power/bandwidth

(2) Leverage snoop coherence?

Remains Attractive:

Simple

/

Design Re-use

Yes: Exploit Program Behavior to

Dynamically Identify Requests that do not Need SnoopingSlide31

31

CPU

I$

D$

CPU

I$

D$

CPU

I$

D$

interconnect

Main Memory

RegionScout: Avoid Some Snoops

Frequent case:

non-sharing even at a coarse level/Region

RegionScout: Dynamically Identify Non-Shared Regions

First Request to a Region Identifies it as not Shared

Subsequent Requests do not need to be broadcast

Uses Imprecise Information

Small structures

Layer on top of conventional coherence

No additional constraintsSlide32

32

Roadmap

Conventional Coherence:

The need for power-aware designsPotential: Program BehaviorRegionScout: What and How

Implementation

Evaluation

SummarySlide33

33

Coherence Basics

Given request for memory block X (address)

Detect where its current value resides

Main Memory

snoop

snoop

X

hit

CPU

CPU

CPUSlide34

34

Conventional Coherence not

Power-Aware/Bandwidth-Effective

All L2 tags see all accesses

Perf. & Complexity:

Have L2 tags why not use them

Power:

All

L2 tags consume power on

all

accesses

Bandwidth:

broadcast all coherent requests

Main Memory

L2

CPU

miss

miss

CPU

CPUSlide35

35

RegionScout

Motivation: Sharing is CoarseRegion:

large continuous memory area, power of 2 size

CPU X asks for data block in region R

No one else has X

No one else has

any

block in R

RegionScout Exploits this Behavior

Layered Extension over Snoop Coherence

Typical Memory Space Snapshot:

colored by owner(s)

addressesSlide36

Optimization Opportunities

Power and Bandwidth

Originating node:

avoid asking othersRemote node: avoid tag lookup

CPU

I$

D$

CPU

I$

D$

Memory

SWITCH

CPU

I$

D$Slide37

Potential: Region Miss Frequency

% of all requests

Region Size

Even with a 16K Region

~45% of requests miss in all remote nodes

better

Global Region MissesSlide38

RegionScout at Work: Non-Shared Region Discovery

First request detects a non-shared region

Main Memory

CPU

CPU

CPU

Global Region Miss

Region Miss

Region Miss

1

2

2

3

Record: Non-Shared Regions

Record: Locally Cached RegionsSlide39

RegionScout at Work: Avoiding Snoops

Subsequent request avoids snoops

Main Memory

CPU

CPU

CPU

Global Region Miss

1

2

Record: Non-Shared Regions

Record: Locally Cached RegionsSlide40

RegionScout is Self-Correcting

Request from another node invalidates non-shared record

Main Memory

CPU

CPU

CPU

1

2

2

Record: Non-Shared Regions

Record: Locally Cached RegionsSlide41

Requesting Node provides address:

At Originating Node – from CPU:

Have I discovered that this region is not shared?

At Remote Nodes – from Interconnect:

Do I have a block in the region?

Implementation: Requirements

Region Tag

offset

lg(Region Size)

CPU

addressSlide42

Remembering Non-Shared Regions

Records non-shared regions

Lookup by Region portion prior to issuing a request

Snoop requests and invalidate

Region Tag

offset

address

valid

Non-Shared Region Table

Few entries

16x4 in most experimentsSlide43

What Regions are Locally Cached?

If we had as many counters as regions:

Block Allocation: counter[region]++

Block Eviction: counter[region]--Region cached only if counter[region] non-zeroNot Practical:E.g., 16K Regions and 4G Memory  256K counters

Region Tag

offset

counterSlide44

Moshovos ©What Regions are Locally Cached?

Use few Counters Imprecise:

Records a superset

of locally cached Regions

False positives: lost opportunity, correctness preserved

Region Tag

offset

counter

hash

Cached Region Hash

“Counter”:

+ on block allocation

- on block eviction

Few entries, e.g., 256

p bits

P-bit

1 if counter non-zero

used for lookupsSlide45

Moshovos ©Roadmap

Conventional Coherence

Program Behavior: Region Miss Frequency

RegionScout

Evaluation

SummarySlide46

Moshovos ©Evaluation Overview

Methodology

Filter rates

Practical Filters can capture many Region MissesInterconnect bandwidth reductionSlide47

Moshovos ©Methodology

In-House simulator based on

Simplescalar

Execution drivenAll instructions simulated – MIPS like ISASystem calls faked by passing them to host OS

Synchronization using load-linked/store-conditional

Simple in-order processors

Memory requests complete instantaneously

MESI snoop coherence

1 or 2 level memory hierarchy

WATTCH power models

SPLASH II benchmarks

Scientific workloads

Feasibility studySlide48

Moshovos ©Filter Rates

Identified

Global Region Misses

CRH Size

better

For small CRH better to use large regions

Practical RegionScout filters capture a lot of the potentialSlide49

Moshovos ©Bandwidth Reduction

Messages

Region Size

better

CMP

Moderate Bandwidth Savings for SMP (15%-22%)

More so for CMP (>25%)Slide50

Moshovos ©Related Work

RegionScout

Technical Report, Dec. 2003

JettyMoshovos, Memik, Falsafi,

Choudhary

, HPCA 2001

PST

Eckman

, Dahlgren, and

Stenström

, ISLPED 2002

Coarse-Grain Coherence

Cantin

,

Lipasti

and Smith, ISCA 2005Slide51

Moshovos ©

51

Summary

Exploit program behavior/optimize a frequent caseMany requests result in a global region miss

RegionScout

Practical filter mechanism

Dynamically detect would-be region misses

Avoid broadcasts

Save tag lookup power and interconnect bandwidth

Small structures

Layered extension over existing mechanisms

Invisible to programmer and the OSSlide52

Coarse-Grain Coherence

J.

Cantin

, M. Lipasti and

J. E. Smith

ISCA

2005Slide53

Coarse-Grain CoherenceExploits the same phenomenon as

RegionScout

Protocol extended to keep track of region state as well

Additional optimizationsUses an additional region tag array to do soRegion replacements Must scan and find the block and evict themSlide54

Flexible snooping: adaptive forwarding and filtering of snoops in embedded-ring multiprocessors

K. Strauss, X.

Shen

, J. TorrellasInternational Symposium on Computer Architecture, June 2006.Slide55

Karin Strauss Flexible Snooping 55

Predictors and algorithms

snoop

forward

Exact

forward

then snoop

Agg

forward

snoop

forward

then snoop

Subset

action on positive prediction

action on negative prediction

predictor / algorithm

Superset

Con

snoop then forward

node can supply

in predictor

set of addresses:

Ring-specificSlide56

Karin Strauss Flexible Snooping 56

Predictor implementation

Subset

associative table:

subset of addresses that can be supplied by node

Superset

bloom filter:

superset of addresses that can be supplied by node

associative table (exclude cache):

addresses that recently suffered false positives

Exact

associative table:

all addresses that can be supplied by node

downgrading:

if address has to be evicted from predictor table, corresponding line in node has to be downgraded

Slide57

Design and Implementation of the Blue Gene/P Snoop Filter

Valentina

Salapura, Matthias Blumrich

, Alan

Gara

Int’l Conf. on High-Performance Computer Architecture, 2008 Slide58
Slide59

Three MechanismsStream registers

Contiguous data areas

Adaptive to cache arbitrarily sized contiguous regions with a single register

Stream registers track strided and sequential streamsSnoop cachesCache of recently executed snoop requestsMultiple requests to same line do not have to cause multiple snoop lookups

Snoop caches track locality

Range filter

Identify regions of known non-shared data

Configured by softwareSlide60

Stream RegistersBase = where the block starts

Mask = which bits are common

Example: base 0111 mask 1101

 01X1 may be in the cacheOver time Mask becomes all zerosHow to reset?Cache Wrap

Each set uses Round-Robin replacement

Count replacements per set

Cache wrap when all counters > ways

Copy all streams to history and use combination

Next time throw out history

Slide61

Stream Registers: An Example

Direct mapped cache with two blocks

At this point the filter reports that the cache contains:

001 and 011101 and 111The first two are not there

Eventually the filter becomes saturated and can filter much

How can we get rid of the 011 / 1x1?

empty

empty

001

empty

empty

empty

001 / 111

empty

001

011

001 /

1X1

empty

101

011

001 / 111

101 / 111

101

111

001 / 1X1

101 /

1X1

Time

cache

Stream registersSlide62

Avoiding Saturation: Exploiting Cache Warping

empty

empty

001

empty

empty

empty

001 / 111

empty

001

011

001 /

1X1

empty

101

011

empty

101 / 111

101

111

empty

101 /

1X1

Time

cache

Stream registers

empty

empty

empty

empty

001 /

1X1

empty

001 / 1X1

empty

001 / 1X1

empty

Shadow

Cache Warp

 Can discard ShadowSlide63

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

Chinnakrishnan

S.

Ballapuram

Ahmad Sharif

Hsien-Hsin

S. Lee

ASPLOS 2008Slide64

Software-Hardware HybridSoftware Directs hardware what to do

Mechanisms very similar to Jetty and

RegionScout

Paper incorrectly states that: Jetty does not work for CMPsIt does not work well for small scale CMPsRegionScout

works only for

busses

Is

interconnect agnosticSlide65

RegionTracker: A Framework for Coarse-Grain Optimizations in the On-chip Memory Hierarchy

Jason

Zebchuk

, Elham Safi and Andreas Moshovos

Int’l Symposium on

Microarchitecture

, 2007Slide66

EPFL, Jan. 2008

66

Aenao Group/Toronto

Future Caches: Just Larger?

CPU

I$

D$

CPU

I$

D$

CPU

I$

D$

interconnect

Main Memory

“Big Picture” Management

Store Metadata

10s – 100s of MBSlide67

EPFL, Jan. 2008

67

Aenao Group/Toronto

Conventional Block Centric Cache

“Small” Blocks

Optimizes Bandwidth and Performance

Large L2/L3 caches especially

Fine-Grain View of Memory

L2 Cache

Big Picture LostSlide68

EPFL, Jan. 2008

68

Aenao Group/Toronto

“Big Picture” View

Region

: 2

n

sized, aligned area of memory

Patterns and behavior exposed

Spatial locality

Exploit for performance/area/power

Coarse-Grain View of Memory

L2 CacheSlide69

EPFL, Jan. 2008

69

Aenao Group/Toronto

Exploiting Coarse-Grain Patterns

Many existing coarse-grain optimizations

Add new structures to track coarse-grain information

CPU

L2 Cache

Stealth Prefetching

Run-time Adaptive Cache Hierarchy Management via Reference Analysis

Destination-Set Prediction

Spatial Memory Streaming

Coarse-Grain Coherence Tracking

RegionScout

Circuit-Switched Coherence

Hard to justify for a commercial design

Coarse-Grain Framework

Embed coarse-grain information in tag array

Support many different optimizations with less area overhead

Adaptable optimization FRAMEWORK

Virtual Tree Coherence

Power-Efficient DRAM

SpeculationSlide70

EPFL, Jan. 2008

70

Aenao Group/Toronto

L2 Cache

RegionTracker Solution

Manage

blocks

, but also track and manage

regions

Tag Array

L1

L1

L1

L1

Data Array

Data Blocks

Block

Requests

Block Requests

Region

Tracker

Region

Probes

Region

ResponsesSlide71

EPFL, Jan. 2008

71

Aenao Group/Toronto

RegionTracker Summary

Replace conventional tag array

:

4-core CMP with 8MB shared L2 cache

Within 1% of original performance

Up to 20% less tag area

Average 33% less energy consumption

Optimization Framework

:

Stealth Prefetching: same performance, 36% less area

RegionScout: 2x more snoops avoided, no area overheadSlide72

EPFL, Jan. 2008

72

Aenao Group/Toronto

Road Map

Introduction

Goals

Coarse-Grain Cache Designs

RegionTracker: A Tag Array Replacement

RegionTracker: An Optimization Framework

ConclusionSlide73

EPFL, Jan. 2008

73

Aenao Group/Toronto

Goals

Conventional Tag Array Functionality

Identify data block location and state

Leave data array un-changed

Optimization Framework Functionality

Is Region X cached?

Which blocks of Region X are cached? Where?

Evict or migrate Region X

Easy to assign properties to each RegionSlide74

EPFL, Jan. 2008

74

Aenao Group/Toronto

Coarse-Grain Cache Designs

Increased BW, Decreased hit-rates

Region X

Large Block Size

Tag Array

Data ArraySlide75

EPFL, Jan. 2008

75

Aenao Group/Toronto

Sector Cache

Decreased hit-rates

Region X

Tag Array

Data ArraySlide76

EPFL, Jan. 2008

76

Aenao Group/Toronto

Sector Pool Cache

High Associativity (2 - 4 times)

Region X

Tag Array

Data ArraySlide77

EPFL, Jan. 2008

77

Aenao Group/Toronto

Decoupled Sector Cache

Region information not exposed

Region replacement requires scanning multiple entries

Region X

Tag Array

Data Array

Status TableSlide78

EPFL, Jan. 2008

78

Aenao Group/Toronto

Design Requirements

Small block size (64B)

Miss-rate does not increase

Lookup associativity does not increase

No additional access latency

(i.e., No scanning, no multiple block evictions)

Does not increase latency, area, or energy

Allows banking and interleaving

Fit in conventional tag array “envelope”Slide79

EPFL, Jan. 2008

79

Aenao Group/Toronto

RegionTracker: A Tag Array Replacement

L1

L1

L1

L1

Data Array

3 SRAM arrays, combined smaller than tag array

R

egion

V

ector

A

rray

B

lock

S

tatus

T

able

E

victed

R

egion

B

ufferSlide80

EPFL, Jan. 2008

80

Aenao Group/Toronto

Common Case: Hit

Region Tag

RVA Index

Region Offset

Block Offset

49

0

6

10

21

Address:

Region Vector Array

(RVA)

Region Tag

……

block

0

block

15

way

V

Block Offset

19

6

0

Block Status Table

(BST)

1

4

status

3

2

Data Array + BST Index

To Data Array

Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB regionSlide81

EPFL, Jan. 2008

81

Aenao Group/Toronto

Worst Case (Rare): Region Miss

Region Tag

RVA Index

Region Offset

Block Offset

49

0

6

10

21

Address:

Region Vector Array

(RVA)

Region Tag

……

block

0

block

15

way

V

Block Offset

19

6

0

Block Status Table

(BST)

status

3

Ptr

2

Data Array + BST Index

Evicted

Region

Buffer

(ERB)

No

Match!

PtrSlide82

82

Aenao Group/Toronto

Methodology

Flexus simulator from CMU SimFlex group

Based on Simics full-system simulator

4-core CMP modeled after Piranha

Private 32KB, 4-way set-associative L1 caches

Shared 8MB, 16-way set-associative L2 cache

64-byte blocks

Miss-rates

: Functional simulation of 2 billion instructions per core

Performance and Energy

: Timing simulation using SMARTS sampling methodology

Area and Power

: Full custom implementation on 130nm commercial technology

9 commercial workloads:

WEB: SpecWEB on Apache and Zeus

OLTP: TPC-C on DB2 and Oracle

DSS: 5 TPC-H queries on DB2

Interconnect

L2

P

D$

I$

P

D$

I$

P

D$

I$

P

D$

I$Slide83

83

Aenao Group/Toronto

Miss-Rates vs. Area

Sector Cache: 512KB sectors, SPC and RT: 1KB regions

Trade-offs comparable to conventional cache

better

Relative Miss-Rate

Relative Tag Array Area

Sector Cache (0.25, 1.26)

14-way

15-way

52-way

48-waySlide84

EPFL, Jan. 2008

84

Aenao Group/Toronto

Performance & Energy

12-way set-associative RegionTracker: 20% less area

Error bars: 95% confidence interval

Performance within 1%, with 33% tag energy reduction

Normalized Execution Time

better

Reduction in Tag Energy

better

Performance

EnergySlide85

85

Aenao Group/Toronto

Road Map

Introduction

Goals

Coarse-Grain Cache Designs

RegionTracker: A Tag Array Replacement

RegionTracker: An Optimization Framework

ConclusionSlide86

86

Aenao Group/Toronto

RegionTracker

: An Optimization Framework

L1

L1

L1

L1

RVA

ERB

Data Array

BST

Stealth Prefetching

:

Average 20% performance improvement

Drop-in RegionTracker for 36% less area overhead

RegionScout:

In-depth analysisSlide87

87

Aenao Group/Toronto

Snoop Coherence: Common Case

Main Memory

CPU

CPU

CPU

Read x

miss

miss

Read x+1

Read x+2

Read x+n

Many snoops are to non-shared regionsSlide88

88

Aenao Group/Toronto

RegionScout

Eliminate broadcasts for non-shared regions

Main Memory

CPU

CPU

CPU

Global Region Miss

Region Miss

Non-Shared Regions

Locally Cached Regions

Read x

Region

Miss

Miss

MissSlide89

89

Aenao Group/Toronto

RegionTracker Implementation

Minimal overhead to support RegionScout optimization

Still uses less area than conventional tag array

Non-Shared Regions

Add 1 bit to each RVA entry

Locally Cached Regions

Already provided by RVASlide90

90

Aenao Group/Toronto

RegionTracker + RegionScout

Reduction in Snoop Broadcasts

better

4 processors, 512KB L2 Caches

1KB regions

Avoid 41% of Snoop Broadcasts,

no area overhead compared to conventional tag arraySlide91

EPFL, Jan. 2008

91

Aenao Group/Toronto

Result Summary

Replace Conventional Tag Array:

20% Less tag area

33% Less tag energy

Within 1% of original performance

Coarse-Grain Optimization Framework:

36% reduction in area overhead for Stealth Prefetching

Filter 41% of snoop broadcasts with no area overhead compared to conventional cache