Nonuniform Cache Architectures Nikos Hardavellas Northwestern University Team M Ferdman B Falsafi A Ailamaki Northwestern Carnegie Mellon EPFL Hardavellas 2 Moores Law Is Alive And Well ID: 243071
Download Presentation The PPT/PDF document "Near-Optimal Cache Block Placement with ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures
Nikos Hardavellas, Northwestern University
Team: M. Ferdman,
B. Falsafi,
A.
Ailamaki
Northwestern, Carnegie Mellon, EPFLSlide2
© Hardavellas2
Moore’s Law Is Alive And Well
90nm
90nm transistor
(Intel, 2005)
Swine Flu A/H1N1
(CDC)
65nm
2007
45nm
2010
32nm
2013
22nm
2016
16nm
2019
Device scaling continues for at least another 10 yearsSlide3
© Hardavellas3
Good Days Ended Nov. 2002
[Yelick09]
“New” Moore’s Law: 2x cores with every generation
On-chip cache grows commensurately to supply all cores with data
Moore’s Law Is Alive And WellSlide4
© Hardavellas4
slow access
large caches
Larger Caches Are Slower Caches
Increasing access latency forces caches to be distributedSlide5
© Hardavellas5
Cache design trendsBalance cache slice access with network latency
As caches become bigger, they get slower:
Split cache
into smaller “slices”:Slide6
© Hardavellas6
core
core
core
Modern Caches
:
Distributed
Split cache into “slices”, distribute across die
L2
L2
L2
L2
L2
L2
L2
L2
core
core
core
core
coreSlide7
Data Placement Determines Performance
©
Hardavellas
7
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
Goal
:
place data on chip close
to where they are used
cache
sliceSlide8
© Hardavellas8
Our proposal: R-NUCAReactive Nonuniform Cache Architecture
Data may exhibit arbitrarily complex behaviors...but few that matter
!Learn the behaviors at run time & exploit their characteristicsMake the common case
fast, the rare case correct
Resolve conflicting requirementsSlide9
© Hardavellas
9Reactive Nonuniform Cache Architecture
Cache accesses can be classified at run-timeEach class amenable to different placement
Per-class block placementSimple, scalable, transparentNo need for HW coherence mechanisms at LLC
Up to 32% speedup (17% on average)
-5% on avg. from an ideal cache organization
Rotational Interleaving
Data replication and fast single-probe lookup
[
Hardavellas
et al
, ISCA 2009
]
[Hardavellas
et al
, IEEE-Micro Top Picks 2010]Slide10
© Hardavellas
10Outline
IntroductionWhy do Cache Accesses Matter?Access Classification and Block PlacementReactive NUCA Mechanisms
EvaluationConclusionSlide11
© Hardavellas11
Bottleneck shifts from memory to L2-hit stalls
Cache accesses dominate execution
4-core CMP
DSS: TPCH/DB2
1GB database
[
Hardavellas
et al
, CIDR 2007
]
Lower is
better
IdealSlide12
© Hardavellas12
How much do we lose?
We lose
half
the potential throughput
4-core CMP
DSS: TPCH/DB2
1GB database
Higher is
betterSlide13
© Hardavellas
13Outline
IntroductionWhy do Cache Accesses Matter?Access Classification and Block Placement
Reactive NUCA MechanismsEvaluationConclusionSlide14
© Hardavellas
14Terminology: Data Types
core
L2
core
L2
core
core
L2
core
Read
or
Write
Read
Read
Read
Write
Private
Shared
Read-Only
Shared
Read-WriteSlide15
© Hardavellas15
Distributed shared L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
Maximum capacity, but slow access
(30
+ cycles
)
address
mod <#slices>
Unique location
for any block
(private or shared)Slide16
© Hardavellas16
L2
Distributed private L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
Fast access to core-private
data
Private data:
allocate at
local L2 slice
On every access
allocate data
at local L2 sliceSlide17
© Hardavellas17
L2
Distributed private L2: shared-RO access
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
Wastes capacity due to
replication
Shared read-only
data: replicate
across L2 slices
On every access
allocate data
at local L2 sliceSlide18
© Hardavellas18
Distributed private L2: shared-RW access
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
dir
core
core
core
core
L2
L2
L2
L2
Slow for shared read-write
Wastes capacity (dir overhead) and bandwidth
X
Shared read-write
data: maintain
coherence
via
indirection (dir)
On every access
allocate data
at local L2 sliceSlide19
©
Hardavellas
19
Conventional Multi-Core Caches
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
dir
L2
L2
L2
We want: high capacity (shared) + fast access (private)
Private
Shared
Address-interleave blocks
High capacity
Slow access
Each block cached locally
Fast access (local)
Low capacity (replicas)
Coherence: via indirection
(distributed directory)Slide20
© Hardavellas
20Where to Place the Data?
Close to where they are used!Accessed by single core: migrate locallyAccessed by many cores: replicate (?)
If read-only, replication is OKIf read-write, coherence a problemLow reuse: evenly distribute across sharers
sharers#
read-write
migrate
replicate
share
read-onlySlide21
21
MethodologyFlexus:
Full-system cycle-accurate timing simulation
Model ParametersTiled, LLC = L2
Server/Scientific
wrkld
.
16-cores, 1MB/core
Multi-programmed
wrkld
.
8-cores, 3MB/coreOoO
, 2GHz, 96-entry ROBFolded 2D-torus2-cycle router, 1-cycle link45ns memory
Workloads
OLTP: TPC-C 3.0 100 WH
IBM DB2 v8Oracle 10gDSS: TPC-H
Qry 6, 8, 13IBM DB2 v8
SPECweb99 on Apache 2.0Multiprogammed: Spec2K
Scientific: em3d© Hardavellas
[
Hardavellas et al, SIGMETRICS-PER 2004
Wenisch et al, IEEE Micro 2006]Slide22
© Hardavellas
22Cache Access Classification Example
Each bubble: cache blocks shared by x
coresSize of bubble proportional to % L2 accesses
y axis: % blocks in bubble that are read-write
% RW Blocks in BubbleSlide23
© Hardavellas
23Scientific/MP Apps
Cache
Access Clustering
Accesses naturally
form 3 clusters
Server Apps
migrate
locally
share (
addr
-interleave)
replicate
R/W
migrate
replicate
share
R/O
sharers#
% RW Blocks in Bubble
% RW Blocks in BubbleSlide24
Instruction Replication
©
Hardavellas
24
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
Distribute in cluster of neighbors, replicate across
Instruction working set too large for one cache sliceSlide25
© Hardavellas25
Reactive NUCA in a nutshellClassify accessesprivate data: like private
scheme (migrate)shared data: like shared scheme (interleave)instructions: controlled replication (middle ground)
To place cache blocks, we first need to classify themSlide26
© Hardavellas
26Outline
IntroductionAccess Classification and Block PlacementReactive NUCA Mechanisms
EvaluationConclusionSlide27
© Hardavellas27
Classification GranularityPer-block classification
High area/power overhead (cut L2 size by half)High latency (indirection through directory)
Per-page classification (utilize OS page table)Persistent structureCore accesses the page table for
every access anyway (TLB
)
Utilize already existing SW/HW structures and events
Page classification is accurate (<0.5% error)
Classify entire
data pages
, page table/TLB for bookkeepingSlide28
Instructions classification: all accesses from L1-I (per-block)Data classification: private/shared per-page at TLB miss
Classification Mechanisms
© Hardavellas
28
TLB Miss
core
L2
Ld A
Core
i
OS
A:
Private
to “
i
”
TLB Miss
Ld A
OS
A: Private to “
i
”
core
L2
Core
j
A:
Shared
On 1
st
access
On access by another core
Bookkeeping through OS page table and TLBSlide29
Page Table and TLB Extensions
© Hardavellas29
vpage
ppage
L2 id
P/S/I
2 bits
log(n)
vpage
ppage
P/S
TLB entry:
1 bit
Page granularity allows simple + practical HW
Page table entry:
Core accesses the page table for every access anyway (TLB)
Pass information from the “directory” to the core
Utilize already existing SW/HW structures and events Slide30
© Hardavellas30
Data Class Bookkeeping and Lookup
offset
Physical
Addr
.:
vpage
ppage
L2 id
vpage
ppage
L2 id
S
cache index
tag
Page table entry:
Page table entry:
vpage
ppage
P
TLB entry:
L2 id
vpage
ppage
S
TLB entry:
P
private data
: place in local L2 slice
shared data
: place in aggregate L2 (addr interleave)Slide31
© Hardavellas
31Coherence: No Need for HW Mechanisms at LLC
Fast
access, eliminates HW overhead, SIMPLE
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
Private data: local slice
Shared data:
addr
-interleave
Reactive NUCA placement guarantee
Each R/W datum in
unique
&
known
locationSlide32
each slice caches
the same blockson behalf of any cluster© Hardavellas
32
3
1
0
0
1
3
2
0
3
1
3
1
0
0
1
3
2
0
3
1
Instructions Lookup: Rotational
Interleaving
2
2
3
1
0
1
3
2
2
0
2
3
1
0
0
1
3
2
0
1
3
2
+1
+log
2
(
k
)
RID
Fast
access (nearest-neighbor, simple lookup)
Balance access latency with capacity constraints
Equal capacity
pressure at
overlapped
slices
PC: 0xfa480
RID
Addr
size-4 clusters:
local slice + 3 neighborsSlide33
© Hardavellas
33Outline
IntroductionAccess Classification and Block Placement
Reactive NUCA MechanismsEvaluationConclusionSlide34
©
2009 Hardavellas34
Evaluation
Delivers robust performance across workloads
Shared: same for Web, DSS;
17%
for OLTP, MIX
Private:
17%
for OLTP, Web, DSS; same for MIX
Shared (S)
R-NUCA (R)
Ideal (I)Slide35
© Hardavellas
35ConclusionsData may exhibit arbitrarily complex behaviors
...but few that matter!Learn the behaviors that matter at run timeMake the common case fast, the rare case correct
Reactive NUCA: near-optimal cache block placementSimple, scalable, low-overhead, transparent, no coherenceRobust performance
Matches best alternative, or 17% better; up to 32%
Near-optimal placement (-5% avg. from ideal)Slide36
For more information:
http://www.eecs.northwestern.edu/~hardav/
© Hardavellas
36
Thank You!
N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Near-Optimal Cache Block Placement with Reactive
Nonuniform
Cache Architectures.
IEEE Micro Top Picks
, Vol. 30(1), pp. 20-28, January/February 2010.
N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches.
ISCA 2009
.Slide37
© 2009 Hardavellas
37BACKUP SLIDESSlide38
Why Are Caches Growing So Large?
Increasing number of cores: cache grows commensuratelyFewer but faster cores have the same effectIncreasing datasets: faster than Moore’s Law!Power/thermal efficiency: caches are “cool”, cores are “hot”So, its easier to fit more cache in a power budgetLimited bandwidth: large cache == more data on chipOff-chip pins are used less frequently
© Hardavellas
38Slide39
© 2009 Hardavellas
39Backup SlidesASRSlide40
© 2009 Hardavellas
40ASR vs. R-NUCA Configurations
ASR-1
ASR-2
R-NUCA
12.5×
25.0×
5.6×
2.1×
2.2×
38%
Core Type
In-Order
OoO
OoO
L2 Size (MB)
4
16
16
Memory
150
500
90
Local L2
12
20
16
Avg. Shared L2
25
44
22Slide41
© Hardavellas41
ASR design space searchSlide42
© 2009 Hardavellas
42Backup SlidesPrior WorkSlide43
© Hardavellas
43Prior Work
Several proposals for CMP cache managementASR, cooperative caching, victim replication,
CMP-NuRapid, D-NUCA...but suffer from shortcomings
complex, high-latency lookup/coherence
don’t
scale
lower effective cache capacity
optimize only for subset of accesses
We need:
Simple, scalable mechanism for fast access to all dataSlide44
© Hardavellas44
Shortcomings of prior workL2-Private
Wastes capacityHigh latency (3 slice accesses + 3 hops on shr.)L2-Shared
High latencyCooperative CachingDoesn’t scale (centralized tag structure)
CMP-
NuRapid
High latency (pointer dereference, 3 hops on
shr
)
OS-managed L2
Wastes capacity (migrates all blocks)
Spill to neighbors useless (all run same code)Slide45
© Hardavellas45
Shortcomings of Prior WorkD-NUCA
No practical implementation (lookup?)Victim ReplicationHigh latency (like L2-Private)Wastes capacity (home always stores block)Adaptive
Selective Replication (ASR)High latency (like L2-Private)Capacity pressure (replicates at slice granularity)Complex (4 separate HW structures to bias coin)Slide46
© 2009 Hardavellas
46Backup SlidesClassification and LookupSlide47
©
2009 Hardavellas47
Data Classification Timeline
TLB Miss
OS
vpage
ppage
i
P
core
L2
Ld A
Core
i
allocate A
vpage
ppage
x
S
TLB Miss
core
L2
Ld A
Core j
i≠j
inval A
TLBi
evict A
core
L2
Core k
allocate A
reply A
Fast & simple lookup for dataSlide48
© Hardavellas48
Misclassifications at Page Granularity
Classification at page granularity is accurate
Accesses from pages with
multiple access types
Access misclassifications
A page may service multiple access types
But, one type always
dominates accessesSlide49
© 2009 Hardavellas
49Backup SlidesPlacementSlide50
© Hardavellas50
Spill to neighbors if working set too large?NO!!! Each core runs similar threads
Private Data Placement
Store in local L2 slice (like in private cache)Slide51
© Hardavellas51
Private Data Working Set
OLTP: Small per-core work. set (3MB/16 cores = 200KB/core)
Web: primary wk. set <6KB/core, remaining <1.5% L2 refsDSS: Policy
doesn’t
matter much
(>100MB work. set, <13% L2 refs
very low
reuse on private)Slide52
© Hardavellas52
Read-write + large working set + low reuse Unlikely to be in local slice for reuse
Also, next sharer is random [WMPI’04]
Shared Data Placement
Address-interleave in aggregate L2 (like shared cache)Slide53
© Hardavellas53
Shared Data Working SetSlide54
Instruction Placement
Working set too large for one sliceSlices store private & shared data too!Sufficient capacity with 4 L2 slices© Hardavellas
54
Share in clusters of neighbors, replicate acrossSlide55
© Hardavellas55
Instructions Working SetSlide56
© 2009 Hardavellas
56Backup SlidesRotational InterleavingSlide57
Instruction Classification and Lookup
©
2009 Hardavellas
57
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
core
core
core
core
L2
L2
L2
L2
Share within neighbors’ cluster, replicate across
Identification: all
accesses from
L1-I
But, working set too large to fit in one cache sliceSlide58
RotationalID
0
© 2009
Hardavellas
58
3
1
0
1
3
2
0
3
1
Rotational Interleaving
2
2
3
1
0
1
3
2
2
0
2
3
1
0
0
1
3
2
0
1
3
2
+1
+log
2
(
k
)
Fast access (nearest-neighbor, simple lookup)
Equalize capacity
pressure at overlapping slices
16
25
27
26
17
19
18
20
9
11
24
28
29
31
30
21
23
22
8
10
12
13
15
14
0
1
3
2
4
5
7
6
TileIDSlide59
© Hardavellas59
Nearest-neighbor size-8 clusters