/
Near-Optimal Cache Block Placement with Reactive Near-Optimal Cache Block Placement with Reactive

Near-Optimal Cache Block Placement with Reactive - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
378 views
Uploaded On 2016-03-05

Near-Optimal Cache Block Placement with Reactive - PPT Presentation

Nonuniform Cache Architectures Nikos Hardavellas Northwestern University Team M Ferdman B Falsafi A Ailamaki Northwestern Carnegie Mellon EPFL Hardavellas 2 Moores Law Is Alive And Well ID: 243071

hardavellas core data cache core hardavellas cache data access shared private slice page read placement classification capacity block 2009

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Near-Optimal Cache Block Placement with ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures

Nikos Hardavellas, Northwestern University

Team: M. Ferdman,

B. Falsafi,

A.

Ailamaki

Northwestern, Carnegie Mellon, EPFLSlide2

© Hardavellas2

Moore’s Law Is Alive And Well

90nm

90nm transistor

(Intel, 2005)

Swine Flu A/H1N1

(CDC)

65nm

2007

45nm

2010

32nm

2013

22nm

2016

16nm

2019

Device scaling continues for at least another 10 yearsSlide3

© Hardavellas3

Good Days Ended Nov. 2002

[Yelick09]

“New” Moore’s Law: 2x cores with every generation

On-chip cache grows commensurately to supply all cores with data

Moore’s Law Is Alive And WellSlide4

© Hardavellas4

slow access

large caches

Larger Caches Are Slower Caches

Increasing access latency forces caches to be distributedSlide5

© Hardavellas5

Cache design trendsBalance cache slice access with network latency

As caches become bigger, they get slower:

Split cache

into smaller “slices”:Slide6

© Hardavellas6

core

core

core

Modern Caches

:

Distributed

Split cache into “slices”, distribute across die

L2

L2

L2

L2

L2

L2

L2

L2

core

core

core

core

coreSlide7

Data Placement Determines Performance

©

Hardavellas

7

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

Goal

:

place data on chip close

to where they are used

cache

sliceSlide8

© Hardavellas8

Our proposal: R-NUCAReactive Nonuniform Cache Architecture

Data may exhibit arbitrarily complex behaviors...but few that matter

!Learn the behaviors at run time & exploit their characteristicsMake the common case

fast, the rare case correct

Resolve conflicting requirementsSlide9

© Hardavellas

9Reactive Nonuniform Cache Architecture

Cache accesses can be classified at run-timeEach class amenable to different placement

Per-class block placementSimple, scalable, transparentNo need for HW coherence mechanisms at LLC

Up to 32% speedup (17% on average)

-5% on avg. from an ideal cache organization

Rotational Interleaving

Data replication and fast single-probe lookup

[

Hardavellas

et al

, ISCA 2009

]

[Hardavellas

et al

, IEEE-Micro Top Picks 2010]Slide10

© Hardavellas

10Outline

IntroductionWhy do Cache Accesses Matter?Access Classification and Block PlacementReactive NUCA Mechanisms

EvaluationConclusionSlide11

© Hardavellas11

Bottleneck shifts from memory to L2-hit stalls

Cache accesses dominate execution

4-core CMP

DSS: TPCH/DB2

1GB database

[

Hardavellas

et al

, CIDR 2007

]

Lower is

better

IdealSlide12

© Hardavellas12

How much do we lose?

We lose

half

the potential throughput

4-core CMP

DSS: TPCH/DB2

1GB database

Higher is

betterSlide13

© Hardavellas

13Outline

IntroductionWhy do Cache Accesses Matter?Access Classification and Block Placement

Reactive NUCA MechanismsEvaluationConclusionSlide14

© Hardavellas

14Terminology: Data Types

core

L2

core

L2

core

core

L2

core

Read

or

Write

Read

Read

Read

Write

Private

Shared

Read-Only

Shared

Read-WriteSlide15

© Hardavellas15

Distributed shared L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

Maximum capacity, but slow access

(30

+ cycles

)

address

mod <#slices>

Unique location

for any block

(private or shared)Slide16

© Hardavellas16

L2

Distributed private L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

Fast access to core-private

data

Private data:

allocate at

local L2 slice

On every access

allocate data

at local L2 sliceSlide17

© Hardavellas17

L2

Distributed private L2: shared-RO access

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

Wastes capacity due to

replication

Shared read-only

data: replicate

across L2 slices

On every access

allocate data

at local L2 sliceSlide18

© Hardavellas18

Distributed private L2: shared-RW access

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

dir

core

core

core

core

L2

L2

L2

L2

Slow for shared read-write

Wastes capacity (dir overhead) and bandwidth

X

Shared read-write

data: maintain

coherence

via

indirection (dir)

On every access

allocate data

at local L2 sliceSlide19

©

Hardavellas

19

Conventional Multi-Core Caches

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

dir

L2

L2

L2

We want: high capacity (shared) + fast access (private)

Private

Shared

Address-interleave blocks

High capacity

Slow access

Each block cached locally

Fast access (local)

Low capacity (replicas)

Coherence: via indirection

(distributed directory)Slide20

© Hardavellas

20Where to Place the Data?

Close to where they are used!Accessed by single core: migrate locallyAccessed by many cores: replicate (?)

If read-only, replication is OKIf read-write, coherence a problemLow reuse: evenly distribute across sharers

sharers#

read-write

migrate

replicate

share

read-onlySlide21

21

MethodologyFlexus:

Full-system cycle-accurate timing simulation

Model ParametersTiled, LLC = L2

Server/Scientific

wrkld

.

16-cores, 1MB/core

Multi-programmed

wrkld

.

8-cores, 3MB/coreOoO

, 2GHz, 96-entry ROBFolded 2D-torus2-cycle router, 1-cycle link45ns memory

Workloads

OLTP: TPC-C 3.0 100 WH

IBM DB2 v8Oracle 10gDSS: TPC-H

Qry 6, 8, 13IBM DB2 v8

SPECweb99 on Apache 2.0Multiprogammed: Spec2K

Scientific: em3d© Hardavellas

[

Hardavellas et al, SIGMETRICS-PER 2004

Wenisch et al, IEEE Micro 2006]Slide22

© Hardavellas

22Cache Access Classification Example

Each bubble: cache blocks shared by x

coresSize of bubble proportional to % L2 accesses

y axis: % blocks in bubble that are read-write

% RW Blocks in BubbleSlide23

© Hardavellas

23Scientific/MP Apps

Cache

Access Clustering

Accesses naturally

form 3 clusters

Server Apps

migrate

locally

share (

addr

-interleave)

replicate

R/W

migrate

replicate

share

R/O

sharers#

% RW Blocks in Bubble

% RW Blocks in BubbleSlide24

Instruction Replication

©

Hardavellas

24

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

Distribute in cluster of neighbors, replicate across

Instruction working set too large for one cache sliceSlide25

© Hardavellas25

Reactive NUCA in a nutshellClassify accessesprivate data: like private

scheme (migrate)shared data: like shared scheme (interleave)instructions: controlled replication (middle ground)

To place cache blocks, we first need to classify themSlide26

© Hardavellas

26Outline

IntroductionAccess Classification and Block PlacementReactive NUCA Mechanisms

EvaluationConclusionSlide27

© Hardavellas27

Classification GranularityPer-block classification

High area/power overhead (cut L2 size by half)High latency (indirection through directory)

Per-page classification (utilize OS page table)Persistent structureCore accesses the page table for

every access anyway (TLB

)

Utilize already existing SW/HW structures and events

Page classification is accurate (<0.5% error)

Classify entire

data pages

, page table/TLB for bookkeepingSlide28

Instructions classification: all accesses from L1-I (per-block)Data classification: private/shared per-page at TLB miss

Classification Mechanisms

© Hardavellas

28

TLB Miss

core

L2

Ld A

Core

i

OS

A:

Private

to “

i

TLB Miss

Ld A

OS

A: Private to “

i

core

L2

Core

j

A:

Shared

On 1

st

access

On access by another core

Bookkeeping through OS page table and TLBSlide29

Page Table and TLB Extensions

© Hardavellas29

vpage

ppage

L2 id

P/S/I

2 bits

log(n)

vpage

ppage

P/S

TLB entry:

1 bit

Page granularity allows simple + practical HW

Page table entry:

Core accesses the page table for every access anyway (TLB)

Pass information from the “directory” to the core

Utilize already existing SW/HW structures and events Slide30

© Hardavellas30

Data Class Bookkeeping and Lookup

offset

Physical

Addr

.:

vpage

ppage

L2 id

vpage

ppage

L2 id

S

cache index

tag

Page table entry:

Page table entry:

vpage

ppage

P

TLB entry:

L2 id

vpage

ppage

S

TLB entry:

P

private data

: place in local L2 slice

shared data

: place in aggregate L2 (addr interleave)Slide31

© Hardavellas

31Coherence: No Need for HW Mechanisms at LLC

Fast

access, eliminates HW overhead, SIMPLE

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

Private data: local slice

Shared data:

addr

-interleave

Reactive NUCA placement guarantee

Each R/W datum in

unique

&

known

locationSlide32

each slice caches

the same blockson behalf of any cluster© Hardavellas

32

3

1

0

0

1

3

2

0

3

1

3

1

0

0

1

3

2

0

3

1

Instructions Lookup: Rotational

Interleaving

2

2

3

1

0

1

3

2

2

0

2

3

1

0

0

1

3

2

0

1

3

2

+1

+log

2

(

k

)

RID

Fast

access (nearest-neighbor, simple lookup)

Balance access latency with capacity constraints

Equal capacity

pressure at

overlapped

slices

PC: 0xfa480

RID

Addr

size-4 clusters:

local slice + 3 neighborsSlide33

© Hardavellas

33Outline

IntroductionAccess Classification and Block Placement

Reactive NUCA MechanismsEvaluationConclusionSlide34

©

2009 Hardavellas34

Evaluation

Delivers robust performance across workloads

Shared: same for Web, DSS;

17%

for OLTP, MIX

Private:

17%

for OLTP, Web, DSS; same for MIX

Shared (S)

R-NUCA (R)

Ideal (I)Slide35

© Hardavellas

35ConclusionsData may exhibit arbitrarily complex behaviors

...but few that matter!Learn the behaviors that matter at run timeMake the common case fast, the rare case correct

Reactive NUCA: near-optimal cache block placementSimple, scalable, low-overhead, transparent, no coherenceRobust performance

Matches best alternative, or 17% better; up to 32%

Near-optimal placement (-5% avg. from ideal)Slide36

For more information:

http://www.eecs.northwestern.edu/~hardav/

© Hardavellas

36

Thank You!

N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Near-Optimal Cache Block Placement with Reactive

Nonuniform

Cache Architectures.

IEEE Micro Top Picks

, Vol. 30(1), pp. 20-28, January/February 2010.

N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches.

ISCA 2009

.Slide37

© 2009 Hardavellas

37BACKUP SLIDESSlide38

Why Are Caches Growing So Large?

Increasing number of cores: cache grows commensuratelyFewer but faster cores have the same effectIncreasing datasets: faster than Moore’s Law!Power/thermal efficiency: caches are “cool”, cores are “hot”So, its easier to fit more cache in a power budgetLimited bandwidth: large cache == more data on chipOff-chip pins are used less frequently

© Hardavellas

38Slide39

© 2009 Hardavellas

39Backup SlidesASRSlide40

© 2009 Hardavellas

40ASR vs. R-NUCA Configurations

ASR-1

ASR-2

R-NUCA

12.5×

25.0×

5.6×

2.1×

2.2×

38%

Core Type

In-Order

OoO

OoO

L2 Size (MB)

4

16

16

Memory

150

500

90

Local L2

12

20

16

Avg. Shared L2

25

44

22Slide41

© Hardavellas41

ASR design space searchSlide42

© 2009 Hardavellas

42Backup SlidesPrior WorkSlide43

© Hardavellas

43Prior Work

Several proposals for CMP cache managementASR, cooperative caching, victim replication,

CMP-NuRapid, D-NUCA...but suffer from shortcomings

complex, high-latency lookup/coherence

don’t

scale

lower effective cache capacity

optimize only for subset of accesses

We need:

Simple, scalable mechanism for fast access to all dataSlide44

© Hardavellas44

Shortcomings of prior workL2-Private

Wastes capacityHigh latency (3 slice accesses + 3 hops on shr.)L2-Shared

High latencyCooperative CachingDoesn’t scale (centralized tag structure)

CMP-

NuRapid

High latency (pointer dereference, 3 hops on

shr

)

OS-managed L2

Wastes capacity (migrates all blocks)

Spill to neighbors useless (all run same code)Slide45

© Hardavellas45

Shortcomings of Prior WorkD-NUCA

No practical implementation (lookup?)Victim ReplicationHigh latency (like L2-Private)Wastes capacity (home always stores block)Adaptive

Selective Replication (ASR)High latency (like L2-Private)Capacity pressure (replicates at slice granularity)Complex (4 separate HW structures to bias coin)Slide46

© 2009 Hardavellas

46Backup SlidesClassification and LookupSlide47

©

2009 Hardavellas47

Data Classification Timeline

TLB Miss

OS

vpage

ppage

i

P

core

L2

Ld A

Core

i

allocate A

vpage

ppage

x

S

TLB Miss

core

L2

Ld A

Core j

i≠j

inval A

TLBi

evict A

core

L2

Core k

allocate A

reply A

Fast & simple lookup for dataSlide48

© Hardavellas48

Misclassifications at Page Granularity

Classification at page granularity is accurate

Accesses from pages with

multiple access types

Access misclassifications

A page may service multiple access types

But, one type always

dominates accessesSlide49

© 2009 Hardavellas

49Backup SlidesPlacementSlide50

© Hardavellas50

Spill to neighbors if working set too large?NO!!! Each core runs similar threads

Private Data Placement

Store in local L2 slice (like in private cache)Slide51

© Hardavellas51

Private Data Working Set

OLTP: Small per-core work. set (3MB/16 cores = 200KB/core)

Web: primary wk. set <6KB/core, remaining <1.5% L2 refsDSS: Policy

doesn’t

matter much

(>100MB work. set, <13% L2 refs

very low

reuse on private)Slide52

© Hardavellas52

Read-write + large working set + low reuse Unlikely to be in local slice for reuse

Also, next sharer is random [WMPI’04]

Shared Data Placement

Address-interleave in aggregate L2 (like shared cache)Slide53

© Hardavellas53

Shared Data Working SetSlide54

Instruction Placement

Working set too large for one sliceSlices store private & shared data too!Sufficient capacity with 4 L2 slices© Hardavellas

54

Share in clusters of neighbors, replicate acrossSlide55

© Hardavellas55

Instructions Working SetSlide56

© 2009 Hardavellas

56Backup SlidesRotational InterleavingSlide57

Instruction Classification and Lookup

©

2009 Hardavellas

57

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

core

core

core

core

L2

L2

L2

L2

Share within neighbors’ cluster, replicate across

Identification: all

accesses from

L1-I

But, working set too large to fit in one cache sliceSlide58

RotationalID

0

© 2009

Hardavellas

58

3

1

0

1

3

2

0

3

1

Rotational Interleaving

2

2

3

1

0

1

3

2

2

0

2

3

1

0

0

1

3

2

0

1

3

2

+1

+log

2

(

k

)

Fast access (nearest-neighbor, simple lookup)

Equalize capacity

pressure at overlapping slices

16

25

27

26

17

19

18

20

9

11

24

28

29

31

30

21

23

22

8

10

12

13

15

14

0

1

3

2

4

5

7

6

TileIDSlide59

© Hardavellas59

Nearest-neighbor size-8 clusters