/
Samira Khan Samira Khan

Samira Khan - PowerPoint Presentation

alida-meadow
alida-meadow . @alida-meadow
Follow
392 views
Uploaded On 2017-06-08

Samira Khan - PPT Presentation

University of Virginia April 26 2016 COMPUTER ARCHITECTURE CS 6354 Prefetching The content and concept of this course are adapted from CMU ECE 740 AGENDA Logistics Review from last lecture ID: 557200

cache prefetch prefetcher prefetching prefetch cache prefetching prefetcher access data missrate shared memory stream hardware lru partition bandwidth performance based buffer target

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Samira Khan" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Samira KhanUniversity of VirginiaApril 26, 2016

COMPUTER ARCHITECTURE

CS 6354Prefetching

The content and concept of this course are adapted from CMU ECE 740Slide2

AGENDALogisticsReview from last lecture

Prefetching2Slide3

LOGISTICSFinal PresentationApril 28 and May 3Final Exam

May 6, 9.00 amFinal Report DueMay

7Format: same as a regular paper (12 pages or less)Introduction, Background, Related Work, Key Idea, Key Mechanism, Results, Conclusion3Slide4

SHARED CACHES BETWEEN CORES

Advantages:High effective capacityDynamic partitioning of available cache spaceNo fragmentation due to static partitioning

Easier to maintain coherence (a cache block is in a single location)Shared data and locks do not ping pong between cachesDisadvantagesSlower accessCores incur conflict misses due to other cores’

accesses

Misses due to inter-core interference

Some cores can destroy the hit rate of other cores

Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?)

4Slide5

Fair Shared Cache PartitioningGoal: Equalize the slowdowns of multiple threads sharing the cache

Idea: Dynamically estimate slowdowns due to sharing and assign cache blocks to balance slowdowns

Approximate slowdown with change in miss rate + Simple - Not accurate. Why?Kim et al., “Fair Cache Sharing and Partitioning

in a Chip Multiprocessor Architecture

,” PACT 2004.

5Slide6

Problem with Shared Caches

L2 $

L1 $

……

Processor Core 1

L1 $

Processor Core 2

t1

6Slide7

Problem with Shared Caches

L1 $

Processor Core 1

L1 $

Processor Core 2

L2 $

……

t2→

7Slide8

Problem with Shared Caches

L1 $

L2 $

……

Processor Core 1

Processor Core 2

t1

L1 $

t2→

t2’s throughput is significantly reduced due to unfair cache sharing.

8Slide9

Problem with Shared Caches

9Slide10

Fairness Metrics

Uniform slowdown

Minimize:Ideally:

10Slide11

Block-Granularity Partitioning

LRU

LRU

LRU

LRU

P1: 448B

P2 Miss

P2: 576B

Current Partition

P1: 384B

P2: 640B

Target Partition

Modified LRU cache replacement policy

G.

Suh

, et. al., HPCA 2002

11Slide12

Block-Granularity Partitioning

LRU

LRU

LRU

*

LRU

P1: 448B

P2 Miss

P2: 576B

Current Partition

P1: 384B

P2: 640B

Target Partition

Modified LRU cache replacement policy

G.

Suh

, et. al., HPCA 2002

LRU

LRU

LRU

*

LRU

P1: 384B

P2: 640B

Current Partition

P1: 384B

P2: 640B

Target Partition

12Slide13

Dynamic Fair Caching Algorithm

P1:

P2:

Ex) Optimizing

M3 metric

P1:

P2:

Target Partition

MissRate alone

P1:

P2:

MissRate shared

Repartitioning

interval

13Slide14

Dynamic Fair Caching Algorithm

1

st

Interval

P1:20%

P2: 5%

MissRate alone

Repartitioning

interval

P1:

P2:

MissRate shared

P1:20%

P2:15%

MissRate shared

P1:256KB

P2:256KB

Target Partition

14Slide15

Dynamic Fair Caching Algorithm

Repartition!

Evaluate M3

P1:

20%

/ 20%

P2:

15%

/ 5%

P1:20%

P2: 5%

MissRate alone

Repartitioning

interval

P1:20%

P2:15%

MissRate shared

P1:256KB

P2:256KB

Target Partition

P1:192KB

P2:320KB

Target Partition

Partition granularity: 64KB

15Slide16

Dynamic Fair Caching Algorithm

2

nd

Interval

P1:20%

P2: 5%

MissRate alone

Repartitioning

interval

P1:20%

P2:15%

MissRate shared

P1:20%

P2:15%

MissRate shared

P1:20%

P2:10%

MissRate shared

P1:192KB

P2:320KB

Target Partition

16Slide17

Dynamic Fair Caching Algorithm

Repartition!

Evaluate M3

P1:

20%

/ 20%

P2:

10%

/ 5%

P1:20%

P2: 5%

MissRate alone

Repartitioning

interval

P1:20%

P2:15%

MissRate shared

P1:20%

P2:10%

MissRate shared

P1:192KB

P2:320KB

Target Partition

P1:128KB

P2:384KB

Target Partition

17Slide18

Dynamic Fair Caching Algorithm

3

rd

Interval

P1:20%

P2: 5%

MissRate alone

Repartitioning

interval

P1:20%

P2:10%

MissRate shared

P1:128KB

P2:384KB

Target Partition

P1:20%

P2:10%

MissRate shared

P1:25%

P2: 9%

MissRate shared

18Slide19

Dynamic Fair Caching Algorithm

Repartition!

Do Rollback if:

P2:

Δ

<T

rollback

Δ

=MR

old

-MR

new

P1:20%

P2: 5%

MissRate alone

Repartitioning

interval

P1:20%

P2:10%

MissRate shared

P1:25%

P2: 9%

MissRate shared

P1:128KB

P2:384KB

Target Partition

P1:192KB

P2:320KB

Target Partition

19Slide20

Dynamic Fair Caching Results

Improves both fairness and throughput

20Slide21

Effect of Partitioning Interval

Fine-grained partitioning is important for both fairness and throughput

21Slide22

Benefits of Fair CachingProblems of unfair cache sharing

Sub-optimal throughputThread starvationPriority inversion

Thread-mix dependent performanceBenefits of fair cachingBetter fairnessBetter throughputFair caching likely simplifies OS scheduler design

22Slide23

Advantages/Disadvantages of the ApproachAdvantages

+ No (reduced) starvation+ Better average throughputDisadvantages

- Scalable to many cores?- Is this the best (or a good) fairness metric?- Does this provide performance isolation in cache?- Alone miss rate estimation can be incorrect (estimation interval different from enforcement interval)23Slide24

Memory Latency Tolerance 24Slide25

Cache Misses Responsible for Many Stalls

512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher

Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model

L2 Misses

25Slide26

Memory Latency Tolerance TechniquesCaching [initially by Wilkes, 1965]

Widely used, simple, effective, but inefficient, passiveNot all applications/phases exhibit temporal or spatial locality

Prefetching [initially in IBM 360/91, 1967]Works well for regular memory access patternsPrefetching irregular access patterns is difficult, inaccurate, and hardware-intensiveMultithreading [initially in CDC 6600, 1964]Works well if there are multiple threads

Improving single thread performance using multithreading hardware is an ongoing research effort

Out-of-order execution

[initially by

Tomasulo

, 1967]

Tolerates irregular cache misses that cannot be

prefetched

Requires extensive hardware resources for tolerating long latencies

Runahead execution

alleviates this problem (as we will

see today)

26Slide27

PrefetchingSlide28

Outline of Prefetching LectureWhy prefetch? Why could/does it work?The four questions

What (to prefetch), when, where, howSoftware prefetchingHardware prefetching algorithmsExecution-based prefetchingPrefetching performanceCoverage, accuracy, timelinessBandwidth consumption, cache

pollution28Slide29

Prefetching Idea: Fetch the data before it is needed (i.e. pre-fetch) by the programWhy?

Memory latency is high. If we can prefetch accurately and early enough we can reduce/eliminate that latency.

Can eliminate compulsory cache missesCan it eliminate all cache misses? Capacity, conflict?Involves predicting which address will be needed in the futureWorks if programs have predictable miss address patterns29Slide30

Prefetching and CorrectnessDoes a misprediction in prefetching affect correctness?No, prefetched data at a “mispredicted

” address is simply not usedThere is no need for state recoveryIn contrast to branch misprediction or value misprediction

30Slide31

BasicsIn modern systems, prefetching is usually done in cache block granularityPrefetching is a technique that can reduce both

Miss rateMiss latencyPrefetching can be done by hardwarecompiler

programmer31Slide32

How a HW Prefetcher Fits in the Memory System

32Slide33

Prefetching: The Four QuestionsWhatWhat

addresses to prefetchWhenWhen to initiate a prefetch request

WhereWhere to place the prefetched dataHowSoftware, hardware, execution-based, cooperative33Slide34

Challenges in Prefetching: WhatWhat addresses to prefetchPrefetching useless data wastes resources

Memory bandwidthCache or prefetch buffer spaceEnergy consumptionThese could all be utilized by demand requests or more accurate prefetch requestsAccurate

prediction of addresses to prefetch is importantPrefetch accuracy = used prefetches / sent prefetchesHow do we know what to prefetchPredict based on past access patternsUse the compiler’s knowledge of data structuresPrefetching algorithm determines what to prefetch

34Slide35

Challenges in Prefetching: WhenWhen to initiate a prefetch request

Prefetching too earlyPrefetched data might not be used before it is evicted from storagePrefetching too lateMight not hide the whole memory latency

When a data item is prefetched affects the timeliness of the prefetcherPrefetcher can be made more timely byMaking it more aggressive: try to stay far ahead of the processor’s access stream (hardware)Moving the

prefetch

instructions earlier in the code

(software)

35Slide36

Challenges in Prefetching: Where (I)Where to place the prefetched data

In cache+ Simple design, no need for separate buffers-- Can evict useful demand data  cache pollution

In a separate prefetch buffer+ Demand data protected from prefetches  no cache pollution-- More complex memory system design

- Where to place the

prefetch

buffer

- When to access the

prefetch

buffer (parallel vs. serial with cache)

- When to move the data from the

prefetch

buffer to cache

- How to size the

prefetch

buffer

- Keeping the

prefetch

buffer coherent

Many modern systems place

prefetched

data into the cache

Intel Pentium 4, Core2

s, AMD systems, IBM POWER4,5,6, …36Slide37

Challenges in Prefetching: Where (II)Which level of cache to prefetch into?

Memory to L2, memory to L1. Advantages/disadvantages?L2 to L1? (a separate

prefetcher between levels)Where to place the prefetched data in the cache?Do we treat prefetched blocks the same as demand-fetched blocks?Prefetched

blocks are not known to be needed

With LRU, a demand block is placed into the MRU position

Do we skew the replacement policy such that it favors the demand-fetched blocks?

E.g., place all

prefetches

into the LRU position in a way?

37Slide38

Challenges in Prefetching: Where (III)Where to place the hardware prefetcher in the memory hierarchy?In other words, what access patterns does the prefetcher see?

L1 hits and missesL1 misses only L2 misses only Seeing a more complete access pattern:

+ Potentially better accuracy and coverage in prefetching-- Prefetcher needs to examine more requests (bandwidth intensive, more ports into the prefetcher?)38Slide39

Challenges in Prefetching: HowSoftware prefetchingISA provides

prefetch instructionsProgrammer or compiler inserts prefetch instructions (effort)Usually works well only for

“regular access patterns”Hardware prefetchingHardware monitors processor accessesMemorizes or finds patterns/stridesGenerates prefetch addresses automatically

Execution-based

prefetchers

A

thread

is executed to

prefetch

data for the main program

Can be generated by either software/programmer or hardware

39Slide40

Software Prefetching (I)Idea: Compiler/programmer places prefetch instructions into appropriate places in code

Mowry et al., “

Design and Evaluation of a Compiler Algorithm for Prefetching,” ASPLOS 1992.Prefetch instructions prefetch data into cachesCompiler or programmer can insert such instructions into the program40Slide41

X86 PREFETCH Instruction

microarchitecture

dependent

specification

different instructions

for different cache

levels

41Slide42

Software Prefetching (II)Can work for very regular array-based access patterns. Issues:

-- Prefetch instructions take up processing/execution bandwidthHow early to prefetch? Determining this is difficult

-- Prefetch distance depends on hardware implementation (memory latency, cache size, time between loop iterations)  portability?-- Going too far back in code reduces accuracy (branches in between)Need “special”

prefetch instructions in ISA?

Alpha

load into register 31 treated as prefetch (r31==0)

PowerPC

dcbt

(data cache block touch) instruction

-- Not easy to do for pointer-based data structures

for (

i

=0;

i

<N;

i

++) {

__

prefetch

(a[i+8]);

__

prefetch

(b[i+8]);

sum += a[

i

]*b[

i

];}

while (p) {

__prefetch(pnext); work(pdata); p = pnext;

}

while (p) {

__prefetch(p

nextnextnext

);

work(p

data)

;

p = p

next;

}

Which one is better?

42Slide43

Software Prefetching (III)Where should a compiler insert prefetches?Prefetch for every load access?

Too bandwidth intensive (both memory and execution bandwidth)Profile the code and determine loads that are likely to missWhat if profile input set is not representative?

How far ahead before the miss should the prefetch be inserted?Profile and determine probability of use for various prefetch distances from the missWhat if profile input set is not representative?Usually need to insert a prefetch far in advance to cover 100s of cycles of main memory latency  reduced accuracy

43Slide44

Hardware Prefetching (I)Idea: Specialized hardware observes load/store access patterns and prefetches data based on past access behavior

Tradeoffs:+ Can be tuned to system implementation+ Does not waste instruction execution bandwidth

-- More hardware complexity to detect patterns - Software can be more efficient in some cases44Slide45

Next-Line PrefetchersSimplest form of hardware prefetching: always prefetch next N cache lines after a demand access (or a demand miss)

Next-line prefetcher (or next sequential

prefetcher)Tradeoffs:+ Simple to implement. No need for sophisticated pattern detection+ Works well for sequential/streaming access patterns (instructions?)-- Can waste bandwidth with irregular patterns-- And, even regular patterns: - What is the

prefetch

accuracy if access stride = 2 and N = 1?

- What if the program is traversing memory from higher to lower addresses?

- Also

prefetch

previous

N cache lines?

45Slide46

Stride PrefetchersTwo kindsInstruction program counter (PC) based

Cache block address basedInstruction based:Baer and Chen, “

An effective on-chip preloading scheme to reduce data access penalty,” SC 1991.Idea: Record the distance between the memory addresses referenced by a load instruction (i.e. stride of the load) as well as the last address referenced by the loadNext time the same load instruction is fetched, prefetch last address + stride

46Slide47

Instruction Based Stride PrefetchingWhat is the problem with this?

How far can the prefetcher get ahead of the demand access stream?

Initiating the prefetch when the load is fetched the next time can be too late Load will access the data cache soon after it is fetched!Solutions:Use lookahead PC to index the prefetcher table (decouple frontend of the processor from backend)

Prefetch ahead (

last address + N*stride

)

Generate

multiple prefetches

Load Inst.

Last Address

Last

Confidence

PC (tag)

Referenced

Stride

…….

…….

……

Load

Inst

PC

47Slide48

Cache-Block Address Based Stride PrefetchingCan detect

A, A+N, A+2N, A+3N, …Stream buffers are a special case of cache block address based stride prefetching where N = 1

Address tag

Stride

Control/Confidence

…….

……

Block

address

48Slide49

Stream Buffers (Jouppi, ISCA 1990)Each stream buffer holds one stream of sequentially prefetched cache lines

On a load miss check the head of all stream buffers for an address matchif hit, pop the entry from FIFO, update the cache with data if not, allocate a new stream buffer to the new miss address (may have to recycle a stream buffer following LRU policy)

Stream buffer FIFOs are continuously topped-off with subsequent cache lines whenever there is room and the bus is not busyFIFO

FIFO

FIFO

FIFO

DCache

Memory interface

Jouppi

, “

Improving

Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch

Buffers

,” ISCA 1990.

49Slide50

Stream Buffer Design

50Slide51

Stream Buffer Design

51Slide52

These slides were not covered in the classThey are for your benefit

52Slide53

Prefetcher Performance (I)Accuracy (used prefetches / sent prefetches)

Coverage (prefetched misses / all misses)Timeliness (on-time prefetches / used prefetches)Bandwidth consumptionMemory bandwidth consumed with prefetcher / without prefetcher

Good news: Can utilize idle bus bandwidth (if available)Cache pollutionExtra demand misses due to prefetch placement in cacheMore difficult to quantify but affects performance53Slide54

Prefetcher Performance (II)Prefetcher aggressiveness affects all performance metricsAggressiveness dependent on

prefetcher typeFor most hardware prefetchers:Prefetch

distance: how far ahead of the demand stream Prefetch degree: how many prefetches per demand access

Predicted Stream

Predicted Stream

X

Access Stream

P

max

Prefetch Distance

P

max

Very Conservative

P

max

Middle of the Road

P

max

Very Aggressive

P

Prefetch Degree

X+1

1 2 3

54Slide55

Prefetcher Performance (III)How do these metrics interact?Very Aggressive Prefetcher

(large prefetch distance & degree)Well ahead of the load access stream Hides memory access latency better

More speculative+ Higher coverage, better timeliness-- Likely lower accuracy, higher bandwidth and pollutionVery Conservative Prefetcher (small prefetch distance & degree)Closer to the load access streamMight not hide memory access latency completelyReduces potential for cache pollution and bandwidth contention

+ Likely higher accuracy, lower bandwidth, less polluting

-- Likely lower coverage and less timely

55Slide56

Prefetcher Performance (IV)

56Slide57

Prefetcher Performance (V)

Srinath et al.,

“Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers“, HPCA 2007.

48%

29%

57Slide58

Feedback-Directed Prefetcher Throttling (I)Idea:

Dynamically monitor prefetcher performance metrics

Throttle the prefetcher aggressiveness up/down based on past performanceChange the location prefetches are inserted in cache based on past performance

High Accuracy

Not-Late

Polluting

Decrease

Late

Increase

Med Accuracy

Not-Poll

Late

Increase

Polluting

Decrease

Low Accuracy

Not-Poll

Not-Late

No Change

Decrease

58Slide59

Feedback-Directed Prefetcher Throttling (II)

Srinath et al., “

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers“, HPCA 2007.Srinath et al., “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers“

, HPCA 2007.

11%

13%

59Slide60

Samira KhanUniversity of VirginiaApr 26, 2016

COMPUTER ARCHITECTURE

CS 6354Prefetching

The content and concept of this course are adapted from CMU ECE 740