University of Virginia April 26 2016 COMPUTER ARCHITECTURE CS 6354 Prefetching The content and concept of this course are adapted from CMU ECE 740 AGENDA Logistics Review from last lecture ID: 557200
Download Presentation The PPT/PDF document "Samira Khan" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Samira KhanUniversity of VirginiaApril 26, 2016
COMPUTER ARCHITECTURE
CS 6354Prefetching
The content and concept of this course are adapted from CMU ECE 740Slide2
AGENDALogisticsReview from last lecture
Prefetching2Slide3
LOGISTICSFinal PresentationApril 28 and May 3Final Exam
May 6, 9.00 amFinal Report DueMay
7Format: same as a regular paper (12 pages or less)Introduction, Background, Related Work, Key Idea, Key Mechanism, Results, Conclusion3Slide4
SHARED CACHES BETWEEN CORES
Advantages:High effective capacityDynamic partitioning of available cache spaceNo fragmentation due to static partitioning
Easier to maintain coherence (a cache block is in a single location)Shared data and locks do not ping pong between cachesDisadvantagesSlower accessCores incur conflict misses due to other cores’
accesses
Misses due to inter-core interference
Some cores can destroy the hit rate of other cores
Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?)
4Slide5
Fair Shared Cache PartitioningGoal: Equalize the slowdowns of multiple threads sharing the cache
Idea: Dynamically estimate slowdowns due to sharing and assign cache blocks to balance slowdowns
Approximate slowdown with change in miss rate + Simple - Not accurate. Why?Kim et al., “Fair Cache Sharing and Partitioning
in a Chip Multiprocessor Architecture
,” PACT 2004.
5Slide6
Problem with Shared Caches
L2 $
L1 $
……
Processor Core 1
L1 $
Processor Core 2
←
t1
6Slide7
Problem with Shared Caches
L1 $
Processor Core 1
L1 $
Processor Core 2
L2 $
……
t2→
7Slide8
Problem with Shared Caches
L1 $
L2 $
……
Processor Core 1
Processor Core 2
←
t1
L1 $
t2→
t2’s throughput is significantly reduced due to unfair cache sharing.
8Slide9
Problem with Shared Caches
9Slide10
Fairness Metrics
Uniform slowdown
Minimize:Ideally:
10Slide11
Block-Granularity Partitioning
LRU
LRU
LRU
LRU
P1: 448B
P2 Miss
P2: 576B
Current Partition
P1: 384B
P2: 640B
Target Partition
Modified LRU cache replacement policy
G.
Suh
, et. al., HPCA 2002
11Slide12
Block-Granularity Partitioning
LRU
LRU
LRU
*
LRU
P1: 448B
P2 Miss
P2: 576B
Current Partition
P1: 384B
P2: 640B
Target Partition
Modified LRU cache replacement policy
G.
Suh
, et. al., HPCA 2002
LRU
LRU
LRU
*
LRU
P1: 384B
P2: 640B
Current Partition
P1: 384B
P2: 640B
Target Partition
12Slide13
Dynamic Fair Caching Algorithm
P1:
P2:
Ex) Optimizing
M3 metric
P1:
P2:
Target Partition
MissRate alone
P1:
P2:
MissRate shared
Repartitioning
interval
13Slide14
Dynamic Fair Caching Algorithm
1
st
Interval
P1:20%
P2: 5%
MissRate alone
Repartitioning
interval
P1:
P2:
MissRate shared
P1:20%
P2:15%
MissRate shared
P1:256KB
P2:256KB
Target Partition
14Slide15
Dynamic Fair Caching Algorithm
Repartition!
Evaluate M3
P1:
20%
/ 20%
P2:
15%
/ 5%
P1:20%
P2: 5%
MissRate alone
Repartitioning
interval
P1:20%
P2:15%
MissRate shared
P1:256KB
P2:256KB
Target Partition
P1:192KB
P2:320KB
Target Partition
Partition granularity: 64KB
15Slide16
Dynamic Fair Caching Algorithm
2
nd
Interval
P1:20%
P2: 5%
MissRate alone
Repartitioning
interval
P1:20%
P2:15%
MissRate shared
P1:20%
P2:15%
MissRate shared
P1:20%
P2:10%
MissRate shared
P1:192KB
P2:320KB
Target Partition
16Slide17
Dynamic Fair Caching Algorithm
Repartition!
Evaluate M3
P1:
20%
/ 20%
P2:
10%
/ 5%
P1:20%
P2: 5%
MissRate alone
Repartitioning
interval
P1:20%
P2:15%
MissRate shared
P1:20%
P2:10%
MissRate shared
P1:192KB
P2:320KB
Target Partition
P1:128KB
P2:384KB
Target Partition
17Slide18
Dynamic Fair Caching Algorithm
3
rd
Interval
P1:20%
P2: 5%
MissRate alone
Repartitioning
interval
P1:20%
P2:10%
MissRate shared
P1:128KB
P2:384KB
Target Partition
P1:20%
P2:10%
MissRate shared
P1:25%
P2: 9%
MissRate shared
18Slide19
Dynamic Fair Caching Algorithm
Repartition!
Do Rollback if:
P2:
Δ
<T
rollback
Δ
=MR
old
-MR
new
P1:20%
P2: 5%
MissRate alone
Repartitioning
interval
P1:20%
P2:10%
MissRate shared
P1:25%
P2: 9%
MissRate shared
P1:128KB
P2:384KB
Target Partition
P1:192KB
P2:320KB
Target Partition
19Slide20
Dynamic Fair Caching Results
Improves both fairness and throughput
20Slide21
Effect of Partitioning Interval
Fine-grained partitioning is important for both fairness and throughput
21Slide22
Benefits of Fair CachingProblems of unfair cache sharing
Sub-optimal throughputThread starvationPriority inversion
Thread-mix dependent performanceBenefits of fair cachingBetter fairnessBetter throughputFair caching likely simplifies OS scheduler design
22Slide23
Advantages/Disadvantages of the ApproachAdvantages
+ No (reduced) starvation+ Better average throughputDisadvantages
- Scalable to many cores?- Is this the best (or a good) fairness metric?- Does this provide performance isolation in cache?- Alone miss rate estimation can be incorrect (estimation interval different from enforcement interval)23Slide24
Memory Latency Tolerance 24Slide25
Cache Misses Responsible for Many Stalls
512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
L2 Misses
25Slide26
Memory Latency Tolerance TechniquesCaching [initially by Wilkes, 1965]
Widely used, simple, effective, but inefficient, passiveNot all applications/phases exhibit temporal or spatial locality
Prefetching [initially in IBM 360/91, 1967]Works well for regular memory access patternsPrefetching irregular access patterns is difficult, inaccurate, and hardware-intensiveMultithreading [initially in CDC 6600, 1964]Works well if there are multiple threads
Improving single thread performance using multithreading hardware is an ongoing research effort
Out-of-order execution
[initially by
Tomasulo
, 1967]
Tolerates irregular cache misses that cannot be
prefetched
Requires extensive hardware resources for tolerating long latencies
Runahead execution
alleviates this problem (as we will
see today)
26Slide27
PrefetchingSlide28
Outline of Prefetching LectureWhy prefetch? Why could/does it work?The four questions
What (to prefetch), when, where, howSoftware prefetchingHardware prefetching algorithmsExecution-based prefetchingPrefetching performanceCoverage, accuracy, timelinessBandwidth consumption, cache
pollution28Slide29
Prefetching Idea: Fetch the data before it is needed (i.e. pre-fetch) by the programWhy?
Memory latency is high. If we can prefetch accurately and early enough we can reduce/eliminate that latency.
Can eliminate compulsory cache missesCan it eliminate all cache misses? Capacity, conflict?Involves predicting which address will be needed in the futureWorks if programs have predictable miss address patterns29Slide30
Prefetching and CorrectnessDoes a misprediction in prefetching affect correctness?No, prefetched data at a “mispredicted
” address is simply not usedThere is no need for state recoveryIn contrast to branch misprediction or value misprediction
30Slide31
BasicsIn modern systems, prefetching is usually done in cache block granularityPrefetching is a technique that can reduce both
Miss rateMiss latencyPrefetching can be done by hardwarecompiler
programmer31Slide32
How a HW Prefetcher Fits in the Memory System
32Slide33
Prefetching: The Four QuestionsWhatWhat
addresses to prefetchWhenWhen to initiate a prefetch request
WhereWhere to place the prefetched dataHowSoftware, hardware, execution-based, cooperative33Slide34
Challenges in Prefetching: WhatWhat addresses to prefetchPrefetching useless data wastes resources
Memory bandwidthCache or prefetch buffer spaceEnergy consumptionThese could all be utilized by demand requests or more accurate prefetch requestsAccurate
prediction of addresses to prefetch is importantPrefetch accuracy = used prefetches / sent prefetchesHow do we know what to prefetchPredict based on past access patternsUse the compiler’s knowledge of data structuresPrefetching algorithm determines what to prefetch
34Slide35
Challenges in Prefetching: WhenWhen to initiate a prefetch request
Prefetching too earlyPrefetched data might not be used before it is evicted from storagePrefetching too lateMight not hide the whole memory latency
When a data item is prefetched affects the timeliness of the prefetcherPrefetcher can be made more timely byMaking it more aggressive: try to stay far ahead of the processor’s access stream (hardware)Moving the
prefetch
instructions earlier in the code
(software)
35Slide36
Challenges in Prefetching: Where (I)Where to place the prefetched data
In cache+ Simple design, no need for separate buffers-- Can evict useful demand data cache pollution
In a separate prefetch buffer+ Demand data protected from prefetches no cache pollution-- More complex memory system design
- Where to place the
prefetch
buffer
- When to access the
prefetch
buffer (parallel vs. serial with cache)
- When to move the data from the
prefetch
buffer to cache
- How to size the
prefetch
buffer
- Keeping the
prefetch
buffer coherent
Many modern systems place
prefetched
data into the cache
Intel Pentium 4, Core2
’
s, AMD systems, IBM POWER4,5,6, …36Slide37
Challenges in Prefetching: Where (II)Which level of cache to prefetch into?
Memory to L2, memory to L1. Advantages/disadvantages?L2 to L1? (a separate
prefetcher between levels)Where to place the prefetched data in the cache?Do we treat prefetched blocks the same as demand-fetched blocks?Prefetched
blocks are not known to be needed
With LRU, a demand block is placed into the MRU position
Do we skew the replacement policy such that it favors the demand-fetched blocks?
E.g., place all
prefetches
into the LRU position in a way?
37Slide38
Challenges in Prefetching: Where (III)Where to place the hardware prefetcher in the memory hierarchy?In other words, what access patterns does the prefetcher see?
L1 hits and missesL1 misses only L2 misses only Seeing a more complete access pattern:
+ Potentially better accuracy and coverage in prefetching-- Prefetcher needs to examine more requests (bandwidth intensive, more ports into the prefetcher?)38Slide39
Challenges in Prefetching: HowSoftware prefetchingISA provides
prefetch instructionsProgrammer or compiler inserts prefetch instructions (effort)Usually works well only for
“regular access patterns”Hardware prefetchingHardware monitors processor accessesMemorizes or finds patterns/stridesGenerates prefetch addresses automatically
Execution-based
prefetchers
A
“
thread
”
is executed to
prefetch
data for the main program
Can be generated by either software/programmer or hardware
39Slide40
Software Prefetching (I)Idea: Compiler/programmer places prefetch instructions into appropriate places in code
Mowry et al., “
Design and Evaluation of a Compiler Algorithm for Prefetching,” ASPLOS 1992.Prefetch instructions prefetch data into cachesCompiler or programmer can insert such instructions into the program40Slide41
X86 PREFETCH Instruction
microarchitecture
dependent
specification
different instructions
for different cache
levels
41Slide42
Software Prefetching (II)Can work for very regular array-based access patterns. Issues:
-- Prefetch instructions take up processing/execution bandwidthHow early to prefetch? Determining this is difficult
-- Prefetch distance depends on hardware implementation (memory latency, cache size, time between loop iterations) portability?-- Going too far back in code reduces accuracy (branches in between)Need “special”
prefetch instructions in ISA?
Alpha
load into register 31 treated as prefetch (r31==0)
PowerPC
dcbt
(data cache block touch) instruction
-- Not easy to do for pointer-based data structures
for (
i
=0;
i
<N;
i
++) {
__
prefetch
(a[i+8]);
__
prefetch
(b[i+8]);
sum += a[
i
]*b[
i
];}
while (p) {
__prefetch(pnext); work(pdata); p = pnext;
}
while (p) {
__prefetch(p
nextnextnext
);
work(p
data)
;
p = p
next;
}
Which one is better?
42Slide43
Software Prefetching (III)Where should a compiler insert prefetches?Prefetch for every load access?
Too bandwidth intensive (both memory and execution bandwidth)Profile the code and determine loads that are likely to missWhat if profile input set is not representative?
How far ahead before the miss should the prefetch be inserted?Profile and determine probability of use for various prefetch distances from the missWhat if profile input set is not representative?Usually need to insert a prefetch far in advance to cover 100s of cycles of main memory latency reduced accuracy
43Slide44
Hardware Prefetching (I)Idea: Specialized hardware observes load/store access patterns and prefetches data based on past access behavior
Tradeoffs:+ Can be tuned to system implementation+ Does not waste instruction execution bandwidth
-- More hardware complexity to detect patterns - Software can be more efficient in some cases44Slide45
Next-Line PrefetchersSimplest form of hardware prefetching: always prefetch next N cache lines after a demand access (or a demand miss)
Next-line prefetcher (or next sequential
prefetcher)Tradeoffs:+ Simple to implement. No need for sophisticated pattern detection+ Works well for sequential/streaming access patterns (instructions?)-- Can waste bandwidth with irregular patterns-- And, even regular patterns: - What is the
prefetch
accuracy if access stride = 2 and N = 1?
- What if the program is traversing memory from higher to lower addresses?
- Also
prefetch
“
previous
”
N cache lines?
45Slide46
Stride PrefetchersTwo kindsInstruction program counter (PC) based
Cache block address basedInstruction based:Baer and Chen, “
An effective on-chip preloading scheme to reduce data access penalty,” SC 1991.Idea: Record the distance between the memory addresses referenced by a load instruction (i.e. stride of the load) as well as the last address referenced by the loadNext time the same load instruction is fetched, prefetch last address + stride
46Slide47
Instruction Based Stride PrefetchingWhat is the problem with this?
How far can the prefetcher get ahead of the demand access stream?
Initiating the prefetch when the load is fetched the next time can be too late Load will access the data cache soon after it is fetched!Solutions:Use lookahead PC to index the prefetcher table (decouple frontend of the processor from backend)
Prefetch ahead (
last address + N*stride
)
Generate
multiple prefetches
Load Inst.
Last Address
Last
Confidence
PC (tag)
Referenced
Stride
…….
…….
……
Load
Inst
PC
47Slide48
Cache-Block Address Based Stride PrefetchingCan detect
A, A+N, A+2N, A+3N, …Stream buffers are a special case of cache block address based stride prefetching where N = 1
Address tag
Stride
Control/Confidence
…….
……
Block
address
48Slide49
Stream Buffers (Jouppi, ISCA 1990)Each stream buffer holds one stream of sequentially prefetched cache lines
On a load miss check the head of all stream buffers for an address matchif hit, pop the entry from FIFO, update the cache with data if not, allocate a new stream buffer to the new miss address (may have to recycle a stream buffer following LRU policy)
Stream buffer FIFOs are continuously topped-off with subsequent cache lines whenever there is room and the bus is not busyFIFO
FIFO
FIFO
FIFO
DCache
Memory interface
Jouppi
, “
Improving
Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch
Buffers
,” ISCA 1990.
49Slide50
Stream Buffer Design
50Slide51
Stream Buffer Design
51Slide52
These slides were not covered in the classThey are for your benefit
52Slide53
Prefetcher Performance (I)Accuracy (used prefetches / sent prefetches)
Coverage (prefetched misses / all misses)Timeliness (on-time prefetches / used prefetches)Bandwidth consumptionMemory bandwidth consumed with prefetcher / without prefetcher
Good news: Can utilize idle bus bandwidth (if available)Cache pollutionExtra demand misses due to prefetch placement in cacheMore difficult to quantify but affects performance53Slide54
Prefetcher Performance (II)Prefetcher aggressiveness affects all performance metricsAggressiveness dependent on
prefetcher typeFor most hardware prefetchers:Prefetch
distance: how far ahead of the demand stream Prefetch degree: how many prefetches per demand access
Predicted Stream
Predicted Stream
X
Access Stream
P
max
Prefetch Distance
P
max
Very Conservative
P
max
Middle of the Road
P
max
Very Aggressive
P
Prefetch Degree
X+1
1 2 3
54Slide55
Prefetcher Performance (III)How do these metrics interact?Very Aggressive Prefetcher
(large prefetch distance & degree)Well ahead of the load access stream Hides memory access latency better
More speculative+ Higher coverage, better timeliness-- Likely lower accuracy, higher bandwidth and pollutionVery Conservative Prefetcher (small prefetch distance & degree)Closer to the load access streamMight not hide memory access latency completelyReduces potential for cache pollution and bandwidth contention
+ Likely higher accuracy, lower bandwidth, less polluting
-- Likely lower coverage and less timely
55Slide56
Prefetcher Performance (IV)
56Slide57
Prefetcher Performance (V)
Srinath et al.,
“Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers“, HPCA 2007.
48%
29%
57Slide58
Feedback-Directed Prefetcher Throttling (I)Idea:
Dynamically monitor prefetcher performance metrics
Throttle the prefetcher aggressiveness up/down based on past performanceChange the location prefetches are inserted in cache based on past performance
High Accuracy
Not-Late
Polluting
Decrease
Late
Increase
Med Accuracy
Not-Poll
Late
Increase
Polluting
Decrease
Low Accuracy
Not-Poll
Not-Late
No Change
Decrease
58Slide59
Feedback-Directed Prefetcher Throttling (II)
Srinath et al., “
Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers“, HPCA 2007.Srinath et al., “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers“
, HPCA 2007.
11%
13%
59Slide60
Samira KhanUniversity of VirginiaApr 26, 2016
COMPUTER ARCHITECTURE
CS 6354Prefetching
The content and concept of this course are adapted from CMU ECE 740