Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor HarcholBalter Motivation Memory is a shared resource Threads requests contend for memory ID: 580405
Download Presentation The PPT/PDF document "Thread Cluster Memory Scheduling" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
Yoongu KimMichael PapamichaelOnur MutluMor Harchol-BalterSlide2
MotivationMemory is a shared resource
Threads’ requests contend for memoryDegradation in single thread performanceCan even lead to starvationHow to schedule memory requests to increase both system throughput and fairness?2
Core
Core
Core
Core
MemorySlide3
Previous Scheduling Algorithms are Biased
3System throughput bias
Fairness
bias
No previous memory scheduling algorithm provides both the best fairness and system throughput
Ideal
Better
system throughput
Better
fairnessSlide4
Take turns accessing memory
Why do Previous Algorithms Fail?4
Fairness biased
approach
thread C
thread B
thread A
less memory
intensive
higher
priority
Prioritize less memory-intensive threads
Throughput biased
approach
Good for throughput
starvation
unfairness
thread C
thread B
thread A
Does not starve
not prioritized
reduced throughput
Single policy for all threads is insufficientSlide5
Insight: Achieving Best of Both Worlds5
threadthread
higher
priority
thread
thread
thread
thread
thread
thread
Prioritize memory-non-intensive threads
For Throughput
Unfairness caused by memory-intensive being prioritized over each other
Shuffle threads
Memory-intensive threads have
different vulnerability
to interference
Shuffle
asymmetrically
For Fairness
thread
thread
thread
threadSlide6
OutlineMotivation & Insights
OverviewAlgorithm
Bringing
it
All Together
Evaluation
Conclusion
6Slide7
Overview: Thread Cluster Memory Scheduling
Group threads into two
clusters
Prioritize
non-intensive cluster
Different policies for each cluster
7
thread
Threads in the system
thread
thread
thread
thread
thread
thread
Non-intensive
cluster
Intensive cluster
thread
thread
thread
Memory-non-intensive
Memory-intensive
Prioritized
higher
priority
higher
priority
Throughput
FairnessSlide8
OutlineMotivation & Insights
OverviewAlgorithm
Bringing
it
All Together
Evaluation
Conclusion
8Slide9
TCM Outline9
1. ClusteringSlide10
Clustering ThreadsStep1 Sort threads by
MPKI (misses per kiloinstruction)
10
thread
thread
thread
thread
thread
thread
higher
MPKI
T
α
< 10%
ClusterThreshold
Intensive
cluster
α
T
Non-intensive
cluster
T
=
Total
memory bandwidth usage
Step2
Memory bandwidth usage
α
T
divides clustersSlide11
TCM Outline11
1. Clustering
2. Between
ClustersSlide12
Prioritize non-intensive cluster
Increases system throughputNon-intensive threads have greater potential for making progressDoes not degrade fairnessNon-intensive threads are “light”Rarely interfere with intensive threads
Prioritization Between Clusters
12
>
prioritySlide13
TCM Outline13
1. Clustering
2. Between
Clusters
3. Non-Intensive
Cluster
ThroughputSlide14
Prioritize threads according to MPKI
Increases system throughputLeast intensive thread has the greatest potential for making progress in the processor
Non-Intensive Cluster
14
thread
thread
thread
thread
higher
priority
lowest MPKI
highest MPKISlide15
TCM Outline15
1. Clustering
2. Between
Clusters
3. Non-Intensive
Cluster
4. Intensive
Cluster
Throughput
FairnessSlide16
Periodically shuffle the priority of threads
Is treating all threads equally good enough?BUT: Equal turns ≠ Same slowdownIntensive Cluster
16
thread
thread
thread
Increases fairness
Most prioritized
higher
priority
thread
thread
threadSlide17
Case Study: A Tale of Two Threads
Case Study: Two intensive threads contendingrandom-access
streaming
17
Prioritize
random-access
Prioritize
streaming
random-access
thread is more easily slowed down
7x
prioritized
1x
11x
prioritized
1x
Which is slowed down more easily?Slide18
Why are Threads Different?
18
random-access
streaming
req
req
req
req
Bank 1
Bank 2
Bank 3
Bank 4
Memory
rows
All requests parallel
High
bank-level parallelism
All requests
Same row
High
row-buffer locality
req
req
req
req
activated row
req
req
req
req
req
req
req
req
stuck
Vulnerable
to interferenceSlide19
TCM Outline19
1. Clustering
2. Between
Clusters
3. Non-Intensive
Cluster
4. Intensive
Cluster
Fairness
ThroughputSlide20
Niceness
How to quantify difference between threads?20
Vulnerability to interference
Bank-level parallelism
Causes interference
Row-buffer locality
+
Niceness
-
Niceness
High
LowSlide21
Shuffling: Round-Robin vs. Niceness-Aware
Round-Robin shufflingNiceness-Aware shuffling21
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice thread
GOOD:
Each thread prioritized once
What can go wrong?
A
B
C
D
D
A
B
C
DSlide22
Shuffling: Round-Robin vs. Niceness-Aware
Round-Robin
shuffling
Niceness-Aware
shuffling
22
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice thread
What can go wrong?
A
B
C
D
D
A
B
C
D
A
B
D
C
B
C
A
D
C
D
B
A
D
A
C
B
BAD:
Nice threads receive
lots of interference
GOOD:
Each thread prioritized onceSlide23
Shuffling: Round-Robin vs. Niceness-Aware
Round-Robin shufflingNiceness-Aware shuffling23
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice thread
GOOD:
Each thread prioritized once
A
B
C
D
D
C
B
A
DSlide24
Shuffling: Round-Robin vs. Niceness-Aware
Round-Robin
shuffling
Niceness-Aware
shuffling
24
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice thread
A
B
C
D
D
C
B
A
D
D
A
C
B
B
A
C
D
A
D
B
C
D
A
C
B
GOOD:
Each thread prioritized once
GOOD:
Least nice thread stays
mostly
deprioritizedSlide25
TCM Outline25
1. Clustering
2. Between
Clusters
3. Non-Intensive
Cluster
4. Intensive
Cluster
1. Clustering
2. Between
Clusters
3. Non-Intensive
Cluster
4. Intensive
Cluster
Fairness
ThroughputSlide26
OutlineMotivation & Insights
OverviewAlgorithm
Bringing
it
All Together
Evaluation
Conclusion
26Slide27
Quantum-Based Operation27
Time
Previous quantum
(~1M cycles)
During quantum:
Monitor thread behavior
Memory intensity
Bank-level parallelism
Row-buffer locality
Beginning of quantum
:
Perform clustering
Compute niceness of intensive threads
Current quantum
(~1M cycles)
Shuffle interval
(~1K cycles)Slide28
TCM Scheduling AlgorithmHighest-rank
: Requests from higher ranked threads prioritizedNon-Intensive cluster > Intensive clusterNon-Intensive cluster: lower intensity higher rank
Intensive
cluster: r
ank shuffling
Row-hit
:
Row-buffer hit requests are prioritized
Oldest:
Older requests are prioritized28Slide29
Implementation CostsRequired storage at memory controller (24 cores)
No computation is on the critical path29
Thread
memory behavior
Storage
MPKI
~0.2kb
Bank-level
parallelism~0.6kb
Row-buffer locality~2.9kbTotal
< 4kbitsSlide30
OutlineMotivation & Insights
OverviewAlgorithm
Bringing
it
All Together
Evaluation
Conclusion
30
Fairness
ThroughputSlide31
Metrics & MethodologyMetricsSystem throughput
31
Unfairness
Methodology
Core model
4 GHz processor, 128-entry instruction window
512 KB/core L2
cache
Memory model: DDR2
96
multiprogrammed
SPEC CPU2006 workloadsSlide32
Previous WorkFRFCFS
[Rixner et al., ISCA00]: Prioritizes row-buffer hitsThread-oblivious Low throughput & Low fairnessSTFM [
Mutlu
et al., MICRO07]:
Equalizes thread slowdowns
Non-intensive threads not prioritized
Low throughput
PAR-BS [Mutlu et al., ISCA08]: Prioritizes oldest batch of requests while preserving bank-level parallelism
Non-intensive threads not always prioritized
Low throughputATLAS [Kim et al., HPCA10]:
Prioritizes threads with less memory service
Most intensive thread starves
Low fairness
32Slide33
Results: Fairness vs. Throughput
33Better system throughput
Better
fairness
5%
39%
8%
5%
TCM provides best fairness and system throughput
Averaged over 96 workloadsSlide34
Results: Fairness-Throughput Tradeoff
34
When
configuration parameter is varied…
Adjusting
ClusterThreshold
TCM
allows robust fairness-throughput tradeoff
STFM
PAR-BS
ATLAS
TCM
Better
system throughput
Better
fairnessSlide35
Operating System SupportClusterThreshold is a tunable knobOS can trade off between fairness and throughput
Enforcing thread weightsOS assigns weights to threadsTCM enforces thread weights within each cluster35Slide36
OutlineMotivation & Insights
OverviewAlgorithm
Bringing
it
All Together
Evaluation
Conclusion
36
Fairness
ThroughputSlide37
Conclusion37
No previous memory scheduling algorithm provides both high system throughput and fairnessProblem: They use a single policy for all threadsTCM groups threads into two clustersPrioritize
non-intensive
cluster throughput
Shuffle priorities in
intensive
cluster fairness
Shuffling should favor
nice threads fairnessTCM
provides the best system throughput and fairnessSlide38
Thank You38Slide39
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
Yoongu KimMichael PapamichaelOnur MutluMor Harchol-BalterSlide40
Thread Weight SupportEven if heaviest weighted thread happens to be the most intensive thread…Not prioritized over the least intensive thread
40Slide41
Harmonic Speedup41
Better
system throughput
Better
fairnessSlide42
Shuffling Algorithm ComparisonNiceness-Aware shufflingAverage of maximum slowdown is lowerVariance of maximum slowdown is lower
42Shuffling Algorithm
Round-Robin
Niceness-Aware
E(Maximum
Slowdown)
5.58
4.84
VAR(Maximum Slowdown)1.610.85Slide43
Sensitivity Results43
ShuffleInterval (cycles)
500
600
700
800
System Throughput
14.2
14.314.214.7
Maximum Slowdown6.05.45.9
5.5
Number of Cores
4
8
16
24
32
System Throughput
(compared to ATLAS)
0%
3%
2%
1%
1%
Maximum Slowdown
(compared to ATLAS)
-4%-30%-29%-30%
-41%