Memory Scheduling Eiman Ebrahimi Rustam Miftakhutdinov Chris Fallin Chang Joo Lee Jose Joao Onur Mutlu Yale N Patt HPS Research Group ID: 422664
Download Presentation The PPT/PDF document "Parallel Application" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Parallel Application
Memory Scheduling
Eiman Ebrahimi*Rustam Miftakhutdinov*, Chris Fallin‡Chang Joo Lee*+, Jose Joao*Onur Mutlu‡, Yale N. Patt*
* HPS Research Group The University of Texas at Austin
‡ Computer Architecture LaboratoryCarnegie Mellon University
+ Intel Corporation AustinSlide2
2
Background
Core 0Core 1Core 2Core NShared CacheMemory ControllerDRAMBank 0DRAMBank 1DRAM Bank 2...DRAMBank K
...
Shared Memory
Resources
Chip Boundary
On-chip
Off-chip
2Slide3
Background
Memory requests from different cores interfere in shared memory resourcesMulti-programmed workloadsSystem Performance and FairnessA single multi-threaded application?
3Core 0Core 1Core 2Core NShared CacheMemory Controller...DRAMBank K
...
Shared Memory
Resources
Chip Boundary
3
DRAM
Bank
0
DRAM
Bank 1
DRAM
Bank 2Slide4
Memory System Interference in A Single Multi-Threaded Application
Inter-dependent threads from the same application slow each other
downMost importantly the critical path of execution can be significantly slowed downProblem and goal are very different from interference between independent applicationsInterdependence between threadsGoal: Reduce execution time of a single applicationNo notion of fairness among the threads of the same application4Slide5
Potential inA Single Multi-Threaded Application
If all main-memory related interference is
ideally eliminated, execution time is reduced by 45% on average5Normalized to systemusing FR-FCFS memoryschedulingSlide6
Outline
Problem StatementParallel Application Memory SchedulingEvaluationConclusion6Slide7
Outline
Problem StatementParallel Application Memory SchedulingEvaluationConclusion7Slide8
Parallel Application Memory Scheduler
Identify the set of threads likely to be on the critical path as limiter threadsPrioritize requests from limiter threads Among limiter threads:Prioritize requests from latency-sensitive threads (those with lower MPKI)
Among non-limiter threads:Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop8Slide9
Parallel Application Memory Scheduler
Identify the set of threads likely to be on the critical path as limiter threadsPrioritize requests from limiter threads Among limiter threads:Prioritize requests from latency-sensitive threads (those with lower MPKI)
Among non-limiter threads:Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop9Slide10
Runtime System Limiter Identification
Contended critical sections are often on the critical path of execution
Extend runtime system to identify thread executing the most contended critical section as the limiter threadTrack total amount of time all threads wait on each lock in a given intervalIdentify the lock with largest waiting time as the most contendedThread holding the most contended lock is a limiter and this information is exposed to the memory controller10Slide11
Prioritizing Requests from Limiter Threads
11
Critical Section 1BarrierNon-Critical SectionWaiting for Sync or LockThread DThread CThread B
Thread A
Time
Barrier
Time
Barrier
Thread D
Thread C
Thread B
Thread A
Critical Section 2
Critical Path
Saved
Cycles
Limiter Thread:
D
B
C
A
Most Contended
Critical Section:
1
Limiter Thread IdentificationSlide12
Parallel Application Memory Scheduler
Identify the set of threads likely to be on the critical path as limiter threadsPrioritize requests from limiter threads Among limiter threads:Prioritize requests from latency-sensitive threads (those with lower MPKI)
Among non-limiter threads:Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop12Slide13
Time-based classification of threads as
latency- vs. BW-sensitive13
Critical SectionBarrierNon-Critical SectionWaiting for SyncThread DThread CThread B
Thread A
Time
Barrier
Time
Interval 1
Time
Interval
2
Thread Cluster Memory Scheduling (TCM) [Kim et. al., MICRO’10]Slide14
Terminology
A code-segment is defined as:A program region between two consecutive synchronization operationsIdentified with a 2-tuple:<beginning IP, lock address>Important for classifying threads as latency- vs. bandwidth-sensitiveTime-based vs. code-segment based classification
14Slide15
Code-segment based classification of threads as latency- vs. BW-sensitive
15Thread DThread C
Thread B
Thread D
Thread C
Thread BThread ATime
Code-Segment Changes
Barrier
Thread A
Time
Barrier
Time
Interval
1
Time
Interval
2
Code
Segment 1
Code
Segment 2
Critical Section
Barrier
Non-Critical Section
Waiting for SyncSlide16
Parallel Application Memory Scheduler
Identify the set of threads likely to be on the critical path as limiter threadsPrioritize requests from limiter threads Among limiter threads:Prioritize requests from latency-sensitive threads (those with lower MPKI)
Among non-limiter threads:Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop16Slide17
Shuffling Priorities of Non-Limiter Threads
Goal: Reduce inter-thread interference among a set of threads with the same importance in terms of our estimation of the critical pathPrevent any of these threads from becoming new bottlenecks
Basic Idea: Give each thread a chance to be high priority in the memory system and exploit intra-thread bank parallelism and row-buffer localityEvery interval assign a set of random priorities to the threads and shuffle priorities at the end of the interval17Slide18
Shuffling Priorities of Non-Limiter Threads
Barrier
Thread AThread BThread CThread DBarrierTime
Thread A
Thread B
Thread CThread DTime
Thread A
Thread B
Thread C
Thread D
Time
Thread A
Thread B
Thread C
Thread D
Barrier
Barrier
Time
Thread A
Thread B
Thread C
Thread D
Time
Thread A
Thread B
Thread C
Thread D
Time
4
3
2
1
3
2
1
2
1
1
Saved Cycles
Saved Cycles
Saved
Lost Cycles
18
Baseline
(No shuffling)
Policy 1
Threads with
similar
memory behavior
Policy 2
Shuffling
Shuffling
4
2
3
1
1
2
3
2
1
1
Cycles
Active
Waiting
Legend
Threads with
different
memory behavior
PAMS dynamically monitors memory intensities and chooses appropriate shuffling policy
for non-limiter threads at runtimeSlide19
Outline
Problem StatementParallel Application Memory SchedulingEvaluationConclusion19Slide20
Evaluation Methodology
x86 cycle accurate simulatorBaseline processor configurationPer-core4-wide issue, out-of-order, 64 entry ROBShared (16-core system)128 MSHRs4MB, 16-way L2 cacheMain MemoryDDR3 1333 MHz
Latency of 15ns per command (tRP, tRCD, CL)8B wide core to memory bus20Slide21
PAMS Evaluation
13%
217%Thread criticality predictors (TCP) [Bhattacherjee+, ISCA’09]Slide22
22
Sensitivity to system parameters-10.5%
-15.9%-16.7%
L2 Cache Size
4
MB
8
MB
16
MB
Δ
FR-FCFS
Δ
FR-FCFS
Δ
FR-FCFS
-10.4%
-11.6%
-16.7%
Number of Memory
Channels
1 Channel
2 Channels
4 Channels
Δ
FR-FCFS
Δ
FR-FCFS
Δ
FR-FCFSSlide23
Conclusion
Inter-thread main memory interference within a multi-threaded application increases execution timeParallel Application Memory Scheduling (PAMS) improves a single multi-threaded application’s performance byIdentifying a set of threads likely to be on the critical path and prioritizing requests from themPeriodically shuffling priorities of non-likely critical threads to reduce inter-thread interference among themPAMS significantly outperforms Best previous memory scheduler designed for
multi-programmed workloadsA memory scheduler that uses a state-of-the-art thread criticality predictor (TCP) 23Slide24
Parallel Application
Memory Scheduling
Eiman Ebrahimi*Rustam Miftakhutdinov*, Chris Fallin‡Chang Joo Lee*+, Jose Joao*Onur Mutlu‡, Yale N. Patt** HPS Research Group The University of Texas at Austin
‡ Computer Architecture LaboratoryCarnegie Mellon University
+ Intel Corporation Austin