/
Parallel Application Parallel Application

Parallel Application - PowerPoint Presentation

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
367 views
Uploaded On 2016-07-28

Parallel Application - PPT Presentation

Memory Scheduling Eiman Ebrahimi Rustam Miftakhutdinov Chris Fallin Chang Joo Lee Jose Joao Onur Mutlu Yale N Patt HPS Research Group ID: 422664

threads thread limiter memory thread threads memory limiter time critical application requests barrier interference core parallel section priorities bank

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Parallel Application" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Parallel Application

Memory Scheduling

Eiman Ebrahimi*Rustam Miftakhutdinov*, Chris Fallin‡Chang Joo Lee*+, Jose Joao*Onur Mutlu‡, Yale N. Patt*

* HPS Research Group The University of Texas at Austin

‡ Computer Architecture LaboratoryCarnegie Mellon University

+ Intel Corporation AustinSlide2

2

Background

Core 0Core 1Core 2Core NShared CacheMemory ControllerDRAMBank 0DRAMBank 1DRAM Bank 2...DRAMBank K

...

Shared Memory

Resources

Chip Boundary

On-chip

Off-chip

2Slide3

Background

Memory requests from different cores interfere in shared memory resourcesMulti-programmed workloadsSystem Performance and FairnessA single multi-threaded application?

3Core 0Core 1Core 2Core NShared CacheMemory Controller...DRAMBank K

...

Shared Memory

Resources

Chip Boundary

3

DRAM

Bank

0

DRAM

Bank 1

DRAM

Bank 2Slide4

Memory System Interference in A Single Multi-Threaded Application

Inter-dependent threads from the same application slow each other

downMost importantly the critical path of execution can be significantly slowed downProblem and goal are very different from interference between independent applicationsInterdependence between threadsGoal: Reduce execution time of a single applicationNo notion of fairness among the threads of the same application4Slide5

Potential inA Single Multi-Threaded Application

If all main-memory related interference is

ideally eliminated, execution time is reduced by 45% on average5Normalized to systemusing FR-FCFS memoryschedulingSlide6

Outline

Problem StatementParallel Application Memory SchedulingEvaluationConclusion6Slide7

Outline

Problem StatementParallel Application Memory SchedulingEvaluationConclusion7Slide8

Parallel Application Memory Scheduler

Identify the set of threads likely to be on the critical path as limiter threadsPrioritize requests from limiter threads Among limiter threads:Prioritize requests from latency-sensitive threads (those with lower MPKI)

Among non-limiter threads:Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop8Slide9

Parallel Application Memory Scheduler

Identify the set of threads likely to be on the critical path as limiter threadsPrioritize requests from limiter threads Among limiter threads:Prioritize requests from latency-sensitive threads (those with lower MPKI)

Among non-limiter threads:Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop9Slide10

Runtime System Limiter Identification

Contended critical sections are often on the critical path of execution

Extend runtime system to identify thread executing the most contended critical section as the limiter threadTrack total amount of time all threads wait on each lock in a given intervalIdentify the lock with largest waiting time as the most contendedThread holding the most contended lock is a limiter and this information is exposed to the memory controller10Slide11

Prioritizing Requests from Limiter Threads

11

Critical Section 1BarrierNon-Critical SectionWaiting for Sync or LockThread DThread CThread B

Thread A

Time

Barrier

Time

Barrier

Thread D

Thread C

Thread B

Thread A

Critical Section 2

Critical Path

Saved

Cycles

Limiter Thread:

D

B

C

A

Most Contended

Critical Section:

1

Limiter Thread IdentificationSlide12

Parallel Application Memory Scheduler

Identify the set of threads likely to be on the critical path as limiter threadsPrioritize requests from limiter threads Among limiter threads:Prioritize requests from latency-sensitive threads (those with lower MPKI)

Among non-limiter threads:Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop12Slide13

Time-based classification of threads as

latency- vs. BW-sensitive13

Critical SectionBarrierNon-Critical SectionWaiting for SyncThread DThread CThread B

Thread A

Time

Barrier

Time

Interval 1

Time

Interval

2

Thread Cluster Memory Scheduling (TCM) [Kim et. al., MICRO’10]Slide14

Terminology

A code-segment is defined as:A program region between two consecutive synchronization operationsIdentified with a 2-tuple:<beginning IP, lock address>Important for classifying threads as latency- vs. bandwidth-sensitiveTime-based vs. code-segment based classification

14Slide15

Code-segment based classification of threads as latency- vs. BW-sensitive

15Thread DThread C

Thread B

Thread D

Thread C

Thread BThread ATime

Code-Segment Changes

Barrier

Thread A

Time

Barrier

Time

Interval

1

Time

Interval

2

Code

Segment 1

Code

Segment 2

Critical Section

Barrier

Non-Critical Section

Waiting for SyncSlide16

Parallel Application Memory Scheduler

Identify the set of threads likely to be on the critical path as limiter threadsPrioritize requests from limiter threads Among limiter threads:Prioritize requests from latency-sensitive threads (those with lower MPKI)

Among non-limiter threads:Shuffle priorities of non-limiter threads to reduce inter-thread memory interference Prioritize requests from threads falling behind others in a parallel for-loop16Slide17

Shuffling Priorities of Non-Limiter Threads

Goal: Reduce inter-thread interference among a set of threads with the same importance in terms of our estimation of the critical pathPrevent any of these threads from becoming new bottlenecks

Basic Idea: Give each thread a chance to be high priority in the memory system and exploit intra-thread bank parallelism and row-buffer localityEvery interval assign a set of random priorities to the threads and shuffle priorities at the end of the interval17Slide18

Shuffling Priorities of Non-Limiter Threads

Barrier

Thread AThread BThread CThread DBarrierTime

Thread A

Thread B

Thread CThread DTime

Thread A

Thread B

Thread C

Thread D

Time

Thread A

Thread B

Thread C

Thread D

Barrier

Barrier

Time

Thread A

Thread B

Thread C

Thread D

Time

Thread A

Thread B

Thread C

Thread D

Time

4

3

2

1

3

2

1

2

1

1

Saved Cycles

Saved Cycles

Saved

Lost Cycles

18

Baseline

(No shuffling)

Policy 1

Threads with

similar

memory behavior

Policy 2

Shuffling

Shuffling

4

2

3

1

1

2

3

2

1

1

Cycles

Active

Waiting

Legend

Threads with

different

memory behavior

PAMS dynamically monitors memory intensities and chooses appropriate shuffling policy

for non-limiter threads at runtimeSlide19

Outline

Problem StatementParallel Application Memory SchedulingEvaluationConclusion19Slide20

Evaluation Methodology

x86 cycle accurate simulatorBaseline processor configurationPer-core4-wide issue, out-of-order, 64 entry ROBShared (16-core system)128 MSHRs4MB, 16-way L2 cacheMain MemoryDDR3 1333 MHz

Latency of 15ns per command (tRP, tRCD, CL)8B wide core to memory bus20Slide21

PAMS Evaluation

13%

217%Thread criticality predictors (TCP) [Bhattacherjee+, ISCA’09]Slide22

22

Sensitivity to system parameters-10.5%

-15.9%-16.7%

L2 Cache Size

4

MB

8

MB

16

MB

Δ

FR-FCFS

Δ

FR-FCFS

Δ

FR-FCFS

-10.4%

-11.6%

-16.7%

Number of Memory

Channels

1 Channel

2 Channels

4 Channels

Δ

FR-FCFS

Δ

FR-FCFS

Δ

FR-FCFSSlide23

Conclusion

Inter-thread main memory interference within a multi-threaded application increases execution timeParallel Application Memory Scheduling (PAMS) improves a single multi-threaded application’s performance byIdentifying a set of threads likely to be on the critical path and prioritizing requests from themPeriodically shuffling priorities of non-likely critical threads to reduce inter-thread interference among themPAMS significantly outperforms Best previous memory scheduler designed for

multi-programmed workloadsA memory scheduler that uses a state-of-the-art thread criticality predictor (TCP) 23Slide24

Parallel Application

Memory Scheduling

Eiman Ebrahimi*Rustam Miftakhutdinov*, Chris Fallin‡Chang Joo Lee*+, Jose Joao*Onur Mutlu‡, Yale N. Patt** HPS Research Group The University of Texas at Austin

‡ Computer Architecture LaboratoryCarnegie Mellon University

+ Intel Corporation Austin