/
Thread Cluster Memory Scheduling Thread Cluster Memory Scheduling

Thread Cluster Memory Scheduling - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
400 views
Uploaded On 2017-08-20

Thread Cluster Memory Scheduling - PPT Presentation

Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor HarcholBalter Motivation Memory is a shared resource Threads requests contend for memory ID: 580405

intensive thread fairness throughput thread intensive throughput fairness cluster threads memory prioritized req system shuffling niceness priority tcm outline

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Thread Cluster Memory Scheduling" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Yoongu KimMichael PapamichaelOnur MutluMor Harchol-BalterSlide2

MotivationMemory is a shared resource

Threads’ requests contend for memoryDegradation in single thread performanceCan even lead to starvationHow to schedule memory requests to increase both system throughput and fairness?2

Core

Core

Core

Core

MemorySlide3

Previous Scheduling Algorithms are Biased

3System throughput bias

Fairness

bias

No previous memory scheduling algorithm provides both the best fairness and system throughput

Ideal

Better

system throughput

Better

fairnessSlide4

Take turns accessing memory

Why do Previous Algorithms Fail?4

Fairness biased

approach

thread C

thread B

thread A

less memory

intensive

higher

priority

Prioritize less memory-intensive threads

Throughput biased

approach

Good for throughput

starvation

unfairness

thread C

thread B

thread A

Does not starve

not prioritized

reduced throughput

Single policy for all threads is insufficientSlide5

Insight: Achieving Best of Both Worlds5

threadthread

higher

priority

thread

thread

thread

thread

thread

thread

Prioritize memory-non-intensive threads

For Throughput

Unfairness caused by memory-intensive being prioritized over each other

Shuffle threads

Memory-intensive threads have

different vulnerability

to interference

Shuffle

asymmetrically

For Fairness

thread

thread

thread

threadSlide6

OutlineMotivation & Insights

OverviewAlgorithm

Bringing

it

All Together

Evaluation

Conclusion

6Slide7

Overview: Thread Cluster Memory Scheduling

Group threads into two

clusters

Prioritize

non-intensive cluster

Different policies for each cluster

7

thread

Threads in the system

thread

thread

thread

thread

thread

thread

Non-intensive

cluster

Intensive cluster

thread

thread

thread

Memory-non-intensive

Memory-intensive

Prioritized

higher

priority

higher

priority

Throughput

FairnessSlide8

OutlineMotivation & Insights

OverviewAlgorithm

Bringing

it

All Together

Evaluation

Conclusion

8Slide9

TCM Outline9

1. ClusteringSlide10

Clustering ThreadsStep1 Sort threads by

MPKI (misses per kiloinstruction)

10

thread

thread

thread

thread

thread

thread

higher

MPKI

T

α

< 10%

ClusterThreshold

Intensive

cluster

α

T

Non-intensive

cluster

T

=

Total

memory bandwidth usage

Step2

Memory bandwidth usage

α

T

divides clustersSlide11

TCM Outline11

1. Clustering

2. Between

ClustersSlide12

Prioritize non-intensive cluster

Increases system throughputNon-intensive threads have greater potential for making progressDoes not degrade fairnessNon-intensive threads are “light”Rarely interfere with intensive threads

Prioritization Between Clusters

12

>

prioritySlide13

TCM Outline13

1. Clustering

2. Between

Clusters

3. Non-Intensive

Cluster

ThroughputSlide14

Prioritize threads according to MPKI

Increases system throughputLeast intensive thread has the greatest potential for making progress in the processor

Non-Intensive Cluster

14

thread

thread

thread

thread

higher

priority

lowest MPKI

highest MPKISlide15

TCM Outline15

1. Clustering

2. Between

Clusters

3. Non-Intensive

Cluster

4. Intensive

Cluster

Throughput

FairnessSlide16

Periodically shuffle the priority of threads

Is treating all threads equally good enough?BUT: Equal turns ≠ Same slowdownIntensive Cluster

16

thread

thread

thread

Increases fairness

Most prioritized

higher

priority

thread

thread

threadSlide17

Case Study: A Tale of Two Threads

Case Study: Two intensive threads contendingrandom-access

streaming

17

Prioritize

random-access

Prioritize

streaming

random-access

thread is more easily slowed down

7x

prioritized

1x

11x

prioritized

1x

Which is slowed down more easily?Slide18

Why are Threads Different?

18

random-access

streaming

req

req

req

req

Bank 1

Bank 2

Bank 3

Bank 4

Memory

rows

All requests parallel

High

bank-level parallelism

All requests

 Same row

High

row-buffer locality

req

req

req

req

activated row

req

req

req

req

req

req

req

req

stuck

Vulnerable

to interferenceSlide19

TCM Outline19

1. Clustering

2. Between

Clusters

3. Non-Intensive

Cluster

4. Intensive

Cluster

Fairness

ThroughputSlide20

Niceness

How to quantify difference between threads?20

Vulnerability to interference

Bank-level parallelism

Causes interference

Row-buffer locality

+

Niceness

-

Niceness

High

LowSlide21

Shuffling: Round-Robin vs. Niceness-Aware

Round-Robin shufflingNiceness-Aware shuffling21

Most prioritized

ShuffleInterval

Priority

Time

Nice thread

Least nice thread

GOOD:

Each thread prioritized once

What can go wrong?

A

B

C

D

D

A

B

C

DSlide22

Shuffling: Round-Robin vs. Niceness-Aware

Round-Robin

shuffling

Niceness-Aware

shuffling

22

Most prioritized

ShuffleInterval

Priority

Time

Nice thread

Least nice thread

What can go wrong?

A

B

C

D

D

A

B

C

D

A

B

D

C

B

C

A

D

C

D

B

A

D

A

C

B

BAD:

Nice threads receive

lots of interference

GOOD:

Each thread prioritized onceSlide23

Shuffling: Round-Robin vs. Niceness-Aware

Round-Robin shufflingNiceness-Aware shuffling23

Most prioritized

ShuffleInterval

Priority

Time

Nice thread

Least nice thread

GOOD:

Each thread prioritized once

A

B

C

D

D

C

B

A

DSlide24

Shuffling: Round-Robin vs. Niceness-Aware

Round-Robin

shuffling

Niceness-Aware

shuffling

24

Most prioritized

ShuffleInterval

Priority

Time

Nice thread

Least nice thread

A

B

C

D

D

C

B

A

D

D

A

C

B

B

A

C

D

A

D

B

C

D

A

C

B

GOOD:

Each thread prioritized once

GOOD:

Least nice thread stays

mostly

deprioritizedSlide25

TCM Outline25

1. Clustering

2. Between

Clusters

3. Non-Intensive

Cluster

4. Intensive

Cluster

1. Clustering

2. Between

Clusters

3. Non-Intensive

Cluster

4. Intensive

Cluster

Fairness

ThroughputSlide26

OutlineMotivation & Insights

OverviewAlgorithm

Bringing

it

All Together

Evaluation

Conclusion

26Slide27

Quantum-Based Operation27

Time

Previous quantum

(~1M cycles)

During quantum:

Monitor thread behavior

Memory intensity

Bank-level parallelism

Row-buffer locality

Beginning of quantum

:

Perform clustering

Compute niceness of intensive threads

Current quantum

(~1M cycles)

Shuffle interval

(~1K cycles)Slide28

TCM Scheduling AlgorithmHighest-rank

: Requests from higher ranked threads prioritizedNon-Intensive cluster > Intensive clusterNon-Intensive cluster: lower intensity  higher rank

Intensive

cluster: r

ank shuffling

Row-hit

:

Row-buffer hit requests are prioritized

Oldest:

Older requests are prioritized28Slide29

Implementation CostsRequired storage at memory controller (24 cores)

No computation is on the critical path29

Thread

memory behavior

Storage

MPKI

~0.2kb

Bank-level

parallelism~0.6kb

Row-buffer locality~2.9kbTotal

< 4kbitsSlide30

OutlineMotivation & Insights

OverviewAlgorithm

Bringing

it

All Together

Evaluation

Conclusion

30

Fairness

ThroughputSlide31

Metrics & MethodologyMetricsSystem throughput

31

Unfairness

Methodology

Core model

4 GHz processor, 128-entry instruction window

512 KB/core L2

cache

Memory model: DDR2

96

multiprogrammed

SPEC CPU2006 workloadsSlide32

Previous WorkFRFCFS

[Rixner et al., ISCA00]: Prioritizes row-buffer hitsThread-oblivious  Low throughput & Low fairnessSTFM [

Mutlu

et al., MICRO07]:

Equalizes thread slowdowns

Non-intensive threads not prioritized

 Low throughput

PAR-BS [Mutlu et al., ISCA08]: Prioritizes oldest batch of requests while preserving bank-level parallelism

Non-intensive threads not always prioritized 

Low throughputATLAS [Kim et al., HPCA10]:

Prioritizes threads with less memory service

Most intensive thread starves

 Low fairness

32Slide33

Results: Fairness vs. Throughput

33Better system throughput

Better

fairness

5%

39%

8%

5%

TCM provides best fairness and system throughput

Averaged over 96 workloadsSlide34

Results: Fairness-Throughput Tradeoff

34

When

configuration parameter is varied…

Adjusting

ClusterThreshold

TCM

allows robust fairness-throughput tradeoff

STFM

PAR-BS

ATLAS

TCM

Better

system throughput

Better

fairnessSlide35

Operating System SupportClusterThreshold is a tunable knobOS can trade off between fairness and throughput

Enforcing thread weightsOS assigns weights to threadsTCM enforces thread weights within each cluster35Slide36

OutlineMotivation & Insights

OverviewAlgorithm

Bringing

it

All Together

Evaluation

Conclusion

36

Fairness

ThroughputSlide37

Conclusion37

No previous memory scheduling algorithm provides both high system throughput and fairnessProblem: They use a single policy for all threadsTCM groups threads into two clustersPrioritize

non-intensive

cluster  throughput

Shuffle priorities in

intensive

cluster  fairness

Shuffling should favor

nice threads  fairnessTCM

provides the best system throughput and fairnessSlide38

Thank You38Slide39

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Yoongu KimMichael PapamichaelOnur MutluMor Harchol-BalterSlide40

Thread Weight SupportEven if heaviest weighted thread happens to be the most intensive thread…Not prioritized over the least intensive thread

40Slide41

Harmonic Speedup41

Better

system throughput

Better

fairnessSlide42

Shuffling Algorithm ComparisonNiceness-Aware shufflingAverage of maximum slowdown is lowerVariance of maximum slowdown is lower

42Shuffling Algorithm

Round-Robin

Niceness-Aware

E(Maximum

Slowdown)

5.58

4.84

VAR(Maximum Slowdown)1.610.85Slide43

Sensitivity Results43

ShuffleInterval (cycles)

500

600

700

800

System Throughput

14.2

14.314.214.7

Maximum Slowdown6.05.45.9

5.5

Number of Cores

4

8

16

24

32

System Throughput

(compared to ATLAS)

0%

3%

2%

1%

1%

Maximum Slowdown

(compared to ATLAS)

-4%-30%-29%-30%

-41%