/
Thread Criticality Predictors Thread Criticality Predictors

Thread Criticality Predictors - PowerPoint Presentation

stefany-barnette
stefany-barnette . @stefany-barnette
Follow
400 views
Uploaded On 2016-06-21

Thread Criticality Predictors - PPT Presentation

for Dynamic Performance Power and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi Princeton University Why Thread Criticality Prediction DCache Miss I ID: 371513

thread inst tcp criticality inst thread criticality tcp core task cache energy based tbb barrier predictor dvfs apply approach hardware performance efficiency

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Thread Criticality Predictors" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors

Abhishek Bhattacharjee

Margaret Martonosi

Princeton UniversitySlide2

Why Thread Criticality Prediction?

D-Cache Miss

I

-Cache

Miss

Stall

Stall

T0

T1

T2 T3

Threads 1 & 3 are

critical

 Performance degradation, energy inefficiency

Sources of variability: algorithm, process variations, thermal emergencies etc.

With thread criticality prediction:

Task stealing for performance

DVFS for energy efficiency

Many others …

Insts ExecSlide3

Related WorkInstruction criticality [Fields et al., Tune et al. 2001 etc.]

Thrifty barrier [Li et al. 2005]

Faster cores transitioned into low-power mode based on prediction of barrier stall time

DVFS for energy-efficiency at barriers [Liu et al. 2005]

Meeting points [

Cai

et al. 2008]DVFS non-critical threads by tracking loop iterations completion rate across cores (parallel loops)

Our Approach:

Also handles non-barrier codeWorks on constant or variable loop iteration sizePredicts criticality at any point in time, not just barriersSlide4

Thread Criticality Prediction Goals

Design Goals

1. Accuracy

Absolute TCP accuracy

Relative TCP accuracy

2. Low-overhead implementation

Simple HW (allow SW policies to be built on top)

3. One predictor, many uses

Design Decisions

1. Find suitable arch. metric

2. History-based local approach versus thread-comparative approach

3. This paper: TBB, DVFS

Other uses: Shared LLC management, SMT and memory priority, …Slide5

Outline of this TalkThread Criticality Predictor DesignMethodology

Identify µarchitectural events impacting thread criticality

Introduce basic TCP hardware

Thread Criticality Predictor Uses

Apply to Intel’s Threading Building Blocks (TBB)

Apply for energy-efficiency in barrier-based programsSlide6

Methodology Evaluations on a range of architectures: high-performance and embedded domains

Full-system including OS

Detailed power/energy studies using FPGA emulator

Infrastructure

Domain

System

Cores

Caches

GEMS Simulator

High-performance,

wide-issue, out-of-order16 core CMP with Solaris 10

4-issue SPARC

32KB L1 , 4MB L2

ARM SimulatorEmbedded, in-order

4-32 core CMP

2-issue ARM32KB L1, 4MB L2

FPGA Emulator

Embedded, in-order4-core CMP with Linux 2.6

1-issue SPARC

4KB I-Cache, 8KB D-CacheSlide7

Why not History-Based TCPs?+ Info local to core: no communication -- Requires repetitive barrier behavior

-- Problem for in-order pipelines: variant IPCsSlide8

Thread-Comparative Metrics for TCP: Instruction CountsSlide9

Thread-Comparative Metrics for TCP: L1 D Cache MissesSlide10

Thread-Comparative Metrics for TCP: L1 I & D Cache MissesSlide11

Thread-Comparative Metrics for TCP: All L1 and L2 Cache MissesSlide12

Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses Slide13

Outline of this TalkThread Criticality Predictor DesignMethodology

Identify µarchitectural events impacting thread criticality

Introduce basic TCP hardware

Thread Criticality Predictor Uses

Apply to Intel’s Threading Building Blocks (TBB)

Apply for energy-efficiency in barrier-based programsSlide14

Basic TCP HardwareCore 0

Core 1

Core 2

L1 I $

L1 D $

L1 I $

L1 D $

L1 I $

L1 D $

Core 3

L1 I $

L1 D $

Shared L2 Cache

L2 Controller

TCP Hardware

Inst 1

Inst 1

Inst 1

Inst 1

Inst 2

Inst 2

Inst 2

Inst 2

Inst 5

Inst 5: L1 D$ Miss!

Inst 5

Inst 5

Criticality Counters

0

0

0

0

L1 Cache Miss!

0

1

0

0

Inst 15

Inst 5: Miss Over

Inst 15

Inst 15

Inst 20

Inst 10

Inst 20: L1 I$ Miss!

Inst 20

L1 Cache Miss!

0

1

1

0

Inst 30

Inst 20

Inst 20: Miss Over

Inst 30

Inst 35

Inst 25: L2 $ Miss

Inst 25

Inst 35

L2 Cache Miss!

0

11

1

0

Per-core Criticality Counters track poorly cached, slow threads

Inst 135

Inst 25: Miss Over

Inst 125

Inst 135

Periodically refresh criticality counters with Interval Bound RegisterSlide15

Outline of this TalkThread Criticality Predictor (TCP) DesignMethodology

Identify µarchitectural events impacting thread criticality

Introduce basic TCP hardware

Thread Criticality Predictor Uses

Apply to Intel’s Threading Building Blocks (TBB)

Apply for energy-efficiency in barrier-based programsSlide16

TBB Task Stealing & Thread CriticalityTBB dynamic scheduler distributes tasks

Each thread maintains software queue filled with tasks

Empty queue – thread “steals” task from another thread’s queue

Approach 1: Default TBB uses

random

task stealingMore failed steals at higher core counts

 poor performanceApproach 2: Occupancy-based task stealing [Contreras, Martonosi, 2008]

Steal based on number of items in SW queueMust track and compare max. occupancy countsSlide17

TCP-Guided TBB Task StealingCore 0

SW Q0

Shared L2 Cache

Core 1

SW Q1

Core 2

SW Q2

Core 3

SW Q3

Criticality Counters

Interval Bound Register

Task 1

TCP Control Logic

Task 0

Task 2

Task 3

Task 4

Task 5

Task 6

Task 7

Clock: 0

Clock: 10

0

0

0

0

1

Core 3: L2 Miss

11

Clock: 30

Clock: 100

None

14

5

2

21

Core 2: Steal Req.

Scan for max val.

Steal from Core 3

Task 7

TCP initiates steals from critical thread

Modest message overhead: L2 access latency

Scalable:

14-bit criticality counters  114 bytes of storage @ 64 cores

Core 3: L1 MissSlide18

TCP-Guided TBB Performance

TCP access penalized with L2 latency

%

Perf

. Improvement versus Random Task Stealing

Avg. Improvement over Random (32 cores) = 21.6 %

Avg. Improvement over Occupancy (32 cores) = 13.8 %

Slide19

Outline of this TalkThread Criticality Predictor DesignMethodology

Identify µarchitectural events impacting thread criticality

Introduce basic TCP hardware

Thread Criticality Predictor Uses

Apply to Intel’s Threading Building Blocks (TBB)

Apply for energy-efficiency in barrier-based programsSlide20

Adapting TCP for Energy Efficiency in Barrier-Based Programs

T0 T1 T2 T3

Insts

Exec

L2 D$ Miss

L2 D$ Over

T1 critical, => DVFS T0, T2, T3

Approach: DVFS non-critical threads to eliminate barrier stall time

Challenges:

Relative criticalities

Misprediction

costs

DVFS overheadsSlide21

TCP for DVFS: ResultsFPGA platform with 4 cores, 50% fixed leakage costSee paper for details: TCP

mispredictions

, DVFS overheads etc

Average 15% energy savingsSlide22

ConclusionsGoal 1: AccuracyAccurate TCPs based on simple cache statistics

Goal 2: Low-overhead hardware

Scalable per-core criticality counters used

TCP in central location where cache info. is already available

Goal 3: Versatility

TBB improved by 13.8% over best known approach @ 32 coresDVFS used to achieve 15% energy savings

Two uses shown, many others possible…