for Dynamic Performance Power and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi Princeton University Why Thread Criticality Prediction DCache Miss I ID: 371513
Download Presentation The PPT/PDF document "Thread Criticality Predictors" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors
Abhishek Bhattacharjee
Margaret Martonosi
Princeton UniversitySlide2
Why Thread Criticality Prediction?
D-Cache Miss
I
-Cache
Miss
Stall
Stall
T0
T1
T2 T3
Threads 1 & 3 are
critical
Performance degradation, energy inefficiency
Sources of variability: algorithm, process variations, thermal emergencies etc.
With thread criticality prediction:
Task stealing for performance
DVFS for energy efficiency
Many others …
Insts ExecSlide3
Related WorkInstruction criticality [Fields et al., Tune et al. 2001 etc.]
Thrifty barrier [Li et al. 2005]
Faster cores transitioned into low-power mode based on prediction of barrier stall time
DVFS for energy-efficiency at barriers [Liu et al. 2005]
Meeting points [
Cai
et al. 2008]DVFS non-critical threads by tracking loop iterations completion rate across cores (parallel loops)
Our Approach:
Also handles non-barrier codeWorks on constant or variable loop iteration sizePredicts criticality at any point in time, not just barriersSlide4
Thread Criticality Prediction Goals
Design Goals
1. Accuracy
Absolute TCP accuracy
Relative TCP accuracy
2. Low-overhead implementation
Simple HW (allow SW policies to be built on top)
3. One predictor, many uses
Design Decisions
1. Find suitable arch. metric
2. History-based local approach versus thread-comparative approach
3. This paper: TBB, DVFS
Other uses: Shared LLC management, SMT and memory priority, …Slide5
Outline of this TalkThread Criticality Predictor DesignMethodology
Identify µarchitectural events impacting thread criticality
Introduce basic TCP hardware
Thread Criticality Predictor Uses
Apply to Intel’s Threading Building Blocks (TBB)
Apply for energy-efficiency in barrier-based programsSlide6
Methodology Evaluations on a range of architectures: high-performance and embedded domains
Full-system including OS
Detailed power/energy studies using FPGA emulator
Infrastructure
Domain
System
Cores
Caches
GEMS Simulator
High-performance,
wide-issue, out-of-order16 core CMP with Solaris 10
4-issue SPARC
32KB L1 , 4MB L2
ARM SimulatorEmbedded, in-order
4-32 core CMP
2-issue ARM32KB L1, 4MB L2
FPGA Emulator
Embedded, in-order4-core CMP with Linux 2.6
1-issue SPARC
4KB I-Cache, 8KB D-CacheSlide7
Why not History-Based TCPs?+ Info local to core: no communication -- Requires repetitive barrier behavior
-- Problem for in-order pipelines: variant IPCsSlide8
Thread-Comparative Metrics for TCP: Instruction CountsSlide9
Thread-Comparative Metrics for TCP: L1 D Cache MissesSlide10
Thread-Comparative Metrics for TCP: L1 I & D Cache MissesSlide11
Thread-Comparative Metrics for TCP: All L1 and L2 Cache MissesSlide12
Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses Slide13
Outline of this TalkThread Criticality Predictor DesignMethodology
Identify µarchitectural events impacting thread criticality
Introduce basic TCP hardware
Thread Criticality Predictor Uses
Apply to Intel’s Threading Building Blocks (TBB)
Apply for energy-efficiency in barrier-based programsSlide14
Basic TCP HardwareCore 0
Core 1
Core 2
L1 I $
L1 D $
L1 I $
L1 D $
L1 I $
L1 D $
Core 3
L1 I $
L1 D $
Shared L2 Cache
L2 Controller
TCP Hardware
Inst 1
Inst 1
Inst 1
Inst 1
Inst 2
Inst 2
Inst 2
Inst 2
Inst 5
Inst 5: L1 D$ Miss!
Inst 5
Inst 5
Criticality Counters
0
0
0
0
L1 Cache Miss!
0
1
0
0
Inst 15
Inst 5: Miss Over
Inst 15
Inst 15
Inst 20
Inst 10
Inst 20: L1 I$ Miss!
Inst 20
L1 Cache Miss!
0
1
1
0
Inst 30
Inst 20
Inst 20: Miss Over
Inst 30
Inst 35
Inst 25: L2 $ Miss
Inst 25
Inst 35
L2 Cache Miss!
0
11
1
0
Per-core Criticality Counters track poorly cached, slow threads
Inst 135
Inst 25: Miss Over
Inst 125
Inst 135
Periodically refresh criticality counters with Interval Bound RegisterSlide15
Outline of this TalkThread Criticality Predictor (TCP) DesignMethodology
Identify µarchitectural events impacting thread criticality
Introduce basic TCP hardware
Thread Criticality Predictor Uses
Apply to Intel’s Threading Building Blocks (TBB)
Apply for energy-efficiency in barrier-based programsSlide16
TBB Task Stealing & Thread CriticalityTBB dynamic scheduler distributes tasks
Each thread maintains software queue filled with tasks
Empty queue – thread “steals” task from another thread’s queue
Approach 1: Default TBB uses
random
task stealingMore failed steals at higher core counts
poor performanceApproach 2: Occupancy-based task stealing [Contreras, Martonosi, 2008]
Steal based on number of items in SW queueMust track and compare max. occupancy countsSlide17
TCP-Guided TBB Task StealingCore 0
SW Q0
Shared L2 Cache
Core 1
SW Q1
Core 2
SW Q2
Core 3
SW Q3
Criticality Counters
Interval Bound Register
Task 1
TCP Control Logic
Task 0
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
Clock: 0
Clock: 10
0
0
0
0
1
Core 3: L2 Miss
11
Clock: 30
Clock: 100
None
14
5
2
21
Core 2: Steal Req.
Scan for max val.
Steal from Core 3
Task 7
TCP initiates steals from critical thread
Modest message overhead: L2 access latency
Scalable:
14-bit criticality counters 114 bytes of storage @ 64 cores
Core 3: L1 MissSlide18
TCP-Guided TBB Performance
TCP access penalized with L2 latency
%
Perf
. Improvement versus Random Task Stealing
Avg. Improvement over Random (32 cores) = 21.6 %
Avg. Improvement over Occupancy (32 cores) = 13.8 %
Slide19
Outline of this TalkThread Criticality Predictor DesignMethodology
Identify µarchitectural events impacting thread criticality
Introduce basic TCP hardware
Thread Criticality Predictor Uses
Apply to Intel’s Threading Building Blocks (TBB)
Apply for energy-efficiency in barrier-based programsSlide20
Adapting TCP for Energy Efficiency in Barrier-Based Programs
T0 T1 T2 T3
Insts
Exec
L2 D$ Miss
L2 D$ Over
T1 critical, => DVFS T0, T2, T3
Approach: DVFS non-critical threads to eliminate barrier stall time
Challenges:
Relative criticalities
Misprediction
costs
DVFS overheadsSlide21
TCP for DVFS: ResultsFPGA platform with 4 cores, 50% fixed leakage costSee paper for details: TCP
mispredictions
, DVFS overheads etc
Average 15% energy savingsSlide22
ConclusionsGoal 1: AccuracyAccurate TCPs based on simple cache statistics
Goal 2: Low-overhead hardware
Scalable per-core criticality counters used
TCP in central location where cache info. is already available
Goal 3: Versatility
TBB improved by 13.8% over best known approach @ 32 coresDVFS used to achieve 15% energy savings
Two uses shown, many others possible…