Paolo Romano Based on ICAC14 paper N Diegues and Paolo Romano SelfTuning Intel Transactional Synchronization Extensions 11 th USENIX International Conference on Autonomic Computing ID: 389736
Download Presentation The PPT/PDF document "Self-tuning HTM" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Self-tuning HTM
Paolo RomanoSlide2
Based on ICAC’14 paper
N
. Diegues
and Paolo RomanoSelf-Tuning Intel Transactional Synchronization Extensions11
th
USENIX International Conference
on Autonomic Computing (ICAC), June 2014Best paper award
2Slide3
Best-Effort Nature of HTM
3
No progress guarantees:
A transaction may
always
abort
…due to a number of reasons:Forbidden instructions
Capacity of caches (L1 for writes, L2 for reads)
Faults and signals
Contending transactions, aborting each other
Need for a fallback path, typically a lock or an STMSlide4
When and how to activate the fallback?
4
How
many retries
before triggering the
fall-back
?Ranges from never retrying to insisting many times
How to cope with capacity aborts
?
GiveUp
– exhaust all retries left
Half – drop half of the retries left
Stubborn
– drop only one retry left
How to implement the
fall-back
synchronization?
Wait
– single lock should be free before retrying
None
– retry immediately and hope the lock will be freed
Aux
– serialize conflicting transactions on auxiliary
lockSlide5
Is static tuning enough?
5
Focus on single global lock fallback
Heuristic
:
Try to tune the parameters according to best practices
Empirical work in recent papers [SC13, HPCA14]
Intel optimization manual
GCC
:
Use the existing support in GCC out of the boxSlide6
Why Static Tuning is not enough
6
Benchmark
GCC
Heuristic
Best Tuning
genome
1.54
3.14
3.36 wait-giveup-4
intruder
2.03
1.81
3.02 wait-giveup-4
kmeans
-h
2.73
2.66
3.03 none-stubborn-10
rbt
-l-w
2.482.43 2.95 aux-stubborn-3ssca21.711.69 1.78 wait-giveup-6vacation-h2.121.61 2.51 aux-half-5yada0.190.47 0.81 wait-stubborn-15
Speedup with 4 threads (vs 1 thread non-instrumented)
Intel Haswell Xeon with 4 cores (8 hyperthreads)
room for improvementSlide7
No one size fits all
7
Intruder from STAMP benchmarksSlide8
Are all optimization dimensions relevant?
8
How
many retries
before triggering the
fall-back
?Ranges from never retrying to insisting many times
How to cope with capacity aborts
?
GiveUp
– exhaust all retries left
Half – drop half of the retries left
Stubborn
– drop only one retry
left
How to implement the
fall-back
synchronization?
Wait
– single lock should be free before retrying
None
– retry immediately and hope the lock will be freedAux – serialize conflicting transactions on auxiliary lockaux and wait perform similarlyWhen none is best, it is by a marginal amountReduce this dimension in the optimization problemSlide9
Self-tuning design choices
3 key choices:
How should we learn?At what granularity should we adapt?
What metrics should we optimize for?
9Slide10
How should we learn?
Off-line learningtest with some mix of applications & characterize their workload
infer a model (e.g., based on decision trees) mapping:workload
optimal configurationmonitor the workload of your target application, feed the model with this info and accordingly tune the system
On-line learning
no preliminary training phase
explore the search space while the application is running exploit the knowledge acquired via exploration for tuning
10Slide11
How should
we learn
?
Off-line learning
PRO:
no exploration costs
CONs: initial training phase is time-consuming and “critical”accuracy is strongly affected by training set representativeness
non-trivial to incorporate new knowledge from target applicationOn-line learning
PROs:
no training phase
plug-and-play effect
naturally incorporate newly available knowledgeCONs:exploration costs
11
reconfiguration cost is low with HTM
exploring is affordableSlide12
Which on-line learning techniques?
12
Uses 2 on-line
reinforcement learning techniques in synergy
:
Upper Confidence Bounds
: how to cope with capacity aborts?Gradient Descent
: how many retries in hardware?
Key features:
both techniques are extremely lightweight
practical
coupled in a hierarchical fashion:
they optimize non-independent parameters
avoid
ping-pong
effectsSlide13
Self-tuning design choices
3 key choices:How should we learn?
At what granularity should we adapt?
What metrics should we optimize for?
13Slide14
At what granularity should we adapt?
Per thread & atomic block
PRO:exploit diversity and maximize flexibility
CON: possibly large number of optimizers running in parallelredundancy larger overheads
interplay of multiple local optimizers
Whole
applicationPRO:
lower overhead, simpler convergence dynamicsCON: reduced flexibility
14Slide15
Self-tuning design choices
3 key choices:How should we learn?
At what granularity should we adapt?
What metrics should we optimize for?
15Slide16
What metrics should we optimize for?
Performance? Power? A combination of the two?Key issues/questions:Cost
and accuracy of monitoring the target metricPerformance:
RTDSC allow for lightweight, fine-grained measurement of latencyEnergy:RAPL: coarse granularity (
msec
) and requires system calls
How correlated are the two metrics?
16Slide17
Energy and performance in (H)TM: two sides of the same coin?
How correlated are energy consumption and throughput?
480 different configurations (number of retries, capacity aborts handling, no. threads) per each benchmark:
includes both optimal and sub-optimal configurations
17Slide18
Energy and performance in (H)TM: two sides of the same coin?
How suboptimal is the energy consumption if we use a configuration that is optimal performance-wise?
18Slide19
(G)Tuner
19
Performance measured through processor
cycles (RTDSC)
Support fine and coarse grained
optimization granularity:
Tuner: per atomic block, per
threadno synchronization among threads
G
(
lobal
)-T
uner:
application-wide configuration
Threads
collect statistics privately
An optimizer thread periodically:
Gathers
stats &
d
ecides (a possibly) new
configurationPeriodic profiling and re-optimization to minimize overheadIntegrated in GCCSlide20
Evaluation
20
Idealized “Best” variant
Tuner
G-Tuner
Heuristic: GiveUp-5
NOrec
(STM)
Intel
Haswell
Xeon with 4 cores (8 hyper-threads)
RTM-SGL
RTM-
NOrec
Idealized “Best” variant
Tuner
G-Tuner
Heuristic: GiveUp-5
GCC
Adaptive Locks [PACT09]Slide21
RTM-SGL
21
Intruder from STAMP benchmarks
4%
avg
offset
+50%
Threads
SpeedupSlide22
RTM-NORec
22
Intruder from STAMP benchmarks
G-Tuner better with
NOrec
fallback
Threads
SpeedupSlide23
Evaluating the granularity trade-off
23
Genome from STAMP benchmarks, 8 threads
a
dapting
o
ver time
a
lso adapting, but large constant overheads
s
tatic configurationSlide24
Take home messages
24
Tuning of
fall-back policy strongly impacts performance
Self-tuning of HTM
via on-line learning is feasible:
plug & play: no training phase
gains largely outweigh exploration overheads
Tuning granularity hides subtle trade-offs:
flexibility
vs
overhead
vs
convergence speed
Optimize for performance or for energy?
Strong correlation between the 2 metrics
How general is this claim? Seems the case also for STMSlide25
Thank you!
25
Questions?Slide26
BACKUP SLIDES
Dagstuhl Seminar 2015
26Slide27
Single lock fallback path
After “some” failed attempts using HTM, acquire a single global lock and execute the tx pessimistically
How to couple transactions executing in hardware and fallback?Subscribe the lock in the HTM transaction:read the state of the global lock from the HTM transaction
activating the fallback path aborts any concurrent
hw
transaction
STRONG IMPACT ON PERFORMANCE
BETTER TUNE THIS MECHANISM PROPERLY!
ICAC 2014
27Slide28
Why Static Tuning is not enough
Self-Tuning Intel RTM
28Slide29
How to handle capacity aborts?
Self-Tuning Intel RTM
29
Reduction to “
Bandit Problem
”
3-levers slot machine with unknown reward distributions
Exploitation
vs
Exploration dilemma
how often to test apparently unfavorable levers?
Too little
: convergence to wrong solution
Too much
: many suboptimal choices
Lever A
Lever B
Lever C
giveup
half
stubborn
Strategy:
Reward:
?
?
?Slide30
Upper Confidence Bounds (UCB)
Self-Tuning Intel RTM
30
Solution to
exploration
vs exploitation dilemma
Online estimation of “uncertainty” of each strategy
upper confidence bound on expected
reward
amplify
bound of rarely explored strategies
Appealing theoretical guarantees:
logarithmic bound on optimization error
Very lightweight and efficient:
…practical
!Slide31
Upper Confidence Bounds (UCB)
Self-Tuning Intel RTM
31
Basic reward function for each strategy
i
:
x
i
=
Estimate upper bound on reward of each strategy:
Amplify confidence bound of rarely explored levers
avg. #cycles using strategy
i
1Slide32
How many attempts using HTM?
Self-Tuning Intel RTM
32
UCB not a good fit
too many levers to explore!Slide33
Gradient Descent
Self-Tuning Intel RTM
33
1
2
3
4
?
Problems:
1- unnecessary oscillations
2- stuck in local maximaSlide34
Gradient Descent
Self-Tuning Intel RTM
34
1
2
3
4
?
Problems:
1- unnecessary oscillations
* stabilization threshold
2- stuck in local maxima
* random jumps
5Slide35
Gradient Descent
Self-Tuning Intel RTM
35
1
2
3
4
?
Problems:
1- unnecessary oscillations
* stabilization threshold
2- stuck in local maxima
* random jumps
5
6
7
8
revert to
curr
. maximum upon “unlucky”
jumpsSlide36
Optimizers in action
Self-Tuning Intel RTM
36
One atomic block in
Yada
benchmark
(8 threads).
the two optimizers are *not* independentSlide37
Coupling the Optimizers
Self-Tuning Intel RTM
37
UCB and Gradient Descent overlap in responsibilities:
O
ptimize consumption of attempts upon capacity aborts
Optimize allocation of budget for attempts
Minimize interference via hierarchical organization:
UCB rules over Grad:
UCB can force Grad to explore with random jump
Direction and length defined by UCB belief
More details in the paper Slide38
Coupling the Optimizers
Self-Tuning Intel RTM
38
Speedup of coupled techniques
vs
individual onesSlide39
Overhead of self-tuning
Self-Tuning Intel RTM
39
Profiling and decision-making are performed, but discarded.
Uses a static configuration (and compares with it).Slide40
Integration in GCC
Self-Tuning Intel RTM
40
Workload-oblivious
Transparent to the programmer
Lightweight for general purpose use
Ideal candidate for integration at the compiler
level
(current prototype does not support G-Tuner yet)Slide41
Integration in GCC
Self-Tuning Intel RTM
41
o
ur
extensions