/
Self-tuning HTM Self-tuning HTM

Self-tuning HTM - PowerPoint Presentation

stefany-barnette
stefany-barnette . @stefany-barnette
Follow
394 views
Uploaded On 2016-07-04

Self-tuning HTM - PPT Presentation

Paolo Romano Based on ICAC14 paper N Diegues and Paolo Romano SelfTuning Intel Transactional Synchronization Extensions 11 th USENIX International Conference on Autonomic Computing ID: 389736

intel tuning lock rtm tuning intel rtm lock htm retries granularity giveup ucb fallback wait performance line tuner optimize

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Self-tuning HTM" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Self-tuning HTM

Paolo RomanoSlide2

Based on ICAC’14 paper

N

. Diegues

and Paolo RomanoSelf-Tuning Intel Transactional Synchronization Extensions11

th

USENIX International Conference

on Autonomic Computing (ICAC), June 2014Best paper award

2Slide3

Best-Effort Nature of HTM

3

No progress guarantees:

A transaction may

always

abort

…due to a number of reasons:Forbidden instructions

Capacity of caches (L1 for writes, L2 for reads)

Faults and signals

Contending transactions, aborting each other

Need for a fallback path, typically a lock or an STMSlide4

When and how to activate the fallback?

4

How

many retries

before triggering the

fall-back

?Ranges from never retrying to insisting many times

How to cope with capacity aborts

?

GiveUp

– exhaust all retries left

Half – drop half of the retries left

Stubborn

– drop only one retry left

How to implement the

fall-back

synchronization?

Wait

– single lock should be free before retrying

None

– retry immediately and hope the lock will be freed

Aux

– serialize conflicting transactions on auxiliary

lockSlide5

Is static tuning enough?

5

Focus on single global lock fallback

Heuristic

:

Try to tune the parameters according to best practices

Empirical work in recent papers [SC13, HPCA14]

Intel optimization manual

GCC

:

Use the existing support in GCC out of the boxSlide6

Why Static Tuning is not enough

6

Benchmark

GCC

Heuristic

Best Tuning

genome

1.54

3.14

3.36 wait-giveup-4

intruder

2.03

1.81

3.02 wait-giveup-4

kmeans

-h

2.73

2.66

3.03 none-stubborn-10

rbt

-l-w

2.482.43 2.95 aux-stubborn-3ssca21.711.69 1.78 wait-giveup-6vacation-h2.121.61 2.51 aux-half-5yada0.190.47 0.81 wait-stubborn-15

Speedup with 4 threads (vs 1 thread non-instrumented)

Intel Haswell Xeon with 4 cores (8 hyperthreads)

room for improvementSlide7

No one size fits all

7

Intruder from STAMP benchmarksSlide8

Are all optimization dimensions relevant?

8

How

many retries

before triggering the

fall-back

?Ranges from never retrying to insisting many times

How to cope with capacity aborts

?

GiveUp

– exhaust all retries left

Half – drop half of the retries left

Stubborn

– drop only one retry

left

How to implement the

fall-back

synchronization?

Wait

– single lock should be free before retrying

None

– retry immediately and hope the lock will be freedAux – serialize conflicting transactions on auxiliary lockaux and wait perform similarlyWhen none is best, it is by a marginal amountReduce this dimension in the optimization problemSlide9

Self-tuning design choices

3 key choices:

How should we learn?At what granularity should we adapt?

What metrics should we optimize for?

9Slide10

How should we learn?

Off-line learningtest with some mix of applications & characterize their workload

infer a model (e.g., based on decision trees) mapping:workload

 optimal configurationmonitor the workload of your target application, feed the model with this info and accordingly tune the system

On-line learning

no preliminary training phase

explore the search space while the application is running exploit the knowledge acquired via exploration for tuning

10Slide11

How should

we learn

?

Off-line learning

PRO:

no exploration costs

CONs: initial training phase is time-consuming and “critical”accuracy is strongly affected by training set representativeness

non-trivial to incorporate new knowledge from target applicationOn-line learning

PROs:

no training phase

 plug-and-play effect

naturally incorporate newly available knowledgeCONs:exploration costs

11

reconfiguration cost is low with HTM

exploring is affordableSlide12

Which on-line learning techniques?

12

Uses 2 on-line

reinforcement learning techniques in synergy

:

Upper Confidence Bounds

: how to cope with capacity aborts?Gradient Descent

: how many retries in hardware?

Key features:

both techniques are extremely lightweight

 practical

coupled in a hierarchical fashion:

they optimize non-independent parameters

avoid

ping-pong

effectsSlide13

Self-tuning design choices

3 key choices:How should we learn?

At what granularity should we adapt?

What metrics should we optimize for?

13Slide14

At what granularity should we adapt?

Per thread & atomic block

PRO:exploit diversity and maximize flexibility

CON: possibly large number of optimizers running in parallelredundancy  larger overheads

interplay of multiple local optimizers

Whole

applicationPRO:

lower overhead, simpler convergence dynamicsCON: reduced flexibility

14Slide15

Self-tuning design choices

3 key choices:How should we learn?

At what granularity should we adapt?

What metrics should we optimize for?

15Slide16

What metrics should we optimize for?

Performance? Power? A combination of the two?Key issues/questions:Cost

and accuracy of monitoring the target metricPerformance:

RTDSC allow for lightweight, fine-grained measurement of latencyEnergy:RAPL: coarse granularity (

msec

) and requires system calls

How correlated are the two metrics?

16Slide17

Energy and performance in (H)TM: two sides of the same coin?

How correlated are energy consumption and throughput?

480 different configurations (number of retries, capacity aborts handling, no. threads) per each benchmark:

includes both optimal and sub-optimal configurations

17Slide18

Energy and performance in (H)TM: two sides of the same coin?

How suboptimal is the energy consumption if we use a configuration that is optimal performance-wise?

18Slide19

(G)Tuner

19

Performance measured through processor

cycles (RTDSC)

Support fine and coarse grained

optimization granularity:

Tuner: per atomic block, per

threadno synchronization among threads

G

(

lobal

)-T

uner:

application-wide configuration

Threads

collect statistics privately

An optimizer thread periodically:

Gathers

stats &

d

ecides (a possibly) new

configurationPeriodic profiling and re-optimization to minimize overheadIntegrated in GCCSlide20

Evaluation

20

Idealized “Best” variant

Tuner

G-Tuner

Heuristic: GiveUp-5

NOrec

(STM)

Intel

Haswell

Xeon with 4 cores (8 hyper-threads)

RTM-SGL

RTM-

NOrec

Idealized “Best” variant

Tuner

G-Tuner

Heuristic: GiveUp-5

GCC

Adaptive Locks [PACT09]Slide21

RTM-SGL

21

Intruder from STAMP benchmarks

4%

avg

offset

+50%

Threads

SpeedupSlide22

RTM-NORec

22

Intruder from STAMP benchmarks

G-Tuner better with

NOrec

fallback

Threads

SpeedupSlide23

Evaluating the granularity trade-off

23

Genome from STAMP benchmarks, 8 threads

a

dapting

o

ver time

a

lso adapting, but large constant overheads

s

tatic configurationSlide24

Take home messages

24

Tuning of

fall-back policy strongly impacts performance

Self-tuning of HTM

via on-line learning is feasible:

plug & play: no training phase

gains largely outweigh exploration overheads

Tuning granularity hides subtle trade-offs:

flexibility

vs

overhead

vs

convergence speed

Optimize for performance or for energy?

Strong correlation between the 2 metrics

How general is this claim? Seems the case also for STMSlide25

Thank you!

25

Questions?Slide26

BACKUP SLIDES

Dagstuhl Seminar 2015

26Slide27

Single lock fallback path

After “some” failed attempts using HTM, acquire a single global lock and execute the tx pessimistically

How to couple transactions executing in hardware and fallback?Subscribe the lock in the HTM transaction:read the state of the global lock from the HTM transaction

activating the fallback path aborts any concurrent

hw

transaction

STRONG IMPACT ON PERFORMANCE

BETTER TUNE THIS MECHANISM PROPERLY!

ICAC 2014

27Slide28

Why Static Tuning is not enough

Self-Tuning Intel RTM

28Slide29

How to handle capacity aborts?

Self-Tuning Intel RTM

29

Reduction to “

Bandit Problem

3-levers slot machine with unknown reward distributions

Exploitation

vs

Exploration dilemma

how often to test apparently unfavorable levers?

Too little

: convergence to wrong solution

Too much

: many suboptimal choices 

Lever A

Lever B

Lever C

giveup

half

stubborn

Strategy:

Reward:

?

?

?Slide30

Upper Confidence Bounds (UCB)

Self-Tuning Intel RTM

30

Solution to

exploration

vs exploitation dilemma

Online estimation of “uncertainty” of each strategy

upper confidence bound on expected

reward

amplify

bound of rarely explored strategies

Appealing theoretical guarantees:

logarithmic bound on optimization error

Very lightweight and efficient:

…practical

!Slide31

Upper Confidence Bounds (UCB)

Self-Tuning Intel RTM

31

Basic reward function for each strategy

i

:

x

i

=

Estimate upper bound on reward of each strategy:

Amplify confidence bound of rarely explored levers

avg. #cycles using strategy

i

1Slide32

How many attempts using HTM?

Self-Tuning Intel RTM

32

UCB not a good fit

too many levers to explore!Slide33

Gradient Descent

Self-Tuning Intel RTM

33

1

2

3

4

?

Problems:

1- unnecessary oscillations

2- stuck in local maximaSlide34

Gradient Descent

Self-Tuning Intel RTM

34

1

2

3

4

?

Problems:

1- unnecessary oscillations

* stabilization threshold

2- stuck in local maxima

* random jumps

5Slide35

Gradient Descent

Self-Tuning Intel RTM

35

1

2

3

4

?

Problems:

1- unnecessary oscillations

* stabilization threshold

2- stuck in local maxima

* random jumps

5

6

7

8

revert to

curr

. maximum upon “unlucky”

jumpsSlide36

Optimizers in action

Self-Tuning Intel RTM

36

One atomic block in

Yada

benchmark

(8 threads).

the two optimizers are *not* independentSlide37

Coupling the Optimizers

Self-Tuning Intel RTM

37

UCB and Gradient Descent overlap in responsibilities:

O

ptimize consumption of attempts upon capacity aborts

Optimize allocation of budget for attempts

Minimize interference via hierarchical organization:

UCB rules over Grad:

UCB can force Grad to explore with random jump

Direction and length defined by UCB belief

More details in the paper Slide38

Coupling the Optimizers

Self-Tuning Intel RTM

38

Speedup of coupled techniques

vs

individual onesSlide39

Overhead of self-tuning

Self-Tuning Intel RTM

39

Profiling and decision-making are performed, but discarded.

Uses a static configuration (and compares with it).Slide40

Integration in GCC

Self-Tuning Intel RTM

40

Workload-oblivious

Transparent to the programmer

Lightweight for general purpose use

Ideal candidate for integration at the compiler

level

(current prototype does not support G-Tuner yet)Slide41

Integration in GCC

Self-Tuning Intel RTM

41

o

ur

extensions