Eastep David Wingate Marco D Santambrogio Anant Agarwal Smartlocks SelfAware Synchronization Multicores are Complex 2 The good Get performance scaling back on track with Moores Law ID: 548179
Download Presentation The PPT/PDF document "Jonathan" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Jonathan EastepDavid WingateMarco D. SantambrogioAnant Agarwal
Smartlocks: Self-Aware SynchronizationSlide2
Multicores are Complex2The goodGet performance scaling back on track with Moore’s LawThe BadSystem complexities are skyrocketing
Difficult to program multicores and utilize their performanceSlide3
Asymmetric Multicore is Worse3The ProblemDifferent capabilities, clock speeds = new layer of complexityProgrammers aren’t used to reasoning about asymmetry
Asymmetric Multicore
Core 0
Core 1
Core 2
Core 3
Why Asymmetric Multicore?
Improving Power / Performance
Increasing Manufacturing yieldSlide4
Self-Aware Computing Can Help4A promising recent approach to systems complexity managementMonitor themselves, adapting as necessary to meet their goalsSelf-aware systemsGoal-Oriented Computing, Ward et al., CSAIL
IBM K42 Operating System (OS w/ online reconfig.)Oracle Automatic Workload Repository (DB tuning)
Intel RAS Technologies for Enterprise (hw fault tol.)Slide5
Smartlocks Overview5Self-Aware technology applied to synchronization, resource sharing, programming modelsC/C++ spin-lock library for multicoreUses heuristics and machine learning to internally adapt its algorithms / behaviors
Reward signal provided by application monitorKey innovation: Lock Acquisition Scheduling
t1
t3
Lock Scheduler
Waiters
t2Slide6
Lock Acquisition Scheduling is the Key!
6
Thought experiment: 2 slow cores, 1 fast
improvement
4 CS
4 CSSlide7
Talk Outline7MotivationSmartlocks Architecture Smartlocks
InterfaceSmartlock DesignResults
ConclusionSlide8
ApplicationSmartlocks Architecture
8
Each
Smartlock self-optimizes as the app runs
Take reward from application monitoring framework
Reinforcement Learning adapts lock scheduling policy
Smartlock
Pthreads
Application Monitor: Heartbeats
Reward: Heart Rate
lock
ML
sched.
Lock Scheduling
PR Lock
Priorities
…
→Slide9
Do Scheduling with PR LocksPriority Lock (PR Lock)Releases lock to waiters preferentially (ordered by priority) Each potential lock holder (thread) has a priority To acquire, thread registers in wait priority queueUsually priority settings are set statically9
Lock Acquisition Scheduling
Augments PR Lock w/ ML engine to dynamically control priority settings
Scheduling policy = the set of thread priorities
Lock Scheduling
PR Lock
Priorities
…
→
p
t0
p
t2
p
t6
p
t8
p
t1
p
t11
t
i
= thread i; p
ti
= priority t
i
p
t1
p
t2
p
t3
p
tnSlide10
Talk Outline10MotivationSmartlocks Architecture
Smartlocks InterfaceSmartlock Design
ResultsConclusionSlide11
Smartlocks Interface11Similar to pthread mutexesDifference is interface for external monitorSmartlock queries monitor for reward signal
Function Prototype
Description
smartlock::smartlock
(int max_lockers, monitor *m_ptr)
Creates a
Smartlock
Smartlock
::~
smartlock
()
Destroys a
Smartlock
void
smartlock
::acquire(
int
id)
Acquires the lock
void
smartlock
::release(
int
id)
Releases the lockSlide12
Talk Outline12MotivationSmartlocks Architecture
Smartlocks Interface
Smartlock DesignResultsConclusionSlide13
Smartlocks Design ChallengesMajor Scheduling ChallengesThe Timeliness ChallengeScheduling too slowly could negate benefit of schedulingWhere do you get compute resources to optimize?The Quality ChallengeFinding policies with best long-term effectsNo model of system to guide direct optimization methodsEfficiently searching an exponential policy space
Overcoming stochastic / partially observable dynamics
13Slide14
Meeting The Timeliness Challenge14
Run adaptation algorithms in decoupled helper thread
Relax scheduling frequency to once every few locks
For efficiency, use PR locks as scheduling mechanism
ML engine updates priorities; PR lock runs decoupledSlide15
Meeting the Quality ChallengeMachine Learning, Reinforcement LearningNeed not know *how* to accomplish task just *when* you haveGood at learning actions that maximize long-term benefitNatural for application engineers to construct reward signalAddresses issues like stochastic / partially observable dynamics Policy GradientsComputationally cheap, fast, and straightforward to implementNeed no model of the system (we don’t have one!)Stochastic Soft-Max Policy
Relaxes exponential discrete action space into differentiable oneEffective, natural way to balance exploration vs. exploitation
15Slide16
The RL Problem Formulation16Goal: learn a policy p(action | q
)Action= PR lock priority settings (exponential space)k priorities levels, n threads
→ kn possible priority settingsq
are learned parametersReward is e.g. heart rate smoothed over small windowThus
p is a distribution over thread prioritizationsAt each timestep, we sample and execute a prioritizationOptimization objective: average reward h
Depends on the policy, which depends on
q
maximizeSlide17
Use Policy Gradients Approach17Approach: policy gradientsIdea: estimate the gradient of average reward h with respect to policy parameters
qApproximate with importance sampling
Take a step in the gradient directionSlide18
Talk Outline18MotivationSmartlocks Architecture
Smartlocks Interface
Smartlock DesignResults
ConclusionSlide19
Experimental Setup 119Simulated 6-core single-ISA asymmetric multicore w/ dynamic clock speeds
Throughput benchmark
Work-pile programming model (no stealing)1 producer, 4 workers
Record how long to perform n total work items
Fast cores finish work faster; if they spin it’s badTwo thermal Throttling EventsSlide20
Performance as a Function of Time20Workload 1: Worker 0 @ 3.16GHz, others @ 2.11GHzWorkload 2: Worker 3 @ 3.16GHz, others @ 2.11GHzWorkload 3: Same as Workload 1
gap
best w/ pri.
Smartlock
best w/o pri.
TAS
adaptation time-scaleSlide21
Policy as a Learned Function of Time21Workload 1: Worker 0 @ 3.16GHz, others @ 2.11GHzWorkload 2: Worker 3 @ 3.16GHz, others @ 2.11GHzWorkload 3: Same as Workload 1
Policy as a Learned Function of TimeSlide22
Experimental Setup 222Hardware asymmetry using cpufrequtils
8-core Intel Xeon machine
{2.11,2.11,2.11,2.11,2.11,2.11,3.16,3.16} GHz1 core reserved for Machine Learning (not required: helper thread could share a core)
Splash2First results:
RadiosityComputes equilibrium dist. of light in scene
Parallelism via work queues with stealing
Work items imbalanced (function of input scene)
Heartbeat for every work item completedSlide23
Radiosity Performance vs. Policy23BenchmarkStudy how lock scheduling affects performance ~20% difference between best and worst policyTAS (uniformly random) is in the middle
Smartlock within 3% of best policy
Radiosity (lower is better)
Execution
time (seconds)
SmartlocksSlide24
Smartlocks is Bigger Than This24Smartlock adapts each aspect of a lockProtocol: picks from {TAS,TASEB,Ticket,MCS,PR Lock}Wait Strategy: picks from {spin, spin with backoff}Scheduling Policy: arbitrary, optimized by RL engine
Smartlocks has an adaptation component for eachThis talk focuses on Lock Acquisition Scheduler
Smartlock
Protocol Selector
Lock Acquisition Scheduler
Wait Strategy Selector
Application MonitorSlide25
Conclusion25Smartlocks is a self-aware software library for synchronization / resource-sharingIdeal for multicores / applications with dynamic asymmetry
Lock Acquisition Scheduling is the key innovationSmartlocks is open source (COMING SOON!)
Code: http://github.com/Smartlocks/SmartlocksProject web-page:
http://groups.csail.mit.edu/carbon/smartlocks