Robert Grandl Ganesh Ananthanarayanan Srikanth Kandula Sriram Rao Aditya Akella Tetris Performance of cluster schedulers We find that 1 Time to finish a set of jobs Resources are fragmented ID: 696839
Download Presentation The PPT/PDF document "Multi-Resource Packing for Cluster Sched..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Multi-Resource Packing for Cluster Schedulers
Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella
TetrisSlide2
Performance of cluster schedulers
We find that:
1
Time to finish a set of jobs
Resources are
fragmented
i.e. machines run below capacity
Even at 100% usage,
goodput
is smaller due to
over-allocation
P
areto-efficient multi-resource fair schemes do not lead to good avg. performance
Tetris
U
p to
40% improvement in makespan
1
and
job completion time
with
near-perfect
fairnessSlide3
Findings from Bing and Facebook traces analysis
Tasks need
varying amounts of each resource
Demands for resources
are weakly correlated
Applications have (very) diverse resource needs
Multiple resources become tight
This matters, because no single bottleneck resource in the cluster:
E.g., enough cross-rack network bandwidth
t
o use all cores
3
Upper
bound on
potential gains
M
akespan
reduces by
≈ 49
%
A
vg. job completion time reduces by
≈
46%Slide4
4
Why so bad #1
Production schedulers neither pack tasks nor consider all
their relevant resource demands
#1 Resource Fragmentation
#2 Over-allocationSlide5
Current Schedulers
“Packer” Scheduler
Machine
A
4 GB Memory
Machine B
4 GB Memory
T1: 2 GB
T3: 4 GB
T2: 2 GB
Time
Resource Fragmentation (RF)
STOP
Machine
A
4 GB Memory
Machine B
4 GB Memory
T1: 2 GB
T3: 4 GB
T2: 2 GB
Time
Avg. task
compl
. time = 1 t
5
Current Schedulers
RF increase with the number of resources being allocated !
Avg. task
compl.time
= 1.33 t
Allocate resources per
slots, fairness.
Are not explicit about packing.Slide6
Current Schedulers
“Packer” Scheduler
Machine
A
4 GB Memory; 20 MB/s Nw.
Time
T1: 2 GB
Memory
20 MB/s Nw.
T2: 2 GB
Memory
20 MB/s Nw.
T3: 2 GB
Memory
Machine
A
4 GB Memory; 20 MB/s Nw.
Time
T1: 2 GB
Memory
20 MB/s Nw.
T2: 2 GB
Memory
20 MB/s Nw.
T3: 2 GB
Memory
STOP
20 MB/s Nw.
20 MB/s Nw.
6
Over-Allocation
Not all of the resources are
explicitly allocated
E.g.,disk
and
network
can be
over-allocated
Avg. task
compl.time
= 2.33 t
Avg. task
compl
. time = 1.33 t
Current SchedulersSlide7
Work
Conserving != no fragmentation, over-allocation
Treat cluster as a big bag of resources
Hides
the impact of
resource fragmentation
Assume
job has a fixed resource
profile
Different tasks in the same job have
different demands
Multi-resource Fairness
Schemes
do not solve the problem
Why so bad #2
How the job is
scheduled
impacts
jobs’
current resource
profiles
Can schedule to create
complementarity
Example in paper
Packer vs. DRF:
makespan
and avg. completion time improve by over 30%
Pareto
1
efficient != performant
1
no job can increase its share without decreasing the share of another
7Slide8
Competing objectives
Job completion time
Fairness
v
s.
Cluster efficiency
v
s.
Current Schedulers
1. Resource Fragmentation
3. Fair allocations sacrifice performance
2. Over-Allocation
8Slide9
# 1
Pack tasks along multiple resources to improve
cluster efficiency
and
reduce
makespan
Tetris
9Slide10
Theory
Practice
Multi-Resource Packing of Tasks
similar to
Multi-Dimensional Bin Packing
Balls could be tasks
Bin could be machine, time
1
APX-Hard is a strict subset of NP-hard
APX-Hard
1
Existing heuristics do not directly apply:
Assume balls of a fixed size
Assume balls are known
apriori
10
vary with time / machine placed
elastic
cope with online arrival of jobs, dependencies, cluster activity
Avoiding fragmentation looks like
:
Tight bin packing
Reduce
#
of bins
reduce
makespanSlide11
# 1
P
acking heuristic
Packing tasks to machines = Multi-Dimensional Bin Packing
Ball = Task resource demands vector
Bin = Machine available resource vector
Tetris
1.
Check for fit to ensure
no over-allocation
Over-Allocation
Alignment
score (A)
11
A packing heuristic
Tasks resources demand vector
Machine resource vector
<
Fit
“A” works because:
2.
Bigger balls
get bigger scores
3.
Abundant resources
used first
Resource Fragmentation
Slide12
# 2
Faster average
job completion time
Tetris
12Slide13
13
Tetris
CHALLENGE
# 2
Shortest Remaining Time First
1
(SRTF)
1
SRTF – M.
Harchol
-Balter
et al. Connection Scheduling in Web Servers [USITS’99]
schedules jobs in ascending order of their
remaining time
Job Completion Time Heuristic
Q
: What is the shortest
“
remaining time
” ?
“
remaining work
”
remaining # tasks
t
asks’ durations
t
asks’ resource demands
&
&
=
A job completion time heuristic
Gives a score
P
to every job
Extended SRTF to incorporate multiple resourcesSlide14
14
Tetris
CHALLENGE
# 2
Job Completion Time Heuristic
Combine A and P scores !
Packing Efficiency
Completion Time
?
1:
among
J
runnable jobs
2:
score
(j)
=
A
(t, R
)+
P
(j
)
3:
max
task t in j, demand(t) ≤
R (resources free)
4:
pick j*, t* =
argmax
score(j)
A:
delays job completion time
P:
loss in packing efficiencySlide15
# 3
Achieve performance and
fairness
Tetris
15Slide16
# 3
Tetris
16
Packer
says: “
task
T
should go next to improve
packing efficiency
”
Possible to satisfy all three
In fact, happens often in practice
SRTF
says: “schedule
job
J
to improve
avg. completion time
”
Fairness
says: “
this set of
jobs
should be scheduled next
”
Fairness Heuristic
Performance and fairness do not mix well in general
But
….
We
can get “perfect fairness” and much better performanceSlide17
# 3
Tetris
17
Fairness Knob, F
[0, 1)
Pick the
best-for-
perf
.
task
from among
1-
F
fraction of
jobs furthest from fair share
Fairness Heuristic
Fairness is not a tight constraint
Long term fairness
not short term fairness
Lose
a bit of
fairness
for a lot of
gains in performance
Heuristic
F = 0
F
→
1
Most unfair
Most efficient scheduling
Close to perfect fairnessSlide18
18
Putting it all together
We saw:Other things in the paper:
Packing efficiency
Prefer small remaining work
Fairness knob
Estimate
task demands
Deal with
inaccuracies, barriers
Other
cluster activities
Job Manager
1
Node Manager
1
Cluster-wide Resource Manager
Multi-resource asks; barrier hint
Track resource usage; enforce allocations
New logic to match tasks to machines (+packing, +SRTF, +fairness)
Allocations
Asks
Offers
Resource
availability reports
Yarn architecture
Changes to add Tetris(shown in
orange
)Slide19
Evaluation
Implemented in Yarn 2.4
250 machine cluster deploymentBing and Facebook workload
19Slide20
20
Efficiency
Makespan
Multi-resource Scheduler
28 %
Avg. Job
Compl
. Time
35%
Tetris
Gains from
avoiding
fragmentation
avoiding
over-allocation
Tetris vs.
Single Resource Scheduler
29 %
30 %
Utilization (%)
200
150
100
5
0
0
0
5000
10000
15000
Time (s)
Over-allocation
Low value
→
high fragmentation
Single Resource SchedulerSlide21
21
Fairness
Fairness Knob
quantifies the extent to which Tetris adheres to fair allocation
No Fairness
F = 0
Makespan
50 %
10 %
25 %
Job
Compl
.
Time
4
0 %
23 %
35 %
Avg. Slowdown
[over impacted jobs]
25 %
2 %
5 %
Full Fairness
F
→
1
F = 0.25Slide22
Tetris
Pack efficiently along multiple resources
Prefer jobs with less “remaining work”
Incorporate Fairness
C
ombine heuristics that
improve packing efficiency
with those that
lower
average
job completion time
A
chieving desired amounts of
fairness can coexist
with improving cluster performance
I
mplemented inside YARN; deployment and trace-driven simulations show encouraging initial results
We are working towards a Yarn check-in
http://research.microsoft.com/en-us/UM/redmond/projects/tetris/
22Slide23
23Backup slidesSlide24
Estimating resource requirements
Estimating Resource Demands
Observe usage and backfillfinished tasks in the same phase
We estimate peak usage from
reports resource usages of tasks and other cluster activity e.g., evacuation
Resource Tracker
collecting statistics
from recurring jobs
inputs size/location of tasks
24
Placement impacts usages of network and disk.Slide25
25
Incorporating task placement
1. Disk and network demands depend on task placement2. Remote resources cannot directly be included in dot-product
Resource vectors increase with the number of machines
in the cluster
Can prefer remote placement!
Compute packing score on local resources; use a
fractional penalty
to
reduce use of remote resources
Sensitivity Analysis
Makespan
& completion time change little for remote penalty
[
6%, 15%
]
Slide26
26
Alternative Packing Heuristics
d
i
– task demand along
dimens
.
i
a
i
– avail. res.
along
dimens
.
iSlide27
27
Virtual Machine Packing != Tetris
Virtual Machine Packing But focus on different challenges and not task packing:
balance load across servers
ensure VM availability inspite
of failures
allow for quick software and hardware updates
NO
corresponding entity to a job and hence
job completion time is inexpressible
Explicit resource requirements (e.g. small VM) makes
VM packing simpler
Consolidating VMs, with multi-dimensional resource requirements, on to the fewest number of serversSlide28
28
Barrier knob, b
[0, 1) Tetris gives preference for last tasks in a stage
Offer resources to tasks in a stage preceding a barrier,
where b
fraction of tasks have finished
b
=
1
no tasks preferentially treatedSlide29
29
Weighting Alignment vs. SRTF
While the best choice of depends on the workload, we found that:
1:
among
J
runnable jobs
2:
score
(j)
=
A
(t, R
) +
P
(j
)
3:
max
task t in j, demand(t) ≤
R (resources free)
4:
pick j*, t* =
argmax
score(j)
gains
from
packing efficiency
are
only moderate sensitive to
improving avg. job
compl
.
t
ime requires
> 0
, though gains stabilize quickly
Sensitivity Analysis
Slide30
30
Cluster load vs. Tetris performance Slide31
31
Starvation Prevention
It could take a long time to accommodate large tasksWorking on a more principled solutionBut …
most tasks have demands
within one order of magnitude
Free resources become available in “large clumps”
periodic availability reports
scheduler learns about resources freed up by all tasks that finish in the preceding period in one shotSlide32
32
Workload analysisSlide33
33
Ingestion / evacuation
ingestion = storing incoming data for later analytics evacuation = data evacuated and re-replicated before maintenance operations
E.g., some clusters reports volumes of up to 10 TB per hour
Other cluster activities which produce background traffic
E
.g., rack decommission for machines re-imaging
Resource Tracker
reports, used by Tetris to avoid contention between its tasks and these activitiesSlide34
34
Fairness vs. EfficiencySlide35
35
Fairness vs. EfficiencySlide36
Packer Scheduler vs. DRF
DRF Scheduler
Packer Schedulers
2 tasks
Job Schedule
Resources used
2 tasks
2 tasks
2 tasks
2 tasks
2 tasks
6 tasks
6 tasks
6 tasks
A
B
C
18 cores
16 GB
18 cores
16 GB
18 cores
16 GB
t
2t
3
t
0 tasks
Job Schedule
Resources used
0 tasks
6 tasks
0 tasks
6 tasks
18 tasks
A
B
C
18 cores
18 cores
6 GB
18 cores
6 GB
t
2t
3
t
36 GB
Durations
:
A: 3t
B: 3t
C: 3t
Durations
:
A: t
B: 2t
C: 3t
33% improvement
Dominant Resource Fairness (DRF)
computes the dominant share (DS) of every user and seeks to maximize the minimum DS across all users
Cluster [18 Cores, 36 GB Memory]
Job:
[Task Prof.], # tasks
A
[1 Core, 2 GB], 18
B
[3 Cores, 1 GB], 6
C
[3 Cores, 1 GB], 6
DS =
max (
,
,
) (Maximize allocations)
(CPU constraint)
2q
A
+ 1q
B
+ 1q
C
36 (Memory constraint)
36Slide37
1Time to finish a set of jobs
Machine 1,2: [2 Cores, 4 GB]
Job:
[Task Prof.], # tasksA
[2 Cores, 3 GB], 6
B
[1 Core, 2 GB], 2
Resources used
4 cores
6 GB
2 tasks
2 tasks
2 tasks
2 tasks
t
2t
3t
4t
Job Schedule
4 cores
6 GB
4 cores
6 GB
2
cores
4
GB
Resources used
2
cores
4
GB
2 tasks
2 tasks
2 tasks
2 tasks
t
2t
3t
4t
Job Schedule
4 cores
6 GB
4 cores
6 GB
4 cores
6 GB
Pack
No Pack
Durations:
A: 3t
B: 4t
Durations:
A: 4t
B:
t
29% improvement
37
Packing efficiency does not achieve everything
Achieving packing efficiency does not necessarily improve job completion timeSlide38
38