Robert Grandl Mosharaf Chowdhury Aditya Akella Ganesh Ananthanarayanan Carbyne Performance of Cluster Schedulers We observe that Existing cluster schedulers focus on instantaneous fairness ID: 677050
Download Presentation The PPT/PDF document "Altruistic Scheduling in Multi-Resource ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Altruistic Scheduling in Multi-Resource Clusters
Robert Grandl, Mosharaf Chowdhury, Aditya Akella, Ganesh Ananthanarayanan
CarbyneSlide2
Performance of Cluster Schedulers
We observe that:
Existing cluster schedulers focus on
instantaneous
fairness
Data-parallel jobs provide ample
opportunities for long-term optimizations
Carbyne
1.3x
higher cluster efficiency;
1.6x
lower average job completion time;near-perfect fairness
Long-term fairness enables larger scheduling flexibility
2Slide3
Scheduling in Data Analytics Clusters
Jobs
Cluster Wide
Resource Manager
#1: Fairness
Objectives
:
Cluster resources
#2: Job Performance
#3: Cluster Efficiency
Inter-Job Scheduler
Intra-Job Scheduler
= is difficult!
3Slide4
Scheduling in Data Analytics Clusters
#1: Fairness
#2: Job Performance
#3: Cluster Efficiency
= is difficult!
TPC-DS workload on a 100-machine cluster
Jain’s Fairness Index
0
0.5
1
Inter-job fairness
0
Avg. JCT (seconds)
500
1000
Job Performance
Makespan (seconds)
0
2000
4000
6000
8000
Cluster efficiency
Objectives
:
4Slide5
Scheduling in Data Analytics Clusters
#1: Fairness
DRF [NSDI’11]
#2: Job Performance
#3: Cluster Efficiency
= is difficult!
TPC-DS workload on a 100-machine cluster
0.86
DRF
Jain’s Fairness Index
0
0.5
1
Inter-job fairness
1224
DRF
0
Avg. JCT (seconds)
500
1000
Job Performance
5478
DRF
Makespan (seconds)
0
2000
4000
6000
8000
Cluster efficiency
Objectives
:
Max-min fair sharing across multiple dimensions
Dominant Resource Fairness
4Slide6
Scheduling in Data Analytics Clusters
Shortest Job First (SJF)
#1: Fairness
DRF [NSDI’11]
#2: Job Performance
#3: Cluster Efficiency
TPC-DS workload on a 100-machine cluster
Jain’s Fairness Index
0
0.5
1
0.86
DRF
0.64
SJF
Inter-job fairness
Avg. JCT (seconds)
0
500
1000
1224
DRF
769
SJF
Job Performance
Makespan (seconds)
0
2000
4000
6000
8000
5478
SJF
6210
DRF
Cluster efficiency
Objectives
:
= is difficult!
4
Schedule jobs near completion firstSlide7
Scheduling in Data Analytics Clusters
Shortest Job First (SJF)
Tetris [SIGCOMM’14]
#1: Fairness
DRF [NSDI’11]
#2: Job Performance
#3: Cluster Efficiency
TPC-DS workload on a 100-machine cluster
Jain’s Fairness Index
0
0.5
1
0.86
0.64
0.74
Tetris
DRF
SJF
Inter-job fairness
Avg. JCT (seconds)
0
500
1000
1224
769
1123
Tetris
DRF
SJF
Job Performance
SJF
Makespan (seconds)
0
2000
4000
6000
8000
5478
6210
4356
Tetris
DRF
Cluster efficiency
Objectives
:
= is difficult!
4
Efficiently pack resourcesSlide8
Scheduling in Data Analytics Clusters
Shortest Job First (SJF)
Tetris [SIGCOMM’14]
#1: Fairness
DRF [NSDI’11]
#2: Job Performance
#3: Cluster Efficiency
TPC-DS workload on a 100-machine cluster
Jain’s Fairness Index
0
0.5
1
0.86
0.64
0.74
Tetris
DRF
SJF
Inter-job fairness
Avg. JCT (seconds)
0
500
1000
1224
769
1123
Tetris
DRF
SJF
Job Performance
SJF
Makespan (seconds)
0
2000
4000
6000
8000
5478
6210
4356
Tetris
DRF
Cluster efficiency
Objectives
:
= is difficult!
4
Outperforms
Underperforms
Preferred metric
Secondary metricsSlide9
Scheduling in Data Analytics Clusters
Shortest Job First (SJF)
Tetris [SIGCOMM’14]
#1: Fairness
DRF [NSDI’11]
#2: Job Performance
#3: Cluster Efficiency
TPC-DS workload on a 100-machine cluster
Jain’s Fairness Index
0
0.5
1
0.74
0.74
0.86
0.64
Tetris
DRF
SJF
Inter-job fairness
Avg. JCT (seconds)
0
500
1000
1123
1224
769
Tetris
DRF
SJF
Job Performance
SJF
Makespan (seconds)
0
2000
4000
6000
8000
4356
5478
6210
Tetris
DRF
Cluster efficiency
Objectives
:
Outperforms
Underperforms
Preferred metric
Secondary metrics
= is difficult!
4Slide10
Scheduling in Data Analytics Clusters
Shortest Job First (SJF)
Tetris [SIGCOMM’14]
#1: Fairness
DRF [NSDI’11]
#2: Job Performance
#3: Cluster Efficiency
TPC-DS workload on a 100-machine cluster
Jain’s Fairness Index
0
0.5
1
0.74
0.74
0.86
0.64
Tetris
DRF
SJF
Inter-job fairness
Avg. JCT (seconds)
0
500
1000
1123
1224
769
Tetris
DRF
SJF
Job Performance
SJF
Makespan (seconds)
0
2000
4000
6000
8000
4356
5478
6210
Tetris
DRF
Cluster efficiency
Objectives
:
Outperforms
Underperforms
Preferred metric
Secondary metrics
= is difficult!
4Slide11
Scheduling in Data Analytics Clusters
Shortest Job First (SJF)
Tetris [SIGCOMM’14]
#1: Fairness
DRF [NSDI’11]
#2: Job Performance
#3: Cluster Efficiency
TPC-DS workload on a 100-machine cluster
Jain’s Fairness Index
0
0.5
1
0.74
0.74
0.86
0.64
Tetris
DRF
SJF
Inter-job fairness
Avg. JCT (seconds)
0
500
1000
1123
1224
769
Tetris
DRF
SJF
Job Performance
SJF
Makespan (seconds)
0
2000
4000
6000
8000
4356
5478
6210
Tetris
DRF
Cluster efficiency
Objectives
:
Outperforms
Underperforms
Preferred metric
Secondary metrics
= is difficult!
4Slide12
Scheduling in Data Analytics Clusters
Jain’s Fairness Index
0
0.5
1
0.74
0.74
0.86
0.64
Tetris
DRF
SJF
Inter-job fairness
Avg. JCT (seconds)
0
500
1000
1123
1224
769
Tetris
DRF
SJF
Job Performance
SJF
Makespan (seconds)
0
2000
4000
6000
8000
4356
5478
6210
Tetris
DRF
Cluster efficiency
?
?
?
Is it possible to ensure
fairness
and
still
be
competitive with the best approaches
for the secondary metrics
(
job performance
and
cluster efficiency
)?
4Slide13
Key observation
#1
Modern cluster schedulers focus on
instantaneous fairness
and
force short-term
o
ptimizations
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
Job 1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res. req> @Dur
Traditional scheduling
1
Traditional scheduling
Time
Capacity
0.5
1.0
1
2
0
Fair
Allocation
Greedy
decisions
S0
S1
S1
S0
5
Assumption:
Know tasks demands and durationsSlide14
Key observation
#1
Modern cluster schedulers focus on
instantaneous fairness
and
force short-term
o
ptimizations
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
Job 1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res. req> @Dur
Traditional scheduling
1
Traditional scheduling
Time
Capacity
0.5
1.0
1
2
0
S0
S1
S1
S0
S0
S1
S0
S2
Avg. JCT:
2.05
Users care less
for instantaneous fairness
5
Assumption:
Know tasks demands and durationsSlide15
Key observation
#1
Modern cluster schedulers focus on
instantaneous fairness
and
force short-term
o
ptimizations
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
Job 1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res. req> @Dur
Traditional scheduling
1
Altruistic scheduling
Time
Capacity
0.5
1.0
1
2
0
S0
S1
S1
S0
S0
S1
S0
S2
Avg. JCT:
2.05
Users care less
for instantaneous fairness
Leftover
:
donated unnecessary resources
Altruism
:
an action to contribute leftover resources
S0
S0
5Slide16
Key observation
#1
Modern cluster schedulers focus on
instantaneous fairness
and
force short-term
o
ptimizations
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
Job 1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res. req> @Dur
Traditional scheduling
1
Altruistic scheduling
Time
Capacity
0.5
1.0
1
2
0
S1
S1
S0
S1
S0
S2
Avg. JCT:
2.05
Users care less
for instantaneous fairness
Leftover:
donated unnecessary resources
Altruism:
an action to contribute leftover resources
S0
S0
S0
Avg. JCT:
1.33 x
better
5Slide17
Key observation
#2
Jobs
in data analytics clusters
have ample
opportunities for altruism
What increases opportunities?
Complex DAG structures
Longer DAGs
How much opportunities?
50%
of the time at least
20%
of the resources can be used as leftover
6Slide18
Altruistic
multi-resource
scheduling
technique
#1. How to
maximize
the amount of
leftover
resources?
#2.
How much leftover
should contribute?
#3. How to
redistribute
the
leftover
?
Inter-Job Scheduler
Intra-Job Scheduler
Leftover
Carbyne
7Slide19
Inter-Job Scheduler
Intra-Job Scheduler
Leftover
Maximize the amount of leftover resources
Instantaneous fairness elongates job completion time
the most and increases altruism opportunities
Carbyne uses
DRF for inter-job scheduling
Any fair scheduler technique
can be used
Time
Capacity
0.5
1.0
1
2
0
Fair
Allocation
Job 1
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res.
req
> @
Dur
8Slide20
Inter-Job Scheduler
Intra-Job Scheduler
Leftover
How much leftover to contribute?
Traditional scheduling to
compute
expected completion time
Time
Capacity
0.5
1.0
1
2
0
Fair
Allocation
S0
S1
S1
S0
S1
S2
S0
S0
JCT Job 1: 2.1
JCT Job 2: 2.0
9
Job 1
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res.
req
> @
DurSlide21
Inter-Job Scheduler
Intra-Job Scheduler
Leftover
How much leftover to contribute?
Time
Capacity
0.5
1.0
1
2
0
Fair
Allocation
S0
S1
S1
S0
S1
S2
S0
S0
Scheduling
in the
future
from finish to current time
Move into future
Traditional scheduling to
compute
expected completion time
9
JCT Job 1: 2.1
JCT Job 2: 2.0
Job 1
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res.
req
> @
DurSlide22
Inter-Job Scheduler
Intra-Job Scheduler
Leftover
How much leftover to contribute?
Time
Capacity
0.5
1.0
1
2
0
Fair
Allocation
S1
S1
S2
S0
S0
S0
S1
S0
Move into future
Traditional scheduling to
compute
expected completion time
Scheduling
in the
future
from finish to current time
9
JCT Job 1: 2.1
JCT Job 2: 2.0
Job 1
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res.
req
> @
DurSlide23
Inter-Job Scheduler
Intra-Job Scheduler
Leftover
How much leftover to contribute?
Time
Capacity
0.5
1.0
1
2
0
Fair
Allocation
S1
S1
S2
S0
S0
0.29
Leftover
Traditional scheduling to
compute
expected completion time
Scheduling
in the
future
from finish to current time
Donate leftover
resources through altruism
S0
S1
S0
9
Job 1:
JCT Job 1: 2.1
JCT Job 2: 2.0
0.29
Total:
Job 1
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res.
req
> @
DurSlide24
Inter-Job Scheduler
Intra-Job Scheduler
Leftover
How much leftover to contribute?
Time
Capacity
0.5
1.0
1
2
0
Fair
Allocation
S1
S1
S2
S0
S0
Traditional scheduling to
compute
expected completion time
Scheduling
in the
future
from finish to current time
Donate leftover
resources through altruism
S0
S1
S0
9
JCT Job 1: 2.1
JCT Job 2: 2.0
Leftover
0.29
Job 1:
0.21
Job 2:
0.29
Total:
0.50
Job 1
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res.
req
> @
DurSlide25
Inter-Job Scheduler
Intra-Job Scheduler
Leftover
How to redistribute the leftover resources?
Time
Capacity
0.5
1.0
1
2
0
Fair
Allocation
S1
S1
S2
S0
S0
Goal 1: Improve
average JCT
Goal 2: Maximize
efficiency
Goals 1 and 2 can be interchanged
Schedule jobs closest to completion time first
S0
S1
S0
10
JCT Job 1: 2.1
JCT Job 2: 2.0
Leftover
Total:
0.50
Job 1
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res.
req
> @
DurSlide26
Inter-Job Scheduler
Intra-Job Scheduler
Leftover
How to redistribute the leftover resources?
Time
Capacity
0.5
1.0
1
2
0
Fair
Allocation
S1
S1
S2
S0
S0
Goal 1: Improve
average JCT
Goal 2: Maximize
efficiency
Goals 1 and 2 can be interchanged
Schedule jobs closest to completion time first
S0
S1
S0
10
Leftover
Total:
0.50
0.21
JCT Job 1: 2.1
JCT Job 2: 2.0
JCT Job 2: 1.0
Job 1
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res.
req
> @
DurSlide27
Inter-Job Scheduler
Intra-Job Scheduler
Leftover
How to redistribute the leftover resources?
Time
Capacity
0.5
1.0
1
2
0
Fair
Allocation
S1
S1
S2
S0
S0
S0
S1
S0
Goal 1: Improve
average JCT
Goal 2: Maximize
efficiency
Goals 1 and 2 can be interchanged
Schedule jobs closest to completion time first
Pack as many unscheduled tasks are possible
10
Leftover
Total:
0.21
JCT Job 1: 2.1
JCT Job 2: 1.0
Job 1
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res.
req
> @
DurSlide28
Inter-Job Scheduler
Intra-Job Scheduler
Leftover
How to redistribute the leftover resources?
Time
Capacity
0.5
1.0
1
2
0
Fair
Allocation
S1
S1
S2
S0
S0
S1
S0
S0
Goal 1: Improve
average JCT
Goal 2: Maximize
efficiency
Goals 1 and 2 can be interchanged
Schedule jobs closest to completion time first
Pack as many unscheduled tasks are possible
10
JCT Job 1: 2.1
JCT Job 2: 2.0
Leftover
Total:
0.21
0
Job 1
S0
2 x <.08> @.5
S1
3 x <.21> @1
S2
1 x <.1> @.1
S0
2 x <.29> @1
Job 2
Stage ID
#Tasks x <Res.
req
> @
DurSlide29
Putting it all together
We saw
Increase leftover
via Inter-Job Scheduling
Adopting best fair schedulers
Compute leftover
via Intra-Job Scheduling
Leftover redistribution
Improve JCT and cluster efficiency
Other things in the paper
Bounding altruism
with
P(Altruism)
Resource estimation
Data locality
Straggler mitigation
Task failures
11Slide30
Putting it all together
We saw
Increase leftover
via Inter-Job Scheduling
Provide Fairness
Compute leftover
via Intra-Job Scheduling
Leftover redistribution
Improve JCT and cluster efficiency
Over things in the paper
Bounding altruism
with P(Altruism)
Bounding resource
misestimations
Data locality
Straggler mitigation
Task failures
Compute leftover
Leftover redistribution
Node Manager
1
Running tasks
Report available resources
…
Job Manager
1
…
Asks
Offers
Allocations
Resource availability
Resource Manager
Yarn architecture
Changes to add Carbyne (shown in
orange
)
11Slide31
Evaluation
Implemented in Yarn and Tez
100 machine cluster deployment
Replay Bing / Facebook traces and TPC-DS / TPC-H workloads
12Slide32
Fairness vs. Performance vs. Efficiency
Jain’s Fairness Index
0
0.5
1
0.74
0.74
0.86
0.64
Tetris
DRF
SJF
Inter-job fairness
Avg. JCT (seconds)
0
500
1000
1123
1224
769
Tetris
DRF
SJF
Job Performance
SJF
Makespan (seconds)
0
2000
4000
6000
8000
4356
5478
6210
Tetris
DRF
Cluster efficiency
?
?
?
13Slide33
Fairness vs. Performance vs. Efficiency
Jain’s Fairness Index
0
0.5
1
0.74
0.74
0.86
0.64
Tetris
DRF
SJF
Inter-job fairness
Avg. JCT (seconds)
0
500
1000
1123
1224
769
Tetris
DRF
SJF
Job Performance
SJF
Makespan (seconds)
0
2000
4000
6000
8000
4356
5478
6210
Tetris
DRF
Cluster efficiency
0.81
Carbyne
814
Carbyne
4492
Carbyne
Comparable performance with best approaches in each metric
Gains from
Altruism helps in long run
13Slide34
Job Performance
14
Carbyne w/o leftover
Only running tasks from being altruistic
Carbyne
Tasks from being altruistic and from leftover redistribution
DRF
Tasks from a DRF allocation
Gains from
Altruism helps in long run
Snapshot of the execution of a TPC-DS querySlide35
Job Performance
14
Altruistic
Greedy
Carbyne w/o leftover
Only running tasks from being altruistic
Carbyne
Tasks from being altruistic and from leftover redistribution
DRF
Tasks from a DRF allocation
DRF takes greedy decisions
Approximates DRF allocation due to leftover
Low resource contention
Gains from
Altruism helps in long run
Snapshot of the execution of a TPC-DS querySlide36
Job Performance
14
Leftover
DRF progress is slowed down
Carbyne progress faster due to receiving leftover
Carbyne w/o leftover
Only running tasks from being altruistic
Carbyne
Tasks from being altruistic and from leftover redistribution
DRF
Tasks from a DRF allocation
High resource contention
Gains from
Altruism helps in long run
Snapshot of the execution of a TPC-DS querySlide37
Job Performance
Performance
Better jobs completion time
16% jobs slowed down
4%
only by more than
0.8x
Which jobs slow down?
Longer jobs
No bias towards shorter jobs
Most jobs benefit from leftover
14
Snapshot of the execution of a TPC-DS querySlide38
Impact of a Better Intra-Job Scheduler
Default intra-job scheduler
Tetris
Limited view of the job’s DAG
Factor of Improvement (w.r.t. DRF)
Fraction of Jobs
15Slide39
Impact of a Better Intra-Job Scheduler
Default intra-job scheduler
Tetris
Limited view of the job’s DAG
Better intra-job scheduler
Graphene – DAG-wide view
Factor of Improvement (w.r.t. DRF)
Fraction of Jobs
15Slide40
Impact of a Better Intra-Job Scheduler
Default intra-job scheduler
Tetris
Limited view of the job’s DAG
Factor of Improvement (w.r.t. DRF)
Fraction of Jobs
Better intra-job scheduler
Graphene – DAG-wide view
Extracts
more leftover
Further
increase performance
15Slide41
Maximize Leftover for Individual Jobs
Redistribution via Leftover Scheduling
Increase Leftover via Inter-Job Scheduling
Fairness
Performance
Efficiency
Long-term altruistic view
of Carbyne
outperforms
existing cluster schedulers which focus on instantaneous fairness
Implemented inside YARN and
Tez
Carbyne
Performance comparable with best approaches in terms of fairness
and
job completion time
and
cluster efficiency
16Slide42
Backup slidesSlide43
Impact of Altruism Levels
Probability of Making Altruistic Choices
Factor of improvement
Increasing levels of altruism increase performance
Comparable when P(Altruism) == 0Slide44
Impact of
Misestimations
Factor of improvement
Error in Resource
Misestimations
[%]
Consistently
better performance
Comparable when resources are underestimatedSlide45
Impact of Contention
More contention increases the need for carefully rearranging tasks
Too much contention saturates the cluster; not much room for leftover allocationsSlide46
Even in online case, Carbyne comes closely to the best metric in each metric
Fairness vs. Performance vs. Efficiency - OnlineSlide47
Data Locality vs. Straggler Mitigation vs. Task Failures
Altruistically giving up resources for data-local task may have adverse effects
An altruistically delayed data-local task is likely to find data locality when it is eventually scheduled
Data Locality
Likely to prioritize speculative tasks during leftover scheduling because it selects jobs in the SRTF order
Straggler Mitigation
Does not distinguish between new and restarted tasks
in case of task failures, it has to recalculate the expected completion time for the job
Handling Task Failures