/
Altruistic Scheduling in Multi-Resource Clusters Altruistic Scheduling in Multi-Resource Clusters

Altruistic Scheduling in Multi-Resource Clusters - PowerPoint Presentation

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
390 views
Uploaded On 2018-09-23

Altruistic Scheduling in Multi-Resource Clusters - PPT Presentation

Robert Grandl Mosharaf Chowdhury Aditya Akella Ganesh Ananthanarayanan Carbyne Performance of Cluster Schedulers We observe that Existing cluster schedulers focus on instantaneous fairness ID: 677050

leftover job drf fairness job leftover fairness drf cluster jct scheduling performance scheduler sjf time efficiency tetris tasks inter

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Altruistic Scheduling in Multi-Resource ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Altruistic Scheduling in Multi-Resource Clusters

Robert Grandl, Mosharaf Chowdhury, Aditya Akella, Ganesh Ananthanarayanan

CarbyneSlide2

Performance of Cluster Schedulers

We observe that:

Existing cluster schedulers focus on

instantaneous

fairness

Data-parallel jobs provide ample

opportunities for long-term optimizations

Carbyne

1.3x

higher cluster efficiency;

1.6x

lower average job completion time;near-perfect fairness

Long-term fairness enables larger scheduling flexibility

2Slide3

Scheduling in Data Analytics Clusters

Jobs

Cluster Wide

Resource Manager

#1: Fairness

Objectives

:

Cluster resources

#2: Job Performance

#3: Cluster Efficiency

Inter-Job Scheduler

Intra-Job Scheduler

= is difficult!

3Slide4

Scheduling in Data Analytics Clusters

#1: Fairness

#2: Job Performance

#3: Cluster Efficiency

= is difficult!

TPC-DS workload on a 100-machine cluster

Jain’s Fairness Index

0

0.5

1

Inter-job fairness

0

Avg. JCT (seconds)

500

1000

Job Performance

Makespan (seconds)

0

2000

4000

6000

8000

Cluster efficiency

Objectives

:

4Slide5

Scheduling in Data Analytics Clusters

#1: Fairness

DRF [NSDI’11]

#2: Job Performance

#3: Cluster Efficiency

= is difficult!

TPC-DS workload on a 100-machine cluster

0.86

DRF

Jain’s Fairness Index

0

0.5

1

Inter-job fairness

1224

DRF

0

Avg. JCT (seconds)

500

1000

Job Performance

5478

DRF

Makespan (seconds)

0

2000

4000

6000

8000

Cluster efficiency

Objectives

:

Max-min fair sharing across multiple dimensions

Dominant Resource Fairness

4Slide6

Scheduling in Data Analytics Clusters

Shortest Job First (SJF)

#1: Fairness

DRF [NSDI’11]

#2: Job Performance

#3: Cluster Efficiency

TPC-DS workload on a 100-machine cluster

Jain’s Fairness Index

0

0.5

1

0.86

DRF

0.64

SJF

Inter-job fairness

Avg. JCT (seconds)

0

500

1000

1224

DRF

769

SJF

Job Performance

Makespan (seconds)

0

2000

4000

6000

8000

5478

SJF

6210

DRF

Cluster efficiency

Objectives

:

= is difficult!

4

Schedule jobs near completion firstSlide7

Scheduling in Data Analytics Clusters

Shortest Job First (SJF)

Tetris [SIGCOMM’14]

#1: Fairness

DRF [NSDI’11]

#2: Job Performance

#3: Cluster Efficiency

TPC-DS workload on a 100-machine cluster

Jain’s Fairness Index

0

0.5

1

0.86

0.64

0.74

Tetris

DRF

SJF

Inter-job fairness

Avg. JCT (seconds)

0

500

1000

1224

769

1123

Tetris

DRF

SJF

Job Performance

SJF

Makespan (seconds)

0

2000

4000

6000

8000

5478

6210

4356

Tetris

DRF

Cluster efficiency

Objectives

:

= is difficult!

4

Efficiently pack resourcesSlide8

Scheduling in Data Analytics Clusters

Shortest Job First (SJF)

Tetris [SIGCOMM’14]

#1: Fairness

DRF [NSDI’11]

#2: Job Performance

#3: Cluster Efficiency

TPC-DS workload on a 100-machine cluster

Jain’s Fairness Index

0

0.5

1

0.86

0.64

0.74

Tetris

DRF

SJF

Inter-job fairness

Avg. JCT (seconds)

0

500

1000

1224

769

1123

Tetris

DRF

SJF

Job Performance

SJF

Makespan (seconds)

0

2000

4000

6000

8000

5478

6210

4356

Tetris

DRF

Cluster efficiency

Objectives

:

= is difficult!

4

Outperforms

Underperforms

Preferred metric

Secondary metricsSlide9

Scheduling in Data Analytics Clusters

Shortest Job First (SJF)

Tetris [SIGCOMM’14]

#1: Fairness

DRF [NSDI’11]

#2: Job Performance

#3: Cluster Efficiency

TPC-DS workload on a 100-machine cluster

Jain’s Fairness Index

0

0.5

1

0.74

0.74

0.86

0.64

Tetris

DRF

SJF

Inter-job fairness

Avg. JCT (seconds)

0

500

1000

1123

1224

769

Tetris

DRF

SJF

Job Performance

SJF

Makespan (seconds)

0

2000

4000

6000

8000

4356

5478

6210

Tetris

DRF

Cluster efficiency

Objectives

:

Outperforms

Underperforms

Preferred metric

Secondary metrics

= is difficult!

4Slide10

Scheduling in Data Analytics Clusters

Shortest Job First (SJF)

Tetris [SIGCOMM’14]

#1: Fairness

DRF [NSDI’11]

#2: Job Performance

#3: Cluster Efficiency

TPC-DS workload on a 100-machine cluster

Jain’s Fairness Index

0

0.5

1

0.74

0.74

0.86

0.64

Tetris

DRF

SJF

Inter-job fairness

Avg. JCT (seconds)

0

500

1000

1123

1224

769

Tetris

DRF

SJF

Job Performance

SJF

Makespan (seconds)

0

2000

4000

6000

8000

4356

5478

6210

Tetris

DRF

Cluster efficiency

Objectives

:

Outperforms

Underperforms

Preferred metric

Secondary metrics

= is difficult!

4Slide11

Scheduling in Data Analytics Clusters

Shortest Job First (SJF)

Tetris [SIGCOMM’14]

#1: Fairness

DRF [NSDI’11]

#2: Job Performance

#3: Cluster Efficiency

TPC-DS workload on a 100-machine cluster

Jain’s Fairness Index

0

0.5

1

0.74

0.74

0.86

0.64

Tetris

DRF

SJF

Inter-job fairness

Avg. JCT (seconds)

0

500

1000

1123

1224

769

Tetris

DRF

SJF

Job Performance

SJF

Makespan (seconds)

0

2000

4000

6000

8000

4356

5478

6210

Tetris

DRF

Cluster efficiency

Objectives

:

Outperforms

Underperforms

Preferred metric

Secondary metrics

= is difficult!

4Slide12

Scheduling in Data Analytics Clusters

Jain’s Fairness Index

0

0.5

1

0.74

0.74

0.86

0.64

Tetris

DRF

SJF

Inter-job fairness

Avg. JCT (seconds)

0

500

1000

1123

1224

769

Tetris

DRF

SJF

Job Performance

SJF

Makespan (seconds)

0

2000

4000

6000

8000

4356

5478

6210

Tetris

DRF

Cluster efficiency

?

?

?

Is it possible to ensure

fairness

and

still

be

competitive with the best approaches

for the secondary metrics

(

job performance

and

cluster efficiency

)?

4Slide13

Key observation

#1

Modern cluster schedulers focus on

instantaneous fairness

and

force short-term

o

ptimizations

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

Job 1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res. req> @Dur

Traditional scheduling

1

Traditional scheduling

Time

Capacity

0.5

1.0

1

2

0

Fair

Allocation

Greedy

decisions

S0

S1

S1

S0

5

Assumption:

Know tasks demands and durationsSlide14

Key observation

#1

Modern cluster schedulers focus on

instantaneous fairness

and

force short-term

o

ptimizations

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

Job 1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res. req> @Dur

Traditional scheduling

1

Traditional scheduling

Time

Capacity

0.5

1.0

1

2

0

S0

S1

S1

S0

S0

S1

S0

S2

Avg. JCT:

2.05

Users care less

for instantaneous fairness

5

Assumption:

Know tasks demands and durationsSlide15

Key observation

#1

Modern cluster schedulers focus on

instantaneous fairness

and

force short-term

o

ptimizations

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

Job 1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res. req> @Dur

Traditional scheduling

1

Altruistic scheduling

Time

Capacity

0.5

1.0

1

2

0

S0

S1

S1

S0

S0

S1

S0

S2

Avg. JCT:

2.05

Users care less

for instantaneous fairness

Leftover

:

donated unnecessary resources

Altruism

:

an action to contribute leftover resources

S0

S0

5Slide16

Key observation

#1

Modern cluster schedulers focus on

instantaneous fairness

and

force short-term

o

ptimizations

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

Job 1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res. req> @Dur

Traditional scheduling

1

Altruistic scheduling

Time

Capacity

0.5

1.0

1

2

0

S1

S1

S0

S1

S0

S2

Avg. JCT:

2.05

Users care less

for instantaneous fairness

Leftover:

donated unnecessary resources

Altruism:

an action to contribute leftover resources

S0

S0

S0

Avg. JCT:

1.33 x

better

5Slide17

Key observation

#2

Jobs

in data analytics clusters

have ample

opportunities for altruism

What increases opportunities?

Complex DAG structures

Longer DAGs

How much opportunities?

50%

of the time at least

20%

of the resources can be used as leftover

6Slide18

Altruistic

multi-resource

scheduling

technique

#1. How to

maximize

the amount of

leftover

resources?

#2.

How much leftover

should contribute?

#3. How to

redistribute

the

leftover

?

Inter-Job Scheduler

Intra-Job Scheduler

Leftover

Carbyne

7Slide19

Inter-Job Scheduler

Intra-Job Scheduler

Leftover

Maximize the amount of leftover resources

Instantaneous fairness elongates job completion time

the most and increases altruism opportunities

Carbyne uses

DRF for inter-job scheduling

Any fair scheduler technique

can be used

Time

Capacity

0.5

1.0

1

2

0

Fair

Allocation

Job 1

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res.

req

> @

Dur

8Slide20

Inter-Job Scheduler

Intra-Job Scheduler

Leftover

How much leftover to contribute?

Traditional scheduling to

compute

expected completion time

Time

Capacity

0.5

1.0

1

2

0

Fair

Allocation

S0

S1

S1

S0

S1

S2

S0

S0

JCT Job 1: 2.1

JCT Job 2: 2.0

9

Job 1

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res.

req

> @

DurSlide21

Inter-Job Scheduler

Intra-Job Scheduler

Leftover

How much leftover to contribute?

Time

Capacity

0.5

1.0

1

2

0

Fair

Allocation

S0

S1

S1

S0

S1

S2

S0

S0

Scheduling

in the

future

from finish to current time

Move into future

Traditional scheduling to

compute

expected completion time

9

JCT Job 1: 2.1

JCT Job 2: 2.0

Job 1

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res.

req

> @

DurSlide22

Inter-Job Scheduler

Intra-Job Scheduler

Leftover

How much leftover to contribute?

Time

Capacity

0.5

1.0

1

2

0

Fair

Allocation

S1

S1

S2

S0

S0

S0

S1

S0

Move into future

Traditional scheduling to

compute

expected completion time

Scheduling

in the

future

from finish to current time

9

JCT Job 1: 2.1

JCT Job 2: 2.0

Job 1

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res.

req

> @

DurSlide23

Inter-Job Scheduler

Intra-Job Scheduler

Leftover

How much leftover to contribute?

Time

Capacity

0.5

1.0

1

2

0

Fair

Allocation

S1

S1

S2

S0

S0

0.29

Leftover

Traditional scheduling to

compute

expected completion time

Scheduling

in the

future

from finish to current time

Donate leftover

resources through altruism

S0

S1

S0

9

Job 1:

JCT Job 1: 2.1

JCT Job 2: 2.0

0.29

Total:

Job 1

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res.

req

> @

DurSlide24

Inter-Job Scheduler

Intra-Job Scheduler

Leftover

How much leftover to contribute?

Time

Capacity

0.5

1.0

1

2

0

Fair

Allocation

S1

S1

S2

S0

S0

Traditional scheduling to

compute

expected completion time

Scheduling

in the

future

from finish to current time

Donate leftover

resources through altruism

S0

S1

S0

9

JCT Job 1: 2.1

JCT Job 2: 2.0

Leftover

0.29

Job 1:

0.21

Job 2:

0.29

Total:

0.50

Job 1

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res.

req

> @

DurSlide25

Inter-Job Scheduler

Intra-Job Scheduler

Leftover

How to redistribute the leftover resources?

Time

Capacity

0.5

1.0

1

2

0

Fair

Allocation

S1

S1

S2

S0

S0

Goal 1: Improve

average JCT

Goal 2: Maximize

efficiency

Goals 1 and 2 can be interchanged

Schedule jobs closest to completion time first

S0

S1

S0

10

JCT Job 1: 2.1

JCT Job 2: 2.0

Leftover

Total:

0.50

Job 1

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res.

req

> @

DurSlide26

Inter-Job Scheduler

Intra-Job Scheduler

Leftover

How to redistribute the leftover resources?

Time

Capacity

0.5

1.0

1

2

0

Fair

Allocation

S1

S1

S2

S0

S0

Goal 1: Improve

average JCT

Goal 2: Maximize

efficiency

Goals 1 and 2 can be interchanged

Schedule jobs closest to completion time first

S0

S1

S0

10

Leftover

Total:

0.50

0.21

JCT Job 1: 2.1

JCT Job 2: 2.0

JCT Job 2: 1.0

Job 1

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res.

req

> @

DurSlide27

Inter-Job Scheduler

Intra-Job Scheduler

Leftover

How to redistribute the leftover resources?

Time

Capacity

0.5

1.0

1

2

0

Fair

Allocation

S1

S1

S2

S0

S0

S0

S1

S0

Goal 1: Improve

average JCT

Goal 2: Maximize

efficiency

Goals 1 and 2 can be interchanged

Schedule jobs closest to completion time first

Pack as many unscheduled tasks are possible

10

Leftover

Total:

0.21

JCT Job 1: 2.1

JCT Job 2: 1.0

Job 1

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res.

req

> @

DurSlide28

Inter-Job Scheduler

Intra-Job Scheduler

Leftover

How to redistribute the leftover resources?

Time

Capacity

0.5

1.0

1

2

0

Fair

Allocation

S1

S1

S2

S0

S0

S1

S0

S0

Goal 1: Improve

average JCT

Goal 2: Maximize

efficiency

Goals 1 and 2 can be interchanged

Schedule jobs closest to completion time first

Pack as many unscheduled tasks are possible

10

JCT Job 1: 2.1

JCT Job 2: 2.0

Leftover

Total:

0.21

0

Job 1

S0

2 x <.08> @.5

S1

3 x <.21> @1

S2

1 x <.1> @.1

S0

2 x <.29> @1

Job 2

Stage ID

#Tasks x <Res.

req

> @

DurSlide29

Putting it all together

We saw

Increase leftover

via Inter-Job Scheduling

Adopting best fair schedulers

Compute leftover

via Intra-Job Scheduling

Leftover redistribution

Improve JCT and cluster efficiency

Other things in the paper

Bounding altruism

with

P(Altruism)

Resource estimation

Data locality

Straggler mitigation

Task failures

11Slide30

Putting it all together

We saw

Increase leftover

via Inter-Job Scheduling

Provide Fairness

Compute leftover

via Intra-Job Scheduling

Leftover redistribution

Improve JCT and cluster efficiency

Over things in the paper

Bounding altruism

with P(Altruism)

Bounding resource

misestimations

Data locality

Straggler mitigation

Task failures

Compute leftover

Leftover redistribution

Node Manager

1

Running tasks

Report available resources

Job Manager

1

Asks

Offers

Allocations

Resource availability

Resource Manager

Yarn architecture

Changes to add Carbyne (shown in

orange

)

11Slide31

Evaluation

Implemented in Yarn and Tez

100 machine cluster deployment

Replay Bing / Facebook traces and TPC-DS / TPC-H workloads

12Slide32

Fairness vs. Performance vs. Efficiency

Jain’s Fairness Index

0

0.5

1

0.74

0.74

0.86

0.64

Tetris

DRF

SJF

Inter-job fairness

Avg. JCT (seconds)

0

500

1000

1123

1224

769

Tetris

DRF

SJF

Job Performance

SJF

Makespan (seconds)

0

2000

4000

6000

8000

4356

5478

6210

Tetris

DRF

Cluster efficiency

?

?

?

13Slide33

Fairness vs. Performance vs. Efficiency

Jain’s Fairness Index

0

0.5

1

0.74

0.74

0.86

0.64

Tetris

DRF

SJF

Inter-job fairness

Avg. JCT (seconds)

0

500

1000

1123

1224

769

Tetris

DRF

SJF

Job Performance

SJF

Makespan (seconds)

0

2000

4000

6000

8000

4356

5478

6210

Tetris

DRF

Cluster efficiency

0.81

Carbyne

814

Carbyne

4492

Carbyne

Comparable performance with best approaches in each metric

Gains from

Altruism helps in long run

13Slide34

Job Performance

14

Carbyne w/o leftover

Only running tasks from being altruistic

Carbyne

Tasks from being altruistic and from leftover redistribution

DRF

Tasks from a DRF allocation

Gains from

Altruism helps in long run

Snapshot of the execution of a TPC-DS querySlide35

Job Performance

14

Altruistic

Greedy

Carbyne w/o leftover

Only running tasks from being altruistic

Carbyne

Tasks from being altruistic and from leftover redistribution

DRF

Tasks from a DRF allocation

DRF takes greedy decisions

Approximates DRF allocation due to leftover

Low resource contention

Gains from

Altruism helps in long run

Snapshot of the execution of a TPC-DS querySlide36

Job Performance

14

Leftover

DRF progress is slowed down

Carbyne progress faster due to receiving leftover

Carbyne w/o leftover

Only running tasks from being altruistic

Carbyne

Tasks from being altruistic and from leftover redistribution

DRF

Tasks from a DRF allocation

High resource contention

Gains from

Altruism helps in long run

Snapshot of the execution of a TPC-DS querySlide37

Job Performance

Performance

Better jobs completion time

16% jobs slowed down

4%

only by more than

0.8x

Which jobs slow down?

Longer jobs

No bias towards shorter jobs

Most jobs benefit from leftover

14

Snapshot of the execution of a TPC-DS querySlide38

Impact of a Better Intra-Job Scheduler

Default intra-job scheduler

Tetris

Limited view of the job’s DAG

Factor of Improvement (w.r.t. DRF)

Fraction of Jobs

15Slide39

Impact of a Better Intra-Job Scheduler

Default intra-job scheduler

Tetris

Limited view of the job’s DAG

Better intra-job scheduler

Graphene – DAG-wide view

Factor of Improvement (w.r.t. DRF)

Fraction of Jobs

15Slide40

Impact of a Better Intra-Job Scheduler

Default intra-job scheduler

Tetris

Limited view of the job’s DAG

Factor of Improvement (w.r.t. DRF)

Fraction of Jobs

Better intra-job scheduler

Graphene – DAG-wide view

Extracts

more leftover

Further

increase performance

15Slide41

Maximize Leftover for Individual Jobs

Redistribution via Leftover Scheduling

Increase Leftover via Inter-Job Scheduling

Fairness

Performance

Efficiency

Long-term altruistic view

of Carbyne

outperforms

existing cluster schedulers which focus on instantaneous fairness

Implemented inside YARN and

Tez

Carbyne

Performance comparable with best approaches in terms of fairness

and

job completion time

and

cluster efficiency

16Slide42

Backup slidesSlide43

Impact of Altruism Levels

Probability of Making Altruistic Choices

Factor of improvement

Increasing levels of altruism increase performance

Comparable when P(Altruism) == 0Slide44

Impact of

Misestimations

Factor of improvement

Error in Resource

Misestimations

[%]

Consistently

better performance

Comparable when resources are underestimatedSlide45

Impact of Contention

More contention increases the need for carefully rearranging tasks

Too much contention saturates the cluster; not much room for leftover allocationsSlide46

Even in online case, Carbyne comes closely to the best metric in each metric

Fairness vs. Performance vs. Efficiency - OnlineSlide47

Data Locality vs. Straggler Mitigation vs. Task Failures

Altruistically giving up resources for data-local task may have adverse effects

An altruistically delayed data-local task is likely to find data locality when it is eventually scheduled

Data Locality

Likely to prioritize speculative tasks during leftover scheduling because it selects jobs in the SRTF order

Straggler Mitigation

Does not distinguish between new and restarted tasks

in case of task failures, it has to recalculate the expected completion time for the job

Handling Task Failures