/
Multi-Resource Packing for Cluster Schedulers Multi-Resource Packing for Cluster Schedulers

Multi-Resource Packing for Cluster Schedulers - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
349 views
Uploaded On 2018-10-25

Multi-Resource Packing for Cluster Schedulers - PPT Presentation

Robert Grandl Ganesh Ananthanarayanan Srikanth Kandula Sriram Rao Aditya Akella Tetris Performance of cluster schedulers We find that 1 Time to finish a set of jobs Resources are fragmented ID: 696839

resource tasks job time tasks resource time job fairness resources packing cores tetris task cluster completion memory machine efficiency

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Multi-Resource Packing for Cluster Sched..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Multi-Resource Packing for Cluster Schedulers

Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella

TetrisSlide2

Performance of cluster schedulers

We find that:

1

Time to finish a set of jobs

Resources are

fragmented

i.e. machines run below capacity

Even at 100% usage,

goodput

is smaller due to

over-allocation

P

areto-efficient multi-resource fair schemes do not lead to good avg. performance

Tetris

U

p to

40% improvement in makespan

1

and

job completion time

with

near-perfect

fairnessSlide3

Findings from Bing and Facebook traces analysis

Tasks need

varying amounts of each resource

Demands for resources

are weakly correlated

Applications have (very) diverse resource needs

Multiple resources become tight

This matters, because no single bottleneck resource in the cluster:

E.g., enough cross-rack network bandwidth

t

o use all cores

3

Upper

bound on

potential gains

M

akespan

reduces by

≈ 49

%

A

vg. job completion time reduces by

46%Slide4

4

Why so bad #1

Production schedulers neither pack tasks nor consider all

their relevant resource demands

#1 Resource Fragmentation

#2 Over-allocationSlide5

Current Schedulers

“Packer” Scheduler

Machine

A

4 GB Memory

Machine B

4 GB Memory

T1: 2 GB

T3: 4 GB

T2: 2 GB

Time

Resource Fragmentation (RF)

STOP

Machine

A

4 GB Memory

Machine B

4 GB Memory

T1: 2 GB

T3: 4 GB

T2: 2 GB

Time

Avg. task

compl

. time = 1 t

5

Current Schedulers

RF increase with the number of resources being allocated !

Avg. task

compl.time

= 1.33 t

Allocate resources per

slots, fairness.

Are not explicit about packing.Slide6

Current Schedulers

“Packer” Scheduler

Machine

A

4 GB Memory; 20 MB/s Nw.

Time

T1: 2 GB

Memory

20 MB/s Nw.

T2: 2 GB

Memory

20 MB/s Nw.

T3: 2 GB

Memory

Machine

A

4 GB Memory; 20 MB/s Nw.

Time

T1: 2 GB

Memory

20 MB/s Nw.

T2: 2 GB

Memory

20 MB/s Nw.

T3: 2 GB

Memory

STOP

20 MB/s Nw.

20 MB/s Nw.

6

Over-Allocation

Not all of the resources are

explicitly allocated

E.g.,disk

and

network

can be

over-allocated

Avg. task

compl.time

= 2.33 t

Avg. task

compl

. time = 1.33 t

Current SchedulersSlide7

Work

Conserving != no fragmentation, over-allocation

Treat cluster as a big bag of resources

Hides

the impact of

resource fragmentation

Assume

job has a fixed resource

profile

Different tasks in the same job have

different demands

Multi-resource Fairness

Schemes

do not solve the problem

Why so bad #2

How the job is

scheduled

impacts

jobs’

current resource

profiles

Can schedule to create

complementarity

Example in paper

Packer vs. DRF:

makespan

and avg. completion time improve by over 30%

Pareto

1

efficient != performant

1

no job can increase its share without decreasing the share of another

7Slide8

Competing objectives

Job completion time

Fairness

v

s.

Cluster efficiency

v

s.

Current Schedulers

1. Resource Fragmentation

3. Fair allocations sacrifice performance

2. Over-Allocation

8Slide9

# 1

Pack tasks along multiple resources to improve

cluster efficiency

and

reduce

makespan

Tetris

9Slide10

Theory

Practice

Multi-Resource Packing of Tasks

similar to

Multi-Dimensional Bin Packing

Balls could be tasks

Bin could be machine, time

1

APX-Hard is a strict subset of NP-hard

APX-Hard

1

Existing heuristics do not directly apply:

Assume balls of a fixed size

Assume balls are known

apriori

10

vary with time / machine placed

elastic

cope with online arrival of jobs, dependencies, cluster activity

Avoiding fragmentation looks like

:

Tight bin packing

Reduce

#

of bins

reduce

makespanSlide11

# 1

P

acking heuristic

Packing tasks to machines = Multi-Dimensional Bin Packing

Ball = Task resource demands vector

Bin = Machine available resource vector

Tetris

1.

Check for fit to ensure

no over-allocation

Over-Allocation

Alignment

score (A)

11

A packing heuristic

Tasks resources demand vector

Machine resource vector

<

Fit

“A” works because:

2.

Bigger balls

get bigger scores

3.

Abundant resources

used first

Resource Fragmentation

Slide12

# 2

Faster average

job completion time

Tetris

12Slide13

13

Tetris

CHALLENGE

# 2

Shortest Remaining Time First

1

(SRTF)

1

SRTF – M.

Harchol

-Balter

et al. Connection Scheduling in Web Servers [USITS’99]

schedules jobs in ascending order of their

remaining time

Job Completion Time Heuristic

Q

: What is the shortest

remaining time

” ?

remaining work

remaining # tasks

t

asks’ durations

t

asks’ resource demands

&

&

=

A job completion time heuristic

Gives a score

P

to every job

Extended SRTF to incorporate multiple resourcesSlide14

14

Tetris

CHALLENGE

# 2

Job Completion Time Heuristic

Combine A and P scores !

Packing Efficiency

Completion Time

?

1:

among

J

runnable jobs

2:

score

(j)

=

A

(t, R

)+

P

(j

)

3:

max

task t in j, demand(t) ≤

R (resources free)

4:

pick j*, t* =

argmax

score(j)

A:

delays job completion time

P:

loss in packing efficiencySlide15

# 3

Achieve performance and

fairness

Tetris

15Slide16

# 3

Tetris

16

Packer

says: “

task

T

should go next to improve

packing efficiency

Possible to satisfy all three

In fact, happens often in practice

SRTF

says: “schedule

job

J

to improve

avg. completion time

Fairness

says: “

this set of

jobs

should be scheduled next

Fairness Heuristic

Performance and fairness do not mix well in general

But

….

We

can get “perfect fairness” and much better performanceSlide17

# 3

Tetris

17

Fairness Knob, F

 [0, 1)

Pick the

best-for-

perf

.

task

from among

1-

F

fraction of

jobs furthest from fair share

Fairness Heuristic

Fairness is not a tight constraint

Long term fairness

not short term fairness

Lose

a bit of

fairness

for a lot of

gains in performance

Heuristic

F = 0

F

1

Most unfair

Most efficient scheduling

Close to perfect fairnessSlide18

18

Putting it all together

We saw:Other things in the paper:

Packing efficiency

Prefer small remaining work

Fairness knob

Estimate

task demands

Deal with

inaccuracies, barriers

Other

cluster activities

Job Manager

1

Node Manager

1

Cluster-wide Resource Manager

Multi-resource asks; barrier hint

Track resource usage; enforce allocations

New logic to match tasks to machines (+packing, +SRTF, +fairness)

Allocations

Asks

Offers

Resource

availability reports

Yarn architecture

Changes to add Tetris(shown in

orange

)Slide19

Evaluation

Implemented in Yarn 2.4

250 machine cluster deploymentBing and Facebook workload

19Slide20

20

Efficiency

Makespan

Multi-resource Scheduler

28 %

Avg. Job

Compl

. Time

35%

Tetris

Gains from

avoiding

fragmentation

avoiding

over-allocation

Tetris vs.

Single Resource Scheduler

29 %

30 %

Utilization (%)

200

150

100

5

0

0

0

5000

10000

15000

Time (s)

Over-allocation

Low value

high fragmentation

Single Resource SchedulerSlide21

21

Fairness

Fairness Knob

quantifies the extent to which Tetris adheres to fair allocation

No Fairness

F = 0

Makespan

50 %

10 %

25 %

Job

Compl

.

Time

4

0 %

23 %

35 %

Avg. Slowdown

[over impacted jobs]

25 %

2 %

5 %

Full Fairness

F

1

F = 0.25Slide22

Tetris

Pack efficiently along multiple resources

Prefer jobs with less “remaining work”

Incorporate Fairness

C

ombine heuristics that

improve packing efficiency

with those that

lower

average

job completion time

A

chieving desired amounts of

fairness can coexist

with improving cluster performance

I

mplemented inside YARN; deployment and trace-driven simulations show encouraging initial results

We are working towards a Yarn check-in

http://research.microsoft.com/en-us/UM/redmond/projects/tetris/

22Slide23

23Backup slidesSlide24

Estimating resource requirements

Estimating Resource Demands

Observe usage and backfillfinished tasks in the same phase

We estimate peak usage from

reports resource usages of tasks and other cluster activity e.g., evacuation

Resource Tracker

collecting statistics

from recurring jobs

inputs size/location of tasks

24

Placement impacts usages of network and disk.Slide25

25

Incorporating task placement

1. Disk and network demands depend on task placement2. Remote resources cannot directly be included in dot-product

Resource vectors increase with the number of machines

in the cluster

Can prefer remote placement!

Compute packing score on local resources; use a

fractional penalty

to

reduce use of remote resources

Sensitivity Analysis

Makespan

& completion time change little for remote penalty

[

6%, 15%

]

Slide26

26

Alternative Packing Heuristics

d

i

– task demand along

dimens

.

i

a

i

– avail. res.

along

dimens

.

iSlide27

27

Virtual Machine Packing != Tetris

Virtual Machine Packing But focus on different challenges and not task packing:

balance load across servers

ensure VM availability inspite

of failures

allow for quick software and hardware updates

NO

corresponding entity to a job and hence

job completion time is inexpressible

Explicit resource requirements (e.g. small VM) makes

VM packing simpler

Consolidating VMs, with multi-dimensional resource requirements, on to the fewest number of serversSlide28

28

Barrier knob, b

 [0, 1) Tetris gives preference for last tasks in a stage

Offer resources to tasks in a stage preceding a barrier,

where b

fraction of tasks have finished

b

=

1

no tasks preferentially treatedSlide29

29

Weighting Alignment vs. SRTF

While the best choice of  depends on the workload, we found that:

1:

among

J

runnable jobs

2:

score

(j)

=

A

(t, R

) +

P

(j

)

3:

max

task t in j, demand(t) ≤

R (resources free)

4:

pick j*, t* =

argmax

score(j)

gains

from

packing efficiency

are

only moderate sensitive to

improving avg. job

compl

.

t

ime requires

> 0

, though gains stabilize quickly

 

Sensitivity Analysis

 

 

 

 

 

 

 

 

 

 Slide30

30

Cluster load vs. Tetris performance Slide31

31

Starvation Prevention

It could take a long time to accommodate large tasksWorking on a more principled solutionBut …

most tasks have demands

within one order of magnitude

Free resources become available in “large clumps”

periodic availability reports

scheduler learns about resources freed up by all tasks that finish in the preceding period in one shotSlide32

32

Workload analysisSlide33

33

Ingestion / evacuation

ingestion = storing incoming data for later analytics evacuation = data evacuated and re-replicated before maintenance operations

E.g., some clusters reports volumes of up to 10 TB per hour

Other cluster activities which produce background traffic

E

.g., rack decommission for machines re-imaging

Resource Tracker

reports, used by Tetris to avoid contention between its tasks and these activitiesSlide34

34

Fairness vs. EfficiencySlide35

35

Fairness vs. EfficiencySlide36

Packer Scheduler vs. DRF

DRF Scheduler

Packer Schedulers

2 tasks

Job Schedule

Resources used

2 tasks

2 tasks

2 tasks

2 tasks

2 tasks

6 tasks

6 tasks

6 tasks

A

B

C

18 cores

16 GB

18 cores

16 GB

18 cores

16 GB

t

2t

3

t

0 tasks

Job Schedule

Resources used

0 tasks

6 tasks

0 tasks

6 tasks

18 tasks

A

B

C

18 cores

18 cores

6 GB

18 cores

6 GB

t

2t

3

t

36 GB

Durations

:

A: 3t

B: 3t

C: 3t

Durations

:

A: t

B: 2t

C: 3t

33% improvement

Dominant Resource Fairness (DRF)

computes the dominant share (DS) of every user and seeks to maximize the minimum DS across all users

Cluster [18 Cores, 36 GB Memory]

Job:

[Task Prof.], # tasks

A

[1 Core, 2 GB], 18

B

[3 Cores, 1 GB], 6

C

[3 Cores, 1 GB], 6

DS =

 

max (

,

,

) (Maximize allocations)

(CPU constraint)

2q

A

+ 1q

B

+ 1q

C

36 (Memory constraint)

 

36Slide37

1Time to finish a set of jobs

Machine 1,2: [2 Cores, 4 GB]

Job:

[Task Prof.], # tasksA

[2 Cores, 3 GB], 6

B

[1 Core, 2 GB], 2

Resources used

4 cores

6 GB

2 tasks

2 tasks

2 tasks

2 tasks

t

2t

3t

4t

Job Schedule

4 cores

6 GB

4 cores

6 GB

2

cores

4

GB

Resources used

2

cores

4

GB

2 tasks

2 tasks

2 tasks

2 tasks

t

2t

3t

4t

Job Schedule

4 cores

6 GB

4 cores

6 GB

4 cores

6 GB

Pack

No Pack

Durations:

A: 3t

B: 4t

Durations:

A: 4t

B:

t

29% improvement

37

Packing efficiency does not achieve everything

Achieving packing efficiency does not necessarily improve job completion timeSlide38

38