/
Jockey Jockey

Jockey - PowerPoint Presentation

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
376 views
Uploaded On 2016-05-18

Jockey - PPT Presentation

Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson Peter Bodik Srikanth Kandula Eric Boutin and Rodrigo Fonseca 2 Data parallel clusters 3 Data parallel clusters ID: 324671

job jockey allocation deadline jockey job deadline allocation minutes jobs model progress evaluation control loop total clusters time indicator run parallel data

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Jockey" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

JockeyGuaranteed Job Latency inData Parallel Clusters

Andrew Ferguson,

Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo FonsecaSlide2

2Data parallel

clustersSlide3

3Data parallel

clustersSlide4

4Data parallel clusters

PredictabilitySlide5

5Data parallel clusters

DeadlineSlide6

6Data parallel clusters

DeadlineSlide7

7Variable latencySlide8

8Variable latencySlide9

9Variable latencySlide10

10Variable latencySlide11

11Variable latencySlide12

12Variable latencySlide13

13Variable latencySlide14

14Variable latency

4.3xSlide15

15Why does latency vary?

Pipeline complexity

Noisy execution environmentSlide16

Cosmos16

Microsoft’s data parallel clustersSlide17

Cosmos17

Microsoft’s data parallel clusters

CosmosStoreSlide18

Cosmos18

Microsoft’s data parallel clusters

CosmosStore

DryadSlide19

Cosmos19

Microsoft’s data parallel clusters

CosmosStore

Dryad

SCOPESlide20

Cosmos20

Microsoft’s data parallel clusters

CosmosStore

Dryad

SCOPESlide21

21Dryad’s dag workflow

Cosmos ClusterSlide22

22Dryad’s dag workflow

Cosmos ClusterSlide23

23Dryad’s dag workflow

Pipeline

JobSlide24

24Dryad’s dag workflow

Deadline

Deadline

Deadline

Deadline

DeadlineSlide25

25Dryad’s dag workflow

Deadline

Deadline

Deadline

Deadline

DeadlineSlide26

26

Stage

Dryad’s dag workflow

JobSlide27

27

Stage

Dryad’s dag workflow

Tasks

JobSlide28

28Slide29

29Expressing performance targets

Priorities?Slide30

30Expressing performance targets

Priorities? Not expressive enough

Weights?Slide31

31Expressing performance targets

Priorities? Not expressive enough

Weights? Difficult

for users to set

Utility

curves? Slide32

32Expressing performance targets

Priorities?

Not expressive enoughWeights? Difficult

for users to set

Utility

curves? Capture

deadline & penaltySlide33

33Our goalSlide34

34Our goal

Maximize

utilitySlide35

35Our goal

Maximize

utility

while

minimizing

resourcesSlide36

36Our goal

Maximize

utility

while

minimizing

resources

by dynamically

adjusting

the allocationSlide37

Jockey37Slide38

Jockey38

Large clustersSlide39

Jockey39

Large clusters

Many usersSlide40

Jockey40

Large clusters

Many users

Prior executionSlide41

41Jockey – model

f

(job state,

allocation

) ->

remaining run timeSlide42

42Jockey – Model

f

(job state,

allocation

) ->

remaining run timeSlide43

43Jockey – Model

f

(job state,

allocation

) ->

remaining run timeSlide44

44Jockey – Model

f

(job state,

allocation

) -> remaining

run timeSlide45

45Jockey – Control loopSlide46

46Jockey – control loopSlide47

47Jockey – control loopSlide48

48Jockey – model

f

(job state,

allocation

) -> remaining

run timeSlide49

49Jockey – model

f

(progress, allocation

) -> remaining

run time

f

(

job

state

,

allocation

) -> remaining

run timeSlide50

50Jockey – Progress indicatorSlide51

51Jockey – Progress indicatorSlide52

52Jockey – Progress indicatorSlide53

53Jockey – Progress indicator

total runningSlide54

54Jockey – Progress indicator

total running

+

t

otal

queuingSlide55

55Jockey – Progress indicator

stage

total running

+

t

otal

queuingSlide56

56Jockey – Progress indicator

total running

+

t

otal

queuing

total running

+

t

otal

queuing

total running

+

t

otal

queuing

Stage 1

Stage 2

Stage 3

+

+Slide57

57Jockey – Progress indicator

total running

+

t

otal

queuing

total running

+

t

otal

queuing

total running

+

t

otal

queuing

# complete

total tasks

# complete

total tasks

# complete

total tasks

Stage 1

Stage 2

Stage 3

+

+Slide58

58Jockey – Progress indicatorSlide59

59Jockey – Progress indicatorSlide60

60Jockey – Progress indicatorSlide61

61Jockey – Progress indicatorSlide62

62Jockey – Progress indicatorSlide63

63Jockey – Progress indicatorSlide64

64Jockey – Control loopSlide65

65Jockey – Control loop

1% complete

2% complete

3% complete

4% complete

5%

complete

Job modelSlide66

66Jockey – Control loop

10 nodes

20 nodes30 nodes1% complete2% complete

3% complete

4% complete

5%

complete

Job modelSlide67

67Jockey – Control loop

10 nodes

20 nodes30 nodes1% complete60 minutes40 minutes25 minutes

2% complete

59 minutes

39 minutes

24 minutes

3% complete

58 minutes

37 minutes

22 minutes

4% complete

56 minutes

36

minutes

21 minutes

5%

complete

54 minutes34 minutes20 minutes

Job modelSlide68

68Jockey – Control loop

10 nodes

20 nodes30 nodes1% complete60 minutes40 minutes

25 minutes

2% complete

59 minutes

39 minutes

24 minutes

3% complete

58 minutes

37 minutes

22 minutes

4% complete

56 minutes

36

minutes

21 minutes

5%

complete54 minutes34 minutes

20 minutes

Job model

Deadline:

50 min.

Completion:

1%Slide69

69Jockey – Control loop

Job model

Deadline:

50 min.

Completion:

1%

10 nodes

20 nodes

30

nodes

1% complete

60 minutes

40 minutes

25 minutes

2% complete

59 minutes

39 minutes

24 minutes

3% complete

58 minutes

37 minutes

22 minutes

4% complete

56 minutes

36

minutes

21 minutes

5%

complete

54 minutes34 minutes

20 minutesSlide70

70Jockey – Control loop

Job model

10 nodes

20 nodes

30

nodes

1% complete

60 minutes

40 minutes

25 minutes

2% complete

59 minutes

39 minutes

24 minutes

3% complete

58 minutes

37 minutes

22 minutes

4% complete56 minutes

36 minutes21 minutes5% complete

54 minutes34 minutes20 minutes

Deadline:

40 min.

Completion:

3%Slide71

71Jockey – Control loop

Job model

10 nodes

20 nodes

30

nodes

1% complete

60 minutes

40 minutes

25 minutes

2% complete

59 minutes

39 minutes

24 minutes

3% complete

58 minutes

37 minutes

22 minutes

4% complete56 minutes36

minutes21 minutes5% complete

54 minutes34 minutes20 minutes

Deadline:

30

min.

Completion:

5

%Slide72

72Jockey – Model

f

(progress, allocation

) -> remaining

run timeSlide73

73Jockey – Model

f

(progress, allocation

) -> remaining

run time

a

nalytic

model?Slide74

74Jockey – MODEL

f

(progress, allocation

) -> remaining

run time

a

nalytic

model

?

machine learning?Slide75

75Jockey – Model

f

(progress, allocation

) -> remaining

run time

a

nalytic

model

?

machine learning

?

simulatorSlide76

76jockey

Problem

SolutionSlide77

77jockey

Problem

SolutionPipeline complexitySlide78

78jockey

Problem

SolutionPipeline complexityUse a simulatorSlide79

79jockey

Problem

SolutionPipeline complexityUse a simulator

Noisy environmentSlide80

80jockey

Problem

SolutionPipeline complexityUse a simulator

Noisy environment

Dynamic

controlSlide81

Jockey in Action81Slide82

Jockey in Action82

Real jobSlide83

Jockey in Action83

Real job

Production clusterSlide84

Jockey in Action84

Real job

Production cluster

CPU load: ~80%Slide85

Jockey in Action85Slide86

Jockey in Action86Slide87

Jockey in Action87Slide88

Jockey in Action88

Initial deadline:

140 minutesSlide89

Jockey in Action89

New deadline:

70 minutesSlide90

Jockey in Action90

New deadline:

70 minutesRelease resources due to excess pessimismSlide91

Jockey in Action91

“Oracle” allocation:

Total allocation-hoursDeadlineSlide92

Jockey in Action92

“Oracle” allocation:

Total allocation-hoursDeadlineAvailable parallelismless than allocationSlide93

Jockey in Action93

“Oracle” allocation:

Total allocation-hoursDeadlineAllocation above oracleSlide94

Evaluation94Slide95

Evaluation95

Production clusterSlide96

Evaluation96

Production cluster

21 jobsSlide97

Evaluation97

Production cluster

21 jobs

SLO met?Slide98

Evaluation98

Production cluster

21 jobs

SLO met?

Cluster impact?Slide99

Evaluation99Slide100

EvaluationJobs which met the SLO

100Slide101

EvaluationJobs which met the SLO

101Slide102

Evaluation

Jobs which met the SLO

102Missed 1 of94 deadlinesSlide103

Evaluation

Jobs which met the SLO

103Slide104

Evaluation

Jobs which met the SLO

104Slide105

Evaluation

Jobs which met the SLO

105

1.4xSlide106

Evaluation

Jobs which met the SLO

106Allocated too manyresources

Missed 1 of

94 deadlinesSlide107

Evaluation

Jobs which met the SLO

Allocated too manyresources

107

Simulator made good predictions:

80% finish before deadline

Missed 1 of

94 deadlinesSlide108

Evaluation

Jobs which met the SLO

Allocated too manyresources

Simulator made good predictions:

80% finish before deadline

108

Control loop is

stable

and successful

Missed 1 of

94 deadlinesSlide109

Evaluation109Slide110

Evaluation110Slide111

Evaluation111Slide112

Evaluation112Slide113

Evaluation113Slide114

Evaluation114Slide115

Conclusion115Slide116

116Data parallel jobs are complex,Slide117

117Data parallel jobs are complex,

yet users demand deadlines.Slide118

118Data parallel jobs are complex,

yet users demand deadlines.

Jobs run in shared, noisy clusters,Slide119

119Data parallel jobs are complex,

yet users demand deadlines.

Jobs run in shared, noisy clusters,

making simple models inaccurate.Slide120

Jockey120Slide121

simulator121Slide122

control-loop122Slide123

123

Deadline

Deadline

Deadline

Deadline

DeadlineSlide124

124Slide125

Questions?

Andrew Ferguson

adf@cs.brown.edu125Slide126

Co-authorsPeter Bodík 
(Microsoft Research)Srikanth Kandula
(Microsoft Research)Eric

Boutín
(Microsoft)Rodrigo Fonseca
(Brown

)Questions?

126

Andrew Ferguson

adf@cs.brown.eduSlide127

Backup Slides127Slide128

Utility CurvesDeadline

For single jobs,

scale doesn’t matter

For multiple jobs,

use financial penalties

128Slide129

129Jockey

Resource allocation control loop

1. Slack

2. Hysteresis

3. Dead Zone

Prediction

Run Time

UtilitySlide130

130Cosmos

Resources are allocated with a form of fair sharing across business groups and their jobs. (Like Hadoop

FairScheduler or CapacityScheduler)Each job is guaranteed a number of tokens as dictated by cluster policy; each running or initializing task uses one token. Token released on task completion.A token is a guaranteed share of CPU and memoryTo increase efficiency, unused tokens are re-allocated to jobs with available work

Resource sharing in CosmosSlide131

131Jockey

Progress indicator

Can use many features of the job to build a progress indicatorEarlier work (ParaTimer) concentrated on fraction of tasks completedOur indicator is very simple, but we found it performs best for Jockey’s needs

Total vertex initialization time

Total vertex run time

Fraction of completed verticesSlide132

132Comparison with ARIA

ARIA uses analytic models

Designed for 3 stages: Map, Shuffle, ReduceJockey’s control loop is robust due to control-theory improvementsARIA tested on small (66-node) cluster without a network bottleneckWe believe Jockey is a better match for production DAG frameworks such as Hive, Pig, etc.Slide133

133Jockey

Latency prediction: C(p

, a)Event-based simulatorSame scheduling logic as actual Job ManagerCaptures important features of job progress

Does

not model input size variation or speculative re-execution of

stragglers

Inputs

: job algebra, distributions of task timings, probabilities of failures,

allocation

Analytic model

Inspired

by Amdahl’s Law: T = S + P/

N

S

is remaining work on critical path, P is all remaining work, N is number of machinesSlide134

134JockeyResource allocation control

loopExecutes

in Dryad’s Job ManagerInputs: fraction of completed tasks in each stage, time job has spent running, utility function, precomputed values (for speedup)Output: Number of tokens to allocateImproved

with techniques from control-theorySlide135

Jockey

offline

during job runtimejob profile

135Slide136

Jockeysimulator

offline

during job runtime

job profile

136Slide137

Jockeysimulator

offline

during job runtime

job stats

job profile

137Slide138

Jockeysimulator

offline

during job runtime

job stats

latency

predictions

job profile

138Slide139

Jockeysimulator

offline

during job runtime

utility

function

job stats

latency

predictions

job profile

139Slide140

Jockeysimulator

offline

during job runtime

running

job

utility

function

job stats

latency

predictions

resource allocation control loop

job profile

140