Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson Peter Bodik Srikanth Kandula Eric Boutin and Rodrigo Fonseca 2 Data parallel clusters 3 Data parallel clusters ID: 324671
Download Presentation The PPT/PDF document "Jockey" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
JockeyGuaranteed Job Latency inData Parallel Clusters
Andrew Ferguson,
Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo FonsecaSlide2
2Data parallel
clustersSlide3
3Data parallel
clustersSlide4
4Data parallel clusters
PredictabilitySlide5
5Data parallel clusters
DeadlineSlide6
6Data parallel clusters
DeadlineSlide7
7Variable latencySlide8
8Variable latencySlide9
9Variable latencySlide10
10Variable latencySlide11
11Variable latencySlide12
12Variable latencySlide13
13Variable latencySlide14
14Variable latency
4.3xSlide15
15Why does latency vary?
Pipeline complexity
Noisy execution environmentSlide16
Cosmos16
Microsoft’s data parallel clustersSlide17
Cosmos17
Microsoft’s data parallel clusters
CosmosStoreSlide18
Cosmos18
Microsoft’s data parallel clusters
CosmosStore
DryadSlide19
Cosmos19
Microsoft’s data parallel clusters
CosmosStore
Dryad
SCOPESlide20
Cosmos20
Microsoft’s data parallel clusters
CosmosStore
Dryad
SCOPESlide21
21Dryad’s dag workflow
Cosmos ClusterSlide22
22Dryad’s dag workflow
Cosmos ClusterSlide23
23Dryad’s dag workflow
Pipeline
JobSlide24
24Dryad’s dag workflow
Deadline
Deadline
Deadline
Deadline
DeadlineSlide25
25Dryad’s dag workflow
Deadline
Deadline
Deadline
Deadline
DeadlineSlide26
26
Stage
Dryad’s dag workflow
JobSlide27
27
Stage
Dryad’s dag workflow
Tasks
JobSlide28
28Slide29
29Expressing performance targets
Priorities?Slide30
30Expressing performance targets
Priorities? Not expressive enough
Weights?Slide31
31Expressing performance targets
Priorities? Not expressive enough
Weights? Difficult
for users to set
Utility
curves? Slide32
32Expressing performance targets
Priorities?
Not expressive enoughWeights? Difficult
for users to set
Utility
curves? Capture
deadline & penaltySlide33
33Our goalSlide34
34Our goal
Maximize
utilitySlide35
35Our goal
Maximize
utility
while
minimizing
resourcesSlide36
36Our goal
Maximize
utility
while
minimizing
resources
by dynamically
adjusting
the allocationSlide37
Jockey37Slide38
Jockey38
Large clustersSlide39
Jockey39
Large clusters
Many usersSlide40
Jockey40
Large clusters
Many users
Prior executionSlide41
41Jockey – model
f
(job state,
allocation
) ->
remaining run timeSlide42
42Jockey – Model
f
(job state,
allocation
) ->
remaining run timeSlide43
43Jockey – Model
f
(job state,
allocation
) ->
remaining run timeSlide44
44Jockey – Model
f
(job state,
allocation
) -> remaining
run timeSlide45
45Jockey – Control loopSlide46
46Jockey – control loopSlide47
47Jockey – control loopSlide48
48Jockey – model
f
(job state,
allocation
) -> remaining
run timeSlide49
49Jockey – model
f
(progress, allocation
) -> remaining
run time
f
(
job
state
,
allocation
) -> remaining
run timeSlide50
50Jockey – Progress indicatorSlide51
51Jockey – Progress indicatorSlide52
52Jockey – Progress indicatorSlide53
53Jockey – Progress indicator
total runningSlide54
54Jockey – Progress indicator
total running
+
t
otal
queuingSlide55
55Jockey – Progress indicator
stage
total running
+
t
otal
queuingSlide56
56Jockey – Progress indicator
total running
+
t
otal
queuing
total running
+
t
otal
queuing
total running
+
t
otal
queuing
Stage 1
Stage 2
Stage 3
+
+Slide57
57Jockey – Progress indicator
total running
+
t
otal
queuing
total running
+
t
otal
queuing
total running
+
t
otal
queuing
# complete
total tasks
# complete
total tasks
# complete
total tasks
Stage 1
Stage 2
Stage 3
+
+Slide58
58Jockey – Progress indicatorSlide59
59Jockey – Progress indicatorSlide60
60Jockey – Progress indicatorSlide61
61Jockey – Progress indicatorSlide62
62Jockey – Progress indicatorSlide63
63Jockey – Progress indicatorSlide64
64Jockey – Control loopSlide65
65Jockey – Control loop
1% complete
2% complete
3% complete
4% complete
5%
complete
Job modelSlide66
66Jockey – Control loop
10 nodes
20 nodes30 nodes1% complete2% complete
3% complete
4% complete
5%
complete
Job modelSlide67
67Jockey – Control loop
10 nodes
20 nodes30 nodes1% complete60 minutes40 minutes25 minutes
2% complete
59 minutes
39 minutes
24 minutes
3% complete
58 minutes
37 minutes
22 minutes
4% complete
56 minutes
36
minutes
21 minutes
5%
complete
54 minutes34 minutes20 minutes
Job modelSlide68
68Jockey – Control loop
10 nodes
20 nodes30 nodes1% complete60 minutes40 minutes
25 minutes
2% complete
59 minutes
39 minutes
24 minutes
3% complete
58 minutes
37 minutes
22 minutes
4% complete
56 minutes
36
minutes
21 minutes
5%
complete54 minutes34 minutes
20 minutes
Job model
Deadline:
50 min.
Completion:
1%Slide69
69Jockey – Control loop
Job model
Deadline:
50 min.
Completion:
1%
10 nodes
20 nodes
30
nodes
1% complete
60 minutes
40 minutes
25 minutes
2% complete
59 minutes
39 minutes
24 minutes
3% complete
58 minutes
37 minutes
22 minutes
4% complete
56 minutes
36
minutes
21 minutes
5%
complete
54 minutes34 minutes
20 minutesSlide70
70Jockey – Control loop
Job model
10 nodes
20 nodes
30
nodes
1% complete
60 minutes
40 minutes
25 minutes
2% complete
59 minutes
39 minutes
24 minutes
3% complete
58 minutes
37 minutes
22 minutes
4% complete56 minutes
36 minutes21 minutes5% complete
54 minutes34 minutes20 minutes
Deadline:
40 min.
Completion:
3%Slide71
71Jockey – Control loop
Job model
10 nodes
20 nodes
30
nodes
1% complete
60 minutes
40 minutes
25 minutes
2% complete
59 minutes
39 minutes
24 minutes
3% complete
58 minutes
37 minutes
22 minutes
4% complete56 minutes36
minutes21 minutes5% complete
54 minutes34 minutes20 minutes
Deadline:
30
min.
Completion:
5
%Slide72
72Jockey – Model
f
(progress, allocation
) -> remaining
run timeSlide73
73Jockey – Model
f
(progress, allocation
) -> remaining
run time
a
nalytic
model?Slide74
74Jockey – MODEL
f
(progress, allocation
) -> remaining
run time
a
nalytic
model
?
machine learning?Slide75
75Jockey – Model
f
(progress, allocation
) -> remaining
run time
a
nalytic
model
?
machine learning
?
simulatorSlide76
76jockey
Problem
SolutionSlide77
77jockey
Problem
SolutionPipeline complexitySlide78
78jockey
Problem
SolutionPipeline complexityUse a simulatorSlide79
79jockey
Problem
SolutionPipeline complexityUse a simulator
Noisy environmentSlide80
80jockey
Problem
SolutionPipeline complexityUse a simulator
Noisy environment
Dynamic
controlSlide81
Jockey in Action81Slide82
Jockey in Action82
Real jobSlide83
Jockey in Action83
Real job
Production clusterSlide84
Jockey in Action84
Real job
Production cluster
CPU load: ~80%Slide85
Jockey in Action85Slide86
Jockey in Action86Slide87
Jockey in Action87Slide88
Jockey in Action88
Initial deadline:
140 minutesSlide89
Jockey in Action89
New deadline:
70 minutesSlide90
Jockey in Action90
New deadline:
70 minutesRelease resources due to excess pessimismSlide91
Jockey in Action91
“Oracle” allocation:
Total allocation-hoursDeadlineSlide92
Jockey in Action92
“Oracle” allocation:
Total allocation-hoursDeadlineAvailable parallelismless than allocationSlide93
Jockey in Action93
“Oracle” allocation:
Total allocation-hoursDeadlineAllocation above oracleSlide94
Evaluation94Slide95
Evaluation95
Production clusterSlide96
Evaluation96
Production cluster
21 jobsSlide97
Evaluation97
Production cluster
21 jobs
SLO met?Slide98
Evaluation98
Production cluster
21 jobs
SLO met?
Cluster impact?Slide99
Evaluation99Slide100
EvaluationJobs which met the SLO
100Slide101
EvaluationJobs which met the SLO
101Slide102
Evaluation
Jobs which met the SLO
102Missed 1 of94 deadlinesSlide103
Evaluation
Jobs which met the SLO
103Slide104
Evaluation
Jobs which met the SLO
104Slide105
Evaluation
Jobs which met the SLO
105
1.4xSlide106
Evaluation
Jobs which met the SLO
106Allocated too manyresources
Missed 1 of
94 deadlinesSlide107
Evaluation
Jobs which met the SLO
Allocated too manyresources
107
Simulator made good predictions:
80% finish before deadline
Missed 1 of
94 deadlinesSlide108
Evaluation
Jobs which met the SLO
Allocated too manyresources
Simulator made good predictions:
80% finish before deadline
108
Control loop is
stable
and successful
Missed 1 of
94 deadlinesSlide109
Evaluation109Slide110
Evaluation110Slide111
Evaluation111Slide112
Evaluation112Slide113
Evaluation113Slide114
Evaluation114Slide115
Conclusion115Slide116
116Data parallel jobs are complex,Slide117
117Data parallel jobs are complex,
yet users demand deadlines.Slide118
118Data parallel jobs are complex,
yet users demand deadlines.
Jobs run in shared, noisy clusters,Slide119
119Data parallel jobs are complex,
yet users demand deadlines.
Jobs run in shared, noisy clusters,
making simple models inaccurate.Slide120
Jockey120Slide121
simulator121Slide122
control-loop122Slide123
123
Deadline
Deadline
Deadline
Deadline
DeadlineSlide124
124Slide125
Questions?
Andrew Ferguson
adf@cs.brown.edu125Slide126
Co-authorsPeter Bodík (Microsoft Research)Srikanth Kandula (Microsoft Research)Eric
Boutín (Microsoft)Rodrigo Fonseca (Brown
)Questions?
126
Andrew Ferguson
adf@cs.brown.eduSlide127
Backup Slides127Slide128
Utility CurvesDeadline
For single jobs,
scale doesn’t matter
For multiple jobs,
use financial penalties
128Slide129
129Jockey
Resource allocation control loop
1. Slack
2. Hysteresis
3. Dead Zone
Prediction
Run Time
UtilitySlide130
130Cosmos
Resources are allocated with a form of fair sharing across business groups and their jobs. (Like Hadoop
FairScheduler or CapacityScheduler)Each job is guaranteed a number of tokens as dictated by cluster policy; each running or initializing task uses one token. Token released on task completion.A token is a guaranteed share of CPU and memoryTo increase efficiency, unused tokens are re-allocated to jobs with available work
Resource sharing in CosmosSlide131
131Jockey
Progress indicator
Can use many features of the job to build a progress indicatorEarlier work (ParaTimer) concentrated on fraction of tasks completedOur indicator is very simple, but we found it performs best for Jockey’s needs
Total vertex initialization time
Total vertex run time
Fraction of completed verticesSlide132
132Comparison with ARIA
ARIA uses analytic models
Designed for 3 stages: Map, Shuffle, ReduceJockey’s control loop is robust due to control-theory improvementsARIA tested on small (66-node) cluster without a network bottleneckWe believe Jockey is a better match for production DAG frameworks such as Hive, Pig, etc.Slide133
133Jockey
Latency prediction: C(p
, a)Event-based simulatorSame scheduling logic as actual Job ManagerCaptures important features of job progress
Does
not model input size variation or speculative re-execution of
stragglers
Inputs
: job algebra, distributions of task timings, probabilities of failures,
allocation
Analytic model
Inspired
by Amdahl’s Law: T = S + P/
N
S
is remaining work on critical path, P is all remaining work, N is number of machinesSlide134
134JockeyResource allocation control
loopExecutes
in Dryad’s Job ManagerInputs: fraction of completed tasks in each stage, time job has spent running, utility function, precomputed values (for speedup)Output: Number of tokens to allocateImproved
with techniques from control-theorySlide135
Jockey
offline
during job runtimejob profile
135Slide136
Jockeysimulator
offline
during job runtime
job profile
136Slide137
Jockeysimulator
offline
during job runtime
job stats
job profile
137Slide138
Jockeysimulator
offline
during job runtime
job stats
latency
predictions
job profile
138Slide139
Jockeysimulator
offline
during job runtime
utility
function
job stats
latency
predictions
job profile
139Slide140
Jockeysimulator
offline
during job runtime
running
job
utility
function
job stats
latency
predictions
resource allocation control loop
job profile
140