Jobs using Mantri Ganesh Ananthanarayanan Srikanth Kandula Albert Greenberg Ion Stoica Yi Lu Bikas Saha Ed Harris UC Berkeley Microsoft 1 MapReduce Jobs ID: 695550
Download Presentation The PPT/PDF document "Reining in the Outliers in MapReduce" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Reining in the Outliers in MapReduce
Jobs using Mantri
Ganesh Ananthanarayanan†, Srikanth Kandula*, Albert Greenberg*, Ion Stoica†, Yi Lu*, Bikas Saha*, Ed Harris* † UC Berkeley * Microsoft
1Slide2
MapReduce Jobs
Basis of analytics in modern Internet servicesE.g., Dryad, HadoopJob {Phase} {Task}
Graph flow consists of pipelines as well as strict blocks2Slide3
Example Dryad Job Graph
EXTRACT
AGGREGATE_PARTITION
FULL_AGGREGATE
PROCESS
COMBINE
PROCESS
Distr. File System
Distr. File
System
Phase
Pipeline
Blocked until
i
nput is done
Map.1
Reduce.1
Map.2
Reduce.2
Join
EXTRACT
AGGREGATE_PARTITION
FULL_AGGREGATE
Distr. File System
3Slide4
Log Analysis from Production
Logs from production cluster with thousands of machines, sampled over six months10,000+ jobs, 80PB of data, 4PB network transfersTask-level detailsProduction and experimental jobs
4Slide5
Outliers hurt!
Tasks that run longer than the rest in the phase
Median phase has 10% outliers, running for >10x longerSlow down jobs by 35% at medianOperational InefficiencyUnpredictability in completion times affect SLAsHurts development productivityWastes
compute-cycles5Slide6
Why do outliers occur?
6Mantri
: A system that mitigates outliers based on root-cause analysisInput Unavailable
Read Input
Execute
Network Congestion
Local Contention
Workload ImbalanceSlide7
Mantri’s Outlier Mitigation
Avoid RecomputationNetwork-aware Task PlacementDuplicate OutliersCognizant of Workload Imbalance
7Slide8
Recomputes: Illustration
(a) Barrier phases
(b) Cascading Recomputes
Inflation
Ideal
Actual
Inflation
Ideal
Actual
Recompute task
Normal task
8Slide9
What causes recomputes? [1]
Faulty machines
Bad disks, non-persistent hardware quirks(4%)
9
Set of faulty machines varies
with time, not constantSlide10
What causes recomputes? [2]
Transient machine load
Recomputes correlate with machine loadRequests for data access dropped 10Slide11
Replicate
costly outputs
Task1
Task 2
Task 3
MR
3
MR
2
((
MR
3
*(1-MR
2
)) * T3
(MR3 * MR2) (T3+T2)+
Replicate (TRep)
T
Rep
< TRecomp
REPLICATE
T
Recomp
= MR: Recompute Probability of a machine
Recompute only Task3 or both Task3 as well as Task2
11Slide12
Transient Failure Causes
Recomputes manifest in clutchesMachine prone to cause recomputes till the problem is fixed
Load abates, critical process restart etc.Clue: At least r recomputes within t time window on a machine
12Slide13
Speculative Recomputes
Anticipatorily
recompute tasks whose outputs are unread
Speculative
Recompute
Speculative
Recompute
(Read Fail)
Unread Data
13
Task
Input DataSlide14
Mantri’s Outlier Mitigation
Avoid RecomputationPreferential Replication + Speculative
Recomp.Network-aware Task PlacementDuplicate OutliersCognizant of Workload Imbalance14Slide15
Reduce Tasks
Tasks access output of tasks from previous phasesReduce phase (
74% of total traffic)
Reduce
Map
Network
Local
Outlier!
15
Distr. File SystemSlide16
Variable Congestion
Reduce task
Map output
Rack
Smart placement smoothens hotspots
16Slide17
Traffic-based Allotment
For every rack:d : datau : available uplink bandwidth
v : available downlink bandwidthGoal: Minimize phase completion time
Solve for task allocation fractions,
ai
17Slide18
Local Control is a good approx.
Let rack i have a
i fraction of tasksTime uploading, Tu = di (1 - ai) / uiTime downloading,
Td = (D – di)
ai / vi
Timei = max {Tu
, T
d
}
18
Goal:
Minimize
phase completion time
For every rack:
d
: data, D: data over all racks
u : available uplink bandwidth
v
: available downlink bandwidth
Link utilizations average out in long term, are steady on the short termSlide19
Mantri’s Outlier Mitigation
Avoid RecomputationPreferential Replication + Speculative
Recomp.Network-aware Task PlacementTraffic on link proportional to bandwidthDuplicate OutliersCognizant of Workload Imbalance
19Slide20
Contentions cause outliers
Tasks contend for local resourcesProcessor, memory etc.Duplicate tasks elsewhere in the clusterCurrent schemes duplicate towards end of
the phase (e.g., LATE [OSDI 2008])Duplicate outlier or schedule pending task?20Slide21
Resource-Aware Restart
21
Running task
Potential restart
(
t
new
)
now
time
t
rem
Save time and
resources:
P(
c
t
new
< (
c
+ 1)
t
rem)
Continuously observe and kill wasteful copiesSlide22
Mantri’s Outlier Mitigation
Avoid RecomputationPreferential Replication + Speculative Recomp
.Network-aware Task PlacementTraffic on link proportional to bandwidthDuplicate OutliersResource-Aware Restart
Cognizant of Workload Imbalance
22Slide23
Workload Imbalance
A quarter of the outlier tasks have more data to processUnequal key partitions for reduce tasksIgnoring these better than duplication
Schedule tasks in descending order of data to processTime α (Data to Process)[Graham ‘69] At worse, 33% of optimal23Slide24
Mantri’s Outlier Mitigation
Avoid RecomputationPreferential Replication + Speculative Recomp.Network-aware Task Placement
Traffic on link proportional to bandwidthDuplicate OutliersResource-Aware RestartCognizant of Workload ImbalanceSchedule in descending order of size24
Proactive
Reactive
Predict to act early
Be
resource-aware
Act based on the
causeSlide25
Results
Deployed in production Bing clustersTrace-driven simulationsMimic workflow, failures, data skewCompare with existing and idealized schemes
25Slide26
Jobs in the Wild
Act Early: Duplicates issued when task 42% done (77% for Dryad)Light: Issues fewer copies (.47X as many as Dryad)
Accurate: 2.8x higher success rate of copies26
Jobs faster by
32%
at median, consuming lesser resourcesSlide27
Recomputation Avoidance
27
Eliminates most recomputes with minimal extra resources
(Replication + Speculation) work well in tandemSlide28
Network-Aware Placement
28
Mantri well-approximates the idealBandwidth approximationsSlide29
Summary
From measurements in a production cluster, Outliers are a significant problemAre due to an interplay between storage, network and map-reduce
Mantri, a cause-, resource-aware mitigationDeployment shows encouraging results“Reining in the Outliers in MapReduce Clusters using Mantri”, USENIX OSDI 2010
29