Reusing work in largescale computations Lucian Popa Mihai Budiu Yuan Yu Michael Isard Microsoft Research Silicon Valley UC Berkeley Problem Statement ID: 557762
Download Presentation The PPT/PDF document "DryadInc" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
DryadInc: Reusing work in large-scale computations
Lucian
Popa
*+
,
Mihai
Budiu
+
,
Yuan Yu
+
, Michael
Isard
+
+
Microsoft Research Silicon Valley
*
UC BerkeleySlide2
Problem Statement
…
…
Goal:
Reuse
(part of) prior computations to:
Speed up the current job
Increase cluster throughput
Reduce energy and costs
Outputs
Inputs
Distributed
Computation
Append-only dataSlide3
Propose Two Approaches
1.
IDEReuse IDEntical computations from the past(like make or memoization)
2. MERDo only incremental computation on the new data
and MERge results with the previous ones(like patch)Slide4
Context
Implemented for
DryadDryad Job = Computational DAGVertex: arbitrary computation + inputs/outputsEdge: data flowsSimple Example: Record CountCI2CA
Add
OutputsInputs (partitions)CountI1Slide5
IDE – IDEntical Computation
Record Count
CI2CA
Add
OutputsInputs (partitions)CountI1First executionDAG Slide6
IDE – IDEntical Computation
Second execution
DAG Record CountCI2CA
Add
OutputsInputs (partitions)CountI1I3CNew InputSlide7
IDE –
IDEntical
ComputationSecond executionDAG Record CountCI2CA
Add
OutputsInputs (partitions)CountI1I3C
Identical
subDAGSlide8
IDE – IDEntical
Computation
IDE Modified DAG Replaced with Cached DataReplace identical computational subDAG with edge data cached from previous executionA
Add
OutputsInputs (partitions)CountI3CSlide9
IDE – IDEntical
Computation
IDE Modified DAG Use DAG fingerprints to determine if computations are identicalAAddOutputs
Inputs (partitions)
CountI3CReplace identical computational subDAG with edge data cached from previous executionSlide10
Semantic Knowledge Can Help
C
I2CAI1
Reuse OutputSlide11
Semantic Knowledge Can Help
C
I2CAI1
C
I3AMerge (Add)
Previous OutputIncremental DAGSlide12
MER – MERgeable Computation
C
I2CAI1
C
I3AMerge (Add)
Automatically
InferredAutomatically BuiltUser-specifiedSlide13
MER – MERgeable Computation
C
I2CAI1
A
CI2C
A
I1I3CEmpty
Save to Cache
Incremental DAG – Remove
Old Inputs
Merge
VertexSlide14
IDE in practice
6 input DAG
IDE 6 (4+2) input DAGSlide15
9 input DAG
MER 9 (5+4)
input DAGMER in practiceEmptyEmptySlide16
Evaluation – Running time
No Cache
IDEMERWord Histogram Application – 8 nodesSlide17
Discussion
MapReduce
: just a particular caseIDE reuses the output of Mappers MER requires combined Reduce functionCombine IDE with MER: benefits don’t add upIDE can be used for the incremental DAG at MERMore semantic knowledge: further opportunitiesGenerate merge function automaticallyImprove incremental DAGSlide18
Conclusions & Questions
Problem:
reuse work in distributed computations on append-only dataTwo methods:INC – reuse IDEntical past computationsNo user effortMER – MERge past results with new onesSmall user effort, potentially larger gainsImplemented for DryadSlide19
Backup SlidesSlide20
Architecture
Modify
DAG before runDryad Job ManagerCache ServerIDE/MER Rerun LogicUpdate cache after runRUNSlide21
Use
fingerprints
to identify identical computational DAGsGuarantee that both computation and data are unchanged
First DAG
Store to Cache: <hash, outputs>
Second DAG
IDE –
IDEntical
ComputationSlide22
cached
Use a heuristic to identify the vertices to cache
First DAG
Second DAG
IDE –
IDEntical
ComputationSlide23
…
…
Hash Distribute (Map)
Merge Sort
Count (Reduce)
Sort
Outputs
Inputs
n
m
Word Histogram ApplicationSlide24
Evaluation – Running time
No Cache
IDE (4 ext)MER (4 ext)IDE (20 ext)MER (20 ext)Results similar if 20 more inputs appended ~ 1.2GB