/
DryadInc DryadInc

DryadInc - PowerPoint Presentation

pasty-toler
pasty-toler . @pasty-toler
Follow
375 views
Uploaded On 2017-06-09

DryadInc - PPT Presentation

Reusing work in largescale computations Lucian Popa Mihai Budiu Yuan Yu Michael Isard Microsoft Research Silicon Valley UC Berkeley Problem Statement ID: 557762

dag ide computation identical ide dag identical computation mer count inputs outputs add data cache merge computations partitions input

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "DryadInc" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

DryadInc: Reusing work in large-scale computations

Lucian

Popa

*+

,

Mihai

Budiu

+

,

Yuan Yu

+

, Michael

Isard

+

+

Microsoft Research Silicon Valley

*

UC BerkeleySlide2

Problem Statement

Goal:

Reuse

(part of) prior computations to:

Speed up the current job

Increase cluster throughput

Reduce energy and costs

Outputs

Inputs

Distributed

Computation

Append-only dataSlide3

Propose Two Approaches

1.

IDEReuse IDEntical computations from the past(like make or memoization)

2. MERDo only incremental computation on the new data

and MERge results with the previous ones(like patch)Slide4

Context

Implemented for

DryadDryad Job = Computational DAGVertex: arbitrary computation + inputs/outputsEdge: data flowsSimple Example: Record CountCI2CA

Add

OutputsInputs (partitions)CountI1Slide5

IDE – IDEntical Computation

Record Count

CI2CA

Add

OutputsInputs (partitions)CountI1First executionDAG Slide6

IDE – IDEntical Computation

Second execution

DAG Record CountCI2CA

Add

OutputsInputs (partitions)CountI1I3CNew InputSlide7

IDE –

IDEntical

ComputationSecond executionDAG Record CountCI2CA

Add

OutputsInputs (partitions)CountI1I3C

Identical

subDAGSlide8

IDE – IDEntical

Computation

IDE Modified DAG Replaced with Cached DataReplace identical computational subDAG with edge data cached from previous executionA

Add

OutputsInputs (partitions)CountI3CSlide9

IDE – IDEntical

Computation

IDE Modified DAG Use DAG fingerprints to determine if computations are identicalAAddOutputs

Inputs (partitions)

CountI3CReplace identical computational subDAG with edge data cached from previous executionSlide10

Semantic Knowledge Can Help

C

I2CAI1

Reuse OutputSlide11

Semantic Knowledge Can Help

C

I2CAI1

C

I3AMerge (Add)

Previous OutputIncremental DAGSlide12

MER – MERgeable Computation

C

I2CAI1

C

I3AMerge (Add)

Automatically

InferredAutomatically BuiltUser-specifiedSlide13

MER – MERgeable Computation

C

I2CAI1

A

CI2C

A

I1I3CEmpty

Save to Cache

Incremental DAG – Remove

Old Inputs

Merge

VertexSlide14

IDE in practice

6 input DAG

IDE 6 (4+2) input DAGSlide15

9 input DAG

MER 9 (5+4)

input DAGMER in practiceEmptyEmptySlide16

Evaluation – Running time

No Cache

IDEMERWord Histogram Application – 8 nodesSlide17

Discussion

MapReduce

: just a particular caseIDE reuses the output of Mappers MER requires combined Reduce functionCombine IDE with MER: benefits don’t add upIDE can be used for the incremental DAG at MERMore semantic knowledge: further opportunitiesGenerate merge function automaticallyImprove incremental DAGSlide18

Conclusions & Questions

Problem:

reuse work in distributed computations on append-only dataTwo methods:INC – reuse IDEntical past computationsNo user effortMER – MERge past results with new onesSmall user effort, potentially larger gainsImplemented for DryadSlide19

Backup SlidesSlide20

Architecture

Modify

DAG before runDryad Job ManagerCache ServerIDE/MER Rerun LogicUpdate cache after runRUNSlide21

Use

fingerprints

to identify identical computational DAGsGuarantee that both computation and data are unchanged

First DAG

Store to Cache: <hash, outputs>

Second DAG

IDE –

IDEntical

ComputationSlide22

cached

Use a heuristic to identify the vertices to cache

First DAG

Second DAG

IDE –

IDEntical

ComputationSlide23

Hash Distribute (Map)

Merge Sort

Count (Reduce)

Sort

Outputs

Inputs

n

m

Word Histogram ApplicationSlide24

Evaluation – Running time

No Cache

IDE (4 ext)MER (4 ext)IDE (20 ext)MER (20 ext)Results similar if 20 more inputs appended ~ 1.2GB

Related Contents


Next Show more