/
Restore Restore

Restore - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
406 views
Uploaded On 2015-09-20

Restore - PPT Presentation

R eusing results of mapreduce jobs Jun Fan Outline Introduction to MapReduce MapReduce and its implementations such as Hadoop are common in Facebook Yahoo and Google as well as smaller companies ID: 135277

job mapreduce restore repository mapreduce job repository restore jobs plan operators outputs system output physical input reuse data execution stored overview rewritere

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Restore" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Restore:Reusing results of mapreduce jobs

Jun FanSlide2

OutlineSlide3

Introduction to MapReduceMapReduce and its implementations such as Hadoop are common in Facebook, Yahoo, and Google as well as smaller companiesUse high-level query languages such as Pig to express their complex analysis tasks.Translate queries into workflows of MapReduce jobs, output is stored in the distributed file systemSlide4

Introduction to MapReduceSlide5

Introduction to MapReduceHigh-level language translate an query into physical operators(Join, Select)Embed all physical operators into mapper and reducer stagesCompiler generates code for each MapReduce job and passes it to the MapReduce systemReStore extends this dataflow to reuse the output of physical operators Slide6

Overview ReStoreReStore improves the performance of workflows by storing the intermediate results and reusing themEnable queries submitted at different times to share resultsBuilt on top of dataflow language processorsSlide7

Overview ReStoreReuse job outputs previously storedStore the outputs of executed jobs for future reuseCreate more reuse opportunities by storing the outputs of sub-jobs Selects the outputs of jobs toRewrites a MapReduce job and submits it to the MapReduce system to be executed.Slide8

Overview ReStoreSlide9

Overview ReStoreSlide10

Overview ReStoreSlide11

Types of Result ReuseSlide12

Types of Result ReuseIf all dependant jobs of Join are stored in the system, Ttotal (Join) = ET(Jobn)Parts of the query execution plan are stored in the systemSlide13

Types of Result ReuseSlide14

ReStore System ArchitectureInput is a workflow of MapReduce jobs generated by a dataflow systemOutputs are: A modified MapReduce workflow that exploits prior executed jobs stored by ReStoreA new set of job outputs to store in the distributed file system.Slide15

ReStore System ArchitectureRepository to manage the stored MapReduce job outputs:The physical query execution plan of the MapReduce job (Input, output, operators)The filename of the output in the distributed file system,Statistics about the MapReduce job and the frequency of use of this output by different workflows. (size of input and output, execution time)Slide16

ReStore System ArchitectureSlide17

Plan Matcher and RewritereGoal is to find physical plans in the repository that can be used to rewrite the input workflowBefore a job is matched against the repository, all other jobs that it depends on have to be matched and rewritten to use the job outputs stored in the repositorySlide18

Plan Matcher and RewritereSlide19

Plan Matcher and RewritereThe flow:Scan sequentially through the physical plans in the repositoryRewrite it to use the matched physical plan in the repositoryAfter rewriting, a new sequential scan through the repository is startedIf a scan does not find any matches, ReStore proceeds to matching the next MapReduce job in the workflowSlide20

Plan Matcher and RewritereTwo operators are equivalent if: Their inputs are pipelined from operators that are equivalent or from thesame data setsThey perform functions that produce the same output dataSlide21

Plan Matcher and RewritereReStore uses the first match that it finds in the repositoryRules to order the physical plans in the repositoryPlan A is preferred to plan B if all the operators in plan B have equivalent operators in plan AThe ratio between the size of the input data and output data and the execution time of the MapReduce jobSlide22

The RepositoryCan we treat all possible sub-jobs as candidates? NO!!!Require a substantial amount of storage the overhead of storing allThe intermediate data would considerably slow down the execution of the input MapReduce jobSlide23

The RepositoryTwo heuristics for choosing candidate sub-jobs:Conservative Heuristic: Use the outputs of operators that are known to reduce their input size (Project, Filter)Aggressive Heuristic: Use the outputs of operators that are known to be expensive (Project,Filter, Join, Group)Slide24

The RepositorySlide25

The RepositoryRules to keep a candidate job in the repository:The size of its output data is smaller than the size of its input dataThere will be a reduction in execution time for workflows reusing this jobSlide26

The RepositoryRules to evict a candidate job in the repository:Evict a job from the repository if it has not been reused within a window of timeEvict a job from the repository if one or more of its inputs is deleted or modifiedSlide27

THANK YOUQuestions!