/
A Hadoop MapReduce Performance Prediction Method A Hadoop MapReduce Performance Prediction Method

A Hadoop MapReduce Performance Prediction Method - PowerPoint Presentation

ideassi
ideassi . @ideassi
Follow
344 views
Uploaded On 2020-07-02

A Hadoop MapReduce Performance Prediction Method - PPT Presentation

Ge Song Zide Meng Fabrice Huet Frederic Magoules Lei Yu and Xuelian Lin University of Nice Sophia ID: 793013

reduce map prediction time map reduce time prediction job model data execution hadoop type parameters cpu occupation disk input

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "A Hadoop MapReduce Performance Predictio..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A Hadoop MapReduce Performance Prediction Method

Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin#* University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France+ Ecole Centrale de Paris, France# Beihang University, Beijing China

1

Slide2

Background

Hadoop MapReduceINPUT DATASplit

Map

Map

Map

Map

Reduce

Reduce

Job

Map

Reduce

Map

Reduce

Map

Map

Map

Reduce

+

(Key, Value)

Partion1

Partion2

HDFS

2

Slide3

Background

HadoopMany steps within Map stage and Reduce stageDifferent step may consume different type of resource READMapSORTMER

G

E

O

U

T

P

U

T

Map

3

Slide4

Motivation

ProblemsSchedulingNo consideration about the execution time and different type of resources consumedHadoopParameterTuningNumerous parameters, default value is not optimal

Hadoop

CPU

I

ntensive

CPU

I

ntensive

Hadoop

DefaultHadoop

Job

Hadoop

Job

Default

Conf

4

Slide5

Motivation

SolutionPredict the performance of Hadoop JobsSchedulingHadoopParameterTuningNumerous parameters, default value is not optimal

No

consideration

about

the

execution

time

and

different

type

of

resources

consumed

5

Slide6

Related

WorkExisting Prediction Method 1:- Black Box BasedJobFeaturesHadoopStatistic/Learning ModelsExecutionTime

Lack

of

the

analysis

about

Hadoop

Hard

to

choose

6

Slide7

Related

WorkExisting Prediction Method 2:- Cost Model BasedJob FeatureF(map)=f(read,map,sort,spill,merge,write)F(reduce)=f(read,write,merge,reduce,write)Execution Time

Difficult

to

ensure

accuracy

Lots

of

concurrent

processes

Hard

to

divide

stages

Hadoop

Read

Hadoop

map

Out

put

Read

reduce

Output

7

Slide8

Related

WorkA Brief Summary about Existing Prediction MethodBlack BoxCost ModelAdvantageSimple and EffectiveHigh accuracyHigh isomorphismDetailed analysis about Hadoop processing Division is flexible (stage, resource)Multiple predictionShort ComingLack of

job

feature

extraction

Lack

of

analysis

Hard

to

divide

each

step

and

resource

Lack

of job feature extraction

A lot of concurrent, hard to model

Better for theoretical analysis, not suitable for prediction

Simple

prediction,

Lack of jobs (jar package + data) analysis8

Slide9

Goal

Design a Hadoop MapReduce performance prediction system to:- Predict the job consumption of various type of resources (CPU, Disk IO, Network)- Predict the execution time of Map phase and Reduce phasePrediction System- Map execution time- Reduce execution time- CPU Occupation Time- Disk Occupation Time- Network Occupation TimeJob9

Slide10

Design - 1

Cost Model- Map execution time- Reduce execution time- CPU Occupation Time- Disk Occupation Time- Network Occupation TimeCOSTMODELJob10

Slide11

Cost Model [1]

Analysis about Map- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239. InitiationRead DataNetworkTransferCreateObjectMap FunctionSort InMemoryRead/WriteDiskMergeSort

Write

Disk

Serialization

Map

CPU:

Disk:

Net:

11

Slide12

Cost Model [1]

Cost Function Parameters AnalysisType One:ConstantHadoop System Consume,Initialization ConsumeType Two:Job-related ParametersMap Function Computational Complexity,Map Input RecordsType Three:Parameters defined by Cost ModelSorting Coefficient, Complexity Factor[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239. 12

Slide13

Parameters Collection

Type One and Type ThreeType one: Run empty map tasks,calculate the system consumed from the logsType Three: Extract the sort part from Hadoop source code, sort a certain number of records.Type TwoRun a new job,analyze logHigh LatencyLarge OverheadSampling Data,only analyze the behavior of map function and reduce functionAlmost no latencyVery low extra overhead Job Analyzer13

Slide14

Job

Analyzer - ImplementationJob Analyzer – ImplementationHadoop virtual execution environment Accept the job Jar File & Input DataSampling ModuleSample input data by a certain percentage (less than 5%). MR ModuleInstantiate user job’s class in using Java reflectionAnalyze ModuleInput Data (Amount & Number)Relative computational complexityData conversion rates (output/input)SamplingModuleMR ModuleAnalyze ModuleHadoop virtual execution environment

Jar

File + Input Data

Job Feature

14

Slide15

Job Analyzer - Feasibility

Data similarity: Logs have uniform formatExecution similarity: each record will be processed by the same map & reduce function repeatedlyINPUTDATA

Split

Map

Map

Map

Map

Reduce

Reduce

15

Slide16

Design - 2

Parameters Collection- Map execution time- Reduce execution time- CPU Occupation Time- Disk Occupation Time- Network Occupation TimeCOSTMODELJob Analyzer:Collect Parameters of Type 2

Static Parameters Collection Module:

Collect Parameters of Type1 & Type 3

16

Slide17

Prediction Model

Problem Analysis-Many concurrent steps -- the total time can not be added up by the time of each partInitiationRead DataNetworkTransferCreateObjectMap FunctionSort InMemoryRead/WriteDiskMergeSortWriteDiskSerialization

CPU:

Disk:

Net:

17

Slide18

Prediction Model

Main Factors (according to the performance model)- Map StageInitiationRead DataNetworkTransferCreateObjectMap FunctionSort InMemoryRead/WriteDiskMergeSortWriteDiskSerialization

Tmap

0

+

α

1

*

MapInput

+

α

2

*

N

+

α

3

*

N

*

Log(N)

+

α

4

*

The complexity of map function

+

α

5

*The conversion rate of map data

The amount of input data

The number of input records (N)

The complexity of Map function

The conversion rate of Map data

NlogN

18

Slide19

Prediction Model

Experimental AnalysisTest 4 kinds of jobs (0-10000 records)Extract the features for linear regressionCalculate the correlation coefficient (R2)JobsDedupWordCountProjectGrepTotalR20.99820.99920.99910.99490.615719

Slide20

Prediction Model

Number of RecordsExecution Time of MapVery good linear relationship within the same kind of jobs.But no linear relationship among different kind of jobs.20

Slide21

Find the

nearest jobs!Instance-Based Linear RegressionFind the nearest samples to the jobs to be predicted in history logs “nearest”-> similar jobs (Top K nearest, with K=10%-15%)Do linear regression to the samples we have foundCalculate the prediction valueNearest:The weighted distance of job features (weight w)High contribution for job classification:map/reduce complexity,map/reduce data conversion rate

Low

contribution

for

job

classification

Data

amount

Number

of

records

21

Slide22

Prediction Module

ProcedureCost ModelMain FactorsTmap=α0+α1*MapInput+α2*N+α3*N*Log(N)+α4*The complexity of map function+α5*The conversion rate of map dataJob FeaturesSearch for the nearest samples

Prediction Function

Prediction Results

1

2

3

4

5

6

7

22

Slide23

Prediction Module

ProcedureTraining SetFind-Neighbor ModulePrediction ResultsPrediction FunctionCost Model23

Slide24

Design - 3

Parameters Collection- CPU Occupation Time- Disk Occupation Time- Network Occupation TimeCOSTMODELJob Analyzer:Collect Parameters of Type 2Static Parameters Collection Module:Collect Parameters of Type1 & Type 3

- Map execution time

- Reduce execution time

Prediction

Module

24

Slide25

Experience

Task Execution Time (Error Rate)K=12%, and with w different for each featureK=12%, and with w the same for each featureK=25%, and with w different for each feature4 kinds of jobs, 64M-8GJob IDJob ID25

Slide26

Conclusion

Job Analyzer :Analyze Job Jar + Input FileCollect parametersPrediction Module:Find the main factorPropose a linear equationJob classificationMultiple prediction26

Slide27

27

Thank you!Question?

Slide28

Cost Model [1]

Analysis about Reduce- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resourcesInitiationRead DataNetworkTransferCreate ObjectReduce FunctionMergeSortRead/WriteDiskNetworkWrite DiskSerialization

Deserialization

Reduce

CPU:

Disk:

Net:

28

Slide29

Prediction Model

Main Factors (according to the performance model)- Reduce StageInitiationRead DataNetworkTransferCreate ObjectReduce FunctionMergeSortRead/WriteDiskNetworkWrite DiskSerializationDeserialization

Treduce

0

1

*

MapInput

2

*

N

3

*

Nlog

(

N

)

4

*

The complexity of Reduce function

5

*

The conversion rate of Map data

6

*The conversion rate of Reduce data

The amount of input data

The number of input records

The complexity of

Reduce

function

The conversion rate of Map data

NlogN

The conversion rate of

Reduce

data

29