Ge Song Zide Meng Fabrice Huet Frederic Magoules Lei Yu and Xuelian Lin University of Nice Sophia ID: 793013
Download The PPT/PDF document "A Hadoop MapReduce Performance Predictio..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A Hadoop MapReduce Performance Prediction Method
Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin#* University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France+ Ecole Centrale de Paris, France# Beihang University, Beijing China
1
Slide2Background
Hadoop MapReduceINPUT DATASplit
Map
Map
Map
Map
Reduce
Reduce
Job
Map
Reduce
Map
Reduce
Map
Map
Map
Reduce
+
(Key, Value)
Partion1
Partion2
HDFS
2
Slide3Background
HadoopMany steps within Map stage and Reduce stageDifferent step may consume different type of resource READMapSORTMER
G
E
O
U
T
P
U
T
Map
3
Slide4Motivation
ProblemsSchedulingNo consideration about the execution time and different type of resources consumedHadoopParameterTuningNumerous parameters, default value is not optimal
Hadoop
CPU
I
ntensive
CPU
I
ntensive
Hadoop
DefaultHadoop
Job
Hadoop
Job
Default
Conf
4
Slide5Motivation
SolutionPredict the performance of Hadoop JobsSchedulingHadoopParameterTuningNumerous parameters, default value is not optimal
No
consideration
about
the
execution
time
and
different
type
of
resources
consumed
5
Slide6Related
WorkExisting Prediction Method 1:- Black Box BasedJobFeaturesHadoopStatistic/Learning ModelsExecutionTime
Lack
of
the
analysis
about
Hadoop
Hard
to
choose
6
Slide7Related
WorkExisting Prediction Method 2:- Cost Model BasedJob FeatureF(map)=f(read,map,sort,spill,merge,write)F(reduce)=f(read,write,merge,reduce,write)Execution Time
Difficult
to
ensure
accuracy
Lots
of
concurrent
processes
Hard
to
divide
stages
Hadoop
Read
Hadoop
map
Out
put
…
Read
…
reduce
Output
7
Slide8Related
WorkA Brief Summary about Existing Prediction MethodBlack BoxCost ModelAdvantageSimple and EffectiveHigh accuracyHigh isomorphismDetailed analysis about Hadoop processing Division is flexible (stage, resource)Multiple predictionShort ComingLack of
job
feature
extraction
Lack
of
analysis
Hard
to
divide
each
step
and
resource
Lack
of job feature extraction
A lot of concurrent, hard to model
Better for theoretical analysis, not suitable for prediction
Simple
prediction,
Lack of jobs (jar package + data) analysis8
Slide9Goal
Design a Hadoop MapReduce performance prediction system to:- Predict the job consumption of various type of resources (CPU, Disk IO, Network)- Predict the execution time of Map phase and Reduce phasePrediction System- Map execution time- Reduce execution time- CPU Occupation Time- Disk Occupation Time- Network Occupation TimeJob9
Slide10Design - 1
Cost Model- Map execution time- Reduce execution time- CPU Occupation Time- Disk Occupation Time- Network Occupation TimeCOSTMODELJob10
Slide11Cost Model [1]
Analysis about Map- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239. InitiationRead DataNetworkTransferCreateObjectMap FunctionSort InMemoryRead/WriteDiskMergeSort
Write
Disk
Serialization
Map
CPU:
Disk:
Net:
11
Slide12Cost Model [1]
Cost Function Parameters AnalysisType One:ConstantHadoop System Consume,Initialization ConsumeType Two:Job-related ParametersMap Function Computational Complexity,Map Input RecordsType Three:Parameters defined by Cost ModelSorting Coefficient, Complexity Factor[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239. 12
Slide13Parameters Collection
Type One and Type ThreeType one: Run empty map tasks,calculate the system consumed from the logsType Three: Extract the sort part from Hadoop source code, sort a certain number of records.Type TwoRun a new job,analyze logHigh LatencyLarge OverheadSampling Data,only analyze the behavior of map function and reduce functionAlmost no latencyVery low extra overhead Job Analyzer13
Slide14Job
Analyzer - ImplementationJob Analyzer – ImplementationHadoop virtual execution environment Accept the job Jar File & Input DataSampling ModuleSample input data by a certain percentage (less than 5%). MR ModuleInstantiate user job’s class in using Java reflectionAnalyze ModuleInput Data (Amount & Number)Relative computational complexityData conversion rates (output/input)SamplingModuleMR ModuleAnalyze ModuleHadoop virtual execution environment
Jar
File + Input Data
Job Feature
14
Slide15Job Analyzer - Feasibility
Data similarity: Logs have uniform formatExecution similarity: each record will be processed by the same map & reduce function repeatedlyINPUTDATA
Split
Map
Map
Map
Map
Reduce
Reduce
15
Slide16Design - 2
Parameters Collection- Map execution time- Reduce execution time- CPU Occupation Time- Disk Occupation Time- Network Occupation TimeCOSTMODELJob Analyzer:Collect Parameters of Type 2
Static Parameters Collection Module:
Collect Parameters of Type1 & Type 3
16
Slide17Prediction Model
Problem Analysis-Many concurrent steps -- the total time can not be added up by the time of each partInitiationRead DataNetworkTransferCreateObjectMap FunctionSort InMemoryRead/WriteDiskMergeSortWriteDiskSerialization
CPU:
Disk:
Net:
17
Slide18Prediction Model
Main Factors (according to the performance model)- Map StageInitiationRead DataNetworkTransferCreateObjectMap FunctionSort InMemoryRead/WriteDiskMergeSortWriteDiskSerialization
Tmap
=α
0
+
α
1
*
MapInput
+
α
2
*
N
+
α
3
*
N
*
Log(N)
+
α
4
*
The complexity of map function
+
α
5
*The conversion rate of map data
The amount of input data
The number of input records (N)
The complexity of Map function
The conversion rate of Map data
NlogN
18
Slide19Prediction Model
Experimental AnalysisTest 4 kinds of jobs (0-10000 records)Extract the features for linear regressionCalculate the correlation coefficient (R2)JobsDedupWordCountProjectGrepTotalR20.99820.99920.99910.99490.615719
Slide20Prediction Model
Number of RecordsExecution Time of MapVery good linear relationship within the same kind of jobs.But no linear relationship among different kind of jobs.20
Slide21Find the
nearest jobs!Instance-Based Linear RegressionFind the nearest samples to the jobs to be predicted in history logs “nearest”-> similar jobs (Top K nearest, with K=10%-15%)Do linear regression to the samples we have foundCalculate the prediction valueNearest:The weighted distance of job features (weight w)High contribution for job classification:map/reduce complexity,map/reduce data conversion rate
Low
contribution
for
job
classification
:
Data
amount
、
Number
of
records
21
Slide22Prediction Module
ProcedureCost ModelMain FactorsTmap=α0+α1*MapInput+α2*N+α3*N*Log(N)+α4*The complexity of map function+α5*The conversion rate of map dataJob FeaturesSearch for the nearest samples
Prediction Function
Prediction Results
1
2
3
4
5
6
7
22
Slide23Prediction Module
ProcedureTraining SetFind-Neighbor ModulePrediction ResultsPrediction FunctionCost Model23
Slide24Design - 3
Parameters Collection- CPU Occupation Time- Disk Occupation Time- Network Occupation TimeCOSTMODELJob Analyzer:Collect Parameters of Type 2Static Parameters Collection Module:Collect Parameters of Type1 & Type 3
- Map execution time
- Reduce execution time
Prediction
Module
24
Slide25Experience
Task Execution Time (Error Rate)K=12%, and with w different for each featureK=12%, and with w the same for each featureK=25%, and with w different for each feature4 kinds of jobs, 64M-8GJob IDJob ID25
Slide26Conclusion
Job Analyzer :Analyze Job Jar + Input FileCollect parametersPrediction Module:Find the main factorPropose a linear equationJob classificationMultiple prediction26
Slide2727
Thank you!Question?
Slide28Cost Model [1]
Analysis about Reduce- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resourcesInitiationRead DataNetworkTransferCreate ObjectReduce FunctionMergeSortRead/WriteDiskNetworkWrite DiskSerialization
Deserialization
Reduce
CPU:
Disk:
Net:
28
Slide29Prediction Model
Main Factors (according to the performance model)- Reduce StageInitiationRead DataNetworkTransferCreate ObjectReduce FunctionMergeSortRead/WriteDiskNetworkWrite DiskSerializationDeserialization
Treduce
=β
0
+β
1
*
MapInput
+β
2
*
N
+β
3
*
Nlog
(
N
)
+β
4
*
The complexity of Reduce function
+β
5
*
The conversion rate of Map data
+β
6
*The conversion rate of Reduce data
The amount of input data
The number of input records
The complexity of
Reduce
function
The conversion rate of Map data
NlogN
The conversion rate of
Reduce
data
29