/
Profiling, Profiling,

Profiling, - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
439 views
Uploaded On 2016-07-19

Profiling, - PPT Presentation

Whatif Analysis and Costbased Optimization of MapReduce Programs Herodotos Herodotou Shivnath Babu Duke University Analysis in the Big Data Era 8312011 Duke University 2 Popular option ID: 410917

university duke map 2011 duke university 2011 map job data mapreduce profile task cost hadoop cluster reduce input profiling

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Profiling," is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs

Herodotos HerodotouShivnath Babu

Duke UniversitySlide2

Analysis in the Big Data Era8/31/2011Duke University2

Popular optionHadoop software stack

MapReduce

Execution Engine

Distributed File System

Hadoop

Java / C++ /

R

/ Python

Oozie

Hive

Pig

Elastic

MapReduce

Jaql

HBaseSlide3

Analysis in the Big Data Era8/31/2011Duke University3

Popular optionHadoop software stackWho are the users?Data analysts, statisticians, computational scientists…Researchers, developers, testers…

You!Who performs setup and tuning?The users!Usually lack expertise to tune the systemSlide4

Problem OverviewGoalEnable Hadoop users and applications to get good performance automaticallyPart of the Starfish system This talk: tuning individual MapReduce

jobsChallengesHeavy use of programming languages for

MapReduce programs and UDFs (e.g., Java/Python)Data loaded/accessed as opaque filesLarge space of tuning choices8/31/2011Duke University

4Slide5

MapReduce Job Execution8/31/2011Duke University5

split 0

map

out 0

reduce

Two Map Waves

One Reduce Wave

split 2

map

split 1

map

split 3

map

o

ut

1

reduce

job

j

=

<

program

p

, data

d

, resources

r

, configuration

c

>Slide6

Optimizing MapReduce Job ExecutionSpace of configuration choices:Number of map tasksNumber of reduce tasksPartitioning of map outputs to reduce tasksMemory allocation to task-level buffersMultiphase external sorting in the tasksWhether output data from tasks should be compressed

Whether combine function should be used8/31/2011

Duke University6job j =

<

program

p

, data

d

, resources

r

, configuration

c

>Slide7

Optimizing MapReduce Job ExecutionUse defaults or set manually (rules-of-thumb)Rules-of-thumb may not suffice

8/31/2011Duke University

72-dim projection of 13-dim surface

Rules-of-thumb settingsSlide8

Applying Cost-based OptimizationGoal:Just-in-Time OptimizerSearches through the space S of parameter settingsWhat-if EngineEstimates

perf using properties of p, d, r, and c

Challenge: How to capture the properties of an arbitrary MapReduce program p?8/31/2011Duke University8Slide9

Job ProfileConcise representation of program execution as a jobRecords information at the level of “task phases”Generated by Profiler through measurement or by the

What-if Engine through estimation8/31/2011

Duke University9

Memory Buffer

Merge

Sort,

[Combine],

[Compress]

Serialize,

Partition

map

Merge

split

DFS

Spill

Collect

Map

ReadSlide10

Job Profile FieldsDataflow: amount of data flowing through task phases

Map output bytes

Number of map-side spillsNumber of records in buffer per spill8/31/2011

Duke University

10

Costs:

execution times at

the level of task phases

Read phase time in the map task

Map phase time in the map task

Spill phase time in the map task

Dataflow Statistics:

statistical information about the dataflow

Map

func’s

selectivity (output / input)

Map output compression ratio

Size of records (keys and values)

Cost Statistics:

statistical information about the

costs

I/O cost for reading from local disk per byte

CPU cost for executing Map

func

per record

CPU cost for uncompressing the input per byteSlide11

Generating Profiles by MeasurementGoalsHave zero overhead when profiling is turned offRequire no modifications to HadoopSupport unmodified MapReduce programs written in Java or Hadoop Streaming/Pipes (Python/Ruby/C++)

Dynamic instrumentationMonitors task phases of MapReduce job executionEvent-condition-action rules are specified, leading to run-time instrumentation of Hadoop internalsWe currently use

BTrace (Hadoop internals are in Java)8/31/2011Duke University11Slide12

Generating Profiles by Measurement8/31/2011Duke University12

split 0

map

out 0

reduce

split 1

map

enable

profiling

raw data

enable

profiling

raw data

enable

profiling

raw data

map profile

reduce profile

job profile

Use of Sampling

Profiling

Task executionSlide13

What-if Engine8/31/2011Duke University13

Task Scheduler Simulator

What-if Engine

Job Oracle

Job

Profile

<p, d

1

, r

1

, c

1

>

Input Data

Properties

<d

2

>

Cluster

Resources

<r

2

>

Configuration

Settings

<c

2

>

Virtual Job Profile for

<p, d

2

, r

2

, c

2

>

Properties of Hypothetical jobSlide14

Virtual Profile Estimation8/31/2011Duke University14

Given profile for job j = <p, d

1, r1, c1> estimate

profile for job

j' = <p, d

2

, r

2

, c

2

>

(Virtual) Profile for

j'

Dataflow

Statistics

Dataflow

Cost

Statistics

Costs

Profile for

j

Input

Data

d

2

Confi-guration

c

2

Resources

r

2

Costs

White-box Models

Cost

Statistics

Relative

Black-box

Models

Dataflow

White-box Models

Dataflow

Statistics

Cardinality

ModelsSlide15

White-box ModelsDetailed set of equations for HadoopExample:8/31/2011Duke University

15

Calculate dataflow in each task phase in a map task

Input data properties

Dataflow statistics

Configuration parameters

Memory Buffer

Merge

Sort,

[Combine],

[Compress]

Serialize,

Partition

map

Merge

split

DFS

Spill

Collect

Map

ReadSlide16

Just-in-Time Optimizer8/31/2011Duke University16

Best Configuration

Settings

<

c

opt

>

for

<p, d

2

, r

2

>

(Sub) Space Enumeration

Recursive Random Search

Just-in-Time Optimizer

Job

Profile

<p, d

1

, r

1

, c

1

>

Input Data

Properties

<d

2

>

Cluster

Resources

<r

2>

What-if CallsSlide17

Recursive Random Search8/31/2011Duke University17

Parameter Space

Space Point

(configuration

settings)

Use What-if Engine to costSlide18

Experimental Methodology15-30 Amazon EC2 nodes, various instance typesCluster-level configurations based on rules of thumbData sizes: 10-180 GBRule-based Optimizer Vs. Cost-based Optimizer8/31/2011Duke University

18

Abbr.MapReduce ProgramDomainDatasetCO

Word

Co-occurrence

NLP

Wikipedia

WC

WordCount

Text Analytics

Wikipedia

TS

TeraSort

Business Analytics

TeraGen

LG

LinkGraph

Graph

Processing

Wikipedia (compressed)

JO

Join

Business Analytics

TPC-H

TF

TF-IDF

Information Retrieval

WikipediaSlide19

Job Optimizer Evaluation8/31/2011Duke University19

Hadoop cluster: 30 nodes, m1.xlargeData sizes: 60-180 GBSlide20

Job Optimizer Evaluation8/31/2011Duke University20

Hadoop cluster: 30 nodes, m1.xlargeData sizes: 60-180 GBSlide21

Estimates from the What-if Engine8/31/2011Duke University21

Hadoop

cluster: 16 nodes, c1.mediumMapReduce Program: Word Co-occurrenceData set: 10 GB Wikipedia

True surface

Estimated surfaceSlide22

Estimates from the What-if Engine8/31/2011Duke University22

Profiling on Test cluster, prediction on Production cluster

Test cluster: 10 nodes, m1.large, 60 GBProduction cluster: 30 nodes, m1.xlarge, 180 GBSlide23

Profiling Overhead Vs. Benefit8/31/2011Duke University23

Hadoop

cluster: 16 nodes, c1.mediumMapReduce Program: Word Co-occurrenceData set: 10 GB WikipediaSlide24

ConclusionWhat have we achieved?Perform in-depth job analysis with profilesPredict the behavior of hypothetical job executionsOptimize arbitrary MapReduce programsWhat’s next?Optimize job workflows/workloadsAddress the cluster sizing (provisioning) problem

Perform data layout tuning8/31/2011Duke University

24Slide25

Starfish: Self-tuning Analytics System8/31/2011Duke University25

www.cs.duke.edu/starfish

Software Release: Starfish v0.2.0

Demo Session C:

Thursday, 10:30-12:00

Grand CrescentSlide26

Hadoop Configuration ParametersParameterDefault Value

io.sort.mb

100io.sort.record.percent0.05io.sort.spill.percent

0.8

io.sort.factor

10

mapreduce.combine.class

null

min.num.spills.for.combine

3

mapred.compress.map.output

false

mapred.reduce.tasks

1

mapred.job.shuffle.input.buffer.percent

0.7

mapred.job.shuffle.merge.percent

0.66

mapred.inmem.merge.threshold

1000

mapred.job.reduce.input.buffer.percent

0

mapred.output.compress

false

8/31/2011

Duke University

26Slide27

Amazon EC2 Node TypesNodeType

CPU(EC2Units)

Mem(GB)Storage(GB)

Cost

($/hour)

Map

Slots

per Node

Reduce

Slots

per Node

Max

Mem

per Slot

m1.small

1

1.7

160

0.085

2

1

300

m1.large

4

7.5

850

0.343

2

1024m1.xlarge

8

151690

0.68

44

1536c1.medium

51.7

3500.172

2300

c1.xlarge20

7

16900.68

86

400

8/31/2011

Duke University

27Slide28

Input Data & Cluster PropertiesInput Data PropertiesData sizeBlock sizeCompressionCluster PropertiesNumber of nodesNumber of map slots per nodeNumber of reduce slots per nodeMaximum memory per task slot

8/31/2011Duke University

28