/
Data-Intensive Computing Data-Intensive Computing

Data-Intensive Computing - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
449 views
Uploaded On 2017-05-16

Data-Intensive Computing - PPT Presentation

with MapReduce Jimmy Lin University of Maryland Thursday January 31 2013 Session 2 Hadoop Nuts and Bolts This work is licensed under a Creative Commons AttributionNoncommercialShare Alike 30 United States ID: 548832

mapper hadoop job context hadoop mapper context job intermediates text data reducer intwritable file source key inputsplit sum map recordreader input word

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Data-Intensive Computing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Data-Intensive Computing with MapReduce

Jimmy LinUniversity of MarylandThursday, January 31, 2013

Session 2: Hadoop Nuts and Bolts

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States

See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsSlide2

Source: Wikipedia (The Scream)Slide3

Source: Wikipedia (Japanese rock garden)Slide4

Source: Wikipedia (Keychain)Slide5

How will I actually learn Hadoop?This class sessionHadoop: The Definitive GuideRTFMRTFC(!)Slide6

Materials in the courseThe course homepageHadoop: The Definitive GuideData-Intensive Text Processing with MapReduceCloud9

Take advantage of GitHub!clone, branch, send pull requestSlide7

Source: Wikipedia (Mahout)Slide8

Basic Hadoop API*Mappervoid setup(Mapper.Context context)Called once at the beginning of the taskvoid map(K key,

V value, Mapper.Context context)Called once for each key/value pair in the input splitvoid cleanup(Mapper.Context context)Called once at the end of the taskReducer/Combinervoid setup(Reducer.Context context)Called

once at the start of the taskvoid reduce(K key,

Iterable<V> values, Reducer.Context context)

Called once for each keyvoid cleanup(Reducer.Context context)

Called

once at the end of the

task

*Note that there are two versions of the API!Slide9

Basic Hadoop API*Partitionerint getPartition(K key, V value,

int numPartitions)Get the partition number given total number of partitions JobRepresents a packaged Hadoop job for submission to clusterNeed to specify input and output pathsNeed to specify input and output formatsNeed to specify mapper, reducer, combiner, partitioner classesNeed to specify intermediate/final key/value classes

Need to specify number of reducers (but not mappers, why?)Don’t depend of defaults!

*Note that there are two versions of the API!Slide10

A tale of two packages…

org.apache.hadoop.mapreduce org.apache.hadoop.mapredSource: Wikipedia (Budapest)Slide11

Data Types in Hadoop: Keys and ValuesWritable

Defines a de/serialization protocol. Every data type in Hadoop is a Writable.WritableComprable

Defines a sort order. All keys must be of this type (but not values).

IntWritable

LongWritable

Text

Concrete classes for different data types.

SequenceFiles

Binary encoded of a sequence of

key/value pairsSlide12

“Hello World”: Word CountMap(String docid

, String text): for each word w in text: Emit(w, 1);

Reduce(String

term, Iterator<

Int> values):

int

sum

= 0;

for each v in

values

:

sum

+=

v;

Emit(term, value);Slide13

“Hello World”: Word Count private static class

MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable

ONE = new IntWritable

(1); private final static Text WORD = new Text();

@Override

public void map(

LongWritable

key, Text value, Context context)

throws

IOException

,

InterruptedException

{

String line = ((Text) value).

toString

();

StringTokenizer

itr

= new

StringTokenizer

(line);

while (

itr.hasMoreTokens

()) {

WORD.set

(

itr.nextToken

());

context.write

(WORD, ONE);

}

}

}Slide14

“Hello World”: Word Count private

static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable

> { private

final static IntWritable SUM = new

IntWritable();

@Override

public void reduce(Text key,

Iterable

<

IntWritable

> values,

Context

context

) throws

IOException

,

InterruptedException

{

Iterator

<

IntWritable

>

iter

=

values.iterator

();

int

sum = 0;

while (

iter.hasNext

()) {

sum +=

iter.next

().get();

}

SUM.set

(sum);

context.write

(key, SUM);

}

}Slide15

Three GotchasAvoid object creation at all costsReuse Writable objects, change the payloadExecution framework reuses value object in reducerPassing parameters via class staticsSlide16

Getting Data to Mappers and ReducersConfiguration parametersDirectly in the Job object for parameters“Side data”DistributedCacheMappers/reducers read from HDFS in setup methodSlide17

Complex Data Types in HadoopHow do you implement complex data types?The easiest way:Encoded it as Text, e.g., (a, b) = “a:b”Use regular expressions to parse and extract data

Works, but pretty hack-ishThe hard way:Define a custom implementation of Writable(Comprable)Must implement: readFields, write, (compareTo)Computationally efficient, but slow for rapid prototyping

Implement WritableComparator hook for performance

Somewhere in the middle:Cloud9 offers JSON support and lots of useful Hadoop types

Quick tour…Slide18

Basic Cluster Components*One of each:Namenode (NN): master node for HDFSJobtracker (JT): master node for job submissionSet of each per slave machine:Tasktracker (TT): contains multiple task slots

Datanode (DN): serves HDFS data blocks* Not quite… leaving aside YARN for nowSlide19

Putting everything together…

datanode

daemon

Linux file system

tasktracker

slave node

datanode

daemon

Linux file system

tasktracker

slave node

datanode

daemon

Linux file system

tasktracker

slave node

namenode

namenode

daemon

job submission node

jobtrackerSlide20

Anatomy of a JobMapReduce program in Hadoop = Hadoop jobJobs are divided into map and reduce tasksAn instance of running a task is called a task attempt (occupies a slot)Multiple jobs can be composed into a workflowJob submission: Client (i.e., driver program) creates a job, configures it, and submits it to

jobtrackerThat’s it! The Hadoop cluster takes over…Slide21

Anatomy of a JobBehind the scenes:Input splits are computed (on client end)Job data (jar, configuration XML) are sent to JobTrackerJobTracker puts job data in shared location, enqueues

tasksTaskTrackers poll for tasksOff to the races…Slide22

InputSplit

Source:

redrawn from a slide by

Cloduera

, cc-licensed

InputSplit

InputSplit

Input File

Input File

InputSplit

InputSplit

RecordReader

RecordReader

RecordReader

RecordReader

RecordReader

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

InputFormatSlide23

InputSplit

InputSplit

InputSplit

Client

Records

Mapper

RecordReader

Mapper

RecordReader

Mapper

RecordReaderSlide24

Source: redrawn from a slide by Cloduera, cc-licensed

Mapper

Mapper

Mapper

Mapper

Mapper

Partitioner

Partitioner

Partitioner

Partitioner

Partitioner

Intermediates

Intermediates

Intermediates

Intermediates

Intermediates

Reducer

Reducer

Reduce

Intermediates

Intermediates

Intermediates

(combiners omitted here)Slide25

Source: redrawn from a slide by Cloduera, cc-licensed

Reducer

Reducer

Reduce

Output File

RecordWriter

OutputFormat

Output File

RecordWriter

Output File

RecordWriterSlide26

Input and OutputInputFormat:TextInputFormatKeyValueTextInputFormatSequenceFileInputFormat…OutputFormat

:TextOutputFormatSequenceFileOutputFormat…Slide27

Shuffle and Sort in HadoopProbably the most complex aspect of MapReduceMap sideMap outputs are buffered in memory in a circular bufferWhen buffer reaches threshold, contents are “spilled” to diskSpills merged in a single, partitioned file (sorted within each partition): combiner runs during the mergesReduce side

First, map outputs are copied over to reducer machine“Sort” is a multi-pass merge of map outputs (happens in memory and on disk): combiner runs during the mergesFinal merge pass goes directly into reducerSlide28

Shuffle and Sort

Mapper

Reducer

other

mappers

other reducers

circular buffer

(in memory)

spills (on disk)

merged spills

(on disk)

intermediate files

(on disk)

Combiner

CombinerSlide29

Hadoop Workflow

Hadoop Cluster

You

1. Load data into HDFS

2. Develop code locally

3. Submit MapReduce job

3a. Go back to Step 2

4. Retrieve data from HDFSSlide30

Recommended WorkflowHere’s how I work:Develop code in Eclipse on host machineBuild distribution on host machineCheck out copy of code on VMCopy (i.e., scp) jars over to VM (in same directory structure)

Run job on VMIterate…Commit code on host machine and pushPull from inside VM, verifyAvoid using the UI of the VMDirectly ssh into the VMSlide31

Actually Running the Job$HADOOP_CLASSPATHhadoop jar MYJAR.jar -D k1=v1 ... -libjars

foo.jar,bar.jar my.class.to.run arg1 arg2 arg3 …Slide32

Debugging HadoopFirst, take a deep breathStart small, start locallyBuild incrementallySlide33

Code Execution EnvironmentsDifferent ways to run code:Plain JavaLocal (standalone) modePseudo-distributed modeFully-distributed modeLearn what’s good for whatSlide34

Hadoop Debugging StrategiesGood ol’ System.out.printlnLearn to use the webapp to access logsLogging

preferred over System.out.printlnBe careful how much you log!Fail on successThrow RuntimeExceptions and capture stateProgramming is still programmingUse Hadoop as the “glue”Implement core functionality outside mappers and reducersIndependently test (e.g., unit testing)Compose (tested) components in mappers and reducersSlide35

Source: Wikipedia (Japanese rock garden)

Questions?

Assignment 2 due in two weeks…