1 Word Count Example We have a large file of words one word to a line Count the number of times each distinct word appears in the file Sample application analyze web server logs to find popular URLs ID: 760331
Download Presentation The PPT/PDF document "Map reduce 5/24/2011 Map Reduce" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Map reduce
5/24/2011
Map Reduce
1
Slide2Word Count Example
We have a large file of words, one word to a lineCount the number of times each distinct word appears in the fileSample application: analyze web server logs to find popular URLs
5/24/2011
Map Reduce
2
Slide3Word Count (2)
Case 1: Entire file fits in memory
5/24/2011
Map Reduce
3
Slide4Word Count (2)
Case 1: Entire file fits in memoryCase 2: File too large for mem, but all <word, count> pairs fit in mem
5/24/2011
Map Reduce
4
Slide5Word Count (2)
Case 1: Entire file fits in memoryCase 2: File too large for mem, but all <word, count> pairs fit in memCase 3: File on disk, too many distinct words to fit in memorysort datafile | uniq –c
5/24/2011
Map Reduce
5
Slide6Word Count (3)
A large corpus of documents, sharded across many disks in many machinesMachines, disks, networks can failMotivation for Google’s MapReduce
5/24/2011
Map Reduce
6
Slide7Input: A set of files
5/24/2011
Map Reduce
7
CNN
…
BBC
F
O
X
Slide8Map: Generate Word Count Per File
5/24/2011
Map Reduce
8
CNN
…
BBC
FOX
C:1
N:2
B:2
C:1
F:1
O:1
X
:1
Slide9Partition (Optional)
5/24/2011
Map Reduce
9
CNN
…
BBC
FOX
C:1
N:2
B:2
C:1
F:1
O:1
X:1
C:1
N:2
B:2
C:1
F:1
O:1
X
:1
Slide10Reduce
5/24/2011
Map Reduce
10
CNN
…
BBC
FOX
C:1
N:2
B:2
C:1
F:1
O:1
X:1
C:1
N:2
B:2
C:1
F:1
O:1
X:1
C:2
N:2
B:2
F:1
O:1
X
:1
Slide11MapReduce
Input: a set of key/value pairsUser supplies two functions:map(k,v) list(k1,v1)reduce(k1, list(v1)) v2(k1,v1) is an intermediate key/value pairOutput is the set of (k1,v2) pairs
5/24/2011
Map Reduce
11
Slide12Word Count using MapReduce
map(key, value):// key: document name; value: text of document for each word w in value: emit(w, 1)
reduce(key, values):// key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)
5/24/2011
Map Reduce
12
Slide13Map and Reduce vs MapReduce
The map and reduce operations in MapReduce are inspired by similar operations in functional programming
5/24/2011
Map Reduce
13
Slide14Map
Given a function f : (A) => BA collection a: A[]Generates a collection b: B[], where B[i] = f( A[i] )Parallel.For, Paralle.ForEachWhere each loop iteration is independent
5/24/2011
Map Reduce
14
f
f
f
f
f
f
f
f
A
B
Slide15Reduce
Given a function f: (A, B) => BA collection a: A[]An initial value b0: BGenerate a final value b: BWhere b = f(A[n-1], … f(A[1], f(A[0], b0)) )
5/24/2011
Map Reduce
15
f
f
f
f
f
f
f
f
b
0
b
A
Slide16Relationship to SQL
Implementing word count in SQL
5/24/2011
Map Reduce
16
SELECT
word Count(*) as
wordCount
FROM
files
GROUP BY
word;
// where files is a (distributed)
// relation <name,
posn
, word>
Slide17Signs of a Good Abstraction
Hides important detailsBut not too muchSimple for lay programmers to useNot necessarily generalBut not very restrictedCan be application/domain specificAllows efficient implementationsAutomatic optimizationsManual optimizations (by experts)
5/24/2011
Map Reduce
17
Slide18Distributed Execution Overview
UserProgram
Worker
Worker
Master
Worker
Worker
Worker
fork
fork
fork
assign
map
assign
reduce
read
local
write
remote
read,
sort
Output
File 0
Output
File 1
write
Split 0
Split 1
Split 2
Input Data
5/24/2011
Map Reduce
18
Slide19Data flow
Input, final output are stored on a distributed file systemScheduler tries to schedule map tasks “close” to physical storage location of input dataIntermediate results are stored on local FS of map and reduce workersOutput is often input to another map reduce task
5/24/2011
Map Reduce
19
Slide20Coordination
Master data structuresTask status: (idle, in-progress, completed)Idle tasks get scheduled as workers become availableWhen a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducerMaster pushes this info to reducersMaster pings workers periodically to detect failures
5/24/2011
Map Reduce
20
Slide21Failures
Map worker failureMap tasks completed or in-progress at worker are reset to idleReduce workers are notified when task is rescheduled on another workerReduce worker failureOnly in-progress tasks are reset to idleMaster failureMapReduce task is aborted and client is notified
5/24/2011
Map Reduce
21
Slide22How many Map and Reduce jobs?
M map tasks, R reduce tasksRule of thumb:Make M and R much larger than the number of nodes in clusterOne DFS chunk per map is commonImproves dynamic load balancing and speeds recovery from worker failureUsually R is smaller than M, because output is spread across R files
5/24/2011
Map Reduce
22
Slide23Combiners
Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key kE.g., popular words in Word CountCan save network time by pre-aggregating at mappercombine(k1, list(v1)) v2Usually same as reduce functionWorks only if reduce function is commutative and associative
5/24/2011
Map Reduce
23
Slide24Partition Function
Inputs to map tasks are created by contiguous splits of input fileFor reduce, we need to ensure that records with the same intermediate key end up at the same workerSystem uses a default partition function e.g., hash(key) mod RSometimes useful to override E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file
5/24/2011
Map Reduce
24
Slide25Avoiding Stragglers
A slow running task (straggler) can prolong overall executionOverloaded machinesSlow diskKill stragglers Fork redundant tasks and take the first
5/24/2011
Map Reduce
25
Slide26Example: Sorting
5/24/2011
Map Reduce
26
Slide27Example: Database Join
5/24/2011
Map Reduce
27
Slide28Can Mappers Push instead of Reducers Pulling Data ?
5/24/2011
Map Reduce
28