/
Map reduce 5/24/2011 Map Reduce Map reduce 5/24/2011 Map Reduce

Map reduce 5/24/2011 Map Reduce - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
342 views
Uploaded On 2019-06-26

Map reduce 5/24/2011 Map Reduce - PPT Presentation

1 Word Count Example We have a large file of words one word to a line Count the number of times each distinct word appears in the file Sample application analyze web server logs to find popular URLs ID: 760331

reduce map word 2011 map reduce 2011 word file count worker key input tasks task function output mapreduce files

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Map reduce 5/24/2011 Map Reduce" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Map reduce

5/24/2011

Map Reduce

1

Slide2

Word Count Example

We have a large file of words, one word to a lineCount the number of times each distinct word appears in the fileSample application: analyze web server logs to find popular URLs

5/24/2011

Map Reduce

2

Slide3

Word Count (2)

Case 1: Entire file fits in memory

5/24/2011

Map Reduce

3

Slide4

Word Count (2)

Case 1: Entire file fits in memoryCase 2: File too large for mem, but all <word, count> pairs fit in mem

5/24/2011

Map Reduce

4

Slide5

Word Count (2)

Case 1: Entire file fits in memoryCase 2: File too large for mem, but all <word, count> pairs fit in memCase 3: File on disk, too many distinct words to fit in memorysort datafile | uniq –c

5/24/2011

Map Reduce

5

Slide6

Word Count (3)

A large corpus of documents, sharded across many disks in many machinesMachines, disks, networks can failMotivation for Google’s MapReduce

5/24/2011

Map Reduce

6

Slide7

Input: A set of files

5/24/2011

Map Reduce

7

CNN

BBC

F

O

X

Slide8

Map: Generate Word Count Per File

5/24/2011

Map Reduce

8

CNN

BBC

FOX

C:1

N:2

B:2

C:1

F:1

O:1

X

:1

Slide9

Partition (Optional)

5/24/2011

Map Reduce

9

CNN

BBC

FOX

C:1

N:2

B:2

C:1

F:1

O:1

X:1

C:1

N:2

B:2

C:1

F:1

O:1

X

:1

Slide10

Reduce

5/24/2011

Map Reduce

10

CNN

BBC

FOX

C:1

N:2

B:2

C:1

F:1

O:1

X:1

C:1

N:2

B:2

C:1

F:1

O:1

X:1

C:2

N:2

B:2

F:1

O:1

X

:1

Slide11

MapReduce

Input: a set of key/value pairsUser supplies two functions:map(k,v)  list(k1,v1)reduce(k1, list(v1))  v2(k1,v1) is an intermediate key/value pairOutput is the set of (k1,v2) pairs

5/24/2011

Map Reduce

11

Slide12

Word Count using MapReduce

map(key, value):// key: document name; value: text of document for each word w in value: emit(w, 1)

reduce(key, values):// key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)

5/24/2011

Map Reduce

12

Slide13

Map and Reduce vs MapReduce

The map and reduce operations in MapReduce are inspired by similar operations in functional programming

5/24/2011

Map Reduce

13

Slide14

Map

Given a function f : (A) => BA collection a: A[]Generates a collection b: B[], where B[i] = f( A[i] )Parallel.For, Paralle.ForEachWhere each loop iteration is independent

5/24/2011

Map Reduce

14

f

f

f

f

f

f

f

f

A

B

Slide15

Reduce

Given a function f: (A, B) => BA collection a: A[]An initial value b0: BGenerate a final value b: BWhere b = f(A[n-1], … f(A[1], f(A[0], b0)) )

5/24/2011

Map Reduce

15

f

f

f

f

f

f

f

f

b

0

b

A

Slide16

Relationship to SQL

Implementing word count in SQL

5/24/2011

Map Reduce

16

SELECT

word Count(*) as

wordCount

FROM

files

GROUP BY

word;

// where files is a (distributed)

// relation <name,

posn

, word>

Slide17

Signs of a Good Abstraction

Hides important detailsBut not too muchSimple for lay programmers to useNot necessarily generalBut not very restrictedCan be application/domain specificAllows efficient implementationsAutomatic optimizationsManual optimizations (by experts)

5/24/2011

Map Reduce

17

Slide18

Distributed Execution Overview

UserProgram

Worker

Worker

Master

Worker

Worker

Worker

fork

fork

fork

assign

map

assign

reduce

read

local

write

remote

read,

sort

Output

File 0

Output

File 1

write

Split 0

Split 1

Split 2

Input Data

5/24/2011

Map Reduce

18

Slide19

Data flow

Input, final output are stored on a distributed file systemScheduler tries to schedule map tasks “close” to physical storage location of input dataIntermediate results are stored on local FS of map and reduce workersOutput is often input to another map reduce task

5/24/2011

Map Reduce

19

Slide20

Coordination

Master data structuresTask status: (idle, in-progress, completed)Idle tasks get scheduled as workers become availableWhen a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducerMaster pushes this info to reducersMaster pings workers periodically to detect failures

5/24/2011

Map Reduce

20

Slide21

Failures

Map worker failureMap tasks completed or in-progress at worker are reset to idleReduce workers are notified when task is rescheduled on another workerReduce worker failureOnly in-progress tasks are reset to idleMaster failureMapReduce task is aborted and client is notified

5/24/2011

Map Reduce

21

Slide22

How many Map and Reduce jobs?

M map tasks, R reduce tasksRule of thumb:Make M and R much larger than the number of nodes in clusterOne DFS chunk per map is commonImproves dynamic load balancing and speeds recovery from worker failureUsually R is smaller than M, because output is spread across R files

5/24/2011

Map Reduce

22

Slide23

Combiners

Often a map task will produce many pairs of the form (k,v1), (k,v2), … for the same key kE.g., popular words in Word CountCan save network time by pre-aggregating at mappercombine(k1, list(v1))  v2Usually same as reduce functionWorks only if reduce function is commutative and associative

5/24/2011

Map Reduce

23

Slide24

Partition Function

Inputs to map tasks are created by contiguous splits of input fileFor reduce, we need to ensure that records with the same intermediate key end up at the same workerSystem uses a default partition function e.g., hash(key) mod RSometimes useful to override E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file

5/24/2011

Map Reduce

24

Slide25

Avoiding Stragglers

A slow running task (straggler) can prolong overall executionOverloaded machinesSlow diskKill stragglers Fork redundant tasks and take the first

5/24/2011

Map Reduce

25

Slide26

Example: Sorting

5/24/2011

Map Reduce

26

Slide27

Example: Database Join

5/24/2011

Map Reduce

27

Slide28

Can Mappers Push instead of Reducers Pulling Data ?

5/24/2011

Map Reduce

28