/
Designing MapReduce Algorithms Designing MapReduce Algorithms

Designing MapReduce Algorithms - PowerPoint Presentation

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
401 views
Uploaded On 2017-10-28

Designing MapReduce Algorithms - PPT Presentation

Ch 3 Lin and Dyers text httplintoolgithubioMapReduceAlgorithmsMapReducebookfinalpdf Pages 4373 3969 Word count Local aggregation as opposed to external combiner that is NOT guaranteed by the ID: 600310

key reducer data word reducer key word data tiger dog pairs marginal compute frequency occurrence 1000 relative var34 special

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Designing MapReduce Algorithms" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Designing MapReduce Algorithms

Ch. 3 Lin and Dyer’s text

http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

Pages 43-73 (39-69)Slide2

Word count:

Local aggregation as opposed to external combiner that is NOT guaranteed by the

Hadoop frameworkMay not work all the time: what if wanted word “mean” instead of word “count”: may have to adjust <k,v> types at the output of mapWord co-occurrence (matrix)Very important since many (many) problems are expressed and solved using matricesPairs and stripes approachesAnd comparison of these two methods P.60 (56)

ImprovementsSlide3

First version simplistic counts

Then “relative frequency” instead of counts

What is relative frequency? Instead of absolute countsf(wi /wj) = N(wi,wj)/∑w’ (w

i

, w’)For example, if word “dog” co-occurred with “food” 23 times, and “dog” co-occurred with all words 460 times, then relative frequency is 23/460 = 1/20 = 0.05Also the 460 could come from many mappers, many documents over the entire corpus.These co-occurrences from every mapper are delivered to “corresponding reducer” with a special keyThis is delivered as special key item < (wi, *) , count> as the first <k,v> pairThe magic is that reducer processes < (wi, *) , count>

Co-OccurrenceSlide4

Key

Value

Reducer operation/computeResult(dog,*)[200,

350,650

]One per mapper with combiner Marginal∑w’ (wi, w’)=1200(dog, bark)60Relative frequency

<(dog, bark), 0.05>

(dog, cat)12Relative frequency<(dog, cat), 0.01>(dog, food)600Relative frequency<(dog, food), 0.5>….…

At the reducer: Blue: reducer1/Orange: reducer 2

Key

Value

Reducer

operation

Result

(tiger,*)

[100,

300,600]

Compute

marginal

w’

(wi, w’)=1000

(tiger, cub)

10

Compute

R

.freq

<(tiger, cub), 10/1000>

(tiger, hunt)

100

Compute

R.freq

<(tiger,

hunt), 100/1000>

(tiger, prey)

200

Compute

R.freq

<(tiger, prey), 200/1000>

….Slide5

4 different reducersSlide6

Emitting a special key-value pair for each co-occurring word pair in the

mapper to

capture its contribution to the marginal.Controlling the sort order of the intermediate key so that the key-value pairs representing the marginal contributions are processed by the reducer before any of the pairs representing the joint word co-occurrence counts.Defining a custom partitioner to ensure that all pairs with the same left word are shuffled to the same reducer.Preserving state across multiple keys in the reducer to first compute the

marginal based

on the special key-value pairs and then dividing the joint counts by the marginal to arrive at the relative frequencies.RequirementsSlide7

Lets generalize this

<(var34, left), value>

<(var34, right), value><(var34, middle), value> all delivered to the same reducer.. What can you do with this?Reducer can “middle(left’s value, right’s value) “  <var34, computedValue>

Some more:

<KEY complex object, VALUE complex object>You can do anything you want for function… “KEY.operation” on “VALUE.data” Therein lies the power of MR.Slide8

Text word count

Text co-occurrence

 pairs and stripesNumerical data processing with most math functionsHow about sensor data?Consider m sensors, sending out readings rx at various times ty resulting large volume of data of the format:(t1;m1; r80521)(t1;m2; r14209)(t1;m3; r76042):::(t2;m1; r21823)(t2;m2; r66508)

(t2;m3; r98347

)Suppose you wanted to know the readings by the sensors, how could process the above to get that info?Use MR to do that…<m1 , (t1; r80521)> etc.But what if wanted that sorted by time t that is a part of the value?Problem discussed so farSlide9

Solution 1: Let the reducer do the in-memory sorting

 memory bottle neck

Solution 2: Move value to be sorted to the key, and modify the shuffler and partitionerIn the later, the “secondary sorting” is left to the framework and it excels in doing this anyway. So solution 2 is a preferred approach.Lesson: Let the framework do what it is good at and don’t try to move into your code… in the latter you will be regressing to the “usual” coding practices and ensuing disadvantages

In memory sorting vs. value-key conversionSlide10

Reduce-side join is intuitive but inefficient

Map-side join requires simple merge of respective input files and appropriate sort by the MR framework

In-memory joins can be done for smaller data.We will NOT discuss this in detail since there are other solutions such as Hive, Hbase available for warehouse data. We will look into these later.Relational Joins/warehouse data