/
© Spinnaker Labs, Inc. © Spinnaker Labs, Inc.

© Spinnaker Labs, Inc. - PowerPoint Presentation

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
427 views
Uploaded On 2016-07-17

© Spinnaker Labs, Inc. - PPT Presentation

2008 NSF DataIntensive Scalable Computing in Education Workshop Module II HadoopBased Course Components This presentation includes content Google Inc Redistributed under the Creative Commons Attribution 25 license ID: 408945

spinnaker labs data word labs spinnaker word data clustering systems pagerank mapper number distributed url mapreduce file large key

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "© Spinnaker Labs, Inc." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

© Spinnaker Labs, Inc.

2008 NSF Data-Intensive Scalable Computing in Education Workshop

Module II: Hadoop-Based Course Components

This presentation includes content © Google, Inc.

Redistributed under the Creative Commons Attribution 2.5 license.

All other contents:Slide2

© Spinnaker Labs, Inc.

Overview

University of Washington CurriculumTeaching Methods

Reflections

Student Background

Course Staff Requirements

MapReduce AlgorithmsSlide3

© Spinnaker Labs, Inc.

UW: Course Summary

Course title: “Problem Solving on Large Scale Clusters”Primary purpose: developing large-scale problem solving skills

Format: 6 weeks of lectures + labs, 4 week projectSlide4

© Spinnaker Labs, Inc.

UW: Course Goals

Think creatively about large-scale problems in a parallel fashion; design parallel solutions

Manage large data sets under memory, bandwidth limitations

Develop a foundation in parallel algorithms for large-scale data

Identify and understand engineering trade-offs in real systemsSlide5

© Spinnaker Labs, Inc.

Lectures

2 hours, once per weekHalf formal lecture, half discussion

Mostly covered systems & background

Included group activities for reinforcementSlide6

© Spinnaker Labs, Inc.

Classroom Activities

Worksheets included pseudo-code programming, working through examples

Performed in groups of 2—3

Small-group discussions about engineering and systems design

Groups of ~10

Course staff facilitated, but mostly open-endedSlide7

© Spinnaker Labs, Inc.

Readings

No textbookOne academic paper per week

E.g., “Simplified Data Processing on Large Clusters”

Short homework covered comprehension

Formed basis for discussionSlide8

© Spinnaker Labs, Inc.

Lecture Schedule

Introduction to Distributed ComputingMapReduce: Theory and Implementation

Networks and Distributed Reliability

Real-World Distributed Systems

Distributed File Systems

Other Distributed SystemsSlide9

© Spinnaker Labs, Inc.

Intro to Distributed Computing

What is distributed computing?Flynn’s Taxonomy

Brief history of distributed computing

Some background on synchronization and memory sharingSlide10

© Spinnaker Labs, Inc.

MapReduce

Brief refresher on functional programmingMapReduce slides

More detailed version of module I

Discussion on MapReduceSlide11

© Spinnaker Labs, Inc.

Networking and Reliability

Crash course in networkingDistributed systems reliability

What is reliability?

How do distributed systems fail?

ACID, other metrics

Discussion: Does MapReduce provide reliability?Slide12

© Spinnaker Labs, Inc.

Real Systems

Design and implementation of NutchTech talk from Googler on Google MapsSlide13

© Spinnaker Labs, Inc.

Distributed File Systems

Introduced GFSDiscussed implementation of NFS and AndrewFS for comparisonSlide14

© Spinnaker Labs, Inc.

Other Distributed Systems

BOINC: Another platformBroader definition of distributed systems

DNS

One Laptop per Child projectSlide15

© Spinnaker Labs, Inc.

Labs

Also 2 hours, once per weekFocused on applications of distributed systems

Four lab projects over six weeksSlide16

© Spinnaker Labs, Inc.

Lab Schedule

Introduction to Hadoop, Eclipse Setup, Word CountInverted Index

PageRank on Wikipedia

Clustering on Netflix Prize DataSlide17

© Spinnaker Labs, Inc.

Design Projects

Final four weeks of quarterTeams of 1—3 students

Students proposed topic, gathered data, developed software, and presented solutionSlide18

© Spinnaker Labs, Inc.

Example: Geozette

Image © Julia SchwartzSlide19

© Spinnaker Labs, Inc.

Example: Galaxy Simulation

Image © Slava Chernyak, Mike HoakSlide20

© Spinnaker Labs, Inc.

Other Projects

Bayesian Wikipedia spam filterUnsupervised synonym extraction

Video collage renderingSlide21

© Spinnaker Labs, Inc.

Ongoing research: traceroutes

Analyze time-stamped traceroute data to model changes in Internet router topology

4.5 GB of data/day * 1.5 years

12 billion traces from 200 PlanetLab sites

Calculates prevalence and persistence of routes between hostsSlide22

© Spinnaker Labs, Inc.

Ongoing research: dynamic program traces

Dynamic memory trace data from simulators can reach hundreds of GB

Existing work focuses on sampling

New capability: record all accesses and post-process with HadoopSlide23

© Spinnaker Labs, Inc.

Common Features

Hadoop!Used publicly-available web APIs for data

Many involved reading papers for algorithms and translating into MapReduce frameworkSlide24

© Spinnaker Labs, Inc.

Background Topics

Programming LanguagesSystems:

Operating Systems

File Systems

Networking

DatabasesSlide25

© Spinnaker Labs, Inc.

Programming Languages

MapReduce is based on functional programming map

and

fold

FP is taught in one quarter, but not reinforced

“Crash course” necessary

Worksheets to pose short problems in terms of map and fold

Immutable data a key conceptSlide26

© Spinnaker Labs, Inc.

Multithreaded programming

Taught in OS course at WashingtonNot a prerequisite!

Students need to understand multiple copies of same method running in parallelSlide27

© Spinnaker Labs, Inc.

File Systems

Necessary to understand GFSComparison to NFS, other distributed file systems relevantSlide28

© Spinnaker Labs, Inc.

Networking

TCP/IPConcepts of “connection,” network splits, other failure modes

Bandwidth issuesSlide29

© Spinnaker Labs, Inc.

Other Systems Topics

Process SchedulingSynchronization

Memory coherency Slide30

© Spinnaker Labs, Inc.

Databases

Concept of shared consistency modelConsensus

ACID characteristics

Journaling

Multi-phase commit processesSlide31

© Spinnaker Labs, Inc.

Course Staff

Instructor (me!)Two undergrad teaching assistants

Helped facilitate discussions, directed labs

One student sys admin

Worked only about three hours/weekSlide32

© Spinnaker Labs, Inc.

Preparation

Teaching assistants had taken previous iteration of course in winterLectures retooled based on feedback from that quarter

Added reasonably large amount of background material

Ran & solved all labs in advanceSlide33

© Spinnaker Labs, Inc.

The Course: What Worked

DiscussionsOften covered broad range of subjects

Hands-on lab projects

“Active learning” in classroom

Independent design projectsSlide34

© Spinnaker Labs, Inc.

Things to Improve: Coverage

Algorithms were not reinforced during lectureStudents requested much more time be spent on “how to parallelize an iterative algorithm”

Background material was very fast-pacedSlide35

© Spinnaker Labs, Inc.

Things to Improve: Projects

Labs could have used a moderated/scripted discussion component

Just “jumping in” to the code proved difficult

No time was devoted to Hadoop itself in lecture

Clustering lab should be split in two

Design projects could have used more timeSlide36

© Spinnaker Labs, Inc.

Part 2: AlgorithmsSlide37

© Spinnaker Labs, Inc.

Algorithms for MapReduce

SortingSearchingIndexing

Classification

TF-IDF

PageRank

ClusteringSlide38

© Spinnaker Labs, Inc.

MapReduce Jobs

Tend to be very short, code-wiseIdentityReducer is very common

“Utility” jobs can be composed

Represent a

data flow

, more so than a procedureSlide39

© Spinnaker Labs, Inc.

Sort: Inputs

A set of files, one value per line.Mapper key is file name, line number

Mapper value is the contents of the lineSlide40

© Spinnaker Labs, Inc.

Sort Algorithm

Takes advantage of reducer properties: (key, value) pairs are processed in order by key; reducers are themselves ordered

Mapper: Identity function for value

(k, v)

 (v, _)

Reducer: Identity function (k’, _) -> (k’, “”)Slide41

© Spinnaker Labs, Inc.

Sort: The Trick

(key, value) pairs from mappers are sent to a particular reducer based on hash(key)

Must pick the hash function for your data such that k

1

< k

2

=> hash(k

1

) < hash(k

2

)Slide42

© Spinnaker Labs, Inc.

Final Thoughts on Sort

Used as a test of Hadoop’s raw speedEssentially “IO drag race”

Highlights utility of GFSSlide43

© Spinnaker Labs, Inc.

Search: Inputs

A set of files containing lines of textA search pattern to find

Mapper key is file name, line number

Mapper value is the contents of the line

Search pattern sent as special parameterSlide44

© Spinnaker Labs, Inc.

Search Algorithm

Mapper:Given (filename, some text) and “pattern”, if “text” matches “pattern” output (filename, _)

Reducer:

Identity functionSlide45

© Spinnaker Labs, Inc.

Search: An Optimization

Once a file is found to be interesting, we only need to mark it that way onceUse

Combiner

function to fold redundant (filename, _) pairs into a single one

Reduces network I/OSlide46

© Spinnaker Labs, Inc.

Indexing: Inputs

A set of files containing lines of textMapper key is file name, line number

Mapper value is the contents of the lineSlide47

© Spinnaker Labs, Inc.

Inverted Index Algorithm

Mapper: For each word in (file, words), map to (word, file)

Reducer: Identity functionSlide48

© Spinnaker Labs, Inc.

Index: MapReduce

map(pageName, pageText):

foreach

word w

in

pageText:

emitIntermediate(w, pageName);

done

reduce(word, values):

foreach

pageName

in

values:

AddToOutputList(pageName);

done

emitFinal(FormattedPageListForWord);Slide49

© Spinnaker Labs, Inc.

Index: Data FlowSlide50

© Spinnaker Labs, Inc.

An Aside: Word Count

Word count was described in module IMapper for Word Count is (word, 1) for each word in input line

Strikingly similar to inverted index

Common theme: reuse/modify existing mappersSlide51

© Spinnaker Labs, Inc.

Bayesian Classification

Files containing classification instances are sent to mappersMap (filename, instance)

 (instance, class)

Identity ReducerSlide52

© Spinnaker Labs, Inc.

Bayesian Classification

Existing toolsets exist to perform Bayes classification on instance

E.g., WEKA, already in Java!

Another example of discarding input keySlide53

© Spinnaker Labs, Inc.

TF-IDF

Term Frequency – Inverse Document FrequencyRelevant to text processing

Common web analysis algorithmSlide54

© Spinnaker Labs, Inc.

The Algorithm, Formally

|

D

| : total number of documents in the corpus

 

               : number of documents where the term

t

i

appears (that is

 

      ). Slide55

© Spinnaker Labs, Inc.

Information We Need

Number of times term X appears in a given documentNumber of terms in each document

Number of documents X appears in

Total number of documents Slide56

© Spinnaker Labs, Inc.

Job 1: Word Frequency in Doc

MapperInput: (docname, contents)

Output: ((word, docname), 1)

Reducer

Sums counts for word in document

Outputs ((word, docname),

n

)

Combiner is same as ReducerSlide57

© Spinnaker Labs, Inc.

Job 2: Word Counts For Docs

Mapper

Input: ((word, docname),

n

)

Output: (docname, (word,

n)

)

Reducer

Sums frequency of individual

n

’s in same doc

Feeds original data through

Outputs ((word, docname), (

n

,

N)

)Slide58

© Spinnaker Labs, Inc.

Job 3: Word Frequency In Corpus

MapperInput: ((word, docname), (

n, N)

)

Output: (word, (docname,

n

,

N

, 1))

Reducer

Sums counts for word in corpus

Outputs ((word, docname), (

n

,

N

,

m

))Slide59

© Spinnaker Labs, Inc.

Job 4: Calculate TF-IDF

MapperInput: ((word, docname), (n, N, m))

Assume D is known (or, easy MR to find it)

Output ((word, docname), TF*IDF)

Reducer

Just the identity functionSlide60

© Spinnaker Labs, Inc.

Working At Scale

Buffering (doc, n,

N

) counts while summing 1’s into

m

may not fit in memory

How many documents does the word “the” occur in?

Possible solutions

Ignore very-high-frequency words

Write out intermediate data to a file

Use another MR passSlide61

© Spinnaker Labs, Inc.

Final Thoughts on TF-IDF

Several small jobs add up to full algorithmLots of code reuse possible

Stock classes exist for aggregation, identity

Jobs 3 and 4 can really be done at once in same reducer, saving a write/read cycle

Very easy to handle medium-large scale, but must take care to ensure flat memory usage for largest scaleSlide62

© Spinnaker Labs, Inc.

PageRank: Random Walks Over The Web

If a user starts at a random web page and surfs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page?

The

PageRank

of a page captures this notion

More “popular” or “worthwhile” pages get a higher rankSlide63

PageRank: VisuallySlide64

© Spinnaker Labs, Inc.

PageRank: Formula

Given page A, and pages T

1

through T

n

linking to A, PageRank is defined as:

PR(A) = (1-d) + d (PR(T

1

)/C(T

1

) + ... +

PR(T

n

)/C(T

n

))

C(P) is the cardinality (out-degree) of page P

d is the damping (“random URL”) factor Slide65

© Spinnaker Labs, Inc.

PageRank: Intuition

Calculation is iterative: PR

i+1

is based on PR

i

Each page distributes its PR

i

to all pages it links to. Linkees add up their awarded rank fragments to find their PR

i+1

d

is a tunable parameter (usually = 0.85) encapsulating the “random jump factor”

PR(A) = (1-d) + d (PR(T

1

)/C(T

1

) + ... + PR(T

n

)/C(T

n

))Slide66

© Spinnaker Labs, Inc.

Graph Representations

The most straightforward representation of graphs uses references from each node to its neighborsSlide67

© Spinnaker Labs, Inc.

Direct References

Structure is inherent to object

Iteration requires linked list “threaded through” graph

Requires common view of shared memory (synchronization!)

Not easily serializable

class GraphNode

{

Object data;

Vector<GraphNode>

out_edges;

GraphNode

iter_next;

}Slide68

Adjacency Matrices

Another classic graph representation. M[i][j]= '1' implies a link from node

i to j.

Naturally encapsulates iteration over nodes

0

1

0

1

4

0

0

1

0

3

1

1

0

1

2

1

0

1

0

1

4

3

2

1Slide69

© Spinnaker Labs, Inc.

Adjacency Matrices: Sparse Representation

Adjacency matrix for most large graphs (e.g., the web) will be overwhelmingly full of zeros.

Each row of the graph is absurdly long

Sparse matrices only include non-zero elementsSlide70

© Spinnaker Labs, Inc.

Sparse Matrix Representation

1: (3, 1), (18, 1), (200, 1)

2: (6, 1), (12, 1), (80, 1), (400, 1)

3: (1, 1), (14, 1)

…Slide71

© Spinnaker Labs, Inc.

Sparse Matrix Representation

1: 3, 18, 200

2: 6, 12, 80, 400

3: 1, 14

…Slide72

© Spinnaker Labs, Inc.

PageRank: First Implementation

Create two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR values

Iterate over all pages in the graph, distributing PR from 'current' into 'next' of linkees

current := next; next := fresh_table();

Go back to iteration step or end if convergedSlide73

© Spinnaker Labs, Inc.

Distribution of the Algorithm

Key insights allowing parallelization:

The 'next' table depends on 'current', but not on any other rows of 'next'

Individual rows of the adjacency matrix can be processed in parallel

Sparse matrix rows are relatively smallSlide74

© Spinnaker Labs, Inc.

Distribution of the Algorithm

Consequences of insights:

We can

map

each row of 'current' to a list of PageRank “fragments” to assign to linkees

These fragments can be

reduced

into a single PageRank value for a page by summing

Graph representation can be even more compact; since each element is simply 0 or 1, only transmit column numbers where it's 1Slide75
Slide76

© Spinnaker Labs, Inc.

Phase 1: Parse HTML

Map task takes (URL, page content) pairs and maps them to (URL, (PR

init

, list-of-urls))

PR

init

is the “seed” PageRank for URL

list-of-urls contains all pages pointed to by URL

Reduce task is just the identity functionSlide77

© Spinnaker Labs, Inc.

Phase 2: PageRank Distribution

Map task takes (URL, (cur_rank, url_list))

For each

u

in url_list, emit

(

u

, cur_rank/|url_list|)

Emit (URL, url_list) to carry the points-to list along through iterations

PR(A) = (1-d) + d (PR(T

1

)/C(T

1

) + ... + PR(T

n

)/C(T

n

))Slide78

© Spinnaker Labs, Inc.

Phase 2: PageRank Distribution

Reduce task gets (URL, url_list) and many (URL,

val

) values

Sum

val

s and fix up with

d

Emit (URL, (new_rank, url_list))

PR(A) = (1-d) + d (PR(T

1

)/C(T

1

) + ... + PR(T

n

)/C(T

n

))Slide79

© Spinnaker Labs, Inc.

Finishing up...

A subsequent component determines whether convergence has been achieved (Fixed number of iterations? Comparison of key values?)

If so, write out the PageRank lists - done!

Otherwise, feed output of Phase 2 into another Phase 2 iterationSlide80

PageRank Conclusions

MapReduce runs the “heavy lifting” in iterated computation

Key element in parallelization is independent PageRank computations in a given step

Parallelization requires thinking about minimum data partitions to transmit (e.g., compact representations of graph rows)

Even the implementation shown today doesn't actually scale to the whole Internet; but it works for intermediate-sized graphsSlide81

Clustering

What is clustering?Slide82

Google News

They didn’t pick all 3,400,217 related articles by hand…

Or Amazon.com

Or Netflix…Slide83

Other less glamorous things...

Hospital Records

Scientific Imaging

Related genes, related stars, related sequences

Market Research

Segmenting markets, product positioning

Social Network Analysis

Data mining

Image segmentation…Slide84

The Distance Measure

How the similarity of two elements in a set is determined, e.g.

Euclidean Distance

Manhattan Distance

Inner Product Space

Maximum Norm

Or any metric you define over the space…Slide85

Hierarchical Clustering vs.

Partitional Clustering

Types of AlgorithmsSlide86

Hierarchical Clustering

Builds or breaks up a hierarchy of clusters.Slide87

Partitional Clustering

Partitions set into all clusters simultaneously.Slide88

Partitional Clustering

Partitions set into all clusters simultaneously.Slide89

K-Means Clustering

Simple Partitional Clustering

Choose the number of clusters, k

Choose k points to be cluster centers

Then…Slide90

K-Means Clustering

iterate {

Compute distance from all points to all k-

centers

Assign each point to the nearest k-center

Compute the average of all points assigned to

all specific k-centers

Replace the k-centers with the new averages

}Slide91

But!

The complexity is pretty high:

k * n * O ( distance metric ) * num (iterations)

Moreover, it can be necessary to send

tons

of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.Slide92

Furthermore

There are three big ways a data set can be large:

There are a large number of elements in the set.

Each element can have many features.

There can be many clusters to discover

Conclusion – Clustering can be huge, even when you distribute it.Slide93

Canopy Clustering

Preliminary step to help parallelize computation.

Clusters data into overlapping Canopies using super cheap distance metric.

Efficient

AccurateSlide94

Canopy Clustering

While there are unmarked points {

pick a point which is not strongly marked

call it a canopy center

mark all points within some threshold of

it as in it’s canopy

strongly mark all points within some

stronger threshold

}Slide95

After the canopy clustering…

Resume hierarchical or partitional clustering as usual.

Treat objects in separate clusters as being at infinite distances.Slide96

MapReduce Implementation:

Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure.Slide97

The Distance Metric

The Canopy Metric ($)

The K-Means Metric ($$$) Slide98

Steps!

Get Data into a form you can use (MR)

Picking Canopy Centers (MR)

Assign Data Points to Canopies (MR)

Pick K-Means Cluster Centers

K-Means algorithm (MR)

Iterate!Slide99

Selecting Canopy CentersSlide100
Slide101
Slide102
Slide103
Slide104
Slide105
Slide106
Slide107
Slide108
Slide109
Slide110
Slide111
Slide112
Slide113
Slide114
Slide115

Assigning Points to CanopiesSlide116
Slide117
Slide118
Slide119
Slide120
Slide121

K-Means MapSlide122
Slide123
Slide124
Slide125
Slide126
Slide127
Slide128
Slide129

Elbow Criterion

Choose a number of clusters s.t. adding a cluster doesn’t add interesting information.

Rule of thumb to determine what number of Clusters should be chosen.

Initial assignment of cluster seeds has bearing on final model performance.

Often required to run clustering several times to get maximal performanceSlide130

Clustering Conclusions

Clustering is slick

And it can be done super efficiently

And in lots of different waysSlide131

© Spinnaker Labs, Inc.

Conclusions

Lots of high level algorithmsLots of deep connections to low-level systems

Discussion-based classes helped students think critically about real issues

Code labs made them real