2008 NSF DataIntensive Scalable Computing in Education Workshop Module II HadoopBased Course Components This presentation includes content Google Inc Redistributed under the Creative Commons Attribution 25 license ID: 408945
Download Presentation The PPT/PDF document "© Spinnaker Labs, Inc." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
© Spinnaker Labs, Inc.
2008 NSF Data-Intensive Scalable Computing in Education Workshop
Module II: Hadoop-Based Course Components
This presentation includes content © Google, Inc.
Redistributed under the Creative Commons Attribution 2.5 license.
All other contents:Slide2
© Spinnaker Labs, Inc.
Overview
University of Washington CurriculumTeaching Methods
Reflections
Student Background
Course Staff Requirements
MapReduce AlgorithmsSlide3
© Spinnaker Labs, Inc.
UW: Course Summary
Course title: “Problem Solving on Large Scale Clusters”Primary purpose: developing large-scale problem solving skills
Format: 6 weeks of lectures + labs, 4 week projectSlide4
© Spinnaker Labs, Inc.
UW: Course Goals
Think creatively about large-scale problems in a parallel fashion; design parallel solutions
Manage large data sets under memory, bandwidth limitations
Develop a foundation in parallel algorithms for large-scale data
Identify and understand engineering trade-offs in real systemsSlide5
© Spinnaker Labs, Inc.
Lectures
2 hours, once per weekHalf formal lecture, half discussion
Mostly covered systems & background
Included group activities for reinforcementSlide6
© Spinnaker Labs, Inc.
Classroom Activities
Worksheets included pseudo-code programming, working through examples
Performed in groups of 2—3
Small-group discussions about engineering and systems design
Groups of ~10
Course staff facilitated, but mostly open-endedSlide7
© Spinnaker Labs, Inc.
Readings
No textbookOne academic paper per week
E.g., “Simplified Data Processing on Large Clusters”
Short homework covered comprehension
Formed basis for discussionSlide8
© Spinnaker Labs, Inc.
Lecture Schedule
Introduction to Distributed ComputingMapReduce: Theory and Implementation
Networks and Distributed Reliability
Real-World Distributed Systems
Distributed File Systems
Other Distributed SystemsSlide9
© Spinnaker Labs, Inc.
Intro to Distributed Computing
What is distributed computing?Flynn’s Taxonomy
Brief history of distributed computing
Some background on synchronization and memory sharingSlide10
© Spinnaker Labs, Inc.
MapReduce
Brief refresher on functional programmingMapReduce slides
More detailed version of module I
Discussion on MapReduceSlide11
© Spinnaker Labs, Inc.
Networking and Reliability
Crash course in networkingDistributed systems reliability
What is reliability?
How do distributed systems fail?
ACID, other metrics
Discussion: Does MapReduce provide reliability?Slide12
© Spinnaker Labs, Inc.
Real Systems
Design and implementation of NutchTech talk from Googler on Google MapsSlide13
© Spinnaker Labs, Inc.
Distributed File Systems
Introduced GFSDiscussed implementation of NFS and AndrewFS for comparisonSlide14
© Spinnaker Labs, Inc.
Other Distributed Systems
BOINC: Another platformBroader definition of distributed systems
DNS
One Laptop per Child projectSlide15
© Spinnaker Labs, Inc.
Labs
Also 2 hours, once per weekFocused on applications of distributed systems
Four lab projects over six weeksSlide16
© Spinnaker Labs, Inc.
Lab Schedule
Introduction to Hadoop, Eclipse Setup, Word CountInverted Index
PageRank on Wikipedia
Clustering on Netflix Prize DataSlide17
© Spinnaker Labs, Inc.
Design Projects
Final four weeks of quarterTeams of 1—3 students
Students proposed topic, gathered data, developed software, and presented solutionSlide18
© Spinnaker Labs, Inc.
Example: Geozette
Image © Julia SchwartzSlide19
© Spinnaker Labs, Inc.
Example: Galaxy Simulation
Image © Slava Chernyak, Mike HoakSlide20
© Spinnaker Labs, Inc.
Other Projects
Bayesian Wikipedia spam filterUnsupervised synonym extraction
Video collage renderingSlide21
© Spinnaker Labs, Inc.
Ongoing research: traceroutes
Analyze time-stamped traceroute data to model changes in Internet router topology
4.5 GB of data/day * 1.5 years
12 billion traces from 200 PlanetLab sites
Calculates prevalence and persistence of routes between hostsSlide22
© Spinnaker Labs, Inc.
Ongoing research: dynamic program traces
Dynamic memory trace data from simulators can reach hundreds of GB
Existing work focuses on sampling
New capability: record all accesses and post-process with HadoopSlide23
© Spinnaker Labs, Inc.
Common Features
Hadoop!Used publicly-available web APIs for data
Many involved reading papers for algorithms and translating into MapReduce frameworkSlide24
© Spinnaker Labs, Inc.
Background Topics
Programming LanguagesSystems:
Operating Systems
File Systems
Networking
DatabasesSlide25
© Spinnaker Labs, Inc.
Programming Languages
MapReduce is based on functional programming map
and
fold
FP is taught in one quarter, but not reinforced
“Crash course” necessary
Worksheets to pose short problems in terms of map and fold
Immutable data a key conceptSlide26
© Spinnaker Labs, Inc.
Multithreaded programming
Taught in OS course at WashingtonNot a prerequisite!
Students need to understand multiple copies of same method running in parallelSlide27
© Spinnaker Labs, Inc.
File Systems
Necessary to understand GFSComparison to NFS, other distributed file systems relevantSlide28
© Spinnaker Labs, Inc.
Networking
TCP/IPConcepts of “connection,” network splits, other failure modes
Bandwidth issuesSlide29
© Spinnaker Labs, Inc.
Other Systems Topics
Process SchedulingSynchronization
Memory coherency Slide30
© Spinnaker Labs, Inc.
Databases
Concept of shared consistency modelConsensus
ACID characteristics
Journaling
Multi-phase commit processesSlide31
© Spinnaker Labs, Inc.
Course Staff
Instructor (me!)Two undergrad teaching assistants
Helped facilitate discussions, directed labs
One student sys admin
Worked only about three hours/weekSlide32
© Spinnaker Labs, Inc.
Preparation
Teaching assistants had taken previous iteration of course in winterLectures retooled based on feedback from that quarter
Added reasonably large amount of background material
Ran & solved all labs in advanceSlide33
© Spinnaker Labs, Inc.
The Course: What Worked
DiscussionsOften covered broad range of subjects
Hands-on lab projects
“Active learning” in classroom
Independent design projectsSlide34
© Spinnaker Labs, Inc.
Things to Improve: Coverage
Algorithms were not reinforced during lectureStudents requested much more time be spent on “how to parallelize an iterative algorithm”
Background material was very fast-pacedSlide35
© Spinnaker Labs, Inc.
Things to Improve: Projects
Labs could have used a moderated/scripted discussion component
Just “jumping in” to the code proved difficult
No time was devoted to Hadoop itself in lecture
Clustering lab should be split in two
Design projects could have used more timeSlide36
© Spinnaker Labs, Inc.
Part 2: AlgorithmsSlide37
© Spinnaker Labs, Inc.
Algorithms for MapReduce
SortingSearchingIndexing
Classification
TF-IDF
PageRank
ClusteringSlide38
© Spinnaker Labs, Inc.
MapReduce Jobs
Tend to be very short, code-wiseIdentityReducer is very common
“Utility” jobs can be composed
Represent a
data flow
, more so than a procedureSlide39
© Spinnaker Labs, Inc.
Sort: Inputs
A set of files, one value per line.Mapper key is file name, line number
Mapper value is the contents of the lineSlide40
© Spinnaker Labs, Inc.
Sort Algorithm
Takes advantage of reducer properties: (key, value) pairs are processed in order by key; reducers are themselves ordered
Mapper: Identity function for value
(k, v)
(v, _)
Reducer: Identity function (k’, _) -> (k’, “”)Slide41
© Spinnaker Labs, Inc.
Sort: The Trick
(key, value) pairs from mappers are sent to a particular reducer based on hash(key)
Must pick the hash function for your data such that k
1
< k
2
=> hash(k
1
) < hash(k
2
)Slide42
© Spinnaker Labs, Inc.
Final Thoughts on Sort
Used as a test of Hadoop’s raw speedEssentially “IO drag race”
Highlights utility of GFSSlide43
© Spinnaker Labs, Inc.
Search: Inputs
A set of files containing lines of textA search pattern to find
Mapper key is file name, line number
Mapper value is the contents of the line
Search pattern sent as special parameterSlide44
© Spinnaker Labs, Inc.
Search Algorithm
Mapper:Given (filename, some text) and “pattern”, if “text” matches “pattern” output (filename, _)
Reducer:
Identity functionSlide45
© Spinnaker Labs, Inc.
Search: An Optimization
Once a file is found to be interesting, we only need to mark it that way onceUse
Combiner
function to fold redundant (filename, _) pairs into a single one
Reduces network I/OSlide46
© Spinnaker Labs, Inc.
Indexing: Inputs
A set of files containing lines of textMapper key is file name, line number
Mapper value is the contents of the lineSlide47
© Spinnaker Labs, Inc.
Inverted Index Algorithm
Mapper: For each word in (file, words), map to (word, file)
Reducer: Identity functionSlide48
© Spinnaker Labs, Inc.
Index: MapReduce
map(pageName, pageText):
foreach
word w
in
pageText:
emitIntermediate(w, pageName);
done
reduce(word, values):
foreach
pageName
in
values:
AddToOutputList(pageName);
done
emitFinal(FormattedPageListForWord);Slide49
© Spinnaker Labs, Inc.
Index: Data FlowSlide50
© Spinnaker Labs, Inc.
An Aside: Word Count
Word count was described in module IMapper for Word Count is (word, 1) for each word in input line
Strikingly similar to inverted index
Common theme: reuse/modify existing mappersSlide51
© Spinnaker Labs, Inc.
Bayesian Classification
Files containing classification instances are sent to mappersMap (filename, instance)
(instance, class)
Identity ReducerSlide52
© Spinnaker Labs, Inc.
Bayesian Classification
Existing toolsets exist to perform Bayes classification on instance
E.g., WEKA, already in Java!
Another example of discarding input keySlide53
© Spinnaker Labs, Inc.
TF-IDF
Term Frequency – Inverse Document FrequencyRelevant to text processing
Common web analysis algorithmSlide54
© Spinnaker Labs, Inc.
The Algorithm, Formally
|
D
| : total number of documents in the corpus
: number of documents where the term
t
i
appears (that is
). Slide55
© Spinnaker Labs, Inc.
Information We Need
Number of times term X appears in a given documentNumber of terms in each document
Number of documents X appears in
Total number of documents Slide56
© Spinnaker Labs, Inc.
Job 1: Word Frequency in Doc
MapperInput: (docname, contents)
Output: ((word, docname), 1)
Reducer
Sums counts for word in document
Outputs ((word, docname),
n
)
Combiner is same as ReducerSlide57
© Spinnaker Labs, Inc.
Job 2: Word Counts For Docs
Mapper
Input: ((word, docname),
n
)
Output: (docname, (word,
n)
)
Reducer
Sums frequency of individual
n
’s in same doc
Feeds original data through
Outputs ((word, docname), (
n
,
N)
)Slide58
© Spinnaker Labs, Inc.
Job 3: Word Frequency In Corpus
MapperInput: ((word, docname), (
n, N)
)
Output: (word, (docname,
n
,
N
, 1))
Reducer
Sums counts for word in corpus
Outputs ((word, docname), (
n
,
N
,
m
))Slide59
© Spinnaker Labs, Inc.
Job 4: Calculate TF-IDF
MapperInput: ((word, docname), (n, N, m))
Assume D is known (or, easy MR to find it)
Output ((word, docname), TF*IDF)
Reducer
Just the identity functionSlide60
© Spinnaker Labs, Inc.
Working At Scale
Buffering (doc, n,
N
) counts while summing 1’s into
m
may not fit in memory
How many documents does the word “the” occur in?
Possible solutions
Ignore very-high-frequency words
Write out intermediate data to a file
Use another MR passSlide61
© Spinnaker Labs, Inc.
Final Thoughts on TF-IDF
Several small jobs add up to full algorithmLots of code reuse possible
Stock classes exist for aggregation, identity
Jobs 3 and 4 can really be done at once in same reducer, saving a write/read cycle
Very easy to handle medium-large scale, but must take care to ensure flat memory usage for largest scaleSlide62
© Spinnaker Labs, Inc.
PageRank: Random Walks Over The Web
If a user starts at a random web page and surfs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page?
The
PageRank
of a page captures this notion
More “popular” or “worthwhile” pages get a higher rankSlide63
PageRank: VisuallySlide64
© Spinnaker Labs, Inc.
PageRank: Formula
Given page A, and pages T
1
through T
n
linking to A, PageRank is defined as:
PR(A) = (1-d) + d (PR(T
1
)/C(T
1
) + ... +
PR(T
n
)/C(T
n
))
C(P) is the cardinality (out-degree) of page P
d is the damping (“random URL”) factor Slide65
© Spinnaker Labs, Inc.
PageRank: Intuition
Calculation is iterative: PR
i+1
is based on PR
i
Each page distributes its PR
i
to all pages it links to. Linkees add up their awarded rank fragments to find their PR
i+1
d
is a tunable parameter (usually = 0.85) encapsulating the “random jump factor”
PR(A) = (1-d) + d (PR(T
1
)/C(T
1
) + ... + PR(T
n
)/C(T
n
))Slide66
© Spinnaker Labs, Inc.
Graph Representations
The most straightforward representation of graphs uses references from each node to its neighborsSlide67
© Spinnaker Labs, Inc.
Direct References
Structure is inherent to object
Iteration requires linked list “threaded through” graph
Requires common view of shared memory (synchronization!)
Not easily serializable
class GraphNode
{
Object data;
Vector<GraphNode>
out_edges;
GraphNode
iter_next;
}Slide68
Adjacency Matrices
Another classic graph representation. M[i][j]= '1' implies a link from node
i to j.
Naturally encapsulates iteration over nodes
0
1
0
1
4
0
0
1
0
3
1
1
0
1
2
1
0
1
0
1
4
3
2
1Slide69
© Spinnaker Labs, Inc.
Adjacency Matrices: Sparse Representation
Adjacency matrix for most large graphs (e.g., the web) will be overwhelmingly full of zeros.
Each row of the graph is absurdly long
Sparse matrices only include non-zero elementsSlide70
© Spinnaker Labs, Inc.
Sparse Matrix Representation
1: (3, 1), (18, 1), (200, 1)
2: (6, 1), (12, 1), (80, 1), (400, 1)
3: (1, 1), (14, 1)
…Slide71
© Spinnaker Labs, Inc.
Sparse Matrix Representation
1: 3, 18, 200
2: 6, 12, 80, 400
3: 1, 14
…Slide72
© Spinnaker Labs, Inc.
PageRank: First Implementation
Create two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR values
Iterate over all pages in the graph, distributing PR from 'current' into 'next' of linkees
current := next; next := fresh_table();
Go back to iteration step or end if convergedSlide73
© Spinnaker Labs, Inc.
Distribution of the Algorithm
Key insights allowing parallelization:
The 'next' table depends on 'current', but not on any other rows of 'next'
Individual rows of the adjacency matrix can be processed in parallel
Sparse matrix rows are relatively smallSlide74
© Spinnaker Labs, Inc.
Distribution of the Algorithm
Consequences of insights:
We can
map
each row of 'current' to a list of PageRank “fragments” to assign to linkees
These fragments can be
reduced
into a single PageRank value for a page by summing
Graph representation can be even more compact; since each element is simply 0 or 1, only transmit column numbers where it's 1Slide75Slide76
© Spinnaker Labs, Inc.
Phase 1: Parse HTML
Map task takes (URL, page content) pairs and maps them to (URL, (PR
init
, list-of-urls))
PR
init
is the “seed” PageRank for URL
list-of-urls contains all pages pointed to by URL
Reduce task is just the identity functionSlide77
© Spinnaker Labs, Inc.
Phase 2: PageRank Distribution
Map task takes (URL, (cur_rank, url_list))
For each
u
in url_list, emit
(
u
, cur_rank/|url_list|)
Emit (URL, url_list) to carry the points-to list along through iterations
PR(A) = (1-d) + d (PR(T
1
)/C(T
1
) + ... + PR(T
n
)/C(T
n
))Slide78
© Spinnaker Labs, Inc.
Phase 2: PageRank Distribution
Reduce task gets (URL, url_list) and many (URL,
val
) values
Sum
val
s and fix up with
d
Emit (URL, (new_rank, url_list))
PR(A) = (1-d) + d (PR(T
1
)/C(T
1
) + ... + PR(T
n
)/C(T
n
))Slide79
© Spinnaker Labs, Inc.
Finishing up...
A subsequent component determines whether convergence has been achieved (Fixed number of iterations? Comparison of key values?)
If so, write out the PageRank lists - done!
Otherwise, feed output of Phase 2 into another Phase 2 iterationSlide80
PageRank Conclusions
MapReduce runs the “heavy lifting” in iterated computation
Key element in parallelization is independent PageRank computations in a given step
Parallelization requires thinking about minimum data partitions to transmit (e.g., compact representations of graph rows)
Even the implementation shown today doesn't actually scale to the whole Internet; but it works for intermediate-sized graphsSlide81
Clustering
What is clustering?Slide82
Google News
They didn’t pick all 3,400,217 related articles by hand…
Or Amazon.com
Or Netflix…Slide83
Other less glamorous things...
Hospital Records
Scientific Imaging
Related genes, related stars, related sequences
Market Research
Segmenting markets, product positioning
Social Network Analysis
Data mining
Image segmentation…Slide84
The Distance Measure
How the similarity of two elements in a set is determined, e.g.
Euclidean Distance
Manhattan Distance
Inner Product Space
Maximum Norm
Or any metric you define over the space…Slide85
Hierarchical Clustering vs.
Partitional Clustering
Types of AlgorithmsSlide86
Hierarchical Clustering
Builds or breaks up a hierarchy of clusters.Slide87
Partitional Clustering
Partitions set into all clusters simultaneously.Slide88
Partitional Clustering
Partitions set into all clusters simultaneously.Slide89
K-Means Clustering
Simple Partitional Clustering
Choose the number of clusters, k
Choose k points to be cluster centers
Then…Slide90
K-Means Clustering
iterate {
Compute distance from all points to all k-
centers
Assign each point to the nearest k-center
Compute the average of all points assigned to
all specific k-centers
Replace the k-centers with the new averages
}Slide91
But!
The complexity is pretty high:
k * n * O ( distance metric ) * num (iterations)
Moreover, it can be necessary to send
tons
of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.Slide92
Furthermore
There are three big ways a data set can be large:
There are a large number of elements in the set.
Each element can have many features.
There can be many clusters to discover
Conclusion – Clustering can be huge, even when you distribute it.Slide93
Canopy Clustering
Preliminary step to help parallelize computation.
Clusters data into overlapping Canopies using super cheap distance metric.
Efficient
AccurateSlide94
Canopy Clustering
While there are unmarked points {
pick a point which is not strongly marked
call it a canopy center
mark all points within some threshold of
it as in it’s canopy
strongly mark all points within some
stronger threshold
}Slide95
After the canopy clustering…
Resume hierarchical or partitional clustering as usual.
Treat objects in separate clusters as being at infinite distances.Slide96
MapReduce Implementation:
Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure.Slide97
The Distance Metric
The Canopy Metric ($)
The K-Means Metric ($$$) Slide98
Steps!
Get Data into a form you can use (MR)
Picking Canopy Centers (MR)
Assign Data Points to Canopies (MR)
Pick K-Means Cluster Centers
K-Means algorithm (MR)
Iterate!Slide99
Selecting Canopy CentersSlide100Slide101Slide102Slide103Slide104Slide105Slide106Slide107Slide108Slide109Slide110Slide111Slide112Slide113Slide114Slide115
Assigning Points to CanopiesSlide116Slide117Slide118Slide119Slide120Slide121
K-Means MapSlide122Slide123Slide124Slide125Slide126Slide127Slide128Slide129
Elbow Criterion
Choose a number of clusters s.t. adding a cluster doesn’t add interesting information.
Rule of thumb to determine what number of Clusters should be chosen.
Initial assignment of cluster seeds has bearing on final model performance.
Often required to run clustering several times to get maximal performanceSlide130
Clustering Conclusions
Clustering is slick
And it can be done super efficiently
And in lots of different waysSlide131
© Spinnaker Labs, Inc.
Conclusions
Lots of high level algorithmsLots of deep connections to low-level systems
Discussion-based classes helped students think critically about real issues
Code labs made them real