/
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
343 views
Uploaded On 2019-10-31

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded - PPT Presentation

In the once upon a time days of the First Age of Magic the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health forthe stories goonce an enemy even a weak unskilled enemy learned the sorcerers true name then ro ID: 761506

generate foreach len prod foreach generate prod len web sim dot sqr site group join spark means sum data

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "In the once upon a time days of the Firs..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go--once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again: The first hint Mr. Slippery had that his own True Name might be known--and, for that matter, known to the Great Enemy--came with the appearance of two black Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.… This had been, of course, Roger Pollack's great fear. They had discovered Mr. Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861.

Guinea PIG: A Workflow Package for Python 2

Full Implementation3 docid term v( d,w ) d137 aardvark 0.645 d137 found 0.083 d137 I 0.004 d137 farmville 0.356 d138 when … d138 … …

DB2 DB2 Motivation Integrating data is important Data from different sources may not have consistent object identifiers Especially automatically-constructed ones But databases will have human-readable names and/or descriptions for the objects But matching names and descriptions is tricky…. DB1 DB2 … DB1 …

One solution: Soft (Similarity) joinsA similarity join of two sets A and B isan ordered list of triples (sij,ai ,b j ) such that a i is from A b j is from B s ij is the similarity of ai and bj the triples are in descending orderthe list is either the top K triples by sij or ALL triples with s ij>L … or sometimes some approximation of these….

Example: soft joins/similarity joins … … Input: Two Different Lists of Entity Names

Example: soft joins/similarity joinsOutput: Pairs of Names Ranked by Similarity … … identical similar less similar

Semantic Joiningwith Multiscale StatisticsWilliam CohenKatie Rivard , Dana Attias-Moshevitz CMU

A parallel workflow for TFIDF similarity joins9

Parallel Inverted Index Softjoin - 1 Statistics for computing TFIDF with IDFs local to each relation sumSquareWeights rel docid term v( d,w ) a a137 gateway 0.945 a a137 nwr 0.055 a a138 acadia 0.915 a a138 np 0.085 b b013 gateway 0.845 b b013 np 0.155 b … … …

Parallel Inverted Index Softjoin - 2 What’s the algorithm? Step 1: create document vectors as (C d , d, term, weight) tuples Step 2: join the tuples from A and B: one sort and reduce Gives you tuples ( a, b, term, w( a,term )*w( b,term)) Step 3: group the common terms by (a,b) and reduce to aggregate the components of the sum  

Making the algorithm smarter….

Inverted Index Softjoin - 2 we should make a smart choice about which terms to use

Adding heuristics to the soft join - 1

Adding heuristics to the soft join - 2

Adding heuristicsParks: input 40kdata 60kdocvec 102k softjoin 539k tokens 508k documents 0 errors in top 50 w/ heuristics: input 40k data 60k docvec 102k softjoin 32k tokens 24k documents 3 errors in top 50 < 400 useful terms

Adding heuristicsSO vs Wikipedia: input 612Mdocvec 1050Msoftjoin 67M with heuristics input 612M docvec 1050M softjoin 9.1M

PageRank18

Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site xxx Inlinks are “ good ” (recommendations ) Inlinks from a “ good ” site are better than inlinks from a “ bad ” site but inlinks from sites with many outlinks are not a “good ” ... “ Good “ and “ bad ” are relative. web site xxx 19

Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site xxx Imagine a “ pagehopper ” that always either follows a random link, or jumps to random page 20

Google’s PageRank(Brin & Page, http://www-db.stanford.edu/~backrub/google.html) web site xxx web site yyyy web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site xxx Imagine a “ pagehopper ” that always either follows a random link, or jumps to random page PageRank ranks pages by the amount of time the pagehopper spends on a page: or, if there were many pagehoppers, PageRank is the expected “ crowd size ” 21

PageRank in MemoryLet u = (1/N, …, 1/N)dimension = #nodes NLet A = adjacency matrix: [a ij =1  i links to j] Let W = [ w ij = a ij /outdegree(i)] wij is probability of jump from i to jLet v0 = (1,1,….,1) or anything else you wantRepeat until converged:Let v t+1 = cu + (1-c) v tWc is probability of jumping “anywhere randomly” 22

Streaming PageRank

Streaming PageRankAssume we can store v but not W in memoryRepeat until converged: Let v t+1 = c u + (1-c) v t W Store A as a row matrix: each line isi j i,1,…,ji,d [the neighbors of i] Store v’ and v in memory: v’ starts out as cu For each line “i ji,1 ,…,ji,d “For each j in j i,1,…,ji,d v’[j] += (1-c) v[i]/d Everything needed for update is right there in row… . 24

Streaming PageRankAssume we can store one row of W in memory at a time, but not v – i.e., links on a web page fit in memory, not pageRank for the whole web. Repeat until converged: Let v t+1 = c u + (1-c) vtWStore A as a row matrix: each line isi ji,1,…,j i,d [the neighbors of i]v’ starts out as c uFor each line “i j i,1,…,ji,d “For each j in j i,1,…,ji,d v’ [j] += (1-c)v[i]/d 25 We need to convert these counter updates to messages, like we did for naïve Bayes Like in naïve Bayes: a document fits, but the model doesn’t

Streaming PageRankAssume we can store one row of W in memory at a time, but not v – i.e., links on a web page fit in memory, not pageRank for the whole web. Repeat until converged: Let v t+1 = c u + (1-c) vtWStore A as a row matrix: each line isi ji,1,…,j i,d [the neighbors of i]v’ starts out as c uFor each line “i j i,1,…,ji,d “For each j in j i,1,…,ji,d v’ [j] += (1-c)v[i]/d 26 Streaming: if we know c, v[ i ] and have the linked-to j’s then we can compute d we can produce messages saying how much to increment each v[j] Then we sort the messages by v[j] and add up all the incrementals and somehow also add in c/N

Streaming PageRankAssume we can store one row of W in memory at a time, but not v – i.e., links on a web page fit in memory, not pageRank for the whole web. Repeat until converged: Let v t+1 = c u + (1-c) vtWStore A as a row matrix: each line isi ji,1,…,j i,d [the neighbors of i]v’ starts out as c uFor each line “i j i,1,…,ji,d “For each j in j i,1,…,ji,d v’ [j] += (1-c)v[i]/d 27 Streaming: if we know c, v[ i ] and have the linked-to j’s then we can compute d we can produce messages saying how much to increment each v[j] Then we sort the messages by v[j] and add up all the incrementals and somehow also add in c/N So we need to stream thru some structure that has v[ i ] plus all the outlinks . This is a copy of the graph with current pageRank estimates ( v ) for each node attached to the node.

PageRank in Dataflow Languages28

Iteration in DataflowPIG and Guinea Pig are pure data flow languagesno conditionalsno iterationTo loop you need to embed a dataflow program into a ‘real’ program 29

30 Recap: a Gpig program … . There’s no “program” object, and no obvious way to write a main that will call a program

Calling GuineaPig programmatically Somewhere in the execution of the plan we will call this script with special arguments that tell it to do a substep of the plan. So we need a Python main that will do that right But I also want to write an actual main program so … . Creatiing a Planner object I can call (not a subclass) – remember to call setup() 31

Calling GuineaPig programmatically Convert from initial format and assign initial pagerank vector v create the planner Move initial pageranks + graph to tmp file Compute updated pageranks Move new pageranks to tmp file for next iteration of loop 32

lots and lots of i/o happening here… row in initGraph : ( i , [j1, … ., jd ]) 33

note: there is lots of i/o happening here… 34

An example from Ron Bekkerman35

Example: k-means clusteringAn EM-like algorithm:Initialize k cluster centroids E-step: associate each data instance with the closest centroid Find expected values of cluster assignments given the data and centroids M-step: recalculate centroids as an average of the associated data instances Find new centroids that maximize that expectation36

k-means Clustering centroids 37

Parallelizing k -means 38

Parallelizing k -means 39

Parallelizing k -means 40

k-means on MapReduceMappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroidsReducers write the new centroids Panda et al, Chapter 2 41

k-means in Apache Pig: input dataAssume we need to cluster documentsStored in a 3-column table D : Initial centroids are k randomly chosen docs Stored in table C in the same format as above Document Word Count doc1Carnegie 2doc1 Mellon2 42

D_C = JOIN C BY w , D BY w ; PROD = FOREACH D_C GENERATE d, c , id * ic AS id ic ; PROD g = GROUP PROD BY ( d , c ); DOT_PROD = FOREACH PROD g GENERATE d , c , SUM( i d i c ) AS d X c ; SQR = FOREACH C GENERATE c , i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c , SQRT(SUM( i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc;SIM g = GROUP SIM BY d;CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 43

D_C = JOIN C BY w , D BY w ; PROD = FOREACH D_C GENERATE d, c , id * ic AS id ic ; PROD g = GROUP PROD BY ( d , c ); DOT_PROD = FOREACH PROD g GENERATE d , c , SUM( i d i c ) AS d X c ; SQR = FOREACH C GENERATE c , i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c , SQRT(SUM( i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc;SIM g = GROUP SIM BY d;CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 44

D_C = JOIN C BY w , D BY w ; PROD = FOREACH D_C GENERATE d, c , id * ic AS id ic ; PROD g = GROUP PROD BY ( d , c ); DOT_PROD = FOREACH PROD g GENERATE d , c , SUM( i d i c ) AS d X c ; SQR = FOREACH C GENERATE c , i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c , SQRT(SUM( i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc;SIM g = GROUP SIM BY d;CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 45

D_C = JOIN C BY w , D BY w ; PROD = FOREACH D_C GENERATE d, c , id * ic AS id ic ; PROD g = GROUP PROD BY ( d , c ); DOT_PROD = FOREACH PROD g GENERATE d , c , SUM( i d i c ) AS d X c ; SQR = FOREACH C GENERATE c , i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c , SQRT(SUM( i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc;SIM g = GROUP SIM BY d;CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 46

D_C = JOIN C BY w , D BY w ; PROD = FOREACH D_C GENERATE d, c , id * ic AS id ic ; PROD g = GROUP PROD BY ( d , c ); DOT_PROD = FOREACH PROD g GENERATE d , c , SUM( i d i c ) AS d X c ; SQR = FOREACH C GENERATE c , i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c , SQRT(SUM( i c 2 )) AS len c ; DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc;SIM g = GROUP SIM BY d;CLUSTERS = FOREACH SIM g GENERATE TOP(1, 2, SIM); k-means in Apache Pig: E-step 47

k-means in Apache Pig: E-stepD_C = JOIN C BY w , D BY w ;PROD = FOREACH D_C GENERATE d, c , id * i c AS idic ; PROD g = GROUP PROD BY ( d , c ); DOT_PROD = FOREACH PROD g GENERATE d , c , SUM( i d i c ) AS d X c ; SQR = FOREACH C GENERATE c , i c * i c AS i c 2 ; SQR g = GROUP SQR BY c; LEN_C = FOREACH SQR g GENERATE c , SQRT(SUM( i c 2 )) AS len c;DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;SIM = FOREACH DOT_LEN GENERATE d, c, dXc / lenc ;SIMg = GROUP SIM BY d;CLUSTERS = FOREACH SIMg GENERATE TOP(1, 2, SIM ); 48

k-means in Apache Pig: M-stepD_C_W = JOIN CLUSTERS BY d , D BY d ; D_C_W g = GROUP D_C_W BY (c , w); SUMS = FOREACH D_C_Wg GENERATE c, w, SUM( i d ) AS sum ; D_C_W gg = GROUP D_C_W BY c ; SIZES = FOREACH D_C_W gg GENERATE c , COUNT( D_C_W ) AS size ; SUMS_SIZES = JOIN SIZES BY c , SUMS BY c ; C = FOREACH SUMS_SIZES GENERATE c , w , sum / size AS i c ; Finally - embed in Java (or Python or ….) to do the looping 49

The problem with k-means in HadoopI/O costs 50

Data is read, and model is written, with every iterationMappers read data portions and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroidsReducers write the new centroids Panda et al, Chapter 2 51

Spark: Another Dataflow Language52

SparkHadoop: Too much typingprograms are not conciseHadoop: Too low level missing abstractions hard to specify a workflow Pig, Guinea Pig, … address these problems: also Spark Not well suited to iterative operations E.g., PageRank, E/M, k-means clustering, … Spark lowers cost of repeated reads 53 Set of concise dataflow operations (“transformation”) Dataflow operations are embedded in an API together with “actions” Spark: Sharded files are replaced by “RDDs” – resilient distributed datasets. RDDs can be cached in cluster memory and recreated to recover from error

Spark examples54 spark is a spark context object

Spark examples55 e rrors is a transformation , and thus a data strucure that explains HOW to do something c ount () is an action : it will actually execute the plan for errors and return a value. e rrors.filter() is a transformation collect() is an action everything is sharded , like in Hadoop and GuineaPig

Spark examples56 # modify errors to be stored in cluster memory subsequent actions will be much faster everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible) You can also persist() an RDD on disk, which is like marking it as opts(stored=True) in GuineaPig . Spark’s not smart about persisting data.

Spark examples57 # modify errors to be stored in cluster memory everything is sharded … and the shards are stored in memory of worker machines not local disk (if possible) You can also persist() an RDD on disk, which is like marking it as opts(stored=True) in GuineaPig. Spark’s not smart about persisting data. persist-on-disk works because the RDD is read-only (immutable)

Spark examples: wordcount58 the action transformation on ( key,value ) pairs , which are special

Spark examples: batch logistic regression59 reduce is an action – it produces a numpy vector p.x and w are vectors, from the numpy package p.x and w are vectors, from the numpy package. Python overloads operations like * and + for vectors.

Spark examples: batch logistic regression Important note : numpy vectors/matrices are not just “syntactic sugar”. They are much more compact than something like a list of python floats. numpy operations like dot, *, + are calls to optimized C code a little python logic around a lot of numpy calls is pretty efficient 60

Spark examples: batch logistic regression61 w is defined outside the lambda function, but used inside it So: python builds a closure – code including the current value of w – and Spark ships it off to each worker. So w is copied, and must be read-only.

Spark examples: batch logistic regression62 dataset of points is cached in cluster memory to reduce i/o

Spark logistic regression example 63

Spark64