/
Paul  Suganthan  G. C. University of Wisconsin-Madison Joint work with Paul  Suganthan  G. C. University of Wisconsin-Madison Joint work with

Paul Suganthan G. C. University of Wisconsin-Madison Joint work with - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
343 views
Uploaded On 2019-11-01

Paul Suganthan G. C. University of Wisconsin-Madison Joint work with - PPT Presentation

Paul Suganthan G C University of WisconsinMadison Joint work with Sanjib Das AnHai Doan Jeffrey F Naughton Ganesh Krishnan Rohit Deep Esteban Arcaute Vijay Raghavendra ID: 762038

rules match title blocking match rules blocking title time machine crowd execute drop matches tables size isbn predicted year

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Paul Suganthan G. C. University of Wis..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Paul Suganthan G. C.University of Wisconsin-MadisonJoint work with Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, Youngchoon Park Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services @ WalmartLabs

Entity Matching2Name City State Dave Smith Madison WI Joe Wilson San JoseCADan SmithMiddletonWI Table A NameCityStateDavid D. SmithMadisonWIDaniel W. SmithMiddletonWI Table B

Recent Crowdsourced EM WorkExample: verifying predicted matches 3a bc de A B Blocking Matching (a,d)(b,e)(c,d)(c,e)(a,d) Y(b,e) N(c,d) Y(c,e) YVerifying ( a,d) Y(c,e) YLimitationscrowdsources only parts of workflowneeds a developer to execute the remaining partsdoes not work when doing many EM tasks, not enough developers

Our Recent Solution : Corleone [SIGMOD-14]4Introduces the idea of “hands-off crowdsourcing”crowdsources the entire EM workflow, requiring no developersUser Matcher B Candidate tuple pairs Predicted matchesATablesAccuracy Estimator Predicted matches Accuracy estimates (P, R) Difficult Pairs’ Locator Blocker

Limitations of Corleone5 D oes not scale to large tables (e.g., 50K-1M tuples) e xecutes mostly a single-machine in-memory EM workflow Domain scientists(self-service EM) EM service Domain scientists@ UW-Madison Our developers Crowd workers takes weeks to match

Contributions of This PaperScales up Corleone to tables of millions of tuplesmatches tables of 1-2.5M tuples for only $54-66 in 2-14 hours6 Used extensively at multiple organizations e .g., Johnson Controls, Marshfield Clinic, a non-profit organization, WalmartLabs , different UW-Madison departments Recently deployed as a cloud service by Yash Govind, Erik Paulson,Mukilan Ashok

Contributions of This PaperOur technical solutiondefine basic operators & use them to model the EM workflow of Corleone as a DAGscales up operators (using MapReduce)e.g., executing complex blocking rules over large tablesoptimizes within and across operatorse.g., use crowd time of an operator to mask machine time of another Results can potentially be applicable to other contextse.g., scale up rules that involve string similarity measuress cale up any DAG that involves complex rules, crowdsourcing, and machine learning 7

Consider a Simplified Corleone Workflow 8UserMatcher B Candidate tuple pairs Predicted matches ATablesAccuracy Estimator Predicted matches Accuracy estimates (P, R)Difficult Pairs’ Locator Blocker

BlockingMany solutions, hash-based blocking is very populare.g., only consider tuple pairs where A.state = B.stateeasy to understand and implement, highly scalableIn practice however often does not work welldue to dirty data, variations in data values, missing valuesresults in low recall (i.e., drop many true matches)Corleone uses rule-based blockingmore powerful, gives higher recall, subsumes hash-based blockingbut far more difficult to learn and scale9 jaccard ( a.title, b.title ) < 0.7  drop(a, b) exact_match(a.year, b.year) = 0 AND abs_diff(a.price, b.price) > 10  drop(a,b)Sample rules for blocking tables of books

Blocking in CorleoneTakes sample S from A X B (without materializing it)Trains a random forest F on S (to match tuple pairs) using active learning, where crowd labels pairs isbn_match N Y No # pages_matchNYNoYesExtracts candidate blocking rules from FExample random forest F for matching bookstitle_matchNYNo publisher_match NYNoYes(isbn_match = N) No(isbn_match = Y) and (#pages_match = N) No(title_match = N) No(title_match = Y) and (publisher_match = N) No

Blocking in Corleone11Use the crowd to evaluate precision of extracted rulesSelect a subset of rules & execute on A & B(isbn_match = N) No (isbn_match = Y) and (# pages_match = N) No ( title_match = N) No (title_match = Y) and (publisher_match = N) No(isbn_match = N) No(title_match = N) No

Modeling the Blocking Step as a DAG of Basic Operators12active learnfeature generation sample evaluate rules select optimal sequence S   execute blocking rulesget blocking rulessampleSmatcherMrulesRrule seqFrulesEcand setC BAB A

Modeling Both Blocking and Matching as a DAG of Operatorsactive learnfeature generationsample evaluate rules select optimal sequence S   execute blocking rules get blocking rulesfeature generationactive learnsampleSapply matchermatcherMrulesRrule seqF C  rulesEcand setCmatchesmatcherNB A B A Eight basic operators Involve complex rules, crowdsourcing, ML Can be used to compose a variety of EM workflows

Scaling Up the Operator “Executing Blocking Rules”14 Existing solutions do not work well a ssume rules as black-boxes and hence enumerates entire A x B or c onsiders simple rules that are single predicateexecute blocking rules AB rules Cjaccard(a.title, b.title) < 0.7  drop(a, b) exact_match(a.year, b.year) = 0 AND abs_diff(a.price, b.price) > 10  drop(a,b) Sample rules for blocking tables of books

Our SolutionObserve that rules often contain well-known similarity functionse.g., edit distance, Jaccard, cosineExploit properties of these functions build index-based filters, then use them to avoid enumerating A x BExample of filtering using size 15 String: “The Big Short”Number of tokens = 3Possible size range: 3*0.7 ≤ x ≤ 3/0.7 Spirited Awaydon’t keepThe Big Lebowskikeepjaccard(a.title, b.title) < 0.7  drop(a, b) jaccard(a.title, b.title) ≥ 0.7  keep(a, b)

Our Solution16 Developed four MapReduce implementations All indexes may not fit in memory of a mapper Balance between memory available for indexes at mappers and amount of work done at reducers Use a rule-based optimizer to choose an implementation I1I2I3C1(C2C3) ⋃ ⋂C = Indexes over AFor each b in BApply the rules to (a, b) such that a ∈ C jaccard(a.title, b.title) < 0.7  drop(a, b) exact_match(a.year, b.year) = 0 AND abs_diff(a.price, b.price) > 10  drop(a,b)

Optimizing the EM DAGKey idea: use crowd time to mask machine timebuild indexes, speculatively execute rules, mask pair selectionactive learnfeature generation sample evaluate rules select optimal sequence S   execute blocking rulesget blocking rulesfeature generationactive learnsampleSapply matchermatcherMrulesRrule seq F C rulesEcand setCmatchesmatcherNB A B A build indexes speculatively execute rules speculatively execute matcher mask pair selection

Empirical EvaluationRan Hadoop on a 10-node clustereach node has 8-core Intel Xeon 2.1GHz processor and 8GB RAMMechanical Turk settingsturker qualifications: ≥ 100 approved HITs with ≥ 95% approval ratepayment: 2 cents per question18 Data Set Table A Table B # Matches DescriptionProducts2,55422,0741,154Electronic products atAmazon and Walmart Songs 1,000,0001,000,0001,292,023Songs within a single tableCitations1,823,9782,512,927558,787Citations in Citeseer and DBLP

Overall PerformanceBlocking took 2m to 1h 13mOptimization reduces total machine time by 11 - 70%19 Data Set Accuracy (%) Cost (# Questions) Run Time Candidate Set Size Machine TimeCrowd TimeTotal TimeProducts90.974.5 81.9 $57.6 (960)52m13h 7m13h 25m536K - 11.4MSongs96.099.397.6$54 (900)2h 7m11h 25m11h 58m1.6M – 51.4MCitations92.098.595.2$65.5 (1087)2h 32m 13h 33m 14h 37m 654K – 1.06M Data Set Accuracy (%) Cost (# Questions) Run Time Candidate Set Size Machine Time Crowd Time Total Time Products 90.9 74.5 81.9 $57.6 (960) 52m 13h 7m 13h 25m 536K - 11.4M Songs 96.0 99.3 97.6 $54 (900) 2h 7m 11h 25m 11h 58m 1.6M – 51.4M Citations 92.0 98.5 95.2 $65.5 (1087) 2h 32m 13h 33m 14h 37m 654K – 1.06M

Falcon “in the Wild” (CloudMatcher.io)Matching drug descriptions for Marshfield Clinictables of size 453K vs 451Kdata is sensitive  labeled by a scientist, 1h 37m to label 830 pairs2h 10m of machine time  optimization reduces by 49% to 1h 6mMatching organizations for a team of economists at UWsupervised by Yash Govind and Mukilan Ashoktables of size 2616 vs 21530crowd cost $61, precision >= 94%, recall >= 98%machine time 12m, crowd time 23h 12m when labeled by an economist: label time 50m, machine time 11mcan match in an hour vs weeks using a developer (@ $15/hour)20

Related WorkParallel execution of DAGs of operatorsconsiders each operator as blackboxe.g., [F. Hueske et al. ICDE’13, I. Gog et al. EuroSys’15]RDBMS-style solutions for data cleaningdefine operators for EM at coarse granularities e.g., [Wisteria VLDB’15, BigDansing SIGMOD’15] Crowdsourcingcrowdsourced query processing [ CrowdDB SIGMOD’11 ] verifying predicted matches [Wang et al. VLDB’12]minimize number of questions [Whang et al. VLDB’13]find best UI to pose questions [Marcus et al. VLDB’11]minimizing crowd latency [D. Haas et al. VLDB’16]21

ConclusionsCorleone ideally suited for building EM cloud servicesbut does not scale to large tablesProposed Falcon to scale up Corleonematches tables of 1-2.5M tuples for $54-66 in 2-14 hoursscales up DAGs with complex rules, crowdsourcing, MLFuture directionsexploring new optimizationsconsider more complex workflowsprovide CloudMatcher.io as a public service22