TABLEVQUERYRESULTSFORTHE148MATILDA148BROADWAYSHOWFROMWEBTEXTSHOWNAMEMatildaTEXTFEEDwhichbeganpreviewsonTuesdaygrossed659391orAndMatildaanawardwinningimportfromLondongrossed9609 ID: 189714
Download Pdf The PPT/PDF document "Fig.1.ExtendedDataTamerLarge-scaleDataFu..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Fig.1.ExtendedDataTamerLarge-scaleDataFusionArchitectureII.ARCHITECTUREFigure1showsourextendedarchitectureforDATATAMER.Inthisgure,weshowmodulesfordataingest(acceptingdatafromadatasourceandstoringitinourinternalRDBMS),schemaintegration(matchingupattributenamesforentitiestoformthecompositeentitiesintheglobalschema),entityconsolidation(ndingrecordsfromdifferentdatasourceswhichdescribethesameentityandthencon-solidatingtheserecordsintoacompositeentityrecord),datacleaning(tocorrecterroneousdata)anddatatransformation(forexampletotranslateeurosintodollars).Toextendthissystemtodealwithtextweareco-operatingwithRecordedFuture,Inc.,aWebaggregatoroftextualinformation[16].Theyhavemorethan1TByteoftexttheyhaveingestedfromtheWeb.Likeothertextapplications,theyareinterestedinspecickindsofinformation.Hence,theyhaveadomain-specicparserwhichlooksthroughthetextforinformationofinterest.Othertextaggregatorswehavetalkedto,indicatetheneedforadomain-specicparser.ThismoduleisshowninFigure1asauser-denedmodule.Theresultofthisparseislarge-scalesemi-structured/hierarchicaldata,whichafteratteningcanbeprocessedbyDATATAMER.HierarchicaldatamodelisoftenusedbyWeb-scaledistributedsemi-structuredstorageenginesusedtomanageWeb-crawlsorotherdatasetshavinglargeamountsofunstructureddata.ByatteningherewemeantheprocessofconvertinghierarchicaldataintoatrecordsbeforeprocessingbyDATATAMER.However,thecharacteristicsofthisdataarequitedifferentfromstructureddatasources.Forexample,thedataisusuallymuchdirtierthantypicalstructureddata.Also,structureddatatendstohavemanyattributes,whiletextusuallyhasonlyafew.SincetextandstructureddatahaveverydissimilarcharacteristicsitisachallengingundertakingtofusebothinDATATAMER.III.DATASETSLarge-scaledataeithercomingfromdistributedorlocaldatasourcessupportedbytheextendedDATATAMERarchitec-turecanbeeitherstructured,semi-structured,orunstructured(seeFigure1).Large-scaleWeb-text:HerefordemonstrationpurposesTABLEIII.STATISTICSBYENTITYTYPEINWEB-ENTITIES+------------------+----------+|type|cnt|+------------------+----------+|Person|38867351||OrgEntity|33529169||GeoEntity|11964810||URL|11194592||IndustryTerm|9101781||Position|8938934||Company|8846692||Product|8800019||Organization|6301459||Facility|4081458||City|3621317||MedicalCondition|1313487||Technology|940349||Movie|260230||ProvinceOrState|223243|...weused1TByteofWeb-textprimarilyfromRecordedFutureaugmentedbythefragmentsofnews-feeds,blogs,Twitter,etcprocessedbyadomain-dependentparser[16].Theoutputoftheparseisentitydataalongwiththetextfragmentswherethedatacamefrom.TableIillustratesthestatisticscomputedfortheWEBINSTANCEdatasetcontainingthefragments.Wecanseeitconsistsof242distributed2GBextentsandhasmorethan17millionentries.InmoredetailnsinTablesIandIIreferstonamespace,counttothetotalnumberofentriesinthiscollection,numExtentstothenumberof2GBextentsusedtostorethecollection,nindexestothenumberofindexes,andlastExtentSizetothesizeofthelastextentondiskinbytes,lastIndexSizetotheindexsizecreatedforthiscollection.TableIIhasstatisticsfortheWEBENTITIESdataset,whichistheoutputofthedomain-specicparser[16],consistingoftheentityinstanceswiththeirattributes.Wecanseeithasmorethan173millionentriesandconsistingof562GBdistributedextents.TableIIIhasthestatisticsbytypeofentityavailablefortheWEBENTITIESdataset.GoogleFusiontables:Inadditiontotheweb-scaletextdatasetweused20structureddatasourcesfoundusingGoogleFusionTableshavingBroadwayshowsschedules,theaterloca-tions,anddiscounts.Thestructuredsourcesonaveragehave TABLEV.QUERYRESULTSFORTHEMATILDABROADWAYSHOWFROMWEB-TEXTSHOW_NAME"Matilda"TEXT_FEED"..whichbeganpreviewsonTuesday,grossed659,391,or...AndMatildaanaward-winningimportfromLondon,grossed960,998,or93percentofthemaximum."TABLEVI.ENRICHEDQUERYRESULTSFROMWEB-TEXTANDFUSIONTABLESSHOW_NAME"Matilda"THEATER"Shubert225W.44thStbetween7thand8th"PERFORMANCE"Tuesat7pmWedat8pmThursat7pmFri-Satat8pmWed,Satat2pmSunat3pm"TEXT_FEED"..whichbeganpreviewsonTuesday,grossed659,391,or...AndMatildaanaward-winningimportfromLondon,grossed960,998,or93percentofthemaximum."CHEAPEST_PRICE"$27"FIRST"3/4/2013"VI.RELATEDANDFUTUREWORKMuchofthecurrentresearchindatamanagement,infor-mationretrieval,andsearchisdevotedtotheWeb,socialnetworks,personalresources,andunfortunatelydoesnotapplydirectlytotext.ThemostrecentrelevantresearchbyLuetalproposesanefcientalgorithmcalledselective-expansionandSI-treeanewindexingstructuredesignedforefcientstringsimilarityjoinswithsynonyms[7].In[17]Gaoetalproposetwotypesofsimilaritiesbetweentwoprobabilisticsetsanddesignanefcientdynamicprogramming-basedalgorithmtocalculatebothtypesofsimilarities.Dongetalin[18]describesgenericselectiveinformationintegrationcriticalinsearch.Anothersignicantworkin[19]shedslightoncontroversialdecisionmakingprocessinlarge-scaledatafusion.Halevyin[20]describesresearcheffortsinastructuralrealmoflarge-scaleinformationfusion.Guptain[21]givesapartialoverviewofrecentstructureddataresearchrelatedtoWebsearch.ManyrecentvenueshighlighttextasanareaofgrowinginteresttoDataManagementcommunities[2][6],[8][10],[15],[17],[18],[22][35].Noneoftheseefforts,tothebestofourknowledge,systematicallyaddresstheproblemofautomaticfusionoftext,structured,andsemi-structureddataatscale.Herewedemonstratedanewscalabledataintegrationarchitectureandasystemcapableofintegratingunstructured,semi-structured,andstructureddata.WealsoshowedtheaddedvalueofdataintegrationusingtheBroadwayshowsscenario.REFERENCES[1]M.GubanovandL.Shapiro,Usinguniedfamousobjects(ufo)toautomatealzheimer'sdiseasediagnostics,inBIBM,2012.[2]A.Singhal,Introducingtheknowledgegraph:Things,notstrings,inGoogleBlog,2012.[3]M.Gubanov,L.Popa,H.Ho,H.Pirahesh,P.Chang,andL.Chen,Ibmuforepository.inVLDB,2009.[4]L.Chilton,G.Little,D.Edge,D.Weld,andJ.Landay,Cascade:Crowdsourcingtaxonomycreation,inCHI,2013.[5]C.Zhang,R.Hoffmann,andD.Weld,Ontologicalsmoothingforrelationextractionwithminimalsupervision,inAAAI,2012.[6]M.GubanovandM.Stonebraker,Bootstrapingsynonymresolutionatwebscale,inDIMACS,2013.[7]J.Lu,C.Lin,W.Wang,C.Li,andH.Wang,Stringsimilaritymeasuresandjoinswithsynonyms,inSIGMOD,2013.[8]Y.Cai,X.L.Dong,A.Halevy,J.M.Liu,andJ.Madhavan,Personalinformationmanagementwithsemex,inSIGMOD,2005.[9]K.Bollacker,C.Evans,P.Paritosh,T.Sturge,andJ.Taylor,Free-base:acollaborativelycreatedgraphdatabaseforstructuringhumanknowledge,inSIGMOD,2008.[10]E.Agichtein,E.Brill,andS.Dumais,Improvingwebsearchrankingbyincorporatinguserbehaviorinformation,inSIGIR,2006.[11]M.Stonebraker,D.Bruckner,I.Ilyas,G.Beskales,M.Cherniack,S.Zdonik,A.Pagan,andS.Xu,Datacurationatscale:Thedatatamersystem,inCIDR,2013.[12]M.Gubanov,A.Pyayt,andL.Shapiro,Readfast:Browsinglargedocumentsthroughuniedfamousobjects(ufo),inIRI,2011.[13]M.GubanovandA.Pyayt,Readfast:High-relevancesearch-engineforbigtext,inACMCIKM,2013.[14]M.GubanovandM.Stonebraker,Large-scalesemanticproleextrac-tion,inEDBT,2014.[15]R.Helaoui,D.Riboni,M.Niepert,C.Bettini,andH.Stuckenschmidt,Towardsactivityrecognitionusingprobabilisticdescriptionlogics,inAAAI,2012.[16]Recordedfutureinc.[Online].Available:http://www.recordedfuture.com[17]M.Gao,C.Jin,W.Wang,X.Lin,andA.Zhou,Similarityqueryprocessingforprobabilisticsets,inICDE,2013.[18]X.L.Dong,B.Saha,andD.Srivastava,Lessismore:Selectingsourceswiselyforintegration,inVLDB,2013.[19],Explainingdatafusiondecisions,inWWW,2013.[20]A.Halevy,Datapublishingandsharingusingfusiontables,inCIDR,2013.[21]N.Gupta,A.Halevy,B.Harb,H.Lam,H.Lee,J.Madhavan,F.Wu,andC.Yu,Recentprogresstowardsanecosystemofstructureddataontheweb,inICDE,2013.[22]M.GubanovandA.Pyayt,Medreadfast:Structuralinformationre-trievalengineforbigclinicaltext,inIRI,2012.[23]M.Gubanov,L.Shapiro,andA.Pyayt,Learninguniedfamousobjects(ufo)tobootstrapinformationintegration,inIRI,2011.[24]M.Banko,M.Cafarella,S.Soderland,M.Broadhead,andO.Etzioni,Openinformationextractionfromtheweb,inIJCAI,2007.[25]J.Diederich,W.-T.Balke,andU.Thaden,Demonstratingthesemanticgrowbag:automaticallycreatingtopicfacetsforfaceteddblp,inJCDL,2007.[26]K.R.VenkateshGanti,SurajitChaudhuri,Aprimitiveoperatorforsimilarityjoinsindatacleaning,inICDE,2006.[27]S.Sekine,On-demandinformationextraction,inCOLING/ACL,2006.[28]O.Udrea,L.Getoor,andR.J.Miller,Leveragingdataandstructureinontologyintegration,inSIGMOD,2007.[29]S.Amer-Yahia,L.V.S.Lakshmanan,andS.Pandit,Flexpath:exiblestructureandfull-textqueryingforxml,inSIGMOD,2004.[30]N.Polyzotis,M.Garofalakis,andY.Ioannidis,Approximatexmlqueryanswers,2004.[31]Y.Li,C.Yu,andH.V.Jagadish,Schema-freexquery,inVLDB,2004.[32]X.Zhou,J.Gaugaz,W.-T.Balke,andW.Nejdl,Queryrelaxationusingmalleableschemas,inSIGMOD,2007.[33]X.DongandA.Y.Halevy,Malleableschemas:Apreliminaryreport.inWebDB'05,2005.[34]X.DongandA.Halevy,Indexingdataspaces,inSIGMOD,2007.[35]J.ParkandD.Barbosa,Adaptiverecordextractionfromwebpages,inWWW,2007.