/
Fig.1.ExtendedDataTamerLarge-scaleDataFusionArchitectureII.ARCHITECTUR Fig.1.ExtendedDataTamerLarge-scaleDataFusionArchitectureII.ARCHITECTUR

Fig.1.ExtendedDataTamerLarge-scaleDataFusionArchitectureII.ARCHITECTUR - PDF document

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
372 views
Uploaded On 2015-11-11

Fig.1.ExtendedDataTamerLarge-scaleDataFusionArchitectureII.ARCHITECTUR - PPT Presentation

TABLEVQUERYRESULTSFORTHE148MATILDA148BROADWAYSHOWFROMWEBTEXTSHOWNAMEMatildaTEXTFEEDwhichbeganpreviewsonTuesdaygrossed659391orAndMatildaanawardwinningimportfromLondongrossed9609 ID: 189714

TABLEV.QUERYRESULTSFORTHE”MATILDA”BROADWAYSHOWFROMWEB-TEXTSHOW_NAME"Matilda"TEXT_FEED"..whichbeganpreviewsonTuesday grossed659 391 or...AndMatildaanaward-winningimportfromLondon grossed960

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Fig.1.ExtendedDataTamerLarge-scaleDataFu..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Fig.1.ExtendedDataTamerLarge-scaleDataFusionArchitectureII.ARCHITECTUREFigure1showsourextendedarchitectureforDATATAMER.Inthisgure,weshowmodulesfordataingest(acceptingdatafromadatasourceandstoringitinourinternalRDBMS),schemaintegration(matchingupattributenamesforentitiestoformthecompositeentitiesintheglobalschema),entityconsolidation(ndingrecordsfromdifferentdatasourceswhichdescribethesameentityandthencon-solidatingtheserecordsintoacompositeentityrecord),datacleaning(tocorrecterroneousdata)anddatatransformation(forexampletotranslateeurosintodollars).Toextendthissystemtodealwithtextweareco-operatingwithRecordedFuture,Inc.,aWebaggregatoroftextualinformation[16].Theyhavemorethan1TByteoftexttheyhaveingestedfromtheWeb.Likeothertextapplications,theyareinterestedinspecickindsofinformation.Hence,theyhaveadomain-specicparserwhichlooksthroughthetextforinformationofinterest.Othertextaggregatorswehavetalkedto,indicatetheneedforadomain-specicparser.ThismoduleisshowninFigure1asauser-denedmodule.Theresultofthisparseislarge-scalesemi-structured/hierarchicaldata,whichafteratteningcanbeprocessedbyDATATAMER.HierarchicaldatamodelisoftenusedbyWeb-scaledistributedsemi-structuredstorageenginesusedtomanageWeb-crawlsorotherdatasetshavinglargeamountsofunstructureddata.ByatteningherewemeantheprocessofconvertinghierarchicaldataintoatrecordsbeforeprocessingbyDATATAMER.However,thecharacteristicsofthisdataarequitedifferentfromstructureddatasources.Forexample,thedataisusuallymuchdirtierthantypicalstructureddata.Also,structureddatatendstohavemanyattributes,whiletextusuallyhasonlyafew.SincetextandstructureddatahaveverydissimilarcharacteristicsitisachallengingundertakingtofusebothinDATATAMER.III.DATASETSLarge-scaledataeithercomingfromdistributedorlocaldatasourcessupportedbytheextendedDATATAMERarchitec-turecanbeeitherstructured,semi-structured,orunstructured(seeFigure1).Large-scaleWeb-text:HerefordemonstrationpurposesTABLEIII.STATISTICSBYENTITYTYPEINWEB-ENTITIES+------------------+----------+|type|cnt|+------------------+----------+|Person|38867351||OrgEntity|33529169||GeoEntity|11964810||URL|11194592||IndustryTerm|9101781||Position|8938934||Company|8846692||Product|8800019||Organization|6301459||Facility|4081458||City|3621317||MedicalCondition|1313487||Technology|940349||Movie|260230||ProvinceOrState|223243|...weused1TByteofWeb-textprimarilyfromRecordedFutureaugmentedbythefragmentsofnews-feeds,blogs,Twitter,etcprocessedbyadomain-dependentparser[16].Theoutputoftheparseisentitydataalongwiththetextfragmentswherethedatacamefrom.TableIillustratesthestatisticscomputedfortheWEBINSTANCEdatasetcontainingthefragments.Wecanseeitconsistsof242distributed2GBextentsandhasmorethan17millionentries.InmoredetailnsinTablesIandIIreferstonamespace,counttothetotalnumberofentriesinthiscollection,numExtentstothenumberof2GBextentsusedtostorethecollection,nindexestothenumberofindexes,andlastExtentSizetothesizeofthelastextentondiskinbytes,lastIndexSizetotheindexsizecreatedforthiscollection.TableIIhasstatisticsfortheWEBENTITIESdataset,whichistheoutputofthedomain-specicparser[16],consistingoftheentityinstanceswiththeirattributes.Wecanseeithasmorethan173millionentriesandconsistingof562GBdistributedextents.TableIIIhasthestatisticsbytypeofentityavailablefortheWEBENTITIESdataset.GoogleFusiontables:Inadditiontotheweb-scaletextdatasetweused20structureddatasourcesfoundusingGoogleFusionTableshavingBroadwayshowsschedules,theaterloca-tions,anddiscounts.Thestructuredsourcesonaveragehave TABLEV.QUERYRESULTSFORTHE”MATILDA”BROADWAYSHOWFROMWEB-TEXTSHOW_NAME"Matilda"TEXT_FEED"..whichbeganpreviewsonTuesday,grossed659,391,or...AndMatildaanaward-winningimportfromLondon,grossed960,998,or93percentofthemaximum."TABLEVI.ENRICHEDQUERYRESULTSFROMWEB-TEXTANDFUSIONTABLESSHOW_NAME"Matilda"THEATER"Shubert225W.44thStbetween7thand8th"PERFORMANCE"Tuesat7pmWedat8pmThursat7pmFri-Satat8pmWed,Satat2pmSunat3pm"TEXT_FEED"..whichbeganpreviewsonTuesday,grossed659,391,or...AndMatildaanaward-winningimportfromLondon,grossed960,998,or93percentofthemaximum."CHEAPEST_PRICE"$27"FIRST"3/4/2013"VI.RELATEDANDFUTUREWORKMuchofthecurrentresearchindatamanagement,infor-mationretrieval,andsearchisdevotedtotheWeb,socialnetworks,personalresources,andunfortunatelydoesnotapplydirectlytotext.ThemostrecentrelevantresearchbyLuetalproposesanefcientalgorithmcalledselective-expansionandSI-treeanewindexingstructuredesignedforefcientstringsimilarityjoinswithsynonyms[7].In[17]Gaoetalproposetwotypesofsimilaritiesbetweentwoprobabilisticsetsanddesignanefcientdynamicprogramming-basedalgorithmtocalculatebothtypesofsimilarities.Dongetalin[18]describesgenericselectiveinformationintegrationcriticalinsearch.Anothersignicantworkin[19]shedslightoncontroversialdecisionmakingprocessinlarge-scaledatafusion.Halevyin[20]describesresearcheffortsinastructuralrealmoflarge-scaleinformationfusion.Guptain[21]givesapartialoverviewofrecentstructureddataresearchrelatedtoWebsearch.ManyrecentvenueshighlighttextasanareaofgrowinginteresttoDataManagementcommunities[2]–[6],[8]–[10],[15],[17],[18],[22]–[35].Noneoftheseefforts,tothebestofourknowledge,systematicallyaddresstheproblemofautomaticfusionoftext,structured,andsemi-structureddataatscale.Herewedemonstratedanewscalabledataintegrationarchitectureandasystemcapableofintegratingunstructured,semi-structured,andstructureddata.WealsoshowedtheaddedvalueofdataintegrationusingtheBroadwayshowsscenario.REFERENCES[1]M.GubanovandL.Shapiro,“Usinguniedfamousobjects(ufo)toautomatealzheimer'sdiseasediagnostics,”inBIBM,2012.[2]A.Singhal,“Introducingtheknowledgegraph:Things,notstrings,”inGoogleBlog,2012.[3]M.Gubanov,L.Popa,H.Ho,H.Pirahesh,P.Chang,andL.Chen,“Ibmuforepository.”inVLDB,2009.[4]L.Chilton,G.Little,D.Edge,D.Weld,andJ.Landay,“Cascade:Crowdsourcingtaxonomycreation,”inCHI,2013.[5]C.Zhang,R.Hoffmann,andD.Weld,“Ontologicalsmoothingforrelationextractionwithminimalsupervision,”inAAAI,2012.[6]M.GubanovandM.Stonebraker,“Bootstrapingsynonymresolutionatwebscale,”inDIMACS,2013.[7]J.Lu,C.Lin,W.Wang,C.Li,andH.Wang,“Stringsimilaritymeasuresandjoinswithsynonyms,”inSIGMOD,2013.[8]Y.Cai,X.L.Dong,A.Halevy,J.M.Liu,andJ.Madhavan,“Personalinformationmanagementwithsemex,”inSIGMOD,2005.[9]K.Bollacker,C.Evans,P.Paritosh,T.Sturge,andJ.Taylor,“Free-base:acollaborativelycreatedgraphdatabaseforstructuringhumanknowledge,”inSIGMOD,2008.[10]E.Agichtein,E.Brill,andS.Dumais,“Improvingwebsearchrankingbyincorporatinguserbehaviorinformation,”inSIGIR,2006.[11]M.Stonebraker,D.Bruckner,I.Ilyas,G.Beskales,M.Cherniack,S.Zdonik,A.Pagan,andS.Xu,“Datacurationatscale:Thedatatamersystem,”inCIDR,2013.[12]M.Gubanov,A.Pyayt,andL.Shapiro,“Readfast:Browsinglargedocumentsthroughuniedfamousobjects(ufo),”inIRI,2011.[13]M.GubanovandA.Pyayt,“Readfast:High-relevancesearch-engineforbigtext,”inACMCIKM,2013.[14]M.GubanovandM.Stonebraker,“Large-scalesemanticproleextrac-tion,”inEDBT,2014.[15]R.Helaoui,D.Riboni,M.Niepert,C.Bettini,andH.Stuckenschmidt,“Towardsactivityrecognitionusingprobabilisticdescriptionlogics,”inAAAI,2012.[16]“Recordedfutureinc.”[Online].Available:http://www.recordedfuture.com[17]M.Gao,C.Jin,W.Wang,X.Lin,andA.Zhou,“Similarityqueryprocessingforprobabilisticsets,”inICDE,2013.[18]X.L.Dong,B.Saha,andD.Srivastava,“Lessismore:Selectingsourceswiselyforintegration,”inVLDB,2013.[19]——,“Explainingdatafusiondecisions,”inWWW,2013.[20]A.Halevy,“Datapublishingandsharingusingfusiontables,”inCIDR,2013.[21]N.Gupta,A.Halevy,B.Harb,H.Lam,H.Lee,J.Madhavan,F.Wu,andC.Yu,“Recentprogresstowardsanecosystemofstructureddataontheweb,”inICDE,2013.[22]M.GubanovandA.Pyayt,“Medreadfast:Structuralinformationre-trievalengineforbigclinicaltext,”inIRI,2012.[23]M.Gubanov,L.Shapiro,andA.Pyayt,“Learninguniedfamousobjects(ufo)tobootstrapinformationintegration,”inIRI,2011.[24]M.Banko,M.Cafarella,S.Soderland,M.Broadhead,andO.Etzioni,“Openinformationextractionfromtheweb,”inIJCAI,2007.[25]J.Diederich,W.-T.Balke,andU.Thaden,“Demonstratingthesemanticgrowbag:automaticallycreatingtopicfacetsforfaceteddblp,”inJCDL,2007.[26]K.R.VenkateshGanti,SurajitChaudhuri,“Aprimitiveoperatorforsimilarityjoinsindatacleaning,”inICDE,2006.[27]S.Sekine,“On-demandinformationextraction,”inCOLING/ACL,2006.[28]O.Udrea,L.Getoor,andR.J.Miller,“Leveragingdataandstructureinontologyintegration,”inSIGMOD,2007.[29]S.Amer-Yahia,L.V.S.Lakshmanan,andS.Pandit,“Flexpath:exiblestructureandfull-textqueryingforxml,”inSIGMOD,2004.[30]N.Polyzotis,M.Garofalakis,andY.Ioannidis,“Approximatexmlqueryanswers,”2004.[31]Y.Li,C.Yu,andH.V.Jagadish,“Schema-freexquery,”inVLDB,2004.[32]X.Zhou,J.Gaugaz,W.-T.Balke,andW.Nejdl,“Queryrelaxationusingmalleableschemas,”inSIGMOD,2007.[33]X.DongandA.Y.Halevy,“Malleableschemas:Apreliminaryreport.”inWebDB'05,2005.[34]X.DongandA.Halevy,“Indexingdataspaces,”inSIGMOD,2007.[35]J.ParkandD.Barbosa,“Adaptiverecordextractionfromwebpages,”inWWW,2007.