/
DIADEM:ThousandsofWebsitestoaSingleDatabaseTimFurche,GeorgGottlob,Gio DIADEM:ThousandsofWebsitestoaSingleDatabaseTimFurche,GeorgGottlob,Gio

DIADEM:ThousandsofWebsitestoaSingleDatabaseTimFurche,GeorgGottlob,Gio - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
375 views
Uploaded On 2016-03-13

DIADEM:ThousandsofWebsitestoaSingleDatabaseTimFurche,GeorgGottlob,Gio - PPT Presentation

TheresearchleadingtotheseresultshasreceivedfundingfromtheEuropeanResearchCouncilundertheEuropeanCommunitysSeventhFrameworkProgrammeFP720071502013ERCgrantagreementDIADEMno246858GiorgioOrs ID: 254793

TheresearchleadingtotheseresultshasreceivedfundingfromtheEuropeanResearchCouncilundertheEuropeanCommunity'sSev-enthFrameworkProgramme(FP7/2007–2013)/ERCgrantagree-mentDIADEM no.246858.GiorgioOrs

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "DIADEM:ThousandsofWebsitestoaSingleDatab..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DIADEM:ThousandsofWebsitestoaSingleDatabaseTimFurche,GeorgGottlob,GiovanniGrasso,XiaonanGuo,GiorgioOrsi,ChristianSchallhart,ChengWangDepartmentofComputerScience,OxfordUniversity,WolfsonBuilding,ParksRoad,OxfordOX13QDrstname.lastname@cs.ox.ac.ukABSTRACTThewebisoverowingwithimplicitlystructureddata,spreadoverhundredsofthousandsofsites,hiddendeepbehindsearchforms,orsiloedinmarketplaces,onlyaccessibleasHTML.Automaticextractionofstructureddataatthescaleofthousandsofwebsiteshaslongprovenelusive,despiteitscentralroleinthe“webofdata”.Throughanextensiveevaluationspanningover10000websitesfrommultipleapplicationdomains,weshowthatautomatic,yetaccuratefull-siteextractionisnolongeradistantdream.DIADEMistherstautomaticfull-siteextractionsystemthatisabletoex-tractstructureddatafromdifferentdomainsatveryhighaccuracy.Itcombinesautomatedexplorationofwebsites,identicationofrelevantdata,andinductionofexhaustivewrappers.Automatingthesecomponentsistherstchallenge.DIADEMovercomesthischallengebycombiningphenomenologicalandontologicalknowl-edge.Integratingthesecomponentsisthesecondchallenge.DIA-DEMovercomesthischallengethroughaself-adaptivenetworkofrelationaltransducersthatproduceseffectivewrappersforawidevarietyofwebsites.Ourextensiveandpubliclyavailableevaluationshowsthat,formorethan90%ofsitesfromthreedomains,DIADEMobtainsaneffectivewrapperthatextractsallrelevantdatawith97%averageprecision.DIADEMalsotoleratesnoisyentityrecognisers,anditscomponentsindividuallyoutperformcomparableapproaches.CategoriesandSubjectDescriptorsH.3.5[InformationStorageandRetrieval]:On-lineInformationServices—Web-basedservicesGeneralTermsLanguages,ExperimentationKeywordsdataextraction,deepweb,wrapperinduction TheresearchleadingtotheseresultshasreceivedfundingfromtheEuropeanResearchCouncilundertheEuropeanCommunity'sSev-enthFrameworkProgramme(FP7/2007–2013)/ERCgrantagree-mentDIADEM,no.246858.GiorgioOrsiandChristianSchallharthavealsobeensupportedbytheOxfordMartinSchool,InstitutefortheFutureofComputing.ThisworkislicensedundertheCreativeCommonsAttribution­NonCommercial­NoDerivs3.0UnportedLicense.Toviewacopyofthisli­cense,visithttp://creativecommons.org/licenses/by­nc­nd/3.0/.Obtainper­missionpriortoanyusebeyondthosecoveredbythelicense.Contactcopyrightholderbyemailinginfo@vldb.org.Articlesfromthisvolumewereinvitedtopresenttheirresultsatthe40thInternationalConferenceonVeryLargeDataBases,September1st­5th2014,Hangzhou,China.ProceedingsoftheVLDBEndowment,Vol.7,No.14Copyright2014VLDBEndowment2150­8097/14/10.1.INTRODUCTIONThewebhasbecomethelargestrepositoryofstructureddata.FortheUSalone,thenumberofonlineshoppingsiteswith$10k+revenueisestimatedinexcessofahundredthousand[30],withalong-tailofseveralhundredthousandsmallershops.Asignicantportionofdataisonlyavailableinthislong-tail[11].ThisdataismostlyavailableinHTMLpages,designedforhu-mans.Theautomatic,yetaccurateextractionofthestructureddataunderlyingsuchpagesisalongstandingchallenge[5].Semi-superviseddataextractionapproaches,suchas[2,12],havebeeninvestigatedextensively,butrequireuserstosupervisetheinduc-tionbynavigatingeachsiteandidentifyingrelevantdata.Incon-trast,automaticfull-siteextraction(AFE)operatesautomaticallywithnoper-sitesupervision,navigatestoallrelevantdataonthefullsite,yetextractshighlystructureddata.Automaticfull-siteextractioninvolvesthreeprimarysub-problems,namelysiteexploration(withformunderstandingandlling),recordandattributeidentication,andwrapperinduction.Eachofthesesub-problemsisasignicantchallengebyitself,butworse,theseproblemshaveinthepastbeentacklediniso-lationwithfewexceptions.SuccessfulapplicationsofAFEhavebeenlimitedtonarrowsettingswithsimplestructure,suchasti-tleandbodyextractionfornewsarticles[39]orsearchenginere-sults(ViNTs[44]).Forextractinghighlystructureddata,theseap-proachesareunsuitable.Furthermore,mostoftheseapproachesfailonmodernsites.Forexample,ViNTs[44]full-siteextractionidentiesrecords(oftitleandbodyonly)withonly83%-88%ac-curacy(Section5),evenwhensupervisedbyselectingnegativeandpositiveexamplesofresultpagesforeachsite.Thislackofintegrated,full-siteextractionapproacheshassig-nicantlyreducedtheimpactofdataextraction—withsomecom-mercialapplicationssuchasGoogleProductsmovingawayfromextractiontowardspurelycurateddatacollection.Yet,theneedhasonlyincreased,inparticularwiththeriseofbigdataanalyticsforcompetitivepriceintelligenceorintelligentsupplychainmanage-ment.Inmanyoftheseapplications,theacquisitionofaccurate,large-scaledata,e.g.,aboutcompetitor'sproducts,currentdeals,orsupplylevelsarecrucial.DIADEMistherstautomaticfull-siteextractionsystemthatisabletoextractstructureddatafromdifferentdomains,suchasrealestateorusedcars,atveryhighaccuracy.IncontrasttopreviousAFEsystems,suchasViNTs[44],DIADEMextracts,forover90%ofthe10602evaluatedwebsites,highlystructureddatawithdozensofattributesat97%accuracy.Incontrasttoexistingapproachesfocusingononlyoneortwosub-problemsofAFE,DIADEMisanintegratedsystemthatsolveseachofthesub-problemsgivenjustalistofsites.Incontrasttomostpreviousdataextractionapproaches,itrequiresnoper-sitesupervision,noteventheselectionofresult (a)REattributetypefrequency (b)UCattributetypefrequency (c)FormtypesperformFigure11:Attributeandformtypefrequencies Figure12:Attributequality #sites withform avg(#elds) avg(#records) RE-FULL3404 20927.4 85 UC-FULL7089 42545.3 77 RE-US109 7110.1 110 RE-OXF172 1377.5 108 ICQ100 1006.5 – (a)Datasetcharacteristics effective incorrect none RE-RNDDIADEM 91%7% 2% UC-RNDDIADEM 93%4% 3% RE-USDIADEM 90%5% 5% RE-OXFDIADEM 90%6% 4% RE-OXFViNTs 4%5% 91% (b)WrapperqualityTable2:Datasetsandwrapperqualityeralisedwrapperusedfortheextractionandevaluatedinstep(2)).WedonotevaluateDIADEM'swrapperexecutionindividuallyforspacereasons,butreferto[20].Table2bshowstheassessmentofthewrappereffectivenessin-ducedbyDIADEMforRE-RND,UC-RND,RE-US,andRE-OXF.Forthelatter,wealsoshowthecorrespondingnumbersforViNTs[44].Weassumethattherandomsamplesarerepresentativeforthefulldatasets,asindicatedbythehighlycorrelatedcharacteristics.Theprimaryresultisthatover90%ofthewrappersareeffectiveineachdatasets,with91%averageeffectiveness.Toavoidbias,weuseatwostepvericationofthewrappers:Eachwrapperisman-uallyveriedbyoneperson.Ifawrapperisconsideredeffec-tive,theactualextractedrecordsareautomaticallycomparedtotheSEARCH_RESULTS_NUMBERidentiedontherstlistingspage,ifpresent.Ifnotpresent,weuseuniquenessofURL'sandimagesandidenticalrecordnumbersfromdifferentllings.Ifthisauto-maticvericationfails,twomorepersonsareaskedtoverifythewrapperandtheaggregatedresultisreported.ContrastthistoViNTs,asystemforfullyautomaticallygenerat-ingwrappersforsearchengines.Itprovidesonlyfewattributesthatarecommontomanysearchengines,namelythetitleoftheresultanditstextualdescriptionandthusweconsiderawrappereffectiveifitextractstherightrecords(whereforDIADEMwealsoinspecttheattributes).ViNTsperformsquitewellifsupervisedbyprovid-ingafewresultpagesandanon-result(control)page(Figure13a).AsDIADEM,ViNTsisalsoabletoautomaticallyextractafull-sitewrapperstartingfromaform.However,thispartofViNTshasbeenspecicallyengineeredforsimplesearchforms.Thus,weremoveallsiteswithnosearchforms,iFrames,ortoofewpropertiesfromtheevaluation.ViNTsstillonlymanagestoproduceawrapperin9%.Onlyfor4%ofthesitesitproducesaneffectivewrapper.Intheothercases,ViNTsextractsonlypartialdata,e.g.,norentals.AmongthemostcommoncausesforaDIADEMwrappertobenoneffectivearemisalignedattributes,e.g.,inpresenceofmultiplepivotattributesorrareoptionalattributes,andsitesthatlistrelatedproductsmoreprominently.E.g.,onafewsitesthatofferalsonewcars,DIADEMmayextractthoseratherthantheusedcars,ifnei-therlistingcontainsmanyusedcarspecicattributesandthenewcarsaremoreprominentlyplacedonthesite.Thereareabout3%ofsiteswherenowrappercanbeinduced,typicallyastheycon-tainnoproperties,allpropertiesareonaggregators,ortheycontainnopivotattribute.Forthesesites,DIADEMcorrectlydetectsthatthereisnoeffectivewrapper.ThenalcaseisthatDIADEMfailstoproduceaneffectivewrapper,yetoneexists.Themostcommonreasonsforthesefailuresaredynamicforms(15%),resultpageswithdynamicallyrenderedprices(12%),formslocatedinsidebariframes(15%),priceswithoutcurrencies(6%),orsiteswhichcon-tainonlyasingleproperty(6%).Todemonstrate,thatDIADEMdoesnotproduceawrapperforsitesthatarenotinthetargetdomain,wealsorunDIADEMonthesetoftopUKshoppingwebsitesUK-100.Onthisset,DIADEMinducesawrapperforonly5sites,confusingtoycarsonAmazonandToys-R-Usforusedcars.Todeterminetheattributequalityoftheextracteddata,weper-formamanualevaluationontheRE-OXF,RE-RND,andUC-RND izesbothcomplextext-rangeselectionsandexplorationsequences,thusavoidingexpensivepost-processingandprogrammaticcombi-nationofmultiplewrappersandllingfound,e.g.,in[35,40].7.CONCLUSIONDIADEMistherstsystemforautomaticfull-siteextractionca-pableofproducinghighlyaccurate,effectivewrappersforthou-sandsofwebsitesinseveralapplicationdomains.Despitethisachievement,thereremainseveralopenissues,amongthem:(1)WearecondentthatDIADEMcanbeappliedtomostifnotallproductdomains,suchashotel,electronics,fashion,orbooks.However,theperformanceondomainssuchaseventannounce-ments,whereobjectsarelesshomogeneous,isanopenquestion.(2)Sometimes,listingpagesreportonlyasubsetoftheavailabledataandtheextractionofalldatarequiresvisitingindividualdetailspages.ThegainfromextractingfromdetailspagesvariesfromsitetositeandwouldhavetobequantiedautomaticallyforintegrationintoasystemlikeDIADEM.(3)Althougheffective,forminterac-tionisstilloneofthehardestproblemsfacedinDIADEM,inpartic-ularwhenitcomestounderstandingdynamiceld-dependenciesandmorecomplexwidgets.8.REFERENCES[1]S.Abiteboul,V.Vianu,B.Fordham,andY.Yesha.Relationaltransducersforelectroniccommerce.InPODS,p.179–187,1998.[2]R.Baumgartner,S.Flesca,andG.Gottlob.VisualwebinformationextractionwithLixto.InVLDB,p.119–128,2001.[3]L.Bing,W.Lam,andT.-L.Wong.Robustdetectionofsemi-structuredwebrecordsusingadomstructure-knowledge-drivenmodel.TWeb,7(4):21:1–21:32,2013.[4]M.Bronzi,V.Crescenzi,P.Merialdo,andP.Papotti.Extractionandintegrationofpartiallyoverlappingwebsources.PVLDB,6(10):805–816,2013.[5]M.J.Cafarella,A.Halevy,andJ.Madhavan.Structureddataontheweb.CACM,54(2):72–79,2011.[6]M.J.Cafarella,A.Y.Halevy,D.Z.Wang,E.Wu,andY.Zhang.Webtables:Exploringthepoweroftablesontheweb.PVLDB,1(1):538–549,2008.[7]S.Chakrabarti,M.vandenBerg,andB.Dom.Focusedcrawling:anewapproachtotopic-specicwebresourcediscovery.InWWW,p.1623–1640,1999.[8]L.Chen,S.Ortona,G.Orsi,andM.Benedikt.Aggregatingsemanticannotators.PVLDB,6(10),2013.[9]V.CrescenziandG.Mecca.Automaticinformationextractionfromlargewebsites.J.ACM,51(5):731–779,2004.[10]V.Crescenzi,P.Merialdo,andD.Qiu.Aframeworkforlearningwebwrappersfromthecrowd.InWWW,p.261–272,2013.[11]N.Dalvi,A.Machanavajjhala,andB.Pang.Ananalysisofstructureddataontheweb.PVLDB,5(7):680–691,2012.[12]N.N.Dalvi,P.Bohannon,andF.Sha.Robustwebextraction:anapproachbasedonaprobabilistictree-editmodel.InSIGMOD,p.335–348,2009.[13]N.N.Dalvi,R.Kumar,andM.A.Soliman.Automaticwrappersforlargescalewebextraction.PVLDB,4(4):219–230,2011.[14]N.Derouiche,B.Cautis,andT.Abdessalem.Automaticextractionofstructuredwebdatawithdomainknowledge.InICDE,p.726–737,2012.[15]E.C.Dragut,T.Kabisch,C.Yu,andU.Leser.Ahierarchicalapproachtomodelwebqueryinterfacesforwebsourceintegration.PVLDB,2(1):325–336,2009.[16]O.Etzioni,M.Banko,S.Soderland,andD.S.Weld.Openinformationextractionfromtheweb.CACM,51(12):68–74,2008.[17]T.Furche,G.Gottlob,G.Grasso,O.Gunes,X.Guo,A.Kravchenko,G.Orsi,C.Schallhart,A.Sellers,andC.Wang.Diadem:Domain-centric,intelligent,automateddataextractionmethodology.InWWW,p.267–270,2012.[18]T.Furche,G.Gottlob,G.Grasso,X.Guo,G.Orsi,andC.Schallhart.Theontologicalkey:automaticallyunderstandingandintegratingformstoaccessthedeepweb.VLDBJ.,22(5):615–640,2013.[19]T.Furche,G.Gottlob,G.Grasso,G.Orsi,C.Schallhart,andC.Wang.Littleknowledgerulestheweb:Domain-centricresultpageextraction.InRR,p.61–76,2011.[20]T.Furche,G.Gottlob,G.Grasso,C.Schallhart,andA.Sellers.OXPath:Alanguageforscalabledataextraction,automation,andcrawlingonthedeepweb.VLDBJ.,p.47–72,2013.[21]P.Gulhane,A.Madaan,R.R.Mehta,J.Ramamirtham,R.Rastogi,S.Satpal,S.H.Sengamedu,A.Tengli,andC.Tiwari.Web-scaleinformationextractionwithvertex.InICDE,p.1209–1220,2011.[22]P.Gulhane,R.Rastogi,S.H.Sengamedu,andA.Tengli.Exploitingcontentredundancyforwebinformationextraction.InWWW,p.1105–1106,2010.[23]M.KayedandC.-H.Chang.FiVaTech:Page-LevelWebDataExtractionfromTemplatePages.TKDE,22(2):249–263,2010.[24]R.Khare,Y.An,andI.-Y.Song.Understandingdeepwebsearchinterfaces:Asurvey.SIGMODRec.,39(1):33–40,2010.[25]N.Kushmerick.Wrapperinduction:efciencyandexpressiveness.Artif.Intell.,118:15–68,2000.[26]N.Leone,G.Pfeifer,W.Faber,T.Eiter,G.Gottlob,S.Perri,andF.Scarcello.TheDLVsystemforknowledgerepresentationandreasoning.TOCL,7(3):499–562,2006.[27]B.Liu,R.Grossman,andY.Zhai.Miningdatarecordsinwebpages.InSIGKDD,p.601–605,2003.[28]W.Liu,X.Meng,andW.Meng.ViDE:Avision-basedapproachfordeepwebdataextraction.TKDE,22:447–460,2010.[29]J.Madhavan,D.Ko,L.Kot,V.Ganapathy,A.Rasmussen,andA.Halevy.Google'sdeepwebcrawl.PVLDB,1(2):1241–1252,2008.[30]MarketLine.Onlineretailintheunitedstates,2013.On-lineathttp://www.marketresearch.com/MarketLine-v3883/Online-Retail-United-States-7760207/.[31]A.Mesbah,A.vanDeursen,andS.Lenselink.Crawlingajax-basedwebapplicationsthroughdynamicanalysisofuserinterfacestatechanges.TWeb,6(1):3:1–3:30,2012.[32]H.Nguyen,T.Nguyen,andJ.Freire.Learningtoextractformlabels.PVLDB,1(1):684–694,2008.[33]P.Senellart,A.Mittal,D.Muschick,R.Gilleron,andM.Tommasi.Automaticwrapperinductionfromhidden-websourceswithdomainknowledge.InWIDM,p.9–16,2008.[34]K.SimonandG.Lausen.ViPER:AugmentingAutomaticInformationExtractionwithvisualPerceptions.InCIKM,p.381–388,2005.[35]W.Su,J.Wang,andF.H.Lochovsky.ODE:Ontology-AssistedDataExtraction.TODS,34(2):12:1–12:35,2009.[36]W.Su,H.Wu,Y.Li,J.Zhao,F.H.Lochovsky,H.Cai,andT.Huang.Understandingqueryinterfacesbystatisticalparsing.TWeb,7(2):8:1–8:22,2013.[37]F.M.Suchanek,G.Kasneci,andG.Weikum.Yago:Acoreofsemanticknowledge.InWWW,p.697–706,2007.[38]G.A.Toda,E.Cortez,A.S.daSilva,andE.deMoura.Aprobabilisticapproachforautomaticallyllingform-basedwebinterfaces.PVLDB,4(3):151–160,2010.[39]J.Wang,C.Chen,C.Wang,J.Pei,J.Bu,Z.Guan,andW.V.Zhang.Canwelearnatemplate-independentwrapperfornewsarticleextractionfromasingletrainingsite?InKDD,p.1345–1354,2009.[40]J.WangandF.H.Lochovsky.Dataextractionandlabelassignmentforwebdatabases.InWWW,p.187–196,2003.[41]W.Wu,A.Doan,C.Yu,andW.Meng.Modelingandextractingdeep-webqueryinterfaces.InAdvancesinInformation&IntelligentSystems,pages65–90,2009.[42]Y.Xia,H.Yu,andS.Zhang.Automaticwebdataextractionusingtreealignment.InCIKM,p.1645–1648,2009.[43]Y.ZhaiandB.Liu.StructuredDataExtractionfromtheWebBasedonPartialTreeAlignment.TKDE,18(12):1614–1628,2006.[44]H.Zhao,W.Meng,Z.Wu,V.Raghavan,andC.Yu.FullyAutomaticWrapperGenerationForSearchEngines.InWWW,p.66–75,2005.