/
Crawling Deep Web Entity Pages Yeye He Univ Crawling Deep Web Entity Pages Yeye He Univ

Crawling Deep Web Entity Pages Yeye He Univ - PDF document

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
452 views
Uploaded On 2015-05-22

Crawling Deep Web Entity Pages Yeye He Univ - PPT Presentation

of WisconsinMadison Madison WI 53706 heyeyecswiscedu Dong Xin Google Inc Mountain View CA 94043 dongxingooglecom Venkatesh Ganti Google Inc Mountain View CA 94043 vgantigooglecom Sriram Rajaraman Google Inc Mountain View CA 94043 sriramrgo ID: 72098

WisconsinMadison Madison

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Crawling Deep Web Entity Pages Yeye He U..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

forentity-orientedsites,searchformspredominantlyemployonetexteldforkeywordqueries,andadditionalinputeldswithgood“defaultvalue”behavior.OurURLtemplategenerationbasedonthisobservationprovidesatractablesolutionforalargenumberofpotentiallycomplexsearchformswithoutsignicantlysacricingcontentcoverage.QuerygenerationandURLgeneration.Priorartinquerygen-erationfordeepwebcrawlfocusedonbootstrappingusingtextex-tractedfromretrievedpages[15,17,22,23].Thatis,asetofseedqueriesarerstusedtocrawlpages.Theretrievedpagesarean-alyzedforpromisingkeywords,whicharethenusedasqueriestocrawlmorepagesrecursively.Thereareseveralkeyreasonswhyexistingapproachesarenotverywellsuitedforourpurpose.Firstofall,mostpreviouswork[17,22,23]aimstooptimizecoverageofindividualsites,thatis,toretrieveasmuchdeep-webcontentaspossiblefromoneorafewsites,wheresuccessismeasuredbypercentageofcontentretrieved.Authorsin[3]goasfarassuggestingtocrawlusingcommonstopwords“a,the”etc.toimprovesitecoveragewhenthesewordsareindexed.Weareinlinewith[15]inaimingtoimprovecontentcov-erageforalargenumberofsitesontheWeb.Becauseofthesheernumberofdeep-websitescrawledwetradeoffcompletecoverageofindividualsiteforincompletebut“representative”coverageofalargenumberofsites.Thesecondimportantdifferenceisthatsincewearecrawlingentity-orientedpages,thequerieswecomeupwithshouldbeen-titynamesinsteadofarbitraryphrasessegments.Assuch,welever-agetwoimportantdatasources,namelyquerylogsandknowledgebases.Wewillshowthatclassicalinformationretrievalandentityextractiontechniquescanbeusedeffectivelyforentityquerygen-eration.Toourknowledgeneitherofthesedatasourceshasbeenverywellstudiedfordeep-webcrawlpurposes.Emptypageltering.Authorsin[15]developedaninterest-ingnotionofinformativenesstoltersearchforms,whichiscom-putedbyclusteringsignaturesthatsummarizecontentofcrawledpages.Ifcrawledpagesonlyhaveafewsignatureclusters,thenthesearchformisuninformativeandwillbeprunedaccordingly.Thisapproachaddressestheproblemofemptypagestoanextentbylteringuninformativeforms.However,thisapproachoperatesatthelevelofsearchform/URLtemplate,itmaystillmissemptypagescrawledusinganinformativeURLtemplate.Sinceoursystemgeneratesonlyonehigh-qualityURLtemplateforeachsite,lteringatthegranularityofURLtemplatesislikelytobeill-suited.Instead,ourapproachconsidersinthisworkl-tersatpagelevel—itcanautomaticallydistinguishesemptypagesfromusefulentitypagesbyutilizingintentionallygeneratedbadqueries.Toourknowledgethissimpleyeteffectiveapproachhasnotbeenexploredintheliterature.Anovelpage-levelemptypagelteringtechniquewasdescribedin[20],whichlabelsaresultpageasempty,ifeithercertainprede-nederrormessagesaredetectedfromthe“signicantportion”ofresultpages(e.g.,theportionofthepageformattedusingframes,orvisuallylaidoutatthecenterofthepage),oralargefractionofresultpagesarehashedtothesamevalue.Incomparison,ourapproachobviatestheneedtorecognizethesignicantportionofresultpages,andweusecontentsignatureinsteadofhashingthatismorerobustagainstminorpagedifferences.URLdeduplication.TheproblemofURLdeduplicationhasre-ceivedconsiderableattentioninthecontextofwebcrawlingandin-dexing[2,8,13].CurrenttechniquesconsidertwoURLsasdupli-catesiftheircontentarehighlysimilar.Theseapproaches,referredtoascontent-basedURLdeduplication,proposestorstsumma-rizepagecontentsusingcontentsketches[7]sothatpageswith Figure2:Atypicalsearchinterface(ebay.com)similarcontentaregroupedintoclusters.URLsinthesameclusterarethenanalyzedtolearnURLtransformationrules(forexample,itcanlearnthatwww.cnn.com/story?id=numisequivalenttowww.cnn.com/story_num).Inthispaper,insteadoflookingatthetraditionalnotionofpagesimilarityatthecontentlevel,weviewpagesimilarityatthese-manticlevel.Thatis,weviewpageswithentitiesfromthesameresultset(butperhapscontainingdifferentportionsoftheresult,orpresentingwithdifferentsortingorders)assemanticallysimi-lar,whichcanthenbededuplicated.Thissignicantlyreducesthenumberofcrawlsneeded,andisinlinewithourgoalofobtainingrepresentativecontentcoveragegiventhesheernumberofwebsitescrawled.Usingsemanticsimilarity,ourapproachcananalyzeURLpat-ternsanddeduplicatebeforepagesarecrawled.Incomparison,existingcontent-baseddeduplicationnotonlyrequirespagestobecrawledrstforcontentanalysis,itwouldnotbeabletorecognizethesemanticsimilaritybetweenURLsandwouldrequirebillionsofmoreURLscrawled.Authorsin[15]pioneeredthenotionofpresentationcriteria,andpointedoutthatcrawlingpageswithcontentthatdifferonlyinpre-sentationcriteriaareundesirable.Theirapproach,however,dedu-plicatesatthelevelofsearchformsandcannotbeusedtodedupli-cateURLsdirectly.4.URLTEMPLATEGENERATIONAsinputtooursystem,wearegivenalistofentity-orienteddeep-websitesthatneedtobecrawled.Ourrstproblemistogen-erateURLtemplatesforeachsitethatareequivalenttosubmittingsearchforms,sothatentitiescanbecrawleddirectlyusingURLtemplates.Asaconcreteexample,thesearchformfromebay.comisshowninFigure2,whichrepresentsatypicalentity-orienteddeep-websearchinterface.Searchingthisformusingquery“ipad2”withoutchangingthedefaultvalue“AllCategories”ofthedrop-downboxisequivalenttousingtheURLtemplateforebay.cominTable1,withwild-card“{query}”replacedby“ipad+2”.Theexacttechniquethatparsessearchformsisdevelopedbasedontechniquesproposedin[15],whichwewillnotdiscussindetailsintheinterestofspace.However,ourexperiencewithURLtem-plategenerationleadstotwointerestingobservationsworthmen-tioning.Ourrstobservationisthatforentity-orientedsites,themainsearchformisalmostalwaysonhomepagesinsteadofsomewheredeepinthesite.Thesearchformissuchaneffectiveinforma-tionretrievalparadigmthatwebsitesareonlytooeagertoexposethem.Amanualsurveysuggeststhatonly1outof100randomlysampledsitesdoesnothavethesearchformonthehomepage(www.arke.nl).Thisobviatestheneedtousesophisticatedtech-niquestolocatesearchformsdeepinwebsites(e.g.,[4,5]).Thesecondobservationisthatinentity-orientedsites,searchformspredominantlyuseonemaintextinputeldstoacceptkey-wordqueries(afull93%ofsitessurveyedhaveexactlyonetexteldtoacceptkeywordqueries).Atthesametime,othernon-textinputeldsexhibitgood“defaultvalue”behavior(94%ofsitesoutofthe100sampledarejudgedtobeabletoretrieveentitiesusingdefaultvalueswithoutsacricingcoverage).Sinceenumeratingvaluescombinationinmultipleinputelds (a)“Relatedqueries”onrst-levelpages (b)Disambiguationpages (c)FacetedsearchFigure4:Examplesforwhichsecond-levelcrawlisdesirableIntherstcategory,whenaqueryissearched,anadditionalsetqueriesrelevanttotheoriginalqueryaredisplayed.Thisisknownasqueryexpansion[19],whichaimstohelpuserstorefor-mulatetheirqueries.Figure4aisascreen-shotofsuchanexample.Whentheoriginalquery“iphone4”issearched,thereturnedpagedisplaysqueriesrelatedtotheoriginalquery,like“iphone3gs”,“iphone4case”,etc.Sincethesequeriesaremaintainedandsug-gestedbyindividualsites,theycaninmostcasesleadtovaliddeep-webcontent,thusimprovingcontentcoverage.Figure4bshowsanotherscenarioforsecond-levelcrawl.Inthisexample,whenthequery“sanfrancisco”issearched,adis-ambiguationpageisreturnedcontainingmultiplecitieswiththatname.Followingsecond-levelURLsonthedisambiguationpageisneededinordertoexposerichdeep-webcontent(listsofhotelsinthiscase).Second-levelcrawlsarealsodesirableforsitesthatemploythe“facetedsearch/browsing”paradigm[11].Infacetedsearch,re-turnedentitiesarepresentedinamulti-dimensional,facetedman-ner.IntheexampleshowninFigure4cwhen“camera”issearched,a“multi-faceted”entityclassicationisreturnedalongwithalargesetofresults,toallowuserstodrill-downusingadditionalcriteria(e.g.,category,brand,priceetc.).Conceptually,thisisequivalenttoaddinganadditionalpredicatetotheoriginalentityretrievalquery,whichamountstoanewquery.Accordingly,suchURLsarealsodesirableforfurthercrawling.Wealsonotethatcrawlingsecond-levelURLsusingoursim-pliedURLtemplatewithdefaultvaluescanachievesimilaref-fectsasexhaustivelytemplategenerationbyenumeratingallpos-siblevaluescombinations(Section4).InthisexamplewebsiteinFigure4cwherequery“camera”issearchedwithdefaultcategory“All-categories”,thesecond-levelURLforcategory“Electronics”actuallycorrespondstosearching“camera”withsub-category“Elec-tronics”selected.Furthermore,whilethepreviousmulti-eldenumerationapproachwouldsearchforaquerywithallvaluecombinatione.g.,searchingfor“camera”withthe(inconsistent)sub-categories“pets”or“fur-nitures”thatwouldleadtoemptypages,ourapproachisdatadriven—typicallyfacetedsearchlinksareexposedonlywhenthereexistsdeep-webcontentmatchingthesearchingcriteria(category“pets”willnotdisplaywhen“camera”issearched).Asaresult,ourap-proachismorelikelytoretrievecontentsuccessfully. www.buy.com/sr/searchresults.aspx?qu=gps&sort=4 &from=7&mfgid=-652&page=1 www.buy.com/sr/searchresults.aspx?qu=gps&sort=5 &from=7&mfgid=-652&page=1 ... www.buy.com/sr/searchresults.aspx?qu=gps&sort=4&from=8 &mfgid=-652&page=1 www.buy.com/sr/searchresults.aspx?qu=gps&sort=4&from=9 &mfgid=-652&page=1 ... www.buy.com/sr/searchresults.aspx?qu=gps&sort=4&from=7&mfgid=-652&page=2 www.buy.com/sr/searchresults.aspx?qu=gps&sort=4&from=7&mfgid=-652&page=3 ... www.buy.com/sr/searchresults.aspx?qu=gps&sort=4 &from=7&mfgid=-1755&page=1 www.buy.com/sr/searchresults.aspx?qu=gps&sort=5 &from=7&mfgid=-1755&page=1 ... www.buy.com/sr/searchresults.aspx?qu=tv&sort=4 &from=7&mfgid=-1001&page=1 www.buy.com/sr/searchresults.aspx?qu=tv&sort=5 &from=7&mfgid=-1001&page=1 ... Table4:Duplicateclusterofsecond-levelURLs7.2URLextractionandlteringWhilesomesecond-levelURLsaredesirable,notallsecond-levelURLsshouldbecrawledforefciencyaswellasqualityrea-sons.First,thereareonaverageafewdozenssecond-levelURLsforeachrst-levelpagecrawled.Crawlingallthesesecond-levelURLsbecomesverycostlyatlargescale.Furthermore,asigni-cantportionofthesesecond-levelURLsareinfactentirelyirrele-vantwithnodeep-webcontent(manyURLsarestaticandnaviga-tional,forexamplebrowsingURLs,loginURLs,etc.),whichneedtobelteredout.WelterURLsbyusingkeyword-queryarguments.Keyword-queryargumentistheURLargumentimmediatelypriortothe“{query}”wild-cardinURLtemplates.Forexample,inTable1,“_nkw=”isthekeyword-queryargumentforebay.com.Similarlywehave“search_by=”forchegg.com,and“keyword=”forbeso.com.Thepresenceofthekeyword-queryargumentinagivendomainisingeneralindicativethatthepageisdynamicallygeneratedwithkey-wordpredicatesandisthuslikelytobedeep-webrelated.WeobservethatlteringURLsbykeyword-queryargumentssigni-cantlyreducesthenumberofURLs—typicallybyafactorof3-5—whilestillpreservesdesirablesecond-levelURLsthatleadtodeep-webcontent.7.3URLdeduplicationWeobservethatafterURLltering,therearegroupsofURLsthataredifferentintheirtextstringbutreallyleadtosimilarornearlyidenticaldeep-webcontent.Inthissectionweproposetodeduplicatesecond-levelURLstofurtherreducethenumberofURLsthatneedtobecrawled.Totakeacloserlookatsecond-levelURLs,weuseURLsex-tractedfrombuy.cominTable4asanexampletoillustrate.Dy-namicalURLsgeneratedbydeep-webformsubmissiongenerallyfollowtheW3CURIrecommendations[1],wherethepartofURLstringafter“?”istheso-calledquerystring,andeachcomponentseparatedby“&”(or“;”)isaquerysegmentthatconsistsofapairofargumentandvalueconnectedbyan“=”.Theobservationhereisthateachquerysegmenttypicallycorre-spondstoaquerypredicate.TaketherstURLinTable4asanexample,thequerysegment“qu=gps”indicatesthatreturnedenti-tiesshouldcontainthekeyword“gps”.“Sort=4”speciesthatthelistofentitiesshouldbesortedbypricefromlowtohigh(where4isaninternalencodingforthatsortingcriterion);“from=7”isaninternaltrackingparametertorecordwhichURLwasclickedthatleadstothispage;“mfgid=-652”isapredicatethatselectsman-ufacturerGarmin(where-652isagainaninternalencoding),andnally“page=1”retrievestherstpageofmatchingentities(typi-callyeachpageonlypresentsalimitednumberofentities,thusnotallresultscanbedisplayedononepage).WhiletheexactURLencodingsofquerystringsvarywildlyfromsitetosite,suchmap- Figure5:Second-levelURLsfromdifferentsortingcriteriapingsfromquerysegmentstologicalquerypredicatesgenerallyexistacrossdifferentsites.Inthisparticularexample,ifthisURLquerystringistobewrit-teninSQL,itwouldcorrespondtothequerybelow:SELECT*FROMdbWHEREdescriptionLIKE`%gps%'ANDmanufacturer=`Garmin'ORDERBYpriceDESCLIMIT20;--numberofentitiesperpageThesecondURLinTable4differsfromtherstoneintheseg-ment“sort=5”,whichsortsentitiesbyreleasedate.Theyactuallycorrespondtodifferentsortingtabsintheresultpage,asillustratedinFigure5,andthereexistmanyadditionalsortingcriteria,eachofwhichcorrespondstoasecond-levelURL.Recallthatweaimtorecoverarepresentativecontentcoverageofeachsite—obtainingexhaustivecoverageforalargenumberofsiteisunrealisticandalsounnecessary.Withthatinmind,weonlyneedtocrawloneofthesetwoexampleURLs(andothersimilarURLswithdifferentsortingcriteria)discussedaboveby“dedupli-cating”them—afterall,crawlingentitieswithsimilarpropertiesbutdifferentsortingcriteriaonlyproducemarginalbenet.Similarly,thethirdandfourthURLinTable4differfromtherstURLinthesegment“from=”.Thisisonlyforinternaltrackingpurposessothatsourceoftheclickcanbeidentied.WhiletheURLstringsaredifferentinthe“from=”part,thecontenttheyleadtoareidenticalandthusalsoneedtobededuplicated.Lastly,thefthandsixthURLintheTabledifferfromtherstURLinthe“page=”segment.Thisistoretrievedifferentportionoftheresultsetasthenumberofentitiespresentedoneachpagetendstobelimited.Againwiththegoalofrecoveringrepresen-tativecontentcoverage,wewouldwanttoavoiditeratingoverthecompleteresultsetbycrawlingallresultsthatspandifferentpageswiththeuseofURLdeduplication.Giventhisrequirement,existingcontent-basedURLdeduplica-tion(e.g.,[2,8,13])isclearlyinsufcient.Toseewhythisisthecase,considertherstandsecondURLinTable4thatretrievesthesameresultsetbutusedifferentsortingcriteria.Sincetheresultsetcanspanmultiplepages,adifferentsortingordercanproduceato-tallydifferentresultpage.Existingtechniquesthatanalyzecontentsimilaritywillnotbeabletorecognizethesemanticsimilaritybe-tweenthesetwopages.SimilarlycontentbaseddeduplicationwilltreatthefthandsixthURLasdistinctpagesinsteadofsemanticduplicates,thuswastingsignicantcrawlingbandwidth.Inthispa-perweproposeanapproachthatanalyzesURLargumentpatternsanddeduplicatesURLsevenbeforeanypagesarecrawled.7.3.1Pre-crawlURLdeduplicationInthisworkweproposetoanalyzethepatternsofsecond-levelURLsbeforetheyarecrawled,anduseanewdenitiontocapturebothcontentsimilarityaswellasthesimilarityinthesemanticsofqueriesthatareusedtoretrievedeep-webcontent.First,wecategorizequerysegmentsintothreegroups,(1)se-lectionsegmentsarequerysegmentsthatcorrespondtoselectionpredicatesandcanaffectthesetofresultentities(e.g.,“qu=gps”and“mfgid=-652”intheexampleURLsdiscussedabove,whichareessentiallypredicatesinthewhereclauseoftheSQLquery);(2)presentationsegmentsarequerysegmentsthatdonotchangetheresultset,butonlyaffecthowthesetofentitiesarepresented(e.g.,“sort=4”or“page=1”intheexampleURL);andlastly,(3)contentirrelevantsegmentsarequerysegmentsthathavenoim-mediateimpactontheresultentities(e.g.,thetrackingsegment“from=7”).WethendenetwoURLsassemanticduplicatesifthecorre-spondingselectionquerieshavethesamesetofselectionsegments.Morespecically,ifqueriescorrespondingtotwoURLstringsre-turnthesamesetofentities,thenirrespectiveofhowtheentitiesaresortedorwhatportionofresultsetarepresented,thetwoURLsareconsideredtobeduplicatestoeachother.WecanalternativelystatethattwoURLsareconsideredassemanticduplicatesiftheydifferonlyincontentirrelevantsegmentsorpresentationsegments.Whilethereasonofdisregarding“contentirrelevantsegments”arestraightforward,therationalebehindignoringpresentationseg-mentsgoesbacktoourgoalofobtaining“representativecoverage”foreachsite.Exhaustivelycrawlingthecompleteresultsetindif-ferentsortingordersprovidesmarginalbenets;crawlingonepageforeachdistinctsetofselectionpredicatesisdeemedsufcient.EXAMPLE2.WeuseexampleURLsinTable4toillustrateourdenitionofsemanticduplicates.TherstgroupofURLsallcorre-spondtothesameselectionpredicates(i.e.,“qu=gps”and“mfgid=-652”)butdifferonlyincontentirrelevantsegments(“from=”),orpresentationsegments(“sort=”,“page=”).CrawlinganyoneURLfromthisgroupwillproviderepresentativecoverage.Ontheotherhand,URLsfromtherstgroupandsecondgroupdifferinselectionsegment“mfgid=?”,where“mfgid=-652”rep-resents“Garmin”while“mfgid=-1755”isfor“Tomtom”.These-lectionquerieswouldretrievetwodifferentsetsofentities,thusshouldnotbeconsideredassemanticduplicates.Itcanbeseenthatoursemantics-basedURLdeduplicationisamoregeneralnotionthatgoesbeyondsimplycontent-basedsimi-larity.Ourapproachisnotbasedonanycontentanalysis.Rather,ithingesonthecorrectidenticationofthecategoriesofqueryseg-mentsbyURLpatternanalysis.Onthehighlevel,ourapproachisbasedontwokeyobservations.First,searchresultpagesinthesamesitearelikelytobegeneratedfromthesametemplateandarethushighlyhomogeneous.Thatis,forthesamesite,thestructure,layoutandcontentofresultpagessharemuchsimilarity(whichincludedeep-webURLsembeddedinresultpagesthatweareinterestedin).Second,givenpagehomogeneity,weobservethatalmostallresultpagesinthesamesitesharecertainpresentationlogicsorothercontent-irrelevantfunctionalities.IntheexampleofURLsextractedfrombuy.comwediscussedabove,almostallresultpagescansortentitiesbyprice,oradvancetothesecondpageofentitiesintheresultset.Eachpagealsoimplementsthesite-specicclick-trackingfunctionality.Thesepresentationlogicstranslatestothesamepresentationsegments(“sort=4”and“page=2”),andcontent-irrelevantsegments(“from=7”),respectively,whichcanbefoundinalmosteveryresultpage.Withtheseobservations,weproposetotakeallURLsembed-dedinaresultpageasaunitofanalysis.Wethenaggregatethefrequencyofquerysegmentsandidentifysegmentsthatcommonlyoccuracrossmanydifferentpagesinthesamesite.Thefactthatthesequerysegmentsinalmostallpagesindicatesthattheyarenotspecictotheinputkeywordquery,andarethuslikelytobeeitherpresentational(sorting,pagenumber,etc.),orcontentirrele-vant(internaltracking,etc.).Ontheotherhand,selectionsegments,likemanufacturername(“mfgid=-652”for“Garmin”)inthepreviousexample,aremuchmoresensitivetotheinputqueries.OnlywhenqueriesrelatedtoGPSaresearched,willthesegmentrepresentingmanufacturer“Garmin”(“mfgid=-652”)appear.Pagescrawledforotherentities (a)Num.ofmatchedpairs (b)PrecisionofmatchedpairsFigure7:EffectsofdifferentscorethresholdthehighestlevelFreebasedataaregroupedinto“domains”,orcat-egoriesofrelevanttopics,likeautomotive,book,computers,etc.,intherstcolumninTable6.Withineachdomain,thereisalistof“types,”eachofwhichconsistsofmanuallycuratedentitieswithnamesandrelationships.Forexample,thedomainlmcontainstypesincludinglm(listoflmnames),actor(listofactornames)andperformance(whichactorperformedinwhichlmrelation). Domainname Toptypes #oftypes #ofinstances Automotive trim_level,model_year 30 78,684 Book book_edition,book,isbn 20 10,776,904 Computer software,software_comparability 31 27,166 Digicam digital_camera,camera_iso 18 6,049 Film lm,performance,actor 51 1,703,255 Food nutrition_fact,food,beer 40 66,194 Location location,geocode,mailing_address 167 4,150,084 Music track,release,artist,album 63 10,863,265 TV tv_series_episode,tv_program 41 1,728,083 Wine wine,grape_variety_composition 11 16,125 Table6:Freebasedomainsusedforqueryexpansion Domain:typename matcheddeep-websites Automotive:model_year stratmosphere.com,ebay.com Book:book_edition christianbook.com,netix.com ,barnesandnoble.com Computer:software booksprice.com Digicam:digital_camera rozetka.com.ua,price.ua Food:food bergourmet.com,tablespoon.com Location:location tripadvisor.com,hotels.com,apartmenthomeliving.com Music:track netix.com,play.com,musicload.de TV:tv_series_episode netix.com,cafepress.com Wine:wine wineenthusiast.com Table7:SampleFreebasematches(incorrectonesareunderlined)SincenotallFreebasedomainsareequallyapplicablefordeep-webcrawlpurposes(e.g.,chemistryontology),forhumanevalua-tionpurposesweonlyfocusonthe5largestFreebasetypesin10widelyapplicabledomainslistedinTable6.Wealsorestrictourattentionin100high-trafconlineretailersthatwearemostinter-estedin.Weaskedadomainexperttomanuallylabel,foreachpairofwebsite,Freebasetype&#x]TJ/;བ ;.96;d T; 7.;Ũ ;� Td;&#x [00;result,whetherthematchingiscor-rect.Thatis,whetherentitiesintheFreebasetypecanbeusedtoretrievevalidproductentitiesfromthematchingwebsite.Inordernottooverestimatethematchingprecision,weintentionallyignorematchesforgeneral-purposesitesthatspanmultipleproductcate-gories(ebay.com,nextag.com,etc.),byconsideringsuchmatchesasneithercorrectnorincorrect.Figure7ashowsthetotalnumberofmatchedFreebase-type/sitepairsandFigure7billustratesthematchingprecision.Aswecansee,whilethenumberofmatchedpairsincreasesasthethresholddecreases,thereisasignicantdropinmatchingprecisionwhen (a)Varyscorethreshold (b)UsingdifferentsitesFigure8:Precision/recallofemptypageltering.(8a):Eachdotinthegraphrepresentsprecision/recallofadifferentscorethreshold;(8b):Eachdotrepresentsresultsfromadifferentwebsite.thresholddecreasesfrom0.5to0.3.Empiricallyathresholdof0.5isusedinoursystem.Table7illustratesexamplematchesbetweenFreebasetypesanddeep-websites.Top3matchesofthelargesttypeineachFreebasedomainarelisted(insomecasesonly1or2matchesareabovetherelativethreshold).OverallthisproducesgoodqualitymatchestoFreebasetypes,whichinturngreatlyimprovescrawlingcoverage.Emptypageltering.Toevaluatetheeffectivenessofouremptypagelteringapproach(Section6),werandomlyselected10deep-websitesfromalistofhigh-trafcsites(namely,booking.com,cdiscount.com,ebay.com,ebay.com.uk,marksandspencer.com,nord-strom.com,overstock.com,screwx.com,sephora.com,tripadvi-sor.com),andmanuallyidentiedtheirrespectiveerrormessages(e.g.,"Yoursearchreturned0items"istheerrormessageusedbyebay.com).Thismanualapproach,whileaccurate,doesnotscaletoalargenumberofwebsites.Itdoes,however,enablesustobuildthegroundtruth—anypagecrawledfromthesitewiththatpar-ticularmessageisregardedasanemptypage(negativeinstances),andpageswithoutsuchmessagearetreatedasnon-emptypages(positiveinstances).Wecanthenevaluatetheprecisionandrecallofouralgorithm,whereprecisionandrecallaredenedasprecision=jfpagespredictedasnon-emptygj\jfpagesthatarenon-emptygj jfpagespredictedasnon-emptygjrecall=jfpagespredictedasnon-emptygj\jfpagesthatarenon-emptygj fjpagesthatarenon-emptyjgFigure8ashowstheprecision/recallgraphofemptypagelter-ingwhenvaryingthethresholdscorefrom0.4to0.95.Weob-servethatsettingthresholdtoalowvalue,say0.4,achieveshighprecision(predictednon-emptypagesareindeednon-empty)atthecostofsignicantlyreducingrecalltoaroundonly0.6(manynon-emptypagesaremistakenlylabeledasemptybecauseofthelowthreshold).Atthreshold0.85theprecisionandrecallare0.89and0.9,respectively,whichisagoodempiricalsettingthatweusedinoursystem.Figure8bplotstheprecision/recallofindividualdeep-websiteforemptypageltering.Otherthanaclusterofpointsattheupper-rightcorner,representingsiteswithalmostperfectprecision/recall,thereisonlyonesitewithrelativelowprecisionandanotheronewithrelativelowrecall.URLdeduplication.Tounderstandtheeffectivenessofourse-manticURLdeduplicationtechnique(Section7.3),weusedthesamesetof10entitysitesusedinemptypageltering,andmanu-allylabelallURLargumentsabovethethreshold0.01aseitherse-manticallyrelevantorirrelevantfordeduplicationpurposes.Notethatwecannotaffordtoinspectallpossiblearguments,becausewebsitescantypicallyuseaverylargenumberofargumentsinURLs.Forexample,wefound1471differentargumentsfromover-stock.com,1243fromebay.co.uk,etc.Furthermore,ascertainingsemanticrelevanceofargumentsthatappearveryinfrequentlycan (a)Argumentlevelresults (b)URLlevelrecallFigure9:Precision/recallofURLdeduplication.(9a):Eachdotrepresentsargumentprecision/recallatdifferentprevalencevalue;(9b):EachdotrepresentslostURLatdifferentprevalencevalue. Figure10:ReductionratioofURLdeduplicationbeincreasinglyhard.Asaresultweonlyevaluateargumentswithprevalencescoreofatleast0.01,andidentiedatotalof152argu-mentsthataresemanticallyrelevant.Figure9ashowstheprecision/recallofURLdeduplicationattheargumentlevel.Eachdatapointcorrespondstoathresholdatadif-ferentvaluethatrangesfrom0.01to0.5.Recallthatourprevalencebasedalgorithmpredictsanargumentasirrelevantifitsprevalencescoreisoverthethreshold.Thispredicationisdeemedcorrectiftheargumentismanuallylabeledasirrelevant(becauseitispre-sentationalorcontent-irrelevant).Atthreshold0.1,ourapproachhasaprecisionof98%andrecallof94%,respectively,whichisagoodempiricalvalueweuseforourcrawlsystem.ThesecondexperimentinFigure9bshowstherecallatURLlevel.Anargumentmistakenlypredictedasirrelevantbyouralgo-rithmwillcauseURLswiththatargumenttobeincorrectlydedupli-cated.Inthisexperiment,inadditiontousingallargumentsmanu-allylabeledasrelevantinthegroundtruth,wetreatunlabeledargu-mentswithprevalencelowerthan0.1asrelevant(whichwecannotmanuallyverifyhoweverduetothesheersizeofsuchinfrequentarguments).WethenevaluatethepercentageofURLsmistakenlydeduplicated(percentageofthelossincontentthatshouldhavebeencrawled)duetothemis-prediction.Thegraphshowsthatat0.1level,only0.7%ofURLsareincorrectlydeduplicated.Finally,Figure10comparesthereductionratioofsecond-levelURLsbetweenourapproachandthesimplerapproachofusingURLlteringonly(whichltersoutstaticURLsandURLswith-outthekeyword-querysegment,e.g.,_nkw=forebay.com).Ascanbeseen,URLlteringaloneaccountsforareductionratioof3.6.OurapproachofusingsemanticURLdeduplicationontopofURLlteringachievesaroughly10foldreductioninthenumberofURLs,whichis2.3to3.4timesmorereductionthanusingURLl-teringalone,dependingontheprevalencethreshold.Thisamountstosignicantsavinggiventhatthenumberofsecond-levelURLsextractedfromresultspagesareontheorderofbillions.9.CONCLUSIONInthisworkwedevelopaprototypesystemthatfocusesoncrawl-ingentity-orienteddeep-websites.Weleveragecharacteristicsoftheseentitysites,andproposeoptimizedtechniquesthatimprovetheefciencyandeffectivenessofthecrawlsystem.Whilethesetechniquesareshowntobeuseful,ourexperiencepointstoafewareasthatwarrantfuturestudies.Forexample,inthetemplategeneration,ourparsingapproachonlyhandlesHTML“GET”formsbutnot“POST”formsorjavascriptforms,whichre-ducessitecoverage.Inquerygeneration,althoughFreebase-basedentityexpansionisuseful,certainsiteswithlowtrafcordiversetrafcdonotgetmatchedwithFreebasetypeseffectivelyusingquerylogsalone.Utilizingadditionalsignals(e.g.,entitiesboot-strappedfromcrawledpages)forentityexpansionisaninterestingarea.Efcientlyenumerateentityqueryforsearchformswithmul-tipleinputeldsisanotherinterestingchallenge.Giventheubiquityofentity-orienteddeep-websitesandtheva-rietyofusesthatentity-orientedcontentcanenable,webelieveentity-orientedcrawlisausefulresearcheffort,andwehopeourinitialeffortsinthisareacanserveasaspringboardforfuturere-search.10.REFERENCES[1]HTML4.01Specication,W3Crecommendations.http://www.w3.org/addressing/url/4_uri_recommentations.html.[2]Z.Bar-yossef,I.Keidar,andU.Schonfeld.Donotcrawlinthedust:differenturlswithsimilartext.InProceedingsofWWW,2006.[3]L.BarbosaandJ.Freire.Siphoninghidden-webdatathroughkeyword-basedinterfaces.InProceedingsofSBBD,2004.[4]L.BarbosaandJ.Freire.Searchingforhiddenwebdatabases.InProceedingsofWebDB,2005.[5]L.BarbosaandJ.Freire.Anadaptivecrawlerforlocatinghidden-webentrypoints.InProceedingsofWWW,2007.[6]K.D.Bollacker,C.Evans,P.Paritosh,T.Sturge,andJ.Taylor.Freebase:acollaborativelycreatedgraphdatabaseforstructuringhumanknowledge.InProceedingsofSIGMOD,2008.[7]A.Z.Broder,S.C.Glassman,M.S.Manasse,andG.Zweig.Syntacticclusteringoftheweb.InProceedingsofWWW,1997.[8]A.Dasgupta,R.Kumar,andA.Sasturkar.De-dupingurlsviarewriterules.InProceedingofthe14thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,ProceedingsofKDD,2008.[9]J.Guo,G.Xu,X.Cheng,andH.Li.Namedentityrecognitioninquery.InProceedingsofSIGIR,ProceedingsofSIGIR,2009.[10]B.He,M.Patel,Z.Zhang,andK.C.-C.Chang.Accessingthedeepweb.Commun.ACM,50,2007.[11]M.A.Hearst.UIsforfacetednavigationrecentadvancesandremainingopenproblems.InProceedingsofHCIR,2008.[12]A.JainandM.Pennacchiotti.Openentityextractionfromwebsearchquerylogs.InProceedingsofICCL,2010.[13]H.S.Koppula,K.P.Leela,A.Agarwal,K.P.Chitrapura,S.Garg,andA.Sasturkar.Learningurlpatternsforwebpagede-duplication.InProceedingsofWSDM,2010.[14]J.Madhavan,S.R.Jeffery,S.Cohen,X.lunaDong,D.Ko,C.Yu,andA.Halevy.Web-scaledataintegration:Youcanonlyaffordtopayasyougo.InProceedingsofCIDR,2007.[15]J.Madhavan,D.Ko,L.Kot,V.Ganapathy,A.Rasmussen,andA.Halevy.Google'sdeepwebcrawl.InProceedingsofVLDB,2008.[16]G.S.Manku,A.Jain,andA.D.Sarma.Detectingnear-duplicatesforwebcrawling.InProceedingsofWWW,2007.[17]A.Ntoulas.Downloadingtextualhiddenwebcontentthroughkeywordqueries.InJCDL,2005.[18]M.Pa¸sca.Weakly-superviseddiscoveryofnamedentitiesusingwebsearchqueries.InProceedingsofCIKM,2007.[19]Y.QiuandH.-P.Frei.Conceptbasedqueryexpansion.InProceedingsofSIGIR,1993.[20]S.RaghavanandH.Garcia-Molina.Crawlingthehiddenweb.Technicalreport,Stanford,2000.[21]P.-N.TanandV.Kumar.IntroductiontoDataMining.[22]Y.Wang,J.Lu,andJ.Chen.Crawlingdeepwebusinganewsetcoveringalgorithm.InProceedingsofADMA,2009.[23]P.Wu,J.-R.Wen,H.Liu,andW.-Y.Ma.Queryselectiontechniquesforefcientcrawlingofstructuredwebsources.InProceedingsofICDE,2006.