/
Crawling the Hidden Web               by S Raghavan  HG MolinaPres Crawling the Hidden Web               by S Raghavan  HG MolinaPres

Crawling the Hidden Web by S Raghavan HG MolinaPres - PDF document

arya
arya . @arya
Follow
342 views
Uploaded On 2021-07-03

Crawling the Hidden Web by S Raghavan HG MolinaPres - PPT Presentation

2 Background InfoHidden Web databases whose contentis accessible only through search formsWhy is it important to tap into the hiddenWeb 3 Background InfoAccording to The Deep Web Surfacing ID: 852395

web 131 x0000 form 131 web form x0000 hidden page lite label hiwe background info

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Crawling the Hidden Web by..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 Crawling the Hidden Web by
Crawling the Hidden Web by S. Raghavan & H.G. MolinaPresenter: Jack NgOct 30, 2002 2 Background Info§Hidden Web - databases whose contentis accessible only through search forms§Why is it important to tap into the hiddenWeb? 3 Background Info§According to "The Deep Web: Surfacing HiddenValue", 2001:ƒ500 billion documents;� 500 times PIWƒ7500 TB of data; 19 TB for PIWƒgrows much faster than the PIW&

2 #131;High quality, topic-specific inform
#131;High quality, topic-specific informationƒ95 % is publicly accessible - no fees or subscriptions 4 Background Info§Challenges faced by crawlers to extractcontent from the hidden Web:ƒsize of hidden Web is enormous!ƒcontent not reachable by followinghypertext linksƒ"form-filling" is a human activity“Training” a crawler is very difficult!! 5 Background Info§Authors’ approach to address the challeng

3 es:ƒmodel of hidden Web crawler
es:ƒmodel of hidden Web crawlerƒmodel of form pageƒLITE (Layout-based Information Eechnique) for content extractionnImplementation - HiWE (Hidden Web E HiWE Architecture 7 HiWE Data Structures - LVS Table§Task-specific DB§Organized by concepts§Vocabularies for filling out formsPlatform{Xbox, PS2, GameCube, PC} Genre{action, RPG, strategy, sports} Developer{EA, Sega, Squaresoft, Bioware} Release date{1999, 2000, 2001, 2

4 002} Task: search for game reviews Fuzzy
002} Task: search for game reviews Fuzzy set: membershipfunction assigns ‘confidence’ to eachvalue HiWE - Form Processing Strategy§Given internal form representation: F = ({E1, E2, …, En}, S, M)§For each infinite domain element, label matchingalgorithm finds closest match in LVS table and assignsvalue set to it§Rank value assignments to ensure quality submissionƒFuzzy conjunction�-- conservative�-- a

5 ggressive§Submit only if rank is greate
ggressive§Submit only if rank is greater than threshold 9 What is LITE?§Label extraction heuristics based on how page is laidout for human viewing§ label is often visually adjacent to widget (e.g.,textbox) and obvious to viewer§�Partial layout is sufficient to determine adjacency --prune unnecessary elements (see Figure 4 in paper)§Applications in HiWE:ƒform page analysisƒresponse page analysis 10 LITE Application

6 - Form Page AnalysisMovie TitleDirectorW
- Form Page AnalysisMovie TitleDirectorWelcome! Movie review archive 50 pixels50508060 Label is “Movie Title”How LITE heuristics identify label of form element 11 LITE Application - Response Page Analysis§Based on idea that results must be obvious to viewers§Prune page to find visually center-most portion &intrepret it as results location§To identify error pages:ƒsearch center portion for common error text (e.g.,"No

7 results")ƒcompute hash value for ce
results")ƒcompute hash value for center portionŸcommon hash values = error pages 12 Experimental Results§Value assignment rankingƒfuzzy conj.�-- best submission efficiencyƒ�-- most successfulsubmissions�-- poor performance§LITE outperforms other label extractiontechniques; overall 93 % accuracyü 13 Strength & novelty of solution+flexible framework+works with non-cooperative DBs+crawler has lea

8 rning capability+crawls both PIW and hid
rning capability+crawls both PIW and hidden Web+‘mines’ visual layout info for semantics 14 limitations-LVS table - how to handle semanticallyambiguous labels?-what about image labels?-doesn't consider relationships amongelements when assigning values-'all-or-none' form submission policy 15 Presentation of paper§easy to follow and understand§right level of details§goals & pre-conditions clearly defined§overall, a good pap