2 Background InfoHidden Web databases whose contentis accessible only through search formsWhy is it important to tap into the hiddenWeb 3 Background InfoAccording to The Deep Web Surfacing ID: 852395
Download Pdf The PPT/PDF document "Crawling the Hidden Web by..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1 Crawling the Hidden Web by
Crawling the Hidden Web by S. Raghavan & H.G. MolinaPresenter: Jack NgOct 30, 2002 2 Background Info§Hidden Web - databases whose contentis accessible only through search forms§Why is it important to tap into the hiddenWeb? 3 Background Info§According to "The Deep Web: Surfacing HiddenValue", 2001:500 billion documents; 500 times PIW7500 TB of data; 19 TB for PIWgrows much faster than the PIW&
2 #131;High quality, topic-specific inform
#131;High quality, topic-specific information95 % is publicly accessible - no fees or subscriptions 4 Background Info§Challenges faced by crawlers to extractcontent from the hidden Web:size of hidden Web is enormous!content not reachable by followinghypertext links"form-filling" is a human activityTraining a crawler is very difficult!! 5 Background Info§Authors approach to address the challeng
3 es:model of hidden Web crawler
es:model of hidden Web crawlermodel of form pageLITE (Layout-based Information Eechnique) for content extractionnImplementation - HiWE (Hidden Web E HiWE Architecture 7 HiWE Data Structures - LVS Table§Task-specific DB§Organized by concepts§Vocabularies for filling out formsPlatform{Xbox, PS2, GameCube, PC} Genre{action, RPG, strategy, sports} Developer{EA, Sega, Squaresoft, Bioware} Release date{1999, 2000, 2001, 2
4 002} Task: search for game reviews Fuzzy
002} Task: search for game reviews Fuzzy set: membershipfunction assigns confidence to eachvalue HiWE - Form Processing Strategy§Given internal form representation: F = ({E1, E2, , En}, S, M)§For each infinite domain element, label matchingalgorithm finds closest match in LVS table and assignsvalue set to it§Rank value assignments to ensure quality submissionFuzzy conjunction-- conservative-- a
5 ggressive§Submit only if rank is greate
ggressive§Submit only if rank is greater than threshold 9 What is LITE?§Label extraction heuristics based on how page is laidout for human viewing§ label is often visually adjacent to widget (e.g.,textbox) and obvious to viewer§Partial layout is sufficient to determine adjacency --prune unnecessary elements (see Figure 4 in paper)§Applications in HiWE:form page analysisresponse page analysis 10 LITE Application
6 - Form Page AnalysisMovie TitleDirectorW
- Form Page AnalysisMovie TitleDirectorWelcome! Movie review archive 50 pixels50508060 Label is Movie TitleHow LITE heuristics identify label of form element 11 LITE Application - Response Page Analysis§Based on idea that results must be obvious to viewers§Prune page to find visually center-most portion &intrepret it as results location§To identify error pages:search center portion for common error text (e.g.,"No
7 results")compute hash value for ce
results")compute hash value for center portioncommon hash values = error pages 12 Experimental Results§Value assignment rankingfuzzy conj.-- best submission efficiency-- most successfulsubmissions-- poor performance§LITE outperforms other label extractiontechniques; overall 93 % accuracyü 13 Strength & novelty of solution+flexible framework+works with non-cooperative DBs+crawler has lea
8 rning capability+crawls both PIW and hid
rning capability+crawls both PIW and hidden Web+mines visual layout info for semantics 14 limitations-LVS table - how to handle semanticallyambiguous labels?-what about image labels?-doesn't consider relationships amongelements when assigning values-'all-or-none' form submission policy 15 Presentation of paper§easy to follow and understand§right level of details§goals & pre-conditions clearly defined§overall, a good pap