Web Crawling Taner Kapucu Electrical and Electronical Engineering Taner Kapucu 2011514036 1 Contents Search Engine Web Crawling Crawling Policy Focused Web Crawling Algoritms ID: 753452
Download Presentation The PPT/PDF document "ÇUKUROVA UNIVERSITY COMPUTER NETWORKS" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
ÇUKUROVA UNIVERSITY
COMPUTER NETWORKS
Web CrawlingTaner Kapucu
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
1Slide2
Contents
Search
Engine
Web
CrawlingCrawling PolicyFocused Web CrawlingAlgoritmsExamples of Web CrawlersReferences
1
2
3
4
5
6
7
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
2
Web CrawlingSlide3
Search Engine
Abstract
With
the rapid development and increase in global data on World Wide Web and with increased and rapid growth in web users across the globe, an acute need has arisen to improve and modify or design search algorithms that helps in effectively and efficiently searching the specific required data from the huge repository available.
2015
863,105,652
3,185,996,155
Year
WebsitesInternet Users
19982,407,067188,023,930
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide4
Search Engine
The significance of search engines
The significance and popularity of search engines are increasing day by day.
Surveys show that the use of internet is growing world wide both in servers and in clients. The necessity of an efficient tool for finding what users are looking for makes the significance of search engines more obvious. Search engines take users’ queries as input and produce a sequence of URLs that match the query according to a rank they calculate for each document on the web.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide5
Search Engine
Search
Engine Architecture
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide6
Search Engine
Search Engine Architecture
Search Engine
:
major componentsCrawlers Web search engines and some other sites use Web crawling or spidering software to update their web content or indexes of others sites' web content.Web Crawlers are one of the main components of web search engines systems that assemble a corpus of web pages, index them, and allow users to issue queries against the index and find the webpages that match the queries fired by the users. Each crawler has different policies. The pages indexed by various search engines are different. The Indexer Processes pages, decide which of them to index, build various data structures representing the pages (inverted index, web graph, etc.), different representation among search engines. Might also build additional structure
(
LSI )
The Query Processor Processes user queries and returns matching answers in an order determined by a ranking algorithm
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide7
Web Crawling
What is Web Crawling?
T
he process or program used by search engines to download pages from the web for later processing by a search engine that will index the downloaded pages to provide fast searches.
A Web crawler is an internet bot which systematically browses the World Wide Web, typically for the purpose of web indexing.
A program or automated script which browses the World Wide Web in a methodical, automated manner
Offline readerA Web crawler has many names like Spider, Robot, Web agent, Wanderer, worm etc. less used names ants, bots and worms.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide8
Web Crawling
Why is used web crawler
Internet has a wide expanse of Information
.
Finding relevant information requires an efficient mechanism.Web Crawlers provide that scope to the search engine.As the number of pages on the internet is extremely large, even the largest crawlers fall short of making a complete index.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide9
Web Crawling
How does web crawler work?
I
t starts with a list of
URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of visited URLs, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. New urls
can be specified here. This is
google’s
web Crawler.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide10
Web Crawling
Single Web Crawling
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide11
Web Crawling
Crawling Algorithm
Crawler basic algorithm
Remove a URL from the unvisited URL list
Determine the IP Address of its host nameDownload the corresponding documentExtract any links contained in it.If the URL is new, add it to the list of unvisited URLsProcess the downloaded documentBack to step 1
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide12
Web Crawling
Robot Exclusion
How
to control those robots!
If a webmaster wishes to restrict the information on their site available to a Googlebot, or another well-behaved spider they can do so with the appropriate directives, Two components:Robots Exclusion Protocol (robots.txt): Site wide specification of excluded directories.Robots META Tag: Individual document tag to exclude indexing or following links.<meta name="Googlebot" content="nofollow" /> t
o the web page
Spider: downloads robots.txt file and other pages that are requested by the crawler manager and
permitted by web authors. The robots.txt files are sent to crawler manager for processing andextracting the URLs. The other downloaded files are sent to a central indexer.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide13
Structure of Web Crawler
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
Web CrawlingSlide14
Web Crawling
A Web Crawler Design
Any crawler must fulfill following two issues
.
1) It must have a good crawling strategy.2) It has to have a highly optimized system architecture that can download a large number of pages per seconds.Most of search engines use more than one crawler and manage them in a distributed method. This has following benefits:Increased resource utilization,Effective distribution of crawling tasks with no bottle necks,Configurability of the crawling tasks.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide15
Search Engine
What is WebCrawler?
Search engine
WebCrawler
is a registered trademark of InfoSpace, Inc. It went live on April 20, 1994 and was created by Brian Pinkerton at the University of WashingtonWebCrawler is a metasearch engine that blends the top search results from Google Search and Yahoo!.WebCrawler was the first Web search engine to provide
full
text
search.WebCrawler also provides users the option to search for images, audio, video, news, yellow pagesand white pages.WebCrawler formerly fetched results from Bing Search (formerly MSN Search and Live Search
), Ask.com, About.com, MIVA, and LookSmart.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide16
Search Engine
Meta Search Engine
WebCrawler, Dogpile, MetaCrawler
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide17
Web Crawling
Crawling Policy
The behavior of a Web crawler is the outcome of a combination of
policies
.a selection policy which states the pages to download,a re-visit policy which states when to check for changes to the pages,a politeness policy that states how to avoid overloading Web sites, anda parallelization policy that states how to coordinate distributed web crawlers.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide18
Web Crawling
History
Web crawling is the process by which system gather pages from the Web resources, in
order to index them and support a search engine that serves the user queries. The primary objective of crawling is to quickly, effectively and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them and provide the search results to the user requesting it.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide19
Web Crawling
The Universal Crawlers
The Universal crawlers support universal search engines. It browses the World Wide Web in a methodical,
and
creates the index of the documents that it accesses. The Universal crawler first downloads the first website. It then goes through the HTML and finds the link tag and retrieves the outside link. When it finds a link tag, it adds the link to the list of links it plans to crawl. Thus, as the universal crawler crawls all the pages found, huge cost of crawl is incurred over many queries from the users. The universal crawler is comparatively expensive.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide20
The Preferential Crawlers
The
preferential crawlers are the topic based crawlers. They are selective in case of web pages. They are built to retrieve pages within a certain topic. Focused web crawlers and topical crawlers are types of preferential crawlers that search related to a specific topic and only download a predefined subset of web pages from the entire web.
W
e can see a variety of algorithms. Following are some algorithms that can be implemented over the focused web crawler to improve the efficiency and the performance of the crawler. breadth-first search The depth-first searchDe Bra’s fish search Shark search The link structure analysisthe page rank algorithm The hits algorithmSeveral machine learning algorithms
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
Web CrawlingSlide21
Focused
Web
Crawlers
The
adaptive topical crawlers and the static crawlers are type of topical crawlers. If a focused web crawler includes learning methods in order to adapt its behavior during the crawl to the particular environment and its relationships with the given input parameters, e.g. the set of retrieved pages and the user-defined topic, the crawler is named adaptive. Static crawlers are simple crawlers not adapting to the environment they are provided.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
Web CrawlingSlide22
Challenges in Web Crawli
ng
The
major issues in web crawling faced globally are as mentioned below:
• Collaborative Web Crawling • Crawling the large repository on web • Crawling Multimedia Data • Execution time • Scale: Millions of pages on web • No central control of web pages • Large number of web pages emerging daily Various mechanisms and alternatives have been designed to address the above mentioned challenges. The rate at which the data on web is booming makes web crawling a more challenging and uphill task. The web crawling algorithms need to constantly match up with the demanding data from users and growing data on web.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
Web
CrawlingSlide23
Overview of Focused Web Crawler
T
he
working of a crawler called the focused web crawler that can solve the problem of information retrieval from the huge content of the web more effectively and efficiently.
Focused web crawler is a hypertext system that seeks, acquires, indexes and maintains pages on a specific set of topics that represent a relatively narrow segment of the web. It entails a very small investment in the hardware and network resources and yet achieves respectable coverage at a rapid rate, simply because there is relatively little to do. Thus, web content can be managed by a distributed team of focused web crawlers, each specializing in one or a few topics..by effectively prioritizing the crawler frontier and managing the exploration process for hyperlink., and helps keep the crawl more up-to-date.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
Focused Web CrawlingSlide24
Structure of Focused Web Crawler
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
Focused Web CrawlingSlide25
The Architecture of Focused Web Crawler
This
type of web crawler makes indexing more effectives ultimately helping us in achieving the basic requirement of faster and more relevant retrieval of information from the large repository of World Wide Web. Many search engines have started using this technique to give users a more rich experience while accessing web content ultimately increasing their hit counts. The various components of the system, their input and output, their functionality and the innovative aspects are shown below.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
Focused Web CrawlingSlide26
Focused Web Crawling
The Architecture of Focused Web Crawl
er
Seed detector
: - The functionality of the Seed detector is to determine the seed URLs for the specific keyword by retrieving the first n URLs. The seed pages are detected and assigned a priority based on the PageRank algorithm or the hits algorithm or algorithm similar to that. Crawler Manager: - The component downloads the documents from the global network. The URLs in the URL repository are fetched and assigned to the buffer in the Crawler Manager. The URL buffer is a priority queue. Based on the size of the URL buffer, the Crawler Manager dynamically generates instance for the crawlers, which will download the document. For more efficiency, crawler manager can create a crawler pool. The manager is also responsible for limiting the speed of the crawlers and balancing the load among them. This is achieved by monitoring the crawlers.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide27
The Architecture of Focused Web Cra
wler
Crawler
: - The crawler is a multi-thread java program, which is capable of downloading the web pages from the web and storing the documents in the document repository. Each crawler has its own queue, which holds the list of URLs to be crawled. The crawler fetches the URL from the queue
. The same or the other crawlers would have sent a request to the same server. If so, sending the request to the same server will result in overloading the server. The server is busy in completing the request that have come from the crawlers that have sent request and awaiting for the response. The server is made synchronized. If the request for the URL has not been sent previously, the request is forwarded to the HTTP module. This ensures that the crawler doesn’t overload any server. HTTP Protocol Module: - HTTP Protocol Module, sends the request for the document whose URL has been received from the queue. Upon receiving the document, the URL of the document downloaded is stored in the URL fetched along with the time stamp and the document is stored in the document repository. Storing the downloaded URL avoids redundant crawling, hence making the crawler more efficient.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
Focused Web CrawlingSlide28
The Architecture of Focused Web Crawler
Link Extractor
: - The link extractor extracts the link from the documents present in the document repository. The component checks for the URL being in the URL fetched. If not found, the surrounding text preceding and succeeding the hyperlink, the heading or sub-heading under which the link is present, are extracted.
Hypertext
Analyzer: - The Hypertext Analyzer receives the keywords from the Link Extractor and finds the relevancy of the terms with the search keyword referring the Taxonomy Hierarchy.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
Focused Web CrawlingSlide29
Web Crawling
Breadth First Search
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide30
Web Crawling
Depth First Search
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide31
Web Crawling
Depth First & Breadth First
depth-first
goes off into one branch until it reaches a leaf nodenot good if the goal node is on another branchneither complete nor optimaluses much less space than breadth-firstmuch fewer visited nodes to keep track ofsmaller fringe breadth-first is more careful by checking all alternatives
complete and
optimal
very memory-intensive
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide32
Best-First Search
The Best-First Search algorithm can be implemented to search for the best URL which explores the URL queue to find the most promising URL. The crawling order can be determined by the BFS algorithm. It is based on a heuristic function
𝑓(𝑛
). The value of the function is evaluated for each URL present in the queue and the URL with the most optimal value of the function
𝑓(𝑛) is given the highest priority and priority list can be formed and the URL with the highest priority can be fetched next. The optimal URL can be fetched using the Best First Search algorithm but the memory and the time consumed are very high and so more enhanced algorithms have been proposed.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
Focused Web CrawlingSlide33
Fish Search
The
Fish
Search algorithm can be implemented for dynamic search. When the relevant information is found, the search agents continue to look for more information and in the absence of information, they stop. The key principle of the algorithm is as follows: It takes as input a seed URL and a search query. It then dynamically builds the priority list of the next URLs. At each step, the first URL is popped from the list and is processed. As each document's text becomes available, it is analyzed by a scoring component evaluating whether it is relevant or irrelevant to the search query (1-0 value) and, based on that score, a heuristic decides whether to pursue the exploration in that direction or not: Whenever a document source is fetched, it is scanned for links. The URLs pointed to by these links (denoted as "children") are each assigned a depth value. If the parent is relevant, the depth of the children is set to some predefined value. Otherwise, the depth of the children is set to be one less than the depth of the parent. When the depth reaches zero, the direction is dropped and none of its children is inserted into the list. The algorithm is helpful in forming the priority table but the limitation is that there is very low differentiation among the priority of the URLs. Many documents have the same priority. And also the scoring capability is problematic as it is difficult to assign a more precise potential score to documents which have not yet been fetched.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036
Focused Web CrawlingSlide34
Focused Web Crawling
Shark Search
Shark-Search
: - The Shark-Search
is the improved version of the Fish-Search algorithm. The Fish-Search did a binary evaluation of the URL to be analyzed and so the actual relevance cannot be obtained. In Shark-Search, a similarity engine is called which evaluates the relevance of the documents to a given query. Such an engine analyzes two documents dynamically and returns a "fuzzy" score, i.e., a score between 0 and 1 (0 for no similarity whatsoever, 1 for perfect "conceptual" match) rather than a binary value. The similarity algorithm can be seen as orthogonal to the fish-search algorithm. The potential score of the URL that is extracted can be refined by the meta-information contained in the links to the documents.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide35
Web Crawling
Abstract to Regional Crawler
Today web crawlers are unable to update their huge search engine indexes concurrent to the growth in the information available on the web. Most of times they download some unimportant pages and ignore the pages that their probability of being searched is noticeable. This sometimes leads to incapability of search engines for giving up to date information to the users. Regional Crawler improves the problem of updating and finding new pages to some extent by gathering users’ common needs and interests in a certain domain, which can be as small as a LAN in a department of a university or as huge a country. It crawls the pages containing information about interests at first, instead of crawling the web without any predefined order.
In this method, the crawling strategy is based on users' interests and needs in certain domains.
These needs and interests are determined according to common characteristics of the users likegeographical location, age, membership and job.
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide36
Web Crawling
Example
s
Başlıca Bilinen Arama Motoru Robotları
Googlebot, Googlebot-Image, Googlebot-Mobile, Googlebot-Video, Adsbot-Google,Mediapartners-Google, Könguló (Google)BingBot/MSNBot, MSRBot (Bing)YandexBot (Yandex)Baiduspider, Baiduspider-image, Baiduspider-ads (Baidu)FAST Crawler (Fast Search & Transfer – Alltheweb)Scooter, Mercator (Altavista)Slurp, Yahoo-Blogs (Yahoo)Gigabot (Gigablast)Scrubby (Scrub The Web)Robozilla (DMOZ) WebCrawler : Used to build the first publicly-available full-text index of a subset of the Web.Web Fountain: Distributed, modular crawler written in C++. Slug: Semantic web crawler
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide37
Web Crawling
References
http://www.
icasa
.nmt.eduhttp://www.krishisanskriti.org/acsit.htmlhttps://www.promptcloud.comhttps://en.wikipedia.org/wiki/Web_crawlerhttps://forum.thgtr.com
Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide38
Web Crawling
Thank
you
!
Electrical and Electronical Engineering / Taner Kapucu / 2011514036