ÇUKUROVA UNIVERSITY COMPUTER NETWORKS - PowerPoint Presentation

348 views
Uploaded On 2019-02-23

ÇUKUROVA UNIVERSITY COMPUTER NETWORKS - PPT Presentation

Web Crawling Taner Kapucu Electrical and Electronical Engineering Taner Kapucu 2011514036 1 Contents Search Engine Web Crawling Crawling Policy Focused Web Crawling Algoritms ID: 753452

search web crawler crawling web search crawling crawler kapucu engineering 2011514036 taner electronical electrical pages url crawlers focused engine

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/753452" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "ÇUKUROVA UNIVERSITY COMPUTER NETWORKS" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

ÇUKUROVA UNIVERSITY

COMPUTER NETWORKS

Web CrawlingTaner Kapucu

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

1Slide2

Contents

Engine

Web

CrawlingCrawling PolicyFocused Web CrawlingAlgoritmsExamples of Web CrawlersReferences

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Web CrawlingSlide3

Search Engine

Abstract

With

the rapid development and increase in global data on World Wide Web and with increased and rapid growth in web users across the globe, an acute need has arisen to improve and modify or design search algorithms that helps in effectively and efficiently searching the specific required data from the huge repository available.

2015

863,105,652

3,185,996,155

Year

WebsitesInternet Users

19982,407,067188,023,930

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide4

Search Engine

The significance of search engines

The significance and popularity of search engines are increasing day by day.

Surveys show that the use of internet is growing world wide both in servers and in clients. The necessity of an efficient tool for finding what users are looking for makes the significance of search engines more obvious. Search engines take users’ queries as input and produce a sequence of URLs that match the query according to a rank they calculate for each document on the web.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide5

Search Engine

Engine Architecture

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide6

Search Engine

Search Engine Architecture

Search Engine

major componentsCrawlers Web search engines and some other sites use Web crawling or spidering software to update their web content or indexes of others sites' web content.Web Crawlers are one of the main components of web search engines systems that assemble a corpus of web pages, index them, and allow users to issue queries against the index and find the webpages that match the queries fired by the users. Each crawler has different policies. The pages indexed by various search engines are different. The Indexer Processes pages, decide which of them to index, build various data structures representing the pages (inverted index, web graph, etc.), different representation among search engines. Might also build additional structure

(

LSI )

The Query Processor Processes user queries and returns matching answers in an order determined by a ranking algorithm

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide7

Web Crawling

What is Web Crawling?

he process or program used by search engines to download pages from the web for later processing by a search engine that will index the downloaded pages to provide fast searches.

A Web crawler is an internet bot which systematically browses the World Wide Web, typically for the purpose of web indexing.

A program or automated script which browses the World Wide Web in a methodical, automated manner

Offline readerA Web crawler has many names like Spider, Robot, Web agent, Wanderer, worm etc. less used names ants, bots and worms.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide8

Web Crawling

Why is used web crawler

Internet has a wide expanse of Information

Finding relevant information requires an efficient mechanism.Web Crawlers provide that scope to the search engine.As the number of pages on the internet is extremely large, even the largest crawlers fall short of making a complete index.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide9

Web Crawling

How does web crawler work?

t starts with a list of

URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of visited URLs, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. New urls

can be specified here. This is

google’s

web Crawler.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide10

Web Crawling

Single Web Crawling

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide11

Web Crawling

Crawling Algorithm

Crawler basic algorithm

Remove a URL from the unvisited URL list

Determine the IP Address of its host nameDownload the corresponding documentExtract any links contained in it.If the URL is new, add it to the list of unvisited URLsProcess the downloaded documentBack to step 1

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide12

Web Crawling

Robot Exclusion

How

to control those robots!

If a webmaster wishes to restrict the information on their site available to a Googlebot, or another well-behaved spider they can do so with the appropriate directives, Two components:Robots Exclusion Protocol (robots.txt): Site wide specification of excluded directories.Robots META Tag: Individual document tag to exclude indexing or following links.<meta name="Googlebot" content="nofollow" /> t

o the web page

Spider: downloads robots.txt file and other pages that are requested by the crawler manager and

permitted by web authors. The robots.txt files are sent to crawler manager for processing andextracting the URLs. The other downloaded files are sent to a central indexer.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide13

Structure of Web Crawler

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Web CrawlingSlide14

Web Crawling

A Web Crawler Design

Any crawler must fulfill following two issues

1) It must have a good crawling strategy.2) It has to have a highly optimized system architecture that can download a large number of pages per seconds.Most of search engines use more than one crawler and manage them in a distributed method. This has following benefits:Increased resource utilization,Effective distribution of crawling tasks with no bottle necks,Configurability of the crawling tasks.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide15

Search Engine

What is WebCrawler?

Search engine

WebCrawler

is a registered trademark of InfoSpace, Inc. It went live on April 20, 1994 and was created by Brian Pinkerton at the University of WashingtonWebCrawler is a metasearch engine that blends the top search results from Google Search and Yahoo!.WebCrawler was the first Web search engine to provide

full

text

search.WebCrawler also provides users the option to search for images, audio, video, news, yellow pagesand white pages.WebCrawler formerly fetched results from Bing Search (formerly MSN Search and Live Search

), Ask.com, About.com, MIVA, and LookSmart.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide16

Search Engine

Meta Search Engine

WebCrawler, Dogpile, MetaCrawler

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide17

Web Crawling

Crawling Policy

The behavior of a Web crawler is the outcome of a combination of

policies

.a selection policy which states the pages to download,a re-visit policy which states when to check for changes to the pages,a politeness policy that states how to avoid overloading Web sites, anda parallelization policy that states how to coordinate distributed web crawlers.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide18

Web Crawling

History

Web crawling is the process by which system gather pages from the Web resources, in

order to index them and support a search engine that serves the user queries. The primary objective of crawling is to quickly, effectively and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them and provide the search results to the user requesting it.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide19

Web Crawling

The Universal Crawlers

The Universal crawlers support universal search engines. It browses the World Wide Web in a methodical,

and

creates the index of the documents that it accesses. The Universal crawler first downloads the first website. It then goes through the HTML and finds the link tag and retrieves the outside link. When it finds a link tag, it adds the link to the list of links it plans to crawl. Thus, as the universal crawler crawls all the pages found, huge cost of crawl is incurred over many queries from the users. The universal crawler is comparatively expensive.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide20

The Preferential Crawlers

The

preferential crawlers are the topic based crawlers. They are selective in case of web pages. They are built to retrieve pages within a certain topic. Focused web crawlers and topical crawlers are types of preferential crawlers that search related to a specific topic and only download a predefined subset of web pages from the entire web.

e can see a variety of algorithms. Following are some algorithms that can be implemented over the focused web crawler to improve the efficiency and the performance of the crawler. breadth-first search The depth-first searchDe Bra’s fish search Shark search The link structure analysisthe page rank algorithm The hits algorithmSeveral machine learning algorithms

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Web CrawlingSlide21

Focused

Web

Crawlers

The

adaptive topical crawlers and the static crawlers are type of topical crawlers. If a focused web crawler includes learning methods in order to adapt its behavior during the crawl to the particular environment and its relationships with the given input parameters, e.g. the set of retrieved pages and the user-defined topic, the crawler is named adaptive. Static crawlers are simple crawlers not adapting to the environment they are provided.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Web CrawlingSlide22

Challenges in Web Crawli

The

major issues in web crawling faced globally are as mentioned below:

• Collaborative Web Crawling • Crawling the large repository on web • Crawling Multimedia Data • Execution time • Scale: Millions of pages on web • No central control of web pages • Large number of web pages emerging daily Various mechanisms and alternatives have been designed to address the above mentioned challenges. The rate at which the data on web is booming makes web crawling a more challenging and uphill task. The web crawling algorithms need to constantly match up with the demanding data from users and growing data on web.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Web

CrawlingSlide23

Overview of Focused Web Crawler

working of a crawler called the focused web crawler that can solve the problem of information retrieval from the huge content of the web more effectively and efficiently.

Focused web crawler is a hypertext system that seeks, acquires, indexes and maintains pages on a specific set of topics that represent a relatively narrow segment of the web. It entails a very small investment in the hardware and network resources and yet achieves respectable coverage at a rapid rate, simply because there is relatively little to do. Thus, web content can be managed by a distributed team of focused web crawlers, each specializing in one or a few topics..by effectively prioritizing the crawler frontier and managing the exploration process for hyperlink., and helps keep the crawl more up-to-date.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Focused Web CrawlingSlide24

Structure of Focused Web Crawler

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Focused Web CrawlingSlide25

The Architecture of Focused Web Crawler

This

type of web crawler makes indexing more effectives ultimately helping us in achieving the basic requirement of faster and more relevant retrieval of information from the large repository of World Wide Web. Many search engines have started using this technique to give users a more rich experience while accessing web content ultimately increasing their hit counts. The various components of the system, their input and output, their functionality and the innovative aspects are shown below.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Focused Web CrawlingSlide26

Focused Web Crawling

The Architecture of Focused Web Crawl

Seed detector

: - The functionality of the Seed detector is to determine the seed URLs for the specific keyword by retrieving the first n URLs. The seed pages are detected and assigned a priority based on the PageRank algorithm or the hits algorithm or algorithm similar to that. Crawler Manager: - The component downloads the documents from the global network. The URLs in the URL repository are fetched and assigned to the buffer in the Crawler Manager. The URL buffer is a priority queue. Based on the size of the URL buffer, the Crawler Manager dynamically generates instance for the crawlers, which will download the document. For more efficiency, crawler manager can create a crawler pool. The manager is also responsible for limiting the speed of the crawlers and balancing the load among them. This is achieved by monitoring the crawlers.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide27

The Architecture of Focused Web Cra

wler

Crawler

: - The crawler is a multi-thread java program, which is capable of downloading the web pages from the web and storing the documents in the document repository. Each crawler has its own queue, which holds the list of URLs to be crawled. The crawler fetches the URL from the queue

. The same or the other crawlers would have sent a request to the same server. If so, sending the request to the same server will result in overloading the server. The server is busy in completing the request that have come from the crawlers that have sent request and awaiting for the response. The server is made synchronized. If the request for the URL has not been sent previously, the request is forwarded to the HTTP module. This ensures that the crawler doesn’t overload any server. HTTP Protocol Module: - HTTP Protocol Module, sends the request for the document whose URL has been received from the queue. Upon receiving the document, the URL of the document downloaded is stored in the URL fetched along with the time stamp and the document is stored in the document repository. Storing the downloaded URL avoids redundant crawling, hence making the crawler more efficient.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Focused Web CrawlingSlide28

The Architecture of Focused Web Crawler

Link Extractor

: - The link extractor extracts the link from the documents present in the document repository. The component checks for the URL being in the URL fetched. If not found, the surrounding text preceding and succeeding the hyperlink, the heading or sub-heading under which the link is present, are extracted.

Hypertext

Analyzer: - The Hypertext Analyzer receives the keywords from the Link Extractor and finds the relevancy of the terms with the search keyword referring the Taxonomy Hierarchy.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Focused Web CrawlingSlide29

Web Crawling

Breadth First Search

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide30

Web Crawling

Depth First Search

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide31

Web Crawling

Depth First & Breadth First

depth-first

goes off into one branch until it reaches a leaf nodenot good if the goal node is on another branchneither complete nor optimaluses much less space than breadth-firstmuch fewer visited nodes to keep track ofsmaller fringe breadth-first is more careful by checking all alternatives

complete and

optimal

very memory-intensive

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide32

Best-First Search

The Best-First Search algorithm can be implemented to search for the best URL which explores the URL queue to find the most promising URL. The crawling order can be determined by the BFS algorithm. It is based on a heuristic function

𝑓(𝑛

). The value of the function is evaluated for each URL present in the queue and the URL with the most optimal value of the function

𝑓(𝑛) is given the highest priority and priority list can be formed and the URL with the highest priority can be fetched next. The optimal URL can be fetched using the Best First Search algorithm but the memory and the time consumed are very high and so more enhanced algorithms have been proposed.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Focused Web CrawlingSlide33

Fish Search

The

Fish

Search algorithm can be implemented for dynamic search. When the relevant information is found, the search agents continue to look for more information and in the absence of information, they stop. The key principle of the algorithm is as follows: It takes as input a seed URL and a search query. It then dynamically builds the priority list of the next URLs. At each step, the first URL is popped from the list and is processed. As each document's text becomes available, it is analyzed by a scoring component evaluating whether it is relevant or irrelevant to the search query (1-0 value) and, based on that score, a heuristic decides whether to pursue the exploration in that direction or not: Whenever a document source is fetched, it is scanned for links. The URLs pointed to by these links (denoted as "children") are each assigned a depth value. If the parent is relevant, the depth of the children is set to some predefined value. Otherwise, the depth of the children is set to be one less than the depth of the parent. When the depth reaches zero, the direction is dropped and none of its children is inserted into the list. The algorithm is helpful in forming the priority table but the limitation is that there is very low differentiation among the priority of the URLs. Many documents have the same priority. And also the scoring capability is problematic as it is difficult to assign a more precise potential score to documents which have not yet been fetched.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036

Focused Web CrawlingSlide34

Focused Web Crawling

Shark Search

Shark-Search

: - The Shark-Search

is the improved version of the Fish-Search algorithm. The Fish-Search did a binary evaluation of the URL to be analyzed and so the actual relevance cannot be obtained. In Shark-Search, a similarity engine is called which evaluates the relevance of the documents to a given query. Such an engine analyzes two documents dynamically and returns a "fuzzy" score, i.e., a score between 0 and 1 (0 for no similarity whatsoever, 1 for perfect "conceptual" match) rather than a binary value. The similarity algorithm can be seen as orthogonal to the fish-search algorithm. The potential score of the URL that is extracted can be refined by the meta-information contained in the links to the documents.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide35

Web Crawling

Abstract to Regional Crawler

Today web crawlers are unable to update their huge search engine indexes concurrent to the growth in the information available on the web. Most of times they download some unimportant pages and ignore the pages that their probability of being searched is noticeable. This sometimes leads to incapability of search engines for giving up to date information to the users. Regional Crawler improves the problem of updating and finding new pages to some extent by gathering users’ common needs and interests in a certain domain, which can be as small as a LAN in a department of a university or as huge a country. It crawls the pages containing information about interests at first, instead of crawling the web without any predefined order.

In this method, the crawling strategy is based on users' interests and needs in certain domains.

These needs and interests are determined according to common characteristics of the users likegeographical location, age, membership and job.

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide36

Web Crawling

Example

Başlıca Bilinen Arama Motoru Robotları

Googlebot, Googlebot-Image, Googlebot-Mobile, Googlebot-Video, Adsbot-Google,Mediapartners-Google, Könguló (Google)BingBot/MSNBot, MSRBot (Bing)YandexBot (Yandex)Baiduspider, Baiduspider-image, Baiduspider-ads (Baidu)FAST Crawler (Fast Search & Transfer – Alltheweb)Scooter, Mercator (Altavista)Slurp, Yahoo-Blogs (Yahoo)Gigabot (Gigablast)Scrubby (Scrub The Web)Robozilla (DMOZ) WebCrawler : Used to build the first publicly-available full-text index of a subset of the Web.Web Fountain: Distributed, modular crawler written in C++. Slug: Semantic web crawler

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide37

Web Crawling

References

http://www.

icasa

.nmt.eduhttp://www.krishisanskriti.org/acsit.htmlhttps://www.promptcloud.comhttps://en.wikipedia.org/wiki/Web_crawlerhttps://forum.thgtr.com

Electrical and Electronical Engineering / Taner Kapucu / 2011514036Slide38

Web Crawling

Thank

you

Electrical and Electronical Engineering / Taner Kapucu / 2011514036