/
A  web crawler  (also known as a  A  web crawler  (also known as a 

A  web crawler  (also known as a  - PowerPoint Presentation

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
347 views
Uploaded On 2018-12-24

A  web crawler  (also known as a  - PPT Presentation

web  spider or  web  robot is a program or automated script which browses the World Wide Web in a methodical automated manner Search engines such as Google Bing etc uses web crawlers ID: 745463

web news crawler url news web url crawler twitter article import java jsoup content graph data extract meta doc

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A  web crawler  (also known as a " is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. Search engines such as Google, Bing etc. uses web crawlers to index the newly created data on Internet.

Web Crawler

Web CrawlerSlide2

News Crawlers are focused on retrieving newly published News Data.News Crawlers monitors a set of defined News sources and captures the news as soon as it publishes.

News Crawler

News URL Downloader

Predefined Set

of

News Sources

News Database

New URLs

Crawl every 30 Min

News Article Downloader

News Articles

New URLs

Architecture of News Crawler at IITR

Web CrawlerSlide3

A Simple Java Program for Downloading a Web PageWeb Crawler

Web CrawlerSlide4

Given a Web Page, we can retrieve different components by Parsing it. Many HTML Parsers are available such as Jsoup, Xerces

,

NekoHTML

Following Java program uses

Jsoup parser to extract

Hyperlinks

from a web page.

Parsing a Web Page

import

org.jsoup.

Jsoup

;

import org.jsoup.nodes.

Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements

;import java.io.IOException

;import java.io.File;

public class ExtractLinks {

public static void main(String[] args)

throws IOException { File input = new File("data.html"); Document doc =

Jsoup.parse(input, "UTF-8", “ "); Elements links = doc.select("a[

href]"); System.out.println("Total Number of Links:"+links.size()); for (

Element link : links) { System.out.println

(link.attr("abs:href")); } }}

Web CrawlerSlide5

There are many API available for extracting the main content from web pages, such as Boilerplate APIFollowing Java program demonstrates the use of Boilerplate API to extract the article text from a news articleRetrieving Article Text

import

java.io.

PrintWriter;import java.net.URL;

import de.l3s.boilerpipe.BoilerpipeExtractor;

import

de.l3s.boilerpipe.extractors.

CommonExtractors

;

import

de.l3s.boilerpipe.sax.

HTMLHighlighter

;

public class BoilerplateDemo {public static void main(String[]

args) throws Exception {URL url = new URL("http://www.thehindu.com/news/national/land-acquisition-ordinance-bill-gets-a-burial/article7597517.ece"); final BoilerpipeExtractor

extractor = CommonExtractors.ARTICLE_EXTRACTOR; // choose the operation mode (i.e., highlighting or extraction) //final HTMLHighlighter hh

= HTMLHighlighter.newHighlightingInstance();

final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();

PrintWriter out = new PrintWriter("

highlighted.html", "UTF-8"); out.println

(hh.process(url, extractor)); out.close(); System.out.println

("Now open file highlighted.html in your web browser"); }}Web CrawlerSlide6

Objective: To extract Article Content from Given News URLNews URL: http://www.hindustantimes.com/world-t20/amitabh-bachchan-to-sing-national-anthem-before-india-pakistan-match/story-QXxnQAvmJsisvIYtSFv33L.htmlArticle Extraction

Article Extractor

Bollywood superstar Amitabh

Bachchan

will sing the National Anthem before the start of the marquee India-Pakistan World Twenty20 cricket match at the Eden Gardens on March 19.

Bachchan

has confirmed the development by retweeting a post in his official Twitter handle while sources in the Cricket Association of Bengal today said this was an effort by its president

Sourav

Ganguly

.

“The president was involved and the plan was on for a long time,” CAB sources said.

While the ‘Big B’ will sing the National Anthem in his signature baritone, Pakistan will also make their presence felt with classical singer

Shafaqat

Amanat Ali who is slated to sing the Pakistani National Anthem.Slide7

Add-ons: NoiseArticle Extractorhttp://timesofindia.indiatimes.com/india/India-became-3rd-largest-economy-in-2011-from-10th-in-2005/articleshow/34416429.cmsSlide8

Article ExtractionArticle ExtractorSlide9

Article ExtractionArticle ExtractorString url = “input_url.html”;String name = “CLASS or ID name”;

Document

doc = Jsoup.connect

(url).timeout(100*1000).userAgent("

Mozilla").get();

article

=

doc

.getElementsByClass

(name

).text();

Or article

= doc.getElementById(name).text(); String url = “

http://www.dnaindia.com/world/report-pakistan-blast-in-peshawar-bus-kills-at-least-15-govt-employees-over-25-injured-2189902”;String name = “body-text”;

Document doc

= Jsoup.connect(url).timeout(100*1000

).userAgent("Mozilla").get();

article = doc.getElementsByClass

(name).text();

ExampleSlide10

Metadata refers to data about data. It is always in the form of key-value pairs.

Key :

name = “author”

Value :

content = “TCA

Sharad

Raghavan

Metadata of News Webpages

Extract Meta-Key PhraseSlide11

Metadata content of a typical news webpage: Title, Description, News keywords, Author name, Last modified date, Publishing date

,

etc.

News websites use various types of protocols to insert metadata.

OGP (Open Graph Protocol) is one of them.

Some of the well know OGP tags are :

og:title

 - The title of your object as it should appear within the graph, e.g., "The Rock".

og:type

 - The 

type

 of your object, e.g., "video.movie". Depending on the type you specify, other properties may also be required.

og:image

 - An image URL which should represent your object within the graph.

og:url

 - The canonical URL of your object that will be used as its permanent ID in the graph, e.g., "http://www.imdb.com/title/tt0117500/".

Metadata of News Webpages

Extract Meta-Key PhraseSlide12

Open Graph Protocol (OGP) provided by Facebook, allows the embedding of web content as Facebook social graph objects. It defines tags which can be used by web content generators for converting web objects into corresponding graph object. Open Graph Protocol

Facebook Graph Object

<meta

property

="

og:description

"

content

="The government seems to have given up on the Give It

Up….">

<meta

property

="og:title"

content="Centre changes tack on LPG subsidy campaign"><meta property

="og:image" content="http://www.thehindu.com/multimedia/dynamic/02459/LPG_2459323c.jpg">Extract Meta-Key PhraseSlide13

Keywords which are most relevant to the article.“news_keyword” Tag<meta name="news_keywords"

content="LPG subsidy, LPG subsidy campaign, Give It Up Campaign ,economy, business and finance, energy and resource">

Extract Meta-Key PhraseSlide14

Online social networking and microblogging service. Enables its registered user to read and send messages of 140 characters known as tweets.Twitter contains data in following forms:Tweet: Message to send with 140 characters or less.

Follower: A person who has chosen to read your tweets on an ongoing basis.

Reply or @ : The @ symbol means you are talking to or about the person.

Retweet or RT: The act of repeating what some one else has tweeted so that your followers can see it.

HashTag or #

:

HashTag

provide a theme for the tweet that allow all similar tweets to be searched

.

Twitter

Twitter CrawlerSlide15

TwitterTwitter CrawlerTo FollowTweet

Reply

Persons Retweeted

Retweets

HashTagSlide16

Data from twitter can be extracted using either Twitter APIs or R packages.Twitter APIs:REST API Streaming APIR packages:twitteRRTwitterAPI

Data Extraction from Twitter

Twitter CrawlerSlide17

Login Twitter account.Open link https://apps.twitter.com/app/new and create an application.Generate Access token.Create a New Java Project and include the Twitter4j Library from https://dl.dropboxusercontent.com/u/1737239/twitter4j-core-2.2.5.jar

Data Extraction from Twitter using a

REST API: Twitter4J

Twitter CrawlerSlide18

Java Code to Extract Tweets related to Query “World Cup”Twitter CrawlerSlide19

Java Code to Extract Trends from TwitterTwitter Crawler