Crawler Techniques and Algorithms to Harvest Modern Weblogs Olivier Blanvillain 1 Nikos Kasioumis 2 Vangelis Banos 3 1 Ecole Polytechnique Federale de Lausanne EPFL 1015 Lausanne Switzerland ID: 251597
Download Presentation The PPT/PDF document "BlogForever" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
BlogForever Crawler:Techniques and Algorithmsto Harvest Modern Weblogs
Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3
1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland,2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland,3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece
1Slide2
ContentsIntroduction:
The disappearing web,Blog archiving.Our ContributionsAlgorithmsMotivation,Blog content extraction,Extraction rules,Variations for authors, dates and comments.System ArchitectureEvaluationComparison with 3 web article extraction systems.Issues and Future Work2Slide3
The disappearing web
Source: http://gigaom.com/2012/09/19/the-disappearing-web-information-decay-is-eating-away-our-history/3Slide4
Blog archiving
Why archive the web?Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public.Blog archiving is a special case of web archiving.The blogosphere is a live record of contemporary Society, Culture, Science and Economy.Some blogs contain unique data and valuable information.Users take action and make important decisions based on this information.
We have a Responsibility to preserve the web.4Slide5
5
Blog
crawlers
Real-time monitoring
Html data extraction engineSpam filteringUnstructured information
Original data andXML metadata
Blog digital repository
Digital preservation
Quality assurance
Collections
curation
Public access APIs
Personalised
services
Information
retrieval
Public web interface /
Browse, search, export
Harvesting
Preserving
Managing and reusing
Web services
Web interface
FP7 EC Funded
Project
http://blogforever.eu/Slide6
Our ContributionsA
web crawler capable of extracting blog articles, authors, publication dates and comments.A new algorithm to build extraction rules from blog web feeds with linear time complexity,Applications of the algorithm to extract authors, publication dates and comments,A new web crawler architecture, including how we use a complete web browser to render JavaScript web pages before processing them.Extensive evaluation of the content extraction and execution time of our algorithm against three state-of-the-art web article extraction algorithms.6Slide7
MotivationExtracting metadata and content from HTML documents is a challenging task.
Web standards usage is low (<0.5% of websites).More than 95% of websites do not pass HTML validation. Having blogs as our target websites, we made the following observations which play a central role in the extraction process:Blogs provide web feeds: structured and standardized XML views of the latest posts of a blog,Posts of the same blog share a similar HTML structure.Web feeds usually have 10-20 posts whereas blogs contain a lot more. We have to access more posts than the ones referenced in web feeds.
7Slide8
Content Extraction OverviewUse
blog web feeds and referenced HTML pages as training data to build extraction rules.Extraction rules capable of locating in HTML page all RSS referenced elements such as:Title,Author,Description,Publication date,Use the defined extraction rules to process all blog pages.
8Slide9
Locate in HTML page all RSS referenced elements
9Slide10
Generic procedure to build extraction rules
10Slide11
Extraction rules and string similarity
Rules are XPath queries.For each rule, we compute the score based on string similarity.The choice of ScoreFunction greatly influences the running time and precision of the extraction process.Why we chose Sorensen–Dice coefficient similarity:Low sensitivity to word ordering and length variationsRuns in linear time
11Slide12
Example: blog post title best extraction rule
RSS feed: http://vbanos.gr/en/feed/Find RSS blog post title: “volumelaser.eim.gr” in html page: http://vbanos.gr/blog/2014/03/09/volumelaser-eim-gr-2/12XPath
HTML Element ValueSimilarity Score/body/div[@id=“page”]/header/h1volumelaser.eim.gr100%
/body/div[@id=“page”]/div[@class=“entry-code”]/p/a
http://volumelaser.eim.gr/80%/head/titlevolumelaser.eim.gr | Βαγγέλης Μπάνος66%.........The Best Extraction Rule for the blog post title is: /body/div[@id=“page”]/header/h1Slide13
Time complexity and linear reformulation
13Post-order traversal ofthe HTML tree.
Compute node bigramsfrom their children bigrams.Slide14
Variations for authors, dates, commentsAuthors, dates and comments are special cases as they appear many times throughout a post.
To resolve this issue, we implement an extra component in the Score function:For authors: an HTML tree distance between the evaluated node and the post content node.For dates: we check the alternative formats of each date in addition to the HTML tree distance between the evaluated node and the post content node.Example: “1970-01-01” == “January 1 1970”For comments: we use the special comment RSS feed.14Slide15
System ArchitectureOur crawler is built on top of
Scrapy (http://www.scrapy.org/)15Slide16
System ArchitecturePipeline of operations:
Render HTML and JavaScript,Extract content,Extract comments,Download multimedia files,Propagate resulting records to the back-end.Interesting areas:Blog post page identification,Handle blogs with a large number of pages,JavaScript rendering,
Scalability.16Slide17
Blog post identificationThe crawler visits all blog pages.For each URL, it needs to identify whether it is a blog post or not.
We construct a regular expression based on blog post RSS to identify blog posts.We assume that all posts from the same blog use the same URL pattern.This assumption is valid for all blog platforms we have encountered.17Slide18
Handle blogs with a large number of pages
Avoid random walk of pages, depth first search or breadth first search.Use a priority queue with machine learning defined priorities.Pages with a lot of blog post URLs have a higher priority.Use Distance-Weighted kNN classifier to predict.Whenever a new page is downloaded, it is given to the machine learning system as training data.When the crawler encounters a new URL, it will ask the machine learning system for the potential number of blog posts and use the value as the download priority of the URL.18Slide19
JavaScript renderingJavaScript is widely used client-side language.Traditional HTML based crawlers do not see web pages using JavaScript.
We embed PhantomJS, a headless web browser with great performance and scripting capabilities.We instruct the PhantomJS browser to click dynamic JavaScript pagination buttons on pages to retrieve more content (e.g. Disqus Show More button to show comments).This crawler functionality is non-generic and requires human intervention to maintain and extend to other cases.19Slide20
ScalabilityWhen aiming to work with a large amount of input, it is crucial to build every system layer with scalability in mind.The two core crawler procedures
NewCrawl and UpdateCrawl are Stateless and Purely Functional.All shared mutable state is delegated to the back-end.20Slide21
EvaluationTask:
Extract articles and titles from web pagesComparison against three open-source projects:Readability (Javascript), Boilerpipe (Java),Goose (Scala).Criteria:Extraction success rate,Running time.Dataset:2300 blog posts from 230 blogs obtained by the Spinn3r dataset.System:Debian linux 7.2, Intel Core i7-3770 3.4 GHz.
All data, scripts and instructions to reproduce available at:https://github.com/OlivierBlanvillain/blogforever-crawler-publication21Slide22
Evaluation: Extraction success ratesBlogForever Crawler competitors are generic:
They do not use RSS feeds.They do not use structural similarities between web pages.They can be used with any HTML page.22Slide23
Evaluation: Running time
our approach spends the majority of its total running time between the initialisation and the processing of the first blog post.23Slide24
Issues & Future WorkOur main causes of failure was:The insufficient quality of web feeds,
The high structural variation of blog pages in the same blog.Future WorkInvestigate hybrid extraction algorithms. Combine with other techniques such as word density or spacial reasoning.Large scale deployment of the software in a distributed architecture.24Slide25
Thank you!BlogForever Crawler: Techniques and Algorithms
to Harvest Modern WeblogsOlivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos31Ecole Polytechnique
Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland,2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland,3Department of Informatics, Aristotle University, Thessaloniki 54124, GreeceContact email: vbanos@gmail.comProject code available at:https://github.com/BlogForever/crawler
25