/
77 Massachusetts Ave., Cambridge, MA 02139, USA 77 Massachusetts Ave., Cambridge, MA 02139, USA

77 Massachusetts Ave., Cambridge, MA 02139, USA - PDF document

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
420 views
Uploaded On 2016-07-09

77 Massachusetts Ave., Cambridge, MA 02139, USA - PPT Presentation

The Semantic Web Initiative envisions a Web wherein information is offered free of presentation allowing more effective exchange and mixing across web sites and across web pages But without substan ID: 397286

The Semantic Web Initiative

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "77 Massachusetts Ave., Cambridge, MA 021..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

77 Massachusetts Ave., Cambridge, MA 02139, USA The Semantic Web Initiative envisions a Web wherein information is offered free of presentation, allowing more effective exchange and mixing across web sites and across web pages. But without substantial Semantic Web content, tion access, we have created a web browser extension called Piggy Bank that lets users make use of Semantic Web content within Web content as users browse the Web. Wherever Semantic Web content is not available, Piggy Bank can invoke , a web server application that lets Piggy Bank users share the Semantic Web information they have collected, enabling collaborative efforts to build so phisticated Semantic Web information repositories through simple, everyday’s Introduction Figure 4. Saved information items reside in “My Piggy Bank.” The user can start browsing them in several ways, increasing the chances of re-finding information Eurzvhe\suhgh¿qhg Figure 3. Like del.icio.us, Piggy Bank allows each web page to be tagged with keywords. How ever, this same tagging mechanism also works for “pure” information items and is indiscriminate against levels of granularity of the information being tagged Wdjwklvzhe In fact, Piggy Bank has several databases: a permanent “My Piggy Bank” database for stor ing saved information and several temporary databases, each created to hold information col lected from a different source. The Save command causes data to be copied from a temporary database to the permanent database. Commands such as Save, Tag, and Publish are implemented as HTTP POSTs, sent from the generated DHTML-based user interface back to the embedded web server. Tag completion suggestions are supported in the same manner. Semantic Bank shares a very similar architecture to the Java part of Piggy Bank. They both make use of the same servlet that serves their DHTML-based faceted browsing user interface. They make use of several profiles for segregating data models. Semantic Bank gives each of its subscribed members a different profile for persisting data while it keeps another profile where “published” information from all members gets mixed together.Semantic Bank listens to HTTP POSTs sent by a user’s piggy bank to upload his/her data. All of the uploaded data goes into that user’s profile on the Semantic Bank, and those items marked as public are copied to the common profile. Each published item is also linked to one or more members of the semantic bank who have contributed that item.Related WorkWe will now take a trip back in history to the birth of the World Wide Web, and witness that even at that time, adhoc solutions were already suggested to combat the highly flexible but still constraining information model of the Web. When the number of web sites started to accumulate, directories of web sites (e.g., Yahoo! [32]) were compiled to give an overview of the Web. When the number of sites continued to grow, search engines were invented to offer a way to query over all sites simultaneously, substantially reducing concerns about the physical location of information, thereby letting users experience the Web as a whole rather than as loosely connected parts. Capable of liberating web pages from within web sites, search engines still cannot liberate individual information items (e.g., a single phone number) from within their containing pages. Furthermore, because these third-party search engines do not have direct access into databases embedded within web sites, they cannot support structured queries based on the schemas in these databases but must resolve to index the Another invention in the early days of the Web was web portals which provided personaliz able homepages (e.g., My Netscape [14]). A user of a portal would choose which kinds of infor mation to go on his/her portal homepage, and in doing so, aggregate information in his/her own taste. Such an aggregation is a one-time costly effort that generates only one dynamic view of information, while aggregation through Piggy Bank happens by lightweight interactions, gener ating many dynamic views of information through the act of browsing. During the evolution of the web portal, the need for keeping aggregated news articles up-to-date led to the invention of RSS (originally Rich Site Summary) [21] that could encode the content of a web site chronologi cally, facilitating the aggregation of parts of different sites by date. RSS was the first effort to fur ther reduce the granularity of the information consumption on the web that achieved widespread adoption. RSS feeds are now used by web sites to publish streams of chronologically ordered information for users do consume. RSS was also the first example of a pure-content format, firmly separating the concern of data production and data consumption and allowing innovative user interfaces to exist (e.g., [16]). Also early in the history of the World Wide Web came screenscrapers—client-side programs that extract information from within web pages (e.g., stockquotes, weather forecasts) in order to re-render them in some manners customized to the needs of individual users. The news aggrega tors (e.g., [8]) juxtaposed fragments ripped out from various news web sites together to make up a customized “front page” for each user according to his/her news taste. More recently, client-side tools such as Greasemonkey [9] and Chickenfoot [33] let advanced users themselves prescribe manipulations on elements within web pages, so to automate tedious tasks or to customize their Web experience. Additions to web browsers such as Hunter-Gatherer [40] and Net Snippets [15] let users bookmark fragments within web pages, and Annotea [36] supports annotation on such Piggy Bank adopts the scraping strategy but at a platform level and also introduces the use of RDF as a common data model wherein results from different scrapers can be mixed, thus allow ing for a unified experience over data scraped from different sources by different scrapers. Piggy Bank is capable of storing more than just XPaths [28] pointing to information items as Hunter-Gatherer [40], and it allows users to extract data rather than annotate documents as Annotea [36] does. Piggy Bank does not rely on heuristics to re-structure information as Thresher [35] does, but rather requires people write easily distributable scraping code. It is possible to make use of ProductionOn the production side, HTTP [10] natively supports posting of data to a URL, though it leaves the syntax and semantic of that data as well as how the data is used to the web server at that URL. Web sites have been employing this mechanism to support lightweight authoring activities, such as providing registration information, rating a product, filling out an online purchase order, sign A more sophisticated form of publishing is Web logs, or blogs. Originally written by tech-savvy authors in text editors (e.g., [1]), blogs have morphed into automated personal content management systems used by tech-unsavvy people mostly as online journals or for organizing short articles chronologically. Using RSS technology, blog posts from several authors can be extracted and re-aggregated to form “planets”.Unlike blog planets, wikis [27] pool content from several authors together by making them collaborate on the editing of shared documents. This form of collaborative, incremental author ing, while strongly criticized for its self-regulating nature and generally very low barrier to entry [5], has been proven incredibly prolific in the creation of content and at the same time very popu lar. (Wikipedia [26] is visited more often than the New York Times. [2])The effectiveness of socially scalable solutions is also evident in the more recent social book marking services (e.g., del.icio.us [6]) where content authoring is extremely lightweight (assign ing keywords) but the benefit of such authoring effort is amplified when the information is pooled together, giving rise to overall patterns that no one user’s data can show. In adopting Piggy Bank, users immediately gain flexibility in the ways they use existing Web information without ever leaving their familiar web browser. Through the use of Piggy Bank, as they consume Web information, they automatically produce Semantic Web information. Through Semantic Bank, as they publish, the information they have collected merges together smoothly, giving rise to higher-ordered patterns and structures. This, we believe, is how the Semantic Web might emerge from the Web. In this section, we discuss how the rest of the story might go. Scraping The WebOur story is about empowering Web users, giving them control over the information that they encounter. Even in the cases where the web sites do not publish Semantic Web information di rectly, users can still extract the data using scrapers. By releasing a platform on which scrapers can be easily installed and used, and they can contribute their results to a common data model, we have introduced a means for users to integrate information from multiple sources on the Web at their own choosing.In this new “scraping ecosystem,” there are the end-users who want to extract Semantic Web information, scraper writers who know how to do so, and the publishers who want to remain in control of their information. We expect that many scraper writers will turn their creativity and expertise at scraping as many sites as they can so to liberate the information within.The explosion of scrapers raises a few questions. Will there be a market where scrapers for the same site compete on the quality of data they produce? Will there be an explosion of several ontologies for describing the same domain? How can a user find the “best” scraper for a site? As a possible scenario, a centralized service could host the metadata of scrapers in order to support easy or automatic discovery of scrapers for end-users while allowing scraper writers to coordinate their work. Such a centralized service, however, is a single point of failure and a single target for attack. An alternative is some form of peer-to-peer scraper-sharing network.Information Wants to Be FreeOur system goes beyond just collecting Semantic Web information but also enables users to pub lish the collected information back out to the Web. We expect that the ease with which publishing can be done will encourage people to publish more. This behavior raises a few questions. How can we build our system to encourage observance of copyright laws? How will publishers adapt to this new publishing mechanism? How will copyright laws adapt to the fine-grained nature of the information being redistributed? Is a Semantic Bank responsible for checking for copyright infringement of information published to it? Will scraper writers be held responsible for illegal use of the information their scrapers produce on a massive scale?In order to remain in control of their information, one might expect publishers to publish Se mantic Web information themselves so to eliminate the need for scraping their sites. They might include copyright information into every item they publish and hold Piggy Bank and Semantic Bank responsible for keeping that information intact as the items are moved about.Perhaps it is in the interest of publishers to publish Semantic Web information not only to retain copyright over their information but also to offer advantages over their competitors. They can claim to publish richer, purer, more standard-compliant, more up-to-date, more coherent, more reliable data that is more usable, more mixable, more trustable. They can offer searching and browsing services directly on their web sites that are more sophisticated than what Piggy Bank can offer. They can even take advantage of this new publishing mechanism to spread their advertisements more easily.AcknowledgementsThis work is conducted by the Simile Project, a collaborative effort between the MIT Librar ies, the Haystack group at MIT CSAIL, and the World Wide Web Consortium. We would like to thank Eric Miller, Rob Miller, MacKenzie Smith, Vineet Sinha, the Simile group, the User Interface Design group, and the Haystack group for trying out Piggy Bank and for their valuable feedbacks on this work. Last but not least, we are in debt to Ben Hyde for having infected us with the idea of a “scraping ecosystem.” References[1] 9101 -- /News. http://www.w3.org/History/19921103-hypertext/hypertext/WWW/News/9201.html.[2] Alexa Web Search - Top 500. http://www.alexa.com/site/ds/top_sites?ts_mode=lang&lang=en.[3] Apache Lucene. http://lucene.apache.org/.[4] CiteULike: A free online service to organise your academic papers. http://www.citeulike.org/.[5] Criticism of Wikipedia. http://en.wikipedia.org/wiki/Criticism_of_Wikipedia.[6] del.icio.us. http://del.icio.us/.[7] Firefox - Rediscover the web. http://www.mozilla.org/products/firefox/.[8] Google News. http://news.google.com/.[9] Greasemonkey. http://greasemonkey.mozdev.org/.[10] HTTP - Hypertext Transfer Protocol Overview. http://www.w3.org/Protocols/.[11] Informa: RSS Library for Java. http://informa.sourceforge.net/.[12] Jetty Java HTTP Servlet Server. http://jetty.mortbay.org/jetty/.[13] LiveConnect Index. http://www.mozilla.org/js/liveconnect/.[14] My Netscape. http://my.netscape.com/.[15] Net Snippets. http://www.netsnippets.com/.[16] NewsMap. http://www.marumushi.com/apps/newsmap/newsmap.cfm.[17] openRDF.org - home of Sesame. http://www.openrdf.org/.[18] Primer - Getting into the semantic web and RDF using N3. http://www.w3.org/2000/10/swap/Primer. [19] RDF/XML Syntax Specifications (Revised). http://www.w3.org/TR/rdf-syntax-grammar/.[20] Resource Description Framework (RDF) / W3C Semantic Web Activity. http://www.w3.org/RDF/.[21] RSS 2.0 Specifications. http://blogs.law.harvard.edu/tech/rss.[22] Semantic Web project. http://www.w3.org/2001/sw/.[23] Velocity. http://jakarta.apache.org/velocity/.[24] W3C Document Object Model. http://www.w3.org/DOM/.[25] Welcome to Flickr - Photo Sharing. http://flickr.com/.[26] Wikipedia. http://www.wikipedia.org/.[27] Wiki Wiki Web. http://c2.com/cgi/wiki?WikiWikiWeb.[28] XML Path Language (XPath). http://www.w3.org/TR/xpath.[29] XML User Interface Language (XUL) Project. http://www.mozilla.org/projects/xul/.[30] XPCOM. http://www.mozilla.org/projects/xpcom/.[31] XSL Transformations (XSLT). http://www.w3.org/TR/xslt.[32] Yahoo!. http://www.yahoo.com/.[33] Bolin, M., M. Webber, P. Rha, T. Wilson, and R. Miller. Automation and Customization of Rendered Web Pages. Submitted to UIST 2005.[34] Goodman, D. Dynamic HTML: The Definitive Reference. 2nd. O’Reilly & Associates, Inc., 2002.[35] Hogue, A. and D. Karger. Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web. In Proc. WWW 2005.[36] Kahan, J., Koivunen, M., E. Prud’Hommeaux and R. Swick. Annotea: An Open RDF Infrastructure for Shared Web Annotations. In Proc. WWW 2001.[37] Lansdale, M. The Psychology of Personal Information Management. Applied Ergonomics 19(1), [38] Malone, T. How Do People Organize Their Desks? Implications for the Design of Office Information Systems. ACM Transactions on Office Information Systems 1(1), 99–112, 1983.[39] Quan, D. and D. Karger. How to Make a Semantic Web Browser. In Proc. WWW 2004.[40] schraefel, m.c., Y. Zhu, D. Modjeska, D. Wigdor, and S. Zhao. Hunter Gatherer: Interaction Support for the Creation and Management of Within-Web-Page Collections. In Proc. WWW 2002.[41] Sinha, V. and D. Karger. Magnet: Supporting Navigation in Semistructured Data Environments. In [42] Whittaker, S. and C. Sidner. Email Overload: Exploring Personal Information Management of Email. [43] Yee, P., K. Swearingen, K. Li, and M. Hearst. Faceted Metadata for Image Search and Browsing. In