Data Schemas amp Applications Lecture 3 Data Representation XML amp RSS Last week introduction to the web u ri schemas amp encoding http protocol m edia types request response cycle ID: 399305
Download Presentation The PPT/PDF document "UFCEKG-20-2" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
UFCEKG-20-2 Data, Schemas & Applications
Lecture 3
Data Representation, XML & RSSSlide2
Last week: introduction to the web
u
ri
schemas & encoding
http protocol
m
edia types
request / response cycle
g
et, post, put and delete
i
ntroduction to
mashups
s
imple
mashup
example with formsSlide3
WWW : definitionThe World Wide Web
(abbreviated as
WWW
or
W3
,
commonly known as
the Web
), is a
system
of interlinked
hypertext
documents accessed via the
Internet
. With a
web browser
, one can view
web pages
that may contain text, images, videos, and other
multimedia
,
and
navigate
between them via
hyperlinks
.
Wikipedia : World Wide Web
Concept originally proposed by
Sir
Tim
Berners-Lee
(1989) based on earlier hypertext systems.
Berners-Lee and Belgian computer scientist
Robert
Cailliau
proposed in 1990 to use hypertext "to link and access information of various kinds as a web of nodes in which the user can browse at will
",
and they publicly introduced the project in December of the same year
.Slide4
Problem : How to encode data for communication
Bank of America Market Data Mirrors
Competing
constraints
Data must be serialised into a character stream
Communicate the meaning of the data as well as the data
Error-free
Minimimal
size
Handle Multi-Lingual text Slide5
Solutions
Card file based
csv
xls
- Excel file format
XML
SQL export
JSON - JavaScript Object Notation
The
Medabar
in Asmara, Eritrea
Google MapSlide6
Card-based
Examples
ATCO-CIF
for timetables
IGES
for Computer-Aided Design
Characteristics
Based on old 80-column punched cards
Muliple
record types
Fixed field widths
No formal language to define the format Slide7
ExamplesAlveston (Bristol) weather data
World Health Organization(WHO) - generated estimates of TB mortality, prevalence, incidence (including incidence of HIV+TB) and case detection rate.
1000 Songs - Google
Spreadsheet
Characteristics
Data values separated by a common separator character - space, comma or tab
Column position is significantLines separated by newlines - coding depends on OS - linefeed (x0A) Unix or carriage-return (x0D), line feed - Windows, carriage-return on old Macs Separator must not occur in data values, or some other convention needed - Quotes around value, an escape character
Column headings may be the first lineOnly tables - all lines the same
All columns required - problem for space-separated data
CSVSlide8
Data with optional data and repeated data need more complex structures. Many have been developed for specific domainsMARC library catalogue recordsEDIFACT
for commercial Electronic Data interchange (EDI)
EDIF
LISP -based nested data
EXIF
data embedded in a JPEG image
Tagged record structuresSlide9
XML
A generic data format based on tagged elements in a tree structure.
Developed from GML, via SGML
.
GML
,a
document
markup
language developed
by
Charles Goldfarb at IBM in 1969.
Examples
Alveston
WDL
config
file
UWE news RSS feed
Tree with Buddhist prayer flagsSlide10
XML domain vocabulariesXML defines only the rules for a well-formed document. The allowable tags, their structuring and order in a document, range of allowable values and the meaning of those tags depends on the XML application - called a vocabulary.There are now hundreds of XML vocabularies designed for every sort of data
XHTML - the version of HTML which conforms to XML
SVG - graphics
TransExchange
for timetables
RSS and Atom for news
XML domain vocabulariesSlide11
There are also vocabularies for languages for processing XMLXSLT - for transforming XML documents
XSL-FO
- for transforming to PDF documents
XML Schema
-
for defining XML vocabulariesXProc
- for defining XML Pipelines XML processing vocabulariesSlide12
I want to disseminate news about my project/company, and allow interested people to read it. e.g. the university wants to spread the news about successful staffSolution 1 : HTML pagePublish a page of news on the website in HTML
Problems
how do visitors know when its changed?
news from different universities cannot be easily combined – (why?)
Problem: News disseminationSlide13
Encourage interested users to subscribe to your company newsletter.ProblemsSubscription is a barrier
Clutters up email boxes
can look like spam
List
management
and emailing overhead
Solution : emailSlide14
UWE makes up its own set of additional tagsSolution : Create XML document for news
<
newsItem
date
=‘2007-10-2’>
<
newsTitle>UWE best in West</newsTitle>
<newsBody
>UWE wins
tiddlewinks
again</newsBody
> <contact>press@uwe.ac.uk</Contact
>
</
newsItem
>
Problems
someone
has to design this language
has to be translated to HTML to display
s
reader has to understand multiple new tags from different sources
needs to be distinguished from standard HTMLSlide15
ProblemHow to distinguish in a document XML tags from different vocabularies ?Solution
define a (global) unique URI for the vocabulary
use an arbitrary prefix - news: for all tags in the same
vocubulary
- unique within a document
link the prefix to the vocabulary in the document
Aside: Namespaces
<h1>UWE news</h1>
<p>
<news:item
xmlns
="http://www.uwe.ac.uk/news" date="
2007-10-2“> <news:Title>UWE best in West</news:Title>
<news:Body>UWE wins
tiddlewinks
again</
news:Body
>
<
news:Contact>press@uwe.ac.uk</news:Contact
>
</
news:item
>
</
p>Slide16
Standardize on one (or several !) standard tagsTags are machine-readable to identify news items in a list of web sitesRSS 2.0
Really Simple Syndication
Rich Site Summary
Atom - a more recent format
Differences - dates (RFC 822 v RFC 3339 timestamps), multi-lingual content
Characteristics
Structure:
rss / channel / item TreeItems in reverse chronological order
Few mandatory tagsNamespaces allow additional vocabularies to be added
Solution : RSSSlide17
Example RSS - UWE news<?xml version="1.0" encoding="iso-8859-1
"?>
<
rss
version="2.0
">
<
channel> <title>UWE News</title
><link>http://www.uwe.ac.uk</link
>
<
description>Latest UWE press releases</description
><image>
<
url
>http://info.uwe.ac.uk/common/assets/2004Design/logoNoBorder.gif</url
>
<
title>University of the West of England</title
>
<
link>http://www.uwe.ac.uk</link
>
</
image
>
<
pubDate
>Fri, 13 Oct 2008 15:15:44 GMT</
pubDate
>
<
item
>
<
title>New research looks to transport users for solutions</title
>
<
link>http://info.uwe.ac.uk/news/uwenews/article.asp?item=1363</link
>
<
description>'Ideas in Transit' is a new initiative which will look
to
transport users' experiences and creativity as a source of
innovation
to
tackle the UK's transport problems
....
</
description
>
</
item> Slide18
Example RSS - BBC Finance News<?xml version="1.0" encoding="ISO-8859-1"
?>
<?
xml-
stylesheet
title="
XSL_formatting
" type="text/xsl“
href="/shared/bsp
/
xsl
/
rss/nolsol.xsl"?> <rss
version="2.0"
xmlns:media
="http://search.yahoo.com/
mrss
">
<
channel>
<
title>BBC News | Business | UK Edition</title
>
<
link>http://news.bbc.co.uk/go/rss/-/1/hi/business/default.stm</link
>
<
description>Visit BBC News for up-to-the-minute news, breaking news, video, audio
and
feature stories. BBC News provides trusted World and UK news as well as local
and
regional perspectives. Also entertainment, business, science, technology and
health
news.
</
description
>
<language>en-
gb
</language
>
<
lastBuildDate
>Mon, 13 Oct 2008 14:28:30 GMT</
lastBuildDate
>
<
copyright>Copyright: (C) British Broadcasting Corporation,
see
http://news.bbc.co.uk/1/hi/help/rss/4498287.stm for terms and conditions of
reuse
</
copyright
>
<
docs>http://www.bbc.co.uk/syndication/</docs
>
<
ttl
>15</
ttl
>
<
image>
<
title>BBC News</title>
<
url
>http://news.bbc.co.uk/nol/shared/img/bbc_news_120x60.gif</url
>
<
link>http://news.bbc.co.uk/go/rss/-/1/hi/business/default.stm</link
>
</image
>
<item
>
<title>UK banks receive £37bn bail-out</title>
<
description>The UK government says it is to inject a total of up to £37bn into
Royal
…..
</
item> Slide19
ProblemHow to keep track of multiple feedsRSS aggregation
Solution
http://www.youtube.com/watch?v=0klgLsSxGsU&feature=player_embedded#t=0s
Application needed which is
stateful
– remembers what items you have read
Integrates multiple feeds into one ‘magazine’
Polls RSS providers on a regular basis
Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader integrated into
MyUWE
RSS Aggregation with
BloglinesSlide20
UWE newsBBC Finance news
Earthquakes
RSS as a tree structureSlide21
strings enclosed in tags which provide a humanly readable name for the element - so-called self-describingelements may be nested to create hierarchical data structures
element tags may be repeated
element names can be relative to their parent
element structure can be formally defined
XML CharacteristicsSlide22
Element names provide a clue about the meaning of the data, but not enoughnames are ambiguousnames may be misleading
what units?
what accuracy?
what origin? - leads to need for meta-data
who created
when
what license to use
whyAside: Self -describingSlide23
XML documents are tree-structures, with each node bounded by an open and a closing tagElement: the opening tag, attributes, the body of the element and the closing tag. Elements are not elemental!tag name: the name in angle brackets - must conform to rules, may have a prefix
Attribute: a name="value" pair attached to an element. Names follow the same rules as tag names.
Parent: all
elments
except the root have one parent
Child: an element nested in another parent element
Root: every document has a single root element with no parent
Mixed Content: an element may contain a mixure of text and other elements XML terminologySlide24
A single root elementTags must be properly nestedAn element must be closed:
Open and closing tag <p>... </p>
Empty element <
br
/> or <
hr
size="3"/>Other formatting rules
XML names are case sensitive, no spaces, restricted character setAttribute values must be single or double-quotedSpecial characters coded as references 
 (a line feed) > >
Some characters have special meaning e.g. < is the start of a tag- within XML data, & is the first character of an entity reference. In XML data these have to be encoded as < and & or enclosed in <[CDATA[ ....]]>
Preferably use
standard
formats for representing values e.g. 2008-10-14 for a date
Basic XML rules