constraints on XML XML defines a genaral syntax for documents These documents can contain records Records are put together in an XML document But to makes sense of XML we need something more ID: 815635
Download The PPT/PDF document "lis512 lecture 5 XML based metadata fo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
lis512 lecture 5
XML based metadata formats
Slide2constraints on XML
XML defines a genaral syntax for documents.
These documents can contain records.
Records are put together in an XML document.
But to makes sense of XML, we need something more.
Slide3valid XML
XML that is well-formed may be valid against a set of constraints.
Such a set of constraints can be formally written down.
An old form of a formally written set of constraints is called a DTD.
More recently schema files have become more popular, they do the same thing.
Slide4example: XHTML DTD
[some of this not quite true]
XHTML is an XML DTD.
Basically it is a set of constraints that an XML documents has to follow if it is to be a web page.
These constraints are written out in a file called the XHTML DTD.
Slide5sample HTML constraints
The root element of a HTML document is called <html>.
It admits two child elements, called <head> and <body>.
There must be one <head> and one <body>.
The <head> appears first, then appears the <body>.
<html><head>...</head>
<body>...</body></html>
Slide6more HTML constraints
All elements in the body are classified as text level and block-level.
A text-level element has to be a child of a block-level element.
A text-level element can not have block-level elements as children.
Slide7more HTML constraints
There is an element <img> that can be placed in the body. It is a text level element. It is always an empty elemnet. It requires an attribute called src= and an attribute called alt=. It may take a bunch of other attributes.
The fact that <img> creates an image is understood. It is not something that can be formally specified in the DTD.
Slide8meaning what
meaning that a computer can check that there is an src= attribute pointing to an image file.
It can check that the image can be accessed (at this time) and
It can not check the image is “appropriate” (e.g. it is the American flag rather than the Russian flag)
Slide9other XML formats
I will mention here a few other formats that are used in the bibligraphic universe.
Contrary to HTML, which is really a document format, these formas are used to encode certain record.
One we have already seen, MARC XML.
Slide10constrain languages
There are basically languages that are used to model constraints on XML documents.
DTDs
XML Schema files
Relax NG
The don't have the same capabilities. Some can implement constraints that others can't.
No language is the to implement all constraints possible.
Slide11DTDs
DTDs were define with a parent format of XML called SGML.
DTDs are not written in an XML syntax, they have their own syntax.
No software has been written that fully implements SGML and its DTDs.
DTDs are now considered a legacy format.
Slide12XML Schema
It's a WC3 sponsored format that aimed to be the ultimate schema language.
It is quiet abominably complicated and fails miserably in certain areas. For example, it has poor support for unordered contents.
A look at http://www.w3.org/TR/xmlschema-1/ will help.
Slide13Relax NG
This is a more informal standard for constraining XML documents.
It is much easier to learn and use than XML Schema.
Free software implements the entire standard.
Slide14in the archival world
Documents and other artifacts held in archives are usually not catalogued item by item. That would be too expensive.
Instead archivists create "finding aids", documents that describe what can be found in an archival collection.
A large archive may have hundreds of finding aid, usually on per box of "stuff".
Slide15encoded archival description
http://www.archivists.org/saagroups/ead/aboutEAD.html says "EAD stands for Encoded Archival Description, and is a non-proprietary
de facto
standard for the encoding of finding aids for use in a networked (online) environment. Finding aids are inventories, indexes, or guides that are created by archival and manuscript repositories to provide information about specific collections. (cont.)
Slide16continuation of previous quote.
While the finding aids may vary somewhat in style, their common purpose is to provide detailed description of the content and intellectual organization of collections of archival materials. EAD allows the standardization of collection information in finding aids within and across repositories."
Slide17EAD history
EAD started in 1993 (just as
RePEc) at the University of California as a project lead by Daniel V.
Pitti
.
EAD is formally specified as an SGML DTD, but now also as XML Schema file and
ReLAX
NG Schema.
The current version is called EAD 2002. It is maintained by the Library of Congress.
Slide18other formats
There are two modern versions of EAD around.
A RelaxNG
form http://www.loc.gov/ead/ead.rng
An XML Schema http://www.loc.gov/ead/ead.xsd
Both
RelaxNG
and XML Schema are schema languages. They are ways to specify constraints on XML documents.
Slide19EAD problem
EAD is very difficult to implement for archivist who generally have low IT skills.
They are then reliant on somebody else to do it for them. Such personnel does not come cheap.
See Sonia
Yaco’s
paper: "It’s Complicated: Barriers to EAD Implementation", based on Survey work by
Anlex
consulting.
Library of Congress example appears invalid!
Slide20in the humanities: TEI
The TEI is the Text Encoding Initiative.
It is a DTD, Relax NG and XML Schema specification, with documentation, on how to encode texts.
The texts they talk about are mainly historic and cultural artifacts,
poetry
historical documents
Slide21problems to address
How to represent faithfully a printed text in an XML form?
imitate page appearance
character recognition and normalization
guessing formal structure from visual appearance.
Slide22example: Langland's Piers Plowman
<l>
<
seg
>In a
somer
seson
,</
seg
>
<
seg
>
whan
softe
was the
sonne
,</
seg
>
</l>
<l>
<
seg
>I
shoop
me into
shroudes
</
seg
>
<
seg
>as I a sheep were,</
seg
>
</l>
<l>
<
seg
>In
habite
as an
heremite
</
seg
>
<
seg
>unholy of
werkes
,</
seg
>
</l>
<l>
<
seg
>Went wide in this world </
seg
>
<
seg
>
wondres
to here.</
seg
>
</l>
Slide23example from Pope's Essay on Criticism
<div
type="book"
n="1"
met="-+|-+|-+|-+|-+/"
rhyme="
aa
">
<
lg
n="1" type="paragraph">
<l>
'Tis
hard to say, if greater Want of Skill</l>
<l>Appear in <hi>Writing</hi> or in <hi>Judging</hi> ill;</l>
<l>But, of the two, less
dang'rous
is
th'Offence
,</l>
<l>To tire our <hi>Patience</hi>, than
mis
-lead our <hi>Sense</hi>:</l>
</
lg
>
</div>
Slide24example for manuscript description
<
handDesc
hands="3">
<
handNote
xml:id
="Eirsp-1" scope="minor">
<p>The first part of the manuscript,
<locus from="1v" to="72v:4">
fols
1v-72v:4</locus>, is written in a
practised
Icelandic Gothic
bookhand
. This hand is not found elsewhere.</p>
</
handNote
>
<
handNote
xml:id
="Eirsp-2" scope="major"> <p>The second part of the manuscript, <locus from="72v:4" to="194v">
fols
72v:4-194</locus>, is written in a hand contemporary with the first; it can also be found in a fragment of <title>
Knýtlinga
saga</title>,
<ref>AM 20b II fol.</ref>.</p>
</
handNote
>
<
handNote
xml:id
="Eirsp-3" scope="minor">
<p>The third hand has written the majority of the chapter headings.
This hand has been identified as the one also found in <ref>AM 221 fol.</ref>.</p>
</
handNote
>
</
handDesc
>
Slide25ONIX
ONIX is a record format used in the book trading and publishing industry.
The London based EDItEUR
group, a membership based organization maintains the standard.
Slide26purpose
ONIX is essentially related to product descriptions in the publishing industry.
Publishers can use it to communicate with vendors.
Publishers can use it internally to streamline procedures.
There are there formats
books
serials
rights
Slide27MODS
This is the Metadata Objects and Description Schema.
It is basically a MARC 'light'.By converting MARC to MODS, you loose some information.
The resulting information is less detailed and less complicated.
Slide28METS
http://www.loc.gov/standards/mets/METS%20Documentation%20final%20070930%20msw.pdf says
"The Metadata Encoding and Transmission Standard (METS) is a data encoding and transmission specification, expressed in XML, that provides the means to convey the metadata necessary for both the management of digital objects within a repository and the exchange of such objects between repositories (or between repositories and their users)."
Slide29METS
METS is as container structure that does not only allow data about an object of interest be encoded, but also allows for the object itself to be encoded.
As a result, it is very complicated.
Slide30Journal Publishing Tag Set
This XML DTD is issued by the National Library of Medicine.
It is the format followed by records in the
PubMed
database.
PubMed
is the largest sort-of freely available bibliographic dataset for scientific articles.
Because of its market power, the tag set is spreading through the publishing industry.