/
lis512  lecture  5 XML based metadata formats lis512  lecture  5 XML based metadata formats

lis512 lecture 5 XML based metadata formats - PowerPoint Presentation

reportssuper
reportssuper . @reportssuper
Follow
343 views
Uploaded On 2020-10-22

lis512 lecture 5 XML based metadata formats - PPT Presentation

constraints on XML XML defines a genaral syntax for documents These documents can contain records Records are put together in an XML document But to makes sense of XML we need something more ID: 815635

seg xml constraints ead xml seg ead constraints schema html documents dtd called written level body archival mets set

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "lis512 lecture 5 XML based metadata fo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

lis512 lecture 5

XML based metadata formats

Slide2

constraints on XML

XML defines a genaral syntax for documents.

These documents can contain records.

Records are put together in an XML document.

But to makes sense of XML, we need something more.

Slide3

valid XML

XML that is well-formed may be valid against a set of constraints.

Such a set of constraints can be formally written down.

An old form of a formally written set of constraints is called a DTD.

More recently schema files have become more popular, they do the same thing.

Slide4

example: XHTML DTD

[some of this not quite true]

XHTML is an XML DTD.

Basically it is a set of constraints that an XML documents has to follow if it is to be a web page.

These constraints are written out in a file called the XHTML DTD.

Slide5

sample HTML constraints

The root element of a HTML document is called <html>.

It admits two child elements, called <head> and <body>.

There must be one <head> and one <body>.

The <head> appears first, then appears the <body>.

<html><head>...</head>

<body>...</body></html>

Slide6

more HTML constraints

All elements in the body are classified as text level and block-level.

A text-level element has to be a child of a block-level element.

A text-level element can not have block-level elements as children.

Slide7

more HTML constraints

There is an element <img> that can be placed in the body. It is a text level element. It is always an empty elemnet. It requires an attribute called src= and an attribute called alt=. It may take a bunch of other attributes.

The fact that <img> creates an image is understood. It is not something that can be formally specified in the DTD.

Slide8

meaning what

meaning that a computer can check that there is an src= attribute pointing to an image file.

It can check that the image can be accessed (at this time) and

It can not check the image is “appropriate” (e.g. it is the American flag rather than the Russian flag)

Slide9

other XML formats

I will mention here a few other formats that are used in the bibligraphic universe.

Contrary to HTML, which is really a document format, these formas are used to encode certain record.

One we have already seen, MARC XML.

Slide10

constrain languages

There are basically languages that are used to model constraints on XML documents.

DTDs

XML Schema files

Relax NG

The don't have the same capabilities. Some can implement constraints that others can't.

No language is the to implement all constraints possible.

Slide11

DTDs

DTDs were define with a parent format of XML called SGML.

DTDs are not written in an XML syntax, they have their own syntax.

No software has been written that fully implements SGML and its DTDs.

DTDs are now considered a legacy format.

Slide12

XML Schema

It's a WC3 sponsored format that aimed to be the ultimate schema language.

It is quiet abominably complicated and fails miserably in certain areas. For example, it has poor support for unordered contents.

A look at http://www.w3.org/TR/xmlschema-1/ will help.

Slide13

Relax NG

This is a more informal standard for constraining XML documents.

It is much easier to learn and use than XML Schema.

Free software implements the entire standard.

Slide14

in the archival world

Documents and other artifacts held in archives are usually not catalogued item by item. That would be too expensive.

Instead archivists create "finding aids", documents that describe what can be found in an archival collection.

A large archive may have hundreds of finding aid, usually on per box of "stuff".

Slide15

encoded archival description

http://www.archivists.org/saagroups/ead/aboutEAD.html says "EAD stands for Encoded Archival Description, and is a non-proprietary

de facto

standard for the encoding of finding aids for use in a networked (online) environment. Finding aids are inventories, indexes, or guides that are created by archival and manuscript repositories to provide information about specific collections. (cont.)

Slide16

continuation of previous quote.

While the finding aids may vary somewhat in style, their common purpose is to provide detailed description of the content and intellectual organization of collections of archival materials. EAD allows the standardization of collection information in finding aids within and across repositories."

Slide17

EAD history

EAD started in 1993 (just as

RePEc) at the University of California as a project lead by Daniel V.

Pitti

.

EAD is formally specified as an SGML DTD, but now also as XML Schema file and

ReLAX

NG Schema.

The current version is called EAD 2002. It is maintained by the Library of Congress.

Slide18

other formats

There are two modern versions of EAD around.

A RelaxNG

form http://www.loc.gov/ead/ead.rng

An XML Schema http://www.loc.gov/ead/ead.xsd

Both

RelaxNG

and XML Schema are schema languages. They are ways to specify constraints on XML documents.

Slide19

EAD problem

EAD is very difficult to implement for archivist who generally have low IT skills.

They are then reliant on somebody else to do it for them. Such personnel does not come cheap.

See Sonia

Yaco’s

paper: "It’s Complicated: Barriers to EAD Implementation", based on Survey work by

Anlex

consulting.

Library of Congress example appears invalid!

Slide20

in the humanities: TEI

The TEI is the Text Encoding Initiative.

It is a DTD, Relax NG and XML Schema specification, with documentation, on how to encode texts.

The texts they talk about are mainly historic and cultural artifacts,

poetry

historical documents

Slide21

problems to address

How to represent faithfully a printed text in an XML form?

imitate page appearance

character recognition and normalization

guessing formal structure from visual appearance.

Slide22

example: Langland's Piers Plowman

<l>

 <

seg

>In a

somer

seson

,</

seg

>

 <

seg

>

whan

softe

was the

sonne

,</

seg

>

</l>

<l>

 <

seg

>I

shoop

me into

shroudes

</

seg

>

 <

seg

>as I a sheep were,</

seg

>

</l>

<l>

 <

seg

>In

habite

as an

heremite

</

seg

>

 <

seg

>unholy of

werkes

,</

seg

>

</l>

<l>

 <

seg

>Went wide in this world </

seg

>

 <

seg

>

wondres

to here.</

seg

>

</l>

Slide23

example from Pope's Essay on Criticism

<div

type="book"

n="1"

met="-+|-+|-+|-+|-+/"

rhyme="

aa

">

<

lg

 n="1" type="paragraph">

 <l>

'Tis

hard to say, if greater Want of Skill</l>

 <l>Appear in <hi>Writing</hi> or in <hi>Judging</hi> ill;</l>

 <l>But, of the two, less

dang'rous

is

th'Offence

,</l>

<l>To tire our <hi>Patience</hi>, than

mis

-lead our <hi>Sense</hi>:</l>

</

lg

>

</div>

Slide24

example for manuscript description

<

handDesc

 hands="3">

<

handNote

 

xml:id

="Eirsp-1" scope="minor">

<p>The first part of the manuscript,

 <locus from="1v" to="72v:4">

fols

1v-72v:4</locus>, is written in a

practised

     Icelandic Gothic

bookhand

. This hand is not found elsewhere.</p>

</

handNote

>

<

handNote

 

xml:id

="Eirsp-2" scope="major"> <p>The second part of the manuscript, <locus from="72v:4" to="194v">

fols

72v:4-194</locus>, is written in a hand contemporary with the first; it can also be found in a fragment of <title>

Knýtlinga

saga</title>,

  <ref>AM 20b II fol.</ref>.</p>

 </

handNote

>

 <

handNote

 

xml:id

="Eirsp-3" scope="minor">

  <p>The third hand has written the majority of the chapter headings.

This hand has been identified as the one also found in <ref>AM 221 fol.</ref>.</p>

</

handNote

>

</

handDesc

>

Slide25

ONIX

ONIX is a record format used in the book trading and publishing industry.

The London based EDItEUR

group, a membership based organization maintains the standard.

Slide26

purpose

ONIX is essentially related to product descriptions in the publishing industry.

Publishers can use it to communicate with vendors.

Publishers can use it internally to streamline procedures.

There are there formats

books

serials

rights

Slide27

MODS

This is the Metadata Objects and Description Schema.

It is basically a MARC 'light'.By converting MARC to MODS, you loose some information.

The resulting information is less detailed and less complicated.

Slide28

METS

http://www.loc.gov/standards/mets/METS%20Documentation%20final%20070930%20msw.pdf says

"The Metadata Encoding and Transmission Standard (METS) is a data encoding and transmission specification, expressed in XML, that provides the means to convey the metadata necessary for both the management of digital objects within a repository and the exchange of such objects between repositories (or between repositories and their users)."

Slide29

METS

METS is as container structure that does not only allow data about an object of interest be encoded, but also allows for the object itself to be encoded.

As a result, it is very complicated.

Slide30

Journal Publishing Tag Set

This XML DTD is issued by the National Library of Medicine.

It is the format followed by records in the

PubMed

database.

PubMed

is the largest sort-of freely available bibliographic dataset for scientific articles.

Because of its market power, the tag set is spreading through the publishing industry.