Roger L Costello 28 September 2014 Flat XML Document You might receive an XML document that has no structure For example this XML document contains a flat linear list of Book data ltBooksgt ID: 360772
Download Presentation The PPT/PDF document "Parsing for XML Developers" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Parsing for XML Developers
Roger L. Costello
28 September 2014Slide2
Flat XML Document
You might receive an XML document that has no structure. For example, this XML document contains a flat (linear) list of Book data:
<Books>
<Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books>
2Slide3
Give it structure to facilitate processing
<Books>
<Title>
Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books><Books>
<Book>
<Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>
3Slide4
That’s parsing!
Parsing is taking a flat (linear) sequence of items and adding structure so that the result conforms to a grammar.
4Slide5
Parsing
<Books>
<Title>
Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books><Books>
<Book>
<Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>
parse
5Slide6
From the book: “Parsing Techniques”
Parsing is the process of structuring a linear representation in accordance with a given grammar.
The
“linear representation” may be:
A flat sequence of XML elementsa sentencea computer programa knitting patterna sequence of geological strataa piece of musicactions of ritual behavior 6Slide7
Grammar
A grammar is a succinct description of the structure.
Here is a grammar for Books:
Books
→ Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text7Slide8
Parsing
parser
Books
→
Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text<Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author>
<Date>
1993
</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books>GrammarLinear representation<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>
0-486-67870-9</ISBN>
<Publisher>
Dover Publications
</Publisher>
</Book>
<Book>
<Title>
Introduction to Formal Languages
</Title>
<Authors>
<Author>
Gyorgy
E.
Revesz
</Author>
</Authors>
<Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>
Structured representation
8Slide9
Parsing Techniques
Over the last 50 years many parsing techniques have been created.
Some parsing
techniques work from the starting grammar rule to the bottom. These are called top-down parsing techniques.
Other parsing techniques work from the bottom grammar rules to the starting grammar rule. These are called bottom-up parsing techniques.The following slides show how to apply a powerful bottom-up parsing technique to the Books example.9Slide10
What does “powerful” mean?
The previous slide said, …
following slides
show how to apply a powerful
bottom-up parsing technique …“Powerful” means the technique can be used with lots of grammars, i.e., it can be used to generate lots of different structures.10Slide11
Suppose we were to structure the XML from scratch. We might follow these steps:
<Books>
</Books>
<Books> <Book> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> </Authors> </Book> </Books>11
c
ontinued
o
n nextslideSlide12
Follow
these steps (cont.):
<Books>
<Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> </Authors> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author>
Ceriel J.H. Jacobs
</Author>
</Authors> <Date>2007</Date> </Book> </Books>continuedon nextslide12Slide13
Follow
these
steps (cont.):
<Books>
<Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN>
<Publisher>
Dover Publications
</Publisher> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> </Book> </Books>and so forth, filling in the second Book then the third Book13Slide14
Last step: add the last Book’s Publisher
<Books>
<Book>
<Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>
Introduction to Formal Languages
</Title>
<Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> </Book></Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>
Introduction to Formal Languages</Title>
<Authors>
<Author>
Gyorgy
E.
Revesz
</Author>
</Authors>
<Date>
2012
</Date>
<ISBN>
0-486-66697-2
</ISBN>
<Publisher>
Dover Publications
</Publisher>
</Book></Books>last step adds this14Slide15
Alternate view of the steps (a tree view)
Books
Books
Book
BooksBookTitleBooksBookTitleAuthorsBooksBookTitleAuthorsAuthorBooks
Book
Title
Authors
Authorcontinuedon nextslide15AuthorSlide16
Alternate view (cont.)
16
Books
Book
TitleAuthorsAuthorAuthorDateBooksBookTitleAuthorsAuthorAuthor
Date
ISBN
Books
BookTitleAuthorsAuthorAuthorDateISBNPublishercontinuedon nextslideSlide17
Alternate view (cont.)
Books
Book
Title
AuthorsDateISBNPublisherBookand so forth, filling in the second Book then the third Book17AuthorAuthorSlide18
Last step: add the last Book’s Publisher
Books
Book
Title
AuthorsDateISBNPublisherBookTitleAuthorsAuthorDateISBNPublisherBook
Title
Authors
Author
DateISBNBooksBookTitleAuthorsDateISBNPublisherBookTitleAuthorsAuthorDateISBNPublisher
Book
Title
Authors
Author
Date
ISBN
Publisher
Author
Author
l
ast step adds this
18
Author
AuthorSlide19
Terminology: Production Step
<Books>
</Books>
<Books> <Book> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> </Authors> </Book> </Books>Each step is called a production step21Slide20
Top down
The previous slides showed the generation of the structured XML by starting from the top (root element) down to the bottom (leaf nodes).
19Slide21
Bottom-up parsing
In bottom-up parsing we work backward: from the last step to the first step.
20Slide22
Let’s begin …
One production step must have been the last and its result must be visible in the
linear representation.
We recognize the
rule Publisher → text inThis gives us the final step in the production process (and the first step in bottom-up parsing):22<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Slide23
Next
We recognize the rule
ISBN
→
text inThis gives us the next-to-last step in the production process (and the second step in bottom-up parsing):23<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Slide24
Next
We recognize the rule
Date
→
text inThis gives us the third step in bottom-up parsing:24<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Slide25
Next
We recognize the rule
Author
→
text inThis gives us the fourth step in bottom-up parsing:25<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Slide26
Next
We recognize the rule
Authors
→
Author+ inThis gives us the fifth step in bottom-up parsing:26<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Authors> <Author>Gyorgy E. Revesz</Author></Authors><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications
</Publisher>Slide27
Next
We recognize the rule
Title
→
text inThis gives us the sixth step in bottom-up parsing:27<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Authors> <Author>Gyorgy E. Revesz</Author></Authors><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications
</Publisher>Slide28
Next
We
recognize the
rule
Book → Title Authors Date ISBN Publisher inThis gives us the seventh step in bottom-up parsing:28<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date>
<
ISBN>
0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Book>Slide29
See the algorithm?
See how we are working backwards, from the bottom grammar rules up to the starting grammar rule? In the process we are adding structure to the flat (linear) XML – neat!
29Slide30
Terminology: Reduction
In bottom-up parsing a collection of symbols are recognized as derived from a symbol. For example,
Title, Authors, Date, ISBN, Publisher
is derived from Book:30Title, Authors, Date, ISBN, Publisher is reduced to BookSo the bottom-up parsing process is a reduction process.BookTitleAuthorsDateISBN
PublisherSlide31
Build your own bottom up parser!
You now have enough knowledge that you can go off and build your own bottom-up parser.
31Slide32
I implemented a bottom-up parser
I used XSLT to implement a bottom-up parser.
If you would like to give my implementation a go, here is the XSLT program and a sample flat (linear) input XML document:
http
://www.xfront.com/parsing-techniques/bottom-up-parser/bottom-up-parser-for-Books.xslhttp://www.xfront.com/parsing-techniques/bottom-up-parser/Books.xml32