/
Parsing for XML Developers Parsing for XML Developers

Parsing for XML Developers - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
427 views
Uploaded On 2016-06-13

Parsing for XML Developers - PPT Presentation

Roger L Costello 28 September 2014 Flat XML Document You might receive an XML document that has no structure For example this XML document contains a flat linear list of Book data ltBooksgt ID: 360772

title author isbn date author title date isbn publisher book authors books parsing techniques dover publications 486 introduction bottom

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Parsing for XML Developers" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Parsing for XML Developers

Roger L. Costello

28 September 2014Slide2

Flat XML Document

You might receive an XML document that has no structure. For example, this XML document contains a flat (linear) list of Book data:

<Books>

<Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books>

2Slide3

Give it structure to facilitate processing

<Books>

<Title>

Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books><Books>

<Book>

<Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

3Slide4

That’s parsing!

Parsing is taking a flat (linear) sequence of items and adding structure so that the result conforms to a grammar.

4Slide5

Parsing

<Books>

<Title>

Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books><Books>

<Book>

<Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

parse

5Slide6

From the book: “Parsing Techniques”

Parsing is the process of structuring a linear representation in accordance with a given grammar.

The

“linear representation” may be:

A flat sequence of XML elementsa sentencea computer programa knitting patterna sequence of geological strataa piece of musicactions of ritual behavior 6Slide7

Grammar

A grammar is a succinct description of the structure.

Here is a grammar for Books:

Books

→ Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text7Slide8

Parsing

parser

Books

Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text<Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author>

<Date>

1993

</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books>GrammarLinear representation<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>

0-486-67870-9</ISBN>

<Publisher>

Dover Publications

</Publisher>

</Book>

<Book>

<Title>

Introduction to Formal Languages

</Title>

<Authors>

<Author>

Gyorgy

E.

Revesz

</Author>

</Authors>

<Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

Structured representation

8Slide9

Parsing Techniques

Over the last 50 years many parsing techniques have been created.

Some parsing

techniques work from the starting grammar rule to the bottom. These are called top-down parsing techniques.

Other parsing techniques work from the bottom grammar rules to the starting grammar rule. These are called bottom-up parsing techniques.The following slides show how to apply a powerful bottom-up parsing technique to the Books example.9Slide10

What does “powerful” mean?

The previous slide said, …

following slides

show how to apply a powerful

bottom-up parsing technique …“Powerful” means the technique can be used with lots of grammars, i.e., it can be used to generate lots of different structures.10Slide11

Suppose we were to structure the XML from scratch. We might follow these steps:

<Books>

</Books>

<Books> <Book> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> </Authors> </Book> </Books>11

c

ontinued

o

n nextslideSlide12

Follow

these steps (cont.):

<Books>

<Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> </Authors> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author>

Ceriel J.H. Jacobs

</Author>

</Authors> <Date>2007</Date> </Book> </Books>continuedon nextslide12Slide13

Follow

these

steps (cont.):

<Books>

<Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN>

<Publisher>

Dover Publications

</Publisher> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> </Book> </Books>and so forth, filling in the second Book then the third Book13Slide14

Last step: add the last Book’s Publisher

<Books>

<Book>

<Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>

Introduction to Formal Languages

</Title>

<Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> </Book></Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>

Introduction to Formal Languages</Title>

<Authors>

<Author>

Gyorgy

E.

Revesz

</Author>

</Authors>

<Date>

2012

</Date>

<ISBN>

0-486-66697-2

</ISBN>

<Publisher>

Dover Publications

</Publisher>

</Book></Books>last step adds this14Slide15

Alternate view of the steps (a tree view)

Books

Books

Book

BooksBookTitleBooksBookTitleAuthorsBooksBookTitleAuthorsAuthorBooks

Book

Title

Authors

Authorcontinuedon nextslide15AuthorSlide16

Alternate view (cont.)

16

Books

Book

TitleAuthorsAuthorAuthorDateBooksBookTitleAuthorsAuthorAuthor

Date

ISBN

Books

BookTitleAuthorsAuthorAuthorDateISBNPublishercontinuedon nextslideSlide17

Alternate view (cont.)

Books

Book

Title

AuthorsDateISBNPublisherBookand so forth, filling in the second Book then the third Book17AuthorAuthorSlide18

Last step: add the last Book’s Publisher

Books

Book

Title

AuthorsDateISBNPublisherBookTitleAuthorsAuthorDateISBNPublisherBook

Title

Authors

Author

DateISBNBooksBookTitleAuthorsDateISBNPublisherBookTitleAuthorsAuthorDateISBNPublisher

Book

Title

Authors

Author

Date

ISBN

Publisher

Author

Author

l

ast step adds this

18

Author

AuthorSlide19

Terminology: Production Step

<Books>

</Books>

<Books> <Book> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> </Book> </Books><Books> <Book> <Title>Parsing Techniques</Title> <Authors> </Authors> </Book> </Books>Each step is called a production step21Slide20

Top down

The previous slides showed the generation of the structured XML by starting from the top (root element) down to the bottom (leaf nodes).

19Slide21

Bottom-up parsing

In bottom-up parsing we work backward: from the last step to the first step.

20Slide22

Let’s begin …

One production step must have been the last and its result must be visible in the

linear representation.

We recognize the

rule Publisher → text inThis gives us the final step in the production process (and the first step in bottom-up parsing):22<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Slide23

Next

We recognize the rule

ISBN

text inThis gives us the next-to-last step in the production process (and the second step in bottom-up parsing):23<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Slide24

Next

We recognize the rule

Date

text inThis gives us the third step in bottom-up parsing:24<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Slide25

Next

We recognize the rule

Author

text inThis gives us the fourth step in bottom-up parsing:25<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Slide26

Next

We recognize the rule

Authors

Author+ inThis gives us the fifth step in bottom-up parsing:26<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Authors> <Author>Gyorgy E. Revesz</Author></Authors><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications

</Publisher>Slide27

Next

We recognize the rule

Title

text inThis gives us the sixth step in bottom-up parsing:27<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Authors> <Author>Gyorgy E. Revesz</Author></Authors><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications

</Publisher>Slide28

Next

We

recognize the

rule

Book → Title Authors Date ISBN Publisher inThis gives us the seventh step in bottom-up parsing:28<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date>

<

ISBN>

0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Book>Slide29

See the algorithm?

See how we are working backwards, from the bottom grammar rules up to the starting grammar rule? In the process we are adding structure to the flat (linear) XML – neat!

29Slide30

Terminology: Reduction

In bottom-up parsing a collection of symbols are recognized as derived from a symbol. For example,

Title, Authors, Date, ISBN, Publisher

is derived from Book:30Title, Authors, Date, ISBN, Publisher is reduced to BookSo the bottom-up parsing process is a reduction process.BookTitleAuthorsDateISBN

PublisherSlide31

Build your own bottom up parser!

You now have enough knowledge that you can go off and build your own bottom-up parser.

31Slide32

I implemented a bottom-up parser

I used XSLT to implement a bottom-up parser.

If you would like to give my implementation a go, here is the XSLT program and a sample flat (linear) input XML document:

http

://www.xfront.com/parsing-techniques/bottom-up-parser/bottom-up-parser-for-Books.xslhttp://www.xfront.com/parsing-techniques/bottom-up-parser/Books.xml32