/
Recursive Descent Parsing for XML Developers Recursive Descent Parsing for XML Developers

Recursive Descent Parsing for XML Developers - PowerPoint Presentation

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
369 views
Uploaded On 2018-02-18

Recursive Descent Parsing for XML Developers - PPT Presentation

Roger L Costello 15 October 2014 1 Table of Contents Introduction to parsing in general recursive descent parsing in particular Example 1 How to do recursive descent parsing on Book data ID: 632898

title author isbn date author title date isbn publisher expression book input authors identifier parsing token 486 publications dover

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Recursive Descent Parsing for XML Develo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Recursive Descent Parsing for XML Developers

Roger L. Costello15 October 2014

1Slide2

Table of Contents

Introduction to parsing in general, recursive descent parsing in particularExample #1: How to do recursive descent parsing on Book dataExample #2: How to do recursive descent parsing for a grammar that contains alternatives Limitations of recursive descent parsing2Slide3

Flat XML Document

You might receive an XML document that has no structure. For example, this XML document contains a flat (linear) list of Book data:<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>

978-0-387-20248-8

</ISBN>

<Publisher>

Springer

</Publisher>

<Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input>

2Slide4

Give it structure to facilitate processing

<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN>

<Publisher>

Springer

</Publisher>

<Title>

Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input><Books>

<Book>

<Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

3Slide5

That’s parsing!

Parsing is taking a flat (linear) sequence of items and adding structure so that the result conforms to a grammar.4Slide6

Parsing

<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN>

<Publisher>

Springer

</Publisher>

<Title>

Introduction to Graph Theory

</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input><Books>

<Book>

<Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

parse

5Slide7

From the book:

“Parsing Techniques”Parsing is the process of structuring a linear representation in accordance with a given grammar. The “linear representation” may be:a flat sequence of XML elementsa sentencea computer programa knitting patterna sequence of geological strataa piece of musicactions of ritual behavior 7Slide8

Grammar

A grammar is a succinct description of the structure.Here is a grammar for Books:Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → text

Author

text

Date

textISBN → textPublisher → text7Slide9

Parsing

parserBooks → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor →

text

Date

text

ISBN

textPublisher → text<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author>

<Date>

1993

</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input>GrammarLinear representation<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>

0-486-67870-9</ISBN>

<Publisher>

Dover Publications

</Publisher>

</Book>

<Book>

<Title>

Introduction to Formal Languages

</Title>

<Authors>

<Author>

Gyorgy

E.

Revesz

</Author>

</Authors>

<Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

Structured representation

8Slide10

Alternate view of the parser output

parserBooks → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor

text

Date

text

ISBN

→ textPublisher → text<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author>

<Date>

1993

</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input>GrammarLinear representationParse tree8BooksBookTitleAuthorsDateISBNPublisherBookTitleAuthors

Author

Date

ISBN

Publisher

Book

Title

Authors

Author

Date

ISBN

Publisher

Author

Author

Parsing Techniques

Dick Grune

Ceriel J.H. Jacobs

2007

978-0-387-20248-8

Springer

Introduction to

Graph

Theory

Richard J. Trudeau

1993

0-486-67870-9

Dover Publications

Introduction to

Formal

Languages

Gyorgy

E.

Revesz

2012

0-486-66697-2

Dover PublicationsSlide11

Parsing Techniques

Over the last 50 years many parsing techniques have been created.Some parsing techniques work from the starting grammar rule to the bottom. Those are called top-down parsing techniques.Other parsing techniques work from the bottom grammar rules to the starting grammar rule. Those are called bottom-up parsing techniques.The following slides explain the “recursive descent parsing technique.” It is a top-down parsing technique.9Slide12

Terminology: Token

A token is an atomic (indivisible) unit.Each item in the input is a token.After parsing the tokens will be leaf nodes.12Slide13

The input consists of a sequence of tokens

<input> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN>

<Publisher>

Springer

</Publisher>

<Title>

Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></input>

Each of these are tokens.

T

his input consists of 16 tokens.

13Slide14

After parsing

the tokens will be leaf nodes<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors>

<Date>

2007

</Date>

<ISBN>

978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>

Introduction to Formal Languages

</Title>

<Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>tokens (terminal symbols)14Slide15

Another view of the tokens, after parsing

BooksBookTitleAuthorsDateISBNPublisher

Book

Title

Authors

Author

Date

ISBN

PublisherBook

Title

Authors

Author

DateISBNPublisherAuthorAuthorParsing TechniquesDick GruneCeriel J.H. Jacobs2007978-0-387-20248-8SpringerIntroduction to Graph TheoryRichard J. Trudeau19930-486-67870-9Dover PublicationsIntroduction to Formal LanguagesGyorgy E. Revesz20120-486-66697-2

Dover Publications

15Slide16

Parsing

structures the input by wrapping the tokens in non-terminal symbols<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors>

<Date>

2007

</Date>

<ISBN>

978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book>

<Title>

Introduction to Formal Languages

</Title>

<Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>non-terminal symbols16Slide17

Recursive descent parsing

Recursive descent parsing works like this:Start at the grammar’s start symbol and output it. In our grammar, the start symbol is <Books>, so output it.Progress through each grammar rule. For a non-terminal symbol, output it. For a terminal symbol (i.e., token), check the token in the input stream for match with the terminal symbol; if it matches, output it.17Slide18

Initial

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor →

text

Date

text

ISBN

→ textPublisher → text7<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author>

<

Date>

1993

</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Start with the grammar’s start symbol and the first token in the input stream.Slide19

Output the start symbol

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor →

text

Date

text

ISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><

Date>

1993

</Date>

<ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher><Books></Books>Output:19Slide20

Grammar says there must be at least one Book

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor →

text

Date

text

ISBN

→ textPublisher → textSo the input stream must contain all the tokens for at least one Book. Let’s process the grammar rule for Book.20Slide21

Output <Book>

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor →

text

Date

text

ISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><

Date>

1993

</Date>

<ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher><Books> <Book> <Book></Books>Output:21Slide22

Grammar

says the token in the input stream must be TitleBooks → Book+Book →

Title

Authors

Date ISBN

Publisher

Authors

Author+Title → textAuthor → textDate → textISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN>

<

Publisher>

Springer

</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher><Books> <Book> <Title>Parsing Techniques</Title> <Book></Books>Output:Yea, the input token matches the grammar rule22Slide23

Grammar:

after Title must be AuthorsBooks → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title →

text

Author

text

Date

→ textISBN → textPublisher → textSo the input stream must contain Author tokens. Let’s process the rule for Authors.23Slide24

Output <Authors>

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor →

text

Date

text

ISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><

Date>

1993

</Date>

<ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher><Books> <Book> <Authors> <Authors> <Book></Books>Output:24Slide25

Grammar

says the next token in the input stream must be an Author tokenBooks → Book+Book → Title Authors Date ISBN PublisherAuthors →

Author+

Title

text

Author

→ textDate → textISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher>

<

Title>

Introduction to Graph Theory

</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Yea, the input token matches the grammar rule<Books> <Book> <Authors> <Author>Dick Grune</Author> <Authors> <Book></Books>Output:25Slide26

Grammar

says the next token in the input stream may be an Author tokenBooks → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title

text

Author

textDate → textISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title>

<

Author>

Richard J. Trudeau

</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Another Author match<Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Authors> <Book></Books>Output:26Slide27

The next token in the input stream is not an Author token

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title →

text

Author

text

Date

→ textISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><

Author>

Richard J. Trudeau

</Author>

<Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>So, return to the caller (i.e., return to the Book rule).27Slide28

Grammar

says the input stream token must be a Date tokenBooks → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+

Title

text

Author

textDate → textISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher>

<

Title>

Introduction to Graph Theory

</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Yea, the input token matches the grammar rule<Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Authors> <Date>2007</Date> <Book></Books>Output:28Slide29

Grammar

says the input stream token must be an ISBN tokenBooks → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+

Title

text

Author

textDate → textISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher>

<

Title>

Introduction to Graph Theory

</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Yea, the input token matches the grammar rule<Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Book></Books>Output:29Slide30

Grammar

says the input stream token must be a Publisher tokenBooks → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+

Title

text

Author

textDate → textISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher>

<

Title>

Introduction to Graph Theory

</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>Yea, the input token matches the grammar rule<Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Book></Books>Output:30Slide31

We’ve completed structuring the first 6 input tokens

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor →

text

Date

text

ISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author>

<

Date>

1993

</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher><Books> <Book> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Book></Books>Output:31Slide32

Completed the Book rule

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor →

text

Date

text

ISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><

Date>

1993

</Date>

<ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>We’ve finished processing the Book rule, so return to the caller (i.e., the Books rule).32Slide33

Begin work on structuring the

next BookBooks → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor

text

Date

textISBN → textPublisher → text<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author>

<

Date>

1993

</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>33Slide34

Implementation

The following slides show, in a step-by-step manner, how to implement a recursive descent parser34Slide35

S

tep 1Create a function for each non-terminal symbol in the grammar:Books() { …} Book() { …} Authors() { …}

Books

Book+

Book

Title Authors Date ISBN PublisherAuthors → Author+Functions35Slide36

Step 2

Create a global element, Token, that is used to identify the current position in the input stream. Initialize Token to 0:Token = 036Slide37

Step 3

Create a function, get_next_token(). When it is called, it increments the current position in the input stream:get_next_token() { Token = Token + 1}37Slide38

Step 4

Create a function, token(), and pass it a name, tk. The purpose of this function is to answer the question: “Does the token at the current position in the input stream match tk?”38Slide39

Example of using the token() function

Suppose that during recursive descent parsing the grammar indicates that the next token in the input stream must be “Title.” Suppose the global variable, Token, indicates that we are here in the input stream:<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>

2007

</Date>

<

ISBN>

978-0-387-20248-8

</ISBN>

<Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN>

<

Publisher>

Dover Publications

</Publisher>39Slide40

Example (cont.)

The token() function determines that there is a match, so it calls get_next_token() to increment the position in the input stream and returns the token:<Title>Parsing Techniques</Title><Author>

Dick Grune

</Author>

<

Author>

Ceriel J.H. Jacobs

</Author>

<Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author>

<

Date>

2012

</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>return40Slide41

The token() function

token(string tk) { if (tk != input[position() = Token]) then return () else { get_next_token() return input[position() = Token]) }}Notice that token() returns empty if there is not a match.41Slide42

Motivation for Step 5

Suppose that during recursive descent parsing we are in the Book() function. The Book() function first checks—by calling the token() function—to see if the current position of the input stream contains “Title.” Suppose it does. Then, according to the grammar, there must be Authors, Date, ISBN, and then Publisher: Book → Title Authors Date ISBN Publisher42Slide43

Step 5

Create a function, require(), and pass it a token, found. If the token is empty (i.e., the token() function returned empty because there was not a match) then call the error() function. Otherwise, return the token. require(element found) { if empty(found) then error(‘Invalid input’) else return found}43Slide44

Step 6

Create an error function, error(). Pass it a string. It outputs the string and then halts the parser.error(string s) { output s stop}44Slide45

The complete implementation

Recursive descent has been around a long time and people have developed beautiful code for it.The following two slides collects all the code from the previous slides. I recommend spending some time studying it to appreciate its beauty.45Slide46

Token = 0

main() { get_next_token() require(input())}input() { return require(Books())} Books() { <Books> return (require(Book()), optional_additional_Books()) </Books>} optional_additional_Books() { book = Book() if exists(book) then return (book, optional_additional_Books())} Book() { title = token('Title')

if exists(title) then

<Book>

return (title, require(Authors(), require(token

('Date')), require(token(‘ISBN')), require(token(‘Publisher')) </Book> } Authors() { <Authors> return (require(Author()), optional_additional_Authors()) </Authors>}Code for a Recursive Descent Parser46Slide47

optional_

additional_Authors() { author = token(‘Author') if exists(author) then return (author, optional_additional_Authors())}token(string tk) { if (tk != input[position() = Token]) then return () else { get_next_token() return input[position() = Token]) }}require(element found) { if empty(found) then error(‘Invalid input’) else return found}get_next_token() { Token = Token + 1}47Slide48

XSLT Implementation

I created an XSLT implementation. I tried to mirror the beautiful code shown on the previous slides.If you would like to give my implementation a go, here is the XSLT program and a sample flat (linear) input XML document:http://www.xfront.com/parsing-techniques/recursive-descent-parser/books-parser.xslhttp://www.xfront.com/parsing-techniques/recursive-descent-parser/books-test.xml48Slide49

Richer example

The Books example shown on the previous slides was fine for introducing recursive descent parsing.But it glossed over an important problem: grammar rules with alternatives.The following example shows how to do recursive descent parsing with a grammar that has alternatives.49Slide50

Expressions

Let’s parse a simple expression language that has these tokens: IDENTIFIER, addition, parentheses, and EoF.Here are a few examples of expressions:IDENTIFIER EoF(IDENTIFIER) EoFIDENTIFIER + IDENTIFIER EoF(IDENTIFIER + IDENTIFIER) EoFIDENTIFIER + (IDENTIFIER + IDENTIFIER) EoF(IDENTIFIER + IDENTIFIER) + IDENTIFIER EoFIDENTIFIER + (IDENTIFIER + (IDENTIFIER + IDENTIFIER)) EoFEach expression ends with an end-of-file (EoF) token.50Slide51

Expression grammar

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression

'+' expression |

ε

51Slide52

Parse tree for:

IDENTIFIER EoFinput → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression

'(' expression

')'

rest_expression

'+' expression | ε inputexpressionEoFtermrest_expressionIDENTIFIERε

52Slide53

Parser selects the first alternative

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression

'+' expression |

ε

inputexpressionEoFtermrest_expressionIDENTIFIERε

term

has two

alternatives. The parser selected the first alternative.

53Slide54

Parse tree for:

(IDENTIFIER) EoFinputexpressionEoFtermrest_expression

parenthesized_expression

ε

(

expression

)

term

rest_expressionIDENTIFIER

ε

input

expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε 54Slide55

Parser selects the

second alternativeinput → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression

')'

rest_expression

'+' expression |

ε inputexpressionEoFtermrest_expressionparenthesized_expressionε(expression

)

term

rest_expression

IDENTIFIERεterm’s second alternative is selected55Slide56

Question

How does a recursive descent parser know that it should select the first or second alternative?term → IDENTIFIER | parenthesized_expressionHow does the parser know which alternative to select?56Slide57

Answer

The parser doesn’t know.It tries the first alternative. If that fails it tries the second alternative (i.e., the parser backtracks and tries the next alternative). It repeats until it finds an alternative that succeeds.57Slide58

Processing the first token in the input stream

input → expression EoFexpression → term rest_expressionterm → IDENTIFIER | parenthesized_expressionparenthesized_expression → '(' expression ')'rest_expression

'+' expression |

ε

inputexpressiontermIDENTIFIERTry the first alternative, which says the input token must be IDENTIFIER. However, the input token is ( so we must back up and try the next alternative123Input tokens:(IDENTIFIER)EoF

58Slide59

Implementation of the

term() functionterm() { <term> identifier = token('IDENTIFIER') if exists(identifier) then return (identifier) else return (require(parenthesized_expression())) </term>

}

Check the current token in the input stream to see if it is IDENTIFIER.

59Slide60

term

() function (cont.)term() { <term> identifier = token('IDENTIFIER') if exists(identifier) then return (identifier) else return (require(parenthesized_expression

()))

</term>

}

If there is a match, return the token.

60Slide61

term

() function (cont.)term() { <term> identifier = token('IDENTIFIER') if exists(identifier) then return (identifier) else return (require(parenthesized_expression()))

</term>

}

Otherwise try the other alternative, it must succeed.

61Slide62

Let’s represent each expression

as XMLInstead of this input: IDENTIFIER EoFour input will be this: <input> <IDENTIFIER /> <EoF /> </input>62Slide63

XML representation (cont.)

Instead of this input: (IDENTIFIER) EoFour input will be this: <input> <LP /> <IDENTIFIER /> <RP /> <EoF /> </input>63Slide64

XML representation (cont.)

Instead of this input: IDENTIFIER + IDENTIFIER EoFour input will be this: <input> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <EoF /> </input>64Slide65

XML representation (cont.)

Instead of this input: IDENTIFIER + (IDENTIFIER + IDENTIFIER) EoFour input will be this XML input: <input> <IDENTIFIER /> <PLUS /> <LP /> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <RP /> <EoF /> </input>65Slide66

Parsing

<input> <IDENTIFIER /> <EoF /></input>Parserinput → expression EoFexpression → term rest_expressionterm →

IDENTIFIER |

parenthesized_expression

parenthesized_expression

'(' expression ')'rest_expression → '+' expression | ε <output> <expression> <term> <IDENTIFIER/> </term> <rest_expression/> </expression> <EoF/></output>66Slide67

Parsing (cont.)

<input> <LP /> <IDENTIFIER /> <RP /> <EoF /></input>Parserinput → expression EoFexpression → term rest_expressionterm

IDENTIFIER |

parenthesized_expression

parenthesized_expression

→ '(' expression ')'rest_expression → '+' expression | ε <output> <expression> <term> <parenthesized_expression> <LP/> <expression> <term> <IDENTIFIER/> </term> <rest_expression/> </expression> <RP/> </parenthesized_expression> </term> <rest_expression/> </expression> <EoF/></output>67Slide68

Parsing (cont.)

<input> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <EoF /></input>Parserinput → expression EoFexpression → term rest_expressionterm

IDENTIFIER |

parenthesized_expression

parenthesized_expression

→ '(' expression ')'rest_expression → '+' expression | ε <output> <expression> <term> <IDENTIFIER/> </term> <rest_expression> <PLUS/> <expression> <term> <IDENTIFIER/> </term> <rest_expression/> </expression> </rest_expression> </expression> <EoF/></output>68Slide69

Parsing (cont.)

<input> <IDENTIFIER /> <PLUS /> <LP /> <IDENTIFIER /> <PLUS /> <IDENTIFIER /> <RP /> <EoF /></input>Parserinput → expression EoFexpression → term

rest_expression

term

IDENTIFIER |

parenthesized_expression

parenthesized_expression → '(' expression ')'rest_expression → '+' expression | ε <output> <expression> <term> <IDENTIFIER/> </term> <rest_expression> <PLUS/> <expression> <term> <parenthesized_expression> <LP/> <expression> <term> <IDENTIFIER/> </term> <rest_expression> <PLUS/> <expression> <term>

<IDENTIFIER/>

</term>

<

rest_expression/> </expression> </rest_expression> </expression> <RP/> </parenthesized_expression> </term> <rest_expression/> </expression> </rest_expression> </expression> <EoF/></output>69Slide70

XSLT Implementation

I created an XSLT implementation of a recursive descent parser for the expression language.If you would like to give my implementation a go, here is the XSLT program and a sample flat (linear) input XML document:http://www.xfront.com/parsing-techniques/recursive-descent-parser/expression-parser.xslhttp://www.xfront.com/parsing-techniques/recursive-descent-parser/expression-test.xml70Slide71

Limitations of Recursive Descent Parsers

Recall that in a rule containing alternatives we tried the first alternative, if it failed we backtracked and tried the second alternative. Searching the alternatives is time-consuming.71Slide72

Limitations (cont.)

Recursive descent parsers can’t handle left-recursive grammar rules. The parser goes into an infinite loop.Example: suppose the grammar has this rule: expression → expression '-' termThat is a “left-recursive” rule: on the rule’s right-hand side it starts with the same symbol as on the left-hand side (i.e., expression). The recursive descent routine for this rule is:

expression()

{

return

expression() and require(token(‘-’)) and require(term

)

}(infinite) recursion!72Slide73

Limitations (cont.)

Suppose we add an array element as a term: term → IDENTIFIER | indexed_element | parenthesized_expression indexed_element

IDENTIFIER '[' expression

']'

and create a recursive descent parser for the new grammar. The routine for

indexed_element

will never be tried: when the sequence IDENTIFIER '[' occurs in the input, the first alternative of term will succeed, consume the identifier, and leave the indigestible part '[' expression ']' in the input.73Slide74

References – Great Books

Modern Compiler Design (http://www.amazon.com/Modern-Compiler-Design-Dick-Grune/dp/1461446988/ref=sr_1_1?s=books&ie=UTF8&qid=1413408458&sr=1-1&keywords=modern+compiler+design)Parsing Techniques (http://www.amazon.com/Parsing-Techniques-Practical-Monographs-Computer/dp/038720248X/ref=sr_1_1?s=books&ie=UTF8&qid=1413408496&sr=1-1&keywords=parsing+techniques) 74