/
Albert Gatt Albert Gatt

Albert Gatt - PowerPoint Presentation

phoebe-click
phoebe-click . @phoebe-click
Follow
391 views
Uploaded On 2016-06-13

Albert Gatt - PPT Presentation

Corpora and Statistical Methods Lecture 11 Probabilistic ContextFree Grammars and beyond Part 1 Contextfree grammars reminder Many NLP parsing applications rely on the CFG formalism Definition ID: 360787

lexical probability derivation parse probability lexical parse derivation sentence pcfgs pcfg probabilities rule grammar tree dependencies nnp independence matt

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Albert Gatt" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Albert Gatt

Corpora and Statistical Methods

Lecture

11Slide2

Probabilistic Context-Free Grammars and beyond

Part

1Slide3

Context-free grammars: reminder

Many NLP parsing applications rely on the CFG formalism

Definition

:

CFG is a 4-tuple:

(

N,

Σ

,P,S)

:

N

= a set of non-terminal symbols (e.g. NP, VP)

Σ

= a set of terminals (e.g. words)

N

and

Σ

are disjoint

P

= a set of productions of the form

A

β

A

Є

N

β

Є

(

N U

Σ

)*

(any string of terminals and non-terminals)

S

= a designated start symbol (usually, “sentence”)Slide4

CFG Example

S

 NP VP

S  Aux NP VP

NP  Det Nom

NP  Proper-Noun

Det 

that

|

the

|

a

…Slide5

Probabilistic CFGs

A CFG where each production has an associated probability

PCFG is a 5-tuple:

(

N,

Σ

,P,S, D)

:

D: P -> [0,1]

a function assigning each rule in P a probability

usually, probabilities are obtained from a corpus

most widely used corpus is the Penn TreebankSlide6

The Penn Treebank

English sentences annotated with syntax trees

built at the University of Pennsylvania

40,000 sentences, about a million words

text from the Wall Street Journal

Other

treebanks

exist for other languages (e.g. NEGRA for German)Slide7

Example treeSlide8

Building a tree: rules

S

 NP VP

NP  NNP NNP

NNP 

Mr

NNP 

Vinken

S

NP

NNP

NNP

Vinken

Mr

VP

NP

VBZ

PP

NP

NN

is

chairman

IN

NN

NNP

of

ElsevierSlide9

Characteristics of PCFGs

In a PCFG, the probability

P(A

β

)

expresses the likelihood that the non-terminal A will expand as

β

.

e.g. the likelihood that S  NP VP

(as opposed to SVP, or S  NP VP PP, or… )

can be interpreted as a conditional probability:

probability of the expansion, given the LHS non-terminal

P(A

β

) = P(

A

β

|A)

Therefore, for any non-terminal A, probabilities of every rule of the form

A 

β

must sum to 1If this is the case

, we say the PCFG is

consistentSlide10

Uses of probabilities in parsing

Disambiguation:

given

n

legal parses of a string, which is the most likely?

e.g. PP-attachment ambiguity can be resolved this way

Speed

:

parsing is a search problem

search through space of possible applicable derivations

search space can be pruned by focusing on the most likely sub-parses of a parse

Parser

can be used as a model to determine the probability of a sentence, given a parse

typical use in speech recognition, where input utterance can be “heard” as several possible sentencesSlide11

Using PCFG probabilities

PCFG assigns a probability to every parse-tree t of a string W

e.g. every possible parse (derivation) of a sentence recognised by the grammar

Notation

:

G = a PCFG

s = a sentence

t = a particular tree under our grammar

t consists of several nodes

n

each node is generated by applying some rule

rSlide12

Probability of a tree vs. a sentence

simply

the multiplication of the probability of every rule (node) that gives rise to

t

(i.e. the derivation of

t

)

this

is both the joint probability of

t

and

s

, and the probability of

t

alone

why?Slide13

P(t,s) = P(t)

But P(s|t) must be 1, since the tree

t

is a parse of all the words of

sSlide14

Picking the best parse in a PCFG

A sentence will usually have several parses

we usually want them ranked, or only want the

n

-

best

parses

we

need to focus on

P(

t|s,G

)

probability of a parse, given our sentence and our grammar

definition

of the best parse for s:Slide15

Picking the best parse in a PCFG

Problem:

t

can have multiple derivations

e.g. expand left-corner nodes first, expand right-corner nodes first etc

so

P(

t|s,G

) should be estimated by summing over all possible derivations

Fortunately, derivation order makes no difference to the final probabilities.

can

assume a

canonical derivation

” d of t

P(t) =

def P(d)Slide16

Probability of a sentence

Simply the sum of probabilities of all parses of that sentence

since

s

is only a sentence if it’s recognised by

G

, i.e. if there is some

t

for s under

G

all those trees which “yield” sSlide17

Flaws I: Structural independence

Probability of a rule

r

expanding node

n

depends only on

n

.

Independent of other non-terminals

Example:

P(NP

 Pro) is independent of where the NP is in the sentence

but we know that

NP

Pro

is much more likely in subject position

Francis et al (1999) using the Switchboard corpus:

91% of subjects are pronouns;

only 34% of objects are pronounsSlide18

Flaws II: lexical independence

vanilla PCFGs ignore lexical material

e.g. P(VP

 V NP PP) independent of the head of NP or PP or lexical head V

Examples

:

prepositional phrase attachment preferences depend on lexical items;

cf

:

dump [sacks into a bin]

dump [sacks] [into a bin] (preferred parse)

coordination ambiguity:

[dogs in houses] and [cats]

[dogs] [in houses and cats]Slide19

Weakening the independence assumptions in PCFGsSlide20

Lexicalised PCFGs

Attempt to weaken the lexical independence assumption.

Most

common technique:

mark each phrasal head (N,V, etc) with the lexical material

this is based on the idea that the most crucial lexical dependencies are between head and dependent

E.g.:

Charniak

1997, Collins 1999Slide21

Lexicalised PCFGs: Matt walks

Makes probabilities partly dependent on lexical content.

P(VP

VBD|VP) becomes:

P(VP

VBD|VP

, h(VP

)=

walk

)

NB

: normally, we can’t assume that all heads of a phrase of category C are equally probable.

S(walks)

NP(

Matt

)

NNP(

Matt

)

Matt

VP(

walk

)

VBD(

walk

)

walksSlide22

Practical problems for lexicalised PCFGs

data sparseness:

we don’t necessarily see all heads of all phrasal categories often enough in the training data

flawed

assumptions:

lexical dependencies occur elsewhere, not just between head and complement

I got the easier problem of the two to solve

of the two

and

to solve

become more likely

because of the

prehead

modifier easierSlide23

Structural context

The simple way: calculate p(

t|s,G

) based on rules in the canonical derivation d of

t

assumes that p(t) is independent of the derivation

could

condition on more structural context

but then we could lose the notion of a canonical derivation, i.e. P(t

) could really depend on the derivation!Slide24

Structural context: probability of a derivation history

How to calculate P(t) based on a derivation d?

Observation:

(probability that a sequence of

m

rewrite rules in a derivation yields

s

)

can use the chain rule for multiplicationSlide25

Approach 2: parent annotation

Annotate each node with its parent in the parse tree.

E.g

. if NP has parent S, then rename NP to NP^S

Can

partly account for dependencies such as

subject-of

(NP^S is a subject, NP^VP is an object)

S(walks)

NP^S

NNP^NP

Matt

VP^S

VBD^VP

walksSlide26

The main point

Many different parsing approaches differ on

what they condition their probabilities onSlide27

Other grammar formalismsSlide28

Phrase structure vs. Dependency grammar

PCFGs are in the tradition of phrase-structure grammars

Dependency

grammar describes syntax in terms of dependencies between words

no non-terminals or phrasal nodes

only lexical nodes with links between them

links are labelled, labels from a finite listSlide29

Dependency Grammar

<ROOT>

GAVE

main

I

him

address

subj:

dat:

obj:

MY

attr:Slide30

Dependency grammar

Often used now in probabilistic parsing

Advantages

:

directly encode lexical dependencies

therefore, disambiguation decisions take lexical material into account directly

dependencies are a way of decomposing PSRs and their probability estimates

estimating probability of dependencies between 2 words is less likely to lead to data sparseness problemsSlide31

Summary

We’ve taken a tour of PCFGs

crucial notion: what the probability of a rule is conditioned on

flaws in PCFGs: independence assumptions

several proposals to go beyond these flaws

dependency grammars are an alternative formalism