Corpora and Statistical Methods Lecture 11 Probabilistic ContextFree Grammars and beyond Part 1 Contextfree grammars reminder Many NLP parsing applications rely on the CFG formalism Definition ID: 360787
Download Presentation The PPT/PDF document "Albert Gatt" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Albert Gatt
Corpora and Statistical Methods
Lecture
11Slide2
Probabilistic Context-Free Grammars and beyond
Part
1Slide3
Context-free grammars: reminder
Many NLP parsing applications rely on the CFG formalism
Definition
:
CFG is a 4-tuple:
(
N,
Σ
,P,S)
:
N
= a set of non-terminal symbols (e.g. NP, VP)
Σ
= a set of terminals (e.g. words)
N
and
Σ
are disjoint
P
= a set of productions of the form
A
β
A
Є
N
β
Є
(
N U
Σ
)*
(any string of terminals and non-terminals)
S
= a designated start symbol (usually, “sentence”)Slide4
CFG Example
S
NP VP
S Aux NP VP
NP Det Nom
NP Proper-Noun
Det
that
|
the
|
a
…Slide5
Probabilistic CFGs
A CFG where each production has an associated probability
PCFG is a 5-tuple:
(
N,
Σ
,P,S, D)
:
D: P -> [0,1]
a function assigning each rule in P a probability
usually, probabilities are obtained from a corpus
most widely used corpus is the Penn TreebankSlide6
The Penn Treebank
English sentences annotated with syntax trees
built at the University of Pennsylvania
40,000 sentences, about a million words
text from the Wall Street Journal
Other
treebanks
exist for other languages (e.g. NEGRA for German)Slide7
Example treeSlide8
Building a tree: rules
S
NP VP
NP NNP NNP
NNP
Mr
NNP
Vinken
…
S
NP
NNP
NNP
Vinken
Mr
VP
NP
VBZ
PP
NP
NN
is
chairman
IN
NN
NNP
of
ElsevierSlide9
Characteristics of PCFGs
In a PCFG, the probability
P(A
β
)
expresses the likelihood that the non-terminal A will expand as
β
.
e.g. the likelihood that S NP VP
(as opposed to SVP, or S NP VP PP, or… )
can be interpreted as a conditional probability:
probability of the expansion, given the LHS non-terminal
P(A
β
) = P(
A
β
|A)
Therefore, for any non-terminal A, probabilities of every rule of the form
A
β
must sum to 1If this is the case
, we say the PCFG is
consistentSlide10
Uses of probabilities in parsing
Disambiguation:
given
n
legal parses of a string, which is the most likely?
e.g. PP-attachment ambiguity can be resolved this way
Speed
:
parsing is a search problem
search through space of possible applicable derivations
search space can be pruned by focusing on the most likely sub-parses of a parse
Parser
can be used as a model to determine the probability of a sentence, given a parse
typical use in speech recognition, where input utterance can be “heard” as several possible sentencesSlide11
Using PCFG probabilities
PCFG assigns a probability to every parse-tree t of a string W
e.g. every possible parse (derivation) of a sentence recognised by the grammar
Notation
:
G = a PCFG
s = a sentence
t = a particular tree under our grammar
t consists of several nodes
n
each node is generated by applying some rule
rSlide12
Probability of a tree vs. a sentence
simply
the multiplication of the probability of every rule (node) that gives rise to
t
(i.e. the derivation of
t
)
this
is both the joint probability of
t
and
s
, and the probability of
t
alone
why?Slide13
P(t,s) = P(t)
But P(s|t) must be 1, since the tree
t
is a parse of all the words of
sSlide14
Picking the best parse in a PCFG
A sentence will usually have several parses
we usually want them ranked, or only want the
n
-
best
parses
we
need to focus on
P(
t|s,G
)
probability of a parse, given our sentence and our grammar
definition
of the best parse for s:Slide15
Picking the best parse in a PCFG
Problem:
t
can have multiple derivations
e.g. expand left-corner nodes first, expand right-corner nodes first etc
so
P(
t|s,G
) should be estimated by summing over all possible derivations
Fortunately, derivation order makes no difference to the final probabilities.
can
assume a
“
canonical derivation
” d of t
P(t) =
def P(d)Slide16
Probability of a sentence
Simply the sum of probabilities of all parses of that sentence
since
s
is only a sentence if it’s recognised by
G
, i.e. if there is some
t
for s under
G
all those trees which “yield” sSlide17
Flaws I: Structural independence
Probability of a rule
r
expanding node
n
depends only on
n
.
Independent of other non-terminals
Example:
P(NP
Pro) is independent of where the NP is in the sentence
but we know that
NP
Pro
is much more likely in subject position
Francis et al (1999) using the Switchboard corpus:
91% of subjects are pronouns;
only 34% of objects are pronounsSlide18
Flaws II: lexical independence
vanilla PCFGs ignore lexical material
e.g. P(VP
V NP PP) independent of the head of NP or PP or lexical head V
Examples
:
prepositional phrase attachment preferences depend on lexical items;
cf
:
dump [sacks into a bin]
dump [sacks] [into a bin] (preferred parse)
coordination ambiguity:
[dogs in houses] and [cats]
[dogs] [in houses and cats]Slide19
Weakening the independence assumptions in PCFGsSlide20
Lexicalised PCFGs
Attempt to weaken the lexical independence assumption.
Most
common technique:
mark each phrasal head (N,V, etc) with the lexical material
this is based on the idea that the most crucial lexical dependencies are between head and dependent
E.g.:
Charniak
1997, Collins 1999Slide21
Lexicalised PCFGs: Matt walks
Makes probabilities partly dependent on lexical content.
P(VP
VBD|VP) becomes:
P(VP
VBD|VP
, h(VP
)=
walk
)
NB
: normally, we can’t assume that all heads of a phrase of category C are equally probable.
S(walks)
NP(
Matt
)
NNP(
Matt
)
Matt
VP(
walk
)
VBD(
walk
)
walksSlide22
Practical problems for lexicalised PCFGs
data sparseness:
we don’t necessarily see all heads of all phrasal categories often enough in the training data
flawed
assumptions:
lexical dependencies occur elsewhere, not just between head and complement
I got the easier problem of the two to solve
of the two
and
to solve
become more likely
because of the
prehead
modifier easierSlide23
Structural context
The simple way: calculate p(
t|s,G
) based on rules in the canonical derivation d of
t
assumes that p(t) is independent of the derivation
could
condition on more structural context
but then we could lose the notion of a canonical derivation, i.e. P(t
) could really depend on the derivation!Slide24
Structural context: probability of a derivation history
How to calculate P(t) based on a derivation d?
Observation:
(probability that a sequence of
m
rewrite rules in a derivation yields
s
)
can use the chain rule for multiplicationSlide25
Approach 2: parent annotation
Annotate each node with its parent in the parse tree.
E.g
. if NP has parent S, then rename NP to NP^S
Can
partly account for dependencies such as
subject-of
(NP^S is a subject, NP^VP is an object)
S(walks)
NP^S
NNP^NP
Matt
VP^S
VBD^VP
walksSlide26
The main point
Many different parsing approaches differ on
what they condition their probabilities onSlide27
Other grammar formalismsSlide28
Phrase structure vs. Dependency grammar
PCFGs are in the tradition of phrase-structure grammars
Dependency
grammar describes syntax in terms of dependencies between words
no non-terminals or phrasal nodes
only lexical nodes with links between them
links are labelled, labels from a finite listSlide29
Dependency Grammar
<ROOT>
GAVE
main
I
him
address
subj:
dat:
obj:
MY
attr:Slide30
Dependency grammar
Often used now in probabilistic parsing
Advantages
:
directly encode lexical dependencies
therefore, disambiguation decisions take lexical material into account directly
dependencies are a way of decomposing PSRs and their probability estimates
estimating probability of dependencies between 2 words is less likely to lead to data sparseness problemsSlide31
Summary
We’ve taken a tour of PCFGs
crucial notion: what the probability of a rule is conditioned on
flaws in PCFGs: independence assumptions
several proposals to go beyond these flaws
dependency grammars are an alternative formalism