Parsing with Richer Formalisms - PowerPoint Presentation

367 views
Uploaded On 2018-02-22

Parsing with Richer Formalisms - PPT Presentation

HPSG1 by Sibel Ciddi Major Focuses of Research in This Field UnificationBased Grammars Probabilistic Approaches Dynamic Programming Stochastic AttributeValue Grammars Abney 2007 Dynamic Programming for Parsing and Estimation of Stochastic UnificationBased Grammars Geman amp ID: 634072

pcfg models probability model models pcfg model probability weights conditional distribution algorithm lpcfg features trees erf stochastic tree set

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/634072" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Parsing with Richer Formalisms" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Parsing with Richer Formalisms

HPSG1

Sibel

CiddiSlide2

Major Focuses of Research in This Field:

Unification-Based Grammars

Probabilistic Approaches Dynamic Programming Stochastic Attribute-Value Grammars, Abney, 2007Dynamic Programming for Parsing and Estimation of Stochastic Unification-Based Grammars, Geman & Johnson, 2002Stochastic HPSG Parse Disambiguation using the Redwoods Corpus, Toutanova & Flickinger, 2005

2Slide3

tochastic

Attribute-Value Grammars, Abney, 1997INTROMotivation for this paper:

Insufficiencies of previous probabilistic Attribute-Value grammars in defining parameter-estimation

Other than Brew and Eisele’s attempts; there has not been serious attempts

to extend stochastic models developed for CFGs to constraint-based grammars by assigning weights or preferences, or introducing ‘weighted logic*See Riezlar (1996)**See Brew (1995), Eisele (1994)

3Slide4

Goal:

Abney’s goal in this research:

Defining a stochastic attribute-value grammarAlgorithm to compute the maximum-likelihood estimations of AVG parameters.How?Adapting from Gibbs S

ampling* for the Improved Iterative Scaling Algorithm (ISS)

to estimate certain function expectations.

For further improvement, use of Metropolis-Hastings algorithm. * the Gibbs sampling was earlier used for English orthography application by Della Pietra, Della Pietra & Laferty(1995), .

4Slide5

Overview of Abney’s experiments:

Based on Brew and Eisel’s earlier experiments, Abney experiments with Empirical Relative Frequency (ERF) used for SFCGs.

Using Expectation-Maximization (EM) Algorithm What to do in case of true context-dependencies?Random Fields (a generalization of Markov Chains and stochastic branch processes

)

For further improvement, Metropolis-Hastings Algorithm

5Slide6

Ideas and Methods for Parameter Estimation:

Abney’s ideas and methods for parameter-estimations in AVGs:

Empirical Relative Frequency (ERF) MethodExpectation-maximization used for CFGs‘Normalization’ (normalizing constant) in ERF for AVGs

Random Fields (

generalized Markov Chains) Method

Improved Iterative Scaling Algorithm (DL&L, English Ortho)Null FieldsFeature SelectionWeight AdjustmentIterationRandom Sampling MethodMetropolis-Hastings Algorithm

6Slide7

Probability Estimations in CFG:

For probabilistic estimations in CFG, assume the grammar and model is as following:

The probability distribution according to this model is: X is the tree, and the RHS of the equation are the weights: Tree=x

Rule 1, used once, 1/3.1 for

p|f_1| and 1/3.2 for p|f_3| Rule3, used twiceThe probability distribution of a parse tree is defined as:

Parameter estimation determines the values for weights for stochastic grammar to be effective, so obtaining the correct weights are crucial.

7Slide8

Empirical Relative Frequency in CFG:

Empirical Distribution is the

frequency of how often a certain tree appears in the corpus. The dissimilarity of distributions is necessary to compare the probability distribution to the empirical distribution.Using the Kullback-Leibler Divergence, maximizing the likelihood of probability distributions (for q)**.CFGs use the ERF method to find the best weights for probability distributions.

ERF

: for each rule

i with βi and function fi(x), returning the number of times rule i is used in the derivation of tree (x), the p|f| represents the expectation of f under probability distribution p:

This way the ERF enables us to compute the expectation of each rule’s frequency and normalize among rules with the same LHS.

*This is the structure generated by the grammar.

for distribution also counts for missing mass that is not available in the training corpus.

p[f]=∑

ₓp

(x)

8Slide9

-table3

ERF weights are the best weights because they are closest

to the empirical

distribution. If the training corpus is generated by SCFGs, the dependencies between the trees are assumed to be coincidental.

However;

What if the dependencies are not just coincidental, but True Context-Dependencies?

Apply the same method to AVGs, in an attempt to formalize an AVG as a CFG, and follow the similar rewrite rules and models???

Then,

can apply the same context-free grammar as an attribute-value grammar like the following:

9Slide10

ERF

in AVGs?

This involves the following:attaching probabilities to AVG (in the same way used for CFGs)estimate weightsAV structure generated the grammar.

Rule applications, using the E

ϕ represents the dag* weight for the tree x1.

*Dag is short for ‘directed acyclic graph’.

Rule applications in AV graph generated by the grammar,

using

ERF

10Slide11

Probability

stimation Using ERF in AVGs: However, here the weight ϕ points to the entire set of AV structures

(shown in prev. slide)—

not

the probability distribution (as in the CFGs).In order to get the correct probability distributions and weights, a normalizing constant for ϕ₂ is introduced:q(x)=1/Z ϕ(x) where Z is a normalizer constant as in: However, this normalization violates the certain conditions for the ERF to give the best weights by changing the grammar, and affecting the probability.

***This shows that in order to apply ERF to AVGs, there is a need for normalization, however normalization violates the ERF conditions.

11Slide12

olution to Normalization Problem:

Random Fields Method: Defines probability distribution over a set of labeled graph Ω*, using functions to represents frequencies of functions, features and rules: ***

This normalizes the weight distribution.

As a result, empirical distribution needs less feature. For parameter estimation with Random Fields method, the followings are needed:

Values for weights, and features corresponding to those weightsImproved Iterative Scaling Algorithm determines these values, and involves the following:Null-field (no features)Feature SelectionWeight AdjustmentIteration until the best weight is obtainedISS shows improvement for the AV parameter estimation

weights. However

it doesn’t

give the

best weights

(compared to CFG parameter

estimations) due to the number of iterations required for a given

tree.

the given grammar is small enough; the

results

can work moderately; but if the grammar is large enough (as in the AVGs), another alternative proves better

results.

*Omegas are actually the AV structures generated by the grammar.

12Slide13

Random Sampling

(stochastic decision)

Using Metropolis-Hastings Algorithm:Converts the sampler for the initial distribution into a sampler for field distribution, which proposes a new item (out of the existing sampler).If the new item proposed by the current sampler is greater than new sampler’s distribution, it means there is overrepresentation.With stochastic decision, if the item has a probability that is equal to its degree of overrepresentation; the item is rejected.

If the degree of overrepresentations (of the proposed item) is equal to the new sampler:

if the new item is underrepresented relative to the new sampler, the new item is accepted with the probability of 1

.If the new item is overrepresented relative to the new sampler; the new item is accepted with a probability that diminishes, as the representation increases. 13Slide14

Conclusion

Stochastic CFG Methods are not easily convertable to AVGs.

Because of constraints and True Context-Dependencies on AVGs , ERF Method gives wrong weights for the parameter estimation.Obtaining correct weights require various methods:Random SamplingFeature SelectionWeight AdjustmentRandom Sampling shows promise for future experimentation but the performance is too slow, and output weights are not the optimal weights due to required iterations. This is even more difficult for larger grammars because of the large number of features, and the number of iterations for Random sampling for each feature, so the results are not optimal.

Also these methods assume a complete parsed data. So without additional methods* incomplete data poses major challenges.

described in Riezler, 1997)14Slide15

ynamic

Programming for Parsing and Estimations of Stochastic Unification-based Grammars, Geman & Johnson, 2002

INTRO

Because the earlier algorithms, proposed by Abney (1997) and Johnson et al. (1999), for stochastic parsing and estimation required extensive enumeration and did not work well on large grammars; Geman and Johnson discusses another algorithm that does graph-based dynamic programming without the enumeration process.

15Slide16

Goal

In this research Geman and Johnson’s target is to find the most probable parse by using the packed parse set representations (

aka. Truth Maintenance System – TMS). *A packed representation quadruple: R=(F’, X, N,

)

Overview:Viterbi Algorithm: finds the most probable parse of a string.Inside-Outside Algorithm: estimates a PCFG from unparsed data.

Maxwell III and Kaplan’s algorithm:

takes a string y, returns a packed representation R such that

(R)=

(y)

, where R represents the set of parses of string y.

16Slide17

Features of Dynamic Programming in SUBGs:

Properties must be very

local with respect to features.Finding the probability of a parse requires maximization of functions, where each function depends on subset of variables.Finding the estimation of property weights

, maximization of conditional likelihood

of the training corpus parses (

given their yields**) is required.For conditional likelihood maximization, conditional expectation calculation is required. For this, the Conjugate Gradient Algorithm is used.Calculation of these features make up generalizations for Viterbi, and forward-backward algorithm for HMMs used in dynamic programming.

A property is a

real-valued function of

parses

**Yield is a string of words associated with well-formed parse

17Slide18

Conclusions

Geman and Johnson in this paper examines only a few of the algorithms with regards to Maxwell and Kaplan representations to be used for dynamic programming for parsing.Because these representations provide a very compact set of parses of sentences, it eliminates the need to list each parse separately. As the nature of natural languages, strings and features occur as substrings or sub-features of other components, and therefore the parsing process can be faster via dynamic programming.

The algorithms presented here have promising results for parsing unification-based grammars compared to earlier experiments that iterates through countless enumerations.

18Slide19

tochastic H

PSG Parse Disambiguation Using The Redwoods CorpusToutanova, Flickinger, 2005

INTRO

Supporting Abney’s and other researches’ conclusions that point out the improvements resulting from the use of

conditional stochastic models in parsing; Toutanova and Flickinger experimented both with generative and conditional log-linear models.Recall;***Brew & Eisner had found that ERF doesn’t result in the best weights for parameter estimation if there are true context-dependencies. This is why Abney had experiments with other methods such as normalization factors, random fields and random sampling, etc…

19Slide20

Goal

Using the Redwoods corpus, build a tagger for HPSG lexical item identifiers to be used for parse disambiguation.Train stochastic models using:Derivation TreesPhrase Structure TreesSemantic Trees (Approximations to Minimal Recursion Semantics)

In order to build:

Generative Models

Conditional Log-linear ModelsHOW?Use probabilistic models (from training) to rank unseen test sentences according to probabilities attach to them (from those probabilistic models).20Slide21

Generative Model vs. Conditional Log-Linear Model:

enerative Models define the probability distribution over the trees of derivational types corresponding to HPSG analysis of sentences. The PCFG parameters used for each rule of CFG correspond to the schemas of HPSG. The probability distribution weights are obtained by using relative frequency estimations with Witten-Bell smoothing.

onditional Log-Linear Models

have a set of features {f₁,…, fM}, and a set of corresponding weights {λ1,…, λM}. The conditional models define features over derivation and semantic trees. Trains the models by maximizing the conditional likelihood

of preferred analysis of parses, by using a Gaussian prior for smoothing. For these models in this project, the Gaussian prior is set to 1.

21Slide22

Overview of Models:

Model

Types

Generative

Models

Conditional Models

Tagger

Trigram HMM tagger

Ltagger: Ltrigram

Derivation

Trees

PCFG-1P

PCFG-2P

PCFG-3P

PCFG-A

LPCFG-1P

LPCFG-A

Semantic

Trees

PCFG-Sem

LPCFG-Sem

Model Combination

PCFG-Combined

LCombined

22Slide23

Comparison of Models:

Generative Model

Conditional Model

HMM Trigram Tagger:

For

probability estimates, u

ses maximum

relative frequency over the pre-terminal sequences and yields of derivation trees.

***

Pre-terminals are lexical item- identifiers.

For probability weights, smoothed by linear interpolation, using Witten-Bell smoothing with varying paramater

It doesn’t, unfortunately, take advantage of lexical-type or type-hierarchy information from the HPSG.

Ltagger- Trigram:

Includes

features

for all lexical items trigrams, bigrams, and unigrams.

***

This is based more on Naive Bayes model probability-estimation.

23Slide24

Comparison of Models:

Derivation Trees

Generative Model

Conditional Model

PCFG-1P:

Parameters:

i,j

=(P

| A

)

 maximizes the preferred parses likelihood in the training set.

For wider context; extra parameters for additional nodes are set, and models are trained for PCFG-2P and PCFG-3P

For estimation of probabilities: Linear interpolation

with linear subsets of conditioning context.

Conditioning coefficients are obtained using Witten-Bell smoothing.

PCFG-A:

Conditions

each nodes’ expansion up to five of its ancestors. The ancestor selection is similar to context specific independencies in Bayesian networks.

Final probability estimates are linear interpolations of relative frequency where the coefficients are estimated using Witten-Bell smoothing.

LPCFG-1P:

Parameters:

λi,j  has one feature for each expansion of each nonterminal in the tree of ‘



‘

Weights are obtained using maximum conditional likelihood

LPCFG-A:

For every path and expansion, a feature is added.

The feature values are the sums of feature values of local trees.

Uses a generative component of expansion for feature selection, thus not purely discriminative in construction.

Uses feature parameters for leaves and internal nodes; and it multiplies its parameters.

24Slide25

Comparison of Models:

Semantic Trees

Generative Model

Conditional Model

PCFG-Sem:

Similar to markovized rule models, and uses decision tree growing algorithm, with final tree algorithm Witten-Bell smoothing.

Adds stop symbols at the ends to the right and left ends of the dependents.

The probability of dependency estimated as a product of probabilities of local trees. Five conditions for dependent generation:

1. The parent of the node

2. The direct (left / right)

3. The number of dependents already generated in the surface string between the head and the dependent.

4.The grandparent label

5. The label of the immediately preceding dependent.

P(pron_rel|be_prd_rel,left,0,top,none)x

P(stop|be_prd_rel,left,0,top,pron_rel)x

P(sorry_rel | be_prd_rel, right,0,top,none)x

P(stop | be_prd_rel, right,1,top,sorry_rel)

*** Semantic Dependency tree for ‘I am sorry’

LPCFG-Sem:

Corresponds to the PCFG model. Features were defined the same way as the LPCFG-A was defined according to PCFG-A model (

i.e., by adding a feature for every path and expansion occurring

in that node

Third, Add Stop symbols at the ends to the LEFT and RIGHT

first, create left dependents (from R to L, given H, parent, R-Sister, Num. of dependents of L.

Second, create right dependents from L to R

25Slide26

Comparison of Models:

Combination Trees

Generative Model

Conditional Model

PCFG-Combination:

Combination of PCFG-A, HMM Tagger, PCFG-Sem.

It computes the scores of analyses by the individual models.

Uses trigram tag sequence probabilities (transition probabilities of the HMM tagging model).

Uses interpolation weights of

λ₁

=0.5 and

₂=0.4

***

Generative models were combined by taking a weighted log sum of probabilities they assign to trees.

LPCFG-Combination: LCombined

Log-linear models

LPCFG-A, LPCFG-Sem, Ltagger with their

eatures.

Conditional

models were combined by collecting all features into one model.

26Slide27

Results:

-All sentences include ambiguous and unambiguous, and may be used for training. Ambiguous only set used for testing.

-For this work, unambiguous sentences were discarded from the test set.

-Unambiguous sets were used for training in generative models. Conditional models did not use any ambiguous sets.*

*The ambiguous sets

contribute a constant to the

log-likelihood in conditional models.

**Accuracy is 94.6% for sentences with 2 possible analyses, and 40% with more than 100 analyses.

Method

PCFG

Accuracy

LPCFG

Accuracy

Method

Random

22.7

Random

HMM-Trigram

42.1

43.2

LTrigram

perfect48.8

48.8perfectPCFG-1PPCFG-A

PCFG-SemCombined61.671.0

62.8

73.2

72.4

75.9

65.4

76.7

LPCFG-1P

LPCFG-A

LPCFG-Sem

LCombined

igh accuracy already achieved even by simple stat. models.

HMM tagger doesn’t perform by itself compared with other models with more info. about the parse.

PCFG-A

achieved 24% error reduction over PCFG-1P

PCFG-Sem has good accuracy but not better than PCFG-A by itself

Model combination has complimentary advantages:

tagger adds left-context info to the PCFG-A model

PCFG-Sem provides semantical info.

LPCFG-1P achieves 28% error reduction over PCFG-1P

Overall, 13% error reduction from PCFG-Combined to Lcombined

Sentences # Length Str. Ambiguity

All 6876 8.0 44.5

Ambiguous 5266 9.1 57.8

It’s also important to note that high accuracy might result from:

low ambiguity rate in the corpus

short sentence length

Also,

accuracy decreases as ambiguity increases**

And, there is over fitting between testing and training.

27Slide28

Error Analysis & Conclusion:

Ancestor annotation proves higher accuracy over other models and especially better results on Phrase Structure Trees.

Another combined model based on the log-probability combinations of derivation and PS trees; the new model gives even slightly better accuracy on phrase structures. This shows:Node labels (schema names) provide enough information for good accuracy.Semantic models perform less than the expected.*

Data scarcity, size of corpus**

A more detailed analysis of error show that out of 165 sentences in this study:

%26 were caused by annotation error in the tree bank%12 caused by both tree bank and the model %62 were real errors (the parse was able to correctly capture***)For these real errors, 103 out of 165:27 PP-attachments21 wrong lexical-item selection15 modifier attachment13 coordination9 Complement / adjunt18 other misc. errors

***

This figure was %20 in 2002, for the 1

growth of the Redwoods corpus, which shows the corpus was improved meantime as well.

28Slide29

Further Discussion

& Future Work:

After seeing the numbers in the results, the efficiency of the numbers showing the improvement can be argued, in terms of whether they are good enough, or if so, according to who? And what’s the measure of goodness in this kind of experiments? Toutoneva &

Flickinger’s

research shows that the performance expected from the stochastic models were not greatly higher than the statistical models.

It is especially surprising to see that the more work was put into the research for the later models, the less improvement was obtained from the later models. *If you recall, the highest improvement came from the first LPCFG model.Some of the following ideas can be further researched to improve the current working models and ameliorate the inefficacies:Better use of semantic or lexical info, use of MRS semantic representationAutomatic clustering, further examination of existing lexical sourcesSemantic collocation

information

se of other features within the HPSG framework such as:

syntactic categories

clause finiteness

agreement features

Increase

in corpus

size

Any Questions & Comments?