/
Statistical Translation Language Model Statistical Translation Language Model

Statistical Translation Language Model - PowerPoint Presentation

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
377 views
Uploaded On 2018-01-07

Statistical Translation Language Model - PPT Presentation

Maryam Karimzadehgan mkarimz2illinoisedu University of Illinois at UrbanaChampaign 1 2 Outline Motivation amp Background Language model LM for IR Smoothing methods for IR Statistical Machine Translation CrossLingual ID: 620979

model translation word query translation model query word language information smoothing statistical mutual synthetic likelihood queries auto

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Statistical Translation Language Model" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Statistical Translation Language Model

Maryam Karimzadehganmkarimz2@illinois.eduUniversity of Illinois at Urbana-Champaign

1Slide2

2

Outline

Motivation & Background

Language model (LM) for IR

Smoothing methods for IR

Statistical Machine Translation – Cross-Lingual

Motivation

IBM Model 1

Statistical Translation Language Model

Monolingual

Synthetic Queries

Mutual Information-based approach

Regularization

of self-translation

probabilities

Smoothing

in

Statistical Translation Language

M

odelSlide3

The Basic LM Approach

([Ponte & Croft 98], [Hiemstra & Kraaij 98], [Miller et al. 99])

Document

Text mining

paper

Food nutrition

paper

Language Model

text ?

mining ?

assocation ?clustering ?…food ?…

food ?nutrition ?healthy ?diet ?…

Query =

“data mining algorithms”

?

Which model would most

likely have generated this query?Slide4

Ranking Docs by Query Likelihood

d

1

d

2

d

N

q

d

1

d

2

dN

Doc LM

p(q| 

d

1

)

p(q| 

d

2

)

p(q| 

d

N

)

Query likelihoodSlide5

Retrieval as LM Estimation

Document ranking based on

query likelihood

Retrieval problem

Estimation of

p(w

i

|d)

Smoothing is an important issue, and distinguishes different approaches

Document language modelSlide6

6

How to Estimate p(

w|d

)?

Simplest solution: Maximum Likelihood Estimator

P(

w|d

) = relative frequency of word w in d

What if a word doesn’t appear in the text? P(

w|d

)=0

In general, what probability should we give a word that has not been observed?

If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed wordsThis is what “smoothing” is about …Slide7

Language Model Smoothing

P(w)

w

Max. Likelihood Estimate

Smoothed LM Slide8

Smoothing Methods for IR

Method 1(Linear interpolation, Jelinek-Mercer): Method 2 (Dirichlet Prior/Bayesian):

parameter

ML estimate

parameter

(Zhai & Lafferty 01)Slide9

9

Outline

Motivation & Background

Language model (LM) for IR

Smoothing methods for IR

Statistical Machine Translation – Cross-Lingual

Motivation

IBM Model 1

Statistical Translation Language Model

Monolingual

Synthetic QueriesMutual Information-based approachRegularization

of self-translation probabilitiesSmoothing in Statistical Translation Language ModelSlide10

10

A Brief HistoryMachine translation was one of the first applications envisioned for computersWarren Weaver (1949): “I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that it has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text.”First demonstrated by IBM in 1954 with a basic word-for-word translation systemSlide11

11

Interest in Machine TranslationCommercial interest:U.S. has invested in MT for intelligence purposesMT is popular on the web—it is the most used of Google’s special features

EU spends more than $1 billion on translation costs each year.

(Semi-)automated translation could lead to huge savingsSlide12

12

Interest in Machine TranslationAcademic interest:One of the most challenging problems in NLP researchRequires knowledge from many NLP sub-areas, e.g., lexical semantics, parsing, morphological analysis, statistical modeling,…

Being able to establish links between two languages allows for transferring resources from one language to anotherSlide13

13

Word-Level AlignmentsGiven a parallel sentence pair we can link (align) words or phrases that are translations of each other:Slide14

Machine Translation -- Concepts

We are trying to model P(e|f)I give you a French sentenceYou give me back EnglishHow are we going to model this?

The maximum likelihood estimation of

P(e | f) is: freq(e,f)/freq(f).Way too specific to get any reasonable frequencies! Vast majority of unseen data will have zero counts!Slide15

Machine Translation – Alternative way

We could use Bayes ruleWhy using B

ayes rule and not directly estimating p(

e|f) ?

It is important that our model for p(

e|f

) concentrates its probability as much as possible on well-formed English sentences. But it is not important that our model for P(

f|e

) concentrate its probability on well-formed

French sentences.

Given a French sentence

f, we could do a search for an e that maximizes p(e|f)

. Slide16

16

Statistical Machine TranslationThe noisy channel model

Assumptions

:An English word can be aligned with multiple French words while each French word is aligned with at most one English word Independence of the individual word-to-word translations

Language Model

Translation Model

Decoder

e: English

f: French

|e|=l

|f|=mSlide17

17

Estimation of Probabilities -- IBM Model 1Simplest of the IBM models. (There are 5 models)Does not consider word order (bag-of-words approach)Does not model one-to-many alignments

Computationally inexpensive

Useful for parameter estimations that are passed on to more elaborate modelsSlide18

18

IBM Model 1Three important components involvedLanguage modelGive the probability

p

(e).Translation model Estimate the Translation Probability p(f|e).DecoderSlide19

19

IBM Model 1- Translation Model

Joint probability of P(F=f, E=e, A=a) where A is an alignment between two sentences.

Assume each French word has exactly one connection.

 Slide20

20

IBM Model 1- Translation ModelAssume,

|e|=l

and |f|=m, then the alignment can be represented by a series a = a1a2…amEach alignment is between 0 and

l such that if the word in position

j

of the French sentence is connected to the word in position

i

of the English sentence, then aj=i and if it not connected to any English word, then aj=0.

The alignment is determined by specifying the values of

a

j for j from

1 to m, each of which can take any value from 0 to l.  Slide21

21

IBM Model 1 – Translation Model

 

all

possible alignments

(the English word that a French

word

f

j

is aligned with)

translation

probability

 

 

EM algorithm is used to estimate the translation probabilities. Slide22

22

Outline

Motivation & Background

Language model (LM) for IR

Smoothing methods for IR

Statistical Machine Translation – Cross-Lingual

Motivation

IBM Model 1

Statistical Translation Language Model

Monolingual

Synthetic QueriesMutual Information-based approachRegularization of self-translation probabilitiesSmoothing in

Statistical Translation Language ModelSlide23

The Problem of Vocabulary Gap

Query = auto wash

auto

wash…

car

wash

vehicle

d1

auto

buy

auto

d2d3P(“auto”) P(“wash”)

P(“auto”)

P(“wash”)

How to support inexact matching?

{“car” , “vehicle”}

==

“auto” “buy” ==== “wash”

23Slide24

Translation Language Models for IR [Berger & Lafferty 99]

Query = auto wash

auto

wash…

car

wash

vehicle

d1

auto

buy

auto

d2

d3“auto” “car” “translate”

“auto”

Query = car wash

P(“auto”) P(“wash”)

“car” “auto”

P(“car”|d3)

Pt(“auto”| “car”)

“vehicle”

P(“vehicle”|d3)

P

t

(“auto”| “vehicle”)

P(“auto” |d3)= p(“car”|d3) x

p

t

(“auto”| “car”)

+

p(“vehicle”|d3) x

p

t

(“auto”| “vehicle”)

How to estimate?

24Slide25

When relevance judgments are available, (

q,d) serves as data to train the translation model Without relevance judgments, we can use synthetic data [Berger & Lafferty 99], <title, body>[Jin et al. 02]

Basic translation

model

Translation model

Regular doc LM

Estimation of Translation Model: p

t

(

w|u

) Slide26

Select words that are representative of a document.

Calculate a Mutual information for each word in a document: I(w,d) = p(w,d)log

Synthetic queries are sampled based on normalized mutual information.

The resulting (d,q) of documents and synthetic queries are used to estimate the probabilities using EM algorithm (IBM Model 1).

 

Estimation of Translation Model

Synthetic Queries

([Berger & Lafferty 99])Slide27

Estimation of Translation Model

– Synthetic Queries Algorithm([Berger & Lafferty 99])

Training data

Limitations:

Can’t translate into words not seen in the training queries

Computational complexity Slide28

A simpler and more efficient method for estimating p

t(w|u) with higher coveragewas proposed in:

M.

Karimzadehgan and C. Zhai. Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval. ACM SIGIR, pages 323-330, 2010 28Slide29

Estimation of Translation Model Based on Mutual Information

1. Calculate Mutual information for each pair of two words in the collection (measuring co-occurrences)

2.

Normalize mutual information score to obtain a translation probability:

29

presence/absence of word w in a documentSlide30

Computation Detail

Xw

=1

Xu

=1

N

30

Xw

Xu

D1 0 0

D2 1 1

D3 1 0….…DN 0 0

Exploit index to speed up computationSlide31

Sample Translation Probabilities (AP90)

q

p(

q|w

)

everest

0.079

climber

0.042

climb

0.0365

mountain

0.0359

mount

0.033

reach

0.0312

expedit

0.0314

summit

0.0253

whittak

0.016

peak

0.0149

p(w| “

everest

”)

q

p(

q|w

)

everest

0.1051

climber

0.0423

mount

0.0339

028

0.0308

expedit

0.0303

peak

0.0155

himalaya

0.01532

nepal

0.015

sherpa

0.01431

hillari

0.01431

Mutual Information

31

Synthetic QuerySlide32

Regularizing Self-Translation Probability

Self-translation probability can be under-estimated An exact match would be counted less than an exact match

Solution: Interpolation with “1.0 self-translation”

w = u

= 1

basic query likelihood

model

= 0

original MI estimate

32Slide33

Query Likelihood and Translation Language Model

Document ranking based on query likelihood

Document language model

Translation Language Model

Do you see any problem?Slide34

Further Smoothing of Translation Model for Computing Query Likelihood

Linear interpolation (Jelinek-Mercer): Bayesian interpolation (

Dirichlet

prior):

p

ml

(

w|d

)

34

p

ml

(w|d)Slide35

Experiment Design

MI vs. Synthetic query estimationData Sets: Associated Press (AP90) and San Jose Mercury News (SJMN) + TREC topics 51-100Relatively small data sets in order to compare our results with Synthetic queries in [Berger& Lafferty 99].

MI Translation model vs. Basic query likelihood

Larger Data Sets: TREC7, TREC8 (plus AP90, SJMN) TREC topics 351-400 for TREC7 and 401-450 for TREC8Additional issuesRegularization of self-translation? Influence of smoothing on translation models? Translation model + pseudo feedback? 35Slide36

Mutual information outperforms synthetic queries in both MAP and P@10

AP90

+

queries 51-100, Dirichlet Prior Smoothing36

Syn. Query

MI

Syn. Query

MISlide37

Upper Bound Comparison of Mutual Information and Synthetic Queries

Dirichlet Prior Smoothing

Data

MAP

Precision @10

Mutual Info

Syn. Query

Mutual Info.

Syn. Query

AP90

0.264*

0.25

0.381

0.357

SJMN

0.197*

0.189

0.252

0.267

37

JM

Smoothing

Data

MAP

Precision @10

Mutual Info

Syn. Query

Mutual Info.

Syn. Query

AP90

0.272*

0.251

0.423

0.404

SJMN

0.2*

0.195

0.28

0.266Slide38

Mutual information translation model outperforms basic query likelihood

Data

MAP

Precision @10

Basic QL

MI Trans.

Basic QL

MI Trans.

AP90

0.248

0.272*

0.398

0.423

SJMN

0.195

0.2*

0.266

0.28

TREC7

0.183

0.187*

0.412

0.404

TREC8

0.248

0.249

0.452

0.456

JM

Smoothing

Data

MAP

Precision @10

Basic QL

MI Trans.

Basic QL

MI Trans.

AP90

0.246

0.264*

0.357

0.381

SJMN

0.188

0.197*

0.252

0.267

TREC7

0.165

0.172

0.354

0.362

TREC8

0.236

0.244*

0.428

0.436

Dir. Prior

Smoothing

38Slide39

Translation model appears to need less collection smoothing than basic QL

39

Translation model

Basic query likelihoodSlide40

Translation model and pseudo feedback exploit word co-occurrences differently

Data

MAP

Precision @10

BL

PFB

PFB+TM

BL

PFB

PFB+TM

AP90

0.246

0.271

0.298

0.357

0.383

0.411

SJMN

0.188

0.229

0.234

0.252

0.316

0.313

TREC7

0.165

0.209

0.222

0.354

0.38

0.384

TREC8

0.236

0.240

0.281

0.428

0.4

0.452

JM Smoothing

Query model

from pseudo FB

Smoothed Translation Model

40Slide41

Regularization of self-translation is beneficial

AP

Data Set,

Dirichlet Prior 41Slide42

Summary

Statistical Translation language model are effective for bridging the vocabulary gap.Mutual information is more effective and more efficient than synthetic queries for estimating translation model probabilities. Regularization of self-translation is beneficial

Translation model outperforms basic query likelihood on small and large collections and is more robust

Translation model and pseudo feedback exploit word co-occurrences differently and can be combined to further improve performance 42Slide43

References

[1] A. Berger and J. Lafferty. Information Retrieval as Statistical Translation. ACM SIGIR, pages 222–229, 1999.

[2]

P. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.[3] M. Karimzadehgan and C. Zhai. Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information

Retrieval. ACM SIGIR, pages 323-330, 2010.

43