Maryam Karimzadehgan mkarimz2illinoisedu University of Illinois at UrbanaChampaign 1 2 Outline Motivation amp Background Language model LM for IR Smoothing methods for IR Statistical Machine Translation CrossLingual ID: 620979
Download Presentation The PPT/PDF document "Statistical Translation Language Model" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Statistical Translation Language Model
Maryam Karimzadehganmkarimz2@illinois.eduUniversity of Illinois at Urbana-Champaign
1Slide2
2
Outline
Motivation & Background
Language model (LM) for IR
Smoothing methods for IR
Statistical Machine Translation – Cross-Lingual
Motivation
IBM Model 1
Statistical Translation Language Model
–
Monolingual
Synthetic Queries
Mutual Information-based approach
Regularization
of self-translation
probabilities
Smoothing
in
Statistical Translation Language
M
odelSlide3
The Basic LM Approach
([Ponte & Croft 98], [Hiemstra & Kraaij 98], [Miller et al. 99])
Document
Text mining
paper
Food nutrition
paper
Language Model
…
text ?
mining ?
assocation ?clustering ?…food ?…
…
food ?nutrition ?healthy ?diet ?…
Query =
“data mining algorithms”
?
Which model would most
likely have generated this query?Slide4
Ranking Docs by Query Likelihood
d
1
d
2
d
N
q
d
1
d
2
dN
Doc LM
p(q|
d
1
)
p(q|
d
2
)
p(q|
d
N
)
Query likelihoodSlide5
Retrieval as LM Estimation
Document ranking based on
query likelihood
Retrieval problem
Estimation of
p(w
i
|d)
Smoothing is an important issue, and distinguishes different approaches
Document language modelSlide6
6
How to Estimate p(
w|d
)?
Simplest solution: Maximum Likelihood Estimator
P(
w|d
) = relative frequency of word w in d
What if a word doesn’t appear in the text? P(
w|d
)=0
In general, what probability should we give a word that has not been observed?
If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed wordsThis is what “smoothing” is about …Slide7
Language Model Smoothing
P(w)
w
Max. Likelihood Estimate
Smoothed LM Slide8
Smoothing Methods for IR
Method 1(Linear interpolation, Jelinek-Mercer): Method 2 (Dirichlet Prior/Bayesian):
parameter
ML estimate
parameter
(Zhai & Lafferty 01)Slide9
9
Outline
Motivation & Background
Language model (LM) for IR
Smoothing methods for IR
Statistical Machine Translation – Cross-Lingual
Motivation
IBM Model 1
Statistical Translation Language Model
–
Monolingual
Synthetic QueriesMutual Information-based approachRegularization
of self-translation probabilitiesSmoothing in Statistical Translation Language ModelSlide10
10
A Brief HistoryMachine translation was one of the first applications envisioned for computersWarren Weaver (1949): “I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that it has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text.”First demonstrated by IBM in 1954 with a basic word-for-word translation systemSlide11
11
Interest in Machine TranslationCommercial interest:U.S. has invested in MT for intelligence purposesMT is popular on the web—it is the most used of Google’s special features
EU spends more than $1 billion on translation costs each year.
(Semi-)automated translation could lead to huge savingsSlide12
12
Interest in Machine TranslationAcademic interest:One of the most challenging problems in NLP researchRequires knowledge from many NLP sub-areas, e.g., lexical semantics, parsing, morphological analysis, statistical modeling,…
Being able to establish links between two languages allows for transferring resources from one language to anotherSlide13
13
Word-Level AlignmentsGiven a parallel sentence pair we can link (align) words or phrases that are translations of each other:Slide14
Machine Translation -- Concepts
We are trying to model P(e|f)I give you a French sentenceYou give me back EnglishHow are we going to model this?
The maximum likelihood estimation of
P(e | f) is: freq(e,f)/freq(f).Way too specific to get any reasonable frequencies! Vast majority of unseen data will have zero counts!Slide15
Machine Translation – Alternative way
We could use Bayes ruleWhy using B
ayes rule and not directly estimating p(
e|f) ?
It is important that our model for p(
e|f
) concentrates its probability as much as possible on well-formed English sentences. But it is not important that our model for P(
f|e
) concentrate its probability on well-formed
French sentences.
Given a French sentence
f, we could do a search for an e that maximizes p(e|f)
. Slide16
16
Statistical Machine TranslationThe noisy channel model
Assumptions
:An English word can be aligned with multiple French words while each French word is aligned with at most one English word Independence of the individual word-to-word translations
Language Model
Translation Model
Decoder
e: English
f: French
|e|=l
|f|=mSlide17
17
Estimation of Probabilities -- IBM Model 1Simplest of the IBM models. (There are 5 models)Does not consider word order (bag-of-words approach)Does not model one-to-many alignments
Computationally inexpensive
Useful for parameter estimations that are passed on to more elaborate modelsSlide18
18
IBM Model 1Three important components involvedLanguage modelGive the probability
p
(e).Translation model Estimate the Translation Probability p(f|e).DecoderSlide19
19
IBM Model 1- Translation Model
Joint probability of P(F=f, E=e, A=a) where A is an alignment between two sentences.
Assume each French word has exactly one connection.
Slide20
20
IBM Model 1- Translation ModelAssume,
|e|=l
and |f|=m, then the alignment can be represented by a series a = a1a2…amEach alignment is between 0 and
l such that if the word in position
j
of the French sentence is connected to the word in position
i
of the English sentence, then aj=i and if it not connected to any English word, then aj=0.
The alignment is determined by specifying the values of
a
j for j from
1 to m, each of which can take any value from 0 to l. Slide21
21
IBM Model 1 – Translation Model
all
possible alignments
(the English word that a French
word
f
j
is aligned with)
translation
probability
EM algorithm is used to estimate the translation probabilities. Slide22
22
Outline
Motivation & Background
Language model (LM) for IR
Smoothing methods for IR
Statistical Machine Translation – Cross-Lingual
Motivation
IBM Model 1
Statistical Translation Language Model
–
Monolingual
Synthetic QueriesMutual Information-based approachRegularization of self-translation probabilitiesSmoothing in
Statistical Translation Language ModelSlide23
The Problem of Vocabulary Gap
Query = auto wash
auto
wash…
car
wash
vehicle
d1
auto
buy
…
auto
d2d3P(“auto”) P(“wash”)
P(“auto”)
P(“wash”)
How to support inexact matching?
{“car” , “vehicle”}
==
“auto” “buy” ==== “wash”
23Slide24
Translation Language Models for IR [Berger & Lafferty 99]
Query = auto wash
auto
wash…
car
wash
vehicle
d1
auto
buy
auto
d2
d3“auto” “car” “translate”
“auto”
Query = car wash
P(“auto”) P(“wash”)
“car” “auto”
P(“car”|d3)
Pt(“auto”| “car”)
“vehicle”
P(“vehicle”|d3)
P
t
(“auto”| “vehicle”)
P(“auto” |d3)= p(“car”|d3) x
p
t
(“auto”| “car”)
+
p(“vehicle”|d3) x
p
t
(“auto”| “vehicle”)
How to estimate?
24Slide25
When relevance judgments are available, (
q,d) serves as data to train the translation model Without relevance judgments, we can use synthetic data [Berger & Lafferty 99], <title, body>[Jin et al. 02]
Basic translation
model
Translation model
Regular doc LM
Estimation of Translation Model: p
t
(
w|u
) Slide26
Select words that are representative of a document.
Calculate a Mutual information for each word in a document: I(w,d) = p(w,d)log
Synthetic queries are sampled based on normalized mutual information.
The resulting (d,q) of documents and synthetic queries are used to estimate the probabilities using EM algorithm (IBM Model 1).
Estimation of Translation Model
–
Synthetic Queries
([Berger & Lafferty 99])Slide27
Estimation of Translation Model
– Synthetic Queries Algorithm([Berger & Lafferty 99])
Training data
Limitations:
Can’t translate into words not seen in the training queries
Computational complexity Slide28
A simpler and more efficient method for estimating p
t(w|u) with higher coveragewas proposed in:
M.
Karimzadehgan and C. Zhai. Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval. ACM SIGIR, pages 323-330, 2010 28Slide29
Estimation of Translation Model Based on Mutual Information
1. Calculate Mutual information for each pair of two words in the collection (measuring co-occurrences)
2.
Normalize mutual information score to obtain a translation probability:
29
presence/absence of word w in a documentSlide30
Computation Detail
Xw
=1
Xu
=1
N
30
Xw
Xu
D1 0 0
D2 1 1
D3 1 0….…DN 0 0
Exploit index to speed up computationSlide31
Sample Translation Probabilities (AP90)
q
p(
q|w
)
everest
0.079
climber
0.042
climb
0.0365
mountain
0.0359
mount
0.033
reach
0.0312
expedit
0.0314
summit
0.0253
whittak
0.016
peak
0.0149
p(w| “
everest
”)
q
p(
q|w
)
everest
0.1051
climber
0.0423
mount
0.0339
028
0.0308
expedit
0.0303
peak
0.0155
himalaya
0.01532
nepal
0.015
sherpa
0.01431
hillari
0.01431
Mutual Information
31
Synthetic QuerySlide32
Regularizing Self-Translation Probability
Self-translation probability can be under-estimated An exact match would be counted less than an exact match
Solution: Interpolation with “1.0 self-translation”
w = u
= 1
basic query likelihood
model
= 0
original MI estimate
32Slide33
Query Likelihood and Translation Language Model
Document ranking based on query likelihood
Document language model
Translation Language Model
Do you see any problem?Slide34
Further Smoothing of Translation Model for Computing Query Likelihood
Linear interpolation (Jelinek-Mercer): Bayesian interpolation (
Dirichlet
prior):
p
ml
(
w|d
)
34
p
ml
(w|d)Slide35
Experiment Design
MI vs. Synthetic query estimationData Sets: Associated Press (AP90) and San Jose Mercury News (SJMN) + TREC topics 51-100Relatively small data sets in order to compare our results with Synthetic queries in [Berger& Lafferty 99].
MI Translation model vs. Basic query likelihood
Larger Data Sets: TREC7, TREC8 (plus AP90, SJMN) TREC topics 351-400 for TREC7 and 401-450 for TREC8Additional issuesRegularization of self-translation? Influence of smoothing on translation models? Translation model + pseudo feedback? 35Slide36
Mutual information outperforms synthetic queries in both MAP and P@10
AP90
+
queries 51-100, Dirichlet Prior Smoothing36
Syn. Query
MI
Syn. Query
MISlide37
Upper Bound Comparison of Mutual Information and Synthetic Queries
Dirichlet Prior Smoothing
Data
MAP
Precision @10
Mutual Info
Syn. Query
Mutual Info.
Syn. Query
AP90
0.264*
0.25
0.381
0.357
SJMN
0.197*
0.189
0.252
0.267
37
JM
Smoothing
Data
MAP
Precision @10
Mutual Info
Syn. Query
Mutual Info.
Syn. Query
AP90
0.272*
0.251
0.423
0.404
SJMN
0.2*
0.195
0.28
0.266Slide38
Mutual information translation model outperforms basic query likelihood
Data
MAP
Precision @10
Basic QL
MI Trans.
Basic QL
MI Trans.
AP90
0.248
0.272*
0.398
0.423
SJMN
0.195
0.2*
0.266
0.28
TREC7
0.183
0.187*
0.412
0.404
TREC8
0.248
0.249
0.452
0.456
JM
Smoothing
Data
MAP
Precision @10
Basic QL
MI Trans.
Basic QL
MI Trans.
AP90
0.246
0.264*
0.357
0.381
SJMN
0.188
0.197*
0.252
0.267
TREC7
0.165
0.172
0.354
0.362
TREC8
0.236
0.244*
0.428
0.436
Dir. Prior
Smoothing
38Slide39
Translation model appears to need less collection smoothing than basic QL
39
Translation model
Basic query likelihoodSlide40
Translation model and pseudo feedback exploit word co-occurrences differently
Data
MAP
Precision @10
BL
PFB
PFB+TM
BL
PFB
PFB+TM
AP90
0.246
0.271
0.298
0.357
0.383
0.411
SJMN
0.188
0.229
0.234
0.252
0.316
0.313
TREC7
0.165
0.209
0.222
0.354
0.38
0.384
TREC8
0.236
0.240
0.281
0.428
0.4
0.452
JM Smoothing
Query model
from pseudo FB
Smoothed Translation Model
40Slide41
Regularization of self-translation is beneficial
AP
Data Set,
Dirichlet Prior 41Slide42
Summary
Statistical Translation language model are effective for bridging the vocabulary gap.Mutual information is more effective and more efficient than synthetic queries for estimating translation model probabilities. Regularization of self-translation is beneficial
Translation model outperforms basic query likelihood on small and large collections and is more robust
Translation model and pseudo feedback exploit word co-occurrences differently and can be combined to further improve performance 42Slide43
References
[1] A. Berger and J. Lafferty. Information Retrieval as Statistical Translation. ACM SIGIR, pages 222–229, 1999.
[2]
P. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.[3] M. Karimzadehgan and C. Zhai. Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information
Retrieval. ACM SIGIR, pages 323-330, 2010.
43