/
CLAIRE Makes Machine TranslationBLEU No MorebyAli MohammadBSc Mathemat CLAIRE Makes Machine TranslationBLEU No MorebyAli MohammadBSc Mathemat

CLAIRE Makes Machine TranslationBLEU No MorebyAli MohammadBSc Mathemat - PDF document

victoria
victoria . @victoria
Follow
343 views
Uploaded On 2021-10-11

CLAIRE Makes Machine TranslationBLEU No MorebyAli MohammadBSc Mathemat - PPT Presentation

to apply an apOur analysis with patience and helpful and getting led me me from 1950W ork Models Are Choosing a Conclusions and D A ei eto the score in language model model 6932 Find the opti ID: 900587

phrase translation model language translation phrase language model log machine system sentence based x0000 bleu statistical computational linguistics translations

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "CLAIRE Makes Machine TranslationBLEU No ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 CLAIRE Makes Machine TranslationBLEU No
CLAIRE Makes Machine TranslationBLEU No MorebyAli MohammadB.Sc. Mathematics and Physics, Kansas State University (2003)B.Sc. Computer Engineering, Kansas State University (2003)B.Sc. Computer Science, Kansas State University (2003)S.M., Massachusetts Institute of Technology (2006)

2 Submitted to the Department ofElectrical
Submitted to the Department ofElectrical Engineering and Computer Sciencein partial fulfillment of the requirements for the degree ofDoctorate of Scienceat theMASSACHUSETTS INSTITUTE OF TECHNOLOGYARCHIVESMASSACHUSETTS INSTI EOF TECHNOLOGYUL 0 1 2012Lf.iBRARESJune 2012@ Massachuset

3 ts Institute of Technology 2012. All rig
ts Institute of Technology 2012. All rights reserved.A uthor ............ ...............Department of Electrical Engineering and Computer ScienceFebruary 14, 2012Certified by......Boris KatzPrincipal Research Scientist in Computer ScienceCThesis SupervisorAccepted by ....../ 6 Le

4 slie A. KolodziejskiChair of the Committ
slie A. KolodziejskiChair of the Committee on Graduate StudentsDepartment of Electrical Engineering and Computer Science to apply an ap-Our analysis with patience and helpful and getting led me me from ..................................................................19........

5 ........................................
.........................................................................................................................................................................................................................................................................................

6 ........................................
...................................................... ....................................................................................................................50W ork ................................................................ .. .. .... ..... .. .. ..............

7 ........................................
.................................................................................................................................................................Models Are .................................................................................... ........................

8 ................................Choosing
................................Choosing a ..................................................................Conclusions and .................................................................................................................................... ..................D A .

9 ........................................
.............................................................................. ...................ei eto the ................................................................................................. score (in language model model 693-2 Find the optimal scores based on coun

10 ts; this is a nonlinear gradientdescent
ts; this is a nonlinear gradientdescent algorithm. 7 denotes the Hessian of L. .......................................................a Gaussian German appears are indeed ............................................................................... ................in Arabic, ...

11 ........................................
.....................................................here arethese features ....... ........... optimization algorithm. ..................................................... corpus, and....................................(at the ...........proposed metric, ........................

12 ............ language processing has bee
............ language processing has been perform increasingly with others large bodies bodies an early cryptanalyst and cryptogra-pher renowned for his work decrypting the ENIGMA code in World War 2, famouslycompared translation to codebreaking in the 1940's. Early natural langu

13 age practi-tioners achieved small victor
age practi-tioners achieved small victories in the 1950's, and with exciting advancements in lin-guistics (particularly by Noam Chomsky), they promised the dreams outlined aboveto funding agencies in the US and abroad. Anyone familiar with the modern fruits ofmachine translation,

14 taking into consideration the great adva
taking into consideration the great advancements in the theoryof computation, learning theory, linguistics, as well as the massive improvements insupporting infrastructure (microprocessors and datasets), would hardly be surprisedby the demoralizing failures that were to come.After

15 pouring funding into one promising proj
pouring funding into one promising project after another for decades, theNSF, DOD, and CIA commissioned on sensible sensible As a result F. Hockett;an area could be had resulted research should as a science to automatic deserve some Series of could not more skeptical British co

16 mputational computational expressing his
mputational computational expressing his viewon the future of automatic machine translation. The statement is brief enough toinclude here contribute substantially this understanding has become being done It should surely be ones failed, solved since substantially to our assessme

17 nt deal more anaphora. But, been done ha
nt deal more anaphora. But, been done has been large enough and bounds, the Canadian used machine How can we distinguish our work from the pseudo-science that ALPAC and Kaydescribed? It is difficult to declare natural language processing to be a science sinceour goal is not to le

18 arn about an existing system, but is ins
arn about an existing system, but is instead to build usefulsystems of our own. I would argue that it is still possible to do science in this arena,but that it requires care to understand the limitations of our results. It's difficult tomake broad statements about the value of a p

19 articular approach when so much is still
articular approach when so much is stillunknown about language in general and considering how far state-of-the-art systemsfall short of the dream. It is not at all inconceivable that the best research systemsextant would bear little resemblance to a future system that fulfills the

20 promises ofour predecessors.a definitiv
promises ofour predecessors.a definitive a massive be more Collins Model Model for more details.Just as a statistical speech recognition system aims to maximize Pr (e I a), givena sentence S, the statistical parser aims to find the parse tree T that maximizesPr (T S). The Collin

21 s parsers are generative models based on
s parsers are generative models based on the probabilistic con-text free grammar formalism (PCFG), except that they are lexicalized. An unlexi-calized PCFG would be broken down as follows:arg max Pr (T I S) = arg max Pr (T, S)T T= argmax Pr(RHS|LHS,),where RHSi and LHS, denote the

22 left- and right-hand sides of the conte
left- and right-hand sides of the context-free gram-mar rule that is used at the ith step in the derivation of the tree; the probabilities ofthe rules make up the parameters of the model, and maximum are easy to obtain a corpus head generatesrequired complementson each generated

23 depending the constraint head itself Le
depending the constraint head itself Let us the translation has become simply count appear together has a a Pr (e I e2, ei) = at Prt (e I e2, ei) + ab Prb (e I e2) + am Prm (e)where at + ab + am = 1, the as called anthe three models are earlier models, ..., our analysis the lat

24 er an ordered set sentence). Notice an e
er an ordered set sentence). Notice an entire " All many French ers you, you can assume can expect poor grammar infinite numberObvious analogs a normalization since it (f I French sentence word). Model Z D I j ) T j ) French wordssentences (four-word French sentences f) in ----

25 ----------I j, 1.2.4 Phrase-Based Model
----------I j, 1.2.4 Phrase-Based ModelsThe primary unit of information in all of the systems we have described up to thispoint is the word; in a phrase-a score score 1]. That is, instead of considering probabilites of word-to-wordtranslations and word-movement, a phrase-based sy

26 stem will deal with probabilities ofphra
stem will deal with probabilities ofphrase-to-phrase translations and phrase-movement. There is a great such as as simply introduce mechanisms forphrase-to-phrase translations and invent policies to assign probability mass to phrase-to-phrase translations. Others, such as as build

27 a dictionary of phrases fromother infor
a dictionary of phrases fromother information sources. Our experiments are centered on the K6hn system, as itachieves state-of-the-art performance.There are a number of ways one can build phrase dictionaries depending on thedata available. Phrases can be built from word-based ali

28 gnments (such as those gen-erated by the
gnments (such as those gen-erated by the IBM Models). syntactic information to syntactic not syntactically are, generally, as usefulplace more these phrases syntax boundary consider as many phrases phrases Experiments by K6hn et al show that simple heuristic methods based on word

29 -based alignments phrase dictionary, ali
-based alignments phrase dictionary, alignments generated different from alignments generated other direction, the phrase done using an algorithm algorithm The output sentence is gen-erated left-to-right in the form of partial translations which are formed using are depicted and q

30 uality Surprisingly, it (3, e4; f5, f6)
uality Surprisingly, it (3, e4; f5, f6) + log Pr (e3 e1, e2)+ log Pr (e4 I e2, e3) + Distortion(5-6, 3-4)]Figure 1-2: Typical states in the Kbhn decoder. In the original state, depicted above,the decoder has hypothesised that the French phrase f3 f4 corresponds with phrase reorde

31 ring.reordering.above, that we can respo
ring.reordering.above, that we can respond tohis demands to point to the lessons that have been learned and which summarize themainstream agenda of the field.1. Noisy-channel methods, like those described above, are vastly superior to systems, and do so models are tough to We will

32 these and human assistance Kay's quest
these and human assistance Kay's question, presents makes language to permit the text, a great any valuable field of man with are several the attachment the author he saw be valuable? be able be able a first-classincrease more types directly are quick quick However, perhaps

33 the earliest paper to suggest disambigua
the earliest paper to suggest disambiguation by asking in anyformal context is Kay's description description a multilingual translationengine. Later proposals for the use of monolingual consultants in translation aredescribed in Chapter 2.In parsing, Carter [13] presents TreeBanke

34 r, a system similar to an earlier system
r, a system similar to an earlier systemby Tomita for translation [46], but targeted at expert consultants use of a corpus, to manual user may to stop question and full in to many The program the constituentsFor each a finite man with are simply be used a collection map a the

35 transducer with the rules can require a
transducer with the rules can require a child match another The transducer the transduction (a parse permits unary such a chart) instead. posed a the probability FOREST List of NODESNODE labelheadLabelheadWordspanStart, spanEnd (sentence span)List of CHILDRENsCHILDREN scoreList o

36 f NODESMATCH contentNODECHILDRENspanStar
f NODESMATCH contentNODECHILDRENspanStart, spanEnd (subspan of the CHILDREN, for partial matches)TGRAMMAR List of TRULEsrootLabelTRULE labelINPUTRULEOUTPUTRULEINPUTRULE RERULENAMEDRULENOTRULEKLEENERULEANDRULEORRULECONCATRULEmapping from (NODE, CHILDREN) F-+ MATCHRERULE regularExpr

37 ession denoted by &#xrege;&#xxp00;"NAMED
ession denoted by &#xrege;&#xxp00;"NAMEDRULE label denoted by &#xlabe;&#xl000;#NOTRULE INPUTRULE denoted by &#xrule;!KLEENERULE INPUTRULE denoted by &#xrule;ANDRULE List of INPUTRULEs denoted by &#xrule;& &#xrule;& ... & &#xrule;ORRULE List of INPUTRULEs denoted by &#xrule;| &#xru

38 le;I ... I &#xrule;CONCATRULE List of IN
le;I ... I &#xrule;CONCATRULE List of INPUTRULEs denoted by &#xrule;@ &#xrule;@ ... @ &#xrule;OUTPUTRULE mapping which accepts MATCHFigure 1-1: Types used in the Transducer39 x {(Nodes, {(Nodes, node, children]|-..node, children]|-----node, children, subspan]---node, children, sub

39 span]---node, children]|---node, childre
span]---node, children]|---node, children]|----------for each --.---.--((pRule, pNode, Transducer Pseudocode 1.2.4 Sample 4-5), (DT, the man ..., (VP, 1-4)), The man This information these trees. noun phrases be reliably head? Whenoptions with treebank; this this )We would run in

40 to even more trouble with other language
to even more trouble with other languages, like Arabic, whereverbs are conjugated based on gender as well. If we merge (via and) a masculine nounphrase and a feminine noun phrase, our discomfort here: no head assignment a general the hand-written fairly common to a a )Furthermore

41 , when we are considering head. These co
, when we are considering head. These community to first place.parse trees, can identify be singu-plural; otherwise, Arabic withgender agreement subject, but more "factored" distinct internal node labels, Arabic, compared verb gender and number. apple the the apple the girls the

42 apple the apple the apple S, D, and plur
apple the apple the apple S, D, and plural, supervised classification tree and the output phrase. Our to the our model a particular For each variables and language and given in are more according to our model to the English, using given in NN, NNPPVSUFF-DO: 3DIVSUFF.DO: 3MPhere

43 are ---+- n random values from [0, 1]--
are ---+- n random values from [0, 1]---do-E-Step:-------E)---T[i].F +- F[i]-M-Step:--------mstep(T[i], E)-------[i]-..pfunction mstep(tree T, parameters 8)-if T is not a leaf:---- at the noun phrases which yielded relatively rare not statistically English Number (parses correspo

44 ndingsimply penalized compared to to the
ndingsimply penalized compared to to the results are as follows: very quickly or two per sentence). users correspond a significant to further parser to would improve attractive next In lieu of more labeled data, we can make use of unlabeled (i.e., parsed, but notannotated) data t

45 o estimate the lexicalized rule scores v
o estimate the lexicalized rule scores via self-training. For instance,if we find that a verb does not appear in passive voice in our data set, we canconfidently hypothesize The Transduction capital letters nodes of parse trees (often accompanied S e S(-A)? -* NPB?-A * V *S "* V2

46 (NPB?-A) * V1(NPB?-A) ** VP -VB *VBA *V
(NPB?-A) * V1(NPB?-A) ** VP -VB *VBA *V --' (t) VB * VBA1(t,VB) *-2 (t) VBA2(t,VB)e VP-S VB * NPB?-A *V -1 (t) c(t) pp(VB) *byt *-27(t) NPB?-Ae VP-A -VB *VBA *VBA e -+1 (t, $) VB * VBA1(t,VB) *-2 (t, $) VBA2(t,VB)e VP-A -VB * NPB?-A *VBA =4 -I (t, $) c(tNPB?-A) p(VB) * by t *-2 (t

47 , $) NPB?-AVB = VB[^N] -not a form of to
, $) NPB?-AVB = VB[^N] -not a form of to be or to have-+ itselfTable 1.6: The active-to-passive transduction grammarCS e S(-A)? -4 (CCIINTJ)* (PREARG)* CSUB * CVB *-It c(CVB) CSUB that * CVB *CS S(-A)? -(CCIINTJ)* (PREARG)* CSUB * CVBARG *-It c(CVBARG) CSUB that * CVBARG ** VP a *

48 CSUB*CVBARG 4 -1 CSUBe VP -*CVBARG *CVBA
CSUB*CVBARG 4 -1 CSUBe VP -*CVBARG *CVBARG --1 CVBARG1-+2 * CVBARG2 *CVB e VP head is not a form of to, said, add, contend-+ itselfPREARG = !(S.*ICCIINTJ) -+ *-+ itselfCSUB [ [VCMSI].* a *-itselfTable 1.7: The cleft transduction grammar53 the granularity made; in and [14]) seek t

49 o reverse it, with some success. Neverth
o reverse it, with some success. Nevertheless,the fact that the original paper found that it did not help and that the later pa-pers (despite careful design surrounding the inclusion of the system) found limitedimprovement is suggestive.We will modify a current system to from a a

50 amultilingual translation engine. Kay po
amultilingual translation engine. Kay points out that use of the term "fully automatic"when describing translation systems is misleading, because users of fully automatictranslation systems will edit the output if they are familiar with the target language.He therefore suggests th

51 e use of a monolingual consultant to res
e use of a monolingual consultant to resolve ambiguities inthe source language, but he does not propose algorithms for producing or processingthe interaction, nor does he indicate what form the interaction (a man) man) with an emphasis on transfer (i.e., preferring to preserveamb

52 iguity in translation whenever the sourc
iguity in translation whenever the source and target languages make that possi-ble). They also point out that more sophisticated queries than those of Tomita maybe used for certain types of ambiguity.Maruyama [35] presents a system for ambiguity resolution interactively displaying

53 dependencies. system displays displays
dependencies. system displays displays )the user first selects the firstphrase, then the second phrase.you-SUJ ytd m sePyou-SUBJ yesterday meet-PAST man-OBJ see-PASTtyou-SUBJ yesterday meet-PAST man-OBJ see-PASTsee-PASTpresents an elegant variation of the target language rewritin

54 gtechnique: display a chart permit the t
gtechnique: display a chart permit the through the is a free (both gratis and libre) state-of-the-art statistical translation moses decoder search space) the output, candidate translation.Each node a source source 1]fENU{0}where E denotes the set of all edge in be visited the se

55 arch one which topological order):order)
arch one which topological order):order)::= E, E, P[n' -- n]Asking about particular edges is effectively asking a bilingual consultant whetheror not a phrase in the target only have source language; hence, houseisitthere ishqus too.Figure 2-1: A search graph produced by moses for

56 the German sentence Es gibt einHaus.59
the German sentence Es gibt einHaus.59 to estimate estimate [e ans] Iq]q-arg max Pr (ans) H [e I ans]ans= arg max Pr (ans I e) Pr (e) log Pr (ans l e) Pr (e)mane (P Pr (ans)ans e/= arg max Pr (e) script generates. tasks to = 0-.-------------------- Welche W6rter kann man in die

57 sem Satz benutzen?Wer hat das neue Haus
sem Satz benutzen?Wer hat das neue Haus verzichtet?aufvoranforWelche Ausdruck ist der beste Ersatz fOr die Kursivschrift Phrase?Das Problem ist zu schwer fOr mich.6 schwierigdickleibigFigure 2-3: Sample Mechanical Turk Questionnaire including a test question.63 on in the statistic

58 al a review Kay first where a was 0.20,
al a review Kay first where a was 0.20, are able rule-based system. system. and is the basisof Systran's current hybrid systems.3.1.3 Minimum Error-Rate TrainingIn [37], Och suggested adding of high-level arg max max 2.5]to be optimal.3.1.4 Linguistics an engineering effort on m

59 odel) needed to obtain which are scores.
odel) needed to obtain which are scores. Hence, into modern a simple simple describes this model as "almost moronic... [capturing] local tactic constraints by sheerforce of numbers, but the more well-protected bastions of semantic, pragmatic anddiscourse constraint and even morpho

60 logical and global syntactic constraint
logical and global syntactic constraint remainunscathed, in fact unnoticed". in speech speech 44].In our own experiments, we found that it is possible to remove a substantialamount of information-heavy content (as judged by a trigram model) and to obtaina high BLEU score, showing

61 that n-gram models are closely tied to t
that n-gram models are closely tied to the BLEU score(see Table 3.1).Perhaps it is because data remains sparse and take advantage sites and newspapers, and have achieved achieved Beginning with a state-of-the-art Arabic to English translation sys-tem, they found that simply incre

62 asing the amount of data available to th
asing the amount of data available to their n-gramincreasing language value of gains may Their results pomme mange Christmas. Systranwant a white car. not an Mohammad non not an not an not an some possible 69I I I bit; equivalently, to understand digit recognition, recognition,

63 researchers an all-or-nothing approach
researchers an all-or-nothing approach has an would have computing WER.recognition system are unlikely adequacy. Fluency whereas adequacy as "Ali "Ali is a standard method for evaluating machine translation sys-tem performance by comparing translations to one or many human trans

64 lations. Thetranslations are compared by
lations. Thetranslations are compared by the precision of n-grams of successively greater length;the BLEU score typically refers to a smoothed 4-gram comparison; mathematically,it can be described by the following formula:BLEU = e ce'('/c) .g p4log BLEU = Ic( -+ piwhere c is the t

65 otal length of the candidate translation
otal length of the candidate translation produced by the system beingevaluated, r is sum of the lengths of the reference sentences that most closely matchthe lengths of the candidate sentences, Icis 1 if c r and 0 otherwise, and pj refersto the j-gram precision of the test set.Eva

66 luating translation systems is a difficu
luating translation systems is a difficult task-BLEU has a number of usefulproperties that make it a popular choice: it is fast, it is cheap, and it a general precision model. The papers papers 40] report then that an n-gram precision model correlates strongly withhuman judgments

67 of translation quality. Of course, an i
of translation quality. Of course, an improvement in a correlatedvalue does not imply an improvement in the value of interest; however, it is thestrength of the correlation that is so promising. In [39], the authors report thatthe correlation between BLEU and human judgments of a

68 dequacy and fluency forFrench-English tr
dequacy and fluency forFrench-English translation systems are 0.94 and 1.00, respectively.These figures are incredibly impressive. A correlation of 1.0 implies that BLEU isa linear function of the human evaluation of fluency, which implies that BLEU canbe used to predict the human

69 evaluation of fluency perfectly.Another
evaluation of fluency perfectly.Another interpretation of this result, however, is that humans, when asked a set minor issue laboring under far less problems with with human et al al However,they conclude that, although BLEU clearly cannot be used as a substitute for humanevalua

70 tions in general, it can be problems, it
tions in general, it can be problems, it may be possiblecan be human judgments judgments Reconsidering the results of table 3.1, BLEU is effectively a detector for an n-gramlanguage model. model. monolingual evaluators favor translations that aremore fluid to translations that are

71 more adequate. On the other hand, bilin
more adequate. On the other hand, bilingualevaluators tend to be more forgiving to sentences that favor adequacy over fluency.It seems that the gold standard should be the bilingual evaluator; an evaluator thatis able to judge the source text as well as the target text generally

72 people source language to heuristics ev
people source language to heuristics evaluation. This and consistently consistently Forced-choice binary comparisons area premiere method of obtaining information from human evaluators without causingthe kind of fatigue that we described.We will withdraw from the machine translat

73 ion setting for the remainder of thissec
ion setting for the remainder of thissection to analyze the mathematical aspect of this question. This yields a new eval-uation metric which we call CLAIRE2.You are the head judge of a baking competition and you are required to announcea full ranking of the cakes that were submitt

74 ed to you. You are democratic, so you'dl
ed to you. You are democratic, so you'dlike to give a ranking that corresponds to the rankings that would be given by theaverage cake-taster. Tasters are able to compare exactly two cakes at a time, and tasters at draws a better than than �- Rans ] . preference variable var

75 iable 1], making it must be one whose ha
iable 1], making it must be one whose has a from a more convenient ically increases, can maximize maximize log F(x, -xj) + cpi log F(xj -xi)]i=1 j=i+lThis is a mapping from RN to R, so we'll compute the gradient of this quantity withrespect to Y and find the critical points.OLxOx

76 jcij F'(xi -xj)F(xi -xj)= EZ F'(xi -xj)
jcij F'(xi -xj)F(xi -xj)= EZ F'(xi -xj) F(j54i_cpiF'(xj -xi )F(xj -xi) Icii c-xi- x3) -1 -F(x, -x, )Assuming the ci are all nonzero, we can make a few observations about thisselect a ..., = .in each coordinate, and unique up include infinite to take gradient with is especially c

77 onvenient when it comes to computing the
onvenient when it comes to computing the gradient:12exiXexi x -x ex xiThis helpfully satisfies the differential equation:F' = F -F 2hence,Ox F(x -x3)(1 -F(x -x(I) [cig(1 -F(xi -x )) -cjiF(xi -x )]Sz4 F -F'(xI -x -E -x( -Fx -)) [cij -(cij + cji)F(xi -xj)]jiF (xi -xj) (1 -F (xi -xj)

78 )E [cij -(cij + cji)F(x, -xj)].We comput
)E [cij -(cij + cji)F(x, -xj)].We compute the Hessian H for use in nonlinear optimization:02L(-)a xiaxi(cij + cji)F'(xi -xy) for i 74 j-i(n~cik + ck )F'(xi -xk) otherwise.3.3.1 Active Ranking by Forced ComparisonsSuppose now that having a person taste the cakes is expensive. What

79 can we do tominimize the number of compa
can we do tominimize the number of comparisons that must be made? We would like to maximizethe amount of information that is gained by each comparison (in expectation), solet's try to minimize the entropy of the distribution P( I C).Let's begin by considering the effect of includi

80 ng an additional comparison i �-
ng an additional comparison i �- j79 �- j, j, (C U {i -j}) I C]= P(i � j �- j �- j �- j �- j �- j �- j| j| (�-j |-) CI+H(I|C)-H(i �- jC).jC).(-j �- j P( j to compute have gone gone It is rare to b

81 e able to select pairs for comparison in
e able to select pairs for comparison in these scenarios, and thesemodels are generally restricted in that only the scores of the items compared I]N-i =0-k 0-ri VL(z-).d =--8' = |r|2.0 = i'-while i imaflx and 8' &#x Tj ;&#x/T1_; 1 ;&#xTf 7;&#x 0 0;&#x 10 ;Ť.;„ 5;�

82 4.4; Tm;&#x 000; E2.-.----- samples,
4.4; Tm;&#x 000; E2.-.----- samples, an list of ----||YO1|1|I -] (augment the vector)--weight = exp (L(z)) -exp (-((Zo -Xo) A -zo))/2) /(27r)(N-)---N -2]:---+ 1, ...,N- 1]:---Z=0;h=0;f=0---.--; j= --..-- ..., ..., make use a significant the test the sentence that be noted mor

83 e than for anything * This 0.026 0.277 p
e than for anything * This 0.026 0.277 proposed metric, tion and alternative to and other strongly with metrics, and ways of Fred Jelinek's yields a second a right right Most recognizers behave, not like scientists, but like mad inventors or un-trustworthy engineers. The typica

84 l The basis does very an experiment. It
l The basis does very an experiment. It took decades for speech recognition to plateau, to exhaust the immediate grat-ification that can be had by doubling the clock-speed of a server or the size of adata-set; recent research in speech recognition begins to look like real science,

85 includ-ing explorations on the benefits
includ-ing explorations on the benefits of linguistic representation, a willingness to sacrificeperformance in the short-term for nuanced models that can capture the rare eventsas well as the common ones. It has become a respectable scientific enterprise withmany applications and

86 (with the pervasiveness of mobile phone
(with the pervasiveness of mobile phones) ubiquity. Researchagenda in hand, I have high hopes that machine translation can match and surpassthat success.89 (Ex, 8) (8, 8) H(8, E) H(E', 8)Equivalently:Proof:KL(9', 8) &#x 000; 0V X 0 = log(1) =log(lfyP (Y IX; E')log P (Y IX; ). P(

87 Xj)= log Ey (dY)P (Y X 8x)&#x 000; Ey lo
Xj)= log Ey (dY)P (Y X 8x)&#x 000; Ey log P(X;E)X-0-P (Y |X; 8) 1 )= Ey (log P (Y X;0')| X; E) -Ey (log P (Y IX; 8)| X ; 8)Ex(0) = 0 &#x 000; Ex (Ey (log P (Y I X; ') IX; E)) -Ex (Ey (log P (Y I X; 8)| X; E))SH(, 8) -H(8',E).H(8, 8) H(8', 8)A.3 EM is NondecreasingTheorem:L (e()) L

88 (e®(i+1)Proof:= log P (X; 8)xE Ey (log
(e®(i+1)Proof:= log P (X; 8)xE Ey (log P(X;x= 3 Ey (log P (X; E) I X; E()) -Q (0, E(')) + Q (0, 8('))x91Lemma:V 8, O'V E, 'dY)0) | X; 8E) ) (E(i+1), E()) -H (EP), E(8))]+ [Q (E(0+1), W~)) -Q (W(+1), e())]� 0.Thus, the likelihood of successive EM parameter vectors is non-d

89 ecreasing. (Thisis a long occurs very t
ecreasing. (Thisis a long occurs very to inherent E / = to further P (X, occurred this many times:E Ey (Count,(X, Y) I X, a).xNote that these counts are pre-normalization! When the a, are "grouped" the sameway as the 0,, the normalization does not enter into the picture (that is

90 , whenP(E,; a') = P(E,; o') Vr E Ri,wher
, whenP(E,; a') = P(E,; o') Vr E Ri,where the Ri are defined as above to be sets of 8 parameters that must be normalized);consequently, the Q-maximizing 8, can themselves be used as counts to maximizethe likelihood of a. Again, if, for any i, we were to multiply the coefficients o

91 f E,for r E Ri by a constant, the maximu
f E,for r E Ri by a constant, the maximum likelihood values do not change; thus, nonormalization is necessary.[It is easy to see that normalization can be harmful; consider, for instance, thefollowing experiment: we repeatedly select one of two biased coins to flip and recordwhich

92 coin we flipped and the outcome. Then,
coin we flipped and the outcome. Then, the maximum-likelihood probabilitythat the first coin will flip heads, for instance, is the number of heads we got from thefirst coin divided by the number of times we flipped the first coin. Suppose, however,that we add the constraint that

93 the coins are identically biased. Then t
the coins are identically biased. Then the number oftimes we flipped each coin is important; we cannot correctly estimate the probabilityof heads with the unconstrained maximum-likelihood probability of heads for eachcoin alone.]A.6 EM on IBM2+1dGHere we give the full derivation o

94 f the EM updates for the one-dimensional
f the EM updates for the one-dimensional gaussianframework described in Appendix D for the sake of the mathematically skeptical.Let's begin by defining the model:94 I j, malization constraints on finding t, a, (-N(j, f, m))fg(i pi j'm, o-ij,e,m)1jE'm i=1-C(0, j E, m) log N(j, e,

95 m) + C(i, j, , m) log(1 -N(j, f, m))ji~
m) + C(i, j, , m) log(1 -N(j, f, m))ji~m j,f,m i=1+ C(i, j, e, m) log fg(i I pj,m, of,j,m).j,f,m i=1Thus, the optimal value of N is given by:NUj, f, M) C(0,j, f, m)Ei=o C(i, j, f, m)and the y2 and o are optimized by their usual maximum-likelihood estimators:pjfm = 3 i C(i, j, f,

96 m) C(i, j, E, m)i= 1 =1o- -~m= ( -[p,m,)
m) C(i, j, E, m)i= 1 =1o- -~m= ( -[p,m,)2 C(i, j, E, m) �3C(i, j, E, m) all of tags and LSMDNNP?S?PDTPOSPRP$?RB[RS]?RPSYMTOUHVB[DGNPZ]?WDTDeterminerExistential thereForeign WordPreposition/Subordinating ConjunctionAdjectiveR ComparativeS SuperlativeListModalNoun (singular

97 or mass)P ProperS PluralPredeterminerPos
or mass)P ProperS PluralPredeterminerPossessive EndingPersonal Pronoun$ Possessive PronounAdverbR ComparativeS SuperlativeParticleSymboltoInterjectionVerb (base form)D past tenseG gerundN past participleP present, 3rd-person singular * ** & Closed Classbut either aboard about alo

98 ngside amidbesides between then though c
ngside amidbesides between then though ca can shall should'em he 't- t' back before alas amen WHADJPWHADVPWHNPWHPPXWh-adjective PhraseWh-adverb PhraseWh-noun PhraseWh-prepositional PhraseUnknown, Uncertain, or UnbracketableB.4 Extended TagsI took the next two tables from "The Pe

99 nn Treebank: Annotating Predicate Argume
nn Treebank: Annotating Predicate ArgumentStructure" by Mitch Marcus et al [34].103 put? (not dollar finished * *prices recovered, salary might toilet paper.W. Bush a year; buy programs a good in favor green with to mark of them.on Dec mouths of are coming China exported on t

100 hethe new only way like a the wheareabou
hethe new only way like a the wheareabouts higher demand.an exodus "On Sports" at another they say, to give as scheduled.what they SBAR-PRPSBAR-PRP-PRDSBAR-PUTSBAR-SBJSBAR-TMPSBAR-TMP-CLRSBAR-TMP-PRDSBAR-TPCSBAR-TTLSINVSINVSINV-ADVSINV-HLNS1NV-TPCthe charge didn't affect net fo

101 r the quarter, as it wasoffset by tax be
r the quarter, as it wasoffset by tax benefits.we expect a large market in the future, so the long termit will be profitable.that is because John ran up the hill....put our resources where they could do the most; put hismoney where his mouth ishe has made it clear that the issue i

102 s important to himpersonally.we want to
s important to himpersonally.we want to make sure we hold on to our existing cus-tomers.where they lag behind the Japanese is in turning the in-ventiveness into increased production.i will be happy when terms are fixed Oct. 26.it didn't help when she was charged with public drunke

103 n-ness.that was before the tax reform ma
n-ness.that was before the tax reform made things more compli-cated.he jailed them for several hours after they defied his order;"When Harry Met Sally"; "When Irish Eyes Are Smil-ing"Sentence, Inverted"I am hungry, " said Bob.Says Joe, "I am hungry, too."protected themselves again

104 st squalls in any area, be itstocks, bon
st squalls in any area, be itstocks, bonds, or real estateseems same as SINV; just used when it is a headline in-stead of a normal sentenceOffsetting the lower stake in Lyondell were high crude oilprices, among other things.114 SINV-TTLSQSQSQ-PRDSQ-TPCSQ-TTLS115the children sang "

105 Here Comes Santa Claus"Sentence, Questio
Here Comes Santa Claus"Sentence, Question (inverted yes/no, or the argument ofa Wh)How the hell can you live with yourself?What gets by me every time is has the milk expired?Jimmy asked, "Can I go to the store?"Is that the forecast? Is the government really not helpinganybody? Wou

106 ld I have done all those things?"Is Scie
ld I have done all those things?"Is Science, Or Private Gain, Driving Ozone Policy?"(article title)Sentence, DeclarativeA piece down, the computer resigned.Investment bonds ended 1/4 point lower.The company wouldn't elaborate, citing competitive rea-sons.It is the total relationsh

107 ip that is important."It's not very ofte
ip that is important."It's not very often something like this comes up," saidRon.It helps to be male. The farmer stands to go.Share prices closed lower in Paris, and mixed in Amster-dam.JAMAICA FIRES BACKAt the end of the third quarter McDonald's had 10Kunits operating world-wide.

108 Bonuses would be paid based on playing t
Bonuses would be paid based on playing time and perfor-mance.He began his career peddling stock to individual investors.He apologizes for sounding pushy. They don't flinch atwriting them.SS-ADVS-CLFS-CLF-TPCS-CLRS-CLR-ADVS-HLNS-LOCS-MNRS-MNR-CLRS-NOM was followed to come The theme

109 on the WHPP by how much, under what w
on the WHPP by how much, under what weight, for whomX Unknown, Uncertain, or UnbracketableX the closer they got, the more the price rosethe stock tumbled, to end atthe earthquake wasthe crowd shouted, "viva peace, viva."c- list item 3i struggled to to eat my sandwich.i am marri

110 ed, no children.it was a funny time, wha
ed, no children.it was a funny time, what with the vietnam war and all.i was hungry to begin withX-ADV the more extensive the voir dire, the easier you make it.the more he muzzles his colleagues, the more leaks willpop up.X-CLFX-DIR earnings declined by $120 million last year's ro

111 bust levels.X-EXT exports from canada ju
bust levels.X-EXT exports from canada jumped 11% while imports fromcanada rose only 2.7%X-HLNX-PUT mr. bush's veto power puts him a commanding positionin the narrowly divided houseX-TTL118 Appendix CModifying the Bikel ParserThe Collins-Bikel parser is a very heavily optimized cha

112 rt parser. The algorithm forthe parser i
rt parser. The algorithm forthe parser is described in the appendices of Collins' PhD thesis [16], and remainsthe basis of Bikel's implementation. The main modification that is done is to add anequivalentItems list to each element in the chart and to store every item and link that

113 would have been pruned away either by th
would have been pruned away either by the search or just by virtue of the dynamicprogram (which is looking for the top-scoring parse). In the Bikel parser, this changeshould occur in the add method of the Chart class.Following this, any calculations that need to be done (to comput

114 e inside andoutside probabilities, for e
e inside andoutside probabilities, for example) can be done in Decoder.parse once the entireforest has been computed. This is also the appropriate point for the parse forest tobe emitted.119 in closed closed For instance, the number of words in English sentences in the EUROPARL co

115 rpus,depicted in Figure D-1, is roughly
rpus,depicted in Figure D-1, is roughly Gaussian. One could explain this based on thatfuzzy Central Limit Theorem by imagining a number 30 35 be Gaussian Limit Theorem:the other Sentence Length German Sentences 10 15 20 25 English Sentence LengthGerman Sentence gush Sentence Len

116 gth(a) The length English sentences and
gth(a) The length English sentences and German have a are indeed English Word Alignment for German Word 13 from Sentences of Length 250306050403020 -10 -0 10 5 10Figure D-3: Our motivation: The indicesGaussian.15 20 25English Wordof aligned words can be approximated by a123 typic

117 ally results a sentence pair eaj)D(a|j,
ally results a sentence pair eaj)D(a|j, em)f, I simple change In fact, vanilla and ..., deficient, assigning this restriction no longer f) in -------------..-----.* -------- be optimized a numerical to "fitting" the intermediate our experience, | -, , our model the notational

118 422tj,#,m, j,,m) -FN(i -1 I Pjf,m, 0I,e,
422tj,#,m, j,,m) -FN(i -1 I Pjf,m, 0I,e,m)1 fg(O I pUJy,m, O-,e,m) -f I'(j,i,m, Uj,f,m)402 FN(L Iym, 0j,m) -Fg(O Pje,m, 0j,,m)We apply the conjugate gradient descent algorithm to the D values [43]:Given D E Rm-Select Yo (I, U2) E R2 at random-i = 0, #0 = Exo) ho =-0-do:---- used G

119 erman-English German-English The data wa
erman-English German-English The data was aligned at thesentence-level using the standard tools and sentences of vastly differing lengths wereremoved. Finally, we trained the the Pharoah language model model The resulting translations translations The BLEU metric is a standard met

120 hod for evaluating machine translation s
hod for evaluating machine translation sys-tem performance by comparing translations to one or many human translations. Thetranslations are compared by precision and recall on n-grams of successively greaterlength; the BLEU score typically refers to a smoothed 4-gram comparison; m

121 atically, it EUROPARL corpus corpus and
atically, it EUROPARL corpus corpus and the applied aBLEU scorer to our model's output on the standard Our resultsset) are shown in one-dimensional gaussian 160000 180000 Corpus Size This graph due to 132 Second international workshop on evaluating word sense disambiguation syste

122 ms.http://www.senseval. org/, July 2001.
ms.http://www.senseval. org/, July 2001.[2] Danit Ben-Ari, Daniel M. Berry, and Mori Rimon. Translational ambiguityrephrased. In Proceedings of Second International Conference on Theoreticaland Methodological Issues in Machine Machine Ann Bies, Mark Ferguson, Karen Katz, and Pete

123 r F. Brown, Stephen A. Della Pietra, Vin
r F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L.Mercer. The mathematics of statistical machine translation: Parameter estima-tion. Association for Computational Linguistics, 19(2):263-311, 1993.[8] Chris Callison-Burch. Linear b system description descri

124 ption Chris Callison-Burch, Cameron Ford
ption Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, andJosh Schroeder. (meta-) evaluation of machine translation. In Proceedings of theSecond Workshop on Statistical Machine Translation, pages 136-158, Prague,Czech Republic, June 2007. Association for Comput

125 ational Linguistics.[10] Chris Callison-
ational Linguistics.[10] Chris Callison-Burch, Miles Osborne, and Philipp Koehn. Re-evaluating the roleof bleu in machine Trento, Italy, Italy, Marine Carpuat and Dekai Wu. Word sense disambiguation vs. statistical ma-chine translation. In Proceedings of the 43rd Annual meeting of

126 the Associationfor Computational Lingui
the Associationfor Computational Linguistics, pages 387-394, Ann Arbor, Arbor, Marine Carpuat and Dekai Wu. Improving statistical machine translation usingword sense Computational NaturalNaturalDavid Carter. The treebanker: a tool ing of of Kenneth W. Church and Eduard H. Hovy.

127 Good applications for crummy ma-chine tr
Good applications for crummy ma-chine translation. Machine Translation, 8:239-258, 1993.[16] Michael Collins. Head-Driven Statistical Models for Natural Language Processing.PhD thesis, University of Pennsylvania, 1999.[17] Loic Dugast, Jean Senellart, and Philipp Koehn. Statistica

128 l post-editing onsystrans rule-based tra
l post-editing onsystrans rule-based translation system. In Proceedings of the Second Second Andreas Fink and Aljoscha C. Neubauer. Speed of information processing, psy-chometric intelligence and time estimation as an index of cognitive load. Per-sonality and Individual Difference

129 s, 30:1009-1021, 2001.[19] Sandiway Fong
s, 30:1009-1021, 2001.[19] Sandiway Fong and Robert C. Berwick. New approaches to parsing conjunctionsusing prolog. 1985.[20] Ryan Gabbard, Mitchell Marcus, and Seth penn tree-tree-Ralf Herbrich and Thore Graepel. Trueskill(tm): A bayesian skill rating system.2006.[22] F. Jelinek,

130 R. L. Mercer, L. Bahl, and J. Baker. In
R. L. Mercer, L. Bahl, and J. Baker. Interpolated estimation ofMarkov source source Frederick Jelinek. Up from trigrams! the struggle for improved language Frederick Jelinek. Statistical methods for speech recognition. MIT Press, Cam-bridge, MA, USA, 1997.[251 Howard Johnson, Jo

131 el Martin, George Foster, and Roland Kuh
el Martin, George Foster, and Roland Kuhn. Improvingtranslation quality by discarding most of the phrasetable. In Proceedings of the2007 Joint Conference on Empirical Methods in Natural Language Processing andComputational Natural Language Learning (EMNLP-CoNLL), pages 967-975,Pra

132 gue, Czech Republic, June 2007. Associat
gue, Czech Republic, June 2007. Association for Computational Linguistics.[26] Martin Kay. The MIND System, pages 155-188. Algorithmics Press, New York,1973.[27] Martin Kay. Machine translation will not work. In Proceedings of the 24th Annualmeeting of the Association for Computat

133 ional Linguistics, page 268, New York,Ju
ional Linguistics, page 268, New York,July 1986. Association for Computational Linguistics.[28] Philipp Koehn. Pharaoh: a Beam Search Decoder for Phrase-Based StatisticalMachine Translation Models. In The 6th 6th Philipp Koehn. Europarl: Europarl: Philipp Koehn, Hieu Hoang, Alexan

134 dra Birch, Chris Callison-Burch, Marcell
dra Birch, Chris Callison-Burch, MarcelloFederico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, RichardZens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses:Open source toolkit for statistical machine translation. In Proceedings of the 45thAnnu

135 al Meeting of Philipp Koehn, Franz Jose
al Meeting of Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-basedtranslation. In Proceedings of the 2003 Human Language Technology Conferenceof the North American Chapter of the Association for Computational Linguistics,pages 48-54, Edmonton, AB Canada, May

136 -June 2003. Association for Compu-tation
-June 2003. Association for Compu-tational Linguistics.[32] Beth Levin. English Verb Classes And Alternations: A Preliminary Investigation.University of Chicago Press, Chicago, 1993.[33] Daniel Marcu and William Wong. A phrase-based, joint probability model forstatistical machine

137 translation. In Proceedings of the 2002
translation. In Proceedings of the 2002 Conference on Empir-ical Methods Kishore Papineni, Salim Roukos, Todd Ward, John Henderson, and FlorenceReeder. Corpus-based comprehensive and diagnostic mt evaluation: Initial ara-bic, chinese, french, and spanish results. In Proceedings o

138 f HLT 2002, SecondInternational Conferen
f HLT 2002, SecondInternational Conference on Human Language Technology Research, pages 132-137, San Francisco, March 2002. Association for Computational Linguistics.[40] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: amethod for automatic evaluation of machine

139 translation. In Proceedings of the40th
translation. In Proceedings of the40th Annual meeting of the Association for Computational Linguistics, pages311-318, Philadelphia, July 2002. Association for Computational Linguistics.[41] John R. Pierce. Pierce. John R. Pierce, John B. Carroll, Eric P. Hamp, David G. Hays, Char

140 les F. Hock-ett, Anthony G. Oettinger, a
les F. Hock-ett, Anthony G. Oettinger, and Alan Perlis. Language and Machines: Machines: E. Polak. Optimization: Algorithms and Consistent Approximations. Springer,New York, 1997.[44] Matt Post and Daniel Gildea. [48 Eric W. Weisstein. Central limit theorem. http: //mathworld. wo