WeintegratedourwiderpipelinemodelintothePSCFGgrammarconstructionprocessofthepubliclyavailableSyntaxAugmentedMachineTranslationsystemZollmannandVenugopal2006WerstreviewPSCFGgrammarsSection2 ID: 392840
Download Pdf The PPT/PDF document "pipelineofmaximumaposteriori..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
pipelineofmaximumaposterioriinferencestepstoidentifyhiddentranslationstructureandestimatetheparametersoftheirtranslationmodels.Therststepinthispipelinetypicallyinvolveslearningword-alignments(Brownetal.,1993)overparallelsentencealignedtrainingdata.Theoutputsofthissteparethemodel'smostprobableword-to-wordcorrespondenceswithineachparallelsentencepair.Thesealignmentsareusedastheinputtoaphraseextractionstep,wheremulti-wordphrasepairsareidentiedandscored(withmultiplefeatures)basedonstatisticscomputedacrossthetrainingdata.Themostsuccessfulmethodsextractphrasesthatadheretoheuristicconstraints(Koehnetal.,2003;OchandNey,2004).Thus,errorsmadewithinthesingle-bestalignmentarepropagated(1)totheidentica-tionofphrases,sinceerrorsinthealignmentaffectwhichphrasesareextracted,and(2)totheestima-tionofphraseweights,sinceeachextractedphraseiscountedasevidenceforrelativefrequencyesti-mates.MethodslikethosedescribedinWu(1997),MarcuandWong(2002),andDeNeroetal.(2006)addressthisproblembyjointlymodelingalignmentandphraseidentication,yethavenotachievedthesameempiricalresultsassurfaceheuristicbasedmethods,orrequiresubstantiallymorecomputa-tionalefforttotrain.Inthisworkwedescribeanapproachthatwidensthepipeline,ratherthanperformingtwostepsjointly.WepresentN-bestalignmentsandparsestothedownstreamphraseextractionalgo-rithmanddeneaprobabilitydistributionoverthesealternativestogenerateexpected,possiblyfractionalcountsfortheextractedtranslationrules,underthatdistribution.Thesefractionalcountsarethenusedwhenassigningweightstorules.Thistechniqueisdirectlyapplicabletobothatandhierarchically-structuredtranslationmodels.Insyntax-basedtranslation,single-besttargetlanguageparsetrees(givenbyastatisticalparser)areusedtoassignsyntacticcategorieswithineachrule,andtoconstrainthecombinationofthoserules.Deci-sionsmadeduringtheparsingstepofthepipelineaffectthechoiceofnonterminalsusedforeachruleinthePSCFG.PresentingN-bestparsealternativestotheruleextractionprocessallowstheidentica-tionofmorediversestructuresforuseduringtrans-lationand,perhaps,bettergeneralizationability. Weintegratedour`wider-pipeline'modelintothePSCFGgrammarconstructionprocessofthepub-liclyavailableSyntax-AugmentedMachineTrans-lationsystem(ZollmannandVenugopal,2006).WerstreviewPSCFGgrammars(Section2),andthen,inSection3,presentamethodofintegratingPSCFGrulesextractedfromN-bestalignmentsandparsesandallowtheposteriorfractionalcountstoinuencetheruleweights.InSection4,weshowhowthewidenedpipelineimprovestranslationperformanceonalimited-domaindomainspeechtranslationtask,theIWSLTChinese-Englishdatatrack(Paul,2006).2SynchronousGrammarsforSMTProbabilisticsynchronouscontext-freegrammars(PSCFGs)aredenedbyasourceterminalset(sourcevocabulary)TS,atargetterminalset(targetvocabulary)TT,asharednonterminalsetNandin-ducerulesoftheformX!h ;;;wiwhere X2Nisanonterminal, 2(N[TS)isasequenceofnonterminalsandsourceterminals, 2(N[TT)isasequenceofnonterminalsandtargetterminals, thecount#NT( )ofnonterminaltokensin isequaltothecount#NT()ofnonterminaltokensin, :f1;:::;#NT( )g!f1;:::;#NT()gisaone-to-onemappingfromnonterminaltokensin tononterminaltokensin,and w2[0;1)isanonnegativereal-valuedweightassignedtotherule.Inournotation,wewillassumetobeimplicitlydenedbyindexingtheNToccurrencesin fromlefttorightstartingwith1,andbyindexingtheNToccurrencesinbytheindicesoftheircorrespond-ingcounterpartsin .Syntax-orientedPSCFGap-proachesoftenignoresourcestructure,insteadfo-cusingongeneratingsyntacticallywell-formedtar-getderivations.Chiang(2005)usesasinglenon-terminalcategory,Galleyetal.(2006)usesyntac-ticconstituentsforthePSCFGnonterminalset,and withthelanguagemodelweightLMviaminimum-errortraining(MER)(Och,2003).Here,wefocusontheestimationofthefeaturevaluesduringthegrammarconstructionprocess.Thefeaturevaluesarestatisticsestimatedfromrulecounts.2.3FeatureValueStatisticsThefeaturesrepresentmultiplecriteriabywhichthedecodingprocesscanjudgethequalityofeachruleand,byextension,eachderivation.Wein-cludebothreal-valuedandboolean-valuedfeaturesforeachrule.Thefollowingprobabilisticquantitiesareestimatedandusedasfeaturevalues: ^p(rjlhs(X)):probabilityofarulegivenitsleft-handsidecategory; ^p(rjsrc(r)):probabilityofarulegivenitssourceside; ^p(rjtgt(r)):probabilityofarulegivenitstargetside; ^p(ul(tgt(r))jul(src(r))):probabilityoftheunla-beledtargetsideoftherulegivenitsunlabeledsourceside;and ^p(ul(src(r))jul(tgt(r))):probabilityoftheunla-beledsourceandtargetsideoftherulegivenitsunlabeledtargetside.Inournotation,lhsreturnstheleft-handsideofarule,srcreturnsthesourceside ,andtgtre-turnsthetargetsideofaruler.Thefunc-tionulremovesallsyntacticlabelsfromitsargu-ments,butretainsorderingnotation.Forexample,ul(NP+AUX1doesnotgo)=21doesnotgo.Thelasttwofeaturesrepresentthesamekindofrelativefrequencyestimatescommonlyusedinphrase-basedsystems.Theulfunctionallowsustocalculatetheseestimatesforruleswithnonterminalsaswell.Toestimatetheseprobabilisticfeatures,weusemaximumlikelihoodestimatesbasedoncountsoftherulesextractedfromthetrainingdata.Forexample,^p(rjlhs(r))isestimatedbycomputing#(r)=#(lhs(r)),aggregatingcountsfromallex-tractedrules.Asinphrase-basedtranslationmodelestimation,alsocontainstwolexicalweights^pw(lex(src(r))jlex(tgt(r)))and^pw(lex(tgt(r))jlex(src(r)))(Koehnetal.,2003) thatarebasedonthelexicalsymbolsof and.Theseweightsareestimatedbasedonanpairofstatisticallexiconsthatrepresent^p(sjt);^p(tjs),wheresandtaresinglewordsinthesourceandtargetvocabulary.Theseword-leveltranslationmodelsaretypicallyestimatedbymaximumlike-lihood,consideringtheword-to-wordlinksfromsingle-bestalignmentsasevidence.containsseveralbooleanfeaturesthatindicatewhether:(a.)theruleispurelylexicalinand ,(b.)theruleispurelynon-lexicalinand ,(c.)theratiooflexicalsourceandtargetwordsintheruleisbetween1/5and5.alsocontainsafeaturethatreectsthenumberoftargetlexicalsymbolsandafeaturethatis1foreachrule,allowingthedecodertoprefershorter(orlonger)derivationsbasedonthecorrespondingweightin.3N-bestEvidenceThePSCFGruleextractionproceduredescribedabovereliesonhighqualitywordalignmentsandparses.Thequalityofthealignmentsaffectsthesetofphrasesthatcanbeidentiedbytheheuris-ticsin(Koehnetal.,2003).Improvingordiversi-fyingthesetofinitialphrasesalsoaffectstheruleswithnonterminalsthatareidentiedviatheproce-duredescribedabove.SincePSCFGsystemsrelyonruleswithnonterminalsymbolstorepresentreorder-ingoperations,thesetoftheseinitialphraseshasthepotentialtohaveaprofoundimpactontranslationquality.Thequalityoftheparsesaffectsthesyn-tacticcategoriesassignedtotheleft-hand-sideandnonterminalsymbolsofeachrule.Thesecategoriesplayanimportantroleinconstrainingthedecodingprocesstogrammaticallyfeasibletargetparsetrees.Severalrecentstudiesexploretherelationshipbetweenthequalityoftheinitialmodelsinthepipelineandnaltranslationquality.QuirkandCorston-Oliver(2006)showimprovementsintrans-lationqualitywhenthequalityofparsingisim-provedbyaddingadditionaltrainingdatawithinthetreeletparadigmintroducedbyQuirketal.(2005).Koehnetal.(2003)showthattranslationqualityinaphrasebasedsystemdoesnotvarysignicantlywhenincreasingthecomplexityofthemodelusedforalignment(rangingfromIBMmodel1through4),butthatincreasingtheamountofparalleltraining instancesthatrelyonlowprobabilityorfeweralign-mentsandparseswillgetlowercounts(approaching0ascertaintyincreases).3.2RenedAlignmentsWorkbyOchandNey(2004)andKoehnetal.(2003)demonstratesthevalueofgeneratingwordalignmentsinbothsource-to-targetandtarget-to-sourcedirectionsinordertofacilitatetheextrac-tionofphraseswithmany-to-manywordrelation-ships.WefollowKoehnetal.(2003)ingeneratingarenedbidirectionalalignmentusingtheheuristicalgorithmgrow-diag-nal-anddescribedinthatwork.SincewerequireN-bestalignments,werstextractN-bestalignmentsineachdirection,andthenperformtherenementtechniquetoallN2bidirectionalalignmentpairs.Theresultingalign-mentsareassignedtheprobability(pf:pr)wherepfisthecandidateprobabiltyfortheforwardalign-mentandpristhecandidateprobabilitytothere-versealignment.Wethenremoveanyduplicaterenedalignments(therenedalignmentwiththehighestprobabilityisretained)thatcameaboutduetotherenementpro-cess.Finally,weselectthetopNalignmentsfromthissetofrenedalignments.Theselectionofcontrolstheentropyofthere-sultingdistributionovercandidatealignments(afternormalization).Highervaluesof1makethedistributionmorepeaked(affectingtheestimationoffeaturesonrulesfromthesealignments),whilevaluesof01makethedistributionmoreuniform.Amorepeakeddistributionfavorsrulesfromthetopalignments,whileamoreuniformonegivesrulesfromlowerperformingaligmentsmoreofachancetoparticipateintranslation.Wecanalsousethissametechniquetocontrolthedistributionoverparses.4TranslationResults4.1ExperimentalSetupWepresentresultsontheIWSLT2007and2008Chinese-to-Englishtranslationtask,basedonthefullBTECcorpusoftravelexpressionswith120Kparallelsentences(906Ksourcewordsand1.2Mtar-getwords)aswellastheevaluationcorporafromtheevaluationyearspreceding2007.Thedevelop- mentdataconsistsof489sentences(averagelengthof10.6words)fromthe2006evaluation,the2007testsetcontains489sentence(averagelengthof6.47words)sentencesandthe2008testsetcontains507sentences(averagelengthof5.59words).WordalignmentwastrainedusingtheGIZA++toolkit,andN-bestparsesgeneratedbytheCharniak(2000)parser,withoutadditionalre-ranking.2N-bestalign-mentsweregeneratedfromsourcetotargetandtar-gettosource,renedasdescribedabove.Initialphrasesofuptolength10wereidentiedusingtheheuristicsproposedbyKoehnetal.(2003).Ruleswithupto2nonterminalsareextractedusingtheSAMTtoolkit(ZollmannandVenugopal,2006),modiedtohandleN-bestalignmentsandparsesandposteriorcounting.Notethatlexicalweights(Koehnetal.,2003)asdescribedaboveareassignedtobasedonsingle-bestwordalignments.Rulesthatreceivezeroprobabilityvaluefortheirlexi-calweightsareimmediatelydiscarded,sincetheywouldthenhaveaprohibitivelyhighcostwhenusedduringtranslation.Rulesextractedfromsingle-bestevidenceaswellasNbestevidencecanbedis-cardedinthisway.Then-gramlanguagemodelistrainedonthetar-getsideoftheparalleltrainingcorpus3andtransla-tionexperimentsusethedecoderandMERtraineravailableinthesametoolkit.Weusethecube-pruningoption(Chiang,2007)intheseexperiments.4.2Cumulative(N;N0)-BestWemeasuretranslationqualityusingthemixed-casedIBM-BLEU(Papinenietal.,2002)metricaswevarythesizeofNandN0foralignmentsandparsesrespectively.EachvalueofNimpliesthattherstNalternativeshavebeenconsideredwhenbuildingthegrammar.Foreachgrammarwealsotrackthenumberofrulesrelevantfortherstsen-tenceintheIWSLT2007testset(grammarsaresub-sampledonaper-sentencebasistokeepmemoryrequirementslowduringdecoding).Wealsonotethenumberofsecondsrequiredtotranslateeachtest 2Rerankingmightbeusedtochangeestimatesof^p(i),butwouldnotchangethesetofrulesextractedonlythefractionalcounts.3AsBTECisaverydomain-speciccorpus,trainingthelan-guagemodelonlargeavailablemonolingualcorpora(e.g.,fromthenews-domain)isoflimitedutility. System #Rules(1sent.) Dev 2007 2008 2007Time(s) 2008Time(s) N=1(lex=1st) 400K 0.309 0.355 0.453 8108 8367 N=1(=1lex=m4) 420K 0.301 0.361 0.440 8024 8250 N=5(=1lex=m4) 680K 0.322 0.374 0.470 15376 15577 N=10(=1lex=m4) 900K 0.313 0.382 0.467 19298 19469 N=50(=1lex=m4) 1500K 0.316 0.370 0.478 29500 30894 N=10(=0:5lex=m4) 900K 0.315 0.395 0.477 20398 20118 N=50(=0:5lex=m4) 1500K 0.317 0.373 0.477 33682 34760 N=10(=2lex=m4) 900K 0.313 0.375 0.464 15117 15070 N=50(=2lex=m4) 1500K 0.315 0.373 0.488 26590 27126 Table1:Grammarstatisticsandtranslationquality(IBM-BLEU)ondevelopment(IWSLT2006)andtestset(IWSLT2007,2008)whenintegratingN-bestalignmentsforalternativeSyntaxAugmentedgrammarcongurations.#RulesreectrulesthatareapplicabletotherstsentenceinIWSLT2007.Decodingtimesinsecondsarecumulativeoverallsentencesinrespectivetestset. System #Rules(1sent.) Dev 2007 2008 2007Time(s) 2008Time(s) HierN=1 10K 0.277 0.367 0.460 895 1451 HierN=5(=1) 12K 0.286 0.374 0.472 906 1476 HierN=10(=1) 13K 0.291 0.382 0.477 944 1516 HierN=50(=1) 14K 0.282 0.384 0.463 979 1596 HierN=10(=0:5) 13K 0.285 0.399 0.476 963 1547 HierN=50(=0:5) 14K 0.283 0.376 0.470 982 1599 HierN=10(=2) 13K 0.284 0.372 0.467 965 1570 HierN=50(=2) 14K 0.290 0.374 0.459 921 1483 Table2:Grammarstatisticsandtranslationquality(IBM-BLEU)ondevelopment(IWSLT2006)andtestsets(IWSLT2007,2008)whenintegratingN-bestalignmentsforpurelyhierarchicalgrammarcongurations.#RulesreectrulesthatareapplicabletotherstsentenceinIWSLT2007.Decodingtimesinsecondsarecumulativeoverallsentencesinrespectivetestset. 4.3GrammarRulesFigure1showsthemostfrequentlyoccurringrulesthatexistonlyinthebestperformingN=10;N0=1grammar,andnotinthebaseline(Model-4lex-icon)grammar.Weshowtheestimatedcountsontheserulesaswellastheirsource,targetandleft-hand-sidenonterminalsymbol.Theserulesarepar-ticularlyinterestingwhenconsideringthedomainofthistranslationtask.Thesourcesideofthetrainingdatacontainsnopunctuation(sinceitistranscribedspeech),whilethetargetsidedoes(sincetheyweremanuallygeneratedtranslations).Thesystemthere-foreattemptstogeneratepunctuationduringtransla-tion.Considertherstexample,wheretheChinesewordforplease(oftenfoundatthebeginningofasentence)isalignedtotheEnglishplease.(attheendofthesentenceasindicatedbythepunctuation).Thisruleisextractedfromalower-probabilityalign-mentwithhighlevelsofdistortion.Thispatternwas Figure1:Toprulesextractedbyourmethod,butnotthebaseline. notseeninanysingle-bestalignments.5ConclusionInthisworkwehavedemonstratedthefeasibilityandbenetsofwideningtheMTpipelinetoinclude System #Rules(1sent.) #Labels Dev 2007 2008 2007Time(s) 2008Time(s) N0=1 420K 10K 0.301 0.361 0.440 8024 8250 N0=5 800K 15K 0.300 0.358 0.447 16930 15102 N0=10 1079K 18K 0.299 0.361 0.460 26944 23662 Table3:Grammarstatisticsandtranslationquality(IBM-BLEU)ondevelopment(IWSLT2006)andtestsets(IWSLT2007,2008)andwhenintegratingN-bestparseswiththeSyntaxAugmentedgrammar.#RulesreectrulesthatareapplicabletotherstsentenceinIWSLT2007.Decodingtimesinsecondsarecumulativeoverallsentencesinrespectivetestset.Allexperimentsinthistableuselex=m4,=1and1-bestalignments. additionalevidencefromN-bestalignmentsandparses.Weintegratethisdiverseknowledgeunderaprincipledmodelthatusesaprobabilitydistribu-tionoverthesealternatives.Weachievesignicantimprovementsintranslationqualityovergrammarsbuiltonsingle-bestevidencealonewhenconsid-eringN-bestalignments,whileN0-bestparsesseemtohavenoimpactontranslationquality.Usingarel-ativelysmallnumberofadditionalalternativealign-mentsresultsinsignicantimprovementsinquality,withminimalimpactonthenumberofrulesinthegrammarandthetranslationruntimeforahierarchi-calsystem,butatsignicantlyincreasedgrammarsizeandruntimeforasyntax-augmentedsystem.Infutureworkweplantofocusonmethodstotakebet-teradvantageofthesyntacticlabelsfromalternativeparsecandidates.AcknowledgmentsThisworkhasbeenpartlyfundedbyGALEHR0011-06-2-0001.N.SmithissupportedbyNSFIIS-0836431andanIBMfacultyaward.References AlfredV.AhoandJeffreyD.Ullmann.1969.Syntaxdi-rectedtranslationsandthepushdownassembler.Jour-nalofComputerandSystemSciences. PeterF.Brown,VincentJ.DellaPietra,StephenA.DellaPietra,andRobertL.Mercer.1993.Themathemat-icsofstatisticalmachinetranslation:parameteresti-mation.ComputationalLinguistics. EugeneCharniak.2000.Amaximumentropy-inspiredparser.InProceedingsoftheHumanLanguageTech-nologyConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguisticsConfer-ence(HLT/NAACL). DavidChiang.2005.Ahierarchicalphrase-basedmodelforstatisticalmachinetranslation.InProceedingsoftheAnnualMeetingoftheAssociationforCompua-tionalLinguistics(ACL). DavidChiang.2007.Hierarchicalphrasebasedtransla-tion.ComputationalLinguistics. JohnDeNero,DanGillick,JamesZhang,andDanKlein.2006.Whygenerativephrasemodelsunderperformsurfaceheuristics.InProceedingsoftheWorkshoponStatisticalMachineTranslation,ACL. ChristopherDyer,SmarandaMuresan,andPhilipResnik.2008.Generalizingwordlatticetranslation.InPro-ceedingsoftheAnnualMeetingoftheAssociationforComputationalLinguistics(ACL). MichaelGalley,MarkHopkins,KevinKnight,andDanielMarcu.2006.Scalableinferencesandtrainingofcontext-richsyntaxtranslationmodels.InProceed-ingsoftheHumanLanguageTechnologyConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguisticsConference(HLT/NAACL). KuzmanGanchev,JoaoV.Graca,andBenTaskar.2008.Betteralignments=bettertranslations?InProceed-ingsoftheAnnualMeetingoftheAssociationforCom-putationalLinguistics(ACL). PhilippKoehn,FranzJ.Och,andDanielMarcu.2003.Statisticalphrase-basedtranslation.InProceedingsoftheHumanLanguageTechnologyConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguisticsConference(HLT/NAACL). DanielMarcuandWilliamWong.2002.Aphrase-based,jointprobabilitymodelforstatisticalmachinetransla-tion.InProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP). DanielMarcu,WeiWang,AbdessamadEchihabi,andKevinKnight.2006.SPMT:Statisticalmachinetranslationwithsyntactiedtargetlanguagephrases.InProceedingsoftheConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing(EMNLP),Syd-ney,Australia. HaitaoMi,LiangHuang,andQunLiu.2008.Forest-basedtranslation.InProceedingsoftheAnnualMeet-ingoftheAssociationforComputationalLinguistics(ACL). FranzJ.OchandHermannNey.2003.Asystematiccomparisonofvariousalignmentmodels.Computa-tionalLinguistics.