KeywordsmultilingualNLPdependencyparsingwordalignmentpartofspeechtagginglowresourcelanguagesrecurrentneuralnetworkswordembeddingslinguistictypologycodeswitchinglanguageidenti2cationMALOPAmultiClust ID: 897845
Download Pdf The PPT/PDF document "TowardsaUniversalAnalyzerofNaturalLangua..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1 TowardsaUniversalAnalyzerofNaturalLangua
TowardsaUniversalAnalyzerofNaturalLanguagesWaleedAmmarCMU-LTI-16-010June2016LanguageTechnologiesInstituteSchoolofComputerScienceCarnegieMellonUniversity5000ForbesAve.,Pittsburgh,PA15213www.lti.cs.cmu.eduThesisCommittee:ChrisDyer(co-chair),CarnegieMellonUniversityNoahA.Smith(co-chair),UniversityofWashingtonTomMitchell,CarnegieMellonUniversityKuzmanGanchev,GoogleResearchSubmittedinpartialfulllmentoftherequirementsforthedegreeofDoctorofPhilosophy.Copyrightc 2016WaleedAmmar Keywords:multilingualNLP,dependencyparsing,wordalignment,part-of-speechtagging,low-resourcelanguages,recurrentneuralnetworks,wordembeddings,linguistictypology,codeswitching,languageidentication,MALOPA,multiCluster,multiCCA,wordembeddingseval-uation,CRFautoencoders,
2 language-universal. Formyfamily. Acknowl
language-universal. Formyfamily. AcknowledgmentsIhadthepleasureandhonorofinteractingwithgreatindividualswhoinuencedmyworkthroughoutthevepreciousyearsofmyPh.D.Firstandforemost,IwouldliketoexpressmydeepgratitudetoNoahSmithandChrisDyerfortheirenlightningguidanceandtremendoussupportineverystepoftheroad.ThankyouforplacingyourcondenceinmeatthetimesIneededitthemost.Iwillbeindebtedtoyoufortherestofmycareer.Ialsowouldliketoacknowledgethefollowinggroupsoftalentedindividuals:mycollaborators:Chu-ChengLin,LoriLevin,YuliaTsvetkov,D.Sculley,KristinaToutanova,ShomirWilson,NormanSadeh,MiguelBallesteros,GeorgeMulcaire,GuillaumeLample,PradeepDasigi,EduardHovy,AndersSøgaard,ZeljkoAgic,HiroakiHayashi,AustinMatthewsandManaalFaruquiformakingwo
3 rkmuchmoreenjoyable;my(ex-)colleagu
rkmuchmoreenjoyable;my(ex-)colleagues:BrendanO'Connor,SwabhaSwayamdipta,LingpengKong,Chu-ChengLin,DaniYogatama,AnkurParikh,AhmedHefny,PrasannaKumar,ManaalFaruqui,ShayCohen,KevinGimpel,AndreMartins,AvneeshSaluja,DavidBamman,SamThomson,AustinMatthews,DipanjanDas,JesseDodge,MattGardner,MohammadGowayyed,AdonaIosif,JesseDunietz,SubhodeepMoitra,AdhiKuncoro,YanChuanSim,LeahNicolich-Henkin,EvaSchlingerandVictorChahuneauforcreatingadiverseandenrichingenvironment,andIforgiveyouforignitingmyimpostersyndrommoreoftenthanIwouldliketoadmit;myGooglefellowshipmentors:RyanMcDonaldandGeorgeFosterformanystimulatingdiscussions;mycommitteemembersKuzmanGanchevandTomMitchellfortheirtimeandvaluablefeedback;andothermembersoftheCMUfamily:StaceyY
4 oung,LauraAlford,MalloryDep-tola,RoniRos
oung,LauraAlford,MalloryDep-tola,RoniRosenfeld,MohamedDarwish,MahmoudBishr,MazenSolimanwhomadeschoolfeelmoreofahomethanaworkplace.Iamespeciallygratefulformyfamily.ThankyouAbdul-mottalebandMabrokaAm-marforgivingmeeverythingandexpectingnothinginreturn!ThankyouAmanyHasanforsacricingyourbestinterestssothatIcanpursuemypassion!ThankyouSalmaAmmarforcreatingsomuchjoyinmylife!ThankyouKhaledAmmarforbeingagreatrolemodeltofollow!ThankyouSamiaandWalaaAmmarforyourtremendouslove! vi AbstractDevelopingNLPtoolsformanylanguagesinvolvesuniquechallengesnottypicallyencounteredinEnglishNLPwork(e.g.,limitedannotations,unscalablearchitectures,codeswitching).Althougheachlanguageisunique,differentlanguagesoftenexhibitsimilarcharacteristics(e.g.,phonetic,morpho
5 logical,lexical,syntactic)whichcanbeexpl
logical,lexical,syntactic)whichcanbeexploitedtosynergisticallytrainanalyzersformultiplelanguages.Inthisthesis,weadvocateforanovellanguage-universalapproachtomultilingualNLPinwhichonestatisticalmodeltrainedonmultilingual,homogenuousannotationsisusedtoprocessnaturallanguageinputinmultiplelanguages.Toempiricallyshowthemeritsoftheproposedapproach,wedevelopMALOPA,alanguage-universaldependencyparserwhichoutperformsmonolingually-trainedparsersinseverallow-resourceandhigh-resourcescenarios.MALOPAisagreedytransition-basedparserwhichusesmultilingualwordembeddingsandotherlanguage-universalfeaturesasahomogeneousrepresentationoftheinputacrossalllanguages.Toad-dressthesyntacticdifferencesbetweenlanguages,MALOPAmakesuseoftoken-levellanguageinformationas
6 wellaslanguage-specicrepresentations
wellaslanguage-specicrepresentationssuchasne-grainedpart-of-speechtags.MALOPAusesarecurrentneuralnetworkarchitectureandmulti-tasklearningtojointlypredictPOStagsandlabeleddependencyparses.Focusingonhomogeneousinputrepresentations,weproposenovelmethodsforestimatingmultilingualwordembeddingsandforpredictingwordalignments.Wedeveloptwomethodsforestimatingmultilingualwordembeddingsfrombilingualdic-tionariesandmonolingualcorpora.Therstestimationmethod,multiCluster,learnsembeddingsofwordclusterswhichmaycontainwordsfromdifferentlanguages,andlearnsdistributionalsimilaritiesbypoolingthecontextsofallwordsinthesameclusterinmultiplemonolingualcorpora.Thesecondestimationmethod,multiCCA,learnsalinearprojectionofmonolinguallytrainedembeddingsi
7 neachlanguagetoonevectorspace,extendingt
neachlanguagetoonevectorspace,extendingtheworkofFaruquiandDyer(2014)tomorethantwolanguages.Toshowthescalabilityofourmethods,wetrainmultilingualembeddingsin59languages.Wealsodevelopanextensible,easy-to-useweb-basedevaluationportalforevaluat-ingarbitrarymultilingualwordembeddingsonseveralintrinsicandextrinsictasks.Wedeveloptheconditionalrandomeldautoencoder(CRFautoencoder)modelforun-supervisedlearningofstructuredpredictors,anduseittopredictwordalignmentsinparallelcorpora.Weuseafeature-richCRFmodeltopredictthelatentrepresentationconditionalontheobservedinput,thenreconstructtheinputconditionalonthelatentrepresentationusingagenerativemodelwhichfactorizessimilarlytotheCRFmodel.Toreconstructtheobservations,weexperimentwithacategoricaldistrib
8 utionoverwordtypes(orwordclusters),andam
utionoverwordtypes(orwordclusters),andamultivariateGaussiandistributionthatgeneratespretrainedwordembeddings. viii Contents1Introduction11.1ChallengesinMultilingualNLP..........................21.2ThesisStatement..................................21.3SummaryofContributions.............................32Background52.1NaturalLanguages.................................62.1.1ItisnotsufcienttodevelopNLPtoolsforthefewmostpopularlanguages62.1.2LanguagesoftenstudiedinNLPresearch.................82.1.3Characterizingsimilaritiesanddissimilaritiesacrosslanguages......92.2WordEmbeddings..................................112.2.1Distributedwordrepresentations......................112.2.2Skipgrammodelforestimatingmonolingualwordembeddings......113Langu
9 age-universalDependencyParsing133.1Overv
age-universalDependencyParsing133.1Overview......................................143.2Approach......................................143.2.1HomogenuousAnnotations.........................153.2.2CoreModel.................................163.2.3LanguageUnication............................193.2.4LanguageDifferentiation..........................193.2.5Multi-taskLearningforPOSTaggingandDependencyParsing......213.3Experiments.....................................213.3.1TargetLanguageswithaTreebank.....................223.3.2TargetLanguageswithoutaTreebank...................273.3.3ParsingCode-switchedInput........................283.4OpenQuestions...................................293.5RelatedWork....................................293.6Summa
10 ry......................................
ry......................................304MultilingualWordEmbeddings334.1Overview......................................344.2EstimationMethods.................................344.2.1MultiClusterembeddings..........................35ix 4.2.2MultiCCAembeddings...........................354.2.3MultiSkipembeddings...........................384.2.4Translation-invariantmatrixfactorization.................384.3EvaluationPortal..................................394.3.1Wordsimilarity...............................394.3.2Wordtranslation..............................394.3.3Correlation-basedevaluationtasks.....................404.3.4Extrinsictasks...............................414.4Experiments.....................................424.4.1Correlationsbet
11 weenintrinsicvs.extrinsicevaluationmetri
weenintrinsicvs.extrinsicevaluationmetrics......424.4.2Evaluatingmultilingualestimationmethods................424.5RelatedWork....................................464.6Summary......................................475ConditionalRandomFieldAutoencoders495.1Overview......................................505.2Approach......................................505.2.1Notation..................................515.2.2Model....................................515.2.3Learning..................................535.2.4Optimization................................545.3Experiments.....................................565.3.1POSInduction...............................565.3.2WordAlignment..............................615.3.3Token-levelLanguageIdentication.....
12 ...............675.4OpenQuestions.......
...............675.4OpenQuestions...................................705.5RelatedWork....................................715.6Summary......................................736Conclusion756.1Contributions....................................766.2FutureDirections..................................77AAddingNewEvaluationTaskstotheEvaluationPortal79x ListofFigures1.1Insteadoftrainingonemodelperlanguage,wedeveloponelanguage-universalmodeltoanalyzetextinmultiplelanguages....................32.1Ahistogramthatshowsthenumberoflanguagesbypopulationofnativespeakersinlogscale.Forinstance,thegureshowsthat306languageshaveapopulationbetweenonemillionandtenmillionnativespeakers.NumbersarebasedonLewisetal.(2016)..................................62.2Numbero
13 fInternetusers(inmillions)perlanguagefol
fInternetusers(inmillions)perlanguagefollowsapowerlawdistribu-tion,withalongtail(notshown)whichaccountsfor21.8%ofInternetusers.ThegurewasretrievedonMay10,2016fromhttp://www.internetworldstats.com/stats7.htm................................72.3ReproducedfromBender(2011):LanguagesstudiedinACL2008papers,bylanguagegenusandfamily.............................83.1DependencyparsingisacoreprobleminNLPwhichisusedtoinformothertaskssuchascoreferenceresolution(e.g.,DurrettandKlein,2013),semanticparsing(e.g.,Dasetal.,2010)andquestionanswering(e.g.,HeilmanandSmith,2010).153.2Theparsercomputesvectorrepresentationsforthebuffer,stackandlistofac-tionsattimesteptdenotedbt,st,andat,respectively.Thethreevectorsfeedintoahiddenlayerdenotedas`parserstate',foll
14 owedbyasoftmaxlayerthatrepresentspossibl
owedbyasoftmaxlayerthatrepresentspossibleoutputsattimestept......................173.3AnillustrationofthecontentofthebufferS-LSTMmodule,theactionsS-LSTMmodule,andthestackS-LSTMmodulefortherstthreestepsinparsingthesentenceYouloveyourcat$.Upperleft:thebufferisinitializedwithalltokensinreverseorder.Upperright:simulatinga`shift'action,theheadvec-torofthebufferS-LSTMbacktracks,andtwonewhiddenlayersarecomputedfortheactionsS-LSTMandthestackS-LSTM.Lowerleft:simulatinganother`shift'action,theheadvectorofthebufferS-LSTMbacktracks,andthetwoadditionalhiddenlayersarecomputedfortheactionsS-LSTMandthestackS-LSTM.Lowerright:simulatinga`right-arc(nsubj)',theheadvectorofthebufferS-LSTMbacktracks,anadditionalhiddenlayeriscomputedfortheaction
15 sS-LSTM.ThestackS-LSTMrstbacktrackst
sS-LSTM.ThestackS-LSTMrstbacktrackstwosteps,thenanewhiddenlayeriscomputedasafunctionofthehead,thedependent,andtherelationshiprepre-sentations......................................18xi 3.4Thevectorrepresentinglanguagetypologycomplementsthetokenrepresenta-tion,theactionrepresentationandtheinputtotheparserstate...........214.1StepsforestimatingmultiClusterembeddings:1)identifythesetoflanguagesofinterest,2)obtainbilingualdictionariesbetweenpairsoflanguages,3)grouptranslationallyequivalentwordsintoclusters,4)obtainamonolingualcorpusforeachlanguageofinterest,5)replacewordsinmonolingualcorporawithclusterIDs,6)obtainclusterembeddings,7)usethesameembeddingforallwordsinthesamecluster...................................364.2Stepsforestimatingmult
16 iCCAembeddings:1)pretrainmonolingualembe
iCCAembeddings:1)pretrainmonolingualembed-dingsforthepivotlanguage(English).Foreachotherlanguagem:2)pre-trainmonolingualembeddingsform,3)estimatelinearprojectionsTm!m;enandTen!m;en,4)projecttheembeddingsofmintotheembeddingspaceofthepivotlanguage.......................................374.3Samplecorrelationcoefcients(perfectcorrelation=1.0)foreachdimensioninthesolutionfoundbyCCA,basedonadictionaryof35,524English-Bulgariantranslations......................................475.1Left:Examplesofstructuredobservations(inblack),hiddenstructures(ingrey),andsideinformation(underlined ).Right:ModelvariablesforPOSinductionandwordalignment.Aparallelcorpusconsistsofpairsofsentences(sourceandtarget)............................
17 .........515.2Graphicalmodelrepresentati
.........515.2GraphicalmodelrepresentationsofCRFautoencoders.Left:thebasicautoen-codermodelwheretheobservationxgeneratesthehiddenstructurey(encod-ing),whichthengenerates^x(reconstruction).Center:sideinformation()isadded.Right:afactorgraphshowingrst-orderMarkovdependenciesamongelementsofthehiddenstructurey.........................525.3V-measureofinducedpartsofspeechinsevenlanguages,withmultinomialemis-sions.........................................585.4Many-to-oneaccuracy%ofinducedpartsofspeechinsevenlanguages,withmultinomialemissions................................595.5POSinductionresultsof5models(VMeasure,higherisbetter).Modelswhichusewordembeddings(i.e.,GaussianHMMandGaussianCRFAutoencoder)outperformallbaselinesonaverageacrossla
18 nguages...............605.6Left:Effectof
nguages...............605.6Left:EffectofdimensionsizeonPOSinductiononasubsetoftheEnglishPTBcorpus.Windowsizeissetto1forallcongurations.Right:EffectofcontextwindowsizeonV-measureofPOSinductiononasubsetoftheEnglishPTBcorpus.d=100,scaleratio=5...........................615.7Averageinferenceruntimepersentencepairforwordalignmentinseconds(ver-ticalaxis),asafunctionofthenumberofsentencesusedfortraining(horizontalaxis).........................................635.8Left:L-BFGSvs.SGDwithaconstantlearningrate.Right:L-BFGSvs.SGDwithdiminishing t.................................65xii 5.9Left:L-BFGSvs.SGDwithcyclicvs.randomizedorder(andwithepoch-xed ).Right:AsynchronousSGDupdateswith1,2,4,and8processors.......665.10Left:Batchvs.onlineEMwithdiff
19 erentvaluesof(stickinessparameter).
erentvaluesof(stickinessparameter).Right:BatchEMvs.onlineEMwithmini-batchsizesof102;103;104......66xiii xiv ListofTables2.1CurrentnumberoftokensinLeipzigmonolingualcorpora(inmillions),wordpairsinprintedbilingualdictionaries(inthousands),tokensinthetargetsideofen-xxOPUSparallelcorpora(inmillions),andtheuniversaldependencytree-banks(inthousands)for41languages.Morerecentstatisticscanbefoundathttp://opus.lingfil.uu.se/,http://www.bilingualdictionaries.com/,http://corpora2.informatik.uni-leipzig.de/download.htmlandhttp://universaldependencies.org/,respectively.....103.1Parsertransitionsindicatingtheactionappliedtothestackandbufferattimetandtheresultingstackandbufferattimet+1...................163.2Numberofsentences(tokens)ineachtreebanksp
20 litinUniversalDependencyTreebanksversion
litinUniversalDependencyTreebanksversion2.0(UDT)andUniversalDependenciesversion1.2(UD)forthelanguagesweexperimentwith.Thelastrowgivesthenumberofuniquelanguage-specicne-grainedPOStagsusedinatreebank.............223.3Dependencyparsing:unlabeledandlabeledattachmentscores(UAS,LAS)formonolingually-trainedparsersandMALOPA.Eachtargetlanguagehasalargetreebank(seeTable3.2).Inthistable,weusetheuniversaldependenciesver-son1.2whichonlyincludesannotationsfor13KEnglishsentences,whichexplainstherelativelylowscoresinEnglish.Whenweinsteadusetheuniversaldependencytreebanksversion2.0whichincludesannotationsfor40KEnglishsentences(originallyfromtheEnglishPennTreebank),weachieveUASscore93.0andLASscore91.5..............................243.4Rec
21 allofsomeclassesofdependencyattachments/
allofsomeclassesofdependencyattachments/relationsinGerman......253.5EffectofautomaticallypredictinglanguageIDandPOStagswithMALOPAonparsingaccuracy...................................263.6EffectofmultilingualembeddingestimationmethodonthemultilingualparsingwithMALOPA.UASandLASscoresaremacro-averagedacrossseventargetlanguages......................................263.7Small(3Ktoken)targettreebanksetting:language-universaldependencyparserperformance.....................................273.8Dependencyparsing:unlabeledandlabeledattachmentscores(UAS,LAS)formulti-sourcetransferparsersinthesimulatedlow-resourcescenariowhereLt\Ls=;........................................283.9UASresultsontherst300sentencesintheEnglishdevelopmentoftheUniver-salDep
22 endenciesv1.2,withandwithoutsimulatedcod
endenciesv1.2,withandwithoutsimulatedcode-switching........29xv 4.1Evaluationmetricsonthecorpusandlanguagesforwhichevaluationdataareavailable.......................................404.2Correlationsbetweenintrinsicevaluationmetrics(rows)anddownstreamtaskperformance(columns)................................434.3Resultsformultilingualembeddingsthatcover59languages.Eachrowcorre-spondstooneoftheembeddingevaluationmetricsweuse(higherisbetter).Eachcolumncorrespondstooneoftheembeddingestimationmethodswecon-sider;i.e.,numbersinthesamerowarecomparable.Numbersinsquarebracketsarecoveragepercentages...............................434.4ResultsformultilingualembeddingsthatcoverBulgarian,Czech,Danish,Greek,English,Spanish,German,Finnish,French,Hungarian,It
23 alianandSwedish.Eachrowcorrespondstooneo
alianandSwedish.Eachrowcorrespondstooneoftheembeddingevaluationmetricsweuse(higherisbetter).Eachcolumncorrespondstooneoftheembeddingestimationmethodsweconsider;i.e.,numbersinthesamerowarecomparable.Numbersinsquarebracketsarecoveragepercentages..........................455.1Left:AERresults(%)forCzech-Englishwordalignment.Lowervaluesarebetter.Right:BLEUtranslationqualityscores(%)forCzech-English,Urdu-EnglishandChinese-English.Highervaluesarebetter...............645.2Totalnumberoftweets,tokens,andTwitteruserIDsforeachlanguagepair.Foreachlanguagepair,therstlinerepresentsalldataprovidedtosharedtaskparticipants.Thesecondandthirdlinesrepresentourtrain/testdatasplitfortheexperimentsreportedinthischapter.SinceTwitterusersareallowedtodeletethei
24 rtweets,thenumberoftweetsandtokensreport
rtweets,thenumberoftweetsandtokensreportedinthethirdandfourthcolumnsmaybelessthanthenumberoftweetsandtokensoriginallyannotatedbythesharedtaskorganizers............................675.3Thetype-levelcoveragepercentagesofannotateddataaccordingtowordembed-dings(secondcolumn)andaccordingtowordlists(thirdcolumn),perlanguage..695.4Tokenlevelaccuracy(%)resultsforeachofthefourlanguagepairs........705.5ConfusionbetweenMSAandARZintheBaselineconguration..........705.6F-MeasuresoftwoArabiccongurations.lang1isMSA.lang2isARZ....705.7NumberoftweetsinL,UtestandUallusedforsemi-supervisedlearningofCRFautoencodersmodels.................................71xvi Chapter1Introduction1 MultilingualNLPisthescienticandengineeringdisciplineconcernedwithaut
25 omaticallyanalyzingwrittenorspokeninputi
omaticallyanalyzingwrittenorspokeninputinmultiplehumanlanguages.Thedesiredanalysisdependsontheapplication,butisoftenrepresentedasasetofdiscretevariables(e.g.,part-of-speechtags)withapresumeddependencystructure(e.g.,rst-orderMarkovdependencies).A(partially)correctanalysiscanbeusedtoenablenaturalcomputerinterfacessuchasApple'sSiri.Otherapplicationsincludesummarizinglongarticlesinafewsentences(e.g.,Saltonetal.,1997),anddiscoveringsubtletrendsinlargeamountsofuser-generatedtext(e.g.,O'Connoretal.2010).TheabilitytoprocesshumanlanguageshasalwaysbeenoneoftheprimarygoalsofarticialintelligencesinceitsconceptionbyMcCarthyetal.(1955).1.1ChallengesinMultilingualNLPAlthoughsomeEnglishNLPproblemsarefarfromsolved,wecanmakeanumberofsimplifyingassu
26 mptionswhendevelopingmonolingualNLPmodel
mptionswhendevelopingmonolingualNLPmodelsforahigh-resourcelanguagesuchasEnglish.Wecanoftenassumethatlargeannotatedcorporaareavailable.Evenwhentheyarenot,itisreasonabletoassumewecanndqualiedannotatorseitherlocallyorviacrowd-sourcing.Itiseasytoiterativelydesignandevaluatemeaningfullanguage-specicfeatures(e.g.,[city]isthecapitalof[country],[POS]endswithing).Itisalsooftenassumedtheinputlanguagematchesthatofthetrainingdata.Whenwedevelopmodelstoanalyzemanylanguages,therstchallengeweoftenndisthelackofannotationsandlinguisticresources.Language-specicfeatureengineeringanderroranalysisbecomestediousatbest,assumingweareluckyenoughtocollaboratewithresearchersorlinguistswhoknowalllanguagesofinteres
27 t.Moreoftenthannot,however,itisinfeasibl
t.Moreoftenthannot,however,itisinfeasibletodesignmeaningfullanguage-specicfeaturesbecausetheteamhasinsufcientcollectiveknowl-edgeofsomelanguages.Conguring,training,tuning,monitoringandoccasionallyupdatingthemodelsforeachlanguageofinterestislogisticallydifcultandrequiresmorehumanandcom-putationalresources.Low-resourcelanguagesrequireadifferentpipelinethanhigh-resourcelanguagesandareoftenignoredinanindustrialsetup.1Theinputcanbeinoneofmanylan-guages,andoftenusesmultiplelanguagesinthesamediscourse.Thesechallengesmotivatetheworkinthisdissertation.Othermultilingualchallengesnotaddressedinthisthesisinclude:identifyingsentenceboundariesinlanguageswhichdonotuseuniquepunctuationtoendasentence(e.g.,Thai,Arabic),tokenizatio
28 ninlanguageswhichdonotusespacestoseparat
ninlanguageswhichdonotusespacestoseparatewords(e.g.,Japanese),ndingdigitizedtextsofsomelanguages(e.g.,Yoruba,Hieroglyphic).1.2ThesisStatementInthisthesis,weshowthefeasibilityofalanguage-universalapproachtomultilingualNLPinwhichonestatisticalmodeltrainedonmultilingual,homogenuousannotationsisusedtoprocess1Althoughtheperformanceonlow-resourcelanguagestendtobemuchworsethanthatofhigh-resourcelan-guages,eveninaccuratepredictionscanbeveryuseful.Forexample,amachinetranslationsystemtrainedonasmallparallelcorpuswillproduceinaccuratetranslations,butaninaccuratetranslationisoftensufcienttolearnwhataforeigndocumentisabout.2 Figure1.1:Insteadoftrainingonemodelperlanguage,wedeveloponelanguage-universalmodeltoanalyzetextinmultiplelanguages.
29 naturallanguageinputinmultiplelanguages.
naturallanguageinputinmultiplelanguages.Thisapproachaddressesseveralpracticaldifcul-tiesinmultilingualNLPsuchasprocessingcode-switchedinputanddeveloping/maintainingalargenumberofmodelstocoverlanguagesofinterest,especiallylow-resourceones.Althougheachlanguageisunique,differentlanguagesoftenexhibitsimilarcharacteristics(e.g.,phonetic,morphological,lexical,syntactic)whichcanbeexploitedtosynergisticallytrainuniversallan-guageanalyzers.Theproposedlanguage-universalmodelsoutperformsmonolingually-trainedmodelsinseverallow-resourceandhigh-resourcescenarios.BuildingonarichliteratureinmultilingualNLP,thisdissertationenablesthelanguage-universalapproachbydevelopingstatisticalmodelsfor:i)alanguage-universalsyntacticanalyzertoex-emplifytheproposed
30 approach,ii)estimatingmassivelymultiling
approach,ii)estimatingmassivelymultilingualwordembeddingstoserveasasharedrepresentationofnaturallanguageinputinmultiplelanguages,andiii)inducingword-alignmentsfromunlabeledexamplesinaparallelcorpus.Themodelsproposedineachofthethreecomponentsaredesigned,implemented,andempir-icallycontrastedtocompetitivebaselines.1.3SummaryofContributionsInchapter3,wedescribethelanguage-universalapproachtotrainingmultilingualNLPmod-els.Insteadoftrainingonemodelperlanguage,wesimultaneouslytrainonelanguage-universalmodelonannotationsinmultiplelanguages(seeFig.1.1).Weshowthatthisapproachoutper-formscomparablelanguage-specicmodelsonaverage,andalsooutperformsstate-of-the-artmodelsinlow-resourcescenarioswherenoorfewannotationsareavailableinthetargetlan-guage.
31 Inchapter4,wefocusontheproblemofestimati
Inchapter4,wefocusontheproblemofestimatingdistributed,multilingualwordrepresenta-tions.Previousworkonthisproblemassumestheavailabilityofsizableparallelcorporawhichconnectalllanguagesofinterestinafullyconnectedgraph.However,inpractice,publiclyavail-ableparallelcorporaofhighqualityareonlyavailableforarelativelysmallnumberoflanguages.Inordertoscaleuptrainingofmultilingualwordembeddingstomorelanguages,weproposetwoestimationmethodswhichusebilingualdictionariesinsteadofparallelcorpora.Inchapter5,wedescribeafeature-richmodelforstructuredpredictionproblemswhichlearns3 fromunlabeledexamples.WeinstantiatethemodelforPOSinduction,wordalignments,andtoken-levellanguageidentication.4 Chapter2Background5 2.1NaturalLanguagesWhileopen-sourceNLPtools(e.
32 g.,TurboParser1andStanfordParser2)arerea
g.,TurboParser1andStanfordParser2)arereadilyavailableinseverallanguages,onecanhardlyndanytoolsformostlivinglanguages.Thelanguage-universalapproachwedescribeinthisthesisprovidesapracticalsolutionforprocessinganarbitrarylanguage,givenamonolingualcorpusandabilingualdictionary(oraparallelcorpus)toinducelexicalfeaturesinthatlanguage.Thissectiondiscussessomeaspectsofthediversityofnaturallanguagestohelpthereaderappreciatethefullpotentialofourapproach.2.1.1ItisnotsufcienttodevelopNLPtoolsforthefewmostpopularlanguagesEthnologue(Lewisetal.,2016),anextensivecatalogoftheworld'slanguages,estimatesthattheworldpopulationofapproximately7.4billionpeoplenativelyspeaks6,879languagesasof2016.Manyoftheselanguagesarespokenbyalargepopulation.Forinstance
33 ,Fig.2.1showsthat306languageshaveapopula
,Fig.2.1showsthat306languageshaveapopulationbetweenonemillionandtenmillionnativespeakers. Figure2.1:Ahistogramthatshowsthenumberoflanguagesbypopulationofnativespeakersinlogscale.Forinstance,thegureshowsthat306languageshaveapopulationbetweenonemillionandtenmillionnativespeakers.NumbersarebasedonLewisetal.(2016).1http://www.cs.cmu.edu/~ark/TurboParser/2http://nlp.stanford.edu/software/lex-parser.shtml6 WrittenlanguagesusedontheInternetalsofollowasimilarpattern,althoughthelanguagerankingisdifferent(e.g.,ChinesehasthelargestnumberofnativespeakersbutEnglishhasthelargestnumberofInternetusers.)Fig.2.2givesthenumberofInternetusersperlanguage(forthetoptenlanguages),3andshowsthatthelongtailoflanguages(ranked11ormore)accountfor21.8%ofallInternet
34 users. Figure2.2:NumberofInternetusers(i
users. Figure2.2:NumberofInternetusers(inmillions)perlanguagefollowsapowerlawdistribution,withalongtail(notshown)whichaccountsfor21.8%ofInternetusers.ThegurewasretrievedonMay10,2016fromhttp://www.internetworldstats.com/stats7.htmBeyondlanguageswithlargepopulationofnativespeakers,evenendangeredlanguages4andextinctlanguages5maybeimportanttobuildNLPtoolsfor.Forexample,TheEndangeredLanguagesProject6aimstodocument,preserveandteachendangeredlanguagesinordertoreducethelossofculturalknowledgewhenthoselanguagesfalloutofuse.NLPtoolshave3EstimatedbyMiniwattsMarketingGroupathttp://www.internetworldstats.com/stats7.htm4Anendangeredlanguageisoneatriskofreachinganativespeakingpopulationofzeroasitfallsoutofuse.5Extinctlanguageshavenolivingspeakingpop
35 ulation.6http://www.endangeredlanguages.
ulation.6http://www.endangeredlanguages.com/7 alsobeenusedtohelpwithhistoricaldiscoveriesbyanalyzingpreservedtextwritteninancientlanguages(Bammanetal.,2013).2.1.2LanguagesoftenstudiedinNLPresearchItisnosurprisethatsomelanguagesreceivemoreattentionthanothersinNLPresearch.Fig.2.3,reproducedfromBender(2011),showsthat63%ofpapersintheACL2008conferencestudiedEnglish,whileonly0.7%(i.e.,onepaper)studiedDanish,Swedish,Bulgarian,Slovene,Ukra-nian,Aramaic,TurkishorWambaya. Figure2.3:ReproducedfromBender(2011):LanguagesstudiedinACL2008papers,bylan-guagegenusandfamily.Thenumberofspeakersofalanguagemightexplaintheattentionitreceives,butonlypartially;e.g.,Bengali,thenativelanguageof2.5%oftheworld'spopulation,isnotstudiedbyanypapersinACL2008.Otherfactors
36 whichcontributetotheattentionalanguagere
whichcontributetotheattentionalanguagereceivesinNLPresearchmayinclude:Availabilityofannotateddatasets:MostNLPresearchisempirical,requiringtheavailabilityofannotateddatasets.(Evenunsupervisedmethodsrequireanannotateddatasetforeval-uation.)Itiscustomarytocallalanguagewithnoorlittleannotationsalow-resourcelanguage,butthetermislooselydened.SomelanguagesmayhaveplentyofresourcesforoneNLPproblem,butnoresourcesforanother.EvenforaparticularNLPtask,thereisnoclearthresholdforthemagnitudeofannotateddatabelowwhichalanguageisconsid-eredtobealow-resourcelanguage.Table2.1providesstatisticsaboutthesizeofdatasetsin8 highly-multilingualresources:theLeipzigmonolingualcorpora,7OPUSparallelcorpora,8andtheuniversaldependencytreebanks.9Economicsi
37 gnicance:TheindustrialarmofNLPresear
gnicance:TheindustrialarmofNLPresearch(e.g.,BellLabs,IBM,BBN,Mi-crosoft,Google)hasmadeimportantcontributionstotheeld.Shortofaddressingalllan-guagesatthesametime,moreeconomicallysignicantlanguagesareoftengivenahigherpriority.Researchfunding:ManyNLPresearchstudiesarefundedbynationalorregionalagenciessuchasNationalScienceFoundation(NSF),EuropeanResearchCouncil(ERC),DefenseAd-vancedResearchProjectsAgency(DARPA),IntelligenceAdvancedResearchProjectsAc-tivity(IARPA)andtheEuropeanCommission.Researchgoalsofthefundingagencyoftenpartiallydetermineswhichlanguageswillbestudiedinafundedproject.Forexample,Euro-Matrixwasathree-yearresearchprojectfundedbytheEuropeanCommission(20092012)andaimedtodevelopmachinetranslationsystemsforallpairs
38 oflanguagesintheEuro-peanUnion.10Another
oflanguagesintheEuro-peanUnion.10AnotherexampleisTransTac,ave-yearresearchprojectfundedbyDARPAwhichaimedtodevelopspeech-to-speechtranslationsystems,primarilyforEnglishIraqiandIraqiEnglish.Inbothexamples,thechoiceoflanguagesstudiedwasdrivenbystrategicgoalsofthefundingagencies.2.1.3CharacterizingsimilaritiesanddissimilaritiesacrosslanguagesLinguistictypology(Comrie,1989)isaeldoflinguisticswhichaimstoorganizelanguagesintodifferenttypes(i.e.,totypologizelanguages).Typologistsoftenusereferencegrammars11tocontrastlinguisticpropertiesacrossdifferentlanguages.Anextensivelistoftypologicalprop-ertiescanbefoundfor2,679languagesintheWorldAtlasofLanguagesStructures(WALS;DryerandHaspelmath,2013).12Studiedpropertiesinclude:synta
39 cticpatterns(e.g.,orderofsubject,verband
cticpatterns(e.g.,orderofsubject,verbandobject),morphologicalproperties(e.g.,reduplication,positionofcaseafxesonnouns),andphonologicalproperties(e.g.,consonant-vowelratio,uvularconsonants).Itisalsousefultoconsidergenealogicalclassicationoflanguages,emphasizingthatlan-guageswhichdescendedfromacommonancestortendtobelinguisticallysimilar.Forexample,Semiticlanguages(e.g.,Hebrew,Arabic,Amharic)shareadistinctmorphologicalsystemthatcombinesatriliteralrootwithapatternofvowelsandconsonantstoconstructaword.Romancelanguages(e.g.,Spanish,Italian,French)sharemorphologicalinectionsthatmarkperson(rst,second,third),number(singular,plural),tense(e.g.,imperfect,future),andmood(e.g.,indica-tive,imperative).7http://corpora2.informat
40 ik.uni-leipzig.de/download.html8http://o
ik.uni-leipzig.de/download.html8http://opus.lingfil.uu.se/9http://universaldependencies.org10Notetheconnectiontoeconomicsignicance.11Areferencegrammargivesatechnicaldescriptionofthemajorlinguisticfeaturesofalanguagewithafewexamples;e.g.,Martin(2004)andRyding(2005).12WALStypologicalpropertiescanbedownloadedathttp://wals.info/.SyntacticStructuresoftheWorld'sLanguages(SSWL)isanothercatalogoftypologicalpropertiesbutitonlyhasinformationabout385lan-guagesbyMay2016.SSWLtypologicalpropertiescanbedownloadedathttp://sswl.railsplayground.net/.9 language monolingualcorporaparallelcorporauniversaldependencytreebanks (millionsoftokens)(millionsoftokens)(thousandsoftokens) AncientGreek 244Arabic 41702242Basque 4121Bulgarian 104156 Catalan 120530Chin
41 ese 517158123Croatian 4240787Czech 48786
ese 517158123Croatian 4240787Czech 4878661,503 Danish 154691100Dutch 3371,000209English 926N/A254Estonian 38262234 Finnish 55445181French 1,4681,800390Galician 47138German 425916293 Gothic 56Greek 731,00059Hebrew 47356115Hindi 4513351 Hungarian 17662242Indonesian 1,20654121Irish 1623Italian 399943252 Japanese 5817267Kazakh 14Latin 291Latvian 113420 Norwegian 8444311OldChurchSlavonic 57Persian 19479151Polish 9660083 Portuguese 531,000226Portuguese(Brazilian) 486878298Romanian 125859145Russian 1,8006191,032Slovenian 54378140 Spanish 3911,500547Swedish 10746496Tamil 14138Turkish 1352056 Table2.1:CurrentnumberoftokensinLeipzigmonolingualcorpora(inmillions),wordpairsinprintedbilingualdictionaries(inthousands),tokensinthetargetsideofen-xxOPUSpa
42 rallelcorpora(inmillions),andtheuniversa
rallelcorpora(inmillions),andtheuniversaldependencytree-banks(inthousands)for41languages.Morerecentstatisticscanbefoundathttp://opus.lingfil.uu.se/,http://www.bilingualdictionaries.com/,http://corpora2.informatik.uni-leipzig.de/download.htmlandhttp://universaldependencies.org/,respectively.10 2.2WordEmbeddingsWordembeddings(alsoknownasvectorspacerepresentationsordistributedrepresentations)provideaneffectivemethodforsemi-supervisedlearninginmonolingualNLP(Turianetal.,2010).Thissectiongivesabriefoverviewonmonolingualwordembeddings,beforewediscussmultilingualwordembeddingsinthefollowingchapters.2.2.1DistributedwordrepresentationsThereareseveralchoicesforrepresentinglexicalitemsinacomputationalmodel.Themostbasicrepresentation,asequenceofchara
43 cters(e.g.,thesi
cters(e.g.,thesis),helpsdistinguishbetweendif-ferentwords.Anisomorphicrepresentationcommonlyreferredtoasone-hotrepresentationassignsanarbitraryuniqueintegertoeachuniquecharactersequenceinavocabularyofsizeV.Theone-hotrepresentationisoftenpreferredtocharactersequencesbecausemoderncomputerarchitecturescanmanipulateintegersmoreefciently.Thislexicalrepresentationisproblematicfortworeasons:1.Thenumberofuniquen-gramlexicalfeaturesisO(Vn).SincethevocabularysizeVisof-tenlarge,13theincreasednumberofparametersmakesthemodelmorepronetoovertting.Thiscanbeproblematicevenforn=1(unigramfeatures),especiallywhenthetrainingdataissmall.2.Wemissanopportunitytosharestatisticalstrengthbetweensimilarwords.Forexamp
44 le,ifdissertationisanout-of-
le,ifdissertationisanout-of-vocabularyword(i.e.,doesnotappearsinthetrainingset),themodelcannotrelateittoasimilarwordseenintrainingsuchasthesis.Wordembeddingsareanalternativerepresentationwhichmapsawordtoavectorofrealnum-bers;i.e.embedsthewordinavectorspace.Thisrepresentationaddressestherstproblem,providedthatthedimensionalityofwordembeddingsissignicantlylowerthanthevocabularysize,whichistypical(dimensionalityofwordembeddingsoftenusedareintherangeof50500).Inordertoaddressthesecondproblem,twoapproachesarecommonlyusedtoestimatetheembeddingofaword:Estimatewordembeddingsasextraparametersinthemodeltrainedforthedownstreamtask.Notethatthisapproachreintroducesalargenumberofparametersinthemodel
45 ,andalsomissestheopportunitytolearnfromu
,andalsomissestheopportunitytolearnfromunlabeledexamples.UsethedistributionalhypothesisofHarris(1954)toestimatewordembeddingsusingacorpusofrawtext,beforetrainingamodelforthedownstreamtask.In§2.2.2,wedescribeapopularmodelforestimatingwordembetrainingusinganunlabeledcorpusofrawtext.Weusethismethodasbasisformultilingualembeddingsinotherchaptersofthethesis.2.2.2SkipgrammodelforestimatingmonolingualwordembeddingsThedistributionalhypothesisstatesthatthesemanticsofawordcanbedeterminedbythewordsthatoccurinitscontext(Harris,1954).Theskipgrammodel(Mikolovetal.,2013a)isoneof13Thevocabularysizedependsonthelanguage,genreanddatasetsize.AnEnglishmonolingualcorpusof1billionwordtokenshasavocabularysizeof4,785,862uniquewordtypes.11 severalmethodswhic
46 himplementthishypothesis.Theskipgrammode
himplementthishypothesis.Theskipgrammodelgeneratesaworduthatoccursinthecontext(ofwindowsizeK)ofanotherwordvasfollows:p(ujv)=expEskipgram(v)Econtext(u) Pu02vocabularyexpEskipgram(v)Econtext(u0)whereEskipgram(u)2Rdisthevectorwordembeddingofaworduwithdimensionalityd.Econtext(u)alsoembedsthewordu,buttheoriginalimplementationoftheskipgrammodelintheword2vecpackage14onlyusesitasextramodelparameters.Notethatthisisadecientmodelsincethesamewordtokenappears(andhencegenerated)inmorethanonecontext(e.g.,contextofthewordimmediatelybefore,andcontextofthewordimmediatelyafter).Themodelistrainedtomaximizethelog-likelihoodasfollows:Xi2indexesXk2fK;:::;1;1;:::;Kglogp(ui+kjui)whereiindexesallwordtokensinamonolingualcorpus.Toavoidthe
47 expensivesummationinthepartitionfunction
expensivesummationinthepartitionfunction,thedistributionp(ujv)isapproximatedusingnoisecontrastiveestimation(GutmannandHyvärinen,2012).Weusetheskipgrammodelinchapters3,4and5.14https://code.google.com/archive/p/word2vec/12 Chapter3Language-universalDependencyParsing13 3.1OverviewInhigh-resourcescenarios,themainstreamapproachformultilingualNLPistodeveloplanguage-specicmodels.Foreachlanguageofinterest,theresourcesnecessaryfortrainingthemodelareobtained(orcreated),andmodelparametersareoptimizedforeachlanguageseparately.Thisapproachissimple,effectiveandgrantstheexibilityofcustomizingthemodelorfeaturestotheneedsofeachlanguageindependently,butitissuboptimalfortheoreticalaswellaspracticalreasons.Theoretically,thestudyoflinguistictypologyr
48 evealsthatmanylanguagessharemor-phologic
evealsthatmanylanguagessharemor-phological,phonological,andsyntacticphenomena(Bender,2011).Onthepracticalside,itisinconvenienttodeployordistributeNLPtoolsthatarecustomizedformanydifferentlanguagesbecause,foreachlanguageofinterest,weneedtocongure,train,tune,monitorandoccasionallyupdatethemodel.Furthermore,code-switchingorcode-mixing(mixingmorethanonelanguageinthesamediscourse),whichispervasiveinsomegenres,inparticularsocialmedia,presentsachallengeformonolingually-trainedNLPmodels(Barmanetal.,2014).Canwetrainonelanguage-universalmodelinsteadoftrainingaseparatemodelforeachlanguageofinterest,withoutsacricingaccuracy?Weaddressthisquestionincontextofdependencyparsing,acoreprobleminNLP(seeFig.3.1).Wediscussmodelingtoolsforuni-fyinglangua
49 gestoenablecross-lingualsupervision,aswe
gestoenablecross-lingualsupervision,aswellastoolsfordifferentiatingbetweenthecharacteristicsofdifferentlanguages.Equippedwiththesemodelingtools,weshowthatlanguage-universaldependencyparserscanoutperformmonolingually-trainedparsersinhigh-resourcescenarios.Thesameapproachcanalsobeusedinlow-resourcescenarios(withnolabeledexamplesorwithasmallnumberoflabeledexamplesinthetargetlanguage),aspreviouslyexploredbyCohenetal.(2011),McDonaldetal.(2011)andTäckströmetal.(2013).Weaddressbothexperimentalsettings(targetlanguagewithandwithoutlabeledexamples)andshowthatourmodelcomparesfavorablytopreviousworkinbothsettings.Inprinciple,theproposedapproachisapplicabletomanyNLPproblems,includingmorpho-logical,syntactic,andsemanticanalysis.However,inordertoexplo
50 itthefullpotentialofthisapproach,weneedh
itthefullpotentialofthisapproach,weneedhomogenousannotationsinseverallanguagesforthetaskofinterest(see§3.2.1).Forthisreason,wefocusondependencyparsing,forwhichhomogenousannotationsareavailableinmanylanguages.MostofthematerialinthischapterwaspreviouslypublishedinAmmaretal.(2016a).3.2ApproachTheavailabilityofhomogeneoussyntacticannotationsinmanylanguages(Petrovetal.,2012;McDonaldetal.,2013;Nivreetal.,2015b;Agi´cetal.,2015;Nivreetal.,2015a)presentstheopportunitytodevelopaparserthatiscapableofparsingsentencesinmultiplelanguagesofinterest.Suchparsercanpotentiallyreplaceanarrayoflanguage-specicmonolingually-trainedparsers(forlanguageswithalargetreebank).Ourgoalistotrainadependencyparserforasetoft argetl anguagesLt,givenuniversaldependency
51 annotationsinasetofs ourcel anguagesLs.W
annotationsinasetofs ourcel anguagesLs.WhenalllanguagesinLthavealargetreebank,themainstreamapproachhasbeentotrainonemonolingualparserpertargetlanguageandroutesentencesofagivenlanguagetothecorrespondingparserattesttime.Incontrast,ourapproachistotrainoneparsingmodelwiththeunionoftreebanksinLs,thenusethissingle14 Figure3.1:DependencyparsingisacoreprobleminNLPwhichisusedtoinformothertaskssuchascoreferenceresolution(e.g.,DurrettandKlein,2013),semanticparsing(e.g.,Dasetal.,2010)andquestionanswering(e.g.,HeilmanandSmith,2010).trainedmodeltoparsetextinanylanguageinLt,whichwecallmanylanguages,oneparser(MALOPA).3.2.1HomogenuousAnnotationsAlthoughmultilingualdependencytreebankshavebeenavailableforadecadeviathe2006and2007CoNLLsharedtasks(
52 BuchholzandMarsi,2006;Nivreetal.,2007),t
BuchholzandMarsi,2006;Nivreetal.,2007),thetreebankofeachlanguagewasannotatedindependentlyandwithitsownannotationconventions.McDonaldetal.(2013)designedannotationguidelineswhichusesimilardependencylabelsandconven-tionsforseverallanguagesbasedontheStanforddependencies.Twoversionsofthistreebankwerereleased:v1.0(6languages)1andv2.0(11languages).2.Thedependencyparsingcommu-nityfurtherdevelopedthesetreebanksintotheUniversalDependencies,3witha6-monthreleaseschedule.Sofar,UniversalDependenciesv1.0(10languages),4v1.1(18languages)5andv1.2(34languages)6havebeenreleased.InMALOPA,werequirethatallsourcelanguageshaveauniversaldependencytreebank.Wetransformnon-projectivetreesinthetrainingtreebankstopseudo-projectivetreesusingthebaselinescheme
53 in(NivreandNilsson,2005).Inaddition,weus
in(NivreandNilsson,2005).Inaddition,weusethefollowingdataresourcesforeachlanguageinL=Lt[Ls:universalPOSannotationsfortrainingaPOStagger(required),7abilingualdictionarywithanotherlanguageinLforaddingcross-linguallexicalinforma-tion(optional),81https://github.com/ryanmcd/uni-dep-tb/blob/master/universal_treebanks_v1.0.tar.gz2https://github.com/ryanmcd/uni-dep-tb/blob/master/universal_treebanks_v2.0.tar.gz3http://universaldependencies.org/4http://hdl.handle.net/11234/1-14645http://hdl.handle.net/11234/LRT-14786http://hdl.handle.net/11234/1-15487See§3.2.5fordetails.8Ourbestresultsmakeuseofthisresource.WerequirethatalllanguagesinLare(transitively)connected.The15 Stackt Buffert Action Dependency Stackt+1 Buffert+1 u;v;S B REDUCE-RIGH
54 T(r) ur!v u;S Bu;v;S B REDUCE-LEFT(r) ur
T(r) ur!v u;S Bu;v;S B REDUCE-LEFT(r) ur v v;S BS u;B SHIFT u;S BTable3.1:Parsertransitionsindicatingtheactionappliedtothestackandbufferattimetandtheresultingstackandbufferattimet+1.languagetypologyinformation(optional),9language-specicPOSannotations(optional),10andamonolingualcorpus(optional).113.2.2CoreModelRecentadvances(e.g.,Gravesetal.2013,Sutskeveretal.2014)suggestthatrecurrentneuralnetworks(RNNs)arecapableoflearningusefulrepresentationsformodelingproblemsofse-quentialnature.FollowingDyeretal.(2015),weuseaRNNfortransition-baseddependencyparsing.Wedescribethecoremodelinthissection,andmodifyittoenablelanguage-universalparsinginthefollowingsections.Thecoremodelcanbeunderstoodasthesequentialmanipula-tionofthree
55 datastructures:abuffer(fromwhichwer
datastructures:abuffer(fromwhichwereadthetokensequence),astack(whichcontainspartially-builtparsetrees),andalistofactionspreviouslytakenbytheparser.Theparserusesthearc-standardtransitionsystem(Nivre,2004).Ateachtimestept,atransitionactionisappliedthataltersthesedatastructuresaccordingtoTable3.1.Alongwiththediscretetransitionsofthearc-standardsystem,theparsercomputesvectorrepresentationsforthebuffer,stackandlistofactionsattimesteptdenotedbt,st,andat,respectively(seeFig.3.2).Astack-LSTMmoduleisusedtocomputethevectorrepresentationforeachdatastructure.Fig.3.3illustratesthecontentofeachmodulefortherstthreestepsinatoyexample.Theparserstate12attimetisgivenby:pt=maxf0;W[st;bt;at]+Wbiasg(3.1)wherethematrixWandthevectorWbiasarelea
56 rnedparameters.Theparserstateptisthenuse
rnedparameters.Theparserstateptisthenusedtodeneacategoricaldistributionoverpossiblenextactionsz:p(zjpt)=expgzpt+qz Pz0expgz0pt+qz0(3.2)bilingualdictionariesweusedarebasedonunsupervisedwordalignmentsofparallelcorpora,asdescribedinGuoetal.(2016).See§3.2.3fordetails.9See§3.2.4fordetails.10Ourbestresultsmakeuseofthisresource.See§3.2.4fordetails.11Thisisonlyusedfortrainingwordembeddingswith`multiCCA',`multiCluster'and`translation-invariance'inTable3.6.Wedonotusethisresourcewhenwecomparetopreviouswork.12Nottobeconfusedwiththestateofthetransitionsystem(i.e.,thecontentofthestackandthebuffer).16 Figure3.2:Theparsercomputesvectorrepresentationsforthebuffer,stackandlistofactionsattimesteptdenotedbt,st,andat,res
57 pectively.Thethreevectorsfeedintoahidden
pectively.Thethreevectorsfeedintoahiddenlayerdenotedas`parserstate',followedbyasoftmaxlayerthatrepresentspossibleoutputsattimestept.wheregzandqzareparametersassociatedwithactionz.Thetotalnumberofactionsistwicethenumberofuniquedependencylabelsinthetreebankusedfortrainingplusone,butweonlyconsideractionswhichmeetthearc-standardpreconditionsinTable3.1.Theselectedactionisthenusedtoupdatethebuffer,stackandlistofactions,andtocomputebt+1,st+1andat+1accordingly.Themodelistrainedtomaximizethelog-likelihoodofcorrectactions.Attesttime,theparsergreedilychoosesthemostprobableactionineverytimestepuntilacompleteparsetreeisproduced.Tokenrepresentations.Thevectorrepresentationsofinputtokensfeedintothestack-LSTMmodulesofthebufferandthestack.Formonolingualpa
58 rsing,werepresenteachtokenbycon-catenati
rsing,werepresenteachtokenbycon-catenatingthefollowingvectors:axed,pretrainedembeddingofthewordtype,alearnedembeddingoftheword,alearnedembeddingoftheBrowncluster,alearnedembeddingofthene-grainedPOStag,alearnedembeddingofthecoarsePOStag.Thenextsectiondescribesthemechanismsweusetoenablecross-lingualsupervision.17 Figure3.3:AnillustrationofthecontentofthebufferS-LSTMmodule,theactionsS-LSTMmodule,andthestackS-LSTMmodulefortherstthreestepsinparsingthesentenceYouloveyourcat$.Upperleft:thebufferisinitializedwithalltokensinreverseorder.Upperright:simulatinga`shift'action,theheadvectorofthebufferS-LSTMbacktracks,andtwonewhiddenlayersarecomputedfortheactionsS-LSTMandthestackS-LSTM.Lowerleft:simulatin
59 ganother`shift'action,theheadvectorofthe
ganother`shift'action,theheadvectorofthebufferS-LSTMbacktracks,andthetwoadditionalhiddenlayersarecomputedfortheactionsS-LSTMandthestackS-LSTM.Lowerright:simulatinga`right-arc(nsubj)',theheadvectorofthebufferS-LSTMbacktracks,anadditionalhiddenlayeriscomputedfortheactionsS-LSTM.ThestackS-LSTMrstbacktrackstwosteps,thenanewhiddenlayeriscomputedasafunctionofthehead,thedependent,andtherelationshiprepresentations.18 3.2.3LanguageUnicationThekeytounifyingdifferentlanguagesinthemodelistomaplanguage-specicrepresentationsoftheinputtolanguage-universalrepresentations.Weapplythisontwolevels:part-of-speechtagsandlexicalitems:Coarsesyntacticembeddings.Welearnvectorrepresentationsofmultilingually-denedcoarsePOStags(Petrovetal.,2012),inste
60 adofusinglanguage-specictagsets.Wetr
adofusinglanguage-specictagsets.Wetrainasim-pledelexicalizedmodelwherethetokenrepresentationonlyconsistsoflearnedembeddingsofcoarsePOStags,whicharesharedacrossalllanguagestoenablemodeltransfer.Lexicalembeddings.Previousworkhasshownthatsacricinglexicalfeaturesamountstoasubstantialdecreaseintheperformanceofadependencyparser(Cohenetal.,2011;Täckströmetal.,2012a;Tiedemann,2015;Guoetal.,2015).Therefore,weextendthetokenrepresenta-tioninMALOPAbyconcatenatingpretrainedmultilingualembeddingsofwordtypes.Wealsoconcatenatelearnedembeddingsofmultilingualwordclusters.Beforetrainingtheparser,wees-timateBrownclustersofEnglishwordsandprojectthemviawordalignmentstowordsinotherlanguages.Thisissimilartothe`projectedclusters'methodinTäckströmetal.
61 (2012a).TogofromBrownclusterstoembedding
(2012a).TogofromBrownclusterstoembeddings,weignorethehierarchywithinBrownclustersandassignauniqueparametervectortoeachleaf.3.2.4LanguageDifferentiationHere,wedescribehowwetweakthebehaviorofMALOPAdependingonthecurrentinputlanguage.Languageembeddings.Whilemanylanguages,especiallyonesthatbelongtothesamefam-ily,exhibitsomesimilarsyntacticphenomena(e.g.,alllanguageshavesubjects,verbs,andob-jects),substantialsyntacticdifferencesabound.Someofthesedifferencesareeasytocharac-terize(e.g.,subject-verb-objectvs.verb-subject-object,prepositionsvs.postpositions,adjective-nounvs.noun-adjective),whileothersaresubtle(e.g.,numberandpositionsofnegationmor-phemes).Itisnotatallclearhowtotranslatedescriptivefactsaboutalanguage'ssyntaxintofeaturesforaparser.Con
62 sequently,trainingalanguage-universalpar
sequently,trainingalanguage-universalparserontreebanksinmultiplesourcelanguagesrequirescaution.Whileexposingtheparsertoadiversesetofsyntacticpatternsacrossmanylanguageshasthepotentialtoimproveitsperformanceineach,dependencyannotationsinonelanguagewill,insomeways,contradictthoseintypologicallydifferentlanguages.Forinstance,consideracontextwherethenextwordonthebufferisanoun,andthetopwordonthestackisanadjective,followedbyanoun.Treebanksoflanguageswherepostpositiveadjectivesaretypical(e.g.,French)willoftenteachtheparsertopredictREDUCE-LEFT,whilethoseoflanguageswhereprepositiveadjectivesaremoretypical(e.g.,English)willteachtheparsertopredictSHIFT.InspiredbyNaseemetal.(2012),weaddressthisproblembyinformingtheparserabouttheinputlanguageitiscurre
63 ntlyparsing.Letlbetheinputvectorrepresen
ntlyparsing.Letlbetheinputvectorrepresentationofaparticularlanguage.Weconsiderthreedenitionsforl:aone-hotencodingofthelanguageID,19 aone-hotencodingofword-orderproperties,13andanencodingofalltypologicalfeaturesinWALS.14Weuseahiddenlayerwithtanhnonlinearitytocomputethelanguageembeddingl0as:l0=tanh(Ll+Lbias)whereLandLbiasareadditionalmodelparamters.Wemodifytheparsingarchitectureasfollows:includel0inthetokenrepresentation,includel0intheactionvectorrepresentation,andletpt=maxf0;W[st;bt;at;l0]+WbiasgIntuitively,thersttwomodicationsallowtheinputlanguagetoinuencethevectorrepre-sentationofthestack,thebufferandthelistofactions.Thethirdmodicationallowstheinputlanguagetoinuencetheparserstatew
64 hichinturnisusedtopredictthenextaction.I
hichinturnisusedtopredictthenextaction.Inprelim-inaryexperiments,wefoundthataddingthelanguageembeddingsatthetokenandactionlevelisimportant.Wealsoexperimentedwithcomputingmorecomplexfunctionsof(st;bt;at;l0)todenetheparserstate,buttheydidnothelp.Fine-grainedPOStagembeddings.Tiedemann(2015)showsthatomittingne-grainedPOStagssignicantlyhurtstheperformanceofadependencyparser.However,thosene-grainedPOStagsetsaredenedmonolinguallyandareonlyavailableforasubsetofthelanguageswithuniversaldependencytreebanks.Weextendthetokenrepresentationtoincludeane-grainedPOSembedding(inadditiontothecoarsePOSembedding).Duringtraining,westochasticallydropoutthene-grainedPOSembeddingwith50%probability(Srivastavaetal.,2014)sothattheparserca
65 nmakeuseofne-grainedPOStagswhenavail
nmakeuseofne-grainedPOStagswhenavailablebutstayreliablewhenthene-grainedPOStagsaremiss-ing.Fig.3.4illustratestheparsingmodelwithvariouscomponentswhichenablecross-lingualsupervisionandlanguagedifferentiation.Blockdropout.Weintroduceanothermodicationwhichmakestheparsermorerobusttonoisyinputsandlanguage-specicinputswhichmayormaynotbeprovidedattesttime.Theideaistostochasticallyzeroouttheentirevectorrepresentationofanoisyinput.Whiletrainingtheparser,wereplacethevectorrepresentationiwithanothervector(ofthesamedimensionality)stochasticallycomputedas:i0=(1b)=i,wherebisaBernoulli-distributedrandomvariablewithparameterwhichmatchesexpectederrorrateonadevelopmentset.15Forexample,weusetheblockdropouttoteachtheparserto
66 ignorethepredictedPOStagembeddingsallthe
ignorethepredictedPOStagembeddingsallthetimeatrstbyinitializing=1:0(i.e.,alwaysdropout,settingi0=0),anddynamicallyupdatetomatchtheerrorrateofthePOStaggeronthedevelopmentset.Attesttime,wealwaysusetheoriginalvector,i.e.,i0=e.Intuitively,thismethodextendsthedropoutmethod(Srivastavaetal.,2014)toaddressstructurednoiseintheinputlayer.13TheWorldAtlasofLanguageStructures(WALS;DryerandHaspelmath,2013)isanonlineportaldocument-ingtypologicalpropertiesof2,679languages(asofJuly2015).WeusethesamesetofWALSfeaturesusedbyZhangandBarzilay(2015),namely82A(orderofsubjectandverb),83A(orderofobjectandverb),85A(orderofadpositionandnounphrase),86A(orderofgenitiveandnoun),and87A(orderofadjectiveandnoun).14SinceWALSfeaturesarenotannotatedforalllangua
67 ges,weusetheaveragevalueofalllanguagesin
ges,weusetheaveragevalueofalllanguagesinthesamegenus.15Dividingbyattrainingtimealleviatestheneedtochangethecomputationattesttime.20 Figure3.4:Thevectorrepresentinglanguagetypologycomplementsthetokenrepresentation,theactionrepresentationandtheinputtotheparserstate.3.2.5Multi-taskLearningforPOSTaggingandDependencyParsingThemodeldiscussedthusfarconditionsonthePOStagsofwordsintheinputsentence.How-ever,POStagsmaynotbeavailableinrealapplications(e.g.,parsingtheweb).Letx1;:::;xn,y1;:::;yn,z1;:::;z2nbethesequenceofwords,POStagsandparsingactions,respectively,forasentenceoflengthn.WedenethejointdistributionofaPOStagsequenceandparsingactionsgivenasequenceofwordsasfollows:p(y1;:::;yn;z1;:::;z2njx1;:::;xn)=nYi=1p(yijx1;:::;xn)2nYj=1p(zjjx
68 1;:::;xn;y1;:::;yn;z1;:::;zj1)wherep
1;:::;xn;y1;:::;yn;z1;:::;zj1)wherep(zjj:::)isdenedinEq.3.2,andp(yijx1;:::;xn)usesabidirectionalLSTM(Gravesetal.,2013),similartoHuangetal.(2015).ThetokenrepresentationthatfeedsintothebidirectionalLSTMsharesthesameparametersofthetokenrepresentationdescribedearlierfortheparser,butomitsbothPOSembeddings.TheoutputsoftmaxlayerdenesacategoricaldistributionoverpossiblePOStagsateachposition.Thismulti-tasklearningsetupenablesustopredictbothPOStagsanddependencytreeswiththesamemodel.3.3ExperimentsWeevaluateourMALOPAparserinthreedatascenarios:whenthetargetlanguagehasalargetreebank(Table3.3),asmalltreebank(Table3.7)ornotreebank(Table3.8).21 German(de)English(en)Spanish(es)French(fr)Italian(it)Portuguese(pt)Swedish(sv) UDT2 train 14118(2649
69 06)39832(950028)14138(375180)14511(35123
06)39832(950028)14138(375180)14511(351233)6389(149145)9600(239012)4447(66631) dev. 801(12215)1703(40117)1579(40950)1620(38328)399(9541)1211(29873)493(9312) test 1001(16339)2416(56684)300(8295)300(6950)400(9187)1205(29438)1219(20376) UD1.2 train 14118(269626)12543(204586)14187(382436)14552(355811)11699(249307)8800(201845)4303(66645) dev. 799(12512)2002(25148)1552(41975)1596(39869)489(11656)271(4833)504(9797) test 977(16537)2077(25096)274(8128)298(7210)489(11719)288(5867)1219(20377) tags -50--36866134 Table3.2:Numberofsentences(tokens)ineachtreebanksplitinUniversalDependencyTree-banksversion2.0(UDT)andUniversalDependenciesversion1.2(UD)forthelanguagesweexperimentwith.Thelastrowgivesthenumberofuniquelanguage-specicne-grainedPOStagsus
70 edinatreebank.Data.Forexperimentswhereth
edinatreebank.Data.Forexperimentswherethetargetlanguagehasalargetreebank,weusethestandarddatasplitsforGerman(de),English(en),Spanish(es),French(fr),Italian(it),Portuguese(pt)andSwedish(sv)inthelatestrelease(version1.2)ofUniversalDependencies(Nivreetal.,2015a),andexperimentwithbothgoldandpredictedPOStags.Forexperimentswherethetargetlanguagehasnotreebank,weusethestandardsplitsfortheselanguagesintheolderuniversaldependencytreebanksv2.0(McDonaldetal.,2013)andusegoldPOStags,followingthebaselines(ZhangandBarzilay,2015;Guoetal.,2016).Table3.2givesthenumberofsentencesandwordsannotatedforeachlanguageinbothversions.WeusethesamemultilingualBrownclustersandmultilingualembeddingsofGuoetal.(2016),kindlyprovidedbytheauthors.Optimization.Weusestochasticg
71 radientupdateswithaninitiallearningrateo
radientupdateswithaninitiallearningrateof0=0:1inepoch#0andupdatethelearningrateinfollowingepochsast=0=(1+0:1t).Weclipthel2normofthegradienttoavoidexplodinggradients.Unlabeledattachmentscore(UAS)onthedevelopmentsetdeterminesearlystopping.Parametersareinitializedwithuniformsamplesinp 6=(r+c)whererandcarethesizesofthepreviousandfollowinglayerinthenueralnetwork(GlorotandBengio,2010).Thestandarddeviationsofthelabeledattachmentscore(LAS)duetorandominitializationinindiviualtargetlanguagesare0.36(de),0.40(en),0.37(es),0.46(fr),0.47(it),0.41(pt)and0.24(sv).ThestandarddeviationoftheaverageLASscoresacrosslangaugesis0.17.WhentrainingtheparseronmultiplelanguagesinMALOPA,insteadofupdatingtheparam-eterswiththegradientofindividualsente
72 nces,weusemini-batchupdateswhichincludeo
nces,weusemini-batchupdateswhichincludeonesentencesampleduniformly(withoutreplacement)fromeachlanguage'streebank,untilallsen-tencesinthesmallesttreebankareused(whichconcludesanepoch).Werepeatthesameprocessinfollowingepochs.Wefoundthistohelppreventonesourcelanguagewithalargertreebank(e.g.,German)fromdominatingparameterupdates,attheexpenseofothersourcelanguageswithasmallertreebank(e.g.,Swedish).3.3.1TargetLanguageswithaTreebankHere,weevaluateourMALOPAparserwhenthetargetlanguagehasatreebank.Baseline.Foreachtargetlanguage,thestrongbaselineweuseisamonolingually-trainedS-LSTMparserwithatokenrepresentationwhichconcatenates:pretrainedwordembeddings(5022 dimensions),16learnedwordembeddings(50dimensions),coarse(universal)POStagembed-dings(12dimensi
73 ons),ne-grained(language-specic)
ons),ne-grained(language-specic)POStagembeddings(12dimensions),andembeddingsofBrownclusters(12dimensions),andusesatwo-layerS-LSTMforeachofthestack,thebufferandthelistofactions.Weindependentlytrainonebaselineparserforeachtargetlanguage,andsharenomodelparameters.Thisbaseline,denoted`monolingual',achievesUASscore93.0andLASscore91.5whentrainedontheEnglishPennTreebank,whichiscomparabletoDyeretal.(2015).MALOPA.WetrainMALOPAontheconcantenationoftrainingsectionsofallsevenlan-guages.Tobalancethedevelopmentset,weonlyconcatenatetherst300sentencesofeachlanguage'sdevelopmentsection.TherstMALOPAparserweevaluateonlyusescoarsePOSembeddingsasthetokenrep-resentation.17AsshowninTable3.3,thisparserconsistentlyperformsmuchworsethanthemonolingu
74 albaselines,withagapof12.5LASpointsonave
albaselines,withagapof12.5LASpointsonaverage.Addinglexicalembeddingstothetokenrepresentationasdescribedin§3.2.3substantiallyimprovestheperformanceofMALOPA,recovering83%ofthegapinaverageperformance.Weexperimentedwiththreewaystoincludelanguageinformationinthetokenrepresenta-tion,namely:`languageID',`wordorder'and`fulltypology'(see§3.2.4fordetails),andfoundallthreetoimprovetheperformanceofMALOPAgivingLASscores83.5,83.2and82.5,re-spectively.Itisinterestingtoseethatthemodeliscapableoflearningmoreusefullanguageembeddingswhentypologicalpropertiesarenotspecied.Using`languageID',wehavenowrecoveredanother12%oftheoriginalgap.Finally,thebestcongurationofMALOPAaddsne-grainedPOSembeddingstothetokenrepresentation.18Surprisingly,addingn
75 e-grainedPOSembeddingsimprovestheperform
e-grainedPOSembeddingsimprovestheperformanceevenforsomelanguageswherene-grainedPOStagsarenotavailable(e.g.,Spanish),sug-gestingthatthemodeliscapableofpredictingne-grainedPOStagsforthoselanguagesviacross-lingualsupervision.Thisparseroutperformsthemonolingualbaselineinveoutofseventargetlanguages,andwinsonaverageby0.3LASpoints.Weemphasizethatthismodelisonlytrainedonceonalllanguages,andthesamemodelisusedtoparsethetestsetofeachlanguage,whichsimpliesthedistributionordeploymentofmultilingualparsingsoftware.Wenotethatthene-grainedPOStagsweusedforSwedishandPortugueseencodemorpho-logicalproperties.Addingne-grainedPOStagembeddingsforthesetwolanguagesimprovedtheresultsby1.9and2.0LASpoints,respectively.Thisobservationsuggeststh
76 atusingmor-phologicalfeaturesaspartoftok
atusingmor-phologicalfeaturesaspartoftokenrepresentationislikelytoresultinfurtherimprovements.Thisisespeciallyrelevantsincerecentversionsoftheuniversaldependencytreebanksincludeauniversalannotationsofmorphologicalproperties.Wenotethat,forsome(model,testlanguage)combinations,theimprovementsaresmallcom-paredtothevarianceduetorandominitializationofmodelparameters.Forexample,thecell16Theseembeddingsaretreatedasxedinputstotheparser,andarenotoptimizedtowardstheparsingobjective.WeusethesameembeddingsusedinGuoetal.(2016).17WeusethesamenumberofdimensionsforthecoarsePOSembeddingsasinthemonolingualbaselines.ThesameappliestoallothertypesofembeddingsusedinMALOPA.18Fine-grainedPOStagswereonlyavailableforEnglish,Italian,PortugueseandSwedish.Otherlan
77 guagesreusethecoarsePOStagsasne-grai
guagesreusethecoarsePOStagsasne-grainedtagsinsteadofpaddingtheextradimensionsinthetokenrepresentationwithzeros.23 UAS targetlanguage average de en es fr it pt sv monolingual 84.5 88.7 87.5 85.6 91.1 89.1 87.2 87.6 MALOPA 78.8 75.6 80.6 79.7 84.7 81.6 77.6 79.8+lexical 83.0 85.6 87.3 85.3 90.6 86.5 86.4 86.3+fulltypology 83.3 86.1 87.1 85.8 90.8 87.8 86.7 86.8+wordorder 84.0 87.2 87.3 86.0 91.0 87.9 87.2 87.2+languageID 84.2 87.2 87.2 86.1 91.5 87.5 87.2 87.2+ne-grainedPOS 84.7 88.6 88.1 86.4 91.1 89.4 88.2 88.0LAS targetlanguage average de en es fr it pt sv monolingual 79.3 85.9 83.7 81.7 88.7 85.7 83.5 84.0 MALOPA 70.4 69.3 72.4 71.1 78.0 74.1 65.4 71.5+lexical 76.7 82.0 82.7 81.2 87.6 82.1 81.2 81.9+fulltypology 77.8 82.5 82.6 8
78 1.5 88.0 83.7 81.8 82.5+wordorder 78.5 8
1.5 88.0 83.7 81.8 82.5+wordorder 78.5 84.3 83.4 81.7 88.3 84.1 82.6 83.2+languageID 78.6 84.2 83.4 82.4 89.1 84.2 82.6 83.5+ne-grainedPOS 78.9 85.4 84.3 82.4 89.0 86.2 84.5 84.3Table3.3:Dependencyparsing:unlabeledandlabeledattachmentscores(UAS,LAS)formonolingually-trainedparsersandMALOPA.Eachtargetlanguagehasalargetreebank(seeTable3.2).Inthistable,weusetheuniversaldependenciesverson1.2whichonlyincludesanno-tationsfor13KEnglishsentences,whichexplainstherelativelylowscoresinEnglish.Whenweinsteadusetheuniversaldependencytreebanksversion2.0whichincludesannotationsfor40KEnglishsentences(originallyfromtheEnglishPennTreebank),weachieveUASscore93.0andLASscore91.5.(+fulltypology,it)inTable3.2showsanimprovementof0.4LASpointswhilethes
79 tandarddeviationduetorandominitializatio
tandarddeviationduetorandominitializationinItalianis0.47LAS.However,theaverageresultsacrossmultiplelanguagesshowsteady,robustimprovementseachofwhichfarexceedsthestandarddeviationof0.17LAS.Qualitativeanalysis.Togainabetterunderstandingofthemodelbehavior,weanalyzecertainclassesofdependencyattachments/relationsinGerman,whichhasnotablyexiblewordorder,inTable3.4.Weconsidertherecallofleftattachments(wheretheheadwordprecedesthedependentwordinthesentence),rightattachments,rootattachments,short-attachments(withdistance=1),long-attachments(withdistance6),aswellasthefollowingrelationgroups:nsubj*(nominalsubjects:nsubj,nsubjpass),dobj(directobject:dobj),conj(conjunct:conj),*comp(clausalcomplements:ccomp,xcomp),case(cliticsandadpositions:c
80 ase),*mod(modiersofanoun:nmod,nummod
ase),*mod(modiersofanoun:nmod,nummod,amod,appos),neg(negationmodier:neg).Foreachgroup,wereportrecallofboththeattachmentandrelationweightedbythenumberofinstancesinthegoldannotation.Adetaileddescriptionofeachrelationcanbefoundathttp://universaldependencies.org/u/dep/index.html24 Recall% leftright root shortlong nsubj*dobjconj*compcase*mod monolingual 89.995.2 86.4 92.981.1 77.375.566.045.693.377.0 MALOPA 85.493.3 80.2 91.273.3 57.362.764.234.090.769.6+lexical 89.993.8 84.5 92.678.6 73.373.466.935.391.675.3+languageID 89.194.7 86.6 93.281.4 74.773.071.248.292.876.3+ne-grainedPOS 89.595.7 87.8 93.682.0 74.774.969.746.093.376.3 Table3.4:Recallofsomeclassesofdependencyattachments/relationsinGerman.Wefoundthateachofthethreeimprovemen
81 ts(lexicalembeddings,languageembeddingsa
ts(lexicalembeddings,languageembeddingsandne-grainedPOSembeddings)tendstoimproverecallformostclasses.Unfortunately,MAL-OPAunderperforms(comparedtothemonolingualbaseline)insomeclassesnominalsubjects,directobjectsandmodiersofanoun.Nevertheless,MALOPAoutperformsthebaselineinsomeimportantclassessuchasroot,longattachmentsandconjunctions.PredictinglanguageIDsandPOStags.InTable3.3,weassumethatbothlanguageIDoftheinputlanguageandthePOStagsaregivenattesttime.However,thisassumptionmaynotberealisticinpracticalapplications.Here,wequantifythedegradationinparsingaccuracywhenlanguageIDandPOStagsareonlygivenattrainingtime,butmustbepredictedattesttime.Wedonotusene-grainedPOStagsinthispart.InordertopredictlanguageID,weusethelangid.pylibrary(Luia
82 ndBaldwin,2012)19andclassifyindividualse
ndBaldwin,2012)19andclassifyindividualsentencesinthetestsetstooneofthesevenlanguagesofinterest,usingthedefaultmodelsincludedinthelibrary.ThemacroaveragelanguageIDpredictionaccuracyonthetestsetacrosssentencesis94.7%.InordertopredictPOStags,weusethemodelde-scribedin§3.2.5withbothinputandhiddenLSTMdimensionsof60,andwithblockdropout.ThemacroaverageaccuracyofthePOStaggeris93.3%.Table3.5summarizesthefourcong-urations:{goldlanguageID,predictedlanguageID}{goldPOStags,predictedPOStags}.Theperformanceoftheparsersuffersmildly(-0.8LASpoints)whenusingpredictedlanguageIDs,butsufferssignicantly(-5.1LASpoints)whenusingpredictedPOStags.Nevertheless,theobserveddegradationinparsingperformancewhenusingpredictedPOStagsiscomparabletothedegradation
83 sreportedbyTiedemann(2015).ThepredictedP
sreportedbyTiedemann(2015).ThepredictedPOSresultsinTable3.5useblockdropout.Withoutusingblockdropout,weloseanextra0.2LASpointsinbothcongurationsusingpredictedPOStags,averagingoveralllanguages.Differentmultilingualembeddings.Severalmethodshavebeenproposedforpretrainingmul-tilingualwordembeddings.Wecomparethreeofthem:multiCCAandmultiCluster(Ammaretal.,2016b)androbustprojection(Guoetal.,2015).Allembeddingsaretrainedonthesamedataandusethesamenumberofdimensions(100).Table3.6illustratesthatthethreemethodsperformcomparablywellonthistask.Smalltargettreebank.Duongetal.(2015)consideredasetupwherethetargetlanguagehasasmalltreebankof3Ktokens,andthesourcelanguage(English)hasalargetreebankof205K.TheparserproposedinDuongetal.(2015)isaneural
84 networkparserbasedonChenand19https://git
networkparserbasedonChenand19https://github.com/saffsd/langid.py25 targetlangauge de en es fr it pt sv average languageIDaccuracy% 96.3 78.0 100.0 97.6 98.3 94.0 98.8 94.7targetlanguage de en es fr it pt sv average POStaggingaccuracy% 89.8 92.7 94.5 94.0 95.2 93.4 94.0 93.3UAS targetlanguage average languageIDcoarsePOS de en es fr it pt sv goldgold 84.2 87.2 87.2 86.1 91.5 87.5 87.2 87.2predictedgold 84.0 84.0 87.2 85.8 91.3 87.4 87.2 86.7goldpredicted 78.9 84.8 85.4 84.0 89.0 84.4 81.0 83.9predictedpredicted 78.5 79.7 85.4 83.8 88.7 83.2 80.9 82.8LAS targetlanguage average languageIDcoarsePOS de en es fr it pt sv goldgold 78.6 84.2 83.4 82.4 89.1 84.2 82.6 83.5predictedgold 78.5 80.2 83.4 82.1 88.9 83.9 82.5 82.7goldpredicted 71.2 79.9 8
85 0.5 78.5 85.0 78.4 75.5 78.4predictedpre
0.5 78.5 85.0 78.4 75.5 78.4predictedpredicted 70.8 74.1 80.5 78.2 84.7 77.1 75.5 77.2Table3.5:EffectofautomaticallypredictinglanguageIDandPOStagswithMALOPAonparsingaccuracy.multilingualembeddings ave.UAS ave.LAS multiCluster 87.7 84.1multiCCA 87.8 84.4robustprojection 87.8 84.2Table3.6:EffectofmultilingualembeddingestimationmethodonthemultilingualparsingwithMALOPA.UASandLASscoresaremacro-averagedacrossseventargetlanguages.26 LAS targetlanguage de es fr it sv Duongetal. 61.8 70.5 67.2 71.3 62.5MALOPA 63.4 70.5 69.1 74.1 63.4Table3.7:Small(3Ktoken)targettreebanksetting:language-universaldependencyparserper-formance.Manning(2014),whichsharesmostoftheparametersbetweenEnglishandthetargetlanguage,andusesanL2regularizertotiethelexicalembeddings
86 oftranslationally-equivalentwords.Whilen
oftranslationally-equivalentwords.Whilenottheprimaryfocusofthispaper,20wecompareourproposedmethodtothatofDuongetal.(2015)onvetargetlanguagesforwhichmultilinguallexicalfeaturesareavailablefromGuoetal.(2016).Foreachtargetlanguage,wetraintheparserontheEnglishtrainingdataintheUDversion1.0corpus(Nivreetal.,2015b)andasmalltreebankinthetargetlanguage.21FollowingDuongetal.(2015),wedonotuseanydevelopmentdatainthetargetlanguages,andwesubsampletheEnglishtrainingdataineachepochtothesamenumberofsentencesinthetargetlanguage.Weusethesamehyperparametersspeciedbefore.Table3.7showthatourproposedmethodoutperformsDuongetal.(2015)by1.4LASpointsonaverage.3.3.2TargetLanguageswithoutaTreebankMcDonaldetal.(2011)establishedthat,whennotreebankannotationsare
87 availableinthetargetlanguage,trainingonm
availableinthetargetlanguage,trainingonmultiplesourcelanguagesoutperformstrainingonone(i.e.,multi-sourcemodeltransferoutperformssingle-sourcemodeltransfer).Inthissection,weevaluatetheper-formanceofourparserinthissetup.Weusetwostrongbaselinemulti-sourcemodeltransferparserswithnosupervisioninthetargetlanguage:ZhangandBarzilay(2015)isagraph-basedarc-factoredparsingmodelwithatensor-basedscoringfunction.Ittakestypologicalpropertiesofalanguageasinput.Wecomparetothebestreportedconguration(i.e.,thecolumntitledOURSinTable5ofZhangandBarzilay,2015).Guoetal.(2016)isatransition-basedneural-networkparsingmodelbasedonChenandMan-ning(2014).ItusesamultilingualembeddingsandBrownclustersaslexicalfeatures.Wecomparetothebestreportedc
88 onguration(i.e.,thecolumntitled
onguration(i.e.,thecolumntitledMULTI-PROJinTable1ofGuoetal.,2016).FollowingGuoetal.(2016),foreachtargetlanguage,wetraintheparseronsixotherlanguagesintheGoogleUniversalDependencyTreebanksversion2.022(de,en,es,fr,it,pt,sv,excludingwhicheveristhetargetlanguage).OurparserusesthesamewordembeddingsandwordclustersusedinGuoetal.(2016),anddoesnotuseanytypologyinformation.2320Thesetupcostinvolvedinrecruitinglinguists,developingandrevisingannotationguidelinestoannotateanewlanguageoughttobehigherthanthecostofannotating3Ktokens.21WethankLongDuongforprovidingthesubsampledtrainingcorporaineachtargetlanguage.22https://github.com/ryanmcd/uni-dep-tb/23Inpreliminaryexperiments,wefoundlanguageembeddingstohurttheperformanceoftheparserfortarget
89 languageswithoutatreebank.27 UAS targetl
languageswithoutatreebank.27 UAS targetlanguage average de es fr it pt sv ZhangandBarzilay(2015) 62.5 78.0 78.9 79.3 78.6 75.0 75.4Guoetal.(2016) 65.0 79.0 77.6 78.4 81.8 78.2 76.3 thiswork 65.2 80.2 80.6 80.7 81.2 79.0 77.8LAS targetlanguage average de es fr it pt sv ZhangandBarzilay(2015) 54.1 68.3 68.8 69.4 72.5 62.5 65.9Guoetal.(2016) 55.9 73.0 71.0 71.2 78.6 69.5 69.3 MALOPA 57.1 74.6 73.9 72.5 77.0 68.1 70.5Table3.8:Dependencyparsing:unlabeledandlabeledattachmentscores(UAS,LAS)formulti-sourcetransferparsersinthesimulatedlow-resourcescenariowhereLt\Ls=;.TheresultsinTable3.8showthat,onaverage,ourparseroutperformsbothbaselinesbymorethan1pointinLAS,andgivesthebestLASresultsinfour(outofsix)languages.3.3.3ParsingCode-switchedInputCode-swi
90 tchingpresentsachallengeformonolingually
tchingpresentsachallengeformonolingually-trainedNLPmodels(Barmanetal.,2014).Wehypothesizethatourlanguage-universalapproachisagoodtforcode-switchedtext.However,itishardtotestthishypothesisduetothelackofuniversaldependencytreebankswithnaturally-occurringcode-switching.Instead,wesimulateanevaluationtreebankwithcode-switchingbyreplacingwordsintheEnglishdevelopmentsetoftheUniversalDependenciesv1.2withSpanishwords.ToaccountforthefactthatSpanishwordsdonotarbitrarilyappearincode-switchingwithEnglish,weonlyallowaSpanishwordtosubstituteanEnglishwordundertwoconditions:(1)theSpanishwordmustbealikelytranslationoftheEnglishword,and(2)togetherwiththe(possiblymodied)previouswordinthetreebank,theintroducedSpanishwordformsabigramwhichappearsinnatur
91 ally-occurringcode-switchedtweets(fromth
ally-occurringcode-switchedtweets(fromtheEMNLP2014sharedtaskoncodeswitching(Linetal.,2014)).2.5%ofEnglishwordsinthedevelopmentsetwerereplacedwithSpanishwords.WeusepuretorefertotheoriginalEnglishdevelopmentset,andcode-switchedtorefertothesamedevelopmentsetafterreplacing2.5%ofEnglishwordswithSpanishtransla-tions.Inordertoquantifythedegreetowhichmonolingually-trainedparsersmakebadpredictionswhentheinputtextiscode-switched,wecontrasttheUASperformanceofourjointmodelfortaggingandparsing,trainedonEnglishtreebankswithcoarsePOStagsonly,andtestedonpurevs.code-switchedtreebanks.WethenrepeatthesameexperimentwithMALOPAparsertrainedonsevenlanguages,withlanguageIDandcoarsePOStagsonly.TheresultsinTable3.9suggestthat(simulated)code
92 -switchedinputadverselyaffectstheperform
-switchedinputadverselyaffectstheperformanceofmonolingually-trainedparsers,buthardlyaffectstheperformanceofourMALOPAparser.28 UAS purecode-switched monolingual 85.082.6MALOPA 84.784.8Table3.9:UASresultsontherst300sentencesintheEnglishdevelopmentoftheUniversalDependenciesv1.2,withandwithoutsimulatedcode-switching.3.4OpenQuestionsOurresultsopenthedoorformoreresearchinmultilingualNLP.Someofthequestionstriggeredbyourresultsare:Multilingualdependencyparsingcanbeviewedasadomainadaptationproblemwhereeachlanguagerepresentsadifferentdomain.CanweusetheMALOPAapproachintraditionaldomainadaptationsettings?Canwecombinethelanguage-universalapproachwithothermethodsforindirectsuper-vision(e.g.,annotationprojection,CRFautoencodersandco-traini
93 ng)tofurtherimproveperformanceinlow-reso
ng)tofurtherimproveperformanceinlow-resourcescenarioswithouthurtingperformanceonhigh-resourcesce-narios?Canweobtainbetterresultsbysharingsomeofthemodelparametersforallmembersofthesamelanguagefamily?Canweapplythelanguage-universalapproachtomoredistantlanguagessuchasArabicandJapanese?Canweapplythelanguage-universalapproachtomoreNLPproblemssuchasnamedentityrecognitionandcoreferenceresolution?3.5RelatedWorkOurworkbuildsonthemodeltransferapproach,whichwaspioneeredbyZemanandResnik(2008)whotrainedaparseronasourcelanguagetreebankthenappliedittoparsesentencesinatargetlanguage.Cohenetal.(2011)andMcDonaldetal.(2011)trainedunlexicalizedparsersontreebanksofmultiplesourcelanguagesandappliedtheparsertodifferentlanguages.Naseemetal.(2012),
94 Täckströmetal.(2013),andZhangandBarzil
Täckströmetal.(2013),andZhangandBarzilay(2015)usedlanguagetypologytoimprovemodeltransfer.Toaddlexicalinformation,Täckströmetal.(2012a)usedmultilingualwordclusters,whileXiaoandGuo(2014),Guoetal.(2015),Søgaardetal.(2015)andGuoetal.(2016)usedmultilingualwordembeddings.Duongetal.(2015)usedaneuralnetworkbasedmodel,sharingmostoftheparametersbetweentwolanguages,andusedanL2regularizertotiethelexicalembeddingsoftranslationally-equivalentwords.Weincorporatetheseideasinourframework,whileproposinganovelneuralarchitectureforembeddinglanguagetypology(see§3.2.4)andanotherforconsumingnoisystructuredinputs(blockdropout).Wealsoshowhowtoreplaceanarrayofmonolinguallytrainedparserswithonemultilingually-trainedparserwithoutsacricingaccuracy,whichisre
95 latedtoVilaresetal.(2015).Neuralnetworkp
latedtoVilaresetal.(2015).NeuralnetworkparsingmodelswhichprecededDyeretal.(2015)includeHenderson(2003),TitovandHenderson(2007),HendersonandTitov(2010)andChenandManning(2014).Re-29 latedtolexicalfeaturesincross-lingualparsingisDurrettetal.(2012)whodenedlexico-syntacticfeaturesbasedonbilinguallexicons.OtherrelatedworkincludeÖstling(2015),whichmaybeusedtoinducemoreusefultypologicaltoinformmultilingualparsing.Tsvetkovetal.(2016)concurrentlyusedasimilarapproachtolearnlanguage-universallanguagemodelsbasedonmorphology.Anotherpopularapproachforcross-lingualsupervisionistoprojectannotationsfromthesourcelanguagetothetargetlanguageviaaparallelcorpus(Yarowskyetal.,2001;Hwaetal.,2005)orviaautomatically-translatedsentences(Schneideretal.,2013;Tied
96 emannetal.,2014).MaandXia(2014)usedentro
emannetal.,2014).MaandXia(2014)usedentropyregularizationtolearnfrombothparalleldata(withprojectedannotations)andunlabeleddatainthetargetlanguage.RasooliandCollins(2015)trainedanarrayoftarget-languageparsersonfullyannotatedtrees,byiterativelydecodingsentencesinthetargetlanguagewithincompleteannotations.Oneresearchdirectionworthpursuingistondsynergiesbetweenthemodeltransferapproachandannotationprojectionapproach.3.6SummaryInthischapter,wedescribeageneralapproachfortraininglanguage-universalNLPmodels,andapplythisapproachtodependencyparsing.Themainingredientsofthisapproacharehomogen-uousannotationsinmultiplelanguages(e.g.,theuniversaldepedencytreebanks),acoremodelwithlargecapacityforrepresentingcomplexfunctions(e.g.,recurrentneuralnetwork
97 s),map-pinglanguage-specicrepresenta
s),map-pinglanguage-specicrepresentationsintoalanguage-universalspace(e.g.,multilingualwordembeddings),andmechanismsfordifferentiatingbetweenthebehaviorofdifferentlanguages(e.g.,languageembeddingsandne-grainedPOStags).Weshowforthersttimehowtotrainlanguage-universalmodelsthatperformcompetitivelyinmultiplelanguagesinbothhigh-andlow-resourcescenarios.24Wealsoshowforthersttime,usingasimulatedevaluationset,thatlanguage-universalmodelsisaviablesolutionforprocessingcode-switchedtext.Wenotetheimportanceoflexicalfeaturesthatconnectmultiplelanguagesviabilingualdic-tionariesorparallelcorpora.Whenthemultilinguallexicalfeaturesarebasedonout-of-domainresourcessuchastheBible,weobserveasignicantdropinperformance.Developingcompeti-tive
98 language-universalmodelswithlanguageswit
language-universalmodelswithlanguageswithsmallbilingualdictionariesremainsanopenproblemforfutureresearch.Inprinciple,theproposedapproachcanbeappliedtomanyNLPproblemssuchasnamedentityrecognition,coreferenceresolution,questionansweringandtextualentailment.However,inpractice,thelackofhomogenuousannotationsinmultiplelanguagesforthesetasksmakesithardtoevaluate,letalonetrain,language-universalmodels.Onepossibleapproachtoalleviatethispracticaldifcultyistouseexistinglanguage-specicannotationsformultiplelanguagesanddedicateasmallnumberofthemodelparameterstolearnamappingfromalatentlanguage-universaloutputtothelanguage-specicannotationsinamulti-tasklearningframework,whichhasbeenproposedinadifferentsetupbyFangandCohn(2016).Wealsohopethatf
99 uturemultilingualannotationprojectswilld
uturemultilingualannotationprojectswilldevelopannotationguidelinestoencourageconsistency24Theresultsinsome(modication,testlanguage)combinationsweresmallrelativetothestandarddeviation,suggestingthatsomeofthesemodicationsmaynotbeimportantforthoselanguages.Nevertheless,theaveragere-sultsacrossmultiplelanguagesshowsteady,robustimprovementseachofwhichfarexceedsthestandarddeviation.30 acrosslanguages,followingtheexampleofuniversaldependencytreebanks.31 32 Chapter4MultilingualWordEmbeddings33 4.1OverviewThepreviouschapterdiscussedhowtodeveloplanguage-universalmodelsforanalyzingtext.Inthischapter,wefocusonmultilingualwordembeddings,oneoftheimportantmechanismsforenablingcross-lingualsupervision.Vector-spacerepresentationsofwordsarewidelyus
100 edinstatisticalmodelsofnaturallan-guage.
edinstatisticalmodelsofnaturallan-guage.InadditiontoimprovementsonstandardmonolingualNLPtasks(CollobertandWeston,2008),sharedrepresentationofwordsacrosslanguagesoffersintriguingpossibilities(Klemen-tievetal.,2012).Forexample,inmachinetranslation,translatingawordneverseeninparalleldatamaybeovercomebyseekingitsvector-spaceneighbors,iftheembeddingsarelearnedfrombothplentifulmonolingualcorporaandmorelimitedparalleldata.Asecondopportunitycomesfromtransferlearning,inwhichmodelstrainedinonelanguagecanbedeployedinotherlan-guages.Whilepreviousworkhasusedhand-engineeredfeaturesthatarecross-linguallystableasthebasisformodeltransfer(ZemanandResnik,2008;McDonaldetal.,2011;Tsvetkovetal.,2014),automaticallylearnedembeddingsofferthepromiseofbettergeneral
101 izationatlowercost(Klementievetal.,2012;
izationatlowercost(Klementievetal.,2012;HermannandBlunsom,2014;Guoetal.,2016).Wethereforeconjecturethatdevelopingestimationmethodsformassivelymultilingualwordembeddings(i.e.,embeddingsforwordsinalargenumberoflanguages)willplayanimportantroleinthefutureofmultilingualNLP.Novelcontributionsinthischapterincludetwomethodsforestimatingmultilingualembed-dingswhichonlyrequiremonolingualdataineachlanguageandpairwisebilingualdictionaries,andscaletoalargenumberoflanguages.Wealsointroduceourevaluationwebportalforup-loadingarbitrarymultilingualembeddingsandevaluatingthemautomaticallyusingasuiteofintrinsicandextrinsicevaluationmethods.ThematerialinthischapterwaspreviouslypublishedinAmmaretal.(2016b).4.2EstimationMethodsLetLbeasetoflanguages
102 ,andletVmbethesetofsurfaceforms(wordtype
,andletVmbethesetofsurfaceforms(wordtypes)inm2L.LetV=Sm2LVm.OurgoalistoestimateapartialembeddingfunctionE:LV!Rd(allowingasurfaceformthatappearsintwolanguagestohavedifferentvectorsineach).Wewouldliketoestimatethisfunctionsuchthat:(i)semanticallysimilarwordsinthesamelanguagearenearby,(ii)translationallyequivalentwordsindifferentlanguagesarenearby,and(iii)thedomainofthefunctioncoversasmanywordsinVaspossible.WeusedistributionalsimilarityinamonolingualcorpusMmtomodelsemanticsimilaritybetweenwordsinthesamelanguage.1Forcross-lingualsimilarity,eitheraparallelcorpusPm;norabilingualdictionaryDm;nVmVncanbeused.Insomecases,weextractthebilingualdictionaryfromaparallelcorpus.Todothis,wealignthecorpususingfast_align(Dyeretal.,2013)inbothdir
103 ections.Theestimatedparametersofthewordt
ections.Theestimatedparametersofthewordtranslationdistributionsareusedtoselectpairs:Dm;n=(u;v)ju2Vm;v2Vn;pmjn(ujv)pnjm(vju) ,wherethethresholdtradesoffdictionaryrecallandprecision.21Monolingualcorporaareoftenanorderofmagnitudelargerthanparallelcorpora.Therefore,multilingualwordembeddingmodelstrainedonmonolingualcorporatendtohavehighercoverage.2Wexed=0:1foralllanguagepairsbasedonmanualinspectionoftheresultingdictionaries.34 Withthreenotableexceptions(see§4.2.3,§4.2.4,§4.5),previousworkonmultilingualem-beddingsonlyconsideredthebilingualcase,jLj=2.Inthissection,wefocusonestimatingmultilingualembeddingsforjLj2.Werstdescribetwonoveldictionary-basedmethods(multiClusterandmultiCCA).Then,wereviewa
104 variantofthemultiSkipmethod(Guoetal.,201
variantofthemultiSkipmethod(Guoetal.,2016)andthetranslation-invariancematrixfactorizationmethod(Gardneretal.,2015).34.2.1MultiClusterembeddingsInthismethod,wedecomposetheproblemintotwosimplersubproblems:E=EembedEcluster,whereEcluster:LV!CmapswordstomultilingualclustersC,andEembed:C!Rdassignsavectortoeachcluster.Weuseabilingualdictionarytondclustersoftranslationallyequivalentwords,thenusedistributionalsimilaritiesoftheclustersinmonolingualcorporafromalllanguagesinLtoestimateanembeddingforeachcluster.Byforcingwordsfromdifferentlanguagesinaclustertosharethesameembedding,wecreateanchorpointsinthevectorspacetobridgelanguages.Fig.4.1illustratesthismethodwithaschematicdiagram.Morespecically,wedenetheclustersastheconnectedcom
105 ponentsinagraphwherenodesare(language,su
ponentsinagraphwherenodesare(language,surfaceform)pairsandedgescorrespondtotranslationentriesinDm;n.WeassignarbitraryIDstotheclustersandreplaceeachwordtokenineachmonolingualcorpuswiththecorrespondingclusterID,andconcatenateallmodiedcorpora.TheresultingcorpusconsistsofmultilingualclusterIDsequences.Wecanthenapplyanymonolingualembeddingestimatortoobtainclusterembeddings;here,weusetheskipgrammodelfromMikolovetal.(2013a).OurimplementationofthemultiClustermethodisavailableonGitHub.44.2.2MultiCCAembeddingsInthismethod,werstuseamonolingualestimatortoindependentlyembedwordsineachlanguageofinterest.Wethenpickapivotlanguageandandlinearlyprojectwordsfromeveryotherlanguagetothevectorspaceofthepivotlanguage.5Inordertondthelinearprojectionb
106 etweentwolanguages,webuildontheworkofFar
etweentwolanguages,webuildontheworkofFaruquiandDyer(2014),whoproposedabilingualembeddingestimationmethodbasedoncanonicalcorrelationanalysis(CCA)andshowedthattheresultingembeddingsforEnglishwordsout-performmonolingually-trainedEnglishembeddingsonwordsimilaritytasks.First,theyusemonolingualcorporatotrainmonolingualembeddingsforeachlanguageindependently(EmandEn),capturingsemanticsimilaritywithineachlanguageseparately.Then,usingabilingualdictionaryDm;n,theyuseCCAtoestimatelinearprojectionsfromtherangesofthemonolin-gualembeddingsEmandEn,yieldingabilingualembeddingEm;n.ThelinearprojectionsaredenedbyTm!m;nandTn!m;n2Rdd;theyareselectedtomaximizethecorrelationbetweenvectorpairsTm!m;nEm(u)andTn!m;nEn(v),where(u;v)2Dm;n.Thebilingualembedding
107 isthendenedasECCA(m;u)=Tm!m;nEm(u)(a
isthendenedasECCA(m;u)=Tm!m;nEm(u)(andlikewiseforECCA(n;v)).Westartbyestimating,foreachm2Lnfeng,thetwoprojectionmatrices:Tm!m;enandTen!m;en;theseareguaranteedtobenon-singular,ausefulpropertyofCCAweexploitbyinvert-3WedevelopedthemultiSkipmethodindependentlyofGuoetal.(2016).4https://github.com/wammar/wammar-utils/blob/master/train-multilingual-embeddings.sh5WeuseEnglishasthepivotlanguagesinceEnglishtypicallyoffersthelargestcorporaandwideavailabilityofbilingualdictionaries.35 Figure4.1:StepsforestimatingmultiClusterembeddings:1)identifythesetoflanguagesofinterest,2)obtainbilingualdictionariesbetweenpairsoflanguages,3)grouptranslationallyequivalentwordsintoclusters,4)obtainamonolingualcorpusforeachlanguageofinterest,5)replacewordsinmonoli