/
TowardsaUniversalAnalyzerofNaturalLanguagesWaleedAmmarCMULTI16010Ju TowardsaUniversalAnalyzerofNaturalLanguagesWaleedAmmarCMULTI16010Ju

TowardsaUniversalAnalyzerofNaturalLanguagesWaleedAmmarCMULTI16010Ju - PDF document

oryan
oryan . @oryan
Follow
343 views
Uploaded On 2021-10-08

TowardsaUniversalAnalyzerofNaturalLanguagesWaleedAmmarCMULTI16010Ju - PPT Presentation

KeywordsmultilingualNLPdependencyparsingwordalignmentpartofspeechtagginglowresourcelanguagesrecurrentneuralnetworkswordembeddingslinguistictypologycodeswitchinglanguageidenti2cationMALOPAmultiClust ID: 897845

2016 2015 148 147 2015 2016 147 148 speci 2014 2013 2011 2012 150 english table3 language uas average

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "TowardsaUniversalAnalyzerofNaturalLangua..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 TowardsaUniversalAnalyzerofNaturalLangua
TowardsaUniversalAnalyzerofNaturalLanguagesWaleedAmmarCMU-LTI-16-010June2016LanguageTechnologiesInstituteSchoolofComputerScienceCarnegieMellonUniversity5000ForbesAve.,Pittsburgh,PA15213www.lti.cs.cmu.eduThesisCommittee:ChrisDyer(co-chair),CarnegieMellonUniversityNoahA.Smith(co-chair),UniversityofWashingtonTomMitchell,CarnegieMellonUniversityKuzmanGanchev,GoogleResearchSubmittedinpartialfulllmentoftherequirementsforthedegreeofDoctorofPhilosophy.Copyrightc 2016WaleedAmmar Keywords:multilingualNLP,dependencyparsing,wordalignment,part-of-speechtagging,low-resourcelanguages,recurrentneuralnetworks,wordembeddings,linguistictypology,codeswitching,languageidentication,MALOPA,multiCluster,multiCCA,wordembeddingseval-uation,CRFautoencoders,

2 language-universal. Formyfamily. Acknowl
language-universal. Formyfamily. AcknowledgmentsIhadthepleasureandhonorofinteractingwithgreatindividualswhoinuencedmyworkthroughoutthevepreciousyearsofmyPh.D.Firstandforemost,IwouldliketoexpressmydeepgratitudetoNoahSmithandChrisDyerfortheirenlightningguidanceandtremendoussupportineverystepoftheroad.ThankyouforplacingyourcondenceinmeatthetimesIneededitthemost.Iwillbeindebtedtoyoufortherestofmycareer.Ialsowouldliketoacknowledgethefollowinggroupsoftalentedindividuals:mycollaborators:Chu-ChengLin,LoriLevin,YuliaTsvetkov,D.Sculley,KristinaToutanova,ShomirWilson,NormanSadeh,MiguelBallesteros,GeorgeMulcaire,GuillaumeLample,PradeepDasigi,EduardHovy,AndersSøgaard,ZeljkoAgic,HiroakiHayashi,AustinMatthewsandManaalFaruquiformakingwo

3 rkmuchmoreenjoyable;my(ex-)colleagu
rkmuchmoreenjoyable;my(ex-)colleagues:BrendanO'Connor,SwabhaSwayamdipta,LingpengKong,Chu-ChengLin,DaniYogatama,AnkurParikh,AhmedHefny,PrasannaKumar,ManaalFaruqui,ShayCohen,KevinGimpel,AndreMartins,AvneeshSaluja,DavidBamman,SamThomson,AustinMatthews,DipanjanDas,JesseDodge,MattGardner,MohammadGowayyed,AdonaIosif,JesseDunietz,SubhodeepMoitra,AdhiKuncoro,YanChuanSim,LeahNicolich-Henkin,EvaSchlingerandVictorChahuneauforcreatingadiverseandenrichingenvironment,andIforgiveyouforignitingmyimpostersyndrommoreoftenthanIwouldliketoadmit;myGooglefellowshipmentors:RyanMcDonaldandGeorgeFosterformanystimulatingdiscussions;mycommitteemembersKuzmanGanchevandTomMitchellfortheirtimeandvaluablefeedback;andothermembersoftheCMUfamily:StaceyY

4 oung,LauraAlford,MalloryDep-tola,RoniRos
oung,LauraAlford,MalloryDep-tola,RoniRosenfeld,MohamedDarwish,MahmoudBishr,MazenSolimanwhomadeschoolfeelmoreofahomethanaworkplace.Iamespeciallygratefulformyfamily.ThankyouAbdul-mottalebandMabrokaAm-marforgivingmeeverythingandexpectingnothinginreturn!ThankyouAmanyHasanforsacricingyourbestinterestssothatIcanpursuemypassion!ThankyouSalmaAmmarforcreatingsomuchjoyinmylife!ThankyouKhaledAmmarforbeingagreatrolemodeltofollow!ThankyouSamiaandWalaaAmmarforyourtremendouslove! vi AbstractDevelopingNLPtoolsformanylanguagesinvolvesuniquechallengesnottypicallyencounteredinEnglishNLPwork(e.g.,limitedannotations,unscalablearchitectures,codeswitching).Althougheachlanguageisunique,differentlanguagesoftenexhibitsimilarcharacteristics(e.g.,phonetic,morpho

5 logical,lexical,syntactic)whichcanbeexpl
logical,lexical,syntactic)whichcanbeexploitedtosynergisticallytrainanalyzersformultiplelanguages.Inthisthesis,weadvocateforanovellanguage-universalapproachtomultilingualNLPinwhichonestatisticalmodeltrainedonmultilingual,homogenuousannotationsisusedtoprocessnaturallanguageinputinmultiplelanguages.Toempiricallyshowthemeritsoftheproposedapproach,wedevelopMALOPA,alanguage-universaldependencyparserwhichoutperformsmonolingually-trainedparsersinseverallow-resourceandhigh-resourcescenarios.MALOPAisagreedytransition-basedparserwhichusesmultilingualwordembeddingsandotherlanguage-universalfeaturesasahomogeneousrepresentationoftheinputacrossalllanguages.Toad-dressthesyntacticdifferencesbetweenlanguages,MALOPAmakesuseoftoken-levellanguageinformationas

6 wellaslanguage-specicrepresentations
wellaslanguage-specicrepresentationssuchasne-grainedpart-of-speechtags.MALOPAusesarecurrentneuralnetworkarchitectureandmulti-tasklearningtojointlypredictPOStagsandlabeleddependencyparses.Focusingonhomogeneousinputrepresentations,weproposenovelmethodsforestimatingmultilingualwordembeddingsandforpredictingwordalignments.Wedeveloptwomethodsforestimatingmultilingualwordembeddingsfrombilingualdic-tionariesandmonolingualcorpora.Therstestimationmethod,multiCluster,learnsembeddingsofwordclusterswhichmaycontainwordsfromdifferentlanguages,andlearnsdistributionalsimilaritiesbypoolingthecontextsofallwordsinthesameclusterinmultiplemonolingualcorpora.Thesecondestimationmethod,multiCCA,learnsalinearprojectionofmonolinguallytrainedembeddingsi

7 neachlanguagetoonevectorspace,extendingt
neachlanguagetoonevectorspace,extendingtheworkofFaruquiandDyer(2014)tomorethantwolanguages.Toshowthescalabilityofourmethods,wetrainmultilingualembeddingsin59languages.Wealsodevelopanextensible,easy-to-useweb-basedevaluationportalforevaluat-ingarbitrarymultilingualwordembeddingsonseveralintrinsicandextrinsictasks.Wedeveloptheconditionalrandomeldautoencoder(CRFautoencoder)modelforun-supervisedlearningofstructuredpredictors,anduseittopredictwordalignmentsinparallelcorpora.Weuseafeature-richCRFmodeltopredictthelatentrepresentationconditionalontheobservedinput,thenreconstructtheinputconditionalonthelatentrepresentationusingagenerativemodelwhichfactorizessimilarlytotheCRFmodel.Toreconstructtheobservations,weexperimentwithacategoricaldistrib

8 utionoverwordtypes(orwordclusters),andam
utionoverwordtypes(orwordclusters),andamultivariateGaussiandistributionthatgeneratespretrainedwordembeddings. viii Contents1Introduction11.1ChallengesinMultilingualNLP..........................21.2ThesisStatement..................................21.3SummaryofContributions.............................32Background52.1NaturalLanguages.................................62.1.1ItisnotsufcienttodevelopNLPtoolsforthefewmostpopularlanguages62.1.2LanguagesoftenstudiedinNLPresearch.................82.1.3Characterizingsimilaritiesanddissimilaritiesacrosslanguages......92.2WordEmbeddings..................................112.2.1Distributedwordrepresentations......................112.2.2Skipgrammodelforestimatingmonolingualwordembeddings......113Langu

9 age-universalDependencyParsing133.1Overv
age-universalDependencyParsing133.1Overview......................................143.2Approach......................................143.2.1HomogenuousAnnotations.........................153.2.2CoreModel.................................163.2.3LanguageUnication............................193.2.4LanguageDifferentiation..........................193.2.5Multi-taskLearningforPOSTaggingandDependencyParsing......213.3Experiments.....................................213.3.1TargetLanguageswithaTreebank.....................223.3.2TargetLanguageswithoutaTreebank...................273.3.3ParsingCode-switchedInput........................283.4OpenQuestions...................................293.5RelatedWork....................................293.6Summa

10 ry......................................
ry......................................304MultilingualWordEmbeddings334.1Overview......................................344.2EstimationMethods.................................344.2.1MultiClusterembeddings..........................35ix 4.2.2MultiCCAembeddings...........................354.2.3MultiSkipembeddings...........................384.2.4Translation-invariantmatrixfactorization.................384.3EvaluationPortal..................................394.3.1Wordsimilarity...............................394.3.2Wordtranslation..............................394.3.3Correlation-basedevaluationtasks.....................404.3.4Extrinsictasks...............................414.4Experiments.....................................424.4.1Correlationsbet

11 weenintrinsicvs.extrinsicevaluationmetri
weenintrinsicvs.extrinsicevaluationmetrics......424.4.2Evaluatingmultilingualestimationmethods................424.5RelatedWork....................................464.6Summary......................................475ConditionalRandomFieldAutoencoders495.1Overview......................................505.2Approach......................................505.2.1Notation..................................515.2.2Model....................................515.2.3Learning..................................535.2.4Optimization................................545.3Experiments.....................................565.3.1POSInduction...............................565.3.2WordAlignment..............................615.3.3Token-levelLanguageIdentication.....

12 ...............675.4OpenQuestions.......
...............675.4OpenQuestions...................................705.5RelatedWork....................................715.6Summary......................................736Conclusion756.1Contributions....................................766.2FutureDirections..................................77AAddingNewEvaluationTaskstotheEvaluationPortal79x ListofFigures1.1Insteadoftrainingonemodelperlanguage,wedeveloponelanguage-universalmodeltoanalyzetextinmultiplelanguages....................32.1Ahistogramthatshowsthenumberoflanguagesbypopulationofnativespeakersinlogscale.Forinstance,thegureshowsthat306languageshaveapopulationbetweenonemillionandtenmillionnativespeakers.NumbersarebasedonLewisetal.(2016)..................................62.2Numbero

13 fInternetusers(inmillions)perlanguagefol
fInternetusers(inmillions)perlanguagefollowsapowerlawdistribu-tion,withalongtail(notshown)whichaccountsfor21.8%ofInternetusers.ThegurewasretrievedonMay10,2016fromhttp://www.internetworldstats.com/stats7.htm................................72.3ReproducedfromBender(2011):LanguagesstudiedinACL2008papers,bylanguagegenusandfamily.............................83.1DependencyparsingisacoreprobleminNLPwhichisusedtoinformothertaskssuchascoreferenceresolution(e.g.,DurrettandKlein,2013),semanticparsing(e.g.,Dasetal.,2010)andquestionanswering(e.g.,HeilmanandSmith,2010).153.2Theparsercomputesvectorrepresentationsforthebuffer,stackandlistofac-tionsattimesteptdenotedbt,st,andat,respectively.Thethreevectorsfeedintoahiddenlayerdenotedas`parserstate',foll

14 owedbyasoftmaxlayerthatrepresentspossibl
owedbyasoftmaxlayerthatrepresentspossibleoutputsattimestept......................173.3AnillustrationofthecontentofthebufferS-LSTMmodule,theactionsS-LSTMmodule,andthestackS-LSTMmodulefortherstthreestepsinparsingthesentence“Youloveyourcat$”.Upperleft:thebufferisinitializedwithalltokensinreverseorder.Upperright:simulatinga`shift'action,theheadvec-torofthebufferS-LSTMbacktracks,andtwonewhiddenlayersarecomputedfortheactionsS-LSTMandthestackS-LSTM.Lowerleft:simulatinganother`shift'action,theheadvectorofthebufferS-LSTMbacktracks,andthetwoadditionalhiddenlayersarecomputedfortheactionsS-LSTMandthestackS-LSTM.Lowerright:simulatinga`right-arc(nsubj)',theheadvectorofthebufferS-LSTMbacktracks,anadditionalhiddenlayeriscomputedfortheaction

15 sS-LSTM.ThestackS-LSTMrstbacktrackst
sS-LSTM.ThestackS-LSTMrstbacktrackstwosteps,thenanewhiddenlayeriscomputedasafunctionofthehead,thedependent,andtherelationshiprepre-sentations......................................18xi 3.4Thevectorrepresentinglanguagetypologycomplementsthetokenrepresenta-tion,theactionrepresentationandtheinputtotheparserstate...........214.1StepsforestimatingmultiClusterembeddings:1)identifythesetoflanguagesofinterest,2)obtainbilingualdictionariesbetweenpairsoflanguages,3)grouptranslationallyequivalentwordsintoclusters,4)obtainamonolingualcorpusforeachlanguageofinterest,5)replacewordsinmonolingualcorporawithclusterIDs,6)obtainclusterembeddings,7)usethesameembeddingforallwordsinthesamecluster...................................364.2Stepsforestimatingmult

16 iCCAembeddings:1)pretrainmonolingualembe
iCCAembeddings:1)pretrainmonolingualembed-dingsforthepivotlanguage(English).Foreachotherlanguagem:2)pre-trainmonolingualembeddingsform,3)estimatelinearprojectionsTm!m;enandTen!m;en,4)projecttheembeddingsofmintotheembeddingspaceofthepivotlanguage.......................................374.3Samplecorrelationcoefcients(perfectcorrelation=1.0)foreachdimensioninthesolutionfoundbyCCA,basedonadictionaryof35,524English-Bulgariantranslations......................................475.1Left:Examplesofstructuredobservations(inblack),hiddenstructures(ingrey),andsideinformation(underlined ).Right:ModelvariablesforPOSinductionandwordalignment.Aparallelcorpusconsistsofpairsofsentences(“source”and“target”)............................

17 .........515.2Graphicalmodelrepresentati
.........515.2GraphicalmodelrepresentationsofCRFautoencoders.Left:thebasicautoen-codermodelwheretheobservationxgeneratesthehiddenstructurey(encod-ing),whichthengenerates^x(reconstruction).Center:sideinformation()isadded.Right:afactorgraphshowingrst-orderMarkovdependenciesamongelementsofthehiddenstructurey.........................525.3V-measureofinducedpartsofspeechinsevenlanguages,withmultinomialemis-sions.........................................585.4Many-to-oneaccuracy%ofinducedpartsofspeechinsevenlanguages,withmultinomialemissions................................595.5POSinductionresultsof5models(VMeasure,higherisbetter).Modelswhichusewordembeddings(i.e.,GaussianHMMandGaussianCRFAutoencoder)outperformallbaselinesonaverageacrossla

18 nguages...............605.6Left:Effectof
nguages...............605.6Left:EffectofdimensionsizeonPOSinductiononasubsetoftheEnglishPTBcorpus.Windowsizeissetto1forallcongurations.Right:EffectofcontextwindowsizeonV-measureofPOSinductiononasubsetoftheEnglishPTBcorpus.d=100,scaleratio=5...........................615.7Averageinferenceruntimepersentencepairforwordalignmentinseconds(ver-ticalaxis),asafunctionofthenumberofsentencesusedfortraining(horizontalaxis).........................................635.8Left:L-BFGSvs.SGDwithaconstantlearningrate.Right:L-BFGSvs.SGDwithdiminishing t.................................65xii 5.9Left:L-BFGSvs.SGDwithcyclicvs.randomizedorder(andwithepoch-xed ).Right:AsynchronousSGDupdateswith1,2,4,and8processors.......665.10Left:Batchvs.onlineEMwithdiff

19 erentvaluesof (stickinessparameter).
erentvaluesof (stickinessparameter).Right:BatchEMvs.onlineEMwithmini-batchsizesof102;103;104......66xiii xiv ListofTables2.1CurrentnumberoftokensinLeipzigmonolingualcorpora(inmillions),wordpairsinprintedbilingualdictionaries(inthousands),tokensinthetargetsideofen-xxOPUSparallelcorpora(inmillions),andtheuniversaldependencytree-banks(inthousands)for41languages.Morerecentstatisticscanbefoundathttp://opus.lingfil.uu.se/,http://www.bilingualdictionaries.com/,http://corpora2.informatik.uni-leipzig.de/download.htmlandhttp://universaldependencies.org/,respectively.....103.1Parsertransitionsindicatingtheactionappliedtothestackandbufferattimetandtheresultingstackandbufferattimet+1...................163.2Numberofsentences(tokens)ineachtreebanksp

20 litinUniversalDependencyTreebanksversion
litinUniversalDependencyTreebanksversion2.0(UDT)andUniversalDependenciesversion1.2(UD)forthelanguagesweexperimentwith.Thelastrowgivesthenumberofuniquelanguage-specicne-grainedPOStagsusedinatreebank.............223.3Dependencyparsing:unlabeledandlabeledattachmentscores(UAS,LAS)formonolingually-trainedparsersandMALOPA.Eachtargetlanguagehasalargetreebank(seeTable3.2).Inthistable,weusetheuniversaldependenciesver-son1.2whichonlyincludesannotationsfor13KEnglishsentences,whichexplainstherelativelylowscoresinEnglish.Whenweinsteadusetheuniversaldependencytreebanksversion2.0whichincludesannotationsfor40KEnglishsentences(originallyfromtheEnglishPennTreebank),weachieveUASscore93.0andLASscore91.5..............................243.4Rec

21 allofsomeclassesofdependencyattachments/
allofsomeclassesofdependencyattachments/relationsinGerman......253.5EffectofautomaticallypredictinglanguageIDandPOStagswithMALOPAonparsingaccuracy...................................263.6EffectofmultilingualembeddingestimationmethodonthemultilingualparsingwithMALOPA.UASandLASscoresaremacro-averagedacrossseventargetlanguages......................................263.7Small(3Ktoken)targettreebanksetting:language-universaldependencyparserperformance.....................................273.8Dependencyparsing:unlabeledandlabeledattachmentscores(UAS,LAS)formulti-sourcetransferparsersinthesimulatedlow-resourcescenariowhereLt\Ls=;........................................283.9UASresultsontherst300sentencesintheEnglishdevelopmentoftheUniver-salDep

22 endenciesv1.2,withandwithoutsimulatedcod
endenciesv1.2,withandwithoutsimulatedcode-switching........29xv 4.1Evaluationmetricsonthecorpusandlanguagesforwhichevaluationdataareavailable.......................................404.2Correlationsbetweenintrinsicevaluationmetrics(rows)anddownstreamtaskperformance(columns)................................434.3Resultsformultilingualembeddingsthatcover59languages.Eachrowcorre-spondstooneoftheembeddingevaluationmetricsweuse(higherisbetter).Eachcolumncorrespondstooneoftheembeddingestimationmethodswecon-sider;i.e.,numbersinthesamerowarecomparable.Numbersinsquarebracketsarecoveragepercentages...............................434.4ResultsformultilingualembeddingsthatcoverBulgarian,Czech,Danish,Greek,English,Spanish,German,Finnish,French,Hungarian,It

23 alianandSwedish.Eachrowcorrespondstooneo
alianandSwedish.Eachrowcorrespondstooneoftheembeddingevaluationmetricsweuse(higherisbetter).Eachcolumncorrespondstooneoftheembeddingestimationmethodsweconsider;i.e.,numbersinthesamerowarecomparable.Numbersinsquarebracketsarecoveragepercentages..........................455.1Left:AERresults(%)forCzech-Englishwordalignment.Lowervaluesarebetter.Right:BLEUtranslationqualityscores(%)forCzech-English,Urdu-EnglishandChinese-English.Highervaluesarebetter...............645.2Totalnumberoftweets,tokens,andTwitteruserIDsforeachlanguagepair.Foreachlanguagepair,therstlinerepresentsalldataprovidedtosharedtaskparticipants.Thesecondandthirdlinesrepresentourtrain/testdatasplitfortheexperimentsreportedinthischapter.SinceTwitterusersareallowedtodeletethei

24 rtweets,thenumberoftweetsandtokensreport
rtweets,thenumberoftweetsandtokensreportedinthethirdandfourthcolumnsmaybelessthanthenumberoftweetsandtokensoriginallyannotatedbythesharedtaskorganizers............................675.3Thetype-levelcoveragepercentagesofannotateddataaccordingtowordembed-dings(secondcolumn)andaccordingtowordlists(thirdcolumn),perlanguage..695.4Tokenlevelaccuracy(%)resultsforeachofthefourlanguagepairs........705.5ConfusionbetweenMSAandARZintheBaselineconguration..........705.6F-MeasuresoftwoArabiccongurations.lang1isMSA.lang2isARZ....705.7NumberoftweetsinL,UtestandUallusedforsemi-supervisedlearningofCRFautoencodersmodels.................................71xvi Chapter1Introduction1 MultilingualNLPisthescienticandengineeringdisciplineconcernedwithaut

25 omaticallyanalyzingwrittenorspokeninputi
omaticallyanalyzingwrittenorspokeninputinmultiplehumanlanguages.Thedesiredanalysisdependsontheapplication,butisoftenrepresentedasasetofdiscretevariables(e.g.,part-of-speechtags)withapresumeddependencystructure(e.g.,rst-orderMarkovdependencies).A(partially)correctanalysiscanbeusedtoenablenaturalcomputerinterfacessuchasApple'sSiri.Otherapplicationsincludesummarizinglongarticlesinafewsentences(e.g.,Saltonetal.,1997),anddiscoveringsubtletrendsinlargeamountsofuser-generatedtext(e.g.,O'Connoretal.2010).TheabilitytoprocesshumanlanguageshasalwaysbeenoneoftheprimarygoalsofarticialintelligencesinceitsconceptionbyMcCarthyetal.(1955).1.1ChallengesinMultilingualNLPAlthoughsomeEnglishNLPproblemsarefarfromsolved,wecanmakeanumberofsimplifyingassu

26 mptionswhendevelopingmonolingualNLPmodel
mptionswhendevelopingmonolingualNLPmodelsforahigh-resourcelanguagesuchasEnglish.Wecanoftenassumethatlargeannotatedcorporaareavailable.Evenwhentheyarenot,itisreasonabletoassumewecanndqualiedannotatorseitherlocallyorviacrowd-sourcing.Itiseasytoiterativelydesignandevaluatemeaningfullanguage-specicfeatures(e.g.,“[city]isthecapitalof[country]”,“[POS]endswith–ing”).Itisalsooftenassumedtheinputlanguagematchesthatofthetrainingdata.Whenwedevelopmodelstoanalyzemanylanguages,therstchallengeweoftenndisthelackofannotationsandlinguisticresources.Language-specicfeatureengineeringanderroranalysisbecomestediousatbest,assumingweareluckyenoughtocollaboratewithresearchersorlinguistswhoknowalllanguagesofinteres

27 t.Moreoftenthannot,however,itisinfeasibl
t.Moreoftenthannot,however,itisinfeasibletodesignmeaningfullanguage-specicfeaturesbecausetheteamhasinsufcientcollectiveknowl-edgeofsomelanguages.Conguring,training,tuning,monitoringandoccasionallyupdatingthemodelsforeachlanguageofinterestislogisticallydifcultandrequiresmorehumanandcom-putationalresources.Low-resourcelanguagesrequireadifferentpipelinethanhigh-resourcelanguagesandareoftenignoredinanindustrialsetup.1Theinputcanbeinoneofmanylan-guages,andoftenusesmultiplelanguagesinthesamediscourse.Thesechallengesmotivatetheworkinthisdissertation.Othermultilingualchallengesnotaddressedinthisthesisinclude:identifyingsentenceboundariesinlanguageswhichdonotuseuniquepunctuationtoendasentence(e.g.,Thai,Arabic),tokenizatio

28 ninlanguageswhichdonotusespacestoseparat
ninlanguageswhichdonotusespacestoseparatewords(e.g.,Japanese),ndingdigitizedtextsofsomelanguages(e.g.,Yoruba,Hieroglyphic).1.2ThesisStatementInthisthesis,weshowthefeasibilityofalanguage-universalapproachtomultilingualNLPinwhichonestatisticalmodeltrainedonmultilingual,homogenuousannotationsisusedtoprocess1Althoughtheperformanceonlow-resourcelanguagestendtobemuchworsethanthatofhigh-resourcelan-guages,eveninaccuratepredictionscanbeveryuseful.Forexample,amachinetranslationsystemtrainedonasmallparallelcorpuswillproduceinaccuratetranslations,butaninaccuratetranslationisoftensufcienttolearnwhataforeigndocumentisabout.2 Figure1.1:Insteadoftrainingonemodelperlanguage,wedeveloponelanguage-universalmodeltoanalyzetextinmultiplelanguages.

29 naturallanguageinputinmultiplelanguages.
naturallanguageinputinmultiplelanguages.Thisapproachaddressesseveralpracticaldifcul-tiesinmultilingualNLPsuchasprocessingcode-switchedinputanddeveloping/maintainingalargenumberofmodelstocoverlanguagesofinterest,especiallylow-resourceones.Althougheachlanguageisunique,differentlanguagesoftenexhibitsimilarcharacteristics(e.g.,phonetic,morphological,lexical,syntactic)whichcanbeexploitedtosynergisticallytrainuniversallan-guageanalyzers.Theproposedlanguage-universalmodelsoutperformsmonolingually-trainedmodelsinseverallow-resourceandhigh-resourcescenarios.BuildingonarichliteratureinmultilingualNLP,thisdissertationenablesthelanguage-universalapproachbydevelopingstatisticalmodelsfor:i)alanguage-universalsyntacticanalyzertoex-emplifytheproposed

30 approach,ii)estimatingmassivelymultiling
approach,ii)estimatingmassivelymultilingualwordembeddingstoserveasasharedrepresentationofnaturallanguageinputinmultiplelanguages,andiii)inducingword-alignmentsfromunlabeledexamplesinaparallelcorpus.Themodelsproposedineachofthethreecomponentsaredesigned,implemented,andempir-icallycontrastedtocompetitivebaselines.1.3SummaryofContributionsInchapter3,wedescribethelanguage-universalapproachtotrainingmultilingualNLPmod-els.Insteadoftrainingonemodelperlanguage,wesimultaneouslytrainonelanguage-universalmodelonannotationsinmultiplelanguages(seeFig.1.1).Weshowthatthisapproachoutper-formscomparablelanguage-specicmodelsonaverage,andalsooutperformsstate-of-the-artmodelsinlow-resourcescenarioswherenoorfewannotationsareavailableinthetargetlan-guage.

31 Inchapter4,wefocusontheproblemofestimati
Inchapter4,wefocusontheproblemofestimatingdistributed,multilingualwordrepresenta-tions.Previousworkonthisproblemassumestheavailabilityofsizableparallelcorporawhichconnectalllanguagesofinterestinafullyconnectedgraph.However,inpractice,publiclyavail-ableparallelcorporaofhighqualityareonlyavailableforarelativelysmallnumberoflanguages.Inordertoscaleuptrainingofmultilingualwordembeddingstomorelanguages,weproposetwoestimationmethodswhichusebilingualdictionariesinsteadofparallelcorpora.Inchapter5,wedescribeafeature-richmodelforstructuredpredictionproblemswhichlearns3 fromunlabeledexamples.WeinstantiatethemodelforPOSinduction,wordalignments,andtoken-levellanguageidentication.4 Chapter2Background5 2.1NaturalLanguagesWhileopen-sourceNLPtools(e.

32 g.,TurboParser1andStanfordParser2)arerea
g.,TurboParser1andStanfordParser2)arereadilyavailableinseverallanguages,onecanhardlyndanytoolsformostlivinglanguages.Thelanguage-universalapproachwedescribeinthisthesisprovidesapracticalsolutionforprocessinganarbitrarylanguage,givenamonolingualcorpusandabilingualdictionary(oraparallelcorpus)toinducelexicalfeaturesinthatlanguage.Thissectiondiscussessomeaspectsofthediversityofnaturallanguagestohelpthereaderappreciatethefullpotentialofourapproach.2.1.1ItisnotsufcienttodevelopNLPtoolsforthefewmostpopularlanguagesEthnologue(Lewisetal.,2016),anextensivecatalogoftheworld'slanguages,estimatesthattheworldpopulationofapproximately7.4billionpeoplenativelyspeaks6,879languagesasof2016.Manyoftheselanguagesarespokenbyalargepopulation.Forinstance

33 ,Fig.2.1showsthat306languageshaveapopula
,Fig.2.1showsthat306languageshaveapopulationbetweenonemillionandtenmillionnativespeakers. Figure2.1:Ahistogramthatshowsthenumberoflanguagesbypopulationofnativespeakersinlogscale.Forinstance,thegureshowsthat306languageshaveapopulationbetweenonemillionandtenmillionnativespeakers.NumbersarebasedonLewisetal.(2016).1http://www.cs.cmu.edu/~ark/TurboParser/2http://nlp.stanford.edu/software/lex-parser.shtml6 WrittenlanguagesusedontheInternetalsofollowasimilarpattern,althoughthelanguagerankingisdifferent(e.g.,ChinesehasthelargestnumberofnativespeakersbutEnglishhasthelargestnumberofInternetusers.)Fig.2.2givesthenumberofInternetusersperlanguage(forthetoptenlanguages),3andshowsthatthelongtailoflanguages(ranked11ormore)accountfor21.8%ofallInternet

34 users. Figure2.2:NumberofInternetusers(i
users. Figure2.2:NumberofInternetusers(inmillions)perlanguagefollowsapowerlawdistribution,withalongtail(notshown)whichaccountsfor21.8%ofInternetusers.ThegurewasretrievedonMay10,2016fromhttp://www.internetworldstats.com/stats7.htmBeyondlanguageswithlargepopulationofnativespeakers,evenendangeredlanguages4andextinctlanguages5maybeimportanttobuildNLPtoolsfor.Forexample,TheEndangeredLanguagesProject6aimstodocument,preserveandteachendangeredlanguagesinordertoreducethelossofculturalknowledgewhenthoselanguagesfalloutofuse.NLPtoolshave3EstimatedbyMiniwattsMarketingGroupathttp://www.internetworldstats.com/stats7.htm4Anendangeredlanguageisoneatriskofreachinganativespeakingpopulationofzeroasitfallsoutofuse.5Extinctlanguageshavenolivingspeakingpop

35 ulation.6http://www.endangeredlanguages.
ulation.6http://www.endangeredlanguages.com/7 alsobeenusedtohelpwithhistoricaldiscoveriesbyanalyzingpreservedtextwritteninancientlanguages(Bammanetal.,2013).2.1.2LanguagesoftenstudiedinNLPresearchItisnosurprisethatsomelanguagesreceivemoreattentionthanothersinNLPresearch.Fig.2.3,reproducedfromBender(2011),showsthat63%ofpapersintheACL2008conferencestudiedEnglish,whileonly0.7%(i.e.,onepaper)studiedDanish,Swedish,Bulgarian,Slovene,Ukra-nian,Aramaic,TurkishorWambaya. Figure2.3:ReproducedfromBender(2011):LanguagesstudiedinACL2008papers,bylan-guagegenusandfamily.Thenumberofspeakersofalanguagemightexplaintheattentionitreceives,butonlypartially;e.g.,Bengali,thenativelanguageof2.5%oftheworld'spopulation,isnotstudiedbyanypapersinACL2008.Otherfactors

36 whichcontributetotheattentionalanguagere
whichcontributetotheattentionalanguagereceivesinNLPresearchmayinclude:Availabilityofannotateddatasets:MostNLPresearchisempirical,requiringtheavailabilityofannotateddatasets.(Evenunsupervisedmethodsrequireanannotateddatasetforeval-uation.)Itiscustomarytocallalanguagewithnoorlittleannotationsa“low-resourcelanguage”,butthetermislooselydened.SomelanguagesmayhaveplentyofresourcesforoneNLPproblem,butnoresourcesforanother.EvenforaparticularNLPtask,thereisnoclearthresholdforthemagnitudeofannotateddatabelowwhichalanguageisconsid-eredtobealow-resourcelanguage.Table2.1providesstatisticsaboutthesizeofdatasetsin8 highly-multilingualresources:theLeipzigmonolingualcorpora,7OPUSparallelcorpora,8andtheuniversaldependencytreebanks.9Economicsi

37 gnicance:TheindustrialarmofNLPresear
gnicance:TheindustrialarmofNLPresearch(e.g.,BellLabs,IBM,BBN,Mi-crosoft,Google)hasmadeimportantcontributionstotheeld.Shortofaddressingalllan-guagesatthesametime,moreeconomicallysignicantlanguagesareoftengivenahigherpriority.Researchfunding:ManyNLPresearchstudiesarefundedbynationalorregionalagenciessuchasNationalScienceFoundation(NSF),EuropeanResearchCouncil(ERC),DefenseAd-vancedResearchProjectsAgency(DARPA),IntelligenceAdvancedResearchProjectsAc-tivity(IARPA)andtheEuropeanCommission.Researchgoalsofthefundingagencyoftenpartiallydetermineswhichlanguageswillbestudiedinafundedproject.Forexample,Euro-Matrixwasathree-yearresearchprojectfundedbytheEuropeanCommission(2009–2012)andaimedtodevelopmachinetranslationsystemsforallpairs

38 oflanguagesintheEuro-peanUnion.10Another
oflanguagesintheEuro-peanUnion.10AnotherexampleisTransTac,ave-yearresearchprojectfundedbyDARPAwhichaimedtodevelopspeech-to-speechtranslationsystems,primarilyforEnglish–IraqiandIraqi–English.Inbothexamples,thechoiceoflanguagesstudiedwasdrivenbystrategicgoalsofthefundingagencies.2.1.3CharacterizingsimilaritiesanddissimilaritiesacrosslanguagesLinguistictypology(Comrie,1989)isaeldoflinguisticswhichaimstoorganizelanguagesintodifferenttypes(i.e.,totypologizelanguages).Typologistsoftenusereferencegrammars11tocontrastlinguisticpropertiesacrossdifferentlanguages.Anextensivelistoftypologicalprop-ertiescanbefoundfor2,679languagesintheWorldAtlasofLanguagesStructures(WALS;DryerandHaspelmath,2013).12Studiedpropertiesinclude:synta

39 cticpatterns(e.g.,orderofsubject,verband
cticpatterns(e.g.,orderofsubject,verbandobject),morphologicalproperties(e.g.,reduplication,positionofcaseafxesonnouns),andphonologicalproperties(e.g.,consonant-vowelratio,uvularconsonants).Itisalsousefultoconsidergenealogicalclassicationoflanguages,emphasizingthatlan-guageswhichdescendedfromacommonancestortendtobelinguisticallysimilar.Forexample,Semiticlanguages(e.g.,Hebrew,Arabic,Amharic)shareadistinctmorphologicalsystemthatcombinesatriliteralrootwithapatternofvowelsandconsonantstoconstructaword.Romancelanguages(e.g.,Spanish,Italian,French)sharemorphologicalinectionsthatmarkperson(rst,second,third),number(singular,plural),tense(e.g.,imperfect,future),andmood(e.g.,indica-tive,imperative).7http://corpora2.informat

40 ik.uni-leipzig.de/download.html8http://o
ik.uni-leipzig.de/download.html8http://opus.lingfil.uu.se/9http://universaldependencies.org10Notetheconnectiontoeconomicsignicance.11Areferencegrammargivesatechnicaldescriptionofthemajorlinguisticfeaturesofalanguagewithafewexamples;e.g.,Martin(2004)andRyding(2005).12WALStypologicalpropertiescanbedownloadedathttp://wals.info/.SyntacticStructuresoftheWorld'sLanguages(SSWL)isanothercatalogoftypologicalpropertiesbutitonlyhasinformationabout385lan-guagesbyMay2016.SSWLtypologicalpropertiescanbedownloadedathttp://sswl.railsplayground.net/.9 language monolingualcorporaparallelcorporauniversaldependencytreebanks (millionsoftokens)(millionsoftokens)(thousandsoftokens) AncientGreek 244Arabic 41702242Basque 4121Bulgarian 104156 Catalan 120530Chin

41 ese 517158123Croatian 4240787Czech 48786
ese 517158123Croatian 4240787Czech 4878661,503 Danish 154691100Dutch 3371,000209English 926N/A254Estonian 38262234 Finnish 55445181French 1,4681,800390Galician 47138German 425916293 Gothic 56Greek 731,00059Hebrew 47356115Hindi 4513351 Hungarian 17662242Indonesian 1,20654121Irish 1623Italian 399943252 Japanese 5817267Kazakh 14Latin 291Latvian 113420 Norwegian 8444311OldChurchSlavonic 57Persian 19479151Polish 9660083 Portuguese 531,000226Portuguese(Brazilian) 486878298Romanian 125859145Russian 1,8006191,032Slovenian 54378140 Spanish 3911,500547Swedish 10746496Tamil 14138Turkish 1352056 Table2.1:CurrentnumberoftokensinLeipzigmonolingualcorpora(inmillions),wordpairsinprintedbilingualdictionaries(inthousands),tokensinthetargetsideofen-xxOPUSpa

42 rallelcorpora(inmillions),andtheuniversa
rallelcorpora(inmillions),andtheuniversaldependencytree-banks(inthousands)for41languages.Morerecentstatisticscanbefoundathttp://opus.lingfil.uu.se/,http://www.bilingualdictionaries.com/,http://corpora2.informatik.uni-leipzig.de/download.htmlandhttp://universaldependencies.org/,respectively.10 2.2WordEmbeddingsWordembeddings(alsoknownasvectorspacerepresentationsordistributedrepresentations)provideaneffectivemethodforsemi-supervisedlearninginmonolingualNLP(Turianetal.,2010).Thissectiongivesabriefoverviewonmonolingualwordembeddings,beforewediscussmultilingualwordembeddingsinthefollowingchapters.2.2.1DistributedwordrepresentationsThereareseveralchoicesforrepresentinglexicalitemsinacomputationalmodel.Themostbasicrepresentation,asequenceofchara

43 cters(e.g.,t–h–e–s–i
cters(e.g.,t–h–e–s–i–s),helpsdistinguishbetweendif-ferentwords.Anisomorphicrepresentationcommonlyreferredtoas“one-hotrepresentation”assignsanarbitraryuniqueintegertoeachuniquecharactersequenceinavocabularyofsizeV.Theone-hotrepresentationisoftenpreferredtocharactersequencesbecausemoderncomputerarchitecturescanmanipulateintegersmoreefciently.Thislexicalrepresentationisproblematicfortworeasons:1.Thenumberofuniquen-gramlexicalfeaturesisO(Vn).SincethevocabularysizeVisof-tenlarge,13theincreasednumberofparametersmakesthemodelmorepronetoovertting.Thiscanbeproblematicevenforn=1(unigramfeatures),especiallywhenthetrainingdataissmall.2.Wemissanopportunitytosharestatisticalstrengthbetweensimilarwords.Forexamp

44 le,if“dissertation”isanout-of-
le,if“dissertation”isanout-of-vocabularyword(i.e.,doesnotappearsinthetrainingset),themodelcannotrelateittoasimilarwordseenintrainingsuchas“thesis.”Wordembeddingsareanalternativerepresentationwhichmapsawordtoavectorofrealnum-bers;i.e.“embeds”thewordinavectorspace.Thisrepresentationaddressestherstproblem,providedthatthedimensionalityofwordembeddingsissignicantlylowerthanthevocabularysize,whichistypical(dimensionalityofwordembeddingsoftenusedareintherangeof50–500).Inordertoaddressthesecondproblem,twoapproachesarecommonlyusedtoestimatetheembeddingofaword:Estimatewordembeddingsasextraparametersinthemodeltrainedforthedownstreamtask.Notethatthisapproachreintroducesalargenumberofparametersinthemodel

45 ,andalsomissestheopportunitytolearnfromu
,andalsomissestheopportunitytolearnfromunlabeledexamples.UsethedistributionalhypothesisofHarris(1954)toestimatewordembeddingsusingacorpusofrawtext,beforetrainingamodelforthedownstreamtask.In§2.2.2,wedescribeapopularmodelforestimatingwordembetrainingusinganunlabeledcorpusofrawtext.Weusethismethodasbasisformultilingualembeddingsinotherchaptersofthethesis.2.2.2SkipgrammodelforestimatingmonolingualwordembeddingsThedistributionalhypothesisstatesthatthesemanticsofawordcanbedeterminedbythewordsthatoccurinitscontext(Harris,1954).Theskipgrammodel(Mikolovetal.,2013a)isoneof13Thevocabularysizedependsonthelanguage,genreanddatasetsize.AnEnglishmonolingualcorpusof1billionwordtokenshasavocabularysizeof4,785,862uniquewordtypes.11 severalmethodswhic

46 himplementthishypothesis.Theskipgrammode
himplementthishypothesis.Theskipgrammodelgeneratesaworduthatoccursinthecontext(ofwindowsizeK)ofanotherwordvasfollows:p(ujv)=expEskipgram(v)�Econtext(u) Pu02vocabularyexpEskipgram(v)�Econtext(u0)whereEskipgram(u)2Rdisthevectorwordembeddingofaworduwithdimensionalityd.Econtext(u)alsoembedsthewordu,buttheoriginalimplementationoftheskipgrammodelintheword2vecpackage14onlyusesitasextramodelparameters.Notethatthisisadecientmodelsincethesamewordtokenappears(andhencegenerated)inmorethanonecontext(e.g.,contextofthewordimmediatelybefore,andcontextofthewordimmediatelyafter).Themodelistrainedtomaximizethelog-likelihoodasfollows:Xi2indexesXk2f�K;:::;�1;1;:::;Kglogp(ui+kjui)whereiindexesallwordtokensinamonolingualcorpus.Toavoidthe

47 expensivesummationinthepartitionfunction
expensivesummationinthepartitionfunction,thedistributionp(ujv)isapproximatedusingnoisecontrastiveestimation(GutmannandHyvärinen,2012).Weusetheskipgrammodelinchapters3,4and5.14https://code.google.com/archive/p/word2vec/12 Chapter3Language-universalDependencyParsing13 3.1OverviewInhigh-resourcescenarios,themainstreamapproachformultilingualNLPistodeveloplanguage-specicmodels.Foreachlanguageofinterest,theresourcesnecessaryfortrainingthemodelareobtained(orcreated),andmodelparametersareoptimizedforeachlanguageseparately.Thisapproachissimple,effectiveandgrantstheexibilityofcustomizingthemodelorfeaturestotheneedsofeachlanguageindependently,butitissuboptimalfortheoreticalaswellaspracticalreasons.Theoretically,thestudyoflinguistictypologyr

48 evealsthatmanylanguagessharemor-phologic
evealsthatmanylanguagessharemor-phological,phonological,andsyntacticphenomena(Bender,2011).Onthepracticalside,itisinconvenienttodeployordistributeNLPtoolsthatarecustomizedformanydifferentlanguagesbecause,foreachlanguageofinterest,weneedtocongure,train,tune,monitorandoccasionallyupdatethemodel.Furthermore,code-switchingorcode-mixing(mixingmorethanonelanguageinthesamediscourse),whichispervasiveinsomegenres,inparticularsocialmedia,presentsachallengeformonolingually-trainedNLPmodels(Barmanetal.,2014).Canwetrainonelanguage-universalmodelinsteadoftrainingaseparatemodelforeachlanguageofinterest,withoutsacricingaccuracy?Weaddressthisquestionincontextofdependencyparsing,acoreprobleminNLP(seeFig.3.1).Wediscussmodelingtoolsforuni-fyinglangua

49 gestoenablecross-lingualsupervision,aswe
gestoenablecross-lingualsupervision,aswellastoolsfordifferentiatingbetweenthecharacteristicsofdifferentlanguages.Equippedwiththesemodelingtools,weshowthatlanguage-universaldependencyparserscanoutperformmonolingually-trainedparsersinhigh-resourcescenarios.Thesameapproachcanalsobeusedinlow-resourcescenarios(withnolabeledexamplesorwithasmallnumberoflabeledexamplesinthetargetlanguage),aspreviouslyexploredbyCohenetal.(2011),McDonaldetal.(2011)andTäckströmetal.(2013).Weaddressbothexperimentalsettings(targetlanguagewithandwithoutlabeledexamples)andshowthatourmodelcomparesfavorablytopreviousworkinbothsettings.Inprinciple,theproposedapproachisapplicabletomanyNLPproblems,includingmorpho-logical,syntactic,andsemanticanalysis.However,inordertoexplo

50 itthefullpotentialofthisapproach,weneedh
itthefullpotentialofthisapproach,weneedhomogenousannotationsinseverallanguagesforthetaskofinterest(see§3.2.1).Forthisreason,wefocusondependencyparsing,forwhichhomogenousannotationsareavailableinmanylanguages.MostofthematerialinthischapterwaspreviouslypublishedinAmmaretal.(2016a).3.2ApproachTheavailabilityofhomogeneoussyntacticannotationsinmanylanguages(Petrovetal.,2012;McDonaldetal.,2013;Nivreetal.,2015b;Agi´cetal.,2015;Nivreetal.,2015a)presentstheopportunitytodevelopaparserthatiscapableofparsingsentencesinmultiplelanguagesofinterest.Suchparsercanpotentiallyreplaceanarrayoflanguage-specicmonolingually-trainedparsers(forlanguageswithalargetreebank).Ourgoalistotrainadependencyparserforasetoft argetl anguagesLt,givenuniversaldependency

51 annotationsinasetofs ourcel anguagesLs.W
annotationsinasetofs ourcel anguagesLs.WhenalllanguagesinLthavealargetreebank,themainstreamapproachhasbeentotrainonemonolingualparserpertargetlanguageandroutesentencesofagivenlanguagetothecorrespondingparserattesttime.Incontrast,ourapproachistotrainoneparsingmodelwiththeunionoftreebanksinLs,thenusethissingle14 Figure3.1:DependencyparsingisacoreprobleminNLPwhichisusedtoinformothertaskssuchascoreferenceresolution(e.g.,DurrettandKlein,2013),semanticparsing(e.g.,Dasetal.,2010)andquestionanswering(e.g.,HeilmanandSmith,2010).trainedmodeltoparsetextinanylanguageinLt,whichwecall“manylanguages,oneparser”(MALOPA).3.2.1HomogenuousAnnotationsAlthoughmultilingualdependencytreebankshavebeenavailableforadecadeviathe2006and2007CoNLLsharedtasks(

52 BuchholzandMarsi,2006;Nivreetal.,2007),t
BuchholzandMarsi,2006;Nivreetal.,2007),thetreebankofeachlanguagewasannotatedindependentlyandwithitsownannotationconventions.McDonaldetal.(2013)designedannotationguidelineswhichusesimilardependencylabelsandconven-tionsforseverallanguagesbasedontheStanforddependencies.Twoversionsofthistreebankwerereleased:v1.0(6languages)1andv2.0(11languages).2.Thedependencyparsingcommu-nityfurtherdevelopedthesetreebanksintotheUniversalDependencies,3witha6-monthreleaseschedule.Sofar,UniversalDependenciesv1.0(10languages),4v1.1(18languages)5andv1.2(34languages)6havebeenreleased.InMALOPA,werequirethatallsourcelanguageshaveauniversaldependencytreebank.Wetransformnon-projectivetreesinthetrainingtreebankstopseudo-projectivetreesusingthe“baseline”scheme

53 in(NivreandNilsson,2005).Inaddition,weus
in(NivreandNilsson,2005).Inaddition,weusethefollowingdataresourcesforeachlanguageinL=Lt[Ls:universalPOSannotationsfortrainingaPOStagger(required),7abilingualdictionarywithanotherlanguageinLforaddingcross-linguallexicalinforma-tion(optional),81https://github.com/ryanmcd/uni-dep-tb/blob/master/universal_treebanks_v1.0.tar.gz2https://github.com/ryanmcd/uni-dep-tb/blob/master/universal_treebanks_v2.0.tar.gz3http://universaldependencies.org/4http://hdl.handle.net/11234/1-14645http://hdl.handle.net/11234/LRT-14786http://hdl.handle.net/11234/1-15487See§3.2.5fordetails.8Ourbestresultsmakeuseofthisresource.WerequirethatalllanguagesinLare(transitively)connected.The15 Stackt Buffert Action Dependency Stackt+1 Buffert+1 u;v;S B REDUCE-RIGH

54 T(r) ur!v u;S Bu;v;S B REDUCE-LEFT(r) ur
T(r) ur!v u;S Bu;v;S B REDUCE-LEFT(r) ur v v;S BS u;B SHIFT — u;S BTable3.1:Parsertransitionsindicatingtheactionappliedtothestackandbufferattimetandtheresultingstackandbufferattimet+1.languagetypologyinformation(optional),9language-specicPOSannotations(optional),10andamonolingualcorpus(optional).113.2.2CoreModelRecentadvances(e.g.,Gravesetal.2013,Sutskeveretal.2014)suggestthatrecurrentneuralnetworks(RNNs)arecapableoflearningusefulrepresentationsformodelingproblemsofse-quentialnature.FollowingDyeretal.(2015),weuseaRNNfortransition-baseddependencyparsing.Wedescribethecoremodelinthissection,andmodifyittoenablelanguage-universalparsinginthefollowingsections.Thecoremodelcanbeunderstoodasthesequentialmanipula-tionofthree

55 datastructures:abuffer(fromwhichwer
datastructures:abuffer(fromwhichwereadthetokensequence),astack(whichcontainspartially-builtparsetrees),andalistofactionspreviouslytakenbytheparser.Theparserusesthearc-standardtransitionsystem(Nivre,2004).Ateachtimestept,atransitionactionisappliedthataltersthesedatastructuresaccordingtoTable3.1.Alongwiththediscretetransitionsofthearc-standardsystem,theparsercomputesvectorrepresentationsforthebuffer,stackandlistofactionsattimesteptdenotedbt,st,andat,respectively(seeFig.3.2).Astack-LSTMmoduleisusedtocomputethevectorrepresentationforeachdatastructure.Fig.3.3illustratesthecontentofeachmodulefortherstthreestepsinatoyexample.Theparserstate12attimetisgivenby:pt=maxf0;W[st;bt;at]+Wbiasg(3.1)wherethematrixWandthevectorWbiasarelea

56 rnedparameters.Theparserstateptisthenuse
rnedparameters.Theparserstateptisthenusedtodeneacategoricaldistributionoverpossiblenextactionsz:p(zjpt)=exp�g�zpt+qz Pz0exp�g�z0pt+qz0(3.2)bilingualdictionariesweusedarebasedonunsupervisedwordalignmentsofparallelcorpora,asdescribedinGuoetal.(2016).See§3.2.3fordetails.9See§3.2.4fordetails.10Ourbestresultsmakeuseofthisresource.See§3.2.4fordetails.11Thisisonlyusedfortrainingwordembeddingswith`multiCCA',`multiCluster'and`translation-invariance'inTable3.6.Wedonotusethisresourcewhenwecomparetopreviouswork.12Nottobeconfusedwiththestateofthetransitionsystem(i.e.,thecontentofthestackandthebuffer).16 Figure3.2:Theparsercomputesvectorrepresentationsforthebuffer,stackandlistofactionsattimesteptdenotedbt,st,andat,res

57 pectively.Thethreevectorsfeedintoahidden
pectively.Thethreevectorsfeedintoahiddenlayerdenotedas`parserstate',followedbyasoftmaxlayerthatrepresentspossibleoutputsattimestept.wheregzandqzareparametersassociatedwithactionz.Thetotalnumberofactionsistwicethenumberofuniquedependencylabelsinthetreebankusedfortrainingplusone,butweonlyconsideractionswhichmeetthearc-standardpreconditionsinTable3.1.Theselectedactionisthenusedtoupdatethebuffer,stackandlistofactions,andtocomputebt+1,st+1andat+1accordingly.Themodelistrainedtomaximizethelog-likelihoodofcorrectactions.Attesttime,theparsergreedilychoosesthemostprobableactionineverytimestepuntilacompleteparsetreeisproduced.Tokenrepresentations.Thevectorrepresentationsofinputtokensfeedintothestack-LSTMmodulesofthebufferandthestack.Formonolingualpa

58 rsing,werepresenteachtokenbycon-catenati
rsing,werepresenteachtokenbycon-catenatingthefollowingvectors:axed,pretrainedembeddingofthewordtype,alearnedembeddingoftheword,alearnedembeddingoftheBrowncluster,alearnedembeddingofthene-grainedPOStag,alearnedembeddingofthecoarsePOStag.Thenextsectiondescribesthemechanismsweusetoenablecross-lingualsupervision.17 Figure3.3:AnillustrationofthecontentofthebufferS-LSTMmodule,theactionsS-LSTMmodule,andthestackS-LSTMmodulefortherstthreestepsinparsingthesentence“Youloveyourcat$”.Upperleft:thebufferisinitializedwithalltokensinreverseorder.Upperright:simulatinga`shift'action,theheadvectorofthebufferS-LSTMbacktracks,andtwonewhiddenlayersarecomputedfortheactionsS-LSTMandthestackS-LSTM.Lowerleft:simulatin

59 ganother`shift'action,theheadvectorofthe
ganother`shift'action,theheadvectorofthebufferS-LSTMbacktracks,andthetwoadditionalhiddenlayersarecomputedfortheactionsS-LSTMandthestackS-LSTM.Lowerright:simulatinga`right-arc(nsubj)',theheadvectorofthebufferS-LSTMbacktracks,anadditionalhiddenlayeriscomputedfortheactionsS-LSTM.ThestackS-LSTMrstbacktrackstwosteps,thenanewhiddenlayeriscomputedasafunctionofthehead,thedependent,andtherelationshiprepresentations.18 3.2.3LanguageUnicationThekeytounifyingdifferentlanguagesinthemodelistomaplanguage-specicrepresentationsoftheinputtolanguage-universalrepresentations.Weapplythisontwolevels:part-of-speechtagsandlexicalitems:Coarsesyntacticembeddings.Welearnvectorrepresentationsofmultilingually-denedcoarsePOStags(Petrovetal.,2012),inste

60 adofusinglanguage-specictagsets.Wetr
adofusinglanguage-specictagsets.Wetrainasim-pledelexicalizedmodelwherethetokenrepresentationonlyconsistsoflearnedembeddingsofcoarsePOStags,whicharesharedacrossalllanguagestoenablemodeltransfer.Lexicalembeddings.Previousworkhasshownthatsacricinglexicalfeaturesamountstoasubstantialdecreaseintheperformanceofadependencyparser(Cohenetal.,2011;Täckströmetal.,2012a;Tiedemann,2015;Guoetal.,2015).Therefore,weextendthetokenrepresenta-tioninMALOPAbyconcatenatingpretrainedmultilingualembeddingsofwordtypes.Wealsoconcatenatelearnedembeddingsofmultilingualwordclusters.Beforetrainingtheparser,wees-timateBrownclustersofEnglishwordsandprojectthemviawordalignmentstowordsinotherlanguages.Thisissimilartothe`projectedclusters'methodinTäckströmetal.

61 (2012a).TogofromBrownclusterstoembedding
(2012a).TogofromBrownclusterstoembeddings,weignorethehierarchywithinBrownclustersandassignauniqueparametervectortoeachleaf.3.2.4LanguageDifferentiationHere,wedescribehowwetweakthebehaviorofMALOPAdependingonthecurrentinputlanguage.Languageembeddings.Whilemanylanguages,especiallyonesthatbelongtothesamefam-ily,exhibitsomesimilarsyntacticphenomena(e.g.,alllanguageshavesubjects,verbs,andob-jects),substantialsyntacticdifferencesabound.Someofthesedifferencesareeasytocharac-terize(e.g.,subject-verb-objectvs.verb-subject-object,prepositionsvs.postpositions,adjective-nounvs.noun-adjective),whileothersaresubtle(e.g.,numberandpositionsofnegationmor-phemes).Itisnotatallclearhowtotranslatedescriptivefactsaboutalanguage'ssyntaxintofeaturesforaparser.Con

62 sequently,trainingalanguage-universalpar
sequently,trainingalanguage-universalparserontreebanksinmultiplesourcelanguagesrequirescaution.Whileexposingtheparsertoadiversesetofsyntacticpatternsacrossmanylanguageshasthepotentialtoimproveitsperformanceineach,dependencyannotationsinonelanguagewill,insomeways,contradictthoseintypologicallydifferentlanguages.Forinstance,consideracontextwherethenextwordonthebufferisanoun,andthetopwordonthestackisanadjective,followedbyanoun.Treebanksoflanguageswherepostpositiveadjectivesaretypical(e.g.,French)willoftenteachtheparsertopredictREDUCE-LEFT,whilethoseoflanguageswhereprepositiveadjectivesaremoretypical(e.g.,English)willteachtheparsertopredictSHIFT.InspiredbyNaseemetal.(2012),weaddressthisproblembyinformingtheparserabouttheinputlanguageitiscurre

63 ntlyparsing.Letlbetheinputvectorrepresen
ntlyparsing.Letlbetheinputvectorrepresentationofaparticularlanguage.Weconsiderthreedenitionsforl:aone-hotencodingofthelanguageID,19 aone-hotencodingofword-orderproperties,13andanencodingofalltypologicalfeaturesinWALS.14Weuseahiddenlayerwithtanhnonlinearitytocomputethelanguageembeddingl0as:l0=tanh(Ll+Lbias)whereLandLbiasareadditionalmodelparamters.Wemodifytheparsingarchitectureasfollows:includel0inthetokenrepresentation,includel0intheactionvectorrepresentation,andletpt=maxf0;W[st;bt;at;l0]+WbiasgIntuitively,thersttwomodicationsallowtheinputlanguagetoinuencethevectorrepre-sentationofthestack,thebufferandthelistofactions.Thethirdmodicationallowstheinputlanguagetoinuencetheparserstatew

64 hichinturnisusedtopredictthenextaction.I
hichinturnisusedtopredictthenextaction.Inprelim-inaryexperiments,wefoundthataddingthelanguageembeddingsatthetokenandactionlevelisimportant.Wealsoexperimentedwithcomputingmorecomplexfunctionsof(st;bt;at;l0)todenetheparserstate,buttheydidnothelp.Fine-grainedPOStagembeddings.Tiedemann(2015)showsthatomittingne-grainedPOStagssignicantlyhurtstheperformanceofadependencyparser.However,thosene-grainedPOStagsetsaredenedmonolinguallyandareonlyavailableforasubsetofthelanguageswithuniversaldependencytreebanks.Weextendthetokenrepresentationtoincludeane-grainedPOSembedding(inadditiontothecoarsePOSembedding).Duringtraining,westochasticallydropoutthene-grainedPOSembeddingwith50%probability(Srivastavaetal.,2014)sothattheparserca

65 nmakeuseofne-grainedPOStagswhenavail
nmakeuseofne-grainedPOStagswhenavailablebutstayreliablewhenthene-grainedPOStagsaremiss-ing.Fig.3.4illustratestheparsingmodelwithvariouscomponentswhichenablecross-lingualsupervisionandlanguagedifferentiation.Blockdropout.Weintroduceanothermodicationwhichmakestheparsermorerobusttonoisyinputsandlanguage-specicinputswhichmayormaynotbeprovidedattesttime.Theideaistostochasticallyzeroouttheentirevectorrepresentationofanoisyinput.Whiletrainingtheparser,wereplacethevectorrepresentationiwithanothervector(ofthesamedimensionality)stochasticallycomputedas:i0=(1�b)=i,wherebisaBernoulli-distributedrandomvariablewithparameterwhichmatchesexpectederrorrateonadevelopmentset.15Forexample,weusetheblockdropouttoteachtheparserto

66 ignorethepredictedPOStagembeddingsallthe
ignorethepredictedPOStagembeddingsallthetimeatrstbyinitializing=1:0(i.e.,alwaysdropout,settingi0=0),anddynamicallyupdatetomatchtheerrorrateofthePOStaggeronthedevelopmentset.Attesttime,wealwaysusetheoriginalvector,i.e.,i0=e.Intuitively,thismethodextendsthedropoutmethod(Srivastavaetal.,2014)toaddressstructurednoiseintheinputlayer.13TheWorldAtlasofLanguageStructures(WALS;DryerandHaspelmath,2013)isanonlineportaldocument-ingtypologicalpropertiesof2,679languages(asofJuly2015).WeusethesamesetofWALSfeaturesusedbyZhangandBarzilay(2015),namely82A(orderofsubjectandverb),83A(orderofobjectandverb),85A(orderofadpositionandnounphrase),86A(orderofgenitiveandnoun),and87A(orderofadjectiveandnoun).14SinceWALSfeaturesarenotannotatedforalllangua

67 ges,weusetheaveragevalueofalllanguagesin
ges,weusetheaveragevalueofalllanguagesinthesamegenus.15Dividingbyattrainingtimealleviatestheneedtochangethecomputationattesttime.20 Figure3.4:Thevectorrepresentinglanguagetypologycomplementsthetokenrepresentation,theactionrepresentationandtheinputtotheparserstate.3.2.5Multi-taskLearningforPOSTaggingandDependencyParsingThemodeldiscussedthusfarconditionsonthePOStagsofwordsintheinputsentence.How-ever,POStagsmaynotbeavailableinrealapplications(e.g.,parsingtheweb).Letx1;:::;xn,y1;:::;yn,z1;:::;z2nbethesequenceofwords,POStagsandparsingactions,respectively,forasentenceoflengthn.WedenethejointdistributionofaPOStagsequenceandparsingactionsgivenasequenceofwordsasfollows:p(y1;:::;yn;z1;:::;z2njx1;:::;xn)=nYi=1p(yijx1;:::;xn)2nYj=1p(zjjx

68 1;:::;xn;y1;:::;yn;z1;:::;zj�1)wherep
1;:::;xn;y1;:::;yn;z1;:::;zj�1)wherep(zjj:::)isdenedinEq.3.2,andp(yijx1;:::;xn)usesabidirectionalLSTM(Gravesetal.,2013),similartoHuangetal.(2015).ThetokenrepresentationthatfeedsintothebidirectionalLSTMsharesthesameparametersofthetokenrepresentationdescribedearlierfortheparser,butomitsbothPOSembeddings.TheoutputsoftmaxlayerdenesacategoricaldistributionoverpossiblePOStagsateachposition.Thismulti-tasklearningsetupenablesustopredictbothPOStagsanddependencytreeswiththesamemodel.3.3ExperimentsWeevaluateourMALOPAparserinthreedatascenarios:whenthetargetlanguagehasalargetreebank(Table3.3),asmalltreebank(Table3.7)ornotreebank(Table3.8).21 German(de)English(en)Spanish(es)French(fr)Italian(it)Portuguese(pt)Swedish(sv) UDT2 train 14118(2649

69 06)39832(950028)14138(375180)14511(35123
06)39832(950028)14138(375180)14511(351233)6389(149145)9600(239012)4447(66631) dev. 801(12215)1703(40117)1579(40950)1620(38328)399(9541)1211(29873)493(9312) test 1001(16339)2416(56684)300(8295)300(6950)400(9187)1205(29438)1219(20376) UD1.2 train 14118(269626)12543(204586)14187(382436)14552(355811)11699(249307)8800(201845)4303(66645) dev. 799(12512)2002(25148)1552(41975)1596(39869)489(11656)271(4833)504(9797) test 977(16537)2077(25096)274(8128)298(7210)489(11719)288(5867)1219(20377) tags -50--36866134 Table3.2:Numberofsentences(tokens)ineachtreebanksplitinUniversalDependencyTree-banksversion2.0(UDT)andUniversalDependenciesversion1.2(UD)forthelanguagesweexperimentwith.Thelastrowgivesthenumberofuniquelanguage-specicne-grainedPOStagsus

70 edinatreebank.Data.Forexperimentswhereth
edinatreebank.Data.Forexperimentswherethetargetlanguagehasalargetreebank,weusethestandarddatasplitsforGerman(de),English(en),Spanish(es),French(fr),Italian(it),Portuguese(pt)andSwedish(sv)inthelatestrelease(version1.2)ofUniversalDependencies(Nivreetal.,2015a),andexperimentwithbothgoldandpredictedPOStags.Forexperimentswherethetargetlanguagehasnotreebank,weusethestandardsplitsfortheselanguagesintheolderuniversaldependencytreebanksv2.0(McDonaldetal.,2013)andusegoldPOStags,followingthebaselines(ZhangandBarzilay,2015;Guoetal.,2016).Table3.2givesthenumberofsentencesandwordsannotatedforeachlanguageinbothversions.WeusethesamemultilingualBrownclustersandmultilingualembeddingsofGuoetal.(2016),kindlyprovidedbytheauthors.Optimization.Weusestochasticg

71 radientupdateswithaninitiallearningrateo
radientupdateswithaninitiallearningrateof0=0:1inepoch#0andupdatethelearningrateinfollowingepochsast=0=(1+0:1t).Weclipthel2normofthegradienttoavoidexplodinggradients.Unlabeledattachmentscore(UAS)onthedevelopmentsetdeterminesearlystopping.Parametersareinitializedwithuniformsamplesinp 6=(r+c)whererandcarethesizesofthepreviousandfollowinglayerinthenueralnetwork(GlorotandBengio,2010).Thestandarddeviationsofthelabeledattachmentscore(LAS)duetorandominitializationinindiviualtargetlanguagesare0.36(de),0.40(en),0.37(es),0.46(fr),0.47(it),0.41(pt)and0.24(sv).ThestandarddeviationoftheaverageLASscoresacrosslangaugesis0.17.WhentrainingtheparseronmultiplelanguagesinMALOPA,insteadofupdatingtheparam-eterswiththegradientofindividualsente

72 nces,weusemini-batchupdateswhichincludeo
nces,weusemini-batchupdateswhichincludeonesentencesampleduniformly(withoutreplacement)fromeachlanguage'streebank,untilallsen-tencesinthesmallesttreebankareused(whichconcludesanepoch).Werepeatthesameprocessinfollowingepochs.Wefoundthistohelppreventonesourcelanguagewithalargertreebank(e.g.,German)fromdominatingparameterupdates,attheexpenseofothersourcelanguageswithasmallertreebank(e.g.,Swedish).3.3.1TargetLanguageswithaTreebankHere,weevaluateourMALOPAparserwhenthetargetlanguagehasatreebank.Baseline.Foreachtargetlanguage,thestrongbaselineweuseisamonolingually-trainedS-LSTMparserwithatokenrepresentationwhichconcatenates:pretrainedwordembeddings(5022 dimensions),16learnedwordembeddings(50dimensions),coarse(universal)POStagembed-dings(12dimensi

73 ons),ne-grained(language-specic)
ons),ne-grained(language-specic)POStagembeddings(12dimensions),andembeddingsofBrownclusters(12dimensions),andusesatwo-layerS-LSTMforeachofthestack,thebufferandthelistofactions.Weindependentlytrainonebaselineparserforeachtargetlanguage,andsharenomodelparameters.Thisbaseline,denoted`monolingual',achievesUASscore93.0andLASscore91.5whentrainedontheEnglishPennTreebank,whichiscomparabletoDyeretal.(2015).MALOPA.WetrainMALOPAontheconcantenationoftrainingsectionsofallsevenlan-guages.Tobalancethedevelopmentset,weonlyconcatenatetherst300sentencesofeachlanguage'sdevelopmentsection.TherstMALOPAparserweevaluateonlyusescoarsePOSembeddingsasthetokenrep-resentation.17AsshowninTable3.3,thisparserconsistentlyperformsmuchworsethanthemonolingu

74 albaselines,withagapof12.5LASpointsonave
albaselines,withagapof12.5LASpointsonaverage.Addinglexicalembeddingstothetokenrepresentationasdescribedin§3.2.3substantiallyimprovestheperformanceofMALOPA,recovering83%ofthegapinaverageperformance.Weexperimentedwiththreewaystoincludelanguageinformationinthetokenrepresenta-tion,namely:`languageID',`wordorder'and`fulltypology'(see§3.2.4fordetails),andfoundallthreetoimprovetheperformanceofMALOPAgivingLASscores83.5,83.2and82.5,re-spectively.Itisinterestingtoseethatthemodeliscapableoflearningmoreusefullanguageembeddingswhentypologicalpropertiesarenotspecied.Using`languageID',wehavenowrecoveredanother12%oftheoriginalgap.Finally,thebestcongurationofMALOPAaddsne-grainedPOSembeddingstothetokenrepresentation.18Surprisingly,addingn

75 e-grainedPOSembeddingsimprovestheperform
e-grainedPOSembeddingsimprovestheperformanceevenforsomelanguageswherene-grainedPOStagsarenotavailable(e.g.,Spanish),sug-gestingthatthemodeliscapableofpredictingne-grainedPOStagsforthoselanguagesviacross-lingualsupervision.Thisparseroutperformsthemonolingualbaselineinveoutofseventargetlanguages,andwinsonaverageby0.3LASpoints.Weemphasizethatthismodelisonlytrainedonceonalllanguages,andthesamemodelisusedtoparsethetestsetofeachlanguage,whichsimpliesthedistributionordeploymentofmultilingualparsingsoftware.Wenotethatthene-grainedPOStagsweusedforSwedishandPortugueseencodemorpho-logicalproperties.Addingne-grainedPOStagembeddingsforthesetwolanguagesimprovedtheresultsby1.9and2.0LASpoints,respectively.Thisobservationsuggeststh

76 atusingmor-phologicalfeaturesaspartoftok
atusingmor-phologicalfeaturesaspartoftokenrepresentationislikelytoresultinfurtherimprovements.Thisisespeciallyrelevantsincerecentversionsoftheuniversaldependencytreebanksincludeauniversalannotationsofmorphologicalproperties.Wenotethat,forsome(model,testlanguage)combinations,theimprovementsaresmallcom-paredtothevarianceduetorandominitializationofmodelparameters.Forexample,thecell16Theseembeddingsaretreatedasxedinputstotheparser,andarenotoptimizedtowardstheparsingobjective.WeusethesameembeddingsusedinGuoetal.(2016).17WeusethesamenumberofdimensionsforthecoarsePOSembeddingsasinthemonolingualbaselines.ThesameappliestoallothertypesofembeddingsusedinMALOPA.18Fine-grainedPOStagswereonlyavailableforEnglish,Italian,PortugueseandSwedish.Otherlan

77 guagesreusethecoarsePOStagsasne-grai
guagesreusethecoarsePOStagsasne-grainedtagsinsteadofpaddingtheextradimensionsinthetokenrepresentationwithzeros.23 UAS targetlanguage average de en es fr it pt sv monolingual 84.5 88.7 87.5 85.6 91.1 89.1 87.2 87.6 MALOPA 78.8 75.6 80.6 79.7 84.7 81.6 77.6 79.8+lexical 83.0 85.6 87.3 85.3 90.6 86.5 86.4 86.3+fulltypology 83.3 86.1 87.1 85.8 90.8 87.8 86.7 86.8+wordorder 84.0 87.2 87.3 86.0 91.0 87.9 87.2 87.2+languageID 84.2 87.2 87.2 86.1 91.5 87.5 87.2 87.2+ne-grainedPOS 84.7 88.6 88.1 86.4 91.1 89.4 88.2 88.0LAS targetlanguage average de en es fr it pt sv monolingual 79.3 85.9 83.7 81.7 88.7 85.7 83.5 84.0 MALOPA 70.4 69.3 72.4 71.1 78.0 74.1 65.4 71.5+lexical 76.7 82.0 82.7 81.2 87.6 82.1 81.2 81.9+fulltypology 77.8 82.5 82.6 8

78 1.5 88.0 83.7 81.8 82.5+wordorder 78.5 8
1.5 88.0 83.7 81.8 82.5+wordorder 78.5 84.3 83.4 81.7 88.3 84.1 82.6 83.2+languageID 78.6 84.2 83.4 82.4 89.1 84.2 82.6 83.5+ne-grainedPOS 78.9 85.4 84.3 82.4 89.0 86.2 84.5 84.3Table3.3:Dependencyparsing:unlabeledandlabeledattachmentscores(UAS,LAS)formonolingually-trainedparsersandMALOPA.Eachtargetlanguagehasalargetreebank(seeTable3.2).Inthistable,weusetheuniversaldependenciesverson1.2whichonlyincludesanno-tationsfor13KEnglishsentences,whichexplainstherelativelylowscoresinEnglish.Whenweinsteadusetheuniversaldependencytreebanksversion2.0whichincludesannotationsfor40KEnglishsentences(originallyfromtheEnglishPennTreebank),weachieveUASscore93.0andLASscore91.5.(+fulltypology,it)inTable3.2showsanimprovementof0.4LASpointswhilethes

79 tandarddeviationduetorandominitializatio
tandarddeviationduetorandominitializationinItalianis0.47LAS.However,theaverageresultsacrossmultiplelanguagesshowsteady,robustimprovementseachofwhichfarexceedsthestandarddeviationof0.17LAS.Qualitativeanalysis.Togainabetterunderstandingofthemodelbehavior,weanalyzecertainclassesofdependencyattachments/relationsinGerman,whichhasnotablyexiblewordorder,inTable3.4.Weconsidertherecallofleftattachments(wheretheheadwordprecedesthedependentwordinthesentence),rightattachments,rootattachments,short-attachments(withdistance=1),long-attachments(withdistance�6),aswellasthefollowingrelationgroups:nsubj*(nominalsubjects:nsubj,nsubjpass),dobj(directobject:dobj),conj(conjunct:conj),*comp(clausalcomplements:ccomp,xcomp),case(cliticsandadpositions:c

80 ase),*mod(modiersofanoun:nmod,nummod
ase),*mod(modiersofanoun:nmod,nummod,amod,appos),neg(negationmodier:neg).Foreachgroup,wereportrecallofboththeattachmentandrelationweightedbythenumberofinstancesinthegoldannotation.Adetaileddescriptionofeachrelationcanbefoundathttp://universaldependencies.org/u/dep/index.html24 Recall% leftright root shortlong nsubj*dobjconj*compcase*mod monolingual 89.995.2 86.4 92.981.1 77.375.566.045.693.377.0 MALOPA 85.493.3 80.2 91.273.3 57.362.764.234.090.769.6+lexical 89.993.8 84.5 92.678.6 73.373.466.935.391.675.3+languageID 89.194.7 86.6 93.281.4 74.773.071.248.292.876.3+ne-grainedPOS 89.595.7 87.8 93.682.0 74.774.969.746.093.376.3 Table3.4:Recallofsomeclassesofdependencyattachments/relationsinGerman.Wefoundthateachofthethreeimprovemen

81 ts(lexicalembeddings,languageembeddingsa
ts(lexicalembeddings,languageembeddingsandne-grainedPOSembeddings)tendstoimproverecallformostclasses.Unfortunately,MAL-OPAunderperforms(comparedtothemonolingualbaseline)insomeclassesnominalsubjects,directobjectsandmodiersofanoun.Nevertheless,MALOPAoutperformsthebaselineinsomeimportantclassessuchasroot,longattachmentsandconjunctions.PredictinglanguageIDsandPOStags.InTable3.3,weassumethatbothlanguageIDoftheinputlanguageandthePOStagsaregivenattesttime.However,thisassumptionmaynotberealisticinpracticalapplications.Here,wequantifythedegradationinparsingaccuracywhenlanguageIDandPOStagsareonlygivenattrainingtime,butmustbepredictedattesttime.Wedonotusene-grainedPOStagsinthispart.InordertopredictlanguageID,weusethelangid.pylibrary(Luia

82 ndBaldwin,2012)19andclassifyindividualse
ndBaldwin,2012)19andclassifyindividualsentencesinthetestsetstooneofthesevenlanguagesofinterest,usingthedefaultmodelsincludedinthelibrary.ThemacroaveragelanguageIDpredictionaccuracyonthetestsetacrosssentencesis94.7%.InordertopredictPOStags,weusethemodelde-scribedin§3.2.5withbothinputandhiddenLSTMdimensionsof60,andwithblockdropout.ThemacroaverageaccuracyofthePOStaggeris93.3%.Table3.5summarizesthefourcong-urations:{goldlanguageID,predictedlanguageID}{goldPOStags,predictedPOStags}.Theperformanceoftheparsersuffersmildly(-0.8LASpoints)whenusingpredictedlanguageIDs,butsufferssignicantly(-5.1LASpoints)whenusingpredictedPOStags.Nevertheless,theobserveddegradationinparsingperformancewhenusingpredictedPOStagsiscomparabletothedegradation

83 sreportedbyTiedemann(2015).ThepredictedP
sreportedbyTiedemann(2015).ThepredictedPOSresultsinTable3.5useblockdropout.Withoutusingblockdropout,weloseanextra0.2LASpointsinbothcongurationsusingpredictedPOStags,averagingoveralllanguages.Differentmultilingualembeddings.Severalmethodshavebeenproposedforpretrainingmul-tilingualwordembeddings.Wecomparethreeofthem:multiCCAandmultiCluster(Ammaretal.,2016b)androbustprojection(Guoetal.,2015).Allembeddingsaretrainedonthesamedataandusethesamenumberofdimensions(100).Table3.6illustratesthatthethreemethodsperformcomparablywellonthistask.Smalltargettreebank.Duongetal.(2015)consideredasetupwherethetargetlanguagehasasmalltreebankof3Ktokens,andthesourcelanguage(English)hasalargetreebankof205K.TheparserproposedinDuongetal.(2015)isaneural

84 networkparserbasedonChenand19https://git
networkparserbasedonChenand19https://github.com/saffsd/langid.py25 targetlangauge de en es fr it pt sv average languageIDaccuracy% 96.3 78.0 100.0 97.6 98.3 94.0 98.8 94.7targetlanguage de en es fr it pt sv average POStaggingaccuracy% 89.8 92.7 94.5 94.0 95.2 93.4 94.0 93.3UAS targetlanguage average languageIDcoarsePOS de en es fr it pt sv goldgold 84.2 87.2 87.2 86.1 91.5 87.5 87.2 87.2predictedgold 84.0 84.0 87.2 85.8 91.3 87.4 87.2 86.7goldpredicted 78.9 84.8 85.4 84.0 89.0 84.4 81.0 83.9predictedpredicted 78.5 79.7 85.4 83.8 88.7 83.2 80.9 82.8LAS targetlanguage average languageIDcoarsePOS de en es fr it pt sv goldgold 78.6 84.2 83.4 82.4 89.1 84.2 82.6 83.5predictedgold 78.5 80.2 83.4 82.1 88.9 83.9 82.5 82.7goldpredicted 71.2 79.9 8

85 0.5 78.5 85.0 78.4 75.5 78.4predictedpre
0.5 78.5 85.0 78.4 75.5 78.4predictedpredicted 70.8 74.1 80.5 78.2 84.7 77.1 75.5 77.2Table3.5:EffectofautomaticallypredictinglanguageIDandPOStagswithMALOPAonparsingaccuracy.multilingualembeddings ave.UAS ave.LAS multiCluster 87.7 84.1multiCCA 87.8 84.4robustprojection 87.8 84.2Table3.6:EffectofmultilingualembeddingestimationmethodonthemultilingualparsingwithMALOPA.UASandLASscoresaremacro-averagedacrossseventargetlanguages.26 LAS targetlanguage de es fr it sv Duongetal. 61.8 70.5 67.2 71.3 62.5MALOPA 63.4 70.5 69.1 74.1 63.4Table3.7:Small(3Ktoken)targettreebanksetting:language-universaldependencyparserper-formance.Manning(2014),whichsharesmostoftheparametersbetweenEnglishandthetargetlanguage,andusesanL2regularizertotiethelexicalembeddings

86 oftranslationally-equivalentwords.Whilen
oftranslationally-equivalentwords.Whilenottheprimaryfocusofthispaper,20wecompareourproposedmethodtothatofDuongetal.(2015)onvetargetlanguagesforwhichmultilinguallexicalfeaturesareavailablefromGuoetal.(2016).Foreachtargetlanguage,wetraintheparserontheEnglishtrainingdataintheUDversion1.0corpus(Nivreetal.,2015b)andasmalltreebankinthetargetlanguage.21FollowingDuongetal.(2015),wedonotuseanydevelopmentdatainthetargetlanguages,andwesubsampletheEnglishtrainingdataineachepochtothesamenumberofsentencesinthetargetlanguage.Weusethesamehyperparametersspeciedbefore.Table3.7showthatourproposedmethodoutperformsDuongetal.(2015)by1.4LASpointsonaverage.3.3.2TargetLanguageswithoutaTreebankMcDonaldetal.(2011)establishedthat,whennotreebankannotationsare

87 availableinthetargetlanguage,trainingonm
availableinthetargetlanguage,trainingonmultiplesourcelanguagesoutperformstrainingonone(i.e.,multi-sourcemodeltransferoutperformssingle-sourcemodeltransfer).Inthissection,weevaluatetheper-formanceofourparserinthissetup.Weusetwostrongbaselinemulti-sourcemodeltransferparserswithnosupervisioninthetargetlanguage:ZhangandBarzilay(2015)isagraph-basedarc-factoredparsingmodelwithatensor-basedscoringfunction.Ittakestypologicalpropertiesofalanguageasinput.Wecomparetothebestreportedconguration(i.e.,thecolumntitled“OURS”inTable5ofZhangandBarzilay,2015).Guoetal.(2016)isatransition-basedneural-networkparsingmodelbasedonChenandMan-ning(2014).ItusesamultilingualembeddingsandBrownclustersaslexicalfeatures.Wecomparetothebestreportedc

88 onguration(i.e.,thecolumntitled“
onguration(i.e.,thecolumntitled“MULTI-PROJ”inTable1ofGuoetal.,2016).FollowingGuoetal.(2016),foreachtargetlanguage,wetraintheparseronsixotherlanguagesintheGoogleUniversalDependencyTreebanksversion2.022(de,en,es,fr,it,pt,sv,excludingwhicheveristhetargetlanguage).OurparserusesthesamewordembeddingsandwordclustersusedinGuoetal.(2016),anddoesnotuseanytypologyinformation.2320Thesetupcostinvolvedinrecruitinglinguists,developingandrevisingannotationguidelinestoannotateanewlanguageoughttobehigherthanthecostofannotating3Ktokens.21WethankLongDuongforprovidingthesubsampledtrainingcorporaineachtargetlanguage.22https://github.com/ryanmcd/uni-dep-tb/23Inpreliminaryexperiments,wefoundlanguageembeddingstohurttheperformanceoftheparserfortarget

89 languageswithoutatreebank.27 UAS targetl
languageswithoutatreebank.27 UAS targetlanguage average de es fr it pt sv ZhangandBarzilay(2015) 62.5 78.0 78.9 79.3 78.6 75.0 75.4Guoetal.(2016) 65.0 79.0 77.6 78.4 81.8 78.2 76.3 thiswork 65.2 80.2 80.6 80.7 81.2 79.0 77.8LAS targetlanguage average de es fr it pt sv ZhangandBarzilay(2015) 54.1 68.3 68.8 69.4 72.5 62.5 65.9Guoetal.(2016) 55.9 73.0 71.0 71.2 78.6 69.5 69.3 MALOPA 57.1 74.6 73.9 72.5 77.0 68.1 70.5Table3.8:Dependencyparsing:unlabeledandlabeledattachmentscores(UAS,LAS)formulti-sourcetransferparsersinthesimulatedlow-resourcescenariowhereLt\Ls=;.TheresultsinTable3.8showthat,onaverage,ourparseroutperformsbothbaselinesbymorethan1pointinLAS,andgivesthebestLASresultsinfour(outofsix)languages.3.3.3ParsingCode-switchedInputCode-swi

90 tchingpresentsachallengeformonolingually
tchingpresentsachallengeformonolingually-trainedNLPmodels(Barmanetal.,2014).Wehypothesizethatourlanguage-universalapproachisagoodtforcode-switchedtext.However,itishardtotestthishypothesisduetothelackofuniversaldependencytreebankswithnaturally-occurringcode-switching.Instead,wesimulateanevaluationtreebankwithcode-switchingbyreplacingwordsintheEnglishdevelopmentsetoftheUniversalDependenciesv1.2withSpanishwords.ToaccountforthefactthatSpanishwordsdonotarbitrarilyappearincode-switchingwithEnglish,weonlyallowaSpanishwordtosubstituteanEnglishwordundertwoconditions:(1)theSpanishwordmustbealikelytranslationoftheEnglishword,and(2)togetherwiththe(possiblymodied)previouswordinthetreebank,theintroducedSpanishwordformsabigramwhichappearsinnatur

91 ally-occurringcode-switchedtweets(fromth
ally-occurringcode-switchedtweets(fromtheEMNLP2014sharedtaskoncodeswitching(Linetal.,2014)).2.5%ofEnglishwordsinthedevelopmentsetwerereplacedwithSpanishwords.Weuse“pure”torefertotheoriginalEnglishdevelopmentset,and“code-switched”torefertothesamedevelopmentsetafterreplacing2.5%ofEnglishwordswithSpanishtransla-tions.Inordertoquantifythedegreetowhichmonolingually-trainedparsersmakebadpredictionswhentheinputtextiscode-switched,wecontrasttheUASperformanceofourjointmodelfortaggingandparsing,trainedonEnglishtreebankswithcoarsePOStagsonly,andtestedonpurevs.code-switchedtreebanks.WethenrepeatthesameexperimentwithMALOPAparsertrainedonsevenlanguages,withlanguageIDandcoarsePOStagsonly.TheresultsinTable3.9suggestthat(simulated)code

92 -switchedinputadverselyaffectstheperform
-switchedinputadverselyaffectstheperformanceofmonolingually-trainedparsers,buthardlyaffectstheperformanceofourMALOPAparser.28 UAS purecode-switched monolingual 85.082.6MALOPA 84.784.8Table3.9:UASresultsontherst300sentencesintheEnglishdevelopmentoftheUniversalDependenciesv1.2,withandwithoutsimulatedcode-switching.3.4OpenQuestionsOurresultsopenthedoorformoreresearchinmultilingualNLP.Someofthequestionstriggeredbyourresultsare:Multilingualdependencyparsingcanbeviewedasadomainadaptationproblemwhereeachlanguagerepresentsadifferentdomain.CanweusetheMALOPAapproachintraditionaldomainadaptationsettings?Canwecombinethelanguage-universalapproachwithothermethodsforindirectsuper-vision(e.g.,annotationprojection,CRFautoencodersandco-traini

93 ng)tofurtherimproveperformanceinlow-reso
ng)tofurtherimproveperformanceinlow-resourcescenarioswithouthurtingperformanceonhigh-resourcesce-narios?Canweobtainbetterresultsbysharingsomeofthemodelparametersforallmembersofthesamelanguagefamily?Canweapplythelanguage-universalapproachtomoredistantlanguagessuchasArabicandJapanese?Canweapplythelanguage-universalapproachtomoreNLPproblemssuchasnamedentityrecognitionandcoreferenceresolution?3.5RelatedWorkOurworkbuildsonthemodeltransferapproach,whichwaspioneeredbyZemanandResnik(2008)whotrainedaparseronasourcelanguagetreebankthenappliedittoparsesentencesinatargetlanguage.Cohenetal.(2011)andMcDonaldetal.(2011)trainedunlexicalizedparsersontreebanksofmultiplesourcelanguagesandappliedtheparsertodifferentlanguages.Naseemetal.(2012),

94 Täckströmetal.(2013),andZhangandBarzil
Täckströmetal.(2013),andZhangandBarzilay(2015)usedlanguagetypologytoimprovemodeltransfer.Toaddlexicalinformation,Täckströmetal.(2012a)usedmultilingualwordclusters,whileXiaoandGuo(2014),Guoetal.(2015),Søgaardetal.(2015)andGuoetal.(2016)usedmultilingualwordembeddings.Duongetal.(2015)usedaneuralnetworkbasedmodel,sharingmostoftheparametersbetweentwolanguages,andusedanL2regularizertotiethelexicalembeddingsoftranslationally-equivalentwords.Weincorporatetheseideasinourframework,whileproposinganovelneuralarchitectureforembeddinglanguagetypology(see§3.2.4)andanotherforconsumingnoisystructuredinputs(blockdropout).Wealsoshowhowtoreplaceanarrayofmonolinguallytrainedparserswithonemultilingually-trainedparserwithoutsacricingaccuracy,whichisre

95 latedtoVilaresetal.(2015).Neuralnetworkp
latedtoVilaresetal.(2015).NeuralnetworkparsingmodelswhichprecededDyeretal.(2015)includeHenderson(2003),TitovandHenderson(2007),HendersonandTitov(2010)andChenandManning(2014).Re-29 latedtolexicalfeaturesincross-lingualparsingisDurrettetal.(2012)whodenedlexico-syntacticfeaturesbasedonbilinguallexicons.OtherrelatedworkincludeÖstling(2015),whichmaybeusedtoinducemoreusefultypologicaltoinformmultilingualparsing.Tsvetkovetal.(2016)concurrentlyusedasimilarapproachtolearnlanguage-universallanguagemodelsbasedonmorphology.Anotherpopularapproachforcross-lingualsupervisionistoprojectannotationsfromthesourcelanguagetothetargetlanguageviaaparallelcorpus(Yarowskyetal.,2001;Hwaetal.,2005)orviaautomatically-translatedsentences(Schneideretal.,2013;Tied

96 emannetal.,2014).MaandXia(2014)usedentro
emannetal.,2014).MaandXia(2014)usedentropyregularizationtolearnfrombothparalleldata(withprojectedannotations)andunlabeleddatainthetargetlanguage.RasooliandCollins(2015)trainedanarrayoftarget-languageparsersonfullyannotatedtrees,byiterativelydecodingsentencesinthetargetlanguagewithincompleteannotations.Oneresearchdirectionworthpursuingistondsynergiesbetweenthemodeltransferapproachandannotationprojectionapproach.3.6SummaryInthischapter,wedescribeageneralapproachfortraininglanguage-universalNLPmodels,andapplythisapproachtodependencyparsing.Themainingredientsofthisapproacharehomogen-uousannotationsinmultiplelanguages(e.g.,theuniversaldepedencytreebanks),acoremodelwithlargecapacityforrepresentingcomplexfunctions(e.g.,recurrentneuralnetwork

97 s),map-pinglanguage-specicrepresenta
s),map-pinglanguage-specicrepresentationsintoalanguage-universalspace(e.g.,multilingualwordembeddings),andmechanismsfordifferentiatingbetweenthebehaviorofdifferentlanguages(e.g.,languageembeddingsandne-grainedPOStags).Weshowforthersttimehowtotrainlanguage-universalmodelsthatperformcompetitivelyinmultiplelanguagesinbothhigh-andlow-resourcescenarios.24Wealsoshowforthersttime,usingasimulatedevaluationset,thatlanguage-universalmodelsisaviablesolutionforprocessingcode-switchedtext.Wenotetheimportanceoflexicalfeaturesthatconnectmultiplelanguagesviabilingualdic-tionariesorparallelcorpora.Whenthemultilinguallexicalfeaturesarebasedonout-of-domainresourcessuchastheBible,weobserveasignicantdropinperformance.Developingcompeti-tive

98 language-universalmodelswithlanguageswit
language-universalmodelswithlanguageswithsmallbilingualdictionariesremainsanopenproblemforfutureresearch.Inprinciple,theproposedapproachcanbeappliedtomanyNLPproblemssuchasnamedentityrecognition,coreferenceresolution,questionansweringandtextualentailment.However,inpractice,thelackofhomogenuousannotationsinmultiplelanguagesforthesetasksmakesithardtoevaluate,letalonetrain,language-universalmodels.Onepossibleapproachtoalleviatethispracticaldifcultyistouseexistinglanguage-specicannotationsformultiplelanguagesanddedicateasmallnumberofthemodelparameterstolearnamappingfromalatentlanguage-universaloutputtothelanguage-specicannotationsinamulti-tasklearningframework,whichhasbeenproposedinadifferentsetupbyFangandCohn(2016).Wealsohopethatf

99 uturemultilingualannotationprojectswilld
uturemultilingualannotationprojectswilldevelopannotationguidelinestoencourageconsistency24Theresultsinsome(modication,testlanguage)combinationsweresmallrelativetothestandarddeviation,suggestingthatsomeofthesemodicationsmaynotbeimportantforthoselanguages.Nevertheless,theaveragere-sultsacrossmultiplelanguagesshowsteady,robustimprovementseachofwhichfarexceedsthestandarddeviation.30 acrosslanguages,followingtheexampleofuniversaldependencytreebanks.31 32 Chapter4MultilingualWordEmbeddings33 4.1OverviewThepreviouschapterdiscussedhowtodeveloplanguage-universalmodelsforanalyzingtext.Inthischapter,wefocusonmultilingualwordembeddings,oneoftheimportantmechanismsforenablingcross-lingualsupervision.Vector-spacerepresentationsofwordsarewidelyus

100 edinstatisticalmodelsofnaturallan-guage.
edinstatisticalmodelsofnaturallan-guage.InadditiontoimprovementsonstandardmonolingualNLPtasks(CollobertandWeston,2008),sharedrepresentationofwordsacrosslanguagesoffersintriguingpossibilities(Klemen-tievetal.,2012).Forexample,inmachinetranslation,translatingawordneverseeninparalleldatamaybeovercomebyseekingitsvector-spaceneighbors,iftheembeddingsarelearnedfrombothplentifulmonolingualcorporaandmorelimitedparalleldata.Asecondopportunitycomesfromtransferlearning,inwhichmodelstrainedinonelanguagecanbedeployedinotherlan-guages.Whilepreviousworkhasusedhand-engineeredfeaturesthatarecross-linguallystableasthebasisformodeltransfer(ZemanandResnik,2008;McDonaldetal.,2011;Tsvetkovetal.,2014),automaticallylearnedembeddingsofferthepromiseofbettergeneral

101 izationatlowercost(Klementievetal.,2012;
izationatlowercost(Klementievetal.,2012;HermannandBlunsom,2014;Guoetal.,2016).Wethereforeconjecturethatdevelopingestimationmethodsfor“massively”multilingualwordembeddings(i.e.,embeddingsforwordsinalargenumberoflanguages)willplayanimportantroleinthefutureofmultilingualNLP.Novelcontributionsinthischapterincludetwomethodsforestimatingmultilingualembed-dingswhichonlyrequiremonolingualdataineachlanguageandpairwisebilingualdictionaries,andscaletoalargenumberoflanguages.Wealsointroduceourevaluationwebportalforup-loadingarbitrarymultilingualembeddingsandevaluatingthemautomaticallyusingasuiteofintrinsicandextrinsicevaluationmethods.ThematerialinthischapterwaspreviouslypublishedinAmmaretal.(2016b).4.2EstimationMethodsLetLbeasetoflanguages

102 ,andletVmbethesetofsurfaceforms(wordtype
,andletVmbethesetofsurfaceforms(wordtypes)inm2L.LetV=Sm2LVm.OurgoalistoestimateapartialembeddingfunctionE:LV!Rd(allowingasurfaceformthatappearsintwolanguagestohavedifferentvectorsineach).Wewouldliketoestimatethisfunctionsuchthat:(i)semanticallysimilarwordsinthesamelanguagearenearby,(ii)translationallyequivalentwordsindifferentlanguagesarenearby,and(iii)thedomainofthefunctioncoversasmanywordsinVaspossible.WeusedistributionalsimilarityinamonolingualcorpusMmtomodelsemanticsimilaritybetweenwordsinthesamelanguage.1Forcross-lingualsimilarity,eitheraparallelcorpusPm;norabilingualdictionaryDm;nVmVncanbeused.Insomecases,weextractthebilingualdictionaryfromaparallelcorpus.Todothis,wealignthecorpususingfast_align(Dyeretal.,2013)inbothdir

103 ections.Theestimatedparametersofthewordt
ections.Theestimatedparametersofthewordtranslationdistributionsareusedtoselectpairs:Dm;n=(u;v)ju2Vm;v2Vn;pmjn(ujv)pnjm(vju)� ,wherethethresholdtradesoffdictionaryrecallandprecision.21Monolingualcorporaareoftenanorderofmagnitudelargerthanparallelcorpora.Therefore,multilingualwordembeddingmodelstrainedonmonolingualcorporatendtohavehighercoverage.2Wexed=0:1foralllanguagepairsbasedonmanualinspectionoftheresultingdictionaries.34 Withthreenotableexceptions(see§4.2.3,§4.2.4,§4.5),previousworkonmultilingualem-beddingsonlyconsideredthebilingualcase,jLj=2.Inthissection,wefocusonestimatingmultilingualembeddingsforjLj�2.Werstdescribetwonoveldictionary-basedmethods(multiClusterandmultiCCA).Then,wereviewa

104 variantofthemultiSkipmethod(Guoetal.,201
variantofthemultiSkipmethod(Guoetal.,2016)andthetranslation-invariancematrixfactorizationmethod(Gardneretal.,2015).34.2.1MultiClusterembeddingsInthismethod,wedecomposetheproblemintotwosimplersubproblems:E=EembedEcluster,whereEcluster:LV!CmapswordstomultilingualclustersC,andEembed:C!Rdassignsavectortoeachcluster.Weuseabilingualdictionarytondclustersoftranslationallyequivalentwords,thenusedistributionalsimilaritiesoftheclustersinmonolingualcorporafromalllanguagesinLtoestimateanembeddingforeachcluster.Byforcingwordsfromdifferentlanguagesinaclustertosharethesameembedding,wecreateanchorpointsinthevectorspacetobridgelanguages.Fig.4.1illustratesthismethodwithaschematicdiagram.Morespecically,wedenetheclustersastheconnectedcom

105 ponentsinagraphwherenodesare(language,su
ponentsinagraphwherenodesare(language,surfaceform)pairsandedgescorrespondtotranslationentriesinDm;n.WeassignarbitraryIDstotheclustersandreplaceeachwordtokenineachmonolingualcorpuswiththecorrespondingclusterID,andconcatenateallmodiedcorpora.TheresultingcorpusconsistsofmultilingualclusterIDsequences.Wecanthenapplyanymonolingualembeddingestimatortoobtainclusterembeddings;here,weusetheskipgrammodelfromMikolovetal.(2013a).OurimplementationofthemultiClustermethodisavailableonGitHub.44.2.2MultiCCAembeddingsInthismethod,werstuseamonolingualestimatortoindependentlyembedwordsineachlanguageofinterest.Wethenpickapivotlanguageandandlinearlyprojectwordsfromeveryotherlanguagetothevectorspaceofthepivotlanguage.5Inordertondthelinearprojectionb

106 etweentwolanguages,webuildontheworkofFar
etweentwolanguages,webuildontheworkofFaruquiandDyer(2014),whoproposedabilingualembeddingestimationmethodbasedoncanonicalcorrelationanalysis(CCA)andshowedthattheresultingembeddingsforEnglishwordsout-performmonolingually-trainedEnglishembeddingsonwordsimilaritytasks.First,theyusemonolingualcorporatotrainmonolingualembeddingsforeachlanguageindependently(EmandEn),capturingsemanticsimilaritywithineachlanguageseparately.Then,usingabilingualdictionaryDm;n,theyuseCCAtoestimatelinearprojectionsfromtherangesofthemonolin-gualembeddingsEmandEn,yieldingabilingualembeddingEm;n.ThelinearprojectionsaredenedbyTm!m;nandTn!m;n2Rdd;theyareselectedtomaximizethecorrelationbetweenvectorpairsTm!m;nEm(u)andTn!m;nEn(v),where(u;v)2Dm;n.Thebilingualembedding

107 isthendenedasECCA(m;u)=Tm!m;nEm(u)(a
isthendenedasECCA(m;u)=Tm!m;nEm(u)(andlikewiseforECCA(n;v)).Westartbyestimating,foreachm2Lnfeng,thetwoprojectionmatrices:Tm!m;enandTen!m;en;theseareguaranteedtobenon-singular,ausefulpropertyofCCAweexploitbyinvert-3WedevelopedthemultiSkipmethodindependentlyofGuoetal.(2016).4https://github.com/wammar/wammar-utils/blob/master/train-multilingual-embeddings.sh5WeuseEnglishasthepivotlanguagesinceEnglishtypicallyoffersthelargestcorporaandwideavailabilityofbilingualdictionaries.35 Figure4.1:StepsforestimatingmultiClusterembeddings:1)identifythesetoflanguagesofinterest,2)obtainbilingualdictionariesbetweenpairsoflanguages,3)grouptranslationallyequivalentwordsintoclusters,4)obtainamonolingualcorpusforeachlanguageofinterest,5)replacewordsinmonoli

Related Contents


Next Show more