/
SequencetoSequenceLearningwithNeuralNetworks SequencetoSequenceLearningwithNeuralNetworks

SequencetoSequenceLearningwithNeuralNetworks - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
407 views
Uploaded On 2016-07-01

SequencetoSequenceLearningwithNeuralNetworks - PPT Presentation

IlyaSutskeverGoogleilyasugooglecomOriolVinyalsGooglevinyalsgooglecomQuocVLeGoogleqvlgooglecomAbstractDeepNeuralNetworksDNNsarepowerfulmodelsthathaveachievedexcellentperformanceondifcultlear ID: 384950

IlyaSutskeverGoogleilyasu@google.comOriolVinyalsGooglevinyals@google.comQuocV.LeGoogleqvl@google.comAbstractDeepNeuralNetworks(DNNs)arepowerfulmodelsthathaveachievedexcel-lentperformanceondifcultlear

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "SequencetoSequenceLearningwithNeuralNetw..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

SequencetoSequenceLearningwithNeuralNetworks IlyaSutskeverGoogleilyasu@google.comOriolVinyalsGooglevinyals@google.comQuocV.LeGoogleqvl@google.comAbstractDeepNeuralNetworks(DNNs)arepowerfulmodelsthathaveachievedexcel-lentperformanceondifcultlearningtasks.AlthoughDNNsworkwellwheneverlargelabeledtrainingsetsareavailable,theycannotbeusedtomapsequencestosequences.Inthispaper,wepresentageneralend-to-endapproachtosequencelearningthatmakesminimalassumptionsonthesequencestructure.OurmethodusesamultilayeredLongShort-TermMemory(LSTM)tomaptheinputsequencetoavectorofaxeddimensionality,andthenanotherdeepLSTMtodecodethetargetsequencefromthevector.OurmainresultisthatonanEnglishtoFrenchtranslationtaskfromtheWMT-14dataset,thetranslationsproducedbytheLSTMachieveaBLEUscoreof34.8ontheentiretestset,wheretheLSTM'sBLEUscorewaspenalizedonout-of-vocabularywords.Additionally,theLSTMdidnothavedifcultyonlongsentences.Forcomparison,aphrase-basedSMTsystemachievesaBLEUscoreof33.3onthesamedataset.WhenweusedtheLSTMtorerankthe1000hypothesesproducedbytheaforementionedSMTsystem,itsBLEUscoreincreasesto36.5,whichisclosetothepreviousstateoftheart.TheLSTMalsolearnedsensiblephraseandsentencerepresentationsthataresensitivetowordorderandarerelativelyinvarianttotheactiveandthepassivevoice.Fi-nally,wefoundthatreversingtheorderofthewordsinallsourcesentences(butnottargetsentences)improvedtheLSTM'sperformancemarkedly,becausedoingsointroducedmanyshorttermdependenciesbetweenthesourceandthetargetsentencewhichmadetheoptimizationproblemeasier.1IntroductionDeepNeuralNetworks(DNNs)areextremelypowerfulmachinelearningmodelsthatachieveex-cellentperformanceondifcultproblemssuchasspeechrecognition[13,7]andvisualobjectrecog-nition[19,6,21,20].DNNsarepowerfulbecausetheycanperformarbitraryparallelcomputationforamodestnumberofsteps.AsurprisingexampleofthepowerofDNNsistheirabilitytosortNN-bitnumbersusingonly2hiddenlayersofquadraticsize[27].So,whileneuralnetworksarerelatedtoconventionalstatisticalmodels,theylearnanintricatecomputation.Furthermore,largeDNNscanbetrainedwithsupervisedbackpropagationwheneverthelabeledtrainingsethasenoughinformationtospecifythenetwork'sparameters.Thus,ifthereexistsaparametersettingofalargeDNNthatachievesgoodresults(forexample,becausehumanscansolvethetaskveryrapidly),supervisedbackpropagationwillndtheseparametersandsolvetheproblem.Despitetheirexibilityandpower,DNNscanonlybeappliedtoproblemswhoseinputsandtargetscanbesensiblyencodedwithvectorsofxeddimensionality.Itisasignicantlimitation,sincemanyimportantproblemsarebestexpressedwithsequenceswhoselengthsarenotknowna-priori.Forexample,speechrecognitionandmachinetranslationaresequentialproblems.Likewise,ques-tionansweringcanalsobeseenasmappingasequenceofwordsrepresentingthequestiontoa1 sequenceofwordsrepresentingtheanswer.Itisthereforeclearthatadomain-independentmethodthatlearnstomapsequencestosequenceswouldbeuseful.SequencesposeachallengeforDNNsbecausetheyrequirethatthedimensionalityoftheinputsandoutputsisknownandxed.Inthispaper,weshowthatastraightforwardapplicationoftheLongShort-TermMemory(LSTM)architecture[16]cansolvegeneralsequencetosequenceproblems.TheideaistouseoneLSTMtoreadtheinputsequence,onetimestepatatime,toobtainlargexed-dimensionalvectorrepresentation,andthentouseanotherLSTMtoextracttheoutputsequencefromthatvector(g.1).ThesecondLSTMisessentiallyarecurrentneuralnetworklanguagemodel[28,23,30]exceptthatitisconditionedontheinputsequence.TheLSTM'sabilitytosuccessfullylearnondatawithlongrangetemporaldependenciesmakesitanaturalchoiceforthisapplicationduetotheconsiderabletimelagbetweentheinputsandtheircorrespondingoutputs(g.1).Therehavebeenanumberofrelatedattemptstoaddressthegeneralsequencetosequencelearningproblemwithneuralnetworks.OurapproachiscloselyrelatedtoKalchbrennerandBlunsom[18]whowerethersttomaptheentireinputsentencetovector,andisverysimilartoChoetal.[5].Graves[10]introducedanoveldifferentiableattentionmechanismthatallowsneuralnetworkstofocusondifferentpartsoftheirinput,andanelegantvariantofthisideawassuccessfullyappliedtomachinetranslationbyBahdanauetal.[2].TheConnectionistSequenceClassicationisanotherpopulartechniqueformappingsequencestosequenceswithneuralnetworks,althoughitassumesamonotonicalignmentbetweentheinputsandtheoutputs[11]. Figure1:Ourmodelreadsaninputsentence“ABC”andproduces“WXYZ”astheoutputsentence.Themodelstopsmakingpredictionsafteroutputtingtheend-of-sentencetoken.NotethattheLSTMreadstheinputsentenceinreverse,becausedoingsointroducesmanyshorttermdependenciesinthedatathatmaketheoptimizationproblemmucheasier.Themainresultofthisworkisthefollowing.OntheWMT'14EnglishtoFrenchtranslationtask,weobtainedaBLEUscoreof34.81bydirectlyextractingtranslationsfromanensembleof5deepLSTMs(with380Mparameterseach)usingasimpleleft-to-rightbeam-searchdecoder.Thisisbyfarthebestresultachievedbydirecttranslationwithlargeneuralnetworks.Forcomparison,theBLEUscoreofaSMTbaselineonthisdatasetis33.30[29].The34.81BLEUscorewasachievedbyanLSTMwithavocabularyof80kwords,sothescorewaspenalizedwheneverthereferencetranslationcontainedawordnotcoveredbythese80k.Thisresultshowsthatarelativelyunoptimizedneuralnetworkarchitecturewhichhasmuchroomforimprovementoutperformsamaturephrase-basedSMTsystem.Finally,weusedtheLSTMtorescorethepubliclyavailable1000-bestlistsoftheSMTbaselineonthesametask[29].Bydoingso,weobtainedaBLEUscoreof36.5,whichimprovesthebaselineby3.2BLEUpointsandisclosetothepreviousstate-of-the-art(whichis37.0[9]).Surprisingly,theLSTMdidnotsufferonverylongsentences,despitetherecentexperienceofotherresearcherswithrelatedarchitectures[26].Wewereabletodowellonlongsentencesbecausewereversedtheorderofwordsinthesourcesentencebutnotthetargetsentencesinthetrainingandtestset.Bydoingso,weintroducedmanyshorttermdependenciesthatmadetheoptimizationproblemmuchsimpler(seesec.2and3.3).Asaresult,SGDcouldlearnLSTMsthathadnotroublewithlongsentences.Thesimpletrickofreversingthewordsinthesourcesentenceisoneofthekeytechnicalcontributionsofthiswork.AusefulpropertyoftheLSTMisthatitlearnstomapaninputsentenceofvariablelengthintoaxed-dimensionalvectorrepresentation.Giventhattranslationstendtobeparaphrasesofthesourcesentences,thetranslationobjectiveencouragestheLSTMtondsentencerepresentationsthatcapturetheirmeaning,assentenceswithsimilarmeaningsareclosetoeachotherwhiledifferent2 sentencesmeaningswillbefar.Aqualitativeevaluationsupportsthisclaim,showingthatourmodelisawareofwordorderandisfairlyinvarianttotheactiveandpassivevoice.2ThemodelTheRecurrentNeuralNetwork(RNN)[31,28]isanaturalgeneralizationoffeedforwardneuralnetworkstosequences.Givenasequenceofinputs(x1;:::;xT),astandardRNNcomputesasequenceofoutputs(y1;:::;yT)byiteratingthefollowingequation:h=sigmhxx+hhh1y=yhhTheRNNcaneasilymapsequencestosequenceswheneverthealignmentbetweentheinputstheoutputsisknownaheadoftime.However,itisnotclearhowtoapplyanRNNtoproblemswhoseinputandtheoutputsequenceshavedifferentlengthswithcomplicatedandnon-monotonicrelation-ships.Asimplestrategyforgeneralsequencelearningistomaptheinputsequencetoaxed-sizedvectorusingoneRNN,andthentomapthevectortothetargetsequencewithanotherRNN(thisapproachhasalsobeentakenbyChoetal.[5]).WhileitcouldworkinprinciplesincetheRNNisprovidedwithalltherelevantinformation,itwouldbedifculttotraintheRNNsduetotheresultinglongtermdependencies[14,4](gure1)[16,15].However,theLongShort-TermMemory(LSTM)[16]isknowntolearnproblemswithlongrangetemporaldependencies,soanLSTMmaysucceedinthissetting.ThegoaloftheLSTMistoestimatetheconditionalprobabilityp(y1;:::;yT0jx1;:::;xT)where(x1;:::;xT)isaninputsequenceandy1;:::;yT0isitscorrespondingoutputsequencewhoselengthT0maydifferfromT.TheLSTMcomputesthisconditionalprobabilitybyrstobtainingthexed-dimensionalrepresentationvoftheinputsequence(x1;:::;xT)givenbythelasthiddenstateoftheLSTM,andthencomputingtheprobabilityofy1;:::;yT0withastandardLSTM-LMformulationwhoseinitialhiddenstateissettotherepresentationvofx1;:::;xT:p(y1;:::;yT0jx1;:::;xT)=T0Y=1p(yjv;y1;:::;y1)(1)Inthisequation,eachp(yjv;y1;:::;y1)distributionisrepresentedwithasoftmaxoverallthewordsinthevocabulary.WeusetheLSTMformulationfromGraves[10].Notethatwerequirethateachsentenceendswithaspecialend-of-sentencesymbol“EOS&#x-0.8;أ倀”,whichenablesthemodeltodeneadistributionoversequencesofallpossiblelengths.Theoverallschemeisoutlinedingure1,wheretheshownLSTMcomputestherepresentationof“A”,“B”,“C”,“EOS&#x-0.8;͸䀀”andthenusesthisrepresentationtocomputetheprobabilityof“W”,“X”,“Y”,“Z”,“EOS&#x-0.8;أ倀”.Ouractualmodelsdifferfromtheabovedescriptioninthreeimportantways.First,weusedtwodifferentLSTMs:onefortheinputsequenceandanotherfortheoutputsequence,becausedoingsoincreasesthenumbermodelparametersatnegligiblecomputationalcostandmakesitnaturaltotraintheLSTMonmultiplelanguagepairssimultaneously[18].Second,wefoundthatdeepLSTMssignicantlyoutperformedshallowLSTMs,sowechoseanLSTMwithfourlayers.Third,wefounditextremelyvaluabletoreversetheorderofthewordsoftheinputsentence.Soforexample,insteadofmappingthesentencea;b;ctothesentence ; ;\r,theLSTMisaskedtomapc;b;ato ; ;\r,where ; ;\risthetranslationofa;b;c.Thisway,aisincloseproximityto ,bisfairlycloseto ,andsoon,afactthatmakesiteasyforSGDto“establishcommunication”betweentheinputandtheoutput.WefoundthissimpledatatransformationtogreatlyboosttheperformanceoftheLSTM.3ExperimentsWeappliedourmethodtotheWMT'14EnglishtoFrenchMTtaskintwoways.WeusedittodirectlytranslatetheinputsentencewithoutusingareferenceSMTsystemandweittorescorethen-bestlistsofanSMTbaseline.Wereporttheaccuracyofthesetranslationmethods,presentsampletranslations,andvisualizetheresultingsentencerepresentation.3 3.1DatasetdetailsWeusedtheWMT'14EnglishtoFrenchdataset.Wetrainedourmodelsonasubsetof12Msen-tencesconsistingof348MFrenchwordsand304MEnglishwords,whichisaclean“selected”subsetfrom[29].Wechosethistranslationtaskandthisspecictrainingsetsubsetbecauseofthepublicavailabilityofatokenizedtrainingandtestsettogetherwith1000-bestlistsfromthebaselineSMT[29].Astypicalneurallanguagemodelsrelyonavectorrepresentationforeachword,weusedaxedvocabularyforbothlanguages.Weused160,000ofthemostfrequentwordsforthesourcelanguageand80,000ofthemostfrequentwordsforthetargetlanguage.Everyout-of-vocabularywordwasreplacedwithaspecial“UNK”token.3.2DecodingandRescoringThecoreofourexperimentsinvolvedtrainingalargedeepLSTMonmanysentencepairs.WetraineditbymaximizingthelogprobabilityofacorrecttranslationTgiventhesourcesentenceS,sothetrainingobjectiveis1=jSjX(T;S)2Slogp(TjS)whereSisthetrainingset.Oncetrainingiscomplete,weproducetranslationsbyndingthemostlikelytranslationaccordingtotheLSTM:^T=argmaxTp(TjS)(2)Wesearchforthemostlikelytranslationusingasimpleleft-to-rightbeamsearchdecoderwhichmaintainsasmallnumberBofpartialhypotheses,whereapartialhypothesisisaprexofsometranslation.Ateachtimestepweextendeachpartialhypothesisinthebeamwitheverypossiblewordinthevocabulary.ThisgreatlyincreasesthenumberofthehypothesessowediscardallbuttheBmostlikelyhypothesesaccordingtothemodel'slogprobability.Assoonasthe“EOS&#x-0.8;͸䀀”symbolisappendedtoahypothesis,itisremovedfromthebeamandisaddedtothesetofcompletehypotheses.Whilethisdecoderisapproximate,itissimpletoimplement.Interestingly,oursystemperformswellevenwithabeamsizeof1,andabeamofsize2providesmostofthebenetsofbeamsearch(Table1).WealsousedtheLSTMtorescorethe1000-bestlistsproducedbythebaselinesystem[29].Torescoreann-bestlist,wecomputedthelogprobabilityofeveryhypothesiswithourLSTMandtookanevenaveragewiththeirscoreandtheLSTM'sscore.3.3ReversingtheSourceSentencesWhiletheLSTMiscapableofsolvingproblemswithlongtermdependencies,wediscoveredthattheLSTMlearnsmuchbetterwhenthesourcesentencesarereversed(thetargetsentencesarenotreversed).Bydoingso,theLSTM'stestperplexitydroppedfrom5.8to4.7,andthetestBLEUscoresofitsdecodedtranslationsincreasedfrom25.9to30.6.Whilewedonothaveacompleteexplanationtothisphenomenon,webelievethatitiscausedbytheintroductionofmanyshorttermdependenciestothedataset.Normally,whenweconcatenateasourcesentencewithatargetsentence,eachwordinthesourcesentenceisfarfromitscorrespondingwordinthetargetsentence.Asaresult,theproblemhasalarge“minimaltimelag”[17].Byreversingthewordsinthesourcesentence,theaveragedistancebetweencorrespondingwordsinthesourceandtargetlanguageisunchanged.However,therstfewwordsinthesourcelanguagearenowveryclosetotherstfewwordsinthetargetlanguage,sotheproblem'sminimaltimelagisgreatlyreduced.Thus,backpropagationhasaneasiertime“establishingcommunication”betweenthesourcesentenceandthetargetsentence,whichinturnresultsinsubstantiallyimprovedoverallperformance.Initially,webelievedthatreversingtheinputsentenceswouldonlyleadtomorecondentpredic-tionsintheearlypartsofthetargetsentenceandtolesscondentpredictionsinthelaterparts.How-ever,LSTMstrainedonreversedsourcesentencesdidmuchbetteronlongsentencesthanLSTMs4 trainedontherawsourcesentences(seesec.3.7),whichsuggeststhatreversingtheinputsentencesresultsinLSTMswithbettermemoryutilization.3.4TrainingdetailsWefoundthattheLSTMmodelsarefairlyeasytotrain.WeuseddeepLSTMswith4layers,with1000cellsateachlayerand1000dimensionalwordembeddings,withaninputvocabularyof160,000andanoutputvocabularyof80,000.WefounddeepLSTMstosignicantlyoutperformshallowLSTMs,whereeachadditionallayerreducedperplexitybynearly10%,possiblyduetotheirmuchlargerhiddenstate.Weusedanaivesoftmaxover80,000wordsateachoutput.TheresultingLSTMhas380Mparametersofwhich64Marepurerecurrentconnections(32Mforthe“encoder”LSTMand32Mforthe“decoder”LSTM).Thecompletetrainingdetailsaregivenbelow:WeinitializedalloftheLSTM'sparameterswiththeuniformdistributionbetween-0.08and0.08Weusedstochasticgradientdescentwithoutmomentum,withaxedlearningrateof0.7.After5epochs,webegunhalvingthelearningrateeveryhalfepoch.Wetrainedourmodelsforatotalof7.5epochs.Weusedbatchesof128sequencesforthegradientanddivideditthesizeofthebatch(namely,128).AlthoughLSTMstendtonotsufferfromthevanishinggradientproblem,theycanhaveexplodinggradients.Thusweenforcedahardconstraintonthenormofthegradient[10,25]byscalingitwhenitsnormexceededathreshold.Foreachtrainingbatch,wecomputes=kgk2,wheregisthegradientdividedby128.Ifs�5,wesetg=5g s.Differentsentenceshavedifferentlengths.Mostsentencesareshort(e.g.,length20-30)butsomesentencesarelong(e.g.,length�100),soaminibatchof128randomlychosentrainingsentenceswillhavemanyshortsentencesandfewlongsentences,andasaresult,muchofthecomputationintheminibatchiswasted.Toaddressthisproblem,wemadesurethatallsentenceswithinaminibatchwereroughlyofthesamelength,whicha2xspeedup.3.5ParallelizationAC++implementationofdeepLSTMwiththecongurationfromtheprevioussectiononasin-gleGPUprocessesaspeedofapproximately1,700wordspersecond.Thiswastooslowforourpurposes,soweparallelizedourmodelusingan8-GPUmachine.EachlayeroftheLSTMwasexecutedonadifferentGPUandcommunicateditsactivationstothenextGPU(orlayer)assoonastheywerecomputed.Ourmodelshave4layersofLSTMs,eachofwhichresidesonaseparateGPU.Theremaining4GPUswereusedtoparallelizethesoftmax,soeachGPUwasresponsibleformultiplyingbya100020000matrix.Theresultingimplementationachievedaspeedof6,300(bothEnglishandFrench)wordspersecondwithaminibatchsizeof128.Trainingtookaboutatendayswiththisimplementation.3.6ExperimentalResultsWeusedthecasedBLEUscore[24]toevaluatethequalityofourtranslations.WecomputedourBLEUscoresusingmulti-bleu.pl1onthetokenizedpredictionsandgroundtruth.ThiswayofevaluatingtheBELUscoreisconsistentwith[5]and[2],andreproducesthe33.3scoreof[29].However,ifweevaluatethestateoftheartsystemof[9](whosepredictionscanbedownloadedfromstatmt.org\matrix)inthismanner,weget37.0,whichisgreaterthanthe35.8reportedbystatmt.org\matrix.Theresultsarepresentedintables1and2.OurbestresultsareobtainedwithanensembleofLSTMsthatdifferintheirrandominitializationsandintherandomorderofminibatches.WhilethedecodedtranslationsoftheLSTMensembledonotbeatthestateoftheart,itisthersttimethatapureneuraltranslationsystemoutperformsaphrase-basedSMTbaselineonalargeMTtaskby 1ThereseveralvariantsoftheBLEUscore,andeachvariantisdenedwithaperlscript.5 Method testBLEUscore(ntst14) Bahdanauetal.[2] 28.45 BaselineSystem[29] 33.30 SingleforwardLSTM,beamsize12 26.17 SinglereversedLSTM,beamsize12 30.59 Ensembleof5reversedLSTMs,beamsize1 33.00 Ensembleof2reversedLSTMs,beamsize12 33.27 Ensembleof5reversedLSTMs,beamsize2 34.50 Ensembleof5reversedLSTMs,beamsize12 34.81 Table1:TheperformanceoftheLSTMonWMT'14EnglishtoFrenchtestset(ntst14).Notethatanensembleof5LSTMswithabeamofsize2ischeaperthanofasingleLSTMwithabeamofsize12. Method testBLEUscore(ntst14) BaselineSystem[29] 33.30 Choetal.[5] 34.54 Stateoftheart[9] 37.0 Rescoringthebaseline1000-bestwithasingleforwardLSTM 35.61 Rescoringthebaseline1000-bestwithasinglereversedLSTM 35.85 Rescoringthebaseline1000-bestwithanensembleof5reversedLSTMs 36.5 OracleRescoringoftheBaseline1000-bestlists 45 Table2:MethodsthatuseneuralnetworkstogetherwithanSMTsystemontheWMT'14EnglishtoFrenchtestset(ntst14).asizeablemargin,despiteitsinabilitytohandleout-of-vocabularywords.TheLSTMiswithin0.5BLEUpointsofthepreviousstateoftheartbyrescoringthe1000-bestlistofthebaselinesystem.3.7PerformanceonlongsentencesWeweresurprisedtodiscoverthattheLSTMdidwellonlongsentences,whichisshownquantita-tivelyingure3.Table3presentsseveralexamplesoflongsentencesandtheirtranslations.3.8ModelAnalysis -8 -6 -4 -2 0 2 4 6 8 10 -6 -5 -4 -3 -2 -1 0 1 2 3 4 John respects Mary Mary respects John John admires Mary Mary admires John Mary is in love with John John is in love with Mary -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 I gave her a card in the garden In the garden , I gave her a card She was given a card by me in the garden She gave me a card in the garden In the garden , she gave me a card I was given a card by her in the garden Figure2:Thegureshowsa2-dimensionalPCAprojectionoftheLSTMhiddenstatesthatareobtainedafterprocessingthephrasesinthegures.Thephrasesareclusteredbymeaning,whichintheseexamplesisprimarilyafunctionofwordorder,whichwouldbedifculttocapturewithabag-of-wordsmodel.Noticethatbothclustershavesimilarinternalstructure.Oneoftheattractivefeaturesofourmodelisitsabilitytoturnasequenceofwordsintoavectorofxeddimensionality.Figure2visualizessomeofthelearnedrepresentations.Thegureclearlyshowsthattherepresentationsaresensitivetotheorderofwords,whilebeingfairlyinsensitivetothe6 Type Sentence Ourmodel UlrichUNK,membreduconseild'administrationduconstructeurautomobileAudi, afrmequ'ils'agitd'unepratiquecourantedepuisdesann´eespourquelest´el´ephones portablespuissentˆetrecollect´esavantlesr´eunionsduconseild'administrationanqu'ils nesoientpasutilis´escommeappareilsd'´ecouteadistance. Truth UlrichHackenberg,membreduconseild'administrationduconstructeurautomobileAudi, d´eclarequelacollectedest´el´ephonesportablesavantlesr´eunionsduconseil,anqu'ils nepuissentpasˆetreutilis´escommeappareilsd'´ecouteadistance,estunepratiquecourante depuisdesann´ees. Ourmodel “Lest´el´ephonescellulaires,quisontvraimentunequestion,nonseulementparcequ'ils pourraientpotentiellementcauserdesinterf´erencesaveclesappareilsdenavigation,mais noussavons,selonlaFCC,qu'ilspourraientinterf´ereraveclestoursdet´el´ephonecellulaire lorsqu'ilssontdansl'air”,ditUNK. Truth “Lest´el´ephonesportablessontv´eritablementunprobleme,nonseulementparcequ'ils pourraient´eventuellementcr´eerdesinterf´erencesaveclesinstrumentsdenavigation,mais parcequenoussavons,d'apreslaFCC,qu'ilspourraientperturberlesantennes-relaisde ´el´ephoniemobiles'ilssontutilis´esabord”,ad´eclar´eRosenker. Ourmodel Aveclacr´emation,ilyaun“sentimentdeviolencecontrelecorpsd'unˆetrecher”, quisera“r´eduitaunepiledecendres”entrespeudetempsaulieud'unprocessusde d´ecomposition“quiaccompagnerales´etapesdudeuil”. Truth Ilya,aveclacr´emation,“uneviolencefaiteaucorpsaim´e”, quivaˆetre“r´eduitauntasdecendres”entrespeudetemps,etnonapresunprocessusde d´ecomposition,qui“accompagneraitlesphasesdudeuil”. Table3:AfewexamplesoflongtranslationsproducedbytheLSTMalongsidethegroundtruthtranslations.ThereadercanverifythatthetranslationsaresensibleusingGoogletranslate. 4 7 8 12 17 22 28 35 79test sentences sorted by their length 20 25 30 35 BLEU score LSTM (34.8) baseline (33.3) 0 500 1000 1500 2000 2500 3000 test sentences sorted by average word frequency rank 20 25 30 35 BLEU score LSTM (34.8) baseline (33.3) Figure3:Theleftplotshowstheperformanceofoursystemasafunctionofsentencelength,wherethex-axiscorrespondstothetestsentencessortedbytheirlengthandismarkedbytheactualsequencelengths.Thereisnodegradationonsentenceswithlessthan35words,thereisonlyaminordegradationonthelongestsentences.TherightplotshowstheLSTM'sperformanceonsentenceswithprogressivelymorerarewords,wherethex-axiscorrespondstothetestsentencessortedbytheir“averagewordfrequencyrank”.replacementofanactivevoicewithapassivevoice.Thetwo-dimensionalprojectionsareobtainedusingPCA.4RelatedworkThereisalargebodyofworkonapplicationsofneuralnetworkstomachinetranslation.Sofar,thesimplestandmosteffectivewayofapplyinganRNN-LanguageModel(RNNLM)[23]ora7 FeedforwardNeuralNetworkLanguageModel(NNLM)[3]toanMTtaskisbyrescoringthen-bestlistsofastrongMTbaseline[22],whichreliablyimprovestranslationquality.Morerecently,researchershavebeguntolookintowaysofincludinginformationaboutthesourcelanguageintotheNNLM.ExamplesofthisworkincludeAulietal.[1],whocombineanNNLMwithatopicmodeloftheinputsentence,whichimprovesrescoringperformance.Devlinetal.[8]followedasimilarapproach,buttheyincorporatedtheirNNLMintothedecoderofanMTsystemandusedthedecoder'salignmentinformationtoprovidetheNNLMwiththemostusefulwordsintheinputsentence.Theirapproachwashighlysuccessfulanditachievedlargeimprovementsovertheirbaseline.OurworkiscloselyrelatedtoKalchbrennerandBlunsom[18],whowerethersttomaptheinputsentenceintoavectorandthenbacktoasentence,althoughtheymapsentencestovectorsusingconvolutionalneuralnetworks,whichlosetheorderingofthewords.Similarlytothiswork,Choetal.[5]usedanLSTM-likeRNNarchitecturetomapsentencesintovectorsandback,althoughtheirprimaryfocuswasonintegratingtheirneuralnetworkintoanSMTsystem.Bahdanauetal.[2]alsoattempteddirecttranslationswithaneuralnetworkthatusedanattentionmechanismtoovercomethepoorperformanceonlongsentencesexperiencedbyChoetal.[5]andachievedencouragingresults.Likewise,Pouget-Abadieetal.[26]attemptedtoaddressthememoryproblemofChoetal.[5]bytranslatingpiecesofthesourcesentenceinwaythatproducessmoothtranslations,whichissimilartoaphrase-basedapproach.Wesuspectthattheycouldachievesimilarimprovementsbysimplytrainingtheirnetworksonreversedsourcesentences.End-to-endtrainingisalsothefocusofHermannetal.[12],whosemodelrepresentstheinputsandoutputsbyfeedforwardnetworks,andmapthemtosimilarpointsinspace.However,theirapproachcannotgeneratetranslationsdirectly:togetatranslation,theyneedtodoalookupforclosestvectorinthepre-computeddatabaseofsentences,ortorescoreasentence.5ConclusionInthiswork,weshowedthatalargedeepLSTMwithalimitedvocabularycanoutperformastan-dardSMT-basedsystemwhosevocabularyisunlimitedonalarge-scaleMTtask.ThesuccessofoursimpleLSTM-basedapproachonMTsuggeststhatitshoulddowellonmanyothersequencelearningproblems,providedtheyhaveenoughtrainingdata.Weweresurprisedbytheextentoftheimprovementobtainedbyreversingthewordsinthesourcesentences.Weconcludethatitisimportanttondaproblemencodingthathasthegreatestnumberofshorttermdependencies,astheymakethelearningproblemmuchsimpler.Inparticular,whilewewereunabletotrainastandardRNNonthenon-reversedtranslationproblem(showning.1),webelievethatastandardRNNshouldbeeasilytrainablewhenthesourcesentencesarereversed(althoughwedidnotverifyitexperimentally).WewerealsosurprisedbytheabilityoftheLSTMtocorrectlytranslateverylongsentences.WewereinitiallyconvincedthattheLSTMwouldfailonlongsentencesduetoitslimitedmemory,andotherresearchersreportedpoorperformanceonlongsentenceswithamodelsimilartoours[5,2,26].Andyet,LSTMstrainedonthereverseddatasethadlittledifcultytranslatinglongsentences.Mostimportantly,wedemonstratedthatasimple,straightforwardandarelativelyunoptimizedap-proachcanoutperformamatureSMTsystem,sofurtherworkwilllikelyleadtoevengreatertrans-lationaccuracies.Theseresultssuggestthatourapproachwilllikelydowellonotherchallengingsequencetosequenceproblems.6AcknowledgmentsWethankSamyBengio,JeffDean,MatthieuDevin,GeoffreyHinton,NalKalchbrenner,ThangLuong,Wolf-gangMacherey,RajatMonga,VincentVanhoucke,PengXu,WojciechZaremba,andtheGoogleBrainteamforusefulcommentsanddiscussions.8 References[1]M.Auli,M.Galley,C.Quirk,andG.Zweig.Jointlanguageandtranslationmodelingwithrecurrentneuralnetworks.InEMNLP,2013.[2]D.Bahdanau,K.Cho,andY.Bengio.Neuralmachinetranslationbyjointlylearningtoalignandtranslate.arXivpreprintarXiv:1409.0473,2014.[3]Y.Bengio,R.Ducharme,P.Vincent,andC.Jauvin.Aneuralprobabilisticlanguagemodel.InJournalofMachineLearningResearch,pages1137–1155,2003.[4]Y.Bengio,P.Simard,andP.Frasconi.Learninglong-termdependencieswithgradientdescentisdifcult.IEEETransactionsonNeuralNetworks,5(2):157–166,1994.[5]K.Cho,B.Merrienboer,C.Gulcehre,F.Bougares,H.Schwenk,andY.Bengio.Learningphraserepresen-tationsusingRNNencoder-decoderforstatisticalmachinetranslation.InArxivpreprintarXiv:1406.1078,2014.[6]D.Ciresan,U.Meier,andJ.Schmidhuber.Multi-columndeepneuralnetworksforimageclassication.InCVPR,2012.[7]G.E.Dahl,D.Yu,L.Deng,andA.Acero.Context-dependentpre-traineddeepneuralnetworksforlargevocabularyspeechrecognition.IEEETransactionsonAudio,Speech,andLanguageProcessing-SpecialIssueonDeepLearningforSpeechandLanguageProcessing,2012.[8]J.Devlin,R.Zbib,Z.Huang,T.Lamar,R.Schwartz,andJ.Makhoul.Fastandrobustneuralnetworkjointmodelsforstatisticalmachinetranslation.InACL,2014.[9]NadirDurrani,BarryHaddow,PhilippKoehn,andKennethHeaeld.Edinburgh'sphrase-basedmachinetranslationsystemsforwmt-14.InWMT,2014.[10]A.Graves.Generatingsequenceswithrecurrentneuralnetworks.InArxivpreprintarXiv:1308.0850,2013.[11]A.Graves,S.Fern´andez,F.Gomez,andJ.Schmidhuber.Connectionisttemporalclassication:labellingunsegmentedsequencedatawithrecurrentneuralnetworks.InICML,2006.[12]K.M.HermannandP.Blunsom.Multilingualdistributedrepresentationswithoutwordalignment.InICLR,2014.[13]G.Hinton,L.Deng,D.Yu,G.Dahl,A.Mohamed,N.Jaitly,A.Senior,V.Vanhoucke,P.Nguyen,T.Sainath,andB.Kingsbury.Deepneuralnetworksforacousticmodelinginspeechrecognition.IEEESignalProcessingMagazine,2012.[14]S.Hochreiter.Untersuchungenzudynamischenneuronalennetzen.Master'sthesis,InstitutfurInfor-matik,TechnischeUniversitat,Munchen,1991.[15]S.Hochreiter,Y.Bengio,P.Frasconi,andJ.Schmidhuber.Gradientowinrecurrentnets:thedifcultyoflearninglong-termdependencies,2001.[16]S.HochreiterandJ.Schmidhuber.Longshort-termmemory.NeuralComputation,1997.[17]S.HochreiterandJ.Schmidhuber.LSTMcansolvehardlongtimelagproblems.1997.[18]N.KalchbrennerandP.Blunsom.Recurrentcontinuoustranslationmodels.InEMNLP,2013.[19]A.Krizhevsky,I.Sutskever,andG.E.Hinton.ImageNetclassicationwithdeepconvolutionalneuralnetworks.InNIPS,2012.[20]Q.V.Le,M.A.Ranzato,R.Monga,M.Devin,K.Chen,G.S.Corrado,J.Dean,andA.Y.Ng.Buildinghigh-levelfeaturesusinglargescaleunsupervisedlearning.InICML,2012.[21]Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner.Gradient-basedlearningappliedtodocumentrecognition.ProceedingsoftheIEEE,1998.[22]T.Mikolov.StatisticalLanguageModelsbasedonNeuralNetworks.PhDthesis,BrnoUniversityofTechnology,2012.[23]T.Mikolov,M.Kara´at,L.Burget,J.Cernocky,andS.Khudanpur.Recurrentneuralnetworkbasedlanguagemodel.InINTERSPEECH,pages1045–1048,2010.[24]K.Papineni,S.Roukos,T.Ward,andW.J.Zhu.BLEU:amethodforautomaticevaluationofmachinetranslation.InACL,2002.[25]R.Pascanu,T.Mikolov,andY.Bengio.Onthedifcultyoftrainingrecurrentneuralnetworks.arXivpreprintarXiv:1211.5063,2012.[26]J.Pouget-Abadie,D.Bahdanau,B.vanMerrienboer,K.Cho,andY.Bengio.Overcomingthecurseofsentencelengthforneuralmachinetranslationusingautomaticsegmentation.arXivpreprintarXiv:1409.1257,2014.[27]A.Razborov.Onsmalldepththresholdcircuits.InProc.3rdScandinavianWorkshoponAlgorithmTheory,1992.[28]D.Rumelhart,G.E.Hinton,andR.J.Williams.Learningrepresentationsbyback-propagatingerrors.Nature,323(6088):533–536,1986.[29]H.Schwenk.Universitylemans.http://www-lium.univ-lemans.fr/˜schwenk/cslm_joint_paper/,2014.[Online;accessed03-September-2014].[30]M.Sundermeyer,R.Schluter,andH.Ney.LSTMneuralnetworksforlanguagemodeling.InINTER-SPEECH,2010.[31]P.Werbos.Backpropagationthroughtime:whatitdoesandhowtodoit.ProceedingsofIEEE,1990.9

Related Contents


Next Show more