264K - views

Generating Text with Recurrent Neural Networks Ilya Sutskever ILYA CS UTORONTO CA James Martens JMARTENS CS TORONTO EDU Geoffrey Hinton HINTON CS TORONTO EDU University of Toronto Kings College Rd

Toronto ON M5S 3G4 CANADA Abstract Recurrent Neural Networks RNNs are very powerful sequence models that do not enjoy widespread use because it is extremely dif64257 cult to train them properly Fortunately re cent advances in Hessianfree optimizatio

Tags : Toronto M5S 3G4
Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "Generating Text with Recurrent Neural Ne..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Generating Text with Recurrent Neural Networks Ilya Sutskever ILYA CS UTORONTO CA James Martens JMARTENS CS TORONTO EDU Geoffrey Hinton HINTON CS TORONTO EDU University of Toronto Kings College Rd






Presentation on theme: "Generating Text with Recurrent Neural Networks Ilya Sutskever ILYA CS UTORONTO CA James Martens JMARTENS CS TORONTO EDU Geoffrey Hinton HINTON CS TORONTO EDU University of Toronto Kings College Rd"— Presentation transcript:

GeneratingTextwithRecurrentNeuralNetworks ferentnodestoshareknowledge.Forexample,thecharac-terstring“ing”isquiteprobableafter“x”andalsoquiteprobableafter“break”.Ifthehiddenstatevectorsthatrep-resentthetwohistories“x”and“break”shareacommonrepresentationofthefactthatthiscouldbethestemofaverb,thenthiscommonrepresentationcanbeacteduponbythecharacter“i”toproduceahiddenstatethatpredictsan“n”.Forthistobeagoodpredictionwerequiretheconjunctionoftheverb-stemrepresentationintheprevioushiddenstateandthecharacter“i”.Oneorotherofthesealonedoesnotprovidehalfasmuchevidenceforpredict-ingan“n”:Itistheirconjunctionthatisimportant.Thisstronglysuggeststhatweneedamultiplicativeinteraction.ToachievethisgoalwemodifytheRNNsothatitshidden-to-hiddenweightmatrixisa(learned)functionofthecur-rentinputxt:ht=tanhWhxxt+W(xt)hhht�1+bh(3)ot=Wohht+bo(4)Theseareidenticaltoeqs.1and2,exceptthatWhhisre-placedwithW(xt)hh,allowingeachcharactertospecifyadifferenthidden-to-hiddenweightmatrix.ItisnaturaltodeneW(xt)hhusingatensor.IfwestoreMmatrices,W(1)hh;:::;W(M)hh,whereMisthenumberofdimensionsofxt,wecoulddeneW(xt)hhbytheequationW(xt)hh=MXm=1x(m)tW(m)hh(5)wherex(m)tisthem-thcoordinateofxt.Whentheinputxtisa1-of-Mencodingofacharacter,itiseasilyseenthateverycharacterhasanassociatedweightmatrixandW(xt)hhisthematrixassignedtothecharacterrepresentedbyxt.23.2.TheMultiplicativeRNNTheabovescheme,whileappealing,hasamajordrawback:Fullygeneral3-waytensorsarenotpracticalbecauseoftheirsize.Inparticular,ifwewanttouseRNNswithalargenumberofhiddenunits(say,1000)andifthedimen-sionalityofxtisevenmoderatelylarge,thenthestoragerequiredforthetensorW(xt)hhbecomesprohibitive.ItturnsoutwecanremedytheaboveproblembyfactoringthetensorW(x)hh(e.g.,Taylor&Hinton,2009).ThisisdonebyintroducingthethreematricesWfx,Whf,andWfh,andreparameterizingthematrixW(xt)hhbytheequationW(xt)hh=Whfdiag(Wfxxt)Wfh(6) 2Theabovemodel,appliedtodiscreteinputsrepresentedwiththeir1-of-Mencodings,isthenonlinearversionoftheObserv-ableOperatorModel(OOM;Jaeger,2000)whoselinearnaturemakesitcloselyrelatedtoanHMMintermsofexpressivepower. Figure3.TheMultiplicativeRecurrentNeuralNetwork“gates”therecurrentweightmatrixwiththeinputsymbol.Eachtrianglesymbolrepresentsafactorthatappliesalearnedlinearlterateachofitstwoinputvertices.Theproductoftheoutputsofthesetwolinearltersisthensent,viaweightedconnections,toalltheunitsconnectedtothethirdvertexofthetriangle.Consequentlyeveryinputcansynthesizeitsownhidden-to-hiddenweightma-trixbydeterminingthegainsonallofthefactors,eachofwhichrepresentsarankonehidden-to-hiddenweightmatrixdenedbytheouter-productofitsincomingandoutgoingweight-vectorstothehiddenunits.Thesynthesizedweightmatricesshare“struc-ture”becausetheyareallformedbyblendingthesamesetofrankonematrices.Incontrast,anunconstrainedtensormodelensuresthateachinputhasacompletelyseparateweightmatrix.IfthedimensionalityofthevectorWfxxt,denotedbyF,issufcientlylarge,thenthefactorizationisasexpressiveastheoriginaltensor.SmallervaluesofFrequirefewerparameterswhilehopefullyretainingasignicantfractionofthetensor'sexpressivepower.TheMultiplicativeRNN(MRNN)istheresultoffactoriz-ingtheTensorRNNbyexpandingeq.6withineq.3.TheMRNNcomputesthehiddenstatesequence(h1;:::;hT),anadditional“factorstatesequence”(f1;:::;fT),andtheoutputsequence(o1;:::;oT)byiteratingtheequationsft=diag(Wfxxt)Wfhht�1(7)ht=tanh(Whfft+Whxxt)(8)ot=Wohht+bo(9)whichimplementtheneuralnetworking.3.Thetensorfactorizationofeq.6hastheinterpretationofanadditionallayerofmultiplicativeunitsbetweeneachpairofconsec-utivelayers(i.e.,thetrianglesing.3),sotheMRNNac-tuallyhastwostepsofnonlinearprocessinginitshiddenstatesforeveryinputtimestep.Eachofthemultiplicativeunitsoutputsthevalueftofeq.7whichistheproductoftheoutputsofthetwolinearltersconnectingthemulti-plicativeunittotheprevioushiddenstatesandtotheinputs.WeexperimentallyveriedtheadvantageoftheMRNNovertheRNNwhenthetwohavethesamenumberofpa-rameters.WetrainedanRNNwith500hiddenunitsandanMRNNwith350hiddenunitsand350factors(sotheRNNhasslightlymoreparameters)onthe“machinelearn-ing”dataset(dataset3intheexperimentalsection).Afterextensivetraining,theMRNNachieved1.56bitsperchar-acterandtheRNNachieved1.65bitspercharacteronthe GeneratingTextwithRecurrentNeuralNetworks 6.QualitativeexperimentsInthissectionwequalitativelyinvestigatethenatureofthemodelslearnedbytheMRNN.6.1.SamplesfromthemodelsThesimplestqualitativeexperimentistoinspectthesam-plesgeneratedbythethreeMRNNs.Themostsalientchar-acteristicofthesamplesistherichnessoftheirvocabular-ies.Furtherinspectionrevealsthatthetextismostlygram-matical,andthatparenthesesareusuallybalancedovermanycharacters.Theartifactsofthegeneratedtext,suchasconsecutivecommasorquotes,aretheresultofthedatapreprocessingandarefrequentlyfoundinthetrainingset.6.1.1.SAMPLESFROMTHEWIKIPEDIAMODELWenowpresentasamplefromtheWikipediamodel.Weuse?toindicatethe“unknown”character.Thesamplebe-lowwasobtainedbyrunningtheMRNNlessthan10timesandselectingthemostintriguingsample.Thebeginningoftheparagraphandtheparenthesesneartheendarepar-ticularlyinteresting.TheMRNNwasinitializedwiththephrase“Themeaningoflifeis”:Themeaningoflifeisthetraditionoftheancienthumanrepro-duction:itislessfavorabletothegoodboyforwhentoremoveherbigger.Intheshow'sagreementunanimouslyresurfaced.Thewildpasteuredwithconsistentstreetforestswereincorporatedbythe15thcenturyBE.In1996theprimaryrapfordundergoesaneffortthatthereserveconditioning,writtenintoJewishcities,sleeperstoincorporatethe.StEurasiathatactivatesthepopula-tion.Mar??aNationale,Kelli,Zedlat-Dukastoe,Florendon,Ptu'sthoughtis.ToadaptinmostpartsofNorthAmerica,thedynamicfairyDanpleasebelieves,thefreespeecharemuchrelatedtothe6.1.2.SAMPLESFROMTHENYTMODELBelowisasamplefromthemodeltrainedonthefullNYTdataset,wheretheMRNNwasinitializedwithasinglespace.Thespacessurroundingthepunctuationareanar-tifactofthepreprocessing.whilehewasgivingattentiontothesecondadvantageofschoolbuildinga2-for-2stoolkilledbytheCulturessaddledwithahalf-suitdefendingtheBharatiyaFernall'sofce.Ms.ClaireParterswillalsohaveahistorytempleforhimtoraisejobsuntilnakedProdienatopaintbaseballpartners,providedpeopletoridebothofManhattanin1978,butwhatwaslargelydirectedtoChinain1946,focusingonthetrademarkperiodisthesailboatyesterdayandcommentsonwhomtheyobtainoverheardwithinthe120thanniversary,wheremanycivilrightsdened,ofcialssaidearlythatforms,”saidBernardJ.MarcoJr.ofPennsylvania,wasmonitoringNewYork6.1.3.SAMPLESFORMTHEMLMODELFinally,wegeneratetextfromanMRNNtrainedontheMLcorpusconditionedonthestring“Recurrent”.ThisMRNNisalsoabletobalanceparentheses(e.g.,thethirdlineofthesample):RecurrentnetworkwiththeStiefelinformationforlogisticregres-sionmethodsAlongwitheitherofthealgorithmspreviously(twoormoreskewprecision)ismoresimilartothemodelwiththesameaveragemismatchedgraph.Thoughthistaskistobestudiedun-dertherewardtransform,suchas(c)and(C)fromthetrainingset,basedontargetactivitiesforarticlesa?2(6)and(4.3).ThePHDPic(PDB)matrixofcav'vausingthethreerelevantinforma-tioncontainsfortiemingmeasurements.Moreover,becauseofthetheraptor,theaimistoimprovethescoretothebestpatchrandomly,butforeachinitiallyfourdatasets.AsshowninFigure11,itismorethan100steps,weused??ntoninftywith10006.2.StructuredsentencecompletionInthissection,weinvestigatetheMRNN'sresponseinvar-ioussituationsbysamplingfromtheMRNN'sdistributionconditionedonaprex.ThegoalistoseewhethertheMRNNisabletogenerate“plausible”continuationstotheinitialstrings.Inourrstexperiment,weusetheWikipediaMRNNtocompletethestring“England,Spain,France,Germany,”:England,Spain,France,Germany,andMassachusetts.England,Spain,France,Germany,cars,anddirectschoolsEngland,Spain,France,Germany,,orNewOrleansandUganda.England,Spain,France,Germany,,Westchester,JetState,Springeld,AthleavesandSorvinheeIntheabovecompletions,theMRNNcorrectlyinterpretedthestringtobealistoflocations,sothegeneratedtextwasalsoapartofalist.Next,weperformedasimilarexperimentusingtheMLmodelandthepairofstrings“(ABCetal”and“ABCetal”.Thesystemhasneverseenthestring“(ABCetal”initstrainingset(simplybecausethereisnomachinelearn-ingauthornamedABC,anditscapitalizationisparticularlyuncommonforacitation),sotheMRNNneededtogener-alizeoveranentirelynewauthorname:(ABCetal.,2003),?13?,andforasupervisedMann-¡Whitnaguing(ABCetal.,2002),basedonLebanonandHaussler,1995b)ABCetal.(2003b),orPenalizationofInformationABCetal.(2008)canbeevaluatedandmotivatedbyprovidingoptimalestimateThisexampleshowsthattheMRNNissensitivetotheini-tialbracketbefore“ABC”,illustratingitsrepresentationalpower.Theaboveeffectisextremelyrobust.Incon-trast,bothN-grammodelsandthesequencememoizercannotmakesuchpredictionsunlesstheseexactstrings(e.g.,“(ABCetal.,2003)”)occurinthetrainingset,whichcannotbecountedon.Infact,anymethodwhichisbasedonprecisecontextmatchesisfundamentallyincapableofutilizinglongcontexts,becausetheprobabilitythatalongcontextoccursmorethanonceisvanishinglysmall.Weexperimentallyveriedthatneitherthesequencememoizer GeneratingTextwithRecurrentNeuralNetworks norPAQarenotsensitivetotheinitialbracket.7.DiscussionModelinglanguageatthecharacterlevelseemsunneces-sarilydifcultbecausewealreadyknowthatmorphemesaretheappropriateunitsformakingsemanticandsyntacticpredictions.Convertinglargedatabasesintosequencesofmorphemes,however,isnon-trivialcomparedwithtreatingthemascharacterstrings.Also,learningwhichcharacterstringsmakewordsisarelativelyeasytaskcomparedwithdiscoveringthesubtletiesofsemanticandsyntacticstruc-ture.So,givenapowerfullearningsystemlikeanMRNN,theconvenienceofusingcharactersmayoutweightheex-traworkofhavingtolearnthewordsAllourexperimentsshowthatanMRNNndsitveryeasytolearnwords.Withtheexceptionofpropernames,thegeneratedtextcontainsveryfewnon-words.Atthesametime,theMRNNalsoas-signsprobabilityto(andoccasionallygenerates)plausiblewordsthatdonotappearinthetrainingset(e.g.,“cryptoli-ation”,“homosomalist”,or“un-ameliary”).Thisisadesir-ablepropertywhichenabledtheMRNNtogracefullydealwithrealwordsthatitnonethelessdidn'tseeinthetrain-ingset.Predictingthenextwordbymakingasequenceofcharacterpredictionsavoidshavingtouseahugesoft-maxoverallknownwordsandthisissoadvantageousthatsomeword-levellanguagemodelsactuallymakeupbinary“spellings”ofwordssothattheycanpredictthemonebitatatime(Mnih&Hinton,2009).MRNNsalreadylearnsurprisinglygoodlanguagemodelsusingonly1500hiddenunits,andunlikeotherapproachessuchasthesequencememoizerandPAQ,theyareeasytoextendalongvariousdimensions.IfwecouldtrainmuchbiggerMRNNswithmillionsofunitsandbillionsofcon-nections,itispossiblethatbruteforcealonewouldbesuf-cienttoachieveanevenhigherstandardofperformance.Butthiswillofcourserequireconsiderablymorecomputa-tionalpower.AcknowledgementsThisworkwassupportedbyaGooglefellowshipandNSERC.TheexperimentswereimplementedwithsoftwarepackagesbyTieleman(2010)andMnih(2009).REFERENCESBell,R.M.,Koren,Y.,andVolinsky,C.TheBellKorsolutiontotheNetixprize.KorBellTeam'sReporttoNetix,2007.Bengio,Y.,Simard,P.,andFrasconi,P.Learninglong-termde-pendencieswithgradientdescentisdifcult.IEEETransac-tionsonNeuralNetworks,5(2):157–166,1994.Gasthaus,J.,Wood,F.,andTeh,Y.W.LosslesscompressionbasedontheSequenceMemoizer.InDataCompressionConference(DCC),2010,pp.337–345.IEEE,2010.Graves,A.andSchmidhuber,J.Ofinehandwritingrecognitionwithmultidimensionalrecurrentneuralnetworks.AdvancesinNeuralInformationProcessingSystems,21,2009.Hochreiter,S.UntersuchungenzudynamischenneuronalenNet-zen.Diplomathesis.PhDthesis,InstitutfurInformatik,Tech-nischeUniversitatMunchen,1991.Hochreiter,S.andSchmidhuber,J.Longshort-termmemory.NeuralComputation,9(8):1735–1780,1997.ISSN0899-7667.Hutter,M.TheHumanknowledgecompressionprize,2006.Jaeger,H.Observableoperatormodelsfordiscretestochastictimeseries.NeuralComputation,12(6):1371–1398,2000.Jaeger,H.andHaas,H.Harnessingnonlinearity:Predictingchaoticsystemsandsavingenergyinwirelesscommunication.Science,304(5667):78,2004.Mahoney,M.Adaptiveweighingofcontextmodelsforlosslessdatacompression.FloridaInst.Technol.,Melbourne,FL,Tech.Rep.CS-2005-16,2005.Martens,J.DeeplearningviaHessian-freeoptimization.InProceedingsofthe27thInternationalConferenceonMachineLearning(ICML).ICML2010,2010.Martens,J.andSutskever,I.TrainingRecurrentNeuralNetworkswithHessian-Freeoptimizaiton.ICML2011,2011.Mikolov,T.,Kara´at,M.,Burget,L.,Cernocky,J.,andKhudan-pur,S.RecurrentNeuralNetworkBasedLanguageModel.InEleventhAnnualConferenceoftheInternationalSpeechCom-municationAssociation,2010.Mnih,A.andHinton,G.Ascalablehierarchicaldistributedlan-guagemodel.AdvancesinNeuralInformationProcessingSys-tems,21:1081–1088,2009.Mnih,Volodymyr.Cudamat:aCUDA-basedmatrixclassforpython.TechnicalReportUTMLTR2009-004,DepartmentofComputerScience,UniversityofToronto,November2009.Murphy,K.P.Dynamicbayesiannetworks:representation,infer-enceandlearning.PhDthesis,Citeseer,2002.Pollastri,G.,Przybylski,D.,Rost,B.,andBaldi,P.Improvingthepredictionofproteinsecondarystructureinthreeandeightclassesusingrecurrentneuralnetworksandproles.Proteins:Structure,Function,andBioinformatics,47(2):228–235,2002.Rissanen,J.andLangdon,G.G.Arithmeticcoding.IBMJournalofResearchandDevelopment,23(2):149–162,1979.Robinson,A.J.Anapplicationofrecurrentnetstophoneproba-bilityestimation.NeuralNetworks,IEEETransactionson,5(2):298–305,2002.ISSN1045-9227.Rumelhart,D.E.,Hintont,G.E.,andWilliams,R.J.Learningrep-resentationsbyback-propagatingerrors.Nature,323(6088):533–536,1986.Sandhaus,E.Thenewyorktimesannotatedcorpus.LinguisticDataConsortium,Philadelphia,2008.Taylor,G.W.andHinton,G.E.Factoredconditionalrestrictedboltzmannmachinesformodelingmotionstyle.InProceed-ingsofthe26thAnnualInternationalConferenceonMachineLearning,pp.1025–1032.ACM,2009.Tieleman,T.Gnumpy:aneasywaytouseGPUboardsinPython.TechnicalReportUTMLTR2010-002,UniversityofToronto,DepartmentofComputerScience,2010.Ward,D.J.,Blackwell,A.F.,andMacKay,D.J.C.Dasher–adataentryinterfaceusingcontinuousgesturesandlanguagemodels.InProceedingsofthe13thannualACMsymposiumonUserinterfacesoftwareandtechnology,pp.129–137.ACM,2000.Werbos,P.J.Backpropagationthroughtime:Whatitisandhowtodoit.ProceedingsoftheIEEE,78(10):1550–1560,1990.Wood,F.,Archambeau,C.,Gasthaus,J.,James,L.,andTeh,Y.W.Astochasticmemoizerforsequencedata.InProceedingsofthe26thAnnualInternationalConferenceonMachineLearning,pp.1129–1136.ACM,2009.