Generating Text with Recurrent Neural Networks Ilya Sutskever ILYA CS UTORONTO CA James - PDF document

629 views
Uploaded On 2015-01-18

Generating Text with Recurrent Neural Networks Ilya Sutskever ILYA CS UTORONTO CA James - PPT Presentation

Toronto ON M5S 3G4 CANADA Abstract Recurrent Neural Networks RNNs are very powerful sequence models that do not enjoy widespread use because it is extremely dif64257 cult to train them properly Fortunately re cent advances in Hessianfree optimizatio ID: 32909

Toronto M5S 3G4

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/32909" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Pdf The PPT/PDF document "Generating Text with Recurrent Neural Ne..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

GeneratingTextwithRecurrentNeuralNetworks ferentnodestoshareknowledge.Forexample,thecharac-terstring“ing”isquiteprobableafter“x”andalsoquiteprobableafter“break”.Ifthehiddenstatevectorsthatrep-resentthetwohistories“x”and“break”shareacommonrepresentationofthefactthatthiscouldbethestemofaverb,thenthiscommonrepresentationcanbeacteduponbythecharacter“i”toproduceahiddenstatethatpredictsan“n”.Forthistobeagoodpredictionwerequiretheconjunctionoftheverb-stemrepresentationintheprevioushiddenstateandthecharacter“i”.Oneorotherofthesealonedoesnotprovidehalfasmuchevidenceforpredict-ingan“n”:Itistheirconjunctionthatisimportant.Thisstronglysuggeststhatweneedamultiplicativeinteraction.ToachievethisgoalwemodifytheRNNsothatitshidden-to-hiddenweightmatrixisa(learned)functionofthecur-rentinputxt:ht=tanhWhxxt+W(xt)hhht�1+bh(3)ot=Wohht+bo(4)Theseareidenticaltoeqs.1and2,exceptthatWhhisre-placedwithW(xt)hh,allowingeachcharactertospecifyadifferenthidden-to-hiddenweightmatrix.ItisnaturaltodeneW(xt)hhusingatensor.IfwestoreMmatrices,W(1)hh;:::;W(M)hh,whereMisthenumberofdimensionsofxt,wecoulddeneW(xt)hhbytheequationW(xt)hh=MXm=1x(m)tW(m)hh(5)wherex(m)tisthem-thcoordinateofxt.Whentheinputxtisa1-of-Mencodingofacharacter,itiseasilyseenthateverycharacterhasanassociatedweightmatrixandW(xt)hhisthematrixassignedtothecharacterrepresentedbyxt.23.2.TheMultiplicativeRNNTheabovescheme,whileappealing,hasamajordrawback:Fullygeneral3-waytensorsarenotpracticalbecauseoftheirsize.Inparticular,ifwewanttouseRNNswithalargenumberofhiddenunits(say,1000)andifthedimen-sionalityofxtisevenmoderatelylarge,thenthestoragerequiredforthetensorW(xt)hhbecomesprohibitive.ItturnsoutwecanremedytheaboveproblembyfactoringthetensorW(x)hh(e.g.,Taylor&Hinton,2009).ThisisdonebyintroducingthethreematricesWfx,Whf,andWfh,andreparameterizingthematrixW(xt)hhbytheequationW(xt)hh=Whfdiag(Wfxxt)Wfh(6) 2Theabovemodel,appliedtodiscreteinputsrepresentedwiththeir1-of-Mencodings,isthenonlinearversionoftheObserv-ableOperatorModel(OOM;Jaeger,2000)whoselinearnaturemakesitcloselyrelatedtoanHMMintermsofexpressivepower. Figure3.TheMultiplicativeRecurrentNeuralNetwork“gates”therecurrentweightmatrixwiththeinputsymbol.Eachtrianglesymbolrepresentsafactorthatappliesalearnedlinearlterateachofitstwoinputvertices.Theproductoftheoutputsofthesetwolinearltersisthensent,viaweightedconnections,toalltheunitsconnectedtothethirdvertexofthetriangle.Consequentlyeveryinputcansynthesizeitsownhidden-to-hiddenweightma-trixbydeterminingthegainsonallofthefactors,eachofwhichrepresentsarankonehidden-to-hiddenweightmatrixdenedbytheouter-productofitsincomingandoutgoingweight-vectorstothehiddenunits.Thesynthesizedweightmatricesshare“struc-ture”becausetheyareallformedbyblendingthesamesetofrankonematrices.Incontrast,anunconstrainedtensormodelensuresthateachinputhasacompletelyseparateweightmatrix.IfthedimensionalityofthevectorWfxxt,denotedbyF,issufcientlylarge,thenthefactorizationisasexpressiveastheoriginaltensor.SmallervaluesofFrequirefewerparameterswhilehopefullyretainingasignicantfractionofthetensor'sexpressivepower.TheMultiplicativeRNN(MRNN)istheresultoffactoriz-ingtheTensorRNNbyexpandingeq.6withineq.3.TheMRNNcomputesthehiddenstatesequence(h1;:::;hT),anadditional“factorstatesequence”(f1;:::;fT),andtheoutputsequence(o1;:::;oT)byiteratingtheequationsft=diag(Wfxxt)Wfhht�1(7)ht=tanh(Whfft+Whxxt)(8)ot=Wohht+bo(9)whichimplementtheneuralnetworking.3.Thetensorfactorizationofeq.6hastheinterpretationofanadditionallayerofmultiplicativeunitsbetweeneachpairofconsec-utivelayers(i.e.,thetrianglesing.3),sotheMRNNac-tuallyhastwostepsofnonlinearprocessinginitshiddenstatesforeveryinputtimestep.Eachofthemultiplicativeunitsoutputsthevalueftofeq.7whichistheproductoftheoutputsofthetwolinearltersconnectingthemulti-plicativeunittotheprevioushiddenstatesandtotheinputs.WeexperimentallyveriedtheadvantageoftheMRNNovertheRNNwhenthetwohavethesamenumberofpa-rameters.WetrainedanRNNwith500hiddenunitsandanMRNNwith350hiddenunitsand350factors(sotheRNNhasslightlymoreparameters)onthe“machinelearn-ing”dataset(dataset3intheexperimentalsection).Afterextensivetraining,theMRNNachieved1.56bitsperchar-acterandtheRNNachieved1.65bitspercharacteronthe GeneratingTextwithRecurrentNeuralNetworks 6.QualitativeexperimentsInthissectionwequalitativelyinvestigatethenatureofthemodelslearnedbytheMRNN.6.1.SamplesfromthemodelsThesimplestqualitativeexperimentistoinspectthesam-plesgeneratedbythethreeMRNNs.Themostsalientchar-acteristicofthesamplesistherichnessoftheirvocabular-ies.Furtherinspectionrevealsthatthetextismostlygram-matical,andthatparenthesesareusuallybalancedovermanycharacters.Theartifactsofthegeneratedtext,suchasconsecutivecommasorquotes,aretheresultofthedatapreprocessingandarefrequentlyfoundinthetrainingset.6.1.1.SAMPLESFROMTHEWIKIPEDIAMODELWenowpresentasamplefromtheWikipediamodel.Weuse?toindicatethe“unknown”character.Thesamplebe-lowwasobtainedbyrunningtheMRNNlessthan10timesandselectingthemostintriguingsample.Thebeginningoftheparagraphandtheparenthesesneartheendarepar-ticularlyinteresting.TheMRNNwasinitializedwiththephrase“Themeaningoflifeis”:Themeaningoflifeisthetraditionoftheancienthumanrepro-duction:itislessfavorabletothegoodboyforwhentoremoveherbigger.Intheshow'sagreementunanimouslyresurfaced.Thewildpasteuredwithconsistentstreetforestswereincorporatedbythe15thcenturyBE.In1996theprimaryrapfordundergoesaneffortthatthereserveconditioning,writtenintoJewishcities,sleeperstoincorporatethe.StEurasiathatactivatesthepopula-tion.Mar??aNationale,Kelli,Zedlat-Dukastoe,Florendon,Ptu'sthoughtis.ToadaptinmostpartsofNorthAmerica,thedynamicfairyDanpleasebelieves,thefreespeecharemuchrelatedtothe6.1.2.SAMPLESFROMTHENYTMODELBelowisasamplefromthemodeltrainedonthefullNYTdataset,wheretheMRNNwasinitializedwithasinglespace.Thespacessurroundingthepunctuationareanar-tifactofthepreprocessing.whilehewasgivingattentiontothesecondadvantageofschoolbuildinga2-for-2stoolkilledbytheCulturessaddledwithahalf-suitdefendingtheBharatiyaFernall'sofce.Ms.ClaireParterswillalsohaveahistorytempleforhimtoraisejobsuntilnakedProdienatopaintbaseballpartners,providedpeopletoridebothofManhattanin1978,butwhatwaslargelydirectedtoChinain1946,focusingonthetrademarkperiodisthesailboatyesterdayandcommentsonwhomtheyobtainoverheardwithinthe120thanniversary,wheremanycivilrightsdened,ofcialssaidearlythatforms,”saidBernardJ.MarcoJr.ofPennsylvania,wasmonitoringNewYork6.1.3.SAMPLESFORMTHEMLMODELFinally,wegeneratetextfromanMRNNtrainedontheMLcorpusconditionedonthestring“Recurrent”.ThisMRNNisalsoabletobalanceparentheses(e.g.,thethirdlineofthesample):RecurrentnetworkwiththeStiefelinformationforlogisticregres-sionmethodsAlongwitheitherofthealgorithmspreviously(twoormoreskewprecision)ismoresimilartothemodelwiththesameaveragemismatchedgraph.Thoughthistaskistobestudiedun-dertherewardtransform,suchas(c)and(C)fromthetrainingset,basedontargetactivitiesforarticlesa?2(6)and(4.3).ThePHDPic(PDB)matrixofcav'vausingthethreerelevantinforma-tioncontainsfortiemingmeasurements.Moreover,becauseofthetheraptor,theaimistoimprovethescoretothebestpatchrandomly,butforeachinitiallyfourdatasets.AsshowninFigure11,itismorethan100steps,weused??ntoninftywith10006.2.StructuredsentencecompletionInthissection,weinvestigatetheMRNN'sresponseinvar-ioussituationsbysamplingfromtheMRNN'sdistributionconditionedonaprex.ThegoalistoseewhethertheMRNNisabletogenerate“plausible”continuationstotheinitialstrings.Inourrstexperiment,weusetheWikipediaMRNNtocompletethestring“England,Spain,France,Germany,”:England,Spain,France,Germany,andMassachusetts.England,Spain,France,Germany,cars,anddirectschoolsEngland,Spain,France,Germany,,orNewOrleansandUganda.England,Spain,France,Germany,,Westchester,JetState,Springeld,AthleavesandSorvinheeIntheabovecompletions,theMRNNcorrectlyinterpretedthestringtobealistoflocations,sothegeneratedtextwasalsoapartofalist.Next,weperformedasimilarexperimentusingtheMLmodelandthepairofstrings“(ABCetal”and“ABCetal”.Thesystemhasneverseenthestring“(ABCetal”initstrainingset(simplybecausethereisnomachinelearn-ingauthornamedABC,anditscapitalizationisparticularlyuncommonforacitation),sotheMRNNneededtogener-alizeoveranentirelynewauthorname:(ABCetal.,2003),?13?,andforasupervisedMann-¡Whitnaguing(ABCetal.,2002),basedonLebanonandHaussler,1995b)ABCetal.(2003b),orPenalizationofInformationABCetal.(2008)canbeevaluatedandmotivatedbyprovidingoptimalestimateThisexampleshowsthattheMRNNissensitivetotheini-tialbracketbefore“ABC”,illustratingitsrepresentationalpower.Theaboveeffectisextremelyrobust.Incon-trast,bothN-grammodelsandthesequencememoizercannotmakesuchpredictionsunlesstheseexactstrings(e.g.,“(ABCetal.,2003)”)occurinthetrainingset,whichcannotbecountedon.Infact,anymethodwhichisbasedonprecisecontextmatchesisfundamentallyincapableofutilizinglongcontexts,becausetheprobabilitythatalongcontextoccursmorethanonceisvanishinglysmall.Weexperimentallyveriedthatneitherthesequencememoizer GeneratingTextwithRecurrentNeuralNetworks norPAQarenotsensitivetotheinitialbracket.7.DiscussionModelinglanguageatthecharacterlevelseemsunneces-sarilydifcultbecausewealreadyknowthatmorphemesaretheappropriateunitsformakingsemanticandsyntacticpredictions.Convertinglargedatabasesintosequencesofmorphemes,however,isnon-trivialcomparedwithtreatingthemascharacterstrings.Also,learningwhichcharacterstringsmakewordsisarelativelyeasytaskcomparedwithdiscoveringthesubtletiesofsemanticandsyntacticstruc-ture.So,givenapowerfullearningsystemlikeanMRNN,theconvenienceofusingcharactersmayoutweightheex-traworkofhavingtolearnthewordsAllourexperimentsshowthatanMRNNndsitveryeasytolearnwords.Withtheexceptionofpropernames,thegeneratedtextcontainsveryfewnon-words.Atthesametime,theMRNNalsoas-signsprobabilityto(andoccasionallygenerates)plausiblewordsthatdonotappearinthetrainingset(e.g.,“cryptoli-ation”,“homosomalist”,or“un-ameliary”).Thisisadesir-ablepropertywhichenabledtheMRNNtogracefullydealwithrealwordsthatitnonethelessdidn'tseeinthetrain-ingset.Predictingthenextwordbymakingasequenceofcharacterpredictionsavoidshavingtouseahugesoft-maxoverallknownwordsandthisissoadvantageousthatsomeword-levellanguagemodelsactuallymakeupbinary“spellings”ofwordssothattheycanpredictthemonebitatatime(Mnih&Hinton,2009).MRNNsalreadylearnsurprisinglygoodlanguagemodelsusingonly1500hiddenunits,andunlikeotherapproachessuchasthesequencememoizerandPAQ,theyareeasytoextendalongvariousdimensions.IfwecouldtrainmuchbiggerMRNNswithmillionsofunitsandbillionsofcon-nections,itispossiblethatbruteforcealonewouldbesuf-cienttoachieveanevenhigherstandardofperformance.Butthiswillofcourserequireconsiderablymorecomputa-tionalpower.AcknowledgementsThisworkwassupportedbyaGooglefellowshipandNSERC.TheexperimentswereimplementedwithsoftwarepackagesbyTieleman(2010)andMnih(2009).REFERENCESBell,R.M.,Koren,Y.,andVolinsky,C.TheBellKorsolutiontotheNetixprize.KorBellTeam'sReporttoNetix,2007.Bengio,Y.,Simard,P.,andFrasconi,P.Learninglong-termde-pendencieswithgradientdescentisdifcult.IEEETransac-tionsonNeuralNetworks,5(2):157–166,1994.Gasthaus,J.,Wood,F.,andTeh,Y.W.LosslesscompressionbasedontheSequenceMemoizer.InDataCompressionConference(DCC),2010,pp.337–345.IEEE,2010.Graves,A.andSchmidhuber,J.Ofinehandwritingrecognitionwithmultidimensionalrecurrentneuralnetworks.AdvancesinNeuralInformationProcessingSystems,21,2009.Hochreiter,S.UntersuchungenzudynamischenneuronalenNet-zen.Diplomathesis.PhDthesis,InstitutfurInformatik,Tech-nischeUniversitatMunchen,1991.Hochreiter,S.andSchmidhuber,J.Longshort-termmemory.NeuralComputation,9(8):1735–1780,1997.ISSN0899-7667.Hutter,M.TheHumanknowledgecompressionprize,2006.Jaeger,H.Observableoperatormodelsfordiscretestochastictimeseries.NeuralComputation,12(6):1371–1398,2000.Jaeger,H.andHaas,H.Harnessingnonlinearity:Predictingchaoticsystemsandsavingenergyinwirelesscommunication.Science,304(5667):78,2004.Mahoney,M.Adaptiveweighingofcontextmodelsforlosslessdatacompression.FloridaInst.Technol.,Melbourne,FL,Tech.Rep.CS-2005-16,2005.Martens,J.DeeplearningviaHessian-freeoptimization.InProceedingsofthe27thInternationalConferenceonMachineLearning(ICML).ICML2010,2010.Martens,J.andSutskever,I.TrainingRecurrentNeuralNetworkswithHessian-Freeoptimizaiton.ICML2011,2011.Mikolov,T.,Kara´at,M.,Burget,L.,Cernocky,J.,andKhudan-pur,S.RecurrentNeuralNetworkBasedLanguageModel.InEleventhAnnualConferenceoftheInternationalSpeechCom-municationAssociation,2010.Mnih,A.andHinton,G.Ascalablehierarchicaldistributedlan-guagemodel.AdvancesinNeuralInformationProcessingSys-tems,21:1081–1088,2009.Mnih,Volodymyr.Cudamat:aCUDA-basedmatrixclassforpython.TechnicalReportUTMLTR2009-004,DepartmentofComputerScience,UniversityofToronto,November2009.Murphy,K.P.Dynamicbayesiannetworks:representation,infer-enceandlearning.PhDthesis,Citeseer,2002.Pollastri,G.,Przybylski,D.,Rost,B.,andBaldi,P.Improvingthepredictionofproteinsecondarystructureinthreeandeightclassesusingrecurrentneuralnetworksandproles.Proteins:Structure,Function,andBioinformatics,47(2):228–235,2002.Rissanen,J.andLangdon,G.G.Arithmeticcoding.IBMJournalofResearchandDevelopment,23(2):149–162,1979.Robinson,A.J.Anapplicationofrecurrentnetstophoneproba-bilityestimation.NeuralNetworks,IEEETransactionson,5(2):298–305,2002.ISSN1045-9227.Rumelhart,D.E.,Hintont,G.E.,andWilliams,R.J.Learningrep-resentationsbyback-propagatingerrors.Nature,323(6088):533–536,1986.Sandhaus,E.Thenewyorktimesannotatedcorpus.LinguisticDataConsortium,Philadelphia,2008.Taylor,G.W.andHinton,G.E.Factoredconditionalrestrictedboltzmannmachinesformodelingmotionstyle.InProceed-ingsofthe26thAnnualInternationalConferenceonMachineLearning,pp.1025–1032.ACM,2009.Tieleman,T.Gnumpy:aneasywaytouseGPUboardsinPython.TechnicalReportUTMLTR2010-002,UniversityofToronto,DepartmentofComputerScience,2010.Ward,D.J.,Blackwell,A.F.,andMacKay,D.J.C.Dasher–adataentryinterfaceusingcontinuousgesturesandlanguagemodels.InProceedingsofthe13thannualACMsymposiumonUserinterfacesoftwareandtechnology,pp.129–137.ACM,2000.Werbos,P.J.Backpropagationthroughtime:Whatitisandhowtodoit.ProceedingsoftheIEEE,78(10):1550–1560,1990.Wood,F.,Archambeau,C.,Gasthaus,J.,James,L.,andTeh,Y.W.Astochasticmemoizerforsequencedata.InProceedingsofthe26thAnnualInternationalConferenceonMachineLearning,pp.1129–1136.ACM,2009.