Recur rent Neural Networks RNNs have the ability in theory to cope with these temporal dependencies by virtue of the shortterm memory implemented by their recurrent feedback connections How ever in practice they are dif64257cult to train success ful ID: 42075
Download Pdf The PPT/PDF document "A Clockwork RNN Jan Koutn k HKOU IDSIA C..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
AClockworkRNN Figure1.CW-RNNarchitectureissimilartoasimpleRNNwithaninput,outputandhiddenlayer.Thehiddenlayerispartitionedintogmoduleseachwithitsownclockrate.Withineachmoduletheneuronsarefullyinterconnected.NeuronsinfastermoduleiareconnectedtoneuronsinaslowermodulejonlyifaclockperiodTiTj.2.RelatedWorkContributionstosequencemodelingandrecognitionthatarerelevanttoCW-RNNsarepresentedinthissection.Thepri-maryfocusisonRNNextensionsthatdealwiththeproblemofbridginglongtimelags.HierarchicalRecurrentNeuralNetworks(Hihi&Bengio,1996)arethemostsimilartoCW-RNNswork,havingbothdelayedconnectionsandunitsoperatingatdifferenttime-scales.Fivehand-designedarchitecturesweretestedonsimplelongtime-lagtoy-problemsndingthatmulti-time-scaleunitssignicantlyhelpinlearningthosetasks.AnothersimilarmodelistheNARXRNN1(Linetal.,1996).But,insteadofsimplifyingthenetwork,itintroducesaddi-tionalsetsofrecurrentconnectionswithtimelagsof2,3..ktimesteps.Theseadditionalconnectionshelptobridgelongtimelags,butintroducemanyadditionalparametersthatmakeNARXRNNtrainingmoredifcultandrunktimesslower.LongShort-TermMemory(LSTM;Hochreiter&Schmid-huber,1997)usesaspecializedarchitecturethatallowsin-formationtobestoredinlinearunitscalledconstanterrorcarousels(CECs)indenitely.CECsarecontainedincellsthathaveasetofmultiplicativeunits(gates)connectedtoothercellsthatregulatewhennewinformationenterstheCEC(inputgate),whentheactivationoftheCECisoutputtotherestofthenetwork(outputgate),andwhentheactiva-tiondecaysorisforgotten(forgetgate).Thesenetworks 1NARXstandsforNon-linearAuto-RegressivemodelwitheXogeneousinputshavebeenverysuccessfulrecentlyinspeechandhandwrit-ingrecognition(Gravesetal.,2005;2009;Saketal.,2014).StackingLSTMs(Fernandezetal.,2007;Graves&Schmidhuber,2009)intoahierarchicalsequenceproces-sor,equippedwithConnectionistTemporalClassication(CTC;Gravesetal.,2006),performssimultaneoussegmen-tationandrecognitionofsequences.Itsdeepvariantcur-rentlyholdsthestate-of-the-artresultinphonemerecogni-tionontheTIMITdatabase(Gravesetal.,2013).TemporalTransitionHierarchies(TTHs;Ring,1993)incre-mentallyaddhigh-orderneuronsinordertobuildamemorythatisusedtodisambiguateaninputatthecurrenttimestep.Thisapproachcan,inprinciple,bridgetimeintervalsofanylength,butatacostofaproportionalgrowthinnetworksize.Themodelwasrecentlyimprovedbyaddingrecurrentcon-nections(Ring,2011)thatreducesthenumberofhigh-orderneuronsneededtoencodethetemporaldependencies.OneoftheearliestattemptstoenableRNNstohandlelong-termdependenciesistheReducedDescriptionNet-work(Mozer,1992;1994).Itusesleakyneuronswhoseactivationchangesonlybyasmallamountinresponsetoitsinputs.ThistechniquewasrecentlypickedupbyEchoStateNetworks(ESN;Jaeger,2002).AsimilartechniquehasbeenusedbySutskever&Hinton(2010)tosolvesomeserialrecalltasks.TheseTemporal-KernelRNNsaddaweightedconnectionfromeachneurontoitselfthatdecaysexponentiallyintime.ItsperformanceisstillslightlyinferiortoLSTM.Evolino(Schmidhuberetal.,2005;2007)feedstheinput AClockworkRNN toanRNN(whichcanbee.g.LSTMtocopewithlongtimelags)andthentransformstheRNNoutputstothetargetsequencesviaanoptimallinearmapping,thatiscomputedanalyticallybypseudo-inverse.TheRNNistrainedbyanevolutionaryalgorithm,andthereforedoesnotsufferfromthevanishinggradientproblem.EvolinooutperformedLSTMonasetofsyntheticproblemsandwasusedtoper-formcomplexroboticmanipulation(Mayeretal.,2006).AmoderntheoryofwhyRNNsfailtolearnlong-termdependenciesisthatsimplegradient-descentfailstoop-timizethemcorrectly.OneattempttomitigatethisproblemisHessianFree(HF)optimization(Martens&Sutskever,2011),anadaptedsecond-ordertrainingmethodthathasbeendemonstratedtoworkwellwithRNNs.HFallowsRNNstosolvesomelong-termlagproblemsthatarecon-sideredchallengingforstochasticgradient-descent.Theirperformanceonrathersynthetic,long-termmemorybench-marksisapproachingthatofLSTM,thoughthenumberofoptimizationstepsrequiredinHF-RNNisusuallygreater.TrainingnetworksbyHFoptimizationisanorthogonalapproachtothenetworkarchitecture,sobothLSTMandCW-RNNcanstillbenetfromit.HFoptimizationallowedfortrainingofMultiplicativeRNN(MRNN;Sutskeveretal.,2011)thatporttheconceptofmultiplicativegatingunitstoSRNs.Thegatingunitsarerepresentedbyafactored3-waytensorinordertoreducethenumberofparameters.ExtensivetrainingofanMRNNonaGPUprovidedimpressiveresultsintextgeneration.TrainingRNNswithKalmanlters(PĀ“erez-Ortizetal.,2003;Williams,1992)hasshownadvantagesinbridginglongtimelagsaswell,althoughthisapproachiscomputationallyunfeasibleforlargernetworks.Themethodsmentionedabovearestrictlysynchronouselementsofthenetworkclockatthesamespeed.TheSe-quenceChunker,NeuralHistoryCompressororHierarchi-calTemporalMemory(Schmidhuber,1991;1992)consistsofahierarchyorstackofRNNthatmayrunatdifferenttimescales,but,unlikethesimplerCW-RNN,itrequiresunsupervisedeventpredictors:ahigher-levelRNNreceivesaninputonlywhenthelower-levelRNNbelowisunabletopredictit.Hencetheclockofthehigherlevelmayspeeduporslowdown,dependingonthecurrentpredictabilityoftheinputstream.ThiscontrastswiththeCW-RNN,inwhichtheclocksalwaysrunatthesamespeed,someslower,somefaster.3.AClockworkRecurrentNeuralNetworkClockworkRecurrentNeuralNetworks(CW-RNN)likeSRNs,consistofinput,hiddenandoutputlayers.Thereareforwardconnectionsfromtheinputtohiddenlayer,andfromthehiddentooutputlayer,but,unliketheSRN, Figure2.Calculationofthehiddenunitactivationsattimestept=6inCW-RNNaccordingtoequation(1).Inputandrecurrentweightmatricesarepartitionedintoblocks.Eachblock-rowinWHandWIcorrespondstotheweightsofaparticularmodule.Attimestept=6,thersttwomoduleswithperiodsT1=1andT2=2getevaluated(highlightedpartsofWHandWIareused)andthehighlightedoutputsareupdated.Notethat,whileusingexponentialseriesofperiods,theactivepartsofWHandWIarealwayscontiguous.theneuronsinthehiddenlayerarepartitionedintogmod-ulesofsizek.EachofthemodulesisassignedaclockperiodTn2fT1;:::;Tgg.Eachmoduleisinternallyfully-interconnected,buttherecurrentconnectionsfrommodulejtomoduleiexistsonlyiftheperiodTiissmallerthanperiodTj.Sortingthemodulesbyincreasingperiod,theconnectionsbetweenmodulespropagatethehiddenstateright-to-left,fromslowermodulestofastermodules,seeFigure1.TheSRNoutput,y(t)O,atatimesteptiscalculatedusingthefollowingequations:y(t)H=fH(WHy(t1)+WIx(t));(1)y(t)O=fO(WOy(t)H);(2)whereWH,WIandWOarethehidden,inputandoutputweightmatrices,xtistheinputvector,andy(t)Hisavectorrepresentingtheactivationofthehiddenunitsattimestept.FunctionsfH(:)andfO(:)arenon-linearactivationfunc-tions.Forsimplicity,neuronbiasesareomittedfromtheequations.ThemaindifferencebetweenCW-RNNandaSRNisthatateachCW-RNNtimestept,onlytheoutputofmodulesithatsatisfy(tMODTi)=0areactive.ThechoiceofthesetofperiodsfT1;:::;Tggisarbitrary.Inthispaper,weusetheexponentialseriesofperiods:moduleihasclockperiodofTi=2i1.MatricesWHandWIarepartitionedintogblocks-rows:WH=0B@WH1...WHg1CAWI=0B@WI1...WIg1CA;(3)andWHisablock-uppertriangularmatrix,whereeachblock-row,WHi,ispartitionedintoblock-columns AClockworkRNN Figure3.Normalizedmeansquarederrorforthesequencegen-erationtask,withonebox-whisker(showingmeanvalue,25%and75%quantiles,minimum,maximumandoutliers)foreachtestednetworksizeandtype.NotethattheplotfortheSRNhasadifferentscalethantheothertwo.f01;:::;0i1;WHi;i;:::;WHi;gg.Ateachtimestepofaforwardpass,onlytheblock-rowsofWHandWIthatcorrespondtotheactivemodulesareusedforevaluationinEquation(1):WHi=WHiif(tMODTi)=00otherwise;(4)andthecorrespondingpartsoftheoutputvector,yH,areupdated.Theothermodulesretaintheiroutputvaluesfromtheprevioustime-step.Calculationofthehiddenactivationattimestept=6isillustratedinFigure2.Asaresult,thelow-clock-ratemodulesprocess,retainandoutputthelong-terminformationobtainedfromtheinputsequences(notbeingdistractedbythehighspeedmodules),whereasthehigh-speedmodulesfocusonthelocal,high-frequencyinformation(havingthecontextprovidedbythelowspeedmodulesavailable).ThebackwardpassoftheerrorpropagationissimilartoSRNaswell.Theonlydifferenceisthattheerrorpropagatesonlyfrommodulesthatwereexecutedattimestept.Theerrorofnon-activatedmodulesgetscopiedbackintime(similarlytocopyingtheactivationsofnodesnotactivatedatthetimesteptduringthecorrespondingforwardpass),whereitisaddedtotheback-propagatederror. Figure4.Classicationerrorforthewordclassicationtask,withonebox-whiskerforeachtestednetworksizeandtype.ACW-RNNwiththesamenumberofnodesrunsmuchfasterthanaSRNsincenotallmodulesareevaluatedateverytimestep.ThelowerboundfortheCW-RNNspeedupcomparedtoaSRNwiththesamenumberofneuronsisg=4inthecaseofthisexponentialclocksetup,seeAppendixforadetailedderivation.4.ExperimentsCW-RNNswerecomparedtotheSRNandLSTMnetworks.Allnetworkshaveonehiddenlayerwiththetanhactivationfunction,andthenumberofnodesinthehiddenlayerwaschosentoobtain(approximately)thesamenumberofpa-rametersforallthreemethods(inthecaseofCW-RNN,theclockperiodswereincludedintheparametercount).InitialvaluesforalltheweightsweredrawnfromaGaussiandistributionwithzeromeanandstandarddeviationof0:1.Initialvaluesofallinternalstatevariablesforallhiddenactivationsweresetto0.Eachsetupwasrun100timeswithdifferentrandominitializationofparameters.AllnetworksweretrainedusingStochasticGradientDescent(SGD)withNesterov-stylemomentum(Sutskeveretal.,2013).4.1.SequenceGenerationThegoalofthistaskistotrainarecurrentneuralnetwork,thatreceivesnoinput,togenerateatargetsequenceasaccu-ratelyaspossible.Theweightsofthenetworkcanbeseen AClockworkRNN activemaynotaffecttheiroutput.Onecouldusemultiplelow-speedmodulesthatrunatthesameclockperiodsbutinterleavewithadifferentoffset,employabufferforthesequenceandpassthelow-speedmodulesafunctionofthisbuffer(e.g.mean)ortransformthedatarstintosomeotherdomain(e.g.wavelet)andlettheCW-RNNprocessthetransformedinformation.AppendixCW-RNNhasfewertotalparametersandevenfeweroper-ationspertimestepthanaSRNwiththesamenumberofneurons.AssumeCW-RNNconsistsofgmodulesofsizekforatotalofn=kgneurons.Becauseaneuronisonlyconnectedtootherneuronswiththesameorlargerperiod,thenumberofparametersNHfortherecurrentmatrixis:NH=gXi=1kXj=1k(gi+1)=k2g1Xi=0(gi)=n2 2+nk 2:Comparedtothen2parametersintherecurrentmatrixWHofSRNthisresultsinroughlyhalfasmanyparameters:NH n2=n2 2+nk 2 n2=n2+nk 2n2=n+k 2n=g+1 2g1 2:EachmoduleiisevaluatedonlyeveryTi-thtimestep,there-forethenumberofoperationsatatimestepis:OH=k2g1Xi=0gi Ti:Forexponentiallyscaledperiods,Ti=2i,theupperboundfornumberofoperations,OH,neededforWHpertimestepis:Oh=k2g1Xi=0gi 2i=k2 gg1Xi=01 2i| {z }2+g1Xi=0i 2i| {z }2!k2(2g2)2nk;becauseg2thisislessthanorequalton2.RecurrentoperationsinCW-RNNarefasterthaninaSRNwiththesamenumberofneuronsbyafactorofatleastg=2,which,fortypicalCW-RNNsizesendsupbeingbetween2and5.Similarly,upperboundforthenumberofinputweightevaluations,EI,is:OI=g1Xi=0km Ti=kmg1Xi=01 Ti2kmTherefore,theoverallCW-RNNspeed-upw.r.tSRNis:n2+nm+n OR+OI+2n=k2g2+kgm+kg k2(2g2)+2km+2kg==g(kg+m+1) 2(k(g1)+m+g)=g 2(kg+m+1) k(g1)+m+g| {z }1 2g 4Notethatthisisaconservativelowerbound.AcknowledgmentsThisresearchwassupportedbySwissNationalScienceFoundationgrant#138219:TheoryandPracticeofReinforcementLearning2,andtheEUFP7projectsNanoBioTouch,grant#228844andNASCENCE,grant#317662.ReferencesElman,J.L.Findingstructureintime.CRLTechnicalReport8801,CenterforResearchinLanguage,UniversityofCalifornia,SanDiego,1988.Fernandez,S.,Graves,A.,andSchmidhuber,J.Sequencela-bellinginstructureddomainswithhierarchicalrecurrentneuralnetworks.InProceedingsofthe20thInternationalJointCon-ferenceonArticialIntelligence(IJCAI),2007.Garofolo,J.S.,Lamel,L.F.,Fisher,W.M.,Fiscus,J.G.,Pallett,D.S.,andDahlgren,N.L.DARPATIMITacousticphoneticcontinuousspeechcorpusCD-ROM,1993.Graves,A.Supervisedsequencelabellingwithrecurrentneuralnetworks.PhDthesis,TechnicalUniversityMunich,2008.Graves,A.andSchmidhuber,J.Ofinehandwritingrecognitionwithmultidimensionalrecurrentneuralnetworks.InAdvancesinNeuralInformationProcessingSystems21,pp.545552.MITPress,Cambridge,MA,2009.Graves,A.,Beringer,N.,andSchmidhuber,J.RapidretrainingonspeechdatawithLSTMrecurrentnetworks.TechnicalRe-portIDSIA-09-05,IDSIA,2005.URLhttp://www.idsia.ch/idsiareport/IDSIA-09-05.pdf.Graves,A.,Fernandez,S.,Gomez,F.J.,andSchmidhuber,J.Connectionisttemporalclassication:Labellingunsegmentedsequencedatawithrecurrentneuralnets.InICML'06:Pro-ceedingsoftheInternationalConferenceonMachineLearning,2006.Graves,A.,Liwicki,M.,Fernandez,S.,Bertolami,R.,Bunke,H.,andSchmidhuber,J.Anovelconnectionistsystemforimprovedunconstrainedhandwritingrecognition.IEEETransactionsonPatternAnalysisandMachineIntelligence,31(5),2009.Graves,A.,Mohamed,A.,andHinton,G.E.Speechrecognitionwithdeeprecurrentneuralnetworks.InICASSP,pp.66456649.IEEE,2013.Hihi,S.E.andBengio,Y.Hierarchicalrecurrentneuralnetworksforlong-termdependencies.InTouretzky,D.S.,Mozer,M.C.,andHasselmo,M.E.(eds.),AdvancesinNeuralInformationProcessingSystems8,pp.493499.MITPress,1996.