/
A Clockwork RNN Jan Koutn k HKOU IDSIA CH Klaus Greff KLAUS IDSIA CH Faustino Gomez TINO A Clockwork RNN Jan Koutn k HKOU IDSIA CH Klaus Greff KLAUS IDSIA CH Faustino Gomez TINO

A Clockwork RNN Jan Koutn k HKOU IDSIA CH Klaus Greff KLAUS IDSIA CH Faustino Gomez TINO - PDF document

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
706 views
Uploaded On 2015-03-06

A Clockwork RNN Jan Koutn k HKOU IDSIA CH Klaus Greff KLAUS IDSIA CH Faustino Gomez TINO - PPT Presentation

Recur rent Neural Networks RNNs have the ability in theory to cope with these temporal dependencies by virtue of the shortterm memory implemented by their recurrent feedback connections How ever in practice they are dif64257cult to train success ful ID: 42075

Recur rent Neural Networks

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "A Clockwork RNN Jan Koutn k HKOU IDSIA C..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AClockworkRNN Figure1.CW-RNNarchitectureissimilartoasimpleRNNwithaninput,outputandhiddenlayer.Thehiddenlayerispartitionedintogmoduleseachwithitsownclockrate.Withineachmoduletheneuronsarefullyinterconnected.NeuronsinfastermoduleiareconnectedtoneuronsinaslowermodulejonlyifaclockperiodTiTj.2.RelatedWorkContributionstosequencemodelingandrecognitionthatarerelevanttoCW-RNNsarepresentedinthissection.Thepri-maryfocusisonRNNextensionsthatdealwiththeproblemofbridginglongtimelags.HierarchicalRecurrentNeuralNetworks(Hihi&Bengio,1996)arethemostsimilartoCW-RNNswork,havingbothdelayedconnectionsandunitsoperatingatdifferenttime-scales.Fivehand-designedarchitecturesweretestedonsimplelongtime-lagtoy-problemsndingthatmulti-time-scaleunitssignicantlyhelpinlearningthosetasks.AnothersimilarmodelistheNARXRNN1(Linetal.,1996).But,insteadofsimplifyingthenetwork,itintroducesaddi-tionalsetsofrecurrentconnectionswithtimelagsof2,3..ktimesteps.Theseadditionalconnectionshelptobridgelongtimelags,butintroducemanyadditionalparametersthatmakeNARXRNNtrainingmoredifcultandrunktimesslower.LongShort-TermMemory(LSTM;Hochreiter&Schmid-huber,1997)usesaspecializedarchitecturethatallowsin-formationtobestoredinlinearunitscalledconstanterrorcarousels(CECs)indenitely.CECsarecontainedincellsthathaveasetofmultiplicativeunits(gates)connectedtoothercellsthatregulatewhennewinformationenterstheCEC(inputgate),whentheactivationoftheCECisoutputtotherestofthenetwork(outputgate),andwhentheactiva-tiondecaysoris”forgotten”(forgetgate).Thesenetworks 1NARXstandsforNon-linearAuto-RegressivemodelwitheXogeneousinputshavebeenverysuccessfulrecentlyinspeechandhandwrit-ingrecognition(Gravesetal.,2005;2009;Saketal.,2014).StackingLSTMs(Fernandezetal.,2007;Graves&Schmidhuber,2009)intoahierarchicalsequenceproces-sor,equippedwithConnectionistTemporalClassication(CTC;Gravesetal.,2006),performssimultaneoussegmen-tationandrecognitionofsequences.Itsdeepvariantcur-rentlyholdsthestate-of-the-artresultinphonemerecogni-tionontheTIMITdatabase(Gravesetal.,2013).TemporalTransitionHierarchies(TTHs;Ring,1993)incre-mentallyaddhigh-orderneuronsinordertobuildamemorythatisusedtodisambiguateaninputatthecurrenttimestep.Thisapproachcan,inprinciple,bridgetimeintervalsofanylength,butatacostofaproportionalgrowthinnetworksize.Themodelwasrecentlyimprovedbyaddingrecurrentcon-nections(Ring,2011)thatreducesthenumberofhigh-orderneuronsneededtoencodethetemporaldependencies.OneoftheearliestattemptstoenableRNNstohandlelong-termdependenciesistheReducedDescriptionNet-work(Mozer,1992;1994).Itusesleakyneuronswhoseactivationchangesonlybyasmallamountinresponsetoitsinputs.ThistechniquewasrecentlypickedupbyEchoStateNetworks(ESN;Jaeger,2002).AsimilartechniquehasbeenusedbySutskever&Hinton(2010)tosolvesomeserialrecalltasks.TheseTemporal-KernelRNNsaddaweightedconnectionfromeachneurontoitselfthatdecaysexponentiallyintime.ItsperformanceisstillslightlyinferiortoLSTM.Evolino(Schmidhuberetal.,2005;2007)feedstheinput AClockworkRNN toanRNN(whichcanbee.g.LSTMtocopewithlongtimelags)andthentransformstheRNNoutputstothetargetsequencesviaanoptimallinearmapping,thatiscomputedanalyticallybypseudo-inverse.TheRNNistrainedbyanevolutionaryalgorithm,andthereforedoesnotsufferfromthevanishinggradientproblem.EvolinooutperformedLSTMonasetofsyntheticproblemsandwasusedtoper-formcomplexroboticmanipulation(Mayeretal.,2006).AmoderntheoryofwhyRNNsfailtolearnlong-termdependenciesisthatsimplegradient-descentfailstoop-timizethemcorrectly.OneattempttomitigatethisproblemisHessianFree(HF)optimization(Martens&Sutskever,2011),anadaptedsecond-ordertrainingmethodthathasbeendemonstratedtoworkwellwithRNNs.HFallowsRNNstosolvesomelong-termlagproblemsthatarecon-sideredchallengingforstochasticgradient-descent.Theirperformanceonrathersynthetic,long-termmemorybench-marksisapproachingthatofLSTM,thoughthenumberofoptimizationstepsrequiredinHF-RNNisusuallygreater.TrainingnetworksbyHFoptimizationisanorthogonalapproachtothenetworkarchitecture,sobothLSTMandCW-RNNcanstillbenetfromit.HFoptimizationallowedfortrainingofMultiplicativeRNN(MRNN;Sutskeveretal.,2011)thatporttheconceptofmultiplicativegatingunitstoSRNs.Thegatingunitsarerepresentedbyafactored3-waytensorinordertoreducethenumberofparameters.ExtensivetrainingofanMRNNonaGPUprovidedimpressiveresultsintextgeneration.TrainingRNNswithKalmanlters(PĀ“erez-Ortizetal.,2003;Williams,1992)hasshownadvantagesinbridginglongtimelagsaswell,althoughthisapproachiscomputationallyunfeasibleforlargernetworks.Themethodsmentionedabovearestrictlysynchronous–elementsofthenetworkclockatthesamespeed.TheSe-quenceChunker,NeuralHistoryCompressororHierarchi-calTemporalMemory(Schmidhuber,1991;1992)consistsofahierarchyorstackofRNNthatmayrunatdifferenttimescales,but,unlikethesimplerCW-RNN,itrequiresunsupervisedeventpredictors:ahigher-levelRNNreceivesaninputonlywhenthelower-levelRNNbelowisunabletopredictit.Hencetheclockofthehigherlevelmayspeeduporslowdown,dependingonthecurrentpredictabilityoftheinputstream.ThiscontrastswiththeCW-RNN,inwhichtheclocksalwaysrunatthesamespeed,someslower,somefaster.3.AClockworkRecurrentNeuralNetworkClockworkRecurrentNeuralNetworks(CW-RNN)likeSRNs,consistofinput,hiddenandoutputlayers.Thereareforwardconnectionsfromtheinputtohiddenlayer,andfromthehiddentooutputlayer,but,unliketheSRN, Figure2.Calculationofthehiddenunitactivationsattimestept=6inCW-RNNaccordingtoequation(1).Inputandrecurrentweightmatricesarepartitionedintoblocks.Eachblock-rowinWHandWIcorrespondstotheweightsofaparticularmodule.Attimestept=6,thersttwomoduleswithperiodsT1=1andT2=2getevaluated(highlightedpartsofWHandWIareused)andthehighlightedoutputsareupdated.Notethat,whileusingexponentialseriesofperiods,theactivepartsofWHandWIarealwayscontiguous.theneuronsinthehiddenlayerarepartitionedintogmod-ulesofsizek.EachofthemodulesisassignedaclockperiodTn2fT1;:::;Tgg.Eachmoduleisinternallyfully-interconnected,buttherecurrentconnectionsfrommodulejtomoduleiexistsonlyiftheperiodTiissmallerthanperiodTj.Sortingthemodulesbyincreasingperiod,theconnectionsbetweenmodulespropagatethehiddenstateright-to-left,fromslowermodulestofastermodules,seeFigure1.TheSRNoutput,y(t)O,atatimesteptiscalculatedusingthefollowingequations:y(t)H=fH(WHy(t�1)+WIx(t));(1)y(t)O=fO(WOy(t)H);(2)whereWH,WIandWOarethehidden,inputandoutputweightmatrices,xtistheinputvector,andy(t)Hisavectorrepresentingtheactivationofthehiddenunitsattimestept.FunctionsfH(:)andfO(:)arenon-linearactivationfunc-tions.Forsimplicity,neuronbiasesareomittedfromtheequations.ThemaindifferencebetweenCW-RNNandaSRNisthatateachCW-RNNtimestept,onlytheoutputofmodulesithatsatisfy(tMODTi)=0areactive.ThechoiceofthesetofperiodsfT1;:::;Tggisarbitrary.Inthispaper,weusetheexponentialseriesofperiods:moduleihasclockperiodofTi=2i�1.MatricesWHandWIarepartitionedintogblocks-rows:WH=0B@WH1...WHg1CAWI=0B@WI1...WIg1CA;(3)andWHisablock-uppertriangularmatrix,whereeachblock-row,WHi,ispartitionedintoblock-columns AClockworkRNN Figure3.Normalizedmeansquarederrorforthesequencegen-erationtask,withonebox-whisker(showingmeanvalue,25%and75%quantiles,minimum,maximumandoutliers)foreachtestednetworksizeandtype.NotethattheplotfortheSRNhasadifferentscalethantheothertwo.f01;:::;0i�1;WHi;i;:::;WHi;gg.Ateachtimestepofaforwardpass,onlytheblock-rowsofWHandWIthatcorrespondtotheactivemodulesareusedforevaluationinEquation(1):WHi=WHiif(tMODTi)=00otherwise;(4)andthecorrespondingpartsoftheoutputvector,yH,areupdated.Theothermodulesretaintheiroutputvaluesfromtheprevioustime-step.Calculationofthehiddenactivationattimestept=6isillustratedinFigure2.Asaresult,thelow-clock-ratemodulesprocess,retainandoutputthelong-terminformationobtainedfromtheinputsequences(notbeingdistractedbythehighspeedmodules),whereasthehigh-speedmodulesfocusonthelocal,high-frequencyinformation(havingthecontextprovidedbythelowspeedmodulesavailable).ThebackwardpassoftheerrorpropagationissimilartoSRNaswell.Theonlydifferenceisthattheerrorpropagatesonlyfrommodulesthatwereexecutedattimestept.Theerrorofnon-activatedmodulesgetscopiedbackintime(similarlytocopyingtheactivationsofnodesnotactivatedatthetimesteptduringthecorrespondingforwardpass),whereitisaddedtotheback-propagatederror. Figure4.Classicationerrorforthewordclassicationtask,withonebox-whiskerforeachtestednetworksizeandtype.ACW-RNNwiththesamenumberofnodesrunsmuchfasterthanaSRNsincenotallmodulesareevaluatedateverytimestep.ThelowerboundfortheCW-RNNspeedupcomparedtoaSRNwiththesamenumberofneuronsisg=4inthecaseofthisexponentialclocksetup,seeAppendixforadetailedderivation.4.ExperimentsCW-RNNswerecomparedtotheSRNandLSTMnetworks.Allnetworkshaveonehiddenlayerwiththetanhactivationfunction,andthenumberofnodesinthehiddenlayerwaschosentoobtain(approximately)thesamenumberofpa-rametersforallthreemethods(inthecaseofCW-RNN,theclockperiodswereincludedintheparametercount).InitialvaluesforalltheweightsweredrawnfromaGaussiandistributionwithzeromeanandstandarddeviationof0:1.Initialvaluesofallinternalstatevariablesforallhiddenactivationsweresetto0.Eachsetupwasrun100timeswithdifferentrandominitializationofparameters.AllnetworksweretrainedusingStochasticGradientDescent(SGD)withNesterov-stylemomentum(Sutskeveretal.,2013).4.1.SequenceGenerationThegoalofthistaskistotrainarecurrentneuralnetwork,thatreceivesnoinput,togenerateatargetsequenceasaccu-ratelyaspossible.Theweightsofthenetworkcanbeseen AClockworkRNN activemaynotaffecttheiroutput.Onecouldusemultiplelow-speedmodulesthatrunatthesameclockperiodsbutinterleavewithadifferentoffset,employabufferforthesequenceandpassthelow-speedmodulesafunctionofthisbuffer(e.g.mean)ortransformthedatarstintosomeotherdomain(e.g.wavelet)andlettheCW-RNNprocessthetransformedinformation.AppendixCW-RNNhasfewertotalparametersandevenfeweroper-ationspertimestepthanaSRNwiththesamenumberofneurons.AssumeCW-RNNconsistsofgmodulesofsizekforatotalofn=kgneurons.Becauseaneuronisonlyconnectedtootherneuronswiththesameorlargerperiod,thenumberofparametersNHfortherecurrentmatrixis:NH=gXi=1kXj=1k(g�i+1)=k2g�1Xi=0(g�i)=n2 2+nk 2:Comparedtothen2parametersintherecurrentmatrixWHofSRNthisresultsinroughlyhalfasmanyparameters:NH n2=n2 2+nk 2 n2=n2+nk 2n2=n+k 2n=g+1 2g1 2:EachmoduleiisevaluatedonlyeveryTi-thtimestep,there-forethenumberofoperationsatatimestepis:OH=k2g�1Xi=0g�i Ti:Forexponentiallyscaledperiods,Ti=2i,theupperboundfornumberofoperations,OH,neededforWHpertimestepis:Oh=k2g�1Xi=0g�i 2i=k2 gg�1Xi=01 2i| {z }2+g�1Xi=0i 2i| {z }2!k2(2g�2)2nk;becauseg2thisislessthanorequalton2.RecurrentoperationsinCW-RNNarefasterthaninaSRNwiththesamenumberofneuronsbyafactorofatleastg=2,which,fortypicalCW-RNNsizesendsupbeingbetween2and5.Similarly,upperboundforthenumberofinputweightevaluations,EI,is:OI=g�1Xi=0km Ti=kmg�1Xi=01 Ti2kmTherefore,theoverallCW-RNNspeed-upw.r.tSRNis:n2+nm+n OR+OI+2n=k2g2+kgm+kg k2(2g�2)+2km+2kg==g(kg+m+1) 2(k(g�1)+m+g)=g 2(kg+m+1) k(g�1)+m+g| {z }1 2g 4Notethatthisisaconservativelowerbound.AcknowledgmentsThisresearchwassupportedbySwissNationalScienceFoundationgrant#138219:“TheoryandPracticeofReinforcementLearning2”,andtheEUFP7projects“NanoBioTouch”,grant#228844and“NASCENCE”,grant#317662.ReferencesElman,J.L.Findingstructureintime.CRLTechnicalReport8801,CenterforResearchinLanguage,UniversityofCalifornia,SanDiego,1988.Fernandez,S.,Graves,A.,andSchmidhuber,J.Sequencela-bellinginstructureddomainswithhierarchicalrecurrentneuralnetworks.InProceedingsofthe20thInternationalJointCon-ferenceonArticialIntelligence(IJCAI),2007.Garofolo,J.S.,Lamel,L.F.,Fisher,W.M.,Fiscus,J.G.,Pallett,D.S.,andDahlgren,N.L.DARPATIMITacousticphoneticcontinuousspeechcorpusCD-ROM,1993.Graves,A.Supervisedsequencelabellingwithrecurrentneuralnetworks.PhDthesis,TechnicalUniversityMunich,2008.Graves,A.andSchmidhuber,J.Ofinehandwritingrecognitionwithmultidimensionalrecurrentneuralnetworks.InAdvancesinNeuralInformationProcessingSystems21,pp.545–552.MITPress,Cambridge,MA,2009.Graves,A.,Beringer,N.,andSchmidhuber,J.RapidretrainingonspeechdatawithLSTMrecurrentnetworks.TechnicalRe-portIDSIA-09-05,IDSIA,2005.URLhttp://www.idsia.ch/idsiareport/IDSIA-09-05.pdf.Graves,A.,Fernandez,S.,Gomez,F.J.,andSchmidhuber,J.Connectionisttemporalclassication:Labellingunsegmentedsequencedatawithrecurrentneuralnets.InICML'06:Pro-ceedingsoftheInternationalConferenceonMachineLearning,2006.Graves,A.,Liwicki,M.,Fernandez,S.,Bertolami,R.,Bunke,H.,andSchmidhuber,J.Anovelconnectionistsystemforimprovedunconstrainedhandwritingrecognition.IEEETransactionsonPatternAnalysisandMachineIntelligence,31(5),2009.Graves,A.,Mohamed,A.,andHinton,G.E.Speechrecognitionwithdeeprecurrentneuralnetworks.InICASSP,pp.6645–6649.IEEE,2013.Hihi,S.E.andBengio,Y.Hierarchicalrecurrentneuralnetworksforlong-termdependencies.InTouretzky,D.S.,Mozer,M.C.,andHasselmo,M.E.(eds.),AdvancesinNeuralInformationProcessingSystems8,pp.493–499.MITPress,1996.