/
On the importance of initialization and momentum in deep learning Ilya Sutskever ilyasugoogle On the importance of initialization and momentum in deep learning Ilya Sutskever ilyasugoogle

On the importance of initialization and momentum in deep learning Ilya Sutskever ilyasugoogle - PDF document

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
540 views
Uploaded On 2014-12-11

On the importance of initialization and momentum in deep learning Ilya Sutskever ilyasugoogle - PPT Presentation

com James Martens jmartenscstorontoedu George Dahl gdahlcstorontoedu Geo rey Hinton hintoncstorontoedu Abstract Deep and recurrent neural networks DNNs and RNNs respectively are powerful mod els that were considered to be almost impos sible to train ID: 22215

com James Martens jmartenscstorontoedu George

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "On the importance of initialization and ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Hintonetal.,2012;Dahletal.,2012;Graves,2012).Althoughtheirrepresentationalpowerisappealing,thedi!cultyoftrainingDNNshaspreventedtheirProceedingsofthe30thInternationalConferenceonMa-chineLearning,Atlanta,Georgia,USA,2013.JMLR:W&CPvolume28.Copyright2013bytheauthor(s).widepreaduseuntilfairlyrecently.DNNsbecamethesubjectofrenewedattentionfollowingtheworkofHintonetal.(2006)whointroducedtheideaofgreedylayerwisepre-training.Thisapproachhassincebranchedintoafamilyofmethods(Bengioetal.,2007),allofwhichtrainthelayersoftheDNNinasequenceusinganauxiliaryobjectiveandthenÒÞne-tuneÓtheentirenetworkwithstandardoptimizationmethodssuchasstochasticgradientdescent(SGD).Morerecently,Martens(2010)attractedconsiderableattentionbyshowingthatatypeoftruncated-NewtonmethodcalledHessian-freeOptimization(HF)iscapa-bleoftrainingDNNsfromcertainrandominitializa-tionswithouttheuseofpre-training,andcanachieve fallsinourexperimentsandprovideasimpletounder-standandeasytouseframeworkfordeeplearningthatissurprisinglye"ectiveandcanbenaturallycombinedwithtechniquessuchasthoseinRaikoetal.(2011).Wewillalsodiscussthelinksbetweenclassicalmo-mentumandNesterovÕsacceleratedgradientmethod(whichhasbeenthesubjectofmuchrecentstudyinconvexoptimizationtheory),arguingthatthelattercanbeviewedasasimplemodiÞcationoftheformerwhichincreasesstability,andcansometimesprovideadistinctimprovementinperformancewedemonstratedinourexperiments.Weperformatheoreticalanalysiswhichmakescleartheprecisedi"erenceinlocalbe-haviorofthesetwoalgorithms.Additionally,weshowhowHFemployswhatcanbeviewedasatypeofÒmo-mentumÓthroughitsuseofspecialinitializationsto f(!)tobeminimized,classicalmomentumisgivenby:vt+1=µvt! Sincedirectionsdoflow-curvaturehave,bydeÞni-tion,slowerlocalchangeintheirrateofreduction(i.e.,d!"f),theywilltendtopersistacrossiterationsandbeampliÞedbyCM.Second-ordermethodsalsoam-plifystepsinlow-curvaturedirections,butinsteadofaccumulatingchangestheyreweighttheupdatealongeacheigen-directionofthecurvaturematrixbythein-verseoftheassociatedcurvature.Andjustassecond-ordermethodsenjoyimprovedlocalconvergencerates,Polyak(1964)showedthatCMcanconsiderablyaccel-erateconvergencetoalocalminimum,requiring$R-timesfeweriterationsthansteepestdescenttoreachthesamelevelofaccuracy,where (3)!t+1=!t+vt+1(4)Whiletheclassicalconvergencetheoriesforbothmeth-odsrelyonnoiselessgradientestimates(i.e.,notstochastic),withsomecareinpracticetheyarebothapplicabletothestochasticsetting.However,thethe-orypredictsthatanyadvantagesintermsofasymp-toticlocalrateofconvergencewillbelost(Orr,1996;Wiegerincketal.,1999),aresultalsoconÞrmedinex-periments(LeCunetal.,1998).Forthesereasons,interestinmomentummethodsdiminishedaftertheyhadreceivedsubstantialattentioninthe90Õs.Andbe-causeofthisapparentincompatibilitywithstochasticoptimization,someauthorsevendiscourageusingmo-mentumordownplayitspotentialadvantages(LeCunetal.,1998).However,whilelocalconvergenceisallthatmattersintermsofasymptoticconvergencerates(andoncer-tainverysimple/shallowneuralnetworkoptimizationproblemsitmayevendominatethetotallearningtime),inpractice,theÒtransientphaseÓofconvergence(Darken&Moody,1993),whichoccursbeforeÞnelo-calconvergencesetsin,seemstomatteralotmoreforoptimizingdeepneuralnetworks.Inthistransientphaseoflearning,directionsofreductionintheob-jectivetendtopersistacrossmanysuccessivegradientestimatesandarenotcompletelyswampedbynoise.Althoughthetransientphaseoflearningismostno-ticeableintrainingdeeplearningmodels,itisstillno-ticeableinconvexobjectives.Theconvergencerateofstochasticgradientdescentonsmoothconvexfunc-tionsisgivenbyO(L/T+#/$T),where#isthevarianceinthegradientestimateandListheLip-shitscoe!cientof"f.Incontrast,theconvergencerateofanacceleratedgradientmethodofLan(2010)(whichisrelatedtobutdi"erentfromNAG,inthat blythanCMinmanysituations,especiallyforhighervaluesof !t+µvtandifµvtisindeedapoorupdate,then"f(!t+µvt)willpointbacktowards! !Ax/2+b!x.WecanthinkofCMandNAGasoperatingindependentlyoverthedi"erenteigendi-rectionsofA.NAGoperatesalonganyoneofthesedirectionsequivalentlytoCM,exceptwithane"ectivevalueofµthatisgivenbyµ(1!$" 2+[c]itand$i�0arethediagonalentriesofD(andthustheeigenvaluesofA)andcorrespondtothecurva-turealongtheassociatedeigenvectordirections.As p]n,[y]n,[v]n)'()NAGz(µ,p,y,v)=$% µwasgivenbythefollowingformula:µt=min(1!2"1"log2( 0.001áN fortasksthatdonothavemanyirrelevantinputs,alargerscaleoftheinput-to-hiddenweights(namely,0.1)workedbetter,becausetheaforementioneddis- 9fortheÞrst1000parameter,afterwhichµ=µ0,whereµ0cantakethefol-lowingvalues{0,0.9,0.98,0.995}.Foreachµ0,weusetheempiricallybestlearningratechosenfrom{10"3,10"4,10"5,10"6}.TheresultsarepresentedinTable5,whicharetheav- pictureanddemonstratedconclusivelythatalargepartoftheremainingperformancegapthatisnot dependentpre-traineddeepneuralnetworksforlarge-vocabularyspeechrecognition.Audio,Speech,andLan-guageProcessing,IEEETransactionson,20(1):30Ð42,2012.Darken,C.andMoody,J.Towardsfasterstochasticgra-dientsearch.Advancesinneuralinformationprocessingsystems,pp.1009Ð1009,1993.Glorot,X.andBengio,Y.Understandingthedi"cultyoftrainingdeepfeedforwardneuralnetworks.InPro-ceedingsofAISTATS2010,volume9,pp.249Ð256,may2010.Graves,A.Sequencetransductionwithrecurrentneuralnetworks.arXivpreprintarXiv:1211.3711,2012.Hinton,GandSalakhutdinov,R.Reducingthedimension-alityofdatawithneuralnetworks.Science,313:504Ð507,2006.Hinton,G.,Deng,L.,Yu,D.,Dahl,G.,Mohamed,A.,Jaitly,N.,Senior,A.,Vanhoucke,V.,Nguyen,P.,Sainath,T.,etal.Deepneuralnetworksforacousticmodelinginspeechrecognition.IEEESignalProcessingMagazine,2012.Hinton,G.E.,Osindero,S.,andTeh,Y.W.Afastlearningalgorithmfordeepbeliefnets.Neuralcomputation,18(7):1527Ð1554,2006.Hochreiter,S.andSchmidhuber,J.Longshort-termmem-ory.Neuralcomputation,9(8):1735Ð1780,1997.Jaeger,H.personalcommunication,2012.Jaeger,H.andHaas,H.Harnessingnonlinearity:Pre-dictingchaoticsystemsandsavingenergyinwirelesscommunication.Science,304:78Ð80,2004.Krizhevsky,A.,Sutskever,I.,andHinton,G.ImagenetclassiÞcationwithdeepconvolutionalneuralnetworks.InAdvancesinNeuralInformationProcessingSystems25,pp.1106Ð1114,2012.Lan,G.Anoptimalmethodforstochasticcompositeop-timization.MathematicalProgramming