fallsinourexperimentsandprovideasimpletounderstandandeasytouseframeworkfordeeplearningthatissurprisinglyeectiveandcanbenaturallycombinedwithtechniquessuchasthoseinRaikoetal2011Wewillalsodiscusst ID: 518402
Download Pdf The PPT/PDF document "Hintonetal.,2012;Dahletal.,2012;Graves,2..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Hintonetal.,2012;Dahletal.,2012;Graves,2012).Althoughtheirrepresentationalpowerisappealing,thedi!cultyoftrainingDNNshaspreventedtheirProceedingsofthe30thInternationalConferenceonMa-chineLearning,Atlanta,Georgia,USA,2013.JMLR:W&CPvolume28.Copyright2013bytheauthor(s).widepreaduseuntilfairlyrecently.DNNsbecamethesubjectofrenewedattentionfollowingtheworkofHintonetal.(2006)whointroducedtheideaofgreedylayerwisepre-training.Thisapproachhassincebranchedintoafamilyofmethods(Bengioetal.,2007),allofwhichtrainthelayersoftheDNNinasequenceusinganauxiliaryobjectiveandthenÒÞne-tuneÓtheentirenetworkwithstandardoptimizationmethodssuchasstochasticgradientdescent(SGD).Morerecently,Martens(2010)attractedconsiderableattentionbyshowingthatatypeoftruncated-NewtonmethodcalledHessian-freeOptimization(HF)iscapa-bleoftrainingDNNsfromcertainrandominitializa-tionswithouttheuseofpre-training,andcanachieve fallsinourexperimentsandprovideasimpletounder-standandeasytouseframeworkfordeeplearningthatissurprisinglye"ectiveandcanbenaturallycombinedwithtechniquessuchasthoseinRaikoetal.(2011).Wewillalsodiscussthelinksbetweenclassicalmo-mentumandNesterovÕsacceleratedgradientmethod(whichhasbeenthesubjectofmuchrecentstudyinconvexoptimizationtheory),arguingthatthelattercanbeviewedasasimplemodiÞcationoftheformerwhichincreasesstability,andcansometimesprovideadistinctimprovementinperformancewedemonstratedinourexperiments.Weperformatheoreticalanalysiswhichmakescleartheprecisedi"erenceinlocalbe-haviorofthesetwoalgorithms.Additionally,weshowhowHFemployswhatcanbeviewedasatypeofÒmo-mentumÓthroughitsuseofspecialinitializationsto f(!)tobeminimized,classicalmomentumisgivenby:vt+1=µvt! Sincedirectionsdoflow-curvaturehave,bydeÞni-tion,slowerlocalchangeintheirrateofreduction(i.e.,d!"f),theywilltendtopersistacrossiterationsandbeampliÞedbyCM.Second-ordermethodsalsoam-plifystepsinlow-curvaturedirections,butinsteadofaccumulatingchangestheyreweighttheupdatealongeacheigen-directionofthecurvaturematrixbythein-verseoftheassociatedcurvature.Andjustassecond-ordermethodsenjoyimprovedlocalconvergencerates,Polyak(1964)showedthatCMcanconsiderablyaccel-erateconvergencetoalocalminimum,requiring$R-timesfeweriterationsthansteepestdescenttoreachthesamelevelofaccuracy,where (3)!t+1=!t+vt+1(4)Whiletheclassicalconvergencetheoriesforbothmeth-odsrelyonnoiselessgradientestimates(i.e.,notstochastic),withsomecareinpracticetheyarebothapplicabletothestochasticsetting.However,thethe-orypredictsthatanyadvantagesintermsofasymp-toticlocalrateofconvergencewillbelost(Orr,1996;Wiegerincketal.,1999),aresultalsoconÞrmedinex-periments(LeCunetal.,1998).Forthesereasons,interestinmomentummethodsdiminishedaftertheyhadreceivedsubstantialattentioninthe90Õs.Andbe-causeofthisapparentincompatibilitywithstochasticoptimization,someauthorsevendiscourageusingmo-mentumordownplayitspotentialadvantages(LeCunetal.,1998).However,whilelocalconvergenceisallthatmattersintermsofasymptoticconvergencerates(andoncer-tainverysimple/shallowneuralnetworkoptimizationproblemsitmayevendominatethetotallearningtime),inpractice,theÒtransientphaseÓofconvergence(Darken&Moody,1993),whichoccursbeforeÞnelo-calconvergencesetsin,seemstomatteralotmoreforoptimizingdeepneuralnetworks.Inthistransientphaseoflearning,directionsofreductionintheob-jectivetendtopersistacrossmanysuccessivegradientestimatesandarenotcompletelyswampedbynoise.Althoughthetransientphaseoflearningismostno-ticeableintrainingdeeplearningmodels,itisstillno-ticeableinconvexobjectives.Theconvergencerateofstochasticgradientdescentonsmoothconvexfunc-tionsisgivenbyO(L/T+#/$T),where#isthevarianceinthegradientestimateandListheLip-shitscoe!cientof"f.Incontrast,theconvergencerateofanacceleratedgradientmethodofLan(2010)(whichisrelatedtobutdi"erentfromNAG,inthat blythanCMinmanysituations,especiallyforhighervaluesof !t+µvtandifµvtisindeedapoorupdate,then"f(!t+µvt)willpointbacktowards! !Ax/2+b!x.WecanthinkofCMandNAGasoperatingindependentlyoverthedi"erenteigendi-rectionsofA.NAGoperatesalonganyoneofthesedirectionsequivalentlytoCM,exceptwithane"ectivevalueofµthatisgivenbyµ(1!$" 2+[c]itand$i0arethediagonalentriesofD(andthustheeigenvaluesofA)andcorrespondtothecurva-turealongtheassociatedeigenvectordirections.As p]n,[y]n,[v]n)'()NAGz(µ,p,y,v)=$% µwasgivenbythefollowingformula:µt=min(1!2"1"log2( 0.001áN fortasksthatdonothavemanyirrelevantinputs,alargerscaleoftheinput-to-hiddenweights(namely,0.1)workedbetter,becausetheaforementioneddis- 9fortheÞrst1000parameter,afterwhichµ=µ0,whereµ0cantakethefol-lowingvalues{0,0.9,0.98,0.995}.Foreachµ0,weusetheempiricallybestlearningratechosenfrom{10"3,10"4,10"5,10"6}.TheresultsarepresentedinTable5,whicharetheav- pictureanddemonstratedconclusivelythatalargepartoftheremainingperformancegapthatisnot dependentpre-traineddeepneuralnetworksforlarge-vocabularyspeechrecognition.Audio,Speech,andLan-guageProcessing,IEEETransactionson,20(1):30Ð42,2012.Darken,C.andMoody,J.Towardsfasterstochasticgra-dientsearch.Advancesinneuralinformationprocessingsystems,pp.1009Ð1009,1993.Glorot,X.andBengio,Y.Understandingthedi"cultyoftrainingdeepfeedforwardneuralnetworks.InPro-ceedingsofAISTATS2010,volume9,pp.249Ð256,may2010.Graves,A.Sequencetransductionwithrecurrentneuralnetworks.arXivpreprintarXiv:1211.3711,2012.Hinton,GandSalakhutdinov,R.Reducingthedimension-alityofdatawithneuralnetworks.Science,313:504Ð507,2006.Hinton,G.,Deng,L.,Yu,D.,Dahl,G.,Mohamed,A.,Jaitly,N.,Senior,A.,Vanhoucke,V.,Nguyen,P.,Sainath,T.,etal.Deepneuralnetworksforacousticmodelinginspeechrecognition.IEEESignalProcessingMagazine,2012.Hinton,G.E.,Osindero,S.,andTeh,Y.W.Afastlearningalgorithmfordeepbeliefnets.Neuralcomputation,18(7):1527Ð1554,2006.Hochreiter,S.andSchmidhuber,J.Longshort-termmem-ory.Neuralcomputation,9(8):1735Ð1780,1997.Jaeger,H.personalcommunication,2012.Jaeger,H.andHaas,H.Harnessingnonlinearity:Pre-dictingchaoticsystemsandsavingenergyinwirelesscommunication.Science,304:78Ð80,2004.Krizhevsky,A.,Sutskever,I.,andHinton,G.ImagenetclassiÞcationwithdeepconvolutionalneuralnetworks.InAdvancesinNeuralInformationProcessingSystems25,pp.1106Ð1114,2012.Lan,G.Anoptimalmethodforstochasticcompositeop-timization.MathematicalProgramming