Hintonetal.,2012;Dahletal.,2012;Graves,2012).Althoughtheirrepresentati - PDF document

506 views
Uploaded On 2017-02-22

Hintonetal.,2012;Dahletal.,2012;Graves,2012).Althoughtheirrepresentati - PPT Presentation

fallsinourexperimentsandprovideasimpletounderstandandeasytouseframeworkfordeeplearningthatissurprisinglyeectiveandcanbenaturallycombinedwithtechniquessuchasthoseinRaikoetal2011Wewillalsodiscusst ID: 518402

fallsinourexperimentsandprovideasimpletounder-standandeasytouseframeworkfordeeplearningthatissurprisinglye"ectiveandcanbenaturallycombinedwithtechniquessuchasthoseinRaikoetal.(2011).Wewillalsodiscusst

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/518402" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Pdf The PPT/PDF document "Hintonetal.,2012;Dahletal.,2012;Graves,2..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Hintonetal.,2012;Dahletal.,2012;Graves,2012).Althoughtheirrepresentationalpowerisappealing,thedi!cultyoftrainingDNNshaspreventedtheirProceedingsofthe30thInternationalConferenceonMa-chineLearning,Atlanta,Georgia,USA,2013.JMLR:W&CPvolume28.Copyright2013bytheauthor(s).widepreaduseuntilfairlyrecently.DNNsbecamethesubjectofrenewedattentionfollowingtheworkofHintonetal.(2006)whointroducedtheideaofgreedylayerwisepre-training.Thisapproachhassincebranchedintoafamilyofmethods(Bengioetal.,2007),allofwhichtrainthelayersoftheDNNinasequenceusinganauxiliaryobjectiveandthenÒÞne-tuneÓtheentirenetworkwithstandardoptimizationmethodssuchasstochasticgradientdescent(SGD).Morerecently,Martens(2010)attractedconsiderableattentionbyshowingthatatypeoftruncated-NewtonmethodcalledHessian-freeOptimization(HF)iscapa-bleoftrainingDNNsfromcertainrandominitializa-tionswithouttheuseofpre-training,andcanachieve fallsinourexperimentsandprovideasimpletounder-standandeasytouseframeworkfordeeplearningthatissurprisinglye"ectiveandcanbenaturallycombinedwithtechniquessuchasthoseinRaikoetal.(2011).Wewillalsodiscussthelinksbetweenclassicalmo-mentumandNesterovÕsacceleratedgradientmethod(whichhasbeenthesubjectofmuchrecentstudyinconvexoptimizationtheory),arguingthatthelattercanbeviewedasasimplemodiÞcationoftheformerwhichincreasesstability,andcansometimesprovideadistinctimprovementinperformancewedemonstratedinourexperiments.Weperformatheoreticalanalysiswhichmakescleartheprecisedi"erenceinlocalbe-haviorofthesetwoalgorithms.Additionally,weshowhowHFemployswhatcanbeviewedasatypeofÒmo-mentumÓthroughitsuseofspecialinitializationsto f(!)tobeminimized,classicalmomentumisgivenby:vt+1=µvt! Sincedirectionsdoflow-curvaturehave,bydeÞni-tion,slowerlocalchangeintheirrateofreduction(i.e.,d!"f),theywilltendtopersistacrossiterationsandbeampliÞedbyCM.Second-ordermethodsalsoam-plifystepsinlow-curvaturedirections,butinsteadofaccumulatingchangestheyreweighttheupdatealongeacheigen-directionofthecurvaturematrixbythein-verseoftheassociatedcurvature.Andjustassecond-ordermethodsenjoyimprovedlocalconvergencerates,Polyak(1964)showedthatCMcanconsiderablyaccel-erateconvergencetoalocalminimum,requiring$R-timesfeweriterationsthansteepestdescenttoreachthesamelevelofaccuracy,where (3)!t+1=!t+vt+1(4)Whiletheclassicalconvergencetheoriesforbothmeth-odsrelyonnoiselessgradientestimates(i.e.,notstochastic),withsomecareinpracticetheyarebothapplicabletothestochasticsetting.However,thethe-orypredictsthatanyadvantagesintermsofasymp-toticlocalrateofconvergencewillbelost(Orr,1996;Wiegerincketal.,1999),aresultalsoconÞrmedinex-periments(LeCunetal.,1998).Forthesereasons,interestinmomentummethodsdiminishedaftertheyhadreceivedsubstantialattentioninthe90Õs.Andbe-causeofthisapparentincompatibilitywithstochasticoptimization,someauthorsevendiscourageusingmo-mentumordownplayitspotentialadvantages(LeCunetal.,1998).However,whilelocalconvergenceisallthatmattersintermsofasymptoticconvergencerates(andoncer-tainverysimple/shallowneuralnetworkoptimizationproblemsitmayevendominatethetotallearningtime),inpractice,theÒtransientphaseÓofconvergence(Darken&Moody,1993),whichoccursbeforeÞnelo-calconvergencesetsin,seemstomatteralotmoreforoptimizingdeepneuralnetworks.Inthistransientphaseoflearning,directionsofreductionintheob-jectivetendtopersistacrossmanysuccessivegradientestimatesandarenotcompletelyswampedbynoise.Althoughthetransientphaseoflearningismostno-ticeableintrainingdeeplearningmodels,itisstillno-ticeableinconvexobjectives.Theconvergencerateofstochasticgradientdescentonsmoothconvexfunc-tionsisgivenbyO(L/T+#/$T),where#isthevarianceinthegradientestimateandListheLip-shitscoe!cientof"f.Incontrast,theconvergencerateofanacceleratedgradientmethodofLan(2010)(whichisrelatedtobutdi"erentfromNAG,inthat blythanCMinmanysituations,especiallyforhighervaluesof !t+µvtandifµvtisindeedapoorupdate,then"f(!t+µvt)willpointbacktowards! !Ax/2+b!x.WecanthinkofCMandNAGasoperatingindependentlyoverthedi"erenteigendi-rectionsofA.NAGoperatesalonganyoneofthesedirectionsequivalentlytoCM,exceptwithane"ectivevalueofµthatisgivenbyµ(1!$" 2+[c]itand$i�0arethediagonalentriesofD(andthustheeigenvaluesofA)andcorrespondtothecurva-turealongtheassociatedeigenvectordirections.As p]n,[y]n,[v]n)'()NAGz(µ,p,y,v)=$% µwasgivenbythefollowingformula:µt=min(1!2"1"log2( 0.001áN fortasksthatdonothavemanyirrelevantinputs,alargerscaleoftheinput-to-hiddenweights(namely,0.1)workedbetter,becausetheaforementioneddis- 9fortheÞrst1000parameter,afterwhichµ=µ0,whereµ0cantakethefol-lowingvalues{0,0.9,0.98,0.995}.Foreachµ0,weusetheempiricallybestlearningratechosenfrom{10"3,10"4,10"5,10"6}.TheresultsarepresentedinTable5,whicharetheav- pictureanddemonstratedconclusivelythatalargepartoftheremainingperformancegapthatisnot dependentpre-traineddeepneuralnetworksforlarge-vocabularyspeechrecognition.Audio,Speech,andLan-guageProcessing,IEEETransactionson,20(1):30Ð42,2012.Darken,C.andMoody,J.Towardsfasterstochasticgra-dientsearch.Advancesinneuralinformationprocessingsystems,pp.1009Ð1009,1993.Glorot,X.andBengio,Y.Understandingthedi"cultyoftrainingdeepfeedforwardneuralnetworks.InPro-ceedingsofAISTATS2010,volume9,pp.249Ð256,may2010.Graves,A.Sequencetransductionwithrecurrentneuralnetworks.arXivpreprintarXiv:1211.3711,2012.Hinton,GandSalakhutdinov,R.Reducingthedimension-alityofdatawithneuralnetworks.Science,313:504Ð507,2006.Hinton,G.,Deng,L.,Yu,D.,Dahl,G.,Mohamed,A.,Jaitly,N.,Senior,A.,Vanhoucke,V.,Nguyen,P.,Sainath,T.,etal.Deepneuralnetworksforacousticmodelinginspeechrecognition.IEEESignalProcessingMagazine,2012.Hinton,G.E.,Osindero,S.,andTeh,Y.W.Afastlearningalgorithmfordeepbeliefnets.Neuralcomputation,18(7):1527Ð1554,2006.Hochreiter,S.andSchmidhuber,J.Longshort-termmem-ory.Neuralcomputation,9(8):1735Ð1780,1997.Jaeger,H.personalcommunication,2012.Jaeger,H.andHaas,H.Harnessingnonlinearity:Pre-dictingchaoticsystemsandsavingenergyinwirelesscommunication.Science,304:78Ð80,2004.Krizhevsky,A.,Sutskever,I.,andHinton,G.ImagenetclassiÞcationwithdeepconvolutionalneuralnetworks.InAdvancesinNeuralInformationProcessingSystems25,pp.1106Ð1114,2012.Lan,G.Anoptimalmethodforstochasticcompositeop-timization.MathematicalProgramming