/
Do Deep Nets Really Need to be Deep Lei Jimmy Ba Unive Do Deep Nets Really Need to be Deep Lei Jimmy Ba Unive

Do Deep Nets Really Need to be Deep Lei Jimmy Ba Unive - PDF document

pasty-toler
pasty-toler . @pasty-toler
Follow
486 views
Uploaded On 2015-05-26

Do Deep Nets Really Need to be Deep Lei Jimmy Ba Unive - PPT Presentation

utorontoca Rich Caruana Microsoft Research rcaruanamicrosoftcom Abstract Currently deep neural networks are the state of the art on problems such as speech recognition and computer vision In this paper we empirically demonstrate that shallow feedforw ID: 74665

utorontoca Rich Caruana Microsoft Research

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Do Deep Nets Really Need to be Deep Lei ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DoDeepNetsReallyNeedtobeDeep? LeiJimmyBaUniversityofTorontojimmy@psi.utoronto.caRichCaruanaMicrosoftResearchrcaruana@microsoft.comAbstractCurrently,deepneuralnetworksarethestateoftheartonproblemssuchasspeechrecognitionandcomputervision.Inthispaperweempiricallydemonstratethatshallowfeed-forwardnetscanlearnthecomplexfunctionspreviouslylearnedbydeepnetsandachieveaccuraciespreviouslyonlyachievablewithdeepmodels.Moreover,insomecasestheshallownetscanlearnthesedeepfunctionsusingthesamenumberofparametersastheoriginaldeepmodels.OntheTIMITphonemerecognitionandCIFAR-10imagerecognitiontasks,shallownetscanbetrainedthatperformsimilarlytocomplex,well-engineered,deeperconvolutionalmodels.1IntroductionYouaregivenatrainingsetwith1Mlabeledpoints.Whenyoutrainashallowneuralnetwithonefullyconnectedfeed-forwardhiddenlayeronthisdatayouobtain86%accuracyontestdata.Whenyoutrainadeeperneuralnetasin[1]consistingofaconvolutionallayer,poolinglayer,andthreefullyconnectedfeed-forwardlayersonthesamedatayouobtain91%accuracyonthesametestset.Whatisthesourceofthisimprovement?Isthe5%increaseinaccuracyofthedeepnetovertheshallownetbecause:a)thedeepnethasmoreparameters;b)thedeepnetcanlearnmorecomplexfunctionsgiventhesamenumberofparameters;c)thedeepnethasbetterinductivebiasandthuslearnsmoreinteresting/usefulfunctions(e.g.,becausethedeepnetisdeeperitlearnshierarchicalrepresentations[5]);d)netswithoutconvolutioncan'teasilylearnwhatnetswithconvolutioncanlearn;e)currentlearningalgorithmsandregularizationmethodsworkbetterwithdeeparchitecturesthanshallowarchitectures[8];f)allorsomeoftheabove;g)noneoftheabove?Therehavebeenattemptstoanswerthisquestion.Ithasbeenshownthatdeepnetscoupledwithunsupervisedlayer-by-layerpre-training[10][19]workwell.In[8],theauthorsshowthatdepthcombinedwithpre-trainingprovidesagoodpriorformodelweights,thusimprovinggeneralization.Thereiswell-knownearlytheoreticalworkontherepresentationalcapacityofneuralnets.Forexample,itwasprovedthatanetworkwithalargeenoughsinglehiddenlayerofsigmoidunitscanapproximateanydecisionboundary[4].Empiricalwork,however,showsthatitisdifculttotrainshallownetstobeasaccurateasdeepnets.Forvisiontasks,arecentstudyondeepconvolutionalnetssuggeststhatdeepermodelsarepreferredunderaparameterbudget[7].In[5],theauthorstrainedshallownetsonSIFTfeaturestoclassifyalarge-scaleImageNetdatasetandfoundthatitwasdifculttotrainlarge,high-accuracy,shallownets.Andin[17],theauthorsshowthatdeepermodelsaremoreaccuratethanshallowmodelsinspeechacousticmodeling.Inthispaperweprovideempiricalevidencethatshallownetsarecapableoflearningthesamefunctionasdeepnets,andinsomecaseswiththesamenumberofparametersasthedeepnets.Wedothisbyrsttrainingastate-of-the-artdeepmodel,andthentrainingashallowmodeltomimicthedeepmodel.Themimicmodelistrainedusingthemodelcompressionmethoddescribedinthenextsection.Remarkably,withmodelcompressionweareabletotrainshallownetstobeasaccurateassomedeepmodels,eventhoughwearenotabletotraintheseshallownetstobeasaccurateasthedeepnetswhentheshallownetsaretraineddirectlyontheoriginallabeledtrainingdata.Ifashallownetwiththesamenumberofparametersasadeepnetcanlearntomimicadeepnetwithhighdelity,thenitisclearthatthefunctionlearnedbythatdeepnetdoesnotreallyhavetobedeep.1 2TrainingShallowNetstoMimicDeepNets2.1ModelCompressionThemainideabehindmodelcompression[3]istotrainacompactmodeltoapproximatethefunc-tionlearnedbyalarger,morecomplexmodel.Forexample,in[3],asingleneuralnetofmodestsizecouldbetrainedtomimicamuchlargerensembleofmodels—althoughthesmallneuralnetscontained1000timesfewerparameters,oftentheywerejustasaccurateastheensemblestheyweretrainedtomimic.Modelcompressionworksbypassingunlabeleddatathroughthelarge,accuratemodeltocollectthescoresproducedbythatmodel.Thissyntheticallylabeleddataisthenusedtotrainthesmallermimicmodel.Themimicmodelisnottrainedontheoriginallabels—itistrainedtolearnthefunctionthatwaslearnedbythelargermodel.Ifthecompressedmodellearnstomimicthelargemodelperfectlyitmakesexactlythesamepredictionsandmistakesasthecomplexmodel.Surprisingly,oftenitisnot(yet)possibletotrainasmallneuralnetontheoriginaltrainingdatatobeasaccurateasthecomplexmodel,norasaccurateasthemimicmodel.Compressiondemonstratesthatasmallneuralnetcould,inprinciple,learnthemoreaccuratefunction,butcurrentlearningalgorithmsareunabletotrainamodelwiththataccuracyfromtheoriginaltrainingdata;instead,wemusttrainthecomplexintermediatemodelrstandthentraintheneuralnettomimicit.Clearly,whenitispossibletomimicthefunctionlearnedbyacomplexmodelwithasmallnet,thefunctionlearnedbythecomplexmodelwasn'ttrulytoocomplextobelearnedbyasmallnet.Thissuggeststousthatthecomplexityofalearnedmodel,andthesizeandarchitectureoftherepresentationbestusedtolearnthatmodel,aredifferentthings.2.2MimicLearningviaRegressingLogitswithL2LossOnbothTIMITandCIFAR-10weusemodelcompressiontotrainshallowmimicnetsusingdatalabeledbyeitheradeepnet,oranensembleofdeepnets,trainedontheoriginalTIMITorCIFAR-10trainingdata.Thedeepmodelsaretrainedintheusualwayusingsoftmaxoutputandcross-entropycostfunction.Theshallowmimicmodels,however,insteadofbeingtrainedwithcross-entropyonthe183pvalueswherepk=ezk=Pjezjoutputbythesoftmaxlayerfromthedeepmodel,aretraineddirectlyonthe183logprobabilityvaluesz,alsocalledlogits,beforethesoftmaxactivation.Trainingonlogits,whicharelogarithmsofpredictedprobabilities,makeslearningeasierforthestudentmodelbyplacingequalemphasisontherelationshipslearnedbytheteachermodelacrossallofthetargets.Forexample,iftheteacherpredictsthreetargetswithprobability[210�9;410�5;0:9999]andthoseprobabilitiesareusedaspredictiontargetsandcrossentropyisminimized,thestudentwillfocusonthethirdtargetandtendtoignoretherstandsecondtargets.Astudent,however,trainedonthelogitsforthesetargets,[10;20;30],willbetterlearntomimicthedetailedbehaviouroftheteachermodel.Moreover,considerasecondtrainingcasewheretheteacherpredictslogits[�10;0;10].Aftersoftmax,theselogitsyieldthesamepredictedprobabilitiesas[10;20;30],yetclearlytheteachermodelsthetwocasesverydifferently.Bytrainingthestudentmodeldirectlyonthelogits,thestudentisbetterabletolearntheinternalmodellearnedbytheteacher,withoutsufferingfromtheinformationlossthatoccursfrompassingthroughlogitstoprobabilityspace.WeformulatetheSNN-MIMIClearningobjectivefunctionasaregressionproblemgiventrainingdataf(x(1);z(1)),...,(x(T);z(T))g:L(W; )=1 2TXtjjg(x(t);W; )�z(t)jj22;(1)whereWistheweightmatrixbetweeninputfeaturesxandhiddenlayer, istheweightsfromhiddentooutputunits,g(x(t);W; )= f(Wx(t))isthemodelpredictiononthetthtrainingdatapointandf()isthenon-linearactivationofthehiddenunits.TheparametersWand areupdatedusingstandarderrorback-propagationalgorithmandstochasticgradientdescentwithmomentum.Wehavealsoexperimentedwithothermimiclossfunctions,suchasminimizingtheKLdivergenceKL(pteacherkpstudent)costfunctionandL2lossonprobabilities.Regressiononlogitsoutperformsalltheotherlossfunctionsandisoneofthekeytechniquesforobtainingtheresultsintherestofthis2 paper.WefoundthatnormalizingthelogitsfromtheteachermodelbysubtractingthemeananddividingthestandarddeviationofeachtargetacrossthetrainingsetcanimproveL2lossslightlyduringtraining,butnormalizationisnotcrucialforobtaininggoodstudentmimicmodels.2.3Speeding-upMimicLearningbyIntroducingaLinearLayerTomatchthenumberofparametersinadeepnet,ashallownethastohavemorenon-linearhiddenunitsinasinglelayertoproducealargeweightmatrixW.Whentrainingalargeshallowneuralnetworkwithmanyhiddenunits,wenditisveryslowtolearnthelargenumberofparametersintheweightmatrixbetweeninputandhiddenlayersofsizeO(HD),whereDisinputfeaturedimensionandHisthenumberofhiddenunits.Becausetherearemanyhighlycorrelatedparametersinthislargeweightmatrix,gradientdescentconvergesslowly.Wealsonoticethatduringlearning,shallownetsspendmostofthecomputationinthecostlymatrixmultiplicationoftheinputdatavectorsandlargeweightmatrix.Theshallownetseventuallylearnaccuratemimicfunctions,buttrainingtoconvergenceisveryslow(multipleweeks)evenwithaGPU.Wefoundthatintroducingabottlenecklinearlayerwithklinearhiddenunitsbetweentheinputandthenon-linearhiddenlayerspeduplearningdramatically:wecanfactorizetheweightmatrixW2RHDintotheproductoftwolow-rankmatrices,U2RHkandV2RkD,wherekD;H.Thenewcostfunctioncanbewrittenas:L(U;V; )=1 2TXtjj f(UVx(t))�z(t)jj22(2)TheweightsUandVcanbelearntbyback-propagatingthroughthelinearlayer.Thisre-parameterizationofweightmatrixWnotonlyincreasestheconvergencerateoftheshallowmimicnets,butalsoreducesmemoryspacefromO(HD)toO(k(H+D)).Factorizingweightmatriceshasbeenpreviouslyexploredin[16]and[20].Whilethesepriorworksfocusonusingmatrixfactorizationinthelastoutputlayer,ourmethodisappliedbetweentheinputandhiddenlayertoimprovetheconvergencespeedduringtraining.Thereducedmemoryusageenablesustotrainlargeshallowmodelsthatwerepreviouslyinfeasibleduetoexcessivememoryusage.Notethatthelinearbottleneckcanonlyreducetherepresentationalpowerofthenetwork,anditcanalwaysbeabsorbedintoasingleweightmatrixW.3TIMITPhonemeRecognitionTheTIMITspeechcorpushas462speakersinthetrainingset,aseparatedevelopmentsetforcross-validationthatincludes50speakers,andanaltestsetwith24speakers.Therawwaveformau-diodatawerepre-processedusing25msHammingwindowshiftingby10mstoextractFourier-transform-basedlter-bankswith40coefcients(plusenergy)distributedonamel-scale,togetherwiththeirrstandsecondtemporalderivatives.Weincluded+/-7nearbyframestoformulatethenal1845dimensioninputvector.Thedatainputfeatureswerenormalizedbysubtractingthemeananddividingbythestandarddeviationoneachdimension.All61phonemelabelsarerepresentedintri-state,i.e.,threestatesforeachofthe61phonemes,yieldingtargetlabelvectorswith183dimensionsfortraining.Atdecodingtimethesearemappedto39classesasin[13]forscoring.3.1DeepLearningonTIMITDeeplearningwasrstsuccessfullyappliedtospeechrecognitionin[14].Followingtheirframe-work,wetraintwodeepmodelsonTIMIT,DNNandCNN.DNNisadeepneuralnetconsistingofthreefullyconnectedfeedforwardhiddenlayersconsistingof2000rectiedlinearunits(ReLU)[15]perlayer.CNNisadeepneuralnetconsistingofaconvolutionallayerandmax-poolinglayerfollowedbythreehiddenlayerscontaining2000ReLUunits[2].TheCNNwastrainedusingthesameconvolutionalarchitectureasin[6].WealsoformedanensembleofnineCNNmodels,ECNN.TheaccuracyofDNN,CNN,andECNNonthenaltestsetareshowninTable1.Theerrorrateoftheconvolutionaldeepnet(CNN)isabout2.1%betterthanthedeepnet(DNN).Thetablealsoshowstheaccuracyofshallowneuralnetswith8000,50,000,and400,000hiddenunits(SNN-8k,3 SNN-50k,andSNN-400k)trainedontheoriginaltrainingdata.Despitehavingupto10XasmanyparametersasDNN,CNN,andECNN,theshallowmodelsare1.4%to2%lessaccuratethantheDNN,3.5%to4.1%lessaccuratethantheCNN,and4.5%to5.1%lessaccuratethantheECNN.3.2LearningtoMimicanEnsembleofDeepConvolutionalTIMITModelsThemostaccuratesinglemodelthatwetrainedonTIMITisthedeepconvolutionalarchitecturein[6].BecausewehavenounlabeleddatafromtheTIMITdistribution,weusethesame1.1Mpointsinthetrainsetasunlabeleddataforcompressionbythrowingawaythelabels.1Re-usingthe1.1Mtrainsetreducestheaccuracyofthestudentmimicmodels,increasingthegapbetweentheteacherandmimicmodelsontestdata:modelcompressionworksbestwhentheunlabeledsetisverylarge,andwhentheunlabeledsamplesdonotfallontrainpointswheretheteachermodelislikelytohaveovert.Toreducetheimpactofthegapcausedbyperformingcompressionwiththeoriginaltrainset,wetrainthestudentmodeltomimicamoreaccurateensembleofdeepconvolutionalmodels.WeareabletotrainamoreaccuratemodelonTIMITbyforminganensembleofninedeep,con-volutionalneuralnets,eachtrainedwithsomewhatdifferenttrainsets,andwitharchitecturesofdifferentkernelsizesintheconvolutionallayers.Weusedthisveryaccuratemodel,ECNN,astheteachermodeltolabelthedatausedtotraintheshallowmimicnets.AsdescribedinSection2.2thelogits(logprobabilityofthepredictedvalues)fromeachCNNintheECNNmodelareaveragedandtheaveragelogitsareusedasnalregressiontargetstotrainthemimicSNNs.Wetrainedshallowmimicnetswith8k(SNN-MIMIC-8k)and400k(SNN-MIMIC-400k)hiddenunitsonthere-labeled1.1Mtrainingpoints.AsdescribedinSection2.3,tospeeduplearningbothmimicmodelshave250linearunitsbetweentheinputandnon-linearhiddenlayer—preliminaryexperimentssuggestthatforTIMITthereislittlebenetfromusingmorethan250linearunits.3.3CompressionResultsForTIMIT Architecture #Param. #Hiddenunits PER SNN-8k 8k+dropout 12M 8k 23.1% trainedonoriginaldata SNN-50k 50k+dropout 100M 50k 23.0% trainedonoriginaldata SNN-400k 250L-400k+dropout 180M 400k 23.6% trainedonoriginaldata DNN 2k-2k-2k+dropout 12M 6k 21.9% trainedonoriginaldata CNN c-p-2k-2k-2k+dropout 13M 10k 19.5% trainedonoriginaldata ECNN ensembleof9CNNs 125M 90k 18.5% SNN-MIMIC-8k 250L-8k 12M 8k 21.6% noconvolutionorpoolinglayers SNN-MIMIC-400k 250L-400k 180M 400k 20.0% noconvolutionorpoolinglayers Table1:Comparisonofshallowanddeepmodels:phoneerrorrate(PER)onTIMITcoretestset.ThebottomofTable1showstheaccuracyofshallowmimicnetswith8000ReLUsand400,000ReLUs(SNN-MIMIC-8kand-400k)trainedwithmodelcompressiontomimictheECNN.Surpris-ingly,shallownetsareabletoperformaswellastheirdeepcounterpartswhentrainedwithmodelcompressiontomimicamoreaccuratemodel.Aneuralnetwithonehiddenlayer(SNN-MIMIC-8k)canbetrainedtoperformaswellasaDNNwithasimilarnumberofparameters.Furthermore,ifweincreasethenumberofhiddenunitsintheshallownetfrom8kto400k(thelargestwecould 1ThatSNNscanbetrainedtobeasaccurateasDNNsusingonlytheoriginaltrainingdatahighlightsthatitshouldbepossibletotrainaccurateSNNsontheoriginaltrainingdatagivenbetterlearningalgorithms.4 train),weseethataneuralnetwithonehiddenlayer(SNN-MIMIC-400k)canbetrainedtoperformcomparablytoaCNN,eventhoughtheSNN-MIMIC-400knethasnoconvolutionalorpoolinglay-ers.Thisisinterestingbecauseitsuggeststhatalargesinglehiddenlayerwithoutatopologycustomdesignedfortheproblemisabletoreachtheperformanceofadeepconvolutionalneuralnetthatwascarefullyengineeredwithpriorstructureandweight-sharingwithoutanyincreaseinthenumberoftrainingexamples,eventhoughthesamearchitecturetrainedontheoriginaldatacouldnot. Figure1:AccuracyofSNNs,DNNs,andMimicSNNsvs.#ofparametersonTIMITDev(left)andTest(right)sets.AccuracyoftheCNNandtargetECNNareshownashorizontallinesforreference.Figure1showstheaccuracyofshallownetsanddeepnetstrainedontheoriginalTIMIT1.1Mdata,andshallowmimicnetstrainedontheECNNtargets,asafunctionofthenumberofparametersinthemodels.TheaccuracyoftheCNNandtheteacherECNNareshownashorizontallinesatthetopofthegures.Whenthenumberofparametersissmall(about1million),theSNN,DNN,andSNN-MIMICmodelsallhavesimilaraccuracy.Asthesizeofthehiddenlayersincreasesandthenumberofparametersincreases,theaccuracyofashallowmodeltrainedontheoriginaldatabeginstolagbehind.Theaccuracyoftheshallowmimicmodel,however,matchestheaccuracyoftheDNNuntilabout4millionparameters,whentheDNNbeginstofallbehindthemimic.TheDNNasymptotesataround10Mparameters,whiletheshallowmimiccontinuestoincreaseinaccuracy.Eventuallythemimicasymptotesataround100MparameterstoanaccuracycomparabletothatoftheCNN.TheshallowmimicneverachievestheaccuracyoftheECNNitistryingtomimic(becausethereisnotenoughunlabeleddata),butitisabletomatchorexceedtheaccuracyofdeepnets(DNNs)havingthesamenumberofparameterstrainedontheoriginaldata.4ObjectRecognition:CIFAR-10ToverifythattheresultsonTIMITgeneralizetootherlearningproblemsandtaskdomains,weransimilarexperimentsontheCIFAR-10ObjectRecognitionTask[12].CIFAR-10consistsofasetofnaturalimagesfrom10differentobjectclasses:airplane,automobile,bird,cat,deer,dog,frog,horse,ship,truck.Thedatasetisalabeledsubsetofthe80milliontinyimagesdataset[18]andisdividedinto50,000trainand10,000testimages.Eachimageis32x32pixelsin3colorchannels,yieldinginputvectorswith3072dimensions.Wepreparedthedatabysubtractingthemeananddividingthestandarddeviationofeachimagevectortoperformglobalcontrastnormalization.WethenappliedZCAwhiteningtothenormalizedimages.Thispre-processingisthesameusedin[9].4.1LearningtoMimicanEnsembleofDeepConvolutionalCIFAR-10ModelsWefollowthesameapproachaswithTIMIT:AnensembleofdeepCNNmodelsisusedtolabelCIFAR-10imagesformodelcompression.Thelogitpredictionsfromthisteachermodelareusedasregressiontargetstotrainamimicshallowneuralnet(SNN).CIFAR-10imageshaveahigherdimensionthanTIMIT(3072vs.1845),butthesizeoftheCIFAR-10trainingsetisonly50,000comparedto1.1millionexamplesforTIMIT.Fortunately,unlikeTIMIT,inCIFAR-10wehaveaccesstounlabeleddatafromasimilardistributionbyusingthesupersetofCIFAR-10:the80milliontinyimagesdataset.Weaddtherstonemillionimagesfromthe80millionsettotheoriginal50,000CIFAR-10trainingimagestocreatea1.05Mmimictraining(transfer)set.5 75 76 77 78 79 80 81 82 1 10 100 Accuracy on TIMIT Test SetNumber of Parameters (millions)ShallowNet DeepNet ShallowMimicNet Convolutional Net Ensemble of CNNs 76 77 78 79 80 81 82 83 1 10 100 Accuracy on TIMIT Dev SetNumber of Parameters (millions)ShallowNet DeepNet ShallowMimicNet Convolutional Net Ensemble of CNNs Architecture #Param. #Hiddenunits Err. DNN 2000-2000+dropout 10M 4k 57.8% SNN-30k 128c-p-1200L-30k 70M 190k 21.8% +dropoutinput&hidden single-layer 4000c-p 125M 3.7B 18.4% featureextraction followedbySVM CNN[11] 64c-p-64c-p-64c-p-16lc 10k 110k 15.6% (noaugmentation) +dropoutonlc CNN[21] 64c-p-64c-p-128c-p-fc 56k 120k 15.13% (noaugmentation) +dropoutonfc andstochasticpooling teacherCNN 128c-p-128c-p-128c-p-1kfc 35k 210k 12.0% (noaugmentation) +dropoutonfc andstochasticpooling ECNN ensembleof4CNNs 140k 840k 11.0% (noaugmentation) SNN-CNN-MIMIC-30k 64c-p-1200L-30k 54M 110k 15.4% trainedonasingleCNN withnoregularization SNN-CNN-MIMIC-30k 128c-p-1200L-30k 70M 190k 15.1% trainedonasingleCNN withnoregularization SNN-ECNN-MIMIC-30k 128c-p-1200L-30k 70M 190k 14.2% trainedonensemble withnoregularization Table2:Comparisonofshallowanddeepmodels:classicationerrorrateonCIFAR-10.Key:c,convolutionlayer;p,poolinglayer;lc,locallyconnectedlayer;fc,fullyconnectedlayerCIFAR-10imagesarerawpixelsforobjectsviewedfrommanydifferentanglesandpositions,whereasTIMITfeaturesarehuman-designedlter-bankfeatures.Inpreliminaryexperimentsweobservedthatnon-convolutionalnetsdonotperformwellonCIFAR-10,nomatterwhattheirdepth.Insteadofrawpixels,theauthorsin[5]trainedtheirshallowmodelsontheSIFTfeatures.Similarly,[7]usedabaseconvolutionandpoolinglayertostudydifferentdeeparchitectures.Wefollowtheapproachin[7]toallowourshallowmodelstobenetfromconvolutionwhilekeepingthemodelsasshallowaspossible,andintroduceasinglelayerofconvolutionandpoolinginourshallowmimicmodelstoactasafeatureextractortocreateinvariancetosmalltranslationsinthepixeldomain.TheSNN-MIMICmodelsforCIFAR-10thusconsistofaconvolutionandmaxpoolinglayerfollowedbyfullyconnected1200linearunitsand30knon-linearunits.Asbefore,thelinearunitsarethereonlytospeedlearning;theydonotincreasethemodel'srepresentationalpowerandcanbeabsorbedintotheweightsinthenon-linearlayerafterlearning.ResultsonCIFAR-10areconsistentwiththosefromTIMIT.Table2showsresultsfortheshallowmimicmodels,andformuchdeeperconvolutionalnets.TheshallowmimicnettrainedtomimictheteacherCNN(SNN-CNN-MIMIC-30k)achievesaccuracycomparabletoCNNswithmultiplecon-volutionalandpoolinglayers.AndbytrainingtheshallowmodeltomimictheensembleofCNNs(SNN-ECNN-MIMIC-30k),accuracyisimprovedanadditional0.9%.ThemimicmodelsareabletoachieveaccuraciespreviouslyunseenonCIFAR-10withmodelswithsofewlayers.Althoughthedeepconvolutionalnetshavemorehiddenunitsthantheshallowmimicmodels,becauseofweightsharing,thedeepernetswithmultipleconvolutionlayershavefewerparametersthantheshallowfullyconnectedmimicmodels.Still,itissurprisingtoseehowaccuratetheshallowmimicmod-elsare,andthattheirperformancecontinuestoimproveastheperformanceoftheteachermodelimproves(seefurtherdiscussionofthisinSection5.2).5Discussion5.1WhyMimicModelsCanBeMoreAccuratethanTrainingonOriginalLabelsItmaybesurprisingthatmodelstrainedontargetspredictedbyothermodelscanbemoreaccuratethanmodelstrainedontheoriginallabels.Thereareavarietyofreasonswhythiscanhappen:6 Ifsomelabelshaveerrors,theteachermodelmayeliminatesomeoftheseerrors(i.e.,censorthedata),thusmakinglearningeasierforthestudent.Similarly,iftherearecomplexregionsinp(yjX)thataredifculttolearngiventhefeaturesandsampledensity,theteachermayprovidesimpler,softlabelstothestudent.Complexitycanbewashedawaybylteringtargetsthroughtheteachermodel.Learningfromtheoriginalhard0/1labelscanbemoredifcultthanlearningfromateacher'sconditionalprobabilities:onTIMITonlyoneof183outputsisnon-zerooneachtrainingcase,butthemimicmodelseesnon-zerotargetsformostoutputsonmosttrain-ingcases,andtheteachercanspreaduncertaintyovermultipleoutputsfordifcultcases.Theuncertaintyfromtheteachermodelismoreinformativetothestudentmodelthantheoriginal0/1labels.Thisbenetisfurtherenhancedbytrainingonlogits.Theoriginaltargetsmaydependinpartonfeaturesnotavailableasinputsforlearning,butthestudentmodelseestargetsthatdependonlyontheinputfeatures;thetargetsfromtheteachermodelareafunctiononlyoftheavailableinputs;thedependenceonunavailablefeatureshasbeeneliminatedbylteringtargetsthroughtheteachermodel. Figure2:Shallowmimictendsnottoovert.Themechanismsabovecanbeseenasformsofregularizationthathelppreventoverttinginthestudentmodel.Typically,shallowmodelstrainedontheoriginaltargetsaremorepronetooverttingthandeepmodels—theybegintoovertbeforelearningtheaccuratefunctionslearnedbydeepermodelsevenwithdropout(seeFigure2).Ifwehadmoreeffectiveregu-larizationmethodsforshallowmodels,someoftheperformancegapbetweenshallowanddeepmodelsmightdisappear.Modelcompressionappearstobeaformofregularizationthatisef-fectiveatreducingthisgap.5.2TheCapacityandRepresentationalPowerofShallowModels Figure3:Accuracyofstudentmodelscontinuestoimproveasaccuracyofteachermodelsimproves.Figure3showsresultsofanexperimentwithTIMITwherewetrainedshallowmimicmod-elsoftwosizes(SNN-MIMIC-8kandSNN-MIMIC-160k)onteachermodelsofdifferentaccuracies.Thetwoshallowmimicmodelsaretrainedonthesamenumberofdatapoints.Theonlydifferencebetweenthemisthesizeofthehiddenlayer.Thex-axisshowstheaccuracyoftheteachermodel,andthey-axisistheaccu-racyofthemimicmodels.Linesparalleltothediagonalsuggestthatincreasesintheaccuracyoftheteachermodelsyieldsimilarincreasesintheaccuracyofthemimicmodels.Althoughthedatadoesnotfallperfectlyonadiagonal,thereisstrongevidencethattheaccuracyofthemimicmodelscontinuestoincreaseastheac-curacyoftheteachermodelimproves,suggest-ingthatthemimicmodelsarenot(yet)runningoutofcapacity.Whentrainingonthesametar-gets,SNN-MIMIC-8kalwaysperformworsethanSNN-MIMIC-160Kthathas10timesmorepa-rameters.Althoughthereisaconsistentperformancegapbetweenthetwomodelsduetothediffer-enceinsize,thesmallershallowmodelwaseventuallyabletoachieveaperformancecomparabletothelargershallownetbylearningfromabetterteacher,andtheaccuracyofbothmodelscontinuestoincreaseasteacheraccuracyincreases.Thissuggeststhatshallowmodelswithanumberofpa-rameterscomparabletodeepmodelsprobablyarecapableoflearningevenmoreaccuratefunctions7 78 79 80 81 82 83 78 79 80 81 82 83 Accuracy of Mimic Model on Dev SetAccuracy of Teacher Model on Dev SetMimic with 8k Non-Linear Units Mimic with 160k Non-Linear Units y=x (no student-teacher gap) 74 74.5 75 75.5 76 76.5 77 77.5 0 2 4 6 8 10 12 14 Phone Recognition AccuracyNumber of EpochsSNN-8k SNN-8k + dropout SNN-Mimic-8k ifamoreaccurateteacherand/ormoreunlabeleddatabecomeavailable.Similarly,onCIFAR-10wesawthatincreasingtheaccuracyoftheteachermodelbyforminganensembleofdeepCNNsyieldedcommensurateincreaseintheaccuracyofthestudentmodel.Weseelittleevidencethatshal-lowmodelshavelimitedcapacityorrepresentationalpower.Instead,themainlimitationappearstobethelearningandregularizationproceduresusedtotraintheshallowmodels.5.3ParallelDistributedProcessingvs.DeepSequentialProcessingOurresultsshowthatshallownetscanbecompetitivewithdeepmodelsonspeechandvisiontasks.Inourexperimentsthedeepmodelsusuallyrequired8–12hourstotrainonNvidiaGTX580GPUstoreachthestate-of-the-artperformanceonTIMITandCIFAR-10datasets.Interestingly,althoughsomeoftheshallowmimicmodelshavemoreparametersthanthedeepmodels,theshallowmodelstrainmuchfasterandreachsimilaraccuraciesinonly1–2hours.Also,givenparallelcomputationalresources,atrun-timeshallowmodelscannishcomputationin2or3cyclesforagiveninput,whereasadeeparchitecturehastomakesequentialinferencethrougheachofitslayers,expendinganumberofcyclesproportionaltothedepthofthemodel.Thisbenetcanbeimportantinon-lineinferencesettingswheredataparallelizationisnotaseasytoachieveasitisinthebatchinferencesetting.Forreal-timeapplicationssuchassurveillanceorreal-timespeechtranslation,amodelthatrespondsinfewercyclescanbebenecial.6FutureWorkThetinyimagesdatasetcontains80millionsimages.Wearecurrentlyinvestigatingwhether,ifbylabelingthese80Mimageswithateacher,itispossibletotrainshallowmodelswithnoconvolutionalorpoolinglayerstomimicdeepconvolutionalmodels.Thispaperfocusedontrainingtheshallowest-possiblemodelstomimicdeepmodelsinordertobetterunderstandtheimportanceofmodeldepthinlearning.AssuggestedinSection5.3,therearepracticalapplicationsofthisworkaswell:studentmodelsofsmall-to-mediumsizeanddepthcanbetrainedtomimicverylarge,high-accuracydeepmodels,andensemblesofdeepmodels,thusyieldingbetteraccuracywithreducedruntimecostthaniscurrentlyachievablewithoutmodelcompression.Thisapproachallowsonetoadjustexiblythetrade-offbetweenaccuracyandcom-putationalcost.Inthispaperweareabletodemonstrateempiricallythatshallowmodelscan,atleastinprinciple,learnmoreaccuratefunctionswithoutalargeincreaseinthenumberofparameters.Thealgorithmweusetodothis—trainingtheshallowmodeltomimicamoreaccuratedeepmodel,however,isawkward.Itdependsontheavailabilityofeitheralargeunlabeleddataset(toreducethegapbetweenteacherandmimicmodel)orateachermodelofveryhighaccuracy,orboth.Developingalgorithmstotrainshallowmodelsofhighaccuracydirectlyfromtheoriginaldatawithoutgoingthroughtheintermediateteachermodelwould,ifpossible,beasignicantcontribution.7ConclusionsWedemonstrateempiricallythatshallowneuralnetscanbetrainedtoachieveperformancespre-viouslyachievableonlybydeepmodelsontheTIMITphonemerecognitionandCIFAR-10imagerecognitiontasks.Single-layerfullyconnectedfeedforwardnetstrainedtomimicdeepmodelscanperformsimilarlytowell-engineeredcomplexdeepconvolutionalarchitectures.Theresultssuggestthatthestrengthofdeeplearningmayariseinpartfromagoodmatchbetweendeeparchitecturesandcurrenttrainingprocedures,andthatitmaybepossibletodevisebetterlearningalgorithmstotrainmoreaccurateshallowfeed-forwardnets.Foragivennumberofparameters,depthmaymakelearningeasier,butmaynotalwaysbeessential.AcknowledgementsWethankLiDengforgeneroushelpwithTIMIT,LiDengandOssamaAbdel-HamidforthecodefortheirdeepconvolutionalTIMITmodel,ChrisBurges,LiDeng,RanGilad-Bachrach,TapasKanungoandJohnPlattfordiscussionthatsignicantlyimprovedthiswork,DavidJohnsonforhelpwiththespeechmodel,andMikeAultmanforhelpwiththeGPUcluster.8 References[1]OssamaAbdel-Hamid,Abdel-rahmanMohamed,HuiJiang,andGeraldPenn.Applyingconvolutionalneuralnetworksconceptstohybridnn-hmmmodelforspeechrecognition.InAcoustics,SpeechandSignalProcessing(ICASSP),2012IEEEInternationalConferenceon,pages4277–4280.IEEE,2012.[2]OssamaAbdel-Hamid,LiDeng,andDongYu.Exploringconvolutionalneuralnetworkstructuresandoptimizationtechniquesforspeechrecognition.Interspeech2013,2013.[3]CristianBucilu,RichCaruana,andAlexandruNiculescu-Mizil.Modelcompression.InProceedingsofthe12thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages535–541.ACM,2006.[4]GeorgeCybenko.Approximationbysuperpositionsofasigmoidalfunction.MathematicsofControl,SignalsandSystems,2(4):303–314,1989.[5]YannNDauphinandYoshuaBengio.Bigneuralnetworkswastecapacity.arXivpreprintarXiv:1301.3583,2013.[6]LiDeng,JinyuLi,Jui-TingHuang,KaishengYao,DongYu,FrankSeide,MichaelSeltzer,GeoffZweig,XiaodongHe,JasonWilliams,etal.RecentadvancesindeeplearningforspeechresearchatMicrosoft.ICASSP2013,2013.[7]DavidEigen,JasonRolfe,RobFergus,andYannLeCun.Understandingdeeparchitecturesusingarecursiveconvolutionalnetwork.arXivpreprintarXiv:1312.1847,2013.[8]DumitruErhan,YoshuaBengio,AaronCourville,Pierre-AntoineManzagol,PascalVincent,andSamyBengio.Whydoesunsupervisedpre-traininghelpdeeplearning?TheJournalofMachineLearningResearch,11:625–660,2010.[9]IanGoodfellow,DavidWarde-Farley,MehdiMirza,AaronCourville,andYoshuaBengio.Maxoutnet-works.InProceedingsofThe30thInternationalConferenceonMachineLearning,pages1319–1327,2013.[10]G.E.HintonandR.R.Salakhutdinov.Reducingthedimensionalityofdatawithneuralnetworks.Science,313(5786):504–507,2006.[11]G.E.Hinton,N.Srivastava,A.Krizhevsky,I.Sutskever,andR.R.Salakhutdinov.Improvingneuralnet-worksbypreventingco-adaptationoffeaturedetectors.arXivpreprintarXiv:1207.0580,2012.[12]AlexKrizhevskyandGeoffreyHinton.Learningmultiplelayersoffeaturesfromtinyimages.ComputerScienceDepartment,UniversityofToronto,Tech.Rep,2009.[13]K.F.LeeandH.W.Hon.Speaker-independentphonerecognitionusinghiddenmarkovmodels.Acoustics,SpeechandSignalProcessing,IEEETransactionson,37(11):1641–1648,1989.[14]Abdel-rahmanMohamed,GeorgeEDahl,andGeoffreyHinton.Acousticmodelingusingdeepbeliefnetworks.Audio,Speech,andLanguageProcessing,IEEETransactionson,20(1):14–22,2012.[15]V.NairandG.E.Hinton.Rectiedlinearunitsimproverestrictedboltzmannmachines.InProc.27thInternationalConferenceonMachineLearning,pages807–814.OmnipressMadison,WI,2010.[16]TaraNSainath,BrianKingsbury,VikasSindhwani,EbruArisoy,andBhuvanaRamabhadran.Low-rankmatrixfactorizationfordeepneuralnetworktrainingwithhigh-dimensionaloutputtargets.InAcoustics,SpeechandSignalProcessing(ICASSP),2013IEEEInternationalConferenceon,pages6655–6659.IEEE,2013.[17]FrankSeide,GangLi,andDongYu.Conversationalspeechtranscriptionusingcontext-dependentdeepneuralnetworks.InInterspeech,pages437–440,2011.[18]AntonioTorralba,RobertFergus,andWilliamTFreeman.80milliontinyimages:Alargedatasetfornonparametricobjectandscenerecognition.PatternAnalysisandMachineIntelligence,IEEETransac-tionson,30(11):1958–1970,2008.[19]P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,andP.A.Manzagol.Stackeddenoisingautoencoders:Learningusefulrepresentationsinadeepnetworkwithalocaldenoisingcriterion.TheJournalofMachineLearningResearch,11:3371–3408,2010.[20]JianXue,JinyuLi,andYifanGong.Restructuringofdeepneuralnetworkacousticmodelswithsingularvaluedecomposition.Proc.Interspeech,Lyon,France,2013.[21]MatthewD.ZeilerandRobFergus.Stochasticpoolingforregularizationofdeepconvolutionalneuralnetworks.arXivpreprintarXiv:1301.3557,2013.9