/
Efcient Learning of Sparse Representations with an EnergyBased Model MarcAurelio Ranzato Efcient Learning of Sparse Representations with an EnergyBased Model MarcAurelio Ranzato

Efcient Learning of Sparse Representations with an EnergyBased Model MarcAurelio Ranzato - PDF document

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
519 views
Uploaded On 2014-12-27

Efcient Learning of Sparse Representations with an EnergyBased Model MarcAurelio Ranzato - PPT Presentation

nyuedu Abstract We describe a novel unsupervised method for learning sparse overcomplete fea tures The model uses a linear encoder and a linear decoder p receded by a spar sifying nonlinearity that turns a code vector into a quasi binary sparse code ID: 30054

nyuedu Abstract describe

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Efcient Learning of Sparse Representatio..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

EfcientLearningofSparseRepresentationswithanEnergy-BasedModel Marc'AurelioRanzatoChristopherPoultneySumitChopraYannLeCunCourantInstituteofMathematicalSciencesNewYorkUniversity,NewYork,NY10003franzato,crispy,sumit,yanng@cs.nyu.eduAbstractWedescribeanovelunsupervisedmethodforlearningsparse,overcompletefea-tures.Themodelusesalinearencoder,andalineardecoderprecededbyaspar-sifyingnon-linearitythatturnsacodevectorintoaquasi-binarysparsecodevec-tor.Givenaninput,theoptimalcodeminimizesthedistancebetweentheoutputofthedecoderandtheinputpatchwhilebeingassimilaraspossibletotheen-coderoutput.Learningproceedsinatwo-phaseEM-likefashion:(1)computetheminimum-energycodevector,(2)adjusttheparametersoftheencoderandde-codersoastodecreasetheenergy.Themodelproduces“strokedetectors”whentrainedonhandwrittennumerals,andGabor-likelterswhentrainedonnaturalimagepatches.Inferenceandlearningareveryfast,requiringnopreprocessing,andnoexpensivesampling.Usingtheproposedunsupervisedmethodtoinitializetherstlayerofaconvolutionalnetwork,weachievedanerrorrateslightlylowerthanthebestreportedresultontheMNISTdataset.Finally,anextensionofthemethodisdescribedtolearntopographicalltermaps.1IntroductionUnsupervisedlearningmethodsareoftenusedtoproducepre-processorsandfeatureextractorsforimageanalysissystems.PopularmethodssuchasWaveletdecomposition,PCA,Kernel-PCA,Non-NegativeMatrixFactorization[1],andICAproducecompactrepresentationswithsomewhatuncor-related(orindependent)components[2].Mostmethodsproducerepresentationsthateitherpreserveorreducethedimensionalityoftheinput.However,severalrecentworkshaveadvocatedtheuseofsparse-overcompleterepresentationsforimages,inwhichthedimensionofthefeaturevectorislargerthanthedimensionoftheinput,butonlyasmallnumberofcomponentsarenon-zeroforanyoneimage[3,4].Sparse-overcompleterepresentationspresentseveralpotentialadvantages.Usinghigh-dimensionalrepresentationsincreasesthelikelihoodthatimagecategorieswillbeeasily(possiblylinearly)separable.Sparserepresentationscanprovideasimpleinterpretationoftheinputdataintermsofasmallnumberof“parts”byextractingthestructurehiddeninthedata.Further-more,thereisconsiderableevidencethatbiologicalvisionusessparserepresentationsinearlyvisualareas[5,6].Itseemsreasonabletoconsiderarepresentation“complete”ifitispossibletoreconstructtheinputfromit,becausetheinformationcontainedintheinputwouldneedtobepreservedintherepresen-tationitself.Mostunsupervisedlearningmethodsforfeatureextractionarebasedonthisprinciple,andcanbeunderstoodintermsofanencodermodulefollowedbyadecodermodule.Theencodertakestheinputandcomputesacodevector,forexampleasparseandovercompleterepresentation.Thedecodertakesthecodevectorgivenbytheencoderandproducesareconstructionofthein-put.Encoderanddecoderaretrainedinsuchawaythatreconstructionsprovidedbythedecoderareassimilaraspossibletotheactualinputdata,whentheseinputdatahavethesamestatisticsasthetrainingsamples.MethodssuchasVectorQuantization,PCA,auto-encoders[7],RestrictedBoltzmannMachines[8],andothers[9]haveexactlythisarchitecturebutwithdifferentconstraintsonthecodeandlearningalgorithms,anddifferentkindsofencoderanddecoderarchitectures.Inotherapproaches,theencodingmoduleismissingbutitsroleistakenbyaminimizationincode spacewhichretrievestherepresentation[3].Likewise,innon-causalmodelsthedecodingmoduleismissingandsamplingtechniquesmustbeusedtoreconstructtheinputfromacode[4].Insec.2,wedescribeanenergy-basedmodelwhichhasbothanencodingandadecodingpart.Aftertraining,theencoderallowsveryfastinferencebecausendingarepresentationdoesnotrequiresolvinganoptimizationproblem.Thedecoderprovidesaneasywaytoreconstructinputvectors,thusallowingthetrainertoassessdirectlywhethertherepresentationextractsmostoftheinformationfromtheinput.Mostmethodsndrepresentationsbyminimizinganappropriatelossfunctionduringtraining.Inordertolearnsparserepresentations,atermenforcingsparsityisaddedtotheloss.Thistermusuallypenalizesthosecodeunitsthatareactive,aimingtomakethedistributionoftheiractivitieshighlypeakedatzerowithheavytails[10][4].Adrawbackfortheseapproachesisthatsomeactionmightneedtobetakeninordertopreventthesystemfromalwaysactivatingthesamefewunitsandcollapsingalltheotherstozero[3].Analternativeapproachistoembedasparsifyingmodule,e.g.anon-linearity,inthesystem[11].Thisingeneralforcesalltheunitstohavethesamedegreeofsparsity,butitalsomakesatheoreticalanalysisofthealgorithmmorecomplicated.Inthispaper,wepresentasystemwhichachievessparsitybyplacinganon-linearitybetweenencoderanddecoder.Sec.2.1describesthismodule,dubbedthe“SparsifyingLogistic”,whichisalogisticfunctionwithanadaptivebiasthattracksthemeanofitsinput.Thisnon-linearityisparameterizedinasimplewaywhichallowsustocontrolthedegreeofsparsityoftherepresentationaswellastheentropyofeachcodeunit.Unfortunately,learningtheparametersinencoderanddecodercannotbeachievedbysimpleback-propagationofthegradientsofthereconstructionerror:theSparsifyingLogisticishighlynon-linearandresetsmostofthegradientscomingfromthedecodertozero.Therefore,insec.3weproposetoaugmentthelossfunctionbyconsideringnotonlytheparametersofthesystembutalsothecodevectorsasvariablesoverwhichtheoptimizationisperformed.Exploitingthefactthat1)itisfairlyeasytodeterminetheweightsinencoderanddecoderwhen“good”codesaregiven,and2)itisstraightforwardtocomputetheoptimalcodeswhentheparametersinencoderanddecoderarexed,wedescribeasimpleiterativecoordinatedescentoptimizationtolearntheparametersofthesystem.TheprocedurecanbeseenasasortofdeterministicversionoftheEMalgorithminwhichthecodevectorsplaytheroleofhiddenvariables.Thelearningalgorithmdescribedturnsouttobeparticularlysimple,fastandrobust.Nopre-processingisrequiredfortheinputimages,beyondasimplecenteringandscalingofthedata.Insec.4wereportexperimentsoffeatureextractiononhandwrittennumeralsandnaturalimagepatches.Whenthesystemhasalinearencoderanddecoder(rememberthattheSparsifyingLogisticisaseparatemodule),theltersresemble“objectparts”forthenumerals,andlocalized,orientedfeaturesforthenaturalimagepatches.ApplyingthesefeaturesfortheclassicationofthedigitsintheMNISTdataset,wehaveachievedbyasmallmarginthebestaccuracyeverreportedintheliterature.Weconcludebyshowingahierarchicalextensionwhichsuggeststheformofsimpleandcomplexcellreceptiveelds,andleadstoatopographiclayoutofthelterswhichisreminiscentofthetopographicmapsfoundinareaV1ofthevisualcortex.2TheModelTheproposedmodelisbasedonthreemaincomponents,asshowning.1:Theencoder:Asetoffeed-forwardltersparameterizedbytherowsofmatrixWC,thatcomputesacodevectorfromanimagepatchX.TheSparsifyingLogistic:Anon-linearmodulethattransformsthecodevectorZintoasparsecodevectorZwithcomponentsintherange[0;1].Thedecoder:AsetofreverseltersparameterizedbythecolumnsofmatrixWD,thatcomputesareconstructionoftheinputimagepatchfromthesparsecodevectorZ.Theenergyofthesystemisthesumoftwoterms:E(X;Z;WC;WD)=EC(X;Z;WC)+ED(X;Z;WD)(1)ThersttermisthecodepredictionenergywhichmeasuresthediscrepancybetweentheoutputoftheencoderandthecodevectorZ.Inourexperiments,itisdenedas:EC(X;Z;WC)=1 2jjZEnc(X;WC)jj21 2jjZWCXjj2(2)Thesecondtermisthereconstructionenergywhichmeasuresthediscrepancybetweentherecon-structedimagepatchproducedbythedecoderandtheinputimagepatchX.Inourexperiments,it Figure1:Architectureoftheenergy-basedmodelforlearningsparse-overcompleterepresentations.TheinputimagepatchXisprocessedbytheencodertoproduceaninitialestimateofthecodevector.TheencodingpredictionenergyECmeasuresthesquareddistancebetweenthecodevectorZanditsestimate.ThecodevectorZispassedthroughtheSparsifyingLogisticnon-linearitywhichproducesasparsiedcodevectorZ.Thedecoderreconstructstheinputimagepatchfromthesparsecode.ThereconstructionenergyEDmeasuresthesquareddistancebetweenthereconstructionandtheinputimagepatch.TheoptimalcodevectorZforagivenpatchminimizesthesumofthetwoenergies.Thelearningprocessndstheencoderanddecoderparametersthatminimizetheenergyfortheoptimalcodevectorsaveragedoverasetoftrainingsamples. 0:01 30 300:10:1 10 Figure2:ToyexampleofsparsifyingrecticationproducedbytheSparsifyingLogisticfordifferentchoicesoftheparametersand .TheinputisasequenceofGaussianrandomvariables.Theoutput,computedbyusingeq.4,isasequenceofspikeswhoserateandamplitudedependontheparametersand .Inparticular,increasing hastheeffectofmakingtheoutputapproximatelybinary,whileincreasingincreasestheringrateoftheoutputsignal.isdenedas:ED(X;Z;WD)=1 2jjXDec(Z;WD)jj21 2jjXWDZjj2(3)whereZiscomputedbyapplyingtheSparsifyingLogisticnon-linearitytoZ.2.1TheSparsifyingLogisticTheSparsifyingLogisticmoduleisanon-linearfront-endtothedecoderthattransformsthecodevectorintoasparsevectorwithpositivecomponents.Letusconsiderhowittransformsthek-thtrainingsample.Letzi(k)bethei-thcomponentofthecodevectorandzi(k)beitscorrespondingoutput,withi[1::m]wheremisthenumberofcomponentsinthecodevector.Therelationbetweenthesevariablesisgivenby:zi(k)=e zi(k) i(k),i[1::m]withi(k)=e zi(k)+(1)i(k1)(4)whereitisassumedthat[0;1].i(k)istheweightedsumofvaluesofe zi(n)correspondingtoprevioustrainingsamplesn,withnk.Theweightsinthissumareexponentiallydecayingascanbeseenbyunrollingtherecursiveequationin4.Thisnon-linearitycanbeeasilyunderstoodasaweightedsoftmaxfunctionappliedoverconsecutivesamplesofthesamecodeunit.Thisproducesasequenceofpositivevalueswhich,forlargevaluesof andsmallvaluesof,ischaracterizedbybriefandpunctuateactivitiesintime.Thisbehaviorisreminiscentofthespikingbehaviorofneurons.controlsthesparsenessofthecodebydeterminingthe“width”ofthetimewindowoverwhichsamplesaresummedup. controlsthedegreeof“softness”ofthefunction.Large valuesyieldquasi-binaryoutputs,whilesmall valuesproducemoregradedresponses;g.2showshowtheseparametersaffecttheoutputwhentheinputisaGaussianrandomvariable.AnotherviewoftheSparsifyingLogisticisasalogisticfunctionwithanadaptivebiasthattrackstheaverageinput;bydividingtherighthandsideofeq.4bye zi(k)wehave:zi(k)=h1+e (zi(k)1 log(1 i(k1)))i1,i[1::m](5) Noticehow directlycontrolsthegainofthelogistic.Largevaluesofthisparameterwillturnthenon-linearityintoastepfunctionandwillmakeZ(k)abinarycodevector.Inourexperiments,iistreatedastrainableparameterandkeptxedafterthelearningphase.Inthiscase,theSparsifyingLogisticreducestoalogisticfunctionwithaxedgainandalearnedbias.Forlarge inthecontinuous-timelimit,thespikescanbeshowntofollowahomogeneousPoissonprocess.Inthisframework,sparsityisa“temporal”propertycharacterizingeachsingleunitinthecode,ratherthana“spatial”propertysharedamongalltheunitsinacode.Spatialsparsityusuallyrequiressomesortofad-hocnormalizationtoensurethatthecomponentsofthecodethatare“on”arenotalwaysthesameones.Oursolutiontacklesthisproblemdifferently:eachunitmustbesparsewhenencodingdifferentsamples,independentlyfromtheactivitiesoftheothercomponentsinthecodevector.Unlikeothermethods[10],noad-hocrescalingoftheweightsorcodeunitsisnecessary.3LearningLearningisaccomplishedbyminimizingtheenergyineq.1.Indicatingwithsuperscriptstheindicesreferringtothetrainingsamplesandmakingexplicitthedependenciesonthecodevectors,wecanrewritetheenergyofthesystemas:E(WC;WD;Z1;:::;ZP)=PXi=1[ED(Xi;Zi;WD)+EC(Xi;Zi;WC)](6)Thisisalsothelossfunctionweproposetominimizeduringtraining.Theparametersofthesystem,WCandWD,arefoundbysolvingthefollowingminimizationproblem:fWC;WDgargminfWc;WdgminZ1;:::;ZPE(Wc;Wd;Z1;:::;ZP)(7)ItiseasytominimizethislosswithrespecttoWCandWDwhentheZiareknownand,particularlyforourexperimentswhereencoderanddecoderareasetoflinearlters,thisisaconvexquadraticoptimizationproblem.Likewise,whentheparametersinthesystemarexeditisstraightforwardtominimizewithrespecttothecodesZi.Theseobservationssuggestacoordinatedescentoptimizationprocedure.First,wendtheoptimalZiforagivensetofltersinencoderanddecoder.Then,weupdatetheweightsinthesystemxingZitothevaluefoundatthepreviousstep.Weiteratethesetwostepsinalternationuntilconvergence.Inourexperimentsweusedanon-lineversionofthisalgorithmwhichcanbesummarizedasfollows:1.propagatetheinputXthroughtheencodertogetacodewordZinit2.minimizethelossineq.6,sumofreconstructionandcodepredictionenergy,withrespecttoZbygradientdescentusingZinitastheinitialvalue3.computethegradientofthelosswithrespecttoWCandWD,andperformagradientstepwherethesuperscriptshavebeendroppedbecausewearereferringtoagenerictrainingsample.SincethecodevectorZminimizesbothenergyterms,itnotonlyminimizesthereconstructionenergy,butisalsoassimilaraspossibletothecodepredictedbytheencoder.Aftertrainingthede-codersettlesonltersthatproducelowreconstructionerrorsfromminimum-energy,sparsiedcodevectorsZ,whiletheencodersimultaneouslylearnsltersthatpredictthecorrespondingminimum-energycodesZ.Inotherwords,thesystemconvergestoastatewhereminimum-energycodevectorsnotonlyreconstructtheimagepatchbutcanalsobeeasilypredictedbytheencoderlters.Moreover,startingtheminimizationoverZfromthepredictiongivenbytheencoderallowsconver-genceinveryfewiterations.Aftertherstfewthousandtrainingsamples,theminimizationoverZrequiresjust4iterationsonaverage.Whentrainingiscomplete,asimplepassthroughtheencoderwillproduceanaccuratepredictionoftheminimum-energycodevector.Intheexperiments,tworegularizationtermsareaddedtothelossineq.6:a“lasso”termequaltotheL1normofWCandWD,anda“ridge”termequaltotheirL2norm.Thesehavebeenaddedtoencouragethelterstolocalizeandtosuppressnoise.Noticethatwecoulddifferentlyweighttheencodingandthereconstructionenergiesinthelossfunction.Inparticular,assigningaverylargeweighttotheencodingenergycorrespondstoturningthepenaltyontheencodingpredictionintoahardconstraint.Thecodevectorwouldbeassignedthevaluepredictedbytheencoder,andtheminimizationwouldreducetoameansquareerrorminimiza-tionthroughback-propagationasinastandardautoencoder.Unfortunately,thisautoencoder-like Figure3:Resultsoffeatureextractionfrom12x12patchestakenfromtheBerkeleydataset,showingthe200lterslearnedbythedecoder.learningfailsbecauseSparsifyingLogisticisalmostalwayshighlysaturated(otherwisetherepre-sentationwouldnotbesparse).Hence,thegradientsback-propagatedtotheencoderarelikelytobeverysmall.Thiscausesthedirectminimizationoverencoderparameterstofail,butdoesnotseemtoadverselyaffecttheminimizationovercodevectors.Wesurmisethatthelargenumberofdegreesoffreedomincodevectors(relativetothenumberofencoderparameters)makestheminimizationproblemconsiderablybetterconditioned.Inotherwords,thealternateddescentalgorithmperformsaminimizationoveramuchlargersetofvariablesthanregularback-prop,andhenceislesslikelytofallvictimtolocalminima.ThealternateddescentovercodeandparameterscanbeseenasakindofdeterministicEM.Itisrelatedtogradient-descentoverparameters(standardback-prop)inthesamewaythattheEMalgorithmisrelatedtogradientascentformaximumlikelihoodestimation.Thislearningalgorithmisnotonlysimplebutalsoveryfast.Forexample,intheexperimentsofsec.4.1ittakeslessthan30minutesona2GHzprocessortolearn200ltersfrom100,000patchesofsize12x12,andafterjustafewminutestheltersarealreadyverysimilartothenalones.Thisismuchmoreefcientandrobustthanwhatcanbeachievedusingothermethods.Forexample,inOlshausenandField's[10]lineargenerativemodel,inferenceisexpensivebecauseminimizationincodespaceisnecessaryduringtestingaswellastraining.InTehetal.[4],learningisveryexpensivebecausethedecoderismissing,andsamplingtechniques[8]mustbeusedtoprovideareconstruction.Moreover,mostmethodsrelyonpre-processingoftheinputpatchessuchaswhitening,PCAandlow-passlteringinordertoimproveresultsandspeedupconvergence.Inourexperiments,weneedonlycenterthedatabysubtractingaglobalmeanandscalebyaconstant.4ExperimentsInthissectionwepresentsomeapplicationsoftheproposedenergy-basedmodel.Twostandarddatasetswereused:naturalimagepatchesandhandwrittendigits.Asdescribedinsec.2,theencoderanddecoderlearnlinearlters.Asmentionedinsec.3,theinputimageswereonlytriviallypre-processed.4.1FeatureExtractionfromNaturalImagePatchesIntherstexperiment,thesystemwastrainedon100,000gray-levelpatchesofsize12x12extractedfromtheBerkeleysegmentationdataset[12].Pre-processingofimagesconsistsofsubtractingtheglobalmeanpixelvalue(whichisabout100),anddividingtheresultby125.Wechoseanovercompletefactorapproximatelyequalto2byrepresentingtheinputwith200codeunits1.TheSparsifyingLogisticparametersand wereequalto0.02and1,respectively.ThelearningrateforupdatingWCwassetto0.005andforWDto0.001.Thesearedecreasedprogressivelyduringtraining.ThecoefcientsoftheL1andL2regularizationtermswereabout0.001.Thelearningratefortheminimizationincodespacewassetto0.1,andwasmultipliedby0.8every10iterations,foratmost100iterations.Somecomponentsofthesparsecodemustbeallowedtotakecontinuousvaluestoaccountfortheaveragevalueofapatch.Forthisreason,duringtrainingwesaturatedtherunningsumstoallowsomeunitstobealwaysactive.Valuesofweresaturatedto109.Weveriedempiricallythatsubtractingthelocalmeanfromeachpatcheliminatestheneedforthissaturation.However,saturationduringtrainingmakestestinglessexpensive.Trainingonthisdatasettakeslessthanhalfanhourona2GHzprocessor.Examplesoflearnedencoderanddecoderltersareshowningure3.Theyarespatiallylocalized,andhavedifferentorientations,frequenciesandscales.Theyaresomewhatsimilarto,butmorelocalizedthan,GaborwaveletsandarereminiscentofthereceptiveeldsofV1neurons.Interest- 1Overcompletenessmustbeevaluatedbyconsideringthenumberofcodeunitsandtheeffectivedimension-alityoftheinputasgivenbyPCA. + 1+ 1= 1+ 1+ 1+ 1+ 1+ 0.8+ 0.8 Figure4:Top:Arandomlyselectedsubsetofencoderlterslearnedbyourenergy-basedmodelwhentrainedontheMNISThandwrittendigitdataset.Bottom:Anexampleofreconstructionofadigitrandomlyextractedfromthetestdataset.Thereconstructionismadebyadding“parts”:itistheadditivelinearcombinationoffewbasisfunctionsofthedecoderwithpositivecoefcients.ingly,theencoderanddecoderltervaluesarenearlyidenticaluptoascalefactor.Aftertraining,inferenceisextremelyfast,requiringonlyasimplematrix-vectormultiplication.4.2FeatureExtractionfromHandwrittenNumeralsTheenergy-basedmodelwastrainedon60,000handwrittendigitsfromtheMNISTdataset[13],whichcontainsquasi-binaryimagesofsize28x28(784pixels).Themodelisthesameasinthepreviousexperiment.Thenumberofcomponentsinthecodevectorwas196.While196islessthanthe784inputs,therepresentationisstillovercomplete,becausetheeffectivedimensionofthedigitdatasetisconsiderablylessthan784.Pre-processingconsistedofdividingeachpixelvalueby255.Parametersand inthetemporalsoftmaxwere0.01and1,respectively.Theotherparametersofthesystemhavebeensettovaluessimilartothoseofthepreviousexperimentonnaturalimagepatches.Eachoneofthelters,showninthetoppartofg.4,containsanelementary“part”ofadigit.Straightstrokedetectorsarepresent,asinthepreviousexperiment,butcurlystrokescanalsobefound.ReconstructionofmostsingledigitscanbeachievedbyalinearadditivecombinationofasmallnumberoflterssincetheoutputoftheSparsifyingLogisticissparseandpositive.Thebottompartofg.4illustratesthisreconstructionby“parts”.4.3LearningLocalFeaturesfortheMNISTdatasetDeepconvolutionalnetworkstrainedwithbackpropagationholdthecurrentrecordforaccuracyontheMNISTdataset[14,15].Whileback-propagationproducesgoodlow-levelfeatures,itiswellknownthatdeepnetworksareparticularlychallengingforgradient-descentlearning.Hintonetal.[16]haverecentlyshownthatinitializingtheweightsofadeepnetworkusingunsupervisedlearningbeforeperformingsupervisedlearningwithback-propagationcansignicantlyimprovetheperformanceofadeepnetwork.Thissectiondescribesasimilarexperimentinwhichweusedtheproposedmethodtoinitializetherstlayerofalargeconvolutionalnetwork.WeusedanarchitectureessentiallyidenticaltoLeNet-5asdescribedin[15].However,becauseourmodelproducessparsefeatures,ournetworkhadaconsiderablylargernumberoffeaturemaps:50forlayer1and2,50forlayer3and4,200forlayer5,and10fortheoutputlayer.ThenumbersforLeNet-5were6,16,100,and10respectively.Werefertoourlargernetworkasthe50-50-200-10network.Wetrainedthisnetworkson55,000samplesfromMNIST,keepingtheremaining5,000trainingsamplesasavalidationset.Whentheerroronthevalidationsetreacheditsminimum,anadditionalvesweepswereperformedonthetrainingsetaugmentedwiththevalidationset(unlessthisincreasedthetrainingloss).Thenthelearningwasstopped,andthenalerrorrateonthetestsetwasmeasured.Whentheweightsareinitializedrandomly,the50-50-200-10achievesatesterrorrateof0.7%,tobecomparedwiththe0.95%obtainedby[15]withthe6-16-100-10network.Inthenextexperiment,theproposedsparsefeaturelearningmethodwastrainedon5x5imagepatchesextractedfromtheMNISTtrainingset.Themodelhada50-dimensionalcode.Theencoderlterswereusedtoinitializetherstlayerofthe50-50-200-10net.Thenetworkwasthentrainedintheusualway,exceptthattherstlayerwaskeptxedfortherst10epochsthroughthetrainingset.The50ltersaftertrainingareshowning.5.Thetesterrorratewas0.6%.Toourknowledge,thisisthebestresultseverreportedwithamethodtrainedontheoriginalMNISTset,withoutdeskewingnoraugmentingthetrainingsetwithdistortedsamples.Thetrainingsetwasthenaugmentedwithsamplesobtainedbyelasticallydistortingtheoriginaltrainingsamples,usingamethodsimilarto[14].Theerrorrateofthe50-50-200-10netwithrandominitializationwas0.49%(tobecomparedto0.40%reportedin[14]).Byinitializingtherstlayer withtheltersobtainedwiththeproposedmethod,thetesterrorratedroppedto0.39%.WhilethisisthebestnumericalresulteverreportedonMNIST,itisnotstatisticallydifferentfrom[14]. Figure5:Filtersintherstconvolutionallayeraftertrainingwhenthenetworkisrandomlyinitial-ized(toprow)andwhentherstlayerofthenetworkisinitializedwiththefeatureslearnedbytheunsupervisedenergy-basedmodel(bottomrow). Architecture TrainingSetSize 20K 60K 60K+Distortions 6-16-100-10[15] -- 0.95- 0.60- 5-50-100-10[14] -- -- 0.40- 50-50-200-10 1.010.89 0.700.60 0.490.39 Table1:ComparisonoftesterrorratesonMNISTdatasetusingconvolutionalnetworkswithvarioustrainingsetsize:20,000,60,000,and60,000plus550,000elasticdistortions.Foreachsize,resultsarereportedwithrandomlyinitializedlters,andwithrst-layerltersinitializedusingtheproposedalgorithm(boldface).4.4HierarchicalExtension:LearningTopographicMapsIthasalreadybeenobservedthatfeaturesextractedfromnaturalimagepatchesresembleGabor-likelters,seeg.3.Ithasbeenrecentlypointedout[6]thattheseltersproducecodeswithsomewhatuncorrelatedbutnotindependentcomponents.Inordertocapturehigherorderdependenciesamongcodeunits,weproposetoextendtheencoderarchitecturebyaddingtothelinearlterbankasecondlayerofunits.Inthishierarchicalmodeloftheencoder,theunitsproducedbythelterbankarenowlaidoutonatwodimensionalgridandlteredaccordingtoaxedweightedmeankernel.Thisassignsalargerweighttothecentralunitandasmallerweighttotheunitsintheperiphery.InordertoactivateaunitattheoutputoftheSparsifyingLogistic,alltheafferentunrectiedunitsintherstlayermustagreeingivingastrongpositiveresponsetotheinputpatch.Asaconsequenceneighboringlterswillexhibitsimilarfeatures.Also,thetoplevelunitswillencodefeaturesthataremoretranslationandrotationinvariant,defactomodelingcomplexcells.Usinganeighborhoodofsize3x3,toroidalboundaryconditions,andcomputingcodevectorswith400unitsfrom12x12inputpatchesfromtheBerkeleydataset,wehaveobtainedthetopographicmapshowning.6.Filtersexhibitfeaturesthatarelocallysimilarinorientation,position,andphase.Therearetwolowfrequencyclustersandpinwheelregionssimilartowhatisexperimentallyfoundincorticaltopography. Wc Wd Figure6:Exampleofltermapslearnedbythetopographichierarchicalextensionofthemodel.Theoutlineofthemodelisshownontheright. 5ConclusionsAnenergy-basedmodelwasproposedforunsupervisedlearningofsparseovercompleterepresenta-tions.Learningtoextractsparsefeaturesfromdatahasapplicationsinclassication,compression,denoising,inpainting,segmentation,andsuper-resolutioninterpolation.Themodelhasnoneoftheinefcienciesandidiosyncrasiesofpreviouslyproposedsparse-overcompletefeaturelearningmeth-ods.Thedecoderproducesaccuratereconstructionsofthepatches,whiletheencoderprovidesafastpredictionofthecodewithouttheneedforanyparticularpreprocessingoftheinputimages.Itseemsthatanon-linearitythatdirectlysparsiesthecodeisconsiderablysimplertocontrolthanaddingasparsityterminthelossfunction,whichgenerallyrequiresad-hocnormalizationproce-dures[3].Inthecurrentwork,weusedlinearencodersanddecodersforsimplicity,butthemodelauthorizesnon-linearmodules,aslongasgradientscanbecomputedandback-propagatedthroughthem.Asbrieypresentedinsec.4.4,itisstraightforwardtoextendtheoriginalframeworktohierarchicalarchitecturesinencoder,andthesameispossibleinthedecoder.Anotherpossibleextensionwouldstackmultipleinstancesofthesystemdescribedinthepaper,witheachsystemasamoduleinamulti-layerstructurewherethesparsecodeproducedbyonefeatureextractorisfedtotheinputofahigher-levelfeatureextractor.Futureworkwillincludetheapplicationofthemodeltovarioustasks,includingfacialfeatureextrac-tion,imagedenoising,imagecompression,inpainting,classication,andinvariantfeatureextractionforroboticsapplications.AcknowledgmentsWewishtothankSebastianSeungandGeoffHintonforhelpfuldiscussions.ThisworkwassupportedinpartbytheNSFundergrantsNo.0325463and0535166,andbyDARPAundertheLAGRprogram.References[1]Lee,D.D.andSeung,H.S.(1999)Learningthepartsofobjectsbynon-negativematrixfactorization.Nature,401:788-791.[2]Hyvarinen,A.andHoyer,P.O.(2001)A2-layersparsecodingmodellearnssimpleandcomplexcellreceptiveeldsandtopographyfromnaturalimages.VisionResearch,41:2413-2423.[3]Olshausen,B.A.(2002)Sparsecodesandspikes.R.P.N.Rao,B.A.OlshausenandM.S.LewickiEds.-MITpress:257-272.[4]Teh,Y.W.andWelling,M.andOsindero,S.andHinton,G.E.(2003)Energy-basedmodelsforsparseovercompleterepresentations.JournalofMachineLearningResearch,4:1235-1260.[5]Lennie,P.(2003)Thecostofcorticalcomputation.Currentbiology,13:493-497[6]Simoncelli,E.P.(2005)Statisticalmodelingofphotographicimages.AcademicPress2nded.[7]Hinton,G.E.andZemel,R.S.(1994)Autoencoders,minimumdescriptionlength,andHelmholtzfreeenergy.AdvancesinNeuralInformationProcessingSystems6,J.D.Cowan,G.TesauroandJ.Alspector(Eds.),MorganKaufmann:SanMateo,CA.[8]Hinton,G.E.(2002)Trainingproductsofexpertsbyminimizingcontrastivedivergence.NeuralCompu-tation,14:1771-1800.[9]DoiE.,Balcan,D.C.andLewicki,M.S.(2006)Atheoreticalanalysisofrobustcodingovernoisyover-completechannels.AdvancesinNeuralInformationProcessingSystems18,MITPress.[10]Olshausen,B.A.andField,D.J.(1997)Sparsecodingwithanovercompletebasisset:astrategyemployedbyV1?VisionResearch,37:3311-3325.[11]Foldiak,P.(1990)Formingsparserepresentationsbylocalanti-hebbianlearning.BiologicalCybernetics,64:165-170.[12]Theberkeleysegmentationdatasethttp://www.cs.berkeley.edu/projects/vision/grouping/segbench/[13]TheMNISTdatabaseofhandwrittendigitshttp://yann.lecun.com/exdb/mnist/[14]Simard,P.Y.Steinkraus,D.andPlatt,J.C.(2003)Bestpracticesforconvolutionalneuralnetworks.IC-DAR[15]LeCun,Y.Bottou,L.Bengio,Y.andHaffner,P.(1998)Gradient-basedlearningappliedtodocumentrecognition.ProceedingsoftheIEEE,86(11):2278-2324.[16]Hinton,G.E.,Osindero,S.andTeh,Y.(2006)Afastlearningalgorithmfordeepbeliefnets.NeuralComputation18,pp1527-1554.