/
Deep Learning of Invariant Features via Simulated Fixations in Video Will Y Deep Learning of Invariant Features via Simulated Fixations in Video Will Y

Deep Learning of Invariant Features via Simulated Fixations in Video Will Y - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
722 views
Uploaded On 2015-01-19

Deep Learning of Invariant Features via Simulated Fixations in Video Will Y - PPT Presentation

Zou Shenghuo Zhu Andrew Y Ng Kai Yu Department of Electrical Engineering Stanford Universit y CA Department of Computer Science Stanford University CA NEC Laboratories America Inc Cupertino CA wzou ang csstanfordedu zsh kyu svneclabscom Abstract ID: 33229

Zou Shenghuo Zhu

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Deep Learning of Invariant Features via ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DeepLearningofInvariantFeaturesviaSimulatedFixationsinVideo WillY.Zou1,ShenghuoZhu3,AndrewY.Ng2,KaiYu31DepartmentofElectricalEngineering,StanfordUniversity,CA2DepartmentofComputerScience,StanfordUniversity,CA3NECLaboratoriesAmerica,Inc.,Cupertino,CAfwzou,angg@cs.stanford.edufzsh,kyug@sv.nec-labs.comAbstractWeapplysalientfeaturedetectionandtrackinginvideostosimulatexationsandsmoothpursuitinhumanvision.Withtrackedsequencesasinput,ahierarchicalnetworkofmoduleslearnsinvariantfeaturesusingatemporalslownessconstraint.Thenetworkencodesinvariancewhichareincreasinglycomplexwithhierarchy.Althoughlearnedfromvideos,ourfeaturesarespatialinsteadofspatial-temporal,andwellsuitedforextractingfeaturesfromstillimages.Weappliedourfeaturestofourdatasets(COIL-100,Caltech101,STL-10,PubFig),andobserveaconsistentimprovementof4%to5%inclassicationaccuracy.Withthisapproach,weachievestate-of-the-artrecognitionaccuracy61%onSTL-10dataset.1IntroductionOurvisualsystemsareamazinglycompetentatrecognizingpatternsinimages.Duringtheirdevel-opment,trainingstimuliarenotincoherentsequencesofimages,butnaturalvisualstreamsmodu-latedbyxations[1].Likewise,weexpectamachinevisionsystemtolearnfromcoherentimagesequencesextractedfromthenaturalenvironment.Throughthislearningprocess,itisdesiredthatfeaturesbecomerobusttotemporaltransfromationsandperformsignicantlybetterinrecognition.Inthispaper,webuildanunsuperviseddeeplearningsystemwhichexhibitsthesesproperties,thusachievingcompetitiveperformanceonconcretecomputervisionbenchmarks.Asalearningprinciple,sparsityisessentialtounderstandingthestatisticsofnaturalimages[2].However,itremainsuncleartowhatextentsparsityandsubspacepooling[3,4]couldproduceinvarianceexhibitedinhigherlevelsofvisualsystems.Anotherapproachtolearninginvarianceistemporalslowness[1,5,6,7].Experimentalevidencesuggeststhathigh-levelvisualrepresentationsbecomeslow-changingandtoleranttowardsnon-trivialtransformations,byassociatinglow-levelfeatureswhichappearinacoherentsequence[5].Tolearnfeaturesusingslowness,akeyobservationisthatduringourvisualxations,movingobjectsremaininvisualfocusforasustainedamountoftimethroughsmoothpursuiteyemovements.Thismechanismensuresthatthesameobjectremainsinvisualexposure,avoidingrapidswitchingortranslations.Simulationforsuchamechanismformsanessentialpartofourproposal.Innaturalvideos,weusespatial-temporalfeaturedetectorstosimulatexationsonsalientfeatures.Atthesefeaturelocations,weapplylocalcontrastnormalization[8],templatematching[9]tondlocalcorrespondencesbetweensuccessivevideoframes.Thisapproachproducestrainingsequencesforourunsupervisedalgorithm.AsshowninFigure1,traininginputtotheneuralnetworkisfreefromabruptchangesbutcontainnon-trivialmotiontransformations.Inpriorwork[10,11,12],asinglelayeroffeatureslearnedusingtemporalslownessresultsintranslation-invariantedgedetectors,reminiscentofcomplex-cells.However,itremainsunclearwhetherhigherlevelsofinvariances[1],suchasonesexhibitedinIT,canbelearnedusingtemporal1 Figure1:Simulatingsmoothpursuiteyemovements.(Left)Sequencesextractedfromxedspatiallocationsinavideo.(Right)Sequencesproducedbyourtrackingalgorithm.slowness.Inthispaper,wefocusondevelopingalgorithmsthatcapturehigherlevelsofinvariance,bylearningmultiplelayersofrepresentations.Bystackinglearningmodules,weareabletolearnfeaturesthatareincreasinglyinvariant.Usingtemporalslowness,therstlayerunitsbecomelocallytranslationalinvariant,similartosubspaceorspatialpooling;thesecondlayerunitscanthenencodemorecomplexinvariancessuchasout-of-planetransformationsandnon-linearwarping.Usingthisapproach,weshowasurprisingresultthatdespitebeingtrainedonvideos,ourfeaturesencodecomplexinvarianceswhichtranslatetorecognitionperformanceonstillimages.Wecarryoutourexperimentsusingtheself-taughtlearningframework[13].Werstlearnasetoffeaturesusingsimulatedxationsinunlabeledvideos,andthenapplythelearnedfeaturestoclassicationtasks.Thelearnedfeaturesimproveaccuracybyasignicant4%to5%acrossfourstillimagerecognitiondatasets.Inparticular,weshowbestclassicationresultstodate61%ontheSTL-10[14]dataset.Finally,wequantifytheinvariancelearnedusingtemporalslownessandsimulatedxationsbyasetofcontrolexperiments.2RelatedworkUnsupervisedlearningimagefeaturesfrompixelsisarelativelynewapproachincomputervision.Nevertheless,therehavebeensuccessfulapplicationofunsupervisedlearningalgorithmssuchasSparseCoding[15,16],IndependentComponentAnalysis[17],evenclusteringalgorithms[14]onaconvincingrangeofdatasets.Thesealgorithmsoftenusesuchprinciplesassparsityandfeatureorthogonalitytolearngoodrepresentations.RecentworkindeeplearningsuchasLeet.al.[18]showedpromisingresultsfortheapplicationofdeeplearningtovision.Atthesametime,theseadvancessuggestchallengesforlearningdeeperlayers[19]usingpurelyunsupervisedlearning.Mobahiet.al.[20]showedthattemporalslow-nesscouldimproverecognitiononavideo-likeCOIL-100dataset.Despitebeingoneofthersttoapplytemporalslownessindeeparchitectures,theauthorstrainedafullysupervisedconvolutionalnetworkandusedtemporalslownessasaregularizingstepintheoptimizationprocedure.Thein-uentialworkofSlowFeatureAnalysis(SFA)[7]wasanearlyexampleofunsupervisedalgorithmusingtemporalslowness.SFAsolvesaconstrainedproblemandoptimizesfortemporalslownessbymappingdataintoaquadraticexpansionandperformingeigenvectordecomposition.Despiteitselegance,SFA'snon-linear(quadratic)expansionisslowcomputationallywhenappliedtohighdi-mensionaldata.ApplicationsofSFAtocomputervisionhavehadlimitedsuccess,appliedprimarilytoarticiallygeneratedgraphicsdata[21].Bergstraet.al.[12]proposedtotraindeeparchitectureswithtemporalslownessanddecorrelation,andillustratedtrainingarstlayeronMNISTdigits.[22,23]proposedbi-linearmodelstorepresentsnaturalimagesusingafactorialcode.Cadieuet.al.[24]trainedatwo-layeralgorithmtolearnvisualtransformationsinvideos,withlimitedemphasisontemporalslowness.Thecomputervisionliteraturehasanumberofworkswhich,similartous,usetheideaofvideotrackingtolearninvariantfeatures.Stavenset.al.[25]showimprovementinperformancewhenSIFT/HOGparametersareoptimizedusingtrackedimagepatchsequencesinspecicapplicationdomains.Leistneret.al.[26]usednaturalvideosas“weaklysupervised”signalstoimproveran-domforestclassiers.Leeet.al.[27]introducedvideo-baseddescriptorsusedinhand-heldvisualrecognitionsystems.Incontrasttotheserecentexamples,ouralgorithmlearnsfeaturesdirectlyfromrawimagepixels,andadaptstopixel-levelimagestatistics—inparticular,itdoesnotrelyonhand-designedpreprocessingsuchasSIFT/HOG.Further,sinceitisimplementedbyaneural2 network,ourmethodcanalsobeusedinconjunctionwithsuchtechniquesasne-tuningwithback-propagation.[28,29]3LearningArchitectureInthissection,wedescribethebasicmodulesandthearchitectureofthelearningalgorithm.Inparticular,ourlearningmodulesuseacombinationoftemporalslownessandanon-degeneracyprinciplesimilartoorthogonality[30,31].Eachmoduleimplementsalineartransformationfol-lowedbyapoolingstep.Themodulescanbeactivatedinafeed-forwardmanner,makingthemsuitableforformingadeeparchitecture.Tolearninvariantfeatureswithtemporalslowness,weuseatwolayernetwork,wheretherstlayerisconvolutionalandreplicatesneuronswithlocalreceptiveeldacrossdensegridlocations,andthesecond(non-convolutional)layerisfullyconnected.3.1LearningModuleTheinputdatatoourlearningmoduleisacoherentsequenceofimageframes,andallframesinthesequenceareindexedbyt.Tolearnhiddenfeaturesp()fromdatax(),themodulesaretrainedbysolvingthefollowingunconstrainedminimizationproblem:minimizeWN1X=1kp()p(+1)k1+NX=1kx()Tx()k22(1)Thehiddenfeaturesp()aremappedfromdatax()byafeed-forwardpassinthenetworkshownFigure2:p()= H(x())2(2)ThisequationdescribesL2poolingonalinearnetworklayer.Thesquareandsquare-rootoperationsareelement-wise.ThispoolingmechanismisimplementedbyasubspacepoolingmatrixHwithagroupsizeoftwo[30].Morespecically,eachrowofHpicksandsumstwoadjacentfeaturedimensionsinanon-overlappingfashion.ThesecondterminEquation1isfromtheReconstructionICAalgorithm[31].Ithelpsavoidde-generacyinthefeatures,andplaysarolesimilartoorthogonalizationinIndependentComponentAnalysis[30].Thenetworkencodesthedatax()byamatrix-vectormultiplicationz()=x(),andreconstructsthedatawithanotherfeed-forwardpass^x()=Tz().Thistermcanalsobeinterpretedasanauto-encoderreconstructioncost.(See[31]fordetails.)Althoughthealgorithmisdrivenbytemporalslowness,sparsityalsohelpstoobtaingoodfeaturesfromnaturalimages.Thus,inpractice,wefurtheraddtoEquation1anL1-normsparsityregular-izationterm\rPN=1kp()k1,tomakesuretheobtainedfeatureshavesparseactivations.ThisbasicalgorithmtrainedontheHansvanHateren'snaturalvideorepository[24]producedori-entededgelters.Thelearnedfeaturesarehighlyinvarianttolocaltranslations.Thereasonforthisisthattemporalslownessrequireshiddenfeaturestobeslow-changingacrosstime.Usingthevisualizationmethodof[24],inFigure3,wevarytheinterpolationanglein-betweenpairsofpooledfeatures,andproduceamotionofsmoothtranslations.Avideoofthisillustrationisalsoavailableonline.13.2StackedArchitectureTherstlayermodulesdescribedinthelastsectionaretrainedonasmallerpatchsize(16x16pixels)oflocallytrackedvideosequences.Toconstructthesetofinputstothesecondstackedlayer,rstlayerfeaturesarereplicatedonadensegridinalargerscale(32x32pixels).TheinputtolayertwoisextractedafterL2pooling.Thisarchitectureproducesanover-completenumberoflocal16x16featuresacrossthelargerfeaturearea.ThetwolayerarchitectureisshowninFigure4.Duetothehighdimensionalityoftherstlayeroutputs,weapplyPCAtoreducetheirdimensionsforthesecondlayeralgorithm.Afterwards, 1http://ai.stanford.edu/wzou/slow/rst layer invariance.avi3 Figure2:Neuralnetworkar-chitectureofthebasiclearn-ingmodule Figure3:Translationalinvarianceinrstlayerfeatures;columnscorrespondtointerpolationan-gleatmultiplesof45degreesafullyconnectedmoduleistrainedwithtemporalslownessontheoutputofPCA.Thestackedarchitecturelearnsfeaturesinasigncantlylarger2-Dareathantherstlayeralgorithm,andabletolearninvariancetolarger-scaletransformationsseeninvideos. Figure4:Two-layerarchitectureofouralgorithmusedtolearninvariancefromvideos.3.3InvarianceVisualizationAfterunsupervisedtrainingwithvideosequences,wevisualizethefeatureslearnedbythetwolayernetwork.OntheleftofFigure5,weshowtheoptimalstimuliwhichmaximallyactivateseachoftherstlayerpoolingunits.Thisisobtainedbyanalyticallyndingtheinputthatmaximizestheoutputofapoolingunit(subjecttotheconstraintthattheinputxhasunitnorm,kxk2=1).Theoptimalstimuliforunitslearnedwithoutslownessareshownatthetop,andappearstogivehighfrequencygrating-likepatterns.Atthebottom,weshowtheoptimalstimuliforfeatureslearnedwithslowness;here,theoptimalstimuliappearmuchsmootherbecausethepairsofGabor-likefeaturesbeingpooledoverareusuallyaquadraturepair.Thisimpliesthatthepoolingunitisrobusttochangesinphasepositions,whichcorrespondtotranslationsoftheGabor-likefeature.Thesecondlayerfeaturesarelearnedontopofthepooledrstlayerfeatures.Wevisualizethesecondlayerfeaturesbyplottinglinearcombinationsoftherstlayerfeatures'optimalstimuli(asshownontheleftofFigure5),andvaryingtheinterpolationangleasin[24].TheresultisshownonrightofFigure5,whereeachrowcorrespondstothevisualizationofasinglepoolingunit.Eachrowcorrespondstoamotionsequencetowhichwewouldexpectthesecondlayerfeaturestoberoughlyinvariant.Fromthisvisualization,non-trivialinvariancesareobservedsuchasnon-linearwarping,rotation,localnon-afnechangesandlargescaletranslations.Avideoanimationofthisvisualizationisalsoavailableonline2. 2http://ai.stanford.edu/wzou/slow/second layer invariance.avi4 Figure5:(Left)Comparisonofoptimalstimuliofrstlayerpoolingunits(patchsize16x16)learnedwithout(top)andwith(bottom)temporalslowness.(Right)visualizationofsecondlayerfeatures(patchsize32x32),witheachrowcorrespondingtoonepoolingunit.Weobserveafewnon-trivialinvariances,suchaswarping(rows9and10),rotation(rstrow),localnon-afnechanges(rows3,4,6,7),largescaletranslations(rows2and5).4ExperimentsOurexperimentsarecarriedoutinaself-taughtlearningsetting[13].WersttrainthealgorithmontheHansvanHaterennaturalscenevideos,tolearnasetoffeatures.Thelearnedfeaturesarethenusedtoclassifysingleimagesineachoffourdatasets.Throughoutthissection,weusegray-scalefeaturestoperformrecognition.4.1TrainingwithTrackedSequencesToextractdatafromtheHansvanHaterennaturalvideorepository,weapplyspatial-temporalDifference-of-Gaussianblobdetectorandselectareasofhighresponsetosimulatevisualxations.Aftertheinitialframeisselected,theimagepatchistrackedacross20framesusingatrackerwebuiltandcustomizedforthistask.ThetrackerndslocalcorrespondencesbycalculatingNor-malizedCrossCorrelation(NCC)ofpatchesacrosstimewhichareprocessedwithlocalcontrastnormalization.Therstlayeralgorithmislearnedon16x16patcheswith128features(pooledfrom256linearbases).Thebasesarethenconvolvedwithinthelarger32x32imagepatcheswithastrideof2.PCAisusedtorstreducethedimensionsoftheresponsemapsto300beforelearningthesecondlayer.Thesecondlayerlearns150features(pooledfrom300linearbases).4.2VisionDatasetsCOIL-100containsimagesof100objects,eachwith72views.Wefollowedtestingprotocolsin[32,20].ThevideoswetrainedontoobtainthetemporalslownessfeatureswerebasedonthevanHatarenvideos,andwerethusunrelatedtoCOIL-100.Theclassicationexperimentisperformedonall100objects.InCaltech101,wefollowedthecommonexperimentsetupgivenin[33]:wepick30imagesperclassastrainingset,andrandomlypick50perclass(iffewerthan50left,taketherest)astestset.Thisisperformed10timesandwereporttheaverageclassicationaccuracy.TheSTL-10[34]datasetcontains10objectclasseswith5000trainingand8000testimages.Thereare10pre-denedfoldsoftrainingimages,with500imagesineachfold.Ineachfold,aclassier5 Table1:Acc.COIL-100(unrelatedvideo) Method Acc. VTU[32] 79.1% ConvNetregularizedwithvideo[20] 79.77% Ourresultswithoutvideo 82.0% Ourresultsusingvideo 87.0% Performanceincreasebytrainingonvideo +5.0% Table2:Ave.acc.Caltech101 Method Ave.acc. Two-layerConvNet[36] 66.9% ScSPM[37] 73.2% Hierarchicalsparse-coding[38] 74.0% Macrofeatures[39] 75.7% Ourresultswithoutvideo 66.5% Ourresultsusingvideo 74.6% Performanceincreasewithvideo +8.1% Table3:Ave.acc.STL-10 Method Ave.acc. ReconstructionICA[31] 52.9% SparseFiltering[40] 53.5% SCfeatures,K-meansencoding[16] 56.0% SCfeatures,SCencoding[16] 59.0% Localreceptiveeldselection[19] 60.1% Ourresultwithoutvideo 56.5% Ourresultusingvideo 61.0% Performanceincreasewithvideo +4.5% Table4:Acc.PubFigfaces Method Acc. Ourresultwithoutvideo 86% Ourresultusingvideo 90.0% Performanceincreasewithvideo +4.0% istrainedonaspecicsetof500trainingimages,andtestedonall8000testingimages.Similartopriorwork,theevaluationmetricwereportisaverageaccuracyacross10folds.Thedatasetissuitablefordevelopingunsupervisedfeaturelearningandself-taughtlearningalgorithms,sincethenumberofsupervisedtraininglabelsisrelativelysmall.PubFig[35]isafacerecognitiondatasetwith58,797imagesof200persons.Faceimagescontainlargevariationinpose,expression,backgroundandimageconditions.SincesomeoftheURLlinksprovidedbytheauthorswerebroken,weonlycompareourresultsusingvideoagainstourownbaselineresultwithoutvideo.10%ofthedownloadeddatawasusedasthetestset.4.3TestPipelineOnstillimages,weapplyourtrainednetworktoextractfeaturesatdensegridlocations.AlinearSVMclassieristrainedonfeaturesfrombothrstandsecondlayers.Wedidnotapplyne-tuning.ForCOIL-100,wecrossvalidatetheaveragepoolingsize.Asimplefour-quadrantpoolingisusedforSTL-10andPubFigdatasets.ForCaltech101,weuseathreelayerspatialpyramid.4.4RecognitionResultsWereportresultsonCOIL-100,Caltech101,STL-10andPubFigdatasetsintables1,2,3and4.Intheseexperiments,thehyper-parametersarecross-validated.However,performanceisnotpartic-ularlysensitivetotheweightingbetweentemporalslownessobjectivecomparedtoreconstructionobjectiveinEquation1,aswewillillustrateinSection4.5.2.Foreachdataset,wecompareresultsusingfeaturestrainedwithandwithoutthetemporalslownessobjectiveterminEquation1.Despitethefeaturebeinglearnedfromnaturalvideosandthenbeingtransferredtodifferentrecognitiontasks(i.e.,self-taughtlearning),theygiveexcellentperformanceinourexperiments.Theapplicationoftemporalslownessincreasesrecognitionaccuracyconsistentlyby4%to5%,bringingourresultstobecompetitivewiththestate-of-the-art.4.5ControlExperiments4.5.1EffectofFixationSimulationandTrackingWecarryoutacontrolexperimenttoelucidatethedifferencebetweenfeatureslearnedusingourxationandsmoothpursuitmethodforextractingvideoframes(asinFigure1,right)comparedtofeatureslearnedusingnon-trackedsequences(Figure1,left).AsshownontheleftofFigure6,trainingontrackedsequencesreducesthetranslationinvariancelearnedinthesecondlayer.In6 comparisontootherformsofinvariances,translationislessusefulbecauseitiseasytoencodewithspatialpooling[17].Instead,thefeaturesencodeotherinvariancesuchasdifferentformsofnon-linearwarping.TheadvantageofusingtrackeddataisreectedinobjectrecognitionperformanceontheSTL-10dataset.ShownontherightofFigure6,recognitionaccuracyisincreasedbyaconsiderablemarginbytrainingontrackedsequences. Figure6:(Left)Comparisonofsecondlayerinvariancevisualizationwhentrainingdatawasob-tainedwithtrackingandwithout;(Right)Ave.acc.onSTL-10withfeaturestrainedontrackedsequencescomparedtonon-tracked;inthisplotisslownessweightingparameterfromEquation1.4.5.2ImportanceofTemporalSlownesstoRecognitionPerformanceTounderstandhowmuchtheslownessprinciplehelpstolearngoodfeatures,wevarytheslownessparameteracrossarangeofvaluestoobserveitseffectonrecognitionaccuracy.Figure7showsrecognitionaccuracyonSTL-10,plottedasafunctionofaslownessweightingparameterintherstandsecondlayers.Onbothlayers,accuracyincreasesconsiderablywith,andthenlevelsoffslowlyastheweightingparameterbecomeslarge.Theperformancealsoappearstobereasonablyrobusttothechoiceof,solongastheparameterisinthehigh-valueregime. Figure7:PerformanceonSTL-10versustheamountoftemporalslowness,ontherstlayer(left)andsecondlayer(right);intheseplotsistheslownessweightingparameterfromEquation1;differentcoloredcurvesareshownfordifferentvaluesintheotherlayer.4.5.3InvarianceTestsWequantifyinvarianceencodedintheunsupservisedlearnedfeatureswithinvariancetests.Inthisexperiment,wetaketheapproachdescribedin[4]andmeasurethechangeinfeaturesasinputimageundergoestransformations.Apatchisextractedfromanaturalimage,andtransformedthroughtran-lation,rotationandzoom.WemeasuretheMeanSquaredError(MSE)betweentheL2normalizedfeaturevectorofthetransformedpatchandthefeaturevectoroftheoriginalpatch3.ThenormalizedMSEisplottedagainsttheamountoftranslation,rotation,andzoom.Resultsofinvariancetestsare 3MSEisnormalizedagainstfeaturedimensions,andaveragedacross100randomlysampledpatches.Sincethelargestdistortionmakesalmostacompletelyuncorrelatedpatch,forallfeatures,MSEisnormalizedagainstthevalueatthelargestdistortion.7 showninFigure84.Intheseplots,lowercurvesindicateshigherlevelsofinvariance.Ourfeaturestrainedwithtemporalslownesshavebetterinvariancepropertiescomparedtofeatureslearnedonlyusingsparity,andSIFT5.Further,simulationofxationwithfeaturedetectionandtrackinghasavisibleeffectonfeatureinvariance.Specically,asshownontheleftofFigure8,featuretrackingreducestranslationinvarianceinagreementwithouranalysisinSection4.5.1.Atthesametime,middleandrightplotsofFigure8showthatfeaturetrackingincreasesthenon-trivialrotationandzoominvarianceinthesecondlayerofourtemporalslownessfeatures. Figure8:Invariancetestscomparingourtemporalslownessfeaturesusingtrackedandnon-trackedsequences,againstSIFTandfeaturestrainedonlywithsparsity,shownfordifferenttransformations:Translation(left),Rotation(middle)andZoom(right).5ConclusionWehavedescribedanunsupervisedlearningalgorithmforlearninginvariantfeaturesfromvideousingthetemporalslownessprinciple.Thesystemisimprovedbyusingsimulatedxationsandsmoothpursuittogeneratethevideosequencesprovidedtothelearningalgorithm.Weillustratebyvirtualofvisualizationandinvariancetests,thatthelearnedfeaturesareinvarianttoacollectionofnon-trivialtransformations.Withconcreterecognitionexperiments,weshowthatthefeatureslearnedfromnaturalvideosnotonlyapplytostillimages,butalsogivecompetitiveresultsonanumberofobjectrecognitionbenchmarks.Sinceourfeaturescanbeextractedusingafeed-forwardneuralnetwork,theyarealsoeasytouseandefcienttocompute.References[1]N.LiandJ.J.DiCarlo.Unsupervisednaturalexperiencerapidlyaltersinvariantobjectrepresentationinvisualcortex.Science,2008.[2]A.HyvarinenandP.Hoyer.Topographicindependentcomponentanalysisasamodelofv1organizationandreceptiveelds.NeuralComputation,2001.[3]J.H.vanHaterenandD.L.Ruderman.Independentcomponentltersofnaturalimagescomparedwithsimplecellsinprimaryvisualcortex.ProcRoyalSociety,1998.[4]K.Kavukcuoglu,M.Ranzato,R.Fergus,andY.LeCun.Learninginvariantfeaturesthroughtopographicltermaps.InCVPR,2009.[5]D.Cox,P.Meier,N.Oertelt,andJ.DiCarlo.`Breaking'position-invariantobjectrecognition.NatureNeuroscience,2005.[6]T.MasquelierandS.J.Thorpe.Unsupervisedlearningofvisualfeaturesthroughspiketimingdependentplasticity.PLoSComputationalBiology,2007.[7]P.BerkesandL.Wiskott.Slowfeatureanalysisyieldsarichrepertoireofcomplexcellproperties.JournalofVision,2005.[8]E.P.SimoncelliS.Lyu.Nonlinearimagerepresentationusingdivisivenormalization.InCVPR,2008.[9]J.P.Lewis.Fastnormalizedcross-correlation.InVisionInterface,1995.[10]A.Hyvarinen,J.Hurri,andJ.Vayrynen.Bubbles:aunifyingframeworkforlow-levelstatisticalpropertiesofnaturalimagesequences.OpticalSocietyofAmerica,2003. 4Translationtestisperformedwith16x16patchesandrstlayerfeatures,rotationandzoomtestsareper-formedwith32x32patchesandsecondlayerfeatures.5WeuseSIFTintheVLFeattoolbox[41]http://www.vlfeat.org/8 [11]J.HurriandA.Hyvarinen.Temporalcoherence,naturalimagesequencesandthevisualcortex.InNIPS,2006.[12]J.BergstraandY.Bengio.Slow,decorrelatedfeaturesforpretrainingcomplexcell-likenetworks.InNIPS,2009.[13]R.Raina,A.Madhavan,andA.Y.Ng.Large-scaledeepunsupervisedlearningusinggraphicsprocessors.InICML,2009.[14]A.Coates,H.Lee,andA.Y.Ng.Ananalysisofsinglelayernetworksinunsupervisedfeaturelearning.InAISTATS,2011.[15]B.A.OlshausenandD.J.Field.Howclosearewetounderstandingv1?NeuralComputation,2005.[16]A.CoatesandA.Ng.Theimportanceofencodingversustrainingwithsparsecodingandvectorquanti-zation.InICML,2011.[17]Q.V.Le,J.Ngiam,Z.Chen,D.Chia,P.W.Koh,andA.Y.Ng.Tiledconvolutionalneuralnetworks.InAdvancesinNeuralInformationProcessingSystems,2010.[18]Q.V.Le,M.A.Ranzato,R.Monga,M.Devin,K.Chen,G.S.Corrado,J.Dean,andA.Y.Ng.Buildinghigh-levelfeaturesusinglargescaleunsupervisedlearning.InICML,2012.[19]A.CoatesandA.Y.Ng.Selectingreceptiveeldsindeepnetworks.InNIPS,2011.[20]H.Mobahi,R.Collobert,andJasonWeston.Deeplearningfromtemporalcoherenceinvideo.InICML,2009.[21]M.Franzius,N.Wilbert,andL.Wiskott.InvariantobjectrecognitionwithSlowFeatureAnalysis.InICANN,2008.[22]B.Olshausen,C.Cadieu,J.Culpepper,andD.K.Warland.Bilinearmodelsofnaturalimages.InProc.SPIE6492,2007.[23]R.P.N.RaoD.B.Grimes.Bilinearsparsecodingforinvariantvision.[24]C.CadieuandB.Olshausen.Learningtranformationalinvariantsfromnaturalmovies.InNIPS,2009.[25]S.ThrunD.Stavens.Unsupervisedlearningofinvariantfeaturesusingvideo.InCVPR,2010.[26]C.Leistner,M.Godec,S.Schulter,M.Werlberger,A.Saffari,andH.Bischof.Improvingclassierswithunlabeledweakly-relatedvideos.InCVPR,2011.[27]T.LeeandS.Soatto.Video-baseddescriptorsforobjectrecognition.ImageandVisionComputing,2011.[28]D.E.Rumelhart,G.E.Hinton,andR.J.Williams.Learningrepresentationsbyback-propagatingerrors.Nature,1986.[29]Y.BengioandY.LeCun.ScalinglearningalgorithmstowardsAI.InLarge-ScaleKernelMachines,2007.[30]A.Hyvarinen,J.Hurri,andP.O.Hoyer.NaturalImageStatistics.Springer,2009.[31]Q.V.Le,A.Karpenko,J.Ngiam,andA.Y.Ng.ICAwithreconstructioncostforefcientovercompletefeaturelearning.InNIPS,2011.[32]H.WersingandE.Kr¨oner.Learningoptimizedfeaturesforhierarchicalmodelsofinvariantobjectrecog-nition.NeuralComputation,2003.[33]L.Fei-Fei,R.Fergus,andP.Perona.Learninggenerativevisualmodelsfromfewtrainingexamples:anincrementalbayesianapproachtestedon101objectcategories.[34]A.Coates,H.Lee,andA.Ng.Ananalysisofsingle-layernetworksinunsupervisedfeaturelearning.InAISTATS14,2010.[35]N.Kumar,A.C.Berg,P.N.Belhumeur,andS.K.Nayar.Attributeandsimileclassiersforfaceveri-cation.InICCV,2009.[36]K.Kavukcuoglu,P.Sermanet,Y.Boureau,K.Gregor,M.Mathieu,andY.LeCun.Learningconvolutionalfeaturehierarchiesforvisualrecognition.InNIPS,2010.[37]J.Yang,K.Yu,Y.Gong,andT.Huang.Linearspatialpyramidmatchingusingsparsecodingforimageclassication.InCVPR,2009.[38]K.Yu,Y.Lin,andJ.Lafferty.Learningimagerepresentationsfromthepixellevelviahierarchicalsparsecoding.InCVPR,2011.[39]Y-LanBoureau,FrancisBach,YannLeCun,andJeanPonce.Learningmid-levelfeaturesforrecognition.InCVPR,2010.[40]J.Ngiam,P.W.Koh,Z.Chen,S.Bhaskar,andA.Y.Ng.Sparseltering.InNIPS,2011.[41]A.VedaldiandB.Fulkerson.VLFeat:Anopenandportablelibraryofcomputervisionalgorithms,2008.9