umontrealca Pascal Vincent 1 vincentpiroumontrealca Xavier Muller 1 mullerxiroumontrealca Xavier Glorot 1 glorotxairoumontrealca Yoshua Bengio 1 bengioyiroumontrealca 1 Dept IRO Universite de Montreal Montreal QC H3C 3J7 Canada Abstract We present in ID: 25139
Download Pdf The PPT/PDF document "Contractive AutoEncoders Explicit Invari..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
ContractiveAuto-Encoders etal.,2010).Contribution.Whatprinciplesshouldguidethelearningofsuchintermediaterepresentations?Theyshouldcaptureasmuchaspossibleoftheinformationineachgivenexample,whenthatexampleislikelyundertheunderlyinggeneratingdistribution.Thatiswhatauto-encoders(Vincentetal.,2010)andsparsecodingaimtoachievewhenminimizingreconstruc-tionerror.Wewouldalsoliketheserepresentationstobeusefulincharacterizingtheinputdistribution,andthatiswhatisachievedbydirectlyoptimizingagenerativemodel'slikelihood(suchasRBMs).Inthispaper,weintroduceapenaltytermthatcouldbeaddedineitheroftheabovecontexts,whichencour-agestheintermediaterepresentationtoberobusttosmallchangesoftheinputaroundthetrainingexam-ples.Weshowthroughcomparativeexperimentsonmanybenchmarkdatasetsthatthischaracteristicisusefultolearnrepresentationsthathelptrainingbet-terclassiers.Wehypothesizethatwhereasthepro-posedpenaltytermencouragesthelearnedfeaturestobelocallyinvariantwithoutanypreferenceforpartic-ulardirections,whenitiscombinedwithareconstruc-tionerrororlikelihoodcriterionweobtaininvarianceinthedirectionsthatmakesenseinthecontextofthegiventrainingdata,i.e.,thevariationsthatarepresentinthedatashouldalsobecapturedinthelearnedrep-resentation,buttheotherdirectionsmaybecontractedinthelearnedrepresentation.2.HowtoextractrobustfeaturesToencouragerobustnessoftherepresentationf(x)ob-tainedforatraininginputxweproposetopenalizeitssensitivitytothatinput,measuredastheFrobeniusnormoftheJacobianJf(x)ofthenon-linearmap-ping.Formally,ifinputx2IRdxismappedbyencod-ingfunctionftohiddenrepresentationh2IRdh,thissensitivitypenalizationtermisthesumofsquaresofallpartialderivativesoftheextractedfeatureswithrespecttoinputdimensions:kJf(x)k2F=Xij@hj(x) @xi2:(1)PenalizingkJfk2Fencouragesthemappingtothefea-turespacetobecontractiveintheneighborhoodofthetrainingdata.Thisgeometricperspective,whichgivesitsnametoouralgorithm,willbefurtherelaboratedon,insection5.3,basedonexperimentalevidence.The atnessinducedbyhavinglowvaluedrstderiva-tiveswillimplyaninvarianceorrobustnessoftherep-resentationforsmallvariationsoftheinput.Thusinthisstudy,termslikeinvariance,(in-)sensitivity,ro-bustness, atnessandcontractionallpointtothesamenotion.WhilesuchaJacobiantermalonewouldencouragemappingtoauselessconstantrepresentation,itiscounterbalancedinauto-encodertraining2bytheneedforthelearntrepresentationtoallowagoodrecon-structionoftheinput3.3.Auto-encodersvariantsInitssimplestform,anauto-encoder(AE)iscom-posedoftwoparts,anencoderandadecoder.Itwasintroducedinthelate80's(Rumelhartetal.,1986;BaldiandHornik,1989)asatechniquefordimen-sionalityreduction,wheretheoutputoftheencoderrepresentsthereducedrepresentationandwherethedecoderistunedtoreconstructtheinitialinputfromtheencoder'srepresentationthroughtheminimizationofacostfunction.Morespecicallywhentheen-codingactivationfunctionsarelinearandthenum-berofhiddenunitsisinferiortotheinputdimen-sion(henceformingabottleneck),ithasbeenshownthatthelearntparametersoftheencoderareasub-spaceoftheprincipalcomponentsoftheinputspace(BaldiandHornik,1989).Withtheuseofnon-linearactivationfunctionsanAEcanhoweverbeexpectedtolearnmoreusefulfeature-detectorsthanwhatcanbeobtainedwithasimplePCA(Japkowiczetal.,2000).Moreover,contrarytotheirclassicaluseasdimensionality-reductiontechniques,intheirmoderninstantiationauto-encodersareoftenemployedinaso-calledover-completesettingtoextractanumberoffea-tureslargerthantheinputdimension,yieldingarichhigher-dimensionalrepresentation.Inthissetup,usingsomeformofregularizationbecomesessentialtoavoiduninterestingsolutionswheretheauto-encodercouldperfectlyreconstructtheinputwithoutneedingtoex-tractanyusefulfeature.Thissectionformallydenestheauto-encodervariantsconsideredinthisstudy.Basicauto-encoder(AE).Theencoderisafunc-tionfthatmapsaninputx2IRdxtohiddenrepresen-tationh(x)2IRdh.Ithastheformh=f(x)=sf(Wx+bh);(2)wheresfisanonlinearactivationfunction,typi-callyalogisticsigmoid(z)=1 1+ez.TheencoderisparametrizedbyadhdxweightmatrixW,anda 2Usingalsothenowcommonadditionalconstraintofencoderanddecodersharingthesame(transposed)weights,whichprecludesamereglobalcontractingscal-ingintheencoderandexpansioninthedecoder.3Alikelihood-relatedcriterionwouldalsosimilarlypre-ventacollapseoftherepresentation. ContractiveAuto-Encoders Vincentetal.(2010).TheCAEandtheDAEdierhoweverinthefollowingways:CAEsexplicitlyencouragerobustnessofrepresenta-tionf(x),whereasDAEsencouragesrobustnessofre-construction(gf)(x)(whichmayonlypartiallyandindirectlyencouragerobustnessoftherepresentation,astheinvariancerequirementissharedbetweenthetwopartsoftheauto-encoder).WebelievethatthispropertymakeCAEsabetterchoicethanDAEstolearnusefulfeatureextractors.Sincewewilluseonlytheencoderpartforclassication,robustnessoftheextractedfeaturesappearsmoreimportantthanro-bustnessofthereconstruction.DAEs'robustnessisobtainedstochastically(eq.6)byhavingseveralexplicitlycorruptedversionsofatrain-ingpointaimforanidenticalreconstruction.Bycon-trast,CAEs'robustnesstotinyperturbationsisob-tainedanalyticallybypenalizingthemagnitudeofrstderivativeskJf(x)k2Fattrainingpointsonly(eq.7).NotethatananalyticapproximationforDAE'sstochasticrobustnesscriterioncanbeobtainedinthelimitofverysmalladditiveGaussiannoise,byfollow-ingBishop(1995).Thisyields,notsurprisingly,aterminkJgf(x)k2F(Jacobianofreconstruction)ratherthanthekJf(x)k2F(Jacobianofrepresentation)ofCAEs.ComputationalconsiderationsInthecaseofasigmoidnonlinearity,thepenaltyontheJacobiannormhasthefollowingsimpleexpression:kJf(x)k2F=dhXi=1(hi(1hi))2dxXj=1W2ij:Computingthispenalty(oritsgradient),issimilartoandhasaboutthesamecostascomputingthere-constructionerror(or,respectively,itsgradient).TheoverallcomputationalcomplexityisO(dxdh).5.ExperimentsandresultsConsideredmodels.Inourexperiments,wecom-paretheproposedContractiveAutoEncoder(CAE)againstthefollowingmodelsforunsupervisedfeatureextraction:RBM-binary:RestrictedBoltzmannMachinetrainedbyContrastiveDivergence,AE:Basicauto-encoder,AE+wd:Auto-encoderwithweight-decayregu-larization,DAE-g:Denoisingauto-encoderwithGaussiannoise,DAE-b:Denoisingauto-encoderwithbinarymaskingnoise,Allauto-encodervariantsusedtiedweights,asigmoidactivationfunctionforbothencoderanddecoder,andacross-entropyreconstructionerror(seeSection3).Theyweretrainedbyoptimizingtheir(regularized)objectivefunctiononthetrainingsetbystochasticgradientdescent.AsforRBMs,theyweretrainedbyContrastiveDivergence.Thesealgorithmswereappliedonthetrainingsetwithoutusingthelabels(unsupervised)toextractarstlayeroffeatures.Optionallytheprocedurewasrepeatedtostackadditionalfeature-extractionlayersontopoftherstone.Oncethustrained,thelearntparametervaluesoftheresultingfeature-extractors(weightandbiasoftheencoder)wereusedasinitiali-sationofamultilayerperceptron(MLP)withanextrarandom-initialisedoutputlayer.Thewholenetworkwasthenne-tunedbyagradientdescentonasuper-visedobjectiveappropriateforclassication4,usingthelabelsinthetrainingset.Datasetsused.Wehavetestedourapproachonabenchmarkofimageclassicationproblems,namely:CIFAR-bw:agray-scaleversionoftheCIFAR-10image-classicationtask(KrizhevskyandHinton,2009)andMNIST:thewell-knowndigitclassica-tionproblem.WealsousedproblemsfromthesamebenchmarkasVincentetal.(2010)whichincludesveharderdigitrecognitionproblemsderivedbyaddingextrafactorsofvariationtoMNISTdigits,eachwith10000examplesfortraining,2000forvalidation,50000fortestaswellastwoarticialshapeclassicationproblems5.5.1.ClassicationperformanceOurrstseriesofexperimentsfocusesontheMNISTandCIFAR-bwdatasets.Wecomparetheclassica-tionperformanceobtainedbyaneuralnetworkwithonehiddenlayerof1000units,initializedwitheachoftheunsupervisedalgorithmsunderconsideration.Foreachcase,weselectedthevalueofhyperparameters(suchasthestrengthofregularization)thatyielded,aftersupervisedne-tuning,thebestclassicationper- 4Weusedsigmoid+cross-entropyforbinaryclassica-tion,andlogofsoftmaxformulti-classproblems5Thesedatasetsareavailableathttp://www.iro.umontreal.ca/~lisa/icml2007:basic:smallersubsetofMNIST;rot:digitswithaddedrandomrotation;bg-rand:digitswithrandomnoisebackground;bg-img:digitswithrandomimagebackground;bg-img-rot:digitswithrota-tionandimagebackground;rect:discriminatebetweentallandwiderectangles(whiteonblack);rect-img:discrimi-natebetweentallandwiderectangularimageonadierentbackgroundimage. ContractiveAuto-Encoders Table2.Comparisonofstackedcontractiveauto-encoderswith1and2layers(CAE-1andCAE-2)withother3-layerstackedmodelsandbaselineSVM.Testerrorrateonallconsideredclassicationproblemsisreportedtogetherwitha95%condenceinterval.Bestperformerisinbold,aswellasthoseforwhichcondenceintervalsoverlap.ClearlyCAEscanbesuccessfullyusedtobuildtop-performingdeepnetworks.2layersofCAEoftenoutperformed3layersofotherstackedmodels. DataSet SVMrbf SAE-3 RBM-3 DAE-b-3 CAE-1 CAE-2 basic 3.030:15 3.460:16 3.110:15 2.840:15 2.830:15 2.480:14 rot 11.110:28 10.300:27 10.300:27 9.530:26 11.590:28 9.660:26 bg-rand 14.580:31 11.280:28 6.730:22 10.300:27 13.570:30 10.900:27 bg-img 22.610:379 23.000:37 16.310:32 16.680:33 16.700:33 15.500:32 bg-img-rot 55.180:44 51.930:44 47.390:44 43.760:43 48.100:44 45.230:44 rect 2.150:13 2.410:13 2.600:14 1.990:12 1.480:10 1.210:10 rect-img 24.040:37 24.050:37 22.500:37 21.590:36 21.860:36 21.540:36 (DAE-g),thecontractionratiodecreases(i.e.,towardsmorecontraction)aswemoveawayfromthetrain-ingexamples(thisisduetomoresaturation,andwasexpected),whereasfortheCAEthecontractionratioinitiallyincreases,uptothepointwheretheeectofsaturationtakesover(thebumpoccursataboutthemaximumdistancebetweentwotrainingexamples).Thinkaboutthecasewherethetrainingexamplescon-gregatenearalow-dimensionalmanifold.Thevari-ationspresentinthedata(e.g.translationandro-tationsofobjectsinimages)correspondtolocaldi-mensionsalongthemanifold,whilethevariationsthataresmallorrareinthedatacorrespondtothedirec-tionsorthogonaltothemanifold(ataparticularpointnearthemanifold,correspondingtoaparticularex-ample).Theproposedcriterionistryingtomakethefeaturesinvariantinalldirectionsaroundthetrainingexamples,butthereconstructionerror(orlikelihood)ismakingsurethatthattherepresentationisfaith-ful,i.e.,canbeusedtoreconstructtheinputexam-ple.Hencethedirectionsthatresisttothiscontract-ingpressure(stronginvariancetoinputchanges)arethedirectionspresentinthetrainingset.Indeed,ifthevariationsalongthesedirectionspresentinthetrainingsetwerenotpreserved,neighboringtrainingexamplescouldnotbedistinguishedandproperlyreconstructed.Hencethedirectionswherethecontractionisstrong(smallratio,smallsingularvaluesoftheJacobianma-trix)arealsothedirectionswherethemodelbelievesthattheinputdensitydropsquickly,whereasthedi-rectionswherethecontractionisweak(closerto1,largercontractionratio,largersingularvaluesoftheJacobianmatrix)correspondtothedirectionswherethemodelbelievesthattheinputdensityis at(andlarge,sincewearenearatrainingexample).Webelievethatthiscontractionpenaltythushelpsthelearnercarveakindofmountainsupportedbythetrainingexamples,andgeneralizingtoaridgebetweenthem.Whatwewouldlikeisfortheseridgestocor-respondtosomedirectionsofvariationpresentinthedata,associatedwithunderlyingfactorsofvariation.Howfardotheseridgesextendaroundeachtrainingexampleandhow atarethey?Thiscanbevisual-izedcomparativelywiththeanalysisofFigure1,withthecontractionratiofordierentdistancesfromthetrainingexamples.Notethatdierentfeatures(elementsoftherepresen-tationvector)wouldbeexpectedtohaveridges(i.e.directionsofinvariance)indierentdirections,andthatthe\dimensionality"oftheseridges(weareinafairlyhigh-dimensionalspace)givesahintastothelocaldimensionalityofthemanifoldnearwhichthedataexamplescongregate.Thesingularvaluespec-trumoftheJacobianinformsusaboutthatgeometry.Thenumberoflargesingularvaluesshouldrevealthedimensionalityoftheseridges,i.e.,ofthatmanifoldnearwhichexamplesconcentrate.ThisisillustratedinFigure2,showingthesingularvaluesspectrumoftheencoder'sJacobian.TheCAEbyfardoesthebestjobatrepresentingthedatavariationsnearalower-dimensionalmanifold,andtheDAEissecondbest,whileordinaryauto-encoders(regularizedornot)donotsucceedatallinthisrespect.WhathappenswhenwestackaCAEontopofan-otherone,tobuildadeeperencoder?Thisisillus-tratedinFigure3,whichshowstheaveragecontrac-tionratiofordierentdistancesaroundeachtrainingpoint,fordepth1vsdepth2encoders.ComposingtwoCAEsyieldsevenmorecontractionandevenmorenon-linearity,i.e.asharperprole,witha atterlevelofcontractionatshortandmediumdistances,andadelayedeectofsaturation(thebumponlycomesupatfartherdistances).Wewouldthusexpecthigher-levelfeaturestobemoreinvariantintheirfeature-