/
Contractive AutoEncoders Explicit Invariance During Feature Extraction Salah Rifai  rifaisaliro Contractive AutoEncoders Explicit Invariance During Feature Extraction Salah Rifai  rifaisaliro

Contractive AutoEncoders Explicit Invariance During Feature Extraction Salah Rifai rifaisaliro - PDF document

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
500 views
Uploaded On 2014-12-17

Contractive AutoEncoders Explicit Invariance During Feature Extraction Salah Rifai rifaisaliro - PPT Presentation

umontrealca Pascal Vincent 1 vincentpiroumontrealca Xavier Muller 1 mullerxiroumontrealca Xavier Glorot 1 glorotxairoumontrealca Yoshua Bengio 1 bengioyiroumontrealca 1 Dept IRO Universite de Montreal Montreal QC H3C 3J7 Canada Abstract We present in ID: 25139

umontrealca Pascal Vincent vincentpiroumontrealca

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Contractive AutoEncoders Explicit Invari..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

ContractiveAuto-Encoders etal.,2010).Contribution.Whatprinciplesshouldguidethelearningofsuchintermediaterepresentations?Theyshouldcaptureasmuchaspossibleoftheinformationineachgivenexample,whenthatexampleislikelyundertheunderlyinggeneratingdistribution.Thatiswhatauto-encoders(Vincentetal.,2010)andsparsecodingaimtoachievewhenminimizingreconstruc-tionerror.Wewouldalsoliketheserepresentationstobeusefulincharacterizingtheinputdistribution,andthatiswhatisachievedbydirectlyoptimizingagenerativemodel'slikelihood(suchasRBMs).Inthispaper,weintroduceapenaltytermthatcouldbeaddedineitheroftheabovecontexts,whichencour-agestheintermediaterepresentationtoberobusttosmallchangesoftheinputaroundthetrainingexam-ples.Weshowthroughcomparativeexperimentsonmanybenchmarkdatasetsthatthischaracteristicisusefultolearnrepresentationsthathelptrainingbet-terclassi ers.Wehypothesizethatwhereasthepro-posedpenaltytermencouragesthelearnedfeaturestobelocallyinvariantwithoutanypreferenceforpartic-ulardirections,whenitiscombinedwithareconstruc-tionerrororlikelihoodcriterionweobtaininvarianceinthedirectionsthatmakesenseinthecontextofthegiventrainingdata,i.e.,thevariationsthatarepresentinthedatashouldalsobecapturedinthelearnedrep-resentation,buttheotherdirectionsmaybecontractedinthelearnedrepresentation.2.HowtoextractrobustfeaturesToencouragerobustnessoftherepresentationf(x)ob-tainedforatraininginputxweproposetopenalizeitssensitivitytothatinput,measuredastheFrobeniusnormoftheJacobianJf(x)ofthenon-linearmap-ping.Formally,ifinputx2IRdxismappedbyencod-ingfunctionftohiddenrepresentationh2IRdh,thissensitivitypenalizationtermisthesumofsquaresofallpartialderivativesoftheextractedfeatureswithrespecttoinputdimensions:kJf(x)k2F=Xij@hj(x) @xi2:(1)PenalizingkJfk2Fencouragesthemappingtothefea-turespacetobecontractiveintheneighborhoodofthetrainingdata.Thisgeometricperspective,whichgivesitsnametoouralgorithm,willbefurtherelaboratedon,insection5.3,basedonexperimentalevidence.The atnessinducedbyhavinglowvalued rstderiva-tiveswillimplyaninvarianceorrobustnessoftherep-resentationforsmallvariationsoftheinput.Thusinthisstudy,termslikeinvariance,(in-)sensitivity,ro-bustness, atnessandcontractionallpointtothesamenotion.WhilesuchaJacobiantermalonewouldencouragemappingtoauselessconstantrepresentation,itiscounterbalancedinauto-encodertraining2bytheneedforthelearntrepresentationtoallowagoodrecon-structionoftheinput3.3.Auto-encodersvariantsInitssimplestform,anauto-encoder(AE)iscom-posedoftwoparts,anencoderandadecoder.Itwasintroducedinthelate80's(Rumelhartetal.,1986;BaldiandHornik,1989)asatechniquefordimen-sionalityreduction,wheretheoutputoftheencoderrepresentsthereducedrepresentationandwherethedecoderistunedtoreconstructtheinitialinputfromtheencoder'srepresentationthroughtheminimizationofacostfunction.Morespeci callywhentheen-codingactivationfunctionsarelinearandthenum-berofhiddenunitsisinferiortotheinputdimen-sion(henceformingabottleneck),ithasbeenshownthatthelearntparametersoftheencoderareasub-spaceoftheprincipalcomponentsoftheinputspace(BaldiandHornik,1989).Withtheuseofnon-linearactivationfunctionsanAEcanhoweverbeexpectedtolearnmoreusefulfeature-detectorsthanwhatcanbeobtainedwithasimplePCA(Japkowiczetal.,2000).Moreover,contrarytotheirclassicaluseasdimensionality-reductiontechniques,intheirmoderninstantiationauto-encodersareoftenemployedinaso-calledover-completesettingtoextractanumberoffea-tureslargerthantheinputdimension,yieldingarichhigher-dimensionalrepresentation.Inthissetup,usingsomeformofregularizationbecomesessentialtoavoiduninterestingsolutionswheretheauto-encodercouldperfectlyreconstructtheinputwithoutneedingtoex-tractanyusefulfeature.Thissectionformallyde nestheauto-encodervariantsconsideredinthisstudy.Basicauto-encoder(AE).Theencoderisafunc-tionfthatmapsaninputx2IRdxtohiddenrepresen-tationh(x)2IRdh.Ithastheformh=f(x)=sf(Wx+bh);(2)wheresfisanonlinearactivationfunction,typi-callyalogisticsigmoid(z)=1 1+e�z.TheencoderisparametrizedbyadhdxweightmatrixW,anda 2Usingalsothenowcommonadditionalconstraintofencoderanddecodersharingthesame(transposed)weights,whichprecludesamereglobalcontractingscal-ingintheencoderandexpansioninthedecoder.3Alikelihood-relatedcriterionwouldalsosimilarlypre-ventacollapseoftherepresentation. ContractiveAuto-Encoders Vincentetal.(2010).TheCAEandtheDAEdi erhoweverinthefollowingways:CAEsexplicitlyencouragerobustnessofrepresenta-tionf(x),whereasDAEsencouragesrobustnessofre-construction(gf)(x)(whichmayonlypartiallyandindirectlyencouragerobustnessoftherepresentation,astheinvariancerequirementissharedbetweenthetwopartsoftheauto-encoder).WebelievethatthispropertymakeCAEsabetterchoicethanDAEstolearnusefulfeatureextractors.Sincewewilluseonlytheencoderpartforclassi cation,robustnessoftheextractedfeaturesappearsmoreimportantthanro-bustnessofthereconstruction.DAEs'robustnessisobtainedstochastically(eq.6)byhavingseveralexplicitlycorruptedversionsofatrain-ingpointaimforanidenticalreconstruction.Bycon-trast,CAEs'robustnesstotinyperturbationsisob-tainedanalyticallybypenalizingthemagnitudeof rstderivativeskJf(x)k2Fattrainingpointsonly(eq.7).NotethatananalyticapproximationforDAE'sstochasticrobustnesscriterioncanbeobtainedinthelimitofverysmalladditiveGaussiannoise,byfollow-ingBishop(1995).Thisyields,notsurprisingly,aterminkJgf(x)k2F(Jacobianofreconstruction)ratherthanthekJf(x)k2F(Jacobianofrepresentation)ofCAEs.ComputationalconsiderationsInthecaseofasigmoidnonlinearity,thepenaltyontheJacobiannormhasthefollowingsimpleexpression:kJf(x)k2F=dhXi=1(hi(1�hi))2dxXj=1W2ij:Computingthispenalty(oritsgradient),issimilartoandhasaboutthesamecostascomputingthere-constructionerror(or,respectively,itsgradient).TheoverallcomputationalcomplexityisO(dxdh).5.ExperimentsandresultsConsideredmodels.Inourexperiments,wecom-paretheproposedContractiveAutoEncoder(CAE)againstthefollowingmodelsforunsupervisedfeatureextraction:RBM-binary:RestrictedBoltzmannMachinetrainedbyContrastiveDivergence,AE:Basicauto-encoder,AE+wd:Auto-encoderwithweight-decayregu-larization,DAE-g:Denoisingauto-encoderwithGaussiannoise,DAE-b:Denoisingauto-encoderwithbinarymaskingnoise,Allauto-encodervariantsusedtiedweights,asigmoidactivationfunctionforbothencoderanddecoder,andacross-entropyreconstructionerror(seeSection3).Theyweretrainedbyoptimizingtheir(regularized)objectivefunctiononthetrainingsetbystochasticgradientdescent.AsforRBMs,theyweretrainedbyContrastiveDivergence.Thesealgorithmswereappliedonthetrainingsetwithoutusingthelabels(unsupervised)toextracta rstlayeroffeatures.Optionallytheprocedurewasrepeatedtostackadditionalfeature-extractionlayersontopofthe rstone.Oncethustrained,thelearntparametervaluesoftheresultingfeature-extractors(weightandbiasoftheencoder)wereusedasinitiali-sationofamultilayerperceptron(MLP)withanextrarandom-initialisedoutputlayer.Thewholenetworkwasthen ne-tunedbyagradientdescentonasuper-visedobjectiveappropriateforclassi cation4,usingthelabelsinthetrainingset.Datasetsused.Wehavetestedourapproachonabenchmarkofimageclassi cationproblems,namely:CIFAR-bw:agray-scaleversionoftheCIFAR-10image-classi cationtask(KrizhevskyandHinton,2009)andMNIST:thewell-knowndigitclassi ca-tionproblem.WealsousedproblemsfromthesamebenchmarkasVincentetal.(2010)whichincludes veharderdigitrecognitionproblemsderivedbyaddingextrafactorsofvariationtoMNISTdigits,eachwith10000examplesfortraining,2000forvalidation,50000fortestaswellastwoarti cialshapeclassi cationproblems5.5.1.Classi cationperformanceOur rstseriesofexperimentsfocusesontheMNISTandCIFAR-bwdatasets.Wecomparetheclassi ca-tionperformanceobtainedbyaneuralnetworkwithonehiddenlayerof1000units,initializedwitheachoftheunsupervisedalgorithmsunderconsideration.Foreachcase,weselectedthevalueofhyperparameters(suchasthestrengthofregularization)thatyielded,aftersupervised ne-tuning,thebestclassi cationper- 4Weusedsigmoid+cross-entropyforbinaryclassi ca-tion,andlogofsoftmaxformulti-classproblems5Thesedatasetsareavailableathttp://www.iro.umontreal.ca/~lisa/icml2007:basic:smallersubsetofMNIST;rot:digitswithaddedrandomrotation;bg-rand:digitswithrandomnoisebackground;bg-img:digitswithrandomimagebackground;bg-img-rot:digitswithrota-tionandimagebackground;rect:discriminatebetweentallandwiderectangles(whiteonblack);rect-img:discrimi-natebetweentallandwiderectangularimageonadi erentbackgroundimage. ContractiveAuto-Encoders Table2.Comparisonofstackedcontractiveauto-encoderswith1and2layers(CAE-1andCAE-2)withother3-layerstackedmodelsandbaselineSVM.Testerrorrateonallconsideredclassi cationproblemsisreportedtogetherwitha95%con denceinterval.Bestperformerisinbold,aswellasthoseforwhichcon denceintervalsoverlap.ClearlyCAEscanbesuccessfullyusedtobuildtop-performingdeepnetworks.2layersofCAEoftenoutperformed3layersofotherstackedmodels. DataSet SVMrbf SAE-3 RBM-3 DAE-b-3 CAE-1 CAE-2 basic 3.030:15 3.460:16 3.110:15 2.840:15 2.830:15 2.480:14 rot 11.110:28 10.300:27 10.300:27 9.530:26 11.590:28 9.660:26 bg-rand 14.580:31 11.280:28 6.730:22 10.300:27 13.570:30 10.900:27 bg-img 22.610:379 23.000:37 16.310:32 16.680:33 16.700:33 15.500:32 bg-img-rot 55.180:44 51.930:44 47.390:44 43.760:43 48.100:44 45.230:44 rect 2.150:13 2.410:13 2.600:14 1.990:12 1.480:10 1.210:10 rect-img 24.040:37 24.050:37 22.500:37 21.590:36 21.860:36 21.540:36 (DAE-g),thecontractionratiodecreases(i.e.,towardsmorecontraction)aswemoveawayfromthetrain-ingexamples(thisisduetomoresaturation,andwasexpected),whereasfortheCAEthecontractionratioinitiallyincreases,uptothepointwherethee ectofsaturationtakesover(thebumpoccursataboutthemaximumdistancebetweentwotrainingexamples).Thinkaboutthecasewherethetrainingexamplescon-gregatenearalow-dimensionalmanifold.Thevari-ationspresentinthedata(e.g.translationandro-tationsofobjectsinimages)correspondtolocaldi-mensionsalongthemanifold,whilethevariationsthataresmallorrareinthedatacorrespondtothedirec-tionsorthogonaltothemanifold(ataparticularpointnearthemanifold,correspondingtoaparticularex-ample).Theproposedcriterionistryingtomakethefeaturesinvariantinalldirectionsaroundthetrainingexamples,butthereconstructionerror(orlikelihood)ismakingsurethatthattherepresentationisfaith-ful,i.e.,canbeusedtoreconstructtheinputexam-ple.Hencethedirectionsthatresisttothiscontract-ingpressure(stronginvariancetoinputchanges)arethedirectionspresentinthetrainingset.Indeed,ifthevariationsalongthesedirectionspresentinthetrainingsetwerenotpreserved,neighboringtrainingexamplescouldnotbedistinguishedandproperlyreconstructed.Hencethedirectionswherethecontractionisstrong(smallratio,smallsingularvaluesoftheJacobianma-trix)arealsothedirectionswherethemodelbelievesthattheinputdensitydropsquickly,whereasthedi-rectionswherethecontractionisweak(closerto1,largercontractionratio,largersingularvaluesoftheJacobianmatrix)correspondtothedirectionswherethemodelbelievesthattheinputdensityis at(andlarge,sincewearenearatrainingexample).Webelievethatthiscontractionpenaltythushelpsthelearnercarveakindofmountainsupportedbythetrainingexamples,andgeneralizingtoaridgebetweenthem.Whatwewouldlikeisfortheseridgestocor-respondtosomedirectionsofvariationpresentinthedata,associatedwithunderlyingfactorsofvariation.Howfardotheseridgesextendaroundeachtrainingexampleandhow atarethey?Thiscanbevisual-izedcomparativelywiththeanalysisofFigure1,withthecontractionratiofordi erentdistancesfromthetrainingexamples.Notethatdi erentfeatures(elementsoftherepresen-tationvector)wouldbeexpectedtohaveridges(i.e.directionsofinvariance)indi erentdirections,andthatthe\dimensionality"oftheseridges(weareinafairlyhigh-dimensionalspace)givesahintastothelocaldimensionalityofthemanifoldnearwhichthedataexamplescongregate.Thesingularvaluespec-trumoftheJacobianinformsusaboutthatgeometry.Thenumberoflargesingularvaluesshouldrevealthedimensionalityoftheseridges,i.e.,ofthatmanifoldnearwhichexamplesconcentrate.ThisisillustratedinFigure2,showingthesingularvaluesspectrumoftheencoder'sJacobian.TheCAEbyfardoesthebestjobatrepresentingthedatavariationsnearalower-dimensionalmanifold,andtheDAEissecondbest,whileordinaryauto-encoders(regularizedornot)donotsucceedatallinthisrespect.WhathappenswhenwestackaCAEontopofan-otherone,tobuildadeeperencoder?Thisisillus-tratedinFigure3,whichshowstheaveragecontrac-tionratiofordi erentdistancesaroundeachtrainingpoint,fordepth1vsdepth2encoders.ComposingtwoCAEsyieldsevenmorecontractionandevenmorenon-linearity,i.e.asharperpro le,witha atterlevelofcontractionatshortandmediumdistances,andadelayede ectofsaturation(thebumponlycomesupatfartherdistances).Wewouldthusexpecthigher-levelfeaturestobemoreinvariantintheirfeature-