/
The Manifold Tangent Classier Salah Rifai Yann N The Manifold Tangent Classier Salah Rifai Yann N

The Manifold Tangent Classier Salah Rifai Yann N - PDF document

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
496 views
Uploaded On 2015-05-15

The Manifold Tangent Classier Salah Rifai Yann N - PPT Presentation

Dauphin Pascal Vincent Yoshua Bengio Xavier Muller Department of Computer Science and Operations Research University of Montreal Montreal H3C 3J7 rifaisal dauphiya vincentp bengioy mullerx iroumontrealca Abstract We combine three important ideas pre ID: 67477

Dauphin Pascal Vincent Yoshua

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "The Manifold Tangent Classier Salah Rifa..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

TheManifoldTangentClassier SalahRifai,YannN.Dauphin,PascalVincent,YoshuaBengio,XavierMullerDepartmentofComputerScienceandOperationsResearchUniversityofMontrealMontreal,H3C3J7frifaisal,dauphiya,vincentp,bengioy,mullerxg@iro.umontreal.caAbstractWecombinethreeimportantideaspresentinpreviousworkforbuildingclassi-ers:thesemi-supervisedhypothesis(theinputdistributioncontainsinformationabouttheclassier),theunsupervisedmanifoldhypothesis(datadensityconcen-tratesnearlow-dimensionalmanifolds),andthemanifoldhypothesisforclassi-cation(differentclassescorrespondtodisjointmanifoldsseparatedbylowden-sity).Weexploitanovelalgorithmforcapturingmanifoldstructure(high-ordercontractiveauto-encoders)andweshowhowitbuildsatopologicalatlasofcharts,eachchartbeingcharacterizedbytheprincipalsingularvectorsoftheJacobianofarepresentationmapping.Thisrepresentationlearningalgorithmcanbestackedtoyieldadeeparchitecture,andwecombineitwithadomainknowledge-freeversionoftheTangentPropalgorithmtoencouragetheclassiertobeinsensitivetolocaldirectionschangesalongthemanifold.Record-breakingclassicationresultsareobtained.1IntroductionMuchofmachinelearningresearchcanbeviewedasanexplorationofwaystocompensateforscarcepriorknowledgeabouthowtosolveaspecictaskbyextracting(usuallyimplicit)knowledgefromvastamountsofdata.Thisisespeciallytrueofthesearchforgenericlearningalgorithmsthataretoperformwellonawiderangeofdomainsforwhichtheywerenotspecicallytailored.Whilesuchanoutlookprecludesusingmuchdomain-specicknowledgeindesigningthealgorithms,itcanhoweverbebenecialtoleveragewhatmightbecalled“generic”priorhypotheses,thatappearlikelytoholdforawiderangeofproblems.Theapproachstudiedinthepresentworkexploitsthreesuchpriorhypotheses:1.Thesemi-supervisedlearninghypothesis,accordingtowhichlearningaspectsofthein-putdistributionp(x)canimprovemodelsoftheconditionaldistributionofthesupervisedtargetp(yjx),i.e.,p(x)andp(yjx)sharesomething(Lasserreetal.,2006).Thishypoth-esisunderliesnotonlythestrictsemi-supervisedsettingwhereonehasmanymoreunla-beledexamplesathisdisposalthanlabeledones,butalsothesuccessfulunsupervisedpre-trainingapproachforlearningdeeparchitectures,whichhasbeenshowntosignicantlyimprovesupervisedperformanceevenwithoutusingadditionalunlabeledexamples(Hin-tonetal.,2006;Bengio,2009;Erhanetal.,2010).2.The(unsupervised)manifoldhypothesis,accordingtowhichrealworlddatapresentedinhighdimensionalspacesislikelytoconcentrateinthevicinityofnon-linearsub-manifoldsofmuchlowerdimensionality(Cayton,2005;NarayananandMitter,2010).3.Themanifoldhypothesisforclassication,accordingtowhichpointsofdifferentclassesarelikelytoconcentratealongdifferentsub-manifolds,separatedbylowdensityregionsoftheinputspace.1 TherecentlyproposedContractiveAuto-Encoder(CAE)algorithm(Rifaietal.,2011a),basedontheideaofencouragingthelearnedrepresentationtoberobusttosmallvariationsoftheinput,wasshowntobeveryeffectiveforunsupervisedfeaturelearning.Itssuccessfulapplicationinthepre-trainingofdeepneuralnetworksisyetanotherillustrationofwhatcanbegainedbyadoptinghypothesis1.Inaddition,Rifaietal.(2011a)propose,andshowempiricalevidencefor,thehypoth-esisthatthetrade-offbetweenreconstructionerrorandthepressuretobeinsensitivetovariationsininputspacehasaninterestingconsequence:Ityieldsamostlycontractivemappingthat,locallyaroundeachtrainingpoint,remainssubstantiallysensitiveonlytoafewinputdirections(withdiffer-entdirectionsofsensitivityfordifferenttrainingpoints).Thisistakenasevidencethatthealgorithmindirectlyexploitshypothesis2andmodelsalower-dimensionalmanifold.Mostofthedirectionstowhichtherepresentationissubstantiallysensitivearethoughttobedirectionstangenttothedata-supportingmanifold(thosethatlocallydeneitstangentspace).Thepresentworkfollowsthroughonthisinterpretation,andinvestigateswhetheritispossibletousethisinformation,thatispresumablycapturedaboutmanifoldstructure,tofurtherimproveclas-sicationperformancebyleveraginghypothesis3.Tothatend,weextractasetofbasisvectorsforthelocaltangentspaceateachtrainingpointfromtheContractiveAuto-Encoder'slearnedpa-rameters.ThisisobtainedwithaSingularValueDecomposition(SVD)oftheJacobianoftheencoderthatmapseachinputtoitslearnedrepresentation.Basedonhypothesis3,wethenadoptthe“genericprior”thatclasslabelsarelikelytobeinsensitivetomostdirectionswithintheselocaltangentspaces(ex:smalltranslations,rotationsorscalingsusuallydonotchangeanimage'sclass).Supervisedclassicationalgorithmsthathavebeendevisedtoefcientlyexploittangentdirectionsgivenasdomain-specicprior-knowledge(Simardetal.,1992,1993),canreadilybeusedinsteadwithourlearnedtangentspaces.Inparticular,wewillshowrecord-breakingimprovementsbyusingTangentPropfornetuningCAE-pre-traineddeepneuralnetworks.Tothebestofourknowledgethisisthersttimethattheimplicitrelationshipbetweenanunsupervisedlearnedmappingandthetangentspaceofamanifoldisrenderedexplicitandsuccessfullyexploitedforthetrainingofaclassier.Thisshowcasesauniedapproachthatsimultaneouslyleveragesallthree“generic”priorhypothesesconsidered.Ourexperiments(seeSection6)showthatthisapproachsetsnewrecordsfordomain-knowledge-freeperformanceonseveralreal-worldclassicationproblems.Remarkably,insomecasesitevenoutperformedmethodsthatuseweakorstrongdomain-specicpriorknowl-edge(e.g.convolutionalnetworksandtangentdistancebasedona-prioriknowntransformations).Naturally,thisapproachisevenmorelikelytobebenecialfordatasetswherenopriorknowledgeisreadilyavailable.2Contractiveauto-encoders(CAE)Weconsidertheproblemoftheunsupervisedlearningofanon-linearfeatureextractorfromadatasetD=fx1;:::;xng.Examplesxi2IRdarei.i.d.samplesfromanunknowndistributionp(x).2.1Traditionalauto-encodersTheauto-encoderframeworkisoneoftheoldestandsimplesttechniquesfortheunsupervisedlearn-ingofnon-linearfeatureextractors.Itlearnsanencoderfunctionh,thatmapsaninputx2IRdtoahiddenrepresentationh(x)2IRdh,jointlywithadecoderfunctiong,thatmapshbacktotheinputspaceasr=g(h(x))thereconstructionofx.Theencoderanddecoder'sparametersarelearnedbystochasticgradientdescenttominimizetheaveragereconstructionerrorL(x;g(h(x)))fortheexamplesofthetrainingset.Theobjectivebeingminimizedis:JAE()=Xx2DL(x;g(h(x))):(1)Wewillwillusethemostcommonformsofencoder,decoder,andreconstructionerror:Encoder:h(x)=s(Wx+bh),wheresistheelement-wiselogisticsigmoids(z)=1 1+e�z.ParametersareadhdweightmatrixWandbiasvectorbh2IRdh.Decoder:r=g(h(x))=s2(WTh(x)+br).ParametersareWT(tiedweights,sharedwiththeencoder)andbiasvectorbr2IRd.Activationfunctions2iseitheralogisticsigmoid(s2=s)ortheidentity(lineardecoder).2 Lossfunction:Eitherthesquarederror:L(x;r)=kx�rk2orBernoullicross-entropy:L(x;r)=�Pdi=1xilog(ri)+(1�xi)log(1�ri).Thesetofparametersofsuchanauto-encoderis=fW;bh;brg.Historically,auto-encoderswereprimarilyviewedasatechniquefordimensionalityreduction,whereanarrowbottleneck(i.e.dhd)wasineffectactingasacapacitycontrolmechanism.Bycontrast,recentsuccesses(Bengioetal.,2007;Ranzatoetal.,2007a;Kavukcuogluetal.,2009;Vincentetal.,2010;Rifaietal.,2011a)tendtorelyonrich,oftentimesover-completerepresen-tations(dh&#x-492;d),sothatmoresophisticatedformsofregularizationarerequiredtopressuretheauto-encodertoextractrelevantfeaturesandavoidtrivialsolutions.Severalsuccessfultechniquesaimatsparserepresentations(Ranzatoetal.,2007a;Kavukcuogluetal.,2009;Goodfellowetal.,2009).Alternatively,denoisingauto-encoders(Vincentetal.,2010)changetheobjectivefrommerereconstructiontothatofdenoising.2.2Firstorderandhigherordercontractiveauto-encodersMorerecently,Rifaietal.(2011a)introducedtheContractiveAuto-Encoder(CAE),thatencouragesrobustnessofrepresentationh(x)tosmallvariationsofatraininginputx,bypenalizingitssensitivitytothatinput,measuredastheFrobeniusnormoftheencoder'sJacobianJ(x)=@h @x(x).TheregularizedobjectiveminimizedbytheCAEisthefollowing:JCAE()=Xx2DL(x;g(h(x)))+kJ(x)k2;(2)whereisanon-negativeregularizationhyper-parameterthatcontrolshowstronglythenormoftheJacobianispenalized.Notethat,withthetraditionalsigmoidencoderformgivenabove,onecaneasilyobtaintheJacobianoftheencoder.ItsjthrowisobtainedformthejthrowofWas:J(x)j=@hj(x) @x=hj(x)(1�hj(x))Wj:(3)Computingtheextrapenaltyterm(anditscontributiontothegradient)issimilartocomputingthereconstructionerrorterm(anditscontributiontothegradient),thusrelativelycheap.Itisalsopossibletopenalizehigherorderderivatives(Hessian)byusingasimplestochastictech-niquethateschewscomputingthemexplicitly,whichwouldbeprohibitive.ItsufcestopenalizedifferencesbetweentheJacobianatxandtheJacobianatnearbypoints~x=x+(stochasticcor-ruptionsofx).ThisyieldstheCAE+H(Rifaietal.,2011b)variantwiththefollowingoptimizationobjective:JCAE+H()=Xx2DL(x;g(h(x)))+jjJ(x)jj2+ EN(0;2I)hjjJ(x)�J(x+)jj2i;(4)where isanadditionalregularizationhyper-parametersthatcontrolshowstronglywepenalizelocalvariationsoftheJacobian,i.e.higherorderderivatives.TheexpectationEisoverGaussiannoisevariable.Inpracticestochasticsamplesthereofareusedforeachstochasticgradientupdate.TheCAE+Histhevariantusedforourexperiments.3CharacterizingthetangentbundlecapturedbyaCAERifaietal.(2011a)reasonthat,whiletheregularizationtermencouragesinsensitivityofh(x)inallinputspacedirections,thispressureiscounterbalancedbytheneedforaccuratereconstruction,thusresultinginh(x)beingsubstantiallysensitiveonlytothefewinputdirectionsrequiredtodistinguishclosebytrainingpoints.Thegeometricinterpretationisthatthesedirectionsspanthelocaltangentspaceoftheunderlyingmanifoldthatsupportsthedata.Thetangentbundleofasmoothmanifoldisthemanifoldalongwiththesetoftangentplanestakenatallpointsonit.EachsuchtangentplanecanbeequippedwithalocalEuclideancoordinatesystemorchart.Intopology,anatlasisacollectionofsuchcharts(likethelocallyEuclideanmapineachpageofageographicatlas).Eventhoughthesetofchartsmayformanon-Euclideanmanifold(e.g.,asphere),eachchartisEuclidean.3 3.1ConditionsforthefeaturemappingtodeneanatlasonamanifoldInordertoobtainaproperatlasofcharts,hmustbeadiffeomorphism.Itmustbesmooth(C1)andinvertibleonopenEuclideanballsonthemanifoldMaroundthetrainingpoints.Smoothnessisguaranteedbecauseofourchoiceofparametrization(afne+sigmoid).Injectivity(differentvaluesofh(x)correspondtodifferentvaluesofx)onthetrainingexamplesisencouragedbyminimizingreconstructionerror(otherwisewecannotdistinguishtrainingexamplesxiandxjbyonlylookingath(xi)andh(xj)).Sinceh(x)=s(Wx+bh)andsisinvertible,usingthedenitionofinjectivityweget(bycomposingh(xi)=h(xj)withs�1)8i;jh(xi)=h(xj)()Wij=0whereij=xi�xj.Inordertopreservetheinjectivityofh,WhastoformabasisspannedbyitsrowsWk,where8i;j9 2IRdh;ij=Pdhk kWk:Withthisconditionsatised,mappinghisinjectiveinthesubspacespannedbythevariationsinthetrainingset.Ifwelimitthedomainofhtoh(X)(0;1)dhcomprisingvaluesobtainablebyhappliedtosomesetX,thenweobtainsurjectivitybydenition,hencebijectivityofhbetweenthetrainingsetDandh(D).LetMxbeanopenballonthemanifoldMaroundtrainingexamplex.BysmoothnessofthemanifoldMandofmappingh,weobtainbijectivitylocallyaroundthetrainingexamples(onthemanifold)aswell,i.e.,between[x2DMxandh([x2DMx).3.2ObtaininganatlasfromthelearnedfeaturemappingNowthatwehavenecessaryconditionsforlocalinvertibilityofh(x)forx2D,letusconsiderhowtodenethelocalchartaroundxfromthenatureofh.Becausehmustbesensitivetochangesfromanexamplexitooneofitsneighborsxj,butinsensitivetootherchanges(becauseoftheCAEpenalty),weexpectthatthiswillbereectedinthespectrumoftheJacobianmatrixJ(x)=@h(x) @xateachtrainingpointx.IntheidealcasewhereJ(x)hasrankk,h(x+v)differsfromh(x)onlyifvisinthespanofthesingularvectorsofJ(x)withnon-zerosingularvalue.Inpractice,J(x)hasmanytinysingularvalues.Hence,wedenealocalchartaroundxusingtheSingularValueDecompositionofJT(x)=U(x)S(x)VT(x)(whereU(x)andV(x)areorthogonalandS(x)isdiagonal).ThetangentplaneHxatxisgivenbythespanofthesetofprincipalsingularvectorsBx:Bx=fUk(x)jSkk(x)�gandHx=fx+vjv2span(Bx)g;whereUk(x)isthek-thcolumnofU(x),andspan(fzkg)=fxjx=Pkwkzk;wk2IRg.WecanthusdeneanatlasAcapturedbyh,basedonthelocallinearapproximationaroundeachexample:A=f(Mx;x)jx2D;x(~x)=Bx(~x�x)g:(5)Notethatthiswayofobtaininganatlascanalsobeappliedtosubsequentlayersofadeepnetwork.Itisthuspossibletouseagreedylayer-wisestrategytoinitializeanetworkwithCAEs(Rifaietal.,2011a)andobtainanatlasthatcorrespondstothenonlinearfeaturescomputedatanylayer.4ExploitingthelearnedtangentdirectionsforclassicationUsingthepreviouslydenedchartsforeverypointofthetrainingset,weproposetousethisaddi-tionalinformationprovidedbyunsupervisedlearningtoimprovetheperformanceofthesupervisedtask.Inthisweadoptthemanifoldhypothesisforclassicationmentionedintheintroduction.4.1CAE-basedtangentdistanceOnewayofachievingthisistouseanearestneighborclassierwithasimilaritycriteriondenedastheshortestdistancebetweentwohyperplanes(Simardetal.,1993).Thetangentsextractedoneachpointswillallowustoshrinkthedistancesbetweentwosampleswhentheycanapproximateeachotherbyalinearcombinationoftheirlocaltangents.FollowingSimardetal.(1993),wedenethetangentdistancebetweentwopointsxandyasthedistancebetweenthetwohyperplanesHx;HyIRdspannedrespectivelybyBxandBy.Usingtheusualdenitionofdistancebetweentwospaces,d(Hx;Hy)=inffkz�wk2j=(z;w)2HxHyg,weobtainthesolutionforthisconvex4 problembysolvingasystemoflinearequations(Simardetal.,1993).Thisprocedurecorrespondstoallowingtheconsideredpointsxandytomovealongthedirectionsspannedbytheirassociatedlocalcharts.Theirdistanceisthenevaluatedonthenewcoordinateswherethedistanceisminimal.Wecanthenuseanearestneighborclassierbasedonthisdistance.4.2CAE-basedtangentpropagationNearestneighbortechniquesareoftenimpracticalforlargescaledatasetsbecausetheircompu-tationalrequirementsscalelinearlywithnforeachtestcase.Bycontrast,oncetrained,neuralnetworksyieldfastresponsesfortestcases.Wecanalsoleveragetheextractedlocalchartswhentraininganeuralnetwork.FollowingthetangentpropagationapproachofSimardetal.(1992),butexploitingourlearnedtangents,weencouragetheoutputoofaneuralnetworkclassiertobeinsensitivetovariationsinthedirectionsofthelocalchartofxbyaddingthefollowingpenaltytoitssupervisedobjectivefunction: (x)=Xu2Bx @o @x(x)u 2(6)ContributionofthistermtothegradientsofnetworkparameterscanbecomputedinO(Nw),whereNwisthenumberofneuralnetworkweights.4.3TheManifoldTangentClassier(MTC)Puttingitalltogether,hereisthehighlevelsummaryofhowwebuildandtrainadeepnetwork:1.Train(unsupervised)astackofKCAE+Hlayers(Eq.4).Eachistrainedinturnontherepresentationlearnedbythepreviouslayer.2.Foreachxi2DcomputetheJacobianofthelastlayerrepresentationJ(K)(xi)=@h(K) @x(xi)anditsSVD1.StoretheleadingdMsingularvectorsinsetBxi.3.OntopoftheKpre-trainedlayers,stackanoutputlayerofsizethenumberofclasses.Fine-tunethewholenetworkforsupervisedclassication2withanaddedtangentpropagationpenalty(Eq.6),usingforeachxi,tangentdirectionsBxi.WecallthisdeeplearningalgorithmtheManifoldTangentClassier(MTC).Alternatively,insteadofstep3,onecanusethetangentvectorsinBxiinatangentdistancenearestneighborsclassier.5RelatedpriorworkManyNon-LinearManifoldLearningalgorithms(RoweisandSaul,2000;Tenenbaumetal.,2000)havebeenproposedwhichcanautomaticallydiscoverthemaindirectionsofvariationaroundeachtrainingpoint,i.e.,thetangentbundle.Mostofthesealgorithmsarenon-parametricandlocal,i.e.,explicitlyparametrizingthetangentplanearoundeachtrainingpoint(withaseparatesetofparametersforeach,orderivedmostlyfromthesetoftrainingexamplesineveryneighborhood),asmostexplicitlyseeninManifoldParzenWindows(VincentandBengio,2003)andmanifoldCharting(Brand,2003).SeeBengioandMonperrus(2005)foracritiqueoflocalnon-parametricmanifoldalgorithms:theymightrequireanumberoftrainingexampleswhichgrowsexponentiallywithmanifolddimensionandcurvature(morecrooksandvalleysinthemanifoldwillrequiremoreexamples).Oneattempttogeneralizethemanifoldshapenon-locally(Bengioetal.,2006)isbasedonexplicitlypredictingthetangentplaneassociatedtoanygivenpointx,asaparametrizedfunctionofx.Notethatthesealgorithmsallexplicitlyexploittrainingsetneighborhoods(seeFigure2),i.e.theyusepairsortuplesofpoints,withthegoaltoexplicitlymodelthetangentspace,whileitis 1J(K)istheproductoftheJacobiansofeachencoder(seeEq.3)inthestack.ItsufcestocomputeitsleadingdMSVDvectorsandsingularvalues.ThisisachievedinO(dMddh)pertrainingexample.Forcomparison,thecostofaforwardpropagationthroughasingleMLPlayerisO(ddh)perexample.2AsigmoidoutputlayerispreferredbecausecomputingitsJacobianisstraightforwardandefcient(Eq.3).Thesupervisedcostusedisthecrossentropy.Trainingisbystochasticgradientdescent.5 modeledimplicitlybytheCAE'sobjectivefunction(thatisnotbasedonpairsofpoints).Morere-cently,theLocalCoordinateCoding(LCC)algorithm(Yuetal.,2009)anditsLocalTangentLCCvariant(YuandZhang,2010)wereproposedtobuildaalocalchartaroundeachtrainingexample(withalocallow-dimensionalcoordinatesystemaroundit)anduseittodenearepresentationforeachinputx:theresponsibilityofeachlocalchart/anchorinexplaininginputxandthecoordinateofxineachlocalchart.Thatrepresentationisthenfedtoaclassierandyieldbettergeneralizationthanxitself.Thetangentdistance(Simardetal.,1993)andTangentProp(Simardetal.,1992)algorithmswereinitiallydesignedtoexploitpriordomain-knowledgeofdirectionsofinvariance(ex:knowledgethattheclassofanimageshouldbeinvarianttosmalltranslationsrotationsorscalingsintheimageplane).Howeveranyalgorithmabletooutputachartforatrainingpointmightpotentiallybeused,aswedohere,toprovidedirectionstoaTangentdistanceorTangentProp(Simardetal.,1992)basedclassier.OurapproachisneverthelessuniqueastheCAE'sunsupervisedfeaturelearningcapabilitiesareusedsimultaneouslytoprovideagoodinitializationofdeepnetworklayersandacoherentnon-localpredictoroftangentspaces.TangentPropisitselfcloselyrelatedtotheDoubleBackpropagationalgorithm(DruckerandLeCun,1992),inwhichoneinsteadaddsapenaltythatisthesumofsquaredderivativesofthepredictionerror(withrespecttothenetworkinput).WhereasTangentPropattemptstomaketheoutputinsensitivetoselecteddirectionsofchange,thedoublebackpropagationpenaltytermattemptstomaketheerroratatrainingexampleinvarianttochangesinalldirections.Sinceoneisalsotryingtominimizetheerroratthetrainingexample,thisamountstomakingthatminimizationmorerobust,i.e.,extendittotheneighborhoodofthetrainingexamples.AlsorelatedistheSemi-SupervisedEmbeddingalgorithm(Westonetal.,2008).Inadditiontominimizingasupervisedpredictionerror,itencourageseachlayerofrepresentationofadeepar-chitecturetobeinvariantwhenthetrainingexampleischangedfromxtoanearneighborofxinthetrainingset.Thisalgorithmworksimplicitlyunderthehypothesisthatthevariableytopre-dictfromxisinvarianttothelocaldirectionsofchangepresentbetweennearestneighbors.Thisisconsistentwiththemanifoldhypothesisforclassication(hypothesis3mentionedintheintro-duction).Insteadofremovingvariabilityalongthelocaldirectionsofvariation,theContractiveAuto-Encoder(Rifaietal.,2011a)initiallyndsarepresentationwhichismostsensitivetothem,asweexplainedinsection2.6ExperimentsWeconductedexperimentstoevaluateourapproachandthequalityofthemanifoldtangentslearnedbytheCAE,usingarangeofdatasetsfromdifferentdomains:MNISTisadatasetof2828imagesofhandwrittendigits.Thelearningtaskistopredictthedigitcontainedintheimages.ReutersCorpusVolumeIisapopularbenchmarkfordocumentclassication.Itconsistsof800,000real-worldnewswirestoriesmadeavailablebyReuters.Weusedthe2000mostfrequentwordscalculatedonthewholedatasettocreateabag-of-wordsvectorrepresentation.WeusedtheLYRL2004splittoseparatebetweenatrainandtestset.CIFAR-10isadatasetof70,0003232RGBreal-worldimages.Itcontainsimagesofreal-worldobjects(i.e.cars,animals)withallthevariationspresentinnaturalimages(i.e.backgrounds).ForestCoverTypeisalarge-scaledatabaseofcartographicvariablesforthepredictionofforestcovertypesmadeavailablebytheUSForestService.WeinvestigatewhetherleveragingtheCAElearnedtangentsleadstobetterclassicationperfor-manceontheseproblems,usingthefollowingmethodology:Optimalhyper-parametersfor(astackof)CAEsareselectedbycross-validationonadisjointvalidationsetextractedfromthetrainingset.ThequalityofthefeatureextractorandtangentscapturedbytheCAEsisevaluatedbyinitializinganneuralnetwork(MLP)withthesameparametersandne-tuningitbybackpropagationonthesuper-visedclassicationtask.TheoptimalstrengthofthesupervisedTangentProppenaltyandnumberoftangentsdMisalsocross-validated.ResultsFigure1showsavisualizationofthetangentslearnedbytheCAE.OnMNIST,thetangentsmostlycorrespondtosmallgeometricaltransformationsliketranslationsandrotations.OnCIFAR-10,the6 Figure1:VisualisationofthetangentslearnedbytheCAEforMNIST,CIFAR-10andRCV1(toptobottom).Theleft-mostcolumnistheexampleandthefollowingcolumnsareitstangents.OnRCV1,weshowthetangentsofadocumentwiththetopic”Trading&Markets”(MCAT)withthenegativetermsinred(-)andthepositivetermsingreen(+). Figure2:TangentsextractedbylocalPCAonCIFAR-10.Thisshowsthelimitationofapproachesthatrelyontrainingsetneighborhoods.modelalsolearnssensibletangents,whichseemtocorrespondtochangesinthepartsofobjects.ThetangentsonRCV1-v2correspondtotheadditionorremovalofsimilarwordsandremovalofirrelevantwords.Wealsonotethatextractingthetangentsofthemodelisawaytovisualizewhatthemodelhaslearnedaboutthestructureofthemanifold.Interestingly,weseethathypothesis3holdsforthesedatasetsbecausemosttangentsdonotchangetheclassoftheexample.Table1:ClassicationaccuracyonseveraldatasetsusingKNNvariantsmeasuredon10,000testexampleswith1,000trainingexamples.TheKNNistrainedontherawinputvectorusingtheEuclideandistancewhiletheK-layer+KNNiscomputedontherepresentationlearnedbyaK-layerCAE.TheKNN+Tangentsusesateverysamplethelocalchartsextractedfromthe1-layerCAEtocomputetangentdistance. KNNKNN+Tangents1-LayerCAE+KNN2-LayerCAE+KNN MNIST 86.988.790.5591.15CIFAR-10 25.426.525.1-COVERTYPE 70.270.9869.5467.45WeuseKNNusingtangentdistancetoevaluatethequalityofthelearnedtangentsmoreobjectively.Table1showsthatusingthetangentsextractedfromaCAEalwaysleadtobetterperformancethanatraditionalKNN.Asdescribedinsection4.2,thetangentsextractedbytheCAEcanbeusedforne-tuningthemulti-layerperceptronusingtangentpropagation,yieldingourManifoldTangentClassier(MTC).Asitisasemi-supervisedapproach,weevaluateitseffectivenesswithavaryingamountoflabeledexam-plesonMNIST.FollowingWestonetal.(2008),theunsupervisedfeatureextractoristrainedonthefulltrainingsetandthesupervisedclassieristrainedonarestrictedlabeledset.Table2showsourresultsforasinglehiddenlayerMLPinitializedwithCAE+Hpretraining(notedCAEforbrevity)andforthesameclassierne-tunedwithtangentpropagation(i.e.themanifoldtangentclassierofsection4.3,notedMTC).Themethodsthatdonotleveragethesemi-supervisedlearninghypothesis(SupportVectorMachines,traditionalNeuralNetworksandConvolutionalNeuralNetworks)giveverypoorperformancewhentheamountoflabeleddataislow.Insomecases,themethodsthatcanlearnfromunlabeleddatacanreducetheclassicationerrorbyhalf.TheCAEgivesbetterresultsthanotherapproachesacrossalmostthewholerangeconsidered.Itshowsthatthefeaturesextracted7 Table2:Semi-supervisedclassicationerrorontheMNISTtestsetwith100,600,1000and3000labeledtrainingexamples.Wecompareourmethodwithresultsfrom(Westonetal.,2008;Ranzatoetal.,2007b;SalakhutdinovandHinton,2007). NNSVMCNNTSVMDBN-rNCAEmbedNNCAEMTC 100 25.8123.4422.9816.81-16.8613.4712.03600 11.448.857.686.168.75.976.35.131000 10.77.776.455.38-5.734.773.643000 6.044.213.353.453.33.593.222.57fromtherichunlabeleddatadistributiongiveagoodinductivepriorfortheclassicationtask.NotethattheMTCconsistentlyoutperformstheCAEonthisbenchmark.Table3:ClassicationerrorontheMNISTtestsetwiththefulltrainingset. K-NNNNSVMDBNCAEDBMCNNMTC 3.09%1.60%1.40%1.17%1.04%0.95%0.95%0.81%Table3showsourresultsonthefullMNISTdatasetwithsomeresultstakenfrom(LeCunetal.,1999;Hintonetal.,2006).TheCAEinthisgureisatwo-layerdeepnetworkwith2000unitsperlayerpretrainedwiththeCAE+Hobjective.TheMTCusesthesamestackofCAEstrainedwithtangentpropagationusing15tangents.ThepriorstateoftheartforthepermutationinvariantversionofthetaskwassetbytheDeepBoltzmannMachines(SalakhutdinovandHinton,2009)at0.95%.Usingourapproach,wereach0.81%erroronthetestset.Remarkably,theMTCalsooutperformsthebasicConvolutionalNeuralNetwork(CNN)eventhoughtheCNNexploitspriorknowledgeaboutvisionusingconvolutionandpoolingtoenhancetheresults.Table4:ClassicationerrorontheForestCoverTypedataset. SVMDistributedSVMMTC 4.11%3.46%3.13%Wealsotraineda4layerMTContheForestCoverTypedataset.FollowingTrebarandSteele(2008),weusethedatasplitDS2-581whichcontainsover500,000trainingexamples.TheMTCyieldsthebestperformancefortheclassicationtaskbeatingthepreviousstateoftheartheldbythedistributedSVM(mixtureofseveralnon-linearSVMs).7ConclusionInthiswork,wehaveshownanewwaytocharacterizeamanifoldbyextractingalocalchartateachdatapointbasedontheunsupervisedfeaturemappingbuiltwithadeeplearningapproach.ThedevelopedManifoldTangentClassiersuccessfullyleveragesthreecommon“genericpriorhypotheses”inauniedmanner.Itlearnsameaningfulrepresentationthatcapturesthestructureofthemanifold,andcanleveragethisknowledgetoreachsuperiorclassicationperformance.Ondatasetsfromdifferentdomains,itsuccessfullyachievesstateoftheartperformance.AcknowledgmentsTheauthorswouldliketoacknowledgethesupportofthefollowingagenciesforresearchfundingandcomputingsupport:NSERC,FQRNT,CalculQu´ebecandCIFAR.ReferencesBengio,Y.(2009).LearningdeeparchitecturesforAI.FoundationsandTrendsinMachineLearning,2(1),1–127.Alsopublishedasabook.NowPublishers,2009.Bengio,Y.andMonperrus,M.(2005).Non-localmanifoldtangentlearning.InNIPS'04,pages129–136.MITPress.Bengio,Y.,Larochelle,H.,andVincent,P.(2006).Non-localmanifoldparzenwindows.InNIPS'05,pages115–122.MITPress.8 Bengio,Y.,Lamblin,P.,Popovici,D.,andLarochelle,H.(2007).Greedylayer-wisetrainingofdeepnetworks.InAdvancesinNIPS19.Brand,M.(2003).Chartingamanifold.InNIPS'02,pages961–968.MITPress.Cayton,L.(2005).Algorithmsformanifoldlearning.TechnicalReportCS2008-0923,UCSD.Drucker,H.andLeCun,Y.(1992).Improvinggeneralisationperformanceusingdoubleback-propagation.IEEETransactionsonNeuralNetworks,3(6),991–997.Erhan,D.,Bengio,Y.,Courville,A.,Manzagol,P.-A.,Vincent,P.,andBengio,S.(2010).Whydoesunsuper-visedpre-traininghelpdeeplearning?JMLR,11,625–660.Goodfellow,I.,Le,Q.,Saxe,A.,andNg,A.(2009).Measuringinvariancesindeepnetworks.InNIPS'09,pages646–654.Hinton,G.E.,Osindero,S.,andTeh,Y.(2006).Afastlearningalgorithmfordeepbeliefnets.NeuralComputation,18,1527–1554.Kavukcuoglu,K.,Ranzato,M.,Fergus,R.,andLeCun,Y.(2009).Learninginvariantfeaturesthroughtopo-graphicltermaps.pages1605–1612.IEEE.Lasserre,J.A.,Bishop,C.M.,andMinka,T.P.(2006).Principledhybridsofgenerativeanddiscriminativemodels.pages87–94,Washington,DC,USA.IEEEComputerSociety.LeCun,Y.,Haffner,P.,Bottou,L.,andBengio,Y.(1999).Objectrecognitionwithgradient-basedlearning.InShape,ContourandGroupinginComputerVision,pages319–345.Springer.Narayanan,H.andMitter,S.(2010).Samplecomplexityoftestingthemanifoldhypothesis.InJ.Lafferty,C.K.I.Williams,J.Shawe-Taylor,R.Zemel,andA.Culotta,editors,AdvancesinNeuralInformationProcessingSystems23,pages1786–1794.Ranzato,M.,Poultney,C.,Chopra,S.,andLeCun,Y.(2007a).Efcientlearningofsparserepresentationswithanenergy-basedmodel.InNIPS'06.Ranzato,M.,Huang,F.,Boureau,Y.,andLeCun,Y.(2007b).Unsupervisedlearningofinvariantfeaturehierarchieswithapplicationstoobjectrecognition.IEEEPress.Rifai,S.,Vincent,P.,Muller,X.,Glorot,X.,andBengio,Y.(2011a).Contractingauto-encoders:Explicitin-varianceduringfeatureextraction.InProceedingsoftheTwenty-eightInternationalConferenceonMachineLearning(ICML'11).Rifai,S.,Mesnil,G.,Vincent,P.,Muller,X.,Bengio,Y.,Dauphin,Y.,andGlorot,X.(2011b).Higherordercontractiveauto-encoder.InEuropeanConferenceonMachineLearningandPrinciplesandPracticeofKnowledgeDiscoveryinDatabases(ECMLPKDD).Roweis,S.andSaul,L.K.(2000).Nonlineardimensionalityreductionbylocallylinearembedding.Science,290(5500),2323–2326.Salakhutdinov,R.andHinton,G.E.(2007).Learninganonlinearembeddingbypreservingclassneighbour-hoodstructure.InAISTATS'2007,SanJuan,PortoRico.Omnipress.Salakhutdinov,R.andHinton,G.E.(2009).DeepBoltzmannmachines.InAISTATS'2009,volume5,pages448–455.Simard,P.,Victorri,B.,LeCun,Y.,andDenker,J.(1992).Tangentprop-Aformalismforspecifyingselectedinvariancesinanadaptivenetwork.InNIPS'91,pages895–903,SanMateo,CA.MorganKaufmann.Simard,P.Y.,LeCun,Y.,andDenker,J.(1993).Efcientpatternrecognitionusinganewtransformationdistance.InNIPS'92,pages50–58.MorganKaufmann,SanMateo.Tenenbaum,J.,deSilva,V.,andLangford,J.C.(2000).Aglobalgeometricframeworkfornonlineardimen-sionalityreduction.Science,290(5500),2319–2323.Trebar,M.andSteele,N.(2008).Applicationofdistributedsvmarchitecturesinclassifyingforestdatacovertypes.ComputersandElectronicsinAgriculture,63(2),119–130.Vincent,P.andBengio,Y.(2003).Manifoldparzenwindows.InNIPS'02.MITPress.Vincent,P.,Larochelle,H.,Lajoie,I.,Bengio,Y.,andManzagol,P.-A.(2010).Stackeddenoisingautoencoders:Learningusefulrepresentationsinadeepnetworkwithalocaldenoisingcriterion.JMLR,11(3371–3408).Weston,J.,Ratle,F.,andCollobert,R.(2008).Deeplearningviasemi-supervisedembedding.InICML2008,pages1168–1175,NewYork,NY,USA.Yu,K.andZhang,T.(2010).Improvedlocalcoordinatecodingusinglocaltangents.Yu,K.,Zhang,T.,andGong,Y.(2009).Nonlinearlearningusinglocalcoordinatecoding.InY.Bengio,D.Schuurmans,J.Lafferty,C.K.I.Williams,andA.Culotta,editors,AdvancesinNeuralInformationProcessingSystems22,pages2223–2231.9