/
Discriminative Features via Generalized Eigenvectors Nikos Karampatziakis NIKOSK MICROSOFT Discriminative Features via Generalized Eigenvectors Nikos Karampatziakis NIKOSK MICROSOFT

Discriminative Features via Generalized Eigenvectors Nikos Karampatziakis NIKOSK MICROSOFT - PDF document

briana-ranney
briana-ranney . @briana-ranney
Follow
673 views
Uploaded On 2014-12-14

Discriminative Features via Generalized Eigenvectors Nikos Karampatziakis NIKOSK MICROSOFT - PPT Presentation

In this paper we investigate scalable techniques for inducing discriminative features by taking ad vantage of simple second order structure in the data We focus on multiclass classi64257cation and show that features extracted from the generalized ei ID: 24035

this paper

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Discriminative Features via Generalized ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DiscriminativeFeaturesviaGeneralizedEigenvectors Inmulticlassclassication,thetensorE[x x y]issim-plyacollectionoftheconditionalsecondmomentmatricesCi=E[xx�jy=i].Therearemanystandardwaysofextractingfeaturesfromthesematrices.Forexample,onecouldtryper-classPCA(Wold&Sjostrom,1977)whichwillnddirectionsthatmaximizev�E[xx�jy=i]v=E[(v�x)2jy=i],orVCA(Livnietal.,2013)whichwillnddirectionsthatminimizethesamequantity.Thesub-tletyhereisthatthereisnoreasontobelievethatthesedirectionsarespecictoclassi.Inotherwords,thedi-rectionswendmightbeverysimilarforallclassesand,therefore,notbediscriminative.AsimplealternativeistoworkwiththequotientRij(v)=E[(v�x)2jy=i] E[(v�x)2jy=j]=v�Civ v�Cjv;(1)whoselocalmaximizersarethegeneralizedeigenvectorssolvingCiv=Cjv.1EfcientandrobustroutinesforsolvingthesetypesofproblemsarepartofmaturesoftwarepackagessuchasLAPACK.Sinceobjective(1)ishomogeneousinv,wewillassumethateacheigenvectorvisscaledsuchthatv�Cjv=1.Thenwehavethatv�Civ=,i.e.onaverage,thesquaredprojectionofanexamplefromclassionvwillbewhilethesquaredprojectionofanexamplefromclassjwillbe1.Aslongasisfarfrom1,thisgivesusadirectionalongwhichweexpecttobeabletodiscriminatethetwoclassesbysimplyusingthemagnitudeoftheprojection.Moreover,iftherearemanyeigenvaluessubstantiallydifferentfrom1allassociatedeigenvectorscanbeusedasfeaturedetectors.2.1.UsefulPropertiesThefeaturedetectorsresultingfrommaximizingequation(1)havetwousefulpropertieswhichwelistbelow.Forsimplicitywestatetheresultsassumingfullrankexactcon-ditionalmomentmatrices,andthendiscusstheimpactofregularizationandnitesamples.Proposition1.(Invariance)Undertheaboveassumptions,theembeddingv�xisinvarianttoinvertiblelineartrans-formationsofx.Proof.LetA2Rddbeinvertibleandx0=Axbethetransformedinput.LetCm=E[xx�jy=m]bethesecondmomentmatrixgiveny=mfortheorigi-naldatawithCholeskyfactorizationCm=LmL�m.Forthetransformeddata,theconditionalsecondmomentsareE[x0x�0jy=m]=AE[xx�jy=m]A�=ACmA� 1Analternativewouldbetousethecovariancematrixinsteadofthesecondmomentinthedenominator.Thisleadstoanoffsetterminourfeaturedetectorthatsometimesleadstobetterempir-icalresults.Foreaseofexpositionwedonotexplorethisintheremainderofthispaper.andthecorrespondinggeneralizedeigenvectorv0satisesACiA�v0=ACjA�v0.Lettingv0=A��L��ju0weseethatu0alsosatisesL�1jCiL��ju0=u0.Finally,theembeddinginvolvesonlyv�0x0=u�0L�1jA�1Ax=u�L�1jxwhichisthesameastheembeddingfortheorigi-naldata. Itisworthpointingoutthattheresultsofsomepopularmethods,suchasPCA,arenotinvarianttolineartrans-formationsoftheinputs.Forsuchmethods,differencesinpreprocessingandnormalizationcanleadtovastlydif-ferentresults.Thepracticalutilityofan“offtheshelf”classierisgreatlyimprovedbythisinvariance,whichpro-videsrobustnesstodataspecication,e.g.,differingunitsofmeasurementacrosstheoriginalfeatures.Proposition2.(Diversity)Twofeaturedetectorsv1andv2extractedfromthesameorderedclasspair(i;j)haveuncorrelatedresponsesE[(v�1x)(v�2x)jy=j]=0:Proof.Thisfollowsfromtheorthogonalityoftheeigen-vectorsintheinducedproblemL�1jCiL��ju=u(c.f.proofofProposition1)andtheconnectionv=L��ju.Ifu1andu2areeigenvectorsofL�1jCiL��jthen0=u�1u2=v�1LjL�jv2=v�1E[xx�jy=j]v2=E[(v�1x)(v�2x)jy=j]: Diversityindicatesthedifferentgeneralizedeigenvectorsperclasspairprovidecomplementaryinformation,andthattechniqueswhichonlyusetherstgeneralizedeigenvectorarenotmaximallyexploitingthedata.2.2.FiniteSampleConsiderationsEventhoughwehaveshownthepropertiesofourmethodassumingknowledgeoftheexpectationsE[xx�jy=m],inpracticeweestimatethesequantitiesfromourtrainingsamples.Theempiricalaverage^Cm=Pni=1I[yi=m]xix�i Pni=1I[yi=m](2)convergestotheexpectationatarateofO(n�1=2).Hereandbelowwearesuppressingthedependenceuponthedimensionalityd,whichweconsiderxed.Typical-nitesampletailboundsbecomemeaningfuloncen=O(dlogd)(Vershynin,2010).Given^Cm=Cm+EmwithjjEmjj2=O(n�1=2),wecanuseresultsfrommatrixperturbationtheorytoestablishthatournitesampleresultscannotbetoofarfromthoseobtainedusingtheexpectedvalues.Forexample,iftheCrawfordnumberc(Ci;Cj):=minjjvjj=1(v�Civ)2+(v�Cjv)2�0; DiscriminativeFeaturesviaGeneralizedEigenvectors Method Signal Noise PCA E[xx�] I VCA I E[xx�] FisherLDA Ey[E[xjy]E[xjy]�] PyCov[xjy] SIR PyE[wjy]E[wjy]� I OrientedPCA E[xx�] E[zz�] Ourmethod E[xx�jy=i] E[xx�jy=j] Table1.Tableofrelatedmethods(assumingE[x]=0)forndingdirectionsthatmaximizethesignaltonoiseratio.Cov[xjy]referstotheconditionalcovariancematrixofxgiveny,wisawhitenedversionofx,andzisanytypeofnoisemeaningfultothetaskathand.TheaboveobservationsleadtotheGEMprocedureout-linedinAlgorithm1.AlthoughAlgorithm1hasprovensufcientlyversatilefortheexperimentsdescribedherein,itismerelyanexampleofhowtousegeneralizedeigen-valuebasedfeaturesformulticlassclassication.Otherclassicationtechniquescouldbenetfromusingtherawprojectionvalueswithoutanynonlinearmanipulation,e.g.,decisiontrees;additionallythegeneralizedeigenvectorscouldbeusedtoinitializeaneuralnetworkarchitectureasaformofpre-training.WeremarkthateachstepinAlgorithm1ishighlyamenabletodistributedimplementation:empiricalclass-conditionalsecondmomentmatricescanbecomputedusingmap-reducetechniques,thegeneralizedeigenvalueproblemscanbesolvedindependentlyinparallel,andthelogisticre-gressionoptimizationisconvexandthereforehighlyscal-able(Agarwaletal.,2011).3.RelatedWorkOurapproachresemblesmanyexistingmethodsthatworkbyndingeigenvectorsofmatricesconstructedfromdata.Onecanthinkofalltheseapproachesasproceduresforndingdirectionsvthatmaximizeasignaltonoiseratio,withsymmetricmatricesSandNchosensuchthatthequadraticformsv�Svandv�Nvrepresentthesignalandthenoise,respectively,capturedalongdirectionv,R(v)=v�Sv v�Nv:(4)InTable1wepresentmanywellknownapproachesthatcouldbecastinthisframework.PrincipalComponentAnalysis(PCA)ndsthedirectionsofmaximalvariancewithoutanyparticularnoisemodel.TherecentlyproposedVanishingComponentAnalysis(VCA)(Livnietal.,2013)ndsthedirectionsonwhichtheprojectionsvanishsoitcanbethoughtasswappingtherolesofsignalandnoiseinPCA.FisherLDAmaximizesthevariabilityintheclassmeanswhileminimizingthewithinclassvariance.SlicedInverseRegressionrstwhitensx,andthenusesthesec-ondmomentmatrixoftheconditionalwhitenedmeansas Figure1.Picturesofthetop5generalizedeigenvectorsforMNISTforclasspairs(3;2)(toprow),(8;5)(secondrow),(3;5)(thirdrow),(8;0)(fourthrow),and(4;9)(bottomrow)with =0:5.Filtershavelargeresponseontherstclassandsmallresponseonthesecondclass.Bestviewedincolor.thesignaland,likePCA,hasnoparticularnoisemodel.Finally,orientedPCA(Diamantaras&Kung,1996;Plattetal.,2010)isaverygeneralframeworkinwhichthenoisematrixcanbethecorrelationmatrixofanytypeofnoisezmeaningfultothetaskathand.Bycloselyexaminingthesignalandnoisematrices,itisclearthateachmethodcanbefurtherdistinguishedaccord-ingtotwoothercapabilities:whetheritispossibletoex-tractmanydirections,andwhetherthedirectionsaredis-criminative.Forexample,PCAandVCAcanextractmanydirectionsbutthesearenotdiscriminative.Incontrast,FisherLDAandSIRarediscriminativebuttheyworkwithrank-kmatricessothenumberofdirectionsthatcouldbeextractedislimitedbythenumberofclasses.Furthermorebothofthesemethodslosevaluabledelityaboutthedatabyusingtheconditionalmeans.OrientedPCAissufcientlygeneraltoencompassourtechniqueasaspecialcase.Nonetheless,tothebestofourknowledge,thespecicsignalandnoisemodelsinthispa-perarenoveland,asweshowinSection4,theyempiricallyworkverywell.4.Experiments4.1.MNISTWebeginwiththeMNISTdatabaseofhandwrittendig-its(LeCunetal.,1998),forwhichwecanvisualizethegeneralizedeigenvectors,providingintuitionregardingthediscriminativenatureofthecomputeddirections.Foreachofthetenclasses,weestimatedCm=E[xx�jy=m]us-ing(2)andthenextractedgeneralizedeigenvectorsforeachclasspair(i;j)bysolving^Civ=( dTrace(^Cj)I+^Cj)v.Figure1showsasampleofresultsfromthisprocedurefor DiscriminativeFeaturesviaGeneralizedEigenvectors Method Frame Cross StateError(%) Entropy GEM 41:870:073 1:6370:001T-DSN 40.9 2.02GEM(ensemble) 40.86 1.581Table4.ResultsonTIMITtestset.T-DSNisthebestresultfrom(Hutchinsonetal.,2012).13stackedlayersofnonlineartransformations,whereastheGEMprocedureproducesashallowarchitecturewithasin-glenonlinearlayer.5.DiscussionGiventhesimplicityandempiricalsuccessofourmethod,weweresurprisedtondconsiderableworkonmethodsthatonlyextracttherstgeneralizedeigenvector(Mikaetal.,2003)butverylittleworkonusingthetopmgeneral-izedeigenvectors.Ourexperienceisthatadditionaleigen-vectorsprovidecomplementaryinformation.Empirically,theirinclusioninthenalclassierfaroutweighsthenec-essaryincreaseinsamplecomplexity,especiallygiventyp-icalmoderndatasetsizes.Thuswebelievethistechniqueshouldbevaluableinotherdomains.Ofcourseourmethodwillnotbeabletoextractanythingusefulifallclasseshavethesamesecondmomentbutdif-ferenthigherorderstatistics.Whileourlimitedexperienceheresuggestssecondmomentsareinformativefornatu-raldatasets,therearepotentialbenetsinusinghigheror-dermoments.Forexample,wecouldreplaceourclass-conditionalsecondmomentmatrixwithasecondmomentmatrixconditionedonotherevents,informedbyhigheror-dermoments.Asthenumberofclasslabelsincreases,sayk1000,ourbruteforceall-pairsapproach,whichscalesasO(k2),be-comesincreasinglydifcultbothcomputationallyandsta-tistically:weneedtosolveO(k2)eigenvectorproblems(possiblyinparallel)anddealwithO(k2)featuresintheultimateclassier.Takingastepback,theobjectofourattentionisthetensorE[x x y]andinthispaperweonlystudiedonewayofselectingpairsofslicesfromit.Inparticular,ourslicesaretensorcontractionswithoneofthestandardbasisvectorsinRk.Clearly,contractingthetensorwithanyvectoruinRkispossible.Thiscontractionleadstoaddsecondmomentmatrixwhichaveragestheexamplesofthedifferentclassesinthewayprescribedbyu.Anysensible,data-dependentwayofpickingagoodsetofvectorsushouldbeabletoreducethedependenceonk2.Thesameissuesalsoarisewithacontinuousy:howtodeneandestimatethepairsofmatriceswhosegeneral-izedeigenvectorsshouldbeextractedisnotimmediatelyclear.Still,thecasewhereyismultidimensional(vectorregression)canbereducedtothecaseofunivariateyusingthesametechniqueofcontractionwithavectoru.Featureextractionfromacontinuousycanbedonebydiscretiza-tion(solelyforthepurposeoffeatureextraction),whichismucheasierintheunivariatecasethaninthemultivariatecase.Indomainswhereexamplesexhibitlargevariation,orwhenlabeleddataisscarce,incorporatingpriorknowledgeisex-tremelyimportant.Forexample,inimagerecognition,con-volutionsandlocalpoolingarepopularwaystogeneraterepresentationsthatareinvarianttolocalizeddistortions.Directlyexploitingthespatialortemporalstructureoftheinputsignal,aswellasincorporatingotherkindsofinvari-ancesinourframework,isadirectionforfuturework.Highdimensionalproblemscreatebothcomputationalandstatisticalchallenges.Computationally,whend�106,thesolutionofgeneralizedeigenvalueproblemscanonlybeperformedviaspecializedlibrariessuchasScaLA-PACK,orviarandomizedtechniques,suchasthoseout-linedin(Halkoetal.,2011;Saibaba&Kitanidis,2013).Statistically,thenite-samplesecondmomentestimatescanbeinaccuratewhenthenumberofdimensionsover-whelmsthenumberofexamples.Theeffectofthisinac-curacyontheextractedeigenvectorsneedsfurtherinvesti-gation.Inparticular,itmightbeunimportantfordatasetsencounteredinpractice,e.g.,ifthetrueclass-conditionalsecondmomentmatriceshaveloweffectiverank(Bunea&Xiao,2012).Finally,ourapproachissimpletoimplementandwellsuitedtothedistributedsetting.Althoughadistributedim-plementationisoutofthescopeofthispaper,wedonotethataspectsofAlgorithm1weremotivatedbythedesireforefcientdistributedimplementation.Therecentsuc-cessofnon-convexlearningsystemshassparkedrenewedinterestinnon-convexrepresentationlearning.However,genericdistributednon-convexoptimizationisextremelychallenging.Ourapproachrstdecomposestheprob-lemintotractablenon-convexsubproblemsandthensubse-quentlycomposeswithconvextechniques.Ultimatelywehopethatjudiciousapplicationofconvenientnon-convexobjectives,coupledwithconvexoptimizationtechniques,willyieldcompetitiveandscalablelearningalgorithms.6.ConclusionWehaveshownamethodforcreatingdiscriminativefea-turesviasolvinggeneralizedeigenvalueproblems,anddemonstratedempiricalefcacyviamultipleexperiments.Themethodhasmultiplecomputationalandstatisticaldesiderata.Computationally,generalizedeigenvalueex-tractionisamaturenumericalprimitive,andthematriceswhicharedecomposedcanbeestimatedusingmap-reducetechniques.Statistically,themethodisinvarianttoinvert- DiscriminativeFeaturesviaGeneralizedEigenvectors iblelineartransformations,estimationoftheeigenvectorsisrobustwhenthenumberofexamplesexceedsthenum-berofvariables,andestimationoftheresultingclassierparametersiseasedduetotheparsimonyofthederivedrepresentation.Duetothiscombinationofempirical,computational,andstatisticalproperties,webelievethemethodintroducedhereinhasutilityforawidevarietyofmachinelearningproblems.AcknowledgmentsWethankJohnPlattandLiDengforhelpfuldiscussionsandassistancewiththeTIMITexperiments.ReferencesAgarwal,Alekh,Chapelle,Olivier,Dud´k,Miroslav,andLangford,John.Areliableeffectiveterascalelinearlearningsystem.CoRR,abs/1110.4198,2011.Blackard,JockAandDean,DenisJ.Comparativeaccura-ciesofarticialneuralnetworksanddiscriminantanal-ysisinpredictingforestcovertypesfromcartographicvariables.ComputersandElectronicsinAgriculture,24(3):131–151,1999.Bunea,F.andXiao,L.Onthesamplecovariancematrixestimatorofreducedeffectiverankpopulationmatrices,withapplicationstofPCA.ArXive-prints,December2012.Dean,Jeffrey,Corrado,Greg,Monga,Rajat,Chen,Kai,Devin,Matthieu,Le,QuocV.,Mao,MarkZ.,Ran-zato,Marc'Aurelio,Senior,AndrewW.,Tucker,PaulA.,Yang,Ke,andNg,AndrewY.Largescaledistributeddeepnetworks.InNIPS,pp.1232–1240,2012.Demmel,James,Dongarra,Jack,Ruhe,Axel,vanderVorst,Henk,andBai,Zhaojun.Templatesforthesolu-tionofalgebraiceigenvalueproblems:apracticalguide.SocietyforIndustrialandAppliedMathematics,2000.Diamantaras,KonstantinosIandKung,SunY.Principalcomponentneuralnetworks.WileyNewYork,1996.Fisher,RonaldA.Theuseofmultiplemeasurementsintaxonomicproblems.Annalsofeugenics,7(2):179–188,1936.Fisher,W.,Doddington,G.,andMarshall,GoudieK.TheDARPAspeechrecognitionresearchdatabase:Speci-cationandstatus.InProceedingsoftheDARPASpeechRecognitionWorkshop,pp.93–100,1986.Friedman,JeromeH.Regularizeddiscriminantanalysis.JournaloftheAmericanstatisticalassociation,84(405):165–175,1989.Golub,GeneHandVanLoan,CharlesF.Matrixcomputa-tions,volume3.JHUPress,2012.Goodfellow,IanJ,Warde-Farley,David,Mirza,Mehdi,Courville,Aaron,andBengio,Yoshua.Maxoutnet-works.arXivpreprintarXiv:1302.4389,2013.Halevy,Alon,Norvig,Peter,andPereira,Fernando.Theunreasonableeffectivenessofdata.IntelligentSystems,IEEE,24(2):8–12,2009.Halko,Nathan,Martinsson,Per-Gunnar,andTropp,JoelA.Findingstructurewithrandomness:Probabilisticalgorithmsforconstructingapproximatematrixdecom-positions.SIAMreview,53(2):217–288,2011.Hinton,Geoffrey,Deng,Li,Yu,Dong,Dahl,GeorgeE,Mohamed,Abdel-rahman,Jaitly,Navdeep,Senior,An-drew,Vanhoucke,Vincent,Nguyen,Patrick,Sainath,TaraN,etal.Deepneuralnetworksforacousticmod-elinginspeechrecognition:Thesharedviewsoffourresearchgroups.SignalProcessingMagazine,IEEE,29(6):82–97,2012a.Hinton,GeoffreyE,Srivastava,Nitish,Krizhevsky,Alex,Sutskever,Ilya,andSalakhutdinov,RuslanR.Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors.arXivpreprintarXiv:1207.0580,2012b.Hutchinson,Brian,Deng,Li,andYu,Dong.Adeeparchi-tecturewithbilinearmodelingofhiddenrepresentations:Applicationstophoneticrecognition.InAcoustics,SpeechandSignalProcessing(ICASSP),2012IEEEIn-ternationalConferenceon,pp.4805–4808.IEEE,2012.Jose,Cijo,Goyal,Prasoon,Aggrwal,Parv,andVarma,Manik.Localdeepkernellearningforefcientnon-linearsvmprediction.InProceedingsofthe30thInter-nationalConferenceonMachineLearning(ICML-13),pp.486–494,2013.Koren,Yehuda,Bell,Robert,andVolinsky,Chris.Ma-trixfactorizationtechniquesforrecommendersystems.Computer,42(8):30–37,2009.Krizhevsky,Alex,Sutskever,Ilya,andHinton,Geoff.Im-agenetclassicationwithdeepconvolutionalneuralnet-works.InAdvancesinNeuralInformationProcessingSystems25,pp.1106–1114,2012.LeCun,Yann,Bottou,L´eon,Bengio,Yoshua,andHaffner,Patrick.Gradient-basedlearningappliedtodocumentrecognition.ProceedingsoftheIEEE,86(11):2278–2324,1998. DiscriminativeFeaturesviaGeneralizedEigenvectors Li,Ker-Chau.Slicedinverseregressionfordimensionre-duction.JournaloftheAmericanStatisticalAssociation,86(414):316–327,1991.Livni,Roi,Lehavi,David,Schein,Sagi,Nachliely,Hila,Shalev-Shwartz,Shai,andGloberson,Amir.Vanishingcomponentanalysis.InProceedingsofthe30thInterna-tionalConferenceonMachineLearning(ICML-13),pp.597–605,2013.Mairal,Julien,Bach,Francis,Ponce,Jean,Sapiro,Guillermo,andZisserman,Andrew.Superviseddictio-narylearning.arXivpreprintarXiv:0809.3083,2008.Mika,Sebastian,Ratsch,Gunnar,Weston,Jason,Scholkopf,B,Smola,Alex,andMuller,K-R.Con-structingdescriptiveanddiscriminativenonlinearfea-tures:Rayleighcoefcientsinkernelfeaturespaces.Pat-ternAnalysisandMachineIntelligence,IEEETransac-tionson,25(5):623–628,2003.Platt,JohnC,Toutanova,Kristina,andYih,Wen-tau.Translingualdocumentrepresentationsfromdiscrimina-tiveprojections.InProceedingsofthe2010ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pp.251–261.AssociationforComputationalLinguistics,2010.Rahimi,AliandRecht,Benjamin.Randomfeaturesforlarge-scalekernelmachines.InAdvancesinneuralin-formationprocessingsystems,pp.1177–1184,2007.Saibaba,ArvindKandKitanidis,PeterK.Randomizedsquare-rootfreealgorithmsforgeneralizedhermitianeigenvalueproblems.arXivpreprintarXiv:1307.6885,2013.Vapnik,VladimirN.Statisticallearningtheory.Wiley,1998.Vershynin,Roman.Introductiontothenon-asymptoticanalysisofrandommatrices.arXivpreprintarXiv:1011.3027,2010.Wan,Li,Zeiler,Matthew,Zhang,Sixin,Cun,YannL,andFergus,Rob.Regularizationofneuralnetworksusingdropconnect.InProceedingsofthe30thInternationalConferenceonMachineLearning(ICML-13),pp.1058–1066,2013.Wold,SvanteandSjostrom,Michael.Simca:amethodforanalyzingchemicaldataintermsofsimilarityandanal-ogy.Chemometrics:theoryandapplication,52:243–282,1977.