Kernel Mean Estimation and Stein Effect Krikamol Muandet KRIKAMOL TUEBINGEN MPG - PDF document

Download presentation
Kernel Mean Estimation and Stein Effect Krikamol Muandet KRIKAMOL TUEBINGEN MPG
Kernel Mean Estimation and Stein Effect Krikamol Muandet KRIKAMOL TUEBINGEN MPG

Embed / Share - Kernel Mean Estimation and Stein Effect Krikamol Muandet KRIKAMOL TUEBINGEN MPG


Presentation on theme: "Kernel Mean Estimation and Stein Effect Krikamol Muandet KRIKAMOL TUEBINGEN MPG "— Presentation transcript


KernelMeanEstimationandSteinEffect KrikamolMuandet KRIKAMOL@TUEBINGEN.MPG.DE EmpiricalInferenceDepartment,MaxPlanckInstituteforIntelligentSystems,T¨ubingen,GermanyKenjiFukumizu FUKUMIZU@ISM.AC.JP TheInstituteofStatisticalMathematics,Tokyo,JapanBharathSriperumbudur BS493@STATSLAB.CAM.AC.UK StatisticalLaboratory,UniversityofCambridge,Cambridge,UnitedKingdomArthurGretton ARTHUR.GRETTON@GMAIL.COM GatsbyComputationalNeuroscienceUnit,UniversityCollegeLondon,London,UnitedKingdomBernhardSch¨olkopf BS@TUEBINGEN.MPG.DE EmpiricalInferenceDepartment,MaxPlanckInstituteforIntelligentSystems,T¨ubingen,GermanyAbstractAmeanfunctioninareproducingkernelHilbertspace(RKHS),orakernelmean,isanimpor-tantpartofmanyalgorithmsrangingfromkernelprincipalcomponentanalysistoHilbert-spaceembeddingofdistributions.Givenanitesam-ple,anempiricalaverageisthestandardestimateforthetruekernelmean.Weshowthatthisesti-matorcanbeimprovedduetoawell-knownphe-nomenoninstatisticscalledStein'sphenomenon.Afterconsideration,ourtheoreticalanalysisre-vealstheexistenceofawideclassofestimatorsthatarebetterthanthestandardone.Focusingonasubsetofthisclass,weproposeefcientshrinkageestimatorsforthekernelmean.Em-piricalevaluationsonseveralapplicationsclearlydemonstratethattheproposedestimatorsoutper-formthestandardkernelmeanestimator.1.IntroductionThispaperaimstoimprovetheestimationofthemeanfunctioninareproducingkernelHilbertspace(RKHS)fromanitesample.Akernelmeanofaprobabilitydistri-butionPoverameasurablespaceXisdenedbyP,ZXk(x;)dP(x)2H;(1) Proceedingsofthe31stInternationalConferenceonMachineLearning,Beijing,China,2014.JMLR:W&CPvolume32.Copy-right2014bytheauthor(s).whereHisanRKHSassociatedwithareproducingkernelk:XX!R.Conditionsensuringthatthisexpectationexistsaregivenin Smolaetal. ( 2007 ).Unfortunately,itisnotpracticaltocomputePdirectlybecausethedistribu-tionPisusuallyunknown.Instead,givenani.i.dsamplex1;x2;:::;xnfromP,wecaneasilycomputetheempiri-calkernelmeanbytheaveragebP,1 nnX=1k(x;):(2)TheestimatebPisthemostcommonlyusedestimateofthetruekernelmean.Ourprimaryinteresthereistoinvestigatewhetheronecanimproveuponthisstandardestimator.Thekernelmeanhasrecentlygainedattentioninthemachinelearningcommunity,thankstotheintro-ductionofHilbertspaceembeddingfordistributions( BerlinetandAgnan , 2004 ; Smolaetal. , 2007 ).Repre-sentingthedistributionasameanfunctionintheRKHShasseveraladvantages:1)therepresentationwithap-propriatechoiceofkernelkhasbeenshowntopreserveallinformationaboutthedistribution( Fukumizuetal. , 2004 ; Sriperumbuduretal. , 2008 ; 2010 );2)basicoper-ationsonthedistributioncanbecarriedoutbymeansofinnerproductsinRKHS,e.g.,EP[f(x)]=hf;PiHforallf2H;3)nointermediatedensityestimationisrequired,e.g.,whentestingforhomogeneityfrom-nitesamples.Asaresult,manyalgorithmshavebene-tedfromthekernelmeanrepresentation,namely,maxi-mummeandiscrepancy(MMD)( Grettonetal. , 2007 ),ker-neldependencymeasure( Grettonetal. , 2005 ),kerneltwo-sample-test( Grettonetal. , 2012 ),Hilbertspaceembed-dingofHMMs( Songetal. , 2010 ),andkernelBayesrule KernelMeanEstimationandSteinEffect ( Fukumizuetal. , 2011 ).TheirperformancesrelydirectlyonthequalityoftheempiricalestimatebP.However,itisofgreatimportance,especiallyforourread-erswhoarenotfamiliarwithkernelmethods,torealizeamorefundamentalroleofthekernelmean.Itbasi-callyservesasafoundationtomostkernel-basedlearn-ingalgorithms.Forinstance,nonlinearcomponentanal-yses,suchaskernelPCA,kernelFDA,andkernelCCA,relyheavilyonmeanfunctionsandcovarianceoperatorsinRKHS( Sch¨olkopfetal. , 1998 ).Thekernelk-meansalgo-rithmperformsclusteringinfeaturespaceusingmeanfunc-tionsastherepresentativesoftheclusters( Dhillonetal. , 2004 ).Moreover,italsoservesasabasisinearlydevelop-mentofalgorithmsforclassicationandanomalydetection( Shawe-TaylorandCristianini , 2004 ,chap.5).Allofthoseemploy( 2 )astheestimateofthetruemeanfunction.Thus,thefactthatsubstantialimprovementcanbegainedwhenestimating( 1 )mayinfactraiseawidespreadsuspicionontraditionalwayoflearningwithkernels.Weshowinthisworkthatthestandardestimator( 2 )is,inacertainsense,notoptimal,i.e.,thereexistbetterestimators(morebelow).Inaddition,weproposeshrinkageestima-torsthatoutperformthestandardone.Atrstglance,itwasdenitelycounter-intuitiveandsurprisingforus,andwillundoubtedlyalsobeforsomeofourreaders,thattheem-piricalkernelmeancouldbeimproved,and,giventhesim-plicityoftheproposedestimators,thatthishasremainedunnoticeduntilnow.Oneofthereasonsmaybethatthereisacommonbeliefthattheestimator^PalreadygivesagoodestimateofP,and,assamplesizegoestoinnity,theestimationerrordisappears( Shawe-TaylorandCristianini , 2004 ).Asaresult,noneedisfelttoimprovethekernelmeanestimation.However,givenanitesample,substan-tialimprovementisinfactpossibleandseveralfactorsmaycomeintoplay,aswillbeseenlaterinthiswork.ThisworkwaspartlyinspiredbyStein'sseminalworkin1955,whichshowedthatamaximumlikelihoodestimator(MLE),i.e.,thestandardempiricalmean,forthemeanofthemultivariateGaussiandistributionN(;2I)isinad-missible( Stein , 1955 ).Thatis,thereexistsanestimatorthatalwaysachievessmallertotalmeansquarederrorre-gardlessofthetrue,whenthedimensionisatleast3.PerhapsthebestknownestimatorofsuchkindisJames-Steinsestimator( JamesandStein , 1961 ).Interestingly,theJames-Steinestimatorisitselfinadmissible,andthereex-istsawideclassofestimatorsthatoutperformtheMLE,seee.g., Berger ( 1976 ).However,ourworkdiffersfundamentallyfromtheStein'sseminalworksandthosealongthislineintwoas-pects.First,oursettingisnon-parametricinasensethatwedonotassumeanyparametricformofthedis-tribution,whereasmostoftraditionalworksfocusonsomespecicdistributions,e.g.,Gaussiandistribution.Second,oursettinginvolvesanon-linearfeaturemapintoahigh-dimensionalspace,ifnotinnite.Asare-sult,highermomentsofthedistributionmaycomeintoplay.Thus,onecannotadoptStein'ssettingstraightfor-wardly.AdirectgeneralizationofJames-Steinestimatortoinnite-dimensionalHilbertspacehasalreadybeenconsid-ered( BergerandWolpert , 1983 ; MandelbaumandShepp , 1987 ; PrivaultandRveillac , 2008 ).Inthoseworks,whichistheparametertobeestimatedisassumedtobethemeanofaGaussianmeasureontheHilbertspacefromwhichsamplesaredrawn.Inourcase,ontheotherhand,thesamplesaredrawnfromPandnotfromtheGaussiandistributionwhosemeanisP.Thecontributionofthispapercanbesummarizedasfol-lows:First,weshowthatthestandardkernelmeanestima-torcanbeimprovedbyprovidinganalternativeestimatorthatachievessmallerrisk(x 2 ).Thetheoreticalanalysisre-vealstheexistenceofawideclassofestimatorsthatarebetterthanthestandard.Tothisend,weproposeinx 3 akernelmeanshrinkageestimator(KMSE),whichisbasedonanovelmotivationforregularizationthroughthenotionofshrinkage.Moreover,weproposeanefcientleave-one-outcross-validationproceduretoselecttheshrinkagepa-rameter,whichisnovelinthecontextofkernelmeanesti-mation.Lastly,wedemonstratethebenetoftheproposedestimatorsinseveralapplications(x 4 ).2.Motivation:ShrinkageEstimatorsForanarbitrarydistributionP,denotebyandbthetruekernelmeananditsempiricalestimate( 2 )fromthei.i.d.samplex1;x2;:::;xnP(weremovethesub-scriptforeaseofnotation).Themostnaturallossfunc-tionconsideredinthisworkis`(;b)=kbk2H.Anestimatorbisamappingwhichismeasurablew.r.t.theBorel-algebraofHandisevaluatedbyitsriskfunc-tionR(;b)=EP[`(;b)]whereEPindicatesexpectationoverthechoiceofi.i.d.sampleofsizenfromP.Letusconsideranalternativekernelmeanestimator:b, f+(1 )bwhere0 1andf2H.Itisessentiallyashrinkageestimatorthatshrinksthestandardestimatortowardafunctionfbyanamountspeciedby .If =0,breducestothestandardestimatorb.Thefollowingtheoremassertsthattheriskofshrinkageestima-torbissmallerthanthatofstandardestimatorbgivenanappropriatechoiceof ,regardlessofthefunctionf(morebelow).Theorem1.ForalldistributionsPandthekernelk,thereexists &#x-0.8;͸䀀0forwhichR(;b)R(;b).Proof.Theriskofthestandardkernelmeanestimatorsat-isesEkbk2=1 n(E[k(x;x)]E[k(x;~x)])=: KernelMeanEstimationandSteinEffect where~xisanindependentcopyofx.Letusdenetheriskoftheproposedshrinkageestimatorby:=Ekbk2where isanon-negativeshrinkagepa-rameter.Wecanthenwritethisintermsofthestan-dardriskas=2 Ehb;b+fi+ 2Ekfk22 2E[f(x)]+ 2Ekbk2.ItfollowsfromthereproducingpropertyofHthatE[f(x)]=hf;i.Moreover,usingthefactthatEkbk2=Ekb+k2=+E[k(x;~x)],wecansimplifytheshrinkageriskby= 2(+kfk2)2 +.Thus,wehave= 2(+kfk2)2 whichisnon-positivewhere 20;2 +kfk2(3)andminimizedat ==(+kfk2).Aswecanseein( 3 ),thereisarangeof forwhichanon-positive,i.e.,R(;b)R(;b),isguaranteed.However,Theorem 1 reliesontheimportantassumptionthatthetruekernelmeanofthedistributionPisrequiredtoestimate .Inspiteofthis,thetheoremhasanimportantimplicationsuggestingthattheshrinkageestimatorbcanimproveuponbif ischosenappropriately.Later,wewillexploitthisresultinordertoconstructmorepracticalestimators.Remark1.ThefollowingobservationsfollowimmediatelyfromTheorem 1 :Theshrinkageestimatoralwaysimprovesuponthestandardoneregardlessofthedirectionofshrinkage,asspeciedbyf.Inotherwords,thereexistsawideclassofkernelmeanestimatorsthatarebetterthanthestandardone.Thevalueof alsodependsonthechoiceoff.Thefurtherfisfrom,thesmaller becomes.Thus,theshrinkagegetssmalleriffischosensuchthatitisfarfromthetruekernelmean.ThiseffectisakintoJames-Steinestimator.Theimprovementcanbeviewedasabias-variancetrade-off:theshrinkageestimatorreducesvariancesubstantiallyattheexpenseofalittlebias.Remark 1 shedslightonhowonecanpracticallyconstructtheshrinkageestimator:wecanchoosefarbitrarilyaslongastheparameter ischosenappropriately.More-over,furtherimprovementcanbegainedbyincorporat-ingpriorknowledgeastothelocationofP,whichcanbestraightforwardlyintegratedintotheframeworkviaf( BergerandWolpert , 1983 ).InspiredbyJames-Steinesti-mator,wefocusonf=0.Wewillinvestigatetheeffectofdifferentpriorfinfutureworks.3.KernelMeanShrinkageEstimatorInthissectionwegiveanovelformulationofkernelmeanestimatorthatallowsustoestimatetheshrinkageparame-terefciently.Inthefollowing,let:X!Hbeafea-turemapassociatedwiththekernelkandh;ibeaninnerproductintheRKHSHsuchthatk(x;x0)=h(x);(x0)i.Unlessstatedotherwise,kkdenotestheRKHSnorm.ThekernelmeanPanditsempiricalestimatebPcanbeob-tainedasaminimizerofthelossfunctionalsE(g),ExPk(x)gk2;bE(g),1 nnX=1k(x)gk2;respectively.WewillcalltheestimatorminimizingthelossfunctionalbE(g)akernelmeanestimator(KME).NotethatthelossE(g)isdifferentfromtheoneconsideredinx 2 ,i.e.,`(;g)=kgk2=kE[(x)]gk2.Never-theless,wehave`(;g)=Exx0k(x;x0)2Exg(x)+kgk2.SinceE(g)=Exk(x;x)2Exg(x)+kgk2,theloss`(;g)differsfromE(g)onlybyExk(x;x)Exx0k(x;x0)whichisnotafunctionofg.Weintroducethenewformherebecauseitwillgiveamoretractablecross-validationcom-putation(x 3.1 ).Inspiteofthis,theresultingestimatorsarealwaysevaluatedw.r.t.thelossinx 2 (cf.x 4.1 ).Fromtheformulationabove,itisnaturaltoaskifmini-mizingtheregularizedversionofbE(g)willgivebetteres-timator.Ontheonehand,onecanarguethat,unlikeintheclassicalriskminimization,wedonotreallyneedareg-ularizerhere.Thestandardestimator( 2 )isknowntobe,inacertainsense,optimalandcanbeestimatedreliably( Shawe-TaylorandCristianini , 2004 ,prop.5.2).More-over,theoriginalformulationofbE(g)isawell-posedprob-lem.Ontheotherhand,sinceregularizationmaybeviewedasshrinkingthesolutiontowardzero,itcanactuallyim-provethekernelmeanestimation,assuggestedbyTheorem 1 (cf.discussionsattheendofx 2 ).Consequently,weminimizeamodiedlossfunctionalbE(g),bE(g)+\n(kgk)=1 nnX=1k(x)gk2+\n(kgk);(4)where\n()denotesamonotonically-increasingregulariza-tionfunctionalandisanon-negativeregularizationpa-rameter. 1 Inwhatfollows,werefertotheshrinkagees-timatorbminimizingbE(g)asakernelmeanshrinkageestimator(KMSE). 1Theparameters andplaysimilarroleasashrinkagepa-rameter.Theyspecifyanamountbywhichthestandardestimatorbisshrunktowardf=0.Thus,thetermshrinkageparameterandregularizationparameterwillbeusedinterchangeably. KernelMeanEstimationandSteinEffect Itfollowsfromtherepresentertheoremthatgliesinasub-spacespannedbythedata,i.e.,g=Pnj=1 j(xj)forsome 2Rn.Byconsidering\n(kgk)=kgk2,wecanrewrite( 4 )as1 nnX=1\r\r\r\r\r\r(x)nXj=1 j(xj)\r\r\r\r\r\r2+\r\r\r\r\r\rnXj=1 j(xj)\r\r\r\r\r\r2= �K 2 �K1n+ �K +c;(5)wherecisaconstantterm,KisannnGrammatrixsuchthatKij=k(x;xj),and1n=[1=n;1=n;:::;1=n]�.Takingaderivativeof( 5 )w.r.t. andsettingittozeroyield =(1=(1+))1n.Bysetting ==(1+)theshrinkageestimatecanbewrittenasb=(1 )b.Since0 1,theestimatorbcorrespondstoashrinkageestimatordiscussedinx 2 whenf=0.Wecallthisesti-matorasimplekernelmeanshrinkageestimator(S-KMSE).Usingtheexpansiong=Pnj=1 j(xj),wemaycon-siderwhentheregularizationfunctionaliswrittenintermof ,e.g., � .Thisleadstoaparticularlyinter-estingkernelmeanestimator.Inthiscase,theopti-malweightvectorisgivenby =(K+I)1K1nandtheshrinkageestimatecanbewrittenaccordinglyasb=Pnj=1 j(xj)=�(K+I)1K1nwhere=[(x1);(x2);:::;(xn)]�.UnliketheS-KMSE,thises-timatorshrinkstheusualestimatedifferentlyineachcoor-dinate(cf.Theorem 2 ).Hence,wewillcallitaexiblekernelmeanshrinkageestimator(F-KMSE).ThefollowingtheoremcharacterizestheF-KMSEasashrinkageestimator.Theorem2.TheF-KMSEcanbewrittenasb=Pn=1\r \r+hb;vivwheref\r;vgareeigenvalueandeigenvectorpairsoftheempiricalcovarianceoperatorbCxxinH.Inwords,theeffectofF-KMSEistoreducehighfre-quencycomponentsoftheexpansionofb,byexpand-ingthisintermsofthekernelPCAbasisandshrinkingthecoefcientsofthehighordereigenfunctions,e.g.,see RasmussenandWilliams ( 2006 ,sec.4.3).NotethatthecovarianceoperatorbCxxitselfdoesnotdependon.Aswecansee,thesolutiontotheregularizedversionisindeedoftheformofshrinkageestimatorswhenf=0.Thatis,bothS-KMSEandF-KMSEshrinkthestandardkernelmeanestimatetowardszero.ThedifferenceisthattheS-KMSEshrinksequallyinallcoordinate,whereastheF-KMSEalsoconstraintstheamountofshrinkagebytheinformationcontainedineachcoordinate.Moreover,thesquaredRKHSnormkk2canbedecom-posedasasumofsquaredlossweightedbytheeigenval-ues\r(cf. MandelbaumandShepp ( 1987 ,appendix)).BythesamereasoningasStein'sresultinnite-dimensionalcase,onewouldsuspectthatanimprovementofshrinkageestimatorsinHshouldalsodependonhowfasttheeigen-valuesofkdecay.Thatis,onewouldexpectgreaterim-provementifthevaluesof\rdecayveryslowly.Forexam-ple,theGaussianRBFkernelwithlargerbandwidthgivessmallerimprovementwhencomparedtoonewithsmallerbandwidth.Similarly,weshouldexpecttoseemoreim-provementwhenapplyingaLaplaciankernelthanwhenusingaGaussianRBFkernel.Insomeapplicationsofkernelmeanembedding,onemaywanttointerprettheweight asaprobabilityvector( Nishiyamaetal. , 2012 ).However,theweightvector outputbyourestimatorsisingeneralnotnormalized.Infact,allelementswillbesmallerthan1=nasaresultofshrinkage.However,onemayimposeaconstraintthat mustsumtooneandresorttoaquadraticprogramming( Songetal. , 2008 ).Unfortunately,thisapproachhasunde-sirableeffectofsparsitywhichisunlikelytoimproveuponthestandardestimator.Post-normalizingtheweightsoftendeterioratestheestimationperformance.Tothebestofourknowledge,nopreviousattempthasbeenmadetoimprovethekernelmeanestimation.How-ever,wediscusssomecloselyrelatedworkshere.Forex-ample,insteadofthelossfunctionalbE(g), KimandScott ( 2012 )considerarobustlossfunctionsuchastheHuber'slosstoreducetheeffectofoutliers.Theauthorscon-siderkerneldensityestimators,whichdifferfundamentallyfromkernelmeanestimators.Theyneedtoreducetheker-nelbandwidthwithincreasingsamplesizefortheestima-torstobeconsistent.RegularizedversionofMMDwasadoptedby Danafaretal. ( 2013 )inthecontextofkernel-basedhypothesistesting.Theresultingformulationre-semblesourS-KMSE.Furthermore,theF-KMSEisofasimilarformastheconditionalmeanembeddingusedin Gr¨unew¨alderetal. ( 2012 ),whichcanbeviewedmoregen-erallyasaregressionprobleminRKHSwithsmoothoper-ators( Gr¨unew¨alderetal. , 2013 ).3.1.ChoosingShrinkageParameterAsdiscussedinx 2 ,theamountofshrinkageplaysanim-portantroleinourestimators.Inthisworkweproposetoselecttheshrinkageparameterbyanautomaticleave-one-outcross-validation.Foragivenshrinkageparameter,letusconsidertheob-servationxasbeinganewobservationbyomittingitfromthedataset.Denotebyb()=Pj= ()j(xj)thekernelmeanestimatedfromtheremainingdata,usingthevalueasashrinkageparameter,sothat ()isthemini-mizerofbE()(g).Wewillmeasurethequalityofb()byhowwellitapproximates(x).Theoverallqualityofthe KernelMeanEstimationandSteinEffect estimateisquantiedbythecross-validationscoreLOOCV()=1 nnX=1\r\r\r(x)b()\r\r\r2H:(6)Bysimplealgebra,itisnotdifculttoshowthattheop-timalshrinkageparameterofS-KMSEcanbecalculatedanalytically,asstatedbythefollowingtheorem.Theorem3.Let,1 n2Pn=1Pnj=1k(x;xj)and%,1 nPn=1k(x;x).Theshrinkageparameter=(%)=((n1)+%=n%)oftheS-KMSEistheminimizerofLOOCV().Ontheotherhand,ndingtheoptimalfortheF-KMSEisrelativelymoreinvolved.Evaluatingthescore( 6 )na¨velyrequiresonetosolveforb()explicitlyforeveryi.Fortu-nately,wecansimplifythescoresuchthatitcanbeevalu-atedefciently,asstatedinthefollowingtheorem.Theorem4.TheLOOCVscoreofF-KMSEsatisesLOOCV()=1 nPn=1( �KK)�C( �KK)where istheweightvectorcalculatedfromthefulldatasetwiththeshrinkageparameterandC=(K1 nK(K+I)1K)1K(K1 nK(K+I)1K)1.ProofofThorem 4 .Forxedandi,letb()betheleave-one-outkernelmeanestimateofF-KMSEandlet,(K+I)1.Then,wecanwriteanexpressionforthedeletedresidualas():=b()(x)=b(x)+1 nPnj=1Pnl=1jlh(xl);b()(x)i(xj).Since()liesinasubspacespannedbythesam-ple(x1);:::;(xn),wehave()=Pnk=1k(xk)forsome2Rn.Substituting()backyieldsPnk=1k(xk)=b(x)+1 nPnj=1fAKgj(xj).Bytakingtheinnerproductonbothsidesw.r.t.thesam-ple(x1);:::;(xn)andsolvingfor,wehave=(K1 nKAK)1( �KK)whereKistheithcolumnofK.Consequently,theleave-one-outscoreofthesamplexcanbecomputedbyk()k2=�K=( �KK)�(K1 nKAK)1K(K1 nKAK)1( �KK)=( �KK)�C( �KK).Averag-ingk()k2overallsamplesgivesLOOCV()=1 nPn=1k()k2=1 nPn=1( �KK)�C( �KK),asrequired.Itisinterestingtoseethattheleave-one-outcross-validationscoreinTheorem 4 dependsonlyonthenon-leave-one-outsolution ,whichcanbeobtainedasaby-productofthealgorithm.ComputationalcomplexityTheS-KMSErequiresO(n2)operationstoselectshrinkageparameter.FortheF-KMSE,therearetwostepsincross-validation.First,weneedtocompute(K+I)1repeatedlyfordifferentvaluesof.AssumethatweknowtheeigendecompositionK=UDU�whereDisdiag-onalwithd0andUU�=I.Itfollowsthat(K+I)1=U(D+I)1U�.Consequently,solvingfor takesO(n2)operations.SinceeigendecompositionrequiresO(n3)operations,nding formany'sisessentiallyfree.Alow-rankapproximationcanalsobeadoptedtoreducethecomputationalcostfurther.Second,weneedtocomputethecross-validationscore( 6 ).AsshowninTheorem 4 ,wecancomputeitusingonly obtainedfromthepreviousstep.ThecalculationofCcanbesimpliedfurtherviatheeigendecompositionofKasC=U(D1 nD(D+I)1D)1D(D1 nD(D+I)1D)1U�.Sinceitonlyinvolvestheinverseofdiag-onalmatrices,theinversioncanbeevaluatedinO(n)oper-ations.Theoverallcomputationalcomplexityofthecross-validationrequiresonlyO(n2)operations,asopposedtothena¨veapproachthatrequiresO(n4)operations.Whenperformedasaby-productofthealgorithm,thecomputa-tionalcostofcross-validationprocedurebecomesnegligi-bleasthedatasetbecomeslarger.Inpractice,weusethefminsearchandfminbndroutinesoftheMATLABoptimizationtoolboxtondthebestshrinkageparameter.3.2.CovarianceOperatorsThecovarianceoperatorfromHXtoHYcanbeviewedasameanfunctioninaproductspaceHX\nHY.Hence,wecanalsoconstructashrinkageestimatorofcovarianceoper-atorinRKHS.Let(HX;kX)and(HY;kY)betheRKHSoffunctionsonmeasurablespaceXandY,respectively,withp.d.kernelkXandkY(withfeaturemapand).Wewillconsiderarandomvector(X;Y):\n!XYwithdistributionPXY,withPXandPYasmarginaldis-tributions.Undersomeconditions,thereexistsauniquecross-covarianceoperatorYX:HXHYsuchthathg;YXfiHY=EXY[(f(X)EX[f(X)])(g(Y)EY[g(Y)])]=Cov(f(X);g(Y))holdsforallf2HXandg2HY( Fukumizuetal. , 2004 ).IfXequalsY,wegettheself-adjointoperatorXXcalledthecovarianceop-erator.Givenani.i.dsamplefromPXYwrittenas(x1;y1);(x2;y2);:::;(xn;yn),wecanwritetheempiricalcross-covarianceoperatorasbYX:=1 nPn=1(x)\n(y)bX\nbYwherebX=1 nPn=1(x)andbY=1 nPn=1(y).Leteandebethecenteredfeaturemapsofand,respectively.Then,itcanberewrittenasbYX:=1 nPn=1e(x)\ne(y)2HX\nHY.Itfollowsfromtheinnerproductpropertyinproductspacethathe(x)\ne(y);e(x0)\ne(y0)iHX\nHY=he(x);e(x0)iHXhe(y);e(y0)iHY=ekX(x;x0)ekY(y;y0). KernelMeanEstimationandSteinEffect 500 1000 1500 2000 2500 3000 ´g 500 1000 1500 2000 2500 3000 ´g 500 1000 1500 2000 2500 3000 ´g (a)LIN 3 4 5 6 7 8 9 10 11 12 13x 105 ´g 2 3 4 5 6 7 8 9 10 11x 105 ´g 2 3 4 5 6 7 8 9 10 11 12x 105 ´g (b)POLY2 2 4 6 8 10 12 14x 108 ´g 2 4 6 8 10 12 14 16 18x 108 ´g 2 4 6 8 10 12 14x 108 ´g (c)POLY3 0.04 0.06 0.08 0.1 0.12 0.14 0.16 ´g 0.02 0.04 0.06 0.08 0.1 0.12 0.14 ´g 0.02 0.04 0.06 0.08 0.1 0.12 ´g (d)RBFFigure1.TheaveragelossofKME(left),S-KMSE(middle),andF-KMSE(right)estimatorswithdifferentvaluesofshrinkageparam-eter.Insideboxescorrespondtoestimators.Werepeattheexperimentsover30differentdistributionswithn=10andd=30.Then,wecanobtaintheshrinkageestimatorsforthecovarianceoperatorbypluggingthekernelk((x;y);(x0;y0))=~kX(x;x0)~kY(y;y0)inourKM-SEs.Wewillcallthisestimatoracovariance-operatorshrinkageestimator(COSE).Thesametrickcanbeeasilygeneralizedtotensorsofhigherorder,whichhavebeenpreviouslyused,forexample,in Songetal. ( 2011 ).4.ExperimentsWefocusonthecomparisonbetweenourshrinkageesti-matorsandthestandardestimatorofthekernelmeanusingbothsyntheticdatasetsandreal-worlddatasets.4.1.SyntheticDataGiventhetruedata-generatingdistributionP,weevalu-atedifferentestimatorsusingthelossfunction`( ),kPn=1 k(x;)EP[k(x;)]k2Hwhere istheweightvectorassociatedwithdifferentestimators.Toallowforanexactcalculationof`( ),weconsiderwhenPisamixture-of-Gaussiansdistributionandkisthefollowingkernelfunction:1)linearkernelk(x;x0)=x�x0;2)poly-nomialdegree-2kernelk(x;x0)=(x�x0+1)2;3)poly-nomialdegree-3kernelk(x;x0)=(x�x0+1)3;and4)GaussianRBFkernelk(x;x0)=expkxx0k2=22.WewillrefertothemasLIN,POLY2,POLY3,andRBF,respectively.Experimentalprotocol.Dataaregeneratedfromad-dimensionalmixtureofGaussians:x4X=1N(;)+";ijU(10;10);W(2Id;7);"N(0;0:2Id);whereU(a;b)andW(0;df)representtheuniformdis-tributionandWishartdistribution,respectively.Weset=[0:05;0:3;0:4;0:25].Thechoiceofparametershereisquitearbitrary;wehaveexperimentedusingvariouspa-rametersettingsandtheresultsaresimilartothosepre-sentedhere.FortheGaussianRBFkernel,wesetthebandwidthparametertosquare-rootofthemedianEu-clideandistancebetweensamplesinthedataset(i.e.,2=mediankxxjk2 throughout).Figure 1 showstheaveragelossofdifferentestimatorsus-ingdifferentkernelsasweincreasethevalueofshrinkageparameter.Herewescaletheshrinkageparameterbytheminimumnon-zeroeigenvalue\r0ofkernelmatrixK.Ingeneral,wendS-KMSEandF-KMSEtendtooutperformKME.However,asbecomeslarge,therearesomecaseswhereshrinkagedeterioratestheestimationperformance,e.g.,seeLINkernelandsomeoutliersinthegures.Thissuggeststhatitisveryimportanttochoosetheparameterappropriately(cf.thediscussioninx 2 ).Similarly,Figure 2 depictstheaveragelossaswevarythesamplesizeanddimensionofthedata.Inthiscase,theshrinkageparameterischosenbytheproposedleave-one-outcross-validationscore.Aswecansee,bothS-KMSE 0 50 100 20 40 60 80 100 120 140 160 180 LINSample Size (d=20)Average Loss 0 50 100 2 2.5 3 3.5 4 4.5 5 5.5 6x 105 POLY2Sample Size (d=20) 0 50 100 3 3.5 4 4.5 5 5.5x 108 POLY3Sample Size (d=20) 0 50 100 0 0.02 0.04 0.06 0.08 0.1 0.12 RBFSample Size (d=20) KME S-KMSE F-KMSE 0 50 100 0 100 200 300 400 500 600 700 800 LINDimension (n=20)Average Loss 0 50 100 0 2 4 6 8 10 12x 106 POLY2Dimension (n=20) 0 50 100 0 1 2 3 4 5x 1010 POLY3Dimension (n=20) 0 50 100 0 0.01 0.02 0.03 0.04 0.05 0.06 RBFDimension (n=20) Figure2.Theaveragelossover30differentdistributionsofKME,S-KMSE,andF-KMSEwithvaryingsamplesize(n)anddimen-sion(d).TheshrinkageparameterischosenbyLOOCV. KernelMeanEstimationandSteinEffect Table1.Averagenegativelog-likelihoodofthemodelQontestpointsover10randomizations.Theboldfacerepresentstheresultwhosedifferencefromthebaseline,i.e.,KME,isstatisticallysignicant. Dataset LIN POLY2 POLY3 RBF KMES-KMSEF-KMSE KMES-KMSEF-KMSE KMES-KMSEF-KMSE KMES-KMSEF-KMSE 1.ionosphere 33.244033.032533.1436 53.126653.706750.8695 51.680049.914947.4461 40.896140.557839.6804 2.sonar 72.663072.877072.5015 120.3454108.8246109.9980 102.449990.392091.1547 71.304870.572170.5830 3.australian 18.370318.334118.3719 18.592818.602818.4987 41.156334.430334.5460 17.513817.563717.4026 4.specft 56.613855.737455.8667 67.390165.966265.2056 63.927363.557162.1480 57.556956.138655.5808 5.wdbc 30.977830.926630.4400 93.054191.580387.5265 58.823554.123750.3911 30.822730.596830.2646 6.wine 15.922515.885016.0431 24.284124.132523.5163 35.206932.946532.4702 17.152316.917716.6312 7.satimage 19.635319.872119.7943 149.5986143.2277146.0648 52.797357.248245.8946 20.330620.502020.2226 8.segment 22.913122.821922.0696 61.271259.438754.8621 38.722638.622638.4217 17.680116.414915.6814 9.vehicle 16.414516.288816.3210 83.159779.724879.6679 70.434063.432248.0177 15.925615.833115.6516 10.svmguide2 27.151427.064427.1144 30.306530.229029.9875 37.042736.785435.8157 27.393027.251727.1815 11.vowel 12.422712.421912.4264 32.138928.047429.3492 25.872824.068423.9747 12.397612.382312.3677 12.housing 15.524915.161815.3176 39.958237.136032.1028 50.848149.088435.1366 14.557614.381013.9379 13.bodyfat 17.642617.041917.2152 44.329543.795942.3331 27.433925.653024.7955 16.272515.917015.8665 14.abalone 4.33484.32744.3187 14.916614.404111.4431 20.607123.248723.6291 4.69284.60564.6017 15.glass 10.407810.445110.4067 33.348031.611030.5075 45.080134.960825.5677 8.61678.49928.2469 andF-KMSEoutperformthestandardKME.TheS-KMSEperformsslightlybetterthantheF-KMSE.Moreover,theimprovementismoresubstantialinthe“larged,smalln”paradigm.Intheworstcases,theS-KMSEandF-KMSEperformaswellastheKME.Lastly,itisinstructivetonotethattheimprovementvarieswiththechoiceofkernelk.Briey,thechoiceofkernelreectsthedimensionalityoffeaturespaceH.Onewouldexpectmoreimprovementinhigh-dimensionalspace,e.g.,RBFkernel,thanthelow-dimensional,e.g.,linearkernel(cf.discussionsattheendofx 3 ).ThisphenomenoncanbeobservedinbothFigure 1 and 2 .4.2.RealDataWeconsiderthreebenchmarkapplications:densityes-timationviakernelmeanmatching( Songetal. , 2008 ),kernelPCAusingshrinkagemeanandcovarianceoperator( Sch¨olkopfetal. , 1998 ),anddiscriminativelearningondistributions( MuandetandSch¨olkopf , 2013 ; Muandetetal. , 2012 ).Forthersttwotasksweemploy15datasetsfromtheUCIrepositories.Weuseonlyreal-valuedfeatures,eachofwhichisnormalizedtohavezeromeanandunitvariance.Densityestimation.Weperformdensityestimationviakernelmeanmatching( Songetal. , 2008 ).Thatis,wetthedensityQ=Pmj=1jN(j;2jI)toeachdatasetbyminimizingkbQk2Hs.t.Pmj=1j=1.Thekernelmeanbisobtainedfromthesamplesusingdifferentesti-mators,whereasQisthekernelmeanembeddingofthedensityQ.Unlikeexperimentsin Songetal. ( 2008 ),ourgoalistocomparedifferentestimatorsofPwherePisthetruedatadistribution.Thatis,wereplace^withaver-sionobtainedviashrinkage.AbetterestimateofPshouldleadtobetterdensityestimation,asmeasuredbytheneg-ativelog-likelihoodofQonthetestset.Weuse30%ofthedatasetasatestset.Wesetm=10foreachdataset.Themodelisinitializedbyrunning50randominitializa-tionsusingthek-meansalgorithmandreturningthebest.Werepeattheexperiments10timesandperformthepairedsigntestontheresultsatthe5%signicancelevel. 2 Theaveragenegativelog-likelihoodofthemodelQ,op-timizedviadifferentestimators,isreportedinTable 1 .Clearly,bothS-KMSEandF-KMSEconsistentlyachievesmallernegativelog-likelihoodwhencomparedtoKME.TherearehoweverfewcasesinwhichKMEoutperformstheproposedestimators,especiallywhenthedatasetisrel-ativelylarge,e.g.,satimageandabalone.WesuspectthatinthosecasesthestandardKMEalreadyprovidesanaccurateestimateofthekernelmean.Togetabetteres-timate,moreeffortisrequiredtooptimizefortheshrink-ageparameter.Moreover,theimprovementacrossdifferentkernelsisconsistentwithresultsonthesyntheticdatasets.KernelPCA.Inthisexperiment,weperformtheKPCAusingdifferentestimatesofthemeanandcovarianceop-erators.WecomparethereconstructionerrorEproj(z)=k(z)P(z)k2ontestsampleswherePistheprojectionconstructedfromtherst20principalcomponents.WeuseaGaussianRBFkernelforalldatasets.Wecompare5dif-ferentscenarios:1)standardKPCA;2)shrinkagecenter-ingwithS-KMSE;3)shrinkagecenteringwithF-KMSE;4)KPCAwithS-COSE;and5)KPCAwithF-COSE.ToperformKPCAonshrinkagecovarianceoperator,wesolvethegeneralizedeigenvalueproblemKcBKc=KcVDwhereB=diag( )andKcisthecenteredGrammatrix.Theweightvector isobtainedfromshrinkageestima-torsusingthekernelmatrixKcKcwheredenotestheHadamardproduct.Weuse30%ofthedatasetasatestset. 2Thepairedsigntestisanonparametrictestthatcanbeusedtoexaminewhethertwopairedsampleshavethesamedistribution.Inourcase,wecompareS-KMSEandF-KMSEagainstKME. KernelMeanEstimationandSteinEffect ionosphere sonar australian specft wdbc wine satimage segment vehicle svmguide2 vowel housing bodyfat abalone glass 0 0.2 0.4 0.6 0.8 1 reconstruction error KME S-KMSE F-KMSE S-COSE F-COSE Figure3.TheaveragereconstructionerrorofKPCAonhold-outtestsamplesover10repetitions.TheKMErepresentsthestandardapproach,whereasS-KMSEandF-KMSEuseshrinkagemeanstoperformcentering.TheS-COSEandF-COSEdirectlyusetheshrinkageestimateofthecovarianceoperator.Figure 3 illustratestheresultsofKPCA.Clearly,theS-COSEandF-COSEconsistentlyoutperformsallotheresti-mators.AlthoughweobserveanimprovementofS-KMSEandF-KMSEoverKME,itisverysmallcomparedtothatofS-COSEandF-COSE.Thismakessenseintuitively,sincechangingthemeanpointorshiftingdatadoesnotchangethecovariancestructureconsiderably,soitwillnotsignicantlyaffectthereconstructionerror.Discriminativelearningondistributions.Apositivesemi-denitekernelbetweendistributionscanbedenedviatheirkernelmeanembeddings.Thatis,givenatrainingsample(bP1;y1);:::;(bPm;ym)2Pf1;+1gwherebP:=1 nPnk=1xkandxkP,thelin-earkernelbetweentwodistributionsisapproximatedbyhbP;bPji=hPnk=1 k(xk);Pnl=1 jl(xjl)i=Pnk;l=1 k jlk(xk;xjl).Theweightvectors and jcomefromthekernelmeanestimatesofPandPj,re-spectively.Thenon-linearkernelcanthenbedenedac-cordingly,e.g.,(P;Pj)=exp(kbPbPjk2H=22).Ourgoalinthisexperimentistoinvestigateiftheshrink-ageestimateofthekernelmeanimprovestheperfor-manceofthediscriminativelearningondistributions.Tothisend,weconductexperimentsonnaturalscenecategorizationusingsupportmeasuremachine(SMM)( Muandetetal. , 2012 )andgroupanomalydetectiononahigh-energyphysicsdatasetusingone-classSMM(OC-SMM)( MuandetandSch¨olkopf , 2013 ).Weusebothlin-earandnon-linearkernelswheretheGaussianRBFker-nelisemployedasanembeddingkernel( Muandetetal. , 2012 ).Allhyper-parametersarechosenby10-foldcross-validation.Forourunsupervisedproblem,werepeattheexperimentsusingseveralparametersettingsandreportthebestresults.Table 2 reportstheclassicationaccuracyofSMMandtheareaunderROCcurve(AUC)ofOCSMMusingdifferentTable2.TheclassicationaccuracyofSMMandtheareaunderROCcurve(AUC)ofOCSMMusingdifferentkernelmeanesti-matorstoconstructthekernelondistributions. Estimator Linear Non-linear SMMOCSMM SMMOCSMM KME 0.54320.6955 0.60170.9085 S-KMSE 0.55210.6970 0.63030.9105 F-KMSE 0.56100.6970 0.65220.9095 kernelmeanestimators.Bothshrinkageestimatorsconsis-tentlyleadtobetterperformanceonbothSMMandOC-SMMwhencomparedtoKME.Tosummarize,wendsufcientevidencetoconcludethatbothS-KMSEandF-KMSEoutperformsthestandardKME.TheperformanceofS-KMSEandF-KMSEisverycompetitive.Thedifferencedependsonthedatasetandthekernelfunction.5.ConclusionsToconclude,weshowthatthecommonlyusedkernelmeanestimatorcanbeimproved.Ourtheoreticalresultsuggeststhatthereexistsawideclassofkernelmeanestimatorsthatarebetterthanthestandardone.Todemonstratethis,wefocusontwoefcientshrinkageestimators,namely,sim-pleandexiblekernelmeanshrinkageestimators.Empir-icalstudyclearlyshowsthattheproposedestimatorsout-performthestandardoneinvariousscenarios.Mostim-portantly,theshrinkageestimatesnotonlyprovidemoreaccurateestimation,butalsoleadtosuperiorperformanceonreal-worldapplications.AcknowledgmentsTheauthorswishtothankDavidHoggandRossFedelyforread-ingtherstdraftandanonymousreviewerswhogavevaluablesuggestionthathashelpedtoimprovethemanuscript. KernelMeanEstimationandSteinEffect ReferencesJ.BergerandR.Wolpert.Estimatingthemeanfunctionofagaus-sianprocessandthesteineffect.JournalofMultivariateAnal-ysis,13(3):401–424,1983.J.O.Berger.Admissibleminimaxestimationofamultivariatenormalmeanwitharbitraryquadraticloss.AnnalsofStatistics,4(1):223–226,1976.A.BerlinetandT.C.Agnan.ReproducingKernelHilbertSpacesinProbabilityandStatistics.KluwerAcademicPublishers,2004.S.Danafar,P.M.V.Rancoita,T.Glasmachers,K.Whittingstall,andJ.Schmidhuber.Testinghypothesesbyregularizedmaxi-mummeandiscrepancy.2013.I.S.Dhillon,Y.Guan,andB.Kulis.Kernelk-means:spectralclusteringandnormalizedcuts.InProceedingsofthe10thACMSIGKDDInternationalConferenceonKnowledgeDis-coveryandDataMining(KDD),pages551–556,NewYork,NY,USA,2004.K.Fukumizu,F.R.Bach,andM.I.Jordan.Dimensionalityre-ductionforsupervisedlearningwithreproducingkernelHilbertspaces.JournalofMachineLearningResearch,5:73–99,2004.K.Fukumizu,L.Song,andA.Gretton.KernelBayes'rule.InAdvancesinNeuralInformationProcessingSystems(NIPS),pages1737–1745.2011.A.Gretton,R.Herbrich,A.Smola,B.Sch¨olkopf,andA.Hyv¨arinen.Kernelmethodsformeasuringindependence.JournalofMachineLearningResearch,6:2075–2129,2005.A.Gretton,K.M.Borgwardt,M.Rasch,B.Sch¨olkopf,andA.J.Smola.Akernelmethodforthetwo-sample-problem.InAdvancesinNeuralInformationProcessingSystems(NIPS),2007.A.Gretton,K.M.Borgwardt,M.J.Rasch,B.Sch¨olkopf,andA.Smola.Akerneltwo-sampletest.JournalofMachineLearningResearch,13:723–773,2012.S.Gr¨unew¨alder,G.Lever,A.Gretton,L.Baldassarre,S.Patter-son,andM.Pontil.Conditionalmeanembeddingsasregres-sors.InProceedingsofthe29thInternationalConferenceonMachineLearning(ICML),2012.S.Gr¨unew¨alder,A.Gretton,andJ.Shawe-Taylor.Smoothoper-ators.InProceedingsofthe30thInternationalConferenceonMachineLearning(ICML),2013.W.JamesandJ.Stein.Estimationwithquadraticloss.InPro-ceedingsoftheThirdBerkeleySymposiumonMathematicalStatisticsandProbability,pages361–379.UniversityofCali-forniaPress,1961.J.KimandC.D.Scott.Robustkerneldensityestimation.JournalofMachineLearningResearch,13:2529–2565,Sep2012.A.MandelbaumandL.A.Shepp.Admissibilityasatouchstone.AnnalsofStatistics,15(1):252–268,1987.K.MuandetandB.Sch¨olkopf.One-classsupportmeasurema-chinesforgroupanomalydetection.InProceedingsofthe29thConferenceonUncertaintyinArticialIntelligence(UAI).AUAIPress,2013.K.Muandet,K.Fukumizu,F.Dinuzzo,andB.Sch¨olkopf.Learn-ingfromdistributionsviasupportmeasuremachines.InAdvancesinNeuralInformationProcessingSystems(NIPS),pages10–18.2012.Y.Nishiyama,A.Boularias,A.Gretton,andK.Fukumizu.HilbertspaceembeddingsofPOMDPs.InProceedingsofthe28thConferenceonUncertaintyinArticialIntelligence(UAI),pages644–653,2012.N.PrivaultandA.Rveillac.Steinestimationforthedriftofgaus-sianprocessesusingthemalliavincalculus.AnnalsofStatis-tics,36(5):2531–2550,2008.C.E.RasmussenandC.Williams.GaussianProcessesforMa-chineLearning.MITPress,2006.B.Sch¨olkopf,A.Smola,andK.-R.M¨uller.Nonlinearcomponentanalysisasakerneleigenvalueproblem.NeuralComputation,10(5):1299–1319,July1998.J.Shawe-TaylorandN.Cristianini.KernelMethodsforPatternAnalysis.CambridgeUniversityPress,Cambridge,UK,2004.A.Smola,A.Gretton,L.Song,andB.Sch¨olkopf.AHilbertspaceembeddingfordistributions.InProceedingsofthe18thInter-nationalConferenceonAlgorithmicLearningTheory(ALT),pages13–31.Springer-Verlag,2007.L.Song,X.Zhang,A.Smola,A.Gretton,andB.Sch¨olkopf.Tailoringdensityestimationviareproducingkernelmomentmatching.InProceedingsofthe25thInternationalConferenceonMachineLearning(ICML),pages992–999,2008.L.Song,B.Boots,S.M.Siddiqi,G.J.Gordon,andA.J.Smola.HilbertspaceembeddingsofhiddenMarkovmodels.InPro-ceedingsofthe27thInternationalConferenceonMachineLearning(ICML),2010.L.Song,A.P.Parikh,andE.P.Xing.Kernelembeddingsofla-tenttreegraphicalmodels.InAdvancesinNeuralInformationProcessingSystems(NIPS),pages2708–2716,2011.B.K.Sriperumbudur,A.Gretton,K.Fukumizu,G.Lanckriet,andB.Sch¨olkopf.InjectiveHilbertspaceembeddingsofprobabil-itymeasures.InCOLT,2008.B.K.Sriperumbudur,A.Gretton,K.Fukumizu,B.Sch¨olkopf,andG.R.G.Lanckriet.Hilbertspaceembeddingsandmet-ricsonprobabilitymeasures.JournalofMachineLearningResearch,99:1517–1561,2010.C.Stein.Inadmissibilityoftheusualestimatorforthemeanofamultivariatenormaldistribution.InProceedingsofthe3rdBerkeleySymposiumonMathematicalStatisticsandProbabil-ity,volume1,pages197–206.UniversityofCaliforniaPress,1955.

By: myesha-ticknor
Views: 77
Type: Public

Kernel Mean Estimation and Stein Effect Krikamol Muandet KRIKAMOL TUEBINGEN MPG - Description


Given a 64257nite sam ple an empirical average is the standard estimate for the true kernel mean We show that this esti mator can be improved due to a wellknown phe nomenon in statistics called Steins phenomenon After consideration our theoretical a ID: 3960 Download Pdf

Related Documents