Given a 64257nite sam ple an empirical average is the standard estimate for the true kernel mean We show that this esti mator can be improved due to a wellknown phe nomenon in statistics called Steins phenomenon After consideration our theoretical a ID: 3960
Download Pdf The PPT/PDF document "Kernel Mean Estimation and Stein Effect ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
KernelMeanEstimationandSteinEffect KrikamolMuandet KRIKAMOL@TUEBINGEN.MPG.DE EmpiricalInferenceDepartment,MaxPlanckInstituteforIntelligentSystems,T¨ubingen,GermanyKenjiFukumizu FUKUMIZU@ISM.AC.JP TheInstituteofStatisticalMathematics,Tokyo,JapanBharathSriperumbudur BS493@STATSLAB.CAM.AC.UK StatisticalLaboratory,UniversityofCambridge,Cambridge,UnitedKingdomArthurGretton ARTHUR.GRETTON@GMAIL.COM GatsbyComputationalNeuroscienceUnit,UniversityCollegeLondon,London,UnitedKingdomBernhardSch¨olkopf BS@TUEBINGEN.MPG.DE EmpiricalInferenceDepartment,MaxPlanckInstituteforIntelligentSystems,T¨ubingen,GermanyAbstractAmeanfunctioninareproducingkernelHilbertspace(RKHS),orakernelmean,isanimpor-tantpartofmanyalgorithmsrangingfromkernelprincipalcomponentanalysistoHilbert-spaceembeddingofdistributions.Givenanitesam-ple,anempiricalaverageisthestandardestimateforthetruekernelmean.Weshowthatthisesti-matorcanbeimprovedduetoawell-knownphe-nomenoninstatisticscalledStein'sphenomenon.Afterconsideration,ourtheoreticalanalysisre-vealstheexistenceofawideclassofestimatorsthatarebetterthanthestandardone.Focusingonasubsetofthisclass,weproposeefcientshrinkageestimatorsforthekernelmean.Em-piricalevaluationsonseveralapplicationsclearlydemonstratethattheproposedestimatorsoutper-formthestandardkernelmeanestimator.1.IntroductionThispaperaimstoimprovetheestimationofthemeanfunctioninareproducingkernelHilbertspace(RKHS)fromanitesample.Akernelmeanofaprobabilitydistri-butionPoverameasurablespaceXisdenedbyP,ZXk(x;)dP(x)2H;(1) Proceedingsofthe31stInternationalConferenceonMachineLearning,Beijing,China,2014.JMLR:W&CPvolume32.Copy-right2014bytheauthor(s).whereHisanRKHSassociatedwithareproducingkernelk:XX!R.Conditionsensuringthatthisexpectationexistsaregivenin Smolaetal. ( 2007 ).Unfortunately,itisnotpracticaltocomputePdirectlybecausethedistribu-tionPisusuallyunknown.Instead,givenani.i.dsamplex1;x2;:::;xnfromP,wecaneasilycomputetheempiri-calkernelmeanbytheaveragebP,1 nnX=1k(x;):(2)TheestimatebPisthemostcommonlyusedestimateofthetruekernelmean.Ourprimaryinteresthereistoinvestigatewhetheronecanimproveuponthisstandardestimator.Thekernelmeanhasrecentlygainedattentioninthemachinelearningcommunity,thankstotheintro-ductionofHilbertspaceembeddingfordistributions( BerlinetandAgnan , 2004 ; Smolaetal. , 2007 ).Repre-sentingthedistributionasameanfunctionintheRKHShasseveraladvantages:1)therepresentationwithap-propriatechoiceofkernelkhasbeenshowntopreserveallinformationaboutthedistribution( Fukumizuetal. , 2004 ; Sriperumbuduretal. , 2008 ; 2010 );2)basicoper-ationsonthedistributioncanbecarriedoutbymeansofinnerproductsinRKHS,e.g.,EP[f(x)]=hf;PiHforallf2H;3)nointermediatedensityestimationisrequired,e.g.,whentestingforhomogeneityfrom-nitesamples.Asaresult,manyalgorithmshavebene-tedfromthekernelmeanrepresentation,namely,maxi-mummeandiscrepancy(MMD)( Grettonetal. , 2007 ),ker-neldependencymeasure( Grettonetal. , 2005 ),kerneltwo-sample-test( Grettonetal. , 2012 ),Hilbertspaceembed-dingofHMMs( Songetal. , 2010 ),andkernelBayesrule KernelMeanEstimationandSteinEffect ( Fukumizuetal. , 2011 ).TheirperformancesrelydirectlyonthequalityoftheempiricalestimatebP.However,itisofgreatimportance,especiallyforourread-erswhoarenotfamiliarwithkernelmethods,torealizeamorefundamentalroleofthekernelmean.Itbasi-callyservesasafoundationtomostkernel-basedlearn-ingalgorithms.Forinstance,nonlinearcomponentanal-yses,suchaskernelPCA,kernelFDA,andkernelCCA,relyheavilyonmeanfunctionsandcovarianceoperatorsinRKHS( Sch¨olkopfetal. , 1998 ).Thekernelk-meansalgo-rithmperformsclusteringinfeaturespaceusingmeanfunc-tionsastherepresentativesoftheclusters( Dhillonetal. , 2004 ).Moreover,italsoservesasabasisinearlydevelop-mentofalgorithmsforclassicationandanomalydetection( Shawe-TaylorandCristianini , 2004 ,chap.5).Allofthoseemploy( 2 )astheestimateofthetruemeanfunction.Thus,thefactthatsubstantialimprovementcanbegainedwhenestimating( 1 )mayinfactraiseawidespreadsuspicionontraditionalwayoflearningwithkernels.Weshowinthisworkthatthestandardestimator( 2 )is,inacertainsense,notoptimal,i.e.,thereexistbetterestimators(morebelow).Inaddition,weproposeshrinkageestima-torsthatoutperformthestandardone.Atrstglance,itwasdenitelycounter-intuitiveandsurprisingforus,andwillundoubtedlyalsobeforsomeofourreaders,thattheem-piricalkernelmeancouldbeimproved,and,giventhesim-plicityoftheproposedestimators,thatthishasremainedunnoticeduntilnow.Oneofthereasonsmaybethatthereisacommonbeliefthattheestimator^PalreadygivesagoodestimateofP,and,assamplesizegoestoinnity,theestimationerrordisappears( Shawe-TaylorandCristianini , 2004 ).Asaresult,noneedisfelttoimprovethekernelmeanestimation.However,givenanitesample,substan-tialimprovementisinfactpossibleandseveralfactorsmaycomeintoplay,aswillbeseenlaterinthiswork.ThisworkwaspartlyinspiredbyStein'sseminalworkin1955,whichshowedthatamaximumlikelihoodestimator(MLE),i.e.,thestandardempiricalmean,forthemeanofthemultivariateGaussiandistributionN(;2I)isinad-missible( Stein , 1955 ).Thatis,thereexistsanestimatorthatalwaysachievessmallertotalmeansquarederrorre-gardlessofthetrue,whenthedimensionisatleast3.PerhapsthebestknownestimatorofsuchkindisJames-Steinsestimator( JamesandStein , 1961 ).Interestingly,theJames-Steinestimatorisitselfinadmissible,andthereex-istsawideclassofestimatorsthatoutperformtheMLE,seee.g., Berger ( 1976 ).However,ourworkdiffersfundamentallyfromtheStein'sseminalworksandthosealongthislineintwoas-pects.First,oursettingisnon-parametricinasensethatwedonotassumeanyparametricformofthedis-tribution,whereasmostoftraditionalworksfocusonsomespecicdistributions,e.g.,Gaussiandistribution.Second,oursettinginvolvesanon-linearfeaturemapintoahigh-dimensionalspace,ifnotinnite.Asare-sult,highermomentsofthedistributionmaycomeintoplay.Thus,onecannotadoptStein'ssettingstraightfor-wardly.AdirectgeneralizationofJames-Steinestimatortoinnite-dimensionalHilbertspacehasalreadybeenconsid-ered( BergerandWolpert , 1983 ; MandelbaumandShepp , 1987 ; PrivaultandRveillac , 2008 ).Inthoseworks,whichistheparametertobeestimatedisassumedtobethemeanofaGaussianmeasureontheHilbertspacefromwhichsamplesaredrawn.Inourcase,ontheotherhand,thesamplesaredrawnfromPandnotfromtheGaussiandistributionwhosemeanisP.Thecontributionofthispapercanbesummarizedasfol-lows:First,weshowthatthestandardkernelmeanestima-torcanbeimprovedbyprovidinganalternativeestimatorthatachievessmallerrisk(x 2 ).Thetheoreticalanalysisre-vealstheexistenceofawideclassofestimatorsthatarebetterthanthestandard.Tothisend,weproposeinx 3 akernelmeanshrinkageestimator(KMSE),whichisbasedonanovelmotivationforregularizationthroughthenotionofshrinkage.Moreover,weproposeanefcientleave-one-outcross-validationproceduretoselecttheshrinkagepa-rameter,whichisnovelinthecontextofkernelmeanesti-mation.Lastly,wedemonstratethebenetoftheproposedestimatorsinseveralapplications(x 4 ).2.Motivation:ShrinkageEstimatorsForanarbitrarydistributionP,denotebyandbthetruekernelmeananditsempiricalestimate( 2 )fromthei.i.d.samplex1;x2;:::;xnP(weremovethesub-scriptforeaseofnotation).Themostnaturallossfunc-tionconsideredinthisworkis`(;b)=k bk2H.Anestimatorbisamappingwhichismeasurablew.r.t.theBorel-algebraofHandisevaluatedbyitsriskfunc-tionR(;b)=EP[`(;b)]whereEPindicatesexpectationoverthechoiceofi.i.d.sampleofsizenfromP.Letusconsideranalternativekernelmeanestimator:b,f+(1 )bwhere01andf2H.Itisessentiallyashrinkageestimatorthatshrinksthestandardestimatortowardafunctionfbyanamountspeciedby.If=0,breducestothestandardestimatorb.Thefollowingtheoremassertsthattheriskofshrinkageestima-torbissmallerthanthatofstandardestimatorbgivenanappropriatechoiceof,regardlessofthefunctionf(morebelow).Theorem1.ForalldistributionsPandthekernelk,thereexists-0.8;͸ä0forwhichR(;b)R(;b).Proof.Theriskofthestandardkernelmeanestimatorsat-isesEkb k2=1 n(E[k(x;x)] E[k(x;~x)])=: KernelMeanEstimationandSteinEffect where~xisanindependentcopyofx.Letusdenetheriskoftheproposedshrinkageestimatorby:=Ekb k2whereisanon-negativeshrinkagepa-rameter.Wecanthenwritethisintermsofthestan-dardriskas= 2Ehb ;b + fi+2Ekfk2 22E[f(x)]+2Ekbk2.ItfollowsfromthereproducingpropertyofHthatE[f(x)]=hf;i.Moreover,usingthefactthatEkbk2=Ekb +k2=+E[k(x;~x)],wecansimplifytheshrinkageriskby=2(+kf k2) 2+.Thus,wehave =2(+kf k2) 2whichisnon-positivewhere20;2 +kf k2(3)andminimizedat==(+kf k2).Aswecanseein( 3 ),thereisarangeofforwhichanon-positive ,i.e.,R(;b) R(;b),isguaranteed.However,Theorem 1 reliesontheimportantassumptionthatthetruekernelmeanofthedistributionPisrequiredtoestimate.Inspiteofthis,thetheoremhasanimportantimplicationsuggestingthattheshrinkageestimatorbcanimproveuponbifischosenappropriately.Later,wewillexploitthisresultinordertoconstructmorepracticalestimators.Remark1.ThefollowingobservationsfollowimmediatelyfromTheorem 1 :Theshrinkageestimatoralwaysimprovesuponthestandardoneregardlessofthedirectionofshrinkage,asspeciedbyf.Inotherwords,thereexistsawideclassofkernelmeanestimatorsthatarebetterthanthestandardone.Thevalueofalsodependsonthechoiceoff.Thefurtherfisfrom,thesmallerbecomes.Thus,theshrinkagegetssmalleriffischosensuchthatitisfarfromthetruekernelmean.ThiseffectisakintoJames-Steinestimator.Theimprovementcanbeviewedasabias-variancetrade-off:theshrinkageestimatorreducesvariancesubstantiallyattheexpenseofalittlebias.Remark 1 shedslightonhowonecanpracticallyconstructtheshrinkageestimator:wecanchoosefarbitrarilyaslongastheparameterischosenappropriately.More-over,furtherimprovementcanbegainedbyincorporat-ingpriorknowledgeastothelocationofP,whichcanbestraightforwardlyintegratedintotheframeworkviaf( BergerandWolpert , 1983 ).InspiredbyJames-Steinesti-mator,wefocusonf=0.Wewillinvestigatetheeffectofdifferentpriorfinfutureworks.3.KernelMeanShrinkageEstimatorInthissectionwegiveanovelformulationofkernelmeanestimatorthatallowsustoestimatetheshrinkageparame-terefciently.Inthefollowing,let:X!Hbeafea-turemapassociatedwiththekernelkandh;ibeaninnerproductintheRKHSHsuchthatk(x;x0)=h(x);(x0)i.Unlessstatedotherwise,kkdenotestheRKHSnorm.ThekernelmeanPanditsempiricalestimatebPcanbeob-tainedasaminimizerofthelossfunctionalsE(g),ExPk(x) gk2;bE(g),1 nnX=1k(x) gk2;respectively.WewillcalltheestimatorminimizingthelossfunctionalbE(g)akernelmeanestimator(KME).NotethatthelossE(g)isdifferentfromtheoneconsideredinx 2 ,i.e.,`(;g)=k gk2=kE[(x)] gk2.Never-theless,wehave`(;g)=Exx0k(x;x0) 2Exg(x)+kgk2.SinceE(g)=Exk(x;x) 2Exg(x)+kgk2,theloss`(;g)differsfromE(g)onlybyExk(x;x) Exx0k(x;x0)whichisnotafunctionofg.Weintroducethenewformherebecauseitwillgiveamoretractablecross-validationcom-putation(x 3.1 ).Inspiteofthis,theresultingestimatorsarealwaysevaluatedw.r.t.thelossinx 2 (cf.x 4.1 ).Fromtheformulationabove,itisnaturaltoaskifmini-mizingtheregularizedversionofbE(g)willgivebetteres-timator.Ontheonehand,onecanarguethat,unlikeintheclassicalriskminimization,wedonotreallyneedareg-ularizerhere.Thestandardestimator( 2 )isknowntobe,inacertainsense,optimalandcanbeestimatedreliably( Shawe-TaylorandCristianini , 2004 ,prop.5.2).More-over,theoriginalformulationofbE(g)isawell-posedprob-lem.Ontheotherhand,sinceregularizationmaybeviewedasshrinkingthesolutiontowardzero,itcanactuallyim-provethekernelmeanestimation,assuggestedbyTheorem 1 (cf.discussionsattheendofx 2 ).Consequently,weminimizeamodiedlossfunctionalbE(g),bE(g)+\n(kgk)=1 nnX=1k(x) gk2+\n(kgk);(4)where\n()denotesamonotonically-increasingregulariza-tionfunctionalandisanon-negativeregularizationpa-rameter. 1 Inwhatfollows,werefertotheshrinkagees-timatorbminimizingbE(g)asakernelmeanshrinkageestimator(KMSE). 1Theparametersandplaysimilarroleasashrinkagepa-rameter.Theyspecifyanamountbywhichthestandardestimatorbisshrunktowardf=0.Thus,thetermshrinkageparameterandregularizationparameterwillbeusedinterchangeably. KernelMeanEstimationandSteinEffect Itfollowsfromtherepresentertheoremthatgliesinasub-spacespannedbythedata,i.e.,g=Pnj=1j(xj)forsome2Rn.Byconsidering\n(kgk)=kgk2,wecanrewrite( 4 )as1 nnX=1\r\r\r\r\r\r(x) nXj=1j(xj)\r\r\r\r\r\r2+\r\r\r\r\r\rnXj=1j(xj)\r\r\r\r\r\r2=K 2K1n+K+c;(5)wherecisaconstantterm,KisannnGrammatrixsuchthatKij=k(x;xj),and1n=[1=n;1=n;:::;1=n].Takingaderivativeof( 5 )w.r.t.andsettingittozeroyield=(1=(1+))1n.Bysetting==(1+)theshrinkageestimatecanbewrittenasb=(1 )b.Since01,theestimatorbcorrespondstoashrinkageestimatordiscussedinx 2 whenf=0.Wecallthisesti-matorasimplekernelmeanshrinkageestimator(S-KMSE).Usingtheexpansiong=Pnj=1j(xj),wemaycon-siderwhentheregularizationfunctionaliswrittenintermof,e.g.,.Thisleadstoaparticularlyinter-estingkernelmeanestimator.Inthiscase,theopti-malweightvectorisgivenby=(K+I) 1K1nandtheshrinkageestimatecanbewrittenaccordinglyasb=Pnj=1j(xj)=(K+I) 1K1nwhere=[(x1);(x2);:::;(xn)].UnliketheS-KMSE,thises-timatorshrinkstheusualestimatedifferentlyineachcoor-dinate(cf.Theorem 2 ).Hence,wewillcallitaexiblekernelmeanshrinkageestimator(F-KMSE).ThefollowingtheoremcharacterizestheF-KMSEasashrinkageestimator.Theorem2.TheF-KMSEcanbewrittenasb=Pn=1\r \r+hb;vivwheref\r;vgareeigenvalueandeigenvectorpairsoftheempiricalcovarianceoperatorbCxxinH.Inwords,theeffectofF-KMSEistoreducehighfre-quencycomponentsoftheexpansionofb,byexpand-ingthisintermsofthekernelPCAbasisandshrinkingthecoefcientsofthehighordereigenfunctions,e.g.,see RasmussenandWilliams ( 2006 ,sec.4.3).NotethatthecovarianceoperatorbCxxitselfdoesnotdependon.Aswecansee,thesolutiontotheregularizedversionisindeedoftheformofshrinkageestimatorswhenf=0.Thatis,bothS-KMSEandF-KMSEshrinkthestandardkernelmeanestimatetowardszero.ThedifferenceisthattheS-KMSEshrinksequallyinallcoordinate,whereastheF-KMSEalsoconstraintstheamountofshrinkagebytheinformationcontainedineachcoordinate.Moreover,thesquaredRKHSnormkk2canbedecom-posedasasumofsquaredlossweightedbytheeigenval-ues\r(cf. MandelbaumandShepp ( 1987 ,appendix)).BythesamereasoningasStein'sresultinnite-dimensionalcase,onewouldsuspectthatanimprovementofshrinkageestimatorsinHshouldalsodependonhowfasttheeigen-valuesofkdecay.Thatis,onewouldexpectgreaterim-provementifthevaluesof\rdecayveryslowly.Forexam-ple,theGaussianRBFkernelwithlargerbandwidthgivessmallerimprovementwhencomparedtoonewithsmallerbandwidth.Similarly,weshouldexpecttoseemoreim-provementwhenapplyingaLaplaciankernelthanwhenusingaGaussianRBFkernel.Insomeapplicationsofkernelmeanembedding,onemaywanttointerprettheweightasaprobabilityvector( Nishiyamaetal. , 2012 ).However,theweightvectoroutputbyourestimatorsisingeneralnotnormalized.Infact,allelementswillbesmallerthan1=nasaresultofshrinkage.However,onemayimposeaconstraintthatmustsumtooneandresorttoaquadraticprogramming( Songetal. , 2008 ).Unfortunately,thisapproachhasunde-sirableeffectofsparsitywhichisunlikelytoimproveuponthestandardestimator.Post-normalizingtheweightsoftendeterioratestheestimationperformance.Tothebestofourknowledge,nopreviousattempthasbeenmadetoimprovethekernelmeanestimation.How-ever,wediscusssomecloselyrelatedworkshere.Forex-ample,insteadofthelossfunctionalbE(g), KimandScott ( 2012 )considerarobustlossfunctionsuchastheHuber'slosstoreducetheeffectofoutliers.Theauthorscon-siderkerneldensityestimators,whichdifferfundamentallyfromkernelmeanestimators.Theyneedtoreducetheker-nelbandwidthwithincreasingsamplesizefortheestima-torstobeconsistent.RegularizedversionofMMDwasadoptedby Danafaretal. ( 2013 )inthecontextofkernel-basedhypothesistesting.Theresultingformulationre-semblesourS-KMSE.Furthermore,theF-KMSEisofasimilarformastheconditionalmeanembeddingusedin Gr¨unew¨alderetal. ( 2012 ),whichcanbeviewedmoregen-erallyasaregressionprobleminRKHSwithsmoothoper-ators( Gr¨unew¨alderetal. , 2013 ).3.1.ChoosingShrinkageParameterAsdiscussedinx 2 ,theamountofshrinkageplaysanim-portantroleinourestimators.Inthisworkweproposetoselecttheshrinkageparameterbyanautomaticleave-one-outcross-validation.Foragivenshrinkageparameter,letusconsidertheob-servationxasbeinganewobservationbyomittingitfromthedataset.Denotebyb( )=Pj=( )j(xj)thekernelmeanestimatedfromtheremainingdata,usingthevalueasashrinkageparameter,sothat( )isthemini-mizerofbE( )(g).Wewillmeasurethequalityofb( )byhowwellitapproximates(x).Theoverallqualityofthe KernelMeanEstimationandSteinEffect estimateisquantiedbythecross-validationscoreLOOCV()=1 nnX=1\r\r\r(x) b( )\r\r\r2H:(6)Bysimplealgebra,itisnotdifculttoshowthattheop-timalshrinkageparameterofS-KMSEcanbecalculatedanalytically,asstatedbythefollowingtheorem.Theorem3.Let,1 n2Pn=1Pnj=1k(x;xj)and%,1 nPn=1k(x;x).Theshrinkageparameter=(% )=((n 1)+%=n %)oftheS-KMSEistheminimizerofLOOCV().Ontheotherhand,ndingtheoptimalfortheF-KMSEisrelativelymoreinvolved.Evaluatingthescore( 6 )na¨velyrequiresonetosolveforb( )explicitlyforeveryi.Fortu-nately,wecansimplifythescoresuchthatitcanbeevalu-atedefciently,asstatedinthefollowingtheorem.Theorem4.TheLOOCVscoreofF-KMSEsatisesLOOCV()=1 nPn=1(K K)C(K K)whereistheweightvectorcalculatedfromthefulldatasetwiththeshrinkageparameterandC=(K 1 nK(K+I) 1K) 1K(K 1 nK(K+I) 1K) 1.ProofofThorem 4 .Forxedandi,letb( )betheleave-one-outkernelmeanestimateofF-KMSEandlet,(K+I) 1.Then,wecanwriteanexpressionforthedeletedresidualas( ):=b( ) (x)=b (x)+1 nPnj=1Pnl=1jlh(xl);b( ) (x)i(xj).Since( )liesinasubspacespannedbythesam-ple(x1);:::;(xn),wehave( )=Pnk=1k(xk)forsome2Rn.Substituting( )backyieldsPnk=1k(xk)=b (x)+1 nPnj=1fAKgj(xj).Bytakingtheinnerproductonbothsidesw.r.t.thesam-ple(x1);:::;(xn)andsolvingfor,wehave=(K 1 nKAK) 1(K K)whereKistheithcolumnofK.Consequently,theleave-one-outscoreofthesamplexcanbecomputedbyk( )k2=K=(K K)(K 1 nKAK) 1K(K 1 nKAK) 1(K K)=(K K)C(K K).Averag-ingk( )k2overallsamplesgivesLOOCV()=1 nPn=1k( )k2=1 nPn=1(K K)C(K K),asrequired.Itisinterestingtoseethattheleave-one-outcross-validationscoreinTheorem 4 dependsonlyonthenon-leave-one-outsolution,whichcanbeobtainedasaby-productofthealgorithm.ComputationalcomplexityTheS-KMSErequiresO(n2)operationstoselectshrinkageparameter.FortheF-KMSE,therearetwostepsincross-validation.First,weneedtocompute(K+I) 1repeatedlyfordifferentvaluesof.AssumethatweknowtheeigendecompositionK=UDUwhereDisdiag-onalwithd0andUU=I.Itfollowsthat(K+I) 1=U(D+I) 1U.Consequently,solvingfortakesO(n2)operations.SinceeigendecompositionrequiresO(n3)operations,ndingformany'sisessentiallyfree.Alow-rankapproximationcanalsobeadoptedtoreducethecomputationalcostfurther.Second,weneedtocomputethecross-validationscore( 6 ).AsshowninTheorem 4 ,wecancomputeitusingonlyobtainedfromthepreviousstep.ThecalculationofCcanbesimpliedfurtherviatheeigendecompositionofKasC=U(D 1 nD(D+I) 1D) 1D(D 1 nD(D+I) 1D) 1U.Sinceitonlyinvolvestheinverseofdiag-onalmatrices,theinversioncanbeevaluatedinO(n)oper-ations.Theoverallcomputationalcomplexityofthecross-validationrequiresonlyO(n2)operations,asopposedtothena¨veapproachthatrequiresO(n4)operations.Whenperformedasaby-productofthealgorithm,thecomputa-tionalcostofcross-validationprocedurebecomesnegligi-bleasthedatasetbecomeslarger.Inpractice,weusethefminsearchandfminbndroutinesoftheMATLABoptimizationtoolboxtondthebestshrinkageparameter.3.2.CovarianceOperatorsThecovarianceoperatorfromHXtoHYcanbeviewedasameanfunctioninaproductspaceHX\nHY.Hence,wecanalsoconstructashrinkageestimatorofcovarianceoper-atorinRKHS.Let(HX;kX)and(HY;kY)betheRKHSoffunctionsonmeasurablespaceXandY,respectively,withp.d.kernelkXandkY(withfeaturemapand).Wewillconsiderarandomvector(X;Y):\n!XYwithdistributionPXY,withPXandPYasmarginaldis-tributions.Undersomeconditions,thereexistsauniquecross-covarianceoperatorYX:HXHYsuchthathg;YXfiHY=EXY[(f(X) EX[f(X)])(g(Y) EY[g(Y)])]=Cov(f(X);g(Y))holdsforallf2HXandg2HY( Fukumizuetal. , 2004 ).IfXequalsY,wegettheself-adjointoperatorXXcalledthecovarianceop-erator.Givenani.i.dsamplefromPXYwrittenas(x1;y1);(x2;y2);:::;(xn;yn),wecanwritetheempiricalcross-covarianceoperatorasbYX:=1 nPn=1(x)\n(y) bX\nbYwherebX=1 nPn=1(x)andbY=1 nPn=1(y).Leteandebethecenteredfeaturemapsofand,respectively.Then,itcanberewrittenasbYX:=1 nPn=1e(x)\ne(y)2HX\nHY.Itfollowsfromtheinnerproductpropertyinproductspacethathe(x)\ne(y);e(x0)\ne(y0)iHX\nHY=he(x);e(x0)iHXhe(y);e(y0)iHY=ekX(x;x0)ekY(y;y0). KernelMeanEstimationandSteinEffect 500 1000 1500 2000 2500 3000 ´g 500 1000 1500 2000 2500 3000 ´g 500 1000 1500 2000 2500 3000 ´g (a)LIN 3 4 5 6 7 8 9 10 11 12 13x 105 ´g 2 3 4 5 6 7 8 9 10 11x 105 ´g 2 3 4 5 6 7 8 9 10 11 12x 105 ´g (b)POLY2 2 4 6 8 10 12 14x 108 ´g 2 4 6 8 10 12 14 16 18x 108 ´g 2 4 6 8 10 12 14x 108 ´g (c)POLY3 0.04 0.06 0.08 0.1 0.12 0.14 0.16 ´g 0.02 0.04 0.06 0.08 0.1 0.12 0.14 ´g 0.02 0.04 0.06 0.08 0.1 0.12 ´g (d)RBFFigure1.TheaveragelossofKME(left),S-KMSE(middle),andF-KMSE(right)estimatorswithdifferentvaluesofshrinkageparam-eter.Insideboxescorrespondtoestimators.Werepeattheexperimentsover30differentdistributionswithn=10andd=30.Then,wecanobtaintheshrinkageestimatorsforthecovarianceoperatorbypluggingthekernelk((x;y);(x0;y0))=~kX(x;x0)~kY(y;y0)inourKM-SEs.Wewillcallthisestimatoracovariance-operatorshrinkageestimator(COSE).Thesametrickcanbeeasilygeneralizedtotensorsofhigherorder,whichhavebeenpreviouslyused,forexample,in Songetal. ( 2011 ).4.ExperimentsWefocusonthecomparisonbetweenourshrinkageesti-matorsandthestandardestimatorofthekernelmeanusingbothsyntheticdatasetsandreal-worlddatasets.4.1.SyntheticDataGiventhetruedata-generatingdistributionP,weevalu-atedifferentestimatorsusingthelossfunction`(),kPn=1k(x;) EP[k(x;)]k2Hwhereistheweightvectorassociatedwithdifferentestimators.Toallowforanexactcalculationof`(),weconsiderwhenPisamixture-of-Gaussiansdistributionandkisthefollowingkernelfunction:1)linearkernelk(x;x0)=xx0;2)poly-nomialdegree-2kernelk(x;x0)=(xx0+1)2;3)poly-nomialdegree-3kernelk(x;x0)=(xx0+1)3;and4)GaussianRBFkernelk(x;x0)=exp kx x0k2=22.WewillrefertothemasLIN,POLY2,POLY3,andRBF,respectively.Experimentalprotocol.Dataaregeneratedfromad-dimensionalmixtureofGaussians:x4X=1N(;)+";ijU( 10;10);W(2Id;7);"N(0;0:2Id);whereU(a;b)andW(0;df)representtheuniformdis-tributionandWishartdistribution,respectively.Weset=[0:05;0:3;0:4;0:25].Thechoiceofparametershereisquitearbitrary;wehaveexperimentedusingvariouspa-rametersettingsandtheresultsaresimilartothosepre-sentedhere.FortheGaussianRBFkernel,wesetthebandwidthparametertosquare-rootofthemedianEu-clideandistancebetweensamplesinthedataset(i.e.,2=mediankx xjk2 throughout).Figure 1 showstheaveragelossofdifferentestimatorsus-ingdifferentkernelsasweincreasethevalueofshrinkageparameter.Herewescaletheshrinkageparameterbytheminimumnon-zeroeigenvalue\r0ofkernelmatrixK.Ingeneral,wendS-KMSEandF-KMSEtendtooutperformKME.However,asbecomeslarge,therearesomecaseswhereshrinkagedeterioratestheestimationperformance,e.g.,seeLINkernelandsomeoutliersinthegures.Thissuggeststhatitisveryimportanttochoosetheparameterappropriately(cf.thediscussioninx 2 ).Similarly,Figure 2 depictstheaveragelossaswevarythesamplesizeanddimensionofthedata.Inthiscase,theshrinkageparameterischosenbytheproposedleave-one-outcross-validationscore.Aswecansee,bothS-KMSE 0 50 100 20 40 60 80 100 120 140 160 180 LINSample Size (d=20)Average Loss 0 50 100 2 2.5 3 3.5 4 4.5 5 5.5 6x 105 POLY2Sample Size (d=20) 0 50 100 3 3.5 4 4.5 5 5.5x 108 POLY3Sample Size (d=20) 0 50 100 0 0.02 0.04 0.06 0.08 0.1 0.12 RBFSample Size (d=20) KME S-KMSE F-KMSE 0 50 100 0 100 200 300 400 500 600 700 800 LINDimension (n=20)Average Loss 0 50 100 0 2 4 6 8 10 12x 106 POLY2Dimension (n=20) 0 50 100 0 1 2 3 4 5x 1010 POLY3Dimension (n=20) 0 50 100 0 0.01 0.02 0.03 0.04 0.05 0.06 RBFDimension (n=20) Figure2.Theaveragelossover30differentdistributionsofKME,S-KMSE,andF-KMSEwithvaryingsamplesize(n)anddimen-sion(d).TheshrinkageparameterischosenbyLOOCV. KernelMeanEstimationandSteinEffect Table1.Averagenegativelog-likelihoodofthemodelQontestpointsover10randomizations.Theboldfacerepresentstheresultwhosedifferencefromthebaseline,i.e.,KME,isstatisticallysignicant. Dataset LIN POLY2 POLY3 RBF KMES-KMSEF-KMSE KMES-KMSEF-KMSE KMES-KMSEF-KMSE KMES-KMSEF-KMSE 1.ionosphere 33.244033.032533.1436 53.126653.706750.8695 51.680049.914947.4461 40.896140.557839.6804 2.sonar 72.663072.877072.5015 120.3454108.8246109.9980 102.449990.392091.1547 71.304870.572170.5830 3.australian 18.370318.334118.3719 18.592818.602818.4987 41.156334.430334.5460 17.513817.563717.4026 4.specft 56.613855.737455.8667 67.390165.966265.2056 63.927363.557162.1480 57.556956.138655.5808 5.wdbc 30.977830.926630.4400 93.054191.580387.5265 58.823554.123750.3911 30.822730.596830.2646 6.wine 15.922515.885016.0431 24.284124.132523.5163 35.206932.946532.4702 17.152316.917716.6312 7.satimage 19.635319.872119.7943 149.5986143.2277146.0648 52.797357.248245.8946 20.330620.502020.2226 8.segment 22.913122.821922.0696 61.271259.438754.8621 38.722638.622638.4217 17.680116.414915.6814 9.vehicle 16.414516.288816.3210 83.159779.724879.6679 70.434063.432248.0177 15.925615.833115.6516 10.svmguide2 27.151427.064427.1144 30.306530.229029.9875 37.042736.785435.8157 27.393027.251727.1815 11.vowel 12.422712.421912.4264 32.138928.047429.3492 25.872824.068423.9747 12.397612.382312.3677 12.housing 15.524915.161815.3176 39.958237.136032.1028 50.848149.088435.1366 14.557614.381013.9379 13.bodyfat 17.642617.041917.2152 44.329543.795942.3331 27.433925.653024.7955 16.272515.917015.8665 14.abalone 4.33484.32744.3187 14.916614.404111.4431 20.607123.248723.6291 4.69284.60564.6017 15.glass 10.407810.445110.4067 33.348031.611030.5075 45.080134.960825.5677 8.61678.49928.2469 andF-KMSEoutperformthestandardKME.TheS-KMSEperformsslightlybetterthantheF-KMSE.Moreover,theimprovementismoresubstantialinthelarged,smallnparadigm.Intheworstcases,theS-KMSEandF-KMSEperformaswellastheKME.Lastly,itisinstructivetonotethattheimprovementvarieswiththechoiceofkernelk.Briey,thechoiceofkernelreectsthedimensionalityoffeaturespaceH.Onewouldexpectmoreimprovementinhigh-dimensionalspace,e.g.,RBFkernel,thanthelow-dimensional,e.g.,linearkernel(cf.discussionsattheendofx 3 ).ThisphenomenoncanbeobservedinbothFigure 1 and 2 .4.2.RealDataWeconsiderthreebenchmarkapplications:densityes-timationviakernelmeanmatching( Songetal. , 2008 ),kernelPCAusingshrinkagemeanandcovarianceoperator( Sch¨olkopfetal. , 1998 ),anddiscriminativelearningondistributions( MuandetandSch¨olkopf , 2013 ; Muandetetal. , 2012 ).Forthersttwotasksweemploy15datasetsfromtheUCIrepositories.Weuseonlyreal-valuedfeatures,eachofwhichisnormalizedtohavezeromeanandunitvariance.Densityestimation.Weperformdensityestimationviakernelmeanmatching( Songetal. , 2008 ).Thatis,wetthedensityQ=Pmj=1jN(j;2jI)toeachdatasetbyminimizingkb Qk2Hs.t.Pmj=1j=1.Thekernelmeanbisobtainedfromthesamplesusingdifferentesti-mators,whereasQisthekernelmeanembeddingofthedensityQ.Unlikeexperimentsin Songetal. ( 2008 ),ourgoalistocomparedifferentestimatorsofPwherePisthetruedatadistribution.Thatis,wereplace^withaver-sionobtainedviashrinkage.AbetterestimateofPshouldleadtobetterdensityestimation,asmeasuredbytheneg-ativelog-likelihoodofQonthetestset.Weuse30%ofthedatasetasatestset.Wesetm=10foreachdataset.Themodelisinitializedbyrunning50randominitializa-tionsusingthek-meansalgorithmandreturningthebest.Werepeattheexperiments10timesandperformthepairedsigntestontheresultsatthe5%signicancelevel. 2 Theaveragenegativelog-likelihoodofthemodelQ,op-timizedviadifferentestimators,isreportedinTable 1 .Clearly,bothS-KMSEandF-KMSEconsistentlyachievesmallernegativelog-likelihoodwhencomparedtoKME.TherearehoweverfewcasesinwhichKMEoutperformstheproposedestimators,especiallywhenthedatasetisrel-ativelylarge,e.g.,satimageandabalone.WesuspectthatinthosecasesthestandardKMEalreadyprovidesanaccurateestimateofthekernelmean.Togetabetteres-timate,moreeffortisrequiredtooptimizefortheshrink-ageparameter.Moreover,theimprovementacrossdifferentkernelsisconsistentwithresultsonthesyntheticdatasets.KernelPCA.Inthisexperiment,weperformtheKPCAusingdifferentestimatesofthemeanandcovarianceop-erators.WecomparethereconstructionerrorEproj(z)=k(z) P(z)k2ontestsampleswherePistheprojectionconstructedfromtherst20principalcomponents.WeuseaGaussianRBFkernelforalldatasets.Wecompare5dif-ferentscenarios:1)standardKPCA;2)shrinkagecenter-ingwithS-KMSE;3)shrinkagecenteringwithF-KMSE;4)KPCAwithS-COSE;and5)KPCAwithF-COSE.ToperformKPCAonshrinkagecovarianceoperator,wesolvethegeneralizedeigenvalueproblemKcBKc=KcVDwhereB=diag()andKcisthecenteredGrammatrix.Theweightvectorisobtainedfromshrinkageestima-torsusingthekernelmatrixKcKcwheredenotestheHadamardproduct.Weuse30%ofthedatasetasatestset. 2Thepairedsigntestisanonparametrictestthatcanbeusedtoexaminewhethertwopairedsampleshavethesamedistribution.Inourcase,wecompareS-KMSEandF-KMSEagainstKME. KernelMeanEstimationandSteinEffect ionosphere sonar australian specft wdbc wine satimage segment vehicle svmguide2 vowel housing bodyfat abalone glass 0 0.2 0.4 0.6 0.8 1 reconstruction error KME S-KMSE F-KMSE S-COSE F-COSE Figure3.TheaveragereconstructionerrorofKPCAonhold-outtestsamplesover10repetitions.TheKMErepresentsthestandardapproach,whereasS-KMSEandF-KMSEuseshrinkagemeanstoperformcentering.TheS-COSEandF-COSEdirectlyusetheshrinkageestimateofthecovarianceoperator.Figure 3 illustratestheresultsofKPCA.Clearly,theS-COSEandF-COSEconsistentlyoutperformsallotheresti-mators.AlthoughweobserveanimprovementofS-KMSEandF-KMSEoverKME,itisverysmallcomparedtothatofS-COSEandF-COSE.Thismakessenseintuitively,sincechangingthemeanpointorshiftingdatadoesnotchangethecovariancestructureconsiderably,soitwillnotsignicantlyaffectthereconstructionerror.Discriminativelearningondistributions.Apositivesemi-denitekernelbetweendistributionscanbedenedviatheirkernelmeanembeddings.Thatis,givenatrainingsample(bP1;y1);:::;(bPm;ym)2Pf 1;+1gwherebP:=1 nPnk=1xkandxkP,thelin-earkernelbetweentwodistributionsisapproximatedbyhbP;bPji=hPnk=1k(xk);Pnl=1jl(xjl)i=Pnk;l=1kjlk(xk;xjl).TheweightvectorsandjcomefromthekernelmeanestimatesofPandPj,re-spectively.Thenon-linearkernelcanthenbedenedac-cordingly,e.g.,(P;Pj)=exp(kbP bPjk2H=22).Ourgoalinthisexperimentistoinvestigateiftheshrink-ageestimateofthekernelmeanimprovestheperfor-manceofthediscriminativelearningondistributions.Tothisend,weconductexperimentsonnaturalscenecategorizationusingsupportmeasuremachine(SMM)( Muandetetal. , 2012 )andgroupanomalydetectiononahigh-energyphysicsdatasetusingone-classSMM(OC-SMM)( MuandetandSch¨olkopf , 2013 ).Weusebothlin-earandnon-linearkernelswheretheGaussianRBFker-nelisemployedasanembeddingkernel( Muandetetal. , 2012 ).Allhyper-parametersarechosenby10-foldcross-validation.Forourunsupervisedproblem,werepeattheexperimentsusingseveralparametersettingsandreportthebestresults.Table 2 reportstheclassicationaccuracyofSMMandtheareaunderROCcurve(AUC)ofOCSMMusingdifferentTable2.TheclassicationaccuracyofSMMandtheareaunderROCcurve(AUC)ofOCSMMusingdifferentkernelmeanesti-matorstoconstructthekernelondistributions. Estimator Linear Non-linear SMMOCSMM SMMOCSMM KME 0.54320.6955 0.60170.9085 S-KMSE 0.55210.6970 0.63030.9105 F-KMSE 0.56100.6970 0.65220.9095 kernelmeanestimators.Bothshrinkageestimatorsconsis-tentlyleadtobetterperformanceonbothSMMandOC-SMMwhencomparedtoKME.Tosummarize,wendsufcientevidencetoconcludethatbothS-KMSEandF-KMSEoutperformsthestandardKME.TheperformanceofS-KMSEandF-KMSEisverycompetitive.Thedifferencedependsonthedatasetandthekernelfunction.5.ConclusionsToconclude,weshowthatthecommonlyusedkernelmeanestimatorcanbeimproved.Ourtheoreticalresultsuggeststhatthereexistsawideclassofkernelmeanestimatorsthatarebetterthanthestandardone.Todemonstratethis,wefocusontwoefcientshrinkageestimators,namely,sim-pleandexiblekernelmeanshrinkageestimators.Empir-icalstudyclearlyshowsthattheproposedestimatorsout-performthestandardoneinvariousscenarios.Mostim-portantly,theshrinkageestimatesnotonlyprovidemoreaccurateestimation,butalsoleadtosuperiorperformanceonreal-worldapplications.AcknowledgmentsTheauthorswishtothankDavidHoggandRossFedelyforread-ingtherstdraftandanonymousreviewerswhogavevaluablesuggestionthathashelpedtoimprovethemanuscript. KernelMeanEstimationandSteinEffect ReferencesJ.BergerandR.Wolpert.Estimatingthemeanfunctionofagaus-sianprocessandthesteineffect.JournalofMultivariateAnal-ysis,13(3):401424,1983.J.O.Berger.Admissibleminimaxestimationofamultivariatenormalmeanwitharbitraryquadraticloss.AnnalsofStatistics,4(1):223226,1976.A.BerlinetandT.C.Agnan.ReproducingKernelHilbertSpacesinProbabilityandStatistics.KluwerAcademicPublishers,2004.S.Danafar,P.M.V.Rancoita,T.Glasmachers,K.Whittingstall,andJ.Schmidhuber.Testinghypothesesbyregularizedmaxi-mummeandiscrepancy.2013.I.S.Dhillon,Y.Guan,andB.Kulis.Kernelk-means:spectralclusteringandnormalizedcuts.InProceedingsofthe10thACMSIGKDDInternationalConferenceonKnowledgeDis-coveryandDataMining(KDD),pages551556,NewYork,NY,USA,2004.K.Fukumizu,F.R.Bach,andM.I.Jordan.Dimensionalityre-ductionforsupervisedlearningwithreproducingkernelHilbertspaces.JournalofMachineLearningResearch,5:7399,2004.K.Fukumizu,L.Song,andA.Gretton.KernelBayes'rule.InAdvancesinNeuralInformationProcessingSystems(NIPS),pages17371745.2011.A.Gretton,R.Herbrich,A.Smola,B.Sch¨olkopf,andA.Hyv¨arinen.Kernelmethodsformeasuringindependence.JournalofMachineLearningResearch,6:20752129,2005.A.Gretton,K.M.Borgwardt,M.Rasch,B.Sch¨olkopf,andA.J.Smola.Akernelmethodforthetwo-sample-problem.InAdvancesinNeuralInformationProcessingSystems(NIPS),2007.A.Gretton,K.M.Borgwardt,M.J.Rasch,B.Sch¨olkopf,andA.Smola.Akerneltwo-sampletest.JournalofMachineLearningResearch,13:723773,2012.S.Gr¨unew¨alder,G.Lever,A.Gretton,L.Baldassarre,S.Patter-son,andM.Pontil.Conditionalmeanembeddingsasregres-sors.InProceedingsofthe29thInternationalConferenceonMachineLearning(ICML),2012.S.Gr¨unew¨alder,A.Gretton,andJ.Shawe-Taylor.Smoothoper-ators.InProceedingsofthe30thInternationalConferenceonMachineLearning(ICML),2013.W.JamesandJ.Stein.Estimationwithquadraticloss.InPro-ceedingsoftheThirdBerkeleySymposiumonMathematicalStatisticsandProbability,pages361379.UniversityofCali-forniaPress,1961.J.KimandC.D.Scott.Robustkerneldensityestimation.JournalofMachineLearningResearch,13:25292565,Sep2012.A.MandelbaumandL.A.Shepp.Admissibilityasatouchstone.AnnalsofStatistics,15(1):252268,1987.K.MuandetandB.Sch¨olkopf.One-classsupportmeasurema-chinesforgroupanomalydetection.InProceedingsofthe29thConferenceonUncertaintyinArticialIntelligence(UAI).AUAIPress,2013.K.Muandet,K.Fukumizu,F.Dinuzzo,andB.Sch¨olkopf.Learn-ingfromdistributionsviasupportmeasuremachines.InAdvancesinNeuralInformationProcessingSystems(NIPS),pages1018.2012.Y.Nishiyama,A.Boularias,A.Gretton,andK.Fukumizu.HilbertspaceembeddingsofPOMDPs.InProceedingsofthe28thConferenceonUncertaintyinArticialIntelligence(UAI),pages644653,2012.N.PrivaultandA.Rveillac.Steinestimationforthedriftofgaus-sianprocessesusingthemalliavincalculus.AnnalsofStatis-tics,36(5):25312550,2008.C.E.RasmussenandC.Williams.GaussianProcessesforMa-chineLearning.MITPress,2006.B.Sch¨olkopf,A.Smola,andK.-R.M¨uller.Nonlinearcomponentanalysisasakerneleigenvalueproblem.NeuralComputation,10(5):12991319,July1998.J.Shawe-TaylorandN.Cristianini.KernelMethodsforPatternAnalysis.CambridgeUniversityPress,Cambridge,UK,2004.A.Smola,A.Gretton,L.Song,andB.Sch¨olkopf.AHilbertspaceembeddingfordistributions.InProceedingsofthe18thInter-nationalConferenceonAlgorithmicLearningTheory(ALT),pages1331.Springer-Verlag,2007.L.Song,X.Zhang,A.Smola,A.Gretton,andB.Sch¨olkopf.Tailoringdensityestimationviareproducingkernelmomentmatching.InProceedingsofthe25thInternationalConferenceonMachineLearning(ICML),pages992999,2008.L.Song,B.Boots,S.M.Siddiqi,G.J.Gordon,andA.J.Smola.HilbertspaceembeddingsofhiddenMarkovmodels.InPro-ceedingsofthe27thInternationalConferenceonMachineLearning(ICML),2010.L.Song,A.P.Parikh,andE.P.Xing.Kernelembeddingsofla-tenttreegraphicalmodels.InAdvancesinNeuralInformationProcessingSystems(NIPS),pages27082716,2011.B.K.Sriperumbudur,A.Gretton,K.Fukumizu,G.Lanckriet,andB.Sch¨olkopf.InjectiveHilbertspaceembeddingsofprobabil-itymeasures.InCOLT,2008.B.K.Sriperumbudur,A.Gretton,K.Fukumizu,B.Sch¨olkopf,andG.R.G.Lanckriet.Hilbertspaceembeddingsandmet-ricsonprobabilitymeasures.JournalofMachineLearningResearch,99:15171561,2010.C.Stein.Inadmissibilityoftheusualestimatorforthemeanofamultivariatenormaldistribution.InProceedingsofthe3rdBerkeleySymposiumonMathematicalStatisticsandProbabil-ity,volume1,pages197206.UniversityofCaliforniaPress,1955.