/
Kernel Mean Estimation and Stein Effect Krikamol Muandet KRIKAMOL TUEBINGEN MPG Kernel Mean Estimation and Stein Effect Krikamol Muandet KRIKAMOL TUEBINGEN MPG

Kernel Mean Estimation and Stein Effect Krikamol Muandet KRIKAMOL TUEBINGEN MPG - PDF document

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
440 views
Uploaded On 2014-10-08

Kernel Mean Estimation and Stein Effect Krikamol Muandet KRIKAMOL TUEBINGEN MPG - PPT Presentation

Given a 64257nite sam ple an empirical average is the standard estimate for the true kernel mean We show that this esti mator can be improved due to a wellknown phe nomenon in statistics called Steins phenomenon After consideration our theoretical a ID: 3960

Given 64257nite sam

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Kernel Mean Estimation and Stein Effect ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

KernelMeanEstimationandSteinEffect KrikamolMuandet KRIKAMOL@TUEBINGEN.MPG.DE EmpiricalInferenceDepartment,MaxPlanckInstituteforIntelligentSystems,T¨ubingen,GermanyKenjiFukumizu FUKUMIZU@ISM.AC.JP TheInstituteofStatisticalMathematics,Tokyo,JapanBharathSriperumbudur BS493@STATSLAB.CAM.AC.UK StatisticalLaboratory,UniversityofCambridge,Cambridge,UnitedKingdomArthurGretton ARTHUR.GRETTON@GMAIL.COM GatsbyComputationalNeuroscienceUnit,UniversityCollegeLondon,London,UnitedKingdomBernhardSch¨olkopf BS@TUEBINGEN.MPG.DE EmpiricalInferenceDepartment,MaxPlanckInstituteforIntelligentSystems,T¨ubingen,GermanyAbstractAmeanfunctioninareproducingkernelHilbertspace(RKHS),orakernelmean,isanimpor-tantpartofmanyalgorithmsrangingfromkernelprincipalcomponentanalysistoHilbert-spaceembeddingofdistributions.Givenanitesam-ple,anempiricalaverageisthestandardestimateforthetruekernelmean.Weshowthatthisesti-matorcanbeimprovedduetoawell-knownphe-nomenoninstatisticscalledStein'sphenomenon.Afterconsideration,ourtheoreticalanalysisre-vealstheexistenceofawideclassofestimatorsthatarebetterthanthestandardone.Focusingonasubsetofthisclass,weproposeefcientshrinkageestimatorsforthekernelmean.Em-piricalevaluationsonseveralapplicationsclearlydemonstratethattheproposedestimatorsoutper-formthestandardkernelmeanestimator.1.IntroductionThispaperaimstoimprovetheestimationofthemeanfunctioninareproducingkernelHilbertspace(RKHS)fromanitesample.Akernelmeanofaprobabilitydistri-butionPoverameasurablespaceXisdenedbyP,ZXk(x;)dP(x)2H;(1) Proceedingsofthe31stInternationalConferenceonMachineLearning,Beijing,China,2014.JMLR:W&CPvolume32.Copy-right2014bytheauthor(s).whereHisanRKHSassociatedwithareproducingkernelk:XX!R.Conditionsensuringthatthisexpectationexistsaregivenin Smolaetal. ( 2007 ).Unfortunately,itisnotpracticaltocomputePdirectlybecausethedistribu-tionPisusuallyunknown.Instead,givenani.i.dsamplex1;x2;:::;xnfromP,wecaneasilycomputetheempiri-calkernelmeanbytheaveragebP,1 nnX=1k(x;):(2)TheestimatebPisthemostcommonlyusedestimateofthetruekernelmean.Ourprimaryinteresthereistoinvestigatewhetheronecanimproveuponthisstandardestimator.Thekernelmeanhasrecentlygainedattentioninthemachinelearningcommunity,thankstotheintro-ductionofHilbertspaceembeddingfordistributions( BerlinetandAgnan , 2004 ; Smolaetal. , 2007 ).Repre-sentingthedistributionasameanfunctionintheRKHShasseveraladvantages:1)therepresentationwithap-propriatechoiceofkernelkhasbeenshowntopreserveallinformationaboutthedistribution( Fukumizuetal. , 2004 ; Sriperumbuduretal. , 2008 ; 2010 );2)basicoper-ationsonthedistributioncanbecarriedoutbymeansofinnerproductsinRKHS,e.g.,EP[f(x)]=hf;PiHforallf2H;3)nointermediatedensityestimationisrequired,e.g.,whentestingforhomogeneityfrom-nitesamples.Asaresult,manyalgorithmshavebene-tedfromthekernelmeanrepresentation,namely,maxi-mummeandiscrepancy(MMD)( Grettonetal. , 2007 ),ker-neldependencymeasure( Grettonetal. , 2005 ),kerneltwo-sample-test( Grettonetal. , 2012 ),Hilbertspaceembed-dingofHMMs( Songetal. , 2010 ),andkernelBayesrule KernelMeanEstimationandSteinEffect ( Fukumizuetal. , 2011 ).TheirperformancesrelydirectlyonthequalityoftheempiricalestimatebP.However,itisofgreatimportance,especiallyforourread-erswhoarenotfamiliarwithkernelmethods,torealizeamorefundamentalroleofthekernelmean.Itbasi-callyservesasafoundationtomostkernel-basedlearn-ingalgorithms.Forinstance,nonlinearcomponentanal-yses,suchaskernelPCA,kernelFDA,andkernelCCA,relyheavilyonmeanfunctionsandcovarianceoperatorsinRKHS( Sch¨olkopfetal. , 1998 ).Thekernelk-meansalgo-rithmperformsclusteringinfeaturespaceusingmeanfunc-tionsastherepresentativesoftheclusters( Dhillonetal. , 2004 ).Moreover,italsoservesasabasisinearlydevelop-mentofalgorithmsforclassicationandanomalydetection( Shawe-TaylorandCristianini , 2004 ,chap.5).Allofthoseemploy( 2 )astheestimateofthetruemeanfunction.Thus,thefactthatsubstantialimprovementcanbegainedwhenestimating( 1 )mayinfactraiseawidespreadsuspicionontraditionalwayoflearningwithkernels.Weshowinthisworkthatthestandardestimator( 2 )is,inacertainsense,notoptimal,i.e.,thereexistbetterestimators(morebelow).Inaddition,weproposeshrinkageestima-torsthatoutperformthestandardone.Atrstglance,itwasdenitelycounter-intuitiveandsurprisingforus,andwillundoubtedlyalsobeforsomeofourreaders,thattheem-piricalkernelmeancouldbeimproved,and,giventhesim-plicityoftheproposedestimators,thatthishasremainedunnoticeduntilnow.Oneofthereasonsmaybethatthereisacommonbeliefthattheestimator^PalreadygivesagoodestimateofP,and,assamplesizegoestoinnity,theestimationerrordisappears( Shawe-TaylorandCristianini , 2004 ).Asaresult,noneedisfelttoimprovethekernelmeanestimation.However,givenanitesample,substan-tialimprovementisinfactpossibleandseveralfactorsmaycomeintoplay,aswillbeseenlaterinthiswork.ThisworkwaspartlyinspiredbyStein'sseminalworkin1955,whichshowedthatamaximumlikelihoodestimator(MLE),i.e.,thestandardempiricalmean,forthemeanofthemultivariateGaussiandistributionN(;2I)isinad-missible( Stein , 1955 ).Thatis,thereexistsanestimatorthatalwaysachievessmallertotalmeansquarederrorre-gardlessofthetrue,whenthedimensionisatleast3.PerhapsthebestknownestimatorofsuchkindisJames-Steinsestimator( JamesandStein , 1961 ).Interestingly,theJames-Steinestimatorisitselfinadmissible,andthereex-istsawideclassofestimatorsthatoutperformtheMLE,seee.g., Berger ( 1976 ).However,ourworkdiffersfundamentallyfromtheStein'sseminalworksandthosealongthislineintwoas-pects.First,oursettingisnon-parametricinasensethatwedonotassumeanyparametricformofthedis-tribution,whereasmostoftraditionalworksfocusonsomespecicdistributions,e.g.,Gaussiandistribution.Second,oursettinginvolvesanon-linearfeaturemapintoahigh-dimensionalspace,ifnotinnite.Asare-sult,highermomentsofthedistributionmaycomeintoplay.Thus,onecannotadoptStein'ssettingstraightfor-wardly.AdirectgeneralizationofJames-Steinestimatortoinnite-dimensionalHilbertspacehasalreadybeenconsid-ered( BergerandWolpert , 1983 ; MandelbaumandShepp , 1987 ; PrivaultandRveillac , 2008 ).Inthoseworks,whichistheparametertobeestimatedisassumedtobethemeanofaGaussianmeasureontheHilbertspacefromwhichsamplesaredrawn.Inourcase,ontheotherhand,thesamplesaredrawnfromPandnotfromtheGaussiandistributionwhosemeanisP.Thecontributionofthispapercanbesummarizedasfol-lows:First,weshowthatthestandardkernelmeanestima-torcanbeimprovedbyprovidinganalternativeestimatorthatachievessmallerrisk(x 2 ).Thetheoreticalanalysisre-vealstheexistenceofawideclassofestimatorsthatarebetterthanthestandard.Tothisend,weproposeinx 3 akernelmeanshrinkageestimator(KMSE),whichisbasedonanovelmotivationforregularizationthroughthenotionofshrinkage.Moreover,weproposeanefcientleave-one-outcross-validationproceduretoselecttheshrinkagepa-rameter,whichisnovelinthecontextofkernelmeanesti-mation.Lastly,wedemonstratethebenetoftheproposedestimatorsinseveralapplications(x 4 ).2.Motivation:ShrinkageEstimatorsForanarbitrarydistributionP,denotebyandbthetruekernelmeananditsempiricalestimate( 2 )fromthei.i.d.samplex1;x2;:::;xnP(weremovethesub-scriptforeaseofnotation).Themostnaturallossfunc-tionconsideredinthisworkis`(;b)=kbk2H.Anestimatorbisamappingwhichismeasurablew.r.t.theBorel-algebraofHandisevaluatedbyitsriskfunc-tionR(;b)=EP[`(;b)]whereEPindicatesexpectationoverthechoiceofi.i.d.sampleofsizenfromP.Letusconsideranalternativekernelmeanestimator:b, f+(1 )bwhere0 1andf2H.Itisessentiallyashrinkageestimatorthatshrinksthestandardestimatortowardafunctionfbyanamountspeciedby .If =0,breducestothestandardestimatorb.Thefollowingtheoremassertsthattheriskofshrinkageestima-torbissmallerthanthatofstandardestimatorbgivenanappropriatechoiceof ,regardlessofthefunctionf(morebelow).Theorem1.ForalldistributionsPandthekernelk,thereexists &#x-0.8;͸䀀0forwhichR(;b)R(;b).Proof.Theriskofthestandardkernelmeanestimatorsat-isesEkbk2=1 n(E[k(x;x)]E[k(x;~x)])=: KernelMeanEstimationandSteinEffect where~xisanindependentcopyofx.Letusdenetheriskoftheproposedshrinkageestimatorby:=Ekbk2where isanon-negativeshrinkagepa-rameter.Wecanthenwritethisintermsofthestan-dardriskas=2 Ehb;b+fi+ 2Ekfk22 2E[f(x)]+ 2Ekbk2.ItfollowsfromthereproducingpropertyofHthatE[f(x)]=hf;i.Moreover,usingthefactthatEkbk2=Ekb+k2=+E[k(x;~x)],wecansimplifytheshrinkageriskby= 2(+kfk2)2 +.Thus,wehave= 2(+kfk2)2 whichisnon-positivewhere 20;2 +kfk2(3)andminimizedat ==(+kfk2).Aswecanseein( 3 ),thereisarangeof forwhichanon-positive,i.e.,R(;b)R(;b),isguaranteed.However,Theorem 1 reliesontheimportantassumptionthatthetruekernelmeanofthedistributionPisrequiredtoestimate .Inspiteofthis,thetheoremhasanimportantimplicationsuggestingthattheshrinkageestimatorbcanimproveuponbif ischosenappropriately.Later,wewillexploitthisresultinordertoconstructmorepracticalestimators.Remark1.ThefollowingobservationsfollowimmediatelyfromTheorem 1 :Theshrinkageestimatoralwaysimprovesuponthestandardoneregardlessofthedirectionofshrinkage,asspeciedbyf.Inotherwords,thereexistsawideclassofkernelmeanestimatorsthatarebetterthanthestandardone.Thevalueof alsodependsonthechoiceoff.Thefurtherfisfrom,thesmaller becomes.Thus,theshrinkagegetssmalleriffischosensuchthatitisfarfromthetruekernelmean.ThiseffectisakintoJames-Steinestimator.Theimprovementcanbeviewedasabias-variancetrade-off:theshrinkageestimatorreducesvariancesubstantiallyattheexpenseofalittlebias.Remark 1 shedslightonhowonecanpracticallyconstructtheshrinkageestimator:wecanchoosefarbitrarilyaslongastheparameter ischosenappropriately.More-over,furtherimprovementcanbegainedbyincorporat-ingpriorknowledgeastothelocationofP,whichcanbestraightforwardlyintegratedintotheframeworkviaf( BergerandWolpert , 1983 ).InspiredbyJames-Steinesti-mator,wefocusonf=0.Wewillinvestigatetheeffectofdifferentpriorfinfutureworks.3.KernelMeanShrinkageEstimatorInthissectionwegiveanovelformulationofkernelmeanestimatorthatallowsustoestimatetheshrinkageparame-terefciently.Inthefollowing,let:X!Hbeafea-turemapassociatedwiththekernelkandh;ibeaninnerproductintheRKHSHsuchthatk(x;x0)=h(x);(x0)i.Unlessstatedotherwise,kkdenotestheRKHSnorm.ThekernelmeanPanditsempiricalestimatebPcanbeob-tainedasaminimizerofthelossfunctionalsE(g),ExPk(x)gk2;bE(g),1 nnX=1k(x)gk2;respectively.WewillcalltheestimatorminimizingthelossfunctionalbE(g)akernelmeanestimator(KME).NotethatthelossE(g)isdifferentfromtheoneconsideredinx 2 ,i.e.,`(;g)=kgk2=kE[(x)]gk2.Never-theless,wehave`(;g)=Exx0k(x;x0)2Exg(x)+kgk2.SinceE(g)=Exk(x;x)2Exg(x)+kgk2,theloss`(;g)differsfromE(g)onlybyExk(x;x)Exx0k(x;x0)whichisnotafunctionofg.Weintroducethenewformherebecauseitwillgiveamoretractablecross-validationcom-putation(x 3.1 ).Inspiteofthis,theresultingestimatorsarealwaysevaluatedw.r.t.thelossinx 2 (cf.x 4.1 ).Fromtheformulationabove,itisnaturaltoaskifmini-mizingtheregularizedversionofbE(g)willgivebetteres-timator.Ontheonehand,onecanarguethat,unlikeintheclassicalriskminimization,wedonotreallyneedareg-ularizerhere.Thestandardestimator( 2 )isknowntobe,inacertainsense,optimalandcanbeestimatedreliably( Shawe-TaylorandCristianini , 2004 ,prop.5.2).More-over,theoriginalformulationofbE(g)isawell-posedprob-lem.Ontheotherhand,sinceregularizationmaybeviewedasshrinkingthesolutiontowardzero,itcanactuallyim-provethekernelmeanestimation,assuggestedbyTheorem 1 (cf.discussionsattheendofx 2 ).Consequently,weminimizeamodiedlossfunctionalbE(g),bE(g)+\n(kgk)=1 nnX=1k(x)gk2+\n(kgk);(4)where\n()denotesamonotonically-increasingregulariza-tionfunctionalandisanon-negativeregularizationpa-rameter. 1 Inwhatfollows,werefertotheshrinkagees-timatorbminimizingbE(g)asakernelmeanshrinkageestimator(KMSE). 1Theparameters andplaysimilarroleasashrinkagepa-rameter.Theyspecifyanamountbywhichthestandardestimatorbisshrunktowardf=0.Thus,thetermshrinkageparameterandregularizationparameterwillbeusedinterchangeably. KernelMeanEstimationandSteinEffect Itfollowsfromtherepresentertheoremthatgliesinasub-spacespannedbythedata,i.e.,g=Pnj=1 j(xj)forsome 2Rn.Byconsidering\n(kgk)=kgk2,wecanrewrite( 4 )as1 nnX=1\r\r\r\r\r\r(x)nXj=1 j(xj)\r\r\r\r\r\r2+\r\r\r\r\r\rnXj=1 j(xj)\r\r\r\r\r\r2= �K 2 �K1n+ �K +c;(5)wherecisaconstantterm,KisannnGrammatrixsuchthatKij=k(x;xj),and1n=[1=n;1=n;:::;1=n]�.Takingaderivativeof( 5 )w.r.t. andsettingittozeroyield =(1=(1+))1n.Bysetting ==(1+)theshrinkageestimatecanbewrittenasb=(1 )b.Since0 1,theestimatorbcorrespondstoashrinkageestimatordiscussedinx 2 whenf=0.Wecallthisesti-matorasimplekernelmeanshrinkageestimator(S-KMSE).Usingtheexpansiong=Pnj=1 j(xj),wemaycon-siderwhentheregularizationfunctionaliswrittenintermof ,e.g., � .Thisleadstoaparticularlyinter-estingkernelmeanestimator.Inthiscase,theopti-malweightvectorisgivenby =(K+I)1K1nandtheshrinkageestimatecanbewrittenaccordinglyasb=Pnj=1 j(xj)=�(K+I)1K1nwhere=[(x1);(x2);:::;(xn)]�.UnliketheS-KMSE,thises-timatorshrinkstheusualestimatedifferentlyineachcoor-dinate(cf.Theorem 2 ).Hence,wewillcallitaexiblekernelmeanshrinkageestimator(F-KMSE).ThefollowingtheoremcharacterizestheF-KMSEasashrinkageestimator.Theorem2.TheF-KMSEcanbewrittenasb=Pn=1\r \r+hb;vivwheref\r;vgareeigenvalueandeigenvectorpairsoftheempiricalcovarianceoperatorbCxxinH.Inwords,theeffectofF-KMSEistoreducehighfre-quencycomponentsoftheexpansionofb,byexpand-ingthisintermsofthekernelPCAbasisandshrinkingthecoefcientsofthehighordereigenfunctions,e.g.,see RasmussenandWilliams ( 2006 ,sec.4.3).NotethatthecovarianceoperatorbCxxitselfdoesnotdependon.Aswecansee,thesolutiontotheregularizedversionisindeedoftheformofshrinkageestimatorswhenf=0.Thatis,bothS-KMSEandF-KMSEshrinkthestandardkernelmeanestimatetowardszero.ThedifferenceisthattheS-KMSEshrinksequallyinallcoordinate,whereastheF-KMSEalsoconstraintstheamountofshrinkagebytheinformationcontainedineachcoordinate.Moreover,thesquaredRKHSnormkk2canbedecom-posedasasumofsquaredlossweightedbytheeigenval-ues\r(cf. MandelbaumandShepp ( 1987 ,appendix)).BythesamereasoningasStein'sresultinnite-dimensionalcase,onewouldsuspectthatanimprovementofshrinkageestimatorsinHshouldalsodependonhowfasttheeigen-valuesofkdecay.Thatis,onewouldexpectgreaterim-provementifthevaluesof\rdecayveryslowly.Forexam-ple,theGaussianRBFkernelwithlargerbandwidthgivessmallerimprovementwhencomparedtoonewithsmallerbandwidth.Similarly,weshouldexpecttoseemoreim-provementwhenapplyingaLaplaciankernelthanwhenusingaGaussianRBFkernel.Insomeapplicationsofkernelmeanembedding,onemaywanttointerprettheweight asaprobabilityvector( Nishiyamaetal. , 2012 ).However,theweightvector outputbyourestimatorsisingeneralnotnormalized.Infact,allelementswillbesmallerthan1=nasaresultofshrinkage.However,onemayimposeaconstraintthat mustsumtooneandresorttoaquadraticprogramming( Songetal. , 2008 ).Unfortunately,thisapproachhasunde-sirableeffectofsparsitywhichisunlikelytoimproveuponthestandardestimator.Post-normalizingtheweightsoftendeterioratestheestimationperformance.Tothebestofourknowledge,nopreviousattempthasbeenmadetoimprovethekernelmeanestimation.How-ever,wediscusssomecloselyrelatedworkshere.Forex-ample,insteadofthelossfunctionalbE(g), KimandScott ( 2012 )considerarobustlossfunctionsuchastheHuber'slosstoreducetheeffectofoutliers.Theauthorscon-siderkerneldensityestimators,whichdifferfundamentallyfromkernelmeanestimators.Theyneedtoreducetheker-nelbandwidthwithincreasingsamplesizefortheestima-torstobeconsistent.RegularizedversionofMMDwasadoptedby Danafaretal. ( 2013 )inthecontextofkernel-basedhypothesistesting.Theresultingformulationre-semblesourS-KMSE.Furthermore,theF-KMSEisofasimilarformastheconditionalmeanembeddingusedin Gr¨unew¨alderetal. ( 2012 ),whichcanbeviewedmoregen-erallyasaregressionprobleminRKHSwithsmoothoper-ators( Gr¨unew¨alderetal. , 2013 ).3.1.ChoosingShrinkageParameterAsdiscussedinx 2 ,theamountofshrinkageplaysanim-portantroleinourestimators.Inthisworkweproposetoselecttheshrinkageparameterbyanautomaticleave-one-outcross-validation.Foragivenshrinkageparameter,letusconsidertheob-servationxasbeinganewobservationbyomittingitfromthedataset.Denotebyb()=Pj= ()j(xj)thekernelmeanestimatedfromtheremainingdata,usingthevalueasashrinkageparameter,sothat ()isthemini-mizerofbE()(g).Wewillmeasurethequalityofb()byhowwellitapproximates(x).Theoverallqualityofthe KernelMeanEstimationandSteinEffect estimateisquantiedbythecross-validationscoreLOOCV()=1 nnX=1\r\r\r(x)b()\r\r\r2H:(6)Bysimplealgebra,itisnotdifculttoshowthattheop-timalshrinkageparameterofS-KMSEcanbecalculatedanalytically,asstatedbythefollowingtheorem.Theorem3.Let,1 n2Pn=1Pnj=1k(x;xj)and%,1 nPn=1k(x;x).Theshrinkageparameter=(%)=((n1)+%=n%)oftheS-KMSEistheminimizerofLOOCV().Ontheotherhand,ndingtheoptimalfortheF-KMSEisrelativelymoreinvolved.Evaluatingthescore( 6 )na¨velyrequiresonetosolveforb()explicitlyforeveryi.Fortu-nately,wecansimplifythescoresuchthatitcanbeevalu-atedefciently,asstatedinthefollowingtheorem.Theorem4.TheLOOCVscoreofF-KMSEsatisesLOOCV()=1 nPn=1( �KK)�C( �KK)where istheweightvectorcalculatedfromthefulldatasetwiththeshrinkageparameterandC=(K1 nK(K+I)1K)1K(K1 nK(K+I)1K)1.ProofofThorem 4 .Forxedandi,letb()betheleave-one-outkernelmeanestimateofF-KMSEandlet,(K+I)1.Then,wecanwriteanexpressionforthedeletedresidualas():=b()(x)=b(x)+1 nPnj=1Pnl=1jlh(xl);b()(x)i(xj).Since()liesinasubspacespannedbythesam-ple(x1);:::;(xn),wehave()=Pnk=1k(xk)forsome2Rn.Substituting()backyieldsPnk=1k(xk)=b(x)+1 nPnj=1fAKgj(xj).Bytakingtheinnerproductonbothsidesw.r.t.thesam-ple(x1);:::;(xn)andsolvingfor,wehave=(K1 nKAK)1( �KK)whereKistheithcolumnofK.Consequently,theleave-one-outscoreofthesamplexcanbecomputedbyk()k2=�K=( �KK)�(K1 nKAK)1K(K1 nKAK)1( �KK)=( �KK)�C( �KK).Averag-ingk()k2overallsamplesgivesLOOCV()=1 nPn=1k()k2=1 nPn=1( �KK)�C( �KK),asrequired.Itisinterestingtoseethattheleave-one-outcross-validationscoreinTheorem 4 dependsonlyonthenon-leave-one-outsolution ,whichcanbeobtainedasaby-productofthealgorithm.ComputationalcomplexityTheS-KMSErequiresO(n2)operationstoselectshrinkageparameter.FortheF-KMSE,therearetwostepsincross-validation.First,weneedtocompute(K+I)1repeatedlyfordifferentvaluesof.AssumethatweknowtheeigendecompositionK=UDU�whereDisdiag-onalwithd0andUU�=I.Itfollowsthat(K+I)1=U(D+I)1U�.Consequently,solvingfor takesO(n2)operations.SinceeigendecompositionrequiresO(n3)operations,nding formany'sisessentiallyfree.Alow-rankapproximationcanalsobeadoptedtoreducethecomputationalcostfurther.Second,weneedtocomputethecross-validationscore( 6 ).AsshowninTheorem 4 ,wecancomputeitusingonly obtainedfromthepreviousstep.ThecalculationofCcanbesimpliedfurtherviatheeigendecompositionofKasC=U(D1 nD(D+I)1D)1D(D1 nD(D+I)1D)1U�.Sinceitonlyinvolvestheinverseofdiag-onalmatrices,theinversioncanbeevaluatedinO(n)oper-ations.Theoverallcomputationalcomplexityofthecross-validationrequiresonlyO(n2)operations,asopposedtothena¨veapproachthatrequiresO(n4)operations.Whenperformedasaby-productofthealgorithm,thecomputa-tionalcostofcross-validationprocedurebecomesnegligi-bleasthedatasetbecomeslarger.Inpractice,weusethefminsearchandfminbndroutinesoftheMATLABoptimizationtoolboxtondthebestshrinkageparameter.3.2.CovarianceOperatorsThecovarianceoperatorfromHXtoHYcanbeviewedasameanfunctioninaproductspaceHX\nHY.Hence,wecanalsoconstructashrinkageestimatorofcovarianceoper-atorinRKHS.Let(HX;kX)and(HY;kY)betheRKHSoffunctionsonmeasurablespaceXandY,respectively,withp.d.kernelkXandkY(withfeaturemapand).Wewillconsiderarandomvector(X;Y):\n!XYwithdistributionPXY,withPXandPYasmarginaldis-tributions.Undersomeconditions,thereexistsauniquecross-covarianceoperatorYX:HXHYsuchthathg;YXfiHY=EXY[(f(X)EX[f(X)])(g(Y)EY[g(Y)])]=Cov(f(X);g(Y))holdsforallf2HXandg2HY( Fukumizuetal. , 2004 ).IfXequalsY,wegettheself-adjointoperatorXXcalledthecovarianceop-erator.Givenani.i.dsamplefromPXYwrittenas(x1;y1);(x2;y2);:::;(xn;yn),wecanwritetheempiricalcross-covarianceoperatorasbYX:=1 nPn=1(x)\n(y)bX\nbYwherebX=1 nPn=1(x)andbY=1 nPn=1(y).Leteandebethecenteredfeaturemapsofand,respectively.Then,itcanberewrittenasbYX:=1 nPn=1e(x)\ne(y)2HX\nHY.Itfollowsfromtheinnerproductpropertyinproductspacethathe(x)\ne(y);e(x0)\ne(y0)iHX\nHY=he(x);e(x0)iHXhe(y);e(y0)iHY=ekX(x;x0)ekY(y;y0). KernelMeanEstimationandSteinEffect 500 1000 1500 2000 2500 3000 ´g 500 1000 1500 2000 2500 3000 ´g 500 1000 1500 2000 2500 3000 ´g (a)LIN 3 4 5 6 7 8 9 10 11 12 13x 105 ´g 2 3 4 5 6 7 8 9 10 11x 105 ´g 2 3 4 5 6 7 8 9 10 11 12x 105 ´g (b)POLY2 2 4 6 8 10 12 14x 108 ´g 2 4 6 8 10 12 14 16 18x 108 ´g 2 4 6 8 10 12 14x 108 ´g (c)POLY3 0.04 0.06 0.08 0.1 0.12 0.14 0.16 ´g 0.02 0.04 0.06 0.08 0.1 0.12 0.14 ´g 0.02 0.04 0.06 0.08 0.1 0.12 ´g (d)RBFFigure1.TheaveragelossofKME(left),S-KMSE(middle),andF-KMSE(right)estimatorswithdifferentvaluesofshrinkageparam-eter.Insideboxescorrespondtoestimators.Werepeattheexperimentsover30differentdistributionswithn=10andd=30.Then,wecanobtaintheshrinkageestimatorsforthecovarianceoperatorbypluggingthekernelk((x;y);(x0;y0))=~kX(x;x0)~kY(y;y0)inourKM-SEs.Wewillcallthisestimatoracovariance-operatorshrinkageestimator(COSE).Thesametrickcanbeeasilygeneralizedtotensorsofhigherorder,whichhavebeenpreviouslyused,forexample,in Songetal. ( 2011 ).4.ExperimentsWefocusonthecomparisonbetweenourshrinkageesti-matorsandthestandardestimatorofthekernelmeanusingbothsyntheticdatasetsandreal-worlddatasets.4.1.SyntheticDataGiventhetruedata-generatingdistributionP,weevalu-atedifferentestimatorsusingthelossfunction`( ),kPn=1 k(x;)EP[k(x;)]k2Hwhere istheweightvectorassociatedwithdifferentestimators.Toallowforanexactcalculationof`( ),weconsiderwhenPisamixture-of-Gaussiansdistributionandkisthefollowingkernelfunction:1)linearkernelk(x;x0)=x�x0;2)poly-nomialdegree-2kernelk(x;x0)=(x�x0+1)2;3)poly-nomialdegree-3kernelk(x;x0)=(x�x0+1)3;and4)GaussianRBFkernelk(x;x0)=expkxx0k2=22.WewillrefertothemasLIN,POLY2,POLY3,andRBF,respectively.Experimentalprotocol.Dataaregeneratedfromad-dimensionalmixtureofGaussians:x4X=1N(;)+";ijU(10;10);W(2Id;7);"N(0;0:2Id);whereU(a;b)andW(0;df)representtheuniformdis-tributionandWishartdistribution,respectively.Weset=[0:05;0:3;0:4;0:25].Thechoiceofparametershereisquitearbitrary;wehaveexperimentedusingvariouspa-rametersettingsandtheresultsaresimilartothosepre-sentedhere.FortheGaussianRBFkernel,wesetthebandwidthparametertosquare-rootofthemedianEu-clideandistancebetweensamplesinthedataset(i.e.,2=mediankxxjk2 throughout).Figure 1 showstheaveragelossofdifferentestimatorsus-ingdifferentkernelsasweincreasethevalueofshrinkageparameter.Herewescaletheshrinkageparameterbytheminimumnon-zeroeigenvalue\r0ofkernelmatrixK.Ingeneral,wendS-KMSEandF-KMSEtendtooutperformKME.However,asbecomeslarge,therearesomecaseswhereshrinkagedeterioratestheestimationperformance,e.g.,seeLINkernelandsomeoutliersinthegures.Thissuggeststhatitisveryimportanttochoosetheparameterappropriately(cf.thediscussioninx 2 ).Similarly,Figure 2 depictstheaveragelossaswevarythesamplesizeanddimensionofthedata.Inthiscase,theshrinkageparameterischosenbytheproposedleave-one-outcross-validationscore.Aswecansee,bothS-KMSE 0 50 100 20 40 60 80 100 120 140 160 180 LINSample Size (d=20)Average Loss 0 50 100 2 2.5 3 3.5 4 4.5 5 5.5 6x 105 POLY2Sample Size (d=20) 0 50 100 3 3.5 4 4.5 5 5.5x 108 POLY3Sample Size (d=20) 0 50 100 0 0.02 0.04 0.06 0.08 0.1 0.12 RBFSample Size (d=20) KME S-KMSE F-KMSE 0 50 100 0 100 200 300 400 500 600 700 800 LINDimension (n=20)Average Loss 0 50 100 0 2 4 6 8 10 12x 106 POLY2Dimension (n=20) 0 50 100 0 1 2 3 4 5x 1010 POLY3Dimension (n=20) 0 50 100 0 0.01 0.02 0.03 0.04 0.05 0.06 RBFDimension (n=20) Figure2.Theaveragelossover30differentdistributionsofKME,S-KMSE,andF-KMSEwithvaryingsamplesize(n)anddimen-sion(d).TheshrinkageparameterischosenbyLOOCV. KernelMeanEstimationandSteinEffect Table1.Averagenegativelog-likelihoodofthemodelQontestpointsover10randomizations.Theboldfacerepresentstheresultwhosedifferencefromthebaseline,i.e.,KME,isstatisticallysignicant. Dataset LIN POLY2 POLY3 RBF KMES-KMSEF-KMSE KMES-KMSEF-KMSE KMES-KMSEF-KMSE KMES-KMSEF-KMSE 1.ionosphere 33.244033.032533.1436 53.126653.706750.8695 51.680049.914947.4461 40.896140.557839.6804 2.sonar 72.663072.877072.5015 120.3454108.8246109.9980 102.449990.392091.1547 71.304870.572170.5830 3.australian 18.370318.334118.3719 18.592818.602818.4987 41.156334.430334.5460 17.513817.563717.4026 4.specft 56.613855.737455.8667 67.390165.966265.2056 63.927363.557162.1480 57.556956.138655.5808 5.wdbc 30.977830.926630.4400 93.054191.580387.5265 58.823554.123750.3911 30.822730.596830.2646 6.wine 15.922515.885016.0431 24.284124.132523.5163 35.206932.946532.4702 17.152316.917716.6312 7.satimage 19.635319.872119.7943 149.5986143.2277146.0648 52.797357.248245.8946 20.330620.502020.2226 8.segment 22.913122.821922.0696 61.271259.438754.8621 38.722638.622638.4217 17.680116.414915.6814 9.vehicle 16.414516.288816.3210 83.159779.724879.6679 70.434063.432248.0177 15.925615.833115.6516 10.svmguide2 27.151427.064427.1144 30.306530.229029.9875 37.042736.785435.8157 27.393027.251727.1815 11.vowel 12.422712.421912.4264 32.138928.047429.3492 25.872824.068423.9747 12.397612.382312.3677 12.housing 15.524915.161815.3176 39.958237.136032.1028 50.848149.088435.1366 14.557614.381013.9379 13.bodyfat 17.642617.041917.2152 44.329543.795942.3331 27.433925.653024.7955 16.272515.917015.8665 14.abalone 4.33484.32744.3187 14.916614.404111.4431 20.607123.248723.6291 4.69284.60564.6017 15.glass 10.407810.445110.4067 33.348031.611030.5075 45.080134.960825.5677 8.61678.49928.2469 andF-KMSEoutperformthestandardKME.TheS-KMSEperformsslightlybetterthantheF-KMSE.Moreover,theimprovementismoresubstantialinthe“larged,smalln”paradigm.Intheworstcases,theS-KMSEandF-KMSEperformaswellastheKME.Lastly,itisinstructivetonotethattheimprovementvarieswiththechoiceofkernelk.Briey,thechoiceofkernelreectsthedimensionalityoffeaturespaceH.Onewouldexpectmoreimprovementinhigh-dimensionalspace,e.g.,RBFkernel,thanthelow-dimensional,e.g.,linearkernel(cf.discussionsattheendofx 3 ).ThisphenomenoncanbeobservedinbothFigure 1 and 2 .4.2.RealDataWeconsiderthreebenchmarkapplications:densityes-timationviakernelmeanmatching( Songetal. , 2008 ),kernelPCAusingshrinkagemeanandcovarianceoperator( Sch¨olkopfetal. , 1998 ),anddiscriminativelearningondistributions( MuandetandSch¨olkopf , 2013 ; Muandetetal. , 2012 ).Forthersttwotasksweemploy15datasetsfromtheUCIrepositories.Weuseonlyreal-valuedfeatures,eachofwhichisnormalizedtohavezeromeanandunitvariance.Densityestimation.Weperformdensityestimationviakernelmeanmatching( Songetal. , 2008 ).Thatis,wetthedensityQ=Pmj=1jN(j;2jI)toeachdatasetbyminimizingkbQk2Hs.t.Pmj=1j=1.Thekernelmeanbisobtainedfromthesamplesusingdifferentesti-mators,whereasQisthekernelmeanembeddingofthedensityQ.Unlikeexperimentsin Songetal. ( 2008 ),ourgoalistocomparedifferentestimatorsofPwherePisthetruedatadistribution.Thatis,wereplace^withaver-sionobtainedviashrinkage.AbetterestimateofPshouldleadtobetterdensityestimation,asmeasuredbytheneg-ativelog-likelihoodofQonthetestset.Weuse30%ofthedatasetasatestset.Wesetm=10foreachdataset.Themodelisinitializedbyrunning50randominitializa-tionsusingthek-meansalgorithmandreturningthebest.Werepeattheexperiments10timesandperformthepairedsigntestontheresultsatthe5%signicancelevel. 2 Theaveragenegativelog-likelihoodofthemodelQ,op-timizedviadifferentestimators,isreportedinTable 1 .Clearly,bothS-KMSEandF-KMSEconsistentlyachievesmallernegativelog-likelihoodwhencomparedtoKME.TherearehoweverfewcasesinwhichKMEoutperformstheproposedestimators,especiallywhenthedatasetisrel-ativelylarge,e.g.,satimageandabalone.WesuspectthatinthosecasesthestandardKMEalreadyprovidesanaccurateestimateofthekernelmean.Togetabetteres-timate,moreeffortisrequiredtooptimizefortheshrink-ageparameter.Moreover,theimprovementacrossdifferentkernelsisconsistentwithresultsonthesyntheticdatasets.KernelPCA.Inthisexperiment,weperformtheKPCAusingdifferentestimatesofthemeanandcovarianceop-erators.WecomparethereconstructionerrorEproj(z)=k(z)P(z)k2ontestsampleswherePistheprojectionconstructedfromtherst20principalcomponents.WeuseaGaussianRBFkernelforalldatasets.Wecompare5dif-ferentscenarios:1)standardKPCA;2)shrinkagecenter-ingwithS-KMSE;3)shrinkagecenteringwithF-KMSE;4)KPCAwithS-COSE;and5)KPCAwithF-COSE.ToperformKPCAonshrinkagecovarianceoperator,wesolvethegeneralizedeigenvalueproblemKcBKc=KcVDwhereB=diag( )andKcisthecenteredGrammatrix.Theweightvector isobtainedfromshrinkageestima-torsusingthekernelmatrixKcKcwheredenotestheHadamardproduct.Weuse30%ofthedatasetasatestset. 2Thepairedsigntestisanonparametrictestthatcanbeusedtoexaminewhethertwopairedsampleshavethesamedistribution.Inourcase,wecompareS-KMSEandF-KMSEagainstKME. KernelMeanEstimationandSteinEffect ionosphere sonar australian specft wdbc wine satimage segment vehicle svmguide2 vowel housing bodyfat abalone glass 0 0.2 0.4 0.6 0.8 1 reconstruction error KME S-KMSE F-KMSE S-COSE F-COSE Figure3.TheaveragereconstructionerrorofKPCAonhold-outtestsamplesover10repetitions.TheKMErepresentsthestandardapproach,whereasS-KMSEandF-KMSEuseshrinkagemeanstoperformcentering.TheS-COSEandF-COSEdirectlyusetheshrinkageestimateofthecovarianceoperator.Figure 3 illustratestheresultsofKPCA.Clearly,theS-COSEandF-COSEconsistentlyoutperformsallotheresti-mators.AlthoughweobserveanimprovementofS-KMSEandF-KMSEoverKME,itisverysmallcomparedtothatofS-COSEandF-COSE.Thismakessenseintuitively,sincechangingthemeanpointorshiftingdatadoesnotchangethecovariancestructureconsiderably,soitwillnotsignicantlyaffectthereconstructionerror.Discriminativelearningondistributions.Apositivesemi-denitekernelbetweendistributionscanbedenedviatheirkernelmeanembeddings.Thatis,givenatrainingsample(bP1;y1);:::;(bPm;ym)2Pf1;+1gwherebP:=1 nPnk=1xkandxkP,thelin-earkernelbetweentwodistributionsisapproximatedbyhbP;bPji=hPnk=1 k(xk);Pnl=1 jl(xjl)i=Pnk;l=1 k jlk(xk;xjl).Theweightvectors and jcomefromthekernelmeanestimatesofPandPj,re-spectively.Thenon-linearkernelcanthenbedenedac-cordingly,e.g.,(P;Pj)=exp(kbPbPjk2H=22).Ourgoalinthisexperimentistoinvestigateiftheshrink-ageestimateofthekernelmeanimprovestheperfor-manceofthediscriminativelearningondistributions.Tothisend,weconductexperimentsonnaturalscenecategorizationusingsupportmeasuremachine(SMM)( Muandetetal. , 2012 )andgroupanomalydetectiononahigh-energyphysicsdatasetusingone-classSMM(OC-SMM)( MuandetandSch¨olkopf , 2013 ).Weusebothlin-earandnon-linearkernelswheretheGaussianRBFker-nelisemployedasanembeddingkernel( Muandetetal. , 2012 ).Allhyper-parametersarechosenby10-foldcross-validation.Forourunsupervisedproblem,werepeattheexperimentsusingseveralparametersettingsandreportthebestresults.Table 2 reportstheclassicationaccuracyofSMMandtheareaunderROCcurve(AUC)ofOCSMMusingdifferentTable2.TheclassicationaccuracyofSMMandtheareaunderROCcurve(AUC)ofOCSMMusingdifferentkernelmeanesti-matorstoconstructthekernelondistributions. Estimator Linear Non-linear SMMOCSMM SMMOCSMM KME 0.54320.6955 0.60170.9085 S-KMSE 0.55210.6970 0.63030.9105 F-KMSE 0.56100.6970 0.65220.9095 kernelmeanestimators.Bothshrinkageestimatorsconsis-tentlyleadtobetterperformanceonbothSMMandOC-SMMwhencomparedtoKME.Tosummarize,wendsufcientevidencetoconcludethatbothS-KMSEandF-KMSEoutperformsthestandardKME.TheperformanceofS-KMSEandF-KMSEisverycompetitive.Thedifferencedependsonthedatasetandthekernelfunction.5.ConclusionsToconclude,weshowthatthecommonlyusedkernelmeanestimatorcanbeimproved.Ourtheoreticalresultsuggeststhatthereexistsawideclassofkernelmeanestimatorsthatarebetterthanthestandardone.Todemonstratethis,wefocusontwoefcientshrinkageestimators,namely,sim-pleandexiblekernelmeanshrinkageestimators.Empir-icalstudyclearlyshowsthattheproposedestimatorsout-performthestandardoneinvariousscenarios.Mostim-portantly,theshrinkageestimatesnotonlyprovidemoreaccurateestimation,butalsoleadtosuperiorperformanceonreal-worldapplications.AcknowledgmentsTheauthorswishtothankDavidHoggandRossFedelyforread-ingtherstdraftandanonymousreviewerswhogavevaluablesuggestionthathashelpedtoimprovethemanuscript. KernelMeanEstimationandSteinEffect ReferencesJ.BergerandR.Wolpert.Estimatingthemeanfunctionofagaus-sianprocessandthesteineffect.JournalofMultivariateAnal-ysis,13(3):401–424,1983.J.O.Berger.Admissibleminimaxestimationofamultivariatenormalmeanwitharbitraryquadraticloss.AnnalsofStatistics,4(1):223–226,1976.A.BerlinetandT.C.Agnan.ReproducingKernelHilbertSpacesinProbabilityandStatistics.KluwerAcademicPublishers,2004.S.Danafar,P.M.V.Rancoita,T.Glasmachers,K.Whittingstall,andJ.Schmidhuber.Testinghypothesesbyregularizedmaxi-mummeandiscrepancy.2013.I.S.Dhillon,Y.Guan,andB.Kulis.Kernelk-means:spectralclusteringandnormalizedcuts.InProceedingsofthe10thACMSIGKDDInternationalConferenceonKnowledgeDis-coveryandDataMining(KDD),pages551–556,NewYork,NY,USA,2004.K.Fukumizu,F.R.Bach,andM.I.Jordan.Dimensionalityre-ductionforsupervisedlearningwithreproducingkernelHilbertspaces.JournalofMachineLearningResearch,5:73–99,2004.K.Fukumizu,L.Song,andA.Gretton.KernelBayes'rule.InAdvancesinNeuralInformationProcessingSystems(NIPS),pages1737–1745.2011.A.Gretton,R.Herbrich,A.Smola,B.Sch¨olkopf,andA.Hyv¨arinen.Kernelmethodsformeasuringindependence.JournalofMachineLearningResearch,6:2075–2129,2005.A.Gretton,K.M.Borgwardt,M.Rasch,B.Sch¨olkopf,andA.J.Smola.Akernelmethodforthetwo-sample-problem.InAdvancesinNeuralInformationProcessingSystems(NIPS),2007.A.Gretton,K.M.Borgwardt,M.J.Rasch,B.Sch¨olkopf,andA.Smola.Akerneltwo-sampletest.JournalofMachineLearningResearch,13:723–773,2012.S.Gr¨unew¨alder,G.Lever,A.Gretton,L.Baldassarre,S.Patter-son,andM.Pontil.Conditionalmeanembeddingsasregres-sors.InProceedingsofthe29thInternationalConferenceonMachineLearning(ICML),2012.S.Gr¨unew¨alder,A.Gretton,andJ.Shawe-Taylor.Smoothoper-ators.InProceedingsofthe30thInternationalConferenceonMachineLearning(ICML),2013.W.JamesandJ.Stein.Estimationwithquadraticloss.InPro-ceedingsoftheThirdBerkeleySymposiumonMathematicalStatisticsandProbability,pages361–379.UniversityofCali-forniaPress,1961.J.KimandC.D.Scott.Robustkerneldensityestimation.JournalofMachineLearningResearch,13:2529–2565,Sep2012.A.MandelbaumandL.A.Shepp.Admissibilityasatouchstone.AnnalsofStatistics,15(1):252–268,1987.K.MuandetandB.Sch¨olkopf.One-classsupportmeasurema-chinesforgroupanomalydetection.InProceedingsofthe29thConferenceonUncertaintyinArticialIntelligence(UAI).AUAIPress,2013.K.Muandet,K.Fukumizu,F.Dinuzzo,andB.Sch¨olkopf.Learn-ingfromdistributionsviasupportmeasuremachines.InAdvancesinNeuralInformationProcessingSystems(NIPS),pages10–18.2012.Y.Nishiyama,A.Boularias,A.Gretton,andK.Fukumizu.HilbertspaceembeddingsofPOMDPs.InProceedingsofthe28thConferenceonUncertaintyinArticialIntelligence(UAI),pages644–653,2012.N.PrivaultandA.Rveillac.Steinestimationforthedriftofgaus-sianprocessesusingthemalliavincalculus.AnnalsofStatis-tics,36(5):2531–2550,2008.C.E.RasmussenandC.Williams.GaussianProcessesforMa-chineLearning.MITPress,2006.B.Sch¨olkopf,A.Smola,andK.-R.M¨uller.Nonlinearcomponentanalysisasakerneleigenvalueproblem.NeuralComputation,10(5):1299–1319,July1998.J.Shawe-TaylorandN.Cristianini.KernelMethodsforPatternAnalysis.CambridgeUniversityPress,Cambridge,UK,2004.A.Smola,A.Gretton,L.Song,andB.Sch¨olkopf.AHilbertspaceembeddingfordistributions.InProceedingsofthe18thInter-nationalConferenceonAlgorithmicLearningTheory(ALT),pages13–31.Springer-Verlag,2007.L.Song,X.Zhang,A.Smola,A.Gretton,andB.Sch¨olkopf.Tailoringdensityestimationviareproducingkernelmomentmatching.InProceedingsofthe25thInternationalConferenceonMachineLearning(ICML),pages992–999,2008.L.Song,B.Boots,S.M.Siddiqi,G.J.Gordon,andA.J.Smola.HilbertspaceembeddingsofhiddenMarkovmodels.InPro-ceedingsofthe27thInternationalConferenceonMachineLearning(ICML),2010.L.Song,A.P.Parikh,andE.P.Xing.Kernelembeddingsofla-tenttreegraphicalmodels.InAdvancesinNeuralInformationProcessingSystems(NIPS),pages2708–2716,2011.B.K.Sriperumbudur,A.Gretton,K.Fukumizu,G.Lanckriet,andB.Sch¨olkopf.InjectiveHilbertspaceembeddingsofprobabil-itymeasures.InCOLT,2008.B.K.Sriperumbudur,A.Gretton,K.Fukumizu,B.Sch¨olkopf,andG.R.G.Lanckriet.Hilbertspaceembeddingsandmet-ricsonprobabilitymeasures.JournalofMachineLearningResearch,99:1517–1561,2010.C.Stein.Inadmissibilityoftheusualestimatorforthemeanofamultivariatenormaldistribution.InProceedingsofthe3rdBerkeleySymposiumonMathematicalStatisticsandProbabil-ity,volume1,pages197–206.UniversityofCaliforniaPress,1955.