/
JournalofMachineLearningResearch6(2005)1939 JournalofMachineLearningResearch6(2005)1939

JournalofMachineLearningResearch6(2005)1939 - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
393 views
Uploaded On 2015-11-26

JournalofMachineLearningResearch6(2005)1939 - PPT Presentation

QUI ID: 205813

QUI

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch6(2005)1..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch6(2005)1939–1959Submitted10/05;Published12/05AUnifyingViewofSparseApproximateGaussianProcessRegressionJoaquinQui˜nonero-CandelaJQC@TUEBINGEN.MPG.DECarlEdwardRasmussenCARL@TUEBINGEN.MPG.DEMaxPlanckInstituteforBiologicalCyberneticsSpemannstraße3872076T¨ubingen,GermanyEditor:RalfHerbrichAbstractWeprovideanewunifyingview,includingallexistingproperprobabilisticsparseapproximationsforGaussianprocessregression.Ourapproachreliesonexpressingtheeffectivepriorwhichthemethodsareusing.Thisallowsnewinsightstobegained,andhighlightstherelationshipbetweenexistingmethods.ItalsoallowsforacleartheoreticallyjustiedrankingoftheclosenessoftheknownapproximationstothecorrespondingfullGPs.Finallywepointdirectlytodesignsofnewbettersparseapproximations,combiningthebestoftheexistingstrategies,withinattractivecom-putationalconstraints.Keywords:Gaussianprocess,probabilisticregression,sparseapproximation,BayesiancommitteemachineRegressionmodelsbasedonGaussianprocesses(GPs)aresimpletoimplement,exible,fullyprobabilisticmodels,andthusapowerfultoolinmanyareasofapplication.Theirmainlimitationisthatmemoryrequirementsandcomputationaldemandsgrowasthesquareandcuberespectively,ofthenumberoftrainingcasesn,effectivelylimitingadirectimplementationtoproblemswithatmostafewthousandcases.Toovercomethecomputationallimitationsnumerousauthorshaverecentlysuggestedawealthofsparseapproximations.Commontoalltheseapproximationschemesisthatonlyasubsetofthelatentvariablesaretreatedexactly,andtheremainingvariablesaregivensomeapproximate,butcomputationallycheapertreatment.However,thepublishedalgorithmshavewidelydifferentmotivations,emphasisandexposition,soitisdifculttogetanoverview(seeRasmussenandWilliams,2006,chapter8)ofhowtheyrelatetoeachother,andwhichcanbeexpectedtogiverisetothebestalgorithms.InthispaperweprovideaunifyingviewofsparseapproximationsforGPregression.Ourapproachissimple,butpowerful:foreachalgorithmweanalyzetheposterior,andcomputetheeffectivepriorwhichitisusing.Thus,wereinterpretthealgorithmsas“exactinferencewithanapproximatedprior”,ratherthantheexisting(ubiquitous)interpretation“approximateinferencewiththeexactprior”.Thisapproachhastheadvantageofdirectlyexpressingtheapproximationsintermsofpriorassumptionsaboutthefunction,whichmakestheconsequencesoftheapproximationsmucheasiertounderstand.Whileourviewoftheapproximationsisnottheonlyonepossible,ithastheadvantageofputtingallexistingprobabilisticsparseapproximationsunderoneumbrella,thusenablingdirectcomparisonandrevealingtherelationbetweenthem.InSection1webrieyintroduceGPmodelsforregression.InSection2wepresentouruni-fyingframeworkandwriteoutthekeyequationsinpreparationfortheunifyinganalysisofsparsec\r2005JoaquinQui˜nonero-CandelaandCarlEdwardRasmussen. QUI˜NONERO-CANDELAANDRASMUSSENalgorithmsinSections4-7.TherelationoftransductionandaugmentationtooursparseframeworkiscoveredinSection8.Allourapproximationsarewrittenintermsofanewsetofinducingvari-ables.Thechoiceofthesevariablesisitselfachallengingproblem,andisdiscussedinSection9.WecommentonafewspecialapproximationsoutsideourgeneralschemeinSection10andconclusionsaredrawnattheend.1.GaussianProcessesforRegressionProbabilisticregressionisusuallyformulatedasfollows:givenatrainingsetD=f(xi;yi);=1;:::;ngofnpairsof(vectorial)inputsxiandnoisy(real,scalar)outputsyi,computethepredictivedistributionofthefunctionvalues(ornoisyy)attestlocationsx.Inthesimplestcase(whichwedealwithhere)weassumethatthenoiseisadditive,independentandGaussian,suchthattherelationshipbetweenthe(latent)function(x)andtheobservednoisytargetsyaregivenbyyi=(xi)+ei;whereeiN(0;s2noise);(1)wheres2noiseisthevarianceofthenoise.Denition1AGaussianprocess(GP)isacollectionofrandomvariables,anynitenumberofwhichhaveconsistent1jointGaussiandistributions.Gaussianprocess(GP)regressionisaBayesianapproachwhichassumesaGPprior2overfunctions,i.e.assumesapriorithatfunctionvaluesbehaveaccordingtop(fjx1;x2;:::;xn)=N(0;K);(2)wheref=[1;2;:::;n]�isavectoroflatentfunctionvalues,i=(xi)andKisacovariancema-trix,whoseentriesaregivenbythecovariancefunctionKij=k(xi;xj).NotethattheGPtreatsthelatentfunctionvaluesiasrandomvariables,indexedbythecorrespondinginput.Inthefollowing,forsimplicitywewillalwaysneglecttheexplicitconditioningontheinputs;theGPmodelandallexpressionsarealwaysconditionalonthecorrespondinginputs.TheGPmodelisconcernedonlywiththeconditionaloftheoutputsgiventheinputs;wedonotmodelanythingabouttheinputsthemselves.Remark2Note,thattoadheretoastrictBayesianformalism,theGPcovariancefunction3whichdenestheprior,shouldnotdependonthedata(althoughitcandependonadditionalparameters).Aswewillseeinlatersections,someapproximationsarestrictlyequivalenttoGPs,whileothersarenot.Thatis,theimpliedpriormaystillbemultivariateGaussian,butthecovariancefunctionmaybedifferentfortrainingandtestcases.Denition3AGaussianprocessiscalleddegenerateiffthecovariancefunctionhasanitenumberofnon-zeroeigenvalues.1.Byconsistencyismeantsimplythattherandomvariablesobeytheusualrulesofmarginalization,etc.2.Fornotationalsimplicityweexclusivelyusezero-meanpriors.3.Thecovariancefunctionitselfshouldn'tdependonthedata,thoughitsvalueataspecicpairofinputsofcoursewill.1940 SPARSEAPPROXIMATEGAUSSIANPROCESSREGRESSIONDegenerateGPs(suchase.g.withpolynomialcovariancefunction)correspondtonitelinear(-in-the-parameters)models,whereasnon-degenerateGPs(suchase.g.withsquaredexponentialorRBFcovariancefunction)donot.Thepriorforanitemdimensionallinearmodelonlyconsid-ersauniverseofatmostmlinearlyindependentfunctions;thismayoftenbetoorestrictivewhennm.Notehowever,thatnon-degeneracyonitsowndoesn'tguaranteetheexistenceofthe“rightkind”ofexibilityforagivenparticularmodellingtask.ForamoredetailedbackgroundonGPmodels,seeforexamplethatofRasmussenandWilliams(2006).InferenceintheGPmodelissimple:weputajointGPpriorontrainingandtestlatentvalues,fandf4,andcombineitwiththelikelihood5p(yjf)usingBayesrule,toobtainthejointposteriorp(f;fjy)=p(f;f)p(yjf)p(y):(3)Thenalstepneededtoproducethedesiredposteriorpredictivedistributionistomarginalizeouttheunwantedtrainingsetlatentvariables:p(fjy)=Zp(f;fjy)df=1p(y)Zp(yjf)p(f;f)df;(4)orinwords:thepredictivedistributionisthemarginaloftherenormalizedjointpriortimesthelikelihood.ThejointGPpriorandtheindependentlikelihoodarebothGaussianp(f;f)=N0;hKffKfKfKi;andp(yjf)=N(f;s2noiseI);(5)whereKissubscriptbythevariablesbetweenwhichthecovarianceiscomputed(andweusetheasteriskasshorthandforf)andIistheidentitymatrix.SincebothfactorsintheintegralareGaussian,theintegralcanbeevaluatedinclosedformtogivetheGaussianpredictivedistributionp(fjy)=NKf(Kff+s2noiseI)1y;KKf(Kff+s2noiseI)1Kf;(6)seetherelevantGaussianidentityinappendixA.TheproblemwiththeaboveexpressionisthatitrequiresinversionofamatrixofsizennwhichrequiresO(n3)operations,wherenisthenumberoftrainingcases.Thus,thesimpleexactimplementationcanhandleproblemswithatmostafewthousandtrainingcases.2.ANewUnifyingViewWenowseektomodifythejointpriorp(f;f)from(5)inwayswhichwillreducethecomputationalrequirementsfrom(6).Letusrstrewritethatpriorbyintroducinganadditionalsetofmlatentvariablesu=[u1;:::;um]�,whichwecalltheinducingvariables.TheselatentvariablesarevaluesoftheGaussianprocess(asalsofandf),correspondingtoasetofinputlocationsXu,whichwecalltheinducinginputs.Whereastheadditionallatentvariablesuarealwaysmarginalizedoutinthepredictivedistribution,thechoiceofinducinginputsdoesleaveanimprintonthenalsolution.4.Wewillmostlyconsideravectoroftestcasesf(ratherthanasingle).5.Youmayhavebeenexpectingthelikelihoodwrittenasp(yjf;f)butsincethelikelihoodisconditionallyindependentofeverythingelsegivenf,thismakesnodifference.1941 QUI˜NONERO-CANDELAANDRASMUSSENTheinducingvariableswillturnouttobegeneralizationsofvariableswhichotherauthorshavere-ferredtovariouslyas“supportpoints”,“activeset”or“pseudo-inputs”.Particularsparsealgorithmschoosetheinducingvariablesinvariousdifferentways;somealgorithmschosetheinducinginputstobeasubsetofthetrainingset,othersnot,aswewilldiscussinSection9.Fornowconsideranyarbitraryinducingvariables.DuetotheconsistencyofGaussianprocesses,weknowthatwecanrecoverp(f;f)bysimplyintegrating(marginalizing)outufromthejointGPpriorp(f;f;u)p(f;f)=Zp(f;f;u)du=Zp(f;fju)p(u)du;wherep(u)=N(0;Kuu):(7)Thisisanexactexpression.Now,weintroducethefundamentalapproximationwhichgivesrisetoalmostallsparseapproximations.Weapproximatethejointpriorbyassumingthatfandfareconditionallyindependentgivenu,seeFigure1,suchthatp(f;f)'q(f;f)=Zq(fju)q(fju)p(u)du:(8)Thenameinducingvariableismotivatedbythefactthatfandfcanonlycommunicatethoughu,anduthereforeinducesthedependenciesbetweentrainingandtestcases.Asweshalldetailinthefollowingsections,thedifferentcomputationallyefcientalgorithmsproposedintheliteraturecorrespondtodifferentadditionalassumptionsaboutthetwoapproximateinducingconditionalsq(fju)q(fju)oftheintegralin(8).Itwillbeusefulforfuturereferencetospecifyheretheexactexpressionsforthetwoconditionalstrainingconditional:p(fju)=N(KfuK1uuu;KffQff);(9a)testconditional:p(fju)=N(KuK1uuu;KQ);(9b)wherewehaveintroducedtheshorthandnotation6Qab,KauK1uuKub.Wecanreadilyidentifytheexpressionsin(9)asspecial(noisefree)casesofthestandardpredictiveequation(6)withuplayingtheroleof(noisefree)observations.Notethatthe(positivesemi-denite)covariancematricesin(9)havetheformKQwiththefollowinginterpretation:thepriorcovarianceKminusa(non-negativedenite)matrixQquantifyinghowmuchinformationuprovidesaboutthevariablesinquestion(forf).Weemphasizethatallthesparsemethodsdiscussedinthepapercorrespondsimplytodifferentapproximationstotheconditionalsin(9),andthroughoutweusetheexactlikelihoodandinducingpriorp(yjf)=N(f;s2noiseI);andp(u)=N(0;Kuu):(10)3.TheSubsetofData(SoD)ApproximationBeforewegetstartedwiththemoresophisticatedapproximations,wementionasabaselinemethodthesimplestpossiblesparseapproximation(whichdoesn'tfallinsideourgeneralscheme):useonlyasubsetofthedata(SoD).ThecomputationalcomplexityisreducedtoO(m3),wheremnWewouldnotgenerallyexpectSoDtobeacompetitivemethod,sinceitwouldseemimpossible(evenwithfairlyredundantdataandagoodchoiceofthesubset)togetarealisticpictureofthe6.Note,thatQa;bdependsonualthoughthisisnotexplicitinthenotation.1942 SPARSEAPPROXIMATEGAUSSIANPROCESSREGRESSIONAAAAAAQQQQQQQQQuf1f2rrrfnfrrAAAAAAQQQQQQQQQuf1f2rrrfnfrrFigure1:Graphicalmodeloftherelationbetweentheinducingvariablesu,thetraininglatentfunc-tionsvaluesf=[1;:::;n]�andthetestfunctionvalue.Thethickhorizontallinerep-resentsasetoffullyconnectednodes.Theobservationsy1;:::;yn;y(notshown)woulddangleindividuallyfromthecorrespondinglatentvalues,bywayoftheexact(factored)likelihood(5).Leftgraph:thefullyconnectedgraphcorrespondstothecasewherenoapproximationismadetothefulljointGaussianprocessdistributionbetweenthesevariables.Theinducingvariablesuaresuperuousinthiscase,sincealllatentfunc-tionvaluescancommunicatewithallothers.Rightgraph:assumptionofconditionalindependencebetweentrainingandtestfunctionvaluesgivenu.Thisgivesrisetotheseparationbetweentrainingandtestconditionalsfrom(8).Noticethathavingcutthecommunicationpathbetweentrainingandtestlatentfunctionvalues,informationfromfcanonlybetransmittedtoviatheinducingvariablesuuncertainties,whenonlyapartofthetrainingdataisevenconsidered.Weincludeitheremostlyasabaselineagainstwhichtocomparebettersparseapproximations.InFigure5top,leftweseehowtheSoDmethodproduceswidepredictivedistributions,whentrainingonarandomlyselectedsubsetof10cases.Afaircomparisontoothermethodswouldtakeintoaccountthatthecomputationalcomplexityisindependentofnasopposedtoothermoreadvancedmethods.Theseextracomputationalresourcescouldbespentinanumberofways,e.g.largerm,oranactive(ratherthanrandom)selectionofthempoints.Inthispaperwewillconcentrateonunderstandingthetheoreticalfoundationsofthevariousapproximationsratherthaninvestigatingthenecessaryheuristicsneededtoturntheapproximationschemesintoactuallyprac-ticalalgorithms.4.TheSubsetofRegressors(SoR)ApproximationTheSubsetofRegressors(SoR)algorithmwasgivenbySilverman(1985),andmentionedagainbyWahbaetal.(1999).ItwasthenadaptedbySmolaandBartlett(2001)toproposeasparsegreedyapproximationtoGaussianprocessregression.SoRmodelsarenitelinear-in-the-parametersmod-elswithaparticularpriorontheweights.Foranyinputx,thecorrespondingfunctionvalueisgivenby:=Kuwu;withp(wu)=N(0;K1uu);(11)wherethereisoneweightassociatedtoeachinducinginputinXu.Notethatthecovariancematrixforthepriorontheweightsistheinverseofthatonu,suchthatwerecovertheexactGPprioronu1943 QUI˜NONERO-CANDELAANDRASMUSSENwhichisGaussianwithzeromeanandcovarianceu=Kuuwu)huu�i=Kuuhwuw�uiKuu=Kuu:(12)Usingtheeffectiveprioronuandthefactthatwu=K1uuuwecanredenetheSoRmodelinanequivalent,moreintuitiveway:f=KuK1uuu;withuN(0;Kuu):(13)WearenowreadytointegratetheSoRmodelinourunifyingframework.Giventhatthereisadeterministicrelationbetweenanyfandu,theapproximateconditionaldistributionsintheintegralineq.(8)aregivenby:qSoR(fju)=N(KfuK1uuu;0);andqSoR(fju)=N(KuK1uuu;0);(14)withzeroconditionalcovariance,compareto(9).TheeffectivepriorimpliedbytheSoRapproxi-mationiseasilyobtainedfrom(8),givingqSoR(f;f)=N0;hQffQfQfQi;(15)wherewerecallQab,KauK1uuKub.Amoredescriptivenameforthismethod,wouldbetheDeterministicInducingConditional(DIC)approximation.Weseethatthisapproximatepriorisdegenerate.Thereareonlymdegreesoffreedominthemodel,whichimpliesthatonlymlinearlyindependentfunctionscanbedrawnfromtheprior.Them+1-thoneisalinearcombinationoftheprevious.Forexample,inaverylownoiseregime,theposteriorcouldbeseverelyconstrainedbyonlymtrainingcases.Thedegeneracyofthepriorcausesunreasonablepredictivedistributions.Indeed,theapprox-imateprioroverfunctionsissorestrictive,thatgivenenoughdataonlyaverylimitedfamilyoffunctionswillbeplausibleundertheposterior,leadingtoovercondentpredictivevariances.Thisisageneralproblemofnitelinearmodelswithsmallnumbersofweights(formoredetailsseeRasmussenandQui˜nonero-Candela,2005).Figure5,top,rightpanel,illustratestheunreasonablepredictiveuncertaintiesoftheSoRapproximationonatoydataset.7ThepredictivedistributionisobtainedbyusingtheSoRapproximateprior(15)insteadofthetruepriorin(4).Foreachalgorithmwegivetwoformsofthepredictivedistribution,onewhichiseasytointerpret,andtheotherwhichiseconomicaltocomputewith:qSoR(fjy)=NQf(Qff+s2noiseI)1y;QQf(Qff+s2noiseI)1Qf;(16a)=Ns2KuSKufy;KuSKu;(16b)wherewehavedenedS=(s2KufKfu+Kuu)1.Equation(16a)isreadilyrecognizedastheregularpredictionequation(6),exceptthatthecovarianceKhaseverywherebeenreplacedbyQwhichwasalreadysuggestedby(15).ThiscorrespondstoreplacingthecovariancefunctionkwithkSoR(xi;xj)=k(xi;u)K1uuk(u;xj).Thenewcovariancefunctionhasrank(atmost)m.Thuswehavethefollowing7.Waryofthisfact,SmolaandBartlett(2001)proposeusingthepredictivevariancesoftheSoD,oramoreaccuratecomputationallycostlyalternative(moredetailsaregivenbyQui˜nonero-Candela,2004,Chapter3).1944 SPARSEAPPROXIMATEGAUSSIANPROCESSREGRESSIONRemark4TheSoRapproximationisequivalenttoexactinferenceinthedegenerateGaussianprocesswithcovariancefunctionkSoR(xi;xj)=k(xi;u)K1uuk(u;xj)Theequivalent(16b)iscomputationallycheaper,andwith(11)inmind,Sisthecovarianceoftheposteriorontheweightswu.Notethatasopposedtothesubsetofdatamethod,alltrainingcasesaretakenintoaccount.ThecomputationalcomplexityisO(nm2)initially,andO(m)andO(m2)pertestcaseforthepredictivemeanandvariancerespectively.5.TheDeterministicTrainingConditional(DTC)ApproximationTakingupideasalreadycontainedintheworkofCsat´oandOpper(2002),Seegeretal.(2003)recentlyproposedanothersparseapproximationtoGaussianprocessregression,whichdoesnotsufferfromthenonsensicalpredictiveuncertaintiesoftheSoRapproximation,butthatinterestinglyleadstoexactlythesamepredictivemean.Seegeretal.(2003),whocalledthemethodProjectedLatentVariables(PLV),presentedthemethodasrelyingonalikelihoodapproximation,basedontheprojectionf=KfuK1uuup(yjf)'q(yju)=N(KfuK1uuu;s2noiseI):(17)ThemethodhasalsobeencalledtheProjectedProcessApproximation(PPA)byRasmussenandWilliams(2006,Chapter8).Onewayofobtaininganequivalentmodelistoretaintheusuallikeli-hood,buttoimposeadeterministictrainingconditionalandtheexacttestconditionalfromeq.(9b)qDTC(fju)=N(KfuK1uuu;0);andqDTC(fju)=p(fju):(18)Thisreformulationhastheadvantageofallowingustosticktoourviewofexactinference(withexactlikelihood)withapproximatepriors.Indeed,underthismodeltheconditionaldistributionoffgivenuisidenticaltothatoftheSoR,givenintheleftof(14).AsystematicnameforthisapproximationistheDeterministicTrainingConditional(DTC).ThefundamentaldifferencewithSoRisthatDTCusestheexacttestconditional(9b)insteadofthedeterministicrelationbetweenfanduofSoR.ThejointpriorimpliedbyDTCisgivenby:qDTC(f;f)=N0;hQffQfQfKi;(19)whichissurprisinglysimilartotheeffectivepriorimpliedbytheSoRapproximation(15).ThefundamentaldifferenceisthatundertheDTCapproximationfhasapriorvarianceofitsown,givenbyK.Thispriorvariancereversesthebehaviourofthepredictiveuncertainties,andturnsthemintosensibleones,seeFigure5foranillustration.Thepredictivedistributionisnowgivenby:qDTC(fjy)=N(Qf(Qff+s2noiseI)1y;KQf(Qff+s2noiseI)1Qf(20a)=Ns2KuSKufy;KQ+KuSK�u;(20b)whereagainwehavedenedS=(s2KufKfu+Kuu)1asin(16).ThepredictivemeanfortheDTCisidenticaltothatoftheSoRapproximation(16),butthepredictivevariancereplacestheQfromSoRwithK(whichislarger,sinceKQispositivedenite).Thisaddedtermisthepredictivevarianceoftheposteriorofconditionedonu.ItgrowstothepriorvarianceKasxmovesfarfromtheinducinginputsinXu1945 QUI˜NONERO-CANDELAANDRASMUSSENAAAAAAQQQQQQQQQuf1f2rrrfnfFigure2:GraphicalmodelfortheFITCapproximation.ComparedtothoseinFigure1,alledgesbetweenlatentfunctionvalueshavebeenremoved:thelatentfunctionvaluesarecon-ditionallyfullyindependentgiventheinducingvariablesu.AlthoughstrictlyspeakingtheSoRandDTCapproximationscouldalsoberepresentedbythisgraph,notethatbothfurtherassumeadeterministicrelationbetweenfanduRemark5TheonlydifferencebetweenthepredictivedistributionofDTCandSoRisthevariance.ThepredictivevarianceofDTCisneversmallerthanthatofSoR.Note,thatsincethecovariancesfortrainingcasesandtestcasesarecomputeddifferently,see(19),itfollowsthatRemark6TheDTCapproximationdoesnotcorrespondexactlytoaGaussianprocess,asthecovariancebetweenlatentvaluesdependsonwhethertheyareconsideredtrainingortestcases,violatingconsistency,seeDenition1.ThecomputationalcomplexityhasthesameorderasforSoR.6.TheFullyIndependentTrainingConditional(FITC)ApproximationRecentlySnelsonandGhahramani(2006)proposedanotherlikelihoodapproximationtospeedupGaussianprocessregression,whichtheycalledSparseGaussianProcessesusingPseudo-inputs(SGPP).WhiletheDTCisbasedonthelikelihoodapproximationgivenby(17),theSGPPproposesamoresophisticatedlikelihoodapproximationwitharichercovariancep(yjf)'q(yju)=N(KfuK1uuu;diag[KffQff]+s2noiseI);(21)wherediag[A]isadiagonalmatrixwhoseelementsmatchthediagonalofA.Aswedidin(18)fortheDTC,weprovideanalternativeequivalentformulationcalledFullyIndependentTrainingConditional(FITC)basedontheinducingconditionals:qFITC(fju)=nÕi=1p(iju)=NKfuK1uuu;diag[KffQff];andqFITC(ju)=p(ju):(22)WeseethatasopposedtoSoRandDTC,FITCdoesnotimposeadeterministicrelationbetweenfandu.Insteadofignoringthevariance,FITCproposesanapproximationtothetrainingconditionaldistributionoffgivenuasafurtherindependenceassumption.Inaddition,theexacttestconditionalfrom(9b)isusedin(22),althoughforreasonswhichwillbecomecleartowardstheendofthis1946 SPARSEAPPROXIMATEGAUSSIANPROCESSREGRESSIONsection,weinitiallyconsideronlyasingletestcase,.ThecorrespondinggraphicalmodelisgiveninFigure2.TheeffectivepriorimpliedbytheFITCisgivenbyqFITC(f;)=N0;hQffdiag[QffKff]QfQfKi:(23)Note,thatthesoledifferencebetweentheDTCandFITCisthatinthetopleftcorneroftheimpliedpriorcovariance,FITCreplacestheapproximatecovariancesofDTCbytheexactonesonthediagonal.ThepredictivedistributionisqFITC(jy)=NQf(Qff+L)1y;KQf(Qff+L)1Qf(24a)=NKuSKufL1y;KQ+KuSKu;(24b)wherewehavedenedS=(Kuu+KufL1Kfu)1andL=diag[KffQff+s2noiseI].Thecompu-tationalcomplexityisidenticaltothatofSoRandDTC.Sofarwehaveonlyconsideredasingletestcase.Therearetwooptionsforjointpredictions,either1)usetheexactfulltestconditionalfrom(9b),or2)extendtheadditionalfactorizingas-sumptiontothetestconditional.AlthoughSnelsonandGhahramani(2006)don'texplicitlydiscussjointpredictions,itwouldseemthattheyprobablyintendthesecondoption.Whereastheaddi-tionalindependenceassumptionforthetestcasesisnotreallynecessaryforcomputationalreasons,itdoesaffectthenatureoftheapproximation.Underoption1)thetrainingandtestcovariancearecomputeddifferently,andthusthisdoesnotcorrespondtoourstrictdenitionofaGPmodel,butRemark7Ifftheassumptionoffullindependenceisextendedtothetestconditional,theFITCap-proximationisequivalenttoexactinferenceinanon-degenerateGaussianprocesswithcovariancefunctionkFIC(xi;xj)=kSoR(xi;xj)+dij[k(xi;xj)kSoR(xi;xj)]wheredijisKronecker'sdelta.Alogicalnameforthemethodwheretheconditionals(trainingandtest)arealwaysforcedtobefullyindependentwouldbetheFullyIndependentConditional(FIC)approximation.TheeffectivepriorimpliedbyFICis:qFIC(f;f)=N0;hQffdiag[QffKff]QfQfQdiag[QK]i:(25)7.ThePartiallyIndependentTrainingConditional(PITC)ApproximationIntheprevioussectionwesawhowtoimprovetheDTCapproximationbyapproximatingthetrain-ingconditionalwithanindependentdistribution,i.e.onewithadiagonalcovariancematrix.Inthissectionwewillfurtherimprovetheapproximation(whileremainingcomputationallyattractive)byextendingthetrainingconditionaltohaveablockdiagonalcovariance:qPITC(fju)=NKfuK1uuu;blockdiag[KffQff];andqPITC(fju)=p(fju):(26)whereblockdiag[A]isablockdiagonalmatrix(wheretheblockingstructureisnotexplicitlystated)WerepresentgraphicallythePITCapproximationinFigure3.DevelopingthisanalogouslytotheFITCapproximationfromtheprevioussection,wegetthejointpriorqPITC(f;)=N0;hQffblockdiag[QffKff]QfQfKi;(27)1947 QUI˜NONERO-CANDELAANDRASMUSSENAAAAAAQQQQQQQQQufI1fI2rrrfIkfFigure3:GraphicalrepresentationofthePITCapproximation.ThesetoflatentfunctionvaluesfIiindexedbythethesetofindicesIiisfullyconnected.ThePITCdiffersfromFITC(seegraphinFig.2)inthatconditionalindependenceisnowbetweenthekgroupsoftraininglatentfunctionvalues.Thiscorrespondstotheblockdiagonalapproximationtothetruetrainingconditionalgivenin(26).andthepredictivedistributionisidenticalto(24),exceptforthealternativedenitionofL=blockdiag[KffQff+s2noiseI].AnidenticalexpressionwasobtainedbySchwaighoferandTresp(2003,Sect.3),developingfromtheoriginalBayesiancommitteemachine(BCM)byTresp(2000).TherelationshiptotheFITCwaspointedoutbyLehelCsat´o.TheBCMwasoriginallyproposedasatransductivelearner(i.e.wherethetestinputshavetobeknownbeforetraining),andtheinducinginputsXuwerechosentobethetestinputs.Wediscusstransductionindetailinthenextsection.ItisimportanttorealizethattheBCMproposestwoorthogonalideas:rst,theblockdiagonalstructureofthepartiallyindependenttrainingconditional,andsecondsettingtheinducinginputstobethetestinputs.ThesetwoideascanbeusedindependentlyandinSection8weproposeusingtherstwithoutthesecond.ThecomputationalcomplexityofthePITCapproximationdependsontheblockingstructureimposedin(26).Areasonablechoice,alsorecommendedbyTresp(2000)maybetochoosek=n=mblocks,eachofsizemm.ThecomputationalcomplexityremainsO(nm2).SinceinthePITCmodelthecovarianceiscomputeddifferentlyfortrainingandtestcasesRemark8ThePITCapproximationdoesnotcorrespondexactlytoaGaussianprocess.Thisisbecausecomputingcovariancesrequiresknowingwhetherpointsarefromthetraining-ortest-set,(27).OnecanobtainaGaussianprocessfromthePITCbyextendingthepartialconditionalindependenceassumptiontothetestconditional,aswedidinRemark7fortheFITC.8.TransductionandAugmentationTheideaoftransductionisthatoneshouldrestrictthegoaloflearningtopredictiononapre-speciedsetoftestcases,ratherthantryingtolearnanentirefunction(induction)andthenevaluateitatthetestinputs.Theremaybenouniversallyagreedupondenitionoftransduction.InthispaperweuseDenition9Transductionoccursonlyifthepredictivedistributiondependsonothertestinputs.Thisoperationaldenitionexcludesmodelsforwhichthereexistanequivalentinductivecounter-part.Accordingtothisdenition,itisirrelevantwhenthebulkofthecomputationtakesplace.1948 SPARSEAPPROXIMATEGAUSSIANPROCESSREGRESSIONAAAAAAufI1fI2rrrfIkfFigure4:TwoviewsonAugmentation.Oneviewistoseethatthetestlatentfunctionvalueisnowpartoftheinducingvariablesuandthereforehasaccesstothetraininglatentfunctionvalues.Anequivalentviewistoconsiderthatwehavedroppedtheassumptionofconditionalindependencebetweenandthetraininglatentfunctionvalues.Evenifhasnowdirectaccesstoeachofthetrainingi,thesestillneedtogothroughutotalktoeachotheriftheyfallinconditionallyindependentblocks.WehaveinthisguredecidedtorecyclethegraphforPITCfromFigure3toshowthatallapproximationswehavepresentedcanbeaugmented,irrespectiveofwhattheapproximationforthetrainingconditionalis.Thereareseveraldifferentpossiblemotivationsfortransduction:1)transductionissomehoweasierthaninduction(Vapnik,1995),2)thetestinputsmayrevealimportantinformation,whichshouldbeusedduringtraining.Thismotivationdrivesmodelsinsemi-supervisedlearning(studiedmostlyinthecontextofclassication)and3)forapproximatealgorithmsonemaybeabletolimitthediscrepanciesoftheapproximationatthetestpoints.ForexactGPmodelsitseemsthattherstreasondoesn'treallyapply.IfyoumakepredictionsatthetestpointsthatareconsistentwithaGP,thenitistrivialinsidetheGPframeworktoextendthesetoanyotherinputpoints,andineffectwehavedoneinduction.Thesecondreasonseemsmoreinteresting.However,inastandardGPsetting,itisaconse-quenceoftheconsistencyproperty,seeRemark2,thatpredictionsatonetestinputareindependentofthelocationofanyothertestinputs.ThereforetransductioncannotbemarriedwithexactGPs:Remark10TransductioncannotoccurinexactGaussianprocessmodels.WhereasthisholdsfortheusualsettingofGPs,itcouldbedifferentinnon-standardsituationswheree.g.thecovariancefunctiondependsontheempiricalinputdensities.TransductioncanoccurinthesparseapproximationtoGPs,bymakingthechoiceofinducingvariablesdependonthetestinputs.TheBCMfromtheprevioussection,whereXu=X(whereXarethetestinputs)isanexampleofthis.Sincetheinducingvariablesareconnectedtoallothernodes(seeFigure3)wewouldexpecttheapproximationtobegoodatu=f,whichiswhatwecareaboutforpredictions,relatingtoreason3)above.Whilethisreasoningissound,itisnotnecessarilyasufcientconsiderationforgettingagoodmodel.Themodelhastobeabletosimultaneouslyexplainthetrainingtargetsaswellandifthechoiceofumakesthisdifcult,theposterioratthepointsofinterestmaybedistorted.Thus,thechoiceofushouldbegovernedbytheabilitytomodeltheconditionalofthelatentsgiventheinputs,andnotsolelybythedensityofthe(test)inputs.Themaindrawbackoftransductionisthatbyitsnatureitdoesn'tprovideapredictivemodelinthewayinductivemodelsdo.IntheusualGPmodelonecandothebulkofthecomputation1949 QUI˜NONERO-CANDELAANDRASMUSSENinvolvedinthepredictivedistributions(e.g.matrixinversion)beforeseeingthetestcases,enablingfastcomputationoftestpredictions.Itisinterestingthatwhereasothermethodsspendmuchefforttryingtooptimizetheinducingvariables,theBCMsimplyusesthetestset.ThequalityoftheBCMapproximationdependsthenontheparticularlocationofthetestinputs,uponwhichoneusuallydoesnothaveanycontrol.Wenowseethattheremaybeabettermethod,eliminatingthedrawbackoftransduction,namelyusethePITCapproximation,butchoosetheu'scarefully(seeSection9),don'tjustusethetestset.8.1AugmentationAnideacloselyrelatedtotransduction,butnotcoveredbyourdenition,isaugmentation,whichincontrasttotransductionisdoneindividuallyforeachtestcase.Sinceintheprevioussections,wehaven'tassumedanythingaboutu,wecansimplyaugmentthesetofinducingvariablesby(i.e.haveoneadditionalinducingvariableequaltothecurrenttestlatent),andseewhathappensinthepredictivedistributionsforthedifferentmethods.Let'srstinvestigatetheconsequencesforthetestconditionalfrom(9b).Note,theinterpretationofthecovariancematrixKQwas“thepriorcovarianceminustheinformationwhichuprovidesabout”.Itisclearthattheaugmentedu(with)providesallpossibleinformationabout,andconsequentlyQ=KAnequivalentviewonaugmentationisthattheassumptionofconditionalindependencebetweenandfisdropped.Thisisseentriviallybyaddingedgesbetweenandtheiinthegraphicalmodel,Figure4.AugmentationwasoriginallyproposedbyRasmussen(2002),andappliedindetailtotheSoRwithRBFcovariancebyQui˜nonero-Candela(2004).BecausetheSoRisanitelinearmodel,andthebasisfunctionsarelocal(Gaussianbumps),thepredictivedistributionscanbeverymisleading.Forexample,whenmakingpredictionsfarawayfromthecenterofanybasisfunction,allbasisfunctionshaveinsignicantmagnitudes,andtheprediction(averagedovertheposterior)willbeclosetozero,withverysmallerror-bars;thisistheoppositeofthedesiredbehaviour,wherewewouldexpecttheerror-barstogrowaswemoveawayfromthetrainingcases.Hereaugmentationmakesaparticularlybigdifferenceturningthenonsensicalpredictivedistributionintoareasonableone,byensuringthatthereisalwaysabasisfunctioncenteredonthetestcase.Comparethenon-augmentedtotheaugmentedSoRinFigure5.AnanalogousGaussianprocessbasednitelinearmodelthathasrecentlybeenhealedbyaugmentationistherelevancevectormachine(RasmussenandQui˜nonero-Candela,2005).Althoughaugmentationwasinitiallyproposedforanarrowsetofcircumstances,itiseasilyappliedtoanyoftheapproximationsdiscussed.Ofcourse,augmentationdoesn'tmakeanysenseforanexact,non-degenerateGaussianprocessmodel(aGPwithacovariancefunctionthathasafeature-spacewhichisinnitedimensional,i.e.withbasisfunctionseverywhere).Remark11Afullnon-degenerateGaussianprocesscannotbeaugmented,sincethecorrespondingfwouldalreadybeconnectedtoallothervariablesinthegraphicalmodel.ButaugmentationdoesmakesenseforsparseapproximationstoGPs.Themoregeneralprocessviewonaugmentationhasseveraladvantagesoverthebasisfunctionview.Itisnotcompletelyclearfromthebasisfunctionview,whichbasisfunctionshouldbeusedforaugmentation.Forexample,RasmussenandQui˜nonero-Candela(2005)successfullyapplyaug-mentationusingbasisfunctionsthathaveazerocontributionatthetestlocation!Intheprocessview1950 SPARSEAPPROXIMATEGAUSSIANPROCESSREGRESSIONhowever,itseemsclearthatonewouldchosetheadditionalinducingvariabletobe,tominimizetheeffectsoftheapproximations.LetuscomputetheeffectivepriorfortheaugmentedSoR.Giventhatisintheinducingset,thetestconditionalisnotanapproximationandwecanrewritetheintegralleadingtotheeffectiveprior:qASoR(f;f)=ZqSoR(fj;u)p(;u)du:(28)ItisinterestingtonoticethatthisisalsotheeffectivepriorthatwouldresultfromaugmentingtheDTCapproximation,sinceqSoR(fj;u)=qDTC(fj;u)Remark12AugmentedSoR(ASoR)isequivalenttoaugmentedDTC(ADTC).AugmentedDTConlydiffersfromDTCintheadditionalpresenceofamongtheinducingvari-ablesinthetrainingconditional.WecanonlyexpectaugmentedDTCtobeamoreaccurateapprox-imationthanDTC,sinceaddinganadditionalinducingvariablecanonlyhelpcaptureinformationfromy.ThereforeRemark13DTCisalessaccurate(butcheaper)approximationthanaugmentedSoRWesawpreviouslyinSection5thattheDTCapproximationdoesnotsufferfromthenonsensi-calpredictivevariancesoftheSoR.TheequivalencebetweentheaugmentedSoRandaugmentedDTCisanotherwayofseeinghowaugmentationreversesthemisbehaviourofSoR.ThepredictivedistributionoftheaugmentedSoRisobtainedbyaddingtouin(20).Predictionwithanaugmentedsparsemodelcomesatahighercomputationalcost,sincenowdirectlyinteractswithalloffandnotjustwithu.Foreachnewtestcase,updatingtheaugmentedSinthepredictiveequation(forexample(20b)forDTC)impliescomputingthevectormatrixproductKfKfuwithcomplexityO(nm).ThisisclearlyhigherthantheO(m)forthemean,andO(m2)forthepredictivedistributionofallthenon-augmentedmethodswehavediscussed.Augmentationseemstobeonlyreallynecessaryformethodsthatmakeasevereapproxima-tiontothetestconditional,liketheSoR.Formethodsthatmakelittleornoapproximationtothetestconditional,itisdifculttopredictthedegreetowhichaugmentationwouldhelp.However,onecanseebygivingaccesstoallofthetraininglatentfunctionvaluesinf,onewouldexpectaugmentationtogivelessunder-condentpredictivedistributionsnearthetrainingdata.Figure5clearlyshowsthataugmentedDTC(equivalenttoaugmentedSoR)hasasuperiorpredictivedis-tribution(bothmeanandvariance)thanstandardDTC.Notehoweverthatinthegurewehavepurposelychosenatooshortlengthscaletoenhancevisualization.Quantitatively,thissuperioritywasexperimentallyassessedbyQui˜nonero-Candela(2004,Table3.1).Augmentationhasn'tbeencomparedtothemoreadvancedapproximationsFITCandPITC,andthegurewouldchangeinthemorerealisticscenariowheretheinducinginputsandhyperparametersarelearnt(SnelsonandGhahramani,2006).TransductivemethodsliketheBCMcanbeseenasjointaugmentation,andonecouldpotentiallyuseitforanyofthemethodspresented.ItseemsthatthegoodperformanceoftheBCMcouldessentiallystemfromaugmentation,thepresenceoftheothertestinputsintheinducingsetbeingprobablyoflittlebenet.Jointaugmentationmightbringsomecomputationaladvantage,butwon'tchangethescaling:notethataugmentingmtimesatacostofO(nm)apieceimpliesthesameO(nm2)totalcostasthejointlyaugmentedBCM.1951 QUI˜NONERO-CANDELAANDRASMUSSENSoD-15-10-5051015-1.5-1-0.500.511.5SoR-15-10-5051015-1.5-1-0.500.511.5DTC-15-10-5051015-1.5-1-0.500.511.5ASoR/ADTC-15-10-5051015-1.5-1-0.500.511.5FITC-15-10-5051015-1.5-1-0.500.511.5PITCFigure5:Toyexamplewithidenticalcovariancefunctionandhyperparameters.Thesquaredex-ponentialcovariancefunctionisused,andaslightlytooshortlengthscaleischosenonpurposetoemphasizethedifferentbehaviourofthepredictiveuncertainties.Thedotsarethetrainingpoints,thecrossesarethetargetscorrespondingtotheinducinginputs,randomlyselectedfromthetrainingset.Thesolidlineisthemeanofthepredictivedistribution,andthedottedlinesshowthe95%condenceintervalofthepredictions.AugmentedDTC(ADTC)isequivalenttoaugmentedSoR(ASoR),seeRemark12.1952 SPARSEAPPROXIMATEGAUSSIANPROCESSREGRESSION9.OntheChoiceoftheInducingVariablesWehaveuntilnowassumedthattheinducinginputsXuweregiven.Traditionally,sparsemodelshaveveryoftenbeenbuiltuponacarefullychosensubsetofthetraininginputs.Thisconceptisprobablybestexempliedinthepopularsupportvectormachine(CortesandVapnik,1995).InsparseGaussianprocessesithasalsobeensuggestedtoselecttheinducinginputsXufromamongthetraininginputs.Sincethisinvolvesaprohibitivecombinatorialoptimization,greedyoptimiza-tionapproacheshavebeensuggestedusingvariousselectioncriterialikeonlinelearning(Csat´oandOpper,2002),greedyposteriormaximization(SmolaandBartlett,2001),maximuminformationgain(Seegeretal.,2003),matchingpursuit(KeerthiandChu,2006),andprobablymore.Asdis-cussedintheprevioussection,selectingtheinducinginputsfromamongthetestinputshasalsobeenconsideredintransductivesettings.Recently,SnelsonandGhahramani(2006)haveproposedtorelaxtheconstraintthattheinducingvariablesmustbeasubsetoftraining/testcases,turningthediscreteselectionproblemintooneofcontinuousoptimization.Onemayhopethatndingagoodsolutioniseasierinthecontinuousthanthediscretecase,althoughndingtheglobaloptimumisintractableinbothcases.Andperhapsthelessrestrictivechoicecanleadtobetterperformanceinverysparsemodels.Whichoptimalitycriterionshouldbeusedtosettheinducinginputs?DepartingfromafullyBayesiantreatmentwhichwouldinvolvedeningpriorsonXu,onecouldmaximizethemarginallikelihood(alsocalledtheevidence)withrespecttoXu,anapproachalsofollowedbySnelsonandGhahramani(2006).Eachoftheapproximatemethodsproposedinvolvesadifferenteffectiveprior,andhenceitsownparticulareffectivemarginallikelihoodconditionedontheinducinginputsq(yjXu)=ZZp(yjf)q(fju)p(ujXu)dudf=Zp(yjf)q(fjXu)df;(29)whichofcourseisindependentofthetestconditional.WehaveintheaboveequationexplicitlyconditionedontheinducinginputsXu.UsingGaussianidentities,theeffectivemarginallikelihoodisveryeasilyobtainedbyaddingaridges2noiseI(fromthelikelihood)tothecovarianceofeffectiveprioronf.UsingtheappropriatedenitionsofL,thelogmarginallikelihoodbecomeslogq(yjXu)=12logjQff+Lj12y�(Qff+L)1yn2log(2p);(30)whereLSoR=LDTC=s2noiseILFITC=diag[KffQff]+s2noiseI,andLPITC=blockdiag[KffQff]+s2noiseI.ThecomputationalcostofthemarginallikelihoodisO(nm2)forallmethods,thatofitsgradientwithrespecttooneelementofXuisO(nm).Thisofcourseimpliesthatthecomplexityofcomputingthegradientwrt.tothewholeofXuisO(dnm2),wheredisthedimensionoftheinputspace.Ithasbeenproposedtomaximizetheeffectiveposteriorinsteadoftheeffectivemarginallikeli-hood(SmolaandBartlett,2001).Howeverthisispotentiallydangerousandcanleadtoovertting.Maximizingthewholeevidenceinsteadissoundandcomesatanidenticalcomputationalcost(foradeeperanalysisseeQui˜nonero-Candela,2004,Sect.3.3.5andFig.3.2).ThemarginallikelihoodhastraditionallybeenusedtolearnthehyperparametersofGPsinthenonfullyBayesiantreatment(seeforexampleWilliamsandRasmussen,1996).Forthesparseapproximationspresentedhere,onceyouarelearningXuitisstraightforwardtoallowforlearninghyperparameters(ofthecovariancefunction)duringthesameoptimization,andthereisnoneedtointerleaveoptimizationofuwithlearningofthehyperparametersasithasbeenproposedforexamplebySeegeretal.(2003).1953 QUI˜NONERO-CANDELAANDRASMUSSEN10.OtherMethodsInthissectionwebrieymentiontwoapproximationswhichdon'ttinourunifyingscheme,sinceonedoesn'tcorrespondtoaproperprobabilisticmodel,andtheotheroneusesaparticularconstructionforthecovariancefunction,ratherthanallowinganygeneralcovariancefunction.10.1TheNystr¨omApproximationTheNystr¨omApproximationforspeedingupGPregressionwasoriginallyproposedbyWilliamsandSeeger(2001),andthenquestionedbyWilliamsetal.(2002).LikeSoRandDTC,theNystr¨omApproximationforGPregressionapproximatesthepriorcovarianceoffbyQff.However,unlikethesemethods,theNystr¨omApproximationisnotbasedonagenerativeprobabilisticmodel.Thepriorcovariancebetweenandfistakentobeexact,whichisinconsistentwiththepriorcovarianceonfq(f;f)=N0;hQffKfKfKi:(31)Asaresultwecannotderivethismethodfromourunifyingframework,norrepresentitwithagraphicalmodel.Worse,theresultingpriorcovariancematrixisnotevenguaranteedtobepositivedenite,allowingthepredictivevariancestobenegative.NoticethatreplacingKfbyQfin(31)isenoughtomakethepriorcovariancepositivedenite,andoneobtainstheDTCapproximation.Remark14TheNystr¨omApproximationdoesnotcorrespondtoawell-formedprobabilisticmodeIgnoringanyquibblesaboutpositivedeniteness,thepredictivedistributionoftheNystr¨omAp-proximationisgivenby:p(jy)=NK�f[Qff+s2noiseI]1y;KK�f[Qff+s2noiseI]1Kf;(32)butthepredictivevarianceisnotguaranteedtobepositive.ThecomputationalcostisO(nm2)10.2TheRelevanceVectorMachineTherelevancevectormachine,introducedbyTipping(2001),isanitelinearmodelwithanin-dependentGaussianpriorimposedontheweights.Foranyinputx,thecorrespondingfunctionoutputisgivenby:=w;withp(wjA)=N(0;A);(33)where=[f1(x);:::;fm(x)]isthe(row)vectorofresponsesofthembasisfunctions,andA=diag(a1;:::;am)isthediagonalmatrixofjointpriorprecisions(inversevariances)oftheweights.TheaiarelearntbymaximizingtheRVMevidence(obtainedbyalsoassumingGaussianadditiveiid.noise,see(1)),andforthetypicalcaseofrichenoughsetsofbasisfunctionsmanyoftheprecisionsgotoinnityeffectivelypruningoutthecorrespondingweights(foraveryinterestinganalysisseeWipfetal.,2004).TheRVMisthusasparsemethodandthesurvivingbasisfunctionsarecalledrelevancevectorsNotethatsincetheRVMisanitelinearmodelwithGaussianpriorsontheweights,itcanbeseenasaGaussianprocess:Remark15TheRVMisequivalenttoadegenerateGaussianprocesswithcovariancefunctionkRVM(xi;xj)=iA1�j=åmk=1a1kfk(xi)fk(xj)1954 SPARSEAPPROXIMATEGAUSSIANPROCESSREGRESSIONMethodq(fju)q(fju)jointpriorcovarianceGP?GPexactexactKffKfKfKpSoRdeterm.determ.QffQfQfQpDTCexactdeterm.QffQfQfKFITC(exact)fullyindep.Qffdiag[QffKff]QfQfK(p)PITCexactpartiallyindep.Qffblokdiag[QffKff]QfQfKTable1:Summaryofthewayapproximationsarebuilt.Allthesemethodsaredetailedintheprevi-oussections.Theinitialcostandthatofthemeanandvariancepertestcasearerespectivelyn2nandn2fortheexactGP,andnm2mandm2forallothermethods.The“GP?”columnindicateswhethertheapproximationisequivalenttoaGP.ForFITCseeRemark7.aswasalsopointedoutbyTipping(2001,eq.(59)).Whereasallsparseapproximationswehavepresenteduntilnowaretotallyindependentofthechoiceofcovariancefunction,fortheRVMthischoiceisrestrictedtocovariancefunctionsthatcanbeexpressedasniteexpansionsintermsofsomebasisfunctions.BeingdegenerateGPsinexactlythesamewayastheSoR(presentedinSection4),theRVMdoesalsosufferfromunreasonablepredictivevariances.RasmussenandQui˜nonero-Candela(2005)showthatthepredictivedistributionsofRVMscanalsobehealedbyaugmentation,seeSection8.Oncetheaihavebeenlearnt,denotingbymthenumberofsurvivingrelevancevectors,thecomplexityofcomputingthepredictivedistributionoftheRVMisO(m)formeanandO(m2)forthevariance.RVMsareoftenusedwithradialbasisfunctionscenteredonthetraininginputs.OnepotentiallyinterestingextensiontotheRVMwouldbetolearnthelocationsofthecentersofthebasisfunctions,inthesamewayasproposedbySnelsonandGhahramani(2006)fortheFITCapproximation,seeSection6.ThisisacuriousreminiscenceoflearningthecentersinRBFNetworks.11.ConclusionsWehaveprovidedaunifyingframeworkforsparseapproximationstoGaussianprocessesforregres-sion.Ourapproachconsistsoftwosteps,rst1)werecasttheapproximationintermsofapprox-imationstotheprior,andsecond2)weintroduceinducingvariablesuandtheideaofconditionalindependencegivenu.Werecoverallexistingsparsemethodsbymakingfurthersimplicationsofthecovariancesofthetrainingandtestconditionals,seeTable1forasummary.Previousmethodswerepresentedbasedondifferentapproximationparadigms(e.g.likelihoodapproximations,projectionmethods,matrixapproximations,minimizationofKullback-Leiblerdi-vergence,etc),makingdirectcomparisondifcult.Underourunifyingviewwedeconstructmeth-ods,makingitclearwhichbuildingblockstheyarebasedupon.Forexample,theSGPPbySnelson1955 QUI˜NONERO-CANDELAANDRASMUSSENandGhahramani(2006)containstwoideas,1)alikelihoodapproximationand2)theideaofvaryingtheinducinginputscontinuously;thesetwoideascouldeasilybeusedindependently,andincorpo-ratedinothermethods.Similarly,theBCMbyTresp(2000)containstwoindependentideas1)ablockdiagonalassumption,and2)the(transductive)ideaofchoosingthetestinputsastheinduc-ingvariables.Finallywenotethatalthoughallthreeideasof1)transductivelysettingu=f,2)augmentationand3)continuousoptimizationofXuhavebeenproposedinveryspecicsettings,infacttheyarecompletelygeneralideas,whichcanbeappliedtoanyoftheapproximationschemesconsidered.WehaverankedtheapproximationaccordingtohowclosetheyaretothecorrespondingfullGP.However,theperformanceinpracticalsituationsmaynotalwaysfollowthistheoreticalrankingsincetheapproximationsmightexhibitproperties(notpresentinthefullGP)whichmaybepar-ticularlysuitableforspecicdatasets.Thismaymaketheinterpretationofempiricalcomparisonschallenging.Afurthercomplicationariseswhenaddingthenecessaryheuristicsforturningthetheoreticalconstructsintopracticalalgorithms.Wehavenotdescribedfullalgorithmsinthispaper,butarecurrentlyworkingonadetailedempiricalstudy(inpreparation,seealsoRasmussenandWilliams,2006,chapter8).Wenotethattheorderofthecomputationalcomplexityisidenticalforallthemethodsconsid-ered,O(nm2).Thishighlightsthatthereisnocomputationalexcuseforusinggrossapproximations,suchasassumingdeterministicrelationships,inparticularoneshouldprobablythinktwicebeforeusingSoRorevenDTC.Althoughaugmentationhasattractivepredictiveproperties,itiscom-putationallyexpensive.Itremainsunclearwhetheraugmentationcouldbebenecialonaxedcomputationalbudget.Wehaveonlyconsideredthesimplercaseofregressioninthispaper,butsparsenessisalsocom-monlysoughtinclassicationsettings.ItshouldnotbedifculttocastprobabilisticapproximationmethodssuchasExpectationPropagation(EP)ortheLaplacemethod(foracomparison,seeKussandRasmussen,2005)intoourunifyingframework.Ouranalysissuggeststhatanewinterestingapproximationwouldcomefromcombiningthebestpossibleapproximation(PITC)withthemostpowerfulselectionmethodfortheinducinginputs.Thiswouldcorrespondtoanon-transductiveversionoftheBCM.Wewouldevadethenecessityofknowingthetestsetbeforedoingthebulkofthecomputation,andwecouldhopetosupersedethesuperiorperformancereportedbySnelsonandGhahramani(2006)forverysparseapproximations.AcknowledgmentsThankstoNeilLawrenceforarrangingthe2005GaussianProcessRoundTablemeetinginShefeld,whichprovidedmuchinspirationtothispaper.SpecialthankstoOlivierChapelle,LehelCsat´o,ZoubinGhahramani,MatthiasSeeger,EdSnelsonandChrisWilliamsforhelpfuldiscussions,andtothreeanonymousreviewers.BothauthorsweresupportedbytheGermanResearchCouncil(DFG)throughgrantRA1030/1.ThisworkwassupportedinpartbytheISTProgrammeoftheEuropeanCommunity,underthePASCALNetworkofExcellence,IST-2002-506778.1956 SPARSEAPPROXIMATEGAUSSIANPROCESSREGRESSIONAppendixA.GaussianandMatrixIdentitiesInthisappendixweprovideidentitiesusedtomanipulatematricesandGaussiandistributionsthroughoutthepaper.LetxandybejointlyGaussianxyNxy;ACC�B;(34)thenthemarginalandtheconditionalaregivenbyxN(x;A);andxjyNx+CB1(yy);ACB1C�(35)Also,theproductofaGaussianinxwithaGaussianinalinearprojectionPxisagainaGaussian,althoughunnormalizedN(xja;A)N(Pxjb;B)=zcN(xjc;C);(36)whereC=A1+P�B1P1;c=CA1a+P�B1b:ThenormalizingconstantzcisgaussianinthemeansaandbofthetwoGaussians:zc=(2p)m2jB+PAP�j12exp12(bPa)�B+PAP�1(bPa):(37)Thematrixinversionlemma,alsoknownastheWoodbury,Sherman&Morrisonformulastatesthat:(Z+UWV�)1=Z1Z1U(W1+V�Z1U)1V�Z1;(38)assumingtherelevantinversesallexist.HereZisnnWismmandUandVarebothofsizenm;consequentlyifZ1isknown,andalowrank(ie.mn)perturbationaremadetoZasinlefthandsideofeq.(38),considerablespeedupcanbeachieved.ReferencesCorinnaCortesandVladimirVapnik.Support-vectornetwork.MachineLearning,20(3):273–297,1995.LehelCsat´oandManfredOpper.SparseonlineGaussianprocesses.NeuralComputation,14(3):641–669,2002.SathiyaKeerthiandWeiChu.AMatchingPursuitapproachtosparseGaussianprocessregression.InY.Weiss,B.Sch¨olkopf,andJ.Platt,editors,AdvancesinNeuralInformationProcessingSystems18,Cambridge,Massachussetts,2006.TheMITPress.MalteKussandCarlEdwardRasmussen.AssessingapproximateinferenceforbinaryGaussianprocessclassication.JournalofMachineLearningResearch,pages1679–1704,2005.JoaquinQui˜nonero-Candela.LearningwithUncertainty–GaussianProcessesandRelevanceVec-torMachines.PhDthesis,TechnicalUniversityofDenmark,Lyngby,Denmark,2004.CarlEdwardRasmussen.ReducedrankGaussianprocesslearning.Technicalreport,GatsbyCom-putationalNeuroscienceUnit,UCL,2002.1957 QUI˜NONERO-CANDELAANDRASMUSSENCarlEdwardRasmussenandJoaquinQui˜nonero-Candela.Healingtherelevancevectormachinebyaugmentation.InInternationalConferenceonMachineLearning,2005.CarlEdwardRasmussenandChristopherK.I.Williams.GaussianProcessesforMachineLearningTheMITpress,2006.AntonSchwaighoferandVolkerTresp.TransductiveandinductivemethodsforapproximateGaus-sianprocessregression.InSuzannaBecker,SebastianThrun,andKlausObermayer,editors,AdvancesinNeuralInformationProcessingSystems15,pages953–960,Cambridge,Massachus-setts,2003.TheMITPress.MatthiasSeeger,ChristopherK.I.Williams,andNeilLawrence.FastforwardselectiontospeedupsparseGaussianprocessregression.InChristopherM.BishopandBrendanJ.Frey,editors,NinthInternationalWorkshoponArticialIntelligenceandStatistics.SocietyforArticialIntelligenceandStatistics,2003.BernhardW.Silverman.Someaspectsofthesplinesmoothingapproachtonon-parametricregres-sioncurvetting.J.Roy.Stat.Soc.B,47(1):1–52,1985.(withdiscussion).AlexanderJ.SmolaandPeterL.Bartlett.SparsegreedyGaussianprocessregression.InToddK.Leen,ThomasG.Dietterich,andVolkerTresp,editors,AdvancesinNeuralInformationProcess-ingSystems13,pages619–625,Cambridge,Massachussetts,2001.TheMITPressEdwardSnelsonandZoubinGhahramani.SparseGaussianprocessesusingpseudo-inputs.InY.Weiss,B.Sch¨olkopf,andJ.Platt,editors,AdvancesinNeuralInformationProcessingSystems18,Cambridge,Massachussetts,2006.TheMITPress.MichaelE.Tipping.SparseBayesianlearningandtheRelevanceVectorMachine.JournalofMachineLearningResearch,1:211–244,2001.VolkerTresp.ABayesiancommitteemachine.NeuralComputation,12(11):2719–2741,2000.VladimirN.Vapnik.TheNatureofStatisticalLearningTheory.SpringerVerlag,1995.GraceWahba,XiwuLin,FangyuGao,DongXiang,RonaldKlein,andBarbaraKlein.Thebias-variancetradeoffandtherandomizedGACV.InMichaelS.Kerns,SaraA.Solla,andDavidA.Cohn,editors,AdvancesinNeuralInformationProcessingSystems11,pages620–626,Cam-bridge,Massachussetts,1999.TheMITPress.ChristopherK.I.WilliamsandCarlEdwardRasmussen.Gaussianprocessesforregression.InDavidS.Touretzky,MichaelC.Mozer,andMichaelE.Hasselmo,editors,AdvancesinNeuralInformationProcessingSystems8,pages514–520,Cambridge,Massachussetts,1996.TheMITPress.ChristopherK.I.Williams,CarlEdwardRasmussen,AntonSchwaighofer,andVolkerTresp.Ob-servationsoftheNystr¨ommethodforGaussiamprocessprediction.Technicalreport,UniversityofEdinburgh,Edinburgh,Scotland,2002.1958 SPARSEAPPROXIMATEGAUSSIANPROCESSREGRESSIONChristopherK.I.WilliamsandMathiasSeeger.UsingtheNystr¨ommethodtospeedupkernelmachines.InToddK.Leen,ThomasG.Dietterich,andVolkerTresp,editors,AdvancesinNeuralInformationProcessingSystems13,pages682–688,Cambridge,Massachussetts,2001.TheMITPress.DavidWipf,JasonPalmer,andBhaskarRao.PerspectivesonsparseBayesianlearning.InSebas-tianThrun,LawrenceSaul,andBernhardSch¨olkopf,editors,AdvancesinNeuralInformationProcessingSystems16,Cambridge,Massachussetts,2004.TheMITPress.1959