/
Sparse Gaussian Processes using Pseudoinputs Edward Snelson Zoubin Ghahramani Gatsby Computational Sparse Gaussian Processes using Pseudoinputs Edward Snelson Zoubin Ghahramani Gatsby Computational

Sparse Gaussian Processes using Pseudoinputs Edward Snelson Zoubin Ghahramani Gatsby Computational - PDF document

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
596 views
Uploaded On 2014-12-15

Sparse Gaussian Processes using Pseudoinputs Edward Snelson Zoubin Ghahramani Gatsby Computational - PPT Presentation

uclacuk Abstract We present a new Gaussian process GP regression model whose co variance is parameterized by the the locations of pseudoinput points which we learn by a gradient based optimization We take where is the number of real data points and h ID: 24528

uclacuk Abstract present

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Sparse Gaussian Processes using Pseudoin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

anditsgradients,meaningthattheycannotgetsmoothconvergence.Thereforethespeedofactivesetselectionissomewhatunderminedbythedifcultyofselectinghyperparame-ters.Inappropriatelylearnedhyperparameterswilladverselyaffectthequalityofsolution,especiallyifoneistryingtousethemforautomaticrelevancedetermination(ARD)[10].InthispaperwecircumventthisproblembyconstructingaGPregressionmodelthaten-ablesustondactivesetpointlocationsandhyperparametersinonesmoothjointoptimiza-tion.ThecovariancefunctionofourGPisparameterizedbythelocationsofpseudo-inputs—anactivesetnotconstrainedtobeasubsetofthedata,foundbyacontinuousoptimiza-tion.Thisisafurthermajoradvantage,sincewecanimprovethequalityofourtbythenetuningoftheirpreciselocations.OurmodeliscloselyrelatedtoseveralsparseGPapproximations,inparticularSeeger'smethodofprojectedlatentvariables(PLV)[7,8].Wediscusstheserelationsinsection3.InprinciplewecouldalsoapplyourtechniqueofmovingactivesetpointsoffdatapointstoapproximationssuchasPLV.HoweverweempiricallydemonstratethatacrucialdifferencebetweenPLVandourmethod(SPGP)preventsthisideafromworkingforPLV.1.1GaussianprocessesforregressionWeprovidehereaconcisesummaryofGPsforregression,butsee[11,12,13,10]formoredetailedreviews.WehaveadatasetDconsistingofNinputvectorsX=fxngNn=1ofdimensionDandcorrespondingrealvaluedtargetsy=fyngNn=1.WeplaceazeromeanGaussianprocesspriorontheunderlyinglatentfunctionf(x)thatwearetryingtomodel.WethereforehaveamultivariateGaussiandistributiononanynitesubsetoflatentvariables;inparticular,atX:p(fjX)=N(fj0;KN),whereN(fjm;V)isaGaussiandistributionwithmeanmandcovarianceV.InaGaussianprocessthecovariancematrixisconstructedfromacovariancefunction,orkernel,Kwhichexpressessomepriornotionofsmoothnessoftheunderlyingfunction:[KN]nn0=K(xn;xn0).Usuallythecovariancefunctiondependsonasmallnumberofhyperparameters,whichcontrolthesesmoothnessproperties.ForourexperimentslateronwewillusethestandardGaussiancovariancewithARDhyperparameters:K(xn;xn0)=cexph�1 2XDd=1bd�x(d)n�x(d)n02i;=fc;bg:(1)InstandardGPregressionwealsoassumeaGaussiannoisemodelorlikelihoodp(yjf)=N(yjf;2I).Integratingoutthelatentfunctionvaluesweobtainthemarginallikelihood:p(yjX;)=N(yj0;KN+2I);(2)whichistypicallyusedtotraintheGPbyndinga(local)maximumwithrespecttothehyperparametersand2.Predictionismadebyconsideringanewinputpointxandconditioningontheobserveddataandhyperparameters.Thedistributionofthetargetvalueatthenewpointisthen:p(yjx;D;)=N�y k�x(KN+2I)�1y;Kxx�k�x(KN+2I)�1kx+2;(3)where[kx]n=K(xn;x)andKxx=K(x;x).TheGPisanon-parametricmodel,becausethetrainingdataareexplicitlyrequiredattesttimeinordertoconstructthepredictivedistribution,asisclearfromtheaboveexpression.GPsareprohibitiveforlargedatasetsbecausetrainingrequiresO(N3)timeduetotheinversionofthecovariancematrix.Oncetheinversionisdone,predictionisO(N)forthepredictivemeanandO(N2)forthepredictivevariancepernewtestcase. Figure1:Predictivedistributions(meanandtwostandarddeviationlines)for:(a)fullGP,(b)SPGPtrainedusinggradientascenton(9),(c)SPGPtrainedusinggradientascenton(10).Initialpseudopointpositionsareshownatthetopasredcrosses;nalpseudopointpositionsareshownatthebottomasbluecrosses(theylocationontheplotsofthesecrossesisnotmeaningful).Weareleftwiththeproblemofndingthepseudo-inputlocationsXandhyperparameters=f;2g.Wecandothisbycomputingthemarginallikelihoodfrom(5)and(6):p(yjX;X;)=Zdfp(yjX;X;f)p(fjX)=N(yj0;KNMK�1MKMN++2I):(9)ThemarginallikelihoodcanthenbemaximizedwithrespecttoalltheseparametersfX;gbygradientascent.Thedetailsofthegradientcalculationsarelongandtediousandthereforeomittedhereforbrevity.Theycloselyfollowthederivationsofhyperparam-etergradientsofSeegeretal.[7](seealsosection3),andasthere,canbemostefcientlycodedwithCholeskyfactorisations.NotethatKM,KMNandareallfunctionsoftheMpseudo-inputsXand.Theexactformofthegradientswillofcoursedependonthefunctionalformofthecovariancefunctionchosen,butourmethodwillapplytoanyco-variancethatisdifferentiablewithrespecttotheinputpoints.ItisworthsayingthattheSPGPcanbeviewedasastandardGPwithaparticularnon-stationarycovariancefunctionparameterizedbythepseudo-inputs.SincewenowhaveMD+jjparameterstot,insteadofjustjjforthefullGP,onemaybeworriedaboutovertting.However,considerthecasewhereweletM=NandX=X–thepseudo-inputscoincidewiththerealinputs.AtthispointthemarginallikelihoodisequaltothatofafullGP(2).ThisisbecauseatthispointKMN=KM=KNand=0.Moreoverthepredictivedistribution(8)alsocollapsestothefullGPpredictivedistribution(3).Theseareclearlydesirablepropertiesofthemodel,andtheygivecondencethatagoodsolutionwillbefoundwhenMN.Howeveritisthecasethathyperparameterlearningcomplicatesmatters,andwediscussthisfurtherinsection4.3RelationtoothermethodsItturnsoutthatSeeger'smethodofPLV[7,8]usesaverysimilarmarginallikelihoodapproximationandpredictivedistribution.IfyouremovefromalltheSPGPequationsyougetpreciselytheirexpressions.Inparticularthemarginallikelihoodtheyuseis:p(yjX;X;)=N(yj0;KNMK�1MKMN+2I);(10)whichhasalsobeenusedelsewherebefore[1,4,5].Theyhavederivedthisexpressionfromasomewhatdifferentroute,asadirectapproximationtothefullGPmarginallikelihood. Figure2:Sampledatadrawnfromthemarginallikelihoodof:(a)afullGP,(b)SPGP,(c)PLV.For(b)and(c),thebluecrossesshowthelocationofthe10pseudo-inputpoints.Asdiscussedearlier,themajordifferencebetweenourmethodandtheseothermethods,isthattheydonotusethismarginallikelihoodtolearnlocationsofactivesetinputpoints–onlythehyperparametersarelearntfrom(10).Thisbeggedthequestionofwhatwouldhappenifwetriedtousetheirmarginallikelihoodapproximation(10)insteadof(9)totrytolearnpseudo-inputlocationsbygradientascent.WeshowthatthethatappearsintheSPGPmarginallikelihood(9)iscrucialforndingpseudo-inputpointsbygradients.Figure1showswhathappenswhenwetrytooptimizethesetwolikelihoodsusinggradientascentwithrespecttothepseudoinputs,onasimple1Ddataset.Plottedarethepredictivedistributions,initialandnallocationsofthepseudoinputs.Hyperparameterswerexedtotheirtruevaluesforthisexample.Theinitialpseudo-inputlocationswerechosenadver-sarially:alltowardstheleftoftheinputspace(redcrosses).UsingtheSPGPlikelihood,thepseudo-inputsspreadthemselvesalongtheextentofthetrainingdata,andthepredictivedistributionmatchesthefullGPveryclosely(Figure1(b)).UsingthePLVlikelihood,thepointsbegintospread,butveryquicklybecomestuckasthegradientpushingthepointstowardstherightbecomestiny(Figure1(c)).Figure2comparesdatasampledfromthemarginallikelihoods(9)and(10),givenapartic-ularsettingofthehyperparametersandasmallnumberofpseudo-inputpoints.ThemajordifferencebetweenthetwoisthattheSPGPlikelihoodhasaconstantmarginalvarianceofKnn+2,whereasthePLVdecreasesto2awayfromthepseudo-inputs.Alternatively,thenoisecomponentofthePLVlikelihoodisaconstant2,whereastheSPGPnoisegrowstoKnn+2awayfromthepseudo-inputs.IfoneisinthesituationofFigure1(c),undertheSPGPlikelihood,movingtherightmostpseudo-inputslightlytotherightwillimme-diatelystarttoreducethenoiseinthisregionfromKnn+2towards2.Hencetherewillbeastronggradientpullingittotheright.WiththePLVlikelihood,thenoiseisxedat2everywhere,andmovingthepointtotherightdoesnotimprovethequalityoftofthemeanfunctionenoughlocallytoprovideasignicantgradient.Thereforethepointsbecomestuck,andwebelievethiseffectaccountsforthefailureofthePLVlikelihoodinFigure1(c).ItshouldbeemphasisedthattheglobaloptimumofthePLVlikelihood(10)maywellbeagoodsolution,butitisgoingtobedifculttondwithgradients.TheSPGPlikelihood(9)alsosuffersfromlocaloptimaofcourse,butnotsocatastrophically.Itmaybeinterestinginthefuturetocomparewhichperformsbetterforhyperparameteroptimization.4ExperimentsIntheprevioussectionweshowedourgradientmethodsuccessfullylearningthepseudo-inputsona1Dexample.Theretheinitialpseudoinputpointswerechosenadversarially,butonarealproblemitissensibletoinitializebyrandomlyplacingthemonrealdatapoints, Figure3:Ourresultshavebeenaddedtoplotsreproducedwithkindpermissionfrom[7].Theplotsshowmeansquaretesterrorasafunctionofactive/pseudosetsizeM.Toprow–datasetkin-40k,bottomrow–pumadyn-32nm1.WehaveaddedcircleswhichshowSPGPwithbothhyperparameterandpseudo-inputlearningfromrandominitialisation.Forkin-40kthesquaresshowSPGPwithhyperparametersobtainedfromafullGPandxed.Forpumadyn-32nmthesquaresshowhyperparametersinitializedfromafullGP.random,info-gainandsmo-bartareexplainedinthetext.ThehorizontallinesareafullGPtrainedonasubsetofthedata.andthisiswhatwedoforallofourexperiments.TocompareourresultstoothermethodswehaverunexperimentsonexactlythesamedatasetsasinSeegeretal.[7],followingpreciselytheirpreprocessingandtestingmethods.InFigure3,wehavereproducedtheirlearningcurvesfortwolargedatasets1,superimposingourtesterror(meansquared).Seegeretal.comparethreemethods:random,info-gainandsmo-bart.randominvolvespickinganactivesetofsizeMrandomlyfromamongtrainingdata.info-gainistheirowngreedysubsetselectionmethod,whichisextremelycheaptotrain–barelymoreexpensivethanrandom.smo-bartisSmolaandBartlett's[1]moreexpensivegreedysubsetselectionmethod.AlsoshownwithhorizontallinesisthetesterrorforafullGPtrainedonasubsetofthedataofsize2000fordatasetkin-40kand1024forpumadyn-32nm.Fortheselearningcurves,theydonotactuallylearnhyperparametersbymaximizingtheirapproximationtothemarginallikelihood(10).InsteadtheyxthemtothoseobtainedfromthefullGP2.Forkin-40kwefollowSeegeretal.'sprocedureofsettingthehyperparametersfromthefullGPonasubset.Wethenoptimizethepseudo-inputpositions,andplottheresultsasredsquares.WeseetheSPGPlearningcurvelyingsignicantlybelowallthreeothermethodsinFigure3.WerapidlyapproachtheerrorofafullGPtrainedon2000points,usingapseudosetofonlyafewhundredpoints.Wethentrythehardertaskofalsondingthehyperparametersatthesametimeasthepseudo-inputs.Theresultsareplottedasbluecircles.ThemethodperformsextremelywellforsmallM,butweseesomeovertting 1kin-40k:10000training,30000test,9attributes,seewww.igi.tugraz.at/aschwaig/data.html.pumadyn-32nm:7168training,1024test,33attributes,seewww.cs.toronto/delve.2Seegeretal.haveaseparatesectiontestingtheirlikelihoodapproximation(10)tolearnhyper-parameters,inconjunctionwiththeactivesetselectionmethods.Theyshowthatitcanbeusedtoreliablylearnhyperparameterswithinfo-gainforactivesetsizesof100andabove.Theyhavemoretroublereliablylearninghyperparametersforverysmallactivesets. Figure4:Regressiononadatasetwithinputdependentnoise.Left:standardGP.Right:SPGP.Predictivemeanandtwostan-darddeviationlinesareshown.Crossesshownallocationsofpseudo-inputsforSPGP.Hyper-parametersarealsolearnt.behaviourforlargeMwhichseemstobecausedbythenoisehyperparameterbeingdriventoosmall(thebluecircleshavehigherlikelihoodthantheredsquaresbelowthem).Fordatasetpumadyn-32nm,weagaintrytojointlyndhyperparametersandpseudo-inputs.AgainFigure3showsSPGPwithextremelylowerrorforsmallpseudosetsize–withjust10pseudo-inputswearealreadyclosetotheerrorofafullGPtrainedon1024points.However,inthiscaseincreasingthepseudosetsizedoesnotdecreaseourerror.Inthisproblemthereisalargenumberofirrelevantattributes,andtherelevantonesneedtobesingledoutbyARD.Althoughthehyperparameterslearntbyourmethodarereasonable(2outofthe4relevantdimensionsarefound),theyarenotgoodenoughtogetdowntotheerrorofthefullGP.Howeverifweinitializeourgradientalgorithmwiththehyperparam-etersofthefullGP,wegetthepointsplottedassquares(thistimeredlikelihoods�bluelikelihoods,soitisaproblemoflocaloptimanotovertting).Nowwithonlyapseudosetofsize25wereachtheperformanceofthefullGP,andsignicantlyoutperformtheothermethods(whichalsohadtheirhyperparameterssetfromthefullGP).Anothermaindifferencebetweenthemethodsliesintrainingtime.Ourmethodperformsoptimizationoverapotentiallylargeparameterspace,andhenceisrelativelyexpensivetotrain.Onthefaceofitmethodssuchasinfo-gainandrandomareextremelycheap.How-everallthesemethodsmustbecombinedwithobtaininghyperparametersinsomeway–eitherbyafullGPonasubset(generallyexpensive),orbygradientascentonanapprox-imationtothelikelihood.Whenyouconsiderthiscombinedtask,andthatallmethodsinvolvesomekindofgradientbasedprocedure,thennoneofthemethodsareparticularlycheap.Webelievethatthegaininaccuracyachievedbyourmethodcanoftenbeworththeextratrainingtimeassociatedwithoptimizinginalargerparameterspace.5Conclusions,extensionsandfutureworkAlthoughGPsareveryexibleregressionmodels,theyarestilllimitedbytheformofthecovariancefunction.Forexampleitisdifculttomodelnon-stationaryprocesseswithaGPbecauseitishardtoconstructsensiblenon-stationarycovariancefunctions.AlthoughtheSPGPisnotspecicallydesignedtomodelnon-stationarity,theextraexibilityassociatedwithmovingpseudoinputsaroundcanactuallyachievethistoacertainextent.Figure4showstheSPGPttosomedatawithaninputdependentnoisevariance.TheSPGPachievesamuchbetterttothedatathanthestandardGPbymovingalmostallthepseudo-inputpointsoutsidetheregionofdata3.Itwillbeinterestingtotestthesecapabilitiesfurtherinthefuture.Theextensiontoclassicationisalsoanaturalavenuetoexplore.Wehavedemonstratedasignicantdecreaseintesterrorovertheothermethodsforagivensmallpseudo/activesetsize.Ourmethodrunsintoproblemswhenweconsidermuchlarger 3Itshouldbesaidthattherearelocaloptimainthisproblem,andothersolutionslookedclosertothestandardGP.Weranthemethod5timeswithrandominitialisations.AllrunshadhigherlikelihoodthantheGP;theonewiththehighestlikelihoodisplotted. pseudosetsizeand/orhighdimensionalinputspaces,becausethespaceinwhichweareoptimizingbecomesimpracticallybig.Howeverwehavecurrentlyonlytriedusingan`offtheshelf'conjugategradientminimizer,orL-BFGS,andtherearecertainlyimprovementsthatcanbemadeinthisarea.Forexamplewecantryoptimizingsubsetsofvariablesiteratively(chunking),orstochasticgradientascent,orwecouldmakeahybridbypickingsomepointsrandomlyandoptimizingothers.Ingeneralthoughweconsiderourmethodmostusefulwhenonewantsaverysparse(hencefastprediction)andaccuratesolution.OnefurtherwayinwhichtodealwithlargeDistolearnalowdimensionalprojectionoftheinputspace.ThishasbeenconsideredforGPsbefore[14],andcouldeasilybeappliedtoourmodel.Inconclusion,wehavepresentedanewmethodforsparseGPregression,whichshowsasignicantperformancegainoverothermethodsespeciallywhensearchingforanex-tremelysparsesolution.Wehaveshownthattheaddedexibilityofmovingpseudo-inputpointswhicharenotconstrainedtolieonthetruedatapointsleadstobettersolutions,andevensomenon-stationaryeffectscanbemodelled.Finallywehaveshownthathyperpa-rameterscanbejointlylearnedwithpseudo-inputpointswithreasonablesuccess.AcknowledgementsThankstotheauthorsof[7]foragreeingtomaketheirresultsandplotsavailableforrepro-duction.ThankstoallattheShefeldGPworkshopforhelpingtoclarifythiswork.References[1]A.J.SmolaandP.Bartlett.SparsegreedyGaussianprocessregression.InAdvancesinNeuralInformationProcessingSystems13.MITPress,2000.[2]C.K.I.WilliamsandM.Seeger.UsingtheNystr¨ommethodtospeedupkernelmachines.InAdvancesinNeuralInformationProcessingSystems13.MITPress,2000.[3]V.Tresp.ABayesiancommitteemachine.NeuralComputation,12:2719–2741,2000.[4]L.Csat´oandM.Opper.SparseonlineGaussianprocesses.NeuralComputation,14:641–668,2002.[5]L.Csat´o.GaussianProcesses—IterativeSparseApproximations.PhDthesis,AstonUniver-sity,UK,2002.[6]N.D.Lawrence,M.Seeger,andR.Herbrich.FastsparseGaussianprocessmethods:theinformativevectormachine.InAdvancesinNeuralInformationProcessingSystems15.MITPress,2002.[7]M.Seeger,C.K.I.Williams,andN.D.Lawrence.FastforwardselectiontospeedupsparseGaussianprocessregression.InC.M.BishopandB.J.Frey,editors,ProceedingsoftheNinthInternationalWorkshoponArticialIntelligenceandStatistics,2003.[8]M.Seeger.BayesianGaussianProcessModels:PAC-BayesianGeneralisationErrorBoundsandSparseApproximations.PhDthesis,UniversityofEdinburgh,2003.[9]J.Qui˜noneroCandela.LearningwithUncertainty—GaussianProcessesandRelevanceVectorMachines.PhDthesis,TechnicalUniversityofDenmark,2004.[10]D.J.C.MacKay.IntroductiontoGaussianprocesses.InC.M.Bishop,editor,NeuralNetworksandMachineLearning,NATOASISeries,pages133–166.KluwerAcademicPress,1998.[11]C.K.I.WilliamsandC.E.Rasmussen.Gaussianprocessesforregression.InAdvancesinNeuralInformationProcessingSystems8.MITPress,1996.[12]C.E.Rasmussen.EvaluationofGaussianProcessesandOtherMethodsforNon-LinearRe-gression.PhDthesis,UniversityofToronto,1996.[13]M.N.Gibbs.BayesianGaussianProcessesforRegressionandClassication.PhDthesis,CambridgeUniversity,1997.[14]F.VivarelliandC.K.I.Williams.DiscoveringhiddenfeatureswithGaussianprocessesregres-sion.InAdvancesinNeuralInformationProcessingSystems11.MITPress,1998.