 ## Gaussian Pr ocesses or Regr ession Quick Intr oduction M - Description

Ebden August 2008 Comments to markebdenengoxacuk Figure illustrates typical xample of prediction problem gi en some noisy obser ations of dependent ariable at certain alues of the independent ariable what is our best estimate of the dependent ariabl ID: 24507 Download Pdf

179K - views

# Gaussian Pr ocesses or Regr ession Quick Intr oduction M

Ebden August 2008 Comments to markebdenengoxacuk Figure illustrates typical xample of prediction problem gi en some noisy obser ations of dependent ariable at certain alues of the independent ariable what is our best estimate of the dependent ariabl

## Gaussian Pr ocesses or Regr ession Quick Intr oduction M

Download Pdf - The PPT/PDF document "Gaussian Pr ocesses or Regr ession Quick..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

## Presentation on theme: "Gaussian Pr ocesses or Regr ession Quick Intr oduction M"— Presentation transcript:

Page 1
Gaussian Pr ocesses or Regr ession: Quick Intr oduction M. Ebden, August 2008 Comments to mark.ebden@eng.ox.ac.uk Figure illustrates typical xample of prediction problem: gi en some noisy obser ations of dependent ariable at certain alues of the independent ariable what is our best estimate of the dependent ariable at ne alue,  If we xpect the underlying function   to be linear and can mak some as- sumptions about the input data, we might use least-squares method to ﬁt straight line (linear re gression). Moreo er if we suspect   may also be quadratic, cubic,

or en nonpolynomial, we can use the principles of model selection to choose among the arious possibilities. Gaussian process re gression (GPR) is an en ﬁner approach than this. Rather than claiming   relates to some speciﬁc models (e.g.      ), Gaussian process can represent   obliquely ut rigorously by letting the data �speak more clearly for themselv es. GPR is still form of supervised learning, ut the training data are harnessed in subtler ay As such, GPR is less �parametric tool. Ho we er it not completely free-form, and if we re unwilling to mak en basic

assumptions about   then more gen- eral techniques should be considered, including those underpinned by the principle of maximum entrop y; Chapter of Si via and Skilling (2006) of fers an introduction. −1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0.2 −2.5 −2 −1.5 −1 −0.5 0.5 1.5 Figure 1: Gi en six noisy data points (error bars are indicated with ertical lines), we are interested in estimating se enth at 
Page 2
FI AU Gaussian processes (GPs) xtend multi ariate Gaussian distrib utions to inﬁnite

dimen- sionality ormally Gaussian process generates data located throughout some domain such that an ﬁnite subset of the range follo ws multi ariate Gaussian distrib ution. No the observ ations in an arbitrary data set,  ! "#%\$'& can al ays be imagined as single point sampled from some multi ariate -v ariate) Gaussian distri- ution, after enough thought. Hence, orking backw ards, this data set can be partnered with GP Thus GPs are as uni ersal as the are simple. ery often, it assumed that the mean of this partner GP is zero erywhere. What relates one observ ation to

another in such cases is just the co variance function )*'+, popular choice is the �squared xponential�, -# .0/21 3.4 57698;: +, %< (1) where the maximum allo able co ariance is deﬁned as this should be high for functions which co er broad range on the axis. If ?>'+ then -*+, approaches this maximum, meaning   is nearly perfectly correlated with  '+@ This is good: for our function to look smooth, neighbours must be alik e. No if is distant from '+ we ha instead -*+AB>C i.e. the tw points cannot �see each other So, for xample, during interpolation at ne alues,

distant observ ations will ha ne gligible ef fect. Ho much ef fect this separation has will depend on the length parameter so there is much ﬂe xibility uilt into (1). Not quite enough ﬂe xibility though: the data are often noisy as well, from measure- ment errors and so on. Each observ ation can be thought of as related to an underlying function   through Gaussian noise model: D  - FE G#/ H (2) something which should look amiliar to those who done re gression before. Re- gression is the search for   Purely for simplicity of xposition in the ne xt page, we tak

the no el approach of folding the noise into -#'+, by writing )* ./ 4I576J8 '+, K< L/ \$7M -# I (3) where -#'+, is the Kroneck er delta function. (When most people use Gaussian processes, the eep /G\$ separate from -*+, Ho we er our redeﬁnition of )*+@ is equally suitable for orking with problems of the sort posed in Figure 1. So, gi en observ ations our objecti is to predict not the �actual their xpected alues are identical according to (2), ut their ariances dif fer wing to the observ ational noise process. e.g. in Figure 1, the xpected alue of and of is the dot at .)

prepare for GPR, we calculate the co ariance function, (3), among all possible combinations of these points, summarizing our ﬁndings in three matrices: PO 2;#  ;* TS S!S 2;#'\$ #  * TS S!S #'\$ '\$2#I \$* US S!S '\$2#'\$ VXW (4) [Z %*2  %* \S!S S K*\$G^] _ * I (5) Conﬁrm for yourself that the diagonal elements of are `/ and that its xtreme of f-diagonal elements tend to zero when spans lar ge enough domain.
Page 3
AU Since the assumption in GP modelling is that our data can be represented as sample from multi ariate

Gaussian distrib ution, we ha that a =cb EedGf- Nhg i =kj (6) where indicates matrix transposition. are of course interested in the conditional probability on �gi en the data, ho lik ely is certain prediction for ?�. As xplained more slo wly in the Appendix, the probability follo ws Gaussian distrib ution:  NLp i Nqp H (7) Our best estimate for is the mean of this distrib ution: NLp (8) and the uncertainty in our estimate is captured in its ariance: r;sKt !. _ (9) e re no ready to tackle the data in Figure 1. 1. There are vu observ ations at wh :Bx y% :Bx o Gz%y G

{ G oy|G o ] kno /G\$DG from the error bars. ith judicious choices of and (more on this later), we ha enough to calculate co ariance matrix using (4): zK {a  ~z|zK€y {a zK y%u }K{   �z xTx y%u zK y x€x {  G ~az }%{ z; yK� {o~ Gz% x€x {a yK� z; yKu G G �az x‚x {o~ yKu z; VXW From (5) we also ha iƒ zK and G }o~TGzK� %} }oy {ou yK~ ]„ 2. From (8) and (9), 0G �y and r;sKt .v 3. Figure sho ws data point with question mark underneath, representing the estimation of

the dependent ariable at …0G can repeat the abo procedure for arious other points spread er some portion of the axis, as sho wn in Figure 2. (In act, equi alently we could oid the repetition by performing the abo procedure once with suitably lar ger and _ matrices. In this case, since there are 1,000 test points spread er the axis, _ ould be of size 1,000 1,000.) Rather than plotting simple error bars, we decided to plot .‡ �%uGˆ r;sKt gi ving 95% conﬁdence interv al.
Page 4
−1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4

−0.2 0.2 −2.5 −2 −1.5 −1 −0.5 0.5 1.5 Figure 2: The solid line indicates an estimation of for 1,000 alues of Pointwise 95% conﬁdence interv als are shaded. The reliability of our re gression is dependent on ho well we select the co ariance function. Clearly if its parameters call them Š‹<Œi/ #/G\$& are not cho- sen sensibly the result is nonsense. Our maximum posteriori estimate of occurs when w� is at its greatest. Bayes theorem tells us that, assuming we ha little prior kno wledge about what should be, this corresponds to

maximizing Ž��%‘"m w� gi en by Ž’�o‘"m w� . NLp Ž��%‘ Ž��%‘ K“� (10) Simply run your ourite multi ariate optimization algorithm (e.g. conjugate gradi- ents, Nelder -Mead simple x, etc.) on this equation and you found pretty good choice for in our xample, <- and oz It only �pretty good because, of course, Thomas Bayes is rolling in his gra e. Why commend just one answer for when you can inte grate erything er the man dif ferent possible choices for Chapter of Rasmussen and illiams (2006) presents the equations necessary in this case.

Finally if you feel you grasped the to problem in Figure 2, the ne xt tw xam- ples handle more complicated cases. Figure 3(a), in addition to long-term do wnw ard trend, has some ﬂuctuations, so we might use more sophisticated co ariance function: -* ./ 4I576”8 '+, K< •/ 4I576J8 '+, K< L/ -# I (11) The ﬁrst term tak es into account the small vicissitudes of the dependent ariable, and the second term has longer length parameter >–u%<— to represent its long-term
Page 5
−1 −5 −4 −3 −2 −1 −1

−4 −2 (a) (b) Figure 3: Estimation of (solid line) for function with (a) short-term and long-term dynamics, and (b) long-term dynamics and periodic element. Observ ations are sho wn as crosses. trend. Co ariance functions can be gro wn in this ay ad inﬁnitum to suit the comple x- ity of your particular data. The function looks as if it might contain periodic element, ut it dif ﬁcult to be sure. Let consider another function, which we re told has periodic element. The solid line in Figure 3(b) as re gressed with the follo wing co ariance function: )* .v/ 4I576J8

'+, K< 4I576 �˜#™’š71%› œa“ ^�ž&Ÿ L/ \$M )* I (12) The ﬁrst term represents the hill-lik trend er the long term, and the second term gi es periodicity with frequenc This is the ﬁrst time we encountered case where and '+ can be distant and yet still �see each other (that is, -*+,? >� for ��+ ). What if the dependent ariable has other dynamics which, priori you xpect to appear? There no limit to ho complicated )*'+, can be, pro vided is positi deﬁnite. Chapter of Rasmussen and illiams (2006) of fers good outline of

the range of co ariance functions you should eep in your toolkit. �Hang on minute, you ask, �isn choosing co ariance function from toolkit lot lik choosing model type, such as linear ersus cubic which we discussed at the outset? ell, there are indeed similarities. In act, there is no ay to perform re gression without imposing at least modicum of structure on the data set; such is the nature of generati modelling. Ho we er it orth repeating that Gaussian processes do allo the data to speak ery clearly or xample, there xists xcellent theoreti- cal justiﬁcation for the use of (1) in man

settings (Rasmussen and illiams (2006), Section 4.3). ou will still ant to in estigate carefully which co ariance functions are appropriate for your data set. Essentially choosing among alternati functions is ay of reﬂecting arious forms of prior kno wledge about the physical pr ocess under in estigation.
Page 6
e presented brief outline of the mathematics of GPR, ut practical implementa- tion of the abo ideas requires the solution of fe algorithmic hurdles as opposed to those of data analysis. If you aren good computer programmer then the code for Figures and is at

ftp://ftp.r obots.ox.ac.uk/pub/outg oin g/meb den /misc/GPtut.zip and more general code can be found at http://www .gaussianpr ocess.or g/gpml e merely scratched the surf ace of po werful technique (MacKay, 1998). First, although the focus has been on one-dimensional inputs, it simple to accept those of higher dimension. Whereas ould then change from scalar to ector )* ould remain scalar and so the maths erall ould be virtually unchanged. Second, the zero ector representing the mean of the multi ariate Gaussian distrib ution in (6) can be replaced with functions of Third, in addition to their

use in re gression, GPs are applicable to inte gration, global optimization, mixture-of-e xperts models, unsuper vised learning models, and more see Chapter of Rasmussen and illiams (2006). The ne xt tutorial will focus on their use in classiﬁcation MacKay D. (1998). In C.M. Bishop (Ed.), Neural netw orks and machine learning. (N ASI Series, Series F, Computer and Systems Sciences, Vol. 168, pp. 133- 166.) Dordrecht: Kluwer Academic Press. Rasmussen, C. and C. illiams (2006). Gaussian Pr ocesses for Mac hine Learning MIT Press. Si via, D. and J. Skilling (2006). Data Analysis: Bayesian

Tutorial (second ed.). Oxford Science Publications. Imagine data sample tak en from some multi ariate Gaussian distrib ution with zero mean and co ariance gi en by matrix No decompose arbitrarily into tw consecuti sub ectors and in other ords, writing f- ould be the same as writing =Db Eedf" 8—� =�j (13) where and are the corresponding bits and pieces that mak up Interestingly the conditional distrib ution of gi en is itself Gaussian-distrib uted. If the co ariance matrix were diagonal or en block diagonal, then kno wing ouldn tell us an ything about speciﬁcally f" On the other

hand, if were nonzero, then some matrix algebra leads us to ��� ��� H (14) The mean, ��� is kno wn as the �matrix of re gression coef ﬁcients�, and the ari- ance, ��� is the �Schur complement of in �. In summary if we kno some of we can use that to inform our estimate of what the rest of might be, thanks to the re ealing of f-diagonal elements of
Page 7
Gaussian Pr ocesses or Classiﬁcation: Quick Intr oduction M. Ebden, August 2008 Prerequisite reading: Gaussian Pr ocesses for Re gr ession As mentioned in the pre vious document, GPs can be applied to problems other

than re gression. or xample, if the output of GP is squashed onto the range  it can represent the pr obability of data point belonging to one of say tw types, and oil a, we can ascertain classiﬁcations. This is the subject of the current document. The big dif ference between GPR and GPC is ho the output data, are link ed to the underlying function outputs, The are no longer connected simply via noise process as in (2) in the pre vious document, ut are instead no discrete: say L precisely for one class and ” :Bx for the other In principle, we could try ﬁtting GP that

produces an output of approximately for some alues of and approximately :Bx for others, simulating this discretization. Instead, we interpose the GP between the data and squashing function; then, classiﬁcation of ne data point in olv es tw steps instead of one: 1. Ev aluate �latent function which models qualitati ely ho the lik elihood of one class ersus the other changes er the axis. This is the GP 2. Squash the output of this latent function onto G using an sigmoidal function, � . prob D Writing these tw steps schematically data, GP :7� latent function, %n sigmoid :H:::*�

class probability � The ne xt section will alk you through more slo wly ho such classiﬁer operates. Section xplains ho to train the classiﬁer so perhaps we re presenting things in re erse order! Section handles classiﬁcation when there are more than tw classes. Before we get started, quick note on � Although other forms will do, here we will prescribe it to be the cumulati Gaussian distrib ution, � This -shaped function satisﬁes our needs, mapping high into � �> and lo into � �> second quick note, re visiting (6) and (7) in the ﬁrst document:

conﬁrm for your self that, if there were no noise /G\$Dv ), the tw equations could be re written as f- _ =Ÿj (1) and NLp _ NLp H (2)
Page 8
FI Suppose we trained classiﬁer from input data, and their corresponding xpert- labelled output data, And suppose that in the process we formed some GP outputs corresponding to these data, which ha some uncertainty ut mean alues gi en by e re no ready to input ne data point,  in the left side of our schematic, in order to determine at the other end the probability “2 of its class membership. In the ﬁrst

step, ﬁnding the probability � is similar to GPR, i.e. we adapt (2): � �•E NLp i I (3) will be xplained soon, ut for no consider it to be ery similar to .) In the second step, we squash to ﬁnd the probability of class membership, “��“ �    �   The xpected alue is ���“ � � %n *� (4) This is the inte gral of cumulati Gaussian times Gaussian, which can be solv ed analytically By Section 3.9 of Rasmussen and illiams (2006), the solution is: r;sKt � !;j (5) An xample is depicted in Figure 1. FI Our objecti no is to ﬁnd and so

that we kno erything about the GP pro- ducing (3), the ﬁrst step of the classiﬁer The second step of the classiﬁer does not require training as it ﬁx ed sigmoidal function. Among the man GPs which could be partnered with our data set, naturally we lik to compare their usefulness quantitati ely Considering the outputs of certain GP ho lik ely the are to be appropriate for the training data can be decomposed using Bayes theorem: w� Ÿ w- w- (6) Let focus on the tw actors in the numerator Assuming the data set is i.i.d., � ��� I (7) Dropping the subscripts

in the product, is informed by our sigmoid function, � Speciﬁcally � is � by deﬁnition, and to complete the picture,   :Bx ? x�: � terse ay of combining these tw cases is to write . The second actor in the numerator is w- This is related to the output of the ﬁrst step of our schematic dra wing, ut ﬁrst we re interested in the alue of w- which maximizes the posterior probability w� This occurs when the deri ati
Page 9
−20 −10 10 20 −10 −8 −6 −4 −2 10 Latent function f(x) −20 −10 10 20

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Predictive probability Figure 1: (a) classiﬁcation dataset, where circles and crosses indicate class mem- bership of the input (training) data at locations is for one class and :Bx for another ut for illustrati purposes we pretend �e instead of :Bx in this ﬁgure. The solid line is the (mean) probability “)… prob ƒ ! i.e. the �answer to our problem after successfully performing GPC. (b) The corresponding distrib ution of the latent function not constrained to lie between and 1. of (6) with respect to is zero, or equi

alently and more simply when the deri ati of its logarithm is zero. Doing this, and using the same logic that produced (10) in the pre vious document, we ﬁnd that N9� Ž’�o‘"m I (8) where is the best for our problem. Unfortunately appears on both sides of the equation, so we mak an initial guess (zero is ﬁne) and go through fe iterations. The answer to (8) can be used directly in (3), so we found one of the tw quantities we seek therein. The ariance of is gi en by the ne gati second deri ati of the logarithm of (6), which turns out to be •�� with �� ���

Ž��%‘2m Making Laplace appr oximation we pretend w� is Gaussian distrib uted, i.e. w� b� w� ŸLE �� H (9) (This assumption is occasionally inaccurate, so if it yields poor classiﬁcations, bet- ter ays of characterizing the uncertainty in should be considered, for xample via xpectation propagation.)
Page 10
No for subtle point. The act that can ary means that using (2) directly is in- appropriate: in particular its mean is correct ut its ariance no longer tells the whole story This is why we use the adapted ersion, (3), with instead of Since the arying

quantity in (2), is being multiplied by we add co to the ariance in (2). Simpliﬁcation leads to (3), in which +' � ith the GP no completely speciﬁed, we re ready to use the classiﬁer as described in the pre vious section. GPC in the Real orld As with GPR, the reliability of our classiﬁcation is dependent on ho well we select the co ariance function in the GP in our ﬁrst step. The parameters are e‹<Œi/ one fe wer no because [ Ho we er as usual, is optimized by maximizing w� or (omitting on the righthand side of the equation), w� . w-*�

(10) This can be simpliﬁed, using Laplace approximation, to yield w� . Nqp Ž’�o‘"m Ž��%‘ Nqp •� H (11) This is the equation to run your ourite optimizer on, as performed in GPR. e described binary classiﬁcation, where the number of possible classes, is just tw o. In the case of ��� classes, one approach is to ﬁt an for each class. In the ﬁrst of the tw steps of classiﬁcation, our GP alues are concatenated as � ! !2 !  2 !  2 �  ! 2 � (12) Let be ector of the same length as which, for each   !2

is for the class which is the label and for the other :�x entries. Let gro to being block diagonal in the matrices  ! 2 So the ﬁrst change we see for ��� is lengthening of the GP Section 3.5 of Rasmussen and illiams (2006) of fers hints on eeping the computations manageable. The second change is that the (merely one-dimensional) cumulati Gaussian dis- trib ution is no longer suf ﬁcient to describe the squashing function in our classiﬁer; instead we use the softmax function. or the th data point, � �0“-� 4I576 ž �^� 4 576 � (13) where is nonconsecuti

subset of viz. � ! !2 can summa- rize our results with �!“ ! !2*“ *“ ! !"*“   !"*“   !2*“ No that we presented the tw big changes needed to go from binary- to multi- class GPC, we continue as before. Setting to zero the deri ati of the logarithm of the components in (6), we replace (8) with H (14) The corresponding ariance is �� as before, ut no �Š diag :��…� where is v†L matrix obtained by stacking ertically the diagonal matrices diag if is the sub ector of pertaining to class
Page 11
ith these

quantities estimated, we ha enough to generalize (3) to ž �LE�� diag _  g„� (15) where and represent the class-rele ant information only Finally (11) is replaced with w� � �’� Ž��%‘h� 4I576 �‹� Ž’�o‘ �in � (16) on present an xample of multi-class GPC, ut hopefully you get the idea. As with GPR, classiﬁcation can be xtended to accept alues with multiple dimen- sions, while eeping most of the mathematics unchanged. Other possible xtensions include using the xpectation propagation method in lieu of the Laplace approximation as

mentioned pre viously putting conﬁdence interv als on the classiﬁcation probabili- ties, calculating the deri ati es of (16) to aid the optimizer or using the ariational Gaussian process classiﬁers described in MacKay (1998), to name ut four xtensions. Second, we repeat the Bayesian call from the pre vious document to inte grate er range of possible co ariance function parameters. This should be done re gardless of ho much prior kno wledge is ailable see for xample Chapter of Si via and Skilling (2006) on ho to choose priors in the most opaque situations. Third, we again

spared you fe practical algorithmic details; computer code is ailable at http://www .gaussianpr ocess.or g/gpml with xamples. Thanks are due to Prof Stephen Roberts and members of his attern Analysis Research Group, as well as the ALADDIN project (www .aladdinproject.or ). MacKay D. (1998). In C.M. Bishop (Ed.), Neural netw orks and machine learning. (N ASI Series, Series F, Computer and Systems Sciences, Vol. 168, pp. 133- 166.) Dordrecht: Kluwer Academic Press. Rasmussen, C. and C. illiams (2006). Gaussian Pr ocesses for Mac hine Learning MIT Press. Si via, D. and J. Skilling (2006). Data

Analysis: Bayesian Tutorial (second ed.). Oxford Science Publications.