Journal of Mac hine Learning Researc   Submitted  Revised  Published  Estimation of NonNormalized Statistical Mo dels Score Matc hing Aap Hyv arinen aapo
152K - views

Journal of Mac hine Learning Researc Submitted Revised Published Estimation of NonNormalized Statistical Mo dels Score Matc hing Aap Hyv arinen aapo

hyv arinenhelsinkifi Helsinki Institute for Information chnolo gy BR U Dep artment of Computer Scienc FIN00014 University of Helsinki Finland Editor eter Da an Abstract One often an ts to estimate statistical mo dels where the probabilit densit funct

Download Pdf

Journal of Mac hine Learning Researc Submitted Revised Published Estimation of NonNormalized Statistical Mo dels Score Matc hing Aap Hyv arinen aapo




Download Pdf - The PPT/PDF document "Journal of Mac hine Learning Researc S..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Journal of Mac hine Learning Researc Submitted Revised Published Estimation of NonNormalized Statistical Mo dels Score Matc hing Aap Hyv arinen aapo"— Presentation transcript:


Page 1
Journal of Mac hine Learning Researc (2005) 695{709 Submitted 11/04; Revised 3/05; Published 4/05 Estimation of Non-Normalized Statistical Mo dels Score Matc hing Aap Hyv arinen aapo.hyv arinen@helsinki.fi Helsinki Institute for Information chnolo gy (BR U) Dep artment of Computer Scienc FIN-00014 University of Helsinki, Finland Editor: eter Da an Abstract One often an ts to estimate statistical mo dels where the probabilit densit function is kno wn only up to ultiplicativ normalization constan t. ypically one then has to resort to Mark Chain Mon te Carlo metho ds, or appro

ximations of the normalization constan t. Here, prop ose that suc mo dels can estimated minimizing the exp ected squared distance et een the gradien of the log-densit giv en the mo del and the gradien of the log-densit of the observ ed data. While the estimation of the gradien of log-densit function is, in principle, ery dicult non-parametric problem, pro surprising result that giv es simple form ula for this ob jectiv function. The densit function of the observ ed data do es not app ear in this form ula, whic simplies to sample erage of sum of some deriv ativ es of the

log-densit giv en the mo del. The alidit of the metho is demonstrated on ultiv ariate Gaussian and indep enden comp onen analysis mo dels, and estimating an ercomplete lter set for natural image data. Keyw ords: statistical estimation, non-normalized densities, pseudo-lik eliho d, Mark hain Mon te Carlo, con trastiv div ergence 1. In tro duction In man cases, probabilistic mo dels in mac hine learning, statistics, or signal pro cessing are giv en in the form of non-normalized probabilit densities. That is, the mo del con tains an unkno wn normalization constan whose computation is to

dicult for practical purp oses. Assume observ random ector whic has probabilit densit function (p df denoted ). ha parametrized densit mo del ), where is an dimensional ector of parameters. an to estimate the parameter from i.e. an to appro ximate for the estimated parameter alue (W shall here consider the case of con tin uous-v alued ariables only .) The problem consider here is that only are able to compute the df giv en the mo del up to ultiplicativ constan ): That is, do kno the functional form of as an analytical expression (or an form that can easily computed), but do not kno ho

to easily compute whic is giv en 2005 Aap Hyv arinen.
Page 2
Hyv arinen an in tegral that is often analytically in tractable: In higher dimensions (in fact, for almost an 2), the umerical computation of this in tegral is practically imp ossible as ell. Usually estimation of non-normalized mo dels is approac hed Mark Chain Mon te Carlo (MCMC) metho ds, whic are ery slo w, or making some appro ximations, whic ma quite or (Mac y, 2003). Non-normalized mo dels are often encoun tered in con tinous-v alued Mark random elds, whic are widely used in image mo delling, see e.g. (Bouman

and Sauer, 1993; Li, 2001). In general, undirected graphical mo dels cannot normalized except in the Gaussian case. Other recen ork in image mo delling also includes non-normalized mo dels (Hyv arinen and Ho er, 2001; eh et al., 2003). Presumably the um er of useful applications for non- normalized mo dels is uc larger than the presen literature suggests. Non-normalized mo dels ha een oided ecause their estimation has een considered to dicult; the adv en of ecien estimation metho ds ma signican tly increase their utilit In this pap er, prop ose simple metho for

estimating suc non-normalized mo dels. This is based on minimizing the exp ected squared distance of the score function of and the score function giv en the mo del. (By score function, mean here the gradien of log-densit .) sho that this distance can estimated ery simple form ula in olving only sample erages of some deriv ativ es of the logarithm of the df giv en the mo del. Th us, the computations in olv ed are essen tially not more complicated than in the case where kno an analytical expression for the normalization constan t. The prop osed form ula is exact and do es not in olv an appro

ximations, whic is wh are able to pro the lo cal consistency of the resulting metho d. Minimization of the prop osed ob jectiv function th us pro vides an estimation metho that is computationally simple et statistically lo cally consisten t. 2. Estimation Score Matc hing In the follo wing, use extensiv ely the gradien of the log-densit with resp ect to the data ector. or simplicit call this the score function, although according the con en tional denition, it is actually the score function with resp ect to yp othetical lo cation parameter (Sc hervish, 1995). or the mo del densit denote

the score function ): log log log The oin in using the score function is that it do es not dep end on ). In fact ob viously ha log (1) Lik ewise, denote log the score function of the distribution of observ ed data This could in principle estimated computing the gradien of the logarithm of 696
Page 3
Estima tion by Score Ma tching non-parametric estimate of the df|but will see elo that no suc computation is necessary Note that score functions are mappings from to no prop ose that the mo del is estimated minimizing the exp ected squared dis- tance et een the mo del score function and

the data score function ). dene this squared distance as (2) Th us, our sc or matching estimator of is giv en arg min The motiv ation for this estimator is that the score function can directly computed from as in (1), and do not need to compute Ho ev er, this ma still seem to ery dicult of estimating since migh ha to compute an estimator of the data score function from the observ ed sample, whic is basically non-parametric estimation problem. Ho ev er, no suc non-parametric estimation is needed. This is ecause can use simple tric of partial in tegration to compute the ob jectiv

function ery easily as sho wn the follo wing theorem: Theorem Assume that the mo del sc or function is dier entiable, as wel as some we ak gularity onditions. Then, the obje ctive function in (2) an expr esse as =1 onst. (3) wher the onstant do es not dep end on log is the -th element of the mo del sc or function, and log is the artial derivative of the -th element of the mo del sc or function with esp ct to the -th variable. The pro of, giv en in the App endix, is based simple tric of partial in tegration that has previously een used in the theory of indep enden comp onen analysis for

mo delling the densities of the indep enden comp onen ts (Pham and Garrat, 1997). ha th us pro en the remark able fact that the squared distance of the mo del score function from the data score function can computed as simple exp ectation of certain 1. Namely: the data df is dieren tiable, the exp ectations fk and fk are nite for an and go es to zero for an when 697
Page 4
Hyv arinen functions of the non-normalized mo del df. If ha an analytical expression for the non-normalized densit function these functions are readily obtained deriv ation using (1) and taking

further deriv ativ es. In practice, ha observ ations of the random ector denoted (1) ). The sample ersion of is ob viously obtained from (3) as =1 =1 ); ); const. (4) whic is asymptotically equiv alen to due to the la of large um ers. prop ose to estimate the mo del minimization of in the case of real, nite sample. One ma onder whether it is enough to minimize to estimate the mo del, or whether the distance of the score functions can zero for dieren parameter alues. Ob viously if the mo del is degenerate in the sense that dieren alues of giv the same df, cannot estimate

If assume that the mo del is not degenerate, and that alw ys, ha lo cal consistency as sho wn the follo wing theorem and the corollary: Theorem Assume the df of fol lows the mo del: for some Assume further that no other ar ameter value gives df that is qual to and that for al Then or pro of, see the App endix. Corollary Under the assumptions of the pr ding The or ems, the sc or matching esti- mator obtaine by minimization of is onsistent, i.e. it onver ges in pr ob ability towar ds the true value of when sample size appr aches innity, assuming that the optimization algorithm is able to

nd the glob al minimum. The corollary is pro en applying the la of large um ers. As sample size approac hes innit con erges to (in probabilit y). Th us, the estimator con erges to oin where is globally minimized. By Theorem 2, the global minim um is unique and found at the true parameter alue (ob viously cannot negativ e). This result of consistency assumes that the global minim um of is found the opti- mization algorithm used in the estimation. In practice, this ma not true, in particular ecause there ma sev eral lo cal minima. Then, the consistency is of lo cal nature, i.e.,

the estimator is consisten if the optimization iteration is started sucien tly close to the true alue. Note that consistency implies asymptotic un biasedness. 3. Examples Here, pro vide three sim ulations to illustrate ho score matc hing orks, as ell as to conrm its consistency and applicabilit to real data. 2. In this theorem and its pro of, equalities of df s are to tak en in the sense of equal almost ev erywhere with resp ect to the Leb esgue measure. 698
Page 5
Estima tion by Score Ma tching 3.1 Multiv ariate Gaussian Densit As ery simple illustrativ example,

consider estimation of the parameters of the ultiv ariate Gaussian densit 3.1.1 Estima tion The probabilit densit function is giv en exp( )) where is symmetric ositiv e-denite matrix (the in erse of the co ariance matrix). Of course, the expression for is ell-kno wn in this case, but this serv es as an illustration of the metho d. As long as there is no hance of confusion, use here as the general -dimensional ector. Th us, here ha exp( )) (5) and obtain and ii Th us, obtain =1 ii MM )] (6) minimize this with resp ect to it is enough to compute the gradien MM MM =1 whic is ob viously

zero if and only if is the sample erage =1 ). This is truly minim um ecause the matrix MM that denes the quadratic form is ositiv e-denite. Next, compute the gradien with resp ect to whic giv es =1 )( =1 )( whic is zero if and only if is the in erse of the sample co ariance matrix =1 )( whic th us giv es the score matc hing estimate. In terestingly see that score matc hing giv es exactly the same estimator as maxim um lik eliho estimation. In fact, the estimators are iden tical for an sample (and not just asymptotically). The maxim um lik eliho estimator is kno wn to consisten

t, so the score matc hing estimator is consisten as ell. 699
Page 6
Hyv arinen 3.1.2 Intuitive Interpret tion This example also giv es some in tuitiv insigh in to the principle of score matc hing. Let us consider what happ ened if just maximized the non-normalized log-lik eliho d, i.e., log of in (5). It is maximized when the scale parameters in are zero, i.e., the mo del ariances are innite and the df is completely at. This is ecause then the mo del assigns the same probabilit to all ossible alues of ), whic is equal to 1. In fact, the same applies to the second term

in (6), whic th us seems to closely connected to maximization of the non-normalized log-lik eliho d. Therefore, the rst term in (3) and (6), in olving second deriv ativ es of the logarithm of seems to act as kind of normalization term. Here it is equal to ii minimize this, the ii should made as large (and ositiv e) as ossible. Th us, this term has the opp osite eect to the second term. Since the rst term is linear and the second term olynomial in the minim um of the sum is dieren from zero. similar in terpretation applies to the general non-Gaussian case. The

second term in (3), exp ectation of the norm of score function, is closely related to maximization of non- normalized lik eliho d: if the norm of this gradien is zero, then in fact the data oin is in lo cal extrem um of the non-normalized log-lik eliho d. The rst term then measures what kind of an extrem um this is. If it is minim um, the rst term is ositiv and the alue of is increased. minimize the rst term should negativ e, in whic case the extrem um is maxim um. In fact, the extrem um should as steep maxim um (as opp osed to at maxim um) as ossible to

minimize This coun teracts, again, the tendency to assign the same probabilit to all data oin ts that is often inheren in the maximization of the non-normalized lik eliho d. 3.2 Estimation of Basic Indep enden Comp onen Analysis Mo del Next, sho the alidit of score matc hing in estimating the follo wing mo del log =1 (7) whic is the basic form of the indep enden comp onen analysis (ICA) mo del. Again, the normalization constan is ell-kno wn and equal to log det where the matrix has the ectors as ro ws, but this serv es as an illustration of our metho d. The nice thing ab out this mo del is

that can easily generate data that follo ws this mo del. In fact, if laten ariables are indep enden tly distributed and ha the df giv en exp( )), the linear transformation As (8) with follo ws the df s giv en in (7), see e.g. (Hyv arinen et al., 2001). Th us, will estimating the generativ mo del in (8) using the non-normalized lik eliho in (7). Here, ho ose the distribution of the comp onen ts to so-called logistic with log cosh( log 700
Page 7
Estima tion by Score Ma tching This distribution is normalized to unit ariance as ypical in the theory of ICA. The score function of the mo

del in (7 is giv en =1 (9) where the scalar nonlinear function is giv en tanh The relev an deriv ativ es of the score function are giv en y: =1 and the sample ersion of the ob jectiv function is giv en =1 =1 =1 )) =1 )) =1 )) =1 =1 )) ;k =1 =1 )) )) (10) erformed sim ulations to alidate the consistency of score matc hing estimation, and to compare its eciency with resp ect to maxim um lik eliho estimation. generated data follo wing the mo del as describ ed ab e, where the dimension as hosen to 4. Score matc hing estimation consisted of minimizing in (10) simple gradien descen t; lik

eliho as maximized using natural gradien metho (Amari et al., 1996; Hyv arinen et al., 2001), using the true alue of rep eated the estimation for sev eral dieren sample sizes: 500, 1000, 2000, 4000, 8000, and 16000. or eac sample size, the estimation as rep eated 11 times using dieren random initial oin ts in the optimization, and dieren random data sets. or eac estimate, measure of asymptotic ariance as computed as follo ws. The matrix where is the estimate as normalized ro w-b y-ro so that the largest alue on eac ro had an absolute alue of 1. Then, the sum of squares

of all the elemen ts as computed, and (i.e. the sum of the squares of the four elemen ts equal to one) as subtracted. This giv es measure of the squared error of the estimate (w cannot simply compare with iden tit ecause the order of the comp onen ts is not ell-dened). or eac sample size and estimator yp (score matc hing vs. maxim um lik eliho d) then computed the median error. Figure sho ws the results. The error of score matc hing seems to go to zero, whic alidates the theoretical consistency result of Theorem 2. Score matc hing giv es sligh tly larger errors than maxim um lik eliho

d, whic is to exp ected ecause of the eciency results of maxim um lik eliho estimation (Pham and Garrat, 1997). In the preceding sim ulation, knew exactly the prop er function to used in the score function. in estigate the robustness of the metho to missp ecication of the score 701
Page 8
Hyv arinen -2.4 -2.2 -2 -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 Figure 1: The estimation errors of score matc hing (solid line) compared with errors of max- im um lik eliho estimation (dashed line) for the basic ICA mo del. Horizon tal axis: log 10

of sample size. ertical axis: log 10 of estimation error. function (a ell-kno wn problem in ICA estimation), ran the same estimation metho ds, score matc hing and maxim um lik eliho d, for data that as generated sligh tly dieren distribution. Sp ecically generated the data so that the indep enden comp onen ts had Laplacian distributions of unit ariance (Hyv arinen et al., 2001). then estimated the mo del using exactly the same as efore, whic as not theoretically correct. The estimation errors are sho wn in Figure 2. see that score matc hing still seems consisten t. In

terestingly it no erforms sligh tly etter than maxim um lik eliho estimation (whic ould more prop erly called quasi-maxim um lik eliho estimation due to the missp eci- cation (Pham and Garrat, 1997)). 3.3 Estimation of an Ov ercomplete Mo del for Image Data Finally sho image analysis results using an ercomplete ersion of the ICA mo del. The lik eliho is dened almost as in (7), but the um er of comp onen ts is larger than the dimension of the data see e.g. (T eh et al., 2003), and in tro duce some extra parameters. The lik eliho is giv en log =1 (11) where the ectors are

constrained to unit norm (unlik in the preceding example), and the are scaling parameters. in tro duce here the extra parameters to accoun for dieren distributions for dieren pro jections. Constraining and 702
Page 9
Estima tion by Score Ma tching -3.4 -3.2 -3 -2.8 -2.6 -2.4 -2.2 -2 -1.8 -1.6 -1.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 Figure 2: The estimation errors of score matc hing compared with errors of maxim um lik eli- ho estimation for the basic ICA mo del. This time, the df of the indep enden comp onen ts as sligh tly missp ecied. Legend as in Fig. 1.

and allo wing the to ha an norm, this ecomes the basic ICA mo del of the preceding subsection. The mo del is related to ICA with ercomplete bases (Hyv arinen et al., 2001; Hyv arinen and Inki, 2002; Olshausen and Field, 1997), i.e. the case where there are more indep enden comp onen ts and basis ectors than observ ed ariables. In con trast to most ICA mo dels, the ercompleteness is expressed as ercompleteness of lters whic seems to mak the problem bit simpler ecause no laten ariables need to inferred. Ho ev er, the normalization constan is not kno wn when is non-quadratic, i.e. when

the mo del is non-Gaussian, whic is wh previous researc had to resort to MCMC metho ds (T eh et al., 2003) or some appro ximations (Hyv arinen and Inki, 2002). ha the score function =1 where is the rst deriv ativ of Going through similar dev elopmen ts as in the case of the basic ICA mo del, the sample ersion of the ob jectiv function can sho wn to equal =1 =1 )) ;k =1 =1 )) )) (12) 703
Page 10
Hyv arinen Figure 3: The ercomplete set of lters estimated from natural image data. Note that no dimension reduction as erformed, and sho lters instead of basis ectors,

whic is wh the results are uc less smo oth and \b eautiful" than some published ICA results (Hyv arinen et al., 2001). estimated the mo del for image patc hes of pixels tak en from natural images, see .O. Ho ers imageic pac age. As prepro cessing, the DC comp onen (i.e. the mean gra y-scale alue) as remo ed from eac image patc h, reducing the eectiv dimensionalit of the data to 63. The data as also whitened, i.e. the mo del as used in linearly transformed space (the exact metho of whitening has no signicance). set 200. also to ok the tanh function as whic corresp onds to log

cosh( (w did not other to nd the righ scaling as in the basic ICA case). The ob jectiv function in (12) as optimized gradien descen t. The ere set to random initial alues, and the ere all set to the initial alue 1.5 that as found to close to the optimal alue in pilot exp erimen ts. The obtained ectors are sho wn in Figure 3. or the purp oses of visualization, the ectors ere con erted bac to the original space from the whitened space. The optimal ere in the range 2. sho that the metho correctly found dieren ectors and not duplicates of smaller set of ectors, computed the dot-pro

ducts et een the ectors, and for eac selected the largest absolute alue of dot-pro duct The dot-pro ducts ere computed in the whitened space. The histogram of these maximal dot-pro ducts is sho wn in Figure 4. They are all uc smaller than (in absolute alue), in fact all are smaller than 5. Since the ectors ere normalized to unit norm, this sho ws that no ere close to equal, and did nd dieren ectors. 4. Discussion Here discuss the connections of our metho to ell-kno wn metho ds efore concluding the pap er. 3. The pac age can do wnloaded at

ttp://www.cs.helsinki./patrik.ho er/. 704
Page 11
Estima tion by Score Ma tching 10 15 20 25 0.25 0.3 0.35 0.4 0.45 0.5 Figure 4: The distribution of maximal dot-pro ducts of lter with all other lters, computed in the whitened space. 4.1 Comparison with Pseudo-Lik eliho Estimation related metho for estimating non-normalized mo dels is maximization of pseudo-lik eliho (Besag, 1974). The idea is to maximize the pro duct of marginal conditional lik eliho ds. The df is appro ximated log pseudo =1 +1 (13) and the lik eliho is computed using this appro ximation. The

idea as originally dev elop ed in connection with Mark random elds, in whic con text it is quite natural ecause the conditional probabilities are often giv en as part of the mo del sp ecication. The idea can still used in the general case considered in this article. Ho ev er, the conditional probabilities in (13) are not necessarily readily ailable and need to computed. In particular, these conditional densities need to normalized. The computational burden needed in the normalization is reduced from the original problem since only need to umerically compute one-dimensional in

tegrals whic is far more feasible than single -dimensional in tegral. Ho ev er, compared to score matc hing, this is computationally exp ensiv metho since score matc hing oids the need for umerical in tegration altogether. The question of consistency of pseudo-lik eliho estimation seems to unclear. Some consistency pro ofs ere pro vided Besag (1974, 1977), but these only apply to sp ecial cases suc as Gaussian or binary random elds. Sucien tly general consistency results on pseudo-lik eliho estimation seem to lac king. This is another disadv an tage with resp ect to score matc

hing, whic as sho wn ab to (lo cally) consisten t. 705
Page 12
Hyv arinen 4.2 Comparison with Con trastiv Div ergence An in teresting appro ximativ MCMC metho called con trastiv div ergence as recen tly prop osed Hin ton (2002). The basic principle is to use an MCMC metho for computing the deriv ativ of the logarithm of the normalization factor but the MCMC is allo ed to run for only single iteration (or few iterations) efore doing the gradien step. The metho is generally biased, ev en asymptotically (Carreira-P erpi n an and Hin ton, 2005b), except in some sp ecial cases suc as the

ultiv ariate Gaussian distribution (Carreira- erpi n an and Hin ton, 2005a). Score matc hing is th us preferable if consisten estimator is an ted. The computational eciency of con trastiv div ergence is dicult to ev aluate since it is not really single metho but family of metho ds, dep ending on the MCMC metho used. or the case of con tin uous-v alued ariables that consider here, Metrop olis-t yp algorithm ould probably the metho of hoice, but there is large um er of dieren arian ts whose erformances are lik ely to quite dieren t. Nev ertheless, con trastiv div

ergence is uc more general metho than score matc hing since it is applicable to in tractable laten ariable mo dels. It can also handle binary/discrete ariables|in fact, it is probably uc easier to implemen t, using Gibbs sampling, for binary ariables than for con tinous-v alued ariables. Extension of score matc hing to these cases is an imp ortan problem for future researc h. 4.3 Conclusion ha prop osed new metho d, score matc hing, to estimate statistical mo dels in the case where the normalization constan is unkno wn. Although the estimation of the score function is computationally

dicult, sho ed that the distance of data and mo del score functions is ery easy to compute. The main assumptions in the metho are: 1) all the ariables are con tin uous-v alued and dened er 2) the mo del df is smo oth enough. Score matc hing pro vides computationally simple et lo cally consisten alternativ to existing metho ds, suc as MCMC and arious appro ximativ metho ds. Ac kno wledgmen ts am grateful to atrik Ho er, Jarmo Hurri, and Shohei Shimizu for commen ts on the man uscript, to Sam Ro eis for in teresting discussions, and to Miguel Carreira-P erpi n an and

Georey Hin ton for pro viding access to unpublished results. The ork as supp orted the Academ of Finland, Academ Researc ello osition and pro ject #48593. App endix A. Pro of of Theorem Denition (2) giv es (14) (F or simplicit omit the in tegration domain in here.) The rst term in brac ets do es not dep end on and can ignored. The in tegral of the second term is simply the in tegral 706
Page 13
Estima tion by Score Ma tching of the sum of the second terms in brac ets in (3). Th us, the dicult thing to pro is that in tegral of the third term in brac ets

in (14) equals the in tegral of the sum of the rst terms in brac ets in (3). This term equals ;i where ;i denotes the -th elemen of the ector ). can consider the in tegral for single separately whic equals log The basic tric of partial in tegration needed the pro of is simple: for an one-dimensional df and an function ha )(log dx dx dx dx under some regularit assumptions that will dealt with elo w. pro ceed with the pro of, need to use ultiv ariate ersion of suc partial in te- gration: Lemma lim !1 ;b !1 a; a; b; b; 1 d 1 d assuming that and ar dier

entiable. The same applies for al indic es of but for notational simplicity we only write the ase her e. Pro of of lemma: can no consider this as function of alone, all other ariables eing xed. Then, in tegrating er ha pro en the lemma. No w, can apply this lemma on and whic ere oth assumed to dieren tiable in the theorem, and obtain: d lim !1 ;b !1 a; a; b; b; )] d 707
Page 14
Hyv arinen or notational simplicit consider the case of only but this is true for an The limit in the ab expression is zero for an ecause assumed that go es to zero at innit Th

us, ha pro en that that is, in tegral of the the third term in brac ets in (14) equals the in tegral of the sum of the rst terms in brac ets in (3), and the pro of of the theorem is complete. App endix B. Pro of of Theorem Assume 0. Then, the assumption implies for all whic implies that and are equal. This implies log log for some constan But is necessarily ecause oth and are df s. Th us, ). By assumption, only fullls this equalit so necessarily and ha pro en the implication from left to righ t. The con erse is trivial. References S.-I. Amari, A. Cic ho ki, and H. H. ang. new

learning algorithm for blind source separation. In dvanc es in Neur al Information Pr essing Systems pages 757{763. MIT Press, 1996. J. Besag. Spatial in teraction and the statistical analysis of lattice systems. Journal of the oyal Statistic al So ciety, Series 36(2):192{236, 1974. J. Besag. Eciency of pseudolik eliho estimation for simple gaussian elds. Biometrika 64(3):616{618, 1977. C. Bouman and K. Sauer. generalized gaussian image mo del for edge-preserving MAP estimation. IEEE ansactions on Image Pr essing 2(3):296{310, 1993. M. A. Carreira-P erpi n an and G. E. Hin ton.

On con trastiv div ergence (CD) learning. ec hnical rep ort, Dept of Computer Science, Univ ersit of oron to, 2005a. In preparation. M. A. Carreira-P erpi n an and G. E. Hin ton. On con trastiv div ergence learning. In Pr d- ings of the Workshop on rticial Intel ligenc and Statistics (AIST TS2005) Barbados, 2005b. G. E. Hin ton. raining pro ducts of exp erts minimizing con trastiv div ergence. Neur al Computation 14(8):1771{1800, 2002. A. Hyv arinen and O. Ho er. o-la er sparse co ding mo del learns simple and complex cell receptiv elds and top ograph from natural images.

Vision ese ar ch 41(18):2413{ 2423, 2001. A. Hyv arinen and M. Inki. Estimating ercomplete indep enden comp onen bases from image windo ws. Journal of Mathematic al Imaging and Vision 17:139{152, 2002. 708
Page 15
Estima tion by Score Ma tching A. Hyv arinen, J. Karh unen, and E. Oja. Indep endent Comp onent nalysis Wiley In ter- science, 2001. S. Z. Li. Markov andom Field Mo deling in Image nalysis Springer, 2nd edition, 2001. D. J. C. Mac Information The ory, Infer enc and arning lgorithms Cam bridge Univ ersit Press, 2003. B. A. Olshausen and D. J. Field. Sparse co ding with an

ercomplete basis set: strategy emplo ed V1? Vision ese ar ch 37:3311{3325, 1997. D.-T. Pham and Garrat. Blind separation of mixture of indep enden sources through quasi-maxim um lik eliho approac h. IEEE ansactions on Signal Pr essing 45(7): 1712{1725, 1997. M. Sc hervish. The ory of Statistics Springer, 1995. Y. W. eh, M. elling, S. Osindero, and G. E. Hin ton. Energy-based mo dels for sparse ercomplete represen tations. Journal of Machine arning ese ar ch 4:1235{1260, 2003. 709