Download
# Nonlinear Learning using Local Coordinate Coding Kai Yu NEC Laboratories America kyusv PDF document - DocSlides

test | 2014-12-13 | General

### Presentations text content in Nonlinear Learning using Local Coordinate Coding Kai Yu NEC Laboratories America kyusv

Show

Page 1

Nonlinear Learning using Local Coordinate Coding Kai Yu NEC Laboratories America kyu@sv.nec-labs.com Tong Zhang Rutgers University tzhang@stat.rutgers.edu Yihong Gong NEC Laboratories America ygong@sv.nec-labs.com Abstract This paper introduces a new method for semi-supervised learning on high dimen- sional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point on the manifold can be locally approximated by a linear combination of its nearby anchor points, and the linear weights become its local coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difﬁcult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods. 1 Introduction Consider the problem of learning a nonlinear function on a high dimensional space We are given a set of labeled data ,y ,..., ,y drawn from an unknown underlying distri- bution. Moreover, assume that we observe a set of unlabeled data from the same distribution. If the dimensionality is large compared to , then the traditional statistical theory predicts over- ﬁtting due to the so called “curse of dimensionality”. One intuitive argument for this effect is that when the dimensionality becomes larger, pairwise distances between two similar data points become larger as well. Therefore one needs more data points to adequately ﬁll in the empty space. However, for many real problems with high dimensional data, we do not observe this so-called curse of di- mensionality. This is because although data are physically represented in a high-dimensional space, they often lie on a manifold which has a much smaller intrinsic dimensionality. This paper proposes a new method that can take advantage of the manifold geometric structure to learn a nonlinear function in high dimension. The main idea is to locally embed points on the manifold into a lower dimensional space, expressed as coordinates with respect to a set of anchor points. Our main observation is simple but very important: we show that a nonlinear function on the manifold can be effectively approximated by a linear function with such an coding under appropriate localization conditions. Therefore using Local Coordinate Coding, we turn a very difﬁcult high dimensional nonlinear learning problem into a much simpler linear learning problem, which has been extensively studied in the literature. This idea may also be considered as a high dimensional generalization of low dimensional local smoothing methods in the traditional statistical literature. 2 Local Coordinate Coding We are interested in learning a smooth function deﬁned on a high dimensional space . Let k·k be a norm on . Although we do not restrict to any speciﬁc norm, in practice, one often employs the Euclidean norm (2-norm): ···

Page 2

Deﬁnition 2.1 (Lipschitz Smoothness) A function on is α,β,p -Lipschitz smooth with respect to a norm k·k if | and | 1+ , where we assume α,β > and (0 1] Note that if the Hessian of exists, then we may take = 1 . Learning an arbitrary Lipschitz smooth function on can be difﬁcult due to the curse of dimensionality. That is, the number of samples required to characterize such a function can be exponential in . However, in many practical applications, one often observe that the data we are interested in approximately lie on a manifold which is embedded into . Although is large, the intrinsic dimensionality of can be much smaller. Therefore if we are only interested in learning on , then the complexity should depend on the intrinsic dimensionality of instead of . In this paper, we approach this problem by introducing the idea of localized coordinate coding. The formal deﬁnition of (non-localized) coordinate coding is given below, where we represent a point in by a linear combination of a set of “anchor points”. Later we show it is sufﬁcient to choose a set of “anchor points” with cardinality depending on the intrinsic dimensionality of the manifold rather than Deﬁnition 2.2 (Coordinate Coding) A coordinate coding is a pair γ,C , where is a set of anchor points, and is a map of to )] such that ) = 1 . It induces the following physical approximation of in ) = . Moreover, for all , we deﬁne the corresponding coding norm as The quantity will become useful in our learning theory analysis. The condition ) = 1 follows from the shift-invariance requirement, which means that the coding should remain the same if we use a different origin of the coordinate system for representing data points. It can be shown (see the appendix ﬁle accompanying the submission) that the map is invariant under any shift of the origin for representing data points in if and only if ) = . The importance of the coordinate coding concept is that if a coordinate coding is sufﬁciently localized, then a nonlinear function can be approximate by a linear function with respect to the coding. This critical observation, illustrate in the following linearization lemma, is the foundation of our approach. Due to the space limitation, all proofs are left to the appendix that accompanies the submission. Lemma 2.1 (Linearization) Let γ,C be an arbitrary coordinate coding on . Let be an α,β,p -Lipschitz smooth function. We have for all |k 1+ To understand this result, we note that on the left hand side, a nonlinear function in is ap- proximated by a linear function with respect to the coding , where )] is the set of coefﬁcients to be estimated from data. The quality of this approximation is bounded by the right hand side, which has two terms: the ﬁrst term means should be close to its physical approximation , and the second term means that the coding should be localized. The quality of a coding with respect to can be measured by the right hand side. For convenience, we introduce the following deﬁnition, which measures the locality of a coding. Deﬁnition 2.3 (Localization Measure) Given α,β,p , and coding γ,C , we deﬁne α,β,p γ,C ) = |k 1+ Observe that in α,β,p α,β,p may be regarded as tuning parameters; we may also simply pick = 1 . Since the quality function α,β,p γ,C only depends on unlabeled data, in principle, we can ﬁnd γ,C by optimizing this quality using unlabeled data. Later, we will consider simpliﬁcations of this objective function that are easier to compute. Next we show that if the data lie on a manifold, then the complexity of local coordinate coding depends on the intrinsic manifold dimensionality instead of . We ﬁrst deﬁne manifold and its intrinsic dimensionality.

Page 3

Deﬁnition 2.4 (Manifold) A subset M is called a -smooth ( p> ) manifold with intrinsic dimensionality if there exists a constant such that given any ∈M , there exists vectors ,...,v so that ∈M inf =1 1+ This deﬁnition is quite intuitive. The smooth manifold structure implies that one can approximate a point in effectively using local coordinate coding. Note that for a typical manifold with well- deﬁned curvature, we can take = 1 Deﬁnition 2.5 (Covering Number) Given any subset M , and > . The covering number, denoted as , , is the smallest cardinality of an -cover ⊂ M . That is, sup ∈M inf k For a compact manifold with intrinsic dimensionality , there exists a constant such that its covering number is bounded by , . The following result shows that there exists a local coordinate coding to a set of anchor points of cardinality , )) such that any α,β,p -Lipschitz smooth function can be linearly approximated using local coordinate coding up to the accuracy 1+ Theorem 2.1 (Manifold Coding) If the data points lie on a compact -smooth manifold , and the norm is deﬁned as = ( Ax for some positive deﬁnite matrix . Then given any > there exist anchor points ⊂M and coding such that | (1+ )) , , Q α,β,p γ,C αc )+(1+ )+2 1+ )) 1+ Moreover, for all ∈M , we have 1 + (1 + )) The approximation result in Theorem 2.1 means that the complexity of linearization in Lemma 2.1 depends only on the intrinsic dimension of instead of . Although this result is proved for manifolds, it is important to observe that the coordinate coding method proposed in this paper does not require the data to lie precisely on a manifold, and it does not require knowing . In fact, similar results hold even when the data only approximately lie on a manifold. In the next section, we characterize the learning complexity of the local coordinate coding method. It implies that linear prediction methods can be used to effectively learn nonlinear functions on a manifold. The nonlinearity is fully captured by the coordinate coding map (which can be a nonlinear function). This approach has some great advantages because the problem of ﬁnding local coordinate coding is much simpler than direct nonlinear learning: Learning γ,C only requires unlabeled data, and the number of unlabeled data can be signiﬁcantly more than the number of labeled data. This step also prevents overﬁtting with respect to labeled data. In practice, we do not have to ﬁnd the optimal coding because the coordinates are merely features for linear supervised learning. This signiﬁcantly simpliﬁes the optimization prob- lem. Consequently, it is more robust than some standard approaches to nonlinear learning that direct optimize nonlinear functions on labeled data (e.g., neural networks). 3 Learning Theory In machine learning, we minimize the expected loss x,y ,y with respect to the underlying distribution within a function class ∈F . In this paper, we are interested in the function class α,β,p ) : ( α,β,p Lipschitz smooth function in The local coordinate coding method considers a linear approximation of functions in α,β,p on the data manifold. Given a local coordinate coding scheme γ,C , we approximate each ∈F α,β,p by γ,C ( w,x ) = , where we estimate the coefﬁcients using ridge regression as: [ ] = arg min =1 γ,C w,x ,y ) + )) (1)

Page 4

where is an arbitrary function assumed to be pre-ﬁxed. In the Bayesian interpretation, this can be regarded as the prior mean for the weights . The default values of are simply . Given a loss function p,y , let p,y ) = p,y /∂p . For simplicity, in this paper we only consider convex Lipschitz loss function, where p,y | . This includes the standard classiﬁcation loss functions such as logistic regression and SVM (hinge loss), both with = 1 Theorem 3.1 (Generalization Bound) Suppose p,y is Lipschitz: p,y | . Consider coordinate coding γ,C , and the estimation method (1) with random training examples ,y ,..., ,y . Then the expected generalization error satisﬁes the inequality: x,y γ,C ( w,x ,y inf ∈F α,β,p x,y ,y ) + )) λn BQ α,β,p γ,C We may choose the regularization parameter that optimizes the bound in Theorem 3.1. Moreover, if we pick , and ﬁnd γ,C at some > , then Theorem 2.1 im- plies the following simpliﬁed generalization bound for any ∈ F α,β,p such that (1) x,y ,y ) + /n 1+ . By optimizing over , we obtain a bound: x,y ,y ) + (1+ (2+2 )) By combining Theorem 2.1 and Theorem 3.1, we can immediately obtain the following simple consistency result. It shows that the algorithm can learn an arbitrary nonlinear function on manifold when . Note that Theorem 2.1 implies that the convergence only depends on the intrinsic dimensionality of the manifold , not Theorem 3.2 (Consistency) Suppose the data lie on a compact manifold M , and the norm k·k is the Euclidean norm in . If loss function p,y is Lipschitz. As , we choose α, α/n,β/n α, depends on ), and = 0 . Then it is possible to ﬁnd coding γ,C using unlabeled data such that /n and α,β,p γ,C . If we pick λn , and | . Then the local coordinate coding method (1) with is consistent as lim x,y ( w,x ,y ) = inf M x,y ,y 4 Practical Learning of Coding Given a coordinate coding γ,C , we can use (1) to learn a nonlinear function in . We showed that γ,C can be obtained by optimizing α,β,p γ,C . In practice, we may also consider the following simpliﬁcations of the localization term: |k 1+ |k 1+ Note that we may simply chose = 0 or = 1 . The formulation is related to sparse coding [6] which has no locality constraints with . In this representation, we may either enforce the constraint ) = 1 or for simplicity, remove it because the formulation is already shift-invariant. Putting the above together, we try to optimize the following objective function in practice: γ,C ) = inf |k 1+ We update and via alternating optimization. The step of updating can be transformed into a canonical LASSO problem, where efﬁcient algorithms exist. The step of updating is a least- squares problem in case = 1 5 Relationship to Other Methods Our work is related to several existing approaches in the literature of machine learning and statistics. The ﬁrst class of them is nonlinear manifold learning, such as LLE [8], Isomap [9], and Laplacian

Page 5

Eigenmaps [1]. These methods ﬁnd global coordinates of data manifold based on a pre-computed afﬁnity graph of data points. The use of afﬁnity graphs requires expensive computation and lacks a coherent way of generalization to new data. Our method learns a compact set of bases to form local coordinates, which has a linear complexity with respect to data size and can naturally handle unseen data. More importantly, local coordinate coding has a direct connection to nonlinear function ap- proximation on manifold, and thus provides a theoretically sound unsupervised pre-training method to facilitate further supervised learning tasks. Another set of related models are local models in statistics, such as local kernel smoothing and local regression, e.g.[4, 2], both traditionally using ﬁxed-bandwidth kernels. Local kernel smoothing can be regarded as a zero-order method; while local regression is higher-order, including local linear regression as the 1st-order case. Traditional local methods are not widely used in machine learn- ing practice, because data with non-uniform distribution on the manifold require to use adaptive- bandwidth kernels. The problem can be somehow alleviated by using K-nearest neighbors. How- ever, adaptive kernel smoothing still suffers from the high-dimensionality and noise of data. On the other hand, higher-order methods are computationally expensive and prone to overﬁtting, be- cause they are highly ﬂexible in locally ﬁtting many segments of data in high-dimension space. Our method can be seen as a generalized 1st-order local method with basis learning and adaptive local- ity. Compared to local linear regression, the learning is achieved by ﬁtting a single globally linear function with respect to a set of learned local coordinates, which is much less prone to overﬁtting and computationally much cheaper. This means that our method achieves better balance between local and global aspects of learning. The importance of such balance has been recently discussed in [10]. Finally, local coordinate coding draws connections to vector quantization (VQ) coding, e.g., [3], and sparse coding, which have been widely applied in processing of sensory data, such as acoustic and image signals. Learning linear functions of VQ codes can be regarded as a generalized zero- order local method with basis learning. Our method has an intimate relationship with sparse coding. In fact, we can regard local coordinate coding as locally constrained sparse coding. Inspired by biological visual systems, people has been arguing sparse features of signals are useful for learning [7]. However, to the best of our knowledge, there is no analysis in the literature that directly answers the question why sparse codes can help learning nonlinear functions in high dimensional space. Our work reveals an important ﬁnding — a good ﬁrst-order approximation to nonlinear function requires the codes to be local, which consequently requires the codes to be sparse. However, sparsity does not always guarantee locality conditions. Our experiments demonstrate that sparse coding is helpful for learning only when the codes are local. Therefore locality is more essential for coding, and sparsity is a consequence of such a condition. 6 Experiments Due to the space limitation, we only include two examples: one synthetic and one real, to illustrate various aspects of our theoretical results. We note that image classiﬁcation based on LCC recently achieved state-of-the-art performance in PASCAL Visual Object Classes Challenge 2009. 6.1 Synthetic Data Our ﬁrst example is based on a synthetic data set, where a nonlinear function is deﬁned on a Swiss- roll manifold, as shown in Figure 1-(1). The primary goal is to demonstrate the performance of nonlinear function learning using simple linear ridge regression based on representations obtained from traditional sparse coding and the newly suggested local coordinate coding, which are, respec- tively, formulated as the following, min γ,C |k (2) where ) = . We note that (2) is an approximation to the original formulation, mainly for the simplicity of computation. http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2009/workshop/index.html

Page 6

(1) A nonlinear function (2) RMSE= 394 (3) RMSE= 499 (4)RMSE= 661 (5) RMSE= 201 (6) RMSE= 109 (7) RMSE= 669 (8) RMSE= 170 Figure 1: Experiments of nonlinear regression on Swiss-roll: (1) a nonlinear function on the Swiss- roll manifold, where the color indicates function values; (2) result of sparse coding with ﬁxed ran- dom anchor points; (3) result of local coordinate coding with ﬁxed random anchor points; 4) result of sparse coding; (5) result of local coordinate coding; (6) result of local kernel smoothing; (7) result of local coordinate coding on noisy data; (8) result of local kernel smoothing on noisy data. We randomly sample 50 000 data points on the manifold for unsupervised basis learning, and 500 labeled points for supervised regression. The number of bases is ﬁxed to be 128. The learned non- linear functions are tested on another set of 10 000 data points, with their performances evaluated by root mean square error (RMSE). In the ﬁrst setting, we let both coding methods use the same set of ﬁxed bases, which are 128 points randomly sampled from the manifold. The regression results are shown in Figure 1-(2) and (3), respectively. Sparse coding based approach fails to capture the nonlinear function, while local coordinate coding behaves much better. We take a closer look at the data representations obtained from the two different encoding methods, by visualizing the distributions of distances from encoded data to bases that have positive, negative, or zero coefﬁcients in Figure 2. It shows that sparse coding lets bases faraway from the encoded data have nonzero coefﬁcients, while local coordinate coding allows only nearby bases to get nonzero coefﬁcients. In other words, sparse coding on this data does not ensure a good locality and thus fails to facilitate the nonlinear function learning. As another interesting phenomenon, local coordinate coding seems to encourage coefﬁcients to be nonnegative, which is intuitively understandable — if we use several bases close to a data point to linearly approximate the point, each basis should have a positive contribution. However, whether there is any merit by explicitly enforcing nonnegativity will remain an interesting future work. In the next two experiments, given the random bases as a common initialization, we let the two algorithms learn bases from the 50 000 unlabeled data points. The regression results based on the learned bases are depicted in Figure 1-(4) and (5), which indicate that regression error is further reduced for local coordinate coding, but remains to be high for sparse coding. We also make a comparison with local kernel smoothing, which takes a weighted average of function values of -nearest training points to make prediction. As shown in Figure 1-(6), the method works very well on this simple low-dimensional data, even outperforming the local coordinate coding approach. However, if we increase the data dimensionality to be 256 by adding 253 -dimensional independent Gaussian noises with zero mean and unitary variance, local coordinate coding becomes superior to local kernel smoothing, as shown in Figure 1-(7) and (8). This is consistent with our theory, which suggests that local coordinate coding can work well in high dimension; on the other hand, local kernel smoothing is known to suffer from high dimensionality and noise. 6.2 Handwritten Digit Recognition Our second example is based on the MNIST handwritten digit recognition benchmark, where each data point is a 28 28 gray image, and pre-normalized into a unitary 784 -dimensional vector. In our setting, the set of anchor points is obtained from sparse coding, with the regularization on

Page 7

(a-1) (a-2) (b-1) (b-2) Figure 2: Coding locality on Swiss roll: (a) sparse coding vs. (b) local coordinate coding. replaced by inequality constraints k . Our focus here is not on anchor point learning, but rather on checking whether a good nonlinear classiﬁer can be obtained if we enforce sparsity and locality in data representation, and then apply simple one-against-all linear SVMs. Since the optimization cost of sparse coding is invariant under ﬂipping the sign of , we take a postprocessing step to change the sign of if we ﬁnd the corresponding for most of is negative. This rectiﬁcation will ensure the anchor points to be on the data manifold. With the obtained , for each data point we solve the local coordinate coding problem (2), by optimizing only, to obtain the representation )] . In the experiments we try different sizes of bases. The classiﬁcation error rates are provided in Table 1. In addition we also compare with linear classiﬁer on raw images, local kernel smoothing based on -nearest neighbors, and linear classiﬁers using representations obtained from various unsupervised learning methods, including autoencoder based on deep belief networks [5], Laplacian eigenmaps [1], locally linear embedding (LLE) [8], and VQ coding based on K-means. We note that, like most of other manifold learning approaches, Laplacian eigenmaps or LLE is a transductive method which has to incorporate both training and testing data in training. The comparison results are summarized in Table 2. Both sparse coding and local coordinate coding perform quite good for this nonlinear classiﬁcation task, signiﬁcantly outperforming linear classiﬁers on raw images. In addition, local coordinate coding is consistently better than sparse coding across various basis sizes. We further check the locality of both representations by plotting Figure-3, where the basis number is 512 , and ﬁnd that sparse coding on this data set happens to be quite local — unlike the case of Swiss-roll data — here only a small portion of nonzero coefﬁcients (again mostly negative) are assigned onto the bases whose distances to the encoded data exceed the average of basis-to-datum distances. This locality explains why sparse coding works well on MNIST data. On the other hand, local coordinate coding is able to remove the unusual coefﬁcients and further improve the locality. Among those compared methods in Table 2, we note that the error rate 2% of deep belief network reported in [5] was obtained via unsupervised pre-training followed by supervised backpropagation. The error rate based on unsupervised training of deep belief networks is about 90% Therefore our result is competitive to the-state-of-the-art results that are based on unsupervised feature learning plus linear classiﬁcation without using additional image geometric information. This is obtained via a personal communication with Ruslan Salakhutdinov at University of Toronto.

Page 8

(a-1) (a-2) (b-1) (b-2) Figure 3: Coding locality on MNIST: (a) sparse coding vs. (b) local coordinate coding. Table 1: Error rates (%) of MNIST classiﬁcation with different |C| 512 1024 2048 4096 Linear SVM with sparse coding 2.96 2.64 2.16 2.02 Linear SVM with local coordinate coding 2.64 2.44 2.08 1.90 Table 2: Error rates (%) of MNIST classiﬁcation with different methods. Methods Error Rate Linear SVM with raw images 12.0 Linear SVM with VQ coding 3.98 Local kernel smoothing 3.48 Linear SVM with Laplacian eigenmap 2.73 Linear SVM with LLE 2.38 Linear classiﬁer with deep belief network 1.90 Linear SVM with sparse coding 2.02 Linear SVM with local coordinate coding 1.90 7 Conclusion This paper introduces a new method for high dimensional nonlinear learning with data distributed on manifolds. The method can be seen as generalized local linear function approximation, but can be achieved by learning a global linear function with respect to coordinates from unsupervised local coordinate coding. Compared to popular manifold learning methods, our approach can naturally handle unseen data and has a linear complexity with respect to data size. The work also generalizes popular VQ coding and sparse coding schemes, and reveals that locality of coding is essential for supervised function learning. The generalization performance depends on intrinsic dimensionality of the data manifold. The experiments on synthetic and handwritten digit data further conﬁrm the ﬁndings of our analysis.

Page 9

References [1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation , 15:1373 – 1396, 2003. [2] Leon Bottou and Vladimir Vapnik. Local learning algorithms. Neural Computation , 4:888 900, 1992. [3] Robert M. Gray and David L. Neuhoff. Quantization. IEEE Transaction on Information The- ory , pages 2325 – 2383, 1998. [4] Trevor Hastie and Clive Loader. Local regression: Automatic kernel carpentry. Statistical Science , 8:139 – 143, 1993. [5] Geoffrey E. Hinton and Ruslan R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science , 313:504 – 507, 2006. [6] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Efﬁcient sparse coding algo- rithms. Neural Information Processing Systems (NIPS) 19 , 2007. [7] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y. Ng. Self-taught learning: Transfer learning from unlabeled data. International Conference on Machine Learn- ing , 2007. [8] Sam Roweis and Lawrence Saul. Nonlinear dimensionality reduction by locally linear embed- ding. Science , 290:2323 – 2326, 2000. [9] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science , 290:2319 – 2323, 2000. [10] Alon Zakai and Ya’acov Ritov. Consistency and localizability. Journal of Machine Learning Research , 10:827 – 856, 2009.

neclabscom Tong Zhang Rutgers University tzhangstatrutgersedu Yihong Gong NEC Laboratories America ygongsvneclabscom Abstract This paper introduces a new method for semisupervised learning on high dimen sional nonlinear manifolds which includes a pha ID: 23314

- Views :
**169**

**Direct Link:**- Link:https://www.docslides.com/test/nonlinear-learning-using-local
**Embed code:**

Download this pdf

DownloadNote - The PPT/PDF document "Nonlinear Learning using Local Coordinat..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Page 1

Nonlinear Learning using Local Coordinate Coding Kai Yu NEC Laboratories America kyu@sv.nec-labs.com Tong Zhang Rutgers University tzhang@stat.rutgers.edu Yihong Gong NEC Laboratories America ygong@sv.nec-labs.com Abstract This paper introduces a new method for semi-supervised learning on high dimen- sional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point on the manifold can be locally approximated by a linear combination of its nearby anchor points, and the linear weights become its local coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difﬁcult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods. 1 Introduction Consider the problem of learning a nonlinear function on a high dimensional space We are given a set of labeled data ,y ,..., ,y drawn from an unknown underlying distri- bution. Moreover, assume that we observe a set of unlabeled data from the same distribution. If the dimensionality is large compared to , then the traditional statistical theory predicts over- ﬁtting due to the so called “curse of dimensionality”. One intuitive argument for this effect is that when the dimensionality becomes larger, pairwise distances between two similar data points become larger as well. Therefore one needs more data points to adequately ﬁll in the empty space. However, for many real problems with high dimensional data, we do not observe this so-called curse of di- mensionality. This is because although data are physically represented in a high-dimensional space, they often lie on a manifold which has a much smaller intrinsic dimensionality. This paper proposes a new method that can take advantage of the manifold geometric structure to learn a nonlinear function in high dimension. The main idea is to locally embed points on the manifold into a lower dimensional space, expressed as coordinates with respect to a set of anchor points. Our main observation is simple but very important: we show that a nonlinear function on the manifold can be effectively approximated by a linear function with such an coding under appropriate localization conditions. Therefore using Local Coordinate Coding, we turn a very difﬁcult high dimensional nonlinear learning problem into a much simpler linear learning problem, which has been extensively studied in the literature. This idea may also be considered as a high dimensional generalization of low dimensional local smoothing methods in the traditional statistical literature. 2 Local Coordinate Coding We are interested in learning a smooth function deﬁned on a high dimensional space . Let k·k be a norm on . Although we do not restrict to any speciﬁc norm, in practice, one often employs the Euclidean norm (2-norm): ···

Page 2

Deﬁnition 2.1 (Lipschitz Smoothness) A function on is α,β,p -Lipschitz smooth with respect to a norm k·k if | and | 1+ , where we assume α,β > and (0 1] Note that if the Hessian of exists, then we may take = 1 . Learning an arbitrary Lipschitz smooth function on can be difﬁcult due to the curse of dimensionality. That is, the number of samples required to characterize such a function can be exponential in . However, in many practical applications, one often observe that the data we are interested in approximately lie on a manifold which is embedded into . Although is large, the intrinsic dimensionality of can be much smaller. Therefore if we are only interested in learning on , then the complexity should depend on the intrinsic dimensionality of instead of . In this paper, we approach this problem by introducing the idea of localized coordinate coding. The formal deﬁnition of (non-localized) coordinate coding is given below, where we represent a point in by a linear combination of a set of “anchor points”. Later we show it is sufﬁcient to choose a set of “anchor points” with cardinality depending on the intrinsic dimensionality of the manifold rather than Deﬁnition 2.2 (Coordinate Coding) A coordinate coding is a pair γ,C , where is a set of anchor points, and is a map of to )] such that ) = 1 . It induces the following physical approximation of in ) = . Moreover, for all , we deﬁne the corresponding coding norm as The quantity will become useful in our learning theory analysis. The condition ) = 1 follows from the shift-invariance requirement, which means that the coding should remain the same if we use a different origin of the coordinate system for representing data points. It can be shown (see the appendix ﬁle accompanying the submission) that the map is invariant under any shift of the origin for representing data points in if and only if ) = . The importance of the coordinate coding concept is that if a coordinate coding is sufﬁciently localized, then a nonlinear function can be approximate by a linear function with respect to the coding. This critical observation, illustrate in the following linearization lemma, is the foundation of our approach. Due to the space limitation, all proofs are left to the appendix that accompanies the submission. Lemma 2.1 (Linearization) Let γ,C be an arbitrary coordinate coding on . Let be an α,β,p -Lipschitz smooth function. We have for all |k 1+ To understand this result, we note that on the left hand side, a nonlinear function in is ap- proximated by a linear function with respect to the coding , where )] is the set of coefﬁcients to be estimated from data. The quality of this approximation is bounded by the right hand side, which has two terms: the ﬁrst term means should be close to its physical approximation , and the second term means that the coding should be localized. The quality of a coding with respect to can be measured by the right hand side. For convenience, we introduce the following deﬁnition, which measures the locality of a coding. Deﬁnition 2.3 (Localization Measure) Given α,β,p , and coding γ,C , we deﬁne α,β,p γ,C ) = |k 1+ Observe that in α,β,p α,β,p may be regarded as tuning parameters; we may also simply pick = 1 . Since the quality function α,β,p γ,C only depends on unlabeled data, in principle, we can ﬁnd γ,C by optimizing this quality using unlabeled data. Later, we will consider simpliﬁcations of this objective function that are easier to compute. Next we show that if the data lie on a manifold, then the complexity of local coordinate coding depends on the intrinsic manifold dimensionality instead of . We ﬁrst deﬁne manifold and its intrinsic dimensionality.

Page 3

Deﬁnition 2.4 (Manifold) A subset M is called a -smooth ( p> ) manifold with intrinsic dimensionality if there exists a constant such that given any ∈M , there exists vectors ,...,v so that ∈M inf =1 1+ This deﬁnition is quite intuitive. The smooth manifold structure implies that one can approximate a point in effectively using local coordinate coding. Note that for a typical manifold with well- deﬁned curvature, we can take = 1 Deﬁnition 2.5 (Covering Number) Given any subset M , and > . The covering number, denoted as , , is the smallest cardinality of an -cover ⊂ M . That is, sup ∈M inf k For a compact manifold with intrinsic dimensionality , there exists a constant such that its covering number is bounded by , . The following result shows that there exists a local coordinate coding to a set of anchor points of cardinality , )) such that any α,β,p -Lipschitz smooth function can be linearly approximated using local coordinate coding up to the accuracy 1+ Theorem 2.1 (Manifold Coding) If the data points lie on a compact -smooth manifold , and the norm is deﬁned as = ( Ax for some positive deﬁnite matrix . Then given any > there exist anchor points ⊂M and coding such that | (1+ )) , , Q α,β,p γ,C αc )+(1+ )+2 1+ )) 1+ Moreover, for all ∈M , we have 1 + (1 + )) The approximation result in Theorem 2.1 means that the complexity of linearization in Lemma 2.1 depends only on the intrinsic dimension of instead of . Although this result is proved for manifolds, it is important to observe that the coordinate coding method proposed in this paper does not require the data to lie precisely on a manifold, and it does not require knowing . In fact, similar results hold even when the data only approximately lie on a manifold. In the next section, we characterize the learning complexity of the local coordinate coding method. It implies that linear prediction methods can be used to effectively learn nonlinear functions on a manifold. The nonlinearity is fully captured by the coordinate coding map (which can be a nonlinear function). This approach has some great advantages because the problem of ﬁnding local coordinate coding is much simpler than direct nonlinear learning: Learning γ,C only requires unlabeled data, and the number of unlabeled data can be signiﬁcantly more than the number of labeled data. This step also prevents overﬁtting with respect to labeled data. In practice, we do not have to ﬁnd the optimal coding because the coordinates are merely features for linear supervised learning. This signiﬁcantly simpliﬁes the optimization prob- lem. Consequently, it is more robust than some standard approaches to nonlinear learning that direct optimize nonlinear functions on labeled data (e.g., neural networks). 3 Learning Theory In machine learning, we minimize the expected loss x,y ,y with respect to the underlying distribution within a function class ∈F . In this paper, we are interested in the function class α,β,p ) : ( α,β,p Lipschitz smooth function in The local coordinate coding method considers a linear approximation of functions in α,β,p on the data manifold. Given a local coordinate coding scheme γ,C , we approximate each ∈F α,β,p by γ,C ( w,x ) = , where we estimate the coefﬁcients using ridge regression as: [ ] = arg min =1 γ,C w,x ,y ) + )) (1)

Page 4

where is an arbitrary function assumed to be pre-ﬁxed. In the Bayesian interpretation, this can be regarded as the prior mean for the weights . The default values of are simply . Given a loss function p,y , let p,y ) = p,y /∂p . For simplicity, in this paper we only consider convex Lipschitz loss function, where p,y | . This includes the standard classiﬁcation loss functions such as logistic regression and SVM (hinge loss), both with = 1 Theorem 3.1 (Generalization Bound) Suppose p,y is Lipschitz: p,y | . Consider coordinate coding γ,C , and the estimation method (1) with random training examples ,y ,..., ,y . Then the expected generalization error satisﬁes the inequality: x,y γ,C ( w,x ,y inf ∈F α,β,p x,y ,y ) + )) λn BQ α,β,p γ,C We may choose the regularization parameter that optimizes the bound in Theorem 3.1. Moreover, if we pick , and ﬁnd γ,C at some > , then Theorem 2.1 im- plies the following simpliﬁed generalization bound for any ∈ F α,β,p such that (1) x,y ,y ) + /n 1+ . By optimizing over , we obtain a bound: x,y ,y ) + (1+ (2+2 )) By combining Theorem 2.1 and Theorem 3.1, we can immediately obtain the following simple consistency result. It shows that the algorithm can learn an arbitrary nonlinear function on manifold when . Note that Theorem 2.1 implies that the convergence only depends on the intrinsic dimensionality of the manifold , not Theorem 3.2 (Consistency) Suppose the data lie on a compact manifold M , and the norm k·k is the Euclidean norm in . If loss function p,y is Lipschitz. As , we choose α, α/n,β/n α, depends on ), and = 0 . Then it is possible to ﬁnd coding γ,C using unlabeled data such that /n and α,β,p γ,C . If we pick λn , and | . Then the local coordinate coding method (1) with is consistent as lim x,y ( w,x ,y ) = inf M x,y ,y 4 Practical Learning of Coding Given a coordinate coding γ,C , we can use (1) to learn a nonlinear function in . We showed that γ,C can be obtained by optimizing α,β,p γ,C . In practice, we may also consider the following simpliﬁcations of the localization term: |k 1+ |k 1+ Note that we may simply chose = 0 or = 1 . The formulation is related to sparse coding [6] which has no locality constraints with . In this representation, we may either enforce the constraint ) = 1 or for simplicity, remove it because the formulation is already shift-invariant. Putting the above together, we try to optimize the following objective function in practice: γ,C ) = inf |k 1+ We update and via alternating optimization. The step of updating can be transformed into a canonical LASSO problem, where efﬁcient algorithms exist. The step of updating is a least- squares problem in case = 1 5 Relationship to Other Methods Our work is related to several existing approaches in the literature of machine learning and statistics. The ﬁrst class of them is nonlinear manifold learning, such as LLE [8], Isomap [9], and Laplacian

Page 5

Eigenmaps [1]. These methods ﬁnd global coordinates of data manifold based on a pre-computed afﬁnity graph of data points. The use of afﬁnity graphs requires expensive computation and lacks a coherent way of generalization to new data. Our method learns a compact set of bases to form local coordinates, which has a linear complexity with respect to data size and can naturally handle unseen data. More importantly, local coordinate coding has a direct connection to nonlinear function ap- proximation on manifold, and thus provides a theoretically sound unsupervised pre-training method to facilitate further supervised learning tasks. Another set of related models are local models in statistics, such as local kernel smoothing and local regression, e.g.[4, 2], both traditionally using ﬁxed-bandwidth kernels. Local kernel smoothing can be regarded as a zero-order method; while local regression is higher-order, including local linear regression as the 1st-order case. Traditional local methods are not widely used in machine learn- ing practice, because data with non-uniform distribution on the manifold require to use adaptive- bandwidth kernels. The problem can be somehow alleviated by using K-nearest neighbors. How- ever, adaptive kernel smoothing still suffers from the high-dimensionality and noise of data. On the other hand, higher-order methods are computationally expensive and prone to overﬁtting, be- cause they are highly ﬂexible in locally ﬁtting many segments of data in high-dimension space. Our method can be seen as a generalized 1st-order local method with basis learning and adaptive local- ity. Compared to local linear regression, the learning is achieved by ﬁtting a single globally linear function with respect to a set of learned local coordinates, which is much less prone to overﬁtting and computationally much cheaper. This means that our method achieves better balance between local and global aspects of learning. The importance of such balance has been recently discussed in [10]. Finally, local coordinate coding draws connections to vector quantization (VQ) coding, e.g., [3], and sparse coding, which have been widely applied in processing of sensory data, such as acoustic and image signals. Learning linear functions of VQ codes can be regarded as a generalized zero- order local method with basis learning. Our method has an intimate relationship with sparse coding. In fact, we can regard local coordinate coding as locally constrained sparse coding. Inspired by biological visual systems, people has been arguing sparse features of signals are useful for learning [7]. However, to the best of our knowledge, there is no analysis in the literature that directly answers the question why sparse codes can help learning nonlinear functions in high dimensional space. Our work reveals an important ﬁnding — a good ﬁrst-order approximation to nonlinear function requires the codes to be local, which consequently requires the codes to be sparse. However, sparsity does not always guarantee locality conditions. Our experiments demonstrate that sparse coding is helpful for learning only when the codes are local. Therefore locality is more essential for coding, and sparsity is a consequence of such a condition. 6 Experiments Due to the space limitation, we only include two examples: one synthetic and one real, to illustrate various aspects of our theoretical results. We note that image classiﬁcation based on LCC recently achieved state-of-the-art performance in PASCAL Visual Object Classes Challenge 2009. 6.1 Synthetic Data Our ﬁrst example is based on a synthetic data set, where a nonlinear function is deﬁned on a Swiss- roll manifold, as shown in Figure 1-(1). The primary goal is to demonstrate the performance of nonlinear function learning using simple linear ridge regression based on representations obtained from traditional sparse coding and the newly suggested local coordinate coding, which are, respec- tively, formulated as the following, min γ,C |k (2) where ) = . We note that (2) is an approximation to the original formulation, mainly for the simplicity of computation. http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2009/workshop/index.html

Page 6

(1) A nonlinear function (2) RMSE= 394 (3) RMSE= 499 (4)RMSE= 661 (5) RMSE= 201 (6) RMSE= 109 (7) RMSE= 669 (8) RMSE= 170 Figure 1: Experiments of nonlinear regression on Swiss-roll: (1) a nonlinear function on the Swiss- roll manifold, where the color indicates function values; (2) result of sparse coding with ﬁxed ran- dom anchor points; (3) result of local coordinate coding with ﬁxed random anchor points; 4) result of sparse coding; (5) result of local coordinate coding; (6) result of local kernel smoothing; (7) result of local coordinate coding on noisy data; (8) result of local kernel smoothing on noisy data. We randomly sample 50 000 data points on the manifold for unsupervised basis learning, and 500 labeled points for supervised regression. The number of bases is ﬁxed to be 128. The learned non- linear functions are tested on another set of 10 000 data points, with their performances evaluated by root mean square error (RMSE). In the ﬁrst setting, we let both coding methods use the same set of ﬁxed bases, which are 128 points randomly sampled from the manifold. The regression results are shown in Figure 1-(2) and (3), respectively. Sparse coding based approach fails to capture the nonlinear function, while local coordinate coding behaves much better. We take a closer look at the data representations obtained from the two different encoding methods, by visualizing the distributions of distances from encoded data to bases that have positive, negative, or zero coefﬁcients in Figure 2. It shows that sparse coding lets bases faraway from the encoded data have nonzero coefﬁcients, while local coordinate coding allows only nearby bases to get nonzero coefﬁcients. In other words, sparse coding on this data does not ensure a good locality and thus fails to facilitate the nonlinear function learning. As another interesting phenomenon, local coordinate coding seems to encourage coefﬁcients to be nonnegative, which is intuitively understandable — if we use several bases close to a data point to linearly approximate the point, each basis should have a positive contribution. However, whether there is any merit by explicitly enforcing nonnegativity will remain an interesting future work. In the next two experiments, given the random bases as a common initialization, we let the two algorithms learn bases from the 50 000 unlabeled data points. The regression results based on the learned bases are depicted in Figure 1-(4) and (5), which indicate that regression error is further reduced for local coordinate coding, but remains to be high for sparse coding. We also make a comparison with local kernel smoothing, which takes a weighted average of function values of -nearest training points to make prediction. As shown in Figure 1-(6), the method works very well on this simple low-dimensional data, even outperforming the local coordinate coding approach. However, if we increase the data dimensionality to be 256 by adding 253 -dimensional independent Gaussian noises with zero mean and unitary variance, local coordinate coding becomes superior to local kernel smoothing, as shown in Figure 1-(7) and (8). This is consistent with our theory, which suggests that local coordinate coding can work well in high dimension; on the other hand, local kernel smoothing is known to suffer from high dimensionality and noise. 6.2 Handwritten Digit Recognition Our second example is based on the MNIST handwritten digit recognition benchmark, where each data point is a 28 28 gray image, and pre-normalized into a unitary 784 -dimensional vector. In our setting, the set of anchor points is obtained from sparse coding, with the regularization on

Page 7

(a-1) (a-2) (b-1) (b-2) Figure 2: Coding locality on Swiss roll: (a) sparse coding vs. (b) local coordinate coding. replaced by inequality constraints k . Our focus here is not on anchor point learning, but rather on checking whether a good nonlinear classiﬁer can be obtained if we enforce sparsity and locality in data representation, and then apply simple one-against-all linear SVMs. Since the optimization cost of sparse coding is invariant under ﬂipping the sign of , we take a postprocessing step to change the sign of if we ﬁnd the corresponding for most of is negative. This rectiﬁcation will ensure the anchor points to be on the data manifold. With the obtained , for each data point we solve the local coordinate coding problem (2), by optimizing only, to obtain the representation )] . In the experiments we try different sizes of bases. The classiﬁcation error rates are provided in Table 1. In addition we also compare with linear classiﬁer on raw images, local kernel smoothing based on -nearest neighbors, and linear classiﬁers using representations obtained from various unsupervised learning methods, including autoencoder based on deep belief networks [5], Laplacian eigenmaps [1], locally linear embedding (LLE) [8], and VQ coding based on K-means. We note that, like most of other manifold learning approaches, Laplacian eigenmaps or LLE is a transductive method which has to incorporate both training and testing data in training. The comparison results are summarized in Table 2. Both sparse coding and local coordinate coding perform quite good for this nonlinear classiﬁcation task, signiﬁcantly outperforming linear classiﬁers on raw images. In addition, local coordinate coding is consistently better than sparse coding across various basis sizes. We further check the locality of both representations by plotting Figure-3, where the basis number is 512 , and ﬁnd that sparse coding on this data set happens to be quite local — unlike the case of Swiss-roll data — here only a small portion of nonzero coefﬁcients (again mostly negative) are assigned onto the bases whose distances to the encoded data exceed the average of basis-to-datum distances. This locality explains why sparse coding works well on MNIST data. On the other hand, local coordinate coding is able to remove the unusual coefﬁcients and further improve the locality. Among those compared methods in Table 2, we note that the error rate 2% of deep belief network reported in [5] was obtained via unsupervised pre-training followed by supervised backpropagation. The error rate based on unsupervised training of deep belief networks is about 90% Therefore our result is competitive to the-state-of-the-art results that are based on unsupervised feature learning plus linear classiﬁcation without using additional image geometric information. This is obtained via a personal communication with Ruslan Salakhutdinov at University of Toronto.

Page 8

(a-1) (a-2) (b-1) (b-2) Figure 3: Coding locality on MNIST: (a) sparse coding vs. (b) local coordinate coding. Table 1: Error rates (%) of MNIST classiﬁcation with different |C| 512 1024 2048 4096 Linear SVM with sparse coding 2.96 2.64 2.16 2.02 Linear SVM with local coordinate coding 2.64 2.44 2.08 1.90 Table 2: Error rates (%) of MNIST classiﬁcation with different methods. Methods Error Rate Linear SVM with raw images 12.0 Linear SVM with VQ coding 3.98 Local kernel smoothing 3.48 Linear SVM with Laplacian eigenmap 2.73 Linear SVM with LLE 2.38 Linear classiﬁer with deep belief network 1.90 Linear SVM with sparse coding 2.02 Linear SVM with local coordinate coding 1.90 7 Conclusion This paper introduces a new method for high dimensional nonlinear learning with data distributed on manifolds. The method can be seen as generalized local linear function approximation, but can be achieved by learning a global linear function with respect to coordinates from unsupervised local coordinate coding. Compared to popular manifold learning methods, our approach can naturally handle unseen data and has a linear complexity with respect to data size. The work also generalizes popular VQ coding and sparse coding schemes, and reveals that locality of coding is essential for supervised function learning. The generalization performance depends on intrinsic dimensionality of the data manifold. The experiments on synthetic and handwritten digit data further conﬁrm the ﬁndings of our analysis.

Page 9

References [1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation , 15:1373 – 1396, 2003. [2] Leon Bottou and Vladimir Vapnik. Local learning algorithms. Neural Computation , 4:888 900, 1992. [3] Robert M. Gray and David L. Neuhoff. Quantization. IEEE Transaction on Information The- ory , pages 2325 – 2383, 1998. [4] Trevor Hastie and Clive Loader. Local regression: Automatic kernel carpentry. Statistical Science , 8:139 – 143, 1993. [5] Geoffrey E. Hinton and Ruslan R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science , 313:504 – 507, 2006. [6] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Efﬁcient sparse coding algo- rithms. Neural Information Processing Systems (NIPS) 19 , 2007. [7] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y. Ng. Self-taught learning: Transfer learning from unlabeled data. International Conference on Machine Learn- ing , 2007. [8] Sam Roweis and Lawrence Saul. Nonlinear dimensionality reduction by locally linear embed- ding. Science , 290:2323 – 2326, 2000. [9] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science , 290:2319 – 2323, 2000. [10] Alon Zakai and Ya’acov Ritov. Consistency and localizability. Journal of Machine Learning Research , 10:827 – 856, 2009.

Today's Top Docs

Related Slides