SemiSup ervised Classication Lo Densit Separation Olivier Chap elle Alexander Zien Max Planc Institute for Biological Cyb ernetics  ubingen German Abstract eliev that the cluster assumption is ey to
158K - views

SemiSup ervised Classication Lo Densit Separation Olivier Chap elle Alexander Zien Max Planc Institute for Biological Cyb ernetics ubingen German Abstract eliev that the cluster assumption is ey to

Based on this prop ose three semisup ervised al gorithms 1 deriving graphbased distances that emphazise lo densit regions et een clusters follo ed training standard SVM 2 optimizing the ransductiv SVM ob jectiv function whic places the decision ound

Tags : Based this prop
Download Pdf

SemiSup ervised Classication Lo Densit Separation Olivier Chap elle Alexander Zien Max Planc Institute for Biological Cyb ernetics ubingen German Abstract eliev that the cluster assumption is ey to




Download Pdf - The PPT/PDF document "SemiSup ervised Classication Lo Densit S..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "SemiSup ervised Classication Lo Densit Separation Olivier Chap elle Alexander Zien Max Planc Institute for Biological Cyb ernetics ubingen German Abstract eliev that the cluster assumption is ey to"— Presentation transcript:


Page 1
Semi-Sup ervised Classication Lo Densit Separation Olivier Chap elle, Alexander Zien Max Planc Institute for Biological Cyb ernetics 72076 ubingen, German Abstract eliev that the cluster assumption is ey to successful semi-sup ervised learning. Based on this, prop ose three semi-sup ervised al- gorithms: 1. deriving graph-based distances that emphazise lo densit regions et een clusters, follo ed training standard SVM; 2. optimizing the ransductiv SVM ob jectiv function, whic places the decision oundary in lo densit regions, gradien descen t; 3. com bining the

rst to mak maxim um use of the cluster assumption. compare with state of the art algorithms and demonstrate sup erior accuracy for the latter metho ds. INTR ODUCTION The goal of semi-sup ervised classication is to use un- lab eled data to impro the generalization. The cluster assumption states that the decision oundary should not cross high densit regions, but instead lie in lo densit regions. eliev that virtually all successful semi-sup ervised algorithms utilize the cluster assump- tion, though most of the time indirectly or instance, manifold learning algorithms (e.g., [1 ])

construct decision functions that ary little along the manifolds ccupied the data. Often, dieren classes form separate manifolds. Then, manifold learn- ing indirectly implemen ts the cluster assumption not cutting the manifolds. The ransductiv SVM [20 implemen ts the cluster as- sumption more directly trying to nd yp erplane whic is far from the unlab eled oin ts. In our opinion, the rationale for maximizing the margin is ery dieren for the lab eled and unlab eled oin ts: or the lab eled oin ts, it implemen ts regulariza- tion [20 ]. In tuitiv ely the large margin prop ert mak es the

classication robust with resp ect to erturbations of the data oin ts [6 ]. or the unlab eled oin ts, the margin maximiza- tion implemen ts the cluster assumption. It is not directly related to regularization (in this resp ect, ha dieren view than apnik [20 ]). Con- sider for instance an example where the cluster assumption do es not hold: uniform distribution of input oin ts. Then the unlab eled oin ts con- ey almost no information, and maximizing the margin on those oin ts is useless (and can ev en harmful). TSVM migh seem to the erfect semi-sup ervised algorithm, since it com bines

the erful regulariza- tion of SVMs with direct implemen tation of the clus- ter assumption. Ho ev er, its main dra wbac is that the ob jectiv function is non-con ex and th us di- cult to minimize. Consequen tly optimization heuris- tics lik SVMlight [12 sometimes giv bad results and are often criticized. The main oin ts of this pap er are: The ob jectiv function of TSVM is appropriate, but dieren ys of optimizing it can lead to ery dieren results. Th us, it is more accurate to criticize giv en implemen tation of the TSVM rather than the ob jectiv function itself. The searc for lo

densit decision oundary is dicult. The task of the TSVM algorithm can eased hanging the data represen tation. substan tiate our claims, dev elop and assess cor- resp onding algorithms. Firstly prop ose graph- based semi-sup ervised learning metho exploiting the cluster assumption. Secondly it is sho wn that gra- dien descen on the primal form ulation of the TSVM ob jectiv function erforms signican tly etter then the optimization strategy pursued in SVMlight [12 ]. Finally com bining these ideas in one algorithm, are able to ac hiev clearly sup erior generalization accuracy


Page 2
ALGORITHMS Let the giv en data consist of lab eled data oin ts and unlab eled data oin ts or simplicit assume that the lab els n; are binary i.e. 1; for ulti-class problems, use the one-against-rest sc heme that is common for SVMs (e.g., [17 ]). In the follo wing sections, describ dieren ys to enforce the cluster assumption in SVM classi- cation and ho they can implemen ted. 2.1 GRAPH-BASED SIMILARITIES Let the graph deriv ed from the data suc that the no des are the data oin ts, If sparsit is desired, edges are placed et een no des that are nearest neigh ors (NN),

either thresholding the degree -NN) or the distance -NN). Man semi- sup ervised learning metho ds op erate on nearest neigh- or graphs, see e.g. [1 14 18 23 22 ]. Usually they do not require the data oin ts themselv es, but only their pairwise distances along the edges. In the follo wing assume that the edges i; are eigh ted Eu- clidean distances i; := jj jj (missing edges corresp ond to i; ), although other distances are ossible as ell. Man graph-based semi-sup ervised algorithms ork enforcing smo othness of the solution with resp ect to the graph, i.e. that the output function aries lit- tle

et een connected no des. Here use the graph to deriv pairwise similarities et een oin ts, thereb \squeezing" the distances in high densit regions while lea ving them in lo densit regions. This idea has een prop osed efore, e.g. in [5 21 and [4 section 3]. It has een implemen ted and used in Isomap [19 ], cluster ernels [7 ], and connectivit clustering [10 ]. 2.1.1 Motiv ation According to the cluster assumption, the decision oundary should preferably not cut clusters. to enforce this for similarit y-based classiers is to as- sign lo similarities to pairs of oin ts that lie in dier- en

clusters. do so, construct arzen windo densit estimate with Gaussian ernel of width =1 exp jj jj If oin ts are in the same cluster, it means that there exists con tin uous connecting curv that only go es through regions of high densit y; if oin ts are made symmetric including j; in if i; in dieren clusters, ev ery suc curv has to tra erse densit alley can th us dene the similarit of oin ts maximizing er all con tin uous connecting curv es the minim um densit along the connection, but this is hard to compute. Tw observ ations, illustrated in Figure 1, allo to ap- pro ximate the ab

similarit with paths on graph: (a) An optimal connecting curv can ell appro xi- mated conjoining short line segmen ts that directly connect oin ts. (b) The minim um densit is assumed at the middle of line segmen t, and dominated the closest oin ts. (a) (b) distance along path density Figure 1: Optimal connecting curv es are ell appro x- imated paths of short distance edges on graph. 2.1.2 densit y-sensitiv distance measure ormally dene to path of length =: on graph ), if +1 for path is said to connect the no des and let i;j denote the set of all paths connecting and obtain max i;j min

+1 exp min i;j max +1 (1) This called \connectivit ernel", is ositiv denite and as suggested for clustering previously [10 ]. The ernel alues do not dep end on the length of the paths, whic ma lead to the connection of otherwise separated clusters single outliers (\bridge" oin ts). oid this problem, \soften" the max in Equa- tion (1) replacing it with smax := ln j =1 d +1 (2) Equation (1) is reco ered taking If 0, smax ecomes simply the sum of original distances along the path i;j Due to the triangular inequal- it this is nev er less than i; ), so that in full graph with

Euclidean distances the minim um path distance
Page 3
ecomes jj jj Th us, the standard Gaussian RBF ernel is reco ered, and no use is made of the un- lab eled data. Ho ev er, for sparse graph computing the minim um path distance when is equiv alen to Isomap [19 ]. The prop osed metho can summarized as follo ws: 1. Build nearest neigh or graph from all (lab eled and unlab eled) data. 2. Compute the distance matrix of minimal -path distances according to i;j ln min i;j j =1 d +1 from all lab eled oin ts to all oin ts. 3. erform non-linear transformation on to get ernel i;j exp

i;j The linear case corresp onds to and with eing the cen ter- ing matrix (as in Multidimensional Scaling [8 ]): ij =n 4. rain an SVM with and predict. 2.1.3 Commen ts few commen ts can made on these steps. 1- The use of sparse graph is merely to sa computation time. This is in con trast to some other graph-based metho ds, that require sparseness for de- tecting the manifold structure (e.g. Isomap ). In our metho d, the sparse graph is alw ys seen as an appro x- imation to the full graph. Ho ev er, the accuracy of this appro ximation dep ends on the alue of the soft- ening parameter for 0, the

direct connection is alw ys shortest, so that ev ery deletion of an edge can cause the corresp onding distance to increase. or shortest paths almost nev er con tain an long edge, so that long edges can safely deleted. 2- or large alues of the distances et een oin ts in the same cluster are decreased. In con trast, the dis- tances et een oin ts from dieren clusters are still dominated the gaps et een the clusters and, as result, those gaps ecome more pronounced. Instead of Equation (2), it is ossible to use other in ter- olations et een the max and the mean suc as the Mink wski metric, j =1 +1

+1 +1) 3- is in general not ositiv denite (p.d.), except for (standard RBF) and (then is an ultrametric and th us negativ denite [10 ], yielding p.d. ernel [17 ]). In practice, negativ eigen alues can observ ed, but they are few and small in absolute alue, as do cumen ted in able 1. In our exp erimen ts, the SVM training still con erges quic kly Moreo er, recen pap ers ha argued in fa or of the use of non- ositiv denite ernels for learning [11 16 ]. 0.5 0.19 2.96 4.66 1.89 0.02 able 1: Empirically found eigh on the Coil20 dataset of the negativ eigen alues as ercen t-

age of the eigh of all eigen alues, := 100 max (0 2.2 MAR GIN MAXIMIZA TION The ransductiv Supp ort ector Mac hine (TSVM), rst in tro duced in [20 and implemen ted [3 12 ], aims at minimizing the follo wing functional, min =1 +1 under the constrain ts: This can rewritten without constrain as the mini- mization of =1 )) +1 (3) with max (0 ). Unfortunately the last term mak es this problem non- con ex and dicult to solv [3 12 ]. The implemen- tation of TSVM that prop ose in this pap er is to erform standard gradien descen on (3). Ho ev er, since this latter is not dieren tiable,

replace it =1 )) +1 (4) with exp (c.f. Figure 2). enforce that all unlab eled data are not put in the same class, add the additional constrain t, +1 =1 (5)
Page 4
This is in analogy to the treatmen of the min-cut problem in sp ectral clustering, whic is usually re- placed the normalized cut to enforce balanced so- lutions [13 ]. Finally note that unlik traditional SVM learning al- gorithms, whic solv the problem in the dual, di- rectly solv the problem in the primal. If an to use non-linear ernel, it is ossible to compute the co ordinates of eac oin in the ernel PCA basis [17 ]. More

directly one can compute the Cholesky decom- osition of the Gram matrix, and minimize (4) with i; i;n ). −2 −1 0.2 0.4 0.6 0.8 Signed output Loss Standard TSVM Gaussian approximation Figure 2: TSVM cost functions for unlab eled data. decided to initially set to small alue and increase it exp onen tionally to thereb follo wing SVMlight Note that the hoice of setting the nal alue of to is somewhat arbitrary Ideally it ould preferable to consider this alue as free parameter of the algorithm. 2.3 IMPLEMENT TION rom the metho ds discussed ab e, deriv three algorithms: 1. graph

training an SVM on graph-distance de- riv ed ernel; 2. TSVM training TSVM gradien descen t; 3. LDS (Lo Densit Separation), com bining oth of the previous algorithms. or SVM use the Spider mac hine learning pac age for matlab or TSVM conjugate gradien descen metho as used. The distance computation for graph can carried out using the shortest path algorithm Dijkstra [9 ]. or LDS the full matrix of pairwise distances has to computed. ailable at http://www.kyb.tuebingen.mpg.de/bs/ people/spider ailable at http://www.kyb.tuebingen.mpg.de/bs/ people/carl/code/minimize 10 10 10 10 10 10 10 10 TSVM

Grad TSVM Joachims 10 10 10 10 10 10 10 10 10 10 TSVM Grad TSVM Joachims Figure 3: Eac oin represen ts the alues of the ob jec- tiv function reac hed the TSVM and TSVM for some alue of oin ts ab the diagonal mean that TSVM found etter lo cal minim um. Left: Coil20 dataset, righ t: g10n (b oth describ ed elo w). Since the deriv ed ernel is (in general) not ositiv def- inite, can apply Multidimensional Scaling (MDS) [8 to nd Euclidean em edding of efore apply- ing TSVM The em edding found the classical MDS are the eigen ectors corresp onding to the ositiv eigen alues of where ij ij ). or

computational reasons, decided to tak only the rst eigen ectors suc that =1 (1 max (0 and (6) with decreasing eigen alues compare our algorithms to one state of the art su- ervised metho d, SVM and to state of the art semi- sup ervised metho ds, the TSVM optimization sc heme as implemen ted in SVMlight [12 and graph-based manifold learning, whic is closely related to those in [1 22 23 ]. More precisely estimate the lab els of the unlab eled oin ts minimizing the functional =1 i;j ij i;j =1 ij (7) where ij exp( jj jj if is among the nearest neigh ors of (or vice-v ersa), and

otherwise. This metho ds dep ends on the sparsit of the graph. Figure compares ho oth implemen tations of TSVM are able to minimize the cost function (3). Note that our prop osed implemen tation do es not minimize (3), but the dieren tiable appro ximation (4) and for this reason it has disadv an tage in the comparison sho wn in Figure 3. Nev ertheless, on erage it pro- duces etter alues of the ob jectiv function, whic translate, as will see later, in to etter test errors. 2.3.1 Computational Complexit implemen the searc for the next-closest un- explored no de in Dijkstra's algorithm with

prior-
Page 5
it queue based on binary heap. This results in log )) run time for computing the path distances of one lab eled oin to all other oin ts. Th us, the en tire matrix costs nk log )) on -NN graph. The time complexit of gradien descen algorithm is appro ximately equal to that of ev aluating the cost function ultiplied the square of the um er of ariables. or TSVM this amoun ts to The MDS is of the same time complexit since it com- putes the eigendecomp osition of an matrix. or oth algorithms, the complexit can reduced if one considers only the rst eigen ectors. While

TSVM needs to store the en tire ernel matrix (on oth lab eled and unlab eled oin ts), for graph an part is sucien t. Memory can re- duced to the part required for SVM training, but the (w orst case) time required to compute individual shortest paths is as uc as is required for computing all paths from single source to all targets. or oth SVM and TSVM in practice only parts of the ernel ma- trices ha to (computed and) stored, ecause of the sparsit of the solution. or training the manifold algorithm as giv en in Eq. 7, sparse matrix needs to stored and in erted. Due to the use of -NN

graph, the matrix has ab out en tries (at most )). 2.3.2 arameters or eac algorithm, the alues for um er of param- eters ha to xed. In practical applications, this is usually done cross-v alidation (CV). While this is no ma jor problem for parameters (lik the SVMs ha e), it is impractical for the v parameters of the graph algorithm. reduce this um er, x three of them in adv ance, as sho wn in the table: algorithm free parameters; [xed parameters] SVM TSVM manifold TSVM graph 1] LDS 1] Figure demonstrates that for LDS the parameter x- ing prop osed ab

leads only to minor loss in ac- curacy As sho wn in (a), fully connected graph is go (for the optim um alue of ). As sho wn in (b), (i.e. no further non-linear transformation) is go (again, for the optim um alue of ). In general the resulting ernel will not ositiv denite (ex- cept for and see also able 1). As sho wn in (c), the SVM seems to handle negativ eigen al- ues reasonably ell. This can seen on the righ of (c): almost the same results ere obtained with and without MDS. It seems safe to discard the eigen ectors corresp onding to small (p ositiv e) eigen alues (c.f. left side of

the plot (c)). In the rest of the exp erimen ts, set 1. determine go alues of the remaining free pa- rameters (eg, CV), it is imp ortan to searc on the righ scale. therefore x default alues for and that ha the righ order of magnitude. In -class problem, use the =c quan tile of the pairwise dis- tances i;j of all data oin ts as the default for The default for is the in erse of the empirical ariance of the data in feature space, whic can calculated ii ij ij from ernel ma- trix Belo w, all alues for these parameters will giv en relativ to the resp ectiv default alues, making them

comparable for dieren data sets. 2.3.3 LDS algorithm The nal LDS algorithm is summarized in Figure 1. Note that sligh hanges are required for the extreme settings of for 0, steps to ha to replaced simply running the shortest path algorithm on i; to compute i;j for mo died ersion of Dijkstra that eeps trac of maxim um distances in- stead of sums along paths ust used. matlab im- plemen tation of LDS can obtained at http://www. kyb.tuebingen.mpg.de/bs/people/chapelle/lds/ EXPERIMENT AL RESUL TS 3.1 SETS In order to get go picture of the eectiv eness of the algorithms, compare

their generalization er- formance on articial and three real orld data sets with dieren prop erties. data set classes dims oin ts lab eled g50c 50 550 50 g10n 10 550 50 Coil20 20 1024 1440 40 Text 7511 1946 50 Uspst 10 256 2007 50 The articial data sets are inspired [2]: the data are generated from standard normal ulti-v ariate Gaussians. In g50c the lab els corresp ond to the Gaus- sians, and the means are lo cated in 50-dimensional space suc that the Ba es error is 5%. In con trast, g10n is deterministic problem in 10 dimensions, where the decision function tra erses the cen

ters of the Gaussians (th us violating the cluster assumption), and dep ends on only of the input dimensions.
Page 6
0 0.5 1 2 4 8 16 32 Inf 0.05 0.1 0.15 0.2 0.25 0.3 Test error 10 NN 100 NN Fully connected 0 0.5 1 2 4 8 16 32 Inf 0.05 0.1 0.15 0.2 0.25 0.3 Test error Default = 10 10 10 10 10 −4 10 −3 10 −2 10 −1 1/ Mean squared difference (a) (b) (c) Figure 4: Influence of parameter hoice on the test error of LDS on the Coil20 data: (a) the graph structure; (b) and (c) the appro ximation accuracy of the MDS. Plot (c) sho ws the square dierence et een

the test error ac hiev ed with and without MDS, eraged er dieren alues of and Algorithm LDS algorithm Require: Compute -distanc es: 1: Build fully connected graph with edge lengths ij exp( d i; )) 1. 2: Use Dijkstra's algorithm [9 to compute the shortest path lengths i; for all pairs of oin ts. 3: orm the matrix of squared -path distances ij log (1 i; )) Perform multidimensional sc aling: 4: where ij ij ). 5: Find the threshold suc that (6) holds. 6: The new represen tation of is ik ik ain TSVM: 7: for i=0 to 10 do 8: Set 10 9: Minimize gradien descen (3) under constrain (5). 10: end

for The real orld data sets consist of o-class and ulti- class problems. In Coil20 the data are gra y-scale images of 20 dieren ob jects tak en from dieren an- gles, in steps of degrees [15 ]. The Text dataset are the classes mac and mswindows of the Newsgroup20 dataset prepro cessed as in [18 ]. Finally our Uspst set con tains the test data part of the ell-kno wn USPS data on handwritten digit recognition. 3.2 EXPERIMENTS or eac of the data sets, 10 dieren splits in to lab eled and unlab eled oin ts ere randomly generated. to ok care to include at least one oin of eac class in the lab eled

set (t for Coil20 ). used dieren mo del selection strategy for LDS than for the other algorithms. or LDS carry out 5- fold cross-v alidation (CV) on the training set for eac split, thereb sim ulating the real orld application sce- nario. Note that all data (training and test) can (and is) used as unlab eled data. The rep orted test errors are obtained after training the selected mo del on the en tire training set. or the other algorithms, are in terested in the est ossible erformance, and simply select the parameter alues minimizing the test error. In oth cases, select com binations of alues

on nite grid as follo ws: parameter alues width exp onen enalt 10 10 10 10 degree 10 100 all regulariz. Although LDS and graph ork with an ernel, here x the linear ernel c.f. section 2.3.2). 3.3 RESUL TS The results are presen ted in able 2. Except for the data set g10n LDS alw ys ac hiev es lo er test errors with empirically found parameter settings than all the
Page 7
0 1 2 4 8 16 Inf 0.1 0.2 0.3 0.4 0.5 0.6 Cross validation error Coil20 g50c g10n Text Usps Figure 5: Cross-v alidation error (with standard devi- ation error bars) as function of the parameter other

algorithms are capable of ac hieving, ev en when optimal parameter settings are kno wn. This clearly demonstrates the sup eriorit of LDS Although TSVM alw ys erforms etter (and usually signican tly etter) than TSVM it still fails to reac the lev el of manifold on the Coil20 data set. But this shortcoming is eliminated making use of the graph transform of the distances. etter understand the role of the distance trans- form, depict the 5-fold cross alidation error for the est alue of eraged er the 10 splits, as function of in Figure 5. can distinguish three cases: the minim um is (i) at

or close to 0; (ii) at or close to or (iii) somewhere in et een. (i) Linear classiers are optimal construction for the articial data, and lik ely to optimal for Text due to the high dimensionalit or g50c and Text do es not substan tially help TSVM but do es not urt either. Only for g10n where the cluster assump- tion do es not hold, increasing immediatly increases the test error. (ii) In Coil20 the oin ts of eac class lie eqi-distan tly on ring. With all their pairwise distances are reduced to the distance of neigh oring oin ts. Note that there is no noise whic could cause un

an ted bridging et een classes. (iii) erhaps the most in teresting case is Uspst with an optim um of 4. While there denitely are clus- ters corresp onding to the classes, there seem to exist outliers that ould, for to large lead to erroneous merging of clusters. As the optim um alue of seems to corresp ond to fea- tures of the data set, prior kno wledge on the data could ossibly used to narro the range to searc hed. CONCLUSIONS The TSVM ob jectiv function could, at rst sigh t, in terpreted as straigh t-forw ard extension of the maxim um margin principle of SVM to unlab eled

data. conjecture that it actually implemen ts dier- en principles: the regularization margin maximiza- tion on the lab eled oin ts, and the cluster assumption margin maximization on the unlab eled oin ts. The latter do es not lead to smo other decision functions, but it enforces that the decision oundary lies in lo densit regions. The strength of our gradien descen approac migh that it directly optimizes the ob jectiv according to the cluster assumption: to nd decision ound- ary that oids high densit regions. In con trast, TSVM SVMlight implemen tation) migh suer from the com

binatorial nature of its approac h. By decid- ing, from the ery rst step, on the putativ lab el of ev ery oin (ev en though with lo condence), it ma lose imp ortan degrees of freedom at an early stage, and get trapp ed in bad lo cal minim um. The pairwise distances computed the graph algo- rithm attempt to reflect the cluster assumption: dis- tances of oin ts from the same cluster are shrunk, while for oin ts in dieren clusters they are dom- inated the in ter-cluster distance. Used with an SVM, this clearly impro es er standard (Euclidean) distances, but not er other

semi-sup ervised metho ds. The com bination of the graph distance computation with the TSVM training yields clearly sup erior semi- sup ervised algorithm. Apparen tly the prepro cessed distances mak it less lik ely for the TSVM to get stuc in ery sub optimal lo cal minima. Probably the prepro- cessing widens small densit alleys so that they are more readily found lo cal searc hes. Although manifold learning indirectly exploits the cluster assumption, as argued ab e, another feature ma con tribute to its successes. If the in trinsic dimen- sionalit of the data manifolds is uc smaller than that

of the input space, restricting the learning pro cess to the manifolds can alleviate the \curse of dimension- alit y". plan to in estigate ho uc erformance can gained in this manner. uture ork will on thorough comparison of dis- criminativ semi-sup ervised learning metho ds. ob- serv that the time (and to some degree, also space) complexities of all metho ds in estigated here prohibit the application to really large sets of unlab eled data, sa more than few thousand. Th us, ork should also dev oted to impro emen ts of the computational eciency of algorithms, ideally of LDS
Page

8
metho ds from literature prop osed metho ds data set SVM manifold TSVM graph TSVM LDS Coil20 24.64% 6.20% 26.26% 6.43% 17.56% 4.86% g50c 8.32% 17.30% 6.87% 8.32% 5.80% 5.62% g10n 9.36% 30.64% 14.36% 9.36% 9.82% 9.72% Text 18.87% 11.71% 7.44% 10.48% 5.71% 5.13% Uspst 23.18% 21.30% 26.46% 16.92% 17.61% 15.79% able 2: Mean test error rates. Note that mo del selection as done cross-v alidation for LDS whereas minimizing the test error for the other metho ds. Bold um ers are statistically signican tly (95% condence) etter compared to all other metho ds. Ac kno wledgemen ts

thank Bernhard Sc h olk opf and Matthias Hein for aluable commen ts. References [1] M. Belkin, I. Matv eev a, and Niy ogi. Regu- larization and semi-sup ervised learning on large graphs. In COL 2004. [2] Y. Bengio and Y. Grandv alet. Semi-sup ervised learning en trop minimization. In NIPS ol- ume 17, 2004. [3] K. Bennett and A. Demiriz. Semi-sup ervised sup- ort ector mac hines. In NIPS olume 12, 1998. [4] O. Bousquet, O. Chap elle, and M. Hein. Measure based regularization. In NIPS 2004. [5] O. Chap elle. Supp ort ctor Machines: Induction Principle, daptive uning and Prior Knwole dge PhD

thesis, LIP 6, 2003. [6] O. Chap elle, J. eston, L. Bottou, and V. ap- nik. Vicinal risk minimization. In NIPS ol- ume 13, 2000. [7] O. Chap elle, J. eston, and B. Sc h olk opf. Clus- ter ernels for semi-sup ervised learning. In NIPS olume 15, 2002. [8] T. F. Co and M. A. Co x. Multidimensional Sc al- ing Chapman Hall, 1994. [9] E. W. Dijkstra. note on problems in con- nection with graphs. Numerische Math. 1:269{ 271, 1959. [10] B. Fisc her, V. Roth, and J. M. Buhmann. Clus- tering with the connectivit ernel. In NIPS ol- ume 16, 2004. [11] B. Haasdonk. eature space in terpretation of SVMs with

indenite ernels. IEEE TP AMI 2004. In press. [12] T. Joac hims. ransductiv inference for text classication using supp ort ector mac hines. In ICML pages 200{209, 1999. [13] T. Joac hims. ransductiv learning via sp ectral graph partitioning. In ICML 2003. [14] R. I. Kondor and J. Laert Diusion ernels on graphs and other discrete structures. In ICML 2002. [15] S. A. Nene, S. K. Na ar, and H. Murase. Colum bia ob ject image library (coil-20). ec hni- cal Rep ort CUCS-005-96, Colum bia Univ., USA, ebruary 1996. [16] C. S. Ong, X. Mary S. Can u, and A. J. Smola. Learning with non-p

ositiv ernels. In ICML pages 639{646, 2004. [17] B. Sc h olk opf and A. J. Smola. arning with Ker- nels MIT Press, Cam bridge, MA, 2002. [18] M. Szummer and T. Jaakk ola. artially lab eled classication with mark random alks. In NIPS olume 14, 2001. [19] J. B. enen baum, V. de Silv a, and J. C. Langford. global geometric framew ork for nonlinear di- mensionalit reduction. Scienc 290(5500):2319{ 2323, 2000. [20] V. apnik. Statistic al arning The ory John Wi- ley Sons, 1998. [21] Vincen and Y. Bengio. Densit y-sensitiv met- rics and ernels. Presen ted at the Sno wbird Learning orkshop,

2003. [22] D. Zhou, O. Bousquet, T. Lal, J. eston, and B. Sc h olk opf. Learning with lo cal and global con- sistency In NIPS olume 16, 2003. [23] X. Zh u, Z. Ghahramani, and J. Laert Semi- sup ervised learning using gaussian elds and har- monic functions. In ICML 2003.