/
JournalofMachineLearningResearch7(2006)2399-2434Submitted4/05;Revised5 JournalofMachineLearningResearch7(2006)2399-2434Submitted4/05;Revised5

JournalofMachineLearningResearch7(2006)2399-2434Submitted4/05;Revised5 - PDF document

phoebe-click
phoebe-click . @phoebe-click
Follow
391 views
Uploaded On 2016-06-24

JournalofMachineLearningResearch7(2006)2399-2434Submitted4/05;Revised5 - PPT Presentation

BELKINNIYOGIANDSINDHWANIincludetransductiveSVMVapnik1998Joachims1999cotrainingBlumandMitchell1998andavarietyofgraphbasedmethodsBlumandChawla2001Chapelleetal2003SzummerandJaakkola200 ID: 375842

BELKIN NIYOGIANDSINDHWANIincludetransductiveSVM(Vapnik 1998;Joachims 1999) cotraining(BlumandMitchell 1998) andavarietyofgraph-basedmethods(BlumandChawla 2001;Chapelleetal. 2003;SzummerandJaakkola 200

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch7(2006)2..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch7(2006)2399-2434Submitted4/05;Revised5/06;Published11/06ManifoldRegularization:AGeometricFrameworkforLearningfromLabeledandUnlabeledExamplesMikhailBelkinMBELKIN@CSE.OHIO-STATE.EDUDepartmentofComputerScienceandEngineeringTheOhioStateUniversity2015NeilAvenue,DreeseLabs597Columbus,OH43210,USAParthaNiyogiNIYOGI@CS.UCHICAGO.EDUDepartmentsofComputerScienceandStatisticsUniversityofChicago1100E.58thStreetChicago,IL60637,USAVikasSindhwaniVIKASS@CS.UCHICAGO.EDUDepartmentofComputerScienceUniversityofChicago1100E.58thStreetChicago,IL60637,USAEditor:PeterBartlettAbstractWeproposeafamilyoflearningalgorithmsbasedonanewformofregularizationthatallowsustoexploitthegeometryofthemarginaldistribution.Wefocusonasemi-supervisedframeworkthatincorporateslabeledandunlabeleddatainageneral-purposelearner.Sometransductivegraphlearningalgorithmsandstandardmethodsincludingsupportvectormachinesandregularizedleastsquarescanbeobtainedasspecialcases.WeusepropertiesofreproducingkernelHilbertspacestoprovenewRepresentertheoremsthatprovidetheoreticalbasisforthealgorithms.Asaresult(incontrasttopurelygraph-basedapproaches)weobtainanaturalout-of-sampleextensiontonovelexamplesandsoareabletohandlebothtransductiveandtrulysemi-supervisedsettings.Wepresentexperimentalevidencesuggestingthatoursemi-supervisedalgorithmsareabletouseunlabeleddataeffectively.Finallywehaveabriefdiscussionofunsupervisedandfullysupervisedlearningwithinourgeneralframework.Keywords:semi-supervisedlearning,graphtransduction,regularization,kernelmethods,mani-foldlearning,spectralgraphtheory,unlabeleddata,supportvectormachines1.IntroductionInthispaper,weintroduceaframeworkfordata-dependentregularizationthatexploitsthegeometryoftheprobabilitydistribution.Whilethisframeworkallowsustoapproachthefullrangeoflearningproblemsfromunsupervisedtosupervised(discussedinSections6.1and6.2respectively),wefocusontheproblemofsemi-supervisedlearning.Theproblemoflearningfromlabeledandunlabeleddata(semi-supervisedandtransductivelearning)hasattractedconsiderableattentioninrecentyears.Somerecentlyproposedmethodsc\r2006MikhailBelkin,ParthaNiyogiandVikasSindhwani. BELKIN,NIYOGIANDSINDHWANIincludetransductiveSVM(Vapnik,1998;Joachims,1999),cotraining(BlumandMitchell,1998),andavarietyofgraph-basedmethods(BlumandChawla,2001;Chapelleetal.,2003;SzummerandJaakkola,2002;KondorandLafferty,2002;SmolaandKondor,2003;Zhouetal.,2004;Zhuetal.,2003,2005;Kempetal.,2004;Joachims,1999;BelkinandNiyogi,2003b).WealsonotetheregularizationbasedtechniquesofCorduneanuandJaakkola(2003)andBousquetetal.(2004).Thelatterreferenceisclosestinspirittotheintuitionsofourpaper.WepostponethediscussionofrelatedalgorithmsandvariousconnectionsuntilSection4.5.TheideaofregularizationhasarichmathematicalhistorygoingbacktoTikhonov(1963),whereitisusedforsolvingill-posedinverseproblems.Regularizationisakeyideainthetheoryofsplines(e.g.,Wahba,1990)andiswidelyusedinmachinelearning(e.g.,Evgeniouetal.,2000).Manymachinelearningalgorithms,includingsupportvectormachines,canbeinterpretedasinstancesofregularization.Ourframeworkexploitsthegeometryoftheprobabilitydistributionthatgeneratesthedataandincorporatesitasanadditionalregularizationterm.Hence,therearetworegularizationterms—onecontrollingthecomplexityoftheclassierintheambientspaceandtheothercontrollingthecomplexityasmeasuredbythegeometryofthedistribution.Weconsiderinsomedetailthespecialcasewherethisprobabilitydistributionissupportedonasubmanifoldoftheambientspace.Thepointsbelowhighlightseveralaspectsofthecurrentpaper:1.Ourgeneralframeworkbringstogetherthreedistinctconceptsthathavereceivedsomeinde-pendentrecentattentioninmachinelearning:i.Therstoftheseisthetechnologyofspectralgraphtheory(see,e.g.,Chung,1997)thathasbeenappliedtoawiderangeofclusteringandclassicationtasksoverthelasttwodecades.Suchmethodstypicallyreducetocertaineigenvalueproblems.ii.Thesecondisthegeometricpointofviewembodiedinaclassofalgorithmsthatcanbetermedasmanifoldlearning.1ThesemethodsattempttousethegeometryoftheprobabilitydistributionbyassumingthatitssupporthasthegeometricstructureofaRiemannianmani-fold.ThethirdimportantconceptualframeworkisthesetofideassurroundingregularizationinReproducingKernelHilbertSpaces(RKHS).Thisleadstotheclassofkernelbasedal-gorithmsforclassicationandregression(e.g.,ScholkopfandSmola,2002;Wahba,1990;Evgeniouetal.,2000).Weshowhowtheseideascanbebroughttogetherinacoherentandnaturalwaytoincorporategeometricstructureinakernelbasedregularizationframework.Asfarasweknow,theseideashavenotbeenuniedinasimilarfashionbefore.2.Thisgeneralframeworkallowsustodevelopalgorithmsspanningtherangefromunsuper-visedtofullysupervisedlearning.Inthispaperweprimarilyfocusonthesemi-supervisedsettingandpresenttwofamiliesofalgorithms:theLaplacianRegularizedLeastSquares(hereafter,LapRLS)andtheLaplacianSupportVectorMachines(hereafterLapSVM).ThesearenaturalextensionsofRLSandSVMrespectively.Inaddition,severalrecentlyproposedtransductivemethods(e.g.,Zhuetal.,2003;BelkinandNiyogi,2003b)arealsoseentobespecialcasesofthisgeneralapproach.1.Seehttp://www.cse.msu.edu/lawhiu/manifold/foralonglistofreferences.2400 MANIFOLDREGULARIZATIONIntheabsenceoflabeledexamplesourframeworkresultsinnewalgorithmsforunsupervisedlearning,whichcanbeusedbothfordatarepresentationandclustering.ThesealgorithmsarerelatedtospectralclusteringandLaplacianEigenmaps(BelkinandNiyogi,2003a).3.WeelaborateontheRKHSfoundationsofouralgorithmsandshowhowgeometricknowledgeoftheprobabilitydistributionmaybeincorporatedinsuchasettingthroughanadditionalregularizationpenalty.Inparticular,anewRepresentertheoremprovidesafunctionalformofthesolutionwhenthedistributionisknown;itsempiricalversioninvolvesanexpansionoverlabeledandunlabeledpointswhenthedistributionisunknown.TheseRepresentertheoremsprovidethebasisforouralgorithms.4.OurframeworkwithanambientlydenedRKHSandtheassociatedRepresentertheoremsresultinanaturalout-of-sampleextensionfromthedataset(labeledandunlabeled)tonovelexamples.Thisisincontrasttothevarietyofpurelygraph-basedapproachesthathavebeenconsideredinthelastfewyears.Suchgraph-basedapproachesworkinatransductivesettinganddonotnaturallyextendtothesemi-supervisedcasewherenoveltestexamplesneedtobeclassied(predicted).AlsoseeBengioetal.(2004)andBrand(2003)forsomerecentrelatedworkonout-of-sampleextensions.Wealsonotethatamethodsimilartoourregu-larizedspectralclusteringalgorithmhasbeenindependentlyproposedinthecontextofgraphinferenceinVertandYamanishi(2005).TheworkpresentedhereisbasedontheUniversityofChicagoTechnicalReportTR-2004-05,ashortversionintheProceedingsofAIandStatistics2005,Belkinetal.(2005)andSindhwani(2004).TheSignicanceofSemi-SupervisedLearningFromanengineeringstandpoint,itisclearthatcollectinglabeleddataisgenerallymoreinvolvedthancollectingunlabeleddata.Asaresult,anapproachtopatternrecognitionthatisabletomakebetteruseofunlabeleddatatoimproverecognitionperformanceisofpotentiallygreatpracticalsignicance.However,thesignicanceofsemi-supervisedlearningextendsbeyondpurelyutilitarianconsid-erations.Arguably,mostnatural(humanoranimal)learningoccursinthesemi-supervisedregime.Weliveinaworldwhereweareconstantlyexposedtoastreamofnaturalstimuli.Thesestimulicomprisetheunlabeleddatathatwehaveeasyaccessto.Forexample,inphonologicalacquisi-tioncontexts,achildisexposedtomanyacousticutterances.Theseutterancesdonotcomewithidentiablephonologicalmarkers.Correctivefeedbackisthemainsourceofdirectlylabeledex-amples.Inmanycases,asmallamountoffeedbackissufcienttoallowthechildtomastertheacoustic-to-phoneticmappingofanylanguage.Theabilityofhumanstolearnunsupervisedconcepts(e.g.,learningclustersandcategoriesofobjects)suggeststhatunlabeleddatacanbeusefullyprocessedtolearnnaturalinvariances,toformcategories,andtodevelopclassiers.Inmostpatternrecognitiontasks,humanshaveaccessonlytoasmallnumberoflabeledexamples.Thereforethesuccessofhumanlearninginthis“smallsample”regimeisplausiblyduetoeffectiveutilizationofthelargeamountsofunlabeleddatatoextractinformationthatisusefulforgeneralization.Consequently,ifwearetomakeprogressinunderstandinghownaturallearningcomesabout,weneedtothinkaboutthebasisofsemi-supervisedlearning.Figure1illustrateshowunlabeled2401 BELKIN,NIYOGIANDSINDHWANIFigure1:Unlabeleddataandpriorbeliefsexamplesmayforceustorestructureourhypothesesduringlearning.Imagineasituationwhereoneisgiventwolabeledexamples—onepositiveandonenegative—asshownintheleftpanel.Ifoneistoinduceaclassieronthebasisofthis,anaturalchoicewouldseemtobethelinearseparatorasshown.Indeed,avarietyoftheoreticalformalisms(Bayesianparadigms,regularization,minimumdescriptionlengthorstructuralriskminimizationprinciples,andthelike)havebeenconstructedtorationalizesuchachoice.Inmostoftheseformalisms,onestructuresthesetofone'shypothesisfunctionsbyapriornotionofsimplicityandonemaythenjustifywhythelinearseparatoristhesimpleststructureconsistentwiththedata.Nowconsiderthesituationwhereoneisgivenadditionalunlabeledexamplesasshownintherightpanel.Wearguethatitisself-evidentthatinthelightofthisnewunlabeledset,onemustre-evaluateone'spriornotionofsimplicity.Theparticulargeometricstructureofthemarginaldistributionsuggeststhatthemostnaturalclassierisnowthecircularoneindicatedintherightpanel.Thusthegeometryofthemarginaldistributionmustbeincorporatedinourregularizationprincipletoimposestructureonthespaceoffunctionsinnonparametricclassicationorregression.Thisistheintuitionweformalizeintherestofthepaper.Thesuccessofourapproachdependsonwhetherwecanextractstructurefromthemarginaldistribution,andontheextenttowhichsuchstructuremayrevealtheunderlyingtruth.1.2OutlineofthePaperThepaperisorganizedasfollows:inSection2,wedevelopthebasicframeworkforsemi-supervisedlearningwhereweultimatelyformulateanobjectivefunctionthatcanusebothlabeledandunla-beleddata.TheframeworkisdevelopedinanRKHSsettingandwestatetwokindsofRepresentertheoremsdescribingthefunctionalformofthesolutions.InSection3,weelaborateonthetheo-reticalunderpinningsofthisframeworkandprovetheRepresentertheoremsofSection2.WhiletheRepresentertheoremforthenitesamplecasecanbeprovedusingstandardorthogonalityar-guments,theRepresentertheoremfortheknownmarginaldistributionrequiresmoresubtleconsid-erations.InSection4,wederivethedifferentalgorithmsforsemi-supervisedlearningthatariseoutofourframework.Connectionstorelatedalgorithmsarestated.InSection5,wedescribeexperi-mentsthatevaluatethealgorithmsanddemonstratetheusefulnessofunlabeleddata.InSection6,2402 MANIFOLDREGULARIZATIONweconsiderthecasesoffullysupervisedandunsupervisedlearning.InSection7weconcludethispaper.2.TheSemi-SupervisedLearningFrameworkRecallthestandardframeworkoflearningfromexamples.ThereisaprobabilitydistributionPonXRaccordingtowhichexamplesaregeneratedforfunctionlearning.Labeledexamplesare(x;y)pairsgeneratedaccordingtoP.Unlabeledexamplesaresimplyx2XdrawnaccordingtothemarginaldistributionPXofP.OnemighthopethatknowledgeofthemarginalPXcanbeexploitedforbetterfunctionlearning(e.g.,inclassicationorregressiontasks).Ofcourse,ifthereisnoidentiablerelationbetweenPXandtheconditionalP(yjx),theknowledgeofPXisunlikelytobeofmuchuse.Therefore,wewillmakeaspecicassumptionabouttheconnectionbetweenthemarginalandtheconditionaldistributions.Wewillassumethatiftwopointsx1;x22XarecloseintheintrinsicgeometryofPX,thentheconditionaldistributionsP(yjx1)andP(yjx2)aresimilar.Inotherwords,theconditionalprobabilitydistributionP(yjx)variessmoothlyalongthegeodesicsintheintrinsicgeometryofPX.Weusethesegeometricintuitionstoextendanestablishedframeworkforfunctionlearning.AnumberofpopularalgorithmssuchasSVM,Ridgeregression,splines,RadialBasisFunctionsmaybebroadlyinterpretedasregularizationalgorithmswithdifferentempiricalcostfunctionsandcomplexitymeasuresinanappropriatelychosenReproducingKernelHilbertSpace(RKHS).ForaMercerkernelK:XX!R,thereisanassociatedRKHSHKoffunctionsX!RwiththecorrespondingnormkkK.Givenasetoflabeledexamples(xi;yi),i=1;:::;lthestandardframeworkestimatesanunknownfunctionbyminimizingf=argminf2HK1llåi=1V(xi;yi;f)+gkfk2;(1)whereVissomelossfunction,suchassquaredloss(yif(xi))2forRLSorthehingelossfunc-tionmax[0;1yif(xi)]forSVM.PenalizingtheRKHSnormimposessmoothnessconditionsonpossiblesolutions.TheclassicalRepresenterTheoremstatesthatthesolutiontothisminimizationproblemexistsinHKandcanbewrittenasf(x)=låi=1aiK(xi;x):Therefore,theproblemisreducedtooptimizingoverthenitedimensionalspaceofcoefcientsai,whichisthealgorithmicbasisforSVM,regularizedleastsquaresandotherregressionandclassicationschemes.Werstconsiderthecasewhenthemarginaldistributionisalreadyknown.2.1MarginalPXisKnownOurgoalistoextendthisframeworkbyincorporatingadditionalinformationaboutthegeometricstructureofthemarginalPX.WewouldliketoensurethatthesolutionissmoothwithrespecttoboththeambientspaceandthemarginaldistributionPX.Toachievethat,weintroduceanadditional2403 BELKIN,NIYOGIANDSINDHWANIregularizer:f=argminf2HK1llåi=1V(xi;yi;f)+gAkfk2+gIkfk2I;(2)wherekfk2isanappropriatepenaltytermthatshouldreecttheintrinsicstructureofPX.Intuitively,kfk2isasmoothnesspenaltycorrespondingtotheprobabilitydistribution.Forexample,iftheprobabilitydistributionissupportedonalow-dimensionalmanifold,kfk2maypenalizefalongthatmanifold.gAcontrolsthecomplexityofthefunctionintheambientspacewhilegIcontrolsthecomplexityofthefunctionintheintrinsicgeometryofPX.Itturnsoutthatonecanderiveanexplicitfunctionalformforthesolutionfasshowninthefollowingtheorem.Theorem1AssumethatthepenaltytermkfkIissufcientlysmoothwithrespecttotheRKHSnormkfkK(seeSection3.2fortheexactstatement).ThenthesolutionftotheoptimizationprobleminEquation2aboveexistsandadmitsthefollowingrepresentationf(x)=låi=1aiK(xi;x)+ZMa(z)K(x;z)dPX(z)(3)whereM=suppfPXgisthesupportofthemarginalPX.WepostponetheproofandtheformulationofsmoothnessconditionsonthenormkkIuntilthenextsection.TheRepresenterTheoremaboveallowsustoexpressthesolutionfdirectlyintermsofthelabeleddata,the(ambient)kernelK,andthemarginalPX.IfPXisunknown,weseethatthesolutionmaybeexpressedintermsofanempiricalestimateofPX.Dependingonthenatureofthisestimate,differentapproximationstothesolutionmaybedeveloped.Inthenextsection,weconsideraparticularapproximationschemethatleadstoasimplealgorithmicframeworkforlearningfromlabeledandunlabeleddata.2.2MarginalPXUnknownInmostapplicationsthemarginalPXisnotknown.ThereforewemustattempttogetempiricalestimatesofPXandkkI.Notethatinordertogetsuchempiricalestimatesitissufcienttohaveunlabeledexamples.Acaseofparticularrecentinterest(forexample,seeRoweisandSaul,2000;Tenenbaumetal.,2000;BelkinandNiyogi,2003a;DonohoandGrimes,2003;Coifmanetal.,2005,foradiscussionondimensionalityreduction)iswhenthesupportofPXisacompactsubmanifoldMRn.Inthatcase,onenaturalchoiceforkfkIisRx2MkÑMfk2dPX(x),whereÑMisthegradient(see,forexampleDoCarmo,1992,foranintroductiontodifferentialgeometry)offalongthemanifoldMandtheintegralistakenoverthemarginaldistribution.Theoptimizationproblembecomesf=argminf2HK1llåi=1V(xi;yi;f)+gAkfk2+gIZx2MkÑMfk2dPX(x):ThetermRx2MkÑMfk2dPX(x)maybeapproximatedonthebasisoflabeledandunlabeleddatausingthegraphLaplacianassociatedtothedata.Whileanextendeddiscussionoftheseissuesgoes2404 MANIFOLDREGULARIZATIONbeyondthescopeofthispaper,itcanbeshownthatundercertainconditionschoosingexponentialweightsfortheadjacencygraphleadstoconvergenceofthegraphLaplaciantotheLaplace-BeltramioperatorDM(oritsweightedversion)onthemanifold.SeetheRemarksbelowandBelkin(2003);Lafon(2004);BelkinandNiyogi(2005);Coifmanetal.(2005);Heinetal.(2005)fordetails.Thus,givenasetofllabeledexamplesf(xi;yi)gl=1andasetofuunlabeledexamplesfxjgj=l+uj=l+1,weconsiderthefollowingoptimizationproblem:f=argminf2HK1llåi=1V(xi;yi;f)+gAkfk2+gI(u+l)2l+uåi;j=1(f(xi)f(xj))2Wij;=argminf2HK1llåi=1V(xi;yi;f)+gAkfk2+gI(u+l)2fTLf:(4)whereWijareedgeweightsinthedataadjacencygraph,f=[f(x1);:::;f(xl+u)]T,andListhegraphLaplaciangivenbyL=DW.Here,thediagonalmatrixDisgivenbyDii=ål+uj=1Wij.Thenormalizingcoefcient1(u+l)2isthenaturalscalefactorfortheempiricalestimateoftheLaplaceoperator.Wenotethanonasparseadjacencygraphitmaybereplacedbyål+ui;j=1Wij.ThefollowingversionoftheRepresenterTheoremshowsthattheminimizerhasanexpansionintermsofbothlabeledandunlabeledexamplesandisakeytoouralgorithms.Theorem2Theminimizerofoptimizationproblem4admitsanexpansionf(x)=l+uåi=1aiK(xi;x)(5)intermsofthelabeledandunlabeledexamples.TheproofisavariationofthestandardorthogonalityargumentandispresentedinSection3.4.Remark1:SeveralnaturalchoicesofkkIexist.Someexamplesare:1.IteratedLaplacians(DM)k.Differentialoperators(DM)kandtheirlinearcombinationspro-videanaturalfamilyofsmoothnesspenalties.RecallthattheLaplace-BeltramioperatorDMcanbedenedasthedivergenceofthegradientvectoreldDMf=div(ÑMf)andischaracterizedbytheequalityZx2Mf(x)DMf(x)dµ=Zx2MkÑMf(x)k2dµ:whereµisthestandardmeasure(uniformdistribution)ontheRiemannianmanifold.Ifµistakentobenon-uniform,thenthecorrespondingnotionistheweightedLaplace-Beltramioperator(e.g.,Grigor'yan,2006).2.HeatsemigroupetDMisafamilyofsmoothingoperatorscorrespondingtotheprocessofdiffusion(Brownianmotion)onthemanifold.Onecantakekfk2=RMfetDM(f)dPX.WenotethatforsmallvaluesoftthecorrespondingGreen'sfunction(theheatkernelofM),whichisclosetoaGaussianinthegeodesiccoordinates,canalsobeapproximatedbyasharpGaussianintheambientspace.2405 BELKIN,NIYOGIANDSINDHWANI3.SquarednormoftheHessian(cf.DonohoandGrimes,2003).WhiletheHessianH(f)(thematrixofsecondderivativesoff)generallydependsonthecoordinatesystem,itcanbeshownthattheFrobeniusnorm(thesumofsquaredeigenvalues)ofHisthesameinanygeodesiccoordinatesystemandhenceisinvariantlydenedforaRiemannianmanifoldM.UsingtheFrobeniusnormofHasaregularizerpresentsanintriguinggeneralizationofthin-platesplines.WealsonotethatDM(f)=tr(H(f)).Remark2:Whynotjustusetheintrinsicregularizer?Usingambientandintrinsicregularizersjointlyisimportantforthefollowingreasons:1.WedonotusuallyhaveaccesstoMorthetrueunderlyingmarginaldistribution,justtodatapointssampledfromit.Thereforeregularizationwithrespectonlytothesampledmanifoldisill-posed.Byincludinganambientterm,theproblembecomeswell-posed.2.Theremaybesituationswhenregularizationwithrespecttotheambientspaceyieldsabettersolution,forexample,whenthemanifoldassumptiondoesnothold(orholdstoalesserdegree).Beingabletotradeoffthesetworegularizersmaybeimportantinpractice.Remark3:WhileweusethegraphLaplacianforsimplicity,thenormalizedLaplacian˜L=D1=2LD1=2canbeusedinterchangeablyinallourformulas.Using˜LinsteadofLprovidescertaintheoreticalguarantees(seevonLuxburgetal.,2004)andseemstoperformaswellorbetterinmanypracticaltasks.Infact,weuse˜LinallourempiricalstudiesinSection5.Therelationof˜LtotheweightedLaplace-BeltramioperatorwasdiscussedinLafon(2004).Remark4:NotethataglobalkernelKrestrictedtoM(denotedbyKM)isalsoakerneldenedonMwithanassociatedRKHSHMoffunctionsM!R.WhilethismightsuggestkfkI=kfMkKM(fMisfrestrictedtoM)asareasonablechoiceforkfkI,itturnsout,thatfortheminimizerfofthecorrespondingoptimizationproblemwegetkfkI=kfkK,yieldingthesamesolutionasstandardregularization,althoughwithadifferentparameterg.ThisobservationfollowsfromtherestrictionpropertiesofRKHSdiscussedinthenextsectionandisformallystatedasProposition6.Thereforeitisimpossibletohaveanout-of-sampleextensionwithouttwodifferentmeasuresofsmoothness.Ontheotherhand,adifferentambientkernelrestrictedtoMcanpotentiallyserveastheintrinsicregularizationterm.Forexample,asharpGaussiankernelcanbeusedasanapproximationtotheheatkernelonM.Thusone(sharper)kernelmaybeusedinconjunctionwithunlabeleddatatoestimatetheheatkernelonMandawiderkernelforinference.3.TheoreticalUnderpinningsandResultsInthissectionwebrieyreviewthetheoryofreproducingkernelHilbertspacesandtheirconnectiontointegraloperators.WeproceedtoestablishtheRepresentertheoremsfromtheprevioussection.2406 MANIFOLDREGULARIZATION3.1GeneralTheoryofRKHSWestartbyrecallingsomebasicpropertiesofreproducingkernelHilbertspaces(seetheoriginalworkofAronszajn,1950;CuckerandSmale,2002,foranicediscussioninthecontextoflearningtheory)andtheirconnectionstointegraloperators.WesaythataHilbertspaceHoffunctionsX!Rhasthereproducingproperty,if8x2Xtheevaluationfunctionalf!f(x)iscontinuous.ForthepurposesofthisdiscussionwewillassumethatXiscompact.BytheRieszrepresentationtheoremitfollowsthatforagivenx2X,thereisafunctionhx2H,s.t.8f2Hhhx;fiH=f(x):WecanthereforedenethecorrespondingkernelfunctionK(x;y)=hhx;hyiH:Itfollowsthathx(y)=hhx;hyiH=K(x;y)andthushK(x;);fi=f(x).ItisclearthatK(x;)2H.ItiseasytoseethatK(x;y)isapositivesemi-denitekernelasdenedbelow:Denition:WesaythatK(x;y),satisfyingK(x;y)=K(y;x),isapositivesemi-denitekernelifgivenanarbitrarynitesetofpointsx1;:::;xn,thecorrespondingnnmatrixKwithKij=K(xi;xj)ispositivesemi-denite.Importantly,theconverseisalsotrue.Anypositivesemi-denitekernelK(x;y)givesrisetoanRKHSHK,whichcanbeconstructedbyconsideringthespaceofnitelinearcombina-tionsofkernelsåaiK(xi;)andtakingcompletionwithrespecttotheinnerproductgivenbyhK(x;);K(y;)iHK=K(x;y).SeeAronszajn(1950)fordetails.WethereforeseethatreproducingkernelHilbertspacesoffunctionsonaspaceXareinone-to-onecorrespondencewithpositivesemidenitekernelsonX.ItcanbeshownthatifthespaceHKissufcientlyrich,thatisifforanydistinctpointx1;:::;xnthereisafunctionf,s.t.f(x1)=1;f(xi)=0;i�1,thenthecorrespondingmatrixKij=K(xi;xj)isstrictlypositivedenite.ForsimplicitywewillsometimesassumethatourRKHSarerich(thecorrespondingkernelsaresometimescalleduniversal).Notation:Inwhatfollows,wewillusekernelKtodenoteinnerproductsandnormsinthecor-respondingHilbertspaceHK,thatis,wewillwriteh;iK,kkK,insteadofthemorecumbersomeh;iHK,kkHK.WeproceedtoendowXwithameasureµ(supportedonallofX).ThecorrespondingL2µHilbertspaceinnerproductisgivenbyhf;giµ=ZXf(x)g(x)dµ:WecannowconsidertheintegraloperatorLKcorrespondingtothekernelK:(LKf)(x)=ZXf(y)K(x;y)dµ:Itiswell-knownthatifXisacompactspace,LKisacompactoperatorandisself-adjointwithrespecttoL2µ.Bythespectraltheorem,itseigenfunctionse1(x);e2(x);:::,(scaledtonorm1)formanorthonormalbasisofL2µ.Thespectrumoftheoperatorisdiscreteandthecorrespondingeigenvaluesl1;l2;:::areofnitemultiplicity,limi!¥li=0.WeseethathK(x;);ei()iµ=liei(x):2407 BELKIN,NIYOGIANDSINDHWANIandthereforeK(x;y)=åiliei(x)ei(y).Writingafunctionfinthatbasis,wehavef=åaiei(x)andhK(x;);f()iµ=åiliaiei(x).ItisnothardtoshowthattheeigenfunctionseiareinHK(e.g.,seetheargumentbelow).Thusweseethatej(x)=hK(x;);ej()iK=åiliei(x)hei;ejiK:Thereforehei;ejiK=0,ifi6=j,andhei;eiiK=1li.Ontheotherhandhei;ejiµ=0,ifi6=j,andhei;eiiµ=1.ThisobservationestablishesasimplerelationshipbetweentheHilbertnormsinHKandL2µ.Wealsoseethatf=åaiei(x)2HKifandonlyifåa2li¥.ConsidernowtheoperatorL1=2K.Itcanbedenedastheonlypositivedeniteself-adjointoperator,s.t.LK=L1=2KL1=2K.Assumingthattheseries˜K(x;y)=åipliei(x)ei(y)converges,wecanwrite(L1=2Kf)(x)=ZXf(y)˜K(x;y)dµ:ItiseasytocheckthatL1=2KisanisomorphismbetweenHandL2µ,thatis8f;g2HKhf;giµ=hL1=2Kf;L1=2KgiK:ThereforeHKistheimageofL1=2KactingonL2µ.Lemma3Afunctionf(x)=åiaiei(x)canberepresentedasf=LKgforsomegifandonlyif¥åi=1a2l2¥:(6)ProofSupposef=LKg.Writeg(x)=åibiei(x).Weknowthatg2L2µifandonlyifåib2¥.SinceLK(åibiei)=åibiliei=åiaiei,weobtainai=bili.Thereforeå¥=1a2l2¥.Conversely,iftheconditionintheinequality6issatised,f=Lkg,whereg=åailiei.3.2ProofofTheoremsNowletusrecalltheEquation2:f=argminf2HK1llåi=1V(xi;yi;f)+gAkfk2+gIkfk2I:WehaveanRKHSHKandtheprobabilitydistributionµwhichissupportedonMX.WedenotebySthelinearspace,whichistheclosurewithrespecttotheRKHSnormofHK,ofthelinearspanofkernelscenteredatpointsofM:S=spanfK(x;)jx2Mg:2408 MANIFOLDREGULARIZATIONNotation.BythesubscriptMwewilldenotetherestrictiontoM.Forexample,bySMwedenotefunctionsinSrestrictedtothemanifoldM.Itcanbeshown(Aronszajn,1950,p.350)thatthespace(HK)MoffunctionsfromHKrestrictedtoMisanRKHSwiththekernelKM,inotherwords(HK)M=HKM.Lemma4ThefollowingpropertiesofShold:1.SwiththeinnerproductinducedbyHKisaHilbertspace.2.SM=(HK)M.3.TheorthogonalcomplementS?toSinHKconsistsofallfunctionsvanishingonM.Proof1.FromthedenitionofSitisclearbythatSisacompletesubspaceofHK.2.WegiveaconvergenceargumentsimilartotheonefoundinAronszajn(1950).Since(HK)M=HKManyfunctionfMinitcanbewrittenasfM=limn!¥fM;n,wherefM;n=åiainKM(xin;)isasumofkernelfunctions.Considerthecorrespondingsumfn=åiainK(xin;).FromthedenitionofthenormweseethatkfnfkkK=kfM;nfM;kkKMandthereforefnisaCauchysequence.Thusf=limn!¥fnexistsanditsrestrictiontoMmustequalfM.Thisshowsthat(HK)MSM.Theotherdirectionfollowsbyasimilarargument.3.Letg2S?.Bythereproducingpropertyforanyx2M,g(x)=hK(x;);g()iK=0andthere-foreanyfunctioninS?vanishesonM.Ontheotherhand,ifgvanishesonMitisperpendiculartoeachK(x;);x2MandisthereforeperpendiculartotheclosureoftheirspanS.Lemma5Assumethattheintrinsicnormissuchthatforanyf;g2HK,(fg)jM0impliesthatkfkI=kgkI.ThenassumingthatthesolutionfoftheoptimizationprobleminEquation2exists,f2S.ProofAnyf2HKcanbewrittenasf=fS+f?S,wherefSistheprojectionofftoSandf?Sisitsorthogonalcomplement.Foranyx2MwehaveK(x;)2S.BythepreviousLemmaf?SvanishesonM.Wehavef(xi)=fS(xi)8iandbyassumptionkfSkI=kfkI.Ontheotherhand,kfk2=kfSk2K+kf?Sk2KandthereforekfkKkfSkK.ItfollowsthattheminimizerfisinS.Asadirectcorollaryoftheseconsideration,weobtainthefollowingProposition6IfkfkI=kfkKMthentheminimizerofEquation2isidenticaltothatoftheusualregularizationproblem(Equation1)althoughwithadifferentregularizationparameter(lA+lI).WecannowrestrictourattentiontothestudyofS.Whileitisclearthattheright-handsideofEquation3liesinS,noteveryelementinScanbewritteninthatform.Forexample,K(x;),wherexisnotoneofthedatapointsxicannotgenerallybewrittenaslåi=1aiK(xi;x)+ZMa(y)K(x;y)dµ:2409 BELKIN,NIYOGIANDSINDHWANIWewillnowassumethatforf2Skfk2=hf;DfiL2µ:WeusuallyassumethatDisanappropriatesmoothnesspenalty,suchasaninverseintegraloperatororadifferentialoperator,forexample,Df=DMf.TheRepresentertheorem,however,holdsunderquitemildconditionsonD:Theorem7Letkfk2=hf;DfiL2µwhereDisaboundedoperatorD:S!L2PX.ThenthesolutionfoftheoptimizationprobleminEquation2existsandcanbewrittenasf(x)=låi=1aiK(xi;x)+ZMa(y)K(x;y)dPX(y):(7)ProofForsimplicitywewillassumethatthelossfunctionVisdifferentiable.Thisconditioncanultimatelybeeliminatedbyapproximatinganon-differentiablefunctionappropriatelyandpassingtothelimit.PutH(f)=1llåi=1V(xi;yi;f(xi))+gAkfk2+gIkfk2I:WerstshowthatthesolutiontoEquation2,f,existsandbyLemma5belongstoS.ItfollowseasilyfromCor.10andstandardresultsaboutcompactembeddingsofSobolevspaces(e.g.,Adams,1975)thataballBrHK,Br=ff2S;s:t:kfkKrgiscompactinL¥X.ThereforeforanysuchballtheminimizerinthatballfrmustexistandbelongtoBr.Ontheotherhand,bysubstitutingthezerofunctionH(fr)H(0)=1llåi=1V(xi;yi;0):Ifthelossisactuallyzero,thenzerofunctionisasolution,otherwisegAkfrk2låi=1V(xi;yi;0);andhencefr2Br,wherer=sål=1V(xi;yi;0)gA:ThereforewecannotdecreaseH(f)byincreasingrbeyondacertainpoint,whichshowsthatf=frwithrasabove,whichcompletestheproofofexistence.IfVisconvex,suchsolutionwillalsobeunique.WeproceedtoderivetheEquation7.Asbefore,lete1;e2;:::bethebasisassociatedtotheintegraloperator(LKf)(x)=RMf(y)K(x;y)dPX(y).Writef=åiaiei(x).BysubstitutingfintoH(f)weobtain:H(f)=1llåj=1V(xj;yj;åiaiei(xi))+gAkfk2+gIkfk2I:2410 MANIFOLDREGULARIZATIONAssumethatVisdifferentiablewithrespecttoeachak.Wehavekåiaiei(x)k2=åia2li.Differenti-atingwithrespecttothecoefcientsaiyieldsthefollowingsetofequations:0=¶H(f)¶ak=1llåj=1ek(xj)¶3V(xj;yj;åiaiei)+2gAaklk+gIhDf;eki+gIhf;Deki;where¶3VdenotesthederivativewithrespecttothethirdargumentofV.hDf;eki+hf;Deki=h(D+D)f;ekiandhenceak=lk2gAllåj=1ek(xj)¶3V(xj;yj;f)gI2gAlkhDf+Df;eki:Sincef(x)=åkakek(x)andrecallingthatK(x;y)=åiliei(x)ei(y)f(x)=12gAlåklåj=1lkek(x)ek(xj)¶3V(xj;yj;f)gI2gAåklkhDf+Df;ekiek;=12gAllåj=1K(x;xj)¶3V(xj;yj;f)gI2gAåklkhDf+Df;ekiek:Weseethattherstsummandisasumofthekernelfunctionscenteredatdatapoints.Itre-mainstoshowthatthesecondsummandhasanintegralrepresentation,thatis,canbewrittenasRMa(y)K(x;y)dPX(y),whichisequivalenttobeingintheimageofLK.ToverifythisweapplyLemma3.Weneedthatåkl2hDf+Df;eki2l2=åkhDf+Df;eki2¥:SinceD,itsadjointoperatorDandhencetheirsumareboundedtheinequalityaboveissatisedforanyfunctioninS.3.3ManifoldSetting2WenowshowthatforthecasewhenMisamanifoldandDisadifferentialoperator,suchastheLaplace-BeltramioperatorDM,theboundednessconditionofTheorem7issatised.Whileweconsiderthecasewhenthemanifoldhasnoboundary,thesameargumentgoesthroughformanifoldwithboundary,with,forexample,Dirichlet'sboundaryconditions(vanishingattheboundary).ThusthesettingofTheorem7isverygeneral,applying,amongotherthings,toarbitrarydifferentialoperatorsoncompactdomainsinEuclideanspace.LetMbeaC¥manifoldwithoutboundarywithaninnitelydifferentiableembeddinginsomeambientspaceX,DadifferentialoperatorwithC¥coefcientsandletµ,bethemeasurecorre-spondingtosomeC¥nowherevanishingvolumeformonM.WeassumethatthekernelK(x;y)isalsoinnitelydifferentiable.3AsbeforeforanoperatorA,Adenotestheadjointoperator.2.WethankPeterConstantinandToddDupontforhelpwiththissection.3.Whilewehaveassumedthatallobjectsareinnitelydifferentiable,itisnothardtospecifytheprecisedifferentiabilityconditions.Roughlyspeaking,adegreekdifferentialoperatorDisboundedasanoperatorHK!L2µ,ifthekernelK(x;y)has2kderivatives.2411 BELKIN,NIYOGIANDSINDHWANITheorem8UndertheconditionsaboveDisaboundedoperatorS!L2µ.ProofFirstnotethatitisenoughtoshowthatDisboundedonHKM,sinceDonlydependsontherestrictionfM.Asbefore,letLKM(f)(x)=RMf(y)KM(x;y)dµistheintegraloperatorassociatedtoKM.NotethatDisalsoadifferentialoperatorofthesamedegreeasD.TheintegraloperatorLKMisbounded(compact)fromL2µtoanySobolevspaceHsob.ThereforetheoperatorLKMDisalsobounded.WethereforeseethatDLKMDisboundedL2µ!L2µ.ThereforethereisaconstantC,s.t.hDLKMDf;fiL2µCkfkL2µ.ThesquarerootT=L1=2KMoftheself-adjointpositivedeniteoperatorLKMisaself-adjointpositivedeniteoperatoraswell.Thus(DT)=TD.Bydenitionoftheoperatornorm,foranye�0thereexistsf2L2µ;kfkL2µ1+e,suchthatkDTk22µ=kTDk2L2µhTDf;TDfiL2µ==hDLDf;fiL2µkDLDkL2µkfk22µC(1+e)2:ThereforetheoperatorDT:L2µ!L2µisbounded(andalsokDTkL2µC,sinceeisarbitrary).NowrecallthatTprovidesanisometrybetweenL2µandHKM.Thatmeansthatforanyg2HKMthereisf2L2µ,suchthatTf=gandkfkL2µ=kgkKM.ThuskDgkL2µ=kDTfkL2µCkgkKM,whichshowsthatT:HKM!L2µisboundedandconcludestheproof.SinceSisasubspaceofHKthemainresultfollowsimmediately:Corollary9DisaboundedoperatorS!L2µandtheconditionsofTheorem7hold.BeforenishingthetheoreticaldiscussionweobtainausefulCorollary10TheoperatorT=L1=2KonL2µisabounded(andinfactcompact)operatorL2µ!Hsob,whereHsobisanarbitrarySobolevspace.ProofFollowsfromthefactthatDTisboundedoperatorL2µ!L2µforanarbitrarydifferentialop-eratorDandstandardresultsoncompactembeddingsofSobolevspaces(see,forexample,Adams,1975).3.4TheRepresenterTheoremfortheEmpiricalCaseInthecasewhenMisunknownandsampledvialabeledandunlabeledexamples,theLaplace-BeltramioperatoronMmaybeapproximatedbytheLaplacianofthedataadjacencygraph(seeBelkin,2003;Bousquetetal.,2004,forsomediscussion).AregularizerbasedonthegraphLapla-cianleadstotheoptimizationproblemposedinEquation4.WenowprovideaproofofTheorem2whichstatesthatthesolutiontothisproblemadmitsarepresentationintermsofanexpansionoverlabeledandunlabeledpoints.Theproofisbasedonasimpleorthogonalityargument(e.g.,ScholkopfandSmola,2002).2412 MANIFOLDREGULARIZATIONProof(Theorem2)Anyfunctionf2HKcanbeuniquelydecomposedintoacomponentfjjinthelinearsubspacespannedbythekernelfunctionsfK(xi;)gl+ui=1,andacomponentf?orthogonaltoit.Thus,f=fjj+f?=l+uåi=1aiK(xi;)+f?:Bythereproducingproperty,asthefollowingargumentsshow,theevaluationoffonanydatapointxj,1jl+uisindependentoftheorthogonalcomponentf?:f(xj)=hf;K(xj;)i=hl+uåi=1aiK(xi;);K(xj;)i+hf?;K(xj;)i:Sincethesecondtermvanishes,andhK(xi;);K(xj;)i=K(xi;xj),itfollowsthatf(xj)=ål+ui=1aiK(xi;xj).Thus,theempiricaltermsinvolvingthelossfunctionandtheintrinsicnormintheoptimizationprobleminEquation4dependonlyonthevalueofthecoefcientsfaigl+ui=1andthegrammatrixofthekernelfunction.Indeed,sincetheorthogonalcomponentonlyincreasesthenormoffinHK:kfk2=kl+uåi=1aiK(xi;)k2K+kf?k2Kkl+uåi=1aiK(xi;)k2K:Itfollowsthattheminimizerofproblem4musthavef?=0,andthereforeadmitsarepresentationf()=ål+ui=1aiK(xi;).Thesimpleformoftheminimizer,givenbythistheorem,allowsustotranslateourextrinsicandintrinsicregularizationframeworkintooptimizationproblemsoverthenitedimensionalspaceofcoefcientsfaigl+ui=1,andinvokethemachineryofkernelbasedalgorithms.Inthenextsection,wederivethesealgorithms,andexploretheirconnectionstootherrelatedwork.4.AlgorithmsWenowdiscussstandardregularizationalgorithms(RLSandSVM)andpresenttheirextensions(LapRLSandLapSVMrespectively).TheseareobtainedbysolvingtheoptimizationproblemsposedinEquation4)fordifferentchoicesofcostfunctionVandregularizationparametersgA;gI.Toxnotation,weassumewehavellabeledexamplesf(xi;yi)gl=1anduunlabeledexamplesfxjgj=l+uj=l+1.WeuseKinterchangeablytodenotethekernelfunctionortheGrammatrix.4.1RegularizedLeastSquaresTheregularizedleastsquaresalgorithmisafullysupervisedmethodwherewesolve:minf2HK1llåi=1(yif(xi))2+gkfk2:TheclassicalRepresenterTheoremcanbeusedtoshowthatthesolutionisofthefollowingform:f?(x)=låi=1a?iK(x;xi):2413 BELKIN,NIYOGIANDSINDHWANISubstitutingthisformintheproblemabove,wearriveatfollowingconvexdifferentiableobjec-tivefunctionofthel-dimensionalvariablea=[a1:::al]T:a=argmin1l(YKa)T(YKa)+gaTKa;whereKisthellgrammatrixKij=K(xi;xj)andYisthelabelvectorY=[y1:::yl]T.Thederivativeoftheobjectivefunctionvanishesattheminimizer:1l(YKa)T(K)+gKa=0;whichleadstothefollowingsolution:a=(K+glI)1Y:4.2LaplacianRegularizedLeastSquares(LapRLS)TheLaplacianregularizedleastsquaresalgorithmsolvestheoptimizationprobleminEquation4)withthesquaredlossfunction:minf2HK1llåi=1(yif(xi))2+gAkfk2+gI(u+l)2fTLf:Asbefore,theRepresenterTheoremcanbeusedtoshowthatthesolutionisanexpansionofkernelfunctionsoverboththelabeledandtheunlabeleddata:f?(x)=l+uåi=1a?K(x;xi):Substitutingthisformintheequationabove,asbefore,wearriveataconvexdifferentiableobjectivefunctionofthel+u-dimensionalvariablea=[a1:::al+u]T:a=argmina2Rl+u1l(YJKa)T(YJKa)+gAaTKa+gI(u+l)2aTKLKa;whereKisthe(l+u)(l+u)Grammatrixoverlabeledandunlabeledpoints;Yisan(l+u)dimensionallabelvectorgivenby:Y=[y1;:::;yl;0;:::;0]andJisan(l+u)(l+u)diagonalmatrixgivenbyJ=diag(1;:::;1;0;:::;0)withtherstldiagonalentriesas1andtherest0.Thederivativeoftheobjectivefunctionvanishesattheminimizer:1l(YJKa)T(JK)+(gAK+gIl(u+l)2KLK)a=0;whichleadstothefollowingsolution:a=(JK+gAlI+gIl(u+l)2LK)1Y:(8)NotethatwhengI=0,Equation8)giveszerocoefcientsoverunlabeleddata,andthecoef-cientsoverthelabeleddataareexactlythoseforstandardRLS.2414 MANIFOLDREGULARIZATION4.3SupportVectorMachineClassicationHereweoutlinetheSVMapproachtobinaryclassicationproblems.ForSVMs,thefollowingproblemissolved:minf2HK1llåi=1(1yif(xi))++gkfk2;wherethehingelossisdenedas:(1yf(x))+=max(0;1yf(x))andthelabelsyi2f1;+1g.Again,thesolutionisgivenby:f?(x)=låi=1a?iK(x;xi):(9)FollowingSVMexpositions,theaboveproblemcanbeequivalentlywrittenas:minf2HK;xi2R1llåi=1xi+gkfk2subjectto:yif(xi)1xii=1;:::;lxi0i=1;:::;l:UsingtheLagrangemultiplierstechnique,andbenetingfromstrongduality,theaboveproblemhasasimplerquadraticdualprogramintheLagrangemultipliersb=[b1;:::;bl]T2Rl:b?=maxb2Rllåi=1bi12bTQbsubjectto:låi=1yibi=00bi1li=1;:::;l:wheretheequalityconstraintarisesduetoanunregularizedbiastermthatisoftenaddedtothesuminEquation9,andthefollowingnotationisused:Y=diag(y1;y2;:::;yl);Q=YK2gY;a?=Yb?2g:Hereagain,Kisthegrammatrixoverlabeledpoints.SVMpractitionersmaybefamiliarwithaslightlydifferentparameterizationinvolvingtheCparameter:C=12glistheweightonthehingelossterm(insteadofusingaweightgonthenormtermintheoptimizationproblem).TheCparameterappearsastheupperbound(insteadof1l)onthevaluesofbinthequadraticprogram.ForadditionaldetailsonthederivationandalternativeformulationsofSVMs,seeScholkopfandSmola(2002);Rifkin(2002).2415 BELKIN,NIYOGIANDSINDHWANI4.4LaplacianSupportVectorMachinesByincludingtheintrinsicsmoothnesspenaltyterm,wecanextendSVMsbysolvingthefollowingproblem:minf2HK1llåi=1(1yif(xi))++gAkfk2+gI(u+l)2fTLf:Bytherepresentertheorem,asbefore,thesolutiontotheproblemaboveisgivenby:f?(x)=l+uåi=1a?K(x;xi):OfteninSVMformulations,anunregularizedbiastermbisaddedtotheaboveform.Again,theprimalproblemcanbeeasilyseentobethefollowing:mina2Rl+u;x2Rl1llåi=1xi+gAaTKa+gI(u+l)2aTKLKasubjectto:yi(l+uåj=1ajK(xi;xj)+b)1xi;i=1;:::;lxi0i=1;:::;l:IntroducingtheLagrangian,withbi;ziasLagrangemultipliers:L(a;x;b;b;z)=1llåi=1xi+12aT(2gAK+2gA(l+u)2KLK)alåi=1bi(yi(l+uåj=1ajK(xi;xj)+b)1+xi)låi=1zixi:Passingtothedualrequiresthefollowingsteps:¶L¶b=0=)låi=1biyi=0;¶L¶xi=0=)1lbizi=0;=)0bi1l(xi;ziarenon-negative):Usingaboveidentities,weformulateareducedLagrangian:LR(a;b)=12aT(2gAK+2gI(u+l)2KLK)alåi=1bi(yil+uåj=1ajK(xi;xj)1);=12aT(2gAK+2gI(u+l)2KLK)aaTKJTYb+låi=1bi;2416 MANIFOLDREGULARIZATIONwhereJ=[I0]isanl(l+u)matrixwithIasthellidentitymatrix(assumingtherstlpointsarelabeled)andY=diag(y1;y2;:::;yl).TakingderivativeofthereducedLagrangianwithrespecttoa:¶LR¶a=(2gAK+2gI(u+l)2KLK)aKJTYb:Thisimplies:a=(2gAI+2gI(u+l)2LK)1JTYb?:(10)NotethattherelationshipbetweenaandbisnolongerassimpleastheSVMalgorithm.Inparticular,the(l+u)expansioncoefcientsareobtainedbysolvingalinearsysteminvolvingtheldualvariablesthatwillappearintheSVMdualproblem.SubstitutingbackinthereducedLagrangianweget:b=maxb2Rllåi=1bi12bTQb(11)subjectto:låi=1biyi=00bi1li=1;:::;l(12)whereQ=YJK(2gAI+2gI(l+u)2LK)1JTY:LaplacianSVMscanbeimplementedbyusingastandardSVMsolverwiththequadraticforminducedbytheabovematrix,andusingthesolutiontoobtaintheexpansioncoefcientsbysolvingthelinearsysteminEquation10.NotethatwhengI=0,theSVMQPandEquations11and10,givezeroexpansioncoefcientsovertheunlabeleddata.TheexpansioncoefcientsoverthelabeleddataandtheQmatrixareasinstandardSVM,inthiscase.ThemanifoldregularizationalgorithmsaresummarizedintheTable1.EfciencyIssues:ItisworthnotingthatouralgorithmscomputetheinverseofadenseGrammatrixwhichleadstoO((l+u)3)complexity.Thismaybeimpracticalforlargedatasets.Inthecaseoflinearkernels,insteadofusingEquation5,wecandirectlywritef?(x)=wTxandsolvefortheweightvectorwusingaprimaloptimizationmethod.Thisismuchmoreefcientwhenthedataislow-dimensional.Forhighlysparsedatasets,forexample,intextcategorizationproblems,effectiveconjugategradientschemescanbeusedinalargescaleimplementation,asoutlinedinSindhwanietal.(2006).Forthenon-linearcase,onemayobtainapproximatesolutions(e.g.,usinggreedy,matchingpursuittechniques)wheretheoptimizationproblemissolvedoverthespanofasmallsetofbasisfunctionsinsteadofusingthefullrepresentationinEquation5.Wenotethesedirectionsforfuturework.Insection5,weevaluatetheempiricalperformanceofouralgorithmswithexactcomputationsasoutlinedinTable1withnon-linearkernels.Forotherrecentworkaddressing2417 BELKIN,NIYOGIANDSINDHWANIManifoldRegularizationalgorithmsInput:llabeledexamplesf(xi;yi)gl=1,uunlabeledexamplesfxjgl+uj=l+1Output:Estimatedfunctionf:Rn!RStep1Constructdataadjacencygraphwith(l+u)nodesusing,forexample,knearestneighborsoragraphkernel.ChooseedgeweightsWij,forexample,binaryweightsorheatkernelweightsWij=ekxixjk2=4t.Step2ChooseakernelfunctionK(x;y).ComputetheGrammatrixKij=K(xi;xj).Step3ComputegraphLaplacianmatrix:L=DWwhereDisadi-agonalmatrixgivenbyDii=ål+uj=1Wij.Step4ChoosegAandgI.Step5ComputeausingEquation8forsquaredloss(LaplacianRLS)orusingEquations11and10togetherwiththeSVMQPsolverforsoftmarginloss(LaplacianSVM).Step6Outputfunctionf(x)=ål+ui=1aK(xi;x).Table1:Asummaryofthealgorithmsscalabilityissuesinsemi-supervisedlearning,see,example,TsangandKwok.(2005);Bengioetal.(2004).RelatedWorkandConnectionstoOtherAlgorithmsInthissectionwesurveyvariousapproachestosemi-supervisedandtransductivelearningandhigh-lightconnectionsofmanifoldregularizationtootheralgorithms.TransductiveSVM(TSVM)(Vapnik,1998;Joachims,1999):TSVMsarebasedonthefollow-ingoptimizationprinciple:f=argminf2HKyl+1;:::yl+uClåi=1(1yif(xi))++Cl+uåi=l+1(1yif(xi))++kfk2;whichproposesajointoptimizationoftheSVMobjectivefunctionoverbinary-valuedlabelsontheunlabeleddataandfunctionsintheRKHS.Here,C;Careparametersthatcontroltherelativehinge-lossoverlabeledandunlabeledsets.ThejointoptimizationisimplementedinJoachims(1999)byrstusinganinductiveSVMtolabeltheunlabeleddataandtheniterativelysolvingSVMquadraticprograms,ateachstepswitchinglabelstoimprovetheobjectivefunction.Howeverthisprocedureissusceptibletolocalminimaandrequiresanunknown,possiblylargenumberoflabelswitchesbeforeconverging.NotethateventhoughTSVMwereinspiredbytransductiveinference,theydoprovideanout-of-sampleextension.Semi-SupervisedSVMs(S3VM)(BennettandDemiriz,1999;FungandMangasarian,2001):S3VMincorporateunlabeleddatabyincludingtheminimumhinge-lossforthetwochoicesof2418 MANIFOLDREGULARIZATIONlabelsforeachunlabeledexample.Thisisformulatedasamixed-integerprogramforlinearSVMsinBennettandDemiriz(1999)andisfoundtobeintractableforlargeamountsofunlabeleddata.FungandMangasarian(2001)reformulatethisapproachasaconcaveminimizationproblemwhichissolvedbyasuccessivelinearapproximationalgorithm.Thepresentationofthesealgorithmsisrestrictedtothelinearcase.Measure-BasedRegularization(Bousquetetal.,2004):Theconceptualframeworkofthisworkisclosesttoourapproach.Theauthorsconsideragradientbasedregularizerthatpenalizesvariationsofthefunctionmoreinhighdensityregionsandlessinlowdensityregionsleadingtothefollowingoptimizationprinciple:f=argminf2Flåi=1V(f(xi);yi)+gZXhÑf(x);Ñf(x)ip(x)dx;wherepisthedensityofthemarginaldistributionPX.Theauthorsobservethatitisnotstraightfor-wardtondakernelforarbitrarydensitiesp,whoseassociatedRKHSnormisZhÑf(x);Ñf(x)ip(x)dx:Thus,intheabsenceofarepresentertheorem,theauthorsproposetoperformminimizationoftheregularizedlossonaxedsetofbasisfunctionschosenapriori,thatis,F=fåq=1aifig.Forthehingeloss,thispaperderivesanSVMquadraticprograminthecoefcientsfaigq=1whoseQmatrixiscalculatedbycomputingq2integralsovergradientsofthebasisfunctions.Howeverthealgorithmdoesnotdemonstrateperformanceimprovementsinrealworldexperiments.ItisalsoworthnotingthatwhileBousquetetal.(2004)usethegradientÑf(x)intheambientspace,weusethegradientoverasubmanifoldÑMfforpenalizingthefunction.InasituationwherethedatatrulyliesonornearasubmanifoldM,thedifferencebetweenthesetwopenalizerscanbesignicantsincesmoothnessinthenormaldirectiontothedatamanifoldisirrelevanttoclassicationorregression.Graph-BasedApproachesSee,forexample,BlumandChawla(2001);Chapelleetal.(2003);SzummerandJaakkola(2002);Zhouetal.(2004);Zhuetal.(2003,2005);Kempetal.(2004);Joachims(2003);BelkinandNiyogi(2003b):Avarietyofgraph-basedmethodshavebeenpro-posedfortransductiveinference.However,thesemethodsdonotprovideanout-of-sampleexten-sion.InZhuetal.(2003),nearestneighborlabelingfortestexamplesisproposedonceunlabeledexampleshavebeenlabeledbytransductivelearning.InChapelleetal.(2003),testpointsareapproximatelyrepresentedasalinearcombinationoftrainingandunlabeledpointsinthefeaturespaceinducedbythekernel.Forgraphregularizationandlabelpropagationsee(SmolaandKondor,2003;Belkinetal.,2004;Zhuetal.,2003).SmolaandKondor(2003)discussestheconstructionofacanonicalfamilyofgraphregularizersbasedonthegraphLaplacian.Zhuetal.(2005)presentsanon-parametricconstructionofgraphregularizers.Manifoldregularizationprovidesnaturalout-of-sampleextensionstoseveralgraph-basedap-proaches.TheseconnectionsaresummarizedinTable2.Wealsonotetherecentwork(Delalleauetal.,2005)onout-of-sampleextensionsforsemi-supervisedlearningwhereaninductionformulaisderivedbyassumingthattheadditionofatestpointtothegraphdoesnotchangethetransductivesolutionovertheunlabeleddata.Cotraining(BlumandMitchell,1998):Thecotrainingalgorithmwasdevelopedtointegrateabundanceofunlabeleddatawithavailabilityofmultiplesourcesofinformationindomainslikeweb-pageclassication.Weaklearnersaretrainedonlabeledexamplesandtheirpredictionson2419 BELKIN,NIYOGIANDSINDHWANIsubsetsofunlabeledexamplesareusedtomutuallyexpandthetrainingset.Notethatthisset-tingmaynotbeapplicableinseveralcasesofpracticalinterestwhereonedoesnothaveaccesstomultipleinformationsources.BayesianTechniquesSee,forexample,Nigametal.(2000);Seeger(2001);CorduneanuandJaakkola(2003).Anearlyapplicationofsemi-supervisedlearningtoTextclassicationappearedinNigametal.(2000)whereacombinationofEMalgorithmandNaive-Bayesclassicationispro-posedtoincorporateunlabeleddata.Seeger(2001)providesadetailedoverviewofBayesianframe-worksforsemi-supervisedlearning.TherecentworkinCorduneanuandJaakkola(2003)formu-latesanewinformation-theoreticprincipletodeveloparegularizerforconditionallog-likelihood.ParametersCorrespondingalgorithms(squarelossorhingeloss)gA0gI0ManifoldRegularizationgA0gI=0StandardRegularization(RLSorSVM)gA!0gI�0Out-of-sampleextensionforGraphRegularization(RLSorSVM)gA!0gI!0Out-of-sampleextensionforLabelPropagationgIgA(RLSorSVM)gA!0gI=0HardmarginSVMorInterpolatedRLSTable2:Connectionsofmanifoldregularizationtootheralgorithms5.ExperimentsWeperformedexperimentsonasyntheticdatasetandthreerealworldclassicationproblemsaris-inginvisualandspeechrecognition,andtextcategorization.Comparisonsaremadewithinductivemethods(SVM,RLS).WealsocompareLaplacianSVMwithtransductiveSVM.Allsoftwareanddatasetsusedfortheseexperimentswillbemadeavailableat:http://www.cs.uchicago.edu/vikass/manifoldregularization.html.Forfurtherexperimentalbenchmarkstudiesandcomparisonswithnumerousothermethods,wereferthereadertoChapelleetal.(2006);Sindhwanietal.(2006,2005).5.1SyntheticData:TwoMoonsDataSetThetwomoonsdatasetisshowninFigure2.Thedatasetcontains200exampleswithonly1la-beledexampleforeachclass.AlsoshownarethedecisionsurfacesofLaplacianSVMforincreasingvaluesoftheintrinsicregularizationparametergI.WhengI=0,LaplacianSVMdisregardsunla-beleddataandreturnstheSVMdecisionboundarywhichisxedbythelocationofthetwolabeledpoints.AsgIisincreased,theintrinsicregularizerincorporatesunlabeleddataandcausesthedeci-sionsurfacetoappropriatelyadjustaccordingtothegeometryofthetwoclasses.InFigure3,thebestdecisionsurfacesacrossawiderangeofparametersettingsarealsoshownforSVM,transduc-tiveSVMandLaplacianSVM.Figure3demonstrateshowTSVMfailstondtheoptimalsolution,probablysinceitgetsstuckinalocalminimum.TheLaplacianSVMdecisionboundaryseemstobeintuitivelymostsatisfying.2420 MANIFOLDREGULARIZATION-1012-1012 gA = 0.03125 gI = 0SVM-1012-1012Laplacian SVM gA = 0.03125 gI = 0.01-1012-1012Laplacian SVM gA = 0.03125 gI = 1Figure2:LaplacianSVMwithRBFkernelsforvariousvaluesofgI.Labeledpointsareshownincolor,otherpointsareunlabeled.-1012-1.5-1-0.500.511.522.5SVM-1012-1.5-1-0.500.511.522.5Transductive SVM-1012-1.5-1-0.500.511.522.5Laplacian SVMFigure3:TwoMoonsdataset:BestdecisionsurfacesusingRBFkernelsforSVM,TSVMandLaplacianSVM.Labeledpointsareshownincolor,otherpointsareunlabeled.5.2HandwrittenDigitRecognitionInthissetofexperimentsweappliedLaplacianSVMandLaplacianRLSalgorithmsto45binaryclassicationproblemsthatariseinpairwiseclassicationofhandwrittendigits.Therst400im-agesforeachdigitintheUSPStrainingset(preprocessedusingPCAto100dimensions)weretakentoformthetrainingset.Theremainingimagesformedthetestset.2imagesforeachclasswererandomlylabeled(l=2)andtherestwereleftunlabeled(u=398).FollowingScholkopfetal.(1995),wechosetotrainclassierswithpolynomialkernelsofdegree3,andsettheweightontheregular-izationtermforinductivemethodsasgl=0:05(C=10).Formanifoldregularization,wechosetosplitthesameweightintheratio1:9sothatgAl=0:005;gIl(u+l)2=0:045.Theobservationsreportedinthissectionholdconsistentlyacrossawidechoiceofparameters.InFigure4,wecomparetheerrorratesofmanifoldregularizationalgorithms,inductiveclas-siersandTSVM,atthebreak-evenpointsintheprecision-recallcurvesforthe45binaryclassi-2421 BELKIN,NIYOGIANDSINDHWANIcationproblems.Theseresultsareaveragedover10randomchoicesoflabeledexamples.Thefollowingcommentscanbemade:(a)manifoldregularizationresultsinsignicantimprovementsoverinductiveclassication,forbothRLSandSVM,andeithercompareswellorsignicantlyout-performsTSVMacrossthe45classicationproblems.NotethatTSVMsolvesmultiplequadraticprogramsinthesizeofthelabeledandunlabeledsetswhereasLapSVMsolvesasingleQP(Equa-tion11)inthesizeofthelabeledset,followedbyalinearsystem(Equation10).ThisresultedinsubstantiallyfastertrainingtimesforLapSVMinthisexperiment.(b)Scatterplotsofperformanceontestandunlabeleddatasets,inthebottomrowofFigure4,conrmthattheout-of-sampleex-tensionisgoodforbothLapRLSandLapSVM.(c)Alsoshown,intherightmostscatterplotinthebottomrowofFigure4,arestandarddeviationoferrorratesobtainedbyLapSVMandTSVM.WefoundLapSVMtobesignicantlymorestablethantheinductivemethodsandTSVM,withrespecttochoiceofthelabeleddata.InFigure5,wedemonstratethebenetofunlabeleddataasafunctionofthenumberoflabeledexamples.1020304005101520RLS vs LapRLS45 Classification ProblemsError RatesRLS1020304005101520SVM vs LapSVM45 Classification ProblemsError RatesSVM1020304005101520TSVM vs LapSVM45 Classification ProblemsError RatesTSVM051015051015Out-of-Sample ExtensionLapRLS (Unlabeled)LapRLS (Test)051015051015Out-of-Sample ExtensionLapSVM (Unlabeled)LapSVM (Test)0246051015Std Deviation of Error RatesSVM (o) , TSVM (x) Std DevLapSVM Std DevFigure4:USPSExperiment:(Toprow)Errorratesatprecision-recallbreak-evenpointsfor45binaryclassicationproblems.(Bottomrow)Scatterplotsoferrorratesontestandunla-beleddataforLaplacianRLS,LaplacianSVM;andstandarddeviationsintesterrorsofLaplacianSVMandTSVM.MethodSVMTSVMLapSVMRLSLapRLSError23.626.512.723.612.7Table3:USPSExperiment:one-versus-restmulticlasserrorratesWealsoperformedone-vs-restmulticlassexperimentsontheUSPStestsetwithl=50andu=1957with10randomsplitsasprovidedbyChapelleandZien(2005).Themeanerrorrates2422 MANIFOLDREGULARIZATION2 4 8 16 32 64 128012345678910SVM vs LapSVMNumber of Labeled ExamplesAverage Error RateSVM (T)2 4 8 16 32 64 1280123456789RLS vs LapRLSNumber of Labeled ExamplesAverage Error RateRLS (T)Figure5:USPSExperiment:meanerrorrateatprecision-recallbreak-evenpointsasafunctionofnumberoflabeledpoints(T:testset,U:unlabeledset)inpredictinglabelsofunlabeleddataarereportedinTable3.Inthisexperiment,TSVMactuallyperformsworsethantheSVMbaselineprobablysincelocalminimaproblemsbecomesevereinamulti-classsetting.Forseveralotherexperimentalobservationsandcomparisonsonthisdataset,seeSindhwanietal.(2005).5.3SpokenLetterRecognitionThisexperimentwasperformedontheIsoletdatabaseoflettersoftheEnglishalphabetspokeninisolation(availablefromtheUCImachinelearningrepository).Thedatasetcontainsutterancesof150subjectswhospokethenameofeachletteroftheEnglishalphabettwice.Thespeakersaregroupedinto5setsof30speakerseach,referredtoasisolet1throughisolet5.Forthepurposesofthisexperiment,wechosetotrainontherst30speakers(isolet1)formingatrainingsetof1560examples,andtestonisolet5containing1559examples(1utteranceismissinginthedatabaseduetopoorrecording).Weconsideredthetaskofclassifyingtherst13lettersoftheEnglishalphabetfromthelast13.Weconsidered30binaryclassicationproblemscorrespondingto30splitsofthetrainingdatawhereall52utterancesofonespeakerwerelabeledandalltherestwereleftunlabeled.Thetestsetiscomposedofentirelynewspeakers,formingtheseparategroupisolet5.WechosetotrainwithRBFkernelsofwidths=10(thiswasthebestvalueamongseveralsettingswithrespectto5-foldcross-validationerrorratesforthefullysupervisedproblemusingstandardSVM).ForSVMandRLSwesetgl=0:05(C=10)(thiswasthebestvalueamongseveralsettingswithrespecttomeanerrorratesoverthe30splits).ForLaplacianRLSandLaplacianSVMwesetgAl=gIl(u+l)2=0:005.InFigure6,wecomparethesealgorithms.Thefollowingcommentscanbemade:(a)LapSVMandLapRLSmakesignicantperformanceimprovementsoverinductivemethodsandTSVM,forpredictionsonunlabeledspeakersthatcomefromthesamegroupasthelabeledspeaker,overallchoicesofthelabeledspeaker.(b)OnIsolet5whichcomprisesofaseparategroupofspeakers,2423 BELKIN,NIYOGIANDSINDHWANI01020301416182022242628Labeled Speaker #Error Rate (unlabeled set)RLS vs LapRLSRLS0102030152025303540Labeled Speaker #Error Rates (unlabeled set)SVM vs TSVM vs LapSVMSVM010203020253035Labeled Speaker #Error Rates (test set)RLS vs LapRLSRLS01020302025303540Labeled Speaker #Error Rates (test set)SVM vs TSVM vs LapSVMSVMFigure6:IsoletExperiment-ErrorRatesatprecision-recallbreak-evenpointsof30binaryclassi-cationproblemsperformanceimprovementsaresmallerbutconsistentoverthechoiceofthelabeledspeaker.Thiscanbeexpectedsincethereappearstobeasystematicbiasthataffectsallalgorithms,infavorofsame-groupspeakers.Totestthishypothesis,weperformedanotherexperimentinwhichthetrainingandtestutterancesarebothdrawnfromIsolet1.Here,thesecondutteranceofeachletterforeachofthe30speakersinIsolet1wastakenawaytoformthetestsetcontaining780examples.Thetrainingsetconsistedoftherstutterancesforeachletter.Asbefore,weconsidered30binaryclassicationproblemsarisingwhenallutterancesofonespeakerarelabeledandothertrainingspeakersareleftunlabeled.ThescatterplotsinFigure7conrmourhypothesis,andshowhighcorrelationbetweenin-sampleandout-of-sampleperformanceofouralgorithmsinthisexperiment.ItisencouragingtonoteperformanceimprovementswithunlabeleddatainExperiment1wherethetestdatacomesfromaslightlydifferentdistribution.Thisrobustnessisoftendesirableinreal-worldapplications.InTable4wereportmeanerrorratesoverthe30splitsfromone-vs-rest26-classexperimentsonthisdataset.Theparameterswereheldxedasinthe2-classsetting.ThefailureofTSVMinproducingreasonableresultsonthisdatasethasalsobeenobservedinJoachims(2003).WithLapSVMandLapRLSweobtainaround3to4%improvementovertheirsupervisedcounterparts.2424 MANIFOLDREGULARIZATION1520253015202530Error Rate (Unlabeled)Error Rate (Test)RLSExperiment 11520253015202530Error Rate (Unlabeled)Error Rate (Test)LapRLSExperiment 11520253015202530Error Rate (Unlabeled)Error Rate (Test)SVMExperiment 11520253015202530Error Rate (Unlabeled)Error Rate (Test)LapSVMExperiment 1Figure7:IsoletExperiment-ErrorRatesatprecision-recallbreak-evenpointsontestsetversusunlabeledset.InExperiment1,thetrainingdatacomesfromIsolet1andthetestdatacomesfromIsolet5;inExperiment2,bothtrainingandtestsetscomefromIsolet1.MethodSVMTSVMLapSVMRLSLapRLSError(unlabeled)28.646.624.528.324.1Error(test)36.943.333.736.333.3Table4:Isolet:one-versus-restmulticlasserrorrates5.4TextCategorizationWeperformedTextCategorizationexperimentsontheWebKBdatasetwhichconsistsof1051webpagescollectedfromComputerSciencedepartmentweb-sitesofvariousuniversities.Thetaskistoclassifythesewebpagesintotwocategories:courseornon-course.Weconsideredlearningclassiersusingonlytextualcontentofthewebpages,ignoringlinkinformation.Abag-of-wordvectorspacerepresentationfordocumentsisbuiltusingthethetop3000words(skippingHTMLheaders)havinghighestmutualinformationwiththeclassvariable,followedbyTFIDFmapping.4Featurevectorsarenormalizedtounitlength.9documentswerefoundtocontainnoneofthesewordsandwereremovedfromthedataset.4.TFIDFstandsforTermFrequencyInverseDocumentFrequency.Itisacommondocumentpreprocessingprocedure,whichcombinesthenumberofoccurrencesofagiventermwiththenumberofdocumentscontainingit.2425 BELKIN,NIYOGIANDSINDHWANIFortherstexperiment,weranLapRLSandLapSVMinatransductivesetting,with12ran-domlylabeledexamples(3courseand9non-course)andtherestunlabeled.InTable5,wereporttheprecisionanderrorratesattheprecision-recallbreak-evenpointaveragedover100realizationsofthedata,andincluderesultsreportedinJoachims(2003)forspectralgraphtransduction,andthecotrainingalgorithm(BlumandMitchell,1998)forcomparison.Weused15nearestneigh-borgraphs,weightedbycosinedistancesandusediteratedLaplaciansofdegree3.Forinductivemethods,gAlwassetto0:01forRLSand1:00forSVM.ForLapRLSandLapSVM,gAwassetasininductivemethods,withgIl(l+u)2=100gAl.Theseparameterswerechosenbasedonasimplegridsearchforbestperformanceovertherst5realizationsofthedata.Linearkernelsandcosinedistanceswereusedsincethesehavefoundwide-spreadapplicationsintextclassicationproblems,forexample,inDumaisetal.(1998).MethodPRBEPErrork-NN73.213.3SGT86.26.2Naive-Bayes—12.9Cotraining—6.20SVM76.39(5.6)10.41(2.5)TSVM88.15(1.0)5.22(0.5)LapSVM87.73(2.3)5.41(1.0)RLS73.49(6.2)11.68(2.7)LapRLS86.37(3.1)5.99(1.4)Table5:PrecisionandErrorRatesatthePrecision-RecallBreak-evenPointsofsupervisedandtransductivealgorithms.Sincetheexactdatasetsonwhichthesealgorithmswererun,somewhatdifferinpreprocess-ing,preparationandexperimentalprotocol,theseresultsareonlymeanttosuggestthatmanifoldregularizationalgorithmsperformsimilartostate-of-the-artmethodsfortransductiveinferenceintextclassicationproblems.Thefollowingcommentscanbemade:(a)transductivecategorizationwithLapSVMandLapRLSleadstosignicantimprovementsoverinductivecategorizationwithSVMandRLS.(b)Joachims(2003)reports91:4%precision-recallbreak-evenpoint,and4:6%er-rorrateforTSVM.ResultsforTSVMreportedinthetablewereobtainedwhenwerantheTSVMimplementationusingSVM-Lightsoftwareonthisparticulardataset.TheaveragetrainingtimeforTSVMwasfoundtobemorethan10timesslowerthanforLapSVM.(c)Thecotrainingresultswereobtainedonunseentestdatasetsutilizingadditionalhyperlinkinformation,whichwasexcludedinourexperiments.Thisadditionalinformationisknowntoimproveperformance,asdemonstratedinJoachims(2003)andBlumandMitchell(1998).Inthenextexperiment,werandomlysplittheWebKBdataintoatestsetof263examplesandatrainingsetof779examples.Wenotedtheperformanceofinductiveandsemi-supervisedclassiersonunlabeledandtestsetsasafunctionofthenumberoflabeledexamplesinthetrainingset.Theperformancemeasureistheprecision-recallbreak-evenpoint(PRBEP),averagedover100random2426 MANIFOLDREGULARIZATION 2 4 8 16 32 64606570758085Number of Labeled ExamplesPRBEPPerformance of RLS, LapRLS 2 4 8 16 32 64606570758085Number of Labeled ExamplesPRBEPPerformance of SVM, LapSVM 2 4 8 16 32 648082848688Number of Labeled ExamplesPRBEPLapSVM performance (Unlabeled) 2 4 8 16 32 647880828486Number of Labeled ExamplesPRBEPLapSVM performance (Test)rls (U)svm (U)U=779-lU=779-lFigure8:WebKbTextClassicationExperiment:Thetoppanelpresentsperformanceintermsofprecision-recallbreak-evenpoints(PRBEP)ofRLS,SVM,LaplacianRLSandLaplacianSVMasafunctionofnumberoflabeledexamples,ontest(markedasT)setandunlabeledset(markedasUandofsize779-numberoflabeledexamples).ThebottompanelpresentsperformancecurvesofLaplacianSVMfordifferentnumberofunlabeledpoints.datasplits.ResultsarepresentedinthetoppanelofFigure8.Thebenetofunlabeleddatacanbeseenbycomparingtheperformancecurvesofinductiveandsemi-supervisedclassiers.Wealsoperformedexperimentswithdifferentsizesofthetrainingset,keepingarandomlycho-sentestsetof263examples.ThebottompanelinFigure8presentsthequalityoftransductionandsemi-supervisedlearningwithLaplacianSVM(LaplacianRLSperformedsimilarly)asafunctionofthenumberoflabeledexamplesfordifferentamountsofunlabeleddata.Wendthattransduc-tionimproveswithincreasingunlabeleddata.Weexpectthistobetruefortestsetperformanceaswell,butdonotobservethisconsistentlypossiblysinceweuseaxedsetofparametersthatbecomesuboptimalasunlabeleddataisincreased.Theoptimalchoiceoftheregularizationparam-etersdependsontheamountoflabeledandunlabeleddata,andshouldbeadjustedbythemodelselectionprotocolaccordingly.6.UnsupervisedandFullySupervisedCasesWhilethepreviousdiscussionconcentratedonthesemi-supervisedcase,ourframeworkcoversbothunsupervisedandfullysupervisedcasesaswell.Webrieydiscusseachinturn.2427 BELKIN,NIYOGIANDSINDHWANI6.1UnsupervisedLearning:ClusteringandDataRepresentationIntheunsupervisedcaseoneisgivenacollectionofunlabeleddatapointsx1;:::;xu.OurbasicalgorithmicframeworkembodiedintheoptimizationprobleminEquation2hasthreeterms:(i)ttolabeleddata,(ii)extrinsicregularizationand(iii)intrinsicregularization.Sincenolabeleddataisavailable,thersttermdoesnotariseanymore.Thereforeweareleftwiththefollowingoptimizationproblem:minf2HKgAkfk2+gIkfk2IOfcourse,onlytheratiog=gAgImatters.Asbeforekfk2canbeapproximatedusingtheunlabeleddata.Choosingkfk2I=RMhÑMf;ÑMfiandapproximatingitbytheempiricalLaplacian,weareleftwiththefollowingoptimizationproblem:f=argminåif(xi)=0;åif(xi)2=1f2HKgkfk2+åij(f(xi)f(xj))2:(13)Notethattoavoiddegeneratesolutionsweneedtoimposesomeadditionalconditions(cf.BelkinandNiyogi,2003a).ItturnsoutthataversionofRepresentertheoremstillholdsshowingthatthesolutiontoEquation13admitsarepresentationoftheformf=uåi=1aiK(xi;):BysubstitutingbackinEquation13,wecomeupwiththefollowingoptimizationproblem:a=argmin1TKa=0aTK2a=1gkfk2K+åij(f(xi)f(xj))2;where1isthevectorofallonesanda=(a1;:::;au)andKisthecorrespondingGrammatrix.LettingPbetheprojectionontothesubspaceofRuorthogonaltoK1,oneobtainsthesolutionfortheconstrainedquadraticproblem,whichisgivenbythegeneralizedeigenvalueproblemP(gK+KLK)Pv=lPK2Pv:(14)Thenalsolutionisgivenbya=Pv,wherevistheeigenvectorcorrespondingtothesmallesteigenvalue.Remark1:Theframeworkforclusteringsketchedaboveprovidesamethodforregularizedspec-tralclustering,wheregcontrolsthesmoothnessoftheresultingfunctionintheambientspace.Wealsoobtainanaturalout-of-sampleextensionforclusteringpointsnotintheoriginaldataset.Fig-ures9,10showresultsofthismethodontwotwo-dimensionalclusteringproblems.Unlikerecentwork(Bengioetal.,2004;Brand,2003)onout-of-sampleextensions,ourmethodisbasedonaRepresentertheoremforRKHS.Remark2:BytakingmultipleeigenvectorsofthesysteminEquation14weobtainanaturalregularizedout-of-sampleextensionofLaplacianEigenmaps.Thisleadstonewmethodfordimen-sionalityreductionanddatarepresentation.Furtherstudyofthisapproachisadirectionoffutureresearch.WenotethatasimilaralgorithmhasbeenindependentlyproposedinVertandYamanishi(2005)inthecontextofsupervisedgraphinference.ArelevantdiscussionisalsopresentedinHametal.(2005)ontheinterpretationofseveralgeometricdimensionalityreductiontechniquesaskernelmethods.2428 MANIFOLDREGULARIZATION-1012-1012 gA = 1e-06 gI = 1-1012-1012 gA = 0.0001 gI = 1-1012-1012 gA = 0.1 gI = 1Figure9:TwoMoonsdataset:Regularizedclustering-101-1-0.500.511.5 gA = 1e-06 gI = 1-101-1-0.500.511.5 gA = 0.001 gI = 1-101-1-0.500.511.5 gA = 0.1 gI = 1Figure10:TwoSpiralsdataset:Regularizedclustering6.2FullySupervisedLearningThefullysupervisedcaserepresentstheotherendofthespectrumoflearning.Sincestandardsupervisedalgorithms(SVMandRLS)arespecialcasesofmanifoldregularization,ourframeworkisalsoabletodealwithalabeleddatasetcontainingnounlabeledexamples.Additionally,manifoldregularizationcanaugmentsupervisedlearningwithintrinsicregularization,possiblyinaclass-dependentmanner,whichsuggeststhefollowingalgorithm:f=argminf2HK1llåi=1V(xi;yi;f)+gAkfk2+g+(u+l)2fTL+f++g(u+l)2fTLf:Hereweintroducetwointrinsicregularizationparametersg+,gandregularizeseparatelyforthetwoclasses:f+,farethevectorsofevaluationsofthefunctionf,andL+,LarethegraphLaplacians,onpositiveandnegativeexamplesrespectively.Thesolutiontotheaboveproblemfor2429 BELKIN,NIYOGIANDSINDHWANIRLSandSVMcanbeobtainedbyreplacinggILbytheblock-diagonalmatrixg+L+00gLinthemanifoldregularizationformulasgiveninSection4.Detailedexperimentalstudyofthisapproachtosupervisedlearningisleftforfuturework.7.ConclusionsandFurtherDirectionsWehaveaprovidedanovelframeworkfordata-dependentgeometricregularization.ItisbasedonanewRepresentertheoremthatprovidesabasisforseveralalgorithmsforunsupervised,semi-supervisedandfullysupervisedlearning.ThisframeworkbringstogetherideasfromthetheoryofregularizationinreproducingkernelHilbertspaces,manifoldlearningandspectralmethods.Thereareseveraldirectionsoffutureresearch:1.Convergenceandgeneralizationerror:Thecrucialissueofdependenceofgeneralizationerroronthenumberoflabeledandunlabeledexamplesisstillverypoorlyunderstood.SomeverypreliminarystepsinthatdirectionhavebeentakeninBelkinetal.(2004).2.Modelselection:Modelselectioninvolveschoosingappropriatevaluesfortheextrinsicandintrinsicregularizationparameters.Wedonotasyethaveagoodunderstandingofhowtochoosetheseparameters.Moresystematicproceduresneedtobedeveloped.3.Efcientalgorithms:Thenaiveimplementationsofouralgorithmshavecubiccomplexityinthenumberoflabeledandunlabeledexamples,whichisrestrictiveforlargescalereal-worldappli-cations.Scalabilityissuesneedtobeaddressed.4.Additionalstructure:Inthispaperwehaveshownhowtoincorporatethegeometricstructureofthemarginaldistributionintotheregularizationframework.Webelievethatthisframeworkwillextendtootherstructuresthatmayconstrainthelearningtaskandbringabouteffectivelearnability.Oneimportantexampleofsuchstructureisinvarianceundercertainclassesofnaturaltransforma-tions,suchasinvarianceunderlightingconditionsinvision.SomeideasarepresentedinSindhwani(2004).wledgmentsWearegratefultoMarcCoram,SteveSmaleandPeterBickelforintellectualsupportandtoNSFfundingfornancialsupport.WewouldliketoacknowledgetheToyotaTechnologicalInstituteforitssupportforthiswork.Wealsothanktheanonymousreviewersforhelpingtoimprovethepaper.ReferencesR.A.Adams.Sobolevspaces.AcademicPressNewYork,1975.N.Aronszajn.Theoryofreproducingkernels.TransactionsoftheAmericanMathematicalSociety,68(3):337–404,1950.M.Belkin.ProblemsofLearningonManifolds.PhDthesis,TheUniversityofChicago,2003.M.Belkin,I.Matveeva,andP.Niyogi.Regularizationandsemi-supervisedlearningonlargegraphs.COLT,2004.2430 MANIFOLDREGULARIZATIONM.BelkinandP.Niyogi.Laplacianeigenmapsfordimensionalityreductionanddatarepresentation.NeuralComputation,15(6):1373–1396,2003a.M.BelkinandP.Niyogi.Usingmanifoldstructureforpartiallylabeledclassication.AdvancesinNeuralInformationProcessingSystems,15:929–936,2003b.M.BelkinandP.Niyogi.TowardsatheoreticalfoundationforLaplacian-basedmanifoldmethods.Proc.ofCOLT,2005.M.Belkin,P.Niyogi,andV.Sindhwani.Onmanifoldregularization.ProceedingsoftheTenthInternationalWorkshoponArticialIntelligenceandStatistics(AISTAT2005),2005.Y.Bengio,J.F.Paiement,P.Vincent,andO.Delalleau.Out-of-sampleextensionsforLLE,isomap,MDS,Eigenmaps,andspectralclustering.AdvancesinNeuralInformationProcessingSystems,16,2004.K.BennettandA.Demiriz.Semi-supervisedsupportvectormachines.AdvancesinNeuralInfor-mationProcessingSystems,11:368–374,1999.A.BlumandS.Chawla.Learningfromlabeledandunlabeleddatausinggraphmincuts.Proc.18thInternationalConf.onMachineLearning,pages19–26,2001.A.BlumandT.Mitchell.Combininglabeledandunlabeleddatawithco-training.ProceedingsoftheeleventhannualconferenceonComputationallearningtheory,pages92–100,1998.O.Bousquet,O.Chapelle,andM.Hein.Measurebasedregularization.AdvancesinNeuralInfor-mationProcessingSystems,16,2004.M.Brand.Continuousnonlineardimensionalityreductionbykerneleigenmaps.Int.JointConf.Artif.Intel,2003.O.Chapelle,B.Sch¨olkopf,andA.Zien,editors.Semi-supervisedLearning.MITPress,2006.O.Chapelle,J.Weston,andB.Scholkopf.Clusterkernelsforsemi-supervisedlearning.AdvancesinNeuralInformationProcessingSystems,15:585–592,2003.O.ChapelleandA.Zien.Semi-supervisedclassicationbylowdensityseparation.ProceedingsoftheTenthInternationalWorkshoponArticialIntelligenceandStatistics,pages57–64,2005.F.R.K.Chung.SpectralGraphTheory.AmericanMathematicalSociety,1997.RRCoifman,S.Lafon,ABLee,M.Maggioni,B.Nadler,F.Warner,andSWZucker.Geomet-ricdiffusionsasatoolforharmonicanalysisandstructuredenitionofdata:Diffusionmaps.ProceedingsoftheNationalAcademyofSciences,102(21):7426–7431,2005.A.CorduneanuandT.Jaakkola.Oninformationregularization.ProceedingsoftheNinthAnnualConferenceonUncertaintyinArticialIntelligence,2003.F.CuckerandS.Smale.Onthemathematicalfoundationsoflearning.AmericanMathematicalSociety,39(1):1–49,2002.2431 BELKIN,NIYOGIANDSINDHWANIO.Delalleau,Y.Bengio,andN.LeRoux.Efcientnon-parametricfunctioninductioninsemi-supervisedlearning.ProceedingsoftheTenthInternationalWorkshoponArticialIntelligenceandStatistics(AISTAT2005),2005.M.P.DoCarmo.RiemannianGeometry.Birkhauser,1992.D.L.DonohoandC.Grimes.Hessianeigenmaps:Locallylinearembeddingtechniquesforhigh-dimensionaldata.ProceedingsoftheNationalAcademyofSciences,100(10):5591–5596,2003.S.Dumais,J.Platt,D.Heckerman,andM.Sahami.Inductivelearningalgorithmsandrepresenta-tionsfortextcategorization.ProceedingsoftheSeventhInternationalConferenceonInformationandKnowledgeManagement,11:16,1998.T.Evgeniou,M.Pontil,andT.Poggio.Regularizationnetworksandsupportvectormachines.AdvancesinComputationalMathematics,13(1):1–50,2000.G.FungandO.L.Mangasarian.Semi-supervisedsupportvectormachinesforunlabeleddataclas-sication.OptimizationMethodsandSoftware,15(1):99–05,2001.A.Grigor'yan.Heatkernelsonweightedmanifoldsandapplications.Cont.Math,398,93-191,2006.J.Ham,D.D.Lee,andL.K.Saul.Semisupervisedalignmentofmanifolds.ProceedingsoftheAnnualConferenceonUncertaintyinArticialIntelligence,Z.GhahramaniandR.Cowell,Eds,10:120–127,2005.M.Hein,J.Y.Audibert,andU.vonLuxburg.Fromgraphstomanifolds-weakandstrongpoint-wiseconsistencyofgraphLaplacians.Proceedingsofthe18thConferenceonLearningTheory(COLT),pages470–485,2005.T.Joachims.Transductiveinferencefortextclassicationusingsupportvectormachines.Proceed-ingsoftheSixteenthInternationalConferenceonMachineLearning,pages200–209,1999.T.Joachims.Transductivelearningviaspectralgraphpartitioning.ProceedingsoftheInternationalConferenceonMachineLearning,pages290–297,2003.C.Kemp,T.L.Grifths,S.Stromsten,andJ.B.Tenenbaum.Semi-supervisedlearningwithtrees.AdvancesinNeuralInformationProcessingSystems,16,2004.R.I.KondorandJ.Lafferty.Diffusionkernelsongraphsandotherdiscreteinputspaces.Proc.19thInternationalConf.onMachineLearning,2002.S.Lafon.DiffusionMapsandGeometricHarmonics.PhDthesis,YaleUniversity,2004.K.Nigam,A.K.Mccallum,S.Thrun,andT.Mitchell.TextclassicationfromlabeledandunlabeleddocumentsusingEM.MachineLearning,39(2):103–134,2000.R.M.Rifkin.EverythingOldIsNewAgain:AFreshLookatHistoricalApproachesinMachineLearning.PhDthesis,MassachusetsInstituteofTechnology,2002.2432 MANIFOLDREGULARIZATIONS.T.RoweisandL.K.Saul.Nonlineardimensionalityreductionbylocallylinearembedding.Sci-ence,290(5500):2323,2000.B.Scholkopf,C.Burges,andV.Vapnik.Extractingsupportdataforagiventask.Proceedings,FirstInternationalConferenceonKnowledgeDiscovery&DataMining,MenloPark,1995.B.ScholkopfandA.J.Smola.LearningwithKernels.MITPressCambridge,Mass,2002.M.Seeger.Learningwithlabeledandunlabeleddata.Inst.forAdaptiveandNeuralComputation,technicalreport,Univ.ofEdinburgh,2001.V.Sindhwani.Kernelmachinesforsemi-supervisedlearning.Master'sthesis,TheUniversityofChicago,2004.V.Sindhwani,M.Belkin,andP.Niyogi.Thegeometricbasisofsemi-supervisedlearning.InO.Chapelle,A.Zien,andB.Sch¨olkopf,editors,Semi-supervisedLearning,chapter12,pages217–235.MITPress,2006.V.Sindhwani,P.Niyogi,andM.Belkin.Beyondthepointcloud:fromtransductivetosemi-supervisedlearning.InProceedings,TwentySecondInternationalConferenceonMachineLearn-ing,2005.A.SmolaandR.Kondor.Kernelsandregularizationongraphs.ConferenceonLearningTheory,COLT/KW,2003.M.SzummerandT.Jaakkola.PartiallylabeledclassicationwithMarkovrandomwalks.AdvancesinNeuralInformationProcessingSystems,14:945–952,2002.J.B.Tenenbaum,V.Silva,andJ.C.Langford.Aglobalgeometricframeworkfornonlineardimen-sionalityreduction.Science,290(5500):2319,2000.A.N.Tikhonov.Regularizationofincorrectlyposedproblems.Sov.Math.Dokl,4:1624–1627,1963.I.W.TsangandJ.T.Kwok.Verylargescalemanifoldregularizationusingcorevectormachines.NIPS2005WorkshoponLargeScaleKernelMachines,2005.V.N.Vapnik.StatisticalLearningTheory.Wiley-Interscience,1998.J.P.VertandY.Yamanishi.Supervisedgraphinference.AdvancesinNeuralInformationProcessingSystems,17:1433–1440,2005.U.vonLuxburg,M.Belkin,andO.Bousquet.Consistencyofspectralclustering.MaxPlanckInstituteforBiologicalCyberneticsTechnicalReportTR,134,2004.G.Wahba.Splinemodelsforobservationaldata.SocietyforIndustrialandAppliedMathematicsPhiladelphia,Pa,1990.D.Zhou,O.Bousquet,T.N.Lal,J.Weston,andB.Scholkopf.Learningwithlocalandglobalconsistency.AdvancesinNeuralInformationProcessingSystems,16:321–328,2004.2433 BELKIN,NIYOGIANDSINDHWANIX.Zhu,Z.Ghahramani,andJ.Lafferty.Semi-supervisedlearningusingGaussianeldsandhar-monicfunctions.InProceedingsofthe20thInternationalConferenceonMachineLearning,2003.X.Zhu,J.Kandola,Z.Ghahramani,andJ.Lafferty.Nonparametrictransformsofgraphkernelsforsemi-supervisedlearning.AdvancesinNeuralInformationProcessingSystems,17,2005.2434