/
JournalofMachineLearningResearch4(2003)713-742Submitted11/02;Published JournalofMachineLearningResearch4(2003)713-742Submitted11/02;Published

JournalofMachineLearningResearch4(2003)713-742Submitted11/02;Published - PDF document

phoebe-click
phoebe-click . @phoebe-click
Follow
369 views
Uploaded On 2016-04-20

JournalofMachineLearningResearch4(2003)713-742Submitted11/02;Published - PPT Presentation

MANNORMEIRANDZHANGhavebeenaddressedForexamplethenotionofthemarginanditsimpactonperformanceVapnik1998Schapireetal1998thederivationofsophisticatednitesampleboundsegBartlettetal2002Bous ID: 286049

MANNOR MEIRANDZHANGhavebeenaddressed.Forexample thenotionofthemarginanditsimpactonperformance(Vapnik 1998 Schapireetal. 1998) thederivationofsophisticatednitesamplebounds(e.g. Bartlettetal. 2002 Bous

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch4(2003)7..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch4(2003)713-742Submitted11/02;Published10/03GreedyAlgorithmsforClassication–Consistency,ConvergenceRates,andAdaptivityShieMannorSHIE@MIT.EDULaboratoryforInformationandDecisionSystemsMassachusettsInstituteofTechnologyCambridge,MA02139RonMeirRMEIR@EE.TECHNION.AC.ILDepartmentofElectricalEngineeringTechnion,Haifa32000,IsraelTongZhangTZHANG@WATSON.IBM.COMIBMT.J.WatsonResearchCenterYorktownHeights,NY10598Editor:YoramSingerAbstractManyregressionandclassicationalgorithmsproposedovertheyearscanbedescribedasgreedyproceduresforthestagewiseminimizationofanappropriatecostfunction.Someexamplesincludeadditivemodels,matchingpursuit,andboosting.Inthisworkwefocusontheclassicationprob-lem,forwhichmanyrecentalgorithmshavebeenproposedandappliedsuccessfully.Foraspecicregularizedformofgreedystagewiseoptimization,weproveconsistencyoftheapproachunderrathergeneralconditions.Focusingonspecicclassesofproblemsweprovideconditionsunderwhichourgreedyprocedureachievesthe(nearly)minimaxrateofconvergence,implyingthattheprocedurecannotbeimprovedinaworstcasesetting.Wealsoconstructafullyadaptiveprocedure,which,withoutknowingthesmoothnessparameterofthedecisionboundary,convergesatthesamerateasifthesmoothnessparameterwereknown.1.IntroductionTheproblemofbinaryclassicationplaysanimportantroleinthegeneraltheoryoflearningandestimation.Whilethisproblemisthesimplestsupervisedlearningproblemonemayenvisage,therearestillmanyopenissuesrelatedtothebestapproachtosolvingit.Inthispaperweconsiderafamilyofalgorithmsbasedonagreedystagewiseminimizationofanappropriatesmoothlossfunction,andtheconstructionofacompositeclassierbycombiningsimplebaseclassiersobtainedbythestagewiseprocedure.Suchprocedureshavebeenknownformanyyearsinthestatisticsliteratureasadditivemodels(HastieandTibshirani,1990,Hastieetal.,2001)andhavealsobeenusedinthesignalprocessingcommunityunderthetitleofmatchingpursuit(MallatandZhang,1993).Morerecently,ithastranspiredthattheboostingalgorithmproposedinthemachinelearningcommunity(Schapire,1990,FreundandSchapire,1997),whichwasbasedonaverydifferentmotivation,canalsobethoughtofasastagewisegreedyalgorithm(e.gBreiman,1998,Friedmanetal.,2000,SchapireandSinger,1999,Masonetal.,2000,MeirandR¨atsch,2003).Inspiteoftheconnectionsofthesealgorithmstoearlierworkintheeldofstatistics,itisonlyrecentlythatcertainquestionc\r2003ShieMannor,RonMeir,andTongZhang. MANNOR,MEIRANDZHANGhavebeenaddressed.Forexample,thenotionofthemarginanditsimpactonperformance(Vapnik,1998,Schapireetal.,1998),thederivationofsophisticatednitesamplebounds(e.g.,Bartlettetal.,2002,BousquetandChapelle,2002,KoltchinksiiandPanchenko,2002,Zhang,2002,Antosetal.,2002),theutilizationofarangeofdifferentcostfunctions(Masonetal.,2000,Friedmanetal.,2000,LugosiandVayatis,2001,Zhang,2002,Mannoretal.,2002a)arebutafewoftherecentcontributionstothiseld.Boostingalgorithmshavebeendemonstratedtobeveryeffectiveinmanyapplications,asuccesswhichledtosomeinitialhopesthatboostingdoesnotovert.However,itbecameclearveryquicklythatboostingmayinfactovertbadly(e.g.,Dietterich,1999,SchapireandSinger,1999)ifappliedwithoutregularization.Inordertoaddresstheissueofovertting,severalauthorshaverecentlyaddressedthequestionofstatisticalconsistency.Roughlyspeaking,consistencyofanalgorithmwithrespecttoaclassofdistributionsimpliesthatthelossincurredbytheprocedureultimatelyconvergestothelowestlosspossibleasthesizeofthesampleincreaseswithoutlimit(aprecisedenitionisprovidedinSection2.1).Giventhatanalgorithmisconsistent,aquestionarisesastotheratesofconvergencetotheminimalloss.Inthiscontext,aclassicapproachlooksattheso-calledminimaxcriterion,whichessentiallymeasurestheperformanceofthebestestimatorfortheworstpossibledistributioninaclass.Ideally,wewouldliketoshowthatanalgorithmachievestheminimax(orclosetominimax)rate.Finally,weaddresstheissueofadaptivity.Incomputingminimaxratesoneusuallyassumesthatthereisacertainparameterqcharacterizingthesmoothnessofthetargetdistribution.Thisparameterisusuallyassumedtobeknowninordertocomputetheminimaxrates.Forexample,theparameterqmaycorrespondtotheLipschitzconstantofadecisionboundary.Inpractice,however,oneusuallydoesnotknowthevalueofq.Inthiscontextonewouldliketoconstructalgorithmswhichareabletoachievetheminimaxrateswithoutknowingthevalueofqinadvance.SuchprocedureshavebeentermedadaptiveintheminimaxsensebyBarronetal.(1999).Usingaboostingmodelthatissomewhatdifferentfromours,itwasshownbyB¨uhlmannandYu(2003)thatboostingwithearlystoppingachievestheexactminimaxratesofSobolevclasseswithalinearsmoothingsplineweaklearner,andtheprocedureadaptstotheunknownsmoothnessoftheSobolevclass.Thestagewisegreedyminimizationalgorithmthatisconsideredinthisworkisnaturalandclosertoalgorithmsthatareusedinpractice.Thisisincontrasttothestandardapproachofselectingahypothesisfromaparticularhypothesisclass.Thus,ourapproachprovidesabridgebetweentheoryandpracticesinceweusetheoreticaltoolstoanalyzeawidelyusedpracticalapproach.Theremainderofthispaperisorganizedasfollows.WebegininSection2withsomeformaldenitionsofconsistency,minimaxityandadaptivity,andrecallsomerecenttoolsfromthetheoryofempiricalprocesses.InSection3weintroduceagreedystagewisealgorithmforclassication,basedonrathergenerallossfunctions,andprovetheuniversalconsistencyofthealgorithm.InSection4wethenspecializetothecaseofthesquaredloss,forwhichrecentresultsfromthetheoryofempiricalprocessesenabletheestablishmentoffastratesofconvergence.Wealsointroduceanadaptiveregularizationalgorithmwhichisshowntoleadtonearlyminimaxratesevenifwedonotassumea-prioriknowledgeofq.WethenpresentsomenumericalresultsinSection5,whichdemonstratetheimportanceofregularization.WeconcludethepaperinSection6andpresentsomeopenquestions.714 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITY2.BackgroundandPreliminaryResultsWebeginwiththestandardformalsetupforsupervisedlearning.Let(Z;A;P)beaprobabilityspaceandletFbeaclassofAmeasurablefunctionsfromZtoR.Inthecontextoflearn-ingonetakesZ=XYwhereXistheinputspaceandYistheoutputspace.WeletS=f(X1;Y1);:::;(Xm;Ym)gdenoteasamplegeneratedindependentlyatrandomaccordingtotheproba-bilitydistributionP=PX;Y;inthesequelwedropsubscripts(suchasX;Y)fromP,astheargumentofPwillsufcetospecifytheparticularprobability.Inthispaperweconsidertheproblemofclas-sicationwhereY=f1;+1gandX=Rd,andwherethedecisionismadebytakingthesignofareal-valuedfunctionf(x).Considerthe01lossfunctiongivenby`(y;f(x))=I[yf(x)0];(1)whereI[E]istheindicatorfunctionoftheeventE,andtheexpectedlossisgivenbyL(f)=E`(Y;f(X)):(2)Usingthenotationh(x)4=P(Y=1jX=x),itiswellknownthatL,theminimumofL(f),canbeachievedbysettingf(x)=2h(x)1(e.g.,Devroyeetal.,1996).Notethatthedecisionchoiceatthepointf(x)=0isnotessentialintheanalysis.Inthispaperwesimplyassumethat`(y;0)=1=2,sothatthedecisionrule2h(x)1isBayesoptimalath(x)=1=2.2.1Consistency,MinimaxityandAdaptivityBasedonasampleS,wewishtoconstructarulefwhichassignstoeachnewinputxa(soft)la-belf(S;x),forwhichtheexpectedlossL(f;S)=E`(Y;f(S;X))isminimal.SinceSisarandomvariablesoisL(f;S),sothatonecanonlyexpecttomakeprobabilisticstatementsaboutthisran-domvariable.Inthispaper,wefollowstandardnotationwithinthestatisticsliterature,anddenotesample-dependentquantitiesbyahatabovethevariable.Thus,wereplacef(S;x)byˆf(x).Ingen-eral,onehasatone'sdisposalonlythesampleS,andperhapssomeverygeneralknowledgeabouttheproblemathand,oftenintheformofsomeregularityassumptionsabouttheprobabilitydistri-butionP.WithinthePACsetting(e.g.,KearnsandVazirani,1994),onemakestheverystringentassumptionthatYi=g(Xi)andthatgbelongstosomeknownfunctionclass.Laterworkconsideredtheso-calledagnosticsetting(e.g.,AnthonyandBartlett,1999),wherenothingisassumedaboutg,andonecomparestheperformanceofˆftothatofthebesthypothesisfwithinagivenmodelclassF,namelyf=argminf2FL(f)(inordertoavoidunnecessarycomplications,weassumefexists).However,ingeneraloneisinterestedincomparingthebehavioroftheempiricalestimatorˆftothatoftheoptimalBayesestimator,whichminimizestheprobabilityoferror.Thedifculty,ofcourse,isthatthedeterminationoftheBayesclassiergB(x)=2h(x)1,requiresknowledgeoftheunderlyingprobabilitydistribution.Inmanysituations,onepossessessomegeneralknowl-edgeabouttheunderlyingclassofdistributionsP,usuallyintheformofsomekindofsmoothnessassumption.Forexample,onemayassumethath(x)=P(Y=1jx)isaLipschitzfunction,namelyjh(x)h(x0)jKkxx0kforallxandx0.LetusdenotetheclassofpossibledistributionsbyP,andanempiricalestimatorbasedonasampleofsizembyˆfm.Next,weintroducethenotionofconsistency.Roughly,aclassicationprocedureleadingtoaclassierˆfnisconsistentwithrespecttoclassofdistributionsP,ifthelossL(ˆfn)converges,forincreasingsamplesizes,totheminimallosspossibleforthisclass.Moreformally,thefollowingdenitionisthestandarddenitionofstrongconsistency(e.g.,Devroyeetal.,1996).715 MANNOR,MEIRANDZHANGDenition1AclassicationalgorithmleadingtoaclassierˆfmisstronglyconsistentwithrespecttoaclassofdistributionsPifforeveryP2Plimm!¥L(ˆfm)=L;Palmostsurely.IfXRdandPcontainsallBorelprobabilitymeasures,wesaythatthealgorithmisuniversallyconsistent.Inthisworkweshowthatalgorithmsbasedonstagewisegreedyminimizationofaconvexupperboundonthe01lossareconsistentwithrespecttotheclassofdistributionsP,wherecertainregularityassumptionswillbemadeconcerningtheclassconditionaldistributionh(x).Consistencyisclearlyanimportantpropertyforanylearningalgorithm,asitguaranteesthatthealgorithmultimatelyperformswell,inthesenseofasymptoticallyachievingtheminimallosspos-sible.Oneshouldkeepinmindthough,thatconsistentalgorithmsarenotnecessarilyoptimalwhenonlyaniteamountofdataisavailable.Aclassicexampleofthelackofnite-sampleoptimalityofconsistentalgorithmsistheJames-Steineffect(see,forexample,Robert,2001,Section2.8.2).Inordertoquantifytheperformancemoreprecisely,weneedtobeabletosaysomethingaboutthespeedatwhichconvergencetoLtakesplace.Inordertodoso,weneedtodetermineayard-stickbywhichtomeasuredistance.Aclassicmeasurewhichweusehereistheso-calledminimaxrateofconvergence,whichessentiallymeasurestheperformanceofthebestempiricalestimatoronthemostdifcultdistributioninP.Lettheclassofpossibledistributionsbecharacterizedbyaparameterq,namelyP=Pq.Forexample,assumingthath(x)isLipschitz,qcouldrepresenttheLipschitzconstant.Formally,theminimaxriskisgivenbyrm(q)=infˆfmsupP2PqE`(Y;ˆfm(X))L;whereˆfmisanyestimatorbasedonasampleSofsizem,andtheexpectationistakenwithrespecttoX;Yandthem-sampleS.TherateatwhichtheminimaxriskconvergestozerohasbeencomputedinthecontextofbinaryclassicationforseveralclassesofdistributionsbyYang(1999).SofarwehavecharacterizedthesmoothnessofthedistributionPbyaparameterq.However,ingeneralonedoesnotpossessanypriorinformationaboutq,exceptperhapsthatitisnite.Thequestionthenarisesastowhetheronecandesignanadaptiveschemewhichconstructsanestimatorˆfmwithoutanyknowledgeofq,andforwhichconvergencetoLattheminimaxrates(whichassumesknowledgeofq)canbeguaranteed.FollowingBarronetal.(1999)werefertosuchaprocedureasadaptiveintheminimaxsense.2.2SomeTechnicalToolsWebeginwithafewusefulresults.Letfsigm=1beasequenceofbinaryrandomvariablessuchthatsi=1withprobability1=2.TheRademachercomplexityofF(e.g.,vanderVaartandWellner,1996)isgivenbyRm(F)4=Esupf2F 1mmåi=1sif(Xi) ;wheretheexpectationisoverfsigandfXig.SeeBartlettandMendelson(2002)forsomepropertiesofRm(F).ThefollowingtheoremcanbeobtainedbyaslightmodicationoftheproofofTheorem1ofKoltchinksiiandPanchenko(2002).716 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITYTheorem2(AdaptedfromTheorem1inKoltchinksiiandPanchenko,2002)LetfX1;X2;:::;Xmg2XbeasequenceofpointsgeneratedindependentlyatrandomaccordingtoaprobabilitydistributionP,andletFbeaclassofmeasurablefunctionsfromXtoR.Furthermore,letfbeanon-negativeLipschitzfunctionwithLipschitzconstantk,suchthatsupx2Xjf(f(x))jMforallf2F.Thenwithprobabilityatleast1dEf(f(X))1mmåi=1f(f(Xi))4kRm(F)+Mrlog(1=d)2mforallf2F.Formanyfunctionclasses,theRademachercomplexitycanbeestimateddirectly.Resultssum-marizedbyBartlettandMendelson(2002)areusefulforboundingthisquantityforalgebraiccom-positionoffunctionclasses.WerecalltherelationbetweenRademachercomplexityandcoveringnumbers.Forcompletenesswerepeatthestandarddenitionofcoveringnumbersandentropy(e.g.,vanderVaartandWellner,1996),whicharerelatedtotheRademachercomplexity.Denition3LetFbeaclassoffunctions,andletrbeadistancemeasurebetweenfunctionsinF.ThecoveringnumberN(e;F;r)istheminimalnumberofballsfg:r(g;f)egofradiuseneededtocovertheset.TheentropyofFisthelogarithmofthecoveringnumber.LetX=fX1;:::;XmgbeasetofpointsandletQmbeaprobabilitymeasureoverthesepoints.Wedenethe`p(Qm)distancebetweenanytwofunctionsfandgas`p(Qm)(f;g)= måi=1Qmjf(xi)g(xi)jp!1=p:Inthiscasewedenotethe(empirical)coveringnumberofFbyN(e;F;`p(Qm)).Theuniform`pcoveringnumberandtheuniformentropyaregivenrespectivelybyNp(e;F;m)=supQmN(e;F;`p(Qm));Hp(e;F;m)=logNp(e;F;m);wherethesupremumisoverallprobabilitydistributionsQmoversetsofmpointssampledfromX.Inthespecialcasep=2,wewillabbreviatethenotation,settingH(e;F;m)H2(e;F;m).Let`mdenotetheempirical`2normwithrespecttotheuniformmeasureonthepointsfX1;X2;:::;Xmg,namely`m2(f;g)=1måm=1jf(Xi)g(Xi)j21=2.IfFcontains0,thenthereexistsaconstantCsuchthat(seeCorollary2.2.8invanderVaartandWellner,1996)Rm(F)EZ¥0qlogN(e;F;`m)deCpm;(3)wheretheexpectationistakenwithrespecttothechoiceofmpoints.WenotethattheapproachofusingRademachercomplexityandthe`mcoveringnumberofafunctionclasscanoftenresultintighterboundsthansomeoftheearlierstudiesthatemployedthe`m1coveringnumber(forexample,inPollard,1984).Moreover,the`2coveringnumbersaredirectlyrelatedtotheminimaxratesofconvergence(Yang,1999).717 MANNOR,MEIRANDZHANG2.3RelatedResultsWediscusssomepreviousworkrelatedtotheissuesstudiedinthiswork.Thequestionoftheconsistencyofboostingalgorithmshasattractedsomeattentioninrecentyears.Jiang,followingBreiman(2000),raisedthequestionsofwhetherAdaBoostisconsistentandwhetherregularizationisneeded.ItwasshowninJiang(2000b)thatAdaBoostisconsistentatsomepointintheprocessofboosting.Sincenostoppingconditionswereprovided,thisresultessentiallydoesnotdeterminewhetherboostingforeverisconsistentornot.AonedimensionalexamplewasprovidedbyJiang(2000a),whereitwasshownthatAdaBoostisnotconsistentingeneralsinceittendstoanearestneighborrule.Furthermore,itwasshownintheexamplethatfornoiselesssituationsAdaBoostisinfactconsistent.TheconclusionfromthisseriesofpapersisthatboostingforeverforAdaBoostisnotconsistentandthatsometimesalongtheboostingprocessagoodclassiermaybefound.InarecentpaperLugosiandVayatis(2001)alsopresentedanapproachtoestablishingcon-sistencybasedontheminimizationofaconvexupperboundonthe01loss.Accordingtothisapproachtheconvexcostfunction,ismodieddependingonthesamplesize.Bymakingthecon-vexcostfunctionsharperasthenumberofsamplesincreases,itwasshownthatthesolutiontotheconvexoptimizationproblemyieldsaconsistentclassier.FinitesampleboundsarealsoprovidedbyLugosiandVayatis(2001,2002).Themajordifferencesbetweenourworkand(LugosiandVayatis,2001,2002)arethefollowing:(i)Theprecisenatureofthealgorithmsusedisdifferent;inparticulartheapproachtoregularizationisdifferent.(ii)Weestablishconvergenceratesandpro-videconditionsforestablishingadaptiveminimaxity.(iii)Weconsiderstagewiseproceduresbasedongreedilyaddingonasinglebasehypothesisatatime.TheworkofLugosiandVayatis(2002)focusedontheeffectofusingaconvexupperboundonthe01loss.AdifferentkindofconsistencyresultwasestablishedbyMannorandMeir(2001,2002).Inthisworkgeometricconditionsneededtoestablishtheconsistencyofboostingwithlinearweaklearnerswereestablished.ItwasshownthatiftheBayeserroriszero(andtheoppositelylabelledpointsarewellseparated)thenAdaBoostisconsistent.Zhang(2002)studiedanapproximation-estimationdecompositionofbinaryclassicationmeth-odsbasedonminimizingsomeconvexcostfunctions.Thefocustherewasonapproximationerroranalysisaswellasbehaviorsofdifferentconvexcostfunctions.Theauthoralsostudiedestimationerrorsforkernelmethodsincludingsupportvectormachines,andestablisheduniversalconsistencyresults.However,thepaperdoesnotcontainanyspecicresultforboostingalgorithms.Alloftheworkdiscussedabovedealswiththeissueofconsistency.Thispaperextendsourearlierresults(Mannoretal.,2002a)whereweprovedconsistencyforcertainregularizedgreedyboostingalgorithms.Herewegobeyondconsistencyandconsiderratesofconvergenceandinves-tigatetheadaptivityoftheapproach.3.ConsistencyofMethodsBasedonGreedyMinimizationofaConvexUpperBoundConsideraclassofso-calledbasehypothesesH,andassumethatitisclosedundernegation.WedenetheordertconvexhullofHasCOt(H)=(f:f(x)=tåi=1aihi(x);ai0;tåi=1ai1;hi2H):718 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITYTheconvexhullofH,denotedbyCO(H),isgivenbytakingthelimitt!¥.ThealgorithmsconsideredinthispaperconstructacompositehypothesisbychoosingafunctionffrombCO(H),whereforanyclassG,bG=ff:f=bg;g2Gg.Theparameterbwillbespeciedatalaterstage.WeassumethroughoutthatfunctionsinHtakevaluesin[1;1].ThisimpliesthatfunctionsinbCO(H)takevaluesin[b;b].SincethespacebCO(H)maybehuge,weconsideralgorithmsthatsequentiallyandgreedilyselectahypothesisfrombH.Moreover,sinceminimizingthe01lossisoftenintractable,weconsiderapproacheswhicharebasedonminimizingaconvexupperboundonthe01loss.Themaincontributionofthissectionisthedemonstrationoftheconsistencyofsuchaprocedure.Todescribethealgorithm,letf(x)beaconvexfunction,whichupperboundsthe01loss,namelyf(yf(x))I[yf(x)0];f(u)convex:SpecicexamplesforfaregiveninSection3.3.Considertheempiricalandtruelossesincurredbyafunctionfbasedonthelossf,ˆA(f)4=1mmåi=1f(yif(xi));A(f)4=EX;Yf(Yf(X));=EXfh(X)f(f(X))+(1h(X))f(f(X))g:HereEX;YistheexpectationoperatorwithrespecttothemeasurePandEXistheexpectationwithrespecttothemarginalonX.3.1ApproximationbyConvexHullsofSmallClassesInordertoachieveconsistencywithrespecttoalargeclassofdistributions,onemustdemandthattheclassbCO(H)is`large'insomewell-denedsense.Forexample,iftheclassHconsistsonlyofpolynomialsofaxedorder,thenwecannothopetoapproximatearbitrarycontinuousfunctions,sinceCO(H)alsoconsistssolelyofpolynomialsofaxedorder.However,thereareclassesofnon-polynomialfunctionsforwhichbCO(H)islarge.Asanexample,consideraunivariate(i.e.,one-dimensional)functions:R![0;1].TheclassofsymmetricridgefunctionsoverRdisdenedas:Hs4=fs(a�x+b);a2Rd;b2Rg:RecallthatforaclassoffunctionsF;SPAN(F)consistsofalllinearcombinationsoffunctionsfromF.ItisknownfromLeshnoetal.(1993)thatthespanofHsisdenseinthesetofcontinuousfunctionsoveracompactset.SinceSPAN(Hs)=[b0bCO(H),itfollowsthateverycontinuousfunctionmappingfromacompactsetWtoRcanbeapproximatedwitharbitraryprecisionbysomeginbCO(H)foralargeenoughb.Forthecasewhereh(x)=sgn(w�x+b)Barron(1992)denestheclassSPANC(H)=(f:f(x)=åicisgn(w�x+bi);ci;bi2R;wi2Rd;åijcijC):719 MANNOR,MEIRANDZHANGInput:AsampleSm;astoppingtimet;aconstantbAlgorithm:1.Setˆf0b;m=02.Fort=1;2;:::;tˆht;ˆat;ˆbt=argminh2H;0a1;0b0bˆA((1a)ˆft1b;m+ab0h)ˆftb;m=(1ˆat)ˆft1b;m+ˆatˆbtˆhtOutput:Classierˆftb;mFigure1:AsequentialgreedyalgorithmbasedontheconvexempiricallossfunctionˆA.andreferstoitastheclassoffunctionswithboundedvariationwithrespecttohalf-spaces.Inonedimension,thisissimplytheclassoffunctionswithboundedvariation.Notethatthereareseveralextensionstothenotionofboundedvariationtomultipledimensions.WereturntothisclassoffunctionsinSection4.2.Otherclassesofbasefunctionswhichgeneraterichnonparametricsetsoffunctionsarefree-knotsplines(seeAgarwalandStudden,1980,forasymototicproperties)andradialbasisfunctions(e.g.,Schaback,2000).3.2AGreedyStagewiseAlgorithmandFiniteSampleBoundsBasedonanitesampleSm,wecannothopetominimizeA(f),butratherminimizeitsempiricalcounterpartˆA(f).InsteadofminimizingˆA(f)directly,weconsiderastagewisegreedyalgorithm,whichisdescribedinFigure1.ThealgorithmproposedisrelatedtotheAdaBoostalgorithminincrementallyminimizingagivenconvexlossfunction.InoppositiontoAdaBoost,werestrictthesizeoftheweightsaandb,whichservestoregularizethealgorithm,aprocedurethatwillplayanimportantroleinthesequel.Wealsoobservethatmanyoftheadditivemodelsintroducedinthestatisticalliterature(e.g.,Hastieetal.,2001),operateverysimilarlytoFigure1.Itisclearfromthedescriptionofthealgorithmthatˆftb;m,thehypothesisgeneratedbytheprocedure,belongstobCOt(H)foreveryt.Notealsothat,bythedenitionoff,forxedaandbthefunctionˆA((1a)ˆft1b;m+abh)isconvexinh.Weobservethatmanyrecentapproachestoboosting-typealgorithms(e.g.,Breiman,1998,Hastieetal.,2001,Masonetal.,2000,SchapireandSinger,1999)arebasedonalgorithmssim-ilartotheonepresentedinFigure1.Twopointsareworthnoting.First,ateachstept,thevalueofthepreviouscompositehypothesisˆft1b;mismultipliedby(1a),aprocedurewhichisusuallynotfollowedinotherboosting-typealgorithms;thisensuresthatthecompositefunctionateverystepremainsinbCO(H).Second,theparametersaandbareconstrainedateverystage;thisservesasaregularizationmeasureandpreventsovertting.Inordertoanalyzethebehaviorofthealgorithm,weneedseveraldenitions.Forh2[0;1]andf2RletG(h;f)=hf(f)+(1h)f(f):720 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITYLetRdenotetheextendedrealline(R=R[f¥;+¥g).Weextendaconvexfunctiong:R!Rtoafunctiong:R!Rbydeningg(¥)=limx!¥g(x)andg(¥)=limx!¥g(x).Notethatthisextensionismerelyfornotationalconvenience.Itensuresthat,fG(h),theminimizerofG(h;f),iswell-denedath=0or1forappropriatelossfunctions.Foreveryvalueofh2[0;1]letfG(h)4=argminf2RG(h;f);G(h)4=G(h;fG(h))=inff2RG(h;f):Itcanbeshown(Zhang,2002)thatformanychoicesoff,includingtheexamplesgiveninSection3.3,fG(h)�0whenh�1=2.WebeginwitharesultfromZhang(2002).LetfbminimizeA(f)overbCO(H),anddenotebyfopttheminimizerofA(f)overallBorelmeasurablefunctionsf.Forsimplicityweassumethatfoptexists.InotherwordsA(fopt)A(f)(forallmeasurablef):Ourdenitionimpliesthatfopt(x)=fG(h(x)).Theorem4(Zhang,2002,Theorem2.1)AssumethatfG(h)�0whenh�1=2,andthatthereexistc�0ands1suchthatforallh2[0;1],jh1=2jscs(G(h;0)G(h)):ThenforallBorelmeasurablefunctionsf(x)L(f)L2cA(f)A(fopt)1=s;(4)wheretheBayeserrorisgivenbyL=L(2h()1).TheconditionthatfG(h)�0whenh�1=2inTheorem4ensuresthattheoptimalminimizerfoptachievestheBayeserror.Thisconditioncanbesatisedbyassumingthatf(f)f(f)forallf�0.TheparameterscandsinTheorem4dependonlyonthelossf.Ingeneral,iffissecondorderdifferentiable,thenonecantakes=2.ExamplesofthevaluesofcandsaregiveninSection3.3.Thebound(4)allowsonetoworkdirectlywiththefunctionA()ratherthanwiththelesswieldy01lossL().WeareinterestedinboundingthelossL(f)oftheempiricalestimatorˆftb;mobtainedaftertstepsofthestagewisegreedyalgorithmdescribedinFigure1.Substituteˆftb;min(4),andconsiderboundingther.h.s.asfollows(ignoringthe1=sexponentforthemoment):A(ˆftb;m)A(fopt)=hA(ˆftb;m)ˆA(ˆftb;m)i+hˆA(ˆftb;m)ˆA(fb)i+hˆA(fb)A(fb)i+hA(fb)A(fopt)i:(5)Next,weboundeachofthetermsseparately.ThersttermcanbeboundedusingTheorem2.Inparticular,sinceA(f)=Ef(Yf(X)),wherefisassumedtobeconvex,andsinceˆftb;m2bCO(H)thenf(x)2[b;b]foreveryx.Itfollowsthatonits(bounded)domaintheLipschitzconstantoffisniteandcanbewrittenaskb(seeexplicitexamplesinSection3.3).FromTheorem2wehavethatwithprobabilityatleast1d,A(ˆftb;m)ˆA(ˆftb;m)4bkbRm(H)+fbrlog(1=d)2m;721 MANNOR,MEIRANDZHANGwherefb4=supf2[b;b]f(f).Recallthatˆftb;m2bCO(H),andnotethatwehaveusedthefactthatRm(bCO(H))=bRm(H)(e.g.,BartlettandMendelson,2002).Thethirdtermonther.h.s.of(5)canbeestimateddirectlyfromtheChernoffbound.Wehavewithprobabilityatleast1d:ˆA(fb)A(fb)fbrlog(1=d)2m:Notethatfisxed(independentofthesample),andthereforeasimpleChernoffboundsufceshere.Inordertoboundthesecondtermin(5)weassumethatsupv2[b;b]f00(v)Mb¥;(6)wheref00(u)isthesecondderivativeoff(u).FromTheorem4.2byZhang(2003)weknowthatforaxedsampleˆA(ˆftb;m)ˆA(fb)8b2Mbt:(7)Thisresultholdsforeveryconvexfandxedb.Thefourthtermin(5)isapurelyapproximationtheoreticterm.AnappropriateassumptionwillneedtobemadeconcerningtheBayesboundaryforthistermtovanish.Insummary,foreveryt,withprobabilityatleast12d,A(ˆftb;m)A(fopt)4bkbRm(H)+8b2Mbt+fbr2log(1=d)m+(A(fb)A(fopt)):(8)Thenaltermin(8)canbeboundedusingtheLipschitzpropertyoff.Inparticular,A(fb)A(fopt)=EXnh(X)f(fb(X))+(1h(X))f(fb(X))oEXh(X)f(fopt(X))+(1h(X))f(fopt(X)) =EXnh(X)[f(fb(X))f(fopt(X))]o+EXn(1h(X))[f(fb(X))f(fopt(X))]okbEXnh(X)jfb(X)fopt(X)j+(1h(X))jfb(X)fopt(X)jokbEXjfb(X)fb;opt(X)j+Db;(9)wheretheLipschitzpropertyandthetriangleinequalitywereusedinthenaltwosteps.Herefb;opt(X)=max(b;min(b;fopt(X)))istheprojectionoffoptonto[b;b],andDb4=suph2[1=2;1]fI(fG(h)�b)[G(h;b)G(h;fG(h))]g:NotethatDb!0whenb!¥sinceDbrepresentsthetailbehaviorG(h;b).SeveralexamplesareprovidedinSection3.3.722 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITY3.3ExamplesforfWeconsiderthreecommonlyusedchoicesfortheconvexfunctionf.OtherexamplesarepresentedbyZhang(2002).exp(x)Exponentiallog(1+exp(x))=log(2)Logisticloss(x1)2SquaredlossItiseasytoseethatalllossesarenon-negativeandupperboundthe01lossI(x0),whereI()istheindicatorfunction.TheexponentiallossfunctionwaspreviouslyshowntoleadtotheAdaBoostalgorithm(SchapireandSinger,1999),whiletheotherlosseswereproposedbyFriedmanetal.(2000),andshowntoleadtootherinterestingstagewisealgorithms.Theessentialdifferencesbetweenthelossfunctionsrelatetotheirbehaviorforx!¥.Inthispaper,thenaturallogarithmisusedinthedenitionoflogisticloss.Thedivisionbylog(2)setsthescalesothatthelossfunctionequals1atx=0.ForeachoneofthesecasesweprovideinTable1thevaluesoftheconstantsMb,fb,kb,andDbdenedabove.WealsoincludethevaluesofcandsfromTheorem4,aswellastheoptimalminimizerfG(h).NotethatthevaluesofDbandkblistedinTable1areupperbounds(seeZhang,2002).f(x)exp(x)log(1+exp(x))=log(2)(x1)2Mbexp(b)1=(4log(2))2fbexp(b)log(1+exp(b))=log(2)(b+1)2kbexp(b)1=log(2)2b+2Dbexp(b)exp(b)=log(2)max(0;1b)2fG(h)12log(h1h)log(h1h)2h1c1=p2plog(2)=21=2s222Table1:Parametervaluesforseveralpopularchoicesoff.3.4UniversalConsistencyWeassumethath2Himpliesh2H,whichinturnimpliesthat02CO(H).Thisimpliesthatb1CO(H)b2CO(H)whenb1b2.Therefore,usingalargervalueofbimpliessearchingwithinalargerspace.WedeneSPAN(H)=[b�0bCO(H),whichisthelargestfunctionclassthatcanbereachedinthegreedyalgorithmbyincreasingb.Inordertoestablishuniversalconsistency,wemayassumeinitiallythattheclassoffunctionsSPAN(H)isdenseinC(K)-theclassofcontinuousfunctionsoveradomainKRdundertheuniformnormtopology.FromTheorem4.1byZhang(2002),weknowthatforallfconsideredinthispaper,andallBorelmeasures,inff2SPAN(H)A(f)=A(fopt).SinceSPAN(H)=[b�0bCO(H),weobtainlimb!¥A(fb)A(fopt)=0,leadingtothevanishingofthenaltermin(8)whenb!¥.Usingthisobservationweareabletoestablishsufcientconditionsforconsistency.Theorem5AssumethattheclassoffunctionsSPAN(H)isdenseinC(K)overadomainKRd.AssumefurtherthatfisconvexandLipschitzandthat(6)holds.Chooseb=b(m)suchthatas723 MANNOR,MEIRANDZHANGm!¥,wehaveb!¥,f2logm=m!0,andbkbRm(H)!0.ThenthegreedyalgorithmofFigure1,appliedfortstepswhere(b2Mb)=t!0asm!¥,isstronglyuniversallyconsistent.ProofThebasicideaoftheproofistheselectionofb=b(m)insuchawaythatitbalancestheestimationandapproximationerrorterms.Inparticular,bshouldincreasetoinnitysothattheapproximationerrorvanishes.However,therateatwhichbincreasesshouldbesufcientlyslowtoguaranteeconvergenceoftheestimationerrortozeroasm!¥.Letdm=1m2.Itfollowsfrom(8)thatwithprobabilitysmallerthan2dmA(ˆftmb;m)A(fopt)�4bmkbmRm(H)+8b2Mbmtm+2fbrlogmm+DAb;whereDAb=A(fb)A(fopt)!0asb!¥.UsingtheBorelCantelliLemmathishappensnitelymanytimes,sothereisa(random)numberofsamplesm1afterwhichtheaboveinequalityisalwaysreversed.Sincealltermsin(8)convergeto0,itfollowsthatforeverye�0fromsometimeonA(ˆftb;m)A(fopt)ewithprobability1.Using(4)concludestheproof.Asasimpleexampleforthechoiceofb=b(m),considerthelogisticloss.FromTable1weconcludethatselectingb=opm=logmsufcestoguaranteeconsistency.Unfortunately,noconvergenceratecanbeestablishedinthegeneralsettingofuniversalconsis-tency.ConvergenceratesforparticularfunctionalclassescanbederivedbyapplyingappropriateassumptionsontheclassHandtheposteriorprobabilityh(x).Wenotedelsewhere(Mannoretal.,2002a)weused(8)inordertoestablishconvergenceratesforthethreelossfunctionsdescribedabove,whencertainsmoothnessconditionswereassumedconcerningtheclassconditionaldistri-butionh(x).TheproceduredescribedinMannoretal.(2002a)alsoestablishedappropriate(non-adaptive)choicesforbasafunctionofthesamplesizem.Inthenextsectionweuseadifferentapproachforthesquaredlossinordertoderivefaster,nearlyoptimal,convergencerates.4.RatesofConvergenceandAdaptivity–theCaseofSquaredLossWehaveshownthatunderreasonableconditionsonthefunctionf,universalconsistencycanbeestablishedaslongasthebaseclassHissufcientlyrich.Wenowmoveontodiscussratesofconvergenceandtheissueofadaptivity,asdescribedinSection2.Inthissectionwefocusonthesquaredloss,asparticularlytightboundsareavailableforthiscase,usingtechniquesfromtheem-piricalprocessliterature(e.g.,vandeGeer,2000).Thisallowsustodemonstratenearlyminimaxratesofconvergence.Sinceweareconcernedwithestablishingconvergenceratesinanonparamet-ricsetting,wewillnotbeconcernedwithconstantswhichdonotaffectratesofconvergence.Wewilldenotegenericconstantsbyc;c0;c1;c0,etc.WebeginbyboundingthedifferencebetweenA(f)andA(fopt)inthenonadaptivesetting,whereweconsiderthecaseofaxedvalueofbwhichdenestheclassbCO(H).InSection4.2weusethemultipletestingLemmatoderiveanadaptiveprocedurethatleadstoauniformboundoverSPAN(H).Wenallyapplythoseresultsforattainingboundsontheclassication(01)lossinSection4.3.ObservethatfromtheresultsofSection3,foreachxedvalueofb,wemaytakethenumberofboostingiterationsttoinnity.Weassumethroughoutthissectionthatthisprocedurehasbeenadheredto.724 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITY4.1EmpiricalRatioBoundsfortheSquaredLossInthissectionwerestrictattentiontothesquaredlossfunction,A(f)=E(f(X)Y)2:Sinceinthiscasefopt(x)=E(Yjx);wehavethefollowingidentityforanyfunctionf:EYjx(f(x)Y)2EYjx(fopt(x)Y)2=(f(x)fopt(x))2:ThereforeA(f)A(fopt)=E(f(X)fopt(X))2:(10)WeassumethatfbelongstosomefunctionclassF,butwedonotassumethatfoptbelongstoF.Furthermore,sinceforanyrealnumbersa;b;c,wehavethat(ab)2(cb)2=(ac)2+2(ac)(cb)thefollowingistrue:ˆA(f)ˆA(fopt)=1mmåi=1(f(xi)yi)21mmåi=1(fopt(xi)yi)2=2mmåi=1(fopt(xi)yi)(f(xi)fopt(xi))+1mmåi=1(f(xi)fopt(xi))2:(11)Ourgoalatthispointistoassesstheexpecteddeviationof[A(f)A(fopt)]fromitsempiricalcounterpart,[ˆA(f)ˆA(fopt)].Inparticular,wewanttoshowthatwithprobabilityatleast1d,d2(0;1),forallf2FwehaveA(f)A(fopt)c(ˆA(f)ˆA(fopt))+rm(d);forappropriatelychosencandrm(d).ForanyfitwillbeconvenienttousethenotationˆEf4=1måm=1f(xi).WenowrelatetheexpectedandempiricalvaluesofthedeviationtermsA(f)A(fopt).Thefollowingresultisbasedonthesymmetrizationtechniqueandtheso-calledpeelingmethodinstatis-tics(e.g.,Section5.3invandeGeer,2000).Thepeelingmethodisageneralmethodforboundingsupremaofstochasticprocessesoversomeclassoffunctions.Thebasicideaistotransformthetaskintoasequenceofsimplerbounds,eachdenedoveranelementinanestedsequenceofsubsetsoftheclass(see(5.17)invandeGeer,2000).Sincetheproofisrathertechnical,itispresentedintheappendix.6LetFbeaclassofuniformlyboundedfunctions,andletX=fX1;:::;XmgbeasetofpointsdrawnindependentlyatrandomaccordingtosomelawP.Assumethatforallf2F,supxjf(x)fopt(x)jM.Thenthereexistsapositiveconstantcsuchthatforallqc,withprobabilityatleast1exp(q),forallf2FEf(X)fopt(X)24ˆEf(X)fopt(X)2+100qM2m+D26;725 MANNOR,MEIRANDZHANGwhereDmisanynumbersuchthatmD232M2max(H(Dm;F;m);1):(12)ObservethatDmiswell-denedsincethel.h.s.ismonotonicallyincreasingandunbounded,whilether.h.s.ismonotonicallydecreasing.WeusethefollowingboundfromvandeGeer(2000):Lemma7(vandeGeer,2000,Lemma8.4)LetFbeaclassoffunctionssuchthatforallpositived,H(d;F;m)Kd2x,forsomeconstants0x1andK.LetX;Yberandomvariablesdenedoversomedomain.LetW(x;y)beareal-valuedfunctionsuchthatjW(x;y)jMforallx;y,andEYjxW(x;Y)=0forallx.Thenthereexistsaconstantc,dependingonx;KandMonly,suchthatforallec=pm:P(supg2FjˆEfW(X;Y)g(X)gj(ˆEg(X)2)(1x)=2e)cexp(me2=c2):Inordertoapplythisbound,itisusefultointroducethefollowingassumption.Assumption1Assumethat9M1suchthatsupxjf(x)fopt(x)jMforallf2F.Moreover,forallpositivee,H(e;F;m)K(e=M)2xwhere0x1.WewillnowrewriteLemma7inasomewhatdifferentformusingthenotationofthissection.Lemma8LetAssumption1hold.Thenthereexistpositiveconstantsc0andc1thatdependonxandKonly,suchthat8qc0,withprobabilityatleast1exp(q),forallf2F ˆE(fopt(X)Y)(f(X)fopt(X)) 1x2ˆE(ffopt)2+c1M2qm1=(1+x):ProofLetW(X;Y)=(fopt(X)Y)=M;g(X)=(f(X)fopt(X))=M:UsingLemma7wendthatthereexistconstantscandc0thatdependonxandKonly,suchthat8ec=pmP9g2G:jˆEfW(X;Y)g(X)gj�1x2ˆEg(X)2+1+x2e2=(1+x)(a)Pn9g2G:jˆEfW(X;Y)g(X)gj�(ˆEg(X)2)(1x)=2eo=P(supg2GjˆEfW(X;Y)g(X)gj(ˆEg(X)2)(1x)=2�e)(b)cexp(me2=c2):where(a)usedtheinequalityjabj1x2jaj2=(1x)+1+x2jbj2=(1+x),and(b)followsusingLemma7.Theclaimfollowsbysettinge=pq=mandchoosingc0andc1appropriately.CombiningLemma6andLemma8,weobtainthemainresultofthissection.726 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITYTheorem9SupposeAssumption1holds.Thenthereexistconstantsc0;c1�0thatdependonxandKonly,suchthat8qc0,withprobabilityatleast1exp(q),forallf2FA(f)A(fopt)4xˆA(f)ˆA(fopt)+c1M2xqm1=(1+x):ProofBy(10)itfollowsthatA(f)A(fopt)=E(ffopt)2.Thereexistsaconstantc0dependingonKonlysuchthatinLemma6,wecanletD2=c0M2m1=(1+x)toobtainA(f)A(fopt)=E(ffopt)24ˆE(ffopt)2+M2(100q=m+c0m1=(1+x))(13)withprobabilityatleast1exp(q)whereq1.By(11)wehavethat[ˆA(f)ˆA(fopt)]=2ˆE(fopt(X)Y)(f(X)fopt(X)) +ˆE(f(X)fopt(X))2:UsingLemma8wehavethatthereexistconstantsc01andc02thatdependonKandxonly,suchthatforallqc01,withprobabilityatleast1exp(q): ˆE(fopt(X)Y)(f(X)fopt(X)) 1x2ˆE(ffopt)2+c0M2qm1=(1+x):Combiningtheseresultswehavethatwithprobabilityatleast1eq:[ˆA(f)ˆA(fopt)]=2ˆE(fopt(X)Y)(f(X)fopt(X)) +ˆE(f(X)fopt(X))2xˆE(ffopt)22c0M2qm1=(1+x):(14)From(13)and(14)weobtainwithprobabilityatleast12exp(q):A(f)A(fopt)4x[ˆA(f)ˆA(fopt)]+8xc0M2qm1=(1+x)+M2(100q=m+c0m1=(1+x)):NotethatAssumption1wasusedwheninvokingLemma6.Thetheoremfollowsfromthisinequal-itywithappropriatelychosenc0andc1.4.2AdaptivityInthissectionweletfbechosenfrombCO(H)bF,wherebwillbedeterminedadaptivelybasedonthedatainordertoachieveanoptimalbalancebetweenapproximationandestimationerrors.Inthiscase,supxjf(x)jbMwhereh2Hareassumedtoobeysupxjh(x)jM.Werstneedtodeterminethepreciseb-dependenceoftheboundofTheorem9.WebeginwithadenitionfollowedbyasimpleLemma,theso-calledmultipletestingLemma(e.g.,Lemma4.14inHerbrich,2002).10AtestGisamappingfromthesampleSandacondenceleveldtothelogicalvaluesfTrue;Falseg.WedenotethelogicalvalueofactivatingatestGonasampleSwithcondencedbyG(S;d).727 MANNOR,MEIRANDZHANGLemma11SupposewearegivenasetoftestsG=fG1;:::;Grg.AssumefurtherthatadiscreteprobabilitymeasureP=fpigr=1overGisgiven.Ifforeveryi2f1;2;:::;rgandd2(0;1),PfGi(S;d)g1d,thenPfG1(S;dp1)^^Gr(S;dpr)g1d:WeuseLemma11inordertoextendTheorem9sothatitholdsforallb.Theproofagainreliesonthepeelingtechnique.Theorem12LetAssumption1hold.Thenthereexistconstantsc0;c1�0thatdependonxandKonly,suchthat8qc0,withprobabilityatleast1exp(q),forallb1andforallf2bFwehaveA(f)A(fopt)4xˆA(f)ˆA(fopt)+c1b2M2q+loglog(3b)m1=(1+x):ProofForalls=1;2;3;:::,letFs=2sF.LetusdenethetestGs(S;d)tobeTRUEifA(f)A(fopt)4xˆA(f)ˆA(fopt)+cx22sM2log(1=d)m11+xforallf2FsandFALSEotherwise.UsingTheorem9wehavethatP(Gs(S;d))1d.Letps=1s(s+1),notingthatå¥=1ps=1andbyLemma11wehavethatPfGs(S;dps)forallsg1d:Considerf2bFforsomeb1.Lets=blog2bc+1,wehavethatPn8s:Gs(S;ds(s+1))o1dsothatwithprobabilityatleast1dwehavethat:A(f)A(fopt)4xˆA(f)ˆA(fopt)+c0x22sM2 log(s2+sd)m!1=(1+x)4xˆA(f)ˆA(fopt)+c1xb2M2loglog(3b)+qm1=(1+x);wherewesetq=log(1=d)andusedthefactthat2s1b2s.Theorem12boundsA(f)A(fopt)intermsofˆA(f)ˆA(fopt).However,inordertodetermineoverallconvergenceratesofA(f)toA(fopt)weneedtoeliminatetheempiricaltermˆA(f)ˆA(fopt).Todoso,werstrecallasimpleversionoftheBernsteininequality(e.g.,Devroyeetal.,1996)togetherwithastraightforwardconsequence.Lemma13LetfX1;X2;:::;Xmgbereal-valuedi.i.d.randomvariablessuchthatjXijbwithprob-abilityone.Lets2=Var[X1].Then,foranye�0P(1mmåi=1XiE[X1]�e)expme22s2+2be=3:728 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITYMoreover,ifs2c0bE[X1],thenforallpositiveq,thereexistsaconstantcthatdependsonlyonc0suchthatwithprobabilityatleast1exp(q)1mmåi=1XicE[X1]+bqm;wherecisindependentofb.ProofTherstpartoftheLemmaisjusttheBernsteininequality(e.g.,Devroyeetal.,1996).Toshowthesecondpartweneedtoboundfromabovetheprobabilitythat(1=m)åm=1Xi�cE[X1]+bq=m.Sete=(c1)E[X1]+bq=m.UsingBernstein'sinequalitywehavethatP(1mmåi=1XiE[X1]�(c1)E[X1]+bqm)expme22s2+2be=3expme22c0bE[X1]+2be=3(a)expmebexp(q);where(a)followsbychoosingclargeenoughsothat2c013(c1),implyingthat2c0bE[X1]be=3.Next,weuseBernstein'sinequalityinordertoboundˆA(f)ˆA(fopt).Lemma14LetAssumption1hold.Givenanyb1andf2bF,thereexistsaconstantc0�0(independentofb)suchthat8q,withprobabilityatleast1exp(q):ˆA(f)ˆA(fopt)c0(A(f)A(fopt))+(bM)2qm:ProofFixf2bF.WewilluseLemma13toboundtheprobabilityofalargedifferencebe-tweenˆA(f)andˆA(fopt).InsteadofworkingwithˆA(f)wewilluseZ4=2[(fopt(X)Y)(f(X)fopt(X))]+(f(X)fopt(X))2.Accordingto(11),ˆA(f)ˆA(fopt)=ˆE[Z].TheexpectationofZsatisesthatE[Z]=E[ˆA(f)ˆA(fopt)],sousing(10)wehavethatE[Z]=A(f)A(fopt)=E(f(X)fopt(X))2.BoundingthevarianceweobtainthatVar[Z]E[Z2]E4(fopt(X)Y)2(f(X)fopt(X))2+4(fopt(X)Y)(f(X)fopt(X))3+(f(X)fopt(X))4supx;y 4(fopt(x)y)2+4(fopt(x)y)(f(x)fopt(x))+(f(x)fopt(x))2 E[Z]:(15)ByAssumption1foreveryf2Fwehavethatsupxjf(x)fopt(x)jM,whichimpliesthatforf2bFwehavethatsupxjf(x)fopt(x)j=supxjf(x)bfopt(x)+(b1)fopt(x)jbM+(b1):729 MANNOR,MEIRANDZHANGWeconcludethatsupxjf(x)fopt(x)j2bM.Recallthatfopt(x)=E(Yjx),Y2f1;1g,sowecanboundj(fopt(X)Y)j2.Sinceb1andbytheassumptiononMwehavethatjfopt(X)Yj2bM.Pluggingtheseupperboundsinto(15)weobtainVar[Z]c0b2M2E[Z],withc0=36.AsimilarargumentshowsthatjZjisnotlargerthanc00bM(withprobability1,andc00=12).TheclaimthenfollowsfromadirectapplicationofLemma13.Wenowconsideraprocedurefordeterminingbadaptivelyfromthedata.Deneapenaltytermgq(b)=b2M2loglog(3b)+qm1=(1+x);whichpenalizeslargevaluesofb,correspondingtolargeclasseswithgoodapproximationproper-ties.Theprocedurethenistondˆbqandˆfq2ˆbqFsuchthatˆA(ˆfq)+gq(ˆbq)infb1inff2bFˆA(f)+2gq(b):(16)Thisprocedureissimilartotheso-calledstructuralriskminimizationmethod(Vapnik,1982),exceptthattheminimizationisperformedoverthecontinuousparameterbratherthanadiscretehypothesisclasscounter.Observethatˆbqandˆfqarenonunique,butthisposesnoproblem.Wecannowestablishaboundonthelossincurredbythisprocedure.Theorem15LetAssumption1hold.Chooseq0�0andassumewecomputeˆfq0using(16).Thenthereexistconstantsc0;q0�0thatdependonxandKonly,suchthat8mqmax(q0;c0),withprobabilityatleast1exp(q),A(ˆfq0)A(fopt)+c1qq01=(1+x)infb1inff2bFA(f)A(fopt)+gq(b):Notethatsinceforanyq,gq(b)=O((1=m)1=(1+x)),Theorem15providesratesofconvergenceintermsofthesamplesizem.ObservealsothatthemaindistinctionbetweenTheorem15andTheorem12isthatthelatterprovidesadata-dependentbound,whiletheformerestablishesaso-calledoracleinequality,whichcomparestheperformanceoftheempiricalestimatorˆfq0tothatoftheoptimalestimatorwithina(continuouslyparameterized)hierarchyofclasses.Thisoptimalestimatorcannotbecomputedsincetheunderlyingprobabilitydistributionisunknown,butservesasaperformanceyard-stick.Proof(ofTheorem15)Considerbq1andfq2bqFsuchthatA(fq)A(fopt)+2gq(bq)infb1inff2bFA(f)A(fopt)+4gq(b):(17)Notethatbqandfqdeterminedby(17),asopposedtoˆbqandˆfqin(16),areindependentofthedata.UsingLemma14,weknowthatthereexistsaconstantc0suchthatwithprobabilityatleast1exp(q):ˆA(fq)ˆA(fopt)+2gq(bq)c0infb1inff2bFA(f)A(fopt)+gq(b):(18)730 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITYFrom(16)wehaveˆA(ˆfq0)ˆA(fopt)+gq0(ˆbq0)(a)ˆA(fq)ˆA(fopt)+2gq0(bq)(b)ˆA(fq)ˆA(fopt)+2gq(bq)(c)c0infb1inff2bFA(f)A(fopt)+gq(b):(19)Here(a)resultsfromthedenitionofˆfq0,(b)usesqq0,and(c)isbasedon(18).Wethenconcludethatthereexistconstantsc0;c01�0thatdependonxandKonly,suchthat8qc00,withprobabilityatleast1exp(q):A(ˆfq0)A(fopt)(a)c0[ˆA(ˆfq0)ˆA(fopt)+gq(ˆbq0)](b)c01qq01=(1+x)[ˆA(ˆfq0)ˆA(fopt)+gq0(ˆbq0)](c)c0c02qq01=(1+x)infb1[inff2bFA(f)A(fopt)+gq(b)]:Here(a)isbasedonTheorem12,(b)followsfromthedenitionofgq(b)and(c)followsfrom(19).4.3ClassicationErrorBoundsTheorem15establishedratesofconvergenceofA(ˆf)toA(fopt).However,forbinaryclassicationproblems,themainfocusofthiswork,wewishtodeterminetherateatwhichL(ˆf)convergestotheBayeserrorL.However,fromtheworkofZhang(2002),reproducedasTheorem4above,weimmediatelyobtainaboundontheclassicationerror.Corollary16LetAssumption1holds.Thenthereexistconstantsc0;c1�0thatdependonxandKonly,suchthat8mqmax(q0;c0),withprobabilityatleast1exp(q),L(ˆfq0)L+c0qq01=2(1+x)infb1inff2bFA(f)A(fopt)+gq(b)1=2:(20)Moreover,iftheconditionalprobabilityh(x)isuniformlyboundedawayfrom0.5,namelyjh(x)1=2jd�0forallx,thenwithprobabilityatleast1exp(q),L(ˆfq0)L+c1qq01=(1+x)infb1inff2bF(A(f)A(fopt))+gq(b):ProofTherstinequalityfollowsdirectlyfromTheorems4and15,noticingthes=2fortheleastsquaresloss.ThesecondinequalityfollowsfromCorollary2.1ofZhang(2002).AccordingtothiscorollaryL(ˆfq0)L+2cinfd�0Ejh(x)12jd(ˆfq0fopt)21=2+c01d(A(ˆfq0)A(fopt)):731 MANNOR,MEIRANDZHANGTheclaimfollowssincebyassumptiontherstterminsidetheinmumonther.h.s.vanishes.Inordertoproceedtothederivationofcompleteconvergenceratesweneedtoassesstheparameterxandtheapproximationtheoreticterminff2bFA(f)A(fopt),whereweassumethatF=CO(H).Inordertodosowemakethefollowingassumption.Assumption2Forallh2H,supxjh(x)jM.Moreover,N2(e;H;m)C(M=e)V,forsomecon-stantsCandV.NotethatAssumption2holdsforVCclasses(e.g.,vanderVaartandWellner,1996).TheentropyoftheclassbCO(H)canbeestimatedusingthefollowingresult.Lemma17(vanderVaartandWellner,1996,Theorem2.6.9)LetAssumption2holdforH.ThenthereexistsaconstantKthatdependsonCandVonlysuchthatlogN2(e;bCO(H);m)KbMe2VV+2:(21)WeuseLemma17toestablishpreciseconvergenceratesfortheclassicationerror.Inparticu-lar,Lemma17impliesthatxinAssumption1isequaltoV=(V+2),andindeedobeystherequiredconditions.Weconsidertwosituations,namelythenon-adaptiveandtheadaptivesettings.First,assumethatfopt2bF=bCO(H)whereb¥isknown.Inthiscase,inff2bFA(f)A(fopt)=0,sothatfrom(20)wendthatforsufcientlylargem,withhighprobabilityL(ˆfq0)LOm(V+2)=(4V+4):whereweselectedˆfq0basedon(16)withq=q0.Ingeneral,weassumethatfopt2BCO(H)forsomeunknownbutniteB.Inviewofthedis-cussioninSection2,thisisarathergenericsituationforsufcientlyrichbaseclassesH(e.g.,non-polynomialridgefunctions).Considertheadaptiveprocedure(16).Inthiscasewemaysimplyreplacetheinmumoverbin(20)bythechoiceb=B.Theapproximationerrorterminff2bFA(f)A(fopt)vanishes,andweareleftwiththetermg(B),whichyieldstherateL(ˆfq0)LOm(V+2)=(4V+4):(22)Wethusconcludethattheadaptiveproceduredescribedaboveyieldsthesameratesofconvergenceasthenon-adaptivecase,whichusespriorknowledgeaboutthevalueofb.Inordertoassessthequalityoftheratesobtained,weneedtoconsiderspecicclassesoffunctionsH.Foranyfunctionf(x),denoteby˜f(w)itsFouriertransform.ConsidertheclassoffunctionsintroducedbyBarron(1993)anddenedas,N(B)=f:ZRdkwk1j˜f(w)jdwB;consistingofallfunctionswithaFouriertransformwhichdecayssufcientlyrapidly.Denetheapproximatingclasscomposedofneuralnetworkswithasinglehiddenlayer,Hn=(f:f(x)=c0+nåi=1cif(v�x+bi);jc0j+nåi=1jcijB);732 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITYwherefisa(non-polynomial)sigmoidalLipschitzfunction.Barron(1993)showedthattheclassHnisdenseinN(B).FortheclassN(B)wehavethefollowingworstcaselowerboundfromYang(1999)infˆfmsuph2N(B)EL(ˆf)LW(m(d+2)=(4d+4));(23)whereˆfmisanyestimatorbasedonasampleofsizem,andbywritingh(m)=W(g(m))wemeanthatthereexistm0andCsuchthath(m)Cg(m)formm0.AsaspecicexampleforaclassH,assumeHiscomposedofmonotonicallyincreasingsigmoidalridgefunctions.Inthiscaseonecanshow(e.g.,AnthonyandBartlett,1999)thatV=2(d+1).Substitutingin(22)wendarateoftheorderO(m(d+2)=(4d+6)),whichisslightlyworsethantheminimaxlowerbound(23).PreviouslyMannoretal.(2002a),wealsoestablishedconvergenceratesfortheclassicationerror.FortheparticularcaseofthesquaredlossandtheclassN(B)weobtainedthe(non-adaptive)rateofconvergenceO(m1=4),whichdoesnotdependonthedimensiond,asisrequiredintheminimaxbound.Thenecessarydependenceonthedimensionthatcomesoutoftheanalysisinthepresentsection,hingesontheutilizationofthemorerenedboundingtechniquesusedhere.5.NumericalExperimentsThealgorithmpresentedinFigure1wasimplementedandtestedforanarticialdataset.Thealgorithmandthescriptsthatwereusedtogeneratethegraphsthatappearinthispaperareprovidedintheonlineappendix(Mannoretal.,2002b)forcompleteness.5.1AlgorithmicDetailsTheoptimizationstepinthealgorithmofFigure1iscomputationallyexpensive.Unfortunately,whilethecostfunctionA((1a)f+ah)isconvexinaforaxedh,itisnotnecessarilyconvexintheparametersthatdeneh.TheweaklearnersweusedweresigmoidalH=fh(x)=s(q�x+q0)g.GivenachoiceofhitshouldbenotedthatˆA((1a)ˆft1b;m+ab0h)isconvexina.Wethereforeusedacoordinatesearchapproachwherewesearchonaandhalternately.Thesearchoverawasperformedusingahighlyefcientlinesearchalgorithmbasedontheconvexity.ThesearchovertheparametersofhwasperformedusingtheMatlaboptimizationtoolboxfunctionfminsearch,whichimplementstheNedler-Meadalgorithm(NedlerandMead,1965).Duetotheoccurrenceoflocalminima(ThenumberofminimamaybeexponentiallylargeasshowninAueretal.,1996),weranseveralinstancesuntilconvergence,startingeachrunwithdifferentinitialconditions.Thebestsolutionwasthenselected.5.2ExperimentalResultsThetwo-dimensionaldatasetthatwasusedfortheexperimentswasgeneratedrandomlyinthefollowingway.Pointswithpositivelabelsweregeneratedatrandominatheunitcircle(theradiusandanglewerechosenuniformlyfrom[0;1]and[0;2p],respectively.)Pointswithnegativelabelsweregeneratedatrandominthering(insphericalcoordinates)f(r;q):2r3;0q2pg(theradiusandanglewerechosenuniformlyfrom[2;3]and[0;2p],respectively.)Thesignofeachpoint733 MANNOR,MEIRANDZHANGab-3-2-10123-3-2-10123data set-1.5-1-0.500.511.52-2-1.5-1-0.50log10 Perr train as a function of logblog10 blog10 Perr-1.5-1-0.500.511.52-1-0.8-0.6-0.4-0.20log10 blog10 square losslog of square loss as a function of logbFigure2:(a)Anarticialdataset,(b)Squarelossanderrorprobabilityforthearticialdataset.wasippedwithprobability0:05.AsampledatasetisplottedFigure2a.TheBayeserrorofthisdatasetis0.05(log10(0:05)1:3).Inordertoinvestigateoverttingasafunctionoftheregularizationparameterb,weranthefollowingexperiment.Wexedthenumberofsamplesm=400andvariedboverawiderange.Weranthegreedyalgorithmwiththesquaredloss.Asexpected,thesquaredlosspersampledecreasesasbincreases.Itcanalsobeseenthattheempiricalclassicationtrainingerrordecreaseswhenbincreases,ascanbeseeninFigure2b.Everyexperimentwasrepeatedfteentimesandtheerrorbarsrepresentonestandarddeviationofthesample.ThegeneralizationerrorisplottedinFigure3a.Itseemsthatforbthatistoosmalltheapproxi-mationpowerdoesnotsufce,whileforlargevaluesofboverttingoccurs.Wenotethatforlargevaluesofbtheoptimizationprocessmayfailwithnonnegligibleprobability.InspiteoftheoverttingphenomenonobservedinFigure3awenotethatforagivenvalueofbtheperformanceimproveswithincreasingsamplesize.Foraxedvalueofb=1wevariedmfrom10to1000andranthealgorithm.ThegeneralizationerrorisplottedinFigure3b(resultsareaveragedover15runsandtheerrorbarsrepresentonestandarddeviationofthesample).Wenotethatcomparablebehaviorwasobservedforotherdatasets.Specically,similarresultswereobtainedforpointsinanoisyXORcongurationintwodimensions.WealsoranexperimentsontheIndianPimadataset.Theresultswerecomparabletostate-of-the-artalgorithms(theerrorforthePimadatasetusing15foldcrossvalidationwas29%3%).Theresultsareprovidedindetail,alongwithimplementationsources,inanonlineappendix(Mannoretal.,2002b).Thephenomenonthatwewouldliketoemphasizeisthefactthatforaxedvalueofb,largervaluesofmleadtobetterprediction,asexpected.However,theeffectofregularizationisrevealedwhenmisxedandbvaries.Inthiscasechoosingavalueofbwhichistoosmallleadstoinsufcientapproximationpower,whilechoosingbtoolargeleadstoovertting.734 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITYab-1.5-1-0.500.511.52-1.6-1.4-1.2-1-0.8-0.6-0.4-0.2logPerr test as a function of logblog10 blog10 Perr0.511.522.533.500.10.20.30.40.50.60.70.8log10 mGeneralization probabilityError probability as a function of m, b=1Figure3:(a)Generalizationerror:asafunctionofb,sampledusing400points;(b)Generalizationerror:plottedasafunctionofmforaxedb=1,sampledusing10-1000points.6.DiscussionInthispaperwehavestudiedaclassofgreedyalgorithmsforbinaryclassication,basedonmin-imizinganupperboundonthe01loss.Theapproachfollowedbearsstrongafnitiestoboost-ingalgorithmsintroducedintheeldofmachinelearning,andadditivemodelsstudieswithinthestatisticscommunity.Whileboostingalgorithmswereoriginallyincorrectlybelievedtoeludetheproblemofovertting,itisonlyrecentlythatcarefulstatisticalstudieshavebeenperformedinanat-tempttounderstandtheirstatisticalproperties.TheworkofJiang(2000b,a),motivatedbyBreiman(2000),wasthersttoaddressthestatisticalconsistencyofboosting,focusingmainlyontheques-tionofwhetherboostingshouldbeiteratedindenitely,ashadbeensuggestedinearlierstudies,orwhethersomeearlystoppingcriterionshouldbeintroduced.LugosiandVayatis(2001)andZhang(2002)thendevelopedaframeworkfortheanalysisofalgorithmsbasedonminimizingacontinu-ousconvexupperboundonthe01loss,andestablisheduniversalconsistencyunderappropriateconditions.Theearlierversionofthiswork(Mannoretal.,2002a)consideredastagewisegreedyalgorithm,thusextendingtheproofofuniversalconsistencytothisclassofalgorithms,andshowingthatconsistencycanbeachievedbyboostingforever,aslongassomeregularizationisperformedbylimitingthesizeofacertainparameter.InMannoretal.(2002a)werequiredpriorknowledgeofasmoothnessparameterinordertoderiveconvergencerates.Moreover,theconvergencerateswereworsethantheminimaxrates.Inthecurrentversion,wehavefocusedontheestablishmentofratesofconvergenceandthedevelopmentofadaptiveprocedures,whichassumenothingaboutthedata,andyetconvergetotheoptimalsolutionatnearlytheminimaxrate,whichassumesknowledgeofsomesmoothnessproperties.735 MANNOR,MEIRANDZHANGWhilewehaveestablishednearlyminimaxratesofconvergenceandadaptivityforacertainclassofbaselearners(namelyridgefunctions)andtargetdistributions,theseresultshavebeenre-strictedtothecaseofthesquaredlosswhereparticularlytightratesofconvergenceareavailable.Inmanypracticalapplicationsotherlossfunctionsareused,suchasthelogisticloss,whichseemtoleadtoexcellentpracticalperformance.Itwouldbeinterestingtoseewhethertheratesofcon-vergenceestablishedforthesquaredlossapplytoabroaderclassoflossfunctions.Moreover,wehaveestablishedminimaxityandadaptivityforarathersimpleclassoftargetfunctions.Infu-tureworkitshouldbepossibletoextendtheseresultstomorestandardsmoothnessclasses(e.g.,SobolevandBesovspaces).Someinitialresultsalongtheselineswereprovidedinapreviouspaper(Mannoretal.,2002a),althoughtheratesestablishedinthatworkarenotminimax.Anotherissuewhichwarrantsfurtherinvestigationistheextensionoftheseresultstomulti-categoryclassicationproblems.Finally,wecommentontheoptimalityoftheproceduresdiscussedinthispaper.AspointedoutinSection4,nearoptimalityfortheadaptiveschemeintroducedinthatsectionwasestablished.Ontheotherhand,itiswellknownthatunderveryreasonableconditionsBayesianprocedures(e.g.,Robert,2001)areoptimalfromaminimaxpointofview.Infact,itcanbeshownthatBayesestimatorsareessentiallytheonlyestimatorswhichcanachieveoptimalityintheminimaxsense(Robert,2001).ThisoptimalityfeatureprovidesstrongmotivationforthestudyofBayesiantypeapproachesinafrequentistsetting(MeirandZhang,2003).InmanycasesBayesianprocedurescanbeexpressedasamixtureofestimators,wherethemixtureisweightedbyanappropriatepriordistribution.Theproceduredescribedinthispaper,asmanyothersintheboostingliterature,alsogeneratesanestimatorwhichisformedasamixtureofbaseestimators.Aninterestingopenques-tionistorelatethesetypesofalgorithmstoformalBayesprocedures,withtheirknownoptimalityproperties.AcknowledgmentsWethankthethreeanonymousreviewersfortheirveryhelpfulsuggestions.TheworkofR.M.waspartiallysupportedbytheTechnionV.P.R.fundforthepromotionofspon-soredresearch.SupportfromtheOllendorffcenterofthedepartmentofElectricalEngineeringattheTechnionisalsoacknowledged.TheworkofS.M.waspartiallysupportedbytheFulbrightpostdoctoralgrantandbytheAROundergrantDAAD10-00-1-0466.AppendixAProofofLemma6Inthefollowing,weusethenotationg(x)=(f(x)fopt(x))2,andletG=fg:g(x)=(f(x)fopt(x))2;f2Fg.Consideranyg2G.Supposeweindependentlysamplempointstwice.WedenotetheempiricalexpectationwithrespecttotherstmpointsbyˆEandtheempiricalexpectationwithrespecttothesecondmpointsbyˆE0.Wenotethatthetwosetsofrandomvariablesareindependent.WehavefromChebyshevinequality(8g2(0;1)):P ˆE0g(X)Eg(X) gEg(X)+M2gmVarg(X)m1(gEg(X)+M2gm)2:Rearrangingandtakingthecomplementonegetsthat:PˆE0g(X)(1g)Eg(X)M2gm1Varg(X)m(gEg(X)+M2gm)2:736 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITYSince0g(X)M2itfollowsthatVarg(X)Eg(X)2M2Eg(X)sothat:PˆE0g(X)(1g)Eg(X)M2gm1Eg(X)M2m(gEg(X)+M2gm)2:Observethatforallpositivenumbersa;b;m;gonehasthatabm(ga+bgm)2=12+mg2ab+bg2ma14;wheretheinequalityfollowssincea+1a2foreverypositivenumbera.WethushavePˆE0g(X)(1g)Eg(X)M2gm34:Itfollows(bysettingg=1=4)that8e�8D2:34P9g2G:Eg(X)�4ˆEg(X)+e+16M23m(a)P9g2G:Eg(X)�4ˆEg(X)+e+16M23m&ˆE0g(X)34Eg(X)4M2mP9g2G:ˆE0g(X)�3ˆEg(X)+3e4P9g2G:2jˆE0g(X)ˆEg(X)j�ˆEg(X)+ˆE0g(X)+3e4;where(a)followsbytheindependenceofˆEandˆE0(NotethatˆEandˆE0arerandomvariablesratherthanexpectations).Letfsigm=1denoteasetofindependentidenticallydistributed1-valuedrandomvariablesuchPfsi=1g=1=2foralli.WeabusenotationsomewhatandletˆEsg(X)=(1=m)åm=1sig(Xi),andsimilarlyforˆE0.Itfollowsthat34P9g2G:Eg(X)�4ˆEg(X)+e+16M23mP9g2G:2jˆE0sg(X)ˆEsg(X)j�ˆEg(X)+ˆE0g(X)+3e4P9g2G:2(jˆE0sg(X)j+jˆEsg(X)j)�ˆEg(X)+ˆE0g(X)+3e4(a)2P9g2G:2jˆEsg(X)j�ˆEg(X)+3e8;where(a)usestheunionboundandtheobservationthatˆEandˆE0satisfythesameprobabilitylaw.ForaxedsampleXletˆGs4=g2G:2s1D2ˆEg(X)2sD2m :737 MANNOR,MEIRANDZHANGWedenetheclasssˆGs=ff:f(Xi)=sig(Xi);g2ˆGs;i=1;2;:::;mg.LetˆGs;e=2beane=2-coverofˆGs,withrespecttothe`mnorm,suchthatˆEg2sD2forallg2ˆGs;e=2.ItistheneasytoseethatsˆGs;e=2isalsoane=2-coveroftheclasssˆGs.ForeachswehavePX;s9g2ˆGs:jˆEsg(X)j�e =EXPs9g2ˆGs:jˆEsg(X)j�eEXPs9g2ˆGs;e=2:jˆEsg(X)j�e=2(a)2EX ˆGs;e=2 expme22ˆEg22EN1(e=2;ˆGs;`m)expme22ˆEg22EN1(e=2;ˆGs;`m)expme2M22s+1D2;(24)whereinstep(a)weusedtheunionboundandChernoff'sinequalityP(jˆE(sg)je)2exp(2me2=ˆEg2).Usingtheunionboundandnotingthate�8D2,wehavethat:34P9g2G:Eg(x)�4ˆEg(x)+e+16M23m2¥ås=1P9g2ˆGs:2jˆEsg(X)j�2s1D2+3e8(a)4¥ås=1EN1(e=11+2s3D2;ˆGs;`m1)exp m(2s2D2+3e16)22s+1D2M2!4¥ås=1EN1(e=11+2s3D2;ˆGs;`m1)expm2sD232M2me32M2:Inequality(a)followsfrom(24).Wenowrelatethe`2coveringnumberofFtothe`1coveringnumberofG.SupposethatˆEjf1f2j2e2.Using(11)thisimpliesthatˆEj(f1fopt)2(f2fopt)2je2+2ˆEj(f2fopt)(f1f2)j(a)e2+2qˆE(f2fopt)2qˆE(f1f2)2(b)e2+8e2+18ˆE(f2fopt)2;where(a)followsfromtheCauchy-Schwartzinequality,and(b)followsfromtheinequality4a2+b16apb(whichholdsforeveryaandb).Recallingthatforf22ˆGs,ˆE(f2fopt)22sD2,weconcludethatforallpositivee,N1(9e+2s3D2;ˆGs;`m1)eH(pe;F;m):738 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITYNotethatwecanchoose`2-coversofˆGssothattheirelementsgsatisfyˆEg2sD2.Combiningtheabovewehavethat8e�D2:34P9g2G:Eg(x)�4ˆEg(x)+100e+16M23m4¥ås=1eH(Dm;F;m)expm2sD232M2100me32M2(a)4¥ås=1expmD232M2expm2sD232M23meM2=4¥ås=1expmD232M2(12s)exp3meM24¥ås=1expmD22s132M2exp3meM2(b)4¥ås=1exp2s1exp3meM24e11e1e3me=M23e3me=M2:Hereweused(12)insteps(a)and(b).Setq=2+3me=M2,itfollowsthatwithprobabilityatleast1exp(q)forallg2GEg(x)4ˆEg(x)+100e+16M23m:By(12),D2616M23m.Byourdenitionofqitfollowsthatifq3thenq+23qsothateqM2=m.Weconcludethatwithprobabilityatleast1exp(q)forallg2GEg(x)4ˆEg(x)+100qM2m+D26:ReferencesG.G.AgarwalandW.J.Studden.Asymptoticintegratedmeansquareerrorusingleastsquaresandbiasminimizingsplines.TheAnnalsofStatistics,8:1307–1325,1980.M.AnthonyandP.L.Bartlett.NeuralNetworkLearning;TheoreticalFoundations.CambridgeUniversityPress,1999.A.Antos,B.K´egl,T.Lindet,andG.Lugosi.Date-dependentmargin-basedboundsforclassica-tion.JournalofMachineLearningResearch,3:73–98,2002.P.Auer,M.Herbster,andM.Warmuth.Exponentiallymanylocalminimaforsingleneurons.InD.S.Touretzky,M.C.Mozer,andM.E.Hasselmo,editors,AdvancesinNeuralInformationProcessingSystems8,pages316–322.MITPress,1996.739 MANNOR,MEIRANDZHANGA.R.Barron.Neuralnetapproximation.InProceedingsoftheSeventhYaleWorkshoponAdaptiveandLearningSystems,1992.A.R.Barron.Universalapproximationboundforsuperpositionsofasigmoidalfunction.IEEETrans.Inf.Th.,39:930–945,1993.A.R.Barron,L.Birg´e,andP.Massart.RiskBoundsforModelSelectionviaPenalization.Proba-bilityTheoryandRelatedFields,113(3):301–413,1999.P.L.Bartlett,S.Boucheron,andG.Lugosi.Modelselectionanderrorestimation.MachineLearn-ing,48:85–113,2002.P.L.BartlettandS.Mendelson.RademacherandGaussiancomplexities:riskboundsandstructuralresults.JournalofMachineLearningResearch,3:463–482,2002.O.BousquetandA.Chapelle.Stabilityandgeneralization.JournalofMachineLearningResearch,2:499–526,2002.L.Breiman.Arcingclassiers.TheAnnalsofStatistics,26(3):801–824,1998.L.Breiman.Someinnitytheoryforpredictorensembles.TechnicalReport577,Berkeley,August2000.P.B¨uhlmannandB.Yu.BoostingwiththeL2loss:regressionandclassication.J.Amer.Statist.Assoc.,98:324–339,2003.L.Devroye,L.Gy¨or,andG.Lugosi.AProbabilisticTheoryofPatternRecognition.SpringerVerlag,NewYork,1996.T.G.Dietterich.Anexperimentalcomparisonofthreemethodsforconstructingensemblesofdeci-siontrees:Bagging,boosting,andrandomization.MachineLearning,40(2):139–157,1999.Y.FreundandR.E.Schapire.Adecisiontheoreticgeneralizationofon-linelearningandapplicationtoboosting.Comput.Syst.Sci.,55(1):119–139,1997.J.Friedman,T.Hastie,andR.Tibshirani.Additivelogisticregression:astatisticalviewofboosting.TheAnnalsofStatistics,38(2):337–374,2000.T.HastieandR.Tibshirani.GeneralizedAdditiveModels,volume43ofMonographsonStatisticsandAppliedProbability.Chapman&Hall,London,1990.T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning.SpringerVerlag,Berlin,2001.R.Herbrich.LearningKernelClassiers:TheoryandAlgorithms.MITPress,Boston,2002.W.Jiang.Doesboostingovert:Viewsfromanexactsolution.TechnicalReport00-03,DepartmentofStatistics,NorthwesternUniversity,2000a.W.Jiang.Processconsistencyforadaboost.TechnicalReport00-05,DepartmentofStatistics,NorthwesternUniversity,2000b.740 GREEDYALGORITHMSFORCLASSIFICATION–CONSISTENCY,RATES,ANDADAPTIVITYM.J.KearnsandU.V.Vazirani.AnIntroductiontoComputationalLearningTheory.MITPress,1994.V.KoltchinksiiandD.Panchenko.Empiricalmargindistributionsandboundingthegeneralizationerrorofcombinedclassiers.Ann.Statis.,30(1),2002.M.Leshno,V.Lin,A.Pinkus,andS.Schocken.MultilayerFeedforwardNetworkswithaNon-polynomialActivationFunctionCanApproximateanyFunction.NeuralNetworks,6:861–867,1993.G.LugosiandN.Vayatis.OntheBayes-riskconsistencyofboostingmethods.Technicalreport,PompeuFabraUniversity,2001.G.LugosiandN.Vayatis.Aconsistentstrategyforboostingalgorithms.InProceedingsoftheFifteenthAnnualConferenceonComputationalLearningTheory,volume2375ofLNAI,pages303–318.Springer,2002.S.MallatandZ.Zhang.Matchingpursuitwithtime-frequencydictionaries.IEEETrans.SignalProcessing,41(12):3397–3415,December1993.S.MannorandR.Meir.Geometricboundsforgenerlizationinboosting.InProceedingsoftheFourteenthAnnualConferenceonComputationalLearningTheory,pages461–472,2001.S.MannorandR.Meir.Ontheexistenceofweaklearnersandapplicationstoboosting.MachineLearning,48:219–251,2002.S.Mannor,R.Meir,andT.Zhang.Theconsistencyofgreedyalgorithmsforclassication.InProceedingsofthefteenthAnnualconferenceonComputationallearningtheory,volume2375ofLNAI,pages319–333,Sydney,2002a.Springer.S.Mannor,R.Meir,andT.Zhang.On-lineappendix,2002b.Availablefromhttp://www-ee.technion.ac.il/˜rmeir/adaptivityonlineappendix.zip.L.Mason,P.L.Bartlett,J.Baxter,andM.Frean.Functionalgradienttechniquesforcombininghypotheses.InB.Sch¨olkopfA.Smola,P.L.BartlettandD.Schuurmans,editors,AdvancesinLargeMarginClassiers.MITPress,2000.R.MeirandG.R¨atsch.Anintroductiontoboostingandleveraging.InS.MendelsonandA.Smola,editors,AdvancedLecturesonMachineLearning,LNCS,pages119–184.Springer,2003.R.MeirandT.Zhang.Data-dependentboundsforBayesianmixturemethods.InS.ThrunS.BeckerandK.Obermayer,editors,AdvancesinNeuralInformationProcessingSystems15,pages319–326.MITPress,Cambridge,MA,2003.J.A.NedlerandR.Mead.Asimplexmethodforfunctionminimization.ComputerJournal,7:308–313,1965.D.Pollard.ConvergenceofEmpiricalProcesses.SpringerVerlag,NewYork,1984.C.P.Robert.TheBayesianChoice:ADecisionTheoreticMotivation.SpringerVerlag,NewYork,secondedition,2001.741 MANNOR,MEIRANDZHANGR.Schaback.Auniedtheoryofradialbasisfunctions.J.ofComputationalandAppliedMathe-matics,121:165–177,2000.R.E.Schapire.Thestrengthofweaklearnability.MachineLearning,5(2):197–227,1990.R.E.Schapire,Y.Freund,P.L.Bartlett,andW.S.Lee.Boostingthemargin:anewexplanationfortheeffectivenessofvotingmethods.TheAnnalsofStatistics,26(5):1651–1686,1998.R.E.SchapireandY.Singer.Improvedboostingalgorithmsusingcondence-ratedpredictions.MachineLearning,37(3):297–336,1999.S.vandeGeer.EmpiricalProcessesinM-Estimation.CambridgeUniversityPress,Cambridge,U.K.,2000.A.W.vanderVaartandJ.A.Wellner.WeakConvergenceandEmpiricalProcesses.SpringerVerlag,NewYork,1996.V.N.Vapnik.EstimationofDependencesBasedonEmpiricalData.SpringerVerlag,NewYork,1982.V.N.Vapnik.StatisticalLearningTheory.WileyInterscience,NewYork,1998.Y.Yang.Minimaxnonparametricclassication-partI:ratesofconvergence.IEEETrans.Inf.Theory,45(7):2271–2284,1999.T.Zhang.Statisticalbehaviorandconsistencyofclassicationmethodsbasedonconvexriskmini-mization.Ann.Statis.,2002.Acceptedforpublication.T.Zhang.Sequentialgreedyapproximationforcertainconvexoptimizationproblems.IEEETran.Inf.Theory,49(3):682–691,2003.742