/
MachineLearning:ProceedingsoftheThirteenthInternationalConference,1996 MachineLearning:ProceedingsoftheThirteenthInternationalConference,1996

MachineLearning:ProceedingsoftheThirteenthInternationalConference,1996 - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
386 views
Uploaded On 2016-12-10

MachineLearning:ProceedingsoftheThirteenthInternationalConference,1996 - PPT Presentation

nearestneighborclassi ID: 499624

nearestneighborclassi

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "MachineLearning:ProceedingsoftheThirteen..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

MachineLearning:ProceedingsoftheThirteenthInternationalConference,1996.ExperimentswithaNewBoostingAlgorithmYoavFreundRobertE.SchapireAT&TLaboratories600MountainAvenueMurrayHill,NJ07974-0636yoav,schapire@research.att.comAbstract.Inanearlierpaper,weintroducedanewªboostingºalgorithmcalledAdaBoostwhich,theoretically,canbeusedtosigni®cantlyreducetheerrorofanylearningalgorithmthatcon-sistentlygeneratesclassi®erswhoseperformanceisalittlebetterthanrandomguessing.Wealsointroducedtherelatednotionofaªpseudo-lossºwhichisamethodforforcingalearningalgorithmofmulti-labelconceptstoconcentrateonthelabelsthatarehardesttodiscriminate.Inthispaper,wedescribeexperimentswecarriedouttoassesshowwellAdaBoostwithandwithoutpseudo-loss,performsonreallearningproblems.Weperformedtwosetsofexperiments.The®rstsetcomparedboostingtoBreiman'sªbaggingºmethodwhenusedtoaggregatevariousclassi®ers(includingdecisiontreesandsingleattribute-valuetests).Wecomparedtheperformanceofthetwomethodsonacollectionofmachine-learningbenchmarks.Inthesecondsetofexperiments,westudiedinmoredetailtheperformanceofboostingusinganearest-neighborclassi®eronanOCRproblem.1INTRODUCTIONªBoostingºisageneralmethodforimprovingtheperfor-manceofanylearningalgorithm.Intheory,boostingcanbeusedtosigni®cantlyreducetheerrorofanyªweakºlearningalgorithmthatconsistentlygeneratesclassi®erswhichneedonlybealittlebitbetterthanrandomguessing.Despitethepotentialbene®tsofboostingpromisedbythetheoret-icalresults,thetruepracticalvalueofboostingcanonlybeassessedbytestingthemethodonrealmachinelearningproblems.Inthispaper,wepresentsuchanexperimentalassessmentofanewboostingalgorithmcalledAdaBoost.Boostingworksbyrepeatedlyrunningagivenweak1learningalgorithmonvariousdistributionsoverthetrain-ingdata,andthencombiningtheclassi®ersproducedbytheweaklearnerintoasinglecompositeclassi®er.The®rstprovablyeffectiveboostingalgorithmswerepresentedbySchapire[20]andFreund[9].Morerecently,wede-scribedandanalyzedAdaBoost,andwearguedthatthisnewboostingalgorithmhascertainpropertieswhichmakeitmorepracticalandeasiertoimplementthanitsprede-cessors[10].Thisalgorithm,whichweusedinallourexperiments,isdescribedindetailinSection2.Homepage:ªhttp://www.research.att.com/orgs/ssr/people/uidº.Expectedtochangetoªhttp://www.research.att.com/Äuidºsome-timeinthenearfuture(foruidyoav,schapire).1Weusethetermªweakºlearningalgorithm,eventhough,inpractice,boostingmightbecombinedwithaquitestronglearningalgorithmsuchasC4.5.Thispaperdescribestwodistinctsetsofexperiments.Inthe®rstsetofexperiments,describedinSection3,wecomparedboostingtoªbagging,ºamethoddescribedbyBreiman[1]whichworksinthesamegeneralfashion(i.e.,byrepeatedlyrerunningagivenweaklearningalgorithm,andcombiningthecomputedclassi®ers),butwhichcon-structseachdistributioninasimplermanner.(Detailsgivenbelow.)Wecomparedboostingwithbaggingbecausebothmethodsworkbycombiningmanyclassi®ers.Thiscom-parisonallowsustoseparateouttheeffectofmodifyingthedistributiononeachround(whichisdonedifferentlybyeachalgorithm)fromtheeffectofvotingmultipleclassi®ers(whichisdonethesamebyeach).Inourexperiments,wecomparedboostingtobaggingusinganumberofdifferentweaklearningalgorithmsofvaryinglevelsofsophistication.Theseinclude:(1)analgorithmthatsearchesforverysimplepredictionruleswhichtestonasingleattribute(similartoHolte'sverysim-pleclassi®cationrules[14]);(2)analgorithmthatsearchesforasinglegooddecisionrulethattestsonaconjunctionofattributetests(similarin¯avortotherule-formationpartofCohen'sRIPPERalgorithm[3]andFÈurnkranzandWidmer'sIREPalgorithm[11]);and(3)Quinlan'sC4.5decision-treealgorithm[18].Wetestedthesealgorithmsonacollectionof27benchmarklearningproblemstakenfromtheUCIrepository.Themainconclusionofourexperimentsisthatboost-ingperformssigni®cantlyanduniformlybetterthanbag-gingwhentheweaklearningalgorithmgeneratesfairlysimpleclassi®ers(algorithms(1)and(2)above).WhencombinedwithC4.5,boostingstillseemstooutperformbaggingslightly,buttheresultsarelesscompelling.Wealsofoundthatboostingcanbeusedwithverysim-plerules(algorithm(1))toconstructclassi®ersthatarequitegoodrelative,say,toC4.5.KearnsandMansour[16]arguethatC4.5canitselfbeviewedasakindofboostingalgo-rithm,soacomparisonofAdaBoostandC4.5canbeseenasacomparisonoftwocompetingboostingalgorithms.SeeDietterich,KearnsandMansour'spaper[4]formoredetailonthispoint.Inthesecondsetofexperiments,wetesttheperfor-manceofboostingonanearestneighborclassi®erforhand-writtendigitrecognition.Inthiscasetheweaklearningalgorithmisverysimple,andthisletsusgainsomeinsightintotheinteractionbetweentheboostingalgorithmandthe nearestneighborclassi®er.Weshowthattheboostingal-gorithmisaneffectivewayfor®ndingasmallsubsetofprototypesthatperformsalmostaswellasthecompleteset.WealsoshowthatitcomparesfavorablytothestandardmethodofCondensedNearestNeighbor[13]intermsofitstesterror.Thereseemtobetwoseparatereasonsfortheimprove-mentinperformancethatisachievedbyboosting.The®rstandbetterunderstoodeffectofboostingisthatitgeneratesahypothesiswhoseerroronthetrainingsetissmallbycom-biningmanyhypotheseswhoseerrormaybelarge(butstillbetterthanrandomguessing).Itseemsthatboostingmaybehelpfulonlearningproblemshavingeitherofthefollowingtwoproperties.The®rstproperty,whichholdsformanyreal-worldproblems,isthattheobservedexamplestendtohavevaryingdegreesofhardness.Forsuchproblems,theboostingalgorithmtendstogeneratedistributionsthatcon-centrateontheharderexamples,thuschallengingtheweaklearningalgorithmtoperformwellontheseharderpartsofthesamplespace.Thesecondpropertyisthatthelearningalgorithmbesensitivetochangesinthetrainingexamplessothatsigni®cantlydifferenthypothesesaregeneratedfordifferenttrainingsets.Inthissense,boostingissimilartoBreiman'sbagging[1]whichperformsbestwhentheweaklearnerexhibitssuchªunstableºbehavior.However,unlikebagging,boostingtriesactivelytoforcetheweaklearningalgorithmtochangeitshypothesesbychangingthedistri-butionoverthetrainingexamplesasafunctionoftheerrorsmadebypreviouslygeneratedhypotheses.Thesecondeffectofboostinghastodowithvariancere-duction.Intuitively,takingaweightedmajorityovermanyhypotheses,allofwhichweretrainedondifferentsamplestakenoutofthesametrainingset,hastheeffectofre-ducingtherandomvariabilityofthecombinedhypothesis.Thus,likebagging,boostingmayhavetheeffectofproduc-ingacombinedhypothesiswhosevarianceissigni®cantlylowerthanthoseproducedbytheweaklearner.However,unlikebagging,boostingmayalsoreducethebiasofthelearningalgorithm,asdiscussedabove.(SeeKongandDi-etterich[17]forfurtherdiscussionofthebiasandvariancereducingeffectsofvotingmultiplehypotheses,aswellasBreiman's[2]veryrecentworkcomparingboostingandbaggingintermsoftheireffectsonbiasandvariance.)Inour®rstsetofexperiments,wecompareboostingandbag-ging,andtrytousethatcomparisontoseparatebetweenthebiasandvariancereducingeffectsofboosting.Previouswork.Drucker,SchapireandSimard[8,7]performedthe®rstexperimentsusingaboostingalgorithm.TheyusedSchapire's[20]originalboostingalgorithmcom-binedwithaneuralnetforanOCRproblem.Follow-upcomparisonstootherensemblemethodsweredonebyDruckeretal.[6].Morerecently,DruckerandCortes[5]usedAdaBoostwithadecision-treealgorithmforanOCRtask.JacksonandCraven[15]usedAdaBoosttolearnclassi®ersrepresentedbysparseperceptrons,andtestedthealgorithmonasetofbenchmarks.Finally,Quinlan[19]recentlyconductedanindependentcomparisonofboostingandbaggingcombinedwithC4.5onacollectionofUCIbenchmarks.AlgorithmAdaBoost.M1Input:sequenceofexamples11 \n \r \r \r withlabels1\r \r \r \rweaklearningalgorithmWeakLearnintegerspecifyingnumberofiterationsInitialize1 1forall.Dofor12\r \r \r:1.CallWeakLearn,providingitwiththedistribution .2.Getbackahypothesis!":#%$&.3.Calculatetheerrorof!":'():*,+.-0/132457618 .If':912,thenset;(1andabortloop.4.Set&#x=00;7?'.1=@'\n .5.UpdatedistributionA:80B1  CEDF&#x=00;if! 1otherwisewhereCisanormalizationconstant(chosensothatG0B1willbeadistribution).Outputthe®nalhypothesis:!H(I argmax6KJML):*+-3/K2576log1&#x=00;7 Figure1:ThealgorithmAdaBoost.M1.2THEBOOSTINGALGORITHMInthissection,wedescribeourboostingalgorithm,calledAdaBoost.Seeourearlierpaper[10]formoredetailsaboutthealgorithmanditstheoreticalproperties.WedescribetwoversionsofthealgorithmwhichwedenoteAdaBoost.M1andAdaBoost.M2.Thetwover-sionsareequivalentforbinaryclassi®cationproblemsanddifferonlyintheirhandlingofproblemswithmorethantwoclasses.2.1ADABOOST.M1Webeginwiththesimplerversion,AdaBoost.M1.TheboostingalgorithmtakesasinputatrainingsetofNexam-plesOQPSRTVU1W\rX1Y,W[ZZ[Z\rWTVU]\W\rX\Y\n^whereU]_isaninstancedrawnfromsomespace`andrepresentedinsomeman-ner(typically,avectorofattributevalues),andX_baQcistheclasslabelassociatedwithU_.Inthispaper,weal-waysassumethatthesetofpossiblelabelscisof®nitecardinalityd.Inaddition,theboostingalgorithmhasaccesstoanotherunspeci®edlearningalgorithm,calledtheweaklearningalgorithm,whichisdenotedgenericallyasWeakLearn.TheboostingalgorithmcallsWeakLearnrepeatedlyinaseriesofrounds.Onrounde,theboosterprovidesWeakLearnwithadistributionfhgoverthetrainingsetO.Inresponse,WeakLearncomputesaclassi®erorhy-pothesisig:`kjcwhichshouldcorrectlyclassifyafractionofthetrainingsetthathaslargeprobabilitywithrespecttof g.Thatis,theweaklearner'sgoalisto®ndahypothesisigwhichminimizesthe(training)errorlgPPr_nm:o+MpigTVU_Y8qPX_r.Notethatthiserrorismeasuredwithrespecttothedistributionfhgthatwasprovidedtotheweaklearner.Thisprocesscontinuesforsrounds,and,atlast,theboostercombinestheweakhypothesesi1WZ[ZZWi"tintoasingle®nalhypothesisiH(I.2 AlgorithmAdaBoost.M2Input:sequenceofexamples11 \n \r \r \r withlabels1\r \r \r \rweaklearningalgorithmWeakLearnintegerspecifyingnumberofiterationsLet  :1\r \r \r [Initialize1 1for .Dofor12\r \r \r1.CallWeakLearn,providingitwithmislabeldistribution .2.Getbackahypothesis!":#D$01 .3.Calculatethepseudo-lossof!7:'\n?12)- \n62J\r 8 1=![. !V M 4.Set�7?'.1=@'\n .5.Update8:80B1  8  C�-122-1B*K+-0/1\n61327*K+.-0/1\n6202whereCisanormalizationconstant(chosensothatG0B1willbeadistribution).Outputthe®nalhypothesis:!®n argmax6,JML)51log1�7! \n Figure2:ThealgorithmAdaBoost.M2.Stillunspeci®edare:(1)themannerinwhichfgiscomputedoneachround,and(2)howiH(Iiscomputed.Differentboostingschemesanswerthesetwoquestionsindifferentways.AdaBoost.M1usesthesimpleruleshowninFigure1.Theinitialdistributionf1isuniformoverOsof1TYP1MNforall.Tocomputedistributionfg 1fromfgandthelastweakhypothesisig,wemultiplytheweightofexamplebysomenumber!ga#"0W1Yifigclassi®esU_correctly,andotherwisetheweightisleftunchanged.Theweightsarethenrenormalizedbydividingbythenormal-izationconstant$g.Effectively,ªeasyºexamplesthatarecorrectlyclassi®edbymanyofthepreviousweakhypothe-sesgetlowerweight,andªhardºexampleswhichtendoftentobemisclassi®edgethigherweight.Thus,AdaBoostfo-cusesthemostweightontheexampleswhichseemtobehardestforWeakLearn.Thenumber!]giscomputedasshowninthe®gureasafunctionoflg.The®nalhypothesisiH(Iisaweightedvote(i.e.,aweightedlinearthreshold)oftheweakhypotheses.Thatis,foragiveninstanceU,iH(IoutputsthelabelXthatmaximizesthesumoftheweightsoftheweakhypothesespredictingthatlabel.Theweightofhypothesisigisde®nedtobelogT1%!gYsothatgreaterweightisgiventohypotheseswithlowererror.TheimportanttheoreticalpropertyaboutAdaBoost.M1isstatedinthefollowingtheorem.Thistheoremshowsthatiftheweakhypothesesconsistentlyhaveerroronlyslightlybetterthan12,thenthetrainingerrorofthe®nalhypothesisiH(Idropstozeroexponentiallyfast.Forbinaryclassi®-cationproblems,thismeansthattheweakhypothesesneedbeonlyslightlybetterthanrandom.Theorem1([10])SupposetheweaklearningalgorithmWeakLearn,whencalledbyAdaBoost.M1,generateshy-potheseswitherrorsl1WZ[ZZ\rWlt,wherelgisasde®nedinFigure1.Assumeeachlg'&12,andlet(gP12)lg.Thenthefollowingupperboundholdsontheerrorofthe®nalhypothesisi®n:*:i®nTU7_Y8qPX_*N&t+g ,1-1)4(2g&exp./)2t)g ,1(2g0ZTheorem1impliesthatthetrainingerrorofthe®nalhy-pothesisgeneratedbyAdaBoost.M1issmall.Thisdoesnotnecessarilyimplythatthetesterrorissmall.However,iftheweakhypothesesareªsimpleºandsªnottoolarge,ºthenthedifferencebetweenthetrainingandtesterrorscanalsobetheoreticallybounded(seeourearlierpaper[10]formoreonthissubject).Theexperimentsinthispaperindicatethatthetheoreti-calboundonthetrainingerrorisoftenweak,butgenerallycorrectqualitatively.However,thetesterrortendstobemuchbetterthanthetheorywouldsuggest,indicatingacleardefectinourtheoreticalunderstanding.ThemaindisadvantageofAdaBoost.M1isthatitisunabletohandleweakhypotheseswitherrorgreaterthan12.Theexpectederrorofahypothesiswhichrandomlyguessesthelabelis1)1Md,wheredisthenumberofpossiblelabels.Thus,fordhP2,theweakhypothesesneedtobejustslightlybetterthanrandomguessing,butwhend212,therequirementthattheerrorbelessthan12isquitestrongandmayoftenbehardtomeet.2.2ADABOOST.M2ThesecondversionofAdaBoostattemptstoovercomethisdif®cultybyextendingthecommunicationbetweentheboostingalgorithmandtheweaklearner.First,weallowtheweaklearnertogeneratemoreexpressivehypotheses,which,ratherthanidentifyingasinglelabelinc,insteadchooseasetofªplausibleºlabels.Thismayoftenbeeasierthanchoosingjustonelabel.Forinstance,inanOCRsetting,itmaybehardtotellifaparticularimageisª7ºoraª9º,buteasytoeliminatealloftheotherpossibilities.Inthiscase,ratherthanchoosingbetween7and9,thehypothesismayoutputtheset7W9indicatingthatbothlabelsareplausible.Wealsoallowtheweaklearnertoindicateaªdegreeofplausibility.ºThus,eachweakhypothesisoutputsavector"0W13 4,wherethecomponentswithvaluescloseto1or0correspondtothoselabelsconsideredtobeplausibleorimplausible,respectively.Notethatthisvectorofvaluesisnotaprobabilityvector,i.e.,thecomponentsneednotsumtoone.2Whilewegivetheweaklearningalgorithmmoreex-pressivepower,wealsoplaceamorecomplexrequirementontheperformanceoftheweakhypotheses.Ratherthanusingtheusualpredictionerror,weaskthattheweakhy-pothesesdowellwithrespecttoamoresophisticatederrormeasurethatwecallthepseudo-loss.Unlikeordinaryerrorwhichiscomputedwithrespecttoadistributionoverexam-ples,pseudo-lossiscomputedwithrespecttoadistribution2Wedeliberatelyusethetermªplausibleºratherthanªprob-ableºtoemphasizethefactthatthesenumbersshouldnotbeinterpretedastheprobabilityofagivenlabel.3 overthesetofallpairsofexamplesandincorrectlabels.Bymanipulatingthisdistribution,theboostingalgorithmcanfocustheweaklearnernotonlyonhard-to-classifyex-amples,butmorespeci®cally,ontheincorrectlabelsthatarehardesttodiscriminate.WewillseethattheboostingalgorithmAdaBoost.M2,whichisbasedontheseideas,achievesboostingifeachweakhypothesishaspseudo-lossslightlybetterthanrandomguessing.Moreformally,amislabelisapairTW\rXYwhereistheindexofatrainingexampleandXisanincorrectlabelassociatedwithexample.Letbethesetofallmisla-bels:PTW\rXY:a1WZZ[Z\nWNW\rXPX_ZAmislabeldistributionisadistributionde®nedoverthesetofallmislabels.Oneachroundeofboosting,AdaBoost.M2(Figure2)suppliestheweaklearnerwithamislabeldistributionfg.Inresponse,theweaklearnercomputesahypothesisigoftheformi]g:`cj"0W13.Thereisrestrictiononi]g\nTVUW XMY.Inparticular,thepredictionvectordoesnothavetode®neaprobabilitydistribution.Intuitively,weinterpreteachmislabelTW\rXYasrepre-sentingabinaryquestionoftheform:ªDoyoupredictthatthelabelassociatedwithexampleU(_isX_(thecorrectlabel)orX(oneoftheincorrectlabels)?ºWiththisin-terpretation,theweightfgTW XMYassignedtothismislabelrepresentstheimportanceofdistinguishingincorrectlabelXonexampleU?_.Aweakhypothesisigistheninterpretedinthefollowingmanner.IfigTVU_W X_YP1andigTU_W XMYP0,thenighas(correctly)predictedthatU_'slabelisX_,notX(sinceigdeemsX_tobeªplausibleºandXªimplausibleº).Similarly,ifigTVU_W X_YP0andigTVU_W\rXYP1,thenighas(incorrectly)madetheoppositeprediction.IfigTVU_W\rX_YPigTVU_W\rXY,theni]g'spredictionistakentobearandomguess.(ValuesforiginT0W1Yareinterpretedprobabilistically.)Thisinterpretationleadsustode®nethepseudo-lossofhypothesisigwithrespecttomislabeldistributionfgbytheformulalgP12) _ \n\r fgTW\rXY1)QigTU_W X_YigTVU_W XMYZSpacelimitationspreventusfromgivingacompletederiva-tionofthisformulawhichisexplainedindetailinourearlierpaper[10].Itcanbeveri®edthoughthatthepseudo-lossisminimizedwhencorrectlabelsX_areassignedthevalue1andincorrectlabelsXqPX_assignedthevalue0.Fur-ther,notethatpseudo-loss12istriviallyachievedbyanyconstant-valuedhypothesisi(g.Theweaklearner'sgoalisto®ndaweakhypothesisigwithsmallpseudo-loss.Thus,standardªoff-the-shelfºlearningalgorithmsmayneedsomemodi®cationtobeusedinthismanner,althoughthismodi®cationisoftenstraight-forward.Afterreceivingi(g,themislabeldistributionisup-datedusingarulesimilartotheoneusedinAdaBoost.M1.The®nalhypothesisiH(Ioutputs,foragiveninstanceU,thelabelXthatmaximizesaweightedaverageoftheweakhypothesisvaluesigTVUW XMY.Thefollowingtheoremgivesaboundonthetraininger-rorofthe®nalhypothesis.Notethatthistheoremrequiresonlythattheweakhypotheseshavepseudo-losslessthan12,i.e.,onlyslightlybetterthanatrivial(constant-valued)hypothesis,regardlessofthenumberofclasses.Also,al-thoughtheweakhypothesesi(gareevaluatedwithrespecttothepseudo-loss,weofcourseevaluatethe®nalhypothesisiH(Iusingtheordinaryerrormeasure.Theorem2([10])SupposetheweaklearningalgorithmWeakLearn,whencalledbyAdaBoost.M2generateshy-potheseswithpseudo-lossesl1W[ZZZWlt,wherelgisasde-®nedinFigure2.Let(gP12)lg.Thenthefollowingupperboundholdsontheerrorofthe®nalhypothesisi®n:*:i®nTU_Y8qPX_*N&Td)1Yt+g ,1-1)4(2g&Td)1Yexp.)2t)g ,1(2g0wheredisthenumberofclasses.3BOOSTINGANDBAGGINGInthissection,wedescribeourexperimentscomparingboostingandbaggingontheUCIbenchmarks.We®rstmentionbrie¯yasmallimplementationissue:Manylearningalgorithmscanbemodi®edtohandleex-amplesthatareweightedbyadistributionsuchastheonecreatedbytheboostingalgorithm.Whenthisispossi-ble,thebooster'sdistributionfgissupplieddirectlytotheweaklearningalgorithm,amethodwecallboostingbyreweighting.However,somelearningalgorithmsrequireanunweightedsetofexamples.Forsuchaweaklearn-ingalgorithm,weinsteadchooseasetofexamplesfromOindependentlyatrandomaccordingtothedistributionfgwithreplacement.Thenumberofexamplestobechosenoneachroundisamatterofdiscretion;inourexperiments,wechoseNexamplesoneachround,whereNisthesizeoftheoriginaltrainingsetO.Werefertothismethodasboostingbyresampling.Boostingbyresamplingisalsopossiblewhenusingthepseudo-loss.Inthiscase,asetofmislabelsarechosenfromthesetofallmislabelswithreplacementaccordingtothegivendistributionfg.SuchaprocedureisconsistentwiththeinterpretationofmislabelsdiscussedinSection2.2.Inourexperiments,wechoseasampleofsize**PNTVd)1Yoneachroundwhenusingtheresamplingmethod.3.1THEWEAKLEARNINGALGORITHMSAsmentionedintheintroduction,weusedthreeweaklearn-ingalgorithmsintheseexperiments.Inallcases,theexam-plesaredescribedbyavectorofvalueswhichcorrespondstoa®xedsetoffeaturesorattributes.Thesevaluesmaybediscreteorcontinuous.Someoftheexamplesmayhavemissingvalues.Allthreeoftheweaklearnersbuildhy-potheseswhichclassifyexamplesbyrepeatedlytestingthevaluesofchosenattributes.The®rstandsimplestweaklearner,whichwecallFindAttrTest,searchesforthesingleattribute-valuetest4 #examples##attributesmissingnametraintestclassesdisc.cont.valuessoybean-small47-435--labor57-288promoters106-257--iris150-3-4-hepatitis155-2136sonar208-2-60-glass214-7-9-audiology.stand226-2469-cleve303-276soybean-large3073761935-ionosphere351-2-34-house-votes-84435-216-votes1435-215-crx690-296breast-cancer-w699-2-9pima-indians-di768-2-8-vehicle846-4-18-vowel52846211-10-german1000-2137-segmentation2310-7-19-hypothyroid3163-2187sick-euthyroid3163-2187splice3190-360--kr-vs-kp3196-236--satimage443520006-36-agaricus-lepiot8124-222--letter-recognit16000400026-16-Table1:Thebenchmarkmachinelearningproblemsusedintheexperiments.withminimumerror(orpseudo-loss)onthetrainingset.Moreprecisely,FindAttrTestcomputesaclassi®erwhichisde®nedbyanattribute,avalueandthreepredictions0,1and?.Thisclassi®erclassi®esanewexampleUasfollows:ifthevalueofattributeismissingonU,thenpredict?;ifattributeisdiscreteanditsvalueonexampleUisequalto,orifattributeiscontinuousanditsvalueonUisatmost,thenpredict0;otherwisepredict1.Ifusingordinaryerror(AdaBoost.M1),theseªpredictionsº0,1,?wouldbesimpleclassi®cations;forpseudo-loss,theªpredictionsºwouldbevectorsin"0W134(wheredisthenumberofclasses).ThealgorithmFindAttrTestsearchesexhaustivelyfortheclassi®eroftheformgivenabovewithminimumerrororpseudo-losswithrespecttothedistributionprovidedbythebooster.Inotherwords,allpossiblevaluesof,,0,1and?areconsidered.Withsomepreprocessing,thissearchcanbecarriedoutfortheerror-basedimplementationinhTNYtime,whereisthenumberofattributesandNthenumberofexamples.Asistypical,thepseudo-lossimplementationaddsafactorofhTVdYwheredisthenumberofclasslabels.Forthisalgorithm,weusedboostingwithreweighting.Thesecondweaklearnerdoesasomewhatmoresophis-ticatedsearchforadecisionrulethattestsonaconjunctionofattribute-valuetests.Wesketchthemainideasofthisalgorithm,whichwecallFindDecRule,butomitsomeofthe®nerdetailsforlackofspace.Thesedetailswillbeprovidedinthefullpaper.First,thealgorithmrequiresanunweightedtrainingset,soweusetheresamplingversionofboosting.Thegiventrainingsetisrandomlydividedintoagrowingsetusing70%ofthedata,andapruningsetwiththeremaining30%.020406080pseudo-loss020406080errorboosting FindAttrTest020406080020406080errorboosting FindDecRule020406080020406080errorbagging FindAttrTest020406080020406080errorbagging FindDecRuleFigure3:Comparisonofusingpseudo-lossversusordinaryerroronmulti-classproblemsforboostingandbagging.Inthe®rstphase,thegrowingsetisusedtogrowalistofattribute-valuetests.Eachtestcomparesanattributetoavalue,similartothetestsusedbyFindAttrTest.Weuseanentropy-basedpotentialfunctiontoguidethegrowthofthelistoftests.Thelistisinitiallyempty,andonetestisaddedatatime,eachtimechoosingthetestthatwillcausethegreatestdropinpotential.Afterthetestischosen,onlyonebranchisexpanded,namely,thebranchwiththehighestremainingpotential.Thelistcontinuestobegrowninthisfashionuntilnotestremainswhichwillfurtherreducethepotential.Inthesecondphase,thelistisprunedbyselectingthepre®xofthelistwithminimumerror(orpseudo-loss)onthepruningset.ThethirdweaklearnerisQuinlan'sC4.5decision-treealgorithm[18].Weusedallthedefaultoptionswithpruningturnedon.SinceC4.5expectsanunweightedtrainingsam-ple,weusedresampling.Also,wedidnotattempttouseAdaBoost.M2sinceC4.5isdesignedtominimizeerror,notpseudo-loss.Furthermore,wedidnotexpectpseudo-losstobehelpfulwhenusingaweaklearningalgorithmasstrongasC4.5,sincesuchanalgorithmwillusuallybeableto®ndahypothesiswitherrorlessthan12.3.2BAGGINGWecomparedboostingtoBreiman's[1]ªbootstrapaggre-gatingºorªbaggingºmethodfortrainingandcombiningmultiplecopiesofalearningalgorithm.Brie¯y,themethodworksbytrainingeachcopyofthealgorithmonabootstrapsample,i.e.,asampleofsizeNchosenuniformlyatrandomwithreplacementfromtheoriginaltrainingsetO(ofsizeN).Themultiplehypothesesthatarecomputedarethencombinedusingsimplevoting;thatis,the®nalcompositehypothesisclassi®esanexampleUtotheclassmostoftenassignedbytheunderlyingªweakºhypotheses.Seehispaperformoredetails.Themethodcanbequiteeffective,especially,accordingtoBreiman,forªunstableºlearningalgorithmsforwhichasmallchangeinthedataeffectsalargechangeinthecomputedhypothesis.InordertocompareAdaBoost.M2,whichusespseudo-loss,tobagging,wealsoextendedbagginginanaturalwayforusewithaweaklearningalgorithmthatminimizespseudo-lossratherthanordinaryerror.AsdescribedinSection2.2,suchaweaklearningalgorithmexpectstobeprovidedwithadistributionoverthesetofallmislabels.Oneachroundofbagging,weconstructthisdistributionusingthebootstrapmethod;thatis,weselect**mislabelsfrom(chosenuniformlyatrandomwithreplacement),5 020406080boosting020406080baggingFindAttrTest020406080020406080baggingFindDecRule051015202530051015202530baggingC4.5Figure4:Comparisonofboostingandbaggingforeachoftheweaklearners.andassigneachmislabelweight1**timesthenumberoftimesitwaschosen.Thehypothesesigcomputedinthismannerarethencombinedusingvotinginanaturalmanner;namely,givenU,thecombinedhypothesisoutputsthelabelXwhichmaximizesgigTUW\rXY.Foreithererrororpseudo-loss,thedifferencesbetweenbaggingandboostingcanbesummarizedasfollows:(1)baggingalwaysusesresamplingratherthanreweighting;(2)baggingdoesnotmodifythedistributionoverexamplesormislabels,butinsteadalwaysusestheuniformdistribution;and(3)informingthe®nalhypothesis,bagginggivesequalweighttoeachoftheweakhypotheses.3.3THEEXPERIMENTSWeconductedourexperimentsonacollectionofmachinelearningdatasetsavailablefromtherepositoryatUniversityofCaliforniaatIrvine.3Asummaryofsomeoftheproper-tiesofthesedatasetsisgiveninTable1.Somedatasetsareprovidedwithatestset.Forthese,wereraneachalgorithm20times(sincesomeofthealgorithmsarerandomized),andaveragedtheresults.Fordatasetswithnoprovidedtestset,weused10-foldcrossvalidation,andaveragedthere-sultsover10runs(foratotalof100runsofeachalgorithmoneachdataset).Inallourexperiments,wesetthenumberofroundsofboostingorbaggingtobesP100.3.4RESULTSANDDISCUSSIONTheresultsofourexperimentsareshowninTable2.The®guresindicatetesterrorrateaveragedovermul-tiplerunsofeachalgorithm.Columnsindicatewhichweaklearningalgorithmwasused,andwhetherpseudo-loss(AdaBoost.M2)orerror(AdaBoost.M1)wasused.Notethatpseudo-losswasnotusedonanytwo-classprob-lemssincetheresultingalgorithmwouldbeidenticaltothecorrespondingerror-basedalgorithm.Columnslabeledª±ºindicatethattheweaklearningalgorithmwasusedbyitself(withnoboostingorbagging).Columnsusingboostingorbaggingaremarkedªboostºandªbag,ºrespectively.Oneofourgoalsincarryingouttheseexperimentswastodetermineifboostingusingpseudo-loss(ratherthaner-ror)isworthwhile.Figure3showshowthedifferental-gorithmsperformedoneachofthemany-class(d12)problemsusingpseudo-lossversuserror.Eachpointinthescatterplotrepresentstheerrorachievedbythetwocompet-ingalgorithmsonagivenbenchmark,sothereisonepoint3URLªhttp://www.ics.uci.edu/Ämlearn/MLRepository.htmlº051015202530boosting FindAttrTest051015202530C4.5051015202530boosting FindDecRule051015202530C4.5051015202530boosting C4.5051015202530C4.5051015202530bagging C4.5051015202530C4.5Figure5:ComparisonofC4.5versusvariousotherboostingandbaggingmethods.foreachbenchmark.Theseexperimentsindicatethatboost-ingusingpseudo-lossclearlyoutperformsboostingusingerror.Usingpseudo-lossdiddramaticallybetterthanerroroneverynon-binaryproblem(exceptitdidslightlyworseonªirisºwiththreeclasses).BecauseAdaBoost.M2didsomuchbetterthanAdaBoost.M1,wewillonlydiscussAdaBoost.M2henceforth.Asthe®gureshows,usingpseudo-losswithbagginggavemixedresultsincomparisontoordinaryerror.Over-all,pseudo-lossgavebetterresults,butoccasionally,usingpseudo-losshurtconsiderably.Figure4showssimilarscatterplotscomparingtheper-formanceofboostingandbaggingforallthebenchmarksandallthreeweaklearner.Forboosting,weplottedtheer-rorrateachievedusingpseudo-loss.Topresentbagginginthebestpossiblelight,weusedtheerrorrateachievedusingeithererrororpseudo-loss,whichevergavethebetterresultonthatparticularbenchmark.(Forthebinaryproblems,andexperimentswithC4.5,onlyerrorwasused.)Forthesimplerweaklearningalgorithms(FindAttr-TestandFindDecRule),boostingdidsigni®cantlyanduni-formlybetterthanbagging.Theboostingerrorratewasworsethanthebaggingerrorrate(usingeitherpseudo-lossorerror)onaverysmallnumberofbenchmarkproblems,andonthese,thedifferenceinperformancewasquitesmall.Onaverage,forFindAttrTest,boostingimprovedtheerrorrateoverusingFindAttrTestaloneby55.2%,comparedtobaggingwhichgaveanimprovementofonly11.0%usingpseudo-lossor8.4%usingerror.ForFindDecRule,boost-ingimprovedtheerrorrateby53.0%,baggingbyonly18.8%usingpseudo-loss,13.1%usingerror.WhenusingC4.5astheweaklearningalgorithm,boost-ingandbaggingseemmoreevenlymatched,althoughboostingstillseemstohaveaslightadvantage.Onav-erage,boostingimprovedtheerrorrateby24.8%,baggingby20.0%.Boostingbeatbaggingbymorethan2%on6ofthebenchmarks,whilebaggingdidnotbeatboostingbythisamountonanybenchmark.Fortheremaining20bench-marks,thedifferenceinperformancewaslessthan2%.Figure5showsinasimilarmannerhowC4.5performedcomparedtobaggingwithC4.5,andcomparedtoboostingwitheachoftheweaklearners(usingpseudo-lossforthenon-binaryproblems).Asthe®gureshows,usingboostingwithFindAttrTestdoesquitewellasalearningalgorithminitsownright,incomparisontoC4.5.ThisalgorithmbeatC4.5on10ofthebenchmarks(byatleast2%),tiedon14,andloston3.Asmentionedabove,itsaverageperformancerelativetousingFindAttrTestbyitselfwas55.2%.Incomparison,C4.5'simprovementinperformance6 FindAttrTestFindDecRuleC4.5errorpseudo-losserrorpseudo-losserrorname±boostbagboostbag±boostbagboostbag±boostbagsoybean-small57.656.448.70.220.551.856.045.70.42.92.23.42.2labor25.18.819.124.07.314.615.813.111.3promoters29.78.916.625.98.313.722.05.012.7iris35.24.728.44.87.138.34.318.84.85.55.95.05.0hepatitis19.718.616.821.618.020.121.216.317.5sonar25.916.525.931.416.226.128.919.024.3glass51.551.150.929.454.249.748.547.225.052.031.722.725.7audiology.stand53.553.553.523.665.753.553.553.519.965.723.116.220.1cleve27.818.822.427.419.720.326.621.720.9soybean-large64.864.559.09.874.273.673.673.67.266.013.36.812.2ionosphere17.88.517.310.36.69.38.95.86.2house-votes-844.43.74.45.04.44.43.55.13.6votes112.78.912.713.29.411.210.310.49.2crx14.514.414.514.513.514.515.813.813.6breast-cancer-w8.44.46.78.14.15.35.03.33.2pima-indians-di26.124.426.127.825.326.428.425.724.4vehicle64.364.457.626.156.161.361.261.025.054.329.922.626.1vowel81.881.876.818.274.782.072.771.66.563.22.20.00.0german30.024.930.430.025.429.629.425.024.6segmentation75.875.854.54.272.573.753.354.32.458.03.61.42.7hypothyroid2.21.02.20.81.00.70.81.00.8sick-euthyroid5.63.05.62.42.42.22.22.12.1splice37.09.235.64.433.429.58.029.54.029.55.84.95.2kr-vs-kp32.84.430.724.60.720.80.50.30.6satimage58.358.358.314.941.657.656.556.713.130.014.88.910.6agaricus-lepiot11.30.011.38.20.08.20.00.00.0letter-recognit92.992.991.934.193.792.391.891.830.493.713.83.36.8Table2:Testerrorratesofvariousalgorithmsonbenchmarkproblems.overFindAttrTestwas49.3%.UsingboostingwithFindDecRuledidsomewhatbet-ter.Thewin-tie-losenumbersforthisalgorithm(comparedtoC4.5)were13-12-2,anditsaverageimprovementoverFindAttrTestwas58.1%.4BOOSTINGANEAREST-NEIGHBORCLASSIFIERInthissectionwestudytheperformanceofalearningal-gorithmwhichcombinesAdaBoostandavariantofthenearest-neighborclassi®er.Wetestthecombinedalgorithmontheproblemofrecognizinghandwrittendigits.Ourgoalisnottoimproveontheaccuracyofthenearestneighborclassi®er,butrathertospeeditup.Speed-upisachievedbyreducingthenumberofprototypesinthehypothesis(andthustherequirednumberofdistancecalculations)withoutincreasingtheerrorrate.Itisasimilarapproachtothatofnearest-neighborediting[12,13]inwhichonetriesto®ndtheminimalsetofprototypesthatissuf®cienttolabelallthetrainingsetcorrectly.ThedatasetcomesfromtheUSPostalService(USPS)andconsistsof9709trainingexamplesand2007testexam-ples.Thetrainingandtestexamplesareevidentlydrawnfromratherdifferentdistributionsasthereisaverysigni®-cantimprovementintheperformanceifthepartitionofthedataintotrainingandtestingisdoneatrandom(ratherthanusingthegivenpartition).Wereportresultsbothontheoriginalpartitioningandonatrainingsetandatestsetofthesamesizesthatweregeneratedbyrandomlypartitioningtheunionoftheoriginaltrainingandtestsets.Eachimageisrepresentedbya1616-matrixof8-bitpixels.Themetricthatweuseforidentifyingthenear-estneighbor,andhenceforclassifyinganinstance,isthestandardEuclideandistancebetweentheimages(viewedasvectorsin256).Thisisaverynaivemetric,butitgivesreasonablygoodperformance.Anearest-neighborclassi®erwhichusesallthetrainingexamplesasprototypesachievesatesterrorof5Z7%(2Z3%onrandomlypartitioneddata).Usingthemoresophisticatedtangentdistance[21]isinourfutureplans.Eachweakhypothesisisde®nedbyasubsetofthetrainingexamples,andamapping:j"0W134.GivenanewtestpointU,suchaweakhypothesispredictsthevectorTVU0YwhereU0aistheclosestpointtoU.Oneachroundofboosting,aweakhypothesisisgener-atedbyaddingoneprototypeatatimetothesetuntilthesetreachesaprespeci®edsize.Givenanyset,wealwayschoosethemappingwhichminimizesthepseudo-lossoftheresultingweakhypothesis(withrespecttothegivenmislabeldistribution).Initially,thesetofprototypesisempty.Next,tencandidateprototypesareselectedatrandomaccordingtothecurrent(marginal)distributionoverthetrainingexam-ples.Ofthesecandidates,theonethatcausesthelargestdecreaseinthepseudo-lossisaddedtotheset,andtheprocessisrepeated.Theboostingprocessthusin¯uencestheweaklearningalgorithmintwoways:®rst,bychangingthewaythetenrandomexamplesareselected,andsecondbychangingthecalculationofthepseudo-loss.Itoftenhappensthat,onthefollowingroundofboost-ing,thesamesetwillhavepseudo-losssigni®cantlylessthan12withrespecttothenewmislabeldistribution(butpossiblyusingadifferentmapping).Inthiscase,ratherthanchoosinganewsetofprototypes,wereusethesamesetinadditionalboostingstepsuntiltheadvantagethatcanbegainedfromthegivenpartitionisexhausted(details7 4:1/0.27,4/0.175:0/0.26,5/0.177:4/0.25,9/0.181:9/0.15,7/0.152:0/0.29,2/0.199:7/0.25,9/0.172:3/0.27,2/0.198:2/0.30,8/0.214:1/0.27,4/0.184:1/0.28,4/0.202:8/0.22,2/0.170:2/0.26,0/0.195:3/0.25,5/0.204:1/0.26,4/0.197:2/0.22,3/0.182:0/0.23,6/0.180:6/0.20,5/0.158:2/0.20,3/0.204:1/0.23,4/0.228:6/0.18,8/0.184:9/0.16,4/0.164:1/0.23,4/0.223:5/0.18,3/0.170:6/0.22,0/0.227:9/0.20,7/0.193:5/0.29,3/0.299:9/0.15,4/0.153:5/0.28,3/0.289:7/0.19,9/0.194:1/0.23,4/0.234:1/0.21,4/0.204:9/0.16,4/0.169:9/0.17,4/0.177:7/0.20,9/0.208:8/0.18,6/0.184:4/0.19,1/0.194:9/0.16,4/0.164:1/0.23,4/0.224:1/0.21,4/0.209:9/0.17,4/0.179:9/0.19,7/0.189:9/0.19,4/0.199:9/0.19,4/0.189:9/0.21,7/0.217:7/0.17,9/0.179:9/0.16,4/0.143:3/0.19,5/0.179:9/0.20,7/0.179:9/0.25,7/0.224:4/0.22,1/0.197:7/0.20,9/0.185:5/0.20,3/0.174:4/0.18,9/0.154:4/0.20,9/0.174:4/0.18,9/0.164:4/0.21,1/0.187:7/0.24,9/0.219:9/0.25,7/0.224:4/0.19,9/0.169:9/0.20,7/0.174:4/0.19,9/0.169:9/0.16,4/0.145:5/0.19,3/0.17Figure6:Asampleoftheexamplesthathavethelargestweightafter3ofthe30boostingiterations.The®rstlineisafteritera-tion4,thesecondafteriteration12andthethirdafteriteration25.Underneatheachimagewehavealineoftheform:11,22,whereisthelabeloftheexample,1and2arethelabelsthatgetthehighestandsecondhighestvotefromthecombinedhy-pothesisatthatpointintherunofthealgorithm,and1,2arethecorrespondingnormalizedvotes.omitted).Weran30iterationsoftheboostingalgorithm,andthenumberofprototypesweusedwere10forthe®rstweakhypothesis,20forthesecond,40forthethird,80forthenext®ve,and100fortheremainingtwenty-twoweakhypotheses.Thesesizeswerechosensothattheerrorsofalloftheweakhypothesesareapproximatelyequal.Wecomparedtheperformanceofouralgorithmtoastrawmanalgorithmwhichusesasinglesetofprototypes.Similartoouralgorithm,theprototypesetisgeneratedin-crementally,comparingtenprototypecandidatesateachstep,andalwayschoosingtheonethatminimizestheem-piricalerror.Wecomparedtheperformanceoftheboostingalgorithmtothatofthestrawmanhypothesisthatusesthesamenumberofprototypes.Wealsocomparedourper-formancetothatofthecondensednearestneighborrule(CNN)[13],agreedymethodfor®ndingasmallsetofprototypeswhichcorrectlyclassifytheentiretrainingset.4.1RESULTSANDDISCUSSIONTheresultsofourexperimentsaresummarizedinTa-ble3andFigure7.Table3describestheresultsfromex-perimentswithAdaBoost(eachexperimentwasrepeated10timesusingdifferentrandomseeds),thestrawmanal-gorithm(eachrepeated7times),andCNN(7times).Wecomparetheresultsusingarandompartitionofthedataintotrainingandtestingandusingthepartitionthatwasde®nedbyUSPS.Weseethatinbothcases,aftermorethan970examples,thetrainingerrorofAdaBoostismuchbetterthanthatofthestrawmanalgorithm.Theperformanceonthetestsetissimilar,withaslightadvantagetoAdaBoostwhenthehypothesesincludemorethan1670examples,butaslightadvantagetostrawmaniffewerroundsofboostingareused.After2670examples,theerrorofAdaBoostontherandompartitionis(onaverage)2Z7%,whiletheerrorachievedbyusingthewholetrainingsetis2Z3%.Ontherandompartition,the®nalerroris6Z4%,whiletheerrorusingthewholetrainingsetis5Z7%.00.10.20.30.40.5err05001000150020002500num_prototypesFigure7:GraphsoftheperformanceoftheboostingalgorithmonarandomlypartitionedUSPSdataset.Thehorizontalaxisindicatesthetotalnumberofprototypesthatwereaddedtothecombinedhypothesis,andtheverticalaxisindicateserror.Thetopmostjaggedlineindicatestheerroroftheweakhypothesisthatistrainedatthispointontheweightedtrainingset.TheboldcurveistheboundonthetrainingerrorcalculatedusingTheorem2.Thelowestthincurveandthemedium-boldcurveshowtheperformanceofthecombinedhypothesisonthetrainingsetandtestset,respectively.ComparingtoCNN,weseethatboththestrawmanalgorithmandAdaBoostperformbetterthanCNNevenwhentheyuseabout900examplesintheirhypotheses.LargerhypothesesgeneratedbyAdaBoostorstrawmanaremuchbetterthanthatgeneratedbyCNN.ThemainproblemwithCNNseemstobeitstendencytoover®tthetrainingdata.AdaBoostandthestrawmanalgorithmseemtosufferlessfromover®tting.Figure7showsatypicalrunofAdaBoost.Theupper-mostjaggedlineisaconcatenationoftheerrorsoftheweakhypotheseswithrespecttothemislabeldistribution.Eachpeakfollowedbyavalleycorrespondstothebeginningandenderrorsofaweakhypothesisasitisbeingconstructed,oneprototypeatatime.Theweightederroralwaysstartsaround50%atthebeginningofaboostingiterationanddropstoaround20-30%.TheheaviestlinedescribestheupperboundonthetrainingerrorthatisguaranteedbyThe-orem2,andthetwobottomlinesdescribethetrainingandtesterrorofthe®nalcombinedhypothesis.Itisinterestingthattheperformanceoftheboostingalgorithmonthetestsetimprovedsigni®cantlyaftertheerroronthetrainingsethasalreadybecomezero.ThisissurprisingbecauseanªOccam'srazorºargumentwouldpredictthatincreasingthecomplexityofthehypothesisaftertheerrorhasbeenreducedtozeroislikelytodegradetheperformanceonthetestset.Figure6showsasampleoftheexamplesthataregivenlargeweightsbytheboostingalgorithmonatypicalrun.Thereseemtobetwotypesofªhardºexamples.Firstareexampleswhichareveryatypicalorwronglylabeled(suchasexample2onthe®rstlineandexamples3and4onthesecondline).Thesecondtype,whichtendstodominateonlateriterations,consistsofexamplesthatareverysimilartoeachotherbuthavedifferentlabels(suchasexamples3versus4onthethirdline).Althoughthealgorithmatthispointwascorrectonalltrainingexamples,itisclearfromthevotesitassignedtodifferentlabelsfortheseexamplepairsthatitwasstilltryingtoimprovethediscrimination8 randompartitionUSPSpartitionAdaBoostStrawmanCNNAdaBoostStrawmanCNNrndsizetheorytraintesttraintesttest(size)theorytraintesttraintesttest(size)110524.645.946.137.938.3536.342.543.136.137.6523086.46.38.54.96.283.05.112.34.210.61067016.00.44.62.04.310.90.18.61.48.3139704.50.03.91.53.84.4(990)3.30.08.11.07.78.6(865)1511702.40.03.61.33.61.50.07.70.87.52016700.40.03.10.93.30.20.07.00.67.12521700.10.02.90.73.00.00.06.70.46.93026700.00.02.70.52.80.00.06.40.36.8Table3:Averageerrorratesontrainingandtestsets,inpercent.Forcolumnslabeledªrandompartition,ºarandompartitionoftheunionofthetrainingandtestsetswasused;ªUSPSpartitionºmeanstheUSPS-providedpartitionintotrainingandtestsetswasused.ColumnslabeledªtheoryºgivetheoreticalupperboundsontrainingerrorcalculatedusingTheorem2.ªSizeºindicatesnumberofprototypesde®ningthe®nalhypothesis.betweensimilarexamples.Thisagreeswithourintuitionthatthepseudo-lossisamechanismthatcausestheboostingalgorithmtoconcentrateonthehardtodiscriminatelabelsofhardexamples.5CONCLUSIONSWehavedemonstratedthatAdaBoostcanbeusedinmanysettingstoimprovetheperformanceofalearningalgorithm.Whenstartingwithrelativelysimpleclassi®ers,theim-provementcanbeespeciallydramatic,andcanoftenleadtoacompositeclassi®erthatoutperformsmorecomplexªone-shotºlearningalgorithmslikeC4.5.Thisimprovementisfargreaterthancanbeachievedwithbagging.Note,how-ever,thatfornon-binaryclassi®cationproblems,boostingsimpleclassi®erscanonlybedoneeffectivelyifthemoresophisticatedpseudo-lossisused.WhenstartingwithacomplexalgorithmlikeC4.5,boostingcanalsobeusedtoimproveperformance,butdoesnothavesuchacompellingadvantageoverbagging.Boostingcombinedwithacomplexalgorithmmaygivethegreatestimprovementinperformancewhenthereisarea-sonablylargeamountofdataavailable(note,forinstance,boosting'sperformanceontheªletter-recognitionºproblemwith16,000trainingexamples).Naturally,oneneedstoconsiderwhethertheimprovementinerrorisworththead-ditionalcomputationtime.Althoughweused100roundsofboosting,Quinlan[19]gotgoodresultsusingonly10rounds.Boostingmayhaveotherapplications,besidesreducingtheerrorofaclassi®er.Forinstance,wesawinSection4thatboostingcanbeusedto®ndasmallsetofprototypesforanearestneighborclassi®er.Asdescribedintheintroduction,boostingcombinestwoeffects.Itreducesthebiasoftheweaklearnerbyforcingtheweaklearnertoconcentrateondifferentpartsoftheinstancespace,anditalsoreducesthevarianceoftheweaklearnerbyaveragingseveralhypothesesthatweregeneratedfromdifferentsubsamplesofthetrainingset.Whilethereisgoodtheorytoexplainthebiasreducingeffects,thereisneedforabettertheoryofthevariancereduction.Acknowledgements.ThankstoJasonCatlettandWilliamCohenforextensiveadviceonthedesignofourexperiments.ThankstoRossQuinlanfor®rstsuggestingacomparisonofboostingandbagging.ThanksalsotoLeoBreiman,CorinnaCortes,Har-risDrucker,JeffJackson,MichaelKearns,OferMatan,ParthaNiyogi,WarrenSmith,DavidWolpertandtheanonymousICMLreviewersforhelpfulcomments,suggestionsandcriticisms.Fi-nally,thankstoallwhocontributedtothedatasetsusedinthispaper.References[1]LeoBreiman.Baggingpredictors.TechnicalReport421,DepartmentofStatistics,UniversityofCaliforniaatBerkeley,1994.[2]LeoBreiman.Bias,variance,andarcingclassi®ers.Unpublishedmanuscript,1996.[3]WilliamCohen.Fasteffectiveruleinduction.InProceedingsoftheTwelfthInternationalConferenceonMachineLearning,pages115±123,1995.[4]TomDietterich,MichaelKearns,andYishayMansour.ApplyingtheweaklearningframeworktounderstandandimproveC4.5.InMachineLearning:ProceedingsoftheThirteenthInternationalConference,1996.[5]HarrisDruckerandCorinnaCortes.Boostingdecisiontrees.InAdvancesinNeuralInformationProcessingSystems8,1996.[6]HarrisDrucker,CorinnaCortes,L.D.Jackel,YannLeCun,andVladimirVap-nik.Boostingandotherensemblemethods.NeuralComputation,6(6):1289±1301,1994.[7]HarrisDrucker,RobertSchapire,andPatriceSimard.Boostingperformanceinneuralnetworks.InternationalJournalofPatternRecognitionandArti®cialIntelligence,7(4):705±719,1993.[8]HarrisDrucker,RobertSchapire,andPatriceSimard.Improvingperformanceinneuralnetworksusingaboostingalgorithm.InAdvancesinNeuralInfor-mationProcessingSystems5,pages42±49,1993.[9]YoavFreund.Boostingaweaklearningalgorithmbymajority.InformationandComputation,121(2):256±285,1995.[10]YoavFreundandRobertE.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting.Unpublishedmanuscriptavailableelectronically(onourwebpages,orbyemailrequest).AnextendedabstractappearedinComputationalLearningTheory:SecondEuropeanConference,EuroCOLT'95,pages23±37,Springer-Verlag,1995.[11]JohannesFÈurnkranzandGerhardWidmer.Incrementalreducederrorpruning.InMachineLearning:ProceedingsoftheEleventhInternationalConference,pages70±77,1994.[12]GeoffreyW.Gates.Thereducednearestneighborrule.IEEETransactionsonInformationTheory,pages431±433,1972.[13]PeterE.Hart.Thecondensednearestneighborrule.IEEETransactionsonInformationTheory,IT-14:515±516,May1968.[14]RobertC.Holte.Verysimpleclassi®cationrulesperformwellonmostcom-monlyuseddatasets.MachineLearning,11(1):63±91,1993.[15]JeffreyC.JacksonandMarkW.Craven.Learningsparseperceptrons.InAdvancesinNeuralInformationProcessingSystems8,1996.[16]MichaelKearnsandYishayMansour.Ontheboostingabilityoftop-downdecisiontreelearningalgorithms.InProceedingsoftheTwenty-EighthAnnualACMSymposiumontheTheoryofComputing,1996.[17]EunBaeKongandThomasG.Dietterich.Error-correctingoutputcodingcor-rectsbiasandvariance.InProceedingsoftheTwelfthInternationalConferenceonMachineLearning,pages313±321,1995.[18]J.RossQuinlan.C4.5:ProgramsforMachineLearning.MorganKaufmann,1993.[19]J.RossQuinlan.Bagging,boosting,andC4.5.InProceedings,FourteenthNationalConferenceonArti®cialIntelligence,1996.[20]RobertE.Schapire.Thestrengthofweaklearnability.MachineLearning,5(2):197±227,1990.[21]PatriceSimard,YannLeCun,andJohnDenker.Ef®cientpatternrecogni-tionusinganewtransformationdistance.InAdvancesinNeuralInformationProcessingSystems,volume5,pages50±58,1993.9