OpitzMaclinPreviousworkhasdemonstratedthatBaggingandBoostingareveryeectivefordecisiontreesBauerKohavi1999DruckerCortes1996Breiman1996c1996bFreundSchapire1996Quinlan1996howeverthere ID: 264588
Download Pdf The PPT/PDF document "JournalofArticialIntelligenceResearch11..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
JournalofArticialIntelligenceResearch11(1999)169-198Submitted1/99;published8/99PopularEnsembleMethods:AnEmpiricalStudyDavidOpitzopitz@cs.umt.eduDepartmentofComputerScienceUniversityofMontanaMissoula,MT59812USARichardMaclinrmaclin@d.umn.eduComputerScienceDepartmentUniversityofMinnesotaDuluth,MN55812USAAbstractAnensembleconsistsofasetofindividuallytrainedclassiers(suchasneuralnetworksordecisiontrees)whosepredictionsarecombinedwhenclassifyingnovelinstances.Pre-viousresearchhasshownthatanensembleisoftenmoreaccuratethananyofthesingleclassiersintheensemble.Bagging(Breiman,1996c)andBoosting(Freund&Schapire,1996;Schapire,1990)aretworelativelynewbutpopularmethodsforproducingensem-bles.Inthispaperweevaluatethesemethodson23datasetsusingbothneuralnetworksanddecisiontreesasourclassicationalgorithm.Ourresultsclearlyindicateanumberofconclusions.First,whileBaggingisalmostalwaysmoreaccuratethanasingleclassier,itissometimesmuchlessaccuratethanBoosting.Ontheotherhand,Boostingcancre-ateensemblesthatarelessaccuratethanasingleclassier{especiallywhenusingneuralnetworks.AnalysisindicatesthattheperformanceoftheBoostingmethodsisdependentonthecharacteristicsofthedatasetbeingexamined.Infact,furtherresultsshowthatBoostingensemblesmayovertnoisydatasets,thusdecreasingitsperformance.Finally,consistentwithpreviousstudies,ourworksuggeststhatmostofthegaininanensemble'sperformancecomesintherstfewclassierscombined;however,relativelylargegainscanbeseenupto25classierswhenBoostingdecisiontrees.1.IntroductionManyresearchershaveinvestigatedthetechniqueofcombiningthepredictionsofmulti-pleclassierstoproduceasingleclassier(Breiman,1996c;Clemen,1989;Perrone,1993;Wolpert,1992).Theresultingclassier(hereafterreferredtoasanensemble)isgenerallymoreaccuratethananyoftheindividualclassiersmakinguptheensemble.Boththeo-retical(Hansen&Salamon,1990;Krogh&Vedelsby,1995)andempirical(Hashem,1997;Opitz&Shavlik,1996a,1996b)researchhasdemonstratedthatagoodensembleisonewheretheindividualclassiersintheensemblearebothaccurateandmaketheirerrorsondierentpartsoftheinputspace.TwopopularmethodsforcreatingaccurateensemblesareBagging(Breiman,1996c)andBoosting(Freund&Schapire,1996;Schapire,1990).Thesemethodsrelyon\resampling"techniquestoobtaindierenttrainingsetsforeachoftheclassiers.InthispaperwepresentacomprehensiveevaluationofbothBaggingandBoostingon23datasetsusingtwobasicclassicationmethods:decisiontreesandneuralnetworks.c\r1999AIAccessFoundationandMorganKaufmannPublishers.Allrightsreserved. Opitz&MaclinPreviousworkhasdemonstratedthatBaggingandBoostingareveryeectivefordeci-siontrees(Bauer&Kohavi,1999;Drucker&Cortes,1996;Breiman,1996c,1996b;Freund&Schapire,1996;Quinlan,1996);however,therehasbeenlittleempiricaltestingwithneuralnetworks(especiallywiththenewBoostingalgorithm).Discussionswithpreviousresearchersrevealthatmanyauthorsconcentratedondecisiontreesduetotheirfasttrainingspeedandwell-establisheddefaultparametersettings.Neuralnetworkspresentdicultiesfortestingbothintermsofthesignicantprocessingtimerequiredandinselectingtrain-ingparameters;however,wefeeltherearedistinctadvantagestoincludingneuralnetworksinourstudy.First,previousempiricalstudieshavedemonstratedthatindividualneuralnetworksproducehighlyaccurateclassiersthataresometimesmoreaccuratethancorre-spondingdecisiontrees(Fisher&McKusick,1989;Mooney,Shavlik,Towell,&Gove,1989).Second,neuralnetworkshavebeenextensivelyappliedacrossnumerousdomains(Arbib,1995).Finally,bystudyingneuralnetworksinadditiontodecisiontreeswecanexaminehowBaggingandBoostingarein\ruencedbythelearningalgorithm,givingfurtherinsightintothegeneralcharacteristicsoftheseapproaches.BauerandKohavi(1999)alsostudyBaggingandBoostingappliedtotwolearningmethods,intheircasedecisiontreesusingavariantofC4.5andnaive-Bayesclassiers,buttheirstudymainlyconcentratedonthedecisiontreeresults.Ourneuralnetworkanddecisiontreeresultsledustoanumberofinterestingconclu-sions.TherstisthataBaggingensemblegenerallyproducesaclassierthatismoreaccuratethanastandardclassier.ThusoneshouldfeelcomfortablealwaysBaggingtheirdecisiontreesorneuralnetworks.ForBoosting,however,wenotemorewidelyvaryingre-sults.ForafewdatasetsBoostingproduceddramaticreductionsinerror(evencomparedtoBagging),butforotherdatasetsitactuallyincreasesinerroroverasingleclassier(particularlywithneuralnetworks).InfurthertestsweexaminedtheeectsofnoiseandsupportFreundandSchapire's(1996)conjecturethatBoosting'ssensitivitytonoisemaybepartlyresponsibleforitsoccasionalincreaseinerror.Analternatebaselineapproachweinvestigatedwasthecreationofasimpleneuralnet-workensemblewhereeachnetworkusedthefulltrainingsetanddieredonlyinitsrandominitialweightsettings.Ourresultsindicatethatthisensembletechniqueissurprisinglyeective,oftenproducingresultsasgoodasBagging.ResearchbyAliandPazzani(1996)demonstratedsimilarresultsusingrandomizeddecisiontreealgorithms.Ourresultsalsoshowthattheensemblemethodsaregenerallyconsistent(intermsoftheireectonaccuracy)whenappliedeithertoneuralnetworksortodecisiontrees;however,thereislittleinter-correlationbetweenneuralnetworksanddecisiontreesexceptfortheBoostingmethods.ThissuggeststhatsomeoftheincreasesproducedbyBoostingaredependentontheparticularcharacteristicsofthedatasetratherthanonthecomponentclassier.InfurthertestswedemonstratethatBaggingismoreresilienttonoisethanBoosting.Finally,weinvestigatedthequestionofhowmanycomponentclassiersshouldbeusedinanensemble.Consistentwithpreviousresearch(Freund&Schapire,1996;Quinlan,1996),ourresultsshowthatmostofthereductioninerrorforensemblemethodsoccurswiththerstfewadditionalclassiers.WithBoostingdecisiontrees,however,relativelylargegainsmaybeseenupuntilabout25classiers.170 PopularEnsembleMethodsThispaperisorganizedasfollows.Inthenextsectionwepresentanoverviewofclas-sierensemblesanddiscussBaggingandBoostingindetail.NextwepresentanextensiveempiricalanalysisofBaggingandBoosting.Followingthatwepresentfutureresearchandadditionalrelatedworkbeforeconcluding.2.ClassierEnsemblesFigure1illustratesthebasicframeworkforaclassierensemble.Inthisexample,neuralnetworksarethebasicclassicationmethod,thoughconceptuallyanyclassicationmethod(e.g.,decisiontrees)canbesubstitutedinplaceofthenetworks.EachnetworkinFigure1'sensemble(network1throughnetworkNinthiscase)istrainedusingthetraininginstancesforthatnetwork.Then,foreachexample,thepredictedoutputofeachofthesenetworks(oiinFigure1)iscombinedtoproducetheoutputoftheensemble(^oinFigure1).Manyresearchers(Alpaydin,1993;Breiman,1996c;Krogh&Vedelsby,1995;Lincoln&Skrzypek,1989)havedemonstratedthataneectivecombiningschemeistosimplyaveragethepre-dictionsofthenetwork.Combiningtheoutputofseveralclassiersisusefulonlyifthereisdisagreementamongthem.Obviously,combiningseveralidenticalclassiersproducesnogain.HansenandSalamon(1990)provedthatiftheaverageerrorrateforanexampleislessthan50%andthecomponentclassiersintheensembleareindependentintheproductionoftheirerrors,theexpectederrorforthatexamplecanbereducedtozeroasthenumberofclassierscombinedgoestoinnity;however,suchassumptionsrarelyholdinpractice.KroghandVedelsby(1995)laterprovedthattheensembleerrorcanbedividedintoatermmeasuringtheaveragegeneralizationerrorofeachindividualclassierandatermmeasuringthedisagreementamongtheclassiers.Whattheyformallyshowedwasthatanidealensembleconsistsofhighlycorrectclassiersthatdisagreeasmuchaspossible.OpitzandShavlik(1996a,1996b)empiricallyveriedthatsuchensemblesgeneralizewell.Asaresult,methodsforcreatingensemblescenteraroundproducingclassiersthatdis-agreeontheirpredictions.Generally,thesemethodsfocusonalteringthetrainingprocessinnetwork Nnetwork 1combine network outputsensemble outputinputnetwork 2o1o2oNoFigure1:Aclassierensembleofneuralnetworks.171 Opitz&Maclinthehopethattheresultingclassierswillproducedierentpredictions.Forexample,neu-ralnetworktechniquesthathavebeenemployedincludemethodsfortrainingwithdierenttopologies,dierentinitialweights,dierentparameters,andtrainingonlyonaportionofthetrainingset(Alpaydin,1993;Drucker,Cortes,Jackel,LeCun,&Vapnik,1994;Hansen&Salamon,1990;Maclin&Shavlik,1995).Inthispaperweconcentrateontwopopularmethods(BaggingandBoosting)thattrytogeneratedisagreementamongtheclassiersbyalteringthetrainingseteachclassiersees.2.1BaggingClassiersBagging(Breiman,1996c)isa\bootstrap"(Efron&Tibshirani,1993)ensemblemethodthatcreatesindividualsforitsensemblebytrainingeachclassieronarandomredistri-butionofthetrainingset.Eachclassier'strainingsetisgeneratedbyrandomlydrawing,withreplacement,Nexamples{whereNisthesizeoftheoriginaltrainingset;manyoftheoriginalexamplesmayberepeatedintheresultingtrainingsetwhileothersmaybeleftout.Eachindividualclassierintheensembleisgeneratedwithadierentrandomsamplingofthetrainingset.Figure2givesasampleofhowBaggingmightworkonaimaginarysetofdata.SinceBaggingresamplesthetrainingsetwithreplacement,someinstancearerepresentedmultipletimeswhileothersareleftout.SoBagging'straining-set-1mightcontainexamples3and7twice,butdoesnotcontaineitherexample4or5.Asaresult,theclassiertrainedontraining-set-1mightobtainahighertest-seterrorthantheclassierusingallofthedata.Infact,allfourofBagging'scomponentclassierscouldresultinhighertest-seterror;however,whencombined,thesefourclassierscan(andoftendo)producetest-seterrorlowerthanthatofthesingleclassier(thediversityamongtheseclassiersgenerallycompensatesfortheincreaseinerrorrateofanyindividualclassier).Breiman(1996c)showedthatBaggingiseectiveon\unstable"learningalgorithmswheresmallchangesinthetrainingsetresultinlargechangesinpredictions.Breiman(1996c)claimedthatneuralnetworksanddecisiontreesareexamplesofunstablelearningalgorithms.WestudytheeectivenessofBaggingonboththeselearningmethodsinthisarticle.BoostingClassiersBoosting(Freund&Schapire,1996;Schapire,1990)encompassesafamilyofmethods.Thefocusofthesemethodsistoproduceaseriesofclassiers.Thetrainingsetusedforeachmemberoftheseriesischosenbasedontheperformanceoftheearlierclassier(s)intheseries.InBoosting,examplesthatareincorrectlypredictedbypreviousclassiersintheseriesarechosenmoreoftenthanexamplesthatwerecorrectlypredicted.ThusBoostingattemptstoproducenewclassiersthatarebetterabletopredictexamplesforwhichthecurrentensemble'sperformanceispoor.(NotethatinBagging,theresamplingofthetrainingsetisnotdependentontheperformanceoftheearlierclassiers.)InthisworkweexaminetwonewandpowerfulformsofBoosting:Arcing(Breiman,1996b)andAda-Boosting(Freund&Schapire,1996).LikeBagging,ArcingchoosesatrainingsetofsizeNforclassierK+1byprobabilisticallyselecting(withreplacement)examplesfromtheoriginalNtrainingexamples.UnlikeBagging,however,theprobability172 PopularEnsembleMethodsAsampleofasingleclassieronanimaginarysetofdata.(Original)TrainingSetTraining-set-1:1,2,3,4,5,6,7,8AsampleofBaggingonthesamedata.(Resampled)TrainingSetTraining-set-1:2,7,8,3,7,6,3,1Training-set-2:7,8,5,6,4,2,7,1Training-set-3:3,6,2,7,5,6,2,2Training-set-4:4,5,1,4,6,4,3,8AsampleofBoostingonthesamedata.(Resampled)TrainingSetTraining-set-1:2,7,8,3,7,6,3,1Training-set-2:1,4,5,4,1,5,6,4Training-set-3:7,1,5,8,1,8,1,4Training-set-4:1,1,6,1,1,3,1,5Figure2:HypotheticalrunsofBaggingandBoosting.Assumethereareeighttrainingexamples.Assumeexample1isan\outlier"andishardforthecomponentlearningalgorithmtoclassifycorrectly.WithBagging,eachtrainingsetisanindependentsampleofthedata;thus,someexamplesaremissingandothersoccurmultipletimes.TheBoostingtrainingsetsarealsosamplesoftheoriginaldataset,butthe\hard"example(example1)occursmoreinlatertrainingsetssinceBoostingconcentratesoncorrectlypredictingit.ofselectinganexampleisnotequalacrossthetrainingset.ThisprobabilitydependsonhowoftenthatexamplewasmisclassiedbythepreviousKclassiers.Ada-Boostingcanusetheapproachof(a)selectingasetofexamplesbasedontheprobabilitiesoftheexamples,or(b)simplyusingalloftheexamplesandweighttheerrorofeachexamplebytheprobabilityforthatexample(i.e.,exampleswithhigherprobabilitieshavemoreeectontheerror).Thislatterapproachhastheclearadvantagethateachexampleisincorporated(atleastinpart)inthetrainingset.Furthermore,Friedmanetal.(1998)havedemonstratedthatthisformofAda-Boostingcanbeviewedasaformofadditivemodelingforoptimizingalogisticlossfunction.Inthiswork,however,wehavechosentousetheapproachofsubsamplingthedatatoensureafairempiricalcomparison(inpartduetotherestartingreasondiscussedbelow).BothArcingandAda-Boostinginitiallysettheprobabilityofpickingeachexampletobe1=N.Thesemethodsthenrecalculatetheseprobabilitiesaftereachtrainedclassierisaddedtotheensemble.ForAda-Boosting,letkbethesumoftheprobabilitiesofthe173 Opitz&MaclinmisclassiedinstancesforthecurrentlytrainedclassierCk.TheprobabilitiesforthenexttrialaregeneratedbymultiplyingtheprobabilitiesofCk'sincorrectlyclassiedinstancesbythefactork=(1 k)=kandthenrenormalizingallprobabilitiessothattheirsumequals1.Ada-BoostingcombinestheclassiersC1;:::;CkusingweightedvotingwhereCkhasweightlog(k).TheseweightsallowAda-Boostingtodiscountthepredictionsofclassiersthatarenotveryaccurateontheoverallproblem.Friedmanetal(1998)havealsosuggestedanalternativemechanismthattstogetherthepredictionsoftheclassiersasanadditivemodelusingamaximumlikelihoodcriterion.Inthiswork,weusetherevisiondescribedbyBreiman(1996b)whereweresetalltheweightstobeequalandrestartifeitherkisnotlessthan0.5orkbecomes0.1ByresettingtheweightswedonotdisadvantagetheAda-Boostinglearnerinthosecaseswhereitreachesthesevaluesofk;theAda-Boostinglearneralwaysincorporatesthesamenumberofclassiersasothermethodswetested.Tomakethisfeasible,weareforcedtousetheapproachofselectingadatasetprobabilisticallyratherthanweightingtheexamples,otherwiseadeterministicmethodsuchasC4.5wouldcycleandgenerateduplicatemembersoftheensemble.Thatis,resettingtheweightsto1=Nwouldcausethelearnertorepeatthedecisiontreelearnedastherstmemberoftheensemble,andthiswouldleadtoreweightingthedatasetthesameasforthesecondmemberoftheensemble,andsoon.Randomlyselectingexamplesforthedatasetbasedontheexampleprobabilitiesalleviatesthisproblem.Arcing-x4(Breiman,1996b)(whichwewillrefertosimplyasArcing)startedoutasasimplemechanismforevaluatingtheeectofBoostingmethodswheretheresultingclas-sierswerecombinedwithoutweightingthevotes.Arcingusesasimplemechanismfordeterminingtheprobabilitiesofincludingexamplesinthetrainingset.Fortheithexampleinthetrainingset,thevaluemireferstothenumberoftimesthatexamplewasmisclassi-edbythepreviousKclassiers.TheprobabilitypiforselectingexampleitobepartofclassierK+1'strainingsetisdenedaspi=1+mi4PN=1(1+mj4)(1)Breimanchosethevalueofthepower(4)empiricallyaftertryingseveraldierentvalues(Breiman,1996b).AlthoughthismechanismdoesnothavetheweightedvotingofAda-Boostingitstillproducesaccurateensemblesandissimpletoimplement;thusweincludethismethod(alongwithAda-Boosting)inourempiricalevaluation.Figure2showsahypotheticalrunofBoosting.NotethatthersttrainingsetwouldbethesameasBagging;however,latertrainingsetsaccentuateexamplesthatweremis-classiedbytheearliermemberoftheensembles.Inthisgure,example1isa\hard"examplethatpreviousclassierstendtomisclassify.Withthesecondtrainingset,example1occursmultipletimes,asdoexamples4and5sincetheywereleftoutofthersttrainingsetand,inthiscase,misclassiedbytherstlearner.Forthenaltrainingset,example11.Forthosefewcaseswherekbecomes0(lessthat0.12%ofourresults)wesimplyusealargepositivevalue,log(k)=3:0,toweightthesenetworks.Forthemorelikelycaseswherekislargerthan0.5(approximately5%ofourresults)wechosetoweightthepredictionsbyaverysmallpositivevalue(0.001)ratherthanusinganegativeor0weightfactor(thisproducedslightlybetterresultsthanthealternateapproachesinpilotstudies).174 PopularEnsembleMethodsbecomesthepredominantexamplechosen(whereasnosingleexampleisaccentuatedwithBagging);thus,theoveralltest-seterrorforthisclassiermightbecomeveryhigh.Despitethis,however,Boostingwillprobablyobtainalowererrorratewhenitcombinestheout-putofthesefourclassierssinceitfocusesoncorrectlypredictingpreviouslymisclassiedexamplesandweightsthepredictionsofthedierentclassiersbasedontheiraccuracyforthetrainingset.ButBoostingcanalsoovertinthepresenceofnoise(asweempiricallyshowinSection3).2.3TheBiasplusVarianceDecompositionRecently,severalauthors(Breiman,1996b;Friedman,1996;Kohavi&Wolpert,1996;Kong&Dietterich,1995)haveproposedtheoriesfortheeectivenessofBaggingandBoostingbasedonGemanetal.'s(1992)biasplusvariancedecompositionofclassicationerror.Inthisdecompositionwecanviewtheexpectederrorofalearningalgorithmonaparticulartargetfunctionandtrainingsetsizeashavingthreecomponents:1.Abiastermmeasuringhowclosetheaverageclassierproducedbythelearningalgo-rithmwillbetothetargetfunction;2.Avariancetermmeasuringhowmucheachofthelearningalgorithm'sguesseswillvarywithrespecttoeachother(howoftentheydisagree);and3.AtermmeasuringtheminimumclassicationerrorassociatedwiththeBayesoptimalclassierforthetargetfunction(thistermissometimesreferredtoastheintrinsictargetnoise).Usingthisframeworkithasbeensuggested(Breiman,1996b)thatbothBaggingandBoost-ingreduceerrorbyreducingthevarianceterm.FreundandSchapire(1996)arguethatBoostingalsoattemptstoreducetheerrorinthebiastermsinceitfocusesonmisclassiedexamples.Suchafocusmaycausethelearnertoproduceanensemblefunctionthatdierssignicantlyfromthesinglelearningalgorithm.Infact,Boostingmayconstructafunc-tionthatisnotevenproduciblebyitscomponentlearningalgorithm(e.g.,changinglinearpredictionsintoaclassierthatcontainsnon-linearpredictions).ItisthiscapabilitythatmakesBoostinganappropriatealgorithmforcombiningthepredictionsof\weak"learningalgorithms(i.e.,algorithmsthathaveasimplelearningbias).Intheirrecentpaper,BauerandKohavi(1999)demonstratedthatBoostingdoesindeedseemtoreducebiasforcertainrealworldproblems.Moresurprisingly,theyalsoshowedthatBaggingcanalsoreducethebiasportionoftheerror,oftenforthesamedatasetsforwhichBoostingreducesthebias.Thoughthebias-variancedecompositionisinteresting,therearecertainlimitationstoapplyingittoreal-worlddatasets.Tobeabletoestimatethebias,variance,andtargetnoiseforaparticularproblem,weneedtoknowtheactualfunctionbeinglearned.Thisisunavailableformostreal-worldproblems.TodealwiththisproblemKohaviandWolpert(1996)suggestholdingoutsomeofthedata,theapproachusedbyBauerandKohavi(1999)intheirstudy.Themainproblemwiththistechniqueisthatthetrainingsetsizeisgreatlyreducedinordertogetgoodestimatesofthebiasandvarianceterms.Wehavechosentostrictlyfocusongeneralizationaccuracyinourstudy,inpartbecauseBauerandKohavi'sworkhasansweredthequestionaboutwhetherBoostingandBaggingreducethe175 Opitz&Maclinbiasforrealworldproblems(theybothdo),andbecausetheirexperimentsdemonstratethatwhilethisdecompositiongivessomeinsightintoensemblemethods,itisonlyasmallpartoftheequation.FordierentdatasetstheyobservecaseswhereBoostingandBaggingbothdecreasemostlythevarianceportionoftheerror,andothercaseswhereBoostingandBaggingbothreducethebiasandvarianceoftheerror.TheirtestsalsoseemtoindicatethatBoosting'sgeneralizationerrorincreasesonthedomainswhereBoostingincreasesthevarianceportionoftheerror;but,itisdiculttodeterminewhataspectsofthedatasetsledtotheseresults.3.ResultsThissectiondescribesourempiricalstudyofBagging,Ada-Boosting,andArcing.Eachofthesethreemethodswastestedwithbothdecisiontreesandneuralnetworks.3.1DataSetsToevaluatetheperformanceofBaggingandBoosting,weobtainedanumberofdatasetsfromtheUniversityofWisconsinMachineLearningrepositoryaswellastheUCIdatasetrepository(Murphy&Aha,1994).Thesedatasetswerehandselectedsuchthatthey(a)camefromreal-worldproblems,(b)variedincharacteristics,and(c)weredeemedusefulbypreviousresearchers.Table1givesthecharacteristicsofourdatasets.Thedatasetschosenvaryacrossanumberofdimensionsincluding:thetypeofthefeaturesinthedataset(i.e.,continuous,discrete,oramixofthetwo);thenumberofoutputclasses;andthenumberofexamplesinthedataset.Table1alsoshowsthearchitectureandtrainingparametersusedinourneuralnetworksexperiments.3.2MethodologyResults,unlessotherwisenoted,areaveragedovervestandard10-foldcrossvalidationexperiments.Foreach10-foldcrossvalidationthedatasetisrstpartitionedinto10equal-sizedsets,theneachsetisinturnusedasthetestsetwhiletheclassiertrainsontheotherninesets.Foreachfoldanensembleof25classiersiscreated.Crossvalidationfoldswereperformedindependentlyforeachalgorithm.Wetrainedtheneuralnetworksusingstandardbackpropagationlearning(Rumelhart,Hinton,&Williams,1986).Parametersettingsfortheneuralnetworksincludealearningrateof0.15,amomentumtermof0.9,andweightsareinitializedrandomlytobebetween-0.5and0.5.Thenumberofhiddenunitsandepochsusedfortrainingaregiveninthenextsection.Wechosethenumberofhiddenunitsbasedonthenumberofinputandoutputunits.Thischoicewasbasedonthecriteriaofhavingatleastonehiddenunitperoutput,atleastonehiddenunitforeveryteninputs,andvehiddenunitsbeingaminimum.Thenumberofepochswasbasedbothonthenumberofexamplesandthenumberofparameters(i.e.,topology)ofthenetwork.Specically,weused60to80epochsforsmallproblemsinvolvingfewerthan250examples;40epochsforthemid-sizedproblemscontainingbetween250to500examples;and20to40epochsforlargerproblems.ForthedecisiontreesweusedtheC4.5tool(Quinlan,1993)andprunedtrees(whichempiricallyproducebetterperformance)assuggestedinQuinlan'swork.176 PopularEnsembleMethodsFeaturesNeuralNetworkDataSetCasesClassContDiscInputsOutputsHiddensEpochsbreast-cancer-w69929-91520credit-a6902694711035credit-g100027136311030diabetes76829-81530glass21469-961080heart-cleveland303285131540hepatitis15526133211060house-votes-844352-16161540hypo377257225551540ionosphere351234-3411040iris15934-43580kr-vs-kp31962-367411520labor572882911080letter200002616-16264030promoters-9369362-5722812030ribosome-bind18772-4919612035satellite6435636-3661530segmentation2310719-1971520sick377227225511040sonar208260-6011060soybean68319-35134192540splice31903-6024022530vehicle846418-1841040Table1:Summaryofthedatasetsusedinthispaper.Shownarethenumberofexamplesinthedataset;thenumberofoutputclasses;thenumberofcontinuousanddiscreteinputfeatures;thenumberofinput,output,andhiddenunitsusedintheneuralnetworkstested;andhowmanyepochseachneuralnetworkwastrained.3.3DataSetErrorRatesTable2showstest-seterrorratesforthedatasetsdescribedinTable1forveneuralnetworkmethodsandfourdecisiontreemethods.(InTables4and5weshowtheseerrorratesaswellasthestandarddeviationforeachofthesevalues.)Alongwiththetest-seterrorsforBagging,Arcing,andAda-boosting,weincludethetest-seterrorrateforasingleneural-networkandasingledecision-treeclassier.Wealsoreportresultsforasimple(baseline)neural-networkensembleapproach{creatinganensembleofnetworkswhereeachnetworkvariesonlybyrandomlyinitializingtheweightsofthenetwork.WeincludetheseresultsincertaincomparisonstodemonstratetheirsimilaritytoBagging.Oneobviousconclusiondrawnfromtheresultsisthateachensemblemethodappearstoreducetheerrorrateforalmostallofthedatasets,andinmanycasesthisreductionislarge.Infact,thetwo-tailedsigntestindicatesthateveryensemblemethodissignicantlybetterthanitssingle177 Opitz&MaclinNeuralNetworkC4.5BoostingBoostingDataSetStanSimpBagArcAdaStanBagArcAdabreast-cancer-w3.43.53.43.84.05.03.73.53.5credit-a14.813.713.815.815.714.913.414.013.7credit-g27.924.724.225.225.329.625.225.926.7diabetes23.923.022.824.423.327.824.426.025.7glass38.635.233.132.031.131.325.825.523.3heart-cleveland18.617.417.020.721.124.319.521.520.8hepatitis20.119.517.819.019.721.217.316.917.2house-votes-844.94.84.15.15.33.63.65.04.8hypo6.46.26.26.26.20.50.40.40.4ionosphere9.77.59.27.68.38.16.46.06.1iris4.33.94.03.73.95.24.95.15.6kr-vs-kp2.30.80.80.40.30.60.60.30.4labor6.13.24.23.23.216.513.713.011.6letter18.012.810.55.74.614.07.04.13.9promoters-9365.34.84.04.54.612.810.66.86.4ribosome-bind9.38.58.48.18.211.210.29.39.6satellite13.010.910.69.910.013.89.98.68.4segmentation6.65.35.43.53.33.73.01.71.5sick5.95.75.74.74.51.31.21.11.0sonar16.615.916.812.913.029.725.321.521.7soybean9.26.76.96.76.38.07.97.26.7splice4.74.03.94.04.25.95.45.15.3vehicle24.921.220.719.119.729.427.122.522.9Table2:Testseterrorratesforthedatasetsusing(1)asingleneuralnetworkclassier;(2)anensemblewhereeachindividualnetworkistrainedusingtheoriginaltrainingsetandthusonlydiersfromtheothernetworksintheensemblebyitsrandominitialweights;(3)anensemblewherethenetworksaretrainedusingrandomlyresampledtrainingsets(Bagging);anensemblewherethenetworksaretrainedusingweightedresampledtrainingsets(Boosting)wheretheresamplingisbasedonthe(4)Arcingmethodand(5)Adamethod;(6)asingledecisiontreeclassier;(7)aBaggingensembleofdecisiontrees;and(8)Arcingand(9)AdaBoostingensemblesofdecisiontrees.componentclassieratthe95%condencelevel;however,noneoftheensemblemethodsaresignicantlybetterthananyotherensembleapproachatthe95%condencelevel.TobetteranalyzeTable2'sresults,Figures3and4plotthepercentagereductioninerrorfortheAda-Boosting,Arcing,andBaggingmethodasafunctionoftheoriginalerrorrate.Examiningthesegureswenotethatmanyofthegainsproducedbytheensemblemethodsaremuchlargerthanthestandarddeviationvalues.Intermsofcomparisonsofdierentmethods,itisapparentfrombothguresthattheBoostingmethods(Ada-178 PopularEnsembleMethodsbreast-cancer-wheart-clevelandhouse-votes-84credit-ahepatitishypodiabetescredit-gspliceirisribosome-bindpromoters-936ionosphereglassvehiclesonarsicksatellitesoybeanlaborsegmentationletterkr-vs-kp-40-20020406080100Percent Reduction in ErrorAda-BoostingArcingBaggingFigure3:ReductioninerrorforAda-Boosting,Arcing,andBaggingneuralnetworken-semblesasapercentageoftheoriginalerrorrate(i.e.,areductionfromanerrorrateof2.5%to1.25%wouldbea50%reductioninerrorrate,justasareductionfrom10.0%to5.0%wouldalsobea50%reduction).Alsoshown(whiteportionofeachbar)isonestandarddeviationfortheseresults.Thestandarddeviationisshownasanadditiontotheerrorreduction.179 Opitz&Maclinhouse-votes-84irisdiabetescredit-acredit-gspliceribosome-bindheart-clevelandsoybeanhepatitissickvehicleionosphereglasssonarhypobreast-cancer-wlaborsatellitekr-vs-kppromoters-936segmentationletter-80-60-40-20020406080Percent Reduction in ErrorAda-BoostingArcingBaggingFigure4:ReductioninerrorforAda-Boosting,Arcing,andBaggingdecisiontreeensemblesasapercentageoftheoriginalerrorrate.Alsoshown(whiteportionofeachbar)isonestandarddeviationfortheseresults.180 PopularEnsembleMethodsBoostingandArcing)aresimilarintheirresults,bothforneuralnetworksanddecisiontrees.Furthermore,theAda-BoostingandArcingmethodsproducesomeofthelargestreductionsinerror.Ontheotherhand,whiletheBaggingmethodconsistentlyproducesreductionsinerrorforalmostallofthecases,withneuralnetworkstheBoostingmethodscansometimesresultinanincreaseinerror.Lookingattheorderingofthedatasetsinthetwogures(theresultsaresortedbythepercentageofreductionusingtheAda-Boostingmethod),wenotethatthedatasetsforwhichtheensemblemethodsseemtoworkwellaresomewhatconsistentacrossbothneuralnetworksanddecisiontrees.Forthefewdomainswhichseeincreasesinerror,itisdiculttoreachstrongconclusionssincetheensemblemethodsseemtodowellforalargenumberofdomains.OnedomainonwhichtheBoostingmethodsdouniformlypoorlyisthehouse-votes-84domain.Aswediscusslater,theremaynoiseinthisdomain'sexamplesthatcausestheBoostingmethodssignicantproblems.3.4EnsembleSizeEarlywork(Hansen&Salamon,1990)onensemblessuggestedthatensembleswithasfewastenmemberswereadequatetosucientlyreducetest-seterror.Whilethisclaimmaybetruefortheearlierproposedensembles,theBoostingliterature(Schapire,Freund,Bartlett,&Lee,1997)hasrecentlysuggested(basedonafewdatasetswithdecisiontrees)thatitispossibletofurtherreducetest-seterrorevenaftertenmembershavebeenaddedtoanensemble(andtheynotethatthisresultalsoappliestoBagging).Inthissection,weperformadditionalexperimentstofurtherinvestigatetheappropriatesizeofanensemble.Figure5showsthecompositeerrorrateoverallofourdatasetsforneuralnetworkanddecisiontreeensemblesusingupto100classiers.Ourexperimentsindicatethatmostofthemethodsproducesimilarlyshapedcurves.Asexpected,muchofthereductioninerrorduetoaddingclassierstoanensemblecomeswiththerstfewclassiers;however,thereissomevariationwithrespecttowheretheerrorreductionnallyasymptotes.ForbothBaggingandBoostingappliedtoneuralnetworks,muchofthereductioninerrorappearstohaveoccurredaftertentofteenclassiers.AsimilarconclusioncanbereachedforBagginganddecisiontrees,whichisconsistentwithBreiman(1996a).ButAda-boostingandArcingcontinuetomeasurablyimprovetheirtest-seterroruntilaround25classiersfordecisiontrees.At25classierstheerrorreductionforbothmethodsappearstohavenearlyasymptotedtoaplateau.Therefore,theresultsreportedinthispaperareofanensemblesizeof25(i.e.,asucientyetmanageablesizeforqualitativeanalysis).Itwastraditionallybelieved(Freund&Schapire,1996)thatsmallreductionsintest-seterrormaycontinueindenitelyforboosting;however,GroveandSchuurmans(1998)demonstratethatAda-boostingcanindeedbegintoovertwithverylargeensemblesizes(10,000ormoremembers).3.5CorrelationAmongMethodsAssuggestedabove,itappearsthattheperformanceofmanyoftheensemblemethodsarehighlycorrelatedwithoneanother.Tohelpidentifytheseconsistencies,Table3presentsthecorrelationcoecientsoftheperformanceofallsevenensemblemethods.Foreachdataset,performanceismeasuredastheensembleerrorratedividedbythesingle-classiererror181 Opitz&Maclin0.100.120.140.160.180102030405060708090100Number of Networks in EnsembleComposite Error RateDT-AdaDT-ArcDT-BagNN-AdaNN-ArcNN-BagFigure5:Averagetest-seterroroverall23datasetsusedinourstudiesforensemblesincorporatingfromoneto100decisiontreesorneuralnetworks.Theerrorrategraphedissimplytheaverageoftheerrorratesofthe23datasets.Thealternativeofaveragingtheerroroveralldatapoints(i.e.,weightingadataset'serrorratebyitssamplesize)producessimilarlyshapedcurves.rate.Thusahighcorrelation(i.e.,onenear1.0)suggeststhattwomethodsareconsistentinthedomainsinwhichtheyhavethegreatestimpactontest-seterrorreduction.Table3providesnumerousinterestinginsights.Therstisthattheneural-networkensemblemethodsarestronglycorrelatedwithoneanotherandthedecision-treeensemblemethodsarestronglycorrelatedwithoneanother;however,thereislesscorrelationbe-tweenanyneural-networkensemblemethodandanydecision-treeensemblemethod.Notsurprisingly,Ada-boostingandArcingarestronglycorrelated,evenacrossdierentcompo-nentlearningalgorithms.ThissuggeststhatBoosting'seectivenessdependsmoreonthedatasetthanwhetherthecomponentlearningalgorithmisaneuralnetworkordecisiontree.Baggingontheotherhand,isnotcorrelatedacrosscomponentlearningalgorithms.TheseresultsareconsistentwithourlaterclaimthatwhileBoostingisapowerfulensemblemethod,itismoresusceptibletoanoisydatasetthanBagging.182 PopularEnsembleMethodsNeuralNetworkDecisionTreeSimpleBaggingArcingAdaBaggingArcingAdaSimple-NN1.000.880.870.85-0.100.380.37Bagging-NN0.881.000.780.78-0.110.350.35Arcing-NN0.870.781.000.990.140.610.60Ada-NN0.850.780.991.000.170.620.63Bagging-DT-0.10-0.110.140.171.000.680.69Arcing-DT0.380.350.610.620.681.000.96Ada-DT0.370.350.600.630.690.961.00Table3:Performancecorrelationcoecientsacrossensemblelearningmethods.Perfor-manceismeasuredbytheratiooftheensemblemethod'stest-seterrordividedbythesinglecomponentclassier'stest-seterror.3.6BaggingversusSimplenetworkensemblesFigure6showstheBaggingandSimplenetworkensembleresultsfromTable2.TheseresultsindicatethatoftenaSimpleEnsembleapproachwillproduceresultsthatareasaccurateasBagging(correlationresultsfromTable3alsosupportthisstatement).Thissuggeststhatanymechanismwhichcausesalearningmethodtoproducesomerandomnessintheformationofitsclassierscanbeusedtoformaccurateensembles,andindeed,AliandPazzani(1996)havedemonstratedsimilarresultsforrandomizeddecisiontrees.3.7NeuralNetworksversusDecisionTreesAnotherinterestingquestionishoweectivethedierentmethodsareforneuralnetworksanddecisiontrees.Figures7,8,and9comparetheerrorratesandreductioninerrorvaluesforAda-Boosting,Arcing,andBaggingrespectively.Notethatwegrapherrorrateratherthanpercentreductioninerrorratebecausethebaselineforeachmethod(decisiontreesforAda-BoostingondecisiontreesversusneuralnetworksforAda-Boostingonneuralnetworks)maypartiallyexplainthedierencesinpercentreduction.Forexample,inthepromoters-936problemusingAda-Boosting,themuchlargerreductioninerrorforthedecisiontreeapproachmaybeduetothefactthatdecisiontreesdonotseemtobeaseectiveforthisproblem,andAda-Boostingthereforeproducesalargerpercentreductionintheerrorfordecisiontrees.Theresultsshowthatinmanycasesifasingledecisiontreehadlower(orhigher)errorthanasingleneuralnetworkonadataset,thenthedecision-treeensemblemethodsalsohadlower(orhigher)errorthantheirneuralnetworkcounterpart.Theexceptionstothisrulegenerallyhappenedonthesamedatasetforallthreeensemblemethods(e.g.,hepatitis,soybean,satellite,credit-a,andheart-cleveland).Theseresultssuggestthat(a)theperformanceoftheensemblemethodsisdependentonboththedatasetandclassiermethod,and(b)ensemblescan,atleastinsomecases,overcometheinductivebiasofitscomponentlearningalgorithm.183 Opitz&Maclinsonarbreast-cancer-wsickhypodiabetesionospherecredit-airisheart-clevelandribosome-bindhepatitiscredit-gglasshouse-votes-84vehiclesplicesatellitesegmentationpromoters-936soybeanlaborletterkr-vs-kp-20020406080Percent Reduction in ErrorBaggingSimpleFigure6:ReductioninerrorforBaggingandSimpleneuralnetworkensemblesasaper-centageoftheoriginalerrorrate.Alsoshown(whiteportionofeachbar)isonestandarddeviationfortheseresults.184 PopularEnsembleMethodsbreast-cancer-wheart-clevelandhouse-votes-84credit-ahepatitishypodiabetescredit-gspliceirisribosome-bindpromoters-936ionosphereglassvehiclesonarsicksatellitesoybeanlaborsegmentationletterkr-vs-kp010203040Error (%)Neural NetworkDecision TreeFigure7:ErrorratesforAda-Boostingensembles.ThewhiteportionshowsthereductioninerrorofAda-Boostingcomparedtoasingleclassierwhileincreasesinerrorareshowninblack.Thedatasetsaresortedbytheratioofreductioninensembleerrortooverallerrorforneuralnetworks.185 Opitz&Maclinbreast-cancer-wheart-clevelandcredit-ahouse-votes-84diabeteshypohepatitiscredit-gribosome-bindspliceirispromoters-936glasssickionospheresonarvehiclesatellitesoybeansegmentationlaborletterkr-vs-kp010203040Error (%)Neural NetworkDecision TreeFigure8:ErrorratesforArcingensembles.ThewhiteportionshowsthereductioninerrorofArcingcomparedtoasingleclassierwhileincreasesinerrorareshowninblack.Thedatasetsaresortedbytheratioofreductioninensembleerrortooverallerrorforneuralnetworks.186 PopularEnsembleMethodssonarbreast-cancer-wsickhypodiabetesionospherecredit-airisheart-clevelandribosome-bindhepatitiscredit-gglasshouse-votes-84vehiclesplicesatellitesegmentationpromoters-936soybeanlaborletterkr-vs-kp010203040Error (%)Neural NetworkDecision TreeFigure9:ErrorratesforBaggingensembles.ThewhiteportionshowsthereductioninerrorofBaggingcomparedtoasingleclassierwhileincreasesinerrorareshowninblack.Thedatasetsaresortedbytheratioofreductioninensembleerrortooverallerrorforneuralnetworks.187 Opitz&Maclin3.8BoostingandNoiseFreundandShapire(1996)suggestedthatthesometimespoorperformanceofBoostingresultsfromoverttingthetrainingsetsincelatertrainingsetsmaybeover-emphasizingexamplesthatarenoise(thuscreatingextremelypoorclassiers).ThisargumentseemsespeciallypertinenttoBoostingfortworeasons.Therstandmostobviousreasonisthattheirmethodforupdatingtheprobabilitiesmaybeover-emphasizingnoisyexamples.Thesecondreasonisthattheclassiersarecombinedusingweightedvoting.Previouswork(Sollich&Krogh,1996)hasshownthatoptimizingthecombiningweightscanleadtooverttingwhileanunweightedvotingschemeisgenerallyresilienttoovertting.Friedmanetal.(1998)hypothesizethatBoostingmethods,asadditivemodels,mayseeincreasesinerrorinthosesituationswherethebiasofthebaseclassierisappropriatefortheproblembeinglearned.Wetestthishypothesisinoursecondsetofresultspresentedinthissection.ToevaluatethehypothesisthatBoostingmaybepronetooverttingweperformedasetofexperimentsusingthefourensembleneuralnetworkmethods.Weintroduced5%,10%,20%,and30%noise2intofourdierentdatasets.Ateachlevelwecreatedvedierentnoisydatasets,performeda10-foldcrossvalidationoneach,thenaveragedovertheveresults.InFigure10weshowthereductioninerrorrateforeachoftheensemblemethodscomparedtousingasingleneuralnetworkclassier.Theseresultsdemonstratethatasthenoiselevelgrows,theecacyoftheSimpleandBaggingensemblesgenerallyincreaseswhiletheArcingandAda-Boostingensemblesgainsinperformancearemuchsmaller(ormayactuallydecrease).NotethatthiseectismoreextremeforAda-BoostingwhichsupportsourhypothesisthatAda-Boostingismoreaectedbynoise.ThissuggeststhatBoosting'spoorperformanceforcertaindatasetsmaybepartiallyexplainedbyoverttingnoise.TofurtherdemonstratetheeectofnoiseonBoostingwecreatedseveralsetsofarticialdataspecicallydesignedtomisleadBoostingmethods.Foreachdatasetwecreatedasimplehyperplaneconceptbasedonasetofthefeatures(andalsoincludedsomeirrelevantfeatures).Asetofrandompointswerethengeneratedandlabeledbasedonwhichsideofthehyperplanetheyfell.Thenacertainpercentageofthepointsononesideofthehyperplaneweremislabeledasbeingpartoftheotherclass.Fortheexperimentsshownbelowwegeneratedvedatasetswheretheconceptwasbasedontwolinearfeatures,hadfourirrelevantfeatures,and20%ofthedatawasmislabeled.Wetrainedveensemblesofneuralnetworks(perceptrons)foreachdatasetandaveragedtheensembles'predictions.Thustheseexperimentsinvolvelearninginsituationswheretheoriginalbiasofthelearner(asinglehyperplaneproducedbyaperceptron)isappropriatefortheproblem,andasFriedmanetal.(1998)suggest,usinganadditivemodelmayharmperformance.Figure11showstheresultingerrorratesforAda-Boosting,Arcing,andBaggingbythenumberofnetworksbeingcombinedintheensemble.TheseresultsindicateclearlythatincaseswherethereisnoiseBagging'serrorratewillnotincreaseastheensemblesizeincreaseswhereastheerrorrateoftheBoostingmethodsmayindeedincreaseasensemblesizeincreases.2.X%noiseindicatesthateachfeatureofthetrainingexamples,bothinputandoutputfeatures,hadX%chanceofbeingrandomlyperturbedtoanotherfeaturevalueforthatfeature(forcontinuousfeatures,thesetofpossibleothervalueswaschosenbyexaminingallofthetrainingexamples).188 PopularEnsembleMethods01234051015202530Reduction in error rate (% pts)diabetes0369051015202530soybean-large01234051015202530Noise rate (%)Reduction in error rate (% pts)promoters-93601234051015202530Noise rate (%)segmentationBagging EnsembleBoosting (Arcing) EnsembleBoosting (Ada) EnsembleFigure10:Simple,Bagging,andBoosting(ArcingandAda)neuralnetworkensemblere-ductioninerrorascomparedtousingasingleneuralnetwork.Graphedisthepercentagepointreductioninerror(e.g.,for5%noiseinthesegmentationdataset,ifthesinglenetworkmethodhadanerrorrateof15.9%andtheBaggingmethodhadanerrorrateof14.7%,thenthisisgraphedasa1.2percentagepointreductionintheerrorrate).Additionaltests(notshownhere)showthatAda-Boosting'serrorratebecomesworsewhenrestartingisnotemployed.ThisconclusiondovetailsnicelywithSchapireetal.'s(1997)recentdiscussionwheretheynotethattheeectivenessofavotingmethodcanbemeasuredbyexaminingthemarginsoftheexamples.(Themarginisthedierencebetweenthenumberofcorrectandincorrectvotesforanexample.)InasimpleresamplingmethodsuchasBagging,eachresultingclassierfocusesonincreasingthemarginforasmanyoftheexamplesaspossible.ButinaBoostingmethod,laterclassiersfocusonincreasingthemarginsforexampleswithpoorcurrentmargins.AsSchapireetal.(1997)note,thisisaveryeectivestrategyiftheoverallaccuracyoftheresultingclassierdoesnotdropsignicantly.Foraproblemwithnoise,focusingonmisclassiedexamplesmaycauseaclassiertofocusonboostingthemarginsof(noisy)examplesthatwouldinfactbemisleadinginoverallclassication.189 Opitz&Maclin101214161820051015202530Error rateAdaArcBag101214161820051015202530Error rateAdaArcBag101214161820051015202530Error rateAdaArcBag101214161820051015202530Error rateAdaArcBag101214161820051015202530Networks in EnsembleError rateAdaArcBagFigure11:ErrorratesbythesizeofensembleforAda-Boosting,Arcing,andBaggingen-semblesforvedierentarticialdatasetscontainingone-sidednoise(seetextfordescription).190 PopularEnsembleMethods4.FutureWorkOneinterestingquestionweplantoinvestigateishoweectiveasingleclassierapproachmightbeifitwasallowedtousethetimeittakestheensemblemethodtotrainmultipleclassierstoexploreitsconceptspace.Forexample,aneuralnetworkapproachcouldperformpilotstudiesusingthetrainingsettoselectappropriatevaluesofparameterssuchashiddenunits,learningrate,etc.WeplantocompareBaggingandBoostingmethodstoothermethodsintroducedre-cently.InparticularweintendtoexaminetheuseofStacking(Wolpert,1992)asamethodoftrainingacombiningfunction,soastoavoidtheeectofhavingtoweightclassiers.WealsoplantocompareBaggingandBoostingtoothermethodssuchasOpitzandShav-lik's(1996b)approachtocreatinganensemble.Thisapproachusesgeneticsearchtondclassiersthatareaccurateanddierintheirpredictions.Finally,sincetheBoostingmethodsareextremelysuccessfulinmanydomains,weplantoinvestigatenovelapproachesthatwillretainthebenetsofBoosting.Thegoalwillbetocreatealearnerwhereyoucanessentiallypushastartbuttonandletitrun.TodothiswewouldtrytopreservethebenetsofBoostingwhilepreventingoverttingonnoisydatasets.Onepossibleapproachwouldbetouseaholdouttrainingset(atuningset)toevaluatetheperformanceoftheBoostingensembletodeterminewhentheaccuracyisnolongerincreasing.Anotherapproachwouldbetousepilotstudiestodeterminean\optimal"numberofclassierstouseinanensemble.5.AdditionalRelatedWorkAsmentionedbefore,theideaofusinganensembleofclassiersratherthanthesinglebestclassierhasbeenproposedbyseveralpeople.InSection2,wepresentaframeworkforthesesystems,sometheoriesofwhatmakesaneectiveensemble,anextensivecoveringoftheBaggingandBoostingalgorithms,andadiscussiononthebiasplusvariancedecomposition.Section3referredtoempiricalstudiessimilartoours;thesemethodsdierfromoursinthattheywerelimitedtodecisiontrees,generallywithfewerdatasets.Wecoveradditionalrelatedworkinthissection.LincolnandSkrzypek(1989),Mani(1991)andtheforecastingliterature(Clemen,1989;Granger,1989)indicatethatasimpleaveragingofthepredictorsgeneratesaverygoodcompositemodel;however,manylaterresearchers(Alpaydin,1993;Asker&Maclin,1997a,1997b;Breiman,1996c;Hashem,1997;Maclin,1998;Perrone,1992;Wolpert,1992;Zhang,Mesirov,&Waltz,1992)havefurtherimprovedgeneralizationwithvotingschemesthatarecomplexcombinationsofeachpredictor'soutput.Onemustbecarefulinthiscase,sinceoptimizingthecombiningweightscaneasilyleadtotheproblemofoverttingwhichsimpleaveragingseemstoavoid(Sollich&Krogh,1996).Mostapproachesonlyindirectlytrytogeneratehighlycorrectclassiersthatdisagreeasmuchaspossible.Thesemethodstrytocreatediverseclassiersbytrainingclassierswithdissimilarlearningparameters(Alpaydin,1993),dierentclassierarchitectures(Hashem,1997),variousinitialneural-networkweightsettings(Maclin&Opitz,1997;Maclin&Shav-lik,1995),orseparatepartitionsofthetrainingset(Breiman,1996a;Krogh&Vedelsby,1995).Boostingontheotherhandisactiveintryingtogeneratehighlycorrectnetworks191 Opitz&Maclinsinceitaccentuatesexamplescurrentlyclassiedincorrectlybypreviousmembersoftheensemble.Addemup(Opitz&Shavlik,1996a,1996b)isanotherexampleofanapproachthatdirectlytriestocreateadiverseensemble.Addemupusesgeneticalgorithmstosearchexplicitlyforahighlydiversesetofaccuratetrainednetworks.Addemupworksbyrstcreatinganinitialpopulation,thenusesgeneticoperatorstocreatenewnetworkscon-tinually,keepingthesetofnetworksthatarehighlyaccuratewhiledisagreeingwitheachotherasmuchaspossible.Addemupisalsoeectiveatincorporatingpriorknowledge,ifavailable,toimprovethequalityofitsensemble.Analternateapproachtotheensembleframeworkistotrainindividualnetworksonasubtask,andtothencombinethesepredictionswitha\gating"functionthatdependsontheinput.Jacobsetal.'s(1991)adaptivemixturesoflocalexperts,Baxt's(1992)methodforidentifyingmyocardialinfarction,andNowlanandSejnowski's(1992)visualmodelalltrainnetworkstolearnspecicsubtasks.Thekeyideaofthesetechniquesisthatadecompositionoftheproblemintospecicsubtasksmightleadtomoreecientrepresentationsandtraining(Hampshire&Waibel,1989).Onceaproblemisbrokenintosubtasks,theresultingsolutionsneedtobecombined.Jacobsetal.(1991)proposehavingthegatingfunctionbeanetworkthatlearnshowtoallocateexamplestotheexperts.Thusthegatingnetworkallocateseachexampletooneormoreexperts,andthebackpropagatederrorsandresultingweightchangesarethenrestrictedtothesenetworks(andthegatingfunction).TrespandTaniguchi(1995)proposeamethodfordeterminingthegatingfunctionaftertheproblemhasbeendecomposedandtheexpertstrained.Theirgatingfunctionisaninput-dependent,linear-weightingfunctionthatisdeterminedbyacombinationofthenetworks'diversityonthecurrentinputwiththelikelihoodthatthesenetworkshaveseendata\near"thatinput.Althoughthemixturesofexpertsandensembleparadigmsseemverysimilar,theyareinfactquitedistinctfromastatisticalpointofview.Themixtures-of-expertsmodelmakestheassumptionthatasingleexpertisresponsibleforeachexample.Inthiscase,eachexpertisamodelofaregionoftheinputspace,andthejobofthegatingfunctionistodecidefromwhichmodelthedatapointoriginates.Sinceeachnetworkintheensembleapproachlearnsthewholetaskratherthanjustsomesubtaskandthusmakesnosuchmutualexclusivityassumption,ensemblesareappropriatewhennoonemodelishighlylikelytobecorrectforanyonepointintheinputspace.6.ConclusionsThispaperpresentsacomprehensiveempiricalevaluationofBaggingandBoostingforneuralnetworksanddecisiontrees.OurresultsdemonstratethataBaggingensemblenearlyalwaysoutperformsasingleclassier.OurresultsalsoshowthataBoostingensemblecangreatlyoutperformbothBaggingandasingleclassier.However,forsomedatasetsBoostingmayshowzerogainorevenadecreaseinperformancefromasingleclassier.FurthertestsindicatethatBoostingmaysuerfromoverttinginthepresenceofnoisewhichmayexplainsomeofthedecreasesinperformanceforBoosting.Wealsofoundthatasimpleensembleapproachofusingneuralnetworksthatdieronlyintheirrandominitialweightsettingsperformedsurprisinglywell,oftendoingaswellastheBagging.192 PopularEnsembleMethodsAnalysisofourresultssuggeststhattheperformanceofbothBoostingmethods(Ada-BoostingandArcing)isatleastpartlydependentonthedatasetbeingexamined,whereBaggingshowsmuchlesscorrelation.ThestrongcorrelationsforBoostingmaybepartiallyexplainedbyitssensitivitytonoise,aclaimsupportedbyadditionaltests.Finally,weshowthatmuchoftheperformanceenhancementforanensemblecomeswiththerstfewclassierscombined,butthatBoostingdecisiontreesmaycontinuetofurtherimprovewithlargerensemblesizes.Inconclusion,asageneraltechniquefordecisiontreesandneuralnetworks,Baggingisprobablyappropriateformostproblems,butwhenappropriate,Boosting(eitherArcingorAda)mayproducelargergainsinaccuracy.AcknowledgmentsThisresearchwaspartiallysupportedbyUniversityofMinnesotaGrants-in-Aidtobothau-thors.DaveOpitzwasalsosupportedbyNationalScienceFoundationgrantIRI-9734419,theMontanaDOE/EPSCoRPetroleumReservoirCharacterizationProject,aMONTSgrantsupportedbytheUniversityofMontana,andaMontanaScienceTechnologyAl-liancegrant.ThisisanextendedversionofapaperpublishedintheFourteenthNationalConferenceonArticialIntelligence.ReferencesK.,&Pazzani,M.(1996).Errorreductionthroughlearningmultipledescriptions.MachineLearning,24,173{202.Alpaydin,E.(1993).Multiplenetworksforfunctionlearning.InProceedingsofthe1993IEEEInternationalConferenceonNeuralNetworks,Vol.I,pp.27{32SanFrancisco.Arbib,M.(Ed.).(1995).TheHandbookofBrainTheoryandNeuralNetworks.MITPress.Asker,L.,&Maclin,R.(1997a).Ensemblesasasequenceofclassiers.InProceedingsoftheFifteenthInternationalJointConferenceonArticialIntelligence,pp.860{865Nagoya,Japan.Asker,L.,&Maclin,R.(1997b).Featureengineeringandclassierselection:AcasestudyinVenusianvolcanodetection.InProceedingsoftheFourteenthInternationalConferenceonMachineLearning,pp.3{11Nashville,TN.Bauer,E.,&Kohavi,R.(1999).Anempiricalcomparisonofvotingclassicationalgorithms:Bagging,boosting,andvariants.MachineLearning,36,105-139.Baxt,W.(1992).Improvingtheaccuracyofanarticialneuralnetworkusingmultipledierentlytrainednetworks.NeuralComputation,4,772{780.Breiman,L.(1996a).Baggingpredictors.MachineLearning,24(2),123{140.Breiman,L.(1996b).Bias,variance,andarcingclassiers.Tech.rep.460,UC-Berkeley,Berkeley,CA.193 Opitz&MaclinBreiman,L.(1996c).Stackedregressions.MachineLearning,24(1),49{64.Clemen,R.(1989).Combiningforecasts:Areviewandannotatedbibliography.JournalofForecasting,5,559{583.Drucker,H.,&Cortes,C.(1996).Boostingdecisiontrees.InTouretsky,D.,Mozer,M.,&Hasselmo,M.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.8,pp.479{485Cambridge,MA.MITPress.Drucker,H.,Cortes,C.,Jackel,L.,LeCun,Y.,&Vapnik,V.(1994).Boostingandothermachinelearningalgorithms.InProceedingsoftheEleventhInternationalConferenceonMachineLearning,pp.53{61NewBrunswick,NJ.Efron,B.,&Tibshirani,R.(1993).AnIntroductiontotheBootstrap.ChapmanandHall,NewYork.Fisher,D.,&McKusick,K.(1989).AnempiricalcomparisonofID3andback-propagation.InProceedingsoftheEleventhInternationalJointConferenceonArticialIntelli-gence,pp.788{793Detroit,MI.Freund,Y.,&Schapire,R.(1996).Experimentswithanewboostingalgorithm.InPro-ceedingsoftheThirteenthInternationalConferenceonMachineLearning,pp.148{156Bari,Italy.Friedman,J.(1996).Onbias,variance,0/1-loss,andthecurse-of-dimensionality.JournalofDataMiningandKnowledgeDiscovery,1.Friedman,J.,Hastie,T.,&Tibshirani,R.(1998).Additivelogisticregression:Astatisticalviewofboosting.(http://www-stat.stanford.edu/~jhf).Geman,S.,Bienenstock,E.,&Doursat,R.(1992).Neuralnetworksandthebias/variancedilemma.NeuralComputation,4,1{58.Granger,C.(1989).Combiningforecasts:Twentyyearslater.JournalofForecasting,8,167{173.Grove,A.,&Schuurmans,D.(1998).Boostinginthelimit:Maximizingthemarginoflearnedensembles.InProceedingsoftheFifteenthNationalConferenceonArticialIntelligence,pp.692{699Madison,WI.Hampshire,J.,&Waibel,A.(1989).Themeta-pinetwork:Buildingdistributedknowledgerepresentationsforrobustpatternrecognition.Tech.rep.CMU-CS-89-166,CMU,Pittsburgh,PA.Hansen,L.,&Salamon,P.(1990).Neuralnetworkensembles.IEEETransactionsonPatternAnalysisandMachineIntelligence,12,993{1001.Hashem,S.(1997).Optimallinearcombinationsofneuralnetworks.NeuralNetworks,10(4),599{614.194 PopularEnsembleMethodsJacobs,R.,Jordan,M.,Nowlan,S.,&Hinton,G.(1991).Adaptivemixturesoflocalexperts.NeuralComputation,3,79{87.Kohavi,R.,&Wolpert,D.(1996).Biasplusvariancedecompositionforzero-onelossfunctions.InProceedingsoftheThirteenthInternationalConferenceonMachineLearning,pp.275{283Bari,Italy.Kong,E.,&Dietterich,T.(1995).Error-correctingoutputcodingcorrectsbiasandvariance.InProceedingsoftheTwelfthInternationalConferenceonMachineLearning,pp.313{321TahoeCity,CA.Krogh,A.,&Vedelsby,J.(1995).Neuralnetworkensembles,crossvalidation,andactivelearning.InTesauro,G.,Touretzky,D.,&Leen,T.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.7,pp.231{238Cambridge,MA.MITPress.Lincoln,W.,&Skrzypek,J.(1989).Synergyofclusteringmultiplebackpropagationnet-works.InTouretzky,D.(Ed.),AdvancesinNeuralInformationProcessingSystems,Vol.2,pp.650{659SanMateo,CA.MorganKaufmann.Maclin,R.(1998).Boostingclassiersregionally.InProceedingsoftheFifteenthNationalConferenceonArticialIntelligence,pp.700{705Madison,WI.Maclin,R.,&Opitz,D.(1997).Anempiricalevaluationofbaggingandboosting.InProceedingsoftheFourteenthNationalConferenceonArticialIntelligence,pp.546{551Providence,RI.Maclin,R.,&Shavlik,J.(1995).Combiningthepredictionsofmultipleclassiers:Usingcompetitivelearningtoinitializeneuralnetworks.InProceedingsoftheFourteenthIn-ternationalJointConferenceonArticialIntelligence,pp.524{530Montreal,Canada.Mani,G.(1991).Loweringvarianceofdecisionsbyusingarticialneuralnetworkportfolios.NeuralComputation,3,484{486.Mooney,R.,Shavlik,J.,Towell,G.,&Gove,A.(1989).Anexperimentalcomparisonofsymbolicandconnectionistlearningalgorithms.InProceedingsoftheEleventhInternationalJointConferenceonArticialIntelligence,pp.775{780Detroit,MI.Murphy,P.M.,&Aha,D.W.(1994).UCIrepositoryofmachinelearningdatabases(machine-readabledatarepository).UniversityofCalifornia-Irvine,DepartmentofInformationandComputerScience.Nowlan,S.,&Sejnowski,T.(1992).Filterselectionmodelforgeneratingvisualmotionsignals.InHanson,S.,Cowan,J.,&Giles,C.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.5,pp.369{376SanMateo,CA.MorganKaufmann.Opitz,D.,&Shavlik,J.(1996a).Activelysearchingforaneectiveneural-networkensemble.ConnectionScience,8(3/4),337{353.195 Opitz&MaclinOpitz,D.,&Shavlik,J.(1996b).Generatingaccurateanddiversemembersofaneural-networkensemble.InTouretsky,D.,Mozer,M.,&Hasselmo,M.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.8,pp.535{541Cambridge,MA.MITPress.Perrone,M.(1992).Asoft-competitivesplittingruleforadaptivetree-structuredneuralnetworks.InProceedingsoftheInternationalJointConferenceonNeuralNetworks,pp.689{693Baltimore,MD.Perrone,M.(1993).ImprovingRegressionEstimation:AveragingMethodsforVarianceReductionwithExtensiontoGeneralConvexMeasureOptimization.Ph.D.thesis,BrownUniversity,Providence,RI.Quinlan,J.(1993).C4.5:ProgramsforMachineLearning.MorganKaufmann,SanMateo,CA.Quinlan,J.R.(1996).Bagging,boosting,andc4.5.InProceedingsoftheThirteenthNationalConferenceonArticialIntelligence,pp.725{730.Portland,OR.Rumelhart,D.,Hinton,G.,&Williams,R.(1986).Learninginternalrepresentationsbyerrorpropagation.InRumelhart,D.,&McClelland,J.(Eds.),ParallelDistributedProcessing:Explorationsinthemicrostructureofcognition.Volume1:Foundations,pp.318{363.MITPress,Cambridge,MA.Schapire,R.(1990).Thestrengthofweaklearnability.MachineLearning,5(2),197{227.Schapire,R.,Freund,Y.,Bartlett,P.,&Lee,W.(1997).Boostingthemargin:Anewexplanationfortheeectivenessofvotingmethods.InProceedingsoftheFourteenthInternationalConferenceonMachineLearning,pp.322{330Nashville,TN.Sollich,P.,&Krogh,A.(1996).Learningwithensembles:Howover-ttingcanbeuseful.InTouretsky,D.,Mozer,M.,&Hasselmo,M.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.8,pp.190{196Cambridge,MA.MITPress.Tresp,V.,&Taniguchi,M.(1995).Combiningestimatorsusingnon-constantweightingfunctions.InTesauro,G.,Touretzky,D.,&Leen,T.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.7,pp.419{426Cambridge,MA.MITPress.Wolpert,D.(1992).Stackedgeneralization.NeuralNetworks,5,241{259.Zhang,X.,Mesirov,J.,&Waltz,D.(1992).Hybridsystemforproteinsecondarystructureprediction.JournalofMolecularBiology,225,1049{1063.196 PopularEnsembleMethodsAppendixTables4and5showthecompleteresultsfortherstsetofexperimentsusedinthispaper.SingleSimpleBaggingArcingBoostingDataSetErrSDBestErrSDErrSDErrSDErrSDbreast-cancer-w3.40.32.93.50.23.40.23.80.44.00.4credit-a14.80.713.613.70.513.80.615.80.615.70.6credit-g27.90.826.224.70.224.20.525.20.825.30.1diabetes23.90.922.623.00.522.80.424.40.223.31.2glass38.61.536.935.21.133.11.932.02.431.10.9heart-cleveland18.61.016.817.41.117.00.620.71.621.10.9hepatitis20.11.619.119.51.717.80.719.01.319.70.9house-votes-844.90.64.14.80.24.10.25.10.55.30.5hypo6.40.26.26.20.16.20.16.20.16.20.1ionosphere9.71.37.47.50.59.21.27.60.68.30.5iris4.31.72.03.90.34.00.53.70.63.91.0kr-vs-kp2.30.71.50.80.10.80.20.40.10.30.1labor6.11.53.53.20.84.21.03.20.83.20.8letter18.00.317.612.80.210.50.35.70.44.60.1promoters-9365.30.64.54.80.34.00.34.50.24.60.3ribosome-bind9.30.48.98.50.38.40.48.10.28.20.3satellite13.00.312.610.90.210.60.39.90.210.00.3segmentation6.60.75.75.30.35.40.23.50.23.30.2sick5.90.55.25.70.25.70.14.70.24.50.3sonar16.61.514.915.91.216.81.112.91.513.01.5soybean9.21.17.06.70.56.90.46.70.56.30.6splice4.70.24.54.00.23.90.14.00.14.20.1vehicle24.91.222.921.20.820.70.619.11.019.71.0Table4:Neuralnetworktestseterrorratesandstandarddeviationvaluesforthoseerrorratesfor(1)asingleneuralnetworkclassier;(2)asimpleneuralnetworkensem-ble;(3)aBaggingensemble;(4)anArcingensemble;and(5)andAda-Boostingensemble.Alsoshown(resultscolumn3)isthe\best"resultproducedfromallofthesinglenetworkresultsrunusingallofthetrainingdata.197 Opitz&MaclinSingleBaggingArcingBoostingDataSetErrSDBestErrSDErrSDErrSDbreast-cancer-w5.00.74.03.70.53.50.63.50.3credit-a14.90.814.213.40.514.00.913.70.5credit-g29.61.028.725.20.725.91.026.70.4diabetes27.81.026.724.40.826.00.625.70.6glass31.32.128.525.80.725.51.423.31.3heart-cleveland24.31.322.719.50.721.51.620.81.0hepatitis21.21.220.017.32.016.91.117.21.3house-votes-843.60.33.23.60.25.01.14.81.0hypo0.50.10.40.40.00.40.10.40.0ionosphere8.10.77.16.40.66.00.56.10.5iris5.20.75.34.90.85.10.65.61.1kr-vs-kp0.60.10.50.60.10.30.10.40.0labor16.53.412.713.70.813.02.911.62.0letter14.00.812.27.00.14.10.13.90.1promoters-93612.80.412.510.60.66.80.56.40.3ribosome-bind11.20.610.810.20.19.30.29.60.5satellite13.80.413.59.90.28.60.18.40.2segmentation3.70.23.43.00.21.70.21.50.2sick1.30.91.11.20.11.10.11.00.1sonar29.71.926.925.31.321.53.021.72.8soybean8.00.57.57.90.57.20.26.70.9splice5.90.35.75.40.25.10.15.30.2vehicle29.40.728.627.10.922.50.822.91.9Table5:Decisiontreetestseterrorratesandstandarddeviationvaluesforthoseerrorratesfor(1)asingledecisiontreeclassier;(2)aBaggingensemble;(3)anArcingensemble;and(4)andAda-Boostingensemble.Alsoshown(resultscolumn3)isthe\best"resultproducedfromallofthesingletreeresultsrunusingallofthetrainingdata.198