/
JournalofArticialIntelligenceResearch11(1999)169-198Submitted1/99;pub JournalofArticialIntelligenceResearch11(1999)169-198Submitted1/99;pub

JournalofArti cialIntelligenceResearch11(1999)169-198Submitted1/99;pub - PDF document

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
362 views
Uploaded On 2016-03-21

JournalofArti cialIntelligenceResearch11(1999)169-198Submitted1/99;pub - PPT Presentation

OpitzMaclinPreviousworkhasdemonstratedthatBaggingandBoostingareverye ectivefordecisiontreesBauerKohavi1999DruckerCortes1996Breiman1996c1996bFreundSchapire1996Quinlan1996howeverthere ID: 264588

Opitz&MaclinPreviousworkhasdemonstratedthatBaggingandBoostingareverye ectivefordeci-siontrees(Bauer&Kohavi 1999;Drucker&Cortes 1996;Breiman 1996c 1996b;Freund&Schapire 1996;Quinlan 1996);however there

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofArti cialIntelligenceResearch11..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofArti cialIntelligenceResearch11(1999)169-198Submitted1/99;published8/99PopularEnsembleMethods:AnEmpiricalStudyDavidOpitzopitz@cs.umt.eduDepartmentofComputerScienceUniversityofMontanaMissoula,MT59812USARichardMaclinrmaclin@d.umn.eduComputerScienceDepartmentUniversityofMinnesotaDuluth,MN55812USAAbstractAnensembleconsistsofasetofindividuallytrainedclassi ers(suchasneuralnetworksordecisiontrees)whosepredictionsarecombinedwhenclassifyingnovelinstances.Pre-viousresearchhasshownthatanensembleisoftenmoreaccuratethananyofthesingleclassi ersintheensemble.Bagging(Breiman,1996c)andBoosting(Freund&Schapire,1996;Schapire,1990)aretworelativelynewbutpopularmethodsforproducingensem-bles.Inthispaperweevaluatethesemethodson23datasetsusingbothneuralnetworksanddecisiontreesasourclassi cationalgorithm.Ourresultsclearlyindicateanumberofconclusions.First,whileBaggingisalmostalwaysmoreaccuratethanasingleclassi er,itissometimesmuchlessaccuratethanBoosting.Ontheotherhand,Boostingcancre-ateensemblesthatarelessaccuratethanasingleclassi er{especiallywhenusingneuralnetworks.AnalysisindicatesthattheperformanceoftheBoostingmethodsisdependentonthecharacteristicsofthedatasetbeingexamined.Infact,furtherresultsshowthatBoostingensemblesmayover tnoisydatasets,thusdecreasingitsperformance.Finally,consistentwithpreviousstudies,ourworksuggeststhatmostofthegaininanensemble'sperformancecomesinthe rstfewclassi erscombined;however,relativelylargegainscanbeseenupto25classi erswhenBoostingdecisiontrees.1.IntroductionManyresearchershaveinvestigatedthetechniqueofcombiningthepredictionsofmulti-pleclassi erstoproduceasingleclassi er(Breiman,1996c;Clemen,1989;Perrone,1993;Wolpert,1992).Theresultingclassi er(hereafterreferredtoasanensemble)isgenerallymoreaccuratethananyoftheindividualclassi ersmakinguptheensemble.Boththeo-retical(Hansen&Salamon,1990;Krogh&Vedelsby,1995)andempirical(Hashem,1997;Opitz&Shavlik,1996a,1996b)researchhasdemonstratedthatagoodensembleisonewheretheindividualclassi ersintheensemblearebothaccurateandmaketheirerrorsondi erentpartsoftheinputspace.TwopopularmethodsforcreatingaccurateensemblesareBagging(Breiman,1996c)andBoosting(Freund&Schapire,1996;Schapire,1990).Thesemethodsrelyon\resampling"techniquestoobtaindi erenttrainingsetsforeachoftheclassi ers.InthispaperwepresentacomprehensiveevaluationofbothBaggingandBoostingon23datasetsusingtwobasicclassi cationmethods:decisiontreesandneuralnetworks.c\r1999AIAccessFoundationandMorganKaufmannPublishers.Allrightsreserved. Opitz&MaclinPreviousworkhasdemonstratedthatBaggingandBoostingareverye ectivefordeci-siontrees(Bauer&Kohavi,1999;Drucker&Cortes,1996;Breiman,1996c,1996b;Freund&Schapire,1996;Quinlan,1996);however,therehasbeenlittleempiricaltestingwithneuralnetworks(especiallywiththenewBoostingalgorithm).Discussionswithpreviousresearchersrevealthatmanyauthorsconcentratedondecisiontreesduetotheirfasttrainingspeedandwell-establisheddefaultparametersettings.Neuralnetworkspresentdicultiesfortestingbothintermsofthesigni cantprocessingtimerequiredandinselectingtrain-ingparameters;however,wefeeltherearedistinctadvantagestoincludingneuralnetworksinourstudy.First,previousempiricalstudieshavedemonstratedthatindividualneuralnetworksproducehighlyaccurateclassi ersthataresometimesmoreaccuratethancorre-spondingdecisiontrees(Fisher&McKusick,1989;Mooney,Shavlik,Towell,&Gove,1989).Second,neuralnetworkshavebeenextensivelyappliedacrossnumerousdomains(Arbib,1995).Finally,bystudyingneuralnetworksinadditiontodecisiontreeswecanexaminehowBaggingandBoostingarein\ruencedbythelearningalgorithm,givingfurtherinsightintothegeneralcharacteristicsoftheseapproaches.BauerandKohavi(1999)alsostudyBaggingandBoostingappliedtotwolearningmethods,intheircasedecisiontreesusingavariantofC4.5andnaive-Bayesclassi ers,buttheirstudymainlyconcentratedonthedecisiontreeresults.Ourneuralnetworkanddecisiontreeresultsledustoanumberofinterestingconclu-sions.The rstisthataBaggingensemblegenerallyproducesaclassi erthatismoreaccuratethanastandardclassi er.ThusoneshouldfeelcomfortablealwaysBaggingtheirdecisiontreesorneuralnetworks.ForBoosting,however,wenotemorewidelyvaryingre-sults.ForafewdatasetsBoostingproduceddramaticreductionsinerror(evencomparedtoBagging),butforotherdatasetsitactuallyincreasesinerroroverasingleclassi er(particularlywithneuralnetworks).Infurthertestsweexaminedthee ectsofnoiseandsupportFreundandSchapire's(1996)conjecturethatBoosting'ssensitivitytonoisemaybepartlyresponsibleforitsoccasionalincreaseinerror.Analternatebaselineapproachweinvestigatedwasthecreationofasimpleneuralnet-workensemblewhereeachnetworkusedthefulltrainingsetanddi eredonlyinitsrandominitialweightsettings.Ourresultsindicatethatthisensembletechniqueissurprisinglye ective,oftenproducingresultsasgoodasBagging.ResearchbyAliandPazzani(1996)demonstratedsimilarresultsusingrandomizeddecisiontreealgorithms.Ourresultsalsoshowthattheensemblemethodsaregenerallyconsistent(intermsoftheire ectonaccuracy)whenappliedeithertoneuralnetworksortodecisiontrees;however,thereislittleinter-correlationbetweenneuralnetworksanddecisiontreesexceptfortheBoostingmethods.ThissuggeststhatsomeoftheincreasesproducedbyBoostingaredependentontheparticularcharacteristicsofthedatasetratherthanonthecomponentclassi er.InfurthertestswedemonstratethatBaggingismoreresilienttonoisethanBoosting.Finally,weinvestigatedthequestionofhowmanycomponentclassi ersshouldbeusedinanensemble.Consistentwithpreviousresearch(Freund&Schapire,1996;Quinlan,1996),ourresultsshowthatmostofthereductioninerrorforensemblemethodsoccurswiththe rstfewadditionalclassi ers.WithBoostingdecisiontrees,however,relativelylargegainsmaybeseenupuntilabout25classi ers.170 PopularEnsembleMethodsThispaperisorganizedasfollows.Inthenextsectionwepresentanoverviewofclas-si erensemblesanddiscussBaggingandBoostingindetail.NextwepresentanextensiveempiricalanalysisofBaggingandBoosting.Followingthatwepresentfutureresearchandadditionalrelatedworkbeforeconcluding.2.Classi erEnsemblesFigure1illustratesthebasicframeworkforaclassi erensemble.Inthisexample,neuralnetworksarethebasicclassi cationmethod,thoughconceptuallyanyclassi cationmethod(e.g.,decisiontrees)canbesubstitutedinplaceofthenetworks.EachnetworkinFigure1'sensemble(network1throughnetworkNinthiscase)istrainedusingthetraininginstancesforthatnetwork.Then,foreachexample,thepredictedoutputofeachofthesenetworks(oiinFigure1)iscombinedtoproducetheoutputoftheensemble(^oinFigure1).Manyresearchers(Alpaydin,1993;Breiman,1996c;Krogh&Vedelsby,1995;Lincoln&Skrzypek,1989)havedemonstratedthatane ectivecombiningschemeistosimplyaveragethepre-dictionsofthenetwork.Combiningtheoutputofseveralclassi ersisusefulonlyifthereisdisagreementamongthem.Obviously,combiningseveralidenticalclassi ersproducesnogain.HansenandSalamon(1990)provedthatiftheaverageerrorrateforanexampleislessthan50%andthecomponentclassi ersintheensembleareindependentintheproductionoftheirerrors,theexpectederrorforthatexamplecanbereducedtozeroasthenumberofclassi erscombinedgoestoin nity;however,suchassumptionsrarelyholdinpractice.KroghandVedelsby(1995)laterprovedthattheensembleerrorcanbedividedintoatermmeasuringtheaveragegeneralizationerrorofeachindividualclassi erandatermmeasuringthedisagreementamongtheclassi ers.Whattheyformallyshowedwasthatanidealensembleconsistsofhighlycorrectclassi ersthatdisagreeasmuchaspossible.OpitzandShavlik(1996a,1996b)empiricallyveri edthatsuchensemblesgeneralizewell.Asaresult,methodsforcreatingensemblescenteraroundproducingclassi ersthatdis-agreeontheirpredictions.Generally,thesemethodsfocusonalteringthetrainingprocessinnetwork Nnetwork 1combine network outputsensemble outputinputnetwork 2o1o2oNoFigure1:Aclassi erensembleofneuralnetworks.171 Opitz&Maclinthehopethattheresultingclassi erswillproducedi erentpredictions.Forexample,neu-ralnetworktechniquesthathavebeenemployedincludemethodsfortrainingwithdi erenttopologies,di erentinitialweights,di erentparameters,andtrainingonlyonaportionofthetrainingset(Alpaydin,1993;Drucker,Cortes,Jackel,LeCun,&Vapnik,1994;Hansen&Salamon,1990;Maclin&Shavlik,1995).Inthispaperweconcentrateontwopopularmethods(BaggingandBoosting)thattrytogeneratedisagreementamongtheclassi ersbyalteringthetrainingseteachclassi ersees.2.1BaggingClassi ersBagging(Breiman,1996c)isa\bootstrap"(Efron&Tibshirani,1993)ensemblemethodthatcreatesindividualsforitsensemblebytrainingeachclassi eronarandomredistri-butionofthetrainingset.Eachclassi er'strainingsetisgeneratedbyrandomlydrawing,withreplacement,Nexamples{whereNisthesizeoftheoriginaltrainingset;manyoftheoriginalexamplesmayberepeatedintheresultingtrainingsetwhileothersmaybeleftout.Eachindividualclassi erintheensembleisgeneratedwithadi erentrandomsamplingofthetrainingset.Figure2givesasampleofhowBaggingmightworkonaimaginarysetofdata.SinceBaggingresamplesthetrainingsetwithreplacement,someinstancearerepresentedmultipletimeswhileothersareleftout.SoBagging'straining-set-1mightcontainexamples3and7twice,butdoesnotcontaineitherexample4or5.Asaresult,theclassi ertrainedontraining-set-1mightobtainahighertest-seterrorthantheclassi erusingallofthedata.Infact,allfourofBagging'scomponentclassi erscouldresultinhighertest-seterror;however,whencombined,thesefourclassi erscan(andoftendo)producetest-seterrorlowerthanthatofthesingleclassi er(thediversityamongtheseclassi ersgenerallycompensatesfortheincreaseinerrorrateofanyindividualclassi er).Breiman(1996c)showedthatBaggingise ectiveon\unstable"learningalgorithmswheresmallchangesinthetrainingsetresultinlargechangesinpredictions.Breiman(1996c)claimedthatneuralnetworksanddecisiontreesareexamplesofunstablelearningalgorithms.Westudythee ectivenessofBaggingonboththeselearningmethodsinthisarticle.BoostingClassi ersBoosting(Freund&Schapire,1996;Schapire,1990)encompassesafamilyofmethods.Thefocusofthesemethodsistoproduceaseriesofclassi ers.Thetrainingsetusedforeachmemberoftheseriesischosenbasedontheperformanceoftheearlierclassi er(s)intheseries.InBoosting,examplesthatareincorrectlypredictedbypreviousclassi ersintheseriesarechosenmoreoftenthanexamplesthatwerecorrectlypredicted.ThusBoostingattemptstoproducenewclassi ersthatarebetterabletopredictexamplesforwhichthecurrentensemble'sperformanceispoor.(NotethatinBagging,theresamplingofthetrainingsetisnotdependentontheperformanceoftheearlierclassi ers.)InthisworkweexaminetwonewandpowerfulformsofBoosting:Arcing(Breiman,1996b)andAda-Boosting(Freund&Schapire,1996).LikeBagging,ArcingchoosesatrainingsetofsizeNforclassi erK+1byprobabilisticallyselecting(withreplacement)examplesfromtheoriginalNtrainingexamples.UnlikeBagging,however,theprobability172 PopularEnsembleMethodsAsampleofasingleclassi eronanimaginarysetofdata.(Original)TrainingSetTraining-set-1:1,2,3,4,5,6,7,8AsampleofBaggingonthesamedata.(Resampled)TrainingSetTraining-set-1:2,7,8,3,7,6,3,1Training-set-2:7,8,5,6,4,2,7,1Training-set-3:3,6,2,7,5,6,2,2Training-set-4:4,5,1,4,6,4,3,8AsampleofBoostingonthesamedata.(Resampled)TrainingSetTraining-set-1:2,7,8,3,7,6,3,1Training-set-2:1,4,5,4,1,5,6,4Training-set-3:7,1,5,8,1,8,1,4Training-set-4:1,1,6,1,1,3,1,5Figure2:HypotheticalrunsofBaggingandBoosting.Assumethereareeighttrainingexamples.Assumeexample1isan\outlier"andishardforthecomponentlearningalgorithmtoclassifycorrectly.WithBagging,eachtrainingsetisanindependentsampleofthedata;thus,someexamplesaremissingandothersoccurmultipletimes.TheBoostingtrainingsetsarealsosamplesoftheoriginaldataset,butthe\hard"example(example1)occursmoreinlatertrainingsetssinceBoostingconcentratesoncorrectlypredictingit.ofselectinganexampleisnotequalacrossthetrainingset.Thisprobabilitydependsonhowoftenthatexamplewasmisclassi edbythepreviousKclassi ers.Ada-Boostingcanusetheapproachof(a)selectingasetofexamplesbasedontheprobabilitiesoftheexamples,or(b)simplyusingalloftheexamplesandweighttheerrorofeachexamplebytheprobabilityforthatexample(i.e.,exampleswithhigherprobabilitieshavemoree ectontheerror).Thislatterapproachhastheclearadvantagethateachexampleisincorporated(atleastinpart)inthetrainingset.Furthermore,Friedmanetal.(1998)havedemonstratedthatthisformofAda-Boostingcanbeviewedasaformofadditivemodelingforoptimizingalogisticlossfunction.Inthiswork,however,wehavechosentousetheapproachofsubsamplingthedatatoensureafairempiricalcomparison(inpartduetotherestartingreasondiscussedbelow).BothArcingandAda-Boostinginitiallysettheprobabilityofpickingeachexampletobe1=N.Thesemethodsthenrecalculatetheseprobabilitiesaftereachtrainedclassi erisaddedtotheensemble.ForAda-Boosting,letkbethesumoftheprobabilitiesofthe173 Opitz&Maclinmisclassi edinstancesforthecurrentlytrainedclassi erCk.TheprobabilitiesforthenexttrialaregeneratedbymultiplyingtheprobabilitiesofCk'sincorrectlyclassi edinstancesbythefactor k=(1k)=kandthenrenormalizingallprobabilitiessothattheirsumequals1.Ada-Boostingcombinestheclassi ersC1;:::;CkusingweightedvotingwhereCkhasweightlog( k).TheseweightsallowAda-Boostingtodiscountthepredictionsofclassi ersthatarenotveryaccurateontheoverallproblem.Friedmanetal(1998)havealsosuggestedanalternativemechanismthat tstogetherthepredictionsoftheclassi ersasanadditivemodelusingamaximumlikelihoodcriterion.Inthiswork,weusetherevisiondescribedbyBreiman(1996b)whereweresetalltheweightstobeequalandrestartifeitherkisnotlessthan0.5orkbecomes0.1ByresettingtheweightswedonotdisadvantagetheAda-Boostinglearnerinthosecaseswhereitreachesthesevaluesofk;theAda-Boostinglearneralwaysincorporatesthesamenumberofclassi ersasothermethodswetested.Tomakethisfeasible,weareforcedtousetheapproachofselectingadatasetprobabilisticallyratherthanweightingtheexamples,otherwiseadeterministicmethodsuchasC4.5wouldcycleandgenerateduplicatemembersoftheensemble.Thatis,resettingtheweightsto1=Nwouldcausethelearnertorepeatthedecisiontreelearnedasthe rstmemberoftheensemble,andthiswouldleadtoreweightingthedatasetthesameasforthesecondmemberoftheensemble,andsoon.Randomlyselectingexamplesforthedatasetbasedontheexampleprobabilitiesalleviatesthisproblem.Arcing-x4(Breiman,1996b)(whichwewillrefertosimplyasArcing)startedoutasasimplemechanismforevaluatingthee ectofBoostingmethodswheretheresultingclas-si erswerecombinedwithoutweightingthevotes.Arcingusesasimplemechanismfordeterminingtheprobabilitiesofincludingexamplesinthetrainingset.Fortheithexampleinthetrainingset,thevaluemireferstothenumberoftimesthatexamplewasmisclassi- edbythepreviousKclassi ers.Theprobabilitypiforselectingexampleitobepartofclassi erK+1'strainingsetisde nedaspi=1+mi4PN=1(1+mj4)(1)Breimanchosethevalueofthepower(4)empiricallyaftertryingseveraldi erentvalues(Breiman,1996b).AlthoughthismechanismdoesnothavetheweightedvotingofAda-Boostingitstillproducesaccurateensemblesandissimpletoimplement;thusweincludethismethod(alongwithAda-Boosting)inourempiricalevaluation.Figure2showsahypotheticalrunofBoosting.Notethatthe rsttrainingsetwouldbethesameasBagging;however,latertrainingsetsaccentuateexamplesthatweremis-classi edbytheearliermemberoftheensembles.Inthis gure,example1isa\hard"examplethatpreviousclassi erstendtomisclassify.Withthesecondtrainingset,example1occursmultipletimes,asdoexamples4and5sincetheywereleftoutofthe rsttrainingsetand,inthiscase,misclassi edbythe rstlearner.Forthe naltrainingset,example11.Forthosefewcaseswherekbecomes0(lessthat0.12%ofourresults)wesimplyusealargepositivevalue,log( k)=3:0,toweightthesenetworks.Forthemorelikelycaseswherekislargerthan0.5(approximately5%ofourresults)wechosetoweightthepredictionsbyaverysmallpositivevalue(0.001)ratherthanusinganegativeor0weightfactor(thisproducedslightlybetterresultsthanthealternateapproachesinpilotstudies).174 PopularEnsembleMethodsbecomesthepredominantexamplechosen(whereasnosingleexampleisaccentuatedwithBagging);thus,theoveralltest-seterrorforthisclassi ermightbecomeveryhigh.Despitethis,however,Boostingwillprobablyobtainalowererrorratewhenitcombinestheout-putofthesefourclassi erssinceitfocusesoncorrectlypredictingpreviouslymisclassi edexamplesandweightsthepredictionsofthedi erentclassi ersbasedontheiraccuracyforthetrainingset.ButBoostingcanalsoover tinthepresenceofnoise(asweempiricallyshowinSection3).2.3TheBiasplusVarianceDecompositionRecently,severalauthors(Breiman,1996b;Friedman,1996;Kohavi&Wolpert,1996;Kong&Dietterich,1995)haveproposedtheoriesforthee ectivenessofBaggingandBoostingbasedonGemanetal.'s(1992)biasplusvariancedecompositionofclassi cationerror.Inthisdecompositionwecanviewtheexpectederrorofalearningalgorithmonaparticulartargetfunctionandtrainingsetsizeashavingthreecomponents:1.Abiastermmeasuringhowclosetheaverageclassi erproducedbythelearningalgo-rithmwillbetothetargetfunction;2.Avariancetermmeasuringhowmucheachofthelearningalgorithm'sguesseswillvarywithrespecttoeachother(howoftentheydisagree);and3.Atermmeasuringtheminimumclassi cationerrorassociatedwiththeBayesoptimalclassi erforthetargetfunction(thistermissometimesreferredtoastheintrinsictargetnoise).Usingthisframeworkithasbeensuggested(Breiman,1996b)thatbothBaggingandBoost-ingreduceerrorbyreducingthevarianceterm.FreundandSchapire(1996)arguethatBoostingalsoattemptstoreducetheerrorinthebiastermsinceitfocusesonmisclassi edexamples.Suchafocusmaycausethelearnertoproduceanensemblefunctionthatdi erssigni cantlyfromthesinglelearningalgorithm.Infact,Boostingmayconstructafunc-tionthatisnotevenproduciblebyitscomponentlearningalgorithm(e.g.,changinglinearpredictionsintoaclassi erthatcontainsnon-linearpredictions).ItisthiscapabilitythatmakesBoostinganappropriatealgorithmforcombiningthepredictionsof\weak"learningalgorithms(i.e.,algorithmsthathaveasimplelearningbias).Intheirrecentpaper,BauerandKohavi(1999)demonstratedthatBoostingdoesindeedseemtoreducebiasforcertainrealworldproblems.Moresurprisingly,theyalsoshowedthatBaggingcanalsoreducethebiasportionoftheerror,oftenforthesamedatasetsforwhichBoostingreducesthebias.Thoughthebias-variancedecompositionisinteresting,therearecertainlimitationstoapplyingittoreal-worlddatasets.Tobeabletoestimatethebias,variance,andtargetnoiseforaparticularproblem,weneedtoknowtheactualfunctionbeinglearned.Thisisunavailableformostreal-worldproblems.TodealwiththisproblemKohaviandWolpert(1996)suggestholdingoutsomeofthedata,theapproachusedbyBauerandKohavi(1999)intheirstudy.Themainproblemwiththistechniqueisthatthetrainingsetsizeisgreatlyreducedinordertogetgoodestimatesofthebiasandvarianceterms.Wehavechosentostrictlyfocusongeneralizationaccuracyinourstudy,inpartbecauseBauerandKohavi'sworkhasansweredthequestionaboutwhetherBoostingandBaggingreducethe175 Opitz&Maclinbiasforrealworldproblems(theybothdo),andbecausetheirexperimentsdemonstratethatwhilethisdecompositiongivessomeinsightintoensemblemethods,itisonlyasmallpartoftheequation.Fordi erentdatasetstheyobservecaseswhereBoostingandBaggingbothdecreasemostlythevarianceportionoftheerror,andothercaseswhereBoostingandBaggingbothreducethebiasandvarianceoftheerror.TheirtestsalsoseemtoindicatethatBoosting'sgeneralizationerrorincreasesonthedomainswhereBoostingincreasesthevarianceportionoftheerror;but,itisdiculttodeterminewhataspectsofthedatasetsledtotheseresults.3.ResultsThissectiondescribesourempiricalstudyofBagging,Ada-Boosting,andArcing.Eachofthesethreemethodswastestedwithbothdecisiontreesandneuralnetworks.3.1DataSetsToevaluatetheperformanceofBaggingandBoosting,weobtainedanumberofdatasetsfromtheUniversityofWisconsinMachineLearningrepositoryaswellastheUCIdatasetrepository(Murphy&Aha,1994).Thesedatasetswerehandselectedsuchthatthey(a)camefromreal-worldproblems,(b)variedincharacteristics,and(c)weredeemedusefulbypreviousresearchers.Table1givesthecharacteristicsofourdatasets.Thedatasetschosenvaryacrossanumberofdimensionsincluding:thetypeofthefeaturesinthedataset(i.e.,continuous,discrete,oramixofthetwo);thenumberofoutputclasses;andthenumberofexamplesinthedataset.Table1alsoshowsthearchitectureandtrainingparametersusedinourneuralnetworksexperiments.3.2MethodologyResults,unlessotherwisenoted,areaveragedover vestandard10-foldcrossvalidationexperiments.Foreach10-foldcrossvalidationthedatasetis rstpartitionedinto10equal-sizedsets,theneachsetisinturnusedasthetestsetwhiletheclassi ertrainsontheotherninesets.Foreachfoldanensembleof25classi ersiscreated.Crossvalidationfoldswereperformedindependentlyforeachalgorithm.Wetrainedtheneuralnetworksusingstandardbackpropagationlearning(Rumelhart,Hinton,&Williams,1986).Parametersettingsfortheneuralnetworksincludealearningrateof0.15,amomentumtermof0.9,andweightsareinitializedrandomlytobebetween-0.5and0.5.Thenumberofhiddenunitsandepochsusedfortrainingaregiveninthenextsection.Wechosethenumberofhiddenunitsbasedonthenumberofinputandoutputunits.Thischoicewasbasedonthecriteriaofhavingatleastonehiddenunitperoutput,atleastonehiddenunitforeveryteninputs,and vehiddenunitsbeingaminimum.Thenumberofepochswasbasedbothonthenumberofexamplesandthenumberofparameters(i.e.,topology)ofthenetwork.Speci cally,weused60to80epochsforsmallproblemsinvolvingfewerthan250examples;40epochsforthemid-sizedproblemscontainingbetween250to500examples;and20to40epochsforlargerproblems.ForthedecisiontreesweusedtheC4.5tool(Quinlan,1993)andprunedtrees(whichempiricallyproducebetterperformance)assuggestedinQuinlan'swork.176 PopularEnsembleMethodsFeaturesNeuralNetworkDataSetCasesClassContDiscInputsOutputsHiddensEpochsbreast-cancer-w69929-91520credit-a6902694711035credit-g100027136311030diabetes76829-81530glass21469-961080heart-cleveland303285131540hepatitis15526133211060house-votes-844352-16161540hypo377257225551540ionosphere351234-3411040iris15934-43580kr-vs-kp31962-367411520labor572882911080letter200002616-16264030promoters-9369362-5722812030ribosome-bind18772-4919612035satellite6435636-3661530segmentation2310719-1971520sick377227225511040sonar208260-6011060soybean68319-35134192540splice31903-6024022530vehicle846418-1841040Table1:Summaryofthedatasetsusedinthispaper.Shownarethenumberofexamplesinthedataset;thenumberofoutputclasses;thenumberofcontinuousanddiscreteinputfeatures;thenumberofinput,output,andhiddenunitsusedintheneuralnetworkstested;andhowmanyepochseachneuralnetworkwastrained.3.3DataSetErrorRatesTable2showstest-seterrorratesforthedatasetsdescribedinTable1for veneuralnetworkmethodsandfourdecisiontreemethods.(InTables4and5weshowtheseerrorratesaswellasthestandarddeviationforeachofthesevalues.)Alongwiththetest-seterrorsforBagging,Arcing,andAda-boosting,weincludethetest-seterrorrateforasingleneural-networkandasingledecision-treeclassi er.Wealsoreportresultsforasimple(baseline)neural-networkensembleapproach{creatinganensembleofnetworkswhereeachnetworkvariesonlybyrandomlyinitializingtheweightsofthenetwork.WeincludetheseresultsincertaincomparisonstodemonstratetheirsimilaritytoBagging.Oneobviousconclusiondrawnfromtheresultsisthateachensemblemethodappearstoreducetheerrorrateforalmostallofthedatasets,andinmanycasesthisreductionislarge.Infact,thetwo-tailedsigntestindicatesthateveryensemblemethodissigni cantlybetterthanitssingle177 Opitz&MaclinNeuralNetworkC4.5BoostingBoostingDataSetStanSimpBagArcAdaStanBagArcAdabreast-cancer-w3.43.53.43.84.05.03.73.53.5credit-a14.813.713.815.815.714.913.414.013.7credit-g27.924.724.225.225.329.625.225.926.7diabetes23.923.022.824.423.327.824.426.025.7glass38.635.233.132.031.131.325.825.523.3heart-cleveland18.617.417.020.721.124.319.521.520.8hepatitis20.119.517.819.019.721.217.316.917.2house-votes-844.94.84.15.15.33.63.65.04.8hypo6.46.26.26.26.20.50.40.40.4ionosphere9.77.59.27.68.38.16.46.06.1iris4.33.94.03.73.95.24.95.15.6kr-vs-kp2.30.80.80.40.30.60.60.30.4labor6.13.24.23.23.216.513.713.011.6letter18.012.810.55.74.614.07.04.13.9promoters-9365.34.84.04.54.612.810.66.86.4ribosome-bind9.38.58.48.18.211.210.29.39.6satellite13.010.910.69.910.013.89.98.68.4segmentation6.65.35.43.53.33.73.01.71.5sick5.95.75.74.74.51.31.21.11.0sonar16.615.916.812.913.029.725.321.521.7soybean9.26.76.96.76.38.07.97.26.7splice4.74.03.94.04.25.95.45.15.3vehicle24.921.220.719.119.729.427.122.522.9Table2:Testseterrorratesforthedatasetsusing(1)asingleneuralnetworkclassi er;(2)anensemblewhereeachindividualnetworkistrainedusingtheoriginaltrainingsetandthusonlydi ersfromtheothernetworksintheensemblebyitsrandominitialweights;(3)anensemblewherethenetworksaretrainedusingrandomlyresampledtrainingsets(Bagging);anensemblewherethenetworksaretrainedusingweightedresampledtrainingsets(Boosting)wheretheresamplingisbasedonthe(4)Arcingmethodand(5)Adamethod;(6)asingledecisiontreeclassi er;(7)aBaggingensembleofdecisiontrees;and(8)Arcingand(9)AdaBoostingensemblesofdecisiontrees.componentclassi eratthe95%con dencelevel;however,noneoftheensemblemethodsaresigni cantlybetterthananyotherensembleapproachatthe95%con dencelevel.TobetteranalyzeTable2'sresults,Figures3and4plotthepercentagereductioninerrorfortheAda-Boosting,Arcing,andBaggingmethodasafunctionoftheoriginalerrorrate.Examiningthese gureswenotethatmanyofthegainsproducedbytheensemblemethodsaremuchlargerthanthestandarddeviationvalues.Intermsofcomparisonsofdi erentmethods,itisapparentfromboth guresthattheBoostingmethods(Ada-178 PopularEnsembleMethodsbreast-cancer-wheart-clevelandhouse-votes-84credit-ahepatitishypodiabetescredit-gspliceirisribosome-bindpromoters-936ionosphereglassvehiclesonarsicksatellitesoybeanlaborsegmentationletterkr-vs-kp-40-20020406080100Percent Reduction in ErrorAda-BoostingArcingBaggingFigure3:ReductioninerrorforAda-Boosting,Arcing,andBaggingneuralnetworken-semblesasapercentageoftheoriginalerrorrate(i.e.,areductionfromanerrorrateof2.5%to1.25%wouldbea50%reductioninerrorrate,justasareductionfrom10.0%to5.0%wouldalsobea50%reduction).Alsoshown(whiteportionofeachbar)isonestandarddeviationfortheseresults.Thestandarddeviationisshownasanadditiontotheerrorreduction.179 Opitz&Maclinhouse-votes-84irisdiabetescredit-acredit-gspliceribosome-bindheart-clevelandsoybeanhepatitissickvehicleionosphereglasssonarhypobreast-cancer-wlaborsatellitekr-vs-kppromoters-936segmentationletter-80-60-40-20020406080Percent Reduction in ErrorAda-BoostingArcingBaggingFigure4:ReductioninerrorforAda-Boosting,Arcing,andBaggingdecisiontreeensemblesasapercentageoftheoriginalerrorrate.Alsoshown(whiteportionofeachbar)isonestandarddeviationfortheseresults.180 PopularEnsembleMethodsBoostingandArcing)aresimilarintheirresults,bothforneuralnetworksanddecisiontrees.Furthermore,theAda-BoostingandArcingmethodsproducesomeofthelargestreductionsinerror.Ontheotherhand,whiletheBaggingmethodconsistentlyproducesreductionsinerrorforalmostallofthecases,withneuralnetworkstheBoostingmethodscansometimesresultinanincreaseinerror.Lookingattheorderingofthedatasetsinthetwo gures(theresultsaresortedbythepercentageofreductionusingtheAda-Boostingmethod),wenotethatthedatasetsforwhichtheensemblemethodsseemtoworkwellaresomewhatconsistentacrossbothneuralnetworksanddecisiontrees.Forthefewdomainswhichseeincreasesinerror,itisdiculttoreachstrongconclusionssincetheensemblemethodsseemtodowellforalargenumberofdomains.OnedomainonwhichtheBoostingmethodsdouniformlypoorlyisthehouse-votes-84domain.Aswediscusslater,theremaynoiseinthisdomain'sexamplesthatcausestheBoostingmethodssigni cantproblems.3.4EnsembleSizeEarlywork(Hansen&Salamon,1990)onensemblessuggestedthatensembleswithasfewastenmemberswereadequatetosucientlyreducetest-seterror.Whilethisclaimmaybetruefortheearlierproposedensembles,theBoostingliterature(Schapire,Freund,Bartlett,&Lee,1997)hasrecentlysuggested(basedonafewdatasetswithdecisiontrees)thatitispossibletofurtherreducetest-seterrorevenaftertenmembershavebeenaddedtoanensemble(andtheynotethatthisresultalsoappliestoBagging).Inthissection,weperformadditionalexperimentstofurtherinvestigatetheappropriatesizeofanensemble.Figure5showsthecompositeerrorrateoverallofourdatasetsforneuralnetworkanddecisiontreeensemblesusingupto100classi ers.Ourexperimentsindicatethatmostofthemethodsproducesimilarlyshapedcurves.Asexpected,muchofthereductioninerrorduetoaddingclassi erstoanensemblecomeswiththe rstfewclassi ers;however,thereissomevariationwithrespecttowheretheerrorreduction nallyasymptotes.ForbothBaggingandBoostingappliedtoneuralnetworks,muchofthereductioninerrorappearstohaveoccurredaftertento fteenclassi ers.AsimilarconclusioncanbereachedforBagginganddecisiontrees,whichisconsistentwithBreiman(1996a).ButAda-boostingandArcingcontinuetomeasurablyimprovetheirtest-seterroruntilaround25classi ersfordecisiontrees.At25classi erstheerrorreductionforbothmethodsappearstohavenearlyasymptotedtoaplateau.Therefore,theresultsreportedinthispaperareofanensemblesizeof25(i.e.,asucientyetmanageablesizeforqualitativeanalysis).Itwastraditionallybelieved(Freund&Schapire,1996)thatsmallreductionsintest-seterrormaycontinueinde nitelyforboosting;however,GroveandSchuurmans(1998)demonstratethatAda-boostingcanindeedbegintoover twithverylargeensemblesizes(10,000ormoremembers).3.5CorrelationAmongMethodsAssuggestedabove,itappearsthattheperformanceofmanyoftheensemblemethodsarehighlycorrelatedwithoneanother.Tohelpidentifytheseconsistencies,Table3presentsthecorrelationcoecientsoftheperformanceofallsevenensemblemethods.Foreachdataset,performanceismeasuredastheensembleerrorratedividedbythesingle-classi ererror181 Opitz&Maclin0.100.120.140.160.180102030405060708090100Number of Networks in EnsembleComposite Error RateDT-AdaDT-ArcDT-BagNN-AdaNN-ArcNN-BagFigure5:Averagetest-seterroroverall23datasetsusedinourstudiesforensemblesincorporatingfromoneto100decisiontreesorneuralnetworks.Theerrorrategraphedissimplytheaverageoftheerrorratesofthe23datasets.Thealternativeofaveragingtheerroroveralldatapoints(i.e.,weightingadataset'serrorratebyitssamplesize)producessimilarlyshapedcurves.rate.Thusahighcorrelation(i.e.,onenear1.0)suggeststhattwomethodsareconsistentinthedomainsinwhichtheyhavethegreatestimpactontest-seterrorreduction.Table3providesnumerousinterestinginsights.The rstisthattheneural-networkensemblemethodsarestronglycorrelatedwithoneanotherandthedecision-treeensemblemethodsarestronglycorrelatedwithoneanother;however,thereislesscorrelationbe-tweenanyneural-networkensemblemethodandanydecision-treeensemblemethod.Notsurprisingly,Ada-boostingandArcingarestronglycorrelated,evenacrossdi erentcompo-nentlearningalgorithms.ThissuggeststhatBoosting'se ectivenessdependsmoreonthedatasetthanwhetherthecomponentlearningalgorithmisaneuralnetworkordecisiontree.Baggingontheotherhand,isnotcorrelatedacrosscomponentlearningalgorithms.TheseresultsareconsistentwithourlaterclaimthatwhileBoostingisapowerfulensemblemethod,itismoresusceptibletoanoisydatasetthanBagging.182 PopularEnsembleMethodsNeuralNetworkDecisionTreeSimpleBaggingArcingAdaBaggingArcingAdaSimple-NN1.000.880.870.85-0.100.380.37Bagging-NN0.881.000.780.78-0.110.350.35Arcing-NN0.870.781.000.990.140.610.60Ada-NN0.850.780.991.000.170.620.63Bagging-DT-0.10-0.110.140.171.000.680.69Arcing-DT0.380.350.610.620.681.000.96Ada-DT0.370.350.600.630.690.961.00Table3:Performancecorrelationcoecientsacrossensemblelearningmethods.Perfor-manceismeasuredbytheratiooftheensemblemethod'stest-seterrordividedbythesinglecomponentclassi er'stest-seterror.3.6BaggingversusSimplenetworkensemblesFigure6showstheBaggingandSimplenetworkensembleresultsfromTable2.TheseresultsindicatethatoftenaSimpleEnsembleapproachwillproduceresultsthatareasaccurateasBagging(correlationresultsfromTable3alsosupportthisstatement).Thissuggeststhatanymechanismwhichcausesalearningmethodtoproducesomerandomnessintheformationofitsclassi erscanbeusedtoformaccurateensembles,andindeed,AliandPazzani(1996)havedemonstratedsimilarresultsforrandomizeddecisiontrees.3.7NeuralNetworksversusDecisionTreesAnotherinterestingquestionishowe ectivethedi erentmethodsareforneuralnetworksanddecisiontrees.Figures7,8,and9comparetheerrorratesandreductioninerrorvaluesforAda-Boosting,Arcing,andBaggingrespectively.Notethatwegrapherrorrateratherthanpercentreductioninerrorratebecausethebaselineforeachmethod(decisiontreesforAda-BoostingondecisiontreesversusneuralnetworksforAda-Boostingonneuralnetworks)maypartiallyexplainthedi erencesinpercentreduction.Forexample,inthepromoters-936problemusingAda-Boosting,themuchlargerreductioninerrorforthedecisiontreeapproachmaybeduetothefactthatdecisiontreesdonotseemtobease ectiveforthisproblem,andAda-Boostingthereforeproducesalargerpercentreductionintheerrorfordecisiontrees.Theresultsshowthatinmanycasesifasingledecisiontreehadlower(orhigher)errorthanasingleneuralnetworkonadataset,thenthedecision-treeensemblemethodsalsohadlower(orhigher)errorthantheirneuralnetworkcounterpart.Theexceptionstothisrulegenerallyhappenedonthesamedatasetforallthreeensemblemethods(e.g.,hepatitis,soybean,satellite,credit-a,andheart-cleveland).Theseresultssuggestthat(a)theperformanceoftheensemblemethodsisdependentonboththedatasetandclassi ermethod,and(b)ensemblescan,atleastinsomecases,overcometheinductivebiasofitscomponentlearningalgorithm.183 Opitz&Maclinsonarbreast-cancer-wsickhypodiabetesionospherecredit-airisheart-clevelandribosome-bindhepatitiscredit-gglasshouse-votes-84vehiclesplicesatellitesegmentationpromoters-936soybeanlaborletterkr-vs-kp-20020406080Percent Reduction in ErrorBaggingSimpleFigure6:ReductioninerrorforBaggingandSimpleneuralnetworkensemblesasaper-centageoftheoriginalerrorrate.Alsoshown(whiteportionofeachbar)isonestandarddeviationfortheseresults.184 PopularEnsembleMethodsbreast-cancer-wheart-clevelandhouse-votes-84credit-ahepatitishypodiabetescredit-gspliceirisribosome-bindpromoters-936ionosphereglassvehiclesonarsicksatellitesoybeanlaborsegmentationletterkr-vs-kp010203040Error (%)Neural NetworkDecision TreeFigure7:ErrorratesforAda-Boostingensembles.ThewhiteportionshowsthereductioninerrorofAda-Boostingcomparedtoasingleclassi erwhileincreasesinerrorareshowninblack.Thedatasetsaresortedbytheratioofreductioninensembleerrortooverallerrorforneuralnetworks.185 Opitz&Maclinbreast-cancer-wheart-clevelandcredit-ahouse-votes-84diabeteshypohepatitiscredit-gribosome-bindspliceirispromoters-936glasssickionospheresonarvehiclesatellitesoybeansegmentationlaborletterkr-vs-kp010203040Error (%)Neural NetworkDecision TreeFigure8:ErrorratesforArcingensembles.ThewhiteportionshowsthereductioninerrorofArcingcomparedtoasingleclassi erwhileincreasesinerrorareshowninblack.Thedatasetsaresortedbytheratioofreductioninensembleerrortooverallerrorforneuralnetworks.186 PopularEnsembleMethodssonarbreast-cancer-wsickhypodiabetesionospherecredit-airisheart-clevelandribosome-bindhepatitiscredit-gglasshouse-votes-84vehiclesplicesatellitesegmentationpromoters-936soybeanlaborletterkr-vs-kp010203040Error (%)Neural NetworkDecision TreeFigure9:ErrorratesforBaggingensembles.ThewhiteportionshowsthereductioninerrorofBaggingcomparedtoasingleclassi erwhileincreasesinerrorareshowninblack.Thedatasetsaresortedbytheratioofreductioninensembleerrortooverallerrorforneuralnetworks.187 Opitz&Maclin3.8BoostingandNoiseFreundandShapire(1996)suggestedthatthesometimespoorperformanceofBoostingresultsfromover ttingthetrainingsetsincelatertrainingsetsmaybeover-emphasizingexamplesthatarenoise(thuscreatingextremelypoorclassi ers).ThisargumentseemsespeciallypertinenttoBoostingfortworeasons.The rstandmostobviousreasonisthattheirmethodforupdatingtheprobabilitiesmaybeover-emphasizingnoisyexamples.Thesecondreasonisthattheclassi ersarecombinedusingweightedvoting.Previouswork(Sollich&Krogh,1996)hasshownthatoptimizingthecombiningweightscanleadtoover ttingwhileanunweightedvotingschemeisgenerallyresilienttoover tting.Friedmanetal.(1998)hypothesizethatBoostingmethods,asadditivemodels,mayseeincreasesinerrorinthosesituationswherethebiasofthebaseclassi erisappropriatefortheproblembeinglearned.Wetestthishypothesisinoursecondsetofresultspresentedinthissection.ToevaluatethehypothesisthatBoostingmaybepronetoover ttingweperformedasetofexperimentsusingthefourensembleneuralnetworkmethods.Weintroduced5%,10%,20%,and30%noise2intofourdi erentdatasets.Ateachlevelwecreated vedi erentnoisydatasets,performeda10-foldcrossvalidationoneach,thenaveragedoverthe veresults.InFigure10weshowthereductioninerrorrateforeachoftheensemblemethodscomparedtousingasingleneuralnetworkclassi er.Theseresultsdemonstratethatasthenoiselevelgrows,theecacyoftheSimpleandBaggingensemblesgenerallyincreaseswhiletheArcingandAda-Boostingensemblesgainsinperformancearemuchsmaller(ormayactuallydecrease).Notethatthise ectismoreextremeforAda-BoostingwhichsupportsourhypothesisthatAda-Boostingismorea ectedbynoise.ThissuggeststhatBoosting'spoorperformanceforcertaindatasetsmaybepartiallyexplainedbyover ttingnoise.Tofurtherdemonstratethee ectofnoiseonBoostingwecreatedseveralsetsofarti cialdataspeci callydesignedtomisleadBoostingmethods.Foreachdatasetwecreatedasimplehyperplaneconceptbasedonasetofthefeatures(andalsoincludedsomeirrelevantfeatures).Asetofrandompointswerethengeneratedandlabeledbasedonwhichsideofthehyperplanetheyfell.Thenacertainpercentageofthepointsononesideofthehyperplaneweremislabeledasbeingpartoftheotherclass.Fortheexperimentsshownbelowwegenerated vedatasetswheretheconceptwasbasedontwolinearfeatures,hadfourirrelevantfeatures,and20%ofthedatawasmislabeled.Wetrained veensemblesofneuralnetworks(perceptrons)foreachdatasetandaveragedtheensembles'predictions.Thustheseexperimentsinvolvelearninginsituationswheretheoriginalbiasofthelearner(asinglehyperplaneproducedbyaperceptron)isappropriatefortheproblem,andasFriedmanetal.(1998)suggest,usinganadditivemodelmayharmperformance.Figure11showstheresultingerrorratesforAda-Boosting,Arcing,andBaggingbythenumberofnetworksbeingcombinedintheensemble.TheseresultsindicateclearlythatincaseswherethereisnoiseBagging'serrorratewillnotincreaseastheensemblesizeincreaseswhereastheerrorrateoftheBoostingmethodsmayindeedincreaseasensemblesizeincreases.2.X%noiseindicatesthateachfeatureofthetrainingexamples,bothinputandoutputfeatures,hadX%chanceofbeingrandomlyperturbedtoanotherfeaturevalueforthatfeature(forcontinuousfeatures,thesetofpossibleothervalueswaschosenbyexaminingallofthetrainingexamples).188 PopularEnsembleMethods01234051015202530Reduction in error rate (% pts)diabetes0369051015202530soybean-large01234051015202530Noise rate (%)Reduction in error rate (% pts)promoters-93601234051015202530Noise rate (%)segmentationBagging EnsembleBoosting (Arcing) EnsembleBoosting (Ada) EnsembleFigure10:Simple,Bagging,andBoosting(ArcingandAda)neuralnetworkensemblere-ductioninerrorascomparedtousingasingleneuralnetwork.Graphedisthepercentagepointreductioninerror(e.g.,for5%noiseinthesegmentationdataset,ifthesinglenetworkmethodhadanerrorrateof15.9%andtheBaggingmethodhadanerrorrateof14.7%,thenthisisgraphedasa1.2percentagepointreductionintheerrorrate).Additionaltests(notshownhere)showthatAda-Boosting'serrorratebecomesworsewhenrestartingisnotemployed.ThisconclusiondovetailsnicelywithSchapireetal.'s(1997)recentdiscussionwheretheynotethatthee ectivenessofavotingmethodcanbemeasuredbyexaminingthemarginsoftheexamples.(Themarginisthedi erencebetweenthenumberofcorrectandincorrectvotesforanexample.)InasimpleresamplingmethodsuchasBagging,eachresultingclassi erfocusesonincreasingthemarginforasmanyoftheexamplesaspossible.ButinaBoostingmethod,laterclassi ersfocusonincreasingthemarginsforexampleswithpoorcurrentmargins.AsSchapireetal.(1997)note,thisisaverye ectivestrategyiftheoverallaccuracyoftheresultingclassi erdoesnotdropsigni cantly.Foraproblemwithnoise,focusingonmisclassi edexamplesmaycauseaclassi ertofocusonboostingthemarginsof(noisy)examplesthatwouldinfactbemisleadinginoverallclassi cation.189 Opitz&Maclin101214161820051015202530Error rateAdaArcBag101214161820051015202530Error rateAdaArcBag101214161820051015202530Error rateAdaArcBag101214161820051015202530Error rateAdaArcBag101214161820051015202530Networks in EnsembleError rateAdaArcBagFigure11:ErrorratesbythesizeofensembleforAda-Boosting,Arcing,andBaggingen-semblesfor vedi erentarti cialdatasetscontainingone-sidednoise(seetextfordescription).190 PopularEnsembleMethods4.FutureWorkOneinterestingquestionweplantoinvestigateishowe ectiveasingleclassi erapproachmightbeifitwasallowedtousethetimeittakestheensemblemethodtotrainmultipleclassi erstoexploreitsconceptspace.Forexample,aneuralnetworkapproachcouldperformpilotstudiesusingthetrainingsettoselectappropriatevaluesofparameterssuchashiddenunits,learningrate,etc.WeplantocompareBaggingandBoostingmethodstoothermethodsintroducedre-cently.InparticularweintendtoexaminetheuseofStacking(Wolpert,1992)asamethodoftrainingacombiningfunction,soastoavoidthee ectofhavingtoweightclassi ers.WealsoplantocompareBaggingandBoostingtoothermethodssuchasOpitzandShav-lik's(1996b)approachtocreatinganensemble.Thisapproachusesgeneticsearchto ndclassi ersthatareaccurateanddi erintheirpredictions.Finally,sincetheBoostingmethodsareextremelysuccessfulinmanydomains,weplantoinvestigatenovelapproachesthatwillretainthebene tsofBoosting.Thegoalwillbetocreatealearnerwhereyoucanessentiallypushastartbuttonandletitrun.Todothiswewouldtrytopreservethebene tsofBoostingwhilepreventingover ttingonnoisydatasets.Onepossibleapproachwouldbetouseaholdouttrainingset(atuningset)toevaluatetheperformanceoftheBoostingensembletodeterminewhentheaccuracyisnolongerincreasing.Anotherapproachwouldbetousepilotstudiestodeterminean\optimal"numberofclassi erstouseinanensemble.5.AdditionalRelatedWorkAsmentionedbefore,theideaofusinganensembleofclassi ersratherthanthesinglebestclassi erhasbeenproposedbyseveralpeople.InSection2,wepresentaframeworkforthesesystems,sometheoriesofwhatmakesane ectiveensemble,anextensivecoveringoftheBaggingandBoostingalgorithms,andadiscussiononthebiasplusvariancedecomposition.Section3referredtoempiricalstudiessimilartoours;thesemethodsdi erfromoursinthattheywerelimitedtodecisiontrees,generallywithfewerdatasets.Wecoveradditionalrelatedworkinthissection.LincolnandSkrzypek(1989),Mani(1991)andtheforecastingliterature(Clemen,1989;Granger,1989)indicatethatasimpleaveragingofthepredictorsgeneratesaverygoodcompositemodel;however,manylaterresearchers(Alpaydin,1993;Asker&Maclin,1997a,1997b;Breiman,1996c;Hashem,1997;Maclin,1998;Perrone,1992;Wolpert,1992;Zhang,Mesirov,&Waltz,1992)havefurtherimprovedgeneralizationwithvotingschemesthatarecomplexcombinationsofeachpredictor'soutput.Onemustbecarefulinthiscase,sinceoptimizingthecombiningweightscaneasilyleadtotheproblemofover ttingwhichsimpleaveragingseemstoavoid(Sollich&Krogh,1996).Mostapproachesonlyindirectlytrytogeneratehighlycorrectclassi ersthatdisagreeasmuchaspossible.Thesemethodstrytocreatediverseclassi ersbytrainingclassi erswithdissimilarlearningparameters(Alpaydin,1993),di erentclassi erarchitectures(Hashem,1997),variousinitialneural-networkweightsettings(Maclin&Opitz,1997;Maclin&Shav-lik,1995),orseparatepartitionsofthetrainingset(Breiman,1996a;Krogh&Vedelsby,1995).Boostingontheotherhandisactiveintryingtogeneratehighlycorrectnetworks191 Opitz&Maclinsinceitaccentuatesexamplescurrentlyclassi edincorrectlybypreviousmembersoftheensemble.Addemup(Opitz&Shavlik,1996a,1996b)isanotherexampleofanapproachthatdirectlytriestocreateadiverseensemble.Addemupusesgeneticalgorithmstosearchexplicitlyforahighlydiversesetofaccuratetrainednetworks.Addemupworksby rstcreatinganinitialpopulation,thenusesgeneticoperatorstocreatenewnetworkscon-tinually,keepingthesetofnetworksthatarehighlyaccuratewhiledisagreeingwitheachotherasmuchaspossible.Addemupisalsoe ectiveatincorporatingpriorknowledge,ifavailable,toimprovethequalityofitsensemble.Analternateapproachtotheensembleframeworkistotrainindividualnetworksonasubtask,andtothencombinethesepredictionswitha\gating"functionthatdependsontheinput.Jacobsetal.'s(1991)adaptivemixturesoflocalexperts,Baxt's(1992)methodforidentifyingmyocardialinfarction,andNowlanandSejnowski's(1992)visualmodelalltrainnetworkstolearnspeci csubtasks.Thekeyideaofthesetechniquesisthatadecompositionoftheproblemintospeci csubtasksmightleadtomoreecientrepresentationsandtraining(Hampshire&Waibel,1989).Onceaproblemisbrokenintosubtasks,theresultingsolutionsneedtobecombined.Jacobsetal.(1991)proposehavingthegatingfunctionbeanetworkthatlearnshowtoallocateexamplestotheexperts.Thusthegatingnetworkallocateseachexampletooneormoreexperts,andthebackpropagatederrorsandresultingweightchangesarethenrestrictedtothesenetworks(andthegatingfunction).TrespandTaniguchi(1995)proposeamethodfordeterminingthegatingfunctionaftertheproblemhasbeendecomposedandtheexpertstrained.Theirgatingfunctionisaninput-dependent,linear-weightingfunctionthatisdeterminedbyacombinationofthenetworks'diversityonthecurrentinputwiththelikelihoodthatthesenetworkshaveseendata\near"thatinput.Althoughthemixturesofexpertsandensembleparadigmsseemverysimilar,theyareinfactquitedistinctfromastatisticalpointofview.Themixtures-of-expertsmodelmakestheassumptionthatasingleexpertisresponsibleforeachexample.Inthiscase,eachexpertisamodelofaregionoftheinputspace,andthejobofthegatingfunctionistodecidefromwhichmodelthedatapointoriginates.Sinceeachnetworkintheensembleapproachlearnsthewholetaskratherthanjustsomesubtaskandthusmakesnosuchmutualexclusivityassumption,ensemblesareappropriatewhennoonemodelishighlylikelytobecorrectforanyonepointintheinputspace.6.ConclusionsThispaperpresentsacomprehensiveempiricalevaluationofBaggingandBoostingforneuralnetworksanddecisiontrees.OurresultsdemonstratethataBaggingensemblenearlyalwaysoutperformsasingleclassi er.OurresultsalsoshowthataBoostingensemblecangreatlyoutperformbothBaggingandasingleclassi er.However,forsomedatasetsBoostingmayshowzerogainorevenadecreaseinperformancefromasingleclassi er.FurthertestsindicatethatBoostingmaysu erfromover ttinginthepresenceofnoisewhichmayexplainsomeofthedecreasesinperformanceforBoosting.Wealsofoundthatasimpleensembleapproachofusingneuralnetworksthatdi eronlyintheirrandominitialweightsettingsperformedsurprisinglywell,oftendoingaswellastheBagging.192 PopularEnsembleMethodsAnalysisofourresultssuggeststhattheperformanceofbothBoostingmethods(Ada-BoostingandArcing)isatleastpartlydependentonthedatasetbeingexamined,whereBaggingshowsmuchlesscorrelation.ThestrongcorrelationsforBoostingmaybepartiallyexplainedbyitssensitivitytonoise,aclaimsupportedbyadditionaltests.Finally,weshowthatmuchoftheperformanceenhancementforanensemblecomeswiththe rstfewclassi erscombined,butthatBoostingdecisiontreesmaycontinuetofurtherimprovewithlargerensemblesizes.Inconclusion,asageneraltechniquefordecisiontreesandneuralnetworks,Baggingisprobablyappropriateformostproblems,butwhenappropriate,Boosting(eitherArcingorAda)mayproducelargergainsinaccuracy.AcknowledgmentsThisresearchwaspartiallysupportedbyUniversityofMinnesotaGrants-in-Aidtobothau-thors.DaveOpitzwasalsosupportedbyNationalScienceFoundationgrantIRI-9734419,theMontanaDOE/EPSCoRPetroleumReservoirCharacterizationProject,aMONTSgrantsupportedbytheUniversityofMontana,andaMontanaScienceTechnologyAl-liancegrant.ThisisanextendedversionofapaperpublishedintheFourteenthNationalConferenceonArti cialIntelligence.ReferencesK.,&Pazzani,M.(1996).Errorreductionthroughlearningmultipledescriptions.MachineLearning,24,173{202.Alpaydin,E.(1993).Multiplenetworksforfunctionlearning.InProceedingsofthe1993IEEEInternationalConferenceonNeuralNetworks,Vol.I,pp.27{32SanFrancisco.Arbib,M.(Ed.).(1995).TheHandbookofBrainTheoryandNeuralNetworks.MITPress.Asker,L.,&Maclin,R.(1997a).Ensemblesasasequenceofclassi ers.InProceedingsoftheFifteenthInternationalJointConferenceonArti cialIntelligence,pp.860{865Nagoya,Japan.Asker,L.,&Maclin,R.(1997b).Featureengineeringandclassi erselection:AcasestudyinVenusianvolcanodetection.InProceedingsoftheFourteenthInternationalConferenceonMachineLearning,pp.3{11Nashville,TN.Bauer,E.,&Kohavi,R.(1999).Anempiricalcomparisonofvotingclassi cationalgorithms:Bagging,boosting,andvariants.MachineLearning,36,105-139.Baxt,W.(1992).Improvingtheaccuracyofanarti cialneuralnetworkusingmultipledi erentlytrainednetworks.NeuralComputation,4,772{780.Breiman,L.(1996a).Baggingpredictors.MachineLearning,24(2),123{140.Breiman,L.(1996b).Bias,variance,andarcingclassi ers.Tech.rep.460,UC-Berkeley,Berkeley,CA.193 Opitz&MaclinBreiman,L.(1996c).Stackedregressions.MachineLearning,24(1),49{64.Clemen,R.(1989).Combiningforecasts:Areviewandannotatedbibliography.JournalofForecasting,5,559{583.Drucker,H.,&Cortes,C.(1996).Boostingdecisiontrees.InTouretsky,D.,Mozer,M.,&Hasselmo,M.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.8,pp.479{485Cambridge,MA.MITPress.Drucker,H.,Cortes,C.,Jackel,L.,LeCun,Y.,&Vapnik,V.(1994).Boostingandothermachinelearningalgorithms.InProceedingsoftheEleventhInternationalConferenceonMachineLearning,pp.53{61NewBrunswick,NJ.Efron,B.,&Tibshirani,R.(1993).AnIntroductiontotheBootstrap.ChapmanandHall,NewYork.Fisher,D.,&McKusick,K.(1989).AnempiricalcomparisonofID3andback-propagation.InProceedingsoftheEleventhInternationalJointConferenceonArti cialIntelli-gence,pp.788{793Detroit,MI.Freund,Y.,&Schapire,R.(1996).Experimentswithanewboostingalgorithm.InPro-ceedingsoftheThirteenthInternationalConferenceonMachineLearning,pp.148{156Bari,Italy.Friedman,J.(1996).Onbias,variance,0/1-loss,andthecurse-of-dimensionality.JournalofDataMiningandKnowledgeDiscovery,1.Friedman,J.,Hastie,T.,&Tibshirani,R.(1998).Additivelogisticregression:Astatisticalviewofboosting.(http://www-stat.stanford.edu/~jhf).Geman,S.,Bienenstock,E.,&Doursat,R.(1992).Neuralnetworksandthebias/variancedilemma.NeuralComputation,4,1{58.Granger,C.(1989).Combiningforecasts:Twentyyearslater.JournalofForecasting,8,167{173.Grove,A.,&Schuurmans,D.(1998).Boostinginthelimit:Maximizingthemarginoflearnedensembles.InProceedingsoftheFifteenthNationalConferenceonArti cialIntelligence,pp.692{699Madison,WI.Hampshire,J.,&Waibel,A.(1989).Themeta-pinetwork:Buildingdistributedknowledgerepresentationsforrobustpatternrecognition.Tech.rep.CMU-CS-89-166,CMU,Pittsburgh,PA.Hansen,L.,&Salamon,P.(1990).Neuralnetworkensembles.IEEETransactionsonPatternAnalysisandMachineIntelligence,12,993{1001.Hashem,S.(1997).Optimallinearcombinationsofneuralnetworks.NeuralNetworks,10(4),599{614.194 PopularEnsembleMethodsJacobs,R.,Jordan,M.,Nowlan,S.,&Hinton,G.(1991).Adaptivemixturesoflocalexperts.NeuralComputation,3,79{87.Kohavi,R.,&Wolpert,D.(1996).Biasplusvariancedecompositionforzero-onelossfunctions.InProceedingsoftheThirteenthInternationalConferenceonMachineLearning,pp.275{283Bari,Italy.Kong,E.,&Dietterich,T.(1995).Error-correctingoutputcodingcorrectsbiasandvariance.InProceedingsoftheTwelfthInternationalConferenceonMachineLearning,pp.313{321TahoeCity,CA.Krogh,A.,&Vedelsby,J.(1995).Neuralnetworkensembles,crossvalidation,andactivelearning.InTesauro,G.,Touretzky,D.,&Leen,T.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.7,pp.231{238Cambridge,MA.MITPress.Lincoln,W.,&Skrzypek,J.(1989).Synergyofclusteringmultiplebackpropagationnet-works.InTouretzky,D.(Ed.),AdvancesinNeuralInformationProcessingSystems,Vol.2,pp.650{659SanMateo,CA.MorganKaufmann.Maclin,R.(1998).Boostingclassi ersregionally.InProceedingsoftheFifteenthNationalConferenceonArti cialIntelligence,pp.700{705Madison,WI.Maclin,R.,&Opitz,D.(1997).Anempiricalevaluationofbaggingandboosting.InProceedingsoftheFourteenthNationalConferenceonArti cialIntelligence,pp.546{551Providence,RI.Maclin,R.,&Shavlik,J.(1995).Combiningthepredictionsofmultipleclassi ers:Usingcompetitivelearningtoinitializeneuralnetworks.InProceedingsoftheFourteenthIn-ternationalJointConferenceonArti cialIntelligence,pp.524{530Montreal,Canada.Mani,G.(1991).Loweringvarianceofdecisionsbyusingarti cialneuralnetworkportfolios.NeuralComputation,3,484{486.Mooney,R.,Shavlik,J.,Towell,G.,&Gove,A.(1989).Anexperimentalcomparisonofsymbolicandconnectionistlearningalgorithms.InProceedingsoftheEleventhInternationalJointConferenceonArti cialIntelligence,pp.775{780Detroit,MI.Murphy,P.M.,&Aha,D.W.(1994).UCIrepositoryofmachinelearningdatabases(machine-readabledatarepository).UniversityofCalifornia-Irvine,DepartmentofInformationandComputerScience.Nowlan,S.,&Sejnowski,T.(1992).Filterselectionmodelforgeneratingvisualmotionsignals.InHanson,S.,Cowan,J.,&Giles,C.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.5,pp.369{376SanMateo,CA.MorganKaufmann.Opitz,D.,&Shavlik,J.(1996a).Activelysearchingforane ectiveneural-networkensemble.ConnectionScience,8(3/4),337{353.195 Opitz&MaclinOpitz,D.,&Shavlik,J.(1996b).Generatingaccurateanddiversemembersofaneural-networkensemble.InTouretsky,D.,Mozer,M.,&Hasselmo,M.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.8,pp.535{541Cambridge,MA.MITPress.Perrone,M.(1992).Asoft-competitivesplittingruleforadaptivetree-structuredneuralnetworks.InProceedingsoftheInternationalJointConferenceonNeuralNetworks,pp.689{693Baltimore,MD.Perrone,M.(1993).ImprovingRegressionEstimation:AveragingMethodsforVarianceReductionwithExtensiontoGeneralConvexMeasureOptimization.Ph.D.thesis,BrownUniversity,Providence,RI.Quinlan,J.(1993).C4.5:ProgramsforMachineLearning.MorganKaufmann,SanMateo,CA.Quinlan,J.R.(1996).Bagging,boosting,andc4.5.InProceedingsoftheThirteenthNationalConferenceonArti cialIntelligence,pp.725{730.Portland,OR.Rumelhart,D.,Hinton,G.,&Williams,R.(1986).Learninginternalrepresentationsbyerrorpropagation.InRumelhart,D.,&McClelland,J.(Eds.),ParallelDistributedProcessing:Explorationsinthemicrostructureofcognition.Volume1:Foundations,pp.318{363.MITPress,Cambridge,MA.Schapire,R.(1990).Thestrengthofweaklearnability.MachineLearning,5(2),197{227.Schapire,R.,Freund,Y.,Bartlett,P.,&Lee,W.(1997).Boostingthemargin:Anewexplanationforthee ectivenessofvotingmethods.InProceedingsoftheFourteenthInternationalConferenceonMachineLearning,pp.322{330Nashville,TN.Sollich,P.,&Krogh,A.(1996).Learningwithensembles:Howover- ttingcanbeuseful.InTouretsky,D.,Mozer,M.,&Hasselmo,M.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.8,pp.190{196Cambridge,MA.MITPress.Tresp,V.,&Taniguchi,M.(1995).Combiningestimatorsusingnon-constantweightingfunctions.InTesauro,G.,Touretzky,D.,&Leen,T.(Eds.),AdvancesinNeuralInformationProcessingSystems,Vol.7,pp.419{426Cambridge,MA.MITPress.Wolpert,D.(1992).Stackedgeneralization.NeuralNetworks,5,241{259.Zhang,X.,Mesirov,J.,&Waltz,D.(1992).Hybridsystemforproteinsecondarystructureprediction.JournalofMolecularBiology,225,1049{1063.196 PopularEnsembleMethodsAppendixTables4and5showthecompleteresultsforthe rstsetofexperimentsusedinthispaper.SingleSimpleBaggingArcingBoostingDataSetErrSDBestErrSDErrSDErrSDErrSDbreast-cancer-w3.40.32.93.50.23.40.23.80.44.00.4credit-a14.80.713.613.70.513.80.615.80.615.70.6credit-g27.90.826.224.70.224.20.525.20.825.30.1diabetes23.90.922.623.00.522.80.424.40.223.31.2glass38.61.536.935.21.133.11.932.02.431.10.9heart-cleveland18.61.016.817.41.117.00.620.71.621.10.9hepatitis20.11.619.119.51.717.80.719.01.319.70.9house-votes-844.90.64.14.80.24.10.25.10.55.30.5hypo6.40.26.26.20.16.20.16.20.16.20.1ionosphere9.71.37.47.50.59.21.27.60.68.30.5iris4.31.72.03.90.34.00.53.70.63.91.0kr-vs-kp2.30.71.50.80.10.80.20.40.10.30.1labor6.11.53.53.20.84.21.03.20.83.20.8letter18.00.317.612.80.210.50.35.70.44.60.1promoters-9365.30.64.54.80.34.00.34.50.24.60.3ribosome-bind9.30.48.98.50.38.40.48.10.28.20.3satellite13.00.312.610.90.210.60.39.90.210.00.3segmentation6.60.75.75.30.35.40.23.50.23.30.2sick5.90.55.25.70.25.70.14.70.24.50.3sonar16.61.514.915.91.216.81.112.91.513.01.5soybean9.21.17.06.70.56.90.46.70.56.30.6splice4.70.24.54.00.23.90.14.00.14.20.1vehicle24.91.222.921.20.820.70.619.11.019.71.0Table4:Neuralnetworktestseterrorratesandstandarddeviationvaluesforthoseerrorratesfor(1)asingleneuralnetworkclassi er;(2)asimpleneuralnetworkensem-ble;(3)aBaggingensemble;(4)anArcingensemble;and(5)andAda-Boostingensemble.Alsoshown(resultscolumn3)isthe\best"resultproducedfromallofthesinglenetworkresultsrunusingallofthetrainingdata.197 Opitz&MaclinSingleBaggingArcingBoostingDataSetErrSDBestErrSDErrSDErrSDbreast-cancer-w5.00.74.03.70.53.50.63.50.3credit-a14.90.814.213.40.514.00.913.70.5credit-g29.61.028.725.20.725.91.026.70.4diabetes27.81.026.724.40.826.00.625.70.6glass31.32.128.525.80.725.51.423.31.3heart-cleveland24.31.322.719.50.721.51.620.81.0hepatitis21.21.220.017.32.016.91.117.21.3house-votes-843.60.33.23.60.25.01.14.81.0hypo0.50.10.40.40.00.40.10.40.0ionosphere8.10.77.16.40.66.00.56.10.5iris5.20.75.34.90.85.10.65.61.1kr-vs-kp0.60.10.50.60.10.30.10.40.0labor16.53.412.713.70.813.02.911.62.0letter14.00.812.27.00.14.10.13.90.1promoters-93612.80.412.510.60.66.80.56.40.3ribosome-bind11.20.610.810.20.19.30.29.60.5satellite13.80.413.59.90.28.60.18.40.2segmentation3.70.23.43.00.21.70.21.50.2sick1.30.91.11.20.11.10.11.00.1sonar29.71.926.925.31.321.53.021.72.8soybean8.00.57.57.90.57.20.26.70.9splice5.90.35.75.40.25.10.15.30.2vehicle29.40.728.627.10.922.50.822.91.9Table5:Decisiontreetestseterrorratesandstandarddeviationvaluesforthoseerrorratesfor(1)asingledecisiontreeclassi er;(2)aBaggingensemble;(3)anArcingensemble;and(4)andAda-Boostingensemble.Alsoshown(resultscolumn3)isthe\best"resultproducedfromallofthesingletreeresultsrunusingallofthetrainingdata.198