/
22IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS 22IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS

22IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS - PDF document

liane-varnes
liane-varnes . @liane-varnes
Follow
364 views
Uploaded On 2016-06-01

22IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS - PPT Presentation

95CRudinIDaubechiesandRESchapire ID: 344280

[95]C.Rudin I.Daubechies andR.E.Schapire

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "22IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBER..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

22IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS [95]C.Rudin,I.Daubechies,andR.E.Schapire,“ThedynamicsofAdaBoost: Cyclicbehaviorandconvergenceofmargins,” J.Mach.Learn.Res. ,vol.5, pp.1557–1595,2004. [96]R.E.SchapireandY.Singer,“Improvedboostingalgorithmsusing condence-ratedpredictions,” Mach.Learn. ,vol.37,pp.297–336,1999. [97]M.Joshi,V.Kumar,andR.Agarwal,“Evaluatingboostingalgorithmsto classifyrareclasses:Comparisonandimprovements,”in Proc.IEEEInt. Conf.DataMining ,2001,pp.257–264. [98]H.GuoandH.L.Viktor,“Learningfromimbalanceddatasetswith boostinganddatageneration:TheDataBoost-IMapproach,” SIGKDD Expl.Newsl. ,vol.6,pp.30–39,2004. [99]R.Barandela,R.M.Valdovinos,andJ.S.S ´ anchez,“Newapplicationsof ensemblesofclassiers,” PatternAnal.App. ,vol.6,pp.245–256,2003. [100]E.Chang,B.Li,G.Wu,andK.Goh,“Statisticallearningforeffective visualinformationretrieval,”in Proc.Int.Conf.ImageProcess. ,2003, vol.3,no.2,pp.609–612. [101]D.Tao,X.Tang,X.Li,andX.Wu,“Asymmetricbaggingandrandom subspaceforsupportvectormachines-basedrelevancefeedbackinim- ageretrieval,” IEEETrans.PatternAnal.Mach.Intell. ,vol.28,no.7, pp.1088–1099,Jul.2006. [102]S.Hido,H.Kashima,andY.Takahashi,“Roughlybalancedbaggingfor imbalanceddata,” Stat.Anal.DataMin. ,vol.2,pp.412–426,2009. [103]P.K.ChanandS.J.Stolfo,“Towardscalablelearningwithnon-uniform classandcostdistributions:Acasestudyincreditcardfrauddetection,” in Proc.4thInt.Conf.Knowl.Discov.DataMining(KDD-98) ,1998, pp.164–168. [104]R.Yan,Y.Liu,R.Jin,andA.Hauptmann,“Onpredictingrareclasses withSVMensemblesinsceneclassication,”in Proc.IEEEInt.Conf. Acoust.,Speech,SignalProcess. ,2003,vol.3,pp.21–24. [105]C.Li,“Classifyingimbalanceddatausingabaggingensemblevariation (BEV),”in Proc.45thAnnualSoutheastRegionalConference (Associa- tionofComputingMachinerySouthEastSeries45).NewYork:ACM, 2007,pp.203–208. [106]D.A.CieslakandN.V.Chawla,“Learningdecisiontreesforunbalanced data,”in MachineLearningandKnowledgeDiscoveryinDatabases (Lec- tureNotesinComputerScienceSeries5211),W.Daelemans,B.Goethals, andK.Morik,Eds.,2008,pp.241–256. [107]F.ProvostandP.Domingos,“Treeinductionforprobability-basedrank- ing,” Mach.Learn. ,vol.52,pp.199–215,2003. [108]S.Garc ´ a,A.Fern ´ andez,J.Luengo,andF.Herrera,“Astudyofstatis- ticaltechniquesandperformancemeasuresforgenetics-basedmachine learning:Accuracyandinterpretability,” SoftComp. ,vol.13,no.10, pp.959–977,2009. [109]F.Wilcoxon,“Individualcomparisonsbyrankingmethods,” Biometrics Bull. ,vol.1,no.6,pp.80–83,1945. [110]D.Sheskin, HandbookofParametricandNonparametricStatisticalPro- cedures ,2nded.London,U.K.:Chapman&Hall/CRC,2006. [111]S.Holm,“Asimplesequentiallyrejectivemultipletestprocedure,” Scand.J.Stat. ,vol.6,pp.65–70,1979. [112]J.P.Shaffer,“Modiedsequentiallyrejectivemultipletestprocedures,” J.Am.Stat.Assoc. ,vol.81,no.395,pp.826–831,1986. MikelGalar receivedtheM.Sc.degreeincomputer sciencesfromthePublicUniversityofNavarra,Pam- plona,Spain,in2009.HeisworkingtowardthePh.D. degreewiththedepartmentofAutomaticsandCom- putation,UniversidadP ´ ublicadeNavarra,Navarra, Spain. HeiscurrentlyaTeachingAssistantintheDe- partmentofAutomaticsandComputation.Hisre- searchinterestsincludedata-minig,classication, multi-classication,ensemblelearning,evolutionary algorithmsandfuzzysystems. AlbertoFern ´ andez receivedtheM.Sc.andPh.D. degreesincomputersciencein2005and2010,both fromtheUniversityofGranada,Granada,Spain. HeiscurrentlyanAssistantProfessorintheDe- partmentofComputerScience,UniversityofJa ´ en, Ja ´ en,Spain.Hisresearchinterestsincludedatamin- ing,classicationinimbalanceddomains,fuzzy rulelearning,evolutionaryalgorithmsandmulti- classicationproblems. EdurneBarrenechea receivedtheM.Sc.degreein computerscienceatthePaisVascoUniversity,San Sebastian,Spain,in1990.SheobtainedthePh.D. degreeincomputersciencefromPublicUniver- sityofNavarra,Navarra,Spain,in2005,onthe topicinterval-valuedfuzzysetsappliedtoimage processing. SheanAssistantLecturerattheDepartmentof AutomaticsandComputation,PublicUniversityof Navarra.Sheworkedinaprivatecompany(Bombas Itur)asanAnalystProgrammerfrom1990to2001, andthenshejoinedthePublicUniversityofNavarraasanAssociateLec- turer.Herpublicationscomprisemorethan20papersininternationaljournals andabout15bookchapters.Herresearchinterestsincludefuzzytechniques forimageprocessing,fuzzysetstheory,intervaltype-2fuzzysetstheoryand applications,decisionmaking,andmedicalandindustrialapplicationsofsoft computingtechniques. Dr.BarrenecheaisaMemberoftheboardoftheEuropeanSocietyforFuzzy LogicandTechnology. HumbertoBustince (M’08)receivedthePh.D. degreeinmathematicsfromPublicUniversityof Navarra,Navarra,Spain,in1994. HeisaFullProfessorattheDepartmentof AutomaticsandComputation,PublicUniversityof Navarra.Hisresearchinterestsincludefuzzylogic theory,extensionsoffuzzysets(type-2fuzzysets, interval-valuedfuzzysets,Atanassov’sintuitionistic fuzzysets),fuzzymeasures,aggregationfunctions andfuzzytechniquesforImageprocessing.Heisau- thorofmorethan65publishedoriginalarticlesand involvedinteachingarticialintelligenceforstudentsofcomputersciences. FranciscoHerrera receivedtheM.Sc.degreein mathematics,in1988,andthePh.D.degreeinmathe- maticsin1991,bothfromtheUniversityofGranada, Granada,Spain. HeiscurrentlyaProfessorwiththeDepartment ofComputerScienceandArticialIntelligenceatthe UniversityofGranada.HeactsasanAssociateEdi- torofthejournals:IEEET RANSACTIONSON F UZZY S YSTEMS , InformationSciences , MathwareandSoft Computing , AdvancesinFuzzySystems , Advancesin ComputationalSciencesandTechnology ,and Inter- nationalJournalofAppliedMetaheuristicsComputing .Hecurrentlyservesas anAreaEditorofthe JournalSoftComputing (areaofgeneticalgorithmsand geneticfuzzysystems),andheservesasmemberofseveraljournaleditorial boards,amongothers: FuzzySetsandSystems , AppliedIntelligence , Knowl- edgeandInformationSystems , InformationFusion , EvolutionaryIntelligence , InternationalJournalofHybridIntelligentSystems , MemeticComputation .He haspublishedmorethan150papersininternationaljournals.Heisthecoauthor ofthebook“GeneticFuzzySystems:EvolutionaryTuningandLearningof FuzzyKnowledgeBases”(WorldScientic,2001).Aseditedactivities,hehas co-editedveinternationalbooksandco-edited20specialissuesininternational journalsondifferentSoftComputingtopics.Hiscurrentresearchinterestsin- cludecomputingwithwordsanddecisionmaking,datamining,datapreparation, instanceselection,fuzzy-rule-basedsystems,geneticfuzzysystems,knowledge extractionbasedonevolutionaryalgorithms,memeticalgorithms,andgenetic algorithms. GALAR etal. :REVIEWONENSEMBLESFORTHECLASSIMBALANCEPROBLEM 21 [45]C.Seiffert,T.Khoshgoftaar,J.VanHulse,andA.Napolitano,“Rusboost: Ahybridapproachtoalleviatingclassimbalance,” IEEETrans.Syst., Man,Cybern.A,Syst.,Humans ,vol.40,no.1,pp.185–197,Jan.2010. [46]J.Baszczy ´ nski,M.Deckert,J.Stefanowski,andS.Wilk,“Integratingse- lectivepre-processingofimbalanceddatawithivotesensemble,”in Rough SetsandCurrentTrendsinComputing (LectureNotesinComputerSci- enceSeries6086),M.Szczuka,M.Kryszkiewicz,S.Ramanna,R.Jensen, andQ.Hu,Eds.Berlin/Heidelberg,Germany:Springer-Verlag,2010, pp.148–157. [47]X.-Y.Liu,J.Wu,andZ.-H.Zhou,“Exploratoryundersamplingforclass- imbalancelearning,” IEEETrans.Syst.,Man,Cybern.B,Appl.Rev , vol.39,no.2,pp.539–550,2009. [48]W.Fan,S.J.Stolfo,J.Zhang,andP.K.Chan,“Adacost:Misclassication cost-sensitiveboosting,”presentedatthe6thInt.Conf.Mach.Learning, pp.97–105,SanFrancisco,CA,1999. [49]K.M.Ting,“Acomparativestudyofcost-sensitiveboostingalgorithms,” in Proc.17thInt.Conf.Mach.Learning ,Stanford,CA,2000,pp.983–990. [50]Y.Sun,M.S.Kamel,A.K.Wong,andY.Wang,“Cost-sensitiveboosting forclassicationofimbalanceddata,” PatternRecog. ,vol.40,no.12, pp.3358–3378,2007. [51]A.Estabrooks,T.Jo,andN.Japkowicz,“Amultipleresamplingmethod forlearningfromimbalanceddatasets,” Comput.Intell. ,vol.20,no.1, pp.18–36,2004. [52]J.StefanowskiandS.Wilk,“Selectivepre-processingofimbalanced dataforimprovingclassicationperformance,”in DataWarehousingand KnowledgeDiscovery (LectureNotesinComputerScienceSeries5182), I.-Y.Song,J.Eder,andT.Nguyen,Eds.,2008,pp.283–292. [53]A.Fern ´ andez,S.Garc ´ a,M.J.delJesus,andF.Herrera,“Astudyof thebehaviouroflinguisticfuzzy-rule-basedclassicationsystemsinthe frameworkofimbalanceddata-sets,” FuzzySetsSyst. ,vol.159,no.18, pp.2378–2398,2008. [54]A.Orriols-PuigandE.Bernad ´ o-Mansilla,“Evolutionaryrule-basedsys- temsforimbalanceddatasets,” SoftComp. ,vol.13,pp.213–225,2009. [55]S.WangandX.Yao,“Diversityanalysisonimbalanceddatasetsbyusing ensemblemodels,”in IEEESymp.Comput.Intell.DataMining ,2009, pp.324–331. [56]J.Alcal ´ a-Fdez,L.S ´ anchez,S.Garc ´ a,M.J.delJesus,S.Ventura,J. M.Garrell,J.Otero,C.Romero,J.Bacardit,V.M.Rivas,J.Fern ´ andez, andF.Herrera,“KEEL:Asoftwaretooltoassessevolutionaryalgorithms fordataminingproblems,” SoftComp. ,vol.13,no.3,pp.307–318,2008. [57]J.Alcal ´ a-Fdez,A.Fern ´ andez,J.Luengo,J.Derrac,S.Garc ´ a,L.S ´ anchez, andF.Herrera,“KEELdata-miningsoftwaretool:Datasetrepository, integrationofalgorithmsandexperimentalanalysisframework,” J.Mult.- ValuedLogicSoftComput. ,vol.17,no.2–3,pp.255–287,2011. [58]J.R.Quinlan, C4.5:ProgramsforMachineLearning ,1sted.SanMateo, CA:MorganKaufmannPublishers,1993. [59]C.-T.SuandY.-H.Hsiao,“AnevaluationoftherobustnessofMTSforim- balanceddata,” IEEETrans.Knowl.DataEng. ,vol.19,no.10,pp.1321– 1332,Oct.2007. [60]D.Drown,T.Khoshgoftaar,andN.Seliya,“Evolutionarysamplingand softwarequalitymodelingofhigh-assurancesystems,” IEEETrans. Syst.,Man,Cybern.A,Syst.,Humans. ,vol.39,no.5,pp.1097–1107, Sep.2009. [61]S.Garc ´ a,A.Fern ´ andez,andF.Herrera,“Enhancingtheeffectiveness andinterpretabilityofdecisiontreeandruleinductionclassierswith evolutionarytrainingsetselectionoverimbalancedproblems,” Appl.Soft Comput. ,vol.9,no.4,pp.1304–1314,2009. [62]J.VanHulse,T.Khoshgoftaar,andA.Napolitano,“Anempiricalcom- parisonofrepetitiveundersamplingtechniques,”in Proc.IEEEInt.Conf. Inf.ReuseIntegr. ,2009,pp.29–34. [63]J.Dem  sar,“Statisticalcomparisonsofclassiersovermultipledatasets,” J.Mach.Learn.Res. ,vol.7,pp.1–30,2006. [64]S.Garc ´ aandF.Herrera,“Anextensionon“statisticalcomparisonsof classiersovermultipledatasetsforallpairwisecomparisons,” J.Mach. Learn.Res. ,vol.9,pp.2677–2694,2008. [65]S.Garc ´ a,A.Fern ´ andez,J.Luengo,andF.Herrera,“Advancednon- parametrictestsformultiplecomparisonsinthedesignofexperiments incomputationalintelligenceanddatamining:Experimentalanalysisof power,” Inf.Sci. ,vol.180,pp.2044–2064,2010. [66]A.P.Bradley,“TheuseoftheareaundertheROCcurveintheevaluation ofmachinelearningalgorithms,” PatternRecog. ,vol.30,no.7,pp.1145– 1159,1997. [67]J.HuangandC.X.Ling,“UsingAUCandaccuracyinevaluatinglearning algorithms,” IEEETrans.Knowl.DataEng. ,vol.17,no.3,pp.299–310, Mar.2005. [68]R.O.Duda,P.E.Hart,andD.G.Stork, PatternClassiÞcation ,2nded. NewYork:Wiley,2001. [69]D.Williams,V.Myers,andM.Silvious,“Mineclassicationwithimbal- anceddata,” IEEEGeosci.RemoteSens.Lett. ,vol.6,no.3,pp.528–532, Jul.2009. [70]W.-Z.LuandD.Wang,“Ground-levelozonepredictionbysupportvector machineapproachwithacost-sensitiveclassicationscheme,” Sci.Total. Enviro. ,vol.395,no.2-3,pp.109–116,2008. [71]Y.-M.Huang,C.-M.Hung,andH.C.Jiau,“Evaluationofneuralnetworks anddataminingmethodsonacreditassessmenttaskforclassimbalance problem,” NonlinearAnal.R.WorldAppl. ,vol.7,no.4,pp.720–747, 2006. [72]D.Cieslak,N.Chawla,andA.Striegel,“Combatingimbalanceinnetwork intrusiondatasets,”in IEEEInt.Conf.GranularComput. ,2006,pp.732– 737. [73]K.Kilic ¸, ¨ OzgeUncuandI.B.T ¨ urksen,“Comparisonofdifferentstrategies ofutilizingfuzzyclusteringinstructureidentication,” Inf.Sci. ,vol.177, no.23,pp.5153–5162,2007. [74]M.E.Celebi,H.A.Kingravi,B.Uddin,H.Iyatomi,Y.A.Aslandogan, W.V.Stoecker,andR.H.Moss,“Amethodologicalapproachtothe classicationofdermoscopyimages,” Comput.Med.Imag.Grap. ,vol.31, no.6,pp.362–373,2007. [75]X.PengandI.King,“RobustBMPMtrainingbasedonsecond-ordercone programminganditsapplicationinmedicaldiagnosis,” NeuralNetw. , vol.21,no.2–3,pp.450–457,2008. [76]B.Liu,Y.Ma,andC.Wong,“Improvinganassociationrulebasedclas- sier,”in PrinciplesofDataMiningandKnowledgeDiscovery (Lecture NotesinComputerScienceSeries1910),D.Zighed,J.Komorowski,and J.Zytkow,Eds.,2000,pp.293–317. [77]Y.Lin,Y.Lee,andG.Wahba,“Supportvectormachinesforclassication innonstandardsituations,” Mach.Learn. ,vol.46,pp.191–202,2002. [78]R.Barandela,J.S.S ´ anchez,V.Garc ´ a,andE.Rangel,“Strategiesfor learninginclassimbalanceproblems,” PatternRecog. ,vol.36,no.3, pp.849–851,2003. [79]K.Napieraa,J.Stefanowski,andS.Wilk,“LearningfromImbalanced datainpresenceofnoisyandborderlineexamples,”in RoughSetsCurr. TrendsComput. ,2010,pp.158–167. [80]C.Ling,V.Sheng,andQ.Yang,“Teststrategiesforcost-sensitivedecision trees,” IEEETrans.Knowl.DataEng. ,vol.18,no.8,pp.1055–1067,2006. [81]S.Zhang,L.Liu,X.Zhu,andC.Zhang,“Astrategyforattributesselection incost-sensitivedecisiontreesinduction,”in Proc.IEEE8thInt.Conf. Comput.Inf.Technol.Workshops ,2008,pp.8–13. [82]S.Hu,Y.Liang,L.Ma,andY.He,“MSMOTE:Improvingclassica- tionperformancewhentrainingdataisimbalanced,”in Proc.2ndInt. WorkshopComput.Sci.Eng. ,2009,vol.2,pp.13–17. [83]J.Kittler,M.Hatef,R.Duin,andJ.Matas,“Oncombiningclassiers,” IEEETrans.PatternAnal.Mach.Intell. ,vol.20,no.3,pp.226–239,Mar. 1998. [84]S.Geman,E.Bienenstock,andR.Doursat,“Neuralnetworksandthe bias/variancedilemma,” NeuralComput. ,vol.4,pp.1–58,1992. [85]J.Friedman,T.Hastie,andR.Tibshirani,“Additivelogisticregression:a statisticalviewofboosting,” Ann.Statist. ,vol.28,pp.337–407,1998. [86]E.B.KongandT.G.Dietterich,“Error-correctingoutputcodingcorrects biasandvariance,”in Proc.12thInt.Conf.Mach.Learning ,1995,pp.313– 321. [87]R.KohaviandD.H.Wolpert,“Biasplusvariancedecompositionfor zero-onelossfunctions,”in Proc.13thInt.Conf.Mach.Learning ,1996. [88]L.Breiman,“Bias,variance,andarcingclassiers,”UniversityofCali- fornia,Berkeley,CA,Tech.Rep.460,1996. [89]R.Tibshirani,“Bias,varianceandpredictionerrorforclassicationrules,” UniversityofToronto,Toronto,Canada,Dept.ofStatistic,Tech.Rep. 9602,1996. [90]J.H.Friedman,“Onbias,variance,0/1-loss,andthecurse-of- dimensionality,” DataMin.Knowl.Disc ,vol.1,pp.55–77,1997. [91]G.M.James,“Varianceandbiasforgenerallossfunctions,” Mach. Learning ,vol.51,pp.115–135,2003. [92]L.I.KunchevaandC.J.Whitaker,“Measuresofdiversityinclassier ensemblesandtheirrelationshipwiththeensembleaccuracy,” Mach. Learning ,vol.51,pp.181–207,2003. [93]L.Breiman,“Pastingsmallvotesforclassicationinlargedatabasesand on-line,” Mach.Learn. ,vol.36,pp.85–103,1999. [94]X.Wu,V.Kumar,J.RossQuinlan,J.Ghosh,Q.Yang,H.Motoda,G. J.McLachlan,A.Ng,B.Liu,P.S.Yu,Z.-H.Zhou,M.Steinbach,D. J.Hand,andD.Steinberg,“Top10algorithmsindatamining,” Knowl. Inf.Syst. ,vol.14,pp.1–37,2007. 20IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS A PPENDIX D ETAILED R ESULTS T ABLE Inthisappendix,wepresenttheAUCtestresultsforallthe algorithmsinalldata-sets.TableXIXshowstheresultsfor nonensemblesandclassicensembles.InTableXXweshow thetestresultsforcost-sensitiveboosting,boosting-basedand hybridensembles,whereasTableXXIshowsthetestresultsfor bagging-basedones.Theresultsareshowninascendingorder oftheIR.Thelastrowineachtableshowstheaverageresultof eachalgorithm.Westresswithbold-facethebestresultsamong allalgorithmsineachdata-set. A CKNOWLEDGMENT Theauthorswouldliketothankthereviewersfortheirvalu- ablecommentsandsuggestionsthatcontributedtotheimprove- mentofthiswork. R EFERENCES [1]Y.Sun,A.C.Wong,andM.S.Kamel,“Classicationofimbalanceddata: Areview,” Int.J.PatternRecogn. ,vol.23,no.4,pp.687–719,2009. [2]H.HeandE.A.Garcia,“Learningfromimbalanceddata,” IEEETrans. Knowl.DataEng. ,vol.21,no.9,pp.1263–1284,Sep.2009. [3]N.V.Chawla,“Dataminingforimbalanceddatasets:Anoverview,”in DataMiningandKnowledgeDiscoveryHandbook ,2010,pp.875–886. [4]V.Garc ´ a,R.Mollineda,andJ.S ´ anchez,“Onthe k -nnperformanceina challengingscenarioofimbalanceandoverlapping,” PatternAnal.App. , vol.11,pp.269–280,2008. [5]G.M.WeissandF.Provost,“Learningwhentrainingdataarecostly:The effectofclassdistributionontreeinduction,” J.Artif.Intell.Res. ,vol.19, pp.315–354,2003. [6]N.JapkowiczandS.Stephen,“Theclassimbalanceproblem:Asystematic study,” Intell.DataAnal. ,vol.6,pp.429–449,2002. [7]D.A.CieslakandN.V.Chawla,“Startglobally,optimizelocally,predict globally:Improvingperformanceonimbalanceddata,”in Proc.8thIEEE Int.Conf.DataMining ,2009,pp.143–152. [8]Q.YangandX.Wu,“10challengingproblemsindataminingresearch,” Int.J.Inf.Tech.Decis. ,vol.5,no.4,pp.597–604,2006. [9]Z.Yang,W.Tang,A.Shintemirov,andQ.Wu,“Associationrulemining- baseddissolvedgasanalysisforfaultdiagnosisofpowertransformers,” IEEETrans.Syst.,Man,Cybern.C,Appl.Rev. ,vol.39,no.6,pp.597–610, 2009. [10]Z.-B.ZhuandZ.-H.Song,“Faultdiagnosisbasedonimbalancemodied kernelsherdiscriminantanalysis,” Chem.Eng.Res.Des. ,vol.88,no.8, pp.936–951,2010. [11]W.Khreich,E.Granger,A.Miri,andR.Sabourin,“Iterativeboolean combinationofclassiersintherocspace:Anapplicationtoanomaly detectionwithhmms,” PatternRecogn. ,vol.43,no.8,pp.2732–2752, 2010. [12]M.Tavallaee,N.Stakhanova,andA.Ghorbani,“Towardcredibleeval- uationofanomaly-basedintrusion-detectionmethods,” IEEETrans. Syst.,Man,Cybern.C,Appl.Rev ,vol.40,no.5,pp.516–524, Sep.2010. [13]M.A.Mazurowski,P.A.Habas,J.M.Zurada,J.Y.Lo,J.A.Baker, andG.D.Tourassi,“Trainingneuralnetworkclassiersformedicaldeci- sionmaking:Theeffectsofimbalanceddatasetsonclassicationperfor- mance,” NeuralNetw. ,vol.21,no.2–3,pp.427–436,2008. [14]P.Bermejo,J.A.G ´ amez,andJ.M.Puerta,“Improvingtheperformanceof naivebayesmultinomialine-mailfolderingbyintroducingdistribution- basedbalanceofdatasets,” ExpertSyst.Appl. ,vol.38,no.3,pp.2072– 2080,2011. [15]Y.-H.LiuandY.-T.Chen,“Totalmargin-basedadaptivefuzzysupport vectormachinesformultiviewfacerecognition,”in Proc.IEEEInt.Conf. Syst.,ManCybern. ,2005,vol.2,pp.1704–1711. [16]M.Kubat,R.C.Holte,andS.Matwin,“Machinelearningforthedetection ofoilspillsinsatelliteradarimages,” Mach.Learn. ,vol.30,pp.195–215, 1998. [17]J.R.Quinlan,“Improvedestimatesfortheaccuracyofsmalldisjuncts,” Mach.Learn. ,vol.6,pp.93–98,1991. [18]B.ZadroznyandC.Elkan,“Learningandmakingdecisionswhencosts andprobabilitiesarebothunknown,”in Proc.7thACMSIGKDDInt. Conf.Knowl.Discov.DataMining ,NewYork,2001,pp.204–213. [19]G.WuandE.Chang,“KBA:kernelboundaryalignmentconsidering imbalanceddatadistribution,” IEEETrans.Knowl.DataEng. ,vol.17, no.6,pp.786–795,Jun.2005. [20]G.E.A.P.A.Batista,R.C.Prati,andM.C.Monard,“Astudyofthe behaviorofseveralmethodsforbalancingmachinelearningtrainingdata,” SIGKDDExpl.Newslett. ,vol.6,pp.20–29,2004. [21]N.V.Chawla,K.W.Bowyer,L.O.Hall,andW.P.Kegelmeyer,“SMOTE: syntheticminorityover-samplingtechnique,” J.Artif.Intell.Res. ,vol.16, pp.321–357,2002. [22]N.V.Chawla,N.Japkowicz,andA.Kolcz,Eds., SpecialIssueLearning ImbalancedDatasets,SIGKDDExplor.Newsl., vol.6,no.1,2004. [23]N.Chawla,D.Cieslak,L.Hall,andA.Joshi,“Automaticallycountering imbalanceanditsempiricalrelationshiptocost,” DataMin.Knowl. Discov. ,vol.17,pp.225–252,2008. [24]A.Freitas,A.Costa-Pereira,andP.Brazdil,“Cost-sensitivedecisiontrees appliedtomedicaldata,”in DataWarehousingKnowl.Discov.(Lecture NotesSeriesinComputerScience) ,I.Song,J.Eder,andT.Nguyen,Eds., Berlin/Heidelberg,Germany:Springer,2007,vol.4654,pp.303–312. [25]R.Polikar,“Ensemblebasedsystemsindecisionmaking,” IEEECircuits Syst.Mag. ,vol.6,no.3,pp.21–45,2006. [26]L.Rokach,“Ensemble-basedclassiers,” Artif.Intell.Rev. ,vol.33,pp.1– 39,2010. [27]N.C.OzaandK.Tumer,“Classierensembles:Selectreal-worldappli- cations,” Inf.Fusion ,vol.9,no.1,pp.4–20,2008. [28]C.Silva,U.Lotric,B.Ribeiro,andA.Dobnikar,“Distributedtextclas- sicationwithanensemblekernel-basedlearningapproach,” IEEE Trans.Syst.,Man,Cybern.C ,vol.40,no.3,pp.287–297,May. 2010. [29]Y.YangandK.Chen,“TimeseriesclusteringviaRPCLnetworkensemble withdifferentrepresentations,” IEEETrans.Syst.,Man,Cybern.C,Appl. Rev. ,vol.41,no.2,pp.190–199,Mar.2011. [30]Y.Xu,X.Cao,andH.Qiao,“Anefcienttreeclassierensemble-based approachforpedestriandetection,” IEEETrans.Syst.,Man,Cybern.B, Cybern ,vol.41,no.1,pp.107–117,Feb.2011. [31]T.K.Ho,J.J.Hull,andS.N.Srihari,“Decisioncombinationinmultiple classiersystems,” IEEETrans.PatternAnal.Mach.Intell. ,vol.16,no.1, pp.66–75,Jan.1994. [32]T.K.Ho,“Multipleclassiercombination:Lessonsandnextsteps,” in HybridMethodsinPatternRecognition ,KandelandBunke,Eds. Singapore:WorldScientic,2002,pp.171–198. [33]N.UedaandR.Nakano,“Generalizationerrorofensembleestimators,” in Proc.IEEEInt.Conf.NeuralNetw. ,1996,vol.1,pp.90–95. [34]A.KroghandJ.Vedelsby,“Neuralnetworkensembles,crossvalidation, andactivelearning,”in Proc.Adv.NeuralInf.Process.Syst. ,1995,vol.7, pp.231–238. [35]G.Brown,J.Wyatt,R.Harris,andX.Yao,“Diversitycreationmethods: Asurveyandcategorization,” Inf.Fusion ,vol.6,no.1,pp.5–20,2005 (diversityinmultipleclassiersystems). [36]K.TumerandJ.Ghosh,“Errorcorrelationanderrorreductioninensemble classiers,” Connect.Sci. ,vol.8,no.3–4,pp.385–404,1996. [37]X.Hu,“Usingroughsetstheoryanddatabaseoperationstoconstructa goodensembleofclassiersfordataminingapplications,”in Proc.IEEE Int.Conf.DataMining ,2001,pp.233–240. [38]L.I.Kuncheva,“Diversityinmultipleclassiersystems,” Inf.Fusion , vol.6,no.1,pp.3–4,2005(diversityinmultipleclassiersystems). [39]L.Rokach,“Taxonomyforcharacterizingensemblemethodsinclassi- cationtasks:Areviewandannotatedbibliography,” Comput.Stat.Data An. ,vol.53,no.12,pp.4046–4072,2009. [40]R.E.Schapire,“Thestrengthofweaklearnability,” Mach.Learn. ,vol.5, pp.197–227,1990. [41]Y.FreundandR.E.Schapire,“Adecision-theoreticgeneralizationof on-linelearningandanapplicationtoboosting,” J.Comput.Syst.Sci. , vol.55,no.1,pp.119–139,1997. [42]L.Breiman,“Baggingpredictors,” Mach.Learn. ,vol.24,pp.123–140, 1996. [43]L.I.Kuncheva, CombiningPatternClassiÞers:MethodsandAlgorithms . NewYork:Wiley-Interscience,2004. [44]N.V.Chawla,A.Lazarevic,L.O.Hall,andK.W.Bowyer,“SMOTEBoost: Improvingpredictionoftheminorityclassinboosting,”in Proc.Knowl. Discov.Databases ,2003,pp.107–119. GALAR etal. :REVIEWONENSEMBLESFORTHECLASSIMBALANCEPROBLEM 19 TABLEXX1 D ETAILED T EST R ESULTS T ABLEFOR B AGGING -B ASED A LGORITHMS issueofthesemethodsresidesinproperlyexploitingthe diversitywheneachbootstrapreplicaisformed. 4)Clearly,thetrade-offbetweencomplexityandperfor- manceofensemblelearningalgorithmsadaptedtohan- dleclassimbalanceispositive,sincetheresultsaresig- nicantlyimproved.Theyaremoreappropriatethanthe mereuseofclassicensemblesordatapreprocessingtech- niques.Inaddition,extendingtheresultsofthelastpart oftheexperimentalstudy,baseclassier’sresultsareout- performed. 5)Regardingthecomputationalcomplexity,eventhoughour analysisismainlydevotedtoalgorithms’performance, weshouldhighlightthatRUSBoostiscompetingagainst SMOTEBaggingandUnderBaggingwithonlytenclassi- ers(sinceitachievesbetterperformancewithlessclassi- ers).Thereadermightalsonotethat,RUSBoost’sclassi- ersaremuchfasterinbuildingtime,sincelessinstances areusedtoconstructeachclassier(duetotheundersam- plingprocess);besides,theensembleismorecomprehen- sible,containingonlytensmallertrees.Ontheotherhand, SMOTEBaggingconstructslargertrees(duetotheover- samplingmechanism).Likewise,UnderBaggingiscom- putationallyharderthanRUSBoost,inspiteofobtaining comparablesizetrees,itusesfourtimesmoreclassiers. VI.C ONCLUDING R EMARKS Inthispaper,thestateoftheartonensemblemethodologies todealwithclassimbalanceproblemhasbeenreviewed.This issuehinderstheperformanceofstandardclassierlearning algorithmsthatassumerelativelybalancedclassdistributions, andclassicensemblelearningalgorithmsarenotanexception. Inrecentyears,severalmethodologiesintegratingsolutionsto enhancetheinducedclassiersinthepresenceofclassimbal- ancebytheusageofensemblelearningalgorithmshavebeen presented.However,therewasalackofframeworkwhereeach oneofthemcouldbeclassied;forthisreason,ataxonomy wheretheycanbeplacedhasbeenpresented.Wedividedthese methodsintofourfamiliesdependingontheirbaseensemble learningalgorithmandthewayinwhichtheyaddresstheclass imbalanceproblem. Oncethatthenewtaxonomyhasbeenpresented,thorough studyoftheperformanceofthesemethodsinalargenumberof real-worldimbalancedproblemshasbeenperformed,andthese approacheswithclassicensembleapproachesandnonensemble approacheshavebeencompared.Wehaveperformedthisstudy developingahierarchicalanalysisoverthetaxonomyproposed, whichwasguidedbynonparametricstatisticaltests. Finally,wehaveconcludedthatensemble-basedalgorithms areworthwhile,improvingtheresultsthatareobtainedbythe usageofdatapreprocessingtechniquesandtrainingasingle classier.Theuseofmoreclassiersmakesthemmorecom- plex,butthisgrowthisjustiedbythebetterresultsthatcan beassessed.Wehavetoremarkthegoodperformanceofap- proachessuchasRUSBoostorUnderBagging,whichdespitebe- ingsimpleapproaches,achievehigherperformancesthanmany othermorecomplexalgorithms.Moreover,wehaveshown thepositivesynergybetweensamplingtechniques(e.g.,un- dersamplingorSMOTE)andBaggingensemblelearningalgo- rithm.ParticularlynoteworthyistheperformanceofRUSBoost, whichisthecomputationallyleastcomplexamongthebest performers. 18IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS Fig.10.Globalanalysisscheme.Agraytone(color)representseachal- gorithm.Rankingsand p -valuesareshownforWilcoxontestswhereasonly rankingsandthehypothesisresultsareshownforIman–DavenportandHolm tests. account,insuchamanner,RUSBoostexcelsasthemost appropriateensemblemethod(Item5extendsthisissue). 2)Morecomplexmethodsdoesnotperformbetterthansim- plerones.Itmustbepointedoutthattheperformance oftwoofthesimplestapproaches(RUSBoostandUn- derBagging),withtheusageofarandomandeasy-to- developstrategy,achievebetterresultsthanmanyother approaches.Thepositivesynergybetweenrandomunder- samplingandensembletechniqueshasstoodoutlook- ingattheexperimentalanalysis.Thissamplingtechnique eliminatesdifferentmajorityclassexamplesineachiter- ation;thisway,thedistributionoftheclassoverlapping differsinalldata-sets,andthiscausesthediversityto beboosted.Inaddition,incontrastwiththemereuseof anundersamplingprocessbeforelearninganonensemble classier,carryingitoutineveryiterationwhenconstruct- ingtheensembleallowstheconsiderationofmostofthe importantmajoritypatternsthatcanbedenedbycon- TABLEXIX D ETAILED T EST R ESULTS T ABLEOF N ONENSEMBLE M ETHODSAND C LASSIC E NSEMBLE A LGORITHMS TABLEXX D ETAILED T EST R ESULTS T ABLEFOR C OST -S ENSITIVE B OOSTING , B OOSTING -B ASED , AND H YBRID E NSEMBLES creteinstanceswhichusingauniqueclassiercouldbe lost. 3)Baggingtechniquesarenotonlyeasytodevelop,butalso powerfulwhendealingwithclassimbalanceiftheyare properlycombined.Theirhybridizationwithdataprepro- cessingtechniqueshasshowncompetitiveresults,thekey GALAR etal. :REVIEWONENSEMBLESFORTHECLASSIMBALANCEPROBLEM 17 Fig.8.Box-plotofAUCresultsofthefamilies’representatives. Fig.9.Averagerankingsoftherepresentativesofeachfamily. TABLEXVI H OLM T ABLEFOR B EST I NTERFAMILY A NALYSIS TABLEXVII W ILCOXON T ESTSTO S HOW D IFFERENCES B ETWEEN SBAG4 AND RUS1 thesealgorithmsandwecontinuewiththeHolmpost-hoctest. TheresultsofthistestareshowninTableXVI. TheHolmtestbringsoutthedominanceofSBAG4andRUS1 overtherestofthemethods.SBAG4signicantlyoutperforms allalgorithmsexceptRUS1.Wehavetwomethodswhichbehave similarlywithrespecttotherest,SBAG4andRUS1;therefore, wewillgetthemintoapairwisecomparisonviaaWilcoxontest. Insuchaway,ouraimistoobtainabetterinsightonthebehav- iorofthispairofmethods(inTableXVII,weshowtheresult). TheWilcoxontestneitherindicatestheexistenceofstatistical differences;moreover,bothalgorithmsaresimilarintermsof ranks,SBAG4hasanadvantageandhence,apparentlyabetter overallbehavior,butwecannotsupportthisfactwiththistest. Therefore,SBAG4isthewinnerofthehierarchicalanalysisin termsofranks,butitiscloselyfollowedbyRUS1andUB4 TABLEXVIII S HAFFER T ESTSFOR I NTERFAMILY C OMPARISON (aswehaveshowninSectionV-B4).However,despiteSBAG4 winsintermsofranks,sincetheredoesnotexistanystatisti- caldifference,wemayalsopayattentiontothecomputational complexityofeachalgorithminordertoestablishapreference. Inthissense,RUS1undoubtedlystandsoutwithrespecttoboth SBAG4andUB4.RUS1andUB4classiers’buildingtimeis lowerthanthatofSBAG4’sclassiers;thisisduetotheun- dersamplingprocesstheydevelopinsteadoftheoversampling thatiscarriedoutbySBAG4,insuchaway,theclassiersare trainedwithmuchlessinstances.Moreover,RUS1onlyusesten classiersagainstthe40classiersthatareusedbySBAG4and UB4,whichapartfromresultinginalesscomplexandmore comprehensibleensemble,needsfourtimeslesstimethanUB4 tobeconstructed. Toendandcompletethestatisticalstudy,wecarryoutanother post-hoctestfortheinterfamilycomparisoninordertoshowthe relationbetweenallrepresentatives,thatis,a n × n comparison. Todoso,weexecutetheShafferpost-hoctestandweshowthe resultsinTableXVIII.Inthistable,a“ + ”symbolimpliesthat thealgorithmintherowisstatisticallybetterthantheonein thecolumn,whereas“ Š ”impliesthecontrary;“ = ”meansthat thetwoalgorithmsthatarecomparedhavenosignicantdiffer- ences.Inbrackets,theadjusted p -valuethatisassociatedwith eachcomparisonisshown.Inthistable,wecanalsoobservethe superiorityofSBAG4andRUS1againsttheremainingalgo- rithmsandbesides,thesimilarity(almostequivalence)between bothapproaches. D.Discussion:SummaryoftheResults Inordertosummarizethewholehierarchicalanalysisdevel- opedinthissection,weincludeaschemeshowingtheglobal analysisinFig.10.Eachalgorithmisrepresentedbyagraytone (color).ForWilcoxontests,weshowtheranksandthe p -value returned;forIman–Davenporttests,weshowtherankingsand whetherthehypothesishasbeenrejectedornotbytheusageof theHolmpost-hoctest.Thisway,theevolutionoftheanalysis canbeeasilyfollowed. Summarizingtheresultsofthehierarchicalanalysis,wepoint outthemainconclusionsthatwehaveextractedfromtheexper- imentalstudyafterwards: 1)Themethodswiththebest(themostrobust)behaviorare SMOTEBagging,RUSBoost,andUnderBagging.Among them,intermsofranks,SMOTEBaggingstandsout obtainingslightlybetterresults.Anyway,thistripleof algorithmsoutperformsstatisticallytheothersconsidered inthisstudy,buttheyarestatisticallyequivalent;forthis reason,weshouldtakethecomputationalcomplexityinto 16IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS Fig.6.AveragerankingsofIIVotes-basedensembles. Fig.7.Averagerankingsofbagging-basedensembles. TABLEXII H OLM T ABLEFOR B EST B AGGING -B ASED M ETHODS RegardingIIVotesmethods,westartthemultiplecompar- isonsbyexecutingtheIman–Davenporttestwhichreturnsa p -valueof 0 . 1208 .Therefore,thehypothesisofequivalenceis notrejected.However,asFig.6shows,therankingsobtained bySPrarehigherthantheonesoftheothertwomethods.Fol- lowingtheseresults,wewillonlytakeintoaccountSPrinthe followingphase. OncewehavereducedthenumberofBagging-basedalgo- rithms,wecandevelopthepropercomparisonamongthere- mainingmethods.TheIman–Davenporttestexecutedforthis groupofalgorithmsreturnsa p -valueof 0 . 00172 ,whichmeans thatthereexistsignicantdifferences(inFig.7,weshowthe averagerankings). Hence,weapplytheHolmpost-hocproceduretocom- pareSBAG4(theonewiththebestranking)withtherestof theBagging-basedmethods.Observingtheresultsshownin TableXII,SBAG4clearlyoutperformstheothermethods(ex- ceptforUB4)withsignicantdifferences. RegardingUB4,andgivenitssimilarbehaviortoSBAG4with respecttotherest,wewillcarryoutaWilcoxontest(TableXIII) inordertocheckwhetherthereareanysignicantdifferences betweenthem.Fromthistestweconcludethat,whenbothalgo- rithmsareconfrontedoneversustheother,theyareequivalent. Onthecontrarytotherankingscomputedamongthegroupof algorithms,theranksinthiscasearenearlythesame.Thisoc- cursbecauseSBAG4hasagoodoverallbehavioramongmore data-sets,whereasUB4standsoutmoreinsomeofthemand lessinothers.Asaconsequence,whentheyareputtogether withothermethods,UB4rankingdecreases,whereasSBAG4 excelsinspiteofUB4meantestresultisslightlyhigherthan TABLEXIII W ILCOXON T ESTSTO S HOW D IFFERENCES B ETWEEN SBAG4 AND UB4 TABLEXIV W ILCOXON T ESTSFOR N ONENSEMBLE M ETHODS TABLEXV R EPRESENTATIVE M ETHODS S ELECTEDFOR E ACH F AMILY SBAG4.Knowingthatbothalgorithmsachievesimilarperfor- mances,wewilluseasrepresentativeSBAG4becauseitsoverall behaviorwhenthecomparisonhasincludedmoremethodshas beenbetter. 5)HybridEnsembles: Thislastfamilyonlyhastwometh- ods;hence,weexecuteWilcoxonsigned-ranktesttondout possibledifferences.TableXIVshowstheresultofthetest, bothmethodsarequitesimilar,butEASYattainshigherranks. Thisresultisinaccordancewithpreviousstudies[47],where theadvantageofBALisitsefciencywhendealingwithlarge data-setswithouthighlydecreasingtheperformancewithre- specttoEASY.Followingthesamemethodologyasinprevious families,wewilluseEASYasrepresentative. C.InterfamilyComparison Wehaveselectedarepresentativeforeveryfamily,sonow wecanproceedwiththeglobalstudyoftheperformance.First, werecalltheselectedmethodsfromtheintrafamilycomparison inTableXV. Wehavesummarizedtheresultsforthetestpartitionsofthese methodsinFig.8usingtheboxplotasrepresentationscheme. Boxplotsprovedamostvaluabletoolindatareporting,since theyallowthegraphicalrepresentationoftheperformanceof thealgorithms,indicatingimportantfeaturessuchasthemedian, extremevaluesandspreadofvaluesaboutthemedianintheform ofquartiles.WecanobservethattheRUS1boxiscompact, aswellastheSBAG4box,bothmethodshavesimilarresults (superiortotherest),buttheRUS1medianvalueisbetter.On theotherhand,SMTseemstobeinferiortotheotherapproaches withtheexceptionofM14,whichvarianceisthehighest. Startingwiththecomparisonitself,weusetheIman– Davenporttesttondoutsignicantdifferencesamongthese methods.Therankingscomputedtocarryoutthetestarede- pictedinFig.9.The p -valuereturnedbythetestisverylow ( 1 . 27 E Š 09 );hence,thereexistdifferencesamongsomeof GALAR etal. :REVIEWONENSEMBLESFORTHECLASSIMBALANCEPROBLEM 15 TABLEVIII W ILCOXON T ESTSFOR N ONENSEMBLE M ETHODS Fig.4.Averagerankingsofclassicensembles. techniquesweareconsidering,C45andSMT,thatis,C4.5de- cisiontreealoneandC4.5trainedoverpreprocesseddata-sets (usingSMOTE).TheresultofthetestisshowninTableVIII. WeobservethattheperformanceofC45isaffectedbythe presenceofclassimbalance.TheWilcoxontestshows,incon- cordancewithpreviousstudies[20],[53],thatmakinguseof SMOTEasapreprocessingtechniquesignicantlyoutperforms C4.5algorithmalone.TheoverallperformanceofSMTisbet- ter,achievinghigherranksandrejectingthenullhypothesesof equivalencewitha p -valueof 0 . 00039 .Forthisreason,SMT willbethealgorithmrepresentingthefamilyofnonensembles. 2)ClassicEnsembles: Regardingclassicensembles,Boost- ing(AdaBoost,AdaBoost.M1andAdaBoost.M2)andBagging, wecarryoutIman–Davenporttesttondoutwhethertheyare statisticallydifferentintheimbalanceframework.Fig.4shows theaveragerankingsofthealgorithmscomputedfortheIman– Davenporttest. WeobservethattherankingofBAG4ishigherthantherest, whichmeansthatistheworstperformer,whereastherankings ofBoostingalgorithmsaresimilar,whichisunderstandablebe- causeoftheircommonidea.However,theabsolutedifferences ofranksarereallylow,thisisconrmedbytheIman–Davenport testwhichobtainsa p -valueof 0 . 49681 .Hence,wewillselect asrepresentativeofthefamilyM14forhavingthelowestaver- agerank,butnoticethatinspiteofselectingM14,therearenot signicantdifferencesinthisfamily. 3)Boosting-basedEnsembles: Thiskindofensemblesin- cludesRUSBoost,SMOTEBoost,andMSMOTEBoostap- proaches.Weshowtherankingscomputedtocarryoutthetest inFig.5.Inthiscase,Iman–Davenporttestrejectsthenullhy- pothesiswitha p -valueof 2 . 97 E Š 04 .Hence,weexecutethe Holmpost-hoctestwithRUS1ascontrolalgorithmsinceithas thelowestranking. HolmtestshowsthatRUS1issignicantlybetterthanMBO4, whereasthesamesignicanceisnotreachedwithrespectto SBO1(theresultsareshowninTableIX). WewanttobetteranalyzetherelationbetweenRUS1and SBO1,soweexecuteWilcoxontestforthispair.Theresult isshowninTableX,RUS1hasabetteroverallbehavioras expected,the p -valuereturnedbythecomparisonislow,but despitethissituationneithersignicantdifferencesareattained. Fig.5.Averagerankingsofboosting-basedensembles. TABLEIX H OLM T ABLEFOR B OOSTING -B ASED M ETHODS TABLEX W ILCOXON T ESTSTO S HOW D IFFERENCES B ETWEEN SBO1 AND RUS1 TABLEXI W ILCOXON T ESTSFOR B AGGING -B ASED E NSEMBLES R EDUCTION RUS1willrepresentthisfamilyinthenextphaseduetoits bettergeneralperformance. 4)Bagging-basedEnsembles: Becauseofthenumberof Bagging-basedapproaches,westartmakingapreselectionbe- foretothecomparisonbetweenthefamilymembers.Ontheone hand,wewillmakeareductionbetweensimilarapproachessuch asUB/UB2,OB/OB2,andSBAG/MBAG.Ontheotherhand, wewillselectthebestIIVotesensemblecomparingthethree waystodeveloptheSPIDERpreprocessinginsidetheIVotes iterations. Togetonwiththerstpart,weusetheWilcoxontestto investigatewhichoneofeachpairofapproachesismoreade- quate.TheresultsofthesetestsareshowninTableXI.Between UnderBaggingapproaches,UB4(whichalwaysusesallthemi- norityclassexampleswithoutresampling)obtainshigherranks. Thisresultstressesthatthediversityisnomoreexploitedwhen minorityclassexamplesarealsobootstrapped,thiscanbebe- causenotusingalltheminorityclassinstancescouldmakemore difculttolearnthepositiveconceptinsomeoftheclassiers oftheensemble.InthecaseofOverBagging,theuseofre- samplingofthemajorityclass(OB2)clearlyoutperformsOB, thismakessensesincethediversityofOB2is apriori higher thantheoneofOB.Inaddition,betweensyntheticoversampling approaches,theoriginalSMOTEBaggingissignicantlybetter thanitsmodicationwithMSMOTE,whichseemsnottowork aswellastheoriginal.Therefore,onlyUB4,OB24,andSBAG4 areselectedforthenextphase. 14IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS TABLEVI M EAN AUCT RAINAND T EST R ESULTSFORALLTHE A LGORITHMSINTHE E XPERIMENTAL S TUDY ( ± FOR S TANDARD D EVIATION ) Sincewecomparepairsofresultsets,weusetheWilcoxon signed-ranktesttondoutwhethertherearesignicantdiffer- encesbetweentheusageofoneoranotherconguration,and ifnot,toselecttheset-upwhichreachesthehighestamountof ranks.Thisdoesnotmeanthatthemethodissignicantlybetter, butthatithasanoverallbetterbehavioramongallthedata-sets, sowewilluseitinfurthercomparisons. TableVIIshowstheoutputsoftheWilcoxontests.Weappend a“1”tothealgorithmabbreviationtoreferthatitusesten classiersandwedothesamewitha“4”wheneverituses40 classiers.Weshowtheranksforeachmethodandwhetherthe hypothesisisrejectedwithasignicancevalueof  =0 . 05 ,but TABLEVII W ILCOXON T ESTSTO D ECIDETHE N UMBEROF C LASSIFIERS alsothe p -valuewhichgiveusimportantinformationaboutthe differences.Thelastcolumnshowsthecongurationthatwe haveselectedforthenextphasedependingontherejectionof thehypothesisorifisnotrejected,dependingontheranks. LookingatTableVII,weobservethatclassicboostingmeth- odshavedifferentbehaviors;ADABandM1havebetterper- formancewith40classiers,whereasM2isslightlybetterwith 10.Classicbagging,aswellasmostofthebagging-basedap- proaches(exceptUOB),hassignicantlybetterresultsusing40 baseclassiers.Thecost-sensitiveboostingapproachobtains alow p -value(closeto0.05)infavorofthecongurationof 40classiers;hence,itbenetsthisstrategy.Withrespectto boosting-basedensembles,RUSperformanceclearlyoutstands whenonlytenclassiersareused;ontheotherhand,thecon- gurationofSBOandMBOisquiteindifferent.Asinthecase ofcost-sensitiveboosting,forbothSBAGandMBAGthe p - valueisquitelowandthesumofranksstressesthegoodness oftheselectionof40classiersintheseensemblealgorithms. Bagging-basedapproachesthatuserandomoversampling(OB, OB2,andUOB)havenotgotsobigdifferences,butUOBis theuniquethatworksgloballybetterwiththelownumberof classiers. B.IntrafamilyComparison Inthissubsection,wedevelopthecomparisonsinorderto selectthebestrepresentativesofthefamilies.Whenweonly haveapairofalgorithmsinafamily,weusetheWilcoxon signed-ranktest;otherwise,weusetheIman–Davenporttest andwefollowwithHolmpost-hocifitisnecessary. Wedividedthissubsectionintoveparts,onefortheanalysis ofeachfamily.Wehavetorecallthatwedonotanalyzecost- sensitiveBoostingapproachessinceweareonlyconsidering AdaC2approach;hence,itwillbetheirrepresentativeinthelast phase.Therefore,rstwegetonwithnonensembleandclassic ensembletechniquesandthen,wegothroughtheremaining threefamiliesofensemblesespeciallydesignedforimbalanced problems. 1)NonensembleTechniques: Firstly,weexecutethe Wilcoxontestbetweentheresultsofthetwonon-ensemble GALAR etal. :REVIEWONENSEMBLESFORTHECLASSIMBALANCEPROBLEM 13 examples(#Ex.),numberofattributes(#Atts.),classnameof eachclass(minorityandmajority),thepercentageofexamples ofeachclassandtheIR.Thistableisorderedaccordingtothis lastcolumnintheascendingorder. WehaveobtainedtheAUCmetricestimatesbymeansofa 5-foldcross-validation.Thatis,thedata-setwassplitintove folds,eachonecontaining20%ofthepatternsofthedata- set.Foreachfold,thealgorithmistrainedwiththeexamples containedintheremainingfoldsandthentestedwiththecurrent fold.Thedatapartitionsusedinthispapercanbefoundin KEEL-datasetrepository[57]sothatanyinterestedresearcher canreproducetheexperimentalstudy. C.StatisticalTests Inordertocomparedifferentalgorithmsandtoshowwhether thereexistsignicantdifferencesamongthem,wehavetogive thecomparisonastatisticalsupport[108].Todoso,weuse nonparametrictestsaccordingtotherecommendationsmade in[63]–[65],[108],whereasetofpropernonparametrictestsfor statisticalcomparisonsofclassiersispresented.Weneedtouse nonparametrictestsbecausetheinitialconditionsthatguarantee thereliabilityoftheparametrictestsmaynotbesatisedcausing thestatisticalanalysistoloseitscredibility[63]. Inthispaper,weusetwotypesofcomparisons:pairwise (betweenapairofalgorithms)andmultiple(amongagroupof algorithms). 1) Pairwisecomparisons: weuseWilcoxonpairedsigned- ranktest[109]tondoutwhetherthereexistsignicant differencesbetweenapairofalgorithms. 2) Multiplecomparisons: werstusetheIman–Davenport test[110]todetectstatisticaldifferencesamongagroupof results.Then,ifwewanttocheckoutifacontrolalgorithm (usuallythebestone)issignicantlybetterthantherest ( 1 × n comparison),weusetheHolmpost-hoctest[111]. Whereas,whenwewanttondoutwhichalgorithmsare distinctiveamongan n × n comparison,weusetheShaf- ferpost-hoctest[112].Thepost-hocproceduresallowus toknowwhetherahypothesisofcomparisonofmeans couldberejectedataspeciedlevelofsignicance  (i.e.,thereexistsignicantdifferences).Besides,wecom- putethe p -valueassociatedwitheachcomparison,which representsthelowestlevelofsignicanceofahypothesis thatresultsinarejection.Inthismanner,wecanalsoknow howdifferenttwoalgorithmsare. Thesetestsaresuggestedindifferentstudies[63]–[65],where theiruseintheeldofmachinelearningishighlyrecommended. Anyinterestedreadercanndadditionalinformationonthe Websitehttp://sci2s.ugr.es/sicidm/,togetherwiththesoftware forapplyingthestatisticaltests. Complementingthestatisticalanalysis,wealsoconsiderthe averagerankingofthealgorithmsinordertoshowatarst glancehowgoodamethodiswithrespecttotherestinthe comparison.Therankingsarecomputedbyrstassigninga rankpositiontoeachalgorithmineverydata-set,whichconsists inassigningtherstrankinadata-set(value1)tothebestper- formingalgorithm,thesecondrank(value2)tothesecondbest algorithm,andsoforth.Finally,theaveragerankingofamethod iscomputedbythemeanvalueofitsranksamongalldata-sets. V.E XPERIMENTAL S TUDY Inthissection,wecarryouttheempiricalcomparisonofthe algorithmsthatwehavereviewed.Ouraimistoanswerseveral questionsaboutthereviewedensemblelearningalgorithmsin thescenariooftwo-classimbalancedproblems. 1)Inrstplace,wewanttoanalyzewhichoneoftheap- proachesisabletobetterhandlealargeamountofimbal- anceddata-setswithdifferentIR,i.e.,toshowwhichone isthemostrobustmethod. 2)Wealsowanttoinvestigatetheirimprovementwithrespect toclassicensemblesandtolookintotheappropriateness oftheiruseinsteadofapplyingauniquepreprocessing stepandtrainingasingleclassier.Thatis,whetherthe trade-offbetweencomplexityincrementandperformance enhancementisjustiedornot. Giventheamountofmethodsinthecomparison,wecannot afforditdirectly.Onthisaccount,wedevelopahierarchical analysis(atournamentamongalgorithms).Thismethodology allowsustoobtainabetterinsightontheresultsbydiscarding thosealgorithmswhicharenotthebestinacomparison.We dividedthestudyintothreephases,allofthemguidedbythe nonparametrictestspresentedinSectionIV-C: 1) NumberofclassiÞers: Intherstphase,weanalyzewhich congurationofhowmanyclassiersisthebestforthe algorithmsthatarecongurabletobeexecutedwithboth 10and40classiers.AsweexplainedinSectionIV-A,this phaseallowsustogiveallofthemthesameopportunities. 2) Intra-familycomparison: Thesecondphaseconsistsin analyzingeachfamilyseparately.Weinvestigatewhichof theircomponentshasthebest(oronlyabetter)behavior. Thosemethodswillbethenconsideredtotakepartonthe nalphase(representativesofthefamilies). 3) Inter-familycomparison: Inthelastphase,wedevelopa comparisonamongtherepresentativesofeachfamily.In suchaway,ourobjectiveistoanalyzewhichalgorithm standsoutfromallofthemaswellastostudythebehavior ofensemble-basedmethodstoaddresstheclassimbal- anceproblemwithrespecttotherestoftheapproaches considered. Followingthismethodology,attheend,wewillbeableto accountforthequestionsthatwehavesetout.Wedividethis sectionintothreesubsectionsaccordingtoeachoneofthegoals ofthestudy,andanalone(SubsectionV-D)wherewediscuss andsumuptheresultsobtainedinthisstudy. Beforestartingwiththeanalysis,weshowtheoveralltrain andtestAUCresults( ± forstandarddeviation)inTableVI.The detailedtestresultsofallmethodsanddata-setsarepresented intheAppendix. A.NumberofClassiÞers Westartinvestigatingthecongurationofthenumberofclas- siers.Thisparameteriscongurableinallexceptnonensem- bles,hybrids,andIIVotesmethods. 12IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS TABLEIII A LGORITHMS U SEDINTHE E XPERIMENTAL S TUDY haveusedintheexperiments,whicharetheparametersrecom- mendedbytheirauthors.Allexperimentshavebeendeveloped usingtheKEEL 1 software[56],[57]. B.Data-sets Inthestudy,wehaveconsidered44binarydata-setsfrom KEELdata-setrepository[56],[57],whicharepubliclyavail- ableonthecorrespondingweb-page, 2 whichincludesgeneral informationaboutthem.Multiclassdata-setsweremodiedto obtaintwo-classimbalancedproblemssothattheunionofone ormoreclassesbecamethepositiveclassandtheunionofone ormoreoftheremainingclasseswaslabeledasthenegative class.Thisway,wehavedifferentIRs:fromlowimbalanceto highlyimbalanceddata-sets.TableVsummarizestheproper- tiesoftheselecteddata-sets:foreachdata-set,thenumberof 1 http://www.keel.es 2 http://www.keel.es/dataset.php TABLEIV C ONFIGURATION P ARAMETERSFORTHE A LGORITHMS U SEDINTHE E XPERIMENTAL S TUDY TABLEV S UMMARY D ESCRIPTIONOFTHE I MBALANCED D ATA -S ETS U SEDINTHE E XPERIMENTAL S TUDY GALAR etal. :REVIEWONENSEMBLESFORTHECLASSIMBALANCEPROBLEM 11 TABLEII P ARAMETER S PECIFICATIONFOR C4.5 treetop-downbytheusageofthenormalizedinformationgain (differenceinentropy)thatresultsfromchoosinganattribute forsplittingthedata.Theattributewiththehighestnormal- izedinformationgainistheoneusedtomakethedecision.In TableII,weshowthecongurationparametersthatwehave usedtorunC4.5.Weacknowledgethatwecouldconsiderthe useofaclassicationtreealgorithm,suchasHellingerdistance tree[106],thatisspecicallydesignedforthesolutionofim- balancedproblems.However,in[106],theauthorsshowthat itoftenexperiencesareductioninperformancewhensampling techniquesareapplied,whichisthebaseofthemajorityofthe studiedtechniques;moreover,beingmorerobust(lessweak) thanC4.5intheimbalancescenario,thediversityoftheensem- blescouldbehindered. Besidestheensemble-basedmethodsthatweconsider,we includeanothernonensembletechniquetobeabletoanalyze whethertheuseofensemblesisbenecial,notonlywithrespect totheoriginalbaseclassier,butalsotooutperformtheresults oftheclassiertrainedoverpreprocesseddata-sets.Todoso, beforelearningthedecisiontrees,weuseSMOTEpreprocessing algorithmtorebalancethedata-setsbeforethelearningstage (seeSectionII-D).Previousworkshaveshownthepositive synergyofthiscombinationleadingtosignicantimprovements [20],[53]. Regardingensemblelearningalgorithms,ontheonehand, weincludeclassicensembles(whicharenotspecicallyde- velopedforimbalanceddomains)suchasBagging,AdaBoost, AdaBoot.M1,andAdaBoost.M2.Ontheotherhand,wein- cludethealgorithmsthataredesignedtodealwithskewedclass distributionsinthedata-setswhich,followingthetaxonomy proposedinSectionIII-B,aredistinguishedintofourfamilies: Cost-sensitiveBoosting,Boosting-based,Bagging-based,and Hybridensembles. Concerningthecost-sensitiveboostingframework,athor- oughempiricalstudywaspresentedin[50].Toavoidtherepe- titionofsimilarexperiments,wewillfollowtheresultswhere AdaC2algorithmstandsoutwithrespecttotheothers.Hence, wewillempiricallystudythisalgorithmamongtheonesfrom thisfamilyintheexperimentalstudy. Notethatinourexperimentswewanttoanalyzewhichis themostrobustmethodamongensembleapproaches,thatis, givenalargevarietyofproblemswhichoneismorecapable ofassessinganoverallgood(better)performanceinallthe problems.Robustnessconceptalsohasanimplicitmeaningof generality,algorithmswhosecongurationparametershaveto betuneddependingonthedata-setarelessrobust,sincechanges inthedatacaneasilyworsentheirresults;hence,theyhavemore difcultiestobeadaptedtonewproblems. RecallfromSectionII-Cthatcost-sensitiveapproaches’ weaknessistheneedofcostsdenition.Thesecostsarenot usuallypresentedinclassicationdata-sets,andonthisaccount, theyareusuallysetad-hocorfoundconductingasearchinthe spaceofpossiblecosts.Therefore,inordertoexecuteAdaC2, wesetthecostsdependingontheIRofeachdata-set.Inother words,wesetupanadaptivecoststrategy,wherethecostof misclassifyingaminorityclassinstanceisalways C min =1 , whereasthatofmisclassifyingamajorityclassinstanceisin- verselyproportionaltotheIRofthedata-set( C maj =1 /IR ). TheBoosting-basedensemblesthatareconsideredinour studyareRUSBoost,SMOTEBoostandMSMOTEBoost.As wehaveexplained,DataBoost-IMapproachisnotcapableof dealingwithsomeofthedata-setsthatareusedinthestudy (moredetailsinSubsectionIV-B). WithrespecttoBagging-basedensembles,weincludefrom theOverBagginggroup,OverBagging(whichusesrandom oversampling)andSMOTEBaggingduetothegreatdifference intheirwaytoperformtheoversamplingtocreateeachbag. InthesamemannerthatweuseMSMOTEBoost,inthiscase, wehavealsodevelopedaMSMOTEBaggingalgorithm,whose uniquedifferencewithSMOTEBaggingistheuseofMSMOTE insteadofSMOTE.Hence,weareabletoanalyzethesuitability oftheirintegrationinbothBoostingandBagging.AmongUn- derBaggingmethods,weconsidertherandomundersampling methodtocreateeachbalancedbag.Wediscardrestoftheap- proaches(e.g.,roughlybalancedbaggingorpartitioning)given theirsimilarity;hence,weonlydevelopthemoregeneralver- sion.ForUnderBaggingandOverBagging,weincorporatetheir bothpossiblevariations(resamplingofbothclassesineachbag andresamplingofonlyoneofthem),insuchawaythatwecan analyzetheirinuenceinthediversityoftheensemble.Theset ofBagging-basedensemblesendswithUnderOverBaggingand thecombinationofSPIDERwithIVotes,forIIVotesalgorithm wehavetestedthethreecongurationsofSPIDER. Finally,weconsiderbothhybridapproaches,EasyEnsemble, andBalanceCascade. Forthesakeofclarityforthereader,TableIIIsummarizes thewholelistofalgorithmsgroupedbyfamilies,wealsoshow theabbreviationsthatwewillusealongtheexperimentalstudy andashortdescription. Inourexperiments,wewantallmethodstohavethesame opportunitiestoachievetheirbestresults,butalwayswithout ne-tuningtheirparametersdependingonthedata-set.Gen- erally,thehigherthenumberofbaseclassiers,thebetterthe resultsweachieve;however,thisdoesnotoccurineverymethod (i.e.,moreclassierswithoutspreadingdiversitycouldworsen theresultsandtheycouldalsoproduceovertting).Mostofthe reviewedapproachesemploytenbaseclassiersbydefault,but otherssuchasEasyEnsembleandBalanceCascadeneedmore classierstomakesense(sincetheytraineachbagwithAd- aBoost).Inthatcase,theauthorsuseatotalof40classiers (fourbaggingiterationsandtenAdaBoostiterationsperbag). Onthisaccount,wewillrststudywhichcongurationismore appropriateforeachensemblemethodandthenwewillfol- lowwiththeintrafamilyandinterfamilycomparisons.TableIV showstherestoftheparametersrequiredbythealgorithmswe 10IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS usageoftheSMOTEpreprocessingalgorithm.SMOTE- Bagging[55]differsfromtheuseofrandomoversampling notonlybecausethedifferentpreprocessingmechanism. Thewayitcreateseachbagissignicantlydifferent.As wellasinOverBagging,inthismethodbothclassescon- tributetoeachbagwith N maj instances.But,aSMOTE resamplingrate( a % )issetineachiteration(rangingfrom 10% intherstiterationto 100% inthelast,alwaysbe- ingmultipleof 10 )andthisratiodenesthenumberof positiveinstances( a % · N maj )randomlyresampled(with replacement)fromtheoriginaldata-setineachiteration. Therestofthepositiveinstancesaregeneratedbythe SMOTEalgorithm.Besides,thesetofnegativeinstances isbootstrappedineachiterationinordertoformamore diverseensemble. b) UnderBagging: OnthecontrarytoOverBagging,Under- Baggingprocedureusesundersamplinginsteadofover- sampling.However,inthesamemannerasOverBagging, itcanbedevelopedinatleasttwoways.Theundersam- plingprocedureisusuallyonlyappliedtothemajority class;however,aresamplingwithreplacementofthemi- norityclasscanalsobeappliedinordertoobtain apriori morediverseensembles.Pointoutthat,inUnderBaggingit ismoreprobabletoignoresomeusefulnegativeinstances, buteachbaghaslessinstancesthantheoriginaldata-set (onthecontrarytoOverBagging). Ontheonehand,theUnderBaggingmethodhasbeen usedwithdifferentnames,butmaintainingthesamefunc- tionalstructure,e.g.,AsymmetricBagging[101]andQua- siBagging[100].Ontheotherhand,roughly-balanced Bagging[102]isquitesimilartoUnderBagging,butit doesnotbootstrapatotallybalancedbag.Thenumber ofpositiveexamplesiskeptxed(bytheusageofallof themorresamplingthem),whereasthenumberofnega- tiveexamplesdrawnineachiterationvariesslightlyfol- lowinganegativebinomialdistribution(with q =0 . 5 and n = N min ).Partitioning[103],[104](alsocalledBagging EnsembleVariation[105])isanotherwaytodevelopthe undersampling,inthiscase,theinstancesofthemajority classaredividedintoIRdisjointdata-setsandeachclas- sieristrainedwithoneofthosebootstraps(mixedwith theminorityclassexamples). c) UnderOverBagging:UnderBaggingtoOverBagging fol- lowsadifferentmethodologyfromOverBaggingandUn- derBagging,butsimilartoSMOTEBaggingtocreateeach bag.Itmakesuseofbothoversamplingandundersampling techniques;aresamplingrate( a % )issetineachiteration (rangingfrom 10% to 100% alwaysbeingmultipleof 10 ); thisratiodenesthenumberofinstancestakenfromeach class( a % · N maj instances).Hence,therstclassiersare trainedwithalowernumberofinstancesthanthelastones. Thisway,thediversityisboosted. d) IIVotes:ImbalancedIVotes isbasedonthesamecombina- tionidea,butitintegratestheSPIDERdatapreprocessing techniquewithIVotes(apreprocessingphaseisappliedin eachiterationbeforeStep13ofAlgorithm2).Thismethod hastheadvantageofnotneedingtodenethenumberof bags,sincethealgorithmstopswhentheout-of-bagerror estimationnolongerdecreases. 4)HybridEnsembles: Themaindifferenceofthealgorithms inthiscategorywithrespecttothepreviousonesisthatthey carryoutadoubleensemblelearning,thatis,theycombine bothbaggingandboosting(alsowithapreprocessingtech- nique).Bothalgorithmsthatusethishybridizationwerepro- posedin[47],andwerereferredtoasexploratoryundersampling techniques.EasyEnsembleandBalanceCascadeuseBaggingas themainensemblelearningmethod,butinspiteoftraininga classierforeachnewbag,theytraineachbagusingAdaBoost. Hence,thenalclassierisanensembleofensembles. InthesamemannerasUnderBagging,eachbalancedbag isconstructedbyrandomlyundersamplinginstancesfromthe majorityclassandbytheusageofalltheinstancesfromthe minorityclass.Thedifferencebetweenthesemethodsistheway inwhichtheytreatthenegativeinstancesaftereachiteration,as explainedinthefollowing. a) EasyEnsemble: Thisapproachdoesnotperformanyop- erationwiththeinstancesfromtheoriginaldata-setafter eachAdaBoostiteration.Hence,alltheclassierscanbe trainedinparallel.Notethat,EasyEnsemblecanbeseenas anUnderBaggingwherethebaselearnerisAdaBoost,if wexthenumberofclassiers,EasyEnsemblewilltrain lessbagsthanUnderBagging,butmoreclassierswillbe assignedtolearneachsinglebag. b) BalanceCascade: BalanceCascadeworksinasupervised manner,andthereforetheclassiershavetobetrained sequentially.Ineachbaggingiterationafterlearningthe AdaBoostclassier,themajorityclassexamplesthatare correctlyclassiedwithhighercondencesbythecurrent trainedclassiersareremovedfromthedata-set,andthey arenottakenintoaccountinfurtheriterations. IV.E XPERIMENTAL F RAMEWORK Inthissection,wepresenttheframeworkusedtocarryout theexperimentsanalyzedinSectionV.First,webrieyde- scribethealgorithmsfromtheproposedtaxonomythatwehave includedinthestudyandweshowtheirset-upparametersin SubsectionIV-A.Then,weprovidedetailsofthereal-world imbalancedproblemschosentotestthealgorithmsinSubsec- tionIV-B.Finally,wepresentthestatisticalteststhatwehave appliedtomakeapropercomparisonoftheclassiers’results inSubsectionIV-C.Weshouldrecallthatwearefocusingon two-classproblems. A.AlgorithmsandParameters Inrstplace,weneedtodeneabaselineclassierwhichwe useinalltheensembles.Withthisgoal,wewilluseC4.5de- cisiontreegeneratingalgorithm[58].Almostalltheensemble methodologieswearegoingtotestwereproposedincombi- nationwithC4.5.Furthermore,ithasbeenwidelyusedtodeal withimbalanceddata-sets[59]–[61],andC4.5hasalsobeen includedasoneofthetop-tendata-miningalgorithms[94].Be- causeofthesefacts,wehavechosenitasthemostappropriate baselearner.C4.5learningalgorithmconstructsthedecision GALAR etal. :REVIEWONENSEMBLESFORTHECLASSIMBALANCEPROBLEM 9 thecostsareintroducedoutsidetheexponentpart: D t +1 ( i )= C i D t ( i ) · e Š  t h t ( x i ) y i . (9) Inconsequence,  t ’scomputationischanged:  t = 1 2 ln  i,y i = h t ( x i ) C i D t ( i )  i,y i  = h t ( x i ) C i D t ( i ) . (10) 6) AdaC3: ThismodicationconsiderstheideaofAdaC1 andAdaC2atthesametime.Theweightupdateformulais modiedbyintroducingthecostsbothinsideandoutside theexponentpart: D t +1 ( i )= C i D t ( i ) · e Š  t C i h t ( x i ) y i . (11) Inthismanner,overagain  t changes:  t = 1 2 ln  i C i D t ( i )+  i,y i = h t ( x i ) C 2 i D t ( i ) Š  i,y i  = h t ( x i ) C 2 i D t ( i )  i C i D t ( i ) Š  i,y i = h t ( x i ) C 2 i D t ( i )+  i,y i  = h t ( x i ) C 2 i D t ( i ) . (12) 2)Boosting-basedEnsembles: Inthisfamily,wehavein- cludedthealgorithmsthatembedtechniquesfordataprepro- cessingintoboostingalgorithms.Insuchamanner,thesemeth- odsalterandbiastheweightdistributionusedtotrainthenext classiertowardtheminorityclasseveryiteration.Insidethis family,weincludeSMOTEBoost[44],MSMOTEBoost[82], RUSBoost[45],andDataBoost-IM[98]algorithms. a) SMOTEBoostandMSMOTEBoost: Bothmethodsin- troducesyntheticinstancesjustbeforeStep4ofAd- aBoost.M2(Algorithm2),usingtheSMOTEand MSMOTEdatapreprocessingalgorithms,respectively. Theweightsofthenewinstancesareproportionaltothe totalnumberofinstancesinthenewdata-set.Hence,their weightsarealwaysthesame(inalliterationsandfor allnewinstances),whereasoriginaldata-set’sinstances weightsarenormalizedinsuchawaythattheyforma distributionwiththenewinstances.Aftertrainingaclas- sier,theweightsoftheoriginaldata-setinstancesare updated;thenanothersamplingphaseisapplied(again, modifyingtheweightdistribution).Therepetitionofthis processalsobringsalongmorediversityinthetraining data,whichgenerallybenetstheensemblelearning. b) RUSBoost: Inotherrespects,RUSBoostperformssimi- larlytoSMOTEBoost,butitremovesinstancesfromthe majorityclassbyrandomundersamplingthedata-setin eachiteration.Inthiscase,itisnotnecessarytoassign newweightstotheinstances.Itisenoughwithsimply normalizingtheweightsoftheremaininginstancesinthe newdata-setwithrespecttotheirtotalsumofweights. TherestoftheprocedureisthesameasinSMOTEBoost. c) DataBoost-IM: Thisapproachisslightlydifferenttothe previousones.Itsinitialideaisnotdifferent,itcombines AdaBoost.M1algorithmwithadatagenerationstrategy. Itsmajordifferenceisthatitrstidentieshardexam- ples(seeds)andthencarriesoutarebalanceprocess, alwaysforbothclasses.Atthebeginning,the N s in- stances(asmanyasmisclassiedinstancesbythecur- rentclassier)withthelargestweightsaretakenasseeds. Consideringthat N min and N maj arethenumberofin- stancesoftheminorityandmajorityclass,respectively; whereas N smin and N smaj arethenumberofseedin- stancesofeachclass; M L =min( N maj /N min ,N smaj ) and M S =min(( N maj · M L ) /N min ,N smin ) minorityand majorityclassinstancesareusedasnalseeds.Eachseed produce N maj or N min newexamples,dependingonits classlabel.Nominalattributes’valuesarecopiedfromthe seedandthevaluesofcontinuousattributesarerandomly generatedfollowinganormaldistributionwiththemean andvarianceofclassinstances.Thoseinstancesareadded totheoriginaldata-setwithaweightproportionaltothe weightoftheseed.Finally,thesumsofweightsofthe instancesbelongingtoeachclassarerebalanced,insucha waythatbothclasses’sumisequal.Themajordrawbackof thisapproachisitsincapabilitytodealwithhighlyimbal- anceddata-sets,becauseitgeneratesanexcessiveamount ofinstanceswhicharenotmanageableforthebaseclassi- er(i.e., N maj =3000 and N min =29 with Err =15% , therewillbe 100 seedinstances,where 71 havetobe fromthemajorityclassandatleast 71 · 3000=213000 newmajorityinstancesaregeneratedineachiteration). Forthisreason,wewillnotanalyzeitintheexperimental study. 3)Bagging-basedEnsembles: Manyapproacheshavebeen developedusingbaggingensemblestodealwithclassimbal- anceproblemsduetoitssimplicityandgoodgeneralization ability.Thehybridizationofbagginganddatapreprocessing techniquesisusuallysimplerthantheirintegrationinboosting. Abaggingalgorithmdoesnotrequiretorecomputeanykind ofweights;therefore,neitherisnecessarytoadapttheweight updateformulanortochangecomputationsinthealgorithm.In thesemethods,thekeyfactoristhewaytocollecteachbootstrap replica(Step2ofAlgorithm1),thatis,howtheclassimbalance problemisdealttoobtainausefulclassierineachiteration withoutforgettingtheimportanceofthediversity. Wedistinguishfourmainalgorithmsinthisfamily,OverBag- ging[55],UnderBagging[99],UnderOverBagging[55],and IIVotes[46].Notethat,wehavegroupedseveralapproaches intoOverBaggingandUnderBaggingduetotheirsimilarityas weexplainhereafter. a) OverBagging: Aneasywaytoovercometheclassimbal- anceproblemineachbagistotakeintoaccounttheclasses oftheinstanceswhentheyarerandomlydrawnfromthe originaldata-set.Hence,insteadofperformingarandom samplingofthewholedata-set,anoversamplingprocess canbecarriedoutbeforetrainingeachclassier(OverBag- ging).Thisprocedurecanbedevelopedinatleasttwo ways.Oversamplingconsistsinincreasingthenumberof minorityclassinstancesbytheirreplication,allmajority classinstancescanbeincludedinthenewbootstrap,but anotheroptionistoresamplethemtryingtoincreasethe diversity.NotethatinOverBaggingallinstanceswillprob- ablytakepartinatleastonebag,buteachbootstrapped replicawillcontainmanymoreinstancesthantheoriginal data-set. Ontheotherhand,anotherdifferentmannertoover- sampleminorityclassinstancescanbecarriedoutbythe 8IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS Fig.3.Proposedtaxonomyforensemblestoaddresstheclassimbalanceproblem. ofmisclassifyingthe i thexample,theauthorsprovide theirrecommendedfunctionas  + = Š 0 . 5 C i +0 . 5 and  Š =0 . 5 C i +0 . 5 .Theweightingfunctionandthecom- putationof  t arereplacedbythefollowingformulas: D t +1 ( i )= D t ( i ) · e Š  t y i h t ( x i )  sign( h t ( x i ) ,y i ) (3)  t = 1 2 ln 1+  i D t ( i ) · e Š  t y i h t ( x i )  sign( h t ( x i ) ,y i ) 1 Š  i D t ( i ) · e Š  t y i h t ( x i )  sign( h t ( x i ) ,y i ) (4) 2) CSB: NeitherCSB1norCSB2useanadjustmentfunction. Moreover,theseapproachesonlyconsiderthecostsin theweightupdateformula,thatis,noneofthemchanges thecomputationof  t .CSB1becauseitdoesnotuse  t anymore(  t =1 )andCSB2becauseitusesthesame  t computedbyAdaBoost.Inthesecases,theweightupdate isreplacedby D t +1 ( i )= D t ( i ) C sign( h t ( x i ) ,y i ) · e Š  t y i h t ( x i ) (5) where C + =1 and C Š = C i  1 arethecostsofmisclas- sifyingapositiveandanegativeexample,respectively. 3) RareBoost: ThismodicationofAdaBoosttriestotackle theclassimbalanceproblembysimplychanging  t ’s computation(Algorithm3,line 9 )makinguseofthecon- fusionmatrixineachiteration.Moreover,theycompute twodifferent  t valuesineachiteration.Thisway,false positives( FP t istheweights’sumofFPinthe t thitera- tion)arescaledinproportiontohowwelltheyaredistin- guishedfromtruepositives( TP t ),whereasfalsenegatives ( FN t )arescaledinproportiontohowwelltheyaredis- tinguishedfromtruenegatives( TN t ).Ontheonehand,  p t = TP t /FP t iscomputedforexamplespredictedas positives.Ontheotherhand,  n t = TN t /FN t iscom- putedfortheonespredictedasnegatives.Finally,the weightupdateisdoneseparatelybytheusageofboth factorsdependingonthepredictedclassofeachinstance. Notethat,despitewehaveincludeRareBoostincost- sensitiveboostingfamily,itdoesnotdirectlymakeuse ofcosts,whichcanbeanadvantage,butitmodiesAd- aBoostalgorithminasimilarwaytotheapproachesinthis family.Becauseofthisfact,wehaveclassiedintothis group.However,thisalgorithmhasahandicap, TP t and TN t arereduced,and FP t and FT t areincreasedonlyif TP t �FP t and TN t �FN t ,thatisequivalenttorequire anaccuracyofthepositiveclassgreaterthan 50% : TP t / ( TP t + FP t ) � 0 . 5 . (6) Thisconstraintisnottrivialwhendealingwiththeclass imbalanceproblem;moreover,itisastrongcondition. Withoutsatisfyingthiscondition,thealgorithmwillcol- lapse.Therefore,wewillnotincludeitinourempirical study. 4) AdaC1: Thisalgorithmisoneofthethreemodications ofAdaBoostproposedin[50].Theauthorsproposeddif- ferentwaysinwhichthecostscanbeembeddedintothe weightupdateformula(Algorithm3,line 10 ).Theyde- rivedifferentcomputationsof  t dependingonwherethey introducethecosts.Inthiscase,thecostfactorsareintro- ducedwithintheexponentpartoftheformula: D t +1 ( i )= D t ( i ) · e Š  t C i h t ( x i ) y i (7) where C i  [0 , +  ) .Hence,thecomputationoftheclas- siers’weightisdoneasfollows:  t = 1 2 ln 1+  i,y i = h t ( x i ) C i D t ( i ) Š  i,y i  = h t ( x i ) C i D t ( i ) 1 Š  i,y i = h t ( x i ) C i D t ( i )+  i,y i  = h t ( x i ) C i D t ( i ) . (8) NotethatAdaCostisavariationofAdaC1wherethere isacostadjustmentfunctioninsteadofacostiteminside theexponent.Though,inthecaseofAdaCost,itdoesnot reducetotheAdaBoostalgorithmwhenbothclassesare equallyweighted(contrarytoAdaC1). 5) AdaC2: LikewiseAdaC1,AdaC2integratesthecostsin theweightupdateformula.Buttheprocedureisdifferent; GALAR etal. :REVIEWONENSEMBLESFORTHECLASSIMBALANCEPROBLEM 7 B.AddressingClassImbalanceProblem WithClassiÞerEnsembles Aswehavestated,inrecentyears,ensembleofclassiers havearisenasapossiblesolutiontotheclassimbalanceproblem attractinggreatinterestamongresearchers[45],[47],[50],[62]. Inthissection,ouraimistoreviewtheapplicationofensemble learningmethodstodealwiththisproblem,aswellastopresent ataxonomywherethesetechniquescanbecategorized.Further- more,wehaveselectedseveralsignicantapproachesfromeach familyofourtaxonomytodevelopanexhaustiveexperimental studythatwewillcarryoutinSectionV. Tostartwiththedescriptionofthetaxonomy,weshow ourproposalinFig.3,wherewecategorizethedifferentap- proaches.Mainly,wedistinguishfourdifferentfamiliesamong ensembleapproachesforimbalancedlearning.Ontheonehand, cost-sensitiveboostingapproaches,whicharesimilartocost- sensitivemethods,butwherethecostsminimizationisguided bytheboostingalgorithm.Ontheotherhand,wedifference threemorefamiliesthathaveacharacteristicincommon;allof themconsistinembeddingadatapreprocessingtechniquein anensemblelearningalgorithm.Wecategorizethesethreefam- iliesdependingontheensemblelearningalgorithmtheyuse. Therefore,weconsiderboosting-andbagging-basedensem- bles,andthelastfamilyisformedbyhybridsensembles.That is,ensemblemethodsthatapartfromcombininganensemble learningalgorithmandapreprocessingtechnique,makeuseof bothboostingandbagging,oneinsidetheother,togetherwith apreprocessingtechnique. Next,welookoverthesefamilies,reviewingtheexisting worksandfocusinginthemostsignicantproposalsthatwe useintheexperimentalanalysis. 1)Cost-sensitiveBoosting: AdaBoostisanaccuracy- orientedalgorithm,whentheclassdistributionisuneven,this strategybiasesthelearning(theweights)towardthemajor- ityclass,sinceitcontributesmoretotheoverallaccuracy.For thisreason,therehavebeendifferentproposalsthatmodifythe weightupdateofAdaBoost(Algorithm3,line 10 and,asa consequence,line 9 ).Insuchaway,examplesfromdifferent classesarenotequallytreated.Toreachthisunequaltreatment, cost-sensitiveapproacheskeepthegenerallearningframework ofAdaBoost,butatthesametimeintroducecostitemsintothe weightupdateformula.Theseproposalsusuallydifferinthe waythattheymodifytheweightupdaterule,amongthisfam- ilyAdaCost[48],CSB1,CSB2[49],RareBoost[97],AdaC1, AdaC2,andAdaC3[50]arethemostrepresentativeapproaches. 1) AdaCost: Inthisalgorithm,theweightupdateismodi- edbyaddinga costadjustmentfunction  .Thisfunc- tion,foraninstancewithahighercostfactorincreases itsweight“more”iftheinstanceismisclassied,butde- creasesitsweight“less”otherwise.Being C i thecost 6IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS wayinwhichtheclassiersarestrategicallygeneratedtoreach thediversityneeded,bymanipulatingthetrainingsetbefore learningeachclassier. Fromthispoint,webrieyrecallBagging(includingthe modicationcalledpastingsmallvoteswithimportancesam- pling)andBoosting(AdaBoostanditsvariantsAdaBoost.M1 andAdaBoost.M2)ensemblelearningalgorithms,whichhave beenthenintegratedwithpreviouslyexplainedpreprocessing techniquesinordertodealwiththeclassimbalanceproblem. 1) Bagging :Breiman[42]introducedtheconceptof boot- strapaggregating toconstructensembles.Itconsistsin trainingdifferentclassierswithbootstrappedreplicasof theoriginaltrainingdata-set.Thatis,anewdata-setis formedtotraineachclassierbyrandomlydrawing(with replacement)instancesfromtheoriginaldata-set(usually, maintainingtheoriginaldata-setsize).Hence,diversity isobtainedwiththeresamplingprocedurebytheusage ofdifferentdatasubsets.Finally,whenanunknownin- stanceispresentedtoeachindividualclassier,amajority orweightedvoteisusedtoinfertheclass.Algorithm1 showsthepseudocodeforBagging. Pastingsmallvotes isavariationofBaggingoriginally designedforlargedata-sets[93].Largedata-setsareparti- tionedintosmallersubsets,whichareusedtotraindiffer- entclassiers.Thereexisttwovariants, Rvotes thatcreates thedatasubsetsatrandomand Ivotes thatcreateconsec- utivedata-setsbasedontheimportanceoftheinstances; importantinstancesarethosethatimprovediversity.The wayusedtocreatethedata-setsconsistsintheusageofa balanceddistributionofeasyanddifcultinstances.Dif- cultinstancesaredetectedby out-of-bag classiers[42], thatis,aninstanceisconsidereddifcultwhenitismis- classiedbytheensembleclassierformedofthoseclas- sierswhichdidnotusetheinstancetobetrained.These difcultinstancesarealwaysaddedtothenextdatasubset, whereaseasyinstanceshavealowchancetobeincluded. WeshowthepseudocodeforIvotesinAlgorithm2. 2) Boosting :Boosting(alsoknownasARCing,adaptivere- samplingandcombining)wasintroducedbySchapirein 1990[40].Schapireprovedthataweaklearner(whichis slightlybetterthanrandomguessing)canbeturnedintoa stronglearnerinthesenseof probablyapproximatelycor- rect (PAC)learningframework.AdaBoost[41]isthemost representativealgorithminthisfamily,itwastherstap- plicableapproachofBoosting,andithasbeenappointedas oneofthetoptendataminingalgorithms[94].AdaBoost isknowntoreducebias(besidesfromvariance)[85],and similarlytosupportvectormachines(SVMs)booststhe margins[95].AdaBoostusesthewholedata-settotrain eachclassierserially,butaftereachround,itgivesmore focustodifcultinstances,withthegoalofcorrectlyclas- sifyingexamplesinthenextiterationthatwereincorrectly classiedduringthecurrentiteration.Hence,itgivesmore focustoexamplesthatarehardertoclassify,thequan- tityoffocusismeasuredbyaweight,whichinitiallyis equalforallinstances.Aftereachiteration,theweights ofmisclassiedinstancesareincreased;onthecontrary, theweightsofcorrectlyclassiedinstancesaredecreased. Furthermore,anotherweightisassignedtoeachindivid- ualclassierdependingonitsoverallaccuracywhichis thenusedinthetestphase;morecondenceisgivento moreaccurateclassiers.Finally,whenanewinstanceis submitted,eachclassiergivesaweightedvote,andthe classlabelisselectedbymajority. Inthiswork,wewillusetheoriginaltwo-classAd- aBoost(Algorithm3)andtwoofitsverywell-known modications[41],[96]thathavebeenemployedin imbalanceddomains:AdaBoost.M1andAdaBoost.M2. Theformeristherstextensiontomulticlassclassi- cationwithadifferentweightchangingmechanism (Algorithm4);thelatteristhesecondextensiontomul- ticlass,inthiscase,makinguseofbaseclassiers’con- dencerates(Algorithm5).Notethatneitherofthesealgo- rithmsbyitselfdealwiththeimbalanceproblemdirectly; bothhavetobechangedorcombinedwithanothertech- nique,sincetheyfocustheirattentionondifcultexamples withoutdifferentiatingtheirclass.Inanimbalanceddata- set,majorityclassexamplescontributemoretotheaccu- racy(theyaremoreprobablydifcultexamples);hence, ratherthantryingtoimprovethetruepositives,itiseas- iertoimprovethetruenegatives,alsoincreasingthefalse negatives,whichisnotadesiredcharacteristic. GALAR etal. :REVIEWONENSEMBLESFORTHECLASSIMBALANCEPROBLEM 5 thiscase,randomlyreplicatingminorityclassinstances. Severalauthorsagreethatthismethodcanincreasethe likelihoodofoccurringovertting,sinceitmakesexact copiesofexistinginstances. 3) Syntheticminorityoversamplingtechnique (SMOTE) [21]:Itisanoversamplingmethod,whosemainideais tocreatenewminorityclassexamplesbyinterpolating severalminorityclassinstancesthatlietogether.SMOTE createsinstancesbyrandomlyselectingone(ormorede- pendingontheoversamplingratio)ofthe k nearestneigh- bors( k NN)ofaminorityclassinstanceandthegeneration ofthenewinstancevaluesfromarandominterpolationof bothinstances.Thus,theoverttingproblemisavoided andcausesthedecisionboundariesfortheminorityclass tobespreadfurtherintothemajorityclassspace. 4) ModiÞedsyntheticminorityoversamplingtechnique (MSMOTE)[82]:ItisamodiedversionofSMOTE. Thisalgorithmdividestheinstancesoftheminorityclass intothreegroups, safe , border and latentnoise instances bythecalculationofthedistancesamongallexamples. WhenMSMOTEgeneratesnewexamples,thestrategyto selectthenearestneighborsischangedwithrespectto SMOTEthatdependsonthegrouppreviouslyassignedto theinstance.Forsafeinstances,thealgorithmrandomly selectsadatapointfromthe k NN(samewayasSMOTE); forborderinstances,itonlyselectsthenearestneighbor; nally,forlatentnoiseinstances,itdoesnothing. 5) Selectivepreprocessingofimbalanceddata (SPIDER) [52]:Itcombineslocaloversamplingoftheminorityclass withlteringdifcultexamplesfromthemajorityclass.It consistsintwophases,identicationandpreprocessing. Therstoneidentieswhichinstancesareaggedasnoisy (misclassied)by k NN.Thesecondphasedependsonthe optionestablished( weak,relabel, or strong );whenweak optionissettled,itampliesminorityclassinstances;for relabel,itampliesminorityclassexamplesandrelabels majorityclassinstances(i.e.,changesclasslabel);nally, usingstrongoption,itstronglyampliesminorityclass instances.Aftercarryingouttheseoperations,theremain- ingnoisyexamplesfromthemajorityclassareremoved fromthedata-set. III.S TATEOFTHE A RTON E NSEMBLES T ECHNIQUES FOR I MBALANCED D ATA - SETS Inthissection,weproposeanewtaxonomyforensemble- basedtechniquestodealwithimbalanceddata-setsandwere- viewthestateoftheartonthesesolutions.Withthisaim,westart recallingseveralclassicallearningalgorithmsforconstructing setsofclassiers,whoseclassiersproperlycomplementeach other,andthenwegetonwiththeensemble-basedsolutionsto addresstheclassimbalanceproblem. A.LearningEnsemblesofClassiÞers:Descriptionand RepresentativeTechniques Themainobjectiveofensemblemethodologyistotryto improvetheperformanceofsingleclassiersbyinducingsev- eralclassiersandcombiningthemtoobtainanewclassier thatoutperformseveryoneofthem.Hence,thebasicideais toconstructseveralclassiersfromtheoriginaldataandthen aggregatetheirpredictionswhenunknowninstancesarepre- sented.Thisideafollowsthehumannaturalbehaviorthattends toseekseveralopinionsbeforemakinganyimportantdecision. Themainmotivationforthecombinationofclassiersinredun- dantensemblesistoimprovetheirgeneralizationability:each classierisknowntomakeerrors,butsincetheyaredifferent (e.g.,theyhavebeentrainedondifferentdata-setsortheyhave differentbehaviorsoverdifferentpartoftheinputspace),mis- classiedexamplesarenotnecessarilythesame[83].Ensemble- basedclassiersusuallyrefertothecombinationofclassiers thatareminorvariantsofthesamebaseclassier,whichcan becategorizedinthebroaderconceptofmultipleclassiersys- tems[25],[31],[32].Inthispaper,wefocusonlyonensembles whoseclassiersareconstructedbymanipulatingtheoriginal data. Intheliterature,theneedofdiverseclassierstocomposean ensembleisstudiedintermsofthestatisticalconceptsof bias- variance decomposition[33],[84]andtherelatedambiguity [34]decomposition.Thebiascanbecharacterizedasameasure ofitsabilitytogeneralizecorrectlytoatestset,whereasthe variancecanbesimilarlycharacterizedasameasureofthe extenttowhichtheclassier’spredictionissensitivetothe dataonwhichitwastrained.Hence,varianceisassociated withovertting,theperformanceimprovementinensemblesis usuallyduetoareductioninvariancebecausetheusualeffectof ensembleaveragingistoreducethevarianceofasetofclassiers (someensemblelearningalgorithmsarealsoknowntoreduce bias[85]).Ontheotherhand,ambiguitydecompositionshows that,takingthecombinationofseveralpredictorsisbetteron average,overseveralpatterns,thanamethodselectingoneof thepredictorsatrandom.Anyway,theseconceptsareclearly statedinregressionproblemswheretheoutputisreal-valued andthemeansquarederrorisusedasthelossfunction.However, inthecontextofclassication,thosetermsarestillill-dened [35],[38],sincedifferentauthorsprovidedifferentassumptions [86]–[90]andthereisnoanagreementontheirdenitionfor generalizedlossfunctions[91]. Nevertheless,despitenotbeingtheoreticallyclearlydened, diversityamongclassiersiscrucial(butaloneisnotenough)to formanensemble,asshownbyseveralauthors[36]–[38].Note alsothat,themeasurementofthediversityanditsrelationto accuracyisnotdemonstrated[43],[92],butthisisprobablydue tothemeasuresofdiversityratherthanfornotexistingthatrela- tion.Therearedifferentwaystoreachtherequireddiversity,that is,differentensemblelearningmechanisms.Animportantpoint isthatthebaseclassiersshouldbe weaklearners ;aclassier learningalgorithmissaidtobeweakwhenlowchangesindata producebigchangesintheinducedmodel;thisiswhythemost commonlyusedbaseclassiersaretreeinductionalgorithms. Consideringaweaklearningalgorithm,differenttechniques canbeusedtoconstructanensemble.Themostwidelyused ensemblelearningalgorithmsareAdaBoost[41]andBagging [42]whoseapplicationsinseveralclassicationproblemshave ledtosignicantimprovements[27].Thesemethodsprovidea 4IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS Fig.2.ExampleofanROCplot.Twoclassiers’curvesaredepicted:the dashedlinerepresentsarandomclassier,whereasthesolidlineisaclassier whichisbetterthantherandomclassier. oftruepositiveswithouttheincrementofthefalsepositives. TheareaundertheROCcurve(AUC)[67]correspondstothe probabilityofcorrectlyidentifyingwhichoneofthetwostimuli isnoiseandwhichoneissignalplusnoise.AUCprovidesa singlemeasureofaclassier’sperformancefortheevaluation thatwhichmodelisbetteronaverage.Fig.2showshowto buildtheROCspaceplottingonatwo-dimensionalchart,the TP rate ( Y -axis)againstthe FP rate ( X -axis).Pointsin (0 , 0) and (1 , 1) aretrivialclassierswherethepredictedclassisalways thenegativeandpositive,respectively.Onthecontrary, (0 , 1) pointrepresentstheperfectclassication.TheAUCmeasureis computedjustbyobtainingtheareaofthegraphic: AUC = 1+ TP rate Š FP rate 2 . (2) C.DealingWiththeClassImbalanceProblem Onaccountoftheimportanceoftheimbalanceddata-sets problem,alargeamountoftechniqueshavebeendeveloped toaddressthisproblem.Asstatedintheintroduction,these approachescanbecategorizedintothreegroups,dependingon howtheydealwiththeproblem. 1)Algorithmlevelapproaches(alsocalled internal )tryto adaptexistingclassierlearningalgorithmstobiasthe learningtowardtheminorityclass[76]–[78].Thesemeth- odsrequirespecialknowledgeofboththecorresponding classierandtheapplicationdomain,comprehendingwhy theclassierfailswhentheclassdistributionisuneven. 2)Datalevel(or external )approachesrebalancetheclass distributionbyresamplingthedataspace[20],[52],[53], [79].Thisway,theyavoidthemodicationofthelearn- ingalgorithmbytryingtodecreasetheeffectcausedby imbalancewithapreprocessingstep.Therefore,theyare independentoftheclassierused,andforthisreason, usuallymoreversatile. 3)Cost-sensitivelearningframeworkfallsbetweendataand algorithmlevelapproaches.Itincorporatesbothdatalevel transformations(byaddingcoststoinstances)andalgo- rithmlevelmodications(bymodifyingthelearningpro- cesstoacceptcosts)[23],[80],[81].Itbiasestheclassier towardtheminorityclassthetheassumptionhighermis- classicationcostsforthisclassandseekingtominimize thetotalcosterrorsofbothclasses.Themajordrawback oftheseapproachesistheneedtodenemisclassication costs,whicharenotusuallyavailableinthedata-sets. Inthiswork,westudyapproachesthatarebasedonensemble techniquestodealwiththeclassimbalanceproblem.Aside fromthosethreecategories,ensemble-basedmethodscanbe classiedintoanewcategory.Thesetechniquesusuallyconsist inacombinationbetweenanensemblelearningalgorithmand oneofthetechniquesabove,specically,datalevelandcost- sensitiveones.Bytheadditionofadatalevelapproachtothe ensemblelearningalgorithm,thenewhybridmethodusually preprocessesthedatabeforetrainingeachclassier.Ontheother hand,cost-sensitiveensemblesinsteadofmodifyingthebase classierinordertoacceptcostsinthelearningprocessguide thecostminimizationviatheensemblelearningalgorithm.This way,themodicationofthebaselearnerisavoided,butthe majordrawback(i.e.,costsdenition)isstillpresent. D.DataPreprocessingMethods Aspointedout,preprocessingtechniquescanbeeasilyem- beddedinensemblelearningalgorithms.Hereafter,werecall severaldatapreprocessingtechniquesthathavebeenusedto- getherwithensembles,whichwewillanalyzeinthefollowing sections. Inthespecializedliterature,wecanndsomepapersabout resamplingtechniquesthatstudytheeffectofchangingclass distributiontodealwithimbalanceddata-sets,whereithasbeen empiricallyprovedthattheapplicationofapreprocessingstep inordertobalancetheclassdistributionisusuallyapositive solution[20],[53].Themainadvantageofthesetechniques, aspreviouslypointedout,isthattheyareindependentofthe underlyingclassier. Resamplingtechniquescanbecategorizedintothreegroups. Undersamplingmethods,whichcreateasubsetoftheorigi- naldata-setbyeliminatinginstances(usuallymajorityclass instances);oversamplingmethods,whichcreateasupersetof theoriginaldata-setbyreplicatingsomeinstancesorcreating newinstancesfromexistingones;andnally,hybridsmethods thatcombinebothsamplingmethods.Amongthesecategories, thereexistseveraldifferentproposals;fromthispoint,weonly centerourattentioninthosethathavebeenusedincombination withensemblelearningalgorithms. 1) Randomundersampling :Itisanonheuristicmethod thataimstobalanceclassdistributionthroughtheran- domeliminationofmajorityclassexamples.Itsmajor drawbackisthatitcandiscardpotentiallyusefuldata, whichcouldbeimportantfortheinductionprocess. 2) Randomoversampling :Inthesamewayasrandomun- dersampling,ittriestobalanceclassdistribution,butin GALAR etal. :REVIEWONENSEMBLESFORTHECLASSIMBALANCEPROBLEM 3 Fig.1.Exampleofdifcultiesinimbalanceddata-sets.(a)Classoverlapping. (b)Smalldisjuncts. learningstage.Insuchaway,thesystemthatisgeneratedby thelearningalgorithmisamappingfunctionthatisdenedover thepatterns A i  C ,anditiscalledclassier. A.TheProblemofImbalancedData-sets Inclassication,adata-setissaidtobeimbalancedwhen thenumberofinstanceswhichrepresentsoneclassissmaller thantheonesfromotherclasses.Furthermore,theclasswith thelowestnumberofinstancesisusuallytheclassofinterest fromthepointofviewofthelearningtask[22].Thisprob- lemisofgreatinterestbecauseitturnsupinmanyreal-world classicationproblems,suchasremote-sensing[69],pollution detection[70],riskmanagement[71],frauddetection[72],and especiallymedicaldiagnosis[13],[24],[73]–[75]. Inthesecases,standardclassierlearningalgorithmshave abiastowardtheclasseswithgreaternumberofinstances, sincerulesthatcorrectlypredictthoseinstancesarepositively weightedinfavoroftheaccuracymetric,whereasspecicrules thatpredictexamplesfromtheminorityclassareusuallyig- nored(treatingthemasnoise),becausemoregeneralrulesare preferred.Insuchaway,minorityclassinstancesaremoreof- tenmisclassiedthanthosefromtheotherclasses.Anyway, skeweddatadistributiondoesnothinderthelearningtaskby itself[1],[2],theissueisthatusuallyaseriesofdifculties relatedtothisproblemturnup. 1) Smallsamplesize :Generallyimbalanceddata-setsdonot haveenoughminorityclassexamples.In[6],theauthors reportedthattheerrorratecausedbyimbalancedclass distributiondecreaseswhenthenumberofexamplesof theminorityclassisrepresentative(xingtheratioof imbalance).Thisway,patternsthataredenedbypositive instancescanbebetterlearneddespitetheunevenclass distribution.However,thisfactisusuallyunreachablein real-worldproblems. 2) Overlapping or classseparability [seeFig.1(a)]:Whenit occurs,discriminativerulesarehardtoinduce.Asacon- sequence,moregeneralrulesareinducedthatmisclassify alownumberofinstances(minorityclassinstances)[4].If thereisnooverlappingbetweenclasses,anysimpleclas- siercouldlearnanappropriateclassierregardlessofthe classdistribution. 3) Smalldisjuncts [seeFig.1(b)]:Thepresenceofsmalldis- junctsinadata-setoccurswhentheconceptrepresentedby TABLEI C ONFUSION M ATRIXFORA T WO -C LASS P ROBLEM theminorityclassisformedofsubconcepts[5].Besides, smalldisjunctsareimplicitinmostoftheproblems.The existenceofsubconceptsalsoincreasesthecomplexityof theproblembecausetheamountofinstancesamongthem isnotusuallybalanced. Inthispaper,wefocusontwo-classimbalanceddata-sets, wherethereisapositive(minority)class,withthelowestnumber ofinstances,andanegative(majority)class,withthehighest numberofinstances.Wealsoconsidertheimbalanceratio(IR) [54],denedasthenumberofnegativeclassexamplesthatare dividedbythenumberofpositiveclassexamples,toorganize thedifferentdata-sets. B.PerformanceEvaluationinImbalancedDomains Theevaluationcriterionisakeyfactorbothintheassessment oftheclassicationperformanceandguidenceoftheclassier modeling.Inatwo-classproblem,theconfusionmatrix(shown inTableI)recordstheresultsofcorrectlyandincorrectlyrec- ognizedexamplesofeachclass. Traditionally,theaccuracyrate(1)hasbeenthemostcom- monlyusedempiricalmeasure.However,intheframeworkof imbalanceddata-sets,accuracyisnolongerapropermeasure, sinceitdoesnotdistinguishbetweenthenumbersofcorrectly classiedexamplesofdifferentclasses.Hence,itmayleadto erroneousconclusions,i.e.,aclassierthatachievesanaccuracy of 90% inadata-setwithanIRvalueof9,isnotaccurateifit classiesallexamplesasnegatives. Acc = TP + TN TP + FN + FP + TN . (1) Forthisreason,whenworkinginimbalanceddomains,thereare moreappropriatemetricstobeconsideredinsteadofaccuracy. Specically,wecanobtainfourmetricsfromTableItomeasure theclassicationperformanceofboth,positiveandnegative, classesindependently. 1) Truepositiverate TP rate = TP TP + FN isthepercentageof positiveinstancescorrectlyclassied. 2) Truenegativerate TN rate = TN FP + TN isthepercentageof negativeinstancescorrectlyclassied. 3) Falsepositiverate FP rate = FP FP + TN isthepercentageof negativeinstancesmisclassied. 4) Falsenegativerate FN rate = FN TP + FN isthepercentage ofpositiveinstancesmisclassied. Clearly,sinceclassicationintendstoachievegoodquality resultsforbothclasses,noneofthesemeasuresaloneisadequate byitself.Onewaytocombinethesemeasuresandproducean evaluationcriterionistousethereceiveroperatingcharacteristic (ROC)graphic[66].Thisgraphicallowsthevisualizationof thetrade-offbetweenthebenets( TP rate )andcosts( FP rate ); thus,itevidencesthatanyclassiercannotincreasethenumber 2IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS applicationrangeoveralargenumberofproblems[27]–[30]. Intheliterature,theterm“ensemblemethods”usuallyrefersto thosecollectionofclassiersthatareminorvariantsofthesame classier,whereas“multipleclassiersystems”isabroadercat- egorythatalsoincludesthosecombinationsthatconsiderthehy- bridizationofdifferentmodels[31],[32],whicharenotcovered inthispaper.Whenformingensembles,creatingdiverseclassi- ers(butmaintainingtheirconsistencywiththetrainingset)isa keyfactortomakethemaccurate.Diversityinensembleshasa thoroughtheoreticalbackgroundinregressionproblems(where itisstudiedintermsofbias-variance[33]andambiguity[34] decomposition);however,inclassication,theconceptofdi- versityisstillformallyill-dened[35].Eventhough,diversity isnecessary[36]–[38]andthereexistseveraldifferentwaysto achieveit[39].Inthispaper,wefocusondatavariation-based ensembles,whichconsistinthemanipulationofthetraining examplesinsuchawaythateachclassieristrainedwithadif- ferenttrainingset.AdaBoost[40],[41]andBagging[42]arethe mostcommonensemblelearningalgorithmsamongthem,but thereexistmanyvariantsandotherdifferentapproaches[43]. Becauseoftheiraccuracy-orienteddesign,ensemblelearn- ingalgorithmsthataredirectlyappliedtoimbalanceddata-sets donotsolvetheproblemthatunderlayinthebaseclassier bythemselves.However,theircombinationwithothertech- niquestotackletheclassimbalanceproblemhaveledtoseveral proposalsintheliterature,withpositiveresults.Thesehybrid approachesareinsomesensealgorithmlevelapproaches(since theyslightlymodifytheensemblelearningalgorithm),butthey donotneedtochangethebaseclassier,whichisoneoftheirad- vantages.Themodicationoftheensemblelearningalgorithm usuallyincludesdatalevelapproachestopreprocessthedata beforelearningeachclassier[44]–[47].However,otherpro- posalsconsidertheembeddingofthecost-sensitiveframework intheensemblelearningprocess[48]–[50]. Ingeneral,algorithmlevelandcost-sensitiveapproachesare moredependentontheproblem,whereasdatalevelanden- semblelearningmethodsaremoreversatilesincetheycanbe usedindependentlyofthebaseclassier.Manyworkshave beendevelopedstudyingthesuitabilityofdatapreprocessing techniquestodealwithimbalanceddata-sets[21],[51],[52]. Furthermore,thereexistseveralcomparisonsbetweendifferent externaltechniquesindifferentframeworks[20],[53],[54].On theotherhand,withregardtoensemblelearningmethods,a largenumberofdifferentapproacheshavebeenproposedin theliterature,includingbutnotlimitedtoSMOTEBoost[44], RUSBoost[45],IIVotes[46],EasyEnsemble[47],orSMOTE- Bagging[55].Allofthesemethodsseemtobeadequatetodeal withtheclassimbalanceprobleminconcreteframeworks,but therearenoexhaustivecomparisonsoftheirperformanceamong them.Inmanycases,newproposalsarecomparedwithrespect toasmallnumberofmethodsandbytheusageoflimitedsets ofproblems[44]–[47].Moreover,thereisalackofaunication frameworkwheretheycanbecategorized. Becauseofthesereasons,ouraimistoreviewthestateof theartonensembletechniquestoaddressatwo-classimbal- anceddata-setsproblemandtoproposeataxonomythatdenes ageneralframeworkwithineachalgorithmcanbeplaced.We considerdifferentfamiliesofalgorithmsdependingonwhich ensemblelearningalgorithmtheyarebased,andwhattypeof techniquestheyusedtodealwiththeimbalanceproblem.Over thistaxonomy,wecarryoutathoroughempiricalcomparison oftheperformanceofensembleapproacheswithatwofoldob- jective.Therstoneistoanalyzewhichoneoffersthebest behavioramongthem.Thesecondoneistoobservethesuit- abilityofincreasingclassiers’complexitywiththeuseofen- semblesinsteadoftheconsiderationofauniquestageofdata preprocessingandtrainingasingleclassier. Wehavedesignedtheexperimentalframeworkinsuchaway thatwecanextractwell-foundedconclusions.Weuseasetof 44two-classreal-worldproblems,whichsufferfromtheclass imbalanceproblem,fromtheKEELdata-setrepository[56], [57](http://www.keel.es/dataset.php).WeconsiderC4.5[58] asbaseclassierforourexperimentssinceithasbeenwidely usedinimbalanceddomains[20],[59]–[61];besides,mostof theproposalswearestudyingweretestedwithC4.5bytheir authors(e.g.,[45],[50],[62]).Weperformthecomparisonby thedevelopmentofahierarchicalanalysisofensemblemethods thatisdirectedbynonparametricstatisticaltestsassuggested intheliterature[63]–[65].Todoso,accordingtotheimbalance framework,weusetheareaundertheROCcurve(AUC)[66], [67]astheevaluationcriterion. Therestofthispaperisorganizedasfollows.InSectionII,we presenttheimbalanceddata-setsproblemthatdescribesseveral techniqueswhichhavebeencombinedwithensembles,anddis- cussingtheevaluationmetrics.InSectionIII,werecalldifferent ensemblelearningalgorithms,describeournewtaxonomy,and reviewthestateoftheartonensemble-basedtechniquesfor imbalanceddata-sets.Next,SectionIVintroducestheexperi- mentalframework,thatis,thealgorithmsthatareincludedin thestudywiththeircorrespondingparameters,thedata-sets,and thestatisticalteststhatweusealongtheexperimentalstudy.In SectionV,wecarryouttheexperimentalanalysisoverthemost signicantalgorithmsofthetaxonomy.Finally,inSectionVI, wemakeourconcludingremarks. II.I NTRODUCTIONTO C LASS I MBALANCE P ROBLEM IN C LASSIFICATION Inthissection,werstintroducetheproblemofimbalanced data-setsinclassication.Then,wepresenthowtoevaluate theperformanceoftheclassiersinimbalanceddomains.Fi- nally,werecallseveraltechniquestoaddresstheclassimbalance problem,specically,thedatalevelapproachesthathavebeen combinedwithensemblelearningalgorithmsinpreviousworks. Priortotheintroductionoftheproblemofclassimbalance, weshouldformallystatetheconceptofsupervisedclassica- tion[68].Inmachinelearning,theaimofclassicationistolearn asystemcapableofthepredictionoftheunknownoutputclassof apreviouslyunseeninstancewithagoodgeneralizationability. Thelearningtask,i.e.,theknowledgeextraction,iscarriedout byasetof n inputinstances x 1 ,...,x n characterizedby i fea- tures a 1 ,...,a i  A ,whichincludesnumericalornominalval- ues,whosedesiredoutputclasslabels y j  C = { c 1 ,...,c m } , inthecaseofsupervisedclassication,areknownbeforetothe IEEETRANSACTIONSONSYSTEMS,MAN,ANDCYBERNETICS—PARTC:APPLICATIONSANDREVIEWS1 AReviewonEnsemblesfortheClassImbalance Problem:Bagging-,Boosting-,and Hybrid-BasedApproaches MikelGalar,AlbertoFern ´ andez,EdurneBarrenechea,HumbertoBustince ,Member,IEEE , andFranciscoHerrera ,Member,IEEE Abstract —Classierlearningwithdata-setsthatsufferfromim- balancedclassdistributionsisachallengingproblemindatamin- ingcommunity.Thisissueoccurswhenthenumberofexamples thatrepresentoneclassismuchlowerthantheonesoftheother classes.Itspresenceinmanyreal-worldapplicationshasbrought alongagrowthofattentionfromresearchers.Inmachinelearning, theensembleofclassiersareknowntoincreasetheaccuracyof singleclassiersbycombiningseveralofthem,butneitherofthese learningtechniquesalonesolvetheclassimbalanceproblem,to dealwiththisissuetheensemblelearningalgorithmshavetobe designedspecically.Inthispaper,ouraimistoreviewthestate oftheartonensembletechniquesintheframeworkofimbalanced data-sets,withfocusontwo-classproblems.Weproposeataxon- omyforensemble-basedmethodstoaddresstheclassimbalance whereeachproposalcanbecategorizeddependingontheinner ensemblemethodologyinwhichitisbased.Inaddition,wedevelop athoroughempiricalcomparisonbytheconsiderationofthemost signicantpublishedapproaches,withinthefamiliesofthetaxon- omyproposed,toshowwhetheranyofthemmakesadifference. Thiscomparisonhasshownthegoodbehaviorofthesimplestap- proacheswhichcombinerandomundersamplingtechniqueswith baggingorboostingensembles.Inaddition,thepositivesynergy betweensamplingtechniquesandbagginghasstoodout.Further- more,ourresultsshowempiricallythatensemble-basedalgorithms areworthwhilesincetheyoutperformthemereuseofpreprocess- ingtechniquesbeforelearningtheclassier,thereforejustifying theincreaseofcomplexitybymeansofasignicantenhancement oftheresults. IndexTerms —Bagging,boosting,classdistribution,classica- tion,ensembles,imbalanceddata-sets,multipleclassiersystems. I.I NTRODUCTION C LASSdistribution,i.e.,theproportionofinstancesbelong- ingtoeachclassinadata-set,playsakeyroleinclassi- cation.Imbalanceddata-setsproblemoccurswhenoneclass, ManuscriptreceivedJanuary12,2011;revisedApril28,2011andJune7, 2011;acceptedJune23,2011.ThisworkwassupportedinpartbytheSpanish MinistryofScienceandTechnologyunderprojectsTIN2008-06681-C06-01and TIN2010-15055.ThispaperwasrecommendedbyAssociateEditorM.Last. M.Galar,E.Barrenechea,andH.BustincearewiththeDepartmentofAu- tom ´ aticayComputaci ´ on,UniversidadP ´ ublicadeNavarra,31006Navarra,Spain (e-mail:mikel.galar@unavarra.es;edurne.barrenechea@unavarra.es). A.Fern ´ andeziswiththeDepartmentofComputerScience,Universityof Ja ´ en,23071Ja ´ en,Spain(e-mail:alberto.fernandez@ujaen.es). F.HerreraiswiththeDepartmentofComputerScienceandArti- cialIntelligence,UniversityofGranada,18071Granada,Spain(e-mail: herrera@decsai.ugr.es). DigitalObjectIdentier10.1109/TSMCC.2011.2161285 usuallytheonethatreferstotheconceptofinterest(positive orminorityclass),isunderrepresentedinthedata-set;inother words,thenumberofnegative(majority)instancesoutnumbers theamountofpositiveclassinstances.Anyway,neitheruniform distributionsnorskeweddistributionshavetoimplyadditional difcultiestotheclassierlearningtaskbythemselves[1]–[3]. However,data-setswithskewedclassdistributionusuallytend tosufferfromclassoverlapping,smallsamplesizeorsmalldis- juncts,whichdifcultclassierlearning[4]–[7].Furthermore, theevaluationcriterion,whichguidesthelearningprocedure, canleadtoignoreminorityclassexamples(treatingthemas noise)andhence,theinducedclassiermightloseitsclassica- tionabilityinthisscenario.Asausualexample,letusconsider adata-setwhoseimbalanceratiois1:100(i.e.,foreachexample ofthepositiveclass,thereare100negativeclassexamples).A classierthattriestomaximizetheaccuracyofitsclassication rule,mayobtainanaccuracyof 99% justbytheignoranceof thepositiveexamples,withtheclassicationofallinstancesas negatives. Inrecentyears,classimbalanceproblemhasemergedasone ofthechallengesindataminingcommunity[8].Thissituation issignicantsinceitispresentinmanyreal-worldclassica- tionproblems.Forinstance,someapplicationsareknownto sufferfromthisproblem,faultdiagnosis[9],[10],anomalyde- tection[11],[12],medicaldiagnosis[13],e-mailfoldering[14], facerecognition[15],ordetectionofoilspills[16],amongoth- ers.Onaccountoftheimportanceofthisissue,alargeamountof techniqueshavebeendevelopedtryingtoaddresstheproblem. Theseproposalscanbecategorizedintothreegroups,which dependonhowtheydealwithclassimbalance.Thealgorithm level( internal )approachescreateormodifythealgorithmsthat exist,totakeintoaccountthesignicanceofpositiveexam- ples[17]–[19].Datalevel( external )techniquesaddaprepro- cessingstepwherethedatadistributionisrebalancedinor- dertodecreasetheeffectoftheskewedclassdistributionin thelearningprocess[20]–[22].Finally,cost-sensitivemethods combinebothalgorithmanddatalevelapproachestoincorpo- ratedifferentmisclassicationcostsforeachclassinthelearning phase[23],[24]. Inadditiontotheseapproaches,anothergroupoftechniques emergeswhentheuseofensemblesofclassiersisconsid- ered.Ensembles[25],[26]aredesignedtoincreasetheaccu- racyofasingleclassierbytrainingseveraldifferentclassiers andcombiningtheirdecisionstooutputasingleclasslabel. Ensemblemethodsarewellknowninmachinelearningandtheir 1094-6977/$26.00©2011IEEE