Latest Results on outlier ensembles available at httpwwwcharuaggarwalnettheorypdf Clickable Link tsubtopicsegbaggingboostingetcintheensembleanalysisareaareverywellformalizedThisisrem ID: 497397
Download Pdf The PPT/PDF document "Ensembles[PositionPaper]CharuC.AggarwalI..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Ensembles[PositionPaper]CharuC.AggarwalIBMT.J.WatsonResearchCenterYorktownHeights,NYcharu@us.ibm.comABSTRACTEnsembleanalysisisawidelyusedmeta-algorithmformanydataminingproblemssuchasclassi\fcationandclustering.Numerousensemble-basedalgorithmshavebeenproposedintheliteraturefortheseproblems.Comparedtothecluster-ingandclassi\fcationproblems,ensembleanalysishasbeenstudiedinalimitedwayintheoutlierdetectionliterature.Insomecases,ensembleanalysistechniqueshavebeenim- Latest Results on outlier ensembles available at http://www.charuaggarwal.net/theory.pdf (Clickable Link) tsub-topics(eg.bagging,boosting,etc.)intheensembleanalysisareaareverywellformalized.Thisisremotelynottrueforoutlieranalysis,inwhichtheworkonensembleanalysisisratherpatchy,sporadic,andnotsowellformalized.Inmanycases,usefulmeta-algorithmsareburieddeepintothealgorithm,andnotformallyrecog-nizedasensembles.Perhaps,oneofthereasonswhyensem-bleanalysishasnotbeenwellexploredinoutlieranalysisisthatmeta-algorithmsrequirecrispevaluationcriteriainordertoshowtheirrelativemeritsoverthebasealgorithm.Furthermore,evaluationcriteriaareoftenusedinthein-termediatestepsofanensemblealgorithm(eg.boostingorstacking),inordertomakefuturedecisionsaboutthepreciseconstructionoftheensemble.Amongallcoredataminingproblems,outlieranalysisisthehardesttoevaluate(espe-ciallyonrealdatasets)becauseofacombinationofitssmallsamplespaceandunsupervisednature.Thesmallsamplespaceissuereferstothefactthatagivendatasetmaycon-tainonlyasmallnumberofoutliers,andthereforethecor-rectnessofanapproachisoftenhardtoquantifyinastatis-ticallyrobustway.Thisisalsoaproblemformakingrobustdecisionsaboutfuturestepsofthealgorithm,withoutcaus-ingover-\ftting.Theunsupervisednatureoftheproblemreferstothefactthatnogroundtruthisavailableinordertoevaluatethequalityofacomponentintheensemble.Thisnecessitatestheconstructionofsimplerensembleswithfewerqualitativedecisionsaboutthechoiceofthecomponentsintheensemble.Thesefactorshavebeenasigni\fcantimped-imentinthedevelopmentofeectivemeta-algorithms.Ontheotherhand,sincetheclassi\fcationproblemhasthemostcrisplyde\fnedcriteriaforevaluation,italsohastherichestmeta-algorithmliteratureamongalldataminingproblems.Thisisbecausetheproblemofmodelevaluationiscloselyrelatedtoquality-drivenmeta-algorithmdevelopment.Nevertheless,anumberofexamplesdoexistintheliter-atureforoutlierensembles.Thesecasesshowthatwhenensembleanalysisisusedproperly,thepotentialforalgo-rithmicimprovementissigni\fcant.Ensembleanalysishasbeenusedparticularlyeectivelyinhigh-dimensionalout-lierdetection[18;24;28;30;32;33],inwhichmultiplesubspacesofthedataareoftenexploredinordertodiscoveroutliers.Infact,theearliest[28]ofoutlieren-sembleanalysis\fndsitsoriginsinhighdimensionaloutlierdetection,thoughinformalmethodsforensembleanalysiswereproposedmuchearliertothiswork.Thehighdimen-sionalscenarioisanimportantoneforensembleanalysis,becausetheoutlierbehaviorofadatapointinhighdimen-sionalspaceisoftendescribedbyasubsetofdimensions,whichareratherhardtodiscoverinrealsettings.Infact,mostmethodsforlocalizingthesubsetsofdimensionscanbeconsideredweakguessestothetruesubsetsofdimensionswhicharerelevantforoutlieranalysis.Theuseofmultiplemodels(correspondingtodierentsubsetsofdimensions)reducestheuncertaintyarisingfromaninherentlydicultsubspaceselectionprocess,andprovidesgreaterrobustnessfortheapproach.Thefeaturebaggingworkdiscussedin[28]maybeconsidereda\frstdescriptionofoutlieren-sembleanalysisinarealsetting.However,aswewillseeinthisarticle,numerousmethodswereproposedearliertothisworkwhichcouldbeconsideredensembles,butwereneverformallyrecognizedasensemblesintheliterature.Asnotedin[18],eventhe\frsthighdimensionaloutlierdetectionap-proach[3]maybeconsideredanensemblemethod,thoughitwasnotformallypresentedasanensemblemethodintheoriginalpaper.Itshouldalsobepointedoutthatwhilehighdimensionaldataisanimportantcaseforensembleanalysis,thepotentialofensembleanalysisismuchbroader,andislikelytoapplytoanyscenarioinwhichoutliersarede\fnedfromvaryingcausesofrarity.Furthermore,manytypesofensemblessuchassequentialensemblescanbeusedinordertosuccessivelyre\fnedata-centricinsights.Thispaperwilldiscussthedierentmethodsforoutlieren-sembleanalysisintheliterature.Wewillprovideaclassi\f-cationofthedierentkindsofensembles,andthekeypartsofthealgorithmicdesignofensembles.Thespeci\fcimpor-tanceofdierentpartsofalgorithmicdesignwillalsobediscussed.Ensemblealgorithmscanbecategorizedintwodierentways:CategorizationbyComponentIndependence:Arethedierentcomponentsoftheensembleindependentofoneanotherordotheydependononeanother?Toprovideananalogywiththeclassi\fcationproblem,boostingcanbeconsideredaprobleminwhichthedierentcomponentsoftheensemblearenotindepen-dentofoneanother.Thisisbecausetheexecutionofaspeci\fccomponentdependsupontheresultfrompre-viousexecutions.Ontheotherhand,manyformsofclassi\fcationensemblessuchasbaggingarethoseinwhichtheclassi\fcationmodelsareindependentofoneCategorizationbyComponentType:Eachcomponentofanensemblecanbede\fnedonthebasisofeitherdatachoicemodelchoice.Theideaintheformeristocarefullypickasubsetofthedataordatadi-mensions(eg.boosting/bagginginclassi\fcation)ortopickaspeci\fcalgorithm(eg.stackingormodel-ensembles).Thecategorizationbycomponenttypeisrelatedtocategorizationbycomponentindependence,becausedata-centeredensemblesareoftensequential,whereasmodel-centeredensemblesareoftenindepen-dent.Howeverthisisnotalwaysthecase.Forexam-ple,independentdatacenteredbaggingmethods[28]areoftenusedinoutlieranalysis.Itshouldbepointedoutthattheaforementionedcategoriza-tionsofdierentkindsofensemblesareinherentlyincom-plete,anditisimpossibletofullydescribeeverypossibility.Forexample,itispossibleforthedierentcomponentstobeheterogeneous,whicharede\fnedonthebasisofdierentaspectsofthedataandmodels[34].However,suchmodelsarelessfrequentintheoutlieranalysisliterature,becauseofthecomplexityofreasonablyevaluatingtheimportanceofdierentensemblecomponents.Atypicaloutlierensemblecontainsanumberofdierentcomponents,whichareusedtoconstructthe\fnalresult.ModelCreation:Thisistheindividualmethodologyoralgorithmwhichisusedtocreatethecorrespond-ingcomponentoftheensemble.Insomecases,themethodologymaybesimplythatofrandomsubspaceDierentmethodsmaycreateoutlierscoreswhichareonverydierentscales.Insomecases,thescoresmaybeinascendingorder,whereasinoth-ers,theymaybeindescendingorder.Insuchcases, isimportantinbeingabletocombinethescoresmeaningfully,sothattheoutlierscoresfromdierentcomponentsareroughlycomparable.ModelCombination:Thisreferstothe\fnalcombi-nationfunction,whichisusedinordertocreatetheoutlierscore.Thispaperisorganizedasfollows.Thenextsectionwilldis-cussthecategorizationofensemblesonthebasisofcompo-nentindependence.Section3willdiscussthecategorizationofensemblesonthebasisofmodeltype.Section4willstudytheroleofthecombinationfunctionindierentkindsofensembleanalysis.Section5discussesmeta-algorithmsforotherdataminingproblemsintheliterature,andwhethersuchideascanbeadaptedtotheoutlieranalysisscenario.Section6containstheconclusionsandsummary.2.CATEGORIZATIONBYCOMPONENTINThiscategorizationexamineswhetherthecomponentsaredevelopedindependently,orwhethertheydependononeanother.Therearetwoprimarykindsofensembles,whichcanbeusedinordertoimprovethequalityofoutlierdetec-tionalgorithms:sequentialensembles,agivenalgorithmorsetofalgorithmsareappliedsequentially,sothatfutureap-plicationsofthealgorithmsareimpactedbypreviousapplications,intermsofeithermodi\fcationsofthebasedataforanalysisorintermsofthespeci\fcchoicesofthealgorithms.The\fnalresultiseitheraweightedcombinationof,orthe\fnalresultofthelastappli-cationofanoutlieranalysisalgorithm.Forexample,inthecontextoftheclassi\fcationproblem,boostingmethodsmaybeconsideredexamplesofsequentialen-sembles.independentensembles,dierentalgorithms,ordif-ferentinstantiationsofthesamealgorithmareap-pliedtoeitherthecompletedataorportionsofthedata.Thechoicesmadeaboutthedataandalgo-rithmsappliedareindependentoftheresultsobtainedfromthesedierentalgorithmicexecutions.There-sultsfromthedierentalgorithmexecutionsarecom-binedtogetherinordertoobtainmorerobustoutliers.Inthissection,bothkindsofensembleswillbestudiedin2.1SequentialEnsemblesInsequential-ensembles,oneormoreoutlierdetectionalgo-rithmsareappliedsequentiallytoeitherallorportionsofthedata.Thecoreprincipleoftheapproachisthateachapplicationofthealgorithmsprovidesabetterunderstand-ingofthedata,soastoenableamorere\fnedexecutionwitheitheramodi\fedalgorithmordataset.Thus,depend-ingupontheapproach,eitherthedatasetorthealgorithmmaybechangedinsequentialexecutions.Ifdesired,thisapproachcaneitherbeappliedfora\fxednumberoftimes,orbeusedinordertoconvergetoamorerobustsolution.Thebroadframeworkofasequential-ensemblealgorithmisprovidedinFigure1.SequentialEnsemble(DataSet:BaseAlgorithms::::begin=1;repeatPickanalgorithmbasedonresultsfrompastexecutions;Createanewdataset)fromonresultsfrompastexecutions;+1;untilreportoutliersbasedoncombinationsofresultsfrompreviousexecutions;Figure1:SequentialEnsembleFrameworkIneachiteration,asuccessivelyre\fnedalgorithmmaybeusedonare\fneddata,basedontheresultsfrompreviousexecutions.Thefunction)isusedtocreateare\fne-mentofthedata,whichcouldcorrespondtodatasubsetselection,attribute-subsetselection,orgenericdatatrans-formationmethods.Thedescriptionaboveisprovidedinaverygeneralform,andmanyspecialcasescanbepossiblyinstantiatedfromthisgeneralframework.Forexample,inpractice,onlyasinglealgorithmmaybeusedonsuccessivemodi\fcationsofthedata,asdataisre\fnedovertime.Fur-thermore,thesequentialensemblemaybeappliedinonlyasmallnumberofconstantpasses,ratherthanagenericconvergence-basedapproach,aspresentedabove.Thebroadprincipleofsequentialensemblesisthatagreaterknowledgeofdatawithsuccessivealgorithmicexecutionhelpsfocusontechniquesandportionsofthedatawhichcanprovidefreshinsights.Sequentialensembleshavenotbeensucientlyexploredintheoutlieranalysisliteratureasgeneralpurposemeta-algorithms.However,manyspeci\fctechniquesintheoutlierliteratureusemethods,whichcanberecognizedasspecialcasesofsequentialensembles.Aclassicexampleofthisistheuseoftwo-phasealgorithmsforbuildingamodelofthenormaldata.Inthe\frst-phase,anoutlierdetectionalgorithmisusedinordertoremovetheobviousoutliers.Inthesec-ondphase,amorerobustnormalmodelisconstructedafterremovingtheseobviousoutliers.Thus,theoutlieranaly-sisinthesecondstageismuchmorere\fnedandaccurate.Suchapproachesarecommonlyusedforcluster-basedout-lieranalysis(forconstructingmorerobustclustersinlaterstages)[6],orformorerobusthistogramconstructionanddensityestimation.However,mostofthesemethodsarepresentedintheoutlieranalysisliteratureasspeci\fcopti-mizationsofparticularalgorithms,ratherthanasgeneralmeta-algorithmswhichcanimprovetheeectivenessofanarbitraryoutlierdetectionalgorithm.Thereissigni\fcantscopeforfurtherresearchintheoutlieranalysisliterature,byrecognizingthesemethodsasgeneral-purposeensembles,andusingthemtoimprovetheeectivenessofoutlierdetec-tion.Inthesemodels,thegoalofthesequentialensembleisdatare\fnement.Therefore,thescorereturnedbythelaststagesoftheensembleisthemostrelevantoutlierscore. endentEnsemble(DataSet:BaseAlgorithms::::begin=1;repeatPickanalgorithmCreateanewdataset)from+1;untilreportoutliersbasedoncombinationsofresultsfrompreviousexecutions;Figure2:IndependentEnsembleFrameworkAnotherexampleofasequentialensembleisproposedin[30]inwhichdierentsubspacesofthedataarerecursivelyplored,onthebasisoftheirdiscriminativebehavior.Asub-spaceisexploredonlyifoneofitspredecessorisalsosucientlydiscriminative.Thus,thisapproachissequential,sincetheconstructionoffuturemodelsoftheensembleisdependentonthepreviousmodels.Thegoalofthesequentialensembleisthediscoveryofotherrelatedsubspaceswhicharealsodiscriminative.Nevertheless,sincethesequentialapproachiscombinedwithenumerativeex-plorationofdierentsubspaceextensions,thecombinationfunctioninthiscaseneedstoincludethescoresfromthedierentsubspacesinordertocreateanoutlierscore.Theworkin[30]usestheproductoftheoutlierscoresofthediscriminativesubspacesasthe\fnalresult.Thisisequiv-alenttousinganaggregateonthelogarithmicfunctionoftheoutlierscore.2.2IndependentEnsemblesInindependentensembles,dierentinstantiationsoftheal-gorithmordierentportionsofthedataareusedforoutlieranalysis.Alternatively,thesamealgorithmmaybeapplied,butwitheitheradierentinitialization,parametersetorevenrandomseedinthecaseofarandomizedalgorithms.Theresultsfromthesedierentalgorithmexecutionscanbecombinedinordertoobtainamorerobustoutlierscore.Ageneralpurposedescriptionofindependentensemblealgo-rithmsisprovidedinthepseudo-codedescriptionofFigureThebroadprincipleofindependentensemblesisthatdif-ferentwaysoflookingatthesameproblemprovidesmorerobustresultswhicharenotdependentonspeci\fcartifactsofaparticularalgorithmordataset.Independentensem-bleshavebeenexploredmuchmorewidelyandformallyintheoutlieranalysisliterature,ascomparedtosequentialen-sembles.Independentensemblesareparticularlypopularforoutlieranalysisinhigh-dimensionaldatasets,becausetheyenabletheexplorationofdierentsubspacesofthedatainwhichdierentkindsofdeviantsmaybefound.Examplesexistofbothpickingdierentalgorithmsanddatasets,inordertocombinetheresultsfromdierentexecu-tions.Forexample,themethodsin[28;29]samplesub- predecessorisde\fnedasasubspacewithonedimensionremoved.spacesfromtheunderlyingdatainordertodetermineout-liersfromeachoftheseexecutionsindependently.Then,theresultsfromthesedierentexecutionsarecombinedinor-dertodeterminetheoutliers.Theideainthesemethodsisthatresultsfromdierentsubsetsofsampledfeaturesmaybebaggedinordertoprovidemorerobustresults.Someoftherecentmethodsforsubspaceoutlierrankingandoutlierevaluationcanbeconsideredindependentensembleswhichcombinetheoutliersdiscoveredindierentsubspacesinor-dertoprovidemorerobustinsights.3.CATEGORIZATIONBYCONSTITUENTIngeneral,aparticularcomponentofthemodelmayuseadierentmodel,andadierentsubsetorsubspaceofthedata[34].However,thisisrarelydoneinpractice.Typically,eachcomponentofthemodeliseitherde\fnedasaspeci\fcmodel,orasaspeci\fcpartofthedata.Theformertypeofensembleisreferredtoasmodel-centered,whereasthelattertypeisreferredtoasdata-centered.Eachofthesespeci\fctypeswillbediscussedindetailinthissection.3.1ModelcenteredEnsemblesModelcenteredensemblesattempttocombinetheoutlierscoresfromdierentmodelsbuiltonthesamedataset.Themajorchallengeofthismodelisthatthescoresfromdier-entmodelsareoftennotdirectlycomparabletooneanother.Forexample,theoutlierscorefroma-nearestneighborap-proachisverydierentfromtheoutlierscoreprovidedbyaPCA-baseddetectionmodel.Thiscausesissuesincombin-ingthescoresfromthesedierentoutliermodels.Therefore,itiscriticaltobeabletoconvertthedierentoutlierscoresintonormalizedvalueswhicharedirectlycomparable,andalsopreferablyinterpretable,suchasaprobability[17].Thisissuewillbediscussedinthenextsectiononde\fningcombi-nationfunctionsforoutlieranalysis.Anotherkeychallengeisintermsofthespeci\fcde\fnitionofthecombinationfunc-tionforoutliers.Shouldweusemodelaveraging,best\ftorworst\ft?Thisproblemisofcoursenotspeci\fctomodel-centeredensembles.Aparticularformofmodel-centeredensembleswhicharecommonlyusedinoutlieranalysis,butnotformallyrecog-nizedasensemblesistheissueofusingthesamemodeloverdierentchoicesoftheunderlyingmodelparameters,andthencombiningthescores.ThisisdonequitefrequentlyinmanyclassicaloutlieranalysisalgorithmssuchasLOCI[35]andLOF[12].However,sincetheapproachisinterpretedasaquestionofparametertuning,itisnotrecognizedformallyasanensemble.Inreality,anysystematicapproachforpa-rametertuning,whichisdependentontheoutputscoresanddirectlycombinesorusestheoutputsofthedierentexecu-shouldbeinterpretedasanensemblarapproach.ThisisthecasewiththeLOFandLOCImethods.Speci\fcally,thefollowingensemblarapproachisusedinthetwometh-ods.IntheLOFmethod,themodelisrunoverarangeofvaluesof,whichde\fnestheneighborhoodofthedatapoints.Theworkin[12]examinestheuseofdierentcombinationfunctionssuchastheminimum,averageorthemaximumoftheLOFvaluesastheoutlierscore.Itisarguedin[12],thattheappropriatecombinationfunctionistousethevaluein topreventdilutionoftheoutlierscoresbyin-appropriateparameterchoicesinthemodel.Inotherwords,thespeci\fcmodelwhichbestenhancestheout-lierbehaviorforadatapointisused.TheLOCImethodusesamulti-granularityapproach,whichusesasamplingneighborhoodinordertode-terminethelevelofgranularityinwhichtocomputetheoutlierscore.Dierentsamplingneighborhoodsareused,andapointisdeclaredasanoutlierbasedontheneighborhoodinwhichitsoutlierbehaviorisenhanced.ItisinterestingtonotethattheLOCImethodusesaverysimilarcombinationfunctionastheLOFmethodintermsofpickingthecomponentoftheensemblewhichenhancestheoutlierbehavior.Itshouldbepointedoutthatwhenthedierentcomponentsoftheensemblecreatecomparablescores(eg.dierentrunsofaparticularalgorithmsuchasLOForLOCI),thenthecombinationprocessisgreatlysimpli\fed,sincethescoresacrossdierentcomponentsarecomparable.However,thisisnotthecase,whenthedierentcomponentscreatescoreswhicharenotdirectlycomparabletooneanother.Thisissuewillbediscussedinalatersectiononde\fningcombination3.2DatacenteredEnsemblesIndata-centeredensembles,dierentparts,samplesorfunc-tionsofthedataareexploredinordertoperformtheanal-ysis.Itshouldbepointedoutthatafunctionofthedatacouldincludeeitherasampleofthedata(horizontalsam-ple)orarelevantsubspace(verticalsample).Moregeneralfunctionsofthedataarealsopossible,thoughhaverarelybeenexploredintheliterature.Thecoreideaisthateachpartofthedataprovidesaspeci\fckindofinsight,andbyusinganensembleoverdierentportionsofthedata,itispossibletoobtaindierentinsights.Oneoftheearliestdata-centeredensembleswasdiscussedin[28].Inthisapproach,randomsubspacesofthedataaresampled,andtheoutliersaredeterminedintheseprojectedsubspaces.The\fnaloutliersaredeclaredasacombinationfunctionoftheoutliersfromthedierentsubspaces.Thistechniqueisalsoreferredtoasthefeaturebaggingsubspacemethod.Thecorealgorithmdiscussedin[28]isasfollows:FeatureBagging(DataSetbeginrepeatSampleasubspacebetween2andLOFscoreforeachpointinprojectedrepresentation;untilReportcombinedscoresfromdierentsubspaces;Twodierentmethodsareusedforcombiningscores.The\frstusesthebestrankofadatapointinanyprojectioninordertocreatetheordering.Avarietyofmethodscanbeusedfortiebreaking.Thesecondmethodaveragesthescoresoverthedierentexecutions.Anothermethoddis-cussedin[17]convertstheoutlierscoresintoprobabilitiesbeforeperformingthebagging.Thisnormalizesthescores,andimprovesthequalityofthe\fnalcombination.Anumberoftechniqueshavealsobeenproposedforstatisti-calselectionofrelevantsubspacesforensembleanalysis[24;30].Theworkin[30]determinessubspaceswhicharerele-vanttoeachdatapoint.Theapproachisdesignedinsuchaway,thatForthediscriminativesubspacesfoundbythemethod,theapproachusestheproductof(ortheadditionofthelogarithmof)theoutlierscoresinthedierentdis-criminativesubspaces.Thiscanbeviewedasacombinationofmodelaveragingandselectionofthemostdiscriminativesubspaces,whenthescoresarescaledbythelogarithmicfunction.Theworkin[24]ismuchclosertothefeaturebaggingmethodof[28],exceptthatstatisticalselectionofrelevantsubspacesisusedfortheoutlieranalysisprocess.The\fnalscoreiscomputedastheaverageofthescoresoverdierentcomponentsoftheensemble.Recently,amethodOutRank[33]hasbeenproposed,whichcancombinetheresultsofmultiplerankingsbasedontherelationshipofdatapointstotheirnearestsubspaceclusters.Ithasbeenshownthateventraditionalsubspaceclusteringalgorithms[4]canprovidegoodresultsforoutlieranalysis,whentheen-semblemethodisused.This,theworkin[33]conclusivelyshownthepowerofensembleanalysisforhighdimensionalAdierentdata-centeredensemblewhichiscommonlyusedintheliterature,butoftennotrecognizedasanensemblarapproachisthatofusinginitialphasesofremovingoutliersfromadataset,inordertocreateamorere\fnedmodelforoutlieranalysis.Anexampleofsuchanapproachinthecontextofintrusiondetectionisdiscussedin[6].Inthesecases,thecombinationfunctioncanbesimplyde\fnedastheresultfromtheverylaststepoftheexecution.Thisisbecausethedataqualityisimprovedsigni\fcantlyfromtheearlycomponentsoftheensemble,andtheresultsinthelastphasere\recttheoutliersmostaccurately.Thisisbecausethisisalsoasequentialensemblewithaspeci\fcgoalofdatare\fnement.Itshouldbepointedoutthatthedistinctioninthissectionbetweenmodel-centeredanddata-centeredensemblesisasomewhatsemanticone,sinceadata-centeredensemblecanalsobeconsideredaspeci\fctypeofmodel-centeredensem-ble.Nevertheless,thiscategorizationisuseful,becausetheexplorationofdierentsegmentsofthedatarequiresinher-entlydierentkindsoftechniquesthantheexplorationofdierentmodelswhicharedata-independent.Thechoicesinpickingdierentfunctionsofthedataforexplorationre-quiresdata-centricinsights,whichareanalogoustoclassi\f-cationmethodssuchasboosting,especiallyinthesequentialcase.Therefore,weviewthiscategorizationasaconvenientwaytostimulatedierentlinesofresearchonthetopic.3.3DiscussionofCategorizationSchemesThetwodierentcategorizationschemesareclearlynotex-haustive,thoughtheyrepresentasigni\fcantfractionoftheensemblefunctionsusedintheliterature.Infact,thesetwocategorizationschemescanbecombinedinordertocreatefourdierentpossibilities.ThisissummarizedinTable1.Wehavealsoillustratedhowmanyofthecurrentensem-blarschemesmaptothesedierentpossibilities.Interest-ingly,wewereunableto\fndanexampleofasequentialmodel-basedensembleintheliterature,thoughitispossi-blethattheresultsfromtheexecutionofaparticularmodelcanprovidehintsaboutfuturedirectionsofmodelconstruc-tionforanoutlieranalysisalgorithm.Therefore,ithas Centered delCentered Indep. eatureBagging[28] Tuning[12] HICS[24] Tuning[35] Proclus[17] For.[29] OutRank[33] enetal[34] vertingscoresintoprobabilities[25] Bagging[17] Seq. trusionBootstrap[6] Open OUTRES[30] able1:CategorizationofEnsembleTechniquesbeenclassi\fedasanopenprobleminourcategorization,andwouldbeaninterestingavenueforfutureexploration.TheworkbyNguyenetal[34]cannotbeclassi\fedasei-theradata-centeredoramodel-centeredscheme,sinceitusessomeaspectsofboth.Furthermore,theworkin[17;25]convertoutlierscoresintoprobabilitiesasageneralpre-processingmethodfornormalization,andarenotdependentonwhethertheindividualcomponentsaredata-centeredormodel-centered.Theissueofmodelcombinationisacriti-callytrickyonebothintermsofhowtheindividualscoresarenormalized,andintermsofhowtheyarecombined.Thisissuewillbediscussedindetailinthenextsection.4.DEFININGCOMBINATIONFUNCTIONSAcrucialissueinoutlieranalysisisthede\fnitionofcombi-nationfunctionswhichcancombinetheoutlierscoresfromdierentmodels.Thereareseveralchallengeswhichariseinthecombinationprocess:NormalizationIssues:Thedierentmodelsmayout-putscoreswhicharenoteasilycomparablewithoneanother.Forexample,a-nearestneighborclassi\fermayoutputadistancescore,whichisdierentfromanLOFscore,andthelatterisalsoquitedierentfromtheMDEFscorereturnedbytheLOCImethod.Evenafeaturebaggingapproach,whichisde\fnedwiththeuseofthesamebasealgorithm(LOF)ondierentfea-turesubsets,maysometimeshavecalibrationissuesinthescores[17].Therefore,ifacombinationfunctionsuchastheaverageorthemaxisappliedtothecon-stituentscores,thenoneormorethemodelsmaybeinadvertentlyfavored.CombinationIssues:Thesecondissueisthechoiceofthecombinationfunction.Givenasetofnormalizedoutlierscores,howdowedecidethespeci\fcchoiceofthecombinationfunctiontobeused?Shouldthemin-imumofthescoresbeused,theaverageofthescoresbeused,orthemaximumofthescoresbeused.Itturnsoutthattheanswertothisquestionmaysome-timesdependonthespeci\fcconstituentcomponentsofthemodel,thoughitwouldseemthatsomechoicesaremorecommonthanothersintheliterature.Inthefollowingsection,wewilldiscusssomeoftheseissuesindetail.4.1NormalizationIssuesThemajorfactorinnormalizationisthatthedierentalgo-rithmsdonotusethesamescalesofreferenceandcannotbereasonablycomparedwithoneanother.Infact,insomecases,highoutlierscoresmaycorrespondtolargeroutliertendency,whereasinothercases,lowscoresmaycorrespondtogreateroutliertendency.Thiscausesproblemsduringthecombinationprocess,sinceoneormorecomponentsmaybeinadvertentlyfavored.Onesimpleapproachforperform-ingthenormalizationistousetheranksfromthedierentoutlieranalysisalgorithmsfromgreatestoutliertendencytoleastoutliertendency.Theserankscanthenbecombinedinordertocreateauni\fedoutlierscore.Oneoftheearliestmethodsforfeaturebagging[28]usessuchanapproachinoneofitscombinationfunctions.Themajorissuewithsuchanapproachisthatitdoeslosealotofinformationabouttherelativedierencesbetweentheoutlierscores.Forexample,considerthecaseswherethetopoutlierscoresforcomponentsoftheensem-bleare::::::respec-tively,andeachcomponentuses(somevariationof)theLOFalgorithm.Itisclearthatincomponent,thetopthreeoutlierscoresarealmostequivalent,andincomponentthetopoutlierscoreisthemostrelevantone.However,arankingapproachwillnotdistinguishbetweenthesescenar-ios,andprovidethemthesamerankvalues.Clearly,thislossofinformationisnotdesirableforcreatinganeectivecombinationfromthedierentscores.Thepreviousexamplesuggeststhatitisimportanttoexam-ineboththeorderingofthevaluesandthedistributionofthevaluesduringthenormalizationprocess.Ideally,itisdesir-abletosomehowconverttheoutlierscoresintoprobabilities,sothattheycanbereasonablyusedinaneectiveway.Anapproachwasproposedin[17]whichusesmixturemodelinginconjunctionwiththeEM-frameworkinordertoconvertthescoresintoprobabilities.Twomethodsareproposedinthiswork.Bothofthesetechniquesuseparametricmodelingmethods.The\frstmethodassumesthattheposteriorprob-abilitiesfollowalogisticsigmoidfunction.TheunderlyingparametersarethenlearnedfromtheEMframeworkfromthedistributionofoutlierscores.Thesecondapproachrec-ognizesthefactthattheoutlierscoresofdatapointsintheoutliercomponentofthemixtureislikelytoshowadier-entdistribution(Gaussiandistribution),thanthescoresofdatapointsinthenormalclass(Exponentialdistribution).Therefore,thisapproachmodelsthescoredistributionsasamixtureofexponentialandGaussianprobabilityfunctions.Asbefore,theparametersarelearnedwiththeuseoftheEM-framework.TheposteriorprobabilitiesarecalculatedwiththeuseoftheBayesrule.Thisapproachhasbeenshowntobeeectiveinimprovingthequalityoftheensem-bleapproachproposedin[28].Asecondmethodhasalsobeenproposedrecently[25],whichimprovesuponthisbasemethodforconvertingtheoutlierscores,andconvertingthescoresintoprobabilities.4.2CombiningScoresfromDifferentModelsThesecondissueisthechoiceofthefunctionwhichneedstobeusedinordertocombinethescores.Givenasetof(normalized)outlierscoresScore forthedatapoint weusethemodelaverage,maximum,orminimum?Foreaseindiscussioninthissection,wewillassumethecon-ventionwithoutlossofgeneralitythatgreateroutlierscorescorrespondtogreateroutliertendency.Thereforethemax-imumfunctionpickstheworst\ft,whereastheminimumfunctionpicksthebest\ft.Theearliestworkonensemble-basedoutlieranalysis(not recognizedasensembleanalysis)wasperformedinthecontextofmodelparametertuning[12;35].Mostoutlieranalysismethodstypicallyhaveaparameter,whichcontrolsgranularityoftheunderlyingmodel.Theoutliersmayoftenbevisibletothealgorithmonlyataspeci\fclevelofgranularity.Forexample,thevalueofintheneighborapproachorLOFapproach,thesamplingneigh-borhoodsizeintheLOCIapproach,thenumberofclus-tersinaclusteringapproachallcontrolthegranularityoftheanalysis.Whatistheoptimalgranularitytobeused?Whilethisisoftenviewedasanissueofparametertuning,itcanalsobeviewedasanissueofensembleanalysis,whenaddressedinacertainway.Inparticular,themethodsin[12;35]runthealgorithmsoverarangeofvaluesofthegranularityparameter,andpicktheparameterchoicewhichbestenhancestheoutlierscore(maximumfunctionforourconventiononscoreordering)foragivendatapoint.Inotherwords,wehave:Ensemble =MAXScore reasonforthishasbeendiscussedinsomedetailintheoriginalLOFpaper.Inparticular,ithasbeensuggestedthattheuseofothercombinationfunctionsuchastheaverageortheminimumleadstoadilutionintheoutlierscoresfromtheirrelevantmodels.Thisseemstobeareasonablechoiceatleastfromanintuitiveperspective.Someothercommonfunctionswhichareusedinthelitera-tureareasfollows:MaximumFunction:Thisisoneofthemostcommonfunctionsusedforcombiningensemblarscoresbothinimplicit(LOFandLOCIparametertuning)andex-plicitensemblarmodels.Onevariationonthismodelistousetheranksinsteadofthescoresinthecom-binationprocess.Suchanapproachwasalsousedinfeaturebagging[28].Animportantaspectofthepro-cessisthatthedierentdatapointsneedtohavethesamenumberofcomponentsintheensembleinordertobecomparedmeaningfully.AveragingFunction:Inthiscase,themodelscoresareaveragedoverthedierentcomponentsoftheensem-ble.Theriskfactorhereisthatiftheindividualcom-ponentsoftheensemblearepoorlyderivedmodels,thentheirrelevantscoresfrommanydierentcompo-nentswilldilutetheoveralloutlierscore.Nevertheless,suchanapproachhasbeenusedextensively.Examplesofmethodswhichusethismethodareoneofthemod-elsinfeaturebagging[28],theHICSmethod[30],andarecentapproachdescribedin[25].DampedAveraging:Inthismodel,adampingfunc-tionisappliedtotheoutlierscoresbeforeaveraging,inordertopreventitfrombeingdominatedbyafewcomponents.Examplesofadampingfunctioncouldbethesquarerootorthelogarithm.Itshouldbepointedoutthattheuseoftheproductoftheoutlierscoresorgeometricaveragingcouldbeinterpretedastheaver-agingofthelogarithmofoutlierscores.PrunedAveragingandAggregates:Inthiscase,thelowscoresareprunedandtheoutlierscoresareeitherav-eragedoraggregated(summedup)overtherelevantensembles.Thegoalhereistoprunetheirrelevantmodelsforeachdatapointbeforecomputingthecom-binationscore.Thepruningcanbeperformedbyei-therusinganabsolutethresholdontheoutlierscore,orbypickingthetopmodelsforeachdatapoint,andaveragingthem.Theriskfactorinusingabso-lutethresholdsarethenormalizationissueswhicharisefromdierentdatapointshavingdierentensemblarcomponents.Boththeaverageandaggregatescorescannolongerbemeaningfullycomparedacrossdif-ferentdatapoints.Aggregatesaremoreappropriatethanaverages,sincetheyimplicitlycountthenumberofensemblecomponentsinwhichadatapointisrele-vant.Adatapointwillbemorerelevantinagreaternumberofensemblecomponents,whenithasagreatertendencytobeanoutlier.ResultfromLastComponentExecuted:Thisapproachissometimesusedinsequentialensembles[6],inwhicheachcomponentoftheensemblesuccessivelyre\fnesthedataset,andremovestheobviousoutliers.Asaresult,thenormalmodelisconstructedonadatasetfromwhichoutliersareremovedandthemodelismorerobust.Insuchcases,thegoalofeachcomponentofthesequentialensembleistosuccessivelyre\fnethedataset.Therefore,thescorefromthelastcomponentisthemostappropriateonetobeused.Whichcombinationfunctionprovidesthebestinsightsforensembleanalysis?Clearly,thecombinationfunctionmaybedependentonthestructureoftheensembleinthegeneralcase,especiallyifthefunctionofeachcomponentoftheensembleistoeitherre\fnethedataset,orunderstandthebehaviorofonlyaverylocalsegmentofthedataset.However,forthegeneralcase,inwhichthefunctionofeachcomponentoftheensembleistoprovideareasonableandcomparableoutlierscoreforeachdatapoint,thetwomostcommonlyusedfunctionsaretheandtheAv-eragingfunctions.Whileprunedaveragingcombinestheseaspects,itisrarelyusedinensembleanalysis.Whichcom-binationfunctionisbest?Arethereanyothercombinationfunctionswhichcouldconceivablyprovidebetterresults?Theseareopenquestions,theanswertowhichisnotcom-pletelyknownbecauseofthesparseliteratureonoutlieren-sembleanalysis.Itisthisauthor'spersonalopinion,thattheintuitiveargumentprovidedintheLOFpaper[12]onusingthemaximumfunctionforavoidingdilutionfromirrelevantmodelsisthecorrectoneinmanyscenarios.However,theissueiscertainlynotsettledinthegeneralcase,andmanyvariantssuchasprunedaveragingmayalsoproviderobustresults,whileavoidingmostoftheirrelevantmodels.5.POSSIBLEAVENUESOFEXPLORATION:LEARNINGFROMOTHERDATAMININGPROBLEMSTheareaofoutlierensembleanalysisisstillinitsinfancy,thoughitisrapidlyemergingasanimportantareaofre-searchinitsownright.Currently,thediversityofalgo-rithmsavailableforoutlierensembleanalysisislimited,andisnowhereclosetomanyotherdataminingproblemssuchasclusteringandclassi\fcation.Therefore,itmaybein-structivetoexaminesomeofthekeytechniquesforotherdataminingproblemssuchasclusteringorclassi\fcation, whetheritmakessenseorisevenfeasibletodesignanalogousmethodsfortheoutlieranalysisproblem.Asdiscussedearlier,amajorchallengeforensembledevelop-mentinunsupervisedproblemsisthattheevaluationpro-cessishighlysubjective,andthereforethequalityoftheintermediateresultscannotbefullyevaluatedinmanysce-narios.Oneoftheconstraintsisthattheintermediatedeci-sionsmustbemadewiththeuseofoutlierscoresonly,ratherthanwiththeuseofconcreteevaluationcriteriaonhold-outsets(asinthecaseoftheclassi\fcationproblem).Therefore,inthiscontext,webelievethatthemajorsimilaritiesanddierencesinsupervisedandunsupervisedmethodsareasfollows:IntermediateEvaluation:Inunsupervisedmeth-ods,groundtruthistypicallynotavailable.Whileonecanusemeasuressuchasclassi\fcationaccuracyinsu-pervisedmethods,thisisnotthecasewithunsuper-visedmethods.Intermediateevaluationisparticularlyimportantforsequentialmethods.Thisisoneofthereasonsthatsequentialmethodsaremuchrarerthanindependentmethodsinoutlierensembleanalysis.DiversityandConsensusIssues:Bothsupervisedandunsupervisedmethodsseekgreaterdiversitywiththeuseofanensembleintermsofthemethodologyusedforcreatingthemodel.Inmanycases,thisisdonebyselectingmodelswhicharedierentfromonean-other.Forexample,inclustering,diversityisachievedbyusingeitherrandomizedclusteringorexplicitlypick-ingorthogonalclusterings[2].However,inthecaseofsupervisedmethods,thelevelofconsensusisalsomea-attheendintermsofthegroundtruth.Thisisnotthecaseinunsupervisedmethods,sincenogroundtruthisavailable.Someofthepropertiesinsupervisedlearning(eg.presenceofclasslabels)cannotobviouslybetransferredtooutlieranalysis.Inothercases,analogousmethodscanbedesignedfortheproblemofoutlieranalysis.Inthebelow,wediscusssomecommonmethodsusedfordierentsupervisedandunsupervisedproblems,andwhethertheycanbetransferredtotheproblemofoutlieranalysis:Boosting:Boosting[16]isacommontechniqueusedinclassi\fcation.Theideaistofocusonsuccessivelydicultportionsofthedatasetinordertocreatemod-elswhichcanclassifythedatapointsintheseportionsmoreaccurately,andthenusetheensemblescoresoverallthecomponents.Ahold-outapproachisusedinordertodeterminetheincorrectlyclassi\fedinstancesforeachportionofthedataset.Suchanapproachclearlydoesnotseemtobeapplicabletotheunsuper-visedversionoftheproblembecauseofthedicultyincomputingtheaccuracyofthemodelondierentdatapointsintheabsenceofgroundtruth.Ontheotherhand,sincethesupervisedversionoftheproblem(rareclassdetection)isaskewedclassi\fcationproblem,theboostingapproachisapplicablealmostdirectly.Anumberoflearners[13;23]havebeenproposedforthesupervisedversionoftheoutlieranalysisproblem.Theseclassi\fershavebeenshowntoachievesigni\f-cantlysuperiorresultsbecauseoftheuseofboost-ing.However,itisunlikelythatananalogueforthismethodcanbedevelopedforaproblemsuchasoutlierBagging[9]isanapproachwhichworkswithsamplesofthedata,andcombinestheresultsfromthedierentsamples.Thewellknownfeaturebaggingap-proachforoutlieranalysis[17;28]performsthisstepinadierentwaybybaggingthefeaturesratherthanbaggingthepoints.Nevertheless,theapproachisalsoapplicabletotheuseofsamplesofdatapoints(ratherthandimensions)inordertoperformtheprediction.Thekeychallengeinthiscasemayarisefromdier-entdatapointsbeingapartofdierentnumbersofensemblesinthedierentcases.Formanyensemblescoressuchasthe,thiscancauseinadver-tentbiasinfavoringdatapointswhicharesampledalargernumberoftimes.Thisbiascannotbecorrected,unlessspeci\fckindsofcombinationfunctionssuchasAverageareused.RandomForests:Randomforests[8]areamethodwhichusesetsofdecisiontreesonthetrainingdata,andcomputethescoreasafunctionofthesedierentcomponents.Whiledecisiontreeswerenotoriginallydesignedfortheoutlieranalysisproblem,ithasbeenshownin[29]thatthebroadconceptofdecisiontreescanalsobeextendedtooutlieranalysisbyexamin-ingthosepathswithunusuallyshortlength,sincetheoutlierregionstendtogetisolatedratherquickly.Anensembleofsuchtreesisreferredtoasanforest[29],andhasbeenusedeectivelyformakingrobustpredictionsaboutoutliers.ModelAveragingandCombination:Thisisoneofthemostcommonmodelsusedinensembleanalysisandisusedbothfortheclusteringandclassi\fcationprob-lems.Infact,therandomforestmethoddiscussedaboveisaspecialcaseofthisidea.Inthecontextoftheclassi\fcationproblem,manyBayesianmethods[15]existforthemodelcombinationprocess.Manyoftherecentmodels[25;34]havefocussedoncreatingabucketofmodelsfromwhichthescoresarecombinedthrougheitheraveragingorusingthemaximumvalue.Eventheparametertuningmethodsusedinmanyout-lieranalysisalgorithmssuchasLOFandLOCIcanbeviewedtobedrawnfromthiscategory.Arelatedmodelis[14;38],inwhichthecombinationisperformedinconjunctionwithmodelevaluation.Thiscansometimesbemoredicultforunsupervisedprob-lemssuchasclassi\fcation.Nevertheless,sincestackinghasbeenusedforsomeunsupervisedproblemssuchasdensityestimation[37],itispossiblethatsomeofthetechniquesmaybegeneralizabletooutlieranalysis,aslongasanappropriatemodelforquantifyingperfor-mancecanbefound.BucketofModels:Inthisapproach[39]a\hold-out"portionofthedatasetisusedinordertodecidethemostappropriatemodel.Themostappropriatemodelisoneinwhichthehighestaccuracyisachievedintheheldoutdataset.Inessence,thisapproachcanbeviewedasacompetitionorbake-ocontestbetweenthedierentmodels.Whilethisiseasytoperforminsupervisedproblemssuchasclassi\fcation,itismuch Method del-Centered Sequential Combination Normalization Data-Centered Independent Function Tuning[12] Model endent Max NotNeeded Tuning[35] Model endent Max NotNeeded eatureBagging[28] Data endent Max/Avg No HICS[24] Data endent eAvg No Bagging[17] Both endent Max/Avg Yes OutRank[33] Data endent HarmonicMean No Proclus[33] Data endent HarmonicMean No vertingscores Both endent Max/Avg Yes probabilities[25] trusionBootstrap[6] Data Sequential Component NotNeeded OUTRES[30] Data Sequential Product No enetal[34] Both endent eightedAvg. No Forest[29] Model endent on.Avg. Yes able2:CharacteristicsofOutlierEnsembleMethodsmoredicultforsmall-sampleandunsupervisedprob-lems.Nogroundtruthisavailableforevaluationinunsupervisedproblems.Itisunlikelythatapreciseanalogueofthemethodcanbecreatedforoutlieranal-ysis,sinceexactgroundtruthisnotavailablefortheevaluationprocess.Tosummarize,wecreateatableofthedierentmethods,andthedierentcharacteristicssuchasthetypeofen-semble,combinationtechnique,orwhethernormalizationispresent.ThisisprovidedinTable2.6.CONCLUSIONSANDDISCUSSIONThispaperprovidesanoverviewoftheemergingareaofout-lierensembleanalysis,whichhasseenincreasingattentionintheliteratureinrecentyears.Manyensembleanalysismethodsintheoutlieranalysisliteraturearenotrecognizedassuchinaformalway.Thispaperprovidesanunder-standingofhowthesemethodsrelatetoothertechniquesusedexplicitlyasensemblesintheliterature.Weprovideddierentwaysofcategorizingtheoutlieranalysisproblemsintheliterature,suchasindependentorsequentialensem-bles,anddata-ormodel-centeredensembles.Wediscussedtheimpactofdierentkindsofcombinationfunctions,andhowthesecombinationfunctionsrelatetodierentkindsofensembles.Theissueofchoosingtherightcombinationfunctionisanimportantone,thoughitmaydependuponthestructureoftheensembleinthegeneralcase.Wealsoprovidedamappingofmanycurrenttechniquesintheliter-aturetodierentkindsofensembles.Finally,adiscussionwasprovidedonthefeasibilityofadaptingensemblartech-niquesfromotherdataminingproblemstooutlieranalysis.Theareaofensembleanalysisispoorlydevelopedinthecontextoftheoutlierdetectionproblem,ascomparedtootherdataminingproblemssuchasclusteringandclassi\f-cation.Thereasonforthisisrootedinthegreaterdicultyofjudgingthequalityofacomponentoftheensemble,ascomparedtootherdataminingproblemssuchasclassi\fca-tion.Manymodelssuchasstackingandboostinginotherdataminingproblemsrequireacrisplyde\fnedjudgementofdierentensemblarcomponentsonhold-outsets,whicharenotreadilyavailableindataminingproblemssuchasoutlieranalysis.Theoutlieranalysisproblemsuersfromtheproblemofsmallsamplespaceaswellaslackofgroundtruth(asinallunsupervisedproblems).Thelackofgroundtruthimpliesthatitisnecessarytousetheintermediateout-putsofthealgorithm(ratherthanconcretequalitymeasuresonhold-outsets),formakingthecombinationdecisionsandensemblarchoices.Theseintermediateoutputsmaysome-timesrepresentpoorestimationsofoutlierscores.Whencombinationdecisionsandensemblarchoicesaremadeinanunsupervisedwayonaninherentlysmallsamplespaceproblemsuchasoutlieranalysis,thelikelihoodandconse-quencesofinappropriatechoicescanbehighascomparedtoanotherunsupervisedproblemsuchasclustering,whichdoesnothavethesmallsamplespaceissues.Whileoutlierdetectionisachallengingproblemforensem-bleanalysis,theproblemsarenotunsurmountable.Ithasbecomeclear,fromtheresultsofnumerousrecentensemblemethodsthatsuchmethodscanleadtosigni\fcantquali-tativeimprovements.Therefore,ensembleanalysisseemstobeanemergingarea,whichcanbeafruitfulresearchdirectionforimprovingthequalityofoutlierdetectionalgo-7.REFERENCES[1]C.C.Aggarwal.OutlierAnalysis,,2013.[2]C.C.Aggarwal,C.Reddy.DataClustering:Algo-rithmsandApplications,CRCPress,2013.[3]C.C.AggarwalandP.S.Yu.OutlierDetectioninHighDimensionalData,ACMSIGMODConference,2001.[4]C.C.Aggarwal,C.Procopiuc,J.Wolf,P.Yu,andJ.Park.FastAlgorithmsforProjectedClustering,ACMSIGMODConference,1999.[5]F.Angiulli,C.Pizzuti.Fastoutlierdetectioninhighdimensionalspaces,PKDDConference,2002.[6]D.Barbara,Y.Li,J.Couto,J.-L.Lin,andS.Jajo-dia.BootstrappingaDataMiningIntrusionDetectionSymposiumonAppliedComputing,2003.[7]S.Bickel,T.Scheer.Multi-viewclustering.Conference,2004. [8]L.Brieman.RandomForests.JournalMachineLearn-ingarchive,45(1),pp.5{32,2001.[9]L.Brieman.BaggingPredictors.MachineLearning24(2),pp.123{140,1996.[10]V.Chandola,A.Banerjee,V.Kumar.AnomalyDetec-tion:ASurvey,ACMComputingSurveys,2009.[11]S.D.BayandM.Schwabacher,Miningdistance-basedoutliersinnearlineartimewithrandomizationandasimplepruningrule,KDDConf.,2003.[12]M.Breunig,H.-P.Kriegel,R.Ng,andJ.Sander.LOF:IdentifyingDensity-basedLocalOutliers,ACMSIG-MODConference,2000.[13]N.Chawla,A.Lazarevic,L.Hall,andK.Bowyer.SMOTEBoost:Improvingpredictionoftheminorityclassinboosting,,pp.107{119,2003.[14]B.Clarke,BayesModelAveragingandStackingwhenModelApproximationErrorcannotbeIgnored,ofMachineLearningResearch,pp683{712,2003.[15]P.Domingos.BayesianAveragingofClassi\fersandtheOver\fttingProblem.ICMLConference,2000.[16]Y.Freund,R.Schapire.ADecision-theoreticGeneral-izationofOnlineLearningandApplicationtoBoosting,ComputationalLearningTheory,1995.[17]J.Gao,P.-N.Tan.Convertingoutputscoresfromout-lierdetectionalgorithmsintoprobabilityestimates.ICDMConference,2006.[18]Z.He,S.DengandX.Xu.AUni\fedSubspaceOutlierEnsembleFrameworkforOutlierDetection,AdvancesinWebAgeInformationManagement,2005.[19]D.Hawkins.Identi\fcationofOutliers,ChapmanandHall,1980.[20]A.Hinneburg,D.Keim,andM.Wawryniuk.Hd-eye:Visualminingofhigh-dimensionaldata.IEEECom-puterGraphicsandApplications,19:22{31,1999.[21]W.Jin,A.Tung,andJ.Han.Miningtop-nlocalout-liersinlargedatabases,ACMKDDConference,2001.[22]T.Johnson,I.Kwok,andR.Ng.Fastcomputationof2-dimensionaldepthcontours.ACMKDDConference[23]M.Joshi,V.Kumar,andR.Agarwal.EvaluatingBoostingAlgorithmstoClassifyRareClasses:Com-parisonandImprovements.ICDMConference,pp.257{264,2001.[24]F.Keller,E.Muller,K.Bohm.HiCS:High-ContrastSubspacesforDensity-basedOutlierRanking,ICDEConference,2012.[25]H.Kriegel,P.Kroger,E.Schubert,andA.Zimek.Inter-pretingandUnifyingOutlierScores.SDMConference[26]E.Knorr,andR.Ng.AlgorithmsforMiningDistance-basedOutliersinLargeDatasets.VLDBConference[27]E.Knorr,andR.Ng.FindingIntensionalKnowledgeofDistance-BasedOutliers.VLDBConference,1999.[28]A.Lazarevic,andV.Kumar.FeatureBaggingforOut-lierDetection,ACMKDDConference,2005.[29]F.T.Liu,K.M.Ting,andZ.-H.Zhou.IsolationForest.ICDMConference,2008.[30]E.Muller,M.Schier,andT.Seidl.StatisticalSelectionofRelevantSubspaceProjectionsforOutlierRanking.ICDEConference,pp,434{445,2011.[31]E.Muller,S.Gunnemann,I.Farber,andT.Seidl,Dis-coveringmultipleclusteringsolutions:Groupingob-jectsindierentviewsofthedata,ICDMConference[32]E.Muller,S.Gunnemann,T.Seidl,andI.Farber.Tuto-rial:DiscoveringMultipleClusteringSolutionsGroup-ingObjectsinDierentViewsoftheData.ICDECon-ference,2012.[33]E.Muller,I.Assent,P.Iglesias,Y.Mulle,andK.Bohm.OutlierRankingviaSubspaceAnalysisinMul-tipleViewsoftheData,ICDMConference,2012.[34]H.Nguyen,H.Ang,andV.Gopalakrishnan.Miningensemblesofheterogeneousdetectorsonrandomsub-DASFAA,2010.[35]S.Papadimitriou,H.Kitagawa,P.Gibbons,andC.Faloutsos,LOCI:Fastoutlierdetectionusingthelocalcorrelationintegral,ICDEConference,2003.[36]S.Ramaswamy,R.Rastogi,andK.Shim.EcientAl-gorithmsforMiningOutliersfromLargeDataSets.ACMSIGMODConference,pp.427{438,2000.[37]P.SmythandD.Wolpert.LinearlyCombiningDensityEstimatorsviaStacking,MachineLearningJournal,36,pp.59{83,1999.[38]D.Wolpert.StackedGeneralization,NeuralNetworks5(2),pp.241{259,1992.[39]B.Zenko.IsCombiningClassi\fersBetterthanSelect-ingtheBestOne,MachineLearning,pp.255{273,2004. Latest Results on outlier ensembles available at http://www.charuaggarwal.net/theory.pdf (Clickable Link)