/
Ensembles[PositionPaper]CharuC.AggarwalIBMT.J.WatsonResearchCenterYork Ensembles[PositionPaper]CharuC.AggarwalIBMT.J.WatsonResearchCenterYork

Ensembles[PositionPaper]CharuC.AggarwalIBMT.J.WatsonResearchCenterYork - PDF document

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
385 views
Uploaded On 2016-12-05

Ensembles[PositionPaper]CharuC.AggarwalIBMT.J.WatsonResearchCenterYork - PPT Presentation

Latest Results on outlier ensembles available at httpwwwcharuaggarwalnettheorypdf Clickable Link tsubtopicsegbaggingboostingetcintheensembleanalysisareaareverywellformalizedThisisrem ID: 497397

Latest Results outlier ensembles

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Ensembles[PositionPaper]CharuC.AggarwalI..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Ensembles[PositionPaper]CharuC.AggarwalIBMT.J.WatsonResearchCenterYorktownHeights,NYcharu@us.ibm.comABSTRACTEnsembleanalysisisawidelyusedmeta-algorithmformanydataminingproblemssuchasclassi\fcationandclustering.Numerousensemble-basedalgorithmshavebeenproposedintheliteraturefortheseproblems.Comparedtothecluster-ingandclassi\fcationproblems,ensembleanalysishasbeenstudiedinalimitedwayintheoutlierdetectionliterature.Insomecases,ensembleanalysistechniqueshavebeenim- Latest Results on outlier ensembles available at http://www.charuaggarwal.net/theory.pdf (Clickable Link) tsub-topics(eg.bagging,boosting,etc.)intheensembleanalysisareaareverywellformalized.Thisisremotelynottrueforoutlieranalysis,inwhichtheworkonensembleanalysisisratherpatchy,sporadic,andnotsowellformalized.Inmanycases,usefulmeta-algorithmsareburieddeepintothealgorithm,andnotformallyrecog-nizedasensembles.Perhaps,oneofthereasonswhyensem-bleanalysishasnotbeenwellexploredinoutlieranalysisisthatmeta-algorithmsrequirecrispevaluationcriteriainordertoshowtheirrelativemeritsoverthebasealgorithm.Furthermore,evaluationcriteriaareoftenusedinthein-termediatestepsofanensemblealgorithm(eg.boostingorstacking),inordertomakefuturedecisionsaboutthepreciseconstructionoftheensemble.Amongallcoredataminingproblems,outlieranalysisisthehardesttoevaluate(espe-ciallyonrealdatasets)becauseofacombinationofitssmallsamplespaceandunsupervisednature.Thesmallsamplespaceissuereferstothefactthatagivendatasetmaycon-tainonlyasmallnumberofoutliers,andthereforethecor-rectnessofanapproachisoftenhardtoquantifyinastatis-ticallyrobustway.Thisisalsoaproblemformakingrobustdecisionsaboutfuturestepsofthealgorithm,withoutcaus-ingover-\ftting.Theunsupervisednatureoftheproblemreferstothefactthatnogroundtruthisavailableinordertoevaluatethequalityofacomponentintheensemble.Thisnecessitatestheconstructionofsimplerensembleswithfewerqualitativedecisionsaboutthechoiceofthecomponentsintheensemble.Thesefactorshavebeenasigni\fcantimped-imentinthedevelopmentofe ectivemeta-algorithms.Ontheotherhand,sincetheclassi\fcationproblemhasthemostcrisplyde\fnedcriteriaforevaluation,italsohastherichestmeta-algorithmliteratureamongalldataminingproblems.Thisisbecausetheproblemofmodelevaluationiscloselyrelatedtoquality-drivenmeta-algorithmdevelopment.Nevertheless,anumberofexamplesdoexistintheliter-atureforoutlierensembles.Thesecasesshowthatwhenensembleanalysisisusedproperly,thepotentialforalgo-rithmicimprovementissigni\fcant.Ensembleanalysishasbeenusedparticularlye ectivelyinhigh-dimensionalout-lierdetection[18;24;28;30;32;33],inwhichmultiplesubspacesofthedataareoftenexploredinordertodiscoveroutliers.Infact,theearliest[28]ofoutlieren-sembleanalysis\fndsitsoriginsinhighdimensionaloutlierdetection,thoughinformalmethodsforensembleanalysiswereproposedmuchearliertothiswork.Thehighdimen-sionalscenarioisanimportantoneforensembleanalysis,becausetheoutlierbehaviorofadatapointinhighdimen-sionalspaceisoftendescribedbyasubsetofdimensions,whichareratherhardtodiscoverinrealsettings.Infact,mostmethodsforlocalizingthesubsetsofdimensionscanbeconsideredweakguessestothetruesubsetsofdimensionswhicharerelevantforoutlieranalysis.Theuseofmultiplemodels(correspondingtodi erentsubsetsofdimensions)reducestheuncertaintyarisingfromaninherentlydicultsubspaceselectionprocess,andprovidesgreaterrobustnessfortheapproach.Thefeaturebaggingworkdiscussedin[28]maybeconsidereda\frstdescriptionofoutlieren-sembleanalysisinarealsetting.However,aswewillseeinthisarticle,numerousmethodswereproposedearliertothisworkwhichcouldbeconsideredensembles,butwereneverformallyrecognizedasensemblesintheliterature.Asnotedin[18],eventhe\frsthighdimensionaloutlierdetectionap-proach[3]maybeconsideredanensemblemethod,thoughitwasnotformallypresentedasanensemblemethodintheoriginalpaper.Itshouldalsobepointedoutthatwhilehighdimensionaldataisanimportantcaseforensembleanalysis,thepotentialofensembleanalysisismuchbroader,andislikelytoapplytoanyscenarioinwhichoutliersarede\fnedfromvaryingcausesofrarity.Furthermore,manytypesofensemblessuchassequentialensemblescanbeusedinordertosuccessivelyre\fnedata-centricinsights.Thispaperwilldiscussthedi erentmethodsforoutlieren-sembleanalysisintheliterature.Wewillprovideaclassi\f-cationofthedi erentkindsofensembles,andthekeypartsofthealgorithmicdesignofensembles.Thespeci\fcimpor-tanceofdi erentpartsofalgorithmicdesignwillalsobediscussed.Ensemblealgorithmscanbecategorizedintwodi erentways:CategorizationbyComponentIndependence:Arethedi erentcomponentsoftheensembleindependentofoneanotherordotheydependononeanother?Toprovideananalogywiththeclassi\fcationproblem,boostingcanbeconsideredaprobleminwhichthedi erentcomponentsoftheensemblearenotindepen-dentofoneanother.Thisisbecausetheexecutionofaspeci\fccomponentdependsupontheresultfrompre-viousexecutions.Ontheotherhand,manyformsofclassi\fcationensemblessuchasbaggingarethoseinwhichtheclassi\fcationmodelsareindependentofoneCategorizationbyComponentType:Eachcomponentofanensemblecanbede\fnedonthebasisofeitherdatachoicemodelchoice.Theideaintheformeristocarefullypickasubsetofthedataordatadi-mensions(eg.boosting/bagginginclassi\fcation)ortopickaspeci\fcalgorithm(eg.stackingormodel-ensembles).Thecategorizationbycomponenttypeisrelatedtocategorizationbycomponentindependence,becausedata-centeredensemblesareoftensequential,whereasmodel-centeredensemblesareoftenindepen-dent.Howeverthisisnotalwaysthecase.Forexam-ple,independentdatacenteredbaggingmethods[28]areoftenusedinoutlieranalysis.Itshouldbepointedoutthattheaforementionedcategoriza-tionsofdi erentkindsofensemblesareinherentlyincom-plete,anditisimpossibletofullydescribeeverypossibility.Forexample,itispossibleforthedi erentcomponentstobeheterogeneous,whicharede\fnedonthebasisofdi erentaspectsofthedataandmodels[34].However,suchmodelsarelessfrequentintheoutlieranalysisliterature,becauseofthecomplexityofreasonablyevaluatingtheimportanceofdi erentensemblecomponents.Atypicaloutlierensemblecontainsanumberofdi erentcomponents,whichareusedtoconstructthe\fnalresult.ModelCreation:Thisistheindividualmethodologyoralgorithmwhichisusedtocreatethecorrespond-ingcomponentoftheensemble.Insomecases,themethodologymaybesimplythatofrandomsubspaceDi erentmethodsmaycreateoutlierscoreswhichareonverydi erentscales.Insomecases,thescoresmaybeinascendingorder,whereasinoth-ers,theymaybeindescendingorder.Insuchcases, isimportantinbeingabletocombinethescoresmeaningfully,sothattheoutlierscoresfromdi erentcomponentsareroughlycomparable.ModelCombination:Thisreferstothe\fnalcombi-nationfunction,whichisusedinordertocreatetheoutlierscore.Thispaperisorganizedasfollows.Thenextsectionwilldis-cussthecategorizationofensemblesonthebasisofcompo-nentindependence.Section3willdiscussthecategorizationofensemblesonthebasisofmodeltype.Section4willstudytheroleofthecombinationfunctionindi erentkindsofensembleanalysis.Section5discussesmeta-algorithmsforotherdataminingproblemsintheliterature,andwhethersuchideascanbeadaptedtotheoutlieranalysisscenario.Section6containstheconclusionsandsummary.2.CATEGORIZATIONBYCOMPONENTIN­Thiscategorizationexamineswhetherthecomponentsaredevelopedindependently,orwhethertheydependononeanother.Therearetwoprimarykindsofensembles,whichcanbeusedinordertoimprovethequalityofoutlierdetec-tionalgorithms:sequentialensembles,agivenalgorithmorsetofalgorithmsareappliedsequentially,sothatfutureap-plicationsofthealgorithmsareimpactedbypreviousapplications,intermsofeithermodi\fcationsofthebasedataforanalysisorintermsofthespeci\fcchoicesofthealgorithms.The\fnalresultiseitheraweightedcombinationof,orthe\fnalresultofthelastappli-cationofanoutlieranalysisalgorithm.Forexample,inthecontextoftheclassi\fcationproblem,boostingmethodsmaybeconsideredexamplesofsequentialen-sembles.independentensembles,di erentalgorithms,ordif-ferentinstantiationsofthesamealgorithmareap-pliedtoeitherthecompletedataorportionsofthedata.Thechoicesmadeaboutthedataandalgo-rithmsappliedareindependentoftheresultsobtainedfromthesedi erentalgorithmicexecutions.There-sultsfromthedi erentalgorithmexecutionsarecom-binedtogetherinordertoobtainmorerobustoutliers.Inthissection,bothkindsofensembleswillbestudiedin2.1SequentialEnsemblesInsequential-ensembles,oneormoreoutlierdetectionalgo-rithmsareappliedsequentiallytoeitherallorportionsofthedata.Thecoreprincipleoftheapproachisthateachapplicationofthealgorithmsprovidesabetterunderstand-ingofthedata,soastoenableamorere\fnedexecutionwitheitheramodi\fedalgorithmordataset.Thus,depend-ingupontheapproach,eitherthedatasetorthealgorithmmaybechangedinsequentialexecutions.Ifdesired,thisapproachcaneitherbeappliedfora\fxednumberoftimes,orbeusedinordertoconvergetoamorerobustsolution.Thebroadframeworkofasequential-ensemblealgorithmisprovidedinFigure1.SequentialEnsemble(DataSet:BaseAlgorithms::::begin=1;repeatPickanalgorithmbasedonresultsfrompastexecutions;Createanewdataset)fromonresultsfrompastexecutions;+1;untilreportoutliersbasedoncombinationsofresultsfrompreviousexecutions;Figure1:SequentialEnsembleFrameworkIneachiteration,asuccessivelyre\fnedalgorithmmaybeusedonare\fneddata,basedontheresultsfrompreviousexecutions.Thefunction)isusedtocreateare\fne-mentofthedata,whichcouldcorrespondtodatasubsetselection,attribute-subsetselection,orgenericdatatrans-formationmethods.Thedescriptionaboveisprovidedinaverygeneralform,andmanyspecialcasescanbepossiblyinstantiatedfromthisgeneralframework.Forexample,inpractice,onlyasinglealgorithmmaybeusedonsuccessivemodi\fcationsofthedata,asdataisre\fnedovertime.Fur-thermore,thesequentialensemblemaybeappliedinonlyasmallnumberofconstantpasses,ratherthanagenericconvergence-basedapproach,aspresentedabove.Thebroadprincipleofsequentialensemblesisthatagreaterknowledgeofdatawithsuccessivealgorithmicexecutionhelpsfocusontechniquesandportionsofthedatawhichcanprovidefreshinsights.Sequentialensembleshavenotbeensucientlyexploredintheoutlieranalysisliteratureasgeneralpurposemeta-algorithms.However,manyspeci\fctechniquesintheoutlierliteratureusemethods,whichcanberecognizedasspecialcasesofsequentialensembles.Aclassicexampleofthisistheuseoftwo-phasealgorithmsforbuildingamodelofthenormaldata.Inthe\frst-phase,anoutlierdetectionalgorithmisusedinordertoremovetheobviousoutliers.Inthesec-ondphase,amorerobustnormalmodelisconstructedafterremovingtheseobviousoutliers.Thus,theoutlieranaly-sisinthesecondstageismuchmorere\fnedandaccurate.Suchapproachesarecommonlyusedforcluster-basedout-lieranalysis(forconstructingmorerobustclustersinlaterstages)[6],orformorerobusthistogramconstructionanddensityestimation.However,mostofthesemethodsarepresentedintheoutlieranalysisliteratureasspeci\fcopti-mizationsofparticularalgorithms,ratherthanasgeneralmeta-algorithmswhichcanimprovethee ectivenessofanarbitraryoutlierdetectionalgorithm.Thereissigni\fcantscopeforfurtherresearchintheoutlieranalysisliterature,byrecognizingthesemethodsasgeneral-purposeensembles,andusingthemtoimprovethee ectivenessofoutlierdetec-tion.Inthesemodels,thegoalofthesequentialensembleisdatare\fnement.Therefore,thescorereturnedbythelaststagesoftheensembleisthemostrelevantoutlierscore. endentEnsemble(DataSet:BaseAlgorithms::::begin=1;repeatPickanalgorithmCreateanewdataset)from+1;untilreportoutliersbasedoncombinationsofresultsfrompreviousexecutions;Figure2:IndependentEnsembleFrameworkAnotherexampleofasequentialensembleisproposedin[30]inwhichdi erentsubspacesofthedataarerecursivelyplored,onthebasisoftheirdiscriminativebehavior.Asub-spaceisexploredonlyifoneofitspredecessorisalsosucientlydiscriminative.Thus,thisapproachissequential,sincetheconstructionoffuturemodelsoftheensembleisdependentonthepreviousmodels.Thegoalofthesequentialensembleisthediscoveryofotherrelatedsubspaceswhicharealsodiscriminative.Nevertheless,sincethesequentialapproachiscombinedwithenumerativeex-plorationofdi erentsubspaceextensions,thecombinationfunctioninthiscaseneedstoincludethescoresfromthedi erentsubspacesinordertocreateanoutlierscore.Theworkin[30]usestheproductoftheoutlierscoresofthediscriminativesubspacesasthe\fnalresult.Thisisequiv-alenttousinganaggregateonthelogarithmicfunctionoftheoutlierscore.2.2IndependentEnsemblesInindependentensembles,di erentinstantiationsoftheal-gorithmordi erentportionsofthedataareusedforoutlieranalysis.Alternatively,thesamealgorithmmaybeapplied,butwitheitheradi erentinitialization,parametersetorevenrandomseedinthecaseofarandomizedalgorithms.Theresultsfromthesedi erentalgorithmexecutionscanbecombinedinordertoobtainamorerobustoutlierscore.Ageneralpurposedescriptionofindependentensemblealgo-rithmsisprovidedinthepseudo-codedescriptionofFigureThebroadprincipleofindependentensemblesisthatdif-ferentwaysoflookingatthesameproblemprovidesmorerobustresultswhicharenotdependentonspeci\fcartifactsofaparticularalgorithmordataset.Independentensem-bleshavebeenexploredmuchmorewidelyandformallyintheoutlieranalysisliterature,ascomparedtosequentialen-sembles.Independentensemblesareparticularlypopularforoutlieranalysisinhigh-dimensionaldatasets,becausetheyenabletheexplorationofdi erentsubspacesofthedatainwhichdi erentkindsofdeviantsmaybefound.Examplesexistofbothpickingdi erentalgorithmsanddatasets,inordertocombinetheresultsfromdi erentexecu-tions.Forexample,themethodsin[28;29]samplesub- predecessorisde\fnedasasubspacewithonedimensionremoved.spacesfromtheunderlyingdatainordertodetermineout-liersfromeachoftheseexecutionsindependently.Then,theresultsfromthesedi erentexecutionsarecombinedinor-dertodeterminetheoutliers.Theideainthesemethodsisthatresultsfromdi erentsubsetsofsampledfeaturesmaybebaggedinordertoprovidemorerobustresults.Someoftherecentmethodsforsubspaceoutlierrankingandoutlierevaluationcanbeconsideredindependentensembleswhichcombinetheoutliersdiscoveredindi erentsubspacesinor-dertoprovidemorerobustinsights.3.CATEGORIZATIONBYCONSTITUENTIngeneral,aparticularcomponentofthemodelmayuseadi erentmodel,andadi erentsubsetorsubspaceofthedata[34].However,thisisrarelydoneinpractice.Typically,eachcomponentofthemodeliseitherde\fnedasaspeci\fcmodel,orasaspeci\fcpartofthedata.Theformertypeofensembleisreferredtoasmodel-centered,whereasthelattertypeisreferredtoasdata-centered.Eachofthesespeci\fctypeswillbediscussedindetailinthissection.3.1Model­centeredEnsemblesModelcenteredensemblesattempttocombinetheoutlierscoresfromdi erentmodelsbuiltonthesamedataset.Themajorchallengeofthismodelisthatthescoresfromdi er-entmodelsareoftennotdirectlycomparabletooneanother.Forexample,theoutlierscorefroma-nearestneighborap-proachisverydi erentfromtheoutlierscoreprovidedbyaPCA-baseddetectionmodel.Thiscausesissuesincombin-ingthescoresfromthesedi erentoutliermodels.Therefore,itiscriticaltobeabletoconvertthedi erentoutlierscoresintonormalizedvalueswhicharedirectlycomparable,andalsopreferablyinterpretable,suchasaprobability[17].Thisissuewillbediscussedinthenextsectiononde\fningcombi-nationfunctionsforoutlieranalysis.Anotherkeychallengeisintermsofthespeci\fcde\fnitionofthecombinationfunc-tionforoutliers.Shouldweusemodelaveraging,best\ftorworst\ft?Thisproblemisofcoursenotspeci\fctomodel-centeredensembles.Aparticularformofmodel-centeredensembleswhicharecommonlyusedinoutlieranalysis,butnotformallyrecog-nizedasensemblesistheissueofusingthesamemodeloverdi erentchoicesoftheunderlyingmodelparameters,andthencombiningthescores.ThisisdonequitefrequentlyinmanyclassicaloutlieranalysisalgorithmssuchasLOCI[35]andLOF[12].However,sincetheapproachisinterpretedasaquestionofparametertuning,itisnotrecognizedformallyasanensemble.Inreality,anysystematicapproachforpa-rametertuning,whichisdependentontheoutputscoresanddirectlycombinesorusestheoutputsofthedi erentexecu-shouldbeinterpretedasanensemblarapproach.ThisisthecasewiththeLOFandLOCImethods.Speci\fcally,thefollowingensemblarapproachisusedinthetwometh-ods.IntheLOFmethod,themodelisrunoverarangeofvaluesof,whichde\fnestheneighborhoodofthedatapoints.Theworkin[12]examinestheuseofdi erentcombinationfunctionssuchastheminimum,averageorthemaximumoftheLOFvaluesastheoutlierscore.Itisarguedin[12],thattheappropriatecombinationfunctionistousethevaluein topreventdilutionoftheoutlierscoresbyin-appropriateparameterchoicesinthemodel.Inotherwords,thespeci\fcmodelwhichbestenhancestheout-lierbehaviorforadatapointisused.TheLOCImethodusesamulti-granularityapproach,whichusesasamplingneighborhoodinordertode-terminethelevelofgranularityinwhichtocomputetheoutlierscore.Di erentsamplingneighborhoodsareused,andapointisdeclaredasanoutlierbasedontheneighborhoodinwhichitsoutlierbehaviorisenhanced.ItisinterestingtonotethattheLOCImethodusesaverysimilarcombinationfunctionastheLOFmethodintermsofpickingthecomponentoftheensemblewhichenhancestheoutlierbehavior.Itshouldbepointedoutthatwhenthedi erentcomponentsoftheensemblecreatecomparablescores(eg.di erentrunsofaparticularalgorithmsuchasLOForLOCI),thenthecombinationprocessisgreatlysimpli\fed,sincethescoresacrossdi erentcomponentsarecomparable.However,thisisnotthecase,whenthedi erentcomponentscreatescoreswhicharenotdirectlycomparabletooneanother.Thisissuewillbediscussedinalatersectiononde\fningcombination3.2Data­centeredEnsemblesIndata-centeredensembles,di erentparts,samplesorfunc-tionsofthedataareexploredinordertoperformtheanal-ysis.Itshouldbepointedoutthatafunctionofthedatacouldincludeeitherasampleofthedata(horizontalsam-ple)orarelevantsubspace(verticalsample).Moregeneralfunctionsofthedataarealsopossible,thoughhaverarelybeenexploredintheliterature.Thecoreideaisthateachpartofthedataprovidesaspeci\fckindofinsight,andbyusinganensembleoverdi erentportionsofthedata,itispossibletoobtaindi erentinsights.Oneoftheearliestdata-centeredensembleswasdiscussedin[28].Inthisapproach,randomsubspacesofthedataaresampled,andtheoutliersaredeterminedintheseprojectedsubspaces.The\fnaloutliersaredeclaredasacombinationfunctionoftheoutliersfromthedi erentsubspaces.Thistechniqueisalsoreferredtoasthefeaturebaggingsubspacemethod.Thecorealgorithmdiscussedin[28]isasfollows:FeatureBagging(DataSetbeginrepeatSampleasubspacebetween2andLOFscoreforeachpointinprojectedrepresentation;untilReportcombinedscoresfromdi erentsubspaces;Twodi erentmethodsareusedforcombiningscores.The\frstusesthebestrankofadatapointinanyprojectioninordertocreatetheordering.Avarietyofmethodscanbeusedfortiebreaking.Thesecondmethodaveragesthescoresoverthedi erentexecutions.Anothermethoddis-cussedin[17]convertstheoutlierscoresintoprobabilitiesbeforeperformingthebagging.Thisnormalizesthescores,andimprovesthequalityofthe\fnalcombination.Anumberoftechniqueshavealsobeenproposedforstatisti-calselectionofrelevantsubspacesforensembleanalysis[24;30].Theworkin[30]determinessubspaceswhicharerele-vanttoeachdatapoint.Theapproachisdesignedinsuchaway,thatForthediscriminativesubspacesfoundbythemethod,theapproachusestheproductof(ortheadditionofthelogarithmof)theoutlierscoresinthedi erentdis-criminativesubspaces.Thiscanbeviewedasacombinationofmodelaveragingandselectionofthemostdiscriminativesubspaces,whenthescoresarescaledbythelogarithmicfunction.Theworkin[24]ismuchclosertothefeaturebaggingmethodof[28],exceptthatstatisticalselectionofrelevantsubspacesisusedfortheoutlieranalysisprocess.The\fnalscoreiscomputedastheaverageofthescoresoverdi erentcomponentsoftheensemble.Recently,amethodOutRank[33]hasbeenproposed,whichcancombinetheresultsofmultiplerankingsbasedontherelationshipofdatapointstotheirnearestsubspaceclusters.Ithasbeenshownthateventraditionalsubspaceclusteringalgorithms[4]canprovidegoodresultsforoutlieranalysis,whentheen-semblemethodisused.This,theworkin[33]conclusivelyshownthepowerofensembleanalysisforhighdimensionalAdi erentdata-centeredensemblewhichiscommonlyusedintheliterature,butoftennotrecognizedasanensemblarapproachisthatofusinginitialphasesofremovingoutliersfromadataset,inordertocreateamorere\fnedmodelforoutlieranalysis.Anexampleofsuchanapproachinthecontextofintrusiondetectionisdiscussedin[6].Inthesecases,thecombinationfunctioncanbesimplyde\fnedastheresultfromtheverylaststepoftheexecution.Thisisbecausethedataqualityisimprovedsigni\fcantlyfromtheearlycomponentsoftheensemble,andtheresultsinthelastphasere\recttheoutliersmostaccurately.Thisisbecausethisisalsoasequentialensemblewithaspeci\fcgoalofdatare\fnement.Itshouldbepointedoutthatthedistinctioninthissectionbetweenmodel-centeredanddata-centeredensemblesisasomewhatsemanticone,sinceadata-centeredensemblecanalsobeconsideredaspeci\fctypeofmodel-centeredensem-ble.Nevertheless,thiscategorizationisuseful,becausetheexplorationofdi erentsegmentsofthedatarequiresinher-entlydi erentkindsoftechniquesthantheexplorationofdi erentmodelswhicharedata-independent.Thechoicesinpickingdi erentfunctionsofthedataforexplorationre-quiresdata-centricinsights,whichareanalogoustoclassi\f-cationmethodssuchasboosting,especiallyinthesequentialcase.Therefore,weviewthiscategorizationasaconvenientwaytostimulatedi erentlinesofresearchonthetopic.3.3DiscussionofCategorizationSchemesThetwodi erentcategorizationschemesareclearlynotex-haustive,thoughtheyrepresentasigni\fcantfractionoftheensemblefunctionsusedintheliterature.Infact,thesetwocategorizationschemescanbecombinedinordertocreatefourdi erentpossibilities.ThisissummarizedinTable1.Wehavealsoillustratedhowmanyofthecurrentensem-blarschemesmaptothesedi erentpossibilities.Interest-ingly,wewereunableto\fndanexampleofasequentialmodel-basedensembleintheliterature,thoughitispossi-blethattheresultsfromtheexecutionofaparticularmodelcanprovidehintsaboutfuturedirectionsofmodelconstruc-tionforanoutlieranalysisalgorithm.Therefore,ithas Centered delCentered Indep. eatureBagging[28] Tuning[12] HICS[24] Tuning[35] Proclus[17] For.[29] OutRank[33] enetal[34] vertingscoresintoprobabilities[25] Bagging[17] Seq. trusionBootstrap[6] Open OUTRES[30] able1:CategorizationofEnsembleTechniquesbeenclassi\fedasanopenprobleminourcategorization,andwouldbeaninterestingavenueforfutureexploration.TheworkbyNguyenetal[34]cannotbeclassi\fedasei-theradata-centeredoramodel-centeredscheme,sinceitusessomeaspectsofboth.Furthermore,theworkin[17;25]convertoutlierscoresintoprobabilitiesasageneralpre-processingmethodfornormalization,andarenotdependentonwhethertheindividualcomponentsaredata-centeredormodel-centered.Theissueofmodelcombinationisacriti-callytrickyonebothintermsofhowtheindividualscoresarenormalized,andintermsofhowtheyarecombined.Thisissuewillbediscussedindetailinthenextsection.4.DEFININGCOMBINATIONFUNCTIONSAcrucialissueinoutlieranalysisisthede\fnitionofcombi-nationfunctionswhichcancombinetheoutlierscoresfromdi erentmodels.Thereareseveralchallengeswhichariseinthecombinationprocess:NormalizationIssues:Thedi erentmodelsmayout-putscoreswhicharenoteasilycomparablewithoneanother.Forexample,a-nearestneighborclassi\fermayoutputadistancescore,whichisdi erentfromanLOFscore,andthelatterisalsoquitedi erentfromtheMDEFscorereturnedbytheLOCImethod.Evenafeaturebaggingapproach,whichisde\fnedwiththeuseofthesamebasealgorithm(LOF)ondi erentfea-turesubsets,maysometimeshavecalibrationissuesinthescores[17].Therefore,ifacombinationfunctionsuchastheaverageorthemaxisappliedtothecon-stituentscores,thenoneormorethemodelsmaybeinadvertentlyfavored.CombinationIssues:Thesecondissueisthechoiceofthecombinationfunction.Givenasetofnormalizedoutlierscores,howdowedecidethespeci\fcchoiceofthecombinationfunctiontobeused?Shouldthemin-imumofthescoresbeused,theaverageofthescoresbeused,orthemaximumofthescoresbeused.Itturnsoutthattheanswertothisquestionmaysome-timesdependonthespeci\fcconstituentcomponentsofthemodel,thoughitwouldseemthatsomechoicesaremorecommonthanothersintheliterature.Inthefollowingsection,wewilldiscusssomeoftheseissuesindetail.4.1NormalizationIssuesThemajorfactorinnormalizationisthatthedi erentalgo-rithmsdonotusethesamescalesofreferenceandcannotbereasonablycomparedwithoneanother.Infact,insomecases,highoutlierscoresmaycorrespondtolargeroutliertendency,whereasinothercases,lowscoresmaycorrespondtogreateroutliertendency.Thiscausesproblemsduringthecombinationprocess,sinceoneormorecomponentsmaybeinadvertentlyfavored.Onesimpleapproachforperform-ingthenormalizationistousetheranksfromthedi erentoutlieranalysisalgorithmsfromgreatestoutliertendencytoleastoutliertendency.Theserankscanthenbecombinedinordertocreateauni\fedoutlierscore.Oneoftheearliestmethodsforfeaturebagging[28]usessuchanapproachinoneofitscombinationfunctions.Themajorissuewithsuchanapproachisthatitdoeslosealotofinformationabouttherelativedi erencesbetweentheoutlierscores.Forexample,considerthecaseswherethetopoutlierscoresforcomponentsoftheensem-bleare::::::respec-tively,andeachcomponentuses(somevariationof)theLOFalgorithm.Itisclearthatincomponent,thetopthreeoutlierscoresarealmostequivalent,andincomponentthetopoutlierscoreisthemostrelevantone.However,arankingapproachwillnotdistinguishbetweenthesescenar-ios,andprovidethemthesamerankvalues.Clearly,thislossofinformationisnotdesirableforcreatingane ectivecombinationfromthedi erentscores.Thepreviousexamplesuggeststhatitisimportanttoexam-ineboththeorderingofthevaluesandthedistributionofthevaluesduringthenormalizationprocess.Ideally,itisdesir-abletosomehowconverttheoutlierscoresintoprobabilities,sothattheycanbereasonablyusedinane ectiveway.Anapproachwasproposedin[17]whichusesmixturemodelinginconjunctionwiththeEM-frameworkinordertoconvertthescoresintoprobabilities.Twomethodsareproposedinthiswork.Bothofthesetechniquesuseparametricmodelingmethods.The\frstmethodassumesthattheposteriorprob-abilitiesfollowalogisticsigmoidfunction.TheunderlyingparametersarethenlearnedfromtheEMframeworkfromthedistributionofoutlierscores.Thesecondapproachrec-ognizesthefactthattheoutlierscoresofdatapointsintheoutliercomponentofthemixtureislikelytoshowadi er-entdistribution(Gaussiandistribution),thanthescoresofdatapointsinthenormalclass(Exponentialdistribution).Therefore,thisapproachmodelsthescoredistributionsasamixtureofexponentialandGaussianprobabilityfunctions.Asbefore,theparametersarelearnedwiththeuseoftheEM-framework.TheposteriorprobabilitiesarecalculatedwiththeuseoftheBayesrule.Thisapproachhasbeenshowntobee ectiveinimprovingthequalityoftheensem-bleapproachproposedin[28].Asecondmethodhasalsobeenproposedrecently[25],whichimprovesuponthisbasemethodforconvertingtheoutlierscores,andconvertingthescoresintoprobabilities.4.2CombiningScoresfromDifferentModelsThesecondissueisthechoiceofthefunctionwhichneedstobeusedinordertocombinethescores.Givenasetof(normalized)outlierscoresScore forthedatapoint weusethemodelaverage,maximum,orminimum?Foreaseindiscussioninthissection,wewillassumethecon-ventionwithoutlossofgeneralitythatgreateroutlierscorescorrespondtogreateroutliertendency.Thereforethemax-imumfunctionpickstheworst\ft,whereastheminimumfunctionpicksthebest\ft.Theearliestworkonensemble-basedoutlieranalysis(not recognizedasensembleanalysis)wasperformedinthecontextofmodelparametertuning[12;35].Mostoutlieranalysismethodstypicallyhaveaparameter,whichcontrolsgranularityoftheunderlyingmodel.Theoutliersmayoftenbevisibletothealgorithmonlyataspeci\fclevelofgranularity.Forexample,thevalueofintheneighborapproachorLOFapproach,thesamplingneigh-borhoodsizeintheLOCIapproach,thenumberofclus-tersinaclusteringapproachallcontrolthegranularityoftheanalysis.Whatistheoptimalgranularitytobeused?Whilethisisoftenviewedasanissueofparametertuning,itcanalsobeviewedasanissueofensembleanalysis,whenaddressedinacertainway.Inparticular,themethodsin[12;35]runthealgorithmsoverarangeofvaluesofthegranularityparameter,andpicktheparameterchoicewhichbestenhancestheoutlierscore(maximumfunctionforourconventiononscoreordering)foragivendatapoint.Inotherwords,wehave:Ensemble =MAXScore reasonforthishasbeendiscussedinsomedetailintheoriginalLOFpaper.Inparticular,ithasbeensuggestedthattheuseofothercombinationfunctionsuchastheaverageortheminimumleadstoadilutionintheoutlierscoresfromtheirrelevantmodels.Thisseemstobeareasonablechoiceatleastfromanintuitiveperspective.Someothercommonfunctionswhichareusedinthelitera-tureareasfollows:MaximumFunction:Thisisoneofthemostcommonfunctionsusedforcombiningensemblarscoresbothinimplicit(LOFandLOCIparametertuning)andex-plicitensemblarmodels.Onevariationonthismodelistousetheranksinsteadofthescoresinthecom-binationprocess.Suchanapproachwasalsousedinfeaturebagging[28].Animportantaspectofthepro-cessisthatthedi erentdatapointsneedtohavethesamenumberofcomponentsintheensembleinordertobecomparedmeaningfully.AveragingFunction:Inthiscase,themodelscoresareaveragedoverthedi erentcomponentsoftheensem-ble.Theriskfactorhereisthatiftheindividualcom-ponentsoftheensemblearepoorlyderivedmodels,thentheirrelevantscoresfrommanydi erentcompo-nentswilldilutetheoveralloutlierscore.Nevertheless,suchanapproachhasbeenusedextensively.Examplesofmethodswhichusethismethodareoneofthemod-elsinfeaturebagging[28],theHICSmethod[30],andarecentapproachdescribedin[25].DampedAveraging:Inthismodel,adampingfunc-tionisappliedtotheoutlierscoresbeforeaveraging,inordertopreventitfrombeingdominatedbyafewcomponents.Examplesofadampingfunctioncouldbethesquarerootorthelogarithm.Itshouldbepointedoutthattheuseoftheproductoftheoutlierscoresorgeometricaveragingcouldbeinterpretedastheaver-agingofthelogarithmofoutlierscores.PrunedAveragingandAggregates:Inthiscase,thelowscoresareprunedandtheoutlierscoresareeitherav-eragedoraggregated(summedup)overtherelevantensembles.Thegoalhereistoprunetheirrelevantmodelsforeachdatapointbeforecomputingthecom-binationscore.Thepruningcanbeperformedbyei-therusinganabsolutethresholdontheoutlierscore,orbypickingthetopmodelsforeachdatapoint,andaveragingthem.Theriskfactorinusingabso-lutethresholdsarethenormalizationissueswhicharisefromdi erentdatapointshavingdi erentensemblarcomponents.Boththeaverageandaggregatescorescannolongerbemeaningfullycomparedacrossdif-ferentdatapoints.Aggregatesaremoreappropriatethanaverages,sincetheyimplicitlycountthenumberofensemblecomponentsinwhichadatapointisrele-vant.Adatapointwillbemorerelevantinagreaternumberofensemblecomponents,whenithasagreatertendencytobeanoutlier.ResultfromLastComponentExecuted:Thisapproachissometimesusedinsequentialensembles[6],inwhicheachcomponentoftheensemblesuccessivelyre\fnesthedataset,andremovestheobviousoutliers.Asaresult,thenormalmodelisconstructedonadatasetfromwhichoutliersareremovedandthemodelismorerobust.Insuchcases,thegoalofeachcomponentofthesequentialensembleistosuccessivelyre\fnethedataset.Therefore,thescorefromthelastcomponentisthemostappropriateonetobeused.Whichcombinationfunctionprovidesthebestinsightsforensembleanalysis?Clearly,thecombinationfunctionmaybedependentonthestructureoftheensembleinthegeneralcase,especiallyifthefunctionofeachcomponentoftheensembleistoeitherre\fnethedataset,orunderstandthebehaviorofonlyaverylocalsegmentofthedataset.However,forthegeneralcase,inwhichthefunctionofeachcomponentoftheensembleistoprovideareasonableandcomparableoutlierscoreforeachdatapoint,thetwomostcommonlyusedfunctionsaretheandtheAv-eragingfunctions.Whileprunedaveragingcombinestheseaspects,itisrarelyusedinensembleanalysis.Whichcom-binationfunctionisbest?Arethereanyothercombinationfunctionswhichcouldconceivablyprovidebetterresults?Theseareopenquestions,theanswertowhichisnotcom-pletelyknownbecauseofthesparseliteratureonoutlieren-sembleanalysis.Itisthisauthor'spersonalopinion,thattheintuitiveargumentprovidedintheLOFpaper[12]onusingthemaximumfunctionforavoidingdilutionfromirrelevantmodelsisthecorrectoneinmanyscenarios.However,theissueiscertainlynotsettledinthegeneralcase,andmanyvariantssuchasprunedaveragingmayalsoproviderobustresults,whileavoidingmostoftheirrelevantmodels.5.POSSIBLEAVENUESOFEXPLORATION:LEARNINGFROMOTHERDATAMIN­INGPROBLEMSTheareaofoutlierensembleanalysisisstillinitsinfancy,thoughitisrapidlyemergingasanimportantareaofre-searchinitsownright.Currently,thediversityofalgo-rithmsavailableforoutlierensembleanalysisislimited,andisnowhereclosetomanyotherdataminingproblemssuchasclusteringandclassi\fcation.Therefore,itmaybein-structivetoexaminesomeofthekeytechniquesforotherdataminingproblemssuchasclusteringorclassi\fcation, whetheritmakessenseorisevenfeasibletodesignanalogousmethodsfortheoutlieranalysisproblem.Asdiscussedearlier,amajorchallengeforensembledevelop-mentinunsupervisedproblemsisthattheevaluationpro-cessishighlysubjective,andthereforethequalityoftheintermediateresultscannotbefullyevaluatedinmanysce-narios.Oneoftheconstraintsisthattheintermediatedeci-sionsmustbemadewiththeuseofoutlierscoresonly,ratherthanwiththeuseofconcreteevaluationcriteriaonhold-outsets(asinthecaseoftheclassi\fcationproblem).Therefore,inthiscontext,webelievethatthemajorsimilaritiesanddi erencesinsupervisedandunsupervisedmethodsareasfollows:IntermediateEvaluation:Inunsupervisedmeth-ods,groundtruthistypicallynotavailable.Whileonecanusemeasuressuchasclassi\fcationaccuracyinsu-pervisedmethods,thisisnotthecasewithunsuper-visedmethods.Intermediateevaluationisparticularlyimportantforsequentialmethods.Thisisoneofthereasonsthatsequentialmethodsaremuchrarerthanindependentmethodsinoutlierensembleanalysis.DiversityandConsensusIssues:Bothsupervisedandunsupervisedmethodsseekgreaterdiversitywiththeuseofanensembleintermsofthemethodologyusedforcreatingthemodel.Inmanycases,thisisdonebyselectingmodelswhicharedi erentfromonean-other.Forexample,inclustering,diversityisachievedbyusingeitherrandomizedclusteringorexplicitlypick-ingorthogonalclusterings[2].However,inthecaseofsupervisedmethods,thelevelofconsensusisalsomea-attheendintermsofthegroundtruth.Thisisnotthecaseinunsupervisedmethods,sincenogroundtruthisavailable.Someofthepropertiesinsupervisedlearning(eg.presenceofclasslabels)cannotobviouslybetransferredtooutlieranalysis.Inothercases,analogousmethodscanbedesignedfortheproblemofoutlieranalysis.Inthebelow,wediscusssomecommonmethodsusedfordi erentsupervisedandunsupervisedproblems,andwhethertheycanbetransferredtotheproblemofoutlieranalysis:Boosting:Boosting[16]isacommontechniqueusedinclassi\fcation.Theideaistofocusonsuccessivelydicultportionsofthedatasetinordertocreatemod-elswhichcanclassifythedatapointsintheseportionsmoreaccurately,andthenusetheensemblescoresoverallthecomponents.Ahold-outapproachisusedinordertodeterminetheincorrectlyclassi\fedinstancesforeachportionofthedataset.Suchanapproachclearlydoesnotseemtobeapplicabletotheunsuper-visedversionoftheproblembecauseofthedicultyincomputingtheaccuracyofthemodelondi erentdatapointsintheabsenceofgroundtruth.Ontheotherhand,sincethesupervisedversionoftheproblem(rareclassdetection)isaskewedclassi\fcationproblem,theboostingapproachisapplicablealmostdirectly.Anumberoflearners[13;23]havebeenproposedforthesupervisedversionoftheoutlieranalysisproblem.Theseclassi\fershavebeenshowntoachievesigni\f-cantlysuperiorresultsbecauseoftheuseofboost-ing.However,itisunlikelythatananalogueforthismethodcanbedevelopedforaproblemsuchasoutlierBagging[9]isanapproachwhichworkswithsamplesofthedata,andcombinestheresultsfromthedi erentsamples.Thewellknownfeaturebaggingap-proachforoutlieranalysis[17;28]performsthisstepinadi erentwaybybaggingthefeaturesratherthanbaggingthepoints.Nevertheless,theapproachisalsoapplicabletotheuseofsamplesofdatapoints(ratherthandimensions)inordertoperformtheprediction.Thekeychallengeinthiscasemayarisefromdi er-entdatapointsbeingapartofdi erentnumbersofensemblesinthedi erentcases.Formanyensemblescoressuchasthe,thiscancauseinadver-tentbiasinfavoringdatapointswhicharesampledalargernumberoftimes.Thisbiascannotbecorrected,unlessspeci\fckindsofcombinationfunctionssuchasAverageareused.RandomForests:Randomforests[8]areamethodwhichusesetsofdecisiontreesonthetrainingdata,andcomputethescoreasafunctionofthesedi erentcomponents.Whiledecisiontreeswerenotoriginallydesignedfortheoutlieranalysisproblem,ithasbeenshownin[29]thatthebroadconceptofdecisiontreescanalsobeextendedtooutlieranalysisbyexamin-ingthosepathswithunusuallyshortlength,sincetheoutlierregionstendtogetisolatedratherquickly.Anensembleofsuchtreesisreferredtoasanforest[29],andhasbeenusede ectivelyformakingrobustpredictionsaboutoutliers.ModelAveragingandCombination:Thisisoneofthemostcommonmodelsusedinensembleanalysisandisusedbothfortheclusteringandclassi\fcationprob-lems.Infact,therandomforestmethoddiscussedaboveisaspecialcaseofthisidea.Inthecontextoftheclassi\fcationproblem,manyBayesianmethods[15]existforthemodelcombinationprocess.Manyoftherecentmodels[25;34]havefocussedoncreatingabucketofmodelsfromwhichthescoresarecombinedthrougheitheraveragingorusingthemaximumvalue.Eventheparametertuningmethodsusedinmanyout-lieranalysisalgorithmssuchasLOFandLOCIcanbeviewedtobedrawnfromthiscategory.Arelatedmodelis[14;38],inwhichthecombinationisperformedinconjunctionwithmodelevaluation.Thiscansometimesbemoredicultforunsupervisedprob-lemssuchasclassi\fcation.Nevertheless,sincestackinghasbeenusedforsomeunsupervisedproblemssuchasdensityestimation[37],itispossiblethatsomeofthetechniquesmaybegeneralizabletooutlieranalysis,aslongasanappropriatemodelforquantifyingperfor-mancecanbefound.BucketofModels:Inthisapproach[39]a\hold-out"portionofthedatasetisusedinordertodecidethemostappropriatemodel.Themostappropriatemodelisoneinwhichthehighestaccuracyisachievedintheheldoutdataset.Inessence,thisapproachcanbeviewedasacompetitionorbake-o contestbetweenthedi erentmodels.Whilethisiseasytoperforminsupervisedproblemssuchasclassi\fcation,itismuch Method del-Centered Sequential Combination Normalization Data-Centered Independent Function Tuning[12] Model endent Max NotNeeded Tuning[35] Model endent Max NotNeeded eatureBagging[28] Data endent Max/Avg No HICS[24] Data endent eAvg No Bagging[17] Both endent Max/Avg Yes OutRank[33] Data endent HarmonicMean No Proclus[33] Data endent HarmonicMean No vertingscores Both endent Max/Avg Yes probabilities[25] trusionBootstrap[6] Data Sequential Component NotNeeded OUTRES[30] Data Sequential Product No enetal[34] Both endent eightedAvg. No Forest[29] Model endent on.Avg. Yes able2:CharacteristicsofOutlierEnsembleMethodsmoredicultforsmall-sampleandunsupervisedprob-lems.Nogroundtruthisavailableforevaluationinunsupervisedproblems.Itisunlikelythatapreciseanalogueofthemethodcanbecreatedforoutlieranal-ysis,sinceexactgroundtruthisnotavailablefortheevaluationprocess.Tosummarize,wecreateatableofthedi erentmethods,andthedi erentcharacteristicssuchasthetypeofen-semble,combinationtechnique,orwhethernormalizationispresent.ThisisprovidedinTable2.6.CONCLUSIONSANDDISCUSSIONThispaperprovidesanoverviewoftheemergingareaofout-lierensembleanalysis,whichhasseenincreasingattentionintheliteratureinrecentyears.Manyensembleanalysismethodsintheoutlieranalysisliteraturearenotrecognizedassuchinaformalway.Thispaperprovidesanunder-standingofhowthesemethodsrelatetoothertechniquesusedexplicitlyasensemblesintheliterature.Weprovideddi erentwaysofcategorizingtheoutlieranalysisproblemsintheliterature,suchasindependentorsequentialensem-bles,anddata-ormodel-centeredensembles.Wediscussedtheimpactofdi erentkindsofcombinationfunctions,andhowthesecombinationfunctionsrelatetodi erentkindsofensembles.Theissueofchoosingtherightcombinationfunctionisanimportantone,thoughitmaydependuponthestructureoftheensembleinthegeneralcase.Wealsoprovidedamappingofmanycurrenttechniquesintheliter-aturetodi erentkindsofensembles.Finally,adiscussionwasprovidedonthefeasibilityofadaptingensemblartech-niquesfromotherdataminingproblemstooutlieranalysis.Theareaofensembleanalysisispoorlydevelopedinthecontextoftheoutlierdetectionproblem,ascomparedtootherdataminingproblemssuchasclusteringandclassi\f-cation.Thereasonforthisisrootedinthegreaterdicultyofjudgingthequalityofacomponentoftheensemble,ascomparedtootherdataminingproblemssuchasclassi\fca-tion.Manymodelssuchasstackingandboostinginotherdataminingproblemsrequireacrisplyde\fnedjudgementofdi erentensemblarcomponentsonhold-outsets,whicharenotreadilyavailableindataminingproblemssuchasoutlieranalysis.Theoutlieranalysisproblemsu ersfromtheproblemofsmallsamplespaceaswellaslackofgroundtruth(asinallunsupervisedproblems).Thelackofgroundtruthimpliesthatitisnecessarytousetheintermediateout-putsofthealgorithm(ratherthanconcretequalitymeasuresonhold-outsets),formakingthecombinationdecisionsandensemblarchoices.Theseintermediateoutputsmaysome-timesrepresentpoorestimationsofoutlierscores.Whencombinationdecisionsandensemblarchoicesaremadeinanunsupervisedwayonaninherentlysmallsamplespaceproblemsuchasoutlieranalysis,thelikelihoodandconse-quencesofinappropriatechoicescanbehighascomparedtoanotherunsupervisedproblemsuchasclustering,whichdoesnothavethesmallsamplespaceissues.Whileoutlierdetectionisachallengingproblemforensem-bleanalysis,theproblemsarenotunsurmountable.Ithasbecomeclear,fromtheresultsofnumerousrecentensemblemethodsthatsuchmethodscanleadtosigni\fcantquali-tativeimprovements.Therefore,ensembleanalysisseemstobeanemergingarea,whichcanbeafruitfulresearchdirectionforimprovingthequalityofoutlierdetectionalgo-7.REFERENCES[1]C.C.Aggarwal.OutlierAnalysis,,2013.[2]C.C.Aggarwal,C.Reddy.DataClustering:Algo-rithmsandApplications,CRCPress,2013.[3]C.C.AggarwalandP.S.Yu.OutlierDetectioninHighDimensionalData,ACMSIGMODConference,2001.[4]C.C.Aggarwal,C.Procopiuc,J.Wolf,P.Yu,andJ.Park.FastAlgorithmsforProjectedClustering,ACMSIGMODConference,1999.[5]F.Angiulli,C.Pizzuti.Fastoutlierdetectioninhighdimensionalspaces,PKDDConference,2002.[6]D.Barbara,Y.Li,J.Couto,J.-L.Lin,andS.Jajo-dia.BootstrappingaDataMiningIntrusionDetectionSymposiumonAppliedComputing,2003.[7]S.Bickel,T.Sche er.Multi-viewclustering.Conference,2004. [8]L.Brieman.RandomForests.JournalMachineLearn-ingarchive,45(1),pp.5{32,2001.[9]L.Brieman.BaggingPredictors.MachineLearning24(2),pp.123{140,1996.[10]V.Chandola,A.Banerjee,V.Kumar.AnomalyDetec-tion:ASurvey,ACMComputingSurveys,2009.[11]S.D.BayandM.Schwabacher,Miningdistance-basedoutliersinnearlineartimewithrandomizationandasimplepruningrule,KDDConf.,2003.[12]M.Breunig,H.-P.Kriegel,R.Ng,andJ.Sander.LOF:IdentifyingDensity-basedLocalOutliers,ACMSIG-MODConference,2000.[13]N.Chawla,A.Lazarevic,L.Hall,andK.Bowyer.SMOTEBoost:Improvingpredictionoftheminorityclassinboosting,,pp.107{119,2003.[14]B.Clarke,BayesModelAveragingandStackingwhenModelApproximationErrorcannotbeIgnored,ofMachineLearningResearch,pp683{712,2003.[15]P.Domingos.BayesianAveragingofClassi\fersandtheOver\fttingProblem.ICMLConference,2000.[16]Y.Freund,R.Schapire.ADecision-theoreticGeneral-izationofOnlineLearningandApplicationtoBoosting,ComputationalLearningTheory,1995.[17]J.Gao,P.-N.Tan.Convertingoutputscoresfromout-lierdetectionalgorithmsintoprobabilityestimates.ICDMConference,2006.[18]Z.He,S.DengandX.Xu.AUni\fedSubspaceOutlierEnsembleFrameworkforOutlierDetection,AdvancesinWebAgeInformationManagement,2005.[19]D.Hawkins.Identi\fcationofOutliers,ChapmanandHall,1980.[20]A.Hinneburg,D.Keim,andM.Wawryniuk.Hd-eye:Visualminingofhigh-dimensionaldata.IEEECom-puterGraphicsandApplications,19:22{31,1999.[21]W.Jin,A.Tung,andJ.Han.Miningtop-nlocalout-liersinlargedatabases,ACMKDDConference,2001.[22]T.Johnson,I.Kwok,andR.Ng.Fastcomputationof2-dimensionaldepthcontours.ACMKDDConference[23]M.Joshi,V.Kumar,andR.Agarwal.EvaluatingBoostingAlgorithmstoClassifyRareClasses:Com-parisonandImprovements.ICDMConference,pp.257{264,2001.[24]F.Keller,E.Muller,K.Bohm.HiCS:High-ContrastSubspacesforDensity-basedOutlierRanking,ICDEConference,2012.[25]H.Kriegel,P.Kroger,E.Schubert,andA.Zimek.Inter-pretingandUnifyingOutlierScores.SDMConference[26]E.Knorr,andR.Ng.AlgorithmsforMiningDistance-basedOutliersinLargeDatasets.VLDBConference[27]E.Knorr,andR.Ng.FindingIntensionalKnowledgeofDistance-BasedOutliers.VLDBConference,1999.[28]A.Lazarevic,andV.Kumar.FeatureBaggingforOut-lierDetection,ACMKDDConference,2005.[29]F.T.Liu,K.M.Ting,andZ.-H.Zhou.IsolationForest.ICDMConference,2008.[30]E.Muller,M.Schi er,andT.Seidl.StatisticalSelectionofRelevantSubspaceProjectionsforOutlierRanking.ICDEConference,pp,434{445,2011.[31]E.Muller,S.Gunnemann,I.Farber,andT.Seidl,Dis-coveringmultipleclusteringsolutions:Groupingob-jectsindi erentviewsofthedata,ICDMConference[32]E.Muller,S.Gunnemann,T.Seidl,andI.Farber.Tuto-rial:DiscoveringMultipleClusteringSolutionsGroup-ingObjectsinDi erentViewsoftheData.ICDECon-ference,2012.[33]E.Muller,I.Assent,P.Iglesias,Y.Mulle,andK.Bohm.OutlierRankingviaSubspaceAnalysisinMul-tipleViewsoftheData,ICDMConference,2012.[34]H.Nguyen,H.Ang,andV.Gopalakrishnan.Miningensemblesofheterogeneousdetectorsonrandomsub-DASFAA,2010.[35]S.Papadimitriou,H.Kitagawa,P.Gibbons,andC.Faloutsos,LOCI:Fastoutlierdetectionusingthelocalcorrelationintegral,ICDEConference,2003.[36]S.Ramaswamy,R.Rastogi,andK.Shim.EcientAl-gorithmsforMiningOutliersfromLargeDataSets.ACMSIGMODConference,pp.427{438,2000.[37]P.SmythandD.Wolpert.LinearlyCombiningDensityEstimatorsviaStacking,MachineLearningJournal,36,pp.59{83,1999.[38]D.Wolpert.StackedGeneralization,NeuralNetworks5(2),pp.241{259,1992.[39]B.Zenko.IsCombiningClassi\fersBetterthanSelect-ingtheBestOne,MachineLearning,pp.255{273,2004. Latest Results on outlier ensembles available at http://www.charuaggarwal.net/theory.pdf (Clickable Link)