/
AddressingConcept-EvolutioninConcept-DriftingDataStreamsMohammadM.Masu AddressingConcept-EvolutioninConcept-DriftingDataStreamsMohammadM.Masu

AddressingConcept-EvolutioninConcept-DriftingDataStreamsMohammadM.Masu - PDF document

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
365 views
Uploaded On 2016-11-28

AddressingConcept-EvolutioninConcept-DriftingDataStreamsMohammadM.Masu - PPT Presentation

outsidethedecisionboundaryThisspaceiscontrolledbyathresholdandthethresholdisadaptedcontinuouslytoreducetheriskoffalsealarmsandmissednovelclassesSecondweapplyaprobabilisticapproachtodetectnovelclas ID: 494345

outsidethedecisionboundary.Thisspaceiscontrolledbyathreshold andthethresholdisadaptedcontinuouslytoreducetheriskoffalsealarmsandmissednovelclasses.Second weapplyaprobabilisticapproachtodetectnovelclas

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "AddressingConcept-EvolutioninConcept-Dri..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AddressingConcept-EvolutioninConcept-DriftingDataStreamsMohammadM.Masud,QingChen,LatifurKhan,CharuAggarwalJingGao,JiaweiHanandBhavaniThuraisinghamDepartmentofComputerScience,UniversityofTexasatDallasmehedy,qingch, outsidethedecisionboundary.Thisspaceiscontrolledbyathreshold,andthethresholdisadaptedcontinuouslytoreducetheriskoffalsealarmsandmissednovelclasses.Second,weapplyaprobabilisticapproachtodetectnovelclassinstancesusingthediscreteGiniCoefÞcient.Withthisapproach,weareabletodistinguishdifferentcausesfortheappearanceoftheoutliers,namely,noise,concept-drift,orconcept-evolution.Finally,weapplyagraph-basedapproachtodetecttheappearanceofmorethanonenovelclassessimultaneously,andseparatetheinstancesofonenovelclassfromtheothers.Tothebestofourknowledge,thisistheÞrstworkthatproposestheseadvancedtechniquesfornovelclassdetectionandclassiÞcationindatastreams.WeapplyourtechniqueonanumberofbenchmarkdatastreamsincludingTwittermessages,andoutperformthestate-of-the-artclassiÞcationandnovelclassdetectiontechniques.Therestofthepaperisorganizedasfollows.SectionIIdiscussestherelatedworksindatastreamclassiÞcationandnovelclassdetection.SectionIIIdescribestheproposedtechnique.SectionIVthenreportsthedatasetsandexperi-mentalresults,andSectionVconcludeswithdirectionstofutureworks.II.RELATEDWORKMostoftheexistingdatastreamclassiÞcationtechniquesaredesignedtohandletheefÞciencyandconcept-driftaspectsoftheclassiÞcationprocess[1],[3]Ð[5],[9]Ð[11].EachofthesetechniquesfollowsomesortofincrementallearningapproachtotackletheinÞnite-lengthandconcept-driftproblems.Therearetwovariationsforthisincrementalapproach.TheÞrstapproachisasingle-modelincrementalapproach,whereasinglemodelisdynamicallymaintainedwiththenewdata.Forexample,[4]incrementallyupdatesadecisiontreewithincomingdata,andthemethodin[1]incrementallyupdatesmicro-clustersinthemodelwiththenewdata.Theotherapproachisahybridbatch-incrementalapproach,inwhicheachmodelisbuiltusingabatchlearningtechnique.However,oldermodelsarereplacedbynewermodelswhentheoldermodelsbecomeobsolete([2],[5],[9],[10]).Someofthesehybridapproachesuseasinglemodeltoclassifytheunlabeleddata(e.g.[10]),whereasothersuseanensembleofmodels(e.g.[5],[9]).Theadvantageofthehybridapproachesoverthesinglemodelincrementalapproachisthatthehybridapproachesrequiremuchsimpleroperationstoupdateamodel.TheothercategoryofdatastreamclassiÞcationtech-niquedealswithconcept-evolution,inadditiontoaddressinginÞnite-lengthandconcept-drift.Spinosaetal.[7]applyacluster-basedtechniquetodetectnovelclassesindatastreams.However,thisapproachassumesonlyoneÒnormalÓclass,andconsidersallotherclassesasÒnovelÓ.Therefore,itisnotdirectlyapplicabletomulti-classdatastreamclas-siÞcation,sinceitcorrespondstoaÒone-classÓclassiÞer.Furthermore,thistechniqueassumesthatthetopologicalshapeofthenormalclassinstancesinthefeaturespaceisconvex.Thismaynotbetrueinrealdata.Ourpreviouswork[6]proposesaclassiÞcationandnovelclassdetectiontechniquethatisamulti-classclassiÞeranddoesnotrequireclassestohaveconvexshape.Inthispaper,weextendthisworkbyproposingßexibleanddynamicallyadaptivedecisionboundaryforoutlierdetection,aswellasmethodsfordistinguishingmorethanonenovelclasses.Experimentswithrealdatasetsprovetheeffectivenessofourapproach.III.NOVELCLASSDETECTIONPROPOSEDAPPROACHInthispaper,weproposethreedifferentimprovementsovertheexistingnovelclassdetectiontechnique,i.e.:i)Outlierdetectionusingadaptivethreshold,ii)NovelclassdetectionusingGiniCoefÞcient,andiii)Simultaneousmul-tiplenovelclassdetection.Thesearediscussedinthefollowingsubsections.First,webrießydiscusstheexistingnovelclassdetectiontechnique.OurstreamclassiÞerisanensembleof,...,M.AclassisdeÞnedasanovelifnoneoftheclassiÞershasbeentrainedwithOtherwise,ifoneoftheclassiÞershasbeentrainedwith,thenitisanexistingclass.Thedatastreamisdividedintoequalsizedchunks.Wetrainak-NNbasedclassiÞerwitheachlabeledchunk.Here,clustersarebuiltusingasemi--meansclustering,andtheclustersummaries(mentionedass)ofeachclusteraresaved.Thesummarycontainsthecentroidradius,andfrequenciesofdatapointsbelongingtoeachclass.Theradiusofapseudopointisequaltothedistancebetweenthecentroidandthefarthestdatapointinthecluster.Therawdatapointsarediscardedaftercreatingthesummary.Thesepseu-dopointsconstitutetheclassiÞcationmodel.ThisclassiÞerreplacesoneoftheexistingclassiÞers(usuallythehighesterrorclassiÞer)intheensemble.Besides,eachpseudopointcorrespondstoahyperspherehavingcenteratthecentroidandradiusequaltotheradiusofthepsuedopoint.TheunionofthehyperspheresconstitutethedecisionboundaryfortheclassiÞer.Ifatestinstanceisinsidethedecisionboundaryofanymodelintheensemble,thenitisclassiÞedasanexistingclassinstanceusingmajorityvoting.Otherwise,ifisoutsidethedecisionboundaryofallthemodels,itisconsideredanoutlier,anditistemporarilystoredinabufferBuf.WhenthereareenoughoutliersinBuf,weinvokeanovelclassdetectionproceduretocheckwhethertheoutliersactuallybelongtoanovelclass.Ifanovelclassisfound,theoutliersaretaggedasnovelclassinstance.ThemainassumptioninnovelclassdetectionisthatanyclassofdatafollowsthepropertythatÒadatapointshouldbeclosertothedatapointsofitsownclass()andfar-therapartfromthedatapointsofotherclasses(separation[6].Thenovelclassdetectionproceduremeasuresthecohe-sionamongtheoutliersinBuf,andseparationoftheoutliersfromtheexistingclassinstancesbycomputing auniÞedmeasureofcohesionandseparation,whichwecall-NeighborhoodSilhouetteCoefÞcientor-NSC,forshort,asfollows:NSC isthemeandistancefromoutlierxtoitsoutlierinstances,andisthemeandistancefromtoits-nearestexistingclassinstances.Theexpression-NSCyieldsavaluebetween-1and+1.Apositivevalueindicatesthatisclosertotheoutlierinstances(morecohesion)andfartherawayfromexistingclassinstances(moreseparation),andviceversa.The)valueofanoutlierxiscomputedseparatelyforeachclassiÞernewclassisdeclaredifthereareatleastoutliershavingpositive-NSCforallA.OutlierdetectionusingadaptivethresholdAtestinstanceisidentiÞedasanoutlierifitisoutsidetheradiusofallthepseudopointsintheensembleofmodels.Therefore,ifatestinstanceisoutsidethehypersphereofapseudopoint,butveryclosetoitssurface,itwillstillbeanoutlier.However,thiscasemightbefrequentduetoconcept-driftornoise,i.e.,existingclassinstancesmaybeoutsidetothesurfaceofthehypersphere.Asaresult,thefalsealarmrate(i.e.,detectingexistingclassesasnovel)wouldbehigh.Inordertosolvethisproblem,wefollowanadaptiveapproachfordetectingtheoutliers.Weallowaslackspacebeyondthesurfaceofeachhypersphere.Ifanytestinstancefallswithinthisslackspace,itisconsideredasexistingclassinstance.ThisslackspaceisdeÞnedbyathreshold,OUTTH.Weapplyanadaptivetechniquetoadjustthethreshold.First,weexplainhowthethresholdisused.UsingOUTTH:beatestinstance,andbethenear-estpseudopointofinmodel,withradius.Letbethedistancefromtothecentroidof.WedeÞneweight()asfollows:weight.If,thenisinside(oron)thehypersphereandweight(.Otherwise,outsidethehypersphereandweight(.Notethatifoutsidethehypersphere,thenweight()iswithintherange[0,1).Themainreasonforusingthisexponentialfunctionisthatthefunctionproducesvalueswithintherange[0,1),whichprovidesaconvenientnormalizedvalue.ThevalueofOUTTHisalsowithin[0,1).Now,ifweight(thenweconsiderasexistingclassinstance,otherwise,isconsideredasanoutlier.IfisidentiÞedasanoutlierforallmodels,thenisconsideredasanoutlierAdjustingOUTTH:Initially,OUTTHisinitializedwithOUTTH INITvalue.WesetOUTTH INITto0.7inourexperiments.ToadjustOUTTH,weexaminethelatestlabeledinstance.Ifhadbeenafalse-novelinstance(i.e.existingclassbutmisclassiÞedasnovelclass),thenitmusthavebeenanoutlier.Therefore,weight(OUTTH.IfthedifferenceOUTTH-weight()islessthanasmallconstant,thenweconsiderasamarginalfalse-novelinstance.Ifisamarginalfalse-novelinstance,thenweincreasetheslack Figure1.Illustrationofaslackspaceoutsidethedecisionboundaryspacesothatfutureinstanceslikethisdonotfalloutsidethedecisionboundary.Therefore,OUTTHisdecreasedbyasmallvalue(),whicheffectivelyincreasestheslackspace.Conversely,ifisamarginalfalse-existinginstance,thenisanovelclassinstancebutwasfalselyidentiÞedasanexistingclassinstancebyanarrowmargin.Therefore,weneedtodecreasetheslackspace(increaseOUTTH).ThisisdonebyincreasingOUTTHby.ThemarginalconstraintisimposedtoavoiddrasticchangesinOUTTHvalue.Figure1illustratestheconceptofOUTTH,marginalfalse-novelandmarginalfalse-existinginstances.B.NovelclassdetectionusingGiniCoefÞcientoutliersdetectedduringtheoutlierdetectionphasemayoccurbecauseofoneormoreofthethreedifferentreasons:noise,concept-drift,orconcept-evolution.Inordertodistinguishtheoutliersthatoccurbecauseofconcept-evolutiononly,wecomputeametriccalleddiscreteGiniCoefÞcientoftheoutlierinstances.WeshowthatiftheGiniCoefÞcientishigherthanaparticularthreshold,thenwecanbeconÞdentoftheconcept-evolutionscenario.AfterdetectingtheoutlierinstancesusingtheOUTTHvaluediscussedintheprevioussection,wecomputethe)valueforeachoutlier.Ifthe)valueisnegative,weremovefromconsidera-tion,i.e.,isregardedasanexistingclassinstance.Fortheremainingoutlier-NSC(.)iswithintherange[0,1].Now,wecomputeacompoundmeasureforeachoutlier,calledNoveltyscoreNscore,asfollows:Nscoreweight minweightNSC,whereweight(isdeÞnedinSectionIII-A,andminweightisthemini-mumweightamongalloutliershavingpositiveNscorecontainstwoparts:TheÞrstpartmeasureshowfartheoutlierisawayfromitsnearestexistingclasspseudopoint(highervalue-greaterdistance)Thesecondpartmeasuresthecohesionoftheoutlierwithotheroutliers,andtheseparationoftheoutlierfromtheexistingclassinstances.NotethatthevalueofNscoreiswithin[0,1].Ahighervalueindicatesagreaterlikelihoodofbeinganovelclassinstance.ThedistributionofNscorecanbecharacterizedbytheactualclassofoutlierinstances.Inotherwords,byexaminingthedistributionofNscore,wecandecideaboutthenoveltyoftheoutlierinstances,asfollows.WediscretizetheNscorevaluesintoequalintervals(orbins),andconstructacumulativedistributionfunction (CDF)ofNscore.LetbethevalueoftheCDFforthethinterval.WecomputethediscreteGiniCoefÞcientforarandomsampleof,asfollows: n(nnini)yi .LetusconsiderthreedifferentcasesandexaminethebehaviorofG(s)ineachcase.Case1:Nscoreareverylow,andfallintheÞrstinterval.Therefore,=1forall.SoG(s)becomes(after n(nnini ))=0Notethatthiscaseoccurswhenalloutliersactuallybelongtotheexistingclasses.Case2:Nscoreareveryhigh,andfallinthelastinterval.Therefore,=1and=0forall.SoG(s)becomes(aftersimpliÞcation): n(n ))= .Notethatthiscaseoccurswhenalloutliersactuallybelongtothenovelclass.Case3:Nscoreisevenlydistributedacrossalltheinter-vals.Inthiscaseforall.SoG(s)becomes(after n(nnini)i ))= Notethatthiscaseoccursifthedistributionismixed,i.e.,noise,concept-driftandpossiblysomenovelclassinstances.Byexaminingthethreecases,wecancomeupwithathresholdforGiniCoefÞcienttoidentifyanovelclass. ,wedeclareanovelclassandtagtheoutliersasnovelclassinstances.If,weclassifytheoutliersasexistingclassinstances.If ,weÞlterouttheoutliersfallingintheÞrstinterval,andconsiderrestoftheoutliersasnovelclass.Notethatif 3n1 .Butforanyvalueof .Forexample,if=10,then .Weuse=10inourexperiments.C.SimultaneousmultiplenovelclassdetectionItispossiblethatmorethanonenovelclassmayarriveatthesametime(inthesamechunk).Thisisacommonscenariointextstreams,suchasTwittermessages.Notethatdeterminingwhethertherearemorethanonenovelclassesisachallengingproblem,sincewemustexecuteitinanunsupervisedfashion.Inordertodetectmultiplenovelclasses,weconstructagraph,andidentifytheconnectedcomponentsinthegraph.Thenumberofconnectedcom-ponentsdeterminesthenumberofnovelclasses.Thebasicassumptionindeterminingthemultiplenovelclassesfollowsfromtheseparationproperty.Forexample,iftherearetwonovelclasses,thentheseparationamongthedifferentnovelclassinstancesshouldbehigherthanthecohesionamongthesame-classinstances.AtÞrstweuse list,thecollectionofnovelclassinstancesdetectedusingthenovelclassdetectiontechnique,tocreatepseudopointsusingK-Meansclusteringandsummarizingtheclusters.Here beingthechunksize).ThenwebuildagraphV,EEachpsudopointisconsideredavertexof.Foreach,weÞnditsnearestneighborbasedoncentroiddistances,andcomputethesilhouettecoefÞcientofusingthefollowingformula: h,h.nnisthecentroiddistancebetween,andisthemeandistancefromthecentroidoftoallinstancesbelongingto.Ifishigh(closeto1),itindicatesisatightclusteranditisfarfromitsnearestcluster.Ontheotherhand,ifislow,thennotsotight,andclosetoitsnearestcluster.Weaddanedge)toislessthanathreshold,whicharenotsoseparable.Weuseinallexperiments.Oncewehavethegraph,weÞndtheconnectedcomponents,andmarkeachpseudopointwiththecorrespondingcomponentnumber.Foreachconnectedcomponent,weÞrstcomputeitsglobalcentroid,i.e.,thecenterofgravityofallpseudopointsin,and,i.e.,themeandistanceofallthepseudopointsin.Foreachpairof,wemergethemifisgreaterthantwicethedistancebetweenInotherwords,twocomponentsaremergedifthemeanintra-componentdistanceishigherthantheinter-componentdistance,i.e.,thecomponentsarelessdenseandlesssep-arablefromeachother.Finally,weassignclasslabelstoeachnovelclassinstance,whichisequaltothecomponentnumbertowhichtheinstancebelongs.IV.EA.DatasetWehavedoneextensiveexperimentsontheTwitter,ForestCover,KDD,andsyntheticdatasets.Duetospacelimitation,wereportonlyforTwitterandForestCoverdatasets.Thedescriptionsofthesedatasetsmaybefoundin[8].B.ExperimentalsetupBaselinetechniques:Thisistheexistingapproachproposedin[6].Thisistheproposedapproach,whichstandsforMultiClassMinerinDataOW:Thisisthecombinationoftwoapproaches,namely,OLINDDA[7],andweightedclassiÞerensemble(WCE)[9].OLINDDAworksasanovelclassdetector,andWCEperformstheclassiÞcation.Similarbaselinehasbeenusedin[6],withtwovariations-parallelandsingle.Hereweuseonlytheparallelmodelsinceitwasthebetterofthetwo.Inallexperiments,theensemblesizeandchunk-sizearekeptthesameforboththesetechniques.Besides,thesamebaselearner(i.e.,k-NN)isusedforallthreemethods.Parameterssettings:Featuresetsize=30forTwitterdataset.Forotherdatasets,allthenumericfeaturesareused.(numberofpseudopointsperchunk)=50,size)=1000,(ensemblesize)=6,(minimumnumberoutliersrequiredtodeclareanovelclass)=50.ForOLINDDA,weusethedefaultparametervalues[7]. C.Evaluation1)Overallnovelclassdetection:Evaluationapproach:Weusethefollowingperformancemetricsforevaluation:=%ofnovelclassinstancesMisclassiÞedasexisting=%ofexistingclassinstancesFalselyidentiÞedasnovelclass,=TotalmisclassiÞcationerror(%)(in-).WebuildtheinitialmodelsineachmethodwiththeÞrstInitNumberchunks.FromtheInitNumberstchunkonward,weÞrstevaluatetheper-formancesofeachmethodonthatchunk,thenusethatchunktoupdatetheexistingmodels.WeuseInitNumber=3forallexperiments.Theperformancemetricsforeachchunkaresavedandaggregatedforproducingthesummaryresult.Figures2(a),and2(b)showtheERRratesforeachapproachthroughoutthestreamintheTwitter,andForestdatasetsrespectively.Forexample,inÞgure2(a)atXaxis=200,theYvaluesshowtheaverageERRofeachapproachfromthebeginningofthestreamtochunk200inTwitterdataset.Atthispoint,theERRofMineClass,MCM,andOWare17.2%,1.3%,and3.3%,respectively.Figures2(d),and2(e)showthetotalnumberofnovelinstancesmissedforeachofthebaselineapproachesforTwitterandForestdataset,respectively.Forexample,inÞgure2(e),atthesamevalueoftheXaxis(=200),theYvaluesshowthetotalnovelinstancesmissed(i.e.,misclassiÞedasexistingclass)foreachapproachfromthebeginningofthestreamtochunk200intheTwitterdataset.Atthispoint,thenumberofnovelinstancesmissedbyMineClass,MCM,andOWare929,0,and3533,respectively.TheROCcurvesfortheTwitter,andForestdatasetsaregeneratedbyplottingfalsenovelclassdetectionrate(falsepositiverateifweconsidernovelclassaspositiveclassandexistingclassesasnegativeclass)againstthetruenovelclassdetectionrate(truepositiverate).Figure2(c)showstheROCcurvesfortheTwitterdataset.TableIUMMARYOFTHERESULTS AUC Twitter MineClass 17.024.315.1 0.88 MCM 1.80.70.6 3.11001.4 0.56 Forest 3.68.41.3 MCM 3.14.00.68 5.920.61.1 TableIsummarizestheresultsofoverallclassiÞcationandnovelclassdetectionerrori.e.,errorinclassiÞcationanddetectingnovelclassonly(notdistinguishingmultiplenovelclasses).Forexample,thecolumnheadedbyreportstheratesofeachapproachindifferentdatasetsfortheentirestream.InTwitterdataset,theratesare24.3%,0.7%,and100%forMineClass,MCM,andOW,respectively.ThecolumnAUCreportstheareaundertheROCcurvesforeachdataset.Tosummarizetheresults,MCMoutperformsMineClassandOWinERR,rates.ThisisbecauseoftheenhancedmechanismofMCMindetectingnovelclasses.RecallthatMCMappliesanadaptivethresholdforoutlierdetection,andalsoemploysaprobabilisticapproachinrecognizingthenovelclassinstances.TheneteffectisthattheoverallratesdropsigniÞcantlyandtheERRratealsodrops.TableIIUMMARYOFMULTINOVELCLASSDETECTIONRESULTS Total Twitter Type1asType1 360394508447 1709 Type1asType2 Type2asType2 518568453500 2039 Type2asType1 3505519 1.01.01.01.0 Recall 0.911.00.90.96 0.951.00.940.98 Forest Type1asType1 371583ÑÑ 954 Type1asType2 183444ÑÑ 627 Type2asType2 300550ÑÑ 850 Type2asType1 113411ÑÑ 524 0.670.57ÑÑ 0.60 0.770.59ÐÑ 0.64 0.710.58ÐÐ 0.62 2)Multinovelclassdetection:TableIIreportsthemultiplenovelclassdetectionresults.Thereare4and2occurrencesoftwonovelclassesinTwitter,andForestdatasets,respectively.Inotherwords,twonovelclassesappearsimultaneouslyin4differentdatachunksinTwitterdataset,andtwonovelclassesappearsimultaneouslyin2differentdatachunksinForestdataset.Foreachoccurrenceofmultiplenovelclasses,wereporttheconfusionmatrixinasinglecolumn.TheentriesintherowsheadedbyÔType1asType1Õreportthenumberoftype1novelclassinstancescorrectlydetectedastype1,therowsheadedbyÔType1asType2Õreportthenumberoftype1novelclassinstancesincorrectlydetectedastype2,andsoon.Forexample,intheTwitterdataset,andintheÞrstoccurrenceoftwonovelclasses(undercolumnÔ1Õ),allofthe360instancesoftype1novelclassareidentiÞedcorrectlyastype1;noneofthetype1novelclassinstancesareincorrectlyidentiÞedastype2;518ofthetype2novelclassinstancesarecorrectlyidentiÞedastype2;and35ofthetype2novelclassinstancesareincorrectlyidentiÞedastype1.Notethatthenumberingoftype1and2arerelative.WealsosummarizeourÞndingsbyreportingtheprecision,recall,andF-measureforeachoccurrenceforeachdataset,basedonthemis-classiÞcationoftype1novelclassinstanceintotheotherkind.Forexample,thetablecellcorrespondingtothecolumnheadedbyÔ1ÕandtherowheadedbyÔTwitterF-measureÕreportstheF-measureofmultiplenovelclassdetectionontheÞrstoccurrenceoftwonovelclassesinTwitterdataset,whichis0.95.TheF-measureiscomputedbyconsideringtype1instancesaspositive,andtheotherasnegativeclass.Consideringthefactthatweapplyananunsupervisedapproach,theresultsareverypromising,especiallyintheTwitterdataset,wheretheF-measureis0.97.FortheForestdataset,theF-measure 0 5 10 15 20 50 100 150 200 250 300 Stream (in thousand data pts) MineClass MCM OW 4 8 12 16 20 50 100 150 200 250 300 350 400 450 Stream (in thousand data pts) MineClass MCM 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 True novel class detection rate False novel class detection rate MineClass MCM 1000 2000 3000 4000 5000 50 100 150 200 250 300 Novel instances Stream (in thousand data pts) Missed by MineClass Missed by MCM Missed by OW 500 1000 1500 2000 2500 3000 3500 0 100 200 300 400 Novel instances Stream (in thousand data pts) Missed by MineClass Missed by MCM Missed by OW Figure2.ERRratesin(a)Twitter,and(b)Forestdataset;(c)ROCcurvesinTwitterdataset;Novelclassesmissedin(d)Twitter,and(e)ForestdataseislowerbecausethenovelclassesinTwitterdatasetarerelativelywellseparatedthanthatoftheForestdataset.V.CWeproposeseveralimprovementsovertheexistingclas-siÞcationandnovelclassdetectiontechnique.First,wepro-poseanimprovedtechniqueforoutlierdetectionbydeÞningadynamicslackspaceoutsidethedecisionboundaryofeachclassiÞcationmodel.Second,weproposeabetteralternativeforidentifyingnovelclassinstancesusingdiscreteGiniCoefÞcient.Finally,weproposeagraph-basedapproachfordistinguishingamongmultiplenovelclasses.Weapplyourtechniqueonseveralrealdatastreamsthatexperienceconcept-driftandconcept-evolution,andachievesigniÞcantperformanceimprovementsovertheexistingtechniques.Inthefuture,wewouldliketoextendourtechniquetotextandmulti-labelstreamclassiÞcationproblems.CKNOWLEDGEMENTResearchwassponsoredinpartbyAFOSRMURIawardFA9550-0810265,NASAgrantNNX08AC35A,andtheArmyResearchLaboratory(ARL)underCooperativeAgree-mentNo.W911NF-0920053(NS-CTA).Theviewsandcon-clusionscontainedinthisdocumentarethoseoftheauthorsandshouldnotbeinterpretedasrepresentingtheofÞcialpolicies,eitherexpressedorimplied,oftheARLortheU.S.Government.TheU.S.GovernmentisauthorizedtoreproduceanddistributereprintsforGovernmentpurposesnotwithstandinganycopyrightnotationhereon.[1]C.C.Aggarwal,J.Han,J.Wang,andP.S.Yu.Aframeworkforon-demandclassiÞcationofevolvingdatastreams.Trans.onKnowl.&DataEngg.(TKDE),18(5):577Ð589,2006.[2]A.Bifet,G.Holmes,B.Pfahringer,R.Kirkby,andR.Gavald.Newensemblemethodsforevolvingdatastreams.In,pages139Ð148,2009.[3]S.Hashemi,Y.Yang,Z.Mirzamomen,andM.Kangavari.Adaptedone-versus-alldecisiontreesfordatastreamclassi-IEEETKDE,21(5):624Ð637,2009.[4]G.Hulten,L.Spencer,andP.Domingos.Miningtime-changingdatastreams.In,pages97Ð106,2001.[5]J.KolterandM.Maloof.Usingadditiveexpertensemblestocopewithconceptdrift.In,pages449Ð456,2005.[6]M.M.Masud,J.Gao,L.Khan,J.Han,andB.M.Thurais-ingham.IntegratingnovelclassdetectionwithclassiÞcationforconcept-driftingdatastreams.InECMLPKDD,volume2,pages79Ð94,2009.[7]E.J.Spinosa,A.P.deLeonF.deCarvalho,andJ.Gama.Cluster-basednovelconceptdetectionindatastreamsappliedtointrusiondetectionincomputernetworks.InACMSACpages976Ð980,2008.[8]UniversityofTexasatDallasDataMiningToolsRepository.[9]H.Wang,W.Fan,P.S.Yu,andJ.Han.Miningconcept-driftingdatastreamsusingensembleclassiÞers.Inpages226Ð235,2003.[10]Y.Yang,X.Wu,andX.Zhu.Combiningproactiveandreactivepredictionsfordatastreams.In,pages710Ð715,2005.[11]P.Zhang,X.Zhu,andL.Guo.Miningdatastreamswithlabeledandunlabeledtrainingexamples.In,pages627Ð636,2009.