/
JournalofMachineLearningResearch11(2010)2973-3009Submitted7/09;Revised JournalofMachineLearningResearch11(2010)2973-3009Submitted7/09;Revised

JournalofMachineLearningResearch11(2010)2973-3009Submitted7/09;Revised - PDF document

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
391 views
Uploaded On 2015-09-12

JournalofMachineLearningResearch11(2010)2973-3009Submitted7/09;Revised - PPT Presentation

ApreliminaryversionofthisworkappearedatAISTATSScottandBlanchard2009cr2010GillesBlanchardGyeminLeeandClaytonScott BLANCHARDLEEANDSCOTTcontrolledexperimentoranexpertLabeledexamplesofnovelties ID: 127276

.ApreliminaryversionofthisworkappearedatAISTATS(ScottandBlanchard 2009).c\r2010GillesBlanchard GyeminLeeandClaytonScott. BLANCHARD LEEANDSCOTTcontrolledexperimentoranexpert.Labeledexamplesofnovelties

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch11(2010)..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch11(2010)2973-3009Submitted7/09;Revised5/10;Published11/10Semi-SupervisedNoveltyDetectionGillesBlanchardBLANCHARD@WIAS-BERLIN.DEUniversit¨atPotsdam,Intitutf¨urMathematikAmNeuenPalais1014469Potsdam,GermanyGyeminLeeGYEMIN@EECS.UMICH.EDUClaytonScottCSCOTT@EECS.UMICH.EDUDepartmentofElectricalEngineeringandComputerScienceUniversityofMichigan1301BealAvenueAnnArbor,MI48109-2122,USAEditor:IngoSteinwartAbstractAcommonsettingfornoveltydetectionassumesthatlabeledexamplesfromthenominalclassareavailable,butthatlabeledexamplesofnoveltiesareunavailable.Thestandard(inductive)ap-proachistodeclarenoveltieswherethenominaldensityislow,whichreducestheproblemtodensitylevelsetestimation.Inthispaper,weconsiderthesettingwhereanunlabeledandpos-siblycontaminatedsampleisalsoavailableatlearningtime.Wearguethatnoveltydetectioninthissemi-supervisedsettingisnaturallysolvedbyageneralreductiontoabinaryclassicationproblem.Inparticular,adetectorwithadesiredfalsepositiveratecanbeachievedthroughare-ductiontoNeyman-Pearsonclassication.Unliketheinductiveapproach,semi-supervisednoveltydetection(SSND)yieldsdetectorsthatareoptimal(e.g.,statisticallyconsistent)regardlessofthedistributiononnovelties.Therefore,innoveltydetection,unlabeleddatahaveasubstantialimpactonthetheoreticalpropertiesofthedecisionrule.WevalidatethepracticalutilityofSSNDwithanextensiveexperimentalstudy.WealsoshowthatSSNDprovidesdistribution-free,learning-theoreticsolutionstotwowellknownproblemsinhypothesistesting.First,ourresultsprovideageneralsolutiontothegeneraltwo-sampleproblem,thatis,theproblemofdeterminingwhethertworandomsamplesarisefromthesamedistribution.Second,aspecializationofSSNDcoincideswiththestandardp-valueap-proachtomultipletestingundertheso-calledrandomeffectsmodel.Unlikestandardrejectionregionsbasedonthresholdedp-values,thegeneralSSNDframeworkallowsforadaptationtoarbi-traryalternativedistributionsinmultipledimensions.Keywords:semi-supervisedlearning,noveltydetection,Neyman-Pearsonclassication,learningreduction,two-sampleproblem,multipletesting1.IntroductionSeveralrecentworksinthemachinelearningliteraturehaveaddressedtheissueofnoveltydetec-tion.Thebasictaskistobuildadecisionrulethatdistinguishesnominalfromnovelpatterns.Thelearnerisgivenarandomsamplex1;:::;xm2ofnominalpatterns,obtained,forexample,froma .ApreliminaryversionofthisworkappearedatAISTATS(ScottandBlanchard,2009).c\r2010GillesBlanchard,GyeminLeeandClaytonScott. BLANCHARD,LEEANDSCOTTcontrolledexperimentoranexpert.Labeledexamplesofnovelties,however,arenotavailable.Thestandardapproachhasbeentoestimatealevelsetofthenominaldensity(Sch¨olkopfetal.,2001;Steinwartetal.,2005;ScottandNowak,2006;VertandVert,2006;El-YanivandNisenson,2007;Hero,2007),andtodeclaretestpointsoutsidetheestimatedlevelsettobenovelties.Werefertothisapproachasinductivenoveltydetection.Inthispaperweincorporateunlabeleddataintonoveltydetection,andarguethatthisframe-workofferssubstantialadvantagesovertheinductiveapproach.Inparticular,weassumethatinadditiontothenominaldata,wealsohaveaccesstoanunlabeledsamplexm+1;:::;xm+nconsistingpotentiallyofbothnominalandnoveldata.Weassumethateachxi=m+1;:::;m+nispairedwithanunobservedlabelyi2f0;1gindicatingitsstatusasnominal(yi=0)ornovel(yi=1),andthat(xm+1;ym+1);:::;(xm+n;ym+n)arerealizationsoftherandompair(X;Y)withjointdistributionPXY.ThemarginaldistributionofanunlabeledpatternXisthecontaminationmodelXPX=(1p)P0+pP1;wherePyy=0;1,istheconditionaldistributionofXjY=y,andp=PXY(Y=1)istheaprioriprobabilityofanovelty.Similarly,weassumex1;:::;xmarerealizationsofP0.WeassumenoknowledgeofPXP0P1,orp,althoughinSection6(wherewewanttoestimatetheproportionp)wedoimposeanaturalconditiononP1thatensuresidentiabilityofpWetakeasourobjectivetobuildadecisionrulewithasmallfalsenegativeratesubjecttoaxedconstraintaonthefalsepositiverate.Ouremphasishereisonsemi-supervisednoveltydetection(SSND),wherethegoalistoconstructageneraldetectorthatcouldclassifyanarbitrarytestpoint.Thisgeneraldetectorcanofcoursebeappliedinthetransductivesetting,wherethegoalistopredictthelabelsym+1;:::;ym+nassociatedwiththeunlabeleddata.Ourresultsextendinanaturalwaytothissetting.OurbasiccontributionistodevelopageneralsolutiontoSSNDbyasurrogateproblemrelatedtoNeyman-Pearson(NP)classication,whichistheproblemofbinaryclassicationsubjecttoauser-speciedconstraintaonthefalsepositiverate.Inparticular,wearguethatSSNDcanbeaddressedbyapplyingaNPclassicationalgorithm,treatingthenominalandunlabeledsamplesasthetwoclasses.EventhoughasamplefromP1isnotavailable,wearguethatourapproachcaneffectivelyadapttoanynoveltydistributionP1,incontrasttotheinductiveapproachwhichisonlyoptimalincertainextremelyunlikelyscenarios.Thatis,bysolvingthesurrogateproblem,weobtainaclassiersuchthat,uptoatolerancethatshrinksassamplesizesincrease,P1((X)=0)isminimal,whileP0((X)=1)aOurlearningreductionallowsustoimportexistingstatisticalperformanceguaranteesforNeyman-Pearsonclassication(Cannonetal.,2002;ScottandNowak,2005)andtherebydeducegeneraliza-tionerrorbounds,consistency,andratesofconvergencefornoveltydetection.Inadditiontothesetheoreticalproperties,thereductiontoNPclassicationhaspracticaladvantages,inthatitallowsessentiallyanyalgorithmforNPclassicationtobeappliedtoSSND.SSNDisparticularlysuitedtosituationswherethenoveltiesoccupyregionswherethenominaldensityishigh.Ifasinglenoveltyliesinaregionofhighnominaldensity,itwillappearnominal.However,ifmanysuchnoveltiesarepresent,theunlabeleddatawillbemoreconcentratedthanonewouldexpectfromjustthenominalcomponent,andthepresenceofnoveltiescanbedetected.SSNDmayalsobethoughtofassemi-supervisedclassicationinthesettingwherelabelsfromoneclassaredifculttoobtain(seediscussionofLPUEbelow).Weemphasizethatwedonotassume2974 SEMI-SUPERVISEDNOVELTYDETECTIONthatnoveltiesarerare,thatis,thatpisverysmall,asinanomalydetection.However,SSNDisapplicabletoanomalydetectionprovidednissufcientlylarge.Wealsodiscussestimationofpandthespecialcaseofp=0,whichisnottreatedinourinitialanalysis.Wepresentahybridapproachthatautomaticallyrevertstotheinductiveapproachwhenp=0,whilepreservingthebenetsoftheNPreductionwhenp�0.Inaddition,wedescribeadistribution-freeone-sidedcondenceintervalforp,consistentestimationofp,andtestingforp=0,whichamountstoageneralversionofthetwo-sampleprobleminstatistics.Wealsodiscussconnectionstomultipletesting,whereweshowthatSSNDgeneralizesastandardapproachtomul-tipletesting,basedonthresholdingp-values,underthecommon“randomeffects”model.Whereasthep-valueapproachisoptimalonlyunderstrongassumptionsonthealternativedistribution,SSNDcanoptimallyadapttoarbitraryalternatives.Thepaperisstructuredasfollows.Afterreviewingrelatedworkinthenextsection,wepresentthegenerallearningreductiontoNPclassicationinSection3,andapplythisreductioninSection4todeducestatisticalperformanceguaranteesforSSND.Section5presentsourhybridapproach,whileSection6applieslearning-theoreticprinciplestoinferenceonp.ConnectionstomultipletestingaredevelopedinSection7.ExperimentsarepresentedinSection8,whileconclusionsarediscussedinthenalsection.Shorterproofsarepresentedinthemaintext,andlongerproofsappearintherstappendix.2.RelatedWorkInductivenoveltydetection:Describedintheintroduction,thisproblemisalsoknownasone-classclassication(Sch¨olkopfetal.,2001)orlearningfromonlypositive(oronlynegative)examples.Thestandardapproachhasbeentoassumethatnoveltiesareoutlierswithrespecttothenominaldistribution,andtobuildanoveltydetectorbyestimatingalevelsetofthenominaldensity(ScottandNowak,2006;VertandVert,2006;El-YanivandNisenson,2007;Hero,2007).Aswediscussbelow,densitylevelsetestimationisequivalenttoassumingthatnoveltiesareuniformlydistributedonthesupportofP0.Thereforethesemethodscanperformarbitrarilypoorly(whenP1isfarfromuniform,andstillhassignicantoverlapwithP0).InSteinwartetal.(2005),inductivenoveltydetectionisreducedtoclassicationofP0againstP1,whereinP1canbearbitrary.Howeverani.i.d.samplefromP1isassumedtobeavailableinadditiontothenominaldata.Incontrast,thesemi-supervisedapproachoptimallyadaptstoP1,whereonlyanunlabeledcontaminatedsampleisavailablebesidesthenominaldata.Inaddition,weaddressestimationandtestingoftheproportionofnovelties.Classicationwithunlabeleddata:Intransductiveandsemi-supervisedclassication,labeledtrainingdataf(xi;yi)gmi=1frombothclassesaregiven.Thesettingproposedhereisaspecialcasewheretrainingdatafromonlyoneclassareavailable.Intwo-classproblems,unlabeleddatatyp-icallyhaveatbestaslighteffectonconstants,nitesamplebounds,andrates(Rigollet,2007;LaffertyandWasserman,2008;Ben-Davidetal.,2008;Singhetal.,2009),andarenotneededforconsistency.Incontrast,wearguethatfornoveltydetection,unlabeleddataareessentialforthesedesirabletheoreticalpropertiestohold.Learningfrompositiveandunlabeledexamples:Classicationofanunlabeledsamplegivendatafromoneclasshasbeenaddressedpreviously,butwithcertainkeydifferencesfromourwork.Thisbodyofworkisoftentermedlearningfrom“positive”andunlabeledexamples(LPUE),al-2975 BLANCHARD,LEEANDSCOTTthoughinourcontextwetendtothinkofnominalexamplesasnegative.Terminologyaside,anumberofalgorithmshavebeendeveloped,whichwenowrelatetothepresentwork.Oneclassofalgorithmsproceedsroughlyasfollows:First,identifyunlabeledpointsforwhichitseemshighlylikelythatyi=1.Second,learnaclassierfromtheknownpositiveexamplesandthesupposednegativeexamples.Useitontheunlabeleddatatoupdatethegroupofcandidatesforthenegativeclassandrepeatuntilastablelabelingisreached.SeveralsuchalgorithmsarereviewedinZhangandLee(2005)andZhangandZuo(2008),buttheytendtobeheuristicinnatureandsensitivetotheinitialchoiceofnegativeexamples.AtheoreticalanalysisofLPUEisprovidedbyDenis(1998);Denisetal.(2005)fromtheview-pointofprobablyapproximatelycorrect(PAC)learnableclasses.InPAClearnability,theobjectiveistondspecicclassesofclassierssuchthattheoptimalclassierinthatclasscanbeapprox-imatedarbitrarilywell,andwherethenumberofsamplesrequiredispolynomialintheinverseoftheerrortolerance.Whilesomeideasarecommonwiththepresentwork(suchasclassifyingthenominalsampleagainstthecontaminatedsampleasaproxyfortheultimategoal),ourpointofviewisrelativelydifferentandbasedonstatisticallearningtheory.Inparticular,ourinputspacecanbenon-discreteandweassumethedistributionsP0andP1canoverlap,whichleadsustousetheNPclassicationsettingandstudyuniversalconsistencyproperties.Severalotherapproacheshavebeendevelopedwhich,eitherexplicitlyorimplicitly,relyonareductiontoaclassicationproblem.SteinbergandCardell(1992)andWardetal.(2009)proposeframeworksbasedonlogisticregression,butbothassumethatpisknown.ElkanandNoto(2008)assumeaparticularsamplingschemewheremandnarerelatedinsuchawaythatpcanbereadilyestimated.Unfortunately,thissamplingassumptionisnotvalidinmanyapplicationsofinterest.Allthreeoftheseworksderivetheiralgorithmsbyaconsiderationofposteriorprobabilities,andconsequentlytheyrequirethatpisknownorcanbeestimated.Incontrast,ourapproachadoptsthe(non-Bayesian)Neyman-PearsoncriterionandinnowaydependsontheabilitytoknoworestimatepTheideaofreducingLPUEtoabinaryclassicationproblemhasalsobeentreatedbyZhangandLee(2005),Liuetal.(2002),LeeandLiu(2003)andLiuetal.(2003).Mostnotably,Liuetal.(2002)providesamplecomplexityboundsforVCclassesforthelearningrulethatminimizesthenumberoffalsenegativeswhilecontrollingtheproportionoffalsepositivesatacertainlevel.Ourapproachextendstheirsinseveralrespects.First,Liuetal.(2002)doesnotconsiderapproximationerrororconsistency,nordotheboundsestablishedthereimplyconsistency.Incontrast,wepresentageneralreductionthatisnotspecictoanyparticularlearningalgorithm,andcanbeusedtodeduceconsistencyorratesofconvergence.OurworkalsomakesseveralcontributionsnotaddressedpreviouslyintheLPUEliterature,includingourresultsrelatingtothecasep=0,totheestimationofp,andtomultipletesting.WealsonoterecentworkbySmolaetal.(2009)describedasrelativenoveltydetection.Thisworkispresentedasanextensionofstandardone-classclassicationtoasettingwhereareferencemeasure(indicatingregionswherenoveltiesaremorelikely)isknownthroughasample.Inpractice,theauthorstakethissampletobeacontaminatedsampleconsistingofbothnominalandnovelmeasurements,sothesettingisthesameasours.Theemphasisinthisworkisprimarilyonanewkernelmethod,whereasourworkfeaturesagenerallearningreductionandlearningtheoreticanalysis.Multipletesting:Themultipletestingproblemisalsoconcernedwiththesimultaneousdetectionofmanypotentiallyabnormalmeasurements(viewedasrejectednullhypotheses).InSection7,we2976 SEMI-SUPERVISEDNOVELTYDETECTIONdiscussindetailtherelationofourcontaminationmodeltotherandomeffectsmodel,astandardmodelinmultipletesting.WeshowhowSSNDis,inseveralrespects,ageneralizationofthatmodel,andincludesseveraldifferentextensionsproposedintherecentmultipletestingliterature.TheSSNDmodel,andtheresultspresentedinthispaper,arethusrelevanttomultipletestingaswell,andsuggestaninterestingpointofviewtothisdomain.Inparticular,throughareductiontoclassication,weintroducebroadconnectionstostatisticallearningtheory3.TheFundamentalReductionTobegin,werstconsiderthepopulationversionoftheproblem,wherethedistributionsareknowncompletely.RecallthatPX=(1p)P0+pP1isthedistributionofunlabeledtestpoints.Adoptingahypothesistestingperspective,wearguethattheoptimaltestsforH0XP0vs.H1XP1areidenticaltotheoptimaltestsforH0XP0vs.HXXPX.Theformerarethetestswewouldliketohave,andthelatteraretestswecanestimatebytreatingthenominalandunlabeledsamplesaslabeledtrainingdataforabinaryclassicationproblem.Tooffersomeintuition,werstassumethatPyhasdensityhyy=0;1.AccordingtotheNeyman-PearsonLemma(Lehmann,1986),theoptimaltestwithsize(falsepositiverate)aforH0XP0vs.H1XP1isgivenbythresholdingthelikelihoodratioh1(x)=h0(x)atanappropriatevalue.Similarly,lettinghX=(1p)h0+ph1denotethedensityofPX,theoptimaltestsforH0XP0vs.HXXPXaregivenbythresholdinghX(x)=h0(x).NownoticehX(x) h0(x)=(1p)+ph1(x) h0(x):Thus,thelikelihoodratiosarerelatedbyasimplemonotonetransformation,providedp�0.Fur-thermore,thetwoproblemshavethesamenullhypothesis.Therefore,bythetheoryofuniformlymostpowerfultests(Lehmann,1986),theoptimaltestofsizeaforoneproblemisalsooptimal,withthesamesizea,fortheotherproblem.Inotherwords,wecandiscriminateP0fromP1bydiscriminatingbetweenthenominalandunlabeleddistributions.Notetheaboveargumentdoesnotrequireknowledgeofpotherthanp�0.Thehypothesistestingperspectivealsoshedslightontheinductiveapproach.Inparticular,estimatingthenominallevelsetfxh0(x)lgisequivalenttothresholding1=h0(x)at1=l.Thus,thedensitylevelsetisanoptimaldecisionruleprovidedh1isconstantonthesupportofh0.ThisassumptionthatP1isuniformonthesupportofP0isthereforeimplicitlyadoptedbyamajorityofworksonnoveltydetection.WenowdroptherequirementthatP0andP1havedensities.LetRd!f0;1gdenoteaclassier.Fory=0;1,letRy()=Py((X)=y)denotethefalsepositiverate(FPR)andfalsenegativerate(FNR)of,respectively.Forgreatergenerality,supposewerestrictourattentiontosomexedsetofclassiersF(possiblythesetofallclassiers).TheoptimalFNRforaclassieroftheclassFwithFPRa,0a1,isR1;a(F)=inff2FR1()(1)s.t.R0()a:2977 BLANCHARD,LEEANDSCOTTSimilarly,introduceRX()=PX((X)=0)=pR1()+(1p)(1R0())andletRX;a(F)=inff2FRX()(2)s.t.R0()a:Inthispaperwewillalwaysassumethefollowingproperty(involvingF;P0andP1)holds:(A)Foranya2(0;1),thereexists2FsuchthatR0()=aandR1()=R1;a(F)Remark.ThisassumptionisinparticularsatisediftheclassFissuchthatforanyf2FwithR0()a,wecanndanotherclassierf02FwithR0(0)=aandf0f(sothatR1(0)R1()).WhenP0isabsolutelycontinuouswithrespecttoLebesquemeasure,thispropertycanbeeasilyveriedformanycommonclassiersets,forexamplelinearclassiers,decisiontreesorradialbasisfunctionclassiers.Evenwithoutanyassumptionsonthedistribution,itispossibletoensurethat(A)issatisedprovidedoneextendstheclassFtoalargerclasscontainingrandomizedclassiersobtainedbyconvexcombinationofclassiersoftheoriginalclass.Thisconstructionisstandardinthereceiveroperatingcharacteristic(ROC)literature.SomebasicresultsonthistopicarerecalledinAppendixBinrelationtotheaboveassumption.Bythefollowingresult,theoptimalclassiersforproblems(1)and(2)arethesame.Further-more,onedirectionofthisequivalencealsoholdsinanapproximatesense.Inparticular,approx-imatesolutionstoXP0vs.XPXtranslatetoapproximatesolutionsforXP0vs.XP1ThefollowingtheoremconstitutesourmainlearningreductioninthesenseofBeygelzimeretal.(2005):Theorem1Assumeproperty(A)issatised.Consideranya0a1,andassumep&#x-0.9;畵0.Thenforanyf2Fthetwofollowingstatementsareequivalent:(i)RX()=RX;a(F)andR0()a(ii)R1()=R1;a(F)andR0()=aMoregenerally,letL1;a(;F)=R1()R1;a(F)andLX;a(;F)=RX()RX;a(F)denotetheexcesslosses(regrets)forthetwoproblems,andassumep&#x-0.9;畵0.IfR0()a+efore0,thenL1;a(;F)p1(LX;a(;F)+(1p)e):Proof.Foranyclassier,wehavetherelationRX()=(1p)(1R0())+pR1().Westartwithproving())().Consider2FsuchthatR1()=R1;a(F)andR0()=a,butassumeRX()&#x-0.9;畵RX(F).Let02FsuchthatRX(0)RX()andR0(0)a.Thensincep&#x-0.9;畵0,R1(0)=p1RX(0)(1p)(1R0(0))p1(RX()(1p)(1a))=R1();2978 SEMI-SUPERVISEDNOVELTYDETECTIONcontradictingminimalityofR1()Toestablishtheconverseimplication,consider2FsuchthatRX()=RX;a(F)andR0()a,butassumeR1()�R1;a(F)orR0()a.Let0besuchthatR0(0)=aandR1(0)=R1(F)whoseexistenceisensuredbyassumption(A).ThenRX(0)=(1p)(1a)+pR1(0)(1p)(1R0())+pR1()=RX();contradictingminimalityofRX().Toprovethenalstatement,rstnotethatweestablishedRX;a(F)=pR1;a(F)+(1p)(1a);bytherstpartofthetheorem.BysubtractionwehaveL1;a(;F)=p1(LX;a(;F)+(1p)(R0()a))p1(LX;a(;F)+(1p)e): Theorem1suggeststhatwemayestimatethesolutionto(1)bysolvingasurrogatebinaryclassicationproblem,treatingx1;:::;xmasoneclassandxm+1;:::;xm+nastheother.Intherestofthepaper,weexploretheconsequencesofthisreductionfromatheoreticalaswellaspracticalperspective.Inthenextsection,weillustrateonthetheoreticalside,inthecaseofanempiricalriskminimization(ERM)typealgorithm,howanitesampleboundforNPclassicationtranslatestoanitesampleboundforSSNDandleadstodesirablepropertiessuchasconsistency.Ontheotherhand,algorithmswecananalyze(suchasERM)oftendonothavethebestperformanceonactualdata,andmaybecomputationallyinfeasible(asituationthatisnotspecictoSSND).ThusintheexperimentalSection8weimplementadifferentmethod,namelysimplebuteffectiveschemesbasedonkerneldensityestimates.ItisimportanttoobservethatTheorem1stillappliestothesemethodssinceitjustcomparestwoobjectivefunctionsandisagnostictothemethodused.4.StatisticalPerformanceGuaranteesWenowillustratehowTheorem1leadstoperformanceguaranteesforSSND.WeconsiderthecaseofaxedsetofclassiersFhavingniteVC-dimension(Vapnik,1998),andtheNPclassicationalgorithmbt=argminf2FbRX()s.t.bR0()a+t;basedon(constrained)empiricalriskminimization,wherebRX()=1 nm+nåi=m+11ff(xi)=1g;bR0()=1 mmåi=11ff(xi)=0g:ThisrulewasanalyzedinCannonetal.(2002)andScottandNowak(2005).DenetheprecisionofaclassierforclassasQi()=PXY(Y=j(X)=)(thehighertheprecision,thebetterthe2979 BLANCHARD,LEEANDSCOTTperformance).ThenwehavethefollowingresultboundingthedifferenceofthequantitiesRiandQitotheiroptimalvaluesoverFTheorem2Assumex1;:::;xmandxm+1;:::;xm+narei.i.d.realizationsofP0andPX,respectively,andthatthetwosamplesareindependentofeachother.Assumep�0.LetFbeasetofclassiersofVC-dimensionV.Assumeproperty(A)issatisedanddenotebyfanoptimalclassierinFwithrespecttothecriterionin(1).Fixingd�0,deneek=q Vlogklogd k.Thereexistabsoluteconstantsc;c0suchthat,ifwechooset=cen,thefollowingboundsholdwithprobability1d:R0(bt)ac0en(3)R1(bt)R1()c0p1(en+em);(4)Qi()Qi(bt)c0 P((X)=)(en+em);=0;1:(5)TheproofisgiveninAppendixA.TheprimarytechnicalingredientsintheproofareTheorem3ofScottandNowak(2005)andthelearningreductionofTheorem1above.TheabovetheoremshowsthattheprocedureisconsistentinsidetheclassFforallcriteriaconsidered,thatis,thesequantitiesdecrease(resp.increase)asymptoticallytotheirvalueat.Thisisincontrasttothestatisticallearningboundspreviouslyobtained(Liuetal.,2002,Thm.2),whichdonotimplyconsistency.FollowingScottandNowak(2005),byextendingsuitablytheargumentandthemethodinthespiritofstructuralriskminimizationoverasequenceofclassesFkhavingtheuniversalapproxima-tionproperty,wecanconcludethatthismethodisuniversallyconsistent(thatis,relevantquantitiesconvergetotheirvalueat,whereisthesolutionof(1)overthesetofallpossibleclassi-ers).Therefore,althoughtechnicallysimple,thereductionresultofTheorem1allowsustodeducestrongerresultsthantheexistingonesconcerningthisproblem.Thiscanbeparalleledwiththere-sultthatinductivenoveltydetectioncanbereducedtoclassicationagainstuniformdata(Steinwartetal.,2005),whichmadethestatisticallearningstudyofthatproblemsignicantlysimpler.ItisinterestingtonotethatthemultiplicativeconstantinfrontoftherateofconvergenceoftheprecisioncriteriaisPX((X)=)1ratherthanp1forR1.InparticularPX((X)=0)(1p)(1a),sothattheconvergencerateforclass0precisionisnotsignicantlyaffectedasp!0.SimilarlyPX((X)=1)(1p)a,sotheconvergencerateforclass1precisiondependsmorecruciallyonthe(known)athanonpForcompleteness,webrieydiscusstheoptimalityofQi()in(5)inthesenseofthecriterionQiitself.Underanadditionalminorcondition,itispossibletoshow(thedetailsaregivenattheendofAppendixB)thatundertheconstraintR0()a,thebestattainableprecisionforclass0inthesetFisattainedby=.Therefore,in(5)(=0),wearereallycomparingtheprecisionofbtagainstthebestpossibleclass0precisiongiventheFPRconstraint.Ontheotherhand,itdoesnotmakesensetoconsiderthebestattainableclass1precisionunderanupperconstraintonR0,sincewecanhavebothR0!0andQ1!1byonlyrejectingavanishinglysmallproportionofverysurenovelties.Butitcaneasilybeseenthatrealizesthebestattainableclass1precisionundertheequalityconstraintR0()=aWeemphasizethattheaboveresultisbutoneofmanypossibletheoremsthatcouldbededucedfromthelearningreduction;otherresultsfromNeyman-Pearsonclassicationcouldalsobeapplied.Wealsoremarkthat,althoughtheprevioustheoremcorrespondstothesemi-supervisedsetting,an2980 SEMI-SUPERVISEDNOVELTYDETECTIONanalogoustransductiveresultiseasilyobtainedbyincorporatinganadditionaluniformdeviationboundrelatingtheempiricalerrorratesontheunlabeleddatatothetrueerrorrates.5.TheCasep=0andaHybridMethodTheprecedingreductionofSSNDtoNPclassicationisonlyjustiedwhenp�0.Asidefromtheanalysisbreakingdown,thiscanbeseenasfollows.TheunlabeledsampleisadrawfromPX=(1p)P0+pP1.Whenp=0,theunlabeledsampleisadrawfromP0.ThereforeitcontainsnoinformationaboutP1.WerewetosolvethesurrogateNPproblem,wewouldbeattemptingtoclassifybetweentwoidenticaldistributions,andthebestwecoulddowouldberandomguessing.ThisisconrmedinTable2(casep=0)wheretheAUCvaluesforSSNDarenearonehalf.Ourgoalinthissectionistodevelopalearningreduction,andaparallelresulttoTheorem1inSection3,butwhichhandlesthecasep=0moresensibly.Whenp=0,wehavenoinformationaboutP1ineithersample.Therefore,theonlywaytogetanytractionontheproblemistomakesomeassumptionaboutP1.Theinductivemethodmakessuchanassumption(asnotedpreviouslyinthepaper),namely,thatP1isuniformonthesupportofP0.Sinceuniformityisthestandardassumptionwithoutanyadditionalpriorknowledge,weaimtodevelopamethodthatperformsatleastaswellastheinductivemethodwhenp=0.Thereforeweaskthefollowingquestion:Canwedeviseamethodwhich,havingnoknowledgeofp,sharesthepropertiesofthelearningreductionofSection3whenp�0,andtheinductiveapproachotherwise?Ouranswertothequestionis“yes”underfairlygeneralconditions.Theintuitionbehindourapproachisthefollowing.Theinductiveapproachtonoveltydetectionperformsdensitylevelsetestimation.Furthermore,aswesawinSection3,densitylevelsetsareoptimaldecisionregionsfortestingthenominaldistributionagainstauniformdistribution.There-fore,levelsetestimationcanbeachievedbygeneratinganarticialuniformsampleandperformingweightedbinaryclassicationagainstthenominaldata(thisideahasbeendevelopedinmoredetailbySteinwartetal.,2005).Ourapproachistosprinkleavanishinglysmallproportionofuniformlydistributeddataamongtheunlabeleddata,andthenimplementSSNDusingNPclassicationonthismodieddata.Whenp=0,theuniformpointswillinuencethenaldecisionruletoperformlevelsetestimation.Whenp�0,theuniformpointswillbeswampedbytheactualnovelties,andtheoptimaldetectorwillbeestimated.Toformalizethisapproach,let0pn1beasequencetendingtozero.AssumethatSisacompactsetwhichisknowntocontainthesupportofP0(obtained,e.g.,throughsupportestimationorthroughaprioriinformationontheproblem),andletP2betheuniformdistributiononS.Considerthefollowingprocedure:Letkbinom(n;pn).DrawkindependentrealizationsfromP2,andredenexm+1;:::;xm+ktobethesevalues.(Inpractice,theuniformdatawouldsimplybeappendedtotheunlabeleddata,sothatinformationisnoterased.Thepresentprocedure,however,isslightlysimplertoanalyze.)TheideanowistoapplytheSSNDlearningreductionfrombeforetothismodiedunlabeleddata.Towardthisend,weintroducethefollowingnotations.Forsimplicity,wedonotexplicitlyindicatetheunderlyingclassF.WerefertoanydatapointthatwasdrawnfromeitherP1orP2asanoperativenovelty.Theproportionofoperativenoveltiesinthemodiedunlabeledsampleis˜p=p(1pn)+pn.Thedistributionofoperativenoveltiesis˜P1=p(1pn) ˜pP1+pn ˜pP2,andtheoveralldistributionofthemodiedunlabeleddatais˜PX=˜p˜P1+(1˜p)P0.LetR2;R2;a;˜R1;˜R1;a;˜RX,and2981 BLANCHARD,LEEANDSCOTT˜RX;abedenedintermsofP2;˜P1,and˜PX,respectively,inanalogytothedenitionsinSection3.AlsodenoteL2;a()=R2()R2;a˜L1;a()=˜R1()˜R1;a,and˜LX;a=˜RX()˜RX;aByapplyingTheorem1tothemodieddata,weimmediatelyconcludethatifR0()a+ethen˜L1;a()1 ˜p(˜LX;a()+(1˜p)e)=1 ˜p(˜LX;a()+(1p)(1pn)e):(6)BypreviouslycitedresultsonNeyman-Pearsonclassication,thequantitiesontheright-handsidecanbemadearbitrarilysmallasmandngrow.Thefollowingresulttranslatesthisboundtothekindofguaranteeweareseeking.Theorem3Assume(A)holds.LetfbeaclassierwithR0()a+e.Ifp=0,thenL2;a()p1n(˜LX;a()+(1pn)e):Ifp�0,thenL1;a()1 p(1pn)(˜LX;a()+(1p)(1pn)e+pn):Tointerprettherststatement,notethatL2;a()istheinductiveregret.TheboundimpliesthatL2;a()!0aslongasbothe=R0()aand˜LX;a()tendtozerofasterthanpn.Thissuggeststakingpntobeasequencetendingtozeroslowly.ThesecondstatementissimilartotheearlierresultinTheorem1,butwithadditionalfactorsofpn.Thesefactorssuggestchoosingpntendingtozerorapidly,incontrasttotherststatement,soinpracticesomebalanceshouldbestruck.ProofIfp=0,then˜L1;a=L2;aandtherststatementfollowstriviallyfrom(6).Toprovethesecondstatement,denotebn=p(1pn) ˜p,andobservethat˜R1;a=infR0(f)a˜R1()=infR0(f)a[bnR1()+(1bn)R2()]bnR1;a+(1bn):Therefore˜L1;a()=˜R1()˜R1;abnR1()+(1bn)R2()bnR1;a(1bn)bn(R1()R1;a)(1bn)=bnL1;a()(1bn)andweconclude,stillusing(6),L1;a()1 bn˜L1;a+1bn bn1 p(1pn)(˜LX;a()+(1p)(1pn)e+pn): 2982 SEMI-SUPERVISEDNOVELTYDETECTIONLikeTheorem1,Theorem3isquitegeneral,andhasboththeoreticalandpracticalimplications.Theoretically,itcouldbecombinedwithspecic,analyzablealgorithmsforNeyman-Pearsonclas-sicationtoyieldnoveltydetectorswithperformanceguarantees,aswasillustratedinSection4.Wedonotdevelopthistheoreticaldirectionhere.Practically,anyalgorithmforNeyman-Pearsonclas-sicationthatgenerallyworkswellinpracticecanbeappliedinthehybridframeworktoproducenoveltydetectorsthatperformwellforvaluesofpthatarezeroorverynearzero.Weimplementthisideaintheexperimentalsectionbelow.Wealsoremarkthatthishybridprocedurecouldbeappliedwithanypriordistributiononnov-eltiesbesidesuniform.Inaddition,thehybridapproachcouldalsobepracticallyusefulwhennissmall,assumingthearticialpointsareappendedtotheunlabeledsample.6.EstimatingpandTestingforp=0Intheprevioussections,ourmaingoalwastondagoodclassierfunctionforthepurposeofnoveltydetection.Besidesthedetectoritself,itisoftenrelevanttotheusertohaveanestimateorboundontheproportionpofnoveltiesinthecontaminateddistributionPX.Estimationofpallowsforestimatingandoptimizingthemisclassicationrateontheunlabeleddata,whichisoftenofinterestintheLPUEliterature(seeSec.2).Estimationofpisalsousefulforestimatingtheprecision(asdenedinSection4);thistopicwillberevisitedinthenextsectioninthecontextofmultipletesting.Itmayalsobeusefultotestwhethertherearenoveltiesatall;inotherwords,sincethelearntdetectorbisallowedacertainproportionoffalsepositives,itisimportanttoassesswhetherthereportednoveltiesareastatisticallysignicantindicationofthepresenceoftruenovelties,oriftheyarelikelytobeallfalsepositives.Wefocusontheseissuesinthepresentsection.Itshouldrstbenotedthatwithoutadditionalassumptions,pisnotanidentiableparameterinourmodel.Toseethis,considertheidealizedcasewherewehaveaninniteamountofnominalandcontaminateddata,sothatwehaveperfectknowledgeofP0andPX.AssumingthedecompositionPX=(1p)P0+pP1holds,notethatanyalternatedecompositionoftheformPX=(1pg)P0+(p+g)P01,withP01=(p+g)1(pP1+gP0),andg2[0;1p],isequallyvalid.BecausethemostimportantfeatureofthemodelisthatwehavenodirectknowledgeofP1,wecannotdecidewhichrepresentationisthe“correct”one;wecouldnotevenexcludeapriorithecasewherep=1andP1=PX(whileproducingtheexactsameobserveddata).ThepreviousresultsestablishedinTheorems1-3arevalidforwhateverunderlyingrepresentationisassumedtobecorrect.Fortheestimationoftheproportionofnovelties,however,itmakessensetodenepastheminimalproportionofnoveltiesthatcanexplainthedifferencebetweenP0andPX.Firstweintroducethefollowingdenition:Denition4AssumeP0,P1areprobabilitydistributionsonthemeasurespace(;S).WecallP1apropernoveltydistributionwithrespecttoP0ifthereexistsnodecompositionoftheformP1=(1g)Q+gP0whereQissomeprobabilitydistributionand0g1ThisdenesapropernoveltydistributionP1asonethatcannotbeconfoundedwithP0;itcannotberepresentedasa(nontrivial)mixtureofP0withanotherdistribution.Thenextresultestablishesacanonicaldecompositionofthecontaminateddistributionintoamixtureofnominaldataandpropernovelties.Asaconsequencetheproportionpofpropernov-elties,andthereforethepropernoveltydistributionP1itself,arewell-dened(thatis,identiable)2983 BLANCHARD,LEEANDSCOTTgiventheknowledgeofthenominalandcontaminateddistributions(exceptforthespecialcaseP0=PX,whereofcoursethenoveltydistributionisnotdened).Proposition5AssumeP0,PXareprobabilitydistributionsonthemeasurespace(;S).IfPX=P0,thereisauniquep2(0;1]andP1suchthatthedecompositionPX=(1p)P0+pP1holds,andsuchthatP1isapropernoveltydistributionwithrespecttoP0.Ifweadditionallydenep=0whenPX=P0,theninallcases,p=minfa2[0;1]9Qprobabilitydistribution:PX=(1a)P0+aQg:(7)TheproofisgiveninAppendixA.FortherestofthissectionweassumeforsimplicityofnotationthatpandP1aretheproportionanddistributionofpropernoveltiesofPXwithrespecttoP0.Theresultstocomearealsoinformativeforimpropernoveltydistributions,inthefollowingsense:ifP1isnotapropernoveltydistributionandthedecompositionPX=(1p)P0+pP1holds,then(7)entailsthatp�p.Itfollowsthatalowerboundonp(eitherdeterministicorvalidwithacertaincondence),aswillbederivedinthecomingsections,isalwaysalsoavalidlowercondenceboundonpwhennon-propernoveltiesareconsidered.AlowerboundiseffectivelythebestwecanhopeforpifP1isnotassumedtobeproper.6.1PopulationCaseWenowwanttorelatetheestimationofptoquantitiespreviouslyintroducedandproblem(1).Wersttreatthepopulationcaseandoptimalnoveltydetectionoverthesetofallpossibleclassiers.Theorem6Foranyclassierf,wehavetheinequalityp1RX() 1R0():OptimizingthisboundoverasetofclassiersFundertheFPRconstraintR0()ayieldsforany0a1:p1RX;a(F) 1a:(8)Furthermore,ifFisthesetofalldeterministicclassiers,p=1infa2[0;1)RX;a(F) 1a:(9)Proof.Fortherstpart,justwriteforanyclassier1RX()=PX((X)=1)=(1p)P0((X)=1)+pP1((X)=1)(1p)R0()+p;resultingintherstinequalityinthetheorem.UndertheconstraintR0()a,thisinequalitythenyieldsp1RX() 1R0()1RX() 1a2984 SEMI-SUPERVISEDNOVELTYDETECTIONoptimizingtheboundundertheconstraintyieldsthesecondinequality.WeestablishinLemma13inAppendixAthatforanye�0thereexistsaclassieresuchthatR0(e)1andR1(e)=(1R0(e))e.Putae=R0(e);wethenhaveRX;ae(F)RX(e)=(1p)(1ae)+pR1(e);implyingp1infa2[0;1)RX;a(F) 1a1RX;ae(F) 1aep1R1(e) 1R0(e)p(1e);whichestablishesthelastclaimofthetheorem. 6.2Distribution-freeLowerCondenceBoundsonpInthelastpartofTheorem6,ifweassumethatthefunctiona7!RX;a(F)=(1a)isnonincreasing(acommonregularityassumption;seeAppendixBforadiscussionofhowthisconditioncanalwaysbeensuredbyconsideringpossiblyrandomizedclassiers),thena7!RX;a(F)isleftdifferentiableata=1and(8)isoptimizedbytakinga!1,thatis,p1+dRX;a(F) da a=1;(10)while(9)entailsthattheaboveinequalityisanequalityifFcontainsalldeterministicclassiers.ThissuggestsobtainingalowerboundonpbyestimatingtheslopeofRX;a(F)atitsrightendpoint.Thefollowingresultadoptsthisapproachwhileaccountingfortheuncertaintyinherentinempiricalperformancemeasures.Theorem7ConsideraclassiersetFforwhichweassumeauniformerrorboundofthefollowingformisavailable:foranydistributionQon,withprobabilityatleast1doverthedrawofani.i.d.sampleofsizenaccordingtoQ,wehave82F Q((X)=1)bQ((X)=1) en(F;d);(11)wherebQdenotestheempiricaldistributionbuiltonthesample.Thenthefollowingquantityisalowerboundonpwithprobabilityatleast(1d)212d(overthedrawofthenominalandunlabeledsamples):bp(F;d)=1inff2FbRX()+en(F;d) (1bR0()em(F;d))+;(12)wheretheratioisformallydenedtobe1wheneverthedenominatoris0Notethatifwedeneba=argminf2FbRX()undertheconstraintbR0()a,thiscanberewrittenbp(F;d)=1infa2[0;1]bRX(ba)+en(F;d) (1bR0(ba)em(F;d))+:2985 BLANCHARD,LEEANDSCOTTTherearetwobalancingforcesatplayhere.Fromthepopulationversion(10)(validunderamildregularityassumption),weknowthatwewouldliketohaveaascloseaspossibleto1forestimatingthederivativeofRX;a(F)ata=1.Thisisbalancedbytheestimationerrorwhichmakesestimationsclosetoa=1unreliablebecauseofthedenominator.Takingtheinmumalongthecurvetakesinasensethebestavailablebias-estimationtradeoff.Proof.Tosimplifynotationwedenoteen(F;d)simplybyen.Asintheproofofthepreviousresult,writeforanyclassierPX((X)=1)(1p)P0((X)=1)+p;fromwhichwededuceafterapplyingtheuniformbound1bRX()en=bPX((X)=1)en(1p)(bR0()+em)+p;whichcanbesolvedwhenever1bR0()em�0. Thefollowingresultshowsthatbp(F;d),whensuitablyappliedusingasequenceofclassiersetsF1;F2;:::thathaveauniversalapproximationproperty,yieldsastronglyuniversallyconsistentestimateoftheproportionpofpropernovelties.TheproofisgiveninAppendixAandreliesonTheorem7inconjunctionwiththeBorel-Cantellilemma.Theorem8ConsiderasequenceF1;F2;:::ofclassiersetshavingthefollowinguniversalap-proximationproperty:foranymeasurablefunctionf!f0;1g,andanydistributionQ,wehaveliminfk!¥inff2FkQ((X)=(X))=0:SupposealsothateachclassFkhasniteVC-dimensionVk,sothatforeachFkwehaveauniformcondenceboundoftheform(11)foren(Fk;d)=3q Vklog(n+1)logd=2 n.Denebp(d)=supkbpFk;dk2:Ifd=(mn)2,thenbpconvergestopalmostsurelyasmin(m;n)!¥6.3ThereareNoDistribution-freeUpperBoundsonpThelowercondenceboundsbp(F;d)andbp(d)aredistribution-freeinthesensethattheyholdregardlessofP0;P1andp.Wenowarguethatdistribution-freeuppercondenceboundsdonotgenerallyexist.Wedeneadistribution-freeuppercondenceboundbp+(d)tobeafunctionoftheobserveddatasuchthat,foranyP0,anypropernoveltydistributionP1,andanynoveltyproportionp1,wehavebp+(d)pwithprobability1doverthedrawofthetwosamples.Wewillshowthatsuchauniversalupperbounddoesnotexistunlessitistrivial.Thereasonisthatthenoveldistributioncanbearbitrarilyhardtodistinguishfromthenominaldistribution.Itispossibletodetectwithsomecertaintythatthereisanon-zeroproportionofnoveltiesinthecontaminateddata(seeCorollary11below),butwecanneverbesurethattherearenonovelties.2986 SEMI-SUPERVISEDNOVELTYDETECTIONThissituationissimilartothephilosophyofsignicancetesting:onecanneveracceptthenullhypothesis,butonlyhaveinsufcientevidencetorejectit.WewillsaythatthenominaldistributionP0isweaklydiffuseifforanyg�0thereexistsasetAsuchthat0P0(A)g.Wesayanuppercondenceboundbp+(d)isnon-trivialifthereexistsaweaklydiffusenominaldistributionP0,apropernoveltydistributionP1,anoveltyproportionp1,andaconstantd&#x-0.9;畵0suchthatP(bp+(d)1)&#x-0.9;畵d;wheretheprobabilityisoverthejointdrawofnominalandcontaminatedsamples.Thisassumptiondemandsthatthereisatleastaspecicsettingwheretheupperboundbp+(d)issignicantlydifferentfromthetrivialbound1,meaningthatitisboundedawayfrom1withlargerprobabilitythanitsallowedprobabilityoferrordTheorem9Thereexistsnodistribution-free,non-trivialuppercondenceboundonpTheproofappearsinAppendixA.Thenon-trivialityassumptionisquiteweakandrelativelyintuitive.TheonlynotdirectlyintuitiveassumptionisthatP0shouldbeweaklydiffuse,whichissatisedforalldistributionshavingacontinuouspart.Thisassumptioneffectivelyexcludesnitestatespaces,whichisanimportantcondition:ifisnite,itisactuallypossibletoobtainanon-trivialuppercondenceboundonpThefollowingcorollaryestablishesthatforanynitesamplesize,anyestimatorofp(andinparticulartheuniversallyconsistentestimatorconsideredintheprevioussection)canhaveanaverageerrorboundedfrombelowbyaconstantindependentofthesamplesize.Corollary10Assumeisaninnitesetandletm;nbexed.Foranyestimatorbpofp,basedonajointsampleofsize(m;n),andanyxedrealp&#x-0.9;畵0:supP2P(m;n)Ejbppjpc(p)&#x-0.9;畵0;whereP(m;n)denotesthesetofallgeneratingdistributionsof(m;n)-samplesfollowingtheSSNDmodel(thatis,oftheformP=P\nm0\nP\nnXforarbitraryP0;PX),andc(p)isaconstantindependentof(m;n)ThisresultessentiallyprecludestheexistenceofuniversalconvergenceratesintheestimationofpInotherwords,toachievesomeprescribedrateofconvergence,someassumptionsonthegeneratingdistributionsmustbemade.ThisparallelstheestimationoftheBayesriskinclassication(Devroye,1982).6.4Testingforp=0Thelowercondenceboundonpcanalsobeusedasatestforp=0,thatis,atestforwhetherthereareanynoveltiesinthetestdata:Corollary11LetFbeasetofclassiers.Ifbp(F;d)&#x-0.9;畵0,thenwemayconclude,withcondence1d,thattheunlabeledsamplecontainsnovelties.2987 BLANCHARD,LEEANDSCOTTItisworthnotingthattestingthishypothesisisequivalenttotestingifP0andPXarethesamedistribution,whichistheclassicaltwo-sampleprobleminanarbitraryinputspace.Thisproblemhasrecentlygeneratedattentioninthemachinelearningcommunity(Grettonetal.,2007),andtheapproachproposedhere,usingarbitraryclassiers,seemstobenew.Ourcondenceboundcouldofcoursealsobeusedtotestthemoregeneralhypothesispp0foraprescribedp0,0p01.Notethat,bydenitionofbp(F;d),testingthehypothesisp=0usingtheabovelowercon-denceboundforpisequivalenttosearchingtheclassierspaceFforaclassiersuchthattheproportionsofpredictionsof0and1bydifferonthetwosamplesinastatisticallysignicantmanner.Namely,foraclassierbelongingtoaclassFforwhichwehaveauniformboundoftheform(11),wehavethelowerboundPX((X)=1)bPX((X)=1)enandtheupperboundP0((X)=1)bP0((X)=1)+em(bothboundsvalidsimultaneouslywithprobabilityatleast1d).IfthedifferenceoftheboundsispositiveweconcludethatwemusthavePX=P0,hencep&#x-0.9;畵0.Thisdifferenceispreciselywhatappearsinthenumeratorofbp(F;d)in(12).Further-more,ifthisnumeratorispositivethensoisthedenominator,sinceitisalwayslarger.Intheend,bp(F;d)&#x-0.9;畵0isequivalenttosupf2F(bPX((X)=1)en)(bP0((X)=1)+em)&#x-0.9;畵0:7.RelationshipBetweenSSNDandMultipleTestingInthissection,weshowhowSSNDofferspowerfulgeneralizationsofthestandardp-valueap-proachtomultipletestingunderthewidelyused“randomeffects”model,asconsideredforexamplebyEfronetal.(2001).7.1MultipleTestingUndertheRandomEffectsModelInthemultipletestingframework,anitefamily(H1;:::;HK)ofnullhypothesestotestisxed;fromtheobservationofsomedataX,adecisionD(Hi;X)2f0;1gmustbetakenforeachhypothe-sis,namelywhether(giventhedata)hypothesisHiisdeemedtobefalse(D(Hi;X)=1,hypothesisrejected)ortrue(D(Hi;X)=0,hypothesisnotrejected).Atypicalapplicationdomainisthatofmicroarraydataanalysis,whereeachnullhypothesisHicorrespondstotheabsenceofadifferenceinexpressionlevelsofgeneinacomparisonbetweentwoexperimentalsituations.Arejectednullhypothesisthenindicatessuchadifferentialexpressionforaspecicgene,andiscalledadiscovery(sincedifferentiallyexpressedgenesarethoseofinterest).However,thenumberofnullhypothesestotestisverylarge,forexampleK'4:104inthegeneexpressionanalysis,andtheprobabilityofrejectingbychanceanullhypothesismustbestrictlycontrolled.Inthestandardsettingformultipletesting,itisassumedthatatestingstatisticZi(X)2RhasbeenxedforeachnullhypothesisHi,andthatitsmarginaldistributionisknownwhenHiistrue.Thisstatisticcanthenbenormalized(bysuitablemonotonetransform)totaketheformofap-value.Ap-valueisafunctionpi(X)ofthedatasuchthat,ifthecorrespondingnullhypothesisHiistrue,thenpi(X)hasauniformmarginaldistributionon[0;1].Inthissetting,itisexpectedthattherejectiondecisionsD(Hi;X)aretakenbasedontheobservedp-values(p1(X);:::;pK(X))ratherthanontherawdata.Infact,inmostcasesitisassumedthatthedecisionstaketheformD(Hi;X)=1fpi(X)bTg,wherebTisadata-dependentthreshold.Further,simplifyingdistributionalassumptionsonthefamilyofp-valuesareoftenposited.Acommondistributionmodelcalled2988 SEMI-SUPERVISEDNOVELTYDETECTIONrandomeffectsabstractsthep-valuesfromtheoriginaldataXandassumesthattheveracityofhypothesisHiisgovernedbyanunderlyinglatentvariablehiasfollows:thevariableshi2f0;1g,1Karei.i.d.Bernoulliwithparameterpthevariablespiareindependent,andconditionallyto(h1;:::;hK)havedistributionpi(Uniform[0;1];ifhi=0P1;ifhi=1:Undertherandomeffectsmodel,thep-valuesthusfollowamixturedistribution(1p)U[0;1]+pP1ontheinterval[0;1]andcanbeseenasacontaminatedsample,whilethevariableshiplaytheroleoftheunknownlabels.ItshouldnowbeclearthattheabovemodelisinfactaspecicationoftheSSNDmodel,withthefollowingadditionalassumptions:1.Theobservationspaceistheinterval[0;1]2.ThenominaldistributionP0isknowntobeexactlyuniformon[0;1](equivalently,thenominaldistributionisuniformandthenominalsamplehasinnitesize);3.Theclassofnoveltydetectorsconsideredisthesetofintervalsoftheform[0;];2[0;1]Therefore,theresultsdevelopedinthispapercanapplytothemorerestrictedsettingofmultipletestingundertherandomeffectsmodelaswell.Inparticular,theestimatorbp(F;d)developedinSection6,whenspeciedundertheaboveadditionalconditions,recoversthemethodologyofnon-asymptoticestimationof1pwhichwasdevelopedbyGenoveseandWasserman(2004),Section3,andournotionofpropernoveltydistributionrecoverstheirnotionofpurityinthatsetting(andhassomewhatmoregenerality,sincetheyassumedP1tohaveadensity).ThereareseveralinterestingbenetsinconsideringforthepurposeofmultipletestingthemoregeneralSSNDmodeldevelopedhere.First,itcanbeunrealisticinpracticetoassumethatthedis-tributionofthep-valuesisknownexactlyundereachoneofthenullhypotheses.Instead,onlyassumingtheknowledgeofareferencesampleundercontrolledexperimentalconditionsasintheSSNDmodelisoftenmorerealistic.Thisproblemwasrecentlymotivatedbyproblemsingenomics(GhoshandChinnaiyan,2009)andproteomics(Ghosh,2009),whereinthelatterreferenceasymp-toticanalysiswasalsopresented.Secondly,therestrictiontodecisionsetsoftheformfpigcanalsobequestionable.Forasingletest,decisionregionsofthisformareoptimal(intheNeyman-Pearsonsense)onlyifthelikelihoodratioofthealternativetothenullisdecreasing,whichamountstoassumingthatthealternativedistributionP1hasadecreasingdensity.Thisassumptionhasbeencriticizedinsomerecentwork.Asimpleexampleofasituationwherethisassumptionfailsisintheframeworkofzor-tests,thatis,thenulldistributionofthestatistic(beforerescalingintop-values)isastandardGaussianoraStudent-distribution,andthecorrespondingp-valuefunctionistheusualone-ortwo-sidedp-value.IfthealternativedistributionP1isamixtureofGaussians(resp.ofnoncentraldistributions),optimalrejectionregionsfortheoriginalstatisticareingeneralaniteunionofdisjointintervalsanddonotcorrespondtolevelsetsofthep-values.Inordertocounterthistypeofproblem,SunandCai(2007)suggesttoestimatefromthedatathealternatedensityandtheproportionoftruenullhypotheses,andusetheseestimatesdirectlyinaplug-inlikelihoodratio2989 BLANCHARD,LEEANDSCOTTbasedtest.Chi(2007)developsaprocedurebasedongrowingrejectionintervalsaroundanitenumberofxedcontrolpointsin[0;1].Inbothcases,anasymptotictheoryisdeveloped.Bothoftheseproceduresaremoreexiblethanusingonlyrejectionintervalsoftheform[0;]andaimatadaptivitywithrespecttothealternativedistributionP1Finally,theremainingrestrictionthateffectiveobservations(thep-values)belongtotheunitintervalwasalsoputintoquestionbyChi(2008),whoconsideredasettingofmultidimensionalp-valuesbelongingto[0;1]d.Thedistributionwasstillassumedtobeuniformunderthecorre-spondingnullhypothesis,althoughthisseemsanevenlessrealisticassumptionthanindimensionone.Inthisframework,theuseofareference“nominal”sampleunderthenulldistributionseemsevenmorerelevant.TheframeworkdevelopedinthepresentpaperallowstocoveratoncethesedifferenttypesofextensionsrathernaturallybyjustconsideringaricherclassFofcandidateclassiers(orequiva-lentlyinthissetting,rejectionregions),andprovidesanon-asymptoticalanalysisoftheirbehaviorusingclassicallearningtheoreticaltoolssuchasVCinequalities.Furthermore,suchnon-asymptoticinequalitiescanalsogiverisetoadaptiveandconsistentmodelselectionforthesetofclassiersus-ingthestructuralriskminimizationprinciple,atopicthatwasnotaddressedpreviouslyfortheextensionsmentionedabove.7.2SSNDwithControlledFDROneremainingimportantdifferencebetweentheSSNDsettingstudiedhereandthatofmultipletestingisthatourmainoptimizationproblem(1)isunderafalsepositiverateconstraintR0()awhilemostrecentworkonmultipletestinggenerallyimposesaconstraintonthefalsediscoveryrate(FDR)instead.IfwedenotePos()=bPX((X)=1)=1bRX()=1 nnåi=11ff(xm+i)=1gtheproportionofreportednovelties,andFP()=bPXY((X)=1;Y=0)=1 nnåi=11ff(xm+i)=1;ym+i=0gthe(unavailabletotheuser)proportionoffalsediscoveriesonthecontaminatedsample,thenthefalsediscoveryproportion(FDP)isdenedasFDP()=FP()=Pos()(takentobezeroifthedenominatorvanishes),andtheFDRisdenedasFDR()=E[FDP()].SomeclassicalvariationsofthisquantityarethepositiveFDR,pFDR()=E[FDP()jPos()�0]andthemarginalFDR,mFDR()=E[FP()]=E[Pos()].Underthemixturecontaminationmodel,itcanbecheckedthatpFDR()=mFDR()=PXY(Y=0j(X)=1)(Storey,2003),hencealsoequaltooneminustheprecisionforclass1(asdenedearlierinSection4).Thefollowingresultstatesexplicitempiricalboundsonthesequantities:Proposition12ConsideraclassiersetFforwhichweassumeauniformerrorboundofthefollowingformisavailable:foranydistributionQonf0;1g,withprobabilityatleast1doverthedrawofani.i.d.sampleofsizenaccordingtoQ,both82F Q((X)=1)bQ((X)=1) en(F;d);(13)2990 SEMI-SUPERVISEDNOVELTYDETECTIONand82F Q((X)=1;Y=0)bQ((X)=1;Y=0) en(F;d);(14)hold,wherebQdenotestheempiricaldistributionbuiltonthesample.Thenthefollowinginequalitiesholdwithprobabilityatleast(1d)212d(overthedrawofthenominalandunlabeledsamples):82FmFDR()=PX(Y=0jX=1)(bR0()+em)(1bp(F;d)) (1bRX()en)+;and82FFDP()(bR0()+em)(1bp(F;d))+en (1bRX());wherebp(F;d)isdenedin(12)NotethatEquations(13)and(14)holdasbeforewithen(F;d)=cq Vlognlogd nwhenFhasVCdimensionV.Intheinterestofsimplicity,weusethesameboundenforbothuniformerrorassumptions.Separateboundscouldalsobeadopted,allowing(13)tobeslightlytighter.WealsoremarkthatsinceFDPisanempiricalquantitybasedonthecontaminatedsample,thesecondboundisinfactatransductiveboundratherthansemi-supervised.Proof.ThemFDRcanberewrittenasmFDR()=P0((X)=1jY=0)PXY(Y=0)=PX((X)=1)=R0()(1p)=(1RX()).InthisexpressionwecanpluginthelowerboundforpofThe-orem7anduniformboundsforR0()andRX()comingfromassumption(13).TheFDPcanbewrittenasFDP()=bPXY((X)=1;Y=0)=(1bRX()).Usingassumption(14),thenumeratorcanbeupperboundedbyPXY((X)=1;Y=0)+en=R0()(1p)+en,andwecanthenusethesamereasoningasfortherstpart. SimilarlytowhatwasproposedinSection4underthefalsepositiverateconstraint,wecaninthiscontextconsidertomaximizebRX()over2Fsubjecttotheconstraintthattheaboveempir-icalboundonthemFDRorFDPislessthana.ThiscanthenbesuitablyextendedtoasequenceofclassesFk.Whileafullstudyoftheresultingprocedureisoutofthescopeofthepresentpa-per,wewanttopointouttheimportantdifferencethatthemFDRisnecessarilylowerboundedbyinfx2XPXY(Y=0jX=x)whichisgenerallystrictlypositive.Hence,therequiredconstraintmaynotberealizableifaissmallerthanthislowerbound,inwhichcasetheempiricalprocedureshouldreturnafailurestatementwithprobabilityoneasn!¥8.ExperimentsDespitepreviousworkonlearningwithpositiveandunlabeledexamples(LPUE),asdiscussedinSection2,theefcacyofourproposedlearningreduction,comparedtothemethodofinduc-tivenoveltydetection,hasnotbeenempiricallydemonstrated.Inaddition,weevaluateourpro-posedhybridmethod.Toassesstheimpactofunlabeleddataonnoveltydetection,weappliedourframeworktosomedatasetswhicharecommonbenchmarksforbinaryclassication.Therst13datasets(M¨ulleretal.,2001)arefromhttp://www.fml.tuebingen.mpg.de/Members/2991 BLANCHARD,LEEANDSCOTTDataSet dimNtrainNtestpbase banana 240049000.45breast-cancer 9200770.29diabetes 84683000.35are-solar 96664000.55german 207003000.30heart 131701000.44ringnorm 2040070000.50thyroid 5140750.30titanic 315020510.32twonorm 2040070000.50waveform 2140046000.33image 18130010100.57splice 60100021750.48ionosphere 342511000.64mushrooms 112412440000.48sonar 601081000.47adult 123300030000.24web 300300030000.03Table1:Descriptionofdatasets.dimisthenumberoffeatures,andNtrainandNtestarethenumbersoftrainingandtestexamples.pbaseistheproportionofpositiveexamples(novelties)inthecombinedtrainingandtestdata.Thus,theaverage(acrosspermutations)nominalsamplesizemis(1pbase)Ntrainraetsch/benchmarkandthelastvedatasets(ChangandLin,2001)arefromhttp://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/Eachdatasetconsistsofbothpositiveandnegativeexamples.Furthermore,eachdatasetisreplicated100times(exceptforimageandsplice,whicharereplicated20times),witheachcopycorrespondingtoadifferentrandompartitioningintotrainingandtestexamples.Allnumericalresultsforadatasetwereobtainedbyaveragingacrossallpartitions.Thenegativeexamplesfromthetrainingsetweretakentoformthenominalsample,andthepositivetrainingexampleswerenotusedatallintheexperiments.ThedatasetsaresummarizedinTable1.HereNtrainandNtestarethesizes1ofthetrainingandtestsets,respectively,whilepbaseistheproportionofpositiveexamplesinthecombinedtrainingandtestdata.Thus,theaverage(acrosspermutations)nominalsamplesizemis(1pbase)NtrainWeemphasizethatintheseexperimentswedonotimplementtheempiricalriskminimization(ERM)algorithmfromSec.4.ThereductiontoNeyman-PearsonclassicationisgeneralandcanbyappliedinconjunctionwithanyNPclassicationalgorithm,whetherthatalgorithmhasassociatedperformanceguaranteesornot.Wehereelecttoapplythereductionusingaplug-inkerneldensityestimate(KDE)classier.ERMiscomputationallyinfeasible,andtheboundstend 1.Thewebandadultdatasetsweresubsampledowingtotheirlargesize.2992 SEMI-SUPERVISEDNOVELTYDETECTIONtobetoolooseinpracticetobeeffective.TheKDEplug-inrulecanbeimplementedefciently,andthereisanaturalinductivecounterpartforcomparison,thethresholdedKDEbasedonthenominalsample.8.1ExperimentalSetupWeevaluatedourmethodologyintwolearningparadigms,comparingvelearningmethodsacrossseveralvaluesofp.Thetwolearningparadigmsaresemi-supervisedandtransductive.Forsemi-supervisedlearning,thetestdataweredividedintotwohalves.Thersthalfwasusedasthecontaminated,unlabeleddata.Thesecondhalfwasusedasanindependentsampleofcontaminateddata,notusedinthelearningstage,butonlyforevaluationofclassiersreturnedbyeachmethod.Inparticular,thesecondhalfofthetestdatawasusedtoestimatetheareaundertheROC(AUC)ofeachmethod.Here,theROCistheonewhichviewsP0asthenulldistributionandP1asthealternative.Fortransductivelearning,theentiretestsetwastreatedastheunlabeleddata,andwasalsousedforevaluatingtheAUC.Thelearningmethodsaretheinductiveapproach,ourproposedlearningreduction,andthreeversionsofthehybridapproach.Thethreehybridscorrespondtopn=1:0,0:5,and0:1,inwhichauniformsampleofsize100pn%oftheunlabeledsamplesizeisappendedtotheunlabeleddata.Weemphasizethateachalgorithmwasimplementedinthesamewayinthetwolearningparadigms;theonlydifferencesarethesizeofthecontaminatedsample,andhowtheyareevaluated.Weimplementedtheinductivenoveltydetectorusingathresholdedkerneldensityestimate(KDE)withGaussiankernel,andSSNDusingaplug-inKDEclassier.Toalleviateconcernsthatourinductiveimplementationisinadequate,wealsotestedtheone-classsupportvectorma-chine(Sch¨olkopfetal.,2001)inseveralexperimentalsettings,andfounditsperformancetobeverysimilar.LettingkdenoteaGaussiankernelwithbandwidths,theinductivenoveltydetectoratdensitylevellis(x)=1if1 måmi=1k0(x;xi)�l0otherwise;andtheSSNDclassieratdensityratiolis(x)=1if(1 nåm+ni=m+1kX(x;xi))=(1 måmi=1k0(x;xi))�l0otherwise:Thehybridmethodisimplementedsimilarly.TheROCsofthesemethodsareobtainedbyvaryingthelevel/thresholdlForeachclass,asinglekernelbandwidthparameterwasemployed,andoptimizedbymaximiz-ingacross-validationestimateoftheAUC.NotethatthisROCisdifferentfromtheoneusedtoevaluatethemethods(seeabove).Inparticular,itstillviewsP0asthenulldistribution,butnowthealternativedistributionistakentobetheuniformdistributionP2fortheinductivedetector(seeSection5;effectivelyweuseauniformrandomsampleofsizeninplaceoftheunlabeleddata),PXforSSND,andtheappropriate˜PXforthehybridmethods(seeSection5).Thus,thetestlabelinformationwasnotusedatanystage(priortovalidation)byanyofthemethods.Wealsocomparedthelearningmethodsforseveralvaluesofp.Forsemi-supervisedlearning,weexaminedp=0:5;p=0:2;p=0:1,andp=0:0.Fortransductivelearning,weexaminedp=0:5;p=0:2,andp=0:1.Thecasep=0:0cannotbeevaluatedinthetransductiveparadigmbecausetherearenopositiveexamplesintheunlabeleddata.Foreachvalueofp,wediscarded2993 BLANCHARD,LEEANDSCOTTjustenoughexamples(eithernegativeorpositive)sothatthedesiredproportionwasachievedinthecontaminateddata.Notethatthenumberofpositiveexamples(novelties)inthecontaminatedsamplecouldbeverysmall.Forthesmallestdatasets,inthesemi-supervisedsettingandwhenp=0:1,thisnumberislessthan10.WeremarkthattheAUCisonlyonepossibleperformancemeasuretoassessouralgorithms.Analternativechoice,andonethatismoredirectlyconnectedtoourtheory,wouldbetoselectseveraldifferentvaluesofa,andcomparethefalsenegativeandfalsepositiverates,foreachvalueofa,acrossalldifferentdatasetsandalgorithms.Itwouldbestraightforwardinourexperimentalsetuptocalibratethethresholds,usingcross-validationforexample,toachievethedesiredfalsepositiverate.WehaveadoptedtheAUCasamoreconcisealternativethatinasenseaveragestheperformanceforNeyman-Pearsonclassicationacrossthecompleterangeofa8.2StatisticalSummariesandMethodologyThecompleteresultsaresummarizedinTables2through5.Tables2and3showtheaverageAUCforeachdatasetandexperimentalsetting,forthesemi-supervisedandtransductiveparadigmsrespectively.TheinductivemethodislabeledInd.OurlearningreductionislabeledSSNDorTNDdependingonthesetting.ThehybridmethodsarelabeledH(pn)inTables2-3,andHybrid(pn)inTables4-5.WefollowedthemethodologyofDemsar(2006)forcomparingalgorithmsacrossmultipledatasets.Foreachdatasetandeachexperimentalsetting,thealgorithmswereranked1(best)through5(worst)basedonAUC.TheFriedmantestwasusedtodetermine,foreachexperimentalsetting,whethertherewasasignicantdifferenceintheaverageranksofthevealgorithmsacrossthedatasets.Theaverageranksandp-valuesarereportedinTables4and5.Theresultsindicatethatthereisasignicantdifferenceamongthealgorithmsatthe0.1signicancelevelforallsettings,withtheexceptionofthetransductivesettingwhenp=0:1.WhentheFriedmantestresultedinsignicantdifferences,wethenperformedapost-hocNe-menyitesttoassesswhentherewasasignicantdifferencebetweenindividualalgorithms.Foravealgorithmexperimenton18datasets,withasignicancelevelof0:1,thecriticaldifferencefortheNemenyitestis1.30.Thatis,whentheaverageranksoftwoalgorithmsdifferbymorethan1.30,theirperformanceisdeemedtobesignicantlydifferent.8.3AnalysisofResultsFromtheresultspresentedinTables2-5,wedrawthefollowingconclusions.1.TheaverageranksinTables4-5conformtoourexpectationsinmanyrespects.SSND/TNDoutranktheinductiveapproachwhenp=0:5,andinductiveoutranksSSNDwhenp=0:0.Attheintermediatevaluesp=0:1and0:2,hybridmethodsachievethebestranking.2.Theaverageranksalsorevealthattheperformanceofthehybridmethodsvaryaccordingtothevalueofp.Aspincreases,thebestperforminghybridhasacorrespondinglysmalleramountofauxiliaryuniformdataappendedtotheunlabeledsample.Thisalsoconformstoourexpectations.2994 SEMI-SUPERVISEDNOVELTYDETECTION dataset =0.5 =0.2 Ind.SSNDH(1.0)H(0.5)H(0.1) Ind.SSNDH(1.0)H(0.5)H(0.1) banana 0.9240.9390.9310.9330.936 0.9240.9150.9240.9230.921breast-cancer 0.6540.6430.6750.6690.667 0.6540.5570.6570.6480.621diabetes 0.7440.7820.7700.7720.776 0.7440.6840.7240.7270.717are-solar 0.6740.6610.6640.6600.662 0.6740.6290.6410.6430.642german 0.6280.7030.6930.6960.704 0.6280.5820.6330.6320.636heart 0.7930.8540.8450.8530.851 0.7930.6900.8050.7890.745ringnorm 0.9990.9970.9960.9960.996 0.9990.9920.9900.9910.983thyroid 0.9850.9660.9640.9670.955 0.9850.8890.9290.9400.943titanic 0.6280.6430.6360.6440.643 0.6280.6120.6360.6340.628twonorm 0.9150.9930.9890.9890.990 0.9150.9400.9610.9580.953waveform 0.7610.9580.9520.9450.956 0.7610.8390.8480.8960.901image 0.8180.9390.9290.9350.939 0.8180.8920.8740.8790.875splice 0.4150.9350.9050.9210.932 0.4150.7020.6130.7640.785ionosphere 0.2560.9260.8390.9210.922 0.2560.6950.4750.6070.704mushrooms 0.9451.0001.0001.0001.000 0.9450.9990.9990.9990.999sonar 0.6880.7520.7570.7640.764 0.6880.5950.6820.6830.646adult 0.6050.8720.8720.8640.835 0.6050.7050.7200.8290.720web 0.4620.7780.7490.6970.788 0.4620.6160.6310.5850.674 dataset =0.1 =0.0 Ind.SSNDH(1.0)H(0.5)H(0.1) Ind.SSNDH(1.0)H(0.5)H(0.1) banana 0.9240.8910.9220.9190.913 0.9240.5400.9190.9050.785breast-cancer 0.6540.5150.6430.6330.575 0.6540.5560.6400.6280.568diabetes 0.7440.6050.6990.7000.692 0.7440.4940.6890.6690.657are-solar 0.6740.5710.6240.6290.626 0.6740.4710.6130.6030.611german 0.6280.5480.6230.6240.602 0.6280.5220.5950.6080.592heart 0.7930.5930.7780.7760.688 0.7930.5060.7590.7500.620ringnorm 0.9990.9840.9810.9860.991 0.9990.4780.9580.9780.985thyroid 0.9850.7860.8840.9060.895 0.9850.5900.8520.8690.795titanic 0.6280.5910.6320.6340.621 0.6280.4430.6300.6280.572twonorm 0.9150.9310.9450.9340.923 0.9150.4800.8940.8790.860waveform 0.7610.8010.8150.8220.806 0.7610.4870.7360.7270.705image 0.8180.7690.8240.8360.851 0.8180.4310.6340.6960.780splice 0.4150.6300.5180.5840.625 0.4150.5230.4470.4930.493ionosphere 0.2560.6180.4380.4880.575 0.2560.5200.3920.4310.486mushrooms 0.9450.9950.9920.9980.996 0.9450.5660.9720.9800.982sonar 0.6880.5560.6580.6520.615 0.6880.5100.6280.6430.587adult 0.6050.6270.6590.6660.626 0.6050.5050.5580.5560.572web 0.4620.5540.5840.5440.611 0.4620.5570.5530.5230.564 Table2:AUCvaluesforvenoveltydetectionalgorithmsinthesemi-supervisedsetting.`H'indi-catesahybridmethod.2995 BLANCHARD,LEEANDSCOTT dataset =0.5 =0.2 Ind.TNDH(1.0)H(0.5)H(0.1) Ind.TNDH(1.0)H(0.5)H(0.1) banana 0.9240.9380.9310.9320.935 0.9240.9150.9230.9230.919breast-cancer 0.6630.6730.6620.6620.670 0.6630.6150.6490.6590.630diabetes 0.7420.7840.7760.7790.788 0.7420.7080.7280.7250.727are-solar 0.6730.6860.6830.6840.684 0.6730.6610.6580.6620.666german 0.6330.7390.7090.7110.714 0.6330.6170.6320.6370.636heart 0.7960.8690.8560.8560.864 0.7960.7160.8110.7940.788ringnorm 0.9990.9970.9960.9960.996 0.9990.9930.9890.9910.983thyroid 0.9840.9760.9780.9790.974 0.9840.9570.9620.9550.962titanic 0.6290.6670.6460.6580.661 0.6290.6420.6410.6580.645twonorm 0.9150.9930.9900.9900.990 0.9150.9400.9610.9610.956waveform 0.7710.9600.9530.9470.957 0.7710.8470.8500.9000.905image 0.8450.9550.9490.9490.953 0.8450.8970.8890.8910.901splice 0.4160.9410.9130.9300.939 0.4160.7160.6230.7690.820ionosphere 0.2540.9530.8440.9310.952 0.2540.7140.4130.6330.746mushrooms 0.9451.0001.0001.0001.000 0.9450.9990.9990.9990.999sonar 0.6830.7570.7670.7780.781 0.6830.6150.6780.6830.662adult 0.6060.8750.8730.8650.835 0.6060.6870.7360.8470.739web 0.4640.8100.7580.7270.788 0.4640.6440.6390.5900.667 dataset =0.1 Ind.TNDH(1.0)H(0.5)H(0.1) banana 0.9240.8960.9210.9200.910 breast-cancer 0.6630.5640.6870.6420.598 diabetes 0.7420.6580.7200.7090.693 are-solar 0.6730.6150.6550.6430.659 german 0.6330.5560.6150.6160.615 heart 0.7960.6260.7920.7840.729 ringnorm 0.9990.9850.9730.9860.992 thyroid 0.9840.9100.9700.9550.932 titanic 0.6290.6030.6430.6420.626 twonorm 0.9150.9330.9430.9370.923 waveform 0.7710.8130.8210.8230.808 image 0.8450.8880.8700.8710.880 splice 0.4160.6300.5540.5530.640 ionosphere 0.2540.5890.3490.4430.552 mushrooms 0.9450.9960.9940.9970.997 sonar 0.6830.5140.6460.6550.592 adult 0.6060.6580.6810.6840.629 web 0.4640.5670.5730.5380.604 Table3:AUCvaluesforvenoveltydetectionalgorithmsinthetransductivesetting.`H'indicatesahybridmethod.2996 SEMI-SUPERVISEDNOVELTYDETECTIONp InductiveSSNDHybrid(1.0)Hybrid(0.5)Hybrid(0.1) p-value 0.0 1.894.392.722.893.11 0.0000.1 2.834.002.832.283.06 0.0230.2 3.283.832.612.562.72 0.0710.5 4.281.943.443.002.33 0.000Table4:Thecomparisonofaverageranksofthevealgorithmsinthesemi-supervisedsetting,bytheFriedmantest.Thecriticaldifferenceofthepost-hocNemenyitestis1:30atasignicancelevelof0:1.p InductiveTNDHybrid(1.0)Hybrid(0.5)Hybrid(0.1) p-value 0.1 2.943.782.562.673.06 0.1570.2 3.173.783.062.502.50 0.0850.5 4.441.443.563.172.39 0.000Table5:Thecomparisonofaverageranksofthevealgorithmsinthetransductivesetting,bytheFriedmantest.Thecriticaldifferenceofthepost-hocNemenyitestis1:30atasignicancelevelof0:1.3.Alltablesindicatethattheproposedmethodologyperformsbetterinthetransductivesettingthanthesemi-supervisedsetting.Alikelyreasonisthat,inourexperimentalsetup,TNDseestwiceasmuchunlabeleddataasSSND.4.Whenp=0:0inthesemi-supervisedexperiments,SSNDtypicallyhasanAUCaround0.5,whichcorrespondstorandomguessing.Thismakessense,becauseitisessentiallytryingtoclassifybetweentworealizationsofthenominaldistribution.FromTables2and4weseethatthehybridmethodsclearlyimproveuponSSNDwhenp=0:0.5.Forsomedatasets(splice,ionosphere,web),theinductivemethoddoesworsethanrandomguessing,butourmethodsdonot.Ineachcase,ourmethodsyielddramaticincreasesinAUC.6.Thebenetsofunlabeleddataincreasewithdimension.Inparticular,SSNDandTNDtendtoperformmuchbetterrelativetotheinductiveapproachondatasetsofdimensionatleast18.Thisisespeciallyevidentinthesecondhalfofthedatasets,whichevenshowsignicantgainsforp=0:1.Thistrendsuggeststhatasdimensionincreases,theassumptionimplicitintheinductiveapproach(thatnoveltiesareuniformwheretheyoverlapthesupportofthenominaldistribution)breaksdown.Figure1depictsasamplingofresultscomparingtheinductiveandsemi-supervisedmethods,andhighlightstheimpactofdimension.ThetopgraphshowsROCsforatwo-dimensionaldatasetwherethetwoclassesarefairlywellseparated,meaningthenoveltieslieinthetailsofthenominalclass,andp=0:5.Notsurprisingly,theinductivemethodisclosetothesemi-supervisedmethod.Themiddlegraphrepresentsthe60-dimensionalsplicedataset,wheretheinductivemethoddoesworsethanrandomguessing,yetSSNDdoesquitewell.ThebottomgraphinFigure1showstheresultsforthe21-dimensionalwaveformdataforp=0:1.Heretheassumptionsoftheinductiveapproacharealsoevidentlyviolatedtosomedegree.2997 BLANCHARD,LEEANDSCOTT 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False PositiveTrue Positivebanana (p = 0.5) Semi-supervised (AUC = 0.9331) Inductive (AUC = 0.9174) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False PositiveTrue Positivesplice (p = 0.5) Semi-supervised (AUC = 0.9456) Inductive (AUC = 0.4246) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False PositiveTrue Positivewaveform (p = 0.1) Semi-supervised (AUC = 0.8932) Inductive (AUC = 0.8238) Figure1:Illustrativeresultsfromthesemi-supervisedsetting.Top:Inthe2-dimensionalbananadata,thetwoclassesarewellseparated,andtheinductiveapproachfareswell.Middle:Inthe60-dimensionalsplicedata,theinductiveapproachdoesworsethanrandomguessing.Bottom:Inthe21-dimensionalwaveformdata,unlabeleddatastilloffergainswhenpissmall(here0:1).2998 SEMI-SUPERVISEDNOVELTYDETECTION9.ConclusionsWehaveshownthatsemi-supervisednoveltydetectionreducestoNeyman-Pearsonclassication.ThisallowsustoleverageknownperformanceguaranteesforNPclassicationalgorithms,andtoimportpracticalalgorithms.Wehaveappliedtechniquesfromstatisticallearningtheory,suchasuniformdeviationinequalities,toestablishdistributionfreeperformanceguaranteesforSSND,aswellasadistributionfreelowerboundanduniversallyconsistentestimatorforp,andtestforp=0.Ourapproachoptimallyadaptstotheunknownnoveltydistribution,unlikeinductiveapproaches,whichoperateasifnoveltiesareuniformlydistributed.WealsointroducedahybridmethodthathasthepropertiesofSSNDwhenp�0,andeffectivelyrevertstotheinductivemethodwhenp=0.Ouranalysisstronglysuggeststhatinnoveltydetection,unliketraditionalbinaryclassication,unlabeleddataareessentialforattainingoptimalperformanceintermsoftightbounds,consistency,andratesofconvergence.Inanextensiveexperimentalstudy,wefoundthattheadvantagesofourapproacharemostpronouncedforhighdimensionaldata.OuranalysisandexperimentsconrmsomechallengesthatseemtobeintrinsictotheSSNDproblem.Inparticular,SSNDismoredifcultforsmallerp.Furthermore,estimatingthenoveltyproportionpcanbecomearbitrarilydifcultasthenominalandnoveldistributionsbecomeincreasinglysimilar.Ourmethodologyalsoprovidesgeneralsolutionstotwowell-studiedproblemsinhypothesistesting.First,ourlowerboundonptranslatesimmediatelytoatestforp=0,whichamountstoadistribution-freesolutiontothetwo-sampleproblem.Second,wealsoshowthatSSNDprovidesapowerfulgeneralizationofstandardmultipletesting.Importantproblemsforfutureworkwillincludedevelopingpracticalmethodologiesfortheseproblemsbasedonourtheoreticalframework.AcknowledgmentsC.ScottandG.LeeweresupportedinpartbyNSFAwardNo.0830490.G.Blanchardwasare-searcherattheWeierstrassInstitut(WIAS),Berlin,whiletakingpartinthiswork,andacknowledgessupportofthePASCAL2networkofexcellenceoftheEuropeancommunity,FP7ICT-216886.AppendixA.ProofsTheremainingproofsarenowpresented.A.1ProofofTheorem2Forthersttwoclaimsofthetheorem,wedirectlyapplyTheorem3ofScottandNowak(2005)totheproblemofNPclassicationofP0versusPX,andobtainthatforasuitablechoiceofconstantsc;c0wehavewithprobabilityatleast1dR0(bt)ac0enRX(bt)RX()c0em:Fromthis,wededuce(3)-(4)byapplicationofTheorem1.Forthesecondclaim,byapplicationofBayes'rulewehaveforanyclassierQ0()=(1p)(1R0()) PX((X)=0)=(1p)(1R0()) pR1()+(1p)(1R0())2999 BLANCHARD,LEEANDSCOTTandQ1()=p(1R1()) PX((X)=1)=p(1R1()) (1p)R0()+p(1R1()):Observethatfora;b�0thefunctionz(x;y)=a(1x) by+a(1x)isdecreasinginy2R+forx2(¥;1]anddecreasinginx2[0;1+by=a)fory2R+.Hence,using(3)-(4)andthefactthatRi()2[0;1]wederivealowerboundonQ0(bt)asfollows:Q0(bt)=(1p)(1R0(bt)) pR1(bt)+(1p)(1R0(bt))(1p)(1ac0en) p(R1()+c0p1(en+em))+(1p)(1R0()c0en)=(1p)(1ac0en) PX((X)=0)+c0(em+pen)(1p)(1a) PX((X)=0)+c0(em+pen)c0(1p)en PX((X)=0)(1p)(1a)c0(1p)en PX((X)=0)(1p)(1a)c0(em+pen) PX((X)=0)2Q0()c0(en+em) PX((X)=0):Therstinequalitycomesfromthemonotonicitypropertiesofz(x;y)(appliedrstwithrespecttoy,thenx).Thesecondiselementary.Inthethirdinequalityweusedthefactthatthefunctiongd7!g(d)=A B+disconvexforA;B;dpositiveandhasderivativeA=B2inzero,sothatg(d)A BdA B2withA=(1p)(1a);B=PX((X)=0);d=c0(em+pen).Inthelastinequalityweused(withthesamedenitionforA;B)thatA B=Q0()1.ThetreatmentforQ1issimilar.Supposerstthatc0(en+em)PX((X)=1):(15)WethenhaveQ1(bt)=p(1R1(bt)) (1p)R0(bt)+p(1R1(bt))p(1R1()c0p1(en+em)) (1p)(a+c0en)+p(1R1()c0p1(en+em))=p(1R1())c0(en+em) PX((X)=1)c0(pen+em)Q1()c0(en+em) PX((X)=1)c0(en+em);whereweusedagainthemonotonicitypropertiesofz(x;y).Notethatassumption(15)ensuresthatalldenominatorsintheabovechainofinequalitiesarepositive,whichisrequiredforthese3000 SEMI-SUPERVISEDNOVELTYDETECTIONinequalitiestohold.Now,sinceQ1(bt)0andQ1()1,theaboveimpliesQ1(bt)Q1()min1;c0(en+em) PX((X)=1)c0(en+em)Q1()2c0(en+em) PX((X)=1)inthelastinequalityweusedthefactthatmin(1;x=(1x))2xforx2[0;1)(withx=c0(en+em)=PX((X)=1)).If(15)isnotsatised,thenthelastdisplaystilltriviallyholdssinceitsRHSisnonpositive.Thisestablishes(5)for=1.A.2ProofofProposition5Letpbedenedasp=inffa2[0;1]9Qprobabilitydistribution:PX=(1a)P0+aQg:Wewanttoestablishthatpsatisestheclaimsofthetheorem(andinparticularthattheaboveinmumisaminimum).InthecaseP0=PX,obviouslywehavep=0andthisisaminimum.WenowassumefortheremainderoftheproofthatP0=PX.ConsidertheLebesguedecompositionPX=P0X+P?XwithP0XP0(thatis,P0XisabsolutelycontinuouswithrespecttoP0)andP?X?P0X(thatis,P?XandP0aremutuallysingular).Let=dP0X=dP0andabetheessentialinmumofwrt.P0.Weclaimthatp=1a.ObserverstthataEXP0[(X)]=P0X()PX()=1:Inparticular,a=1mustimplythattheaboveinequalitiesareequalities,hencethatEXP0[]=aThelattercanonlybevalidif=a=1P0-a.s.,implyingthatP0=P0X,andfurtherP0=PX,whichweexcludedbefore.Thereforeitholdsthata1.CertainlywethenhavethevaliddecompositionPX=aP0+(1a)P1;P1=(1a)1(a)P0+P?X;sothatp1aBydenitionofsingularmeasuresthereexistsameasurablesetDsuchthatP0(D)=1andP?X(D)=0.Fixe&#x-0.9;畵0;bydenitionoftheessentialinmumthereexistsameasurablesetCsuchthatP0(C)&#x-0.9;畵0anda+eP0-a.s.onC.PutA=C\D.ThenP0(A)=P0(C)&#x-0.9;畵0.FurthermoreP1(A) P0(A)=EXP0(1a)1(a)1fX2Ag P0(A)e=(1a):ExistenceofadecompositionoftheformP1=(1g)Q+gP0impliesthatforanymeasurablesetAP1(A)gP0(A).Hencetheaboveimpliesthatg=0foranysuchdecomposition,thatis,P1mustbeapropernoveltydistributionwrt.P0.Italsoimpliesthatforanye�0thereexistsameasurablesetAwithP0(A)�0andPX(A)=P0(A)a+e.HenceforanydecompositionPX=(1a)P0+aQitmustholdthat(1a)a,sothatp1a.Wethusestablishedp=1aandtheexistenceofthedecomposition.Concerningtheunicity,thedecompositionestablishedaboveimpliesthatforanyapPX=(1a)P0+aQholdswithQ=(1p a)P0+p aP1.NotethatforanyxedaexistenceofadecompositionPX=(1a)P0+aQuniquelydeterminesQ.Hencefora�pthecorrespondingQisnotapropernoveltydistribution,andtheonlyvaliddecompositionofPXintoP0andapropernoveltydistributionistheoneestablishedpreviously.3001 BLANCHARD,LEEANDSCOTTA.3LemmaUsedinProofofTheorem6FortheproofofTheorem6wemadeuseofthefollowingauxiliaryresult:Lemma13AssumeP1isapropernoveltydistributionwrt.P0.Thenforanye�0thereexistsa(deterministic)classierfsuchthatR0()1andR1() 1R0()e:Proof.SinceP1isapropernoveltydistributionwrt.P0,reiteratingthereasoningintheproofofProposition5showsthatthereexistsameasurablesetAwithP0(A)�0andP1(A)=P0(A)e.Puta=1P0(A)1.Considertheclassier(x)=1fx2Acg.ThenR0()=P0(=1)=a,while0R1()=P1(=0)=P1(A)e(1a):(16)Thisleadstothedesiredconclusion. A.4ProofofTheorem8ByapplicationofLemma13,foranye�0thereexistsaclassiersuchthatR1(f) 1R0(f)e.ThenwehaveasintheproofofTheorem6:1RX() 1R0()=p1R1() 1R0()p(1e):Fixg�0anddeneeP=1 2(P0+P1).Usingtheassumptionofuniversalapproximation,pickksuchthatthereexistsk2FkwitheP(k(X)=(X))g.SinceeP1 2P0andeP1 2P1thisimpliesalsoP0(k(X)=(X))2gaswellasPX(k(X)=(X))2gFromnowweonlyworkintheclassFkandsoweomittheparametersinthenotationeiei(Fk;dk2).Bytheunionbound,theuniformcontroloftheform(11)isvalidsimultaneouslyforallFk,withprobability1cd(withc=p2=6).Hencewithprobability1cd=1c(mn)2,wehavebR0(k)R0(k)+emR0()+2g+em;andalsobRX(k)RX(k)+enRX()+2g+en:Fromthiswededucethatwithprobability1c(mn)2bp(d)bp(Fk;(mn)2k2)1RX()+2g+2en 1R0()2g2em:Sinceen;emgotozeroasmin(m;n)goestoinnitywededucethata.s.(usingtheBorel-Cantellilemma,andthefactthattheerrorprobabilitiesaresummableover(m;n)2N2)liminfmin(m;n)!¥bp(d)1RX()+2g 1R0()2gp(1e)1R0() 1R0()2g4g 1R0()2g:Takingthelimitoftheaboveasg!0(forxedeand),thenase!0,leadstotheconclusion.3002 SEMI-SUPERVISEDNOVELTYDETECTIONA.5ProofofTheorem9Theprincipleoftheargumentisrelativelystandardandcanbesketchedasfollows.Assumethereexistsanon-trivialuppercondenceboundbp+(d)ontheproportionofanomalies.Fromthenon-trivialityassumption,thereexistsageneratingdistributionPandasetofsamplesBsuchthatbp+(d)1forallsamplesinBwhileP(B)&#x-0.9;畵d.Wewillconstructbelowanalternategenerat-ingdistributionePandsetofsampleseBBwhichareveryclosetoPandB,inparticularsatisfyingeP(eB)&#x-0.9;畵d;howevertheproportionofanomaliesforePis1,contradictingtheuniversalityofbp+sincebp+(d)1forallsamplesineBLetP0;P1;d;pbegivenbythenon-trivialityassumptionandP=P\nm0\nP\nnXdenotecorrespond-inglythejointdistributionofnominalandcontaminateddata.Fixsomeg&#x-0.9;畵0andasetDsuchthat0P0(D)g(suchasetalwaysexistsbytheassumptionthatP0isweaklydiffuse).PutA=Dcsothat1gP0(A)1.ConsiderthedistributionP0conditionaltobelongingtoA,givenby1x2A P0(A)P0.SinceithasitssupportstrictlyincludedinthesupportofP0,itisapropernoveltydistribu-tionwithrespecttoP0.Therefore,sinceP1isalsoapropernoveltydistributionwithrespecttoP0soiseP1=(1p)1x2A P0(A)P0+pP1NowconsiderthenoveltydetectionproblemwithnominaldistributionP0,noveltydistributioneP1,andnoveltyproportionep=1,sothatePX=(1ep)P0+epeP1=eP1.Finally,denethemodiedjointdistributiononnominalandcontaminateddataeP=P\nm0\neP\nnXBythenon-trivialityassumption,thereexistsasetBof(m;n)samplessuchthatbp+(d)1onthesetBandP(B)=d0&#x-0.9;畵d.DenoteA=mAn.Byassumption,P(A)(1g)n;furthermorefromtheconstructionofePitcanbeveriedstraightforwardlythatforanysetCAeP(C)P(C)DenenoweB=B\A;wehaveeP(eB)P(eB)d0(1(1g)n):Henceforgsmallenough,wehaveeP(eB)&#x-0.9;畵dwhichcontradictsthefactthatbp+(d)isa1dcondenceupperbound,sinceoneBwehavebp+(d)1=epA.6ProofofCorollary10Putg=supP2P(m;n)Ejbppjp:ByMarkov'sinequality,itmeansthatforanygeneratingdistributionPPp&#x-0.9;畵bp+g d1 pd:Hencebp+(d)=bp+ d1 pisadistribution-freeuppercondenceboundonpBytheorem(9),thisboundmustbetrivial.Weinvestigatetheconsequencesofthisfact.LetusconsideranygeneratingdistributionP2P(m;n)forwhichp=0,andthenominaldistributionP0isweaklydiffuse(itispossibletondsuchaP0sinceisinnite).Assumed1 2andxsomed02(d;1d).Markov'sinequalityagainimpliesthatforthisdistribution,P bpg d01 p!d0;3003 BLANCHARD,LEEANDSCOTTsothatPbp+(d)2g d1 pd01d:Choosingd0=1=2,d=1=3,theaboveimpliesthatbp+(d)wouldbenon-trivialifg2p=3.Thuswemustconcludethatgc(p)=2p=3.AppendixB.RandomizedclassiersandROCsInthisappendixwereviewsomepropertiesofROCsthatarerelevanttooursetting.Theseshouldbeconsideredwell-known(seeforexamplethepaperofFawcett,2006,andreferencestherein,foranoverviewofROCanalysisoversetsofclassiers).Forthesakeofcompleteness,wegiveherestatementsandshortproofsthataretailoredtocorrespondpreciselytothecontextconsideredinthemainbodyofthepaper.LetFbeaxedsetofclassiers,andrecalltheNeyman-Pearsonclassicationoptimizationproblem(1),restatedhereforconvenience:R1;a(F)=inff2FR1()(1)s.t.R0()a:WecalloptimalROCofP1versusP0forsetFthefunctiona2[0;1]7!1R1;a(F)2[0;1]Forreference,rstconsideravery“regular”casewhereFisthesetofallpossibledeterministicclassiers,andoneassumesthatbothclassprobabilitiesP0;P1havedensitiesh0;h1withrespecttosomereferencemeasure,suchthatthelikelihoodratioF(x)=h1(x) h0(x)iscontinuouswithinfF=0andsupF=+¥.Thentheoptimalsolutionsaof(1)areindicatorsofsetsoftheformCl=x2h1(x) h0(x)l;withl(a)suchthatP0(Cl)=a.InthiscaseR1;a(F)=P1(Cl(a))anditcanbeshownthattheROCiscontinuous,nondecreasingandconcavebetweenthepoints(0;0)and(1;1).InparticularinthiscaseitholdsthatR0(a)=aWhensomeoftheaboveassumptionsarenotsatised,forexampleifweconsideranarbi-trarysubsetFofclassiers(whichisofcoursenecessaryforadequateestimationerrorcontrol),ortheprobabilitydistributionsP0andP1haveatoms(whichisthecaseinpracticeforempiricaldistributions),someofthesepropertiesmayfailtohold.WhileitisclearthattheoptimalROCisalwaysanondecreasingfunction,itmightfailtobeconcave,andtheoptimalsolutionmighthaveR0(a)a.ThisisforexampleobviouslythecaseifFisanitesetofclassiers,inwhichcasetheROCisastepfunctionandR0()canonlytakenitelymanyvalues.WeareinterestedinthefollowingregularitypropertiesdependingonF;P0andP1(A')Foranya2(0;1),thereexistsasequencen2FsuchthatR0(n)=aandR1(n)!R1;a(F)(B)Thefunctiona7!R1;a(F)=(1a)isnondecreasingon[0;1]Notethatforsimplicityofexposition,inthemainbodyofthepaperwesimpliedproperty(A')into(A),wherethesequencenisreplacedbyitslimit,assumedtobelongtotheconsideredsetofclassiers.Ourresultsstillholdunder(A')withstraightforwardmodicationsoftheproofs.3004 SEMI-SUPERVISEDNOVELTYDETECTIONCondition(B)statesthattheslopeofthelinejoiningthepointoftheoptimalROCataandthepoint(1;1)isnonincreasingina;thisassumptionisweakerthanconcavityoftheROC.Itisrelevantforthediscussioninthenalparagraphbelow,relatedtoourresultonprecision.ToensureregularitypropertiesoftheROC,astandarddeviceistoextendtheclassFandcon-siderrandomizedclassiers,whoseoutputisnotadeterministicfunction,butaBernoullivariablewithprobabilitydependingonthepointx.Formallythisamountstoallowingaclassiertotakerealvaluesintheinterval[0;1];nowforagivenxthenaldecisionD(;x)isaBernoullivari-able(independentofeverythingelse)takingvalue1withprobability(x)and0withprobability1(x).Inthissettingtheerrorprobabilitiesbecomefory=0;1:Ry()=Py(D(;X)=y)=Ey[j(X)yj];whereEyisashortcutforEXPy.Althoughthefunctionitselfremainsstrictlyspeakingdetermin-istic,inviewoftheaboveinterpretationwewillcallwithsomeabuseadeterministicclassierifittakesvaluesinf0;1gandrandomizedclassieriftakesvaluesin[0;1]Weconsidertwotypesofextensionsofa(usuallydeterministic)classF,therstoneistheconvexhullofF,orfullrandomization F=(g g=Nåi=1liiN2N;i2F;li0for1N;Nåi=1li=1):ThesecondisgivenbyF+=fgjg=l+(1l);2F;l2[0;1]g;wheretherandomizationislimitedtoconvexinterpolationbetweenoneclassierofthebaseclassandtheconstantclassierequalto1.ThefollowingstandardlemmasummarizesthepropertiesoftheoptimalROCcurvefortheseextendedclasses:Lemma14LetFbeasetofdeterministicclassierscontainingtheconstantclassierequaltozero,andletP0;P1bearbitrarydistributionson.Thenassumptions(A')and(B)aremetwhenconsideringoptimizationproblem(1)overeither ForF+.TheoptimalROCfortheset Fisconcave.Proof.ThefactthattheconstantzeroclassierbelongstoFensuresthattheinmumin(1)(overeitheroftheclassesFF+or F)isnottakenoveranemptysetandexists.ThedenitionofaninmumensuresthatthereexistsasequencegnofelementsofF+suchthatR0(gn)aandR1(gn)&R1;a(F+).Thenputtingln=(1a)=(1R0(gn)),thesequencen=lngn+(1ln)belongstoF+,andissuchthatR0(n)=E0[n]=(1a) (1R0(gn))R0(gn)+1(1a) (1R0(gn))=awhileR1;a(F+)R1(n)=1E1[n]=lnR1(gn)R1(gn)&R1;a(F+);andthusensures(A').Thesamereasoningappliesto F3005 BLANCHARD,LEEANDSCOTTForproperty(B),considerasequence(n)fromproperty(A'),anumberb2[a;1]andhn=(1z)n+zwherez=(1b)=(1a)2[0;1].Thenhn2F+R0(hn)=bandR1(hn)=(1z)R1(n)+z.LettingngrowtoinnityweobtainR1;b(F+)(1z)R1;a(F+)+zwhichinturnimplies(B)Inthecaseof F,similarlyconsidersequencesn;1;n;2likeabovefora=a1;resp.a=a2witha2a1;foranyb2[a1;a2],writeb=la1+(1l)a2;correspondinglythesequenceln;1+(1l)n;2belongsto FandensuresthatR1;b( F)lR1;a1( F)+(1l)R1;a2( F)thatis,theoptimalROCfor Fisconcave. Concerningestimationerrorcontrolfortheextendedclasses,auniformcontroloftheerroronthebaseclassissufcient,sinceitcarriesovertotheextendedclassesbyconvexcombination.Tobemorespecic,letusconsiderforexampletheestimationofR0()uniformlyover2 F.TheempiricalcounterpartofR0()isgivenbybR0()=bP0(D(;X)=0)=bE0[(X)];wherebE0denotestheempiricalexpectationonthenominalsample.Bydenitionof Fcanbewritten=åNi=1liiwithåNi=1li=1andli0;i2Ffor1N,andthustheestimationerroriscontrolledasfollows: R0()bR0() = E0[(X)]bE0[(X)] Nåi=1jlij P0(i(X)=1)bP0(i(X)=1) supf2F P0((X)=1)bP0((X)=1) ;Therefore,ifanerrorcontroloftheform(11)holdsoverthebaseclassF(forexampleifitisaVC-class),thenthesametypeofboundholdsforquantitiesofinterestovertheextendedclassesF+and FForpracticalpurposes,itmightbesignicantlymoredifculttondthesolutionofthe(empir-icalversionof)(1)forrandomizedclassesandinparticularforthefullyrandomizedextension FAnadvantageofthemorelimitedformofrandomizationisthatoptimizationproblem(1)overF+canberewrittenequivalentlyasanoptimizationproblemovertheoriginalclass,namelyasinfh2FR1(h) 1R0(h)s.t.R0(h)a:(17)Toseewhy,assumeforsimplicityofexpositionthat(A)ratherthan(A')issatised.Thentheoptimizationproblem(1)overF+isattainedforsomerandomizedclassier;byconstructionisoftheform=lh+(1l)forsomel2[0;1]andh2F.Byproperty(A)wecanassumeR0()=a,whichentailsl=(1a)=(1R0(h))andR1()=(1a)R1(h)=(1R0(h))hencetheequivalencewith(17)(withtheaboverelationbetweenandh).Finally,ingeneralwecaninterprettheoptimizationproblem(17)asamaximizationoftheclass0precision,Q0()=PXY(Y=0j(X)=0)=(1p)(1R0()) (1p)(1R0())+pR1()=(1p) (1p)+pR1(f) 1R0(f);3006 SEMI-SUPERVISEDNOVELTYDETECTIONundertheconstraintR0()a,sincetheabovedisplayshowsthatQ0()isadecreasingfunctionoftheratioR1()=(1R0()).Ingeneralifproperties(A)and(B)aresatisedfortheconsideredclass,thenitiseasytoseethatthesolutionsto(1)and(17)coincide,sothatthesameclassierachievestheminimumFNRandclass0precisionundertheconstraintontheFPR.ReferencesS.Ben-David,T.Lu,andD.P´al.Doesunlabeleddataprovablyhelp?Worst-caseanalysisofthesamplecomplexityofsemi-supervisedlearning.InR.ServedioandT.Zhang,editors,Proc.21stAnnualConferenceonLearningTheory(COLT),pages33–44,Helsinki,2008.A.Beygelzimer,V.Dani,T.Hayes,J.Langford,andB.Zadrozny.Error-limitingreductionsbetweenclassicationtasks.InL.DeRaedtandS.Wrobel,editors,Proceedingsofthe22ndInternationalMachineLearningConference(ICML).ACMPress,2005.A.Cannon,J.Howse,D.Hush,andC.Scovel.LearningwiththeNeyman-Pearsonandmin-maxcriteria.TechnicalReportLA-UR02-2951,LosAlamosNationalLaboratory,2002.C.C.ChangandC.J.Lin.LIBSVM:Alibraryforsupportvectormachines.http://www.csie.ntu.edu.tw/cjlin/libsvm,2001.Z.Chi.OntheperformanceofFDRcontrol:Constraintsandapartialsolution.Ann.Stat.,35(4):1409–1431,2007.Z.Chi.Falsediscoveryratecontrolwithmultivariatep-values.ElectronicJournalofStatistics,2:368–411,2008.J.Demsar.Statisticalcomparisonsofclassiersovermultipledatasets.J.MachineLearningResearch,7:1–30,2006.F.Denis.PAClearningfrompositivestatisticalqueries.InProc.9thInt.Conf.onAlgorithmicLearningTheory(ALT),pages112–126,Otzenhausen,Germany,1998.F.Denis,R.Gilleron,andF.Letouzey.Learningfrompositiveandunlabeledexamples.TheoreticalComputerScience,348(1):70–83,2005.L.Devroye.Anydiscriminationrulecanhaveanarbitrarilybadprobabilityoferrorfornitesamplesize.IEEETrans.Patt.Anal.Mach.Intell.,4:154–157,1982.B.Efron,R.Tibshirani,J.D.Storey,andV.Tusher.EmpiricalBayesanalysisofamicroarrayexperiment.JournaloftheAmericanStatisticalAssociation,96:1151–1160,2001.R.El-YanivandM.Nisenson.Optimalsingle-classclassicationstrategies.InB.Sch¨olkopf,J.Platt,andT.Hoffman,editors,Adv.inNeuralInform.Proc.Systems19.MITPress,Cambridge,MA,2007.C.ElkanandK.Noto.Learningclassiersfromonlypositiveandunlabeleddata.InProceedingofthe14thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD08),pages213–220,2008.3007 BLANCHARD,LEEANDSCOTTT.Fawcett.AnintroductiontoROCanalysis.PatternRecognitionLetters,27:861–874,2006.C.GenoveseandL.Wasserman.Astochasticprocessapproachtofalsediscoverycontrol.AnnalsofStatistics,32(3):1035–1061,2004.D.Ghosh.Assessingsignicanceofpeptidespectrummatchesinproteomics:Amultipletestingapproach.StatisticsinBiosciences,1:199–213,2009.D.GhoshandA.M.Chinnaiyan.Genomicoutlierproleanalysis:Mixturemodels,nullhypothe-ses,andnonparametricestimation.Biostatistics,10:60–69,2009.A.Gretton,K.M.Borgwardt,M.Rasch,B.Sch¨olkopf,andA.J.Smola.Akernelmethodforthetwo-sampleproblem.InB.Sch¨olkopf,J.Platt,andT.Hoffman,editors,AdvancesinNeuralInformationProcessingSystems19,pages513–520.MITPress,Cambridge,MA,2007.A.Hero.Geometricentropyminimizationforanomalydetectionandlocalization.InB.Sch¨olkopf,J.Platt,andT.Hoffman,editors,Adv.inNeuralInform.Proc.Systems19.MITPress,Cambridge,MA,2007.J.LaffertyandL.Wasserman.Statisticalanalysisofsemi-supervisedregression.InJ.C.Platt,D.Koller,Y.Singer,andS.Roweis,editors,AdvancesinNeuralInformationProcessingSystems20,pages801–808.MITPress,Cambridge,MA,2008.W.S.LeeandB.Liu.Learningwithpositiveandunlabeledexamplesusingweightedlogisticregression.InProc.20thInt.Conf.onMachineLearning(ICML),pages448–455,Washington,DC,2003.E.Lehmann.TestingStatisticalHypotheses.Wiley,NewYork,1986.B.Liu,W.S.Lee,P.S.Yu,andX.Li.Partiallysupervisedclassicationoftextdocuments.InProc.19thInt.Conf.MachineLearning(ICML),pages387–394,Sydney,Australia,2002.B.Liu,Y.Dai,X.Li,W.S.Lee,andP.S.Yu.Buildingtextclassiersusingpositiveandunlabeledexamples.InProc.3rdIEEEInt.Conf.onDataMining(ICDM),pages179–188,Melbourne,FL,2003.K.-R.M¨uller,S.Mika,G.R¨atsch,K.Tsuda,andB.Sch¨olkopf.Anintroductiontokernel-basedlearningalgorithms.IEEETransactionsonNeuralNetworks,12:181–201,2001.P.Rigollet.Generalizationerrorboundsinsemi-supervisedclassicationundertheclusterassump-tion.J.MachineLearningResearch,8:1369–1392,2007.B.Sch¨olkopf,J.Platt,J.Shawe-Taylor,A.Smola,andR.Williamson.Estimatingthesupportofahigh-dimensionaldistribution.NeuralComputation,13(7):1443–1472,2001.C.ScottandG.Blanchard.Noveltydetection:Unlabeleddatadenitelyhelp.InD.vanDykandM.Welling,editors,ProceedingsoftheTwelfthInternationalConferenceonArticialIntelli-genceandStatistics(AISTATS)2009,pages464–471,ClearwaterBeach,Florida,2009.JMLR:W&CP5.3008 SEMI-SUPERVISEDNOVELTYDETECTIONC.ScottandR.Nowak.ANeyman-Pearsonapproachtostatisticallearning.IEEETrans.Inform.Theory,51(11):3806–3819,2005.C.ScottandR.Nowak.Learningminimumvolumesets.J.MachineLearningRes.,7:665–704,2006.A.Singh,R.Nowak,andX.Zhu.Unlabeleddata:Nowithelps,nowitdoesn't.Proc.NeuralInformationProcessingSystems21–NIPS'08,2009.A.Smola,L.Song,andC.H.Teo.Relativenoveltydetection.InD.vanDykandM.Welling,edi-tors,ProceedingsoftheTwelfthInternationalConferenceonArticialIntelligenceandStatistics(AISTATS)2009,pages536–543,ClearwaterBeach,Florida,2009.JMLR:W&CP5.D.SteinbergandN.S.Cardell.Estimatinglogisticregressionmodelswhenthedependentvariablehasnovariance.CommunicationsinStatistics-TheoryandMethods,21:423–450,1992.I.Steinwart,D.Hush,andC.Scovel.Aclassicationframeworkforanomalydetection.J.MachineLearningResearch,6:211–232,2005.J.D.Storey.Thepositivefalsediscoveryrate:ABayesianinterpretationandtheq-value.AnnalsofStatistics,31:6:2013–2035,2003.W.SunandT.Cai.Oracleandadaptivecompounddecisionrulesforfalsediscoveryratecontrol.J.Amer.Statist.Assoc.,102(479):901–912,2007.V.Vapnik.StatisticalLearningTheory.Wiley,NewYork,1998.R.VertandJ.-P.Vert.Consistencyandconvergenceratesofone-classSVMandrelatedalgorithms.J.MachineLearningResearch,pages817–854,2006.G.Ward,T.Hastie,S.Barry,J.Elith,andJ.R.Leathwick.Presence-onlydataandtheEMalgorithm.Biometrics,65:554–564,2009.B.ZhangandW.Zuo.Learningfrompositiveandunlabeledexamples:Asurvey.InProceedingoftheIEEEInternationalSymposiumonInformationProcessing(ISIP),pages650–654,2008.D.ZhangandW.S.Lee.Asimpleprobabilisticapproachtolearningfrompositiveandunlabeledexamples.InProc.5thAnnualUKWorkshoponComp.Intell.(UKCI),London,UK,2005.3009