/
JournalofMachineLearningResearch13(2012)27-66Submitted12/10;Revised6/1 JournalofMachineLearningResearch13(2012)27-66Submitted12/10;Revised6/1

JournalofMachineLearningResearch13(2012)27-66Submitted12/10;Revised6/1 - PDF document

pamella-moone
pamella-moone . @pamella-moone
Follow
398 views
Uploaded On 2016-06-25

JournalofMachineLearningResearch13(2012)27-66Submitted12/10;Revised6/1 - PPT Presentation

BROWNPOCOCKZHAOANDLUJ ID: 377078

BROWN POCOCK ZHAOANDLUJ

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch13(2012)..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch13(2012)27-66Submitted12/10;Revised6/11;Published1/12ConditionalLikelihoodMaximisation:AUnifyingFrameworkforInformationTheoreticFeatureSelectionGavinBrownGAVIN.BROWN@CS.MANCHESTER.AC.UKAdamPocockADAM.POCOCK@CS.MANCHESTER.AC.UKMing-JieZhaoMING-JIE.ZHAO@CS.MANCHESTER.AC.UKMikelLuj´anMIKEL.LUJAN@CS.MANCHESTER.AC.UKSchoolofComputerScienceUniversityofManchesterManchesterM139PL,UKEditor:IsabelleGuyonAbstractWepresentaunifyingframeworkforinformationtheoreticfeatureselection,bringingalmosttwodecadesofresearchonheuristicltercriteriaunderasingletheoreticalinterpretation.Thisisinresponsetothequestion:“whataretheimplicitstatisticalassumptionsoffeatureselectioncriteriabasedonmutualinformation?”.Toanswerthis,weadoptadifferentstrategythanisusualinthefeatureselectionliterature—insteadoftryingtodeneacriterion,wederiveone,directlyfromaclearlyspeciedobjectivefunction:theconditionallikelihoodofthetraininglabels.Whilemanyhand-designedheuristiccriteriatrytooptimizeadenitionoffeature`relevancy'and`redundancy',ourapproachleadstoaprobabilisticframeworkwhichnaturallyincorporatestheseconcepts.Asaresultwecanunifythenumerouscriteriapublishedoverthelasttwodecades,andshowthemtobelow-orderapproximationstotheexact(butintractable)optimisationproblem.Theprimarycontributionistoshowthatcommonheuristicsforinformationbasedfeatureselection(includingMarkovBlanketalgorithmsasaspecialcase)areapproximateiterativemaximisersofthecon-ditionallikelihood.Alargeempiricalstudyprovidesstrongevidencetofavourcertainclassesofcriteria,inparticularthosethatbalancetherelativesizeoftherelevancy/redundancyterms.OverallweconcludethattheJMIcriterion(YangandMoody,1999;Meyeretal.,2008)providesthebesttradeoffintermsofaccuracy,stability,andexibilitywithsmalldatasamples.Keywords:featureselection,mutualinformation,conditionallikelihood1.IntroductionHighdimensionaldatasetsareasignicantchallengeforMachineLearning.Someofthemostpracticallyrelevantandhigh-impactapplications,suchasgeneexpressiondata,mayeasilyhavemorethan10,000features.Manyofthesefeaturesmaybecompletelyirrelevanttothetaskathand,orredundantinthecontextofothers.Learninginthissituationraisesimportantissues,forexample,over-ttingtoirrelevantaspectsofthedata,andthecomputationalburdenofprocessingmanysimilarfeaturesthatprovideredundantinformation.Itisthereforeanimportantresearchdirectiontoautomaticallyidentifymeaningfulsmallersubsetsofthesevariables,thatis,featureselectionFeatureselectiontechniquescanbebroadlygroupedintoapproachesthatareclassier-dependent(`wrapper'and`embedded'methods),andclassier-independent(`lter'methods).Wrappermeth-c\r2012GavinBrown,AdamPocock,Ming-jieZhaoandMikelLuj´an. BROWN,POCOCK,ZHAOANDLUJ´ANodssearchthespaceoffeaturesubsets,usingthetraining/validationaccuracyofaparticularclassi-erasthemeasureofutilityforacandidatesubset.Thismaydeliversignicantadvantagesingen-eralisation,thoughhasthedisadvantageofaconsiderablecomputationalexpense,andmayproducesubsetsthatareoverlyspecictotheclassierused.Asaresult,anychangeinthelearningmodelislikelytorenderthefeaturesetsuboptimal.Embeddedmethods(Guyonetal.,2006,Chapter3)ex-ploitthestructureofspecicclassesoflearningmodelstoguidethefeatureselectionprocess.Whilethedeningcomponentofawrappermethodissimplythesearchprocedure,thedeningcompo-nentofanembeddedmethodisacriterionderivedthroughfundamentalknowledgeofaspecicclassoffunctions.AnexampleisthemethodintroducedbyWestonetal.(2001),selectingfeaturestominimizeageneralisationboundthatholdsforSupportVectorMachines.Thesemethodsarelesscomputationallyexpensive,andlesspronetooverttingthanwrappers,butstillusequitestrictmodelstructureassumptions.Incontrast,ltermethods(Duch,2006)separatetheclassicationandfeatureselectioncomponents,anddeneaheuristicscoringcriteriontoactasaproxymeasureoftheclassicationaccuracy.Filtersevaluatestatisticsofthedataindependentlyofanyparticularclassier,therebyextractingfeaturesthataregeneric,havingincorporatedfewassumptions.Eachofthesethreeapproacheshasitsadvantagesanddisadvantages,theprimarydistinguish-ingfactorsbeingspeedofcomputation,andthechanceofovertting.Ingeneral,intermsofspeed,ltersarefasterthanembeddedmethodswhichareinturnfasterthanwrappers.Intermsofovert-ting,wrappershavehigherlearningcapacitysoaremorelikelytoovertthanembeddedmethods,whichinturnaremorelikelytoovertthanltermethods.Allofthisofcoursechangeswithex-tremesofdata/featureavailability—forexample,embeddedmethodswilllikelyoutperformltermethodsingeneralisationerrorasthenumberofdatapointsincreases,andwrappersbecomemorecomputationallyunfeasibleasthenumberoffeaturesincreases.Aprimaryadvantageofltersisthattheyarerelativelycheapintermsofcomputationalexpense,andaregenerallymoreamenabletoatheoreticalanalysisoftheirdesign.Suchtheoreticalanalysisisthefocusofthisarticle.Thedeningcomponentofaltermethodistherelevanceindex(alsoknownasaselec-tion/scoringcriterion),quantifyingthe`utility'ofincludingaparticularfeatureintheset.Nu-meroushand-designedheuristicshavebeensuggested(Duch,2006),allattemptingtomaximisefeature`relevancy'andminimise`redundancy'.However,fewofthesearemotivatedfromasolidtheoreticalfoundation.Itispreferabletostartfromamoreprincipledperspective—thedesiredapproachisoutlinedeloquentlybyGuyon:“Itisimportanttostartwithacleanmathematicalstatementoftheproblemaddressed[...]Itshouldbemadeclearhowoptimallythechosenapproachaddressestheproblemstated.Finally,theeventualapproximationsmadebythealgorithmtosolvetheoptimi-sationproblemstatedshouldbeexplained.Aninterestingtopicofresearchwouldbeto`retrot'successfulheuristicalgorithmsinatheoreticalframework.”(Guyonetal.,2006,pg.21)Inthisworkweadoptthisapproach—insteadoftryingtodenefeaturerelevanceindices,wederivethemstartingfromaclearlyspeciedobjectivefunction.Theobjectivewechooseisawellacceptedstatisticalprinciple,theconditionallikelihoodoftheclasslabelsgiventhefeatures.Asaresultweareabletoprovidedeeperinsightintothefeatureselectionproblem,andachievepreciselythegoalabove,toretrotnumeroushand-designedheuristicsintoatheoreticalframework.28 FEATURESELECTIONVIACONDITIONALLIKELIHOOD2.BackgroundInthissectionwegiveabriefintroductiontoinformationtheoreticconcepts,followedbyasummaryofhowtheyhavebeenusedtotacklethefeatureselectionproblem.2.1EntropyandMutualInformationThefundamentalunitofinformationistheentropyofarandomvariable,discussedinseveralstan-dardtexts,mostprominently(CoverandThomas,1991).Theentropy,denotedH(X),quantiestheuncertaintypresentinthedistributionofX.Itisdenedas,H(X)=åx2Xp(x)logp(x);wherethelowercasexdenotesapossiblevaluethatthevariableXcanadoptfromthealphabet.Tocompute1this,weneedanestimateofthedistributionp(X).WhenXisdiscretethiscanbeestimatedbyfrequencycountsfromdata,thatisˆp(x)=#x N,thefractionofobservationstakingonvaluexfromthetotalN.WeprovidemorediscussiononthisissueinSection3.3.Ifthedistributionishighlybiasedtowardoneparticulareventx2,thatis,littleuncertaintyovertheoutcome,thentheentropyislow.Ifalleventsareequallylikely,thatis,maximumuncertaintyovertheoutcome,thenH(X)ismaximal.2Followingthestandardrulesofprobabilitytheory,entropycanbeconditionedonotherevents.TheconditionalentropyofXgivenYisdenoted,H(XjY)=åy2Yp(y)åx2Xp(xjy)logp(xjy):ThiscanbethoughtofastheamountofuncertaintyremaininginXafterwelearntheoutcomeofYWecannowdenetheMutualInformation(Shannon,1948)betweenXandY,thatis,theamountofinformationsharedbyXandY,asfollows:I(XY)=H(X)H(XjY)=åx2Xåy2Yp(xy)logp(xy) p(x)p(y):Thisisthedifferenceoftwoentropies—theuncertaintybeforeYisknown,H(X),andtheuncer-taintyafterYisknown,H(XjY).ThiscanalsobeinterpretedastheamountofuncertaintyinXwhichisremovedbyknowingY,thusfollowingtheintuitivemeaningofmutualinformationastheamountofinformationthatonevariableprovidesaboutanother.ItshouldbenotedthattheMutualInformationissymmetric,thatis,I(XY)=I(YX),andiszeroifandonlyifthevariablesaresta-tisticallyindependent,thatisp(xy)=p(x)p(y).TherelationbetweenthesequantitiescanbeseeninFigure1.TheMutualInformationcanalsobeconditioned—theconditionalinformationis,I(XYjZ)=H(XjZ)H(XjYZ)=åz2Zp(z)åx2Xåy2Yp(xyjz)logp(xyjz) p(xjz)p(yjz): 1.Thebaseofthelogarithmisarbitrary,butdecidesthe`units'oftheentropy.Whenusingbase2,theunitsare`bits',whenusingbasee,theunitsare`nats.'2.Ingeneral,0H(X)log(jXj).29 BROWN,POCOCK,ZHAOANDLUJ´AN Figure1:Illustrationofvariousinformationtheoreticquantities.ThiscanbethoughtofastheinformationstillsharedbetweenXandYafterthevalueofathirdvariable,Z,isrevealed.Theconditionalmutualinformationwillemergeasaparticularlyimportantpropertyinunderstandingtheresultsofthiswork.Thissectionhasbrieycoveredtheprinciplesofinformationtheory;inthefollowingsectionwediscussmotivationsforusingittosolvethefeatureselectionproblem.2.2FilterCriteriaBasedonMutualInformationFiltermethodsaredenedbyacriterionJ,alsoreferredtoasa`relevanceindex'or`scoring'criterion(Duch,2006),whichisintendedtomeasurehowpotentiallyusefulafeatureorfeaturesubsetmaybewhenusedinaclassier.AnintuitiveJwouldbesomemeasureofcorrelationbetweenthefeatureandtheclasslabel—theintuitionbeingthatastrongercorrelationbetweentheseshouldimplyagreaterpredictiveabilitywhenusingthefeature.ForaclasslabelY,themutualinformationscoreforafeatureXkisJmim(Xk)=I(XkY):(1)Thisheuristic,whichconsidersascoreforeachfeatureindependentlyofothers,hasbeenusedmanytimesintheliterature,forexample,Lewis(1992).Werefertothisfeaturescoringcriterionas`MIM',standingforMutualInformationMaximisation.TousethismeasurewesimplyrankthefeaturesinorderoftheirMIMscore,andselectthetopKfeatures,whereKisdecidedbysomepredenedneedforacertainnumberoffeaturesorsomeotherstoppingcriterion(Duch,2006).AcommonlycitedjusticationforthismeasureisthatthemutualinformationcanbeusedtowritebothanupperandlowerboundontheBayeserrorrate(Fano,1961;HellmanandRaviv,1970).Animportantlimitationisthatthisassumesthateachfeatureisindependentofallotherfeatures—andeffectivelyranksthefeaturesindescendingorderoftheirindividualmutualinformationcontent.However,wherefeaturesmaybeinterdependent,thisisknowntobesuboptimal.Ingeneral,itiswidelyacceptedthatausefulandparsimonioussetoffeaturesshouldnotonlybeindividuallyrelevant,butalsoshouldnotberedundantwithrespecttoeachother—featuresshouldnotbehighlycorrelated.Thereaderiswarnedthatwhilethisstatementseemsappealinglyintuitive,itisnotstrictlycorrect,aswillbeexpandeduponinlatersections.Inspiteofthis,severalcriteriahave30 FEATURESELECTIONVIACONDITIONALLIKELIHOODbeenproposedthatattempttopursuethis`relevancy-redundancy'goal.Forexample,Battiti(1994)presentstheMutualInformationFeatureSelection(MIFS)criterion:Jmifs(Xk)=I(XkY)båXj2SI(XkXj);whereSisthesetofcurrentlyselectedfeatures.ThisincludestheI(XkY)termtoensurefeaturerelevance,butintroducesapenaltytoenforcelowcorrelationswithfeaturesalreadyselectedinS.Notethatthisassumesweareselectingfeaturessequentially,iterativelyconstructingournalfeaturesubset.Forasurveyofothersearchmethodsthansimplesequentialselection,thereaderisreferredtoDuch(2006);howeveritshouldbenotedthatalltheoreticalresultspresentedinthispaperwillbegenerallyapplicabletoanysearchprocedure,andbasedsolelyonpropertiesofthecriteriathemselves.ThebintheMIFScriterionisacongurableparameter,whichmustbesetexperimentally.Usingb=0wouldbeequivalenttoJmim(Xk),selectingfeaturesindependently,whilealargervaluewillplacemoreemphasisonreducinginter-featuredependencies.Inexperi-ments,Battitifoundthatb=1isoftenoptimal,thoughwithnostrongtheorytoexplainwhy.TheMIFScriterionfocusesonreducingredundancy;analternativeapproachwasproposedbyYangandMoody(1999),andalsolaterbyMeyeretal.(2008)usingtheJointMutualInformation(JMI),tofocusonincreasingcomplementaryinformationbetweenfeatures.TheJMIscoreforfeatureXkisJjmi(Xk)=åXj2SI(XkXjY):ThisistheinformationbetweenthetargetsandajointrandomvariableXkXj,denedbypair-ingthecandidateXkwitheachfeaturepreviouslyselected.Theideaisifthecandidatefeatureis`complementary'withexistingfeatures,weshouldincludeit.TheMIFSandJMIschemesweretherstofmanycriteriathatattemptedtomanagetherelevance-redundancytradeoffwithvariousheuristicterms,howeveritiscleartheyhaveverydif-ferentmotivations.Thecriteriaidentiedintheliterature1992-2011arelistedinTable1.Thepracticeinthisresearchproblemhasbeentohand-designcriteria,piecingcriteriatogetherasajig-sawofinformationtheoreticterms—theoverallaimtomanagetherelevance-redundancytrade-off,witheachnewcriterionmotivatedfromadifferentdirection.Severalquestionsarisehere:Whichcriterionshouldwebelieve?Whatdotheyassumeaboutthedata?Arethereotherusefulcriteria,asyetundiscovered?Inthefollowingsectionweofferanovelperspectiveonthisproblem.3.ANovelApproachInthefollowingsectionsweformulatethefeatureselectiontaskasaconditionallikelihoodproblem.Wewilldemonstratethatpreciselinkscanbedrawnbetweenthewell-acceptedstatisticalframeworkoflikelihoodfunctions,andthecurrentfeatureselectionheuristicsofmutualinformationcriteria.3.1AConditionalLikelihoodProblemWeassumeanunderlyingi.i.d.processpX!Y,fromwhichwehaveasampleofNobservations.Eachobservationisapair(x;y),consistingofad-dimensionalfeaturevectorx=[x1;:::;xd]T,andatargetclassy,drawnfromtheunderlyingrandomvariablesX=fX1;:::;XdgandY.Furthermore,weassumethatp(yjx)isdenedbyasubsetofthedfeaturesinx,whiletheremainingfeaturesare31 BROWN,POCOCK,ZHAOANDLUJ´ANCriterion Fullname Authors MIM MutualInformationMaximisation Lewis(1992)MIFS MutualInformationFeatureSelection Battiti(1994)KS Koller-Sahamimetric KollerandSahami(1996)JMI JointMutualInformation YangandMoody(1999)MIFS-U MIFS-`Uniform' KwakandChoi(2002)IF InformativeFragments Vidal-NaquetandUllman(2003)FCBF FastCorrelationBasedFilter YuandLiu(2004)AMIFS AdaptiveMIFS TesmerandEstevez(2004)CMIM ConditionalMutualInfoMaximisation Fleuret(2004)MRMR Max-RelevanceMin-Redundancy Pengetal.(2005)ICAP InteractionCapping Jakulin(2005)CIFE ConditionalInfomaxFeatureExtraction LinandTang(2006)DISR DoubleInputSymmetricalRelevance MeyerandBontempi(2006)MINRED MinimumRedundancy Duch(2006)IGFS InteractionGainFeatureSelection ElAkadietal.(2008)SOA SecondOrderApproximation GuoandNixon(2009)CMIFS ConditionalMIFS Chengetal.(2011)Table1:Variousinformation-basedcriteriafromtheliterature.Sections3and4willshowhowthesecanallbeinterpretedinasingletheoreticalframework.irrelevant.Ourmodelingtaskisthereforetwo-fold:rstlytoidentifythefeaturesthatplayafunc-tionalrole,andsecondlytousethesefeaturestoperformpredictions.Inthisworkweconcentrateontherststage,thatofselectingtherelevantfeatures.Weadoptad-dimensionalbinaryvectorq:a1indicatingthefeatureisselected,a0indicatingitisdiscarded.Notationxqindicatesthevectorofselectedfeatures,thatis,thefullvectorxprojectedontothedimensionsspeciedbyq.Notationxeqisthecomplement,thatis,theunselectedfeatures.Thefullfeaturevectorcanthereforebeexpressedasx=fxq;xeqg.Asmentioned,weassumetheprocesspisdenedbyasubsetofthefeatures,soforsomeunknownoptimalvectorq,wehavethatp(yjx)=p(yjxq).Weapproximatepusingahypotheticalpredictivemodelq,withtwolayersofparameters:qrepresentingwhichfeaturesareselected,andtrepresentingparametersusedtopredicty.Ourproblemstatementistoidentifytheminimalsubsetoffeaturessuchthatwemaximizetheconditionallikelihoodofthetraininglabels,withrespecttotheseparameters.Fori.i.d.dataD=f(xi;yi)=1::Ngtheconditionallikelihoodofthelabelsgivenparametersfq;tgisL(q;tjD)=NÕi=1q(yijxiq;t):The(scaled)conditionallog-likelihoodis`=1 NNåi=1logq(yijxiq;t):(2)Thisistheerrorfunctionwewishtooptimizewithrespecttotheparametersft;qg;thescalingtermhasnoeffectontheoptima,butsimpliesexpositionlater.Usingconditionallikelihoodhas32 FEATURESELECTIONVIACONDITIONALLIKELIHOODbecomepopularinso-calleddiscriminativemodellingapplications,whereweareinterestedonlyintheclassicationperformance;forexampleGrossmanandDomingos(2004)usedittolearnBayesianNetworkclassiers.WewillexpanduponthislinktodiscriminativemodelsinSection9.3.MaximisingconditionallikelihoodcorrespondstominimisingKL-divergencebetweenthetrueandpredictedclassposteriorprobabilities—forclassication,weoftenonlyrequirethecorrectclass,andnotpreciseestimatesoftheposteriors,henceEquation(2)isaproxylowerboundforclassicationaccuracy.Wenowintroducethequantityp(yjxq):thisisthetruedistributionoftheclasslabelsgiventheselectedfeaturesxq.Itisimportanttonotethedistinctionfromp(yjx),thetruedistributiongivenallfeatures.Multiplyinganddividingqbyp(yjxq),wecanre-writetheaboveas,`=1 NNåi=1logq(yijxiq;t) p(yijxiq)+1 NNåi=1logp(yijxiq):(3)Thesecondtermin(3)canbesimilarlyexpanded,introducingtheprobabilityp(yjx)`=1 NNåi=1logq(yijxiq;t) p(yijxiq)+1 NNåi=1logp(yijxiq) p(yijxi)+1 NNåi=1logp(yijxi):Thesearenitesampleapproximations,drawingdatapointsi.i.d.withrespecttothedistributionp(xy).WeuseExyfgtodenotestatisticalexpectation,andforconveniencewenegatetheaboveturningourmaximisationproblemintoaminimisation.Thisgivesus,`Exynlogp(yjxq) q(yjxq;t)o+Exynlogp(yjx) p(yjxq)oExynlogp(yjx)o:(4)Thesethreetermshaveinterestingpropertieswhichtogetherdenethefeatureselectionprob-lem.ItisparticularlyinterestingtonotethatthesecondtermispreciselythatintroducedbyKollerandSahami(1996)intheirdenitionsofoptimalfeatureselection.Intheirwork,thetermwasadoptedad-hocasasensibleobjectivetofollow—herewehaveshownittobeadirectandnat-uralconsequenceofadoptingtheconditionallikelihoodasanobjectivefunction.Rememberingx=fxq;xeqg,thissecondtermcanbedeveloped:DKS=Exynlogp(yjx) p(yjxq)o=åxyp(xy)logp(yjxqxeq) p(yjxq)=åxyp(xy)logp(yjxqxeq) p(yjxq)p(xeqjxq) p(xeqjxq)=åxyp(xy)logp(xeqyjxq) p(xeqjxq)p(yjxq)=I(XeqYjXq):(5)Thisistheconditionalmutualinformationbetweentheclasslabelandtheremainingfeatures,giventheselectedfeatures.Wecannotealsothatthethirdtermin(4)isanotherinformationtheoretic33 BROWN,POCOCK,ZHAOANDLUJ´ANquantity,theconditionalentropyH(YjX).Insummary,weseethatourobjectivefunctioncanbedecomposedintothreedistinctterms,eachwithitsowninterpretation:N!¥`=Exynlogp(yjxq) q(yjxq;t)o+I(XeqYjXq)+H(YjX):(6)Thersttermisalikelihoodratiobetweenthetrueandthepredictedclassdistributionsgiventheselectedfeatures,averagedovertheinputspace.Thesizeofthistermwilldependonhowwellthemodelqcanapproximatep,giventhesuppliedfeatures.3Whenqtakesonthetruevalueq(orconsistsofasupersetofq)thisbecomesaKL-divergencepjjq.ThesecondtermisI(XeqYjXq)theconditionalmutualinformationbetweentheclasslabelandtheunselectedfeatures,giventheselectedfeatures.Thesizeofthistermdependssolelyonthechoiceoffeatures,andwilldecreaseastheselectedfeaturesetXqexplainsmoreaboutY,untileventuallybecomingzerowhentheremainingfeaturesXeqcontainnoadditionalinformationaboutYinthecontextofXq.Itcanbenotedthatduetothechainrule,wehaveI(XY)=I(XqY)+I(XeqYjXq);henceminimizingI(XeqYjXq)isequivalenttomaximisingI(XqY).ThenaltermisH(YjX),theconditionalentropyofthelabelsgivenallfeatures.Thistermquantiestheuncertaintystillremain-inginthelabelevenwhenweknowallpossiblefeatures;itisanirreducibleconstant,independentofallparameters,andinfactformsaboundontheBayeserror(Fano,1961).Thesethreetermsmakeexplicittheeffectofthefeatureselectionparametersq,separatingthemfromtheeffectoftheparameterstinthemodelthatusesthosefeatures.Ifwesomehowhadtheoptimalfeaturesubsetq,whichperfectlycapturedtheunderlyingprocessp,thenI(XeqYjXq)wouldbezero.Theremaining(reducible)erroristhendowntotheKLdivergencepjjq,expressinghowwellthepredictivemodelqcanmakeuseoftheprovidedfeatures.Ofcourse,differentmodelsqwillhavedifferentpredictiveability:agoodfeaturesubsetwillnotnecessarilybeputtogooduseifthemodelistoosimpletoexpresstheunderlyingfunction.ThisperspectivewasalsoconsideredbyTsamardinosandAliferis(2003),andearlierbyKohaviandJohn(1997)—theaboveresultsplacetheseinthecontextofapreciseobjectivefunction,theconditionallikelihood.Fortheremainderofthepaperwewillusethesameassumptionasthatmadeimplicitlybyalllterselectionmethods.Forcompleteness,herewemaketheassumptionexplicit:Denition1:FilterassumptionGivenanobjectivefunctionforaclassier,wecanaddresstheproblemsofoptimizingthefeaturesetandoptimizingtheclassierintwostages:rstpickinggoodfeatures,thenbuildingtheclassiertousethem.Thisimpliesthatthesecondtermin(6)canbeoptimizedindependentlyoftherst.Inthissectionwehaveformulatedthefeatureselectiontaskasaconditionallikelihoodproblem.Inthefollowing,weconsiderhowthisproblemstatementrelatestotheexistingliterature,anddiscusshowtosolveitinpractice:includinghowtooptimizethefeatureselectionparameters,andtheestimationofthenecessarydistributions. 3.Infact,ifqisaconsistentestimator,thistermwillapproachzerowithlargeN.34 FEATURESELECTIONVIACONDITIONALLIKELIHOOD3.2OptimizingtheFeatureSelectionParametersUnderthelterassumptioninDenition1,Equation(6)demonstratesthattheoptimaofthecondi-tionallikelihoodcoincidewiththatoftheconditionalmutualinformation:argmaxqL(qjD)=argminqI(XeqYjXq):(7)Theremayofcoursebemultipleglobaloptima,inadditiontothetrivialminimumofselectingallfeatures.Withthisinmind,wecanintroduceaminimalityconstraintonthesizeofthefeatureset,anddeneourproblem:q=argminq0fjq0jq0=argminqI(X˜qYjXq)g:(8)ThisisthesmallestfeaturesetXq,suchthatthemutualinformationI(XeqYjXq)isminimal,andthustheconditionallikelihoodismaximal.Itshouldberememberedthatthelikelihoodisonlyourproxyforclassicationerror,andtheminimalfeaturesetintermsofclassicationcouldbesmallerthanthatwhichoptimiseslikelihood.Inthefollowingparagraphs,weconsiderhowthisproblemisimplicitlytackledbymethodsalreadyintheliterature.Acommonheuristicapproachisasequentialsearchconsideringfeaturesone-by-oneforad-dition/removal;thisisusedforexampleinMarkovBlanketlearningalgorithmssuchasIAMB(Tsamardinosetal.,2003).WewillnowdemonstratethatthissequentialsearchheuristicisinfactequivalenttoagreedyiterativeoptimisationofEquation(8).Tounderstandthiswemusttime-indexthefeaturesets.NotationXqt=Xeqtindicatestheselectedandunselectedfeaturesetsattimestep—withaslightabuseofnotationtreatingtheseinterchangeablyassetsandrandomvariables.Denition2:ForwardSelectionStepwithMutualInformationTheforwardselectionstepaddsthefeaturewiththemaximummutualinformationinthecontextofthecurrentlyselectedsetXqt.Theoperationsperformedare:Xk=argmaxXk2XetI(XkYjXqt);Xqt+1 Xqt[Xk;Xeqt+1 XeqtnXk:Asubtle(butimportant)implementationpointforthisselectionheuristicisthatitshouldnotaddanotherfeatureif8Xk;I(XkYjXq)=0.Thisensureswewillnotunnecessarilyincreasethesizeofthefeatureset.Theorem3Theforwardselectionmutualinformationheuristicaddsthefeaturethatgeneratesthelargestpossibleincreaseintheconditionallikelihood—agreedyiterativemaximisation.ProofWiththedenitionsaboveandthechainruleofmutualinformation,wehavethat:I(Xeqt+1YjXqt+1)=I(XeqtYjXqt)I(XkYjXqt):ThefeatureXkthatmaximisesI(XkYjXqt)isthesamethatminimizesI(Xeqt+1YjXqt+1);thereforetheforwardstepisagreedyminimizationofourobjectiveI(XeqYjXq),andthereforemaximisestheconditionallikelihood. 35 BROWN,POCOCK,ZHAOANDLUJ´ANDenition4:BackwardEliminationStepwithMutualInformationInabackwardstep,afeatureisremoved—theutilityofafeatureXkisconsideredasitsmutualinformationwiththetarget,conditionedonallotherelementsoftheselectedsetwithoutXk.Theoperationsperformedare:Xk=argminXk2XtI(XkYjfXqtnXkg):Xqt+1 XqtnXkXeqt+1 Xeqt[XkTheorem5Thebackwardeliminationmutualinformationheuristicremovesthefeaturethatcausestheminimumpossibledecreaseintheconditionallikelihood.ProofWiththesedenitionsandthechainruleofmutualinformation,wehavethat:I(Xeqt+1YjXqt+1)=I(XeqtYjXqt)+I(XkYjXqt+1):ThefeatureXkthatminimizesI(XkYjXqt+1)isthatwhichkeepsI(Xeqt+1YjXqt+1)ascloseaspossi-bletoI(XeqtYjXqt);thereforethebackwardeliminationstepremovesafeaturewhileattemptingtomaintainthelikelihoodascloseaspossibletoitscurrentvalue. Tostrictlyachieveouroptimizationgoal,abackwardstepshouldonlyremoveafeatureifI(XkYjfXqtnXkg)=0.Inpractice,workingwithrealdata,therewilllikelybeestimationerrors(seethefollowingsection)andthusveryrarelythestrictzerowillbeobserved.ThisbringsustoaninterestingcorollaryregardingIAMB(TsamardinosandAliferis,2003).Corollary6SincetheIAMBalgorithmusespreciselytheseforward/backwardselectionheuristics,itisagreedyiterativemaximisationoftheconditionallikelihood.InIAMB,abackwardeliminationstepisonlyacceptedifI(XkYjfXqtnXkg)0,andotherwisetheprocedureterminates.InTsamardinosandAliferis(2003)itisshownthatIAMBreturnstheMarkovBlanketofanytargetnodeinaBayesiannetwork,andthatthissetcoincideswiththestronglyrelevantfeaturesinthedenitionsfromKohaviandJohn(1997).ThepreciselinkstothisliteratureareexploredfurtherinSection7.TheIAMBfamilyofalgorithmsadoptacommonassumption,thatthedataisfaithfultosomeunknownBayesianNetwork.Inthecaseswherethisassumptionholds,theprocedurewasproventoidentifytheuniqueMarkovBlanket.SinceIAMBusespreciselytheforward/backwardstepswehavederived,wecanconcludethattheMarkovBlanketcoincideswiththe(unique)maxi-mumoftheconditionallikelihoodfunction.AmorerecentvariationoftheIAMBalgorithm,calledMMMB(Min-MaxMarkovBlanket)usesaseriesofoptimisationstomitigatetherequirementofexponentialamountsofdatatoestimatetherelevantstatisticalquantities.Theseoptimisationsdonotchangetheunderlyingbehaviourofthealgorithm,asitstillmaximisestheconditionallikelihoodfortheselectedfeatureset,howevertheydoslightlyobscurethestronglinktoourframework.36 FEATURESELECTIONVIACONDITIONALLIKELIHOOD3.3EstimationoftheMutualInformationTermsInconsideringtheforward/backwardheuristics,wemusttakeaccountofthefactthatwedonothaveperfectknowledgeofthemutualinformation.Thisisbecausewehaveimplicitlyassumedwehaveaccesstothetruedistributionsp(xy);p(yjxq),etc.Inpracticewehavetoestimatethesefromdata.Theproblemcalculatingmutualinformationreducestothatofentropyestimation,andisfundamentalinstatistics(Paninski,2003).Themutualinformationisdenedastheexpectedlogarithmofaratio:I(XY)=Exynlogp(xy) p(x)p(y)o:Wecanestimatethis,sincetheStrongLawofLargeNumbersassuresusthatthesampleestimateusingˆpconvergesalmostsurelytotheexpectedvalue—foradatasetofNi.i.d.observations(xi;yi)I(XY)ˆI(XY)=1 NNåi=1logˆp(xiyi) ˆp(xi)ˆp(yi):Inordertocalculatethisweneedtheestimateddistributionsˆp(xy),ˆp(x),andˆp(y).Thecomputationofentropiesforcontinuousorordinaldataishighlynon-trivial,andrequiresanassumedmodeloftheunderlyingdistributions—tosimplifyexperimentsthroughoutthisarticle,weusediscretedata,andestimatedistributionswithhistogramestimatorsusingxed-widthbins.Theprobabilityofanyparticulareventp(X=x)isestimatedbymaximumlikelihood,thefrequencyofoccurrenceoftheeventX=xdividedbythetotalnumberofevents(i.e.,datapoints).Formoreinformationonalternativeentropyestimationprocedures,wereferthereadertoPaninski(2003).AtthispointwemustnotethattheapproximationaboveholdsonlyifNislargerelativetothedimensionofthedistributionsoverxandy.Forexampleifx;yarebinary,N100shouldbemorethansufcienttogetreliableestimates;howeverifx;yaremultinomial,thiswilllikelybeinsufcient.Inthecontextofthesequentialselectionheuristicswehavediscussed,weareapproximatingI(XkYjXq)as,I(XkYjXq)ˆI(XkYjXq)=1 NNåi=1logˆp(xikyijxiq) ˆp(xikjxiq)ˆp(yijxiq):(9)AsthedimensionofthevariableXqgrows(i.e.,asweaddmorefeatures)thenthenecessaryprobabilitydistributionsbecomemorehighdimensional,andhenceourestimateofthemutualinformationbecomeslessreliable.Thisinturncausesincreasinglypoorjudgementsforthein-clusion/exclusionoffeatures.Forpreciselythisreason,theresearchcommunityhavedevelopedvariouslow-dimensionalapproximationsto(9).Inthefollowingsections,wewillinvestigatetheimplicitstatisticalassumptionsandempiricaleffectsoftheseapproximations.Intheremainderofthispaper,weuseI(XY)todenotetheidealcaseofbeingabletocomputethemutualinformation,thoughinpracticeonrealdataweusethenitesampleestimateˆI(XY)3.4SummaryInthesesectionswehaveineffectreverse-engineeredamutualinformation-basedselectionscheme,startingfromaclearlydenedconditionallikelihoodproblem,anddiscussedestimationofthevar-iousquantitiesinvolved.Inthefollowingsectionswewillshowthatwecanretrotnumerousexistingrelevancy-redundancyheuristicsfromthefeatureselectionliteratureintothisprobabilisticframework.37 BROWN,POCOCK,ZHAOANDLUJ´AN4.RetrottingSuccessfulHeuristicsIntheprevioussection,startingfromaclearlydenedconditionallikelihoodproblem,wederivedagreedyoptimizationprocesswhichassessesfeaturesbasedonasimplescoringcriterionontheutilityofincludingafeatureXk2Xeq.ThescoreforafeatureXkis,Jcmi(Xk)=I(XkYjS);(10)wherecmistandsforconditionalmutualinformation,andfornotationalbrevitywenowuseS=Xqforthecurrentlyselectedset.Animportantquestionis,howdoes(10)relatetoexistingheuristicsintheliterature,suchasMIFS?WewillseethatMIFS,andcertainothercriteria,canbephrasedcleanlyaslinearcombinationsofShannonentropyterms,whilesomearenon-linearcombinations,involvingmaxorminoperations.4.1CriteriaasLinearCombinationsofShannonInformationTermsRepeatingtheMIFScriterionforclarity,Jmifs(Xk)=I(XkY)båXj2SI(XkXj):(11)Wecanseethatwerstneedtorearrange(10)intotheformofasimplerelevancytermbetweenXkandY,plussomeadditionalterms,beforewecancompareittoMIFS.UsingtheidentityI(ABjC)I(AB)=I(ACjB)I(AC),wecanre-express(10)as,Jcmi(Xk)=I(XkYjS)=I(XkY)I(XkS)+I(XkSjY):(12)Itisinterestingtoseetermsinthisexpressioncorrespondingtotheconceptsof`relevancy'and`redundancy',thatis,I(XkY)andI(XkS).ThescorewillbeincreasediftherelevancyofXkislargeandtheredundancywithexistingfeaturesissmall.Thisisinaccordancewithacommonviewinthefeatureselectionliterature,observingthatwewishtoavoidredundantvariables.However,wecanalsoseeanimportantadditionaltermI(XkSjY),whichisnottraditionallyaccountedforinthefeatureselectionliterature—wecallthistheconditionalredundancy.ThistermhastheoppositesigntotheredundancyI(XkS),henceJcmiwillbeincreasedwhenthisislarge,thatis,astrongclass-conditionaldependenceofXkwiththeexistingsetS.Thus,wecometotheimportantconclusionthattheinclusionofcorrelatedfeaturescanbeuseful,providedthecorrelationwithinclassesisstrongerthantheoverallcorrelation.WenotethatthisisasimilarobservationtothatofGuyonetal.(2006),that“correlationdoesnotimplyredundancy”—Equation(12)effectivelyembodiesthisstatementininformationtheoreticterms.Thesumofthelasttwotermsin(12)representsthethree-wayinteractionbetweentheexistingfeaturesetS,thetargetY,andthecandidatefeatureXkbeingconsideredforinclusioninS.Tofurtherunderstandthis,wecannotethefollowingproperty:I(XkSY)=I(SY)+I(XkYjS)=I(SY)+I(XkY)I(XkS)+I(XkSjY):WeseethatifI(XkS)�I(XkSjY),thenthetotalutilitywhenincludingXk,thatisI(XkSY),islessthanthesumoftheindividualrelevanciesI(SY)+I(XkY).ThiscanbeinterpretedasXkhavingunnecessaryduplicatedinformation.Intheoppositecase,whenI(XkS)I(XkSjY),thenXkand38 FEATURESELECTIONVIACONDITIONALLIKELIHOODScombinewellandprovidemoreinformationtogetherthanbythesumoftheirparts,I(SY),andI(XkY)Theimportantpointtotakeawayfromthisexpressionisthatthetermsareinatrade-off—wedonotrequireafeaturewithlowredundancyforitsownsake,butinsteadrequireafeaturethatbesttradesoffthethreetermssoastomaximisethescoreoverall.Muchlikethebias-variancedilemma,attemptingtodecreaseonetermislikelytoincreaseanother.Therelationof(10)and(11)canbeseenwithassumptionsontheunderlyingdistributionp(xy)Writingthelattertwotermsof(12)asentropies:Jcmi(Xk)=I(XkY)H(S)+H(SjXk)+H(SjY)H(SjXkY):(13)Todevelopthisfurther,werequireanassumption.Assumption1ForallunselectedfeaturesXk2Xeq,assumethefollowing,p(xqjxk)=Õj2Sp(xjjxk)p(xqjxky)=Õj2Sp(xjjxky):ThisstatesthattheselectedfeaturesXqareindependentandclass-conditionallyindependentgiventheunselectedfeatureXkunderconsideration.Usingthis,Equation(13)becomes,J0cmi(Xk)=I(XkY)H(S)+åj2SH(XjjXk)+H(SjY)åj2SH(XjjXkY):wheretheprimeonJindicateswearemakingassumptionsonthedistribution.Now,ifweintroduceåj2SH(Xj)åj2SH(Xj),andåj2SH(XjjY)åj2SH(XjjY),werecovermutualinformationterms,betweenthecandidatefeatureandeachmemberofthesetS,plussomeadditionalterms,J0cmi(Xk)=I(XkY)åj2SI(XjXk)+åj2SH(Xj)H(S)+åj2SI(XjXkjY)åj2SH(XjjY)+H(SjY):(14)Severalofthetermsin(14)areconstantwithrespecttoXk—assuch,removingthemwillhavenoeffectonthechoiceoffeature.Removingtheseterms,wehaveanequivalentcriterion,J0cmi(Xk)=I(XkY)åj2SI(XjXk)+åj2SI(XjXkjY):(15)39 BROWN,POCOCK,ZHAOANDLUJ´ANThishasinfactalreadyappearedintheliteratureasaltercriterion,originallyproposedbyLinandTang(2006),asConditionalInfomaxFeatureExtraction(CIFE),thoughithasbeenrepeatedlyrediscoveredbyotherauthors(ElAkadietal.,2008;GuoandNixon,2009).Itisparticularlyinterestingasitrepresentsasortof`root'criterion,fromwhichseveralotherscanbederived.Forexample,thelinktoMIFScanbeseenwithonefurtherassumption,thatthefeaturesarepairwiseclass-conditionallyindependent.Assumption2Forallfeaturesi;j,assumep(xixjjy)=p(xijy)p(xjjy).Thisstatesthatthefeaturesarepairwiseclass-conditionallyindependent.Withthisassumption,thetermåI(XjXkjY)willbezero,and(15)becomes(11),theMIFScriterion,withb=1.ThebparameterinMIFScanbeinterpretedasencodingastrengthofbeliefinanotherassumption,thatofunconditionalindependence.Assumption3Forallfeaturesi;j,assumep(xixj)=p(xi)p(xj).Thisstatesthatthefeaturesarepairwiseindependent.Abclosetozeroimpliesverystrongbeliefintheindependencestatement,indicatingthatanymeasuredassociationI(XjXk)isinfactspurious,possiblyduetonoiseinthedata.Abvaluecloserto1impliesalesserbelief,thatanymeasureddependencyI(XjXk)shouldbeincorporatedintothefeaturescoreexactlyasobserved.SinceMIMisproducedbysettingb=0,wecanseethatMIMalsoadoptsAssumption3.ThesamelineofreasoningcanbeappliedtoaverysimilarcriterionproposedbyPengetal.(2005),theMinimum-RedundancyMaximum-Relevancecriterion,Jmrmr(Xk)=I(XkY)1 jSjåj2SI(XkXj):SincemRMRomitstheconditionalredundancytermentirely,itisimplicitlyusingAssumption2.Thebcoefcienthasbeensetinverselyproportionaltothesizeofthecurrentfeatureset.IfwehavealargesetS,thenbwillbeextremelysmall.TheinterpretationisthenthatasthesetSgrows,mRMRadoptsastrongerbeliefinAssumption3.Intheoriginalpaper,(Pengetal.,2005,Section2.3)itwasclaimedthatmRMRisequivalentto(10).Inthissection,throughmakingexplicittheintrinsicassumptionsofthecriterion,wehaveclearlyillustratedthatthisclaimisincorrect.BalaganiandPhoha(2010)presentananalysisofthethreecriteriamRMR,MIFSandCIFE,arrivingatsimilarresultstoourown:thatthesecriteriamakehighlyrestrictiveassumptionsontheunderlyingdatadistributions.Thoughtheconclusionsaresimilar,ourapproachincludestheirresultsasaspecialcase,andmakesexplicitthelinktoalikelihoodfunction.TherelationoftheMIFS/mRMRtoEquation(15)isrelativelystraightforward.Itismorechal-lengingtoconsiderhowcloselyothercriteriamightbere-expressedinthisform.YangandMoody(1999)proposeusingJointMutualInformation(JMI),Jjmi(Xk)=åj2SI(XkXjY):(16)Usingsomerelativelysimplemanipulations(seeappendix)thiscanbere-writtenas,Jjmi(Xk)=I(XkY)1 jSjåj2ShI(XkXj)I(XkXjjY)i:(17)40 FEATURESELECTIONVIACONDITIONALLIKELIHOODThiscriterion(17)returnsexactlythesamesetoffeaturesastheJMIcriterion(16);howeverinthisform,wecanseetherelationtoourproposedframework.TheJMIcriterion,likemRMR,hasastrongerbeliefinthepairwiseindependenceassumptionsasthefeatureseSgrows.SimilaritiescanofcoursebeobservedbetweenJMI,MIFSandmRMR—thedifferencesbeingthescalingfactorandtheconditionalterm—andtheirsubsequentrelationtoEquation(15).Itisinfactpossibletoidentifynumerouscriteriafromtheliteraturethatcanallbere-writtenintoacommonform,correspondingtovariationsupon(15).Aspaceofpotentialcriteriacanbeimagined,whereweparameterizecriterion(15)asso:J0cmi=I(XkY)båj2SI(XjXk)+gåj2SI(XjXkjY):(18)Figure2showshowthecriteriawehavediscussedsofarcanallbettedinsidethisunitsquarecorrespondingtob=gparameters.MIFSsitsonthelefthandaxisofthesquare—withg=0andb2[0;1].TheMIMcriterion,Equation(1),whichsimplyassesseseachfeatureindividuallywithoutanyregardofothers,sitsatthebottomleft,withg=0;b=0.Thetoprightofthesquarecorrespondstog=1;b=1,whichistheCIFEcriterion(LinandTang,2006),alsosuggestedbyElAkadietal.(2008)andGuoandNixon(2009).Averysimilarcriterion,usinganassumptiontoapproximatetheterms,wasproposedbyChengetal.(2011).TheJMIandmRMRcriteriaareuniqueinthattheymovelinearlywithinthespaceasthefeaturesetSgrows.AsthesizeofthesetSincreasestheymoveclosertowardstheoriginandtheMIMcriterion.TheparticularlyinterestingpointaboutthispropertyisthattherelativemagnitudeoftherelevancytermtotheredundancytermsstaysapproximatelyconstantasSgrows,whereaswithMIFS,theredundancytermwillingeneralbejSjtimesbiggerthantherelevancyterm.Theconse-quencesofthiswillbeexploredintheexperimentalsectionofthispaper.AnycriterionexpressibleintheunitsquarehasmadeindependenceAssumption1.Inaddition,anycriteriathatsitatpointsotherthanb=1;g=1haveadoptedvaryingdegreesofbeliefinAssumptions2and3.Afurtherinterestingpointaboutthissquareissimplythatitissparselypopulated.Anobviousunexploredregionisthebottomright,thecornercorrespondingtob=0;g=1;thoughthereisnoclearintuitivejusticationforthispoint,forcompletenessintheexperimentalsectionwewillevaluateit,astheconditionalredundancyor`condred'criterion.Inpreviouswork(Brown,2009)weexploredthisunitsquare,thoughderivedfromanexpansionofthemutualinformationfunctionratherthandirectlyfromtheconditionallikelihood.Whilethisresultedinanidenticalexpressionto(18),theprobabilisticframeworkwepresenthereisfarmoreexpressive,allowingexactspeci-cationoftheunderlyingassumptions.TheunitsquareofFigure2describeslinearcriteria,namedassosincetheyarelinearcombi-nationsoftherelevance/redundancyterms.Thereexistothercriteriathatfollowasimilarform,butinvolvingotheroperations,makingthemnon-linear4.2CriteriaasNon-LinearCombinationsofShannonInformationTermsFleuret(2004)proposedtheConditionalMutualInformationMaximizationcriterion,Jcmim(Xk)=minXj2ShI(XkYjXj)i:Thiscanbere-written,Jcmim(Xk)=I(XkY)maxXj2ShI(XkXj)I(XkXjjY)i:(19)41 BROWN,POCOCK,ZHAOANDLUJ´AN Figure2:Thefullspaceoflinearltercriteria,describingseveralexamplesfromTable1.NotethatallcriteriainthisspaceadoptAssumption1.Additionally,thegandbaxesrepresentthecriteriabeliefinAssumptions2and3,respectively.ThelefthandaxisiswherethemRMRandMIFSalgorithmssit.Thebottomleftcorner,MIM,istheassumptionofcompletelyindependentfeatures,usingjustmarginalmutualinformation.NotethatsomecriteriaareequivalentatparticularsizesofthecurrentfeaturesetjSjTheproofisagainavailableintheappendix.Duetothemaxoperator,theprobabilisticinterpretationisalittlelessstraightforward.ItisclearhoweverthatCMIMadoptsAssumption1,sinceitevaluatesonlypairwisefeaturestatistics.Vidal-NaquetandUllman(2003)proposeanothercriterionusedinComputerVision,whichwerefertoasInformativeFragmentsJif(Xk)=minXj2ShI(XkXjY)I(XjY)i:TheauthorsmotivatethiscriterionbynotingthatitmeasuresthegainofcombininganewfeatureXkwitheachexistingfeatureXj,oversimplyusingXjbyitself.TheXjwiththeleast`gain'frombeingpairedwithXkistakenasthescoreforXk.Interestingly,usingthechainruleI(XkXjY)=I(XjY)+I(XkYjXj),thereforeIFisequivalenttoCMIM,thatis,Jif(Xk)=Jcmim(Xk),makingthesameassumptions.Jakulin(2005)proposedthecriterion,Jicap(Xk)=I(XkY)åXj2Smaxh0;fI(XkXj)I(XkXjjY)gi:Again,thisadoptsAssumption1,usingthesameredundancyandconditionalredundancyterms,yettheexactprobabilisticinterpretationisunclear.Aninterestingclassofcriteriauseanormalisationtermonthemutualinformationtooffsettheinherentbiastowardhigharityfeatures(Duch,2006).AnexampleofthisisDoubleInput42 FEATURESELECTIONVIACONDITIONALLIKELIHOODSymmetricalRelevance(MeyerandBontempi,2006),amodicationoftheJMIcriterion:Jdisr(Xk)=åXj2SI(XkXjY) H(XkXjY):Theinclusionofthisnormalisationtermbreaksthestrongtheoreticallinktoalikelihoodfunction,butagainforcompletenesswewillincludethisinourempiricalinvestigations.Whilethecriteriaintheunitsquarecanhavetheirprobabilisticassumptionsmadeexplicit,thenonlinearityintheCMIM,ICAPandDISRcriteriamakesuchaninterpretationfarmoredifcult.4.3SummaryofTheoreticalFindingsInthissectionwehaveshownthatnumerouscriteriapublishedoverthepasttwodecadesofresearchcanbe`retro-tted'intotheframeworkwehaveproposed—thecriteriaareapproximationsto(10),eachmakingdifferentassumptionsontheunderlyingdistributions.Sinceintheprevioussectionwesawthatacceptingthetoprankedfeatureaccordingto(10)providesthemaximumpossibleincreaseinthelikelihood,weseenowthatthecriteriaareapproximatemaximisersofthelikelihood.Whetherornottheyindeedprovidethemaximumincreaseateachstepwilldependonhowwelltheimplicitassumptionsonthedatacanbetrusted.Also,itshouldberememberedthatevenifweused(10),itisnotguaranteedtondtheglobaloptimumofthelikelihood,since(a)itisagreedysearch,and(b)nitedatawillmeandistributionscannotbeaccuratelymodelled.Inthiscase,wehavereachedthelimitofwhatatheoreticalanalysiscantellusaboutthecriteria,andwemustclosetheremaining`gaps'inourunderstandingwithanexperimentalstudy.5.ExperimentsInthissectionweempiricallyevaluatesomeofthecriteriaintheliteratureagainstoneanother.Notethatwearenotpursuinganexhaustiveanalysis,attemptingtoidentifythe`winning'criterionthatprovidesbestperformanceoverall4—rather,weprimarilyobservehowthetheoreticalpropertiesofcriteriarelatetothesimilarityofthereturnedfeaturesets.Whilethesepropertiesareinterest-ing,weofcoursemustacknowledgethatclassicationperformanceistheultimateevaluationofacriterion—hencewealsoincludehereclassicationresultsonUCIdatasetsandinSection6onthewell-knownbenchmarkNIPSFeatureSelectionChallenge.Inthefollowingsections,weaskthequestions:“howstableisacriteriontosmallchangesinthetrainingdataset?”,“howsimilararethecriteriatoeachother?”,“howdothedifferentcriteriabehaveinlimitedandextremesmall-samplesituations?”,andnally,“whatistherelationbetweenstabilityandaccuracy?”.Toaddressthesequestions,weusethe15datasetsdetailedinTable2.Thesearechosentohaveawidevarietyofexample-featureratios,andarangeofmulti-classproblems.Thefeatureswithineachdatasethaveavarietyofcharacteristics—somebinary/discrete,andsomecontinuous.Con-tinuousfeatureswerediscretized,usinganequal-widthstrategyinto5bins,whilefeaturesalreadywithacategoricalrangewereleftuntouched.The`ratio'statisticquotedinthenalcolumnisanindicatorofthedifcultyofthefeatureselectionforeachdataset.Thisusesthenumberofdata-points(N),themedianarityofthefeatures(m),andthenumberofclasses(c)—theratioquotedin 4.Inanycase,theNoFreeLunchTheoremappliesherealso(TsamardinosandAliferis,2003).43 BROWN,POCOCK,ZHAOANDLUJ´ANthetableforeachdatasetisN mc,henceasmallervalueindicatesamorechallengingfeatureselectionproblem.Akeypointofthisworkistounderstandthestatisticalassumptionsonthedataimposedbythefeatureselectioncriteria—ifourclassicationmodelweretomakeevenmoreassumptions,thisislikelytoobscuretheexperimentalobservationsrelatingperformancetotheoreticalproperties.Forthisreason,inallexperimentsweuseasimplenearestneighbourclassier(k=3),thisischosenasitmakesfew(ifany)assumptionsaboutthedata,andweavoidtheneedforparametertuning.Forthefeatureselectionsearchprocedure,theltercriteriaareappliedusingasimpleforwardselection,toselectaxednumberoffeatures,speciedineachexperiment,beforebeingusedwiththeclassier.DataFeaturesExamplesClassesRatio breast30569257congress16435272heart13270234ionosphere34351235krvskp3631962799landsat3664356214lungcancer563234parkinsons22195220semeion25615931080sonar60208221soybeansmall354746spect22267267splice6031753265waveform4050003333wine13178312Table2:Datasetsusedinexperiments.Thenalcolumnindicatesthedifcultyofthedatainfeatureselection,asmallervalueindicatingamorechallengingproblem.5.1HowStablearetheCriteriatoSmallChangesintheData?Thesetoffeaturesselectedbyanyprocedurewillofcoursedependonthedataprovided.Itisaplausiblecomplaintifthesetofreturnedfeaturesvarieswildlywithonlyslightvariationsinthesupplieddata.Thisisanissuereminiscentofthebias-variancedilemma,wherethesensitivityofaclassiertoitsinitialconditionscauseshighvarianceresponses.However,whilethebias-variancedecompositioniswell-denedandunderstood,thecorrespondingissueforfeatureselection,the`stability',hasonlyrecentlybeenstudied.Thestabilityofafeatureselectioncriterionrequiresameasuretoquantifythe`similarity'betweentwoselectedfeaturesets.ThiswasrstdiscussedbyKalousisetal.(2007),whoinvestigatedseveralmeasures,withthenalrecommendationbeingtheTanimotodistancebetweensets.Suchset-intersectionmeasuresseemappropriate,buthavelimitations;forexample,iftwocriteriaselectedidenticalfeaturesetsofsize10,wemightbelesssurprisedifweknewtheoverallpooloffeatureswasofsize12,thanifitwassize12,000.Toaccount44 FEATURESELECTIONVIACONDITIONALLIKELIHOODforthis,Kuncheva(2007)presentsaconsistencyindex,basedonthehypergeometricdistributionwithacorrectionforchance.Denition7TheconsistencyfortwosubsetsA;BX,suchthatjAj=jBj=k,andr=jA\Bjwhere0kjXj=n,isC(A;B)=rnk2 k(nk):Theconsistencytakesvaluesintherange[1;+1],withapositivevalueindicatingsimilarsets,azerovalueindicatingapurelyrandomrelation,andanegativevalueindicatingastronganti-correlationbetweenthefeaturessets.Oneproblemwiththeconsistencyindexisthatitdoesnottakefeatureredundancyintoaccount.Thatis,twoprocedurescouldselectfeatureswhichhavedifferentarrayindices,soareidentiedas`different',butinfactaresohighlycorrelatedthattheyareeffectivelyidentical.AmethodtodealwiththissituationwasproposedbyYuetal.(2008).Thismethodconstructsaweightedcompletebipartitegraph,wherethetwonodesetscorrespondtotwodifferentfeaturesets,andweightsareassignedtothearcsarethenormalizedmutualinformationbetweenthefeaturesatthenodes,alsosometimesreferredtoasthesymmetricaluncertainty.TheweightbetweennodeinsetA,andnodeinsetB,isw(A();B())=I(XA(i)XB(j)) H(XA(i))+H(XB(j)):TheHungarianalgorithmisthenappliedtoidentifythemaximumweightedmatchingbetweenthetwonodesets,andtheoverallsimilaritybetweensetsAandBisthenalmatchingcost.Thisistheinformationconsistencyofthetwosets.Formoredetails,werefertoYuetal.(2008).Wenowcomparethesetwomeasuresonthecriteriafromtheprevioussections.Foreachdataset,wetakeabootstrapsampleandselectasetoffeaturesusingeachfeatureselectioncriterion.The(information)stabilityofasinglecriterionisquantiedastheaveragepairwise(information)consistencyacross50bootstrapsfromthetrainingdata.Figure3showsKuncheva'sstabilitymeasureonaverageover15datasets,selectingfeaturesetsofsize10;notethatthecriteriahavebeendisplayedorderedleft-to-rightbytheirmedianvalueofstabilityoverthe15datasets.Themarginalmutualinformation,MIM,isasexpectedthemoststable,giventhatithasthelowestdimensionaldistributiontoapproximate.ThenextmoststableisJMIwhichincludestherelevancy/redundancyterms,butaveragesoverthecurrentfeatureset;thisaveragingprocessmightthereforebeinterpretedempiricallyasaformof`smoothing',enablingthecriteriaoveralltoberesistanttopoorestimationofprobabilitydistributions.ItcanbenotedthatthefarrightofFigure3consistsoftheMIFS,ICAPandCIFEcriteria,allofwhichdonotattempttoaveragetheredundancyterms.Figure4showsthesamedatasets,butinsteadtheinformationstabilityiscomputed;asmen-tioned,thisshouldtakeintoaccountthefactthatsomefeaturesarehighlycorrelated.Interestingly,thetwobox-plotsshowbroadlysimilarresults.MIMisthemoststable,andCIFEistheleaststable,thoughhereweseethatJMI,DISR,andMRMRareactuallymorestablethanKuncheva'sstabilityindexcanreect.Aninterestinglineoffutureresearchmightbetocombinethebestofthesetwostabilitymeasures—onethatcantakeintoaccountbothfeatureredundancyandacorrectionforrandomchance.45 BROWN,POCOCK,ZHAOANDLUJ´AN Figure3:Kuncheva'sStabilityIndexacross15datasets.Theboxindicatestheupper/lowerquar-tiles,thehorizontallinewithineachshowsthemedianvalue,whilethedottedcrossbarsindicatethemaximum/minimumvalues.Forconvenienceofinterpretation,criteriaonthex-axisareorderedbytheirmedianvalue. Figure4:Yuetal'sInformationStabilityIndexacross15datasets.Forcomparison,criteriaonthex-axisareorderedidenticallytoFigure3.Thegeneralpictureemergessimilarly,thoughtheinformationstabilityindexisabletotakefeatureredundancyintoaccount,showingthatsomecriteriaareslightlymorestablethanexpected.46 FEATURESELECTIONVIACONDITIONALLIKELIHOOD (a)Kuncheva'sConsistencyIndex. (b)Yuetal'sInformationStabilityIndex.Figure5:Relationsbetweenfeaturesetsgeneratedbydifferentcriteria,onaverageover15datasets.2-Dvisualisationgeneratedbyclassicalmulti-dimensionalscaling.5.2HowSimilararetheCriteria?Twocriteriacanbedirectlycomparedwiththesamemethodology:bymeasuringtheconsistencyandinformationconsistencybetweenselectedfeaturesubsetsonacommonsetofdata.Wecalculatethemeanconsistenciesbetweentwofeaturesetsofsize10,repeatedlyselectedover50bootstrapsfromtheoriginaldata.Thisisthenarrangedinasimilaritymatrix,andweuseclassicalmulti-dimensionalscalingtovisualisethisasa2-dmap,showninFigures5aand5b.Noteagainthatwhiletheindicesmayreturndifferentabsolutevalues(oneisanormalizedmeanofahypergeometricdistributionandtheotherisapairwisesumofmutualinformationterms)theyshowverysimilarrelative`distances'betweencriteria.Bothdiagramsshowaclusterofseveralcriteria,and4clearoutliers:MIFS,CIFE,ICAPandCondRed.The5criteriaclusteringintheupperleftofthespaceappeartoreturnrelativelysimilarfeaturesets.The4outliersappeartoreturnquitesignicantlydifferentfeaturesets,bothfromtheclusteredset,andfromeachother.Acommoncharacteristicofthese4outliersisthattheydonotscaletheredundancyorconditionalredundancyinformationterms.Inthesecriteria,theupperboundontheredundancytermåj2SI(XkXj)growslinearlywiththenumberofselectedfeatures,whilsttheupperboundontherelevancytermI(XkY)remainsconstant.Whenthishappenstherelevancytermisoverwhelmedbytheredundancytermandthusthecriterionselectsfeatureswithminimalredundancy,ratherthantradingoffbetweenthetwoterms.Thisleadstostronglydivergentfeaturesetsbeingselected,whichisreectedinthestabilityofthecriteria.Eachoftheoutliersaredifferentfromeachotherastheyhavedifferentcombinationsofredundancyandconditionalredundancy.Wewillseethis`balance'betweenrelevancyandredundancyemergeasacommonthemeintheexperimentsoverthenextfewsections.47 BROWN,POCOCK,ZHAOANDLUJ´AN5.3HowdoCriteriaBehaveinLimitedandExtremeSmall-sampleSituations?Toassesshowcriteriabehaveindatapoorsituations,wevarythenumberofdatapointssuppliedtoperformthefeatureselection.Theprocedurewastorandomlyselect140datapoints,thenusetheremainingdataasahold-outset.Fromthis140,thenumberprovidedtoeachcriterionwasincreasedinstepsof10,fromaminimalsetofsize20.Toallowareasonabletestingsetsize,welimitedthisassessmenttoonlydatasetswithatleast200datapointstotal;thisgivesus11datasetsfromthe15,omittinglungcancerparkinsonssoybeansmall,andwine.Foreachdatasetweselect10featuresandapplythe3-nnclassier,recordingtherank-orderofthecriteriaintermsoftheirgeneralisationerror.Thisprocesswasrepeatedandaveragedover50trials,givingtheresultsinFigure6.ToaidinterpretationwelabelMIMwithasimplepointmarker,MIFS,CIFE,CondRed,andICAPwithacircle,andtheremainingcriteria(DISR,JMI,mRMRandCMIM)withastar.Thecriterialabelledwithastarbalancetherelativemagnitudeoftherelevancyandredundancyterms,thosewithacircledonotattempttobalancethem,andMIMcontainsnoredundancyterm.Thereisaclearseparationbetweenthosecriteriawithastaroutperformingthosewithacircle,andMIMvaryinginperformancebetweenthetwogroupsasweallowmoretrainingdatapoints.NoticethatthehighestrankedcriteriacoincidewiththoseintheclusteratthetopleftofFigures5aand5b.WesuggestthattherelativedifferenceinperformanceisduetothesamereasonnotedinSection5.2,thattheredundancytermgrowswiththesizeoftheselectedfeatureset.Inthiscase,theredundancytermeventuallygrowstooutweightherelevancybyalargedegree,andthenewfeaturesareselectedsolelyonthebasisofredundancy,ignoringtherelevance,thusleadingtopoorclassicationperformance. 20 40 60 80 100 120 140 1 2 3 4 5 6 7 8 9 Training pointsRank mim mifs condred cife icap mrmr jmi disr cmim Figure6:Averageranksofcriteriaintermsoftesterror,selecting10features,across11datasets.Notethecleardominanceofcriteriawhichdonotallowtheredundancytermtoover-whelmtherelevancyterm(unlledmarkers)overthosethatallowredundancytogrowwiththesizeofthefeatureset(lledmarkers).48 FEATURESELECTIONVIACONDITIONALLIKELIHOODDataFeaturesExamplesClasses Colon2000622Leukemia7070722Lung325737Lymph4026969NCI99712609Table3:DatasetsfromPengetal.(2005),usedinexperiments.5.4ExtremeSmall-SampleExperimentsIntheprevioussectionswediscussedtwotheoreticalpropertiesofinformation-basedfeaturese-lectioncriteria:whetheritbalancestherelativemagnitudeofrelevancyagainstredundancy,andwhetheritincludesaclass-conditionalredundancyterm.EmpiricallyontheUCIdatasets,weseethatthebalancingisfarmoreimportantthantheinclusionoftheconditionalredundancyterm—forexample,MRMRsucceedsinmanycases,whileMIFSperformspoorly.Now,weconsiderwhethersamepropertymayholdinextremesmall-samplesituations,whenthenumberofexamplesissolowthatreliableestimationofdistributionsbecomesextremelydifcult.WeusedatasourcedfromPengetal.(2005),detailedinTable3.ResultsareshowninFigure7,selecting50featuresfromeachdatasetandplottingleave-one-outclassicationerror.Itshouldofcourseberememberedthatonsuchsmalldatasets,makingjustoneadditionaldatapointerrorcanresultinseeminglylargechangesinaccuracy.Forexample,thedifferencebetweenthebestandworstcriteriaonLeukemiawasjust3datapoints.IncontrasttotheUCIresults,thepictureislessclear.OnColon,thecriteriaallperformsimilarly;thisistheleastcomplexofallthedatasets,havingthesmallestnumberofclasseswitha(relatively)smallnumberoffeatures.Aswemovethroughthedatasetswithin-creasingnumbersoffeatures/classes,weseethatMIFS,CONDRED,CIFEandICAPstarttobreakaway,performingpoorlycomparedtotheothers.Again,wenotethatthesedonotattempttobal-ancerelevancy/redundancy.ThisdifferenceisclearestontheNCI9data,themostcomplexwith9classesand9712features.However,aswemayexpectwithsuchhighdimensionalandchallengingproblems,therearesomeexceptions—theColondataasmentioned,andalsotheLungdatawhereICAP/MIFSperformwell.5.5WhatistheRelationBetweenStabilityandAccuracy?Animportantquestioniswhetherwecanndagoodbalancebetweenthestabilityofacriterionandtheclassicationaccuracy.ThiswasconsideredbyGulgezenetal.(2009),whostudiedthesta-bility/accuracytrade-offfortheMRMRcriterion.Inthefollowing,weconsiderthistrade-offinthecontextofPareto-optimality,acrossthe9criteria,andthe15datasetsfromTable2.Experimentalprotocolwastotake50bootstrapsfromthedataset,eachtimecalculatingtheout-of-bagerrorusingthe3-nn.ThestabilitymeasurewasKuncheva'sstabilityindexcalculatedfromthe50featuresets,andtheaccuracywasthemeanout-of-bagaccuracyacrossthe50bootstraps.TheexperimentswerealsorepeatedusingtheInformationStabilitymeasure,revealingalmostidenticalresults.ResultsusingKuncheva'sstabilityindexareshowninFigure8.ThePareto-optimalsetisdenedasthesetofcriteriaforwhichnoothercriterionhasbothahigheraccuracyandahigherstability,hencethemembersofthePareto-optimalsetaresaidtobenon-dominated(FonsecaandFleming,1996).Thus,eachofthesubguresofFigure8,criteria49 BROWN,POCOCK,ZHAOANDLUJ´AN 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 ColonNumber of features selectedLOO number of mistakes mim mifs condred cife icap mrmr jmi disr cmim 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 LeukemiaNumber of features selectedLOO number of mistakes 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 LungNumber of features selectedLOO number of mistakes 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 LymphomaNumber of features selectedLOO number of mistakes 0 5 10 15 20 25 30 35 40 45 50 20 25 30 35 40 45 50 55 NCI9Number of features selectedLOO number of mistakes Figure7:LOOresultsonPeng'sdatasets:Colon,Lymphoma,Leukemia,Lung,NCI9.50 FEATURESELECTIONVIACONDITIONALLIKELIHOOD Figure8:Stabilty(y-axes)versusAccuracy(x-axes)over50bootstrapsforeachoftheUCIdatasets.Thepareto-optimalrankingsaresummarisedinTable4.51 BROWN,POCOCK,ZHAOANDLUJ´AN Accuracy=Stability(Yu) Accuracy=Stability(Kuncheva) Accuracy JMI(1:6) JMI(1:5) JMI(2:6) DISR(2:3) DISR(2:2) MRMR(3:6) MIM(2:4) MIM(2:3) DISR(3:7) MRMR(2:5) MRMR(2:5) CMIM(4:5) CMIM(3:3) CONDRED(3:2) ICAP(5:3) ICAP(3:6) CMIM(3:4) MIM(5:4) CONDRED(3:7) ICAP(4:3) CIFE(5:9) CIFE(4:3) CIFE(4:8) MIFS(6:5) MIFS(4:5) MIFS(4:9) CONDRED(7:4) Table4:Column1:Non-dominatedRankofdifferentcriteriaforthetrade-offofaccuracy/stability.Criteriawithahigherrank(closerto1:0)provideabettertradeoffthanthosewithalowerrank.Column2:Ascolumn1butusingKuncheva'sStabilityIndex.Column3:Averageranksforaccuracyalone.thatappearfurthertothetop-rightofthespacedominatethosetowardthebottomleft—insuchasituationthereisnoreasontochoosethoseatthebottomleft,sincetheyaredominatedonbothobjectivesbyothercriteria.Asummary(forbothstabilityandinformationstability)isprovidedinthersttwocolumnsofTable4,showingthenon-dominatedrankofthedifferentcriteria.Thisiscomputedperdatasetasthenumberofothercriteriawhichdominateagivencriterion,inthePareto-optimalsense,thenaveragedoverthe15datasets.Wecanseethattheserankingsaresimilartotheresultsearlier,withMIFS,ICAP,CIFEandCondRedperformingpoorly.WenotethatJMI,(whichbothbalancestherelevancyandredundancytermsandincludestheconditionalredundancy)outperformsallothercriteria.Wepresenttheaverageaccuracyranksacrossthe50bootstrapsincolumn3.ThesearesimilartotheresultsfromFigure6butuseabootstrapofthefulldataset,ratherthanasmallsamplefromit.FollowingDemsar(2006)weanalysedtheseranksusingaFriedmantesttodeterminewhichcriteriaarestatisticallysignicantlydifferentfromeachother.WethenusedaNemenyipost-hoctesttodeterminewhichcriteriadiffered,withstatisticalsignicancesat90%,95%,and99%condences.ThesegiveapartialorderingforthecriteriawhichwepresentinFigure9,showingaSignicantDominancePartialOrderdiagram.NotethatthisstyleofdiagramencapsulatesthesameinformationasaCriticalDifferencediagram(Demsar,2006),butallowsustodisplaymultiplelevelsofstatisticalsignicance.Aboldlineconnectingtwocriteriasigniesadifferenceatthe99%condencelevel,adashedlineatthe95%level,andadottedlineatthe90%level.Absenceofalinksigniesthatwedonothavethestatisticalpowertodeterminethedifferenceonewayoranother.ReadingFigure9,weseethatwith99%condenceJMIissignicantlysuperiortoCondRed,andMIFS,butnotstatisticallysignicantlydifferentfromtheothercriteria.Aswelowerourcondencelevel,moredifferencesappear,forexampleMRMRandMIFSareonlysignicantlydifferentatthe90%condencelevel.52 FEATURESELECTIONVIACONDITIONALLIKELIHOOD       \n \n  \r     Figure9:Signicantdominancepartial-orderdiagram.Criteriaareplacedtoptobottominthedi-agrambytheirranktakenfromcolumn3ofTable4.AlinkjoiningtwocriteriameansastatisticallysignicantdifferenceisobservedwithaNemenyipost-hoctestatthespec-iedcondencelevel.ForexampleJMIissignicantlysuperiortoMIFS(b=1)atthe99%condencelevel.Notethattheabsenceofalinkdoesnotsignifythelackofastatis-ticallysignicantdifference,butthattheNemenyitestdoesnothavesufcientpower(intermsofnumberofdatasets)todeterminetheoutcome(Demsar,2006).ItisinterestingtonotethatthefourbottomrankedcriteriacorrespondtothecornersoftheunitsquareinFigure2;whilethetopthree(JMI/MRMR/DISR)areallverysimilar,scalingthere-dundancytermsbythesizeofthefeatureset.ThemiddleranksbelongtoCMIM/ICAP,whicharesimilarinthattheyusethemin/maxstrategyinsteadofalinearcombinationofterms.53 BROWN,POCOCK,ZHAOANDLUJ´AN5.6SummaryofEmpiricalFindingsFromexperimentsinthissection,weconcludethatthebalanceofrelevancy/redundancytermsisextremelyimportant,whiletheinclusionofaclassconditionaltermseemstomatterless.Wendthatsomecriteriaareinherentlymorestablethanothers,andthatthetrade-offbetweenaccuracy(usingasimplek-nnclassier)andstabilityofthefeaturesetsdiffersbetweencriteria.Thebestoveralltrade-offforaccuracy/stabilitywasfoundintheJMIandMRMRcriteria.Inthefollowingsectionwere-assessthesendings,inthecontextoftwoproblemsposedfortheNIPSFeatureSelectionChallenge.6.PerformanceontheNIPSFeatureSelectionChallengeInthissectionweinvestigateperformanceofthecriteriaondatasetstakenfromtheNIPSFeatureSelectionChallenge(Guyon,2003).6.1ExperimentalProtocolsWepresentresultsusingGISETTE(ahandwritingrecognitiontask),andMADELON(anarticiallygenerateddataset).DataFeaturesExamples(Tr/Val)Classes GISETTE50006000/10002MADELON5002000/6002Table5:DatasetsfromtheNIPSchallenge,usedinexperiments.Toapplythemutualinformationcriteria,weestimatethenecessarydistributionsusinghis-togramestimators:featureswerediscretizedindependentlyinto10equalwidthbins,withbinboundariesdeterminedfromtrainingdata.Afterthefeatureselectionprocesstheoriginal(undis-cretised)datasetswereusedtoclassifythevalidationdata.Eachcriterionwasusedtogeneratearankingforthetop200featuresineachdataset.Weshowresultsusingthefulltop200forGISETTE,butonlythetop20forMADELONasafterthispointallcriteriademonstratedsevereovertting.WeusetheBalancedErrorRate,forfaircomparisonwithpreviouslypublishedworkontheNIPSdatasets.Weacceptthatthisdoesnotnecessarilysharethesameoptimaastheclassi-cationerror(towhichtheconditionallikelihoodrelates),andleaveinvestigationsofthistofuturework.ValidationdataresultsarepresentedinFigure10(GISETTE)andFigure11(MADELON).Theminimumofthevalidationerrorwasusedtoselectthebestperformingfeaturesetsize,thetrainingdataaloneusedtoclassifythetestingdata,andnallytestlabelsweresubmittedtothechallengewebsite.TestresultsareprovidedinTable6forGISETTE,andTable7forMADELON.5UnlikeinSection5,thedatasetswehaveusedfromtheNIPSFeatureSelectionChallengehaveagreaternumberofdatapoints(GISETTEhas6000trainingexamples,MADELONhas2000)andthuswecanpresentresultsusingadirectimplementationofEquation(10)asacriterion.WerefertothiscriterionasCMI,asitisusingtheconditionalmutualinformationtoscorefeatures.Unfortunatelytherearestillestimationerrorsinthiscalculationwhenselectingalargenumberof 5.WedonotprovideclassicationcondencesasweusedanearestneighbourclassierandthustheAUCisequalto1BER.54 FEATURESELECTIONVIACONDITIONALLIKELIHOOD 0 20 40 60 80 100 120 140 160 180 200 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Number of FeaturesValidation Error cmi mim mifs cife icap mrmr jmi disr cmim Figure10:ValidationErrorcurveusingGISETTE. 0 2 4 6 8 10 12 14 16 18 20 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Number of FeaturesValidation Error cmi mim mifs cife icap mrmr jmi disr cmim Figure11:ValidationErrorcurveusingMADELON.55 BROWN,POCOCK,ZHAOANDLUJ´ANfeatures,evengiventhelargenumberofdatapointsandsothecriterionfailstoselectfeaturesafteracertainpoint,aseachfeatureappearsequallyirrelevant.InGISETTE,CMIselected13features,andsothetop10featureswereusedandthusoneresultisshown.InMADELON,CMIselected7featuresandso7resultsareshown.6.2ResultsonTestDataInTable6thereareseveraldistinctionsbetweenthecriteria,themoststrikingofwhichisthefailureofMIFStoselectaninformativefeatureset.Theimportanceofbalancingthemagnitudeoftherelevancyandtheredundancycanbeseenwhilstlookingattheothercriteriainthistest.Thosecriteriawhichbalancethemagnitudes,(CMIM,JMI,&mRMR)performbetterthanthosewhichdonot(ICAP,CIFE).TheDISRcriterionformsanoutlierhereasitperformspoorlywhencomparedtoJMI.TheonlydifferencebetweenthesetwocriteriaisthenormalizationinDISR—assuch,thisisthelikelycauseoftheobservedpoorperformance,theintroductionofmorevariancebyestimatingthenormalizationH(XkXjY)Wecanalsoseehowimportantthelowdimensionalapproximationis,asevenwith6000trainingexamplesCMIcannotestimatetherequiredjointdistributiontoavoidselectingprobes,despitebeingadirectiterativemaximisationoftheconditionallikelihoodinthelimitofdatapoints. Criterion BER AUC Features(%) Probes(%) MIM 4.18 95.82 4.00 0.00 MIFS 42.00 58.00 4.00 58.50 CIFE 6.85 93.15 2.00 0.00 ICAP 4.17 95.83 1.60 0.00 CMIM 2.86 97.14 2.80 0.00 CMI 8.06 91.94 0.20 20.00 mRMR 2.94 97.06 3.20 0.00 JMI 3.51 96.49 4.00 0.00 DISR 8.03 91.97 4.00 0.00 WinningChallengeEntry 1.35 98.71 18.3 0.0 Table6:NIPSFSChallengeResults:GISETTE.TheMADELONresults(Table7)showaparticularlyinterestingpoint—thetopperformers(intermsofBER)areJMIandCIFE.Boththesecriteriaincludetheclass-conditionalredundancyterm,butCIFEdoesnotbalancetheinuenceofrelevancyagainstredundancy.Inthiscase,itappearsthe`balancing'issue,soimportantinourpreviousexperimentsseemstohavelittleimportance—instead,thepresenceoftheconditionalredundancytermisthedifferentiatingfactorbetweencriteria(notethepoorperformanceofMIFS/MRMR).ThisisperhapsnotsurprisinggiventhenatureoftheMADELONdata,constructedpreciselytorequirefeaturestobeevaluatedjointly.Itisinterestingtonotethatthechallengeorganisersbenchmarkeda3-NNusingtheoptimalfeatureset,achievinga10%testerror(Guyon,2003).Manyofthecriteriamanagedtoselectfeaturesetswhichachievedasimilarerrorrateusinga3-NN,anditislikelythatamoresophisticatedclassierisrequiredtofurtherimproveperformance.Thisconcludesourexperimentalstudy—inthefollowing,wemakefurtherlinkstotheliteratureforthetheoreticalframework,anddiscussimplicationsforfuturework.56 FEATURESELECTIONVIACONDITIONALLIKELIHOOD Criterion BER AUC Features(%) Probes(%) MIM 10.78 89.22 2.20 0.00 MIFS 46.06 53.94 2.60 92.31 CIFE 9.50 90.50 3.80 0.00 ICAP 11.11 88.89 1.60 0.00 CMIM 11.83 88.17 2.20 0.00 CMI 21.39 78.61 0.80 0.00 mRMR 35.83 64.17 3.40 82.35 JMI 9.50 90.50 3.20 0.00 DISR 9.56 90.44 3.40 0.00 WinningChallengeEntry 7.11 96.95 1.6 0.0 Table7:NIPSFSChallengeResults:MADELON.7.RelatedWork:StrongandWeakRelevanceKohaviandJohn(1997)proposeddenitionsofstrongandweakfeaturerelevance.Thedenitionsareformedfromstatementsabouttheconditionalprobabilitydistributionsofthevariablesinvolved.Wecanre-statethedenitionsofKohaviandJohn(hereafterKJ)intermsofmutualinformation,andseehowtheycantintoourconditionallikelihoodmaximisationframework.Inthenotationbelow,notationXiindicatesthethfeatureintheoverallsetX,andnotationXniindicatesthesetfXnXig,allfeaturesexcepttheth.Denition8:StronglyRelevantFeature(KohaviandJohn,1997)FeatureXiisstronglyrelevanttoYiffthereexistsanassignmentofvaluesxi,y,xniforwhichp(Xi=xi;Xni=xni)�0andp(Y=yjXi=xi;Xni=xni)=p(Y=yjXni=xni):Corollary9AfeatureXiisstronglyrelevantiffI(XiYjXni)�0ProofTheKLdivergenceDKL(p(yjxz)jjp(yjz))�0iffp(yjxz)=p(yjz)forsomeassignmentofvaluesx;y;z.Asimplere-applicationofthemanipulationsleadingtoEquation(5)demonstratesthattheexpectedKL-divergenceExzfp(yjxz)jjp(yjz)gisequaltothemutualinformationI(XYjZ).Inthedenitionofstrongrelevance,ifthereexistsasingleassignmentofvaluesxiyxnithatsatisestheinequality,thenExfp(yjxixni)jjp(yjxni)g�0andthereforeI(XiYjXni)�0. Giventheframeworkwehavepresented,wecannotethatthisstrongrelevancecomesfromacom-binationofthreetermsI(XiYjXni)=I(XiY)I(XiXni)+I(XiXnijY):Thisviewofstrongrelevancedemonstratesexplicitlythatafeaturemaybeindividuallyirrelevant(i.e.,p(yjxi)=p(y)andthusI(XiY)=0),butstillstronglyrelevantifI(XiXnijY)I(XiXni)�0.Denition10:WeaklyRelevantFeature(KohaviandJohn,1997)FeatureXiisweaklyrelevanttoYiffitisnotstronglyrelevantandthereexistsasubsetZXni,andanassignmentofvaluesxi,y,zforwhichp(Xi=xi;Z=z)�0suchthatp(Y=yjXi=xi;Z=z)=p(Y=yjZ=z)57 BROWN,POCOCK,ZHAOANDLUJ´ANCorollary11AfeatureXiisweaklyrelevanttoYiffitisnotstronglyrelevantandI(XiYjZ)�0forsomeZXniProofThisfollowsimmediatelyfromtheproofforthestrongrelevanceabove. Itisinteresting,andsomewhatnon-intuitive,thattherecanbecaseswheretherearenostronglyrelevantfeatures,butallareweaklyrelevant.Thiswilloccurforexampleinadatasetwhereallfeatureshaveexactduplicates:wehave2Mfeaturesand8;XM+i=Xi.Inthiscase,foranyXk(suchthatkM)wewillhaveI(XkYjXni)=0sinceitsduplicatefeatureXM+kwillcarrythesameinformation.Inthiscase,foranyfeatureXk(suchthatkM)thatisstronglyrelevantinthedatasetfX1;:::;XMg,itisweaklyrelevantinthedatasetfX1;:::;X2MgThisissuecanbedealtwithbyreningourdenitionofrelevancewithrespecttoasubsetofthefullfeaturespace.AparticularsubsetaboutwhichwehavesomeinformationisthecurrentlyselectedsetS.WecanrelateourframeworktoKJ'sdenitionsinthiscontext.FollowingKJ'sformulations,Denition12:RelevancewithrespecttothecurrentsetSFeatureXiisrelevanttoYwithrespecttoSiffthereexistsanassignmentofvaluesxi,y,sforwhichp(Xi=xi;S=s)&#x-0.9;畵0andp(Y=yjXi=xi;S=s)=p(Y=yjS=s):Corollary13FeatureXiisrelevanttoYwithrespecttoS,iffI(XiYjS)&#x-0.9;畵0AfeaturethatisrelevantwithrespecttoSiseitherstronglyorweaklyrelevant(intheKJsense)butitisnotpossibletodetermineinwhichclassitlies,aswehavenotconditionedonXni.Noticethatthedenitioncoincidesexactlywiththeforwardselectionheuristic(Denition2),whichwehaveshownisahill-climberontheconditionallikelihood.Asaresult,weseethathill-climbingontheconditionallikelihoodcorrespondstoaddingthemostrelevantfeaturewithrespecttothecurrentsetS.Againwere-emphasize,thattheresultantgaininthelikelihoodcomesfromacombinationofthreesourcesI(XiYjS)=I(XiY)I(XiS)+I(XiSjY):ItcouldeasilybethecasethatI(XiY)=0,thatisafeatureisentirelyirrelevantwhenconsideredonitsown—butthesumofthetworedundancytermsresultsinapositivevalueforI(XiYjS).Weseethatifacriteriondoesnotattempttomodelbothoftheredundancyterms,evenifonlyusinglowdimensionalapproximations,itrunstheriskofevaluatingtherelevanceofXiincorrectly.Denition14:IrrelevancewithrespecttothecurrentsetSFeatureXiisirrelevanttoYwithrespecttoSiff8xi,y,sforwhichp(Xi=xi;S=s)&#x-0.9;畵0andp(Y=yjXi=xi;S=s)=p(Y=yjS=s):Corollary15FeatureXiisirrelevanttoYwithrespecttoS,iffI(XiYjS)=0Inaforwardstep,ifafeatureXiisirrelevantwithrespecttoS,addingitalonetoSwillnotincreasetheconditionallikelihood.However,theremaybefurtheradditionstoSinthefuture,givingusaselectedsetS0;wemaythenndthatXiisthenrelevantwithrespecttoS0.InabackwardstepwecheckwhetherafeatureisirrelevantwithrespecttofSnXig,usingthetestI(XiYjfSnXig)=0.Inthiscase,removingthisfeaturewillnotdecreasetheconditionallikelihood.58 FEATURESELECTIONVIACONDITIONALLIKELIHOOD8.RelatedWork:StructureLearninginBayesianNetworksTheframeworkwehavedescribedalsoservestohighlightanumberofimportantlinkstotheliter-atureonstructurelearningofdirectedacyclicgraphical(DAG)models(Korb,2011).TheproblemofDAGlearningfromobserveddataisknowntobeNP-hard(Chickeringetal.,2004),andassuchthereexisttwomainfamiliesofapproximatealgorithms.MetricorScore-and-Searchlearn-ersconstructagraphbysearchingthespaceofDAGsdirectly,assigningascoretoeachbasedonpropertiesofthegraphinrelationtotheobserveddata;probablythemostwell-knownscoreistheBICmeasure(Korb,2011).However,thespaceofDAGsissuperexponentialinthenumberofvari-ables,andhenceanexhaustivesearchrapidlybecomescomputationallyinfeasible.GrossmanandDomingos(2004)proposedagreedyhill-climbingsearchoverstructures,usingconditionallikeli-hoodasascoringcriterion.Theirworkfoundsignicantadvantagefromusingthis`discriminative'learningobjective,asopposedtothetraditional`generative'jointlikelihood.ThepotentialofthisdiscriminativemodelperspectivewillbeexpandeduponinSection9.3.Constraintlearnersapproachtheproblemfromaconstructivistpointofview,addingandremov-ingarcsfromasingleDAGaccordingtoconditionalindependencetestsgiventhedata.WhenthecandidateDAGpassesallconditionalindependencestatementsobservedinthedata,itisconsideredtobeagoodmodel.Inthecurrentpaper,forafeaturetobeeligibleforinclusion,werequiredthatI(XkYjS)�0.ThisisequivalenttoaconditionalindependencetestXk??=YjS.Onewell-knownproblemwithconstraintlearnersisthatifatestgivesanincorrectresult,theerrorcan`cascade',causingthealgorithmtodrawfurtherincorrectconclusionsonthenetworkstructure.Thisproblemisalsotrueofthepopulargreedy-searchheuristicsthatwehavedescribedinthiswork.InSection3.2,weshowedthatMarkovBlanketalgorithms(Tsamardinosetal.,2003)areanexampleoftheframeworkwepropose.Specically,thesolutiontoEquation(7)isa(possiblynon-unique)MarkovBlanket,andthesolutiontoEquation(8)isexactlytheMarkovboundary,thatis,aminimal,uniqueblanket.Itisinterestingtonotethatthesealgorithms,whicharearestrictedclassofstructurelearners,assumefaithfulnessofthedatadistribution.Wecanseestraightforwardlythatallcriteriawehaveconsidered,whencombinedwithagreedyforwardselection,alsomakethisassumption.9.ConclusionThisworkhaspresentedaunifyingframeworkforinformationtheoreticfeatureselection,bringingalmosttwodecadesofresearchonheuristicscoringcriteriaunderasingletheoreticalinterpretation.Thisisachievedviaanovelinterpretationofinformationtheoreticfeatureselectionasanoptimiza-tionoftheconditionallikelihood—thisisincontrasttothecurrentviewofmutualinformation,asaheuristicmeasureoffeaturerelevancy.9.1SummaryofContributionsInSection3weshowedhowtodecomposetheconditionallikelihoodintothreeterms,eachwiththeirowninterpretationinrelationtothefeatureselectionproblem.Oneoftheseemergesasaconditionalmutualinformation.Thisobservationallowsustoanswerthefollowingquestion:Whataretheimplicitstatisticalassumptionsofmutualinformationcriteria?Theinvestigationshaverevealedthatthevariouscriteriapublishedoverthepasttwodecadesareallapproximateiter-ativemaximisersoftheconditionallikelihood.Theapproximationsareduetoimplicitassumptions59 BROWN,POCOCK,ZHAOANDLUJ´ANonthedatadistribution:somearemorerestrictivethanothers,andaredetailedinSection4.Theapproximations,whileheuristic,arenecessaryduetotheneedtoestimatehighdimensionalproba-bilitydistributions.ThepopularMarkovBlanketlearningalgorithmIAMBisincludedinthisclassofprocedures,hencecanalsobeeseenasaniterativemaximiseroftheconditionallikelihood.Themaindifferencesbetweencriteriaarewhethertheyincludeaclass-conditionalterm,andwhethertheyprovideamechanismtobalancetherelativesizeoftheredundancytermsagainsttherelevancyterm.Toascertainhowthesedifferencesimpactthecriteriainpractice,weconductedanempiricalstudyof9differentheuristicmutualinformationcriteriaacross22datasets.Weanalyzedhowthecriteriabehaveinlarge/smallsamplesituations,howthestabilityofreturnedfeaturesetsvariesbetweencriteria,andhowsimilarcriteriaareinthefeaturesetstheyreturn.Inparticular,thefollowingquestionswereinvestigated:Howdothetheoreticalpropertiestranslatetoclassieraccuracy?Summarisingtheperfor-manceofthecriteriaundertheaboveconditions,includingtheclass-conditionaltermisnotalwaysnecessary.Variouscriteria,forexampleMRMR,aresuccessfulwithoutthisterm.However,with-outthistermcriteriaareblindtocertainclassesofproblems,forexample,theMADELONdataset,andwillperformpoorlyinthesecases.Balancingtherelevancyandredundancytermsishoweverextremelyimportant—criterialikeMIFS,orCIFE,thatallowredundancytoswamprelevancy,arerankedlowestforaccuracyinalmostallexperiments.Inaddition,thisimbalancetendstocauselargeinstabilityinthereturnedfeaturesets—beinghighlysensitivetothesupplieddata.Howstablearethecriteriatosmallchangesinthedata?Severalcriteriareturnwildlydifferentfeaturesetswithjustsmallchangesinthedata,whileothersreturnsimilarsetseachtime,henceare`stable'procedures.Themoststablewastheunivariatemutualinformation,followedcloselybyJMI(YangandMoody,1999;Meyeretal.,2008);whileamongtheleaststableareMIFS(Battiti,1994)andICAP(Jakulin,2005).Asvisualisedbymulti-dimensionalscalinginFigure5,severalcriteriaappeartoreturnquitesimilarsets,whiletherearesomeoutliers.Howdocriteriabehaveinlimitedandextremesmall-samplesituations?Inextremesmall-samplesituations,itappearstheaboverules(regardingtheconditionaltermandthebalancingofrelevancy-redundancy)canbebroken—thepoorestimationofdistributionsmeansthetheoreticalpropertiesdonottranslateimmediatelytoperformance.9.2AdviceforthePractitionerFromourinvestigationswehaveidentiedthreedesirablecharacteristicsofaninformationbasedselectioncriterion.Therstiswhetheritincludesreferencetoaconditionalredundancyterm—criteriathatdonotincorporateitareeffectivelyblindtoanentireclassofproblems,thosewithstrongclass-conditionaldependencies.Thesecondiswhetheritkeepstherelativesizeoftheredundancytermfromswampingtherelevancyterm.Wendthistobeessential—withoutthiscontrol,therelevancyofthekthfeaturecaneasilybeignoredintheselectionprocessduetothek1redundancyterms.Thethirdissimplywhetherthecriterionisalow-dimensionalapproximation,hencemakingitusablewithsmallsamplesizes.OnGISETTEwith6000examples,wewereunabletoselectmorethan13featureswithanykindofreliability.Therefore,lowdimensionalapproximations,thefocusofthisarticle,areessential.AsummaryofthecriteriaisshowninTable8.Overallwendonly3criteriathatsatisfytheseproperties:CMIM,JMIandDISR.WerecommendtheJMIcriterion,asfromempiricalinvesti-gationsithasthebesttrade-off(inthePareto-optimalsense)ofaccuracyandstability.DISRis60 FEATURESELECTIONVIACONDITIONALLIKELIHOODanormalisedvariantofJMI—inpracticewefoundlittleneedforthisnormalisationandtheextracomputationinvolved.Ifhigherstabilityisrequired—theMIMcriterion,asexpected,displayedthehigheststabilitywithrespecttovariationsinthedata—thereforeinextremedata-poorsituationswewouldrecommendthisasarststep.Ifspeedisrequired,theCMIMcriterionadmitsanfastexactimplementationgivingordersofmagnitudespeed-upoverastraightforwardimplementation—refertoFleuret(2004)fordetails.Toaidreplicabilityofthiswork,implementationsofallcriteriawehavediscussedareprovidedat:http://www.cs.man.ac.uk/gbrown/fstoolbox/ MIMmRMRMIFSCMIMJMIDISRICAPCIFECMI CondRedundterm? 777444444 Balancesrel/red? 447444774 Estimable? 444444447 Table8:Summaryofcriteria.Theyhavebeenarrangedlefttorightinorderofascendingestimationdifculty.CondRedundterm:doesitincludetheconditionalredundancyterm?Balancesrel/red:doesitbalancetherelevanceandredundancyterms?Estimable:doesitusealowdimensionalapproximation,makingitusablewithsmallsamples?9.3FutureWorkWhileadviceonthesuitabilityofexistingcriteriaisofcourseuseful,perhapsamoreinterest-ingresultofthisworkistheperspectiveitbringstothefeatureselectionproblem.Wewereabletoexplicitlystateanobjectivefunction,andderiveanappropriateinformation-basedcriteriontomaximiseit.Thisbegsthequestion,whatselectioncriteriawouldresultfromdifferentobjectivefunctions?Dmochowskietal.(2010)studyaweightedconditionallikelihood,anditssuitabil-ityforcost-sensitiveproblems—itispossible(thoughoutsidethescopeofthispaper)toderiveinformation-basedcriteriainthiscontext.Thereversequestionisequallyinteresting,whatobjec-tivefunctionsareimpliedbyotherexistingcriteria,suchastheGiniIndex?TheKL-divergence(whichdenesthemutualinformation)isaspecialcaseofawiderfamilyofmeasures,basedonthe-divergence—couldweobtainsimilarefcientcriteriathatpursuethesemeasures,andwhatoverallobjectivesdotheyimply?Inthisworkweexploredcriteriathatusepairwise(i.e.,I(XkXj))approximationstothederivedobjective.Theseapproximationsarecommonlyusedastheyprovideareasonableheuristicwhilestillbeing(relatively)simpletoestimate.Therehasbeenworkwhichsuggestsrelaxingthispairwiseapproximation,andthusincreasingthenumberofterms(Brown,2009;Meyeretal.,2008),butthereislittleexplorationofhowmuchdataisrequiredtoestimatethesemultivariateinformationterms.Atheoreticalanalysisofthetradeoffbetweenestimationaccuracyandadditionalinformationprovidedbythesemorecomplextermscouldprovideinterestingdirectionsforimprovingthepoweroflterfeatureselectiontechniques.Averyinterestingdirectionconcernsthemotivationbehindtheconditionallikelihoodasanob-jective.Itcanbenotedthattheconditionallikelihood,thoughawell-acceptedobjectivefunctioninitsownright,canbederivedfromaprobabilisticdiscriminativemodel,asfollows.Weapproximatethetruedistributionpwithourmodelq,withthreedistinctparametersets:qforfeatureselection,61 BROWN,POCOCK,ZHAOANDLUJ´ANtforclassication,andlmodellingtheinputdistributionp(x).FollowingMinka(2005),intheconstructionofadiscriminativemodel,ourjointlikelihoodisL(D;q;t;l)=p(q;t)p(l)NÕi=1q(yijxi;q;t)q(xijl):Inthistypeofmodel,wewishtomaximizeLwithrespecttoq(ourfeatureselectionparameters)andt(ourmodelparameters),andarenotconcernedwiththegenerativeparametersl.ExcludingthegenerativetermsgivesL(D;q;t;l)µp(q;t)NÕi=1q(yijxi;q;t):Whenwehavenoparticularbiasorpriorknowledgeoverwhichsubsetoffeaturesorparametersaremorelikely(i.e.,aatpriorp(q;t)),thisreducestotheconditionallikelihood:L(D;q;t;l)µNÕi=1q(yijxi;q;t);whichwasexactlyourstartingpointforthecurrentpaper.Anobviousextensionhereistotakeanon-uniformprioroverfeatures.Animportantdirectionformachinelearningistoincorporatedomainknowledge.Anon-uniformpriorwouldmeaninuencingthesearchproceduretoincorpo-rateourbackgroundknowledgeofthefeatures.Thisisapplicableforexampleingeneexpressiondata,whenwemayhaveinformationaboutthemetabolicpathwaysinwhichgenesparticipate,andthereforewhichgenesarelikelytoinuencecertainbiologicalfunctions.Thisisoutsidethescopeofthispaperbutisthefocusofourcurrentresearch.AcknowledgmentsThisresearchwasconductedwithsupportfromtheUKEngineeringandPhysicalSciencesResearchCouncil,ongrantsEP/G000662/1andEP/F023855/1.MikelLuj´anissupportedbyaRoyalSocietyUniversityResearchFellowship.GavinBrownwouldliketothankJamesNeil,SohanSeth,andFabioRoliforinvaluablecommentaryondraftsofthiswork.AppendixA.Thefollowingproofsmakeuseoftheidentity,I(ABjC)I(AB)=I(ACjB)I(AC)A.1ProofofEquation(17)TheJointMutualInformationcriterion(YangandMoody,1999)canbewritten,Jjmi(Xk)=åXj2SI(XkXjY);=åXj2ShI(XjY)+I(XkYjXj)i:62 FEATURESELECTIONVIACONDITIONALLIKELIHOODThetermåXj2SI(XjY)intheaboveisconstantwithrespecttotheXkargumentthatweareinter-estedin,socanbeomitted.Thecriterionthereforereducesto(17)asfollows,Jjmi(Xk)=åXj2ShI(XkYjXj)i=åXj2ShI(XkY)I(XkXj)+I(XkXjjY)i=jSjI(XkY)åXj2ShI(XkXj)I(XkXjjY)iµI(XkY)1 jSjåXj2ShI(XkXj)I(XkXjjY)i:A.2ProofofEquation(19)TherearrangementoftheConditionalMutualInformationcriterion(Fleuret,2004)followsaverysimilarprocedure.Theoriginal,anditsrewritingare,Jcmim(Xk)=minXj2ShI(XkYjXj)i=minXj2ShI(XkY)I(XkXj)+I(XkXjjY)i=I(XkY)+minXj2ShI(XkXjjY)I(XkXj)i=I(XkY)maxXj2ShI(XkXj)I(XkXjjY)i;whichisexactlyEquation(19).ReferencesK.S.BalaganiandV.V.Phoha.Onthefeatureselectioncriterionbasedonanapproximationofmultidimensionalmutualinformation.IEEETransactionsonPatternAnalysisandMachineIntelligence,32(7):1342–1343,2010.ISSN0162-8828.R.Battiti.Usingmutualinformationforselectingfeaturesinsupervisedneuralnetlearning.IEEETransactionsonNeuralNetworks,5(4):537–550,1994.G.Brown.Anewperspectiveforinformationtheoreticfeatureselection.InInternationalConfer-enceonArticialIntelligenceandStatistics,volume5,pages49–56,2009.H.Cheng,Z.Qin,C.Feng,Y.Wang,andF.Li.Conditionalmutualinformation-basedfeatureselectionanalyzingforsynergyandredundancy.ElectronicsandTelecommunicationsResearchInstitute(ETRI)Journal,33(2),2011.D.M.Chickering,D.Heckerman,andC.Meek.Large-samplelearningofbayesiannetworksisnp-hard.JournalofMachineLearningResearch,5:1287–1330,2004.T.M.CoverandJ.A.Thomas.ElementsofInformationTheory.Wiley-InterscienceNewYork,1991.63 BROWN,POCOCK,ZHAOANDLUJ´ANJ.Demsar.Statisticalcomparisonsofclassiersovermultipledatasets.JournalofMachineLearn-ingResearch,7:1–30,2006.J.P.Dmochowski,P.Sajda,andL.C.Parra.Maximumlikelihoodincost-sensitivelearning:modelspecication,approximations,andupperbounds.JournalofMachineLearningResearch,11:3313–3332,2010.W.Duch.FeatureExtraction:FoundationsandApplications,chapter3,pages89–117.StudiesinFuzziness&SoftComputing.Springer,2006.ISBN3-540-35487-5.A.ElAkadi,A.ElOuardighi,andD.Aboutajdine.Apowerfulfeatureselectionapproachbasedonmutualinformation.InternationalJournalofComputerScienceandNetworkSecurity,8(4):116,2008.R.M.Fano.TransmissionofInformation:StatisticalTheoryofCommunications.NewYork:Wiley,1961.F.Fleuret.Fastbinaryfeatureselectionwithconditionalmutualinformation.JournalofMachineLearningResearch,5:1531–1555,2004.C.FonsecaandP.Fleming.Ontheperformanceassessmentandcomparisonofstochasticmultiob-jectiveoptimizers.ParallelProblemSolvingfromNature,pages584–593,1996.D.GrossmanandP.Domingos.Learningbayesiannetworkclassiersbymaximizingconditionallikelihood.InInternationalConferenceonMachineLearning.ACM,2004.G.Gulgezen,Z.Cataltepe,andL.Yu.Stableandaccuratefeatureselection.MachineLearningandKnowledgeDiscoveryinDatabases,pages455–468,2009.B.GuoandM.S.Nixon.GaitfeaturesubsetselectionbymutualinformationIEEETransSystems,ManandCybernetics,39(1):36–46,January2009.I.Guyon.DesignofexperimentsfortheNIPS2003variableselectionbenchmarkhttp://www.nipsfsc.ecs.soton.ac.uk/papers/NIPS2003-Datasets.pdf,2003I.Guyon,S.Gunn,M.Nikravesh,andL.Zadeh,editors.FeatureExtraction:FoundationsandApplications.Springer,2006.ISBN3-540-35487-5.M.HellmanandJ.Raviv.Probabilityoferror,equivocation,andthechernoffbound.IEEETrans-actionsonInformationTheory,16(4):368–372,1970.A.Jakulin.MachineLearningBasedonAttributeInteractions.PhDthesis,UniversityofLjubljana,Slovenia,2005.A.Kalousis,J.Prados,andM.Hilario.Stabilityoffeatureselectionalgorithms:astudyonhigh-dimensionalspaces.Knowledgeandinformationsystems,12(1):95–116,2007.ISSN0219-1377.R.KohaviandG.H.John.Wrappersforfeaturesubsetselection.Articialintelligence,97(1-2):273–324,1997.ISSN0004-3702.64 FEATURESELECTIONVIACONDITIONALLIKELIHOODD.KollerandM.Sahami.Towardoptimalfeatureselection.InInternationalConferenceonMa-chineLearning,1996.K.Korb.EncyclopediaofMachineLearning,chapterLearningGraphicalModels,page584.Springer,2011.L.I.Kuncheva.Astabilityindexforfeatureselection.InIASTEDInternationalMulti-Conference:ArticialIntelligenceandApplications,pages390–395,2007.N.KwakandC.H.Choi.Inputfeatureselectionforclassicationproblems.IEEETransactionsonNeuralNetworks,13(1):143–159,2002.D.D.Lewis.Featureselectionandfeatureextractionfortextcategorization.InProceedingsoftheworkshoponSpeechandNaturalLanguage,pages212–217.AssociationforComputationalLinguisticsMorristown,NJ,USA,1992.D.LinandX.Tang.Conditionalinfomaxlearning:Anintegratedframeworkforfeatureextractionandfusion.InEuropeanConferenceonComputerVision,2006.P.MeyerandG.Bontempi.Ontheuseofvariablecomplementarityforfeatureselectionincancerclassication.InEvolutionaryComputationandMachineLearninginBioinformatics,pages91–102,2006.P.E.Meyer,C.Schretter,andG.Bontempi.Information-theoreticfeatureselectioninmicroarraydatausingvariablecomplementarity.IEEEJournalofSelectedTopicsinSignalProcessing,2(3):261–274,2008.T.Minka.Discriminativemodels,notdiscriminativetraining.MicrosoftResearchCambridge,Tech.Rep.TR-2005-144,2005.L.Paninski.Estimationofentropyandmutualinformation.NeuralComputation,15(6):1191–1253,2003.ISSN0899-7667.H.Peng,F.Long,andC.Ding.Featureselectionbasedonmutualinformation:Criteriaofmax-dependency,max-relevance,andmin-redundancy.IEEETransactionsonPatternAnalysisandMachineIntelligence,27(8):1226–1238,2005.C.E.Shannon.Amathematicaltheoryofcommunication.BellSystemsTechnicalJournal,27(3):379–423,1948.M.TesmerandP.A.Estevez.Amifs:Adaptivefeatureselectionbyusingmutualinformation.InIEEEInternationalJointConferenceonNeuralNetworks,volume1,2004.I.TsamardinosandC.F.Aliferis.Towardsprincipledfeatureselection:Relevancy,ltersandwrappers.InProceedingsoftheNinthInternationalWorkshoponArticialIntelligenceandStatistics(AISTATS),2003.I.Tsamardinos,C.F.Aliferis,andA.Statnikov.Algorithmsforlargescalemarkovblanketdiscov-ery.In16thInternationalFLAIRSConference,volume103,2003.65 BROWN,POCOCK,ZHAOANDLUJ´ANM.Vidal-NaquetandS.Ullman.Objectrecognitionwithinformativefeaturesandlinearclassica-tion.IEEEConferenceonComputerVisionandPatternRecognition,2003.J.Weston,S.Mukherjee,O.Chapelle,M.Pontil,T.Poggio,andV.Vapnik.Featureselectionforsvms.AdvancesinNeuralInformationProcessingSystems,pages668–674,2001.ISSN1049-5258.H.YangandJ.Moody.Datavisualizationandfeatureselection:Newalgorithmsfornon-gaussiandata.AdvancesinNeuralInformationProcessingSystems,12,1999.L.YuandH.Liu.Efcientfeatureselectionviaanalysisofrelevanceandredundancy.JournalofMachineLearningResearch,5:1205–1224,2004.L.Yu,C.Ding,andS.Loscalzo.Stablefeatureselectionviadensefeaturegroups.InProceedingofthe14thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMiningpages803–811,2008.66