BROWNPOCOCKZHAOANDLUJ ID: 377078
Download Pdf The PPT/PDF document "JournalofMachineLearningResearch13(2012)..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
JournalofMachineLearningResearch13(2012)27-66Submitted12/10;Revised6/11;Published1/12ConditionalLikelihoodMaximisation:AUnifyingFrameworkforInformationTheoreticFeatureSelectionGavinBrownGAVIN.BROWN@CS.MANCHESTER.AC.UKAdamPocockADAM.POCOCK@CS.MANCHESTER.AC.UKMing-JieZhaoMING-JIE.ZHAO@CS.MANCHESTER.AC.UKMikelLuj´anMIKEL.LUJAN@CS.MANCHESTER.AC.UKSchoolofComputerScienceUniversityofManchesterManchesterM139PL,UKEditor:IsabelleGuyonAbstractWepresentaunifyingframeworkforinformationtheoreticfeatureselection,bringingalmosttwodecadesofresearchonheuristicltercriteriaunderasingletheoreticalinterpretation.Thisisinresponsetothequestion:whataretheimplicitstatisticalassumptionsoffeatureselectioncriteriabasedonmutualinformation?.Toanswerthis,weadoptadifferentstrategythanisusualinthefeatureselectionliteratureinsteadoftryingtodeneacriterion,wederiveone,directlyfromaclearlyspeciedobjectivefunction:theconditionallikelihoodofthetraininglabels.Whilemanyhand-designedheuristiccriteriatrytooptimizeadenitionoffeature`relevancy'and`redundancy',ourapproachleadstoaprobabilisticframeworkwhichnaturallyincorporatestheseconcepts.Asaresultwecanunifythenumerouscriteriapublishedoverthelasttwodecades,andshowthemtobelow-orderapproximationstotheexact(butintractable)optimisationproblem.Theprimarycontributionistoshowthatcommonheuristicsforinformationbasedfeatureselection(includingMarkovBlanketalgorithmsasaspecialcase)areapproximateiterativemaximisersofthecon-ditionallikelihood.Alargeempiricalstudyprovidesstrongevidencetofavourcertainclassesofcriteria,inparticularthosethatbalancetherelativesizeoftherelevancy/redundancyterms.OverallweconcludethattheJMIcriterion(YangandMoody,1999;Meyeretal.,2008)providesthebesttradeoffintermsofaccuracy,stability,andexibilitywithsmalldatasamples.Keywords:featureselection,mutualinformation,conditionallikelihood1.IntroductionHighdimensionaldatasetsareasignicantchallengeforMachineLearning.Someofthemostpracticallyrelevantandhigh-impactapplications,suchasgeneexpressiondata,mayeasilyhavemorethan10,000features.Manyofthesefeaturesmaybecompletelyirrelevanttothetaskathand,orredundantinthecontextofothers.Learninginthissituationraisesimportantissues,forexample,over-ttingtoirrelevantaspectsofthedata,andthecomputationalburdenofprocessingmanysimilarfeaturesthatprovideredundantinformation.Itisthereforeanimportantresearchdirectiontoautomaticallyidentifymeaningfulsmallersubsetsofthesevariables,thatis,featureselectionFeatureselectiontechniquescanbebroadlygroupedintoapproachesthatareclassier-dependent(`wrapper'and`embedded'methods),andclassier-independent(`lter'methods).Wrappermeth-c\r2012GavinBrown,AdamPocock,Ming-jieZhaoandMikelLuj´an. BROWN,POCOCK,ZHAOANDLUJ´ANodssearchthespaceoffeaturesubsets,usingthetraining/validationaccuracyofaparticularclassi-erasthemeasureofutilityforacandidatesubset.Thismaydeliversignicantadvantagesingen-eralisation,thoughhasthedisadvantageofaconsiderablecomputationalexpense,andmayproducesubsetsthatareoverlyspecictotheclassierused.Asaresult,anychangeinthelearningmodelislikelytorenderthefeaturesetsuboptimal.Embeddedmethods(Guyonetal.,2006,Chapter3)ex-ploitthestructureofspecicclassesoflearningmodelstoguidethefeatureselectionprocess.Whilethedeningcomponentofawrappermethodissimplythesearchprocedure,thedeningcompo-nentofanembeddedmethodisacriterionderivedthroughfundamentalknowledgeofaspecicclassoffunctions.AnexampleisthemethodintroducedbyWestonetal.(2001),selectingfeaturestominimizeageneralisationboundthatholdsforSupportVectorMachines.Thesemethodsarelesscomputationallyexpensive,andlesspronetooverttingthanwrappers,butstillusequitestrictmodelstructureassumptions.Incontrast,ltermethods(Duch,2006)separatetheclassicationandfeatureselectioncomponents,anddeneaheuristicscoringcriteriontoactasaproxymeasureoftheclassicationaccuracy.Filtersevaluatestatisticsofthedataindependentlyofanyparticularclassier,therebyextractingfeaturesthataregeneric,havingincorporatedfewassumptions.Eachofthesethreeapproacheshasitsadvantagesanddisadvantages,theprimarydistinguish-ingfactorsbeingspeedofcomputation,andthechanceofovertting.Ingeneral,intermsofspeed,ltersarefasterthanembeddedmethodswhichareinturnfasterthanwrappers.Intermsofovert-ting,wrappershavehigherlearningcapacitysoaremorelikelytoovertthanembeddedmethods,whichinturnaremorelikelytoovertthanltermethods.Allofthisofcoursechangeswithex-tremesofdata/featureavailabilityforexample,embeddedmethodswilllikelyoutperformltermethodsingeneralisationerrorasthenumberofdatapointsincreases,andwrappersbecomemorecomputationallyunfeasibleasthenumberoffeaturesincreases.Aprimaryadvantageofltersisthattheyarerelativelycheapintermsofcomputationalexpense,andaregenerallymoreamenabletoatheoreticalanalysisoftheirdesign.Suchtheoreticalanalysisisthefocusofthisarticle.Thedeningcomponentofaltermethodistherelevanceindex(alsoknownasaselec-tion/scoringcriterion),quantifyingthe`utility'ofincludingaparticularfeatureintheset.Nu-meroushand-designedheuristicshavebeensuggested(Duch,2006),allattemptingtomaximisefeature`relevancy'andminimise`redundancy'.However,fewofthesearemotivatedfromasolidtheoreticalfoundation.ItispreferabletostartfromamoreprincipledperspectivethedesiredapproachisoutlinedeloquentlybyGuyon:Itisimportanttostartwithacleanmathematicalstatementoftheproblemaddressed[...]Itshouldbemadeclearhowoptimallythechosenapproachaddressestheproblemstated.Finally,theeventualapproximationsmadebythealgorithmtosolvetheoptimi-sationproblemstatedshouldbeexplained.Aninterestingtopicofresearchwouldbeto`retrot'successfulheuristicalgorithmsinatheoreticalframework.(Guyonetal.,2006,pg.21)Inthisworkweadoptthisapproachinsteadoftryingtodenefeaturerelevanceindices,wederivethemstartingfromaclearlyspeciedobjectivefunction.Theobjectivewechooseisawellacceptedstatisticalprinciple,theconditionallikelihoodoftheclasslabelsgiventhefeatures.Asaresultweareabletoprovidedeeperinsightintothefeatureselectionproblem,andachievepreciselythegoalabove,toretrotnumeroushand-designedheuristicsintoatheoreticalframework.28 FEATURESELECTIONVIACONDITIONALLIKELIHOOD2.BackgroundInthissectionwegiveabriefintroductiontoinformationtheoreticconcepts,followedbyasummaryofhowtheyhavebeenusedtotacklethefeatureselectionproblem.2.1EntropyandMutualInformationThefundamentalunitofinformationistheentropyofarandomvariable,discussedinseveralstan-dardtexts,mostprominently(CoverandThomas,1991).Theentropy,denotedH(X),quantiestheuncertaintypresentinthedistributionofX.Itisdenedas,H(X)= åx2Xp(x)logp(x);wherethelowercasexdenotesapossiblevaluethatthevariableXcanadoptfromthealphabet.Tocompute1this,weneedanestimateofthedistributionp(X).WhenXisdiscretethiscanbeestimatedbyfrequencycountsfromdata,thatisp(x)=#x N,thefractionofobservationstakingonvaluexfromthetotalN.WeprovidemorediscussiononthisissueinSection3.3.Ifthedistributionishighlybiasedtowardoneparticulareventx2,thatis,littleuncertaintyovertheoutcome,thentheentropyislow.Ifalleventsareequallylikely,thatis,maximumuncertaintyovertheoutcome,thenH(X)ismaximal.2Followingthestandardrulesofprobabilitytheory,entropycanbeconditionedonotherevents.TheconditionalentropyofXgivenYisdenoted,H(XjY)= åy2Yp(y)åx2Xp(xjy)logp(xjy):ThiscanbethoughtofastheamountofuncertaintyremaininginXafterwelearntheoutcomeofYWecannowdenetheMutualInformation(Shannon,1948)betweenXandY,thatis,theamountofinformationsharedbyXandY,asfollows:I(XY)=H(X) H(XjY)=åx2Xåy2Yp(xy)logp(xy) p(x)p(y):ThisisthedifferenceoftwoentropiestheuncertaintybeforeYisknown,H(X),andtheuncer-taintyafterYisknown,H(XjY).ThiscanalsobeinterpretedastheamountofuncertaintyinXwhichisremovedbyknowingY,thusfollowingtheintuitivemeaningofmutualinformationastheamountofinformationthatonevariableprovidesaboutanother.ItshouldbenotedthattheMutualInformationissymmetric,thatis,I(XY)=I(YX),andiszeroifandonlyifthevariablesaresta-tisticallyindependent,thatisp(xy)=p(x)p(y).TherelationbetweenthesequantitiescanbeseeninFigure1.TheMutualInformationcanalsobeconditionedtheconditionalinformationis,I(XYjZ)=H(XjZ) H(XjYZ)=åz2Zp(z)åx2Xåy2Yp(xyjz)logp(xyjz) p(xjz)p(yjz): 1.Thebaseofthelogarithmisarbitrary,butdecidesthe`units'oftheentropy.Whenusingbase2,theunitsare`bits',whenusingbasee,theunitsare`nats.'2.Ingeneral,0H(X)log(jXj).29 BROWN,POCOCK,ZHAOANDLUJ´AN Figure1:Illustrationofvariousinformationtheoreticquantities.ThiscanbethoughtofastheinformationstillsharedbetweenXandYafterthevalueofathirdvariable,Z,isrevealed.Theconditionalmutualinformationwillemergeasaparticularlyimportantpropertyinunderstandingtheresultsofthiswork.Thissectionhasbrieycoveredtheprinciplesofinformationtheory;inthefollowingsectionwediscussmotivationsforusingittosolvethefeatureselectionproblem.2.2FilterCriteriaBasedonMutualInformationFiltermethodsaredenedbyacriterionJ,alsoreferredtoasa`relevanceindex'or`scoring'criterion(Duch,2006),whichisintendedtomeasurehowpotentiallyusefulafeatureorfeaturesubsetmaybewhenusedinaclassier.AnintuitiveJwouldbesomemeasureofcorrelationbetweenthefeatureandtheclasslabeltheintuitionbeingthatastrongercorrelationbetweentheseshouldimplyagreaterpredictiveabilitywhenusingthefeature.ForaclasslabelY,themutualinformationscoreforafeatureXkisJmim(Xk)=I(XkY):(1)Thisheuristic,whichconsidersascoreforeachfeatureindependentlyofothers,hasbeenusedmanytimesintheliterature,forexample,Lewis(1992).Werefertothisfeaturescoringcriterionas`MIM',standingforMutualInformationMaximisation.TousethismeasurewesimplyrankthefeaturesinorderoftheirMIMscore,andselectthetopKfeatures,whereKisdecidedbysomepredenedneedforacertainnumberoffeaturesorsomeotherstoppingcriterion(Duch,2006).AcommonlycitedjusticationforthismeasureisthatthemutualinformationcanbeusedtowritebothanupperandlowerboundontheBayeserrorrate(Fano,1961;HellmanandRaviv,1970).Animportantlimitationisthatthisassumesthateachfeatureisindependentofallotherfeaturesandeffectivelyranksthefeaturesindescendingorderoftheirindividualmutualinformationcontent.However,wherefeaturesmaybeinterdependent,thisisknowntobesuboptimal.Ingeneral,itiswidelyacceptedthatausefulandparsimonioussetoffeaturesshouldnotonlybeindividuallyrelevant,butalsoshouldnotberedundantwithrespecttoeachotherfeaturesshouldnotbehighlycorrelated.Thereaderiswarnedthatwhilethisstatementseemsappealinglyintuitive,itisnotstrictlycorrect,aswillbeexpandeduponinlatersections.Inspiteofthis,severalcriteriahave30 FEATURESELECTIONVIACONDITIONALLIKELIHOODbeenproposedthatattempttopursuethis`relevancy-redundancy'goal.Forexample,Battiti(1994)presentstheMutualInformationFeatureSelection(MIFS)criterion:Jmifs(Xk)=I(XkY) båXj2SI(XkXj);whereSisthesetofcurrentlyselectedfeatures.ThisincludestheI(XkY)termtoensurefeaturerelevance,butintroducesapenaltytoenforcelowcorrelationswithfeaturesalreadyselectedinS.Notethatthisassumesweareselectingfeaturessequentially,iterativelyconstructingournalfeaturesubset.Forasurveyofothersearchmethodsthansimplesequentialselection,thereaderisreferredtoDuch(2006);howeveritshouldbenotedthatalltheoreticalresultspresentedinthispaperwillbegenerallyapplicabletoanysearchprocedure,andbasedsolelyonpropertiesofthecriteriathemselves.ThebintheMIFScriterionisacongurableparameter,whichmustbesetexperimentally.Usingb=0wouldbeequivalenttoJmim(Xk),selectingfeaturesindependently,whilealargervaluewillplacemoreemphasisonreducinginter-featuredependencies.Inexperi-ments,Battitifoundthatb=1isoftenoptimal,thoughwithnostrongtheorytoexplainwhy.TheMIFScriterionfocusesonreducingredundancy;analternativeapproachwasproposedbyYangandMoody(1999),andalsolaterbyMeyeretal.(2008)usingtheJointMutualInformation(JMI),tofocusonincreasingcomplementaryinformationbetweenfeatures.TheJMIscoreforfeatureXkisJjmi(Xk)=åXj2SI(XkXjY):ThisistheinformationbetweenthetargetsandajointrandomvariableXkXj,denedbypair-ingthecandidateXkwitheachfeaturepreviouslyselected.Theideaisifthecandidatefeatureis`complementary'withexistingfeatures,weshouldincludeit.TheMIFSandJMIschemesweretherstofmanycriteriathatattemptedtomanagetherelevance-redundancytradeoffwithvariousheuristicterms,howeveritiscleartheyhaveverydif-ferentmotivations.Thecriteriaidentiedintheliterature1992-2011arelistedinTable1.Thepracticeinthisresearchproblemhasbeentohand-designcriteria,piecingcriteriatogetherasajig-sawofinformationtheoretictermstheoverallaimtomanagetherelevance-redundancytrade-off,witheachnewcriterionmotivatedfromadifferentdirection.Severalquestionsarisehere:Whichcriterionshouldwebelieve?Whatdotheyassumeaboutthedata?Arethereotherusefulcriteria,asyetundiscovered?Inthefollowingsectionweofferanovelperspectiveonthisproblem.3.ANovelApproachInthefollowingsectionsweformulatethefeatureselectiontaskasaconditionallikelihoodproblem.Wewilldemonstratethatpreciselinkscanbedrawnbetweenthewell-acceptedstatisticalframeworkoflikelihoodfunctions,andthecurrentfeatureselectionheuristicsofmutualinformationcriteria.3.1AConditionalLikelihoodProblemWeassumeanunderlyingi.i.d.processpX!Y,fromwhichwehaveasampleofNobservations.Eachobservationisapair(x;y),consistingofad-dimensionalfeaturevectorx=[x1;:::;xd]T,andatargetclassy,drawnfromtheunderlyingrandomvariablesX=fX1;:::;XdgandY.Furthermore,weassumethatp(yjx)isdenedbyasubsetofthedfeaturesinx,whiletheremainingfeaturesare31 BROWN,POCOCK,ZHAOANDLUJ´ANCriterion Fullname Authors MIM MutualInformationMaximisation Lewis(1992)MIFS MutualInformationFeatureSelection Battiti(1994)KS Koller-Sahamimetric KollerandSahami(1996)JMI JointMutualInformation YangandMoody(1999)MIFS-U MIFS-`Uniform' KwakandChoi(2002)IF InformativeFragments Vidal-NaquetandUllman(2003)FCBF FastCorrelationBasedFilter YuandLiu(2004)AMIFS AdaptiveMIFS TesmerandEstevez(2004)CMIM ConditionalMutualInfoMaximisation Fleuret(2004)MRMR Max-RelevanceMin-Redundancy Pengetal.(2005)ICAP InteractionCapping Jakulin(2005)CIFE ConditionalInfomaxFeatureExtraction LinandTang(2006)DISR DoubleInputSymmetricalRelevance MeyerandBontempi(2006)MINRED MinimumRedundancy Duch(2006)IGFS InteractionGainFeatureSelection ElAkadietal.(2008)SOA SecondOrderApproximation GuoandNixon(2009)CMIFS ConditionalMIFS Chengetal.(2011)Table1:Variousinformation-basedcriteriafromtheliterature.Sections3and4willshowhowthesecanallbeinterpretedinasingletheoreticalframework.irrelevant.Ourmodelingtaskisthereforetwo-fold:rstlytoidentifythefeaturesthatplayafunc-tionalrole,andsecondlytousethesefeaturestoperformpredictions.Inthisworkweconcentrateontherststage,thatofselectingtherelevantfeatures.Weadoptad-dimensionalbinaryvectorq:a1indicatingthefeatureisselected,a0indicatingitisdiscarded.Notationxqindicatesthevectorofselectedfeatures,thatis,thefullvectorxprojectedontothedimensionsspeciedbyq.Notationxeqisthecomplement,thatis,theunselectedfeatures.Thefullfeaturevectorcanthereforebeexpressedasx=fxq;xeqg.Asmentioned,weassumetheprocesspisdenedbyasubsetofthefeatures,soforsomeunknownoptimalvectorq,wehavethatp(yjx)=p(yjxq).Weapproximatepusingahypotheticalpredictivemodelq,withtwolayersofparameters:qrepresentingwhichfeaturesareselected,andtrepresentingparametersusedtopredicty.Ourproblemstatementistoidentifytheminimalsubsetoffeaturessuchthatwemaximizetheconditionallikelihoodofthetraininglabels,withrespecttotheseparameters.Fori.i.d.dataD=f(xi;yi)=1::Ngtheconditionallikelihoodofthelabelsgivenparametersfq;tgisL(q;tjD)=NÕi=1q(yijxiq;t):The(scaled)conditionallog-likelihoodis`=1 NNåi=1logq(yijxiq;t):(2)Thisistheerrorfunctionwewishtooptimizewithrespecttotheparametersft;qg;thescalingtermhasnoeffectontheoptima,butsimpliesexpositionlater.Usingconditionallikelihoodhas32 FEATURESELECTIONVIACONDITIONALLIKELIHOODbecomepopularinso-calleddiscriminativemodellingapplications,whereweareinterestedonlyintheclassicationperformance;forexampleGrossmanandDomingos(2004)usedittolearnBayesianNetworkclassiers.WewillexpanduponthislinktodiscriminativemodelsinSection9.3.MaximisingconditionallikelihoodcorrespondstominimisingKL-divergencebetweenthetrueandpredictedclassposteriorprobabilitiesforclassication,weoftenonlyrequirethecorrectclass,andnotpreciseestimatesoftheposteriors,henceEquation(2)isaproxylowerboundforclassicationaccuracy.Wenowintroducethequantityp(yjxq):thisisthetruedistributionoftheclasslabelsgiventheselectedfeaturesxq.Itisimportanttonotethedistinctionfromp(yjx),thetruedistributiongivenallfeatures.Multiplyinganddividingqbyp(yjxq),wecanre-writetheaboveas,`=1 NNåi=1logq(yijxiq;t) p(yijxiq)+1 NNåi=1logp(yijxiq):(3)Thesecondtermin(3)canbesimilarlyexpanded,introducingtheprobabilityp(yjx)`=1 NNåi=1logq(yijxiq;t) p(yijxiq)+1 NNåi=1logp(yijxiq) p(yijxi)+1 NNåi=1logp(yijxi):Thesearenitesampleapproximations,drawingdatapointsi.i.d.withrespecttothedistributionp(xy).WeuseExyfgtodenotestatisticalexpectation,andforconveniencewenegatetheaboveturningourmaximisationproblemintoaminimisation.Thisgivesus, `Exynlogp(yjxq) q(yjxq;t)o+Exynlogp(yjx) p(yjxq)o Exynlogp(yjx)o:(4)Thesethreetermshaveinterestingpropertieswhichtogetherdenethefeatureselectionprob-lem.ItisparticularlyinterestingtonotethatthesecondtermispreciselythatintroducedbyKollerandSahami(1996)intheirdenitionsofoptimalfeatureselection.Intheirwork,thetermwasadoptedad-hocasasensibleobjectivetofollowherewehaveshownittobeadirectandnat-uralconsequenceofadoptingtheconditionallikelihoodasanobjectivefunction.Rememberingx=fxq;xeqg,thissecondtermcanbedeveloped:DKS=Exynlogp(yjx) p(yjxq)o=åxyp(xy)logp(yjxqxeq) p(yjxq)=åxyp(xy)logp(yjxqxeq) p(yjxq)p(xeqjxq) p(xeqjxq)=åxyp(xy)logp(xeqyjxq) p(xeqjxq)p(yjxq)=I(XeqYjXq):(5)Thisistheconditionalmutualinformationbetweentheclasslabelandtheremainingfeatures,giventheselectedfeatures.Wecannotealsothatthethirdtermin(4)isanotherinformationtheoretic33 BROWN,POCOCK,ZHAOANDLUJ´ANquantity,theconditionalentropyH(YjX).Insummary,weseethatourobjectivefunctioncanbedecomposedintothreedistinctterms,eachwithitsowninterpretation:N!¥ `=Exynlogp(yjxq) q(yjxq;t)o+I(XeqYjXq)+H(YjX):(6)Thersttermisalikelihoodratiobetweenthetrueandthepredictedclassdistributionsgiventheselectedfeatures,averagedovertheinputspace.Thesizeofthistermwilldependonhowwellthemodelqcanapproximatep,giventhesuppliedfeatures.3Whenqtakesonthetruevalueq(orconsistsofasupersetofq)thisbecomesaKL-divergencepjjq.ThesecondtermisI(XeqYjXq)theconditionalmutualinformationbetweentheclasslabelandtheunselectedfeatures,giventheselectedfeatures.Thesizeofthistermdependssolelyonthechoiceoffeatures,andwilldecreaseastheselectedfeaturesetXqexplainsmoreaboutY,untileventuallybecomingzerowhentheremainingfeaturesXeqcontainnoadditionalinformationaboutYinthecontextofXq.Itcanbenotedthatduetothechainrule,wehaveI(XY)=I(XqY)+I(XeqYjXq);henceminimizingI(XeqYjXq)isequivalenttomaximisingI(XqY).ThenaltermisH(YjX),theconditionalentropyofthelabelsgivenallfeatures.Thistermquantiestheuncertaintystillremain-inginthelabelevenwhenweknowallpossiblefeatures;itisanirreducibleconstant,independentofallparameters,andinfactformsaboundontheBayeserror(Fano,1961).Thesethreetermsmakeexplicittheeffectofthefeatureselectionparametersq,separatingthemfromtheeffectoftheparameterstinthemodelthatusesthosefeatures.Ifwesomehowhadtheoptimalfeaturesubsetq,whichperfectlycapturedtheunderlyingprocessp,thenI(XeqYjXq)wouldbezero.Theremaining(reducible)erroristhendowntotheKLdivergencepjjq,expressinghowwellthepredictivemodelqcanmakeuseoftheprovidedfeatures.Ofcourse,differentmodelsqwillhavedifferentpredictiveability:agoodfeaturesubsetwillnotnecessarilybeputtogooduseifthemodelistoosimpletoexpresstheunderlyingfunction.ThisperspectivewasalsoconsideredbyTsamardinosandAliferis(2003),andearlierbyKohaviandJohn(1997)theaboveresultsplacetheseinthecontextofapreciseobjectivefunction,theconditionallikelihood.Fortheremainderofthepaperwewillusethesameassumptionasthatmadeimplicitlybyalllterselectionmethods.Forcompleteness,herewemaketheassumptionexplicit:Denition1:FilterassumptionGivenanobjectivefunctionforaclassier,wecanaddresstheproblemsofoptimizingthefeaturesetandoptimizingtheclassierintwostages:rstpickinggoodfeatures,thenbuildingtheclassiertousethem.Thisimpliesthatthesecondtermin(6)canbeoptimizedindependentlyoftherst.Inthissectionwehaveformulatedthefeatureselectiontaskasaconditionallikelihoodproblem.Inthefollowing,weconsiderhowthisproblemstatementrelatestotheexistingliterature,anddiscusshowtosolveitinpractice:includinghowtooptimizethefeatureselectionparameters,andtheestimationofthenecessarydistributions. 3.Infact,ifqisaconsistentestimator,thistermwillapproachzerowithlargeN.34 FEATURESELECTIONVIACONDITIONALLIKELIHOOD3.2OptimizingtheFeatureSelectionParametersUnderthelterassumptioninDenition1,Equation(6)demonstratesthattheoptimaofthecondi-tionallikelihoodcoincidewiththatoftheconditionalmutualinformation:argmaxqL(qjD)=argminqI(XeqYjXq):(7)Theremayofcoursebemultipleglobaloptima,inadditiontothetrivialminimumofselectingallfeatures.Withthisinmind,wecanintroduceaminimalityconstraintonthesizeofthefeatureset,anddeneourproblem:q=argminq0fjq0jq0=argminqI(XqYjXq)g:(8)ThisisthesmallestfeaturesetXq,suchthatthemutualinformationI(XeqYjXq)isminimal,andthustheconditionallikelihoodismaximal.Itshouldberememberedthatthelikelihoodisonlyourproxyforclassicationerror,andtheminimalfeaturesetintermsofclassicationcouldbesmallerthanthatwhichoptimiseslikelihood.Inthefollowingparagraphs,weconsiderhowthisproblemisimplicitlytackledbymethodsalreadyintheliterature.Acommonheuristicapproachisasequentialsearchconsideringfeaturesone-by-oneforad-dition/removal;thisisusedforexampleinMarkovBlanketlearningalgorithmssuchasIAMB(Tsamardinosetal.,2003).WewillnowdemonstratethatthissequentialsearchheuristicisinfactequivalenttoagreedyiterativeoptimisationofEquation(8).Tounderstandthiswemusttime-indexthefeaturesets.NotationXqt=Xeqtindicatestheselectedandunselectedfeaturesetsattimestepwithaslightabuseofnotationtreatingtheseinterchangeablyassetsandrandomvariables.Denition2:ForwardSelectionStepwithMutualInformationTheforwardselectionstepaddsthefeaturewiththemaximummutualinformationinthecontextofthecurrentlyselectedsetXqt.Theoperationsperformedare:Xk=argmaxXk2XetI(XkYjXqt);Xqt+1 Xqt[Xk;Xeqt+1 XeqtnXk:Asubtle(butimportant)implementationpointforthisselectionheuristicisthatitshouldnotaddanotherfeatureif8Xk;I(XkYjXq)=0.Thisensureswewillnotunnecessarilyincreasethesizeofthefeatureset.Theorem3Theforwardselectionmutualinformationheuristicaddsthefeaturethatgeneratesthelargestpossibleincreaseintheconditionallikelihoodagreedyiterativemaximisation.ProofWiththedenitionsaboveandthechainruleofmutualinformation,wehavethat:I(Xeqt+1YjXqt+1)=I(XeqtYjXqt) I(XkYjXqt):ThefeatureXkthatmaximisesI(XkYjXqt)isthesamethatminimizesI(Xeqt+1YjXqt+1);thereforetheforwardstepisagreedyminimizationofourobjectiveI(XeqYjXq),andthereforemaximisestheconditionallikelihood. 35 BROWN,POCOCK,ZHAOANDLUJ´ANDenition4:BackwardEliminationStepwithMutualInformationInabackwardstep,afeatureisremovedtheutilityofafeatureXkisconsideredasitsmutualinformationwiththetarget,conditionedonallotherelementsoftheselectedsetwithoutXk.Theoperationsperformedare:Xk=argminXk2XtI(XkYjfXqtnXkg):Xqt+1 XqtnXkXeqt+1 Xeqt[XkTheorem5Thebackwardeliminationmutualinformationheuristicremovesthefeaturethatcausestheminimumpossibledecreaseintheconditionallikelihood.ProofWiththesedenitionsandthechainruleofmutualinformation,wehavethat:I(Xeqt+1YjXqt+1)=I(XeqtYjXqt)+I(XkYjXqt+1):ThefeatureXkthatminimizesI(XkYjXqt+1)isthatwhichkeepsI(Xeqt+1YjXqt+1)ascloseaspossi-bletoI(XeqtYjXqt);thereforethebackwardeliminationstepremovesafeaturewhileattemptingtomaintainthelikelihoodascloseaspossibletoitscurrentvalue. Tostrictlyachieveouroptimizationgoal,abackwardstepshouldonlyremoveafeatureifI(XkYjfXqtnXkg)=0.Inpractice,workingwithrealdata,therewilllikelybeestimationerrors(seethefollowingsection)andthusveryrarelythestrictzerowillbeobserved.ThisbringsustoaninterestingcorollaryregardingIAMB(TsamardinosandAliferis,2003).Corollary6SincetheIAMBalgorithmusespreciselytheseforward/backwardselectionheuristics,itisagreedyiterativemaximisationoftheconditionallikelihood.InIAMB,abackwardeliminationstepisonlyacceptedifI(XkYjfXqtnXkg)0,andotherwisetheprocedureterminates.InTsamardinosandAliferis(2003)itisshownthatIAMBreturnstheMarkovBlanketofanytargetnodeinaBayesiannetwork,andthatthissetcoincideswiththestronglyrelevantfeaturesinthedenitionsfromKohaviandJohn(1997).ThepreciselinkstothisliteratureareexploredfurtherinSection7.TheIAMBfamilyofalgorithmsadoptacommonassumption,thatthedataisfaithfultosomeunknownBayesianNetwork.Inthecaseswherethisassumptionholds,theprocedurewasproventoidentifytheuniqueMarkovBlanket.SinceIAMBusespreciselytheforward/backwardstepswehavederived,wecanconcludethattheMarkovBlanketcoincideswiththe(unique)maxi-mumoftheconditionallikelihoodfunction.AmorerecentvariationoftheIAMBalgorithm,calledMMMB(Min-MaxMarkovBlanket)usesaseriesofoptimisationstomitigatetherequirementofexponentialamountsofdatatoestimatetherelevantstatisticalquantities.Theseoptimisationsdonotchangetheunderlyingbehaviourofthealgorithm,asitstillmaximisestheconditionallikelihoodfortheselectedfeatureset,howevertheydoslightlyobscurethestronglinktoourframework.36 FEATURESELECTIONVIACONDITIONALLIKELIHOOD3.3EstimationoftheMutualInformationTermsInconsideringtheforward/backwardheuristics,wemusttakeaccountofthefactthatwedonothaveperfectknowledgeofthemutualinformation.Thisisbecausewehaveimplicitlyassumedwehaveaccesstothetruedistributionsp(xy);p(yjxq),etc.Inpracticewehavetoestimatethesefromdata.Theproblemcalculatingmutualinformationreducestothatofentropyestimation,andisfundamentalinstatistics(Paninski,2003).Themutualinformationisdenedastheexpectedlogarithmofaratio:I(XY)=Exynlogp(xy) p(x)p(y)o:Wecanestimatethis,sincetheStrongLawofLargeNumbersassuresusthatthesampleestimateusingpconvergesalmostsurelytotheexpectedvalueforadatasetofNi.i.d.observations(xi;yi)I(XY)I(XY)=1 NNåi=1logp(xiyi) p(xi)p(yi):Inordertocalculatethisweneedtheestimateddistributionsp(xy),p(x),andp(y).Thecomputationofentropiesforcontinuousorordinaldataishighlynon-trivial,andrequiresanassumedmodeloftheunderlyingdistributionstosimplifyexperimentsthroughoutthisarticle,weusediscretedata,andestimatedistributionswithhistogramestimatorsusingxed-widthbins.Theprobabilityofanyparticulareventp(X=x)isestimatedbymaximumlikelihood,thefrequencyofoccurrenceoftheeventX=xdividedbythetotalnumberofevents(i.e.,datapoints).Formoreinformationonalternativeentropyestimationprocedures,wereferthereadertoPaninski(2003).AtthispointwemustnotethattheapproximationaboveholdsonlyifNislargerelativetothedimensionofthedistributionsoverxandy.Forexampleifx;yarebinary,N100shouldbemorethansufcienttogetreliableestimates;howeverifx;yaremultinomial,thiswilllikelybeinsufcient.Inthecontextofthesequentialselectionheuristicswehavediscussed,weareapproximatingI(XkYjXq)as,I(XkYjXq)I(XkYjXq)=1 NNåi=1logp(xikyijxiq) p(xikjxiq)p(yijxiq):(9)AsthedimensionofthevariableXqgrows(i.e.,asweaddmorefeatures)thenthenecessaryprobabilitydistributionsbecomemorehighdimensional,andhenceourestimateofthemutualinformationbecomeslessreliable.Thisinturncausesincreasinglypoorjudgementsforthein-clusion/exclusionoffeatures.Forpreciselythisreason,theresearchcommunityhavedevelopedvariouslow-dimensionalapproximationsto(9).Inthefollowingsections,wewillinvestigatetheimplicitstatisticalassumptionsandempiricaleffectsoftheseapproximations.Intheremainderofthispaper,weuseI(XY)todenotetheidealcaseofbeingabletocomputethemutualinformation,thoughinpracticeonrealdataweusethenitesampleestimateI(XY)3.4SummaryInthesesectionswehaveineffectreverse-engineeredamutualinformation-basedselectionscheme,startingfromaclearlydenedconditionallikelihoodproblem,anddiscussedestimationofthevar-iousquantitiesinvolved.Inthefollowingsectionswewillshowthatwecanretrotnumerousexistingrelevancy-redundancyheuristicsfromthefeatureselectionliteratureintothisprobabilisticframework.37 BROWN,POCOCK,ZHAOANDLUJ´AN4.RetrottingSuccessfulHeuristicsIntheprevioussection,startingfromaclearlydenedconditionallikelihoodproblem,wederivedagreedyoptimizationprocesswhichassessesfeaturesbasedonasimplescoringcriterionontheutilityofincludingafeatureXk2Xeq.ThescoreforafeatureXkis,Jcmi(Xk)=I(XkYjS);(10)wherecmistandsforconditionalmutualinformation,andfornotationalbrevitywenowuseS=Xqforthecurrentlyselectedset.Animportantquestionis,howdoes(10)relatetoexistingheuristicsintheliterature,suchasMIFS?WewillseethatMIFS,andcertainothercriteria,canbephrasedcleanlyaslinearcombinationsofShannonentropyterms,whilesomearenon-linearcombinations,involvingmaxorminoperations.4.1CriteriaasLinearCombinationsofShannonInformationTermsRepeatingtheMIFScriterionforclarity,Jmifs(Xk)=I(XkY) båXj2SI(XkXj):(11)Wecanseethatwerstneedtorearrange(10)intotheformofasimplerelevancytermbetweenXkandY,plussomeadditionalterms,beforewecancompareittoMIFS.UsingtheidentityI(ABjC) I(AB)=I(ACjB) I(AC),wecanre-express(10)as,Jcmi(Xk)=I(XkYjS)=I(XkY) I(XkS)+I(XkSjY):(12)Itisinterestingtoseetermsinthisexpressioncorrespondingtotheconceptsof`relevancy'and`redundancy',thatis,I(XkY)andI(XkS).ThescorewillbeincreasediftherelevancyofXkislargeandtheredundancywithexistingfeaturesissmall.Thisisinaccordancewithacommonviewinthefeatureselectionliterature,observingthatwewishtoavoidredundantvariables.However,wecanalsoseeanimportantadditionaltermI(XkSjY),whichisnottraditionallyaccountedforinthefeatureselectionliteraturewecallthistheconditionalredundancy.ThistermhastheoppositesigntotheredundancyI(XkS),henceJcmiwillbeincreasedwhenthisislarge,thatis,astrongclass-conditionaldependenceofXkwiththeexistingsetS.Thus,wecometotheimportantconclusionthattheinclusionofcorrelatedfeaturescanbeuseful,providedthecorrelationwithinclassesisstrongerthantheoverallcorrelation.WenotethatthisisasimilarobservationtothatofGuyonetal.(2006),thatcorrelationdoesnotimplyredundancyEquation(12)effectivelyembodiesthisstatementininformationtheoreticterms.Thesumofthelasttwotermsin(12)representsthethree-wayinteractionbetweentheexistingfeaturesetS,thetargetY,andthecandidatefeatureXkbeingconsideredforinclusioninS.Tofurtherunderstandthis,wecannotethefollowingproperty:I(XkSY)=I(SY)+I(XkYjS)=I(SY)+I(XkY) I(XkS)+I(XkSjY):WeseethatifI(XkS)I(XkSjY),thenthetotalutilitywhenincludingXk,thatisI(XkSY),islessthanthesumoftheindividualrelevanciesI(SY)+I(XkY).ThiscanbeinterpretedasXkhavingunnecessaryduplicatedinformation.Intheoppositecase,whenI(XkS)I(XkSjY),thenXkand38 FEATURESELECTIONVIACONDITIONALLIKELIHOODScombinewellandprovidemoreinformationtogetherthanbythesumoftheirparts,I(SY),andI(XkY)Theimportantpointtotakeawayfromthisexpressionisthatthetermsareinatrade-offwedonotrequireafeaturewithlowredundancyforitsownsake,butinsteadrequireafeaturethatbesttradesoffthethreetermssoastomaximisethescoreoverall.Muchlikethebias-variancedilemma,attemptingtodecreaseonetermislikelytoincreaseanother.Therelationof(10)and(11)canbeseenwithassumptionsontheunderlyingdistributionp(xy)Writingthelattertwotermsof(12)asentropies:Jcmi(Xk)=I(XkY) H(S)+H(SjXk)+H(SjY) H(SjXkY):(13)Todevelopthisfurther,werequireanassumption.Assumption1ForallunselectedfeaturesXk2Xeq,assumethefollowing,p(xqjxk)=Õj2Sp(xjjxk)p(xqjxky)=Õj2Sp(xjjxky):ThisstatesthattheselectedfeaturesXqareindependentandclass-conditionallyindependentgiventheunselectedfeatureXkunderconsideration.Usingthis,Equation(13)becomes,J0cmi(Xk)=I(XkY) H(S)+åj2SH(XjjXk)+H(SjY) åj2SH(XjjXkY):wheretheprimeonJindicateswearemakingassumptionsonthedistribution.Now,ifweintroduceåj2SH(Xj) åj2SH(Xj),andåj2SH(XjjY) åj2SH(XjjY),werecovermutualinformationterms,betweenthecandidatefeatureandeachmemberofthesetS,plussomeadditionalterms,J0cmi(Xk)=I(XkY) åj2SI(XjXk)+åj2SH(Xj) H(S)+åj2SI(XjXkjY) åj2SH(XjjY)+H(SjY):(14)Severalofthetermsin(14)areconstantwithrespecttoXkassuch,removingthemwillhavenoeffectonthechoiceoffeature.Removingtheseterms,wehaveanequivalentcriterion,J0cmi(Xk)=I(XkY) åj2SI(XjXk)+åj2SI(XjXkjY):(15)39 BROWN,POCOCK,ZHAOANDLUJ´ANThishasinfactalreadyappearedintheliteratureasaltercriterion,originallyproposedbyLinandTang(2006),asConditionalInfomaxFeatureExtraction(CIFE),thoughithasbeenrepeatedlyrediscoveredbyotherauthors(ElAkadietal.,2008;GuoandNixon,2009).Itisparticularlyinterestingasitrepresentsasortof`root'criterion,fromwhichseveralotherscanbederived.Forexample,thelinktoMIFScanbeseenwithonefurtherassumption,thatthefeaturesarepairwiseclass-conditionallyindependent.Assumption2Forallfeaturesi;j,assumep(xixjjy)=p(xijy)p(xjjy).Thisstatesthatthefeaturesarepairwiseclass-conditionallyindependent.Withthisassumption,thetermåI(XjXkjY)willbezero,and(15)becomes(11),theMIFScriterion,withb=1.ThebparameterinMIFScanbeinterpretedasencodingastrengthofbeliefinanotherassumption,thatofunconditionalindependence.Assumption3Forallfeaturesi;j,assumep(xixj)=p(xi)p(xj).Thisstatesthatthefeaturesarepairwiseindependent.Abclosetozeroimpliesverystrongbeliefintheindependencestatement,indicatingthatanymeasuredassociationI(XjXk)isinfactspurious,possiblyduetonoiseinthedata.Abvaluecloserto1impliesalesserbelief,thatanymeasureddependencyI(XjXk)shouldbeincorporatedintothefeaturescoreexactlyasobserved.SinceMIMisproducedbysettingb=0,wecanseethatMIMalsoadoptsAssumption3.ThesamelineofreasoningcanbeappliedtoaverysimilarcriterionproposedbyPengetal.(2005),theMinimum-RedundancyMaximum-Relevancecriterion,Jmrmr(Xk)=I(XkY) 1 jSjåj2SI(XkXj):SincemRMRomitstheconditionalredundancytermentirely,itisimplicitlyusingAssumption2.Thebcoefcienthasbeensetinverselyproportionaltothesizeofthecurrentfeatureset.IfwehavealargesetS,thenbwillbeextremelysmall.TheinterpretationisthenthatasthesetSgrows,mRMRadoptsastrongerbeliefinAssumption3.Intheoriginalpaper,(Pengetal.,2005,Section2.3)itwasclaimedthatmRMRisequivalentto(10).Inthissection,throughmakingexplicittheintrinsicassumptionsofthecriterion,wehaveclearlyillustratedthatthisclaimisincorrect.BalaganiandPhoha(2010)presentananalysisofthethreecriteriamRMR,MIFSandCIFE,arrivingatsimilarresultstoourown:thatthesecriteriamakehighlyrestrictiveassumptionsontheunderlyingdatadistributions.Thoughtheconclusionsaresimilar,ourapproachincludestheirresultsasaspecialcase,andmakesexplicitthelinktoalikelihoodfunction.TherelationoftheMIFS/mRMRtoEquation(15)isrelativelystraightforward.Itismorechal-lengingtoconsiderhowcloselyothercriteriamightbere-expressedinthisform.YangandMoody(1999)proposeusingJointMutualInformation(JMI),Jjmi(Xk)=åj2SI(XkXjY):(16)Usingsomerelativelysimplemanipulations(seeappendix)thiscanbere-writtenas,Jjmi(Xk)=I(XkY) 1 jSjåj2ShI(XkXj) I(XkXjjY)i:(17)40 FEATURESELECTIONVIACONDITIONALLIKELIHOODThiscriterion(17)returnsexactlythesamesetoffeaturesastheJMIcriterion(16);howeverinthisform,wecanseetherelationtoourproposedframework.TheJMIcriterion,likemRMR,hasastrongerbeliefinthepairwiseindependenceassumptionsasthefeatureseSgrows.SimilaritiescanofcoursebeobservedbetweenJMI,MIFSandmRMRthedifferencesbeingthescalingfactorandtheconditionaltermandtheirsubsequentrelationtoEquation(15).Itisinfactpossibletoidentifynumerouscriteriafromtheliteraturethatcanallbere-writtenintoacommonform,correspondingtovariationsupon(15).Aspaceofpotentialcriteriacanbeimagined,whereweparameterizecriterion(15)asso:J0cmi=I(XkY) båj2SI(XjXk)+gåj2SI(XjXkjY):(18)Figure2showshowthecriteriawehavediscussedsofarcanallbettedinsidethisunitsquarecorrespondingtob=gparameters.MIFSsitsonthelefthandaxisofthesquarewithg=0andb2[0;1].TheMIMcriterion,Equation(1),whichsimplyassesseseachfeatureindividuallywithoutanyregardofothers,sitsatthebottomleft,withg=0;b=0.Thetoprightofthesquarecorrespondstog=1;b=1,whichistheCIFEcriterion(LinandTang,2006),alsosuggestedbyElAkadietal.(2008)andGuoandNixon(2009).Averysimilarcriterion,usinganassumptiontoapproximatetheterms,wasproposedbyChengetal.(2011).TheJMIandmRMRcriteriaareuniqueinthattheymovelinearlywithinthespaceasthefeaturesetSgrows.AsthesizeofthesetSincreasestheymoveclosertowardstheoriginandtheMIMcriterion.TheparticularlyinterestingpointaboutthispropertyisthattherelativemagnitudeoftherelevancytermtotheredundancytermsstaysapproximatelyconstantasSgrows,whereaswithMIFS,theredundancytermwillingeneralbejSjtimesbiggerthantherelevancyterm.Theconse-quencesofthiswillbeexploredintheexperimentalsectionofthispaper.AnycriterionexpressibleintheunitsquarehasmadeindependenceAssumption1.Inaddition,anycriteriathatsitatpointsotherthanb=1;g=1haveadoptedvaryingdegreesofbeliefinAssumptions2and3.Afurtherinterestingpointaboutthissquareissimplythatitissparselypopulated.Anobviousunexploredregionisthebottomright,thecornercorrespondingtob=0;g=1;thoughthereisnoclearintuitivejusticationforthispoint,forcompletenessintheexperimentalsectionwewillevaluateit,astheconditionalredundancyor`condred'criterion.Inpreviouswork(Brown,2009)weexploredthisunitsquare,thoughderivedfromanexpansionofthemutualinformationfunctionratherthandirectlyfromtheconditionallikelihood.Whilethisresultedinanidenticalexpressionto(18),theprobabilisticframeworkwepresenthereisfarmoreexpressive,allowingexactspeci-cationoftheunderlyingassumptions.TheunitsquareofFigure2describeslinearcriteria,namedassosincetheyarelinearcombi-nationsoftherelevance/redundancyterms.Thereexistothercriteriathatfollowasimilarform,butinvolvingotheroperations,makingthemnon-linear4.2CriteriaasNon-LinearCombinationsofShannonInformationTermsFleuret(2004)proposedtheConditionalMutualInformationMaximizationcriterion,Jcmim(Xk)=minXj2ShI(XkYjXj)i:Thiscanbere-written,Jcmim(Xk)=I(XkY) maxXj2ShI(XkXj) I(XkXjjY)i:(19)41 BROWN,POCOCK,ZHAOANDLUJ´AN Figure2:Thefullspaceoflinearltercriteria,describingseveralexamplesfromTable1.NotethatallcriteriainthisspaceadoptAssumption1.Additionally,thegandbaxesrepresentthecriteriabeliefinAssumptions2and3,respectively.ThelefthandaxisiswherethemRMRandMIFSalgorithmssit.Thebottomleftcorner,MIM,istheassumptionofcompletelyindependentfeatures,usingjustmarginalmutualinformation.NotethatsomecriteriaareequivalentatparticularsizesofthecurrentfeaturesetjSjTheproofisagainavailableintheappendix.Duetothemaxoperator,theprobabilisticinterpretationisalittlelessstraightforward.ItisclearhoweverthatCMIMadoptsAssumption1,sinceitevaluatesonlypairwisefeaturestatistics.Vidal-NaquetandUllman(2003)proposeanothercriterionusedinComputerVision,whichwerefertoasInformativeFragmentsJif(Xk)=minXj2ShI(XkXjY) I(XjY)i:TheauthorsmotivatethiscriterionbynotingthatitmeasuresthegainofcombininganewfeatureXkwitheachexistingfeatureXj,oversimplyusingXjbyitself.TheXjwiththeleast`gain'frombeingpairedwithXkistakenasthescoreforXk.Interestingly,usingthechainruleI(XkXjY)=I(XjY)+I(XkYjXj),thereforeIFisequivalenttoCMIM,thatis,Jif(Xk)=Jcmim(Xk),makingthesameassumptions.Jakulin(2005)proposedthecriterion,Jicap(Xk)=I(XkY) åXj2Smaxh0;fI(XkXj) I(XkXjjY)gi:Again,thisadoptsAssumption1,usingthesameredundancyandconditionalredundancyterms,yettheexactprobabilisticinterpretationisunclear.Aninterestingclassofcriteriauseanormalisationtermonthemutualinformationtooffsettheinherentbiastowardhigharityfeatures(Duch,2006).AnexampleofthisisDoubleInput42 FEATURESELECTIONVIACONDITIONALLIKELIHOODSymmetricalRelevance(MeyerandBontempi,2006),amodicationoftheJMIcriterion:Jdisr(Xk)=åXj2SI(XkXjY) H(XkXjY):Theinclusionofthisnormalisationtermbreaksthestrongtheoreticallinktoalikelihoodfunction,butagainforcompletenesswewillincludethisinourempiricalinvestigations.Whilethecriteriaintheunitsquarecanhavetheirprobabilisticassumptionsmadeexplicit,thenonlinearityintheCMIM,ICAPandDISRcriteriamakesuchaninterpretationfarmoredifcult.4.3SummaryofTheoreticalFindingsInthissectionwehaveshownthatnumerouscriteriapublishedoverthepasttwodecadesofresearchcanbe`retro-tted'intotheframeworkwehaveproposedthecriteriaareapproximationsto(10),eachmakingdifferentassumptionsontheunderlyingdistributions.Sinceintheprevioussectionwesawthatacceptingthetoprankedfeatureaccordingto(10)providesthemaximumpossibleincreaseinthelikelihood,weseenowthatthecriteriaareapproximatemaximisersofthelikelihood.Whetherornottheyindeedprovidethemaximumincreaseateachstepwilldependonhowwelltheimplicitassumptionsonthedatacanbetrusted.Also,itshouldberememberedthatevenifweused(10),itisnotguaranteedtondtheglobaloptimumofthelikelihood,since(a)itisagreedysearch,and(b)nitedatawillmeandistributionscannotbeaccuratelymodelled.Inthiscase,wehavereachedthelimitofwhatatheoreticalanalysiscantellusaboutthecriteria,andwemustclosetheremaining`gaps'inourunderstandingwithanexperimentalstudy.5.ExperimentsInthissectionweempiricallyevaluatesomeofthecriteriaintheliteratureagainstoneanother.Notethatwearenotpursuinganexhaustiveanalysis,attemptingtoidentifythe`winning'criterionthatprovidesbestperformanceoverall4rather,weprimarilyobservehowthetheoreticalpropertiesofcriteriarelatetothesimilarityofthereturnedfeaturesets.Whilethesepropertiesareinterest-ing,weofcoursemustacknowledgethatclassicationperformanceistheultimateevaluationofacriterionhencewealsoincludehereclassicationresultsonUCIdatasetsandinSection6onthewell-knownbenchmarkNIPSFeatureSelectionChallenge.Inthefollowingsections,weaskthequestions:howstableisacriteriontosmallchangesinthetrainingdataset?,howsimilararethecriteriatoeachother?,howdothedifferentcriteriabehaveinlimitedandextremesmall-samplesituations?,andnally,whatistherelationbetweenstabilityandaccuracy?.Toaddressthesequestions,weusethe15datasetsdetailedinTable2.Thesearechosentohaveawidevarietyofexample-featureratios,andarangeofmulti-classproblems.Thefeatureswithineachdatasethaveavarietyofcharacteristicssomebinary/discrete,andsomecontinuous.Con-tinuousfeatureswerediscretized,usinganequal-widthstrategyinto5bins,whilefeaturesalreadywithacategoricalrangewereleftuntouched.The`ratio'statisticquotedinthenalcolumnisanindicatorofthedifcultyofthefeatureselectionforeachdataset.Thisusesthenumberofdata-points(N),themedianarityofthefeatures(m),andthenumberofclasses(c)theratioquotedin 4.Inanycase,theNoFreeLunchTheoremappliesherealso(TsamardinosandAliferis,2003).43 BROWN,POCOCK,ZHAOANDLUJ´ANthetableforeachdatasetisN mc,henceasmallervalueindicatesamorechallengingfeatureselectionproblem.Akeypointofthisworkistounderstandthestatisticalassumptionsonthedataimposedbythefeatureselectioncriteriaifourclassicationmodelweretomakeevenmoreassumptions,thisislikelytoobscuretheexperimentalobservationsrelatingperformancetotheoreticalproperties.Forthisreason,inallexperimentsweuseasimplenearestneighbourclassier(k=3),thisischosenasitmakesfew(ifany)assumptionsaboutthedata,andweavoidtheneedforparametertuning.Forthefeatureselectionsearchprocedure,theltercriteriaareappliedusingasimpleforwardselection,toselectaxednumberoffeatures,speciedineachexperiment,beforebeingusedwiththeclassier.DataFeaturesExamplesClassesRatio breast30569257congress16435272heart13270234ionosphere34351235krvskp3631962799landsat3664356214lungcancer563234parkinsons22195220semeion25615931080sonar60208221soybeansmall354746spect22267267splice6031753265waveform4050003333wine13178312Table2:Datasetsusedinexperiments.Thenalcolumnindicatesthedifcultyofthedatainfeatureselection,asmallervalueindicatingamorechallengingproblem.5.1HowStablearetheCriteriatoSmallChangesintheData?Thesetoffeaturesselectedbyanyprocedurewillofcoursedependonthedataprovided.Itisaplausiblecomplaintifthesetofreturnedfeaturesvarieswildlywithonlyslightvariationsinthesupplieddata.Thisisanissuereminiscentofthebias-variancedilemma,wherethesensitivityofaclassiertoitsinitialconditionscauseshighvarianceresponses.However,whilethebias-variancedecompositioniswell-denedandunderstood,thecorrespondingissueforfeatureselection,the`stability',hasonlyrecentlybeenstudied.Thestabilityofafeatureselectioncriterionrequiresameasuretoquantifythe`similarity'betweentwoselectedfeaturesets.ThiswasrstdiscussedbyKalousisetal.(2007),whoinvestigatedseveralmeasures,withthenalrecommendationbeingtheTanimotodistancebetweensets.Suchset-intersectionmeasuresseemappropriate,buthavelimitations;forexample,iftwocriteriaselectedidenticalfeaturesetsofsize10,wemightbelesssurprisedifweknewtheoverallpooloffeatureswasofsize12,thanifitwassize12,000.Toaccount44 FEATURESELECTIONVIACONDITIONALLIKELIHOODforthis,Kuncheva(2007)presentsaconsistencyindex,basedonthehypergeometricdistributionwithacorrectionforchance.Denition7TheconsistencyfortwosubsetsA;BX,suchthatjAj=jBj=k,andr=jA\Bjwhere0kjXj=n,isC(A;B)=rn k2 k(n k):Theconsistencytakesvaluesintherange[ 1;+1],withapositivevalueindicatingsimilarsets,azerovalueindicatingapurelyrandomrelation,andanegativevalueindicatingastronganti-correlationbetweenthefeaturessets.Oneproblemwiththeconsistencyindexisthatitdoesnottakefeatureredundancyintoaccount.Thatis,twoprocedurescouldselectfeatureswhichhavedifferentarrayindices,soareidentiedas`different',butinfactaresohighlycorrelatedthattheyareeffectivelyidentical.AmethodtodealwiththissituationwasproposedbyYuetal.(2008).Thismethodconstructsaweightedcompletebipartitegraph,wherethetwonodesetscorrespondtotwodifferentfeaturesets,andweightsareassignedtothearcsarethenormalizedmutualinformationbetweenthefeaturesatthenodes,alsosometimesreferredtoasthesymmetricaluncertainty.TheweightbetweennodeinsetA,andnodeinsetB,isw(A();B())=I(XA(i)XB(j)) H(XA(i))+H(XB(j)):TheHungarianalgorithmisthenappliedtoidentifythemaximumweightedmatchingbetweenthetwonodesets,andtheoverallsimilaritybetweensetsAandBisthenalmatchingcost.Thisistheinformationconsistencyofthetwosets.Formoredetails,werefertoYuetal.(2008).Wenowcomparethesetwomeasuresonthecriteriafromtheprevioussections.Foreachdataset,wetakeabootstrapsampleandselectasetoffeaturesusingeachfeatureselectioncriterion.The(information)stabilityofasinglecriterionisquantiedastheaveragepairwise(information)consistencyacross50bootstrapsfromthetrainingdata.Figure3showsKuncheva'sstabilitymeasureonaverageover15datasets,selectingfeaturesetsofsize10;notethatthecriteriahavebeendisplayedorderedleft-to-rightbytheirmedianvalueofstabilityoverthe15datasets.Themarginalmutualinformation,MIM,isasexpectedthemoststable,giventhatithasthelowestdimensionaldistributiontoapproximate.ThenextmoststableisJMIwhichincludestherelevancy/redundancyterms,butaveragesoverthecurrentfeatureset;thisaveragingprocessmightthereforebeinterpretedempiricallyasaformof`smoothing',enablingthecriteriaoveralltoberesistanttopoorestimationofprobabilitydistributions.ItcanbenotedthatthefarrightofFigure3consistsoftheMIFS,ICAPandCIFEcriteria,allofwhichdonotattempttoaveragetheredundancyterms.Figure4showsthesamedatasets,butinsteadtheinformationstabilityiscomputed;asmen-tioned,thisshouldtakeintoaccountthefactthatsomefeaturesarehighlycorrelated.Interestingly,thetwobox-plotsshowbroadlysimilarresults.MIMisthemoststable,andCIFEistheleaststable,thoughhereweseethatJMI,DISR,andMRMRareactuallymorestablethanKuncheva'sstabilityindexcanreect.Aninterestinglineoffutureresearchmightbetocombinethebestofthesetwostabilitymeasuresonethatcantakeintoaccountbothfeatureredundancyandacorrectionforrandomchance.45 BROWN,POCOCK,ZHAOANDLUJ´AN Figure3:Kuncheva'sStabilityIndexacross15datasets.Theboxindicatestheupper/lowerquar-tiles,thehorizontallinewithineachshowsthemedianvalue,whilethedottedcrossbarsindicatethemaximum/minimumvalues.Forconvenienceofinterpretation,criteriaonthex-axisareorderedbytheirmedianvalue. Figure4:Yuetal'sInformationStabilityIndexacross15datasets.Forcomparison,criteriaonthex-axisareorderedidenticallytoFigure3.Thegeneralpictureemergessimilarly,thoughtheinformationstabilityindexisabletotakefeatureredundancyintoaccount,showingthatsomecriteriaareslightlymorestablethanexpected.46 FEATURESELECTIONVIACONDITIONALLIKELIHOOD (a)Kuncheva'sConsistencyIndex. (b)Yuetal'sInformationStabilityIndex.Figure5:Relationsbetweenfeaturesetsgeneratedbydifferentcriteria,onaverageover15datasets.2-Dvisualisationgeneratedbyclassicalmulti-dimensionalscaling.5.2HowSimilararetheCriteria?Twocriteriacanbedirectlycomparedwiththesamemethodology:bymeasuringtheconsistencyandinformationconsistencybetweenselectedfeaturesubsetsonacommonsetofdata.Wecalculatethemeanconsistenciesbetweentwofeaturesetsofsize10,repeatedlyselectedover50bootstrapsfromtheoriginaldata.Thisisthenarrangedinasimilaritymatrix,andweuseclassicalmulti-dimensionalscalingtovisualisethisasa2-dmap,showninFigures5aand5b.Noteagainthatwhiletheindicesmayreturndifferentabsolutevalues(oneisanormalizedmeanofahypergeometricdistributionandtheotherisapairwisesumofmutualinformationterms)theyshowverysimilarrelative`distances'betweencriteria.Bothdiagramsshowaclusterofseveralcriteria,and4clearoutliers:MIFS,CIFE,ICAPandCondRed.The5criteriaclusteringintheupperleftofthespaceappeartoreturnrelativelysimilarfeaturesets.The4outliersappeartoreturnquitesignicantlydifferentfeaturesets,bothfromtheclusteredset,andfromeachother.Acommoncharacteristicofthese4outliersisthattheydonotscaletheredundancyorconditionalredundancyinformationterms.Inthesecriteria,theupperboundontheredundancytermåj2SI(XkXj)growslinearlywiththenumberofselectedfeatures,whilsttheupperboundontherelevancytermI(XkY)remainsconstant.Whenthishappenstherelevancytermisoverwhelmedbytheredundancytermandthusthecriterionselectsfeatureswithminimalredundancy,ratherthantradingoffbetweenthetwoterms.Thisleadstostronglydivergentfeaturesetsbeingselected,whichisreectedinthestabilityofthecriteria.Eachoftheoutliersaredifferentfromeachotherastheyhavedifferentcombinationsofredundancyandconditionalredundancy.Wewillseethis`balance'betweenrelevancyandredundancyemergeasacommonthemeintheexperimentsoverthenextfewsections.47 BROWN,POCOCK,ZHAOANDLUJ´AN5.3HowdoCriteriaBehaveinLimitedandExtremeSmall-sampleSituations?Toassesshowcriteriabehaveindatapoorsituations,wevarythenumberofdatapointssuppliedtoperformthefeatureselection.Theprocedurewastorandomlyselect140datapoints,thenusetheremainingdataasahold-outset.Fromthis140,thenumberprovidedtoeachcriterionwasincreasedinstepsof10,fromaminimalsetofsize20.Toallowareasonabletestingsetsize,welimitedthisassessmenttoonlydatasetswithatleast200datapointstotal;thisgivesus11datasetsfromthe15,omittinglungcancerparkinsonssoybeansmall,andwine.Foreachdatasetweselect10featuresandapplythe3-nnclassier,recordingtherank-orderofthecriteriaintermsoftheirgeneralisationerror.Thisprocesswasrepeatedandaveragedover50trials,givingtheresultsinFigure6.ToaidinterpretationwelabelMIMwithasimplepointmarker,MIFS,CIFE,CondRed,andICAPwithacircle,andtheremainingcriteria(DISR,JMI,mRMRandCMIM)withastar.Thecriterialabelledwithastarbalancetherelativemagnitudeoftherelevancyandredundancyterms,thosewithacircledonotattempttobalancethem,andMIMcontainsnoredundancyterm.Thereisaclearseparationbetweenthosecriteriawithastaroutperformingthosewithacircle,andMIMvaryinginperformancebetweenthetwogroupsasweallowmoretrainingdatapoints.NoticethatthehighestrankedcriteriacoincidewiththoseintheclusteratthetopleftofFigures5aand5b.WesuggestthattherelativedifferenceinperformanceisduetothesamereasonnotedinSection5.2,thattheredundancytermgrowswiththesizeoftheselectedfeatureset.Inthiscase,theredundancytermeventuallygrowstooutweightherelevancybyalargedegree,andthenewfeaturesareselectedsolelyonthebasisofredundancy,ignoringtherelevance,thusleadingtopoorclassicationperformance. 20 40 60 80 100 120 140 1 2 3 4 5 6 7 8 9 Training pointsRank mim mifs condred cife icap mrmr jmi disr cmim Figure6:Averageranksofcriteriaintermsoftesterror,selecting10features,across11datasets.Notethecleardominanceofcriteriawhichdonotallowtheredundancytermtoover-whelmtherelevancyterm(unlledmarkers)overthosethatallowredundancytogrowwiththesizeofthefeatureset(lledmarkers).48 FEATURESELECTIONVIACONDITIONALLIKELIHOODDataFeaturesExamplesClasses Colon2000622Leukemia7070722Lung325737Lymph4026969NCI99712609Table3:DatasetsfromPengetal.(2005),usedinexperiments.5.4ExtremeSmall-SampleExperimentsIntheprevioussectionswediscussedtwotheoreticalpropertiesofinformation-basedfeaturese-lectioncriteria:whetheritbalancestherelativemagnitudeofrelevancyagainstredundancy,andwhetheritincludesaclass-conditionalredundancyterm.EmpiricallyontheUCIdatasets,weseethatthebalancingisfarmoreimportantthantheinclusionoftheconditionalredundancytermforexample,MRMRsucceedsinmanycases,whileMIFSperformspoorly.Now,weconsiderwhethersamepropertymayholdinextremesmall-samplesituations,whenthenumberofexamplesissolowthatreliableestimationofdistributionsbecomesextremelydifcult.WeusedatasourcedfromPengetal.(2005),detailedinTable3.ResultsareshowninFigure7,selecting50featuresfromeachdatasetandplottingleave-one-outclassicationerror.Itshouldofcourseberememberedthatonsuchsmalldatasets,makingjustoneadditionaldatapointerrorcanresultinseeminglylargechangesinaccuracy.Forexample,thedifferencebetweenthebestandworstcriteriaonLeukemiawasjust3datapoints.IncontrasttotheUCIresults,thepictureislessclear.OnColon,thecriteriaallperformsimilarly;thisistheleastcomplexofallthedatasets,havingthesmallestnumberofclasseswitha(relatively)smallnumberoffeatures.Aswemovethroughthedatasetswithin-creasingnumbersoffeatures/classes,weseethatMIFS,CONDRED,CIFEandICAPstarttobreakaway,performingpoorlycomparedtotheothers.Again,wenotethatthesedonotattempttobal-ancerelevancy/redundancy.ThisdifferenceisclearestontheNCI9data,themostcomplexwith9classesand9712features.However,aswemayexpectwithsuchhighdimensionalandchallengingproblems,therearesomeexceptionstheColondataasmentioned,andalsotheLungdatawhereICAP/MIFSperformwell.5.5WhatistheRelationBetweenStabilityandAccuracy?Animportantquestioniswhetherwecanndagoodbalancebetweenthestabilityofacriterionandtheclassicationaccuracy.ThiswasconsideredbyGulgezenetal.(2009),whostudiedthesta-bility/accuracytrade-offfortheMRMRcriterion.Inthefollowing,weconsiderthistrade-offinthecontextofPareto-optimality,acrossthe9criteria,andthe15datasetsfromTable2.Experimentalprotocolwastotake50bootstrapsfromthedataset,eachtimecalculatingtheout-of-bagerrorusingthe3-nn.ThestabilitymeasurewasKuncheva'sstabilityindexcalculatedfromthe50featuresets,andtheaccuracywasthemeanout-of-bagaccuracyacrossthe50bootstraps.TheexperimentswerealsorepeatedusingtheInformationStabilitymeasure,revealingalmostidenticalresults.ResultsusingKuncheva'sstabilityindexareshowninFigure8.ThePareto-optimalsetisdenedasthesetofcriteriaforwhichnoothercriterionhasbothahigheraccuracyandahigherstability,hencethemembersofthePareto-optimalsetaresaidtobenon-dominated(FonsecaandFleming,1996).Thus,eachofthesubguresofFigure8,criteria49 BROWN,POCOCK,ZHAOANDLUJ´AN 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 ColonNumber of features selectedLOO number of mistakes mim mifs condred cife icap mrmr jmi disr cmim 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 LeukemiaNumber of features selectedLOO number of mistakes 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 LungNumber of features selectedLOO number of mistakes 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 LymphomaNumber of features selectedLOO number of mistakes 0 5 10 15 20 25 30 35 40 45 50 20 25 30 35 40 45 50 55 NCI9Number of features selectedLOO number of mistakes Figure7:LOOresultsonPeng'sdatasets:Colon,Lymphoma,Leukemia,Lung,NCI9.50 FEATURESELECTIONVIACONDITIONALLIKELIHOOD Figure8:Stabilty(y-axes)versusAccuracy(x-axes)over50bootstrapsforeachoftheUCIdatasets.Thepareto-optimalrankingsaresummarisedinTable4.51 BROWN,POCOCK,ZHAOANDLUJ´AN Accuracy=Stability(Yu) Accuracy=Stability(Kuncheva) Accuracy JMI(1:6) JMI(1:5) JMI(2:6) DISR(2:3) DISR(2:2) MRMR(3:6) MIM(2:4) MIM(2:3) DISR(3:7) MRMR(2:5) MRMR(2:5) CMIM(4:5) CMIM(3:3) CONDRED(3:2) ICAP(5:3) ICAP(3:6) CMIM(3:4) MIM(5:4) CONDRED(3:7) ICAP(4:3) CIFE(5:9) CIFE(4:3) CIFE(4:8) MIFS(6:5) MIFS(4:5) MIFS(4:9) CONDRED(7:4) Table4:Column1:Non-dominatedRankofdifferentcriteriaforthetrade-offofaccuracy/stability.Criteriawithahigherrank(closerto1:0)provideabettertradeoffthanthosewithalowerrank.Column2:Ascolumn1butusingKuncheva'sStabilityIndex.Column3:Averageranksforaccuracyalone.thatappearfurthertothetop-rightofthespacedominatethosetowardthebottomleftinsuchasituationthereisnoreasontochoosethoseatthebottomleft,sincetheyaredominatedonbothobjectivesbyothercriteria.Asummary(forbothstabilityandinformationstability)isprovidedinthersttwocolumnsofTable4,showingthenon-dominatedrankofthedifferentcriteria.Thisiscomputedperdatasetasthenumberofothercriteriawhichdominateagivencriterion,inthePareto-optimalsense,thenaveragedoverthe15datasets.Wecanseethattheserankingsaresimilartotheresultsearlier,withMIFS,ICAP,CIFEandCondRedperformingpoorly.WenotethatJMI,(whichbothbalancestherelevancyandredundancytermsandincludestheconditionalredundancy)outperformsallothercriteria.Wepresenttheaverageaccuracyranksacrossthe50bootstrapsincolumn3.ThesearesimilartotheresultsfromFigure6butuseabootstrapofthefulldataset,ratherthanasmallsamplefromit.FollowingDemsar(2006)weanalysedtheseranksusingaFriedmantesttodeterminewhichcriteriaarestatisticallysignicantlydifferentfromeachother.WethenusedaNemenyipost-hoctesttodeterminewhichcriteriadiffered,withstatisticalsignicancesat90%,95%,and99%condences.ThesegiveapartialorderingforthecriteriawhichwepresentinFigure9,showingaSignicantDominancePartialOrderdiagram.NotethatthisstyleofdiagramencapsulatesthesameinformationasaCriticalDifferencediagram(Demsar,2006),butallowsustodisplaymultiplelevelsofstatisticalsignicance.Aboldlineconnectingtwocriteriasigniesadifferenceatthe99%condencelevel,adashedlineatthe95%level,andadottedlineatthe90%level.Absenceofalinksigniesthatwedonothavethestatisticalpowertodeterminethedifferenceonewayoranother.ReadingFigure9,weseethatwith99%condenceJMIissignicantlysuperiortoCondRed,andMIFS,butnotstatisticallysignicantlydifferentfromtheothercriteria.Aswelowerourcondencelevel,moredifferencesappear,forexampleMRMRandMIFSareonlysignicantlydifferentatthe90%condencelevel.52 FEATURESELECTIONVIACONDITIONALLIKELIHOOD \n \n \r Figure9:Signicantdominancepartial-orderdiagram.Criteriaareplacedtoptobottominthedi-agrambytheirranktakenfromcolumn3ofTable4.AlinkjoiningtwocriteriameansastatisticallysignicantdifferenceisobservedwithaNemenyipost-hoctestatthespec-iedcondencelevel.ForexampleJMIissignicantlysuperiortoMIFS(b=1)atthe99%condencelevel.Notethattheabsenceofalinkdoesnotsignifythelackofastatis-ticallysignicantdifference,butthattheNemenyitestdoesnothavesufcientpower(intermsofnumberofdatasets)todeterminetheoutcome(Demsar,2006).ItisinterestingtonotethatthefourbottomrankedcriteriacorrespondtothecornersoftheunitsquareinFigure2;whilethetopthree(JMI/MRMR/DISR)areallverysimilar,scalingthere-dundancytermsbythesizeofthefeatureset.ThemiddleranksbelongtoCMIM/ICAP,whicharesimilarinthattheyusethemin/maxstrategyinsteadofalinearcombinationofterms.53 BROWN,POCOCK,ZHAOANDLUJ´AN5.6SummaryofEmpiricalFindingsFromexperimentsinthissection,weconcludethatthebalanceofrelevancy/redundancytermsisextremelyimportant,whiletheinclusionofaclassconditionaltermseemstomatterless.Wendthatsomecriteriaareinherentlymorestablethanothers,andthatthetrade-offbetweenaccuracy(usingasimplek-nnclassier)andstabilityofthefeaturesetsdiffersbetweencriteria.Thebestoveralltrade-offforaccuracy/stabilitywasfoundintheJMIandMRMRcriteria.Inthefollowingsectionwere-assessthesendings,inthecontextoftwoproblemsposedfortheNIPSFeatureSelectionChallenge.6.PerformanceontheNIPSFeatureSelectionChallengeInthissectionweinvestigateperformanceofthecriteriaondatasetstakenfromtheNIPSFeatureSelectionChallenge(Guyon,2003).6.1ExperimentalProtocolsWepresentresultsusingGISETTE(ahandwritingrecognitiontask),andMADELON(anarticiallygenerateddataset).DataFeaturesExamples(Tr/Val)Classes GISETTE50006000/10002MADELON5002000/6002Table5:DatasetsfromtheNIPSchallenge,usedinexperiments.Toapplythemutualinformationcriteria,weestimatethenecessarydistributionsusinghis-togramestimators:featureswerediscretizedindependentlyinto10equalwidthbins,withbinboundariesdeterminedfromtrainingdata.Afterthefeatureselectionprocesstheoriginal(undis-cretised)datasetswereusedtoclassifythevalidationdata.Eachcriterionwasusedtogeneratearankingforthetop200featuresineachdataset.Weshowresultsusingthefulltop200forGISETTE,butonlythetop20forMADELONasafterthispointallcriteriademonstratedsevereovertting.WeusetheBalancedErrorRate,forfaircomparisonwithpreviouslypublishedworkontheNIPSdatasets.Weacceptthatthisdoesnotnecessarilysharethesameoptimaastheclassi-cationerror(towhichtheconditionallikelihoodrelates),andleaveinvestigationsofthistofuturework.ValidationdataresultsarepresentedinFigure10(GISETTE)andFigure11(MADELON).Theminimumofthevalidationerrorwasusedtoselectthebestperformingfeaturesetsize,thetrainingdataaloneusedtoclassifythetestingdata,andnallytestlabelsweresubmittedtothechallengewebsite.TestresultsareprovidedinTable6forGISETTE,andTable7forMADELON.5UnlikeinSection5,thedatasetswehaveusedfromtheNIPSFeatureSelectionChallengehaveagreaternumberofdatapoints(GISETTEhas6000trainingexamples,MADELONhas2000)andthuswecanpresentresultsusingadirectimplementationofEquation(10)asacriterion.WerefertothiscriterionasCMI,asitisusingtheconditionalmutualinformationtoscorefeatures.Unfortunatelytherearestillestimationerrorsinthiscalculationwhenselectingalargenumberof 5.WedonotprovideclassicationcondencesasweusedanearestneighbourclassierandthustheAUCisequalto1 BER.54 FEATURESELECTIONVIACONDITIONALLIKELIHOOD 0 20 40 60 80 100 120 140 160 180 200 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Number of FeaturesValidation Error cmi mim mifs cife icap mrmr jmi disr cmim Figure10:ValidationErrorcurveusingGISETTE. 0 2 4 6 8 10 12 14 16 18 20 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Number of FeaturesValidation Error cmi mim mifs cife icap mrmr jmi disr cmim Figure11:ValidationErrorcurveusingMADELON.55 BROWN,POCOCK,ZHAOANDLUJ´ANfeatures,evengiventhelargenumberofdatapointsandsothecriterionfailstoselectfeaturesafteracertainpoint,aseachfeatureappearsequallyirrelevant.InGISETTE,CMIselected13features,andsothetop10featureswereusedandthusoneresultisshown.InMADELON,CMIselected7featuresandso7resultsareshown.6.2ResultsonTestDataInTable6thereareseveraldistinctionsbetweenthecriteria,themoststrikingofwhichisthefailureofMIFStoselectaninformativefeatureset.Theimportanceofbalancingthemagnitudeoftherelevancyandtheredundancycanbeseenwhilstlookingattheothercriteriainthistest.Thosecriteriawhichbalancethemagnitudes,(CMIM,JMI,&mRMR)performbetterthanthosewhichdonot(ICAP,CIFE).TheDISRcriterionformsanoutlierhereasitperformspoorlywhencomparedtoJMI.TheonlydifferencebetweenthesetwocriteriaisthenormalizationinDISRassuch,thisisthelikelycauseoftheobservedpoorperformance,theintroductionofmorevariancebyestimatingthenormalizationH(XkXjY)Wecanalsoseehowimportantthelowdimensionalapproximationis,asevenwith6000trainingexamplesCMIcannotestimatetherequiredjointdistributiontoavoidselectingprobes,despitebeingadirectiterativemaximisationoftheconditionallikelihoodinthelimitofdatapoints. Criterion BER AUC Features(%) Probes(%) MIM 4.18 95.82 4.00 0.00 MIFS 42.00 58.00 4.00 58.50 CIFE 6.85 93.15 2.00 0.00 ICAP 4.17 95.83 1.60 0.00 CMIM 2.86 97.14 2.80 0.00 CMI 8.06 91.94 0.20 20.00 mRMR 2.94 97.06 3.20 0.00 JMI 3.51 96.49 4.00 0.00 DISR 8.03 91.97 4.00 0.00 WinningChallengeEntry 1.35 98.71 18.3 0.0 Table6:NIPSFSChallengeResults:GISETTE.TheMADELONresults(Table7)showaparticularlyinterestingpointthetopperformers(intermsofBER)areJMIandCIFE.Boththesecriteriaincludetheclass-conditionalredundancyterm,butCIFEdoesnotbalancetheinuenceofrelevancyagainstredundancy.Inthiscase,itappearsthe`balancing'issue,soimportantinourpreviousexperimentsseemstohavelittleimportanceinstead,thepresenceoftheconditionalredundancytermisthedifferentiatingfactorbetweencriteria(notethepoorperformanceofMIFS/MRMR).ThisisperhapsnotsurprisinggiventhenatureoftheMADELONdata,constructedpreciselytorequirefeaturestobeevaluatedjointly.Itisinterestingtonotethatthechallengeorganisersbenchmarkeda3-NNusingtheoptimalfeatureset,achievinga10%testerror(Guyon,2003).Manyofthecriteriamanagedtoselectfeaturesetswhichachievedasimilarerrorrateusinga3-NN,anditislikelythatamoresophisticatedclassierisrequiredtofurtherimproveperformance.Thisconcludesourexperimentalstudyinthefollowing,wemakefurtherlinkstotheliteratureforthetheoreticalframework,anddiscussimplicationsforfuturework.56 FEATURESELECTIONVIACONDITIONALLIKELIHOOD Criterion BER AUC Features(%) Probes(%) MIM 10.78 89.22 2.20 0.00 MIFS 46.06 53.94 2.60 92.31 CIFE 9.50 90.50 3.80 0.00 ICAP 11.11 88.89 1.60 0.00 CMIM 11.83 88.17 2.20 0.00 CMI 21.39 78.61 0.80 0.00 mRMR 35.83 64.17 3.40 82.35 JMI 9.50 90.50 3.20 0.00 DISR 9.56 90.44 3.40 0.00 WinningChallengeEntry 7.11 96.95 1.6 0.0 Table7:NIPSFSChallengeResults:MADELON.7.RelatedWork:StrongandWeakRelevanceKohaviandJohn(1997)proposeddenitionsofstrongandweakfeaturerelevance.Thedenitionsareformedfromstatementsabouttheconditionalprobabilitydistributionsofthevariablesinvolved.Wecanre-statethedenitionsofKohaviandJohn(hereafterKJ)intermsofmutualinformation,andseehowtheycantintoourconditionallikelihoodmaximisationframework.Inthenotationbelow,notationXiindicatesthethfeatureintheoverallsetX,andnotationXniindicatesthesetfXnXig,allfeaturesexcepttheth.Denition8:StronglyRelevantFeature(KohaviandJohn,1997)FeatureXiisstronglyrelevanttoYiffthereexistsanassignmentofvaluesxi,y,xniforwhichp(Xi=xi;Xni=xni)0andp(Y=yjXi=xi;Xni=xni)=p(Y=yjXni=xni):Corollary9AfeatureXiisstronglyrelevantiffI(XiYjXni)0ProofTheKLdivergenceDKL(p(yjxz)jjp(yjz))0iffp(yjxz)=p(yjz)forsomeassignmentofvaluesx;y;z.Asimplere-applicationofthemanipulationsleadingtoEquation(5)demonstratesthattheexpectedKL-divergenceExzfp(yjxz)jjp(yjz)gisequaltothemutualinformationI(XYjZ).Inthedenitionofstrongrelevance,ifthereexistsasingleassignmentofvaluesxiyxnithatsatisestheinequality,thenExfp(yjxixni)jjp(yjxni)g0andthereforeI(XiYjXni)0. Giventheframeworkwehavepresented,wecannotethatthisstrongrelevancecomesfromacom-binationofthreetermsI(XiYjXni)=I(XiY) I(XiXni)+I(XiXnijY):Thisviewofstrongrelevancedemonstratesexplicitlythatafeaturemaybeindividuallyirrelevant(i.e.,p(yjxi)=p(y)andthusI(XiY)=0),butstillstronglyrelevantifI(XiXnijY) I(XiXni)0.Denition10:WeaklyRelevantFeature(KohaviandJohn,1997)FeatureXiisweaklyrelevanttoYiffitisnotstronglyrelevantandthereexistsasubsetZXni,andanassignmentofvaluesxi,y,zforwhichp(Xi=xi;Z=z)0suchthatp(Y=yjXi=xi;Z=z)=p(Y=yjZ=z)57 BROWN,POCOCK,ZHAOANDLUJ´ANCorollary11AfeatureXiisweaklyrelevanttoYiffitisnotstronglyrelevantandI(XiYjZ)0forsomeZXniProofThisfollowsimmediatelyfromtheproofforthestrongrelevanceabove. Itisinteresting,andsomewhatnon-intuitive,thattherecanbecaseswheretherearenostronglyrelevantfeatures,butallareweaklyrelevant.Thiswilloccurforexampleinadatasetwhereallfeatureshaveexactduplicates:wehave2Mfeaturesand8;XM+i=Xi.Inthiscase,foranyXk(suchthatkM)wewillhaveI(XkYjXni)=0sinceitsduplicatefeatureXM+kwillcarrythesameinformation.Inthiscase,foranyfeatureXk(suchthatkM)thatisstronglyrelevantinthedatasetfX1;:::;XMg,itisweaklyrelevantinthedatasetfX1;:::;X2MgThisissuecanbedealtwithbyreningourdenitionofrelevancewithrespecttoasubsetofthefullfeaturespace.AparticularsubsetaboutwhichwehavesomeinformationisthecurrentlyselectedsetS.WecanrelateourframeworktoKJ'sdenitionsinthiscontext.FollowingKJ'sformulations,Denition12:RelevancewithrespecttothecurrentsetSFeatureXiisrelevanttoYwithrespecttoSiffthereexistsanassignmentofvaluesxi,y,sforwhichp(Xi=xi;S=s)-0.9;çµ0andp(Y=yjXi=xi;S=s)=p(Y=yjS=s):Corollary13FeatureXiisrelevanttoYwithrespecttoS,iffI(XiYjS)-0.9;çµ0AfeaturethatisrelevantwithrespecttoSiseitherstronglyorweaklyrelevant(intheKJsense)butitisnotpossibletodetermineinwhichclassitlies,aswehavenotconditionedonXni.Noticethatthedenitioncoincidesexactlywiththeforwardselectionheuristic(Denition2),whichwehaveshownisahill-climberontheconditionallikelihood.Asaresult,weseethathill-climbingontheconditionallikelihoodcorrespondstoaddingthemostrelevantfeaturewithrespecttothecurrentsetS.Againwere-emphasize,thattheresultantgaininthelikelihoodcomesfromacombinationofthreesourcesI(XiYjS)=I(XiY) I(XiS)+I(XiSjY):ItcouldeasilybethecasethatI(XiY)=0,thatisafeatureisentirelyirrelevantwhenconsideredonitsownbutthesumofthetworedundancytermsresultsinapositivevalueforI(XiYjS).Weseethatifacriteriondoesnotattempttomodelbothoftheredundancyterms,evenifonlyusinglowdimensionalapproximations,itrunstheriskofevaluatingtherelevanceofXiincorrectly.Denition14:IrrelevancewithrespecttothecurrentsetSFeatureXiisirrelevanttoYwithrespecttoSiff8xi,y,sforwhichp(Xi=xi;S=s)-0.9;çµ0andp(Y=yjXi=xi;S=s)=p(Y=yjS=s):Corollary15FeatureXiisirrelevanttoYwithrespecttoS,iffI(XiYjS)=0Inaforwardstep,ifafeatureXiisirrelevantwithrespecttoS,addingitalonetoSwillnotincreasetheconditionallikelihood.However,theremaybefurtheradditionstoSinthefuture,givingusaselectedsetS0;wemaythenndthatXiisthenrelevantwithrespecttoS0.InabackwardstepwecheckwhetherafeatureisirrelevantwithrespecttofSnXig,usingthetestI(XiYjfSnXig)=0.Inthiscase,removingthisfeaturewillnotdecreasetheconditionallikelihood.58 FEATURESELECTIONVIACONDITIONALLIKELIHOOD8.RelatedWork:StructureLearninginBayesianNetworksTheframeworkwehavedescribedalsoservestohighlightanumberofimportantlinkstotheliter-atureonstructurelearningofdirectedacyclicgraphical(DAG)models(Korb,2011).TheproblemofDAGlearningfromobserveddataisknowntobeNP-hard(Chickeringetal.,2004),andassuchthereexisttwomainfamiliesofapproximatealgorithms.MetricorScore-and-Searchlearn-ersconstructagraphbysearchingthespaceofDAGsdirectly,assigningascoretoeachbasedonpropertiesofthegraphinrelationtotheobserveddata;probablythemostwell-knownscoreistheBICmeasure(Korb,2011).However,thespaceofDAGsissuperexponentialinthenumberofvari-ables,andhenceanexhaustivesearchrapidlybecomescomputationallyinfeasible.GrossmanandDomingos(2004)proposedagreedyhill-climbingsearchoverstructures,usingconditionallikeli-hoodasascoringcriterion.Theirworkfoundsignicantadvantagefromusingthis`discriminative'learningobjective,asopposedtothetraditional`generative'jointlikelihood.ThepotentialofthisdiscriminativemodelperspectivewillbeexpandeduponinSection9.3.Constraintlearnersapproachtheproblemfromaconstructivistpointofview,addingandremov-ingarcsfromasingleDAGaccordingtoconditionalindependencetestsgiventhedata.WhenthecandidateDAGpassesallconditionalindependencestatementsobservedinthedata,itisconsideredtobeagoodmodel.Inthecurrentpaper,forafeaturetobeeligibleforinclusion,werequiredthatI(XkYjS)0.ThisisequivalenttoaconditionalindependencetestXk??=YjS.Onewell-knownproblemwithconstraintlearnersisthatifatestgivesanincorrectresult,theerrorcan`cascade',causingthealgorithmtodrawfurtherincorrectconclusionsonthenetworkstructure.Thisproblemisalsotrueofthepopulargreedy-searchheuristicsthatwehavedescribedinthiswork.InSection3.2,weshowedthatMarkovBlanketalgorithms(Tsamardinosetal.,2003)areanexampleoftheframeworkwepropose.Specically,thesolutiontoEquation(7)isa(possiblynon-unique)MarkovBlanket,andthesolutiontoEquation(8)isexactlytheMarkovboundary,thatis,aminimal,uniqueblanket.Itisinterestingtonotethatthesealgorithms,whicharearestrictedclassofstructurelearners,assumefaithfulnessofthedatadistribution.Wecanseestraightforwardlythatallcriteriawehaveconsidered,whencombinedwithagreedyforwardselection,alsomakethisassumption.9.ConclusionThisworkhaspresentedaunifyingframeworkforinformationtheoreticfeatureselection,bringingalmosttwodecadesofresearchonheuristicscoringcriteriaunderasingletheoreticalinterpretation.Thisisachievedviaanovelinterpretationofinformationtheoreticfeatureselectionasanoptimiza-tionoftheconditionallikelihoodthisisincontrasttothecurrentviewofmutualinformation,asaheuristicmeasureoffeaturerelevancy.9.1SummaryofContributionsInSection3weshowedhowtodecomposetheconditionallikelihoodintothreeterms,eachwiththeirowninterpretationinrelationtothefeatureselectionproblem.Oneoftheseemergesasaconditionalmutualinformation.Thisobservationallowsustoanswerthefollowingquestion:Whataretheimplicitstatisticalassumptionsofmutualinformationcriteria?Theinvestigationshaverevealedthatthevariouscriteriapublishedoverthepasttwodecadesareallapproximateiter-ativemaximisersoftheconditionallikelihood.Theapproximationsareduetoimplicitassumptions59 BROWN,POCOCK,ZHAOANDLUJ´ANonthedatadistribution:somearemorerestrictivethanothers,andaredetailedinSection4.Theapproximations,whileheuristic,arenecessaryduetotheneedtoestimatehighdimensionalproba-bilitydistributions.ThepopularMarkovBlanketlearningalgorithmIAMBisincludedinthisclassofprocedures,hencecanalsobeeseenasaniterativemaximiseroftheconditionallikelihood.Themaindifferencesbetweencriteriaarewhethertheyincludeaclass-conditionalterm,andwhethertheyprovideamechanismtobalancetherelativesizeoftheredundancytermsagainsttherelevancyterm.Toascertainhowthesedifferencesimpactthecriteriainpractice,weconductedanempiricalstudyof9differentheuristicmutualinformationcriteriaacross22datasets.Weanalyzedhowthecriteriabehaveinlarge/smallsamplesituations,howthestabilityofreturnedfeaturesetsvariesbetweencriteria,andhowsimilarcriteriaareinthefeaturesetstheyreturn.Inparticular,thefollowingquestionswereinvestigated:Howdothetheoreticalpropertiestranslatetoclassieraccuracy?Summarisingtheperfor-manceofthecriteriaundertheaboveconditions,includingtheclass-conditionaltermisnotalwaysnecessary.Variouscriteria,forexampleMRMR,aresuccessfulwithoutthisterm.However,with-outthistermcriteriaareblindtocertainclassesofproblems,forexample,theMADELONdataset,andwillperformpoorlyinthesecases.BalancingtherelevancyandredundancytermsishoweverextremelyimportantcriterialikeMIFS,orCIFE,thatallowredundancytoswamprelevancy,arerankedlowestforaccuracyinalmostallexperiments.Inaddition,thisimbalancetendstocauselargeinstabilityinthereturnedfeaturesetsbeinghighlysensitivetothesupplieddata.Howstablearethecriteriatosmallchangesinthedata?Severalcriteriareturnwildlydifferentfeaturesetswithjustsmallchangesinthedata,whileothersreturnsimilarsetseachtime,henceare`stable'procedures.Themoststablewastheunivariatemutualinformation,followedcloselybyJMI(YangandMoody,1999;Meyeretal.,2008);whileamongtheleaststableareMIFS(Battiti,1994)andICAP(Jakulin,2005).Asvisualisedbymulti-dimensionalscalinginFigure5,severalcriteriaappeartoreturnquitesimilarsets,whiletherearesomeoutliers.Howdocriteriabehaveinlimitedandextremesmall-samplesituations?Inextremesmall-samplesituations,itappearstheaboverules(regardingtheconditionaltermandthebalancingofrelevancy-redundancy)canbebrokenthepoorestimationofdistributionsmeansthetheoreticalpropertiesdonottranslateimmediatelytoperformance.9.2AdviceforthePractitionerFromourinvestigationswehaveidentiedthreedesirablecharacteristicsofaninformationbasedselectioncriterion.Therstiswhetheritincludesreferencetoaconditionalredundancytermcriteriathatdonotincorporateitareeffectivelyblindtoanentireclassofproblems,thosewithstrongclass-conditionaldependencies.Thesecondiswhetheritkeepstherelativesizeoftheredundancytermfromswampingtherelevancyterm.Wendthistobeessentialwithoutthiscontrol,therelevancyofthekthfeaturecaneasilybeignoredintheselectionprocessduetothek 1redundancyterms.Thethirdissimplywhetherthecriterionisalow-dimensionalapproximation,hencemakingitusablewithsmallsamplesizes.OnGISETTEwith6000examples,wewereunabletoselectmorethan13featureswithanykindofreliability.Therefore,lowdimensionalapproximations,thefocusofthisarticle,areessential.AsummaryofthecriteriaisshowninTable8.Overallwendonly3criteriathatsatisfytheseproperties:CMIM,JMIandDISR.WerecommendtheJMIcriterion,asfromempiricalinvesti-gationsithasthebesttrade-off(inthePareto-optimalsense)ofaccuracyandstability.DISRis60 FEATURESELECTIONVIACONDITIONALLIKELIHOODanormalisedvariantofJMIinpracticewefoundlittleneedforthisnormalisationandtheextracomputationinvolved.IfhigherstabilityisrequiredtheMIMcriterion,asexpected,displayedthehigheststabilitywithrespecttovariationsinthedatathereforeinextremedata-poorsituationswewouldrecommendthisasarststep.Ifspeedisrequired,theCMIMcriterionadmitsanfastexactimplementationgivingordersofmagnitudespeed-upoverastraightforwardimplementationrefertoFleuret(2004)fordetails.Toaidreplicabilityofthiswork,implementationsofallcriteriawehavediscussedareprovidedat:http://www.cs.man.ac.uk/gbrown/fstoolbox/ MIMmRMRMIFSCMIMJMIDISRICAPCIFECMI CondRedundterm? 777444444 Balancesrel/red? 447444774 Estimable? 444444447 Table8:Summaryofcriteria.Theyhavebeenarrangedlefttorightinorderofascendingestimationdifculty.CondRedundterm:doesitincludetheconditionalredundancyterm?Balancesrel/red:doesitbalancetherelevanceandredundancyterms?Estimable:doesitusealowdimensionalapproximation,makingitusablewithsmallsamples?9.3FutureWorkWhileadviceonthesuitabilityofexistingcriteriaisofcourseuseful,perhapsamoreinterest-ingresultofthisworkistheperspectiveitbringstothefeatureselectionproblem.Wewereabletoexplicitlystateanobjectivefunction,andderiveanappropriateinformation-basedcriteriontomaximiseit.Thisbegsthequestion,whatselectioncriteriawouldresultfromdifferentobjectivefunctions?Dmochowskietal.(2010)studyaweightedconditionallikelihood,anditssuitabil-ityforcost-sensitiveproblemsitispossible(thoughoutsidethescopeofthispaper)toderiveinformation-basedcriteriainthiscontext.Thereversequestionisequallyinteresting,whatobjec-tivefunctionsareimpliedbyotherexistingcriteria,suchastheGiniIndex?TheKL-divergence(whichdenesthemutualinformation)isaspecialcaseofawiderfamilyofmeasures,basedonthe-divergencecouldweobtainsimilarefcientcriteriathatpursuethesemeasures,andwhatoverallobjectivesdotheyimply?Inthisworkweexploredcriteriathatusepairwise(i.e.,I(XkXj))approximationstothederivedobjective.Theseapproximationsarecommonlyusedastheyprovideareasonableheuristicwhilestillbeing(relatively)simpletoestimate.Therehasbeenworkwhichsuggestsrelaxingthispairwiseapproximation,andthusincreasingthenumberofterms(Brown,2009;Meyeretal.,2008),butthereislittleexplorationofhowmuchdataisrequiredtoestimatethesemultivariateinformationterms.Atheoreticalanalysisofthetradeoffbetweenestimationaccuracyandadditionalinformationprovidedbythesemorecomplextermscouldprovideinterestingdirectionsforimprovingthepoweroflterfeatureselectiontechniques.Averyinterestingdirectionconcernsthemotivationbehindtheconditionallikelihoodasanob-jective.Itcanbenotedthattheconditionallikelihood,thoughawell-acceptedobjectivefunctioninitsownright,canbederivedfromaprobabilisticdiscriminativemodel,asfollows.Weapproximatethetruedistributionpwithourmodelq,withthreedistinctparametersets:qforfeatureselection,61 BROWN,POCOCK,ZHAOANDLUJ´ANtforclassication,andlmodellingtheinputdistributionp(x).FollowingMinka(2005),intheconstructionofadiscriminativemodel,ourjointlikelihoodisL(D;q;t;l)=p(q;t)p(l)NÕi=1q(yijxi;q;t)q(xijl):Inthistypeofmodel,wewishtomaximizeLwithrespecttoq(ourfeatureselectionparameters)andt(ourmodelparameters),andarenotconcernedwiththegenerativeparametersl.ExcludingthegenerativetermsgivesL(D;q;t;l)µp(q;t)NÕi=1q(yijxi;q;t):Whenwehavenoparticularbiasorpriorknowledgeoverwhichsubsetoffeaturesorparametersaremorelikely(i.e.,aatpriorp(q;t)),thisreducestotheconditionallikelihood:L(D;q;t;l)µNÕi=1q(yijxi;q;t);whichwasexactlyourstartingpointforthecurrentpaper.Anobviousextensionhereistotakeanon-uniformprioroverfeatures.Animportantdirectionformachinelearningistoincorporatedomainknowledge.Anon-uniformpriorwouldmeaninuencingthesearchproceduretoincorpo-rateourbackgroundknowledgeofthefeatures.Thisisapplicableforexampleingeneexpressiondata,whenwemayhaveinformationaboutthemetabolicpathwaysinwhichgenesparticipate,andthereforewhichgenesarelikelytoinuencecertainbiologicalfunctions.Thisisoutsidethescopeofthispaperbutisthefocusofourcurrentresearch.AcknowledgmentsThisresearchwasconductedwithsupportfromtheUKEngineeringandPhysicalSciencesResearchCouncil,ongrantsEP/G000662/1andEP/F023855/1.MikelLuj´anissupportedbyaRoyalSocietyUniversityResearchFellowship.GavinBrownwouldliketothankJamesNeil,SohanSeth,andFabioRoliforinvaluablecommentaryondraftsofthiswork.AppendixA.Thefollowingproofsmakeuseoftheidentity,I(ABjC) I(AB)=I(ACjB) I(AC)A.1ProofofEquation(17)TheJointMutualInformationcriterion(YangandMoody,1999)canbewritten,Jjmi(Xk)=åXj2SI(XkXjY);=åXj2ShI(XjY)+I(XkYjXj)i:62 FEATURESELECTIONVIACONDITIONALLIKELIHOODThetermåXj2SI(XjY)intheaboveisconstantwithrespecttotheXkargumentthatweareinter-estedin,socanbeomitted.Thecriterionthereforereducesto(17)asfollows,Jjmi(Xk)=åXj2ShI(XkYjXj)i=åXj2ShI(XkY) I(XkXj)+I(XkXjjY)i=jSjI(XkY) åXj2ShI(XkXj) I(XkXjjY)iµI(XkY) 1 jSjåXj2ShI(XkXj) I(XkXjjY)i:A.2ProofofEquation(19)TherearrangementoftheConditionalMutualInformationcriterion(Fleuret,2004)followsaverysimilarprocedure.Theoriginal,anditsrewritingare,Jcmim(Xk)=minXj2ShI(XkYjXj)i=minXj2ShI(XkY) I(XkXj)+I(XkXjjY)i=I(XkY)+minXj2ShI(XkXjjY) I(XkXj)i=I(XkY) maxXj2ShI(XkXj) I(XkXjjY)i;whichisexactlyEquation(19).ReferencesK.S.BalaganiandV.V.Phoha.Onthefeatureselectioncriterionbasedonanapproximationofmultidimensionalmutualinformation.IEEETransactionsonPatternAnalysisandMachineIntelligence,32(7):13421343,2010.ISSN0162-8828.R.Battiti.Usingmutualinformationforselectingfeaturesinsupervisedneuralnetlearning.IEEETransactionsonNeuralNetworks,5(4):537550,1994.G.Brown.Anewperspectiveforinformationtheoreticfeatureselection.InInternationalConfer-enceonArticialIntelligenceandStatistics,volume5,pages4956,2009.H.Cheng,Z.Qin,C.Feng,Y.Wang,andF.Li.Conditionalmutualinformation-basedfeatureselectionanalyzingforsynergyandredundancy.ElectronicsandTelecommunicationsResearchInstitute(ETRI)Journal,33(2),2011.D.M.Chickering,D.Heckerman,andC.Meek.Large-samplelearningofbayesiannetworksisnp-hard.JournalofMachineLearningResearch,5:12871330,2004.T.M.CoverandJ.A.Thomas.ElementsofInformationTheory.Wiley-InterscienceNewYork,1991.63 BROWN,POCOCK,ZHAOANDLUJ´ANJ.Demsar.Statisticalcomparisonsofclassiersovermultipledatasets.JournalofMachineLearn-ingResearch,7:130,2006.J.P.Dmochowski,P.Sajda,andL.C.Parra.Maximumlikelihoodincost-sensitivelearning:modelspecication,approximations,andupperbounds.JournalofMachineLearningResearch,11:33133332,2010.W.Duch.FeatureExtraction:FoundationsandApplications,chapter3,pages89117.StudiesinFuzziness&SoftComputing.Springer,2006.ISBN3-540-35487-5.A.ElAkadi,A.ElOuardighi,andD.Aboutajdine.Apowerfulfeatureselectionapproachbasedonmutualinformation.InternationalJournalofComputerScienceandNetworkSecurity,8(4):116,2008.R.M.Fano.TransmissionofInformation:StatisticalTheoryofCommunications.NewYork:Wiley,1961.F.Fleuret.Fastbinaryfeatureselectionwithconditionalmutualinformation.JournalofMachineLearningResearch,5:15311555,2004.C.FonsecaandP.Fleming.Ontheperformanceassessmentandcomparisonofstochasticmultiob-jectiveoptimizers.ParallelProblemSolvingfromNature,pages584593,1996.D.GrossmanandP.Domingos.Learningbayesiannetworkclassiersbymaximizingconditionallikelihood.InInternationalConferenceonMachineLearning.ACM,2004.G.Gulgezen,Z.Cataltepe,andL.Yu.Stableandaccuratefeatureselection.MachineLearningandKnowledgeDiscoveryinDatabases,pages455468,2009.B.GuoandM.S.Nixon.GaitfeaturesubsetselectionbymutualinformationIEEETransSystems,ManandCybernetics,39(1):3646,January2009.I.Guyon.DesignofexperimentsfortheNIPS2003variableselectionbenchmarkhttp://www.nipsfsc.ecs.soton.ac.uk/papers/NIPS2003-Datasets.pdf,2003I.Guyon,S.Gunn,M.Nikravesh,andL.Zadeh,editors.FeatureExtraction:FoundationsandApplications.Springer,2006.ISBN3-540-35487-5.M.HellmanandJ.Raviv.Probabilityoferror,equivocation,andthechernoffbound.IEEETrans-actionsonInformationTheory,16(4):368372,1970.A.Jakulin.MachineLearningBasedonAttributeInteractions.PhDthesis,UniversityofLjubljana,Slovenia,2005.A.Kalousis,J.Prados,andM.Hilario.Stabilityoffeatureselectionalgorithms:astudyonhigh-dimensionalspaces.Knowledgeandinformationsystems,12(1):95116,2007.ISSN0219-1377.R.KohaviandG.H.John.Wrappersforfeaturesubsetselection.Articialintelligence,97(1-2):273324,1997.ISSN0004-3702.64 FEATURESELECTIONVIACONDITIONALLIKELIHOODD.KollerandM.Sahami.Towardoptimalfeatureselection.InInternationalConferenceonMa-chineLearning,1996.K.Korb.EncyclopediaofMachineLearning,chapterLearningGraphicalModels,page584.Springer,2011.L.I.Kuncheva.Astabilityindexforfeatureselection.InIASTEDInternationalMulti-Conference:ArticialIntelligenceandApplications,pages390395,2007.N.KwakandC.H.Choi.Inputfeatureselectionforclassicationproblems.IEEETransactionsonNeuralNetworks,13(1):143159,2002.D.D.Lewis.Featureselectionandfeatureextractionfortextcategorization.InProceedingsoftheworkshoponSpeechandNaturalLanguage,pages212217.AssociationforComputationalLinguisticsMorristown,NJ,USA,1992.D.LinandX.Tang.Conditionalinfomaxlearning:Anintegratedframeworkforfeatureextractionandfusion.InEuropeanConferenceonComputerVision,2006.P.MeyerandG.Bontempi.Ontheuseofvariablecomplementarityforfeatureselectionincancerclassication.InEvolutionaryComputationandMachineLearninginBioinformatics,pages91102,2006.P.E.Meyer,C.Schretter,andG.Bontempi.Information-theoreticfeatureselectioninmicroarraydatausingvariablecomplementarity.IEEEJournalofSelectedTopicsinSignalProcessing,2(3):261274,2008.T.Minka.Discriminativemodels,notdiscriminativetraining.MicrosoftResearchCambridge,Tech.Rep.TR-2005-144,2005.L.Paninski.Estimationofentropyandmutualinformation.NeuralComputation,15(6):11911253,2003.ISSN0899-7667.H.Peng,F.Long,andC.Ding.Featureselectionbasedonmutualinformation:Criteriaofmax-dependency,max-relevance,andmin-redundancy.IEEETransactionsonPatternAnalysisandMachineIntelligence,27(8):12261238,2005.C.E.Shannon.Amathematicaltheoryofcommunication.BellSystemsTechnicalJournal,27(3):379423,1948.M.TesmerandP.A.Estevez.Amifs:Adaptivefeatureselectionbyusingmutualinformation.InIEEEInternationalJointConferenceonNeuralNetworks,volume1,2004.I.TsamardinosandC.F.Aliferis.Towardsprincipledfeatureselection:Relevancy,ltersandwrappers.InProceedingsoftheNinthInternationalWorkshoponArticialIntelligenceandStatistics(AISTATS),2003.I.Tsamardinos,C.F.Aliferis,andA.Statnikov.Algorithmsforlargescalemarkovblanketdiscov-ery.In16thInternationalFLAIRSConference,volume103,2003.65 BROWN,POCOCK,ZHAOANDLUJ´ANM.Vidal-NaquetandS.Ullman.Objectrecognitionwithinformativefeaturesandlinearclassica-tion.IEEEConferenceonComputerVisionandPatternRecognition,2003.J.Weston,S.Mukherjee,O.Chapelle,M.Pontil,T.Poggio,andV.Vapnik.Featureselectionforsvms.AdvancesinNeuralInformationProcessingSystems,pages668674,2001.ISSN1049-5258.H.YangandJ.Moody.Datavisualizationandfeatureselection:Newalgorithmsfornon-gaussiandata.AdvancesinNeuralInformationProcessingSystems,12,1999.L.YuandH.Liu.Efcientfeatureselectionviaanalysisofrelevanceandredundancy.JournalofMachineLearningResearch,5:12051224,2004.L.Yu,C.Ding,andS.Loscalzo.Stablefeatureselectionviadensefeaturegroups.InProceedingofthe14thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMiningpages803811,2008.66