108K - views

Journal of Machine Learning Research Submitted Published An Introduction to Variable and Feature Selection Isabelle Guyon ISABELLE CLOPINET COM Clopinet Creston Road Berkeley CA USA Andr e Elis

These areas include text processing of internet documents gene expression arr ay analysis and combinatorial chemistry The objective of variable selection is threefold improvi ng the prediction performance of the pre dictors providing faster and more

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "Journal of Machine Learning Research ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Journal of Machine Learning Research Submitted Published An Introduction to Variable and Feature Selection Isabelle Guyon ISABELLE CLOPINET COM Clopinet Creston Road Berkeley CA USA Andr e Elis






Presentation on theme: "Journal of Machine Learning Research Submitted Published An Introduction to Variable and Feature Selection Isabelle Guyon ISABELLE CLOPINET COM Clopinet Creston Road Berkeley CA USA Andr e Elis"— Presentation transcript:

JournalofMachineLearningResearch3(2003)1157-1182Submitted11/02;Published3/03AnIntroductiontoVariableandFeatureSelectionIsabelleGuyonISABELLE@CLOPINET.COMClopinet955CrestonRoadBerkeley,CA94708-1501,USAAndr´eElisseeffANDRE@TUEBINGEN.MPG.DEEmpiricalInferenceforMachineLearningandPerceptionDepartmentMaxPlanckInstituteforBiologicalCyberneticsSpemannstrasse3872076T¨ubingen,GermanyEditor:LesliePackKaelblingAbstractVariableandfeatureselectionhavebecomethefocusofmuchresearchinareasofapplicationforwhichdatasetswithtensorhundredsofthousandsofvariablesareavailable.Theseareasincludetextprocessingofinternetdocuments,geneexpressionarrayanalysis,andcombinatorialchemistry.Theobjectiveofvariableselectionisthree-fold:improvingthepredictionperformanceofthepre-dictors,providingfasterandmorecost-effectivepredictors,andprovidingabetterunderstandingoftheunderlyingprocessthatgeneratedthedata.Thecontributionsofthisspecialissuecoverawiderangeofaspectsofsuchproblems:providingabetterdenitionoftheobjectivefunction,featureconstruction,featureranking,multivariatefeatureselection,efcientsearchmethods,andfeaturevalidityassessmentmethods.Keywords:Variableselection,featureselection,spacedimensionalityreduction,patterndiscov-ery,lters,wrappers,clustering,informationtheory,supportvectormachines,modelselection,statisticaltesting,bioinformatics,computationalbiology,geneexpression,microarray,genomics,proteomics,QSAR,textclassication,informationretrieval.1IntroductionAsof1997,whenaspecialissueonrelevanceincludingseveralpapersonvariableandfeatureselectionwaspublished(BlumandLangley,1997,KohaviandJohn,1997),fewdomainsexploredusedmorethan40features.Thesituationhaschangedconsiderablyinthepastfewyearsand,inthisspecialissue,mostpapersexploredomainswithhundredstotensofthousandsofvariablesorfeatures:1Newtechniquesareproposedtoaddressthesechallengingtasksinvolvingmanyirrelevantandredundantvariablesandoftencomparablyfewtrainingexamples.Twoexamplesaretypicalofthenewapplicationdomainsandserveusasillustrationthroughoutthisintroduction.Oneisgeneselectionfrommicroarraydataandtheotheristextcategorization.Inthegeneselectionproblem,thevariablesaregeneexpressioncoefcientscorrespondingtothe1.Wecall“variable”the“raw”inputvariablesand“features”variablesconstructedfortheinputvariables.Weusewithoutdistinctiontheterms“variable”and“feature”whenthereisnoimpactontheselectionalgorithms,e.g.,whenfeaturesresultingfromapre-processingofinputvariablesareexplicitlycomputed.Thedistinctionisnecessaryinthecaseofkernelmethodsforwhichfeaturesarenotexplicitlycomputed(seesection5.3).c\r2003IsabelleGuyonandAndr´eElisseeff. GUYONANDELISSEEFFabundanceofmRNAinasample(e.g.tissuebiopsy),foranumberofpatients.Atypicalclas-sicationtaskistoseparatehealthypatientsfromcancerpatients,basedontheirgeneexpression“prole”.Usuallyfewerthan100examples(patients)areavailablealtogetherfortrainingandtest-ing.But,thenumberofvariablesintherawdatarangesfrom6000to60,000.Someinitiallteringusuallybringsthenumberofvariablestoafewthousand.BecausetheabundanceofmRNAvariesbyseveralordersofmagnitudedependingonthegene,thevariablesareusuallystandardized.Inthetextclassicationproblem,thedocumentsarerepresentedbya“bag-of-words”,thatisavectorofdimensionthesizeofthevocabularycontainingwordfrequencycounts(propernormalizationofthevariablesalsoapply).Vocabulariesofhundredsofthousandsofwordsarecommon,butaninitialpruningofthemostandleastfrequentwordsmayreducetheeffectivenumberofwordsto15,000.Largedocumentcollectionsof5000to800,000documentsareavailableforresearch.TypicaltasksincludetheautomaticsortingofURLsintoawebdirectoryandthedetectionofunsolicitedemail(spam).Foralistofpubliclyavailabledatasetsusedinthisissue,seeTable1attheendofthepaper.Therearemanypotentialbenetsofvariableandfeatureselection:facilitatingdatavisualizationanddataunderstanding,reducingthemeasurementandstoragerequirements,reducingtrainingandutilizationtimes,defyingthecurseofdimensionalitytoimprovepredictionperformance.Somemethodsputmoreemphasisononeaspectthananother,andthisisanotherpointofdistinctionbetweenthisspecialissueandpreviouswork.Thepapersinthisissuefocusmainlyonconstructingandselectingsubsetsoffeaturesthatareusefultobuildagoodpredictor.Thiscontrastswiththeproblemofndingorrankingallpotentiallyrelevantvariables.Selectingthemostrelevantvariablesisusuallysuboptimalforbuildingapredictor,particularlyifthevariablesareredundant.Conversely,asubsetofusefulvariablesmayexcludemanyredundant,butrelevant,variables.Foradiscussionofrelevancevs.usefulnessanddenitionsofthevariousnotionsofrelevance,seethereviewarticlesofKohaviandJohn(1997)andBlumandLangley(1997).Thisintroductionsurveysthepaperspresentedinthisspecialissue.Thedepthoftreatmentofvarioussubjectsreectstheproportionofpaperscoveringthem:theproblemofsupervisedlearningistreatedmoreextensivelythanthatofunsupervisedlearning;classicationproblemsservemoreoftenasillustrationthanregressionproblems,andonlyvectorialinputdataisconsidered.Complex-ityisprogressivelyintroducedthroughoutthesections:Therstsectionstartsbydescribingltersthatselectvariablesbyrankingthemwithcorrelationcoefcients(Section2).Limitationsofsuchapproachesareillustratedbyasetofconstructedexamples(Section3).Subsetselectionmethodsarethenintroduced(Section4).Theseincludewrappermethodsthatassesssubsetsofvariablesac-cordingtotheirusefulnesstoagivenpredictor.Weshowhowsomeembeddedmethodsimplementthesameidea,butproceedmoreefcientlybydirectlyoptimizingatwo-partobjectivefunctionwithagoodness-of-ttermandapenaltyforalargenumberofvariables.Wethenturntotheproblemoffeatureconstruction,whosegoalsincludeincreasingthepredictorperformanceandbuildingmorecompactfeaturesubsets(Section5).Allofthepreviousstepsbenetfromreliablyassessingthestatisticalsignicanceoftherelevanceoffeatures.Webrieyreviewmodelselectionmethodsandstatisticaltestsusedtothateffect(Section6).Finally,weconcludethepaperwithadiscussionsec-tioninwhichwegoovermoreadvancedissues(Section7).Becausetheorganizationofourpaperdoesnotfollowtheworkowofbuildingamachinelearningapplication,wesummarizethestepsthatmaybetakentosolveafeatureselectionprobleminachecklist22.Wecautionthereaderthatthischecklistisheuristic.Theonlyrecommendationthatisalmostsurelyvalidistotrythesimplestthingsrst.1158 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTION1.Doyouhavedomainknowledge?Ifyes,constructabettersetof“adhoc”features.2.Areyourfeaturescommensurate?Ifno,considernormalizingthem.3.Doyoususpectinterdependenceoffeatures?Ifyes,expandyourfeaturesetbyconstructingconjunctivefeaturesorproductsoffeatures,asmuchasyourcomputerresourcesallowyou(seeexampleofuseinSection4.4).4.Doyouneedtoprunetheinputvariables(e.g.forcost,speedordataunderstandingrea-sons)?Ifno,constructdisjunctivefeaturesorweightedsumsoffeatures(e.g.byclusteringormatrixfactorization,seeSection5).5.Doyouneedtoassessfeaturesindividually(e.g.tounderstandtheirinuenceonthesystemorbecausetheirnumberissolargethatyouneedtodoarstltering)?Ifyes,useavariablerankingmethod(Section2andSection7.2);else,doitanywaytogetbaselineresults.6.Doyouneedapredictor?Ifno,stop.7.Doyoususpectyourdatais“dirty”(hasafewmeaninglessinputpatternsand/ornoisyoutputsorwrongclasslabels)?Ifyes,detecttheoutlierexamplesusingthetoprankingvariablesobtainedinstep5asrepresentation;checkand/ordiscardthem.8.Doyouknowwhattotryrst?Ifno,usealinearpredictor.3Useaforwardselectionmethod(Section4.2)withthe“probe”methodasastoppingcriterion(Section6)orusethe0-normembeddedmethod(Section4.3).Forcomparison,followingtherankingofstep5,constructasequenceofpredictorsofsamenatureusingincreasingsubsetsoffeatures.Canyoumatchorimproveperformancewithasmallersubset?Ifyes,tryanon-linearpredictorwiththatsubset.9.Doyouhavenewideas,time,computationalresources,andenoughexamples?Ifyes,compareseveralfeatureselectionmethods,includingyournewidea,correlationcoefcients,backwardselectionandembeddedmethods(Section4).Uselinearandnon-linearpredictors.Selectthebestapproachwithmodelselection(Section6).10.Doyouwantastablesolution(toimproveperformanceand/orunderstanding)?Ifyes,sub-sampleyourdataandredoyouranalysisforseveral“bootstraps”(Section7.1).2VariableRankingManyvariableselectionalgorithmsincludevariablerankingasaprincipalorauxiliaryselectionmechanismbecauseofitssimplicity,scalability,andgoodempiricalsuccess.Severalpapersinthisissueusevariablerankingasabaselinemethod(see,e.g.,Bekkermanetal.,2003,CaruanaanddeSa,2003,Forman,2003,Westonetal.,2003).Variablerankingisnotnecessarilyusedtobuildpredictors.Oneofitscommonusesinthemicroarrayanalysisdomainistodiscoverasetofdrugleads(see,e.g.,etal.,1999):Arankingcriterionisusedtondgenesthatdiscriminatebetweenhealthyanddiseasepatients;suchgenesmaycodefor“drugable”proteins,orproteinsthatmay3.By“linearpredictor”wemeanlinearintheparameters.Featureconstructionmayrenderthepredictornon-linearintheinputvariables.1159 GUYONANDELISSEEFFthemselvesbeusedasdrugs.Validatingdrugleadsisalaborintensiveprobleminbiologythatisoutsideofthescopeofmachinelearning,sowefocushereonbuildingpredictors.Weconsiderinthissectionrankingcriteriadenedforindividualvariables,independentlyofthecontextofothers.Correlationmethodsbelongtothatcategory.Wealsolimitourselvestosupervisedlearningcriteria.WereferthereadertoSection7.2foradiscussionofothertechniques.2.1PrincipleoftheMethodandNotationsConsiderasetofmexamplesfxk;ykg(k=1;:::m)consistingofninputvariablesxki(=1;:::n)andoneoutputvariableyk.VariablerankingmakesuseofascoringfunctionS()computedfromthevaluesxkiandykk=1;:::m.Byconvention,weassumethatahighscoreisindicativeofavaluablevariableandthatwesortvariablesindecreasingorderofS().Tousevariablerankingtobuildpredictors,nestedsubsetsincorporatingprogressivelymoreandmorevariablesofdecreasingrelevancearedened.WepostponeuntilSection6thediscussionofselectinganoptimumsubsetsize.FollowingtheclassicationofKohaviandJohn(1997),variablerankingisaltermethod:itisapreprocessingstep,independentofthechoiceofthepredictor.Still,undercertainindependenceororthogonalityassumptions,itmaybeoptimalwithrespecttoagivenpredictor.Forinstance,usingFisher'scriterion4torankvariablesinaclassicationproblemwherethecovariancematrixisdiag-onalisoptimumforFisher'slineardiscriminantclassier(Dudaetal.,2001).Evenwhenvariablerankingisnotoptimal,itmaybepreferabletoothervariablesubsetselectionmethodsbecauseofitscomputationalandstatisticalscalability:Computationally,itisefcientsinceitrequiresonlythecomputationofnscoresandsortingthescores;Statistically,itisrobustagainstoverttingbecauseitintroducesbiasbutitmayhaveconsiderablylessvariance(Hastieetal.,2001).5Weintroducesomeadditionalnotation:Iftheinputvectorxcanbeinterpretedastherealizationofarandomvectordrawnfromanunderlyingunknowndistribution,wedenotebyXitherandomvariablecorrespondingtothethcomponentofx.Similarly,Ywillbetherandomvariableofwhichtheoutcomeyisarealization.Wefurtherdenotebyxithemdimensionalvectorcontainingalltherealizationsofthethvariableforthetrainingexamples,andbyythemdimensionalvectorcontainingallthetargetvalues.2.2CorrelationCriteriaLetusconsiderrstthepredictionofacontinuousoutcomey.ThePearsoncorrelationcoefcientisdenedas:R()=cov(Xi;Y)pvar(Xi)var(Y);(1)wherecovdesignatesthecovarianceandvarthevariance.TheestimateofR()isgivenby:R()=åmk=1(xki¯xi)(yk¯y)påmk=1(xki¯xi)2åmk=1(yk¯y)2;(2)4.Theratioofthebetweenclassvariancetothewithin-classvariance.5.ThesimilarityofvariablerankingtotheORDERED-FSalgorithm(Ng,1998)indicatesthatitssamplecomplexitymaybelogarithmicinthenumberofirrelevantfeatures,comparedtoapowerlawfor“wrapper”subsetselectionmethods.Thiswouldmeanthatvariablerankingcantolerateanumberofirrelevantvariablesexponentialinthenumberoftrainingexamples.1160 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTIONwherethebarnotationstandsforanaverageovertheindexk.Thiscoefcientisalsothecosinebetweenvectorsxiandy,aftertheyhavebeencentered(theirmeansubtracted).AlthoughtheR()isderivedfromR()itmaybeusedwithoutassumingthattheinputvaluesarerealizationsofarandomvariable.Inlinearregression,thecoefcientofdetermination,whichisthesquareofR(),representsthefractionofthetotalvariancearoundthemeanvalue¯ythatisexplainedbythelinearrelationbetweenxiandy.Therefore,usingR()2asavariablerankingcriterionenforcesarankingaccordingtogoodnessoflineartofindividualvariables.6TheuseofR()2canbeextendedtothecaseoftwo-classclassication,forwhicheachclasslabelismappedtoagivenvalueofy,e.g.,1.R()2canthenbeshowntobecloselyrelatedtoFisher'scriterion(Fureyetal.,2000),totheT-testcriterion,andothersimilarcriteria(see,e.g.,etal.,1999,Tusheretal.,2001,Hastieetal.,2001).AsfurtherdevelopedinSection6,thelinktotheT-testshowsthatthescoreR()maybeusedasateststatistictoassessthesignicanceofavariable.CorrelationcriteriasuchasR()canonlydetectlineardependenciesbetweenvariableandtar-get.Asimplewayofliftingthisrestrictionistomakeanon-lineartofthetargetwithsinglevariablesandrankaccordingtothegoodnessoft.Becauseoftheriskofovertting,onecanalter-nativelyconsiderusingnon-linearpreprocessing(e.g.,squaring,takingthesquareroot,thelog,theinverse,etc.)andthenusingasimplecorrelationcoefcient.Correlationcriteriaareoftenusedformicroarraydataanalysis,asillustratedinthisissuebyWestonetal.(2003).2.3SingleVariableClassiersAsalreadymentioned,usingR()2asarankingcriterionforregressionenforcesarankingaccordingtogoodnessoflineartofindividualvariables.Onecanextendtotheclassicationcasetheideaofselectingvariablesaccordingtotheirindividualpredictivepower,usingascriteriontheperformanceofaclassierbuiltwithasinglevariable.Forexample,thevalueofthevariableitself(oritsnegative,toaccountforclasspolarity)canbeusedasdiscriminantfunction.Aclassierisobtainedbysettingathresholdqonthevalueofthevariable(e.g.,atthemid-pointbetweenthecenterofgravityofthetwoclasses).Thepredictivepowerofthevariablecanbemeasuredintermsoferrorrate.But,variousothercriteriacanbedenedthatinvolvefalsepositiveclassicationratefprandfalsenegativeclassi-cationratefnr.Thetradeoffbetweenfprandfnrismonitoredinoursimpleexamplebyvaryingthethresholdq.ROCcurvesthatplot“hit”rate(1-fpr)asafunctionof“falsealarm”ratefnrareinstrumentalindeningcriteriasuchas:The“BreakEvenPoint”(thehitrateforathresholdvaluecorrespondingtofpr=fnr)andthe“AreaUnderCurve”(theareaundertheROCcurve).Inthecasewherethereisalargenumberofvariablesthatseparatethedataperfectly,rankingcriteriabasedonclassicationsuccessratecannotdistinguishbetweenthetoprankingvariables.Onewillthenprefertouseacorrelationcoefcientoranotherstatisticlikethemargin(thedistancebetweentheexamplesofoppositeclassesthatareclosesttooneanotherforagivenvariable).6.Avariantofthisideaistousethemean-squared-error,but,ifthevariablesarenotoncomparablescales,acomparisonbetweenmean-squared-errorsismeaningless.AnothervariantistouseR()torankvariables,notR()2.Positivelycorrelatedvariablesarethentoprankedandnegativelycorrelatedvariablesbottomranked.Withthismethod,onecanchooseasubsetofvariableswithagivenproportionofpositivelyandnegativelycorrelatedvariables.1161 GUYONANDELISSEEFFThecriteriadescribedinthissectionextendtothecaseofbinaryvariables.Forman(2003)presentsinthisissueanextensivestudyofsuchcriteriaforbinaryvariableswithapplicationsintextclassication.2.4InformationTheoreticRankingCriteriaSeveralapproachestothevariableselectionproblemusinginformationtheoreticcriteriahavebeenproposed(asreviewedinthisissuebyBekkermanetal.,2003,Dhillonetal.,2003,Forman,2003,Torkkola,2003).Manyrelyonempiricalestimatesofthemutualinformationbetweeneachvariableandthetarget:I()=ZxiZyp(xi;y)logp(xi;y)p(xi)p(y)dxdy;(3)wherep(xi)andp(y)aretheprobabilitydensitiesofxiandy,andp(xi;y)isthejointdensity.ThecriterionI()isameasureofdependencybetweenthedensityofvariablexiandthedensityofthetargetyThedifcultyisthatthedensitiesp(xi)p(y)andp(xi;y)areallunknownandarehardtoestimatefromdata.Thecaseofdiscreteornominalvariablesisprobablyeasiestbecausetheintegralbecomesasum:I()=åxiåyP(X=xi;Y=y)logP(X=xi;Y=y)P(X=xi)P(Y=y):(4)Theprobabilitiesarethenestimatedfromfrequencycounts.Forexample,inathree-classproblem,ifavariabletakes4values,P(Y=y)representstheclasspriorprobabilities(3fre-quencycounts),P(X=xi)representsthedistributionoftheinputvariable(4frequencycounts),andP(X=xi;Y=y)istheprobabilityofthejointobservations(12frequencycounts).Theestima-tionobviouslybecomesharderwithlargernumbersofclassesandvariablevalues.Thecaseofcontinuousvariables(andpossiblycontinuoustargets)isthehardest.Onecanconsiderdiscretizingthevariablesorapproximatingtheirdensitieswithanon-parametricmethodsuchasParzenwindows(see,e.g.,Torkkola,2003).UsingthenormaldistributiontoestimatedensitieswouldbringusbacktoestimatingthecovariancebetweenXiandY,thusgivingusacriterionsimilartoacorrelationcoefcient.3SmallbutRevealingExamplesWepresentaseriesofsmallexamplesthatoutlinetheusefulnessandthelimitationsofvariablerankingtechniquesandpresentseveralsituationsinwhichthevariabledependenciescannotbeignored.3.1CanPresumablyRedundantVariablesHelpEachOther?Onecommoncriticismofvariablerankingisthatitleadstotheselectionofaredundantsubset.Thesameperformancecouldpossiblybeachievedwithasmallersubsetofcomplementaryvariables.Still,onemaywonderwhetheraddingpresumablyredundantvariablescanresultinaperformancegain.1162 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTION(a)(b)Figure1:Informationgainfrompresumablyredundantvariables.(a)Atwoclassproblemwithindependentlyandidenticallydistributed(i.i.d.)variables.EachclasshasaGaussiandistributionwithnocovariance.(b)Thesameexampleaftera45degreerotationshowingthatacombinationofthetwovariablesyieldsaseparationimprovementbyafactorp2.I.i.d.variablesarenottrulyredundant.ConsidertheclassicationproblemofFigure1.Foreachclass,wedrewatrandomm=100examples,eachofthetwovariablesbeingdrawnindependentlyaccordingtoanormaldistributionofstandarddeviation1.Theclasscentersareplacedatcoordinates(-1;-1)and(1;1).Figure1.ashowsthescatterplotinthetwo-dimensionalspaceoftheinputvariables.Wealsoshowonthesamegurehistogramsoftheprojectionsoftheexamplesontheaxes.Tofacilitateitsreading,thescatterplotisshowntwicewithanaxisexchange.Figure1.bshowsthesamescatterplotsafterafortyvedegreerotation.Inthisrepresentation,thex-axisprojectionprovidesabetterseparationofthetwoclasses:thestandarddeviationofbothclassesisthesame,butthedistancebetweencentersinprojectionisnow2p2insteadof2.Equivalently,ifwerescalethex-axisbydividingbyp2toobtainafeaturethatistheaverageofthetwoinputvariables,thedistancebetweencentersisstill2,butthewithinclassstandarddeviationisreducedbyafactorp2.Thisisnotsosurprising,sincebyaveragingni.i.d.randomvariableswewillobtainareductionofstandarddeviationbyafactorofpnNoisereductionandconsequentlybetterclassseparationmaybeobtainedbyaddingvariablesthatarepresumablyredundant.Variablesthatareindependentlyandidenticallydistributedarenottrulyredundant.3.2HowDoesCorrelationImpactVariableRedundancy?Anothernotionofredundancyiscorrelation.Inthepreviousexample,inspiteofthefactthattheexamplesarei.i.d.withrespecttotheclassconditionaldistributions,thevariablesarecorrelatedbecauseoftheseparationoftheclasscenterpositions.Onemaywonderhowvariableredundancyisaffectedbyaddingwithin-classvariablecorrelation.InFigure2,theclasscentersarepositioned1163 GUYONANDELISSEEFF(a)(b)Figure2:Intra-classcovariance.Inprojectionontheaxes,thedistributionsofthetwovariablesarethesameasinthepreviousexample.(a)Theclassconditionaldistributionshaveahighcovarianceinthedirectionofthelineofthetwoclasscenters.Thereisnosignicantgaininseparationbyusingtwovariablesinsteadofjustone.(b)Theclassconditionaldistributionshaveahighcovarianceinthedirectionperpendiculartothelineofthetwoclasscenters.Animportantseparationgainisobtainedbyusingtwovariablesinsteadofone.similarlyasinthepreviousexampleatcoordinates(-1;-1)and(1;1)butwehaveaddedsomevariableco-variance.Weconsidertwocases:InFigure2.a,inthedirectionoftheclasscenterline,thestandarddeviationoftheclasscondi-tionaldistributionsisp2,whileintheperpendiculardirectionitisasmallvalue(e=1=10).Withthisconstruction,asegoestozero,theinputvariableshavethesameseparationpowerasinthecaseoftheexampleofFigure1,withastandarddeviationoftheclassdistributionsofoneandadistanceoftheclasscentersof2.Butthefeatureconstructedasthesumoftheinputvariableshasnobetterseparationpower:astandarddeviationofp2andaclasscenterseparationof2p2(asim-plescalingthatdoesnotchangetheseparationpower).Therefore,inthelimitofperfectvariablecorrelation(zerovarianceinthedirectionperpendiculartotheclasscenterline),singlevariablesprovidethesameseparationasthesumofthetwovariables.Perfectlycorrelatedvariablesaretrulyredundantinthesensethatnoadditionalinformationisgainedbyaddingthem.Incontrast,intheexampleofFigure2.b,therstprincipaldirectionofthecovariancematricesoftheclassconditionaldensitiesisperpendiculartotheclasscenterline.Inthiscase,moreisgainedbyaddingthetwovariablesthanintheexampleofFigure1.Onenoticesthatinspiteoftheirgreatcomplementarity(inthesensethataperfectseparationcanbeachievedinthetwo-dimensionalspacespannedbythetwovariables),thetwovariablesare(anti-)correlated.Moreanti-correlationisobtainedbymakingtheclasscenterscloserandincreasingtheratioofthevariancesoftheclassconditionaldistributions.Veryhighvariablecorrelation(oranti-correlation)doesnotmeanabsenceofvariablecomplementarity.1164 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTIONTheexamplesofFigure1and2allhavevariableswiththesamedistributionofexamples(inprojectionontheaxis).Therefore,methodsthatscorevariablesindividuallyandindependentlyofeachotherareatlosstodeterminewhichcombinationofvariableswouldgivebestperformance.3.3CanaVariablethatisUselessbyItselfbeUsefulwithOthers?Oneconcernaboutmultivariatemethodsisthattheyarepronetoovertting.Theproblemisaggra-vatedwhenthenumberofvariablestoselectfromislargecomparedtothenumberofexamples.Itistemptingtouseavariablerankingmethodtolterouttheleastpromisingvariablesbeforeus-ingamultivariatemethod.Stillonemaywonderwhetheronecouldpotentiallylosesomevaluablevariablesthroughthatlteringprocess.WeconstructedanexampleinFigure3.a.Inthisexample,thetwoclassconditionaldistribu-tionshaveidenticalcovariancematrices,andtheprincipaldirectionsareorienteddiagonally.Theclasscentersareseparatedononeaxis,butnotontheother.Byitselfonevariableis“useless”.Still,thetwodimensionalseparationisbetterthantheseparationusingthe“useful”variablealone.Therefore,avariablethatiscompletelyuselessbyitselfcanprovideasignicantperformanceimprovementwhentakenwithothers.Thenextquestioniswhethertwovariablesthatareuselessbythemselvescanprovideagoodseparationwhentakentogether.Weconstructedanexampleofsuchacase,inspiredbythefamousXORproblem.7InFigure3.b,wedrewexamplesfortwoclassesusingfourGaussiansplacedonthecornersofasquareatcoordinates(0;0),(0;1),(1;0),and(1;1).Theclasslabelsofthesefour“clumps”areattributedaccordingtothetruthtableofthelogicalXORfunction:f(0;0)=0,f(0;1)=1,f(1;0)=1;f(1;1)=0.Wenoticethattheprojectionsontheaxesprovidenoclassseparation.Yet,inthetwodimensionalspacetheclassescaneasilybeseparated(albeitnotwithalineardecisionfunction).8Twovariablesthatareuselessbythemselvescanbeusefultogether.7.TheXORproblemissometimesreferredtoasthetwo-bitparityproblemandisgeneralizabletomorethantwodimensions(n-bitparityproblem).Arelatedproblemisthechessboardprobleminwhichthetwoclassespavethespacewithsquaresofuniformlydistributedexampleswithalternatingclasslabels.Thelatterproblemisalsogeneralizabletothemulti-dimensionalcase.Similarexamplesareusedinseveralpapersinthisissue(Perkinsetal.,2003,Stoppigliaetal.,2003).8.Incidentally,thetwovariablesarealsouncorrelatedwithoneanother.1165 GUYONANDELISSEEFF(a)(b)Figure3:Avariableuselessbyitselfcanbeusefultogetherwithothers.(a)Onevariablehascompletelyoverlappingclassconditionaldensities.Still,usingitjointlywiththeothervariableimprovesclassseparabilitycomparedtousingtheothervariablealone.(b)XOR-likeorchessboard-likeproblems.Theclassesconsistofdisjointclumpssuchthatinprojectionontheaxestheclassconditionaldensitiesoverlapperfectly.Therefore,individualvariableshavenoseparationpower.Still,takentogether,thevariablesprovidegoodclassseparability.4VariableSubsetSelectionIntheprevioussection,wepresentedexamplesthatillustratetheusefulnessofselectingsubsetsofvariablesthattogetherhavegoodpredictivepower,asopposedtorankingvariablesaccordingtotheirindividualpredictivepower.Wenowturntothisproblemandoutlinethemaindirectionsthathavebeentakentotackleit.Theyessentiallydivideintowrappers,lters,andembeddedmethods.Wrappersutilizethelearningmachineofinterestasablackboxtoscoresubsetsofvariableaccordingtotheirpredictivepower.Filtersselectsubsetsofvariablesasapre-processingstep,independentlyofthechosenpredictor.Embeddedmethodsperformvariableselectionintheprocessoftrainingandareusuallyspecictogivenlearningmachines.4.1WrappersandEmbeddedMethodsThewrappermethodology,recentlypopularizedbyKohaviandJohn(1997),offersasimpleandpowerfulwaytoaddresstheproblemofvariableselection,regardlessofthechosenlearningma-chine.Infact,thelearningmachineisconsideredaperfectblackboxandthemethodlendsitselftotheuseofoff-the-shelfmachinelearningsoftwarepackages.Initsmostgeneralformulation,thewrappermethodologyconsistsinusingthepredictionperformanceofagivenlearningmachinetoassesstherelativeusefulnessofsubsetsofvariables.Inpractice,oneneedstodene:(i)howtosearchthespaceofallpossiblevariablesubsets;(ii)howtoassessthepredictionperformanceofalearningmachinetoguidethesearchandhaltit;and(iii)whichpredictortouse.Anexhaustivesearchcanconceivablybeperformed,ifthenumberofvariablesisnottoolarge.But,theproblem1166 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTIONisknowntobeNP-hard(AmaldiandKann,1998)andthesearchbecomesquicklycomputationallyintractable.Awiderangeofsearchstrategiescanbeused,includingbest-rst,branch-and-bound,simulatedannealing,geneticalgorithms(seeKohaviandJohn,1997,forareview).Performanceassessmentsareusuallydoneusingavalidationsetorbycross-validation(seeSection6).Asil-lustratedinthisspecialissue,popularpredictorsincludedecisiontrees,na¨veBayes,least-squarelinearpredictors,andsupportvectormachines.Wrappersareoftencriticizedbecausetheyseemtobea“bruteforce”methodrequiringmassiveamountsofcomputation,butitisnotnecessarilyso.Efcientsearchstrategiesmaybedevised.Us-ingsuchstrategiesdoesnotnecessarilymeansacricingpredictionperformance.Infact,itappearstobetheconverseinsomecases:coarsesearchstrategiesmayalleviatetheproblemofovertting,asillustratedforinstanceinthisissuebytheworkofReunanen(2003).Greedysearchstrategiesseemtobeparticularlycomputationallyadvantageousandrobustagainstovertting.Theycomeintwoavors:forwardselectionandbackwardelimination.Inforwardselection,variablesarepro-gressivelyincorporatedintolargerandlargersubsets,whereasinbackwardeliminationonestartswiththesetofallvariablesandprogressivelyeliminatestheleastpromisingones.9Bothmethodsyieldnestedsubsetsofvariables.Byusingthelearningmachineasablackbox,wrappersareremarkablyuniversalandsimple.Butembeddedmethodsthatincorporatevariableselectionaspartofthetrainingprocessmaybemoreefcientinseveralrespects:theymakebetteruseoftheavailabledatabynotneedingtosplitthetrainingdataintoatrainingandvalidationset;theyreachasolutionfasterbyavoidingretrainingapredictorfromscratchforeveryvariablesubsetinvestigated.Embeddedmethodsarenotnew:decisiontreessuchasCART,forinstance,haveabuilt-inmechanismtoperformvariableselection(Breimanetal.,1984).Thenexttwosectionsaredevotedtotwofamiliesofembeddedmethodsillustratedbyalgorithmspublishedinthisissue.4.2NestedSubsetMethodsSomeembeddedmethodsguidetheirsearchbyestimatingchangesintheobjectivefunctionvalueincurredbymakingmovesinvariablesubsetspace.Combinedwithgreedysearchstrategies(back-wardeliminationorforwardselection)theyyieldnestedsubsetsofvariables10LetuscallsthenumberofvariablesselectedatagivenalgorithmstepandJ(s)thevalueoftheobjectivefunctionofthetrainedlearningmachineusingsuchavariablesubset.Predictingthechangeintheobjectivefunctionisobtainedby:1.Finitedifferencecalculation:ThedifferencebetweenJ(s)andJ(s+1)orJ(s1)iscom-putedforthevariablesthatarecandidatesforadditionorremoval.2.Quadraticapproximationofthecostfunction:Thismethodwasoriginallyproposedtopruneweightsinneuralnetworks(LeCunetal.,1990).Itcanbeusedforbackwardelimi-nationofvariables,viathepruningoftheinputvariableweightswi.AsecondorderTaylorexpansionofJismade.AttheoptimumofJ,therst-ordertermcanbeneglected,yield-9.Thenamegreedycomesfromthefactthatoneneverrevisitsformerdecisionstoinclude(orexclude)variablesinlightofnewdecisions.10.Thealgorithmspresentedinthissectionandinthefollowinggenerallybenetfromvariablenormalization,exceptiftheyhaveaninternalnormalizationmechanismliketheGram-Schmidtorthogonalizationprocedure.1167 GUYONANDELISSEEFFingforvariabletothevariationDJi=(1=2)¶2J¶w2i(Dwi)2.ThechangeinweightDwi=wicorrespondstoremovingvariable3.Sensitivityoftheobjectivefunctioncalculation:TheabsolutevalueorthesquareofthederivativeofJwithrespecttoxi(orwithrespecttowi)isused.Sometrainingalgorithmslendthemselvestousingnitedifferences(method1)becauseexactdifferencescanbecomputedefciently,withoutretrainingnewmodelsforeachcandidatevariable.Suchisthecaseforthelinearleast-squaremodel:TheGram-Schmidtorthogonolizationprocedurepermitstheperformanceofforwardvariableselectionbyaddingateachstepthevariablethatmostdecreasesthemean-squared-error.Twopapersinthisissuearedevotedtothistechnique(Stoppigliaetal.,2003,RivalsandPersonnaz,2003).Forotheralgorithmslikekernelmethods,approximationsofthedifferencecanbecomputedefciently.Kernelmethodsarelearningmachinesoftheform(x)=åmk=1akK(x;xk),whereKisthekernelfunction,whichmeasuresthesimilaritybetweenxandxk(SchoelkopfandSmola,2002).ThevariationinJ(s)iscomputedbykeepingtheakvaluesconstant.ThisprocedureoriginallyproposedforSVMs(Guyonetal.,2002)isusedinthisissueasabaselinemethod(Rakotomamonjy,2003,Westonetal.,2003).The“optimumbraindamage”(OBD)procedure(method2)ismentionedinthisissueinthepaperofRivalsandPersonnaz(2003).Thecaseoflinearpredictors(x)=wx+bisparticularlysimple.TheauthorsoftheOBDalgorithmadvocateusingDJiinsteadofthemagnitudeoftheweightsjwijaspruningcriterion.However,forlinearpredictorstrainedwithanobjectivefunctionJthatisquadraticinwithesetwocriteriaareequivalent.Thisisthecase,forinstance,forthelinearleastsquaremodelusingJ=åmk=1(wxk+byk)2andforthelinearSVMoroptimummarginclassier,whichminimizesJ=(1=2)jjwjj2,underconstraints(Vapnik,1982).Interestingly,forlinearSVMsthenitedifferencemethod(method1)andthesensitivitymethod(method3)alsoboildowntoselectingthevariablewithsmallestjwijforeliminationateachstep(Rakotomamonjy,2003).Thesensitivityoftheobjectivefunctiontochangesinwi(method3)isusedtodeviseaforwardselectionprocedureinonepaperpresentedinthisissue(Perkinsetal.,2003).Applicationsofthisproceduretoalinearmodelwithacross-entropyobjectivefunctionarepresented.Intheformulationproposed,thecriterionistheabsolutevalueof¶J¶wi=åmk=1¶J¶rk¶rk¶wi,whererk=yk(xk).Inthecaseofthelinearmodel(x)=wx+b,thecriterionhasasimplegeometricalinterpretation:itisthethedotproductbetweenthegradientoftheobjectivefunctionwithrespecttothemarginvaluesandthevector[¶rk¶wi=xkiyk]k=1m.Forthecross-entropylossfunction,wehave:¶J¶rk=11+erkAninterestingvariantofthesensitivityanalysismethodisobtainedbyreplacingtheobjectivefunctionbytheleave-one-outcross-validationerror.Forsomelearningmachinesandsomeob-jectivefunctions,approximateorexactanalyticalformulasoftheleave-one-outerrorareknown.Inthisissue,thecaseofthelinearleast-squaremodel(RivalsandPersonnaz,2003)andSVMs(Rakotomamonjy,2003)aretreated.Approximationsfornon-linearleast-squareshavealsobeencomputedelsewhere(MonariandDreyfus,2000).TheproposalofRakotomamonjy(2003)istotrainnon-linearSVMs(Boseretal.,1992,Vapnik,1998)witharegulartrainingprocedureandselectfeatureswithbackwardeliminationlikeinRFE(Guyonetal.,2002).ThevariablerankingcriterionhoweverisnotcomputedusingthesensitivityoftheobjectivefunctionJ,butthatofaleave-one-outbound.1168 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTION4.3DirectObjectiveOptimizationAlotofprogresshasbeenmadeinthisissuetoformalizetheobjectivefunctionofvariableselectionandndalgorithmstooptimizeit.Generally,theobjectivefunctionconsistsoftwotermsthatcom-petewitheachother:(1)thegoodness-of-t(tobemaximized),and(2)thenumberofvariables(tobeminimized).Thisapproachbearssimilaritywithtwo-partobjectivefunctionsconsistingofagoodness-of-ttermandaregularizationterm,particularlywhentheeffectoftheregularizationtermisto“shrink”parameterspace.ThiscorrespondenceisformallyestablishedinthepaperofWestonetal.(2003)fortheparticularcaseofclassicationwithlinearpredictors(x)=wx+bintheSVMframework(Boseretal.,1992,Vapnik,1998).Shrinkingregularizersofthetypejjwjjpp=(åni=1wpi)1=p(p-norm)areused.Inthelimitasp!0,thep-normisjustthenumberofweights,i.e.,thenumberofvariables.Westonetal.proceedwithshowingthatthe0-normformulationofSVMscanbesolvedapproximatelywithasimplemodicationofthevanillaSVMalgorithm:1.TrainaregularlinearSVM(using1-normor2-normregularization).2.Re-scaletheinputvariablesbymultiplyingthembytheabsolutevaluesofthecomponentsoftheweightvectorwobtained.3.Iteratetherst2stepsuntilconvergence.Themethodisreminiscentofbackwardeliminationproceduresbasedonthesmallestjwij.Variablenormalizationisimportantforsuchamethodtoworkproperly.Westonetal.notethat,althoughtheiralgorithmonlyapproximatelyminimizesthe0-norm,inpracticeitmaygeneralizebetterthananalgorithmthatreallydidminimizethe0-norm,becausethelatterwouldnotprovidesufcientregularization(alotofvarianceremainsbecausetheoptimizationproblemhasmultiplesolutions).TheneedforadditionalregularizationisalsostressedinthepaperofPerkinsetal.(2003).Theauthorsuseathree-partobjectivefunctionthatincludesgoodness-of-t,aregularizationterm(1-normor2-norm),andapenaltyforlargenumbersofvariables(0-norm).Theauthorsproposeacomputationallyefcientforwardselectionmethodtooptimizesuchobjective.Anotherpaperintheissue,byBietal.(2003),uses1-normSVMs,withoutiterativemulti-plicativeupdates.Theauthorsndthat,fortheirapplication,the1-normminimizationsufcestodriveenoughweightstozero.Thisapproachwasalsotakeninthecontextofleast-squareregressionbyotherauthors(Tibshirani,1994).Thenumberofvariablescanbefurtherreducedbybackwardelimination.Toourknowledge,noalgorithmhasbeenproposedtodirectlyminimizethenumberofvari-ablesfornon-linearpredictors.Instead,severalauthorshavesubstitutedfortheproblemofvariableselectionthatofvariablescaling(JebaraandJaakkola,2000,Westonetal.,2000,GrandvaletandCanu,2002).Thevariablescalingfactorsare“hyper-parameters”adjustedbymodelselection.Thescalingfactorsobtainedareusedtoassessvariablerelevance.Avariantofthemethodconsistsofadjustingthescalingfactorsbygradientdescentonaboundoftheleave-one-outerror(Westonetal.,2000).ThismethodisusedasbaselinemethodinthepaperofWestonetal.(2003)inthisissue.1169 GUYONANDELISSEEFF4.4FiltersforSubsetSelectionSeveraljusticationsfortheuseofltersforsubsetselectionhavebeenputforwardinthisspecialissueandelsewhere.Itisarguedthat,comparedtowrappers,ltersarefaster.Still,recentlypro-posedefcientembeddedmethodsarecompetitiveinthatrespect.Anotherargumentisthatsomelters(e.g.thosebasedonmutualinformationcriteria)provideagenericselectionofvariables,nottunedfor/byagivenlearningmachine.Anothercompellingjusticationisthatlteringcanbeusedasapreprocessingsteptoreducespacedimensionalityandovercomeovertting.Inthatrespect,itseemsreasonabletouseawrapper(orembeddedmethod)withalinearpre-dictorasalterandthentrainamorecomplexnon-linearpredictorontheresultingvariables.AnexampleofthisapproachisfoundinthepaperofBietal.(2003):alinear1-normSVMisusedforvariableselection,butanon-linear1-normSVMisusedforprediction.Thecomplexityoflinearlterscanberampedupbyaddingtotheselectionprocessproductsofinputvariables(monomi-alsofapolynomial)andretainingthevariablesthatarepartofanyselectedmonomial.Anotherpredictor,e.g.,aneuralnetwork,iseventuallysubstitutedtothepolynomialtoperformpredictionsusingtheselectedvariables(RivalsandPersonnaz,2003,Stoppigliaetal.,2003).Insomecaseshowever,onemayonthecontrarywanttoreducethecomplexityoflinearlterstoovercomeover-ttingproblems.Whenthenumberofexamplesissmallcomparedtothenumberofvariables(inthecaseofmicroarraydataforinstance)onemayneedtoresorttoselectingvariableswithcorrelationcoefcients(seeSection2.2).InformationtheoreticlteringmethodssuchasMarkovblanket11algorithms(KollerandSa-hami,1996)constituteanotherbroadfamily.Thejusticationforclassicationproblemsisthatthemeasureofmutualinformationdoesnotrelyonanypredictionprocess,butprovidesaboundontheerrorrateusinganypredictionschemeforthegivendistribution.Wedonothaveanyillustrationofsuchmethodsinthisissuefortheproblemofvariablesubsetselection.WerefertheinterestedreadertoKollerandSahami(1996)andreferencestherein.However,theuseofmutualinformationcriteriaforindividualvariablerankingwascoveredinSection2andapplicationtofeatureconstructionandselectionareillustratedinSection5.5FeatureConstructionandSpaceDimensionalityReductionInsomeapplications,reducingthedimensionalityofthedatabyselectingasubsetoftheoriginalvariablesmaybeadvantageousforreasonsincludingtheexpenseofmaking,storingandprocessingmeasurements.Iftheseconsiderationsarenotofconcern,othermeansofspacedimensionalityreductionshouldalsobeconsidered.Theartofmachinelearningstartswiththedesignofappropriatedatarepresentations.Betterperformanceisoftenachievedusingfeaturesderivedfromtheoriginalinput.Buildingafeaturerepresentationisanopportunitytoincorporatedomainknowledgeintothedataandcanbeveryap-plicationspecic.Nonetheless,thereareanumberofgenericfeatureconstructionmethods,includ-ing:clustering;basiclineartransformsoftheinputvariables(PCA/SVD,LDA);moresophisticatedlineartransformslikespectraltransforms(Fourier,Hadamard),wavelettransformsorconvolutionsofkernels;andapplyingsimplefunctionstosubsetsofvariables,likeproductstocreatemonomials.11.TheMarkovblanketofagivenvariablexiisasetofvariablesnotincludingxithatrenderxi“unnecessary”.OnceaMarkovblanketisfound,xicansafelybeeliminated.Furthermore,inabackwardeliminationprocedure,itwillremainunnecessaryatlaterstages.1170 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTIONTwodistinctgoalsmaybepursuedforfeatureconstruction:achievingbestreconstructionofthedataorbeingmostefcientformakingpredictions.Therstproblemisanunsupervisedlearningproblem.Itiscloselyrelatedtothatofdatacompressionandalotofalgorithmsareusedacrossbothelds.Thesecondproblemissupervised.Aretherereasonstoselectfeaturesinanunsupervisedmannerwhentheproblemissupervised?Yes,possiblyseveral:Someproblems,e.g.,intextpro-cessingapplications,comewithmoreunlabelleddatathanlabelleddata.Also,unsupervisedfeatureselectionislesspronetoovertting.Inthisissue,fourpapersaddresstheproblemoffeatureconstruction.Allofthemtakeaninfor-mationtheoreticapproachtotheproblem.Twoofthemillustratetheuseofclusteringtoconstructfeatures(Bekkermanetal.,2003,Dhillonetal.,2003),oneprovidesanewmatrixfactorizational-gorithm(GlobersonandTishby,2003),andoneprovidesasupervisedmeansoflearningfeaturesfromavarietyofmodels(Torkkola,2003).Inaddition,twopaperswhosemainfocusisdirectedtovariableselectionalsoaddresstheselectionofmonomialsofapolynomialmodelandthehiddenunitsofaneuralnetwork(RivalsandPersonnaz,2003,Stoppigliaetal.,2003),andonepaperad-dressestheimplicitfeatureselectioninnon-linearkernelmethodsforpolynomialkernels(Westonetal.,2003).5.1ClusteringClusteringhaslongbeenusedforfeatureconstruction.Theideaistoreplaceagroupof“similar”variablesbyaclustercentroid,whichbecomesafeature.ThemostpopularalgorithmsincludeK-meansandhierarchicalclustering.Forareview,see,e.g.,thetextbookofDudaetal.(2001).Clusteringisusuallyassociatedwiththeideaofunsupervisedlearning.Itcanbeusefultointroducesomesupervisionintheclusteringproceduretoobtainmorediscriminantfeatures.Thisistheideaofdistributionalclustering(Pereiraetal.,1993),whichisdevelopedintwopapersofthisissue.Distributionalclusteringisrootedintheinformationbottleneck(IB)theoryofTishbyetal.(1999).Ifwecall˜Xtherandomvariablerepresentingtheconstructedfeatures,theIBmethodseekstominimizethemutualinformationI(X;˜X),whilepreservingthemutualinformationI(˜X;Y).AglobalobjectivefunctionisbuiltbyintroducingaLagrangemultiplierbJ=I(X;˜X)bI(˜X;Y).So,themethodsearchesforthesolutionthatachievesthelargestpossiblecompression,whileretainingtheessentialinformationaboutthetarget.Textprocessingapplicationsareusualtargetsforsuchtechniques.Patternsarefulldocumentsandvariablescomefromabag-of-wordsrepresentation:Eachvariableisassociatedtoawordandisproportionaltothefractionofdocumentsinwhichthatwordappears.Inapplicationtofeatureconstruction,clusteringmethodsgroupwords,notdocuments.Intextcategorizationtasks,thesu-pervisioncomesfromtheknowledgeofdocumentcategories.Itisintroducedbyreplacingvariablevectorscontainingdocumentfrequencycountsbyshortervariablevectorscontainingdocumentcat-egoryfrequencycounts,i.e.,thewordsarerepresentedasdistributionsoverdocumentcategories.ThesimplestimplementationofthisideaispresentedinthepaperofDhillonetal.(2003)inthisissue.ItusesK-meansclusteringonvariablesrepresentedbyavectorofdocumentcategoryfrequencycounts.The(non-symmetric)similaritymeasureusedistheKullback-LeiblerdivergenceK(xj;˜xi)=exp(båkxkjln(xkj=˜xki)).Inthesum,theindexkrunsoverdocumentcategories.AmoreelaborateapproachistakenbyBekkermanetal.(2003)whousea“soft”versionofK-means(allowingwordstobelongtoseveralclusters)andwhoprogressivelydivideclustersbyvaryingtheLagrangemultiplierbmonitoringthetradeoffbetweenI(X;˜X)andI(˜X;Y).Inthisway,documents1171 GUYONANDELISSEEFFarerepresentedasadistributionoverwordcentroids.Bothmethodsperformwell.Bekkermanetal.mentionthatfewwordsendupbelongingtoseveralclusters,hintingthat“hard”clusterassignmentmaybesufcient.5.2MatrixFactorizationAnotherwidelyusedmethodoffeatureconstructionissingularvaluedecomposition(SVD).ThegoalofSVDistoformasetoffeaturesthatarelinearcombinationsoftheoriginalvariables,whichprovidethebestpossiblereconstructionoftheoriginaldataintheleastsquaresense(Dudaetal.,2001).Itisanunsupervisedmethodoffeatureconstruction.Inthisissue,thepaperofGlobersonandTishby(2003)presentsaninformationtheoreticunsupervisedfeatureconstructionmethod:sufcientdimensionalityreduction(SDR).Themostinformativefeaturesareextractedbysolvinganoptimizationproblemthatmonitorsthetradeoffbetweendatareconstructionanddatacompression,similartotheinformationbottleneckofTishbyetal.(1999);thefeaturesarefoundasLagrangemultipliersoftheobjectiveoptimized.Non-negativematricesPofdimension(m,n)representingthejointdistributionoftworandomvariables(forinstancetheco-occurrenceofwordsindocuments)areconsidered.ThefeaturesareextractedbyinformationtheoreticI-projections,yieldingareconstructedmatrixofspecialexponentialform˜P=(1=Z)exp(FY).Forasetofdfeatures,Fisa(m;d+2)matrixwhose(d+1)thcolumnisonesandYisa(d+2;n)matrixwhose(d+2)thcolumnisones,andZisanormalizationcoefcient.SimilarlytoSVD,thesolutionshowsthesymmetryoftheproblemwithrespecttopatternsandvariables.5.3SupervisedFeatureSelectionWereviewthreeapproachesforselectingfeaturesincaseswherefeaturesshouldbedistinguishedfromvariablesbecausebothappearsimultaneouslyinthesamesystem:Nestedsubsetmethods.Anumberoflearningmachinesextractfeaturesaspartofthelearn-ingprocess.Theseincludeneuralnetworkswhoseinternalnodesarefeatureextractors.Thus,nodepruningtechniquessuchasOBDLeCunetal.(1990)arefeatureselectionalgorithms.Gram-SchmidtorthogonalizationispresentedinthisissueasanalternativetoOBD(Stoppigliaetal.,2003).Filters.Torkkola(2003)proposesaltermethodforconstructingfeaturesusingamutualin-formationcriterion.TheauthormaximizesI(f;y)formdimensionalfeaturevectorsfandtargetvectorsy12ModellingthefeaturedensityfunctionwithParzenwindowsallowshimtocomputederivatives¶I=¶fithataretransformindependent.Combiningthemwiththetransform-dependentderivatives¶fi=¶w,hedevisesagradientdescentalgorithmtooptimizetheparameterswofthetransform(thatneednotbelinear):wt+1=wt+h¶I¶w=wt+h¶I¶fi¶fi¶w:(5)Directobjectiveoptimization.Kernelmethodspossessanimplicitfeaturespacerevealedbythekernelexpansion:k(x;x0)=f(x):f(x0),wheref(x)isafeaturevectorofpossiblyinnitedi-mension.Selectingtheseimplicitfeaturesmayimprovegeneralization,butdoesnotchangethe12.Infact,theauthorusesaquadraticmeasureofdivergenceinsteadoftheusualmutualinformation.1172 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTIONrunningtimeorhelpinterpretingthepredictionfunction.Inthisissue,Westonetal.(2003)pro-poseamethodforselectingimplicitkernelfeaturesinthecaseofthepolynomialkernel,usingtheirframeworkofminimizationofthe0-norm.6ValidationMethodsWegroupinthissectionalltheissuesrelatedtoout-of-sampleperformanceprediction(generaliza-tionprediction)andmodelselection.Theseareinvolvedinvariousaspectsofvariableandfeatureselection:todeterminethenumberofvariablesthatare“signicant”,toguideandhaltthesearchforgoodvariablesubsets,tochoosehyperparameters,andtoevaluatethenalperformanceofthesystem.Oneshouldrstdistinguishtheproblemofmodelselectionfromthatofevaluatingthenalperformanceofthepredictor.Forthatlastpurpose,itisimportanttosetasideanindependenttestset.Theremainingdataisusedbothfortrainingandperformingmodelselection.Additionalexperimentalsophisticationcanbeaddedbyrepeatingtheentireexperimentforseveraldrawingsofthetestset.13Toperformmodelselection(includingvariable/featureselectionandhyperparameteroptimiza-tion),thedatanotusedfortestingmaybefurthersplitbetweenxedtrainingandvalidationsets,orvariousmethodsofcross-validationcanbeused.Theproblemisthenbroughtbacktothatofestimatingthesignicanceofdifferencesinvalidationerrors.Foraxedvalidationset,statisticaltestscanbeused,buttheirvalidityisdoubtfulforcross-validationbecauseindependenceassump-tionsareviolated.Foradiscussionoftheseissues,seeforinstancetheworkofDietterich(1998)andNadeauandBengio(2001).Iftherearesufcientlymanyexamples,itmaynotbenecessarytosplitthetrainingdata:Comparisonsoftrainingerrorswithstatisticaltestscanbeused(seeRivalsandPersonnaz,2003,inthisissue).Cross-validationcanbeextendedtotime-seriesdataand,whilei.i.d.assumptionsdonotholdanymore,itisstillpossibletoestimategeneralizationerrorcondenceintervals(seeBengioandChapados,2003,inthisissue).Choosingwhatfractionofthedatashouldbeusedfortrainingandforvalidationisanopenproblem.Manyauthorsresorttousingtheleave-one-outcross-validationprocedure,eventhoughitisknowntobeahighvarianceestimatorofgeneralizationerror(Vapnik,1982)andtogiveoverlyoptimisticresults,particularlywhendataarenotproperlyindependentlyandidenticallysampledfromthe”true”distribution.Theleave-one-outprocedureconsistsofremovingoneexamplefromthetrainingset,constructingthepredictoronthebasisonlyoftheremainingtrainingdata,thentest-ingontheremovedexample.Inthisfashiononetestsallexamplesofthetrainingdataandaveragestheresults.Aspreviouslymentioned,thereexistexactorapproximateformulasoftheleave-one-outerrorforanumberoflearningmachines(MonariandDreyfus,2000,RivalsandPersonnaz,2003,Rakotomamonjy,2003).Leave-one-outformulascanbeviewedascorrectedvaluesofthetrainingerror.Manyothertypesofpenalizationofthetrainingerrorhavebeenproposedintheliterature(see,e.g.,Vapnik,1998,Hastieetal.,2001).Recently,anewfamilyofsuchmethodscalled“metric-basedmethods”havebeenproposed(Schuurmans,1997).ThepaperofBengioandChapados(2003)inthisissue13.Inthelimit,thetestsetcanhaveonlyoneexampleandleave-out-outcanbecarriedoutasan“outerloop”,outsidethefeature/variableselectionprocess,toestimatethenalperformanceofthepredictor.Thiscomputationallyexpensiveprocedureisusedincaseswheredataisextremelyscarce.1173 GUYONANDELISSEEFFillustratestheirapplicationtovariableselection.Theauthorsmakeuseofunlabelleddata,whicharereadilyavailableintheapplicationconsidered,timeseriespredictionwithahorizon.ConsidertwomodelsAandBtrainedwithnestedsubsetsofvariablesAB.Wecalld(A;B)thediscrep-ancyofthetwomodels.ThecriterioninvolvestheratiodU(A;B)=dT(A;B),wheredU(A;B)iscomputedwithunlabelleddataanddT(A;B)iscomputedwithtrainingdata.AratiosignicantlylargerthanoneshedsdoubtontheusefulnessofthevariablesinsubseBthatarenotinAForvariablerankingornestedsubsetrankingmethods(Sections2and4.2),anotherstatisti-calapproachcanbetaken.Theideaistointroduceaprobeinthedatathatisarandomvariable.Roughlyspeaking,variablesthathavearelevancesmallerorequaltothatoftheprobeshouldbediscarded.Bietal.(2003)consideraverysimpleimplementationofthatidea:theyintroduceintheirdatathreeadditional“fakevariables”drawnrandomlyfromaGaussiandistributionandsubmitthemtotheirvariableselectionprocesswiththeother“truevariables”.Subsequently,theydiscardallthevariablesthatarelessrelevantthanoneofthethreefakevariables(accordingtotheirweightmagnitudecriterion).Stoppigliaetal.(2003)proposeamoresophisticatedmethodfortheGram-Schmidtforwardselectionmethod.ForaGaussiandistributedprobe,theyprovideananalyticalformulatocomputetherankoftheprobeassociatedwithagivenriskofacceptinganirrelevantvariable.Anon-parametricvariantoftheprobemethodconsistsincreating“fakevariables”byrandomlyshufingrealvariablevectors.Inaforwardselectionprocess,theintroductionoffakevariablesdoesnotdisturbtheselectionbecausefakevariablescanbediscardedwhentheyareen-countered.Atagivenstepintheforwardselectionprocess,letuscalltthefractionoftruevariablesselectedsofar(amongalltruevariables)andfthefractionoffakevariablesencountered(amongallfakevariables).Asahaltingcriteriononecanplaceathresholdontheratiof=t,whichisanupperboundonthefractionoffalselyrelevantvariablesinthesubsetselectedsofar.Thelattermethodhasbeenusedforvariableranking(Tusheretal.,2001).ItsparametricversionforGaussiandistributionsusingtheTstatisticasrankingcriterionisnothingbuttheT-test.7AdvancedTopicsandOpenProblems7.1VarianceofVariableSubsetSelectionManymethodsofvariablesubsetselectionaresensitivetosmallperturbationsoftheexperimentalconditions.Ifthedatahasredundantvariables,differentsubsetsofvariableswithidenticalpredic-tivepowermaybeobtainedaccordingtoinitialconditionsofthealgorithm,removaloradditionofafewvariablesortrainingexamples,oradditionofnoise.Forsomeapplications,onemightwanttopurposelygeneratealternativesubsetsthatcanbepresentedtoasubsequentstageofprocessing.Stillonemightndthisvarianceundesirablebecause(i)varianceisoftenthesymptomofa“bad”modelthatdoesnotgeneralizewell;(ii)resultsarenotreproducible;and(iii)onesubsetfailstocapturethe“wholepicture”.Onemethodto“stabilize”variableselectionexploredinthisissueistouseseveral“bootstraps”(Bietal.,2003).Thevariableselectionprocessisrepeatedwithsub-samplesofthetrainingdata.Theunionofthesubsetsofvariablesselectedinthevariousbootstrapsistakenasthenal“stable”subset.Thisjointsubsetmaybeatleastaspredictiveasthebestbootstrapsubset.Analyzingthebehaviorofthevariablesacrossthevariousbootstrapsalsoprovidesfurtherinsight,asdescribedinthepaper.Inparticular,anindexofrelevanceofindividualvariablescanbecreatedconsideringhowfrequentlytheyappearinthebootstraps.1174 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTIONRelatedideashavebeendescribedelsewhereinthecontextofBayesianvariableselection(Je-baraandJaakkola,2000,NgandJordan,2001,VehtariandLampinen,2002).Adistributionoverapopulationofmodelsusingvariousvariablesubsetsisestimated.Variablesarethenrankedac-cordingtothemarginaldistribution,reectinghowoftentheyappearinimportantsubsets(i.e.,associatedwiththemostprobablemodels).7.2VariableRankingintheContextofOthersInSection2,welimitedourselvestopresentingvariablerankingmethodsusingacriterioncom-putedfromsinglevariables,ignoringthecontextofothers.InSection4.2,weintroducednestedsubsetmethodsthatprovideausefulrankingofsubsets,notofindividualvariables:somevariablesmayhavealowrankbecausetheyareredundantandyetbehighlyrelevant.BootstrapandBayesianmethodspresentedinSection7.1,maybeinstrumentalinproducingagoodvariablerankingincor-poratingthecontextofothers.Thereliefalgorithmusesanotherapproachbasedonthenearest-neighboralgorithm(KiraandRendell,1992).Foreachexample,theclosestexampleofthesameclass(nearesthit)andtheclosestexampleofadifferentclass(nearestmiss)areselected.ThescoreS()ofthethvariableiscomputedastheaverageoverallexamplesofmagnitudeofthedifferencebetweenthedistancetothenearesthitandthedistancetothenearestmiss,inprojectiononthethvariable.7.3UnsupervisedVariableSelectionSometimes,notargetyisprovided,butonestillwouldwanttoselectasetofmostsignicantvariableswithrespecttoadenedcriterion.Obviously,thereareasmanycriteriaasproblemscanbestated.Still,anumberofvariablerankingcriteriaareusefulacrossapplications,includingsaliencyentropysmoothnessdensityandreliability.Avariableissalientifithasahighvarianceoralargerange,comparedtoothers.Avariablehasahighentropyifthedistributionofexamplesisuniform.Inatimeseries,avariableissmoothifonaverageitslocalcurvatureismoderate.Avariableisinahigh-densityregionifitishighlycorrelatedwithmanyothervariables.Avariableisreliableifthemeasurementerrorbarscomputedbyrepeatingmeasurementsaresmallcomparedtothevariabilityofthevariablevalues(asquantied,e.g.,byanANOVAstatistic)Severalauthorshavealsoattemptedtoperformvariableorfeatureselectionforclusteringap-plications(see,e.g.,XingandKarp,2001,Ben-HurandGuyon,2003,andreferencestherein).7.4Forwardvs.BackwardSelectionItisoftenarguedthatforwardselectioniscomputationallymoreefcientthanbackwardeliminationtogeneratenestedsubsetsofvariables.However,thedefendersofbackwardeliminationarguethatweakersubsetsarefoundbyforwardselectionbecausetheimportanceofvariablesisnotassessedinthecontextofothervariablesnotincludedyet.WeillustratethislatterargumentbytheexampleofFigure4.Inthatexample,onevariableseparatesthetwoclassesbetterbyitselfthaneitherofthetwootheronestakenaloneandwillthereforebeselectedrstbyforwardselection.Atthenextstep,whenitiscomplementedbyeitherofthetwoothervariables,theresultingclassseparationintwodimensionswillnotbeasgoodastheoneobtainedjointlybythetwovariablesthatwerediscardedattherststep.Abackwardselectionmethodmayoutsmartforwardselectionbyeliminatingattherststepthevariablethatbyitselfprovidesthebestseparationtoretainthetwovariablesthat1175 GUYONANDELISSEEFFFigure4:Forwardorbackwardselection?Ofthethreevariablesofthisexample,thethirdoneseparatesthetwoclassesbestbyitself(bottomrighthistogram).Itisthereforethebestcandidateinaforwardselectionprocess.Still,thetwoothervariablesarebettertakentogetherthananysubsetoftwoincludingit.Abackwardselectionmethodmayperformbetterinthiscase.togetherperformbest.Still,ifforsomereasonweneedtogetdowntoasinglevariable,backwardeliminationwillhavegottenridofthevariablethatworksbestonitsown.7.5TheMulti-classProblemSomevariableselectionmethodstreatthemulti-classcasedirectlyratherthandecomposingitintoseveraltwo-classproblems:Allthemethodsbasedonmutualinformationcriteriaextendnaturallytothemulti-classcase(seeinthisissueBekkermanetal.,2003,Dhillonetal.,2003,Torkkola,2003).Multi-classvariablerankingcriteriaincludeFisher'scriterion(theratioofthebetweenclassvariancetothewithin-classvariance).ItiscloselyrelatedtotheFstatisticusedintheANOVAtest,whichisonewayofimplementingtheprobemethod(Section6)forthemulti-classcase.Wrappersorembeddedmethodsdependuponthecapabilityoftheclassierusedtohandlethemulti-classcase.Examplesofsuchclassiersincludelineardiscriminantanalysis(LDA),amulti-classversionofFisher'slineardiscriminant(Dudaetal.,2001),andmulti-classSVMs(see,e.g.,Westonetal.,2003).Onemaywonderwhetheritisadvantageoustousemulti-classmethodsforvariableselection.Ononehand,contrarytowhatisgenerallyadmittedforclassication,themulti-classsettingisinsomesenseeasierforvariableselectionthanthetwo-classcase.Thisisbecausethelargerthe1176 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTIONnumberofclasses,thelesslikelya“random”setoffeaturesprovideagoodseparation.Toillustratethispoint,considerasimpleexamplewhereallfeaturesaredrawnindependentlyfromthesamedistributionPandtherstofthemisthetargety.AssumethatallthesefeaturescorrespondtorollingadiewithQfacesntimes(nisthenumberofsamples).Theprobabilitythatonexedfeature(excepttherstone)isexactlyyisthen(1=Q)n.Therefore,ndingthefeaturethatcorrespondstothetargetywhenitisembeddedinaseaofnoisyfeaturesiseasierwhenQislarge.Ontheotherhand,Forman(2003)pointsoutinthisissuethatinthecaseofunevendistributionsacrossclasses,multi-classmethodsmayover-representabundantoreasilyseparableclasses.Apossiblealternativeistomixrankedlistsofseveraltwo-classproblems.Westonetal.(2003)proposeonesuchmixingstrategy.7.6SelectionofExamplesThedualproblemsoffeatureselection/constructionarethoseofpatternselection/construction.ThesymmetryofthetwoproblemsismadeexplicitinthepaperofGlobersonandTishby(2003)inthisissue.Likewise,bothStoppigliaetal.(2003)andWestonetal.(2003)pointoutthattheiralgorithmalsoappliestotheselectionofexamplesinkernelmethods.Othershavealreadypointedoutthesimilarityandcomplementarityofthetwoproblems(BlumandLangley,1997).Inparticular,mislabeledexamplesmayinducethechoiceofwrongvariables.Conversely,ifthelabelingishighlyreliable,selectingwrongvariablesassociatedwithaconfoundingfactormaybeavoidedbyfocusingoninformativepatternsthatareclosetothedecisionboundary(Guyonetal.,2002).7.7InverseProblemsMostofthespecialissueconcentratesontheproblemofndinga(small)subsetofvariablesusefultobuildagoodpredictor.Insomeapplications,particularlyinbioinformatics,thisisnotnecessarilytheonlygoalofvariableselection.Indiagnosisproblems,forinstance,itisimportanttoidentifythefactorsthattriggeredaparticulardiseaseorunravelthechainofeventsfromthecausestothesymptoms.Butreverseengineeringthesystemthatproducedthedataisamorechallengingtaskthanbuildingapredictor.Thereadersinterestedintheseissuescanconsulttheliteratureongenenetworksintheconferenceproceedingsofthepacicsymposiumonbiocomputing(PSB)orintelligentsystemsformolecularbiologyconference(ISMB)andthecausalityinferenceliterature(see,e.g.,Pearl,2000).Attheheartofthisproblemisthedistinctionbetweencorrelationandcausality.Observationaldatasuchasthedataavailabletomachinelearningresearchersallowusonlytoobservecorrelations.Forexample,observationscanbemadeaboutcorrelationsbetweenexpressionprolesofgivengenesorbetweenprolesandsymptoms,butaleapoffaithismadewhendecidingwhichgeneactivatedwhichotheroneandinturntriggeredthesymptom.Inthisissue,thepaperofCaruanaanddeSa(2003)presentsinterestingideasaboutusingvariablesdiscardedbyvariableselectionasadditionaloutputsofaneuralnetwork.Theyshowim-provedperformanceonsyntheticandrealdata.Theiranalysissupportstheideathatsomevariablesaremoreefcientlyusedasoutputsthanasinputs.Thiscouldbeasteptowarddistinguishingcausesfromconsequences.1177 GUYONANDELISSEEFFDatasetDescriptionpatternsvariablesclassesReferencesLinearabArticiallinear10-1200100-240reg-2SWBeMulti-clustercArticialnon-linear1000-1300100-5002PSQSARdChemistry30-300500-700regBtUCIeMLrepository8-60500-160002-30ReBnToPCLVQ-PAKfPhonemedata19002020TRaetchbench.gUCI/Delve/Statlog200-70008-202RaMicroarrayaCancerclassif.6-1002000-40002WRaMicroarrayaGeneclassication200805WAstonUnivhPipelinetransport1000123TNIPS2000iUnlabeleddata200-4005-800regRi20NewsgroupjoNewspostings20000300-150002-20GBkDTextlteringkTREC/OSHUMED200-25003000-300006-17FIRdatasetslMED/CRAN/CISI1000500030-225GReuters-21578monewswiredocs.21578300-15000114BkFOpenDir.Proj.nWebdirectory50001450050DTable1:Publiclyavailabledatasetsusedinthespecialissue.Approximatenumbersorrangesofpatterns,variables,andclasseseffectivelyusedareprovided.The“classes”columnindicates“reg”forregressionproblems,orthenumberofqueriesforInformationRetrieval(IR)problems.Forarticialdatasets,thefractionofvariablesthatarerelevantrangesfrom2to10.Theinitialoftherstauthorareprovidedasreference:Bk=Bekkerman,Bn=Bengio,Bt=Bennett,C=Caruana,D=Dhillon,F=Forman,G=Globerson,P=Perkins,Re=Reunanen,Ra=Rakotomamonjy,Ri=Rivals,S=Stoppiglia,T=Torkkola,W=Weston.PleasealsochecktheJMLRwebsiteforlateradditionsandpreprocesseddata.a:http://www.kyb.tuebingen.mpg.de/bs/people/weston/l0(0not10)b:http://www.clopinet.com/isabelle/Projects/NIPS2001/Articial.zipc:http://nis-www.lanl.gov/simes/data/jmlr02/d:http://www.rpi.edu/bij2/featsele.htmle:http://www.ics.uci.edu/mlearn/MLRepository.html:http://www.cis.hut./research/software.shtmlg:http://ida.rst.gmd.de/raetsch/data/benchmarks.htmh:http://www.nerg.aston.ac.uk/GTM/3PhaseData.html:http://q.cis.uoguelph.ca/skremer/NIPS2000/:http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.htmlk:http://trec.nist.gov/data.html(FilteringTrackCollection):http://www.cs.utk.edu/lsi/m:http://www.daviddlewis.com/resources/testcollections/reuters21578/n:http://dmoz.org/andhttp://www.cs.utexas.edu/users/manyam/dmoz.txto:http://www.cs.technion.ac.il/ronb/thesis.html1178 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTION8ConclusionTherecentdevelopmentsinvariableandfeatureselectionhaveaddressedtheproblemfromthepragmaticpointofviewofimprovingtheperformanceofpredictors.Theyhavemetthechallengeofoperatingoninputspacesofseveralthousandvariables.Sophisticatedwrapperorembeddedmethodsimprovepredictorperformancecomparedtosimplervariablerankingmethodslikecorre-lationmethods,buttheimprovementsarenotalwayssignicant:domainswithlargenumbersofinputvariablessufferfromthecurseofdimensionalityandmultivariatemethodsmayovertthedata.Forsomedomains,applyingrstamethodofautomaticfeatureconstructionyieldsimprovedperformanceandamorecompactsetoffeatures.Themethodsproposedinthisspecialissuehavebeentestedonawidevarietyofdatasets(seeTable1),whichlimitsthepossibilityofmakingcom-parisonsacrosspapers.Furtherworkincludestheorganizationofabenchmark.Theapproachesareverydiverseandmotivatedbyvarioustheoreticalarguments,butaunifyingtheoreticalframeworkislacking.Becauseoftheseshortcomings,itisimportantwhenstartingwithanewproblemtohaveafewbaselineperformancevalues.Tothatend,werecommendusingalinearpredictorofyourchoice(e.g.alinearSVM)andselectvariablesintwoalternateways:(1)withavariablerankingmethodusingacorrelationcoefcientormutualinformation;(2)withanestedsubsetselectionmethodperformingforwardorbackwardselectionorwithmultiplicativeupdates.Furtherdowntheroad,connectionsneedtobemadebetweentheproblemsofvariableandfeatureselectionandthoseofexperimentaldesignandactivelearning,inanefforttomoveawayfromobservationaldatatowardexperimentaldata,andtoaddressproblemsofcausalityinferenceReferencesE.AmaldiandV.Kann.Ontheapproximationofminimizingnonzerovariablesorunsatisedrelationsinlinearsystems.TheoreticalComputerScience,209:237–260,1998.R.Bekkerman,R.El-Yaniv,N.Tishby,andY.Winter.Distributionalwordclustersvs.wordsfortextcategorization.JMLR,3:1183–1208(thisissue),2003.A.Ben-HurandI.Guyon.Detectingstableclustersusingprincipalcomponentanalysis.InM.J.BrownsteinandA.Kohodursky,editors,MethodsInMolecularBiology,pages159–182.HumanaPress,2003.Y.BengioandN.Chapados.Extensionstometric-basedmodelselection.JMLR,3:1209–1227(thisissue),2003.J.Bi,K.Bennett,M.Embrechts,C.Breneman,andM.Song.Dimensionalityreductionviasparsesupportvectormachines.JMLR,3:1229–1243(thisissue),2003.A.BlumandP.Langley.Selectionofrelevantfeaturesandexamplesinmachinelearning.ArticialIntelligence,97(1-2):245–271,December1997.B.Boser,I.Guyon,andV.Vapnik.Atrainingalgorithmforoptimalmarginclassiers.InFifthAnnualWorkshoponComputationalLearningTheory,pages144–152,Pittsburgh,1992.ACM.L.Breiman,J.H.Friedman,R.A.Olshen,andC.J.Stone.ClassicationandRegressionTreesWadsworthandBrooks,1984.1179 GUYONANDELISSEEFFR.CaruanaandV.deSa.Benettingfromthevariablesthatvariableselectiondiscards.JMLR,3:1245–1264(thisissue),2003.I.Dhillon,S.Mallela,andR.Kumar.Adivisiveinformation-theoreticfeatureclusteringalgorithmfortextclassication.JMLR,3:1265–1287(thisissue),2003.T.G.Dietterich.Approximatestatisticaltestforcomparingsupervisedclassicationlearningalgo-rithms.NeuralComputation,10(7):1895–1924,1998.R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassication.JohnWiley&Sons,USA,2ndedition,2001.T.R.Golubetal.Molecularclassicationofcancer:Classdiscoveryandclasspredictionbygeneexpressionmonitoring.Science,286:531–537,1999.G.Forman.Anextensiveempiricalstudyoffeatureselectionmetricsfortextclassication.JMLR3:1289–1306(thisissue),2003.T.Furey,N.Cristianini,Duffy,BednarskiN.,SchummerD.,M.,andD.Haussler.Supportvectormachineclassicationandvalidationofcancertissuesamplesusingmicroarrayexpressiondata.Bioinformatics,16:906–914,2000.A.GlobersonandN.Tishby.Sufcientdimensionalityreduction.JMLR,3:1307–1331(thisissue),2003.Y.GrandvaletandS.Canu.AdaptivescalingforfeatureselectioninSVMs.InNIPS15,2002.I.Guyon,J.Weston,S.Barnhill,andV.Vapnik.Geneselectionforcancerclassicationusingsupportvectormachines.MachineLearning,46(1-3):389–422,2002.T.Hastie,R.Tibshirani,andJ.Friedman.TheElementsofStatisticalLearning.Springerseriesinstatistics.Springer,NewYork,2001.T.JebaraandT.Jaakkola.Featureselectionanddualitiesinmaximumentropydiscrimination.In16thAnnualConferenceonUncertaintyinArticialIntelligence,2000.K.KiraandL.Rendell.Apracticalapproachtofeatureselection.InD.SleemanandP.Edwards,editors,InternationalConferenceonMachineLearning,pages368–377,Aberdeen,July1992.MorganKaufmann.R.KohaviandG.John.Wrappersforfeatureselection.ArticialIntelligence,97(1-2):273–324,December1997.D.KollerandM.Sahami.Towardoptimalfeatureselection.In13thInternationalConferenceonMachineLearning,pages284–292,July1996.Y.LeCun,J.Denker,S.Solla,R.E.Howard,andL.D.Jackel.Optimalbraindamage.InD.S.Touretzky,editor,AdvancesinNeuralInformationProcessingSystemsII,SanMateo,CA,1990.MorganKaufmann.1180 ANINTRODUCTIONTOVARIABLEANDFEATURESELECTIONG.MonariandG.Dreyfus.Withdrawinganexamplefromthetrainingset:ananalyticestimationofitseffectonanonlinearparameterizedmodel.NeurocomputingLetters,35:195–201,2000.C.NadeauandY.Bengio.Inferenceforthegeneralizationerror.MachineLearning(toappear)2001.A.Y.Ng.Onfeatureselection:learningwithexponentiallymanyirrelevantfeaturesastrain-ingexamples.In15thInternationalConferenceonMachineLearning,pages404–412.MorganKaufmann,SanFrancisco,CA,1998.A.Y.NgandM.Jordan.ConvergenceratesofthevotingGibbsclassier,withapplicationtoBayesianfeatureselection.In18thInternationalConferenceonMachineLearning,2001.J.Pearl.Causality.CambridgeUniversityPress,2000.F.Pereira,N.Tishby,andL.Lee.DistributionalclusteringofEnglishwords.InProc.MeetingoftheAssociationforComputationalLinguistics,pages183–190,1993.S.Perkins,K.Lacker,andJ.Theiler.Grafting:Fastincrementalfeatureselectionbygradientdescentinfunctionspace.JMLR,3:1333–1356(thisissue),2003.A.Rakotomamonjy.VariableselectionusingSVM-basedcriteria.JMLR,3:1357–1370(thisissue),2003.J.Reunanen.Overttinginmakingcomparisonsbetweenvariableselectionmethods.JMLR,3:1371–1382(thisissue),2003.I.RivalsandL.Personnaz.MLPs(mono-layerpolynomialsandmulti-layerperceptrons)fornon-linearmodeling.JMLR,3:1383–1398(thisissue),2003.B.SchoelkopfandA.Smola.LearningwithKernels.MITPress,CambridgeMA,2002.D.Schuurmans.Anewmetric-basedapproachtomodelselection.In9thInnovativeApplicationsofArticialIntelligenceConference,pages552–558,1997.H.Stoppiglia,G.Dreyfus,R.Dubois,andY.Oussar.Rankingarandomfeatureforvariableandfeatureselection.JMLR,3:1399–1414(thisissue),2003.R.Tibshirani.Regressionselectionandshrinkageviathelasso.Technicalreport,StanfordUniver-sity,PaloAlto,CA,June1994.N.Tishby,F.C.Pereira,andW.Bialek.Theinformationbottleneckmethod.InProc.ofthe37thAnnualAllertonConferenceonCommunication,ControlandComputing,pages368–377,1999.K.Torkkola.Featureextractionbynon-parametricmutualinformationmaximization.JMLR,3:1415–1438(thisissue),2003.V.G.Tusher,R.Tibshirani,andG.Chu.Signicanceanalysisofmicroarraysappliedtotheionizingradiationresponse.PNAS,98:5116–5121,April2001.V.Vapnik.Estimationofdependenciesbasedonempiricaldata.Springerseriesinstatistics.Springer,1982.1181 GUYONANDELISSEEFFV.Vapnik.StatisticalLearningTheory.JohnWiley&Sons,N.Y.,1998.A.VehtariandJ.Lampinen.Bayesianinputvariableselectionusingposteriorprobabilitiesandexpectedutilities.ReportB31,2002.J.Weston,A.Elisseff,B.Schoelkopf,andM.Tipping.Useofthezeronormwithlinearmodelsandkernelmethods.JMLR,3:1439–1461(thisissue),2003.J.Weston,S.Mukherjee,O.Chapelle,M.Pontil,T.Poggio,andV.Vapnik.FeatureselectionforSVMs.InNIPS13,2000.E.P.XingandR.M.Karp.Cliff:Clusteringofhigh-dimensionalmicroarraydataviaiterativefea-turelteringusingnormalizedcuts.In9thInternationalConferenceonIntelligenceSystemsforMolecularBiology,2001.1182