ca Abstract Naive Bayes is one of the most ef64257cient and effective inductive learning algorithms for machine learning and data mining Its competitive performance in classi64257ca tion is surprising because the conditional independence assumption o ID: 26190
Download Pdf The PPT/PDF document "The Optimality of Naive Bayes Harry Zhan..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
TheOptimalityofNaiveBayesHarryZhangFacultyofComputerScienceUniversityofNewBrunswickFredericton,NewBrunswick,CanadaE3B5A3email:hzhang@unb.caNaiveBayesisoneofthemostefÞcientandeffectiveinductivelearningalgorithmsformachinelearninganddatamining.ItscompetitiveperformanceinclassiÞca-tionissurprising,becausetheconditionalindependenceassumptiononwhichitisbased,israrelytrueinreal-worldapplications.Anopenquestionis:whatisthetruereasonforthesurprisinglygoodperformanceofnaiveBayesinclassiÞcation?Inthispaper,weproposeanovelexplanationonthesuperbclassiÞcationperformanceofnaiveBayes.Weshowthat,essentially,thedependencedistribution;i.e.,howthelocaldependenceofanodedistributesineach iscalledaBayesianclassiÞer.Assumethatallattributesareindependentgiventhevalueoftheclassvariable;thatis, toextenditsstructuretorepresentexplicitlythedependen-ciesamongattributes.AnaugmentednaiveBayesiannet-work,orsimplyaugmentednaiveBayes(ANB),isanex-tendednaiveBayes,inwhichtheclassnodedirectlypointstoallattributenodes,andthereexistlinksamongattributenodes.Figure2showsanexampleofANB.Fromtheviewofprobability,anANBrepresentsajointprobabilitydis-tributionrepresentedbelow.denotesanassignmenttovaluesofthepar-entsof.WeusetodenotetheparentsofANBisaspecialformofBayesiannetworksinwhichnonodeisspeciÞedasaclassnode.IthasbeenshownthatanyBayesiannetworkcanberepresentedbyanANB(Zhang&Ling2001).Therefore,anyjointprobabilitydistributioncanberepresentedbyanANB. 321 Figure2:AnexampleofANBWhenweapplyalogarithmtoinEquation1,theresultingclassiÞeristhesameas,inthesensethatanexamplebelongstothepositiveclass,ifandonlyifinEquation2issimilar.Inthispaper,weassumethat,givenaclassiÞer,anexamplebelongstothepositiveclass,ifandonlyifRelatedWorkManyempiricalcomparisonsbetweennaiveBayesandmod-erndecisiontreealgorithmssuchasC4.5(Quinlan1993)showedthatnaiveBayespredictsequallywellasC4.5(Lan-gley,Iba,&Thomas1992;Kononenko1990;Pazzani1996).ThegoodperformanceofnaiveBayesissurprisingbecauseitmakesanassumptionthatisalmostalwaysviolatedinreal-worldapplications:giventheclassvalue,allattributesareAnopenquestioniswhatisthetruereasonforthesur-prisinglygoodperformanceofnaiveBayesonmostclassiÞ-cationtasks?Intuitively,sincetheconditionalindependenceassumptionthatitisbasedonisalmostneverhold,itsper-formancemaybepoor.Ithasbeenobservedthat,however,itsclassiÞcationaccuracydoesnotdependonthedependen-cies;i.e.,naiveBayesmaystillhavehighaccuracyonthedatasetsinwhichstrongdependenciesexistamongattributes(Domingos&Pazzani1997).DomingosandPazzani(Domingos&Pazzani1997)presentanexplanationthatnaiveBayesowesitsgoodper-formancetothezero-onelossfunction.Thisfunctionde-ÞnestheerrorasthenumberofincorrectclassiÞcations(Friedman1996).Unlikeotherlossfunctions,suchasthesquarederror,thezero-onelossfunctiondoesnotpenalizeinaccurateprobabilityestimationaslongasthemaximumprobabilityisassignedtothecorrectclass.ThismeansthatnaiveBayesmaychangetheposteriorprobabilitiesofeachclass,buttheclasswiththemaximumposteriorprobabilityisoftenunchanged.Thus,theclassiÞcationisstillcorrect,althoughtheprobabilityestimationispoor.Forexample,letusassumethatthetrueprobabilitiesrespectively,andthattheprobabilityestimatesproducedbynaiveBayesare.Obviously,theprobabilityestimatesarepoor,buttheclassiÞcation(positive)isnotaffected.DomingosandPazzaniÕsexplanation(Domingos&Paz-zani1997)isveriÞedbytheworkofFranketal.(Frank2000),whichshowsthattheperformanceofnaiveBayesismuchworsewhenitisusedforregression(predictingacontinuousvalue).Moreover,evidencehasbeenfoundthatnaiveBayesproducespoorprobabilityestimates(Bennett2000;Monti&Cooper1999).Inouropinion,however,DomingosandPazzani(Domin-gos&Pazzani1997)ÕsexplanationisstillsuperÞcialasitdoesnotuncoverwhythestrongdependenciesamongat-tributescouldnotßiptheclassiÞcation.Fortheexampleabove,whythedependenciescouldnotmaketheprobabilityproducedbynaiveBayesbe?ThekeypointhereisthatweneedtoknowhowthedependenciesaffecttheclassiÞcation,andunderwhatconditionsthedependenciesdonotaffecttheclassiÞcation.TherehasbeensomeworktoexploretheoptimalityofnaiveBayes(Rachlin,Kasif,&Aha1994;Garg&Roth2001;Roth1999;Hand&Yu2001),butnoneofthemgiveanexplicitconditionfortheoptimalityofnaiveBayes.Inthispaper,weproposeanewexplanationthattheclas-siÞcationofnaiveBayesisessentiallyaffectedbythede-pendencedistribution,insteadbythedependenciesamongattributes.Inaddition,wepresentasufÞcientconditionfortheoptimalityofnaiveBayesundertheGaussiandistribu-tion,andshowtheoreticallywhennaiveBayesworkswell.ANewExplanationontheSuperbClassiÞcationPerformanceofNaiveBayesInthissection,weproposeanewexplanationforthesurpris-inglygoodclassiÞcationperformanceofnaiveBayes.Thebasicideacomesfromtheobservationasfollows.Inagivendataset,twoattributesmaydependoneachother,butthedependencemaydistributeevenlyineachclass.Clearly,inthiscase,theconditionalindependenceassumptionisvio-lated,butnaiveBayesisstilltheoptimalclassiÞer.Fur-ther,whateventuallyaffectstheclassiÞcationisthecom-binationofdependenciesamongallattributes.Ifwejustlookattwoattributes,theremayexiststrongdependencebe-tweenthemthataffectstheclassiÞcation.Whenthedepen-denciesamongallattributesworktogether,however,they =fE)niG(xi|pa(xi FromTheorem1,weknowthat,infact,itisthedepen-dencedistributionfactorthatdeterminesthedif-ferencebetweenanANBanditscorrespondentnaiveBayesintheclassiÞcation.Further,istheproductoflocaldependencederivativeratiosofallnodes.Therefore,itre-ßectstheglobaldependencedistribution(howeachlocalde-pendencedistributesineachclass,andhowalllocaldepen-denciesworktogether).Forexample,whenhasthesameclassiÞcationas.Infact,itisnotnecessarytorequire,inordertomakeanANBhasthesameclassiÞcationasitscorrespondentnaiveBayes,asshowninthetheorembelow.Theorem2Givenanexample,...,x,anisequaltoitscorrespondentnaiveBayesderzero-oneloss;i.e.,(DeÞnition1),ifandonlyifwhen;orwhenProof:TheproofisstraightforwardbyapplyDeÞnition1andTheorem1.FromTheorem2,ifthedistributionofthedependencesamongattributessatisÞescertainconditions,thennaiveBayesclassiÞesexactlythesameastheunderlyingANB,eventhoughtheremayexiststrongdependenciesamongat-tributes.Moreover,wehavethefollowingresults:1.When,thedependenciesinANBhasnoinßuenceontheclassiÞcation.Thatis,theclassiÞcationisexactlythesameasthatofitscorrespondentnaive.Thereexistthreecasesfornodependenceexistsamongattributes.foreachattribute;thatis,thelo-caldistributionofeachnodedistributesevenlyinboththeinßuencethatsomelocaldependenciessupportiscanceledoutbytheinßu-encethatotherlocaldependencessupportclassifyingdoesnotrequirethatThepreciseconditionisgivenbyTheorem2.Thatex-plainswhynaiveBayesstillproducesaccurateclassiÞca-tioneveninthedatasetswithstrongdependenciesamongattributes(Domingos&Pazzani1997).3.ThedependenciesinanANBßip(change)theclassiÞca-tionofitscorrespondentnaiveBayes,onlyiftheconditiongivenbyTheorem2isnolongertrue.Theorem2representsasufÞcientandnecessaryconditionfortheoptimalityofnaiveBayesonanexample.Ifforeachexampleintheexamplespace,,thennaiveBayesisgloballyoptimal.ConditionsfortheOptimalityofNaiveBayesInSection,weproposedthatnaiveBayesisoptimalifthedependencesamongattributescanceleachotherout.Thatis,undercircumstance,naiveBayesisstilloptimaleventhoughthedependencesdoexist.Inthissection,wein-vestigatenaiveBayesunderthemultivariateGaussiandis-tributionandproveasufÞcientconditionfortheoptimalityofnaiveBayes,assumingthedependencesamongattributesdoexist.Thatprovidesuswiththeoreticevidencethatthedependencesamongattributesmaycanceleachotherout.LetusrestrictourdiscussiontotwoattributesandassumethattheclassdensityisamultivariateGaussianinboththepositiveandnegativeclasses.Thatis,+)= 2|+|1/2e1 2(xµ+)T1+(xµ+),p(x1, 2||1/2e1 arethecovariancematri-cesinthepositiveandnegativeclassesrespectively,arethedeterminantsofaretheinversesofarethemeansofattributethepositiveandnegativeclassesrespectively,andarethetransposesofWeassumethattwoclasseshaveacommoncovariance,andhavethesamevarianceinbothclasses.Then,whenapplyingalogarithmtotheBayesianclassiÞer,deÞnedinEquation1,weobtaintheclassiÞerbelow.)=log (x1,)=1 Then,becauseoftheconditionalindependenceassump-tion,wehavethecorrespondentnaiveBayesianclassiÞer 2(µ+1µ1)x1+1 Assumethatareindependentif.If,wehave 22 22 22 22. canbesimpliÞedasbelow. .If,then )=(1 Itiseasytoverifythat,when,wecangetthesameinEquation15.Similarly,when,wehave)=(1+ FromEquation15and16,weseethatisaffected.Itistruethatincreases,ascreases.Thatmeans,theabsoluteratioofdistancesbetweentwomeansofclassesaffectsigniÞcantlytheperformanceofnaiveBayes.Moreprecisely,thelessabsoluteratio,thebet-terperformanceofnaiveBayes.Inthispaper,weproposeanewexplanationontheclassiÞca-tionperformanceofnaiveBayes.Weshowthat,essentially,thedependencedistribution;i.e.,howthelocaldependenceofanodedistributesineachclass,evenlyorunevenly,andhowthelocaldependenciesofallnodesworktogether,con-sistently(supportacertainclassiÞcation)orinconsistently(canceleachotherout),playsacrucialroleintheclassiÞca-tion.Weexplainwhyevenwithstrongdependencies,naiveBayesstillworkswell;i.e.,whenthosedependenciescanceleachotherout,thereisnoinßuenceontheclassiÞcation.Inthiscase,naiveBayesisstilltheoptimalclassiÞer.Inaddi-tion,weinvestigatedtheoptimalityofnaiveBayesundertheGaussiandistribution,andpresentedtheexplicitsufÞcientconditionunderwhichnaiveBayesisoptimal,eventhoughtheconditionalindependenceassumptionisviolated.ReferencesBennett,P.N.2000.AssessingthecalibrationofNaiveBayesÕposteriorestimates.InTechnicalReportNo.CMU-Domingos,P.,andPazzani,M.1997.Beyondinde-pendence:ConditionsfortheoptimalityofthesimpleBayesianclassiÞer.MachineLearningFrank,E.;Trigg,L.;Holmes,G.;andWitten,I.H.2000.NaiveBayesforregression.MachineLearningFriedman,J.1996.Onbias,variance,0/1-loss,andthecurseofdimensionality.DataMiningandKnowledgeDis-coveryGarg,A.,andRoth,D.2001.UnderstandingprobabilisticplassiÞers.InRaedt,L.D.,andFlach,P.,eds.,Proceed-ingsof12thEuropeanConferenceonMachineLearningSpringer.179Ð191.Hand,D.J.,andYu,Y.2001.IdiotsBayes-notsostupidafterall?InternationalStatisticalReviewKononenko,I.1990.ComparisonofinductiveandnaiveBayesianlearningapproachestoautomaticknowledgeac-quisition.InWielinga,B.,ed.,CurrentTrendsinKnowl-edgeAcquisition.IOSPress.Langley,P.;Iba,W.;andThomas,K.1992.AnanalysisofBayesianclassiÞers.InProceedingsoftheTenthNationalConferenceofArtiÞcialIntelligence.AAAIPress.223ÐMonti,S.,andCooper,G.F.1999.ABayesiannetworkclassiÞerthatcombinesaÞnitemixturemodelandaNaiveBayesmodel.InProceedingsofthe15thConferenceonUncertaintyinArtiÞcialIntelligence.MorganKaufmann.Pazzani,M.J.1996.SearchfordependenciesinBayesianclassiÞers.InFisher,D.,andLenz,H.J.,eds.,fromData:ArtiÞcialIntelligenceandStatisticsV.SpringerVerlag.Quinlan,J.1993.C4.5:ProgramsforMachineLearningMorganKaufmann:SanMateo,CA.Rachlin,J.R.;Kasif,S.;andAha,D.W.1994.To-wardabetterunderstandingofmemory-basedreasoningsystems.InProceedingsoftheEleventhInternationalMa-chineLearningConference.MorganKaufmann.242Ð250.Roth,D.1999.Learninginnaturallanguage.InProceed-ingsofIJCAIÕ99.MorganKaufmann.898Ð904.Zhang,H.,andLing,C.X.2001.LearnabilityofaugmentedNaiveBayesinnominaldomains.InBrod-ley,C.E.,andDanyluk,A.P.,eds.,ProceedingsoftheEighteenthInternationalConferenceonMachineLearn-.MorganKaufmann.617Ð623.