e presen t a comparison of three en trop ybased discretiza tion metho ds in a con text of learning classi cation rules W e compare the binary recursiv e discretization with a stopping criterion based on the Minim um Description Length Principle MDLP ID: 33038
Download Pdf The PPT/PDF document "Discretization of Con tin uous A ttribut..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
DiscretizationofCon tin uousA ttributesfor LearningClassicationRules AijunAnandNic kCercone Departmen tofComputerScience,Univ ersit yofW aterloo W aterloo,On tarioN2L3G1Canada Abstract. W epresen tacomparisonofthreeen trop y-baseddiscretiza- tionmethodsinacon textoflearningclassicationrules.W umDescrip- tionLengthPrinciple(MDLP)[3].TheMDLPmethodisreportedasasuccessful methodfordiscretizationinthedecisiontreelearningandNaiv e-Ba y eslearn- ingen vironmen ts[2 ,6 ].Ho w ev er,littleresearc hhasbeendonetoin v estigate whetherthemethodw orksw ellwithotherruleinductionmethods.W ereport wthattheMDLPmethodfailstondsucien tuseful cut-poin ts,especiallyonsmalldatasets.Theexperimen tsalsodisco v erthatthe othermethodtendstoselectcut-poin tsfromasmalllocalareaoftheen tirev alue space,especiallyonlargedatasets.T oo v ercometheseproblems,w ein troducea newen trop y-baseddiscretizationmethodthatselectscut-poin informationen trop yanddistributionofpoten tialcut-poin ts.Ourconclusionis thatMDLPdoesnotgiv ethebestresultsinmosttesteddatasets.Theproposed methodperformsbetterthanMDLPintheELEM2learningen vironmen t. 2 TheMDLPDiscretizationMethod Giv enaset S ofinstances,anattribute A ,andacut-poin t T A;T Ent EntEnt)istheclassenyofthesubset,de\fnedasEntlogwherethereare)istheproportionofexamplesthathaeclassoranattribute,theMDLPmethodselectsacutpoinforwhicA;T)isminimalamongalltheboundarypoin.Thetrainingsetisthensplitintotosubsetsbythecutpoint.SubsequentcutpoinareselectedbyrecursivelyapplyingthesamebinarydiscretizationmethodtohofthenewlygeneratedsubsetsuntilthefollowingconditionisacA;Tlog A;T isthenberofexamplesinA;TEntA;TA;TloggkEntEntEnt)],andk;karethenberofclassesrepresentedinthesetsS;S,respectivEmpiricalresults,presentedin[3],showthattheMDLPstoppingcriterionleadstoconstructionofbetterdecisiontrees.Doughertetal.al.]alsoshowthataglobalvtoftheMDLPmethodsigni\fcantlyimproedaNaivclassi\feranditalsoperformsbestamongseveraldiscretizationmethodsinthetextofC4.5decisiontreelearning.ExperimentswithMDLPDiscretizationandELEM2econductedexperimentswithtersionsofELEM2.Bothversionsemplotheeny-baseddiscretizationmethod,butwithdierentstoppingcriteria.OneversionusestheglobalvtoftheMDLPdiscretizationmethod,i.e.,itdiscretizesconuousattributesusingtherecursiveeny-basedmethodwiththeMDLPstoppingcriterionappliedbeforeruleinductionbegins.Theotherversionusesthesameenycriterionforselectingcut-pointsbeforeruleinduction,butitsimplychoosesamaximalnberofestcut-pointswithoutrecursiveapplicationofthemethod.issettobelogisthenberofdistinctobservedvaluesfortheattributebeingdiscretizedandisthenberofclasses.WerefertothismethodasMax-m.Bothversions\frstsorttheexamplesaccordingtotheirvaluesoftheattributeandthenevaluateonlytheboundarypointsintheirsearchforcut-poine\frstconducttheexperimentsonanarti\fcialdataset.Eachexampleinthedatasethastoconuousattributesandasymbolicattribute.The adandIraniproedthatthevthatminimizestheclassenA;Tustalwysbeavaluebeteentoexamplesofdierentclassesinthesequenceofsortedexamples.Thesekindsofvaluesarecalledboundarypoin510Aijun An and Nick Cercone Training eaccuracy No.ofcut-poin No.ofrules No.ofBoun- SetSize MDLP Max-m MDLP Max-m MDLP Max-m daryP 47 56.71% 95.20% 0 14 3 6 58 188 90.41% 100% 2 21 4 6 96 470 100% 100% 5 22 6 6 97 1877 100% 100% 29 22 6 6 97 4692 100% 100% 73 22 6 6 97 able1.ResultsontheArti\fcialDomain.oconuousattributes,named1and2,haaluerangesof[090]andand;5],respectiv.Thesymbolicattribute,color,takesoneofthefourvd,blue,yellow.Anexamplebelongstoclass\1",ifthefolloconditionholds:(30colorblueorgreenotherwise,itbelongstoclass\0".Thedatasethasatotalof9384examples.Wrandomlychose6trainingsetsfromtheseexamples.Thesizesofthetrainingsetsrangefrom47examples(05%)to4692examples(50%).WerunthetofELEM2oneachofthe6trainingsetstogenerateasetofdecisionrules.Therulesarethentestedontheoriginaldatasetof9384examples.Table1depicts,forallthetrainingsets,thepredictiveaccuracy,thetotalnberofcut-poinselectedforbothconuousattributes,thetotalnberofrulesgeneratedforbothclasses,andthenberofboundarypointsforbothconuousattributes.Theresultsindicatethat,whenthenberoftrainingexamplesissmall,theMDLPmethodstopstooearlyandfailsto\fndenoughusefulcut-points,whiccausesELEM2togeneraterulesthathaepoorpredictiveperformanceonthetestingset.Whenthesizeofthetrainingsetincreases,MDLPgeneratesmorecut-pointsanditspredictiveperformanceimproes.Forthemiddle-sizedtrainingset(470examples),MDLPworksperfectlybecauseit\fndsonly5cut-poinfrom97boundarypoints,whichincludeallofthefourrightcut-pointsthatthelearningsystemneedstogeneratecorrectrules.Hoer,whenthetrainingsetbecomeslarger,thenberofcut-pointsMDLP\fndsincreasesgreatly.Inthelasttrainingset(4692examples),itselects73cut-pointsoutof97potenpoints,whichslowsdownthelearningsystem.Incontrast,theMax-mmethodismorestable.Thenberofcut-pointsitproducesrangesfrom14to22anditseperformanceisbetterthanMDLPwhenthetrainingsetissmall.WalsorunthetersionsofELEM2onanberofactualdatasetsobtainedfromtheUCIrepository[4],eachofwhichhasatleastoneconuousattribute.able2reportstheten-foldevaluationresultson6ofthesedatasets.TheempiricalresultspresentedaboeindicatethatMDLPisnotsuperiortoMax-minmostoftesteddatasets.Onepossiblereasonisthat,whenthetrainingsetissmall,theexamplesarenotsucienttomaketheMDLPcriterionvandmeaningfulsothatthecriterioncausesthediscretizationprocesstostoptoo asintheMax-mmethod.Hoer,ratherthantakingthe\frstcut-points,EDA-DBdividesthevaluerangeoftheattributeintoinalsandselectsineachinberofcut-pointsbasedontheenycalculatedertheentireinstancespace.isdeterminedbyestimatingtheprobabilitdistributionoftheboundarypointsoertheinstancespace.TheEDA-DBdis-cretizationalgorithmisdescribedasfollows.Letbethenberofdistinctedvaluesforaconuousattributebethetotalnberofboundarypointsfor,andbethenberofclassesinthedataset.Todiscretize1.Calculatelog2.Estimatetheprobabilitydistributionofboundarypoin(a)Dividethevaluerangeofals,where;log(b)Calculatethenberofboundarypointsineachin,where(c)Estimatetheprobabilityofboundarypointsineachin 3.Calculatethequotaofcut-pointsforeachin)accordingandthedistributionofboundarypointsasfollo4.RanktheboundarypointsineachinyincreasingorderoftheclassinformationenyofthepartitioninducedbytheboundarypoinTheenyforeachpointiscalculatedgloballyoertheentireinstancespace.5.Foreachin),selectthe\frstpointsintheaboeorderedsequence.Atotalofcut-pointsareselected.ExperimentswithEDeconductedexperimentswithEDA-DBcoupledwithELEM2.We\frstcon-ductedten-foldevaluationonthedatasettoseewhetherEDA-DBim-esoerMax-monthisdatasetwhichhasalargenberofboundarypoinforseveralattributes.Theresultisthatthepredictiveaccuracyisincreasedto95.11%andtheaeragenberofrulesdropsto69.Figure1showstheten-foldaluationresultson14UCIdatasets.Inthe\fgure,thesolidlinerepresentsthedierencebeteenEDA-DB'spredictiveaccuracyandMax-m's,andthedashedlinerepresentstheaccuracydierencebeteenEDA-DBandMDLP.There-sultsindicatethatEDA-DBoutperformsbothMax-mandMDLPonmostofthetesteddatasets.ehaepresentedanempiricalcomparisonofthreeeny-baseddiscretizationmethodsinacontextoflearningdecisionrules.WefoundthattheMDLPmethodstopstoearlywhenthenberoftrainingexamplesissmallandthusitfailstodetectsucientcut-pointsonsmalldatasets.OurempiricalresultsalsoindicatethatMax-mandEDA-DBarebetterdiscretizationmethodsforELEM2onmostofthetesteddatasets.WeconjecturethattherecursivenatureoftheMDLPmethodmaycausemostofthecut-pointstobeselectedbasedonsmall abaloneaustraianbreancerbupadiabetesecoligerman glass hearionosphereirissegmentyeastFig.1.en-foldEvaluationResultsonActualDataSets.samplesoftheinstancespace,whichleadstogenerationofunreliablecut-poinTheexperimentwithMax-monthedatasetrevealsthatthestrategyofsimplyselectingthe\frstestcut-pointsdoesnotworkwonlargedatasetswithlargenberofboundarypoints.Thereasonforthisisthatenestcut-pointstendtoconcentrateonasmallregionoftheinstancespace,whichleadstothealgorithmfailingtopickupusefulcut-poininotherregions.OurproposedEDA-DBmethodalleviatestheMax-m'sproblemyconsideringthedistributionofboundarypointsoertheinstancespace.OurtestofEDA-DBonthedatasetshowsthatEDA-DBimproesoMax-monboththepredictiveaccuracyandthenberofrulesgenerated.TheexperimentswithEDA-DBonothertesteddatasetsalsocon\frmthatEDisabettermethodthanbothMax-mandMDLP1.An,A.andCercone,N.1998.ELEM2:ALearningSystemforMoreAccurateClas-eNotesinArti\fcialIntelligene14182.Doughert,J.,Kohavi,R.andSahami,M.1995.SupervisedandUnsupervisedDis-cretizationofConuousFdingsoftheTwelfthInternationalConfer-eonMachineL.MorganKaufmannPublishers,SanFrancisco,CA.3.Fad,U.M.andIrani,K.B.1993.Multi-InalDiscretizationofConaluedAttributesforClassi\fcationLearning..pp.1022-1027.4.Murph.M.andAha,D.W.1994.UCIRositoryofMachineLarningDatabURL:http://www.ics.uci.edu/AI/ML/MLDBRepository5.Quinlan,J.R.1993.C4.5:PramsforMachineL.MorganKaufmannPub-lishers.SanMateo,CA.6.Rabaseda-Loudcher,S.,Sebban,M.andRakotomalala,R.1995.DiscretizationofuousAttributes:aSurveyofMethods.dingsofthe2ndAnnualJointeonInformationScienc,pp.164-166.514Aijun An and Nick Cercone