/
Discretization of Con tin uous A ttributes for Learning Classi cation Rules Aijun An and Discretization of Con tin uous A ttributes for Learning Classi cation Rules Aijun An and

Discretization of Con tin uous A ttributes for Learning Classi cation Rules Aijun An and - PDF document

test
test . @test
Follow
528 views
Uploaded On 2015-01-18

Discretization of Con tin uous A ttributes for Learning Classi cation Rules Aijun An and - PPT Presentation

e presen t a comparison of three en trop ybased discretiza tion metho ds in a con text of learning classi cation rules W e compare the binary recursiv e discretization with a stopping criterion based on the Minim um Description Length Principle MDLP ID: 33038

presen

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Discretization of Con tin uous A ttribut..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DiscretizationofCon tin uousA ttributesfor LearningClassi cationRules AijunAnandNic kCercone Departmen tofComputerScience,Univ ersit yofW aterloo W aterloo,On tarioN2L3G1Canada Abstract. W epresen tacomparisonofthreeen trop y-baseddiscretiza- tionmethodsinacon textoflearningclassi cationrules.W umDescrip- tionLengthPrinciple(MDLP)[3].TheMDLPmethodisreportedasasuccessful methodfordiscretizationinthedecisiontreelearningandNaiv e-Ba y eslearn- ingen vironmen ts[2 ,6 ].Ho w ev er,littleresearc hhasbeendonetoin v estigate whetherthemethodw orksw ellwithotherruleinductionmethods.W ereport wthattheMDLPmethodfailsto ndsucien tuseful cut-poin ts,especiallyonsmalldatasets.Theexperimen tsalsodisco v erthatthe othermethodtendstoselectcut-poin tsfromasmalllocalareaoftheen tirev alue space,especiallyonlargedatasets.T oo v ercometheseproblems,w ein troducea newen trop y-baseddiscretizationmethodthatselectscut-poin informationen trop yanddistributionofpoten tialcut-poin ts.Ourconclusionis thatMDLPdoesnotgiv ethebestresultsinmosttesteddatasets.Theproposed methodperformsbetterthanMDLPintheELEM2learningen vironmen t. 2 TheMDLPDiscretizationMethod Giv enaset S ofinstances,anattribute A ,andacut-poin t T A;T Ent EntEnt)istheclassenyofthesubset,de\fnedasEntlogwherethereare)istheproportionofexamplesthathaeclassoranattribute,theMDLPmethodselectsacutpoinforwhicA;T)isminimalamongalltheboundarypoin.Thetrainingsetisthensplitintotosubsetsbythecutpoint.SubsequentcutpoinareselectedbyrecursivelyapplyingthesamebinarydiscretizationmethodtohofthenewlygeneratedsubsetsuntilthefollowingconditionisacA;Tlog A;T isthenberofexamplesinA;TEntA;TA;TloggkEntEntEnt)],andk;karethenberofclassesrepresentedinthesetsS;S,respectivEmpiricalresults,presentedin[3],showthattheMDLPstoppingcriterionleadstoconstructionofbetterdecisiontrees.Doughertetal.al.]alsoshowthataglobalvtoftheMDLPmethodsigni\fcantlyimproedaNaivclassi\feranditalsoperformsbestamongseveraldiscretizationmethodsinthetextofC4.5decisiontreelearning.ExperimentswithMDLPDiscretizationandELEM2econductedexperimentswithtersionsofELEM2.Bothversionsemplotheeny-baseddiscretizationmethod,butwithdi erentstoppingcriteria.OneversionusestheglobalvtoftheMDLPdiscretizationmethod,i.e.,itdiscretizesconuousattributesusingtherecursiveeny-basedmethodwiththeMDLPstoppingcriterionappliedbeforeruleinductionbegins.Theotherversionusesthesameenycriterionforselectingcut-pointsbeforeruleinduction,butitsimplychoosesamaximalnberofestcut-pointswithoutrecursiveapplicationofthemethod.issettobelogisthenberofdistinctobservedvaluesfortheattributebeingdiscretizedandisthenberofclasses.WerefertothismethodasMax-m.Bothversions\frstsorttheexamplesaccordingtotheirvaluesoftheattributeandthenevaluateonlytheboundarypointsintheirsearchforcut-poine\frstconducttheexperimentsonanarti\fcialdataset.Eachexampleinthedatasethastoconuousattributesandasymbolicattribute.The adandIraniproedthatthevthatminimizestheclassenA;Tustalwysbeavaluebeteentoexamplesofdi erentclassesinthesequenceofsortedexamples.Thesekindsofvaluesarecalledboundarypoin510Aijun An and Nick Cercone Training eaccuracy No.ofcut-poin No.ofrules No.ofBoun- SetSize MDLP Max-m MDLP Max-m MDLP Max-m daryP 47 56.71% 95.20% 0 14 3 6 58 188 90.41% 100% 2 21 4 6 96 470 100% 100% 5 22 6 6 97 1877 100% 100% 29 22 6 6 97 4692 100% 100% 73 22 6 6 97 able1.ResultsontheArti\fcialDomain.oconuousattributes,named1and2,haaluerangesof[090]andand;5],respectiv.Thesymbolicattribute,color,takesoneofthefourvd,blue,yellow.Anexamplebelongstoclass\1",ifthefolloconditionholds:(30colorblueorgreenotherwise,itbelongstoclass\0".Thedatasethasatotalof9384examples.Wrandomlychose6trainingsetsfromtheseexamples.Thesizesofthetrainingsetsrangefrom47examples(05%)to4692examples(50%).WerunthetofELEM2oneachofthe6trainingsetstogenerateasetofdecisionrules.Therulesarethentestedontheoriginaldatasetof9384examples.Table1depicts,forallthetrainingsets,thepredictiveaccuracy,thetotalnberofcut-poinselectedforbothconuousattributes,thetotalnberofrulesgeneratedforbothclasses,andthenberofboundarypointsforbothconuousattributes.Theresultsindicatethat,whenthenberoftrainingexamplesissmall,theMDLPmethodstopstooearlyandfailsto\fndenoughusefulcut-points,whiccausesELEM2togeneraterulesthathaepoorpredictiveperformanceonthetestingset.Whenthesizeofthetrainingsetincreases,MDLPgeneratesmorecut-pointsanditspredictiveperformanceimproes.Forthemiddle-sizedtrainingset(470examples),MDLPworksperfectlybecauseit\fndsonly5cut-poinfrom97boundarypoints,whichincludeallofthefourrightcut-pointsthatthelearningsystemneedstogeneratecorrectrules.Hoer,whenthetrainingsetbecomeslarger,thenberofcut-pointsMDLP\fndsincreasesgreatly.Inthelasttrainingset(4692examples),itselects73cut-pointsoutof97potenpoints,whichslowsdownthelearningsystem.Incontrast,theMax-mmethodismorestable.Thenberofcut-pointsitproducesrangesfrom14to22anditseperformanceisbetterthanMDLPwhenthetrainingsetissmall.WalsorunthetersionsofELEM2onanberofactualdatasetsobtainedfromtheUCIrepository[4],eachofwhichhasatleastoneconuousattribute.able2reportstheten-foldevaluationresultson6ofthesedatasets.TheempiricalresultspresentedaboeindicatethatMDLPisnotsuperiortoMax-minmostoftesteddatasets.Onepossiblereasonisthat,whenthetrainingsetissmall,theexamplesarenotsucienttomaketheMDLPcriterionvandmeaningfulsothatthecriterioncausesthediscretizationprocesstostoptoo asintheMax-mmethod.Hoer,ratherthantakingthe\frstcut-points,EDA-DBdividesthevaluerangeoftheattributeintoinalsandselectsineachinberofcut-pointsbasedontheenycalculatedertheentireinstancespace.isdeterminedbyestimatingtheprobabilitdistributionoftheboundarypointsoertheinstancespace.TheEDA-DBdis-cretizationalgorithmisdescribedasfollows.Letbethenberofdistinctedvaluesforaconuousattributebethetotalnberofboundarypointsfor,andbethenberofclassesinthedataset.Todiscretize1.Calculatelog2.Estimatetheprobabilitydistributionofboundarypoin(a)Dividethevaluerangeofals,where;log(b)Calculatethenberofboundarypointsineachin,where(c)Estimatetheprobabilityofboundarypointsineachin 3.Calculatethequotaofcut-pointsforeachin)accordingandthedistributionofboundarypointsasfollo4.RanktheboundarypointsineachinyincreasingorderoftheclassinformationenyofthepartitioninducedbytheboundarypoinTheenyforeachpointiscalculatedgloballyoertheentireinstancespace.5.Foreachin),selectthe\frstpointsintheaboeorderedsequence.Atotalofcut-pointsareselected.ExperimentswithEDeconductedexperimentswithEDA-DBcoupledwithELEM2.We\frstcon-ductedten-foldevaluationonthedatasettoseewhetherEDA-DBim-esoerMax-monthisdatasetwhichhasalargenberofboundarypoinforseveralattributes.Theresultisthatthepredictiveaccuracyisincreasedto95.11%andtheaeragenberofrulesdropsto69.Figure1showstheten-foldaluationresultson14UCIdatasets.Inthe\fgure,thesolidlinerepresentsthedi erencebeteenEDA-DB'spredictiveaccuracyandMax-m's,andthedashedlinerepresentstheaccuracydi erencebeteenEDA-DBandMDLP.There-sultsindicatethatEDA-DBoutperformsbothMax-mandMDLPonmostofthetesteddatasets.ehaepresentedanempiricalcomparisonofthreeeny-baseddiscretizationmethodsinacontextoflearningdecisionrules.WefoundthattheMDLPmethodstopstoearlywhenthenberoftrainingexamplesissmallandthusitfailstodetectsucientcut-pointsonsmalldatasets.OurempiricalresultsalsoindicatethatMax-mandEDA-DBarebetterdiscretizationmethodsforELEM2onmostofthetesteddatasets.WeconjecturethattherecursivenatureoftheMDLPmethodmaycausemostofthecut-pointstobeselectedbasedonsmall abaloneaustraianbreancerbupadiabetesecoligerman glass hearionosphereirissegmentyeastFig.1.en-foldEvaluationResultsonActualDataSets.samplesoftheinstancespace,whichleadstogenerationofunreliablecut-poinTheexperimentwithMax-monthedatasetrevealsthatthestrategyofsimplyselectingthe\frstestcut-pointsdoesnotworkwonlargedatasetswithlargenberofboundarypoints.Thereasonforthisisthatenestcut-pointstendtoconcentrateonasmallregionoftheinstancespace,whichleadstothealgorithmfailingtopickupusefulcut-poininotherregions.OurproposedEDA-DBmethodalleviatestheMax-m'sproblemyconsideringthedistributionofboundarypointsoertheinstancespace.OurtestofEDA-DBonthedatasetshowsthatEDA-DBimproesoMax-monboththepredictiveaccuracyandthenberofrulesgenerated.TheexperimentswithEDA-DBonothertesteddatasetsalsocon\frmthatEDisabettermethodthanbothMax-mandMDLP1.An,A.andCercone,N.1998.ELEM2:ALearningSystemforMoreAccurateClas-eNotesinArti\fcialIntelligene14182.Doughert,J.,Kohavi,R.andSahami,M.1995.SupervisedandUnsupervisedDis-cretizationofConuousFdingsoftheTwelfthInternationalConfer-eonMachineL.MorganKaufmannPublishers,SanFrancisco,CA.3.Fad,U.M.andIrani,K.B.1993.Multi-InalDiscretizationofConaluedAttributesforClassi\fcationLearning..pp.1022-1027.4.Murph.M.andAha,D.W.1994.UCIRositoryofMachineLarningDatabURL:http://www.ics.uci.edu/AI/ML/MLDBRepository5.Quinlan,J.R.1993.C4.5:PramsforMachineL.MorganKaufmannPub-lishers.SanMateo,CA.6.Rabaseda-Loudcher,S.,Sebban,M.andRakotomalala,R.1995.DiscretizationofuousAttributes:aSurveyofMethods.dingsofthe2ndAnnualJointeonInformationScienc,pp.164-166.514Aijun An and Nick Cercone