/
JournalofMachineLearningResearch5(2004)153-188Submitted12/02;Published JournalofMachineLearningResearch5(2004)153-188Submitted12/02;Published

JournalofMachineLearningResearch5(2004)153-188Submitted12/02;Published - PDF document

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
395 views
Uploaded On 2016-08-08

JournalofMachineLearningResearch5(2004)153-188Submitted12/02;Published - PPT Presentation

LAVRACETALlearningisaformofdescriptiveinductionnonclassicatoryinductionorunsupervisedlearningaimedatthediscoveryofindividualruleswhichdeneinterestingpatternsindataDescriptiveinductionhasrecen ID: 438719

LAVRACETAL.learningisaformofdescriptiveinduction(non-classicatoryinductionorunsupervisedlearning) aimedatthediscoveryofindividualruleswhichdeneinterestingpatternsindata.Descriptiveinductionhasrecen

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch5(2004)1..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch5(2004)153-188Submitted12/02;Published2/04SubgroupDiscoverywithCN2-SDNadaLavracNADA.LAVRAC@IJS.SIJozefStefanInstituteJamova391000Ljubljana,SloveniaandNovaGoricaPolytechnicVipavska135100NovaGorica,SloveniaBrankoKavsekBRANKO.KAVSEK@IJS.SIJozefStefanInstituteJamova391000Ljubljana,SloveniaPeterFlachPETER.FLACH@BRISTOL.AC.UKUniversityofBristolWoodlandRoadBristolBS81UB,UnitedKingdomLjupcoTodorovskiLJUPCO.TODOROVSKI@IJS.SIJozefStefanInstituteJamova391000Ljubljana,SloveniaEditor:StefanWrobelAbstractThispaperinvestigateshowtoadaptstandardclassicationrulelearningapproachestosubgroupdiscovery.Thegoalofsubgroupdiscoveryistondrulesdescribingsubsetsofthepopulationthataresufcientlylargeandstatisticallyunusual.Thepaperpresentsasubgroupdiscoveryalgo-rithm,CN2-SD,developedbymodifyingpartsoftheCN2classicationrulelearner:itscoveringalgorithm,searchheuristic,probabilisticclassicationofinstances,andevaluationmeasures.Ex-perimentalevaluationofCN2-SDon23UCIdatasetsshowssubstantialreductionofthenumberofinducedrules,increasedrulecoverageandrulesignicance,aswellasslightimprovementsintermsoftheareaunderROCcurve,whencomparedwiththeCN2algorithm.ApplicationofCN2-SDtoalargetrafcaccidentdatasetconrmsthesendings.Keywords:RuleLearning,SubgroupDiscovery,UCIDataSets,TrafcAccidentDataAnalysis1.IntroductionRulelearningismostfrequentlyusedinthecontextofclassicationrulelearning(Michalskietal.,1986,ClarkandNiblett,1989,Cohen,1995)andassociationrulelearning(Agrawaletal.,1996).Whileclassicationrulelearningisanapproachtopredictiveinduction(orsupervisedlearning),aimedatconstructingasetofrulestobeusedforclassicationand/orprediction,associationrulec\r2004NadaLavrac,BrankoKavsek,PeterFlach,andLjupcoTodorovski. LAVRACETAL.learningisaformofdescriptiveinduction(non-classicatoryinductionorunsupervisedlearning),aimedatthediscoveryofindividualruleswhichdeneinterestingpatternsindata.Descriptiveinductionhasrecentlygainedmuchattentionoftherulelearningresearchcommu-nity.Besidesminingofassociationrules(e.g.,theAPRIORIassociationrulelearningalgorithm(Agrawaletal.,1996)),otherapproacheshavebeendeveloped,includingclausaldiscoveryasintheCLAUDIENsystem(RaedtandDehaspe,1997,Raedtetal.,2001),anddatabasedependencydiscovery(FlachandSavnik,1999).1.1SubgroupDiscovery:ATaskattheIntersectionofPredictiveandDescriptiveInductionThispapershowshowclassicationrulelearningcanbeadaptedtosubgroupdiscovery,ataskattheintersectionofpredictiveanddescriptiveinduction,thathasrstbeenformulatedbyKl¨osgen(1996)andWrobel(1997,2001),andaddressedbyrulelearningalgorithmsEXPLORA(Kl¨osgen,1996)andMIDOS(Wrobel,1997,2001).IntheworkofKl¨osgen(1996)andWrobel(1997,2001),theproblemofsubgroupdiscoveryhasbeendenedasfollows:Givenapopulationofindividualsandapropertyofthoseindividualsweareinterestedin,ndpopulationsubgroupsthatarestatistically`mostinteresting',e.g.,areaslargeaspossibleandhavethemostunusualstatistical(distributional)characteristicswithrespecttothepropertyofinterest.Insubgroupdiscovery,ruleshavetheformClass Cond,wherethepropertyofinterestforsubgroupdiscoveryisclassvalueClassthatappearsintheruleconsequent,andtheruleantecedentCondisaconjunctionoffeatures(attribute-valuepairs)selectedfromthefeaturesdescribingthetraininginstances.Asrulesareinducedfromlabeledtraininginstances(labeledpositiveifthepropertyofinterestholds,andnegativeotherwise),theprocessofsubgroupdiscoveryistargetedatuncoveringpropertiesofaselectedtargetpopulationofindividualswiththegivenpropertyofinter-est.Inthissense,subgroupdiscoveryisaformofsupervisedlearning.However,inmanyrespectssubgroupdiscoveryisaformofdescriptiveinductionasthetaskistouncoverindividualinterestingpatternsindata.Thestandardassumptionsmadebyclassicationrulelearningalgorithms(espe-ciallytheonesthattakethecoveringapproach),suchas`inducedrulesshouldbeasaccurateaspossible'or`inducedrulesshouldbeasdistinctaspossible,coveringdifferentpartsofthepopula-tion',needtoberelaxed.Inourapproach,therstassumption,implementedinclassicationrulelearnersbyheuristicwhichaimatoptimizingpredictiveaccuracy,isrelaxedbyimplementingnewheuristicsforsubgroupdiscoverywhichaimatnding`best'subgroupsintermsofrulecoverageanddistributionalunusualness.Therelaxationofthesecondassumptionenablesthediscoveryofoverlappingsubgroups,describingsomepopulationsegmentsinamultiplicityofways.Inducedsubgroupdescriptionsmayberedundant,ifviewedfromaclassierperspective,butveryvaluableintermsoftheirdescriptivepower,uncoveringgenuinepropertiesofsubpopulationsfromdifferentviewpoints.Letusemphasizethedifferencebetweensubgroupdiscovery(asataskattheintersectionofpredictiveanddescriptiveinduction)andclassicationrulelearning(asaformofpredictiveinduc-tion).Thegoalofstandardrulelearningistogeneratemodels,oneforeachclass,consistingofrulesetsdescribingclasscharacteristicsintermsofpropertiesoccurringinthedescriptionsoftrainingexamples.Incontrast,subgroupdiscoveryaimsatdiscoveringindividualrulesor`patterns'ofin-terest,whichmustberepresentedinexplicitsymbolicformandwhichmustberelativelysimpleinordertoberecognizedasactionablebypotentialusers.Moreover,standardclassicationrulelearn-ingalgorithmscannotappropriatelyaddressthetaskofsubgroupdiscoveryastheyusethecovering154 SUBGROUPDISCOVERYWITHCN2-SDalgorithmforrulesetconstructionwhich-aswillbeseeninthispaper-hinderstheapplicabilityofclassicationruleinductionapproachesinsubgroupdiscovery.Subgroupdiscoveryisusuallyseenasdifferentfromclassication,asitaddressesdifferentgoals(discoveryofinterestingpopulationsubgroupsinsteadofmaximizingclassicationaccuracyoftheinducedruleset).Thisismanifestedalsobythefactthatinsubgroupdiscoveryonecanoftentoleratemanymorefalsepositives(negativeexamplesincorrectlyclassiedaspositives)thaninaclassicationtask.However,bothtasks,subgroupdiscoveryandclassicationrulelearning,canbeuniedundertheumbrellaofcost-sensitiveclassication.Thisisbecausewhendecidingwhichclassiersareoptimalinagivencontextitdoesnotmatterwhetherwepenalizefalsenegativesasisthecaseinclassication,orrewardtruepositivesasinsubgroupdiscovery.1.2OverviewoftheCN2-SDApproachtoSubgroupDiscoveryThispaperinvestigateshowtoadaptstandardclassicationrulelearningapproachestosubgroupdiscovery.Theproposedmodicationsofclassicationrulelearnerscan,inprinciple,beusedtomodifyanyrulelearnerusingthecoveringalgorithmforrulesetconstruction.Inthispaper,weillustratetheapproachbymodifyingthewell-knownCN2rulelearningalgorithm(ClarkandNiblett,1989,ClarkandBoswell,1991).Alternatively,wecouldhavemodiedRL(Leeetal.,1998),RIPPER(Cohen,1995),SLIPPER(CohenandSinger,1999)orothermoresophisticatedclassicationrulelearners.ThereasonformodifyingCN2isthatothermoresophisticatedlearnersincludeadvancedtechniquesthatmakethemmoreeffectiveinclassicationtasks,improvingtheirclassicationaccuracy.Improvedclassicationaccuracyis,however,notofultimateinterestforsubgroupdiscovery,whosemaingoalistondinterestingpopulationsubgroups.WehaveimplementedthenewsubgroupdiscoveryalgorithmCN2-SDbymodifyingCN2(ClarkandNiblett,1989,ClarkandBoswell,1991).TheproposedapproachperformssubgroupdiscoverythroughthefollowingmodicationsofCN2:(a)replacingtheaccuracy-basedsearchheuristicwithanewweightedrelativeaccuracyheuristicthattradesoffgeneralityandaccuracyoftherule,(b)incorporatingexampleweightsintothecoveringalgorithm,(c)incorporatingexampleweightsintotheweightedrelativeaccuracysearchheuristic,and(d)usingprobabilisticclassicationbasedontheclassdistributionofcoveredexamplesbyindividualrules,bothinthecaseofunorderedrulesetsandordereddecisionlists.Inaddition,wehaveextendedtheROCanalysisframeworktosubgroupdiscoveryandproposeasetofmeasuresappropriateforevaluatingthequalityofinducedsubgroups.ThispaperpresentstheCN2-SDsubgroupdiscoveryalgorithm,togetherwithitsexperimentalevaluationon23datasetsoftheUCIRepositoryofMachineLearningDatabases(MurphyandAha,1994),aswellasitsapplicationtoarealworldproblemoftrafcaccidentanalysis.Theex-perimentalcomparisonwithCN2demonstratesthatthesubgroupdiscoveryalgorithmCN2-SDpro-ducessubstantiallysmallerrulesets,whereindividualruleshavehighercoverageandsignicance.Thesethreefactorsareimportantforsubgroupdiscovery:smallersizeenablesbetterunderstanding,highercoveragemeanslargersupport,andhighersignicancemeansthatrulesdescribediscoveredsubgroupsthathavesignicantlydifferentdistributionalcharacteristicscomparedtotheentirepop-ulation.TheappropriatenessforsubgroupdiscoveryisconrmedalsobyslightimprovementsintermsoftheareaunderROCcurve,withoutdecreasingpredictiveaccuracy.Thepaperisorganizedasfollows.Section2introducesthebackgroundofthisworkwhichin-cludesthedescriptionoftheCN2rulelearningalgorithm,theweightedrelativeaccuracyheuristic,andprobabilisticclassicationofnewexamples.Section3presentsthesubgroupdiscoveryalgo-155 LAVRACETAL.rithmCN2-SDbydescribingthenecessarymodicationsofCN2.InSection4wediscusssubgroupdiscoveryfromtheperspectiveofROCanalysis.Section5presentsarangeofmetricsusedintheexperimentalevaluationofCN2-SD.Section6presentstheresultsofexperimentsonselectedUCIdatasetsaswellasanapplicationofCN2-SDonareal-lifetrafcaccidentdataset.RelatedworkisdiscussedinSection7.Section8concludesbysummarizingthemaincontributionsandproposingdirectionsforfurtherwork.2.BackgroundThissectionpresentsthebackgroundofourwork:theclassicalCN2ruleinductionalgorithm,includingthecoveringalgorithmforrulesetconstruction,thestandardCN2heuristic,weightedrelativeaccuracyheuristic,andtheprobabilisticclassicationtechniqueusedinCN2.2.1TheCN2RuleInductionAlgorithmCN2isanalgorithmforinducingpropositionalclassicationrules(ClarkandNiblett,1989,ClarkandBoswell,1991).Inducedruleshavetheform“ifCondthenClass”,whereCondisaconjunc-tionoffeatures(pairsofattributesandtheirvalues)andClassistheclassvalue.InthispaperweusethenotationClass CondCN2consistsoftwomainprocedures:thebottom-levelsearchprocedurethatperformsbeamsearchinordertondasinglerule,andthetop-levelcontrolprocedurethatrepeatedlyexecutesthebottom-levelsearchtoinducearuleset.Thebottom-levelperformsbeamsearch1usingclassica-tionaccuracyoftheruleasaheuristicfunction.TheaccuracyofapropositionalclassicationruleoftheformClass CondisequaltotheconditionalprobabilityofclassClass,giventhatconditionCondissatised:Acc(Class Cond)=p(ClassjCond)=p(Class:Cond)p(Cond):Usually,thisprobabilityisestimatedbyrelativefrequencyn(ClassCond)n(Cond)2Differentprobabilityesti-mates,liketheLaplace(ClarkandBoswell,1991)orthem-estimate(Cestnik,1990,Dzeroskietal.,1993),canbeusedinCN2forestimatingtheaboveprobability.ThestandardCN2algorithmusedinthisworkusestheLaplaceestimate,whichiscomputedasn(ClassCond)+1n(Cond)+k,wherekisthenumberofclasses(foratwo-classproblem,k=2).CN2canalsoapplyasignicancetesttoaninducedrule.Aruleisconsideredtobesignicant,ifitexpressesaregularityunlikelytohaveoccurredbychance.Totestsignicance,CN2usesthelikelihoodratiostatistic(ClarkandNiblett,1989)thatmeasuresthedifferencebetweentheclassprobabilitydistributioninthesetoftrainingexamplescoveredbytheruleandtheclassprobabilitydistributioninthesetofalltrainingexamples(seeEquation2inSection5).TheempiricalevaluationintheworkofClarkandBoswell(1991)showsthatapplyingthesignicancetestreducesthenumberofinducedrulesatacostofslightlydecreasedpredictiveaccuracy.1.CN2constructsrulesinageneral-to-specicfashion,specializingonlytherulesinthebeam(thebestrules)byiterativelyaddingfeaturestoconditionCond.Thisprocedurestopswhennospecializedrulecanbeaddedtothebeam,becausenoneofthespecializationsismoreaccuratethantherulesinthebeam.2.Hereweusethefollowingnotation:n(Cond)standsforthenumberofinstancescoveredbyruleClass Cond,n(Class)standsforthenumberofexamplesofclassClass,andn(Class:Cond)standsforthenumberofcorrectlyclassiedexamples(truepositives).Weusep(:::)forthecorrespondingprobabilities.156 SUBGROUPDISCOVERYWITHCN2-SDTwodifferenttop-levelcontrolprocedurescanbeusedinCN2.Therstinducesanorderedlistofrulesandthesecondanunorderedsetofrules.Bothproceduresaddadefaultrule(providingformajorityclassassignment)asthenalruleintheinducedruleset.Wheninducinganorderedlistofrules,thesearchprocedurelooksforthemostaccurateruleinthecurrentsetoftrainingexamples.Therulepredictsthemostfrequentclassinthesetofcoveredexamples.InordertopreventCN2ndingthesameruleagain,allthecoveredexamplesareremovedbeforeanewiterationisstartedatthetop-level.Thecontrolprocedureinvokesanewsearch,untilalltheexamplesarecoveredornosignicantrulecanbefound.Intheunorderedcase,thecontrolprocedureisiterated,inducingrulesforeachclassinturn.Foreachinducedrule,onlycoveredexamplesbelongingtothatclassareremoved,insteadofremovingallcoveredexamples,likeintheorderedcase.Thenegativetrainingexamples(i.e.,examplesthatbelongtootherclasses)remain.2.2TheWeightedRelativeAccuracyHeuristicWeightedrelativeaccuracy(Lavracetal.,1999,Todorovskietal.,2000)isavariantofruleaccuracythatcanbeappliedbothinthedescriptiveandpredictiveinductionframework;inthispaperthisheuristicisappliedforsubgroupdiscovery.Weightedrelativeaccuracy,areformulationofoneoftheheuristicsusedinEXPLORA(Kl¨osgen,1996)andMIDOS(Wrobel,1997),isdenedasfollows:WRAcc(Class Cond)=p(Cond)(p(ClassjCond)p(Class)):(1)Likemostotherheuristicsusedinsubgroupdiscoverysystems,weightedrelativeaccuracyconsistsoftwocomponents,providingatradeoffbetweenrulegenerality(ortherelativesizeofasubgroupp(Cond))anddistributionalunusualnessorrelativeaccuracy(thedifferencebetweenruleaccuracyp(ClassjCond)anddefaultaccuracyp(Class)).ThisdifferencecanalsobeseenastheaccuracygainrelativetothexedruleClass true,whichpredictsthatallinstancesbelongtoClass:aruleisinterestingonlyifitimprovesuponthis`default'accuracy.Anotheraspectofrelativeaccuracyisthatitmeasuresthedifferencebetweentruepositivesandtheexpectedtruepositives(expectedundertheassumptionofindependenceoftheleftandrighthand-sideofarule),i.e.,theutilityofconnectingrulebodyCondwithagivenruleheadClass.However,itiseasytoobtainhighrelativeaccuracywithhighlyspecicrules,i.e.,ruleswithlowgeneralityp(Cond).Tothisend,generalityisusedasa`weight',sothatweightedrelativeaccuracytradesoffgeneralityoftherule(p(Cond)i.e.,rulecoverage)andrelativeaccuracy(p(ClassjCond)p(Class)).IntheworkofKl¨osgen(1996),thesequantitiesarereferredtoasg(generality),p(ruleaccuracyorprecision)andp0(defaultruleaccuracy)anddifferenttradeoffsbetweenrulegeneralityandprecisionintheso-calledp-g(precision-generality)spaceareproposed.Inadditiontofunctiong(pp0),whichisequivalenttoourweightedrelativeaccuracyheuristic,othertradeoffsthatreducetheinuenceofgeneralityareproposed,e.g.,pg(pp0)orpg=(1g)(pp0).Here,wefavortheweightedrelativeaccuracyheuristic,becauseithasanintuitiveinterpretationinROCspace,discussedinSection4.2.3ProbabilisticClassicationTheinducedrulescanbeorderedorunordered.Orderedrulesareinterpretedasadecisionlist(Rivest,1987)inastraightforwardmanner:whenclassifyinganewexample,therulesaresequen-tiallytriedandtherstrulethatcoverstheexampleisusedforprediction.157 LAVRACETAL.iflegs=2&feathers=yesthenclass=bird[13,0]ifbeak=yesthenclass=bird[20,0]ifsize=large&ies=nothenclass=elephant[2,10]Table1:Arulesetconsistingoftworulesforclass`bird'andoneruleforclass`elephant'.Inthecaseofunorderedrulesets,thedistributionofcoveredtrainingexamplesamongclassesisattachedtoeachrule.Rulesoftheform:ifCondthenClass[ClassDistribution]areinduced,wherenumbersintheClassDistributionlistdenote,foreachindividualclass,howmanytrainingexamplesofthisclassarecoveredbytherule.Whenclassifyinganewexample,allrulesaretriedandthosecoveringtheexamplearecollected.Ifaclashoccurs(severalruleswithdifferentclasspredictionscovertheexample),avotingmechanismisusedtoobtainthenalprediction:theclassdistributionsattachedtotherulesaresummedtodeterminethemostprobableclass.Ifnoruleres,thedefaultruleisinvokedtopredictthemajorityclassoftraininginstancesnotcoveredbytheotherrulesinthelist.Probabilisticclassicationisillustratedonasampleclassicationtask,takenfromClarkandBoswell(1991).Supposeweneedtoclassifyananimalwhichisatwo-legged,feathered,large,non-yingandhasabeak,3,andtheclassicationisbasedonaruleset,listedinTable1formedofthreeprobabilisticruleswiththe[bird,elephant]classdistributionassignedtoeachrule(forsimplicity,therulesetdoesnotincludethedefaultrule).Allrulesrefortheanimaltobeclassied,resultingina[35,10]classdistribution.Asaresult,theanimalisclassiedasabird.3.TheCN2-SDSubgroupDiscoveryAlgorithmThemainmodicationsoftheCN2algorithm,makingitappropriateforsubgroupdiscovery,involvetheimplementationoftheweightedcoveringalgorithm,incorporationofexampleweightsintotheweightedrelativeaccuracyheuristic,probabilisticclassicationalsointhecaseofthe`ordered'inductionalgorithm,andtheareaunderROCcurverulesetevaluation.ThissectiondescribestheCN2modications,whileROCanalysisandanovelinterpretationoftheweightedrelativeaccuracyheuristicinROCspacearegiveninSection4.3.1WeightedCoveringAlgorithmIfusedforsubgroupdiscovery,oneoftheproblemsofstandardrulelearners,suchasCN2andRIPPER,istheuseofthecoveringalgorithmforrulesetconstruction.Themaindeciencyofthecoveringalgorithmisthatonlytherstfewinducedrulesmaybeofinterestassubgroupdescriptionswithsufcientcoverageandsignicance.Inthesubsequentiterationsofthecoveringalgorithm,rulesareinducedfrombiasedexamplesubsets,i.e.,subsetsincludingonlypositiveexamplesthatarenotcoveredbypreviouslyinducedrules,whichinappropriatelybiasesthesubgroupdiscoveryprocess.3.Theanimalbeingclassiedisaweka.158 SUBGROUPDISCOVERYWITHCN2-SDAsaremedytothisproblemweproposetheuseofaweightedcoveringalgorithm(GambergerandLavrac,2002),inwhichthesubsequentlyinducedrules(i.e.,rulesinducedinthelaterstages)alsorepresentinterestingandsufcientlylargesubgroupsofthepopulation.Theweightedcoveringalgorithmmodiestheclassicalcoveringalgorithminsuchawaythatcoveredpositiveexamplesarenotdeletedfromthecurrenttrainingset.Instead,ineachrunofthecoveringloop,thealgorithmstoreswitheachexampleacountindicatinghowoften(withhowmanyrules)theexamplehasbeencoveredsofar.WeightsderivedfromtheseexamplecountsthenappearinthecomputationofWRAcc.Initialweightsofallpositiveexamplesejequal1,w(ej;0)=1.Theinitialexampleweightw(ej;0)=1meansthattheexamplehasnotbeencoveredbyanyrule,meaning`pleasecoverthisexample,sinceithasnotbeencoveredbefore',whilelowerweights,0w1mean`donottrytoohardonthisexample'.Consequently,theexamplesalreadycoveredbyoneormoreconstructedrulesdecreasetheirweightswhiletheuncoveredtargetclassexampleswhoseweightshavenotbeendecreasedwillhaveagreaterchancetobecoveredinthefollowingiterationsofthealgorithm.Foraweightedcoveringalgorithmtobeused,wehavetospecifytheweightingscheme,i.e.,howtheweightofeachexampledecreaseswiththeincreasingnumberofcoveringrules.Wehaveimplementedtwoweightingschemesdescribedbelow.3.1.1MULTIPLICATIVEWEIGHTSIntherstscheme,weightsdecreasemultiplicatively.Foragivenparameter0g1,weightsofcoveredpositiveexamplesdecreaseasfollows:w(ej;)=gi,wherew(ej;)istheweightofexampleejbeingcoveredbyrules.Notethattheweightedcoveringalgorithmwithg=1wouldresultinndingthesameruleoverandoveragain,whereaswithg=0thealgorithmwouldperformthesameasthestandardCN2coveringalgorithm.3.1.2ADDITIVEWEIGHTSInthesecondscheme,weightsofcoveredpositiveexamplesdecreaseaccordingtotheformulaw(ej;)=1i+1.Intherstiterationalltargetclassexamplescontributethesameweightw(ej;0)=1,whileinthefollowingiterationsthecontributionsofexamplesareinverselyproportionaltotheircoveragebypreviouslyinducedrules.3.2ModiedWRAccHeuristicwithExampleWeightsThemodicationofCN2reportedintheworkofTodorovskietal.(2000)affectedonlytheheuristicfunction:weightedrelativeaccuracywasusedasasearchheuristic,insteadoftheaccuracyheuristicoftheoriginalCN2,whileeverythingelseremainedthesame.Inthiswork,theheuristicfunctionisfurthermodiedtohandleexampleweights,whichprovidethemeanstoconsiderdifferentpartsoftheexamplespaceineachiterationoftheweightedcoveringalgorithm.IntheWRAcccomputation(Equation1)allprobabilitiesarecomputedbyrelativefrequencies.Anexampleweightmeasureshowimportantitistocoverthisexampleinthenextiteration.ThemodiedWRAccmeasureisthendenedasfollows:WRAcc(Class Cond)=n0(Cond)N0(n0(Class:Cond)n0(Cond)n0(Class)N0):159 LAVRACETAL.iflegs=2&feathers=yesthenclass=bird[1,0]ifbeak=yesthenclass=bird[1,0]ifsize=large&ies=nothenclass=elephant[0.17,0.83]Table2:TherulesetofTable1astreatedbyCN2-SDInthisequation,N0isthesumoftheweightsofallexamples,n0(Cond)isthesumoftheweightsofallcoveredexamples,andn0(Class:Cond)isthesumoftheweightsofallcorrectlycoveredexamples.Toaddaruletothegeneratedruleset,therulewiththemaximumWRAccmeasureischosenoutofthoserulesinthesearchspace,whicharenotyetpresentintherulesetproducedsofar(allrulesinthenalrulesetarethusdistinct,withoutduplicates).3.3ProbabilisticClassicationEachCN2rulereturnsaclassdistributionintermsofthenumberofexamplescovered,asdistributedoverclasses.TheCN2algorithmusesclassdistributioninclassifyingunseeninstancesonlyinthecaseofunorderedrulesets,whererulesareinducedseparatelyforeachclass.Inthecaseofordereddecisionlists,therstrulethatresprovidestheclassication.InourmodiedCN2-SDalgorithm,alsointheorderedcaseallapplicablerulesaretakenintoaccount,henceprobabilisticclassicationisusedinbothclassiers.Thismeansthattheterminology`ordered'and`unordered',whichinCN2distinguishedbetweendecisionlistandrulesetinduction,hasadifferentmeaninginoursetting:the`unordered'algorithmreferstolearningclassesonebyone,whilethe`ordered'algorithmreferstondingbestruleconditionsandassigningthemajorityclassintherulehead.NotethatCN2-SDdoesnotusethesameprobabilisticclassicationschemeasCN2.UnlikeCN2,wheretheruleclassdistributioniscomputedintermsofthenumbersofexamplescovered,CN2-SDtreatstheclassdistributionintermsofprobabilities(computedbytherelativefrequencyestimate).Table2presentsthethreerulesofTable1withtheclassdistributionexpressedwithprobabilities.Atwo-legged,feathered,large,non-yinganimalwithabeakisagainclassiedasabirdbutnowtheprobabilitiesareaveraged(insteadofsummingthenumbersofexamples),resultinginthenalprobabilitydistribution[0.72,0.28].Byusingthisvotingschemethesubgroupscoveringasmallnumberofexamplesarenotsoheavilypenalized(asisthecaseinCN2)whenclassifyinganewexample.3.4CN2-SDImplementationTwovariantsofCN2-SDhavebeenimplemented.TheCN2-SDsubgroupdiscoveryalgorithmusedintheexperimentsinthispaperisimplementedinCandrunsonanumberofUNIXplatforms.Itspredecessor,usedintheexperimentsreportedbyLavracetal.(2002),isimplementedinJavaandincorporatedintheWEKAdataminingenvironment(WittenandFrank,1999).TheCimplemen-tationismoreefcientandlessrestrictivethantheJavaimplementation,whichislimitedtobinaryclassproblemsandtodiscreteattributes.160 SUBGROUPDISCOVERYWITHCN2-SDFigure1:TheROCspacewithTProntheXaxisandFProntheYaxis.Thesolidlineconnectingsevenoptimalsubgroups(markedby)istheROCconvexhull.B1andB2denotesuboptimalsubgroups(markedbyx).Thedottedline–thediagonalconnectingpoints(0,0)and(100,100)–indicatespositionsofinsignicantruleswithzerorelativeaccuracy.4.ROCAnalysisforSubgroupDiscoveryInthissectionwedescribehowROC(ReceiverOperatingCharacteristic)analysis(ProvostandFawcett,2001)canbeusedtounderstandsubgroupdiscoveryandtovisualizeandevaluatediscov-eredsubgroups.ApointinROCspaceshowsclassierperformanceintermsoffalsealarmorfalsepositiverateFPr=FPTN+FP(plottedontheX-axis),andsensitivityortruepositiverateTPr=TPTP+FN(plottedontheY-axis).IntermsoftheexpressionsintroducedinSections2.1and2.2,TP(truepositives),FP(falsepositives),TN(truenegatives)andFN(falsenegatives)canbeexpressedas:TP=n(Class:Cond)FP=n(Class:Cond)TN=n(Class:Cond)andFN=n(Class:Cond),whereClassandCondstandfor:Classand:Cond,respectively.TheROCspaceisappropriateformeasuringthesuccessofsubgroupdiscovery,sincerules/sub-groupswhoseTPr=FPrtradeoffisclosetothediagonalcanbediscardedasinsignicant.Con-versely,signicantrules/subgroupsarethosesufcientlydistantfromthediagonal.SignicantrulesdenethepointsinROCspacefromwhichaconvexhullcanbeconstructed.ThebestrulesdenetheROCconvexhull.Figure1showssevenrulesontheconvexhull(markedby),whiletworulesB1andB2belowtheconvexhull(markedbyx)areoflowerquality.161 LAVRACETAL.4.1TheInterpretationofWeightedRelativeAccuracyinROCSpaceWeightedrelativeaccuracyisappropriateformeasuringthequalityofasinglesubgroup,becauseitisproportionaltothedistancefromthediagonalinROCspace.4Toseethatthisholds,noterstthatruleaccuracyp(ClassjCond)isproportionaltotheanglebetweentheX-axisandthelineconnectingtheoriginwiththepointdepictingtheruleintermsofitsTPr=FPrtradeoffinROCspace.So,forinstance,theX-axishasalwaysruleaccuracy0(thesearepurelynegativesubgroups),theY-axishasalwaysruleaccuracy1(purelypositivesubgroups),andthediagonalrepresentssubgroupswithruleaccuracyp(Class),thepriorprobabilityofthepositiveclass.Consequently,allpointonthediagonalrepresentinsignicantsubgroups.Relativeaccuracy,p(ClassjCond)p(Class),re-normalizesthissuchthatallpointsonthediagonalhaverelativeaccuracy0,allpointsontheY-axishaverelativeaccuracy1p(Class)=p(Class)(thepriorprobabilityofthenegativeclass),andallpointsontheX-axishaverelativeaccuracyp(Class).NoticethatallpointsonthediagonalalsohaveWRAcc=0.Intermsofsubgroupdiscovery,thediagonalrepresentsallsubgroupswiththesametargetdistributionasthewholepopulation;onlythegeneralityofthese`average'subgroupsincreaseswhenmovingfromlefttorightalongthediagonal.Thisinterpretationisslightlydifferentinclassierlearning,wherethediagonalrepresentsrandomclassiersthatcanbeconstructedwithoutanytraining.Moregenerally,WRAccisometricslieonstraightlinesparalleltothediagonal(Flach,2003,F¨urnkranzandFlach,2003).Consequently,apointonthelineTPr=FPr+a,whereaistheverticaldistanceofthelinetothediagonal,hasWRAcc=a:p(Class)p(Class).Thus,givenaxedclassdistribution,WRAccisproportionaltotheverticaldistanceatothediagonal.Infact,thequantityTPrFPrwouldbeanalternativequalitymeasureforsubgroups,withtheadditionaladvantagethatitallowsforcomparisonofsubgroupsfrompopulationswithdifferentclassdistributions.4.2MethodsforConstructingROCCurvesandAUCEvaluationSubgroupsobtainedbyCN2-SDcanbeevaluatedinROCspaceintwodifferentways.4.2.1AUC-METHOD-1TherstmethodtreatseachruleasaseparatesubgroupwhichisplottedinROCspaceintermsofitstrueandfalsepositiverates(TPrandFPr).Wethengeneratetheconvexhullofthissetofpoints,selectingthesubgroupswhichperformoptimallyunderaparticularrangeofoperatingcharacteristics.TheareaunderthisROCconvexhull(AUC)indicatesthecombinedqualityoftheoptimalsubgroups,inthesensethatitdoesevaluatewhetheraparticularsubgrouphasanythingtoaddinthecontextofalltheothersubgroups.However,thismethoddoesnottakeaccountofanyoverlapbetweensubgroups,andsubgroupsnotontheconvexhullaresimplyignored.Figure2presentstwoROCcurves,showingtheperformanceofCN2andCN2-SDalgorithmsontheAustralianUCIdataset.4.2.2AUC-METHOD-2Thesecondmethodemploysthecombinedprobabilisticclassicationsofallsubgroups,asindi-catedbelow.Ifwealwayschoosethemostlikelypredictedclass,thiscorrespondstosettingaxedthreshold0.5onthepositiveprobability(theprobabilityofthetargetclass):ifthepositiveprobabil-4.SomeofthereasoningsupportingthisclaimisfurtherdiscussedinthelasttwoparagraphsofSection5.1.162 SUBGROUPDISCOVERYWITHCN2-SDFigure2:ExampleROCcurves(AUC-Method-1)ontheAustralianUCIdataset:thesolidcurveforthestandardCN2classicationrulelearner,andthedottedcurveforCN2-SDityislargerthanthisthresholdwepredictpositive,elsenegative.TheROCcurvecanbeconstructedbyvaryingthisthresholdfrom1(allpredictionsnegative,correspondingto(0,0)inROCspace)to0(allpredictionspositive,correspondingto(1,1)inROCspace).Thisresultsinn+1pointsinROCspace,wherenisthetotalnumberofclassiedexamples(testinstances).Equivalently,wecanorderalltheclassiedexamplesbydecreasingpredictedprobabilityofbeingpositive,andtracingtheROCcurvebystartingin(0,0),steppingupwhentheexampleisactuallypositiveandsteppingtotherightwhenitisnegative,untilwereach(1,1).5Eachpointonthiscurvecorrespondstoaclassierdenedbyapossibleprobabilitythreshold,asopposedtoAUC-Method-1,whereapointontheROCcurvecorrespondstooneoftheoptimalsubgroups.TheROCcurvedepictsasetofclassiers,whereastheareaunderthisROCcurveindicatesthecombinedqualityofallsubgroups(i.e.,thequalityoftheentireruleset).Thismethodcanbeusedwithatestsetorincross-validation,buttheresultingcurveisnotnecessarilyconvex.6Figure3presentstwoROCcurves,showingtheperformanceoftheCN2andCN2-SDalgo-rithmsontheAustralianUCIdataset.ItisapparentfromthisgurethatCN2isbadlyoverttingonthisdataset,becausealmostallofitsROCcurveisbelowthediagonal.ThisisbecauseCN2haslearnedmanyoverlyspecicrules,whichbiasthepredictedprobabilities.TheseoverlyspecicrulesarevisibleinFigure2aspointsclosetotheorigin.5.Inthecaseofties,wemaketheappropriatenumberofstepsupandtotherightatonce,drawingadiagonallinesegment.6.AdescriptionofthismethodappliedtodecisiontreeinductioncanbefoundinthepaperbyFerri-Ram´rezetal.(2002).163 LAVRACETAL.Figure3:ExampleROCcurves(AUC-Method-2)ontheAustralianUCIdataset:thesolidcurveforthestandardCN2classicationrulelearner,andthedottedcurveforCN2-SD4.2.3COMPARISONOFTHETWOAUCMETHODSWhichofthetwomethodsismoreappropriateforsubgroupdiscoveryisopenfordebate.Thesecondmethodseemsmoreappropriateifthediscoveredsubgroupsareintendedtobeappliedalsointhepredictivesetting,asaruleset(amodel)usedforclassication.Itsadvantageisalsothatitiseasiertoapplycross-validation.IntheexperimentalevaluationinSection6weuseAUC-Method-2inthecomparisonofthepredictiveperformanceofrulelearners.AnargumentinfavorofusingAUC-Method-1forsubgroupevaluationisbasedontheobser-vationthatAUC-Method-1suggeststoeliminate,fromtheinducedsetofsubgroupdescriptions,thoseruleswhicharenotontheROCconvexhull.Thisseemsappropriate,asthe`best'subgroupsaccordingtotheWRAccevaluationmeasure,aresubgroupsmostdistantfromtheROCdiagonal.However,disjointsubgroups,eitheronorclosetotheconvexhull,shouldnotbeeliminated,as(duetodisjointcoverageandpossiblydifferentsymbolicdescriptions)theymayrepresentinterestingsubgroups,regardlessofthefactthatthereisanother`better'subgroupontheROCconvexhull,withasimilarTPr=FPrtradeoff.NoticethattheareaunderROCcurve(AUC-Method-1)cannotbeusedasapredictivequal-itymeasurewhencomparingdifferentsubgroupminers,becauseitdoesnottakeintoaccounttheoverlappingstructureofsubgroups.Anargumentagainsttheuseofthismeasureishereelaboratedthroughasimpleexample.7Considerforinstancetwosubgroupminingresults,ofsay3subgroupsineachresultingruleset.TherstresultsetconsistsofthreedisjointsubgroupsofequalsizethattogethercoveralltheexamplesoftheselectedClassvalueandhavea100%accuracy.ThusthesethreesubgroupsareaperfectclassierfortheClassvalue.InROCspacethethreesubgroupscollapseatthepoint(0,1/3).Thesecondresultsetconsistsofthreeequalsubgroups(havingamax-7.Wearegratefultotheanonymousreviewerwhoprovidedthisillustrativeexample.164 SUBGROUPDISCOVERYWITHCN2-SDimumoverlap:withdifferentdescriptions,butequalextensions),alsowitha100%accuracyandcoveringonethirdoftheclassexamples.Clearlytherstresultisbetter,buttherepresentationoftheresultsinROCspace(andtheareaunderROCcurve)isthesameforbothcases.5.SubgroupEvaluationMeasuresInthissectionwedistinguishbetweenpredictiveanddescriptiveevaluationmeasures,whichisin-linewiththedistinctionofpredictiveinductionanddescriptiveinductionmadeinSection1.Descriptivemeasuresareusedtoevaluatethequalityofindividualrules(individualpatterns).Thesequalitymeasuresarethemostappropriateforsubgroupdiscovery,asthetaskofsubgroupdiscoveryistoinduceindividualpatternsofinterest.PredictivemeasuresareusedinadditiontodescriptivemeasuresjusttoshowthattheCN2-SDsubgroupdiscoverymechanismsperformwellalsointhepredictiveinductionsetting,wherethegoalistoinduceaclassier.5.1DescriptiveMeasuresofRuleInterestingnessDescriptivemeasuresofruleinterestingnessevaluateeachindividualsubgroupandarethusappro-priateforevaluatingthesuccessofsubgroupdiscovery.Theproposedqualitymeasurescomputetheaverageovertheinducedsetofsubgroupdescriptions,whichenablesthecomparisonbetweendifferentalgorithms.Coverage.Theaveragecoveragemeasuresthepercentageofexamplescoveredonaveragebyoneruleoftheinducedruleset.CoverageofasingleruleRiisdenedasCov(Ri)=Cov(Class Condi)=p(Condi)=n(Condi)N:TheaveragecoverageofarulesetiscomputedasCOV=1nRnRåi=1Cov(Ri);wherenRisthenumberofinducedrules.Support.Forsubgroupdiscoveryitisinterestingtocomputetheoverallsupport(thetargetcover-age)asthepercentageoftargetexamples(positives)coveredbytherules,computedasthetruepositiveratefortheunionofsubgroups.Supportofaruleisdenedasthefrequencyofcorrectlyclassiedcoveredexamples:Sup(Ri)=Sup(Class Condi)=p(Class:Condi)=n(Class:Condi)N:TheoverallsupportofarulesetiscomputedasSUP=1NåClassjn(Classj_Classj CondiCondi);wheretheexamplescoveredbyseveralrulesarecountedonlyonce(hencethedisjunctionofruleconditionsofruleswiththesameClassjvalueintherulehead).165 LAVRACETAL.Size.Sizeisameasureofcomplexity(thesyntacticalcomplexityofinducedrules).Therulesetsizeiscomputedasthenumberofrulesintheinducedruleset(includingthedefaultrule):SIZE=nR:Inadditiontorulesetsizeusedinthispaper,complexitycouldbemeasuredalsobytheaveragenumberofrules/subgroupsperclass,andtheaveragenumberoffeaturesperrule.Signicance.Averagerulesignicanceiscomputedintermsofthelikelihoodratioofarule,nor-malizedwiththelikelihoodratioofthesignicancethreshold(99%);theaverageiscomputedoveralltherules.Signicance(orevidence,intheterminologyofKl¨osgen,1996)indicateshowsignicantisanding,ifmeasuredbythisstatisticalcriterion.IntheCN2algorithm(ClarkandNiblett,1989),signicanceismeasuredintermsofthelikelihoodratiostatisticofaruleasfollows:Sig(Ri)=Sig(Class Condi)=2åjn(Classj:Condi)logn(Classj:Condi)n(Classj)p(Condi)(2)whereforeachclassClassjn(Classj:Condi)denotesthenumberofinstancesofClassjinthesetwheretherulebodyholdstrue,n(Classj)isthenumberofClassjinstances,andp(Condi)(i.e.,rulecoveragecomputedasn(Condi)N)playstheroleofanormalizingfactor.Notethatalthoughforeachgeneratedsubgroupdescriptiononeclassisselectedasthetargetclass,thesignicancecriterionmeasuresthedistributionalunusualnessunbiasedtoanyparticularclass–assuch,itmeasuresthesignicanceofruleconditiononly.Theaveragesignicanceofarulesetiscomputedas:SIG=1nRnRåi=1Sig(Ri):Unusualness.AverageruleunusualnessiscomputedastheaverageWRAcccomputedoveralltherules:WRACC=1nRnRåi=1WRAcc(Ri):AsdiscussedinSection4.1,WRAccisappropriateformeasuringtheunusualnessofseparatesubgroups,becauseitisproportionaltotheverticaldistancefromthediagonalintheROCspace(seetheunderlyingreasoninginSection4.1).AsWRAccisproportionaltothedistancetothediagonalinROCspace,WRAccalsoreectsrulesignicance–thelargerWRAcc,themoresignicanttherule,andviceversa.AsbothWRAccandrulesignicancemeasurethedistributionalunusualnessofasubgroup,theyarethemostimportantqualitymeasuresforsubgroupdiscovery.However,whilesignicanceonlymeasuresdistributionalunusualness,WRAcctakesalsorulecoverageintoaccount,thereforeweconsiderunusualness–computedbytheaverageWRAcc–tobethemostappropriatemeasureforsubgroupqualityevalu-ation.AspointedoutinSection4.1,thequantityTPrFPrcouldbeanalternativequalitymeasureforsubgroups,withtheadditionaladvantagethatwecanuseittocomparesubgroupsfrompopulationswithdifferentclassdistributions.However,inthispaperweareonlyconcernedwithcomparingsub-groupsfromthesamepopulation,andwepreferWRAccbecauseofits`p-g'(precision-generality)interpretation,whichisparticularlysuitableforsubgroupdiscovery.166 SUBGROUPDISCOVERYWITHCN2-SD5.2PredictiveMeasuresofRuleSetClassicationPerformancePredictivemeasuresevaluatearuleset,interpretingasetofsubgroupdescriptionsasapredictivemodel.Despitethefactthatoptimizingaccuracyisnottheintendedgoalofsubgroupdiscoveryalgorithms,thesemeasurescanbeusedinordertoprovideacomparisonofCN2-SDwithstandardclassicationrulelearners.Predictiveaccuracy.Thepercentageofcorrectlypredictedinstances.Forabinaryclassicationproblem,rulesetaccuracyiscomputedasfollows:ACC=TP+TNTP+TN+FP+FN:NotethatACCmeasurestheaccuracyofthewholerulesetonbothpositiveandnegativeexamples,whileruleaccuracy(denedasAcc(Class Cond)=p(ClassjCond))measurestheaccuracyofasingleruleonpositivesonly.AreaunderROCcurve.TheAUC-Method-2,describedinSection4.2,applicabletorulesetsisselectedastheevaluationmeasure.Itinterpretsarulesetasaprobabilisticmodel,givenallthedifferentprobabilitythresholdsasdenedthroughtheprobabilisticclassicationoftestinstances.6.ExperimentalEvaluationForsubgroupdiscovery,expertevaluationofresultsisofultimateinterest.Nevertheless,beforeapplyingtheproposedapproachtoaparticularproblemofinterest,wewantedtoverifyourclaimsthatthemechanismsimplementedintheCN2-SDalgorithmareindeedappropriateforsubgroupdiscovery.ForthispurposewetesteditonselectedUCIdatasets.InthispaperweusethesamedatasetsasintheworkofTodorovskietal.(2000).WehaveappliedCN2-SDalsotoareallifeproblemoftrafcaccidentanalysis;theseresultswereevaluatedalsobytheexpert.6.1TheExperimentalSettingTotesttheapplicabilityofCN2-SDtothesubgroupdiscoverytask,wecompareitsperformancewiththeperformanceofthestandardCN2classicationrulelearningalgorithm(referredtoasCN2-standard,anddescribedintheworkofClarkandBoswell,1991)aswellaswiththeCN2algorithmusingWRAcc(CN2-WRAcc,describedbyTodorovskietal.,2000).InthiscomparativestudyalltheparametersoftheCN2algorithmaresettotheirdefaultvalues(beam-size=5,signicance-threshold=99%).TheresultsoftheCN2-SDalgorithmarecomputedusingbothmultiplicativeweights(withg=0.5,0.7,0.9)8andadditiveweights.Weestimatetheperformanceofthealgorithmsusingstratied10-foldcross-validation.Theobtainedestimatesarepresentedintermsoftheiraveragevaluesandstandarddeviations.StatisticalsignicanceofthedifferenceinperformancecomparedtoCN2-standardistestedusingthepairedt-test(exactlythesamefoldsareusedinallcomparisons)withsignicancelevelof95%:boldfontand"totherightofaresultinallthetablesmeansthatthealgorithmissignicantlybetterthanCN2-standardwhile#meansitissignicantlyworse.Thesamepairedt-testisusedtocomparethedifferentversionsofouralgorithmwithCN2-standardoverallthedatasets.8.Resultsobtainedwithg=0.7arepresentedinthetablesofAppendixAbutnotinthemainpartofthepaper.167 LAVRACETAL.Dataset#Att.#D.att.#C.att.#Class#Ex.Maj.Class(%)1australian14862690562breast-w9902699663bridges-td7432102854chess3636023196525diabetes8082768656echo6152131677german2013721000708heart13672270569hepatitis1913621557910hypothyroid25187231639511ionosphere3403423516412iris40431503313mutagen5957221886614mutagen-f5757021886615tic-tac-toe99029586516vote1616024356117balance40436254618car660417287019glass90962143620image19019723101421soya35350196831322waveform21021350003423wine13013317840Table3:PropertiesoftheUCIdatasets.6.2ExperimentsonUCIDataSetsWeexperimentallyevaluateourapproachon23datasetsfromtheUCIRepositoryofMachineLearningDatabases(MurphyandAha,1994).Table3givesanoverviewoftheselecteddatasetsintermsofthenumberofattributes(total,discrete,continuous),thenumberofclasses,thenumberofexamples,andthepercentageofexamplesofthemajorityclass.Thesedatasetshavebeenwidelyusedinothercomparativestudies(Todorovskietal.,2000).Wehavedividedthedatasetsintwogroups(Table3),thosewithtwoclasses(binarydatasets1–16)andthosewithmorethentwoclasses(multi-classdatasets17–23).ThisdistinctionismadeasROCanalysisisappliedonlyonbinarydatasets.96.2.1RESULTSOFTHEUNORDEREDCN2-SDTables4and5presentsummaryresultsoftheUCIexperiments,whiledetailscanbefoundinTa-bles14–20inAppendixA.Foreachperformancemeasure,thesummarytableshowstheaveragevalueoverallthedatasets,thesignicanceoftheresultscomparedtoCN2-standard(p-value),win/loss/drawintermsofthenumberofdatasetsinwhichtheresultsarebetter/worse/equalcom-paredwithCN2-standard,aswellasthenumberofsignicantwinsandlosses.9.Thisisasimplication(asmulti-classAUCcouldalsobecomputedastheaverageofAUCscomputedbycomparingallpairsofclasses(HandandTill,2001))thatstillprovidessufcientevidencetosupporttheclaimsofthispaper.168 SUBGROUPDISCOVERYWITHCN2-SDPerformanceDataCN2CN2CN2-SDCN2-SDCN2-SDDetailedMeasureSetsstandardWRAcc(g=0:5)(g=0:9)(add.)ResultsCoverage(COV)230.1310.140.3110.170.4030.230.4500.260.4860.30Table14signicance–pvalue0.0000.0000.0000.000win/loss/draw22/1/022/1/023/0/022/1/0sig.win/sig.loss21/122/022/021/1Support(SUP)230.840.030.850.030.900.060.920.060.910.06Table15signicance–pvalue0.6370.0000.0000.001win/loss/draw13/10/018/5/020/3/016/7/0sig.win/sig.loss5/413/118/013/1Size(SIZE)2318.1821.776.154.496.254.426.494.576.354.58Table16signicance–pvalue0.0060.0070.0070.007win/loss/draw22/1/022/1/020/3/023/0/0sig.win/sig.loss22/021/019/218/0Signicance(SIG)232.110.468.974.6615.576.0516.928.9018.479.00Table17signicance–pvalue0.0000.0000.0000.000win/loss/draw22/1/023/0/022/1/023/0/0sig.win/sig.loss21/023/021/023/0Unusualness(WRACC)230.0170.020.0560.050.0790.060.0880.060.0920.07Table18signicance–pvalue0.0010.0000.0000.000win/loss/draw20/1/222/1/022/1/022/1/0sig.win/sig.loss19/121/121/121/1Table4:SummaryoftheexperimentalresultsontheUCIdatasets(descriptiveevaluationmea-sures)fordifferentvariantsoftheunorderedalgorithmusing10-foldstratiedcross-validation.Thebestresultsareshowninboldface.Theanalysisshowsthatifmultiplicativeweightsareused,mostresultsimprovewiththein-creasedvalueofthegparameter.AsinmostcasesthebestCN2-SDvariantsareCN2-SDwithg=0:9andwithadditiveweights,andasusingadditiveweighsisthesimplermethod,theadditiveweightssettingisrecommendedasdefaultforexperimentaluse.Thesummaryofresultsintermsofdescriptivemeasuresofinterestingnessisasfollows.IntermsoftheaveragecoverageperruleCN2-SDproducesruleswithsignicantlyhighercoverage(thehigherthecoveragethebettertherule)thanbothCN2-WRAccandCN2-standardThecoverageisincreasedbyincreasingthegparameterandthebestresultsareachievedbyg=0:9andbyadditiveweights.CN2-SDinducesrulesetswithsignicantlylargeroverallsupportthanCN2-standardmean-ingthatitcoversahigherpercentageoftargetexamples(positives)thusleavingasmallernumberofexamplesunclassied.10CN2-WRAccandCN2-SDinducerulesetsthataresignicantlysmallerthanCN2-standard(smallerrulesetsarebetter),whilerulesetsofCN2-WRAccandCN2-SDarecomparable,de-spitethefactthatCN2-SDusesweightsto`recycle'examplesandthusproducesoverlappingrules.10.CN2handlestheunclassiedexamplesbyclassifyingthemusingthedefaultrule–therulepredictingthemajorityclass.169 LAVRACETAL.PerformanceDataCN2CN2CN2-SDCN2-SDCN2-SDDetailedMeasureSetsstandardWRAcc(g=0:5)(g=0:9)(add.)ResultsAccuracy(ACC)2381.6111.6678.1216.2880.9216.0481.0715.7879.3616.24Table19signicance–pvalue0.1500.7710.8180.344win/loss/draw10/12/117/6/019/4/015/8/0sig.win/sig.loss3/59/410/47/4AUC-Method-2(AUC)1682.1616.8184.379.8786.758.9586.3910.3286.338.60Table20signicance–pvalue0.5630.1750.2360.236win/loss/draw6/9/110/6/09/7/010/6/0sig.win/sig.loss5/56/27/46/3Table5:SummaryoftheexperimentalresultsontheUCIdatasets(predictiveevaluationmeasures)fordifferentvariantsoftheunorderedalgorithmusing10-foldstratiedcross-validation.Thebestresultsareshowninboldface.CN2-SDinducessignicantlybetterrulesintermsofrulesignicance(ruleswithhighersignicancearebetter)computedbytheaveragelikelihoodratio:whiletheratiosachievedbyCN2-standardarealreadysignicantatthe99%level,thisisfurtherpushedupbyCN2-SDwithmaximumvaluesachievedbyadditiveweights.Aninterestingquestion,tobeveriedinfurtherexperiments,iswhethertheweightedversionsoftheCN2algorithmimprovethesignicanceoftheinducedsubgroupsalsointhecasewhenCN2rulesareinducedwithoutapplyingthesignicancetest.Intermsofruleunusualnesswhichisofultimateinteresttothesubgroupdiscoverytask,CN2-SDproducesruleswithsignicantlyhigheraverageweightedrelativeaccuracythanCN2-standard.Likeinthecaseofaveragecoverageperruletheunusualnessisincreasedbyincreasingthegparameterandthebestresultsareachievedbyg=0:9andbyadditiveweights.Notethattheunusualnessofarule,computedbyitsWRAcc,isacombinationofruleaccuracy,coverageandpriorprobabilityofthetargetclass.Intermsofpredictivemeasuresofclassicationperformanceresultscanbesummarizedasfollows.CN2-SDimprovestheaccuracyincomparisonwithCN2-WRAccandperformscompara-bletoCN2-standard(thedifferenceisinsignicant).Noticehoweverthatwhileoptimiz-ingpredictiveaccuracyistheultimategoalofCN2,forCN2-SDthegoalistooptimizethecoverage/relative-accuracytradeoff.InthecomputationofareaunderROCcurve(AUC-Method-2)duetotherestrictionofthismethodtobinaryclassdatasets,only16binarydatasetsareusedinthecomparisons.NoticethatCN2-SDimprovestheareaunderROCcurvecomparedtoCN2-WRAccandcomparedtoCN2-standard,butthedifferencesarenotsignicant.TheareaunderROCcurvehoweverseemsnottobeaffectedbytheparametergorbytheweightingapproachofCN2-SDAUCperformanceisalsoillustratedbymeansoftheresultsontheAustralianUCIdatasetinFigures2and3ofSection4.2.ThesolidlinesinthesegraphsindicateROCcurvesobtainedbyCN2-standardwhilethedottedlinesrepresentROCcurvesforCN2-SDwithadditiveweights.170 SUBGROUPDISCOVERYWITHCN2-SD6.2.2RESULTSOFTHEORDEREDCN2-SDForcompleteness,theresultsfordifferentversionsoftheorderedalgorithmaresummarizedinTables6and7,withoutgivingtheresultsforindividualdatasetsinAppendixA.Inourview,theunorderedCN2-SDalgorithmismoreappropriateforsubgroupdiscoverythantheorderedvariant,asitinducesasetofrulesforeachtargetclassinturn.PerformanceDataCN2CN2CN2-SDCN2-SDCN2-SDMeasureSetsstandardWRAcc(g=0:5)(g=0:9)(add.)Coverage(COV)230.1740.180.3510.180.4390.250.4200.230.5270.32signicance–pvalue0.0000.0000.0000.000win/loss/draw21/2/023/0/023/0/022/1/0sig.win/sig.loss20/122/022/022/1Support(SUP)230.850.030.850.030.870.050.910.050.900.06signicance–pvalue0.6940.0260.0000.000win/loss/draw12/11/014/9/018/5/019/4/0sig.win/sig.loss4/411/216/114/0Size(SIZE)2317.8728.104.132.734.302.584.612.644.272.79signicance–pvalue0.0250.0260.0300.025win/loss/draw21/1/121/2/020/3/021/2/0sig.win/sig.loss21/020/119/120/0Signicance(SIG)231.870.478.864.8112.707.1114.808.3118.119.84signicance–pvalue0.0000.0000.0000.000win/loss/draw22/1/022/1/023/0/022/1/0sig.win/sig.loss22/022/022/021/0Unusualness(WRACC)230.0240.020.0600.050.0800.060.0820.060.1000.07signicance–pvalue0.0010.0000.0000.000win/loss/draw18/5/021/2/021/2/022/1/0sig.win/sig.loss17/220/120/221/1Table6:SummaryoftheexperimentalresultsontheUCIdatasets(descriptiveevaluationmea-sures)fordifferentvariantsoftheorderedalgorithmusing10-foldstratiedcross-validation.Thebestresultsareshowninboldface.6.3ExperimentsinTrafcAccidentDataAnalysisWehaveevaluatedtheCN2-SDalgorithmalsoonatrafcaccidentdataset.Thisisalargereal-worlddatabase(1.5GB)containing21yearsofpolicetrafcaccidentreports(1979–1999).Theanalysisofthisdatabaseisnotstraightforwardbecauseofthevolumeofthedata,theamountsofnoiseandmissingdata,andthefactthatthereisnoclearlydeneddataminingtarget.Asdescribedbelow,somepreprocessingwasneededbeforerunningthesubgroupdiscoveryexperiments.Resultsofexperimentswereshowntothedomainexpertwhosecommentsareincluded6.3.1THETRAFFICACCIDENTDATASETThetrafcaccidentdatabasecontainsdataabouttrafcaccidentsandthevehiclesandcasualtiesinvolved.Thedataisorganizedinthreelinkedtables:theACCIDENTtable,theVEHICLEtableandtheCASUALTYtable.TheACCIDENTtableconsistsoftherecordsofallaccidentsthat171 LAVRACETAL.PerformanceDataCN2CN2CN2-SDCN2-SDCN2-SDMeasureSetsstandardWRAcc(g=0:5)(g=0:9)(add.)Accuracy(ACC)2383.0010.3078.3416.5279.5016.6881.1016.5380.7916.61signicance–pvalue0.1550.2860.5560.494win/loss/draw8/15/014/9/015/8/015/8/0sig.win/sig.loss3/615/48/47/3AUC-Method-2(AUC)1681.8910.0782.2810.1184.379.1984.708.5383.799.64signicance–pvalue0.7210.0260.0050.049win/loss/draw9/6/110/6/012/4/010/6/0sig.win/sig.loss6/56/38/46/4Table7:SummaryoftheexperimentalresultsontheUCIdatasets(predictiveevaluationmeasures)fordifferentvariantsoftheorderedalgorithmusing10-foldstratiedcross-validation.Thebestresultsareshowninboldface.happenedoverthegivenperiodoftime(1979–1999),theVEHICLEtablecontainsdataaboutthevehiclesinvolvedinthoseaccidents,andtheCASUALTYtablecontainsdataaboutthecasualtiesinvolvedintheaccidents.Considerthefollowingexample:“Twovehiclescrashedinatrafcaccidentandthreepeoplewereseriouslyinjuredinthecrash”.IntermsofthetrafcdatasetthisisrecordedasonerecordintheACCIDENTtable,tworecordsintheVEHICLEtableandthreerecordsintheCASUALTYtable.Thethreetablesaredescribedinmoredetailbelow.TheACCIDENTtablecontainsonerecordforeachaccident.The30attributesdescribinganaccidentcanbedividedinthreegroups:dateandtimeoftheaccident,descriptionoftheroadwheretheaccidenthasoccurred,andconditionsunderwhichtheaccidenthasoccurred(suchasweatherconditions,lightandjunctiondetails).IntheACCIDENTtabletherearemorethan5millionrecords.TheVEHICLEtablecontainsonerecordforeachvehicleinvolvedinanaccidentfromtheACCIDENTtable.Therecanbeoneormanyvehiclesinvolvedinasingleaccident.TheVEHICLEtableattributesdescribethetypeofthevehicle,maneuveranddirectionofthevehicle(fromandto),vehiclelocationontheroad,junctionlocationatimpact,sexandageofthedriver,alcoholtestresults,damageresultingfromtheaccident,andtheobjectthatvehiclehitonandoffcarriageway.Thereare24attributesintheVEHICLEtablewhichcontainsalmost9millionrecords.TheCASUALTYtablecontainsrecordsaboutcasualtiesforeachofthevehiclesintheVEHI-CLEtable.Therecanbeoneormorecasualtiespervehicle.TheCASUALTYtablecontains16attributesdescribingsexandageofcasualty,typeofcasualty(e.g.,pedestrian,cyclist,caroccupantetc.),severityofcasualty,ifcasualtytypeispedestrian,whatwerehis/hercharac-teristics(location,movement,direction).Thistablecontainsalmost7millionrecords.6.3.2DATAPREPROCESSINGThelargevolumeofdatainthetrafcdatasetmakesitpracticallyimpossibletorunanydataminingalgorithmonthewholeset.Thereforewehavetakensamplesofthedatasetandperformed172 SUBGROUPDISCOVERYWITHCN2-SDNumberofPercentageofClassdistribution(%)PFCExamplesSampledAccidentsfatal/serious/slight125550.31.76/24.85/73.39225231.92.53/30.87/66.60325014.80.56/12.35/87.09424991.92.16/27.21/70.63525229.21.90/23.39/74.71625482.01.41/13.69/84.90727881.40.97/16.25/82.78Table8:Propertiesofthetrafcdataset.theexperimentsonthesesamples.WefocusedontheACCIDENTtableandexaminedonlytheaccidentsthathappenedin7districts(calledPoliceForceCodes,orPFCs)acrosstheUK.11The7PFCswerechosenbythedomainexpertandrepresenttypicalPFCsfromclustersofPFCswiththesameaccidentdynamics,analyzedbyLjubicetal.(2002).Inthiswayweobtained7datasets(oneforeachPFC)withsomehundredthousandsofexampleseach.Wefurthersampledthisdatatoobtainapproximately2500examplesperdataset.ThesamplepercentagesarelistedinTable8togetherwiththeothercharacteristicsofthese7sampleddatasets.Amongthe26attributesdescribingeachofthe7datasetswechosetheattribute`accidentseverity'tobetheclassattribute.Thetaskthatwehaveaddressedwasthereforetondsubgroupsofaccidentsofacertainseverity(`slight',`serious'or`fatal')andcharacterizethemintermsofattributesdescribingtheaccident,suchas:`roadclass',`speedlimit',`lightcondition',etc.6.3.3RESULTSOFEXPERIMENTSWewanttoinvestigateifbyrunningCN2-SDonthedatasets,describedinTable8,weareabletogetsomerulesthataretypicalanddifferentfordistinctPFCs.WeusedthesamemethodologytoperformtheexperimentsasinthecaseoftheUCIdatasetsofSection6.2.TheonlydifferenceisthatherewedonotperformtheareaunderROCcurveanalysis,becausethedatasetsarenottwo-class.TheresultspresentedinTables9–13showthesameadvan-tagesofCN2-SDoverCN2-WRAccandCN2-standardasshownbytheresultsofexperimentsontheUCIdatasets.12Inparticular,CN2-SDproducessubstantiallysmallerrulesets,whereindividualruleshavehighercoverageandsignicance.Itshouldbenoticedthatthesedatasetshaveaveryunbalancedclassdistribution(mostaccidentsare`slight'andonlyfeware`fatal',seeTable8).Intermsofrulesetaccuracy,allalgorithmsachievedroughlydefaultperformancewhichisobtainedbyalwayspredictingthemajorityclass.Sinceclassicationwasnotthemaininterestofthisexperiment,weomittheresults.11.Forthesakeofanonymity,thecodenumbers1through7donotcorrespondtothePFCs1through7usedforPoliceForceCodesintheactualtrafcaccidentdatabase.12.LikeintheUCIcase,onlytheresultsoftheunorderedversionsofthealgorithmarepresentedhere,althoughtheexperimentsweredonewithbothunorderedandorderedvariantsofthealgorithms.173 LAVRACETAL.CN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)COVsdCOVsdCOVsdCOVsdCOVsdCOVsd10.0560.010.108"0.000.111"0.030.111"0.030.123"0.030.110"0.0320.0500.100.113"0.040.127"0.050.127"0.040.129"0.050.151"0.0430.1400.030.1180.030.1260.020.1190.020.1180.010.1540.0240.0520.010.105"0.030.105"0.040.120"0.040.122"0.040.116"0.0450.0750.080.108"0.040.115"0.060.121"0.050.110"0.050.127"0.0460.0780.060.118"0.030.134"0.050.122"0.060.124"0.060.120"0.0570.1160.080.1100.110.1180.140.1240.130.1220.130.143"0.12Average0.0810.030.1110.010.1200.010.1210.000.1210.010.1320.02signicance–pvalue0.0470.0210.0230.0290.003win/loss/draw5/2/06/1/06/1/06/1/07/0/0sig.win/sig.loss5/05/05/05/06/0Table9:Experimentalresultsonthetrafcaccidentdatasets.Averagecoverageperrulewithstandarddeviation(COVsd)fordifferentvariantsoftheunorderedalgorithm.CN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)SUPsdSUPsdSUPsdSUPsdSUPsdSUPsd10.860.030.890.020.830.060.93"0.040.96"0.020.95"0.0320.840.020.850.090.850.020.92"0.040.93"0.000.840.0830.810.060.820.040.93"0.020.90"0.050.97"0.010.850.0640.800.040.87"0.050.820.050.830.000.91"0.030.810.1050.870.080.850.030.80#0.030.830.060.94"0.020.830.0860.840.060.88"0.070.810.090.91"0.060.88"0.070.98"0.0170.810.080.830.050.90"0.010.810.010.95"0.020.99"0.00Average0.830.030.850.020.850.050.880.050.930.030.890.08signicance–pvalue0.0560.5480.0530.0010.092win/loss/draw6/1/04/3/06/1/07/0/06/1/0sig.win/sig.loss2/02/14/07/03/0Table10:Experimentalresultsonthetrafcaccidentdatasets.Overallsupportofrulesetswithstandarddeviation(SUPsd)fordifferentvariantsoftheunorderedalgorithm.6.3.4EVALUATIONBYTHEDOMAINEXPERTWehavefurtherexaminedtherulesinducedbytheCN2-SDalgorithm(usingadditiveweights).Wefocusedonruleswithhighcoverageandrulesthatcoverahighpercentageofthepredictedclassasthesearetherulesthatarelikelytoreectsomeregularityinthedata.Oneofthemostinterestingresultsconcernedthefollowing.Onemightexpectthatthenum-berofpeopleinjuredwouldincreasewiththeseverityoftheaccident(uptothetotalnumberofoccupantsinthevehicles).Furthermore,commonsensewoulddictatethatthenumberofvehicles174 SUBGROUPDISCOVERYWITHCN2-SDCN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)SIZEsdSIZEsdSIZEsdSIZEsdSIZEsdSIZEsd116.70.609.3"0.9910.0"0.5110.6"0.4610.6"0.739.5"0.25218.71.289.2"0.3310.0"0.2010.3"0.2110.3"0.5611.1"0.2337.00.308.60.959.20.1910.2#0.149.50.359.8#0.19418.01.399.9"0.5910.4"0.3111.2"0.6411.2"0.2410.3"0.56512.81.449.6"0.1910.1"0.5111.20.8411.60.969.7"0.21612.50.318.5"0.359.3"0.518.7"0.919.4"0.608.5"0.3978.61.419.30.419.90.9010.8#0.7311.1#0.1310.40.59Average13.474.579.200.509.840.4410.420.8610.530.849.900.80signicance–pvalue0.0400.0660.1230.1270.075win/loss/draw5/2/05/2/05/2/05/2/05/2/0sig.win/sig.loss5/05/04/24/15/1Table11:Experimentalresultsonthetrafcaccidentdatasets.Sizesofrulesetswithstandarddeviation(SIZEsd)fordifferentvariantsoftheunorderedalgorithm.CN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)SIGsdSIGsdSIGsdSIGsdSIGsdSIGsd11.90.827.0"0.318.7"0.419.7"0.599.4"0.309.6"0.4521.90.346.2"0.259.9"0.269.8"0.209.5"0.819.8"0.3631.30.276.6"0.618.4"0.529.2"0.5411.5"0.759.3"0.1741.60.107.6"0.148.5"0.7911.0"0.849.4"0.8011.1"0.2451.60.756.0"0.2310.6"0.709.6"0.7612.5"0.439.1"0.7461.50.878.5"0.418.3"0.549.8"0.249.9"0.5112.5"0.3571.70.496.8"0.758.7"0.209.9"0.639.2"0.739.7"0.40Average1.640.206.950.869.010.899.850.5610.201.2810.161.21signicance–pvalue0.0000.0000.0000.0000.000win/loss/draw7/0/07/0/07/0/07/0/07/0/0sig.win/sig.loss7/07/07/07/07/0Table12:Experimentalresultsonthetrafcaccidentdatasets.Averagesignicanceperrulewithstandarddeviation(SIGsd)fordifferentvariantsoftheunorderedalgorithm.involvedwouldalsoincreasewithaccidentseverity.Contrarytotheseexpectationswefoundrulesofthefollowingtwokinds:Rulesthatcovermorethantheaverageproportionof`fatal'or`serious'accidentswhenjustonevehicleisinvolvedintheaccident.Examplesofsuchrulesare:IFnv1.500THENsev="1"[152801024]13IFnv1.500THENsev="2"[22252890]13.TherulesintheexamplearegivenintheCN2-SDoutputformatwherenvstandsfor`numberofvehicles',ncisthe`numberofcasualties'and"1","2",and"3"denotetheclassvalues`fatal',`serious'and`slight'respectively.175 LAVRACETAL.CN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)WRACCsdWRACCsdWRACCsdWRACCsdWRACCsdWRACCsd10.0130.020.025"0.050.025"0.100.026"0.020.028"0.030.025"0.0920.0090.070.018"0.050.021"0.000.021"0.040.021"0.020.025"0.0430.0520.010.0430.000.0460.070.0430.030.0430.050.0560.0240.0100.090.021"0.060.021"0.050.024"0.090.024"0.000.023"0.0750.0190.040.026"0.060.027"0.070.029"0.080.027"0.010.030"0.0760.0270.030.041"0.060.047"0.050.042"0.050.043"0.070.042"0.0770.0380.030.0350.010.0380.040.0400.000.0390.080.0460.04Average0.0240.020.0300.010.0320.010.0320.010.0320.010.0350.01signicance–pvalue0.0960.0420.0410.0480.000win/loss/draw5/2/05/2/06/1/06/1/07/0/0sig.win/sig.loss5/05/05/05/05/0Table13:Experimentalresultsonthetrafcaccidentdatasets.Unusualnessofrulesetswithstan-darddeviation(WRACCsd)fordifferentvariantsoftheunorderedalgorithm.Rulesthatcovermorethantheaverageproportionof`slight'accidentswhentwoormorevehiclesareinvolvedandtherearefewcasualties.Anexampleofsucharuleis:IFnv�1.500ANDnc2.500THENsev="3"[81401190]Havingshowntheinducedresultstothedomainexpert,hepointedoutthefollowingaspectsofdatacollectionforthedataintheACCIDENTtable.14TheseveritycodeintheACCIDENTtablerelatestothemostsevereinjuryamongthosereportedforthataccident.Thereforeamultiplevehicleaccidentwith1fataland20slightinjurieswouldbeclassiedasfatalasonefatalityoccurred,whileeachindividualcasualtyinjuryseverityiscodedintheCASUALTYtable.Some(slight)injuriesmaybeunreportedattheaccidentscene:ifthepolicemancompiled/revisedthereportaftertheevent,newcasualty/injurydetailscanbereported(injuriesthatcametolightaftertheeventorreportedforreasonsrelatingtoinjury/insuranceclaims).However,thesechangesarenotreectedintheACCIDENTtable.Thendingsrevealedbytherulesweresurprisingtothedomainexpertandneedfurtherinvesti-gation.TheanalysisshowsthatexaminingtheACCIDENTtableisnotsufcientandthatfurtherexaminationoftheVEHICLEandCASUALTYtablesisneededinfurtherwork7.RelatedWorkOthersystemshaveaddressedthetaskofsubgroupdiscovery,thebestknownbeingEXPLORA(Kl¨osgen,1996)andMIDOS(Wrobel,1997,2001).EXPLORAtreatsthelearningtaskasasin-glerelationproblem,i.e.,allthedataareassumedtobeavailableinonetable(relation),whereas14.WehavealsoshowntheCN2-standardandCN2-WRAccresultstotheexpertbuthedidnotconsideranyoftherulestobeinteresting.176 SUBGROUPDISCOVERYWITHCN2-SDMIDOSextendsthistasktomulti-relationaldatabases.Otherapproachesdealwithmulti-relationaldatabasesusingpropositionalisationandaggregatefunctionscanbefoundintheworkofKnobbeetal.(2001,2002).Anotherapproachtondingsymbolicdescriptionsofgroupsofinstancesissymboliccluster-ing,whichhasbeenpopularformanyyears(Michalski,1980,GowdaandDiday,1992).Moreover,learningofconcepthierarchiesalsoaimsatdiscoveringgroupsofinstances,whichcanbeinducedinasupervisedorunsupervisedmanner:decisiontreeinductionalgorithmsperformsupervisedsymboliclearningofconcepthierarchies(Langley,1996,RaedtandBlockeel,1997),whereashi-erarchicalclusteringalgorithms(SokalandSneath,1963,Gordon,1982)areunsupervisedanddonotresultinsymbolicdescriptions.Notethatindecisiontreelearning,theruleswhichcanbeformedfrompathsleadingfromtherootnodetoclasslabelsintheleavesrepresentdiscriminantdescriptions,formedfrompropertiesthatbestdiscriminatebetweentheclasses.Asrulesformedfromdecisiontreepathsformdiscriminantdescriptions,theyareinappropriateforsolvingsubgroupdiscoverytaskswhichaimatdescribingsubgroupsbytheircharacteristicproperties.Instanceweightsplayanimportantroleinboosting(FreundandShapire,1996)andalternatingdecisiontrees(SchapireandSinger,1998).InstanceweightshavebeenusedalsoinvariantsofthecoveringalgorithmimplementedinrulelearningapproachessuchasSLIPPER(CohenandSinger,1999),RL(Leeetal.,1998)andDAIRY(Hsuetal.,1998).AvariantoftheweightedcoveringalgorithmhasbeenusedinthesubgroupdiscoveryalgorithmSDforrulesubsetselection(GambergerandLavrac,2002).Avarietyofruleevaluationmeasuresandheuristicshavebeenstudiedforsubgroupdiscovery(Kl¨osgen,1996,Wrobel,1997,2001),aimedatbalancingthesizeofagroup(referredtoasfac-torg)withitsdistributionalunusualness(referredtoasfactorp).Thepropertiesoffunctionsthatcombinethesetwofactorshavebeenextensivelystudied(theso-called`p-g-space'Kl¨osgen,1996).Analternativemeasureq=TPFP+parwasproposedintheSDalgorithmforexpert-guidedsubgroupdiscovery(GambergerandLavrac,2002),aimedatminimizingthenumberoffalsepositivesFPandmaximizingtruepositivesTP,balancedbygeneralizationparameterpar.Besidessuch`ob-jective'measuresofinterestingness,some`subjective'measureofinterestingnessofadiscoveredpatterncanbetakenintoaccount,suchasactionability(`apatternisinterestingiftheusercandosomethingwithittohisorheradvantage')andunexpectedness(`apatternisinterestingtotheuserifitissurprisingtotheuser')(SilberschatzandTuzhilin,1995).Notethatsomeapproachestoassociationruleinductioncanalsobeusedforsubgroupdiscovery.Forinstance,theAPRIORI-Calgorithm(JovanoskiandLavrac,2001),whichappliesassociationruleinductiontoclassicationruleinduction,outputsclassicationruleswithguaranteedsupportandcondencewithrespecttoatargetclass.Ifarulesatisesalsoauser-denedsignicancethreshold,aninducedAPRIORI-Crulecanbeviewedasanindependent`chunk'ofknowledgeaboutthetargetclass(selectedpropertyofinterestforsubgroupdiscovery),whichcanbeviewedasasubgroupdescriptionwithguaranteedsignicance,supportandcondence.ThisobservationledtothedevelopmentofanovelsubgroupdiscoveryalgorithmAPRIORI-SD(Kavseketal.,2003).Itshouldbenoticedthatintheterminology`patientvs.greedy'ofFriedmanandFisher(1999),WRAccisa`patient'rulequalitymeasure,favoringmoregeneralsubgroupsthanthosefoundbyusing`greedy'qualitymeasures.AsshownbyourexperimentsinTodorovskietal.(2000),WRAccheuristicimprovesrulecoveragecomparedtothestandardCN2heuristic.Thisobservationiscon-rmedalsointheexperimentalevaluationinSection6ofthispaper.FurtherevidenceconrmingthisclaimisprovidedbyKavseketal.(2003),providingexperimentalcomparisonofresultsofCN2-177 LAVRACETAL.SDandournovelsubgroupdiscoveryalgorithmAPRIORI-SDwithrulelearnersCN2,RIPPERandAPRIORI-C.8.ConclusionsandFurtherWorkWehavepresentedanovelapproachtoadaptingstandardclassicationrulelearningtosubgroupdiscovery.Tothisendwehaveappropriatelyadaptedthecoveringalgorithm,thesearchheuristic,theprobabilisticclassicationandtheareaundertheROCcurve(AUC)performancemeasure.Wehavealsoproposedasetofmetricsappropriateforevaluatingthequalityofinducedsubgroupdescriptions.Theexperimentalresultson23UCIdatasetsdemonstratethatCN2-SDproducessubstantiallysmallerrulesets,whereindividualruleshavehighercoverageandsignicance.Thesethreefactorsareimportantforsubgroupdiscovery:smallersizeenablesbetterunderstanding,highercoveragemeanslargersupport,andhighersignicancemeansthatrulesdescribediscoveredsubgroupsthataresignicantlydifferentfromtheentirepopulation.WehaveevaluatedtheresultsofCN2-SDalsointermsofAUCandshownasmall(insignicant)increaseintermsoftheareaunderROCcurve.WehaveappliedCN2-SDalsotoareal-lifeproblemoftrafcaccidentanalysis.Theexper-imentalresultsconrmthendingsintheUCIdatasets.Themostinterestingndingsareduetointerpretationbythedomainexpert.Whatwasconrmedinthiscasestudywasthattheresultofadataminingprocessdependsnotonlyontheappropriatenessoftheselectedmethodandthedatathatisathandbutalsoonhowthedatahasbeencollected.InthetrafcaccidentexperimentsexaminingtheACCIDENTtablewasnotsufcient,andfurtherexaminationoftheVEHICLEandCASUALTYtablesisneeded.ThiswillbeperformedusingtheRSDrelationalsubgroupdiscoveryalgorithm(Lavracetal.,2003),arecentupgradeoftheCN2-SDalgorithmwhichenablesrelationalsubgroupdiscovery.InfurtherworkweplantocomparetheresultswiththeMIDOSsubgroupdiscoveryalgorithm.WeplantoinvestigatethebehaviorofCN2-SDintermsofAUCinmulti-classproblems(HandandTill,2001).Aninterestingquestion,tobeveriedinfurtherexperiments,iswhethertheweightedversionsoftheCN2algorithmimprovethesignicanceoftheinducedsubgroupsalsointhecasewhenCN2rulesareinducedwithoutapplyingthesignicancetest.Animportantaspectofsubgroupdiscoveryperformance,whichisneglectedinourstudy,isthedegreeofoverlapoftheinducedsubgroups.ThechallengeofourfurtherresearchistoproposeextensionsoftheweightedrelativeaccuracyheuristicandROCspaceevaluationmetricsthatwilltakeintoaccounttheoverlapofsubgroups.Wearenowmovingthefocusofourresearchinsubgroupdiscoveryfromheuristicsearchtowardexhaustivesearchofthespaceofpatterns.AnattemptofthiskindisdescribedbyKavseketal.(2003)wherethewellknownAPRIORIassociationrulelearnerwasadaptedtothetaskofsubgroupdiscovery.AcknowledgmentsThankstoDraganGambergerforjointworkontheweightedcoveringalgorithm,andJos´eHern´an-dez-OralloandCesarFerri-Ram´rezforjointworkonAUC.ThankstoPeterLjubicandDamjanDemsarforthehelpinupgradingtheCcodeoftheoriginalCN2algorithm.Wearegratefulto178 SUBGROUPDISCOVERYWITHCN2-SDJohnBullasfortheevaluationoftheresultsoftrafcaccidentdataanalysis.Thanksarealsoduetotheanonymousreviewersfortheirinsightfulcomments.TheworkreportedinthispaperwassupportedbytheSlovenianMinistryofEducation,ScienceandSport,theIST-1999-11495projectDataMiningandDecisionSupportforBusinessCompetitiveness:AEuropeanVirtualEnterprise,andtheBritishCouncilprojectPartnershipinSciencePSP-18.AppendixA.TableswithDetailedResultsforDifferentVariantsoftheUnorderedAlgorithminUCIDataSetsThetablesinthisappendixshowdetailedresultsoftheperformanceofdifferentvariantsoftheunorderedalgorithm.Thecomparisonsaremadeon23UCIdatasetslistedinTable3.TheresultsshowninTables14–18ofAppendixAaresummarizedinthepaperinTable4,andtheresultsofTables19–20inTable5.CN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)COVsdCOVsdCOVsdCOVsdCOVsdCOVsd10.0710.010.416"0.000.473"0.030.492"0.030.480"0.030.424"0.0320.0790.100.150"0.040.208"0.050.174"0.040.218"0.050.260"0.0430.6250.030.322#0.030.6120.020.6170.020.7210.010.330#0.0240.0480.010.496"0.030.504"0.040.513"0.040.504"0.040.507"0.0450.0570.080.275"0.040.296"0.060.344"0.050.299"0.050.381"0.0460.3120.060.576"0.030.936"0.051.039"0.061.006"0.061.295"0.0570.0530.080.092"0.110.141"0.140.153"0.130.138"0.130.151"0.1280.1070.090.240"0.070.419"0.090.376"0.120.366"0.110.435"0.0990.2070.040.430"0.060.637"0.040.829"0.040.826"0.040.686"0.03100.0930.000.495"0.000.509"0.000.509"0.000.516"0.000.513"0.00110.0990.050.168"0.080.229"0.050.234"0.040.246"0.040.354"0.06120.3780.010.3860.010.619"0.000.4440.000.768"0.000.668"0.01130.1600.110.408"0.090.639"0.150.467"0.160.424"0.180.621"0.17140.1420.010.356"0.070.461"0.020.668"0.030.569"0.030.720"0.03150.0300.010.113"0.070.129"0.020.146"0.030.182"0.030.117"0.03160.1290.010.650"0.070.703"0.020.711"0.030.674"0.030.831"0.03170.0210.000.216"0.000.225"0.000.270"0.000.307"0.000.324"0.00180.0220.050.146"0.080.155"0.050.157"0.040.166"0.040.200"0.06190.0660.010.331"0.010.357"0.000.628"0.000.616"0.000.759"0.01200.0390.110.139"0.090.151"0.150.159"0.160.149"0.180.169"0.17210.0400.010.076"0.070.115"0.020.177"0.030.172"0.030.216"0.03220.0040.010.185"0.070.194"0.020.185"0.030.188"0.030.191"0.03230.2310.010.477"0.070.552"0.020.715"0.030.818"0.031.022"0.03Average0.1310.140.3110.170.4030.230.4350.250.4500.260.4860.30signicance–pvalue0.0000.0000.0000.0000.000win/loss/draw22/1/022/1/022/1/023/0/022/1/0sig.win/sig.loss21/122/021/022/022/1Table14:Relativeaveragecoverageperrulewithstandarddeviation(COVsd)fordifferentvari-antsoftheunorderedalgorithmusing10-foldstratiedcross-validation.179 LAVRACETAL.CN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)SUPsdSUPsdSUPsdSUPsdSUPsdSUPsd10.810.090.89"0.020.87"0.000.97"0.010.84"0.000.89"0.0420.880.010.900.020.890.090.840.040.93"0.020.860.0530.870.050.870.090.840.050.93"0.020.840.070.95"0.0140.870.060.81#0.090.900.020.81#0.040.97"0.000.93"0.0250.800.010.820.030.92"0.060.850.010.95"0.010.87"0.0560.900.030.81#0.010.95"0.010.850.030.98"0.000.82#0.0270.890.030.880.030.900.020.81#0.070.97"0.010.96"0.0180.840.030.870.040.94"0.010.830.030.890.090.98"0.0090.870.100.81#0.020.850.100.94"0.000.900.020.99"0.00100.840.010.830.080.820.071.00"0.000.90"0.020.95"0.02110.830.030.850.070.96"0.010.95"0.010.89"0.090.98"0.01120.820.040.89"0.000.830.100.91"0.010.88"0.030.95"0.01130.870.100.900.060.81#0.020.80#0.090.850.040.850.03140.840.050.850.070.830.060.89"0.060.93"0.020.860.05150.830.040.800.070.96"0.010.860.090.800.080.810.00160.850.070.820.021.00"0.000.840.060.96"0.010.850.10170.860.080.90"0.030.860.070.820.061.00"0.000.850.06180.810.060.85"0.070.96"0.010.89"0.050.95"0.010.97"0.00190.830.010.850.050.92"0.040.95"0.010.90"0.020.840.05200.900.060.82#0.070.99"0.000.900.030.99"0.000.900.04210.810.050.800.040.87"0.080.90"0.040.93"0.020.820.06220.810.020.89"0.060.94"0.020.96"0.011.00"0.000.96"0.01230.820.050.820.040.94"0.030.87"0.070.99"0.000.99"0.00Average0.840.030.850.030.900.060.890.060.920.060.910.06signicance–pvalue0.6370.0000.0170.0000.001win/loss/draw13/10/018/5/014/9/020/3/016/7/0sig.win/sig.loss5/413/111/318/013/1Table15:Overallrulesetsupportwithstandarddeviation(SUPsd)fordifferentvariantsoftheunorderedalgorithmusing10-foldstratiedcross-validation.180 SUBGROUPDISCOVERYWITHCN2-SDCN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)SIZEsdSIZEsdSIZEsdSIZEsdSIZEsdSIZEsd112.41.952.0"0.752.7"0.022.6"0.872.2"0.853.5"0.79212.61.048.8"0.957.9"0.508.5"1.759.0"0.249.2"1.2431.80.102.00.412.00.702.7#0.441.90.271.80.29414.61.817.9"1.788.1"1.027.9"0.978.5"0.478.5"0.41512.81.565.2"0.796.0"0.685.6"1.355.4"0.304.6"0.8663.71.372.5"0.793.10.723.81.614.7#1.223.40.02715.11.897.8"1.498.4"1.328.7"0.469.1"1.268.8"1.1386.41.533.0"1.202.9"0.982.7"0.672.7"0.901.8"0.3893.00.292.1"0.501.7"0.932.70.533.6#1.832.70.001010.11.023.9"0.313.9"0.853.4"1.103.3"1.902.5"0.54117.61.013.0"1.783.9"1.844.0"0.183.6"0.874.2"0.41123.81.243.0"1.243.2"0.423.4"0.392.9"0.053.60.69134.71.303.1"1.153.4"0.543.9"0.984.61.194.50.71145.20.902.7"0.912.1"0.951.9"0.101.7"1.732.1"0.781521.23.4810.5"1.8511.2"1.1210.3"1.999.6"1.3210.2"1.30167.11.592.0"0.812.4"0.562.4"0.752.9"0.561.8"0.451728.73.899.9"1.229.4"1.618.9"1.809.5"1.038.3"1.171883.85.3710.9"2.3711.3"2.7811.8"1.4511.7"1.6712.8"1.741912.91.687.7"1.008.6"1.219.1"1.858.4"1.0910.1"1.832032.82.648.7"1.828.9"1.489.8"1.0110.5"1.379.2"1.492135.13.5419.6"1.8019.3"2.9119.7"2.9919.8"2.5819.2"2.902277.34.0712.2"1.7911.4"2.8712.4"2.2912.4"2.0911.7"2.81235.51.263.0"0.362.1"0.702.1"0.571.2"0.731.4"0.90Average18.1821.776.154.496.254.426.454.486.494.576.354.58signicance–pvalue0.0060.0070.0070.0070.007win/loss/draw22/1/022/1/021/2/020/3/023/0/0sig.win/sig.loss22/021/020/119/218/0Table16:Averagerulesetsizeswithstandarddeviation(SIZEsd)fordifferentvariantsoftheunorderedalgorithmusing10-foldstratiedcross-validation.181 LAVRACETAL.CN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)SIGsdSIGsdSIGsdSIGsdSIGsdSIGsd12.00.057.8"1.4914.6"1.0524.0"1.0115.6"1.544.6"0.5222.70.1013.3"1.6927.1"3.372.10.0220.5"2.4526.6"3.4332.10.017.8"0.6413.3"1.392.50.0121.2"2.5522.9"2.4342.40.069.1"0.5814.1"1.7216.9"1.2822.5"2.4930.2"3.9852.00.0115.8"1.0714.9"1.9511.0"1.4315.2"1.852.10.0161.90.0310.0"1.6311.0"1.1230.5"2.1230.1"2.2723.1"2.9772.00.022.70.8319.8"1.2117.7"1.6311.1"1.0316.3"1.4981.90.094.6"0.5923.2"1.825.3"0.364.0"0.0330.6"2.9692.70.039.7"0.8612.3"1.009.3"0.658.5"0.8925.0"2.60101.40.043.6"0.745.8"0.4828.3"2.2724.9"2.2713.5"1.84112.00.041.80.0716.7"1.4223.9"2.4130.9"2.1814.9"1.52121.90.037.1"0.0717.0"1.611.30.0917.6"1.454.0"0.00132.10.0015.1"1.8019.4"1.7721.9"2.3821.4"2.399.7"0.61142.50.0814.9"1.9318.0"1.5713.9"1.283.00.0918.1"1.73152.50.054.2"0.4217.5"1.795.7"0.4621.9"2.8326.5"2.22162.60.0411.7"1.909.6"0.5622.7"2.592.30.086.0"0.00172.70.034.8"0.5311.7"1.6721.8"2.5515.0"1.8224.3"2.26181.50.0014.1"1.116.0"0.9326.8"2.5312.6"1.3519.3"1.09191.00.072.4"0.0122.0"1.2017.0"1.7816.4"1.749.1"0.02201.50.0016.0"2.5224.3"1.5211.4"1.2529.9"3.2521.7"2.88212.40.026.8"0.8815.6"1.9812.9"1.478.2"0.0630.6"2.39222.60.049.7"1.563.4"0.0914.2"1.207.1"0.4720.2"2.71232.00.0713.5"1.5720.7"1.932.7"0.0229.4"3.5125.7"2.48Average2.110.468.974.6615.576.0514.959.0216.928.9018.479.00signicance–pvalue0.0000.0000.0000.0000.000win/loss/draw22/1/023/0/021/2/022/1/023/0/0sig.win/sig.loss21/023/020/021/022/0Table17:Averagerulesignicancewithstandarddeviation(SIGsd)fordifferentvariantsoftheunorderedalgorithmusing10-foldstratiedcross-validation.182 SUBGROUPDISCOVERYWITHCN2-SDCN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)WRACCsdWRACCsdWRACCsdWRACCsdWRACCsdWRACCsd10.0220.090.148"0.030.186"0.090.185"0.040.181"0.070.162"0.0120.0340.040.063"0.040.095"0.020.079"0.010.093"0.070.111"0.043-0.0160.08-0.0120.01-0.0050.03-0.0060.09-0.0010.02-0.0120.0140.0200.040.210"0.020.228"0.020.233"0.040.224"0.030.224"0.1050.0130.060.065"0.060.085"0.070.099"0.040.086"0.070.092"0.0360.0580.070.099"0.100.174"0.050.208"0.000.213"0.010.243"0.1070.0120.020.020"0.010.034"0.000.040"0.050.034"0.080.034"0.0880.0260.040.065"0.040.124"0.020.104"0.060.104"0.090.122"0.0390.0040.070.018"0.040.057"0.100.073"0.090.066"0.040.049"0.02100.0130.040.067"0.020.076"0.010.073"0.090.076"0.040.072"0.07110.0410.020.065"0.030.099"0.040.095"0.050.104"0.100.145"0.00120.0240.040.0240.050.062"0.020.042"0.020.052"0.030.045"0.06130.0240.030.056"0.030.114"0.100.085"0.040.065"0.070.092"0.03140.0090.100.038"0.100.053"0.030.082"0.100.082"0.020.085"0.08150.0150.070.030"0.070.036"0.090.041"0.030.055"0.080.032"0.06160.0170.000.095"0.100.117"0.040.129"0.040.127"0.060.138"0.02170.0050.030.048"0.070.051"0.020.073"0.080.083"0.020.073"0.09180.0090.060.030"0.000.037"0.010.032"0.000.034"0.070.045"0.03190.0070.070.060"0.000.081"0.080.133"0.050.132"0.030.147"0.04200.0040.01-0.045#0.10-0.042#0.04-0.048#0.02-0.042#0.03-0.051#0.06210.0150.080.0150.030.024"0.040.039"0.080.042"0.060.045"0.05220.0010.030.045"0.060.054"0.050.054"0.090.054"0.050.049"0.05230.0330.010.076"0.050.089"0.030.144"0.050.149"0.060.167"0.01Average0.0170.020.0560.050.0790.060.0860.070.0880.060.0920.07signicance–pvalue0.0010.0000.0000.0000.000win/loss/draw20/1/222/1/022/1/022/1/022/1/0sig.win/sig.loss19/121/121/121/121/1Table18:Averageruleunusualnesswithstandarddeviation(WRACCsd)fordifferentvariantsoftheunorderedalgorithmusing10-foldstratiedcross-validation.183 LAVRACETAL.CN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)ACCsdACCsdACCsdACCsdACCsdACCsd181.623.5585.53"0.1489.27"8.0487.61"8.7187.81"6.5488.35"8.60292.281.0792.135.9595.801.2195.563.1592.531.5292.602.28382.453.8981.361.3084.138.8884.076.1184.811.0681.462.24494.183.7194.342.2597.190.3597.370.4296.541.7796.081.22572.779.3373.810.9178.66"8.6578.80"0.0478.81"5.5574.129.97668.711.7967.126.5568.620.9670.086.2871.20"9.9468.755.32772.407.6071.407.5773.730.7275.82"8.0774.675.8572.407.36874.104.1577.067.0679.64"6.9877.534.7778.48"3.1678.03"2.70980.747.5983.260.8387.87"2.2987.75"0.3686.97"6.8886.14"1.991098.580.6098.540.1199.860.0399.370.0699.770.0299.100.401191.446.6288.87#7.2693.252.8990.531.4492.414.9691.103.761291.332.0491.337.0295.08"2.0894.400.9491.776.3391.752.281380.871.3279.741.7483.816.5984.23"7.5981.410.7680.867.261472.282.8176.603.1077.59"2.8478.35"5.1180.40"3.3177.74"1.691598.010.6076.40#3.7577.59#1.8177.94#0.6380.26#7.9977.38#4.971694.240.3995.631.8397.67"1.6299.09"0.1499.85"0.0497.62"1.051774.718.6272.490.4872.559.8577.08"8.8976.900.8672.515.301889.825.3370.33#7.9474.21#5.6670.37#7.8170.56#7.4972.48#1.621960.601.8368.13"3.7672.70"8.0571.12"5.4071.46"7.6269.32"0.082058.885.7017.84#2.3322.47#1.0619.84#1.4821.98#1.8619.49#1.182188.733.0169.68#4.1470.71#9.9472.29#8.7074.23#1.2271.04#7.452269.188.9274.26"1.3277.71"9.3179.11"1.2678.56"9.6075.70"7.672389.161.3390.901.1891.085.2395.12"1.0193.26"0.6791.322.97Average81.6111.6678.1216.2880.9216.0481.0216.4481.0715.7879.3616.24signicance–pvalue0.1500.7710.8120.8180.344win/loss/draw10/12/117/6/018/5/019/4/015/8/0sig.win/sig.loss3/59/411/410/47/4Table19:Averagerulesetaccuracywithstandarddeviation(ACCsd)fordifferentvariantsoftheunorderedalgorithmusing10-foldstratiedcross-validation.184 SUBGROUPDISCOVERYWITHCN2-SDCN2CN2CN2-SDCN2-SDCN2-SDCN2-SD#standardWRAcc(g=0:5)(g=0:7)(g=0:9)(add.)AUCsdAUCsdAUCsdAUCsdAUCsdAUCsd133.395.6186.12"0.0583.31"2.0184.27"9.4484.47"6.0585.12"5.16290.743.5789.527.2694.37"2.2996.28"1.4797.33"0.9894.52"1.67384.510.1580.11#9.8482.585.6080.98#8.1278.38#7.4483.031.88496.222.5593.592.2697.190.7692.37#2.3396.541.9092.87#2.66571.337.8680.75"0.5180.52"1.8280.56"8.1780.76"5.0280.06"3.49670.535.9964.42#3.2968.097.3468.632.4464.02#8.7170.612.46771.995.7674.007.1973.997.6373.926.0175.29"7.7072.733.84874.175.3573.980.9083.82"9.7684.69"0.6387.02"9.8085.62"1.84978.814.6485.65"0.3384.82"2.7882.80"5.1978.666.1281.29"0.231096.222.3198.59"0.1097.130.7896.540.1399.65"0.0497.420.241194.461.5290.86#0.3293.172.6893.992.8394.302.1093.871.071299.170.2399.170.1699.960.0199.380.1599.920.0399.460.061383.208.6878.38#2.3382.111.0484.744.5180.12#4.1283.066.971475.066.1379.41"5.1281.62"7.6179.97"1.2980.12"5.3478.51"1.151597.900.3678.90#6.9591.88#2.7391.28#2.6390.87#2.0189.15#4.321696.881.6796.411.6393.44#2.9795.350.1894.821.0693.95#2.06Average82.1616.8184.379.8786.758.9586.618.8186.3910.3286.338.60signicance–pvalue0.5630.1750.1980.2360.236win/loss/draw6/9/110/6/010/6/09/7/010/6/0sig.win/sig.loss5/56/26/37/46/3Table20:AreaundertheROCcurve(AUC-Method-2)withstandarddeviation(AUCsd)fordifferentvariantsoftheunorderedalgorithmusing10-foldstratiedcross-validation.185 LAVRACETAL.ReferencesRakeshAgrawal,HeikkiMannila,RamakrishnanSrikant,HannuToivonen,andA.InkeriVerkamo.Fastdiscoveryofassociationrules.AdvancesinKnowledgeDiscoveryandDataMining,AAAIPress:307–328,1996.BojanCestnik.Estimatingprobabilities:Acrucialtaskinmachinelearning.InProceedingsoftheNinthEuropeanConferenceonArticialIntelligence,pages147–149,Pitman,1990.PeterClarkandRobinBoswell.Ruleinductionwithcn2:Somerecentimprovements.InProceed-ingsoftheFifthEuropeanWorkingSessiononLearning,pages151–163,Springer,1991.PeterClarkandTimNiblett.Thecn2inductionalgorithm.MachineLearning,3(4):261–283,1989.WilliamW.Cohen.Fasteffectiveruleinduction.InProceedingsoftheTwelfthInternationalCon-ferenceonMachineLearning,pages115–123,MorganKaufmann,1995.WilliamW.CohenandYoramSinger.Asimple,fast,andeffectiverulelearner.InProceedingsofAAAI/IAAI,pages335–342,AAAIPress,1999.SasoDzeroski,BojanCestnik,andIgorPetrovski.Usingthem-estimateinruleinduction.JournalofComputingandInformationTechnology,1(1):37–46,1993.CesarFerri-Ram´rez,PeterA.Flach,andJoseHernandez-Orallo.Learningdecisiontreesusingtheareaundertheroccurve.InProceedingsoftheNineteenthInternationalConferenceonMachineLearning,pages139–146,MorganKaufmann,2002.PeterA.Flach.Thegeometryofrocspace:Understandingmachinelearningmetricsthroughrocisometrics.InProceedingsoftheTwentiethInternationalConferenceonMachineLearningpages194–201,AAAIPress,2003.PeterA.FlachandIztokSavnik.Databasedependencydiscovery:Amachinelearningapproach.AICommunications,12(3):139–160,1999.YoavFreundandRobertE.Shapire.Experimentswithanewboostingalgorithm.InProceed-ingsoftheThirteenthInternationalConferenceonMachineLearning,pages148–156,MorganKaufmann,1996.JeromeH.FriedmanandNicholasI.Fisher.Bumphuntinginhigh-dimensionaldata.StatisticsandComputing,9:123–143,1999.JohannesF¨urnkranzandPeterA.Flach.Ananalysisofruleevaluationmetrics.InProceedingsoftheTwetiethInternationalConferenceonMachineLearning,pages202–209,AAAIPress,2003.DraganGambergerandNadaLavrac.Expertguidedsubgroupdiscovery:Methodologyandappli-cation.JournalofArticialIntelligenceResearch,17:501–527,2002.A.D.Gordon.Classication.ChapmanandHall,London,1982.K.ChidanandaGowdaandEdwinDiday.Symbolicclusteringusinganewdissimilaritymeasure.IEEETransactionsonSystems,Man,andCybernetics,22(2):567–578,1992.186 SUBGROUPDISCOVERYWITHCN2-SDDavidJ.HandandRobertJ.Till.Asimplegeneralisationoftheareaundertheroccurveformultipleclassclassicationproblems.MachineLearning,45:171–186,2001.DavidHsu,OrenEtzioni,andStephenSoderland.Aredundantcoveringalgorithmappliedtotextclassication.InProceedingsoftheAAAIWorkshoponLearningfromTextCategorization,AAAIPress,1998.ViktorJovanoskiandNadaLavrac.Classicationrulelearningwithapriori-c.InProgressinArticialIntelligence:ProceedingsoftheTenthPortugueseConferenceonArticialIntelligencepages44–51,Springer,2001.BrankoKavsek,NadaLavrac,andViktorJovanoski.Apriori-sd:Adaptingassociationrulelearningtosubgroupdiscovery.InProceedingsoftheFifthInternationalSymposiumonIntelligentDataAnalysis,pages230–241,Springer,2003.WilliKl¨osgen.Explora:Amultipatternandmultistrategydiscoveryassistant.AdvancesinKnowl-edgeDiscoveryandDataMining,MITPress:249–271,1996.ArnoJ.Knobbe,MarcdeHaas,andArnoSiebes.Propositionalisationandaggregates.InProceed-ingsoftheFifthEuropeanConferenceonPrinciplesandPracticeofKnowledgeDiscoveryinDatabases,pages277–288,Springer,2001.ArnoJ.Knobbe,ArnoSiebes,andBartMarseille.Involvingaggregatefunctionsinmulti-relationalsearch.InProceedingsoftheSixthEuropeanConferenceonPrinciplesandPracticeofKnowl-edgeDiscoveryinDatabases,pages287–298,Springer,2002.PatLangley.ElementsofMachineLearning.MorganKaufmann,1996.NadaLavrac,PeterA.Flach,BrankoKavsek,andLjupcoTodorovski.Adaptingclassicationruleinductiontosubgroupdiscovery.InProceedingsoftheSecondIEEEInternationalConferenceonDataMining,pages266–273,IEEEComputerSociety,2002.NadaLavrac,PeterA.Flach,andBlazZupan.Ruleevaluationmeasures:Aunifyingview.InProceedingsoftheNinethInternationalWorkshoponInductiveLogicProgramming,pages74–185,Springer,1999.NadaLavrac,FilipZelezn´y,andPeterA.Flach.Rsd:Relationalsubgroupdiscoverythroughrst-orderfeatureconstruction.InProceedingsoftheTwelfthInternationalConferenceonInductiveLogicProgramming,pages149–165,Springer,2003.YongwonLee,BruceG.Buchanan,andJohnM.Aronis.Knowledge-basedlearninginexploratoryscience:Learningrulestopredictrodentcarcinogenicity.MachineLearning,30:217–240,1998.PeterLjubic,LjupcoTodorovski,NadaLavrac,andJohnC.Bullas.Time-seriesanalysisofuktrafcaccidentdata.InProceedingsoftheFifthInternationalMulti-conferenceInformationSociety,pages131–134,2002.RyszardS.Michalski.Patternrecognitionasrule-guidedinductiveinference.IEEETransactionsonPatternAnalysisandMachineIntelligence,2(4):349–361,1980.187 LAVRACETAL.RyszardS.Michalski,IgorMozetic,JiarongHong,andN.Lavrac.Themulti-purposeincrementallearningsystemaq15anditstestingapplicationonthreemedicaldomains.InProceedingsoftheFifthNationalConferenceonArticialIntelligence,pages1041–1045,MorganKaufmann,1986.PatrickM.MurphyandDavidW.Aha.Ucirepositoryofmachinelearningdatabases.Availableelectronicallyathttp://www.ics.uci.edu/mlearn/MLRepository.html,1994.FosterJ.ProvostandTomFawcett.Robustclassicationforimpreciseenvironments.MachineLearning,42(3):203–231,2001.LucDeRaedtandHendrikBlockeel.Usinglogicaldecisiontreesforclustering.InProceedingsoftheSeventhInternationalWorkshoponInductiveLogicProgramming,pages133–140,Springer,1997.LucDeRaedt,HendrikBlockeel,LucDehaspe,andWimVanLaer.Threecompanionsfordatamininginrstorderlogic.RelationalDataMining,Springer:106–139,2001.LucDeRaedtandLucDehaspe.Clausaldiscovery.MachineLearning,26:99–146,1997.RonaldL.Rivest.Learningdecisionlists.MachineLearning,2(3):229–246,1987.RobertE.SchapireandYoramSinger.Improvedboostingalgorithmsusingcondence-ratedpre-dictions.InProceedingsoftheEleventhConferenceonComputationalLearningTheory,pages80–91,ACMPress,1998.AviSilberschatzandAlexanderTuzhilin.Onsubjectivemeasuresofinterestingnessinknowledgediscovery.InProceedignsoftheFirstInternationalConferenceonKnowledgeDiscoveryandDataMining,pages275–281,1995.R.R.SokalandPeterH.A.Sneath.PrinciplesofNumericalTaxonomy.Freeman,SanFrancisco,1963.LjupcoTodorovski,PeterA.Flach,andNadaLavrac.Predictiveperformanceofweightedrelativeaccuracy.InProceedingsoftheFourthEuropeanConferenceonPrinciplesofDataMiningandKnowledgeDiscovery,pages255–264,Springer,2000.IanH.WittenandEibeFrank.DataMining:PracticalMachineLearningToolsandTechniqueswithJavaImplementations.MorganKaufmann,1999.StefanWrobel.Analgorithmformulti-relationaldiscoveryofsubgroups.InProceedingsoftheFirstEuropeanConferenceonPrinciplesofDataMiningandKnowledgeDiscovery,pages78–87,Springer,1997.StefanWrobel.Inductivelogicprogrammingforknowledgediscoveryindatabases.RelationalDataMining,Springer:74–101,2001.188