StructuredClassTaxonomies FeihongWuJunZhangandVasantHonavar Arti ID: 290210
Download Pdf The PPT/PDF document "LearningClassi" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
LearningClassiÞersUsingHierarchically StructuredClassTaxonomies FeihongWu,JunZhang,andVasantHonavar ArtiÞcialIntelligenceResearchLaboratory, DepartmentofComputerScience,IowaStateUniversity, Ames,Iowa50011-1040,USA { wuflyh,jzhang,honavar @cs.iastate.edu Abstract. WeconsiderclassiÞcationproblemsinwhichtheclasslabels areorganizedintoanabstractionhierarchyintheformofaclasstaxon- omy.WedeÞneastructuredlabelclassiÞcationproblem.Weexploretwo approachesforlearningclassiÞersinsuchasetting.Wealsodevelopa presentpreliminaryresultsthatdemonstratethepromiseoftheproposed approaches. 1Introduction Machinelearningalgorithmstodesignofpatternclassiershavebeenwellstud- iedintheliterature.Mostsuchalgorithmsoperateundertheassumptionthat presentmorecomplexclassicationscenarios.Forinstance,incomputervision application,naturalscenecontainingmultipleobjectscanbeassignedtomultiple categories[3];inadigitallibraryapplication,atextdocumentcanbeassigned tomultipletopicsorganizedintoatopichierarchy;inbioinformatics,anORF rallyorganizedintheformofahierarchicallystructuredclasstaxonomywhich denesanabstractionoverclasslabels.Suchaclassicationscenariopresents twomainchallenges:(1)Thelargenumberofclasslabelcombinationsmakeit hardtoreliablylearnaccurateclassiersfromrelativelysparsedatasets.(2) tuallyexclusivearenotsuitableforevaluationofclassiersinsettingswhere theclasslabelsareorganizedintoaclasshierarchy.Despiterecentattemptsto addresssomeoftheseproblems,[1,2,3,4,5,6,7],atpresent,ageneralsolution isstilllacking.Againstthisbackground,weexploreapproachestolearningclas- Section2presentsapreciseformulationofthesinglelabel,multilabelandthe structuredlabelclassicationproblems;Section3describestwoapproachesto learningclassiersfromdatainthepresenceofclasstaxonomies;Section4ex- ploresperformancemeasuresforevaluatingtheresultingclassiers;Section5, genotypedata[5];Section6concludeswithasummaryanddiscussion. J.-D.ZuckerandL.Saitta(Eds.):SARA2005,LNAI3607,pp.313320,2005. c Springer-VerlagBerlinHeidelberg2005 314F.Wu,J.Zhang,andV.Honavar 2Preliminaries Manystandardclassierlearningalgorithmsnormallymakethebasicassump- tionofsinglelabelinstances.Thatis,eachinstancethatisrepresentedbyan orderedsetofattributes A = { A 1 ,A 2 ,...,A N } canbelongtooneandonlyone classfromasetofclasses C = { c 1 ,c 2 ,.....,c M } .Therefore,classlabelsin C are mutuallyexclusive. Inmultilabelclassicationsettings,classlabelsarenotmutuallyexclusive. Eachinstancecanbelabelledusingasubsetoflabels c s C ,where C = { c 1 ,c 2 ,...,c M } isanitesetofpossibleclasses.Ifinstancescanbelabelledwith arbitrarysubsetsof C ,thetotalnumberofpossiblemultilabelcombinations is2 M . Anevenmorecomplexclassicationscenarioisoneinwhichinstancesto beclassiedareassignedlabelsfromahierarchicallystructuredclasstaxonomy. Here,wedeneclasstaxonomyrstandthenformalizetheresultingstructured labelclassicationproblem. DeÞnition1(ClassTaxonomy). AClassTaxonomy CT isatreestructured regularconcepthierarchydenedoverapartiallyorderset ( C T , ) ,where C T isanitesetthatenumeratesallclassconceptsintheapplicationdomain,and relation representstheis-arelationshipthatisbothanti-reectiveandtransi- tive: Ð TheonlyonegreatestelementANYistherootofthetree. Ð c i C , c i c i isfalse. Ð c i ,c j ,c k C , c i c j and c j c k imply c i c k . Atreestructuredclasstaxonomyrepresentsclassmembershipsatdierent levelsofabstraction.Therootofaclasstaxonomyisthemostgenerallabel (i.e.,ANY)thatisapplicabletoanyinstance.Theleavesofclasstaxonomy indicatethemostspeciclabels.Thetreestructureimposesstrictconstraints ontheseclassmemberships.Therefore,whenaninstanceisassignedalabel l fromahierarchicallystructuredclasstaxonomy,itisimplicitlylabelledwithall theancestorsofthelabel l intheclasstaxonomy. DeÞnition2(Structuredlabel). Anystructuredlabel C s isrepresentedby asubtreeof CT . C s isapartiallyorderset ( C s , ) thatdenesthesame is- a relationshipsasin CT . c i C s , c i isANYor c i parent ( c i ) ,where parent ( c i ) C s istheimmediateancestorof c i in CT . Aclasstaxonomyimposesconstraintsontheintegrityandvalidityofthe structuredlabels.Theintegrityconstraintstatesthat C s isasubtreestructure of CT sharingthesameroot:Structuredlabelisnotanarbitraryfragmentofthe classtaxonomy.Thevalidityconstraintcapturesthe is-a relationshipsamong classlabelswithinaclasstaxonomy.Astructuredlabelisinvalidifitcontains alabel l butnottheparentsof l inagivenclasstaxonomy. LearningClassiÞersUsingHierarchicallyStructuredClassTaxonomies315 3Methods 3.1BinarizedStructuredLabelLearning Onesimpleapproachistobuildaclassierconsistingofasetofbinaryclassiers (oneforeachclass).However,thedrawbacksofthisapproachareobvious:(1) Whenmakingpredictionsforunlabelledinstances,theclassicationresultsmay violatetheintegrityandvalidityconstraints.(2)Thesetofbinaryclassiersfails toexploitpotentiallyusefulconstraintsprovidedbytheclasstaxonomyduring learning. Toovercomethesedisadvantages,webuildahierarchicallyorganizedcollec- tionofclassiersthatmirrorsthestructureoftheclasstaxonomy CT .Theresult- ingclassiersformapartiallyorderedset( h CT , ),where h CT = { h C 1 , ··· ,h C M } isthesetofclassiers,and representspartialordersamongclassiers.If C j isachildof C i in CT ,thentherespectiveclassierssatisfythepartialorder h C j h C i .Thispartialorderonclassiersguidestheclassicationofanin- stance.If h C j h C i ,aninstancewillnotbeclassiedusing h C j ifithasbeen classiedasnotbelongingto C i (i.e.,outputof h C j is0).Wecallourmethod ofbuildingsuchhierarchicallystructuredclassiersBinarizedStructuredLabel Learning(BSLL). A B C D E F G H Fig.1. Structureclasstaxonomy 3.2Split-BasedStructuredLabelLearning Asecondapproachtostructuredlabellearningisanadaptationofanapproach tomulti-labellearning.Wedigressbrieytooutlineapproachestomulti-label learning. Inrealworldapplicationsitisveryrarethateachofthe2 M multilabel combinationsappearinthetrainingdata.Theactualnumberofmultilabelsis muchsmallerthanthepossiblenumber2 M .Thus,wemaysetanupperlimit onthenumberofpossibleclasslabelcombinations.Ifthenumberoflabelsthat canoccurinamulti-labelislimitedto2,wewillonlyconsiderthecombina- tionsof2classlabelsinsteadof M classlabels.Anotheroptionistoconsider onlythemultilabelsthatappearinthetrainingdata.Ineithercase,wecan notapplystandardlearningalgorithmsdirectlytothemulti-labelclassication problem.Thisisbecausethemultilabelandtheindividualclasslabelsare notmutuallyexclusiveanditisnotuncommonforsomeinstancestobela- belledwithasingleclasslabelandotherswithmultilabels.Becausemoststan- dardlearningalgorithmsassumemutuallyexclusiveclasslabels,wewillneed togeneratemutuallyexclusiveclasses.Forexample,consider C = { A,B,C } 316F.Wu,J.Zhang,andV.Honavar withinstancesset S A ,S B ,S C respectively.Supposetheonlymultilabelob- servedinthetrainingdatais { A,B } .Notethat S A S B = .Sotheex- tendedclasslabelsetis C = { A, B, C,A & B } ,whichrepresentsinstanceset S A S A S B ,S B S A S B ,S C ,S A S B . Thisapproachtotransformingclasslabelstoobtainmutuallyexclusiveclass labelscanbeappliedtostructuredlabellearningproblembybuildingsplit-based classiers.Wewillrstdeneasplitinaclasstaxonomy CT ,andthenforeach splitweshowhowtolearnarespectiveclassierbylearningfrominstanceswith extendedlabelsets(asoutlinedabove). DeÞnition3(Split). Asplitisaonelevelsubtreewithinaclasstaxonomy, whichincludesoneparentnodeandallitschildrennodes,andthelinksbetween theparentnodeandchildrennodes. Obviously,thenumberofsplitsintheclasstaxonomyissmallerthanthe numberofnodes.Wecanbuildasetofclassiersonthesplitstosolvestructured labelproblemsotodecreasethenumberofresultingclassiers.Withineach split,thestructuredlabelproblemwillbereducedtoamultilabelproblem, andweonlyneedtoconsiderthecombinatorialextensionsonclasslabelsat thatparticularlevel.Additionally,thesplit-basedclassiersarealsopartially orderedaccordingtoagivenclasstaxonomy.Anyinstancetobeclassiedwill followthistopologicalorderofthesplit-basedclassiers:startfromtheclassier forthesplitatrstposition,continuetorunasplit-basedclassieronlywhen predictedtobe1bytheparentsplit-basedclassier. 4PerformanceMeasureforStructuredLabel ClassiÞcation Insinglelabelclassication,alossfunction(likestandard0-1lossfunction) loss ( c p ,c o )canbedenedtoevaluatethecostofmisclassifyingtheinstance withobservedclasslabel c o tothepredictiveclasslabel c p .However,thisap- proachisinadequateinastructuredlabelprobleminwhichthereisaneedto takeintoaccounttherelationshipsbetweenlabelsassignedtoaninstance.Here eachlabelsetcorrespondstoasubtreeoftheclasstaxonomyinstructuredlabel problem.Wedeneamisclassicationcostassociatedwiththelabelsetpro- ducedbytheclassierrelativetothecorrectlabelset(thecorrectstructured label). DeÞnition4(NodeDistance). Nodedistanceisavalue d ( c i ,c j ) denoting thedierenceoflabels c i , c j .Ithasthefollowingproperties: Ð d ( c i ,c j ) 0 Ð d ( c i ,c j )= d ( c j ,c i ) Ð d ( c i ,c i )=0 LearningClassiÞersUsingHierarchicallyStructuredClassTaxonomies317 DeÞnition5(DummyLabel). Dummylabel isanadd-onlabeltothe classtaxonomywhichactsasapredictedvaluetotheinstancewhenaclassier cannotdecidetheclasslabelanddoesnothing.Thusthisisalabelbydefault. Ithasthefollowingproperties: Ð d ( ,c i )= d ( ,c j ) Ð d ( c i ,c j ) d ( ,c i ) DeÞnition6(Non-RedundantOperation). Anon-redundantoperation (with astheoperator)toalabelset C i istokeepthechildrenlabelswhen bothchildrenlabelsandtheirparentlabelsarepresent,suchthatweeliminate thelabelredundancieswithinaclasstaxonomy. DeÞnition7(Mapping). Amapping f betweentwolabelsets C 1 , C 2 withthe samecardinalityisabijection f : C 1 C 2 . Wecalculatethedistance d ( C p , C o )between C p and C o ,thepredictedand actuallabel(respectively)foreachclassiedinstanceasfollows: Ð Ifthecardinalitiesof C p and C o areequal,ndamappingtominimize thesumofnodedistancesanddividebythecardinalityofthelabelsetsto obtainthedistance. Ð Ifthecardinalitiesofthetwolabelsetsarenotequal,addasmanydummy labels asneededtothelabelsetwithfewerelementstomakethecardinal- itiesofthetwolabelsetsequalandthencalculatethedistancebetweenthe twolabelsetsasbefore. Theperformanceoftheclassieronatestsetisobtainedbyaveragingthe distancesbetweenpredictedandactuallabelsofinstancesinthetestset T as follows: ¯ d = T d ( C p , C o ) | T | .Thelowerthevalueofthismeasure,thebetterthe classier(intermsofmisclassicationcost). 5ExperimentalResults Givenastructuredlabeldataset,weneedthepair-wisenodedistancesbetween classlabelstocomputethemisclassicationcostasdescribedabove.Thesedis- tancescanbespeciedbyadomainexpert.Alternatively,thedistancesmaybe estimatedfromatrainingsetbasedoncooccurenceofclasslabelsasfollows: Foreachlevelintheclasstaxonomy,wecalculatetheoccurrenceofclassesin thetrainingset,divideitbythenumberoflabelsatthatleveloftheclasstax- onomy.Wecalculatethedistancebetweenclasslabelsasfollows:Weplacethe add-onlabel intherootnodeoftheclasstaxonomytreeandsettheedge distanceasthelevelweight.Fortwonodes,ifoneisancestoroftheother,the nodedistancewillbethesumoftheedgedistancesalongthepaththatconnects them;ifneithernodeisanancestoroftheother,thedistancebetweenthem 318F.Wu,J.Zhang,andV.Honavar isdenedastheaveragedistanceofthetwonodesfromtheirnearestcommon ancestor.Afternormalization,weassigndistance1toanytwolabelsinthetop leveltogetherwiththeadd-onlabel ,andthemaximalnodedistanceequals tothesummationofallthelevelweightsas1 . 268inReuters-21578dataand 1 . 254inphenotypedataset. 5.1ResultsonReuters-21578DataSet Reuters-21578data,originallycollectedbyCarnegieGroupfortextcategoriza- tion,doesnothaveapredenedhierarchicalclasstaxonomy.However,many documentsarelabelledwithmultipletopicclasses.Weextracted670documents. Inthisset,morethan72%ofthedocumentshavemultipleclasslabels.Wecre- atedatwo-levelclasstaxonomyusingcurrentcategoriesofthedocumentsas follows: grain(barley,corn,wheat,oat,sorghum) livestock(l-cattle,hog) WeusedaNaiveBayesclassierasthebaseclassierandestimatedtheper- formanceoftheresultingstructuredlabelclassierusing5foldcrossvalidation. Theresultsintables1 , 2suggestthatbinarizedstructuredlabellearningper- formsaswellassplit-basedstructuredlabellearninginthiscase.Bothhavegood predictiveaccuracyfortheclassesthatappearintherstleveloftheclasstaxon- omy:grain,livestock.Theoverallperformanceofthetwomethods(asmeasured bytheestimatedmisclassicationcost)isslightlydierent,whiletheaverage recallandprecisioncalculatedovertheentireclasshierarchyareveryclose. Table1. Averagedistance:learningonReuters-21578dataset binarizationlearning split-basedlearning ø d 0.217 0.251 Table2. Recall&precision:learningonReuters-21578dataset binarizationlearning split-basedlearning recall precision recall precision grain 0.993 0.964 0.993 0.968 livestock 0.766 0.893 0.752 0.917 barley 0.498 0.440 0.454 0.442 wheat 0.852 0.735 0.859 0.724 corn 0.839 0.721 0.818 0.726 oat 0.270 0.75 0.167 0.75 sorghum 0.408 0.560 0.324 0.591 l-cattle 0.146 0.417 0.167 0.339 hog 0.729 0.786 0.717 0.686 LearningClassiÞersUsingHierarchicallyStructuredClassTaxonomies319 5.2ResultsonPhenotypeDataSet OursecondexperimentusedthephenotypedatasetintroducedbyClareand King[5]whoseclasstaxonomyisahierarchicaltreewith4levelsand198labels. WechoosetheC4 . 5decisiontreeasthebaseclassiertorunthebinarization learningandsplit-basedlearningin5-foldcrossvalidation.Split-basedstructured labellearningshowsbetterperformancethanbinarizedstructuredlabellearning onthisdataset.Themisclassicationcostis0 . 79.Thesplit-basedstructured labellearningpredicts1outof4classlabelscorrectlyinthe1 st levelbranches. ComparedtotheReuters-21578dataset,thephenotypedatasetismuchmore sparsewhichmightexplainthefactthattheresultsarenotasgoodasinthe caseoftheReuters-21578dataset. Wealsocalculateaccuracy,recallandprecisionofeachclasslabel.Itturnsout thattheaccuracyofeachclasslabelisquitehigh(95%).Thisisduetothefact thatthisdatasetishighlyunbalancedandeachclassierhasahightruenegative rate.Owingtothesparsenessofthedataset,manyclasslabelsdonotappearin thetestdataset.Thisleadstoundenedrecallandprecisionestimatesbecauseof divisionby0.Hence,onlythoseclasslabelswithrecallandprecisionestimates availablearelistedinFigure2.Theyshowthatsplit-basedstructuredlabel learningperformsbetterintermsofrecallandprecision,whichisconsistentwith therelativeperformanceofthetwomethodsintermsofmisclassicationcost. Table3. Averagedistance:learningonphenotypedataset binarizationlearning split-basedlearning ø d 1.171 0.790 Fig.2. Recall&precision:learningonphenotypedataset 6SummaryandDiscussion Inthispaper,wehave: Ð Preciselyformulatedoflearningfromdatausingabstractionsoverclassla- bels thestructuredlabellearningproblem asageneralizationofsingle labelandmultilabelproblems. 320F.Wu,J.Zhang,andV.Honavar Ð Describedtwolearningmethods,binarizedandsplit-basedapproachesto learningstructuredlabelsbothofwhichcanbeadaptedtoworkwithany existinglearningalgorithmforsinglelabellearningtask(e.g.,NaiveBayes, Decisiontree,Supportvectormachine,etc.). Ð Exploredaperformancemeasureforevaluationoftheresultingstructured labelclassiers. Somedirectionsforfutureworkinclude: Ð DevelopmentofalgorithmstoincorporatetechniquesforexploitingCT(class taxonomies)tohandlepartiallyspeciedclasslabels. Ð Developmentofmoresophisticatedmetricsforevaluationofstructuredlabel classiers. Acknowledgements. Thisresearchwassupportedinpartbyagrantfromthe NationalInstitutesofHealth(GM066387)toVasantHonavar References 1.A.McCallumÓMultilabeltextclassiÞcationwithamixturemodeltrainedbyEMÓ. AAAIÕ99WorkshoponTextLearning.,1999. 2.T,Joachims.ÓTextcategorizationwithSupportVectorMachines:Learningwith manyrelevantfeaturesÓ.InMachineLearning:ECML-98,TenthEuropeanConfer- enceonMachineLearning,pp.137Ð142.1998 3.X.Shen,M.Boutell,J.Luo,andC.BrownÓMultilabelMachinelearninganditsap- plicationtosemanticsceneclassiÞcationÓ,inProceedingsofthe2004International SymposiumonElectronicImaging(EI2004),Jan.18-22,2004 4.H.-P.Kriegel,P.Kroeger,A.Pryakhin,andM.SchubertÓUsingSupportVector MachinesforClassifyingLargeSetsofMulti-RepresentedObjectsÓ,inProc.4th SIAMInt.Conf.onDataMining,pp.102-114,2004 5.A.ClareandR.D.KingÓKnowledgeDiscoveryinMultilabelPhenotypeDataÓ, 5thEuropeanConferenceonPrinciplesofDataMiningandKnowledgeDiscovery (PKDD2001),volume2168ofLectureNotesinArticialIntelligence,pages42-53, 2001 6.H.Blockeel,M.Bruynooghe,S.Dzeroski,J.Ramon,andJ.Struyf.ÓHierarchi- calMulti-ClassiÞcationÓ,ProceedingsoftheFirstSIGKDDWorkshoponMulti- RelationalDataMining(MRDM-2002),pages21Ð35,July2002 7.K.Wang,S.Zhou,S.C.Liew,ÓBuildinghierarchicalclassiÞersusingclassproxim- ityÓ,TechnicalReport,NationalUniversityofSingapore,1999 8.TheReuters-21578,Distribution1.0testcollectionisavailablefrom http://www.daviddlewis.com/resources/testcollections/reuters21578