/
LearningClassi LearningClassi

LearningClassi - PDF document

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
357 views
Uploaded On 2016-04-23

LearningClassi - PPT Presentation

StructuredClassTaxonomies FeihongWuJunZhangandVasantHonavar Arti ID: 290210

StructuredClassTaxonomies FeihongWu JunZhang andVasantHonavar Arti

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "LearningClassi" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

LearningClassiÞersUsingHierarchically StructuredClassTaxonomies FeihongWu,JunZhang,andVasantHonavar ArtiÞcialIntelligenceResearchLaboratory, DepartmentofComputerScience,IowaStateUniversity, Ames,Iowa50011-1040,USA { wuflyh,jzhang,honavar @cs.iastate.edu Abstract. WeconsiderclassiÞcationproblemsinwhichtheclasslabels areorganizedintoanabstractionhierarchyintheformofaclasstaxon- omy.WedeÞneastructuredlabelclassiÞcationproblem.Weexploretwo approachesforlearningclassiÞersinsuchasetting.Wealsodevelopa presentpreliminaryresultsthatdemonstratethepromiseoftheproposed approaches. 1Introduction Machinelearningalgorithmstodesignofpatternclassi“ershavebeenwellstud- iedintheliterature.Mostsuchalgorithmsoperateundertheassumptionthat presentmorecomplexclassi“cationscenarios.Forinstance,incomputervision application,naturalscenecontainingmultipleobjectscanbeassignedtomultiple categories[3];inadigitallibraryapplication,atextdocumentcanbeassigned tomultipletopicsorganizedintoatopichierarchy;inbioinformatics,anORF rallyorganizedintheformofahierarchicallystructuredclasstaxonomywhich de“nesanabstractionoverclasslabels.Suchaclassi“cationscenariopresents twomainchallenges:(1)Thelargenumberofclasslabelcombinationsmakeit hardtoreliablylearnaccurateclassi“ersfromrelativelysparsedatasets.(2) tuallyexclusivearenotsuitableforevaluationofclassi“ersinsettingswhere theclasslabelsareorganizedintoaclasshierarchy.Despiterecentattemptsto addresssomeoftheseproblems,[1,2,3,4,5,6,7],atpresent,ageneralsolution isstilllacking.Againstthisbackground,weexploreapproachestolearningclas- Section2presentsapreciseformulationofthesinglelabel,multilabelandthe structuredlabelclassi“cationproblems;Section3describestwoapproachesto learningclassi“ersfromdatainthepresenceofclasstaxonomies;Section4ex- ploresperformancemeasuresforevaluatingtheresultingclassi“ers;Section5, genotypedata[5];Section6concludeswithasummaryanddiscussion. J.-D.ZuckerandL.Saitta(Eds.):SARA2005,LNAI3607,pp.313–320,2005. c  Springer-VerlagBerlinHeidelberg2005 314F.Wu,J.Zhang,andV.Honavar 2Preliminaries Manystandardclassi“erlearningalgorithmsnormallymakethebasicassump- tionofsinglelabelinstances.Thatis,eachinstancethatisrepresentedbyan orderedsetofattributes A = { A 1 ,A 2 ,...,A N } canbelongtooneandonlyone classfromasetofclasses C = { c 1 ,c 2 ,.....,c M } .Therefore,classlabelsin C are mutuallyexclusive. Inmultilabelclassi“cationsettings,classlabelsarenotmutuallyexclusive. Eachinstancecanbelabelledusingasubsetoflabels c s  C ,where C = { c 1 ,c 2 ,...,c M } isa“nitesetofpossibleclasses.Ifinstancescanbelabelledwith arbitrarysubsetsof C ,thetotalnumberofpossiblemultilabelcombinations is2 M . Anevenmorecomplexclassi“cationscenarioisoneinwhichinstancesto beclassi“edareassignedlabelsfromahierarchicallystructuredclasstaxonomy. Here,wede“neclasstaxonomy“rstandthenformalizetheresultingstructured labelclassi“cationproblem. DeÞnition1(ClassTaxonomy). AClassTaxonomy CT isatreestructured regularconcepthierarchyde“nedoverapartiallyorderset ( C T ,  ) ,where C T isa“nitesetthatenumeratesallclassconceptsintheapplicationdomain,and relation  representstheis-arelationshipthatisbothanti-re”ectiveandtransi- tive: Ð TheonlyonegreatestelementANYŽistherootofthetree. Ð  c i  C , c i  c i isfalse. Ð  c i ,c j ,c k  C , c i  c j and c j  c k imply c i  c k . Atreestructuredclasstaxonomyrepresentsclassmembershipsatdierent levelsofabstraction.Therootofaclasstaxonomyisthemostgenerallabel (i.e.,ANYŽ)thatisapplicabletoanyinstance.Theleavesofclasstaxonomy indicatethemostspeci“clabels.Thetreestructureimposesstrictconstraints ontheseclassmemberships.Therefore,whenaninstanceisassignedalabel l fromahierarchicallystructuredclasstaxonomy,itisimplicitlylabelledwithall theancestorsofthelabel l intheclasstaxonomy. DeÞnition2(Structuredlabel). Anystructuredlabel C s isrepresentedby asubtreeof CT . C s isapartiallyorderset ( C s ,  ) thatde“nesthesame is- a relationshipsasin CT .  c i  C s , c i isANYor c i  parent ( c i ) ,where parent ( c i )  C s istheimmediateancestorof c i in CT . Aclasstaxonomyimposesconstraintsontheintegrityandvalidityofthe structuredlabels.Theintegrityconstraintstatesthat C s isasubtreestructure of CT sharingthesameroot:Structuredlabelisnotanarbitraryfragmentofthe classtaxonomy.Thevalidityconstraintcapturesthe is-a relationshipsamong classlabelswithinaclasstaxonomy.Astructuredlabelisinvalidifitcontains alabel l butnottheparentsof l inagivenclasstaxonomy. LearningClassiÞersUsingHierarchicallyStructuredClassTaxonomies315 3Methods 3.1BinarizedStructuredLabelLearning Onesimpleapproachistobuildaclassi“erconsistingofasetofbinaryclassi“ers (oneforeachclass).However,thedrawbacksofthisapproachareobvious:(1) Whenmakingpredictionsforunlabelledinstances,theclassi“cationresultsmay violatetheintegrityandvalidityconstraints.(2)Thesetofbinaryclassi“ersfails toexploitpotentiallyusefulconstraintsprovidedbytheclasstaxonomyduring learning. Toovercomethesedisadvantages,webuildahierarchicallyorganizedcollec- tionofclassi“ersthatmirrorsthestructureoftheclasstaxonomy CT .Theresult- ingclassi“ersformapartiallyorderedset( h CT ,  ),where h CT = { h C 1 , ··· ,h C M } isthesetofclassi“ers,and  representspartialordersamongclassi“ers.If C j isachildof C i in CT ,thentherespectiveclassi“erssatisfythepartialorder h C j  h C i .Thispartialorderonclassi“ersguidestheclassi“cationofanin- stance.If h C j  h C i ,aninstancewillnotbeclassi“edusing h C j ifithasbeen classi“edasnotbelongingto C i (i.e.,outputof h C j is0).Wecallourmethod ofbuildingsuchhierarchicallystructuredclassi“ersBinarizedStructuredLabel LearningŽ(BSLL). A B C D E F G H Fig.1. Structureclasstaxonomy 3.2Split-BasedStructuredLabelLearning Asecondapproachtostructuredlabellearningisanadaptationofanapproach tomulti-labellearning.Wedigressbrie”ytooutlineapproachestomulti-label learning. Inrealworldapplicationsitisveryrarethateachofthe2 M multilabel combinationsappearinthetrainingdata.Theactualnumberofmultilabelsis muchsmallerthanthepossiblenumber2 M .Thus,wemaysetanupperlimit onthenumberofpossibleclasslabelcombinations.Ifthenumberoflabelsthat canoccurinamulti-labelislimitedto2,wewillonlyconsiderthecombina- tionsof2classlabelsinsteadof M classlabels.Anotheroptionistoconsider onlythemultilabelsthatappearinthetrainingdata.Ineithercase,wecan notapplystandardlearningalgorithmsdirectlytothemulti-labelclassi“cation problem.Thisisbecausethemultilabelandtheindividualclasslabelsare notmutuallyexclusiveanditisnotuncommonforsomeinstancestobela- belledwithasingleclasslabelandotherswithmultilabels.Becausemoststan- dardlearningalgorithmsassumemutuallyexclusiveclasslabels,wewillneed togeneratemutuallyexclusiveclasses.Forexample,consider C = { A,B,C } 316F.Wu,J.Zhang,andV.Honavar withinstancesset S A ,S B ,S C respectively.Supposetheonlymultilabelob- servedinthetrainingdatais { A,B } .Notethat S A  S B  =  .Sotheex- tendedclasslabelsetis C  = {  A,  B,  C,A & B } ,whichrepresentsinstanceset S A Š S A  S B ,S B Š S A  S B ,S C ,S A  S B . Thisapproachtotransformingclasslabelstoobtainmutuallyexclusiveclass labelscanbeappliedtostructuredlabellearningproblembybuildingsplit-based classi“ers.Wewill“rstde“neasplitinaclasstaxonomy CT ,andthenforeach splitweshowhowtolearnarespectiveclassi“erbylearningfrominstanceswith extendedlabelsets(asoutlinedabove). DeÞnition3(Split). Asplitisaonelevelsubtreewithinaclasstaxonomy, whichincludesoneparentnodeandallitschildrennodes,andthelinksbetween theparentnodeandchildrennodes. Obviously,thenumberofsplitsintheclasstaxonomyissmallerthanthe numberofnodes.Wecanbuildasetofclassi“ersonthesplitstosolvestructured labelproblemsotodecreasethenumberofresultingclassi“ers.Withineach split,thestructuredlabelproblemwillbereducedtoamultilabelproblem, andweonlyneedtoconsiderthecombinatorialextensionsonclasslabelsat thatparticularlevel.Additionally,thesplit-basedclassi“ersarealsopartially orderedaccordingtoagivenclasstaxonomy.Anyinstancetobeclassi“edwill followthistopologicalorderofthesplit-basedclassi“ers:startfromtheclassi“er forthesplitat“rstposition,continuetorunasplit-basedclassi“eronlywhen predictedtobe1Žbytheparentsplit-basedclassi“er. 4PerformanceMeasureforStructuredLabel ClassiÞcation Insinglelabelclassi“cation,alossfunction(likestandard0-1lossfunction) loss ( c p ,c o )canbede“nedtoevaluatethecostofmisclassifyingtheinstance withobservedclasslabel c o tothepredictiveclasslabel c p .However,thisap- proachisinadequateinastructuredlabelprobleminwhichthereisaneedto takeintoaccounttherelationshipsbetweenlabelsassignedtoaninstance.Here eachlabelsetcorrespondstoasubtreeoftheclasstaxonomyinstructuredlabel problem.Wede“neamisclassi“cationcostassociatedwiththelabelsetpro- ducedbytheclassi“errelativetothecorrectlabelset(thecorrectstructured label). DeÞnition4(NodeDistance). Nodedistanceisavalue d ( c i ,c j ) denoting thedierenceoflabels c i , c j .Ithasthefollowingproperties: Ð d ( c i ,c j )  0 Ð d ( c i ,c j )= d ( c j ,c i ) Ð d ( c i ,c i )=0 LearningClassiÞersUsingHierarchicallyStructuredClassTaxonomies317 DeÞnition5(DummyLabel). Dummylabel  isanadd-onŽlabeltothe classtaxonomywhichactsasapredictedvaluetotheinstancewhenaclassi“er cannotdecidetheclasslabelanddoesnothing.ThusthisisalabelbydefaultŽ. Ithasthefollowingproperties: Ð d ( ,c i )= d ( ,c j ) Ð d ( c i ,c j )  d ( ,c i ) DeÞnition6(Non-RedundantOperation). Anon-redundantoperation (with  astheoperator)toalabelset C i istokeepthechildrenlabelswhen bothchildrenlabelsandtheirparentlabelsarepresent,suchthatweeliminate thelabelredundancieswithinaclasstaxonomy. DeÞnition7(Mapping). Amapping f betweentwolabelsets C 1 , C 2 withthe samecardinalityisabijection f : C 1 C 2 . Wecalculatethedistance d ( C p , C o )between C p and C o ,thepredictedand actuallabel(respectively)foreachclassi“edinstanceasfollows: Ð Ifthecardinalitiesof C p and C o areequal,“ndamappingtominimize thesumofnodedistancesanddividebythecardinalityofthelabelsetsto obtainthedistance. Ð Ifthecardinalitiesofthetwolabelsetsarenotequal,addasmanydummy labels  asneededtothelabelsetwithfewerelementstomakethecardinal- itiesofthetwolabelsetsequalandthencalculatethedistancebetweenthe twolabelsetsasbefore. Theperformanceoftheclassi“eronatestsetisobtainedbyaveragingthe distancesbetweenpredictedandactuallabelsofinstancesinthetestset T as follows: ¯ d =  T d ( C p , C o ) | T | .Thelowerthevalueofthismeasure,thebetterthe classi“er(intermsofmisclassi“cationcost). 5ExperimentalResults Givenastructuredlabeldataset,weneedthepair-wisenodedistancesbetween classlabelstocomputethemisclassi“cationcostasdescribedabove.Thesedis- tancescanbespeci“edbyadomainexpert.Alternatively,thedistancesmaybe estimatedfromatrainingsetbasedoncooccurenceofclasslabelsasfollows: Foreachlevelintheclasstaxonomy,wecalculatetheoccurrenceofclassesin thetrainingset,divideitbythenumberoflabelsatthatleveloftheclasstax- onomy.Wecalculatethedistancebetweenclasslabelsasfollows:Weplacethe Žadd-onŽlabel  intherootnodeoftheclasstaxonomytreeandsettheedge distanceasthelevelweight.Fortwonodes,ifoneisancestoroftheother,the nodedistancewillbethesumoftheedgedistancesalongthepaththatconnects them;ifneithernodeisanancestoroftheother,thedistancebetweenthem 318F.Wu,J.Zhang,andV.Honavar isde“nedastheaveragedistanceofthetwonodesfromtheirnearestcommon ancestor.Afternormalization,weassigndistance1toanytwolabelsinthetop leveltogetherwiththeŽadd-onŽlabel  ,andthemaximalnodedistanceequals tothesummationofallthelevelweightsas1 . 268inReuters-21578dataand 1 . 254inphenotypedataset. 5.1ResultsonReuters-21578DataSet Reuters-21578data,originallycollectedbyCarnegieGroupfortextcategoriza- tion,doesnothaveaprede“nedhierarchicalclasstaxonomy.However,many documentsarelabelledwithmultipletopicclasses.Weextracted670documents. Inthisset,morethan72%ofthedocumentshavemultipleclasslabels.Wecre- atedatwo-levelclasstaxonomyusingcurrentcategoriesofthedocumentsas follows: grain(barley,corn,wheat,oat,sorghum) livestock(l-cattle,hog) WeusedaNaiveBayesclassi“erasthebaseclassi“erandestimatedtheper- formanceoftheresultingstructuredlabelclassi“erusing5foldcrossvalidation. Theresultsintables1 , 2suggestthatbinarizedstructuredlabellearningper- formsaswellassplit-basedstructuredlabellearninginthiscase.Bothhavegood predictiveaccuracyfortheclassesthatappearinthe“rstleveloftheclasstaxon- omy:grain,livestock.Theoverallperformanceofthetwomethods(asmeasured bytheestimatedmisclassi“cationcost)isslightlydierent,whiletheaverage recallandprecisioncalculatedovertheentireclasshierarchyareveryclose. Table1. Averagedistance:learningonReuters-21578dataset binarizationlearning split-basedlearning ø d 0.217 0.251 Table2. Recall&precision:learningonReuters-21578dataset binarizationlearning split-basedlearning recall precision recall precision grain 0.993 0.964 0.993 0.968 livestock 0.766 0.893 0.752 0.917 barley 0.498 0.440 0.454 0.442 wheat 0.852 0.735 0.859 0.724 corn 0.839 0.721 0.818 0.726 oat 0.270 0.75 0.167 0.75 sorghum 0.408 0.560 0.324 0.591 l-cattle 0.146 0.417 0.167 0.339 hog 0.729 0.786 0.717 0.686 LearningClassiÞersUsingHierarchicallyStructuredClassTaxonomies319 5.2ResultsonPhenotypeDataSet OursecondexperimentusedthephenotypedatasetintroducedbyClareand King[5]whoseclasstaxonomyisahierarchicaltreewith4levelsand198labels. WechoosetheC4 . 5decisiontreeasthebaseclassi“ertorunthebinarization learningandsplit-basedlearningin5-foldcrossvalidation.Split-basedstructured labellearningshowsbetterperformancethanbinarizedstructuredlabellearning onthisdataset.Themisclassi“cationcostis0 . 79.Thesplit-basedstructured labellearningpredicts1outof4classlabelscorrectlyinthe1 st levelbranches. ComparedtotheReuters-21578dataset,thephenotypedatasetismuchmore sparsewhichmightexplainthefactthattheresultsarenotasgoodasinthe caseoftheReuters-21578dataset. Wealsocalculateaccuracy,recallandprecisionofeachclasslabel.Itturnsout thattheaccuracyofeachclasslabelisquitehigh(95%).Thisisduetothefact thatthisdatasetishighlyunbalancedandeachclassi“erhasahightruenegative rate.Owingtothesparsenessofthedataset,manyclasslabelsdonotappearin thetestdataset.Thisleadstounde“nedrecallandprecisionestimatesbecauseof divisionby0.Hence,onlythoseclasslabelswithrecallandprecisionestimates availablearelistedinFigure2.Theyshowthatsplit-basedstructuredlabel learningperformsbetterintermsofrecallandprecision,whichisconsistentwith therelativeperformanceofthetwomethodsintermsofmisclassi“cationcost. Table3. Averagedistance:learningonphenotypedataset binarizationlearning split-basedlearning ø d 1.171 0.790 �   �                    �                                           �        �        �        �        �                                        �               �                              �       �    �  ��     �     �        �                                           �        �        �        �        �                                        �               �                          �   �           �   � Fig.2. Recall&precision:learningonphenotypedataset 6SummaryandDiscussion Inthispaper,wehave: Ð Preciselyformulatedoflearningfromdatausingabstractionsoverclassla- bels…thestructuredlabellearningproblem…asageneralizationofsingle labelandmultilabelproblems. 320F.Wu,J.Zhang,andV.Honavar Ð Describedtwolearningmethods,binarizedandsplit-basedapproachesto learningstructuredlabelsbothofwhichcanbeadaptedtoworkwithany existinglearningalgorithmforsinglelabellearningtask(e.g.,NaiveBayes, Decisiontree,Supportvectormachine,etc.). Ð Exploredaperformancemeasureforevaluationoftheresultingstructured labelclassi“ers. Somedirectionsforfutureworkinclude: Ð DevelopmentofalgorithmstoincorporatetechniquesforexploitingCT(class taxonomies)tohandlepartiallyspeci“edclasslabels. Ð Developmentofmoresophisticatedmetricsforevaluationofstructuredlabel classi“ers. Acknowledgements. Thisresearchwassupportedinpartbyagrantfromthe NationalInstitutesofHealth(GM066387)toVasantHonavar References 1.A.McCallumÓMultilabeltextclassiÞcationwithamixturemodeltrainedbyEMÓ. AAAIÕ99WorkshoponTextLearning.,1999. 2.T,Joachims.ÓTextcategorizationwithSupportVectorMachines:Learningwith manyrelevantfeaturesÓ.InMachineLearning:ECML-98,TenthEuropeanConfer- enceonMachineLearning,pp.137Ð142.1998 3.X.Shen,M.Boutell,J.Luo,andC.BrownÓMultilabelMachinelearninganditsap- plicationtosemanticsceneclassiÞcationÓ,inProceedingsofthe2004International SymposiumonElectronicImaging(EI2004),Jan.18-22,2004 4.H.-P.Kriegel,P.Kroeger,A.Pryakhin,andM.SchubertÓUsingSupportVector MachinesforClassifyingLargeSetsofMulti-RepresentedObjectsÓ,inProc.4th SIAMInt.Conf.onDataMining,pp.102-114,2004 5.A.ClareandR.D.KingÓKnowledgeDiscoveryinMultilabelPhenotypeDataÓ, 5thEuropeanConferenceonPrinciplesofDataMiningandKnowledgeDiscovery (PKDD2001),volume2168ofLectureNotesinArticialIntelligence,pages42-53, 2001 6.H.Blockeel,M.Bruynooghe,S.Dzeroski,J.Ramon,andJ.Struyf.ÓHierarchi- calMulti-ClassiÞcationÓ,ProceedingsoftheFirstSIGKDDWorkshoponMulti- RelationalDataMining(MRDM-2002),pages21Ð35,July2002 7.K.Wang,S.Zhou,S.C.Liew,ÓBuildinghierarchicalclassiÞersusingclassproxim- ityÓ,TechnicalReport,NationalUniversityofSingapore,1999 8.TheReuters-21578,Distribution1.0testcollectionisavailablefrom http://www.daviddlewis.com/resources/testcollections/reuters21578

Related Contents


Next Show more