Jordan Cen ter for Biological and Computational Learning MIT Cam bridge MA hofmann jordan aimi tedu Institut f ur Informatik I I I Univ ersit at Bonn German jancsunib onnde Abstract Dyadic data refers to a domain with t o nite sets of ob jects in w ID: 80877
Download Pdf The PPT/PDF document "Learning from Dyadic Data" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
LearningfromDyadicDataoappearin:ancesinNeuralInformationProcessingSystems11,MITPress ThomasHofmann,JanPuzic,MichaelI.JordanterforBiologicalandComputationalLearning,M.I.Tbridge,MA,hofmann,jordanInstitutfurInformatikIII,UnivatBonn,German,jan@cs.uni-bonn.deDyadicdatareferstoadomainwithtonitesetsofobjectsinhobservationsaremadefor,i.e.,pairswithoneelemenfromeitherset.Thistypeofdataarisesnaturallyinmanyap-plicationrangingfromcomputationallinguisticsandinformationaltopreferenceanalysisandcomputervision.Inthispaper,epresentasystematic,domain-independentframeworkoflearn-ingfromdyadicdatabystatisticalemo.Ourapproacersdierentmodelswith atandhierarchicallatentclassstruc-tures.WeproposeanersionofthestandardEMalgo-rithmformodelttingwhichisempiricallyevaluatedonavofdatasetsfromdierentdomains.troductionerthepastdecadearningfromdatahasbecomeahighlyactiveeldofre-hdistributedoermanydisciplineslikepatternrecognition,neuralcompu-tation,statistics,machinelearning,anddatamining.Mostdomain-independenlearningarchitecturesaswellastheunderlyingtheoriesoflearninghaebeenfo-cusingonafeature-baseddatarepresentationbectorsinanEuclideanspace.Fthisrestrictedcasesubstantialprogresshasbeenaced.Hoer,avyofimportantproblemsdoesnottintothissettingandfarlessadvanceshaebeenmadefordatatypesbasedondierentrepresenInthispaper,wewillpresentageneralframeworkforunsupervisedlearningfromdyadicdata.Thenotionreferstoadomainwithto(abstract)setsofob-;:::;x;:::;yinwhichobservaremadefor).Inthesimplestcase{onwhicefocus{anelementaryobservconsistsjustof()itself,i.e.,a,whileothercasesyalsoprovideascalarv(strengthofpreferenceorassociation).Someex-emplaryapplicationareasare:(i)Computationallinguisticswiththecorpus-basedstatisticalanalysisofwordco-occurrenceswithapplicationsinlanguagemodeling,ordclustering,wordsensedisambiguation,andthesaurusconstruction.(ii)dinformationr,whereycorrespondtoadocumentcollection, T.Hofmann,J.Puzicha,M.Jordan:LearningfromDyadicData,NIPS*98tokords,and(ouldrepresenttheoccurrenceofaterminadocumen.(iii)delingofpreandconsumptionbyidenindividualsandwithobjectsorstimuliasinollabativeltering.(iv),inparticularinthecontextofimagesegmentation,wherecorrespondstoimagelocations,todiscretizedorcategoricalfeaturevalues,andadyad(tsafeatureedataparticularlocationMixtureModelsforDyadicDataAcrossdierentdomainsthereareatleasttotaskswhichplayafundamentalroleinunsupervisedlearningfromdyadicdata:(i)probabilisticmodeling,i.e.,learningajointorconditionalprobabilitymodeloXY,and(ii)structuredisco,e.g.,tifyingclustersanddatahierarchies.Thekeyprobleminprobabilisticmodelingisthedatasp:Howcanprobabilitiesforrarelyobservedorevenunobservco-occurrencesbereliablyestimated?Asananswerweproposeamodel-basedap-handformlatentclassemo.Thelatterhaethefurthertagetooeraunifyingmethodforprobabilisticmodelingandstructuredis-.Thereareatleastthree(four,ifbothvtsin(ii)arecounted)dierenysofdeninglatentclassmodels:i.Themostdirectwyistointroducean(unobserved)mappingXY!;:::;cthatpartitionsXYclasses.Thistypeofmodelisandthepre-image)isreferredtoasanii.Alternativ,aclasscanbedenedasasubsetofofthespacesysymmetry,yieldingadierentmodel),i.e.,X!f;:::;cinducesauniquepartitioningonXY).Thismodelisreferredtoasdclusteringiscalledaiii.Iflatentclassesaredenedforbothsets,X!f;:::;cY!f;:::;c,respectiv,thisinducesamappinghisapartitioningofXY.ThismodeliscalleddclusteringAspectModelforDyadicDataInordertospecifyanaspectmodelwemaketheassumptionthatallco-occurrencesinthesamplesetarei.i.d.andthatareconditionallyindependentgivtheclass.Withparameters)fortheclass-conditionaldistributionsandpriorprobabilities)thecompletedataprobabilitycanbewrittenasasP(cik)P(xijcik)P(ykjcik)]n(xi;yk);(1)wheren(xi;yk)aretheempiricalcountsfordyadsin).BysummingoerthelatentheusualmixtureformulationisobtainedwingthestandardctationMaximizationhformaximumlikelihoodestimation[Dempsteretal.,1977],theE-stepequationsfortheclassposteriorprob-abilitiesaregivenb Inthecaseofmultipleobservationsofdyadsithasbeenassumedthateachobservyhaeadierentlatentclass.Ifonlyonelatentclassvariableisintroducedforeacad,slightlydierentequationsareobtained. T.Hofmann,J.Puzicha,M.Jordan:LearningfromDyadicData,NIPS*98 two 0.18seven 0.10three 0.10four 0.06five 0.06years 0.11thousand 0.1hundred 0.1days 0.07cubits 0.05 #2, 0.005went 0.10go 0.08come 0.04brought 0.03up 0.40down 0.17forth 0.15out 0.09in 0.01 #3, 0.008have 0.38hath 0.22had 0.11hast 0.09be 0.02not 0.04done 0.04given 0.03made 0.03been 0.03 #4, 0.002shalt 0.18hast 0.08wilt 0.08art 0.07if 0.05thou 0.85not 0.01also 0.004indeed 0.003anoint 0.003 #5, 0.040the 0.95his 0.006my 0.005our 0.003thy 0.003lord 0.09children 0.02son 0.02land 0.02people 0.02 #6, 0.011 he 0.51god 0.08lord 0.05and 0.04who 0.03hath 0.14shall 0.07said 0.05is 0.04was 0.04 #7, 0.029.000;:000;,000;000;?000;and 0.33for 0.08but 0.07then 0.05so 0.02 #8, 0.007thee 0.04me 0.03him 0.03it 0.02you 0.02?000;,000;:000;P(ca)maximalP(xi|ca)maximalP(yk|ca) Figure1:SomeaspectsoftheItisstraighardtoderivetheM-stepre-estimationformandananalogousequationfor).Byre-parameterizationtheaspectmodelcanalsobecharacterizedbyacross-enycriterion.Moreoer,formalequivlencetothegateMarkovmo,independentlyproposedforlanguagemodel-ingin[Saul,Pereira,1997],hasbeenestablished(cf.[Hofmann,Puzicha,1998]forOne-SidedClusteringModelThecompletedatamodelproposedfortheone-sidedclusteringmodelisisP(xi)P(ykjc(xi))]n(xi;yk)1A;(5)wherewehaemadetheassumptionthatobservations()foraparticularareconditionallyindependentgiv).ThiseectivelydenesthemixturemixtureP(xi)P(ykjc)]n(xi;yk);(6)whereSiareallobservationsin.Noticethatco-occurrencesinarenotindependent(astheyareintheaspectmodel),butgetcoupledbythe(shared)).Asbefore,itisstraighardtoderiveanEMalgorithmwithupdateequations).Theone-sidedclusteringmodelissimilartothedistributionalclusteringmodel[Petal.,1993],hoer,therearetoimportantdierences:(i)thenberoflikelihoodconin(7)scaleswiththenberofobservations{afactwhichfollowsfromBaes'rule{and(ii)mixingproportionsaremissingintheoriginaldistributionalclusteringmodel.Theone-sidedclusteringmodelcorrespondstoanunsupervisedversionofthenaiveBaes'classier,ifweinasafeaturespaceforobjectsTherearealsowystowentheconditionalindependenceassumption,e.g.,butilizingamixtureoftreedependencymodels[Meila,Jordan,1998o-SidedClusteringModelThelatenariablestructureoftheto-sidedclusteringmodelsignicantlyreducesthedegreesoffreedominthespecicationoftheclassconditionaldistribution.W T.Hofmann,J.Puzicha,M.Jordan:LearningfromDyadicData,NIPS*98 Figure2:Exemplarysegmentationresultsonyone{sidedclustering.proposethefollowingcompletedatamodelareclusterassociationparameters.Inthismodelthelateninthespacearecoupledbythe{parameters.Therefore,thereexistsnosimplemixturemodelrepresentationfor).Skippingsomeofthetecdetails(cf.[Hofmann,Puzicha,1998])weobtain)andtheM-stepequations PiPfc(xi)=cxgPkn(xi;yk)][aswellasopreservyfortheremainingproblemofcomputingtheposteriorprobabilitiesintheE-step,weapplyafactorialapproximation(aneldximation),i.e.,.Thisresultsinthewingcoupledapproximationequationsforthemarginalposteriorprobabilities)expandasimilarequationfor.TheresultingapproximateEMalgorithmperformsupdatesaccordingtothesequence({post.,{post.,).Inthe(probabilistic)clusteringinonesetisoptimizedinalternationforagivenclus-teringintheotherspaceandviceversa.Theto{sidedclusteringmodelcanalsobeshowntomaximizeamutualinformationcriterion[Hofmann,Puzicha,1998:AspectsandClustersobetterunderstandthedierencesofthepresentedmodelsitiselucidatingtosystematicallycomparetheconditionalprobabilities)and Aspect Model P(cjxi) P(xijc)P(c) P(xi)Pfc(xigP(xijc)P(c) P(xi)Pfc(xixgP(cjyk) P(ykjc)P(c) P(yk)P(ykjc)P(c) Ascanbeseenfromtheaboetable,probabilities)and)correspondtoposteriorprobabilitiesoflatenariablesifclustersaredenedinthe{and{space,respectiv.Otherwise,theyarecomputedfrommodel.Thisisacrucialdierenceas,forexample,theposteriorprobabilitiesareapproac T.Hofmann,J.Puzicha,M.Jordan:LearningfromDyadicData,NIPS*98 12345678901234567890123456789012 bodybookeaponbehafloorpersonbeingsocietexperiencepositionmethodpapershapespotberreportmajoritpropertpeoplehoolpurposeype goodoppositepotenpeacefulspecialimportanopenperfectcooloccasionalsocialpossiblemodernreciprocalpoorbeautifullocalpoliticalpersonal Figure3:Two{sidedclusteringofLOB:matrixandmostprobablewBooleanvaluesintheinnitedatalimitandareconergingtooneoftheclass-conditionaldistributions.Yet,intheaspectmodel)and)aretypicallynotpeakingmoresharplywithanincreasingnberofobservations.Intheaspectmodel,conditionals)areinherentlyawtedsumofthe`protot).Clustermodelsinturnultimatelylookforthe`best'class-conditionalandwtsareonlyindirectlyinducedbytheposterioruncertainTheCluster-AbstractionModelThemodelsdiscussedinSection2alldeneanon-hierarchical,` at'latentclassstructure.Hoer,forstructurediscoeryitisimportanttondhierarchicaldataorganizations.TherearewwnarchitectureslikethealMixturofExp[Jordan,Jacobs,1994]whichthierarchicalmodels.Yet,inthecaseofdyadicdatathereisanalternativepossibilitytodeneahierarchicalmodel.actionMo(CAM)isaclusteringmodel(e.g.,in)wheretheconditionals)areitself{specicaspectmixtures,)withalatentaspectmappingoobtainahierarccalorganization,clustersareidentiedwiththeterminalnodesofahierarc(e.g.,acompletebinarytree)andaspectswithinnerterminalnodes.Asacompatibilityconstraintitisimposedthat)=0wheneverthenodecorrespondingtoisnotonthepathtotheterminalnode.In,con-ditionedona`horizontal'clusteringallobservations(foraparticularetobegeneratedfromoneofthe`vertical'abstractionlevelsonthepathto).Sincedierentclustersshareaspectsaccordingtotheirtopologicalrelation,thisfaorsameaningfulhierarchicalorganizationofclusters.Moreoer,aspectsatinnernodesdonotsimplyrepreseneragesoerclustersintheirsubtreeastheyareforcedtoexplicitlyrepresentwhatistoallsubsequentclusters.Skippingthetechnicaldetails,theE-stepisgivenbbP(ajc;xi)P(ykja)]n(xi;yk)(12)andtheM-stepformulaeare,and T.Hofmann,J.Puzicha,M.Jordan:LearningfromDyadicData,NIPS*98 paramet pyramid casup phase shape invari neocogitronfilterlocal approcim threshold Figure4:Partsofthetoplevelsofahierarchicalclusteringsolutionforthedocumentcollection,aspectsarerepresentedbytheir5mostprobablewordstems.AnnealedExpectationMaximizationAnnealedEMisageneralizationofEMbasedontheideaofdeterministicanneanneetal.,1990]thathasbeensuccessfullyappliedasaheuristicoptimizationhniquetomanyclusteringandmixtureproblems.Annealingreducesthesensitiv-ytolocalmaxima,but,evenmoreimportantlyinthiscontext,itmayalsoimprothegeneralizationperformancecomparedtomaximumlikelihoodestimation.eyideainannealedEMistointroducean(inersetemperature)parameterandtoreplacethenegative(aeraged)completedatalog-likelihoodbyasubstitutewnastheeener(bothareinfactequivtat=1).ThiseectivresultsinasimplemodicationoftheE-stepbytakingthelikelihoodconinBaes'ruletothepoerof.Inordertodeterminetheoptimalvalueforusedanadditionalvalidationsetinacrossvalidationprocedure.ResultsandConclusionsInourexperimentswehaeutilizedthefollowingreal-worlddatasets:(i)astandardtestcollectionfrominformationretrieval(=1400,=4898),(ii):adjective-nounco-occurrencesfromthePennTreebankcorpus(=6931,=4995)andtheLOBcorpus(=5448,=6052),(iii):adocumencollectionwithabstractsofjournalpapersonneuralnetorks(=1278,=6065),ordbigramsfromthebibleeditionoftheGutenbergproject(12858),(v)exturedaerialimagesforsegmentation(=128=192).InFig.1wehaevisualizedanaspectmodelttedtothebigramdata.Noticethatalthoughtheroleoftheprecedingandthesubsequenordsinbigramsisquitedierent.Segmentationresultsobtainedonapplyingtheone-sidedclusteringmodelaredepictedinFig.2.Amulti-scaleGaborlterbank(3octa4orientations)wasutilizedasanimagerepresentation(cf.[Hofmannetal.,1998InFig.3ato{sidedclusteringsolutionofLOBisshown.Fig.4showsthetopelsofthehierarcyfoundbytheCluster-AbstractionModelininnernodedistributionsprovideresolution-specicdescriptorsforthedocumeninthecorrespondingsubtreewhichcanbeutilized,e.g.,ininebroforinformationretrievFig.5showstypicaltestsetperplexitycurvesofthe er,thetreetopologyfortheCAMisheuristicallygrownviaphasetransitions. T.Hofmann,J.Puzicha,M.Jordan:LearningfromDyadicData,NIPS*98 20 25 30 35 40 45 50 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2x 104 20 25 30 35 40 45 50 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2x 104 20 25 30 35 40 45 50 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2x 104 20 25 30 35 40 45 50 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2x 104 20 25 30 35 40 45 50 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2x 104 K=32K=64K=128K=256K=32K=64K=128K=256annealed EM iterationsperplexity 10 20 30 40 50 60 70 80 90 100 400 450 500 550 600 650 700 beta = 1.0beta = 0.9beta = 0.8beta = 0.7annealed EM iterationsperplexity 5 10 15 20 25 30 35 40 45 50 450 500 550 600 650 700 K=32beta=1beta=0.2beta=0.1beta=0.075annealed EM iterationsperplexity K=32beta=1beta=0.2beta=0.1beta=0.075 Figure5:PycurvesforannealedEM(aspect(a),(b)andone-sidedcluster-ingmodel(c))onthe Aspect-clusterCAM-clusterAspect-clusterCAM-cluster PPPP PPPP CranPenn -685------ -639------ 0.884820.095270.185110.67615 0.733120.083520.133220.55394 0.722550.073020.102680.51335 0.722550.073020.102680.51335 0.833860.070.124380.53506 0.712050.072540.082260.46286 0.793600.065270.114220.48477 0.691820.070.072040.44272 0.046630.10 0.062310.06able1:Pyresultsfordierentmodelsonthe(predictingwordscondi-tionedondocuments)anddata(predictingnounsconditionedonadjectivannealedEMalgorithmfortheaspectandclusteringmodel(theper-observationlog-likelihood).A=1(standardEM)oerttingisclearlyvisible,aneectthatvanisheswithdecreasing.AnnealedlearningperformsalsobetterthanstandardEMwithearlystopping.Tab.1systematicallysummarizesperplexityresultsfordierentmodelsanddatasets.mixturemodelsfordyadicdatahaeshownabroadapplicationpo-tial.AnnealingyieldsasubstantialimprotingeneralizationperformancecomparedtostandardEM,inparticularforclusteringmodels,andalsooutper-formsacomplexitycontrolvia.Intermsofperplexit,theaspectmodelhasthebestperformance.Detailedperformancestudiesandcomparisonswithotherstate-of-the-arttechniqueswillappearinforthcomingpapers.ers.etal.,1977]Dempster,A.P.,Laird,N.M.,Rubin,D.B.(1977).MaximumliklihoodfromincompletedataviatheEMalgorithm.J.RoyalStatist.Soc.B,1{38.[Hofmann,Puzicha,1998]Hofmann,T.,Puzicha,J.1998.almodelsforcedatah.rept.ArticalIntelligenceLaboratoryMemo1625,M.I.T.M.I.T.etal.,1998]Hofmann,T.,Puzicha,J.,Buhmann,J.M.(1998).UnsupervisedtexturesegmentationinadeterministicannealingframewIEEETansactionsonPatternAnalysisandMachineIntelligenc(8),803{818.[Jordan,Jacobs,1994]Jordan,M.I.,Jacobs,R.A.(1994).HierarchicalmixturesofexpertsandtheEMalgorithm.alComputation(2),181{214.[Meila,Jordan,1998]Meila,M.,Jordan,M.I.1998.EstimatingDependencyStructureasaHiddenVIn:AesinNeuralInformationPressingSystems1010ereiraetal.,1993]Pereira,F.C.N.,Tish,N.Z.,Lee,L.1993.DistributionalclusteringofEnglishwPages183{190of:PrdingsoftheAAetal.,1990]Rose,K.,Gurewitz,E.,Fx,G.(1990).Statisticalmechanicsandphasetransitionsinclustering.alReviewL(8),945{948.[Saul,Pereira,1997]Saul,L.,Pereira,F.1997.Aggregateandmixed{orderMarkvmod-elsforstatisticallanguageprocessing.In:Prdingsofthe2ndInternationalConfer-eonEmpiricalMethodsinNaturalLanguagePr