uclacuk Katherine A Heller Gatsby Unit University College London hellergatsbyuclacuk Zoubin Ghahramani Department of Engineering University of Cambridge zoubinengcamacuk Abstract Analogical reasoning depends fundamentally on the ability to learn and ID: 56015
Download Pdf The PPT/PDF document "Analogical Reasoning with Relational Bay..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
AnalogicalReasoningwithRelationalBayesianSets RicardoSilvaGatsbyUnitUniversityCollegeLondonrbas@gatsby.ucl.ac.ukKatherineA.HellerGatsbyUnitUniversityCollegeLondonheller@gatsby.ucl.ac.ukZoubinGhahramaniDepartmentofEngineeringUniversityofCambridgezoubin@eng.cam.ac.ukAbstractAnalogicalreasoningdependsfundamentallyontheabilitytolearnandgeneralizeaboutrelationsbetweenobjects.Therearemanywaysinwhichobjectscanberelated,makingautomatedanalogicalreasoningverychal-lenging.Herewedevelopanapproachwhich,givenasetofpairsofrelatedobjectsS=fA1:B1;A2:B2;:::;AN:BNg,measureshowwellotherpairsA:BtinwiththesetS.Thisaddressesthequestion:istherelationbetweenobjectsAandBanalogoustothoserelationsfoundinS?Werecastthisclassi-calproblemasaproblemofBayesiananaly-sisofrelationaldata.Thisproblemisnon-trivialbecausedirectsimilaritybetweenob-jectsisnotagoodwayofmeasuringanalo-gies.Forinstance,theanalogybetweenanelectronaroundthenucleusofanatomandaplanetaroundtheSunishardlyjustiedbyisolated,non-relational,comparisonsofanelectrontoaplanet,andanucleustotheSun.WedevelopagenerativemodelforpredictingtheexistenceofrelationshipsandextendtheframeworkofGhahramaniandHeller(2005)toprovideaBayesianmeasureforhowanal-ogousarelationistootherrelations.Thisshedsnewlightonanoldproblem,whichwemotivateandillustratethroughpracticalap-plicationsinexploratorydataanalysis.1CONTRIBUTIONConsiderthefollowingillustrativeproblemsinex-ploratorydataanalysis:Example1Aresearcherhasalargecollectionofpa-pers.Sheplanstousesuchadatabaseinordertowriteancomprehensivearticleabouttheevolutionofhereldthroughthepasttwodecades.Inparticu-lar,shehasacollectionorganizedaspairsofpapers,whereonecitestheother,i.e.,acollectionofpairsA:BmeaningAcitesB.Thereareseveralreasonswhyapapermightciteanother:Aisabigbibliographicsur-vey,orBwaswrittenbytheadvisoroftheauthorofA,orBwasgivenabestpaperaward,ortheauthorsweregeographicallyclose,oracombinationofseveralsuchfeatures.Suchcombinationsdenea(potentiallyverylarge)varietyofsubpopulationsofpairsofpapers.WhiletheremightberelevantinformationaboutAandBinthedatabase,whichsubpopulationsofcita-tionsA:Bbelongstoisneverexplicitlyindicatedinthedata.Yettheresearcherisnotcompletelyinthedark:shealreadyhasanideaofimportantsubgroupsofcita-tionswhicharerepresentativeofthemostinterestingsubpopulations,althoughitmightbediculttochar-acterizeanysuchsetwithasimpledescription.Shewouldliketoknowwhichotherpairsofpapersmightbelongtosuchsubgroups.Insteadofworryingaboutwritingsomesimplequeryrulesthatexplainthecom-monpropertiesofsuchsubgroups,shewouldratherhaveanintelligentinformationretrievalsystemthatisabletoidentifywhichotherpairsinthedatabasearelinkedinananalogouswaytothosepairsinherselectedsets.Example2Ascientistisinvestigatingapopulationofproteins,withinwhichsomepairsareknowntoin-teract,whiletheremainingpairsareknownnottointeract.Inthisstudy,itisknownthatrecordedgeneexpressionprolesoftherespectivegenescanbeusedasareasonablepredictoroftheexistenceornotofaninteraction.Thecurrentstateofknowledgeisstilllimitedregardingwhichsubpopulations(i.e.,classes)ofinteractionsexist,althoughapartialhierarchyofsuchclassesforsomeproteinsisavailable.Givenase-lectedsetofinteractingproteinsthatarebelievedtobelongtoaparticularleveloftheclasshierachy,theresearcherwouldliketoqueryherdatabasetodiscoverotherplausiblepairsofproteinswhosemechanismoflinkageisofthesamenatureasintheselectedset,i.e., toqueryforanalogousrelations.Ideally,shewouldliketodoitwithoutbeingrequiredtowritedownqueryrulesthatexplicitlydescribetheselectedset.Suchareproblemsofanalogicalreasoning,instantiatedaspracticalproblemsofinformationretrievalforex-ploratorydataanalysis.Evenundertheabsenceofclearclasslabelsforlinkssuchaspapercitationsandproteininteractions(whatisrecordediswhichpairsarelinkedandwhicharenot),onemighthaveathandasubpopulationofinterestthatisbestdescribedbyasampleoflinkedobjects.Thequestiontobeaskedisofanexploratorynature:whichotherobjectsinmyrelationaldatabasearelinkedinasimilarway?Inbothexamples,onehasarelationaldatabase,anditispossibletocreatemodelsforpredictingtheexistenceorlackofarelationshipusingfeaturessuchaspaperattributesandgeneexpressionproles.Inbothexam-ples,itisnotfullyknownhowtoexplicitlydescribeclassesofrelationsthatarebelievedtoexist(anditisanuisancetoselectnegativeexamplesbyhandtolearnaclassier).WeproposeamethodforretrievingrelationsbasedontheBayesianscoringfunctionasproposedintheBayesiansetsmethod(GhahramaniandHeller,2005):givenasetofrelateditemsthatarepostulatedtocomefromasubpopulationofinterest,thegoalistorankexistinglinksaccordingtoameasureofsimilaritywithrespecttothisset.Weinterpretthisproblemastheclassicalproblemofanalogicalreasoning.Thatis,sup-posewehaveapair(orsetofpairs)ofobjectsA:B.Whichotherpairsofobjectsinrelationaldatabasebestre\rectarelationanalogoustoA:B?Thispa-perprovidesanovelandprobabilisticallysoundsolu-tiontothisproblem.Moreover,thisworkextendstheBayesiansetsmethodtodiscriminativemodels.Wewillfocussolelyonndingpairwiserelations.Theideacanbeextendedtomorecomplexrelations,butwewillnotpursuethishere.InSection2wediscussrelatedworkwhiledescribingthedierencebetweenanalogicalreasoningandstan-dardretrievaltasks.TheapproachisintroducedinSection3andevaluatedinSection4.2RELATEDWORKTodeneananalogyistodeneameasureofsimi-laritybetweenstructuresofrelatedobjects(pairs,inourcase).Thekeyaspectisthat,typically,wearenotinterestedinhoweachindividualobjectinacandidatepairissimilartoindividualobjectsinthequerypairs.Asanillustration,considerananalogicalreasoningquestionfromaSAT-likeexamwhereforagivenpair(say,water:river)wehavetochoose(outof5pairs)theonethatbestmatchesthetypeofrelationimplicitinsucha\query."Inthiscase,itisreasonabletosaycar:tracwouldbeabettermatchthan(thesome-whatnonsensical)soda:ocean,sincecars\rowthroughtrac,andsodoeswaterthroughariver.Noticethatifweweretomeasurethesimilaritybetweenobjectsin-steadofrelations,itnowseemsreasonabletosaythatsoda:oceaninthiscasewouldbeamuchcloserpair.Intheexamplesgivenintheprevioussection,similar-itybetweenpairsofobjectsisonlymeaningfultotheextentbywhichsuchfeaturesareusefultopredicttheexistenceoftherelationships.Thereisalargeliteratureonanalogicalreasoninginar-ticialintelligenceandpsychology.WerefertoFrench(2002)forasurvey,aswellastosomerecentmachinelearningpapersonclustering(Marxetal.,2002),pre-diction(TurneyandLittman,2005)anddimensional-ityreduction(MemisevicandHinton,2005).HerewewilluseaBayesianframeworkforinferringsimilarityofrelations.Givenasetofrelations,ourgoalwillbetoscoreothersasrelevantornot.ThescoreisaBayesianmodelcomparisongeneralizingthe\Bayesiansets"score(GhahramaniandHeller,2005)todiscrim-inativemodelsoverpairsofobjects.ThegraphicalmodelformulationofGetooretal.(2002)incorporatesmodelsoflinkexistenceinrela-tionaldatabases,anideausedexplicitlyinSection3astherststepofourproblemformulation.Intheclus-teringliterature,theprobabilisticapproachofKempetal.(2006)ismotivatedbyprinciplessimilartothoseinourformulation:theideaisthatthereisaninnitemixtureofsubpopulationsthatgeneratestheobservedrelations.Ourproblem,however,istoretrieveotherelementsofasubpopulationdescribedbyelementsofaqueryset,agoalthatisalsoclosertotheclassicalparadigmofanalogicalreasoning.Toemphasizeoncemore,ourfocushereisnotonpre-dictingthepresenceorabsenceoflinks,asin,e.g.,(PopesculandUngar,2003)butratheronretrievingsimilarlinksfromamongthosealreadyknowntoex-istintherelationaldatabase.Neitherisourfocustoprovideafullyunsupervisedclusteringofthewholedatabaseofpairs(asin,e.g.,Kempetal.,2006),nortouserelationalinformationtoimproveclassicationofotherattributes(asin,e.g.,Getooretal.,2002).3FUNCTIONSASANALOGIESWenowdescribetheanalogicalreasoningprinciplemoreformally.LetAandBrepresentobjectspaces.TosaythataninteractionA:BisanalogoustoS=fA1:B1;A2:B2;:::;AN:BNgistodeneameasureofsimilaritybetweenthepairandthesetofpairs.However,thissimilarityisnot(directly)givenbythe informationcontainedinthedistributionofobjectsfAigA,fBigB,butbythemappingsclassifyingsuchpairsasbeinglinked: Bayesiananalogicalreasoningformulation:ConsideraspaceoflatentfunctionsinAB!f0;1g.AssumethatAandBaretwoobjectsclassiedaslinkedbysomeunknownfunctionf(A;B),i.e.,f(A;B)=1.Wewanttoquantifyhowsimilarthefunctionf(A;B)istothefunc-tiong(;),whichclassiesallpairs(Ai;Bi)2Sasbeinglinked,i.e.,g(Ai;Bi)=1.ThesimilarityshouldbeafunctionoftheobservationsfS;A;Bgandourpriordistributionoverf(;)andg(;). Functionsf()andg()areunobserved,hencetheneedforapriorthatwillbeusedtointegrateoverthefunc-tionspace.Theresultingsimilaritymetricwillbede-nedthroughaBayesfactor,asexplainednext.Forsimplicity,wewillconsiderafamilyoflatentfunctionsthatisparameterizedbyanite-dimensionalvector:thelogisticregressionfunctionwithmultivariateGaus-sianpriorsforitsparameters.Foraparticularpair(Ai2A,Bj2B),letXij=[1(Ai;Bj)2(Ai;Bj):::K(Ai;Bj)]Tbeapointonafeaturespacedenedbythemapping:AB!K.LetCij2f0;1gbeanindicatoroftheexistenceofalinkbetweenAiandBjinthedatabase.Let=[1;:::;K]TbetheparametervectorforourlogisticregressionmodelP(Cij=1jXij;)=logistic(TXij)(1)wherelogistic(x)=(1+e x) 1.Ourmeasureofsimi-larityforapair(Ai;Bj)withrespecttoaquerysetSistheprobabilisticsimilaritymeasureofBayesiansets(GhahramaniandHeller,2005)onalog-scale:score(Ai;Bj)=logP(Cij=1jXij;S;CS=1) logP(Cij=1jXij)(2)whereCSisthevectoroflinkindicatorsforS:i.e.,C1=1;C2=1;:::;CN=1indicatesthatallpairsinSarelinked.Thegeneralframeworkisasfollows.Wearegivenare-lationaldatabase(DA;DB;LAB),wherethersttwocomponentsofthisdatabasearesampledrespectivelyfromAandB.RelationshiptableLABisabinaryma-trixassumedtobegeneratedbyalogisticregressionmodeloflinkexistence.Aqueryproceedsaccordingtothefolllwingsteps:1.theuserselectsasetofpairsSthatarelinkedinthedatabase; AB AB (a)(b)Figure1:(a)GraphicalplaterepresentationfortherelationalBayesianlogisticregression,whereNA;NBandNCarethenumberofobjectsofeachclass.(b)ExtradependenciesinducedbyfurtherconditioningonCarerepresentedbyundirectededges.2.thesystemperformsBayesianinferencetoob-tainthecorrespondingposteriordistributionfor,P(jS;CS),givenaGaussianpriorP();3.thesystemiteratesthroughalllinkedpairs,com-putingthefollowingforeachpairP(Cij=1jXij;S;CS=1)=RP(Cij=1jXij;)P(jS;CS=1)d(3)aswellasP(Cij=1jXij)byintegrationoverP()1,andthensortsthemaccordingtothescoreinEquation(2);ThecorrespondingplatemodelisillustratedinFigure1(a).Latentparametervector=f1;2;:::;KgTandobjectsAandBareancestorsoflinkindicatorC.ByconditioningonC=1,elementsofwillbeconnectedtoandshareinformationfrominputdatafA;Bg,asinFigure1(b).Thisinformationcanbepassedforwardtoevaluateotherpoints.Thesug-gestedsetupscalesasO(K3)duetothematrixin-versionsnecessaryfor(variational)Bayesianlogisticregression(JaakkolaandJordan,2000).Ifnecessary,afurtherapproximationforP(jS;CS)mightbeim-posedifthedimensionalityofistoohigh.3.1ChoiceoffeaturesandrelationaldiscriminationOursetupassumesthatthefeaturespaceprovidesareasonableclassiertopredicttheexistenceoflinks.Itisevidentthattheproposedframeworkcouldbe 1SincetheintegralusedintheBayesianlogisticfunctiondoesnothaveaclosedformula,inalloftheseexpressionsweusetheBayesianvariationalapproximationbyJaakkolaandJordan(2000).AshortsummaryofthisapproachisgivenbySilvaetal.(2007). usedfornon-relationalproblemswitharbitraryclassi-cationfunctions.However,ouranalogicalreasoningformulationisarelationalmodeltotheextentthatitmodelspresenceandabsenceofinteractionsbetweenobjects:byconditioningonthelinkindicators,thesimilarityscorebetweenA:BandC:Disalwaysafunc-tionofpairs(A;B)and(C;D)thatisnotingeneraldecomposableassimilaritiesbetweenAandC,BandD.Again,thisisillustratedbyFigure1.Onemightalsohaverecordsofpairwiseinformationonotherre-lationaltablesDABbesidesthetargetedone.Forin-stance,onemighthaveameasureinaproteindatabasethatisabinaryindicatorofbothproteinsbeingpro-ducedinthesameareainthecellornot,orthenum-berofcommonproteinsthatinteractwiththepair.Ourmethodthenlearnstoranksimilarityofrelationsbasedonfeaturesextractedforarelationaldatabase,andsuchfeatureshavearolesimilartothelatentvari-ablesinblock-models,asdiscussedinthesequel.Use-fulpredictivefeaturescanalsobegeneratedautomati-callywithavarietyofalgorithms(e.g.,the\structurallogisticregression"ofPopesculandUngar,2003).SeealsoDzeroskiandLavrac(2001).JensenandNeville(2002)discussshortcomingsofautomatedmethodsforautomatedfeatureselectioninrelationalclassication.Ouranalogicalreasoningformulationalsoassumesallsubpopulationsofinterestaremeasuredonthesamefeaturespace.Thisallowsforcomparisonsbetween,e.g.,cellsfromdierentspecies,orwebpagesfromdif-ferentwebdomains,aslongas(;)isthesame.Themostgeneralanalogicalreasoningformulationwouldnothavethisrequirement,butfortheproblemtobewell-dened,featuresfromthedierentspacesmustberelatedsomehow.AhierarchicalBayesianformulationforlinkingdierentfeaturespacesisonepossibilitywhichmightbetreatedinafuturework.3.2PriorsThechoiceofpriorisbasedontheobserveddata,inawaythatisanalogoustothechoiceofpriorsusedintheoriginalformulationofBayesiansets(GhahramaniandHeller,2005).Letbbethemaximumlikelihoodestimatorof.Sincethenumberofpossiblepairsgrowsataquadraticratewiththenumberofobjects,wedonotusethewholedatabaseformaximumlike-lihoodestimation.Also,sincemostdatabaseshaveasparselinkmatrix,togetbweusealllinkedpairsasmembersofthepositiveclass(C=1),andsampleun-linkedpairsasmembersofthenegativeclass(C=0)2. 2Inourexperiments,wesample10\negative"pairsforeach\positive"one,andweightthemtore\rectthepro-portioninthedatabase(e.g.,ifwesample10negativesforeachpositive,whileinthedatabasethereare200negativesforeachpositive,wecounteachnegativecaseasbeing20WeusethepriorP()=N(b;(cbT) 1),whereN(m;V)isanormalofmeanmandvarianceV.Ma-trixbTistheempiricalsecondmomentsmatrixofthelinkedobjectfeaturesinX,ameasureoftheirvariabil-ity.Constantcisasmoothingparametersetbytheuser.Inourexperiments,weselectedittobetwicethetotalnumberoflinks.Empiricalpriorsareasensiblechoice,sincethisisaretrieval,notapredictive,task.Basically,theentiredatasetisthepopulation.Adata-dependentpriorbasedonthepopulationisquiteimportantforanap-proachlikeBayesiansets,sincedeviancesfromthe\av-erage"behaviourinthedataareusefultodiscriminatebetweensubpopulations.Silvaetal.(2007)presentseveralillustrationsofthescorefunctionbehaviorunderdierentchoicesofpri-orsandquerysets.3.3ConnectionstoBayesiansetsandblockmodelsThemodelinFigure1(a)isatypicalconditionalrela-tionalmodel,i.e.,conditionedonobjectsandparam-eterstheresultingrelationsarei.i.d.Undertheorig-inalBayesiansetsformulation,thescorefunctioncanbedescribedby(thelogarithmof)theBayesfactorcomparingthemodelsinFigure2.Incontrast,considerthefollowingdirectmodicationoftheBayesiansetsformulationtothisproblem:\rat-tenthedata,creatingforeachpair(Ai;Bj)arowinthedatabasewithanextrabinaryindicatorofrela-tionshipexistence.CreateajointmodelforpairsbyusingthemarginalmodelsforAandBandtreatingdierentrowsasbeingindependent.Thisignoresthefactthattheresulting\ratteneddatapointsarenotreallyi.i.d.undersuchamodel,becausethesameob-jectmightappearinmultiplerelations(DzeroskiandLavrac,2001).Themodelalsofailstocapturethede-pendencybetweenAiandBjthatarisesfromcondi-tioningonCij,evenifAiandBjaremarginallyinde-pendent.Nevertheless,heuristicallythisapproachcansometimesproducesomegoodresults,andforseveraltypesofprobabilityfamiliesitisverycomputationallyecient.Adierentapproachformodelingrelationaldataistheblock-modelusedforyearsbystatisticiansandso-ciologistsformodelingsocialnetworks(Kempetal.,2006;Airoldietal.,2006).Thebasicideaistousehiddenvariables3inplaceofourfeaturevectorXij:thisispartiallymotivatedbythefactthattypically, casesreplicated.)3Suchhiddenvariablesareusuallydiscreteindicatorsofsomelatentclustermembershipforobjects.Themodeltypicallyrequiresacross-clusteringforobjectmembership QA1C1 ............CC aA22NB NA1C1 ............CC22 (a)(b)Figure2:ThescoreofanewdatapointfA;B;CgisgivenbytheBayesfactorthatcomparesmodels(a)and(b).Noderepresentsthehyperparametersfor.In(a),thegenerativemodelisthesameforboththenewpointandthequerysetrepresentedintherectangle.NoticethatourconditioningsetSofpairsinfAigfBjgmightcontainrepeatedinstancesofasamepoint,i.e.,someAiorBjmightappearmultipletimesindierentrelations,asillustratedbynodesAiwithmultipleoutgoingedges.In(b),thenewpointandthequerysetdonotsharethesameparameters.insocialnetworkanalysis,therearenoeasilyavail-ablefeaturesofthepopulationthatarerecorded.Tocomputequantitiessuchasthemarginallikelihoodofthemodelonehastointegrateoutalargenumberofhiddenvariables.Foramoderatenumberofobjectsstraightevalua-tionofallpairsmightbecomputationallyinfeasibleintheblock-modelsetup.Ourdiscriminativemodelonlyneedstheunlinkedpairswhensettingaprior,whichisaccomplishedbysamplingwhenthetotalnumberofunlinkedpairsistoolarge.Throwingpartoftheinfor-mationfromunlinkedpairsisarguablylessharmfultoourgoalsthantotheclusteringprocedureperformedwithblock-models.Ourmodelassumeslinkindicatorsareindependentgivenobjectfeatures,whichmightnotbethecaseforparticularchoicesoffeaturespace.Intheory,block-modelssidestepthisissuebylearningalltheneces-sarylatentfeaturesthataccountforlinkdependence.Animportantfutureextensionofourworkwouldcon-sistoftractablyaccountingforresiduallinkassociationthatisnotintermediatedbyourobservedfeatures.3.4OncontinuousrelationsAlthoughwefocusonmeasuringsimilarityofquali-tativerelationships,thesameideacouldbeextendedtocontinuousmeasuresofrelationship.Forinstance,TurneyandLittman(2005)measurerelationsbetweenwordsbytheirco-occurrencesintextsnexttoasetofjoiningterms,suchasthetwowordsbeingconnectedbyaspecicpreposition.Severalsimilaritymetricscanbedenedonthisvectorofcontinuousrelation-ship(e.g.,cosinedistance).However,givendataonwordfeaturesandapredictivemodelforsuchquanti-tativerelations,onecandirectlyadaptourmodel4. andrelationmembership.4NoticethatourapproachwouldstillnotbedirectlycomparabletotheonebyTurneyandLittman,sinceunlike4EXPERIMENTSWenowdescribetwoexperimentsonanalogicalre-trievalusingtheproposedmodel.Evaluationofthesignicanceofretrieveditemsoftenreliesonsubjec-tiveassessments(GhahramaniandHeller,2005).Tosimplifyourstudy,wewillfocusonparticularsetupswhereobjectivemeasuresofsuccesscanbederived.Ourmainstandardofcomparisonwillbea\\rattenedBayesiansets"algorithm(whichwewillcall\standardBayesiansets,"SBSets,inconstrasttotherelationalmodel,RBSets).UsingamultivariateindependentBernoullimodelasintheoriginalpaper(GhahramaniandHeller,2005),wejoinlinkedpairsintosinglerows,andthenapplytheoriginalalgorithmdirectlyonthisjoineddata.Thisalgorithmservesthepurposeofbothmeasuringthelossofnottreatingrelationaldataassuch,aswellasthelimitationsoftreatingsimilarityofpairsthroughthegenerativemodelsofAandBin-steadofthegenerativemodelforthelatentpredictivefunctiong(;).Inbothexperiments,objectsareofthesametype,andtherefore,dimensionality.ThefeaturevectorXijforeachpairofobjectsfAi;BjgconsistsoftheVfeaturesforobjectAi,theVfeaturesofobjectBj,andmea-suresfZ1;:::;ZVg,whereZv=(AivBjv)=(kAikkBjk),kAikbeingtheEuclideannormoftheV-dimensionalrepresentationofAi.Wealsouseacon-stantvalue(1)aspartofthefeaturesetasanintercepttermforthelogisticregression.TheZfeaturesareex-actlytheonesusedinthecosinedistancemeasure,acommonandpracticalmeasurewidelyusedininforma-tionretrieval(Manningetal.,2007).Theyalsohavetheimportantadvantageofscalingwellwiththenum-berofvariablesinthedatabase.Moreover,adoptingsuchfeatureswillmakeourcomparisonsinthenext themwewouldmakeuseofsomeexternaldatasourceforthefeaturesofthewords. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 PrecisionRecallfirst trialRBSets SBSets Cosine 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallsecond trialRBSets SBSets Cosine 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallthird trialRBSets SBSets Cosine 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallfourth trialRBSets SBSets Cosine Figure3:Precision/recallcurvesforfourdierentrandomqueriesofsize10forthethreealgorithms:relationalBayesiansets(RBSets),regularBayesiansetswithBernoullimodel(SBSets-B)andcosinedistance.sectionsmorefair,sinceweevaluatehowwellcosinedistanceperformsinourtask.NoticeXijrepresentsasymmetricrelationshipsasrequiredinourapplica-tions.Forsymmetricrelationships,featuressuchasjAiv Bjvjcouldbeusedinstead.4.1SyntheticexperimentWerstdiscussasyntheticexperimentwherethereisaknowngroundtruth.Wegeneratedatafromasimu-latedmodelwithsixclassesofrelationsrepresentedbysixdierentinstatiationsof,f0;1;:::;5g.Thissimpliedsetupdenesamulticlasslogisticsoftmaxclassierthatoutputsaclasslabeloutoff0;1;:::;5g.ObjectspacesAandBarethesame,anddenedbyamultivariateBernoullidistributionof20dimensions,whereeachattributehasindependentlyaprobability1/2ofbeing1.Wegenerate500objects,andconsid-eredall5002pairstogenerate250;000featurevectorsX.ForeachXweevaluateourlogisticclassiertogenerateaclasslabel.Ifthisclassiszero,welabelthecorrespondingpairas\unlinked."Otherwise,welabelitas\linked."Theinterceptparameterforparametervector0wassetmanuallytomakeclass0appearinatleast99%thedata5,thuscorrespondingtotheusualsparsematricesfoundinrelationaldata.Thealgorithmsweevaluatedonotknowwhichofthe5classesthelinkedpairsoriginallycorrespondedto.However,sincethelabelsareknownthroughsimula-tion,weareabletotellhowwellrankedarepointsofaparticularclassgivenaqueryofpairsfromthesameclass.Ourevaluationisasfollows.Wegeneratepreci-sion/recallcurvesforthreealgorithms:ourrelationalBayesiansetsRBSets,\\rattened"standardBayesiansetswithBernoullimodel(SBSets)andcosinedis-tance(summingoverallelementsinthequery).Foreachquery,werandomlysampled10elementsoutofthepoolofelementsoftheleastfrequentclass(about 5Valuesforvectors1;2;:::;5wereotherwisegen-eratedbyindependentmultivariateGaussiandistributionswithzeromeanandstandarddeviationof101%ofthetotalnumberoflinks),andrankedthere-maining2320linkedpairs.Wecountedanelementasahitifitwasoriginallyfromtheselectedclass.RBSetsgivesconsistentlybetterresultsforthetop50%retrievals.Asanillustration,wedepictedfourrandomqueriesof10itemsinFigure3.NoticethatsometimesSBSetscandoreasonably,oftenachiev-ingbetterprecisionatthebottom40%recalls:bythevirtueofhavingfewobjectsinthespaceofelementsofthisclass,afewofthemwillappearinpairsbothinthequeryandoutsideofit,facilitatingmatchingbyobjectsimilaritysincehalfofthepairisalreadygivenasinput.Weconjecturethisexplainstheseeminglystrongresultsoffeature-basedapproachesonthebot-tom40%.However,whenthisdoesnothappentheproblemcangetmuchharder,makingSBSetsmuchmoresensitivetothequerythanRBSets,asillus-tratedinsomeoftherunsinFigure3.4.2TheWebKBexperimentTheWebKBdataisacollectionofwebpagesfromsev-eraluniversities,whererelationsaredirectedandgivenbyhyperlinks(Cravenetal.,1998).Webpagesareclassiedasbeingoftypecourse,department,faculty,project,sta,studentandother.Documentsfromfouruniversities(cornell,texas,washingtonandwisconsin)arealsolabeledassuch.BinarydatawasgeneratedfromthisdatabaseusingthesamemethodsofGhahra-maniandHeller(2005).Atotalof19,450binaryvari-ablesperobjectaregenerated.Toavoidintroduc-ingextraapproximationsintoRBSets,wereduceddimensionalityintheoriginalrepresentationusingsin-gularvaluedecomposition,obtaining25measuresperobject.Thisalsoimprovedtheresultsofouralgo-rithmandcosinedistance.ForSBSets,thisisawayofcreatingcorrelationsintheoriginalfeaturespace.Toevaluatethegainofourmodelovercompetitors,wewillusethefollowingsetup.Intherstquery,wearegiventhepairsofwebpagesofthetypestudent!coursefromthreeofthelabeleduniversities,andeval- 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 PrecisionRecallstudent - course (cornell)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallstudent - course (texas)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallstudent - course (washington)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallstudent - course (wisconsin)RBSets SBSets1 SBSets2 Cosine1 Cosine2 Figure4:Resultsforstudent!courserelationships. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 PrecisionRecallfaculty - project (cornell)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallfaculty - project (texas)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallfaculty - project (washington)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallfaculty - project (wisconsin)RBSets SBSets1 SBSets2 Cosine1 Cosine2 Figure5:Resultsforfaculty!projectrelationships.uatehowrelationsarerankedinthefourthuniversity.Becauseweknowclasslabels(whilethealgorithmdoesnot),wecanusetheclassofthereturnedpairstolabelahitasbeing\relevant"or\irrelevant."Welabelapair(Ai;Bj)asrelevantifandonlyifAiisoftypestudentandBjisoftypecourse,andAilinksintoBj.Thisisaverystringentcriterion,sinceothertypesofrelationscouldalsobevalid(e.g.,sta!courseappearstobeareasonablematch).However,thisfa-cilitatesobjectivecomparisonsofalgorithms.Also,theotherclasscontainsmanytypesofpages,whichallowsforpossibilitiessuchasastudent!\hobby"pair.Suchpairsmightbehardtoevaluate(e.g.,isthatparticularhobbyincrementallydemandinginawaythatcourseworkis?Isitasfunastakingama-chinelearningcourse?)Asacompromise,weomitallpagesfromthecategoryotherinordertobetterclarifydierencesbetweenalgorithms6.Precision/recallcurvesforthestudent!coursequeriesareshowninFigure4.Therearefourqueries,eachcorrespondingtoasearchoveraspecicuniver-sitygivenallvalidstudent!coursepairsfromtheotherthree.Therearefouralgorithmsoneacheval-uation:thestandardBayesiansetswiththeoriginal19,450binaryvariablesforeachobject,plusanother 6Asanextremeexample,queryingstudent!coursepairsfromthewisconsinuniversityreturnedstudent!otherpairsatthetopfour.However,theseotherpageswereforsomereasoncoursepages-suchashttp://www.cs.wisc.edu/markhill/cs752.html19,450binaryvariables,eachcorrespondingtotheproductoftherespectivevariablesintheoriginalpairofobjects(SBSets1);thestandardBayesiansetswiththeoriginalbinaryvariablesonly(SBSets2);astan-dardcosinedistancemeasureoverthe25-dimensionalrepresentation(Cosine1);acosinedistancemeasureusingthe19,450-dimensionaltextdatawithTF-IDFweights(Cosine2);ourapproach,RBSets.InFigure4,RBSetsdemonstratesconsistentlysupe-riororequalprecision-recall.AlthoughSBSetsper-formswellwhenaskedtoretrieveonlystudentitemsoronlycourseitems,itfallsshortofdetectingwhatfeaturesofstudentandcoursearerelevanttopredictalink.ThediscriminativemodelwithinRBSetscon-veysthisinformationthroughtheparameters.Wealsodidanexperimentwithaqueryoftypefac-ulty!project,showninFigure5.Thistimeresultsbetweenalgorithmswerecloser.Tomakedierencesmoreevident,weadoptaslightlydierentmeasureofsuccess:wecountasa1hitifthepairretrievedisafaculty!projectpair,andcountasa0.5hitforpairsoftypestudent!projectandsta!project.Noticethisisamuchharderquery.Forinstance,thestructureoftheprojectwebpagesinthetexasgroupwasquitedistinctfromtheotheruniversities:theyaremostlyveryshort,basicallycontaininglinksformembersoftheprojectandotherprojectwebpages.Althoughtheprecision/recallcurvesconveyaglobalpictureofperformanceforeachalgorithm,theymight Table1:Areaundertheprecision/recallcurveforeachalgorithmandquery. C1 C2 RB SB1 SB2 C1 C2 RB SB1 SB2 student!course faculty!project cornell 0.87 0.61 0.87 0.84 0.80 0.19 0.04 0.24 0.18 0.18 texas 0.55 0.54 0.77 0.62 0.48 0.24 0.07 0.29 0.07 0.12 washington 0.67 0.64 0.76 0.69 0.44 0.40 0.11 0.48 0.29 0.18 wisconsin 0.75 0.73 0.88 0.77 0.55 0.28 0.07 0.27 0.20 0.21 notbecompletelyclearwayofrankingapproachesforcaseswherecurvesintersectonseveralpoints.Inor-dertosummarizeindividualperformanceswithasin-glestatistic,wecomputedtheareaundereachpre-cision/recallcurve(withlinearinterpolationbetweenpoints).ResultsaregiveninTable1.Numbersinboldindicatethealgorithmwiththehighestarea.ThedominanceofRBSetsshouldbeclear.Silvaetal.(2007)describeanotherapplicationofRB-Sets,inthiscaseforsymmetricprotein-proteinin-teractions.Inthisapplication,therearenoindividualobjectfeaturesonwhichCosineandSBSetscanrely(everyXijmeasuresapairwisefeature),andRBSetsperformssubstantiallybetter.5CONCLUSIONWehaveemphasizedtheprocessofanalogicalreason-ingasaretrievalofsimilarrelationships,andpresentedaprobabilisticallysoundapproachforthisproblem.Thereisofcoursemuchmoretoanalogicalreasoningthancalculatingthesimilarityofcomplexrelationalstructures.Forinstance,thereistheissueofjudginghowsignicantthesimilarityis.Consideringthattheretrievedobjectsmightbeofaverydierentnaturethanthoseinthequeryset,onemightalsowanttoexplainwhytherelationsarejudgedtobesimilar.Ul-timately,incase-basedreasoningandplanningprob-lems(Kolodner,1993),onemighthavetoadaptthesimilarstructurestosolveanewcaseorplan.Oneshouldseethecontributionofthispaperasasteptowardsaformalmeasureofanalogicalsimilarity.Muchremainstobedonetocreateacompleteanalog-icalreasoningsystem,butthedescribedapproachhasimmediateapplicationstoinformationretrievalandexploratorydataanalysis.ReferencesE.Airoldi,D.Blei,E.Xing,andS.Fienberg.Mixedmembershipstochasticblockmodelsforrelationaldatawithapplicationtoprotein-proteininterac-tions.ProceedingsoftheInternationalBiometricsSocietyAnnualMeetings,2006.M.Craven,D.DiPasquo,D.Freitag,A.McCallum,T.Mitchell,K.Nigam,andS.Slattery.LearningtoextractsymbolicknowledgefromtheWorldWideWeb.ProceedingsofAAAI'98,pages509{516,1998.S.DzeroskiandN.Lavrac.RelationalDataMining.Springer,2001.R.French.Thecomputationalmodelingofanalogy-making.TrendsinCognitiveSciences,6,2002.L.Getoor,N.Friedman,D.Koller,andB.Taskar.Learningprobabilisticmodelsoflinkstructure.JMLR,3:679{707,2002.Z.GhahramaniandK.Heller.Bayesiansets.18thNIPS,2005.T.JaakkolaandM.Jordan.Bayesianparameteresti-mationviavariationalbounds.StatisticsandCom-puting,10:25{37,2000.D.JensenandJ.Neville.Linkageandautocorrelationcausefeatureselectionbiasinrelationallearning.ProceedingsofICML,2002.C.Kemp,J.Tenenbaum,T.Griths,T.Yamada,andN.Ueda.Learningsystemsofconceptswithanin-niterelationalmodel.ProceedingsofAAAI'06,2006.J.Kolodner.Case-BasedReasoning.MorganKauf-mann,1993.C.Manning,P.Raghavan,andH.Schutze.Introduc-tiontoInformationRetrieval.Inpress,2007.Z.Marx,I.Dagan,J.Buhmann,andE.Shamir.Cou-pleclustering:amethodfordetectingstructuralcor-respondence.JMLR,3:747{780,2002.R.MemisevicandG.Hinton.Multiplerelationalem-bedding.18thNIPS,2005.A.PopesculandL.H.Ungar.Structurallogisticre-gressionforlinkanalysis.Multi-RelationalDataMiningWorkshopatKDD-2003,2003.R.Silva,E.Airoldi,andK.Heller.Smallsetsofinter-actingproteinssuggestlatentlinkagemechanismsthroughanalogicalreasoning.GatsbyTechnicalRe-port,GCNUTR2007-001,2007.P.TurneyandM.Littman.Corpus-basedlearningofanalogiesandsemanticrelations.MachineLearningJournal,60:251{278,2005.