/
Analogical Reasoning with Relational Bayesian Sets Ric Analogical Reasoning with Relational Bayesian Sets Ric

Analogical Reasoning with Relational Bayesian Sets Ric - PDF document

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
415 views
Uploaded On 2015-04-28

Analogical Reasoning with Relational Bayesian Sets Ric - PPT Presentation

uclacuk Katherine A Heller Gatsby Unit University College London hellergatsbyuclacuk Zoubin Ghahramani Department of Engineering University of Cambridge zoubinengcamacuk Abstract Analogical reasoning depends fundamentally on the ability to learn and ID: 56015

uclacuk Katherine Heller Gatsby

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Analogical Reasoning with Relational Bay..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AnalogicalReasoningwithRelationalBayesianSets RicardoSilvaGatsbyUnitUniversityCollegeLondonrbas@gatsby.ucl.ac.ukKatherineA.HellerGatsbyUnitUniversityCollegeLondonheller@gatsby.ucl.ac.ukZoubinGhahramaniDepartmentofEngineeringUniversityofCambridgezoubin@eng.cam.ac.ukAbstractAnalogicalreasoningdependsfundamentallyontheabilitytolearnandgeneralizeaboutrelationsbetweenobjects.Therearemanywaysinwhichobjectscanberelated,makingautomatedanalogicalreasoningverychal-lenging.Herewedevelopanapproachwhich,givenasetofpairsofrelatedobjectsS=fA1:B1;A2:B2;:::;AN:BNg,measureshowwellotherpairsA:B tinwiththesetS.Thisaddressesthequestion:istherelationbetweenobjectsAandBanalogoustothoserelationsfoundinS?Werecastthisclassi-calproblemasaproblemofBayesiananaly-sisofrelationaldata.Thisproblemisnon-trivialbecausedirectsimilaritybetweenob-jectsisnotagoodwayofmeasuringanalo-gies.Forinstance,theanalogybetweenanelectronaroundthenucleusofanatomandaplanetaroundtheSunishardlyjusti edbyisolated,non-relational,comparisonsofanelectrontoaplanet,andanucleustotheSun.WedevelopagenerativemodelforpredictingtheexistenceofrelationshipsandextendtheframeworkofGhahramaniandHeller(2005)toprovideaBayesianmeasureforhowanal-ogousarelationistootherrelations.Thisshedsnewlightonanoldproblem,whichwemotivateandillustratethroughpracticalap-plicationsinexploratorydataanalysis.1CONTRIBUTIONConsiderthefollowingillustrativeproblemsinex-ploratorydataanalysis:Example1Aresearcherhasalargecollectionofpa-pers.Sheplanstousesuchadatabaseinordertowriteancomprehensivearticleabouttheevolutionofher eldthroughthepasttwodecades.Inparticu-lar,shehasacollectionorganizedaspairsofpapers,whereonecitestheother,i.e.,acollectionofpairsA:BmeaningAcitesB.Thereareseveralreasonswhyapapermightciteanother:Aisabigbibliographicsur-vey,orBwaswrittenbytheadvisoroftheauthorofA,orBwasgivenabestpaperaward,ortheauthorsweregeographicallyclose,oracombinationofseveralsuchfeatures.Suchcombinationsde nea(potentiallyverylarge)varietyofsubpopulationsofpairsofpapers.WhiletheremightberelevantinformationaboutAandBinthedatabase,whichsubpopulationsofcita-tionsA:Bbelongstoisneverexplicitlyindicatedinthedata.Yettheresearcherisnotcompletelyinthedark:shealreadyhasanideaofimportantsubgroupsofcita-tionswhicharerepresentativeofthemostinterestingsubpopulations,althoughitmightbediculttochar-acterizeanysuchsetwithasimpledescription.Shewouldliketoknowwhichotherpairsofpapersmightbelongtosuchsubgroups.Insteadofworryingaboutwritingsomesimplequeryrulesthatexplainthecom-monpropertiesofsuchsubgroups,shewouldratherhaveanintelligentinformationretrievalsystemthatisabletoidentifywhichotherpairsinthedatabasearelinkedinananalogouswaytothosepairsinherselectedsets.Example2Ascientistisinvestigatingapopulationofproteins,withinwhichsomepairsareknowntoin-teract,whiletheremainingpairsareknownnottointeract.Inthisstudy,itisknownthatrecordedgeneexpressionpro lesoftherespectivegenescanbeusedasareasonablepredictoroftheexistenceornotofaninteraction.Thecurrentstateofknowledgeisstilllimitedregardingwhichsubpopulations(i.e.,classes)ofinteractionsexist,althoughapartialhierarchyofsuchclassesforsomeproteinsisavailable.Givenase-lectedsetofinteractingproteinsthatarebelievedtobelongtoaparticularleveloftheclasshierachy,theresearcherwouldliketoqueryherdatabasetodiscoverotherplausiblepairsofproteinswhosemechanismoflinkageisofthesamenatureasintheselectedset,i.e., toqueryforanalogousrelations.Ideally,shewouldliketodoitwithoutbeingrequiredtowritedownqueryrulesthatexplicitlydescribetheselectedset.Suchareproblemsofanalogicalreasoning,instantiatedaspracticalproblemsofinformationretrievalforex-ploratorydataanalysis.Evenundertheabsenceofclearclasslabelsforlinkssuchaspapercitationsandproteininteractions(whatisrecordediswhichpairsarelinkedandwhicharenot),onemighthaveathandasubpopulationofinterestthatisbestdescribedbyasampleoflinkedobjects.Thequestiontobeaskedisofanexploratorynature:whichotherobjectsinmyrelationaldatabasearelinkedinasimilarway?Inbothexamples,onehasarelationaldatabase,anditispossibletocreatemodelsforpredictingtheexistenceorlackofarelationshipusingfeaturessuchaspaperattributesandgeneexpressionpro les.Inbothexam-ples,itisnotfullyknownhowtoexplicitlydescribeclassesofrelationsthatarebelievedtoexist(anditisanuisancetoselectnegativeexamplesbyhandtolearnaclassi er).WeproposeamethodforretrievingrelationsbasedontheBayesianscoringfunctionasproposedintheBayesiansetsmethod(GhahramaniandHeller,2005):givenasetofrelateditemsthatarepostulatedtocomefromasubpopulationofinterest,thegoalistorankexistinglinksaccordingtoameasureofsimilaritywithrespecttothisset.Weinterpretthisproblemastheclassicalproblemofanalogicalreasoning.Thatis,sup-posewehaveapair(orsetofpairs)ofobjectsA:B.Whichotherpairsofobjectsinrelationaldatabasebestre\rectarelationanalogoustoA:B?Thispa-perprovidesanovelandprobabilisticallysoundsolu-tiontothisproblem.Moreover,thisworkextendstheBayesiansetsmethodtodiscriminativemodels.Wewillfocussolelyon ndingpairwiserelations.Theideacanbeextendedtomorecomplexrelations,butwewillnotpursuethishere.InSection2wediscussrelatedworkwhiledescribingthedi erencebetweenanalogicalreasoningandstan-dardretrievaltasks.TheapproachisintroducedinSection3andevaluatedinSection4.2RELATEDWORKTode neananalogyistode neameasureofsimi-laritybetweenstructuresofrelatedobjects(pairs,inourcase).Thekeyaspectisthat,typically,wearenotinterestedinhoweachindividualobjectinacandidatepairissimilartoindividualobjectsinthequerypairs.Asanillustration,considerananalogicalreasoningquestionfromaSAT-likeexamwhereforagivenpair(say,water:river)wehavetochoose(outof5pairs)theonethatbestmatchesthetypeofrelationimplicitinsucha\query."Inthiscase,itisreasonabletosaycar:tracwouldbeabettermatchthan(thesome-whatnonsensical)soda:ocean,sincecars\rowthroughtrac,andsodoeswaterthroughariver.Noticethatifweweretomeasurethesimilaritybetweenobjectsin-steadofrelations,itnowseemsreasonabletosaythatsoda:oceaninthiscasewouldbeamuchcloserpair.Intheexamplesgivenintheprevioussection,similar-itybetweenpairsofobjectsisonlymeaningfultotheextentbywhichsuchfeaturesareusefultopredicttheexistenceoftherelationships.Thereisalargeliteratureonanalogicalreasoninginar-ti cialintelligenceandpsychology.WerefertoFrench(2002)forasurvey,aswellastosomerecentmachinelearningpapersonclustering(Marxetal.,2002),pre-diction(TurneyandLittman,2005)anddimensional-ityreduction(MemisevicandHinton,2005).HerewewilluseaBayesianframeworkforinferringsimilarityofrelations.Givenasetofrelations,ourgoalwillbetoscoreothersasrelevantornot.ThescoreisaBayesianmodelcomparisongeneralizingthe\Bayesiansets"score(GhahramaniandHeller,2005)todiscrim-inativemodelsoverpairsofobjects.ThegraphicalmodelformulationofGetooretal.(2002)incorporatesmodelsoflinkexistenceinrela-tionaldatabases,anideausedexplicitlyinSection3asthe rststepofourproblemformulation.Intheclus-teringliterature,theprobabilisticapproachofKempetal.(2006)ismotivatedbyprinciplessimilartothoseinourformulation:theideaisthatthereisanin nitemixtureofsubpopulationsthatgeneratestheobservedrelations.Ourproblem,however,istoretrieveotherelementsofasubpopulationdescribedbyelementsofaqueryset,agoalthatisalsoclosertotheclassicalparadigmofanalogicalreasoning.Toemphasizeoncemore,ourfocushereisnotonpre-dictingthepresenceorabsenceoflinks,asin,e.g.,(PopesculandUngar,2003)butratheronretrievingsimilarlinksfromamongthosealreadyknowntoex-istintherelationaldatabase.Neitherisourfocustoprovideafullyunsupervisedclusteringofthewholedatabaseofpairs(asin,e.g.,Kempetal.,2006),nortouserelationalinformationtoimproveclassi cationofotherattributes(asin,e.g.,Getooretal.,2002).3FUNCTIONSASANALOGIESWenowdescribetheanalogicalreasoningprinciplemoreformally.LetAandBrepresentobjectspaces.TosaythataninteractionA:BisanalogoustoS=fA1:B1;A2:B2;:::;AN:BNgistode neameasureofsimilaritybetweenthepairandthesetofpairs.However,thissimilarityisnot(directly)givenbythe informationcontainedinthedistributionofobjectsfAigA,fBigB,butbythemappingsclassifyingsuchpairsasbeinglinked: Bayesiananalogicalreasoningformulation:ConsideraspaceoflatentfunctionsinAB!f0;1g.AssumethatAandBaretwoobjectsclassi edaslinkedbysomeunknownfunctionf(A;B),i.e.,f(A;B)=1.Wewanttoquantifyhowsimilarthefunctionf(A;B)istothefunc-tiong(;),whichclassi esallpairs(Ai;Bi)2Sasbeinglinked,i.e.,g(Ai;Bi)=1.ThesimilarityshouldbeafunctionoftheobservationsfS;A;Bgandourpriordistributionoverf(;)andg(;). Functionsf()andg()areunobserved,hencetheneedforapriorthatwillbeusedtointegrateoverthefunc-tionspace.Theresultingsimilaritymetricwillbede- nedthroughaBayesfactor,asexplainednext.Forsimplicity,wewillconsiderafamilyoflatentfunctionsthatisparameterizedbya nite-dimensionalvector:thelogisticregressionfunctionwithmultivariateGaus-sianpriorsforitsparameters.Foraparticularpair(Ai2A,Bj2B),letXij=[1(Ai;Bj)2(Ai;Bj):::K(Ai;Bj)]Tbeapointonafeaturespacede nedbythemapping:AB!K.LetCij2f0;1gbeanindicatoroftheexistenceofalinkbetweenAiandBjinthedatabase.Let=[1;:::;K]TbetheparametervectorforourlogisticregressionmodelP(Cij=1jXij;)=logistic(TXij)(1)wherelogistic(x)=(1+ex)1.Ourmeasureofsimi-larityforapair(Ai;Bj)withrespecttoaquerysetSistheprobabilisticsimilaritymeasureofBayesiansets(GhahramaniandHeller,2005)onalog-scale:score(Ai;Bj)=logP(Cij=1jXij;S;CS=1)logP(Cij=1jXij)(2)whereCSisthevectoroflinkindicatorsforS:i.e.,C1=1;C2=1;:::;CN=1indicatesthatallpairsinSarelinked.Thegeneralframeworkisasfollows.Wearegivenare-lationaldatabase(DA;DB;LAB),wherethe rsttwocomponentsofthisdatabasearesampledrespectivelyfromAandB.RelationshiptableLABisabinaryma-trixassumedtobegeneratedbyalogisticregressionmodeloflinkexistence.Aqueryproceedsaccordingtothefolllwingsteps:1.theuserselectsasetofpairsSthatarelinkedinthedatabase; AB  AB (a)(b)Figure1:(a)GraphicalplaterepresentationfortherelationalBayesianlogisticregression,whereNA;NBandNCarethenumberofobjectsofeachclass.(b)ExtradependenciesinducedbyfurtherconditioningonCarerepresentedbyundirectededges.2.thesystemperformsBayesianinferencetoob-tainthecorrespondingposteriordistributionfor,P(jS;CS),givenaGaussianpriorP();3.thesystemiteratesthroughalllinkedpairs,com-putingthefollowingforeachpairP(Cij=1jXij;S;CS=1)=RP(Cij=1jXij;)P(jS;CS=1)d(3)aswellasP(Cij=1jXij)byintegrationoverP()1,andthensortsthemaccordingtothescoreinEquation(2);ThecorrespondingplatemodelisillustratedinFigure1(a).Latentparametervector=f1;2;:::;KgTandobjectsAandBareancestorsoflinkindicatorC.ByconditioningonC=1,elementsofwillbeconnectedtoandshareinformationfrominputdatafA;Bg,asinFigure1(b).Thisinformationcanbepassedforwardtoevaluateotherpoints.Thesug-gestedsetupscalesasO(K3)duetothematrixin-versionsnecessaryfor(variational)Bayesianlogisticregression(JaakkolaandJordan,2000).Ifnecessary,afurtherapproximationforP(jS;CS)mightbeim-posedifthedimensionalityofistoohigh.3.1ChoiceoffeaturesandrelationaldiscriminationOursetupassumesthatthefeaturespaceprovidesareasonableclassi ertopredicttheexistenceoflinks.Itisevidentthattheproposedframeworkcouldbe 1SincetheintegralusedintheBayesianlogisticfunctiondoesnothaveaclosedformula,inalloftheseexpressionsweusetheBayesianvariationalapproximationbyJaakkolaandJordan(2000).AshortsummaryofthisapproachisgivenbySilvaetal.(2007). usedfornon-relationalproblemswitharbitraryclassi- cationfunctions.However,ouranalogicalreasoningformulationisarelationalmodeltotheextentthatitmodelspresenceandabsenceofinteractionsbetweenobjects:byconditioningonthelinkindicators,thesimilarityscorebetweenA:BandC:Disalwaysafunc-tionofpairs(A;B)and(C;D)thatisnotingeneraldecomposableassimilaritiesbetweenAandC,BandD.Again,thisisillustratedbyFigure1.Onemightalsohaverecordsofpairwiseinformationonotherre-lationaltablesDABbesidesthetargetedone.Forin-stance,onemighthaveameasureinaproteindatabasethatisabinaryindicatorofbothproteinsbeingpro-ducedinthesameareainthecellornot,orthenum-berofcommonproteinsthatinteractwiththepair.Ourmethodthenlearnstoranksimilarityofrelationsbasedonfeaturesextractedforarelationaldatabase,andsuchfeatureshavearolesimilartothelatentvari-ablesinblock-models,asdiscussedinthesequel.Use-fulpredictivefeaturescanalsobegeneratedautomati-callywithavarietyofalgorithms(e.g.,the\structurallogisticregression"ofPopesculandUngar,2003).SeealsoDzeroskiandLavrac(2001).JensenandNeville(2002)discussshortcomingsofautomatedmethodsforautomatedfeatureselectioninrelationalclassi cation.Ouranalogicalreasoningformulationalsoassumesallsubpopulationsofinterestaremeasuredonthesamefeaturespace.Thisallowsforcomparisonsbetween,e.g.,cellsfromdi erentspecies,orwebpagesfromdif-ferentwebdomains,aslongas(;)isthesame.Themostgeneralanalogicalreasoningformulationwouldnothavethisrequirement,butfortheproblemtobewell-de ned,featuresfromthedi erentspacesmustberelatedsomehow.AhierarchicalBayesianformulationforlinkingdi erentfeaturespacesisonepossibilitywhichmightbetreatedinafuturework.3.2PriorsThechoiceofpriorisbasedontheobserveddata,inawaythatisanalogoustothechoiceofpriorsusedintheoriginalformulationofBayesiansets(GhahramaniandHeller,2005).Letbbethemaximumlikelihoodestimatorof.Sincethenumberofpossiblepairsgrowsataquadraticratewiththenumberofobjects,wedonotusethewholedatabaseformaximumlike-lihoodestimation.Also,sincemostdatabaseshaveasparselinkmatrix,togetbweusealllinkedpairsasmembersofthepositiveclass(C=1),andsampleun-linkedpairsasmembersofthenegativeclass(C=0)2. 2Inourexperiments,wesample10\negative"pairsforeach\positive"one,andweightthemtore\rectthepro-portioninthedatabase(e.g.,ifwesample10negativesforeachpositive,whileinthedatabasethereare200negativesforeachpositive,wecounteachnegativecaseasbeing20WeusethepriorP()=N(b;(cbT)1),whereN(m;V)isanormalofmeanmandvarianceV.Ma-trixbTistheempiricalsecondmomentsmatrixofthelinkedobjectfeaturesinX,ameasureoftheirvariabil-ity.Constantcisasmoothingparametersetbytheuser.Inourexperiments,weselectedittobetwicethetotalnumberoflinks.Empiricalpriorsareasensiblechoice,sincethisisaretrieval,notapredictive,task.Basically,theentiredatasetisthepopulation.Adata-dependentpriorbasedonthepopulationisquiteimportantforanap-proachlikeBayesiansets,sincedeviancesfromthe\av-erage"behaviourinthedataareusefultodiscriminatebetweensubpopulations.Silvaetal.(2007)presentseveralillustrationsofthescorefunctionbehaviorunderdi erentchoicesofpri-orsandquerysets.3.3ConnectionstoBayesiansetsandblockmodelsThemodelinFigure1(a)isatypicalconditionalrela-tionalmodel,i.e.,conditionedonobjectsandparam-eterstheresultingrelationsarei.i.d.Undertheorig-inalBayesiansetsformulation,thescorefunctioncanbedescribedby(thelogarithmof)theBayesfactorcomparingthemodelsinFigure2.Incontrast,considerthefollowingdirectmodi cationoftheBayesiansetsformulationtothisproblem:\rat-tenthedata,creatingforeachpair(Ai;Bj)arowinthedatabasewithanextrabinaryindicatorofrela-tionshipexistence.CreateajointmodelforpairsbyusingthemarginalmodelsforAandBandtreatingdi erentrowsasbeingindependent.Thisignoresthefactthattheresulting\ratteneddatapointsarenotreallyi.i.d.undersuchamodel,becausethesameob-jectmightappearinmultiplerelations(DzeroskiandLavrac,2001).Themodelalsofailstocapturethede-pendencybetweenAiandBjthatarisesfromcondi-tioningonCij,evenifAiandBjaremarginallyinde-pendent.Nevertheless,heuristicallythisapproachcansometimesproducesomegoodresults,andforseveraltypesofprobabilityfamiliesitisverycomputationallyecient.Adi erentapproachformodelingrelationaldataistheblock-modelusedforyearsbystatisticiansandso-ciologistsformodelingsocialnetworks(Kempetal.,2006;Airoldietal.,2006).Thebasicideaistousehiddenvariables3inplaceofourfeaturevectorXij:thisispartiallymotivatedbythefactthattypically, casesreplicated.)3Suchhiddenvariablesareusuallydiscreteindicatorsofsomelatentclustermembershipforobjects.Themodeltypicallyrequiresacross-clusteringforobjectmembership QA1C1 ............CC aA22NB NA1C1 ............CC22 (a)(b)Figure2:ThescoreofanewdatapointfA;B;CgisgivenbytheBayesfactorthatcomparesmodels(a)and(b).Node representsthehyperparametersfor.In(a),thegenerativemodelisthesameforboththenewpointandthequerysetrepresentedintherectangle.NoticethatourconditioningsetSofpairsinfAigfBjgmightcontainrepeatedinstancesofasamepoint,i.e.,someAiorBjmightappearmultipletimesindi erentrelations,asillustratedbynodesAiwithmultipleoutgoingedges.In(b),thenewpointandthequerysetdonotsharethesameparameters.insocialnetworkanalysis,therearenoeasilyavail-ablefeaturesofthepopulationthatarerecorded.Tocomputequantitiessuchasthemarginallikelihoodofthemodelonehastointegrateoutalargenumberofhiddenvariables.Foramoderatenumberofobjectsstraightevalua-tionofallpairsmightbecomputationallyinfeasibleintheblock-modelsetup.Ourdiscriminativemodelonlyneedstheunlinkedpairswhensettingaprior,whichisaccomplishedbysamplingwhenthetotalnumberofunlinkedpairsistoolarge.Throwingpartoftheinfor-mationfromunlinkedpairsisarguablylessharmfultoourgoalsthantotheclusteringprocedureperformedwithblock-models.Ourmodelassumeslinkindicatorsareindependentgivenobjectfeatures,whichmightnotbethecaseforparticularchoicesoffeaturespace.Intheory,block-modelssidestepthisissuebylearningalltheneces-sarylatentfeaturesthataccountforlinkdependence.Animportantfutureextensionofourworkwouldcon-sistoftractablyaccountingforresiduallinkassociationthatisnotintermediatedbyourobservedfeatures.3.4OncontinuousrelationsAlthoughwefocusonmeasuringsimilarityofquali-tativerelationships,thesameideacouldbeextendedtocontinuousmeasuresofrelationship.Forinstance,TurneyandLittman(2005)measurerelationsbetweenwordsbytheirco-occurrencesintextsnexttoasetofjoiningterms,suchasthetwowordsbeingconnectedbyaspeci cpreposition.Severalsimilaritymetricscanbede nedonthisvectorofcontinuousrelation-ship(e.g.,cosinedistance).However,givendataonwordfeaturesandapredictivemodelforsuchquanti-tativerelations,onecandirectlyadaptourmodel4. andrelationmembership.4NoticethatourapproachwouldstillnotbedirectlycomparabletotheonebyTurneyandLittman,sinceunlike4EXPERIMENTSWenowdescribetwoexperimentsonanalogicalre-trievalusingtheproposedmodel.Evaluationofthesigni canceofretrieveditemsoftenreliesonsubjec-tiveassessments(GhahramaniandHeller,2005).Tosimplifyourstudy,wewillfocusonparticularsetupswhereobjectivemeasuresofsuccesscanbederived.Ourmainstandardofcomparisonwillbea\\rattenedBayesiansets"algorithm(whichwewillcall\standardBayesiansets,"SBSets,inconstrasttotherelationalmodel,RBSets).UsingamultivariateindependentBernoullimodelasintheoriginalpaper(GhahramaniandHeller,2005),wejoinlinkedpairsintosinglerows,andthenapplytheoriginalalgorithmdirectlyonthisjoineddata.Thisalgorithmservesthepurposeofbothmeasuringthelossofnottreatingrelationaldataassuch,aswellasthelimitationsoftreatingsimilarityofpairsthroughthegenerativemodelsofAandBin-steadofthegenerativemodelforthelatentpredictivefunctiong(;).Inbothexperiments,objectsareofthesametype,andtherefore,dimensionality.ThefeaturevectorXijforeachpairofobjectsfAi;BjgconsistsoftheVfeaturesforobjectAi,theVfeaturesofobjectBj,andmea-suresfZ1;:::;ZVg,whereZv=(AivBjv)=(kAikkBjk),kAikbeingtheEuclideannormoftheV-dimensionalrepresentationofAi.Wealsouseacon-stantvalue(1)aspartofthefeaturesetasanintercepttermforthelogisticregression.TheZfeaturesareex-actlytheonesusedinthecosinedistancemeasure,acommonandpracticalmeasurewidelyusedininforma-tionretrieval(Manningetal.,2007).Theyalsohavetheimportantadvantageofscalingwellwiththenum-berofvariablesinthedatabase.Moreover,adoptingsuchfeatureswillmakeourcomparisonsinthenext themwewouldmakeuseofsomeexternaldatasourceforthefeaturesofthewords. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 PrecisionRecallfirst trialRBSets SBSets Cosine 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallsecond trialRBSets SBSets Cosine 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallthird trialRBSets SBSets Cosine 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recallfourth trialRBSets SBSets Cosine Figure3:Precision/recallcurvesforfourdi erentrandomqueriesofsize10forthethreealgorithms:relationalBayesiansets(RBSets),regularBayesiansetswithBernoullimodel(SBSets-B)andcosinedistance.sectionsmorefair,sinceweevaluatehowwellcosinedistanceperformsinourtask.NoticeXijrepresentsasymmetricrelationshipsasrequiredinourapplica-tions.Forsymmetricrelationships,featuressuchasjAivBjvjcouldbeusedinstead.4.1SyntheticexperimentWe rstdiscussasyntheticexperimentwherethereisaknowngroundtruth.Wegeneratedatafromasimu-latedmodelwithsixclassesofrelationsrepresentedbysixdi erentinstatiationsof,f0;1;:::;5g.Thissimpli edsetupde nesamulticlasslogisticsoftmaxclassi erthatoutputsaclasslabeloutoff0;1;:::;5g.ObjectspacesAandBarethesame,andde nedbyamultivariateBernoullidistributionof20dimensions,whereeachattributehasindependentlyaprobability1/2ofbeing1.Wegenerate500objects,andconsid-eredall5002pairstogenerate250;000featurevectorsX.ForeachXweevaluateourlogisticclassi ertogenerateaclasslabel.Ifthisclassiszero,welabelthecorrespondingpairas\unlinked."Otherwise,welabelitas\linked."Theinterceptparameterforparametervector0wassetmanuallytomakeclass0appearinatleast99%thedata5,thuscorrespondingtotheusualsparsematricesfoundinrelationaldata.Thealgorithmsweevaluatedonotknowwhichofthe5classesthelinkedpairsoriginallycorrespondedto.However,sincethelabelsareknownthroughsimula-tion,weareabletotellhowwellrankedarepointsofaparticularclassgivenaqueryofpairsfromthesameclass.Ourevaluationisasfollows.Wegeneratepreci-sion/recallcurvesforthreealgorithms:ourrelationalBayesiansetsRBSets,\\rattened"standardBayesiansetswithBernoullimodel(SBSets)andcosinedis-tance(summingoverallelementsinthequery).Foreachquery,werandomlysampled10elementsoutofthepoolofelementsoftheleastfrequentclass(about 5Valuesforvectors1;2;:::;5wereotherwisegen-eratedbyindependentmultivariateGaussiandistributionswithzeromeanandstandarddeviationof101%ofthetotalnumberoflinks),andrankedthere-maining2320linkedpairs.Wecountedanelementasahitifitwasoriginallyfromtheselectedclass.RBSetsgivesconsistentlybetterresultsforthetop50%retrievals.Asanillustration,wedepictedfourrandomqueriesof10itemsinFigure3.NoticethatsometimesSBSetscandoreasonably,oftenachiev-ingbetterprecisionatthebottom40%recalls:bythevirtueofhavingfewobjectsinthespaceofelementsofthisclass,afewofthemwillappearinpairsbothinthequeryandoutsideofit,facilitatingmatchingbyobjectsimilaritysincehalfofthepairisalreadygivenasinput.Weconjecturethisexplainstheseeminglystrongresultsoffeature-basedapproachesonthebot-tom40%.However,whenthisdoesnothappentheproblemcangetmuchharder,makingSBSetsmuchmoresensitivetothequerythanRBSets,asillus-tratedinsomeoftherunsinFigure3.4.2TheWebKBexperimentTheWebKBdataisacollectionofwebpagesfromsev-eraluniversities,whererelationsaredirectedandgivenbyhyperlinks(Cravenetal.,1998).Webpagesareclassi edasbeingoftypecourse,department,faculty,project,sta ,studentandother.Documentsfromfouruniversities(cornell,texas,washingtonandwisconsin)arealsolabeledassuch.BinarydatawasgeneratedfromthisdatabaseusingthesamemethodsofGhahra-maniandHeller(2005).Atotalof19,450binaryvari-ablesperobjectaregenerated.Toavoidintroduc-ingextraapproximationsintoRBSets,wereduceddimensionalityintheoriginalrepresentationusingsin-gularvaluedecomposition,obtaining25measuresperobject.Thisalsoimprovedtheresultsofouralgo-rithmandcosinedistance.ForSBSets,thisisawayofcreatingcorrelationsintheoriginalfeaturespace.Toevaluatethegainofourmodelovercompetitors,wewillusethefollowingsetup.Inthe rstquery,wearegiventhepairsofwebpagesofthetypestudent!coursefromthreeofthelabeleduniversities,andeval- 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 PrecisionRecall�student - course (cornell)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall�student - course (texas)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall�student - course (washington)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall�student - course (wisconsin)RBSets SBSets1 SBSets2 Cosine1 Cosine2 Figure4:Resultsforstudent!courserelationships. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 PrecisionRecall�faculty - project (cornell)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall�faculty - project (texas)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall�faculty - project (washington)RBSets SBSets1 SBSets2 Cosine1 Cosine2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall�faculty - project (wisconsin)RBSets SBSets1 SBSets2 Cosine1 Cosine2 Figure5:Resultsforfaculty!projectrelationships.uatehowrelationsarerankedinthefourthuniversity.Becauseweknowclasslabels(whilethealgorithmdoesnot),wecanusetheclassofthereturnedpairstolabelahitasbeing\relevant"or\irrelevant."Welabelapair(Ai;Bj)asrelevantifandonlyifAiisoftypestudentandBjisoftypecourse,andAilinksintoBj.Thisisaverystringentcriterion,sinceothertypesofrelationscouldalsobevalid(e.g.,sta !courseappearstobeareasonablematch).However,thisfa-cilitatesobjectivecomparisonsofalgorithms.Also,theotherclasscontainsmanytypesofpages,whichallowsforpossibilitiessuchasastudent!\hobby"pair.Suchpairsmightbehardtoevaluate(e.g.,isthatparticularhobbyincrementallydemandinginawaythatcourseworkis?Isitasfunastakingama-chinelearningcourse?)Asacompromise,weomitallpagesfromthecategoryotherinordertobetterclarifydi erencesbetweenalgorithms6.Precision/recallcurvesforthestudent!coursequeriesareshowninFigure4.Therearefourqueries,eachcorrespondingtoasearchoveraspeci cuniver-sitygivenallvalidstudent!coursepairsfromtheotherthree.Therearefouralgorithmsoneacheval-uation:thestandardBayesiansetswiththeoriginal19,450binaryvariablesforeachobject,plusanother 6Asanextremeexample,queryingstudent!coursepairsfromthewisconsinuniversityreturnedstudent!otherpairsatthetopfour.However,theseotherpageswereforsomereasoncoursepages-suchashttp://www.cs.wisc.edu/markhill/cs752.html19,450binaryvariables,eachcorrespondingtotheproductoftherespectivevariablesintheoriginalpairofobjects(SBSets1);thestandardBayesiansetswiththeoriginalbinaryvariablesonly(SBSets2);astan-dardcosinedistancemeasureoverthe25-dimensionalrepresentation(Cosine1);acosinedistancemeasureusingthe19,450-dimensionaltextdatawithTF-IDFweights(Cosine2);ourapproach,RBSets.InFigure4,RBSetsdemonstratesconsistentlysupe-riororequalprecision-recall.AlthoughSBSetsper-formswellwhenaskedtoretrieveonlystudentitemsoronlycourseitems,itfallsshortofdetectingwhatfeaturesofstudentandcoursearerelevanttopredictalink.ThediscriminativemodelwithinRBSetscon-veysthisinformationthroughtheparameters.Wealsodidanexperimentwithaqueryoftypefac-ulty!project,showninFigure5.Thistimeresultsbetweenalgorithmswerecloser.Tomakedi erencesmoreevident,weadoptaslightlydi erentmeasureofsuccess:wecountasa1hitifthepairretrievedisafaculty!projectpair,andcountasa0.5hitforpairsoftypestudent!projectandsta !project.Noticethisisamuchharderquery.Forinstance,thestructureoftheprojectwebpagesinthetexasgroupwasquitedistinctfromtheotheruniversities:theyaremostlyveryshort,basicallycontaininglinksformembersoftheprojectandotherprojectwebpages.Althoughtheprecision/recallcurvesconveyaglobalpictureofperformanceforeachalgorithm,theymight Table1:Areaundertheprecision/recallcurveforeachalgorithmandquery. C1 C2 RB SB1 SB2 C1 C2 RB SB1 SB2 student!course faculty!project cornell 0.87 0.61 0.87 0.84 0.80 0.19 0.04 0.24 0.18 0.18 texas 0.55 0.54 0.77 0.62 0.48 0.24 0.07 0.29 0.07 0.12 washington 0.67 0.64 0.76 0.69 0.44 0.40 0.11 0.48 0.29 0.18 wisconsin 0.75 0.73 0.88 0.77 0.55 0.28 0.07 0.27 0.20 0.21 notbecompletelyclearwayofrankingapproachesforcaseswherecurvesintersectonseveralpoints.Inor-dertosummarizeindividualperformanceswithasin-glestatistic,wecomputedtheareaundereachpre-cision/recallcurve(withlinearinterpolationbetweenpoints).ResultsaregiveninTable1.Numbersinboldindicatethealgorithmwiththehighestarea.ThedominanceofRBSetsshouldbeclear.Silvaetal.(2007)describeanotherapplicationofRB-Sets,inthiscaseforsymmetricprotein-proteinin-teractions.Inthisapplication,therearenoindividualobjectfeaturesonwhichCosineandSBSetscanrely(everyXijmeasuresapairwisefeature),andRBSetsperformssubstantiallybetter.5CONCLUSIONWehaveemphasizedtheprocessofanalogicalreason-ingasaretrievalofsimilarrelationships,andpresentedaprobabilisticallysoundapproachforthisproblem.Thereisofcoursemuchmoretoanalogicalreasoningthancalculatingthesimilarityofcomplexrelationalstructures.Forinstance,thereistheissueofjudginghowsigni cantthesimilarityis.Consideringthattheretrievedobjectsmightbeofaverydi erentnaturethanthoseinthequeryset,onemightalsowanttoexplainwhytherelationsarejudgedtobesimilar.Ul-timately,incase-basedreasoningandplanningprob-lems(Kolodner,1993),onemighthavetoadaptthesimilarstructurestosolveanewcaseorplan.Oneshouldseethecontributionofthispaperasasteptowardsaformalmeasureofanalogicalsimilarity.Muchremainstobedonetocreateacompleteanalog-icalreasoningsystem,butthedescribedapproachhasimmediateapplicationstoinformationretrievalandexploratorydataanalysis.ReferencesE.Airoldi,D.Blei,E.Xing,andS.Fienberg.Mixedmembershipstochasticblockmodelsforrelationaldatawithapplicationtoprotein-proteininterac-tions.ProceedingsoftheInternationalBiometricsSocietyAnnualMeetings,2006.M.Craven,D.DiPasquo,D.Freitag,A.McCallum,T.Mitchell,K.Nigam,andS.Slattery.LearningtoextractsymbolicknowledgefromtheWorldWideWeb.ProceedingsofAAAI'98,pages509{516,1998.S.DzeroskiandN.Lavrac.RelationalDataMining.Springer,2001.R.French.Thecomputationalmodelingofanalogy-making.TrendsinCognitiveSciences,6,2002.L.Getoor,N.Friedman,D.Koller,andB.Taskar.Learningprobabilisticmodelsoflinkstructure.JMLR,3:679{707,2002.Z.GhahramaniandK.Heller.Bayesiansets.18thNIPS,2005.T.JaakkolaandM.Jordan.Bayesianparameteresti-mationviavariationalbounds.StatisticsandCom-puting,10:25{37,2000.D.JensenandJ.Neville.Linkageandautocorrelationcausefeatureselectionbiasinrelationallearning.ProceedingsofICML,2002.C.Kemp,J.Tenenbaum,T.Griths,T.Yamada,andN.Ueda.Learningsystemsofconceptswithanin -niterelationalmodel.ProceedingsofAAAI'06,2006.J.Kolodner.Case-BasedReasoning.MorganKauf-mann,1993.C.Manning,P.Raghavan,andH.Schutze.Introduc-tiontoInformationRetrieval.Inpress,2007.Z.Marx,I.Dagan,J.Buhmann,andE.Shamir.Cou-pleclustering:amethodfordetectingstructuralcor-respondence.JMLR,3:747{780,2002.R.MemisevicandG.Hinton.Multiplerelationalem-bedding.18thNIPS,2005.A.PopesculandL.H.Ungar.Structurallogisticre-gressionforlinkanalysis.Multi-RelationalDataMiningWorkshopatKDD-2003,2003.R.Silva,E.Airoldi,andK.Heller.Smallsetsofinter-actingproteinssuggestlatentlinkagemechanismsthroughanalogicalreasoning.GatsbyTechnicalRe-port,GCNUTR2007-001,2007.P.TurneyandM.Littman.Corpus-basedlearningofanalogiesandsemanticrelations.MachineLearningJournal,60:251{278,2005.