/
Gatsby Computational Neuroscience Unit  Queen Square London University College London Gatsby Computational Neuroscience Unit  Queen Square London University College London

Gatsby Computational Neuroscience Unit Queen Square London University College London - PDF document

pasty-toler
pasty-toler . @pasty-toler
Follow
615 views
Uploaded On 2015-02-27

Gatsby Computational Neuroscience Unit Queen Square London University College London - PPT Presentation

gatsbyuclacuk 44 20 7679 1176 Funded in part by the Gatsby Charitable Foundation May 5 2005 GCNU TR 2005001 In64257nite Latent Feature Models and the Indian Buffet Process Thomas L Grif64257ths Cognitive and Linguistic Sciences Brown University Zoubi ID: 40180

gatsbyuclacuk 7679 1176

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Gatsby Computational Neuroscience Unit ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

GatsbyComputationalNeuroscienceUnit17QueenSquare,LondonUniversityCollegeLondonWC1N3AR,UnitedKingdom+442076791176 FundedinpartbytheGatsbyCharitableFoundation.May5,2005 GCNUTR2005–001InniteLatentFeatureModelsandtheIndianBuffetProcessThomasL.GrifthsCognitiveandLinguisticSciencesBrownUniversityZoubinGhahramaniGatsbyUnit AbstractWedeneaprobabilitydistributionoverequivalenceclassesofbinaryma-triceswithanitenumberofrowsandanunboundednumberofcolumns.Thisdistributionissuitableforuseasapriorinprobabilisticmodelsthatrepresentobjectsusingapotentiallyinnitearrayoffeatures.WederivethedistributionbytakingthelimitofadistributionoverNKbinarymatricesasK!1,astrategyinspiredbythederivationoftheChineserestaurantprocess(Aldous,1985;Pitman,2002)asthelimitofaDirichlet-multinomialmodel.Thisstrategypreservestheexchangeabilityoftherowsofmatrices.Wedeneseveralsimplegenerativeprocessesthatresultinthesamedistri-butionoverequivalenceclassesofbinarymatrices,oneofwhichwecalltheIndianbuffetprocess.Weillustratetheuseofthisdistributionasapriorinaninnitelatentfeaturemodel,derivingaMarkovchainMonteCarloalgo-rithmforinferenceinthismodelandapplyingthisalgorithmtoanarticialdataset. InniteLatentFeatureModelsandtheIndianBuffetProcess ThomasL.GrifthsCognitiveandLinguisticSciencesBrownUniversityZoubinGhahramaniGatsbyUnit1IntroductionUnsupervisedlearningaimstorecoverthelatentstructureresponsibleforgeneratingtheob-servedpropertiesofasetofobjects.Thestatisticalmodelstypicallyusedinunsupervisedlearningdrawuponarelativelysmallrepertoireofrepresentationsforthislatentstructure.Thesimplestrepresentation,usedinmixturemodels,associateseachobjectwithasinglela-tentclass.Thisapproachisappropriatewhenobjectscanbepartitionedintorelativelyhomo-geneoussubsets.However,thepropertiesofmanyobjectsarebettercapturedbyrepresentingeachobjectaspossessingmultiplelatentfeatures.Forexample,whendescribingafriend,wemightcharacterizehimasmarried,aDemocrat,andaRedSoxfan.Eachofthesefeaturesmaybeusefulinexplainingaspectsofhisbehavior,andisnotnecessarilydirectlyobservable.Severalmethodsexistforrepresentingobjectsintermsoflatentfeatures.Oneapproachistoassociateeachobjectwithaprobabilitydistributionoverfeatures.Thisapproachhasprovensuccessfulinmodelingthecontentofdocuments,whereeachfeatureindicatesoneofthetopicsthatappearsinthedocument(e.g.,Blei,Ng,&Jordan,2003).However,usingaprobabilitydistributionoverfeaturesintroducesaconservationconstraint:themoreanobjectexpressesonefeature,thelessitcanexpressothers.Thisconstraintisinappropriateinmanysettings–intheexampleabove,itwouldimplythatthemoreourfriendappreciatestheRedSox,thelesshewouldbemarried–andisnotimposedbyotherfeature-basedrepresentationschemes.Forinstance,wecouldchoosetorepresenteachobjectasabinaryvector,withentriesindicatingthepresenceorabsenceofeachfeature(e.g.,Ueda&Saito,2003),alloweachfeaturetotakeonacontinuousvalue,representingobjectswithpointsinalatentspace(e.g.,Jolliffe,1986),ordeneafactorialmodel,inwhicheachfeaturetakesononeofadiscretesetofvalues(e.g.,Zemel&Hinton,1994;Ghahramani,1995).Regardlessoftheformtherepresentationtakes,acriticalquestioninalloftheseapproachesisthedimensionalityofthatrepresentation:howmanyclassesorfeaturesareneededtoex-pressthelatentstructureresponsiblefortheobserveddata.Often,thisistreatedasamodelselectionproblem,choosingthemodelwiththedimensionalitythatresultsinthebestper-formance.Thistreatmentoftheproblemassumesthatthereisasingle,nite-dimensionalrepresentationthatcorrectlycharacterizesthepropertiesoftheobservedobjects.Analterna-tiveistoassumethatthenumberofclassesorfeaturesisactuallypotentiallyunbounded,andthattheobservedobjectsonlymanifestasparsesubsetofthoseclassesorfeatures(Rasmussen&Ghahramani,2001).ThisassumptionseemsappropriatewhendescribingourfriendtheRedSoxfan:itispossibletoimagineanarbitrarilylargesetoffeaturesthatcouldbeusedtodescribepeople,andwhichsubsetoffeaturesweactuallyusewilldependuponthepropertieswewanttoexplain.1 Theassumptionthattheobservedobjectsmanifestasparsesubsetofanunboundednum-beroflatentclassesisoftenusedinnonparametricBayesianstatistics.Inparticular,thisas-sumptionismadeinDirichletprocessmixturemodels,whichareusedfornonparametricdensityestimation(Antoniak,1974;Escobar&West,1995;Ferguson,1983;Neal,2000).UnderoneinterpretationofaDirichletprocessmixturemodel,eachdatapointisassignedtoalatentclass,andeachclassisassociatedwithadistributionoverobservableproperties.Thepriordistributionoverassignmentsofdatapointstoclassesisspeciedinsuchawaythatthenum-berofclassesusedbythemodelisboundedonlybythenumberofobjects,makingDirichletprocessmixturemodels“innite”mixturemodels(Rasmussen,2000).Recentworkhasex-tendedthesemethodstomodelsinwhicheachobjectisrepresentedbyadistributionoverfeatures(Blei,Grifths,Jordan,&Tenenbaum,2004;Teh,Jordan,Beal,&Blei,2004).However,therearenoequivalentmethodsfordealingwithotherfeature-basedrepresentations,betheybinaryvectors,factorialstructures,orvectorsofcontinuousfeaturevalues.Inthispaper,wetaketheideaofdeningpriorsoverinnitecombinatorialstructuresfromnonparametricBayesianstatistics,anduseittodevelopmethodsforunsupervisedlearninginwhicheachobjectisrepresentedbyasparsesubsetofanunboundednumberoffeatures.Thesefeaturescanbebinary,takeonmultiplediscretevalues,orhavecontinuousweights.Inalloftheserepresentations,thedifcultproblemisdecidingwhichfeaturesanobjectshouldpossess.Thesetoffeaturespossessedbyasetofobjectscanbeexpressedintheformofabi-narymatrix,whereeachrowisanobject,eachcolumnisafeature,andanentryof1indicatesthataparticularobjectspossessesaparticularfeature.Wethusfocusontheproblemofden-ingadistributiononinnitesparsebinarymatrices.Thisdistributioncanbeusedtodeneprobabilisticmodelsthatrepresentobjectswithinnitelymanybinaryfeatures,andcanbecombinedwithpriorsonfeaturevaluestoproducefactorialandcontinuousrepresentations.Theplanofthepaperisasfollows.Section2reviewstheprinciplesbehindinnitemixturemodels,focusingontheprioronclassassignmentsassumedinthesemodels,whichcanbedenedintermsofasimplestochasticprocess–theChineserestaurantprocess.Section3dis-cussestheroleofaprioroninnitebinarymatricesindeninginnitelatentfeaturemodels.Section4describessuchaprior,correspondingtoastochasticprocesswecalltheIndianbuffetprocess.Section5illustrateshowthispriorcanbeused,deninganinnite-dimensionallinear-Gaussianmodel,derivingasamplingalgorithmforinferenceinthismodel,andapplyingittoasimpledataset.Section6discussesconclusionsandfuturework.2LatentclassmodelsAssumewehaveNobjects,withtheithobjecthavingDobservablepropertiesrepresentedbyarowvectorxi.Inalatentclassmodel,suchasamixturemodel,eachobjectisassumedtobelongtoasingleclass,ci,andthepropertiesxiaregeneratedfromadistributiondeterminedbythatclass.UsingthematrixX=xT1xT2xTNTtoindicatethepropertiesofallNobjects,andthevectorc=[c1c2cN]Ttoindicatetheirclassassignments,themodelisspeciedbyaprioroverassignmentvectorsP(c),andadistributionoverpropertymatricesconditionedonthoseassignments,p(Xjc)1Thesetwodistributionscanbedealtwithseparately:P(c)speciesthenumberofclassesandtheirrelativeprobability,whilep(Xjc)determineshowtheseclassesrelatetothepropertiesofobjects.Inthissection,wewillfocusontheprioroverassignmentvectors,P(c),showinghowsuchapriorcanbedenedwithoutplacinganupperboundonthenumberofclasses. 1WewilluseP()toindicateprobabilitymassfunctions,and()toindicateprobabilitydensityfunctions.Wewillassumethatx2R,and(Xj)isthusadensity.2 2.1FinitemixturemodelsMixturemodelsassumethattheassignmentofanobjecttoaclassisindependentoftheas-signmentsofallotherobjects.IfthereareKclasses,wehaveP(cj)=NYi=1P(cij)=NYi=1c;(1)whereisamultinomialdistributionoverthoseclasses,andkistheprobabilityofclassunderthatdistribution.Underthisassumption,theprobabilityofthepropertiesofallNobjectsXcanbewrittenasp(Xj)=NYi=1KXk=1p(xijci=)k:(2)ThedistributionfromwhicheachxiisgeneratedisthusamixtureoftheKclassdistributionsp(xijci=),withkdeterminingtheweightofclassThemixtureweightscaneitherbetreatedasaparametertobeestimated,oravariablewithpriordistributionp().InBayesianapproachestomixturemodeling,astandardchoiceforp()isasymmetricDirichletdistribution.TheDirichletdistributiononmultinomialsoverKclasseshasparameters 1; 2;:::; K,andisconjugatetothemultinomial(e.g.,Bernardo&Smith,1994).Theprobabilityofanymultinomialdistributionisgivenbyp()=QKk=1 k1k D( 1; 2;:::; K);(3)inwhichD( 1; 2;:::; K)istheDirichletnormalizingconstantD( 1; 2;:::; K)=ZKKYk=1 k1kd(4)=QKk=1( k) (PKk=1 k);(5)whereKisthesimplexofmultinomialsoverKclasses,and()isthegeneralizedfactorialfunction,with(m)=(m1)!foranynon-negativeintegerm.InasymmetricDirichletdistribution,all kareequal.Forexample,wecouldtake k= Kforall.Inthiscase,Equation5becomesD( K; K;:::; K)=( K)K ( );(6)andthemeanofisthemultinomialthatisuniformoverallclasses.Theprobabilitymodelthatwehavedenedisj Dirichlet( K; K;:::; K)cijDiscrete()whereDiscrete()isthemultiple-outcomeanalogueofaBernoullievent,wheretheproba-bilitiesoftheoutcomesarespeciedby(i.e.cijMultinomial(;1)).ThedependenciesamongvariablesinthismodelareshowninFigure1(a).Havingdenedaprioron,wecansimplifythismodelbyintegratingoverallvaluesofratherthanrepresentingthemexplicitly.3 (a)ciikk(b) NNK Figure1:Graphicalmodelsfordifferentpriors.Nodesarevariables,arrowsindicatede-pendencies,andplates(Buntine,1994)indicatereplicatedstructures.(a)TheDirichlet-multinomialmodelusedindeningtheChineserestaurantprocess.(b)Thebeta-binomialmodelusedindeningtheIndianbuffetprocess.Themarginalprobabilityofanassignmentvectorc,integratingoverallvaluesof,isP(c)=ZKnYi=1P(cij)p()d(7)=ZKQKk=1mk+ k1k D( 1; 2;:::; K)d(8)=D(m1+ K;m2+ K;:::;mk+ K) D( K; K;:::; K)(9)=QKk=1(mk+ K) ( K)K( ) (N+ );(10)wheremk=PNi=1(ci=)isthenumberofobjectsassignedtoclass.ThetractabilityofthisintegralisaresultofthefactthattheDirichletisconjugatetothemultinomial.Equation10denesaprobabilitydistributionovertheclassassignmentscasanensem-ble.Individualclassassignmentsarenolongerindependent.Rather,theyareexchangeable(Bernardo&Smith,1994),withtheprobabilityofanassignmentvectorremainingthesamewhentheindicesoftheobjectsarepermuted.Exchangeabilityisadesirablepropertyinadis-tributionoverclassassignments,becausetheindiceslabellingobjectsaretypicallyarbitrary.However,thedistributiononassignmentvectorsdenedbyEquation10assumesanupperboundonthenumberofclassesofobjects,sinceitonlyallowsassignmentsofobjectstouptoKclasses.2.2InnitemixturemodelsIntuitively,deninganinnitemixturemodelmeansthatwewanttospecifytheprobabilityofXintermsofinnitelymanyclasses,modifyingEquation2tobecomep(Xj)=NYi=11Xk=1p(xijci=)k;(11)whereisaninnite-dimensionalmultinomialdistribution.Inordertorepeattheargumentabove,wewouldneedtodeneaprior,p(),oninnite-dimensionalmultinomials,andcom-4 putetheprobabilityofcbyintegratingover.ThisisessentiallythestrategythatistakeninderivinginnitemixturemodelsfromtheDirichletprocess(Antoniak,1974;Ferguson,1983;Ishwaran&James,2001;Sethuraman,1994).Instead,wewillworkdirectlywiththedistri-butionoverassignmentvectorsgiveninEquation10,consideringitslimitasthenumberofclassesapproachesinnity(c.f.Green&Richardson,2001;Neal,1992,2000).ExpandingthegammafunctionsinEquation10usingtherecursion(x)=(x1)(x1)andcancellingtermsproducesthefollowingexpressionfortheprobabilityofanassignmentvectorcP(c)= KK+0@K+Yk=1mk1Yj=1(j+ K)1A( ) (N+ );(12)whereK+isthenumberofclassesforwhichmk�0,andwehavere-orderedtheindicessuchthatmk�0forallK+.ThereareKNpossiblevaluesforc,whichdivergesasK!1.Asthishappens,theprobabilityofanysinglesetofclassassignmentsgoesto0.SinceK+NandNisnite,itisclearthatP(c)!0asK!1,since1 K!0.Consequently,wewilldeneadistributionoverequivalenceclassesofassignmentvectors,ratherthanthevectorsthemselves.Specically,wewilldeneadistributiononpartitionsofobjects.Inoursetting,apartitionisadivisionofthesetofNobjectsintosubsets,whereeachobjectbelongstoasinglesubsetandtheorderingofthesubsetsdoesnotmatter.Twoassignmentvectorsthatresultinthesamedivisionofobjectscorrespondtothesamepartition.Forexample,ifwehadthreeobjects,theclassassignmentsfc1;c2;c3g=f1;1;2gwouldcorrespondtothesamepartitionasf2;2;1gsinceallthatdiffersbetweenthesetwocasesisthelabelsoftheclasses.Apartitionthusdenesanequivalenceclassofassignmentvectors,whichwedenote[c],withtwoassignmentvectorsbelongingtothesameequivalenceclassiftheycorrespondtothesamepartition.Adistributionoverpartitionsissufcienttoallowustodeneaninnitemixturemodel,sincetheseequivalenceclassesofclassassignmentsarethesameasthoseinducedbyidentiability:p(Xjc)isthesameforallassignmentvectorscthatcorrespondtothesamepartition,sowecanapplystatisticalinferenceatthelevelofpartitionsratherthanthelevelofassignmentvectors.AssumewehaveapartitionofNobjectsintoK+subsets,andwehaveK=K0+K+classlabelsthatcanbeappliedtothosesubsets.ThenthereareK! K0!assignmentvectorscthatbelongtotheequivalenceclassdenedbythatpartition,[c].Wecandeneaprobabilitydistributionoverpartitionsbysummingoverallclassassignmentsthatbelongtotheequivalenceclassdenedbyeachpartition.TheprobabilityofeachofthoseclassassignmentsisequalunderthedistributionspeciedbyEquation12,soweobtainP([c])=Xc2[c]P(c)(13)=K! K0! KK+0@K+Yk=1mk1Yj=1(j+ K)1A( ) (N+ ):(14)Rearrangingthersttwoterms,wecancomputethelimitoftheprobabilityofapartitionasK!1,whichislimK!1 K+K! K0!KK+0@K+Yk=1mk1Yj=1(j+ K)1A( ) (N+ )= K+10@K+Yk=1(mk1)!1A( ) (N+ ):(15)5 ... 21067931485 Figure2:ApartitioninducedbytheChineserestaurantprocess.Numbersindicatecustomers(objects),circlesindicatetables(classes).ThedetailsofthestepstakenincomputingthislimitaregivenintheAppendix.Theselimitingprobabilitiesdeneavaliddistributionoverpartitions,andthusoverequivalenceclassesofclassassignments,providingaprioroverclassassignmentsforaninnitemixturemodel.Objectsareexchangeableunderthisdistribution,justasinthenitecase:theprobabilityofapartitionisnotaffectedbytheorderingoftheobjects,sinceitdependsonlyonthecountsmkAsnotedabove,thedistributionoverpartitionsspeciedbyEquation15canbederivedinavarietyofways–bytakinglimits(Green&Richardson,2001;Neal,1992;2000),fromtheDirichletprocess(Blackwell&McQueen,1973),orfromotherequivalentstochasticprocesses(Ishwaran&James,2001;Sethuraman,1994).Wewillbrieydiscussasimpleprocessthatproducesthesamedistributionoverpartitions:theChineserestaurantprocess.2.3TheChineserestaurantprocessTheChineserestaurantprocess(CRP)wasnamedbyJimPitmanandLesterDubins,baseduponametaphorinwhichtheobjectsarecustomersinarestaurant,andtheclassesarethetablesatwhichtheysit(theprocessrstappearsinAldous,1985,whereitisattributedtoPitman).Imaginearestaurantwithaninnitenumberoftables,eachwithaninnitenumberofseats.2Thecustomersentertherestaurantoneafteranother,andeachchooseatableatran-dom.IntheCRPwithparameter ,eachcustomerchoosesanoccupiedtablewithprobabilityproportionaltothenumberofoccupants,andchoosesthenextvacanttablewithprobabilityproportionalto .Forexample,Figure2showsthestateofarestaurantafter10customershavechosentablesusingthisprocedure.Therstcustomerchoosesthersttablewithproba-bility =1.Thesecondcustomerchoosesthersttablewithprobability1 1+ ,andthesecondtablewithprobability 1+ .Afterthesecondcustomerchoosesthesecondtable,thethirdcus-tomerchoosesthersttablewithprobability1 2+ ,thesecondtablewithprobability1 2+ ,andthethirdtablewithprobabililty 2+ .Thisprocesscontinuesuntilallcustomershaveseats,deningadistributionoverallocationsofpeopletotables,and,moregenerally,objectstoclasses.ExtensionsoftheCRPandconnectionstootherstochasticprocessesarepursuedindepthbyPitman(2002).ThedistributionoverpartitionsinducedbytheCRPisthesameasthatgiveninEquation15.IfweassumeanorderingonourNobjects,thenwecanassignthemtoclassessequentiallyusingthemethodspeciedbytheCRP,lettingobjectsplaytheroleofcustomersandclassesplaytheroleoftables.TheithobjectwouldbeassignedtothethclasswithprobabilityP(ci=jc1;c2;:::;ci1)=mk i1+ K+ i1+ otherwise(16)wheremkisthenumberofobjectscurrentlyassignedtoclass,andK+isthenumberofclassesforwhichmk�0.IfallNobjectsareassignedtoclassesviathisprocess,theprobability 2PitmanandDubins,bothstatisticiansatUCBerkeley,wereinspiredbytheapparentlyinnitecapacityofChineserestaurantsinSanFranciscowhentheynamedtheprocess.6 ofapartitionofobjectscisthatgiveninEquation15.TheCRPthusprovidesanintuitivemeansofspecifyingapriorforinnitemixturemodels,aswellasrevealingthatthereisasimplesequentialprocessbywhichexchangeableclassassignmentscanbegenerated.2.4InferencebyGibbssamplingInferenceinaninnitemixturemodelisonlyslightlymorecomplicatedthaninferenceinamixturemodelwithanite,xednumberofclasses.Thestandardalgorithmusedforinfer-enceininnitemixturemodelsisGibbssampling(Escobar&West,1995;Neal,2000).GibbssamplingisaMarkovchainMonteCarlo(MCMC)method,inwhichvariablesaresuccessivelysampledfromtheirdistributionswhenconditionedonthecurrentvaluesofallothervariables(Geman&Geman,1984).ThisprocessdenesaMarkovchain,whichultimatelyconvergestothedistributionofinterest(seeGilks,Richardson,&Spiegelhalter,1996).ImplementingaGibbssamplerrequiresderivingthefullconditionaldistributionforallvariablestobesampled.Inamixturemodel,thesevariablesaretheclassassignmentscTherelevantfullconditionaldistributionisP(cijci;X),theprobabilitydistributionoverciconditionedontheclassassignmentsofallotherobjects,ci,andthedata,X.ByapplyingBayes'rule,thisdistributioncanbeexpressedasP(ci=jci;X)/p(Xjc)P(ci=jci);(17)whereonlythesecondtermontherighthandsidedependsuponthedistributionoverclassassignments,P(c)InanitemixturemodelwithP(c)denedasinEquation10,wecancomputeP(ci=jci)byintegratingover,obtainingP(ci=jci)=ZP(ci=j)p(jci)d=mi;k+ K N1+ ;(18)wheremi;kisthenumberofobjectsassignedtoclass,notincludingobjecti.ThisistheposteriorpredictivedistributionforamultinomialdistributionwithaDirichletprior.InaninnitemixturemodelwithadistributionoverclassassignmentsdenedasinEqua-tion15,wecanuseexchangeabilitytondthefullconditionaldistribution.Sinceitisex-changeable,P([c])isunaffectedbytheorderingofobjects.Thus,wecanchooseanorderinginwhichtheithobjectisthelasttobeassignedtoaclass.ItfollowsdirectlyfromthedenitionoftheChineserestaurantprocessthatP(ci=jci)=8:mi;k N1+ mi;k�0 N1+ =Ki;++10otherwise(19)whereKi;+isthenumberofclassesforwhichmi;k�0.Thesameresultcanbefoundbytakingthelimitofthefullconditionaldistributioninthenitemodel,givenbyEquation18(Neal,2000).Whencombinedwithsomechoiceofp(Xjc),Equations18and19aresufcienttodeneGibbssamplersforniteandinnitemixturemodelsrespectively.DemonstrationsofGibbssamplingininnitemixturemodelsareprovidedbyNeal(2000)andRasmussen(2000).Sim-ilarMCMCalgorithmsarepresentedinBushandMacEachern(1996),West,Muller,andEs-cobar(1994),EscobarandWest(1995)andIshwaranandJames(2001).AlgorithmsthatgobeyondthelocalchangesinclassassignmentsallowedbyaGibbssampleraregivenbyJainandNeal(2004)andDahl(2003).7 (c) objectsNKfeatures 000.2-2.8 00(a)(b) Figure3:Featurematrices.AbinarymatrixZ,asshownin(a),canbeusedasthebasisforsparseinnitelatentfeaturemodels,indicatingwhichfeaturestakenon-zerovalues.Elemen-twisemultiplicationofZbyamatrixVofcontinuousvaluesgivesarepresentationlikethatshownin(b).IfVcontainsdiscretevalues,weobtainarepresentationlikethatshownin(c).2.5SummaryOurreviewofinnitemixturemodelsservesthreepurposes:itshowsthatinnitestatisticalmodelscanbedenedbyspecifyingpriorsoverinnitecombinatorialobjects;itillustrateshowthesepriorscanbederivedbytakingthelimitofpriorsfornitemodels;anditdemon-stratesthatinferenceinthesemodelscanremainpossible,despitethelargehypothesisspacestheyimply.However,innitemixturemodelsarestillfundamentallylimitedintheirrepresen-tationofobjects,assumingthateachobjectcanonlybelongtoasingleclass.Intheremainderofthepaper,weusetheinsightsunderlyinginnitemixturemodelstoderivemethodsforrepresentingobjectsintermsofinnitelymanylatentfeatures.3LatentfeaturemodelsInalatentfeaturemodel,eachobjectisrepresentedbyavectoroflatentfeaturevaluesfiandthepropertiesxiaregeneratedfromadistributiondeterminedbythoselatentfeaturevalues.Latentfeaturevaluescanbecontinuous,asinprincipalcomponentanalysis(PCA;Jolliffe,1986),ordiscrete,asincooperativevectorquantization(CVQ;Zemel&Hinton,1994;Ghahramani,1995).Intheremainderofthissection,wewillassumethatfeaturevaluesarecontinuous.UsingthematrixF=fT1fT2fTNTtoindicatethelatentfeaturevaluesforallNobjects,themodelisspeciedbyaprioroverfeatures,p(F),andadistributionoverobservedpropertymatricesconditionedonthosefeatures,p(XjF).Aswithlatentclassmodels,thesedistributionscanbedealtwithseparately:p(F)speciesthenumberoffeatures,theirproba-bility,andthedistributionovervaluesassociatedwitheachfeature,whilep(XjF)determineshowthesefeaturesrelatetothepropertiesofobjects.Ourfocuswillbeonp(F),showinghowsuchapriorcanbedenedwithoutplacinganupperboundonthenumberoffeatures.WecanbreakthematrixFintotwocomponents:abinarymatrixZindicatingwhichfea-turesarepossessedbyeachobject,withzik=1ifobjectihasfeatureand0otherwise,andasecondmatrixVindicatingthevalueofeachfeatureforeachobject.Fcanbeexpressedastheelementwise(Hadamard)productofZandVF=Z\nV,asillustratedinFigure3.Inmanylatentfeaturemodels,suchasPCAandCVQ,objectshavenon-zerovaluesoneveryfea-ture,andeveryentryofZis1.Insparselatentfeaturemodels(e.g.,sparsePCA;d'Aspremont,Ghaoui,Jordan,&Lanckriet,2004;Jolliffe&Uddin,2003;Zou,Hastie,&Tibshirani,inpress)onlyasubsetoffeaturestakeonnon-zerovaluesforeachobject,andZpicksoutthesesubsets.AprioronFcanbedenedbyspecifyingpriorsforZandVseparately,withp(F)=P(Z)p(V).WewillfocusondeningaprioronZ,sincetheeffectivedimensionalityofalatentfeaturemodelisdeterminedbyZ.AssumingthatZissparse,wecandeneaprior8 forinnitelatentfeaturemodelsbydeningadistributionoverinnitebinarymatrices.Ouranalysisoflatentclassmodelsprovidestwodesiderataforsuchadistribution:objectsshouldbeexchangeable,andinferenceshouldbetractable.Italsosuggestsamethodbywhichthesedesideratacanbesatised:startwithamodelthatassumesanitenumberoffeatures,andconsiderthelimitasthenumberoffeaturesapproachesinnity.4AdistributiononinnitebinarymatricesInthissection,wederiveadistributiononinnitebinarymatricesbystartingwithasimplemodelthatassumesKfeatures,andthentakingthelimitasK!1.Theresultingdistributioncorrespondstoasimplegenerativeprocess,whichwetermtheIndianbuffetprocess.4.1AnitefeaturemodelWehaveNobjectsandKfeatures,andthepossessionoffeaturebyobjectiisindicatedbyabinaryvariablezik.Eachobjectcanpossessmultiplefeatures.ThezikthusformabinaryNKfeaturematrix,Z.Wewillassumethateachobjectpossessesfeaturewithprobabilityk,andthatthefeaturesaregeneratedindependently.Incontrasttotheclassmodelsdiscussedabove,forwhichPkk=1,theprobabilitieskcaneachtakeonanyvaluein[0;1].Underthismodel,theprobabilityofamatrixZgiven=f1;2;:::;Kg,isP(Zj)=KYk=1NYi=1P(zikjk)=KYk=1mkk(1k)Nmk;(20)wheremk=PNi=1zikisthenumberofobjectspossessingfeatureWecandeneaprioronbyassumingthateachkfollowsabetadistribution.Thebetadistributionhasparametersrands,andisconjugatetothebinomial.TheprobabilityofanykundertheBeta(r;s)distributionisgivenbyp(k)=r1k(1k)s1 B(r;s);(21)whereB(r;s)isthebetafunction,B(r;s)=Z10r1k(1k)s1dk(22)=(r)(s) (r+s):(23)Wewilltaker= Kands=1,soEquation23becomesB( K;1)=( K) (1+ K)=K ;(24)exploitingtherecursivedenitionofthegammafunction.Theprobabilitymodelwehavedenediskj Beta( K;1)zikjkBernoulli(k)Eachzikisindependentofallotherassignments,conditionedonk,andthekaregener-atedindependently.Agraphicalmodelillustratingthedependenciesamongthesevariablesis9 showninFigure1(b).Havingdenedaprioron,wecansimplifythismodelbyintegratingoverallvaluesforratherthanrepresentingthemexplicitly.ThemarginalprobabilityofabinarymatrixZisP(Z)=KYk=1Z NYi=1P(zikjk)!p(k)dk(25)=KYk=1B(mk+ K;Nmk+1) B( K;1)(26)=KYk=1 K(mk+ K)(Nmk+1) (N+1+ K):(27)Again,theresultfollowsfromconjugacy,thistimebetweenthebinomialandbetadistribu-tions.Thisdistributionisexchangeable,dependingonlyonthecountsmkThismodelhastheimportantpropertythattheexpectationofthenumberofnon-zeroentriesinthematrixZETZ1=E[Pikzik],hasanupperboundforanyK.SinceeachcolumnofZisindependent,theexpectationisKtimestheexpectationofthesumofasinglecolumn,ETzk.Thisexpectationiseasilycomputed,ETzk=NXi=1E(zik)=NXi=1Z10kp(k)dk=N K 1+ K;(28)wheretheresultfollowsfromthefactthattheexpectationofaBeta(r;s)randomvariableisr r+sConsequently,ETZ1=KETzk=N 1+ K.ForniteK,theexpectationofthenumberofentriesinZisboundedabovebyN 4.2EquivalenceclassesInordertondthelimitofthedistributionspeciedbyEquation27asK!1,weneedtodeneequivalenceclassesofbinarymatrices–theanalogueofpartitionsforassignmentvectors.Ourequivalenceclasseswillbedenedwithrespecttoafunctiononbinarymatrices,lof().Thisfunctionmapsbinarymatricestoleft-orderedbinarymatrices.lof(Z)isobtainedbyorderingthecolumnsofthebinarymatrixZfromlefttorightbythemagnitudeofthebinarynumberexpressedbythatcolumn,takingtherstrowasthemostsignicantbit.Theleft-orderingofabinarymatrixisshowninFigure4.Intherstrowoftheleft-orderedmatrix,thecolumnsforwhichz1k=1aregroupedattheleft.Inthesecondrow,thecolumnsforwhichz2k=1aregroupedattheleftofthesetsforwhichz1k=1.Thisgroupingstructurepersiststhroughoutthematrix.Thehistoryoffeatureatobjectiisdenedtobe(z1k;:::;z(i1)k).Wherenoobjectisspeci-ed,wewillusehistorytorefertothefullhistoryoffeature(z1k;:::;zNk).Wewillindividu-atethehistoriesoffeaturesusingthedecimalequivalentofthebinarynumberscorrespondingtothecolumnentries.Forexample,atobject3,featurescanhaveoneoffourhistories:0,cor-respondingtoafeaturewithnopreviousassignments,1,beingafeatureforwhichz2k=1butz1k=02,beingafeatureforwhichz1k=1butz2k=0,and3,beingafeaturepossessedbybothpreviousobjectswereassigned.Khwilldenotethenumberoffeaturespossessingthehistoryh,withK0beingthenumberoffeaturesforwhichmk=0andK+=P2N1h=1Khbeingthenumberoffeaturesforwhichmk�0,soK=K0+K+.Thismethodofdenotinghistoriesalsofacilitatestheprocessofplacingabinarymatrixinleft-orderedform,asitisusedinthedenitionoflof()lof()isamany-to-onefunction:manybinarymatricesreducetothesameleft-orderedform,andthereisauniqueleft-orderedformforeverybinarymatrix.Wecanthususelof()10 lof Figure4:Binarymatricesandtheleft-orderedform.Thebinarymatrixontheleftistrans-formedintotheleft-orderedbinarymatrixontherightbythefunctionlof().Thisleft-orderedmatrixwasgeneratedfromtheexchangeableIndianbuffetprocesswith =10.Emptycolumnsareomittedfrombothmatrices.todeneasetofequivalenceclasses.AnytwobinarymatricesYandZarelof-equivalentiflof(Y)=lof(Z),thatis,ifYandZmaptothesameleft-orderedform.Thelof-equivalenceclassofabinarymatrixZ,denoted[Z],isthesetofbinarymatricesthatarelof-equivalenttoZlof-equivalenceclassesarepreservedthroughpermutationofeithertherowsorthecolumnsofamatrix,providedthesamepermutationsareappliedtotheothermembersoftheequivalenceclass.Performinginferenceattheleveloflof-equivalenceclassesisappropriateinmodelswherefeatureorderisnotidentiable,withp(XjF)beingunaffectedbytheorderofthecolumnsofF.AnymodelinwhichtheprobabilityofXisspeciedintermsofalinearfunctionofF,suchasPCAorCVQ,hasthisproperty.Weneedtoevaluatethecardinalityof[Z],beingthenumberofmatricesthatmaptothesameleft-orderedform.Thecolumnsofabinarymatrixarenotguaranteedtobeunique:sinceanobjectcanpossessmultiplefeatures,itispossiblefortwofeaturestobepossessedbyexactlythesamesetofobjects.Thenumberofmatricesin[Z]isreducedifZcontainsidenticalcolumns,sincesomere-orderingsofthecolumnsofZresultinexactlythesamematrix.Takingthisintoaccount,thecardinalityof[Z]isKK0:::K2N1=K! Q2N1=0K!,whereKhisthecountofthenumberofcolumnswithfullhistoryhlof-equivalenceclassesplaythesameroleforbinarymatricesaspartitionsdoforassign-mentvectors:theycollapsetogetherallbinarymatrices(assignmentvectors)thatdifferonlyincolumnordering(classlabels).Thisrelationshipcanbemadeprecisebyexaminingthelof-equivalenceclassesofbinarymatricesconstructedfromassignmentvectors.DenetheclassmatrixgeneratedbyanassignmentvectorctobeabinarymatrixZwherezik=1ifandonlyifci=.Itisstraightforwardtoshowthattheclassmatricesgeneratedbytwoassignmentvectorsthatcorrespondtothesamepartitionbelongtothesamelof-equivalenceclass,andviceversa.4.3TakingtheinnitelimitUnderthedistributiondenedbyEquation27,theprobabilityofaparticularlof-equivalenceclassofbinarymatrices,[Z],isP([Z])=X2[]P(Z)(29)=K! Q2N1h=0Kh!KYk=1 K(mk+ K)(Nmk+1) (N+1+ K):(30)InordertotakethelimitofthisexpressionasK!1,wewilldividethecolumnsofZintotwosubsets,correspondingtothefeaturesforwhichmk=0andthefeaturesforwhichmk�011 Re-orderingthecolumnssuchthatmk�0ifK+,andmk=0otherwise,wecanbreaktheproductinEquation30intotwoparts,correspondingtothesetwosubsets.TheproductthusbecomesKYk=1 K(mk+ K)(Nmk+1) (N+1+ K)= K( K)(N+1) (N+1+ K)KK+K+Yk=1 K(mk+ K)(Nmk+1) (N+1+ K)(31)= K( K)(N+1) (N+1+ K)KK+Yk=1(mk+ K)(Nmk+1) ( K)(N+1)(32)= N! QNj=1j+ K!K KK+K+Yk=1(Nmk)!Qmk1j=1(j+ K) N!:(33)SubstitutingEquation33intoEquation30andrearrangingterms,wecancomputeourlimitlimK!1 K+ Q2N1h=1Kh!K! K0!KK+ N! QNj=1(j+ K)!KK+Yk=1(Nmk)!Qmk1j=1(j+ K) N!= K+ Q2N1h=1Kh!1expf HNgK+Yk=1(Nmk)!(mk1)! N!;(34)whereHNistheNthharmonicnumber,HN=PNj=11 j.ThedetailsofthestepstakenincomputingthislimitaregivenintheAppendix.Again,thisdistributionisexchangeable:neitherthenumberofidenticalcolumnsnorthecolumnsumsareaffectedbytheorderingonobjects.4.4TheIndianbuffetprocessTheprobabilitydistributiondenedinEquation34canbederivedfromasimplestochasticprocess.AswiththeCRP,thisprocessassumesanorderingontheobjects,generatingthematrixsequentiallyusingthisordering.Wewillalsouseaculinarymetaphorindeningourstochasticprocess,appropriatelyadjustedforgeography.ManyIndianrestaurantsinLondonofferlunchtimebuffetswithanapparentlyinnitenumberofdishes.Wecandeneadistri-butionoverinnitebinarymatricesbyspecifyingaprocedurebywhichcustomers(objects)choosedishes(features).InourIndianbuffetprocess(IBP),Ncustomersenterarestaurantoneafteranother.Eachcustomerencountersabuffetconsistingofinnitelymanydishesarrangedinaline.Therstcustomerstartsattheleftofthebuffetandtakesaservingfromeachdish,stoppingafteraPoisson( )numberofdishesashisplatebecomesoverburdened.Theithcustomermovesalongthebuffet,samplingdishesinproportiontotheirpopularity,servinghimselfwithprob-abilitymk i,wheremkisthenumberofpreviouscustomerswhohavesampledadish.Havingreachedtheendofallprevioussampleddishes,theithcustomerthentriesaPoisson( i)num-berofnewdishes.WecanindicatewhichcustomerschosewhichdishesusingabinarymatrixZwithNrowsandinnitelymanycolumns,wherezik=1iftheithcustomersampledthethdish.Figure5showsamatrixgeneratedusingtheIBPwith =10.Therstcustomertried17dishes.Thesecondcustomertried7ofthosedishes,andthentried3newdishes.Thethirdcustomertried3dishestriedbybothpreviouscustomers,5dishestriedbyonlytherstcustomer,and212 Dishes1 2 3 4 5 6 7 8 9 10 11 12Customers 13 14 15 16 17 18 19 20 Figure5:AbinarymatrixgeneratedbytheIndianbuffetprocesswith =10newdishes.Verticallyconcatenatingthechoicesofthecustomersproducesthebinarymatrixshowninthegure.UsingK(i)1toindicatethenumberofnewdishessampledbytheithcustomer,theproba-bilityofanyparticularmatrixbeingproducedbythisprocessisP(Z)= K+ QNi=1K(i)1!expf HNgK+Yk=1(Nmk)!(mk1)! N!:(35)AscanbeseenfromFigure5,thematricesproducedbythisprocessaregenerallynotinleft-orderedform.However,thesematricesarealsonotorderedarbitrarilybecausethePoissondrawsalwaysresultinchoicesofnewdishesthataretotherightofthepreviouslysampleddishes.Customersarenotexchangeableunderthisdistribution,asthenumberofdishescountedasK(i)1dependsupontheorderinwhichthecustomersmaketheirchoices.How-ever,ifweonlypayattentiontothelof-equivalenceclassesofthematricesgeneratedbythisprocess,weobtaintheexchangeabledistributionP([Z])givenbyEquation34:QN=1K()1! Q2N1=1K!ma-tricesgeneratedviathisprocessmaptothesameleft-orderedform,andP([Z])isobtainedbymultiplyingP(Z)fromEquation35bythisquantity.Itispossibletodeneasimilarsequentialprocessthatdirectlyproducesadistributiononleft-orderedbinarymatricesinwhichcustomersareexchangeable,butthisrequiresmoreeffortonthepartofthecustomers.IntheexchangeableIndianbuffetprocess,therstcustomersamplesaPoisson( )numberofdishes,movingfromlefttoright.Theithcustomermovesalongthebuffet,andmakesasingledecisionforeachsetofdisheswiththesamehistory.IfthereareKhdisheswithhistoryh,underwhichmhpreviouscustomershavesampledeachofthosedishes,thenthecustomersamplesaBinomial(m i;Kh)numberofthosedishes,startingattheleft.Havingreachedtheendofallprevioussampleddishes,theithcustomerthentriesaPoisson( i)numberofnewdishes.Attendingtothehistoryofthedishesandalwayssamplingfromtheleftguaranteesthattheresultingmatrixisinleft-orderedform,anditiseasytoshow13 thatthematricesproducedbythisprocesshavethesameprobabilityasthecorrespondinglof-equivalenceclassesunderEquation34.4.5AdistributionovercollectionsofhistoriesInSection4.2,wenotedthatlof-equivalenceclassesofbinarymatricesgeneratedfromassign-mentvectorscorrespondtopartitions.Likewise,lof-equivalenceclassesofgeneralbinaryma-tricescorrespondtosimplecombinatorialstructures:vectorsofnon-negativeintegers.FixingsomeorderingofNobjects,acollectionoffeaturehistoriesonthoseobjectscanberepresentedbyafrequencyvectorK=(K1;:::;K2N1),indicatingthenumberoftimeseachhistoryap-pearsinthecollection.3Acollectionoffeaturehistoriescanbetranslatedintoaleft-orderedbinarymatrixbyhorizontallyconcatenatinganappropriatenumberofcopiesofthebinaryvectorrepresentingeachhistoryintoamatrix.Aleft-orderedbinarymatrixcanbetranslatedintoacollectionoffeaturehistoriesbycountingthenumberoftimeseachhistoryappearsinthatmatrix.Sincepartitionsareasubsetofallcollectionsofhistories–namelythosecollec-tionsinwhicheachobjectappearsinonlyonehistory–thisprocessisstrictlymoregeneralthantheCRP.Thisconnectionbetweenlof-equivalenceclassesoffeaturematricesandcollectionsoffea-turehistoriessuggestsanothermeansofderivingthedistributionspeciedbyEquation34,operatingdirectlyonthefrequenciesofthesehistories.Wecandeneadistributiononvec-torsofnon-negativeintegersKbyassumingthateachKhisgeneratedindependentlyfromaPoissondistributionwithparameter B(mh;Nmh+1)= (m1)!(Nm)! N!wheremhisthenumberofnon-zeroelementsinthehistoryh.ThisgivesP(K)=2N1Yh=1 (m1)!(Nm)! N!K Kh!exp (mh1)!(Nmh)! N!(36)= P2N1=1K Q2N1h=1Kh!expf HNg2N1Yh=1(mh1)!(Nmh)! N!K;(37)whichiseasilyseentobethesameasP([Z])inEquation34.Theharmonicnumberintheexponentialtermisobtainedbysumming(m1)!(Nm)! N!overallhistoriesh.ThereareNjhistoriesforwhichmh=j,sowehave2N1Xh=1(mh1)!(Nmh)! N!=NXj=1(Nj)(j1)!(Nj)! N!=NXj=11 j=HN:(38)4.6SomepropertiesofthisdistributionThesedifferentviewsofthedistributionspeciedbyEquation34makeitstraightforwardtoderivesomeofitsproperties.First,theeffectivedimensionofthemodel,K+,followsaPoisson( HN)distribution.ThisismosteasilyshownusingthegenerativeprocessdescribedinSection4.5:K+=P2N1h=1Kh,andunderthisprocessisthusthesumofasetofPoissondis-tributions.ThesumofasetofPoissondistributionsisaPoissondistributionwithparameterequaltothesumoftheparametersofitscomponents.UsingEquation38,thisis HNAsecondpropertyofthisdistributionisthatthenumberoffeaturespossessedbyeachobjectfollowsaPoisson( )distribution.ThisfollowsfromthedenitionoftheexchangeableIBP.TherstcustomerchoosesaPoisson( )numberofdishes.Byexchangeability,allother 3WhileKistechnicallyavectorofnon-negativeintegers,itisnotaparticularlygenericexampleofsuchavectorasmostofitsentrieswillbe0or1.14 customersmustalsochooseaPoisson( )numberofdishes,sincewecanalwaysspecifyanorderingoncustomerswhichbeginswithaparticularcustomer.Finally,itispossibletoshowthatZremainssparseasK!1.Thesimplestwaytodothisistoexploitthepreviousresult:ifthenumberoffeaturespossessedbyeachobjectfollowsaPoisson( )distribution,thentheexpectednumberofentriesinZisN .Thisisconsistentwiththequantityobtainedbytakingthelimitofthisexpectationinthenitemodel,whichisgiveninEquation28:limK!1ETZ1=limK!1N 1+ K=N .Moregenerally,wecanusethepropertyofsumsofPoissonrandomvariablesdescribedabovetoshowthatTZ1willfollowaPoisson(N )distribution.Consequently,theprobabilityofvalueshigherthanthemeandecreasesexponentially.4.7InferencebyGibbssamplingWehavedenedadistributionoverinnitebinarymatricesthatsatisesoneofourdesider-ata–objects(therowsofthematrix)areexchangeableunderthisdistribution.Itremainstobeshownthatinferenceininnitelatentfeaturemodelsistractable,aswasthecaseforinnitemixturemodels.WewillderiveaGibbssamplerforlatentfeaturemodelsinwhichtheexchangeableIBPisusedasaprior.ThecriticalquantityneededtodenethesamplingalgorithmisthefullconditionaldistributionP(zik=1jZ(ik);X)/p(XjZ)P(zik=1jZ(ik));(39)whereZ(ik)denotestheentriesofZotherthanzik,andweareleavingasidetheissueofthefeaturevaluesVforthemoment.TheprioronZcontributestothisprobabilitybyspecifyingP(zik=1jZ(ik))Inthenitemodel,whereP(Z)isgivenbyEquation27,itisstraightforwardtocomputethefullconditionaldistributionforanyzik.IntegratingoverkgivesP(zik=1jzi;k)=Z10P(zikjk)p(kjzi;k)dk=mi;k+ K N+ K;(40)wherezi;kisthesetofassignmentsofotherobjects,notincludingi,forfeature,andmi;kisthenumberofobjectspossessingfeature,notincludingi.Weneedonlyconditiononzi;kratherthanZ(ik)becausethecolumnsofthematrixaregeneratedindependentlyunderthisprior.Intheinnitecase,wecanderivetheconditionaldistributionfromtheexchangeableIBP.Choosinganorderingonobjectssuchthattheithobjectcorrespondstothelastcustomertovisitthebuffet,weobtainP(zik=1jzi;k)=mi;k N;(41)foranysuchthatmi;k�0.ThesameresultcanbeobtainedbytakingthelimitofEquation40asK!1.SimilarlythenumberofnewfeaturesassociatedwithobjectishouldbedrawnfromaPoisson( N)distribution.ThiscanalsobederivedfromEquation40,usingthesamekindoflimitingargumentasthatpresentedabovetoobtainthetermsofthePoisson.5AlatentfeaturemodelwithbinaryfeaturesWehavederivedapriorforinnitesparsebinarymatrices,andindicatedhowstatisticalin-ferencecanbedoneinmodelsdenedusingthisprior.Inthissection,wewillshowhowthispriorcanbeputtouseinmodelsforunsupervisedlearning,illustratingsomeoftheissues15 Z X Figure6:Graphicalmodelforthelinear-Gaussianmodelwithbinaryfeatures.thatcanariseinthisprocess.Wewilldescribeasimplelinear-Gaussianlatentfeaturemodel,inwhichthefeaturesarebinary.Asabove,wewillstartwithanitemodelandthenconsidertheinnitelimit.5.1Anitelinear-GaussianmodelInournitemodel,theD-dimensionalvectorofpropertiesofanobjectixiisgeneratedfromaGaussiandistributionwithmeanziAandcovariancematrixX=2XI,whereziisaK-dimensionalbinaryvector,andAisaKDmatrixofweights.Inmatrixnotation,E[X]=ZA.IfZisafeaturematrix,thisisaformofbinaryfactoranalysis.ThedistributionofXgivenZA,andXismatrixGaussian:p(XjZ;A;X)=1 (22X)ND=2expf1 22Xtr((XZA)T(XZA))g(42)wheretr()isthetraceofamatrix.ThismakesiteasytointegrateoutthemodelparametersA.Todoso,weneedtodeneaprioronA,whichwealsotaketobematrixGaussian:p(AjA)=1 (22A)KD=2expf1 22Atr(ATA)g;(43)whereAisaparametersettingthediffusenessoftheprior.ThedependenciesamongthevariablesinthismodelareshowninFigure6.CombiningEquations42and43resultsinanexponentiatedexpressioninvolvingthetraceof1 2X(XZA)T(XZA)+1 2AATA=1 2XXTX1 2XXTZA1 2XATZTX+AT(1 2XZTZ+1 2AI)A(44)=1 2X(XT(IZMZT)X)+(MZTXA)T(2XM)1(MZTXA);(45)whereIistheidentitymatrix,M=(ZTZ+2X 2AI)1,andthelastlineisobtainedbycompleting16 thesquareforthequadraticterminAinthesecondline.WecanthenintegrateoutAtoobtainp(XjZ;X;A)=Zp(XjZ;A;X)p(AjA)dA(46)=1 (2)(N+K)D=2NDXKDAexpf1 22Xtr(XT(IZMZT)X)gZexpf1 2tr((MZTXA)T(2XM)1(MZTXA))gdA(47)=j2XMjD=2 (2)ND=2NDXKDAexpf1 22Xtr(XT(IZMZT)X)g(48)=1 (2)ND=2(NK)DXKDAjZTZ+2X 2AIjD=2expf1 22Xtr(XT(IZ(ZTZ+2X 2AI)1ZT)X)g:(49)Thisresultisintuitive:theexponentiatedtermisthedifferencebetweentheinnerproductmatrixoftherawvaluesofXandtheirprojectionsontothespacespannedbyZ,regularizedtoanextentdeterminedbytheratioofthevarianceofthenoiseinXtothevarianceoftheprioronAWecanusethisderivationofp(XjZ;X;A)toinferZfromasetofobservationsX,pro-videdwehaveaprioronZ.ThenitefeaturemodeldiscussedasapreludetotheIBPissuchaprior.Thefullconditionaldistributionforzikisgivenby:P(zikjX;Z(i;k);X;A)/p(XjZ;X;A)P(zikjzi;k):(50)Whileevaluatingp(XjZ;X;A)alwaysinvolvesmatrixmultiplication,itneednotalwaysinvolveamatrixinverse.ZTZcanberewrittenasPizTizi,allowingustouserankoneupdatestoefcientlycomputetheinversewhenonlyoneziismodied.DeningMi=(Pj=izTjzj+2X 2AI)1,wehaveMi=(M1zTizi)1(51)=MMzTiziM ziMzTi1(52)M=(M1i+zTizi)1(53)=MiMizTiziMi ziMizTi+1:(54)Iterativelyapplyingtheseupdatesallowsp(XjZ;X;A),tobecomputedviaEquation49fordifferentvaluesofzikwithoutrequiringanexcessivenumberofinverses,althoughafullrankupdateshouldbemadeoccasionallytoavoidaccumulatingnumericalerrors.ThesecondpartofEquation50,P(zikjzi;k),canbeevaluatedusingEquation40.5.2TakingtheinnitelimitTomakesurethatwecandeneaninniteversionofthismodel,weneedtocheckthatp(XjZ;X;A)remainswell-denedifZhasanunboundednumberofcolumns.ZappearsintwoplacesinEquation49:injZTZ+2X 2AIjandinZ(ZTZ+2X 2AI)1ZT.WewillexaminehowthesebehaveasK!117 IfZisinleft-orderedform,wecanwriteitas[Z+Z0],whereZ+consistsofK+columnswithsumsmk�0,andZ0consistsofK0columnswithsumsmk=0.ItfollowsthattherstofthetwoexpressionsweareconcernedwithreducestoZTZ+2X 2AI=ZT+Z+00+2X 2AIK(55)=2X 2AK0ZT+Z++2X 2AIK+:(56)TheappearanceofK0inthisexpressionisnotaproblem,aswewillseeshortly.TheabundanceofzerosinZleadstoadirectreductionofthesecondexpressiontoZ(ZTZ+2X 2AI)1ZT=Z+(ZT+Z++2X 2AIK+)1ZT+;(57)whichonlyusestheniteportionofZ.Combiningtheseresultsyieldsthelikelihoodfortheinnitemodelp(XjZ;X;A)=1 (2)ND=2(NK+)DXK+DAjZT+Z++2X 2AIK+jD=2expf1 22Xtr(XT(IZ+(ZT+Z++2X 2AIK+)1ZT+)X)g:(58)TheK+intheexponentsofAandXappearsasaresultofintroducingD=2multiplesofthefactorof2X 2AK0fromEquation56.ThelikelihoodfortheinnitemodelisthusjustthelikelihoodforthenitemodeldenedontherstK+columnsofZTheGibbssamplerforthismodelisnowstraightforward.Assignmentstoclassesforwhichmi;k�0aredrawninthesamewayasforthenitemodel,viaEquation50,usingEquation58toobtainp(XjZ;X;A)andEquation41forP(zikjzi;k).Asinthenitecase,Equations52and54canbeusedtocomputeinversesefciently.Thedistributionoverthenumberofnewfeaturescanbeapproximatedbytruncation,computingprobabilitiesforarangeofvaluesofK(i)1uptosomereasonableupperbound.Foreachvalue,p(XjZ;X;A)canbecomputedfromEquation58,andtheprioronthenumberofnewclassesisPoisson( N).5.3DemonstrationWeappliedtheGibbssamplerfortheinnitebinarylinear-Gaussianmodeltoasimulateddataset,X,consistingof10066images.Eachimage,xi,wasrepresentedasa36-dimensionalvectorofpixelintensityvalues.Theimagesweregeneratedfromarepresentationwithfourlatentfeatures,correspondingtotheimageelementsshowninFigure7(a).Theseimageel-ementscorrespondtotherowsofthematrixAinthemodelintroducedinSection5.1,spec-ifyingthepixelintensityvaluesassociatedwitheachbinaryfeature.Thenon-zeroelementsofAweresetto1:0,andareindicatedwithwhitepixelsinthegure.Afeaturevector,ziforeachimagewassampledfromadistributionunderwhicheachfeaturewaspresentwithprobability0:5.EachimagewasthengeneratedfromaGaussiandistributionwithmeanziAandcovarianceXI,whereX=0:5.SomeoftheseimagesareshowninFigure7(b),togetherwiththefeaturevectors,zi,thatwereusedtogeneratethem.TheGibbssamplerwasinitializedwithK+=1,choosingthefeatureassignmentsfortherstcolumnbysettingzi1=1withprobability0:5AX,and wereinitiallysetto1:0andthensampledbyaddingMetropolisstepstotheMCMCalgorithm(seeGilksetal.,1996).Figure7showstraceplotsfortherst1000iterationsofMCMCforthelogjointprobabilityof18 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 0 100 200 300 400 500 600 700 800 900 1000 -5000 -4000 -3000 log P( X , Z ) 0 100 200 300 400 500 600 700 800 900 1000 0 5 10 K+ 0 100 200 300 400 500 600 700 800 900 1000 0 0.5 1 sA 0 100 200 300 400 500 600 700 800 900 1000 0 0.5 1 sX 0 100 200 300 400 500 600 700 800 900 1000 0 2 4 aIteration(a)(b)(c)(d) Figure7:Stimuliandresultsforthedemonstrationoftheinnitebinarylinear-Gaussianmodel.(a)Imageelementscorrespondingtothefourlatentfeaturesusedtogeneratethedata.(b)Sampleimagesfromthedataset.(c)Imageelementscorrespondingtothefourfeaturespossessedbythemostobjectsinthe1000thiterationofMCMC.(d)Reconstructionsoftheim-agesin(b)usingtheoutputofthealgorithm.ThelowerportionofthegureshowstraceplotsfortheMCMCsimulation,whicharedescribedinmoredetailinthetext.19 thedataandthelatentfeatures,logp(X;Z),thenumberoffeaturesusedbyatleastoneobject,K+,andthemodelparametersAX,and .Thealgorithmreachedrelativelystablevaluesforallofthesequantitiesafterapproximately100iterations,andourremaininganalyseswilluseonlysamplestakenfromthatpointforward.Thelatentfeaturerepresentationdiscoveredbythemodelwasextremelyconsistentwiththatusedtogeneratethedata.Figure8(a)showsthedistributionoverK+computedfromthesamples.Whilethemodeofthedistributionisaroundsix,samplestendedtoincludefourfeaturesthatwereusedbyalargenumberofobjects,andthenafewfeaturesusedbyonlyoneortwoobjects.Figure8(b)showsthemeanfrequencywithwhichobjectstendedtopossessthedifferentfeatures,orderingfeaturesbythesefrequenciesineachsample.Therstfourfeaturesaveragedaround40objects,whiletheremainderaveragedlessthanve.Figure8(c)showsthedistributionofthenumberoffeaturespossessedbyeachobject.Mostobjectshadoneortwofeatures,butnoobjectshadmorethansix.Themodelthustendedtousealatentfeaturerepresentationdominatedbyfourfeatures,consistentwiththerepresentationusedtogeneratethedata.Figure8(d)and(e)showthesamequantitiesforthefeaturematrixthatwasactuallyusedtogeneratethedata,illustratingtheclosecorrespondencebetweentheposteriordistributionandthetruerepresentation.Theposteriormeanofthefeatureweights,A,givenXandZisE[AjX;Z]=(ZTZ+2X 2AI)1ZTX:(59)Figure7(c)showstheposteriormeanofakforthefourmostfrequentfeaturesinthe1000thsampleproducedbythealgorithm,orderedtomatchthefeaturesshowninFigure7(a).Thesefeaturespickouttheimageelementsusedingeneratingthedata.Figure7(d)showsthefea-turevectorszifromthissampleforthefourimagesinFigure7(b),togetherwiththeposteriormeansofthereconstructionsoftheseimagesforthissampleE[ziAjX;Z].Similarreconstruc-tionsareobtainedbyaveragingoverallvaluesofZproducedbytheMarkovchain.Thereconstructionsprovidedbythemodelclearlypickouttherelevantfeatures,despitethehighlevelofnoiseintheoriginalimages.6ConclusionsandfutureworkWehaveshownthatthemethodsthathavebeenusedtodeneinnitelatentclassmodelscanbeextendedtomodelsinwhichobjectsarerepresentedintermsofasetoflatentfeatures,derivingadistributiononinnitebinarymatricesthatcanbeusedasapriorforsuchmodels.Whilewederivedthispriorastheinnitelimitofasimpledistributiononnitebinarymatri-ces,wehaveshownthatthesamedistributioncanbespeciedintermsofasimplestochasticprocess–theIndianbuffetprocess.Thisdistributionsatisesourtwodesiderataforapriorforinnitelatentfeaturemodels:objectsareexchangeable,andinferenceremainstractable.Thereareanumberofdirectionsinwhichthisworkcanbeextended.First,whilewehavefocussedonthedistributionoverthebinarymatrixZindicatingthefeaturespossessedbydif-ferentobjects,ourintentisthatthisbecombinedwithaprioroverfeaturevaluesVtodenericherinnitelatentfeaturemodels,asdiscussedinSection3.WeanticipatethatMCMCalgo-rithmssimilartothatdescribedaboveinSection4.7canbeappliedinsuchmodels,andhavedevelopedsuchanalgorithmforasimplemodelusingdiscretefeaturevalues–aninnitever-sionofcooperativevectorquantization(Zemel&Hinton,1994).However,introducingfeaturevaluesintothemodelraisessomesignicanttechnicalissues:inmodelswherefeaturevalueshavetoberepresentedexplicitly,andthestructureofthemodeldoesnotpermittheuseofconjugatepriors,carehastobetakentoensurethatposteriordistributionsremainproperandinferencealgorithmsarewell-dened.Similarissuesariseininnitemixturemodels,andarediscussedbyNeal(2000).20 1 2 3 4 5 6 7 8 9 10 11 12 0 0.1 0.2 0.3 0.4 K+Probability(a) 1 2 3 4 5 6 7 8 9 10 11 12 0 10 20 30 40 50 60 FrequencyFeature (ordered by frequency)(d) 1 2 3 4 5 6 7 8 9 10 11 12 0 10 20 30 40 50 60 FrequencyFeature (ordered by frequency)(b) 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 ProbabilityNumber of features(e) 1 2 3 4 5 6 0 0.1 0.2 0.3 0.4 ProbabilityNumber of features(c) Figure8:StatisticsderivedfromtheMCMCsimulation,comparedwiththerepresentationusedtogeneratethedata.(a)PosteriordistributionoverK+,thenumberoffeaturespos-sessedbyatleastoneobject.(b)Meanfrequencieswithwhichobjectswereassignedtofea-tures,orderedfromhighesttolowestineachsample.(c)Distributionovernumberoffeaturespossessedbyeachobject.(d)-(e)showthesamestatisticsas(b)-(c),butcomputedfromtherepresentationthatwasactuallyusedtogeneratethedata.21 Aseconddirectioninwhichthisworkcanbeextendedisinconsideringothermodelsinwhichsuchpriorscanbeused.Inparticular,innitelatentfeaturemodelsinwhichtherela-tionshipbetweendataandfeaturesarenon-linearmaybeusefulinmanifold-learningprob-lems.TherearealsoanumberofapplicationsbeyondinnitelatentfeaturemodelsinwhichdistributionsonbinarymatriceswithNrowsandinnitelymanycolumnsareuseful.Forexample,suchmatricescanbeusedtorepresenttherelationsthatholdbetweentwoclassesofentities,whereoneclasscontainsaknownnumberofentities,andtheotherclasscontainsanunknownnumber.Caseslikethisariseincausallearning,wherethedependenciesamongaxedsetofobservablevariablesmightbeexplainedbytherelationshipsbetweenthosevari-ablesandanunknownnumberofhiddencauses.Thedistributiondenedinthispapercanbeusedasapriorovergraphstructuresforcausallearningproblemsofthiskind.Large-scaleapplicationsofthesemodelswillrequiredevelopingmoresophisticatedinfer-encealgorithms.TheGibbssamplingalgorithmsdiscussedinthispaperrelyonlocalchangestoclassorfeatureassignmentstomovethroughthespaceofrepresentations.Thesemethodsareslowtoconvergeonlargeproblems,andtendtogetstuckinlocalmaximaoftheposteriordistribution–whilethesamplerusedinourdemonstrationinSection5.3stabilizedrapidly,itonlyexploredoneofthemodesoftheposterior,neverswitchingtheorderofthefeaturesinZInferenceininnitemixturemodelscanbeimprovedbysupplementingthelocalchangespro-ducedbytheGibbssamplerwithaMetropolis-Hastingsstepthatoccasionallyproducesglobalchanges(Dahl,2003;Jain&Neal,2004).Similaralgorithmsmaybebenecialforinferenceininnitelatentfeaturemodels.Finally,thereisthequestionofwhetherthemethodsusedinthispapercanleadtootherpriorsoninnitecombinatorialstructures.Oneobviousextensionofthecurrentworkistoexploredistributionsoninnitebinarymatricesproducedbymakingdifferentassumptionsaboutthegenerationofk,suchasatwo-parametermodelinwhichkisgeneratedfromaBeta( K; )distribution.However,therearearangeofotherpossibilities.Oursuccessintransferringthestrategyoftakingthelimitofanitemodelfromlatentclassestolatentfea-turessuggeststhatthesamestrategymightbefruitfullyappliedwithotherrepresentations,broadeningthekindsoflatentstructurethatcanberecoveredthroughunsupervisedlearning.ReferencesAldous,D.(1985).Exchangeabilityandrelatedtopics.In´Ecoled'´et´edeprobabilit´esdeSaint-Flour,XIII—1983(pp.1–198).Berlin:Springer.Antoniak,C.(1974).MixturesofDirichletprocesseswithapplicationstoBayesiannonpara-metricproblems.TheAnnalsofStatistics2,1152-1174.Bernardo,J.M.,&Smith,A.F.M.(1994).Bayesiantheory.NewYork:Wiley.Blackwell,D.,&MacQueen,J.(1973).FergusondistributionsviaPolyaurnschemes.TheAnnalsofStatistics1,353-355.Blei,D.,Grifths,T.,Jordan,M.,&Tenenbaum,J.(2004).HierarchicaltopicmodelsandthenestedChineserestaurantprocess.InAdvancesinNeuralInformationProcessingSystems16.Cambridge,MA:MITPress.Blei,D.M.,Ng,A.Y.,&Jordan,M.I.(2003).LatentDirichletAllocation.JournalofMachineLearningResearch3,993-1022.Bush,C.A.,&MacEachern,S.N.(1996).Asemi-parametricBayesianmodelforrandomizedblockdesigns.Biometrika83,275-286.22 Dahl,D.B.(2003).Animprovedmerge-splitsamplerforconjugateDirichletprocessmixturemodels(Tech.Rep.No.1086).DepartmentofStatistics,UniversityofWisconsin.d'Aspremont,A.,Ghaoui,L.E.,Jordan,I.,&Lanckriet,G.R.G.(2004).AdirectformulationforsparsePCAusingsemideniteprogramming(Tech.Rep.No.UCB/CSD-04-1330).ComputerScienceDivision,UniversityofCalifornia,Berkeley.Escobar,M.D.,&West,M.(1995).Bayesiandensityestimationandinferenceusingmixtures.JournaloftheAmericanStatisticalAssociation90,577-588.Ferguson,T.S.(1983).Bayesiandensityestimationbymixturesofnormaldistributions.InM.Rizvi,J.Rustagi,&D.Siegmund(Eds.),Recentadvancesinstatistics(p.287-302).NewYork:AcademicPress.Geman,S.,&Geman,D.(1984).Stochasticrelaxation,Gibbsdistributions,andtheBayesianrestorationofimages.IEEETransactionsonPatternAnalysisandMachineIntelligence6721-741.Ghahramani,Z.(1995).FactoriallearningandtheEMalgorithm.InAdvancesinNeuralInfor-mationProcessingSystems7.SanFrancisco,CA:MorganKaufmann.Gilks,W.,Richardson,S.,&Spiegelhalter,D.J.(Eds.).(1996).MarkovchainMonteCarloinpractice.Suffolk:ChapmanandHall.Green,P.,&Richardson,S.(2001).ModellingheterogeneitywithandwithouttheDirichletprocess.ScandinavianJournalofStatistics28,355-377.Ishwaran,H.,&James,L.F.(2001).Gibbssamplingmethodsforstick-breakingpriors.JournaloftheAmericanStatisticalAssociation96,1316-1332.Jain,S.,&Neal,R.M.(2004).Asplit-mergeMarkovchainMonteCarloprocedurefortheDirichletProcessmixturemodel.JournalofComputationalandGraphicalStatistics13,158-182.Jolliffe,I.T.(1986).Principalcomponentanalysis.NewYork:Springer.Jolliffe,I.T.,&Uddin,M.(2003).Amodiedprincipalcomponenttechniquebasedonthelasso.JournalofComputationalandGraphicalStatistics12,531-547.Neal,R.M.(1992).Bayesianmixturemodeling.InMaximumEntropyandBayesianmethods:Proceedingsofthe11thInternationalWorkshoponMaximumEntropyandBayesianMethodsofStatisticalAnalysis(p.197-211).Dordrecht:Kluwer.Neal,R.M.(2000).MarkovchainsamplingmethodsforDirichletprocessmixturemodels.JournalofComputationalandGraphicalStatistics9,249-265.Pitman,J.(2002).Combinatorialstochasticprocesses.(NotesforSaintFlourSummerSchool)Rasmussen,C.(2000).Theinnitegaussianmixturemodel.InAdvancesinNeuralInformationProcessingSystems12.Cambridge,MA:MITPress.Rasmussen,C.E.,&Ghahramani,Z.(2001).Occam'srazor.InAdvancesinNeuralInformationProcessingSystems13.Cambridge,MA:MITPress.Sethuraman,J.(1994).AconstructivedenitionofDirichletpriors.StatisticaSinica4,639-650.Teh,Y.,Jordan,M.,Beal,M.,&Blei,D.(2004).HierarchicalDirichletprocesses.InAdvancesinNeuralInformationProcessingSystems17.Cambridge,MA:MITPress.23 Ueda,N.,&Saito,K.(2003).Parametricmixturemodelsformulti-labeledtext.InAdvancesinNeuralInformationProcessingSystems15.Cambridge:MITPress.West,M.,Muller,P.,&Escobar,M.(1994).Hierarchicalpriorsandmixturemodels,withapplicationinregressionanddensityestimation.InP.Freeman&A.Smith(Eds.),Aspectsofuncertainty(p.363-386).NewYork:Wiley.Zemel,R.S.,&Hinton,G.E.(1994).Developingpopulationcodesbyminimizingdescriptionlength.InAdvancesinNeuralInformationProcessingSystems6.SanFrancisco,CA:MorganKaufmann.Zou,H.,Hastie,T.,&Tibshirani,R.(inpress).Sparseprincipalcomponentanalysis.JournalofComputationalandGraphicalStatistics24 Appendix:DetailsoflimitsThisAppendixcontainsthedetailsofthelimitsofthreeexpressionsthatappearinEquations15and34.TherstexpressionisK! K0!KK+=QK+k=1(K+1) KK+(60)=KK+(K+1)K+ 2KK+1++(1)K+1(K+1)!K KK+(61)=1(K+1)K+ 2K++(1)K+1(K+1)! KK+1:(62)ForniteK+,alltermsexcepttherstgotozeroasK!1Thesecondexpressionismk1Yj=1(j+ K)=(mk1)!+ Kmk1Xj=1(mk1)! j++ Kmk1:(63)Fornitemkand ,alltermsexcepttherstgotozeroasK!1Thethirdexpressionis N! QNj=1(j+ K)!K= QNj=1j QNj=1(j+ K)!K(64)=0@NYj=1j (j+ K)1AK(65)=NYj=10@1 1+ 1 j K1AK:(66)WecannowusethefactthatlimK!11 1+x KK=expfxg(67)tocomputethelimitofEquation66asK!1,obtaininglimK!1NYj=10@1 1+ 1 j K1AK=NYj=1expf 1 jg(68)=expf NXj=11 jg(69)=expf HNg;(70)asdesired.25