/
MachineLearning3718323319991999KluwerAcademicPublishersManufacturedinT MachineLearning3718323319991999KluwerAcademicPublishersManufacturedinT

MachineLearning3718323319991999KluwerAcademicPublishersManufacturedinT - PDF document

ivy
ivy . @ivy
Follow
342 views
Uploaded On 2021-10-10

MachineLearning3718323319991999KluwerAcademicPublishersManufacturedinT - PPT Presentation

PE1JORDANETALGeneralexactinferencealgorithmshavebeendevelopedtoperformthiscalculationJensen1996ShachterAndersenSzolovits1994Shenoy1992thesealgorithmstakesystematicadvantageoftheconditionalindependen ID: 899966

jordan inparticular 1996 jordanetal inparticular jordan jordanetal 1996 1997 jaakkola 1999 1994 hinton thatis see

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "MachineLearning3718323319991999KluwerAca..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 MachineLearning,37,183Ð233(1999)1999Klu
MachineLearning,37,183Ð233(1999)1999KluwerAcademicPublishers.ManufacturedinTheNetherlands.AnIntroductiontoVariationalMethodsforGraphicalModelsMICHAELI.JORDANjordan@cs.berkeley.eduDepartmentofElectricalEngineeringandComputerSciencesandDepartmentofStatistics,UniversityofCalifornia,Berkeley,CA94720,USAZOUBINGHAHRAMANIzoubin@gatsby.ucl.ac.ukGatsbyComputationalNeuroscienceUnit,UniversityCollegeLondonWC1N3AR,UKTOMMIS.JAAKKOLAtommi@ai.mit.eduArtiÞcialIntelligenceLaboratory,MIT,Cambridge,MA02139,USALAWRENCEK.SAULlsaul@research.att.eduAT&TLabsÐResearch,FlorhamPark,NJ07932,USADavidHeckermanThispaperpresentsatutorialintroductiontotheuseofvariationalmethodsforinferenceandlearningingraphicalmodels(BayesiannetworksandMarkovrandomÞelds).Wepresentanumberofexamplesofgraphicalmodels,includingtheQMR-DTdatabase,thesigmoidbeliefnetwork,theBoltzmannmachine,andseveralvariantsofhiddenMarkovmodels,inwhichitisinfeasibletorunexactinferencealgorithms.Wethenintroducevariationalmethods,whichexploitlawsoflargenumberstotransformtheoriginalgraphicalmodelintoasimpliÞedgraphicalmodelinwhichinferenceisefÞcient.InferenceinthesimpiÞedmodelprovidesboundsonprobabilitiesofinterestintheoriginalmodel.Wedescribeageneralframeworkforgeneratingvariationaltransformationsbasedonconvexduality.FinallywereturntotheexamplesanddemonstratehowvariationalalgorithmscanbeformulatedineachKeywords:graphicalmodels,Bayesiannetworks,beliefnetworks,probabilisticinference,approximateinfer-ence,variationalmethods,meanÞeldmethods,hiddenMarkovmodels,Boltzmannmachines,neuralnetworks1.IntroductionTheproblemofprobabilisticinferenceingraphicalmodelsistheproblemofcomputingaconditionalprobabilitydistributionoverthevaluesofsomeofthenodes(theÒhiddenÓorÒunobservedÓnodes),giventhevaluesofothernodes(theÒevidenceÓorÒobservedÓnodes).Thus,lettingrepresentthesetofhiddennodesandlettingrepresentthesetofevidencenodes,wewishtocalculate P.E/:(1) JORDANETAL.Generalexactinferencealgorithmshavebeendevelopedtoperformthiscalculation(Jensen,1996;Shachter,Andersen

2 ,&Szolovits,1994;Shenoy,1992);thesealgor
,&Szolovits,1994;Shenoy,1992);thesealgorithmstakesystem-aticadvantageoftheconditionalindependenciespresentinthejointdistributionasinferredfromthepatternofmissingedgesinthegraph.Weoftenalsowishtocalculatemarginalprobabilitiesingraphicalmodels,inparticulartheprobabilityoftheobservedevidence,.Viewedasafunctionoftheparametersofthegraphicalmodel,forÞxedisanimportantquantityknownasthelikelihood.AsissuggestedbyEq.(1),theevaluationofthelikelihoodiscloselyrelatedtothecalculation.Indeed,althoughinferencealgorithmsdonotsimplycomputethenumeratoranddenominatorofEq.(1)anddivide,theyinfactgenerallyproducethelikelihoodasaby-productofthecalculationof.Moreover,algorithmsthatmaximizelikelihood(andrelatedquantities)generallymakeuseofthecalculationofasasubroutine.Althoughtherearemanycasesinwhichtheexactalgorithmsprovideasatisfactorysolutiontoinferenceandlearningproblems,thereareothercases,severalofwhichwediscussinthispaper,inwhichthetimeorspacecomplexityoftheexactcalculationisunacceptableanditisnecessarytohaverecoursetoapproximationprocedures.Withinthecontextofthejunctiontreeconstruction,forexample,thetimecomplexityisexponentialinthesizeofthemaximalcliqueinthejunctiontree.Aswewillsee,therearenaturalarchitecturalassumptionsthatnecessarilyleadtolargecliques.Evenincasesinwhichthecomplexityoftheexactalgorithmsismanageable,therecanbereasontoconsiderapproximationprocedures.Noteinparticularthattheexactalgorithmsmakenouseofthenumericalrepresentationofthejointprobabilitydistributionassoci-atedwithagraphicalmodel;putanotherway,thealgorithmshavethesamecomplexityregardlessoftheparticularprobabilitydistributionunderconsiderationwithinthefamilyofdistributionsthatisconsistentwiththeconditionalindependenciesimpliedbythegraph.TheremaybesituationsinwhichnodesorclustersofnodesareÒnearlyÓconditionallyindependent,situationsinwhichnodeprobabilitiesarewelldeterminedbyasubsetoftheneighborsofthenode,orsituationsinwhichsmallsubsetsofconÞgurationsofvariablescontainmostoftheprobabilitymass.Insuchcasestheexactitudeachievedbyanexac

3 talgorithmmaynotbeworththecomputationalc
talgorithmmaynotbeworththecomputationalcost.Avarietyofapproximationproce-dureshavebeendevelopedthatattempttoidentifyandexploitsuchsituations.ExamplesincludethepruningalgorithmsofKj¾rulff(1994),theÒboundedconditioningÓmethodofHorvitz,Suermondt,andCooper(1989),search-basedmethods(e.g.,Henrion,1991),andtheÒlocalizedpartialevaluationÓmethodofDraperandHanks(1994).Avirtueofallofthesemethodsisthattheyarecloselytiedtotheexactmethodsandthusareabletotakefulladvantageofconditionalindependencies.Thisvirtuecanalsobeavice,however,giventheexponentialgrowthincomplexityoftheexactalgorithms.Arelatedapproachtoapproximateinferencehasariseninapplicationsofgraphicalmodelinferencetoerror-controldecoding(McEliece,MacKay,&Cheng,1998).Inparticular,KimandPearlÕsalgorithmforsingly-connectedgraphicalmodels(Pearl,1988)hasbeenusedsuccessfullyasaniterativeapproximatemethodforinferenceinnon-singly-connectedAnotherapproachtothedesignofapproximationalgorithmsinvolvesmakinguseofMonteCarlomethods.AvarietyofMonteCarloalgorithmshavebeendeveloped(seeNeal, INTRODUCTIONTOVARIATIONALMETHODS1993)andappliedtotheinferenceproblemingraphicalmodels(Dagum&Luby,1993;Fung&Favero,1994;Gilks,Thomas,&Spiegelhalter,1994;Jensen,Kong,&Kj¾rulff,1995;Pearl,1988).Advantagesofthesealgorithmsincludetheirsimplicityofimplementationandtheoreticalguaranteesofconvergence.ThedisadvantagesoftheMonteCarloapproacharethatthealgorithmscanbeslowtoconvergeanditcanbehardtodiagnosetheirconvergence.Inthispaperwediscussvariationalmethods,whichprovideyetanotherapproachtothedesignofapproximateinferencealgorithms.Variationalmethodologyyieldsdeterministicapproximationproceduresthatgenerallyprovideboundsonprobabilitiesofinterest.Thebasicintuitionunderlyingvariationalmethodsisthatcomplexgraphscanbeprobabilisti-callysimple;inparticular,ingraphswithdenseconnectivitythereareaveragingphenomenathatcancomeintoplay,renderingnodesrelativelyinsensitivetoparticularsettingsofval-uesoftheirneighbors.Takingadvantageoftheseaveragingphenomenacanleadtosimple,accurateapproximatio

4 nprocedures.Itisimportanttoemphasizethat
nprocedures.Itisimportanttoemphasizethatthevariousapproachestoinferencethatwehaveoutlinedarebynomeansmutuallyexclusive;indeedtheyexploitcomplementaryfeaturesofthegraphicalmodelformalism.Thebestsolutiontoanygivenproblemmaywellinvolveanalgorithmthatcombinesaspectsofthedifferentmethods.Inthisvein,wewillpresentvariationalmethodsinawaythatemphasizestheirlinkstoexactmethods.Indeed,aswewillsee,exactmethodsoftenappearassubroutineswithinanoverallvariationalapproximation(cf.Jaakkola&Jordan,1996;Saul&Jordan,1996).ItshouldbeacknowledgedattheoutsetthatthereisasmuchÒartÓasthereisÒscienceÓinourcurrentunderstandingofhowvariationalmethodscanbeappliedtoprobabilisticinference.Variationaltransformationsformalarge,open-endedclassofapproximations,andalthoughthereisageneralmathematicalpictureofhowthesetransformationscanbeexploitedtoyieldboundsonprobabilitiesingraphicalmodels,thereisnotasyetasystematicalgebrathatallowsparticularvariationaltransformationstobematchedoptimallytoparticulargraphicalmodels.Wewillprovideillustrativeexamplesofgeneralfamiliesofgraphicalmodelstowhichvariationalmethodshavebeenappliedsuccessfully,andwewillprovideageneralmathematicalframeworkwhichencompassesalloftheseparticularexamples,butwearenotasyetabletoprovideassurancethattheframeworkwilltransfereasilytootherexamples.WebegininSection2withabriefoverviewofexactinferenceingraphicalmodels,basingthediscussiononthejunctiontreealgorithm.Section3presentsseveralexamplesofgraphicalmodels,bothtoprovidemotivationforvariationalmethodologyandtoprovideexamplesthatwereturntoanddevelopindetailasweproceedthroughthepaper.ThecorematerialonvariationalapproximationispresentedinSection4.Sections5and6Þllinsomeofthedetails,focusingonsequentialmethodsandblockmethods,respectively.Intheselattertwosections,wealsoreturntotheexamplesandworkoutvariationalapproximationsineachcase.Finally,Section7presentsconclusionsanddirectionsforfutureresearch.2.ExactinferenceInthissectionweprovideabriefoverviewofexactinferenceforgraphicalmodels,asrepre-sentedbythejunctiontree

5 algorithm(forrelationshipsbetweenthejunc
algorithm(forrelationshipsbetweenthejunctiontreealgorithm JORDANETAL. Figure1.Adirectedgraphisparameterizedbyassociatingalocalconditionalprobabilitywitheachnode.Thejointprobabilityistheproductofthelocalprobabilities.andotherexactinferencealgorithms,seeShachter,Andersen,andSzolovits(1994);seealsoDechter(1999),andShenoy(1992),forrecentdevelopmentsinexactinference).Ourintentionhereisnottoprovideacompletedescriptionofthejunctiontreealgorithm,butrathertointroducetheÒmoralizationÓandÒtriangulationÓstepsofthealgorithm.Anun-derstandingofthesesteps,whichcreatedatastructuresthatdeterminetheruntimeoftheinferencealgorithm,willsufÞceforourpurposes.ForacomprehensiveintroductiontothejunctiontreealgorithmseeJensen(1996).GraphicalmodelscomeintwobasicßavorsÑdirectedgraphicalmodelsandundirectedgraphicalmodels.Adirectedgraphicalmodel(alsoknownasaÒBayesiannetworkÓ)isspeciÞednumericallybyassociatinglocalconditionalprobabilitieswitheachofthenodesinanacyclicdirectedgraph.Theseconditionalprobabilitiesspecifytheprobabilityofnodegiventhevaluesofitsparents,i.e.,,whererepresentsthesetofindicesoftheparentsofnoderepresentsthecorrespondingsetofparentnodes(seeÞgure1).Toobtainthejointprobabilitydistributionforallofthenodesinthegraph,,wetaketheproductoverthelocalnodeprobabilities:Inferenceinvolvesthecalculationofconditionalprobabilitiesunderthisjointdistribution.Anundirectedgraphicalmodel(alsoknownasaÒMarkovrandomÞeldÓ)isspeciÞednumericallybyassociatingÒpotentialsÓwiththecliquesofthegraph.ApotentialisafunctiononthesetofconÞgurationsofaclique(thatis,asettingofvaluesforallofthenodesintheclique)thatassociatesapositiverealnumberwitheachconÞguration.Thus,foreverysubsetofnodesthatformsaclique,wehaveanassociatedpotential(seeÞgure2).Thejointprobabilitydistributionforallofthenodesinthegraphisobtainedbytakingtheproductoverthecliquepotentials: Z;(3) INTRODUCTIONTOVARIATIONALMETHODS Figure2.Anundirectedgraphisparameterizedbyassociatingapotentialwitheachcliqueinthegraph.Thecliquesinthisexampleare,and.Apotentialas

6 signsapositiverealnumbertoeachconÞgurat
signsapositiverealnumbertoeachconÞgurationofthecorrespondingclique.Thejointprobabilityisthenormalizedproductofthecliquepotentials.isthetotalnumberofcliquesandwherethenormalizationfactorisobtainedbysummingthenumeratoroverallconÞgurations:InkeepingwithstatisticalmechanicalterminologywewillrefertothissumasaÒpartitionfunction.ÓThejunctiontreealgorithmcompilesdirectedgraphicalmodelsintoundirectedgraphicalmodels;subsequentinferentialcalculationiscarriedoutintheundirectedformalism.ThestepthatconvertsthedirectedgraphintoanundirectedgraphiscalledÒmoralization.Ó(Iftheinitialgraphisalreadyundirected,thenwesimplyskipthemoralizationstep).Tounderstandmoralization,wenotethatinboththedirectedandtheundirectedcases,thejointprobabilitydistributionisobtainedasaproductoflocalfunctions.Inthedirectedcase,thesefunctionsarethenodeconditionalprobabilities.Infact,thisprobabilitynearlyqualiÞesasapotentialfunction;itiscertainlyareal-valuedfunctionontheconÞgurationsofthesetofvariables.Theproblemisthatthesevariablesdonotalwaysappeartogetherwithinaclique.Thatis,theparentsofacommonchildarenotnecessarilylinked.Tobeabletoutilizenodeconditionalprobabilitiesaspotentialfunctions,weÒmarryÓtheparentsofallofthenodeswithundirectededges.Moreoverwedropthearrowsontheotheredgesinthegraph.TheresultisaÒmoralgraph,Ówhichcanbeusedtorepresenttheprobabilitydistributionontheoriginaldirectedgraphwithintheundirectedformalism.Thesecondphaseofthejunctiontreealgorithmissomewhatmorecomplex.Thisphase,knownasÒtriangulation,Ótakesamoralgraphasinputandproducesasoutputanundirectedgraphinwhichadditionaledgeshave(possibly)beenadded.Thislattergraphhasaspecial JORDANETAL. Figure3.(a)Thesimplestnon-triangulatedgraph.Thegraphhasa4-cyclewithoutachord.(b)Addingachordbetweennodesrendersthegraphtriangulated.propertythatallowsrecursivecalculationofprobabilitiestotakeplace.Inparticular,inatriangulatedgraph,itispossibletobuildupajointdistributionbyproceedingsequentiallythroughthegraph,conditioningblocksofinterconnectednodesonlyonpredecessorblocksi

7 nthesequence.Thesimplestgraphinwhichthis
nthesequence.ThesimplestgraphinwhichthisispossibleistheÒ4-cycle,ÓthecycleoffournodesshowninÞgure3(a).Ifwetrytowritethejointprobabilitysequentiallyas,forexample,,weseethatwehaveaproblem.Inparticular,dependson,andweareunabletowritethejointprobabilityasasequenceofconditionals.Agraphisnottriangulatedifthereare4-cycleswhichdonothaveachord,whereachordisanedgebetweennon-neighboringnodes.ThusthegraphinÞgure3(a)isnottriangulated;itcanbetriangulatedbyaddingachordasinÞgure3(b).InthelattergraphwecanwritethejointprobabilitysequentiallyasMoregenerally,onceagraphhasbeentriangulateditispossibletoarrangethecliquesofthegraphintoadatastructureknownasajunctiontree.Ajunctiontreehastheintersectionproperty:Ifanodeappearsinanytwocliquesinthetree,itappearsinallcliquesthatlieonthepathbetweenthetwocliques.Thispropertyhastheimportantconsequencethatageneralalgorithmforprobabilisticinferencecanbebasedonachievinglocalconsistencybetweencliques.(Thatis,thecliquesassignthesamemarginalprobabilitytothenodesthattheyhaveincommon).Inajunctiontree,becauseoftherunningintersectionproperty,localconsistencyimpliesglobalconsistency.Theprobabilisticcalculationsthatareperformedonthejunctiontreeinvolvemarginal-izingandrescalingthecliquepotentialssoastoachievelocalconsistencybetweenneigh-boringcliques.Thetimecomplexityofperformingthiscalculationdependsonthesizeofthecliques;inparticularfordiscretedatathenumberofvaluesrequiredtorepresentthepotentialisexponentialinthenumberofnodesintheclique.ForefÞcientinference,itisthereforecriticaltoobtainsmallcliques.Intheremainderofthispaper,wewillinvestigatespeciÞcgraphicalmodelsandconsiderthecomputationalcostsofexactinferenceforthesemodels.InallofthesecaseswewilleitherbeabletodisplaytheÒobviousÓtriangulation,orwewillbeabletolowerboundthesizeofcliquesinatriangulatedgraphbyconsideringthecliquesinthemoralgraph. INTRODUCTIONTOVARIATIONALMETHODSThuswewillnotneedtoconsiderspeciÞcalgorithmsfortriangulation(fordiscussionoftriangulationalgorithms,see,e.g.,Kj¾rulff,1990).3.ExamplesInthissectionwe

8 presentexamplesofgraphicalmodelsinwhiche
presentexamplesofgraphicalmodelsinwhichexactinferenceisgenerallyinfeasible.OurÞrstexampleinvolvesadiagnosticsysteminwhichaÞxedgraphicalmodelisusedtoanswerqueries.TheremainingexamplesinvolveestimationproblemsinwhichagraphicalmodelisÞttodataandsubsequentlyusedforpredictionordiagnosis.3.1.TheQMR-DTdatabaseTheQMR-DTdatabaseisalarge-scaleprobabilisticdatabasethatisintendedtobeusedasadiagnosticaidinthedomainofinternalmedicine.WeprovideabriefoverviewoftheQMR-DTdatabasehere;forfurtherdetailsseeShweetal.(1991).TheQMR-DTdatabaseisabipartitegraphicalmodelinwhichtheupperlayerofnodesrepresentdiseasesandthelowerlayerofnodesrepresentsymptoms(seeÞgure4).Thereareapproximately600diseasenodesand4000symptomnodesinthedatabase.Theevidenceisasetofobservedsymptoms;henceforthwerefertoobservedsymptomsasÒÞndingsÓandrepresentthevectorofÞndingswiththesymbol.Thesymbolthevectorofdiseases.Allnodesarebinary,thusthecomponentsarebinaryrandomvariables.Makinguseoftheconditionalindependenciesimpliedbythebipartiteformofthegraph,andmarginalizingovertheunobservedsymptomnodes,weobtainthefollowingjointprobabilityoverdiseasesandÞndings: Figure4.ThestructureoftheQMR-DTgraphicalmodel.TheshadednodesrepresentevidencenodesandarereferredtoasÒÞndings.Ó JORDANETAL.Thepriorprobabilitiesofthediseases,,wereobtainedbyShweetal.fromarchivaldata.TheconditionalprobabilitiesoftheÞndingsgiventhediseases,,wereob-tainedfromexpertassessmentsunderaÒnoisy-ORÓmodel.Thatis,theconditionalproba-bilitythatthethsymptomisabsent,,isexpressedasfollows:wheretheareparametersobtainedfromtheexpertassessments.Consideringthecaseinwhichalldiseasesareabsent,weseethattheparametercanbeinterpretedastheprobabilitythatthethÞndingispresenteventhoughnodiseaseispresent.Theeffectofeachadditionaldisease,ifpresent,istocontributeanadditionalfactoroftotheprobabilitythatthethÞndingisabsent.WewillÞnditusefultorewritethenoisy-ORmodelinanexponentialform:exparethetransformedparameters.NotealsothattheprobabilityofapositiveÞndingisgivenasfollows:expTheseformsexpressthe

9 noisy-ORmodelasageneralizedlinearmodel.I
noisy-ORmodelasageneralizedlinearmodel.Ifwenowformthejointprobabilitydistributionbytakingproductsofthelocalproba-asinEq.(6),weseethatnegativeÞndingsarebenignwithrespecttotheinferenceproblem.Inparticular,aproductofexponentialfactorsthatarelinearinthediseases(cf.Eq.(8))yieldsajointprobabilitythatisalsotheexponentialofanexpressionlinearinthediseases.Thatis,eachnegativeÞndingcanbeincorporatedintothejointprobabilityinalinearnumberofoperations.ProductsoftheprobabilitiesofpositiveÞndings,ontheotherhand,yieldcrossproductstermsthatareproblematicforexactinference.Thesecrossproducttermscouplethediseases(theyareresponsiblefortheÒexplainingawayÓphenomenathatariseforthenoisy-ORmodel;seePearl,1988).Unfortunately,thesecouplingtermscanleadtoanexponentialgrowthininferentialcomplexity.Consideringasetofstandarddiagnosticcases(theÒCPCcasesÓ;seeShweetal.,1991),JaakkolaandJordan(1999b)foundthatthemediansizeofthemaximalcliqueofthemoralizedQMR-DTgraphis151.5nodes.Thusevenwithoutconsideringthetriangulationstep,weseethatdiagnosticcalculationundertheQMR-DTmodelisgenerallyinfeasible. INTRODUCTIONTOVARIATIONALMETHODS Figure5.Thelayeredgraphicalstructureofaneuralnetwork.Theinputnodesandoutputnodescomprisethesetofevidencenodes.3.2.NeuralnetworksasgraphicalmodelsNeuralnetworksarelayeredgraphsendowedwithanonlinearÒactivationÓfunctionateachnode(seeÞgure5).Letusconsideractivationfunctionsthatareboundedbetweenzeroandone,suchasthoseobtainedfromthelogisticfunction.Wecantreatsuchaneuralnetworkasagraphicalmodelbyassociatingabinaryvariablewitheachnodeandinterpretingtheactivationofthenodeastheprobabilitythattheassociatedbinaryvariabletakesoneofitstwovalues.Forexample,usingthelogisticfunction,wewrite: exparetheparametersassociatedwithedgesbetweenparentnodesandnode,andistheÒbiasÓparameterassociatedwithnode.ThisistheÒsigmoidbeliefnetworkÓintroducedbyNeal(1992).Theadvantagesoftreatinganeuralnetworkinthismannerincludetheabilitytoperformdiagnosticcalculations,tohandlemissingdata,andtotreatunsupervisedlearningonthesamefo

10 otingassupervisedlearning.Realizingthese
otingassupervisedlearning.RealizingthesebeneÞts,however,requiresthattheinferenceproblembesolvedinanefÞcientway.Infact,itiseasytoseethatexactinferenceisinfeasibleingenerallayeredneuralnetworkmodels.Anodeinaneuralnetworkgenerallyhasasparentsallofthenodesintheprecedinglayer.Thusthemoralizedneuralnetworkgraphhaslinksbetweenallofthenodesinthislayer(seeÞgure6).ThattheselinksarenecessaryforexactinferenceingeneralisclearÑinparticular,duringtrainingofaneuralnetworktheoutputnodesareevidencenodes,thusthehiddenunitsinthepenultimatelayerbecomeprobabilisticallydependent,asdotheirancestorsintheprecedinghiddenlayers.Thusiftherearehiddenunitsinaparticularhiddenlayer,thetimecomplexityofinferenceisatleast,ignoringtheadditionalgrowthincliquesizeduetotriangulation.Giventhatneuralnetworkswithdozensorevenhundredsofhiddenunitsarecommonplace,weseethattraininganeuralnetworkusingexactinferenceisnotgenerallyfeasible. JORDANETAL. Figure6.Moralizationofaneuralnetwork.Theoutputnodesareevidencenodesduringtraining.Thiscreatesprobabilisticdependenciesbetweenthehiddennodeswhicharecapturedbytheedgesaddedbythemoralization. Figure7.ABoltzmannmachine.Anedgebetweennodesisassociatedwithafactorexpcontributesmultiplicativelytothepotentialofoneofthecliquescontainingtheedge.Eachnodealsocontributesafactorexptooneandonlyonepotential.3.3.BoltzmannmachinesABoltzmannmachineisanundirectedgraphicalmodelwithbinary-valuednodesandarestrictedsetofpotentialfunctions(seeÞgure7).Inparticular,thecliquepotentialsareformedbytakingproductsofÒBoltzmannfactorsÓÑexponentialsoftermsthatareatmostquadraticinthe(Hinton&Sejnowski,1986).Thuseachcliquepotentialisaproductoffactorsexpandfactorsexp,whereAgivenpairofnodescanappearinmultiple,overlappingcliques.Foreachsuchpairweassumethattheexpressionexpappearsasafactorinoneandonlyonecliquepotential.Similarly,thefactorsexpareassumedtoappearinoneandonlyonecliquepotential.Takingtheproductoverallsuchcliquepotentials(cf.Eq.(3)),wehave:exp Z; INTRODUCTIONTOVARIATIONALMETHODSwherewehaveset0fornodesthatarenotnei

11 ghborsinthegraphÑthisconventionallowsus
ghborsinthegraphÑthisconventionallowsustosumindiscriminatelyoverallpairsandstillrespectthecliqueboundaries.WerefertothenegativeoftheexponentinEq.(11)astheenergyWiththisdeÞnitionthejointprobabilityinEq.(11)hasthegeneralformofadistributionSaulandJordan(1994)pointedoutthatexactinferenceforcertainspecialcasesofBoltzmannmachineÑsuchastrees,chains,andpairsofcoupledchainsÑistractableandtheyproposedaalgorithmforthispurpose.FormoregeneralBoltzmannmachines,however,decimationisnotimmunetotheexponentialtimecomplexitythatplaguesotherexactmethods.Indeed,despitethefactthattheBoltzmannmachineisaspecialclassofundirectedgraphicalmodel,itisaspecialclassonlybyvirtueofitsparam-eterization,notbyvirtueofitsconditionalindependencestructure.Thus,exactalgorithmssuchasdecimationandthejunctiontreealgorithm,whicharebasedsolelyonthegraphicalstructureoftheBoltzmannmachine,arenomoreefÞcientforBoltzmannmachinesthantheyareforgeneralgraphicalmodels.Inparticular,whenwetriangulategenericBoltzmannmachines,includingthelayeredBoltzmannmachinesandgrid-likeBoltzmannmachines,weobtainintractablylargecliques.Samplingalgorithmshavetraditionallybeenusedtoattempttocopewiththeintractabil-ityoftheBoltzmannmachine(Hinton&Sejnowski,1986).Thesamplingalgorithmsareoverlyslow,however,andmorerecentworkhasconsideredthefasterÒmeanÞeldÓap-proximation(Peterson&Anderson,1987).WewilldescribethemeanÞeldapproximationforBoltzmannmachineslaterinthepaperÑitisaspecialformofthevariationalap-proximationapproachthatprovideslowerboundsonmarginalprobabilities.Wewillalsodiscussamoregeneralvariationalalgorithmthatprovidesupperandlowerboundsonprobabilities(marginalsandconditionals)forBoltzmannmachines(Jaakkola&Jordan,3.4.HiddenMarkovmodelsInthissection,webrießyreviewhiddenMarkovmodels.ThehiddenMarkovmodel(HMM)isanexampleofagraphicalmodelinwhichexactinferenceistractable;ourpurposeindiscussingHMMshereistolaythegroundworkforthediscussionofintractablevariationsonHMMsinthefollowingsections.SeeSmyth,Heckerman,andJordan(1997)forafullerdiscussionoftheHMMasag

12 raphicalmodel.AnHMMisagraphicalmodelinth
raphicalmodel.AnHMMisagraphicalmodelintheformofachain(seeÞgure8).ConsiderasequenceofmultinomialÒstateÓnodesandassumethattheconditionalprobabilityofnodegivenitsimmediatepredecessor,isindependentofallotherprecedingvariables.(Theindexcanbethoughtofasatimeindex).Thechainisassumedtobehomogeneous;thatis,thematrixoftransitionprobabilities,,isinvariantacrosstime.WealsorequireaprobabilitydistributionfortheinitialstateTheHMMmodelalsoinvolvesasetofÒoutputÓnodesandanemissionprobabilitylaw,againassumedtime-invariant. JORDANETAL. Figure8.AHMMrepresentedasagraphicalmodel.Theleft-to-rightspatialdimensionrepresentstime.Theoutputnodesareevidencenodesduringthetrainingprocessandthestatenodesarehidden.AnHMMistrainedbytreatingtheoutputnodesasevidencenodesandthestatenodesashiddennodes.Anexpectation-maximization(EM)algorithm(Baumetal.,1970;Dempster,Laird,&Rubin,1977)isgenerallyusedtoupdatetheparameters;thisalgorithminvolvesasimpleiterativeprocedurehavingtwoalternatingsteps:(1)runaninferencealgorithmtocalculatetheconditionalprobabilities;(2)updatetheparametersviaweightedmaximumlikelihoodwheretheweightsaregivenbytheconditionalprobabilitiescalculatedinstep(1).ItiseasytoseethatexactinferenceistractableforHMMs.Themoralizationandtrian-gulationstepsarevacuousfortheHMM;thusthetimecomplexitycanbereadofffromÞgure8directly.Weseethatthemaximalcliqueisofsize,whereisthedimensionalityofastatenode.Inferencethereforescalesas,whereisthelengthofthetime3.5.FactorialhiddenMarkovmodelsInmanyproblemdomainsitisnaturaltomakeadditionalstructuralassumptionsaboutthestatespaceandthetransitionprobabilitiesthatarenotavailablewithinthesimpleHMMframework.AnumberofstructuredvariationsonHMMshavebeenconsideredinrecentyears(seeSmythetal.,1997);genericallythesevariationscanbeviewedasÒdynamicbeliefnetworksÓ(Dean&Kanazawa,1989;Kanazawa,Koller,&Russell,1995).HereweconsideraparticularlysimplevariationontheHMMthemeknownastheÒfactorialhiddenMarkovmodelÓ(Ghahramani&Jordan,1997;Williams&Hinton,1991).ThegraphicalmodelforafactorialHMM(FHMM)isshow

13 ninÞgure9.Thesystemiscomposedofasetofch
ninÞgure9.Thesystemiscomposedofasetofchainsindexedby.Letthestatenodeforthethchainattimeberepresentedbyandletthetransitionmatrixforthethchainberep-resentedby.WecanviewtheeffectivestatespacefortheFHMMastheCartesianproductofthestatespacesassociatedwiththeindividualchains.Theoveralltransitionprobabilityforthesystemisobtainedbytakingtheproductacrosstheintra-chaintransition INTRODUCTIONTOVARIATIONALMETHODS Figure9.AfactorialHMMwiththreechains.Thetransitionmatricesare,andassociatedwiththehorizontaledges,andtheoutputprobabilitiesaredeterminedbymatrices,andassociatedwiththeverticaledges.wherethesymbolstandsfortheGhahramaniandJordanutilizedalinear-Gaussiandistributionfortheemissionproba-bilitiesoftheFHMM.Inparticular,theyassumed:wherethearematricesofparameters.TheFHMMisanaturalmodelforsystemsinwhichthehiddenstateisrealizedviathejointconÞgurationofanuncoupledsetofdynamicalsystems.Moreover,anFHMMisabletorepresentalargeeffectivestatespacewithamuchsmallernumberofparametersthanasingleunstructuredCartesianproductHMM.Forexample,ifwehave5chainsandineachchainthenodeshave10states,theeffectivestatespaceisofsize100,000,whilethetransitionprobabilitiesarerepresentedcompactlywithonly500parameters.AsingleunstructuredHMMwouldrequire10parametersforthetransitionmatrixinthiscase.Thefactthattheoutputisafunctionofthestatesofallofthechainsimpliesthatthestatesbecomestochasticallycoupledwhentheoutputsareobserved.LetusinvestigatetheimplicationsofthisfactforthetimecomplexityofexactinferenceintheFHMM.Figure10showsatriangulationforthecaseoftwochains(infactthisisanoptimaltriangulation). JORDANETAL. Figure10.AtriangulationofanFHMMwithtwocomponentchains.Themoralizationsteplinksstatesatasingletimestep.Thetriangulationsteplinksstatesdiagonallybetweenneighboringtimesteps. Figure11.Atriangulationofthestatenodesofathree-chainFHMMwiththreecomponentchains.(Theobser-vationnodeshavebeenomittedintheinterestofsimplicity.) Figure12.Thisgraphisnotatriangulationofathree-chainFHMM.Thecliquesforthehiddenstatesareofsize;thusthetimecomplexity

14 ofexactinference,whereisthenumberofstate
ofexactinference,whereisthenumberofstatesineachchain(weassumethateachchainhasthesamenumberofstatesforsimplicity).Figure11showsthecaseofatriangulationofthreechains;herethetriangulation(againoptimal)createscliquesofsize.(NoteinparticularthatthegraphinÞgure12,withcliquesofsizethree,isatriangulation;there INTRODUCTIONTOVARIATIONALMETHODSare4-cycleswithoutachord).Inthegeneralcase,itisnotdifÞculttoseethatcliquesofarecreated,whereisthenumberofchains;thusthecomplexityofexactinferencefortheFHMMscalesas.ForasingleunstructuredCartesianproductHMMhavingthesamenumberofstatesastheFHMMÑi.e.,statesÑthecomplexityscalesas,thusexactinferencefortheFHMMissomewhatlesscostly,buttheexponentialgrowthincomplexityineithercaseshowsthatexactinferenceisinfeasibleforgeneralFHMMs.3.6.Higher-orderhiddenMarkovmodelsArelatedvariationonHMMsconsidersahigher-orderMarkovmodelinwhicheachstatedependsonthepreviousstatesinsteadofthesinglepreviousstate.Inthiscaseitisagainreadilyshownthatthetimecomplexityisexponentialin.Wewillnotdiscussthehigher-orderHMMfurtherinthispaper;foravariationalalgorithmforthehigher-orderHMMseeSaulandJordan(1996).3.7.HiddenMarkovdecisiontreesFinally,weconsideramodelinwhichadecisiontreeisendowedwithMarkoviandynamics(Jordan,Ghahramani,&Saul,1997).Adecisiontreecanbeviewedasagraphicalmodelbymodelingthedecisionsinthetreeasmultinomialrandomvariables,oneforeachlevelofthedecisiontree.ReferringtoÞgure13,andfocusingonaparticulartimeslice,theshadednodeatthetopofthediagramrepresentstheinputvector.Theunshadednodesbelowtheinputnodesarethedecisionnodes.Eachofthedecisionnodesareconditionedontheinput Figure13.AhiddenMarkovdecisiontree.Theshadednodesrepresentatimeseriesinwhicheachelementisan(input,output)pair.Linkingtheinputsandoutputsareasequenceofdecisionnodeswhichcorrespondtobranchesinadecisiontree.ThesedecisionsarelinkedhorizontallytorepresentMarkoviantemporal JORDANETAL.andontheentiresequenceofprecedingdecisions(theverticalarrowsinthediagram).Intermsofatraditionaldecisiontreediagram,thisdependenceprovidesanindica

15 tionofthepathfollowedbythedatapointasitd
tionofthepathfollowedbythedatapointasitdropsthroughthedecisiontree.Thenodeatthebottomofthediagramistheoutputvariable.Ifwenowmakethedecisionsinthedecisiontreeconditionalnotonlyonthecurrentdatapoint,butalsoonthedecisionsatthepreviousmomentintime,weobtainahiddenMarkovdecisiontree(HMDT).InÞgure13,thehorizontaledgesrepresentthisMarkoviantemporaldependence.Noteinparticularthatthedependencyisassumedtobelevel-speciÞcÑtheprobabilityofadecisiondependsonlyonthepreviousdecisionatthesamelevelofthedecisiontree.Givenasequenceofinputvectorsandacorrespondingsequenceofoutputvectorstheinferenceproblemistocomputetheconditionalprobabilitydistributionoverthehiddenstates.ThisproblemisintractableforgeneralHMDTsÑascanbeseenbynotingthattheHMDTincludestheFHMMasaspecialcase.4.BasicsofvariationalmethodologyVariationalmethodsareusedasapproximationmethodsinawidevarietyofsettings,includ-ingÞniteelementanalysis(Bathe,1996),quantummechanics(Sakurai,1985),statisticalmechanics(Parisi,1988),andstatistics(Rustagi,1976).Ineachofthesecasestheapplica-tionofvariationalmethodsconvertsacomplexproblemintoasimplerproblem,wherethesimplerproblemisgenerallycharacterizedbyadecouplingofthedegreesoffreedomintheoriginalproblem.Thisdecouplingisachievedviaanexpansionoftheproblemtoincludeadditionalparameters,knownasvariationalparameters,thatmustbeÞttotheproblematTheterminologycomesfromtherootsofthetechniquesinthecalculusofvariations.Wewillnotstartsystematicallyfromthecalculusofvariations;instead,wewilljumpofffromanintermediatepointthatemphasizestheimportantroleofconvexityinvariationalapproximation.Thispointofviewturnsouttobeparticularlywellsuitedtothedevelopmentofvariationalmethodsforgraphicalmodels.4.1.ExamplesLetusbeginbyconsideringasimpleexample.Inparticular,letusexpressthelogarithmfunctionvariationally:Inthisexpressionisthevariationalparameter,andwearerequiredtoperformthemin-imizationforeachvalueof.TheexpressionisreadilyveriÞedbytakingthederivativewithrespectto,solvingandsubstituting.Thesituationisperhapsbestappreciatedgeo-metric

16 ally,asweshowinÞgure14.Notethattheexpre
ally,asweshowinÞgure14.NotethattheexpressioninbracesinEq.(14)islinearwithslope.Clearly,giventheconcavityofthelogarithm,foreachlinehavingslope INTRODUCTIONTOVARIATIONALMETHODS Figure14.Variationaltransformationofthelogarithmfunction.Thelinearfunctionsformafamilyofupperboundsforthelogarithm,eachofwhichisexactforaparticularvalueofthereisavalueoftheinterceptsuchthatthelinetouchesthelogarithmatasinglepoint.1inEq.(14)ispreciselythisintercept.Moreover,ifwerangeacrossthefamilyofsuchlinesformsanupperenvelopeofthelogarithmfunction.Thatis,foranygiven,wehave:forall.Thusthevariationaltransformationprovidesafamilyofupperboundsonthelogarithm.Theminimumovertheseboundsistheexactvalueofthelogarithm.ThepragmaticjustiÞcationforsuchatransformationisthatwehaveconvertedanonlinearfunctionintoalinearfunction.Thecostisthatwehaveobtainedafreeparametermustbeset,onceforeach.Foranyvalueofweobtainanupperboundonthelogarithm;ifwesetwellwecanobtainagoodbound.IndeedwecanrecovertheexactvalueoflogarithmfortheoptimalchoiceofLetusnowconsiderasecondexamplethatismoredirectlyrelevanttographicalmodels.Forbinary-valuednodesitiscommontorepresenttheprobabilitythatthenodetakesoneofitsvaluesviaamonotonicnonlinearitythatisasimplefunctionÑe.g.,alinearfunctionÑofthevaluesoftheparentsofthenode.Anexampleisthelogisticregressionmodel: whichwehaveseenpreviouslyinEq.(10).Hereistheweightedsumofthevaluesoftheparentsofanode. JORDANETAL.Thelogisticfunctionisneitherconvexnorconcave,soasimplelinearboundwillnotwork.However,thelogisticfunctionislogconcave.Thatis,thefunctionisaconcavefunctionof(ascanreadilybeveriÞedbycalculatingthesecondderivative).Thuswecanboundtheloglogisticfunctionwithlinearfunctionsandtherebyboundthelogisticfunctionbytheexponential.Inparticular,wecanwrite:isthebinaryentropyfunction,.(Wewillexplainhowthebinaryentropyfunctionarisesbelow;fornowitsufÞcestothinkofitsimplyastheappropriateintercepttermfortheloglogisticfunction).Wenowtaketheexponentialofbothsides,notingthattheminimumandtheexponentialfunctioncommute:Thisisavariation

17 altransformationforthelogisticfunction;e
altransformationforthelogisticfunction;examplesareplottedinÞg-ure15.Finally,wenoteonceagainthatforanyvalueofweobtainanupperboundofthelogisticfunctionforallvaluesofGoodchoicesforprovidebetterbounds. Figure15.Variationaltransformationofthelogisticfunction. INTRODUCTIONTOVARIATIONALMETHODSTheadvantagesofthetransformationinEq.(20)aresigniÞcantinthecontextofgraphicalmodels.Inparticular,toobtainthejointprobabilityinagraphicalmodelwearerequiredtotakeaproductoverthelocalconditionalprobabilities(cf.Eq.(2)).Forconditionalprobabilitiesrepresentedwithlogisticregression,weobtainproductsoffunctionsofthe.Suchaproductisnotinasimpleform.IfinsteadweaugmentournetworkrepresentationbyincludingvariationalparametersÑi.e.representingeachlogisticfunctionvariationallyasinEq.(20)Ñweseethataboundonthejointprobabilityisobtainedbytakingproductsofexponentials.Thisistractablecomputationally,particularlysogiventhattheexponentsarelinearin4.2.ConvexdualityCanweÞndvariationaltransformationsmoresystematically?Indeed,manyofthevari-ationaltransformationsthathavebeenutilizedintheliteratureongraphicalmodelsareexamplesofthegeneralprincipleofconvexduality.Itisageneralfactofconvexanalysis(Rockafellar,1972)thataconcavefunctioncanberepresentedviaafunctionasfollows:wherewenowallowtobevectors.Theconjugatefunctioncanbeobtainedfromthefollowingdualexpression:Thisrelationshipiseasilyunderstoodgeometrically,asshowninÞgure16.Hereweplotandthelinearfunctionforaparticularvalueof.Theshortverticalsegments Figure16.TheconjugatefunctionisobtainedbyminimizingacrossthedeviationsÑrepresentedasdashed JORDANETAL.representvalues.ItisclearfromtheÞgurethatweneedtoshiftthelinearverticallybyanamountwhichistheminimumofthevaluesordertoobtainanupperboundinglinewithslopethattouchesatasinglepoint.ThisobservationbothjustiÞestheformoftheconjugatefunction,asaminimumoverdifferences,andexplainswhytheconjugatefunctionappearsastheinterceptinEq.(21).Itisaneasyexercisetoverifythattheconjugatefunctionforthelogarithmis1,andtheconjugatefunctionfortheloglogisticfu

18 nctionisthebinaryentropyAlthoughwehavefo
nctionisthebinaryentropyAlthoughwehavefocusedonupperboundsinthissection,theframeworkofconvexdualityappliesequallywelltolowerbounds;inparticularforconvexfwehave:istheconjugatefunction.Wehavefocusedonlinearboundsinthissection,butconvexdualityisnotrestrictedtolinearbounds.Moregeneralboundscanbeobtainedbytransformingtheargumentofthefunctionofinterestratherthanthevalueofthefunction(Jaakkola&Jordan,1997a).Forexample,ifisconcaveinwecanwrite:istheconjugatefunctionof.Thusthetransformationyieldsaquadraticboundon.ItisalsoworthnotingthatsuchtransformationscanbecombinedwiththelogarithmictransformationutilizedearliertoobtainGaussianrepresentationsfortheupperbounds.Thiscanbeusefulinobtainingvariationalapproximationsforposteriordistributions(Jaakkola&Jordan,1997b).Tosummarize,thegeneralmethodologysuggestedbyconvexdualityisthefollowing.Wewishtoobtainupperorlowerboundsonafunctionofinterest.Ifthefunctionisalreadyconvexorconcavethenwesimplycalculatetheconjugatefunction.Ifthefunctionisnotconvexorconcave,thenwelookforaninvertibletransformationthatrendersthefunctionconvexorconcave.Wemayalsoconsidertransformationsoftheargumentofthefunction.Wethencalculatetheconjugatefunctioninthetransformedspaceandtransformback.ForthisapproachtobeusefulweneedtoÞndatransform,suchasthelogarithm,whoseinversehasusefulalgebraicproperties.4.3.ApproximationsforjointprobabilitiesandconditionalprobabilitiesThediscussionthusfarhasfocusedonapproximationsforthelocalprobabilitydistributionsatthenodesofagraphicalmodel.Howdotheseapproximationstranslateintoapproxima-tionsfortheglobalprobabilitiesofinterest,inparticularfortheconditionaldistribution INTRODUCTIONTOVARIATIONALMETHODSthatisourinterestintheinferenceproblemandthemarginalprobabilitythatisourinterestinlearningproblems?Letusfocusondirectedgraphsforconcreteness.Supposethatwehavealowerboundandanupperboundforeachofthelocalconditionalprobabilities.Thatis,assumethatwehaveforms,providingupperandlowerbounds,respectively,whereare(generallydifferent)variationalpara-meterizationsappropriatefor

19 theupperandlowerbounds.ConsiderÞrsttheu
theupperandlowerbounds.ConsiderÞrsttheupperbounds.Giventhattheproductofupperboundsisanupperbound,wehave:ThisinequalityholdsforarbitrarysettingsofvaluesofthevariationalparametersMoreover,Eq.(26)mustholdforanysubsetofwheneversomeothersubsetisheldÞxed;thisimpliesthatupperboundsonmarginalprobabilitiescanbeobtainedbytakingsumsoverthevariationalformontheright-handsideoftheequation.Forexample,lettingbeadisjointpartitionof,wehave:where,aswewillseeintheexamplestobediscussedbelow,wechoosethevariationalsothatthesummationovercanbecarriedoutefÞciently(thisisthekeystepindevelopingavariationalmethod).IneitherEq.(26)orEq.(27),giventhattheseupperboundsholdforanysettingsofvaluesthevariationalparameters,theyholdinparticularforoptimizingsettingsoftheparameters.Thatis,wecantreattheright-handsideofEq.(26)ortheright-handsideofEq.(27)asafunctiontobeminimizedwithrespectto.Inthelattercase,thisoptimizationprocesswillinduceinterdependenciesbetweentheparameters.Theseinterdependenciesaredesirable;indeedtheyarecriticalforobtainingagoodvariationalboundonthemarginalprobabilityofinterest.Inparticular,thebestglobalboundsareobtainedwhentheprobabilisticdependenciesinthedistributionarereßectedindependenciesintheapproximation.Toclarifythenatureofvariationalbounds,notethatthereisanimportantdistinctiontobemadebetweenjointprobabilities(Eq.(26))andmarginalprobabilities(Eq.(27)).InEq.(26),ifweallowthevariationalparameterstobesetoptimallyforeachvalueoftheargumentthenitispossible(inprinciple)toÞndoptimizingsettingsofthevariationalparametersthatrecovertheexactvalueofthejointprobability.(Hereweassumethatthelocalprobabilitiescanberepresentedexactlyviaavariationaltransformation,asintheexamplesdiscussedinSection4.1).InEq.(27),ontheotherhand,wearegenerallyabletorecoverexactvaluesofthemarginalbyoptimizingovervariationalparametersthatdependonlyontheargument.Consider,forexample,thecaseofanodethathasparentsin JORDANETAL.Aswerangeacrosstherewillbesummandsontheright-handsideofEq.(27)thatwillinvolveevaluatingthelocalprobabilityfordifferen

20 tvaluesoftheparentsIfthevariationalparam
tvaluesoftheparentsIfthevariationalparameterdependsonlyon,wecannotingeneralexpecttoobtainanexactrepresentationforineachsummand.Thus,someofthesummandsinEq.(27)arenecessarilyboundsandnotexactvalues.Thisobservationprovidesabitofinsightintoreasonswhyavariationalboundmightbeexpectedtobetightinsomecircumstancesandlooseinothers.Inparticular,ifisnearlyconstantaswerangeacross,orifweareoperatingatapointwherethevariationalrepresentationisfairlyinsensitivetothesettingof(forexampletheright-handsideofthelogarithminÞgure14),thentheboundsmaybeexpectedtobetight.Ontheotherhand,iftheseconditionsarenotpresentonemightexpectthattheboundwouldbeloose.Howeverthesituationiscomplicatedbytheinterdependenciesbetweentheareinducedduringtheoptimizationprocess.Wewillreturntotheseissuesinthediscussion.Althoughwehavediscussedupperbounds,similarcommentsapplytolowerbounds,andtomarginalprobabilitiesobtainedfromlowerboundsonthejointdistribution.Theconditionaldistribution,ontheotherhand,istheratiooftwomarginaldistributions;i.e.,Toobtainupperandlowerboundsontheconditionaldistribution,wemusthaveupperandlowerboundsonboththenumeratorandthedenominator.Generallyspeaking,however,ifwecanobtainupperandlowerboundsonthedenominator,thenourlaborisessentiallyÞnished,becausethenumeratorinvolvesfewersums.Indeed,inthecaseinwhichwhichE,thenumeratorinvolvesnosumsandissimplyafunctionevaluation.Finally,itisworthnotingthatvariationalmethodscanalsobeofinterestsimplyastractableapproximationsratherthanasmethodsthatprovidestrictbounds(muchassam-plingmethodsareused).Onewaytodothisistoobtainavariationalapproximationthatisaboundforamarginalprobability,andtosubstitutethevariationalparametersthusobtainedintotheprobabilitydistribution.Thus,forexample,wemightobtainalowerboundonthelikelihoodbyÞttingvariationalparameters.Wecansubsti-tutetheseparametersintotheparameterizedvariationalformforandthenutilizethisvariationalformasanefÞcientinferenceengineincalculatinganapproximationtoInthefollowingsectionswewillillustratethegeneralvariationalframeworkasithasbeen

21 appliedinanumberofworked-outexamples.All
appliedinanumberofworked-outexamples.Alloftheseexamplesinvolvearchitec-turesofpracticalinterestandprovideconcreteexamplesofvariationalmethodology.Toacertaindegreetheexamplesalsoserveascasehistoriesthatcanbegeneralizedtorelatedarchitectures.Itisimportanttoemphasize,however,thatitisnotnecessarilystraightfor-wardtodevelopavariationalapproximationforanewarchitecture.Theeaseandtheutilityofapplyingthemethodsoutlinedinthissectiondependonarchitecturaldetails,includingthechoiceofnodeprobabilityfunctions,thegraphtopologyandtheparticularparameterregimeinwhichthemodelisoperated.Inparticular,certainchoicesofnodeconditionalprobabilityfunctionslendthemselvesmorereadilythanotherstovariationaltransforma-tionsthathaveusefulalgebraicproperties.Also,certainarchitecturessimplifymorereadilyundervariationaltransformationthanothers;inparticular,themarginalboundsinEq.(27)aresimplefunctionsinsomecasesandcomplexinothers.Theseissuesarecurrentlynot INTRODUCTIONTOVARIATIONALMETHODSwellunderstoodandthedevelopmentofeffectivevariationalapproximationscaninsomecasesrequiresubstantialcreativity.4.4.SequentialandblockmethodsLetusnowconsiderinsomewhatmoredetailhowvariationalmethodscanbeappliedtoprobabilisticinferenceproblems.ThebasicideaisthatsuggestedaboveÑwewishtosimplifythejointprobabilitydistributionbytransformingthelocalprobabilityfunctions.Byanappropriatechoiceofvariationaltransformation,wecansimplifytheformofthejointprobabilitydistributionandtherebysimplifytheinferenceproblem.Wecantransformsomeorallofthenodes.Thecostofperformingsuchtransformationsisthatweobtainboundsorapproximationstotheprobabilitiesratherthanexactresults.Theoptionoftransformingonlysomeofthenodesisimportant;itimpliesarolefortheexactmethodsassubroutineswithinavariationalapproximation.Inparticular,partialtransformationsofthegraphmayleavesomeoftheoriginalgraphicalstructureintactand/orintroducenewgraphicalstructuretowhichexactmethodscanbefruitfullyapplied.Ingeneral,wewishtousevariationalapproximationsinalimitedway,transformingthegraphintoasimpliÞedgra

22 phtowhichexactmethodscanbeapplied.Thiswi
phtowhichexactmethodscanbeapplied.Thiswillingeneralyieldtighterboundsthananalgorithmthattransformstheentiregraphwithoutregardforcomputationallytractablesubstructure.Themajorityofvariationalalgorithmsproposedintheliteraturetodatecanbedividedintotwomainclasses:block.Inthesequentialapproach,nodesaretransformedinanorderthatisdeterminedduringtheinferenceprocess.Thisapproachhastheadvantageofßexibilityandgenerality,allowingtheparticularpatternofevidencetodeterminethebestchoicesofnodestotransform.Insomecases,however,particularlywhenthereareobvioussubstructuresinagraphwhichareamenabletoexactmethods,itcanbeadvantageoustodesignateinadvancethenodestobetransformed.Wewillseethatthisblockapproachisparticularlynaturalinthesettingofparameterestimation.5.ThesequentialapproachThesequentialapproachintroducesvariationaltransformationsforthenodesinaparticu-larorder.Thegoalistotransformthenetworkuntiltheresultingtransformednetworkisamenabletoexactmethods.Aswewillseeintheexamplesbelow,certainvariationaltrans-formationscanbeunderstoodgraphicallyasasparsiÞcationinwhichnodesareremovedfromthegraph.IfasufÞcientnumberofvariationaltransformationsareintroducedtheresultinggraphbecomessufÞcientlysparsesuchthatanexactmethodbecomesapplicable.AnoperationaldeÞnitionofsparsenesscanbeobtainedbyrunningagreedytriangulationalgorithmÑthisupperboundstheruntimeofthejunctiontreeinferencealgorithm.TherearebasicallytwowaystoimplementthesequentialapproachÑonecanbeginwiththeuntransformedgraphandintroducevariationaltransformationsonenodeatatime,oronecanbeginwithacompletelytransformedgraphandreintroduceexactconditionalprobabilitiesonenodeatatime.Anadvantageofthelatterapproachisthatthegraphremainstractableatalltimes;thusitisfeasibletodirectlycalculatethequantitativeeffect JORDANETAL.oftransformingorreintroducingagivennode.Intheformerapproachthegraphisintractablethroughoutthesearch,andtheonlywaytoassessingatransformationisviaitsqualitativeeffectongraphicalsparseness.ThesequentialapproachisperhapsbestpresentedinthecontextofaspeciÞce

23 xample.Inthefollowingsectionwereturntoth
xample.InthefollowingsectionwereturntotheQMR-DTnetworkandshowhowasequentialvariationalapproachcanbeusedforinferenceinthisnetwork.5.1.TheQMR-DTnetworkJaakkolaandJordan(1999b)presentanapplicationofsequentialvariationalmethodstotheQMR-DTnetwork.Aswehaveseen,theQMR-DTnetworkisabipartitegraphinwhichtheconditionalprobabilitiesfortheÞndingsarebasedonthenoisy-ORmodel(Eq.(8)forthenegativeÞndingsandEq.(9)forthepositiveÞndings).NotethatsymptomnodesthatarenotÞndingsÑi.e.,symptomsthatarenotobservedÑcansimplybemarginalizedoutofthejointdistributionbyomissionandthereforetheyhavenoimpactoninference.Moreover,aswehavediscussed,thenegativeÞndingspresentnodifÞcultiesforinferenceÑgiventheexponentialformoftheprobabilityinEq.(8),theeffectsofnegativeÞndingsonthediseaseprobabilitiescanbehandledinlineartime.LetusthereforeassumethattheupdatesassociatedwiththenegativeÞndingshavealreadybeenmadeandfocusontheproblemofperforminginferencewhentherearepositiveÞndings.RepeatingEq.(9)forconvenience,wehavethefollowingrepresentationfortheproba-bilityofapositiveÞnding:expThefunction1islogconcave;thus,asinthecaseofthelogisticfunction,weareabletoexpressthevariationalupperboundintermsoftheexponentialofalinearfunction.Inparticular:wheretheconjugatefunctionisasfollows:PluggingtheargumentofEq.(28)intoEq.(29),andnotingthatweneedadifferentvaria-tionalparameterforeachtransformednode,weobtain:exp INTRODUCTIONTOVARIATIONALMETHODS Figure17.TheQMR-DTgraphafterthelightlyshadedÞndinghasbeensubjectedtoavariationaltransformation.Theeffectisequivalenttodelinkingthenodefromthegraph.TheÞnalequationdisplaystheeffectofthevariationaltransformation.Theexponentialfactoroutsideoftheproductissimplyaconstant.Theproductistakenoverallnodesintheparentsetfornode,butunlikethecaseinwhichthegraphismoralizedforexactcomputation,thecontributionsassociatedwiththenodesareuncoupled.Thatis,eachfactorexpissimplyaconstantthatcanbemultipliedintotheprobabilitythatwaspreviouslyassociatedwithnode1).Thereisnocouplingofastherewouldbeifwehadtakenproductsoftheun

24 transformednoisy-OR.Thusthegraphicaleffe
transformednoisy-OR.ThusthegraphicaleffectofthevariationaltransformationisasshowninÞgure17;thevariationaltransformationdelinksthethÞndingfromthegraph.Inourparticularexample,thegraphisnowrenderedsinglyconnectedandanexactinferencealgorithmcanbeinvoked.(Recallthatmarginalizingoverthesymptomssimplyremovesthemfromthegraph).ThesequentialmethodologyutilizedbyJaakkolaandJordanbeginswithacompletelytransformedgraphandthenreinstatesexactconditionalprobabilitiesatselectednodes.Tochoosetheorderinginwhichtoreinstatenodes,JaakkolaandJordanmakeuseofaheuristic,basingthechoiceontheeffectonthelikelihoodboundofreinstatingeachnodeindividuallystartingfromthecompletelytransformedstate.Despitethesuboptimalityofthisheuristic,theyfoundthatityieldedanapproximationthatwasordersofmagnitudemoreaccuratethanthatofanalgorithmthatusedarandomordering.Giventheorderingthealgorithmthenproceedsasfollows:(1)Choosethenextnodeintheordering,andconsidertheeffectofreintroducingthelinksassociatedwiththenodeintothecurrentgraph.(2)Iftheresultinggraphisstillamenabletoexactmethods,reinstatethenodeanditerate.Otherwisestopandrunanexactmethod.Finally,(3)wemustalsochoosetheparameterssoastomaketheapproximationastightaspossible.ItisnotdifÞculttoverifythatproductsoftheexpressioninEq.(32)yieldanoverallboundthatisaconvexfunctionoftheparameters(Jaakkola&Jordan,1999b).ThusstandardoptimizationalgorithmscanbeusedtoÞndgoodchoicesfortheFigure18showsresultsfromJaakkolaandJordan(1999b)forapproximateinferenceonfouroftheÒCPCcasesÓthatwerementionedearlier.ForthesefourcasestherewereasufÞcientlysmallnumberofpositiveÞndingsthatanexactalgorithmcouldberuntoprovideagoldstandardforcomparison.TheleftmostÞgureshowsupperandlowerbounds JORDANETAL. Figure18.(a)Exactvaluesandvariationalupperandlowerboundsonthelog-likelihoodforthefourtractableCPCcases.(b)Themeancorrelationbetweentheapproximateandexactposteriormarginalsasafunctionoftheexecutiontime(seconds).Solidline:variationalestimates;dashedline:likelihood-weightingsampling.Thelinesaboveandbelowthesamplingr

25 esultrepresentstandarderrorsofthemeanbas
esultrepresentstandarderrorsofthemeanbasedonthetenindependentrunsofthesampler.onthelog-likelihoodforthesecases.JaakkolaandJordanalsocalculatedapproximateposteriormarginalsforthediseases.ThecorrelationsofthesemarginalswiththegoldstandardareshownintherightmostÞgure.ThisÞgureplotsaccuracyagainstruntime,forrunsinwhich8,12,and16positiveÞndingsweretreatedexactly.Notethataccuratevalueswereobtainedinlessthanasecond.TheÞgurealsoshowsresultsfromastate-of-the-artsamplingalgorithm(thelikelihood-weightedsamplerofShweandCooper,1991).ThesamplerrequiredsigniÞcantlymorecomputertimethanthevariationalmethodtoobtainroughlycomparableaccuracy.JaakkolaandJordan(1999b)alsopresentedresultsfortheentirecorpusofCPCcases.Theyagainfoundthatthevariationalmethodyieldedreasonablyaccurateestimatesoftheposteriorprobabilitiesofthediseases(usinglengthyrunsofthesamplerasabasisforcomparison)withinlessthanasecondofcomputertime.5.2.TheBoltzmannmachineLetusnowconsideraratherdifferentexample.Aswehavediscussed,theBoltzmannmachineisaspecialsubsetoftheclassofundirectedgraphicalmodelsinwhichthepotentialfunctionsarecomposedofproductsofquadraticandlinearÒBoltzmannfactors.ÓJaakkolaandJordan(1997a)introducedasequentialvariationalalgorithmforapproximateinferenceintheBoltzmannmachine.Theirmethod,whichwediscussinthissection,yieldsbothupperandlowerboundsonmarginalandconditionalprobabilitiesofinterest.RecalltheformofthejointprobabilitydistributionfortheBoltzmannmachine:exp Toobtainmarginalprobabilitiessuchasunderthisjointdistribution,wemustcalculatesumsoverexponentialsofquadraticenergyfunctions.Moreover,toobtainconditional INTRODUCTIONTOVARIATIONALMETHODSprobabilitiessuchas,wetakeratiosofsuchsums,wherethenumeratorrequiresfewersumsthanthedenominator.Themostgeneralsuchsumisthepartitionfunctionitself,whichisasumover.Letusthereforefocusonupperandlowerboundsforthepartitionfunctionasthegeneralcase;thisallowsustocalculateboundsonanyothermarginalsorconditionalsofinterest.Ourapproachistoperformthesumsonesumatatime,introducingvariationaltrans-

26 formationstoensurethattheresultingexpres
formationstoensurethattheresultingexpressionstayscomputationallytractable.Infact,ateverystepoftheprocessthatwedescribe,thetransformedpotentialsinvolvenomorethanquadraticBoltzmannfactors.(Exactmethodscanbeviewedascreatingincreasinglyhigher-ordertermswhenthemarginalizingsumsareperformed).ThusthetransformedBoltzmannmachineremainsaBoltzmannmachine.LetusÞrstconsiderlowerbounds.Wewritethepartitionfunctionasfollows:expexpandattempttoÞndatractablelowerboundontheinnersummandoverontheright-handside.ItisnotdifÞculttoshowthatthisexpressionislogconvex.Thuswebounditslogarithmvariationally:expexpexpwherethesumintheÞrsttermontheright-handsideisasumoverallpairssuchthatisequalto,whereisasbeforethebinaryentropyfunction,andwhereisthevariationalparameterassociatedwithnode.IntheÞrstlinewehavesimplypulledoutsideofthesumallofthosetermsnotinvolving,andinthesecondlinewehaveperformedthesumoverthetwovaluesof.Finally,tolowerboundtheexpressioninEq.(35)weneedonlylowerboundthetermlnontheright-handside.Butwehavealreadyfoundvariationalboundsforarelatedexpressionintreatingthelogisticfunction;recallEq.(18).Theupperboundinthatcasetranslatesintothelowerboundinthecurrent JORDANETAL. Figure19.ThetransformationoftheBoltzmannmachineundertheapproximatemarginalizationovernodeforthecaseoflowerbounds.(a)TheBoltzmannmachinebeforethetransformation.(b)TheBoltzmannmachineafterthetransformation,wherehasbecomedelinked.Allofthepairwiseparameters,,fornotequal,haveremainedunaltered.Assuggestedbythewavylines,thelinearcoefÞcientshavechangedforthosenodesthatwereneighborsofThisistheboundthatwehaveutilizedinEq.(36).LetusconsiderthegraphicalconsequencesoftheboundinEq.(36)(seeÞgure19).Notethatforallnodesinthegraphotherthannodeanditsneighbors,theBoltzmannfactorsareunaltered(seetheÞrsttwotermsinthebound).Thusthegraphisunalteredforsuchnodes.Fromtheterminparenthesesweseethattheneighborsofnodehavebeenendowedwithnewlinearterms;importantly,however,thesenodeshavenotbecomelinked(astheywouldhavebecomeifwehaddonetheexactmarginalization).Neighborstha

27 twerelinkedpreviouslyremainlinkedwiththe
twerelinkedpreviouslyremainlinkedwiththesameparameter.Nodeisabsentfromthetransformedpartitionfunctionandthusabsentfromthegraph,butithasleftitstraceviathenewlinearBoltzmannfactorsassociatedwithitsneighbors.WecansummarizetheeffectsofthetransformationbynotingthatthetransformedgraphisanewBoltzmannmachinewithonefewernodeandthefollowingparameters:NoteÞnallythatwealsohaveaconstanttermtokeeptrackof.ThistermwillhaveaninterestinginterpretationwhenwereturntotheBoltzmannmachinelaterinthecontextofblockmethods.Upperboundsareobtainedinasimilarway.WeagainbreakthepartitionfunctionintoasumoveraparticularnodeandasumovertheconÞgurationsoftheremainingnodesMoreover,theÞrstthreelinesoftheensuingderivationleadingtoEq.(35)areidentical.To INTRODUCTIONTOVARIATIONALMETHODScompletethederivationwenowÞndanupperboundonln.JaakkolaandJordan(1997a)proposedusingquadraticboundsforthispurpose.Inparticular,theynotedthat: andthatlnisaconcavefunctionof(ascanbeveriÞedbytakingthesecondderivativewithrespectto).Thisimpliesthatlnmusthaveaquadraticupperboundofthefollowingform: isanappropriatelydeÞnedconjugatefunction.UsingtheseupperboundsinEq.(35)weobtain:exp isthevariationalparameterassociatedwithnodeThegraphicalconsequencesofthistransformationaresomewhatdifferentthanthoseofthelowerbounds(seeÞgure20).ConsideringtheÞrsttwotermsinthebound,weseethat Figure20.ThetransformationoftheBoltzmannmachineundertheapproximatemarginalizationovernodeforthecaseofupperbounds.(a)TheBoltzmannmachinebeforethetransformation.(b)TheBoltzmannmachineafterthetransformation,wherehasbecomedelinked.Asthedashededgessuggest,alloftheneighborsofhavebecomelinkedandthosethatwereformerlylinkedhavenewparametervalues.Assuggestedbythewavylines,theneighborsofalsohavenewlinearcoefÞcients.Allotheredgesandparametersareunaltered. JORDANETAL.itisstillthecasethatthegraphisunalteredforallnodesinthegraphotherthannodeitsneighbors,andmoreoverneighborsofthatwerepreviouslylinkedremainlinked.Thequadraticterm,however,givesrisetonewlinksbetweenthepreviouslyunlinkedneighborsofn

28 odeandalterstheparametersbetweenprevious
odeandalterstheparametersbetweenpreviouslylinkedneighbors.Eachofthesenodesalsoacquiresanewlinearterm.ExpandingEq.(40)andcollectingterms,weseethattheapproximatemarginalizationhasyieldedaBoltzmannmachinewiththefollowing Finally,theconstanttermisgivenbyThegraphicalconsequencesofthelowerandupperboundtransformationsalsohavecom-putationalconsequences.Inparticular,giventhatthelowerboundtransformationintroducesnoadditionallinkswhennodesaredelinked,itissomewhatmorenaturaltocombinethesetransformationswithexactmethods.Inparticular,thealgorithmsimplydelinksnodesuntilatractablestructure(suchasatree)isrevealed;atthispointanexactalgorithmiscalledasasubroutine.Theupperboundtransformation,ontheotherhand,byintroducinglinksbetweentheneighborsofadelinkednode,doesnotrevealtractablestructureasreadily.Thisseemingdisadvantageismitigatedbythefactthattheupperboundisatighterbound(Jaakkola&Jordan,1997a).6.TheblockapproachAnalternativeapproachtovariationalinferenceistodesignateinadvanceasetofnodesthataretobetransformed.WecaninprincipleviewthisÒblockapproachÓasanoff-lineapplicationofthesequentialapproach.Inthecaseoflowerbounds,however,therearead-vantagestobegainedbydevelopingamethodologythatisspeciÞctoblocktransformation.Inthissection,weshowthatanaturalglobalmeasureofapproximationaccuracycanbeobtainedforlowerboundsviaablockversionofthevariationalformalism.Themethodmeshesreadilywithexactmethodsincasesinwhichtractablesubstructurecanbeidenti-Þedinthegraph.ThisapproachwasÞrstpresentedbySaulandJordan(1996),asareÞnedversionofmeanÞeldtheoryforMarkovrandomÞelds,andhasbeendevelopedfurtherinanumberofrecentstudies(e.g.,Ghahramani&Jordan,1997;Ghahramani&Hinton,1996;Jordanetal.,1997).Intheblockapproach,webeginbyidentifyingasubstructureinthegraphofinterestthatweknowisamenabletoexactinferencemethods(or,moregenerally,toefÞcientapproximateinferencemethods).Forexample,wemightpickoutatreeorasetofchainsintheoriginalgraph.WewishtousethissimpliÞedstructuretoapproximatetheprobabilitydistributionontheoriginalgraph.Todoso,weconsi

29 derafamilyofprobabilitydistributionsthat
derafamilyofprobabilitydistributionsthatareobtainedfromthesimpliÞedgraphviatheintroductionofvariationalparameters.Wechooseaparticularapproximatingdistributionfromthesimplifyingfamilybymakingaparticularchoiceforthevariationalparameters.Asinthesequentialapproachanewchoiceofvariationalparametersmustbemadeeachtimenewevidenceisavailable. INTRODUCTIONTOVARIATIONALMETHODSMoreformally,letrepresentthejointdistributiononthegraphicalmodelofinterest,whereasbeforerepresentsallofthenodesofthegraphandaredisjointsubsetsofrepresentingthehiddennodesandtheevidencenodes,respectively.Wewishtoapproximatetheconditionalprobability.Weintroduceanapproximatingfamilyofconditionalprobabilitydistributions,;¸/,wherearevariationalparameters.Thegraphrepresentingisnotgenerallythesameasthegraphrepresenting;generallyitisasub-graph.Fromthefamilyofapproximatingdistributions,wechooseaparticulardistributionbyminimizingtheKullback-Leibler(KL)divergence,,withrespecttothevariationalparameters:argmin;¸/whereforanyprobabilitydistributionstheKLdivergenceisdeÞnedasfollows: Theminimizingvaluesofthevariationalparameters,,deÞneaparticulardistribu-,thatwetreatasthebestapproximationofinthefamily;¸/OnesimplejustiÞcationforusingtheKLdivergenceasameasureofapproximationaccuracyisthatityieldsthebestlowerboundontheprobabilityoftheevidence(i.e.,thelikelihood)inthefamilyofapproximations;¸/.Indeed,weboundthelogarithmofusingJensenÕsinequalityasfollows: Q.HjE/¸XfHgQ.HjE/ln·P.H;E/ ThedifferencebetweentheleftandrighthandsidesofthisequationiseasilyseentobetheKLdivergence.Thus,bythepositivityoftheKLdivergence(Cover&Thomas,1991),theright-handsideofEq.(43)isalowerboundon.Moreover,bychoosingaccordingtoEq.(41),weobtainthetightestlowerbound.6.1.ConvexdualityandtheKLdivergenceWecanalsojustifythechoiceofKLdivergencebymakinganappealtoconvexdualitytheory,therebylinkingtheblockapproachwiththesequentialapproach(Jaakkola,1997).Consider,forsimplicity,thecaseofdiscrete-valuednodes.Thedistribution;¸/ JORDANETAL.canbeviewedasavectorofrealnumbers,oneforeac

30 hconÞgurationofthevariables.Treatthisve
hconÞgurationofthevariables.Treatthisvectorasthevector-valuedvariationalparameterÒÓinEq.(23).Moreover,thelogprobabilitylncanalsobeviewedasavectorofrealnumbers,deÞnedonthesetofconÞgurationsof.TreatthisvectorasthevariableÒÓinEq.(23).Finally,tobeln.ItcanbeveriÞedthatthefollowingexpressionforlnisindeedconvexinthevaluesln.Moreover,bydirectsubstitutioninEq.(22):;¸/andminimizingwithrespecttoln,theconjugatefunctionisseentobethenegativeentropyfunction.Thus,usingEq.(23),wecanlowerboundtheloglikelihoodasfollows:ThisisidenticaltoEq.(43).Moreover,weseethatwecouldinprinciplerecovertheexactloglikelihoodifwereallowedtorangeoverallprobabilitydistributions.Byrangingoveraparameterizedfamily;¸/,weobtainthetightestlowerboundthatisavailablewithinthefamily.6.2.ParameterestimationviavariationalmethodsNealandHinton(1999)havepointedoutthatthelowerboundinEq.(46)hasausefulroletoplayinthecontextofmaximumlikelihoodparameterestimation.Inparticular,theymakealinkbetweenthislowerboundandparameterestimationviatheEMalgorithm.LetusaugmentournotationtoincludeparametersinthespeciÞcationofthejointprobabilitydistribution.Asbefore,wedesignateasubsetofthenodesastheobservedevidence.Themarginalprobability,thoughtofasafunctionof,isknownasthelikelihood.TheEMalgorithmisamethodformaximumlikelihoodparameterestimationthathillclimbsintheloglikelihood.Itdoessobymakinguseoftheconvexityrelationshipbetweenlnandlndescribedintheprevioussection.InSection6weshowedthatthefunction;µ/isalowerboundontheloglikelihoodforanyprobabilitydistribution.Moreover,weshowedthatthedifferencebetweenlnandthebound;µ/istheKLdivergencebetween.Supposenowthatweallow INTRODUCTIONTOVARIATIONALMETHODSrangeoverallpossibleprobabilitydistributionsonandminimizetheKLdivergence.Itisastandardresult(cf.Cover&Thomas,1991)thattheKLdivergenceisminimizedby;µ/,andthattheminimalvalueiszero.ThisisveriÞedby;µ/intotheright-handsideofEq.(47)andrecoveringlnThissuggeststhefollowingalgorithm.Startingfromaninitialparametervector,weiteratethefollowingtwosteps,knownastheÒE(expe

31 ctation)stepÓandtheÒM(maximiza-tion)st
ctation)stepÓandtheÒM(maximiza-tion)step.ÓFirst,wemaximizethebound;µ/withrespecttoprobabilitydistributions.Second,weÞxandmaximizethebound;µ/withrespecttotheparametersMoreformally,wehave:(Estep):argmax(Mstep):argmaxwhichiscoordinateascentin;µ/ThiscanberelatedtothetraditionalpresentationoftheEMalgorithm(Dempster,Laird,&Rubin,1977)bynotingthatforÞxed,theright-handsideofEq.(47)isafunctionofonlythroughthelnterm.Thusmaximizing;µ/withrespecttointheMstepisequivalenttomaximizingthefollowingfunction:µ/:Maximizationofthisfunction,knownastheÒcompleteloglikelihoodÓintheEMliterature,deÞnestheMstepinthetraditionalpresentationofEM.Letusnowreturntothesituationinwhichweareunabletocomputethefullconditionaldistribution;µ/.Insuchcasesvariationalmethodologysuggeststhatweconsiderafamilyofapproximatingdistributions.AlthoughwearenolongerabletoperformatrueEMiterationgiventhatwecannotavailourselvesof;µ/,wecanstillperformcoordinateascentinthelowerbound;µ/.Indeed,thevariationalstrategyofminimi-zingtheKLdivergencewithrespecttothevariationalparametersthatdeÞnetheapprox-imatingfamilyisexactlyarestrictedformofcoordinateascentintheÞrstargumentof;µ/.WethenfollowthisstepbyanÒMstepÓthatincreasesthelowerboundwithrespecttotheparametersThispointofview,whichcanbeviewedasacomputationallytractableapproximationtotheEMalgorithm,hasbeenexploitedinanumberofrecentarchitectures,includingthesigmoidbeliefnetwork,factorialhiddenMarkovmodelandhiddenMarkovdecisiontreearchitecturesthatwediscussinthefollowingsections,aswellastheÒHelmholtzmachineÓofDayanetal.(1995)andHintonetal.(1995).6.3.ExamplesWenowreturntotheproblemofpickingatractablevariationalparameterizationforagivengraphicalmodel.WewishtopickasimpliÞedgraphwhichisbothrichenoughtoprovide JORDANETAL.distributionsthatareclosetothetruedistribution,andsimpleenoughsothatanexactalgo-rithmcanbeutilizedefÞcientlyforcalculationsundertheapproximatedistribution.Similarconsiderationsholdforthevariationalparameterization:thevariationalparameterizationmustberepresentational

32 lyrichsothatgoodapproximationsareavailab
lyrichsothatgoodapproximationsareavailableandyetsimpleenoughsothataprocedurethatminimizestheKLdivergencehassomehopeofÞndinggoodparametersandnotgettingstuckinalocalminimum.Itisnotnecessarilypossibletorealizeallofthesedesideratasimultaneously;however,inanumberofcasesithasbeenfoundthatrelativelysimplevariationalapproximationscanyieldreasonablyaccuratesolutions.Inthissectionwediscussseveralsuchexamples.6.3.1.MeanÞeldBoltzmannmachine.InSection5.2wediscussedasequentialvariationalalgorithmthatyieldedupperandlowerboundsfortheBoltzmannmachine.WenowrevisittheBoltzmannmachinewithinthecontextoftheblockapproachanddiscusslowerbounds.Wealsorelatethetwoapproaches.RecallthatthejointprobabilityfortheBoltzmannmachinecanbewrittenasfollows:exp 0fornodesthatarenotneighborsinthegraph.Considernowtherepresentationoftheconditionaldistribution;µ/inaBoltzmannmachine.For,thecontributionreducestoaconstant,whichvanisheswhenwenormalize.If,thequadraticcontributionbecomesalinearcontributionthatweassociatewithnode.Finally,lineartermsassociatedwithnodesalsobecomeconstantsandvanish.Insummary,wecanexpresstheconditionaldistribution;µ/asfollows:;µ/exp wherethesumsarerestrictedtorangeovernodesinandtheupdatedparametersincludecontributionsassociatedwiththeevidencenodes:Theupdatedpartitionfunctionisgivenasfollows:expInsum,wehaveaBoltzmannmachineonthesubsetTheÒmeanÞeldÓapproximation(Peterson&Anderson,1987)forBoltzmannmachinesisaparticularformofvariationalapproximationinwhichacompletelyfactorizeddistribution INTRODUCTIONTOVARIATIONALMETHODS Figure21.(a)AnodeinaBoltzmannmachinewithitsMarkovblanket.(b)TheapproximatingmeanÞelddistributionisbasedonagraphwithnoedges.ThemeanÞeldequationsyieldadeterministicrelationship,representedintheÞgurewiththedottedlines,betweenthevariationalparametersfornodesintheMarkovblanketofnodeisusedtoapproximate;µ/.Thatis,weconsiderthesimplestpossibleapproxi-matingdistribution;onethatisobtainedbydroppingoftheedgesintheBoltzmanngraph(seeÞgure21).Forthischoiceof;¹/,(wherewenowusetorepresentthevaria

33 tionalparameters),wehavelittlechoiceasto
tionalparameters),wehavelittlechoiceastothevariationalparameterizationÑtorepresentaslargeanapproximatingfamilyaspossibleweendoweachdegreeoffreedomwithitsownvariationalparameter.Thuscanbewrittenasfollows:;¹/wheretheproductistakenoverthehiddennodesFormingtheKLdivergencebetweenthefullyfactorizeddistributionandthebutioninEq.(52),weobtain:obtain:¹iln¹iC.1¡¹i/ln.1¡¹i/]¡Xijµij¹i¹j¡Xiµci0¹iClnZc;(56)wherethesumsrangeacrossnodesin.Inderivingthisresultwehaveusedthefactthat,underthedistribution,areindependentrandomvariableswithmeanvaluesWenowtakederivativesoftheKLdivergencewithrespecttoÑnotingthatindependentofÑandsetthederivativetozerotoobtainthefollowingequations:isthelogisticfunctionandwedeÞneequaltoEquation(57)deÞnesasetofcoupledequationsknownastheÒmeanÞeldequations.Ó JORDANETAL.TheseequationsaresolvediterativelyforaÞxedpointsolution.NotethateachvariationalupdatesitsvaluebasedonasumacrossthevariationalparametersinitsMarkovblanket(cf.Þgure21(b)).Thiscanbeviewedasavariationalformofalocalmessagepassingalgorithm.PetersonandAnderson(1987)comparedthemeanÞeldapproximationtoGibbssamplingonasetoftestcasesandfoundthatitran10Ð30timesfaster,whileyieldingaroughlyequivalentlevelofaccuracy.Therearecases,however,inwhichthemeanÞeldapproximationisknowntobreakdown.ThesecasesincludesparseBoltzmannmachinesandBoltzmannmachineswithÒfrustratedÓinteractions;thesearenetworkswhosepotentialfunctionsembodyconstraintsbetweenneighboringnodesthatcannotbesimultaneouslysatisÞed(seealsoGalland,1993).Inthecaseofsparsenetworks,exactalgorithmscanprovidehelp;indeed,thisobservationledtotheuseofexactalgorithmsassubroutineswithintheÒstructuredmeanÞeldÓapproachpursuedbySaulandJordan(1996).LetusnowconsidertheparameterestimationproblemforBoltzmannmachines.WritingoutthelowerboundinEq.(47)forthiscase,wehave::¹iln¹iC.1¡¹i/ln.1¡¹i/](58)TakingthederivativewithrespecttoyieldsagradientwhichhasasimpleÒHebbianÓaswellasacontributionfromthederivativeoflnwithrespectto.Itisnothardtoshowthatthisderivativeis;whe

34 rethebracketssignifyanaveragewithrespect
rethebracketssignifyanaveragewithrespecttotheunconditionaldistribution.ThuswehavethefollowinggradientalgorithmforperforminganapproximateMstep:Unfortunately,however,givenourassumptionthatcalculationsundertheBoltzmanndis-tributionareintractableforthegraphunderconsideration,itisintractabletocomputetheunconditionalaverage.WecanonceagainappealtomeanÞeldtheoryandcomputeanapproximationto,wherewenowuseafactorizeddistributiononallofthenodes;however,theMstepisnowadifferenceofgradientsoftwodifferentboundsandisthereforenolongerguaranteedtoincrease.Thereisamoreseriousproblem,moreover,whichisparticularlysalientinunsupervisedlearningproblems.Ifthedatasetofinterestisahet-erogeneouscollectionofsub-populations,suchasinunsupervisedclassiÞcationproblems,theunconditionaldistributionwillgenerallyberequiredtohavemultiplemodes.Unfortu-natelythefactorizedmeanÞeldapproximationisunimodalandisapoorapproximationforamulti-modaldistribution.Oneapproachtothisproblemistoutilizemulti-modaldistributionswithinthemean-Þeldframework;forexample,JaakkolaandJordan(1999a)discusstheuseofmixturemodelsasapproximatingdistributions.TheseissuesÞndamoresatisfactorytreatmentinthecontextofdirectedgraphs,asweseeinthefollowingsection.Inparticular,thegradientforadirectedgraph(cf.Eq.(68))doesnotrequireaveragesundertheunconditionaldistribution. INTRODUCTIONTOVARIATIONALMETHODSFinally,letusconsidertherelationshipbetweenthemeanÞeldapproximationandthelowerboundsthatweobtainedviaasequentialalgorithminSection5.2.Infact,ifwerunthelatteralgorithmuntilallnodesareeliminatedfromthegraph,weobtainaboundthatisidenticaltothemeanÞeldbound(Jaakkola,1997).Toseethis,notethatforaBoltzmannmachineinwhichallofthenodeshavebeeneliminatedtherearenoquadraticandlinearterms;onlytheconstanttermsremain.RecallfromSection5.2thattheconstantthatariseswhennodeisremovedis,wherereferstothevalueofafterithasbeenupdatedtoabsorbthelineartermsfrompreviouslyeliminatednodes.(Recallthatthelatterupdateisgivenbyfortheremovalofaparticularnodethatisaneighborof).Collectingtogethersu

35 chupdatesfor,andsummingacrossall,weÞndt
chupdatesfor,andsummingacrossall,weÞndthattheresultingconstanttermisgivenasfollows:ws:¹iln¹iC.1¡¹i/ln.1¡¹i/](60)ThisdiffersfromthelowerboundinEq.(58)onlybythetermln,whichdisappearswhenwemaximizewithrespectto6.3.2.Neuralnetworks.AsdiscussedinSection3,theÒsigmoidbeliefnetworkÓisessen-tiallya(directed)neuralnetworkwithgraphicalmodelsemantics.Weutilizethelogisticfunctionasthenodeprobabilityfunction: expwhereweassumethat0unlessisaparentof.(Inparticular,Notingthattheprobabilitiesforboththe0caseandthe1casecanbewritteninasingleexpressionasfollows:exp expweobtainthefollowingrepresentationforthejointdistribution:exp expWewishtocalculateconditionalprobabilitiesunderthisjointdistribution.Aswehaveseen(cf.Þgure6),inferenceforgeneralsigmoidbeliefnetworksisintractable,andthusitissensibletoconsidervariationalapproximations.Saul,Jaakkola,andJordan JORDANETAL.(1996)andSaulandJordan(1999)haveexploredtheviabilityofthesimplecompletelyfactorizeddistribution.Thusonceagainweset:;¹/andattempttoÞndthebestsuchapproximationbyvaryingtheparametersThecomputationoftheKLdivergenceproceedsmuchasitdoesinthecaseofthemeanÞeldBoltzmannmachine.Theentropyterm()isthesameasbefore.Theenergyterm()isfoundbytakingthelogarithmofEq.(63)andaveragingwithrespectto.Puttingtheseresultstogether,weobtain:expp¹iln¹iC.1¡¹i/ln.1¡¹i/](65)denotesanaveragewithrespecttothedistribution,andwherewehaveabusednotationbydeÞningvaluesfor;thesearesettotheinstantiatedvaluesNotethat,despitethefactthatisfactorized,weareunabletocalculatetheaverageofln[1],where.Thisisanimportanttermwhicharisesdirectlyfromthedirectednatureofthesigmoidbeliefnetwork(itarisesfromthedenominatorofthesigmoid,afactorwhichisnecessarytodeÞnethesigmoidasalocalconditionalprobability).Todealwiththisterm,Sauletal.(1996)introducedanadditionalvariationaltransformation,duetoSeung(1995),thatcanbeviewedasareÞnedformofJensenÕsinequality.Inparticular:particular:Cezi]iDhln[e»izie¡»izi.1Cezi/iD»ihziiC­ln£e¡»iziCe.1¡»i/zi¤®·»ihziiCln­e¡»iziCe.1¡»i/zi®;(66)wh

36 ere»iisavariationalparameter.(Notethatt
ere»iisavariationalparameter.(NotethattheinequalityreducestothestandardJenseninequalityfor0).TheÞnalresultcanbeutilizeddirectlyinEq.(65)toprovideatractablelowerboundontheloglikelihoodandthevariationalparametercanbeoptimizedalongwiththeothervariationalparameters.SaulandJordan(1999)showthatinthelimitingcaseofnetworksinwhicheachhiddennodehasalargenumberofparents,sothatacentrallimittheoremcanbeinvoked,thehasaprobabilisticinterpretationastheapproximateexpectationofisagainthelogisticfunction. INTRODUCTIONTOVARIATIONALMETHODS Figure22.(a)AnodeinasigmoidbeliefnetworkmachinewithitsMarkovblanket.(b)ThemeanÞeldequationsyieldadeterministicrelationship,representedintheÞgurewiththedottedlines,betweenthevariationalfornodesintheMarkovblanketofnodeForÞxedvaluesoftheparameters,bydifferentiatingtheKLdivergencewithrespecttothevariationalparameters,weobtainthefollowingconsistencyequations:isthederivativeofwithrespectto.AsSauletal.show,thistermdependsonnode,itschild,andtheotherparents(theÒco-parentsÓ)of.GiventhattheÞrsttermisasumovercontributionsfromtheparentsofnodeandthesecondtermisasumovercontributionsfromthechildrenofnode,weseethattheconsistencyequationforagivennodeagaininvolvescontributionsfromtheMarkovblanketofthenode(seeÞgure22).Thus,asinthecaseoftheBoltzmannmachine,weÞndthatthevariationalparametersarelinkedviatheirMarkovblanketsandtheconsistencyequation(Eq.(67))canbeinterpretedasalocalmessage-passingalgorithm.Saul,Jaakkola,andJordan(1996)andSaulandJordan(1999)alsoshowhowtoupdatethevariationalparameters.Thetwopapersutilizetheseparametersinslightlydifferentwaysandobtaindifferentupdateequations.(Yetanotherrelatedvariationalapproximationforthesigmoidbeliefnetwork,includingbothupperandlowerbounds,ispresentedinJaakkolaandJordan,1996).Finally,wecancomputethegradientwithrespecttotheparametersforÞxedvariational.TheresultobtainedbySaulandJordan(1999)takesthefollowingform:Notethatthereisnoneedtocalculatevariationalparametersundertheunconditionaldistri-bution,,asinthecaseoftheBoltzmannmachine(afactÞr

37 stnotedbyNeal,1992). JORDANETAL. Figure2
stnotedbyNeal,1992). JORDANETAL. Figure23.TheleftmostÞgureshowsexamplesoftheimagesusedbySaulandJordan(1999)intrainingtheirhandwrittendigitclassiÞer.TherightmostÞgureshowsexamplesofimageswhosebottomhalveswereinferredfromtheirtophalvesviaavariationalinferencealgorithm.NotealsotheinterestingappearanceofaregularizationtermÑthesecondtermintheequationisaÒweightdecayÓtermthatismaximalfornon-extremevaluesofthevaria-tionalparameters(bothoftheseparametersareboundedbetweenzeroandone).Thus,thiscomputationally-motivatedapproximationtomaximumlikelihoodestimationisinfactaformofpenalizedmaximumlikelihoodestimation.SaulandJordan(1999)testedthesigmoidbeliefnetworkonahandwrittendigitclas-siÞcationproblem,obtainingresultsthatwerecompetitivewithothersupervisedlearningsystems.ExamplesofthedigitsthatSaulandJordanusedareshowninÞgure23.Figure23alsoillustratestheabilityofasigmoidbeliefnetworktoÒÞllinÓmissingdata.Allofthepixelsinthebottomhalvesoftheseimagesweretreatedasmissingandtheirvalueswereinferredviavariationalinferenceequations.AsaresultofthiscapacityforÞllinginmissingdata,thedegradationinclassiÞcationperformancewithmissingpixelsisslight;indeed,SaulandJordanreportedthattheclassiÞcationerrorwentfrom5percentto12percentwhenhalfofthepixelsweremissing.Forfurthercomparativeempiricalworkonsigmoidbeliefnetworksandrelatedarchitec-tures,includingcomparisonswithGibbssampling,seeFrey,Hinton,andDayan(1996).6.3.3.FactorialhiddenMarkovmodels.ThefactorialhiddenMarkovmodel(FHMM)isamultiplechainstructure(seeÞgure24(a)).Usingthenotationdevelopedearlier(seeSection3.5),thejointprobabilitydistributionfortheFHMMisgivenby:Computationunderthisprobabilitydistributionisgenerallyinfeasiblebecause,aswesawearlier,thecliquesizebecomesunmanageablylargewhentheFHMMchainstructureismoralizedandtriangulated.Thusitisnecessarytoconsiderapproximations. INTRODUCTIONTOVARIATIONALMETHODS Figure24.(a)TheFHMM.(b)AvariationalapproximationfortheFHMMcanbeobtainedbypickingoutatractablesubstructureintheFHMMgraph.Parameterizingthisgraphleads

38 toafamilyoftractableapproximatingdistrib
toafamilyoftractableapproximatingdistributions.FortheFHMMthereisanaturalsubstructureonwhichtobaseavariationalalgorithm.Inparticular,thechainsthatcomposetheFHMMareindividuallytractable.Therefore,ratherthanremovingalloftheedges,asinthenaivemeanÞeldapproximationdiscussedintheprevioustwosections,itwouldseemmorereasonabletoremoveonlyasmanyedgesasarenecessarytodecouplethechains.Inparticular,weremovetheedgesthatlinkthestatenodestotheoutputnodes(seeÞgure24(b)).Withouttheseedgesthemoralizationprocessnolongerlinksthestatenodesandnolongercreateslargecliques.Infact,themoralizationprocessonthedelinkedgraphinÞgure24(b)isvacuous,asisthetriangulation.Thusthecliquesonthedelinkedgraphareofsize,whereisthenumberofstatesforasinglechain.OneiterationofapproximateinferencerunsintimeMTN,whereisthenumberofchainsandisthelengthofthetimeseries.LetusnowconsiderhowtoexpressavariationalapproximationusingthedelinkedgraphofÞgure24(b)asanapproximation.Theideaistointroduceonefreeparameterintotheapproximatingprobabilitydistribution,,foreachedgethatwehavedropped.Thesefreeparameters,whichwedenoteas,essentiallyserveassurrogatesfortheeffectoftheobservationattimeonstatecomponent.Whenweoptimizethedivergencewithrespecttotheseparameterstheybecomeinterdependent;this(deterministic)interdependencecanbeviewedasanapproximationtotheprobabilisticdependencethatiscapturedinanexactalgorithmviathemoralizationprocess.ReferringtoÞgure24(b),wewritetheapproximatingdistributioninthefollowingfactorizedform:;µ;¸isthevectorofvariationalparameters.WedeÞnethetransitionmatrixtobetheproductoftheexacttransitionmatrixandthevariationalparameterandsimilarlyfortheinitialstateprobabilities JORDANETAL.Thisfamilyofdistributionsrespectstheconditionalindependencestatementsoftheap-proximategraphinÞgure24,andprovidesadditionaldegreesoffreedomviathevariationalGhahramaniandJordan(1997)presenttheequationsthatresultfromminimizingtheKLdivergencebetweentheapproximatingprobabilitydistribution(Eq.(70))andthetrueprobabilitydistribution(Eq.(69)).Theresultcanbesumma

39 rizedasfollows.Asintheotherarchitectures
rizedasfollows.Asintheotherarchitecturesthatwehavediscussed,theequationforavariationalparameter()isafunctionoftermsthatareintheMarkovblanketofthecorrespondingdelinkednode(i.e.,).Inparticular,theupdatefordependsontheparameters,for,thuslinkingthevariationalparametersattime.Moreover,theupdatefordependsontheexpectedvalueofthestates,wheretheexpectationistakenunderthedistribution.Giventhatthechainsaredecoupledunder,expectationsarefoundbyrunningoneoftheexactalgorithms(forexample,theforward-backwardalgorithmforHMMs),separatelyforeachchain.Theseexpectationsofcoursedependonthecurrentvaluesoftheparameters(cf.Eq.(70)),anditisthisdependencethateffectivelycouplesthechains.Tosummarize,ÞttingthevariationalparametersforaFHMMisaniterative,two-phaseprocedure.IntheÞrstphase,anexactalgorithmisrunasasubroutinetocalculateexpec-tationsforthehiddenstates.Thisisdoneindependentlyforeachofthechains,makingreferencetothecurrentvaluesoftheparameters.Inthesecondphase,theparametersareupdatedbasedontheexpectationscomputedintheÞrstphase.TheprocedurethenreturnstotheÞrstphaseanditerates.GhahramaniandJordan(1997)reportedresultsonÞttinganFHMMtotheBachchoraledataset(Merz&Murphy,1996).TheyshowedthatsigniÞcantlylargereffectivestatespacescouldbeÞtwiththeFHMMthanwithanunstructuredHMM,andthatperformanceintermsofprobabilityofthetestsetwasanorderofmagnitudelargerfortheFHMM.Moreover,evidenceofoverÞttingwasseenfortheHMMfor35statesormore;noevidenceofoverÞttingfortheFHMMwasseenforupto1000states.6.3.4.HiddenMarkovdecisiontrees.AsaÞnalexamplewereturntothehiddenMarkovdecisiontree(HMDT)describedintheintroductionandbrießydiscussvariationalapproxi-mationforthisarchitecture.Aswehavediscussed,aHMDTisessentiallyaMarkovtimeseriesmodel,wheretheprobabilitymodelateachtimestepisa(probabilistic)decisiontreewithhiddendecisionnodes.TheMarkoviandependenceisobtainedviaseparatetransitionmatricesatthedifferentlevelsofthedecisiontree,givingthemodelafactorizedstructure.ThevariationalapproachtoÞttingaHMDTiscloselyrelatedtothatofÞttingaFHMM;howeve

40 r,thereareadditionalchoicesastothevariat
r,thereareadditionalchoicesastothevariationalapproximation.Inparticular,wehavetwosubstructuresworthconsideringintheHMDT:(1)Droppingtheverticaledges,werecoveradecoupledsetofchains.AsintheFHMM,thesechainscaneachbehandledbytheforward-backwardalgorithm.(2)Droppingthehorizontaledges,werecoveradecoupledsetofdecisiontrees.WecancalculateprobabilitiesinthesetreesusingtheposteriorpropagationalgorithmdescribedinJordan(1994).TheÞrstapproach,whichwerefertoastheÒforestofchainsapproximation,ÓisshowninÞgure25.AsintheFHMM,wewriteavariationalapproximationfortheforestofchainsapproximationbyrespectingtheconditionalindependenciesintheapproximatinggraph INTRODUCTIONTOVARIATIONALMETHODS Figure25.TheÒforestofchainsapproximationÓfortheHMDT.Parameterizingthisgraphleadstoanapproxi-matingfamilyofdistributions. Figure26.TheÒforestoftreesapproximationÓfortheHMDT.Parameterizingthisgraphleadstoanapproxi-matingfamilyofdistributions.andincorporatingvariationalparameterstoobtainextradegreesoffreedom(seeJordanetal.,1997,forthedetails).WecanalsoconsideraÒforestoftreesapproximationÓinwhichthehorizontallinksareeliminated(seeÞgure26).Giventhatthedecisiontreeisafullyconnectedgraph,thisisessentiallyanaivemeanÞeldapproximationonahypergraph.Finally,itisalsopossibletodevelopavariationalalgorithmfortheHMDTthatisanalo-goustotheViterbialgorithmforHMMs.Inparticular,weutilizeanapproximationassignsprobabilityonetoasinglepathinthestatespace.TheKLdivergenceforthisdistributionisparticularlyeasytoevaluate,giventhattheentropycontributiontotheKLdivergence(i.e.,theterm)iszero.Moreover,theevaluationoftheenergy(i.e.,theterm)reducestosubstitutingthestatesalongthechosenpathintothedistribution.TheresultingalgorithminvolvesasubroutineinwhichastandardViterbialgorithmisrunonasinglechain,withtheotherchainsheldÞxed.Thissubroutineisrunoneachchaininturn.Jordanetal.(1997)foundthatperformanceoftheHMDTontheBachchoraleswasessentiallythesameasthatoftheFHMM.TheadvantageoftheHMDTwasitsgreaterinterpretability;mostoftherunsresultedinacoarse-to-Þneor

41 deringofthetemporalscalesoftheMarkovproc
deringofthetemporalscalesoftheMarkovprocessesfromthetoptothebottomofthetree. JORDANETAL.7.DiscussionWehavedescribedavarietyofapplicationsofvariationalmethodstoproblemsofinferenceandlearningingraphicalmodels.Wehopetohaveconvincedthereaderthatvariationalmethodscanprovideapowerfulandeleganttoolforgraphicalmodels,andthatthealgo-rithmsthatresultaresimpleandintuitivelyappealing.Itisimportanttoemphasize,however,thatresearchonvariationalmethodsforgraphicalmodelsisofquiterecentorigin,andtherearemanyopenproblemsandunresolvedissues.Inthissectionwediscussanumberoftheseissues.Wealsobroadenthescopeofthepresentationanddiscussanumberofrelatedstrandsofresearch.7.1.RelatedresearchThemethodsthatwehavediscussedallinvolvedeterministic,iterativeapproximational-gorithms.Itisofinteresttodiscussrelatedapproximationschemesthatareeithernon-deterministicornon-iterative.7.1.1.RecognitionmodelsandtheHelmholtzmachine.Allofthealgorithmsthatwehavepresentedhaveattheircoreanonlinearoptimizationproblem.Inparticular,afterhavingintroducedthevariationalparameters,whethersequentiallyorasablock,weareleftwithaboundsuchasthatinEq.(27)thatmustbeoptimized.OptimizationofthisboundisgenerallyachievedviaaÞxed-pointiterationoragradient-basedalgorithm.ThisiterativeoptimizationprocessinducesinterdependenciesbetweenthevariationalparameterswhichgiveusaÒbestÓapproximationtothemarginalorconditionalprobabilityofinterest.Considerinparticularaprobleminwhichadirectedgraphicalmodelisusedforunsu-pervisedlearning.AcommonapproachinunsupervisedlearningistoconsidergraphicalmodelsthatareorientedintheÒgenerativeÓdirection;thatis,theypointfromhiddenvari-ablestoobservables.InthiscasetheÒpredictiveÓcalculationofiselementary.Thecalculationof,ontheotherhand,isaÒdiagnosticÓcalculationthatproceedsbackwardsinthegraph.Diagnosticcalculationsaregenerallynon-trivialandrequirethefullpowerofaninferencealgorithm.AnalternativeapproachtosolvingiterativelyforanapproximationtothediagnosticcalculationistolearnbothagenerativemodelandaÒrecognitionÓmodelthatapprox

42 imatesthediagnosticdistribution.Thusweas
imatesthediagnosticdistribution.Thusweassociatedifferentparameterswiththegene-rativemodelandtherecognitionmodelandrelyontheparameterestimationprocesstobringtheseparameterizationsintoregister.ThisisthebasicideabehindtheÒHelmholtzmachineÓ(Dayanetal.,1995;Hintonetal.,1995).Thekeyadvantageoftherecognition-modelapproachisthatthecalculationofisreducedtoanelementaryfeedforwardcalculationthatcanbeperformedquickly.Therearesomedisadvantagestotheapproachaswell.Inparticular,thelackofaniterativealgorithmmakestheHelmholtzmachineunabletodealnaturallywithmissingdata,andwithphenomenasuchasÒexplaining-away,Óinwhichthecouplingsbetweenhiddenvariableschangeasafunctionoftheconditioningvariables.Moreover,althoughinsomecasesthereisaclearnaturalparameterizationfortherecognitionmodelthatis INTRODUCTIONTOVARIATIONALMETHODSinducedfromthegenerativemodel(inparticularforlinearmodelssuchasfactoranalysis),ingeneralitisdifÞculttoinsurethatthemodelsarematchedappropriately.Someoftheseproblemsmightbeaddressedbycombiningtherecognition-modelapproachwiththeiterativevariationalapproach;essentiallytreatingtherecognition-modelasaÒcacheÓforstoringgoodinitializationsforthevariationalparameters.7.1.2.Samplingmethods.Inthissectionwemakeafewremarksontherelationshipsbetweenvariationalmethodsandstochasticmethods,inparticulartheGibbssampler.Inthesettingofgraphicalmodels,bothclassesofmethodsrelyonextensivemessage-passing.InGibbssampling,themessage-passingisparticularlysimple:eachnodelearnsthecur-rentinstantiationofitsMarkovblanket.WithenoughsamplesthenodecanestimatethedistributionoveritsMarkovblanketand(roughlyspeaking)determineitsownstatistics.Theadvantageofthisschemeisthatinthelimitofverymanysamples,itisguaranteedtoconvergetothecorrectstatistics.ThedisadvantageisthatverymanysamplesmaybeThemessage-passinginvariationalmethodsisquitedifferent.ItspurposeistocouplethevariationalparametersofonenodetothoseofitsMarkovblanket.Themessagesdonotcomeintheformofsamples,butratherintheformofapproximatestatistics(assummarizedbythevariationalpar

43 ameters).Forexample,inanetworkofbinaryno
ameters).Forexample,inanetworkofbinarynodes,whiletheGibbssampleriscirculatingmessagesofbinaryvectorsthatcorrespondtotheofMarkovblankets,thevariationalmethodsarecirculatingreal-valuednumbersthatcorrespondtoofMarkovblankets.ThismaybeonereasonwhyvariationalmethodsoftenconvergefasterthanGibbssampling.Ofcourse,thedisadvantageoftheseschemesisthattheydonotnecessarilyconvergetothecorrectstatistics.Ontheotherhand,theycanprovideboundsonmarginalprobabilitiesthatarequitedifÞculttoestimatebysampling.Indeed,sampling-basedmethodsÐwhilewell-suitedtoestimatingthestatisticsofindividualhiddennodesÐareill-equippedtocomputemarginalprobabilitiessuchasAninterestingdirectionforfutureresearchistoconsidercombinationsofsamplingmethodsandvariationalmethods.SomeinitialworkinthisdirectionhasbeendonebyHinton,Sallans,andGhahramani(1999),whodiscussbriefGibbssamplingfromthepointofviewofvariationalapproximation.7.1.3.Bayesianmethods.VariationalinferencecanbeappliedtothegeneralproblemofBayesianparameterestimation.Indeedwecanquitegenerallytreatparametersasadditionalnodesinagraphicalmodel(cf.Heckerman,1999)andtherebytreatBayesianinferenceonthesamefootingasgenericprobabilisticinferenceinagraphicalmodel.Thisprobabilisticinferenceproblemisoftenintractable,andvariationalapproximationscanbeuseful.AvariationalmethodknownasÒensemblelearningÓwasoriginallyintroducedasawayofÞttinganÒensembleÓofneuralnetworkstodata,whereeachsettingoftheparameterscanbethoughtofasadifferentmemberoftheensemble(Hinton&vanCamp,1993).Letrepresentavariationalapproximationtotheposteriordistribution.TheensembleisÞtbyminimizingtheappropriateKLdivergence: P.µjE/d JORDANETAL.FollowingthesamelineofargumentasinSection6,weknowthatthisminimizationmustbeequivalenttothemaximizationofalowerbound.Inparticular,copyingtheargumentfromSection6,weÞndthatminimizingtheKLdivergenceyieldsthebestlowerboundonthefollowingquantity:.µ/whichisthelogarithmofthemarginallikelihood;akeyquantityinBayesianmodelselectionandmodelaveraging.Morerecently,theensemblelearningapproachha

44 sbeenappliedtomixtureofexpertsar-chitect
sbeenappliedtomixtureofexpertsar-chitectures(Waterhouse,MacKay,&Robinson,1996)andhiddenMarkovmodels(MacKay,1997).Oneinterestingaspectoftheseapplicationsisthattheydonotassumeanyparticularparametricfamilyfor,rathertheymakethenonparametricassumptionthatfactorizesinaspeciÞcway.ThevariationalminimizationitselfdeterminesthebestfamilygiventhisfactorizationandtheprioronJaakkolaandJordan(1997b)havealsodevelopedvariationalmethodsforBayesianin-ference,usingavariationalapproachtoÞndananalyticallytractableapproximationforlogisticregressionwithaGaussianpriorontheparameters.7.1.4.Perspectiveandprospectives.Perhapsthekeyissuethatfacesdevelopersofvari-ationalmethodsistheissueofapproximationaccuracy.Onecandevelopanintuitionforwhenvariationalmethodsperformwellandwhentheyperformpoorlybyexaminingtheirpropertiesincertainwell-studiedcases.Inthecaseoffullyfactorizedapproximationsforundirectedgraphs,agoodstartingpointisthestatisticalmechanicsliteraturewherethisapproximationcangivenotonlygood,butindeedexact,results.Suchcasesincludedenselyconnectedgraphswithuniformlyweak(butnon-negative)couplingsbetweenneighboringnodes(Parisi,1988).ThemeanÞeldequationsforthesenetworkshaveauniquesolutionthatdeterminesthestatisticsofindividualnodesinthelimitofverylargegraphs.KearnsandSaul(1998)haveutilizedlargedeviationmethodstostudytheapproximationaccuracyofboundsonthelikelihoodfordensedirectedgraphs.Characterizingtheaccuracyintermsofthenumberofparentsforeachnode(assumedconstant)inalayeredgraph,theyhaveshownthatthegapbetweenvariationalupperandlowerboundsconvergesatarateof .Theirapproachutilizesarathergeneralformofupperandlowerboundsforthelocalconditionalprobabilitiesthatdoesnotdependonconvexityproperties.Thustheirresultshouldbeexpectedtoberatherrobustacrossthegeneralfamilyofvariationalmethods;moreover,fasterratesmaybeobtainablefortheconvexity-basedvariationalapproximationsdiscussedinthecurrentpaper.Inmoregeneralgraphicalmodelstheconditionsforconvergenceoffullyfactorizedvariationalapproximationsmaynotbesofavorable.Ingeneralsome

45 nodesmayhaveasmallMarkovblanketormaybest
nodesmayhaveasmallMarkovblanketormaybestronglydependentonparticularneighbors,andvariationaltransformationsofsuchnodeswouldyieldpoorbounds.Moreglobally,iftherearestrongprobabilisticdependenciesinthemodeltheposteriorcanhavemultiplemodes(indeed,inthelimitingcaseofdeterministicrelationshipsonecanreadilycreateswitchingautomatathathavemultiplemodes).Fullyfactorizedapproximations,whicharenecessarilyunimodal, INTRODUCTIONTOVARIATIONALMETHODSwillfailinsuchcases.Handlingsuchcasesrequiresmakinguseofmethodsthattransformonlyasubsetofthenodes,incorporatingexactinferentialproceduresassubroutinestohandletheuntransformedstructure.Theexactmethodscancapturethestrongdependencies,leavingtheweakerdependenciesforthevariationaltransformations.Inpractice,achievingthiskindofdivisionoflaboreitherrequiresthatthestrongdependenciescanbeidentiÞedinadvancebyinspectionofthegraph,orcanbeidentiÞedinthecontextofasequentialvariationalmethodviasimplegreedycalculations.Analternativeapproachtohandlingmultiplemodesistoutilizemixturemodelsasapproximatingdistributions(thedistributionsinblockvariationalmethods).SeeJaakkolaandJordan(1999a)andBishopetal.(1998)fordiscussionofthisapproach.Finally,itisalsoimportanttostressthedifferencebetweenjointdistributionsandconditionaldistributions.Inmanysituations,suchasclassiÞcationproblemsinwhichrepresentsacategorylabel,thejointdistributionismulti-modal,buttheconditionaldistributionisnot.Thusfactorizedapproximationsmaymakesenseforinferenceproblemsevenwhentheywouldbepooroverallapproximationstothejointprobabilitymodel.Anotherkeyissuehastodowithbroadeningthescopeofvariationalmethods.Inthispaperwehavepresentedarestrictedsetofvariationaltechniques,thosebasedonconvexitytransformations.ForthesetechniquestobeapplicabletheappropriateconvexitypropertiesneedtobeidentiÞed.Whileitisrelativelyeasytocharacterizesmallclassesofmodelswherethesepropertiesleadtosimpleapproximationalgorithms,suchasthecaseinwhichthelocalconditionalprobabilitiesarelog-concavegeneralizedlinearmodels,itisnotgenerallyeasytodevelop

46 variationalalgorithmsforotherkindsofgrap
variationalalgorithmsforotherkindsofgraphicalmodels.Abroadercharacterizationofvariationalapproximationsisneededandamoresystematicalgebraisneededtomatchtheapproximationstomodels.Otheropenproblemsinclude:(1)theproblemofcombiningvariationalmethodswithsamplingmethodsandwithsearchbasedmethods,(2)theproblemofmakingmorein-formedchoicesofnodeorderinginthecaseofsequentialmethods,(3)thedevelopmentofupperboundswithintheblockframework,(4)thecombinationofmultiplevariationalapproximationsforthesamemodel,and(5)thedevelopmentofvariationalmethodsforarchitecturesthatcombinecontinuousanddiscreterandomvariables.Similaropenproblemsexistforsamplingmethodsandformethodsbasedonincompleteorprunedversionsofexactmethods.ThedifÞcultyinprovidingsolidtheoreticalfoundationsinallofthesecasesliesinthefactthataccuracyiscontingenttoalargedegreeontheactualconditionalprobabilityvaluesoftheunderlyingprobabilitymodelratherthanonthediscretepropertiesofthegraph.AppendixInthissection,wecalculatetheconjugatefunctionsforthelogarithmfunctionandtheloglogisticfunction.For,wehave: JORDANETAL.Takingthederivativewithrespecttoandsettingtozeroyields.SubstitutingbackinEq.(A.1)yields:whichjustiÞestherepresentationofthelogarithmgiveninEq.(14).Fortheloglogisticfunction,wehave:Takingthederivativewithrespecttoandsettingtozeroyields: fromwhichweobtain: ¸(A.5)andln.1Ce¡x/Dln1 PluggingtheseexpressionsbackintoEq.(A.3)yields:whichisthebinaryentropyfunction.ThisjustiÞestherepresentationofthelogisticfunctiongiveninEq.(19).AcknowledgmentsWewishtothankBrendanFrey,DavidHeckerman,UffeKj¾rulff,and(asalways)PeterDayanforhelpfulcommentsonthemanuscript.1.Ourpresentationwilltakethepointofviewthatmoralizationandtriangulation,whencombinedwithalocalmessage-passingalgorithm,aresufÞcientforexactinference.Itisalsopossibletoshowthat,undercertainconditions,thesestepsareforexactinference.SeeJensenandJensen(1994).2.Hereandelsewhereweidentifythethnodewiththerandomvariableassociatedwiththenode.3.WedeÞneacliquetobeasubsetofnodeswhicharefullyconnectedandmaximal;i.e

47 .,noadditionalnodecanbeaddedtothesubsets
.,noadditionalnodecanbeaddedtothesubsetsothatthesubsetremainsfullyconnected.4.NoteinparticularthatÞgure2isthemoralizationofÞgure1. INTRODUCTIONTOVARIATIONALMETHODS5.TheacronymÒQMR-DTÓreferstotheÒDecisionTheoreticÓversionoftheÒQuickMedicalReference.Ó6.Inparticular,thepatternofmissingedgesinthegraphimpliesthat(a)thediseasesaremarginallyindependent,and(b)giventhediseases,thesymptomsareconditionallyindependent.7.JaakkolaandJordan(1999b)alsocalculatedthemedianofthepairwisecutsetsize.Thisvaluewasfoundtobe106.5,whichalsorulesoutexactcutsetmethodsforinferencefortheQMR-DT.8.ItisalsopossibletoconsidermoregeneralBoltzmannmachineswithmultivaluednodes,andpotentialsthatareexponentialsofarbitraryfunctionsonthecliques.SuchmodelsareessentiallyequivalenttothegeneralundirectedgraphicalmodelofEq.(3)(althoughthelattercanrepresentzeroprobabilitieswhiletheformer9.Notethatwetreatingeneralasamarginalprobability;thatis,wedonotnecessarilyassumethatjointlyexhaustthesetofnodes10.TheparticularrecognitionmodelutilizedintheHelmholtzmachineisalayeredgraph,whichmakesweakconditionalindependenceassumptionsandthusmakesitpossible,inprinciple,tocapturefairlygeneralReferencesBathe,K.J.(1996).Finiteelementprocedures.EnglewoodCliffs,NJ:Prentice-Hall.Baum,L.E.,Petrie,T.,Soules,G.,&Weiss,N.(1970).AmaximizationtechniqueoccurringinthestatisticalanalysisofprobabilisticfunctionsofMarkovchains.TheAnnalsofMathematicalStatistics,41,164Ð171.Bishop,C.M.,Lawrence,N.,Jaakkola,T.S.,&Jordan,M.I.(1998).Approximatingposteriordistributionsinbeliefnetworksusingmixtures.InM.Jordan,M.Kearns,&S.Solla(Eds.),Advancesinneuralinformationprocessingsystems10,CambridgeMA:MITPress.Cover,T.,&Thomas,J.(1991).Elementsofinformationtheory.NewYork:JohnWiley.Dagum,P.,&Luby,M.(1993).ApproximatingprobabilisticinferenceinBayesianbeliefnetworksisNP-hard.ArtiÞcialIntelligence,60,141Ð153.Dayan,P.,Hinton,G.E.,Neal,R.,&Zemel,R.S.(1995).TheHelmholtzMachine.NeuralComputation,7Dean,T.,&Kanazawa,K.(1989).Amodelforreasoningaboutcausalityandpersistence.Computational

48 Intelli-gence,5,142Ð150.Dechter,R.(1999
Intelli-gence,5,142Ð150.Dechter,R.(1999).Bucketelimination:Aunifyingframeworkforprobabilisticinference.InM.I.Jordan(Ed.),Learningingraphicalmodels.Cambridge,MA:MITPress.Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).Maximum-likelihoodfromincompletedataviatheEMJournaloftheRoyalStatisticalSociety,B39,1Ð38.Draper,D.L.,&Hanks,S.(1994).Localizedpartialevaluationofbeliefnetworks.UncertaintyandArtiÞcialIntelligence:ProceedingsoftheTenthConference.SanMateo,CA:MorganKaufmann.Frey,B.,Hinton,G.E.,&Dayan,P.(1996).Doesthewake-sleepalgorithmlearngooddensityestimators?InD.S.Touretzky,M.C.Mozer,&M.E.Hasselmo(Eds.),Advancesinneuralinformationprocessingsystems.Cambridge,MA:MITPress.Fung,R.&Favero,B.D.(1994).BackwardsimulationinBayesiannetworks.UncertaintyandArtiÞcialIntelli-gence:ProceedingsoftheTenthConference.SanMateo,CA:MorganKaufmann.Galland,C.(1993).ThelimitationsofdeterministicBoltzmannmachinelearning.Network,4,355Ð379.Ghahramani,Z.,&Hinton,G.E.(1996).Switchingstate-spacemodels.(TechnicalReportCRG-TR-96-3).Toronto:DepartmentofComputerScience,UniversityofToronto.Ghahramani,Z.,&Jordan,M.I.(1997).FactorialHiddenMarkovmodels.MachineLearning,29,245Ð273.Gilks,W.,Thomas,A.,&Spiegelhalter,D.(1994).AlanguageandaprogramforcomplexBayesianmodelling.TheStatistician,43,169Ð178.Heckerman,D.(1999).AtutorialonlearningwithBayesiannetworks.InM.I.Jordan(Ed.),Learningingraphical.Cambridge,MA:MITPress.Henrion,M.(1991).Search-basedmethodstobounddiagnosticprobabilitiesinverylargebeliefnets.andArtiÞcialIntelligence:ProceedingsoftheSeventhConference.SanMateo,CA:MorganKaufmann. JORDANETAL.Hinton,G.E.,&Sejnowski,T.(1986).LearningandrelearninginBoltzmannmachines.InD.E.Rumelhart&J.L.McClelland(Eds.),Paralleldistributedprocessing(Vol.1).Cambridge,MA:MITPress.Hinton,G.E.,&vanCamp,D.(1993).Keepingneuralnetworkssimplebyminimizingthedescriptionlengthoftheweights.Proceedingsofthe6thAnnualWorkshoponComputationalLearningTheory.NewYork,NY:ACMPress.Hinton,G.E.,Dayan,P.,Frey,B.,&Neal,R.M.(1995).Thewake-sleepalgorithmforunsupervisedne

49 uralnetworks.Science,268,1158Ð1161.Hint
uralnetworks.Science,268,1158Ð1161.Hinton,G.E.,Sallans,B.,&Ghahramani,Z.(1999).Ahierarchicalcommunityofexperts.InM.I.Jordan(Ed.),Learningingraphicalmodels.Cambridge,MA:MITPress.Horvitz,E.J.,Suermondt,H.J.,&Cooper,G.F.(1989).Boundedconditioning:Flexibleinferencefordeci-sionsunderscarceresources.ConferenceonUncertaintyinArtiÞcialIntelligence:ProceedingsoftheFifthConference.MountainView,CA:AssociationforUAI.Jaakkola,T.S.,&Jordan,M.I.(1996).Computingupperandlowerboundsonlikelihoodsinintractablenet-works.UncertaintyandArtiÞcialIntelligence:ProceedingsoftheTwelthConference.SanMateo,CA:MorganJaakkola,T.S.(1997).Variationalmethodsforinferenceandestimationingraphicalmodels.Unpublisheddoc-toraldissertation,MassachusettsInstituteofTechnology,Cambridge,MA.Jaakkola,T.S.,&Jordan,M.I.(1997a).Recursivealgorithmsforapproximatingprobabilitiesingraphicalmodels.InM.C.Mozer,M.I.Jordan,&T.Petsche(Eds.),Advancesinneuralinformationprocessingsystems9Cambridge,MA:MITPress.Jaakkola,T.S.,&Jordan,M.I.(1997b).Bayesianlogisticregression:Avariationalapproach.InD.Madigan&P.Smyth(Eds.),Proceedingsofthe1997ConferenceonArtiÞcialIntelligenceandStatistics.Ft.Lauderdale,Jaakkola,T.S.,&Jordan,M.I.(1999a).ImprovingthemeanÞeldapproximationviatheuseofmixturedistribu-tions.InM.I.Jordan(Ed.),Learningingraphicalmodels.Cambridge,MA:MITPress.Jaakkola,T.S.,&Jordan,M.I.(1999b).VariationalmethodsandtheQMR-DTdatabase.JournalofArtiÞcialIntelligenceResearch,10,291Ð322.Jensen,C.S.,Kong,A.,&Kj¾rulff,U.(1995).Blocking-Gibbssamplinginverylargeprobabilisticexpertsystems.InternationalJournalofHuman-ComputerStudies,42,647Ð666.Jensen,F.V.,&Jensen,F.(1994).Optimaljunctiontrees.UncertaintyandArtiÞcialIntelligence:ProceedingsoftheTenthConference.SanMateo,CA:MorganKaufmann.Jensen,F.V.(1996).AnintroductiontoBayesiannetworks.London:UCLPress.Jordan,M.I.(1994).Astatisticalapproachtodecisiontreemodeling.InM.Warmuth(Ed.),ProceedingsoftheSeventhAnnualACMConferenceonComputationalLearningTheory.NewYork:ACMPress.Jordan,M.I.,Ghahramani,Z.,&Saul,L.K.(1

50 997).HiddenMarkovdecisiontrees.InM.C.Moz
997).HiddenMarkovdecisiontrees.InM.C.Mozer,M.I.Jordan,&T.Petsche(Eds.),Advancesinneuralinformationprocessingsystems9.Cambridge,MA:MITPress.Kanazawa,K.,Koller,D.,&Russell,S.(1995).Stochasticsimulationalgorithmsfordynamicprobabilisticnet-works.UncertaintyandArtiÞcialIntelligence:ProceedingsoftheEleventhConference.SanMateo,CA:MorganKj¾rulff,U.(1990).TriangulationofgraphsÑAlgorithmsgivingsmalltotalstatespace.(ResearchReportR-90-09).DepartmentofMathematicsandComputerScience,AalborgUniversity,Denmark.Kj¾rulff,U.(1994).ReductionofcomputationalcomplexityinBayesiannetworksthroughremovalofweakUncertaintyandArtiÞcialIntelligence:ProceedingsoftheTenthConference.SanMateo,CA:MorganKaufmann.MacKay,D.J.C.(1997).EnsemblelearningforhiddenMarkovmodels.Unpublishedmanuscript.Cambridge:DepartmentofPhysics,UniversityofCambridge.McEliece,R.J.,MacKay,D.J.C.,&Cheng,J.-F.(1998).TurbodecodingasaninstanceofPearlÕsÒbeliefpropagationalgorithm.ÓIEEEJournalonSelectedAreasinCommunication,16,140Ð152.Merz,C.J.,&Murphy,P.M.(1996).UCIrepositoryofmachinelearningdatabases.Irvine,CA:DepartmentofInformationandComputerScience,UniversityofCalifornia.Neal,R.(1992).Connectionistlearningofbeliefnetworks.ArtiÞcialIntelligence,56,71Ð113.Neal,R.(1993).ProbabilisticinferenceusingMarkovchainMonteCarlomethods.(TechnicalReportCRG-TR-93-1).Toronto:DepartmentofComputerScience,UniversityofToronto. INTRODUCTIONTOVARIATIONALMETHODSNeal,R.,&Hinton,G.E.(1999).AviewoftheEMalgorithmthatjustiÞesincremental,sparse,andothervariants.InM.I.Jordan(Ed.),Learningingraphicalmodels.Cambridge,MA:MITPress.Parisi,G.(1988).StatisticalÞeldtheory.RedwoodCity,CA:Addison-Wesley.Pearl,J.(1988).Probabilisticreasoninginintelligentsystems:Networksofplausibleinference.SanMateo,CA:MorganKaufmannn.Peterson,C.,&Anderson,J.R.(1987).AmeanÞeldtheorylearningalgorithmforneuralnetworks.Complex,995Ð1019.Rockafellar,R.(1972).Convexanalysis.PrincetonUniversityPress.Rustagi,J.(1976).Variationalmethodsinstatistics.NewYork:AcademicPress.Sakurai,J.(1985).Modernquantummec

51 hanics.RedwoodCity,CA:Addison-Wesley.Sau
hanics.RedwoodCity,CA:Addison-Wesley.Saul,L.K.,&Jordan,M.I.(1994).LearninginBoltzmanntrees.NeuralComputation,6,1173Ð1183.Saul,L.K.,Jaakkola,T.S.,&Jordan,M.I.(1996).MeanÞeldtheoryforsigmoidbeliefnetworks.JournalofArtiÞcialIntelligenceResearch,4,61Ð76.Saul,L.K.,&Jordan,M.I.(1996).Exploitingtractablesubstructuresinintractablenetworks.InD.S.Touretzky,M.C.Mozer,&M.E.Hasselmo(Eds.),Advancesinneuralinformationprocessingsystems8.Cambridge,MA:MITPress.Saul,L.K.,&Jordan,M.I.(1999).AmeanÞeldlearningalgorithmforunsupervisedneuralnetworks.InM.I.Jordan(Ed.),Learningingraphicalmodels.Cambridge,MA:MITPress.Seung,S.(1995).Annealedtheoriesoflearning.InJ.-H.Oh,C.Kwon,S.Cho(Eds.),Neuralnetworks:Thestatisticalmechanicsperspectives.Singapore:WorldScientiÞc.Shachter,R.D.,Andersen,S.K.,&Szolovits,P.(1994).Globalconditioningforprobabilisticinferenceinbeliefnetworks.UncertaintyandArtiÞcialIntelligence:ProceedingsoftheTenthConference.SanMateo,CA:MorganShenoy,P.P.(1992).Valuation-basedsystemsforBayesiandecisionanalysis.OperationsResearch,40,463Ð484.Shwe,M.A.,&Cooper,G.F.(1991).AnempiricalanalysisoflikelihoodÑWeightingsimulationonalarge,multiplyconnectedmedicalbeliefnetwork.ComputersandBiomedicalResearch,24,453Ð475.Shwe,M.A.,Middleton,B.,Heckerman,D.E.,Henrion,M.,Horvitz,E.J.,Lehmann,H.P.,&Cooper,G.F.(1991).ProbabilisticdiagnosisusingareformulationoftheINTERNIST-1/QMRknowledgebase.Meth.Inform.Med.,30,241Ð255.Smyth,P.,Heckerman,D.,&Jordan,M.I.(1997).ProbabilisticindependencenetworksforhiddenMarkovprobabilitymodels.NeuralComputation,9,227Ð270.Waterhouse,S.,MacKay,D.J.C.,&Robinson,T.(1996).Bayesianmethodsformixturesofexperts.InD.S.Touretzky,M.C.Mozer,&M.E.Hasselmo(Eds.),Advancesinneuralinformationprocessingsystems8Cambridge,MA:MITPress.Williams,C.K.I.,&Hinton,G.E.(1991).MeanÞeldnetworksthatlearntodiscriminatetemporallydistortedstrings.InD.S.Touretzky,J.Elman,T.Sejnowski,&G.E.Hinton(Eds.),Proceedingsofthe1990Connec-tionistModelsSummerSchool.SanMateo,CA:MorganKaufmann.ReceivedDecember29,1997AcceptedJanuary

Related Contents


Next Show more