/
Undirected Probabilistic Graphical Models Undirected Probabilistic Graphical Models

Undirected Probabilistic Graphical Models - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
421 views
Uploaded On 2016-07-07

Undirected Probabilistic Graphical Models - PPT Presentation

Markov Nets Slides from Sam Roweis Connection to MCMC MCMC requires sampling a node given its markov blanket Need to use P xMB x For Bayes nets MBx contains more ID: 394883

inference markov log learning markov inference learning log data nets likelihood mcmc gradient potentials true amp feature maximize networks

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Undirected Probabilistic Graphical Model..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Undirected Probabilistic Graphical Models(Markov Nets)

(Slides from Sam

Roweis

)Slide2
Slide3
Slide4
Slide5
Slide6

Connection to MCMC:

MCMC requires sampling a node given its

markov

blanket

Need to use P(

x|MB

(x)).

For

Bayes

nets MB(x) contains more

nodes than are mentioned in the local distribution CPT(x)

 For Markov nets, Slide7

A

B

C

D

Qn

: What is the

most likely

configuration of A&B?

Factor says a=b=0

But, marginal says

a=0;b=1!

Moral: Factors

are

not

marginals

!

Although A,B would

Like to agree, B&C

Need to agree,

C&D need to disagree

And D&A need to agree

.and the latter three have

Higher weights!

Okay, you convinced me

that given any potentials

w

e will have a consistent

Joint. But given any joint,will there be a potentials I can provide? Hammersley-Clifford theorem…

We can have potentials

on any cliques—not just

the maximal ones.

So, for example we can

have a potential on A

in addition to the other

four

pairwise

potentialsSlide8
Slide9

Markov Networks

Undirected

graphical models

Cancer

Cough

Asthma

Smoking

Potential functions defined over cliques

Smoking

Cancer

Ф

(S,C)

False

False

4.5

False

True

4.5

TrueFalse 2.7TrueTrue 4.5Slide10
Slide11

Log-Linear models for Markov Nets

A

B

C

D

Factors are “functions” over their domains

Log linear model consists of

 Features

f

i

(D

i

)

(functions over domains)

Weights

wi for features s.t.Without loss of generality!Slide12

Markov Networks

Undirected

graphical models

Log-linear model:

Weight of Feature

i

Feature

i

Cancer

Cough

Asthma

SmokingSlide13
Slide14
Slide15
Slide16
Slide17
Slide18

Markov Nets vs. Bayes Nets

Property

Markov Nets

Bayes Nets

Form

Prod. potentials

Prod. potentials

Potentials

Arbitrary

Cond. probabilities

Cycles

Allowed

Forbidden

Partition func.

Z = ?

global

Z = 1

local

Indep. check

Graph separationD-separation

Indep. props.SomeSomeInferenceMCMC, BP, etc.Convert to MarkovSlide19

Inference in Markov Networks

Goal

: Compute

marginals

& conditionals of

Exact inference is #

P-complete

Most BN inference approaches work for MNs too

Variable Elimination used factor multiplication—and should work without change..

Conditioning on Markov blanket is easy:

Gibbs sampling exploits thisSlide20

MCMC: Gibbs Sampling

state

random truth assignment

for

i

1 to num-samples do

for each variable x sample x according to P(x|neighbors(x

)) state ← state with new value of xP(

F) ← fraction of states in which F is trueSlide21
Slide22
Slide23

Other Inference Methods

Many variations of MCMC

Belief propagation (sum-product)

Variational approximation

Exact methodsSlide24

Overview

Motivation

Foundational areas

Probabilistic inference

Statistical learning

Logical inference

Inductive logic programming

Putting the pieces together

ApplicationsSlide25

Learning Markov Networks

Learning parameters (weights)

Generatively

Discriminatively

Learning structure (features)

Easy Case:

Assume complete data

(If not: EM versions of algorithms)Slide26

Entanglement in log likelihood…

a

b

cSlide27

Learning for log-linear formulation

Use gradient ascent

Unimodal

, because Hessian is

Co-variance matrix over features

What is the expected

Value of the feature

g

iven the current

parameterization

o

f the network?

Requires inference to answer

(inference at every iteration—

sort of like EM

)Slide28

Why should we spend so much time computing gradient?

Given that gradient is being used only in doing the gradient ascent iteration, it might look as if we should just be able to approximate it in any which way

Afterall

, we are going to take a step with some arbitrary step size anyway..

..But the thing to keep in mind is that the gradient is a

vector.

We are talking not just of magnitude but direction. A mistake in magnitude can change the direction of the vector and push the search into a completely wrong direction…Slide29

Generative Weight Learning

Maximize likelihood or posterior probability

Numerical optimization (gradient or 2

nd

order)

No local maxima

Requires inference at each step (slow!)

No. of times feature

i

is true in data

Expected no. times feature

i

is true according to modelSlide30

Alternative Objectives to maximize..

Since log-likelihood requires network inference to compute the derivative, we might want to focus on other objectives whose gradients are easier to compute (and which also –hopefully—have optima at the same parameter values).

Two options:

Pseudo Likelihood

Contrastive Divergence

Given a single data instance

x

log-likelihood is

Log

prob

of data

Log

prob

of

all

other possible

data instances (w.r.t. current q)

Maximize the distance

(“increase the divergence”)Pick a sample of typical other instances(need to sample from Pq Run MCMC initializing withthe data..)Compute likelihood ofeach possible data instancejust using markov blanket (approximate chain rule)Slide31

Pseudo-Likelihood

Likelihood of each variable given its neighbors in the data

Does not require inference at each step

Consistent estimator

Widely used in vision, spatial statistics, etc.

But PL parameters may not work well for

long inference chains

[Which can lead to disasterous results]Slide32

Discriminative Weight Learning

Maximize conditional likelihood of query (

y

) given evidence (

x

)

Approximate expected counts by counts in MAP state of

y

given

x

No. of true groundings of clause

i

in data

Expected no. true groundings according to modelSlide33

Structure Learning

How to learn the structure of a Markov network?

… not too different from learning structure for a Bayes network: discrete search through space of possible graphs, trying to maximize data probability….