Markov Nets Slides from Sam Roweis Connection to MCMC MCMC requires sampling a node given its markov blanket Need to use P xMB x For Bayes nets MBx contains more ID: 394883
Download Presentation The PPT/PDF document "Undirected Probabilistic Graphical Model..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Undirected Probabilistic Graphical Models(Markov Nets)
(Slides from Sam
Roweis
)Slide2Slide3Slide4Slide5Slide6
Connection to MCMC:
MCMC requires sampling a node given its
markov
blanket
Need to use P(
x|MB
(x)).
For
Bayes
nets MB(x) contains more
nodes than are mentioned in the local distribution CPT(x)
For Markov nets, Slide7
A
B
C
D
Qn
: What is the
most likely
configuration of A&B?
Factor says a=b=0
But, marginal says
a=0;b=1!
Moral: Factors
are
not
marginals
!
Although A,B would
Like to agree, B&C
Need to agree,
C&D need to disagree
And D&A need to agree
.and the latter three have
Higher weights!
Okay, you convinced me
that given any potentials
w
e will have a consistent
Joint. But given any joint,will there be a potentials I can provide? Hammersley-Clifford theorem…
We can have potentials
on any cliques—not just
the maximal ones.
So, for example we can
have a potential on A
in addition to the other
four
pairwise
potentialsSlide8Slide9
Markov Networks
Undirected
graphical models
Cancer
Cough
Asthma
Smoking
Potential functions defined over cliques
Smoking
Cancer
Ф
(S,C)
False
False
4.5
False
True
4.5
TrueFalse 2.7TrueTrue 4.5Slide10Slide11
Log-Linear models for Markov Nets
A
B
C
D
Factors are “functions” over their domains
Log linear model consists of
Features
f
i
(D
i
)
(functions over domains)
Weights
wi for features s.t.Without loss of generality!Slide12
Markov Networks
Undirected
graphical models
Log-linear model:
Weight of Feature
i
Feature
i
Cancer
Cough
Asthma
SmokingSlide13Slide14Slide15Slide16Slide17Slide18
Markov Nets vs. Bayes Nets
Property
Markov Nets
Bayes Nets
Form
Prod. potentials
Prod. potentials
Potentials
Arbitrary
Cond. probabilities
Cycles
Allowed
Forbidden
Partition func.
Z = ?
global
Z = 1
local
Indep. check
Graph separationD-separation
Indep. props.SomeSomeInferenceMCMC, BP, etc.Convert to MarkovSlide19
Inference in Markov Networks
Goal
: Compute
marginals
& conditionals of
Exact inference is #
P-complete
Most BN inference approaches work for MNs too
Variable Elimination used factor multiplication—and should work without change..
Conditioning on Markov blanket is easy:
Gibbs sampling exploits thisSlide20
MCMC: Gibbs Sampling
state
←
random truth assignment
for
i
←
1 to num-samples do
for each variable x sample x according to P(x|neighbors(x
)) state ← state with new value of xP(
F) ← fraction of states in which F is trueSlide21Slide22Slide23
Other Inference Methods
Many variations of MCMC
Belief propagation (sum-product)
Variational approximation
Exact methodsSlide24
Overview
Motivation
Foundational areas
Probabilistic inference
Statistical learning
Logical inference
Inductive logic programming
Putting the pieces together
ApplicationsSlide25
Learning Markov Networks
Learning parameters (weights)
Generatively
Discriminatively
Learning structure (features)
Easy Case:
Assume complete data
(If not: EM versions of algorithms)Slide26
Entanglement in log likelihood…
a
b
cSlide27
Learning for log-linear formulation
Use gradient ascent
Unimodal
, because Hessian is
Co-variance matrix over features
What is the expected
Value of the feature
g
iven the current
parameterization
o
f the network?
Requires inference to answer
(inference at every iteration—
sort of like EM
)Slide28
Why should we spend so much time computing gradient?
Given that gradient is being used only in doing the gradient ascent iteration, it might look as if we should just be able to approximate it in any which way
Afterall
, we are going to take a step with some arbitrary step size anyway..
..But the thing to keep in mind is that the gradient is a
vector.
We are talking not just of magnitude but direction. A mistake in magnitude can change the direction of the vector and push the search into a completely wrong direction…Slide29
Generative Weight Learning
Maximize likelihood or posterior probability
Numerical optimization (gradient or 2
nd
order)
No local maxima
Requires inference at each step (slow!)
No. of times feature
i
is true in data
Expected no. times feature
i
is true according to modelSlide30
Alternative Objectives to maximize..
Since log-likelihood requires network inference to compute the derivative, we might want to focus on other objectives whose gradients are easier to compute (and which also –hopefully—have optima at the same parameter values).
Two options:
Pseudo Likelihood
Contrastive Divergence
Given a single data instance
x
log-likelihood is
Log
prob
of data
Log
prob
of
all
other possible
data instances (w.r.t. current q)
Maximize the distance
(“increase the divergence”)Pick a sample of typical other instances(need to sample from Pq Run MCMC initializing withthe data..)Compute likelihood ofeach possible data instancejust using markov blanket (approximate chain rule)Slide31
Pseudo-Likelihood
Likelihood of each variable given its neighbors in the data
Does not require inference at each step
Consistent estimator
Widely used in vision, spatial statistics, etc.
But PL parameters may not work well for
long inference chains
[Which can lead to disasterous results]Slide32
Discriminative Weight Learning
Maximize conditional likelihood of query (
y
) given evidence (
x
)
Approximate expected counts by counts in MAP state of
y
given
x
No. of true groundings of clause
i
in data
Expected no. true groundings according to modelSlide33
Structure Learning
How to learn the structure of a Markov network?
… not too different from learning structure for a Bayes network: discrete search through space of possible graphs, trying to maximize data probability….