William W Cohen Machine Learning 10601 Motivation for Graphical Models Recap A paradox of i nduction A black crow seems to support the hypothesis all crows are black A pink highlighter supports the hypothesis all nonblack things are noncrows ID: 757996
Download Presentation The PPT/PDF document "Directed Graphical Models" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Directed Graphical Models
William W. Cohen
Machine Learning 10-601Slide2
Motivation for Graphical ModelsSlide3
Recap: A paradox of
i
nduction
A black crow seems to support the hypothesis “all crows are black”.
A pink highlighter supports the hypothesis “all non-black things are non-crows”
Thus, a pink highlighter supports the hypothesis “all crows are black”.
whut
?Slide4
whut
?
B = black
C = crow
crows
non-crows
black
not black
collect statistics for P(B=
b|C
=c) Slide5
Logical reasoning versus common-sense reasoning
CROW(
jim
)
BLACK(
jim
)
FLY(
jim
)
BIRD(
jim
)
EATS(
jim,carrion
)
…Slide6
Another difficult problem: common-sense reasoning
Tweety
is a bird.
Most birds can fly.
Opus is a penguin.
Penguins are birds. Penguins cannot fly.
We’d like to be able to conclude: Opus cannot fly, and Tweety can
default reasoning
✔
✖
Logically…Slide7
Another difficult problem: common-sense reasoning
Tweety
is a bird.
Most birds can fly.
Opus is a penguin.
Penguins are birds. Penguins cannot fly.
We’d like to be able to conclude: Opus cannot fly, and Tweety can
default reasoning
?
✔
NO: F(
tweety
) only provable if he’s provably NOT a penguinSlide8Slide9
Another difficult problem: common-sense reasoning
Tweety
is a bird.
Most birds can fly.
Opus is a penguin.
Penguins are birds. Penguins cannot fly.
We’d like to be able to conclude: Opus cannot fly, and Tweety can
default reasoning
?
✔
NO: F(
tweety
) only provable if he’s provably not a penguin and not dead and …Slide10
Recap: The Joint Distribution
Recipe for making a joint distribution of M variables:
Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2
M
rows).
For each combination of values, say how probable it is.
Example: Boolean variables A, B, C
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10Slide11
Another difficult problem: common-sense reasoningSlide12
Another difficult
problem: common-sense reasoningSlide13
Tweety
is a bird.
Most non-penguin birds can fly.
Opus is a penguin.
Penguins are birds.
Penguins cannot fly.
Pg
BPr(F=0|Pg,B)
Pr(F=1|Pg,B)0
00.90
0.10
0
1
0.01
0.99
1
0
0.5
0.5
1
1
0.9999
0.0001
Pg
Pr
(B=0|Pg)
Pr
(B=1|Pg)
0
0.90
0.10
1
0
1
Pg
Pr
(
Pg
=0)
Pr
(
Pg
=1)
0
0.90
0.10
The joint for an experiment: I pick an object, say in Frick Park, and measure: can it fly, is it a bird, is it a penguin.Slide14
Tweety
is a bird.
Most birds can fly.
Opus is a penguin.
Penguins are birds.
Penguins cannot fly.
Can Opus fly?
= 0
= 1
Pg
B
Pr
(F=0|Pg,B)
Pr
(F=1|Pg,B)
0
0
0.90
0.10
0
1
0.01
0.99
1
0
0.5
0.5
1
1
0.9999
0.0001
Unlikely:
Pr
(F=1|B=1,Pg=1)Slide15
Tweety
is a bird.
Most birds can fly.
Opus is a penguin.
Penguins are birds.
Penguins cannot fly.
Can
Tweety
fly?
Tweety
is
a flying penguin
Tweety
is
a flying non-penguin bird
If flying penguins are rare, it depends:
do non-penguins birds fly?
Pr
(F|B=1,Pg=0)
are all or most birds non-penguins?
Pr
(B=1|Pg=0)
are non-penguins common?
Pr
(
Pg
=0)Slide16
Quiz…
.
https://piazza.com/class/ij382zqa2572hc?cid=
421
https://piazza.com/class/ij382zqa2572hc?cid=
420Slide17
Another difficult
problem: common-sense reasoning
Have we solved the common-sense reasoning problem?
No
: how do we (1) chose the conditional probabilities you need to model a task and (2) use them algorithmically to answer questions ?
No
: How do we invent numbers for all the rows of the CPTs?Slide18
Another difficult
problem: common-sense reasoning
Have we solved the common-sense reasoning problem?
Yes: We use directed graphical models.
Semantics: how to specify them
Inference: how to use them
Yes: We use directed graphical models.
Learning: how to find parametersSlide19
Probabilities and probabilistic inference
Why is logic attractive?
There are well-understood algorithms for reasoning with a logical theory.
E.g
: we can use a computer to determine if B(x)
F(x).What about probabilities?We can do some math
manually and answer many questions.Not really satisfyingWe can answer questions algorithmically with the joint Eg: we can compute
Pr(F=1|B=1)But: this is not tractable for large models.How can we answer questions algorithmically and efficiently?Answer: Graphical modelsSlide20
Probabilities and probabilistic inference
Directed
graphical models
Today: examples and semantics
Wednesday: inference algorithms
Next week: learning in graphical modelsusing graphical models to specify learning algorithms: Naïve Bayes, LDA, HMMs, …Slide21
Graphical Models:
SEMANTICS and definitionsSlide22
I have 3 standard d20 dice, 1 loaded die.
Experiment:
(A)
pick a d20 uniformly at random then
(B)
roll it. Let
A
=d20 picked is fair and
B
=roll 19 or 20 with that die. What is P(
B)?
A
B
P(A=fair)=0.75
P(A=loaded)=0.25
P(B=critical | A=fair=0.1
P(B=noncritical | A=fair)=0.9
P(B=critical | A=loaded)=0.5
P(B=noncritical | A=loaded)=0.5
A
P(A)
Fair
0.75
Loaded
0.25
A
B
P(B|A)
Fair
Critical
0.1
Fair
Noncritical
0.9
Loaded
Critical
0.5
Loaded
Noncritical
0.5
Example: practical problem 1
made easy
P(A,B)=P(B|A)P(A)Slide23
I have 3 standard d20 dice, 1 loaded die.
Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let
A
=d20 picked is fair and
B
=roll 19 or 20 with that die. What is P(
B
)
?
A
B
P(A=fair)=0.75
P(A=loaded)=0.25
P(B=critical | A=fair=0.1
P(B=noncritical | A=fair)=0.9
P(B=critical | A=loaded)=0.5
P(B=noncritical | A=loaded)=0.5
A
P(A)
Fair
0.75
Loaded
0.25
A
B
P(B|A)
Fair
Critical
0.1
Fair
Noncritical
0.9
Loaded
Critical
0.5
Loaded
Noncritical
0.5
Example: practical problem 1
made easy
What is
Pr
(A=1|B=1)?
P(A,B)=P(B|A)P(A)
We have the
information
we need to answer other questions as well
Example of inferenceSlide24
In general: any chain-rule decomposition gives a DGM G:
G has one node
per random variable
If P(X|Y
1
,…,
Y
k
) is a factor in the decomposition, then
G has edges fromY1 X
,
…,
Y
k
X
X is annotated with a conditional probability table (CPT)
encoding P(
X=x|Y
1=
y
1
,…,Y
k=yk) for each tuple (x,y
1,…,yk)
A
B
P(A=fair)=0.75
P(A=loaded)=0.25
P(B=critical | A=fair=0.1
P(B=noncritical | A=fair)=0.9
P(B=critical | A=loaded)=0.5
P(B=noncritical | A=loaded)=0.5
A
P(A)
Fair
0.75
Loaded
0.25
A
B
P(B|A)
Fair
Critical
0.1
Fair
Noncritical
0.9
Loaded
Critical
0.5
Loaded
Noncritical
0.5
Example: practical problem 1
made easy
P(A,B)=P(B|A)P(A)
Applied chain ruleSlide25
I have 3 standard d20 dice, 1 loaded die.
Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let
A
=d20 picked is fair and
B
=roll 19 or 20 with that die.
B
A
B
P
(B)
critical
…
non-critical
…
B
A
P
(A|B)
Critical
Fair
…
Noncritical
Fair
…
Critical
Loaded
…
Noncritical
Loaded
…
There’s more than one network for any distribution
P(A,B)=P(A|B)P(B)Slide26
I have 3 standard d20 dice, 1 loaded die.
Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let
A
=d20 picked is fair and
B
=roll 19 or 20 with that die. What is P(
B
)?
B
A
There’s more than one network for any distribution
A
B
The moral: we have two things here
a “generative story”, “causal model”, …
a joint probability distribution e.g. P(A,B)
a decomposition: P(A,B)=P(B|A)P(A)
another decomposition:
P(A,B)=P
(A|B)P(B)
-- totally valid!
it’s usually cleaner to pick one that fits a “generative story”
P(A,B)=P(A|B)P(B)
P(A,B)=P(B|A)P(A)Slide27
There’s more than one network for any distribution
There are
lots
of decompositions of an model with N variables
They are all
correct
Some are better than others….Slide28
There’s more than one network for any distribution
Suppose there are some
conditional independencies
P(A|B,C,D)=P(A|B)
P
(B|C,D)=P(B|C)
Then the first decomposition can be simplified and compressed, the second can’tSlide29
The
(highly practical)
Monty Hall problem
You
’
re in a game show. Behind one door is a prize. Behind the others, goats.
You pick one of three doors, say #1The host, Monty Hall, opens one door, revealing…a goat!
3
You now can either
stick with your guess
always change doors
flip a coin and pick a new door randomly according to the coinSlide30
Example: practical problem 2Slide31
The
(highly practical)
Monty Hall problem
You
’
re in a game show. Behind one door is a prize. Behind the others, goats.
You pick one of three doors, say #1The host, Monty Hall, opens one door, revealing…a goat!You now can either stick with your guess or change doors
A
B
First guess
The money
C
The
revealed goat
D
Stick, or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
D
P(D)
Stick
0.5
Swap
0.5
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
WSlide32
The
(highly practical)
Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
A
C
D
P(E|A,C,D)
…
…
…
…
If you stick: you win if your first guess was right.
If you swap: you win if your first guess was wrong.Slide33
The
(highly practical)
Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
A
C
D
P(E|A,C,D)
…
…
…
…
…again by the chain rule:
P(A,B,C,D,
E)
=
P(E|A,C,D) *
P(D) *
P(C | A,B ) *
P(B ) *
P(A)
We could construct the joint and compute P(E=B|D=swap) Slide34
The
(highly practical)
Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
A
C
D
P(E|A,C,D)
…
…
…
…
…again by the chain rule:
P(A,B,C,D,
E)
=
P(
E | A,B,C
,D) *
P(
D | A,B,C)
*
P(C | A,B ) *
P(B
| A)
*
P(A)
We could construct the joint and compute P(E=B|D=swap) Slide35
The
(highly practical)
Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
A
C
D
P(E|A,C,D)
…
…
…
…
The joint table has…?
3*3*3*2*3 = 162 rows
The
conditional probability tables
(CPTs) shown have … ?
3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows
Big questions:
why
are the CPTs smaller?
how
much
smaller
are the CPTs than the joint?
can we compute the answers to queries like P(E=
B|d
)
without
building the joint probability tables, just using the CPTs?Slide36
The
(highly practical)
Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
A
C
D
P(E|A,C,D)
…
…
…
…
Why
is the CPTs representation smaller? Follow the money! (B)
E is
conditionally independent
of B given A,D,CSlide37
Conditional Independence formalized
Definition:
R and L are
conditionally independent given M
if
for all x,y,z in {T,F}:
P(R=x M=y ^ L=z) = P(R=x M=y)
More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3
if for all assignments of values to the variables in the sets,P(S
1’
s assignments S2
’
s assignments & S
3
’
s assignments)=
P(S1
’
s assignments S3
’
s assignments)Slide38
The
(highly practical)
Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
What are the conditional indepencies?
I<A, {B}, C> ?
I<A, {C}, B> ?
I<E, {A,C}, B> ?
I<D, {E}, B> ?
…Slide39Slide40
Recap:
Bayes Nets Formalized
A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair
V
,
E
where:V is a set of vertices.
E is a set of directed edges joining vertices. No loops of any length are allowed.Each vertex in V contains the following information:
The name of a random variableA probability distribution table indicating how the probability of this variable
’s values depends on all possible combinations of parent values.Slide41
Building a Bayes Net
Choose a set of relevant variables.
Choose an ordering for them
Assume they’re called X
1
..
Xm (where X
1 is first in ordering, etc)
For i = 1 to m:
Add the Xi node to the network
Set Parents(Xi ) to be a
minimal
subset of {X
1
…X
i-1
} such that we have conditional independence of X
i
and all other members of {X
1
…Xi
-1
} given Parents(Xi )
Define the probability table of
P(Xi
=k Assignments of Parents(Xi ) ).Slide42
The general case
P(X
1
=
x
1 ^ X
2=x2 ^ ….Xn-1=x
n-1 ^ Xn=xn) =P(X
n=xn ^ X
n-1=xn-1 ^ ….X2
=x2 ^ X1
=x
1
) =
P(X
n
=x
n
X
n-1
=x
n-1
^ ….X2
=x2 ^ X1
=x1) * P(Xn-1
=xn-1 ^…. X2=x
2 ^ X1=x1
) =P(Xn=x
n
X
n-1
=x
n-1
^ ….X
2
=x
2
^ X
1
=x
1
) * P(X
n-1
=x
n-1
…. X
2
=x
2
^ X
1
=x
1
) *
P(Xn-2=xn-2
^…. X2
=x2 ^ X1=x1) =
:
:
=
So any entry in joint pdf table can be computed. And so
any conditional probability
can be computed.Slide43
Question: given a network can I find a chain-rule decomposition of the joint?Slide44
Graphical Models:
Determining Conditional IndependenciesSlide45
What Independencies does a Bayes Net Model?
In order for a Bayesian network to model a probability distribution, the following must be true:
Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents.
This
follows from
But what else does it imply?Slide46
What Independencies does a Bayes Net Model?
Example:
Z
Y
X
Given
Y
, does learning the value of
Z
tell us
nothing new about
X
?
I.e., is
P
(
X
|
Y, Z
) equal to
P
(
X
|
Y
)?
Yes. Since we know the value of all of
X
’
s parents (namely,
Y
), and
Z
is not a descendant of
X
,
X
is conditionally independent of
Z
.
Also, since independence is symmetric,
P
(
Z
|
Y
,
X
) =
P
(
Z
|
Y
).Slide47
Quick proof that independence is symmetric
Assume:
P
(
X|Y, Z
) = P
(X|Y)Then:
(Bayes
’
s Rule)
(Chain Rule)
(By Assumption)
(Bayes
’
s Rule)Slide48
What Independencies does a Bayes Net Model?
Let
I
<
X
,
Y,Z> represent X and Z
being conditionally independent given Y.
I<
X,Y,
Z>? Yes, just as in previous example: All X’s parents given, and Z is not a descendant.
Y
X
ZSlide49
What Independencies does a Bayes Net Model?
I
<
X
,{
U},
Z>? No.I<X,{U
,V},Z>? Yes.Maybe I<
X, S,
Z> iff S acts a cutset between X
and Z in an undirected version of the graph…?
Z
V
U
X
VSlide50
Things get a little more confusing
X has no parents, so we know all its parents
’
values trivially
Z is not a descendant of X
So,
I<X,{},Z>, even though there
’s a undirected path from X to Z through an unknown variable Y.What if we do know the value of Y
, though? Or one of its descendants?
Z
X
Y
YSlide51
The
“
Burglar Alarm
”
example
Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes.
Earth arguably doesn’t care whether your house is currently being burgledWhile you are on vacation, one of your neighbors calls and tells you your hom’s burglar alarm is ringing. Uh oh!
Burglar
Earthquake
Alarm
Phone CallSlide52
Things get a lot more confusing
But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all.
Earthquake
“
explains away
”
the hypothetical burglar.But then it must not be the case that I<Burglar,{Phone Call}, Earthquake>, even though
I<Burglar,{}, Earthquake>!
Burglar
Earthquake
Alarm
Phone CallSlide53Slide54
d-separation
to the rescue
Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent:
d-separation
.
Definition:
X and Z are d-separated by a set of evidence variables
E iff every undirected path from X to Z is “blocked
”, where a path is “blocked
” iff one or more of the following conditions is true: ...
ie. X and Z are dependent iff there exists an unblocked pathSlide55
A path is
“
blocked
”
when...
There exists a variable Y on
the path such thatit is in the evidence set Ethe arcs putting
Y in the path are “tail-to-tail”
Or, there exists a variable Y on the path such that
it is in the evidence set E
the arcs putting Y in the path are “tail-to-head
”
Or, ...
Y
Y
unknown
“
common causes
”
of X and Z impose dependency
unknown
“
causal chains
”
connecting X an Z impose dependencySlide56
A path is
“
blocked
”
when… (the funky case)
… Or, there exists a variable
V on the path such thatit is NOT in the evidence set E
neither are any of its descendantsthe arcs putting Y on the path are “head-to-head”
Y
Known
“
common symptoms
”
of X and Z impose dependencies… X may
“
explain away
”
ZSlide57
d-separation
to the rescue, cont
’
d
Theorem [
Verma & Pearl, 1998]:If a set of evidence variables
E d-separates X and Z in a Bayesian network’s graph, then
I<X, E, Z>.d
-separation can be computed in linear time using a depth-first-search-like algorithm.Be careful: d-separation finds what must be conditionally independent
“Might”
: Variables may actually be independent when they’re not d-separated, depending on the actual probabilities involvedSlide58
d-separation
example
A
B
C
D
E
F
G
I
H
J
I<C, {}, D>?
I<C, {A}, D>?
I<C, {A, B}, D>?
I<C, {A, B, J}, D>?
I<C, {A, B, E, J}, D>?