aka Bayesian Networks 1 Matt Gormley Lecture 21 November 9 2016 School of Computer Science Readings Bishop 81 and 822 Mitchell 611 Murphy 10 10601B Introduction to Machine Learning ID: 564492
Download Presentation The PPT/PDF document "Directed Graphical Models" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Directed Graphical Models(aka. Bayesian Networks)
1
Matt GormleyLecture 21November 9, 2016
School of Computer Science
Readings:
Bishop 8.1 and 8.2.2
Mitchell 6.11
Murphy 10
10-601B Introduction to Machine LearningSlide2
Reminders
Homework 6due
Mon., Nov. 21Final Examin-class
Wed
., Dec. 7
2Slide3
OutlineMotivationStructured PredictionBackground
Conditional IndependenceChain Rule of ProbabilityDirected Graphical ModelsBayesian Network definition
Qualitative SpecificationQuantitative SpecificationFamiliar Models as Bayes NetsExample: The Monty Hall ProblemConditional Independence in Bayes NetsThree case studiesD-separationMarkov blanket
3Slide4
Motivation4Slide5
Structured PredictionMost of the models we’ve seen so far were for classificationGiven observations:
x = (x
1, x2, …, xK) P
redict a (binary) label:
yMany real-world problems require structured predictionGiven observations: x
= (x1, x
2, …, x
K) Predict a
structure: y
= (y1, y2, …, yJ) Some classification problems benefit from latent structure5Slide6
Structured Prediction ExamplesExamples of structured predictionPart-of-speech (POS) taggingHandwriting recognition
Speech recognitionWord alignmentCongressional voting
Examples of latent structureObject recognition6Slide7
n
n
v
d
n
Sample 2:
time
like
flies
anarrow
Dataset for Supervised
Part-of-Speech (POS) Tagging
7
n
v
p
d
n
Sample 1:
time
like
flies
an
arrow
p
n
n
v
v
Sample 4:
with
you
time
will
see
n
v
p
n
n
Sample 3:
flies
with
fly
their
wings
Data:
y
(1)
x
(1)
y
(2)
x
(2)
y
(3)
x
(3)
y
(4)
x
(4)Slide8
Dataset for Supervised
Handwriting Recognition
8
Data:
Figures from (
Chatzis
&
Demiris
, 2013)
u
e
p
c
t
Sample 1:
y
(1)
x
(1)
n
x
e
d
e
v
l
a
i
c
Sample
2:
o
c
n
e
b
a
e
s
Sample
2:
m
r
c
y
(2)
x
(2)
y
(3)
x
(3)Slide9
Dataset for Supervised
Phoneme (Speech) Recognition
9
Data:
Figures
from (Jansen &
Niyogi
, 2013)
h#
ih
w
z
iy
Sample 1:
y
(1)
x
(1)
dh
s
uh
iy
z
f
r
s
h#
Sample
2:
ao
ah
s
y
(2)
x
(2)Slide10
Word Alignment / Phrase ExtractionVariables (boolean)
:For each (Chinese phrase, English phrase) pair, are they linked?
Interactions:Word fertilitiesFew “jumps” (discontinuities)Syntactic reorderings“ITG contraint” on alignmentPhrases are disjoint (?)
10
(Burkett & Klein, 2012)
Application:Slide11
Congressional Voting
11
(Stoyanov & Eisner, 2012)
Application:
Variables
:
Representative’s vote
Text of all speeches of a representative
Local contexts of references between two representatives
Interactions:Words used by representative and their votePairs of representatives and their local contextSlide12
Structured Prediction ExamplesExamples of structured predictionPart-of-speech (POS) taggingHandwriting recognition
Speech recognitionWord alignmentCongressional voting
Examples of latent structureObject recognition12Slide13
Case Study: Object RecognitionData consists of images x
and labels y.
13
pigeon
leopard
llama
rhinoceros
y
(1)
x(1)
y
(2)
x
(2)
y
(4)
x
(4)
y
(3)
x
(3)Slide14
Case Study: Object RecognitionData consists of images x
and labels y.
14
Preprocess data into “patches”
Posit a latent labeling
z
describing the object’s parts (e.g. head, leg, tail, torso, grass)
leopard
Define graphical model with these latent variables in mind
z
is not observed at train or test timeSlide15
Case Study: Object RecognitionData consists of images x
and labels y.
15
Preprocess data into “patches”
Posit a latent labeling
z
describing the object’s parts (e.g. head, leg, tail, torso, grass)
leopard
Define graphical model with these latent variables in mind
z
is not observed at train or test time
X
1
Z
1
X
2
Z
2
X
3
Z
3
X
4
Z
4
X
5
Z
5
X
7
Z
7
X
6
Z
6
YSlide16
Case Study: Object RecognitionData consists of images x
and labels y.
16
Preprocess data into “patches”
Posit a latent labeling
z
describing the object’s parts (e.g. head, leg, tail, torso, grass)
leopard
Define graphical model with these latent variables in mind
z
is not observed at train or test time
ψ
2
ψ
4
X
1
Z
1
ψ
1
X
2
Z
2
ψ
3
X
3
Z
3
ψ
5
X
4
Z
4
ψ
7
X
5
Z
5
ψ
9
X
7
Z
7
ψ
1
X
6
Z
6
ψ
1
ψ
4
ψ
4
YSlide17
Structured Prediction17
Preview of challenges to come…
Consider the task of finding the
most probable assignment
to the output
Classification
Structured PredictionSlide18
Machine Learning18
The
data
inspires the structures we want to predict
It also tells us what to optimize
Our
model
defines a score for each structure
Learning
tunes the parameters of the model
Inference
finds
{
best structure,
marginals
, partition function
}
for a new observation
Domain Knowledge
Mathematical Modeling
Optimization
Combinatorial
Optimization
ML
(Inference
is usually called as a subroutine in learning)Slide19
Machine Learning19
Data
Model
Learning
Inference
(Inference
is usually called as a subroutine in learning)
Objective
X
1
X
3
X
2
X
4
X
5Slide20
Background20Slide21
Background: Chain Rule
of Probability
21Slide22
Background:
Conditional Independence
22
Later we will also write:
I<A, {C}, B>Slide23
Directed Graphical ModelsBayesian Networks
23Slide24
WhiteboardWriting Joint DistributionsStrawman: Giant Table
Alternate #1: Rewrite using chain ruleAlternate #2: Assume full independenceAlternate #3: Drop variables from RHS of conditionals
24Slide25
Bayesian Network25
X
1
X
3
X
2
X
4
X
5
Definition:Slide26
Bayesian Network26
X
1
X
3
X
2
X
4
X5
Definition:Slide27
Bayesian NetworkA Bayesian Network is a directed graphical modelIt consists of a graph
G and the conditional probabilities P
These two parts full specify the distribution:Qualitative Specification: GQuantitative Specification: P27
X
1
X
3
X
2
X4
X
5
Definition:Slide28
Qualitative SpecificationWhere does the qualitative specification come from?Prior knowledge of causal relationships
Prior knowledge of modular relationshipsAssessment from expertsLearning from dataWe simply link a certain architecture (e.g. a layered graph)
…© Eric Xing @ CMU, 2006-201128Slide29
WhiteboardIf time…Example: 2016 Presidential Election
29Slide30
Towards quantitative specification of probability distributionSeparation properties in the graph imply independence properties about the associated variablesFor the graph to be useful, any conditional independence properties we can derive from the graph should hold for the probability distribution that the graph represents
The Equivalence Theorem
For a graph G, Let D1 denote the family of all distributions that satisfy I(G),
Let
D2 denote the family of all distributions that factor according to G, Then
D1≡D
2.
30
© Eric Xing @ CMU, 2006-2011Slide31
Quantitative Specification
A
B
C
p(A,B,C) =
31
© Eric Xing @ CMU, 2006-2011Slide32
a
0
0.75
a
1
0.25
b
0
0.33
b
1
0.67
a
0
b
0
a
0
b
1
a
1
b
0
a
1
b
1
c
0
0.45
1
0.9
0.7
c
1
0.55
0
0.1
0.3
A
B
C
P(a,b,c.d) = P(a)P(b)P(c|a,b)P(d|c)
D
c
0
c
1
d
0
0.3
0.5
d
1
07
0.5
Conditional probability tables (CPTs)
32
© Eric Xing @ CMU, 2006-2011Slide33
A
B
C
P(a,b,c.d) = P(a)P(b)P(c|a,b)P(d|c)
D
A~N(
μ
a
,
Σa)B~N(
μ
b
,
Σ
b
)
C~N(A+B,
Σ
c
)
D~N(
μ
a
+C,
Σ
a
)
D
C
P(D| C)
Conditional probability density func. (CPDs)
33
© Eric Xing @ CMU, 2006-2011Slide34
Conditional Independencies
X
1
Y
Features
Label
X
2
X
n-1Xn
What is this model
When Y is observed?
When Y is unobserved?
34
© Eric Xing @ CMU, 2006-2011Slide35
Conditionally Independent Observations
q
Data = {y1
,…yn
}Model parameters
X
1
X
2
Xn-1Xn
35
© Eric Xing @ CMU, 2006-2011Slide36
“Plate” Notation
X
i
i=1:n
q
Data = {x
1
,…x
n
}Model parametersPlate = rectangle in graphical model variables within a plate are replicated in a conditionally independent manner36© Eric Xing @ CMU, 2006-2011Slide37
Example: Gaussian Model
x
i
i=1:n
m
Generative model:
p(x
1
,…x
n | m, s) = P p(xi | m, s) = p(data | parameters) = p(D | q) where q = {m,
s
}
s
Likelihood = p(data | parameters)
= p( D |
q
)
= L (
q
)
Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters
Often easier to work with log L (
q
)
37
© Eric Xing @ CMU, 2006-2011Slide38
Bayesian models
x
i
i=1:n
q
38
© Eric Xing @ CMU, 2006-2011Slide39
Density estimation
Regression
Classification
Parametric and nonparametric methods
Linear, conditional mixture, nonparametric
Generative and discriminative approach
Q
X
Q
X
X
Y
m,s
X
X
More examples
39
© Eric Xing @ CMU, 2006-2011Slide40
Example:The Monty Hall Problem
40
Slide from William CohenExtra slides from
last semesterSlide41
The (highly practical) Monty Hall problem
You’
re in a game show. Behind one door is a prize. Behind the others, goats.You pick one of three doors, say #1The host, Monty Hall, opens one door, revealing…a goat!
3
You now can either
stick with your guess
always change doors
flip a coin and pick a new door randomly according to the coin
Slide from William Cohen
Extra slides from
last semesterSlide42
The (highly practical) Monty Hall problem
You
’re in a game show. Behind one door is a prize. Behind the others, goats.You pick one of three doors, say #1The host, Monty Hall, opens one door, revealing…a goat!You now can either stick with your guess or change doors
A
B
First guess
The money
C
The
revealed goat
D
Stick, or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
D
P(D)
Stick
0.5
Swap
0.5
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
W
Slide from William Cohen
Extra slides from
last semesterSlide43
The (highly practical) Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
A
C
D
P(E|A,C,D)
…
…
…
…
If you stick: you win if your first guess was right.
If you swap: you win if your first guess was wrong.
Slide from William Cohen
Extra slides from
last semesterSlide44
The (highly practical) Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
A
C
D
P(E|A,C,D)
…
…
…
…
…again by the chain rule:
P(A,B,C,D,
E)
=
P(E|A,C,D) *
P(D) *
P(C | A,B ) *
P(B ) *
P(A)
We could construct the joint and compute P(E=B|D=swap)
Slide from William Cohen
Extra slides from
last semesterSlide45
The (highly practical) Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
A
C
D
P(E|A,C,D)
…
…
…
…
…again by the chain rule:
P(A,B,C,D,
E)
=
P(
E | A,B,C
,D) *
P(
D | A,B,C)
*
P(C | A,B ) *
P(B
| A)
*
P(A)
We could construct the joint and compute P(E=B|D=swap)
Slide from William Cohen
Extra slides from
last semesterSlide46
The (highly practical) Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
A
C
D
P(E|A,C,D)
…
…
…
…
The joint table has…?
3*3*3*2*3 = 162 rows
The
conditional probability tables
(CPTs) shown have … ?
3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows
Big questions:
why
are the CPTs smaller?
how
much
smaller
are the CPTs than the joint?
can we compute the answers to queries like P(E=
B|d
)
without
building the joint probability tables, just using the CPTs?
Slide from William Cohen
Extra slides from
last semesterSlide47
The (highly practical) Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
A
P(A)
1
0.33
2
0.33
3
0.33
B
P(B)
1
0.33
2
0.33
3
0.33
A
B
C
P(C|A,B)
1
1
2
0.5
1
1
3
0.5
1
2
3
1.0
1
3
2
1.0
…
…
…
…
A
C
D
P(E|A,C,D)
…
…
…
…
Why
is the CPTs representation smaller? Follow the money! (B)
E is
conditionally independent
of B given A,D,C
Slide from William Cohen
Extra slides from
last semesterSlide48
The (highly practical) Monty Hall problem
A
B
First guess
The money
C
The goat
D
Stick or swap?
E
Second guess
What are the conditional indepencies?
I<A, {B}, C> ?
I<A, {C}, B> ?
I<E, {A,C}, B> ?
I<D, {E}, B> ?
…
Slide from William Cohen
Extra slides from
last semesterSlide49
Graphical Models:Determining Conditional Independencies
Slide from William CohenSlide50
What Independencies does a Bayes Net Model?
In order for a Bayesian network to model a probability distribution, the following must be true: Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents.
This follows from
But what else does it imply?
Slide from William CohenSlide51
Common Parent
V-Structure
Cascade
What Independencies does a Bayes Net Model?
51
Three cases of interest…
Z
Y
X
Y
X
Z
Z
X
Y
YSlide52
Common Parent
V-Structure
Cascade
What Independencies does a Bayes Net Model?
52
Z
Y
X
Y
X
Z
Z
X
Y
Y
Knowing Y
decouples
X and Z
Knowing Y
couples
X and Z
Three cases of interest…Slide53
Whiteboard(The other two cases can easily be shown just as easily.)
53
Common Parent
Y
X
Z
Proof of conditional independenceSlide54
The
“
Burglar Alarm
”
example
Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes.
Earth arguably doesn’t care whether your house is currently being burgled
While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!
Burglar
Earthquake
Alarm
Phone Call
Slide from William Cohen
Quiz: True or False? Slide55
The
“
Burglar Alarm
”
example
But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all.
Earthquake
“
explains away
” the hypothetical burglar.But then it must not be the case that even though BurglarEarthquake
Alarm
Phone Call
Slide from William CohenSlide56
D-Separation (Definition #1)Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent:
d-separation.
Definition: variables X and Z are d-separated
(conditionally independent) given a set of evidence variables E
iff every undirected path from X to Z
is “blocked”, where a path is
“blocked” iff
one or more of the following conditions is true: ...
ie
. X and Z are dependent iff there exists an unblocked pathSlide from William CohenSlide57
D-Separation (Definition #1)There exists a variable Y on the path such that
it is in the evidence set
Ethe arcs putting Y in the path are “tail-to-tail”
Or, there exists a variable
Y on the path such thatit is in the evidence set E
the arcs putting Y in the path are “tail-to-head
”
Or, ...
Y
Y
unknown
“
common causes
”
of X and Z impose dependency
unknown
“
causal chains
”
connecting X an Z impose dependency
Slide from William Cohen
A path is
“
blocked
”
when...Slide58
D-Separation (Definition #1)
… Or, there exists a variable Y
on the path such thatit is NOT in the evidence set Eneither are any of its descendantsthe arcs putting Y on the path are
“head-to-head”
Y
Known
“
common symptoms
”
of X and Z impose dependencies… X may
“
explain away
”
Z
Slide from William Cohen
A path is
“
blocked
”
when...Slide59
D-Separation (Definition #2)D-separation criterion for Bayesian networks (D for Directed edges):
Definition: variables X and Y are
D-separated (conditionally independent) given Z if they are separated in the moralized ancestral graphExample:
© Eric Xing @ CMU, 2006-2011
59Slide60
D-SeparationTheorem [
Verma & Pearl, 1998]:If a set of evidence variables
E d-separates X and Z in a Bayesian network’s graph, then I<X,
E, Z
>.d-separation can be computed in linear time using a depth-first-search-like algorithm.Be careful: d-separation finds what must be conditionally independent
“Might”
: Variables may actually be independent when they’re not d-separated, depending on the actual probabilities involved
Slide from William CohenSlide61
“Bayes-ball” and D-SeparationX is
d-separated (directed-separated) from Z given Y if we can't send a ball from any node in X to any node in Z using the "
Bayes-ball" algorithm illustrated bellow (and plus some boundary conditions):
Defn
: I(G
)=all independence properties that correspond to d-separation:
D-separation is sound and complete
61
© Eric Xing @ CMU, 2006-2011Slide62
A node is conditionally independent
of every other node in the network outside its Markov
blanket
X
Y
1
Y
2
Descendent
Ancestor
Parent
Co-parent
Co-
parent
Child
Markov Blanket
62
© Eric Xing @ CMU, 2006-2011Slide63
Structure: DAG
Meaning: a node is
conditionally independent of every other node in the network outside its Markov blanketLocal conditional distributions (
CPD) and the
DAG completely determine the joint dist.
Give causality
relationships, and facilitate a generative process
X
Y
1
Y
2
Summary: Bayesian Networks
63
© Eric Xing @ CMU, 2006-2011