/
Directed Graphical Models Directed Graphical Models

Directed Graphical Models - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
404 views
Uploaded On 2017-06-29

Directed Graphical Models - PPT Presentation

aka Bayesian Networks 1 Matt Gormley Lecture 21 November 9 2016 School of Computer Science Readings Bishop 81 and 822 Mitchell 611 Murphy 10 10601B Introduction to Machine Learning ID: 564492

slide guess cohen data guess slide data cohen william model 2011 2006 cmu xing eric variables separation conditional monty

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Directed Graphical Models" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Directed Graphical Models(aka. Bayesian Networks)

1

Matt GormleyLecture 21November 9, 2016

School of Computer Science

Readings:

Bishop 8.1 and 8.2.2

Mitchell 6.11

Murphy 10

10-601B Introduction to Machine LearningSlide2

Reminders

Homework 6due

Mon., Nov. 21Final Examin-class

Wed

., Dec. 7

2Slide3

OutlineMotivationStructured PredictionBackground

Conditional IndependenceChain Rule of ProbabilityDirected Graphical ModelsBayesian Network definition

Qualitative SpecificationQuantitative SpecificationFamiliar Models as Bayes NetsExample: The Monty Hall ProblemConditional Independence in Bayes NetsThree case studiesD-separationMarkov blanket

3Slide4

Motivation4Slide5

Structured PredictionMost of the models we’ve seen so far were for classificationGiven observations:

x = (x

1, x2, …, xK) P

redict a (binary) label:

yMany real-world problems require structured predictionGiven observations: x

= (x1, x

2, …, x

K) Predict a

structure: y

= (y1, y2, …, yJ) Some classification problems benefit from latent structure5Slide6

Structured Prediction ExamplesExamples of structured predictionPart-of-speech (POS) taggingHandwriting recognition

Speech recognitionWord alignmentCongressional voting

Examples of latent structureObject recognition6Slide7

n

n

v

d

n

Sample 2:

time

like

flies

anarrow

Dataset for Supervised

Part-of-Speech (POS) Tagging

7

n

v

p

d

n

Sample 1:

time

like

flies

an

arrow

p

n

n

v

v

Sample 4:

with

you

time

will

see

n

v

p

n

n

Sample 3:

flies

with

fly

their

wings

Data:

y

(1)

x

(1)

y

(2)

x

(2)

y

(3)

x

(3)

y

(4)

x

(4)Slide8

Dataset for Supervised

Handwriting Recognition

8

Data:

Figures from (

Chatzis

&

Demiris

, 2013)

u

e

p

c

t

Sample 1:

y

(1)

x

(1)

n

x

e

d

e

v

l

a

i

c

Sample

2:

o

c

n

e

b

a

e

s

Sample

2:

m

r

c

y

(2)

x

(2)

y

(3)

x

(3)Slide9

Dataset for Supervised

Phoneme (Speech) Recognition

9

Data:

Figures

from (Jansen &

Niyogi

, 2013)

h#

ih

w

z

iy

Sample 1:

y

(1)

x

(1)

dh

s

uh

iy

z

f

r

s

h#

Sample

2:

ao

ah

s

y

(2)

x

(2)Slide10

Word Alignment / Phrase ExtractionVariables (boolean)

:For each (Chinese phrase, English phrase) pair, are they linked?

Interactions:Word fertilitiesFew “jumps” (discontinuities)Syntactic reorderings“ITG contraint” on alignmentPhrases are disjoint (?)

10

(Burkett & Klein, 2012)

Application:Slide11

Congressional Voting

11

(Stoyanov & Eisner, 2012)

Application:

Variables

:

Representative’s vote

Text of all speeches of a representative

Local contexts of references between two representatives

Interactions:Words used by representative and their votePairs of representatives and their local contextSlide12

Structured Prediction ExamplesExamples of structured predictionPart-of-speech (POS) taggingHandwriting recognition

Speech recognitionWord alignmentCongressional voting

Examples of latent structureObject recognition12Slide13

Case Study: Object RecognitionData consists of images x

and labels y.

13

pigeon

leopard

llama

rhinoceros

y

(1)

x(1)

y

(2)

x

(2)

y

(4)

x

(4)

y

(3)

x

(3)Slide14

Case Study: Object RecognitionData consists of images x

and labels y.

14

Preprocess data into “patches”

Posit a latent labeling

z

describing the object’s parts (e.g. head, leg, tail, torso, grass)

leopard

Define graphical model with these latent variables in mind

z

is not observed at train or test timeSlide15

Case Study: Object RecognitionData consists of images x

and labels y.

15

Preprocess data into “patches”

Posit a latent labeling

z

describing the object’s parts (e.g. head, leg, tail, torso, grass)

leopard

Define graphical model with these latent variables in mind

z

is not observed at train or test time

X

1

Z

1

X

2

Z

2

X

3

Z

3

X

4

Z

4

X

5

Z

5

X

7

Z

7

X

6

Z

6

YSlide16

Case Study: Object RecognitionData consists of images x

and labels y.

16

Preprocess data into “patches”

Posit a latent labeling

z

describing the object’s parts (e.g. head, leg, tail, torso, grass)

leopard

Define graphical model with these latent variables in mind

z

is not observed at train or test time

ψ

2

ψ

4

X

1

Z

1

ψ

1

X

2

Z

2

ψ

3

X

3

Z

3

ψ

5

X

4

Z

4

ψ

7

X

5

Z

5

ψ

9

X

7

Z

7

ψ

1

X

6

Z

6

ψ

1

ψ

4

ψ

4

YSlide17

Structured Prediction17

Preview of challenges to come…

Consider the task of finding the

most probable assignment

to the output

Classification

Structured PredictionSlide18

Machine Learning18

The

data

inspires the structures we want to predict

It also tells us what to optimize

Our

model

defines a score for each structure

Learning

tunes the parameters of the model

Inference

finds

{

best structure,

marginals

, partition function

}

for a new observation

Domain Knowledge

Mathematical Modeling

Optimization

Combinatorial

Optimization

ML

(Inference

is usually called as a subroutine in learning)Slide19

Machine Learning19

Data

Model

Learning

Inference

(Inference

is usually called as a subroutine in learning)

Objective

X

1

X

3

X

2

X

4

X

5Slide20

Background20Slide21

Background: Chain Rule

of Probability

21Slide22

Background:

Conditional Independence

22

Later we will also write:

I<A, {C}, B>Slide23

Directed Graphical ModelsBayesian Networks

23Slide24

WhiteboardWriting Joint DistributionsStrawman: Giant Table

Alternate #1: Rewrite using chain ruleAlternate #2: Assume full independenceAlternate #3: Drop variables from RHS of conditionals

24Slide25

Bayesian Network25

X

1

X

3

X

2

X

4

X

5

Definition:Slide26

Bayesian Network26

X

1

X

3

X

2

X

4

X5

Definition:Slide27

Bayesian NetworkA Bayesian Network is a directed graphical modelIt consists of a graph

G and the conditional probabilities P

These two parts full specify the distribution:Qualitative Specification: GQuantitative Specification: P27

X

1

X

3

X

2

X4

X

5

Definition:Slide28

Qualitative SpecificationWhere does the qualitative specification come from?Prior knowledge of causal relationships

Prior knowledge of modular relationshipsAssessment from expertsLearning from dataWe simply link a certain architecture (e.g. a layered graph)

…© Eric Xing @ CMU, 2006-201128Slide29

WhiteboardIf time…Example: 2016 Presidential Election

29Slide30

Towards quantitative specification of probability distributionSeparation properties in the graph imply independence properties about the associated variablesFor the graph to be useful, any conditional independence properties we can derive from the graph should hold for the probability distribution that the graph represents

The Equivalence Theorem

For a graph G, Let D1 denote the family of all distributions that satisfy I(G),

Let

D2 denote the family of all distributions that factor according to G, Then

D1≡D

2.

30

© Eric Xing @ CMU, 2006-2011Slide31

Quantitative Specification

A

B

C

p(A,B,C) =

31

© Eric Xing @ CMU, 2006-2011Slide32

a

0

0.75

a

1

0.25

b

0

0.33

b

1

0.67

a

0

b

0

a

0

b

1

a

1

b

0

a

1

b

1

c

0

0.45

1

0.9

0.7

c

1

0.55

0

0.1

0.3

A

B

C

P(a,b,c.d) = P(a)P(b)P(c|a,b)P(d|c)

D

c

0

c

1

d

0

0.3

0.5

d

1

07

0.5

Conditional probability tables (CPTs)

32

© Eric Xing @ CMU, 2006-2011Slide33

A

B

C

P(a,b,c.d) = P(a)P(b)P(c|a,b)P(d|c)

D

A~N(

μ

a

,

Σa)B~N(

μ

b

,

Σ

b

)

C~N(A+B,

Σ

c

)

D~N(

μ

a

+C,

Σ

a

)

D

C

P(D| C)

Conditional probability density func. (CPDs)

33

© Eric Xing @ CMU, 2006-2011Slide34

Conditional Independencies

X

1

Y

Features

Label

X

2

X

n-1Xn

What is this model

When Y is observed?

When Y is unobserved?

34

© Eric Xing @ CMU, 2006-2011Slide35

Conditionally Independent Observations

q

Data = {y1

,…yn

}Model parameters

X

1

X

2

Xn-1Xn

35

© Eric Xing @ CMU, 2006-2011Slide36

“Plate” Notation

X

i

i=1:n

q

Data = {x

1

,…x

n

}Model parametersPlate = rectangle in graphical model variables within a plate are replicated in a conditionally independent manner36© Eric Xing @ CMU, 2006-2011Slide37

Example: Gaussian Model

x

i

i=1:n

m

Generative model:

p(x

1

,…x

n | m, s) = P p(xi | m, s) = p(data | parameters) = p(D | q) where q = {m,

s

}

s

Likelihood = p(data | parameters)

= p( D |

q

)

= L (

q

)

Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters

Often easier to work with log L (

q

)

37

© Eric Xing @ CMU, 2006-2011Slide38

Bayesian models

x

i

i=1:n

q

38

© Eric Xing @ CMU, 2006-2011Slide39

Density estimation

Regression

Classification

Parametric and nonparametric methods

Linear, conditional mixture, nonparametric

Generative and discriminative approach

Q

X

Q

X

X

Y

m,s

X

X

More examples

39

© Eric Xing @ CMU, 2006-2011Slide40

Example:The Monty Hall Problem

40

Slide from William CohenExtra slides from

last semesterSlide41

The (highly practical) Monty Hall problem

You’

re in a game show. Behind one door is a prize. Behind the others, goats.You pick one of three doors, say #1The host, Monty Hall, opens one door, revealing…a goat!

3

You now can either

stick with your guess

always change doors

flip a coin and pick a new door randomly according to the coin

Slide from William Cohen

Extra slides from

last semesterSlide42

The (highly practical) Monty Hall problem

You

’re in a game show. Behind one door is a prize. Behind the others, goats.You pick one of three doors, say #1The host, Monty Hall, opens one door, revealing…a goat!You now can either stick with your guess or change doors

A

B

First guess

The money

C

The

revealed goat

D

Stick, or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

D

P(D)

Stick

0.5

Swap

0.5

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

W

Slide from William Cohen

Extra slides from

last semesterSlide43

The (highly practical) Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

A

C

D

P(E|A,C,D)

If you stick: you win if your first guess was right.

If you swap: you win if your first guess was wrong.

Slide from William Cohen

Extra slides from

last semesterSlide44

The (highly practical) Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

A

C

D

P(E|A,C,D)

…again by the chain rule:

P(A,B,C,D,

E)

=

P(E|A,C,D) *

P(D) *

P(C | A,B ) *

P(B ) *

P(A)

We could construct the joint and compute P(E=B|D=swap)

Slide from William Cohen

Extra slides from

last semesterSlide45

The (highly practical) Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

A

C

D

P(E|A,C,D)

…again by the chain rule:

P(A,B,C,D,

E)

=

P(

E | A,B,C

,D) *

P(

D | A,B,C)

*

P(C | A,B ) *

P(B

| A)

*

P(A)

We could construct the joint and compute P(E=B|D=swap)

Slide from William Cohen

Extra slides from

last semesterSlide46

The (highly practical) Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

A

C

D

P(E|A,C,D)

The joint table has…?

3*3*3*2*3 = 162 rows

The

conditional probability tables

(CPTs) shown have … ?

3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows

Big questions:

why

are the CPTs smaller?

how

much

smaller

are the CPTs than the joint?

can we compute the answers to queries like P(E=

B|d

)

without

building the joint probability tables, just using the CPTs?

Slide from William Cohen

Extra slides from

last semesterSlide47

The (highly practical) Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

A

C

D

P(E|A,C,D)

Why

is the CPTs representation smaller? Follow the money! (B)

E is

conditionally independent

of B given A,D,C

Slide from William Cohen

Extra slides from

last semesterSlide48

The (highly practical) Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

What are the conditional indepencies?

I<A, {B}, C> ?

I<A, {C}, B> ?

I<E, {A,C}, B> ?

I<D, {E}, B> ?

Slide from William Cohen

Extra slides from

last semesterSlide49

Graphical Models:Determining Conditional Independencies

Slide from William CohenSlide50

What Independencies does a Bayes Net Model?

In order for a Bayesian network to model a probability distribution, the following must be true: Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents.

This follows from

But what else does it imply?

Slide from William CohenSlide51

Common Parent

V-Structure

Cascade

What Independencies does a Bayes Net Model?

51

Three cases of interest…

Z

Y

X

Y

X

Z

Z

X

Y

YSlide52

Common Parent

V-Structure

Cascade

What Independencies does a Bayes Net Model?

52

Z

Y

X

Y

X

Z

Z

X

Y

Y

Knowing Y

decouples

X and Z

Knowing Y

couples

X and Z

Three cases of interest…Slide53

Whiteboard(The other two cases can easily be shown just as easily.)

53

Common Parent

Y

X

Z

Proof of conditional independenceSlide54

The

Burglar Alarm

example

Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes.

Earth arguably doesn’t care whether your house is currently being burgled

While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!

Burglar

Earthquake

Alarm

Phone Call

Slide from William Cohen

Quiz: True or False? Slide55

The

Burglar Alarm

example

But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all.

Earthquake

explains away

” the hypothetical burglar.But then it must not be the case that even though BurglarEarthquake

Alarm

Phone Call

Slide from William CohenSlide56

D-Separation (Definition #1)Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent:

d-separation.

Definition: variables X and Z are d-separated

(conditionally independent) given a set of evidence variables E

iff every undirected path from X to Z

is “blocked”, where a path is

“blocked” iff

one or more of the following conditions is true: ...

ie

. X and Z are dependent iff there exists an unblocked pathSlide from William CohenSlide57

D-Separation (Definition #1)There exists a variable Y on the path such that

it is in the evidence set

Ethe arcs putting Y in the path are “tail-to-tail”

Or, there exists a variable

Y on the path such thatit is in the evidence set E

the arcs putting Y in the path are “tail-to-head

Or, ...

Y

Y

unknown

common causes

of X and Z impose dependency

unknown

causal chains

connecting X an Z impose dependency

Slide from William Cohen

A path is

blocked

when...Slide58

D-Separation (Definition #1)

… Or, there exists a variable Y

on the path such thatit is NOT in the evidence set Eneither are any of its descendantsthe arcs putting Y on the path are

“head-to-head”

Y

Known

common symptoms

of X and Z impose dependencies… X may

explain away

Z

Slide from William Cohen

A path is

blocked

when...Slide59

D-Separation (Definition #2)D-separation criterion for Bayesian networks (D for Directed edges):

Definition: variables X and Y are

D-separated (conditionally independent) given Z if they are separated in the moralized ancestral graphExample:

© Eric Xing @ CMU, 2006-2011

59Slide60

D-SeparationTheorem [

Verma & Pearl, 1998]:If a set of evidence variables

E d-separates X and Z in a Bayesian network’s graph, then I<X,

E, Z

>.d-separation can be computed in linear time using a depth-first-search-like algorithm.Be careful: d-separation finds what must be conditionally independent

“Might”

: Variables may actually be independent when they’re not d-separated, depending on the actual probabilities involved

Slide from William CohenSlide61

“Bayes-ball” and D-SeparationX is

d-separated (directed-separated) from Z given Y if we can't send a ball from any node in X to any node in Z using the "

Bayes-ball" algorithm illustrated bellow (and plus some boundary conditions):

Defn

: I(G

)=all independence properties that correspond to d-separation:

D-separation is sound and complete

61

© Eric Xing @ CMU, 2006-2011Slide62

A node is conditionally independent

of every other node in the network outside its Markov

blanket

X

Y

1

Y

2

Descendent

Ancestor

Parent

Co-parent

Co-

parent

Child

Markov Blanket

62

© Eric Xing @ CMU, 2006-2011Slide63

Structure: DAG

Meaning: a node is

conditionally independent of every other node in the network outside its Markov blanketLocal conditional distributions (

CPD) and the

DAG completely determine the joint dist.

Give causality

relationships, and facilitate a generative process

X

Y

1

Y

2

Summary: Bayesian Networks

63

© Eric Xing @ CMU, 2006-2011