/
Directed Graphical Models Directed Graphical Models

Directed Graphical Models - PowerPoint Presentation

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
345 views
Uploaded On 2019-03-19

Directed Graphical Models - PPT Presentation

William W Cohen Machine Learning 10601 Motivation for Graphical Models Recap A paradox of i nduction A black crow seems to support the hypothesis all crows are black A pink highlighter supports the hypothesis all nonblack things are noncrows ID: 757996

fair loaded fly problem loaded fair problem fly guess penguins critical reasoning noncritical tweety birds variables common penguin d20

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Directed Graphical Models" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Directed Graphical Models

William W. Cohen

Machine Learning 10-601Slide2

Motivation for Graphical ModelsSlide3

Recap: A paradox of

i

nduction

A black crow seems to support the hypothesis “all crows are black”.

A pink highlighter supports the hypothesis “all non-black things are non-crows”

Thus, a pink highlighter supports the hypothesis “all crows are black”.

whut

?Slide4

whut

?

B = black

C = crow

crows

non-crows

black

not black

collect statistics for P(B=

b|C

=c) Slide5

Logical reasoning versus common-sense reasoning

CROW(

jim

)

BLACK(

jim

)

FLY(

jim

)

BIRD(

jim

)

EATS(

jim,carrion

)

…Slide6

Another difficult problem: common-sense reasoning

Tweety

is a bird.

Most birds can fly.

Opus is a penguin.

Penguins are birds. Penguins cannot fly.

We’d like to be able to conclude: Opus cannot fly, and Tweety can

default reasoning

Logically…Slide7

Another difficult problem: common-sense reasoning

Tweety

is a bird.

Most birds can fly.

Opus is a penguin.

Penguins are birds. Penguins cannot fly.

We’d like to be able to conclude: Opus cannot fly, and Tweety can

default reasoning

?

NO: F(

tweety

) only provable if he’s provably NOT a penguinSlide8
Slide9

Another difficult problem: common-sense reasoning

Tweety

is a bird.

Most birds can fly.

Opus is a penguin.

Penguins are birds. Penguins cannot fly.

We’d like to be able to conclude: Opus cannot fly, and Tweety can

default reasoning

?

NO: F(

tweety

) only provable if he’s provably not a penguin and not dead and …Slide10

Recap: The Joint Distribution

Recipe for making a joint distribution of M variables:

Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2

M

rows).

For each combination of values, say how probable it is.

Example: Boolean variables A, B, C

A

B

C

Prob

0

0

0

0.30

0

0

1

0.05

0

1

0

0.10

0

1

1

0.05

1

0

0

0.05

1

0

1

0.10

1

1

0

0.25

1

1

1

0.10Slide11

Another difficult problem: common-sense reasoningSlide12

Another difficult

problem: common-sense reasoningSlide13

Tweety

is a bird.

Most non-penguin birds can fly.

Opus is a penguin.

Penguins are birds.

Penguins cannot fly.

Pg

BPr(F=0|Pg,B)

Pr(F=1|Pg,B)0

00.90

0.10

0

1

0.01

0.99

1

0

0.5

0.5

1

1

0.9999

0.0001

Pg

Pr

(B=0|Pg)

Pr

(B=1|Pg)

0

0.90

0.10

1

0

1

Pg

Pr

(

Pg

=0)

Pr

(

Pg

=1)

0

0.90

0.10

The joint for an experiment: I pick an object, say in Frick Park, and measure: can it fly, is it a bird, is it a penguin.Slide14

Tweety

is a bird.

Most birds can fly.

Opus is a penguin.

Penguins are birds.

Penguins cannot fly.

Can Opus fly?

= 0

= 1

Pg

B

Pr

(F=0|Pg,B)

Pr

(F=1|Pg,B)

0

0

0.90

0.10

0

1

0.01

0.99

1

0

0.5

0.5

1

1

0.9999

0.0001

Unlikely:

Pr

(F=1|B=1,Pg=1)Slide15

Tweety

is a bird.

Most birds can fly.

Opus is a penguin.

Penguins are birds.

Penguins cannot fly.

Can

Tweety

fly?

Tweety

is

a flying penguin

Tweety

is

a flying non-penguin bird

If flying penguins are rare, it depends:

do non-penguins birds fly?

Pr

(F|B=1,Pg=0)

are all or most birds non-penguins?

Pr

(B=1|Pg=0)

are non-penguins common?

Pr

(

Pg

=0)Slide16

Quiz…

.

https://piazza.com/class/ij382zqa2572hc?cid=

421

https://piazza.com/class/ij382zqa2572hc?cid=

420Slide17

Another difficult

problem: common-sense reasoning

Have we solved the common-sense reasoning problem?

No

: how do we (1) chose the conditional probabilities you need to model a task and (2) use them algorithmically to answer questions ?

No

: How do we invent numbers for all the rows of the CPTs?Slide18

Another difficult

problem: common-sense reasoning

Have we solved the common-sense reasoning problem?

Yes: We use directed graphical models.

Semantics: how to specify them

Inference: how to use them

Yes: We use directed graphical models.

Learning: how to find parametersSlide19

Probabilities and probabilistic inference

Why is logic attractive?

There are well-understood algorithms for reasoning with a logical theory.

E.g

: we can use a computer to determine if B(x)

F(x).What about probabilities?We can do some math

manually and answer many questions.Not really satisfyingWe can answer questions algorithmically with the joint Eg: we can compute

Pr(F=1|B=1)But: this is not tractable for large models.How can we answer questions algorithmically and efficiently?Answer: Graphical modelsSlide20

Probabilities and probabilistic inference

Directed

graphical models

Today: examples and semantics

Wednesday: inference algorithms

Next week: learning in graphical modelsusing graphical models to specify learning algorithms: Naïve Bayes, LDA, HMMs, …Slide21

Graphical Models:

SEMANTICS and definitionsSlide22

I have 3 standard d20 dice, 1 loaded die.

Experiment:

(A)

pick a d20 uniformly at random then

(B)

roll it. Let

A

=d20 picked is fair and

B

=roll 19 or 20 with that die. What is P(

B)?

A

B

P(A=fair)=0.75

P(A=loaded)=0.25

P(B=critical | A=fair=0.1

P(B=noncritical | A=fair)=0.9

P(B=critical | A=loaded)=0.5

P(B=noncritical | A=loaded)=0.5

A

P(A)

Fair

0.75

Loaded

0.25

A

B

P(B|A)

Fair

Critical

0.1

Fair

Noncritical

0.9

Loaded

Critical

0.5

Loaded

Noncritical

0.5

Example: practical problem 1

made easy

P(A,B)=P(B|A)P(A)Slide23

I have 3 standard d20 dice, 1 loaded die.

Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let

A

=d20 picked is fair and

B

=roll 19 or 20 with that die. What is P(

B

)

?

A

B

P(A=fair)=0.75

P(A=loaded)=0.25

P(B=critical | A=fair=0.1

P(B=noncritical | A=fair)=0.9

P(B=critical | A=loaded)=0.5

P(B=noncritical | A=loaded)=0.5

A

P(A)

Fair

0.75

Loaded

0.25

A

B

P(B|A)

Fair

Critical

0.1

Fair

Noncritical

0.9

Loaded

Critical

0.5

Loaded

Noncritical

0.5

Example: practical problem 1

made easy

What is

Pr

(A=1|B=1)?

P(A,B)=P(B|A)P(A)

We have the

information

we need to answer other questions as well

Example of inferenceSlide24

In general: any chain-rule decomposition gives a DGM G:

G has one node

per random variable

If P(X|Y

1

,…,

Y

k

) is a factor in the decomposition, then

G has edges fromY1 X

,

…,

Y

k

X

X is annotated with a conditional probability table (CPT)

encoding P(

X=x|Y

1=

y

1

,…,Y

k=yk) for each tuple (x,y

1,…,yk)

A

B

P(A=fair)=0.75

P(A=loaded)=0.25

P(B=critical | A=fair=0.1

P(B=noncritical | A=fair)=0.9

P(B=critical | A=loaded)=0.5

P(B=noncritical | A=loaded)=0.5

A

P(A)

Fair

0.75

Loaded

0.25

A

B

P(B|A)

Fair

Critical

0.1

Fair

Noncritical

0.9

Loaded

Critical

0.5

Loaded

Noncritical

0.5

Example: practical problem 1

made easy

P(A,B)=P(B|A)P(A)

Applied chain ruleSlide25

I have 3 standard d20 dice, 1 loaded die.

Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let

A

=d20 picked is fair and

B

=roll 19 or 20 with that die.

B

A

B

P

(B)

critical

non-critical

B

A

P

(A|B)

Critical

Fair

Noncritical

Fair

Critical

Loaded

Noncritical

Loaded

There’s more than one network for any distribution

P(A,B)=P(A|B)P(B)Slide26

I have 3 standard d20 dice, 1 loaded die.

Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let

A

=d20 picked is fair and

B

=roll 19 or 20 with that die. What is P(

B

)?

B

A

There’s more than one network for any distribution

A

B

The moral: we have two things here

a “generative story”, “causal model”, …

a joint probability distribution e.g. P(A,B)

a decomposition: P(A,B)=P(B|A)P(A)

another decomposition:

P(A,B)=P

(A|B)P(B)

-- totally valid!

it’s usually cleaner to pick one that fits a “generative story”

P(A,B)=P(A|B)P(B)

P(A,B)=P(B|A)P(A)Slide27

There’s more than one network for any distribution

There are

lots

of decompositions of an model with N variables

They are all

correct

Some are better than others….Slide28

There’s more than one network for any distribution

Suppose there are some

conditional independencies

P(A|B,C,D)=P(A|B)

P

(B|C,D)=P(B|C)

Then the first decomposition can be simplified and compressed, the second can’tSlide29

The

(highly practical)

Monty Hall problem

You

re in a game show. Behind one door is a prize. Behind the others, goats.

You pick one of three doors, say #1The host, Monty Hall, opens one door, revealing…a goat!

3

You now can either

stick with your guess

always change doors

flip a coin and pick a new door randomly according to the coinSlide30

Example: practical problem 2Slide31

The

(highly practical)

Monty Hall problem

You

re in a game show. Behind one door is a prize. Behind the others, goats.

You pick one of three doors, say #1The host, Monty Hall, opens one door, revealing…a goat!You now can either stick with your guess or change doors

A

B

First guess

The money

C

The

revealed goat

D

Stick, or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

D

P(D)

Stick

0.5

Swap

0.5

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

WSlide32

The

(highly practical)

Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

A

C

D

P(E|A,C,D)

If you stick: you win if your first guess was right.

If you swap: you win if your first guess was wrong.Slide33

The

(highly practical)

Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

A

C

D

P(E|A,C,D)

…again by the chain rule:

P(A,B,C,D,

E)

=

P(E|A,C,D) *

P(D) *

P(C | A,B ) *

P(B ) *

P(A)

We could construct the joint and compute P(E=B|D=swap) Slide34

The

(highly practical)

Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

A

C

D

P(E|A,C,D)

…again by the chain rule:

P(A,B,C,D,

E)

=

P(

E | A,B,C

,D) *

P(

D | A,B,C)

*

P(C | A,B ) *

P(B

| A)

*

P(A)

We could construct the joint and compute P(E=B|D=swap) Slide35

The

(highly practical)

Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

A

C

D

P(E|A,C,D)

The joint table has…?

3*3*3*2*3 = 162 rows

The

conditional probability tables

(CPTs) shown have … ?

3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows

Big questions:

why

are the CPTs smaller?

how

much

smaller

are the CPTs than the joint?

can we compute the answers to queries like P(E=

B|d

)

without

building the joint probability tables, just using the CPTs?Slide36

The

(highly practical)

Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

A

P(A)

1

0.33

2

0.33

3

0.33

B

P(B)

1

0.33

2

0.33

3

0.33

A

B

C

P(C|A,B)

1

1

2

0.5

1

1

3

0.5

1

2

3

1.0

1

3

2

1.0

A

C

D

P(E|A,C,D)

Why

is the CPTs representation smaller? Follow the money! (B)

E is

conditionally independent

of B given A,D,CSlide37

Conditional Independence formalized

Definition:

R and L are

conditionally independent given M

if

for all x,y,z in {T,F}:

P(R=x  M=y ^ L=z) = P(R=x  M=y)

More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3

if for all assignments of values to the variables in the sets,P(S

1’

s assignments  S2

s assignments & S

3

s assignments)=

P(S1

s assignments  S3

s assignments)Slide38

The

(highly practical)

Monty Hall problem

A

B

First guess

The money

C

The goat

D

Stick or swap?

E

Second guess

What are the conditional indepencies?

I<A, {B}, C> ?

I<A, {C}, B> ?

I<E, {A,C}, B> ?

I<D, {E}, B> ?

…Slide39
Slide40

Recap:

Bayes Nets Formalized

A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair

V

,

E

where:V is a set of vertices.

E is a set of directed edges joining vertices. No loops of any length are allowed.Each vertex in V contains the following information:

The name of a random variableA probability distribution table indicating how the probability of this variable

’s values depends on all possible combinations of parent values.Slide41

Building a Bayes Net

Choose a set of relevant variables.

Choose an ordering for them

Assume they’re called X

1

..

Xm (where X

1 is first in ordering, etc)

For i = 1 to m:

Add the Xi node to the network

Set Parents(Xi ) to be a

minimal

subset of {X

1

…X

i-1

} such that we have conditional independence of X

i

and all other members of {X

1

…Xi

-1

} given Parents(Xi )

Define the probability table of

P(Xi

=k  Assignments of Parents(Xi ) ).Slide42

The general case

P(X

1

=

x

1 ^ X

2=x2 ^ ….Xn-1=x

n-1 ^ Xn=xn) =P(X

n=xn ^ X

n-1=xn-1 ^ ….X2

=x2 ^ X1

=x

1

) =

P(X

n

=x

n

X

n-1

=x

n-1

^ ….X2

=x2 ^ X1

=x1) * P(Xn-1

=xn-1 ^…. X2=x

2 ^ X1=x1

) =P(Xn=x

n

X

n-1

=x

n-1

^ ….X

2

=x

2

^ X

1

=x

1

) * P(X

n-1

=x

n-1

…. X

2

=x

2

^ X

1

=x

1

) *

P(Xn-2=xn-2

^…. X2

=x2 ^ X1=x1) =

:

:

=

So any entry in joint pdf table can be computed. And so

any conditional probability

can be computed.Slide43

Question: given a network can I find a chain-rule decomposition of the joint?Slide44

Graphical Models:

Determining Conditional IndependenciesSlide45

What Independencies does a Bayes Net Model?

In order for a Bayesian network to model a probability distribution, the following must be true:

Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents.

This

follows from

But what else does it imply?Slide46

What Independencies does a Bayes Net Model?

Example:

Z

Y

X

Given

Y

, does learning the value of

Z

tell us

nothing new about

X

?

I.e., is

P

(

X

|

Y, Z

) equal to

P

(

X

|

Y

)?

Yes. Since we know the value of all of

X

s parents (namely,

Y

), and

Z

is not a descendant of

X

,

X

is conditionally independent of

Z

.

Also, since independence is symmetric,

P

(

Z

|

Y

,

X

) =

P

(

Z

|

Y

).Slide47

Quick proof that independence is symmetric

Assume:

P

(

X|Y, Z

) = P

(X|Y)Then:

(Bayes

s Rule)

(Chain Rule)

(By Assumption)

(Bayes

s Rule)Slide48

What Independencies does a Bayes Net Model?

Let

I

<

X

,

Y,Z> represent X and Z

being conditionally independent given Y.

I<

X,Y,

Z>? Yes, just as in previous example: All X’s parents given, and Z is not a descendant.

Y

X

ZSlide49

What Independencies does a Bayes Net Model?

I

<

X

,{

U},

Z>? No.I<X,{U

,V},Z>? Yes.Maybe I<

X, S,

Z> iff S acts a cutset between X

and Z in an undirected version of the graph…?

Z

V

U

X

VSlide50

Things get a little more confusing

X has no parents, so we know all its parents

values trivially

Z is not a descendant of X

So,

I<X,{},Z>, even though there

’s a undirected path from X to Z through an unknown variable Y.What if we do know the value of Y

, though? Or one of its descendants?

Z

X

Y

YSlide51

The

Burglar Alarm

example

Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes.

Earth arguably doesn’t care whether your house is currently being burgledWhile you are on vacation, one of your neighbors calls and tells you your hom’s burglar alarm is ringing. Uh oh!

Burglar

Earthquake

Alarm

Phone CallSlide52

Things get a lot more confusing

But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all.

Earthquake

explains away

the hypothetical burglar.But then it must not be the case that I<Burglar,{Phone Call}, Earthquake>, even though

I<Burglar,{}, Earthquake>!

Burglar

Earthquake

Alarm

Phone CallSlide53
Slide54

d-separation

to the rescue

Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent:

d-separation

.

Definition:

X and Z are d-separated by a set of evidence variables

E iff every undirected path from X to Z is “blocked

”, where a path is “blocked

” iff one or more of the following conditions is true: ...

ie. X and Z are dependent iff there exists an unblocked pathSlide55

A path is

blocked

when...

There exists a variable Y on

the path such thatit is in the evidence set Ethe arcs putting

Y in the path are “tail-to-tail”

Or, there exists a variable Y on the path such that

it is in the evidence set E

the arcs putting Y in the path are “tail-to-head

Or, ...

Y

Y

unknown

common causes

of X and Z impose dependency

unknown

causal chains

connecting X an Z impose dependencySlide56

A path is

blocked

when… (the funky case)

… Or, there exists a variable

V on the path such thatit is NOT in the evidence set E

neither are any of its descendantsthe arcs putting Y on the path are “head-to-head”

Y

Known

common symptoms

of X and Z impose dependencies… X may

explain away

ZSlide57

d-separation

to the rescue, cont

d

Theorem [

Verma & Pearl, 1998]:If a set of evidence variables

E d-separates X and Z in a Bayesian network’s graph, then

I<X, E, Z>.d

-separation can be computed in linear time using a depth-first-search-like algorithm.Be careful: d-separation finds what must be conditionally independent

“Might”

: Variables may actually be independent when they’re not d-separated, depending on the actual probabilities involvedSlide58

d-separation

example

A

B

C

D

E

F

G

I

H

J

I<C, {}, D>?

I<C, {A}, D>?

I<C, {A, B}, D>?

I<C, {A, B, J}, D>?

I<C, {A, B, E, J}, D>?