/
CS  b553 : A lgorithms  for Optimization and Learning CS  b553 : A lgorithms  for Optimization and Learning

CS b553 : A lgorithms for Optimization and Learning - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
343 views
Uploaded On 2019-03-14

CS b553 : A lgorithms for Optimization and Learning - PPT Presentation

Bayesian Networks agenda B ayesian networks Chain rule for Bayes nets Naïve Bayes models Independence declarations Dseparation Probabilistic inference queries Purposes of bayesian ID: 756256

independence independent distribution alarm independent independence alarm distribution class joint earthquake probability johncalls burglary 001 feature bayes relationships marycalls

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS b553 : A lgorithms for Optimization..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CS b553: Algorithms for Optimization and Learning

Bayesian NetworksSlide2

agendaB

ayesian networksChain rule for Bayes nets

Naïve Bayes models

Independence declarations

D-separation

Probabilistic inference queriesSlide3

Purposes of bayesian Networks

Efficient and intuitive modeling of complex causal interactionsCompact representation of joint distributions O(n) rather than O(2

n

)

A

lgorithms for efficient inference with given evidence (more on this next time)Slide4

Independence of random variablesTwo random variables A and B are

independent if

P(A,B) = P(A) P(B)

hence P(A|B

) = P(A)

Knowing B

doesn’t give you any information about A

[This equality has

to hold for all combinations of values that A

and B

can take

on, i.e., all events A=a and B=b are independent]Slide5

Significance of independenceIf A and B are independent, then

P(A,B) = P(A) P(B) => The joint distribution over A and B can be defined as a product over the distribution of

A

and the distribution of

B

=> Store

two much smaller probability tables

rather than a large probability table over all combinations of

A

and

BSlide6

Conditional IndependenceTwo random variables

a and b are conditionally independent given C

, if

P(A, B

|

C

) =

P(

A

|C

)

P(

B

|

C

)

hence

P(A|B,C

) =

P(A|C)

Once you know C, learning

B

doesn’t give you any information about

A

[again, this has to hold for all combinations of values that A,B,C can take on]Slide7

Significance of Conditional independenceConsider Grade(CS101), Intelligence, and SAT

Ostensibly, the grade in a course doesn’t have a direct relationship with SAT scoresbut good students are more likely to get

good SAT scores

, so they are not independent…

It is reasonable to believe that

Grade(CS101) and SAT are

conditionally independent given IntelligenceSlide8

bayesian Network

Explicitly

represent independence among

propositions

Notice that Intelligence is the “cause” of both Grade and SAT, and

the causality

is represented explicitly

Intel.

Grade

P(I=x)

high

0.3

low

0.7

SAT

6

probabilities, instead of

11

P(I

,

G

,

S

) = P(G,S|I) P(I) = P(G|I) P(S|I) P(I)

P(G=x|I)I=lowI=high‘a’ 0.20.74‘b’0.340.17‘C’0.460.09

P(S=

x|I

)

I=low

I=high

low

0.95

0.05

high

0.2

0.8Slide9

Definition:

bayesian

network

Set of random variables

X

={X

1

,…,

X

n

} with domains Val(X

1

),…,Val(

X

n

)

E

ach node has a set of parents

PaXGraph must be a DAGEach node also maintains a conditional probability distribution

(often, a table)P(X|PaX)

2k-1 entries for binary valued variablesOverall: O(n2k) storage for binary variablesEncodes the joint probability over X1,…,XnSlide10

Calculation of joint Probability

B

E

P(a|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

B

urglary

E

arthquake

A

larm

M

aryCalls

J

ohnCalls

P(b)

0.001

P(e)

0.002A

P(j|…)

TF

0.90

0.05

A

P(m|…)

TF

0.70

0.01

P(

j

m

a

b

e)

= ??Slide11

P(j

ma

be

)

= P(

j

m|a

,

b,e

)

P(a

be

)

= P(

j

|a

,

b,e) 

P(m|a,b,e

)  P(abe)(J and M are independent given A)P(j|a,b,e) = P(j|a)(J and B and J and E are independent given A)P(m|a,b,e) = P(m|a)P(abe) = P(a|b,e)  P(b|e)  P(e) = P(a|b,e)  P(b)  P(e)(B and E are independent)P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e)BurglaryEarthquake

Alarm

M

aryCalls

JohnCallsSlide12

Calculation of joint Probability

B

E

P(a|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

B

urglary

E

arthquake

alarm

M

aryCalls

J

ohnCalls

P(b)

0.001

P(e)

0.002

A

P(j|…)

TF

0.90

0.05

A

P(m|…)

TF

0.70

0.01

P(

j

m

a

b

e)

=

P(

j|a

)P(

m|a

)P(a|

b,

e

)P

(

b)P

(

e)

= 0.9 x 0.7 x 0.001 x 0.999 x 0.998

= 0.00062Slide13

Calculation of joint Probability

b

e

P(a|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

B

urglary

E

arthquake

alarm

maryCalls

johnCalls

P(b)

0.001

P(e)

0.002

a

P(j|…)

TF

0.90

0.05

a

P(m|…)

TF

0.70

0.01

P(

j

m

a

b

e)

=

P(

j|a

)P(

m|a

)P(a|

b,

e

)P

(

b)P

(

e)

= 0.9 x 0.7 x 0.001 x 0.999 x 0.998

= 0.00062

P(x

1

x

2

x

n

) =

P

i=1,…,

n

P

(

x

i

|

pa

Xi

)

full joint

distributionSlide14

Chain Rule for Bayes Nets

Joint distribution is a product of all CPTs

P(X

1

,X

2

,…,

X

n

) =

P

i=1,…,

n

P

(

X

i

|

PaXi)Slide15

Example: Naïve bayes models

P(Cause,Effect1,…,Effect

n

)

= P(Cause)

P

i

P(

Effect

i

| Cause)

Cause

Effect

1

Effect

2

Effect

nSlide16

Advantages of Bayes Nets (and other graphical models)More manageable # of parameters to set and store

Incremental modelingExplicit encoding of independence assumptions

Efficient inference techniquesSlide17

Arcs do not necessarily encode causality

A

B

C

C

B

A

2 BN’s

with the same expressive power, and a 3

rd

with greater power (exercise)

C

B

ASlide18

Reading off independence relationships

Given B, does the value of A affect the probability of C?P(C|B,A) = P(C|B)?No!

C parent’s (B) are given, and so it is independent of its non-

descendents

(A)

Independence is symmetric:

C

A | B => A

C | B

A

B

CSlide19

Basic RuleA node is independent of its non-descendants given its parents (and given nothing else)Slide20

What does the BN encode?

Burglary



Earthquake

JohnCalls



MaryCalls

| Alarm

JohnCalls



Burglary | Alarm

JohnCalls



Earthquake | Alarm

MaryCalls

 Burglary | AlarmMaryCalls  Earthquake | AlarmBurglaryEarthquakeAlarmMaryCallsJohnCalls

A node is independent of its non-descendents, given its parentsSlide21

Reading off independence relationships

How about Burglary Earthquake | Alarm ?No! Why?

Burglary

Earthquake

Alarm

MaryCalls

JohnCallsSlide22

Reading off independence relationships

How about Burglary  Earthquake | Alarm ?

No! Why?

P(B

E|A) = P(A|B,E)P(B

E)/P(A) = 0.00075

P(B|A)P(E|A) = 0.086

Burglary

Earthquake

Alarm

MaryCalls

JohnCallsSlide23

Reading off independence relationships

How about Burglary  Earthquake |

JohnCalls

?

No! Why?

Knowing

JohnCalls

affects the probability of Alarm, which makes Burglary and Earthquake dependent

Burglary

Earthquake

Alarm

MaryCalls

JohnCallsSlide24

Independence relationships

For polytrees, there exists a unique undirected path between A and B. For each node on the path:

Evidence on the directed road X

EY or

X

EY

makes X and Y

independent

Evidence on

an X

EY

makes descendants

independent

Evidence on a “V” node, or below the

V:

X

EY, or

X

WY with W…

E

makes the X and Y dependent (otherwise they are independent)Slide25

General caseFormal property in general case:

D-separation : the above properties hold for all (acyclic) paths between A and BD-separation

independence

That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation

The CPTs may indeed encode additional independencesSlide26

Probability QueriesGiven: some probabilistic model over variables

XFind: distribution over Y

X

given evidence

E

=

e

for some subset

E

X

/

Y

P(

Y

|

E

=e)Inference problemSlide27

Answering Inference Problems with the Joint Distribution

Easiest case: Y=

X

/

E

P(

Y

|

E

=

e

) = P(

Y

,

e

)/P(

e

)Denominator makes the probabilities sum to 1

Determine P(e) by marginalizing: P(e) = Sy P(Y=

y,e)

Otherwise, let W=X/(EY)P(Y|E=e) = Sw P(Y,W=w,e) /P(e)P(e) = Sy Sw P(Y=y,W=w,e) Inference with joint distribution: O(2|X/E|) for binary variablesSlide28

Naïve bayes Classifier

P(Class,Feature1,…,Feature

n

)

= P(Class)

P

i

P(

Feature

i

| Class)

Class

Feature

1

Feature

2

Feature

n

P(C|F

1

,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn) = 1/Z P(C) Pi P(Fi|C)Given features, what class?

Spam / Not Spam

English /

French / Latin…Word occurrencesSlide29

Naïve bayes Classifier

P(Class,Feature1,…,

Feature

n

)

= P(Class)

P

i

P(

Feature

i

| Class)

P(C|F

1

,….,

F

k

) =

1/Z P(C,F

1

,….,Fk) = 1/Z Sfk+1…fn P(C,F1,….,Fk,fk+1,…fn) = 1/Z P(C) Sfk+1…fn Pi=1…k P(Fi|C) Pj=k+1…n P(fj|C) = 1/Z P(C) Pi=1…k P(Fi|C) Pj=k+1…n Sfj P(

fj|C) = 1/Z P(C)

Pi=1…k P(

Fi|C)

Given some features, what is the distribution over class?Slide30

For General Queries

For BNs and queries in general, it’s not that simple… more in later lectures.Next class: skim 5.1-3, begin reading 9.1-4