/
CS  b553 : A lgorithms  for Optimization and Learning CS  b553 : A lgorithms  for Optimization and Learning

CS b553 : A lgorithms for Optimization and Learning - PowerPoint Presentation

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
372 views
Uploaded On 2019-03-16

CS b553 : A lgorithms for Optimization and Learning - PPT Presentation

Bayesian Networks agenda Probabilistic inference queries Topdown inference Variable elimination Probability Queries Given some probabilistic model over variables X Find distribution over ID: 756758

alarm 001 marycalls inference 001 alarm inference marycalls compute 700 johncalls 002 tf0 earthquake burglary tftf ttff top suppose

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS b553 : A lgorithms for Optimization..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CS b553: Algorithms for Optimization and Learning

Bayesian NetworksSlide2

agendaProbabilistic

inference queriesTop-down inference

Variable eliminationSlide3

Probability QueriesGiven: some probabilistic model over variables

XFind: distribution over Y

X

given evidence

E

=

e

for some subset

E

X

/

Y

P(

Y

|

E

=

e

)

Inference problemSlide4

Answering Inference Problems with the Joint Distribution

Easiest case: Y=

X

/

E

P(

Y

|

E

=

e

) = P(

Y

,

e

)/P(

e

)

Denominator makes the probabilities sum to 1

Determine P(

e

) by marginalizing: P(

e

) =

S

y

P(

Y=

y

,

e

)

Otherwise

, let

W

=

X

/(

E

Y

)

P(

Y

|

E

=

e

) =

S

w

P(

Y

,

W

=

w

,

e

) /P(

e

)

P(

e

) =

S

y

S

w

P(

Y=

y

,

W

=

w

,

e

)

Inference with joint distribution: O(2

|

X

/

E

|

)

for binary variablesSlide5

Answering Inference Problems with the Joint Distribution

Another common case:

Y

={Q} (single query variable)

Can we do better than brute force marginalization of the joint distribution?Slide6

B

E

P(A|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

Burglary

Earthquake

Alarm

MaryCalls

JohnCalls

P(B)

0.001

P(E)

0.002

A

P(J|…)

TF

0.90

0.05

AP(M|…)TF0.700.01

Top-Down inference

Suppose we want to compute P(Alarm)Slide7

B

E

P(A|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

Burglary

Earthquake

Alarm

MaryCalls

JohnCalls

P(B)

0.001

P(E)

0.002

A

P(J|…)

TF

0.90

0.05

AP(M|…)TF0.700.01

Top-Down inference

Suppose we want to compute P(Alarm)

P(Alarm) =

Σ

b,e

P(

A,b,e

)

P(Alarm) =

Σ

b,e

P(

A|b,e

)P(b)P(e)Slide8

B

E

P(A|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

Burglary

Earthquake

Alarm

MaryCalls

JohnCalls

P(B)

0.001

P(E)

0.002

A

P(J|…)

TF

0.90

0.05

AP(M|…)TF0.700.01

Top-Down inference

Suppose we want to compute P(Alarm)

P(Alarm) =

Σ

b,e

P(

A,b,e

)

P(Alarm) =

Σ

b,e

P(

A|b,e

)P(b)P(e)

P(Alarm) = P(A|B,E)P(B)P(E) +

P(A|B,

E)P(B)P(

E) +

P(A|

B,E)P(

B)P(E) +

P(A|

B,

E)P(

B)P(

E)Slide9

B

E

P(A|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

Burglary

Earthquake

Alarm

MaryCalls

JohnCalls

P(B)

0.001

P(E)

0.002

A

P(J|…)

TF

0.90

0.05

AP(M|…)TF0.700.01

Top-Down inference

Suppose we want to compute P(Alarm)

P(A) =

Σ

b,e

P(

A,b,e

)

P(A) =

Σ

b,e

P(

A|b,e

)P(b)P(e)

P(A) = P(A|B,E)P(B)P(E) +

P(A|B,

E)P(B)P(

E) +

P(A|

B,E)P(

B)P(E) +

P(A|

B,

E)P(

B)P(

E)

P(A) = 0.95*0.001*0.002 +

0.94*0.001*0.998 +

0.29*0.999*0.002 +

0.001*0.999*0.998

= 0.00252Slide10

B

E

P(A|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

Burglary

Earthquake

Alarm

MaryCalls

JohnCalls

P(B)

0.001

P(E)

0.002

A

P(J|…)

TF

0.90

0.05

AP(M|…)TF0.700.01

Top-Down inference

Now, suppose we want to compute P(

MaryCalls

)Slide11

B

E

P(A|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

Burglary

Earthquake

Alarm

MaryCalls

JohnCalls

P(B)

0.001

P(E)

0.002

A

P(J|…)

TF

0.90

0.05

AP(M|…)TF0.700.01

Top-Down inference

Now, suppose we want to compute P(

MaryCalls

)

P(M) = P(M|A)P(A) + P(M|

A

)

P(

A)Slide12

B

E

P(A|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

Burglary

Earthquake

Alarm

MaryCalls

JohnCalls

P(B)

0.001

P(E)

0.002

A

P(J|…)

TF

0.90

0.05

AP(M|…)TF0.700.01

Top-Down inference

Now, suppose we want to compute P(

MaryCalls

)

P(M) = P(M|A)P(A) + P(M|

A

)

P(

A)

P(M) = 0.70*0.00252 + 0.01*(1-0.0252)

= 0.0117Slide13

B

E

P(A|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

Burglary

Earthquake

Alarm

MaryCalls

JohnCalls

P(B)

0.001

P(E)

0.002

A

P(J|…)

TF

0.90

0.05

AP(M|…)TF0.700.01

Top-Down

inference with Evidence

Suppose we want to compute

P(

Alarm|Earthquake

)Slide14

B

E

P(A|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

Burglary

Earthquake

Alarm

MaryCalls

JohnCalls

P(B)

0.001

P(E)

0.002

A

P(J|…)

TF

0.90

0.05

AP(M|…)TF0.700.01

Top-Down inference

Suppose we want to compute

P(

A|e

)

P(

A|e

)

=

Σ

b

P(

A,b|e

)

P(

A|e

)

=

Σ

b

P(

A|b,e

)P(b)Slide15

B

E

P(A|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

Burglary

Earthquake

Alarm

MaryCalls

JohnCalls

P(B)

0.001

P(E)

0.002

A

P(J|…)

TF

0.90

0.05

AP(M|…)TF0.700.01

Top-Down inference

Suppose we want to compute

P(

A|e

)

P(

A|e

)

=

Σ

b

P(

A,b|e

)

P(

A|e

)

=

Σ

b

P(

A|b,e

)P(b)

P(

A|e

)

=

0.95*0.001 +

0.29*0.999 +

=

0.29066Slide16

Top-Down inferenceOnly works if

the graph of ancestors is a polytreeEvidence given on ancestor(s) of Q

Efficient:

O(d) time, where d is the number of ancestors of a variable, |

Pa

X

| assumed constant

Evidence on an ancestor cuts off influence of portion of graph above evidence nodeSlide17

Naïve bayes Classifier

P(Class,Feature1,…,Feature

n

)

= P(Class)

P

i

P(

Feature

i

| Class)

ClassFeature1Feature2FeaturenP(C|F1,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn) = 1/Z P(C) Pi P(Fi|C)Given features, what class?

Spam / Not Spam

English

/ French / Latin…

Word occurrencesSlide18

Normalization FactorsP(C|F

1,….,Fn) = P(C,F

1

,….,

F

n

)/P(F

1

,….,

F

n

)= 1/Z P(C) Pi P(Fi|C)1/Z term is a normalization factor so that P(C|F1,…,Fn) sums to 1Z = Sc P(C=c) Pi P(Fi|C=c)Different for each value of F1,…,FnOften left implicitUsual implementation: first compute the unnormalized distribution P(C) Pi P(Fi=fi|C) for all values of C, then performing a normalization step in O(|Val(C)|) timeSlide19

Note: Numerical Issues in ImplementationSuppose P(

fi|c) is very small for all i

, e.g., probability that a given uncommon word f

i

appears in a document

The product

P(C)

P

i

P(

Fi|C) with large n will be exceedingly small and might underflowMore numerically stable solution:Compute log P(C) + Si log P(fi|C) for all values of CCompute b = maxc [log P(c) + Si log P(fi|c)]P(C|f1,…,fn) = exp(log P(C) + Si log P(fi|C) - b) / Z’With Z’ a normalization factorA common trick when dealing with many products of small #s Slide20

Naïve bayes Classifier

P(Class,Feature1,…,

Feature

n

)

= P(Class)

P

i

P(

Feature

i

| Class)P(C|F1,….,Fk) = 1/Z P(C,F1,….,Fk) = 1/Z Sfk+1…fn P(C,F1,….,Fk,fk+1,…fn) = 1/Z P(C) Sfk+1…fn Pi=1…k P(Fi|C) Pj=k+1…n P(fj|C) = 1/Z P(C) Pi=1…k P(Fi|C) Pj=k+1…n Sfj P(

fj

|C) = 1/Z P(C) Pi=1…k

P(Fi|C)

Given some features, what is the distribution over class?Slide21

For General Bayes NetsExact inference: variable elimination

Efficient for polytrees and certain “simple” graphs

NP hard in general

Approximate inference

Monte-Carlo sampling techniques

Belief propagation (exact in

polytrees

)Slide22

B

E

P(A|

)

TTFF

TFTF

0.95

0.94

0.29

0.001

Burglary

Earthquake

Alarm

MaryCalls

JohnCalls

P(B)

0.001

P(E)

0.002

A

P(J|…)

TF

0.90

0.05

AP(M|…)TF0.700.01

Sum-product formulation

Suppose we want to compute

P(A)

P(A

) =

S

b,e

P

(

A|b,e

)P(b)P(e)Slide23

A

B

E

(A,B,E)

TTTTFFFF

TTFFTTFF

TFTFTFTF

1.9e-6

0.000938

0.000579

0.000997

1e-7

0.0011976

0.00141858

0.996

A

MaryCalls

JohnCalls

A,B,E

A

P(J|…)

TF

0.90

0.05A

P(M|…)TF0.700.01Sum-product formulationSuppose we want to compute P(A)P(A) = Sb,e 

(A,b,e)

(product)Slide24

A

B

E

(A,B,E)

TTTTFFFF

TTFFTTFF

TFTFTFTF

1.9e-6

0.000938

0.000579

0.000997

1e-7

0.0011976

0.00141858

0.996

A

MaryCalls

JohnCalls

A

P(J|…)

TF

0.90

0.05

AP(M|…)

TF0.700.01Sum-product formulationSuppose we want to compute P(A)P(A) = Sb,e (A,b,e) (product) = Sb,e

(A,b,e

)

(sum)

P(A=T)

P(A=F)Slide25

Probability queriesComputing P(

Y,E) in a BN is a sum-product operation: P(Y

,

E

)

=

S

w

P(

Y

,W=w,E) = Sw(YE,W=w)With (X) = PX P(X|PaX) Idea of variable elimination: rearrange the order of sum-products so as a recursive set of smaller sum-productsSlide26

Variable EliminationConsider linear network X

1

X

2

X

3

P(

X

) = P(X

1

) P(X2|X1) P(X3|X2)P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2)Slide27

Variable EliminationConsider linear network X

1

X

2

X

3

P(

X

) = P(X

1

) P(X2|X1) P(X3|X2)P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2) = Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1)Rearrange equation…Slide28

Variable EliminationConsider linear network X

1

X

2

X

3

P(

X

) = P(X

1

) P(X2|X1) P(X3|X2)P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2) = Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1) = Σx2 P(X3|x2) (x2)Factor over each value of X2Cache (x2), use for both values of X3!Slide29

Variable EliminationConsider linear network X

1

X

2

X

3

P(

X

) = P(X

1

) P(X2|X1) P(X3|X2)P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2) = Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1) = Σx2 P(X3|x2) (x2)Computed for each value of X2

How many * and + saved?

*: 2*4*2=16 vs

4+4=8+ 2*3=8 vs 2+1=3

Can lead to huge gains in larger networksSlide30

VE in Alarm ExampleP(

E|j,m)=P(E,j,m)/P(j,m

)

P(

E,j,m

)

=

Σ

a

Σ

b

P(E) P(b) P(a|E,b) P(j|a) P(m|a)Slide31

VE in Alarm ExampleP(E|j,m

)=P(E,j,m)/P(j,m)

P(

E,j,m

) =

Σ

a

Σ

b

P(E) P(b) P(

a|E,b

) P(j|a) P(m|a) = P(E) Σb P(b) Σa P(a|E,b) P(j|a) P(m|a)Slide32

VE in Alarm ExampleP(E|j,m

)=P(E,j,m)/P(j,m)

P(

E,j,m

) =

Σ

a

Σ

b

P(E) P(b) P(

a|E,b

) P(j|a) P(m|a) = P(E) Σb P(b) Σa P(a|E,b) P(j|a) P(m|a) = P(E) Σb P(b) (j,m,E,b)Factor over all values of E,bNote:(j,m,E,b) = P(j,m|E,b)Slide33

VE in Alarm ExampleP(E|j,m

)=P(E,j,m)/P(j,m)

P(

E,j,m

) =

Σ

a

Σ

b

P(E) P(b) P(

a|E,b

) P(j|a) P(m|a) = P(E) Σb P(b) Σa P(a|E,b) P(j|a) P(m|a) = P(E) Σb P(b) (j,m|E,b) = P(E) (j,m,E)Compute for all values of ENote:(j,m,E) = P(j,m|E)Slide34

What order to perform VE?For tree-like BNs (

polytrees), order so parents come before children# of variables in each intermediate probability table is

2

k

where k is

#

of parents of a

node

If the number of parents of a node is bounded, then VE is linear time!

Other networks: intermediate factors may become largeSlide35

Non-polytree networksP(D) =

Σa Σb

Σ

c

P(A)P(B|A)P(C|A)P(D|B,C)

=

Σ

b

Σ

c P(D|B,C) Σa P(A)P(B|A)P(C|A)ABCDNo more simplifications…Slide36

Do Tau-factors correspond to conditional distributions?

Sometimes, but not necessarily

A

B

C

DSlide37

Implementation notesHow to implement multidimensional factors?

How to efficiently implement sum-product?