Bayesian Networks agenda Probabilistic inference queries Topdown inference Variable elimination Probability Queries Given some probabilistic model over variables X Find distribution over ID: 756758
Download Presentation The PPT/PDF document "CS b553 : A lgorithms for Optimization..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CS b553: Algorithms for Optimization and Learning
Bayesian NetworksSlide2
agendaProbabilistic
inference queriesTop-down inference
Variable eliminationSlide3
Probability QueriesGiven: some probabilistic model over variables
XFind: distribution over Y
X
given evidence
E
=
e
for some subset
E
X
/
Y
P(
Y
|
E
=
e
)
Inference problemSlide4
Answering Inference Problems with the Joint Distribution
Easiest case: Y=
X
/
E
P(
Y
|
E
=
e
) = P(
Y
,
e
)/P(
e
)
Denominator makes the probabilities sum to 1
Determine P(
e
) by marginalizing: P(
e
) =
S
y
P(
Y=
y
,
e
)
Otherwise
, let
W
=
X
/(
E
Y
)
P(
Y
|
E
=
e
) =
S
w
P(
Y
,
W
=
w
,
e
) /P(
e
)
P(
e
) =
S
y
S
w
P(
Y=
y
,
W
=
w
,
e
)
Inference with joint distribution: O(2
|
X
/
E
|
)
for binary variablesSlide5
Answering Inference Problems with the Joint Distribution
Another common case:
Y
={Q} (single query variable)
Can we do better than brute force marginalization of the joint distribution?Slide6
B
E
P(A|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
Burglary
Earthquake
Alarm
MaryCalls
JohnCalls
P(B)
0.001
P(E)
0.002
A
P(J|…)
TF
0.90
0.05
AP(M|…)TF0.700.01
Top-Down inference
Suppose we want to compute P(Alarm)Slide7
B
E
P(A|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
Burglary
Earthquake
Alarm
MaryCalls
JohnCalls
P(B)
0.001
P(E)
0.002
A
P(J|…)
TF
0.90
0.05
AP(M|…)TF0.700.01
Top-Down inference
Suppose we want to compute P(Alarm)
P(Alarm) =
Σ
b,e
P(
A,b,e
)
P(Alarm) =
Σ
b,e
P(
A|b,e
)P(b)P(e)Slide8
B
E
P(A|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
Burglary
Earthquake
Alarm
MaryCalls
JohnCalls
P(B)
0.001
P(E)
0.002
A
P(J|…)
TF
0.90
0.05
AP(M|…)TF0.700.01
Top-Down inference
Suppose we want to compute P(Alarm)
P(Alarm) =
Σ
b,e
P(
A,b,e
)
P(Alarm) =
Σ
b,e
P(
A|b,e
)P(b)P(e)
P(Alarm) = P(A|B,E)P(B)P(E) +
P(A|B,
E)P(B)P(
E) +
P(A|
B,E)P(
B)P(E) +
P(A|
B,
E)P(
B)P(
E)Slide9
B
E
P(A|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
Burglary
Earthquake
Alarm
MaryCalls
JohnCalls
P(B)
0.001
P(E)
0.002
A
P(J|…)
TF
0.90
0.05
AP(M|…)TF0.700.01
Top-Down inference
Suppose we want to compute P(Alarm)
P(A) =
Σ
b,e
P(
A,b,e
)
P(A) =
Σ
b,e
P(
A|b,e
)P(b)P(e)
P(A) = P(A|B,E)P(B)P(E) +
P(A|B,
E)P(B)P(
E) +
P(A|
B,E)P(
B)P(E) +
P(A|
B,
E)P(
B)P(
E)
P(A) = 0.95*0.001*0.002 +
0.94*0.001*0.998 +
0.29*0.999*0.002 +
0.001*0.999*0.998
= 0.00252Slide10
B
E
P(A|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
Burglary
Earthquake
Alarm
MaryCalls
JohnCalls
P(B)
0.001
P(E)
0.002
A
P(J|…)
TF
0.90
0.05
AP(M|…)TF0.700.01
Top-Down inference
Now, suppose we want to compute P(
MaryCalls
)Slide11
B
E
P(A|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
Burglary
Earthquake
Alarm
MaryCalls
JohnCalls
P(B)
0.001
P(E)
0.002
A
P(J|…)
TF
0.90
0.05
AP(M|…)TF0.700.01
Top-Down inference
Now, suppose we want to compute P(
MaryCalls
)
P(M) = P(M|A)P(A) + P(M|
A
)
P(
A)Slide12
B
E
P(A|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
Burglary
Earthquake
Alarm
MaryCalls
JohnCalls
P(B)
0.001
P(E)
0.002
A
P(J|…)
TF
0.90
0.05
AP(M|…)TF0.700.01
Top-Down inference
Now, suppose we want to compute P(
MaryCalls
)
P(M) = P(M|A)P(A) + P(M|
A
)
P(
A)
P(M) = 0.70*0.00252 + 0.01*(1-0.0252)
= 0.0117Slide13
B
E
P(A|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
Burglary
Earthquake
Alarm
MaryCalls
JohnCalls
P(B)
0.001
P(E)
0.002
A
P(J|…)
TF
0.90
0.05
AP(M|…)TF0.700.01
Top-Down
inference with Evidence
Suppose we want to compute
P(
Alarm|Earthquake
)Slide14
B
E
P(A|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
Burglary
Earthquake
Alarm
MaryCalls
JohnCalls
P(B)
0.001
P(E)
0.002
A
P(J|…)
TF
0.90
0.05
AP(M|…)TF0.700.01
Top-Down inference
Suppose we want to compute
P(
A|e
)
P(
A|e
)
=
Σ
b
P(
A,b|e
)
P(
A|e
)
=
Σ
b
P(
A|b,e
)P(b)Slide15
B
E
P(A|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
Burglary
Earthquake
Alarm
MaryCalls
JohnCalls
P(B)
0.001
P(E)
0.002
A
P(J|…)
TF
0.90
0.05
AP(M|…)TF0.700.01
Top-Down inference
Suppose we want to compute
P(
A|e
)
P(
A|e
)
=
Σ
b
P(
A,b|e
)
P(
A|e
)
=
Σ
b
P(
A|b,e
)P(b)
P(
A|e
)
=
0.95*0.001 +
0.29*0.999 +
=
0.29066Slide16
Top-Down inferenceOnly works if
the graph of ancestors is a polytreeEvidence given on ancestor(s) of Q
Efficient:
O(d) time, where d is the number of ancestors of a variable, |
Pa
X
| assumed constant
Evidence on an ancestor cuts off influence of portion of graph above evidence nodeSlide17
Naïve bayes Classifier
P(Class,Feature1,…,Feature
n
)
= P(Class)
P
i
P(
Feature
i
| Class)
ClassFeature1Feature2FeaturenP(C|F1,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn) = 1/Z P(C) Pi P(Fi|C)Given features, what class?
Spam / Not Spam
English
/ French / Latin…
Word occurrencesSlide18
Normalization FactorsP(C|F
1,….,Fn) = P(C,F
1
,….,
F
n
)/P(F
1
,….,
F
n
)= 1/Z P(C) Pi P(Fi|C)1/Z term is a normalization factor so that P(C|F1,…,Fn) sums to 1Z = Sc P(C=c) Pi P(Fi|C=c)Different for each value of F1,…,FnOften left implicitUsual implementation: first compute the unnormalized distribution P(C) Pi P(Fi=fi|C) for all values of C, then performing a normalization step in O(|Val(C)|) timeSlide19
Note: Numerical Issues in ImplementationSuppose P(
fi|c) is very small for all i
, e.g., probability that a given uncommon word f
i
appears in a document
The product
P(C)
P
i
P(
Fi|C) with large n will be exceedingly small and might underflowMore numerically stable solution:Compute log P(C) + Si log P(fi|C) for all values of CCompute b = maxc [log P(c) + Si log P(fi|c)]P(C|f1,…,fn) = exp(log P(C) + Si log P(fi|C) - b) / Z’With Z’ a normalization factorA common trick when dealing with many products of small #s Slide20
Naïve bayes Classifier
P(Class,Feature1,…,
Feature
n
)
= P(Class)
P
i
P(
Feature
i
| Class)P(C|F1,….,Fk) = 1/Z P(C,F1,….,Fk) = 1/Z Sfk+1…fn P(C,F1,….,Fk,fk+1,…fn) = 1/Z P(C) Sfk+1…fn Pi=1…k P(Fi|C) Pj=k+1…n P(fj|C) = 1/Z P(C) Pi=1…k P(Fi|C) Pj=k+1…n Sfj P(
fj
|C) = 1/Z P(C) Pi=1…k
P(Fi|C)
Given some features, what is the distribution over class?Slide21
For General Bayes NetsExact inference: variable elimination
Efficient for polytrees and certain “simple” graphs
NP hard in general
Approximate inference
Monte-Carlo sampling techniques
Belief propagation (exact in
polytrees
)Slide22
B
E
P(A|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
Burglary
Earthquake
Alarm
MaryCalls
JohnCalls
P(B)
0.001
P(E)
0.002
A
P(J|…)
TF
0.90
0.05
AP(M|…)TF0.700.01
Sum-product formulation
Suppose we want to compute
P(A)
P(A
) =
S
b,e
P
(
A|b,e
)P(b)P(e)Slide23
A
B
E
(A,B,E)
TTTTFFFF
TTFFTTFF
TFTFTFTF
1.9e-6
0.000938
0.000579
0.000997
1e-7
0.0011976
0.00141858
0.996
A
MaryCalls
JohnCalls
A,B,E
A
P(J|…)
TF
0.90
0.05A
P(M|…)TF0.700.01Sum-product formulationSuppose we want to compute P(A)P(A) = Sb,e
(A,b,e)
(product)Slide24
A
B
E
(A,B,E)
TTTTFFFF
TTFFTTFF
TFTFTFTF
1.9e-6
0.000938
0.000579
0.000997
1e-7
0.0011976
0.00141858
0.996
A
MaryCalls
JohnCalls
A
P(J|…)
TF
0.90
0.05
AP(M|…)
TF0.700.01Sum-product formulationSuppose we want to compute P(A)P(A) = Sb,e (A,b,e) (product) = Sb,e
(A,b,e
)
(sum)
P(A=T)
P(A=F)Slide25
Probability queriesComputing P(
Y,E) in a BN is a sum-product operation: P(Y
,
E
)
=
S
w
P(
Y
,W=w,E) = Sw(YE,W=w)With (X) = PX P(X|PaX) Idea of variable elimination: rearrange the order of sum-products so as a recursive set of smaller sum-productsSlide26
Variable EliminationConsider linear network X
1
X
2
X
3
P(
X
) = P(X
1
) P(X2|X1) P(X3|X2)P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2)Slide27
Variable EliminationConsider linear network X
1
X
2
X
3
P(
X
) = P(X
1
) P(X2|X1) P(X3|X2)P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2) = Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1)Rearrange equation…Slide28
Variable EliminationConsider linear network X
1
X
2
X
3
P(
X
) = P(X
1
) P(X2|X1) P(X3|X2)P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2) = Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1) = Σx2 P(X3|x2) (x2)Factor over each value of X2Cache (x2), use for both values of X3!Slide29
Variable EliminationConsider linear network X
1
X
2
X
3
P(
X
) = P(X
1
) P(X2|X1) P(X3|X2)P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2) = Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1) = Σx2 P(X3|x2) (x2)Computed for each value of X2
How many * and + saved?
*: 2*4*2=16 vs
4+4=8+ 2*3=8 vs 2+1=3
Can lead to huge gains in larger networksSlide30
VE in Alarm ExampleP(
E|j,m)=P(E,j,m)/P(j,m
)
P(
E,j,m
)
=
Σ
a
Σ
b
P(E) P(b) P(a|E,b) P(j|a) P(m|a)Slide31
VE in Alarm ExampleP(E|j,m
)=P(E,j,m)/P(j,m)
P(
E,j,m
) =
Σ
a
Σ
b
P(E) P(b) P(
a|E,b
) P(j|a) P(m|a) = P(E) Σb P(b) Σa P(a|E,b) P(j|a) P(m|a)Slide32
VE in Alarm ExampleP(E|j,m
)=P(E,j,m)/P(j,m)
P(
E,j,m
) =
Σ
a
Σ
b
P(E) P(b) P(
a|E,b
) P(j|a) P(m|a) = P(E) Σb P(b) Σa P(a|E,b) P(j|a) P(m|a) = P(E) Σb P(b) (j,m,E,b)Factor over all values of E,bNote:(j,m,E,b) = P(j,m|E,b)Slide33
VE in Alarm ExampleP(E|j,m
)=P(E,j,m)/P(j,m)
P(
E,j,m
) =
Σ
a
Σ
b
P(E) P(b) P(
a|E,b
) P(j|a) P(m|a) = P(E) Σb P(b) Σa P(a|E,b) P(j|a) P(m|a) = P(E) Σb P(b) (j,m|E,b) = P(E) (j,m,E)Compute for all values of ENote:(j,m,E) = P(j,m|E)Slide34
What order to perform VE?For tree-like BNs (
polytrees), order so parents come before children# of variables in each intermediate probability table is
2
k
where k is
#
of parents of a
node
If the number of parents of a node is bounded, then VE is linear time!
Other networks: intermediate factors may become largeSlide35
Non-polytree networksP(D) =
Σa Σb
Σ
c
P(A)P(B|A)P(C|A)P(D|B,C)
=
Σ
b
Σ
c P(D|B,C) Σa P(A)P(B|A)P(C|A)ABCDNo more simplifications…Slide36
Do Tau-factors correspond to conditional distributions?
Sometimes, but not necessarily
A
B
C
DSlide37
Implementation notesHow to implement multidimensional factors?
How to efficiently implement sum-product?