Bayesian Networks agenda B ayesian networks Chain rule for Bayes nets Naïve Bayes models Independence declarations Dseparation Probabilistic inference queries Purposes of bayesian ID: 756256
Download Presentation The PPT/PDF document "CS b553 : A lgorithms for Optimization..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CS b553: Algorithms for Optimization and Learning
Bayesian NetworksSlide2
agendaB
ayesian networksChain rule for Bayes nets
Naïve Bayes models
Independence declarations
D-separation
Probabilistic inference queriesSlide3
Purposes of bayesian Networks
Efficient and intuitive modeling of complex causal interactionsCompact representation of joint distributions O(n) rather than O(2
n
)
A
lgorithms for efficient inference with given evidence (more on this next time)Slide4
Independence of random variablesTwo random variables A and B are
independent if
P(A,B) = P(A) P(B)
hence P(A|B
) = P(A)
Knowing B
doesn’t give you any information about A
[This equality has
to hold for all combinations of values that A
and B
can take
on, i.e., all events A=a and B=b are independent]Slide5
Significance of independenceIf A and B are independent, then
P(A,B) = P(A) P(B) => The joint distribution over A and B can be defined as a product over the distribution of
A
and the distribution of
B
=> Store
two much smaller probability tables
rather than a large probability table over all combinations of
A
and
BSlide6
Conditional IndependenceTwo random variables
a and b are conditionally independent given C
, if
P(A, B
|
C
) =
P(
A
|C
)
P(
B
|
C
)
hence
P(A|B,C
) =
P(A|C)
Once you know C, learning
B
doesn’t give you any information about
A
[again, this has to hold for all combinations of values that A,B,C can take on]Slide7
Significance of Conditional independenceConsider Grade(CS101), Intelligence, and SAT
Ostensibly, the grade in a course doesn’t have a direct relationship with SAT scoresbut good students are more likely to get
good SAT scores
, so they are not independent…
It is reasonable to believe that
Grade(CS101) and SAT are
conditionally independent given IntelligenceSlide8
bayesian Network
Explicitly
represent independence among
propositions
Notice that Intelligence is the “cause” of both Grade and SAT, and
the causality
is represented explicitly
Intel.
Grade
P(I=x)
high
0.3
low
0.7
SAT
6
probabilities, instead of
11
P(I
,
G
,
S
) = P(G,S|I) P(I) = P(G|I) P(S|I) P(I)
P(G=x|I)I=lowI=high‘a’ 0.20.74‘b’0.340.17‘C’0.460.09
P(S=
x|I
)
I=low
I=high
low
0.95
0.05
high
0.2
0.8Slide9
Definition:
bayesian
network
Set of random variables
X
={X
1
,…,
X
n
} with domains Val(X
1
),…,Val(
X
n
)
E
ach node has a set of parents
PaXGraph must be a DAGEach node also maintains a conditional probability distribution
(often, a table)P(X|PaX)
2k-1 entries for binary valued variablesOverall: O(n2k) storage for binary variablesEncodes the joint probability over X1,…,XnSlide10
Calculation of joint Probability
B
E
P(a|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
B
urglary
E
arthquake
A
larm
M
aryCalls
J
ohnCalls
P(b)
0.001
P(e)
0.002A
P(j|…)
TF
0.90
0.05
A
P(m|…)
TF
0.70
0.01
P(
j
m
a
b
e)
= ??Slide11
P(j
ma
be
)
= P(
j
m|a
,
b,e
)
P(a
be
)
= P(
j
|a
,
b,e)
P(m|a,b,e
) P(abe)(J and M are independent given A)P(j|a,b,e) = P(j|a)(J and B and J and E are independent given A)P(m|a,b,e) = P(m|a)P(abe) = P(a|b,e) P(b|e) P(e) = P(a|b,e) P(b) P(e)(B and E are independent)P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e)BurglaryEarthquake
Alarm
M
aryCalls
JohnCallsSlide12
Calculation of joint Probability
B
E
P(a|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
B
urglary
E
arthquake
alarm
M
aryCalls
J
ohnCalls
P(b)
0.001
P(e)
0.002
A
P(j|…)
TF
0.90
0.05
A
P(m|…)
TF
0.70
0.01
P(
j
m
a
b
e)
=
P(
j|a
)P(
m|a
)P(a|
b,
e
)P
(
b)P
(
e)
= 0.9 x 0.7 x 0.001 x 0.999 x 0.998
= 0.00062Slide13
Calculation of joint Probability
b
e
P(a|
…
)
TTFF
TFTF
0.95
0.94
0.29
0.001
B
urglary
E
arthquake
alarm
maryCalls
johnCalls
P(b)
0.001
P(e)
0.002
a
P(j|…)
TF
0.90
0.05
a
P(m|…)
TF
0.70
0.01
P(
j
m
a
b
e)
=
P(
j|a
)P(
m|a
)P(a|
b,
e
)P
(
b)P
(
e)
= 0.9 x 0.7 x 0.001 x 0.999 x 0.998
= 0.00062
P(x
1
x
2
…
x
n
) =
P
i=1,…,
n
P
(
x
i
|
pa
Xi
)
full joint
distributionSlide14
Chain Rule for Bayes Nets
Joint distribution is a product of all CPTs
P(X
1
,X
2
,…,
X
n
) =
P
i=1,…,
n
P
(
X
i
|
PaXi)Slide15
Example: Naïve bayes models
P(Cause,Effect1,…,Effect
n
)
= P(Cause)
P
i
P(
Effect
i
| Cause)
Cause
Effect
1
Effect
2
Effect
nSlide16
Advantages of Bayes Nets (and other graphical models)More manageable # of parameters to set and store
Incremental modelingExplicit encoding of independence assumptions
Efficient inference techniquesSlide17
Arcs do not necessarily encode causality
A
B
C
C
B
A
2 BN’s
with the same expressive power, and a 3
rd
with greater power (exercise)
C
B
ASlide18
Reading off independence relationships
Given B, does the value of A affect the probability of C?P(C|B,A) = P(C|B)?No!
C parent’s (B) are given, and so it is independent of its non-
descendents
(A)
Independence is symmetric:
C
A | B => A
C | B
A
B
CSlide19
Basic RuleA node is independent of its non-descendants given its parents (and given nothing else)Slide20
What does the BN encode?
Burglary
Earthquake
JohnCalls
MaryCalls
| Alarm
JohnCalls
Burglary | Alarm
JohnCalls
Earthquake | Alarm
MaryCalls
Burglary | AlarmMaryCalls Earthquake | AlarmBurglaryEarthquakeAlarmMaryCallsJohnCalls
A node is independent of its non-descendents, given its parentsSlide21
Reading off independence relationships
How about Burglary Earthquake | Alarm ?No! Why?
Burglary
Earthquake
Alarm
MaryCalls
JohnCallsSlide22
Reading off independence relationships
How about Burglary Earthquake | Alarm ?
No! Why?
P(B
E|A) = P(A|B,E)P(B
E)/P(A) = 0.00075
P(B|A)P(E|A) = 0.086
Burglary
Earthquake
Alarm
MaryCalls
JohnCallsSlide23
Reading off independence relationships
How about Burglary Earthquake |
JohnCalls
?
No! Why?
Knowing
JohnCalls
affects the probability of Alarm, which makes Burglary and Earthquake dependent
Burglary
Earthquake
Alarm
MaryCalls
JohnCallsSlide24
Independence relationships
For polytrees, there exists a unique undirected path between A and B. For each node on the path:
Evidence on the directed road X
EY or
X
EY
makes X and Y
independent
Evidence on
an X
EY
makes descendants
independent
Evidence on a “V” node, or below the
V:
X
EY, or
X
WY with W…
E
makes the X and Y dependent (otherwise they are independent)Slide25
General caseFormal property in general case:
D-separation : the above properties hold for all (acyclic) paths between A and BD-separation
independence
That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation
The CPTs may indeed encode additional independencesSlide26
Probability QueriesGiven: some probabilistic model over variables
XFind: distribution over Y
X
given evidence
E
=
e
for some subset
E
X
/
Y
P(
Y
|
E
=e)Inference problemSlide27
Answering Inference Problems with the Joint Distribution
Easiest case: Y=
X
/
E
P(
Y
|
E
=
e
) = P(
Y
,
e
)/P(
e
)Denominator makes the probabilities sum to 1
Determine P(e) by marginalizing: P(e) = Sy P(Y=
y,e)
Otherwise, let W=X/(EY)P(Y|E=e) = Sw P(Y,W=w,e) /P(e)P(e) = Sy Sw P(Y=y,W=w,e) Inference with joint distribution: O(2|X/E|) for binary variablesSlide28
Naïve bayes Classifier
P(Class,Feature1,…,Feature
n
)
= P(Class)
P
i
P(
Feature
i
| Class)
Class
Feature
1
Feature
2
Feature
n
P(C|F
1
,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn) = 1/Z P(C) Pi P(Fi|C)Given features, what class?
Spam / Not Spam
English /
French / Latin…Word occurrencesSlide29
Naïve bayes Classifier
P(Class,Feature1,…,
Feature
n
)
= P(Class)
P
i
P(
Feature
i
| Class)
P(C|F
1
,….,
F
k
) =
1/Z P(C,F
1
,….,Fk) = 1/Z Sfk+1…fn P(C,F1,….,Fk,fk+1,…fn) = 1/Z P(C) Sfk+1…fn Pi=1…k P(Fi|C) Pj=k+1…n P(fj|C) = 1/Z P(C) Pi=1…k P(Fi|C) Pj=k+1…n Sfj P(
fj|C) = 1/Z P(C)
Pi=1…k P(
Fi|C)
Given some features, what is the distribution over class?Slide30
For General Queries
For BNs and queries in general, it’s not that simple… more in later lectures.Next class: skim 5.1-3, begin reading 9.1-4