William W Cohen Machine Learning 10605 Warmup Zenos paradox Lance Armstrong and the tortoise have a race Lance is 10x faster Tortoise has a 1m head start at time 0 0 1 So when Lance gets to 1m the tortoise is at 11m ID: 498660
Download Presentation The PPT/PDF document "A Not-So-Quick Overview of Probability" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1Slide2Slide3
A Not-So-Quick Overview of Probability
William W. Cohen
Machine Learning 10-605Slide4
Warmup
: Zeno’s paradox
Lance Armstrong and the tortoise have a race
Lance is 10x faster
Tortoise has a 1m head start at time 0
0
1
So, when Lance gets to 1m the tortoise is at 1.1m
So, when Lance gets to 1.1m the tortoise is at 1.11m …
So, when Lance gets to 1.11m the tortoise is at 1.111m … and Lance will
never
catch up -?
1+0.1+0.01+0.001+0.0001+… = ?
unresolved until calculus was inventedSlide5
The prosecution calls Gottfried Leibniz.Slide6
The Problem of Induction
David Hume (1711-1776): pointed out
Empirically, induction seems to work
Statement (1) is an application of induction.
This stumped people for about 200 years
Of the Different Species of Philosophy.
Of the Origin of Ideas
Of the Association of Ideas
Sceptical Doubts Concerning the Operations of the Understanding
Sceptical Solution of These Doubts
Of Probability9
Of the Idea of Necessary Connexion
Of Liberty and Necessity
Of the Reason of Animals
Of Miracles
Of A Particular Providence and of A Future State Of the Academical Or Sceptical Philosophy
Slide7
A Second Problem of Induction
A black crow seems to support the hypothesis “all crows are black”.
A pink highlighter supports the hypothesis “all non-black things are non-crows”
Thus, a pink highlighter supports the hypothesis “all crows are black”.Slide8
Probability Theory
Events
discrete random variables, boolean random variables, compound events
Axioms of probability
What defines a reasonable theory of uncertainty
Compound events
Independent events
Conditional probabilitiesBayes rule and beliefs
Joint probability distributionSlide9
Discrete Random Variables
A is a Boolean-valued random variable if
A denotes an event,
there is uncertainty as to whether A occurs.
Define P(A) as “the fraction of experiments in which A is true”
We’re assuming all possible outcomes are
equiprobable
a possible outcome of an “experiment
”
the experiment is not deterministicSlide10
Visualizing A
Event space of all possible worlds
Its area is 1
Worlds in which A is False
Worlds in which A is true
P(A) = Area of
reddish ovalSlide11
Discrete Random Variables
A is a Boolean-valued random variable if
A denotes an event,
there is uncertainty as to whether A occurs.
Define P(A) as “the fraction of experiments in which A is true”
We’re assuming all possible outcomes are
equiprobable
Examples
You roll two 6-sided die (the experiment) and get doubles (A=doubles, the outcome)I pick two students in the class (the experiment) and they have the same birthday (A=same birthday, the outcome)A = I have Ebola
A = The US president in 2023 will be maleA = You wake up tomorrow with a headacheA = the 1,000,000,000,000
th digit of π is 7
a possible outcome of an “experiment
”
the experiment is not deterministicSlide12
The Axioms of Probability
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
Events, random variables, …., probabilities
“Dice”
“Experiments”Slide13
The Axioms Of Probability
(This is Andrew’s joke)Slide14
These Axioms are Not to be Trifled With
There have been many many other approaches to understanding “uncertainty”:
Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, …
25 years ago people in AI argued about these; now they mostly don’t
Any scheme for combining uncertain information, uncertain “beliefs”, etc,… really should obey these axioms
If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent]
[your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)Slide15
Interpreting the axioms
0 <= P(A)
<= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
The area of A can’t get any smaller than 0
And a zero area would mean no world could ever have A true Slide16
Interpreting the axioms
0 <=
P(A) <= 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
The area of A can’t get any bigger than 1
And an area of 1 would mean all worlds will have A true Slide17
Interpreting the axioms
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(
A
or B
) = P(A) + P(B) - P(
A and B)
A
BSlide18
Interpreting the axioms
0 <= P(A) <= 1
P(True) = 1
P(False) = 0
P(
A
or B
) = P(A) + P(B) - P(
A and B)
A
B
P(A or B)
B
P(A and B)
Simple addition and subtractionSlide19
Theorems from the Axioms
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(
A
or
B
) = P(A) + P(
B) - P(A and B
) P(not A) = P(~A) = 1-P(A)
P
(A or ~A) = P(A) + P(~A) - P(A and ~A)
1 = P(A) + P(~A) -
0
P(A or ~A) = 1 P(A and ~A) = 0Slide20
Elementary Probability in Pictures
P(~A) + P(A) = 1
A
~ASlide21
Side Note
I am inflicting these proofs on you for two reasons:
These kind of manipulations will need to be second nature to you if you use probabilistic analytics in depth
Suffering is good for you
(This is also Andrew’s joke)Slide22
Another important theorem
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(
A
or
B
) = P(A
) + P(B) - P(A and
B)
P(A) = P(A ^ B) + P(A ^ ~B)
A = A and (B or ~B) = (A and B) or (A and ~B)
P
(A) = P(A and B) + P(A and ~B) – P((A and B) and (A and ~B))
P(A) = P(A and B) + P(A and ~B) – P(A and A and B and ~B)Slide23
Elementary Probability in Pictures
P(A) = P(A ^ B) + P(A ^ ~B)
B
~B
A ^ ~B
A ^ BSlide24
The
LAWSOf
Probability
Laws of probability:
Axioms …
Monty Hall Problem provisoSlide25
The Monty Hall Problem
You’re in a game show. Behind one door is a prize. Behind the others, goats.
You pick one of three doors, say #1
The host, Monty Hall, opens one door, revealing…a goat!
3
You now can either
stick with your guess
always change doors
flip a coin and pick a new door randomly according to the coinSlide26
The Monty Hall Problem
Case 1: you don’t swap.
W = you win.
Pre-goat: P(W)=1/3
Post-goat: P(W)=1/3
Case 2: you swap
W1=you picked the cash initially.
W2=you win.Pre-goat: P(W1)=1/3.
Post-goat:W2 = ~W1Pr(W2) = 1-P(W1)=2/3.
Moral: ?Slide27
The Extreme Monty Hall/Survivor Problem
You’re in a game show. There are 10,000 doors. Only one of them has a prize.
You pick a door.
Over the remaining 13 weeks, the host eliminates 9,998 of the remaining doors.
For the season finale:
Do you switch, or not?
…Slide28
Some practical problems
You’re the DM in a D&D game.
Joe brings his own d20 and throws 4 critical hits in a row to start off
DM=dungeon master
D20 = 20-sided die
“Critical hit” = 19 or 20
Is Joe cheating?
What is P(A), A=four critical hits?
A is a compound event
: A = C1 and C2 and C3 and C4Slide29
Independent Events
Definition: two events A and B are
independent
if Pr(A and B)=Pr(A)*Pr(B).
Intuition: outcome of A has no effect on the outcome of B (and vice versa).
We need to assume the different rolls are
independent to solve the problem.
You frequently need to assume the independence of something to solve any learning problem.Slide30
Some practical problems
You’re the DM in a D&D game.
Joe brings his own d20 and throws 4 critical hits in a row to start off
DM=dungeon master
D20 = 20-sided die
“Critical hit” = 19 or 20
What are the odds of that happening with a fair die?
Ci=critical hit on trial i, i=1,2,3,4
P(C1 and C2 … and C4) = P(C1)*…*P(C4) = (1/10)^4
Followup: D=pick an ace or king out of deck three times in a row: D=D1 ^ D2 ^ D3Slide31
Some practical problems
The specs for the loaded d20 say that it has 20 outcomes, X where
P(X=20) = 0.25
P(X=19) = 0.25
for i=1,…,18, P(X=i)= Z * 1/18
What is Z?Slide32
Multivalued Discrete Random Variables
Suppose A can take on more than 2 values
A is a
random variable with
arity
k
if it can take on exactly one value out of
{v1,v2, .. v
k}Example: V={aaliyah, aardvark, ….,
zymurge, zynga}Example: V={
aaliyah_aardvark, …, zynga_zymgurgy}
Thus…Slide33
Terms: Binomials and
Multinomials
Suppose A can take on more than 2 values
A is a
random variable with
arity
k
if it can take on exactly one value out of {v1,v2, ..
vk}Example: V={aaliyah
, aardvark, …., zymurge, zynga}
Example: V={aaliyah_aardvark, …, zynga_zymgurgy}
The distribution Pr(A) is a multinomialFor
k=2 the distribution is a binomialSlide34
More about Multivalued Random Variables
Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
And assuming that A obeys…
It’s easy to prove thatSlide35
More about Multivalued Random Variables
Using the axioms of probability and assuming that A obeys…
It’s easy to prove that
And thus we can proveSlide36
Elementary Probability in Pictures
A=1
A=2
A=3
A=4
A=5Slide37
Elementary Probability in Pictures
A=aardvark
A=
aaliyah
A
=…
A
=….
A=
zynga
…Slide38
Some practical problems
The specs for the loaded d20 say that it has 20 outcomes, X
P(X=20) = P(X=19) = 0.25
for
i
=1,…,18, P(X=
i
)=
z … and what is z?Slide39
Some practical problems
You (probably) have 8
neighbors
and 5
close neighbors.
What is Pr(A), A=one or more of your neighbors has the same sign as you?
What’s the experiment?What is Pr(B), B=you and your close neighbors all have different signs?
What about neighbors?
n
c
n
c
*
c
n
c
n
Moral: ?Slide40
Some practical problems
I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
P(X=20) = P(X=19) = 0.25
for i=1,…,18, P(X=i)= 0.5 * 1/18Slide41
Some practical problems
I have 3 standard d20 dice, 1 loaded die.
Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let
A
=d20 picked is fair and
B
=roll 19 or 20 with that die. What is P(
B
)?
P(
B) = P(
B and A) + P(B and
~A)
= 0.1*0.75 + 0.5*0.25 = 0.2
using Andrew’s “important theorem” P(A) = P(A ^ B) + P(A ^ ~B)Slide42
Elementary Probability in Pictures
P(A) = P(A ^ B) + P(A ^ ~B)
B
~B
A ^ ~B
A ^ B
Followup:
What if I change the ratio of fair to loaded die in the experiment? Slide43
Some practical problems
I have lots of standard d20 die, lots of loaded die, all identical.
Experiment is the same: (1) pick a d20 uniformly at random then (2) roll it. Can I mix the dice together so that P(B)=0.137 ?
P(B) = P(B and
A
) + P(B and
~A
)
= 0.1*
λ
+ 0.5*
(1- λ) = 0.137
λ
= (0.5 - 0.137)/0.4 = 0.9075
“mixture model”Slide44
Another picture for this problem
A (fair die)
~A (loaded)
A and B
~A and B
It’s more convenient to say
“if you’ve picked a fair die then …” i.e. Pr(critical hit|fair die)=0.1
“if you’ve picked the loaded die then….” Pr(critical hit|loaded die)=0.5
Conditional probability:
Pr(B|A) = P(B^A)/P(A)
P(B|A)
P(B|~A)Slide45
Definition of Conditional Probability
P(A
^ B)
P(A|B) = -----------
P(B
)
Corollary: The Chain Rule
P(A ^ B) = P(A|B) P(B) Slide46
Some practical problems
I have 3 standard d20 dice, 1 loaded die.
Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let
A
=d20 picked is fair and
B
=roll 19 or 20 with that die. What is P(
B
)?
P(
B) = P(
B|A) P(A) + P(
B|~A) P(~A)
= 0.1
*0.75 +
0.5*0.25 = 0.2
“marginalizing out” ASlide47
A (fair die)
~A (loaded)
A and B
~A and B
P(B|A)
P(B|~A)
P(A)
P(~A)
P(B) = P(B|A)P(A) + P(B|~A)P(~A)Slide48
Some practical problems
I have 3 standard d20 dice, 1 loaded die.
Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die.
Suppose B happens (e.g., I roll a 20). What is the chance the die I rolled is fair? i.e. what is P(A|B) ?Slide49
A (fair die)
~A (loaded)
A and B
~A and B
P(B|A)
P(B|~A)
P(A)
P(~A)
P(A and B) = P(B|A) * P(A)
P(A and B) = P(A|B) * P(B)
P(A|B) * P(B) = P(B|A) * P(A)
P(B|A) * P(A)
P(B)
P(A|B) =
P(B)
P(A|B) = ?Slide50
P(B|A) * P(A)
P(B)
P(A|B) =
P(A|B) * P(B)
P(A)
P(B|A) =
Bayes, Thomas (1763)
An essay towards solving a problem in the doctrine of chances.
Philosophical Transactions of the Royal Society of London,
53:370-418
…by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of
analogical
or
inductive reasoning…
Bayes’ rule
prior
posteriorSlide51
More General Forms of Bayes RuleSlide52
More General Forms of Bayes RuleSlide53
Useful Easy-to-prove factsSlide54
More about Bayes rule
An Intuitive Explanation of Bayesian Reasoning
: Bayes' Theorem for the curious and bewildered; an excruciatingly gentle introduction -
Eliezer
Yudkowsky
Problem: Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl? Slide55
Some practical problems
Joe throws 4 critical hits in a row, is Joe cheating?
A = Joe using cheater’s die
C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1
B = C1 and C2 and C3 and C4
Pr(B|A) = 0.0625 P(B|~A)=0.0001Slide56
What’s the experiment and outcome here?
Outcome A: Joe is cheating
Experiment:
Joe picked a die uniformly at random from a bag containing 10,000 fair die and one bad one.
Joe is a D&D player picked uniformly at random from set of 1,000,000 people and
n
of them cheat with probability
p>0.I have no idea, but I don’t like his looks. Call it P(A)=0.1Slide57
Remember: Don’t Mess with The Axioms
A subjective belief can be treated, mathematically, like a probability
Use those axioms!
There have been many many other approaches to understanding “uncertainty”:
Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, …
25 years ago people in AI argued about these; now they mostly don’t
Any scheme for combining uncertain information, uncertain “beliefs”, etc,… really should obey these axioms
If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent]
[your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)Slide58
Some practical problems
Joe throws 4 critical hits in a row, is Joe cheating?
A = Joe using cheater’s die
C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1
B = C1 and C2 and C3 and C4
Pr(B|A) = 0.0625 P(B|~A)=0.0001
Moral: with enough evidence the prior P(A) doesn’t really matter.Slide59
Some practical problems
I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
1. Collect some data (20 rolls)
2. Estimate Pr(
i
)=C(rolls of
i
)/C(any roll)Slide60
One solution
I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
P(1)=0
P(2)=0
P(3)=0
P(4)=0.1
…
P(19)=0.25
P(20)=0.2
MLE =
maximum
likelihood estimate
But: Do I really think it’s
impossible
to roll a 1,2 or 3?
Would you bet your house on it?Slide61
A better solution
I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
1. Collect some data (20 rolls)
2. Estimate Pr(
i
)=C(rolls of
i
)/C(any roll)
0.
Imagine
some data (20 rolls, each
i
shows up 1x)Slide62
A better solution
I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
P(1
)=1/40
P(2
)=1/40
P(3
)=1/40
P(4
)=(2+1)/40
…
P(19
)=(5+1)/40
P(20
)=(4+1)/40=1/8
0.25
vs.
0.125 – really different! Maybe I should “imagine” less data?Slide63
A better solution?
P(1
)=1/40
P(2
)=1/40
P(3
)=1/40
P(4
)=(2+1)/40
…
P(19
)=(5+1)/40
P(20
)=(4+1)/40=1/8
0.25
vs.
0.125 – really different! Maybe I should “imagine” less data?Slide64
A better solution?
Q: What if I used
m
rolls with a probability of
q=1/20
of rolling any
i
?
I can use this formula with m>20, or even with
m<20 …
say with
m=1Slide65
A better solution
Q: What if I used
m
rolls with a probability of
q=1/20
of rolling any
i
?
If
m>>C(ANY)
then your imagination
q
rules
If
m<<C(ANY)
then your data rules BUT you never ever
ever
end up with Pr(i)=0Slide66
Terminology
This is called a
uniform
Dirichlet
prior
C(
i), C(ANY) are sufficient statistics
It’s equivalent to a doing this:p=some probability over 1…20
Pr(P=p|Prior)=f(…)Pr(P=
p|Prior,data)= g(…)Pr(I=i|Prior,data)=
MLE =
maximum
likelihood estimate
MAP=
maximum
a posteriori estimate
f and g are the same function,
diff’t
args
f is
conjugate
priorSlide67
Terminology – more later
This is called a
uniform
Dirichlet
prior
C(
i
), C(ANY) are sufficient statistics
MLE =
maximum
likelihood estimate
MAP=
maximum
a posteriori estimateSlide68
Conjugate Priors and The
Dirichlet
[discuss]Slide69
Some practical problems
I have 1 standard d6 die, 2 loaded d6 die.
Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50
Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?
Three combinations: HL, HF, FL
P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL)
= P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)Slide70
Elementary Probability in Pictures
A=HL
A=HF
A=FL
B ^ A=HL
B ^ A=HF
B ^ A=LFSlide71
Elementary Probability in Pictures
A=1
A=2
A=3
A=4
A=5
B ^ A=1
B ^ A=2
B ^ A=3
B ^ A=4
B ^ A=5Slide72
Elementary Probability in Pictures
A=2
A=3
A=4
A=5
A=1
P(A=1)
Think of multiplying by P(A=1) as “squeezing” an area by some fractionSlide73
Elementary Probability in Pictures
A=2
A=3
A=4
A=5
A=1
P(A=1)
P(B|A=1)
P(A and B)Slide74
Some practical problems
I have 1 standard d6 die, 2 loaded d6 die.
Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50
Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?
Three combinations: HL, HF, FL
P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL)
= P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)Slide75
Some practical problems
I have 1 standard d6 die, 2 loaded d6 die.
Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50
Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?
1
2
3
4
5
6
1
D
7
2
D
7
3
D
7
4
7
D
5
7
D
6
7
D
Three combinations: HL, HF, FL
Roll 1
Roll 2Slide76
A brute-force solution
A
Roll 1
Roll 2
P
FL
1
1
1/3 * 1/6 * ½
FL
1
2
1/3 * 1/6 * 1/10
FL
1
…
…
…
1
6
FL
2
1
FL
2
…
…
…
…
FL
6
6
HL
1
1
HL
1
2
…
…
…
HF
1
1
…
Comment
doubles
seven
doubles
doubles
A
joint probability table
shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk
With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g.
P(doubles)
P(seven or eleven)
P(total is higher than 5)
….Slide77
The Joint Distribution
Recipe for making a joint distribution of M variables:
Example: Boolean variables A, B, CSlide78
The Joint Distribution
Recipe for making a joint distribution of M variables:
Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2
M
rows).
Example: Boolean variables A, B, C
A
B
C
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1Slide79
The Joint Distribution
Recipe for making a joint distribution of M variables:
Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2
M
rows).
For each combination of values, say how probable it is.
Example: Boolean variables A, B, C
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10Slide80
The Joint Distribution
Recipe for making a joint distribution of M variables:
Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2
M
rows).
For each combination of values, say how probable it is.
If you subscribe to the axioms of probability, those numbers must sum to 1.
Example: Boolean variables A, B, C
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
A
B
C
0.05
0.25
0.10
0.05
0.05
0.10
0.10
0.30Slide81
Using the Joint
One you have the JD you can ask for the probability of any logical expression involving your attribute
Abstract
: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. [
Kohavi
, 1996]
Number of Instances:
48,842
Number of Attributes:
14 (in UCI’s copy of dataset); 3 (here)Slide82
Using the Joint
P(Poor Male) = 0.4654Slide83
Using the Joint
P(Poor) = 0.7604Slide84
Inference with the JointSlide85
Inference with the Joint
P(
Male
|
Poor
) = 0.4654 / 0.7604 = 0.612 Slide86
Estimating the joint distribution
Collect some data points
Estimate the probability P(E1=e1 ^ … ^ En=en) as #(that row appears)/#(any row appears)
….
Gender
Hours
Wealth
g1
h1
w1
g2
h2
w2
..……
gN
hNwNSlide87
Inference is a big deal
I’ve got this evidence. What’s the chance that this conclusion is true?
I’ve got a sore neck: how likely am I to have meningitis?
I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep?Slide88
Estimating the joint distribution
For each combination of values
r:
Total = C[
r
] = 0For each data row
ri
C[ri] ++Total ++
Gender
Hours
Wealth
g1
h1
w1g2
h2
w2..…
…gN
hNwN
Complexity?
Complexity
?
O(n)n = total size of input data
O(2d)
d = #attributes (all binary)
= C[ri
]/Total
ri is “female,40.5+, poor”Slide89
Estimating the joint distribution
For each combination of values
r:
Total = C[
r
] = 0For each data row
ri
C[ri] ++Total ++
Gender
Hours
Wealth
g1
h1
w1g2
h2
w2..…
…gN
hNwN
Complexity?
Complexity
?
O(n)n = total size of input data
ki = arity of attribute iSlide90
Estimating the joint distribution
Gender
Hours
Wealth
g1
h1
w1
g2
h2
w2
..
…
…
gNhNwN
Complexity?
Complexity?
O(n)
n = total size of input data
k
i = arity
of attribute i
For each combination of values
r:
Total = C[
r
] = 0For each data row ri
C[ri] ++Total ++Slide91
Estimating the joint distribution
For each data row
r
i
If
r
i not in hash tables C,Total:
Insert C[ri
] = 0C[ri]
++Total ++
Gender
Hours
Wealth
g1
h1
w1
g2h2w2
..…
…gNhN
wN
Complexity?
Complexity?
O(n)n = total size of input data
m = size of the model
O(m)Slide92
Another example….Slide93
Big ML
c
.
2001 (
Banko
& Brill, “Scaling to Very Very Large…”, ACL 2001)
Task: distinguish pairs of easily-confused words (“affect”
vs
“effect”) in contextSlide94
Big ML
c
.
2001 (
Banko
& Brill, “Scaling to Very Very Large…”, ACL 2001)Slide95
AN EXAMPLE OF THE JOINT
A
B
C
D
E
p
is
the
effectofthe
0.00036isthe
effectofa
0.00034.
Theeffect
ofthis0.00034
tothis
effect:“
0.00034bethe
effectof
the…
…………
…nottheeffect
ofany
0.00024…
……
……does
notaffectthe
general0.00020
doesnotaffect
thequestion0.00020
anymanneraffect
theprinciple0.00018Slide96
The Joint Distribution for a “Big Data” task….Slide97
Big ML
c
.
2001 (
Banko
& Brill, “Scaling to Very Very Large…”, ACL 2001)
Task: distinguish pairs of easily-confused words (“affect”
vs
“effect”) in contextSlide98
Big ML
c
.
2001 (
Banko
& Brill, “Scaling to Very Very Large…”, ACL 2001)Slide99
Some of the Joint Distribution
A
B
C
D
E
p
isthe
effectofthe
0.00036is
the effectof
a0.00034.
Theeffect
ofthis
0.00034tothis
effect:
“0.00034be
theeffectof
the…
………
……not
theeffectofany
0.00024
………
……does
notaffect
thegeneral0.00020
doesnotaffect
thequestion
0.00020anymanner
affecttheprinciple
0.00018Slide100
An experiment
Starting point: Google books 5-gram data
All 5-grams that appear >= 40 times in a corpus of 1M English books
approx 80B words
5-grams: 30Gb compressed, 250-300Gb uncompressed
Each 5-gram contains frequency distribution over
yearsExtract all 5-grams from books published before 2000 that contain ‘effect’ or ‘affect’ in middle position
about 20 “disk hours”approx 100M occurrencesapprox 50k distinct n-grams ---
not bigWrote code to compute Pr(A,B,C,D,E|C=affect or C=effect)
Pr(any subset of A,…,E|any other subset,C=affect V effect
)Slide101
Another experiment
Extracted all affect/effect 5-grams from the old (small) Reuters corpus
about 20k documents
about 723 n-grams, 661 distinct
Financial news, not novels or textbooks
Tried to predict center word with:
Pr(C|A=a,B
=b,D=d,E=e)
then P(C|A,B,D,C=effect V affect)then P(C|B,D, C=effect V affect)then P(C|B, C=effect V affect)then P(C, C=effect V affect)Slide102
EXAMPLES
“The cumulative _ of the”
effect (1.0)
“Go into _ on January”
effect (1.0)
“From cumulative _ of accounting” not presentNor is ““From cumulative _ of _”
But “_ cumulative _ of _” effect (1.0)“Would not _ Finance Minister” not present
But “_ not _ _ _” affect (0.9625)Slide103
Performance summary
Pattern
Used
Errors
P(C|A,B,D,E)
101
1
P(C|A,B,D)
157
6P(C|B,D)163
13P(C|B)
24478P(C)
5831