/
A Quick Overview of Probability A Quick Overview of Probability

A Quick Overview of Probability - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
359 views
Uploaded On 2018-09-23

A Quick Overview of Probability - PPT Presentation

William W Cohen Machine Learning 10605 Jan 19 2012 Probabilistic and Bayesian Analytics Andrew W Moore School of Computer Science Carnegie Mellon University wwwcscmueduawm awmcscmuedu ID: 676279

die probability loaded effect probability die effect loaded variables d20 joint roll random data distribution axioms experiment practical affect

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A Quick Overview of Probability" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A Quick Overview of Probability

William W. Cohen

Machine Learning 10-605

Jan 19 2012Slide2

Probabilistic and Bayesian Analytics

Andrew W. Moore

School of Computer Science

Carnegie Mellon Universitywww.cs.cmu.edu/~awmawm@cs.cmu.edu412-268-7599

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Copyright © Andrew W. Moore

[Some material pilfered from

http://www.cs.cmu.edu/~awm/tutorials

]Slide3

Tuesday’s Lecture - Review

Intro

Who, Where, When -

administriviaWhy – motivationsWhat/How – assignments, grading, …Review - How to count and what to countBig-O and Omega notation, example, …

Costs of i/o vs computationWhat sort of computations do we want to do in (large-scale) machine learning programs?ProbabilitySlide4

Today

Motivation:

why the last 15 years have been awesome

What is probability and what can you do with it?Variables, Events, Axioms of ProbabilityCompound eventsConditional probabilities, chain rule, independent events, Bayes rule Slide5

Warmup

: Zeno’s paradox

Lance Armstrong and the tortoise have a race

Lance is 10x fasterTortoise has a 1m head start at time 0

0

1

So, when Lance gets to 1m the tortoise is at 1.1m

So, when Lance gets to 1.1m the tortoise is at 1.11m …

So, when Lance gets to 1.11m the tortoise is at 1.111m … and Lance will

never

catch up -?

1+0.1+0.01+0.001+0.0001+… = ?

unresolved until calculus was inventedSlide6

The Problem of Induction

David Hume (1711-1776): pointed out

Empirically, induction seems to work

Statement (1) is an application of induction.This stumped people for about 200 years

Of the Different Species of Philosophy.

Of the Origin of Ideas

Of the Association of Ideas

Sceptical Doubts Concerning the Operations of the Understanding

Sceptical Solution of These Doubts

Of Probability

9

Of the Idea of Necessary Connexion

Of Liberty and Necessity

Of the Reason of Animals

Of Miracles

Of A Particular Providence and of A Future State

Of the Academical Or Sceptical Philosophy

Slide7

A Second Problem of Induction

A black crow seems to support the hypothesis “all crows are black”.

A pink highlighter supports the hypothesis “all non-black things are non-crows”

Thus, a pink highlighter supports the hypothesis “all crows are black”.Slide8

A Third Problem of Induction

You have much less than 200 years to figure it all out.Slide9

Probability Theory

Events

discrete random variables, boolean random variables, compound events

Axioms of probabilityWhat defines a reasonable theory of uncertaintyCompound events

Independent eventsConditional probabilitiesBayes rule and beliefsJoint probability distributionSlide10

Discrete Random Variables

A is a Boolean-valued

random variable

ifA denotes an event, there is uncertainty as to whether A occurs.ExamplesA = The US president in 2023 will be male

A = You wake up tomorrow with a headacheA = You have EbolaA = the 1,000,000,000,000th digit of π is 7Define P(A) as “the fraction of possible worlds in which A is true”

We’re assuming all possible worlds are equally probableSlide11

Discrete Random Variables

A is a Boolean-valued random variable if

A denotes an event,

there is uncertainty as to whether A occurs.Define P(A) as “the fraction of experiments in which A is true”We’re assuming all possible outcomes are

equiprobableExamplesYou roll two 6-sided die (the experiment) and get doubles (A=doubles, the outcome)I pick two students in the class (the experiment) and they have the same birthday (A=same birthday, the outcome)

a possible outcome of an “experiment”

the experiment is not deterministicSlide12

Visualizing A

Event space of all possible worlds

Its area is 1

Worlds in which A is False

Worlds in which A is true

P(A) = Area of

reddish ovalSlide13

The Axioms of Probability

0 <= P(A) <= 1

P(True) = 1

P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)

Events, random variables, …., probabilities

“Dice”

“Experiments”Slide14

The Axioms Of Probability

(This is Andrew’s joke)Slide15
Slide16

These Axioms are Not to be Trifled With

There have been many many other approaches to understanding “uncertainty”:

Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, …

25 years ago people in AI argued about these; now they mostly don’tAny scheme for combining uncertain information, uncertain “beliefs”, etc,… really should obey these axioms

If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent]  [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)Slide17

Interpreting the axioms

0 <= P(A)

<= 1

P(True) = 1P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)

The area of A can’t get any smaller than 0

And a zero area would mean no world could ever have A true Slide18

Interpreting the axioms

0 <=

P(A) <= 1

P(True) = 1P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)

The area of A can’t get any bigger than 1

And an area of 1 would mean all worlds will have A true Slide19

Interpreting the axioms

0 <= P(A) <= 1

P(True) = 1

P(False) = 0P(A or B) = P(

A) + P(B) - P(A and B)

A

BSlide20

Interpreting the axioms

0 <= P(A) <= 1

P(True) = 1

P(False) = 0P(A or B) = P(

A) + P(B) - P(A and B)

A

B

P(A or B)

B

P(A and B)

Simple addition and subtractionSlide21

Theorems from the Axioms

0 <= P(A) <= 1, P(True) = 1, P(False) = 0

P(

A or B) = P(A) + P(

B) - P(A and B) P(not A) = P(~A) = 1-P(A)

P(A or ~A) = 1 P(A and ~A) = 0

P(A or ~A) = P(A) + P(~A) - P(A and ~A)

1 = P(A) + P(~A) -

0Slide22

Elementary Probability in Pictures

P(~A) + P(A) = 1

A

~ASlide23

Side Note

I am inflicting these proofs on you for two reasons:

These kind of manipulations will need to be second nature to you if you use probabilistic analytics in depth

Suffering is good for you

(This is also Andrew’s joke)Slide24

Another important theorem

0 <= P(A) <= 1, P(True) = 1, P(False) = 0

P(

A or B) = P(A) + P(

B) - P(A and B) P(A) = P(A ^ B) + P(A ^ ~B)

A = A and (B or ~B) = (A and B) or (A and ~B)

P(A) = P(A and B) + P(A and ~B) – P((A and B) and (A and ~B))

P(A) = P(A and B) + P(A and ~B) – P(A and A and B and ~B)Slide25

Elementary Probability in Pictures

P(A) = P(A ^ B) + P(A ^ ~B)

B

~B

A ^ ~B

A ^ BSlide26

The

LAWSOf

Probability

Laws of probability:Axioms …Monty Hall Problem provisoSlide27

The Monty Hall Problem

You’re in a game show. Behind one door is a prize. Behind the others, goats.

You pick one of three doors, say #1

The host, Monty Hall, opens one door, revealing…a goat!

3

You now can either

stick with your guess

always change doors

flip a coin and pick a new door randomly according to the coinSlide28

The Monty Hall Problem

Case 1: you don’t swap.

W = you win.

Pre-goat: P(W)=1/3Post-goat: P(W)=1/3Case 2: you swap

W1=you picked the cash initially.W2=you win.Pre-goat: P(W1)=1/3.Post-goat:W2 = ~W1Pr(W2) = 1-P(W1)=2/3.

Moral: ?Slide29

The Extreme Monty Hall/Survivor Problem

You’re in a game show. There are 10,000 doors. Only one of them has a prize.

You pick a door.

Over the remaining 13 weeks, the host eliminates 9,998 of the remaining doors.For the season finale:Do you switch, or not?

…Slide30

Some practical problems

You’re the DM in a D&D game.

Joe brings his own d20 and throws 4 critical hits in a row to start off

DM=dungeon master

D20 = 20-sided die“Critical hit” = 19 or 20Is Joe cheating?What is P(A), A=four critical hits?A is a compound event: A = C1 and C2 and C3 and C4Slide31

Independent Events

Definition: two events A and B are

independent

if Pr(A and B)=Pr(A)*Pr(B).Intuition: outcome of A has no effect on the outcome of B (and vice versa).We need to assume the different rolls are independent to solve the problem.You frequently need to assume the independence of

something to solve any learning problem.Slide32

Some practical problems

You’re the DM in a D&D game.

Joe brings his own d20 and throws 4 critical hits in a row to start off

DM=dungeon master

D20 = 20-sided die“Critical hit” = 19 or 20What are the odds of that happening with a fair die?Ci=critical hit on trial i, i=1,2,3,4 P(C1 and C2 … and C4) = P(C1)*…*P(C4) = (1/10)^4

Followup: D=pick an ace or king out of deck three times in a row: D=D1 ^ D2 ^ D3Slide33

Some practical problems

The specs for the loaded d20 say that it has 20 outcomes, X where

P(X=20) = 0.25

P(X=19) = 0.25

for i=1,…,18, P(X=i)= Z * 1/18 What is Z?Slide34

Multivalued Discrete Random Variables

Suppose A can take on more than 2 values

A is a

random variable with arity k if it can take on exactly one value out of {v

1,v2, .. vk}Example: V={aaliyah, aardvark, …., zymurge, zynga

}Example: V={aaliyah_aardvark, …, zynga_zymgurgy}

Thus…Slide35

Terms: Binomials and

Multinomials

Suppose A can take on more than 2 values

A is a random variable with arity k if it can take on exactly one value out of

{v1,v2, .. vk}Example: V={aaliyah, aardvark, …., zymurge,

zynga}Example: V={aaliyah_aardvark, …, zynga_zymgurgy

}

The distribution Pr(A) is a

multinomial

For

k=2

the distribution is a

binomialSlide36

More about Multivalued Random Variables

Using the axioms of probability…

0 <= P(A) <= 1, P(True) = 1, P(False) = 0

P(A or B) = P(A) + P(B) - P(A and B)And assuming that A obeys…

It’s easy to prove thatSlide37

More about Multivalued Random Variables

Using the axioms of probability and assuming that A obeys…

It’s easy to prove that

And thus we can proveSlide38

Elementary Probability in Pictures

A=1

A=2

A=3

A=4

A=5Slide39

Elementary Probability in Pictures

A=aardvark

A=

aaliyah

A

=…

A

=….

A=

zynga

…Slide40

Some practical problems

The specs for the loaded d20 say that it has 20 outcomes, X

P(X=20) = P(X=19) = 0.25

for

i=1,…,18, P(X=i)= z … and what is z?Slide41

Some practical problems

You (probably) have 8

neighbors

and 5 close neighbors.What is Pr(A), A=one or more of your neighbors has the same sign as you?What’s the experiment?What is Pr(B), B=you and your close neighbors all have different signs?

What about neighbors?

n

c

n

c

*

c

n

c

n

Moral: ?Slide42

Some practical problems

I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?

P(X=20) = P(X=19) = 0.25

for i=1,…,18, P(X=i)= 0.5 * 1/18Slide43

Some practical problems

I have 3 standard d20 dice, 1 loaded die.

Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let

A

=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)?

P(

B

) = P(

B

and

A

) + P(

B

and

~A

)

= 0.1*0.75 + 0.5*0.25 = 0.2

using Andrew’s “important theorem” P(A) = P(A ^ B) + P(A ^ ~B)Slide44

Elementary Probability in Pictures

P(A) = P(A ^ B) + P(A ^ ~B)

B

~B

A ^ ~B

A ^ B

Followup:

What if I change the ratio of fair to loaded die in the experiment? Slide45

Some practical problems

I have lots of standard d20 die, lots of loaded die, all identical.

Experiment is the same: (1) pick a d20 uniformly at random then (2) roll it. Can I mix the dice together so that P(B)=0.137 ?

P(B) = P(B and

A) + P(B and ~A)

= 0.1*

λ

+ 0.5*

(1-

λ

)

= 0.137

λ

= (0.5 - 0.137)/0.4 = 0.9075

“mixture model”Slide46

Another picture for this problem

A (fair die)

~A (loaded)

A and B

~A and B

It’s more convenient to say

“if you’ve picked a fair die then …” i.e. Pr(critical hit|fair die)=0.1

“if you’ve picked the loaded die then….” Pr(critical hit|loaded die)=0.5

Conditional probability:

Pr(B|A) = P(B^A)/P(A)

P(B|A)

P(B|~A)Slide47

Definition of Conditional Probability

P(A

^ B) P(A|B) = ----------- P(B) Corollary: The Chain Rule

P(A ^ B) = P(A|B) P(B) Slide48

Some practical problems

I have 3 standard d20 dice, 1 loaded die.

Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let

A

=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)?

P(

B

) = P(

B

|

A

)

P(A)

+ P(

B

|

~A

)

P(~A)

=

0.1

*0.75 +

0.5*0.25 = 0.2

“marginalizing out” ASlide49

A (fair die)

~A (loaded)

A and B

~A and B

P(B|A)

P(B|~A)

P(A)

P(~A)

P(B) = P(B|A)P(A) + P(B|~A)P(~A)Slide50

Some practical problems

I have 3 standard d20 dice, 1 loaded die.

Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die.

Suppose B happens (e.g., I roll a 20). What is the chance the die I rolled is fair? i.e. what is P(A|B) ?Slide51

A (fair die)

~A (loaded)

A and B

~A and B

P(B|A)

P(B|~A)

P(A)

P(~A)

P(A and B) = P(B|A) * P(A)

P(A and B) = P(A|B) * P(B)

P(A|B) * P(B) = P(B|A) * P(A)

P(B|A) * P(A)

P(B)

P(A|B) =

P(B)

P(A|B) = ?Slide52

A (fair die)

~A (loaded)

A and B

~A and B

P(B|A)

P(B|~A)

P(A)

P(~A)

P(A and B) = P(B|A) * P(A)

P(A and B) = P(A|B) * P(B)

P(A|B) * P(B) = P(B|A) * P(A)

P(B|A) * P(A)

P(B)

P(A|B) =

P(B)Slide53

P(B|A) * P(A)

P(B)

P(A|B) =

P(A|B) * P(B)

P(A)

P(B|A) =

Bayes, Thomas (1763)

An essay towards solving a problem in the doctrine of chances.

Philosophical Transactions of the Royal Society of London,

53:370-418

…by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of

analogical

or

inductive reasoning…

Bayes’ rule

prior

posteriorSlide54

More General Forms of Bayes RuleSlide55

More General Forms of Bayes RuleSlide56

Useful Easy-to-prove factsSlide57

More about Bayes rule

An Intuitive Explanation of Bayesian Reasoning

: Bayes' Theorem for the curious and bewildered; an excruciatingly gentle introduction -

Eliezer YudkowskyProblem: Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl? Slide58

Some practical problems

Joe throws 4 critical hits in a row, is Joe cheating?

A = Joe using cheater’s die

C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1

B = C1 and C2 and C3 and C4Pr(B|A) = 0.0625 P(B|~A)=0.0001Slide59

What’s the experiment and outcome here?

Outcome A: Joe is cheating

Experiment:

Joe picked a die uniformly at random from a bag containing 10,000 fair die and one bad one.Joe is a D&D player picked uniformly at random from set of 1,000,000 people and n of them cheat with probability p>0.

I have no idea, but I don’t like his looks. Call it P(A)=0.1Slide60

Remember: Don’t Mess with The Axioms

A subjective belief can be treated, mathematically, like a probability

Use those axioms!

There have been many many other approaches to understanding “uncertainty”:Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, …25 years ago people in AI argued about these; now they mostly don’t

Any scheme for combining uncertain information, uncertain “beliefs”, etc,… really should obey these axiomsIf you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent]  [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)Slide61

Some practical problems

Joe throws 4 critical hits in a row, is Joe cheating?

A = Joe using cheater’s die

C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1

B = C1 and C2 and C3 and C4Pr(B|A) = 0.0625 P(B|~A)=0.0001

Moral: with enough evidence the prior P(A) doesn’t really matter.Slide62

Some practical problems

I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?

1. Collect some data (20 rolls)

2. Estimate Pr(

i

)=C(rolls of

i

)/C(any roll)Slide63

One solution

I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?

P(1)=0

P(2)=0

P(3)=0

P(4)=0.1…

P(19)=0.25

P(20)=0.2

MLE =

maximum

likelihood estimate

But: Do I really think it’s

impossible

to roll a 1,2 or 3?

Would you bet your house on it?Slide64

A better solution

I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?

1. Collect some data (20 rolls)

2. Estimate Pr(

i

)=C(rolls of

i

)/C(any roll)

0.

Imagine

some data (20 rolls, each

i

shows up 1x)Slide65

A better solution

I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?

P(1

)=1/40

P(2

)=1/40P(3

)=1/40

P(4

)=(2+1)/40

P(19

)=(5+1)/40

P(20

)=(4+1)/40=1/8

0.25

vs.

0.125 – really different! Maybe I should “imagine” less data?Slide66

A better solution?

P(1

)=1/40

P(2

)=1/40P(3)=1/40P(4)=(2+1)/40

…P(19)=(5+1)/40P(20

)=(4+1)/40=1/8

0.25

vs.

0.125 – really different! Maybe I should “imagine” less data?Slide67

A better solution?

Q: What if I used

m

rolls with a probability of

q=1/20

of rolling any i

?

I can use this formula with m>20, or even with

m<20 …

say with

m=1Slide68

A better solution

Q: What if I used

m

rolls with a probability of

q=1/20

of rolling any i

?

If

m>>C(ANY)

then your imagination

q

rules

If

m<<C(ANY)

then your data rules BUT you never ever

ever

end up with Pr(

i

)=0Slide69

Terminology

This is called a

uniform

Dirichlet

priorC(i), C(ANY) are sufficient statisticsIt’s equivalent to a doing this:p=some probability over 1…20Pr(P=p|Prior)=f(…)

Pr(P=p|Prior,data)= g(…)Pr(I=

i|Prior,data

)=

MLE =

maximum

likelihood estimate

MAP=

maximum

a posteriori estimate

f and g are the same function,

diff’t

args

 f is

conjugate

priorSlide70

Terminology – more later

This is called a

uniform

Dirichlet

priorC(i), C(ANY) are sufficient statistics

MLE =

maximum

likelihood estimate

MAP=

maximum

a posteriori estimateSlide71

Conjugate Priors and The

Dirichlet

[discuss]Slide72

Some practical problems

I have 1 standard d6 die, 2 loaded d6 die.

Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50

Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?

Three combinations: HL, HF, FL

P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL)

= P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)Slide73

Elementary Probability in Pictures

A=HL

A=HF

A=FL

B ^ A=HL

B ^ A=HF

B ^ A=LFSlide74

Elementary Probability in Pictures

A=1

A=2

A=3

A=4

A=5

B ^ A=1

B ^ A=2

B ^ A=3

B ^ A=4

B ^ A=5Slide75

Elementary Probability in Pictures

A=2

A=3

A=4

A=5

A=1

P(A=1)

Think of multiplying by P(A=1) as “squeezing” an area by some fractionSlide76

Elementary Probability in Pictures

A=2

A=3

A=4

A=5

A=1

P(A=1)

P(B|A=1)

P(A and B)Slide77

Some practical problems

I have 1 standard d6 die, 2 loaded d6 die.

Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50

Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?

Three combinations: HL, HF, FL

P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL)

= P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)Slide78

Some practical problems

I have 1 standard d6 die, 2 loaded d6 die.

Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50

Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?

1

2

3

4

5

6

1

D

7

2

D

7

3

D

7

4

7

D

5

7

D

6

7

D

Three combinations: HL, HF, FL

Roll 1

Roll 2Slide79

A brute-force solution

A

Roll 1

Roll 2

P

FL

1

1

1/3 * 1/6 * ½

FL

1

2

1/3 * 1/6 * 1/10

FL

1

1

6

FL

2

1

FL

2

FL

6

6

HL

1

1

HL

1

2

HF

1

1

Comment

doubles

seven

doubles

doubles

A

joint probability table

shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk

With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g.

P(doubles)

P(seven or eleven)

P(total is higher than 5)

….Slide80

The Joint Distribution

Recipe for making a joint distribution of M variables:

Example: Boolean variables A, B, CSlide81

The Joint Distribution

Recipe for making a joint distribution of M variables:

Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2

M rows).

Example: Boolean variables A, B, C

A

B

C

0

0

0

0

0

1

0

1

0

0

1

1

1

0

0

1

0

1

1

1

0

1

1

1Slide82

The Joint Distribution

Recipe for making a joint distribution of M variables:

Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2

M rows).

For each combination of values, say how probable it is.Example: Boolean variables A, B, C

A

B

C

Prob

0

0

0

0.30

0

0

1

0.05

0

1

0

0.10

0

1

1

0.05

1

0

0

0.05

1

0

1

0.10

1

1

0

0.25

1

1

1

0.10Slide83

The Joint Distribution

Recipe for making a joint distribution of M variables:

Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2

M rows).

For each combination of values, say how probable it is.If you subscribe to the axioms of probability, those numbers must sum to 1.Example: Boolean variables A, B, C

A

B

C

Prob

0

0

0

0.30

0

0

1

0.05

0

1

0

0.10

0

1

1

0.05

1

0

0

0.05

1

0

1

0.10

1

1

0

0.25

1

1

1

0.10

A

B

C

0.05

0.25

0.10

0.05

0.05

0.10

0.10

0.30Slide84

Using the Joint

One you have the JD you can ask for the probability of any logical expression involving your attribute

Abstract

: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. [

Kohavi

, 1996]Number of Instances: 48,842 Number of Attributes: 14 (in UCI’s copy of dataset); 3 (here)Slide85

Using the Joint

P(Poor Male) = 0.4654Slide86

Using the Joint

P(Poor) = 0.7604Slide87

Inference with the JointSlide88

Inference with the Joint

P(

Male

|

Poor

) = 0.4654 / 0.7604 = 0.612 Slide89

Estimating the joint distribution

Collect some data points

Estimate the probability P(E1=e1 ^ … ^ En=en) as #(that row appears)/#(any row appears)

….

Gender

HoursWealthg1

h1

w1

g2

h2

w2

..

gN

hN

wNSlide90

Inference is a big deal

I’ve got this evidence. What’s the chance that this conclusion is true?

I’ve got a sore neck: how likely am I to have meningitis?

I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep?Slide91

Estimating the joint distribution

For each combination of values

r:

Total = C[r] = 0For each data row riC[

ri] ++Total ++

Gender

Hours

Wealth

g1

h1

w1

g2

h2

w2

..

gN

hN

wN

Complexity?

Complexity

?

O(n)

n = total size of input data

O(2

d

)

d = #attributes (all binary)

= C[

r

i

]/Total

ri is “female,40.5+, poor”Slide92

Estimating the joint distribution

For each combination of values

r:

Total = C[r] = 0For each data row riC[

ri] ++Total ++

Gender

Hours

Wealth

g1

h1

w1

g2

h2

w2

..

gN

hN

wN

Complexity

?

Complexity

?

O(n)

n = total size of input data

k

i

=

arity of attribute iSlide93

Estimating the joint distribution

Gender

Hours

Wealth

g1

h1w1g2

h2

w2

..

gN

hN

wN

Complexity?

Complexity?

O(n)

n = total size of input data

k

i

=

arity

of attribute

i

For each combination of values

r:

Total = C[

r

] = 0

For each data row ri

C[r

i] ++Total ++Slide94

Estimating the joint distribution

For each data row

r

iIf ri not in hash tables C,Total:

Insert C[ri] = 0C[ri] ++Total ++

Gender

Hours

Wealth

g1

h1

w1

g2

h2

w2

..

gN

hN

wN

Complexity?

Complexity?

O(n)

n = total size of input data

m = size of the model

O(m)Slide95

Another example….Slide96

Big ML

c

.

2001 (Banko

& Brill, “Scaling to Very Very Large…”, ACL 2001)Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in contextSlide97

Big ML

c

.

2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)Slide98

AN EXAMPLE OF THE JOINT

A

B

C

DE pisthe

effectofthe0.00036

is

the

effect

of

a

0.00034

.

The

effect

of

this

0.00034

to

this

effect

:

0.00034

be

theeffect

ofthe

……

…………

nottheeffectof

any0.00024

……

……

…doesnot

affectthegeneral

0.00020does

notaffectthe

question0.00020any

manneraffectthe

principle0.00018Slide99

The Joint Distribution for a “Big Data” task….Slide100

Big ML

c

.

2001 (Banko

& Brill, “Scaling to Very Very Large…”, ACL 2001)Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in contextSlide101

Big ML

c

.

2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)Slide102

Some of the Joint Distribution

A

B

C

DE pistheeffectof

the0.00036is

the

effect

of

a

0.00034

.

The

effect

of

this

0.00034

to

this

effect

:

0.00034

be

the

effectof

the…

………

……not

theeffectofany

0.00024

………

……does

notaffect

thegeneral0.00020

doesnotaffect

thequestion

0.00020anymanner

affecttheprinciple

0.00018Slide103

An experiment

Starting point: Google books 5-gram data

All 5-grams that appear >= 40 times in a corpus of 1M English books

approx 80B words5-grams: 30Gb compressed, 250-300Gb uncompressedEach 5-gram contains frequency distribution over yearsExtract all 5-grams from books published before 2000 that contain ‘effect’ or ‘affect’ in middle position

about 20 “disk hours”approx 100M occurrencesapprox 50k distinct n-grams --- not bigWrote code to compute Pr(A,B,C,D,E|C=affect or C=effect) Pr(any subset of A,…,

E|any other subset,C=affect V effect)Slide104

Another experiment

Extracted all affect/effect 5-grams from the old

(small)

Reuters corpusabout 20k documentsabout 723 n-grams, 661 distinctFinancial news, not novels or textbooksTried to predict center word with:Pr(C|A

=a,B=b,D=d,E=e)then P(C|A,B,D,C=effect V affect)then P(C|B,D, C=effect V affect)then P(C|B, C=effect V affect)then P(C, C=effect V affect)Slide105

EXAMPLES

“The cumulative _ of the”

effect (1.0)“Go into _ on January”  effect (1.0)“From cumulative _ of accounting” not presentNor is ““From cumulative _ of _”

But “_ cumulative _ of _”  effect (1.0)“Would not _ Finance Minister” not presentBut “_ not _ _ _”  affect (0.9625)Slide106

Performance summary

Pattern

Used

ErrorsP(C|A,B,D,E)

1011P(C|A,B,D)1576P(C|B,D)

16313P(C|B)

244

78

P(C)

58

31Slide107

An experiment

Starting point: Google books 5-gram data

All 5-grams that appear >= 40 times in a corpus of 1M English books

approx 80B wordsExtract all 5-grams from books published before 2000 that contain ‘effect’ or ‘affect’ in middle positionabout 20 “disk hours”approx 100M occurrencesapprox 50k distinct n

-grams --- not bigWrote code to compute Pr(A,B,C,D,E|C=affect or C=effect) Pr(any subset of A,…,E|any other subset,C=affect V effect)Slide108

Another experiment

Extracted all affect/effect 5-grams from the old small Reuters corpus

about 20k documents

about 723 n-grams, 661 distinctTried to predict center word with:Pr(C|A=a,B=

b,D=d,E=e)then P(C|A,B,D,C=effect V affect)then P(C|B,D, C=effect V affect)then P(C|B, C=effect V affect)then P(C, C=effect V affect)Slide109

Examples

“The cumulative _ of the”

effect (1.0)“Go into _ on January”  effect (1.0)“From cumulative _ of accounting” not presentNor is ““From cumulative _ of _”

But “_ cumulative _ of _”  effect (1.0)“Would not _ Finance Minister” not presentBut “_ not _ _ _”  affect (0.9625)Slide110

Performance …

Pattern

Used

ErrorsP(C|A,B,D,E)

1011P(C|A,B,D)1576P(C|B,D)

16313P(C|B)

244

78

P(C)

58

31

Is this good performance?