9 5 3304 30 ECCR 265 includes poster session student group presentations Concept Learning Examples Word meanings Edible foods Abstract structures eg irony glorch glorch not glorch ID: 777234
Download The PPT/PDF document "Announcements CS Ice Cream Social" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Announcements
CS Ice Cream Social
9
/5
3:30-4:
30, ECCR 265
includes poster session, student group presentations
Slide2Concept Learning
Examples
Word meanings
Edible foodsAbstract structures (e.g., irony)
glorch
glorch
not
glorch
not
glorch
Slide3Supervised Approach To Concept Learning
Both positive and negative examples provided
Typical models (both in ML and Cog
Sci
) circa 2000 required both positive and negative examples
Slide4Contrast With Human Learning
Abiliites
Learning from positive examples only
Learning from a small number of examplesE.g., word meanings
E.g., learning appropriate social behaviorE.g., instruction on some skillWhat would it mean to learn from asmall number of positive examples?
+
+
+
Slide5Tenenbaum
(1999)
Two dimensional continuous
feature space
Concepts defined
by axis-
parallel rectangles
e.g., feature
dimensions cholesterol level insulin levele.g., concept
healthy
Slide6Learning Problem
Given a set of
given a set of
n examples,
X = {x1
, x2, x3, …,
xn}, which are instances of the concept…Will some unknown example Y also be an instance of the concept?
Problem of generalization
+++123
Slide7Hypothesis (Model) Space
H: all rectangles on the plane,
parameterized by (l
1
, l
2
, s
1, s2
)h: one particular hypothesisNote: |H| = ∞Consider all hypotheses in parallelIn contrast to non-Bayesian approach of
maintaining only the best hypothesisat any point in time.
Slide8Prediction
Via
Model Averaging
Will some unknown input y be in the concept given examples
X
=
{x
1, x
2, x3, …, xn}?Q: y is a positive example of the concept (T,F)P(Q | X) = ⌠h
p(Q & h | X) dhP(Q & h | X) = p(Q | h, X) p(h | X)
P(Q | h, X) = P(Q | h) = 1 if y is in hp(h | X) ~ P(X | h) p(h)
prior
likelihood
Chain rule
Marginalization
Conditional independence and deterministic concepts
Bayes rule
Slide9Priors and Likelihood Functions
Priors, p(h)
Location invariant
Uninformative prior
(prior depends only on area of rectangle)
Expected size prior
Likelihood function, p(X|h)
X = set of
n examplesSize principle
x
Expected size prior
Slide11Generalization Gradients
MIN: smallest hypothesis consistent with data
weak Bayes: instead of using size principle, assumes examples are produced by process independent of the true class
Dark line =
50% prob.
Slide12Experimental Design
Subjects shown n dots on screen that are “randomly chosen examples from some rectangle of healthy levels
”
n
drawn from {2, 3, 4, 6, 10, 50}
Dots varied in horizontal and vertical range r
drawn from {.25, .5, 1, 2, 4, 8} units in a 24 unit window
Task draw the ‘true’ rectangle around the dots
Slide13Experimental Results
Slide14Number Game
Experimenter picks integer arithmetic concept C
E.g., prime number
E.g., number between 10 and 20E.g., multiple of 5
Experimenter presents positive examples drawn at random from C, say, in range [1, 100]Participant asked whether some new test case belongs in C
Slide15Empirical Predictive Distributions
Slide16Hypothesis Space
Even numbers
Odd numbers
SquaresMultiples of
nEnds in nPowers of n
All numbersIntervals [n, m] for n>0, m<101Powers of 2, plus 37
Powers of 2, except for 32
Slide17Observation = 16
Likelihood function
Size principle
Prior
Intuition
Slide18Observation =
16 8 2 64
Likelihood function
Size principlePrior
Intuition
Slide19Posterior Distribution After Observing 16
Slide20Model Vs. Human Data
MODEL
HUMAN
DATA
Slide21Summary of Tenenbaum (1999)
Method
Pick prior distribution (includes hypothesis space)
Pick likelihood
function (
size principle)
leads to predictions for generalization as a function of r (range) and n (number of examples)
Claims people generalize optimally given assumptions about priors and likelihoodBayesian approach provides best description of how people generalize on rectangle task.
Explains how people can learn from a small number of examples, and only positive examples.
Slide22Important Ideas in Bayesian Models
Generative models
Likelihood
function
Consideration of multiple models in parallel
Potentially infinite model space
Inference
prediction via model averaging
role of priors diminishes with amount of evidence
Learningtrade off between model simplicity and fit to data Bayesian Occam’s Razor
Slide23Ockham's Razor
If two hypotheses are equally consistent with the data, prefer the simpler one.
Simplicity
can accommodate fewer observations
smoother
fewer parameters
restricts predictions more
(“sharper” predictions)
Examples
1st vs. 4th order polynomialsmall rectangle vs. large rectanglein Tenenbaum model
medieval philosopher
and monk
tool for cutting
(metaphorical)
Slide24Motivating Ockham's Razor
Aesthetic considerations
A theory with mathematical beauty is more likely to be right (or believed) than an ugly one, given that both fit the same data.
Past empirical success of the principle
Coherent inference, as embodied by Bayesian reasoning, automatically incorporates Ockham's razor
Two theories H
1
and H
2
PRIORS
LIKELIHOODS
Slide25Ockham's Razor with Priors
Jeffreys
(1939)
probabililty
text
more complex hypotheses should have
lower priors
R
equires a numerical rule for assessing complexitye.g., number of free parameterse.g., Vapnik-Chervonenkis (VC)
dimension
Slide26Subjective vs. Objective Priors
subjective
or
informative
prior
specific
, definite information about a random variable
objective
or uninformative priorvague, general information
Philosophical arguments for certain priors as uninformativeMaximum entropy / least committmente.g., interval [a b]: uniform
e.g., interval [0, ∞) with mean 1/λ: exponential distribution e.g., mean μ and std deviation σ: Gaussian
Independence of measurement scale
e.g., Jeffrey
’
s prior 1/(
θ
(1-
θ
))
for
θ
in [0,1]
expresses same belief whether we talk
about
θ
or
log
θ
Ockham’s Razor Via Likelihoods
Coin flipping example
H
1
: coin has two heads
H
2
: coin has a head and a tail
Consider 5 flips producing HHHHHH1 could produce only this sequenceH2
could produce HHHHH, but also HHHHT, HHHTH, ... TTTTTP(HHHHH | H1) = 1, P(HHHHH | H2) = 1/32H2 pays the price of having a lower likelihood via the fact it can accommodate a greater range of observations
H1 is more readily rejected by observations
Slide28Simple and Complex Hypotheses
H
2
H
1
Slide29Bayes Factor
BIC is approximation to Bayes
factor
A.k.a. likelihood ratio
Slide30Hypothesis
Classes
Varying In Complexity
E.g., 1
st
, 2
nd
, and 3d
order polynomialsHypothesis class is parameterized by wv
Slide31Rissanen (1976)
Minimum Description Length
Prefer models that can communicate the data in the smallest number of bits.
The preferred hypothesis H for explaining data D minimizes:
(1) length of the description of the hypothesis
(
2) length of the description of the data with the help of the chosen theory
L: length
Slide32MDL & Bayes
L: some measure of length (complexity)
MDL: prefer hypothesis that min. L(H) + L(D|H)
Bayes rule implies MDL principle
P(H|D) = P(D|H)P(H) / P(D)
–log P(H|D) = –log P(D|H) – log P(H) + log P(D)
= L(D|H) + L(H) + const
Slide33Slide34Relativity Example
Explain deviation in Mercury's orbit at perihelion with respect to prevailing theory
E: Einstein's theory
α
= true deviation
F: fudged Newtonian theory a = observed deviation
Relativity Example (Continued)
Subjective Ockham's razor
result depends on one's belief about P(
α
|F)
Objective Ockham's razor
for Mercury example, RHS is 15.04
Applies to generic situation