Perceptron SPLODD AE 3 2011 Autumnal Equinox Review Computer science is full of equivalences SQL relational algebra YFCL optimizing on the training data g cc O4 ID: 566353
Download Presentation The PPT/PDF document "Margin Learning, Online Learning, and Th..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Margin Learning, Online Learning, and The Voted Perceptron
SPLODD~= AE* – 3, 2011
* Autumnal EquinoxSlide2
Review
Computer science is full of equivalencesSQL relational algebra
YFCL optimizing … on the training data
g
cc
–O4
foo.c
gcc
foo.c
Also full of
relationships
between
sets
:
Finding smallest error-free decision tree >> 3-SAT
DataLog
>> relational algebra
CFL >>
Det
FSMs =
RegExSlide3
Review
Bayes Nets: describe a (family of) joint distribution(s) between random variables
They are an
operational description
(a program) for how data can be
generated
They are a
declarative description
(a definition) for the joint distribution, and from this we can derive
algorithms
for doing stuff other than generation
There is a close connection between Naïve
Bayes
and
loglinear
modelsSlide4
NB vs loglinear
models
Loglinear
classif
.
NB
classif
.
Multinomial?
c
lassif
.
*
SymDir(100)
* AbsDisc(0.01)
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* Max CL(y|x) + G(0,1.0)
*
NB-JL
NB-CL
NB-CL*Slide5
NB vs
loglinear models
Loglinear
classif
.
NB
classif
.
Multinomial?
c
lassif
.
*
SymDir(100)
*
*
*
*
*
*
*
*
* Max CL(y|x) + G(0,1.0)
*
Y
W
j
“Optimal if”Slide6
Similarly for sequences…
An HMM is a Bayes netIt implies a set of independence assumptions
ML parameter setting and
Viterbi
are optimal if these hold
A CRF is a Markov field
It implies a set of independence assumptions
These, plus the goal of maximizing Pr(
y
|x
), give us a learning algorithmYou can construct features so that any HMM can be emulated by a CRF with those featuresSlide7
In sequence space…
CRF/
l
oglinear
models
HMMs
Multinomial? models
*
SymDir
(100)
*
AbsDisc
(0.01)
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* Max CL(
y|x
) + G(0,1.0)
*
JL
CL
CL*Slide8
Review: CRFs/Markov Random Fields
When will
prof
Cohen post the notes
Semantics of a Markov random field
Y
1
Y2
Y3
Y4
Y5
Y6
Y7
What’s independent
: Pr(
Y
i
|other
Y’s) = Pr(Y
i
|Y
i-1
,Y
i+1
)
Probability distribution: Slide9
Review: CRFs/Markov Random Fields
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes … Slide10
Review: CRFs/Markov Random Fields
When will
prof
Cohen post the notes
Y
1
Y2
Y3
Y4
Y5
Y6
Y7
What’s independent
: Pr(
Y
i
|other
Y’s) = Pr(
Y
i
|neighbors
of Y
i
)
Probability distribution:
YfSlide11
Pseudo-likelihood and dependency networks
Any Markov field defines a (family of) probability distributions DBut
not
a simple program for generation/sampling
We can use MCMC in the general case
If you have for each node
i
, P
D
(X
i|Pai), that’s a dependency netStill no simple program for generation/sampling (but can use Gibbs)
You can learn these from data using YFCLEquivalently: learning this maximizes pseudo-likelihood, just as HMM learning maximizes (real) likelihood on a sequence.
A weirdness: every MRF has an equivalent dependency net, but every dependency net (set of local conditionals) does not have an equivalent MRFSlide12
And now for …