Overview of the Material TexPoint fonts used in EMF Read the TexPoint manual before you delete this box A A A A A A A A A A A A A A A A A Outline 2 Type of structures considered ID: 601424
Download Presentation The PPT/PDF document "Learning for Structured Prediction" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Learning for Structured PredictionOverview of the Material
TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.:
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
ASlide2
Outline2
Type of structures considered
Generative
vs
Discriminative
Global discriminative
vs
local discriminative
Decoding:
at testing
vs
at learning
methods for decoding
Predefined features
vs
latent features
I will use
red italic
to have illustration of methods; oversimplify some points
Slide3
Types of Structures3
Sequences:
Chain CRFs, HMMs, (chain type) M3Ns, ....
Trees:
Constituency trees:
weighted CFGs (including LA-PCFGs), left-corner/shift-reduce parsers (the
MaxEnt
parser, ISBN parser,...)
Dependency structures:
MST-parser,
Nivre’s
shift reduce parser, ...
Rankings
Prank (today)
Not considered: DAGs (e.g., some semantic representations), Bipartite graphs (machine translation), or more general graphs ...Slide4
Generative vs Discriminative4
Discriminative:
CRFs, MEMM, Structured
Perceptron
, Max-Margin Markov Networks (M3Ns),...
Learn mapping from to , so that
expected error is minimal
Pros:
model what you actually care about
complex features of x are easy to integrate
different errors can be consideredless assumptions (and therefore, better asymptotic performance)GenerativeScore how likely is the combination of input and output Pros:easier to learn (if everything is observable – ML parameters are normalized counts)“cleaner” semi-supervised learning , select to maximize often, better with small datasetssome approaches care about (speech recognition, statistical machine translation,...) arguably, preferable with latent variables
HMMs, PCFGs (including the LA-PCFGs), ...Slide5
Global Discr. vs Local Discr.
5
Local (
distribs
over small
decsions
)
MEMMs, SVM decision classifiers in
Nivre’s
shift reduce parser
Pros: no real decoding at training time (cheap learning)complex features of can be integrated easily (about training! still need to decode at testing)Cons:mismatch btw test and train modes: rely on true features in training and on predicted ones in testinglabel bias (cannot dump a unlikely transition if the number of outgoing states is not sufficiently large)Global (distribs over the entire sequences) struct perceptron, CRFs, M3Ns (model: MST parser)Pros Theoretically much cleaner and in practice works better ConsDecoding at training time (+ partition function for CRFs); but approximate learning methods existLearning can be very problematic if complex features of are used Both models require decoding at testing. Decoding does not really depend on the training criteria but on the features of Slide6
Specific learning criteria6
CRFs
Maximize
Perceptron
Ensure
separability
on the training set (with large margin in some variations – e.g., ALMA): rank correct structure above incorrect one
Max-Margin Markov Networks (M3Ns)
Separate training set with maximal margin (sensitively to the error)
For every labeled example
where is any structure, is some loss function (e.g., Hamming distance for sequence measuring how many labels do not match) “Wrong sequences with small errors should be penalized less than with more errors”SVM-Struct, Boosting, .... Slide7
Decoding at training vs testing: examples7
at training
at testing
MEMM (local disc)
“No” (multiclass classifiers
are trained)
Approximate
search if complex decomposition over
y, Viterbi – otherwise. “Standard” chain CRF (global discr) Full (+ partition function) Full (Viterbi) HMM (generative) No Full (Viterbi) Increm Perceptron (global discr) Approximate Approximate (less approximate)Searn (local disrim) Approximate (more than that) ApproximateDifferent combinations are possible ....Slide8
Inference (argmax)8
Simple dependencies in y:
Viterbi
to find the most likely sequence (or, Chi-Liu-Edmonds for MST)
Or, marginal decoding to find the most likely label for every “position”
Complex dependencies:
Beam or greedy search (or some smarter search methods)
Reformulate the inference problems as a integer linear program and use methods known in ILP
(We do not care here when the inference is
used: either at training or testing, or at both)Slide9
Latent Variables vs Explicit Features9
Explicit features:
Pros:
Mostly convex optimization (no local minima)
Cheaper to learn
Cons:
Models is as good as the features are: extensive feature engineering needed
Non local dependencies in y are often necessary
Latent variable models:
Pros:
Learn how to propagate relevant information (learns complex features from simple ones)Can learn a model with simple decompositions over extended y -- efficient decodingLatent representation (e.g., extended parsing states or extended grammar) can potentially be useful in other tasks – multi-task learningCons:Non-convex optimization – need to avoid local minima (tricky)More expensive to trainMost of the model we considered: CRFs, MEMMs, etcLA-PCFGs, ISBNsSlide10
Last bits10
Term paper
: due
Mar 31
but send me ideas, outlines, draft well before the deadline (soon!)
Feedback
on the content would be very much appreciated (as I am preparing a lecture class with a similar set of topics)
Thanks for participating!!!