/
Learning for Structured Prediction Learning for Structured Prediction

Learning for Structured Prediction - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
415 views
Uploaded On 2017-11-01

Learning for Structured Prediction - PPT Presentation

Overview of the Material TexPoint fonts used in EMF Read the TexPoint manual before you delete this box A A A A A A A A A A A A A A A A A Outline 2 Type of structures considered ID: 601424

training features approximate local features training local approximate decoding testing learning crfs global methods discriminative complex parser margin discr

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Learning for Structured Prediction" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Learning for Structured PredictionOverview of the Material

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.:

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

ASlide2

Outline2

Type of structures considered

Generative

vs

Discriminative

Global discriminative

vs

local discriminative

Decoding:

at testing

vs

at learning

methods for decoding

Predefined features

vs

latent features

I will use

red italic

to have illustration of methods; oversimplify some points

Slide3

Types of Structures3

Sequences:

Chain CRFs, HMMs, (chain type) M3Ns, ....

Trees:

Constituency trees:

weighted CFGs (including LA-PCFGs), left-corner/shift-reduce parsers (the

MaxEnt

parser, ISBN parser,...)

Dependency structures:

MST-parser,

Nivre’s

shift reduce parser, ...

Rankings

Prank (today)

Not considered: DAGs (e.g., some semantic representations), Bipartite graphs (machine translation), or more general graphs ...Slide4

Generative vs Discriminative4

Discriminative:

CRFs, MEMM, Structured

Perceptron

, Max-Margin Markov Networks (M3Ns),...

Learn mapping from to , so that

expected error is minimal

Pros:

model what you actually care about

complex features of x are easy to integrate

different errors can be consideredless assumptions (and therefore, better asymptotic performance)GenerativeScore how likely is the combination of input and output Pros:easier to learn (if everything is observable – ML parameters are normalized counts)“cleaner” semi-supervised learning , select to maximize often, better with small datasetssome approaches care about (speech recognition, statistical machine translation,...) arguably, preferable with latent variables

HMMs, PCFGs (including the LA-PCFGs), ...Slide5

Global Discr. vs Local Discr.

5

Local (

distribs

over small

decsions

)

MEMMs, SVM decision classifiers in

Nivre’s

shift reduce parser

Pros: no real decoding at training time (cheap learning)complex features of can be integrated easily (about training! still need to decode at testing)Cons:mismatch btw test and train modes: rely on true features in training and on predicted ones in testinglabel bias (cannot dump a unlikely transition if the number of outgoing states is not sufficiently large)Global (distribs over the entire sequences) struct perceptron, CRFs, M3Ns (model: MST parser)Pros Theoretically much cleaner and in practice works better ConsDecoding at training time (+ partition function for CRFs); but approximate learning methods existLearning can be very problematic if complex features of are used Both models require decoding at testing. Decoding does not really depend on the training criteria but on the features of Slide6

Specific learning criteria6

CRFs

Maximize

Perceptron

Ensure

separability

on the training set (with large margin in some variations – e.g., ALMA): rank correct structure above incorrect one

Max-Margin Markov Networks (M3Ns)

Separate training set with maximal margin (sensitively to the error)

For every labeled example

where is any structure, is some loss function (e.g., Hamming distance for sequence measuring how many labels do not match) “Wrong sequences with small errors should be penalized less than with more errors”SVM-Struct, Boosting, .... Slide7

Decoding at training vs testing: examples7

at training

at testing

MEMM (local disc)

“No” (multiclass classifiers

are trained)

Approximate

search if complex decomposition over

y, Viterbi – otherwise. “Standard” chain CRF (global discr) Full (+ partition function) Full (Viterbi) HMM (generative) No Full (Viterbi) Increm Perceptron (global discr) Approximate Approximate (less approximate)Searn (local disrim) Approximate (more than that) ApproximateDifferent combinations are possible ....Slide8

Inference (argmax)8

Simple dependencies in y:

Viterbi

to find the most likely sequence (or, Chi-Liu-Edmonds for MST)

Or, marginal decoding to find the most likely label for every “position”

Complex dependencies:

Beam or greedy search (or some smarter search methods)

Reformulate the inference problems as a integer linear program and use methods known in ILP

(We do not care here when the inference is

used: either at training or testing, or at both)Slide9

Latent Variables vs Explicit Features9

Explicit features:

Pros:

Mostly convex optimization (no local minima)

Cheaper to learn

Cons:

Models is as good as the features are: extensive feature engineering needed

Non local dependencies in y are often necessary

Latent variable models:

Pros:

Learn how to propagate relevant information (learns complex features from simple ones)Can learn a model with simple decompositions over extended y -- efficient decodingLatent representation (e.g., extended parsing states or extended grammar) can potentially be useful in other tasks – multi-task learningCons:Non-convex optimization – need to avoid local minima (tricky)More expensive to trainMost of the model we considered: CRFs, MEMMs, etcLA-PCFGs, ISBNsSlide10

Last bits10

Term paper

: due

Mar 31

but send me ideas, outlines, draft well before the deadline (soon!)

Feedback

on the content would be very much appreciated (as I am preparing a lecture class with a similar set of topics)

Thanks for participating!!!