/
Online Max-Margin Weight Learning Online Max-Margin Weight Learning

Online Max-Margin Weight Learning - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
419 views
Uploaded On 2017-04-11

Online Max-Margin Weight Learning - PPT Presentation

for Markov Logic Networks Tuyen N Huynh and Raymond J Mooney Machine Learning Group Department of Computer Science The University of Texas at Austin SDM 2011 April 29 2011 Motivation 2 D McDermott and J Doyle ID: 536533

friends loss online amp loss friends amp online learning function dual 2007 cda domingos conjugate margin mlns max functions

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Online Max-Margin Weight Learning" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Online Max-Margin Weight Learning for Markov Logic Networks

Tuyen N. Huynh and Raymond J. Mooney

Machine Learning GroupDepartment of Computer ScienceThe University of Texas at Austin

SDM 2011, April 29, 2011Slide2

Motivation2

D. McDermott and J. Doyle.

Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.

D. McDermott and J. Doyle.

Non-monotonic Reasoning I.

Artificial Intelligence, 13: 41-72, 1980.

D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.

D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.

D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.

D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.

D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was writing about]

Citation segmentation

Semantic role labelingSlide3

Motivation (cont.)3

Markov Logic Networks (MLNs) [Richardson & Domingos

, 2006] is an elegant and powerful formalism for handling those complex structured dataExisting weight learning methods for MLNs are in the batch setting Need to run inference over all the training examples in each iteration

Usually take a few hundred iterations to converge

May not

fit all the training examples in

main memory

 do not scale to

problems having a large number of examplesPrevious work applied an existing online algorithm to learn weights for MLNs but did not compare to other algorithms Introduce a new online weight learning algorithm and extensively compare to other existing methodsSlide4

Outline4

MotivationBackgroundMarkov Logic NetworksPrimal-dual framework for online learningNew online learning algorithm for max-margin structured prediction

Experiment EvaluationSummarySlide5

5

Markov Logic Networks [Richardson &

Domingos, 2006]Set of weighted first-order formulas

Larger weight indicates stronger belief that the formula should hold.

The formulas are called the

structure

of the MLN.MLNs are templates for constructing Markov networks for a given set of constants

MLN Example: Friends & Smokers

*Slide from

[

Domingos, 2007]Slide6

Example: Friends & Smokers

Two constants:

Anna

(A) and

Bob

(B)

6

*Slide from

[

Domingos, 2007]Slide7

Example: Friends & Smokers

Cancer(A)

Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:

Anna

(A) and

Bob (B)

7*Slide from [Domingos, 2007]Slide8

Example: Friends & Smokers

Cancer(A)

Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:

Anna

(A) and

Bob (B)8

*Slide from [Domingos, 2007]Slide9

Example: Friends & Smokers

Cancer(A)

Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:

Anna

(A) and

Bob

(B)9

*Slide from [Domingos, 2007]Slide10

Weight of formula

i

No. of true groundings of formula

i

in

x

10

Probability of a possible world

A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases.

a possible worldSlide11

Max-margin weight learning for MLNs[Huynh & Mooney

, 2009]maximize the separation margin

: log of the ratio of the probability of the correct label and the probability of the closest incorrect oneFormulate as 1-slack Structural SVM [Joachims

et al., 2009]

Use cutting plane method

[

Tsochantaridis

et.al., 2004] with an approximate inference algorithm based on Linear Programming11Slide12

Online learning12

For i=1 to T:Receive an example

The learner choose a vector and uses it to predict a label

Receive the correct label

Suffer a loss:

Goal: minimize the regret

 

The accumulative loss of the online learner

The accumulative loss of the

best batch

learnerSlide13

A general and latest framework for deriving low-regret online algorithmsRewrite the regret bound as an optimization problem (called the primal problem), then considering the dual problem of the primal oneDerive a condition that guarantees the increase in the dual objective in each step

 Incremental-Dual-Ascent (IDA) algorithms. For example: subgradient methods [

Zinkevich, 2003]Primal-dual framework for online learning[

Shalev-Shwartz

et al., 2006]

13Slide14

Primal-dual framework for online learning (cont.)

14Propose a new class of IDA algorithms called Coordinate-Dual-Ascent (CDA) algorithm:The CDA update rule only optimizes the dual w.r.t the last dual variable (the current example)A closed-form solution of CDA update rule

 CDA algorithm has the same cost as subgradient methods but increase the dual objective more in each step  better accuracySlide15

Steps for deriving a new CDA algorithm 15

Define the regularization and loss functionsFind the conjugate functions

Derive a closed-form solution for the CDA update ruleCDA algorithm for max-margin structured predictionSlide16

Max-margin structured prediction

16The output y belongs to some structure space YJoint feature function:

(x,y): X x Y

R

Learn a discriminant function f:

Prediction for a new input x:

Max-margin criterion: 

MLNs: n(

x,y

)Slide17

1. Define the regularization and loss functions17

Regularization function:

Loss function:

Prediction based loss (PL):

the loss incurred by using the

predicted label

at each step

+

where

 

Label loss functionSlide18

1. Define the regularization and loss functions (cont.)

18Loss function:Maximal

loss (ML): the maximum loss an online learner could suffer at each step

where

Upper bound of the PL loss

 more aggressive update  better predictive accuracy on clean datasets

The

ML loss depends on the label loss function

 can

only be used with some label loss functions

 Slide19

2. Find the conjugate functions

19Conjugate function:

1-dimension:

is the negative of the y-intercept of the tangent line to the graph of f that has slope

 Slide20

2. Find the conjugate functions (cont.)

20Conjugate function of the regularization function f(w):f(w)=(1/2)||w||

22  f*(µ) = (1/2)||µ||22Slide21

2. Find the conjugate functions (cont.)

21Conjugate function of the loss functions:

+

similar to Hinge loss

+

Conjugate function of Hinge loss:

[

Shalev-Shwartz

& Singer, 2007]

Conjugate functions of PL and

M

L loss:

 Slide22

CDA’s update formula:

Compare with the update formula of the simple update,

subgradient

method

[Ratliff et al., 2007]

:

 

22

CDA’s learning rate combines the learning rate of the

subgradient

method with the loss incurred at each step

3. Closed-form solution for the CDA update ruleSlide23

Experiments

23Slide24

Experimental Evaluation24

Citation segmentation on CiteSeer datasetSearch query disambiguation on a dataset obtained from MicrosoftSemantic role labeling on noisy

CoNLL 2005 datasetSlide25

Citation segmentation25

Citeseer dataset [Lawrence et.al., 1999] [

Poon and Domingos, 2007]

1,563 citations, divided into 4 research topics

Task: segment each citation into 3 fields:

Author, Title, Venue

Used

the MLN for isolated segmentation model in [

Poon and Domingos, 2007]Slide26

Experimental setup4-fold cross-validationSystems compared:

MM: the max-margin weight learner for MLNs in batch setting [Huynh & Mooney, 2009]1-best MIRA [Crammer et al., 2005]

SubgradientCDACDA-PLCDA-MLMetric:

F

1

, harmonic mean of the precision and recall

26

 Slide27

Average F1on CiteSeer

27Slide28

Average training time in minutes28Slide29

Search query disambiguation29

Used the dataset created by Mihalkova & Mooney [2009]

Thousands of search sessions where ambiguous queries were asked: 4,618 sessions for training, 11,234 sessions for testingGoal: disambiguate search query based on previous related search sessionsNoisy dataset since the true labels are based on which results were clicked by usersUsed the 3 MLNs proposed in [

Mihalkova

& Mooney, 2009]Slide30

Experimental setupSystems compared:Contrastive Divergence (CD)

[Hinton 2002] used in [Mihalkova & Mooney, 2009]

1-best MIRASubgradient CDACDA-PLCDA-MLMetric:Mean Average Precision (MAP): how close the relevant results are to the top of the rankings

30Slide31

MAP scores on Microsoft query search31Slide32

Semantic role labeling 32

CoNLL 2005 shared task dataset [Carreras & Marques, 2005]Task: For each target verb in a sentence, find and label all of its semantic components

90,750 training examples; 5,267 test examplesNoisy labeled experiment:Motivated by noisy labeled data obtained from crowdsourcing services such as Amazon Mechanical TurkSimple noise model:At p percent noise, there is p probability that an argument in a verb is swapped with another argument of that verb.Slide33

Experimental setupUsed the MLN developed in

[Riedel, 2007]Systems compared:1-best MIRASubgradient CDA-ML

Metric:F1 of the predicted arguments [Carreras & Marques, 2005]33Slide34

F1 scores on CoNLL 2005

34Slide35

Summary35

Derived CDA algorithms for max-margin structured predictionHave the same computational cost as existing online algorithms but increase the dual objective more Experimental results on several real-world problems show that the new algorithms generally achieve better accuracy and also have more consistent performance.Slide36

Thank you!

36

Questions?