/
Frank-Wolfe optimization insights in machine learning Frank-Wolfe optimization insights in machine learning

Frank-Wolfe optimization insights in machine learning - PowerPoint Presentation

jane-oiler
jane-oiler . @jane-oiler
Follow
408 views
Uploaded On 2016-10-31

Frank-Wolfe optimization insights in machine learning - PPT Presentation

Simon LacosteJulien INRIA École Normale Supérieure SIERRA Project Team SMILE November 4 th 2013 Outline FrankWolfe optimization FrankWolfe for structured prediction links with previous algorithms ID: 482766

wolfe frank herding structured frank wolfe structured herding amp step svm convex gradient rate algorithm error optimization subgradient dual

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Frank-Wolfe optimization insights in mac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Frank-Wolfe optimization insights in machine learning

Simon Lacoste-JulienINRIA / École Normale SupérieureSIERRA Project Team

SMILE – November 4th 2013Slide2

Outline

Frank-Wolfe optimizationFrank-Wolfe for structured predictionlinks with previous algorithmsblock-coordinate extensionresults for sequence predictionHerding as Frank-Wolfe optimizationextension: weighted Herdingsimulations for quadratureSlide3

FW algorithm – repeat:

convex & cts. differentiable

convex & compact

alg. for constrained opt.:

(aka conditional gradient)

where:

1) Find good feasible direction by

minimizing linearization of :

2) Take convex step in direction:

Frank-Wolfe algorithm

[Frank, Wolfe 1956]

Properties: O(1/T)

rate

sparse iterates

get duality

gap for free

affine invariant

rate holds even if linear

subproblem

solved

approximatelySlide4

Frank-Wolfe: properties

convex steps => convex sparse combo:get duality gap certificate for free(special case of

Fenchel duality gap)also converge as O(1/T)!only need to solve linear subproblem *approximately* (additive/multiplicative bound)

affine invariant!

[see

Jaggi

ICML 2013]Slide5

[ICML 2013]

Simon

Lacoste-Julien

MartinJaggi

PatrickPletscher

MarkSchmidt

Block-Coordinate

Frank-Wolfe Optimization for Structured

SVMsSlide6

Structured SVM optimization

learn

classifier:

structured prediction:

structured SVM primal

:

decoding

vs. binary hinge loss:

structured hinge loss:

-> loss-augmented decoding

-> exp. number of variables!

structured SVM dual

:

primal-dual pair:Slide7

Structured SVM optimization (2)

popular approaches:stochastic subgradient methodpros: online!cons: sensitive to step-size; don’t know when to stopcutting plane method (SVMstruct)pros: automatic step-size; duality gapcons: batch! -> slow for large nour approach: block-coordinate Frank-Wolfe on dual

-> combines best of both worlds:online!automatic step-size via analytic line searchduality gaprates also hold for approximate oracles

rate: after K passes through data:

[Ratliff et al. 07,

Shalev-Shwartz

et al. 10]

[Tsochantaridis et al. 05, Joachims et al. 09]Slide8

FW algorithm – repeat:

convex & cts. differentiable

convex & compact

alg. for constrained opt.:

(aka conditional gradient)

where:

1) Find good feasible direction by

minimizing linearization of :

2) Take convex step in direction:

Frank-Wolfe algorithm

[Frank, Wolfe 1956]

Properties: O(1/T)

rate

sparse iterates

get duality

gap for free

affine invariant

rate holds even if linear

subproblem

solved

approximatelySlide9

Frank-Wolfe

for structured SVM FW algorithm – repeat:

structured SVM dual:

1) Find good feasible direction by

minimizing linearization of :

2) Take convex step in direction:

use primal-dual link:

key insight:

loss-augmented decoding

on each example

i

becomes a batch

subgradient

step:

choose by

analytic

line search on quadratic dual

link between FW and

subgradient

method: see [Bach 12]Slide10

FW for structured SVM: properties

running FW on dual batch subgradient on primalbut adaptive step-size from analytic line-searchand duality gap stopping criterion‘fully corrective’ FW on dual cutting plane alg.still O(1/T) rate; but provides simpler proof for SVMstruct convergence + approximate oracles guarantees

not faster than simple FW in our experimentsBUT: still batch => slow for large n...

(

SVMstruct)Slide11

Block-Coordinate Frank-Wolfe (new!)

for constrained optimization over compact product domain:pick i at random; update only block i with a FW step:

we

proved

same

O(1/T) rate

as batch FW -> each step

n times cheaper though -> constant can be the same (SVM e.g.)

Properties: O(1/T)

rate

sparse iterates

get duality

gap guarantees

affine invariant

rate holds even if linear

subproblem

solved

approximatelySlide12

Block-Coordinate Frank-Wolfe

(new!)

for constrained optimization over compact product domain

:pick i at random; update only block

i with a FW step:

loss-augmented decoding

structured SVM:

we

proved

same

O(1/T) rate

as

batch FW

-> each step

n times cheaper

though

-> constant can be the same (SVM e.g.)

Slide13

BCFW for structured SVM: properties

each update requires 1 oracle calladvantages over stochastic subgradient:step-sizes by line-search -> more robustduality gap certificate -> know when to stopguarantees hold for approximate oraclesimplementation: https://github.com/ppletscher/BCFWstruct

almost as simple as stochastic subgradient methodcaveat: need to store one parameter vector per example (or store the dual variables)for binary SVM -> reduce to DCA method [Hsieh et al. 08]interesting link with prox

SDCA [Shalev-Shwartz et al. 12]

(vs. n for SVMstruct)

so get error after K passes through data

(vs. for

SVMstruct)Slide14

More info about constants...

BCFW rate:

batch FW rate:

comparing constants:

for structured SVM

– same constants:

identity Hessian + cube constraint:

(no speed-up)

->remove with line-search

“curvature”

“product curvature”Slide15

Sidenote

: weighted averagingstandard to average iterates of stochastic subgradient method

uniform averaging:

vs. t-weighted averaging:

[L.-J. et al. 12], [Shamir & Zhang 13]

weighted avg. improves duality gap for BCFW

also makes a big difference in test error!Slide16

Experiments

OCR datasetCoNLL datasetSlide17

CoNLL dataset

Surprising test error though!

test error:

optimization error:

flipped!Slide18

Conclusions for 1st part

applying FW on dual of structured SVMunified previous algorithmsprovided line-search version of batch subgradientnew block-coordinate variant of Frank-Wolfe algorithmsame convergence rate but with cheaper iteration costyields a robust & fast algorithm for structured SVMfuture work:caching tricksnon-uniform sampling

regularization pathexplain weighted avg. test error mysterySlide19

On the Equivalence between Herding

and Conditional Gradient Algorithms

[ICML 2012]

Simon

Lacoste-Julien

Francis

BachGuillaumeObozinskiSlide20

A motivation: quadrature

Approximating integrals:Random sampling yields errorHerding [Welling 2009] yields error! [Chen et al. 2010] (like quasi-MC)This part -> links herding with optimization algorithm (conditional gradient / Frank-Wolfe) suggests extensions - e.g. weighted version with

BUT extensions worse for learning???-> yields interesting insights on properties of herding...Slide21

Outline

Background:Herding[Conditional gradient algorithm]Equivalence between herding & cond. gradientExtensionsNew rates & theoremsSimulationsApproximation of integrals with cond. gradient variantsLearned distribution vs. max entropySlide22

Review of herding [Welling ICML 2009]

Learning in MRF:Motivation:data

parametersamples

learning:

(app.) ML /

max. entropy moment matching

(app.) inference:

sampling

(pseudo)-

herding

feature mapSlide23

Herding updates

Zero temperature limit of log-likelihood:Herding updates -

subgradient ascent updates:Properties:

1) weakly chaotic -> entropy?2) Moment matching:

-> our focus

‘Tipi’ function:(thanks to Max Welling for picture)Slide24

Approx. integrals in RKHS

Reproducing property:Define mean map :Want to approximate integrals of the form:Use weighted sum to get approximated mean: Approximation error is then bounded by:

C

ontrolling

moment discrepancy

is enough to control

error of integrals

in RKHS :Slide25

Conditional gradient algorithm

(aka Frank-Wolfe)Alg. to optimize:Repeat:Find good feasible direction byminimizing linearization of J:

Take convex step in direction:-> Converges in O(1/T) in general

convex & (twice)

cts

. differentiable

convex & compactSlide26

T

rick: look at cond. gradient on dummy objective:

Herding & cond. grad. are equiv.

herding updates:

cond. grad. updates:

+ Do change of variable:

Same with step-size:

Subgradient

ascent and cond. gradient are

Fenchel

duals of each other!

(see also [Bach 2012])Slide27

Extensions of herding

More general step-sizes -> gives weighted sum:Two extensions:1) Line search for 2) Min. norm point algorithm(min J(g) on convex hull of previously visited points)Slide28

Rates of convergence & thms.

No assumption: cond. grad. yields*:If assume in rel. int. of with radius [Chen et al. 2010] yields for herdingWhereas line search version yields[Guélat & Marcotte 1986, Beck & Teboulle

2004]Propositions:1)2)

(i.e. [Chen et al. 2010] doesn’t hold!)Slide29

Simulation 1: approx. integrals

Kernel herding on

Use RKHS with

Bernouilli

polynomial kernel (infinite dim.) (closed form) Slide30

Simulation 2: max entropy?

learning independent bits:

error on moments

error on distributionSlide31

Conclusions for 2nd part

Equivalence of herding and cond. gradient:-> Yields better alg. for quadrature based on moments -> But highlights max entropy / moment matching tradeoff!Other interesting points:Setting up fake optimization problems -> harvest properties of known algorithmsConditional gradient algorithm useful to know...Duality of subgradient & cond. gradient is more generalRecent

related work:link with Bayesian quadrature [

Huszar & Duvenaud UAI 2012]herded Gibbs sampling

[Born et al. ICLR 2013]Slide32

Thank you!