Simon LacosteJulien INRIA École Normale Supérieure SIERRA Project Team SMILE November 4 th 2013 Outline FrankWolfe optimization FrankWolfe for structured prediction links with previous algorithms ID: 482766
Download Presentation The PPT/PDF document "Frank-Wolfe optimization insights in mac..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Frank-Wolfe optimization insights in machine learning
Simon Lacoste-JulienINRIA / École Normale SupérieureSIERRA Project Team
SMILE – November 4th 2013Slide2
Outline
Frank-Wolfe optimizationFrank-Wolfe for structured predictionlinks with previous algorithmsblock-coordinate extensionresults for sequence predictionHerding as Frank-Wolfe optimizationextension: weighted Herdingsimulations for quadratureSlide3
FW algorithm – repeat:
convex & cts. differentiable
convex & compact
alg. for constrained opt.:
(aka conditional gradient)
where:
1) Find good feasible direction by
minimizing linearization of :
2) Take convex step in direction:
Frank-Wolfe algorithm
[Frank, Wolfe 1956]
Properties: O(1/T)
rate
sparse iterates
get duality
gap for free
affine invariant
rate holds even if linear
subproblem
solved
approximatelySlide4
Frank-Wolfe: properties
convex steps => convex sparse combo:get duality gap certificate for free(special case of
Fenchel duality gap)also converge as O(1/T)!only need to solve linear subproblem *approximately* (additive/multiplicative bound)
affine invariant!
[see
Jaggi
ICML 2013]Slide5
[ICML 2013]
Simon
Lacoste-Julien
MartinJaggi
PatrickPletscher
MarkSchmidt
Block-Coordinate
Frank-Wolfe Optimization for Structured
SVMsSlide6
Structured SVM optimization
learn
classifier:
structured prediction:
structured SVM primal
:
decoding
vs. binary hinge loss:
structured hinge loss:
-> loss-augmented decoding
-> exp. number of variables!
structured SVM dual
:
primal-dual pair:Slide7
Structured SVM optimization (2)
popular approaches:stochastic subgradient methodpros: online!cons: sensitive to step-size; don’t know when to stopcutting plane method (SVMstruct)pros: automatic step-size; duality gapcons: batch! -> slow for large nour approach: block-coordinate Frank-Wolfe on dual
-> combines best of both worlds:online!automatic step-size via analytic line searchduality gaprates also hold for approximate oracles
rate: after K passes through data:
[Ratliff et al. 07,
Shalev-Shwartz
et al. 10]
[Tsochantaridis et al. 05, Joachims et al. 09]Slide8
FW algorithm – repeat:
convex & cts. differentiable
convex & compact
alg. for constrained opt.:
(aka conditional gradient)
where:
1) Find good feasible direction by
minimizing linearization of :
2) Take convex step in direction:
Frank-Wolfe algorithm
[Frank, Wolfe 1956]
Properties: O(1/T)
rate
sparse iterates
get duality
gap for free
affine invariant
rate holds even if linear
subproblem
solved
approximatelySlide9
Frank-Wolfe
for structured SVM FW algorithm – repeat:
structured SVM dual:
1) Find good feasible direction by
minimizing linearization of :
2) Take convex step in direction:
use primal-dual link:
key insight:
loss-augmented decoding
on each example
i
becomes a batch
subgradient
step:
choose by
analytic
line search on quadratic dual
link between FW and
subgradient
method: see [Bach 12]Slide10
FW for structured SVM: properties
running FW on dual batch subgradient on primalbut adaptive step-size from analytic line-searchand duality gap stopping criterion‘fully corrective’ FW on dual cutting plane alg.still O(1/T) rate; but provides simpler proof for SVMstruct convergence + approximate oracles guarantees
not faster than simple FW in our experimentsBUT: still batch => slow for large n...
(
SVMstruct)Slide11
Block-Coordinate Frank-Wolfe (new!)
for constrained optimization over compact product domain:pick i at random; update only block i with a FW step:
we
proved
same
O(1/T) rate
as batch FW -> each step
n times cheaper though -> constant can be the same (SVM e.g.)
Properties: O(1/T)
rate
sparse iterates
get duality
gap guarantees
affine invariant
rate holds even if linear
subproblem
solved
approximatelySlide12
Block-Coordinate Frank-Wolfe
(new!)
for constrained optimization over compact product domain
:pick i at random; update only block
i with a FW step:
loss-augmented decoding
structured SVM:
we
proved
same
O(1/T) rate
as
batch FW
-> each step
n times cheaper
though
-> constant can be the same (SVM e.g.)
Slide13
BCFW for structured SVM: properties
each update requires 1 oracle calladvantages over stochastic subgradient:step-sizes by line-search -> more robustduality gap certificate -> know when to stopguarantees hold for approximate oraclesimplementation: https://github.com/ppletscher/BCFWstruct
almost as simple as stochastic subgradient methodcaveat: need to store one parameter vector per example (or store the dual variables)for binary SVM -> reduce to DCA method [Hsieh et al. 08]interesting link with prox
SDCA [Shalev-Shwartz et al. 12]
(vs. n for SVMstruct)
so get error after K passes through data
(vs. for
SVMstruct)Slide14
More info about constants...
BCFW rate:
batch FW rate:
comparing constants:
for structured SVM
– same constants:
identity Hessian + cube constraint:
(no speed-up)
->remove with line-search
“curvature”
“product curvature”Slide15
Sidenote
: weighted averagingstandard to average iterates of stochastic subgradient method
uniform averaging:
vs. t-weighted averaging:
[L.-J. et al. 12], [Shamir & Zhang 13]
weighted avg. improves duality gap for BCFW
also makes a big difference in test error!Slide16
Experiments
OCR datasetCoNLL datasetSlide17
CoNLL dataset
Surprising test error though!
test error:
optimization error:
flipped!Slide18
Conclusions for 1st part
applying FW on dual of structured SVMunified previous algorithmsprovided line-search version of batch subgradientnew block-coordinate variant of Frank-Wolfe algorithmsame convergence rate but with cheaper iteration costyields a robust & fast algorithm for structured SVMfuture work:caching tricksnon-uniform sampling
regularization pathexplain weighted avg. test error mysterySlide19
On the Equivalence between Herding
and Conditional Gradient Algorithms
[ICML 2012]
Simon
Lacoste-Julien
Francis
BachGuillaumeObozinskiSlide20
A motivation: quadrature
Approximating integrals:Random sampling yields errorHerding [Welling 2009] yields error! [Chen et al. 2010] (like quasi-MC)This part -> links herding with optimization algorithm (conditional gradient / Frank-Wolfe) suggests extensions - e.g. weighted version with
BUT extensions worse for learning???-> yields interesting insights on properties of herding...Slide21
Outline
Background:Herding[Conditional gradient algorithm]Equivalence between herding & cond. gradientExtensionsNew rates & theoremsSimulationsApproximation of integrals with cond. gradient variantsLearned distribution vs. max entropySlide22
Review of herding [Welling ICML 2009]
Learning in MRF:Motivation:data
parametersamples
learning:
(app.) ML /
max. entropy moment matching
(app.) inference:
sampling
(pseudo)-
herding
feature mapSlide23
Herding updates
Zero temperature limit of log-likelihood:Herding updates -
subgradient ascent updates:Properties:
1) weakly chaotic -> entropy?2) Moment matching:
-> our focus
‘Tipi’ function:(thanks to Max Welling for picture)Slide24
Approx. integrals in RKHS
Reproducing property:Define mean map :Want to approximate integrals of the form:Use weighted sum to get approximated mean: Approximation error is then bounded by:
C
ontrolling
moment discrepancy
is enough to control
error of integrals
in RKHS :Slide25
Conditional gradient algorithm
(aka Frank-Wolfe)Alg. to optimize:Repeat:Find good feasible direction byminimizing linearization of J:
Take convex step in direction:-> Converges in O(1/T) in general
convex & (twice)
cts
. differentiable
convex & compactSlide26
T
rick: look at cond. gradient on dummy objective:
Herding & cond. grad. are equiv.
herding updates:
cond. grad. updates:
+ Do change of variable:
Same with step-size:
Subgradient
ascent and cond. gradient are
Fenchel
duals of each other!
(see also [Bach 2012])Slide27
Extensions of herding
More general step-sizes -> gives weighted sum:Two extensions:1) Line search for 2) Min. norm point algorithm(min J(g) on convex hull of previously visited points)Slide28
Rates of convergence & thms.
No assumption: cond. grad. yields*:If assume in rel. int. of with radius [Chen et al. 2010] yields for herdingWhereas line search version yields[Guélat & Marcotte 1986, Beck & Teboulle
2004]Propositions:1)2)
(i.e. [Chen et al. 2010] doesn’t hold!)Slide29
Simulation 1: approx. integrals
Kernel herding on
Use RKHS with
Bernouilli
polynomial kernel (infinite dim.) (closed form) Slide30
Simulation 2: max entropy?
learning independent bits:
error on moments
error on distributionSlide31
Conclusions for 2nd part
Equivalence of herding and cond. gradient:-> Yields better alg. for quadrature based on moments -> But highlights max entropy / moment matching tradeoff!Other interesting points:Setting up fake optimization problems -> harvest properties of known algorithmsConditional gradient algorithm useful to know...Duality of subgradient & cond. gradient is more generalRecent
related work:link with Bayesian quadrature [
Huszar & Duvenaud UAI 2012]herded Gibbs sampling
[Born et al. ICLR 2013]Slide32
Thank you!