Gradient Descent Methods Jakub Kone čný joint work with Peter Richt árik University of Edinburgh Introduction Large scale problem setting Problems are often structured Frequently arising in machine learning ID: 262167
Download Presentation The PPT/PDF document "Semi-Stochastic" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Semi-Stochastic Gradient Descent Methods
Jakub
Kone
čný
(joint work with Peter
Richt
árik
)
University of EdinburghSlide2
IntroductionSlide3
Large scale problem setting
Problems are often structured
Frequently arising in machine learning
Structure – sum of functions
is BIGSlide4
ExamplesLinear regression (least squares)
Logistic regression (classification)
Slide5
AssumptionsLipschitz
continuity of derivative of
Strong convexity of Slide6
Gradient Descent (GD)
Update rule
Fast convergence rate
Alternatively, for accuracy we need
iterations
Complexity of single iteration –
(measured in gradient evaluations)Slide7
Stochastic Gradient Descent (SGD)Update rule
Why it works
Slow convergence
Complexity of single iteration –
(
measured in gradient evaluations)
a
step-size parameterSlide8
Goal
GD
SGD
Fast convergence
gradient evaluations in each iteration
Slow convergence
Complexity of iteration independent of
Combine in a single algorithmSlide9
Semi-Stochastic Gradient DescentS2GDSlide10
Intuition
The gradient
does not change
drastically
We could reuse the information from “old” gradientSlide11
Modifying “old” gradientImagine someone gives us a “good” point and
Gradient at point , near , can be expressed as
Approximation of the gradient
Already computed gradient
Gradient change
We can try to estimateSlide12
The S2GD Algorithm
Simplification; size of the inner loop is random, following a geometric ruleSlide13
TheoremSlide14
Convergence rate
How to set the parameters ?
Can be made arbitrarily small, by decreasing
For any fixed , can be made arbitrarily small by increasing Slide15
Setting the parameters
The accuracy is achieved by setting
Total complexity (in gradient evaluations)
# of epochs
full gradient evaluation
cheap iterations
# of epochs
stepsize
# of iterations
Fix target accuracySlide16
ComplexityS2GD complexity
GD complexity
iterations
complexity of a single iteration
TotalSlide17
Related Methods
SAG – Stochastic Average Gradient
(Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)
Refresh single stochastic gradient in each iteration
Need to store gradients.
Similar convergence rate
Cumbersome analysisMISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)
Similar to SAG, slightly worse performanceElegant analysisSlide18
Related Methods
SVRG – Stochastic Variance Reduced Gradient
(
Rie
Johnson, Tong Zhang, 2013)
Arises as a special case in S2GD
Prox-SVRG(Tong Zhang, Lin Xiao, 2014)Extended
to proximal settingEMGD – Epoch Mixed Gradient Descent(Lijun Zhang, Mehrdad
Mahdavi , Rong Jin, 2013)Handles simple constraints, Worse convergence rate Slide19
ExperimentExample problem, with