/
Zheng Qu Zheng Qu

Zheng Qu - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
394 views
Uploaded On 2017-03-31

Zheng Qu - PPT Presentation

University of Edinburgh Optimization amp Big Data Workshop Edinburgh 6 th to 8 th May 2015 Randomized dual coordinate ascent with arbitrary sampling Joint work with Peter Richtárik Edinburgh amp Tong Zhang Rutgers amp Baidu ID: 531604

amp data dual label data amp label dual sampling speedup distributed frac zhang primal coordinate rik quartz richt

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Zheng Qu" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Zheng Qu

University of Edinburgh

Optimization & Big Data Workshop Edinburgh, 6th to 8th May, 2015

Randomized dual coordinate ascent with arbitrary sampling

Joint work with Peter Richtárik (Edinburgh) & Tong Zhang (Rutgers & Baidu)Slide2

Supervised Statistical Learning input (e.g., image, text, clinical measurements

, …)

label (e.g. spam/no spam, stock price)

Predicted label

True label

GOAL

A_i

\in \

R^d

, \

enspace

y_i

\in \R

\

mathrm

{Find}\

enspace

w\in \

R^d

:

Training set of data

Predictor

Data

Algorithm

PredictorSlide3

Supervised Statistical Learning input

label

Predicted label

True label

GOAL

A_i

\in \

R^d

, \

enspace

y_i

\in \R

\

mathrm

{Find}\

enspace

w\in \

R^d

:

Training set of data

Predictor

Data

Algorithm

Predictor

Predicted label

True label

Input

Label

L

abelSlide4

Empirical Risk Minimization input

label

Predicted label

True label

GOAL

A_i

\in \

R^d

, \

enspace

y_i

\in \R

\

mathrm

{Find}\

enspace

w\in \

R^d

:

Training set of data

Predictor

Data

Algorithm

Predictor

Input

Label

L

abel

empirical risk

regularization

n = #

samples

(big!)Slide5

\[\min_{w\in \mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n loss(A_i^\top w, y_i)\]\[(A_1,y_1), (A_2,y_2), \dots, (A_n,y_n)\sim \emph{Distribution}\]

n = #

samples

(big!)

empirical loss

regularization

ERM

problem

:

Empirical Risk MinimizationSlide6

Algorithm: QUARTZ

Z. Q., P.

Richtárik

(UoE) and T. Zhang (Rutgers & Baidu Big Data Lab, Beijing)Randomized dual coordinate ascent with arbitrary sampling

arXiv:1411.5873, 2014 Slide7

Primal-Dual Formulation\[\min_{w\in \mathbb{R}^d}\;\; \left[ P(w) \equiv \frac

{1}{n}\sum_{i=1}^n \phi_i(A_i

^\top w) + \lambda g(w)\right]\]

Fenchel conjugates:

ERM

problem

Dual

problemSlide8

Intuition behind QUARTZ

Fenchel’s inequality

weak duality

Optimality conditionsSlide9

The Primal-Dual UpdateSTEP 1: PRIMAL UPDATESTEP 2: DUAL UPDATE

Optimality conditionsSlide10

STEP 1: Primal update

STEP 2: Dual update

Just

maintaining

Slide11

SDCA: SS.

Shwartz

& T. Zhang, 09/2012

mSDCA M. Takáč, A.

Bijral, P. Richtárik & N.

Srebro

, 03/2013

ASDCA: SS.

Shwartz

& T. Zhang, 05/

20

13

AccProx

-SDCA: SS.

Shwartz

& T. Zhang, 10/2013

DisDCA: TB. Yang, 2013 Iprox

-SDCA: PL. Zhao & T. Zhang, 01/2014 APCG: QH. Lin, Z. Lu & L. Xiao, 07/2014SPDC: Y. Zhang & L. Xiao, 09/2014QUARTZ: Z. Q., P. Richtárik & T. Zhang, 11/2014

Randomized Primal-Dual MethodsSlide12

Convergence Theorem

Expected

Separable

OverapproximationESO Assumption

Convex combination constantSlide13

Iteration Complexity Result

(*)Slide14

Complexity Results for Serial SamplingSlide15

Experiment: Quartz vs SDCA,uniform vs optimal samplingSlide16

QUARTZ with Standard Mini-BatchingSlide17

Data Sparsity

A normalized measure of average

sparsity of the data

“Fully sparse data”

“Fully dense data”Slide18

Iteration Complexity ResultsSlide19

Iteration Complexity ResultsSlide20

Theoretical Speedup Factor

Linear speedup up to a certain data-independent mini-batch size:

Further

data-dependent speedup:Slide21

Plots of Theoretical Speedup Factor

Linear speedup up to a certain

data-independent mini-batch size:

Further data-dependent speedup:Slide22

Theoretical vs Pratical Speedupastro_ph; sparsity: 0.08%; n=29,882;

cov1; sparsity: 22.22%; n=522,911; Slide23

Comparison with Accelerated Mini-Batch P-D MethodsSlide24

Distribution of Data

n

= # dual variables

Data matrixSlide25

Distributed Sampling

Random set of

dual variablesSlide26

Distributed Sampling & Distributed Coordinate DescentPeter Richtárik and Martin TakáčDistributed coordinate descent for learning with big dataarXiv:1310.2059, 2013

Previously studied (not in the primal-dual setup):

Olivier

Fercoq, Z. Q., Peter Richtárik and Martin Takáč

Fast distributed coordinate descent for minimizing non strongly convex losses2014 IEEE Int Workshop on Machine Learning for Signal Processing, 2014

Jakub

Marecek

,

Peter

Richtárik

and Martin

Takáč

Fast distributed coordinate descent for minimizing partially separable functions

arXiv:1406.0238,

2014

2

s

trongly convex & smooth

convex & smoothSlide27

Complexity of Distributed QUARTZ\[\frac{n}{c\tau} + \max_i\frac{\lambda_{\max}\left( \sum_{j=1}^d \left(1+\frac{(\tau-1)(\omega_j-1)}{\max\{n/c-1,1\}}+ \left(\frac{\tau c}{n} - \frac{\tau-1}{\max\{n/c-1,1\}}\right) \frac{\omega_j'-1}{\omega_j'}\

omega_j\right) A_{ji}^\top A_{ji}\right)}{\lambda\gamma c\tau}

\]Slide28

Reallocating Load: Theoretical SpeedupSlide29

Theoretical vs Practical SpeedupSlide30

More on ESO

ESO:

second

order

/

curvature

information

l

ocal

second

order

/

curvature

information

lost

getSlide31

Computation of ESO

Parameters

\[ \mathbf{E} \left\| \sum_{i\in \hat{S}} A_i \alpha_i\right\|^2 \;\;\leq \;\; \sum_{

i=1}^n {\color{blue} p_i} {\color{red} v_i}\|\alpha_i\|^2 \]\[\Updownarrow\]\[ P \circ A^\top A \preceq Diag({\color{blue}p}\circ {\color{red}v})\]

Lemma (QR’14b)

\[A = [A_1,A_2,\

dots,A_n

]\]

S

ampling

DataSlide32

ConclusionQUARTZ (Randomized coordinate ascent method with arbitrary sampling )

Direct primal-dual analysis (for arbitrary

sampling)optimal serial sampling

tau-nice sampling (mini-batch)distributed samplingTheoretical speedup factor which

is a very good predictor of the practical

speedup

factor

d

epends

on

both

the

sparsity and the condition number

shows a weak dependence

on how data is distributedAccelerated QUARTZ?

Randomized fixed point algorithm with relaxation?

…?