/
Near-Optimal Algorithms for Online Matrix Prediction Near-Optimal Algorithms for Online Matrix Prediction

Near-Optimal Algorithms for Online Matrix Prediction - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
450 views
Uploaded On 2016-03-16

Near-Optimal Algorithms for Online Matrix Prediction - PPT Presentation

Elad Hazan Technion Satyen Kale Yahoo Labs Shai ShalevShwartz Hebrew University Three Prediction Problems I Online Collaborative Filtering Users 1 2 m Movies ID: 257641

prediction online loss matrix online prediction matrix loss decomposable matrices regret comparison log class norm max trace omp rating

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Near-Optimal Algorithms for Online Matri..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Near-Optimal Algorithms for Online Matrix Prediction

Elad

Hazan

(

Technion

)

Satyen Kale

(Yahoo! Labs)

Shai

Shalev-Shwartz

(Hebrew University)Slide2

Three Prediction Problems:

I. Online Collaborative Filtering

Users:

{1, 2, …, m}

Movies: {1, 2, …, n}

On round t:User it arrives and is interested in movie jtOutput predicted rating pt in [-1, 1]User responds with actual rating rt in [-1, 1]Loss = (pt – rt)2

Comparison class: all m x n matrices with entries in [-1, 1] of trace norm ≤ τFor each such matrix W, predicted rating = W(it, jt)

Regret = loss of alg – loss of best bounded trace-norm matrix

Sum of singular values

If no entries repeated,

[

Cesa

-Bianchi, Shamir ‘11]

:

O(n

3/2

)

regret for

τ

= O(n)

Slide3

Three Prediction Problems:

II. Online Max Cut

2

political parties

Voters:

{1, 2, …, n}

On round

t

:

Voters

i

t

,

j

t

arrive

Output prediction: votes agree or disagree

Loss

= 1

if incorrect prediction,

0

o.w

.

Comparison class:

all possible bipartitions

Bipartition prediction

=

“agree” if

it, jt in same partition, “disagree” o.w.

Regret = loss of alg – loss of best bipartition

Weight = #(disagree) - #(agree)

Best bipartition = Max Cut!

Inefficient

alg

using the

2

n

bipartitions as experts: regret =Slide4

Three Prediction Problems:

III. Online Gambling

[Abernethy

10, Kleinberg,

Niculescu-Mizil

, Sharma ‘10]Teams: {1, 2, …, n}In round

t:Teams it,

jt competeOutput: prediction which team will winLoss = 1 if incorrect prediction, 0

o.w

.

Comparison class:

all possible

permutations

π

Permutation

π

prediction

= i

t

if

π

(i

t

) ≤

π

(

jt); jt

o.w.Regret = loss of alg – loss of best permutation

Weight = #(

i

wins) - #(j wins)

i

j

Best permutation = Min Feedback Arc Set!

Inefficient

alg

using the

n!

permutations as experts: regret =

Trivial bound of considered hard to improve (e.g.

[

Kanade

, Steinke ’12]

)Slide5

Results

Upper Bound

Lower Bound

Online Collaborative Filtering



Online Max Cut

Online Gambling

Stochastic; solves $50 open problem of

[

Srebro

, Shamir

11]

By

[Kleinberg,

Niculescu-Mizil

, Sharma ‘10]Slide6

One meta-problem to rule them all:

Online Matrix Prediction (OMP)

In round

t

:Receive pair it, jt

in [m] x [n]Output prediction pt in [-1, 1]Receive true value yt in [-1, 1]Suffer loss L(pt, yt)Comparison class: set W of m x n matrices with entries in

[-1, 1]Prediction for matrix W: entry W(it, jt)Regret = loss of alg – loss of best comparison matrixm x n matrices

1

2

n

1

2

:

mSlide7

Online Collaborative Filtering as OMP

Users:

{1, 2, …, m}

Movies:

{1, 2, …, n}

On round t:User it arrives and is interested in movie jtOutput predicted rating pt in [-1, 1]User responds with actual rating rtLoss = (pt – rt)2Comparison class:

W = all m x n matrices with entries in [-1, 1] of trace norm ≤ τFor each such matrix W, predicted rating = W(it, jt)Slide8

Online Max Cut as OMP

On round

t

:

Voters

i

t

,

j

t

arrive

Output prediction: votes agree or disagree

Loss

= 1

if incorrect prediction,

0

o.w

.

Comparison class: all possible bipartitions

Bipartition prediction = “agree” if

i

t

,

j

t

in same partition, “disagree”

o.w

.

2 political parties

Voters:

{1, 2, …, n}

W

=

all

2

n

“cut matrices”

W

S

corresponding to subsets

S

of

[n]

W

S

(

i

, j) = 0 if both

i, j

in S or [n] \ S, = 1 o.w.

0

0

1

1

1

0

0

1

1

1

1

1

0

0

0

1

1

0

0

0

1

1

0

0

0

S

S

[n] \ S

[n] \ SSlide9

Online Gambling as OMP

Teams:

{1, 2, …, n}

In round

t

:

Teams it, jt compete

Output: prediction which team will winLoss = 1 if incorrect prediction, 0

o.w.

π(1)

π(2)

π(n)

π(1)

1

1

1

1

1

π(2)

0

1

1

1

1

:

0

0

1

1

1

:

0

0

0

1

1

π(n)

0

0

0

0

1

1

2

n

1

1

0

1

1

0

2

1

1

1

1

1

:

0

0

1

1

0

:

0

0

0

1

0

n

1

0

1

1

1

Comparison class:

all possible

permutations

π

Permutation

π

prediction

= i

t

if

π

(i

t

) ≤

π

(

j

t

)

;

j

t

o.w

.

W

=

all

n!

“permutation

matrices”

W

π

corresponding to permutations

π

W

π

(

i

, j) = 1

if

π(

i

) ≤ π(j)

= 0

o.w

.Slide10

Decomposability

0

W

W

T

0

P

=

N

W is

(β,

τ

)

-decomposable

if

where

P, N

are positive

semidefinite

Diagonal entries

P

ii

,

N

ii

≤ β

Sum of traces

Tr

(P) + Tr(N) ≤ τ

Class

W is (β,

τ)-decomposable if every W in

W is.

Symmetric square matrix of order m + n Slide11

Main Result for (

β

,

τ)-decomposable OMPAn efficient algorithm for OMP with

(β, τ)-decomposable W

and Lipschitz losses with regret boundSlide12

The Technology

Theorem:

Matrix

Exponentiated

Gradient [Tsuda, Rätsch, Warmuth ’06]/Matrix Multiplicative Weights [Arora, K. ‘07] algorithm

Online Learning problem: in round t,Learner chooses density (i.e. psd, trace 1) matrix XtNature reveals loss matrix Mt with eigenvalues in

[-1, 1]Learner suffers loss Tr(

MtXt)

Goal:

minimize regret = loss of learner – loss of best density matrixSlide13

Overview of Algorithm for OMP

W

K

All square symmetric

X

of order 2(m+n) s.t.X is positive semidefiniteDiagonals Xii ≤ βTrace Tr(X) ≤ τ

P0

0

N

W

Matrix

MW algorithm

+

Bregman

projections into

K

0

W

W

T

0

= P - NSlide14

Decomposability Theorems

Online Collaborative Filtering

Trace

norm

≤ τ matrices are (√(m + n), 2τ)-decomposable.

Online Max CutCut matrices WS are (½, 2n)-decomposable.Online GamblingPermutation matrices Wπ are (O(log n), O(n log n))-decomposable.Slide15

Decomposability Theorems

Online Collaborative Filtering

Trace

norm

≤ τ matrices are (√(m + n), 2τ)-decomposable.

Online Max CutCut matrices WS are (½, 2n)-decomposable.Online GamblingPermutation matrices Wπ are (O(log n), O(n log n))-decomposable.Slide16

Decomposability for OCF

Thm

:

Any symmetric matrix

M of order n with entries in [-1, 1] and trace norm τ is

(√n, τ)-decomposableEigenvalue decomposition: Define

andClearly Tr(P) + Tr(N) = trace-norm(M) = τ.Diagonals of (P + N)2 = M

2 bounded by n. So diagonals of (P + N)

bounded by √n.So diagonals of P, N bounded by √n

.Slide17

Decomposability Theorems

Online Collaborative Filtering

Trace

norm

≤ τ matrices are (√(m + n), 2τ)-decomposable.

Online Max CutCut matrices WS are (½, 2n)-decomposable.Online GamblingPermutation matrices Wπ are (O(log n), O(n log n))-decomposable.Slide18

Decomposability for Online Gambling

Thm

:

The all

1’s upper triangular matrix of order n is (O(log n), O(n log n))-decomposable.

T(n) =

One rank-

1

matrix +

t

wo non-overlapping

T(n/2)

B(n) = 1 + B(n/2)

B(n) = O(log n).Slide19

Concluding Remarks

Gave near-optimal algorithms for various online matrix prediction problems

Exploited

spectral structure of comparison matrices to get near-tight convex relaxationsSolved 2 COLT open problems from

[Abernethy ‘10] and [Shamir, Srebro ‘11]Open problem: get rid of the logarithmic gap between upper and lower bounds

Decompositions in the paper are optimal up to constant factors, so a fundamentally different algorithm seems necessarySlide20

Thanks!