Elad Hazan Technion Satyen Kale Yahoo Labs Shai ShalevShwartz Hebrew University Three Prediction Problems I Online Collaborative Filtering Users 1 2 m Movies ID: 257641
Download Presentation The PPT/PDF document "Near-Optimal Algorithms for Online Matri..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Near-Optimal Algorithms for Online Matrix Prediction
Elad
Hazan
(
Technion
)
Satyen Kale
(Yahoo! Labs)
Shai
Shalev-Shwartz
(Hebrew University)Slide2
Three Prediction Problems:
I. Online Collaborative Filtering
Users:
{1, 2, …, m}
Movies: {1, 2, …, n}
On round t:User it arrives and is interested in movie jtOutput predicted rating pt in [-1, 1]User responds with actual rating rt in [-1, 1]Loss = (pt – rt)2
Comparison class: all m x n matrices with entries in [-1, 1] of trace norm ≤ τFor each such matrix W, predicted rating = W(it, jt)
Regret = loss of alg – loss of best bounded trace-norm matrix
Sum of singular values
If no entries repeated,
[
Cesa
-Bianchi, Shamir ‘11]
:
O(n
3/2
)
regret for
τ
= O(n)
Slide3
Three Prediction Problems:
II. Online Max Cut
2
political parties
Voters:
{1, 2, …, n}
On round
t
:
Voters
i
t
,
j
t
arrive
Output prediction: votes agree or disagree
Loss
= 1
if incorrect prediction,
0
o.w
.
Comparison class:
all possible bipartitions
Bipartition prediction
=
“agree” if
it, jt in same partition, “disagree” o.w.
Regret = loss of alg – loss of best bipartition
Weight = #(disagree) - #(agree)
Best bipartition = Max Cut!
Inefficient
alg
using the
2
n
bipartitions as experts: regret =Slide4
Three Prediction Problems:
III. Online Gambling
[Abernethy
’
10, Kleinberg,
Niculescu-Mizil
, Sharma ‘10]Teams: {1, 2, …, n}In round
t:Teams it,
jt competeOutput: prediction which team will winLoss = 1 if incorrect prediction, 0
o.w
.
Comparison class:
all possible
permutations
π
Permutation
π
prediction
= i
t
if
π
(i
t
) ≤
π
(
jt); jt
o.w.Regret = loss of alg – loss of best permutation
Weight = #(
i
wins) - #(j wins)
i
j
Best permutation = Min Feedback Arc Set!
Inefficient
alg
using the
n!
permutations as experts: regret =
Trivial bound of considered hard to improve (e.g.
[
Kanade
, Steinke ’12]
)Slide5
Results
Upper Bound
Lower Bound
Online Collaborative Filtering
Online Max Cut
Online Gambling
Stochastic; solves $50 open problem of
[
Srebro
, Shamir
’
11]
By
[Kleinberg,
Niculescu-Mizil
, Sharma ‘10]Slide6
One meta-problem to rule them all:
Online Matrix Prediction (OMP)
In round
t
:Receive pair it, jt
in [m] x [n]Output prediction pt in [-1, 1]Receive true value yt in [-1, 1]Suffer loss L(pt, yt)Comparison class: set W of m x n matrices with entries in
[-1, 1]Prediction for matrix W: entry W(it, jt)Regret = loss of alg – loss of best comparison matrixm x n matrices
1
2
…
…
n
1
2
:
mSlide7
Online Collaborative Filtering as OMP
Users:
{1, 2, …, m}
Movies:
{1, 2, …, n}
On round t:User it arrives and is interested in movie jtOutput predicted rating pt in [-1, 1]User responds with actual rating rtLoss = (pt – rt)2Comparison class:
W = all m x n matrices with entries in [-1, 1] of trace norm ≤ τFor each such matrix W, predicted rating = W(it, jt)Slide8
Online Max Cut as OMP
On round
t
:
Voters
i
t
,
j
t
arrive
Output prediction: votes agree or disagree
Loss
= 1
if incorrect prediction,
0
o.w
.
Comparison class: all possible bipartitions
Bipartition prediction = “agree” if
i
t
,
j
t
in same partition, “disagree”
o.w
.
2 political parties
Voters:
{1, 2, …, n}
W
=
all
2
n
“cut matrices”
W
S
corresponding to subsets
S
of
[n]
W
S
(
i
, j) = 0 if both
i, j
in S or [n] \ S, = 1 o.w.
0
0
1
1
1
0
0
1
1
1
1
1
0
0
0
1
1
0
0
0
1
1
0
0
0
S
S
[n] \ S
[n] \ SSlide9
Online Gambling as OMP
Teams:
{1, 2, …, n}
In round
t
:
Teams it, jt compete
Output: prediction which team will winLoss = 1 if incorrect prediction, 0
o.w.
π(1)
π(2)
…
…
π(n)
π(1)
1
1
1
1
1
π(2)
0
1
1
1
1
:
0
0
1
1
1
:
0
0
0
1
1
π(n)
0
0
0
0
1
1
2
…
…
n
1
1
0
1
1
0
2
1
1
1
1
1
:
0
0
1
1
0
:
0
0
0
1
0
n
1
0
1
1
1
Comparison class:
all possible
permutations
π
Permutation
π
prediction
= i
t
if
π
(i
t
) ≤
π
(
j
t
)
;
j
t
o.w
.
W
=
all
n!
“permutation
matrices”
W
π
corresponding to permutations
π
W
π
(
i
, j) = 1
if
π(
i
) ≤ π(j)
= 0
o.w
.Slide10
Decomposability
0
W
W
T
0
P
=
−
N
W is
(β,
τ
)
-decomposable
if
where
P, N
are positive
semidefinite
Diagonal entries
P
ii
,
N
ii
≤ β
Sum of traces
Tr
(P) + Tr(N) ≤ τ
Class
W is (β,
τ)-decomposable if every W in
W is.
Symmetric square matrix of order m + n Slide11
Main Result for (
β
,
τ)-decomposable OMPAn efficient algorithm for OMP with
(β, τ)-decomposable W
and Lipschitz losses with regret boundSlide12
The Technology
Theorem:
Matrix
Exponentiated
Gradient [Tsuda, Rätsch, Warmuth ’06]/Matrix Multiplicative Weights [Arora, K. ‘07] algorithm
Online Learning problem: in round t,Learner chooses density (i.e. psd, trace 1) matrix XtNature reveals loss matrix Mt with eigenvalues in
[-1, 1]Learner suffers loss Tr(
MtXt)
Goal:
minimize regret = loss of learner – loss of best density matrixSlide13
Overview of Algorithm for OMP
W
K
All square symmetric
X
of order 2(m+n) s.t.X is positive semidefiniteDiagonals Xii ≤ βTrace Tr(X) ≤ τ
P0
0
N
W
Matrix
MW algorithm
+
Bregman
projections into
K
0
W
W
T
0
= P - NSlide14
Decomposability Theorems
Online Collaborative Filtering
Trace
norm
≤ τ matrices are (√(m + n), 2τ)-decomposable.
Online Max CutCut matrices WS are (½, 2n)-decomposable.Online GamblingPermutation matrices Wπ are (O(log n), O(n log n))-decomposable.Slide15
Decomposability Theorems
Online Collaborative Filtering
Trace
norm
≤ τ matrices are (√(m + n), 2τ)-decomposable.
Online Max CutCut matrices WS are (½, 2n)-decomposable.Online GamblingPermutation matrices Wπ are (O(log n), O(n log n))-decomposable.Slide16
Decomposability for OCF
Thm
:
Any symmetric matrix
M of order n with entries in [-1, 1] and trace norm τ is
(√n, τ)-decomposableEigenvalue decomposition: Define
andClearly Tr(P) + Tr(N) = trace-norm(M) = τ.Diagonals of (P + N)2 = M
2 bounded by n. So diagonals of (P + N)
bounded by √n.So diagonals of P, N bounded by √n
.Slide17
Decomposability Theorems
Online Collaborative Filtering
Trace
norm
≤ τ matrices are (√(m + n), 2τ)-decomposable.
Online Max CutCut matrices WS are (½, 2n)-decomposable.Online GamblingPermutation matrices Wπ are (O(log n), O(n log n))-decomposable.Slide18
Decomposability for Online Gambling
Thm
:
The all
1’s upper triangular matrix of order n is (O(log n), O(n log n))-decomposable.
T(n) =
One rank-
1
matrix +
t
wo non-overlapping
T(n/2)
B(n) = 1 + B(n/2)
B(n) = O(log n).Slide19
Concluding Remarks
Gave near-optimal algorithms for various online matrix prediction problems
Exploited
spectral structure of comparison matrices to get near-tight convex relaxationsSolved 2 COLT open problems from
[Abernethy ‘10] and [Shamir, Srebro ‘11]Open problem: get rid of the logarithmic gap between upper and lower bounds
Decompositions in the paper are optimal up to constant factors, so a fundamentally different algorithm seems necessarySlide20
Thanks!