m movies v11 vij vnm V ij user is rating of movie j n users Recovering latent factors in a matrix m movies n users m movies x1 y1 x2 y2 ID: 633218
Download Presentation The PPT/PDF document "Matrix Factorization Recovering latent f..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Matrix FactorizationSlide2
Recovering latent factors in a matrix
m
movies
v11
…
…
…
vij
…
vnm
V[i,j] = user i’s rating of movie j
n
usersSlide3
Recovering latent factors in a matrix
m
movies
n
users
m
movies
x1
y1x2
y2....
……
xn
yn
a1
a2
..
…
am
b1
b2
…
…
bm
v11
…
…
…
vij
…vnm
~
V[
i,j
] = user i’s rating of movie jSlide4
talk pilfered from
…..
KDD 2011Slide5Slide6
Recovering latent factors in a matrix
m
movies
n
users
m
movies
x1
y1x2
y2....
……
xn
yn
a1
a2
..
…
am
b1
b2
…
…
bm
v11
…
…
…
vij
…vnm
~
V[
i,j
] = user i’s rating of movie j
r
W
H
VSlide7Slide8Slide9Slide10
f
or image
denoisingSlide11
Matrix factorization as SGD
step sizeSlide12Slide13Slide14
Matrix factorization as SGD - why does this work?
step sizeSlide15
Matrix factorization as SGD - why does this work? Here’s the key claim:Slide16
Checking the claim
Think for SGD for logistic regression
LR loss = compare
y
and
ŷ
= dot(w,x)similar but now update w (user weights) and x (movie weight)Slide17
What loss functions are possible?
N1, N2 - diagonal matrixes, sort of like IDF factors for the users/movies
“generalized” KL-divergenceSlide18
What loss functions are possible?Slide19
What loss functions are possible?Slide20
ALS = alternating least squaresSlide21
talk pilfered from
…..
KDD 2011Slide22Slide23Slide24Slide25
Similar to McDonnell et al with perceptron learningSlide26
Slow convergence…..Slide27Slide28Slide29Slide30Slide31Slide32Slide33
More detail….
Randomly permute rows/cols of matrix
Chop V,W,H into blocks of size
d x dm/d
blocks in W, n/d blocks in HGroup the data:Pick a set of blocks with no overlapping rows or columns (a stratum)Repeat until all blocks in V are covered
Train the SGDProcess strata in seriesProcess blocks within a stratum in parallelSlide34
More detail….
Z
was
VSlide35
More detail….
Initialize W,H randomly
not at zero
Choose a random ordering (random sort) of the points in a stratum in each “sub-epoch”Pick strata sequence by permuting rows and columns of M, and using M’[k,i] as column index of row i in subepoch
k Use “bold driver” to set step size:increase step size when loss decreases (in an epoch)decrease step size when loss increasesImplemented in Hadoop and R/Snowfall
M=Slide36Slide37
Wall Clock Time
8 nodes, 64 cores, R/snowSlide38Slide39Slide40Slide41Slide42
Number of EpochsSlide43Slide44Slide45Slide46Slide47
Varying rank
100 epochs for all Slide48
Hadoop scalability
Hadoop
process setup time starts to dominateSlide49
Hadoop scalabilitySlide50