Sebastian Schelter Venu Satuluri Reza Zadeh Distributed Machine Learning and Matrix Computations workshop in conjunction with NIPS 2014 Latent Factor Models Given M sparse n x ID: 635157
Download Presentation The PPT/PDF document "Factorbird : a Parameter Server Approach..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Factorbird: a Parameter Server Approach to Distributed Matrix Factorization
Sebastian
Schelter
,
Venu
Satuluri
, Reza
Zadeh
Distributed Machine Learning and Matrix Computations workshop in conjunction with NIPS 2014Slide2
Latent Factor Models
Given
M
sparsen x mReturns U and Vrank kApplicationsDimensionality reductionRecommendationInferenceSlide3
Seem familiar?
So why not just use SVD?
SVD!Slide4
Problems with SVD
(Feb 24, 2015 edition)Slide5
Revamped loss function
g
– global bias term
bUi – user-specific bias term for user ibVj – item-specific bias term for item jprediction function p(
i
, j) = g +
b
U
i
+
b
V
j
+
u
T
ivja(i, j) – analogous to SVD’s mij (ground truth)New loss function:Slide6
AlgorithmSlide7
Problems
Resulting
U
and V, for graphs with millions of vertices, still equate to hundreds of gigabytes of floating point values.SGD is inherently sequential; either locking or multiple passes are required to synchronize.Slide8
Problem 1: size of parameters
Solution: Parameter Server architectureSlide9
Problem 2: simultaneous writes
Solution:
…so what?Slide10
Lock-free concurrent updates?
Assumptions
f is Lipshitz continuously differentiable f is strongly convexΩ (size of hypergraph) is small
Δ
(fraction of edges that intersect any variable) is
small
ρ
(
sparsity
of
hypergraph
) is
smallSlide11
Factorbird ArchitectureSlide12
Parameter server architecture
Open source!
http
://parameterserver.org/Slide13
Factorbird Machinery
memcached
– Distributed memory object caching system
finagle – Twitter’s RPC systemHDFS – persistent filestore for dataScalding – Scala front-end for Hadoop MapReduce jobsMesos – resource manager for learner machinesSlide14
Factorbird stubsSlide15
Model assessment
Matrix factorization using RMSE
Root-mean squared error
SGD performance often a function of hyperparametersλ: regularizationη: learning ratek: number of latent factorsSlide16
[Hyper]Parameter grid search
aka “parameter scans:” finding the optimal combination of
hyperparameters
Parallelize!Slide17
Experiments
“
RealGraph
”Not a dataset; a framework for creating graph of user-user interactions on TwitterKamath, Krishna, et al. "RealGraph: User Interaction Prediction at Twitter." User Engagement Optimization Workshop@ KDD. 2014.Slide18
ExperimentsData:
binarized
adjacency matrix of subset of Twitter follower graph
a(i, j) = 1 if user i interacted with user j, 0 otherwiseAll prediction errors weighted equally (w(i, j) = 1)100 million interactions440,000 [popular] usersSlide19
Experiments
80% training, 10% validation, 10% testingSlide20
Experiments
k
= 2
HomophilySlide21
ExperimentsScalability of
Factorbird
large
RealGraph subset229M x 195M (44.6 quadrillion)38.5 billion non-zero entriesSingle SGD pass through training set: ~2.5 hours~ 40 billion parametersSlide22
Important to note
As with most (if not all) distributed platforms:Slide23
Future workSupport streaming (user follows)
Simultaneous factorization
Fault tolerance
Reduce network traffics/memcached/custom application/gLoad balancingSlide24
StrengthsExcellent extension of prior work
Hogwild
,
RealGraphCurrent and [mostly] open technologyHadoop, Scalding, Mesos, memcachedClear problem, clear solution, clear validationSlide25
Weaknesses
Lack of detail, lack of detail, lack of detail
How does number of machines affect runtime?
What were performance metrics of the large RealGraph subset?What were some of the properties of the dataset (when was it collected, how were edges determined, what does “popular” mean, etc)?How did other factorization methods perform by comparison?Slide26
Questions?