Sebastian Schelter Venu Satuluri Reza Zadeh Distributed Machine Learning and Matrix Computations workshop in conjunction with NIPS 2014 Latent Factor Models Given M sparse n x ID: 796205
Download The PPT/PDF document "Factorbird : a Parameter Server Approach..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Factorbird: a Parameter Server Approach to Distributed Matrix Factorization
Sebastian
Schelter
,
Venu
Satuluri
, Reza
Zadeh
Distributed Machine Learning and Matrix Computations workshop in conjunction with NIPS 2014
Slide2Latent Factor Models
Given
M
sparsen x mReturns U and Vrank kApplicationsDimensionality reductionRecommendationInference
Slide3Seem familiar?
So why not just use SVD?
SVD!
Slide4Problems with SVD
(Feb 24, 2015 edition)
Slide5Revamped loss function
g
– global bias term
bUi – user-specific bias term for user ibVj – item-specific bias term for item jprediction function p(
i
, j) = g +
b
U
i
+
b
V
j
+
u
T
ivja(i, j) – analogous to SVD’s mij (ground truth)New loss function:
Slide6Algorithm
Slide7Problems
Resulting
U
and V, for graphs with millions of vertices, still equate to hundreds of gigabytes of floating point values.SGD is inherently sequential; either locking or multiple passes are required to synchronize.
Slide8Problem 1: size of parameters
Solution: Parameter Server architecture
Slide9Problem 2: simultaneous writes
Solution:
…so what?
Slide10Lock-free concurrent updates?
Assumptions
f is Lipshitz continuously differentiable f is strongly convexΩ (size of hypergraph) is small
Δ
(fraction of edges that intersect any variable) is
small
ρ
(
sparsity
of
hypergraph
) is
small
Slide11Hogwild! Lock-free updates
Slide12Factorbird Architecture
Slide13Parameter server architecture
Open source!
http
://parameterserver.org/
Slide14Factorbird Machinery
memcached
– Distributed memory object caching system
finagle – Twitter’s RPC systemHDFS – persistent filestore for dataScalding – Scala front-end for Hadoop MapReduce jobsMesos – resource manager for learner machines
Slide15Factorbird stubs
Slide16Model assessment
Matrix factorization using RMSE
Root-mean squared error
SGD performance often a function of hyperparametersλ: regularizationη: learning ratek: number of latent factors
Slide17[Hyper]Parameter grid search
aka “parameter scans:” finding the optimal combination of
hyperparameters
Parallelize!
Slide18Experiments
“
RealGraph
”Not a dataset; a framework for creating graph of user-user interactions on TwitterKamath, Krishna, et al. "RealGraph: User Interaction Prediction at Twitter." User Engagement Optimization Workshop@ KDD. 2014.
Slide19ExperimentsData:
binarized
adjacency matrix of subset of Twitter follower graph
a(i, j) = 1 if user i interacted with user j, 0 otherwiseAll prediction errors weighted equally (w(i, j) = 1)100 million interactions440,000 [popular] users
Slide20Experiments
80% training, 10% validation, 10% testing
Slide21Experiments
k
= 2
Homophily
Slide22ExperimentsScalability of
Factorbird
large
RealGraph subset229M x 195M (44.6 quadrillion)38.5 billion non-zero entriesSingle SGD pass through training set: ~2.5 hours~ 40 billion parameters
Slide23Important to note
As with most (if not all) distributed platforms:
Slide24Future workSupport streaming (user follows)
Simultaneous factorization
Fault tolerance
Reduce network traffics/memcached/custom application/gLoad balancing
Slide25StrengthsExcellent extension of prior work
Hogwild
,
RealGraphCurrent and [mostly] open technologyHadoop, Scalding, Mesos, memcachedClear problem, clear solution, clear validation
Slide26Weaknesses
Lack of detail, lack of detail, lack of detail
How does number of machines affect runtime?
What were performance metrics of the large RealGraph subset?What were some of the properties of the dataset (when was it collected, how were edges determined, what does “popular” mean, etc)?How did other factorization methods perform by comparison?
Slide27Questions?
Slide28Assignment 1
Code: 65pts
20:
NBTrain (counting)20: message passing and sorting20: NBTest (scanning model, accuracy)5: How to runQ1-Q5: 35pts
Slide29Assignment 2Code: 70pts
20: MR for counting words
15: MR for counting labels
20: MR for joining model + test data15: MR for classification5: How to runQ1-Q2: 30pts
Slide30Assignment 3Code: 50pts
10: Compute TF
10: Compute IDF
25: K-means iterations5: How to runQ1-Q4: 50pts