Hadoop Sudipto Das 1 Yannis Sismanis 2 Kevin S Beyer 2 Rainer Gemulla 2 Peter J Haas 2 John McPherson 2 1 UC Santa Barbara 2 IBM Almaden Research Center Presented by ID: 545996
Download Presentation The PPT/PDF document "Ricardo: Integrating R and" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Ricardo: Integrating R and Hadoop
Sudipto Das1, Yannis Sismanis2, Kevin S Beyer2, Rainer Gemulla2, Peter J. Haas2, John McPherson21 UC Santa Barbara2 IBM Almaden Research Center
Presented by:
Luyuang
Zhang
Yuguan
LiSlide2
OutlineMotivation & Background
Architecture & ComponentsTrading with RicardoSimple TradingComplex TradingEvaluationConclusion2Slide3
Deep Analytics on Big Data
Enterprises collect huge amounts of dataAmazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, …User interaction data and historyClick and Transaction logsDeep analysis critical for competitive edgeUnderstanding/Modeling dataRecommendations to usersAd placementChallenge: Enable Deep Analysis and Understanding over massive data volumesExploiting data to its full potential3Slide4
4
Motivating ExamplesData Exploration/Model Evaluation/Outlier DetectionPersonalized RecommendationsFor each individual customer/productMany applications to Netflix, Amazon, eBay, iTunes, …Difficulty: Discern particular customer preferences Sampling loses Competitive advantageApplication Scenario: Movie RecommendationsMillions of CustomersHundreds of thousands of MoviesBillions of Movie RatingsSlide5
Analyst’s WorkflowData Exploration
Deal with raw dataData ModelingDeal with processed dataUse assigned method to build model fits the dataModel EvaluationDeal with built modelUse data to test the accuracy of model5Slide6
Big Data and Deep Analytics – The Gap
R, SPSS, SAS – A Statistician’s toolboxRich statistical, modeling, visualization functionalityThousands of sophisticated add-on packages developed by hundreds of statistical experts and available through CRAN Operate on small data amounts entirely in memory on a single serverExtensions for data handling cumbersomeHadoop – Scalable Data Management SystemsScalable, Fault-Tolerant, Elastic, …“Magnetic”: easy to store dataLimited deep analytics: mostly descriptive analytics6Slide7
Filling the Gap: Existing Approaches
Reducing Data size by Sampling Approximations might result in losing competitive advantageLoses important features of the long tail of data distributions [Cohen et al., VLDB 2009]Scaling out REfforts from statistics community to parallel and distributed variants [SNOW, Rmpi]Main memory based in most casesRe-implementing DBMS and distributed processing functionalityDeep Analysis within a DBMSPort statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout]Not Sustainable – missing out from R’s community development and rich libraries7Slide8
8
Ricardo: Bridging the GapDavid Ricardo, famous economist from 19th century“Comparative Advantage”Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06]Linear/logistic regression, k-means clustering, Naïve
Bayes, SVMs, PCARecommender Systems/Latent Factorization [our paper]
A key requirement for Ricardo is that the amount of data that must be communicated between both systems be sufficiently smallLarge-part includes joins,
group bys
,
distributive aggregations
Hadoop
+
Jaql
: excellent scalability to large-scale data management
Small-part includes
matrix/vector operations
R
: excellent support for numerically stable matrix inversions, factorizations, optimizations, eigenvector
decompositions,etc
.
Ricardo:
Establishes
“trade”
between R and
Hadoop
/
JaqlSlide9
Ricardo: Bridging the Gap
TradeR send aggregation-processing queries (written in Jaql) to HadoopHadoop send aggregated data to R for advanced satistical processing 9Slide10
R in a Nutshell
10Slide11
R in a Nutshell
11R supports Rich statistical functionalitySlide12
12
Jaql in a NutshellJSON View of the data:Jaql Example:
Scalable Descriptive Analysis using
Hadoop
Jaql
a representative declarative interfaceSlide13
13
Ricardo: The Trading ArchitectureComplexity of Trade between R and HadoopSimple Trading: Data ExplorationComplex Trading:
Data ModelingSlide14
Simple Trading: Exploratory Analytics
Gain insights about dataExample - top-k outliers for a model Identify data items on which the model performed most poorlyHelpful for improving accuracy of modelThe trade: Use complex statistical models using rich R functionalityParallelize processing over entire data using Hadoop/Jaql14Slide15
15
Complex Trading: Latent FactorsSVD-like matrix factorizationMinimize Square Error: Σi,j (pi
qj -
rij)2
p
q
The trade:
Use complex statistical models in R
Parallelize aggregate computations using
Hadoop
/
JaqlSlide16
Complex Trading: Latent Factors
16
However, in real world………
A vector of factors for each customer and item!Slide17
17
Latent Factor Models with RicardoGoalMinimize Square Error: e = Σi,j (piqj - rij)2Numerical methods needed (large, sparse matrix)Pseudocode
Start with initial guess of parameters pi
and qj.Compute error & gradient
E.g., d
e
/d
p
i
=
Σ
j
2
q
j
(
p
i
q
j
–
r
ij
)
Update parameters.
R implements many different optimization algorithms
Repeat steps 2 and 3 until convergence.
p
q
Data intensive,
but parallelizable!Slide18
18
The R Component
Parameters
e: squared error
de: gradients
pq
: concatenation of the latent factors for users and items
R code
optim
( c(
p,q
),
f
e
,
f
de
, method="L-BFGS-B"
)
Goal
Keeps updating
pq
until it reaches convergenceSlide19
19
The
Hadoop
and Jaql Component
Dataset
GoalSlide20
20
The
Hadoop
and Jaql Component
Calculate the squared errors
Calculate the gradientsSlide21
21
Computing the Modeli
j
r
ij
i
p
i
j
q
j
Movie Ratings
Movie
Parameters
Customer
Parameters
3 way join to match
r
ij
,
p
i
, and
q
j
,
then aggregate
e =
Σ
i,j
(
p
i
q
j
-
r
ij
)
2
Similarly compute the gradientsSlide22
22
Aggregation In Jaql/Hadoopres = jaqlTable(channel, " ratings hashJoin( fn(r) r.j, moviePars
, fn(m) m.j, fn(r, m) { r.*,
m.q } )
hashJoin
( fn(r)
r.i
,
custPars
, fn(c)
c.i
, fn(r, c) { r.*,
c.p
} )
transform
{ $.*, diff: $.rating - $.p*$.q }
expand
[ { value:
pow
($.diff, 2.0) },
{ $.
i
, value: -2.0 * $.diff * $.p },
{ $.j, value: -2.0 * $.diff * $.q } ]
group by
g={ $.
i
, $.j }
into
{ g.*, gradient: sum($[*].value) }
")
i
j gradient
---- ---- --------
null
null
325235
1 null 21
2 null 357
…
null 1 9
null 2 64
…
Result in RSlide23
23
Integrating the Components
Remember…..
We
would be
running
optim
( c(
p,q
),
f
e
,
f
de
, method="L-BFGS-B"
)
in R process.Slide24
24
Experimental Evaluation 50 nodes at EC2 Each node: 8 cores, 7GB Memory, 320GB Disk Total: 400 cores, 320GB Memory, 70TB Disk Space
Number of Rating
TuplesData Size in GB
500 Million
104.33
1 Billion
208.68
3 Billion
625.99
5 Billion
1043.23Slide25
25
Leveraging Hadoop’s ScalabilitySlide26
26
Leveraging R’s Rich Functionalityoptim( c(p,q),
f
e, f
de
, method=“CG" )
optim
( c(
p,q
),
f
e
,
f
de
, method="L-BFGS-B" )Slide27
27
ConclusionScaled Latent Factor Models to Terabytes of dataProvided a bridge for other algorithms with Summation Form can be mapped and scaledMany Algorithms have Summation FormDecompose into “large part” and “small part”[Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression, neural network, PCA, ICA, EM, SVM
Future & Current Work
Tighter language integrationMore algorithmsPerformance tuningSlide28
Questions?
Comments?