/
Ricardo: Integrating R and Ricardo: Integrating R and

Ricardo: Integrating R and - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
384 views
Uploaded On 2017-05-08

Ricardo: Integrating R and - PPT Presentation

Hadoop Sudipto Das 1 Yannis Sismanis 2 Kevin S Beyer 2 Rainer Gemulla 2 Peter J Haas 2 John McPherson 2 1 UC Santa Barbara 2 IBM Almaden Research Center Presented by ID: 545996

jaql data ricardo hadoop data jaql hadoop ricardo trading analytics model statistical latent deep complex parameters large null method

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Ricardo: Integrating R and" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Ricardo: Integrating R and Hadoop

Sudipto Das1, Yannis Sismanis2, Kevin S Beyer2, Rainer Gemulla2, Peter J. Haas2, John McPherson21 UC Santa Barbara2 IBM Almaden Research Center

Presented by:

Luyuang

Zhang

Yuguan

LiSlide2

OutlineMotivation & Background

Architecture & ComponentsTrading with RicardoSimple TradingComplex TradingEvaluationConclusion2Slide3

Deep Analytics on Big Data

Enterprises collect huge amounts of dataAmazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, …User interaction data and historyClick and Transaction logsDeep analysis critical for competitive edgeUnderstanding/Modeling dataRecommendations to usersAd placementChallenge: Enable Deep Analysis and Understanding over massive data volumesExploiting data to its full potential3Slide4

4

Motivating ExamplesData Exploration/Model Evaluation/Outlier DetectionPersonalized RecommendationsFor each individual customer/productMany applications to Netflix, Amazon, eBay, iTunes, …Difficulty: Discern particular customer preferences Sampling loses Competitive advantageApplication Scenario: Movie RecommendationsMillions of CustomersHundreds of thousands of MoviesBillions of Movie RatingsSlide5

Analyst’s WorkflowData Exploration

Deal with raw dataData ModelingDeal with processed dataUse assigned method to build model fits the dataModel EvaluationDeal with built modelUse data to test the accuracy of model5Slide6

Big Data and Deep Analytics – The Gap

R, SPSS, SAS – A Statistician’s toolboxRich statistical, modeling, visualization functionalityThousands of sophisticated add-on packages developed by hundreds of statistical experts and available through CRAN Operate on small data amounts entirely in memory on a single serverExtensions for data handling cumbersomeHadoop – Scalable Data Management SystemsScalable, Fault-Tolerant, Elastic, …“Magnetic”: easy to store dataLimited deep analytics: mostly descriptive analytics6Slide7

Filling the Gap: Existing Approaches

Reducing Data size by Sampling Approximations might result in losing competitive advantageLoses important features of the long tail of data distributions [Cohen et al., VLDB 2009]Scaling out REfforts from statistics community to parallel and distributed variants [SNOW, Rmpi]Main memory based in most casesRe-implementing DBMS and distributed processing functionalityDeep Analysis within a DBMSPort statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout]Not Sustainable – missing out from R’s community development and rich libraries7Slide8

8

Ricardo: Bridging the GapDavid Ricardo, famous economist from 19th century“Comparative Advantage”Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06]Linear/logistic regression, k-means clustering, Naïve

Bayes, SVMs, PCARecommender Systems/Latent Factorization [our paper]

A key requirement for Ricardo is that the amount of data that must be communicated between both systems be sufficiently smallLarge-part includes joins,

group bys

,

distributive aggregations

Hadoop

+

Jaql

: excellent scalability to large-scale data management

Small-part includes

matrix/vector operations

R

: excellent support for numerically stable matrix inversions, factorizations, optimizations, eigenvector

decompositions,etc

.

Ricardo:

Establishes

“trade”

between R and

Hadoop

/

JaqlSlide9

Ricardo: Bridging the Gap

TradeR send aggregation-processing queries (written in Jaql) to HadoopHadoop send aggregated data to R for advanced satistical processing 9Slide10

R in a Nutshell

10Slide11

R in a Nutshell

11R supports Rich statistical functionalitySlide12

12

Jaql in a NutshellJSON View of the data:Jaql Example:

Scalable Descriptive Analysis using

Hadoop

Jaql

a representative declarative interfaceSlide13

13

Ricardo: The Trading ArchitectureComplexity of Trade between R and HadoopSimple Trading: Data ExplorationComplex Trading:

Data ModelingSlide14

Simple Trading: Exploratory Analytics

Gain insights about dataExample - top-k outliers for a model Identify data items on which the model performed most poorlyHelpful for improving accuracy of modelThe trade: Use complex statistical models using rich R functionalityParallelize processing over entire data using Hadoop/Jaql14Slide15

15

Complex Trading: Latent FactorsSVD-like matrix factorizationMinimize Square Error: Σi,j (pi

qj -

rij)2

p

q

The trade:

Use complex statistical models in R

Parallelize aggregate computations using

Hadoop

/

JaqlSlide16

Complex Trading: Latent Factors

16

However, in real world………

A vector of factors for each customer and item!Slide17

17

Latent Factor Models with RicardoGoalMinimize Square Error: e = Σi,j (piqj - rij)2Numerical methods needed (large, sparse matrix)Pseudocode

Start with initial guess of parameters pi

and qj.Compute error & gradient

E.g., d

e

/d

p

i

=

Σ

j

2

q

j

(

p

i

q

j

r

ij

)

Update parameters.

R implements many different optimization algorithms

Repeat steps 2 and 3 until convergence.

p

q

Data intensive,

but parallelizable!Slide18

18

The R Component

Parameters

e: squared error

de: gradients

pq

: concatenation of the latent factors for users and items

R code

optim

( c(

p,q

),

f

e

,

f

de

, method="L-BFGS-B"

)

Goal

Keeps updating

pq

until it reaches convergenceSlide19

19

The

Hadoop

and Jaql Component

Dataset

GoalSlide20

20

The

Hadoop

and Jaql Component

Calculate the squared errors

Calculate the gradientsSlide21

21

Computing the Modeli

j

r

ij

i

p

i

j

q

j

Movie Ratings

Movie

Parameters

Customer

Parameters

3 way join to match

r

ij

,

p

i

, and

q

j

,

then aggregate

e =

Σ

i,j

(

p

i

q

j

-

r

ij

)

2

Similarly compute the gradientsSlide22

22

Aggregation In Jaql/Hadoopres = jaqlTable(channel, " ratings  hashJoin( fn(r) r.j, moviePars

, fn(m) m.j, fn(r, m) { r.*,

m.q } ) 

hashJoin

( fn(r)

r.i

,

custPars

, fn(c)

c.i

, fn(r, c) { r.*,

c.p

} )

transform

{ $.*, diff: $.rating - $.p*$.q }

expand

[ { value:

pow

($.diff, 2.0) },

                   { $.

i

, value: -2.0 * $.diff * $.p },

                   { $.j, value: -2.0 * $.diff * $.q } ]

group by

g={ $.

i

, $.j }

   

into

{ g.*, gradient: sum($[*].value) }

")

i

j gradient

---- ---- --------

null

null

325235

1 null 21

2 null 357

null 1 9

null 2 64

Result in RSlide23

23

Integrating the Components

Remember…..

We

would be

running

optim

( c(

p,q

),

f

e

,

f

de

, method="L-BFGS-B"

)

in R process.Slide24

24

Experimental Evaluation 50 nodes at EC2 Each node: 8 cores, 7GB Memory, 320GB Disk Total: 400 cores, 320GB Memory, 70TB Disk Space

Number of Rating

TuplesData Size in GB

500 Million

104.33

1 Billion

208.68

3 Billion

625.99

5 Billion

1043.23Slide25

25

Leveraging Hadoop’s ScalabilitySlide26

26

Leveraging R’s Rich Functionalityoptim( c(p,q),

f

e, f

de

, method=“CG" )

optim

( c(

p,q

),

f

e

,

f

de

, method="L-BFGS-B" )Slide27

27

ConclusionScaled Latent Factor Models to Terabytes of dataProvided a bridge for other algorithms with Summation Form can be mapped and scaledMany Algorithms have Summation FormDecompose into “large part” and “small part”[Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression, neural network, PCA, ICA, EM, SVM

Future & Current Work

Tighter language integrationMore algorithmsPerformance tuningSlide28

Questions?

Comments?