/
Demographics and Weblog Demographics and Weblog

Demographics and Weblog - PowerPoint Presentation

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
383 views
Uploaded On 2017-05-20

Demographics and Weblog - PPT Presentation

Hackathon Case Study 53 of Motley Fool visitors are subscribers Design a classificaiton model for insight into which variables are important for strategies to increase the subscription ID: 550364

demographics data model trees data demographics trees model variable roc variables scale score features hackathon number add rapleaf interactions

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Demographics and Weblog" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Demographics and Weblog Hackathon – Case Study

5.3% of Motley Fool visitors are subscribers. Design a

classificaiton

model for insight into which variables are important for strategies to increase the subscription

rate

Learn by DoingSlide2

http://www.meetup.com/HandsOnProgrammingEvents

/Slide3

Data Mining HackathonSlide4

Funded by Rapleaf

With Motley Fool’s data

App note for

Rapleaf

/Motley Fool

Template for other

hackathons

Did not use AWS. R on individual PCs

Logisics

:

Rapleaf

funded prizes and food for 2 weekends for ~20-50. Venue was freeSlide5

Getting more subscribersSlide6

Headline Data, WeblogSlide7

DemographicsSlide8

Cleaning Data

t

raining.csv

(201,000),

headlines.tsv

(811MB),

entry.tsv

(100k),

demographics.tsv

Feature Engineering

Github

: Slide9

Ensemble Methods

Bagging, Boosting,

randomForests

Overfitting

Stability (small changes make large prediction changes)

Previously none

of these work at scale

Small scale

results using R, large scale exist in proprietary implementations(

google

, amazon, etc..)Slide10

ROC Curves

Binary Classifier Only!Slide11

Paid Subscriber ROC curve, ~61%Slide12

Boosted Regression Trees Performance

training data ROC score = 0.745

cv ROC score = 0.737 ; se = 0.002

5

.5%

less performance than the winning score without doing any data processing

Random is 50% or .50. We are .737-.50 better than random by 23.7%Slide13

Contribution of predictor variablesSlide14

Predictive Importance

Friedman, number of times a variable is selected for splitting weighted by squared error or improvement to model. Measure of

sparsity

in data

Fit plots remove averages of model variables

1

pageV

74.0567852

2

loc

11.0801383

3 income 4.1565597

4 age 3.1426519

5

residlen

3.0813927

6 home 2.3308287

7 marital 0.6560258

8 sex 0.6476549

9 prop 0.3817017

10 child 0.2632598

11 own 0.2030012Slide15

Behavioral vs. Demographics

Demographics are sparse

Behavioral weblogs are the best source. Most sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm

Linear vs. NonlinearSlide16

Fitted Values (Crappy)Slide17

Fitted Values BetterSlide18

Predictor Variable Interaction

Adjusting variable interactions Slide19

Variable InteractionsSlide20

Plot Interactions age, locSlide21

Trees vs. other methods

Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model

No Math. AnalystSlide22

Number of TreesSlide23

Data Set Number of TreesSlide24

Hackathon ResultsSlide25

Weblogs only 68.15%, 18% better than randomSlide26

Demographics add 1%Slide27

AWS Advantages

Running multiple instances with different algorithms and parameters using R

Add tutorial, install Screen, R GUI bugs

http://amazonlabs.pbworks.com/w/page/28036646/

FrontPage

Slide28

Conclusion

Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing.

Tuning using visualization. Tune 3 parameters,

tc

,

lr

, #trees. Didn’t cover 2/3.

This isn’t

reproducable

in

Hadoop

/

Mahout or any open sourc

e code I know of

Other use cases, i.e. predicting which item will sell(eBay), search engine ranking.

Careful with MR paradigms,

Hadoop

MR !=

Couchbase

MR