Hackathon Case Study 53 of Motley Fool visitors are subscribers Design a classificaiton model for insight into which variables are important for strategies to increase the subscription ID: 550364
Download Presentation The PPT/PDF document "Demographics and Weblog" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Demographics and Weblog Hackathon – Case Study
5.3% of Motley Fool visitors are subscribers. Design a
classificaiton
model for insight into which variables are important for strategies to increase the subscription
rate
Learn by DoingSlide2
http://www.meetup.com/HandsOnProgrammingEvents
/Slide3
Data Mining HackathonSlide4
Funded by Rapleaf
With Motley Fool’s data
App note for
Rapleaf
/Motley Fool
Template for other
hackathons
Did not use AWS. R on individual PCs
Logisics
:
Rapleaf
funded prizes and food for 2 weekends for ~20-50. Venue was freeSlide5
Getting more subscribersSlide6
Headline Data, WeblogSlide7
DemographicsSlide8
Cleaning Data
t
raining.csv
(201,000),
headlines.tsv
(811MB),
entry.tsv
(100k),
demographics.tsv
Feature Engineering
Github
: Slide9
Ensemble Methods
Bagging, Boosting,
randomForests
Overfitting
Stability (small changes make large prediction changes)
Previously none
of these work at scale
Small scale
results using R, large scale exist in proprietary implementations(
google
, amazon, etc..)Slide10
ROC Curves
Binary Classifier Only!Slide11
Paid Subscriber ROC curve, ~61%Slide12
Boosted Regression Trees Performance
training data ROC score = 0.745
cv ROC score = 0.737 ; se = 0.002
5
.5%
less performance than the winning score without doing any data processing
Random is 50% or .50. We are .737-.50 better than random by 23.7%Slide13
Contribution of predictor variablesSlide14
Predictive Importance
Friedman, number of times a variable is selected for splitting weighted by squared error or improvement to model. Measure of
sparsity
in data
Fit plots remove averages of model variables
1
pageV
74.0567852
2
loc
11.0801383
3 income 4.1565597
4 age 3.1426519
5
residlen
3.0813927
6 home 2.3308287
7 marital 0.6560258
8 sex 0.6476549
9 prop 0.3817017
10 child 0.2632598
11 own 0.2030012Slide15
Behavioral vs. Demographics
Demographics are sparse
Behavioral weblogs are the best source. Most sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm
Linear vs. NonlinearSlide16
Fitted Values (Crappy)Slide17
Fitted Values BetterSlide18
Predictor Variable Interaction
Adjusting variable interactions Slide19
Variable InteractionsSlide20
Plot Interactions age, locSlide21
Trees vs. other methods
Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model
No Math. AnalystSlide22
Number of TreesSlide23
Data Set Number of TreesSlide24
Hackathon ResultsSlide25
Weblogs only 68.15%, 18% better than randomSlide26
Demographics add 1%Slide27
AWS Advantages
Running multiple instances with different algorithms and parameters using R
Add tutorial, install Screen, R GUI bugs
http://amazonlabs.pbworks.com/w/page/28036646/
FrontPage
Slide28
Conclusion
Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing.
Tuning using visualization. Tune 3 parameters,
tc
,
lr
, #trees. Didn’t cover 2/3.
This isn’t
reproducable
in
Hadoop
/
Mahout or any open sourc
e code I know of
Other use cases, i.e. predicting which item will sell(eBay), search engine ranking.
Careful with MR paradigms,
Hadoop
MR !=
Couchbase
MR