Statistical Learning Course Prof Saharon Rosset January 2015 Keren Levinstein Hallak Overall Linear regression with Ridge regularization Main steps Matrix completion Dates insight ID: 246019
Download Presentation The PPT/PDF document "Class Competition: Netflix data" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Class Competition: Netflix data
Statistical Learning Course, Prof.
Saharon
Rosset
, January 2015
Keren
Levinstein
HallakSlide2
Overall:
Linear regression with Ridge regularization
Main steps:
Matrix completion
Dates insight
Small steps:
Additional parametersSlide3
Matrix completion
Used
M
atlab
code available online for matrix completion via soft
thresholding. Solves:min nuclear-norm(X) subject to Nuclear norm:Data completion was performed for training and testing data togetherRMSE = 0.766796 (with some additional parameters)Slide4
Dates insightSlide5Slide6
Dates Insight
Users rate a lot of movies on the same day
~93% of the users rated other movies on the day they rated Miss Congeniality both in the training and in the testing set
For each user, the mean, median, variance and number of movie rates given at the day Miss Congeniality was rated are useful parametersSlide7
Additional parameters:
Considering
only the ‘true’ rates and not the ones given by matrix
completion:
Variance,
skewness and quartiles for each userNumber of zeros (unwatched movies)Percentage of [1,2,3,4,5] ratings out of the number of watched moviesMiss Congeniality dates85 indicator parameters indicating missing values for movies 15:99 (the first 14 movies were rated by all users) Slide8
Some points to ponder
One should be very careful evaluating the RMSE when the data is divided into subgroups and a different model is built for each subgroup
The preprocess phase (choosing parameters, dealing with missing values) seems to be the most important one
Good to know:
Weka
- a free Data Mining Software in JavaSlide9
Thank you for listening!