Languages to Information Dan Jurafsky Stanford University Recommender Systems amp Collaborative Filtering Slides adapted from Jure Leskovec Recommender Systems Customer X Buys CD of Mozart ID: 775888
Download Presentation The PPT/PDF document " CS 124/LINGUIST 180 From " is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CS 124/LINGUIST 180From Languages to Information
Dan JurafskyStanford University
Recommender Systems & Collaborative Filtering
Slides adapted from Jure
Leskovec
Slide2Recommender Systems
Customer XBuys CD of MozartBuys CD of Haydn
Customer YDoes search on Mozart Recommender system suggests Haydn from data collected about customer X
2/22/17
Jure Leskovec, Stanford C246: Mining Massive Datasets
2
Slide3Recommendations
2/22/17
Slides adapted from Jure
Leskovec
3
Items
Search
Recommendations
Products, web sites,
blogs
, news items, …
Examples:
Slide4From Scarcity to Abundance
Shelf space is a scarce commodity for traditional retailers Also: TV networks, movie theaters,…Web enables near-zero-cost dissemination of information about productsFrom scarcity to abundanceMore choice necessitates better filtersRecommendation enginesHow Into Thin Air made Touching the Void a bestseller: http://www.wired.com/wired/archive/12.10/tail.html
2/22/17
4
Slides adapted from Jure
Leskovec
Slide5Sidenote: The Long Tail
2/22/17
5
Source: Chris Anderson (2004)
Slides adapted from Jure
Leskovec
Slide6Physical vs. Online
2/22/17
Jure Leskovec, Stanford C246: Mining Massive Datasets
6
Read
http://www.wired.com/wired/archive/12.10/tail.html
to learn more!
Slides adapted from Jure
Leskovec
Slide7Types of Recommendations
Editorial and hand curatedList of favoritesLists of “essential” itemsSimple aggregatesTop 10, Most Popular, Recent UploadsTailored to individual usersAmazon, Netflix, …
2/22/17
7
Today class
Slides adapted from Jure
Leskovec
Slide8Formal Model
X = set of CustomersS = set of ItemsUtility function u: X × S RR = set of ratingsR is a totally ordered sete.g., 0-5 stars, real number in [0,1]
2/22/17
8
Slides adapted from Jure
Leskovec
Slide9Utility Matrix
2/22/17
9
Avatar
LOTR
Matrix
Pirates
Alice
Bob
Carol
David
Slides adapted from Jure
Leskovec
Slide10Key Problems
(1) Gathering “known” ratings for matrixHow to collect the data in the utility matrix(2) Extrapolate unknown ratings from known onesMainly interested in high unknown ratingsWe are not interested in knowing what you don’t like but what you like(3) Evaluating extrapolation methodsHow to measure success/performance of recommendation methods
2/22/17
10
Slides adapted from Jure
Leskovec
Slide11(1) Gathering Ratings
ExplicitAsk people to rate itemsDoesn’t work well in practice – people can’t be botheredCrowdsourcing: Pay people to label itemsImplicitLearn ratings from user actionsE.g., purchase implies high ratingWhat about low ratings?
2/22/17
11
Slides adapted from Jure
Leskovec
Slide12(2) Extrapolating Utilities
Key problem: Utility matrix U is sparseMost people have not rated most itemsCold start: New items have no ratingsNew users have no historyThree approaches to recommender systems:Content-basedCollaborative FilteringLatent factor based
2/22/17
12
This lecture!
Slides adapted from Jure
Leskovec
Slide13Content-based Recommender Systems
2/22/17
Jure Leskovec, Stanford C246: Mining Massive Datasets
13
Slides adapted from Jure Leskovec
Slide14Content-based Recommendations
Main idea: Recommend items to customer x similar to previous items rated highly by xExample:Movie recommendationsRecommend movies with same actor(s), director, genre, …Websites, blogs, newsRecommend other sites with “similar” content
2/22/17
14
Slides adapted from Jure
Leskovec
Slide15Plan of Action
2/22/17
15
likes
Item profiles
Red
Circles
Triangles
User profile
match
recommend
build
Slides adapted from Jure
Leskovec
Slide16Item Profiles
For each item, create an item profileProfile is a set (vector) of featuresMovies: author, genre, director, actors, year…Text: Set of “important” words in documentHow to pick important features?TF-IDF (Term frequency * Inverse Doc Frequency)Term … FeatureDocument … Item
2/22/17
16
Slides adapted from Jure
Leskovec
Slide17If everything is 1 or 0 (indicator features)But what if we want to have real or ordinal features too?
Content-based Item Profiles
2/22/17
17
MelissaMcCarthy
JohnnyDepp
Movie X
Movie Y
0110110111010110
ActorA
ActorB
…
PirateGenre
SpyGenre
ComicGenre
Slides adapted from Jure
Leskovec
Slide18Maybe we want a scaling factor α between binary and numeric features
Content-based Item Profiles
2/22/17
18
MelissaMcCarthy
JohnnyDepp
Movie X
Movie Y
011011013110101104
ActorA
ActorB
…
AvgRating
PirateGenre
SpyGenre
ComicGenre
Slides adapted from Jure
Leskovec
Slide19Maybe there is a scaling factor α between binary and numeric featuresOr maybe α=1Cosine(Movie X, Movie Y) =
Content-based Item Profiles
2/22/17
19
MelissaMcCarthy
JohnnyDepp
Movie X
Movie Y
011011013α110101104α
ActorA
ActorB
…
AvgRating
PirateGenre
SpyGenre
ComicGenre
Slides adapted from Jure
Leskovec
Slide20User Profiles
Want a vector with the same components/dimensions as itemsCould be 1s representing user purchasesOr arbitrary numbers from a ratingUser profile is aggregate of items:Average(weighted?)of rated item profiles
2/22/17
20
Slides adapted from Jure
Leskovec
Slide21Sample user profile
Items are moviesUtility matrix has 1 if user has seen movie20% of the movies user U has seen have Melissa McCarthyU[“Melissa McCarthy”] = 0.2
MelissaMcCarthy
User U
0.2.00500…
ActorA
ActorB
…
Slide22Prediction
User and item vectors have the same components/dimensions!So just recommend the items whose vectors are most similar to the user vector!Given user profile x and item profile i, estimate
2/22/17
22
Slides adapted from Jure
Leskovec
Slide23Pros: Content-based Approach
+: No need for data on other usersNo cold-start or sparsity problems+: Able to recommend to users with unique tastes+: Able to recommend new & unpopular itemsNo first-rater problem+: Able to provide explanationsCan provide explanations of recommended items by listing content-features that caused an item to be recommended
2/22/17
23
Slides adapted from Jure
Leskovec
Slide24Cons: Content-based Approach
–: Finding the appropriate features is hardE.g., images, movies, music–: Recommendations for new usersHow to build a user profile?–: OverspecializationNever recommends items outside user’s content profilePeople might have multiple interestsUnable to exploit quality judgments of other users
2/22/17
24
Slides adapted from Jure
Leskovec
Slide25Collaborative Filtering
Harnessing quality judgments of other users
Slide26Collaborative FilteringVersion 1: "User-User" Collaborative Filtering
Consider user xFind set N of other users whose ratings are “similar” to x’s ratingsEstimate x’s ratings based on ratings of users in N
2/22/17
26
x
N
Slides adapted from Jure
Leskovec
Slide27Finding Similar Users
Let rx be the vector of user x’s ratingsJaccard similarity measureProblem: Ignores the value of the rating Cosine similarity measuresim(x, y) = cos(rx, ry) = Problem: Treats missing ratings as “negative”
2/22/17
27
rx = [*, _, _, *, ***]ry = [*, _, **, **, _]
rx, ry as sets:rx = {1, 4, 5}ry = {1, 3, 4}
rx, ry as points:rx = {1, 0, 0, 1, 3}ry = {1, 0, 2, 2, 0}
Slides adapted from Jure
Leskovec
Slide28Utility Matrix
Intuitively we want:
sim(
A
,
B
) > sim(
A
,
C
)
Jaccard
similarity:
1/5
<
2/4
Cosine similarity:
0.386
>
0.322
Considers missing ratings as “negative
”
Slide29Utility Matrix
Problem with cosine: 0 acts like a negative review
C really loves SW
A hates SW
B just hasn’t seen it
Another problem: we’d like to normalize for raters
D rated everything the same; not very useful
Slide30Modified Utility Matrix:subtract the means of each row
Now a 0 means no information
And negative ratings means viewers with opposite ratings will have vectors in opposite directions!
Slide31Modified Utility Matrix:subtract the means of each row
Cos(A,B) =
Cos(A,C) =
Now A and C are (correctly) way further apart than A,B
Slide32Cosine after subtracting mean
Turns out to be the same as Pearson correlation coefficient!!!Cosine similarity is correlation when the data is centered at 0Terminological Note: subtracting the mean is zero-centering, not normalizing (normalizing is dividing by a norm to turn something into a probability), but the textbook (and common usage) sometimes overloads the term “normalize”
Slides adapted from Jure
Leskovec
Slide33Finding Similar Users
Let rx be the vector of user x’s ratingsCosine similarity measuresim(x, y) = cos(rx, ry) = Problem: Treats missing ratings as “negative”Pearson correlation coefficientSxy = items rated by both users x and y
2/22/17
33
rx = [*, _, _, *, ***]ry = [*, _, **, **, _]
rx, ry as points:rx = {1, 0, 0, 1, 3}ry = {1, 0, 2, 2, 0}
rx, ry … avg.rating of x, y
Slides adapted from Jure
Leskovec
Slide34Rating Predictions
From similarity metric to recommendations:Let rx be the vector of user x’s ratingsLet N be the set of k users most similar to x who have rated item iPrediction for item i of user x:Many other tricks possible…
2/22/17
34
Shorthand:
Slides adapted from Jure
Leskovec
Slide35Collaborative Filtering Version 2:Item-Item Collaborative Filtering
So far: User-user collaborative filteringAlternate view that often works better: Item-itemFor item i, find other similar itemsEstimate rating for item i based on ratings for similar itemsCan use same similarity metrics and prediction functions as in user-user model
2/22/17
35
s
ij
… similarity of items i and jrxj…rating of user x on item iN(i;x)…set of items rated by x similar to i
Slides adapted from Jure
Leskovec
Slide36Item-Item CF (|N|=2)
2/22/17
36
12111098765432145531131244525343214232454245224345423316
users
movies
- unknown rating
- rating between 1 to 5
Slides adapted from Jure
Leskovec
Slide37Item-Item CF (|N|=2)
2/22/17
37
121110987654321455? 31131244525343214232454245224345423316
users
- estimate rating of movie
1 by user 5
movies
Slides adapted from Jure
Leskovec
Slide38Item-Item CF (|N|=2)
2/22/17
38
121110987654321455? 31131244525343214232454245224345423316
users
Neighbor selection:Identify movies similar to movie 1, rated by user 5
movies
1.00-0.180.41-0.10-0.310.59
sim(1,m)
Here we use Pearson correlation as similarity:1) Subtract mean rating mi from each movie i m1 = (1+3+5+5+4)/5 = 3.6 row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]2) Compute cosine similarities between rows
Slides adapted from Jure
Leskovec
Slide39Item-Item CF (|N|=2)
2/22/17
39
121110987654321455? 31131244525343214232454245224345423316
users
Compute similarity weights:s1,3=0.41, s1,6=0.59
movies
1.00-0.180.41-0.10-0.310.59
sim(1,m)
Slides adapted from Jure
Leskovec
Slide40Item-Item CF (|N|=2)
2/22/17
40
1211109876543214552.631131244525343214232454245224345423316
users
Predict by taking weighted average:r1,5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6
movies
Slides adapted from Jure
Leskovec
Slide41Item-Item vs. User-User
2/22/17
41
Slides adapted from Jure Leskovec
In practice,
item-item
often works better than user-user
Why?
Items are simpler, users have multiple tastes
Slide42Simplified item-item for our homework
First, assume you've converted all the values to +1 (like), 0 (no rating)-1 (dislike)
12111098765432145531131244525343214232454245224345423316
users
movies
Slide43Simplified item-item for our homework
First, assume you've converted all the values to +1 (like), 0 (no rating)-1 (dislike)
1211109876543211111-111-1-111121111-1-11-13-1111-141-1-111151-111-16
users
movies
Slide44Simplified item-item for our tiny PA6 dataset
Assume you've binarized, i.e. converted all the values to +1 (like), 0 (no rating) -1 (dislike)For this binary case, some tricks that the TAs recommend:Don't mean-center users, just keep the raw +1,0,-1Don't normalize (i.e. don't divide the dot product by the sum)i.e., instead of this:Just do this:Don't use Pearson correlation to compute sijJust use cosine
s
ij
… similarity of items
i
and
j
r
xj
…
rating of user
x
on
item
j
N(
i;x
)
…
set of items rated by
x
Slide45Simplified item-item for our tiny PA6 dataset
1. binarize, i.e. convert all values to +1 (like), 0 (no rating) -1 (dislike)2. The user x gives you (say) ratings for 2 movies m1 and m2 3. For each movie i in the datasetWhere sij… cosine between vectors for movies i and j4. Recommend the movie i with max rxi
r
xj
…
rating of user
x
on item
i
Slide46Pros/Cons of Collaborative Filtering
+ Works for any kind of itemNo feature selection needed- Cold Start:Need enough users in the system to find a match- Sparsity: The user/ratings matrix is sparseHard to find users that have rated the same items- First rater: Cannot recommend an item that has not been previously ratedNew items, Esoteric items- Popularity bias: Cannot recommend items to someone with unique taste Tends to recommend popular items
2/22/17
46
Slides adapted from Jure
Leskovec
Slide47Hybrid Methods
Implement two or more different recommenders and combine predictionsPerhaps using a linear modelAdd content-based methods to collaborative filteringItem profiles for new item problemDemographics to deal with new user problem
2/22/17
47
Slides adapted from Jure
Leskovec
Slide48Evaluation
2/22/17
Jure Leskovec, Stanford C246: Mining Massive Datasets
48
1343554553 32225 2113 31
movies
users
Slides adapted from Jure Leskovec
Slide49Evaluation
2/22/17
Jure Leskovec, Stanford C246: Mining Massive Datasets
49
1343554553 32??? 21?3 ?1
Test Data Set
users
movies
Slide50Evaluating Predictions
Compare predictions with known ratingsRoot-mean-square error (RMSE) where is predicted, is the true rating of x on iRank Correlation: Spearman’s correlation between system’s and user’s complete rankings
2/22/17
50
Slides adapted from Jure
Leskovec
Slide51Problems with Error Measures
Narrow focus on accuracy sometimes misses the pointPrediction DiversityPrediction ContextOrder of predictionsIn practice, we care only to predict high ratings:RMSE might penalize a method that does well for high ratings and badly for others
2/22/17
51
Slides adapted from Jure
Leskovec
Slide52There’s No Data like Mo’ Data
Leverage all the dataSimple methods on large data do bestAdd more datae.g., add IMDB data on genresMore data beats better algorithms
2/22/17
52
Slides adapted from Jure
Leskovec
Slide53Famous Historical Example:The Netflix Prize
Training data100 million ratings, 480,000 users, 17,770 movies6 years of data: 2000-2005Test dataLast few ratings of each user (2.8 million)Evaluation criterion: root mean squared error (RMSE) Netflix Cinematch RMSE: 0.9514Dumb baseline does really well. For user u and movie m take the average ofThe average rating given by u on all rated moviesThe average of the ratings for movie m by all users who rated that movieCompetition2700+ teams$1 million prize for 10% improvement on CinematchBellKor system won in 2009. Combined many factorsOverall deviations of users/movies Regional effectsLocal collaborative filtering patterns Temporal biases
2/22/17
53
Slides adapted from Jure
Leskovec
Slide54Summary on Recommendation Systems
The Long Tail
Content-based Systems
Collaborative Filtering
Latent Factors