Solr for great results Walter Underwood http wunderwoodorgmostcasualobserver Typical Web Query Mix informational navigational knownsite transactional knownitem Andrei Broder ID: 816775
Download The PPT/PDF document "Netflix and Beyond Tuning" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Netflix and Beyond
Tuning
Solr
for great results.
Walter Underwood
http://
wunderwood.org/most_casual_observer
/
Slide2Typical Web Query Mixinformational
navigational (known-site)transactional (known-item)
(Andrei
Broder
, AltaVista, 2002)
Slide3“talking rat movie”
Slide4Top Queries October 2006
finding neverlandbridget jones
closer
the
incredibles
incredibles
ladder 49
fat
albertbeing juliaraynational treasure
alfie
spanglish
star wars
meet the
fockers
final cut
hotel
rwanda
neverland
after the sunset
million dollar baby
hitch
Slide5Netflix Queries92% movie titles5% genres and categories
3% peopleKnown-item queries make up 95% of Netflix traffic.
Slide6Slide7Problematic User BehaviorOne or two words?Partial words
Misspellings
Slide8One or Two Words?
Slide9Partial WordsPeople
don’t like to make mistakes:rat, rata, ratatapoc
koyaanisq
Phonetic encoding (
soundex
) assumes complete words
Slide10Autocomplete Finishes Words
Load movie titles and popular people10% improvement in search quality (MRR)10X as much traffic as search queriesDedicated
Solr
with
RAMDirectory
Front-end HTTP cache, 1 hour lifetime, 80% hit rate
Slide11Some Misspellings
shakespearthe incredablesseven
samarai
breakfast at
tiffiney
blazing
sadles
selen
scorupkotaekuchristopher
walkin
return to
lonsom
dove
teh
matrix
comdy
tv
pirhana
dungens
and dragons
pufi
yami
al
pachino
incredables
gundan
seed mobile suit
chatterluy
white
fany
to the
rsecue
meet the
faulkers
brigette
joes
diary
oh brother where are thou?
pirartes
of the
carr
Slide12Switch from Phonetic to FuzzyTested a dozen algorithms with users
250K queries per test cellJaroWinkler slightly better than Levenstein
JaroWinkler
with 0.7 is very, very broad match
“
koyaanisqatsi
” matches “
koy
” (yuck!)but “1048” matches “1408”
Slide13Problematic Corpus Behavior
Missing moviesOllie Hopnoodle’s Haven of BlissCJ7
Hard-to-spell names
Ratatouille
Coraline
Inglourious
Basterds
Hard-to-remember namesClickApocalyptoSeven Up Plus Seven
Slide14Slide15Metrics: MRRMean Reciprocal Rank
Weighted clickthrough, measured on site traffic#1 is a full click
#2 is a half click
#3 is one third click
etc.
Daily, weekly, and seasonal variations
Overall customer satisfaction
Good for A/B tests, weak for finding bugs
Slide16Per-query MetricsUseful for finding problems
MRRClickthrough percentMost-clicked rank (#1 is good)Percentage of clicks on most-clicked
known-item queries are
over 80
%
categories are under
50
%
Slide17Success Looks Like …MRR consistently over 0.5
85% of clicks on #1
Slide18Questions?