How Noise Can Help Silviu Cucerzan Microsoft Research Text Mining Search and Navigation NISS Workshop on Computational Advertising November 2009 Buying Cheap er on eBay ID: 670875
Download Presentation The PPT/PDF document "Spelling Correction for Advertising:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Spelling Correction for Advertising:How “Noise” Can Help
Silviu CucerzanMicrosoft ResearchText Mining Search and Navigation
NISS Workshop on Computational Advertising, November 2009Slide2
Buying Cheap(er) on eBayCannon 30d
Canon 30d
Not good for the sellers.
Not good for most buyers.Not good for the middle man.Slide3
Good Ads for Bad Queriesespresso machines
cingular wirelessSlide4
Is a Trusted Dictionary Enough?Search: max payne
chats and codes new humwee picsMusic: selin dion
color of my love
cristina aquillaraShopping: pansonic dvd reorders brita water filerHelp and Support: printer
divers for
window vista insert flash flies into power point
cheats
celinecolourpanasonicrecordersfilterdriverswindowsfilespowerpointchristina aguileraSlide5
Web Query Logs as Corpora Web Search: over to 1 billion queries per day!
10-15% of the queries contain spelling errorshighly dynamic domain: many new names and concepts become popular every day
extremely difficult
to maintain a high-coverage lexicondifficult to define what a valid web query ise.g.: divx, ecard, ipod, korn, xbox,
zune,
naboo, nimh, nsync,
shrek,
5dmkii, tsxThe problemThe solutionSlide6
Problems To Be Handledcheese cake factory cheesecake factorychat inspanich
chat in spanishamd processors amd processors
Concatenate and split
Recognize out-of-lexicon valid words Change in-lexicon words to out-of-lexicon words
gun dam fighter
gundam fighter
power crd power cordvideo crd video cardchicken sop chicken soupsop opera soap operaContext-sensitive correction of out-of-lexicon wordsContext-sensitive correction of in-lexicon wordsSlide7
An HMM Architecture for Spelling Correction britabritbrit.britsbriatrita
watereaterhaterlatermateroaterraterwaderwaferwagerwaiterwalterwasterwaters
waterywaver
filerfiberfiferfilefiledfilersfilesfiletfillerfilnerfilterfinerfirerfiverfixerflierbrita water filer
states:
input query:
all
alternative
spellings
from the
query logSlide8
What about terrible misspellings?input: arnol shwartzeggardesired output:
arnold schwarzenegger
unweighted edit distance:
5Slide9
Misspelled query: arnol shwartzeggarFirst iteration: arnold s
chwartzneggarSecond iteration: arnold schwartzeneggerThird iteration:
arnold schwaxrzenegger
Fourth iteration: arnold schwarzeneggerAn Iterative Approach
nomore changes
Speller output:Slide10
hunny moon
honemoon
8
honemoons
3
honeybeemon
3
honeymonn14honeymoon19019honeymoon's12honeymooner
3honeymooner's6honeymooners771honeymooning29honeymoonitis6honeymoons5259
honneymoon6honneymoons9honnymoon4honoeymoon3honymoon19huneymoon10
honey moon333honey moon's5honey mooners34honey moons136honney moon6hony moon
4
Iterative
spelling
correction
process
honeymoon
Search Query Log Statistics
Some IntuitionSlide11
Basic Assumptions about the “Noise”query logs contain a lot of different misspellings for most wordsthe better spelled a word form, the more frequent it isthe correct forms are much more frequent than their misspellingsSlide12
Another Example
albert einstein
4834
albert einstien
525
albert einstine
149
albert einsten27albert einsteins25albert einstain11albert einstin10albert eintein9albeart einstein6
aolbert einstein6alber einstein4albert einseint3albert einsteirn3albert einsterin3albert eintien3alberto einstein3
albrecht einstein3alvert einstein3Slide13
Concatenation and Splitting
Store word unigrams and bigrams in the same searchable trie structure.Find alternative spellings for the input words in this common structure.Slide14
Avoid Changing the User’s Intent britabritbrit.britsbriatrita
watereaterhaterlatermateroaterraterwaderwaferwagerwaiterwalterwasterwaterswatery
waver
filerfiberfiferfilefiledfilersfilesfiletfillerfilnerfilterfinerfirerfiverfixerflierbrita water filer
brit
waiter
fileSlide15
Modified Viterbi Search – Fringes
e.g.: water filer waiter file
k
1k2 k1+k2 paths
in-lexicon wordsSlide16
Modified Viterbi Search – Stop words
e.g.: lord of teh rigs lord of the ringsSlide17
Evaluation
All queries
Valid
Misspelled
Nr. queries
1044
864
180Full system81.884.867.2
No lexicon70.372.261.1No query log77.082.152.8All edits equal
80.483.366.1Unigrams only54.757.441.7
1 iteration only80.988.047.22 iterations only81.384.466.7
No fringes
80.6
83.3
67.2Slide18
A Closer Look to the Results81.8% overall agreement with the annotatorsErrors:alternative queries for valid queries many false positives are reasonable suggestions
e.g. cowboy robes cowboy ropesalternative queries for misspelled queries some suggestions could be valid (user’s intent not known) e.g. massanger
massager /
messengerannotator inter-agreement rate: 91.3%Slide19
Evaluation – When we “know” user’s intent
Full system
73.1
No lexicon
59.2
No query log
44.9
All edits equal69.9Unigrams only43.01 iteration only
45.52 iterations only68.2No fringes71.0(audio flie, audio file) audio file(bueavista, buena vista) buena vista(carrabean nooms, carrabean rooms) caribbean rooms368 queriesSlide20
Learning Curve
Silviu Cucerzan and Eric Brill – “Spelling correction as an iterative processthat exploits the collective knowledge of web users”, EMNLP 2004