/
Spelling Spelling

Spelling - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
390 views
Uploaded On 2015-10-04

Spelling - PPT Presentation

Correction for Advertising How Noise Can Help Silviu Cucerzan Microsoft Research Text Mining Search and Navigation NISS Workshop on Computational Advertising November 2009 Buying Cheap ID: 149407

queries albert words query albert queries query words einstein lexicon search iteration valid correction arnold spelling water good file

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Spelling" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Spelling Correction for Advertising:How “Noise” Can Help

Silviu CucerzanMicrosoft ResearchText Mining Search and Navigation

NISS Workshop on Computational Advertising, November 2009Slide2

Buying Cheap(er) on eBayCannon 30d

Canon 30d

Not good for the sellers.

Not good for most buyers.Not good for the middle man.Slide3

Good Ads for Bad Queriesespresso machines

cingular wirelessSlide4

Is a Trusted Dictionary Enough?Search: max payne

chats and codes new humwee picsMusic: selin dion

color of my love

cristina aquillaraShopping: pansonic dvd reorders brita water filerHelp and Support: printer

divers for

window vista insert flash flies into power point

cheats

celinecolourpanasonicrecordersfilterdriverswindowsfilespowerpointchristina aguileraSlide5

Web Query Logs as Corpora Web Search: over to 1 billion queries per day!

10-15% of the queries contain spelling errorshighly dynamic domain: many new names and concepts become popular every day

extremely difficult

to maintain a high-coverage lexicondifficult to define what a valid web query ise.g.: divx, ecard, ipod, korn, xbox,

zune,

naboo, nimh, nsync,

shrek,

5dmkii, tsxThe problemThe solutionSlide6

Problems To Be Handledcheese cake factory  cheesecake factorychat inspanich

 chat in spanishamd processors  amd processors

Concatenate and split

Recognize out-of-lexicon valid words Change in-lexicon words to out-of-lexicon words

gun dam fighter

 gundam fighter

power crd  power cordvideo crd  video cardchicken sop  chicken soupsop opera  soap operaContext-sensitive correction of out-of-lexicon wordsContext-sensitive correction of in-lexicon wordsSlide7

An HMM Architecture for Spelling Correction britabritbrit.britsbriatrita

watereaterhaterlatermateroaterraterwaderwaferwagerwaiterwalterwasterwaters

waterywaver

filerfiberfiferfilefiledfilersfilesfiletfillerfilnerfilterfinerfirerfiverfixerflierbrita water filer

states:

input query:

all

alternative

spellings

from the

query logSlide8

What about terrible misspellings?input: arnol shwartzeggardesired output:

arnold schwarzenegger

unweighted edit distance:

5Slide9

Misspelled query: arnol shwartzeggarFirst iteration: arnold s

chwartzneggarSecond iteration: arnold schwartzeneggerThird iteration:

arnold schwaxrzenegger

Fourth iteration: arnold schwarzeneggerAn Iterative Approach

nomore changes

Speller output:Slide10

hunny moon

honemoon

8

honemoons

3

honeybeemon

3

honeymonn14honeymoon19019honeymoon's12honeymooner

3honeymooner's6honeymooners771honeymooning29honeymoonitis6honeymoons5259

honneymoon6honneymoons9honnymoon4honoeymoon3honymoon19huneymoon10

honey moon333honey moon's5honey mooners34honey moons136honney moon6hony moon

4

Iterative

spelling

correction

process

honeymoon

Search Query Log Statistics

Some IntuitionSlide11

Basic Assumptions about the “Noise”query logs contain a lot of different misspellings for most wordsthe better spelled a word form, the more frequent it isthe correct forms are much more frequent than their misspellingsSlide12

Another Example

albert einstein

4834

albert einstien

525

albert einstine

149

albert einsten27albert einsteins25albert einstain11albert einstin10albert eintein9albeart einstein6

aolbert einstein6alber einstein4albert einseint3albert einsteirn3albert einsterin3albert eintien3alberto einstein3

albrecht einstein3alvert einstein3Slide13

Concatenation and Splitting

Store word unigrams and bigrams in the same searchable trie structure.Find alternative spellings for the input words in this common structure.Slide14

Avoid Changing the User’s Intent britabritbrit.britsbriatrita

watereaterhaterlatermateroaterraterwaderwaferwagerwaiterwalterwasterwaterswatery

waver

filerfiberfiferfilefiledfilersfilesfiletfillerfilnerfilterfinerfirerfiverfixerflierbrita water filer

brit

waiter

fileSlide15

Modified Viterbi Search – Fringes

e.g.: water filer  waiter file

k

1k2  k1+k2 paths

in-lexicon wordsSlide16

Modified Viterbi Search – Stop words

e.g.: lord of teh rigs  lord of the ringsSlide17

Evaluation

All queries

Valid

Misspelled

Nr. queries

1044

864

180Full system81.884.867.2

No lexicon70.372.261.1No query log77.082.152.8All edits equal

80.483.366.1Unigrams only54.757.441.7

1 iteration only80.988.047.22 iterations only81.384.466.7

No fringes

80.6

83.3

67.2Slide18

A Closer Look to the Results81.8% overall agreement with the annotatorsErrors:alternative queries for valid queries many false positives are reasonable suggestions

e.g. cowboy robes  cowboy ropesalternative queries for misspelled queries some suggestions could be valid (user’s intent not known) e.g. massanger 

massager /

messengerannotator inter-agreement rate: 91.3%Slide19

Evaluation – When we “know” user’s intent

Full system

73.1

No lexicon

59.2

No query log

44.9

All edits equal69.9Unigrams only43.01 iteration only

45.52 iterations only68.2No fringes71.0(audio flie, audio file)  audio file(bueavista, buena vista)  buena vista(carrabean nooms, carrabean rooms)  caribbean rooms368 queriesSlide20

Learning Curve

Silviu Cucerzan and Eric Brill – “Spelling correction as an iterative processthat exploits the collective knowledge of web users”, EMNLP 2004