/
Web Scale NLP: A Case Study on URL Word Breaking Web Scale NLP: A Case Study on URL Word Breaking

Web Scale NLP: A Case Study on URL Word Breaking - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
342 views
Uploaded On 2019-11-08

Web Scale NLP: A Case Study on URL Word Breaking - PPT Presentation

Web Scale NLP A Case Study on URL Word Breaking Kuansan Wang Chris Thrasher BoJune Paul Hsu Microsoft Research Redmond USA WWW 2011 March 31 2011 More Data gt Complex Model 2 Banko and Brill ID: 764746

model data web word data model word web models language breaking nlp matched map microsoft scale gram simple title

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Web Scale NLP: A Case Study on URL Word ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Web Scale NLP:A Case Study on URL Word Breaking Kuansan Wang, Chris Thrasher, Bo-June (Paul) Hsu Microsoft Research, Redmond, USA WWW 2011 March 31, 2011

More Data > Complex Model 2 Banko and Brill. Mitigating the Paucity-of-Data Problems . HLT 01

More Data > Complex Model 3 CIKM 08 There is no data like more data ?

NLP for the WebScale of the Web Avoid manual intervention Efficient implementations Dynamic Nature of the Web Fast adaptationGlobal Reach of the WebNeed rudimentary multi-lingual capabilitiesDiverse Language Styles of Web ContentsMulti-style language models Simple models with matched data!

OutlineWeb-Scale NLP Word Breaking Models Evaluation Conclusion5

Word BreakingLarge Data + Simple Model ( Norvig , CIKM 2008) Use unigram model to rank all possible segmentationsPretty good, but with occasional embarrassing outcomesMore data does not help!Extension to trigram alleviates the problem

Word Breaking for the Web Web URLs exhibit variety of language styles… …and in different languages Matched data is crucial to accuracy!

OutlineWeb-Scale NLP Word Breaking Models Evaluation Conclusion8

MAP Decision Rule Special case of Bayesian Minimum Risk Speech, MT, Parsing, Tagging, Information Retrieval, … Problem: Given , find : transformation model : prior   Channel Signal   Observation   Distortion

MAP for Word Breaker : tweeter hash tag or URL domain name Ex. 247moms, w84um8 : what user meant to say Ex. 24_7 _ moms, w8 _ 4 _u_m8 (wait for you mate)  Channel Signal  Output  Transformation

Plug-in MAP Problem MAP decision rule is optimal only if and are the “correct” underlying distributions Adjustments needed when estimated models and have unknown errorsSimple logarithmic interpolation: “Random Field”/Machine Learning : Bayesian Point estimation is outdated Assume parameters are drawn from “some” distribution  

Baseline Methods GM: Geometric Mean ( Keohn and Kline, 2003) Widely used, especially in MT systems BI: Binomial Model ( Venkataraman , 2001) WL: Word Length Normalization ( Kaitan et al, 2009)   All special cases/variations of MAP

Proposed Method ME: Maximum Entropy Principle Model – Special case of BI ( ) and WL (uniform) using Microsoft Web N-gram , Microsoft Web N-gram (http://web-ngram.research.microsoft.com)Web documents/Bing queries (EN-US market) Rudimentary multilingual (NAACL 10)Frequent updates (ICASSP 09)Multi-style language model (WWW 10, SIGIR 10)  BodyTitle AnchorQuery1-gram1.2 B60 M150 M 252M5-gram237 B 3.8 B 8.9 B -

OutlineWeb-Scale NLP Word Breaking Models Evaluation Conclusion14

Data Set100K randomly sampled URLs indexed by BingSimple tokenization 266K unique tokens Mostly ASCII characters Metric: Precision@3Manually labeled word breaksMultiple answers are allowed

Language Model Style16 Title is best although Body is 100x larger Nav queries often word-split URLs, but Query worse than Title Matched style is crucial to precision!

Model Complexity17 With mismatched data, model choice is crucial With matched data, complex models do not help Simple model is sufficient with matched data!

OutlineWeb-Scale NLP Word Breaking Models Evaluation Conclusion18

Best = Right Data + Smart Model Style of language trumps size of data There is no data like more data … provided it’s matched data!Right data alleviates Plug-in MAP problemComplicated machine learning artillery not required; simple methods sufficeSmart model gives us:Rudimentary multi-lingual capabilityFast inclusion of new words/phrases Eliminate needs of human labor in data labeling http ://research.microsoft.com/en-us/um /people/kuansanw/wordbreaker/

Backup Slides 20

Title Query Anchor Note: BI, WL are oracle results GM 1-gram 2-gram 3-gram Body 59.01% 44.68% 44.78% Title 61.55% 60.31% 58.70% Anchor 60.46% 55.25% 54.84% Query 54.83% 54.27% 54.83%