INST 734 Module 2 Doug Oard Agenda Character sets Terms as units of meaning Boolean r etrieval Building an index Segmentation Retrieval is often a search for concepts But what we actually search are character strings ID: 468308
Download Presentation The PPT/PDF document "Evidence from Content" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Evidence from Content
INST 734
Module 2
Doug
OardSlide2
Agenda
Character setsTerms as units of meaningBoolean retrievalBuilding an indexSlide3
Segmentation
Retrieval is (often) a search for concepts
But what we actually search are character strings
What strings best represent concepts?
In English, words are often a good choice
But well-chosen phrases might also be helpful
In German, compounds may need to be split
Otherwise queries using constituent words would fail
In Chinese, word boundaries are not marked
ThissegmentationproblemissimilartothatofspeechSlide4
TokenizationWords (from linguistics): Morphemes are the units of meaningCombined to make wordsAnti (disestablishmentarian) ismTokens (from computer science)Doug ’s running late !Slide5
MorphologyInflectional morphologyPreserves part of speechDestructions = Destruction+PLURALDestroyed = Destroy+PASTDerivational morphologyRelates parts of speechDestructor = AGENTIVE(destroy)Slide6
StemmingConflates words, usually preserving meaningRule-based suffix-stripping helps for English{destroy, destroyed, destruction}: destrPrefix-stripping is needed in some languagesArabic: {alselam}: selam [Root: SLM (peace)]Imperfect, but goal is to usually be helpfulOverstemming{centennial,century,center
}: centUnderstemming:{acquire,acquiring,acquired}: acquir {acquisition}: acquisSlide7
Segmentation Option 1:
Longest Substring Segmentation
Greedy algorithm based on a lexicon
Start with a list of every possible term
For each
unsegmented
string
Remove the longest single substring in the list
Repeat until no substrings are found in the listSlide8
Longest Substring Example
Possible German compound term:
washington
List of German words:
ach, hin, hing, sei, ton, was, wasch
Longest substring segmentation
was-hing-ton
Roughly translates as “What tone is attached?”Slide9
Segmentation Option 2:Most Likely SegmentationGiven an input word c1 c2 c3 … cnTry all possible partitions into w1 w2 w3 …
c1 c2 c3 … cnc1 c2 c3 c3
…
c
n
c
1
c
2
c
3
…
c
n
…
Choose the highest probability partition
Based on word combinations seen in segmented text (we will call this a “language model” next week)Slide10
Generic Segmentation: N-gram IndexingConsider a Chinese document c1 c2 c3 … cnDon’t commit to any single segmentationInstead, treat every character bigram as a termc1 c2
, c2 c3 , c3 c4 , … , cn-1 cnBreak up queries the same wayBad matches will happen, but rarelySlide11
Relating Words and Concepts
Homonymy:
bank
(river) vs.
bank
(financial)
Different
words are written the same way
We’d like to work with word
senses
rather than words
Polysemy:
fly
(pilot) vs.
fly
(passenger)
A word can have different “shades of meaning”
Not bad for IR: often helps more than it hurts
Synonymy:
class
vs.
course
Causes search failures … well address this next week!Slide12
Phrases
Phrases can yield more precise queries“University of Maryland”, “solar eclipse”We would not usually index only phrasesBecause people might search for parts of a phrasePositional indexes allow query-time phrase matchAt the cost of a larger index and slower queriesIR systems are good at evidence combinationBetter evidence combination less help from phrasesSlide13
“Named Entity” TaggingAutomatically assign “types” to words or phrasesPerson, organization, location, date, money, …More rapid and robust than parsingBest algorithms use supervised learningAnnotate a corpus identifying entities and typesTrain a probabilistic modelApply the model to new textSlide14
A “Term” is Whatever You IndexWord senseTokenWordStemCharacter n-gramPhraseSlide15
Summary
The key is to index the right kind of terms
Start by finding fundamental features
So far all we have talked about are character codes
Same ideas apply to handwriting, OCR, and speech
Combine them into easily recognized units
Words where possible, character n-grams otherwise
Apply further processing to optimize the system
Stemming is the most commonly used technique
Some “good ideas” don’t pan out that waySlide16
Agenda
Character setsTerms as units of meaningBoolean RetrievalBuilding an index