/
Evidence from Content Evidence from Content

Evidence from Content - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
389 views
Uploaded On 2016-09-19

Evidence from Content - PPT Presentation

INST 734 Module 2 Doug Oard Agenda Character sets Terms as units of meaning Boolean r etrieval Building an index Segmentation Retrieval is often a search for concepts But what we actually search are character strings ID: 468308

segmentation words word character words segmentation character word phrases index search list units meaning substring queries longest speech phrase

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Evidence from Content" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Evidence from Content

INST 734

Module 2

Doug

OardSlide2

Agenda

Character setsTerms as units of meaningBoolean retrievalBuilding an indexSlide3

Segmentation

Retrieval is (often) a search for concepts

But what we actually search are character strings

What strings best represent concepts?

In English, words are often a good choice

But well-chosen phrases might also be helpful

In German, compounds may need to be split

Otherwise queries using constituent words would fail

In Chinese, word boundaries are not marked

ThissegmentationproblemissimilartothatofspeechSlide4

TokenizationWords (from linguistics): Morphemes are the units of meaningCombined to make wordsAnti (disestablishmentarian) ismTokens (from computer science)Doug ’s running late !Slide5

MorphologyInflectional morphologyPreserves part of speechDestructions = Destruction+PLURALDestroyed = Destroy+PASTDerivational morphologyRelates parts of speechDestructor = AGENTIVE(destroy)Slide6

StemmingConflates words, usually preserving meaningRule-based suffix-stripping helps for English{destroy, destroyed, destruction}: destrPrefix-stripping is needed in some languagesArabic: {alselam}: selam [Root: SLM (peace)]Imperfect, but goal is to usually be helpfulOverstemming{centennial,century,center

}: centUnderstemming:{acquire,acquiring,acquired}: acquir {acquisition}: acquisSlide7

Segmentation Option 1:

Longest Substring Segmentation

Greedy algorithm based on a lexicon

Start with a list of every possible term

For each

unsegmented

string

Remove the longest single substring in the list

Repeat until no substrings are found in the listSlide8

Longest Substring Example

Possible German compound term:

washington

List of German words:

ach, hin, hing, sei, ton, was, wasch

Longest substring segmentation

was-hing-ton

Roughly translates as “What tone is attached?”Slide9

Segmentation Option 2:Most Likely SegmentationGiven an input word c1 c2 c3 … cnTry all possible partitions into w1 w2 w3 …

c1 c2 c3 … cnc1 c2 c3 c3

c

n

c

1

c

2

c

3

c

n

Choose the highest probability partition

Based on word combinations seen in segmented text (we will call this a “language model” next week)Slide10

Generic Segmentation: N-gram IndexingConsider a Chinese document c1 c2 c3 … cnDon’t commit to any single segmentationInstead, treat every character bigram as a termc1 c2

, c2 c3 , c3 c4 , … , cn-1 cnBreak up queries the same wayBad matches will happen, but rarelySlide11

Relating Words and Concepts

Homonymy:

bank

(river) vs.

bank

(financial)

Different

words are written the same way

We’d like to work with word

senses

rather than words

Polysemy:

fly

(pilot) vs.

fly

(passenger)

A word can have different “shades of meaning”

Not bad for IR: often helps more than it hurts

Synonymy:

class

vs.

course

Causes search failures … well address this next week!Slide12

Phrases

Phrases can yield more precise queries“University of Maryland”, “solar eclipse”We would not usually index only phrasesBecause people might search for parts of a phrasePositional indexes allow query-time phrase matchAt the cost of a larger index and slower queriesIR systems are good at evidence combinationBetter evidence combination  less help from phrasesSlide13

“Named Entity” TaggingAutomatically assign “types” to words or phrasesPerson, organization, location, date, money, …More rapid and robust than parsingBest algorithms use supervised learningAnnotate a corpus identifying entities and typesTrain a probabilistic modelApply the model to new textSlide14

A “Term” is Whatever You IndexWord senseTokenWordStemCharacter n-gramPhraseSlide15

Summary

The key is to index the right kind of terms

Start by finding fundamental features

So far all we have talked about are character codes

Same ideas apply to handwriting, OCR, and speech

Combine them into easily recognized units

Words where possible, character n-grams otherwise

Apply further processing to optimize the system

Stemming is the most commonly used technique

Some “good ideas” don’t pan out that waySlide16

Agenda

Character setsTerms as units of meaningBoolean RetrievalBuilding an index