Incremental Indexing

Incremental Indexing Incremental Indexing - Start

2016-07-01 67K 67 0 0

Incremental Indexing - Description

Dr. Susan Gauch. Indexing . . Current indexing algorithms are essentially batch processing. They start from scratch every time. What happens if we have already indexed a million documents and add 1 document to the collection. ID: 384845 Download Presentation

Download Presentation

Incremental Indexing




Download Presentation - The PPT/PDF document "Incremental Indexing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentations text content in Incremental Indexing

Slide1

Incremental Indexing

Dr. Susan Gauch

Slide2

Indexing

Current indexing algorithms are essentially batch processing

They start from scratch every time

What happens if we have already indexed a million documents and add 1 document to the collection

Do not want to index 1,000,001 documents from scratch

Web search engines have spiders/crawlers/robots continually collecting new content

Need a way to add a new document to existing inverted files

Slide3

Adding a document

This can cause two types of changes

Add a new word

Add an occurrence of an existing word

Slide4

Adding a New Word

This is the easiest type of change

Fill in a new entry in

dict

file for the word

Append to the end of the post file

If the

dict

file is 1/3 full after indexing phase

We can add many words before

dict

file blank records are used up

Over time, probability of a collision increases, slowing down retrieval

When

dict

file is > 2/3 full, rehash on disk

Essentially, create new

dict

file twice as big

Rehash all

dict

file records to new location

Lots of I/O, but can be done in background or on separate computer

Slide5

Adding an New Occurrence

Change to

dict

file is trivial

Just increment

numdocs

Change to post file is catastrophic

Need to add a new posting record but cannot insert a new record in the middle of a file

The

idf

for the word is different (log N/

numdocs

; but

numdocs

just changed)

All postings for that word have the wrong term weights

Slide6

Adding Posting Records Option 1: Blank records

Write blank records after existing postings

Number of blank records should be proportional to the number of existing postings

E.g., if “dog” has 3 postings, write

scale_factor

* 3 blank records after the 3 real postings; if “many” has 100 postings, write

scale_factor

* 100 blank records after the 100 real ones

Allows for

scale_factor

expansion

First word to have more than

numdocs

*

scale_factor

new postings causes entire postings file to be rewritten with new blanks inserted

Slide7

Adding Posting Records Option 2: Move Postings

Copy existing postings for the word to end of file

Append new posting there

Update

dict

to have “start” index to new location

Causes a lot of data movement

Post file becomes fragmented

Slide8

Adding Posting Records Option 3: Overflow pointer

Change post record format to have an overflow pointer (record number/block address)

Add new posting to end of post file or in separate overflow file

While processing post records:

Loop over

numdocs

records in post

If overflow is null

Next =

i

++

Else

Next =

overflow_location

Slide9

Adding Posting Records Option 4: Next pointers

Variation of Option 3

While processing post records

:

Seek to start

Read >>

docid

>>

wt

>> next

While next != -1

Seek to next

Read >>

docid

>>

wt

>>

next

Allows infinite expandability

Can degenerate into equivalent of a linked list on disk with one seek per post record

Slide10

Handling idf

Updating

numdocs

changes

idf

which in turn changes

wt

for all postings for the term

Read all postings for the term, change

wt

, rewrite postings

If doing proper document length normalization,

All document lengths for this term now have new lengths

Must recalculate norm factor and rewrite the postings for all terms in that document

Infeasible: we don’t have a way to find all postings for a document without reading whole file or adding a new file that maps

docid

-> postings (doubling the inverted index size)

Slide11

Better idea

Calculate term weights on the fly

Store rtf in posting record

Prenormalized

by document length

Loop over postings

Acc

[

docid

} +=

wt

Becomes

Calc

idf

from current value of

numdocs

Loop over postings

Acc

[

docid

] += rtf *

idf

Slide12

Scalability (or how Google does it)

Create overflow areas that are larger than 1

Make them variable sizes

Store a few postings in

dict

file

Dict

record becomes

Token,

numdocs

,

idf

, P postings, Next

Pick P so that

dict

record is size of 0.5 or 1 block (e.g., 100)

Create Small, Medium, and Large overflow files

Slide13

Variable Overflows

If have > P postings

Allocate a record in the “Small” overflow file

Record format: S postings, Next

Pick S so that record fits in 1 block

Or Pick S so that 50% of all tokens can be processed without going to Medium overflow file

If have > P + S postings

Allocate a record in “Medium” overflow file

Record format:

M postings

, Next

Or

Pick

M

so that 9

0

% of all tokens can be processed without going to

Large overflow

file

Slide14

Variable Overflows

If have >

P + S + M

postings

Allocate a record in the

“Large”

overflow file

Record format:

L

postings, Next

Pick L

so that

99%

of all tokens can be processed without going to second overflow file

If have > P +

S + L

postings

Allocate another record at end of Large file

Next pointer just points

to the next

Large record


About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.