http://www.flickr.com/photos/56685562@N00/565216/ - PowerPoint Presentation

393 views
Uploaded On 2016-03-26

http://www.flickr.com/photos/56685562@N00/565216/ - PPT Presentation

Document Image Retrieval David Kauchak cs160 Fall 2009 adapted from David Doermann httpterpconnectumdeduoardteaching796spring04slides11796s0411ppt Assign 4 writeups Overall I was very happy ID: 270003

document image retrieval ocr image document ocr retrieval images text character page segmentation data http indexing shape systems evaluation

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/270003" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "http://www.flickr.com/photos/56685562@N0..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

http://www.flickr.com/photos/56685562@N00/565216/Slide2

Document Image RetrievalDavid Kauchakcs160

Fall 2009adapted from

:David

Doermann

http://terpconnect.umd.edu/~oard/teaching/796/spring04/slides/11/796s0411.pptSlide3

Assign 4 writeupsOverall, I was very happySee how big a difference the modifications make!Some general commentsexplain data set and characteristicsexplain your evaluation measure(s)think about the points you’re trying to make, then use the data to make that pointcomment on anything abnormal or surprising in the data

dig deeper if you need toif you have multiple evaluation measures, use them to explain/understand different behavior

try and explain why you got the results you obtainedSlide4

Information retrieval systemsSpend 15 minutes playing with three different image retrieval systemshttp://en.wikipedia.org/wiki/Image_retrieval has a numberWhat works well?What doesn’t work well?Anything interesting you noticed?You won’t hand anything in, but we’ll start class on Monday with a discussion of the systemsSlide5

Image Retrievalhttp://infolab.stanford.edu/~wangz/project/imsearch/review/JOUR/datta_TR.pdfSlide6

Image Retrieval Problemshttp://infolab.stanford.edu/~wangz/project/imsearch/review/JOUR/datta_TR.pdfSlide7

Different Systemshttp://infolab.stanford.edu/~wangz/project/imsearch/review/JOUR/datta_TR.pdfSlide8

Information retrieval: dataText retrievalAudio retrieval

Image retrieval

amount of data

data characteristics

trillions of web pages

within an order of magnitude in “private” data

order of a few billion?

last fm has 150M songs

somewhere in between

user generated

some semi-structured

link structure

mostly professionally generated

co-occurrence statistics

user generated

becoming more

prevelant

some tagging

incorporated into web pages (context)Slide9

Information retrieval: challengesText retrievalAudio retrieval

Image retrieval

challenges

scale

ambiguity of language

link structure

spam

query language

user interface

features/pre-processing

query language

user interface

features/pre-processing

ambiguity of pictures

other dimensions?Slide10

What’s in a document?I give you a file I downloadedYou know it has text in itWhat are the challenges in determining what characters are in the document?File format:

http://www.google.com/help/faq_filetypes.htmlSlide11

What is a document?Slide12

Document ImagesA document image is a document that is represented as an image, rather than some predefined formatLike normal images, contain pixels often binary-valued (black, white)But greyscale or color sometimes300 dots per inch (dpi) gives the best resultsBut images are quite large (1 MB per page)

Faxes are normally 72 dpiUsually stored in TIFF or PDF format

Want to be able to process them like text filesSlide13

Sources of document imagesWebhttp://dli.iiit.ac.in/Arabic news stories are often GIF imagesGoogle Books, Project Gutenberg (though these are a bit different)Library archivesOtherTobacco Litigation Documents49 million page imagesSlide14
Slide15

Document Image DatabaseCollection of scanned imagesNeed to be available for indexing and retrieval, abstracting, routing, editing, dissemination, interpretationNOTE: more needs than just searching!

DOCUMENT

DATABASE

IMAGESlide16

What are the challenges?What are the sub-problems?Slide17

Document imagesSo far, we’ve only been interested in documents as strings of textDocument images introduce contain additional informationembedded imagesformattinghandwritten annotationsfigures/diagrams/tablesclasses of documentsmemo

newspaper articlebook pageSlide18

ChallengesThey’re an image Qualityscan orientationnoisecontrastHand-written textHand-written diagramsSlide19

Sub-problemsClassification - what type of document image is this?Page segmentationstructureidentify imagesidentify textidentify handwritten textdiagram identificationMeta-data identification

title, authorlanguageOCR

Reading orderingIndexingSlide20

Problems we’ll discuss today…Preprocessing issuesPage Layer SegmentationOCRReading orderIR issuesSlide21

Problem: Page Layer SegmentationA document consists of

many layers, such as handwriting, machine printed text, background patterns, tables, figures, noise, etc.Slide22

Step 1 - segmentationSlide23

SegmentationSlide24

SegmentationSlide25

Step 2 – classify the segments

Printed text

Handwriting

Noise

We can use features of the “segment” as well as positional information about the other segmentsSlide26

Segmentation Classification

Before enhancement

After enhancementSlide27

Problem: OCROne of the more successful applications of computer vision

How does this happen?Slide28

OCR: One solutionPattern-matching approachStandard approach in commercial systemsSegment individual charactersRecognize using a neural network classifierSlide29

OCRIdeas?Slide30

Optical Character RecognitionHidden Markov model approachExperimental approach developed at BBNSegment into sub-character slicesLimited lookahead to find best character choice

Determining character segmentation is difficult!

Uniform slices

View as a sequential prediction problemSlide31

OCR Accuracy ProblemsCharacter segmentation errorsIn English, segmentation often changes “m” to “rn”Character confusionCharacters with similar shapes often confoundedOCR on copies is much worse than on originalsPixel bloom, character splitting, binding bendUncommon fonts can cause problemsIf not used to train a neural networkSlide32

Improving OCR AccuracyImage preprocessingMathematical morphology for bloom and splittingParticularly important for degraded images“Voting” between several OCR engines helpsIndividual systems depend on specific training dataLinguistic analysis can correct some errorsUse confusion statistics, word lists, syntax, …But more harmful errors might be introducedSlide33

OCR SpeedNeural networks take about 10 seconds a pageHidden Markov models are slowerVoting can improve accuracyBut at a substantial speed penaltyEasy to speed things up with several machinesFor example, by batch processing - using desktop computers at night

Challenge with OCR is there is a often a trade-off between speed and accuracySlide34

Problem: Reading OrderWhat is the sequence of words from this document?

Ideas?Slide35

Logical Page Analysis Can be hard to guess in some casesNewspaper columns, figure captions, appendices, …Sometimes there are explicit guides“Continued on page 4” (but page 4 may be big!)Structural cues can helpColumn 1 might continue to column 2Content analysis is also usefulWord co-occurrence statistics, syntax analysisSlide36

Traditional ApproachOptical CharacterRecognition

Page

Decomposition

Scanner

Document

Page

Image

Structure, images, etc

Text

RegionsSlide37

Remember our goalCreate an IR system over image documentsChallenge: OCR is not perfectSuccess for high quality OCR (Croft et al 1994, Taghva 1994)Limited success for poor quality OCR (1996 TREC, UNLV)

Ideas?Slide38

Proposed SolutionsImprove OCR Again, speed is always a concernSimilar to spelling correctionAutomatic Correction Characters N-grams

Statistically robust to small numbers of errors

Rapid indexing and retrievalWorks from 70%-85% character accuracy where traditional IR failsSlide39

Matching with OCR errors

with confidence

> 80%

Keep base system answer

75% - 80%

Character

-grams

<75%

More intensive image techniques (e.g. shape codes)Slide40

Conversion to Text?Full Conversion often requiredConversion is difficult!Noisy dataComplex LayoutsNon-text components

Points to Ponder

Do we really need to convert?

Can we expect to fully describe documents without assumptions? Slide41

Idea: do processing on imagesCharacteristicsDoes not require expensive OCR/Conversion Applicable to filtering applicationsMay be more robust to noisePossible DisadvantagesApplication domain may be very limited

Indexing?Slide42

Shape CodingApproachUse of Generic Character DescriptorsMap Character based on Shape features including ascenders, descenders, punctuation and character with holesSlide43

Shape CodesGroup all characters that have similar shapes{a, c, e, n, o, r, s,

u, v

, x, z} {

d, h, k, }{f,

}

{

}

{

1, I}

{

}

Shape codes whether a subset of an image belongs to a given character setSub-process later based on linguistic and/or OCRSlide44

Why Use Shape Codes?Can recognize shapes faster than charactersSeconds per page, and very accuratePreserves recall, but with lower precisionUseful as a first pass in any systemEasily extracted from JPEG-2 imagesBecause JPEG-2 uses object-based compressionSlide45

EvaluationThe usual approach: Model-based evaluationApply confusion statistics to an existing collectionA bit better: Print-scan evaluationScanning is slow, but availability is no problemBest: Scan-only evaluationFew existing IR collections have printed materialsSlide46

SummaryMany applications benefit from image based indexingLess discriminatory featuresFeatures may therefore be easier to computeMore robust to noiseOften computationally more efficientMany classical IR techniques have application for DIRStructure as well as content are important for indexingPreservation of structure is essential for in-depth understandingSlide47

Closing thoughts….What else is useful?Document Metadata? – Logos? Signatures?Where is research heading?Cameras to capture Documents?What massive collections are out there?Google BooksOther Digital LibrariesSlide48

Additional ReadingA. Balasubramanian, et al. Retrieval from Document Image Collections, Document Analysis Systems VII, pages 1-12, 2006.D. Doermann. The Indexing and Retrieval of Document Images: A Survey. Computer Vision and Image Understanding, 70(3), pages 287-298, 1998. Slide49

Fun Stuff http://www.sr.se/P1/src/sing/#