Document Image Retrieval David Kauchak cs160 Fall 2009 adapted from David Doermann httpterpconnectumdeduoardteaching796spring04slides11796s0411ppt Assign 4 writeups Overall I was very happy ID: 270003
Download Presentation The PPT/PDF document "http://www.flickr.com/photos/56685562@N0..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
http://www.flickr.com/photos/56685562@N00/565216/Slide2
Document Image RetrievalDavid Kauchakcs160
Fall 2009adapted from
:David
Doermann
http://terpconnect.umd.edu/~oard/teaching/796/spring04/slides/11/796s0411.pptSlide3
Assign 4 writeupsOverall, I was very happySee how big a difference the modifications make!Some general commentsexplain data set and characteristicsexplain your evaluation measure(s)think about the points you’re trying to make, then use the data to make that pointcomment on anything abnormal or surprising in the data
dig deeper if you need toif you have multiple evaluation measures, use them to explain/understand different behavior
try and explain why you got the results you obtainedSlide4
Information retrieval systemsSpend 15 minutes playing with three different image retrieval systemshttp://en.wikipedia.org/wiki/Image_retrieval has a numberWhat works well?What doesn’t work well?Anything interesting you noticed?You won’t hand anything in, but we’ll start class on Monday with a discussion of the systemsSlide5
Image Retrievalhttp://infolab.stanford.edu/~wangz/project/imsearch/review/JOUR/datta_TR.pdfSlide6
Image Retrieval Problemshttp://infolab.stanford.edu/~wangz/project/imsearch/review/JOUR/datta_TR.pdfSlide7
Different Systemshttp://infolab.stanford.edu/~wangz/project/imsearch/review/JOUR/datta_TR.pdfSlide8
Information retrieval: dataText retrievalAudio retrieval
Image retrieval
amount of data
data characteristics
trillions of web pages
within an order of magnitude in “private” data
order of a few billion?
last fm has 150M songs
somewhere in between
user generated
some semi-structured
link structure
mostly professionally generated
co-occurrence statistics
user generated
becoming more
prevelant
some tagging
incorporated into web pages (context)Slide9
Information retrieval: challengesText retrievalAudio retrieval
Image retrieval
challenges
scale
ambiguity of language
link structure
spam
query language
user interface
features/pre-processing
query language
user interface
features/pre-processing
ambiguity of pictures
other dimensions?Slide10
What’s in a document?I give you a file I downloadedYou know it has text in itWhat are the challenges in determining what characters are in the document?File format:
http://www.google.com/help/faq_filetypes.htmlSlide11
What is a document?Slide12
Document ImagesA document image is a document that is represented as an image, rather than some predefined formatLike normal images, contain pixels often binary-valued (black, white)But greyscale or color sometimes300 dots per inch (dpi) gives the best resultsBut images are quite large (1 MB per page)
Faxes are normally 72 dpiUsually stored in TIFF or PDF format
Want to be able to process them like text filesSlide13
Sources of document imagesWebhttp://dli.iiit.ac.in/Arabic news stories are often GIF imagesGoogle Books, Project Gutenberg (though these are a bit different)Library archivesOtherTobacco Litigation Documents49 million page imagesSlide14Slide15
Document Image DatabaseCollection of scanned imagesNeed to be available for indexing and retrieval, abstracting, routing, editing, dissemination, interpretationNOTE: more needs than just searching!
DOCUMENT
DATABASE
IMAGESlide16
What are the challenges?What are the sub-problems?Slide17
Document imagesSo far, we’ve only been interested in documents as strings of textDocument images introduce contain additional informationembedded imagesformattinghandwritten annotationsfigures/diagrams/tablesclasses of documentsmemo
newspaper articlebook pageSlide18
ChallengesThey’re an image Qualityscan orientationnoisecontrastHand-written textHand-written diagramsSlide19
Sub-problemsClassification - what type of document image is this?Page segmentationstructureidentify imagesidentify textidentify handwritten textdiagram identificationMeta-data identification
title, authorlanguageOCR
Reading orderingIndexingSlide20
Problems we’ll discuss today…Preprocessing issuesPage Layer SegmentationOCRReading orderIR issuesSlide21
Problem: Page Layer SegmentationA document consists of
many layers, such as handwriting, machine printed text, background patterns, tables, figures, noise, etc.Slide22
Step 1 - segmentationSlide23
SegmentationSlide24
SegmentationSlide25
Step 2 – classify the segments
Printed text
Handwriting
Noise
We can use features of the “segment” as well as positional information about the other segmentsSlide26
Segmentation Classification
Before enhancement
After enhancementSlide27
Problem: OCROne of the more successful applications of computer vision
How does this happen?Slide28
OCR: One solutionPattern-matching approachStandard approach in commercial systemsSegment individual charactersRecognize using a neural network classifierSlide29
OCRIdeas?Slide30
Optical Character RecognitionHidden Markov model approachExperimental approach developed at BBNSegment into sub-character slicesLimited lookahead to find best character choice
Determining character segmentation is difficult!
Uniform slices
View as a sequential prediction problemSlide31
OCR Accuracy ProblemsCharacter segmentation errorsIn English, segmentation often changes “m” to “rn”Character confusionCharacters with similar shapes often confoundedOCR on copies is much worse than on originalsPixel bloom, character splitting, binding bendUncommon fonts can cause problemsIf not used to train a neural networkSlide32
Improving OCR AccuracyImage preprocessingMathematical morphology for bloom and splittingParticularly important for degraded images“Voting” between several OCR engines helpsIndividual systems depend on specific training dataLinguistic analysis can correct some errorsUse confusion statistics, word lists, syntax, …But more harmful errors might be introducedSlide33
OCR SpeedNeural networks take about 10 seconds a pageHidden Markov models are slowerVoting can improve accuracyBut at a substantial speed penaltyEasy to speed things up with several machinesFor example, by batch processing - using desktop computers at night
Challenge with OCR is there is a often a trade-off between speed and accuracySlide34
Problem: Reading OrderWhat is the sequence of words from this document?
Ideas?Slide35
Logical Page Analysis Can be hard to guess in some casesNewspaper columns, figure captions, appendices, …Sometimes there are explicit guides“Continued on page 4” (but page 4 may be big!)Structural cues can helpColumn 1 might continue to column 2Content analysis is also usefulWord co-occurrence statistics, syntax analysisSlide36
Traditional ApproachOptical CharacterRecognition
Page
Decomposition
Scanner
Document
Page
Image
Structure, images, etc
Text
Text
RegionsSlide37
Remember our goalCreate an IR system over image documentsChallenge: OCR is not perfectSuccess for high quality OCR (Croft et al 1994, Taghva 1994)Limited success for poor quality OCR (1996 TREC, UNLV)
Ideas?Slide38
Proposed SolutionsImprove OCR Again, speed is always a concernSimilar to spelling correctionAutomatic Correction Characters N-grams
Statistically robust to small numbers of errors
Rapid indexing and retrievalWorks from 70%-85% character accuracy where traditional IR failsSlide39
Matching with OCR errors
5
with confidence
X%
> 80%
Keep base system answer
75% - 80%
Character
n
-grams
<75%
More intensive image techniques (e.g. shape codes)Slide40
Conversion to Text?Full Conversion often requiredConversion is difficult!Noisy dataComplex LayoutsNon-text components
Points to Ponder
Do we really need to convert?
Can we expect to fully describe documents without assumptions? Slide41
Idea: do processing on imagesCharacteristicsDoes not require expensive OCR/Conversion Applicable to filtering applicationsMay be more robust to noisePossible DisadvantagesApplication domain may be very limited
Indexing?Slide42
Shape CodingApproachUse of Generic Character DescriptorsMap Character based on Shape features including ascenders, descenders, punctuation and character with holesSlide43
Shape CodesGroup all characters that have similar shapes{a, c, e, n, o, r, s,
u, v
, x, z} {
b
,
d, h, k, }{f,
t
}
{
g
,
p
,
q
,
y
}
{
i
,
j
,
l
,
1, I}
{
m
,
w
}
Shape codes whether a subset of an image belongs to a given character setSub-process later based on linguistic and/or OCRSlide44
Why Use Shape Codes?Can recognize shapes faster than charactersSeconds per page, and very accuratePreserves recall, but with lower precisionUseful as a first pass in any systemEasily extracted from JPEG-2 imagesBecause JPEG-2 uses object-based compressionSlide45
EvaluationThe usual approach: Model-based evaluationApply confusion statistics to an existing collectionA bit better: Print-scan evaluationScanning is slow, but availability is no problemBest: Scan-only evaluationFew existing IR collections have printed materialsSlide46
SummaryMany applications benefit from image based indexingLess discriminatory featuresFeatures may therefore be easier to computeMore robust to noiseOften computationally more efficientMany classical IR techniques have application for DIRStructure as well as content are important for indexingPreservation of structure is essential for in-depth understandingSlide47
Closing thoughts….What else is useful?Document Metadata? – Logos? Signatures?Where is research heading?Cameras to capture Documents?What massive collections are out there?Google BooksOther Digital LibrariesSlide48
Additional ReadingA. Balasubramanian, et al. Retrieval from Document Image Collections, Document Analysis Systems VII, pages 1-12, 2006.D. Doermann. The Indexing and Retrieval of Document Images: A Survey. Computer Vision and Image Understanding, 70(3), pages 287-298, 1998. Slide49
Fun Stuff http://www.sr.se/P1/src/sing/#