/
Pre-SWOT Report. Pre-SWOT Report.

Pre-SWOT Report. - PowerPoint Presentation

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
378 views
Uploaded On 2016-05-23

Pre-SWOT Report. - PPT Presentation

Printed Arabic OCR Dr Mohamed ElMahallawy Eng Hesham Osman Eng Rana Abdou Dr Mohamed Waleed Fakhr Dr Mohsen Rashwan 1Introduction and challenges These systems recognize text that has been previously written or printed on a page and then optically converted into a bit ID: 331743

ocr arabic pages 300 arabic ocr 300 pages quality font language data processing university script pre ligature documents character tools based training

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Pre-SWOT Report." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Pre-SWOT Report.Printed Arabic OCR

Dr. Mohamed El-Mahallawy

Eng. Hesham Osman

Eng.

Rana

Abdou

Dr. Mohamed

Waleed

Fakhr

Dr.

Mohsen

RashwanSlide2

1-Introduction and challenges

These systems recognize text that has been previously written or printed on a page and then optically converted into a bit image. Offline devices include optical scanners of the flatbed, paper fed and handheld types.

Arabic printed script is more difficult than Latin script for the following reasons: Slide3

challenges

Connectivity problem: segmentation and recognition

Dotting problem

Multiple Grapheme shapes depending on the position.

Ligatures: To make things even more complex, certain compounds of characters at certain positions of the Arabic word segments are represented by single atomic graphemes called ligatures.

Overlapping problemDiacritics problemFonts families and size variations: نسخ، كوفى، رقعة Each has sub-families (Mac versus Windows)Finally, the font size problem: Different Arabic graphemes do not have a fixed height or a fixed width. Moreover, neither the different nominal sizes of the same font scale linearly with their actual line heights, nor the different fonts with the same nominal size have a fixed line height.Slide4

2- Applications

Digitizing billions of books for digital library storage, archival, retrieval, and classification.

Digitizing historical documents Slide5

3- State of the art in products (Latin script)

OCR is a highly mature technology for Latin script with excellent performance.

The main challenges are in the pre-processing, page segmentation, speed of batch processing and post-processing.

OmniPage-17 by

Nuance

is an example of such a product with less than 1% WER: http://www.nuance.com/imaging/omnipage/omnipage-professional.aspSlide6

4- State of the art in products (Arabic script)

1- Sakhr

: 1%

WER

for good quality documents but may drop significantly with poor quality documents. (Best speed, and best output layout)

2- VERUS: a little lower than Sakhr for good quality but significantly better for poor quality.(bibliotheca alexandrina uses both engines for its digitization project).3- Readiris: Lower performance than the other two.Slide7

5- State of the art in Research and Competitions

Focus mainly on producing true Omni OCR for different font families, font sizes (specially the large), document pre-processing and framing, noise robustness, and

batch-mode speed

.

Significant recent efforts: Most recent research employ HMMs, and fusion between multiple OCR systems targeting Omni font performance.Slide8

6- Required Modules

ScanFix

pre-processing tool (or similar): 15$ per license.

Nuance document analysis tool (Framing tools) (or similar): 30$ per license.

Word based language model

Character based language modelGrapheme to ligature and ligature to grapheme convertor: Need to build a tool Statistical training tools: HTK, SRI, Matlab, and many neural network tools.Error analysis tools: Need to be implemented.Diacritic Preprocessing toolLanguage Recognition toolSlide9

7- Required Resources

Word annotated corpus (estimated

5000

pages of different

quality-resolution

and font styles).Character/Ligature annotated corpus for initial models (estimated 8 pages covering all shapes, with about 25 instance per shape).Character-based language models (use digital resources).Word-based language models (use digital resources).Dictionaries with transcriptionsSlide10

8- Available Resources and Gaps

We need some tools to be available (error analysis, grapheme to character/ligature, pre-processing).

No available database so we need to do

data collection very

soon.

Character/Ligature-based Language models have to be trained and made available for researchers.Slide11

9- LR proposed by ALTEC:Training

We need to focus on the

Naskh

fonts family. Within

Naskh

, there may be about 6 families. Each would have 6 different font sizes (8,10,12,14,16,18). The rule is to have about 25 instances for each shape in each case. We assumed to have about 300 different shapes (characters and ligatures). So we need 300*25=7500 instances. This is about 8 pages.This should be done for each font family and for each font size as follows:8pages*6

faontsfamilies*6fontsizes= around 300 pages total (Clean=

Excellent Quality).Slide12

LR proposed by ALTEC (cont.)

These pages (for clean high quality training data) will be generated artificially, by balancing the data to cover all the

300

shapes. 

Then, to generate lower quality training data:

a- The 300 pages will be outputted from a Fax machine (once)b- The 300 pages will be copied once (one output), then twice (second output).c- The same process will be done for 600, 300, and 200 dpi.(This

gives 3600 pages: 300 clean,

300 from Fax, 300 copied once, 300

copied

twice) multiplied by

3

for the

3

different

resolutions.

We will also obtain

2000

transcribed pages from Alex. Bib. with low quality old books, etc.).Slide13

LR proposed by ALTEC

(Benchmarking)

The recommended Benchmarking must be two-folded; one is to measure robustness and reliability of the product (software) and this requires

40,000

documents in one batch. These should include simple and complex documents, different qualities, etc.

The second test, for accuracy, should include at least 600 pages (200 high quality, 200 medium, and 200 poor quality) coming from books, newspapers, Fax outputs, Typewriters, etc.It is highly recommended to have an OCR competition co-organized by ALTEC.Slide14

10- Preliminary SWOT analysis

Strengths:

The

expertise, in DSP, pattern recognition, image processing, NLP, and stochastic methods

Potential to have huge amounts of annotated data.

Weaknesses:The tight time & budget of the intended required products.No benchmarking available for printed Arabic OCRNo training database available for research community for Arabic OCRSlide15

Opportunities:

Truly reliable & robust Arabic Omni OCR systems are a much needed essential technology for the Arabic language to be fully launched in the digital age.

No existing product is yet satisfactory enough

The Arabic language has a huge heritage to be digitized.

Large market of such a tech. of over 300 million native speakers, plus other numerous interested parties (for reasons such as security, commerce, cultural interaction, etc

.).Threats:Back firing against Arabic OCR technologies in the perception of customers, due to a long history of unsatisfactory performance of past and current Arabic OCR/ICR products.Other R&D groups all over the world (esp. in the US) is working hard and racing for a radical solution of the problem.Slide16

11- Survey

 Specify the application that OCR recognition will be used for

What is the data used/intended to train the system?

What is the benchmark to test your system on?

Would you be interested to contribute in the data collection. At what capacity?

Would you be interested to buy Arabic OCR annotated data?Would you be interested to contribute in a competitionHow many persons working in this area in your team? What are their qualifications?What are the platforms supported/targeted in your application?What is the market share anticipated in your application?Would your application support any other languages? Explain.Slide17

List of Survey Targets

Sakhr

RDI

ImagiNet

Orange- Cairo

IBM- CairoCairo UniversityAin Shams UniversityArab academy (AAST)AUCGUCNile University Azhar university Helwan university Assuit universityOther Centers outside EgyptOther companies that are users of the technologySlide18

12- Key Figures in this Field

NovoDynamics

(VERUS) research team: Dr. Steve Schlosser et. al.

Dr. John

Makhoul

(BBN)Dr. Hazem AbdelAzeem (Egypt)