Pre-SWOT Report.

Pre-SWOT Report. - Description

Printed Arabic OCR. Dr. Mohamed El-Mahallawy. Eng. Hesham Osman. Eng. . Rana. . Abdou. Dr. Mohamed . Waleed. . Fakhr. Dr. . Mohsen. . Rashwan. 1-Introduction and challenges. These systems recognize text that has been previously written or printed on a page and then optically converted into a bit.... ID: 331743 Download Presentation

35K - views

Pre-SWOT Report.

Printed Arabic OCR. Dr. Mohamed El-Mahallawy. Eng. Hesham Osman. Eng. . Rana. . Abdou. Dr. Mohamed . Waleed. . Fakhr. Dr. . Mohsen. . Rashwan. 1-Introduction and challenges. These systems recognize text that has been previously written or printed on a page and then optically converted into a bit image. Offline devices include optical scanners of the flatbed, paper fed and handheld types. .

Similar presentations


Download Presentation

Pre-SWOT Report.




Download Presentation - The PPT/PDF document "Pre-SWOT Report." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Pre-SWOT Report."— Presentation transcript:

Slide1

Pre-SWOT Report.Printed Arabic OCR

Dr. Mohamed El-Mahallawy

Eng. Hesham Osman

Eng.

Rana

Abdou

Dr. Mohamed

Waleed

Fakhr

Dr.

Mohsen

Rashwan

Slide2

1-Introduction and challenges

These systems recognize text that has been previously written or printed on a page and then optically converted into a bit image. Offline devices include optical scanners of the flatbed, paper fed and handheld types.

Arabic printed script is more difficult than Latin script for the following reasons:

Slide3

challenges

Connectivity problem: segmentation and recognitionDotting problemMultiple Grapheme shapes depending on the position.Ligatures: To make things even more complex, certain compounds of characters at certain positions of the Arabic word segments are represented by single atomic graphemes called ligatures. Overlapping problemDiacritics problemFonts families and size variations: نسخ، كوفى، رقعة Each has sub-families (Mac versus Windows)Finally, the font size problem: Different Arabic graphemes do not have a fixed height or a fixed width. Moreover, neither the different nominal sizes of the same font scale linearly with their actual line heights, nor the different fonts with the same nominal size have a fixed line height.

Slide4

2- Applications

Digitizing billions of books for digital library storage, archival, retrieval, and classification.

Digitizing historical documents

Slide5

3- State of the art in products (Latin script)

OCR is a highly mature technology for Latin script with excellent performance.

The main challenges are in the pre-processing, page segmentation, speed of batch processing and post-processing.

OmniPage-17 by

Nuance

is an example of such a product with less than 1% WER:

http://www.nuance.com/imaging/omnipage/omnipage-professional.asp

Slide6

4- State of the art in products (Arabic script)

1- Sakhr

: 1%

WER

for good quality documents but may drop significantly with poor quality documents. (Best speed, and best output layout)

2- VERUS

: a little lower than Sakhr for good quality but significantly better for poor quality

.

(bibliotheca

alexandrina

uses both engines for its digitization project).

3-

Readiris

: Lower performance than the other two

.

Slide7

5- State of the art in Research and Competitions

Focus mainly on producing true Omni OCR for different font families, font sizes (specially the large), document pre-processing and framing, noise robustness, and

batch-mode speed

.

Significant recent efforts: Most recent research employ HMMs, and fusion between multiple OCR systems targeting Omni font performance.

Slide8

6- Required Modules

ScanFix

pre-processing tool (or similar): 15$ per license.

Nuance document analysis tool (Framing tools) (or similar): 30$ per license.

Word based language model

Character based language model

Grapheme to ligature and ligature to grapheme convertor: Need to build a tool

Statistical training tools: HTK, SRI,

Matlab

, and many neural network tools.

Error analysis tools: Need to be implemented.

Diacritic Preprocessing tool

Language Recognition tool

Slide9

7- Required Resources

Word annotated corpus (estimated

5000

pages of different

quality-resolution

and font styles

).

Character/Ligature

annotated

corpus

for initial models (estimated

8

pages covering all shapes, with about 25 instance per shape).

Character-based

language models (use digital resources).

Word-based language models (use digital resources).

Dictionaries with transcriptions

Slide10

8- Available Resources and Gaps

We need some tools to be available (error analysis, grapheme to character/ligature, pre-processing).

No available database so we need to do

data collection very

soon.

Character/Ligature-based Language

models have to be trained and made available for researchers.

Slide11

9- LR proposed by ALTEC:Training

We need to focus on the

Naskh

fonts family. Within

Naskh

, there may be about 6 families. Each would have 6 different font sizes (8,10,12,14,16,18). The rule is

to

have about 25 instances for each shape in each case.

We assumed to have about

300

different shapes (characters and ligatures). So we need

300*25=7500

instances. This is about

8

pages.

This should be done for each

font family

and for each

font size

as follows:

8

pages*

6

faontsfamilies*

6

fontsizes= around

300

pages

total (Clean=

Excellent Q

uality).

Slide12

LR proposed by ALTEC (cont.)

These pages (for clean high quality training data) will be generated artificially, by balancing the data to cover all the

300

shapes. 

Then, to generate lower quality training data:

a- The

300

pages will be outputted from a Fax machine (once)

b- The

300

pages will be copied once (one output), then twice (second output

).

c

-

The

same process will be done for

600

,

300

, and

200

dpi

.

(

This

gives

3600

pages:

300

clean,

300

from Fax,

300

copied once,

300

copied

twice) multiplied by

3

for the

3

different

resolutions.

We will also obtain

2000

transcribed pages from Alex. Bib. with low quality old books, etc.).

Slide13

LR proposed by ALTEC(Benchmarking)

The recommended Benchmarking must be two-folded; one is to measure robustness and reliability of the product (software) and this requires

40,000

documents in one batch. These should include simple and complex documents, different qualities, etc.

The second test, for accuracy, should include at least

600

pages (200 high quality, 200 medium, and 200 poor quality) coming from books, newspapers, Fax outputs, Typewriters, etc.

It is highly recommended to have an OCR competition co-organized by ALTEC.

Slide14

10- Preliminary SWOT analysis

Strengths:

The

expertise, in DSP, pattern recognition, image processing, NLP, and stochastic methods

Potential to have huge amounts of annotated data.

Weaknesses

:

The

tight time & budget of the intended required products.

No benchmarking available for printed Arabic OCR

No training database available for research community for Arabic OCR

Slide15

Opportunities:

Truly reliable & robust Arabic Omni OCR systems are a much needed essential technology for the Arabic language to be fully launched in the digital age.

No existing product is yet satisfactory enough

The Arabic language has a huge heritage to be digitized.

Large market of such a tech. of over 300 million native speakers, plus other numerous interested parties (for reasons such as security, commerce, cultural interaction, etc

.).

Threats

:

Back firing against Arabic OCR technologies in the perception of customers, due to a long history of unsatisfactory performance of past and current Arabic OCR/ICR products.

Other R&D groups all over the world (esp. in the US) is working hard and racing for a radical solution of the

problem.

Slide16

11- Survey

 Specify the application that OCR recognition will be used for

What is the data used/intended to train the system?

What is the benchmark to test your system on?

Would you be interested to contribute in the data collection. At what capacity?

Would you be interested to buy Arabic OCR annotated data?

Would you be interested to contribute in a competition

How many persons working in this area in your team? What are their qualifications?

What are the platforms supported/targeted in your application?

What is the market share anticipated in your application?

Would your application support any other languages? Explain.

Slide17

List of Survey Targets

Sakhr

RDI

ImagiNet

Orange- Cairo

IBM- Cairo

Cairo University

Ain

Shams University

Arab academy (AAST)

AUC

GUC

Nile University

Azhar

university

Helwan

university

Assuit

university

Other Centers outside Egypt

Other companies that are users of the technology

Slide18

12- Key Figures in this Field

NovoDynamics

(VERUS) research team: Dr. Steve Schlosser et. al.

Dr. John

Makhoul

(BBN)

Dr.

Hazem

AbdelAzeem

(Egypt)