/
2009 10th International Conference on Document Analysis and Recognitio 2009 10th International Conference on Document Analysis and Recognitio

2009 10th International Conference on Document Analysis and Recognitio - PDF document

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
421 views
Uploaded On 2016-06-17

2009 10th International Conference on Document Analysis and Recognitio - PPT Presentation

825 826 827 824 ItalicorRomanWordStyleRecognitionWithoutAPrioriKnowledgeforOld PrintedDocuments LorisEynardHubertEmptoz Universit ID: 365732

825 826 827 824 ItalicorRoman:WordStyleRecognitionWithoutAPrioriKnowledgeforOld PrintedDocuments LorisEynard HubertEmptoz Universit

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "2009 10th International Conference on Do..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

825 826 827 2009 10th International Conference on Document Analysis and Recognition978-0-7695-3725-2/09 $25.00 © 2009 IEEEDOI 10.1109/ICDAR.2009.176823 824 ItalicorRoman:WordStyleRecognitionWithoutAPrioriKnowledgeforOld PrintedDocuments LorisEynard,HubertEmptoz Universit ´ edeLyon,CNRS INSA-Lyon,LIRIS,UMR5205, F-69621,France f loris.eynard,hubert.emptoz g @insa-lyon.fr Abstract ThispaperpresentsanItalic/Romanwordtyperecog- nitionsystemwithoutaprioriknowledgeonthecharac- ters'font.Thismethodaimsatanalyzingolddocumentsin whichcharactersegmentationisnottrivial.Thereforeour approachsegmentsthedocumentintowordsandanalyse thetextwordperword.Todenethewordstyle,wecom- binethreecriteriawhicharebasedonthevisualdifferences betweenawordandaslantedversionofthesameword. Thesecriteriaaredenedthankstofeaturescomputedfrom theverticalprojectionproleoftheword.Becausewedo notassumeaspecicslantangle,wecomputethesemea- suresonawholerangeofpossibleslantanglesandthen sumtheobtainedscores.Ourresultsshowaratioof100% recognitionforItalicwordsand97.2%forRomanwords. 1.Introduction OurworkwasdesignedtondItalicwordsinhistorical newspaperofthe18 th century,morespecicallyintheim- agesofthe GazettedeLeyde dataset. Commercialsoftwareforcharacterrecognitionarenotef- cientforrecognizingthecharacterinolddocuments.On the GazettedeLeyde datasetweobtainaratioof88%of characterrecognition 1 when99%isconsideredtobeef- cient.Ofcourseonwellconservedpartofthedocuments theresultsareverygoodbutondamagedpartsthisratiode- creasesharply.TheItalicwords,duetothethicknessofthe Italiccharacter,aremoresensitivetothetimedamagesthan theRomanwordandsotheyarebadlyrecognizedbycom- mercialOCR. ThoseItalicwordsrepresentparticularnouns(patronymic orlocationnames)andsoareveryinterestingforre- searchersinHumanSciences.Thenalaimistoextract 1 withABBYYFineReader8Pro wordstypesetinItalicinparagraphstypesetinRoman inseveralyearsofGazettewhichrepresentsthousandsof pages. Figure1.Italicstylenounsinthemiddleofa Romanstyleparagraph Becauseofthislargeamountofdocuments,ourmethod mustbeabletodecideverypreciselyifawordiseitherin ItalicorRomanstyle.Thedifcultiesweencounteredare duetothetheconservationoftheolddocumentsandtheir digitization,whichresultsinseveraltypesofdegradations, e.ginkbleed-through,holes,inkfading,etc.These artefactscreatelinksbetweencharacters(andevenmore betweenItaliccharacterswhicharecloserthanRoman ones)whichmakesthemhardertosegment.Forthisreason weproposeacharacters'segmentation-freemethod.We baseourselfonthevisualcharacteristicsofthecharacters ofawordthatcanbeinterpretedbyanalysingthevertical projectionproleofthewordimage.Thoseanalysisbased onthreevisualfeatureswouldgiveusscorestodecide whetherawordisinItalicorRomanstyle.Aswedonot supposeaspecicslantangleoftheItalictype,whichmay varysignicantlyacrossthedocumentweareprocessing, wetestarangeofanglesratherthanonlyconsideringone slantangle. Thispaperisorganizedasfollows:rstwerecallthestate oftheartonthisproblem.Thenwedescribeourmethod andconcludewiththeresults. left)seemsvisuallycorrect. Inthenextthreesectionswedescribethethreecriteriawe usetodenetheItalicstyle.Theyarebasedonthediffer- encesbetweenawordinItalicstyleandthesamewordin Romanstyle. 3.2.Firstcriterion:Verticalblackcolumn ThemaindifferencebetweenanItalicandRomanver- sionofawordisthepresenceoflongverticalstrokesinthe caseoftheRomanversion,whicharerepresentedaspeaks intheverticalprojectionprole. Figure3.highestblackcolumnofhistogram, seethatmaximumofhistogramdiffersfrom originalandslantedimages Slantingaverticalstrokeintheimagetranslatesintoat- teningthecorrespondingpeakintheprojectionprole(see gure3).ThedifferencebetweenaRomanstylewordand anItalicstyleversionofitisgivenbycomparingthemax- imaoftheirrespectiveprojectionproles. Weobtainedtwovalues MH o fortheoriginalwordand MH s fortheslantedone.Therstcriterion,called C 1 is valuedat1if MH s �MH o and0otherwise. 3.3.SecondCriterion:Overlappingofthe characters ConsideringanItalicstyleword,weobservethatthetop ofaslantedcharacteroftenoverlapthebottomofthefol- lowing.Thisobservationwillbethesameforaninverted- Italicstyleword. Evenifthereisnotarealoverlappingthenthewhite spacebetweenthetwocharactersisnoticeablyreduced. Thiscriterionshowsthisdifferencebetweenthewhite spacesintheRomanversionandintheItalicversionofa word(seeg.4). Thisfeatureistranslatedintheverticalprojectionpro- lebyanalysingthewhitespacebetweenblacksections. Eachblacksectionrepresentsacharacter(ormoreifthere Figure4.Overlappingofcharactersin slantedword(whitespacearemarkedby darkverticallinesfortheslantedwordand bylightverticallinesfortheoriginalword) isoverlapping)andthewhitespacerepresentsthespacebe- tweencharacters.Intheprojectionprole,wesearchallthe whitespacesbetweentwoblackpixels.Bydeductionthe morewhite-between-blackpixelthereis,themoretheword isRomanstyle.Let'scall W o thetotalwidthofwhitespace intheoriginalimageand W s intheslantedword.Wegive thevalue1tothesecondcriterion C 2 if W o �W s and0 otherwise. 3.4.Thirdcriterion:Variationoftheslops oftheVerticalHistogram Thelastcriterionthatweusedisthevariationofslopein theverticalprojectionproleoftheword.Foreachword, thecharactersarerepresentedaspeaksofblackpixelsin theverticalprojectionprole.IfthiswordisRomanstyle, thenthosepeakswouldhavevaluesofslopecertainlyhigher thanforaslantedword.Moreovertheseslopeswillvary muchmoresuddenlythanforanItalicword.Thiscould beexplainedbecauseanItalicstylecharacterappearsmore spreadintheverticalprojectionprolethanawellstraight Romanone.Thesedifferencesofslopetranslatethemore horizontalconcentrationoftheblackpixelforanon-slanted character. Tocomputethevariationofslopeoftheprojectionpro- lewecomputethesecondderivativeoftheverticalprojec- tionprole.Asfortherstcriterionitwouldmakenosense toconsideronlythemaximumofthesevariations.Thatis whywecomputeanaverageofthetenmaximumvariations ofslopeoftheprojectionprole.Ifweconsider VS o theav- eragemaximavariationofslopefortheoriginalwordand VS s fortheslantedimage.Thenthiscriterion, C 3 getthe value1if VS o �VS s and0ifnot. 3.5.FinalStyleDecision Weobtainthreebinarycriteria C 1 , C 2 and C 3 giving usindicationsonthestyleoftheword.Consideringonly thosebinarycriteriawillgiveusatooarbitrarydecision fortheword'sstyle.Tospecifythisdecision,accordingto thegroundtruth,wedeneaweightforeachcriterion.By combiningthecriterionanditsassociatedweightweobtain ascoretodecidetheword'sstyle.Theseweightsrepre- sentaratioofwordsverifyingthecriteriaaccordingtotheir style.Forexample w ro 1 representthepercentageofRoman stylewordsverifyingtherstcriterion.Thesepercentage arecomputedononly40wordsofeachstyle.Thecom- putedweightvaluesareshownintable1. Style Italic Roman criteria C 1 0.15 0.8 C 2 1 0.5 C 3 1 0.4 Table1.Weightsvalues Thankstothecriteriaandtheirassociatedweightswe candenedecisiontermsforeachstyle.Wecomputethese decisiontermsforeachangle asfollow: T Roman = 3 X i =1 ( w ro i :C i +(1 � w ro i ) : (1 � C i )) T Italic = 3 X i =1 ( w it i :C i +(1 � w it i ) : (1 � C i )) . T Roman istheRomanstyledecisiontermforawordfor theskewangle . T Italic istheItalicstyledecisiontermof thesamewordforthesameslantangle .Wecomparethe valuesofthesetermstodecideifthewordisItalicorRoman styleforthespecicslantangle .If T Roman �T Italic thenthewordischaracterizedasRomanstylefortheslant angle andviceversa.Accordingtothis,wecall D the decisionfortheangle describedasfollow: D =  RomanifT Roman �T Italic Italicifnot Inolddocuments,theslantangleoftheItalicstyleisnot xedlikeinrecent'sone.Wecannotassumeaspecic skewanglebutwesupposeItalicstyletobeslantedbetween 5and20degrees.Wetestalltheseanglesbeforechoosing theword'sstyle.Weassumethatifawordischaracterized Romanstyleinmostofthe15anglesthatwetest,thenit wouldbexedasRomanstyle.Wecall D thenaldecision foraword'sstyle.Todenethisdecision D weadaptthe KroneckerDeltanoted  s;D to:  s;D =  1 ifs = D 0 ifnot Thenaldecision D givesusthestyleforwhichthesum foreachangleofthisadaptatedKroneckerDeltaismaxi- mized.Thenthenaldecision D isdenedasfollow: D =argmax s = Roman;Italic 20 X =5  s;D Ifwecallintermediatedecisionfortheangle the D . Theresultofthefunction D isthestylewhichgivesthe maximumofintermediatedecisionsbysummingthedeci- sionforallfteenpossibleslantangles. D resultinatwo choicesdecisionforthewordtobeeitherRomanorItalic style. 4.ResultsandDiscussion OurapproachwasdesignedtodetectItalicstylewords suchaspropernounsinthe GazetteofLeyde dataset.Those nounsaresupposedtobepatronymicsortoponymswhich aremostlylargewordsof4lettersormore.Forthisreason wedon'texpectgoodresultsonshortwords(wordscon- taininglessthan3letters).Moreover,ourmethodtakesinto accountoverlappingcharacters.Foratwoletterswordthere isonlyonepossibleoverlappingandtwoforathreeletters word.Thisobservationdecreasesignicantlytheinuence ofthecriterion C 2 onthenaldecision.Table2showsthe word'slengthhistogramforthedatasetof1358wordswe usedinourexperiments. wordssize(inletters) 23  3 Total Italic 1514176 205 Roman 269145739 1153 Table2.Numberoftestingwords Thisdistributionnaturallyarosefromthedatasetandhas notbeeninuencedbyus.ThesmallnumberofItalicwords isrelatedtothespecictypesettingofthedocuments. wordssize(inletters) 23  3 total Italic 100100100 100 Roman 89.597.299.99 97.2 Table3.RecognitionRates(%) Weobtainreallygoodresultsfordecidingwhethera wordisItalicstyleorRomanstyle.Asexpected,ourresults arelowerforshortwords(2or3letters)butstillverygood. Asthismethodischaractersegmentation-free,itcanrecog- nizeword'sstylesonoldorblurreddocuments,aswellas documentscontainingwordswithtouchingcharactersWe expectthismethodtoworkforanyclassofdocumentsand withanycharacterstyle. 4.1.Acknowledgments Thisworktakepartofa ClusterCulture,Patrimoineet Cr ´ eation whichisaregionalresearchclusteroftheRegion Rhone-Alpes,France. References [1]B.B.ChaudhuriandU.Garain.Automaticdetectionofitalic, boldandall-capitalwordsindocumentimages.In ICPR '98:Proceedingsofthe14thInternationalConferenceon PatternRecognition-Volume1 ,page610,Washington,DC, USA,1998.IEEEComputerSociety. [2]K.-C.FanandC.H.Huang.Italicdetectionandrectication. J.Inf.Sci.Eng. ,23(2):403–419,2007. [3]E.Kavallieratou,N.Fakotakis,andG.Kokkinakis.Slant estimationalgorithmforocrsystem. PatternRecognition , 34:2515–2522,2001. [4]Y.Li,S.Naoi,M.Cheriet,andC.Y.Suen.Asegmentation methodfortouchingitaliccharacters.In ICPR'04:Proceed- ingsofthePatternRecognition,17thInternationalConfer- enceon(ICPR'04)Volume2 ,pages594–597,Washington, DC,USA,2004.IEEEComputerSociety. [5]H.MaandD.Doermann.Adaptivewordstyleclassication usingagaussianmixturemodel.In ICPR'04:Proceedings ofthePatternRecognition,17thInternationalConferenceon (ICPR'04)Volume2 ,pages606–609,Washington,DC,USA, 2004.IEEEComputerSociety. [6]E.K.PavlosStathisandN.Papamarkos.Anevaluationsurvey ofbinarizationalgorithmsonhistoricaldocuments. ICPR'08: Proceedingsofthe19thInternationalConferenceonPattern Recognition . [7]C.SunandD.Si.Skewandslantcorrectionfordocumentim- agesusinggradientdirection.In ICDAR'97:Proceedingsof the4thInternationalConferenceonDocumentAnalysisand Recognition ,pages142–146,Washington,DC,USA,1997. IEEEComputerSociety. [8]L.Zhang,Y.Lu,andC.L.Tan.Italicfontrecognitionusing strokepatternanalysisonwaveletdecomposedwordimages. In ICPR'04:ProceedingsofthePatternRecognition,17th InternationalConferenceon(ICPR'04)Volume4 ,pages835– 838,Washington,DC,USA,2004.IEEEComputerSociety.