ar sing Expression Grammar s RecognitionBased Syntactic Foundation Br an ord Massachusetts Institute of echnology Cambr idge MA baf ordmit - PDF document

ar sing Expression Grammar s RecognitionBased Syntactic Foundation Br an ord Massachusetts Institute of echnology Cambr idge MA baf ordmit
ar sing Expression Grammar s RecognitionBased Syntactic Foundation Br an ord Massachusetts Institute of echnology Cambr idge MA baf ordmit

Presentation on theme: "ar sing Expression Grammar s RecognitionBased Syntactic Foundation Br an ord Massachusetts Institute of echnology Cambr idge MA baf ordmit"— Presentation transcript:

ParsingExpressionGrammars:ARecognition-BasedSyntacticFoundationBryanFordMassachusettsInstituteofTechnologyCambridge,MAbaford@mit.eduAbstractordecadeswehavebeenusingChomsky'sgenerativesystemofgrammars,particularlycontext-freegrammars(CFGs)andregu-larexpressions(REs),toexpressthesyntaxofprogramminglan-guagesandprotocols.Thepowerofgenerativegrammarstoex-pressambiguityiscrucialtotheiroriginalpurposeofmodellingnaturallanguages,butthisverypowermakesitunnecessarilydif-cultbothtoexpressandtoparsemachine-orientedlanguagesusingCFGs.ParsingExpressionGrammars(PEGs)provideanalterna-tive,recognition-basedformalfoundationfordescribingmachine-orientedsyntax,whichsolvestheambiguityproblembynotintro-ducingambiguityintherstplace.WhereCFGsexpressnondeter-ministicchoicebetweenalternatives,PEGsinsteaduseprioritizedchoice.PEGsaddressfrequentlyfeltexpressivenesslimitationsofCFGsandREs,simplifyingsyntaxdenitionsandmakingitun-necessarytoseparatetheirlexicalandhierarchicalcomponents.Alinear-timeparsercanbebuiltforanyPEG,avoidingboththecom-plexityandcklenessofLRparsersandtheinefciencyofgener-alizedCFGparsing.WhilePEGsprovidearichsetofoperatorsforconstructinggrammars,theyarereducibletotwominimalrecogni-tionschemasdevelopedaround1970,TS/TDPLandgTS/GTDPL,whicharehereprovenequivalentineffectiverecognitionpower.CategoriesandSubjectDescriptorsF.4.2[MathematicalLogicandFormalLanguages]:Gram-marsandOtherRewritingSystems—Grammartypes;D.3.1[ProgrammingLanguages]:FormalDenitionsandTheory—Syntax;D.3.4[ProgrammingLanguages]:Processors—ParsingGeneralTermsLanguages,Algorithms,Design,TheoryKeywordsContext-freegrammars,regularexpressions,parsingexpressiongrammars,BNF,lexicalanalysis,uniedgrammars,scannerlessparsing,packratparsing,syntacticpredicates,TDPL,GTDPLPermissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforpro®torcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthe®rstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspeci®cpermissionand/orafee.POPL'04,January14–16,2004,Venice,Italy.Copyright2004ACM1-58113-729-X/04/0001...$5.001IntroductionMostlanguagesyntaxtheoryandpracticeisbasedongenerativesystems,suchasregularexpressionsandcontext-freegrammars,inwhichalanguageisdenedformallybyasetofrulesappliedre-cursivelytogeneratestringsofthelanguage.Arecognition-basedsystem,incontrast,denesalanguageintermsofrulesorpredi-catesthatdecidewhetherornotagivenstringisinthelanguage.Simplelanguagescanbeexpressedeasilyineitherparadigm.Forexample,fs2ajs=(aa)ngisagenerativedenitionofatriviallanguageoveraunarycharacterset,whosestringsare“constructed”byconcatenatingpairsofa's.Incontrast,fs2aj(jsjmod2=0)gisarecognition-baseddenitionofthesamelanguage,inwhichastringofa'sis“accepted”ifitslengthiseven.Whilemostlanguagetheoryadoptsthegenerativeparadigm,mostpracticallanguageapplicationsincomputerscienceinvolvetherecognitionandstructuraldecomposition,orparsing,ofstrings.Bridgingthegapfromgenerativedenitionstopracticalrecogniz-ersisthepurposeofourever-expandinglibraryofparsingalgo-rithmswithdiversecapabilitiesandtrade-offs[9].Chomsky'sgenerativesystemofgrammars,fromwhichtheubiqui-touscontext-freegrammars(CFGs)andregularexpressions(REs)arise,wasoriginallydesignedasaformaltoolformodellingandanalyzingnatural(human)languages.Duetotheireleganceandexpressivepower,computerscientistsadoptedgenerativegrammarsfordescribingmachine-orientedlanguagesaswell.TheabilityofaCFGtoexpressambiguoussyntaxisanimportantandpowerfultoolfornaturallanguages.Unfortunately,thispowergetsinthewaywhenweuseCFGsformachine-orientedlanguagesthatareintendedtobepreciseandunambiguous.AmbiguityinCFGsisdifculttoavoidevenwhenwewantto,anditmakesgeneralCFGparsinganinherentlysuper-linear-timeproblem[14,23].Thispaperdevelopsanalternative,recognition-basedformalfoun-dationforlanguagesyntax,ParsingExpressionGrammarsorPEGs.PEGsarestylisticallysimilartoCFGswithRE-likefea-turesadded,muchlikeExtendedBackus-NaurForm(EBNF)no-tation[30,19].Akeydifferenceisthatinplaceoftheunorderedchoiceoperator`j'usedtoindicatealternativeexpansionsforanon-terminalinEBNF,PEGsuseaprioritizedchoiceoperator`='.Thisoperatorlistsalternativepatternstobetestedinorder,uncondition-allyusingtherstsuccessfulmatch.TheEBNFrules`A!abja'and`A!ajab'areequivalentinaCFG,butthePEGrules`A ab=a'and`A a=ab'aredifferent.ThesecondalternativeinthelatterPEGrulewillneversucceedbecausetherstchoiceisalwaystakeniftheinputstringtoberecognizedbeginswith`a'. APEGmaybeviewedasaformaldescriptionofatop-downparser.Twocloselyrelatedpriorsystemsuponwhichthisworkisbasedweredevelopedprimarilyforthepurposeofstudyingtop-downparsers[4,5].PEGshavefarmoresyntacticexpressivenessthantheLL(k)languageclasstypicallyassociatedwithtop-downparsers,however,andcanexpressalldeterministicLR(k)languagesandmanyothers,includingsomenon-context-freelanguages.Despitetheirconsiderableexpressivepower,allPEGscanbeparsedinlin-eartimeusingatabularormemoizingparser[8].Theseproper-tiesstronglysuggestthatCFGsandPEGsdeneincomparablelan-guageclasses,althoughaformalproofthattherearecontext-freelanguagesnotexpressibleviaPEGsappearssurprisinglyelusive.BesidesdevelopingPEGsasaformalsystem,thispaperpresentspragmaticexamplesthatdemonstratetheirsuitabilityfordescrib-ingrealisticmachine-orientedlanguages.Sincetheselanguagesaregenerallydesignedtobeunambiguousandlinearlyreadableintherstplace,therecognition-orientednatureofPEGscreatesanaturalafnityintermsofsyntacticexpressivenessandparsingefciency.Theprimarycontributionofthisworkistoprovidelanguageandprotocoldesignerswithanewtoolfordescribingsyntaxthatisbothpracticalandrigorouslyformalized.Asecondarycontributionistorenderthisformalismmoreamenabletofurtheranalysisbyprov-ingitsequivalencetotwosimplerformalsystems,originallynamedTS(“TMGrecognitionscheme”)andgTS(“generalizedTS”)byAlexanderBirman[4,5],inreferencetoanearlysyntax-directedcompiler-compiler.ThesesystemswerelatercalledTDPL(“Top-DownParsingLanguage”)andGTDPL(“GeneralizedTDPL”)re-spectivelybyAhoandUllman[3].ByextensionweprovethatwithminorcaveatsTS/TDPLandgTS/GTDPLareequivalentinrecog-nitionpower,anunexpectedresultcontrarytopriorconjectures[5].Therestofthispaperisorganizedasfollows.Section2rstdenesPEGsinformallyandpresentsexamplesoftheirusefulnessforde-scribingpracticalmachine-orientedlanguages.Section3thende-nesPEGsformallyandprovessomeoftheirimportantproperties.Section4presentsusefultransformationsonPEGsandprovesthemainresultregardingthereducibilityofPEGstoTDPLandGT-DPL.Section5outlinesopenproblemsforfuturestudy,Section6describesrelatedwork,andSection7concludes.2ParsingExpressionGrammarsFigure1showsanexamplePEG,whichpreciselyspeciesaprac-ticalsyntaxforPEGsusingtheASCIIcharacterset.TheexamplePEGdescribesitsowncompletesyntaxincludingalllexicalchar-acteristics.MostelementsofthegrammarshouldbeimmediatelyrecognizabletoanyonefamiliarwithCFGsandregularexpressions.Thegrammarconsistsofasetofdenitionsoftheform`Ae',whereAisanonterminalandeisaparsingexpression.Theopera-torsforconstructingparsingexpressionsaresummarizedinTable1.Singleordoublequotesdelimitstringliterals,andsquarebracketsindicatecharacterclasses.LiteralsandcharacterclassescancontainC-likeescapecodes,andcharacterclassescanincluderangessuchas`a-z'.Theconstant`.'matchesanysinglecharacter.Thesequenceexpression`e1e2'looksforamatchofe1immedi-atelyfollowedbyamatchofe2,backtrackingtothestartingpointifeitherpatternfails.Thechoiceexpression`e1=e2'rstattemptspatterne1,thenattemptse2fromthesamestartingpointife1fails.#HierarchicalsyntaxGrammarSpacingDefinition+EndOfFileDefinitionIdentifierLEFTARROWExpressionExpressionSequence(SLASHSequence)*SequencePrefix*Prefix(AND/NOT)?SuffixSuffixPrimary(QUESTION/STAR/PLUS)?PrimaryIdentifier!LEFTARROW/OPENExpressionCLOSE/Literal/Class/DOT#LexicalsyntaxIdentifierIdentStartIdentCont*SpacingIdentStart[a-zA-Z_]IdentContIdentStart/[0-9]Literal['](![']Char)*[']Spacing/["](!["]Char)*["]SpacingClass'['(!']'Range)*']'SpacingRangeChar'-'Char/CharChar'\\'[nrt'"\[\]\\]/'\\'[0-2][0-7][0-7]/'\\'[0-7][0-7]?/!'\\'.LEFTARROW'SpacingSLASH'/'SpacingAND'&'SpacingNOT'!'SpacingQUESTION'?'SpacingSTAR'*'SpacingPLUS'+'SpacingOPEN'('SpacingCLOSE')'SpacingDOT'.'SpacingSpacing(Space/Comment)*Comment'#'(!EndOfLine.)*EndOfLineSpace''/'\t'/EndOfLineEndOfLine'\r\n'/'\n'/'\r'EndOfFile!.Figure1.PEGformallydescribingitsownASCIIsyntaxThe?,*,and+operatorsbehaveasincommonregularexpressionsyntax,exceptthattheyare“greedy”ratherthannondeterministic.Theoptionexpression`e?'unconditionally“consumes”thetextmatchedbyeifesucceeds,andtherepetitionexpressions`e*'and`e+'alwaysconsumeasmanysuccessivematchesofeaspos-sible.Theexpression`a*a'forexamplecannevermatchanystring.Longest-matchparsingisalmostalwaysthedesiredbehav-iorwhereoptionsorrepetitionoccurinpracticalmachine-orientedlanguages.Manyformsofnon-greedybehaviorarestillavailableinPEGswhendesired,however,throughtheuseofpredicates.Theoperators&and!denotesyntacticpredicates[20],whichpro-videmuchofthepracticalexpressivepowerofPEGs.Theex-pression`&e'attemptstomatchpatterne,thenunconditionallybacktrackstothestartingpoint,preservingonlytheknowledgeofwhetheresucceededorfailedtomatch.Conversely,theexpres-sion`!e'failsifesucceeds,butsucceedsifefails.Forexample,thesubexpression`!EndOfLine.'inthedenitionforCommentinFigure1,matchesanysinglecharacteraslongasthenonter- OperatorTypePrecedenceDescription''primary5Literalstring""primary5Literalstring[]primary5Characterclass.primary5Anycharacter(e)primary5Groupinge?unarysufx4Optionale*unarysufx4Zero-or-moree+unarysufx4One-or-more&eunaryprex3And-predicate!eunaryprex3Not-predicatee1e2binary2Sequencee1=e2binary1PrioritizedChoiceTable1.OperatorsforConstructingParsingExpressionsminalEndOfLinedoesnotmatchstartingatthesameposition.Theexpression`Identifier!LEFTARROW'inthedenitionforPrimary,incontrast,matchesanyIdentifierthatisnotfollowedbyaLEFTARROW.Thislatterpredicatepreventstheright-hand-sideExpressionatthebeginningofoneDefinitionfromconsumingtheleft-hand-sideIdentifierofthenextDefinition,eliminat-ingtheneedforanexplicitdelimiter.Predicatescaninvolvearbi-traryparsingexpressionsrequiringanyamountof“lookahead.”2.1UniedLanguageDenitionsMostconventionalsyntaxdescriptionsaresplitintotwoparts:aCFGtospecifythehierarchicalportion,andasetofregularex-pressionsdeningthelexicalelementstoserveasterminalsfortheCFG.CFGsareunsuitableforlexicalsyntaxbecausetheycan-notdirectlyexpressmanycommonidioms,suchasthegreedyrulethatusuallyappliestoidentiersandnumbers,or“negative”syn-taxsuchastheLiteralruleabove,inwhichquotedstringliteralsmaycontainanycharacterexceptthequotecharacter.Regularex-pressionscannotdescriberecursivesyntax,however,suchaslargeexpressionsconstructedinductivelyfromsmallerexpressions.NeitherofthesedifcultiesexistwithPEGs,asdemonstratedbytheuniedexamplegrammar.ThegreedynatureoftherepetitionoperatorensuresthatasequenceofletterscanonlybeinterpretedasasingleIdentifierandnotastwoormoreimmediatelyadjacent,shorterones.Not-predicatesdescribetheappropriatenegativecon-straintsontheelementsthatcanappearinliterals,characterclasses,andcomments.Thelast`!'\\'.'alternativeinthedenitionofCharensuresthatthebackslashcannotbeusedinaliteralorchar-acterclassexceptaspartofanescapesequence.Eachdenitionintheexamplegrammarthatrepresentsadistinctlexical“token,”suchasIdentifier,Literal,orLEFTARROW,usestheSpacingnonterminalto“consume”anywhitespaceand/orcommentsimmediatelyfollowingthetoken.ThedenitionofGrammaralsostartswithSpacinginordertoallowwhitespaceatthebeginningofthele.Associatingwhitespacewitheachimme-diatelyprecedingtokenisaconvenientconventionforPEGs,butwhitespacecouldjustaseasilybeassociatedwiththefollowingto-kenbyreferringtoSpacingatthebeginningofeachtokendeni-tion.Whitespacecouldevenbetreatedasaseparatekindoftoken,consistentwithlexicaltraditions,butdoingsoinauniedgrammarsuchasthisonewouldrequiremanyexplicitreferencestoSpacingthroughoutthehierarchicalportionofthesyntax.2.2NewSyntaxDesignChoicesBesidesbeingabletoexpressmanyexistingmachine-orientedlan-guagesinaconciseanduniedgrammar,PEGsalsocreatenewpossibilitiesforlanguagesyntaxdesign.Considerforexampleawell-knownproblemwithC++syntaxinvolvingnestedtemplatetypeexpressions:loa;&#xt000;vectorloa;&#xt000;MyMatrix;ThespacebetweenthetworightanglebracketsisrequiredbecausetheC++scannerisoblivioustothelanguage'shierarchicalsyntax,andwouldotherwiseinterpretthe`loa;&#xt000;loa;&#xt000;'incorrectlyasarightshiftoperator.InalanguagedescribedbyauniedPEG,however,itiseasytodenethelanguagetopermita`loa;&#xt000;loa;&#xt000;'sequencetobeinter-pretedaseitheronetokenortwodependingonitscontext:TemplTypePrimType(LANGLETemplTypeRANGLE)?ShiftExprPrimExpr(ShiftOperPrimExpr)*ShiftOperLSHIFT/RSHIFTLANGLE'SpacingRANGLE&#x-000;''SpacingLSHIFT'SpacingRSHIFT&#x-000;&#x-000;''SpacingSuchpermissivenesscancreateunexpectedsyntacticsubtleties,ofcourse,andcautionandgoodtasteareinorder:apowerfulsyntaxdescriptionparadigmalsomeansmoreropeforthecarelesslan-guagedesignertohanghimselfwith.Thetraditionalbehaviorforoperatortokensisstilleasilyexpressibleifdesired,asfollows:LANGLE!LSHIFT'SpacingRANGLE!RSHIFT&#x-000;''SpacingLSHIFT'SpacingRSHIFT&#x-000;&#x-000;''SpacingFreeinglexicalsyntaxfromtherestrictionsofregularexpressionsalsoenablestokenstohavehierarchicalcharacteristics,oreventoreferbacktothehierarchicalportionofthelanguage.Pascal-likenestablecomments,forexample,cannotbedescribedbyaregularexpressionbutareeasilyexpressedinaPEG:Comment'(*'(Comment/!'*)'.)*'*)'Characterandstringliteralsinmostprogramminglanguagesper-mitescapesequencesofsomekind,toexpresseitherspecialchar-actersordynamicstringsubstitutions.Theseescapesusuallyhaveahighlyrestrictivesyntax,however.AlanguagedescribedbyauniedPEGcouldpermittheuseofarbitraryexpressionsinsuchescapes,takingadvantageofthefullpowerofthelanguage'sex-pressionsyntax:Expression...PrimaryLiteral/...Literal["](!["]Char)*["]Char'\\('Expression')'/!'\\'.InplaceoftheJavastringliteral"\u2200"containingtheUni-codemathsymbol`8',forexample,theliteralcouldbewrit-ten"\(0x2200)","\(8704)",oreven"\(Unicode.FOR_ALL)",whereFOR_ALLisaconstantdenedinaclassnamedUnicode. 2.3Priorities,NotAmbiguitiesThespecicationexibilityprovidedbyPEGs,andthenewsyntaxdesignchoicestheycreate,arenotlimitedtothelexicalportionsofalanguage.ManysensiblesyntacticconstructsareinherentlyambiguouswhenexpressedinaCFG,commonlyleadinglanguagedesignerstoabandonsyntacticformalityandrelyoninformalmeta-rulestosolvetheseproblems.Theubiquitous“danglingELSE”problemisaclassicexample,traditionallyrequiringeitheranin-formalmeta-ruleorsevereexpansionandobfuscationoftheCFG.ThecorrectbehavioriseasilyexpressedwiththeprioritizedchoiceoperatorinaPEG:StatementIFCondTHENStatementELSEStatement/IFCondTHENStatement/...ThesyntaxofC++containsambiguitiesthatcannotberesolvedwithanyamountofCFGrewriting,inwhichcertaintokense-quencescanbeinterpretedaseitherastatementoradenition.Thelanguagespecication[25]resolvesthisproblemwiththeinformalmeta-rulethatsuchasequenceisalwaysinterpretedasadenitionifpossible.Similarly,thesyntaxoflambdaabstractions,letex-pressions,andconditionalsinHaskell[11]isunresolvablyambigu-ousintheCFGparadigm,andishandledintheHaskellspecica-tionwithaninformal“longestmatch”meta-rule.PEGsprovidethenecessarytools—prioritizedchoice,greedyrepetition,andsyntac-ticpredicates—todenepreciselyhowtoresolvesuchambiguities.Thesetoolsdonotmakelanguagesyntaxdesigneasy,ofcourse.InplaceofhavingtodeterminewhethertwopossiblealternativesinaCFGareambiguous,PEGspresentlanguagedesignerswiththeanalogouschallengeofdeterminingwhethertwoalternativesina`='expressioncanbereorderedwithoutaffectingthelanguage.Thisquestionisoftenobvious,butsometimesisnot,andisundecid-ableingeneral.AswithdiscoveringambiguityinCFGs,however,wehavethehopeofndingautomaticalgorithmstoidentifyordersensitivityorinsensitivityconservativelyincommonsituations.2.4QuirksandLimitationsIfthedenitionofGrammarinFigure1didnotreferenceEndOfFileattheend,thenanyASCIIlestartingwithatleastonecorrectDefinitionwouldbeinterpretedasa“correct”gram-mar,evenifthelehasunreadablegarbageattheend.Thispe-culiarityarisesfromthefactthataparsingexpressioninaPEGcan“succeed”withoutconsumingallinputtext.WeaddressthisminorissuewiththeEndOfFilenonterminal,denedbythepred-icateexpression`!.',which“matches”theend-of-lebyfailingifanycharacterisavailableandsucceedingotherwise.BothleftandrightrecursionarepermissibleinCFGs,butaswithtop-downparsingingeneral,leftrecursionisunavailableinPEGsbecauseitrepresentsadegenerateloop.Forexample,theCFGrules`A!aAja'and`A!Aaja'representaseriesof`a'sinaCFG,butthePEGrule`A Aa=a'isdegeneratebecauseitindicatesthatinordertorecognizenonterminalA,aparsermustrstrecognizenonterminalA:::Thisrestrictionappliesnotonlytodirectleftrecursionasinthisexample,butalsotoindirectormutualleftrecursioninvolvingseveralnonterminals.SincebothleftandrightrecursioninaCFGmerelyrepresentrepetition,however,andrepetitioniseasiertoexpressinaPEGusingrepetitionoperators,thislimitationisnotaseriousprobleminpractice.LikeaCFG,aPEGisapurelysyntacticformalism,notbyitselfcapableofexpressinglanguageswhosesyntaxdependsonsemanticpredicates[20].AlthoughtheJavalanguagecanbedescribedasasingleuniedPEG[7],CandC++parsersrequireanincrementallyconstructedsymboltabletodistinguishbetweenordinaryidentiersandtypedef-denedtypeidentiers.Haskellusesaspecialstageinthe“syntacticpipeline,”insertedbetweenthescannerandparser,toimplementthelanguage'slayout-sensitivefeatures.3FormalDevelopmentofPEGsInthissectionwedenePEGsformallyandexplorekeyproperties.ManyofthesepropertiesandtheirproofswereinspiredbythoseofthecloselyrelatedTS/TDPLandgTS/GTDPLsystems[4,5,3],althoughtheformulationofPEGsissubstantiallydifferent.3.1DenitionofaPEGInFigure1weuseda“concrete”ASCII-basedsyntaxforPEGstoillustratethecharacteristicsofPEGsforpracticallanguagedescrip-tionpurposes.Forformalanalysis,however,itismoreconvenienttouseanabstractsyntaxforPEGsthatrepresentsonlyitsessentialstructure.Webeginthereforebydeningthisabstractsyntax.De®nition:Aparsingexpressiongrammar(PEG)isa4-tupleG=(VN;VT;R;eS),whereVNisanitesetofnonterminalsymbols,VTisanitesetofterminalsymbols,Risanitesetofrules,eSisaparsingexpressiontermedthestartexpression,andVN\VT=/0.Eachruler2Risapair(A;e),whichwewriteA e,whereA2VNandeisaparsingexpression.ForanynonterminalA,thereisexactlyoneesuchthatA e2R.Risthereforeafunctionfromnonterminalstoexpressions,andwewriteR(A)todenotetheuniqueexpressionesuchthatA e2R.Wedeneparsingexpressionsinductivelyasfollows.Ife,e1,ande2areparsingexpressions,thensois:1.e,theemptystring2.a,anyterminal,wherea2VT.3.A,anynonterminal,whereA2VN.4.e1e2,asequence.5.e1=e2,prioritizedchoice.6.e,zero-or-morerepetitions.7.!e,anot-predicate.Allsubsequentuseoftheunqualiedterm“grammar”refersspecif-icallytoparsingexpressiongrammarsasdenedhere,andtheun-qualiedterm“expression”referstoparsingexpressions.Weusethevariablesa;b;c;dtorepresentterminals,A;B;C;Dfornonter-minals,x;y;zforstringsofterminals,andeforparsingexpressions.ThestructuralrequirementthatRbeafunction,mappingeachnon-terminalinVNtoauniqueparsingexpression,precludesthepos-sibilityofexpressionsinthegrammarcontaining“undenedrefer-ences,”orsubroutinefailures[5].TheexpressionsetE(G)ofGisthesetcontainingthestartexpres-sioneS,theexpressionsusedinallgrammarrules,andallsubex-pressionsofthoseexpressions. Arepetition-freegrammarisagrammarwhoseexpressionsetcon-tainsonlyexpressionsconstructedwithoutusingrule6above.Apredicate-freegrammarisonewhoseexpressionsetcontainsonlyexpressionsconstructedwithoutusingrule7.3.2DesugaringtheConcreteSyntaxTheabstractsyntaxdoesnotincludecharacterclasses,the“anycharacter”constant`.',theoptionoperator`?',theone-or-more-repetitionsoperator`+',ortheand-predicateoperator`&',allofwhichappearintheconcretesyntax.Wetreatthesefeaturesoftheconcretesyntaxas“syntacticsugar,”reducingthemtoabstractparsingexpressionsusinglocalsubstitutionsasfollows:Weconsiderthe`.'expressionintheconcretesyntaxtobeacharacterclasscontainingalloftheterminalsinVT.Ifa1;a2;:::;anarealloftheterminalslistedinacharacterclassexpressionintheconcretesyntax,afterexpandinganyranges,thenwedesugarthischaracterclassexpressiontotheabstractsyntaxexpressiona1=a2=:::=an.Wedesugaranoptionexpressione?intheconcretesyntaxtoed=e,whereedisthedesugaringofe.Wedesugaraone-or-more-repetitionsexpressione+toede,whereedisthedesugaringofe.Wedesugaranand-predicate&eto!(!ed),whereedisthedesugaringofe.3.3InterpretationofaGrammarDe®nition:ToformalizethesyntacticmeaningofagrammarG=(VN;VT;R;eS),wedenearelation)Gfrompairsoftheform(e;x)topairsoftheform(n;o),whereeisaparsingexpression,x2VTisaninputstringtoberecognized,n0servesasa“stepcounter,”ando2VT[ffgindicatestheresultofarecognitionat-tempt.The“output”oofasuccessfulmatchistheportionofthein-putstringrecognizedand“consumed,”whileadistinguishedsym-bolf62VTindicatesfailure.For((e;x);(n;o))2)Gwewillwrite(e;x))(n;o),withthereferencetoGbeingimplied.Wedene)Ginductivelyasfollows:1.Empty:(e;x))(1;e)foranyx2VT.2.Terminal(successcase):(a;ax))(1;a)ifa2VT,x2VT.3.Terminal(failurecase):(a;bx))(1;f)ifa6=b,and(a;e))(1;f).4.Nonterminal:(A;x))(n+1;o)ifA e2Rand(e;x))(n;o).5.Sequence(successcase):If(e1;x1x2y))(n1;x1)and(e2;x2y))(n2;x2),then(e1e2;x1x2y))(n1+n2+1;x1x2).Expressionse1ande2arematchedinsequence,andifeachsucceedsandconsumesinputportionsx1andx2respectively,thenthesequencesucceedsandconsumesthestringx1x2.6.Sequence(failurecase1):If(e1;x))(n1;f),then(e1e2;x))(n1+1;f).Ife1istestedandfails,thenthese-quencee1e2failswithoutattemptinge2,7.Sequence(failurecase2):If(e1;x1y))(n1;x1)and(e2;y))(n2;f),then(e1e2;x1y))(n1+n2+1;f).Ife1succeedsbute2fails,thenthesequenceexpressionfails.8.Alternation(case1):If(e1;xy))(n1;x),then(e1=e2;xy))(n1+1;x).Alternativee1isrsttested,andifitsucceeds,theexpressione1=e2succeedswithouttestinge2.9.Alternation(case2):If(e1;x))(n1;f)and(e2;x))(n2;o),then(e1=e2;x))(n1+n2+1;o).Ife1fails,thene2istestedanditsresultisusedinstead.10.Zero-or-morerepetitions(repetitioncase):If(e;x1x2y))(n1;x1)and(e;x2y))(n2;x2),then(e;x1x2y))(n1+n2+1;x1x2).11.Zero-or-morerepetitions(terminationcase):If(e;x))(n1;f),then(e;x))(n1+1;e).12.Not-predicate(case1):If(e;xy))(n;x),then(!e;xy))(n+1;f).Ifexpressionesucceedsconsuminginputx,thenthesyntacticpredicate!efails.13.Not-predicate(case2):If(e;x))(n;f),then(!e;x))(n+1;e).Ifefails,then!esucceedsbutconsumesnothing.Wedenearelation)+frompairs(e;x)tooutcomeso,suchthat(e;x))+oiffannexistssuchthat(e;x))(n;o).If(e;x))+yfory2VT,wesaythatematchesxinG.If(e;x))+f,wesaythatefailsonxinG.ThematchsetMG(e)ofexpressioneinGisthesetofinputsxsuchthatematchesxinG.Anexpressionehandlesastringx2VTifiteithermatchesorfailsonxinG.AgrammarGhandlesstringxifitsstartexpressioneShandlesx.Giscompleteifithandlesallstringsx2VT.Twoexpressionse1ande2areequivalent,writtene1e2,if(e1;x))+oimplies(e2;x))+oandviceversa.Theresultingstepcountsneednotbethesame.Theorem:If(e;x))(n;y),thenyisaprexofx:9z(x=yz).Proof:Byinductiononanintegervariablem0,usingastheinductionhypothesisthepropositionthatthedesiredpropertyholdsforalle;x;nm,andy.Theorem:If(e;x))(n1;o1)and(e;x))(n2;o2),thenn1=n2ando1=o2.Thatis,therelation)Gisafunction.Proof:Byinductiononavariablem0,usingtheinductionhy-pothesisthatthepropositionholdsforalle;x;n1m;n2m;o1,ando2.Thisinductiontechniquewillsubsequentlybereferredtosimplyasinductiononstepcountsof)G.Theorem:Arepetitionexpressionedoesnothandleanyinputstringxonwhichesucceedswithoutconsuminginput:foranyx2VT,if(e;x))(n1;e),then(e;x)6)(n2;o2)foranyn2;o2.Wecallthisthe-loopcondition.Proof:Byinductiononstepcounts.3.4LanguagePropertiesThissectiondescribespropertiesofparsingexpressionlanguages(PELs),theclassoflanguagesthatcanbeexpressedbyPEGs.PELsareclosedunderunion,intersection,andcomplement.Itisunde-cidableingeneralwhetheraPEGrepresentsanonemptylanguage,orwhethertwoPEGsrepresentthesamelanguage. De®nition:ThelanguageL(G)ofaPEGG=(VN;VT;R;eS)isthesetofstringsx2VTforwhichthestartexpressioneSmatchesx.NotethatthestartexpressioneSonlyneedstosucceedoninputstringxforxtobeincludedinL(G);eSneednotconsumeallofstringx.Forexample,thetrivialgrammar(fg;VT;fg;e)recognizesthelanguageVTandnotjusttheemptystring,becausethestartexpressionealwayssucceedseventhoughitdoesnotexamineorconsumeanyinput.ThisdenitioncontrastswithTSandgTS,inwhichpartiallyconsumedinputstringsareexcludedfromthelan-guageandclassiedaspartial-acceptancefailures[5].De®nition:AlanguageLoveranalphabetVTisaparsingexpres-sionlanguage(PEL)iffthereexistsaparsingexpressiongrammarGwhoselanguageisL.Theorem:Theclassofparsingexpressionlanguagesisclosedun-derunion,intersection,andcomplement.Proof:SupposewehavetwogrammarsG1=(V1N;VT;R1;e1)andG2=(V2N;VT;R2;e2S)respectively,describinglanguagesL(G1)andL(G2).Assumewithoutlossofgenerality,thatV1N\V2N=/0,byrenamingnonterminalsifnecessary.WecanformanewgrammarG0=(V1N[V2N;VT;R1[R2;e0S),wheree0Sisoneofthefollowing:Ife0S=e1S=e2S,thenL(G0)=L(G1)[L(G2).Ife0S=&e1Se2S,thenL(G0)=L(G1)\L(G2).Ife0S=!e1S,thenL(G0)=VTL(G1).Theorem:TheclassofPELsincludesnon-context-freelanguages.Proof:Theclassicexamplelanguageanbncnisnotcontext-free,butwecanrecognizeitwithaPEGG=(fA;B;Dg;fa;b;cg;R;D),whereRcontainsthefollowingdenitions:A aAb=eB bBc=eD &(A!b)aB!:Theorem:ItisundecidableingeneralwhetherthelanguageL(G)ofanarbitraryparsingexpressiongrammarGisempty.Proof:WerstproveinthesamewayasforCFGs[3]thatitisundecidablewhethertheintersectionofthelanguagesoftwoPEGsisempty.SincePELsareclosedunderintersection,analgorithmtotesttheemptinessofthelanguageL(G)ofanyGcouldbeusedtotestwhetherL(G1)\L(G2)isempty,implyingthatemptinessisundecidableaswell.GivenaninstanceC=(x1;y1);:::;(xn;yn)ofPost'scorrespon-denceproblemoveranalphabetS,itisknowntobeundecidablewhetherthereisanon-emptystringwthatcanbebuiltfromel-ementsofCsuchthatw=xi1xi2:::xim=yi1yi2:::yim,where1ijnforeach1jm.WebuildagrammarG=(VN;VT;R;D)whereVN=fA;B;Dg,andVT=S[fa1;:::;ang.TheaiinVTaredistinctterminalsnotinS,whichwillserveasmarkersassociatedwiththeelementsofC.Rcontainsthefollowingthreerules:A x1Aa1=x2Aa2=:::=xnAan=eB y1Ba1=y2Ba2=:::=ynBan=eD &:&(A!:)B!:NonterminalAmatchesstringsoftheformxi1xi2:::ximaim:::ai2ai1,whileBmatchesstringsoftheformyi1yi2:::yimaim:::ai2ai1.ThenonterminalDusestheand-predicateoperatortomatchonlystringsmatchingbothAandB,representingsolutionstothecorrespon-denceproblem.The&:atthebeginningofthedenitionofD(desugaredappropriately)ensuresthatemptysolutionsarenotal-lowed,andthe!:afterthereferencestoAandBensurethatthecompleteinputisconsumedineachcase.AnalgorithmtodecidewhetherL(G)isnonemptycouldthereforebeusedtosolvethecor-respondenceproblemC,yieldingthedesiredresult.De®nition:TwoPEGsG1andG2areequivalentiftheyrecognizethesamelanguage:L(G1)=L(G2).Theorem:TheequivalenceoftwoarbitraryPEGsisundecidable.Proof:AnalgorithmtodecidetheequivalenceoftwoPEGscouldalsobeusedtodecidethenon-emptinessproblemabove,simplybycomparingthegrammartobetestedagainstatrivialgrammarfortheemptylanguage.3.5AnalysisofGrammarsWeoftenwouldliketoanalyzethebehaviorofaparticulargram-maroverarbitraryinputstrings.WhilemanyinterestingpropertiesofPEGsareundecidableingeneral,conservativeanalysisprovesusefulandadequateformanypracticalpurposes.Theorem:Itisundecidablewhetheranarbitrarygrammariscom-plete:thatis,whetheriteithersucceedsorfailsonallinputstrings.Proof:SupposewehaveanarbitrarygrammarG=(VN;VT;R;eS),andwedeneanewgrammarG0=(V0N;VT;R0;e0S),whereV0N=VN[fAg,A62VN,R0=R[fA &eSAg,ande0S=A.IfG'sstartexpressioneSsucceedsonanyinputstringx,thenthisinputwillcauseadegenerateloopinG0vianonterminalA,soG0isincom-plete.IfL(G)isempty,however,thenG0iscompleteandalsofailsonallinputs.AnalgorithmtodecidewhetherG0iscompletewouldthereforeallowustodecidewhetherGisempty,whichhasalreadybeenshownundecidable.De®nition:Wedenearelation*Gconsistingofpairsoftheform(e;o),whereeisanexpressionando2f0;1;fg.Wewillwritee*ofor(e;o)2*G,withthereferencetoGbeingimplied.Thisrelationrepresentsanabstractsimulationofthe)Grelation.Ife*0,thenemightsucceedonsomeinputstringwhileconsumingnoinput.Ife*1,thenemightsucceedwhileconsumingatleastoneterminal.Ife*f,thenemightfailonsomeinput.Wewillusethevariablestorepresenta*Goutcomeofeither0or1.Wedenethesimulationrelation*Ginductivelyasfollows:1.e*0.2.a*1.3.a*f.4.A*oifRG(A)*o.5.e1e2*0ife1*0ande2*0.e1e2*1ife1*1ande2*s.e1e2*1ife1*sande2*1.6.e1e2*fife1*f.7.e1e2*fife1*sande2*f. 8.e1=e2*sife1*s.9.e1=e2*oife1*fande2*o.10.e*1ife*1,11.e*0ife*f.12.!e*fife*s.13.!e*0ife*f.Becausethisrelationdoesnotdependontheinputstring,andthereareanitenumberofrelevantexpressionsinagrammar,wecancomputethisrelationoveranygrammarbyapplyingtheaboverulesiterativelyuntilwereachaxedpoint.Theorem:Therelation*Gsummarizes)Gasfollows:If(e;x))G(n;e),thene*0.If(e;x))G(n;y)andjyj�0,thene*1.If(e;x))G(n;f),thene*f.Proof:Byinductionoverthestepcountsoftherelation)G.Thedenitionrulesfor*Gabovecorrespondone-to-onetotherulesfor)G.Theconclusionineachcasefollowsimmediatelyfromtheinductivehypothesis,exceptinthecasesfortherepetitionoperator,whichrequirethe-loopconditiontheoremfromSection3.3.3.6Well-FormedGrammarsAwell-formedgrammarisagrammarthatcontainsnodirectlyormutuallyleft-recursiverules,suchas`A Aa=a',whichcouldpreventthegrammarfromhandlinganyinputstring.Thischeck-ablestructuralpropertyimpliescompleteness,whilebeingpermis-siveenoughformostpurposes.Agrammarcanhaveleft-recursiverulesbutstillbecompleteifitsdegenerateloopsareactuallyun-reachable,butwehavelittleneedforsuchgrammarsinpractice.De®nition:WedeneaninductivesetWFGasfollows.WewriteWF(e)fore2WFG,withthereferencetoGbeingimplied,tomeanthatexpressioneiswell-formedinG.1.WF(e).2.WF(a).3.WF(A)ifWF(RG(A)).4.WF(e1e2)ifWF(e1)ande1*0impliesWF(e2).5.WF(e1=e2)ifWF(e1)andWF(e2).6.WF(e)ifWF(e)ande6*0.7.WF(!e)ifWF(e).AgrammarGiswell-formedifalloftheexpressionsinitsexpres-sionsetE(G)arewell-formedinG.Aswiththe*Grelation,theWFGsetcanbecomputedbyiterationtoaxedpoint.Lemma:AssumethatgrammarGiswell-formed,andthatallex-pressionsinE(G)handleallstringsx2VToflengthnorless.ThentheexpressionsinE(G)alsohandleallstringsoflengthn+1.Proof:Byinductionoverthestepcountsof)G.Theinterestingcasesareasfollows:ForanonterminalA,theinductionhypothesisallowsustoas-sumethatRG(A)handlesallstringsoflengthn+1;thereforesodoesAbythedenitionof)G.Forasequencee1e2,wecanassumethate1handlesallstringsxoflengthn+1.If(e1;x))(n;e),thene1*0,soWF(e2)appliesande2alsohandlesx.If(e1;x))(n;y)forjyj�0,thene2onlyneedstohandlestringsoflengthnorless,whichisgiven.If(e1;x))(n;f),thene2isnotused.Fore,theWF(e)conditionensuresthate1handlesinputsoflengthn+1,andthee6*0conditionensuresthattherecursivedependencyoneinthesuccesscaseonlyneedstohandlestringsoflengthnorless.Theorem:Awell-formedgrammarGiscomplete.Proof:Byinductionoverthelengthofinputstrings,eachexpres-sioninE(G)handleseveryinputstring.SinceG'sstartexpressioneSisinE(G),theconclusionfollows.3.7GrammarIdentitiesAnumberofimportantidentitiesallowPEGstobetransformedwithoutchangingthelanguagetheyrepresent.Wewillusetheseidentitiesinsubsequentresults.Theorem:Thesequenceandalternationoperatorsareasso-ciativeunderexpressionequivalence:e1(e2e3)(e1e2)e3ande1=(e2=e3)(e1=e2)=e3.Proof:Trivial,fromthedenitionof)G.Theorem:Sequenceoperatorscanbedistributedintochoiceoper-atorsontheleftbutnotontheright:e1(e2=e3)(e1e2)=(e1e3),but(e1=e2)e36(e1e3)=(e2e3).Proof:Intheleft-sidecase,theexpression(e1e2)=(e1e3)invokese1twicefromthesamestartingpoint—onthesameinputstring—makingitsresultthesameasthefactorede1(e2=e3)expression.Intheright-sidecase,however,supposethate1succeedsbute3fails.Intheexpression(e1=e2)e3,thefailureofe3causesthewholeexpressiontofail.In(e1e3)=(e2e3),however,therstinstanceofe3onlycausestherstalternativetofail;thesecondalternativewillthenbetried,inwhichthee3mightsucceedife2consumesadifferentamountofinputthane1did.Theorem:Predicatescanbemovedleftwithinsequencesdistribu-tivelyasfollows:e1!e2!(e1e2)e1.Proof:Ife1succeeds,thene2istestedstartingatthesamepointineachcase,resultinginthesameoverallbehavior;thesecondcasemerelyinvokese1twiceatthesameposition.Ife1fails,thenthepredicateine1!e2isnottestedatall.Thepredicatein!(e1e2)e1istested,andsucceedsbecauseoftherste1'sfailure,buttheoverallresultisstillfailureduetothesecondinstanceofe1.De®nition:Twoexpressionse1ande2aredisjointiftheysucceedondisjointsetsofinputstrings:MG(e1)\MG(e2)=/0.Theorem:Achoiceexpressione1=e2iscommutativeifitssubex-pressionsaredisjoint.Proof:Ifeithere1ore2failsonastringx,itdoesnotmatterwhichistestedrst.Theonlywaythelanguagecanbeaffectedbychang- ingtheirorderisife1ande2bothsucceedonxandconsumedif-ferentamountsofinput.Disjointnessprecludesthispossibility.AlthoughtheresultsfromSection3.4implythatdisjointnessisun-decidableingeneral,itiseasyto“force”achoiceexpressiontobedisjointviathefollowingsimpletransformation:Theorem:e1=e2e1=!e1e2!e1e2=e1,andthelattertwoequiv-alentchoiceexpressionsaredisjoint.Proof:Trivial,bycaseanalysis.4ReductionsonPEGsInthissectionwepresentmethodsofreducingPEGstosimplerformsthatmaybemoreusefulforimplementationoreasiertorea-sonaboutformally.Firstwedescribehowtoeliminaterepetitionandpredicateoperators,thenweshowhowPEGscanbemappedintothemuchmorerestrictiveTS/TDPLandgTS/GTDPLsystems.4.1EliminatingRepetitionOperatorsAsinCFGs,repetitionexpressionscanbeeliminatedfromaPEGbyconvertingthemintorecursivenonterminals.UnlikeinCFGs,thesubstitutenonterminalinaPEGmustberight-recursive.Theorem:Anyrepetitionexpressionecanbeeliminatedbyre-placingitwithanewnonterminalAwiththedenitionA eA=e.Proof:Byinductiononthelengthoftheinputstring.Theorem:ForanyPEGG,anequivalentrepetition-freegrammarG'canbecreated.Proof:SimplyeliminateallrepetitionexpressionsthroughoutG'snonterminaldenitionsandstartexpression.4.2EliminatingPredicatesInthissectionweshowhowtoeliminateallpredicateoperatorsfromanywell-formedgrammarwhoselanguagedoesnotincludetheemptystring.Therestrictiontogrammarsthatdonotaccepttheemptystringisaminorbutunavoidableproblem:wewillshowlaterthatitisimpossibleforapredicate-freegrammartoaccepttheemptystringwithoutacceptingallinputstrings.Givenawell-formed,repetition-freegrammarG=(VN;VT;R;eS)wheree62L(G),wewillcreateanequivalentgrammarG0=(V0N;VT;R0;e0)thatiswell-formed,repetition-free,andpredicate-free.Thisprocessoccursinthreenormalizationstages.Intherststage,werewritethegrammarsothatsequenceandpredicateex-pressionsonlycontainnonterminalsandchoiceexpressionsaredis-joint.Inthesecondstage,wefurtherrewritethegrammarsothatnonterminalsneversucceedwithoutconsuminganyinput.Inthethirdstagewenallyeliminatepredicates.4.2.1Stage1InthisstagewerewritetheexistingdenitionsinRandtheoriginalstartexpressioneS,addingsomenewnonterminalsandcorrespond-ingdenitionsintheprocess,toproduceV0N,R1,andeS1.Werstaddthreespecialnonterminals,T,Z,andF,withcorre-spondingrulesasfollows.ThenonterminalTmatchesanysingleterminal,andhasthedenition`T.'inconcretePEGsyntax,beforedesugaring.ThenonterminalZmatchesandconsumesanyinputstring;toavoidintroducingrepetitionoperators,wedeneitZ TZ=e.ThenonterminalFalwaysfails;toavoidusingpredi-cateswedeneitF ZT.Wedeneafunctionfrecursivelyasfollows,toconvertexpres-sionsinouroriginalgrammarGintoourrstnormalform:1.f(e)=eife2feg[VN[VT.2.f(e1e2)=AB,addingA f(e1)andB f(e2)toR1.3.f(e1=e2)=A=!Af(e2),addingA f(e1)toR1.4.f(!e)=!A,addingA f(e)toR1.De®nition:Thestage1grammarG1ofGis(V0N;VT;R1;eS1),whereeS1=f(eS),R1=fA f(e)jA e2Rg[fnewdenitionsresultingfromapplicationoffg,andV0N=VN[fnewnonterminalsresultingfromapplicationoffg.Lemma:Foranyexpressione,f(e)G1e.Proof:Bystructuralinductionovere.Theonlyinterestingcaseisforchoiceexpressions,whichusestheidentitye1=e2e1=!e1e2.Theorem:G1G,allsequenceandpredicateexpressionsintheexpressionsetofG1containonlynonterminalsastheirsubexpres-sions,andallchoiceexpressionsaredisjoint.Proof:Directfromtheconstructionoff.4.2.2Stage2Wenowrewritethestage1grammarG1intoanotherequiva-lentgrammarG2=(V0N;VT;R2;eS2),inwhichallnonterminalsei-thersucceedandconsumeanonemptyinputprex,orfail:8A2V0N(A6*G20).Thistransformationisanalogoustoe-reductiononCFGs,thoughthedetailsaredifferentduetopredicates.Weusetwofunctionsg0andg1,to“split”expressionsintoe-onlyande-freeparts,respectively.Thee-onlypartg0(e)ofanexpres-sioneisanexpressionthatyieldsthesameresultaseonallinputstringsforwhichesucceedswithoutconsuminganyinput,andfailsotherwise.Thee-freepartg1(e)ofelikewiseyieldsthesameresultaseonallinputsforwhichesucceedsandconsumesatleastoneterminal,andfailsotherwise.Werstdeneg0recursivelyasfollows:1.g0(e)=e.2.g0(a)=F.3.g0(A)=g0(RG(A)).4.g0(AB)=g0(A)g0(B)ifA*0,otherwiseg0(AB)=F.5.g0(e1=e2)=g0(e1)=g0(e2).6.g0(!A)=!(A=g0(A)).Lemma:Thefunctiong0terminatesifGiswell-formed.Proof:BystructuralinductionovertheWFGrelation.Terminationreliesong0(AB)notrecursivelyinvokingg0(B)ifA6*0. Wenowdenethefunctiong1primitive-recursivelyasfollows:1.g1(e)=F.2.g1(a)=a.3.g1(A)=A.4.g1(AB)=g0(A)B=Ag0(B)=AB.5.g1(e1=e2)=g1(e1)=g1(e2).6.g1(!e)=F.De®nition:Thestage2grammarG2is(V0N;VT;R2;eS2),whereR2=fA g1(e)jA e2R1g,andeS2=g1(eS1)=g0(eS1).WeeffectivelysplitallofthenonterminaldenitionsinR1,retainingonlythee-freepartsinthedenitionsofR2,whilesubstitutingthecorrespondinge-onlypartsatthepointswherethesenonterminalsarereferencedinordertopreservetheoriginalbehavior.Thereareonlytwosuchpoints:case6ofg0,wherewerewritetheoperandsofpredicateexpressions,andcase4ofg1,fore-freesequences.Wesaythatthesplittinginvariantholdsifthefollowingistrue:If(e;x))+1e,then(g0(e);x))+G2eand(g1(e);x))+G2f.If(e;x))+1yforjyj�0,then(g0(e);x))+G2fand(g1(e);x))+G2y.If(e;x))+G1f,then(g0(e);x))+G2fand(g1(e);x))+G2f.Lemma:Assumethatthesplittinginvariantholdsforallinputstringsoflengthnorless.Thenthesplittinginvariantholdsforstringsoflengthn+1.Proof:Byinductionoverthestepcountsof)G1and)G2.Theorem:G2iswell-formedandequivalenttoG,andforallnon-terminalsA2VN,A6*G20.Proof:AdirectconsequenceofthesplittinginvariantandthefactthatGiswell-formed.4.2.3Stage3FinallywerewriteG2intothenalgrammarG0=(V0N;VT;R0;e0).De®nition:Wedeneafunctiond,suchthatd(A;e)“distributes”anonterminalAintoane-onlyexpressioneresultingfromthestage2functiong0:1.d(A;e)=e,ife2fe;Fg.2.d(A;e1e2)=d(A;e1)d(A;e2).3.d(A;e1=e2)=d(A;e1)=d(A;e2).4.d(A;!e)=!(Ae).Lemma:Ife=g0(e0)ande02E(G2),thenAed(A;e)A.Thatis,wecanused(A;e)tomoveeleftwardacrossanonterminalref-erenceinasequenceexpression.Proof:StructuralinductiononeandtheidentitiesinSection3.7.Nowdeneafunctionn(e;C)=(e(Z=e)=e)C.Lemma:Ifeisane-onlyexpressioninG2,then!eCn(e;C).Proof:Ifesucceeds,thenthe(Z=e)alsosucceedsandconsumestheentireremaininginput.(ThenonterminalZaloneisnotsuf-cientbecauseitwasrewritteninstage2tobee-free.)SinceanynonterminalCise-free,theoverallexpressionwillthereforefail.Ifefails,however,then(e(Z=e)=e)succeedswithoutconsuminganything,makingtheoverallexpressionbehaveaccordingtoC.Wenowdeneafunctionh0toeliminatepredicatesfrome-producingexpressionsresultingfromtheg0ordfunctions:1.h0(e;C)=C.2.h0(F;C)=F.3.h0(e1e2;C)=n(n(h0(e1;C);C)=n(h0(e2;C);C);C).4.h0(e1=e2;C)=h0(e1;C)=h0(e2;C).5.h0(!(B=e);C)=n(B=h0(e;C);C).6.h0(!(A(B=e));C)=n(A(B=h0(e;C));C).Lemma:Ife=g0(e0)ore=d(A;g0(e0))ande02E(G2),thenh0(e;C)isapredicate-freeexpressionequivalenttoeC.Proof:Bystructuralinductionovere.Case5handlespredicatesresultingdirectlyfromg0,whichalwayshavetheform!(B=e1),wheree1islikewiseanexpressionresultingfromg0.Case6sim-ilarlyhandlesthesituatione=d(A;g0(e0)).Case3rewritesase-quencee1e2usingthenot-predicateanalogofDeMorgan'sLaw:ife1ande2aree-onlyexpressions,thene1e2!(!e1=!e2).Wecan-notsimplyuseh0(e1e2;C)=h0(e1;C)h0(e2;C)becauseh0(e1;C)consumesinputifCsucceeds,whichwouldcausetheh0(e2;C)parttostartatthewrongposition.Wenowdeneacorrespondingfunctionh1toeliminatepredicatesfromthee-freeexpressionsgeneratedbythestage2functiong1:1.h1(e)=e,ife2fa;Ag.2.h1(AB)=AB.3.h1(e1B)=h0(e1;B),ife1isnotanonterminal.4.h1(Ae2)=h0(d(A;e2);A),ife2isnotanonterminal.5.h1(e1=e2)=h1(e1)=h1(e2).Lemma:Ife=g1(e0)ande02E(G2),thenh1(e)isapredicate-freeexpressionequivalenttoeinG2.Proof:Bystructuralinductionovere.Incase3,weknowfromthedenitionofg1thate1isane-onlyexpressionresultingfromg0,soweusethefunctionh0tocombineitwiththesubsequent(e-free)nonterminalandeliminatepredicatesfrome1.Case4issimilar,exceptthatwemustrstmovee2totheleftofAusingthedfunctionbeforeapplyingthepredicatetransformation.De®nition:Thepredicate-reducedgrammarG0ofGis(V0N;VT;R0;e0S),whereV0Nisthesetofnonterminalsproducedinstage1,R0=fA h1(e)jA e2R2g,ande0S=h1(g1(eS1))=h0(g0(eS1);T).Theorem:G0iswell-formed,repetition-free,predicate-free,andequivalenttoG.Proof:G0isrepetition-freebecauseGisrepetition-freeandweneverintroducedanyrepetitionoperators.Fromthepreviousresult,eachnonterminalA2V0Nisequivalenttothecorrespondingnonter-minalinthestage2grammar.Bythesameresult,theh1(g1(eS1)) partofthenewstartexpressione0isequivalenttothee-freepartoftheoriginalstartexpressioneS.Theh0(g0(eS1);T)inthenewstartexpressionsucceedsandconsumesexactlyoneterminalwhenevertheinputstringisnonemptyandthee-onlypartoftheoriginalstartexpressioneSsucceeds.Finally,sincewemadetheassumptionatthestartthattheoriginalgrammardoesnotaccepttheemptystring,thetransformedgrammarbehavesidenticallyforthisdegeneratecase.Sincetheacceptanceofastringintothelanguageofagram-maronlydependsonthesuccessorfailureofthestartexpression,andnotonhowmuchoftheinputthestartexpressionconsumes,thenewgrammarG0acceptsexactlythesamestringsasG.4.2.4TheEmptyStringLimitationToshowthatwehavenohopeofavoidingtherestrictionthattheoriginalgrammarcannotaccepttheemptyinputstring,weprovethatanypredicate-freegrammarcannotaccepttheemptyinputstringwithoutacceptingallinputstrings.Lemma:AssumethatGisapredicate-freegrammar,andthatforanyexpressioneandinputxoflengthnorless,(e;e))+eiff(e;x))+e.Thenthesameholdsforinputstringsoflengthn+1.Proof:Byinductionoverstepcountsin)G.Theorem:Inarepetition-freegrammarG,anexpressionematchestheemptystringiffitmatchesallinputstringsandproducesonlyeresults.Inconsequence,e2L(G)impliesL(G)=VT.Proof:Byinductionoverstringlength.WecouldworkaroundtheemptystringlimitationbydeningPEGstorequireallrecognizedstringstoincludeadesignatedendmarkerterminal,asBirmandoesintheoriginalTSandgTSsystems[5].4.3ReductiontoTS/TDPLWecanreduceanypredicate-freePEGtoaninstanceofBirman'sTSsystem[4,5],renamed“Top-DownParsingLanguage”(TDPL)byAhoandUllman[3].Wewillusethelattertermforitsdescrip-tiveness.TDPLusesasetofgrammar-likedenitions,butthesedenitionshaveonlyafewxedformsinplaceofopen-endedhier-archicalparsingexpressions.WecanviewTDPLasthePEGana-logofChomskyNormalForm(CNF)forcontext-freegrammars.InsteadofdeningTDPL“fromthegroundup”asBirmandoes,wesimplydeneitasarestrictedformofPEG.De®nition:ATDPLgrammarisaPEGG=(VN;VT;R;S)inwhichSisanonterminalinVNandallofthedenitionsinRhaveoneofthefollowingforms:1.A e.2.A a,wherea2VT.3.A f,wheref!e.4.A BC=D,whereB;C;D2VN.Thethirdform,A f,representingunconditionalfailure,iscon-sidered“primitive”inTDPL,althoughwedeneithereintermsoftheparsingexpression!e.Thefourthform,A BC=D,combinesthefunctionsofnonterminals,sequencing,andchoice.ATDPLgrammarGisinterpretedaccordingtotheusual)Grelation.Theorem:Anypredicate-freePEGG=(VN;VT;R;eS)canbere-ducedtoanequivalentTDPLgrammarG0=V0N;VT;R0;S).Proof:FirstweaddanewnonterminalSwithdenitionS eS,representingtheoriginalstartexpression.Wethenaddtwonon-terminalsEandFwithdenitionsE eandF frespectively.Finally,werewriteeachdenitionthatdoesnotconformtooneoftheTDPLformsaboveusingthefollowingrules:A B7!A BE=FA e1e27!A BC=FB e1C e2A e1=e27!A BE=CB e1C e2A e7!A BA=EB eAhoandUllmandenean“extendedTDPL”notation[3]equiva-lentinexpressivenesstorepetition-free,predicate-freePEGs,withreductionrulesalmostidenticaltothoseabove.4.4ReductiontogTS/GTDPLBirman's“generalizedTS”(gTS)system,named“generalizedTDPL”(GTDPL)byAhoandUllman,issimilartoTDPL,butusesslightlydifferentbasicruleformsthateffectivelyprovidethefunc-tionalityofpredicatesinPEGs.De®nition:AGTDPLgrammarisaPEGG=(VN;VT;R;S)inwhichSisanonterminalandallofthedenitionsinRhaveoneofthefollowingforms:1.A e.2.A a,wherea2VT.3.A f,wheref!e.4.A B[C;D],whereB[C;D]BC=!BD,andB;C;D2VN.Theorem:AnyPEGG=(VN;VT;R;eS)canbereducedtoanequivalentGTDPLgrammarG0=V0N;VT;R0;S).Proof:FirstweaddthedenitionsS eS,E e,andF f,asaboveforTDPL.Thenwerewriteallnon-conformingdenitionsusingthefollowingtransformations:A B7!A B[E;F]A e1e27!A B[C;F]B e1C e2A e1=e27!A B[E;C]B e1C e2A e7!A B[A;E]B eA !e7!A B[F;E]B e4.4.1ParsingPEGsCorollary:Itispossibletoconstructalinear-timeparserforanyPEGonareasonablerandom-accessmemorymachine. Proof:ReducethePEGtoaGTDPLgrammarandthenusethetabularparsingtechniquedescribedbyAhoandUllman[3].InpracticeitisnotnecessarytoreduceaPEGallthewaytoTDPLorGTDPLform,thoughitistypicallynecessaryatleasttoeliminaterepetitionoperators.Practicalmethodsforconstruct-ingsuchlinear-timeparsersbothmanuallyandautomatically,par-ticularlyusingmodernfunctionalprogramminglanguagessuchasHaskell[11],arediscussedinpriorwork[8,7].4.4.2EquivalenceofTDPLandGTDPLTheorem:Anywell-formedGTDPLgrammarthatdoesnotaccepttheemptystringcanbereducedtoanequivalentTDPLgrammar.Proof:TreatingtheoriginalGTDPLgrammarasarepetition-freePEG,rsteliminatepredicates(Section4.2),thenreducetheresult-ingpredicate-freegrammartoTDPL(Section4.3).5OpenProblemsThissectionbrieyoutlinessomepromisingdirectionsforfutureworkonPEGsandrelatedsyntacticformalisms.BirmandenedatransformationongTSthatconvertsloopfail-urescausedbygrammarcircularitiesintoordinaryrecognitionfail-ures[5].ByextensionitispossibletoconvertanyPEGintoacom-pletePEG.ItisprobablypossibletotransformanyPEGintoanequivalentwell-formedPEG,butthisconjectureisunveried;Bir-mandidnotdeneastructuralwell-formednesspropertyforgTS.SuchatransformationonPEGsisconceivabledespitetheunde-cidabilityofagrammar'scompleteness,sincethetransformationworksessentiallybybuilding“run-time”circularitychecksintothegrammarinsteadoftryingtodecidestaticallyat“compile-time”whetheranycircularconditionsarereachable.Perhapsofmorepracticalinterest,wewouldlikeausefulconserva-tivealgorithmtodetermineifachoiceexpressione1=e2inagram-marisdenitelydisjoint,andthereforecommutative.Suchanal-gorithmwouldenableustoextendPEGsyntaxwithanunorderedchoiceoperator`j'analogoustothechoiceoperatorusedinEBNFsyntaxforCFGs.The`j'operatorwouldbesemanticallyidenticalto`=',butwouldexpressthelanguagedesigner'sassertionthatthealternativesaredisjointandthereforeorder-independent,andtoolssuchPEGanalyzersandPEG-basedparsergenerators[7]couldver-ifytheseassertionsautomatically.Analopenproblemistherelationshipandinter-convertibilityofCFGsandPEGs.BirmanprovedthatTSandgTScansimu-lateanydeterministicpushdownautomata(DPDA)[5],implyingthatPEGscanexpressanydeterministicLR-classcontext-freelan-guage.Thereisinformalevidence,however,thatamuchlargerclassofCFGsmightberecognizablewithPEGs,includingmanyCFGsforwhichnoconventionallinear-timeparsingalgorithmisknown[7].ItisnotevenprovenyetthatCFLsexistthatcannotberecognizedbyaPEG,thoughrecentworkinlowerboundsonthecomplexityofgeneralCFGparsing[14]andmatrixproduct[23]showsatleastthatgeneralCFGparsingisinherentlysuper-linear.6RelatedWorkThisworkisinspiredbyandheavilybasedonBirman'sTS/TDPLandgTS/GTDPLsystems[4,5,3].The)GrelationandthebasicpropertiesinSections3.3and3.4aredirectadaptationsofBirman'swork.Themajornewfeaturesofthepresentworkaretheextensiontosupportgeneralparsingexpressionswithrepetitionandpredicateoperators,thestructuralanalysisandidentityresultsinSections3.5through3.7,andthepredicateeliminationprocedureinSection4.2.Whileparsingexpressionscouldconceivablybetreatedmerelyas“syntacticsugar”forGTDPLgrammars,itisnotclearthatthepred-icateeliminationtransformation,andhencethereductionfromGT-DPLtoTDPL,couldbeaccomplishedwithouttheuseofmoregeneralexpression-likeformsintheintermediatestages.ForthisreasonitappearsthatPEGsrepresentausefulformalnotationinitsownright,complementarytotheminimalistTDPLandGTDPLsystems.itappearsTDPLandGTDPLhavenotseenmuchpracticaluse,perhapsinlargemeasurebecausetheywereorigi-nallydevelopedandpresentedasformalmodelsforcertaintypesoftop-downparsers,ratherthanasausefulsyntacticfoundationinitsownright.Adams[1]usedTDPLinamodularlanguageproto-typingframework,however.Inaddition,manypracticaltop-downparsinglibrariesandtoolkits,includingthepopularANTLR[21]andthePARSECcombinatorlibraryforHaskell[15],provideback-trackingcapabilitiesthatconformtothismodelinpractice,ifper-hapsunintentionally.Theseexistingsystemsgenerallyuse“naive”backtrackingmethodsthatriskexponentialruntimeinworst-casescenarios,butthesamefeaturescanbeimplementedinstrictlylin-eartimeusingamemoizing“packratparser”[8,7].Thepositiveformofsyntacticpredicate(the“and-predicate”)wasintroducedbyParr[20]foruseinANTLR[21],andlaterincorpo-ratedintoJavaCCunderthename“syntacticlookahead”[16].Themetafrontsystemincludesalimited,xed-lookaheadformofsyn-tacticpredicatesundertheterms“attractors”and“traps”[6].Thenegativeformofsyntacticpredicate(the“not-predicate”)appearstobenew,butitseffectcanbeachievedinpracticalparsingsystemssuchasANTLRandJavaCCusingsemanticpredicates[17].Manyextensionsandvariationsofcontext-freegrammarshavebeendeveloped,suchasindexedgrammars[2],W-grammars[28],afxgrammars[13],tree-adjoininggrammars[12],minimalistgrammars[24],andconjunctivegrammars[18].Mostoftheseex-tensionsaremotivatedbytherequirementsofexpressingnaturallanguages,andallareatleastasdifculttoparseasCFGs.Sincemachine-orientedlanguagetranslatorsoftenneedtoprocesslargeinputsinlinearornear-lineartime,andthereappearstobenohopeofgeneralCFGparsinginmuchbetterthanO(n3)time[14],mostparsingalgorithmsformachine-orientedlanguagesfocusonhandlingsubclassesoftheCFGs.Classicdeterministictop-downandbottom-uptechniques[3]arewidelyused,buttheirlimitationsarefrequentlyfeltbylanguagedesignersandimplementors.ThesyntaxdenitionformalismSDFincreasestheexpressivenessofCFGswithexplicitdisambiguationrules,andsupportsuniedlanguagedescriptionsbycombininglexicalandcontext-freesyn-taxdenitionsintoa“two-level”formalism[10].Thenondeter-ministiclinear-timeNSLR(1)parsingalgorithm[26]ispowerfulenoughtogenerate“scannerless”parsersfromuniedsyntaxde-nitionswithouttreatinglexicalanalysisseparately[22],buttheal-gorithmseverelyrestrictstheforminwhichsuchCFGscanbewrit-ten.Othermachine-orientedsyntaxformalismsandtoolsuseCFGsextendedwithexplicitdisambiguationrulestoexpressbothlexicalandhierarchicalsyntax,supportinguniedsyntaxdenitionsmorecleanlywhilegivingupstrictlylinear-timeparsing[21,29,27].Thesesystemsgraftrecognition-basedfunctionalityontogenerativeCFGs,resultingina“hybrid”generative/recognition-basedsyntac- ticmodel.PEGsprovidesimilarfeaturesinasimplersyntacticfoundationbyadoptingtherecognitionparadigmfromthestart.7ConclusionParsingexpressiongrammarsprovideapowerful,formallyrigor-ous,andefcientlyimplementablefoundationforexpressingthesyntaxofmachine-orientedlanguagesthataredesignedtobeun-ambiguous.Becauseoftheirimplicitlongest-matchrecognitioncapabilitycoupledwithexplicitpredicates,PEGsallowboththelexicalandhierarchicalsyntaxofalanguagetobedescribedinoneconcisegrammar.TheexpressivenessofPEGsalsointroducesnewsyntaxdesignchoicesforfuturelanguages.Birman'sGTDPLsys-temservesasanatural“normalform”towhichanyPEGcaneasilybereduced.Withminorrestrictions,PEGscanberewrittentoelim-inatepredicatesandreducedtoTDPL,anevenmoreminimalistform.Inconsequence,wehaveshownTDPLandGTDPLtobeessentiallyequivalentinrecognitionpower.Finally,despitetheirabilitytoexpresslanguageconstructsrequiringunlimitedlooka-headandbacktracking,allPEGsareparseableinlineartimewithasuitabletabularormemoizingalgorithm.AcknowledgmentsIwouldliketothankmyadvisorFransKaashoek,aswellasFranc¸oisPottier,RobertGrimm,TerenceParr,ArnarBirgisson,andthePOPLreviewers,forvaluablefeedbackanddiscussionandforpointingoutseveralerrorsintheoriginaldraft.8References[1]StephenRobertAdams.ModularGrammarsforProgram-mingLanguagePrototyping.PhDthesis,UniversityofSouthampton,1991.[2]AlfredV.Aho.Indexedgrammars—anextensionofcontext-freegrammars.JournaloftheACM,15(4):647–671,October1968.[3]AlfredV.AhoandJeffreyD.Ullman.TheTheoryofParsing,TranslationandCompiling-Vol.I:Parsing.PrenticeHall,EnglewoodCliffs,N.J.,1972.[4]AlexanderBirman.TheTMGRecognitionSchema.PhDthe-sis,PrincetonUniversity,February1970.[5]AlexanderBirmanandJeffreyD.Ullman.Parsingalgorithmswithbacktrack.InformationandControl,23(1):1–34,August1973.[6]ClausBrabrand,MichaelI.Schwartzbach,andMadsVang-gaard.Themetafrontsystem:Extensibleparsingandtrans-formation.InThirdWorkshoponLanguageDescriptions,ToolsandApplications,Warsaw,Poland,April2003.[7]BryanFord.Packratparsing:apracticallinear-timealgorithmwithbacktracking.Master'sthesis,MassachusettsInstituteofTechnology,Sep2002.[8]BryanFord.Packratparsing:Simple,powerful,lazy,lineartime.InProceedingsofthe2002InternationalConferenceonFunctionalProgramming,Oct2002.[9]DickGruneandCerielJ.H.Jacobs.ParsingTechniques—APracticalGuide.EllisHorwood,Chichester,England,1990.[10]J.Heering,P.R.H.Hendriks,P.Klint,andJ.Rekers.ThesyntaxdenitionformalismSDF—referencemanual—.SIG-PLANNotices,24(11):43–75,1989.[11]SimonPeytonJonesandJohnHughes(editors).Haskell98Report,1998.http://www.haskell.org.[12]AravindK.JoshiandYvesSchabes.Tree-adjoininggram-mars.HandbookofFormalLanguages,3:69–124,1997.[13]C.H.A.Koster.Afxgrammars.InJ.E.L.Peck,editor,AL-GOL68Implementation,pages95–109,Amsterdam,1971.North-HollandPubl.Co.[14]LillianLee.Fastcontext-freegrammarparsingrequiresfastbooleanmatrixmultiplication.JournaloftheACM,49(1):1–15,2002.[15]DaanLeijen.Parsec,afastcombinatorparser.http://www.cs.uu.nl/˜daan.[16]SunMicrosystems.Javacompilercompiler(JavaCC).https://javacc.dev.java.net/.[17]SunMicrosystems.JavaCC:LOOKAHEADminitutorial.https://javacc.dev.java.net/doc/lookahead.html.[18]AlexanderOkhotin.Conjunctivegrammars.JournalofAu-tomata,LanguagesandCombinatorics,6(4):519–535,2001.[19]InternationalStandardsOrganization.Syntacticmetalan-guage—ExtendedBNF,1996.ISO/IEC14977.[20]TerenceJ.ParrandRussellW.Quong.AddingsemanticandsyntacticpredicatestoLL(k)—pred-LL(k).InProceedingsoftheInternationalConferenceonCompilerConstruction,Ed-inburgh,Scotland,April1994.[21]TerenceJ.ParrandRussellW.Quong.ANTLR:APredicated-LL(k)parsergenerator.SoftwarePracticeandExperience,25(7):789–810,1995.[22]DanielJ.SalomonandGordonV.Cormack.ScannerlessNSLR(1)parsingofprogramminglanguages.InProceedingsoftheACMSIGPLAN'89ConferenceonProgrammingLan-guageDesignandImplementation(PLDI),pages170–178,Jul1989.[23]AmirShpilka.Lowerboundsformatrixproduct.InIEEESymposiumonFoundationsofComputerScience,pages358–367,2001.[24]EdwardStabler.Derivationalminimalism.LogicalAspectsofComputationalLinguistics,pages68–95,1997.[25]BjarneStroustrup.TheC++ProgrammingLanguage.Addison-Wesley,3rdedition,June1997.[26]Kuo-ChungTai.NoncanonicalSLR(1)grammars.ACMTransactionsonProgrammingLanguagesandSystems,1(2):295–320,Oct1979.[27]M.G.J.vandenBrand,J.Scheerder,J.J.Vinju,andE.Visser.DisambiguationltersforscannerlessgeneralizedLRparsers.InCompilerConstruction,2002.[28]A.vanWijngaarden,B.J.Mailloux,J.E.L.Peck,C.H.A.Koster,M.Sintzoff,C.H.Lindsey,L.G.L.T.Meertens,andR.G.Fisker.ReportonthealgorithmiclanguageALGOL68.Numer.Math.,14:79–218,1969.[29]EelcoVisser.Afamilyofsyntaxdenitionformalisms.Tech-nicalReportP9706,ProgrammingResearchGroup,Univer-sityofAmsterdam,1997.[30]NiklausWirth.Whatcanwedoabouttheunnecessarydiver-sityofnotationforsyntacticdescriptions.CommunicationsoftheACM,20(11):822–823,November1977.

Download Presentation

Download Pdf - The PPT/PDF document "ar sing Expression Grammar s Recognition..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

View more...

If you wait a while, download link will show on top.Please download the presentation after loading the download link.

ar sing Expression Grammar s RecognitionBased Syntactic Foundation Br an ord Massachusetts Institute of echnology Cambr idge MA baf ordmit - Description

edu Abstract or decades we ha been using Chomsk y generati system of grammars particularly conte xtfree grammars CFGs and re gu lar xpressions REs to xpress the syntax of programming lan guages and protocols The po wer of generati grammars to x press ID: 6342 Download Pdf

Uploaded By: trish-goza
Views: 67
Type: Public

Tags

edu Abstract decades

Related Documents