101K - views

IEEE TRANSACTIONS ON INFORMATION THEORY The Smallest Grammar Problem Moses Charikar Eric Lehman April Lehman Ding Liu Rina Panigrahy Manoj Prabhakaran Amit Sahai abhi shelat Abstract This paper addre

Due to the problems inherent complexity our objective is to 64257nd an approximation algorithm which 64257nds a small grammar for the input string We focus attention on the approximation ratio of the algorithm and implicitly worstcase behavior to es

Tags : Due the problems
Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "IEEE TRANSACTIONS ON INFORMATION THEORY ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

IEEE TRANSACTIONS ON INFORMATION THEORY The Smallest Grammar Problem Moses Charikar Eric Lehman April Lehman Ding Liu Rina Panigrahy Manoj Prabhakaran Amit Sahai abhi shelat Abstract This paper addre






Presentation on theme: "IEEE TRANSACTIONS ON INFORMATION THEORY The Smallest Grammar Problem Moses Charikar Eric Lehman April Lehman Ding Liu Rina Panigrahy Manoj Prabhakaran Amit Sahai abhi shelat Abstract This paper addre"— Presentation transcript:

naturallycorrespondtononterminalsinacompactgrammar.Infact,anoriginalandcontinuingmotivationforworkontheproblemwastoidentifyregularitiesinDNAsequences[6],[8].(Interestingly,[8]espousesthegoalofdeterminingtheentropyofDNA.Thisamountstoupper-boundingtheKolmogorovcomplexityofahumanbeing.)Inaddition,smallestgrammaralgorithmshavebeenusedtohighlightpatternsinmusicalscores[13]anduncoverpropertiesoflanguagefromexampletexts[9].Allthisispossiblebecauseastringrepresentedbyacontext-freegrammarremainsrelativelycomprehensible.Thiscomprehensibilityisanimportantattractionofgrammar-basedcompressionrelativetootherwisecompetitivecompressionschemes.Forexample,thebestpatternmatchingalgorithmthatoperatesonastringcompressedasagrammarisasymp-toticallyfasterthantheequivalentforthewell-knownLZ77compressionformat[14].D.HierarchicalApproximationFinally,workonthesmallestgrammarproblemqualitativelyextendsthestudyofapproximationalgorithms.Priorworkonapproximationalgorithmshasfocusedon“at”objectssuchasgraphs,CNFformulas,binsandweights,etc.Incontrast,context-freegrammarsaswellasmanyreal-worldproblemssuchascircuitdesignandimagecompressionhaveahier-archicalnature.Moreover,standardapproximationtechniquessuchaslinearandsemideniteprogrammingarenoteasilytransferredtothisnewdomain.II.PREVIOUSWORKThesmallestgrammarproblemwasarticulatedexplicitlybytwogroupsofauthorsataboutthesametime.Nevill-ManningandWittenstatedtheproblemandproposedtheSEQUITURalgorithmasasolution[6],[13].TheirmainfocuswasonextractingpatternsfromDNAsequences,musicalscores,andeventheChurchofLatter-DaySaintsgenealogicaldatabase,althoughtheyevaluatedSEQUITURasacompressionalgorithmaswell.Theothergroup,consistingofKieffer,Yang,Nelson,andCosman,approachedthesmallestgrammarproblemfromatraditionaldatacompressionperspective[5],[4],[3].First,theypresentedsomedeeptheoreticalresultsontheimpos-sibilityofhavinga“best”compressorunderacertaintypeofgrammarcompressionmodelforinnitelengthstrings[15].Then,theypresentedahostofpracticalalgorithmsincludingBISECTION,MPM,andLONGESTMATCH.Furthermore,theygaveanalgorithm,whichwerefertoasSEQUENTIAL,inthesamespiritasSEQUITUR,butwithsignicantdefectsremoved.AllofthesealgorithmsaredescribedandanalyzedinSectionVI.Interestingly,oninputswithpower-of-twolengths,theBISECTIONalgorithmofNelson,Kieffer,andCosman[16]givesessentiallythesamerepresentationasabinarydecisiondiagram(BDD)[17].BDDshavebeenusedwidelyindigitalcircuitanalysissincethe1980'sandalsorecentlyexploitedformoregeneralcompressiontasks[18],[19].Whilethesetwolinesofresearchledtotherstcleararticulationofthesmallestgrammarproblem,itsrootsgobacktomuchearlierworkinthe1970's.Inparticular,LempelandZivapproachedtheproblemfromthedirectionofKolmogorovcomplexity[20].Overtime,however,theirworkevolvedtowarddatacompression,beginningwithaseminalpaper[21]proposingtheLZ77compressionalgorithm.Thisproceduredoesnotrepresentastringbyagrammar.Nevertheless,weshowinSectionVIIthatLZ77isdeeplyentwinedwithgrammar-basedcompression.LempelandZivsoonproducedanotheralgorithm,LZ78,whichdidimplicitlyrepresentastringwithagrammar[1].WedescribeandanalyzeLZ78indetailinSectionVI.In1984,WelchincreasedtheefciencyofLZ78withanewprocedure,nowknownasLZW[2].Inpractice,LZWismuchpreferredoverLZ78,butforourpurposesthedifferenceissmall.Alsointhe1970's,StorerandSzymanskiexploredawiderangeof“macro-based”compressionschemes[22],[23],[24].Theydenedacollectionofattributesthatsuchacompressormighthave,suchas“recursive”,“restricted”,“overlapping”,etc.Eachcombinationoftheseadjectivesdescribedadifferentscheme,manyofwhichtheyconsideredindetailandprovedtobeNP-hard.Recently,thesmallestgrammarproblemhasreceivedin-creasinginterestinabroadrangeofcommunities.Forexam-ple,deMarcken'sthesis[9]investigatedwhetherthestructureofthesmallestgrammargeneratingalarge,givenbodyofEnglishtextcouldleadtoinsightaboutthestructureofthelanguageitself.Lanctot,Li,andYang[8]proposedusingtheLONGESTMATCHalgorithmforthesmallestgrammarprob-lemtoestimatetheentropyofDNAsequences.ApostolicoandLonardi[11],[25],[10]suggestedaschemethatwecallGREEDYandappliedittothesameproblem.LarssonandMoffatproposedRE-PAIR[7]asageneral,grammar-basedalgorithm.MostoftheseproceduresaredescribedandanalyzedinSectionVI.Therehasalsobeenanefforttodevelopalgorithmsthatmanipulatestringsthatareincompressedform.Forexample,Kida[14]andShibata,etal.[26]haveproposedpatternmatchingalgorithmsthatrunintimerelatednottothelengthofthesearchedstring,butrathertothesizeofthegrammarrepresentingit.Therelativelygoodperformanceofsuchal-gorithmsrepresentsasignicantadvantageofgrammar-basedcompressionoverothercompressiontechniquessuchasLZ77.Inshort,thesmallestgrammarproblemhasbeenconsideredbymanyauthorsinmanydisciplinesformanyreasonsoveraspanofdecades.Giventhislevelofinterest,itisremarkablethattheproblemhasnotattractedgreaterattentioninthegeneralalgorithmscommunity.III.SUMMARYOFOURCONTRIBUTIONSThispapermakesfourmaincontributions,enumeratedbelow.Throughout,weusentodenotethelengthofaninputstring,andmtodenotethesizeofthesmallestgrammargeneratingthatsameinputstring.1)Weshowthatthesmallestgrammargeneratingagivenstringishardtoapproximatetowithinasmallconstantfactor.Furthermore,weshowthatano(logn=loglogn)approximationwouldrequireprogressonawell-studiedproblemincomputationalalgebra.2 thegoalofestablishingprovableguaranteesonperformance,andthereforeestablishingafairbasisforcomparingalgo-rithms.Inaddition,theworst-caseanalysisaddressesanin-herentproblemwithcharacterizingcompressionperformanceonlow-entropystrings.KosarajuandManzini[27]pointoutthatthestandardnotionsofuniversalityandredundancyarenotmeaningfulmeasuresofacompressor'sperformanceonlowentropystrings.Ourapproximationratiomeasurehandlesallcasesandthereforesidestepsthisissue.C.BasicLemmasInthissubsection,wegivesomeeasylemmasthathighlightbasicpointsaboutthesmallestgrammarproblem.Inproofshereandelsewhere,weignorethepossibilityofdegeneracieswheretheyraisenosubstantiveissue,e.g.anonterminalwithanemptydenitionorasecondarynonterminalthatneverappearsinadenition.Lemma1:Thesmallestgrammarforastringoflengthnhassize (logn).Proof:LetGbeanarbitrarygrammarofsizem.WeshowthatGgeneratesastringoflengthO(3m=3),whichim-pliestheclaim.Deneasequenceofnonterminalsrecursivelyasfollows.LetT1bethestartsymbolofgrammarG.LetTi+1bethenonterminalinthedenitionofTithathasthelongestexpansion.(Breaktiesarbitrarily.)ThesequenceendswhenanonterminalTn,denedonlyintermsofterminals,isreached.Notethatthenonterminalsinthissequencearedistinct,sincethegrammarisacyclic.LetkidenotethelengthofthedenitionofTi.ThenthelengthoftheexpansionofTiisupper-boundedbykitimesthelengthoftheexpansionofTi+1.Byaninductiveargument,wend:[T1]k1k2knOntheotherhand,weknowthatthesumofthesizesofthedenitionsofT1;:::;Tmisatmostthesizeoftheentiregrammar:k1+k2+:::+knmItiswellknownthatasetofpositiveintegerswithsumatmostmhasproductatmost3dm=3e.ThusthelengthofthestringgeneratedbyGisO(3m=3)asclaimed. Nextweshowthatcertainhighlystructuredstringsaregeneratedbysmallgrammars.Lemma2:Let bethestringgeneratedbygrammarG ,andlet bethestringgeneratedbygrammarG .Then:1)ThereexistsagrammarofsizejG j+jG j+2thatgeneratesthestring .2)ThereexistsagrammarofsizejG j+O(logk)thatgeneratesthestring k.Proof:Toestablish(1),createagrammarcontainingallrulesinG ,allrulesinG ,andthestartruleS!S S whereS isthestartsymbolofG andS isthestartsymbolofG .For(2),beginwiththegrammarG ,andcallthestartsymbolA1.WeextendthisgrammarbydeningnonterminalsAiwithexpansion iforvariousi.ThestartruleofthenewgrammarisAk.Ifkiseven(say,k=2j),deneAk!AjAjanddeneAjrecursively.Ifkisodd(say,k=2j+1),deneAk!AjAjA1andagaindeneAjrecursively.Whenk=1,wearedone.Witheachrecursivecall,thenonterminalsubscriptdropsbyafactorofatleasttwoandatmostthreesymbolsareaddedtothegrammar.Therefore,thetotalgrammarsizeisjG j+O(logk). Lemma2ishelpfulinlower-boundingtheapproxima-tionratiosofcertainalgorithmswhenitisnecessarytoshowthatthereexistsmallgrammarsforstringssuchasak(k+1)(bak)(k+1).Thefollowinglemmaisusedextensivelyinouranalysisofpreviously-proposedalgorithms.Roughly,itupper-boundsthecomplexityofastringgeneratedbyasmallgrammar.Lemma3(mkLemma):Ifastringisgeneratedbyagrammarofsizem,thencontainsatmostmkdistinctsubstringsoflengthk.Proof:LetGbeagrammarforofsizem.ForeachruleT! inG,weupper-boundthenumberoflength-ksubstringsofhTithatarenotsubstringsoftheexpansionofanonterminalin .Eachsuchsubstringeitherbeginsataterminalin ,orelsebeginswithbetween1andk�1terminalsfromtheexpansionofanonterminalin .Therefore,thenumberofsuchstringsisatmostj jk.Summingoverallrulesinthegrammargivestheupperboundmk.Allthatremainsistoshowthatallsubstringsareaccountedforinthiscalculation.Tothatend,letbeanarbitrarylength-ksubstringof.FindtheruleT! suchthatisasubstringofhTi,andhTiisasshortaspossible.Thus,isasubstringofhTiandisnotasubstringoftheexpansionofanonterminalin .Therefore,wasindeedaccountedforabove. V.HARDNESSWeestablishthehardnessofthesmallestgrammarproblemintwoways.First,weshowthatapproximatingthesizeofthesmallestgrammartowithinasmallconstantfactorisNP-hard.Second,weshowthatapproximatingthesizetowithino(logn=loglogn)wouldrequireprogressonanapparentlydifcultcomputationalalgebraproblem.Thesetwohardnessargumentsarecuriouslycomplementary,aswediscussinSectionV-C.A.NP-HardnessTheorem1:Thereisnopolynomial-timealgorithmforthesmallestgrammarproblemwithapproximationratiolessthan8569=8568unlessP=NP.Proof:WeuseareductionfromarestrictedformofvertexcoverbasedcloselyonargumentsbyStorerandSzy-manski[23],[24].LetH=(V;E)beagraphwithmaximumdegreethreeandjEjjVj.WecanmapthegraphHtoastringoveranalphabetthatincludesadistinctterminal(denotedvi)correspondingtoeachvertexvi2Vasfollows:=Yvi2V(#vijvi#j)2Yvi2V(#vi#j)Y(vi;vj)2E(#vi#vj#j)4 Thus,intheexample,wewouldunderlineasfollows:B!xxx A!BBB S!AjAABxx Eachunderlinedsymbolgeneratesonetermintheadditionchainasfollows.Startingfromtheunderlinedsymbol,workleftwarduntilthestartofthedenitionorauniquesymbolisencountered.Thisspanofsymbolsdenesasubstringwhichendswiththeunderlinedsymbol.Thelengthoftheexpansionofthissubstringisatermintheadditionchain.Intheexample,wewouldobtainthesubstrings:x;xx;xxx;BB;BBB;AA;AAB;AABx;AABxxandtheadditionchain1;2;3;6;9;18;21;22;23.Intuitively,thetermsintheadditionchainproducedabovearethelengthsoftheexpansionsofthesecondarynontermi-nalsinthegrammar.Butthesealonedonotquitesufce.Toseewhy,notethattheruleT!ABCimpliesthat[T]=[A]+[B]+[C].Ifweensurethattheadditionchaincontains[A],[B],and[C],thenwestillcannotimmediatelyadd[T]because[T]isthesumofthreeprecedingterms,insteadoftwo.Thus,wemustalsoinclude,say,theterm[AB],whichisitselfthesumof[A]and[B].Thecreationofsuchextratermsiswhattheelaborateunderliningprocedureaccomplishes.Withthisinmind,itiseasytoverifythattheconstructiondetailedabovegivesanadditionchainoflengthatmostmthatcontainsT.Allthatremainsistoestablishthesecondinequality,m4l.Wedothisbytranslatinganadditionchainoflengthlintoagrammarforthestringofsizeatmost4l.Asbefore,wecarryalonganexample.LetT=f9;23g.TheshortestadditionchaincontainingThaslengthl=7:1;2;4;5;9;18;23.Weassociatethesymbolxwiththersttermofthese-quenceandadistinctnonterminalwitheachsubsequentterm.Eachnonterminalisdenedusingthesymbolsassociatedwithtwoprecedingterms,justaseachtermintheadditionsequenceisthesumoftwopredecessors.ThestartruleconsistsofthenonterminalscorrespondingtothetermsinT,separatedbyuniques.Intheexample,thisgivesthefollowinggrammar:T2!xxT4!T2T2T5!T4xT9!T5T4T18!T9T9T23!T18T5S!T9jT23Thestartrulehaslength2jTj�12l,andthel�1secondaryruleseachhaveexactlytwosymbolsontheright.Thus,thetotalsizeofthegrammarisatmost4l. Additionchainshavebeenstudiedextensivelyfordecades(seesurveysinKnuth[29]andThurber[30]).Inordertondtheshortestadditionchaincontainingasingle,speciedintegern,asubtlealgorithmknownastheM-arymethodgivesa1+O(1=loglogn)approximation.(Thisisapparentlyfolklore.)OnewritesninabaseM,whichisapowerof2:n=d0Mk+d1Mk�1+d2Mk�2+:::+dk�1M+dkTheadditionchainbegins1;2;3;:::;M�1.Thenoneputsd0,doublesitlogMtimes,addsd1totheresult,doublesthatlogMtimes,addsd2totheresult,etc.Thetotallengthoftheadditionchainproducedisatmost:(M�1)+logn+logn logM=logn+O(logn=loglogn)Intheexpressionontheleft,thersttermcountstherstM�1termsoftheadditionchain,thesecondcountsthedoublings,andthethirdcountstheincrementsofdi.TheequalityfollowsbychoosingMtobethesmallestpoweroftwowhichisatleastlogn=loglogn.TheM-arymethodisverynearlythebestpossible.Erd¨os[31]showedthat,inacertainsense,theshortestadditionchaincontainingnhaslengthatleastlogn+logn=loglognforalmostalln.Evenifexponentiallymoretimeisallowed,noexactalgorithm(andapparentlyevennobetterapproximationalgorithm)isknown.Thegeneraladditionchainproblem,whichconsistsofndingtheshortestadditionchaincontainingaspeciedsetofintegersk1;:::;kp,isknowntobeNP-hardiftheintegerskiaregiveninbinary[32].ThereisaneasyO(log(Pki))approximationalgorithm.First,generateallpowersoftwolessthanorequaltothemaximumoftheinputintegerski.Thenformeachkiindependentlybysummingasubsetofthesepowerscorrespondingto1'sinthebinaryrepresenta-tionofki.In1976,Yao[33]pointedoutthatthesecondstepcouldbetweakedinthespiritoftheM-arymethod.Specically,hegroupsthebitsofkiintoblocksofsizeloglogki�2logloglogkiandtacklesallblockswiththesamebitpatternatthesametime.ThisimprovestheapproximationratioslightlytoO(logn=loglogn).Yao'smethodretainsafrustratingaspectofthenaivealgorithm:thereisnoattempttoexploitspecialrelationshipsbetweentheintegerski;eachoneistreatedindependently.Forexample,supposeki=3ifori=1top.Thenthereexistsashortadditionchaincontainingalloftheki:1 ,2,3 ,6,9 ,18,27 ,....ButYao'salgorithmeffectivelyattemptstorepresentpowersofthreeinbasetwo.However,evenifthekiarewritteninunary,apparentlynopolynomialtimealgorithmwithabetterapproximationratiothanYao'sisknown.SinceTheorem2linksadditionchainsandsmallgrammars,ndinganapproximationalgorithmforthesmallestgrammarproblemwithratioo(logn=loglogn)wouldrequireimprovinguponYao'smethod.C.AnObservationonHardnessWehavedemonstratedthatthesmallestgrammarproblemishardtoapproximatethroughreductionsfromtwodifferentproblems.Interestingly,thereisalsoamarkeddifferenceinthetypesofstringsinvolved.Specically,Theorem1mapsgraphstostringswithlargealphabetsandfewrepeatedsubstrings.Insuchstrings,theuseofhierarchydoesnotseemtobemuchofanadvantage.Thus,weshowtheNP-completenessofthesmallestgrammarproblembyanalyzingaclassofinputstringsthatspecicallyavoidsthemostinterestingaspectoftheproblem:hierarchy.6 ofthestartrulecontainsallthenonterminalsassociatedwithpairs.Forexample,thegrammarassociatedwiththeexamplesequenceisasfollows:S!X1X2X3X4X5X6X1!aX3!bX5!X3aX2!X1bX4!X2aX6!X2bGiventhiseasymapping,hereafterwesimplyregardtheoutputofLZ78asagrammarratherthanasasequenceofpairs.NotethatthegrammarsproducedbyLZ78areofare-strictedforminwhichtherightsideofeachsecondaryrulecontainsatmosttwosymbolsandatmostonenonterminal.Subjecttotheserestrictions,thesmallestgrammarforeventhestringxnhassize (p n).(Ontheotherhand,grammarswithsucharegularformcanbemoreefcientlyencodedintobits.)Thenexttwotheoremsprovidenearly-matchingupperandlowerboundsontheapproximationratioofLZ78whenitisinterpretedasanapproximationalgorithmforthesmallestgrammarproblem.Theorem3:TheapproximationratioofLZ78is (n2=3=logn).Proof:Thelowerboundfollowsbyanalyzingthebehav-iorofLZ78oninputstringsoftheformk=ak(k+1)=2(bak)(k+1)2wherek�0.Thelengthofthisstringisn=(k3).RepeatedapplicationofLemma2impliesthatthereexistsagrammarforkofsizeO(logk)=O(logn).ThestringkisprocessedbyLZ78intwostages.Duringtherst,thek(k+1)=2leadinga'sareconsumedandnonter-minalswithexpansionsa;aa;aaa;:::;akarecreated.Duringthesecondstage,theremainderofthestringisconsumedandanonterminalwithexpansionaibajiscreatedforalliandjbetween0andk.Forexample,4isrepresentedbynonterminalswithexpansionsasindicatedbelow:aaaaaaaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaabaaaaThepatternillustratedabovecanbeshowntooccuringeneralwithaninductionargument.Asaresult,thegrammarproducedbyLZ78hassize (k2)= (n2=3).Dividingbyourupperboundonthesizeofthesmallestgrammarprovestheclaim. Theorem4:TheapproximationratioofLZ78isO�(n=logn)2=3.OurtechniquesinthefollowingproofofTheorem4formthebasisfortwootherupperboundspresentedinthissection.Thecoreideaisthatnonterminalsmustexpandtodistinctsubstringsoftheinput.BythemkLemma,however,thereareveryfewshortdistinctsubstringsoftheinput.Thusmostnonterminalsexpandtolongsubstrings.However,thetotalexpansionlengthofallnonterminalsmustbeequaltothesizeoftheinput.Asaresult,therecannotbetoomanynonterminalsinthegrammar.Proof:SupposethattheinputtoLZ78isastringoflengthn,andthatthesmallestgrammargeneratinghassizem.LetS!X1:::XpbethestartrulegeneratedbyLZ78.FirstobservethatthesizeoftheLZ78grammarisatmost3p,sinceeachnonterminalXiisusedonceinthestartruleandisdenedusingatmosttwosymbols.Therefore,itsufcestoupper-boundp,thenumberofnonterminalsinthestartrule.Tothatend,listthenonterminalsofthegrammarinorderofincreasingexpansionlength.Grouptherstmofthesenonterminals,thenext2m,thenext3m,andsoforth.Letgbethenumberofcompletegroupsofnonterminalsthatcanbeformedinthisway.Bythisdenitionofg,wehavem+2m+:::+gm+(g+1)m�pandsop=O(g2m).Ontheotherhand,thedenitionofLZ78guaranteesthateachnonterminalXiexpandstoadistinctsubstringof.Moreover,Lemma3statesthatcontainsatmostmkdistinctsubstringsoflengthk.Thus,therecanbeatmostmnonterminalswhichhaveexpansionlength1,andatmost2mnonterminalswhichhaveexpansionlength2,andsoon.Itfollowsthateachnonterminalinthej-thgroupmustexpandtoastringoflengthatleastj.Therefore,wehaven=[X1]+:::+[Xp]12m+22m+32m+:::+g2mandsog=O�(n=m)1=3.Theinequalityfollowssinceweareignoringtheincomplete(g+1)-thgroup.Substitutingthisboundongintotheupperboundonpobtainedpreviouslygives:p=On m2=3m=O n logn2=3m!ThesecondequalityfollowsfromLemma1,whichsaysthatthesmallestgrammarforastringoflengthnhassize (logn). 2)LZW:SomepracticalimprovementsonLZ78areem-bodiedinalateralgorithm,LZW[2].Thegrammarsimplicitlygeneratedbythetwoproceduresarenotsubstantivelydiffer-ent,butLZWismorewidelyusedinpractice.Forexample,itisusedtoencodeimagesinthethepopulargifformat.Interestingly,thebadstringsintroducedinTheorem3haveanaturalgraphicalinterpretation.Below,4iswrittenina159gridpatternusing2and foraandbrespectively.2222222222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222Thus,animagewithcolorsinthissimpleverticalstripepatternyieldsaworst-casestringintermsofapproximationratio.This8 nonterminaloccursonlyonceafterthissubstitution,replaceitbyitsdenition,anddeletethecorrespondingrule.Example2.Asanexample,considertheinputstring=xxxxx}xxxxx~.Afterthreesteps,thegrammaris:S!xxx.Whenthenextxisappendedtothestartrule,therearetwocopiesofthesubstringxx.Thereforeanewrule,R1!xx,isaddedtothegrammarandbothoccurrencesofxxarereplacedbyR1toproducethefollowingintermediategrammar:S!R1R1R1!xxDuringthenexttwosteps,thestartruleexpandstoS!R1R1x}.Atthispoint,theexpansionofR1isaprexoftheunprocessedpartof,sothenexttwostepsconsumexxandappendR1toStwice.NowthepairR1R1appearstwiceinS,andsoanewruleisR2!R1R1isaddedandapplied:S!R2x}R2R1!xxR2!R1R1Inthenextstep,xisconsumedandnowR2xappearstwice.AnewruleR3!R2xiscreatedandsubstitutedintoS.NoticethattheruleR2onlyappearsonceafterthissubstitution.Therefore,theoccurrenceofR2inthedenitionofR3isreplacedwithR1R1,andR2isremovedfromthegrammar.Afterthenextstep,wehavethefollowingnaloutput:S!R3}R3~R1!xxR3!R1R1x2)Bounds:Thenexttwotheoremsboundtheapproxima-tionratioofSEQUENTIAL.BoththeupperandlowerboundsareconsiderablymorecomplexthantheanalysisforLZ78andBISECTION.Theorem7:TheapproximationratioofSEQUENTIALis (n1=3).Proof:WeanalyzethebehaviorofSEQUENTIALonstringskfork�0,denedbelow,overanalphabetconsistingoffoursymbols:fx;y;};~g.k= } k=2 =xk+1}xk+1~0}0~1}1~:::k}k~ =kkkk�1kk�2kk�3:::kk=2xki=xiyxk�iAsSEQUENTIALprocessestheprex ,itcreatesnontermi-nalsforthestringsx2iforeachifrom1tolog(k+1),anon-terminalwithexpansionxk+1,anonterminalwithexpansioniforeachifrom0tok,andsomenonterminalswithshorterexpansionsthatarenotrelevanthere.Withregardtothethirdassertion,notethatSEQUENTIALparsestherstoccurrenceofthestringiinsomeparticularway.Itthenconsumesthe},andproceedstoconsumethesecondoccurrenceofiinexactlythesamewayastherstone.Thisprocessgeneratesanonterminalwithexpansioni.Noticethatthe}and~symbolsareneveraddedtoasecondaryrule.Theremainderoftheinput,thestring k=2,isconsumedinsegmentsoflengthk+1.Thisisbecause,ateachstep,theleadingk+1symbolsoftheunprocessedportionoftheinputstringareoftheformxk+1oriforsomei.Consequently,thecorrespondingnonterminalisappendedtothestartruleateachstep.Atahighlevel,thisistheinefciencythatweexploit.Thelengthof isnotamultipleofk+1.Asaresult,eachcopyof isrepresentedbyadifferentsequenceofnonterminals.Nowwedescribetheparsingof k=2inmoredetail.Therstcopyof isparsedalmostasitiswrittenabove.Theonlydifferenceisthatthenalxkattheendofthisrstcopyiscombinedwiththeleadingzerointhesecondcopyof andreadasasinglenonterminal.Thus,nonterminalswiththefollowingexpansionsareappendedtothestartruleastherstcopyof isprocessed:kkkk�1kk�2kk�3:::kk=2xk+1SEQUENTIALparsesthesecondcopyof differently,sincetheleadingzeroofthissecondcopyhasalreadybeenprocessed.Furthermore,thenalxk�1inthesecondcopyof iscombinedwiththetwoleadingzeroesinthethirdcopyandreadasasinglenonterminal:k�1k�1k�1k�2k�1k�3:::k�1k=2�1xk+1Withtwoleadingzerosalreadyprocessed,thethirdcopyof isparsedyetanotherway.Ingeneral,aninductionargumentshowsthatthej-thcopy(indexedfrom0)isreadas:k�jk�jk�jk�j�1k�jk�j�2:::k�jk=2�jxk+1Noconsecutivepairofnonterminalseverappearstwiceinthisentireprocess,andsononewrulesarecreated.Sincetheinputstringcontainsk=2copiesof andeachisrepresentedbyaboutknonterminals,thegrammargeneratedbySEQUENTIALhassize (k2).Ontheotherhand,thereexistsagrammarforkofsizeO(k).First,createanonterminalXiwithexpansionxiforeachiuptok+1.Eachsuchnonterminalcanbedenedintermsofitspredecessorsusingonlytwosymbols:Xi!xXi�1.Next,deneanonterminalDiwithexpansioniforeachiusingthreesymbols:Di!XiyXk�i.NowitisstraightforwardtodenenonterminalsAandBwhichexpandto and respectively.Finally,usingLemma2,O(logk)additionalsymbolssufcetodeneastartsymbolwithexpansion } k=2.IntotalthisgrammarhassizeO(k).ThereforetheapproximationratioofSEQUENTIALis (k)= (n1=3). 3)IrreducibleGrammars:Ourupperboundontheapproxi-mationratioofSEQUENTIALreliesonapropertyoftheoutput.Inparticular,KiefferandYang[4]showthatSEQUENTIALproducesanirreduciblegrammar;thatis,onewhichhasthefollowingthreeproperties:10 (M3)Nostrictlylongerstringappearsatleastasmanytimesontherightsidewithoutoverlap.Afteramaximalstring isselected,anewruleT! isaddedtothegrammar.Thisruleisthenappliedbyworkingleft-to-rightthroughtherightsideofeveryotherrule,replacingeachoccurrenceof bythesymbolT.Thealgorithmterminateswhennomoremaximalstringsexist.Example3.Anexampleillustratestherangeofmovesavailabletoaglobalalgorithm.(Throughoutthissection,wewillusetheinputstring=abcabcabcabcabaforourexamples.)WeinitiallycreatethegrammarS!abcabcabcabcabawherespacesareaddedforclarity.Themaximalstringsareab,abc,andabcabc.Supposethatweselectthemaximalstringab,andintroducetheruleT!ab.Thegrammarbecomes:S!TcTcTcTcTaT!abNowthemaximalstringsareTcandTcTc.SupposethatweselectTcTc.ThenweaddtheruleU!TcTc,andthedenitionofSbecomesS!UUTa.NowtheonlymaximalstringisTc.AddingtheruleV!Tcyieldsthenalgrammar:S!UUTaT!abU!VVV!Tc2)UpperBound:TheapproximationratioofeveryglobalalgorithmisO((n=logn)2=3).Thisfollowsfromthefactthatgrammarsproducedbyglobalalgorithmsareparticularlywell-conditioned;notonlyaretheyareirreducible,buttheypossessanadditionalpropertydescribedinthelemmabelow.Lemma6:Everygrammarproducedbyaglobalalgorithmhasthefollowingproperty.Let and bestringsoflengthatleasttwoontherightside.Ifh i=h i,then = .Proof:Weshowthatthisisactuallyaninvariantpropertyofthegrammarmaintainedthroughouttheexecutionofaglobalalgorithm.TheinvariantholdstriviallyfortheinitialgrammarS!.SosupposethattheinvariantholdsforgrammarG,andthengrammarG0isgeneratedfromGbyintroducinganewruleT! .Let 0and 0bestringsoflengthatleasttwoontherightsideofG0suchthath 0i=h 0i.Wemustshowthat 0= 0.Therearetwocasestoconsider.First,supposethatneither 0nor 0appearsin .Then 0and 0mustbeobtainedfromnon-overlappingstrings and inGsuchthath i=h 0iandh i=h 0i.SincetheinvariantholdsforG,wehave = .Butthen and aretransformedthesamewaywhentheruleT! isadded;thatis,correspondinginstancesofthestring within and arereplacedbythenonterminalT.Therefore, 0= 0.Otherwise,supposethatatleastoneof 0or 0appearsin .Thenneither 0nor 0cancontainT.Therefore,both 0and 0appearingrammarG,wheretheinvariantholds,andso 0= 0again. Lemma7:Everygrammarproducedbyaglobalalgorithmisirreducible.Proof:Wemustshowthatagrammarproducedbyaglobalalgorithmsatisesthethreepropertiesofanirreduciblegrammar.(I1).First,notethatallnon-overlappingpairsofadjacentsymbolsontherightsidearedistinctsinceaglobalalgorithmdoesnotterminateuntilthisconditionholds.(I2).Wemustshowthateverysecondarynonterminalappearsatleasttwiceontherightsideofthegrammar.Thispropertyisalsoaninvariantmaintainedduringtheexecutionofaglobalalgorithm.ThepropertyholdsvacuouslyfortheinitialgrammarS!.SupposethatthepropertyholdsforagrammarGwhichhasbeengeneratedbyaglobalalgorithm,andthenweobtainanewgrammarG0byintroducinganewruleT! where isamaximalstring.Bythedenitionofmaximalstring,thenonterminalTmustappearatleasttwiceontherightsideofG0.If containsonlyterminalsornonterminalswhichappeartwiceontheright-sideofG0,thentheinvariantclearlyholdsforG0.Suppose,bycontradiction,that containsanonterminalVwhichappearsonlyonceontherightsideofG0.LetV! bethedenitionofVinG0.ThisimpliesthatVonlyappearsinthedenitionofT,andthereforethestring occursexactlyasmanytimesas inG.Since ismaximal,itmusthavelengthatleasttwo,andtherefore[ ]�[ ].Inparticular,thisimpliesthatduringthestepinwhichtheruleforVwasintroduced,theintermediategrammaratthatpointcontainedastrictlylongerstringwhichappearedexactlythesamenumberoftimes,whichcontradictstheassumptionthatGhasbeenproducedbyaglobalalgorithm.(I3).Finally,wemustshowthatdistinctsymbolshavedistinctexpansions,unlessthestartsymbolexpandstoaterminal.Onceagain,weuseaninvariantargument.ThefollowinginvariantsholdforeverysecondaryruleU! inthegrammarmaintainedduringtheexecutionofaglobalalgorithm:1)Thestring appearsnowhereelseinthegrammar.2)Thelengthof isatleasttwo.BothinvariantsholdtriviallyfortheinitialgrammarS!.SupposethattheinvariantsholdforeveryruleinagrammarG,andthenweobtainanewgrammarG0byintroducingtheruleT!.First,wecheckthattheinvariantsholdforthenewrule.Thestringcannotappearelsewhereinthegrammar;suchaninstancewouldhavebeenreplacedbythenonterminalT.Furthermore,thelengthofisatleasttwo,sinceisamaximalstring.Next,wecheckthattheinvariantholdsforeachruleU! 0inG0thatcorrespondstoaruleU! inG.If 0doesnotcontainT,thenbothinvariantscarryoverfromG.Supposethat 0doescontainT.TherstinvariantstillcarriesoverfromG.Thesecondinvariantholdsunless= .However,sinceisamaximalstring,thatwouldimplythat appearedatleasttwiceinG,violatingtherstinvariant.Thethirdpropertyofanirreduciblegrammarfollowsfromthesetwoinvariants.Nosecondarynonterminalcanexpandtoaterminal,becausethesecondinvariantimpliesthateachsec-ondarynonterminalhasanexpansionoflengthatleasttwo.No12 WeanalyzehowLONGESTMATCHprocessesthisstring.Observethatintheexample,thelongestmatchis [6;9]=x64y128x256y512.Ingeneral,thelongestmatchinkisalwaysthesecondlargestsegmentoftheform [a;k�1].Afterthisruleisaddedandthegrammarrewritten,thenextlongestmatchisthethirdlongestsegmentoftheform [a0;k�1],( [8;9]inourexample)whichiswhollycontainedintherstlongestmatch.Inthenextround,thelongestmatchisthefourthlongestsegment,andsoforth.Afterlogkroundsofthistype,thenexttwolongestmatchesarex2k�1andy2k�1.Atthispoint,thegrammarisasfollows(abbreviationsintroducedaboveareusedforclarity):S! [0;7]j [1;8]j [2;5]T1j [3;6]j [4;7]j [5;8]jT1j [7;8]jT2jT3jT4T4jT3T3T1!x64y128T2T2!x256T3T3!y512T4!x512andafterconsolidatingthegrammar,weobtainS2! [0;7]j [1;8]j [2;5]j [3;6]j [4;7]j [5;8]j [6;7]j [7;8]j [8;8]jx512jy512Thecriticalobservationisthattheconsolidatedgrammaristhesameastheinitialgrammarforinputstring9.Afteranothersuccessionofroundsandaconsolidation,thedenitionofthestartrulebecomes8,andthen7,andsoforth.Reducingtherightsideofthestartrulefromitoi�1entailsthecreationofatleastblogicnonterminals.SincenonterminalscreatedbyLONGESTMATCHarenevereliminated,wecanlowerboundthetotalsizeofthegrammarproducedonthisinputby:kXi=1blogic= (klogk)Ontheotherhand,thereexistsagrammarofsizeO(k)thatgeneratesk.Whatfollowsisasketchoftheconstruction.First,wecreatenonterminalsX2iandY2iwithexpansionsx2iandy2irespectivelyforalliuptok.Wecandeneeachsuchnonterminalusingtwosymbols,andsoonlyO(k)symbolsarerequiredintotal.Thenwedeneanonterminalcorrespondingtoeachseg-mentofk.Wedenethesenonterminalsinbatches,whereabatchconsistsofallnonterminalscorrespondingtosegmentsofkthatcontainthesamenumberofterms.Ratherthande-scribethegeneralprocedure,weillustrateitwithanexample.Supposethatwewanttodenenonterminalscorrespondingtothefollowingbatchofsegmentsin10. [3;6]j [4;7]j [5;8]j [6;9]jThiscanbedonebydeningthefollowingauxiliarynon-terminalswhichexpandtoprexesandsufxesofthestring [3;9]=y8x16y32x64y128x256y512:P1!X64S1!Y128P2!Y32P1S2!S1X256P3!X16P2S3!S2Y512P4!Y8P3Nowwecandenenonterminalscorrespondingtothede-siredsegments [i;j]intermsofthese“prex”and“sufx”nonterminalsasfollows:G[3;6]!P4G[4;7]!P3S1G[5;8]!P2S2G[6;9]!P1S3Inthisway,eachnonterminalcorrespondingtoa [i;j]inkisdenedusingaconstantnumberofsymbols.Therefore,deningallksuchnonterminalsrequiresO(k)symbols.WecompletethegrammarforkbydeningastartrulecontaininganotherO(k)symbols.Thus,thetotalsizeofthegrammarisO(k).Therefore,theapproximationratioforLONGESTMATCHis (logk).Sincethelengthofkisn=(k2k),thisratiois (loglogn)asclaimed. F.GREEDYApostolicoandLonardi[11],[25],[10]proposedavarietyofgreedyalgorithmsforgrammar-baseddatacompression.Thecentralidea,whichweanalyzehere,istoselectthemaximalstringthatreducesthesizeofthegrammarasmuchaspossible.Forexample,onourusualstartinggrammar,therstruleaddedisT!abc,sincethisdecreasesthesizeofthegrammarby5symbols,whichisthebestpossible.Theorem11:TheapproximationratioofGREEDYisatleast5log3 3log5=1:137:::.Proof:WeconsiderthebehaviorofGREEDYonaninputstringoftheformk=xn,wheren=52k.GREEDYbeginswiththegrammarS!k.TherstruleaddedmustbeoftheformT!xt.Thesizeofthegrammarafterthisruleisaddedisthent+bn=tc+(nmodt)wherethersttermreectsthecostofdeningT,thesecondaccountsfortheinstancesofTitself,andthethirdrepresentsextraneousx's.Thissumisminimizedwhent=n1=2.Theresultinggrammaris:S!T52k�1T!x52k�1SincethedenitionsofSandTcontainnocommonsymbols,wecananalyzethebehaviorofGREEDYoneachindepen-dently.Noticethatbothsubproblemsareofthesameformastheoriginal,buthavesizek�1insteadofk.Continuinginthisway,wereachagrammarwith2knonterminals,eachdenedbyvecopiesofanothersymbol.Eachsuchruleistransformedasshownbelowinanalstepthatdoesnotalterthesizeofthegrammar.X!YYYYY=)X!X0X0YX0!YYTherefore,GREEDYgeneratesagrammarforkofsize52k.14 canbedenedusingonlytwosymbols:thenonterminalforthenextshortersufxorprextogetherwithonesymbolxi.Repeatthisconstructionrecursivelyonthetwohalvesoftheoriginalstring,x1:::xkandxk+1:::xp.Therecursionterminateswhenastringoflengthoneisobtained.Thisrecursionhaslogplevels,andpnonterminalsaredenedateachlevel.Sinceeachdenitioncontainsatmosttwosymbols,thetotalcostoftheconstructionisatmost2plogp.Nowweshowthateverysubstring =xi:::xjofisequaltohABi,whereAandBarenonterminalsdenedintheconstruction.Therearetwocasestoconsider.If appearsentirelywithintheleft-halfoforentirelywithintheright-half,thenwecanobtainAandBfromtherecursiveconstructiononx1:::xkorxk+1:::xp.Otherwise,letk=dp 2easbefore,andletAbethenonterminalforxi:::xk,andletBbethenonterminalforxk+1:::xj.Forexample,thesubstringconstructionforthestring=abcdefghisgivenbelow:C1!dC2!cC1C3!bC2C4!aC3D1!eD2!D1fD3!D2gD4!D3hE1!bE2!aE1F1!cF2!F1dG1!fG2!eG1H1!gH2!H1hWiththeserulesdened,eachsubstringofabcdefghisexpressiblewithatmosttwosymbols.Forexample,defg=hC1D3i.Inthenextlemma,wepresentavariationofLemma3neededforthenewalgorithm.Lemma8:Letbeastringgeneratedbyagrammarofsizem.Thenthereexistsastring koflengthatmost2mkthatcontainseverylength-ksubstringof.Proof:Wecanconstruct kbyconcatenatingstringsobtainedfromtherulesofthegrammarofsizem.Foreachrule,T! ,dothefollowing:1)Foreachterminalin ,takethelength-ksubstringofhTibeginningatthatterminal.2)Foreachnonterminalin ,takethelength-(2k�1)substringofhTiconsistingofthelastcharacterintheexpansionofthatnonterminal,theprecedingk�1characters,andthefollowingk�1characters.Inbothcases,wepermitthesubstringstobeshorteriftheyaretruncatedbythestartorendofhTi.Nowweestablishthecorrectnessofthisconstruction.First,notethatthestring kisaconcatenationofatmostmstringsoflengthatmost2k,givingatotallengthofatmost2mkasclaimed.Next,let bealength-ksubstringof.ConsidertheruleT! suchthathTicontains andhTiisasshortaspossible.Either beginsataterminalof ,inwhichcaseitisastringoftype1,orelseitbeginsinsidetheexpansionofanonterminalin andendsbeyond,inwhichcaseitiscontainedinastringoftype2.(Notethat cannotbewhollycontainedintheexpansionofanonterminalin ;otherwise,wewouldhaveselectedthatnonterminalforconsiderationinsteadofT.)Ineithercase, isasubstringof kasdesired. OurapproximationalgorithmforthesmallestgrammarproblemmakesuseofBlum,Jiang,Li,Tromp,andYan-nakakis'4-approximationfortheshortestsuperstringprob-lem[35].Inthisprocedure,wearegivenacollectionofstringsandwanttondtheshortestsuperstring;thatis,theshorteststringthatcontainseachstringinthecollectionasasubstring.Theprocedureworksgreedily.Ateachstep,ndthetwostringsinthecollectionwithlargestoverlap.Mergethesetwointoasinglestring.(Forexample,abaaandaaachaveoverlapaaandthuscanbemergedtoformabaaac.)Repeatthisprocessuntilonlyonestringremains.Thisisthedesiredsuperstring,andBlumet.al.provedthatitisatmostfourtimeslongerthantheshortestsuperstring.B.TheAlgorithmInthisalgorithm,thefocusisoncertainsequencesofsubstringsof.Inparticular,weconstructlognsequencesCn,Cn=2,Cn=4,:::,C2,wherethesequenceCkconsistsofsomesubstringsofthathavelengthatmostk.Thesesequencesaredenedasfollows.ThesequenceCnisinitializedtoconsistofonlythestringitself.Ingeneral,thesequenceCkgeneratesthesequenceCk=2viathefollowingoperations,whichareillustratedinthegurethatfollows.1)UseBlum'sgreedy4-approximationalgorithmtoformasuperstringkcontainingallthedistinctstringsinCk.2)Cutthesuperstringkintosmallpieces.First,determinewhereeachstringinCkendedupinsidek,andthencutkattheleftendpointsofthosestrings.3)Cuteachpieceofkthathaslengthgreaterthank=2atthemidpoint.Duringtheanalysis,weshallrefertothecutsmadeduringthisstepasextracuts.ThesequenceCk=2isdenedtobethesequenceofpiecesofkgeneratedbythisthree-stepprocess.BythenatureofBlum'salgorithm,nopieceofkcanhavelengthgreaterthankafterstep2,andsonopiececanhavelengthgreaterthank=2afterstep3.Thus,Ck=2isasequenceofsubstringsofthathavelengthatmostk=2asdesired. Nowwetranslatethesesequencesofstringsintoagrammar.Tobegin,associateanonterminalwitheachstringineachsequenceCk.Inparticular,thenonterminalassociatedwith16 thesinglestringinCn(whichisitself)isthestartsymbolofthegrammar.Allthatremainsistodenethesenonterminals.Indoingso,thefollowingobservationiskey:eachstringinCkistheconcatenationofseveralconsecutivestringsinCk=2togetherwithaprexofthenextstringinCk=2.Thisisillustratedinthegureabove,wherethefateofonestringinCk(shadedandmarkedT)istracedthroughtheconstructionofCk=2.Inthiscase,TistheconcatenationofV,W,X,andaprexofY.Similarly,theprexofYisitselftheconcatenationofconsecutivestringsinCk=4togetherwithaprexofthenextstringinCk=4.ThisprexisinturntheconcatenationofconsecutivestringsinCk=8togetherwithaprexofthenextstringinCk=8,etc.Asaresult,wecandenethenonterminalcorrespondingtoastringinCkasasequenceofconsecutivenonterminalsfromCk=2,followedbyconsecutivenonterminalsfromCk=4,followedbyconsecutivenonterminalsfromCk=8,etc.Forexample,thedenitionofTwouldbeginT!VWX:::andthencontainsequencesofconsecutivenonterminalsfromCk=4,Ck=8,etc.Asaspecialcase,thenonterminalscorrespondingtostringsinC2canbedenedintermsofterminals.Wecanusethesubstringconstructiontomakethesedeni-tionsshorterandhencetheoverallsizeofthegrammarsmaller.Inparticular,foreachsequenceofstringsCk,weapplythesubstringconstructiononthecorrespondingsequenceofnonterminals.Thisenablesustoexpressanysequenceofconsecutivenonterminalsusingjusttwosymbols.Asaresult,wecandeneeachnonterminalcorrespondingtoastringinCkusingonlytwosymbolsthatrepresentasequenceofconsecutivenonterminalsfromCk=2,twomorethatrepresentasequenceofconsecutivenonterminalsfromCk=4,etc.Thus,everynonterminalcannowbedenedwithO(logn)symbolsontheright.Theorem13:TheproceduredescribedaboveisanO(log3n)-approximationalgorithmforthesmallestgrammarproblem.Proof:Wemustdeterminethesizeofthegrammargeneratedbytheaboveprocedure.Inordertodothis,wemustrstupper-boundthenumberofstringsineachsequenceCk.Tothisend,notethatthenumberofstringsinCk=2isequaltothenumberofstringsinCkplusthenumberofextracutsmadeinstep3.Thus,giventhatCncontainsasinglestring,wecanupper-boundthenumberofstringsinCkbyupper-boundingthenumberofextracutsmadeateachstage.Supposethatthesmallestgrammargeneratinghassizem.ThenLemma8impliesthatthereexistsasuperstringcontainingallthestringsinCkwithlength2mk.Sinceweareusinga4-approximation,thelengthofkisatmost8mk.Therefore,therecanbeatmost16mpiecesofkwithlengthgreaterthank=2afterstep2.Thisupper-boundsthenumberofextracutsmadeintheformationofCk=2,sinceextracutsareonlymadeintopieceswithlengthgreaterthank=2.ItfollowsthateverysequenceofstringsCkhaslengthO(mlogn),sincestep2isrepeatedonlylogntimesoverthecourseofthealgorithm.Ononehand,therearelognsequencesCk,eachcontain-ingO(mlogn)strings.EachsuchstringcorrespondstoanonterminalwithadenitionoflengthO(logn).ThisgivesO(mlog3n)symbolsintotal.Ontheotherhand,foreachsequenceofstringsCk,weapplythesubstringconstructiononthecorrespondingsequenceofnonterminals.Recallthatthisconstructiongenerates2plogpsymbolswhenappliedtoasequenceoflengthp.ThiscreatesanadditionalO((logn)(mlogn)log(mlogn))=O(mlog3n)symbols.Therefore,thetotalsizeofthegrammargeneratedbythisalgorithmisO(mlog3n),whichprovestheclaim. C.AnO(logn=m)-ApproximationAlgorithmWenowpresentamorecomplexsolutiontothesmallestgrammarproblemwithapproximationratioO(logn=m).Thedescriptionisdividedintothreesections.First,weintroduceavariantofthewell-knownLZ77compressionscheme.Thisservestwopurposes:itgivesanewlowerboundonthesizeofthesmallestgrammarforastringandisthestartingpointforourconstructionofasmallgrammar.Second,weintroducebalancedbinarygrammars,thevarietyofwell-behavedgrammarsthatourprocedureemploys.Inthesamesection,wealsointroducethreebasicoperationsonbalancedbinarygrammars.Finally,wepresentthemainalgorithm,whichtranslatesastringcompressedusingourLZ77variantintoagrammaratmostO(logn=m)timeslargerthanthesmallest.D.AnLZ77VariantWebeginbydescribingaavorofLZ77compression[21].Weusethisbothtoobtainalowerboundonthesizeofthesmallestgrammarforastringandasthebasisforgeneratingasmallgrammar.Inthisscheme,astringisrepresentedbyasequenceofcharactersandpairsofintegers.Forexample,onepossiblesequenceis:ab(1;2)(2;3)c(1;5)AnLZ77representationcanbedecodedintoastringbyworkingleft-to-rightthroughthesequenceaccordingtothefollowingrules:Ifacharactercisencounteredinthesequence,thenthenextcharacterinthestringisc.Otherwise,ifapair(x;y)isencounteredinthesequence,thenthenextycharactersofthestringarethesameastheycharactersbeginningatpositionxofthestring.(Werequirethattheycharactersbeginningatpositionxberepresentedbyearlieritemsinthesequence.)Theexamplesequencecanbedecodedasfollows:Index:12345678910111213 LZ77:ab(1;2)(2;3)c(1;5)String:ababbabcababbTheshortestLZ77sequenceforagivenstringcanbefoundinpolynomialtime.Makealeft-to-rightpassthroughthestring.Ifthenextcharacterintheunprocessedportionofthestringhasnotappearedbefore,outputit.Otherwise,ndthelongestprexoftheunprocessedportionthatappearsintheprocessed17 Preexistingrulesareindicatedwithshadedlines,andnewruleswithdarklines.Atthestartofthei-thstep,theactiveruleisZi!AiBi.OurgoalistocreateanewactiverulethatdenesZi+1whilemaintainingthethreeinvariants.Therearethreecasestoconsider.Case1 :IfZiandYi+1areinbalance,thenwecreateanewrule: Thisbecomestheactiverule.Itiseasytocheckthatthethreeinvariantsaremaintained.IfZiandYi+1arenotinbalance,thisimpliesthat 1� [Yi+1] [Zi]1� doesnothold.Sincetherightinequalityis(2),theleftinequalitymustbeviolated.Thus,hereafterwecanassume: 1� �[Yi+1] [Zi](3)Case2 :Otherwise,ifAiisinbalancewithBiYi+1,thenwecreatetwonewrules: Therstofthesebecomestheactiverule.Itiseasytocheckthatthersttwoinvariantsaremaintained.Inordertocheckthatallnewrulesarebalanced,rstnotethattheruleZi+1!AiTiisbalancedbythecaseassumption.FortheruleTi!BiYi+1tobebalanced,wemustshow: 1� [Yi+1] [Bi]1� Theleftinequalityis(1).Fortherightinequality,beginwith(3):[Yi+1] 1� [Zi]= 1� ([Ai]+[Bi]) 1� 1� [Bi]+[Bi]1� [Bi]TheequalityfollowsfromthedenitionofZibytheruleZi!AiBi.Thesubsequentinequalityusesthefactthatthisruleisbalanced,accordingtoinvariant(B3).Thelastinequalityusesonlyalgebraandholdsforall 0:381.Ifcase2isbypassedthenAiandBiYi+1arenotinbalancewhichimpliesthat 1� [Ai] [BiYi+1]1� doesnothold.SinceAiisinbalancewithBialonebyinvariant(B3),therightinequalityholds.Therefore,theleftinequalitymustnot;hereafter,wecanassume: 1� �[Ai] [BiYi+1](4)Combininginequalities(3)and(4),onecanusealgebraicmanipulationtoestablishthefollowingbounds,whichholdhereafter:[Ai] [Bi] 1�2 (5)[Yi+1] [Bi] 1�2 (6)Case3 :Otherwise,supposethatBiisdenedbytheruleBi!UV.Wecreatethreenewrules: Therstofthesebecomestheactiverule.Wemustcheckthatallofthenewrulesareinbalance.WebeginwithPi!AiU.Inonedirection,wehave:[Ai] [U][Ai] (1� )[Bi][Ai] [Bi] 1� TherstinequalityusesthefactthatBi!UVisbalanced.Thesecondinequalityfollowsbecause1� 1.ThenalinequalityusesthefactthatAiandBiareinbalance.Intheotherdirection,wehave:[Ai] [U][Ai] [Bi]1 1�2 1� TherstinequalityusesthefactthatBi!UVisbalanced,andthesecondfollowsfrom(5).Thelastinequalityholdsforall 0:293.TheargumenttoshowthatQi!VYi+1isbalancedissimilar.Finally,wemustcheckthatZi+1!PiQiisinbalance.Inonedirection,wehave:[Pi] [Qi]=[AiU] [VYi+1][Ai]+(1� )[Bi] [Bi]+[Yi+1]=[Ai] [Bi]+(1� ) +[Yi+1] [Bi] 1�2 +(1� ) + 1� 1� TheequalityfollowsfromthedenitionsofPiandQi.TherstinequalityusesthefactthattheruleBi!UVisbalanced.Thesubsequentequalityfollowsbydividingthetopandbottomby[Bi].Inthenextstep,weuse(5)onthetop,and(1)onthebottom.Thenalinequalityholdsforall 1 3.19 Intheotherdirection,wehave:[Pi] [Qi]=[AiU] [VYi+1][Ai]+ [Bi] (1� )[Bi]+[Yi+1]=[Ai] [Bi]+ (1� )+[Yi+1] [Bi] 1� + (1� )+ 1�2  1� Asbefore,therstinequalityusesthedenitionsofPiandQi.ThenweusethefactthatBi!UVisbalanced.Weobtainthesecondequalitybydividingthetopandbottomby[Bi].ThesubsequentinequalityusesthefactthatAiandBiareinbalanceonthetopand(6)onthebottom.Thenalinequalityholdsforall 1 3.Allthatremainsistoupper-boundthenumberofrulescreatedduringtheAddPairoperation.Atmostthreerulesareaddedineachofthetstepsofthesecondphase.Therefore,itsufcestoupperboundt.Thisquantityisdeterminedduringtherstphase,whereYisdecomposedintoastringofsymbols.Ineachstepoftherstphase,thelengthoftheexpansionoftherstsymbolinthisstringdecreasesbyafactorofatleast1� .WhentherstsymbolisinbalancewithX,theprocessstops.Therefore,thenumberofstepsisO(log[Y]=[X]).Sincethestringinitiallycontainsonesymbol,tisO(1+log[Y]=[X]).Therefore,thenumberofnewrulesis:O1+ log[X] [Y] Becausewetaketheabsolutevalue,thisboundholdsregard-lessofwhether[X]or[Y]islarger.2)TheAddSequenceOperation:TheAddSequenceoperationisageneralizationofAddPair.GivenabalancedgrammarwithsymbolsX1:::Xt,theoperationcreatesabalancedgrammarcontaininganonterminalwithexpansionhX1:::Xti.Thenumberofrulesaddedis:Ot1+log[X1:::Xt] tTheideaistoplacetheXiattheleavesofabalancedbinarytree.(Tosimplifytheanalysis,assumethattisapoweroftwo.)Wecreateanonterminalforeachinternalnodebycom-biningthenonterminalsatthechildnodesusingAddPair.RecallthatthenumberofrulesthatAddPaircreateswhencombiningnonterminalsXandYisO1+ log[X] [Y] =O(log[X]+log[Y]).Letcdenotethehiddenconstantontheright,andletsequal[X1:::Xt].CreatingallthenonterminalsonthebottomlevelofthetreegeneratesatmostctXi=1log[Xi]ctlogs trules.(Theinequalityfollowsfromtheconcavityoflog.)Similarly,thenumberofrulescreatedonthesecondlevelofthetreeisatmostc(t=2)logs t=2,becausewepairt=2nonterminals,butthesumoftheirexpansionlengthsisstills.Ingeneral,onthei-thlevel,wecreateatmostc(t=2i)logs t=2i=c(t=2i)logs t+cti=2inewrules.Summingifrom0tologt,wendthatthetotalnumberofrulescreatedislogtXi=0c(t=2i)logs t+cti=2i=Ot1+log[X1:::Xt] tasclaimed.3)TheAddSubstringOperation:Thisoperationtakesabalancedgrammarcontaininganonterminalwith asasubstringandproducesabalancedgrammarcontaininganonterminalwithexpansionexactly whileaddingO(logj j)newrules.LetTbethenonterminalwiththeshortestexpansionsuchthatitsexpansioncontains asasubstring.LetT!XYbeitsdenition.Thenwecanwrite = p s,wheretheprex pliesinhXiandthesufx sliesinhYi.(Note, pisactuallyasufxofhXi,and sisaprexofhYi.)Wegenerateanonterminalthatexpandstotheprex p,anotherthatexpandstothesufx s,andthenmergethetwowithAddPair.ThelaststepgeneratesonlyO(logj j)newrules.Soallthatremainsistogenerateanonterminalthatexpandstotheprex, p;thesufxishandledsymmetrically.Thistaskisdividedintotwophases.Intherstphase,wendasequenceofsymbolsX1:::Xtwithexpansionequalto p.Todothis,webeginwithanemptysequenceandemployarecursiveprocedure.Ateachstep,wehaveadesiredsufx(initially p)ofsomecurrentsymbol(initiallyX).Duringeachstep,weconsiderthedenitionofthecurrentsymbol,sayX!AB.Therearetwocases:1)IfthedesiredsufxwhollycontainshBi,thenweprependBtothenonterminalsequence.ThedesiredsufxbecomestheportionoftheoldsufxthatoverlapshAi,andthecurrentnonterminalbecomesA.2)Otherwise,wekeepthesamedesiredsufx,butthecurrentsymbolbecomesB.Anonterminalisonlyaddedtothesequenceincase1.Butinthatcase,thelengthofthedesiredsufxisscaleddownbyatleastafactor1� .Thereforethelengthoftheresultingnonterminalsequenceist=O(logj j).Thisconstructionimpliesthefollowinginequality,whichweuselater:[X1:::Xi] [Xi+1]1� (7)ThisinequalityholdsbecausehX1:::Xiiisasufxoftheex-pansionofanonterminalinbalancewithXi+1.Consequently,X1:::XiisnottoolongtobeinbalancewithXi+1.Inthesecondphase,wemergethenonterminalsinthesequenceX1:::Xttoobtainthenonterminalwithexpansion p.Theprocessgoesfromlefttoright.Initially,wesetR1=X1.Thereafter,atthestartofthei-thstep,wehaveanonterminalRiwithexpansionhX1:::Xiiandseektomerge20 G.Grammar-BasedCompressionversusLZ77WehavenowshownthatagrammarofsizemcanbetranslatedintoanLZ77sequenceoflengthatmostm.Inthereversedirection,wehaveshownthatanLZ77sequenceoflengthpcanbetranslatedtoagrammarofsizeO(plogn=p).Furthermore,thelatterresultisnearlythebestpossible.Considerstringsoftheform=xk1jxk2j:::jxkqwherek1isthelargestoftheki.Thisstringcanberep-resentedbyanLZ77sequenceoflengthO(q+logk1):x(1;1)(1;2)(1;4)(1;8):::(1;ki�2j)j(1;k2)j:::j(1;kq)Here,jisthelargestpowerof2lessthanki.Ifwesetq=(logk1),thenthesequencehaslengthO(logk1).Ontheotherhand,Theorem2statesthatthesmallestgram-marforiswithinaconstantfactoroftheshortestadditionchaincontainingk1;:::;kq.Pippinger[37]hasshown,viaacountingargument,thatthereexistintegersk1;:::;kqsuchthattheshortestadditionchaincontainingthemallhaslength: logk1+qlogk1 loglogk1+logqIfwechooseq=(logk1)asbefore,thentheaboveexpressionboilsdownto: log2k1 loglogk1Puttingthisalltogether,wehaveastringoflengthn=O(k1logk1)forwhichthereexistsanLZ77sequenceoflengthO(logk1),butforwhichthesmallestgrammarhassize log2k1 loglogk1.TheratiobetweenthegrammarsizeandthelengthoftheLZ77sequenceistherefore: logk1 loglogk1= logn loglognThusouralgorithmfortransformingasequenceofLZ77triplesintoagrammarisalmostoptimal.Theanalysisinthissectionbringstolighttherelationshipbetweenthebestgrammar-basedcompressorsandLZ77.Onewouldexpectthetwotoachieveroughlycomparablecompressionperformancesincethetworepresentationsarequitesimilar.Whichapproachachievessuperiorcompression(overallcases)inpracticedependsonmanyconsiderationsbeyondthescopeofourtheoreticalanalysis.Forexample,onemustbearinmindthatagrammarsymbolcanberepresentedbyfewerbitsthananLZ77pair.Inparticular,eachLZ77pairrequiresabout2lognbitstoencode,althoughthismaybesomewhatreducedbyrepresentingtheintegersineachpairwithavariable-lengthcode.Ontheotherhand,eachgrammarsymbolcanbenaivelyencodedusingaboutlogmbits,whichcouldbeassmallasloglogn.Thiscanbefurtherimprovedviaanoptimizedarithmeticencodingassuggestedin[4].Thus,thefactthatgrammarsarecanbesomewhatlargerthanLZ77sequencesmayberoughlyoffsetbythefactthatgrammarscanalsotranslateintofewerbits.Empiricalcomparisonsin[4]suggestpreciselythisscenario,buttheydonotyetseemdenitiveonewayortheother[4],[9],[6],[7],[10],[11],especiallyinthelow-entropycase.Theprocedurespresentedherearenotreadyforimmedi-ateuseaspracticalcompressionalgorithms.Thenumeroushacksandoptimizationsneededinpracticearelacking.Ouralgorithmsaredesignednotforpracticalperformance,butforgood,analyzableperformance.Inpractice,thebestgrammar-basedcompressionalgorithmmayyetprovetobeasimpleschemelikeRE-PAIR,whichwedonotyetknowhowtoanalyze.VIII.FUTUREDIRECTIONSA.AnalysisofGlobalAlgorithmsOuranalysisofpreviously-proposedalgorithmsforthesmallestgrammarproblemleavesalargegapofunderstand-ingsurroundingtheglobalalgorithms,GREEDY,LONGESTMATCH,andRE-PAIR.Ineachcase,weupper-boundtheapproximationratiobyO((n=logn)2=3)andlowerbounditbysomeexpressionthatiso(logn).Eliminationofthisgapwouldbesignicantforseveralreasons.First,thesealgorithmsareimportant;theyaresimpleenoughtobepracticalforap-plicationssuchascompressionandDNAentropyestimation.Second,therearenaturalanaloguestotheseglobalalgorithmsforotherhierarchically-structuredproblems.Third,allofourlowerboundsontheapproximationratioforthesealgorithmsarewellbelowthe (logn=loglogn)hardnessimpliedbythereductionfromtheadditionchainproblem.Eitherthereexistworseexamplesforthesealgorithmsorelseatightanalysiswillyieldprogressontheadditionchainproblem.B.AlgebraicExtractionTheneedforabetterunderstandingofhierarchicalap-proximationproblemsbeyondthesmallestgrammarproblemiscapturedinthesmallestAND-circuitproblem.Consideradigitalcircuitwhichhasseveralinputsignalsandseveraloutputsignals.Thefunctionofeachoutputisaspeciedsum-of-productsovertheinputsignals.Howmanytwo-inputANDgatesmustthecircuitcontaintosatisfythespecication?Thisproblemhasbeenstudiedextensivelyinthecontextofautomatedcircuitdesign.Interestingly,thebestknownalgorithmsforthisproblemarecloselyanalogoustotheGREEDYandRE-PAIRalgorithmsforthesmallestgrammarproblem.(Fordetailsontheseanalogues,see[38],[39]and[40]respectively.)Noapproximationguaranteesareknown.C.StringComplexityinOtherNaturalModelsOnemotivationforstudyingthesmallestgrammarproblemwastoshedlightonacomputableandapproximablevariantofKolmogorovcomplexity.Thisraisesanaturalfollow-onques-tion:canthecomplexityofastringbeapproximatedinothernaturalmodels?Forexample,thegrammarmodelcouldbeextendedtoallowanonterminaltotakeaparameter.OnecouldthenwritearulesuchasT(P)!PP,andwritethestringxxyzyzasT(x)T(yz).Presumablyasmodelpowerincreases,approximabilitydecaystoincomputability.Goodapproxima-tionalgorithmsforstrongstring-representationmodelscouldbeappliedwhereverthesmallestgrammarproblemhasarisen.22