/
TheMULIProjectAnnotationandAnalysisofInformationStructureinGermanandEn TheMULIProjectAnnotationandAnalysisofInformationStructureinGermanandEn

TheMULIProjectAnnotationandAnalysisofInformationStructureinGermanandEn - PDF document

isabella
isabella . @isabella
Follow
342 views
Uploaded On 2021-08-15

TheMULIProjectAnnotationandAnalysisofInformationStructureinGermanandEn - PPT Presentation

guagespecicrealisationsofthesefeaturesThisisparticularlythecasefortheexpletiveesinGermananditsEnglishequivalentthereinsertionTheunitunderinvestigationonthesyntacticlevelistheclauseiepriortotheanalys ID: 863803

speci http 1994 unique http speci unique 1994 csingular www 1999 abstract 2001 uncountable finally 1997 1996 passoneau 1990

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "TheMULIProjectAnnotationandAnalysisofInf..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 TheMULIProject:AnnotationandAnalysisofIn
TheMULIProject:AnnotationandAnalysisofInformationStructureinGermanandEnglishStefanBaumann,CarenBrinckmann,SilviaHansen-Schirra,Geert-JanKruijff,IvanaKruijff-Korbayov´a,StellaNeumann,ErichSteiner,ElkeTeich,HansUszkoreitSaarlandUniversity,Saarbr¨ucken,GermanyAbstractThegoaloftheMULI(MUltiLingualInformationstructure)projectistoempiricallyanalyseinformationstructureinGermanandEnglishnewspapertexts.Incontrasttootherprojectsinwhichinformationstructureisannotatedandinvestigated(e.g.inthePragueDependencyTreebank,whichmirrorsthebasicinformationaboutthetopic-focusarticulationofthesentence),wedonotannotatetheory-biasedcategoriesliketopic-focusortheme-rheme.Tryingtobeastheory-independentaspossible,weannotatethosefeatureswhicharerelevanttoinformationstructureandonthebasisofwhichtypicalpatterns,co-occurrencesorcorrelationscanbedetermined.Wedistinguishbetweenthreeannotationlevels:syntax,discourseandprosody.ThedataisbasedontheTIGERCorpusforGermanandthePennTreebankforEnglish,sincetheexistinginformationonpart-of-speechandsyntacticstructurecanbere-usedforourpurposes.TheactualannotationofanEnglishexamplesequenceillustratesourchoiceofcategoriesoneachlevel.Theircombinationoffersthepossibilitytoinvestigatehowinformationstructureisrealisedandcanbeinterpreted.1.IntroductionMULI(MUltiLingualInformationstructure)isapilotstudyforenrichingtreebankswithfeaturesrelevantforin-vestigatingthedistributionofinformationintexts.Asitsnamesuggests,theprojectlooksfromacontrastiveangleonthisinvestigationandincorporatesdifferentlinguisticlevels.Thus,MULIisasteptoenhanceexistinglinguis-ticallyinterpretedlanguageresourcesliketheTigerTree-bankforGermanorthePennTreebankforEnglishwithinformationontheinterfacebetweensyntax,(discourse)semanticsandprosody.Themultilingualdesignofthestudyallowsustoidentifylanguage-specicrealisationsandpreferencesofindicatorsofinformationstructure.Theannotationisrestrictedtoarelativelysmallamountofdata,sincetheexperimentaldesignofthestudyrequirestestingoftoolsaswellasmanualannotation.Weareparticularlyinterestedinthecorrelationsandco-occurrencesoffeaturesondifferentlinguisticlevelsthatcanbeinterpretedasindicatorsofinformationstructure.Forthispurpose,theannotationschemehastobeastheoryindependentaspossible.Werefrainfromannotatingab-stractcategoriesofinformationstructureliketopic/focusortheme/rheme,concentratinginsteadonmoreconcretelin-guisticphenomenathathavebeendescribedasindicatorsoftheseabstractcategoriesonthedifferentlevels.Inthispaperwefocusonthedescriptionoftheannota-tionschemeandhowthisannotationcanservetoenrichtheexistingtreebankresources(2).Weillustratetheannota-tioninthediscussionofanexamplefromthecorpusin3.4concludeswiththeperspectivesopenedbythisstudy.2.Annotation2.1.CorpusdesignTheMULIcorpusconsistsofextractsfromtheTigerTreebankforGerman((Brantsetal.,toap-pear);http://www.coli.uni-sb.de/cl/projects/tiger/)andthePennTreebankforEnglish((Marcusetal.,1994);http://www.cis.upenn.edu/˜treebank/home.html).AstheTigerCorpuscontainsarticlesfromthegeneralnewspaperFrankfurterRundschau,weonlyselecttextsfromtheeco-nomicssectioninordertomakethecorpusascomparableaspossibletotheWallstreetJournaltextswhichmakeupthePennTreebank.Ourcorpuscomprises250sentencesinGerman(app.3,500tokens)and320sentencesinEnglish(app.7,000tokens).Part-of-speechinformationandsyn-tacticstructureinthetreebankshelpwithinterpretingthedistributionofinformationinthetexts.2.2.SyntaxAstheTigerandPennTreebankalreadycontainsyntac-ticinformation,theannotationonthesyntacticlevelcon-centratesonthosefeatureswhicharespecicallyrelevantforinformationstructure.TheTigerCorpusencodesin-formationonsyntacticfunctionsintheedgesandphrasecategoriesinthenodes.Italsotakesaccountofpart-of-speechtagsaswellasmorphologywhichisveryimpor-tantforaninectionallanguagelikeGerman.Asamixtureofphrasestructureanddependencyanalysis,theannota-tioncombinestheadvantagesofbothgrammars.Initiativescoveringotherlinguisticphenomenaonthebasisofthisan-notationincludetheextractionoftopologicalinformation(BeckerandFrank,2002)whichcanbeusedfortheanaly-sisofinformationstructureonthesyntacticlevel.Thean-notationofthePennTreebankconsistsofaphrasestructureanalysisenrichedbypart-of-speechinformation.Itdiffer-entiatesbetweenanumberofadjuncts(e.g.temporalandlocal).Furthermore,themostfrequentsyntacticfunctionsareincludedintheannotation.Beyondthesetypesofsyntacticinformation,theMULIannotationschemecoversnoncanonicalwordorderandothersyntacticstructuresthatservetoputthefocusoncer-tainelements.Itdrawso

2 naccountsoftheanalysedfeaturesasdescribe
naccountsoftheanalysedfeaturesasdescribedin(Eisenberg,1994)and(Weinrich,1993)forGermanandin(Quirketal.,1985)and(Biberetal.,1999)forEnglish.Theannotationschemecomprisescleft,pseudo-cleft,reversedpseudo-cleft,extraposition,fronting,expletives,aswellasactive,medio-passiveandpassive.Wherenecessary,theannotationguidelinesspecifylan- guagespecicrealisationsofthesefeatures.Thisisparticu-larlythecasefortheexpletiveesinGermananditsEnglishequivalentthere-insertion.Theunitunderinvestigationonthesyntacticlevelistheclause,i.e.priortotheanalysisthecorpuswassegmentedintoclauses.2.3.DiscourseInformationstructure(IS)theoriesdescribethephe-nomenaathandatasurfacelevel,atasemanticlevel,oratbothlevelssimultaneously,i.e.,anexpressionbelongstosomeISpartition,invirtueofsomeinformationsta-tusofthecorrespondingdiscourseentity.Fortheinves-tigationofISatthesemanticlevel,weneedmoreinfor-mationaboutthecharacterofthediscourseentitiesintro-ducedbylinguisticexpressions.Wethereforeannotateex-pressionswiththeirdiscoursereferentsandtheirfollowingproperties:Type(intensionalorextensionalobject,prop-erty,eventualityortextuality)andmorenegrainedSeman-ticSort;referentialpropertiesofDelimitation(unique,ex-istential,variable,non-denotationaluse(Hlavsa,1975))andQuantication(uncountable,unspecicnon-singular,specic-nonsingularorspecicsingular);theFormofanexpression(althoughitdoesnotnecessarilybelongtothislevel,buttherearecorrelationswiththeotherfeatures);In-formationStatus(new,unused,inferable,evoked)(Prince,1981).CodinginformationstatusismotivatedbythefactthatIStheoriesoftenemploysomenotionofinformationstatusasonedimensionofthepartitioningonitsown,orasthebasisforderivingahigherlevelofpartitioning.WeusePrince'sfamiliaritytaxonomy,whichclearlyaddressesthestatusofdiscourseentitiesassuch,nototherreferentialproperties.Besidesthepropertiesofindividualdiscoursereferents,weannotateanaphoriclinksbetweenexpressions.Wedis-tinguishbetweencoreferenceandbridging,wherethereexistsanassociativerelationshipbetweenthereferentsoftheanaphorandtheantecedent,suchasset-containment,part-wholecomposition,property-attribution,possession,causalityorlexical-argument-lling.TherelationbetweenanaphoricityandISisnotastraightforwardone,andneedsfurtherinvestigation,enabledbyanannotationlikeours.OurannotationschemefollowstheTextEncodingIni-tiativerecommendations(http://www.tei-c.org/)andtheDiscourseResourceInitiativeguidelines(Carlettaetal.,1997).Inlinewiththesestandards,wedenewhatex-pressionsaremarkables,whatattributestheyhaveandwhatlinkscanholdbetweenthem.Atthediscourselevel,markablesare“nominal-like”(Passoneau,1996)linguis-ticexpressionsthatintroduceoraccessdiscourseentities(i.e.,discoursereferentsinthesenseusedinDRTandalike).Webuildonandextendthereferenceannota-tionschemesforMUC-6andMUC-7(MUCCoreferenceSpecication),DRAMA(Passoneau,1996),theMATEproject((Poesioetal.,1999);http://mate.mip.ou.dk),theDRIguidelines(Carlettaetal.,1997),(PoesioandVieira,1998)and(M¨ullerandStrube,2001).Thecorpushasbeenannotatedbytwoannotators(oneofthedevelop-ersandoneonlyinstructedbytheannotationguidelines),usingtheMMAXannotationtool(http://www.eml.villa-bosch.de/english/Research/NLP/Downloads).2.4.ProsodyInspokenlanguage,prosody(intonation,phrasing,stress,rhythm)isoftenusedtorealisetheinformationstructureofatext,e.g.thepragmaticstructure(fo-cus/background)orthedegreeofcognitiveactivationofindividualdiscoursereferentsorpropositions(given/new).Accentplacementandphrasingaretheprimarymeanstomarkinformationstructuralconcepts,butpitchrange,rhythm,andspeechratealsoplayanimportantrole.Inordertocarryouttheprosodicannotation,werecordedoneGermanandoneEnglishnativespeakerread-ingaloudthetextsoftheMULIcorpus.1Sinceindivid-ualspeakingpreferencesmayvaryfromspeakertospeaker,ourresultsarenotgeneralisable,reectingtheexperimentalcharacterofthestudy.TherecordingsweredigitisedandannotatedonsixdifferentlevelsusingtheEMUSpeechDatabaseSystem((CassidyandHarrington,2001);http://emu.sourceforge.net/):(1)wordboundariesandpauses,(2)punctuationofthewrittentexts,(3)positionandtypeofpitchaccentsandboundarytones,(4)positionandstrengthofphrasebreaks,(5)rhythmicphenomena,includingnon-canonicalwordstress,(6)comments.Theannotationoflevel3and4followsthecon-ventionsofToBI(TonesandBreakIndices(BeckmannandHirschberg,1994))forEnglishandGToBI((Griceetal.,inpress);http://www.coli.uni-sb.de/phonetik/projects/Tobi/gtobi.html)forGerman.Theycanberegardedasstandardsfordescribingtheintonationoftheselanguageswithintheframeworkof

3 autosegmental-metricalphonol-ogy,inwhich
autosegmental-metricalphonol-ogy,inwhichpitchcontoursaredecomposedintohighandlowtonaltargets(symbolisedbyHandL).DiacriticsarelistedinTable1,thetonalandbreakindexinventoriesaresummarisedinTable2.targetontheaccentedsyllable+targetbeforeoraftertheaccentedsyllable–boundarytoneofanintermediatephrase(ip)%boundarytoneofanintonationphrase(IP)!downstepofanHtoneˆupstepofanHtoneTable1:(G)ToBIdiacriticsToBIGToBIpitchaccentsH*,L*,L+H*H*,L*,L+H*L*+H,H+!H*L*+H,H+!H*,H+L*forceaccents–H(*),L(*)boundaryL–,H–,L–L%L–,H–,L–%tonesH–L%,H–H%H–%,H–ˆH%L–H%,%HL–H%,%Hbreakindices0,1,2,3,42r,2t,3,4Table2:(G)ToBIinventoriesoftonesandbreakindices1Sinceprosodicannotationisverytime-consuming,wehadtoconcentrateononelanguage.Thus,weanalysedallGermantextsandrestrictedourselvestosomeEnglishexamples. 3.ExampleWeillustratethedifferentlevelsofannotationandanal-ysiswithanexamplesequencetakenfromourEnglishcor-pus(Figure1).Weconsiderthesyntacticannotationasuit-ablestartingpointfortheanalysis.Whererelevantfeaturesaredetected,wecomparetheannotationtootherlevels.(1)Inthe1987crash,remember,themarketwasshakenbyaDannyRostenkowskiproposaltotaxtakeoversoutofexistence.(2)Evenmoreim-portant,inourview,wastheTreasury'sthreattothrashthedollar.(3)TheTreasuryisdoingthesamethingtoday;(4)thankfully,thedollarisnotunder1987-stylepressure.Figure1:ExamplesequencefromtheEnglishcorpusTheexamplesequencewassegmentedintofourclauses.Ofallfourclauses,threeshownoncanonicalwordorders.In(1),thetemporaladjunctisfronted,followedbythepred-icateremember(inimperativemood).Similarly,in(4),anadjunct(markingstance)isfronted.In(2),subjectcomple-mentandadjunct(againmarkingstance)arefronted.Ad-ditionally,(1)containsapassiveconstructionbringingthepatientinsubjectposition.Thediscourseentity(DE)introducedinthefrontedtem-poralphrasethe1987crashin(1)isextensional,abstract,unique,specicsingular,andhastheinformationstatusofunused(alsoindicatedbyremember).TheDEintro-ducedintheunmarkedsubjectpositionisextensional,ab-stract,unique,specicsingular,buthasthestatusofin-ferable:themarketcanbeseenasabridginganaphortothecrash,bymeansofanargumentlling(crashofthemarket).TheDEsintroducedbythesentence-nalexpres-sionsin(1)and(2)arealsoextensional,abstract,unique,specicsingular,andbothhavetheinformationstatusofnew.2Whatappearssentence-nalin(1)and(2)arethustwonegativethingsthathappenedduringthe1987crash.Theevaluation-ascribingadjectivephrasein(2)isnotan-notatedasaDE.TheDEsintheunmarkedsubjectpositionsin(3)and(4)bothhavetheinformationstatusoftextuallyevoked,asbothexpressionsarecoreferentialanaphorstopartsoftheTreasury'sthreattothrashthedollar.WhiletheDEreferredtobytheTreasuryisanextensional,of-ce,unique,specicsingular,thatofthedollarisinten-sional,abstract,unique,uncountable.Theexpressionthesamethingin(3)isanaphorictotheTreasury'sthreat...in(2),butitintroducesanewDEofthesametype;itsin-formationstatusisthatofinferable.Finally,theDEintro-ducedinthesentence-nalexpression1987-stylepressurein(4)isintensional,abstract,existential,uncountable,andalsohastheinformationstatusofinferable;itishoweverhardtocodeitasabridginganaphor,becauseitisnotclearwhatrelationitwouldhavetowhatantecedent:ifanything,thenaDannyRostenkowskiproposal...in(1)(accordingtooneoftheannotators).2Weassumealaymanreader.Foraneconomyexpert,theseentitiesmayhavethestatusofunused.Theprosodicanalysisshowsthatthefrontedphrasein(2)isnotonlysyntacticallybutalsoprosodicallypromi-nent(cf.Figure2):Twopeakaccentsonevenandmorehighlightthesewords(withthemorepronouncedaccentonmoreexpressingacontrast),whereasthewordimportantisdeaccented,sincetheconceptof'importance'isinfer-ablefromthecontext.Furthermore,theadjectiveconstruc-tionformsaphraseofitsown,delimitedbyanintonationphraseboundary,whichisinturnsignalledbyafalling-risingcontourplusashortpause.Thefollowingparenthe-sisinourviewalsoconstitutesasingleintonationphrase.Hereagain,ourisassignedacontrastiveaccent,whileviewisunaccentedduetogivenness.Allremainingcontentwordsoftheclausereceiveac-cents.However,themost'newsworthy'word,threat,istheonlyonemarkedbyarisingpitchaccent(L+H*),in-dicatingitshigherdegreeofimportanceforthespeaker.Thisinterpretationisfurthersupportedbytheinsertionofaphrasebreakdirectlyafterthisword.Finally,thehigh-downsteppednuclearaccent(H+!H*)ondollarmarksthisitemasbeingaccessiblebyspeakerandhearer(cf.(PierrehumbertandHirschberg,1990)).Thismeansitcanneithercountasbrandnew(whichnormallyrequiresaH*peakaccent),norasimmediatelygiven,sinceitisnotdeaccented(asisthecasewiththewordimport

4 antabove).4.ConclusionsFirstexperiencesw
antabove).4.ConclusionsFirstexperienceswithourmultilingualmulti-layeran-notationleadtoconclusionswithrespecttohowtoau-tomizetheannotationprocessusingstatisticalmethodsandlearningprocedures.Onthegrammaticallevel,thesyntac-ticannotationoftheTigerCorpusandthePennTreebank,forexample,canbeusedtodeterminepassiveconstruc-tions.Inconnectionwiththediscourseannotation,forin-stance,theexistingpart-of-speechtagscanbeusedinordertoidentifypronominalco-reference.Wearealsoworkingonrobustmethodsforidentify-inginformationstructure,followingtheannotationscheme(whichfocusesontheinformationstatusofindividualmarkables)aswellasinvestigatingtheideaofinformativityzoning,i.e.thedivisionofclauses/sentencesintopartsthataremoreorlessinformativeinthegivencontext.Furthermore,conclusionscanbedrawnfromtheco-occurrenceofparticularcategoriesondifferentlevelsofan-notationinMULI,indicatinghowthesedifferentlevelsaredeployedinordertomarkinformationstructure.However,ourinitialinvestigationalsorevealswhereadditionalannotationwouldbeneeded.Forinstance,thetextexamplediscussedaboveconstitutesaconcessionscheme,whichwecannotidentifywithoutannotatingdis-course/rhetoricalrelations.Usingourndingsasatertiumcomparationis,theory-dependentinformationstructureannotationandthusexist-ingtheoriesoninformationstructurecanbecomparedtoandvalidatedagainstourtheory-neutralapproachandviceversa.Finally,usingourndingsontheco-occurrenceoftheannotatedcategories,itispossibletocomparehowdif-ferentlanguagesusedifferentgrammatical,discursiveandprosodicmeanstostructureinformation. Figure2:Prosodicannotationofexamplesentence(2)inEMU5.ReferencesBecker,MarkusandAnetteFrank,2002.Astochastictopo-logicalparserofGerman.InProceedingsofCOLING2002.Taipei,Taiwan.Beckmann,MaryE.andJuliaHirschberg,1994.TheToBIannotationconventions.Ms.andaccompanyingspeechmaterials,OhioStateUniversity.Biber,Douglas,StigJohansson,GeoffreyLeech,Su-sanConrad,andEdwardFinegan,1999.TheLong-manGrammarofSpokenandWrittenEnglish.Harlow:Longman.Brants,Sabine,StefanieDipper,PeterEisenberg,Sil-viaHansen,EstherK¨onig,WolfgangLezius,ChristianRohrer,GeorgeSmith,andHansUszkoreit,toappear.TIGER:LinguisticinterpretationofaGermancorpus.JournalofLanguageandComputation(JLAC),SpecialIssue.Carletta,Jean,NilsDahlb¨ack,NorbertReithinger,andMarylinA.Walker,1997.Standardsfordialoguecodinginnaturallanguageprocessing.Reportonthedagstuhlseminar,DiscourseResourceInitiative.Cassidy,SteveandJonathanHarrington,2001.Multi-levelannotationintheEMUspeechdatabasemanagementsystem.SpeechCommunication,33(1-2):61–78.Eisenberg,Peter,1994.GrundrissderdeutschenGram-matik,3.Au..Stuttgart,Weimar:Metzler.Grice,Martine,StefanBaumann,andRalfBenzm¨uller,inpress.Germanintonationinautosegmental-metricalphonology.InSun-AhJun(ed.),ProsodicTypol-ogy:ThroughIntonationalPhonologyandTranscrip-tion.OUP.Hlavsa,Zdenek,1975.Denotaceobjektuajej´prostredkyvsoucasn´ecestine[DenotatingofobjectsanditsmeansincontemporaryCzech],volume10ofStudieapr´acelingvistick´e[Linguisticstudiesandworks].Academia.Marcus,Mitchell,GraceKim,MaryAnnMarcinkiewicz,RobertMacIntyre,AnnBies,MarkFerguson,KarenKatz,andBrittaSchasberger1994.Thepenntreebank:Annotatingpredicateargumentstructure.InProceed-ingsoftheHumanLanguageTechnologyWorkshop.SanFrancisco,MorganKaufmann.M¨uller,ChristophandMichaelStrube,2001.AnnotatinganaphoricandbridgingrelationswithMMAX.InPro-ceedingsofthe2ndSIGdialWorkshoponDiscourseandDialogue.Aalborg,Denmark.Passoneau,Rebecca,1996.Instructionsforapplyingdis-coursereferenceannotationformultipleapplications(DRAMA).Draft.Pierrehumbert,JanetandJuliaHirschberg,1990.Themeaningofintonationalcontoursintheinterpretationofdiscourse.InP.R.Cohen,J.Morgan,andM.E.Pollack(eds.),IntentionsinCommunication.MITpress,pages271–311.Poesio,Massimo,FlorenceBruneseaux,SarahDavies,andLaurentRomary,1999.TheMATEmeta-schemeforcoreferenceindialogueinmultiplelanguages.InMarylinWalker(ed.),Proceedingsoftheworkhopson”TowardsStandardsandToolsforDiscourseTagging”atthe37thAnnualMeetingoftheAssociationforCom-putationalLinguistics(ACL).UniversityofMaryland.Poesio,MassimoandRenataVieira,1998.Acorpus-basedinvestigationofdenitedescriptionuse.ComputationalLinguistics,24(2):183–216.Prince,Ellen,1981.Towardataxonomyofgiven-newin-formation.InPeterCole(ed.),RadicalPragmatics.Aca-demicPress,pages223–256.Quirk,Randolph,SidneyGreenbaum,GeoffreyLeech,andJanSvartik,1985.AcomprehensivegrammaroftheEn-glishlanguage.London:Longman.Weinrich,Harald,1993.TextgrammatikderdeutschenSprache.Mannheimu.a.:Dudenverl

Related Contents


Next Show more