/
FromNonWordtoNewWord:AutomaticallyIdentifyingNeologismsinFrenchNewspap FromNonWordtoNewWord:AutomaticallyIdentifyingNeologismsinFrenchNewspap

FromNonWordtoNewWord:AutomaticallyIdentifyingNeologismsinFrenchNewspap - PDF document

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
372 views
Uploaded On 2017-02-24

FromNonWordtoNewWord:AutomaticallyIdentifyingNeologismsinFrenchNewspap - PPT Presentation

theexclusionlistTherstltereliminatesnonwordsbylookingforbigramsandtrigramsofcharacterswhicharenotfoundinFrenchThesecondltertargetswordswhichareconcatenatedduetomissingspacesandlooksforallthepossi ID: 519244

theexclusionlist.TherstltereliminatesnonwordsbylookingforbigramsandtrigramsofcharacterswhicharenotfoundinFrench.Thesecondltertargetswordswhichareconcatenatedduetomissingspacesandlooksforallthepossi

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "FromNonWordtoNewWord:AutomaticallyIdenti..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

FromNonWordtoNewWord:AutomaticallyIdentifyingNeologismsinFrenchNewspapersIngridFalk,DelphineBernhard,ChristopheGérardLiLPa-Linguistique,Langues,ParoleEA1339,UniversitédeStrasbourg{ifalk,dbernhard,christophegerard}@unistra.frAbstractInthispaperwepresentastatisticalmachinelearningapproachtoformalneologismdetectiongoingsomewaybeyondtheuseofexclusionlists.Weexploretheimpactofthreegroupsoffeatures:formrelated,morpho-lexicalandthematicfeatures.Thelattertypeoffeatureshasnotyetbeenusedinthiskindofapplicationandrepresentsawaytoaccessthesemanticcontextofnewwords.Theresultssuggestthatformrelatedfeaturesarehelpfulattheoverallclassicationtask,whilemorpho-lexicalandthematicfeaturesbettersingleouttrueneologisms.Keywords:neologisms,semi-automaticneologismdetection,classication,French,topicmodeling1.IntroductionTheworkwepresentinthispaperisconcernedwiththesemi-automaticdetectionofFrenchneologismsinonlinenewspaperarticles.Thesystemwedevelop,calledLogoscope,retrievesnews-paperarticlesfromseveralRSSfeedsinFrenchonadailybasis.Usingexclusionlistsitidentiesunknownwords,whicharethenpresentedtoalinguisttodecidewhichofthesearevalidneologisms.Forexample,Table1showsthemostfrequentunknownwords,collectedonJuly12,2013andresultingfromthisprocedure.Thetablealsoillustratesamajordrawbackofthismethod.Clearlymostoftheseformsarenotinterestingneologismcandidates:inmanycasestheyarenotevenvalidwordsandalinguistexpertwouldhavetotediouslyscanalargepartofthelistbeforendinginterestingcandidates.lmd(18)twitter/widgets(7)india-mahdavi(3)pic(this(18)garde-à(6)kilomètresc(2)lazy-retina(9)ex-PPR(4)geniculatus(2)onload(9)pro-Morsi(4)margin-bottom(2)onerror(9)tuparkan(4)politique»(2)amp;euro(7)candiudature(3)...Table1:Themostfrequentunknownwordscollectedon2013-07-12.Wordfrequencyisshowninparentheses.Inthisstudyweinvestigatemethodstoselectamongthede-tectedunknownformsthosewhichmostprobablyrepresentinterestingneologismcandidates.Weaddressthistaskbycastingitintoasupervisedclas-sicationproblem.Theclassicationisbasedondifferenttypesoffeaturesextractedfromthecollectednewspaperar-ticles.Afurtherobjectiveofthispaperistoexplorewhichkindsoffeaturesaremosthelpfulatdetectingthemostin-terestingneologismcandidates.Thepaperisorganizedasfollows.Werstbrieypresentpreviousworkandrelateourstudytotheseearlierefforts.Wethendetailtheresourcesandmethodsourexperimentsarebasedonandnallypresenttheirresults.2.PreviousWorkMethodsfortheautomaticorsemi-automaticidenticationofneologismsmainlytargetthecoinageofnewwordsorchangesinpart-of-speech(grammaticalneologisms),whilesemanticneologismsareonlyseldomdealtwith.2.1.IdenticationofGrammaticalNeologismsGrammaticalneologismscorrespondtoattestedwordformswhichareusedwithanewpart-of-speech.Thisisaratherrarephenomenon,whichmayneverthelesscauseproblemsforPOStaggers.Janssen(2012)describesNeoTag,alanguageindependenttaggerandlemmatizer,whichtakesgrammaticalneologismsexplicitlyintoaccount.Aftertag-ging,thetaggedcorpusiscomparedwithalexicographicresource,inordertodetectwordswhosecategoriesaredif-ferentinthecorpusandintheresource.Giventhatgrammaticalneologismsarescarce,wedonotaddressthisphenomenonforthetimebeing,butratherfo-cusonthedetectionofnewwordforms.2.2.DetectionofNewWordFormsForthistask,twodifferenttypesofmethodsmaybedistin-guished:Methodsbasedonlistscontainingknownwordsinthetargetlanguage,whichareusedtoidentifyunknownwords.Theselistsareusuallycalledexclusionlists;Methodsrelyingonvariousstatisticalmeasuresorma-chinelearningappliedtodiachroniccorpora.Theuseofexclusionlistsisbyfarthemostcommonmethod.ForFrench,thePOMPAMOtool(OllingerandValette,2010)usesanexclusionlistmadeoftheFrenchlexiconMORPHALOU2.0(Romaryetal.,2004),alistofnamedentitiesandlexiconsprovidedbytheuser.Inadditiontothelexicons,thetooluseslterswhichdetectnon-alphanumericcharacters,numbersandcomposedwordforms.Issac(2011)alsodescribesseverallters,aimedateliminatingunwantedneologismcandidatesnotfoundin theexclusionlist.TherstltereliminatesnonwordsbylookingforbigramsandtrigramsofcharacterswhicharenotfoundinFrench.Thesecondltertargetswordswhichareconcatenatedduetomissingspacesandlooksforallthepossiblecombinationsinadictionary.Finally,spellingerrorsareidentiedbyndingcorrectionswiththeLeven-shteindistance.SystemsrelyingonexclusionlistshavealsobeendevelopedforlanguagesotherthanFrench.CENIT–Corpus-basedEnglishNeologismIdentierTool–(RocheandBowker,1999)usesadditionallterswhichaimatdetectingpropernouns.ForGerman,theWortwarteplatform1collectsneol-ogismsonadailybasis(Lemnitzer,2012).Allthesemethodsrelyonsimpleheuristicsandrequirethatthecandidatesbemanuallyvalidatedbyanexpert.Statisticalmeasurescanbeemployedwhenneologismsarestudiedfromadiachronicpointofview.Garcia-Fernandezetal.(2011)usetheGoogleBooksNgramstoidentifythedateofcoinageofnewwords,basedoncumulativefre-quenciesforatimeperiodgoingfrom1700to2008.Thecumulativefrequencycurveofneologismsisexponential.Newwordsareidentiedwhenthiscumulativefrequencyexceedsagiventhreshold.CabréandNazar(2011)alsoex-ploitthispropertytoidentifyneologismsinSpanishnews-papertexts.Finally,machinelearningcanbeemployedtoautomaticallyclassifyneologismcandidates.Stenetorp(2010)presentsamethodtorankwordsextractedfromatemporallyanno-tatedcorpus.Thefeaturescorrespondtothenumberofoc-currencesofthewordinthecorpus,thedistributionofitsoccurrencesovertime,thepointsintimewhenthewordisobserved,lexicalcues(quotes,enclosinginspecicHTMLtags),thepresenceindictionaries,andspell-checking.Themainlimitationofdiachronicmethodsisthattheyne-cessitateacorpuscoveringalargetimespan.Moreover,neologismscanonlybedetectedsometimeaftertheirrstoccurrence.Theworkwepresenthereisanattempttocombinethetwomainapproaches.Weuseexclusionlistsasltersbutalsoapplymachinelearningtechniquestoselectthemostprob-ableneologismcandidates.Sinceweexpecttheusersofoursystemtobeinterestedinthecreationofnewwordsfordifferentreasonsandtohavedifferentviewsonthephe-nomenon,wedidnotrelyonaveryspecicdenitionofneologisms.Moreover,ourgoalistodetectneologismsontheirrstoccurrence,onadailybasis.Asaconsequence,wedonotincludediachronicfeaturesinourstudy.2.3.DetectionofNovelWordSensesHere,theaimistodetectnewwordsensesforattestedwordforms.Itisonlyrecentlythatcomputational–automaticorsemi-automatic–approacheshavebeenproposedforthistask.Onereasonmaybeitsdifculty:inmanycases,newwordsensesmayberatherinfrequentandthuschalleng-ingtodetectwithcorpus-basedmethods(CookandHirst,2011;Lauetal.,2012).Severaldifferentstrategiesmaybeemployed:(i)studycompoundwordforms,(ii)relyonlocallexico-syntacticcontexts,or(iii)analysethewiderthemesofthetext. 1http://www.wortwarte.deForGerman,Lemnitzer(2012)proposesanalyzingnovelcompoundwordformstodetectnewsensesofthebasewordforms.However,thisideahasnotbeenautomatizedyet.CookandHirst(2011)comparelexico-syntacticrepresen-tationsofwordcontextsincorporacorrespondingtotwotimeperiods.TheythenusesyntheticexamplesbuiltfromtheSensevaldatasettoevaluatetheirsystemfortheidenti-cationofsemanticchange.ContextualinformationisalsousedbyBoussidanandPloux(2011)whoproposeseveralcluesforthedetectionofsemanticchange,relyingonaco-occurrencebasedgeo-metricalmodel.Theirmodelisabletorepresentsthematicclustersbuttheapproachisnotfullyautomatic.Followingthesameidea,Lauetal.(2012)applytopicmodelinginor-dertoperformwordsenseinduction,andthusdetectnovelwordsenses.Inthiscase,topicmodelingisnotappliedtowholedocumentsbuttocontextsrepresentingtargetwords.Fortheidenticationofnovelwordsenses,twocorporaareused:onereferencecorpus,torepresentstandardusesandknownwordsenses,andanewercorpuswithunknownwordsenses.Wordssensesareinducedonbothcorporaandthenwordsaretaggedwiththeirmostlikelysense.Anov-eltyscoreisthencomputed,basedonthesensesobservedinthereferencecorpusandinthenewercorpus.Inourwork,wedonotfocusonthedetectionofnovelwordsenses,asweonlytargettheidenticationofnewwordforms.Ourworkisneverthelessrelatedtopreviousworkinthisdomaininthatweusetopicmodelinginordertorepre-sentthethematiccontextofunknownwordsforms,undertheassumptionthatsomethematiccontextsaremorepronetothecoinageofwordsthanothers.3.ExperimentalSetting3.1.TheDataForourexperimentswecollectedacorpusofFrenchRSSfeeds,fromthenewspapersandonthedatesshowninTa-ble2.Totalnumberofarticles:2,723Newspapers:LeMonde(659),Libération(504),l'Équipe(594),LesEchos(956)Dates:July12,16,19,23,25,30,August12013Totalnumberofforms(tokens):51,000Unknownforms(types):692Validforms(types):265Invalidforms(types):427Trueneologisms(types):81Table2:Newspapercorpuscollectedforourexperiments(inparentheses:thenumberofarticlescollectedforeachnewspaper).Wepreprocessedthecorpusandonlykeptthejournalis-ticcontent.ThearticleswerethensegmentedinsentencesandtokenizedusingtheTinyCCtools.2FinallythetokenswerelteredusinganexclusionlistbasedmainlyonMor-phalou(Romaryetal.,2004)andWortschatz(Biemannet 2http://wortschatz.uni-leipzig.de/~cbiemann/software/TinyCC2.html al.,2004).WealsodetectNamedEntitiesusingaslightlymodiedversionoftheCasENNERutility(Maureletal.,2011).Finally,weendupwithalistof692unknownwordsafterltering.Thelistofunknownwordswasfurtherman-uallycategorizedintothefollowingclasses:PlausiblewordswhoseformiscorrectinFrench.Nonwordsmainlycorrespondtospellingerrors(e.g.can-diudature),foreignwords(e.g.lazy-retina)andsourcecodewhichwasnotstrippedfromtheHTMLle(e.g.onload).Neologismswhicharethecandidatesoursystemshouldeventuallyextract.Plausiblewordswhicharenotneologismscorrespondmainlytowordswhichshouldbeincludedintheexclusionlists.3Weperformedtwoseriesof(supervised)classicationex-periments.Forboththeitemstobeclassiedaretheun-knownwordsnotfoundintheexclusionlist(cf.Figure1).Inonesetofexperimentsthepositiveexamplesusedtotraintheclassieraretheplausiblewords(thenegativeexamplesbeingtheremainingunknownwords).ThissettingiscalledPlausiblewordsinthefollowing.Inthesecondsetofexperiments(calledNeologisms)thepositiveexamplesusedintrainingarethevalidatedneol-ogismsandthenegativeexamplestheremainingunknownwords.3.2.FeaturesWeexploredtheeffectontheclassicationofthreegroupsoffeatures:formfeatures,morpho-lexicalfeaturesandthe-maticfeatures,anoverviewofwhichispresentedinTa-ble3.Theformfeaturesarethemostobviousfeaturestobeusedinsuchaclassicationtask.Theyarerelatedtotheformorconstructionofthestringathand,andarelanguageindependent.Table3ashowssomeexamplesofthesefea-tures.Table3bgivesanoverviewofthemainmorpho-lexicalfeatures.First,thesefeaturescheckwhetherparticularpre-xesandsufxesarepresentandwhethersomecharactersindicateparticularlanguages.4Second,weassesstheprob-abilitythattheformmightbeaspellingerrorbyusingtheaspelltool.5Thistoolsuggestsalistofknownspellingsclosetothetestedform.ThefeatureweuseistheLev-enshteindistancetothebestguess(or100ifthereisnosuggestion).Basedontheobservationthatunknownformsoftenarisefrommissingwhitespaceweuseafurthergroupofmorpho-lexicalfeaturestocheckwhetherotherknownformsarepossiblyincludedintheformathand.ThisgroupoffeaturesisderivedfromtheresultsoftheAho-Corasickstringmatchingalgorithm(AhoandCorasick, 3However,sinceitisnotalwaysclearwhetherawordisane-ologism,arguablyvalidunknownwordsareworthkeepingandobserving.4WeusedtheLingua::Identifyperlscripttothisend:http://search.cpan.org/~ambs/Lingua-Identify-0.56/lib/Lingua/Identify.pm.5http://aspell.net/1975)6whichsuggestsalistofknownformspresentintheunknownformathand.Obviouslythesefeaturesdependonmorpho-lexicalcharacteristicsofFrench.SinceoneofthegoalsoftheLogoscopeprojectistopro-videmeansforobservingthecreationofnewwordsinanenlargedtextualcontextwealsofocusontheinuenceofthematicfeaturesontheautomaticidenticationofneol-ogisms(Table3c).Ourhypothesisisthatthesefeaturessupplyinterestingadditionalinformationnotprovidedbyformrelatedandmorpho-lexicalfeatures.Besidestheob-viousNewspaperfeature,weattempttocapturethethe-maticcontextusingtopicmodeling(SteyversandGrifths,2007).Topicmodelsarebasedontheideathatdocumentsaremixturesoftopics,whereatopicisaprobabilitydistri-butionoverwords.Givenacorpusofdocuments,standardstatisticaltechniquesareusedtoinvertthisprocessandin-ferthetopics(intermsoflistsofwords)thatwererespon-sibleforgeneratingthisparticularcollectionofdocuments.Thelearnedtopicmodelcanthenbeappliedtoanunseendocumentandwecanthusestimatethethematiccontentofthisdocumentintermsoftheinferredtopics.Inourexperimentalsettingweusetopicmodelingasfol-lows.Werstassembleasetofgeneraljournalisticthemesfromalargecollectionofnewspaperarticles.Basedonthesetopicswethenestimatethethematiccontentofthelargertextualcontextoftheunknownwordsweinvestigate.Severalstudies((BleiandLafferty,2009;Hoffmanetal.,2010;Hoffmanetal.,2013))showthatingeneraltensorhundredsofthousanddocumentsareneededforathoroughthematicanalysisofthiskindandthatthenumberofex-tractedtopicsisof100-300.However,forthepreliminarystudypresentedherewecollected4,755articlesfromthenewspapersshowninTable3candrestrictedthenumberofextractedtopicsto10.Eachunknownformisassociatedwith20topicfeatures(2featuresforeachoftheextracted10topics).Therstsetof10featuresisobtainedbyrstconcatenatingthesen-tencescontainingtheunknownform.Wethenapplytheobtainedtopicmodeltothistextandobtaintopicpropor-tionsestimatingitsthematiccontent.Theunknownformisthenassociatedwiththesetopicproportions,representingtheweightofeachtopicinitsphrasalcontexts.Tocomputethevaluesofthesecondsetofthe10topicfeaturesweap-plyourtopicmodeltoeacharticlecontainingtheunknownform.Foreachtopic,theunknownformisthenassociatedwiththemaximalproportionforthistopicfoundinthearti-clescontainingtheform.Thus,thesefeaturesrepresentthepredominanttopicsofthearticlescontainingtheunknownforms.3.3.MethodologyThemoststraightforwardwaytocastourneologismde-tectionproblemintoasupervisedclassicationtaskistoconsiderthe81validatedneologismsasthepositiveexam-plesandtheremaining611unknownwordsasthenegative 6WeusedthePerlimplementationavailableattheurlhttp://search.cpan.org/~vbar/Algorithm-AhoCorasick-0.03/lib/Algorithm/AhoCorasick.pm Length:Numberofcharacters Whethertheformcontainsparticularsigns,digits,whetheritiscapitalised, Relativeandabsolutefrequencywrt.todocu-mentsandsentences (a)Examplesofformrelatedfeatures. Language Whethertheformcontainscharactersindicatingapar-ticularlanguage(French,English,GermanorSpanish,5features). Prex/sufx 0or1dependingonwhetheraparticularprexorsufxispresent. Prexes:ultra-,super-,dé-,ré-,...(69intotal) Sufxes:-iste,-ation,-isme,-itude,...(30intotal) Spelling spell-checker(aspell):Isitaspellingerror?(2features) Aho-Corasickalgorithm(AhoandCorasick,1975):Doestheformcontainotherknownforms?(4features) (b)Morpho-Lexicalfeatures. Topics 10topicsextractedfrom4,755newspaperarticles: LeMonde(898),Libération(269),LaLibre(1,784),PresseEurope(690),LeJournaldudimanche(206),Rue89(212),Slate(74),L'Équipe(892) Topicfeatures:10features,proportionoftopicindocu-ment DocumentsFeaturevalue articlescontainingformmax.topicproportion concatenationofsen-tencescontainingformtopicproportion Newspaper:Thenewspaper(s)theformappearedin. (c)Thematicfeatures.Table3:Featuresexamplesinthetrainingdata.Thisisoneoftheclassi-cationscenarios(calledNeologisms)weinvestigateusingthethreegroupsoffeaturespresentedinSection3.2.Wesawtworeasonstoalsoexperimentwithasecondclassi-cationscenario.First,asmentionedearlier,itisnotalwaysclearwhenexactlyanunknownwordturnsintoaneolo-gism.Second,thestronglyimbalanceddatasuggeststhattheresultingclassicationmodelmaypossiblybeinaccu-rate.Inthissecondscenario(calledPlausiblewords)weconsiderthe265plausiblewordsaspositivetrainingex-amples(andtheremaining427asnegativeexamples)andexploretheimpactofthedifferentfeaturesetsontheresult-ingclassications.ThetwoclassicationscenariosandtheinvolveddataitemsareillustratedinFigure1.3.4.ClassierWeusedanSVMclassierasimplementedinLibSVM(ChangandLin,2011)andtheWekatookit(Halletal.,2009).IntheNeologismsclassicationsettingweaccountedfor Figure1:Classicationscenariosthestronglyimbalanceddatabyoversamplingthepositiveclass.7WethenusedtheLibSVMclassierimplementedinWekawiththedefaultsettings.8InthePlausiblewordsclassicationscenariotheclassdistributionislessimbal-anced,soweusedtheresamplingtechniqueproposedinWeka.Wealsousedaradialbasiskernelandperformedagridsearchfortheoptimalcostand parameters.9Weperforma10-foldcross-validationandreportprecision,recallandF-measureforeachclassseparatelyandforbothclassestakentogether.Wealsoreportthenumberofde-tectedtrueneologisms.WhenapplyingtheNeologismsclassicationtechnique,thedetectedtrueneologismsareadirectresultoftheclassi-cationprocess.Incontrast,thePlausiblewordsclassi-cationscenarioresultsinthoseunknownformswhichareconsideredtobeplausiblewords,butwhicharenotneces-sarilytrueneologisms.Sinceultimatelyweareinterestedinthese,wealsorankthefoundpositiveitemsusingtheprobabilityestimatesoutputbytheLibSVMtool.Wethenevaluatehowmanyofthetrueneologismswereamongthe8110wordswithhighestprobability.4.ResultsandDiscussionResults.Table6showstheresultsobtainedinthetwoclassicationscenarios:PlausiblewordsonthelefthandsideofTable6andNeologismsontherighthandside.WeperformedtheclassicationswithcombinationsofthegroupsoffeaturesdescribedinSection3.2.:formrelatedfeatures,morpho-lexicalfeaturesandthematicfeatures.Theresultsaregivenintermsofprecision,recallandF-measureforeachclassandfortheoverallclassicationob-tainedthrough10-foldcrossvalidation.Wealsopresentthecorrectlyidentiedneologisms:intheNeologismsscenariothesearesimplytheitemsclassiedcorrectlyaspositive,whereasinthecaseofthePlausiblewordssettingtheseare 7Oversamplingisaclassicationtechniquewhichhelpstodealwithimbalanceddata.Theinstancesoftheminorityclassaredu-plicatedinordertoobtainapproximatelyasmanyinstancesasinthemajorityclass.WeusedtheWekacostsensitiveclassierstoachievethis.8Anexponentialkernelwithcost1and =0.9WeusedthescriptsprovidedwiththeLibSVMtool(ChangandLin,2011)forthis.10Recallthatthereare81validatedneologismstobedetected. agroécologiste(0.961798)Etat-départements(0.921507)anti-alcoolisme(0.959942)nationalistes-révolutionnaires(0.904493)anti-salazariste(0.953199)démission-suprise(0.891366)non-audition(0.939236)auto-obscurcissant(0.87521)multiactivité(0.92963)constructeur-carrossier(0.870512)restaurant-snack-bar(0.925892)ultra-présent(0.868557)Table4:TopentriesinthereorderedunknownwordlistobtainedinthePlausiblewordsettingwiththelexfeatureset,thecongu-rationwhichresultedinthelargestnumberoftrueneologisms.Inparenthesestheprobabilitythattheunknownformisaplausibleword.thevalidatedneologismsrankinginoneofthetop81posi-tions.Firsttheresultsshowthatusingthemachinelearningtech-niquespresentedhere,theunknownwordslteredbyoursystemcanbereorderedandpresentedtoahumanexpertinamoremeaningfulway.Table4showsapossiblere-orderingproducedbyoursystem.Allclassiersshowareasonableperformancewithrespecttothemeasurewhichismostrelevantforourapplication:therecallforthepositiveclass,whichreectsthenumberofidentiedtrueneologisms.11RegardingtheoverallF-measure,thebehaviourofbothsetsofclassierswassimilar.Thebestresultswereachievedwithattributesetsinvolvingtheformfeatures(formandthematicfeaturesforthePlausiblewordsclassicationandformattributesonlyfortheNeologismsclassication).ForbothscenariostheclassicationswithoverallbestF-measureidentiedthelessvalidatedneologisms.ThesettingswhichprovidethebestbalancebetweenglobalF-measureandthenumberofidentiedneologismsareform+lexforPlausiblewordsandform+lex+themeforNe-ologisms.Theresultssuggestthatthefeaturesusedinourexper-imentsweresufcientlypowerfultosupporttheNeolo-gismclassicationscenariooutweighingtheunbalanceddataandthedifcultyoftheclassicationtask.Groupsoffeatures.Themorpho-lexicalfeaturesprovedtobehelpfulforneologismdetectioninbothclassicationscenarios.Interestingly,whileinthePlausiblewordsset-tingformfeaturesalsohadastrongimpact,intheNeolo-gismsscenarioitwasthethematicfeatureswhichplayedamoreimportantrole.Firstthisconrmsourintuitionthatthethematiccontextismorehelpfulatthedetectionofnewwordsthanatthedetectionofplausiblewords:plausiblewordsmayappearinanycontextwhereasthisoutcomesug-geststhatthematiccontextistosomeextentlinkedwithwordcreation.Thisndingisinlinewithanimportantlineofworkin(Textual)Linguisticswherewordcreationisfoundtocorrelatewithcertaindiscoursetypesandtextualgenres(seeforexample(Cabréetal.,2003;Elsen,2004;ElsenandDzikowicz,2005;OllingerandValette,2010)). 11WhiletheoverallF-measureinthePlausiblewordsclassica-tionscenarioisacceptable,itislowerinthecaseoftheNeologismclassicationscenario.Theseresultscouldpossiblybeimprovedbyalsolearningthekernelparametersfortheoversampledtrain-ingdata.HowevertheabsoluteF-measureislessimportantinoursettingsincewearemoreinterestedintheimpactoftheattributesets.Second,thishighlightsthebenetofusingtopicmodel-ingforamorecompactrepresentationofthematiccontentinsofarasthisrepresentationgivessomeaccesstothese-manticcontextofunknownformsgoingbeyondthemorelimitedco-occurrencewindows.Thisaspect,asshowninSection2.israrelytakenintoaccount,theoreticallyorprac-tically,byrecentneologismdetectionutilities.Qualitativediscussion.Foraqualitativeanalysisoftheeffectofthevarioustypesoffeaturesontheselectionofvalidnewwordsweappliedtheclassicationmodelsbasedontheform,lex,themeandform-lex-themefeaturegroupsonourdataandexaminedthenumberofcorrectlyidenti-edneologismsandforeachgroupthevebestscoringunknownwordsandthevebestscoringvalidatedneol-ogisms.Table5showstheresultsoftheseexperiments.Firstweobservethattheformfeatureshelpidentifypar-ticularlylongneologisms,andthosecontainingahyphen(rstline).Thesecondlineshowsthatthewordsscoringbestwiththelexfeaturesaremainlycompositions(withorwithouthyphen)orcontainaprex(anti,non).Withre-specttothecontextualfeatures(theme)weobserveonthepositivesidethattheypermitthedetectionofneologismswithnoprominentproperty(agnélise,retricoté),butonthenegativesidethesethematicfeatures(theme)seemtofavourtheselectionofwordswhicharenotplausibleconsideringtraditionalformationrulesforFrenchwordforms.Acloserlookattheneologismsdetectedthroughthethemefeaturesbutnotvialexandformfeaturesconrmedtheirabilitytoselectneologismswithlesscharacteristicforms.Somene-ologismsidentiedbythethemeclassier,butnotbythelexandformclassierare:accrobranches,caricatureurs,conicté,frenemies,instinctivores....5.ConclusionandFutureWorkInthispaperwepresentedasupervisedstatisticalmachinelearningmethodwhichhelpsattheidenticationofnewwordsinonlinenewspaperpublications.Firstweiden-tiedwordsinonlinenewspaperarticleswhichwerenotpresentinanexclusionlist.Usingsupportvectormachinestheseunknownwordswerethenrankedaccordingtotheprobabilityoftheirbeingvalidneologismcandidates,basedondifferentfeaturesextractedfromthenewspaperarticles.Despitetherelativelysmallamountofdata,ourexperi-mentsshowedthatthiswaytheunknownwordscouldbere-arrangedinamoremeaningfulway(thanrandomlyorbyfrequencies)thusfacilitatingananalysisbyalinguisticexpertorlexicographer.Weexaminedtheimpactontheclassicationoutcomeofthreegroupsoffeatures:formfeatures,relatedtothecon-structionofthewordandlanguage-independent,morpho-lexicalfeaturesbasedonlexicalcharacteristicsandthere-forelanguagedependentandnallythemefeatureswhicharemeanttoreectthelargertextualcontextoftheun-knownwordandhavenotyetbeenusedtothisend(toourknowledge).ThebestF-measurefortheglobalclassicationtask(resultsforpositiveandnegativeclassescombined)wasachievedbasedontheformrelatedfeaturesinconjunctionwiththethemefeatures,showingtherelevanceofthistypeofat-tributesforneologismdetection.However,thecongura- Features #neos top5validneologisms top5(neologism?) form 37 supermédiateur,doublevédoublevé-doublevé,auto-diagnostiqués,néo-célibataires,sur-monétisation styliste-couturière(no),E-DÉTOURNEMENTS(yes),supermédiateur(yes),garde-à(no),doublevédoublevédoublevé(yes) lex 48 agroécologiste,multiactivité,auto-obscurcissant,neo-retraité,macrostabilité agroécologiste(yes),anti-alcoolisme(no),anti-salazariste(no),non-audition(no),multiactiv-ité(yes) theme 48 e-détournements,partenadver-saires,hollandisme,retricoté,agnélise tuitte(no),e-détournements(yes),schlopp(no),gesagt(no),schloppa(no) form-lex-theme 60 pagano-satanisme,auto-diagnostiqués,neo-retraité,agroécologiste,e-détournements ultra-présent(no),Etat-départements(no),anti-alcoolisme(no),pagano-satanisme(yes),watts-étalons(no) Table5:Top5predictionswhenapplyingthemodel.tionwiththebestoverallF-measureproducedthelessval-idatedneologisms.12Mostneologismswereidentiedus-ingattributesetsinvolvinglexicalandthematicfeatures.Thishighlightstheimportanceofthesefeaturesandsug-geststhattheclassicationtechniquecouldbefurtherim-provedbytheirbetterintegration.Sincewefoundthe“thematic”featurestobeofgreatim-portancefortheneologismdetectionanddocumentationweplantoexpandourworkondetectinganddocumentinggen-eralpurpose,journalisticthemes,usingthetopicmodelingtechniquesdescribedinthispaper.However,asmentionedinSection3.2.,forathoroughthematicanalysisanddoc-umentationbasedongeneraljournalisticthemesweneedtoapplytopicmodelingonamuchlargerandmore“repre-sentative”corpusofnewspaperarticles.Inadditionfurtherinvestigationisneededtodeterminetheinuenceofotherparametersasforexamplethevocabulary–whichwordsaremostmeaningfulinoursettingandthereforearemostrelevantinthetopicselectionprocess,orwhatnumberoftopicstochoose.Further,fordocumentingtheunknownwordsaninterpretation(orlabeling)ofthetopicswouldbeimportant.Infutureworkwealsoplantoexploitother“textual”fea-tureswhichhavebeenfoundinlinguisticstudiestoplayanimportantroleinwordcreation.Someofthesefeaturesare:Thepositionoftheunknownwordinthetext(Loiseau,2012).Thejournalisticgenre–ourhypothesisisthatsomejournalisticgenresaremorefavourabletowordcre-ationsthanothers.Forthishoweverrsta(system-atic)categorisationofgenresandsecondamethodisneededtobetteridentifythejournalisticgenreofanewspapertext.Afurtherinterestingresearchdirectionistoexploremorene-grainedmeasuresinordertobetterassesstheinu-enceofthedifferentfeaturesontheclassicationresult. 12Whilethenumberofidentiedvalidatedneologismscorre-spondstotherecallinthecaseoftheNeologismexperiments,thisisnotthecaseforthePlausiblewordsexperiments.In(Lamireletal.,2013)theauthorsproposeversatilefea-tureanalysisandselectionmethodsallowingtoimprovetheclassicationresultsindependentoftheclassicationmethodandtoaccuratelyinvestigatetheinuenceofeachfeatureontheclassicationprocess.AcknowledgementsWethankRomainPotier-Ferryforhiscontributions.ThisworkwasnancedbyanIDEX2012grantfromtheUni-versitédeStrasbourg.6.ReferencesAho,AlfredV.andCorasick,MargaretJ.(1975).Efcientstringmatching:anaidtobibliographicsearch.Com-mun.ACM,18(6):333–340,June.Biemann,Christian,Bordag,Stefan,Heyer,Gerhard,Quasthoff,Uwe,andWolff,Christian.(2004).Language-IndependentMethodsforCompilingMono-lingualLexicalData.InGoos,Gerhard,Hartmanis,Ju-ris,Leeuwen,Jan,andGelbukh,Alexander,editors,ComputationalLinguisticsandIntelligentTextProcess-ing,volume2945,pages217–228.SpringerBerlinHei-delberg.Blei,DavidM.andLafferty,J.(2009).Topicmodels.Textmining:classication,clustering,andapplications,10:71.Boussidan,ArmelleandPloux,Sabine.(2011).Usingtopicsalienceandconnotationaldriftstodetectcandi-datestosemanticchange.InProceedingsoftheNinthInternationalConferenceonComputationalSemantics(IWCS2011),Oxford.Cabré,TeresaandNazar,Rogelio.(2011).Towardsanewapproachtothestudyofneology.InNeologyandSpe-cialisedTranslation4thJointSeminarOrganisedbytheCVCandTermisti.Cabré,M.T.,Domènech,M.,Estorpà,R.,Freixa,J.,andSolé,E.(2003).L'observatoiredenéologie:con-ception,méthodologie,résultatsetnouveauxtravaux.L'innovationlexicale,pages125–147. Classication Plausiblewords Neologisms form,lex,theme class Prec Rec F corr.neos Prec Rec F corr.neos pos 0.848 0.618 0.715 0.181 0.827 0.297 neg 0.800 0.932 0.861 0.958 0.512 0.667 both 0.818 0.813 0.806 24 0.868 0.548 0.625 67 form,lex class Prec Rec F corr.neos Prec Rec F corr.neos pos 0.822 0.565 0.670 0.192 0.778 0.308 neg 0.776 0.925 0.844 0.952 0.573 0.716 both 0.794 0.788 0.778 30 0.864 0.597 0.669 63 form,theme class Prec Rec F corr.neos Prec Rec F corr.neos pos 0.814 0.687 0.745 0.160 0.531 0.346 neg 0.825 0.904 0.863 0.912 0.638 0.751 both 0.821 0.822 0.818 22 0.826 0.625 0.693 43 form class Prec Rec F corr.neos Prec Rec F corr.neos pos 0.635 0.405 0.494 0.190 0.481 0.273 neg 0.702 0.857 0.772 0.915 0.733 0.814 both 0.676 0.686 0.666 20 0.832 0.704 0.752 39 lex class Prec Rec F corr.neos Prec Rec F corr.neos pos 0.777 0.466 0.582 0.132 0.827 0.227 neg 0.737 0.918 0.818 0.927 0.288 0.440 both 0.752 0.746 0.728 34 0.836 0.350 0.415 67 theme class Prec Rec F corr.neos Prec Rec F corr.neos pos 0.850 0.389 0.534 0.129 0.889 0.225 neg 0.719 0.958 0.822 0.938 0.217 0.353 both 0.769 0.742 0.712 22 0.844 0.295 0.338 72 lex,theme class Prec Rec F corr.neos Prec Rec F corr.neos pos 0.844 0.557 0.671 0.136 0.877 0.236 neg 0.776 0.937 0.849 0.945 0.275 0.426 both 0.802 0.793 0.781 29 0.851 0.345 0.404 71 Table6:ClassicationresultsChang,Chih-ChungandLin,Chih-Jen.(2011).LIBSVM:Alibraryforsupportvectormachines.ACMTransac-tionsonIntelligentSystemsandTechnology,2:27:1–27:27.Softwareavailableathttp://www.csie.ntu.edu.tw/~cjlin/libsvm.Cook,PaulandHirst,Graeme.(2011).Automaticiden-ticationofwordswithnovelbutinfrequentsenses.InPACLIC,page265–274.Elsen,HilkeandDzikowicz,Edyta.(2005).Neologis-meninderZeitungssprache.DeutschalsFremdsprache:ZeitschriftzurTheorieundPraxisdesDeutschunter-richtsfürAusländer,(2):80–85.Elsen,Hilke.(2004).Neologismen:FormenundFunk-tionenneuerWörterinverschiedenenVarietätendesDeutschen.GunterNarrVerlag.Garcia-Fernandez,Anne,Ligozat,Anne-Laure,Dinarelli,Marco,andBernhard,Delphine.(2011).Méthodespourl'archéologielinguistique:datationparcombinai-sond'indicestemporels.ActesduseptièmeDÉFouilledeTextes.Hall,Mark,Frank,Eibe,Holmes,Geoffrey,Pfahringer,Bernhard,Reutemann,Peter,andWitten,IanH.(2009).TheWEKAdataminingsoftware:anupdate.SIGKDDExplor.Newsl.,11(1):10–18,November.Hoffman,MatthewD.,Blei,DavidM.,andBach,Fran-cisR.(2010).OnlineLearningforLatentDirichletAl-location.InNIPS,pages856–864.Hoffman,MatthewD.,Blei,DavidM.,Wang,Chong,andPaisley,John.(2013).Stochasticvariationalin-ference.TheJournalofMachineLearningResearch,14(1):1303–1347.Issac,Fabrice.(2011).Cybernéologisme:Quelquesout-ilsinformatiquespourl'identicationetletraitementdesnéologismessurleweb.Langages,183:89–104. Janssen,Maarten.(2012).NeoTag:aPOSTaggerforGrammaticalNeologismDetection.InLREC,page2118–2124.Lamirel,Jean-Charles,Cuxac,Pascal,Hajlaoui,Kal,andChivukula,AneeshSreevallabh.(2013).Anewfeatureselectionandfeaturecontrastingapproachbasedonqual-itymetric:applicationtoefcientclassicationofcom-plextextualdata.InInternationalWorkshoponQual-ityIssues,MeasuresofInterestingnessandEvaluationofDataMiningModels(QIMIE),Australie,April.Lau,JeyHan,Cook,Paul,McCarthy,Diana,Newman,David,andBaldwin,Timothy.(2012).Wordsensein-ductionfornovelsensedetection.InProceedingsofthe13thConferenceoftheEuropeanChapteroftheAssoci-ationforComputationalLinguistics,page591–601.Lemnitzer,Lothar.(2012).Motsnouveauxetnouvellessignications—quenousapprennentlesmotscomposés?Cahiersdelexicologie.Néologiesémantiqueetanal-ysedecorpus,1(100):105–116.Loiseau,Sylvain.(2012).Unobservablepourdécrireleschangementssémantiquesdanslestraditionsdiscur-sives:latactiquesémantique.Cahiersdelexicologie,1(100):185–199.Maurel,Denis,Friburger,Nathalie,Antoine,Jean-Yves,Eshkol,Iris,andNouvel,Damien.(2011).Cascadesdetransducteursautourdelareconnaissancedesen-titésnommées.TraitementAutomatiquedesLangues,52(1):69–96.Ollinger,SandrineandValette,Mathieu.(2010).Lacréa-tivitélexicale:despratiquessocialesauxtextes.InActesdelICongrésInternacionaldeNeologiadelesllengüesromaniques,volumePublicacionsdel'InstitutUniver-sitarideLingüísticaAplicada(IULA)delaUniversitatPompeuFabra(UPF),pages965–876,Barcelona,Spain.Roche,SorchaandBowker,Lynne.(1999).Cenit:Sys-tèmededétectionsemi-automatiquedesnéologismes.Terminologiesnouvelles,(20):12–16.Romary,Laurent,Salmon-Alt,Susanne,andFrancopoulo,Gil.(2004).Standardsgoingconcrete:fromLMFtoMorphalou.InWorkshopEnhancingandUsingElec-tronicDictionaries,Geneva,Switzerland.Stenetorp,Pontus.(2010).AutomatedExtractionofSwedishNeologismsusingaTemporallyAnnotatedCor-pus.Master'sthesis,RoyalInstituteofTechnology(KTH),Stockholm,Sweden,March.Steyvers,MarkandGrifths,Tom,(2007).ProbabilisticTopicModels.LawrenceErlbaumAssociates. 4337 4338 4339 4340 4341 4342 4343 4344

Related Contents


Next Show more