/
Johnson,A.andWright,D.-Identifyingidiolectinforensicauthorshipattribut Johnson,A.andWright,D.-Identifyingidiolectinforensicauthorshipattribut

Johnson,A.andWright,D.-Identifyingidiolectinforensicauthorshipattribut - PDF document

tatyana-admore
tatyana-admore . @tatyana-admore
Follow
363 views
Uploaded On 2016-04-25

Johnson,A.andWright,D.-Identifyingidiolectinforensicauthorshipattribut - PPT Presentation

JohnsonAandWrightDIdentifyingidiolectinforensicauthorshipattributionLanguageandLawLinguagemeDireitoVol112014p3769aimreliesonthecorpusasalargescalereferencecorpusrepresentingapopulationo ID: 293219

Johnson A.andWright D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito Vol.1(1) 2014 p.37-69aimreliesonthecorpusasalarge-scalereferencecorpusrepresentingapopulationo

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Johnson,A.andWright,D.-Identifyingidiole..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69apartirdopressupostoteóricodequecadafalantenativodedeterminadalín-guapossuiasuaprópriaversãodistintaeindividualizadadalíngua[...],oseupróprioidiolecto”(Coulthard,2004:31).Noentanto,considerandoadiVculdadeemsustentarempiricamenteumateoriadoidiolecto,existeumapreocupaçãocrescentenestaárearelativamenteaofactodeserumconceitodemasiadoab-stractoparaterutilidadeprática(Kredens,2002;Grant,2010;Turell,2010).Asabordagensestilísticas,decorporaecomputacionaisaotexto,noentanto,per-mitemidentiVcarpadrõescolocacionaisrepetidos,oun-gramas,fragmentoslin-guísticosentreduaseseispalavras,semelhantesàconhecidanoçãodesoundbites–pequenossegmentosdeapenasalgunssegundosdefalaqueosjornalistascon-seguemidentiVcarcomopossuindovalornoticioso,equecaracterizammomentosimportantesdafala.Osoundbiteproporcionaumfascinanteparaleloparaoses-tudosdeatribuiçãodeautoria,colocando-seaseguintequestão:observandoumqualquerconjuntodetextosdedeterminadoautor,épossívelidentiVcar“textbitesden-gramas”,pequenossegmentosdetextoquecaracterizamaescritadoautor,fornecendosegmentosdematerialidentiVcativosemelhantesaoDNA?Partindodeumcorpusde63.000emailse2,5milhõesdepalavrasescritaspor176fun-cionáriosdaantigaempresadeenergiaamericanaEnron,realizamosumestudodecaso,quemostra,emprimeirolugar,recorrendoaumaanáliseestilística,queumfuncionáriodaEnronproduzrepetidamenteosmesmospadrõesestilísticosdepedidoseducadamentecodiVcados,deumaformaquepodeserconsideradahabitual.Deseguida,umaexperiênciaestatísticautilizandoomesmoautordoestudodecasorevelaqueosn-gramasdepalavraspermitematribuiramostrasdeemailanonimizadasaesseautorcomtaxasdesucessodaordemdos100%.Esteartigodefendeque,quandosuVcientementedistintivos,estestextbitestêmcapacidadeparaidentiVcarosautoresreduzindoumvolumededadosmassivoasegmentos-chavequenosaproximamdoesquivoconceitodeidiolecto.Palavras-chave:Atribuiçãodeautoria,email,Enron,idiolecto,Jaccard,n-gramas,estilo,textbites.IntroductionJournalistslisteningtolivespeechareabletosingleoutsoundbites,smallsegmentsofnomorethanafewsecondsofspeechthattheyrecogniseashavingnewsvalueandwhichcharacterisetheimportantmomentsoftalk.Though`soundbitesyndrome'(MazzoleniandSchulz,1999:251)isseenasareductionisttrendrepresentinga“tendencytowardsshorterandmoresensationaltextsor`soundbitenews”'(Knox,2007:28),itoUersanintrigu-ingparallel,thinkinglaterally,forauthorshipattributionstudies.Thefollowingquestionarises:lookingatanysetoftextsbyanyauthor,isitpossibletoidentify`n-gramtextbites',smalltwotosixwordportionsoftextthatcharacterisethewritingofthatauthor,provid-ingDNA-likechunksofidentifyingmaterial?UsingtheDNAmetaphor,iftextualchunksor`textbites'(Ancu,2011)aresuXcientlydistinctive,theymaybeabletoidentifyauthorsbyreducingthemassofwordstomeaningfulandkeysegmentsthatmoveusclosertotheabstractandelusiveconceptofa`linguisticVngerprint',whichCoulthard(2004:432)describesas“thelinguistic`impressions'createdbyagivenspeaker/writer[which]shouldbeusable,justlikeasignature,toidentifythem”.Coulthard(2004:432)eschewstheDNAmetaphoras“unhelpful”and“impractical”becauseoftheneedfor“massivedatabanks38 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69aimreliesonthecorpusasalarge-scalereferencecorpusrepresentingapopulationofwritersagainstwhichanindividual'sstylecanbecompared,whilethecomputationalattributionexperimentinthethirdresearchaimusestheEnroncorpusasalargepoolof176candidateauthors.Asaresultofthetriangulatedapproach,weareabletocomparetheresultsofthediUerentmethodstoseewhethertheyarecomplementaryandconVrmatory,toevaluatetheaccuracyandreliabilityofbothapproachesandunderlinetheeUectivenessofn-gramsasimportantmarkersofauthorship.Finally,wecanconsidertheimplicationsforauthorshipstudiesandmethodsforuseinevidentialcases.N-grams,idiolectandemailTheco-occurrenceoflexicalitemswithinashortspaceofeachotherisknownincor-puslinguisticsas`collocation',andwasVrstintroducedbyFirth(1957),laterdevelopedbySinclair(1991),Stubbs(2001)andHoey(2005)amongstothers,andcontinuestobeattheforefrontofcorpuslinguisticresearch(Gries,2013).Theassociationsthatpeoplebuildbetweenwordsandthewaysinwhichtheyproducethemincombinationsisapsycholin-guisticphenomenon,andhasbeenanalysedintermsof`lexicalphrases'(NattingerandDeCarrico,1992),`formulaicsequences'(Wray,2002,2008),`lexicalpriming'(Hoey,2005;Pace-Sigge,2013),andusage-basedtheoriesoflexico-grammarsuchasexemplartheory(Barlow,2013).OnefactorwhichthesediUerentapproacheshaveincommonisthattheyallemphasisethepersonalor`idiolectal'natureofpreferencesforcertainwordcombina-tionsandcollocationalpatterns.Schmittetal.(2004:138),discussingformulaicsequences,arguethat“itseemsreasonabletoassumethatthey[people]willalsohavetheirownuniquestoreofformulaicsequencesbasedontheirownexperienceandlanguageexpo-sure”.Similarly,Hoey(2005:8–15),inhisargumentthatcollocationcanonlybeaccountedforifweassumethateverywordisprimedforco-occurrencewithotherwords,grammat-icalcategoriesorpragmaticfunctions,claimsthat“aninherentqualityoflexicalprimingisthatitispersonal”andthat“wordsareneverprimedperse;theyareonlyprimedforsomeone”.Hegoesontoexplainthat:Everybody'slanguageisunique,becauseallourlexicalitemsareinevitablyprimeddiUerentlyasaresultofdiUerentencounters,spokenandwritten.WehavediUerentparentsanddiUerentfriends,liveindiUerentplaces,readdiUerentbooks,getintodiUerentargumentsandhavediUerentcolleagues.(Hoey,2005:181)CollocationsandsequentialstringsoflexicalwordsarereferredtoinlinguisticsbyawiderangeofdiUerentnames,suchas`concgrams',`Wexigrams',`lexicalbundles',`multi-wordexpressions',`prefabricatedphrases',`skipgrams'(Nerlichetal.,2012:50).Inauthorshipattribution,Juola(2008:265)referstothemsimplyaswordn-grams—lexicalstringsofnwords—anddescribesthemasameansbywhichtotakeadvantageofvocabularyandsyn-tacticinformationintextsandaneUectivewayofcapturingwordsincontext.Collocationsandlexicalstringsinthisstudyarereferredtoasn-gramsandwealsoconsidertheextenttowhichtheycanbecalled`n-gramtextbites',smallportionsoftextthatcharacterisethewritingofaparticularauthor.Theidiolectalnatureofcollocationshasbeeninvestigated,toalimitedextent,incorpuslinguistics.Mollin(2009)usedacorpus-basedstatisticalapproachtoanalysingidiolectalcollocationsinthetextofformerUKPrimeMinister,TonyBlair,andBarlow(2010,2013)examinedtherelativefrequenciesoftwo-andthree-wordsequencesusedby40 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69inwhichindividualauthorsexpressthisparticularspeechact,andhowthisspeechactgivesrisetoauthor-uniquecollocationalpreferencesandidentiVablen-gramsorn-gramtextbites.SincerequestsandordersarediXculttodiUerentiate,giventhatbothhavethefunctionofgettingsomeonetodosomething,wefollowBax's(1986:676)distinction,whereinarequest“therequestingpersonbeneVtsfromthefutureact”andthereisa“re-ciprocalsocialrelation”betweentheinteractants,whereasinadirective“thepersondoesnotnecessarilyhavetobeneVtfrom[theact]”andtheaddresseeisinan“inferiorsocialrelation”.DataandmethodThissectionintroducestheEnronemailcorpus,theparticularversionwecreatedasap-propriateforauthorshipresearch,andthecasestudyofoneemployee,JamesDerrick.Weevaluatethecombinedcasestudy,corpuscomparison,andexperimentalevaluationap-proachandweintroducethestatistical(Jaccard)andcomputational(Jangle)toolsusedandcreatedforthisresearch.TheEnronemailcorpusThecorpususedforthisstudyisadatasetof63,369emailsand2,462,151tokenswrittenandsentbetween1998and2002by176employeesoftheformerAmericanenergycom-panyEnron.TheEnronemaildatawasVrstmadepubliclyavailableonlineaspartoftheFederalEnergyRegulatoryCommission'slegalinvestigationintothecompany'saccount-ingmalpractices(FERC,2013),whichledtotheultimatebankruptcyanddemiseofthecompanyintheearly2000s.Manyversionsofthedatahaveemergedacrosstheweb.ThesourceofthedatabaseusedinthisstudyistheversioncollectedandpreparedbyCarnegieMellonUniversity(CMU)(Cohen,2009),aspartofits`CognitiveAssistantthatLearnsandOrganises'(CALO)project.Thedatahavesubsequentlybeenextracted,cleaned-up,andpreparedspeciVcallyforthepurposesofauthorshipattributionbyWoolls(2012).Theex-tractionprocessminedallofthesentemailsfromallofthevarious`sent'foldersforeachoftheauthors,retainingonlythenewestemailmaterialineachthread,andremovinganypreviousemailconversation,toensurethatonlythematerialwrittenbythesenderwasincludedandavailableforanalysis.Thisprocesswasvital,tocreateacorpussuitableforauthorshipresearch,because,asitstands,theEnroncorpus(Cohen,2009)isunsuitableforthispurpose.Eachemailinthecorpusisaccompaniedbyarangeofmetadata:thedateandtimetheemailwassent,alongwiththe`From:',`To:',`Subject:',`Cc:'and`Bcc:'Velds,andthesubjectline(Example1).Inthecleaned-upcorpusmetadataiscontainedinanglebracketssothatitisnotconsideredbythecomputationaltoolsusedtoanalysetheauthors'textualchoicesinthisstudy. ThedatawerefurthercleanedbyWright,inparticularwheretheauthors'sentfolderscontainedemailssentbytheirassistantsorsecretaries.Suchemailswererelativelyeasy42 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69toidentifyandremoveand,incasesinwhichtherewasmorethanoneemailsentfromthesameassistant,theseemailswereextractedandsavedasbelongingtothisassistant,creatingaseparatesetofVlesforthisindividualandtreatingthemasanadditionalau-thor.Finally,blankemailsandemailscontainingonlyforwardedorcopiedandpastedmaterialwerealsoremoved.Theresulting176author,63,369emailand2,462,151wordcorpusisagold-standardcorpus,makingitparticularlyusefulforauthorshipstudies.Itcontainsonlynaturallyoccurringemaillanguagedataaboutwhichwecanbesureofthe`executiveauthor'(Love,2002:43),thatis,theindividualresponsiblefor“formulatingtheexpressionofideasandma[king]wordselectionstoproducethetext”(Grant,2008:218).Sincedigitaltextssuchastextmessages,instantmessages,andemailsarebecomingin-creasinglyprominentinforensiccasework,includingemailcasescontainingthreatening,abusive,ordefamatorymaterial(e.g.Coulthardetal.,2011:538),thiscorpusrepresentsauniqueopportunityforempiricalresearchwithimplicationsforauthorshipattributioncasework.JamesDerrickTheanalysesfocusonacasestudyofoneEnronemployee,JamesDerrick.CasestudiesareabeneVcialmethodwhen“`how'or`why'questionsarebeingposed”andwhenparticular“phenomena”arebeingstudied(Yin,2009:2)and,inthiscase,wewanttoknowhowanindividualmakesdirectives.Thecasestudyapproachiscomplementedbyacorpuslinguisticapproach,whichallowsustoexaminetheuniquenessofhischoicesagainstthelargerEnronpopulation.Derrickwasanin-houselawyeratEnron(Creameretal.,2009;Priebeetal.,2005),infactEnron'schieflawyerandGeneralCounsel,orchieflegaloXcer,withastaUof“200in-houselawyers”andabletocallon“morethan100outsidelawVrmsfromaroundtheworld”between1991and2002(Ahrens,2006).HeisaformerAdjunctProfessorofLawattheUniversityofTexasLawSchoolbetween1984and1990,andiscurrentlyamanagingpartnerinaUSlawVrm.Heisrepresentedinthecorpusby470emails,atotalof5,902tokensand911types.Hewaschosenasacasestudybecauseoftherelativelysmallamountofdatainhissentbox.Althoughhehasslightlymoreemailsthanthemeanperauthorforthecorpus(mean=360),hehasfarfewerthanthemeantokensperauthor(mean=13,989).Derrickisthereforeacuriouscaseinthathisemailsaremuchshorterthanaverage,butthisisper-hapsunsurprisinggivenhisstatusaschieflawyerandawarenessoflegaldiscovery1.Hehasameanof12.9wordsperemail,whilethemeanforthecorpusis41.74.Further,155ofhis470emails(32.9%)containonlyoneword.Thesesingle-wordemailscontaintwotypes:FYI(foryourinformation22.5%)andFYR(foryourreference/review/records10.4%).ThesemostminimalofmessagesaresentchieWytooneaddresseeforeachtype.67%oftheFYIemailsaretoj.harris@enron,oneofthelawyersinDerrick'steam,andalloftheFYRsingle-wordmessagesaresenttoc.williams@enron.TherelativelysmallamountofdataavailableforDerrick,intheory,makesanyanalysisofstyleandanysubsequentattri-butiontasksmorediXcultthanitwouldbewithanauthorwithmoredata(Grant,2007;Koppeletal.,2013;LuyckxandDaelemans,2011).Theadvantageofusingacasestudyforauthorshipresearchistwofold.First,thesmallamountofdatapresentsasimilarchallengetothatinrealcaseswheredataisoftenlimited.Andsecond,whileacasestudyallows 1Legaldiscovery(calleddisclosureintheUK)requiresthedefendant(inacriminalcase)oradverseparty(inacivilcase)todiscloseanythingthatisaskedforbytheotherside,whichisneededinthepreparationofthecasepriortotrial.Thiscanincludeelectronicdocumentssuchasemail.43 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69wordn-gramsrangingfromsinglewordsthroughtosixwordsinlength.Inawaywhichcapturessemanticaswellaslexicalandgrammaticalinformation,then-gramsusedaresequentialstringsofwordswhichco-occurwithinpunctuatedsentenceboundaries,asopposedtoa`bag-of-words'approach,whichdoesnottakeintoaccountwordorderorpunctuation.WhereasDerrick'ssamplesarecontrolledforsize,thecomparisonsetsoftheotherEnronemployeesarenot,andtheyrangefromassmallastwoemailsandtwentytokens(GretelSmith)toaslargeas3,465emails(SaraShackleton)and170,316tokens(JeUDasovich).Thereisatotalpoolof176candidateauthorsusedinthisexperiment(includingDerrick),andthisisrelativelylargeinstylometricterms,withothersusingthree(Grant,2007),six(Juola,2013),10(Rico-Sulayes,2011:58–9),20(Zhengetal.,2006:387),40(Grieve,2007:258)and145(LuyckxandDaelemans,2011:42).ThereareexceptionssuchasKoppeletal.(2006,2011),however,whichuseopencandidatesetsofthousandsofpotentialauthors.Overallthough,thecombinationofsmallsamplesizesandalargenumberofcandidateauthorsmakestheattributiontaskinthisstudyarelativelydiXcultone.CasestudyofJamesDerrickThiscasestudyemploystwodiUerentapproachestoanalysingJamesDerrick'suseofn-grams.TheVrstapproach(inthesubsectionIdentifyingDerrick'sprofessionalauthorialstyle)isacorpusstylisticone,whichexaminesDerrick'sprofessionalauthorialstyle,inparticularthroughthedistinctivewaysinwhichheconstructsdirectives(Example2be-low).Hehabituallyusesemail-initialpleaseaswellasaVnalthankyouinappreciationofanticipatedcompliance,makingthesemitigateddirectives. Thesecondpartofthecasestudy(intheAttributiontasksubsection)usesastatisti-calandquantitativeapproachintheformofanattributionexperimentusingtheJaccardmeasuredescribedabove(intheExperimentalsetupsection).Finally,thesectionon`Der-rick'sdiscriminatingn-grams'comparestheresultsofthesetwoapproachesintermsoftheirreliability,complementarity,andcompatibility.IdentifyingDerrick'sprofessionalauthorialstyleAstartingpointforanycorpusanalysisofauthorialstyleistocreateafrequencywordlistandthenakeywordlist(Hoover,2009),toidentifythosewords“whosefrequencyisunusuallyhigh[inaparticulardataset]incomparisonwithsomenorm”,because“key-wordsprovideausefulwaytocharacteriseatextorgenre”(Scott,2010:156).First,theEnrondatasetiscomparedwiththe450-million-wordCorpusofContemporaryAmericanEnglish(COCA)(Davies,2012),toidentifywhichwordsarekeyintheEnroncorpus.Second,Derrick'sdataarethencomparedwiththewholeEnroncorpus,toidentifywhichwordsarekeyinhisemails,whentestedagainsttheEnronpopulationfromwhichheis46 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69drawn.WordsmithTools(Scott,2008)wasusedtocreateawordlist.Inanywordlist,functionwordsarethemostcommon,andthatisthecaseintheEnroncorpus,asweseeinTable2,column1(the,to,I,and,a,you,of,for,is,andinmakeupthetop10).Themostcommonlexicalwords,therefore,moreusefullycharacteriseatextthanfunctionwords.IntheEnroncorpusthetoptwolexicalwordsarethanksandplease(Table2,column1),givingaclearindicationthattheEnrondatasetisrepresentativeofacommunityofpracticeinwhichinterlocutorsaregenerallylinguisticallypolitetooneanother,andalsosuggestingthatrequestsormitigateddirectivesarethiscommunity'sprincipalspeechact.Indeed,pleaseandthanksarefoundin165and164ofthe176employees'emailsrespectively.Thisismadeevenclearerbyakeywordanalysis,whichVndsthatthanksandpleasearethetoptwokeywordsintheentireEnroncorpus(Table2,column2).Inturn,asecondkeywordanalysiscomparingDerrick'semailswiththeemailsoftheother175authors(Table2,column3),Vndsthatthank(you)isthemostsigniVcantkeywordandpleaseistheseventhmostimportantinhisemails.Moreimportantly,intermsoftheproportionofDerrick'svocabulary,thankaccountsfor2.24%ofhisvocabulary,whereasitaccountsfor0.05%oftheEnronvocabulary,andpleaseaccountsfor1.85%against0.46%.Thesetwowords,therefore,accountforatotalof4.1%ofDerrick'svocabulary,comparedto0.5%fortheEnronauthorsingeneral,indicatingthatevenwithinthiscorpus,whichisgenerallyindicativeofverypolitediscourse,Derrickappearsexceptionallylinguisticallypolite. Table2.Enrontop25wordsandkeywords,comparedwithDerrick'stop25keywords.47 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69 Figure1.Rateper1,000wordsofpleaseintheEnroncorpus(top5users),alongwithdispersion.ThisisfurthersupportedbyFigure1,whichshowsthetopVveusersofpleaseinthecorpus.DerrickistheVfthmostfrequentuserofpleaseamongthe176employees,butlookingatthedispersionratesinFigure1(therateatwhichpleaseisdispersedacrosstheauthor'stext)(Figure1,column6),andthenumberofwordsineachVle(column3),wecanseethatemployeesonetofourdonotreallycount,sincetheirVlesizeiswellbelow500wordsanddispersionacrosstheiremailsisgenerallylow(apartfromintheVleforAkin:Akin-l-Dedup[Edited].txt).ThismakesDerrickarguablythemostpoliteemployeeinthecorpus,oratleastthepersonwhomakesmostpolitelymitigateddirectives.Furthermore,thekeywordanalysisofthewholeEnroncorpusshowsthatthanks(toprankedkeywordinTable2,column2)isfarmorepopularthanthank(ranked444thkey-word,withDerrickaloneaccountingfor10%ofitsfrequency).Incontrast,Derrick'susageisentirelythereverseofthis;hisVrstkeywordisthank(whichoccurswithyouinall132instances),whilethanksisusedonlytentimesandisanegativekeywordinhisdataset,meaningthatheusesitsigniVcantlylessthanalloftheotherauthorsintheEnroncorpus.Toputthisanotherway,intheEnroncorpusthanksismorethan11timesmorefrequentthanthankyou,whereasforDerrickthankyouis15timesmorefrequentthanthanks.Examples3and4aretypicalofthewaythatDerrickusespleaseandthankyou,usingpre-posedpleasetoserveapolitenessfunctionandmitigatethefacethreatinthedirective(Bax,1986:688).Hefollowsupwiththankyou,assumingcompliance,punctuatedwithafullstopandfollowedbyhisname.Assuch,notonlyisDerrickalinguisticallypolitecommunicant,butthesequantitativeandpreliminaryqualitativeresultsindicatethatheispoliteinsuchawaythatisverydistinctiveinthislargepopulationofEnronwriters.Inaddition,hisuseofThankyou+fullstop,tofollowuphisdirective,indicateshisstatusandauthoritytomaketheseorders. Giventheobviousfrequency,saliencyandsigniVcanceoflinguisticpolitenessintheuseofpleaseandthank(s)intheEnroncorpus,andinparticularinDerrick'semails,wemightsaythatthismarkstheconstructionofpolitenessandpolitenessstrategiesasanimportantaspectofhisprofessionalauthorialstylewithintheEnroncorporation,aphenomenonwhichwedealwithindetailbelow.48 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69 PleaseandotherpolitenessphenomenaThemosteUectivewayofanalysingDerrick'sdistinctiveconstructionofpolitelymitigateddirectivesisthroughtakingacollocationalapproach.WordsmithToolswasusedtoidentifythecollocatesthatDerrickco-selectswithplease(Figure2).TheVrstpointtonoteisthatofDerrick's109pleaseoccurrences,69(63%)appearinmessage-initialposition(afterthegreetingornogreeting).ThisisamarkeduseofpleaseinrelationtotheEnroncorpus;ofthe10,952instancesofpleasethatarefoundintheemailsoftheother175authors,only2,092(19%)aremessage-initialacross118oftheauthors(e.g.Example5),thoughthispatternisapparentinTable3. Figure2.ConcordanceofIwouldappreciateyourverb-ingintheEnroncorpus.Instead,aswellasusingpleasewithinthebodyoftheemail,theotherauthorsintheEnroncorpusoftenmodalisethedirectivewithcan,could,would,orwill(SeetheL2collocatesinTable3–canyoupleasecall;couldyoupleasesend;willyoupleasereview.).Derrickneverusescan,could,orwilltomodalisedirectives,thoughheonceuseswould(Example6).Inthiscase,though,hegrammaticallymarksthesentenceasaquestion,signallinghislackofconVdencethatitcanbecarriedout.Therefore,theuseofunmodalisedpleasedirectives49 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69emails,asaproportionofall109instancesofpleaseinhisdataset.ThegreyedoutcellsintheEnroncolumnshowthat,inadditionto`pleasehandle.Thankyou.',sixadditionaltri-andfour-gramsareuniquetoDerrick:pleaseprintthemessage(s),pleaseseetheproposed,pleaseformatand(print),pleaseformattheattachment,andpleaseproceedwithyour.Alltherestaredistinctiveofhim,apartfrompleaseletmeandpleaseletmeknow(discussedabove),havingmuchhigherfrequenciesforDerrickthatthereferencecorpusgenerally.ToillustrateDerrick'sdistinctiveness,forexample,inthecaseofpleaseprintthe,whichheusesmorethan100timesmorethantheotherauthors(32.11%versus0.29%),otherauthorsusethis,and,andattachmentmoreascollocatesofpleaseprint. Table5.Derrick'srecurringthreeandfourwordn-gramsstartingwithplease.AlthoughthestylisticanalysishasclearlyhighlighteddistinctivepatternsofencodingpolitedirectivesinDerrick'semails,afurtherexaminationisevenmorerevealing.Forexample,in29ofthe32instancesofpleaseprinttheattachment(s),heisentirelyconsistentinfollowingthiswiththankyou(Example13).Incontrast,noneoftheseveninstancesintherestoftheEnrondataisfollowedbythankyouorindeedanysign-oUatall,withformeandthanksbeingthemostrecurrentpatterns(Examples14and15).Assuch,aswithpleasehandleabove,pleaseprinttheattachment(s)+thankyouisapatternuniquetoDerrickinthiscorpus. Similarly,allsevenofhisusesofpleaseseethemessage(s)arefollowedbybelow(Examples16and17).However,althoughthisfour-gramappearsfourtimesintheremainingEnrondata,onlythreeofthemarefollowedbybelow,eachinadiUerentauthor(Example18).53 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69ThesigniVcanceofthisisthatalthoughall176authorsintheEnroncorpussharethesameemailmodeofcommunication,theexplicitreferencestosuchsharedfeaturesthatDerrickmakesareframedinsuchawaythatisdistinctiveofhisstyle.However,mostimportantly,theEnroncorpusisaverypoliteone;pleasewasthesecondmostcommonkeywordinthecorpus(Table2),beingused11,061timesby165ofthe176employees.Despitethis,thecollocationalanalysishasfoundpolitenessstructuresthatareparticularlydistinctiveofDerrick'sstyle,suchastheindirectdirectivewiththegerund(Iwouldappreciateyourverb+ing),whencomparedagainstthisverypolitecorpus,apopulationofwriterswhomonemayassumewouldbemoststylisticallysimilartohim.Moreover,Derrick'schoicesbecomeindividuatingveryquickly;evensomebigramsaredistinctiveofhisemails,and,bythetimethelexicalstringisthreeorfourwordslong,mostofthepatternsareuniquetohim.ThisisconsiderablyshorterthantherequiredtenwordstringspostulatedbyCoulthard(2004)tobeunique,andevenshorterthanthesixreportedbyCulwinandChild(2010).Althoughthesetwostudiesusedthewebascorpus—afarlargercomparisoncorpus—itcanbearguedthatthespeciVcityandrelevanceoftheEnroncorpususedheremakesthecomparisonequallyvalid.Regardless,usingastylisticapproach,andusingtheEnroncorpusasareferencedataset,theseresultsprovideevidencetosuggestthatbigrams,trigrams,andfour-gramshavestrongdiscriminatorypotentialforauthorshipanalysis,particularlywhenfocusedonaparticularfunctionoflanguageuse,inthiscasethepoliteencodingofdirectives.Theimplicationofthisisthatn-gramsmayoUerapowerfulmeansofattributingtheauthorshipofdisputedtexts,evenwhenappliedinapoolofcandidatewriterswritingwithinthesamemodeofcommunicationandwithinthesamecommunityofpractice.Theremainderofthecasestudywhichfollowssetsouttotestthishypothesisinanauthorshipattributiontask.AttributiontaskThissectionreportstheresultsoftheattributionexperimentoutlinedintheExperimentalsetupsection.Tenrandomsamplesof20%,15%,10%,5%and2%ofDerrick'semails—rangingbetween1,220and55tokensinsize—wereextractedfromhissetandcomparedwiththeremainderofhisemailsandalloftheemailsoftheother175authorsinthecorpus,givingatotalof176candidateauthors.SimilaritybetweenthesamplesetsandthecomparisonsetsismeasuredbyJaccard'ssimilaritycoeXcientandintermsofalloftheunigrams,bigrams,trigrams,four-grams,Vve-gramsandsix-gramsthatarefoundinboththesampleandcomparisontexts,asaproportionofalln-gramsofthatsizeinthetwosetscombined.TheexpectationisthatbecauseDerrickauthoredthesamples,theremainderofhisemailsshouldbemostsimilartothem,andshouldthereforeobtainthehighestJaccardsimilarityscoreofall176comparisonsetsandcandidateauthorsineverytest.Intheanalysesthatfollow,theaccuracy,successandreliabilityofthediUerentn-gramsinattributingthevarioussamplesetstoDerrickareevaluatedintwoways:i.`Rawattributionaccuracy'.Themoststraightforwardwayofassessingsuccessisbythenumberofcorrectattributionseachn-gramachieveswhenappliedtothediUerentsamplesizes.Inanygiventest,ifDerrick'sremainingemailsachievethehighestJaccardscoreascomparedtothesamplesoftheother175authors,thenattributionhasbeensuccessful;iftheydonot,attributionhasbeenunsuccessful.ii.`MeanJaccardscore'.ThesecondwayinvolvesconsideringthemeanJaccardscoresobtainedbyall176candidateauthorsoverthetentestsforeachsamplesizeusingthediUerentn-gramtypes.Thisisanimportantmeasuregiventhat,althoughDerrick56 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69bigrams,trigrams,four-gramsandVve-gramswhenappliedto10%samples,andthere-sultsforallofthesixn-grammeasureswhentestedonthe15%and20%samples.WhatismostencouragingabouttheresultsinFigure4isthatthenumberofaccurateattributionsbeyondthe70%markisnotinconsiderableforthelargersamplesizes.Despitenotreach-ingthe70%thresholdforthe2%and5%samples,accuracyratesof30%and60%canbeconsideredsuccessfultosomeextent,particularlywhenthesmallsamplesizesaretakenintoconsideration(55–398tokens).Infact,thethree2%samplessuccessfullyattributedbytrigramsandfour-gramsareonly77,84and109tokensinsize(seeTable1),whichisremarkablysmallforcomputationalauthorshiptechniques.Inaforensiccaseinvolvingemail,theanalystmayhaveaverysmallnumberofdisputedemailswhichtheyareaskedtoattribute,afewemailsormaybeevenonlyone.Thesesmallsamplesofbetween77and109tokensthathavebeensuccessfullyattributedtoDerrickrepresentaroundnineofhisemails.However,asnotedabove,hehasamuchlowerwords-per-emailmean(12.9tokens)thantherestoftheauthorsinthecorpus(41.74tokens).SamplesofthesesizeswouldthereforecomprisearoundtwotothreeemailsfortheaverageEnronemployee,whichgoessomewaytowardsdemonstratingthepotentialeUectivenessofthismethodinforensicapplications,especiallyastherearelikelytobefewercandidateauthorsinanygivencase.Basedontheevidenceprovidedbytheseresults,themostaccurateandreliablen-grammeasureinidentifyingDerrickastheauthorofthesamplesisfour-grams.Eachn-grammeasureunderwent50tests,oneforeachsample(Vvesamplesizes,tensampleseach).Outoftheir50tests,four-gramscorrectlyattributed38samplestoDerrick(76%),closelyfollowedbytrigramswhichsuccessfullyattributed36ofthe50samples(72%)andVve-gramswhichattributed35(70%).Ontheotherhand,unigramsandbigramsaretheworstperformers,successfullyidentifyingDerrickastheauthorofonly22(44%)and28(56%)samplesrespectively.Whilelongern-gramsgenerallyperformbetterthanshorterones,accuracydoesnotcontinuetorisewiththelengthofthen-gram,withfour-gramsoutperformingVve-andsix-grams.Althoughlongerwordstringsaremorelikelytobedistinctiveofindividualauthors,itmaybethatthesestringsarelesslikelytoberepeatedbyauthors,whilefour-gramsandeventrigramsaremorepervasiveandconsistentinaperson'sauthorialstyle.Whattheseresultsdonottakeintoaccount,though,ishowcloselyDerrickcomestobeingidentiVedastheauthorofthesampleswhenhedoesnotachievethehighestJaccardscore.Althoughhemaynotrankasbeingmostsimilartothesampleinanyonetest,hemaybethesecondorthirdmostsimilar,orhemaybeinone-hundred-and-thirdplace.ConsideringmeanJaccardscoresofall176candidateauthorsacrossalltentestsofeachsamplesizehelpstocircumventthisissue(Table6).ThetablepresentstheauthorswiththethreehighestmeanJaccardscoresforalltentestsforeverysamplesize,forallsixn-grammeasures.Withthe2%samples,DerrickdoesnotachievethehighestmeanJaccardscoreoverthetensamplesusinganyofthesixmeasures.Hedoes,however,ranksecondusingsix-grams,andwhileunigramsperformverybadly,rankinghimat60thofall176candidates,bigramsthroughtoVve-gramsrankhimbetween15thand18th.Inapoolof176possibleauthors,n-gramsbetweentwoandsixwordsinlengthconsistentlyrankDerrickasbeinginthetop10%(18/176)ofauthors.Giventheverysmallsizeofthese2%samples,thisisapromisingresult,andindicativeofabetterperformanceofthemethodthantherawattributionresults(Figure4)wouldsuggest.Withthe5%58 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69acorpus-basedstylisticapproachtodemonstratehowdistinctiveDerrick'scollocationpat-ternsare.Followingfromthis,theresultsofthisattributionexperimentcomplementandconVrmtheVndingsofthestylisticanalysis,inthattheyhaveshownwordn-gramsbe-tweenoneandsix-wordsinlengthtobeaccurateandreliableinidentifyingDerrickastheauthorofthesamples.Inordertofullybridgethegapbetweenthestylisticandcomputa-tionalapproaches,thethirdandVnalpartofthecasestudyexaminestheactualn-gramsthatareresponsiblefortheaccurateattributionofauthorshipintheseexperiments.Derrick'sdiscriminatingn-gramsOneofthemainadvantagesofJangleisthat,aswellasrunningcomparisonsandcalcu-latingJaccardscores,theprogramalsodisplaystheactualn-gramsoperatingbehindthestatisticsandaccountingfortheJaccardscores.Thiskindofinformationaboutwhichspe-ciVclinguisticchoicesweremostimportantinattributiontasksisoftennotmadeavailableinstylometricstudies.Inthisparticularattributioncase,thisfacilityallowsustopinpointthen-gramsthatweremostusefulinidentifyingDerrickastheauthorofthedisputedsamples,andtoobserveanyrecurrentpatterns(textbites),andcomparethesewiththoseidentiVedinthecorpusstylisticanalysis.Theexaminationherefocusesonthebigrams,trigrams,four-gramsandVve-gramswhichcontributedtotheaccurateattributionofthe5%samplesets.Althoughbigramswerenotsuccessfulinattributingtheindividual5%samples,theyweresuccessfulinscoringDerrickthehighestmeanJaccardscoreacrossalltentestswiththissamplesize,andforthisreasonthesharedbigramsinallten5%samplesareexaminedhere.Incontrast,thetrigrams,four-gramsandVve-gramsweresuccessfulinattributingsixofthetenindividual5%samples,andthen-gramsconsideredherearethosesharedbetweenthesamplesandtheremainderofDerrick'semailsinthesesuccessfultests.Intotaltherewere311diUerentbigramsthatwerefoundinboththesamplesetandtheremainderofDerrick'semailsacrosstheten5%sampletests,andtherewere122trigrams,64four-gramsand28Vve-gramsthatwerefoundinbothsetsofemailsinthesixorsevensuccessfultests.Onemajorcriticismofusingcontentwordsorcombinationsofcontentwordsinauthorshipanalysisisthattheyareindicativeoftopicratherthanofauthorship.Indeed,thereisariskthatDerrick'ssampleshavebeenattributedtohimonthebasisthattheemailsinthetwodiUerentsetsmadereferencetothesametopics,suchaspeople,placeorcompanynames.However,only57(18%)ofthe311bigramscouldbeconsideredtorefertotopicspeciVcentities,forexamplejimderrick,stephplease,enronletterhead,southwesternlegal,andthelitigation.Similarly,24ofthe122trigrams(19.7%),19ofthe64four-grams(29.7%)and11ofthe28Vve-grams(39.3%)canbeconsideredasbeingtopic-dependentinthisway.Suchn-gramshavebeenremovedfromtheanalysisthatfollows.Thus,althoughtheproportionoftopic-dependentn-gramsgraduallyincreasesasthelengthofthen-gramisextended,themajority(61%-82%)oftheusefuln-gramsaretopicindependent;thatis,theyarenotsharedbetweenthesamplesandtheremainderofDerrick'semailssimplybecauseDerrickistalkingaboutthesamethingsorabout/tothesamepeople.Table7presentsbigramswhichappearedinboththesamplesetsandDerrick'sre-mainingemailsinVveormoretests,andthetri-,four-,andVve-gramswhichweresharedbetweenDerrick'ssampleandhisremainingemailsinthreeormoresuccessfultests.Then-gramsdisplayedinthetablecollectivelyrepresentapoolofthemostcharacteristicanddiscriminatoryn-gramtextbitesrepeatedlyusedbyDerrick.Theyhaveallcontributedtothesuccessfulattributionofhis5%samplesetswhich,varyingbetween156and398words60 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69himintheEnroncorpus.Inturn,thissectionhasfoundthatitisexactlythesen-gramsthataremostinWuentialinthestatisticalattributionexperiment.Usingtheseresults,wecanreturnagaintoacorpusapproachtoidentifyhowcommonorrarethesetextbitesareintherestoftheEnroncorpus.Inotherwords,arethesen-gramsfoundintheemailsoftheother175Enronauthors?ThosehighlightedingreeninTable7arethosethatareuniquetoDerrick;whensearchedforusingWordsmithTools,theonlyresultsreturnedarefromhisemails.Takingresultsfromacrossthefourlengthsofn-gramwecanidentifyVven-gramtextbitesthatareuniquetohim:pleaseformatandprinttheattachment,Iwillsupportyourrecommendation,pleaseproceedwithyour,amOKwithyourproposalandseethemessagebelowfrom.Inaddition,thoseinyellowarethosethataresharedwithotherauthorsinthecorpus,butDerrickisthetopfrequencyuserper1,000wordsoftext.Forexample,pleaseprinttheisusedbytwelveauthorsinthecorpus,butDerrickusesitwithagreaterrelativefrequentlythantheothereleven.Withthegreenandyellown-gramscombined,wehaveapooloftextbitesthataredistinctiveofDerrickandcharacterisehisauthorialstylewhencomparedwiththeotherauthorsintheEnronpopulation.TheseVndingsshowthevalueofreturningtothecorpusapproachinvalidatingthen-gramsfurther,andprovideanexplanationforwhytheseparticularn-gramswereusefulinthesuccessfulattributionofDerrick'ssamples.Themethodologicalimplicationsofthissuccessfultriangulationofapproachesareconsiderable.Ontheonehand,thestatisticalattributiontaskconVrmsandsupportsthereliabilityofthestylisticresultsintermsofthedistinctivenessandusefulnessofDerrick'splease-initialn-grams.Atthesametime,thestylisticanalysesintheVrsthalfofthecasestudyoUersaclearlinguisticexplanationforthestatisticalandcomputationalresultsintermsofwhyn-gramsareanaccurateandusefullinguisticfeaturefordistinguishinganindividual'swritingstyleandattributingauthorship.Throughouthislife,Derrickhasamassedauniqueseriesoflinguisticandprofessionalexperiencesculminatinginbeingalegaleducatorandpractitioner,andwordcollocationsandpatternsarestoredandprimedinhismindbasedontheseuniqueexperiences.Assuch,whenDerrickVndshimselfwithintherelevantcommunicativecontextwithregardtopurposeandrecipientoftheemail,heaccessesandproducesstoredandfrequentlyusedcollocationpatternswhich,inturn,provetobedistinctiveofhisemailstyle.Moreover,heproducessuchpatternsregularlyandrepeatedly,reinforcingtheprimingandmakingthemincreasinglyhabitual.Theresultisthatawordn-gramapproachcanbeusednotonlytoidentifyandisolateanumberofn-gramtextbitesthatdistinguishhisprofessionalemailstylefromthatofotheremployeesinthesamecompany,butalsotosuccessfullyidentifyhimastheauthoroftextsamples,includingsomeassmallas77,84and109tokens.ConclusionTurell(2010)notestheimportantrolethatresearchstudiesinauthorshipattributionplayinbeneVtingtheperformanceofexpertwitnessesincasesofdisputedauthorship.Inthe21stcenturyforensiclinguisticcaseworkincreasinglyinvolvestextscreatedinonlinemodes,includingemail.Thisresearch,therefore,providesmuch-neededempiricalresultsthatmightprovideknowledgeforexpertwitnessreportsinthefuture.Wenowknowthatfocusingononestyle-marker,please,andonespeechact,politelyencodeddirectives,canproducebothstylisticallyandstatisticallyrevealingresults,eventhoughweknowthatpleaseisoneofthemostfrequentlexicalitemsinemailandthatdirectivesareoneofthemostimportantspeechactsinemails.Evenso,itispossibletoVnddistinctiveusesofthese62 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69Argamon,S.andLevitan,S.(2005).Measuringtheusefulnessoffunctionwordsforauthor-shipattribution.InProceedingsofACH/ALLCConference,1–3:UniversityofVictoria,BC,AssociationforComputingandtheHumanities.Austin,J.L.(1962).HowtoDoThingswithWords.Oxford:ClarendonPress.Barlow,M.(2010).Individualusage:Acorpus-basedstudyofidiolects.Paperpresentedat34thInternationalLAUDSymposium.Landau,Germany.Barlow,M.(2013).Exemplartheoryandpatternsofproduction.PaperpresentedatCorpusLinguistics2013.Lancaster,UK.Baron,N.S.(1998).Lettersbyphoneorspeechbyothermeans:Thelinguisticsofemail.LanguageandCommunication,18(2),133–170.Bax,P.(1986).HowtoassignworkinanoXce.AcomparisonofspokenandwrittendirectivesinAmericanEnglish.JournalofPragmatics,10,673–692.Bennell,C.andJones,N.J.(2005).BetweenaROCandahardplace:Amethodforlinkingserialburglariesbymodusoperandi.JournalofInvestigativePsychologyandOUenderProVling,2(1),23–41.Bou-Franch,P.(2011).OpeningsandclosingsinSpanishemailconversations.JournalofPragmatics,43(6),1772–1785.Bou-Franch,P.andLorenzo-Dus,N.(2008).Naturalversuseliciteddataincross-culturalspeechactrealisation:ThecaseofrequestsinPeninsularSpanishandBritishEnglish.SpanishinContext,5(2),246–277.Burrows,J.(2002).“Delta:”AmeasureofstylisticdiUerenceandaguidetolikelyauthor-ship.LiteraryandLinguisticComputing,17(3),267–287.Burrows,J.(2005).AndrewMarvellandthe“PainterSatires”:Acomputationalapproachtotheirauthorship.ModernLanguageReview,100(2),281–297.Chaski,C.E.(2001).Empiricalevaluationsoflanguage-basedauthoridentiVcationtech-niques.ForensicLinguistics:TheInternationalJournalofSpeech,LanguageandtheLaw,8(1),1–65.Chaski,C.E.(2007).MultilingualforensicauthoridentiVcationthroughn-gramanalysis.Paperpresentedatthe8thBiennialConferenceonForensicLinguistics/LanguageandLaw.Seattle,WA.Chejnová,P.(2014).Expressingpolitenessintheinstitutionale-mailcommunicationsofuniversitystudentsintheCzechRepublic.JournalofPragmatics,60,175–192.Cohen,W.W.(2009).Enronemaildataset.[online]http://www.cs.cmu.edu/enron/(Ac-cessedNovember2010).Cotterill,J.(2010).Howtousecorpuslinguisticsinforensiclinguistics.InA.O'KeefeandM.McCarthy,Eds.,TheRoutledgeHandbookofCorpusLinguistics,578–590.London:Routledge.Coulthard,M.(1994).Ontheuseofcorporaintheanalysisofforensictexts.ForensicLinguistics:TheInternationalJournalofSpeech,LanguageandtheLaw,1(1),27–43.Coulthard,M.(2004).AuthoridentiVcation,idiolect,andlinguisticuniqueness.AppliedLinguistics,24(4),431–447.Coulthard,M.(2013).Onadmissiblelinguisticevidence.JournalofLawandPolicy,21(2),441–466.Coulthard,M.,Grant,T.andKredens,K.(2011).Forensiclinguistics.InR.Wodak,B.John-stoneandP.Kerswill,Eds.,TheSAGEHandbookofSociolinguistics,531–544.London:Sage.64 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69Gries,S.T.(2013).50-somethingyearsofworkoncollocations:Whatisorshouldbenext....InternationalJournalofCorpusLinguistics,18(1),137–165.Grieve,J.(2007).Quantitativeauthorshipattribution:Anevaluationoftechniques.Liter-aryandLinguisticComputing,22(3),251–270.Hirst,G.andFeiguina,O.(2007).Bigramsofsyntacticlabelsforauthorshipdiscriminationofshorttexts.LiteraryandLinguisticComputing,22(4),405–417.Hoey,M.(2005).LexicalPriming:Anewtheoryofwordsandlanguage.London:Routledge.Hoover,D.L.(2002).Frequentwordsequencesandstatisticalstylistics.LiteraryandLinguisticComputing,17(2),157–180.Hoover,D.L.(2003).Frequentcollocationsandauthorialstyle.LiteraryandLinguisticComputing,18(3)(3),261–228.Hoover,D.L.(2004).TestingBurrows'sDelta.LiteraryandLinguisticComputing,19(4),453–475.Hoover,D.L.(2009).Wordfrequency,statisticalstylisticsandauthorshipattribution.InD.Archer,Ed.,What'sinawordlist?Investigatingwordfrequencyandkeywordextraction,35–51.Surrey:Ashgate.Hoover,D.L.(2010).Authorialstyle.InD.McIntyreandB.Busse,Eds.,LanguageandStyle,250–271.Basingstoke:PalgraveMacMillan.Izsak,C.andPrice,A.R.(2001).Measuringbeta-diversityusingataxonomicsimilarityindex,anditsrelationtospatialscale.MarineEcologyProgressSeries,215,69–77.Jaccard,P.(1912).ThedistributionoftheWorainthealpinezone.TheNewPhytologist,11(2),37–50.Johnson,A.andWoolls,D.(2009).Whowrotethis?Thelinguistasdetective.InS.HunstonandD.Oakey,Eds.,IntroducingAppliedLinguistics:ConceptsandSkills,111–118.London:Routledge.Juola,P.(2008).AuthorshipAttribution.FoundationsandTrendsinInformationRetrieval.Delft:NOWPublishing.Juola,P.(2013).Stylometryandimmigration:Acasestudy.JournalofLawandPolicy,21(2),287–298.Knox,J.(2007).Visual-verbalcommunicationononlinenewspaperhomepages.VisualCommunication,6(1),19–53.Koppel,M.,Schler,J.andArgamon,S.(2009).Computationalmethodsinauthorshipattri-bution.JournaloftheAmericanSocietyforInformationScienceandTechnology,60(1),9–26.Koppel,M.,Schler,J.andArgamon,S.(2011).Authorshipattributioninthewild.LanguageResourcesandEvaluation,45(1),83–94.Koppel,M.,Schler,J.andArgamon,S.(2013).Authorshipattribution:What'seasyandwhat'shard?JournalofLawandPolicy,21(2),317–332.Koppel,M.,Schler,J.,Argamon,S.andMesseri,E.(2006).Authorshipattributionwiththousandsofcandidateauthors.InProceedingsofthe29thACMSIGIRConferenceonResearchandDevelopmentonInformationRetrieval,Seattle,Washington.Kredens,K.(2002).Towardsacorpus-basedmethodologyofforensicauthorshipattri-bution:Acomparativestudyoftwoidiolects.InB.Lewandowska-Tomaszczyk,Ed.,PALC'01:PracticalApplicationsinLanguageCorpora,405–437.FrankfurtamMain:PeterLang.Kredens,K.andCoulthard,M.(2012).CorpuslinguisticsinauthorshipidentiVcation.InP.TiersmaandL.Solan,Eds.,TheOxfordHandbookofLanguageandLaw,504–516.Oxford:OxfordUniversityPress.66 Johnson,A.andWright,D.-IdentifyingidiolectinforensicauthorshipattributionLanguageandLaw/LinguagemeDireito,Vol.1(1),2014,p.37-69modelsvariesalongenvironmentalgradients.GlobalEcologyandBiogeography,22(1),52–63.Priebe,C.E.,Conroy,J.M.,Marchette,D.J.andPark,Y.(2005).ScanStatisticsonEnronGraphs.ComputationalandMathematicalOrganizationTheory,11(3),229–247.Queralt,S.andTurell,M.T.(2012).Testingthediscriminatorypotentialofsequencesoflinguisticcategories(n-grams)inSpanish,CatalanandEnglishcorpora.InPaperpre-sentedattheRegionalConferenceoftheInternationalAssociationofForensicLinguists2012,KualaLumpur(Malaysia):UniversityofMalaya.Rajaraman,A.andUllman,J.D.(2011).MiningofMassiveDatasets.Cambridge:Cam-bridgeUniversityPress.Rico-Sulayes,A.(2011).StatisticalauthorshipattributionofMexicandrugtraXckingon-lineforumposts.TheInternationalJournalofSpeech,LanguageandtheLaw,18(1),53–74.Sanderson,C.andGuenter,S.(2006).Shorttextauthorshipattributionviasequenceker-nels,Markovchainsandauthorunmasking:Aninvestigation.InProceedingsoftheInternationalConferenceonEmpiricalMethodsinNaturalLanguageEngineering,482–491,Morristown,NJ:AssociationforComputationalLinguistics.Savoy,J.(2012).Authorshipattribution:Acomparativestudyofthreetextcorporaandthreelanguages.JournalofQuantitativeLinguistics,19(2),132–161.Schmitt,N.,Grandage,S.andAdolphs,S.(2004).Arecorpus-derivedrecurrentclusterspsycholinguisticallyvalid?InN.Schmitt,Ed.,FormulaicSequences:Acquisition,Pro-cessingandUse,127–151.Amsterdam:JohnBenjaminsPublishingCompany.Scott,M.(2008).WordsmithToolsversion5.Liverpool:LexicalAnalysisSoftware.Scott,M.(2010).WordsmithToolsHelp.Liverpool:LexicalAnalysisSoftware.Searle,J.R.(1969).SpeechActs:AnEssayinthePhilosophyofLanguage.London:Cam-bridgeUniversityPress.Sherblom,J.(1988).Direction,functionandsignatureinelectronicmail.TheJournalofBusinessCommunication,25(4),39–54.Sinclair,J.M.(1991).Corpus,Concordance,Collocation.Oxford:OxfordUniversityPress.Solan,L.(2013).Intuitionversusalgorithm:Thecaseforforensicauthorshipattribution.JournalofLawandPolicy,21(2),551–576.Solan,L.andTiersma,P.M.(2004).AuthoridentiVcationinAmericanCourts.AppliedLinguistics,25(4),448–465.Stamatatos,E.(2008).AuthoridentiVcation:Usingtextsamplingtohandletheclassim-balanceproblem.InformationProcessingandManagement,44(2),790–799.Stamatatos,E.(2009).Asurveyofmodernauthorshipattributionmethods.JournaloftheAmericanSocietyforInformationScienceandTechnology,60(3),538–556.Stamatatos,E.(2013).Ontherobustnessofauthorshipattributionbasedoncharactern-gramfeatures.JournalofLawandPolicy,21(2),421–440.Stubbs,M.(2001).WordsandPhrases:CorpusStudiesofLexicalSemantics.Oxford:Blackwell.Tang,Z.,Fang,J.,Chi,X.,Feng,J.,Liu,Y.,Shen,Z.,Wang,X.,Wang,Z.,Wu,X.,Zheng,C.andGaston,K.J.(2013).Patternsofplantbeta-diversityalongelevationalandlatitudinalgradientsinmountainforestsofChina.Ecography,35(12),1083–1091.Turell,M.T.(2010).Theuseoftextual,grammaticalandsociolinguisticevidenceinforen-sictextcomparison.TheInternationalJournalofSpeech,LanguageandtheLaw,17(2),211–250.68