/
Natural Language Processing and Textual Analysis in Finance and Accounting Natural Language Processing and Textual Analysis in Finance and Accounting

Natural Language Processing and Textual Analysis in Finance and Accounting - PowerPoint Presentation

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
342 views
Uploaded On 2019-11-22

Natural Language Processing and Textual Analysis in Finance and Accounting - PPT Presentation

Natural Language Processing and Textual Analysis in Finance and Accounting Tim Loughran and Bill McDonald University of Notre Dame 1 Overview DataPrograms Sample App Stemming Word Lists Resources ID: 766939

lists word data stemming word lists stemming data programs overview app sample words 2011 mcdonald forms text resourcesloughran large

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Natural Language Processing and Textual ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Natural Language Processing and Textual Analysis in Finance and Accounting Tim LoughranandBill McDonaldUniversity of Notre Dame 1

Overview Data/Programs Sample App Stemming Word Lists Resources… “ ‘Cause you know sometimes words have two meanings.”2

Overview Data/Programs Sample App Stemming Word Lists ResourcesWhat do we call this?Textual analysisNatural language processingSentiment analysisContent analysisComputational linguistics3

Overview Data/Programs Sample App Stemming Word Lists ResourcesIncreased interest attributable to:Bigger, faster computersAvailability of large quantities of textNew technologies derived from search engines4

Overview Data/Programs Sample App Stemming Word Lists ResourcesExamples of data sources:EDGAR (1994-2011, 22.7 million filings)WSJ News Archive (XML encapsulated, 2000 -> )Audio transcripts (e.g., conference calls)Web sitesGoogle searchesTwitter / Stocktwits 5

Overview Data/Programs Sample App Stemming Word Lists ResourcesProgramsBlack boxes (Wordstat, Lexalytics, Diction …)Two critical componentsAbility to download data and convert into string/character variableAbility to parse large quantities of text6

Overview Data/Programs Sample App Stemming Word Lists ResourcesMost modern languages provide for both of these functions:PerlPythonSAS Text MinerVB.net7

Overview Data/Programs Sample App Stemming Word Lists ResourcesParsing large quantities of text: REGEXRegular expressions exampleRegex that attempts to identify sentences(?<=^|[\.!\?]\s+|\n{2,})[A-Z][^\.!\?\n]{20,}(?=([\.!\?](\s|$)))8

Overview Data/Programs Sample App Stemming Word Lists ResourcesSummary of technical literature: Natural languages are messy and difficult to parse with computers.Current Issues in Parsing TechnologyMasaru TomitaKluwer Academic Publishing, 1991p. 1 9

Overview Data/Programs Sample App Stemming Word Lists ResourcesTripwires – some examplesParsing out 10-K segments“May” Disambiguation of abbreviations Older files are less structured10

Overview Data/Programs Sample App Stemming Word Lists ResourcesDownload 10-XDownload master files for each year/qtr "ftp://ftp.sec.gov/edgar/full-index/YYYY/QTR#"11

Overview Data/Programs Sample App Stemming Word Lists ResourcesIdentify target forms from master fileDownload formshttp://www.sec.gov/Archives/target file name12

Overview Data/Programs Sample App Stemming Word Lists ResourcesIterate thru forms:Clean up text fileRemove ASCII-Encoded segments (e.g., graphics, pdfs, etc.)Remove XBRLRemove tables (<TABLE>.*?</TABLE>)Remove all remaining markup tags (HTML)Re-encode character entity references (e.g., &AMP = &)13

Overview Data/Programs Sample App Stemming Word Lists ResourcesIterate thru forms: (continued)Parse form into tokens Regex: ?i:\b[-A-Z]{2,}\bIterate thru each token to see if it matches an entry in a master dictionaryTabulate words14

Overview Data/Programs Sample App Stemming Word Lists Resources When creating word lists, should we list root words (lexemes) and stem, or expand all root words to include inflections?15

Overview Data/Programs Sample App Stemming Word Lists Resources StemmingProgrammatically collapse words down to root lexeme: expensive, expensed, expensing => expenseInflectiondepreciate=>depreciated/depreciates/depreciating/depreciation Avoids morphologies like: blind / blinds; odd / odds; bitter / bitters 16

Overview Data/Programs Sample App Stemming Word Lists ResourcesThe text processing literature shows that stemming does not in general improve performance. Essentially stemming does not work for morphologically rich languages.17

Overview Data/Programs Sample App Stemming Word Lists ResourcesLoughran/McDonald JF 2011 word listsCreate a dictionary of all words occurring in 10-Ks from 1994-2007.Classify words occurring in 5% or more of the documents.18

Overview Data/Programs Sample App Stemming Word Lists ResourcesLoughran/McDonald JF 2011 word listsFin-Neg – negative words (e.g., loss, bankruptcy, indebtedness, felony, misstated, discontinued, expire, unable). N=2,349Fin-Pos – positive words (e.g., beneficial, excellent, innovative). N = 354 Notice that in financial reporting it is unlikely that negative words will be negated (e.g., not terrible earnings ), whereas positive words are easily qualified or compromised . Although you can easily account for simple negation, typical forms of negation are difficult to detect. 19

Overview Data/Programs Sample App Stemming Word Lists ResourcesLoughran/McDonald JF 2011 word listsFin-Unc – uncertainty words. Note here the emphasis is more so on uncertainty than risk (e.g., ambiguity, approximate, assume, risk). N = 291Fin-Lit – litigious words (e.g., admission, breach, defendant, plaintiff, remand, testimony). N = 871 20

Overview Data/Programs Sample App Stemming Word Lists ResourcesLoughran/McDonald JF 2011 word listsModal Strong – e.g., always, best, definitely, highest, lowest, will. N = 19Modal Weak – e.g., could, depending, may, possibly, sometimes. N = 27 21

Overview Data/Programs Sample App Stemming Word Lists ResourcesUse of word lists:“Content analysis stands or falls by its categories. Particular studies have been productive to the extent that the categories were clearly formulated and well adapted to the problem”Berelson (1952, p 92) 22

Overview Data/Programs Sample App Stemming Word Lists ResourcesZiph’s law – the most frequent word will appear twice as often as the second most frequent word and three times as often as the third, etc. Much like the distribution of market cap in finance.Always look at the words driving your counts23

Overview Data/Programs Sample App Stemming Word Lists ResourcesResources:www.nd.edu/~mcdonald/Word_Lists.htmlSentiment dictionariesMaster dictionaryLists of stop words1994-2011 10-X file summaries spreadsheet 25