The Kucera Server What we are doing today 2 Introducing the corpora Searching the data Sorting your data Saving amp Extracting Manipulating your search data 3 Available Corpora A simple overview ID: 806059
Download The PPT/PDF document "Big Data: Text Mining The Linguistics De..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Big Data: Text Mining
The Linguistics Department Presents:
The
Kucera
Server
Slide2What we are doing today
2
Introducing the corpora
Searching the data
Sorting your data
Saving & Extracting
Manipulating your search data
Slide33
Available Corpora
A simple overview
Slide4Slide5Slide6Kucera: Available Resources
The possibilies are endless. The resorceses available are:
6
The Brown Corpus
This was the first million-word electronic corpus of English, created in 1961 at Brown University. It spans about fifteen different categories of text.
The Penn Treebank
Manually-corrected phrase structure trees for English, including 1.2 million words of newspaper text from the Wall Street Journal.
COCA: Corpus Of Contemporary American English
531 million tokens of American English sampled from 1990--2017 across categories such as Academic, Fiction, Magazine, Newspaper, and Spoken. This version is annotated with lemmas and parts of speech. Provided courtesy of the UGA Library
.
Slide7COHA: Corpus Of Historical American English 400+ million words from the period 1810--2008 facilitate diachronic investigation. Provided courtesy of the UGA Library.
British National Corpus100 million words of British text annotated with PoS and lemmas as well as speaker
age, social class and geographical region. 91% was published between 1985 and 1993.
AudioBNC
Audio and all available transcriptions of the 7.5 million words of the spoken portion of the British National
Corpus.
SpokenBNC 2014
11.4 million tokens, orthographically transcribed from smartphone recordings made between 2012 and 2016. Substantial speaker metadata is
included with
PoS
and semantic tags.
7
Kucera: Available Resources
Slide8Arabic TreebankApproximately 800 thousand words of newswire text from Agence France-
Presse annotated with parts of speech, morphology and phrase structure.DEFT Spanish TreebankAbout 100 thousand words from both Spanish newswire and discussion forums, with extensive morphological and syntactic annotations.CETEMPúblico180 million words from the Portuguese newspaper "Publico'' 1991--1998 with morphological and syntactic annotations.
French Treebank
This corpus is drawn from the newspaper Le Monde 1989-1994 annotated with syntactic constituents, syntactic categories, lemmas and compounds and totals about 650 thousand words
.
8
Kucera: Available Resources
Slide99
Kucera: Available Resources
SPMRL2014
Dependency, constituency and morphology annotations for Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish.
NEGRA corpus
355 thousand tokens of the German newspaper Frankfurter
Rundschau
annotated with syntactic structures.
EuroParl
About 40 million words of European Parliamentary proceedings aligned across translations into English, German, Spanish, French, Italian and Dutch.
CALLHOME
This
corpus consists of 5-10 minute snippets from 120 phone calls, each 30 minutes each in length.
Slide1010
Kucera: Available Resources
The Buckeye Speech Corpus
This corpus is comprised of 40 speakers from Columbus
Ohio
totals more than 300,000 words of
speech.
CELEX2
Orthography, phonology, morphology and attestation frequency information for words in English, German and Dutch.
Concretely Annotated New York Times
About 1.3 billion words from articles that appeared in the New York Times 1982--2007 with automatically-assigned lemmas and part-of-speech tags.
WaCky corpora
Between 1.2 and 1.9 billion tokens each of French, German and Italian as crawled from the world wide web. Also includes about 800 million tokens of English Wikipedia as it was in 2009. These corpora are annotated with lemmas and parts of
speech.
Slide1111
Conducting Searches
A simple overview
Slide12Available Corpora
CQP corporaThis is a sub-grouping of all available corporaSearching using the CQP interfaceThese are searched with regular expression, PoS, Lemma, and other tags.Non-CQP corpora
These are searched with Linux commands and bash scripting.
12
Slide13CQP Corpora
Single WordsIndividual word
ex “judge”
String of words
ex “kick” “the” “bucket”
Wild Cards
“.”
,
“?”
,
“*”
,
“+”
, “
( )
”, “
|
”, “
[ ]
”,
“come” “(for|bec
au
se)” [ ]* “stay(.+)?”
Tags
- [ pos = “vvd” ], [ lemma = “eat” ]
13
Slide14Slide15CQP Corpora
Single WordsIndividual word
ex “judge”
String of words
ex “kick” “the” “bucket”
Wild Cards
“.”
,
“?”
,
“*”
,
“+”
, “
( )
”, “
|
”, “
[ ]
”,
“come” “(for|bec
au
se)” [ ]* “stay(.+)?”
Tags
- [ pos = “vvd” ], [ lemma = “eat” ]
15
Slide16Other Corpora
Linux CommandsGrep + Regular expressions
Wild characters
Exact matches, non-matches
Egrep
+ Regular expressions
Optional commands
-n: Which lines,-c: How many lines,-
i
: Ignore case,
-v: Invert match, and more.
16
Slide17Other Corpora
Linux Commands
Regex
and
Bash scripting
are both well documented and supported.
Intro Webpages
Youtube
Videos
Lynda.com
Books
18
Slide1919
Sorting your data
A simple overview
Slide2020
Counting your data
Counting commands
Use the command “count” to count your results in various ways.
> count
by (attribute
)
(attributes include) + word, lemma,
pos
,
etc
+ cut (number) – cuts to only the number included.
+ descending – reverse the order
+ reverse – sorts the matches by suffix
+ %cd normalizes for case and/or diacritics
Slide2121
Sorting your data
Sorting commands
Use the command “sort” to put the results in the order they were in, in the corpus. Additional commands modify the sort.
> sort by (attribute)
(attributes include) + word, lemma,
pos
,
etc
+ randomize – shuffle the results so you don’t see only the top.
+ descending – reverse the order
+ reverse – sorts the matches by suffix
+ %cd normalizes for case and/or diacritics
Slide2222
Saving & Exporting
A simple overview
Slide2323
Saving Searches
Naming
Searches (CQP)
Each search is stored as the named “Last” search.
You can just rename the last search to call on it later by using the “cat” (concatenate) command and the “>” (write out) command or “>>” (append) command.
>cat Last >> “FileName.txt
” (adds to the bottom of the named file
)
>cat Last > “FileName.txt”
(creates
file of that name or
saves over that file
if it already exists)
Slide2424
Saving Searches
Naming
Searches (Bash)
Know your directory and pathways
Use the “>” (write out) or “>>” (append) commands but, you must put it in your home folder.
Don’t have access to auto saving last searches
Slide2526
Saving Searches
Naming
Searches (Bash)
Know your directory and pathways
Use the “>” (write out) or “>>” (append) commands but, you must put it in your home folder.
Don’t have access to auto saving last searches
Scripting
Both avenues allow for scripts. You can test and improve a set of commands, adding complexity, until you like it.
Slide2727
Exporting
Best Way
WinSCP
(various options for iOS users)
Other
Ways
-
FTP
(
MobaExterm
, Linux
Commandline
)
Format
-
These will be “.txt” files so using notepad or notepad++ is an easy way to see what you have.
Slide2828
Manipulating Data
A simple overview
Slide29Excel – Simple, familiar, short learning curvePython, Perl, etc, – Steeper learning curve, more powerful, very flexible.R – Also has a steeper learning curve, also a powerful stats tool.
Also, Linux tools on a local machineBash, vi, vim, atom, etc29
Manipulating Data
Slide3030
Summation
What
next?
Slide31Fall 2019: LING 4886/6886
Excellent opportunity to learn the how and just as importantly the why.There will be significant digital humanities content.
Counts toward the DH Certificate.
31
Slide3232
Slide33The End
33