/
Big Data: Text Mining The Linguistics Department Presents: Big Data: Text Mining The Linguistics Department Presents:

Big Data: Text Mining The Linguistics Department Presents: - PowerPoint Presentation

greemeet
greemeet . @greemeet
Follow
343 views
Uploaded On 2020-08-27

Big Data: Text Mining The Linguistics Department Presents: - PPT Presentation

The Kucera Server What we are doing today 2 Introducing the corpora Searching the data Sorting your data Saving amp Extracting Manipulating your search data 3 Available Corpora A simple overview ID: 806059

corpus words data corpora words corpus corpora data million searches commands english annotated saving pos kucera speech cqp simple

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Big Data: Text Mining The Linguistics De..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Big Data: Text Mining

The Linguistics Department Presents:

The

Kucera

Server

Slide2

What we are doing today

2

Introducing the corpora

Searching the data

Sorting your data

Saving & Extracting

Manipulating your search data

Slide3

3

Available Corpora

A simple overview

Slide4

Slide5

Slide6

Kucera: Available Resources

The possibilies are endless. The resorceses available are:

6

The Brown Corpus

This was the first million-word electronic corpus of English, created in 1961 at Brown University. It spans about fifteen different categories of text.

The Penn Treebank

Manually-corrected phrase structure trees for English, including 1.2 million words of newspaper text from the Wall Street Journal. 

COCA: Corpus Of Contemporary American English

531 million tokens of American English sampled from 1990--2017 across categories such as Academic, Fiction, Magazine, Newspaper, and Spoken. This version is annotated with lemmas and parts of speech. Provided courtesy of the UGA Library

.

Slide7

COHA: Corpus Of Historical American English 400+ million words from the period 1810--2008 facilitate diachronic investigation. Provided courtesy of the UGA Library.

British National Corpus100 million words of British text annotated with PoS and lemmas as well as speaker

age, social class and geographical region. 91% was published between 1985 and 1993.

AudioBNC

Audio and all available transcriptions of the 7.5 million words of the spoken portion of the British National

Corpus.

SpokenBNC 2014

11.4 million tokens, orthographically transcribed from smartphone recordings made between 2012 and 2016. Substantial speaker metadata is

included with

PoS

and semantic tags.

7

Kucera: Available Resources

Slide8

Arabic TreebankApproximately 800 thousand words of newswire text from Agence France-

Presse annotated with parts of speech, morphology and phrase structure.DEFT Spanish TreebankAbout 100 thousand words from both Spanish newswire and discussion forums, with extensive morphological and syntactic annotations.CETEMPúblico180 million words from the Portuguese newspaper "Publico'' 1991--1998 with morphological and syntactic annotations.

French Treebank

This corpus is drawn from the newspaper Le Monde 1989-1994 annotated with syntactic constituents, syntactic categories, lemmas and compounds and totals about 650 thousand words

.

8

Kucera: Available Resources

Slide9

9

Kucera: Available Resources

SPMRL2014

Dependency, constituency and morphology annotations for Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish.

NEGRA corpus

355 thousand tokens of the German newspaper Frankfurter

Rundschau

annotated with syntactic structures.

EuroParl

About 40 million words of European Parliamentary proceedings aligned across translations into English, German, Spanish, French, Italian and Dutch.

CALLHOME

This

corpus consists of 5-10 minute snippets from 120 phone calls, each 30 minutes each in length.

Slide10

10

Kucera: Available Resources

The Buckeye Speech Corpus

This corpus is comprised of 40 speakers from Columbus 

Ohio

totals more than 300,000 words of

speech.

CELEX2

Orthography, phonology, morphology and attestation frequency information for words in English, German and Dutch.

Concretely Annotated New York Times

About 1.3 billion words from articles that appeared in the New York Times 1982--2007 with automatically-assigned lemmas and part-of-speech tags.

WaCky corpora

Between 1.2 and 1.9 billion tokens each of French, German and Italian as crawled from the world wide web. Also includes about 800 million tokens of English Wikipedia as it was in 2009. These corpora are annotated with lemmas and parts of

speech.

Slide11

11

Conducting Searches

A simple overview

Slide12

Available Corpora

CQP corporaThis is a sub-grouping of all available corporaSearching using the CQP interfaceThese are searched with regular expression, PoS, Lemma, and other tags.Non-CQP corpora

These are searched with Linux commands and bash scripting.

12

Slide13

CQP Corpora

Single WordsIndividual word

ex “judge”

String of words

ex “kick” “the” “bucket”

Wild Cards

“.”

,

“?”

,

“*”

,

“+”

, “

( )

”, “

|

”, “

[ ]

”,

“come” “(for|bec

au

se)” [ ]* “stay(.+)?”

Tags

- [ pos = “vvd” ], [ lemma = “eat” ]

13

Slide14

Slide15

CQP Corpora

Single WordsIndividual word

ex “judge”

String of words

ex “kick” “the” “bucket”

Wild Cards

“.”

,

“?”

,

“*”

,

“+”

, “

( )

”, “

|

”, “

[ ]

”,

“come” “(for|bec

au

se)” [ ]* “stay(.+)?”

Tags

- [ pos = “vvd” ], [ lemma = “eat” ]

15

Slide16

Other Corpora

Linux CommandsGrep + Regular expressions

Wild characters

Exact matches, non-matches

Egrep

+ Regular expressions

Optional commands

-n: Which lines,-c: How many lines,-

i

: Ignore case,

-v: Invert match, and more.

16

Slide17

Slide18

Other Corpora

Linux Commands

Regex

and

Bash scripting

are both well documented and supported.

Intro Webpages

Youtube

Videos

Lynda.com

Books

18

Slide19

19

Sorting your data

A simple overview

Slide20

20

Counting your data

Counting commands

Use the command “count” to count your results in various ways.

> count

by (attribute

)

(attributes include) + word, lemma,

pos

,

etc

+ cut (number) – cuts to only the number included.

+ descending – reverse the order

+ reverse – sorts the matches by suffix

+ %cd normalizes for case and/or diacritics

Slide21

21

Sorting your data

Sorting commands

Use the command “sort” to put the results in the order they were in, in the corpus. Additional commands modify the sort.

> sort by (attribute)

(attributes include) + word, lemma,

pos

,

etc

+ randomize – shuffle the results so you don’t see only the top.

+ descending – reverse the order

+ reverse – sorts the matches by suffix

+ %cd normalizes for case and/or diacritics

Slide22

22

Saving & Exporting

A simple overview

Slide23

23

Saving Searches

Naming

Searches (CQP)

Each search is stored as the named “Last” search.

You can just rename the last search to call on it later by using the “cat” (concatenate) command and the “>” (write out) command or “>>” (append) command.

>cat Last >> “FileName.txt

” (adds to the bottom of the named file

)

>cat Last > “FileName.txt”

(creates

file of that name or

saves over that file

if it already exists)

Slide24

24

Saving Searches

Naming

Searches (Bash)

Know your directory and pathways

Use the “>” (write out) or “>>” (append) commands but, you must put it in your home folder.

Don’t have access to auto saving last searches

Slide25

Slide26

26

Saving Searches

Naming

Searches (Bash)

Know your directory and pathways

Use the “>” (write out) or “>>” (append) commands but, you must put it in your home folder.

Don’t have access to auto saving last searches

Scripting

Both avenues allow for scripts. You can test and improve a set of commands, adding complexity, until you like it.

Slide27

27

Exporting

Best Way

WinSCP

(various options for iOS users)

Other

Ways

-

FTP

(

MobaExterm

, Linux

Commandline

)

Format

-

These will be “.txt” files so using notepad or notepad++ is an easy way to see what you have.

Slide28

28

Manipulating Data

A simple overview

Slide29

Excel – Simple, familiar, short learning curvePython, Perl, etc, – Steeper learning curve, more powerful, very flexible.R – Also has a steeper learning curve, also a powerful stats tool.

Also, Linux tools on a local machineBash, vi, vim, atom, etc29

Manipulating Data

Slide30

30

Summation

What

next?

Slide31

 Fall 2019: LING 4886/6886

Excellent opportunity to learn the how and just as importantly the why.There will be significant digital humanities content.

Counts toward the DH Certificate.

31

Slide32

32

Slide33

The End

33