/
Methodology Methodology

Methodology - PowerPoint Presentation

test
test . @test
Follow
403 views
Uploaded On 2016-10-12

Methodology - PPT Presentation

of linguistic research Corpus linguistics Frans Gregersen 25th of January History of corpus linguistics I To understand the data revolution we have to look at data in general ID: 474727

corpora data word corpus data corpora corpus word invention theory language types work science sciences linguistics text context driven

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Methodology" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Methodology of linguistic research

Corpus

linguistics

Frans Gregersen 25th of

JanuarySlide2

History of corpus linguistics

ISlide3

To understand the data revolution

we

have to look at data in general

Four

types of data:

The

historical

record

;

early

records

scant

, all

records

written

Opinions

about

and observations of

language

use

Actual

,

contemporary

language

use

whether

written

or

spoken

:

behavioral

data

Intuitions

or

introspective

evidence

of

competence

:

introspective

or

judgmental

dataSlide4

Major technological break throughs

The invention of

writing

systems, in

particular

the invention of

alphabetic

writing

systems

The invention of the

printing

press

The invention of the

telegraph

The invention of the radio

The invention of sound films

The invention of the tape

recorder

The invention of TV

The invention of the Xerox

machine

The invention of the

hard

disk and the

possibility

for

storing

vast

amounts

of data Slide5

What is corpus linguistics

Corpus

linguistics

in the

strict

sense

is

the

use

of

stored

and

retrievable

data in

linguistic

work

Any

collection

of data

which

is

structured

so

that

it

may

be

used

for

various

analytical

purposes

may

be

seen

as a corpus and

used

as

such

The new

possibilities

inherent

in large

scale

corpora

lie

in the

possibility

of

automatization

of

hitherto

tedious

and time

consuming

work

such

as

building

up a corpus

or

counting

instances

etc.Slide6

The lexicographical bias

The

development

of

lexicography

,

how

to

make

a

dictionary

The data base for

lexicographical

work

excerpts; from

literature

,

mundane

prose

,

etc

?

questionnaires

older

lexicographers

, the power of tradition

Selection

of lemmas,

lemmatizationSlide7

Lexicography as science

Relationships

between

semantics

and

lexicography

semantic

analyses of lemmas

the

notion

of

semantic

primitives

the

dubious

nature of the

word

or

even

worse

of the lemma and the lemma traditions

As

against

the

morpheme

,

root

or

the

lexeme

As

against

the

fixed

expressions

,

collocations

etc. [

headway

]; kæreste,

vb

. at kæreste med nogen

As

against

the

utterance

,

turn

or

sentenceSlide8

What corpora did to lexicography

and vice versa

The real

text

and the

lexicographical

tradition

the

many

values

of

context

the

systematization

of

evidence

lemmas and

constructions

;

make

headway

;

take

advantage

of, in reference to

yours

of,

The

two

most

useful

notions

of

frequency

and

concordance

, i.e.

word

forms in

context

Corpora

are

in general WORD

based

with

extensions

into

collocations

and

with

the

possibility

of

getting

concordancesSlide9

data driven vs. theory driven 1

Inductive

approaches

,

deductive

approaches

problems of

inductivism

:

how

can

we

know

what

is

worth

looking

at,

if

we

do not have a

theory

?

fishing

for a

significant

result

might

make

you

catch

something

but

what

is it

really

?

or

:

how

can

we

interpret

the

result

if

we

do not have a

framework

?Slide10

data driven vs. theory driven 2

problems

of

deductivism

:

How

do

we

administer

the meeting of

theory

and data;

which

types of data

are

relevant for

this

particular

theory

?

How

do

we

evaluate

competing

claims

;

what

will

count

as

counterevidence

and

what

will

be

discarded

off

hand

as irrelevant to the

theory

?

Can

a

theory

be

disproved

by

empirical

work

?

Falsification

or

falsifiability

as the

hallmark

of the

scientific

approach/stanceSlide11

A view from the sociology

of science

Maybe

what

has

happened

to the

humanities

with

the advent of corpus

linguistics/literary

analysis/textual

analysis

is

that

the

humanities

have

become

more

like

the

natural

sciences

?

Slide12

Robert K. Merton 1942

The

sociology

of science:

The

need

to

scrutinize

the

ethos

of science

became

pressing

in 1942 in the

face

of the Nazi

denial

of

rationalismSlide13

The CUDOS norms

Communism

– the

common

ownership

of

scientific

discoveries

,

according

to

which

scientists

give up

intellectual

property

in

exchange

for

recognition

and

esteem

.

Universalism

according

to

which

claims

to

truth

are

evaluated

in terms of universal

or

impersonal

criteria

, and not

on

the basis of race,

class

,

gender

, religion,

or

nationality

;

Disinterestedness

according

to

which

scientists

are

rewarded

for

acting

in

ways

that

outwardly

appear

to

be

selfless

;

Organized

skepticism

– all

ideas

must

be

tested

and

are

subject

to

rigorous

,

structured

community

scrutinySlide14

The natural sciences

and the CUDOS norms

Communism

: the

need

for

collective

work

and division of

labour

, the

praxis

of

big

science

Universalism

:

Natural

sciences

are

more universal and

less

bound

to

local

languages

, traditions and

culture

than

the human

sciences

Disinterestedness

:

sharing

results

Organized

skepticism

: double blind peer

review

systems,

evaluation

proceduresSlide15

The human sciences and CUDOS

Communism

: More

individual

researchers

than

groups

;

prototypically

little

science

Universalism

: Human

sciences

less

universal and more

bound

to

local

languages

, traditions and

cultures

than

the

natural

sciences

Disinterestedness

:

Often

the

individual

is

tied

to the

method

and

results

Organized

skepticism

: More

skepticism

than

organizationSlide16

The corpus revolution: accountability

Even

though

we

have to

make

choices

which

involve

creating

codes

for the

various

analytical

categories

, and

thus

always

have a

somewhat

hermeneutical

basis,

if

the rest of the research

work

, i.e. the data

analysis

,

may

be

made independent of

subjective

reasoning

,

we

may

approach an ideal of a science

which

is

disinterested

and universal.

This

hinges

on

the

notion

of

accountability

.Slide17

Accountability

It is

uniquely

retrievable

which

unit

or

item

was

coded

as

belonging

to

this

or

that

category

and

where

it is

located

in the data

Thus

the

data

may

always

be

inspected

or

even

re-analyzed

by

others

PROBLEM:

confidentiality

Thus

all data

are

in

principle

if

stored

permanently

as

part of the

project

available

for

others

– the norms of

communism

and

organized

skepticism

may

be

appliedSlide18
Slide19

pair work 1

Turn

towards

the

fellow

researcher

next

to

you

and give

him

or

her

your

asessment

of his

or

her

project

as

being

predominantly

theory

driven

or

predominantly

data driven.

Obviously

,

this

must

be

based

on

the brief

decriptions

you

have

received

.

Use

only

2

minutes

and give

him

or

her

some

minutes

to

respond

Repeat

the

exercise

for

the

other

member

of the pair!Slide20

What corpora?

What

corpora

can

be

good

for

IISlide21

Corpora of written language

A:

Synchronic

National

corpora

(

tend

to

be

huge

,

multifaceted

and

balanced

in

some

way

)

Problem:

What

kind of

language

user

is

viewed

as the

exemplary

individual

?

Specialized

corpora

by

text

type

or

genre:

newspaper

text

types, manuals,

parliamentary

debates

etc.

by

producer:

authors

,

language

learners

, etc.

by speech event:

consultations

with

the GP, hospital

counselling

, radio

discussions

by

language

:

saving

languages

close

to

extinctionSlide22

Corpora of written language

B:

Diachronic

What

periodization

?

What

text

types?

What

kind of

representation

(

normalization

,

multi

level

representations

)?

Errors

and

quality

checks

against

the originals (

or

in

other

words

,

who

made the

transcriptions

and

who

proof

read

them

)?Slide23

Variation as a problemVariation is

ubiquitous

:

In

historical

records

before

there

was

any

norm (

enforcement

)

In

synchronic

records

because

of

difference

between

known

norms (American vs. British English)

unintentional

variation (

errors

)

identity

bearing

semiotic

variationSlide24

Spoken language corpora

Types of

recordings

The

role

of the

recording

device

Since

you

cannot

(as

yet

)

search

directly

in the sound,

you

normally

transcribe

BUT

this

is an interpretation: ’

only

what

is

there

and

exactly

as it is’ is an ideal and

unbelievably

hard

to

attain

Transcription

as

theory

(

Ochs

)

Comparability

as an

issue

; and the

shortcomings

of

orthography

as

another

Alignment

of sound and

transcription

(at

what

level

?)Slide25

What are the normal types of annotation

The

distinction

between

data and metadata

Systems for

securing

comparable

metadata

At

what

level

again

?

Where

are

the metadata?

Linguistic

annotations

PoS

tagging

Parsing

Lemmatization

and

multi

level

representationsSlide26

What types of results? 1

Frequencies

Frequencies

of

what

?

TTRs

(

lexical

diversity

)

Tendencies

:

Zipf’s

law

:

Zipf's

law states that given some

corpus

of

natural language

utterances, the frequency of any word is

inversely proportional

to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the

Brown Corpus

"the"

is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to

Zipf's

Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.Slide27

What type of results? 2

Frequencies

2:

Combining

a

structural

analysis

and a

mathematical

one

:

Halliday’s

conjecture

:

There

are

two

types of

grammatical

systems:

equi

and

skewed

and for the

skewed

,

the

relationship

between

the marked and the

unmarked

categories

is

approximately

1

to

9Slide28

What type of results? 3

Concordances

What

words

occur

with

which

others

?

(

Compare

the original

interest

in

systematizing

words

in

context

(

KWIC-index

) for

constructing

dictionaries

)

Constructions,

formulae

or

just

context

?

The

theoretical

affinity

to

usage

based

and

exemplar

theories

(as

against

rule

based

and

innate

theories

)Slide29

Corpora and subcorpora

If a corpus

consists

of

clearly

delimited

subcorpora

,

e.g

.

representing

different

text

types

or

genres, it is

feasible

to

profile

the

various

subcorpora

as

against

each

other

This

may

lead

to a

characterization

of

different

styles

’, in terms of

PoS

e.g

. a

noun

intensive

style

, a

pronoun

intensive

style

and a

verb

intensive

style

or

it

may

directly

lead

to a

characterization

of the genre

or

text

type

The

special

case of

literary

stylistics

,

e.g

.

Shakespearean

corpus

linguistics

Slide30

A note on statistics

Statistics

for

linguistic

purposes

is

based

on

probability

Probability

or

likelihood

has to

be

modified

by

context

when

it

comes

to

linguistics

and

that

is

why

the Shannon Weaver approach to information (

information

theory

)

was

abandoned

What

is the

relationship

between

context

,

likelihood

and perception of

significance

? The

answer

is:

We

do not

know

(and

nobody

seems

to

care

much

)Slide31

The problem of non-occurrence or

rare

occurrences

What

can

we

conclude

from a

non-occurrence

?

What

can

we

conclude

from rare

occurrences

?

The case of ’plus at’ in DanishSlide32
Slide33
Slide34
Slide35

How to make

such

figures

and

how

to

interpret

them

Searches

Grouping

the data

Speaker variables (metadata)

Internal

variables:

roles

: interviewer vs. informant

Interpreting

the

figures

:

Innovation

Core

periphery

Spoken

-

writtenSlide36

pair work

Formulate

on

the basis of

your

reading

of the

material

for

this

afternoon

three

questions

that

you

would

like

the plenum to

address

!Slide37

Danish corpora

The Danish

Clarin

and the European vision

The DSL

corpora

: ODS and DDO and the parole corpus; websites

with

historical

material

The

dialect

dictionaries

:

Cordiale

and Jysk Ordbog

The LANCHART corpus and the LANCHART CLARIN

The Odense

child

language

corpora

The Danish

Talkbank

and

Childes/Clan

New

corpora

at CIP and

CalpiuSlide38

NamesThe veterans:

John

McH

Sinclair

; MAK

Halliday

The

discipline

owners

:

Tony

McEnery

, Edward

Finegan

, Douglas

Biber

, and

their

associates

The

theoreticians

and new

hopes

:

Stephan

Gries

,

Svenja

Adolphs, Tyler Kendall