of linguistic research Corpus linguistics Frans Gregersen 25th of January History of corpus linguistics I To understand the data revolution we have to look at data in general ID: 474727
Download Presentation The PPT/PDF document "Methodology" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Methodology of linguistic research
Corpus
linguistics
Frans Gregersen 25th of
JanuarySlide2
History of corpus linguistics
ISlide3
To understand the data revolution
…
we
have to look at data in general
Four
types of data:
The
historical
record
;
early
records
scant
, all
records
written
Opinions
about
and observations of
language
use
Actual
,
contemporary
language
use
whether
written
or
spoken
:
behavioral
data
Intuitions
or
introspective
evidence
of
competence
:
introspective
or
judgmental
dataSlide4
Major technological break throughs
The invention of
writing
systems, in
particular
the invention of
alphabetic
writing
systems
The invention of the
printing
press
The invention of the
telegraph
The invention of the radio
The invention of sound films
The invention of the tape
recorder
The invention of TV
The invention of the Xerox
machine
The invention of the
hard
disk and the
possibility
for
storing
vast
amounts
of data Slide5
What is corpus linguistics
Corpus
linguistics
in the
strict
sense
is
the
use
of
stored
and
retrievable
data in
linguistic
work
Any
collection
of data
which
is
structured
so
that
it
may
be
used
for
various
analytical
purposes
may
be
seen
as a corpus and
used
as
such
The new
possibilities
inherent
in large
scale
corpora
lie
in the
possibility
of
automatization
of
hitherto
tedious
and time
consuming
work
such
as
building
up a corpus
or
counting
instances
etc.Slide6
The lexicographical bias
The
development
of
lexicography
,
how
to
make
a
dictionary
The data base for
lexicographical
work
excerpts; from
literature
,
mundane
prose
,
etc
?
questionnaires
older
lexicographers
, the power of tradition
Selection
of lemmas,
lemmatizationSlide7
Lexicography as science
Relationships
between
semantics
and
lexicography
semantic
analyses of lemmas
the
notion
of
semantic
primitives
the
dubious
nature of the
word
or
even
worse
of the lemma and the lemma traditions
As
against
the
morpheme
,
root
or
the
lexeme
As
against
the
fixed
expressions
,
collocations
etc. [
headway
]; kæreste,
vb
. at kæreste med nogen
As
against
the
utterance
,
turn
or
sentenceSlide8
What corpora did to lexicography
and vice versa
The real
text
and the
lexicographical
tradition
the
many
values
of
context
the
systematization
of
evidence
lemmas and
constructions
;
make
headway
;
take
advantage
of, in reference to
yours
of,
The
two
most
useful
notions
of
frequency
and
concordance
, i.e.
word
forms in
context
Corpora
are
in general WORD
based
with
extensions
into
collocations
and
with
the
possibility
of
getting
concordancesSlide9
data driven vs. theory driven 1
Inductive
approaches
,
deductive
approaches
problems of
inductivism
:
how
can
we
know
what
is
worth
looking
at,
if
we
do not have a
theory
?
fishing
for a
significant
result
might
make
you
catch
something
but
what
is it
really
?
or
:
how
can
we
interpret
the
result
if
we
do not have a
framework
?Slide10
data driven vs. theory driven 2
problems
of
deductivism
:
How
do
we
administer
the meeting of
theory
and data;
which
types of data
are
relevant for
this
particular
theory
?
How
do
we
evaluate
competing
claims
;
what
will
count
as
counterevidence
and
what
will
be
discarded
off
hand
as irrelevant to the
theory
?
Can
a
theory
be
disproved
by
empirical
work
?
Falsification
or
falsifiability
as the
hallmark
of the
scientific
approach/stanceSlide11
A view from the sociology
of science
Maybe
what
has
happened
to the
humanities
with
the advent of corpus
linguistics/literary
analysis/textual
analysis
is
that
the
humanities
have
become
more
like
the
natural
sciences
?
Slide12
Robert K. Merton 1942
The
sociology
of science:
The
need
to
scrutinize
the
ethos
of science
became
pressing
in 1942 in the
face
of the Nazi
denial
of
rationalismSlide13
The CUDOS norms
Communism
– the
common
ownership
of
scientific
discoveries
,
according
to
which
scientists
give up
intellectual
property
in
exchange
for
recognition
and
esteem
.
Universalism
–
according
to
which
claims
to
truth
are
evaluated
in terms of universal
or
impersonal
criteria
, and not
on
the basis of race,
class
,
gender
, religion,
or
nationality
;
Disinterestedness
–
according
to
which
scientists
are
rewarded
for
acting
in
ways
that
outwardly
appear
to
be
selfless
;
Organized
skepticism
– all
ideas
must
be
tested
and
are
subject
to
rigorous
,
structured
community
scrutinySlide14
The natural sciences
and the CUDOS norms
Communism
: the
need
for
collective
work
and division of
labour
, the
praxis
of
big
science
Universalism
:
Natural
sciences
are
more universal and
less
bound
to
local
languages
, traditions and
culture
than
the human
sciences
Disinterestedness
:
sharing
results
Organized
skepticism
: double blind peer
review
systems,
evaluation
proceduresSlide15
The human sciences and CUDOS
Communism
: More
individual
researchers
than
groups
;
prototypically
little
science
Universalism
: Human
sciences
less
universal and more
bound
to
local
languages
, traditions and
cultures
than
the
natural
sciences
Disinterestedness
:
Often
the
individual
is
tied
to the
method
and
results
Organized
skepticism
: More
skepticism
than
organizationSlide16
The corpus revolution: accountability
Even
though
we
have to
make
choices
which
involve
creating
codes
for the
various
analytical
categories
, and
thus
always
have a
somewhat
hermeneutical
basis,
if
the rest of the research
work
, i.e. the data
analysis
,
may
be
made independent of
subjective
reasoning
,
we
may
approach an ideal of a science
which
is
disinterested
and universal.
This
hinges
on
the
notion
of
accountability
.Slide17
Accountability
It is
uniquely
retrievable
which
unit
or
item
was
coded
as
belonging
to
this
or
that
category
and
where
it is
located
in the data
Thus
the
data
may
always
be
inspected
or
even
re-analyzed
by
others
PROBLEM:
confidentiality
Thus
all data
are
in
principle
–
if
stored
permanently
as
part of the
project
–
available
for
others
– the norms of
communism
and
organized
skepticism
may
be
appliedSlide18Slide19
pair work 1
Turn
towards
the
fellow
researcher
next
to
you
and give
him
or
her
your
asessment
of his
or
her
project
as
being
predominantly
theory
driven
or
predominantly
data driven.
Obviously
,
this
must
be
based
on
the brief
decriptions
you
have
received
.
Use
only
2
minutes
and give
him
or
her
some
minutes
to
respond
Repeat
the
exercise
for
the
other
member
of the pair!Slide20
What corpora?
What
corpora
can
be
good
for
IISlide21
Corpora of written language
A:
Synchronic
National
corpora
(
tend
to
be
huge
,
multifaceted
and
balanced
in
some
way
)
Problem:
What
kind of
language
user
is
viewed
as the
exemplary
individual
?
Specialized
corpora
by
text
type
or
genre:
newspaper
text
types, manuals,
parliamentary
debates
etc.
by
producer:
authors
,
language
learners
, etc.
by speech event:
consultations
with
the GP, hospital
counselling
, radio
discussions
by
language
:
saving
languages
close
to
extinctionSlide22
Corpora of written language
B:
Diachronic
What
periodization
?
What
text
types?
What
kind of
representation
(
normalization
,
multi
level
representations
)?
Errors
and
quality
checks
against
the originals (
or
in
other
words
,
who
made the
transcriptions
and
who
proof
read
them
)?Slide23
Variation as a problemVariation is
ubiquitous
:
In
historical
records
before
there
was
any
norm (
enforcement
)
In
synchronic
records
because
of
difference
between
known
norms (American vs. British English)
unintentional
variation (
errors
)
identity
bearing
semiotic
variationSlide24
Spoken language corpora
Types of
recordings
The
role
of the
recording
device
Since
you
cannot
(as
yet
)
search
directly
in the sound,
you
normally
transcribe
BUT
this
is an interpretation: ’
only
what
is
there
and
exactly
as it is’ is an ideal and
unbelievably
hard
to
attain
Transcription
as
theory
(
Ochs
)
Comparability
as an
issue
; and the
shortcomings
of
orthography
as
another
Alignment
of sound and
transcription
(at
what
level
?)Slide25
What are the normal types of annotation
The
distinction
between
data and metadata
Systems for
securing
comparable
metadata
At
what
level
again
?
Where
are
the metadata?
Linguistic
annotations
PoS
tagging
Parsing
Lemmatization
and
multi
level
representationsSlide26
What types of results? 1
Frequencies
Frequencies
of
what
?
TTRs
(
lexical
diversity
)
Tendencies
:
Zipf’s
law
:
Zipf's
law states that given some
corpus
of
natural language
utterances, the frequency of any word is
inversely proportional
to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the
Brown Corpus
"the"
is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to
Zipf's
Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.Slide27
What type of results? 2
Frequencies
2:
Combining
a
structural
analysis
and a
mathematical
one
:
Halliday’s
conjecture
:
There
are
two
types of
grammatical
systems:
’
equi
’
and
’
skewed
’
and for the
skewed
,
the
relationship
between
the marked and the
unmarked
categories
is
approximately
1
to
9Slide28
What type of results? 3
Concordances
What
words
occur
with
which
others
?
(
Compare
the original
interest
in
systematizing
words
in
context
(
KWIC-index
) for
constructing
dictionaries
)
Constructions,
formulae
or
just
context
?
The
theoretical
affinity
to
usage
based
and
exemplar
theories
(as
against
rule
based
and
innate
theories
)Slide29
Corpora and subcorpora
If a corpus
consists
of
clearly
delimited
subcorpora
,
e.g
.
representing
different
text
types
or
genres, it is
feasible
to
profile
the
various
subcorpora
as
against
each
other
This
may
lead
to a
characterization
of
different
’
styles
’, in terms of
PoS
e.g
. a
noun
intensive
style
, a
pronoun
intensive
style
and a
verb
intensive
style
or
it
may
directly
lead
to a
characterization
of the genre
or
text
type
The
special
case of
literary
stylistics
,
e.g
.
Shakespearean
corpus
linguistics
Slide30
A note on statistics
Statistics
for
linguistic
purposes
is
based
on
probability
Probability
or
likelihood
has to
be
modified
by
context
when
it
comes
to
linguistics
and
that
is
why
the Shannon Weaver approach to information (
information
theory
)
was
abandoned
What
is the
relationship
between
context
,
likelihood
and perception of
significance
? The
answer
is:
We
do not
know
(and
nobody
seems
to
care
much
)Slide31
The problem of non-occurrence or
rare
occurrences
What
can
we
conclude
from a
non-occurrence
?
What
can
we
conclude
from rare
occurrences
?
The case of ’plus at’ in DanishSlide32Slide33Slide34Slide35
How to make
such
figures
and
how
to
interpret
them
Searches
Grouping
the data
Speaker variables (metadata)
Internal
variables:
roles
: interviewer vs. informant
Interpreting
the
figures
:
Innovation
Core
–
periphery
Spoken
-
writtenSlide36
pair work
Formulate
on
the basis of
your
reading
of the
material
for
this
afternoon
three
questions
that
you
would
like
the plenum to
address
!Slide37
Danish corpora
The Danish
Clarin
and the European vision
The DSL
corpora
: ODS and DDO and the parole corpus; websites
with
historical
material
The
dialect
dictionaries
:
Cordiale
and Jysk Ordbog
The LANCHART corpus and the LANCHART CLARIN
The Odense
child
language
corpora
The Danish
Talkbank
and
Childes/Clan
New
corpora
at CIP and
CalpiuSlide38
NamesThe veterans:
John
McH
Sinclair
; MAK
Halliday
The
discipline
owners
:
Tony
McEnery
, Edward
Finegan
, Douglas
Biber
, and
their
associates
The
theoreticians
and new
hopes
:
Stephan
Gries
,
Svenja
Adolphs, Tyler Kendall