Matthew Brook ODonnell Nick C Ellis Ute R ömer amp Gin Corden English Language Institute mbodumichedu The 2nd University of Michigan Workshop on Data Text Web and Social Network ID: 426761
Download Presentation The PPT/PDF document "VACNET: Extracting and analyzing non-tri..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
VACNET: Extracting and analyzing non-trivial linguistic structures at scale
Matthew Brook O’Donnell,Nick C. Ellis, Ute Römer & Gin CordenEnglish Language Institutembod@umich.edu
The 2nd University of
Michigan Workshop
on Data, Text, Web, and Social Network
Mining
April 22, 2011Slide2
Challenge of natural language for data mining
Much work in NLP, IR and text classification relies upon frequency analysis ofsingle wordsn-grams (contiguous word sequences of various lengths)Units are computationally trivial to retrieveMap-Reduce ‘Hello World’!
Techniques tend to use a ‘bag of words’ approach, disregarding structure
Frequency and statistical measures highlight distinctive items and document ‘
aboutness’But this is a weak proxy for meaning, which remains somewhat elusive!
Sentence splitting
Word Tokenization
POS tagging
Chunking/Parsing
Named-entity recognition
meaning???
text text text text text text
Typical NLP Pipeline
Can linguistic theory help?... NLP tools:Slide3
Challenge of natural language for data mining
Analyzing natural language data is, in my opinion, the problem of the next 2-3 decades. It's an incredibly difficult issue […] It's imperative to have a sufficiently sophisticated and rigorous enough approach that relevant context can be taken into account.
Matthew Russell, Author
Can linguistic theory help?... What is relevant context?Slide4
Learning meaning in language
How are we able to learn what novel words mean? She
moogels
about her book
each word contributes individual meaning
verb meaning central; yet verbs are highly
polysemous
larger
configuration
of words carries
meaning
these
we call CONSTRUCTIONS
V
about
n
moogle
inherits
its interpretation from the echoes of the verbs that occupy
the
V
about
n
Verb Argument Construction (VAC), words like:
talk, think, know, write, hear, speak, worry … fuss, shout, mutter, gossip
‘recurrent patterns of linguistic elements that serve some well-defined linguistic function’ (Ellis 2003)Slide5
Collaborative project
to build an inventory of a large number of English verb argument constructions (VACs) using:the COBUILD Verb Grammar Patterns descriptions tools from computational and corpus linguistics
techniques from data mining, machine learning and network analysis
The
project has two components: a computational corpus analysis of corpora to retrieve instances and verb distributions
for the full range of VACs
psycholinguistic experiments to measure speaker knowledge of these VACs
through the verbs selected.
VACNETSlide6
V about n
– some examplesHe grumbled incessantly about the ‘disgusting’ provincial life we had to lead on the island You should try to think ahead about your financial situation He worried persistently about the poverty of his social life She would keep banging on about her son He wondered
briefly
about
the effects of prolonged exposure to solar radiation The housekeeper left the room, muttering about ingratitude I do not want to carp
about the work of the Committee ‘Any views
expressed about Master Matthew?’ There are several other valid justifications for
teaching explicitly about language
Those who gossip
about him tend to meet with nasty accidents. Slide7
TASK
retrieval of 700+ verb argument constructions from a 100 million corpus with minimal intervention but requirement for high precision and high recallMultidisciplinary TEAMlinguists, psychologists, information scientistsundergraduate/graduate student RAs, facultyTOOLSdependency parsed corpus in GraphML formatweb-based precision analysis tool
processing pipeline
VACNET:
Language engineering challengeSlide8
Architecture: Large
scale extraction of constructions8
POS tagging
&
Dependency Parsing
CouchDB
document database
COBUILD Verb Patterns
Construction Descriptions
CORPUS
BNC 100 mill. words
Word Sense Disambiguation
Statistical analysis of distributions
Web application
WordNet
Network Analysis & Visualization
DISCOSlide9
Method: Collaborative semi-automatic extraction Slide10
DEFINE search graph
ENCODE in XML
CONVERT to Python code
SEARCH corpus and RECORD matches
ERROR CODE
Method: Collaborative semi-automatic extraction Slide11
Precision analysis interfaceSlide12
Recall analysis Slide13
VAC freq
talk
2232
think
1810
know
879
hear349
worry
347
forget322
write
299ask
298
say
281care
250go
203
complain192
speak
181
find148learn
143
be
124feel
118look115
wonder
102read
101
Results:
V
about
n
Types
(list of different verbs occurring in VAC)
Frequency
(
Zipfian
?)
Contingency
(attraction of verb
construction
)
Semantics
prototypicality
of meaning & radial structure (
Zipfian
?)Slide14
VAC freq
talk2232think
1810
know
879hear
349
worry347
forget
322
write299
ask298
say
281care
250
go
203complain
192speak
181
find
148learn
143
be124
feel118
look
115wonder
102read
101
Results:
V
about
n
VAC
freq
Corpus freq
Faithfulness
reminisce
12
98
0.1224
moon
5
51
0.098
talk
2232
24566
0.0909
brag
5
69
0.0725
carp
5
72
0.0694
worry
347
5027
0.069
generalize
15
244
0.0615
generalise
10
176
0.0568
enthuse
13
236
0.0551
complain
192
3947
0.0486
grumble
18
407
0.0442
rave
9
205
0.0439
fret
10
265
0.0377
fuss
9
246
0.0366
care
250
7064
0.0354
speculate
26
771
0.0337
gossip
9
270
0.0333
forget
322
10240
0.0314
enquire
38
1341
0.0283
prowl
5
179
0.0279Slide15
15
VACTypes
Tokens
TTR
Lead verb
Token*Faith
MI
cw
V
about
n
365
3519
10.37
talk
talk
brag
V
across
n
799
4889
16.34
come
spreadscud
V after
n1168
7528
15.52
look
look
lust
V
among
pl-n
417
1228
33.96
find
divide
nestle
V
around
n
761
3801
20.02
look
revolve
traipse
V
as
adj
235
1012
23.22
know
regard
class
V
as
n
1702
34383
4.95
know
act
masquerade
V
at
n
1302
9700
13.42
look
look
officiate
V
between
pl-n
669
3572
18.73
distinguish
distinguish
sandwich
V
for
n
2779
79894
3.48
look
wait
vie
V
in
n
2671
37766
7.07
find
result
couch
V
into
n
1873
46488
4.03
go
divide
delve
V
like
n
548
1972
27.79
look
look
glitter
V n n
663
9183
7.22
give
give
rename
V
of
n
1222
25155
4.86
think
consist
partake
V
over
n
1312
9269
14.15
go
preside
pore
V
through
n
842
4936
17.06
go
riffle
riffle
V
to
n
707
7823
9.04
go
listen
randomize
V
towards
n
190
732
25.96
move
bias
gravitate
V
under
n
1243
8514
14.6
come
come
wilt
V
way
prep
365
2896
12.6
make
wend
wend
V
with
n
1942
24932
7.79
deal
deal
pepperSlide16
The frequency distributions for the types occupying each VAC are
ZipfianThe most frequent verb for each VAC is much more frequent than the other members, taking the lion’s share of the distributionThe most frequent verb in each VAC is prototypical of that construction’s functional interpretation
generic in its action
semantics
VACs
are
selective in their verb form family
occupancy:Individual verbs select particular constructions
Particular constructions select particular verbs
There is greater contingency between verb types and
constructions
VACs
are
coherent in their semantics.
Initial FindingsSlide17
What do speakers know about verbs in VACS?
s/he/it _____ about the …Two Experiments276 Native & 276 L1 German speakers of English
Asked to fill the gap with the first word that comes to mind given the promptSlide18
But what about meaning?...
We want to quantify the semantic coherence or ‘clumpiness’ of the verbs extracted in the previous steps{think, know, hear, worry, care,…} ABOUT
Construction patterns are productive units in language and subject to
polysemy
just like words. Can we separate meaning groups within verb distributions?
Communication
: {talk, write, ask, say
, argue
,…} ABOUT
Cognition:
{think, know, hear, worry, care
,…} ABOUT
Motion: {
move, walk, run, fall, wander,…} ABOUT
The semantic sources must not be based on localized distributional language analysis
Use
WordNet
and Roget’s
Pedersen et al. (2004) WordNet similarity measuresKennedy, A. (2009). The Open Roget's Project: Electronic lexical knowledge base Slide19
Building a semantic network
Use semantic similarity scores for pairs of verbs (from WordNet, Roget, DISCO, etc.) to create networknodes = lemma forms from VAC/CEC distributionedges = link between nodes for top n
similarity scores for a pair of verbs
Cognition
CommunicationSlide20
Community detection
top 100 verbs in VAC V about nSlide21
Semantic Networks
Exploring community detection algorithmsEdge Betweenness (Girvan and Newman, 2002)Fast Greedy (Clauset, Newman and Moore, 2004)Label Propagation (Raghavan, Albert and Kumara, 2007)Leading Eigenvector (Newman 2006)
Spinglass
(
Reichardt and Bornholdt, 2006)Walktrap (Pons and Latapy, 2005)
Louvain (Blondel
, Guillaume, Lambiotte and Lefebvre, 2008)Slide22Slide23
VACNET Summary
Challenge of natural language for data miningProject investigates usage of VACs at scaleconstructions = meaning
through patterns
IR challenge: retrieving non-trivial structures at scale
Corpus analysis examines the distributions of verbs in
VACs
frequency distribution
contingency
semantics
Psycholinguistic experiments explore the
psychological reality
of
VACs
VACNET structured inventory
verb to
construction and construction to verb
valuable for NLP and DM tasks
Future explorations:
Train classifiers on our datasets
Tackle ‘big data’ sets
Thank you!
mbod@umich.edu