/
VACNET: Extracting and analyzing non-trivial linguistic str VACNET: Extracting and analyzing non-trivial linguistic str

VACNET: Extracting and analyzing non-trivial linguistic str - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
398 views
Uploaded On 2016-07-31

VACNET: Extracting and analyzing non-trivial linguistic str - PPT Presentation

Matthew Brook ODonnell Nick C Ellis Ute R ömer amp Gin Corden English Language Institute mbodumichedu The 2nd University of Michigan Workshop on Data Text Web and Social Network ID: 426761

vac verb language verbs verb vac verbs language meaning analysis text corpus data constructions talk construction words semantic amp

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "VACNET: Extracting and analyzing non-tri..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

VACNET: Extracting and analyzing non-trivial linguistic structures at scale

Matthew Brook O’Donnell,Nick C. Ellis, Ute Römer & Gin CordenEnglish Language Institutembod@umich.edu

The 2nd University of

Michigan Workshop

on Data, Text, Web, and Social Network

Mining

April 22, 2011Slide2

Challenge of natural language for data mining

Much work in NLP, IR and text classification relies upon frequency analysis ofsingle wordsn-grams (contiguous word sequences of various lengths)Units are computationally trivial to retrieveMap-Reduce ‘Hello World’!

Techniques tend to use a ‘bag of words’ approach, disregarding structure

Frequency and statistical measures highlight distinctive items and document ‘

aboutness’But this is a weak proxy for meaning, which remains somewhat elusive!

Sentence splitting

Word Tokenization

POS tagging

Chunking/Parsing

Named-entity recognition

meaning???

text text text text text text

Typical NLP Pipeline

Can linguistic theory help?... NLP tools:Slide3

Challenge of natural language for data mining

Analyzing natural language data is, in my opinion, the problem of the next 2-3 decades. It's an incredibly difficult issue […] It's imperative to have a sufficiently sophisticated and rigorous enough approach that relevant context can be taken into account.

Matthew Russell, Author

Can linguistic theory help?... What is relevant context?Slide4

Learning meaning in language

How are we able to learn what novel words mean? She

moogels

about her book

each word contributes individual meaning

verb meaning central; yet verbs are highly

polysemous

larger

configuration

of words carries

meaning

these

we call CONSTRUCTIONS

V

about

n

moogle

inherits

its interpretation from the  echoes of the verbs that occupy

the

V

about

n

Verb Argument Construction (VAC), words like:

talk, think, know, write, hear, speak, worry … fuss, shout, mutter, gossip

‘recurrent patterns of linguistic elements that serve some well-defined linguistic function’ (Ellis 2003)Slide5

Collaborative project

to build an inventory of a large number of English verb argument constructions (VACs) using:the COBUILD Verb Grammar Patterns descriptions tools from computational and corpus linguistics

techniques from data mining, machine learning and network analysis

The

project has two components: a computational corpus analysis of corpora to retrieve instances and verb distributions

for the full range of VACs

psycholinguistic experiments to measure speaker knowledge of these VACs

through the verbs selected.

VACNETSlide6

V about n

– some examplesHe grumbled incessantly about the ‘disgusting’ provincial life we had to lead on the island You should try to think ahead about your financial situation He worried persistently about the poverty of his social life She would keep banging on about her son He wondered

briefly

about

the effects of prolonged exposure to solar radiation The housekeeper left the room, muttering about ingratitude I do not want to carp

about the work of the Committee ‘Any views

expressed about Master Matthew?’ There are several other valid justifications for

teaching explicitly about language

Those who gossip

about him tend to meet with nasty accidents. Slide7

TASK

retrieval of 700+ verb argument constructions from a 100 million corpus with minimal intervention but requirement for high precision and high recallMultidisciplinary TEAMlinguists, psychologists, information scientistsundergraduate/graduate student RAs, facultyTOOLSdependency parsed corpus in GraphML formatweb-based precision analysis tool

processing pipeline

VACNET:

Language engineering challengeSlide8

Architecture: Large

scale extraction of constructions8

POS tagging

&

Dependency Parsing

CouchDB

document database

COBUILD Verb Patterns

Construction Descriptions

CORPUS

BNC 100 mill. words

Word Sense Disambiguation

Statistical analysis of distributions

Web application

WordNet

Network Analysis & Visualization

DISCOSlide9

Method: Collaborative semi-automatic extraction Slide10

DEFINE search graph

ENCODE in XML

CONVERT to Python code

SEARCH corpus and RECORD matches

ERROR CODE

Method: Collaborative semi-automatic extraction Slide11

Precision analysis interfaceSlide12

Recall analysis Slide13

VAC freq

talk

2232

think

1810

know

879

hear349

worry

347

forget322

write

299ask

298

say

281care

250go

203

complain192

speak

181

find148learn

143

be

124feel

118look115

wonder

102read

101

Results:

V

about

n

Types

(list of different verbs occurring in VAC)

Frequency

(

Zipfian

?)

Contingency

(attraction of verb



construction

)

Semantics

prototypicality

of meaning & radial structure (

Zipfian

?)Slide14

VAC freq

talk2232think

1810

know

879hear

349

worry347

forget

322

write299

ask298

say

281care

250

go

203complain

192speak

181

find

148learn

143

be124

feel118

look

115wonder

102read

101

Results:

V

about

n

VAC

freq

Corpus freq

Faithfulness

reminisce

12

98

0.1224

moon

5

51

0.098

talk

2232

24566

0.0909

brag

5

69

0.0725

carp

5

72

0.0694

worry

347

5027

0.069

generalize

15

244

0.0615

generalise

10

176

0.0568

enthuse

13

236

0.0551

complain

192

3947

0.0486

grumble

18

407

0.0442

rave

9

205

0.0439

fret

10

265

0.0377

fuss

9

246

0.0366

care

250

7064

0.0354

speculate

26

771

0.0337

gossip

9

270

0.0333

forget

322

10240

0.0314

enquire

38

1341

0.0283

prowl

5

179

0.0279Slide15

15

VACTypes

Tokens

TTR

Lead verb

Token*Faith

MI

cw

V

about

n

365

3519

10.37

talk

talk

brag

V

across

n

799

4889

16.34

come

spreadscud

V after

n1168

7528

15.52

look

look

lust

V

among

pl-n

417

1228

33.96

find

divide

nestle

V

around

n

761

3801

20.02

look

revolve

traipse

V

as

adj

235

1012

23.22

know

regard

class

V

as

n

1702

34383

4.95

know

act

masquerade

V

at

n

1302

9700

13.42

look

look

officiate

V

between

pl-n

669

3572

18.73

distinguish

distinguish

sandwich

V

for

n

2779

79894

3.48

look

wait

vie

V

in

n

2671

37766

7.07

find

result

couch

V

into

n

1873

46488

4.03

go

divide

delve

V

like

n

548

1972

27.79

look

look

glitter

V n n

663

9183

7.22

give

give

rename

V

of

n

1222

25155

4.86

think

consist

partake

V

over

n

1312

9269

14.15

go

preside

pore

V

through

n

842

4936

17.06

go

riffle

riffle

V

to

n

707

7823

9.04

go

listen

randomize

V

towards

n

190

732

25.96

move

bias

gravitate

V

under

n

1243

8514

14.6

come

come

wilt

V

way

prep

365

2896

12.6

make

wend

wend

V

with

n

1942

24932

7.79

deal

deal

pepperSlide16

The frequency distributions for the types occupying each VAC are

ZipfianThe most frequent verb for each VAC is much more frequent than the other members, taking the lion’s share of the distributionThe most frequent verb in each VAC is prototypical of that construction’s functional interpretation

generic in its action

semantics

VACs

are

selective in their verb form family

occupancy:Individual verbs select particular constructions

Particular constructions select particular verbs

There is greater contingency between verb types and

constructions

VACs

are

coherent in their semantics.

Initial FindingsSlide17

What do speakers know about verbs in VACS?

s/he/it _____ about the …Two Experiments276 Native & 276 L1 German speakers of English

Asked to fill the gap with the first word that comes to mind given the promptSlide18

But what about meaning?...

We want to quantify the semantic coherence or ‘clumpiness’ of the verbs extracted in the previous steps{think, know, hear, worry, care,…} ABOUT

Construction patterns are productive units in language and subject to

polysemy

just like words. Can we separate meaning groups within verb distributions?

Communication

: {talk, write, ask, say

, argue

,…} ABOUT

Cognition:

{think, know, hear, worry, care

,…} ABOUT

Motion: {

move, walk, run, fall, wander,…} ABOUT

The semantic sources must not be based on localized distributional language analysis

Use

WordNet

and Roget’s

Pedersen et al. (2004) WordNet similarity measuresKennedy, A. (2009). The Open Roget's Project: Electronic lexical knowledge base Slide19

Building a semantic network

Use semantic similarity scores for pairs of verbs (from WordNet, Roget, DISCO, etc.) to create networknodes = lemma forms from VAC/CEC distributionedges = link between nodes for top n

similarity scores for a pair of verbs

Cognition

CommunicationSlide20

Community detection

top 100 verbs in VAC V about nSlide21

Semantic Networks

Exploring community detection algorithmsEdge Betweenness (Girvan and Newman, 2002)Fast Greedy (Clauset, Newman and Moore, 2004)Label Propagation (Raghavan, Albert and Kumara, 2007)Leading Eigenvector (Newman 2006)

Spinglass

(

Reichardt and Bornholdt, 2006)Walktrap (Pons and Latapy, 2005)

Louvain (Blondel

, Guillaume, Lambiotte and Lefebvre, 2008)Slide22
Slide23

VACNET Summary

Challenge of natural language for data miningProject investigates usage of VACs at scaleconstructions = meaning

through patterns

IR challenge: retrieving non-trivial structures at scale

Corpus analysis examines the distributions of verbs in

VACs

frequency distribution

contingency

semantics

Psycholinguistic experiments explore the

psychological reality

of

VACs

VACNET structured inventory

verb to

construction and construction to verb

valuable for NLP and DM tasks

Future explorations:

Train classifiers on our datasets

Tackle ‘big data’ sets

Thank you!

mbod@umich.edu