/
600.465 Connecting the 600.465 Connecting the

600.465 Connecting the - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
394 views
Uploaded On 2016-10-10

600.465 Connecting the - PPT Presentation

dots I NLP in Practice Delip Rao delipjhuedu What is Text What is Text What is Text Real World Tons of data on the web A lot of it is text In many languages ID: 473910

pos capitalized word prev capitalized pos prev word ner france great britain backed fischler features proposal designing nnp data

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "600.465 Connecting the" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

600.465 Connecting the dots - I(NLP in Practice)

Delip Rao

delip@jhu.eduSlide2
Slide3

What is “Text”?Slide4

What is “Text”?Slide5

What is “Text”?Slide6

“Real” World

Tons of data on the web

A lot of it is text

In many languages

In many genres

Language by itself is complex.

The Web further complicates language.Slide7

But we have 600.465

Adapted from : Jason Eisner

We can study anything about language ...

1. Formalize some insights

2. Study the formalism mathematically

3. Develop & implement algorithms

4. Test on real data

feature functions!

f(w

i

= off, w

i+1

= the)

f(w

i

=

obama

,

y

i

= NP)

Forward Backward,

Gradient Descent, LBFGS, Simulated Annealing, Contrastive Estimation, …Slide8

NLP for fun and profit

Making NLP more accessible

Provide APIs for common NLP tasks

var

text =

document.get

(…);

var

entities =

agent.markNE(text

);

Big $$$$

Backend to intelligent processing of textSlide9

Desideratum: Multilinguality

Except for feature extraction, systems should be language agnosticSlide10

In this lecture

Understand how to solve and ace in NLP tasks

Learn general methodology or approaches

End-to-End development using an example task

Overview of (

un)common

NLP tasksSlide11

Case study: Named Entity RecognitionSlide12

Case study: Named Entity Recognition

Demo: http://

viewer.opencalais.com

How do we build something like this?

How do we find out well we are doing?

How

can we improve?Slide13

Case study: Named Entity Recognition

Define the problem

Say, PERSON, LOCATION, ORGANIZATION

The UN secretary general met president Obama at Hague.

The

UN

secretary general met president

Obama

at

Hague

.

ORG

PER

LOCSlide14

Case study: Named Entity Recognition

Collect data to learn from

Sentences with words marked as PER, ORG, LOC, NONE

How do we get this data?Slide15

Pay the expertsSlide16

Wisdom of the crowdsSlide17

Getting the data: Annotation

Time consuming

Costs

$$$

Need for quality control

Inter-annotator

aggreement

Kappa score (

Kippendorf

, 1980)

Smarter ways to annotate

Get fewer annotations: Active Learning

Rationales (

Zaidan, Eisner & Piatko, 2007)Slide18

Only France and Great Britain backed Fischler

s

proposal .

Only

O

France

B-LOC

and

O

Great

B-LOC

Britain

I-LOC

backed

O

Fischler

B-PER

s

O

proposalO.OOnly France and Great Britain backed Fischler

‘s proposal .

Input

x

Labels ySlide19

1. Formalize some insights

2. Study the formalism mathematically

3. Develop & implement algorithms

4. Test on real data

Our recipe …Slide20

NER: Designing features

Not as trivial as you think

Original text itself might be in an ugly HTML

Cleaneval

!

Need to segment sentences

Tokenize the sentences

Only

France

and

Great

Britain

backed

Fischler

s

proposal

.

PreprocessingSlide21

NER: Designing features

Only

IS_CAPITALIZED

France

IS_CAPITALIZED

and

Great

IS_CAPITALIZED

Britain

IS_CAPITALIZED

backed

Fischler

IS_CAPITALIZED

s

proposal

.Slide22

NER: Designing features

Only

IS_CAPITALIZED

IS_SENT_START

France

IS_CAPITALIZED

and

Great

IS_CAPITALIZED

Britain

IS_CAPITALIZED

backed

Fischler

IS_CAPITALIZED

s

proposal

.Slide23

NER: Designing features

Only

IS_CAPITALIZED

IS_SENT_START

France

IS_CAPITALIZED

and

Great

IS_CAPITALIZED

Britain

IS_CAPITALIZED

backed

Fischler

IS_CAPITALIZED

s

proposal

.Slide24

NER: Designing features

Only

IS_CAPITALIZED

IS_SENT_START

France

IS_CAPITALIZED

IN_LEXICON_LOC

and

Great

IS_CAPITALIZED

Britain

IS_CAPITALIZED

IN_LEXICON_LOC

backed

Fischler

IS_CAPITALIZED

s

proposal

.Slide25

NER: Designing features

Only

POS=RB

IS_CAPITALIZED

IS_SENT_START

France

POS=NNP

IS_CAPITALIZED

IN_LEXICON_LOC

and

POS=CC

Great

POS=NNP

IS_CAPITALIZED

Britain

POS=NNP

IS_CAPITALIZED

IN_LEXICON_LOC

backed

POS=VBD

Fischler

POS=NNP

IS_CAPITALIZED

s

POS=XX

proposal

POS=NN

.

POS=.

These are extracted during preprocessing!Slide26

NER: Designing features

Only

POS=RB

IS_CAPITALIZED

PREV_WORD=_NONE_

France

POS=NNP

IS_CAPITALIZED

PREV_WORD=only

and

POS=CC

PREV_WORD=

france

Great

POS=NNP

IS_CAPITALIZED

PREV_WORD=and

Britain

POS=NNP

IS_CAPITALIZED

PREV_WORD=great

backedPOS=VBD…

PREV_WORD=

britain

Fischler

POS=NNP

IS_CAPITALIZED

PREV_WORD=backed

s

POS=XX

PREV_WORD=

fischler

proposal

POS=NN

PREV_WORD=‘

s

.

POS=.

PREV_WORD=proposalSlide27

NER: Designing features

Only

POS=RB

IS_CAPITALIZED

PREV_WORD=_NONE_

France

POS=NNP

IS_CAPITALIZED

PREV_WORD=only

and

POS=CC

PREV_WORD=

france

Great

POS=NNP

IS_CAPITALIZED

PREV_WORD=and

Britain

POS=NNPIS_CAPITALIZED

PREV_WORD=great

backed

POS=VBD

PREV_WORD=

britain

Fischler

POS=NNP

IS_CAPITALIZED

PREV_WORD=backed

s

POS=XX

PREV_WORD=

fischler

proposal

POS=NN

PREV_WORD=‘

s

.

POS=.

PREV_WORD=proposal

…Slide28

NER: Designing features

Can you think of other features?

HAS_DIGITS

IS_HYPHENATED

IS_ALLCAPS

FREQ_WORD

RARE_WORD

USEFUL_UNIGRAM_PER

USEFUL_BIGRAM_PER

USEFUL_UNIGRAM_LOC

USEFUL_BIGRAM_LOC

USEFUL_UNIGRAM_ORG

USEFUL_BIGRAM_ORG

USEFUL_SUFFIX_PER

USEFUL_SUFFIX_LOCUSEFUL_SUFFIX_ORG

WORD

PREV_WORD

NEXT_WORD

PREV_BIGRAM

NEXT_BIGRAM

POS

PREV_POSNEXT_POSPREV_POS_BIGRAMNEXT_POS_BIGRAMIN_LEXICON_PERIN_LEXICON_LOCIN_LEXICON_ORGIS_CAPITALIZEDSlide29

Case: Named Entity Recognition

Evaluation Metrics

Token accuracy: What percent of the tokens got labeled correctly

Problem with accuracy

Precision-Recall-F

Model

F-Score

HMM

74.6

president O

Barack B-PER

Obama OSlide30

NER: How can we improve?

Engineer better features

Design better models

Conditional Random Fields

Model

F-Score

HMM

74.6

TBL

81.2

Maxent

85.6

x

1

Y

1

x

2

Y

2

x

3

Y

3

x

4

Y

4

Model

F-Score

HMM

74.6

TBL

81.2

Maxent

85.6

CRF

91.7

…Slide31

NER: How else can we improve?

Unlabeled data!

example from Jerry ZhuSlide32

NER : Challenges

Domain transfer

WSJ

NYT

WSJ

Blogs ??

WSJ

Twitter ??!?

Tough nut: Organizations

Non textual data?

Entity Extraction is a Boring Solved Problem – or is it?(

Vilain

, Su and

Lubar

, 2007)Slide33

NER: Related application

Extracting real estate information from

Criagslist

Ads

Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-in kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door.

Our oversized one,

two and three bedroom apartment

homes with floor plans featuring

1 and 2 baths

offer space unlike any competition. Relax and enjoy the views from your own

private

balcony

or

patio

, or feel free to entertain, with plenty of space in your

large

living room

,

dining area

and

eat-in

kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door. Slide34

NER: Related Application

BioNLP

: Annotation of chemical entities

Corbet

,

Batchelor

&

Teufel

, 2007Slide35

Shared Tasks: NLP in practice

Shared Task

Everybody works on a (mostly) common dataset

Evaluation measures are defined

Participants get ranked on the evaluation measures

Advance the state of the art

Set benchmarks

Tasks involve common hard problems or new interesting problems