dots I NLP in Practice Delip Rao delipjhuedu What is Text What is Text What is Text Real World Tons of data on the web A lot of it is text In many languages ID: 473910
Download Presentation The PPT/PDF document "600.465 Connecting the" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
600.465 Connecting the dots - I(NLP in Practice)
Delip Rao
delip@jhu.eduSlide2Slide3
What is “Text”?Slide4
What is “Text”?Slide5
What is “Text”?Slide6
“Real” World
Tons of data on the web
A lot of it is text
In many languages
In many genres
Language by itself is complex.
The Web further complicates language.Slide7
But we have 600.465
Adapted from : Jason Eisner
We can study anything about language ...
1. Formalize some insights
2. Study the formalism mathematically
3. Develop & implement algorithms
4. Test on real data
feature functions!
f(w
i
= off, w
i+1
= the)
f(w
i
=
obama
,
y
i
= NP)
Forward Backward,
Gradient Descent, LBFGS, Simulated Annealing, Contrastive Estimation, …Slide8
NLP for fun and profit
Making NLP more accessible
Provide APIs for common NLP tasks
var
text =
document.get
(…);
var
entities =
agent.markNE(text
);
Big $$$$
Backend to intelligent processing of textSlide9
Desideratum: Multilinguality
Except for feature extraction, systems should be language agnosticSlide10
In this lecture
Understand how to solve and ace in NLP tasks
Learn general methodology or approaches
End-to-End development using an example task
Overview of (
un)common
NLP tasksSlide11
Case study: Named Entity RecognitionSlide12
Case study: Named Entity Recognition
Demo: http://
viewer.opencalais.com
How do we build something like this?
How do we find out well we are doing?
How
can we improve?Slide13
Case study: Named Entity Recognition
Define the problem
Say, PERSON, LOCATION, ORGANIZATION
The UN secretary general met president Obama at Hague.
The
UN
secretary general met president
Obama
at
Hague
.
ORG
PER
LOCSlide14
Case study: Named Entity Recognition
Collect data to learn from
Sentences with words marked as PER, ORG, LOC, NONE
How do we get this data?Slide15
Pay the expertsSlide16
Wisdom of the crowdsSlide17
Getting the data: Annotation
Time consuming
Costs
$$$
Need for quality control
Inter-annotator
aggreement
Kappa score (
Kippendorf
, 1980)
Smarter ways to annotate
Get fewer annotations: Active Learning
Rationales (
Zaidan, Eisner & Piatko, 2007)Slide18
Only France and Great Britain backed Fischler
‘
s
proposal .
Only
O
France
B-LOC
and
O
Great
B-LOC
Britain
I-LOC
backed
O
Fischler
B-PER
‘
s
O
proposalO.OOnly France and Great Britain backed Fischler
‘s proposal .
Input
x
Labels ySlide19
1. Formalize some insights
2. Study the formalism mathematically
3. Develop & implement algorithms
4. Test on real data
Our recipe …Slide20
NER: Designing features
Not as trivial as you think
Original text itself might be in an ugly HTML
Cleaneval
!
Need to segment sentences
Tokenize the sentences
Only
France
and
Great
Britain
backed
Fischler
‘
s
proposal
.
PreprocessingSlide21
NER: Designing features
Only
IS_CAPITALIZED
France
IS_CAPITALIZED
and
Great
IS_CAPITALIZED
Britain
IS_CAPITALIZED
backed
Fischler
IS_CAPITALIZED
‘
s
proposal
.Slide22
NER: Designing features
Only
IS_CAPITALIZED
IS_SENT_START
France
IS_CAPITALIZED
and
Great
IS_CAPITALIZED
Britain
IS_CAPITALIZED
backed
Fischler
IS_CAPITALIZED
‘
s
proposal
.Slide23
NER: Designing features
Only
IS_CAPITALIZED
IS_SENT_START
France
IS_CAPITALIZED
and
Great
IS_CAPITALIZED
Britain
IS_CAPITALIZED
backed
Fischler
IS_CAPITALIZED
‘
s
proposal
.Slide24
NER: Designing features
Only
IS_CAPITALIZED
IS_SENT_START
France
IS_CAPITALIZED
IN_LEXICON_LOC
and
Great
IS_CAPITALIZED
Britain
IS_CAPITALIZED
IN_LEXICON_LOC
backed
Fischler
IS_CAPITALIZED
‘
s
proposal
.Slide25
NER: Designing features
Only
POS=RB
IS_CAPITALIZED
IS_SENT_START
France
POS=NNP
IS_CAPITALIZED
IN_LEXICON_LOC
and
POS=CC
Great
POS=NNP
IS_CAPITALIZED
Britain
POS=NNP
IS_CAPITALIZED
IN_LEXICON_LOC
backed
POS=VBD
Fischler
POS=NNP
IS_CAPITALIZED
‘
s
POS=XX
proposal
POS=NN
.
POS=.
These are extracted during preprocessing!Slide26
NER: Designing features
Only
POS=RB
IS_CAPITALIZED
…
PREV_WORD=_NONE_
France
POS=NNP
IS_CAPITALIZED
…
PREV_WORD=only
and
POS=CC
…
PREV_WORD=
france
Great
POS=NNP
IS_CAPITALIZED
…
PREV_WORD=and
Britain
POS=NNP
IS_CAPITALIZED
…
PREV_WORD=great
backedPOS=VBD…
PREV_WORD=
britain
Fischler
POS=NNP
IS_CAPITALIZED
…
PREV_WORD=backed
‘
s
POS=XX
…
PREV_WORD=
fischler
proposal
POS=NN
…
PREV_WORD=‘
s
.
POS=.
…
PREV_WORD=proposalSlide27
NER: Designing features
Only
POS=RB
IS_CAPITALIZED
…
PREV_WORD=_NONE_
…
France
POS=NNP
IS_CAPITALIZED
…
PREV_WORD=only
…
and
POS=CC
…
PREV_WORD=
france
…
Great
POS=NNP
IS_CAPITALIZED
…
PREV_WORD=and
…
Britain
POS=NNPIS_CAPITALIZED
…
PREV_WORD=great
…
backed
POS=VBD
…
PREV_WORD=
britain
…
Fischler
POS=NNP
IS_CAPITALIZED
…
PREV_WORD=backed
…
‘
s
POS=XX
…
PREV_WORD=
fischler
…
proposal
POS=NN
…
PREV_WORD=‘
s
…
.
POS=.
…
PREV_WORD=proposal
…Slide28
NER: Designing features
Can you think of other features?
HAS_DIGITS
IS_HYPHENATED
IS_ALLCAPS
FREQ_WORD
RARE_WORD
USEFUL_UNIGRAM_PER
USEFUL_BIGRAM_PER
USEFUL_UNIGRAM_LOC
USEFUL_BIGRAM_LOC
USEFUL_UNIGRAM_ORG
USEFUL_BIGRAM_ORG
USEFUL_SUFFIX_PER
USEFUL_SUFFIX_LOCUSEFUL_SUFFIX_ORG
WORD
PREV_WORD
NEXT_WORD
PREV_BIGRAM
NEXT_BIGRAM
POS
PREV_POSNEXT_POSPREV_POS_BIGRAMNEXT_POS_BIGRAMIN_LEXICON_PERIN_LEXICON_LOCIN_LEXICON_ORGIS_CAPITALIZEDSlide29
Case: Named Entity Recognition
Evaluation Metrics
Token accuracy: What percent of the tokens got labeled correctly
Problem with accuracy
Precision-Recall-F
Model
F-Score
HMM
74.6
president O
Barack B-PER
Obama OSlide30
NER: How can we improve?
Engineer better features
Design better models
Conditional Random Fields
Model
F-Score
HMM
74.6
TBL
81.2
Maxent
85.6
x
1
Y
1
x
2
Y
2
x
3
Y
3
x
4
Y
4
Model
F-Score
HMM
74.6
TBL
81.2
Maxent
85.6
CRF
91.7
…
…Slide31
NER: How else can we improve?
Unlabeled data!
example from Jerry ZhuSlide32
NER : Challenges
Domain transfer
WSJ
NYT
WSJ
Blogs ??
WSJ
Twitter ??!?
Tough nut: Organizations
Non textual data?
Entity Extraction is a Boring Solved Problem – or is it?(
Vilain
, Su and
Lubar
, 2007)Slide33
NER: Related application
Extracting real estate information from
Criagslist
Ads
Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-in kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door.
Our oversized one,
two and three bedroom apartment
homes with floor plans featuring
1 and 2 baths
offer space unlike any competition. Relax and enjoy the views from your own
private
balcony
or
patio
, or feel free to entertain, with plenty of space in your
large
living room
,
dining area
and
eat-in
kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door. Slide34
NER: Related Application
BioNLP
: Annotation of chemical entities
Corbet
,
Batchelor
&
Teufel
, 2007Slide35
Shared Tasks: NLP in practice
Shared Task
Everybody works on a (mostly) common dataset
Evaluation measures are defined
Participants get ranked on the evaluation measures
Advance the state of the art
Set benchmarks
Tasks involve common hard problems or new interesting problems