Preeti Bhargava Nemanja Spasojevic Guoning Hu Applied Data Science Lithium Technologies Email teamrelevancekloutcom Problem Applications Tweets amp other user generated text ID: 574431
Download Presentation The PPT/PDF document "High-Throughput and Language-Agnostic En..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data
Preeti Bhargava, Nemanja Spasojevic, Guoning HuApplied Data Science, Lithium TechnologiesEmail: team-relevance@klout.comSlide2
ProblemSlide3
Applications
Tweets & other user generated textUser profile (interests & expertise)URL recommendationsContent personalization Slide4
Challenges
Ambiguity Multi-lingual content High throughput and lightweight approach0.5B documents
d
aily (~1-2ms per tweet)
c
ommodity hardware (REST API, MR)
Shallow NLP
approach
(no POS
)
Dense annotations (efficient information
r
etrieval) Slide5
Freebase
entities (top 1 million by importance)* Balance coverage and relevance in respect to common social media text2 special entities: NIL (‘the’ -> NIL) MISC (‘1979 USA Basketball Team’ -> MISC)
Knowledge Base
*
Prantik
Bhattacharyya and Nemanja Spasojevic. Global entity ranking across multiple languages.
Poster
WWW
2017
Companion Slide6
Internally Developed Open Data Set
Densely Annotated Wikipedia Text (DAWT)1,2:high precision and dense link coverage on average
4.8 times more links than
original
Wiki articles
6 languages
Data Set
https://
github.com
/
klout
/
opendata
/tree/master/
wiki_annotationNemanja Spasojevic, Preeti Bhargava, and Guoning Hu. 2017. DAWT: Densely Annotated Wikipedia Texts across multiple languages. WWW 2017 Wiki workshop (Wiki’17)Slide7
Text Processing
pipelineSlide8
Text Processing
pipelineSlide9
Entity Extraction –
candidate mention dictionary consider n-grams (n ∈ [1,6]) phraseschoose longest phrase within candidate dictionary
Entity ExtractionSlide10
Entity Extraction
Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b}
CandidatesSlide11
Entity Extraction
Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b
}
Candidates
Google
CEO ->
{}
Slide12
Entity Extraction
Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b
}
Candidates
Google
CEO ->
{}
Google
CEO Eric ->
{}
Slide13
Entity Extraction
Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b
}
Candidates
Google
CEO ->
{}
Google
CEO Eric ->
{}
Candidates
CEO -> {
0dq_5} Slide14
Entity Extraction
Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b
}
Candidates
Google
CEO ->
{}
Google
CEO Eric ->
{}
Candidates
CEO -> {
0dq_5} CEO Eric
-> {} Slide15
Entity Extraction
Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b
}
Candidates
Google
CEO ->
{}
Google
CEO Eric ->
{}
Candidates
CEO -> {
0dq_5} CEO Eric
-> {}
CEO Eric Schmidt ->
{}
Slide16
Entity Extraction
Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b
}
Candidates
Google
CEO ->
{}
Google
CEO Eric ->
{}
Candidates
CEO -> {
0dq_5} CEO Eric -> {}
CEO Eric Schmidt ->
{}
Candidates
Eric -> {
03f078w,
0q9nx
} Slide17
Entity Extraction
Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b
}
Candidates
Google
CEO ->
{}
Google
CEO Eric ->
{}
Candidates
CEO -> {
0dq_5} CEO Eric -> {}
CEO Eric Schmidt ->
{}
Candidates
Eric -> {
03f078w,
0q9nx
}
Eric Schmidt
-> {
03f078w
} Slide18
Entity Extraction
Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b
}
Candidates
Google
CEO ->
{}
Google
CEO Eric ->
{}
Candidates
CEO -> {
0dq_5} CEO Eric -> {}
CEO Eric Schmidt ->
{}
Candidates
Eric -> {
03f078w,
0q9nx
}
Eric Schmidt
-> {
03f078w
}
Eric Schmidt
said
-> {} Slide19
Entity Extraction
Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b
}
Candidates
Google
CEO ->
{}
Google
CEO Eric ->
{}
Candidates
CEO -> {
0dq_5} CEO Eric -> {}
CEO Eric Schmidt ->
{}
Candidates
Eric -> {
03f078w,
0q9nx
}
Eric Schmidt
-> {
03f078w
}
Eric Schmidt
said
-> {}
a
nd so on
…
Slide20
T
wo-pass algorithm:disambiguates and links a set of easy mentions leverages these easy entities and several features to disambiguate and link the remaining hard mentions
Entity DisambiguationSlide21
Use Mention-Entity Co-occurrence prior probability:
Only one candidate entityHigh prior probability given mention (> 0.9)Two candidate entities one
being NIL/MISC -
high
prior probability given mention (> 0.75)
First PASS
Example
of
Mention-Entity Co-occurrence prior
probability:
Dielectrics
0b7kg:0.4863,_nil_:
0.3836,_misc_:0.1301
lost village
_
nil
_:
0.7826,05gxzw:0.2029,_misc_:
0.0145
Tesla _nil_:0.3621,05d1y:0.327,0dr90d:0.1601,036wfx:0.0805,03rhvb:0.0303
tesla
_nil_:0.5345,03rhvb:0.4655Slide22
Build context
Document- easy entities Entity – position, easy entities within windowBuild feature set:Context independentMention-Entity-Co-occurrence Mention-Entity-
Jaccard
Entity-Importance
Context dependent
Entity-Entity-Co-occurrence
Entity-Entity-Topic-Similarity
Second PASSSlide23
Mention Entity
Cooccurrence
Example
of
Mention-Entity Co-occurrence prior
probability:
dielectrics
0b7kg:0.4863
,_none_:0.3836,_misc_:0.1301
lost village _none_:0.7826,05gxzw:0.2029,_misc_:
0.0145
Tesla
_none_:
0.3621,
05d1y:0.327
,0dr90d:0.1601,036wfx:0.0805,03rhvb:0.0303
tesla _none_:0.5345,03rhvb:0.4655
Example: P(
05d1y|’
Tesla’) =
0.327Slide24
Captures alignment of the representative entity mention to observed mention.
Example: ‘Tesla’ vs ‘Tesla Motors’
=> 0.5
Mention Entity JACCARDSlide25
Captures global importance of an entity perceived by casual observers.
Entity ImportanceSlide26
Average co-occurrence
of a candidate entity with the disambiguated easy entities in the context window.Entity Entity CooccurrenceSlide27
Entity-Entity Topic Semantic Similarity
Inverse of
the minimum semantic distance between candidate entity’s topics and entities from easy entity
window.
Example:
sim
(‘Apple’, ‘Google’) = 1 / 4 = 0.25
sim
(‘Apple’,
‘Food’)
= 1 /
5 = 0.2Slide28
Use an ensemble of two classifiers:
Decision Tree classifier labels the feature vector as ‘True’ or ‘False’. Generate final scores using weights generated by the Logistic Regression classifier Final Disambiguation:Only one candidate entity is labeled as ‘True’ Multiple candidate entities labeled as ‘True’ , highest scoring wins
All candidate entities labeled as ‘False’, use highest
scoring only if large score margin compared to next one.
DisambiguationSlide29
Disambiguation ExampleSlide30
Disambiguation ExampleSlide31
Disambiguation Example
Use Mention-Entity Co-occurrence prior probability:
Only one candidate
entity
High
prior probability given mention
(> 0.9
)
Two candidate
entities
one
being NIL/MISC -
high
prior probability given mention (> 0.75)Slide32
Final Disambiguation
Only one candidate entity is labeled as ‘True’ Multiple candidate entities labeled as ‘True’
, highest scoring wins
All candidate entities labeled as ‘False’, use highest
scoring only if large score margin compared to next one.
Disambiguation ExampleSlide33
Disambiguation ExampleSlide34
Ground t
ruth test
set: 20 English Wikipedia (18,773 mentions)
Measured Precision, Recall,
F-score, Accuracy
EvaluationSlide35
Mention Entity Co-occurrence
based features have
the biggest
impact
Context helps (especially for longer
texts
EvaluationSlide36
Evaluation – Per LanguageSlide37
Language
Lithium EDLGoogle Cloud NL APIOpen CalaisAIDAEnglishY
Y
Y
Y
Arabic
Y
Y
Spanish
Y
Y
Y
French
Y
YGerman Y
Japanese
Y
Y
Language Coverage ComparisonsSlide38
Lithium EDL
linked 75
% more entities than Google
NL (precision adjusted lower bound)
Lithium EDL linked
104%
entities
more than Open Calais (precision adjusted lower bound)
Coverage ComparisonsSlide39
Example ComparisonsSlide40
Text preprocessing stage of the Lithium pipeline is about 30,000- 50,000 times faster
than AIDA
Disambiguation
runtime per unique entity extracted of Lithium pipeline is about 3.5 times faster than AIDA
AIDA extracts 2.8 times fewer entities per 50kb of
text
Runtime ComparisonsSlide41
Presented an EDL
algorithm that uses several context-dependent and context-independent features Lithium EDL system recognizes several types of entities (professional titles, sports, activities etc.) in addition to named entities (people, places, organizations etc.)
75% more entities than state of the art systems
EDL algorithm is language-agnostic and
currently
supports 6 different
languages – applicable to real world data
H
igh
throughput and
lightweight
3.5 times faster than state-of-the-art systems such as AIDA
ConclusionSlide42
Questions?
E-mail
:
team-relevance@klout.com