/
High-Throughput and Language-Agnostic Entity Disambiguation High-Throughput and Language-Agnostic Entity Disambiguation

High-Throughput and Language-Agnostic Entity Disambiguation - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
388 views
Uploaded On 2017-07-30

High-Throughput and Language-Agnostic Entity Disambiguation - PPT Presentation

Preeti Bhargava Nemanja Spasojevic Guoning Hu Applied Data Science Lithium Technologies Email teamrelevancekloutcom Problem Applications Tweets amp other user generated text ID: 574431

ceo google eric entity google ceo entity eric candidates schmidt entities mention candidate extraction apple disambiguation 045c7b competition prior

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "High-Throughput and Language-Agnostic En..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data

Preeti Bhargava, Nemanja Spasojevic, Guoning HuApplied Data Science, Lithium TechnologiesEmail: team-relevance@klout.comSlide2

ProblemSlide3

Applications

Tweets & other user generated textUser profile (interests & expertise)URL recommendationsContent personalization Slide4

Challenges

Ambiguity Multi-lingual content High throughput and lightweight approach0.5B documents

d

aily (~1-2ms per tweet)

c

ommodity hardware (REST API, MR)

Shallow NLP

approach

(no POS

)

Dense annotations (efficient information

r

etrieval) Slide5

Freebase

entities (top 1 million by importance)* Balance coverage and relevance in respect to common social media text2 special entities: NIL (‘the’ -> NIL) MISC (‘1979 USA Basketball Team’ -> MISC)

Knowledge Base

*

Prantik

Bhattacharyya and Nemanja Spasojevic. Global entity ranking across multiple languages.

Poster

WWW

2017

Companion Slide6

Internally Developed Open Data Set

Densely Annotated Wikipedia Text (DAWT)1,2:high precision and dense link coverage on average

4.8 times more links than

original

Wiki articles

6 languages

Data Set

https://

github.com

/

klout

/

opendata

/tree/master/

wiki_annotationNemanja Spasojevic, Preeti Bhargava, and Guoning Hu. 2017. DAWT: Densely Annotated Wikipedia Texts across multiple languages. WWW 2017 Wiki workshop (Wiki’17)Slide7

Text Processing

pipelineSlide8

Text Processing

pipelineSlide9

Entity Extraction –

candidate mention dictionary consider n-grams (n ∈ [1,6]) phraseschoose longest phrase within candidate dictionary

Entity ExtractionSlide10

Entity Extraction

Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b}

CandidatesSlide11

Entity Extraction

Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b

}

Candidates

Google

CEO ->

{}

Slide12

Entity Extraction

Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b

}

Candidates

Google

CEO ->

{}

Google

CEO Eric ->

{}

Slide13

Entity Extraction

Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b

}

Candidates

Google

CEO ->

{}

Google

CEO Eric ->

{}

Candidates

CEO -> {

0dq_5} Slide14

Entity Extraction

Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b

}

Candidates

Google

CEO ->

{}

Google

CEO Eric ->

{}

Candidates

CEO -> {

0dq_5} CEO Eric

-> {} Slide15

Entity Extraction

Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b

}

Candidates

Google

CEO ->

{}

Google

CEO Eric ->

{}

Candidates

CEO -> {

0dq_5} CEO Eric

-> {}

CEO Eric Schmidt ->

{}

Slide16

Entity Extraction

Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b

}

Candidates

Google

CEO ->

{}

Google

CEO Eric ->

{}

Candidates

CEO -> {

0dq_5} CEO Eric -> {}

CEO Eric Schmidt ->

{}

Candidates

Eric -> {

03f078w,

0q9nx

} Slide17

Entity Extraction

Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b

}

Candidates

Google

CEO ->

{}

Google

CEO Eric ->

{}

Candidates

CEO -> {

0dq_5} CEO Eric -> {}

CEO Eric Schmidt ->

{}

Candidates

Eric -> {

03f078w,

0q9nx

}

Eric Schmidt

-> {

03f078w

} Slide18

Entity Extraction

Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b

}

Candidates

Google

CEO ->

{}

Google

CEO Eric ->

{}

Candidates

CEO -> {

0dq_5} CEO Eric -> {}

CEO Eric Schmidt ->

{}

Candidates

Eric -> {

03f078w,

0q9nx

}

Eric Schmidt

-> {

03f078w

}

Eric Schmidt

said

-> {} Slide19

Entity Extraction

Google CEO Eric Schmidt said that competition between Apple and Google … Google -> {045c7b

}

Candidates

Google

CEO ->

{}

Google

CEO Eric ->

{}

Candidates

CEO -> {

0dq_5} CEO Eric -> {}

CEO Eric Schmidt ->

{}

Candidates

Eric -> {

03f078w,

0q9nx

}

Eric Schmidt

-> {

03f078w

}

Eric Schmidt

said

-> {}

a

nd so on

Slide20

T

wo-pass algorithm:disambiguates and links a set of easy mentions leverages these easy entities and several features to disambiguate and link the remaining hard mentions

Entity DisambiguationSlide21

Use Mention-Entity Co-occurrence prior probability:

Only one candidate entityHigh prior probability given mention (> 0.9)Two candidate entities one

being NIL/MISC -

high

prior probability given mention (> 0.75)

First PASS

Example

of

Mention-Entity Co-occurrence prior

probability:

Dielectrics

0b7kg:0.4863,_nil_:

0.3836,_misc_:0.1301

lost village

_

nil

_:

0.7826,05gxzw:0.2029,_misc_:

0.0145

Tesla _nil_:0.3621,05d1y:0.327,0dr90d:0.1601,036wfx:0.0805,03rhvb:0.0303

tesla

_nil_:0.5345,03rhvb:0.4655Slide22

Build context

Document- easy entities Entity – position, easy entities within windowBuild feature set:Context independentMention-Entity-Co-occurrence Mention-Entity-

Jaccard

Entity-Importance

Context dependent

Entity-Entity-Co-occurrence

Entity-Entity-Topic-Similarity

Second PASSSlide23

Mention Entity

Cooccurrence

Example

of

Mention-Entity Co-occurrence prior

probability:

dielectrics

0b7kg:0.4863

,_none_:0.3836,_misc_:0.1301

lost village _none_:0.7826,05gxzw:0.2029,_misc_:

0.0145

Tesla

_none_:

0.3621,

05d1y:0.327

,0dr90d:0.1601,036wfx:0.0805,03rhvb:0.0303

tesla _none_:0.5345,03rhvb:0.4655

Example: P(

05d1y|’

Tesla’) =

0.327Slide24

Captures alignment of the representative entity mention to observed mention.

Example: ‘Tesla’ vs ‘Tesla Motors’

=> 0.5

Mention Entity JACCARDSlide25

Captures global importance of an entity perceived by casual observers.

Entity ImportanceSlide26

Average co-occurrence

of a candidate entity with the disambiguated easy entities in the context window.Entity Entity CooccurrenceSlide27

Entity-Entity Topic Semantic Similarity

Inverse of

the minimum semantic distance between candidate entity’s topics and entities from easy entity

window.

Example:

sim

(‘Apple’, ‘Google’) = 1 / 4 = 0.25

sim

(‘Apple’,

‘Food’)

= 1 /

5 = 0.2Slide28

Use an ensemble of two classifiers:

Decision Tree classifier labels the feature vector as ‘True’ or ‘False’. Generate final scores using weights generated by the Logistic Regression classifier Final Disambiguation:Only one candidate entity is labeled as ‘True’ Multiple candidate entities labeled as ‘True’ , highest scoring wins

All candidate entities labeled as ‘False’, use highest

scoring only if large score margin compared to next one.

DisambiguationSlide29

Disambiguation ExampleSlide30

Disambiguation ExampleSlide31

Disambiguation Example

Use Mention-Entity Co-occurrence prior probability:

Only one candidate

entity

High

prior probability given mention

(> 0.9

)

Two candidate

entities

one

being NIL/MISC -

high

prior probability given mention (> 0.75)Slide32

Final Disambiguation

Only one candidate entity is labeled as ‘True’ Multiple candidate entities labeled as ‘True’

, highest scoring wins

All candidate entities labeled as ‘False’, use highest

scoring only if large score margin compared to next one.

Disambiguation ExampleSlide33

Disambiguation ExampleSlide34

Ground t

ruth test

set: 20 English Wikipedia (18,773 mentions)

Measured Precision, Recall,

F-score, Accuracy

EvaluationSlide35

Mention Entity Co-occurrence

based features have

the biggest

impact

Context helps (especially for longer

texts

EvaluationSlide36

Evaluation – Per LanguageSlide37

Language

Lithium EDLGoogle Cloud NL APIOpen CalaisAIDAEnglishY

Y

Y

Y

Arabic

Y

Y

Spanish

Y

Y

Y

French

Y

YGerman Y

Japanese

Y

Y

Language Coverage ComparisonsSlide38

Lithium EDL

linked 75

% more entities than Google

NL (precision adjusted lower bound)

Lithium EDL linked

104%

entities

more than Open Calais (precision adjusted lower bound)

Coverage ComparisonsSlide39

Example ComparisonsSlide40

Text preprocessing stage of the Lithium pipeline is about 30,000- 50,000 times faster

than AIDA

Disambiguation

runtime per unique entity extracted of Lithium pipeline is about 3.5 times faster than AIDA

AIDA extracts 2.8 times fewer entities per 50kb of

text

Runtime ComparisonsSlide41

Presented an EDL

algorithm that uses several context-dependent and context-independent features Lithium EDL system recognizes several types of entities (professional titles, sports, activities etc.) in addition to named entities (people, places, organizations etc.)

75% more entities than state of the art systems

EDL algorithm is language-agnostic and

currently

supports 6 different

languages – applicable to real world data

H

igh

throughput and

lightweight

3.5 times faster than state-of-the-art systems such as AIDA

ConclusionSlide42

Questions?

E-mail

:

team-relevance@klout.com