Answer Inference for Telegraphic Entityseeking Queries EMNLP 2014 Mandar Joshi Uma Sawant Soumen Chakrabarti IBM Research IIT Bombay Yahoo Labs IIT Bombay mandarj90inibmcom umacseiitbacin ID: 678581
Download Presentation The PPT/PDF document "Knowledge Graph and Corpus Driven Segmen..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Knowledge Graph and Corpus Driven Segmentation andAnswer Inference for Telegraphic Entity-seeking QueriesEMNLP 2014
Mandar Joshi
Uma Sawant
Soumen Chakrabarti
IBM Research
IIT Bombay, Yahoo Labs
IIT Bombay
mandarj90@in.ibm.com
uma@cse.iitb.ac.in
soumen@cse.iitb.ac.inSlide2
Entity-seeking Telegraphic QueriesShortUnstructuredExpect answers as entitiesSlide3
No reliable syntax cluesFree word orderNo or rare capitalizationRare to find quoted phrasesAmbiguousMultiple interpretationsaamir khan filmsAamir Khan - the
Indian actor
or
British
boxer
Films - appeared in, directed by, or aboutDifficult to exploit redundancy
ChallengesSlide4
KG are high precision but incompleteWork in progressTriples can’t represent all informationStructured – unstructured gapCorpus provides recallfastest odi century batsman
Why do we need the corpus?
… Anderson hits
fastest ODI century
in mismatch ... was the first time two
batsmen
have hit hundreds in under 50 balls in the same
ODI
.Slide5
Annotated Web with Knowledge Graph… 1998 Redford directed movie The Horse Whisperer, based on the 1995 novel…
Entity
:
The_Horse_Whisperer
Type:
/film/film
instanceOf
mentionOf
Annotated document
Entity:
Robert_Redford
/film/director/film
Type:
/film/director
instanceOf
mentionOfSlide6
Interpretation via segmentationSlide7
Queries seek answer entities (e2)Contain (query) entities (e1) , target types (t2), relations (r), and selectors (s).
Signals from the Query
query
e
1
r
t
2
s
dave navarro first band
dave navarro
band
band
first
dave navarro
-
band
first
spider automobile company
spider
automobile company
automobile company
-
automobile
company
company
spiderSlide8
Interpretation = Segmentation + AnnotationSegmentation of query tokens into 3 partitionsQuery entity (E1)Relation and Type (T2/R)Selectors (S)Each partition may map to multiple annotationsMultiple interpretations possibleWe keep around all of them
Segmentation and InterpretationSlide9
dave navarro first band
dave
navarro
first band
Segmentation and Interpretation
r: member Of t2: musical group
r: null t2: musical group
1.
Dave_Navarro
(Artist)
2.
Dave_Navarro
(Episode)
1.
Dave_Navarro
(Artist)
2.
Dave_Navarro
(Episode)
E
1
partition
T
2
/R partitionSlide10
Model 1 : Graphical ModelTargettype
Connecting
relation
Query
entity
Selectors
Segmentation
Candidate
entity
Type
language
model
Relation
language
model
Entity
language
model
KG-assisted relation
evidence potential
Entity Type Compatibility
Corpus-assisted
entity-relation
evidence potential
Provides mentions of entity, relation, and typeSlide11
Objective: To map relation mentions to Freebase relation(s)Use annotated ClueWeb09 + Freebase triplesFor triples of each Freebase relationLocate endpoints in corpusExtract dependency path phrase between entitiesMaintain co-occurrence counts of <phrase, rel>Score using LMRelation Language
ModelSlide12
Smoothed Dirichlet language modelCreate micro-document for each type usingDescription link /common/topic/aliasWords in type nameIncoming relation links Micro-document for /location/citytown
City, location, headquarters,
capital_city
etc
Favors specific types over generic ones
“city” should map to
/location/
citytown
and not
/location/location
Type Language ModelSlide13
Estimates support to interpretation in corpusSnippets scored using RankSVM Partial list of featuresNumSnippets with distance(e2 , e1 ) < k (k = 5, 10)NumSnippets with distance(e2 , r) < k (k = 3, 6)NumSnippets with relation r = ⊥NumSnippets with relation phrases as prepositions
NumSnippets
covering fraction of query IDF > k (k = 0.2, 0.4, 0.6, 0.8)
Snippet PotentialSlide14
Candidate answer entities (e2)Top k from corpus based on snippet frequencyBy KG links that are in interpretations setEach candidate e2 is scored asInferenceSlide15
Model 2 : Latent Variable Discriminative Trainingq, e2 are observede1, t2, r and z are latent
Non-convex formulation
Constraints
are formulated using the best scoring interpretationSlide16
ExperimentsSlide17
Test bedFreebase entity, type and relation knowledge graph~29 million entities14K types,
2K selected relation types
Annotated corpus
Clueweb09B Web corpus having 50 million pages
Google annotations (FACC1), ~ 13 annotations per page
Text and Entity IndexSlide18
Test BedQuery setsTREC - INEX : 700 entity search queriesWQT : Subset of ~800 queries from WebQuestions natural language query set, converted to telegraphic form
TREC-INEX
WQT
Has type and/or relation hints
Has mostly relation hints
Answers from KG and Corpus collected
by
Tas
Answers from KG only collected by
turkers
.
More suitable for corpus (+ KG)
More suitable for KGSlide19
Synergy between KG and CorpusCorpus and knowledge graph help each other to deliver better performanceSlide20
Query template comparisonEntity-relation-type-selector template provides yields better accuracy than type-selector template
Dataset
Formulation
MAP
MRR
n@10
TREC-INEX
No interpretation
.205
.215
.292
Type + Selector
.292
.306
.356
Unoptimized
.409
.419
.502
LVDT
.419
.436
.541
WQT
No interpretation
.080
.095
.131
Type + Selector
.116
.152
.201
Unoptimized
.377
.401
.474
LVDT
.295
.323
.406Slide21
Comparison with semantic parsersDataset
Formulation
MAP
MRR
n@10
TREC-INEX
SEMPRE (Free917)
.154
.159
.186
SEMPRE (WQ)
.197
.208
.247
Unoptimized
.409
.419
.502
LVDT
.419
.436
.541
WQT
SEMPRE (Free917)
.229
.255
.285
SEMPRE (WQ)
.374
.406
.449
Jacana
.239
.256
.329
Unoptimized
.377
.401
.474
LVDT
.295
.323
.406
We do better than both Jacana and SEMPRE
Gains over SEMPRE are slimmer for WQT.Slide22
NDCG comparison, TREC-INEX datasetSlide23
NDCG comparison, WQT datasetSlide24
SummaryQuery interpretation is rewarding, but non-trivialSemantic parsers do not work well with syntax-less telegraphic queries, segmentation based models work betterEntity-relation-type-selector template better than type-selector templateKnowledge graph and corpus provide complementary benefitsSlide25
Extend to NL queriesMore natural type and relation hintsMore filler words; need better word selection for corpus matchHandle relation joinsRequires more complex relation modelLarge interpretation spaceFuture WorkSlide26
Thank you!Slide27
Generating interpretationsCombine type and relation hint
Identify entities first
3 partitions: E
1
, T
2
/R, and SSlide28
KGBetter quality content than corpusGenerally incompleteCorpusHas better recall but more noiseDependence on annotation algorithmComplementary Benefits!KG vs. Corpus