Presenter ShuYa Li Authors Venkatesh Ganti Arnd Christian König Rares Vernica KDD 2008 2 Outline Motivation Objective Methodology Experiments and Results ID: 802264
Download The PPT/PDF document "Entity Categorization Over Large Docume..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Entity Categorization Over Large Document Collections
Presenter : Shu-Ya LiAuthors : Venkatesh Ganti, Arnd Christian König, Rares Vernica
KDD,
2008
Slide22
OutlineMotivationObjectiveMethodologyExperiments and ResultsConclusionComments
Slide3MotivationGoing from unstructured data to structured data
Extracting entities (people, movies) from documents and identifying the categories (painter, writer, actor)Most prior approaches (unary relation extraction)only analyzed the local document context within which entities occur.3
…
Donald Knuth
works in research …
Prior approaches
is-a-researcher (
Donald_Knuth
)
Context
Entity
[Entity]
present results
But…
[Entity]
publish
is-a-researcher (Entity)?
companies
newspapers
Slide4Objectives
In this paper, we improve the accuracy of entity categorization by considering an entity’s context across multiple documentsexploiting existing large lists of related entities4
}
([Entity],
is-a-researcher
)
“…
[Entity]
published…”
“…
[Entity]
’s
paper…”
“…
[Entity]
gave a talk…”
Multi-Feature Relation Extractor
[Entity]
, ‘paper’
[Entity]
, ‘talk’
[Entity]
, ‘published’
Slide5Methodology
5
(
Yao_Ming
, is-a-athlete)
Ex: Extraction of
is-a-movie
relation
… Julia Roberts
starred
in
Pretty Woman
in 1988 …
Entity
actor name
Alan Alba
Richard
Gere
Julia Roberts
…
Actor-List
Feature: Co-occurrence
between
entity
and
actor name
in context.
(Pretty Woman , is-a-movie)
Slide6Methodology - Processing large Document Collections
6Context Feature
Extraction
Document Corpus
D
Rule-based
Extraction
Classification
n-gram
Extraction
Synopsis of
L
Verification
(Delete
false
Positives)
Co-Occurrence
List
corpus
L
Aggregation
List-Member
ExtractionList-Member Detection
Entity – Candidate
Context
Pairs
Entity-Feature Pairs
Entity-List
Pairs
Classifiers
C
.
retaining
the most important list
members
a known set
of
directors
(as
ε
)
a
list of actors (
as )
3.2 million documents
from
Wiki
Amy
Adams
Elizabeth
Reaser
Julia
Roberts
Tara
Reid
Judy
Reyes
…
E
1
: Pretty Woman
E
2
: Mystic Pizza
E
3
:
Doubt
E
4
: Duplicity
E
5
:
Enchanted
…
}
Actors
list
wiki
Slide7Methodology - Processing large Document Collections
7Context Feature
Extraction
Document Corpus
D
Rule-based
Extraction
Classification
n-gram
Extraction
Synopsis of
L
Verification
(Delete
false
Positives)
Co-Occurrence
List
corpus
L
Aggregation
List-Member
Extraction
List-Member Detection
Entity – Candidate
Context Pairs
Entity-Feature
Pairs
Entity-List
Pairs
Classifiers
C
.
Scanning
D
once
…
Julia
Roberts starred
in
Pretty Woman
in 1988 …
{Julia, Roberts, starred, Pretty, Woman
,
Julia Roberts, Pretty Woman, …
}
1.
t
he
large amount of data written
2. not
expected to
contain an
entity is a member of a list
.
Our Approach – Bloom Filter
{starred, Pretty, Woman, Pretty Woman, …
}
(Julia Robert,
starred
)
(Julia Robert,
Pretty
)
(Julia Robert,
Woman
)
(Julia Robert,
Pretty Woman
)
Verification
Slide8Experiments
8
Slide99
Slide10ConclusionStudied the effect of aggregate context in relation extraction.
Proposed efficient processing techniques for large text corpora.Both aggregate and co-occurrence features provide significant increase in extraction accuracy compared to single-context classifiers.10
Slide11CommentsAdvantage
The first half of this paper is clear.DrawbackBut the first half of this paper isn’t clear.ApplicationEntity categorization11