Classes Harvesting Facts Common Sense Knowledge Knowledge Consolidation Web Content Analytics WrapUp Goal Extraction from text Consistency reasoning Extraction from Tables Open IE Sourcecentric IE vs Yieldcentric IE ID: 760378
Download Presentation The PPT/PDF document "Outline Introduction Harvesting" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Outline
Introduction
Harvesting Classes Harvesting FactsCommon Sense KnowledgeKnowledge ConsolidationWeb Content AnalyticsWrap-Up
Goal
Extraction from text
Consistency reasoning
Extraction from Tables
Open
IE
Slide2Source-centric IE vs. Yield-centric IE
many
sources
one
source
Surajit
obtained
hisPhD in CS from Stanford ...
Document 1:
instanceOf (Surajit, scientist)
inField (Surajit, c.science)
almaMater (Surajit, Stanford U)…
Yield-centric IE
Student
University
Surajit
Chaudhuri Stanford U
Jim Gray UC Berkeley
… …
Student
Advisor
Surajit
Chaudhuri Jeffrey Ullman
Jim Gray Mike Harrison
… …
1)
recall
!
2)
precision
1)
precision
!2) recall
Source-centric IE
worksAt
hasAdvisor
+ (optional)
targetedrelations
2
Slide3We focus on yield-centric IE
many
sources
Yield-centric IE
Student
University
Surajit
Chaudhuri Stanford U
Jim Gray UC Berkeley
… …
Student
Advisor
Surajit
Chaudhuri Jeffrey Ullman
Jim Gray Mike Harrison
… …
1)
precision
!
2)
recall
worksAt
hasAdvisor
+ (optional)
targeted
relations
3
Slide4Goal: Find facts of given binary relations
...find instances of these relationshasAdvisor (JimGray, MikeHarrison)hasAdvisor (Susan Davidson, Hector Garcia-Molina)graduatedAt (JimGray, Berkeley)graduatedAt (HectorGarcia-Molina, Stanford)bornOn (JohnLennon, 9-Oct-1940)
Given binary relations with type signaturehasAdvisor: Person PersongraduatedAt: Person UniversitybornOn: Person Date
4
Slide5Facts
Patterns
(JimGray, MikeHarrison)
(BarbaraLiskov, JohnMcCarthy)
&
Fact Candidates
X and his advisor Y
X under the guidance of Y
X and Y in their paper
X co-authored with Y
X rarely met his advisor Y
…
good for
recall
noisy, drifting
not robust
enough
for high precision
(
Surajit
, Jeff)
(Sunita, Mike)
(Alon, Jeff)
(Renee, Yannis)
(
Surajit, Microsoft)
(Sunita, Soumen)
(Surajit, Moshe)
(Alon, Larry)
(Soumen, Sunita)
Facts yield patterns – and vice versa
5
[Brin@WebDB1998 "DIPRE"; Agichtein@SIGMOD2001 "
Snowball
"]
Slide6Facts
Patterns
(JimGray, MikeHarrison)
(BarbaraLiskov, JohnMcCarthy)
&
Fact Candidates
X and his advisor Y
X under the guidance of Y
X and Y in their paper
X co-authored with Y
X rarely met his advisor Y
…
good
for
recall
noisy, drifting not robust enough for high precision
(
Surajit, Jeff)
(Sunita, Mike)
(Alon, Jeff)
(Renee, Yannis)
(
Surajit, Microsoft)
(Sunita, Soumen)
(Surajit, Moshe)
(Alon, Larry)
(Soumen, Sunita)
Facts yield patterns – and vice versa
6
Extensions:
use
statistics
to
estimate
the
trustworthiness
of patterns
use
counter
examples
to "
punish
"
bad
patterns
[
Ravichandran
2002;
Suchanek
2006; ...]
3. use
deep
parsing
to
generalize
patterns
[
Bunescu
2005 , Suchanek 2006,
…]
Slide7Outline
Introduction
Harvesting Classes Harvesting FactsCommon Sense KnowledgeKnowledge ConsolidationWeb Content AnalyticsWrap-Up
Goal
√
Extraction from text
√
Consistency reasoning
Extraction from Tables
Open
IE
Slide8Reasoning
[Suchanek@WWW2009]
8
occurs("
Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…
Einstein died in 1955
Slide9Reasoning
[
Suchanek@WWW2009]
9
occurs("
Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…
Solving
a
weighted
MAX SAT
problem
at
scale
Slide10Reasoning
[Suchanek@WWW2009]
10
occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…
Slide11Reasoning
[Suchanek@WWW2009]
11
occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…
Extensions:
parallelize
the
reasoning
by performing a min cut on the dependency graph [Nakashole@WSDM2011 "Prospera"]use Markov logic networks to represent the entire joint probability distribution [M. Richardson / P. Domingos 2006]
MLN
>
Slide12Using Markov Logic Networks
12
We can model/computethe marginal probabilitiesthe joint distributionthe MAP (=maximum a posteriori), i.e. the most likely world
World 1:
World 2:
…
Probability
:
Application:
Extracting
facts at large scale [Zhu@WWW2009 "StatSnowball", "EntityCube"]
528
528
Slide13Outline
Introduction
Harvesting Classes Harvesting FactsCommon Sense KnowledgeKnowledge ConsolidationWeb Content AnalyticsWrap-Up
Goal
√Extraction from text √Consistency reasoning √Extraction from TablesOpen IE
tables>
Slide14Web Tables provide relational information
[
Cafarella et al: PVLDB 08; Sarawagi et al: PVLDB 09]
14
Slide15Web Tables can be annotated with YAGO
[Limaye, Sarawagi, Chakrabarti: PVLDB 10]
Goal: enable semantic search over Web tables
Idea:Map column headers to Yago classes,Map cell values to Yago entitiesUsing joint inference for factor-graph learning model
15
Title
Author
A short history of time
S Hawkins
D Adams
Hitchhiker's guide
Book
Person
Entity
hasAuthor
webtables
>
Slide16Statistics yield semantics of Web tables
[Venetis,Halevy et al: PVLDB 11]
Idea: Infer classes from co-occurrences, headers are class names
Result
from
12 Mio. Web tables:1.5 Mio. labeled columns (=classes)155 Mio. instances (=values)
16
but:
classes&entities
not
canonicalized
.
Instances
may
include
:
Google Inc., Google, NASDAQ GOOG, Google
search
engine
, …
Jet Li, Li
Lianjie
, Ley Lin
Git
, Li
Yangzhong
,
Nameless
hero
, …
Slide17ID-Based Extraction
887128476661
Unique
identifiers
exist
for books (ISBN),
products
(GTIN),
companies
(VAT), people (emails*), etc.
Unique
identifiers
can
be
found
by
regular
expression + check digit
verification
Slide18id Name URL
123 Puma PowerTech url1123 Please choose url1123 Puma PowerTech url2123 Puma Power Shoe url2124 Puma Slow Cat url3779 Please choose url3779 Canon PowerShot url3…
ID-Based Extraction
Slide19ID-Based Extraction
[Talaika@WebDB2015 "IBEX"]
Slide20Outline
Introduction
Harvesting Classes Harvesting FactsCommon Sense KnowledgeKnowledge ConsolidationWeb Content AnalyticsWrap-Up
Goal
√
Extraction from text
√
Consistency reasoning
√
Extraction from Tables
√
Open
IE
Slide21Open Information Extraction
S
o far we assumed given relations with type signatures <entity1, relation, entity2>
< CarlaBruni marriedTo NicolasSarkozy> Person R Person < NataliePortman wonAward AcademyAward > Person R Prize
Open IE aims to discover new entities and new relation types <name1, phrase, name2>
Madame Bruni in her happy marriage with Sarko…
21
<Madame Bruni, her happy marriage with, Sarko>
details
>
Slide22Open IE with ReVerb
[A. Fader et al. 2011, T. Lin 2012, Mausam 2012]
Idea: Consider all subject-verb-object triples as facts.
Problem 1: uninformative extractions “Gold has an atomic weight of 196” <Gold,has,atomicweight> “Faust made a deal with the devil” <Faust, made, a deal>
Solutions: enforce regular expressions over POS tags, such as VB (N | ADJ | ADV | PRN | DET)* PREP2. require relation phrase appear with many distinct arg pairs3. intersect with Freebase
Problem 2: over-specific extractions “Elvis is the first and greatest rock and roll star of America” <..., is the first and greatest rock and roll star of, …>
22
Slide2323
http://openie.cs.washington.edu/
PATTY>
Slide24Syntactic-Lexical-Ontological (SOL) patterns combineontological typeslexical surface formsyntactic properties
Amy Winehouse’s cosy voice in her song ‘Rehab’Jim Morrison’s haunting voice and charisma in ‘The End’Joan Baez’s angel-like voice in ‘Farewell Angelina’SOL pattern: <singer> ’s ADJECTIVE voice * in <song>
[Nakashole@EMNLP2012 "PATTY"]
24
Enhanced Patterns
Patterns
can
subsume
each
other
:
"
wife
of" => "
spouse
of"
…
which
means
that
we
can
create
synsets
of patterns
and arrange
them
in a
taxonomy
.
Slide25350 000 SOL
patterns
with 4 Mio. instancesaccessible at: www.mpi-inf.mpg.de/yago-naga/patty
25
[Nakashole@EMNLP2012 "PATTY"]
Enhanced Patterns
Slide26Open Problems and Grand Challenges
Real-time
&
incremental fact extractionfor continuous KB growth & maintenance(life-cycle management over years and decades)
Extensions to ternary & higher-arity relations
events in context: who did what to/with whom when where why …?
Robust
fact
extraction with both high precision & recall
as
highly automated (self-tuning) as possible
Extend
the approaches to other languages
26