for Informatics httpwwwmpiinfmpgdeweikum From Information to Knowledge Harvesting Entities and Relationships From Web Sources Martin Theobald Max Planck Institute ID: 787641
Download The PPT/PDF document "Gerhard Weikum Max Planck Institute" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/
From Information to Knowledge
Harvesting Entities and RelationshipsFrom Web Sources
Martin Theobald Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~mtb/
Slide2Goal: Turn Web into Knowledge Base
comprehensive DB of human knowledge everything that Wikipedia
knows everything machine-readable capturing entities, classes, relationships
Source: DB & IR methods for knowledge discovery.Communications of
the ACM 52(4), 2009
Slide3Approach: Harvesting Facts from Web
Politician
Political Party
Angela Merkel CDU
Karl-Theodor zu
Guttenberg
CDU
Christoph Hartmann FDP
…
Company
CEO
Google Eric Schmidt
Yahoo Overture
Facebook FriendFeed
Software AG IDS Scheer
…
Movie
ReportedRevenue
Avatar $
2,718,444,933
The Reader $
108,709,522
Facebook FriendFeed
Software AG IDS Scheer
…
PoliticalParty
Spokesperson
CDU
Philipp
Wachholz
Die Grünen Claudia Roth
Facebook
FriendFeed
Software AG IDS Scheer
…
Actor
Award
Christoph Waltz Oscar
Sandra Bullock Oscar
Sandra Bullock Golden
Raspberry
…
Politician
Position
Angela Merkel
Chancellor
Germany
Karl-Theodor zu
Guttenberg
Minister
of
Defense Germany
Christoph Hartmann Minister
of
Economy Saarland
…
Company
AcquiredCompany
Google
YouTube
Yahoo
Overture
Facebook
FriendFeed
Software AG IDS Scheer
…
YAGO-NAGA
IWP
Cyc
TextRunner
ReadTheWeb
Slide4Knowledge as Enabling Technology
entity recognition & disambiguation understanding natural language
& speech knowledge services & reasoning for semantic apps
(e.g. deep QA) semantic search: precise answers
to advanced queries (
by
scientists
,
students
,
journalists
,
analysts
, etc.)
Indy 500
winners
who
are
still
alive
?
Politicians who are also scientists?
Enzymes
that inhibit
HIV? Influenza drugs for teens
with high blood pressure?
...
US
president
when Barack
Obama was born?
Relationship
between Angela Merkel, Jim Gray, Dalai Lama?
Slide55/54Knowledge Search (1)
Who was
US presidentwhen Barack Obamawas born?
http://www.wolframalpha.com
Slide66/54Knowledge Search (1)http://www.wolframalpha.com
Who
wasmayor of Indianapoliswhen Barack Obamawas born
?not enoughfacts in KB !
Slide77/54Knowledge Search (2)http://www.google.com/squared/
Indy500 winners?
Slide88/54Knowledge Search (2)http://www.google.com/squared/
Indy500 winners?
Slide99/54
Knowledge Search (2)http://www.google.com/squared/
Indy500 winnersfromEurope?no
typesno inference !
Slide10YAGO-NAGA
Related Work
communities
Kylin
KOG
Cyc
Freebase
Cimple
DBlife
UIMA
DBpedia
Yago-Naga
StatSnowball
EntityCube
Avatar
System T
Powerset
START
ontologies
information
extraction
Answers
SWSE
Hakia
TextRunner
TrueKnowledge
WolframAlpha
Text2Onto
sig.ma
kosmix
KnowItAll
(
Semantic
Web)
(Statistical
Web)
(
Social
Web)
ReadTheWeb
GoogleSquared
10
/38
Cyc
TextRunner
ReadTheWeb
IWP
WebTables
WorldWideTables
PSOX
EntityRank
Cazoodle
Slide11Outline...
Framework
Entities and Classes
Relationships
Temporal
Knowledge
What and Why
Wrap-up
Slide12Framework: Types of Knowledge
...
facts / assertions: bornIn (JohnDillinger, Indianapolis) hasWon (JimGray,
TuringAward), … taxonomic: instanceOf (JohnDillinger, bankRobbers), subclassOf (
bankRobbers, criminals), …
lexical
/
terminology
:
means
(“Big Apple“,
NewYorkCity
),
means
(“Big Mike“,
MichaelStonebraker
)
means
(“MS“, Microsoft) ,
means
(“MS“,
MultipleSclerosis) … common-sense
properties: apples
are green, red, juicy,
sweet, sour … - but not fast, smart … balls
are round, smooth, slippery … - but not square, funny …
common-sense
axioms: x: human(x) male(x) female(x)
x: (male(x) female(x)) (female(x) ) male(x)) x: animal(x) (hasLegs(x)
isEven(numberOfLegs(x)) …
procedural: how
to fix/install/prepare/remove …
epistemic / beliefs
: believes (Ptolemy, shape(Earth, disc)),
believes (Copernicus, shape(Earth, sphere)) …
Framework: Information Extraction (IE)
many
sources
one
source
Surajit
obtained
his
PhD
in CS
from
Stanford University
under
the
supervision
of
Prof. Jeff Ullman.
He
later
joined
HP andworked closely with
Umesh Dayal …
source
-
centric IE
instanceOf
(
Surajit
,
scientist
)
inField (Surajit, computer science)
hasAdvisor (Surajit, Jeff Ullman)almaMater (
Surajit, Stanford U)workedFor (Surajit, HP)friendOf
(Surajit, Umesh Dayal)…
yield-centric
harvesting
Student
Advisor
hasAdvisor
Student
University
almaMater
Student
Advisor
1)
recall
!
2)
precision
1)
precision
!
2)
recall
near
-human
quality
!
Student
Advisor
Surajit
Chaudhuri Jeffrey Ullman
Alon
Halevy Jeffrey Ullman
Jim Gray Mike Harrison
… …
Student
University
Surajit
Chaudhuri Stanford U
Alon
Halevy Stanford U
Jim Gray UC Berkeley
… …
Slide14Framework: Knowledge Representation
... RDF (Resource Description Framework, W3C):
subject-property-object (SPO) triples, binary relations structure, but
no (prescriptive) schema Relations, frames Description logics:
OWL, DL-lite
Higher-order
logics
, epistemic logics
temporal &
provenance
annotations
can
refer
to
reified
facts
via
fact
identifiers(approx.
equiv. to RDF quadruples: “Color“
Sub Prop Obj)
facts (RDF
triples): (JimGray,
hasAdvisor, MikeHarrison) (
SurajitChaudhuri, hasAdvisor, JeffUllman) (Madonna,
marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo
, CarlaBruni)
facts (RDF triples)
1:2:3:
4:
facts about facts:
5: (1, inYear
, 1968)6: (2, inYear
, 2006)7: (3,
validFrom, 22-Dec-2000) 8: (3,
validUntil, Nov-2008)9: (4, validFrom
, 2-Feb-2008)10: (2
, source, SigmodRecord)
Slide15http://www.mpi-inf.mpg.de/yago-naga/KB‘s: Example YAGO (Suchanek et al.: WWW‘07)
Entity
Max_Planck
Apr 23, 1858
Person
City
Country
subclass
Location
subclass
instanceOf
subclass
bornOn
“Max Planck”
means(
0.9)
subclass
Oct 4, 1947
diedOn
Kiel
bornIn
Nobel Prize
Erwin_Planck
FatherOf
hasWon
Scientist
means
“Max Karl Ernst Ludwig Planck”
Physicist
instanceOf
subclass
Biologist
subclass
Germany
Politician
Angela Merkel
Schleswig-Holstein
State
“Angela Dorothea Merkel”
Oct 23, 1944
diedOn
Organization
subclass
Max_Planck Society
instanceOf
means(
0.1)
instanceOf
instanceOf
subclass
subclass
means
“Angela Merkel”
means
citizenOf
instanceOf
instanceOf
locatedIn
locatedIn
subclass
Accuracy
95%
2 Mio.
entities
, 20 Mio.
facts
40
Mio. RDF
triples
( entity1-relation-entity2,
subject-predicate-object
)
Slide16KB‘s
: Example YAGO (F. Suchanek et al.: WWW‘07)http://www.mpi-inf.mpg.de/yago-naga/
Slide17KB‘s: Example DBpedia (Auer, Bizer, et al.: ISWC‘07)
3 Mio.
entities
,
1 Bio.
facts
(RDF
triples
)
1.5 Mio.
entities
mapped
to
hand-crafted
taxonomy
of
259
classes
with
1200
properties
http://www.dbpedia.org
Slide18Outline...
FrameworkEntities and
ClassesRelationships
Temporal KnowledgeWhat and Why
Wrap-up
Slide19Entities & Classes
...
Which entity types (classes,
unary predicates) are there?Which
subsumptions should hold(subclass
/
superclass
,
hyponym
/
hypernym
,
inclusion
dependencies
)
?
Which
individual
entities
belong
to which classes?
Which
names denote which entities
?scientists
, doctoral students
, computer scientists, …
female humans, male
humans, married
humans, …
subclassOf (computer
scientists, scientists),
subclassOf (scientists, humans
), …instanceOf
(Surajit Chaudhuri, computer
scientists),instanceOf (
BarbaraLiskov, computer
scientists),instanceOf (Barbara
Liskov, female humans
), …means
(“Lady Di“, Diana Spencer),means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), …
means (“Madonna“, Madonna Louise Ciccone),
means (“Madonna“, Madonna(painting
by Edward Munch)), …
Slide20WordNet Thesaurus [Miller/Fellbaum 1998]
http://wordnet.princeton.edu/3 concepts / classes & their
synonyms (synset‘s)
Slide21WordNet Thesaurus [Miller/Fellbaum 1998]
http://wordnet.princeton.edu/subclasses(hyponyms)
superclasses
(hypernyms)
Slide22WordNet Thesaurus [Miller & Fellbaum 1998]
scientist, man of science (a person
with advanced knowledge) => cosmographer, cosmographist
=> biologist, life scientist => chemist =>
cognitive scientist
=>
computer
scientist
...
=>
principal
investigator
, PI
…
HAS INSTANCE => Bacon, Roger Bacon
…
but:
only
few
individual entities (instances
of classes)
> 100 000 classes and lexical relations;
can be cast into
description logics or graph
, with weights for relation
strengths (derived from co-occurrence
statistics)http://wordnet.princeton.edu/
Slide23Tapping on Wikipedia Categories
Slide24Tapping on Wikipedia Categories
Slide25Mapping: Wikipedia WordNet[Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07]
Jim Gray
(computer specialist)
ComputerScientist
American
Scientist
Sailor
,
Crewman
Missing
Person
Chemist
Artist
Slide26American
Sailor
,Crewman
Mapping: Wikipedia
WordNet
[Suchanek: WWW‘07,
Ponzetto&Strube
: AAAI‘07]
Jim Gray
(
computer
specialist)
Computer
Scientist
Data-
base
Fellow
(1),
Comrade
Fellow
(2),
Colleague
Fellow
(3)
(
of
Society)
Scientist
Member (1),
Fellow
Member (2),
Extremity
American
Computer
Scientists
Database
Researcher
Fellows
of
the
ACM
People
Lost
at
Sea
instanceOf
subclassOf
?
?
?
name
similarity
(
edit
dist
., n-gram
overlap
)
?
context
similarity
(
word
/
phrase
level
)
?
machine
learning
?
Computer
Scientists
by
Nation
Databases
ACM
Members
of
Learned
Societies
Engineering
Societies
?
?
?
Missing
Person
Slide27Mapping: Wikipedia
WordNet[Suchanek: WWW‘07, Ponzetto & Strube:AAAI‘07]
Analyzing
category names noun group
parser:
American
Musicians
of
Italian
Descent
American Folk Music
of
the
20th Century
American Indy 500 Drivers on Pole
Positions
Head
word
is
key
, should be in plural
for instanceOf
head
pre-modifier
post-
modifier
head
pre-modifier
post-
modifier
head
pre-modifier
post-
modifier
Given
:
entity
e
in
Wikipedia
categories
c
1
, …,
c
k
Wanted
:
instanceOf
(
e,c
)
and
subclassOf
(
c
i
,c
)
for
WN
class
c
Problem: vagueness
& ambiguity of names c1, …, c
k
Slide28Mapping Wikipedia Entities to WordNet Classes
Given: entity
e in Wikipedia categories c1
, …, ckWanted: instanceOf
(e,c
)
and
subclassOf
(
c
i
,c
)
for
WN
class
c
Problem:
vagueness
&
ambiguity
of
names
c
1
, …,
ck
Heuristic
Method:for
each c
i do if
head word
w of
category name c
i is plural
{ 1)
match w against
synsets of
WordNet
classes 2)
choose best
fitting class
c and
set e
c
3)
expand w
by pre-modifier
and
set c
i w+
c }
can also derive features this way
feed into supervised
classifier[Suchanek: WWW‘07,
Ponzetto & Strube: AAAI‘07]
tuned conservatively: high precision
, reduced recall
Slide29Learning More Mappings
[ Wu & Weld: WWW‘08 ]Kylin Ontology Generator (KOG):
learn classifier for subclassOf across Wikipedia & WordNet using YAGO as
training data advanced ML methods (MLN‘s, SVM‘s) rich features
from various sources
category
/
class
name
similarity
measures
category
instances
and
their
infobox
templates: template names
, attribute names (e.g. knownFor)
Wikipedia edit history
: refinement of categories
Hearst patterns: C such as X, X
and Y and other C‘s, …
other search-engine statistics:
co-occurrence frequencies
> 3 Mio.
entities> 1 Mio. w/ infoboxes> 500 000
categories
Slide30Goal: Comprehensive & Consistent !
Jim Gray(computer specialist)
Madonna
(entertainer)
Jeffrey
Ullman
Bob Dylan
…
…
American
Computer
Scientists
Database
Researcher
Fellows
of
the
ACM
Databases
Members
of
Learned
Societies
Artist
Singer
Italian
American
Musician
Born
Award
Winner
Scientist
Known
For
Alma
Mater
Notable
Awards
Doctoral
Students
Academic
Bell Labs
Princeton
Alumni
Knuth
Prize
Laureate
American
People
by
Occupation
Fellow
(1)
Fellow
(2)
World
Record
Holders
American
Songwriters
Athlete
Genres
Years
Active
Hall
of
Fame
Inductees
U Michigan
Alumni
Also
Known
As
Website
Guitar Players
Americans
of
Italian
Descent
People
by
Status
Computer
Data
Telecomm
.
History
Slide31Goal: Comprehensive & Consistent !
Jim Gray(computer specialist)
Madonna
(entertainer)
Jeffrey
Ullman
Bob Dylan
…
…
American
Computer
Scientists
Database
Researcher
Fellows
of
the
ACM
Databases
Members
of
Learned
Societies
Artist
Singer
Italian
American
Musician
Born
Award
Winner
Scientist
Known
For
Alma
Mater
Notable
Awards
Doctoral
Students
Academic
Bell Labs
Princeton
Alumni
Knuth
Prize
Laureate
American
People
by
Occupation
Fellow
(1)
Fellow
(2)
World
Record
Holders
American
Songwriters
Athlete
Genres
Years
Active
Hall
of
Fame
Inductees
U Michigan
Alumni
Also
Known
As
Website
Guitar Players
Americans
of
Italian
Descent
People
by
Status
Computer
Data
Telecomm
.
History
Slide32Goal: Comprehensive & Consistent !
Jim Gray(computer specialist)
Madonna
(entertainer)
Jeffrey
Ullman
Bob Dylan
…
…
American
Computer
Scientists
Database
Researcher
Fellows
of
the
ACM
Databases
Members
of
Learned
Societies
Artist
Singer
Italian
American
Musician
Born
Award
Winner
Scientist
Known
For
Alma
Mater
Notable
Awards
Doctoral
Students
Academic
Bell Labs
Princeton
Alumni
Knuth
Prize
Laureate
American
People
by
Occupation
Fellow
(1)
Fellow
(2)
World
Record
Holders
American
Songwriters
Athlete
Genres
Years
Active
Hall
of
Fame
Inductees
U Michigan
Alumni
Also
Known
As
Website
Guitar Players
Americans
of
Italian
Descent
People
by
Status
Computer
Data
Telecomm
.
History
Slide33Goal: Comprehensive & Consistent !
Jim Gray(computer specialist)
Madonna
(entertainer)
Jeffrey
Ullman
Bob Dylan
…
…
American
Computer
Scientists
Database
Researcher
Fellows
of
the
ACM
Databases
Members
of
Learned
Societies
Artist
Singer
Italian
American
Musician
Born
Award
Winner
Scientist
Known
For
Alma
Mater
Notable
Awards
Doctoral
Students
Academic
Bell Labs
Princeton
Alumni
Knuth
Prize
Laureate
American
People
by
Occupation
Fellow
(1)
Fellow
(2)
World
Record
Holders
American
Songwriters
Athlete
Genres
Years
Active
Hall
of
Fame
Inductees
U Michigan
Alumni
Also
Known
As
Website
Guitar Players
Americans
of
Italian
Descent
People
by
Status
Computer
Data
Telecomm
.
History
Clean
up
the
mess:
graph
algorithms
?
random
walk
with
restart
dense
subgraphs
…
statistical
machine
learning
?
logical
consistency
reasoning
?
gigantic
schema
integration
?
ontology
merging
Slide34Long Tail of Class Instances
Slide35Long Tail of Class Instances
[Etzioni et al. 2004, Cohen et al. 2008, Mitchell et al. 2010]But:Precision
drops for classes with sparse statistics (DB profs, …)
Harvested items are names, not entitiesCanonicalization (de-duplication) unsolved
State-
of
-
the
-Art Approach (e.g. SEAL):
Start
with
seeds
: a
few
class
instances
Find
lists
,
tables
, text
snippets (“for example
: …“), … that contain one
or more seeds Extract
candidates: noun phrases from
vicinity Gather co-occurrence
stats (seed&cand,
cand&className pairs) Rank
candidates point-wise mutual information, …
random walk (PR-style) on seed-cand
graph
Slide36Individual Entity Disambiguation
“Penn““U Penn“
University of Pennsylvania
“Penn State“
Pennsylvania
State University
„PSU“
Pennsylvania
(US State)
Sean Penn
Passenger
Service Unit
Names
Entities
?
ill-defined
with
zero
context
known
as
record
linkage
for
names
in
record
fields
Wikipedia
offers
rich candidate
mappings: disambiguation pages, re-directs
, inter-wiki links, anchor texts of
href links
Slide37Collective Entity Disambiguation
Consider a set of
names {n1, n2, …} in same context and sets
of candidate entities E1 = {e11, e12, …}, E2
= {e21, e22, …}, …
Define
joint
objective
function
(e.g.
likelihood
for
prob. model)
that
rewards
coherence of mappings ni
eij
[McCallum 2003, Doan 2005, Getoor 2006. Domingos 2007, Chakrabarti
2009, …]
Solve optimization problem
Stuart Russell
Michael Jordan
Stuart Russell
(
computer
scientist
)
Stuart Russell (DJ)
Michael Jordan
(
computer
scientist
)
Michael Jordan (NBA)
Slide38Problems and Challenges
Wikipedia
categories
reloaded
Robust
disambiguation
Tags,
tables
,
topics
Long
tail
of
entities
comprehensive
&
consistent
instanceOf
and
subClassOf
across
Wikipedia
and
WordNet (via consistency reasoning ?)tap
on other sources: Web2.0, Web tables,
directories, etc.near-real-time
mapping of names to
entitieswith near-human quality
discover new
entities, detect new names
for known entities
beyond Wikipedia: domain-specific
entity catalogs
Slide39Outline...
FrameworkEntities and
ClassesRelationships
Temporal KnowledgeWhat and Why
Wrap-up
Slide40RelationshipsWhich instances
(pairs of individual entities) are therefor given binary
relations with specific type signatures?
hasAdvisor (JimGray, MikeHarrison)hasAdvisor (HectorGarcia-Molina,
Gio Wiederhold)hasAdvisor
(Susan Davidson, Hector Garcia-Molina)
graduatedAt
(
JimGray
, Berkeley)
graduatedAt
(
HectorGarcia
-Molina, Stanford)
hasWonPrize
(
JimGray
,
TuringAward
)
bornOn
(
JohnLennon
, 9Oct1940)
diedOn
(JohnLennon
, 8Dec1980)marriedTo (JohnLennon,
YokoOno)
Which additional &
interesting relation types
are there between given
classes of entities?
competedWith(
x,y), nominatedForPrize(
x,y), …divorcedFrom(x,y
), affairWith(x,y
), …assassinated(x,y),
rescued(x,y), admired
(x,y), …
Slide41Picking Low-Hanging Fruit (First)
Slide42Deterministic Pattern Matching...[Kushmerick 97,
Califf & Mooney 99, Gottlob 01, …]
Regular expressions
matching Wrapper
induction
(
grammar
learning
for
restricted
regular
languages
)
Well
understood
Slide43French Marriage Problemfacts in KB:
new facts or fact candidates:
married
(Hillary, Bill)
married
(Carla, Nicolas)
married
(Angelina, Brad)
married
(Cecilia, Nicolas)
married
(Carla, Benjamin)
married
(Carla, Mick
)
married
(Michelle,
Barack
)
married
(Yoko, John)
married
(Kate, Leonardo)
married
(Carla, Sofie)
married
(Larry, Google)
for
recall
:
pattern-based
harvesting
for
precision
:
consistency
reasoning
Slide44Pattern-Based HarvestingFacts
Patterns
(Hillary, Bill)(Carla, Nicolas)
& Fact CandidatesX and her husband Y
X and Y on their honeymoon
X and Y and their children
X has been dating with Y
X loves Y
…
good for
recall
noisy, drifting
not robust
enough
for high precision
(Angelina, Brad)
(Hillary, Bill)
(Victoria, David)
(Carla, Nicolas)
(Angelina, Brad)
(Yoko, John)
(Carla, Benjamin)
(Larry, Google)
(Kate, Pete)
(Victoria, David)
(Hearst 92,
Brin
98,
Agichtein
00,
Etzioni
04, …)
Slide45Reasoning
about
Fact Candidates Use consistency constraints
to prune false
candidates
spouse
(
Hillary,Bill
)
spouse
(
Carla,Nicolas
)
spouse
(
Cecilia,Nicolas
)
spouse
(
Carla,Ben
)
spouse
(Carla,Mick)Spouse(Carla, Sofie)
spouse(x,y) diff
(y,z) spouse(x,z
)f(Hillary)
f(Carla)f(Cecilia)f(Sofie)m(Bill)
m(Nicolas)m(Ben)m(Mick)
spouse(x,y) f(x)
spouse(x,y) m(y)
spouse(x,y) (f(x)m(y)) (m(x)f(y))
FOL rules
(restricted):
ground atoms:
Rules can be weighted
(e.g. by fraction of
ground atoms that satisfy a
rule) uncertain / probabilistic data
compute prob. distr. of
subset of atoms
being the
truthRules reveal
inconsistenciesFind consistent
subset(s) of atoms(“
possible world(s)“, “the truth“)
spouse(x,y
) diff(w,y) spouse
(w,y)
Slide46Markov
Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006)
Map logical constraints & fact candidatesinto probabilistic
graph model: Markov Random Field (MRF)s(x,y) m(y)
s(x,y
)
diff
(
y,z
) s(
x,z
)
s(
Carla,Nicolas
)
s(
Cecilia,Nicolas
)
s(
Carla,Ben
)
s(
Carla,Sofie
)…
s(x,y) diff
(w,y) s(w,y)
s(x,y) f(x)
s(Ca,Nic) s(Ce,Nic
)
s(Ca,Nic) s(Ca,Ben
)
s(Ca,Nic) s(Ca,So)
s(Ca,Ben) s(Ca,So)
s(Ca,Ben
) s(Ca,So)
s(Ca,Nic)
m(Nic)
Grounding:
s(Ce,Nic
) m(Nic)
s(Ca,Ben)
m(Ben)
s(Ca,So) m(So)
f(x)
m(x)M(x) f(x)
Literal
Boolean VarLiteral
binary RV
Slide47Markov
Logic
Networks (MLN‘s) (M. Richardson / P. Domingos 2006)Map
logical constraints & fact
candidates
into
probabilistic
graph
model
:
Markov
Random Field
(MRF)
s(
x,y
)
m(y)
s(
x,y
)
diff
(y,z
) s(x,z)
s(Carla,Nicolas)s(Cecilia,Nicolas)s(
Carla,Ben)s(Carla,Sofie)…
s(x,y) diff
(w,y) s(w,y)
s(x,y)
f(x)f(x) m(x)
M(x) f(x)
m
(Ben)
m(Nic)
s(Ca,Nic)
s(Ce,Nic)
s(Ca,Ben)
s(Ca,So)
m
(So)
RVs
coupled
by
MRF
edge
if
they
appear
in same
clause
MRF
assumption
:
P[X
i
|X
1
..
X
n
]=P[
X
i
|N
(
X
i)]Variety of algorithms for joint inference:Gibbs
sampling, other MCMC, belief
propagation, randomized MaxSat, …
joint
distribution has
product form over all cliques
Slide48Related Alternative Probabilistic Models software tools: alchemy.cs.washington.edu
code.google.com/p/factorie/ research.microsoft.com/en-us/um/cambridge
/projects/infernet/Constrained
Conditional Models [D. Roth et al. 2007]Factor Graphs with Imperative Variable Coordination
[A. McCallum et al. 2008]
log-linear
classifiers
with
constraint-violation
penalty
mapped
into
Integer Linear Programs
RV‘s
share
“
factors
“ (
joint
feature
functions)generalizes MRF, BN, CRF, …inference via advanced
MCMCflexible coupling & constraining of
RV‘s
m(Ben)
m(Nic)
s(Ca,Nic)
s(Ce,Nic)
s(Ca,Ben
) s(
Ca,So)
m
(So)
Reasoning for KB Growth: Direct Route
facts in KB:
new fact candidates:married
(Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)
married (Cecilia, Nicolas)
married
(Carla, Benjamin)
married
(Carla, Mick
)
married
(Carla, Sofie)
married
(Larry, Google)
+
patterns
:
X
and
her
husband
Y
X
and
Y
and
their
childrenX
has been
dating with YX
loves Y
?
facts are true;
fact candidates &
patterns hypotheses
grounded constraints clauses with
hypotheses as vars
cast into Weighted Max-Sat
with weights from pattern
stats customized approximation algorithm
unifies: fact cand
consistency, pattern goodness, entity
disambig.(F. Suchanek et al.: WWW‘09)
www.mpi-inf.mpg.de/yago-naga/sofie/
Direct approach:
Slide50Facts & Patterns
Consistency
constraints to connect
facts, fact candidates, patterns
(F. Suchanek et al.: WWW‘09)
functional
dependencies
:
spouse
(X,Y): X
Y, Y X
relation
properties
:
asymmetry
,
transitivity
,
acyclicity
, …
type
constraints
,
inclusion
dependencies
:spouse
Person Person
capitalOfCountry cityOfCountry
domain-specific constraints:
bornInYear(x) + 10years ≤
graduatedInYear(x)www.mpi-inf.mpg.de/yago-naga/sofie/
hasAdvisor(x,y
) graduatedInYear(x,t)
graduatedInYear(y,s) s < t
pattern-fact
duality:
occurs(
p,x,y) expresses(p,R
) R(x,y)
name(-in-context)-
to-entity mapping:
means
(n,e1) means(n,e2) …
occurs(p,x,y) R(
x,y) expresses(p,R
)
Slide51Soft Rules vs. Hard Constraints
Enforce
FD‘s (mutual exclusion) as hard constraints:
Generalize to other forms of constraints:
hard constraint
soft
constraint
hasAdvisor
(
x,y
)
graduatedInYear
(
x,t
)
graduatedInYear
(
y,s
)
s < t
firstPaper
(
x,p
) firstPaper
(y,q)
author(p,x)
author(p,y) )
inYear(p) >
inYear(q) + 5years
hasAdvisor(x,y)
hasAdvisor(
x,y) diff(
y,z) hasAdvisor(
x,z)
combine with weighted constraints
no longer MaxSatconstrained
MaxSat instead
open
issue
for
arbitrary constraints rethink reasoning
!
Slide52Problems and Challenges
High
precision
&
high
recall
at
affordable
cost
Scale
,
dynamics
,
life-cycle
Declarative
,
self-optimizing
workflows
Types
and
constraints
robust
pattern
analysis
& reasoningincorporate
pattern & reasoning steps
into IE queries/programs
grow & maintain KB with
near-human-quality over long
periods explore
& understand different families of constraints
soft rules & hard
constraints, rich DL, beyond CWA
parallel processing, lazy /
lifted inference, …
Open-
domain
knowledge
harvesting
turn names, phrase & table
cells into entities & relations
Outline...
FrameworkEntities and
ClassesRelationships
Temporal KnowledgeWhat and Why
Wrap-up
Slide54Temporal KnowledgeWhich facts
for given relations hold at what time point or
during which time intervals ?marriedTo (Madonna, Guy)
[ 22Dec2000, Dec2008 ]capitalOf (Berlin, Germany) [ 1990, now ]capitalOf (Bonn, Germany) [ 1949, 1989 ]
hasWonPrize (JimGray,
TuringAward
)
[ 1998 ]
graduatedAt
(
HectorGarcia
-Molina, Stanford)
[ 1979 ]
graduatedAt
(
SusanDavidson
,
Princeton
)
[
Oct
1982 ]
hasAdvisor
(
SusanDavidson
,
HectorGarcia-Molina) [ Oct 1982,
forever ]
How
can we query &
reason on entity-relationship facts
in a “time-travel“
manner - with uncertain/incomplete KB ?
US president
when Barack
Obama was born?
students of Hector Garcia-Molina while he was at Princeton?
Slide55French Marriage Problem
facts
in KB
new fact candidates:
married
(Hillary, Bill)
married
(Carla, Nicolas)
married
(Angelina, Brad)
married
(Cecilia, Nicolas)
married
(Carla, Benjamin)
married
(Carla, Mick
)
divorced
(Madonna
, Guy
)
domPartner
(Angelina, Brad
)
1:
2:
3:
validFrom (
2
, 2008)
validFrom
(
4
,
1996)
validUntil
(
4
, 2007)
validFrom
(
5
, 2010
)
validFrom
(
6
, 2006)
validFrom
(
7
, 2008)
4:
5:
6:
7:
8:
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
Slide56Challenge: Temporal Knowledge
for
all people in
Wikipedia (100,000‘s) gather all spouses, incl.
divorced &
widowed
,
and
corresponding
time
periods
!
>95%
accuracy
, >95%
coverage
, in
one
night
consistency
constraints
are
potentially
helpful
:
functional
dependencies
:
husband
,
time
wife
inclusion
dependencies
: marriedPerson
adultPerson
age
/time/gender
restrictions:
birthdate + < marriage
< divorce
recall
: gather temporal scopes for
base factsprecision:
reason on mutual consistency
Slide57Difficult Dating
Slide58(Even More Difficult) Implicit Dating
explicit
dates
vs.
implicit
dates
relative
to
other
dates
Slide59(Even More Difficult) Relative Dating
vague
dates
relative
dates
narrative
text
relative order
Slide60TARSQI:
Extracting Time Annotations
Hong Kong is poised to hold the first election in more than half <TIMEX3
tid="t3" TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy advocate seeking high office in territory controlled by the Chinese government in Beijing. A pro-democracy politician, Alan Leong, announced
<TIMEX3 tid="t4" TYPE="DATE" VAL="20070131">
Wednesday
</TIMEX3>
that he had obtained enough nominations to appear on the ballot to become the territory’s next chief executive. But he acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking re-election. Under electoral rules imposed by Chinese officials, only 796 people on the election committee – the bulk of them with close ties to mainland China – will be allowed to vote in the
<TIMEX3
tid
="t5" TYPE="DATE" VAL="20070325">
March 25
</TIMEX3>
election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in
<TIMEX3
tid
="t6" TYPE="DATE" VAL="1997">
1997
</TIMEX3>
. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in
<TIMEX3
tid
="t7" TYPE="DATE" VAL="2005">
2005
</TIMEX3>
, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing. It is of course a foregone conclusion – Donald Tsang will be elected and will hold office for
<TIMEX3
tid="t9" beginPoint
="t0" endPoint="t8“ TYPE
="DURATION" VAL="P5Y">another five years
</TIMEX3>
, said Mr. Leong, the former chairman of the Hong Kong Bar Association.
(M. Verhagen et al.: ACL‘05)
http://www.timeml.org/site/tarsqi/
e
xtractionerrors
Slide61Representing Time: AI Perspective
Instant
durationless piece of time
Periodpotentially unbounded continuum of instants Eventstime as a sequence of events E
precedence and overlap relations on E E
[Allen 1984, Allen &
Hayes
1989, …]
Slide62Relations between Time PeriodsA Before B B After AA Meets B B MetBy
AA Overlaps B B OverlappedBy AA Starts B B StartedBy AA During B
B Contains AA Finishes B B FinishedBy A A Equal B
AB
A
B
A
B
A
B
A
B
A
B
A
B
Slide63Representing Time: DB PerspectiveTime point: smallest time unit of fixed duration/granularity
(e.g., a day, a year, a second)Interval: finite set of time points
State relation:
fact holds at every
time point
within interval
isCapitalOf
(Bonn, Germany) [1949, 1989]
Event relation:
fact
holds
at
exactly
one time point
within
interval
wonCup
(United,
ChampionsLeague
) [1999, 1999]
intervals
can
also
capture
uncertainty
of time
points
Slide64Uncertainty and TimePoint-probabilities for facts and intervalsplaysFor(Beckham, United)[1990, 2005]:0.9fact valid in interval [
tb, te ] with prob. pfact not valid with prob. 1-p
Continuous distributionsplaysFor(Beckham, United) [1990, 2005]:Gauss(µ=1996,σ2
=1)HistogramsplaysFor(Beckham, United) [1990, 1992):0.1 [1992, 2004):0.6
[2004, 2005]:0.2
0.6
0.2
0.1
‘90
‘92
‘05
‘04
0.9
‘90
‘05
‘90
‘96
‘05
µ
=1996
σ
2
=1
Slide650.3
0.6
Possible Worlds in Time
0.3
State
Event
Event
‘95
‘98
‘02
‘96
‘98
‘00
‘01
‘96
‘99
‘00
‘99
0.54
0.9
1.0
‘01
‘98
playsFor
(Beckham, United)
wonCup
(United,
ChampionsLeague
)
playsFor
(Beckham, United)
wonCup
(United,
ChampionsLeague
)
B
ase
F
acts
hasWon
(Beckham,
ChampionsLeague
)
0.2
0.5
0.1
0.2
0.12
0.30
0.06
0.06
#
P-complete
per
histogram
bin
linear in #
bins
Joint Reasoning on Facts & Time
marriedTo(Nicolas, Carla)0.91
marriedTo(Nicolas, Cecilia)
0.65
divorcedFrom
(Nicolas, Cecilia)
0.78
bornIn
(Nicolas, Paris)
0.77
bornIn
(Cecilia, Boulogne)
0.12
bornIn
(Carla, Turin)
0.43
marriedTo
(Carla, Ben)
0.18
marriedTo
(Carla, Mick)
0.25
marriedTo
(a,b,T1)
marriedTo
(a,c,T2)
different(
b,c
)
disjoint
(T1,T2)
marriedTo
(a,b,T1)
divorcedFrom
(a,b,T2)
before
(T1,T2)
marriedTo
(a,b,T1)
bornIn
(a,c,T2)
before
(T2,T1)
R
ules:
Facts from KB
(with confidence weights):
Slide67Joint Reasoning on Facts & Time
bornIn(Nicolas, Paris)bornIn(Cecilia, Boulogne)
bornIn(Carla, Turin)
m(Nicolas, Cecilia)div(Nicolas, Cecilia)
m(Nicolas, Carla)
m(Carla, Mick)
m(Carla, Ben)
marriedTo
(Nicolas, Carla)
marriedTo
(Nicolas, Cecilia)
divorcedFrom
(Nicolas, Cecilia)
marriedTo
(Carla, Mick)
marriedTo
(Carla, Ben)
bornIn
(Carla, Turin)
bornIn
(Cecilia, Boulogne)
bornIn
(Nicolas, Paris)
0.91
0.65
0.78
0.77
0.12
0.43
0.18
0.25
marriedTo
(a,b,T1)
marriedTo
(a,c,T2)
different(
b,c
)
disjoint
(T1,T2)
marriedTo
(a,b,T1)
divorcedFrom
(a,b,T2)
before
(T1,T2)
marriedTo
(a,b,T1)
bornIn
(a,c,T2)
before
(T2,T1)
R
ules:
Facts from KB
(with confidence weights):
time
+ more soft rules:
hasChild
(
a,c
)
hasChild
(b,c
)
different (
a,b)
marriedTo(
a,b)+ recursive
rules
…
Compute most likely possible world !
Slide68Problems and Challenges
Temporal
Querying
(
Revived
)
Consistency
Reasoning
Incomplete
and
Uncertain
Temporal
Scopes
Gathering
Implicit
and Relative Time
Annotations
query
language
(T-SPARQL?), no
schema
confidence
weights
&
ranking
incorrect
,
incomplete
,
unknown
begin/end
v
ague
dating
b
iographies
& news
, relative orderings
aggregate &
reconcile observations
e
xtended MaxSat
, extended
Datalog, prob. graph.
models, etc.
for resolving
inconsistencies
on uncertain facts
& uncertain time
Slide69Outline...
FrameworkEntities and
ClassesRelationshipsTemporal Knowledge
What and Why
Wrap-up
Slide70KB Building: Where Do We Stand?
Entities
&
Classes
Relationships
Temporal
Knowledge
widely
open
(fertile)
research
ground
:
uncertain
/
incomplete
temporal
scopes
of
facts
joint reasoning on ER facts
and time scopes
good progress, but many challenges
left: recall & precision
by patterns &
reasoning efficiency & scalability
soft rules, hard
constraints, richer logics
, … open-domain discovery of
new relation
types
strong success story, some problems
left: large taxonomies
of classes with individual
entities long
tail calls for new
methods entity
disambiguation remains grand
challenge
Slide71Overall Take-Home
...
Historic
opportunity:
revive Cyc
vision
,
make
it
real &
large-
scale
!
challenging
&
risky
, but
high
pay
-off
Explore
&
exploit synergies between
semantic, statistical, & social Web
methods:statistical evidence
+ logical consistency !
For DB
researchers (theoreticians & normal ones
): efficiency & scalability
constraints & reasoning
killer app for
uncertain data
management knowledge-base life-cycle:
growth & maintenance
Slide72Thank You !