/
What knowledge bases know What knowledge bases know

What knowledge bases know - PowerPoint Presentation

evans
evans . @evans
Follow
344 views
Uploaded On 2021-12-09

What knowledge bases know - PPT Presentation

and what they dont Simon Razniewski Max Planck Institute for Informatics My background Max Planck Institute for Informatics MPII Max Planck Society Foundational research organization in Germany ID: 904674

data recall extraction coverage recall data coverage extraction completeness wikidata properties complete world incomplete mining based knowledge pitt comparative

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "What knowledge bases know" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

What knowledge bases know (and what they don't)

Simon RazniewskiMax Planck Institute for Informatics

Slide2

My background: Max Planck Institute for Informatics (MPII)

Max Planck SocietyFoundational research organization in Germany

MPII

150 members

Located in Saarbrücken(next to Paris)Department 5:Headed by Gerhard Weikum~25 membersThemes: Language, data, knowledgeNotable projects: YAGO, WebChild

2

Saarbrücken

Slide3

Myself

Senior Researcher at MPI for Informatics, GermanyHeading “Knowledge Base Construction and Quality” area of department 5

4 PhD students

Assistant professor

FU Bozen-Bolzano, Italy, 2014-2017PhD FU Bozen-Bolzano

, 2014Research stays at UCSD

(2012), AT&T Labs-Research (2013), University of Queensland (2015)

Research interests:

KB construction in fiction

(1 slide)

Common-sense knowledge (1 slide)KB recall assessment (remainder of talk)

3

Slide4

Research interests (1):KB construction in fiction

Fictional texts as archetypes of domain-specific

low-resource universes

Lord

of the Rings, Marvel Superheroes Amazon titles and roles, French Army

lingo

, model railway terminologyTaxonomies as backbones for KBCConstruction from

noisy category systems andexploiting

WordNet for abstract levels [

WWW’19

]Entity types outside typical news/Wikipedia domainsReference type systems from related universesTyping by combining supervised, dependency-based and lookup-based modulesConsolidation using type correlation and taxonomical coherence [WSDM’20]

4

Slide5

Research interests (2):Commonsense knowledge

Properties of general world concepts instead of instancesElephants, submarines, pianos

Not:

Seattle, Trump,

AmazonChallenges:SparsityReporting bias (web knows as many pink as grey elephants)Semantics (lions have manes <> lions

attack humans)Our approach:

Comprehensive extraction from question datasources

[CIKM’19]Multifaceted semantics, consolidation via taxonomy-based

soft

constraints

[under review/arXiv’20]5

Slide6

What knowledge bases know (and what they don't)

Simon RazniewskiMax Planck Institute for Informatics

Slide7

KB construction: Current state

General-world knowledge an old dream of AILarge KBs general and domain-specific KBs

at most major tech companies

Research

progress visible downstreamIBM Watson beats humans in trivia game in 2011Entity linking systems competitive with humans on popular news corporaSystems pass 8

th grade science tests

in the AllenAI Science challenge in 2016Intrinsic question:

How good are these KBs?

7

Slide8

Intrinsic analysis

Is what they know true?

(precision or correctness)

Do they know what is true?

(recall or completeness)

8

Slide9

Recall awareness: Extrinsic relevance

Resource efficiencyDirecting extraction efforts towards incomplete regionsTruth consolidationComplete sources as evidence against spurious extractions

Question answering

Integrity:

Say when you don’t knowNegation and counts rely on completeness9

Slide10

KB recall: Good?

10

Google Knowledge Graph

:

39

out of 48

Tarantino movies

DBpedia

:

167 out of

204

Nobel

laureates

in

Physics

Wikidata

: 2 out of 2

children of Obama

Slide11

KB recall: Bad?

11

DBpedia: contains

6 out of 35

Dijkstra Prize winners 

Google Knowledge Graph:

``Points of Interest’’ – Completeness? 

Wikidata knows only

15

employees of Amazon

Slide12

What previous work says

12

There are

known knowns

; there are things we know we know. We also know there are

known unknowns

; that is to say we know there are some things we do not know. But there are also

unknown unknowns – the ones we don't know we don't know.

KB engineers have

mainly

tried to make KBs bigger. Another point, however is to

understand

how much they know.

[Marx, 1845]

[Rumsfeld, 2002]

KB

1

KB

2

KB

3

KB

4

KB

5

Slide13

Outline – Assessing KB recall

Logical foundationsData miningInformation extraction

Comparative coverage

13

Slide14

Outline – Assessing KB recall

Logical foundationsData miningInformation extraction

Comparative coverage

14

Slide15

Closed and open-world assumption

won

name

award

Brad Pitt

Oscar

Einstein

Nobel Prize

Berners-Lee

Turing

Award

15

won(

BradPitt

, Oscar)?

won(Pitt, Nobel Prize)?

Closed-world

assumption

Open-world

assumption

Databases

traditionally employ

closed-world assumption

KBs (semantic web)

necessarily operate under

open-world assumption

 Yes

 Yes

 No

Maybe

Slide16

Open-world assumptionQ:

Game of Thrones directed by Shakespeare? KB: Maybe

Q:

Brad Pitt works at Amazon?

KB: MaybeQ: Trump brother of Kim Jong Un? KB

: Maybe

16

World-aware AI?

Practically useful paradigm?

Slide17

The logicians way out

Need power to express both maybe and no

= Partial-closed world assumption

Approach:

Completeness statements [Motro 1989]These statements are cool [VLDB’11, CIKM’12, SIGMOD’15]

17

Completeness statement:

wonAward

is

complete for

Nobel Prizes

won(Pitt, Oscar)?

won(Pitt, Nobel)?

won(Pitt, Turing)?

Yes

 No

 Maybe

won

name

award

Brad Pitt

Oscar

Einstein

Nobel Prize

Berners-Lee

Turing Award

Slide18

Where would completeness statements come from?

Data creators should pass them along as metadataOr editors should add them in

curation steps

Developed COOL-WD

(Completeness tool for Wikidata)

18

Slide19

19

Slide20

But…

Requires human effortEditors are lazyAutomatically created KBs do not even have editorsRemainder of this talk:

How to

automatically acquire

information about KB completeness/recall

20

Slide21

Outline – Assessing KB recall

Logical foundationsData miningInformation extraction

Comparative coverage

21

Slide22

Data mining: Idea (1/2)

Certain patterns in data hint at completeness/incompleteness

People with a death date but no death place are incomplete for death place

People with less than two parents are incomplete for

parents

Movies with a producer are complete for directors

22

Slide23

Data mining: Idea (2/2)

Examples can be expressed as Horn rules:

dateOfDeath

(X

, Y) ∧ lessThan1(X, placeOfDeath) ⇒ incomplete(X, placeOfDeath

)

lessThan2(X,

hasParent) ⇒ incomplete(X, hasParent

)

movie(X) ∧ producer(X, Z) ⇒ complete(X, director

)

Can such patterns be discovered

with

association rule mining

?

23

Slide24

Rule mining: Implementation

We extended the AMIE association rule mining system with meta-predicates onComplete/incomplete complete(X, director)

Object counts

lessThan

2(X, hasParent)Then mined

rules with complete/incomplete in the head for

20 YAGO/Wikidata

relations

Result: Can

predict

(in-)completeness with 46-100% F1

24

[WSDM’17]

Slide25

Data mining: Challenges

Consensus:

human(x)

 Complete(x,

graduatedFrom)

schoolteacher(x

)  Incomplete(x,

graduatedFrom)

professor(x)

 Complete(x,

graduatedFrom)

John

∈ (human,

schoolteacher

, professor)

Complete(John,

graduatedFrom

)?

Rare

properties require very large training

data

E.g.,

US presidents

being complete for

education

Annotated ~3000 rows at 10ct/row  0 US presidents

25

Slide26

Outline – Assessing KB recall

Logical foundationsData miningInformation extraction

Comparative coverage

26

Slide27

IE idea 1: Count information

27

KB: 0 KB: 1 KB: 2

Recall: 0%

Recall: 50%

Recall: 100%

Barack and Michelle have

two

children

Slide28

Count extraction: Implementation

Developed a LSTM-based classifier

for identifying

numbers that express relation

cardinalitiesWorks for a variety of topics

such asFamily relations

has 2

siblingsGeopolitics

is composed of

seven

boroughsArtwork consists of

three

episodes

Counts sometimes the rule, not the exception

E.g.,

178

% more children

in counts on Wikipedia

than as facts in

Wikidata

28

[ACL’17+ISWC’18]

Slide29

Count extraction: Details

Cardinalities are frequently expressed

nonnumeric

:

Nouns has twins, is a trilogyIndefinite articles

They have a daughterNegation/adjectives

Have no children/is childless Extended candidate set

Often

requires reasoning

He has

3 children from Ivana and one from Marla Detecting compositional cues

29

Slide30

Idea 2: Recall estimation during IE

 Which sentence mentions

all

districts?

Linguistic theory: Quantity and relevance are context-dependent [Grice 1975]The wording matters!Preliminary results: Context-based coverage estimation is possible [EMNLP’19]

30

Slide31

Outline – Assessing KB recall

Logical foundationsData miningInformation extraction

Comparative coverage

31

Slide32

Comparative coverage: Idea

Date of birth, author, genre, …

Single-valued

properties

:Having one value  Property is complete

No need for external metadataLook

at data alone suffices!

32

Slide33

What are single-value properties?

33

year

Extreme case, but…

Multiple citizenships

More parents due to adoption

Several Twitter accounts due to

presidentship

Slide34

All hopes lost?

Presence of a value is better than nothing

Even better: For multi-valued attributes,

data

is still frequently added in batchesAll clubs Diego Maradona played forAll ministers of a new cabinet

…Checking data presence

is a common heuristic among Wikidata editors

34

Slide35

Value presence heuristic - example

[https://www.wikidata.org/wiki/Wikidata:Wikivoyage/Lists/Embassies]

Slide36

Can we automate data presence assessment?

4.1: Which properties to look at?4.2: How to quantify data presence?

36

Slide37

4.1: Which properties to look at? (1/2)

Coverage(Wikidata for

Putin

)?There are more than 3000 properties one can assign to Putin…Are at least all relevant properties there?

What do you mean by relevant?

37

Slide38

38

State-of-the-art (itemset mining) gets 61% of high-agreement triples right

Mistakes frequency for interestingness

Our weakly-supervised text model

achieves 75%Crowd-based property

relevance task:

[ADMA’17]

4.1: Which properties to look at? (2/2)

Slide39

4.2: How to quantify data presence?

We have values for 46 out of 77 relevant properties for Putin

 Hard to interpret

Proposal:

Quantify based on comparison with other similar entities

Ingredients:

Similarity metric Who is similar to Trump?Data quantification

How much data is good/bad?Deployed in Wikidata as Relative Completeness Indicator (

Recoin

)

39

[ESWC’17]

Slide40

40

Slide41

41

Slide42

Outline – Assessing KB recall

Logical foundationsData miningInformation extraction

Comparative coverage

Summary

42

Slide43

Acknowledgement

43

Slide44

Summary (1/2)Increasing KB quality

can be noticed downstreamPrecision easy to evaluateRecall largely

unknown

44

Slide45

Summary (2/2)

Proposal:Make recall information a first-class citizen of KBsMethods for obtaining recall information:

Supervised data mining

Numeric or context-based

text extractionComparative data presence

45

Questions?

Slide46

Relevance (1/3): IE resource efficiency

Districts(

Hong Kong

) =

Wan Chai, Kowloon City,

Yau Tsim

Mong

Coverage = Low

 Explore more resources

46

IE

Coverage =

High

 Stop further extraction

Districts(

NY

) =

Manhattan

,

Bronx

,

Queens

,

Brooklyn

,

Staten Island

Slide47

Relevance (2/3): Adjust IE thresholds

District(HK, Wan Chai) - confidence 0.93

District(HK,

Kowloon City) - confidence 0.86

District(HK,

Yau

Tsim

) - confidence 0.74

District(HK,

Macao) - confidence 0.67…

IE

HK consists of the districts Wan Chai, …, …, …, … and ….

Coverage 0.98

Accept

Reject

47

Slide48

Relevance (3/3): QA negation and completeness

Which US presidents were married only once?

Which countries participated in no UN mission?

For which cities do we know all districts?

Without coverage awareness, QA systems cannot answer these

Focus of

our research [SIGMOD’15

, WSDM’17, ACL’17, ISWC’18, …]

QA

48